[00:55:17] --- dev-zero@jabber.org has become available [03:09:16] --- dev-zero@jabber.org has left [04:34:29] --- haba has become available [07:26:36] --- summatusmentis has left [07:26:45] --- deason has become available [08:10:53] --- dev-zero@jabber.org has become available [08:12:59] --- haba has left [08:50:27] change stats for last night: sxw 34 shadow 11 jaltman 3 matt 1 adeason 1 [09:26:02] --- asedeno has left [09:26:02] --- asedeno has become available [09:26:05] --- asedeno has left [09:26:16] --- asedeno has become available [09:32:10] --- mmeffie has become available [10:06:47] --- dev-zero@jabber.org has left [10:08:58] --- Russ has become available [10:11:07] --- summatusmentis has become available [13:22:29] --- dev-zero@jabber.org has become available [13:40:03] --- mmeffie has left [14:14:20] How do I resolve "CallPreamble: Couldn't get CPS. Too many lockers"? [14:14:23] What should I be looking for? [14:24:02] I think the "Couldn't get CPS" is misleading. It may mean that, or it may mean it never even got that far because there are too many outstanding calls from that host. Basically, it means h_FindClient_r returned NULL, which means h_GetHost_r did. [14:26:20] We're seeing this happen on all of our servers, but particularly badly on one of them right now. That server is also, from the client perspective, very slow, with occasional "waiting for busy volume" messages, but no "lost contact with file server". [14:26:24] 1.4.11 [14:26:58] Possibly related: we're seeing volumes randomly drop off-line during the nightly vos backupsys, generally one or two a night. The server suddenly decides they need to be salvaged and takes them off-line. [14:30:02] --- dev-zero@jabber.org has left [14:30:04] One client, or many clients? When a client host exceeds its thread quota, additional calls result in that message and the call failing with VBUSY [14:30:34] what OS? [14:33:08] Many clients. Debian. [14:33:34] Ah, interesting. [14:33:41] * Russ is doing an rxdebug right now to get a count of connections by client. [14:34:20] Mon Sep 28 14:33:39 2009 CallPostamble: null ahost for thost 171.67.216.6:7001 (712f8b08) [14:34:23] Is that related? [14:35:17] I'm sorry; I meant what server OS. Not sure if that's related or not; I'd have to reread code. [14:35:23] Debian is the server OS. [14:35:32] It comes immediately after the Couldn't get CPS message. [14:35:50] Hm, 33 connections from one host and 24 from another. [14:36:04] I'd expect the Couldn't get CPS message to be the last one produced in any given RPC thread [14:36:08] Ah, those are the VLDB servers. [14:36:20] rxdebug -onlyclient [14:36:25] err... -onlyserevr [14:36:31] ... and pretend I can spell [14:36:33] 19, 15, 14 connections from www servers. [14:36:43] 13 connections from some random cable modem client. [14:36:53] Those don't seem like huge numbers to me, but I don't know how big of a number causes problems. [14:37:41] There are three clients with over ten connections other than the www servers. [14:38:21] A large number of connections doesn't cause problems. A large number of calls might. [14:38:22] 'null ahost for thost' is from postamble; other message from preamble that would cause it; same basic thing I think [14:38:46] * Russ is moving volumes off of the server, since I don't really want to take an outage. [14:38:56] But the volume moves are going to take a while. [14:39:11] If only a few clients are seeing the bad behavior, it's probably because they are pounding on the server somehow, and the message indicates the threadquota mechanism is successfully preventing them from making the server unavailable to anyone else. [14:39:14] * Russ is opening a ticket in just a moment too. [14:39:17] anything more specific on when the volumes go offline? or just "volume needs salvage" [14:39:20] The www servers are definitely pounding on the server. [14:39:26] However, they should be. [14:39:29] I need them to be able to. [14:39:40] deason: Nope, nothing. Kind of frustrating. [14:40:02] Salvage never finds anything wrong except for the normal old inode version crap and the regular sorts of things you get when you salvage any volume. [14:40:20] --- stevenjenkins has left [14:41:09] old inode version crap has actually been "the real error" before [14:41:20] thread quota is a maximum of 5 threads waiting on a host lock, plus one holding it, at any given time, assuming your server has more than 64 threads total. [14:41:32] I also wouldn't think this is solely from too much load, even from unresponsive volumes; didn't you post a segfault in host package a little while ago? [14:41:53] Yes. We think that's the leak that was fixed this weekend. [14:42:00] Er, not leak, double-free. [14:42:04] that seems plausible [14:42:11] Would that be related to this? [14:42:19] is that fix in what you're running? [14:42:51] would which be? [14:42:54] Good heavens, no. [14:43:01] It takes us a month to upgrade file servers. [14:43:12] --- stevenjenkins has become available [14:43:15] I mean, good question, but yeah, no chance yet. [14:43:36] shadow: The fixes that went in this weekend and this CPS too many lockers problem. [14:44:22] the double free could be related to too many lockers but it seems unlikely. the cb_in should be a connection which is basically abortive, not locked [14:44:34] to get there, you'd have to have 6 simultaneous calls, _and_ they'd all have to be in that same code path at once -- we don't keep that lock for long. That means that either the client is _really really_ beating on the fileserver, or trying to DoS it, or something is wrong (like the webmail thing that cmu and iastate saw) [14:44:36] actually, maybe i'm wrong [14:44:57] no, looking, pretty sure i'm not wrong [14:45:01] Bear in mind that we get a lot of aborted connections from the www servers when people write CGI scripts that try to access files they're not allowed to. [14:45:17] no rx aborts [14:45:18] wait, double free? not gerrit 496/497? [14:45:19] Aborted calls, rather. [14:45:22] Not connections. [14:45:25] 496/7 was a leak, I thought [14:45:30] abortive calls in the sense of "do we need a host for this? no, we have one" [14:45:37] The commit message says that it's a leak; the commit fixes a double-free. [14:45:54] whether it's a double-free depends on the manifestation [14:45:58] there are 2 bugs [14:46:08] The crash we were getting was heap corruption. [14:47:02] if you hang with cb_in waiting, you hang. you can't leak it while it's waiting, the code path hasn't returned yet to leak it [14:47:56] Without more information, hanging clients and too many host lockers, combined with random unexpected volume offlining, _could_ be some kind of intermittent I/O failure. [14:49:21] apropos of nothing, having a prepaid sim with data for the country you happen to be in (and an unlocked iPhone) rocks [14:49:57] If there's an I/O failure, the kernel doesn't know about it. [14:50:01] dmesg is clean. [14:50:39] when you are actively getting "couldn't get CPS", dumping the host/client/conn list (with kill -XCPU) can be used to see what's up may help [14:51:02] you'll also need tcpdump to confirm no (or slow) answers, for the clients in question [14:51:06] I can just do that now on the running server that's having trouble? [14:51:18] sure [14:51:27] Ah, hm. This appears to actually be 1.4.10. That's interesting. [14:51:33] That might be significant. [14:51:43] In other news, WTF? [14:51:58] Why is this server running 1.4.10? [14:52:04] i blame andrew [14:52:15] Were there relevant fixes between 1.4.10 and 1.4.11? [14:52:43] well, there was at least one rx fix, which i don't promise isn't related [14:53:00] (not deason, may) [14:54:55] We are running 1.4.10, largely because you had problems with 1.4.11. [14:55:07] Okay, definite correlation between servers that are seeing volumes go off-line and 1.4.10. And there *was* a fix for that -- that's that bogus I/O issue in the file server, yes? [14:55:37] Where it misinterpreted a return status and marked a volume off-line incorrectly? [14:55:40] I seem to recall that going through. [14:56:12] We are seeing occasional CPS messages on the 1.4.11 servers as well. [14:56:38] Yeah; if the kernel doesn't report any I/O problems, then the CPS and offline volume things are probably unrelated. [14:56:52] Okay, in fact, most of our file servers are running 1.4.10. [14:57:01] Great. [14:57:29] But we are seeing CPS problems on the 1.4.11 ones as well. [14:57:43] But not the CallPostamble stuff as much. [15:02:01] hm, I've only glanced at it, but I don't immediately see a double free in 496 [15:02:16] Usually when the patch is to explicitly set a variable to NULL, it's so that you don't free it again later. [15:02:17] since the old code just returned in those error conditions [15:02:24] I could be misreading. [15:02:31] yeah, but they're local, and we used to return [15:02:37] We were definitely seeing heap corruption, though, so if it's just a leak, it's not our problem. [15:02:42] Which means our problem still exists. [15:03:05] However, Derrick said that he thought he fixed a leak as well, up above. [15:03:10] that's what I was thinking, but I haven't spent time with 496 [15:07:45] yes, there was a 1.4.10 rx fix for bogus cluient errors [15:07:59] and yes, it's possible that fix could help with excessive retries [15:08:22] any any volumes replicated to 1.4.10 and 1.4.11 servers? [15:09:05] the leak was hosts. the double-free istr was capabilities [15:09:31] we had verifivcation of the host leak. not of the capabilities double-free. it just visually looked possible [15:12:33] i had just hoofed it to a train on too little sleep and coping with slow network. ymmv. [15:16:31] Yes, we definitely have volumes replicated to both 1.4.10 and 1.4.11 servers. [15:17:24] On the 1.4.10 server that I'm currently evacuating, all of the CallPostamble: null ahost errors are for our www servers. The CallPreamble: CPS errors of course don't include the host. (Modifying them to include the host would be nice.) [15:17:36] I haven't dumped a client table yet. I suppose I should do that. [15:18:58] Okay, dumped the client table. What am I looking for? [15:19:31] i gave tracy a patch to modify that. i wonder what i did with it [15:19:56] you'd looking for the flag on the host objects indicating a getcps call is active [15:20:07] The client table seems very short. In particular, it doesn't seem to include any of our www servers. [15:20:18] Oh, hosts. [15:20:20] That's bigger. [15:21:30] Typical entry: [15:21:33] ip:171.67.216.87 port:7001 hidx:66 cbid:124374 lock:0 last:1254176272 active:1254176272 down:0 del:0 cons:0 cldel:0 hpfailed:0 hcpsCall:1254170639 hcps [ -10521 -320 -319] [ 171.67.216.87:7001] holds: 00010000000000000000 slot/bit: 0/1 [15:22:10] There are three entries where lock is not 0. [15:22:14] Is that what I'm looking for? [15:25:34] Crap. File server just restarted due to too many blocked connections. [15:27:01] er, yes [15:30:53] Well, just lost all the ability to get data and apparently I'm taking that outage anyway. [15:31:04] * Russ waits for the file server to finish releasing callbacks. [15:33:08] you don't by chance see "CallPreamble: Couldn't get CPS. Reconnect to ptserver" anywhere in FileLog, do you? [15:33:31] I wouldn't expect it to be that, but the uclient reinit thing bug could theoretically result in writing to freed mem, iirc [15:35:36] No. [15:35:42] None of those. [15:52:25] --- deason has left [17:05:27] --- deason has become available [17:14:43] --- summatusmentis has left [17:14:43] --- Rrrrred has left [17:14:46] --- dwbotsch has left [17:14:47] --- kula has left [17:17:31] --- dwbotsch has become available [17:18:54] --- kula has become available [17:55:37] --- summatusmentis has become available [18:11:15] --- Russ has left: Disconnected [18:28:56] --- Russ has become available [19:43:21] --- Jeffrey Altman has become available [21:59:02] --- deason has left [23:47:58] --- haba has become available [23:58:13] --- haba has left [23:58:17] --- haba has become available