[00:46:46] --- kaj has become available [01:05:10] --- Russ has left: Disconnected [01:58:47] --- haba has become available [02:11:18] --- Simon Wilkinson has become available [02:17:30] --- Simon Wilkinson has left [03:49:59] --- Jeffrey Altman has left: Replaced by new connection [05:07:10] --- haba has left [05:07:11] --- haba has become available [05:28:08] --- jaltman has left: Disconnected [06:20:36] --- jaltman has become available [06:27:12] --- haba has left [06:39:28] --- haba has become available [06:48:11] --- haba has left [06:48:15] --- haba has become available [06:53:28] We had a situation where our console would log: " Tokens for user of AFS id 5575 for cell physto.se: rxkad error=19270408" several times a second. Shouldn't there be any brake somewhere in the code to hinder that? The node was effectively DOSed. [06:57:04] not necessarily all from same server? [06:57:27] This was on the client. [06:57:57] * haba don't quite understand your question. [07:00:55] yes. the error doesn't come from the client, so, i don't understand what you don't understand [07:01:25] The error is in the clients kernel log and the client gets DOSed. [07:01:51] being told your tokens are no good is not being DoSed. [07:02:02] tcpdump and see what server returns the error [07:02:13] The user is not getting the message, it's in the kernel log. [07:03:01] the problem was discarding the tokens was wrong. not discarding the tokens is wrong. it all sucks. [07:04:36] Ok, some other server says that the tokens are no good. Fine. It does not help to report that into the kernel log in a way that makes the computer crawl. So can we just NOT tell the kernel log? [07:05:27] The only person who can do something about it, the user, is not allowed to read the kernel log anyway, nor will he look there in the first place. [07:08:10] --- jaltman has left: Disconnected [07:13:23] Apr 22 21:21:17 a07c01n08 kernel: afs: Tokens for user of AFS id 5575 for cell p hysto.se: rxkad error=19270408 Apr 22 21:21:34 a07c01n08 last message repeated 1171 times Apr 22 21:22:05 a07c01n08 last message repeated 2239 times Apr 22 21:22:37 a07c01n08 last message repeated 2331 times Apr 22 21:23:19 a07c01n08 last message repeated 1142 times Apr 22 21:23:58 a07c01n08 last message repeated 2218 times Apr 22 21:24:38 a07c01n08 last message repeated 2950 times And these are only the "highlights". [07:14:04] --- jaltman has become available [07:16:42] In that state, there is very little CPU left for doing things like letting root log in. Exactly what the user did do is not quite clear, but I suspect an cd /afs ; ls -laRt or something like that. [07:21:37] --- deason has become available [07:22:30] afs_warnuser() isn't useful on most platforms, sadly [07:22:55] ls -l /afs shouldn't send one set of tokens to every cell [07:23:29] * haba does not know what the user did, just guessing. [07:23:52] I know from an lsof that there was an ls active, but not very much more. [07:25:02] I am a bit puzzled because the code in afs_analyze.c around line 750 suggests that there is a counter. [07:25:29] an ls -l might do it. what client version, anyway? [07:25:36] 1.4.11 [07:25:42] oh. boring. [07:25:50] fixed error? [07:25:55] well, maybe [07:26:11] but there were changes in that code, so it's not worth debugging code that doesn't look like that anymore [07:26:33] As the server had an uptime > release date of 1.4.12 it would have been difficult to run 1.4.12 ;) [07:26:56] but now we should upgrade, yes. [07:26:57] linux? [07:27:05] oh yes. [07:27:17] not hard to run 1.4.12. stop afs. install new. start afs. [07:27:22] or ksplice in a new one [07:27:35] so, "hard" [07:27:39] meh users will notice [07:27:47] unlike now? [07:27:57] the crash was the 22nd [07:27:58] --- deason has left [07:28:00] and ksplice, they won't [07:28:12] --- deason has become available [07:28:32] Hmmm [07:28:44] The ksplice thinge works>! [07:28:48] ?? [07:29:26] Or rather: Does the ksplice thinge actually work? [07:29:48] wonder if any of the ksplice people are here [07:30:35] --- Kevin Sumner has left [07:30:50] --- Kevin Sumner has become available [07:34:47] I think ksplice is a litte bit too much futuristic music. [07:37:00] --- jaltman has left: Replaced by new connection [07:37:01] --- jaltman has become available [07:49:59] I knew how to diff two version in the cvs web thingie, but that git web baffles me. [07:51:04] Somehow I want to get the diff of two versions (versions identified by tags) of a file, in this example afs_analyze.c. [08:04:20] --- haba has left [08:07:07] --- haba has become available [08:08:27] 1.4.12: 757 if (serversleft) { 758 afs_warnuser 759 ("afs: Tokens for user of AFS id %d for cell %s: rxkad error=%d\n", 760 tu->vid, aconn->srvr->server->cell->cellName, acode); 761 shouldRetry = 1; ... [08:08:49] shouldRetry = 1 infinitely? [08:08:50] Something like: git diff openafs-stable-1_4_11 openafs-stable-1_4_12 -- ./src/afs/afs_analyze.c [08:13:39] thanks (much easier in my local copy, doh!) [08:14:05] Yeah, I'm not sure how to coax gitweb into doing that. [08:14:24] Especially not by following links. [08:15:09] I'm not looking at the code right now, but that 'shouldRetry = 1' should just be for when there are more servers/rosites to contact for the vol [08:15:24] I thought [08:17:13] i assume he's looking at the code which is biased by BlackListOnce [08:17:24] i'm not looking this second either [08:17:57] When the tokens expire afs_warnuser(.. have expired). When serversleft == 0 then "tokens are discarded" but what happens in the case in between is the question. [08:18:50] you retry. [08:19:30] like... what happens if your admin is a jackass and left one non-krb5-token-capable db server up because... well, they were a jackass? [08:19:34] there is this areq->tokenError counter hmmmm [08:19:46] you could not retry, and then, hey, round robin chance of loing your tokens! yay! [08:19:54] some one elses admin is a jackass [08:20:37] someone else might be physto.se in this case [08:21:02] I say might because we do not know what the user actually did. [08:21:15] some user got tokens for the jackass' cell, then [08:21:58] But I suppose it could be smart to look at areq->tokenError and do something if it gets BIG. Like 1000. [08:23:54] Oh yes, we have users which come from cells that have other admins ;-) [08:27:37] (130.237.205.36 and 130.237.205.72 run 1.4.6 AFS. All services. And client) [08:46:13] --- kaj has left [08:57:49] All DB servers in physto.se are 1.4.6 (there are 5 of them) [08:58:23] it's still not really interesting with a 1.4.11 client [08:58:55] me knows but I don't want to risk running into the same wall with 1.4.12. [08:59:39] --- jaltman has left: Disconnected [08:59:43] So do you think making a patch and bailing out if areq->tokenError is too big would help? [08:59:59] not necessarily. you assume it's the same areq [09:00:19] true [09:00:33] I only suspect it is the same token. [09:03:39] sure. so. a patch which helps you see if that's the issue, if there even is an issue, sure. [09:12:35] * haba will probably do a patch that watches some counters and let everyone know if I get any smarter from that. [09:12:58] Now it is EOW (End Of Work) in this TZ. [09:14:27] --- haba has left [09:50:56] --- kaj has become available [10:18:44] --- jaltman has become available [10:19:45] --- Russ has become available [10:22:02] --- jaltman has left: Disconnected [10:22:25] harald may be helped by 1851 [10:26:57] --- jaltman has become available [11:15:51] --- jaltman has left: Disconnected [11:37:50] --- jaltman has become available [12:10:56] --- Simon Wilkinson has become available [12:20:45] --- jaltman has left: Replaced by new connection [12:20:46] --- jaltman has become available [12:38:40] --- Simon Wilkinson has left [13:31:34] --- Simon Wilkinson has become available [13:37:24] --- Simon Wilkinson has left [13:56:49] --- deason has left [13:57:12] --- deason has become available [14:02:25] --- jaltman has left: Disconnected [14:39:08] --- jaltman has become available [14:43:27] --- deason has left [14:44:01] --- deason has become available [14:44:13] --- deason has left [14:44:39] --- deason has become available [14:44:41] --- jaltman has left: Disconnected [15:23:52] --- mdionne has become available [15:47:38] --- deason has left [16:10:32] --- Simon Wilkinson has become available [16:13:33] --- deason has become available [17:37:38] --- Russ has left: Disconnected [18:12:17] --- Russ has become available [18:35:36] --- mdionne has left [20:18:11] --- jaltman has become available [20:42:17] --- mho has left [20:49:07] --- jaltman has left: Disconnected [21:08:32] --- Born Fool has become available [21:51:28] --- jaltman has become available [21:52:30] --- deason has left [22:04:18] --- Born Fool has left [23:05:58] --- jaltman has left: Replaced by new connection [23:05:59] --- jaltman has become available [23:19:38] --- kaj has left