[01:21:42] --- Born Fool has left [02:25:59] --- Russ has left: Disconnected [06:45:34] --- jaltman has left: Replaced by new connection [06:45:35] --- jaltman has become available [06:46:55] --- meffie has become available [07:03:36] --- reuteras has left [07:43:00] --- jaltman has left: Disconnected [07:43:07] --- jaltman has become available [09:01:50] --- deason has become available [09:50:55] --- deason has left [09:52:20] --- deason has become available [10:20:26] --- rod has left: Disconnected [10:33:06] --- Kevin Sumner has left [10:43:00] --- rra has become available [11:02:48] --- deason has left [11:31:41] --- deason has become available [11:51:24] --- meffie has left [12:21:00] --- meffie has become available [12:35:15] --- deason has left [12:43:19] --- deason has become available [13:06:06] Argh. When afs_Analyze sees an authentication error, it calls afs_BlackListOnce and then if that function says there are more servers to try, tells the caller to retry instead of discarding the bad token. Unfortunately, when called with no FID (as for VLDB lookups), afs_BlackListOnce always returns true without doing anything. So, if you have bad tokens and try to talk to a vlserver, you spin forever. [13:10:21] well, that sucks [13:12:24] and looking at the code i remember one of the details; given that this function is only useful for something we have a fid for (e.g. something we have a list we can resort) it's not supposed to otherwise be called. a way of blacklisting a vlserver once would be nice but was beyond scope for this [13:13:58] Sure, so in that case it should either return 0, or it should not be called and the caller should behave as if it returned 0. Spinning forever and rapidly generating 2G of logs is not a good option. [13:14:06] no argument [13:14:42] Not to mention hammering vlservers. In this case, the ticket was "bad" because its client principal name is "host/keyme_2010@CS.CMU.EDU", and rxkad is too "smart" for anyone's good [13:15:02] yay [13:15:56] Argh. And my officemate left, so I can't do another test today. Argh. [13:17:11] http://gerrit.openafs.org/2473 [13:20:41] No. 1) no need to do both 2) with your change, the caller doesn't actually behave correctly in the vlserver case, because it doesn't set ShouldRetry=0 if !tvp (and I think you leave tvp uninitialized there, but I'm not sure) 3) that's not the call site that matters for this [13:22:07] though it doesn't matter for me right now; even if I were to rebuild the whole mess with a patched AFS, it wouldn't change the problem that the server really is rejecting the request because rxkad has decided that "host/keyme_2010@CS.CMU.EDU" isn't a valid client principal name. [13:22:48] because, you see, the second component doesn't contain any dots, and so it can't do the hardwired v5->v4 name conversion. shoot me now [13:23:09] do both? uh. lemme look. only the first hunk was meant to be pushed [13:24:47] corrected diff there now. [13:29:04] that should fix the case that I am actually seeing. I think the site patched by the hunk you removed will still do the wrong thing, but it won't be any worse than before. I'm not sure whether it matters. I think it results in retrying forever if all vlservers are doing the idle timeout thing. [13:30:11] you may be correct. still looking. the change that's therte should be correct as far as it goes [13:30:29] yes, I agree [13:30:59] still looking at the other case. [13:36:15] the problem is that strictly this wants a way to force resorting tcell->cellHosts just for this areq, and then recalling ConnByMHosts. i suppose we could assume a blacklist request for which there's no fid is for vlserver and mark skipserver on cellHosts, but that's rather invasive for right now [13:39:34] Except that assumption would be false. We also use no-fid for cases where we are talking to a _specific_ fileserver, dammit, such as when we are giving up a callback. [13:42:26] i'm not coding anything now. i'm not touching it beyond the current change. the implications of doing anything broader need much more testing [13:43:08] on the other hand, it looks like the simple change works fine for the case i can easily simulate [13:45:50] what i'd do today, honestly, was look at the port in the afs_conn. what i'd do to do it right would be to mark the afs_conn explicitly, so we can untie vl from port 7003 based on e.g. srv lookup [13:45:58] Hm. Actually... Note that the skipserver bits are _already_ on the request. So, it would be reasonable to... - have blacklist set skipserver bits based on cellhosts - have ConnByMHosts obey skipserver bits except for the call site in afs_FlushVCBs, which I suspect should just give up in case of a failure that results in blacklisting [13:46:42] well, you want to set skipserver bits based on cellHosts only for vl requests. but yeah [13:47:03] Oh, yes, you have a point. We can tell from the port that it's VL or not, which lets us distinguish the no-fid vlserver cases, where we should blacklist against cellhosts, from the no-fid flushcbs case, where we should just always return 0. [13:47:32] i will probably implement this in the future, but not until we have done 1.5.76 and branched [13:47:47] And yes, using the port is wrong, but consistent with existing code which will already need to be cleaned up someday, especially if we want SRV records with not port 7003 not to horribly break CM's [13:48:11] Yes, this is certainly a post-branch change, though I think it may be appropriate for 1.6 [13:48:29] sure. just not ready to fly immediately, and shouldn't block testing [13:48:35] agreed [13:49:24] --- andersk has left [13:51:22] --- andersk has become available [13:57:51] of course, the buildbot's marking it failed verification incorrectly blocks it from being submitted. yay DoS [13:58:39] the various buildbot work is clearly not ready for prime time [14:01:21] I don't think that actually blocks it if you verify, does it? [14:01:26] Also, have I mentioned "const sucks" ? [14:01:57] i verified. i have no "submit" button [14:01:57] I like const for new code, but retroactively introducing it into a source base is pain. [14:02:05] That's less than ideal. [14:02:16] agreed. i want +2 verified :) [14:04:20] The buildbot should not use -1 verified unless it actually _failed to build_, and in a configuration we care about (whatever that means, but not building on an obscure platform we don't support should not automatically block). And, it shouldn't do anything at all if it's not going to bother to post enough information to know what happened. [14:05:05] and yes, you need +2 verified [14:06:02] and I'm really nervous about buildbot stuff going without any kind of safe-to-build verification. though I suppose it's not my boxes at risk... [14:06:11] I think the buildbot is already configured the way that you would want except for the "not actually working" part. [14:06:53] "I failed to pull from git" shouldn't result in -1 fails [14:07:10] Oh, is that what's going on? [14:07:16] Yeah. [14:07:31] Uh. It used verified -1 on a build where the problem was something related to checking out the change in git. And on both that one and a "legitimate" build failure (again, const--), it failed to post any useful information. In the latter case, Jason sent it to me in email, but that's not a substitute [14:07:42] That should be a -1 on the buildbot, not on the patch. :) [14:08:01] * rra didn't realize it was just failing the Git pull portion. [14:08:10] I thought it was somehow messing up the build. [14:08:32] Yeah, the lack of information is a problem. [14:08:55] jason says the info will be made available later. i asked earlier [14:09:15] If you ask later, will it be available earlier? :) [14:10:05] On the present change, it claims the git step failed. On two other changes I started, it claims compile failed, but doesn't say why. I have the output from one of those, and in that case the build really did fail because of a problem with the patchset. Of course, the problem is that the patchset doesn't do %%s/\//g [14:10:45] Oh, wait, that's not enough percents. I don't just want the change on every file in the project. I want it on every file in the Universe. [14:12:16] buildbot is currently failing any change which is windows only. I have no idea why [14:13:55] buildbot failed the unix cm change, which doesn't touch windows. i wonder if it broke earlier and is trying to rebuild in the same sandbox, but now doesn't have an antecedant chain that is right to check out onto [14:14:33] It should really create a new Git branch for every change it tests, try to cherry-pick into that, and blow away the branch and switch back to master each time it finishes. [14:15:02] That's what my script to do patch verification does. [14:15:46] i'd hope it does. but... [14:58:30] the git step did fail. it tries a fetch on receiving email but the ref is not immediately fetchable [15:03:51] --- Simon Wilkinson has become available [15:07:54] If folk would like, I should be able to block the buildbot user from doing verified -1s [15:08:15] that would be very useful [15:08:43] verified -1 is fine, if that's actually what it is. [15:08:57] Well, verified only goes -1 to +1 [15:09:10] I don't think I can (easily) extend that scale. [15:09:10] my point is "git failed" is not -1 [15:09:14] Yeah. [15:09:41] But if the buildbot is stopping people from getting work done, then I can stop it from DoSing you. [15:09:49] until we have more confidence in buildbot on more platforms I would prefer the verification failures be advisory [15:10:02] so 0 to +1 [15:10:07] yes [15:11:42] is the verification being submitted by the buildbot slave or the master? [15:11:53] I have no idea how Jason has it configured. [15:12:15] I'm thinking that if it is the slave we may want more than one buildbot account [15:12:22] In the architecture I was considering, I wanted the master to collate results, and submit a single success or failure. [15:12:33] I don't want an email from every single platform we test build on. [15:12:49] I only want an e-mail from the ones that fail [15:12:52] it's apparently receiving email or something [15:13:01] and triggering builds that way [15:13:08] Really? [15:13:35] seemingly [15:13:45] So there's not really any gerrit integration there. [15:14:15] not in the sense that gerrit is pushing out requests to build to the buildbot master [15:14:18] That should be buildbot restricted to 0 to +1 [15:14:36] thank you [15:14:44] Yeah - just doing it based on objects appearing in the git repo isn't very interesting. [15:29:22] --- deason has left [15:56:35] Hmmm. Maybe that didn't work ... [15:56:43] snerk [15:58:35] I've tried a different way of doing it ... [16:12:31] --- jaltman has left: Disconnected [16:15:06] --- mattjsm has become available [16:24:02] --- mattjsm has left [16:30:48] --- Simon Wilkinson has left [17:32:13] --- deason has become available [18:01:18] --- jaltman has become available [19:56:53] --- meffie has left [20:54:24] --- kaduk@mit.edu/barnowl has left [20:55:11] --- kaduk@mit.edu/barnowl has become available [21:00:19] --- rra has left: Disconnected [21:46:05] --- deason has left [22:17:04] --- reuteras has become available [23:03:43] --- Simon Wilkinson has become available [23:22:51] --- rod has become available