[01:21:42] --- Born Fool has left
[02:25:59] --- Russ has left: Disconnected
[06:45:34] --- jaltman has left: Replaced by new connection
[06:45:35] --- jaltman has become available
[06:46:55] --- meffie has become available
[07:03:36] --- reuteras has left
[07:43:00] --- jaltman has left: Disconnected
[07:43:07] --- jaltman has become available
[09:01:50] --- deason has become available
[09:50:55] --- deason has left
[09:52:20] --- deason has become available
[10:20:26] --- rod has left: Disconnected
[10:33:06] --- Kevin Sumner has left
[10:43:00] --- rra has become available
[11:02:48] --- deason has left
[11:31:41] --- deason has become available
[11:51:24] --- meffie has left
[12:21:00] --- meffie has become available
[12:35:15] --- deason has left
[12:43:19] --- deason has become available
[13:06:06] <jhutz@jis.mit.edu/owl> Argh.
When afs_Analyze sees an authentication error, it calls afs_BlackListOnce
and then if that function says there are more servers to try, tells the
caller to retry instead of discarding the bad token.  Unfortunately,
when called with no FID (as for VLDB lookups), afs_BlackListOnce always
returns true without doing anything.  So, if you have bad tokens and
try to talk to a vlserver, you spin forever.
[13:10:21] <shadow@gmail.com/owlECA78C6F> well, that sucks
[13:12:24] <shadow@gmail.com/owlECA78C6F> and looking at the code i remember one of the details; given that this
function is only useful for something we have a fid for (e.g. something
we have a list we can resort) it's not supposed to otherwise be called.

a way of blacklisting a vlserver once would be nice but was beyond
scope for this
[13:13:58] <jhutz@jis.mit.edu/owl> Sure, so in that case it should either return 0, or it should not be
called and the caller should behave as if it returned 0.  Spinning
forever and rapidly generating 2G of logs is not a good option.
[13:14:06] <shadow@gmail.com/owlECA78C6F> no argument
[13:14:42] <jhutz@jis.mit.edu/owl> Not to mention hammering vlservers.
In this case, the ticket was "bad" because its client principal
name is "host/keyme_2010@CS.CMU.EDU", and rxkad is too "smart"
for anyone's good
[13:15:02] <shadow@gmail.com/owlECA78C6F> yay
[13:15:56] <jhutz@jis.mit.edu/owl> Argh.  And my officemate left, so I can't do another test today.
Argh.
[13:17:11] <shadow@gmail.com/owlECA78C6F> http://gerrit.openafs.org/2473
[13:20:41] <jhutz@jis.mit.edu/owl> No.
1) no need to do both
2) with your change, the caller doesn't actually behave correctly in
   the vlserver case, because it doesn't set ShouldRetry=0 if !tvp
   (and I think you leave tvp uninitialized there, but I'm not sure)
3) that's not the call site that matters for this
[13:22:07] <jhutz@jis.mit.edu/owl> though it doesn't matter for me right now; even if I were to rebuild
the whole mess with a patched AFS, it wouldn't change the problem that
the server really is rejecting the request because rxkad has decided
that "host/keyme_2010@CS.CMU.EDU" isn't a valid client principal name.
[13:22:48] <jhutz@jis.mit.edu/owl> because, you see, the second component doesn't contain any dots,
and so it can't do the hardwired v5->v4 name conversion.
shoot me now
[13:23:09] <shadow@gmail.com/owlECA78C6F> do both? uh. lemme look.

only the first hunk was meant to be pushed
[13:24:47] <shadow@gmail.com/owlECA78C6F> corrected diff there now.
[13:29:04] <jhutz@jis.mit.edu/owl> that should fix the case that I am actually seeing.
I think the site patched by the hunk you removed will still do the
wrong thing, but it won't be any worse than before.  I'm not sure
whether it matters.  I think it results in retrying forever if all
vlservers are doing the idle timeout thing.
[13:30:11] <shadow@gmail.com/owlECA78C6F> you may be correct. still looking. the change that's therte should be
correct as far as it goes
[13:30:29] <jhutz@jis.mit.edu/owl> yes, I agree
[13:30:59] <shadow@gmail.com/owlECA78C6F> still looking at the other case. 
[13:36:15] <shadow@gmail.com/owlECA78C6F> the problem is that strictly this wants a way to force resorting 
tcell->cellHosts just for this areq, and then recalling ConnByMHosts.

i suppose we could assume a blacklist request for which there's no fid
is for vlserver and mark skipserver on cellHosts, but that's rather
invasive for right now
[13:39:34] <jhutz@jis.mit.edu/owl> Except that assumption would be false.  We also use no-fid for cases
where we are talking to a _specific_ fileserver, dammit, such as when
we are giving up a callback.
[13:42:26] <shadow@gmail.com/owlECA78C6F> i'm not coding anything now. i'm not touching it beyond the current
change. the implications of doing anything broader need much more
testing
[13:43:08] <shadow@gmail.com/owlECA78C6F> on the other hand, it looks like the simple change works fine for the
case i can easily simulate
[13:45:50] <shadow@gmail.com/owlECA78C6F> what i'd do today, honestly, was look at the port in the afs_conn. what
i'd do to do it right would be to mark the afs_conn explicitly, so we
can untie vl from port 7003 based on e.g. srv lookup
[13:45:58] <jhutz@jis.mit.edu/owl> Hm.  Actually...
Note that the skipserver bits are _already_ on the request.
So, it would be reasonable to...
- have blacklist set skipserver bits based on cellhosts
- have ConnByMHosts obey skipserver bits

except for the call site in afs_FlushVCBs, which I suspect should just
give up in case of a failure that results in blacklisting
[13:46:42] <shadow@gmail.com/owlECA78C6F> well, you want to set skipserver bits based on cellHosts only for vl
requests. but yeah
[13:47:03] <jhutz@jis.mit.edu/owl> Oh, yes, you have a point.  We can tell from the port that it's VL or
not, which lets us distinguish the no-fid vlserver cases, where we should
blacklist against cellhosts, from the no-fid flushcbs case, where we
should just always return 0.
[13:47:32] <shadow@gmail.com/owlECA78C6F> i will probably implement this in the future, but not until we have
done 1.5.76 and branched
[13:47:47] <jhutz@jis.mit.edu/owl> And yes, using the port is wrong, but consistent with existing code
which will already need to be cleaned up someday, especially if we
want SRV records with not port 7003 not to horribly break CM's
[13:48:11] <jhutz@jis.mit.edu/owl> Yes, this is certainly a post-branch change, though I think it may be
appropriate for 1.6
[13:48:29] <shadow@gmail.com/owlECA78C6F> sure. just not ready to fly immediately, and shouldn't block testing
[13:48:35] <jhutz@jis.mit.edu/owl> agreed
[13:49:24] --- andersk has left
[13:51:22] --- andersk has become available
[13:57:51] <shadow@gmail.com/owlECA78C6F> of course, the buildbot's marking it failed verification incorrectly
blocks it from being submitted. yay DoS
[13:58:39] <jhutz@jis.mit.edu/owl> the various buildbot work is clearly not ready for prime time
[14:01:21] <rra> I don't think that actually blocks it if you verify, does it?
[14:01:26] <jhutz@jis.mit.edu/owl> Also, have I mentioned "const sucks" ?
[14:01:57] <shadow@gmail.com/owlECA78C6F> i verified. i have no "submit" button
[14:01:57] <rra> I like const for new code, but retroactively introducing it into a source base is pain.
[14:02:05] <rra> That's less than ideal.
[14:02:16] <shadow@gmail.com/owlECA78C6F> agreed. i want +2 verified :)
[14:04:20] <jhutz@jis.mit.edu/owl> The buildbot should not use -1 verified unless it actually _failed to
build_, and in a configuration we care about (whatever that means,
but not building on an obscure platform we don't support should not
automatically block).  And, it shouldn't do anything at all if it's
not going to bother to post enough information to know what happened.
[14:05:05] <jhutz@jis.mit.edu/owl> and yes, you need +2 verified
[14:06:02] <jhutz@jis.mit.edu/owl> and I'm really nervous about buildbot stuff going without any kind of
safe-to-build verification.  though I suppose it's not my boxes at risk...
[14:06:11] <rra> I think the buildbot is already configured the way that you would want except for the "not actually working" part.
[14:06:53] <deason> "I failed to pull from git" shouldn't result in -1 fails
[14:07:10] <rra> Oh, is that what's going on?
[14:07:16] <rra> Yeah.
[14:07:31] <jhutz@jis.mit.edu/owl> Uh.  It used verified -1 on a build where the problem was something
related to checking out the change in git.
And on both that one and a "legitimate" build failure (again, const--),
it failed to post any useful information.  In the latter case, Jason
sent it to me in email, but that's not a substitute
[14:07:42] <rra> That should be a -1 on the buildbot, not on the patch.  :)
[14:08:01] * rra didn't realize it was just failing the Git pull portion.
[14:08:10] <rra> I thought it was somehow messing up the build.
[14:08:32] <rra> Yeah, the lack of information is a problem.
[14:08:55] <shadow@gmail.com/owlECA78C6F> jason says the info will be made available later. i asked earlier
[14:09:15] <rra> If you ask later, will it be available earlier?  :)
[14:10:05] <jhutz@jis.mit.edu/owl> On the present change, it claims the git step failed.
On two other changes I started, it claims compile failed, but doesn't
say why.  I have the output from one of those, and in that case the
build really did fail because of a problem with the patchset.  Of
course, the problem is that the patchset doesn't do %%s/\<const\>//g
[14:10:45] <jhutz@jis.mit.edu/owl> Oh, wait, that's not enough percents.
I don't just want the change on every file in the project.
I want it on every file in the Universe.
[14:12:16] <jaltman> buildbot is currently failing any change which is windows only.  I have no idea why
[14:13:55] <shadow@gmail.com/owlECA78C6F> buildbot failed the unix cm change, which doesn't touch windows. i
wonder if it broke earlier and is trying to rebuild in the same
sandbox, but now doesn't have an antecedant chain that is right to
check out onto
[14:14:33] <rra> It should really create a new Git branch for every change it tests, try to cherry-pick into that, and blow away the branch and switch back to master each time it finishes.
[14:15:02] <rra> That's what my script to do patch verification does.
[14:15:46] <shadow@gmail.com/owlECA78C6F> i'd hope it does. but...
[14:58:30] <shadow@gmail.com/owlECA78C6F> the git step did fail. it tries a fetch on receiving email but the ref
is not immediately fetchable
[15:03:51] --- Simon Wilkinson has become available
[15:07:54] <Simon Wilkinson> If folk would like, I should be able to block the buildbot user from doing verified -1s
[15:08:15] <jaltman> that would be very useful
[15:08:43] <shadow@gmail.com/owlECA78C6F> verified -1 is fine, if that's actually what it is. 
[15:08:57] <Simon Wilkinson> Well, verified only goes -1 to +1
[15:09:10] <Simon Wilkinson> I don't think I can (easily) extend that scale.
[15:09:10] <shadow@gmail.com/owlECA78C6F> my point is "git failed" is not -1
[15:09:14] <Simon Wilkinson> Yeah.
[15:09:41] <Simon Wilkinson> But if the buildbot is stopping people from getting work done, then I can stop it from DoSing you.
[15:09:49] <jaltman> until we have more confidence in buildbot on more platforms I would prefer the verification failures be advisory
[15:10:02] <shadow@gmail.com/owlECA78C6F> so 0 to +1
[15:10:07] <jaltman> yes
[15:11:42] <jaltman> is the verification being submitted by the buildbot slave or the master?    
[15:11:53] <Simon Wilkinson> I have no idea how Jason has it configured.
[15:12:15] <jaltman> I'm thinking that if it is the slave we may want more than one buildbot account
[15:12:22] <Simon Wilkinson> In the architecture I was considering, I wanted the master to collate results, and submit a single success or failure.
[15:12:33] <Simon Wilkinson> I don't want an email from every single platform we test build on.
[15:12:49] <jaltman> I only want an e-mail from the ones that fail
[15:12:52] <shadow@gmail.com/owlECA78C6F> it's apparently receiving email or something
[15:13:01] <shadow@gmail.com/owlECA78C6F> and triggering builds that way
[15:13:08] <Simon Wilkinson> Really?
[15:13:35] <shadow@gmail.com/owlECA78C6F> seemingly
[15:13:45] <Simon Wilkinson> So there's not really any gerrit integration there.
[15:14:15] <jaltman> not in the sense that gerrit is pushing out requests to build to the buildbot master
[15:14:18] <Simon Wilkinson> That should be buildbot restricted to 0 to +1
[15:14:36] <jaltman> thank you
[15:14:44] <Simon Wilkinson> Yeah - just doing it based on objects appearing in the git repo isn't very interesting.
[15:29:22] --- deason has left
[15:56:35] <Simon Wilkinson> Hmmm. Maybe that didn't work ...
[15:56:43] <shadow@gmail.com/owlECA78C6F> snerk
[15:58:35] <Simon Wilkinson> I've tried a different way of doing it ...
[16:12:31] --- jaltman has left: Disconnected
[16:15:06] --- mattjsm has become available
[16:24:02] --- mattjsm has left
[16:30:48] --- Simon Wilkinson has left
[17:32:13] --- deason has become available
[18:01:18] --- jaltman has become available
[19:56:53] --- meffie has left
[20:54:24] --- kaduk@mit.edu/barnowl has left
[20:55:11] --- kaduk@mit.edu/barnowl has become available
[21:00:19] --- rra has left: Disconnected
[21:46:05] --- deason has left
[22:17:04] --- reuteras has become available
[23:03:43] --- Simon Wilkinson has become available
[23:22:51] --- rod has become available