- Wednesday, October 15, 2014

release-team@conference.openafs.org

Wednesday, October 15, 2014< ^ >

Room Configuration

Room Occupants

GMT+0
[04:35:29] Jeffrey Altman leaves the room
[04:48:58] Jeffrey Altman joins the room
[07:51:50] shadow@gmail.com/barnowlE5B64A04 joins the room
[08:07:52] shadow@gmail.com/barnowlE5B64A04 leaves the room
[08:08:03] shadow@gmail.com/barnowlE5B64A04 joins the room
[08:53:57] shadow@gmail.com/barnowlE5B64A04 joins the room
[08:55:27] shadow@gmail.com/barnowlE5B64A04 leaves the room
[12:10:53] wiesand joins the room
[12:22:51] <wiesand> test
[12:31:20] meffie joins the room
[13:54:36] kaduk joins the room
[14:00:58] <wiesand> Hello
[14:01:11] <meffie> good afternoon.
[14:02:05] <kaduk> Hello
[14:03:16] <wiesand> Ok, let's start with the 1.6.10 mess...
[14:03:49] deason joins the room
[14:03:59] <shadow@gmail.com/barnowlE5B64A04> hi
[14:04:09] <wiesand> New content in RT #131780, since Sunday
[14:04:13] <shadow@gmail.com/barnowlE5B64A04> tag was pushed, so i hope you have no more changes
[14:04:21] <wiesand> Hi. Thanks for the tag.
[14:04:33] <wiesand> Well, Jeffrey was requesting changes...
[14:04:42] <kaduk> I filed a ticket to get it into freebsd.
Debian should be coming "soon".
[14:05:59] <kaduk> On the RT #131780 front, to make things more exciting, we got a handful of reports to our helpdesk that could be that issue, but on the 1.6.9 from the Ubuntu ppa.
[14:06:15] <kaduk> The reports are a bit scarce on details, unfortunately.
[14:06:22] <wiesand> Oh no.
[14:07:01] <wiesand> Jeffrey thinks that 11358 should have been reverted...
[14:07:21] <wiesand> Let me cite: "The other option is to tag 1.6.10 as-is but not announce it and
immediately release 1.6.11 with 11358 reverted, the 3.17 kernel patches,
and the aklog patch.   This route might be better as there are already
many people that have been using binary packages that Ben Kaduk and SNA
have been distributing to end users."
[14:07:32] <deason> 11358 fixes a real problem
[14:08:06] <kaduk> I don't remember seeing that text; maybe I missed an email?
[14:08:09] <wiesand> As I said last week, my assumption is that we have a choice between two real problems.
[14:08:14] <deason> we could revert just all of the d_revalidate changes recently, but I don't think reverting just that one is an option
[14:08:44] <wiesand> Ben: this was in another context. Nothing secret, just different ;-)
[14:08:50] <kaduk> Okay :)
[14:09:56] <wiesand> Right now, we have a tagged 1.6.10 which is already making it into the wild, with 11358 fixing a real problem but bringing back the getcwd() problem, which is real too.
[14:10:51] <wiesand> And I don't see a solution which would fix both anytime soon.
[14:11:22] <shadow@gmail.com/barnowlE5B64A04> fs choosewhichproblem.
[14:11:25] <deason> we could have one soon, if my guess as what's going on is correct
[14:11:26] <wiesand> To make things perfect, Jeffrey isn't here today..
[14:11:47] <deason> if we had to choose, I thought we'd go back to what the behavior was 'before'
[14:11:57] <deason> since that's what people should be more used to
[14:12:12] <wiesand> That ship has sailed a few minutes ago, as far as 1.6.10 is concerned.
[14:12:16] <kaduk> There were no cross-cell mounts involved in the one helpdesk case that had data.
[14:13:40] <deason> fakestat, fakestat-all, none?
[14:13:51] <kaduk> Just fakestat, I think.
[14:14:13] <wiesand> I propose we proceed with 1.6.10 as planned, but warn loudly about the possible regression in the announcement, on the web page etc.
[14:14:57] <wiesand> And then work on 1.6.11.
[14:15:09] <kaduk> That seems the most reasonable thing, given the tag exists
[14:15:57] <shadow@gmail.com/barnowlE5B64A04> i agree
[14:16:33] <shadow@gmail.com/barnowlE5B64A04> and honestly, pushing back 1.6.10 further seems silly, so the other
other option imo would have been revert and leave the problem as it
was.
[14:16:38] <wiesand> Ok. Jeffrey can object later ;-)
[14:16:54] <wiesand> I agree.
[14:17:43] <wiesand> Fine, on to 1.6.11... Andrew, that "real solution" would be?
[14:18:32] <wiesand> Which is what we formerly called 1.6.10.1.
[14:19:01] <wiesand> Since Jeffrey says the issue is too big for ...1
[14:19:30] <kaduk> It only affects linux, though.  And possibly only certain kernels.
[14:20:07] Marc Dionne joins the room
[14:20:52] <wiesand> True, so it would meet the "platform specific client fixes only" criteria for those point releases. But he's the gatekeeper ;-) And I don't care that much...
[14:21:11] <kaduk> Yeah, I don't care that much, either ;)
[14:21:43] <wiesand> On the other hand, if we just put in the Linux 3.17 changes and revert 10358, and release that as 1.6.11, sites can choose their poison.
[14:21:59] <deason> well, I don't know what it would look like yet; trying to get the situation to happen :) but if we can, a fix should not be far off
[14:22:05] <wiesand> If you want getcwd() troubles, use 1.6.10.
[14:22:20] <wiesand> If you want the problem 10358 fixes, use 1.6.11.
[14:22:39] <wiesand> I'm only kind of joking here...
[14:23:04] <Marc Dionne> hi guys, sorry a bit late to the party
[14:23:21] <kaduk> We have too much evidence that people will just blindly take the highest number for me to be very enthusiastic about that plan.
[14:23:57] <deason> well, they would at first, but if someone says "this problem keeps happening what do I do" we at least have an answer
[14:24:00] <wiesand> At least that wouldn't have a regression w.r.t. 1.6.9.
[14:24:06] <deason> for everyone except people that are concerned about both problems :)
[14:24:41] <wiesand> Well, another answer is "use this patch, or ask your support company to do it".
[14:26:09] <wiesand> Or a 1.6.11 with "configure --with-getcwd-issue" ?
[14:26:17] <wiesand> [this *was* a joke]
[14:26:21] <deason> er, I was being a little too careful above, I think; the 'real fix' I was thinking of involves resolving a mountpoint if returned from the troublesome afs_lookup
[14:26:32] <deason> which I mentioned somehwere in the ticket
[14:27:48] <deason> it's just a little annoying to try to reproduce, since we try to avoid doing a lot of these checks, and avoid not-resolving a mtpt when it's "free"
[14:28:19] <Marc Dionne> it's fairly easy to reprooduce getting a ENOENT, but not sure it's the same scenario that other people are seeing
[14:29:14] <Marc Dionne> just sitting in a directory (as cwd) and have that directory replaced by a new version remotely for instance
[14:29:37] <deason> you mean getting an ENOENT from getpwd?
[14:29:41] <Marc Dionne> yes
[14:29:52] <deason> well yeah, that would do it; the thing that's more of an issue is getting one when nothing changes
[14:31:44] <wiesand> % mkdir /tmp/a; cd /tmp/a; rmdir /tmp/a; /bin/pwd
/bin/pwd: couldn't find directory entry in `..' with matching i-node
[14:32:35] <wiesand> So AFS actually should do what Marc describes...
[14:34:26] <deason> marc may be describing 'mv'ing the dir out of the way and putting a new dir in its place; iirc other fs's do not enoent, but we do; but that's been true forever and I don't think is the issue here
[14:36:19] <Marc Dionne> that's pretty much the scenario, and there's no obvious good solution for that one
[14:37:09] <Marc Dionne> one thing i wonder about 10358 is if it ends up doing a d_drop for cases where it didn't previously (before all these related changes)
[14:40:46] <wiesand> Any more thoughts on 10358/getcwd?
[14:41:17] <wiesand> I guess this won't be sorted out here and today...
[14:41:26] <kaduk> It sounds like we're only speculating, until we can reproduce the failure case.
[14:43:04] <wiesand> Right :-(
[14:43:38] <wiesand> So let's go on  to other 1.6.11 fodder - any news on 11455 and 11492?
[14:44:56] <Marc Dionne> btw there will be also be a few 3.18 bits coming
[14:45:07] <wiesand> Great :-/
[14:45:27] <wiesand> 3.18 ETA is in ~6 weeks?
[14:45:54] <Marc Dionne> at least, still in merge window, and indications were that the window may be a bit longer than normal
[14:46:23] <wiesand> Ok.
[14:46:31] <Marc Dionne> so probably more like 8 weeks+
[14:46:47] <wiesand> Thanks for the first really good news today.
[14:46:57] <wiesand> What can be done to help 11455 and 11492 along?
[14:47:22] <wiesand> Daria, what would need to happen before you hit the submit button?
[14:49:26] <wiesand> And then Jeffrey says that 1.6.11 should ship with 11538 applied.
[14:49:54] <shadow@gmail.com/barnowlE5B64A04> uh. on 11455 and 11492? hang on
[14:50:23] <wiesand> NB Ben, that (11538) was the "different context".
[14:50:44] <shadow@gmail.com/barnowlE5B64A04> (the thing that would need to happen is for gerrit to properly handle
my openid, but apparently i need to kick my brwoser)
[14:51:58] <shadow@gmail.com/barnowlE5B64A04> 11537 probably merits some reviews, since it fixes an actual observed
problem
[14:52:20] <kaduk> Yeah, that's on my todo list.
[14:52:34] <kaduk> Yesterday had too much stuff drop on me for much of anything to get done.
[14:53:35] <wiesand> That's actually 11537, not 11538 + typo?
[14:54:25] <shadow@gmail.com/barnowlE5B64A04> it is
[14:54:57] <shadow@gmail.com/barnowlE5B64A04> like. a bulkstat of somewhere where exactly one vnode (which happened
to be the first in the list) got EACCES taited the cache manager
[14:55:51] <wiesand> Forgive my stupidity, but what are the consequences?
[14:56:31] <Marc Dionne> you are denied access to a file that you should have access to
[14:56:58] <wiesand> Thanks :)
[14:57:17] <wiesand> Is it a recent regression?
[14:57:17] <shadow@gmail.com/barnowlE5B64A04> to possibly many files you should have access to. the cache manager
enforces EACCES on a file you can actually read if you just flush the
bogus stat cache entrie(s)
[14:57:31] <shadow@gmail.com/barnowlE5B64A04> it's a problem that has been one for a while
[14:57:51] <wiesand> Risk?
[14:59:13] <shadow@gmail.com/barnowlE5B64A04> well, technically we should analyze the result of each fid, not just
one, to decide if any need to be retried, and retry one at a time. but
this makes that no worse than it already was
[14:59:55] <shadow@gmail.com/barnowlE5B64A04> it decreases the number of edge cases that can populate an end client
cache with bogus state
[15:00:14] <kaduk> "The risk is only that the implementation is buggy."
[15:00:48] <wiesand> Ah, just that ;-)
[15:01:22] <wiesand> I'm just asking because I think we should try really hard to avoid opening another can of worms in 1.6.11.
[15:03:54] <wiesand> Any other candidates for 1.6.11?
[15:04:42] <meffie> i think have a minor patch on master that i missed for 1.6.10. a logging improvement
[15:05:21] <meffie> gerrit is ...
[15:05:36] <kaduk> Do I need to kick it?
[15:05:58] <meffie> 10849 and 10850
[15:06:11] <meffie> (no, just kick me)
[15:08:39] <wiesand> I'm inclined to defer anything not really urgent to 1.6.12, but let's see.
[15:09:04] <meffie> that is fine.
[15:09:32] <wiesand> Ok. On to "1.8 branch" ?
[15:09:42] <wiesand> Which we can probably skip immediately?
[15:09:44] <kaduk> Daria just merged several more things :)
[15:10:21] <meffie> i build master yesterday, and now i have a libroken.so.2 :)
[15:11:06] <kaduk> Do we want to go pthreaded for 11472?  I haven't gotten to look at what exactly is affected.
[15:13:24] <wiesand> Progress :)
[15:13:59] <wiesand> Regarding 1.8, I haven't seen any reaction saying "it has to wait because..."
[15:14:23] <wiesand> 1_8_x branch, of course.
[15:14:42] <kaduk> I want to ask Andrew if he has more to say about the "cleanup our cleaning" thing with generated .pod files.
[15:15:22] <deason> nah, it was just a suggestion
[15:15:44] <deason> I'd rather not have another list to keep track of, but it's not important
[15:15:57] <kaduk> Okay.
[15:16:06] <kaduk> What's there works for now, and we can revisit later if we feel like it.
[15:16:10] <deason> I could look at pthreaded libuafs if you want, after I hit a dead end with getcwd
[15:16:30] <kaduk> (That's 11532, if Daria is still trigger-happy ;) )
[15:17:48] <kaduk> It "should" only be a handful of objects that are affected, the question is just which ones...some time with objdump on an existing libuafs.a might even be enough.
[15:21:11] <wiesand> I'm sorry, but I'm running out of time.
[15:21:24] <wiesand> Anything else we should discuss now?
[15:21:37] <kaduk> So be it.
[15:22:59] <wiesand> I'll check in again later though. Feel free to continue without me, but I have to run.
[15:23:29] <wiesand> Thanks a lot everyone!
[15:23:44] wiesand leaves the room
[15:24:05] <kaduk> So, do we want to fix "clean up the cleaning" and then branch?
[15:25:40] <kaduk> Hmm, nm thinks that my libuafs.a already has references to some pthread symbols.
[15:27:21] <kaduk> And there don't seem to be any references to LWP symbols.
[15:28:47] <kaduk> Ah, Jeff wants 11286 et seq to be addressed before branching.
[15:30:08] <deason> yes, the ukernel primitives should already be pthreads; the primitives it uses are implemented either via pthreads or 'netscape' threads
[15:30:31] <deason> (another cleanup could be to get rid of the netscape symbols since we threw that out, and get rid of the ukernel abstractions)
[15:30:33] <kaduk> Okay, so that should be a pretty trivial switch, then.
I can take a look.
[15:47:27] <kaduk> I should probably move libuafs talk to the other room ... but deason isn't there.
[16:02:49] meffie leaves the room
[16:12:48] Marc Dionne leaves the room
[22:00:00] kaduk leaves the room
[22:39:06] deason leaves the room