- Wednesday, December 3, 2014

release-team@conference.openafs.org

Wednesday, December 3, 2014< ^ >

Room Configuration

Room Occupants

GMT+0
[10:00:02] Jeffrey Altman leaves the room
[10:01:00] Jeffrey Altman joins the room
[14:08:38] meffie joins the room
[14:18:35] kaduk joins the room
[14:19:58] wiesand joins the room
[15:00:06] <kaduk> $timeofday
[15:01:07] <wiesand> Good evening.
[15:01:25] <wiesand> (it’s almost dark here)
[15:02:02] <wiesand> We have quite an agenda today, thanks to your diligence...
[15:02:36] <kaduk> I will be pleasantly surprised if we can make it through it all...
[15:02:48] <kaduk> But we should do 1.6 first
[15:02:50] <wiesand> So let’s start. The first part is probably quick: Any Linux News?
[15:03:09] Marc Dionne joins the room
[15:03:22] <wiesand> Any 1.6.11pre1 testing results (besides those in RT #131967)?
[15:03:29] <wiesand> Hi Marc :)
[15:03:40] <Marc Dionne> Hi, good timing... :)
[15:03:42] <kaduk> Ubuntu took a 3.18 kernel, but rolled their own patches to support it instead of finding ours
[15:04:21] <wiesand> Yes we’re late. My bad.
[15:04:21] <Marc Dionne> ran some tests with 3.18-rc7+ with 1.6 head, looks fine
[15:04:50] <wiesand> Marc: Thanks. Chances are we’ll get away with it for 3.18 then.
[15:05:03] <Marc Dionne> very likely
[15:05:19] <wiesand> What about #131967? Must it be adressed in 1.6.11?
[15:06:06] <wiesand> The last entry suggests 11616 is at least not a complete solution?
[15:06:42] <kaduk> There may or may not be other bugs exposed by scripts changing from -fakestat-all to just -fakestat
[15:07:48] <wiesand> Right. Or bugs only exposed after applying 11616, because the bug that fixes would strike first?
[15:08:16] <kaduk> potentially.  I was rather distracted when I looked at the relevant gerrit changes.
[15:09:08] <wiesand> NB could gerrit 435 on your list be related to getcwd() issues?
[15:09:45] <kaduk> It seems plausible at first glance
[15:10:24] <wiesand> Added Anders to the list of reviewers
[15:11:46] <wiesand> That RT entry will need some more time. So let’s move on.
[15:12:23] <kaduk> > more time
As in, we don't expect to fix it for 1.6.11, or don't know yet, or ...?
[15:13:11] deason joins the room
[15:13:21] <wiesand> I think we don’t have enough info to take a decision today.
[15:14:55] <kaduk> So we just want more review on 11615 and 11326?
[15:15:10] <wiesand> And 11616?
[15:15:17] <kaduk> Er, right.
[15:15:47] <kaduk> Okay, I'll let you move on now ;)
[15:16:31] <wiesand> I was about to type: And we should probably cut right to “your” part of the agenda.
[15:16:48] <wiesand> Unless anyone has a strong desire to discuss 1.6.12 now.
[15:17:02] <deason> 11616 doesn't really solve the mit scripts issue; I think it's unrelated and should put a note saying that in the gerrit...
[15:17:43] <wiesand> If so, you should also remove the “FIXES”
[15:18:57] <wiesand> On to Ben’s list?
[15:18:59] <kaduk> Lacking any other guidance, I was mostly planning to just go down the list in order until we got too tired to go on.
[15:19:40] <kaduk> So, the first one is this chain starting at 10773, dealing with rx idle dead and busy handling
[15:20:24] <wiesand> Does it finish off “client side idledead processing”?
[15:22:00] <kaduk> I don't think it completely makes the client stop doing idle dead, but Andrew is right here
[15:23:13] <kaduk> My main point here is just that most of the reviews on that chain have been generally positive, and I only see a couple of -1s about minor issues
[15:23:55] <kaduk> So, we can go forward and fix those minor things and try to merge it, but if there are general issues with it, we should find out now.
[15:24:35] <kaduk> Andrew, do you want to say anything here?
[15:24:35] <wiesand> If there are any, I’ll find them once there’s a prerelease to test :-(
[15:24:53] <deason> the parts about 'busy' iirc could be broken out (there were separate issues fixed there, I think?); they just both touch a lot of the same parts of code
[15:25:17] <deason> no issues I am aware of, but nobody runs that code
[15:25:31] <kaduk> Do you think we should skip it for 1.8?
[15:26:25] <wiesand> 10789 is not part of that chain, right?
[15:27:19] <kaduk> correct
[15:28:31] <deason> I'd like it in as soon as possible, since it still causes visible problems
[15:29:22] <wiesand> I’d be surprised if it didn’t... still longing for the pre-1.4.8 state of affairs.
[15:29:31] <kaduk> Okay, I guess we should try to move forward with it, but not consider it a blocker
[15:29:54] <wiesand> Sounds reasonable.
[15:30:20] <kaduk> Next on my list is the volume header times.
I haven't gotten to digest jhutz's analysis yet, so maybe we should skip that, unless someone else wants to say something?
[15:31:13] <kaduk> 11604 is vldb mh entries vs. vos changeaddr, which has inspired a little controversy
[15:32:04] <kaduk> I think it's clear that we should not leave a codepath in which is an easy way to corrupt the vldb
[15:32:13] <kaduk> But do we deny the operation entirely, or fix it?
[15:32:16] <meffie> well, my first idea was to just return an error.
[15:32:53] <meffie> but the patch i submitted just changed the address.
[15:33:45] <deason> I don't agree with removing the operation, since it makes it impossible to modify such addresses
[15:33:47] <meffie> i can revisit my original idea and just return an error when the address is in an mh entry, and we should make vos changeaddr deprecated.
[15:33:57] <deason> it could be simply renamed/moved, though, to avoid people from running it
[15:34:07] <meffie> you can use vos setaddrs to set the addresses.
[15:34:23] <meffie> for a given uuid
[15:34:35] <kaduk> vos eatmydata?
[15:34:59] <meffie> well, the data is there, just the references are broken :)
[15:35:15] <wiesand> The 1.4 manpage seems to deprecate anything but -remove. I never used it for anything else. With my site admin hat on: axe it in 1.8.
[15:36:42] <meffie> the problem is people still run changeaddrs from time to time (new people) and this is harmful.
[15:37:04] <deason> I'm a little confused as to what that change does; it just updates the mh entry to point to the specified ip? it doesn't remove the mh entry?
[15:38:14] <meffie> currently there is a bug in changaddr. the remove just happens to work, becuase we break out of a loop early. when done without the -remove, we orphan the mh entry.
[15:38:59] <deason> yes, but to 'fix' it should still cause a non-mh entry to exist, like I believe it does now; the command is for handling non-mh entries
[15:39:04] <meffie> the patch in gerrit fixes that bug, and then if we do find an address in an mh entry, is it changed.
[15:39:43] <deason> if we wanted to avoid people running it for mh entries, I don't see why we wouldn't just modify 'vos changeaddr' to refuse to run if it detects an mh entry
[15:39:44] <meffie> optionally, if -oldaddr is in an mh entry, we could return an error.
[15:39:58] <deason> (and give a '-force' flag or something)
[15:40:07] <meffie> yes, that was my original idea, and jeff's point
[15:40:39] <meffie> in any cause, there is a bug is vlprocs.c
[15:40:48] <deason> jeff (a) said the vlserver should error our, which I completely disagree with
[15:41:03] <deason> if 'vos' errors out, that's okay and there should be a way to override it
[15:41:08] <deason> er, "error out"
[15:41:31] <meffie> still there is a bug is vlprocs.c that needs to be fixed, and we need to handle it.
[15:42:00] <meffie> either we handle it, or return an error in the vlserver.
[15:42:41] <deason> yes, and "handling it" means creating a non-mh entry for that ip, as we always have
[15:43:04] <deason> but just avoiding orphaning the other entries
[15:43:07] <meffie> oh, dear. and then, we need to remove the old mh entry.
[15:43:44] <meffie> i do not think that is the right thing to do.
[15:44:24] <meffie> why should we do that? just to make it bug compatible?
[15:45:41] <deason> right now it's the only way to make such entries that I am aware of
[15:46:20] <meffie> it is. but i think it was just a mistake :) but ok, such a change is the smallest think we could do.
[15:46:25] <kaduk> This discussion is making me lean towards Jeff A's position, of just having the vlserver deny such requests...
[15:46:26] <deason> I think it would be simpler to just error out in vos and not care about the bug
[15:46:46] <kaduk> But people will still use old clients; we can't not fix the vlserver
[15:47:08] <meffie> i'll do 2 patches, one for vos, another for the vlserver
[15:47:41] <kaduk> What will the vlserver one do?
[15:47:51] <meffie> i'm not sure it we can reliably know from vos if it is a mh entry tho.
[15:48:19] <kaduk> Just do what Stephan says, and remove the changeaddrs subcommand entirely :)
[15:48:20] <deason> you get a 255.0.0.x address for it
[15:48:22] <meffie> i'll fix ChangeAddr() to not orphan mh entries
[15:48:58] <kaduk> I think that's okay by me.
Jeffrey, do you want to say anything more?
[15:49:23] <deason> or just make 'vos changeaddr' not run at all unless given a -force option or something
[15:49:49] <deason> that would probably handle over 99% of the calls to it
[15:50:22] <meffie> ok, and later we can have an vos remaddrs, and just say vos changaddr, just dont use it?
[15:50:38] <deason> or like I said above, just rename the command
[15:50:56] <kaduk> > vos changeaddr, just don't use it
[15:51:00] <kaduk> right
[15:51:11] <meffie> just dont.
[15:51:12] <deason> "this command exists for legacy systems" or "only use this if you know what you're doing" etc
[15:51:35] <meffie> ok, thanks for the discussion.
[15:51:38] <kaduk> Next up is 2591, byte-range locking lock order.
Apparently Simon has a super-awesome idea, but it sounds like nobody has actually implemented it.
[15:54:02] <deason> jeff h mentions that simon's description can be done separately from that change
[15:54:05] <kaduk> I thought it was a well-established rule when dealing with multiple locks that any given pair of locks has a strict hierarchy, wherein A is locked first and unlocked last
[15:54:23] <deason> that is, it's a perf improvement to do the checks like that, but the change in question is fixing correctness for the locking behavior
[15:55:27] <deason> for "normal" locks the unlocking order usually doesn't matter if there's nothing blocking the unlocks; you'll unlock both anyway, and if someone locks both, they'll wait for the unlock on both
[15:55:33] <kaduk> I haven't actually reviewed Hans-Werner's change, but it sounds like what Andrew is saying -- that this is a strict lock order correctness fix.
[15:56:29] <deason> here, the two locks are not equal (one it local, one remote), and we can't block on locking one once we've gotten the other
[15:57:51] <deason> er, maybe that last part is not right; I'm trying to understand what hans is saying in that comment :)
[15:58:35] <kaduk> Okay, so we should spend more time reviewing this not-during-the-meeting.
But have we convinced Jeffrey that there may be separable issues here?
[15:59:17] <deason> oh wait, that's weird, it's because the flock() handling does a non-blocking lock for the local one but the fcntl locking appears to not do that
[16:00:49] <kaduk> So, I'm inclined to move on to 9588, adjusting the RPC-L for some ptint.xg types to add a length restriction.
[16:02:17] <kaduk> Does anyone object to putting in a large limit, like 50k or 500k?
[16:02:30] <deason> no, I don't care much about what the value is
[16:04:31] <kaduk> I was hoping to hear from Jeffrey Altman
[16:05:09] <kaduk> Next up, hash table sizes: 9919, 10801, 10802.
[16:05:52] <kaduk> Looks like 9919 and 10801 are ~duplicates
[16:07:39] <kaduk> I assume there are no objections to the abstract idea of increasing the hash table sizes for vnodes, vcaches, and dcaches, but we will need to tweak the actual sizing algorithms and potentially the hash function used.
[16:08:11] <deason> I don't mean to object strongly to just having a hard-coded size like that; it's better than what we have now
[16:08:35] <deason> my only concern is any other bugs caused by changing the size, but if they've been getting exercised then it's probably fine
[16:08:42] <kaduk> Okay.
[16:08:59] <kaduk> "Any volunteers to convert things to using the jenkins hash?"
[16:10:30] <kaduk> Next topic is "using encryption more places".  11349 is submitted, for encrypting volume releases, but we may also want to encrypt server-to-dbserver traffic, etc..
[16:10:53] <deason> I don't see why changing the hashing algorithm is necessary; the benefit for just increasing the constant is pretty obvious and the effort is near zero
[16:11:11] <kaduk> There would be performance concerns, but do we want to override those with claims of doing the responsible thing?
[16:12:05] <deason> I think you'd need a command-line arg, like chas says
[16:12:43] <deason> since it seems quite likely that there will be situations where crypt will be too slow; without an option, what can a site do?
[16:13:15] <kaduk> "patch the code" ;)
[16:13:36] <kaduk> But yes, there would need to be an option.
Are we willing to set the default to 'encrypt'?
[16:14:12] <wiesand> My fileservers tend to have plenty of idle cycles available, no matter what’s going on...
[16:14:33] <deason> default to encrypt is fine with me
[16:14:53] <kaduk> Yay
[16:15:10] <kaduk> That was the last 'HIGH' priority item.
[16:15:36] <kaduk> 11516, "vos: preserve cloneId and backupId when restoring"
[16:15:45] <kaduk> Is this actually a no-brainer, like it looks at first glance?
[16:16:47] <meffie> it is in use at one site.
[16:17:35] <kaduk> Such rousing approval...
[16:17:53] <kaduk> 435, "clear stat flag on renamed directories"
[16:17:54] <meffie> heh, it fixes an annonance
[16:18:12] <kaduk> This has been sitting there for a really long time.  Anyone know why?
[16:20:41] <deason> I'm not really clear on why it's necessary/helpful to do that
[16:21:01] <deason> we shouldn't need to clear CStatd for a local op, usually, unless localhero says to
[16:21:27] <kaduk> localhero?
[16:22:40] <Jeffrey Altman> sorry folks.  I have a really sick dog and was up most of the night.   hence the reason I replied to jhutz's e-mail hours ago.    just waking up but need to check on the pooch
[16:22:58] <deason> when we do a rename operation, we get status information back for the src and dest directories
[16:23:03] <deason> which includes the dv and other metadata
[16:23:25] <kaduk> Ouch, sorry to hear about the dog.
[16:23:26] <deason> if that info is consistent with what we have in the cache, we can modify our local cache and say everything is still up-to-date
[16:23:34] <kaduk> I did wonder a little bit about the timing of that reply...
[16:23:49] <deason> making that decision is what afs_LocalHero does (later on in the rename function)
[16:24:07] <kaduk> Oops, I forgot -i to git grep
[16:24:40] <deason> so it's not clear what the issue is or whatnot, what it's solving; and of course it was 4 years ago :)
[16:25:00] <kaduk> Right.
[16:25:22] <kaduk> Maybe I will ask Anders to run one scripts server with it just for kicks, but otherwise it sounds like we should continue ignoring it.
[16:25:53] <kaduk> 7286, "libafs: trigger volume lookup on no conn or no server"
[16:27:33] <kaduk> It's not clear to me how this relates to RT 130714; Andrew, do you remember
[16:28:56] <kaduk> Like, it seems like this change is really about a different issue
[16:32:40] <kaduk> I ... guess we should move on, if there's nothing to say.
[16:32:51] <deason> I don't think I was saying it fixed that situation, or anything, just that the code path is relevant in that situation
[16:33:04] <kaduk> okay
[16:33:06] <deason> that is, when we don't have any addrs to contact the serveron, or something like that
[16:33:12] <meffie> is this the thing to deal with stale volume info when servers are moved / shutdown?
[16:33:15] <kaduk> Sure.
[16:33:20] <deason> but from what I said in the ticket, it seems like that problem went away and that change shouldn't be necessary
[16:33:27] <deason> so I don't think there's an issue here anymore
[16:33:44] <kaduk> Yeah, there was an explicit check for "the address is empty" added
[16:34:06] <kaduk> but maybe 7286 would help with the case meffie mentions, stale volume info
[16:35:19] <deason> the case mike mentioned is the scenario where any of this matters (7286, 7287, rt 130714)
[16:35:50] <kaduk> Do you think 7286 should be abandoned, then?
[16:35:51] <deason> in 7286, I mention we re-lookup things anyway, it looks like, I think?
[16:36:36] <deason> that's my impression, but I don't know if I'm rembering it correctly; I'm not aware of issues in this area ever coming up since then
[16:37:00] <deason> maybe daria has another opinion, but daria is the one that would need to abandon it anyway
[16:37:04] <kaduk> I think I had trouble figuring out all the comments in the tickets
[16:37:26] <kaduk> Well, Daria knows where to find us if there is more to say.
[16:37:39] <kaduk> Next up: 6895, "rx: race can lead to sending RX_PACKET_TYPE_BUSY"
[16:37:59] <kaduk> Apparently, Simon had a better approach.  Do we know if that approach ever got written down or implemented or anything?
[16:41:16] <kaduk> > Do we know
Sounds like "no"...maybe Jeffrey will mention something later when things are settled down for him.
[16:41:51] <kaduk> 3271, "DAFS: don't reference Volume* after its freed"
Somehow I thought this might be overtaken by events, but I'm not sure why
[16:44:04] <deason> the other gerrit mentioned makes that one unnecessary, I think
[16:44:17] <deason> 3272 makes that function not free the relevant structure
[16:44:38] <kaduk> Okay.
Daria, please abandon 3271 if you agree.
[16:44:52] <deason> this area of code has I think changed a bit since then anyway; I think it could be abandoned and if it needs to be there, something else could be created
[16:45:00] <kaduk> Sure
[16:45:10] <kaduk> Okay, 9123.  Did jhutz say something about this one in his reply?
[16:45:14] <kaduk> I'll go check...
[16:46:03] <kaduk> He did.  "vaguely remember having some concern"
[16:46:55] <kaduk> I'm not sure that that leaves us with much to discuss right now, so let's skip it for now.
[16:47:48] <kaduk> 8204, "DAFS: Free header on partially-attached vol salv"
[16:48:59] <kaduk> Andrew, how confident are you in the analysis in the commit message?
[16:51:34] <wiesand> “this is not urgent for 1.6; worst case should be a small memory leak, which is fixed upon successful salvage”
[16:52:32] <kaduk> I mean, it's not urgent, sure, but if we have code that we think is right, now is the time to ship it.
[16:52:57] <deason> yes, it's certainly not urgent... I believe it's still correct and relevant but I'm going to need a block of time to re-verify that
[16:53:20] <kaduk> Do you want us to defer it, then?
[16:54:30] <deason> do you mean punt to after 1.8? it s hould be included if possible, but as mentioned, it's not terribly important
[16:54:59] <kaduk> Okay, we'll pull it in if it works out, but not consider it a blocker.
[16:55:34] <kaduk> 10713, " Linux: avoid export_op_default if not exported"
[16:56:13] <kaduk> It has a +1 from Marc.
Maybe we should just go for it?
[16:57:50] <deason> it looks like this affects linux support for exactly 1 linux version, if I'm reading that right
[16:59:00] <deason> I do not like that it requires success for "new" behavior that it is testing for... but at first glance that seems necessary
[16:59:25] <kaduk> > exactly 1 linux version
That seems plausible, and would explain why there haven't been lots of people yelling and screaming
[17:00:09] <deason> I'm okay with it besides the one thing I noted in gerrit, I think
[17:00:19] <kaduk> Okay, thanks.
[17:00:28] <deason> but I'm not really familiar with the fhs, or any ramifications of constructing our own manually etc
[17:00:34] <kaduk> Next up, 11600, log rotation.
[17:01:31] <meffie> or, dont eat my logs.
[17:01:44] <kaduk> I think I'm okay with the scheme Mike presents in 11600, to only roll a log on server startup if the log is large enough; I think this is similar to many syslog schemes.
[17:02:10] <kaduk> We routinely have issues where someone has restarted things a few times since seeing a problem, or during debugging, and the precious log output is gone.
[17:02:32] <kaduk> I think there was also another change about rolling the bosserver log, which doesn't use this code.
[17:02:54] <kaduk> Right, 3347
[17:03:15] <kaduk> 3347 is not okay as-is, but as jhutz says, "now is the time to do it"
[17:03:26] <kaduk> Mike, do you think you'll have time to refresh 3347?
[17:04:35] <meffie> i can make some time for it
[17:04:48] <kaduk> Thanks.
[17:05:05] <meffie> since, i need the logs :)
[17:05:34] <kaduk> I'll mention 10338, "update 'now' when raising events", but I think the ball is in my court.
I'm happy to take suggestions/advice, but won't wait for any
[17:06:06] <kaduk> 11331, "Make all VLDB interactions use VLF/VLSF names"
[17:07:04] <kaduk> nwf put together this cleanup, and pushed an update earlier today.
It will probably be fine, I just wanted to point it out in case anyone had objections.
[17:07:31] <kaduk> 109778, 10979, ..., deal with README and similar files which may or may not be at the root of the repo.
[17:07:45] <kaduk> I'd really like to have Jeffrey here for this discussion, so probably we should skip this one too.
[17:08:20] <kaduk> That's the end of the 'MEDIUM' list, and we're already two hours in.
[17:08:34] <wiesand> [waves white flag]
[17:08:42] <kaduk> Probably, that means we should call it quits for today, and see if we can get more stuff done in email and gerrit.
[17:08:51] <kaduk> Thanks for sticking with us this long :)
[17:09:15] <wiesand> Thank you for all the work you’re doing to give us 1.8!
[17:09:28] <kaduk> I will try to do minutes.
[17:09:29] <meffie> nice work ben.
[17:09:51] <wiesand> I have to run. Thanks a lot everyone!
[17:10:04] wiesand leaves the room
[17:15:17] kaduk leaves the room
[17:19:06] meffie leaves the room
[17:19:38] Marc Dionne leaves the room
[17:23:24] deason leaves the room
[17:46:51] kaduk joins the room
[22:20:19] <Jeffrey Altman> sorry about this morning.  a sick dog is like a sick child except you are sure that s/he isn't faking to get out of school
[22:27:40] <kaduk> Yup.
Life happens; we can deal.
[22:39:47] <kaduk> Looking at 11313, I thought that even "196.168.1.1/16" still meant the whole /16 -- does the slash only apply to zero bits?
[22:46:59] <Jeffrey Altman> I don't understand CIDR well enough.
[22:47:15] <kaduk> I just posted a comment; Chas has been pretty active and should notice.
[22:55:35] kaduk leaves the room