Home
release-team@conference.openafs.org
Friday, May 4, 2018< ^ >
Room Configuration
Room Occupants

GMT+0
[00:35:32] mvita joins the room
[03:11:45] mvita leaves the room
[11:51:51] Marcio Barbosa joins the room
[12:39:54] wiesand joins the room
[12:40:25] <wiesand> I may be a few minutes late today.
[12:40:30] wiesand leaves the room
[12:41:37] <kaduk@jabber.openafs.org/barnowl> Okay.
[12:55:45] meffie joins the room
[12:56:57] mvita joins the room
[12:57:50] <mvita> present
[13:00:30] <kaduk@jabber.openafs.org/barnowl> greetings
[13:00:44] <mvita> Hi!
[13:01:12] <kaduk@jabber.openafs.org/barnowl> If Stephan is not here, maybe we should not start with 1.6.
[13:01:34] <mvita> agreed
[13:01:48] <kaduk@jabber.openafs.org/barnowl> Any 1.8 success/failure reports worth mentioning?
[13:02:01] <kaduk@jabber.openafs.org/barnowl> (johnfg's network configuration does not count, I think)
[13:02:07] <meffie> good morning
[13:02:19] <kaduk@jabber.openafs.org/barnowl> Hi Mike!
[13:02:22] <meffie> yes, success reports
[13:02:27] <mvita> we are aware of a site that "accidentally" deployed 700 new clients
[13:02:36] <mvita> and so far, so good
[13:02:45] <kaduk@jabber.openafs.org/barnowl> heh :)  Ubuntu's fault, or otherwise?
[13:02:51] <mvita> you got it
[13:02:53] <meffie> i think was about 300 or so
[13:03:03] <meffie> but yes it was ubuntu
[13:03:29] <mvita> I remember 700 - but it was hundreds
[13:03:57] <meffie> yeah, so far no issues on them.
[13:04:02] <kaduk@jabber.openafs.org/barnowl> Cool!
[13:04:49] <mvita> The "problem" report on #openafs may be discounted, I think.
[13:05:11] <mvita> ;-)
[13:05:15] <kaduk@jabber.openafs.org/barnowl> right :)
[13:06:46] <mvita> Ben, have you heard any other reports?
[13:07:30] <kaduk@jabber.openafs.org/barnowl> Nothing particularly interesting.  A handful of folks at MIT running
on their personal machines, but we haven't taken it for the
public-facing deployments yet.
[13:08:02] <kaduk@jabber.openafs.org/barnowl> And we had someone run over the freebsd client panic that 13014 is
supposed to fix
[13:09:16] <kaduk@jabber.openafs.org/barnowl> But I don't remember anyone else saying anything, good or bad
[13:10:17] <kaduk@jabber.openafs.org/barnowl> I guess we should start talking about what should go into 1.8.1,
though.  (Obviously I want 13014!)
[13:10:35] wiesand joins the room
[13:10:49] <mvita> <looking at 13014>
[13:11:42] <wiesand> [breathless] Hello
[13:11:59] <kaduk@jabber.openafs.org/barnowl> Hi Stephan!
[13:12:17] <mvita> Welcome!
[13:13:03] <wiesand> Homing in on 23pre1…
[13:13:25] <wiesand> Ben, any chance you merge those two kmodtool fixes on 1.*?
[13:13:35] <wiesand> 1.8
[13:13:40] <kaduk@jabber.openafs.org/barnowl> Sure, let me pull those up...
[13:13:59] <wiesand> They were pulled up
[13:14:27] <wiesand> https://gerrit.openafs.org/#/q/status:open+project:openafs+branch:openafs-stable-1_8_x+topic:kmodtool-fixes
[13:14:46] <wiesand> Just click submit ;-)
[13:15:00] <wiesand> The higher numbered one must go first
[13:15:31] <kaduk@jabber.openafs.org/barnowl> Er, "pull those webpages up in my browser", I mean.
[13:15:42] <wiesand> Ah
[13:15:50] <mvita> :O
[13:18:22] <kaduk@jabber.openafs.org/barnowl> (merged)
[13:18:55] <wiesand> Thanks. Pullups of those will complete my list of changes for pre1, except ctx - Ihaven't gotten around to looking at it again.
[13:19:10] <wiesand> Then NEWS, et voila
[13:19:48] <wiesand> We should really get that pre1 out next week now.
[13:19:49] <kaduk@jabber.openafs.org/barnowl> > except ctx
er, which?
[13:20:51] <wiesand> hang on...
[13:20:53] <meffie> i think he meant ctf-tools
[13:20:57] <meffie> 12902
[13:21:33] <meffie> i think those could wait since we want to add cft-records to the servers processes as well
[13:21:45] <wiesand> er, yes that one
[13:21:53] <meffie> so, there will be more changes to the makefiles for that.
[13:22:28] <wiesand> I don't consider it urgent. It's just on my list.
[13:22:35] <meffie> ok, thanks
[13:23:05] <wiesand> There just were those clashes with your docs reshuffling on master, and I hadn't made up my mind how to resolve them.
[13:23:58] <wiesand> Did I miss anything urgent for pre1?
[13:24:20] <kaduk@jabber.openafs.org/barnowl> (I haven't been paying enough attention, sorry.)
[13:24:25] <meffie> not that i know of.
[13:25:32] <wiesand> Good. Let's plan pre1 for next Friday then. Ben, will you be available for the meeting?
[13:25:48] <kaduk@jabber.openafs.org/barnowl> I believe so.
[13:26:02] <wiesand> Great.
[13:27:29] <wiesand> My last 1.6 related item for today is then that I downloaded EL6.10 beta today and will check whether we need to do anything "soon".
[13:27:53] <mvita> Thank you.
[13:27:58] <kaduk@jabber.openafs.org/barnowl> Yes, thank you!
[13:28:57] <wiesand> On to 1.8/master?
[13:29:59] <wiesand> It looks like there is quite a bit of activity plowing through the ouststanding changes on master due to the 1.8 freeze. Which is great.
[13:30:18] <kaduk@jabber.openafs.org/barnowl> It is ... if I can keep up with it all :-/
[13:31:05] <meffie> we been trying to review review review
[13:31:18] <kaduk@jabber.openafs.org/barnowl> thanks :)
[13:31:23] <mvita> did we overlook anything you would like to see go forward?
[13:32:01] <mvita> "No gerrit left behind" and so on?
[13:33:04] <kaduk@jabber.openafs.org/barnowl> I'd have to check...
[13:33:45] <wiesand> I'm still curious about 12290
[13:33:47] <meffie> more rxgk reviews are on the stack.
[13:34:51] <kaduk@jabber.openafs.org/barnowl> I'm not sure who feels like they own 12290 at this point.
[13:35:29] <kaduk@jabber.openafs.org/barnowl> But maybe we should just abandon it, per Marcio's last comment.  Hard
to say.
[13:38:00] <wiesand> 13044 still makes me nervous
[13:38:44] <meffie> 12290 seem harmless and useful (for debugging)
[13:39:09] <kaduk@jabber.openafs.org/barnowl> meffie: okay, please push an updated commit message for 12290, then ;)
[13:39:22] <meffie> ok
[13:39:48] <meffie> i see it
[13:40:24] <mvita> Stephan:  what about it makes you nervous?
[13:41:46] <wiesand> As Andrew pointed out (and Daria said the same in the past), it's not idledead itself causing issues but some other client side code not coping.
[13:42:43] <wiesand> In the case of "the client" these issues were really really bad.
[13:43:40] <meffie> btw, it seems the root case for the ptserver hangs are fixed with marcios ubik fixes
[13:43:46] <mvita> I'll have to look through the archives to get some specifics.
[13:44:35] <meffie> this timeout fix is just to make the fileserver more robust for other errors
[13:44:43] <wiesand> Now I don't know the difference between idledead and harddead, or if the fileserver as a pt client could be affected in similar ways.
[13:45:06] <meffie> yeah, idledead and harddead are different things.
[13:45:12] <mvita> The dead timeout is between the fileserver and ptserver here;   it's a definite improvement versus a wedged fileserver and a bunch of wedged client requests on that fileserver
[13:45:57] <mvita> The actual cause of the original problem turned out to be that one of the ptservers was deadlocked
[13:46:06] <wiesand> In early  1.6,  idledead caused the exact problem it's meant to solve: clients stuck forever really quickly once a fileserver became slow (just slow).
[13:46:10] <wiesand> And worse.
[13:46:26] <mvita> so if fileserver requests to that ptserver could time out, then the fileserver could try a different ptserver instead of blocking FOREVER.
[13:46:28] <wiesand> The other issues still aren't public.
[13:46:29] <meffie> yeah, that was "meltdown" prevention.
[13:47:31] <meffie> if one ptserver hangs, this timeout allows the fileserver to stay up.
[13:47:58] <mvita> ptservers are not generally subject to heavy load like the fileservers may be
[13:48:01] <meffie> otherwise all the threads end up getting stuck.
[13:48:22] <mvita> so I really don't see the concern about either dead time in this situation
[13:48:45] <meffie> i think harddead is better, it's simpler for sure.
[13:49:47] <meffie> if a ptserver hasnt responded in N minutes, then that is effectively hung.
[13:49:47] <wiesand> Past experiences with those timeouts were that they caused exactly the problem they were supposed to solve - long before the server was actually wedged. That's what makes me nervous.
[13:50:52] <wiesand> Andrew calls it my "idledead foo", and he's right in that I really have no clue what's going on.
[13:50:56] <meffie> yeah, the meltdown heuristics were problematic if i recall.
[13:51:54] <wiesand> It's really like food making you sick as a child - you avoid it for the rest of your life.
[13:51:56] <meffie> yeah, in this case we are not trying to see if the ptserver is just going "slower" or having slow data throughput.
[13:52:25] <meffie> we just need to recover more gracefully if it is completely hung.
[13:53:38] <wiesand> Are the connections in question strictly for reading?
[13:53:44] <mvita> There are many harddead timeouts between fileserver and client _today_, on all releases.
[13:53:48] <meffie> obviously, it's better to fix the ubik bug. good thing the fixes are cued up in 1.6.23 :)
[13:53:52] <mvita> yes, strictly reads
[13:54:07] <mvita> (GetCPS, GetHostCPS, NameToId, IdToName)
[13:54:30] <mvita> TINY reads
[13:54:54] <meffie> yeah, that's a good point, these reads are likely only one packet.
[13:55:23] <wiesand> So at least the client side won't drop data on the floor but believe and claim it was written successfully?
[13:55:45] <meffie> so, if we did not get it after waiting a few minutes, we probably should try a different ptserver rather that hand the whole fileserver.
[13:55:51] <meffie> s/hand/hang/
[13:56:18] <mvita> Oh, Stephan, I misunderstood your question.
[13:56:46] <mvita> So the requests we want to time out are from filesever to ptserver...
[13:57:22] <wiesand> *that* I got ;-)
[13:57:24] <mvita> but they are (almost) always on behalf of a client request to the fileserver.  And those client requests could be anything.
[13:58:28] <mvita> When the ptserver can't reply, the requests from the client _do_ timeout on the fileserver, but the client can't be informed of that until the ptserver request is freed.
[13:59:16] <mvita> client —— <fetch> ——>  filserver —— <getcps> —>   ptserver
[13:59:36] <kaduk@jabber.openafs.org/barnowl> The idea here is that the fileserver will reply to the client that
things failed, and so the client and fileserver agree on the state of
things.  idledead from client to fileserver has/had problems because
the client and fileserver disagreed about what the state of things
was.
[14:00:58] <meffie> that's a good summary.
[14:02:33] <meffie> i guess in way it's like when the fileserver is not able to read a file, it returns an error to the client.
[14:03:03] <wiesand> Yes, thanks. My fear is that this time it could be the fileserver (as a client) and the ptserver disagreeing in the state of things…
[14:03:17] <meffie> (i'm glad staphan is nervous, we dont want to repeat past errors)
[14:03:33] <kaduk@jabber.openafs.org/barnowl> But the fileserver only makes read queries of the ptserver, so I don't
think there is potential for an issue of that nature.
[14:03:54] <kaduk@jabber.openafs.org/barnowl> Either the fileserver has the right answer, or no answer at all, and
with no answer at all it is supposed to fail safe.
[14:04:09] <wiesand> Mike: Not that one, anyway :)
[14:04:22] <meffie> the downside is we could timeout too soon, so it would be a false alarm.
[14:04:44] <mvita> right, the actual timeout value is the tricky part of the issue
[14:04:51] <meffie> or we could swamp servers with too many requests if we have too many timeouts that are retried.
[14:04:59] <wiesand> Ben: thanks, this removes my worst worries
[14:05:13] <mvita> that's why I went with the relatively long 120s already in use in other places in the code.
[14:05:34] <mvita> really, even 10 minutes is an improvement over a deadlock
[14:05:45] <kaduk@jabber.openafs.org/barnowl> There could still be unfortunate behavior on clients if their requests
to the fileserver fail "unnecessarily", but in networked file systems,
you kind of have to be prepared to deal with errors back from the
network.
[14:06:22] <wiesand> I agree.
[14:07:05] <mvita> We usually have 3 DB servers for redundancy.  Implementing this timeout lets us actually make use of that redundancy in another way.
[14:07:07] <wiesand> But the error must be propapgated back all the way to the suer.
[14:07:23] <wiesand> user
[14:07:37] <mvita> The user's request has _already_ timed out, with or without this fix.
[14:07:50] <mvita> but with this fix, the user will find out.
[14:08:50] <mvita> When I looked at this in a fileserver core ....
[14:09:06] <mvita> all the fileserver threads were waiting for ptserver responses that would never come
[14:09:29] <kaduk@jabber.openafs.org/barnowl> Is the user's request actually guaranteed to have timed out if their
client is using hardmount?
[14:09:39] <mvita> and the associated RXAFS_* RPCs were already timed out.
[14:10:03] <mvita> that's a good question, I need to look at the hardmount implications.
[14:11:01] <mvita> but at this point (when the fileserver is looking up something in the ptserver), no actual fileserver partition IO has happened yet.
[14:11:10] <mvita> we are just checking authorization.
[14:12:01] <wiesand> Ok, you're the wizards here and will have to decide. But thanks a lot for listening and giving me a little bit more insight.
[14:14:11] <mvita> oh, of course, hard mount code is all in the client
[14:15:29] <mvita> so the client would just retry the request when it times out
[14:15:57] <mvita> if hard-mount is in effect
[14:18:35] <kaduk@jabber.openafs.org/barnowl> Do we want to continue discussing 13044, or should we move on to
other topics (if there are any)?
[14:20:00] <mvita> I understand Stephan's concerns better now, so I don't have any more re: 13044
[14:21:21] <kaduk@jabber.openafs.org/barnowl> I guess people had a lot of time to think about anything they wanted
in 1.8.1 :)
[14:21:22] <wiesand> fine, lt's move on
[14:21:26] <wiesand> thanks again
[14:21:36] <meffie> kaduk@jabber.openafs.org/barnowl: sorry, random question. do you know if there are debian packages for mod_waklog?
[14:22:21] <wiesand> 1.8.1: I can tell after testing on EL6.10 beta ;-)
[14:23:02] <mvita> heh
[14:23:56] <kaduk@jabber.openafs.org/barnowl> meffie: apt-cache search and
https://packages.debian.org/search?suite=default&section=all&arch=any&searchon=all&keywords=AFS
seem to imply that mod_waklog is not in debian
[14:24:13] <kaduk@jabber.openafs.org/barnowl> I guess it's possible that someone has their own private packaging
files, but I don't know of any.
[14:24:36] <meffie> ok, thanks.
[14:25:02] <wiesand> BTW I still have some hope to get 1.8.0 packages done in time for the SL7.5 release.
[14:26:02] <wiesand> And uploading Stephen's SRPM to the 1.8 space is on my to-do list too.
[14:26:25] <kaduk@jabber.openafs.org/barnowl> cool, thanks
[14:28:35] <wiesand> Any buildbot news? Those linux-daily/rc builds are sadly missed...
[14:29:20] <kaduk@jabber.openafs.org/barnowl> The build master wasn't running for a few days (I think), since it's
not configured to start on boot, and the host machine got meltdown
patches.
[14:29:27] <mvita> they seem to be lacking a new dependency: libssl
[14:29:32] <meffie> sorry just getting to that
[14:29:45] <kaduk@jabber.openafs.org/barnowl> But it's possible I somehow started with the wrong config or
something--
oh, or that
[14:29:58] <meffie> i suspect i need to jump to 18.04 on the linux-rc builders.
[14:30:02] <kaduk@jabber.openafs.org/barnowl> Huh, I wonder why we need libssl
[14:30:11] <mvita> I wonder why as welll
[14:30:18] <mvita> but didn't take a look yet
[14:33:24] <wiesand> Anything else to discuss today?
[14:33:44] <mvita> nothing from me.
[14:33:57] <wiesand> (I guess the 1.8.x "handover" discussion can wait at least another week)
[14:33:59] <meffie> it looks like i'll be going to hepix, i can give a report
[14:34:13] <meffie> unless ben can do so :)
[14:34:34] <kaduk@jabber.openafs.org/barnowl> I'm still not quite decided; it would be a lot more travel than I'm
used to.
[14:34:58] <meffie> understood
[14:35:31] <kaduk@jabber.openafs.org/barnowl> so thank you for going
[14:35:41] <meffie> hope you can go, i would be great to see you again.
[14:35:46] <kaduk@jabber.openafs.org/barnowl> I guess the only other topic I would have is if there are things for
master that should get prioritized.
[14:36:04] <mvita> oh, yes, that's a good question
[14:36:12] <kaduk@jabber.openafs.org/barnowl> I've left some comments on a few things, so hopefully new patchsets
can arrive soon and they can get merged.
[14:36:15] <wiesand> Yes, thanks for going Mike, and thanks for considering Ben.
[14:36:50] <wiesand> And I do support an independent report from the project at HEPiX
[14:37:26] <meffie> ok, great, i can circulate a draft for input from the release team
[14:37:44] <wiesand> Good, thanks a lot.
[14:39:56] <wiesand> Re priorities for master: prioritize whatever is ready or close. Just cutting down on the list of open changes will save everyone's time.
[14:40:21] <mvita> will do
[14:40:23] <kaduk@jabber.openafs.org/barnowl> I guess I do also have the search results up for
branch:openafs-stable-1_8_x -- did anyone look at jaltman's rx serial
number changes?
[14:40:48] <wiesand> I pulled them up ;-)
[14:41:32] <mvita> I see.
[14:42:22] <kaduk@jabber.openafs.org/barnowl> (He had wanted them in 1.8.0, but I decided that it was not urgent
enough to justify the risk of adding that late in the release cycle.)
[14:42:43] <wiesand> Seems perfectly reasonable to me.
[14:46:12] <kaduk@jabber.openafs.org/barnowl> Thanks :)
[14:46:39] <meffie> motion to adjourn ?
[14:47:24] <kaduk@jabber.openafs.org/barnowl> accepted
[14:47:53] <wiesand> sustained
[14:48:00] <kaduk@jabber.openafs.org/barnowl> Thanks everybody!
[14:48:01] <wiesand> Thanks a lot everyone!
[14:48:22] wiesand leaves the room
[14:48:59] <meffie> (and notes sent this time ;)
[16:16:05] mvita leaves the room
[16:28:38] mvita joins the room
[16:51:56] meffie leaves the room
[17:39:04] meffie joins the room
[17:41:35] meffie leaves the room
[18:16:24] mvita leaves the room
[18:26:02] mvita joins the room
[18:44:05] meffie joins the room
[18:45:07] meffie leaves the room
[21:41:44] Marcio Barbosa leaves the room
[22:03:42] mvita leaves the room