- Friday, October 5, 2018

release-team@conference.openafs.org

Friday, October 5, 2018< ^ >

Room Configuration

Room Occupants

GMT+0
[04:08:48] mvita joins the room
[04:25:36] mvita leaves the room
[12:52:57] wiesand joins the room
[12:53:59] meffie joins the room
[12:55:00] <meffie> good morning
[12:55:09] <kaduk@jabber.openafs.org/barnowl> greetings
[12:57:14] <wiesand> hi
[13:01:44] <wiesand> Lots of churn in gerrit...
[13:01:50] <meffie> ?
[13:02:29] <meffie> you mean we are changing things back and forth?
[13:02:47] <wiesand> I mean you are working on a lot of changes
[13:03:07] mvita joins the room
[13:03:11] <mvita> hello
[13:03:30] <meffie> oh, ok. i feel the word churn to mean we are not making progress.
[13:03:44] <mvita> yes, churn has some negative connotations
[13:03:54] <kaduk@jabber.openafs.org/barnowl> The list of open changes sorted by most recent activity is churning a
lot
[13:04:17] <meffie> also, to make butter from cream
[13:04:30] mbarbosa joins the room
[13:05:23] <wiesand> It's not easy to follow what's going on (for me).
[13:05:31] <meffie> ah, yes, gerrit sorts by last thing touched.
[13:06:04] <wiesand> Anyway, I have no news no new plans w.r.t. last week
[13:06:51] <mvita> Stephan: agreed, I wish one could sort by number or other column
[13:07:10] <wiesand> Which of the stacks being worked on currently are targeted for a stable release?
[13:07:15] <kaduk@jabber.openafs.org/barnowl> I guess our gerrit is a bit out of date; maybe a newer version would
have that feature.
[13:09:14] <mvita> stephan:  zap-error-code-cleanup is a real problem, targeted for stable
[13:09:23] <mvita> also avoid-stalled-servers
[13:09:38] <kaduk@jabber.openafs.org/barnowl> Oh, then I should ask about zap-error-code-cleanup, I guess.
[13:10:14] <mvita> macos 10.14 (mojave) support too
[13:10:24] <kaduk@jabber.openafs.org/barnowl> In particular, I would like more context on the first commit about
avoiding the query for parent id -- I don't have much experience with
the on-disk format and when the parent id is/isn't relevant
[13:12:41] <meffie> wiesand: you make a good point, some gerrits are intended for stable releases, and others are meant for eventual releases. i wish there was an easier way for us to categorize the gerrits. i try to do that in the weekly notes, but it takes time and is not complete.
[13:13:41] <meffie> e.g. if gerrit had something like "topic" but maybe called "target"
[13:14:17] <mvita> The linux project has a way to target changes for stable, but I don't recall the details
[13:15:15] <wiesand> Avoid-stalled-servers makes me nervous…
[13:15:28] <kaduk@jabber.openafs.org/barnowl> The topic name like that makes me nervous, too :)
[13:16:07] <wiesand> How thoroughly has this been tested in scenarios where fileservers are simply overloaded?
[13:16:09] <meffie> yes, it's bad when the servers stall. :)
[13:16:42] <wiesand> And for introduction in a stable series, such a feature should default to off IMO.
[13:17:02] <meffie> that sounds fine.
[13:18:17] <wiesand> And: this is about reading only, right?
[13:18:42] <meffie> that particular change is only helpful if for some reason readonly files servers are failing to provide data. the are "demoted" in preference, but not removed from the list of sites for a volume
[13:19:37] <wiesand> what happens when all servers are "stalled"?
[13:20:03] <meffie> storage hardware issues from time to time
[13:21:53] <wiesand> i mean what happens when all servers are perfectly ok but just so busy that clients think they're "stalled"?
[13:22:09] <meffie> at that point the fileserver fetch-data calls hangs for some time, then the call times out. but the server is not marked down, because it is still up and replying to rpcs
[13:22:15] <meffie> (non data rpcs)
[13:23:17] <meffie> the server is "blacklisted" for *that* request, and so the data is fetched from another server on that request.
[13:23:47] <meffie> but subsequent requests from that client or other clients keep hitting the stalled server over and over.
[13:24:28] <wiesand> and write requests are not affected at all?
[13:24:29] <meffie> since the server has a low preference rank, even though requests are timing out
[13:25:08] <mvita> the commit in question only modifies client behavior for problematic fetches
[13:25:09] <meffie> yes, this is just for fetching data from read only file servers that have more than one site
[13:26:44] <meffie> the patch in gerrit does not dynamically change the server prefs. instead it adds a second pass. the first pass we look for the lowest rank server that has not recently had a call dead timeout.
[13:27:23] <meffie> if none found, we try the ones that did have the call dead timeout anyway, just in case they are working now
[13:27:44] <wiesand> I just want to be sure we won't repeat the idledead disaster…
[13:28:02] <meffie> so it just effects how we pick which server to read from when fetching data.
[13:28:43] <meffie> but we do not take any machines "out" of the rotation to pick from, just how we pick from the ones that have read-only sites for the volume.
[13:29:52] <wiesand> ok then - sorry for being a nuisance in this are
[13:30:27] <kaduk@jabber.openafs.org/barnowl> no need to apologize; it's an important question to ask
[13:30:50] <meffie> no worries. there are potential issues. we are changing the effective server rank dynamically. which could have some serious consequences.
[13:31:33] <meffie> so, yes, maybe we should set the default to zero to disable this for 1.8.x
[13:34:34] <meffie> yes, no need to apologize, this is tricky stuff. an i really dont want to create regressions.
[13:34:47] <wiesand> thanks. back to Ben's question regarding zap-error-code-cleanup then?
[13:35:05] <meffie> ok, looking
[13:36:08] <meffie> i dont recall the details, other than andrew gave the suggestions on how to create this fix. i can ask andrew and comment on the gerrit :)
[13:36:37] <meffie> this is one of those dafs cleanup around the edges sort of commit, i feel.
[13:38:01] <kaduk@jabber.openafs.org/barnowl> Oh, is this DAFS-only or does the traditional fileserver use it as
well?
[13:38:56] <meffie> from memory, i think this was introduced with dafs
[13:39:16] <kaduk@jabber.openafs.org/barnowl> Ah, then maybe I should take a look at those early commits/messages
[13:40:23] <meffie> also, it is related to the 'volume group cache' that andrew introduced.
[13:40:57] <kaduk@jabber.openafs.org/barnowl> *nods*
[13:41:09] <kaduk@jabber.openafs.org/barnowl> Thanks, that gives me a place to start at least
[13:41:50] <meffie> this was reported by a site running dafs that was using vos zap to clean up rogue volumes.
[13:44:19] <meffie> sometimes the volume group cache is not populated, so we the parent id lookup fails. the salvager retries when this happens. but in this case (vos zap) we dont need the parent id, so we can remove the volume id by name regardless of what the parent id would be.
[13:45:18] <kaduk@jabber.openafs.org/barnowl> The parent ID being the ID of the RW in this case, and the child
potentially the RO/BK/etc?
[13:45:18] <meffie> it is fairly easy to test. just start a fileserver with a lot of volume and do vos zap -force  in the first few moments.
[13:45:28] <meffie> yes
[13:45:40] <kaduk@jabber.openafs.org/barnowl> *nods*
[13:46:59] <meffie> thank you kaduk
[13:48:22] <meffie> wiesand: are the changes for the 1.6.24 NEWS file ok?
[13:48:40] <wiesand> sure, thank you!
[13:49:25] <meffie> ok, thank you. (gerrit 13330)
[13:49:48] <wiesand> so 1.6.24pre1 is simply blocked on me
[13:50:15] <wiesand> (I "slightly underestimated" the effort to move into my new home)
[13:50:45] <kaduk@jabber.openafs.org/barnowl> This is like how I always "slightly underestimate" how many boxes I'll
need to move my stuff.  Like, by a factor of three.
[13:51:24] <wiesand> Exactly…
[13:52:42] <wiesand> Any other topics to discuss today?
[13:52:56] <kaduk@jabber.openafs.org/barnowl> I don't have much news on master.
[13:53:08] <kaduk@jabber.openafs.org/barnowl> But maybe we should talk about the bug report on the mailing list
[13:53:32] <wiesand> The Ubuntu apparmor thing?
[13:53:37] <kaduk@jabber.openafs.org/barnowl> Yeah
[13:54:04] <kaduk@jabber.openafs.org/barnowl> And to be clear, it *is* a bug that we're using the wrong credentials
for cache reads/writes so that apparmor complains
[13:54:19] <kaduk@jabber.openafs.org/barnowl> But I suggested the permissive profile for expediency
[13:56:00] <wiesand> What would it take to fix it?
[13:58:13] <meffie> looking for the ticket in rt...
[13:58:15] <kaduk@jabber.openafs.org/barnowl> It's not entirely clear, but probably a good first step would be to
reproduce the issue with apparmor not liking a request (doesn't even
have to be in enforce mode) and trying to track down what sort of
kernel operations are responsible
[13:58:30] <kaduk@jabber.openafs.org/barnowl> (I wasn't sure we had a fully topical ticket in RT for the recent
report, at least.)
[13:58:53] <kaduk@jabber.openafs.org/barnowl> Most of the magic is usually in src/afs/<OS>/osi_file.c, where we
"raw_open" cache files on disk
[13:59:04] <meffie> oh, you said mail list... sorry.
[13:59:22] <kaduk@jabber.openafs.org/barnowl> So we will have comments in the source like:
/* Use stashed credentials - prevent selinux/apparmor problems  */
[14:00:32] <kaduk@jabber.openafs.org/barnowl> But of course on linux the way this stuff works has changed several
times over the years, and it's possible we missed a kernel change
where we need to have an explicit "set current creds" call or
something.
Or ubuntu has backported changes to their kernel in a way that don't
interact well with our configure tests.
[14:01:25] <meffie> (openafs-info list subject: problems with ubuntu 18.04 client)
[14:04:59] <kaduk@jabber.openafs.org/barnowl> Does anyone want to volunteer to try to even reproduce the issue?
[14:05:23] <kaduk@jabber.openafs.org/barnowl> On an unrelated note, buildbot is feeling super-snappy recently
[14:06:03] <mvita> kaduk:  I can give it a whirl
[14:06:30] <kaduk@jabber.openafs.org/barnowl> Thanks!
[14:06:31] <meffie> so, i've not read all the messages on that thread carefully yet. did putting apparmor in permissive mode workaround the problem reported?
[14:06:42] <kaduk@jabber.openafs.org/barnowl> I don't think we've heard back yet.
[14:06:53] <meffie> ok, thanks.
[14:06:56] <kaduk@jabber.openafs.org/barnowl> There was a lot of weird speculation that ignored the apparmor errors
[14:09:44] <kaduk@jabber.openafs.org/barnowl> Anything else for today?
[14:10:51] <meffie> none here, just going to do some more work on openafs-robotest this afternoon. :)
[14:10:52] <mvita> I believe the PPC crash (from rt.central.org 134647
[14:11:08] <mvita> is fixed by a commit that was merged last week.
[14:11:18] <mvita> I'm waiting to hear back
[14:11:55] <kaduk@jabber.openafs.org/barnowl> yay
[14:12:13] <meffie> woot
[14:12:14] <mvita> rx: Convert rxinit_status to rx_IsRunning()
[14:12:38] <mvita> if it does, we can remove all the unused and broken rx_atomic bitops
[14:13:01] <mvita> I'll submit a change for that today
[14:13:25] <meffie> yay, thanks mark
[14:13:34] <wiesand> That would replace the outstanding fix discussed last week?
[14:13:43] <mvita> yes
[14:14:02] <wiesand> sounds promising - crossing fingers
[14:14:03] <kaduk@jabber.openafs.org/barnowl> So I guess these rx atomics were broken in multiple different ways,
exciting
[14:14:25] <meffie> and not really needed.
[14:14:48] <meffie> (just the bitops)
[14:16:31] <wiesand> Shall we call it a meeting?
[14:16:38] <kaduk@jabber.openafs.org/barnowl> Sounds like we should
[14:16:52] <wiesand> Let's adjourn then. Thanks a lot everybody!
[14:17:22] <meffie> andrew says we dont need bitops that dont work.
[14:17:22] <mvita> okay, bye
[14:17:22] <mvita> thank you!
[14:17:22] <meffie> thanks, have a good weekend
[14:17:33] meffie leaves the room
[14:17:40] wiesand leaves the room
[14:19:13] <kaduk@jabber.openafs.org/barnowl> Thanks everyone!
[16:52:57] mbarbosa leaves the room
[18:32:48] mbarbosa joins the room
[20:29:06] mvita leaves the room
[21:50:41] mbarbosa leaves the room