Home
release-team@conference.openafs.org
Friday, June 29, 2018< ^ >
Room Configuration
Room Occupants

GMT+0
[00:10:41] kaduk@jabber.openafs.org/barnowl leaves the room
[01:08:22] kaduk@jabber.openafs.org/barnowl joins the room
[11:56:17] Marcio Barbosa joins the room
[12:56:41] mvita joins the room
[13:00:18] <kaduk@jabber.openafs.org/barnowl> greetings
[13:00:22] wiesand joins the room
[13:00:37] <wiesand> good morning
[13:00:53] meffie joins the room
[13:00:55] <mvita> salutations
[13:01:04] <meffie> good morning
[13:01:17] <wiesand> Any news from the Linux-next front?
[13:01:29] <mvita> joe is banging the rocks together
[13:01:45] <mvita> he thinks he has a fix for the timespec64 stuff
[13:02:02] <mvita> and there is a Linux patch for the non-GPL stuff
[13:02:09] <wiesand> great
[13:02:17] <mvita> (that is, a patch against Linux, not OpenAFS)
[13:02:32] <kaduk@jabber.openafs.org/barnowl> excellent!
[13:02:41] <wiesand> yes, Mike mentoned that we can't do anything about it on our side
[13:02:46] <mvita> he's getting his test env together today so he can see what else remains
[13:03:27] <mvita> and so he can devise an autoconf test for the timespec64
[13:03:27] Marcio Barbosa leaves the room
[13:04:03] Marcio Barbosa joins the room
[13:04:23] <wiesand> sounds very promising
[13:04:41] <mvita> yes indeed
[13:05:05] <wiesand> with a little luck, we may have this in time fror 1.6.23pre1
[13:05:31] <wiesand> which I made a little bit of progress with
[13:05:36] <meffie> yay!
[13:06:27] <wiesand> chances are I'll have to kill ~ an hour per day in a very nice location with good 4G network… that should help it along
[13:06:39] <wiesand> for the next week or so
[13:06:51] <kaduk@jabber.openafs.org/barnowl> heh
[13:07:05] <kaduk@jabber.openafs.org/barnowl> Such a terrible thing to have to do.
[13:08:18] <wiesand> And that's about what I have on 1.6 today. 1.8.1pre1 seems the hotter topic anyway?
[13:09:48] Marcio Barbosa leaves the room
[13:09:50] <kaduk@jabber.openafs.org/barnowl> I pushed a 1.8.1pre1 tag, but haven't made tarballs or worked on a web
change yet
[13:10:28] <kaduk@jabber.openafs.org/barnowl> Per discussion with Andrew, I did not revert the dslot-eintr change
(but thank you for having prepared the revert and sparking the
discussion)
[13:10:30] <mvita> Please don't build 1.8.1 pre1 yet, there is a newly discovered bug
[13:10:43] <mvita> that I was going to discuss today
[13:10:56] <wiesand> Shall I abandon the revert?
[13:11:12] Marcio Barbosa joins the room
[13:11:27] meffie leaves the room
[13:11:52] meffie joins the room
[13:11:53] <mvita> (sorry, I'll wait until revert discussion finishes)
[13:11:59] <kaduk@jabber.openafs.org/barnowl> I think it's save to abandon the revert, yes.
[13:12:06] <wiesand> Ok, will do.
[13:12:58] <wiesand> Re pre1: the tag was pushed, and the repo is public, so we'll probably start out the 1.8.1 cycle with a pre2 "for technical reasons" if there. is a bug we must fix before
[13:13:57] <mvita> we have reports of repeatable crashes in the fileserver - it is asserting in rxi_DestroyConnectionNoLock on the presence of delayedAbortEvent.   This should be impossible when the refcount on the conn is zero, so we have a refcount issue (or perhaps a race) in that commit that changed the event refcounting logic
[13:14:00] <wiesand> on to Mark's bug then?
[13:14:38] <mvita> 1.8.x   8ce3b5e Standardize rx_event usage
[13:15:32] <mvita> I've looked at that commit until I'm dizzy but I haven't found the problem yet.
[13:15:56] <mvita> but I'm pretty sure from the symptoms that it does have a problem.
[13:15:56] <kaduk@jabber.openafs.org/barnowl> Er, which assertion is firing?  The refCount == 0 one?
[13:16:22] <mvita> no, the conn>delayedAbortEvent == NULL
[13:16:39] <mvita> all the events are cleared except for that one
[13:16:47] <kaduk@jabber.openafs.org/barnowl> Ah, I was searching for the wrong thing.
[13:17:15] <mvita> so we either are missing an rxevent_Cancel, or the refcount has gone wrong
[13:17:30] <mvita> conn->refcount, that is
[13:18:27] <mvita> the conn itself has been idle for over 2h at the time of the crash, so It's not clear to me why it has a contemporary (recent) delayedAbortEvent
[13:19:00] <mvita> also, the error we are aborting for is token expiration
[13:19:26] <mvita> so that's quite helpful in narrowing down codepaths, but it's still a tough one to find
[13:19:42] <kaduk@jabber.openafs.org/barnowl> And to make looking at the code more confusing, we have both
call->delayedAbortEvent and conn->delayedAbortEvent
[13:19:56] <mvita> yes, grump grump grump
[13:20:04] <mvita> but it's only the conn we care about
[13:20:21] <kaduk@jabber.openafs.org/barnowl> Well, thanks for the report and for looking at it so hard already.
[13:20:25] <mvita> so don't bother looking at the call logic
[13:20:44] <mvita> I think I will probably find it today
[13:20:48] <kaduk@jabber.openafs.org/barnowl> It's not entirely clear that we need to hold pre1, though; some of
these fixes would be pretty nice to get out to people.
[13:21:12] <wiesand> It's not a regression in pre1,right?
[13:21:22] <kaduk@jabber.openafs.org/barnowl> But if we think we'll find it today, that would probably be faster
than I would get to the tarballs and announcement anyway :)
[13:21:24] <mvita> at the reporting site, this is crashing multiple fileservers several times a day
[13:21:34] <kaduk@jabber.openafs.org/barnowl> It was in 1.8.0 yeah
[13:21:50] <mvita> oh, I forgot that this happens in rxi_ReapConnections
[13:22:26] <mvita> so it's the background housekeeper event that checks for stale conns every 60s
[13:22:43] <kaduk@jabber.openafs.org/barnowl> Does that site have any interest in running debug binaries (or can you
repro in the lab)?
[13:22:45] <mvita> a conn si stale if it has no calls and has not sent anything for 700s
[13:23:00] <mvita> I'm going to ask them that very thing today
[13:23:16] <mvita> I think they have nothing to lose, and we'll get closer to a helpful assert
[13:23:46] <mvita> I have not tried to repro yet, I'm still characterizing and doing code inspection
[13:23:49] <kaduk@jabber.openafs.org/barnowl> rxi_SendDelayedConnAbort() only puts its reference if event ==
conn->delayedABortEvent, and we had discussed somewhat whether we
wanted to also put the reference if event != NULL but is also !=
conn->delayedAbortEvent
[13:24:08] <mvita> btw, I learned from the site this morning that every crash is the same assert
[13:24:17] <kaduk@jabber.openafs.org/barnowl> Ah.
[13:24:27] <mvita> so that makes it more likely that we are missing an rxevent_Cancel
[13:24:50] <mvita> I think I know where that should be, so I will probably do some attempts to repro today
[13:26:52] <mvita> the cores were somewhat helpul, but I'm handicapped by the lack of debuginfo for OpenAFS
[13:27:11] <mvita> (from the site's build, that is)
[13:27:32] <kaduk@jabber.openafs.org/barnowl> Can you say if it's a RHEL system or something else?
[13:27:36] <mvita> I had to go old school and decode structs by hand
[13:27:40] <kaduk@jabber.openafs.org/barnowl> (solaris?)
[13:28:00] <mvita> RHEL 7, but I don't think that matters
[13:28:27] <kaduk@jabber.openafs.org/barnowl> Just wondering if we needed to hammer the rpm spec into keeping
debuginfo :)
[13:28:29] <mvita> oh, were you thinking about atomic issues?
[13:28:38] <mvita> oh, yeah, debuginfo
[13:28:58] <mvita> it's easy to make if your procedures take it into account
[13:29:19] <mvita> most people don't think about having them on hand when needed
[13:29:39] <mvita> and they build from SRPM
[13:29:59] <mvita> so it's not like they can grab a debuginfo package from somewhere and have it match
[13:30:20] <wiesand> our current spec should produce usable debuginfo packages
[13:30:37] <mvita> right, I plan to discuss that with them as well
[13:31:55] <mvita> I'm a developer, so I never build without them baked in ;-)
[13:32:34] <mvita> okay, that's all I had
[13:33:01] <mvita> Ben, if you could eyeball that commit again and see if you notice something wrong, that would help.
[13:33:03] <wiesand> wait, on EL 7.5 debuginfo requires 13036 (which would be in pre1 I think…)
[13:33:17] <kaduk@jabber.openafs.org/barnowl> I'm looking already :)
[13:33:39] <mvita> I ruled out the other two smaller commits in that stack
[13:34:05] <mvita> (the MUTEX_ASSERTs and the tiny mutex leak)
[13:36:21] <mvita> I'm thinking it might simply be that we don't rxevent_Cancel in rxi_SendConnectionAbort
[13:36:26] <wiesand> EL 7.5 *may* require 13036, I'm not quite sure but it's possible
[13:36:32] <mvita> after sending the abort
[13:36:47] <kaduk@jabber.openafs.org/barnowl> We do before we send, though?
[13:37:06] <mvita> oh, drr, I see it
[13:37:17] <kaduk@jabber.openafs.org/barnowl> But we do drop the lock
[13:37:28] <mvita> ah, that's the wrong place , hold on
[13:37:32] <mvita> (this code grr)
[13:37:48] <kaduk@jabber.openafs.org/barnowl> Anyway, before we continue drilling down on this one, any other topics
for today?
[13:38:10] <wiesand> I have none
[13:39:09] <kaduk@jabber.openafs.org/barnowl> I did a little bit with master, but there are a couple commits that
would be nice to merge that are waiting for new patchsets
[13:39:23] <mvita> we don't cancel it (or set it to null, at any rate) in rxi_SendDelayedConnAbort
[13:39:40] <mvita> we only rxevent_Put the refcount on the event
[13:39:51] <mvita> (k, sorry)
[13:40:14] <kaduk@jabber.openafs.org/barnowl> That is the event handler function :)
But we probably do call it directly sometimes, blech.
[13:40:57] <mvita> no, just as an event
[13:41:28] <kaduk@jabber.openafs.org/barnowl> If we're running the event handler, rxevent_Cancel() for that event
would fail
[13:41:44] <mvita> aye, for sure
[13:45:27] <kaduk@jabber.openafs.org/barnowl> If the reproducer is supposed to be token expiry, it may be a thing
where multiple calls try to (delayed) abort the connection simultaneously
and that triggers a locking bug or something; maybe we fire the event
handler but the event in the structure has changed, for example.
[13:45:38] <mvita> ahh, event->handled is 0, so it hasn't fired yet at the time of the core
[13:46:15] <mvita> yes, something like that
[13:46:18] <kaduk@jabber.openafs.org/barnowl> Anyway, sounds like we should call the meeting?
[13:46:24] <mvita> okay
[13:46:31] <kaduk@jabber.openafs.org/barnowl> Thanks, everyone!
[13:47:14] <wiesand> So we hold pre1 for that bug?
[13:47:31] <mvita> well, not my call - I just wanted to let you know
[13:47:32] <kaduk@jabber.openafs.org/barnowl> release policy would not dictate that we have to hold pre1
[13:47:41] <kaduk@jabber.openafs.org/barnowl> but it's definitely good to know about
[13:47:48] <meffie> agreed
[13:48:03] <wiesand> Sure.
[13:48:26] <meffie> i assume pre2 would have the fix
[13:48:48] <mvita> you assume much, sir!
[13:48:58] <mvita> but yeah, it's probably something simple
[13:49:08] <wiesand> So we go ahead with pre1 as is? Would you like me to help with tarballs, volumes, web change, ...?
[13:50:11] <kaduk@jabber.openafs.org/barnowl> I'm looking hard at the "only rxevent_Put() if it the event firing
matches the current value for conn->delayedAbortEvent" logic (and if
you can get diagnostics on a case where that doesn't hold it would be
very nice)
[13:50:11] Marcio Barbosa leaves the room
[13:50:45] <kaduk@jabber.openafs.org/barnowl> I think we can wait a day before finishing off pre1, in case we find
this event issue quickly.
[13:50:47] <mvita> oh, refcount on the event is 2
[13:50:57] <wiesand> ben: ok
[13:51:09] <mvita> recount on the conn is zero <sadness>
[13:51:10] Marcio Barbosa joins the room
[13:51:38] <kaduk@jabber.openafs.org/barnowl> (wiesand: I don't know how available you are to do such things this weekend)
[13:52:11] <kaduk@jabber.openafs.org/barnowl> refcount on the event of 2 should only happen when there's one for the
event tree and one for the conn's handle.
[13:52:19] <mvita> yes
[13:52:42] <mvita> more evidence that it didn't fire yet
[13:52:50] <wiesand> (ben: nor do i… drop me a mail if you'd like to have something done, and we'll see)
[13:53:18] <kaduk@jabber.openafs.org/barnowl> Okay, thanks.
[13:53:27] <kaduk@jabber.openafs.org/barnowl> I had best be off, now, though.
[13:53:49] <meffie> ok, notes to be posted shortly.
[13:53:52] <wiesand> Fine. Thanks a lot everybody!
[13:53:57] <mvita> okay, thanks everyone for the sounding board
[13:53:59] <kaduk@jabber.openafs.org/barnowl> Thanks!
[13:55:14] <meffie> :wq  bye
[13:56:13] meffie leaves the room
[13:56:14] mmeffie joins the room
[14:16:43] wiesand leaves the room
[14:30:05] mmeffie leaves the room
[15:21:44] mvita leaves the room
[15:39:14] mvita joins the room
[21:05:18] Marcio Barbosa leaves the room
[21:07:57] Marcio Barbosa joins the room
[22:08:02] Marcio Barbosa leaves the room
[22:13:38] mvita leaves the room
Powered by ejabberd Powered by Erlang Valid XHTML 1.0 Transitional Valid CSS!