- Wednesday, March 2, 2016

release-team@conference.openafs.org

Wednesday, March 2, 2016< ^ >

Room Configuration

Room Occupants

GMT+0
[10:17:36] Jeffrey Altman leaves the room
[10:19:59] Jeffrey Altman joins the room
[10:42:49] mvita leaves the room
[11:51:45] mvita joins the room
[12:41:58] mvita leaves the room
[14:01:42] Jeffrey Altman leaves the room
[14:03:33] mvita joins the room
[14:20:20] Jeffrey Altman joins the room
[14:31:57] Jeffrey Altman leaves the room
[14:31:57] Jeffrey Altman joins the room
[14:37:07] Jeffrey Altman leaves the room
[14:37:30] Jeffrey Altman joins the room
[14:40:42] Jeffrey Altman leaves the room
[14:40:43] Jeffrey Altman joins the room
[14:47:05] Jeffrey Altman leaves the room
[14:47:14] Jeffrey Altman joins the room
[14:59:10] wiesand joins the room
[15:00:01] meffie joins the room
[15:00:06] <mvita> good morning Stephan, Mike
[15:00:15] <wiesand> Hi
[15:00:15] <meffie> hello
[15:01:10] <mvita> is anyone else joining us?  or just lurking?
[15:02:19] <mvita> well, let's get started on 1.6.x
[15:02:32] <wiesand> For the record, I’m building a test build with these applied: 11775..78 11795 11780
12126 12129 12166 12188 1219{3,5,6,7,8}
12089 12127 12185 12178 12179 12085 12113
12191 11983
12205
[15:02:42] <mvita> excellent
[15:02:48] <meffie> nice!
[15:03:12] <mvita> do you want to lead us through any of those?
[15:03:14] <meffie> do you have a branch of that somewhere?
[15:03:25] <wiesand> not publicly
[15:03:37] <meffie> ok
[15:03:39] <mvita> I was unable to follow my own advice and look at the queue ahead of time
[15:04:22] <wiesand> I think it’s more or less the list above, plus... hang on...
[15:04:58] <wiesand> the el capitan stuff in 12072&3, but that doesn’t apply trivially
[15:05:26] <wiesand> the dcache-32-bit-verflow topic, maybe?
[15:06:02] <wiesand> and 11654 11599 once they’re in shape and merged on master
[15:06:20] <wiesand> (afs: shake harder and bozo: use the interface address)
[15:06:58] <wiesand> et voila, that would be 1.6.17 ;-)
[15:07:08] <mvita> wow
[15:07:10] <mvita> that's a lot
[15:07:31] <wiesand> all subject to discussion & review of course
[15:07:35] <meffie> i have and update for shake-loose vcache, it is very small and hopefully that will be ready then.
[15:07:47] <wiesand> it’s just the list of candidates I see
[15:08:31] <wiesand> Unfortunately, we have a gerrit problem.
[15:08:36] <meffie> ?
[15:09:02] <wiesand> It takes too long for new changes to be available as cherry-picks etc.
[15:09:07] <wiesand> Much too long.
[15:09:27] <wiesand> Consequently, buildbot always fails in the "update git" stage.
[15:09:51] <wiesand> There seem to be dead slaves too, since not even failure is reported at all.
[15:10:12] <mvita> not sure I follow the "update git" problem
[15:11:13] <mvita> looking at logs
[15:12:41] <wiesand> git fetch git://gerrit.openafs.org/openafs refs/changes/05/12205/1 && git format-patch -1 --stdout FETCH_HEAD
fatal: Couldn't find remote ref refs/changes/05/12205/1
[15:13:54] <mvita> yes, I see.  but why can't it find it?   btw, I think I've seen this before, maybe a few weeks ago
[15:14:14] <wiesand> this works, but is dead slow: git fetch https://gerrit.openafs.org/openafs refs/changes/05/12205/1 && git format-patch -1 --stdout FETCH_HEAD
[15:14:55] <mvita> ahhh - the git update is maybe timing out and reporting as "couldn't find"?
[15:15:01] <wiesand> http rather than https times out...
[15:15:38] <wiesand> no, the git fetch returns fairly quickly
[15:15:54] <wiesand> and fetching stuff I pushed yesterday worked consistently
[15:16:38] <meffie> did you try to git gc your local repo?
[15:16:38] <wiesand> (today)
[15:17:22] <meffie> the commands you posted are pretty fast for me.
[15:18:12] <mvita> (looking at older builds to see what's different now)
[15:18:19] <wiesand> no change after git gc --aggressive
[15:19:12] <mvita> I know Ben made some changes recently - don't know what they were or if they're relevant to this
[15:20:18] <wiesand> I guess he’ll read this later
[15:20:58] <mvita> I wish I could be of more help
[15:21:42] <meffie> i'm confused. anyway, we can take this offline. i have access to the buildbot so maybe able to help.
[15:22:11] <wiesand> ok
[15:22:56] <meffie> i've been wanting to help more with the buildbot anyway :)
[15:24:08] <wiesand> NB URLs like http(s)://gerrit.openafs.org/12345 no longer work for me either
[15:24:18] <meffie> really?
[15:24:40] <wiesand> really :-(
[15:24:46] <meffie> those should redirect. they do for me :(
[15:25:25] <wiesand> Maybe it’s Safari. But I get redirected to the default front page unless I specify the /#/c/ stuff
[15:26:45] <meffie> which safari version?
[15:27:04] <wiesand> 9.0.3
[15:27:48] <wiesand> on yosemite, fully updated
[15:28:17] <meffie> ok, thanks.
[15:31:01] <Jeffrey Altman> hi
[15:31:05] <mvita> oops, sorry, noodling with gerrit
[15:31:11] <wiesand> Hi
[15:31:14] <mvita> hi jeffrey
[15:31:22] <Jeffrey Altman> Ben, changed the URL that is used to communicated with gerrit from http to https
[15:32:00] <Jeffrey Altman> if the buildbot master rules for fetching were not updated to match, things will fail
[15:32:37] <wiesand> The build slave logs say they try "git fetch git://git.openafs.org"
[15:34:01] <wiesand> Which generally works. Except for changes that are younger than a couple of hours.
[15:34:06] <meffie> on monday, Ben wrote in gerrit 12192; "(gerrit failed to replicate this commit to the standalone git repo, so the buildbots couldn't fetch it.)"
[15:34:40] <wiesand> Match.
[15:35:30] <Jeffrey Altman> I think that "git://git.openafs.org results in http fetch requests
[15:36:05] <Jeffrey Altman> and git doesn't support https for git urls
[15:37:02] <wiesand> It works for me for, say 11775. But not 12205.
[15:37:15] <Jeffrey Altman> the fetch url either has to explicitly be changes to https://git.openafs.org which I suspect will fail because the certificate is gerrit.openafs.org and won't match the host name
[15:37:23] <wiesand> Buildbot verified PS3 for 12168 yesterday.
[15:38:11] <mvita> yes, 12205  ab4cc6729013b7afe8e2497211e9a95d1a5fe989 did not make it to git
[15:39:10] <mvita> https://git.openafs.org/?p=openafs.git;a=patches;h=refs/changes/05/12205/1    "404 unknown commit object"
[15:39:25] <Jeffrey Altman> I have no access so I can't check but I suspect the problem is the changes to force the use of https and the use of a non-wildcard certificate
[15:39:41] <mvita> same for https://git.openafs.org/?p=openafs.git;a=commit;h=ab4cc6729013b7afe8e2497211e9a95d1a5fe989
[15:39:46] <Jeffrey Altman> gerrit can't push to git
[15:40:13] <wiesand> Sounds plausible.
[15:42:14] <wiesand> Whatever, I’m confident Ben will deal with it eventually.
[15:43:04] <wiesand> Going through changes one by one today doesn’t seem expedient to me.
[15:45:28] <wiesand> Mark: it’s still your meeting though ;-)
[15:45:40] <mvita> ah.
[15:45:48] <mvita> sorry, keep getting distracted
[15:46:19] <mvita> well, I wanted to ask mike about the settime doc
[15:46:33] <meffie> settime doc?
[15:46:43] <wiesand> 12178
[15:46:44] <mvita> 12178
[15:46:59] <wiesand> they’re not quite obsolete on 1.6 yet
[15:47:05] <mvita> I was checking before the meeting on that
[15:47:10] <mvita> I agree
[15:47:18] <mvita> they still exist on 1.6
[15:47:32] <wiesand> they just don’t work and crash your client ;-)
[15:47:54] <mvita> so the pod could say they're deprecated, but not that they aren't present
[15:47:54] <meffie> yeah, this was triggered by the patch that removes -settime, if i recall.
[15:48:10] <mvita> I couldn't find that one in 1.6.x
[15:48:16] <mvita> but I ran out of time
[15:48:29] <wiesand> 11716, not yet merged
[15:49:13] <wiesand> And I doubt the stable branch is the right place for it.
[15:49:14] <meffie> yes, i thought it would be good to include a man page change when/if -settime was removed.
[15:49:40] <mvita> now I see.
[15:50:15] <meffie> yes, that's all.
[15:51:11] <mvita> well, if it's really removing bugs....then maybe it should go on 1.6.x
[15:51:47] <wiesand> Maybe it’s working for someone ;-) (Solaris?)
[15:52:18] <wiesand> It has been totally broken for a long time. No need to rush.
[15:52:39] <meffie> alternatively, just update the docs to say -settime is deprecated and should not be used.
[15:53:30] <mvita> yes.  that's what I thought 12178 was going to be before I started reading it
[15:54:12] <wiesand> A while ago we also had the idea to log a warning when -settime is used. Everybody liked it, but nothing happened. (I’m not blaming anyone).
[15:54:41] <wiesand> It’s not our most urgent problem. That’s Linux clients.
[15:55:09] <mvita> right
[15:55:20] <mvita> so I've been digging into that this week
[15:55:40] <wiesand> Ah
[15:55:45] <wiesand> Any hope?
[15:55:46] <mvita> my original aim was just to understand the problem well enough to explain it to others
[15:56:07] <mvita> meffie helped me draft an email to -devel to ask for help
[15:56:20] <mvita> but then I started digging in more and more
[15:56:29] <mvita> and never did send the email
[15:56:36] <mvita> so Ben beat me to it.
[15:57:03] <mvita> I see hope for a possible workaround, I am close to being ready to do proof of concept
[15:57:37] <mvita> if it works, it would NOT fix openafs use of splice but instead avoid it.
[15:58:08] <wiesand> I must admit I don’t know what splice does.
[15:59:04] <mvita> it's a facility of linux that allows device drivers and filesystems to get lower level control of reads and writes
[15:59:19] <mvita> essentially bypassing a set of buffer copies
[15:59:45] <wiesand> Ah, I see. Also just found the man page.
[15:59:47] <mvita> the "splice" is implemented as an internal pipe between your device/filesystem and the kernel
[16:01:13] <wiesand> Thanks.
[16:01:16] <mvita> it's been around for a long time, but the ERESTARTSYS applies (mostly) only to a more recent enhancement which openafs exploits
[16:02:15] <mvita> the idea of the ERESTARTSYS change is to solve a problem with some drivers that the splice can be so tightly coupled that it's difficult to cancel in a timely manner
[16:03:11] <Jeffrey Altman> avoiding splice() is not going to be sufficient.  its not the only kernel api that openafs calls that performs signal handling
[16:04:01] <Jeffrey Altman> you can disable signal handling before the splice calls to similar effect but that leads to other problems.
[16:04:07] <mvita> it's the only one I've found so far that's affected by the change in question
[16:04:24] <Jeffrey Altman> side effect, side effects
[16:07:21] <mvita> still researching for just such things.
[16:08:59] <mvita> well anyway, that's where I am so far and i will keep the team up to date if I find anything that will work.
[16:09:08] <meffie> thanks
[16:09:10] <wiesand> Thanks.
[16:09:29] <wiesand> But long term it’s a losing battle.
[16:09:42] <mvita> only if we give up.
[16:10:27] <wiesand> If dealing with the current problem will take a lot of resources, it may be wiser to invest those into finishing the fuse client.
[16:10:49] <mvita> meffie and i have talked about that as well
[16:11:12] <wiesand> was there a conclusion?
[16:11:31] <mvita> well, we actually talked about kafs, not afsd.fuse
[16:12:09] <mvita> since kafs is in tree, it has some advantages
[16:12:19] <meffie> we talked about fuse as well. picking up where andrew left off to took into how to do pioctls.
[16:12:52] <meffie> a fuse client is only a workaround.
[16:13:22] <Jeffrey Altman> long term, if OpenAFS wants to use its unix cm for Linux:
1. someone needs to be paid to do for OpenAFS what AuriStor does for its products.  Nightly builds of Linus' tree must not only be built but tested with a test suite to identify problems before Linux kernel release candidates.    Simply building the tree is not sufficient.
2. the GLOCK has to go away and locks must not be held across RPCs
3. the manner in which mount points are represented as symlinks has to change.
[16:14:15] <mvita> yes.
[16:14:16] <Jeffrey Altman> FUSE is not a solution for the majority of end users.  It is an abstraction layer over an abstraction layer.  It cannot support PAGs.
[16:14:36] <meffie> yes
[16:15:28] <wiesand> FUSE = no PAGs is a showstopper, I agree.
[16:15:38] <Jeffrey Altman> The FUSE interface is static and it cannot be changed.   The interface is sub-optimial for AFS.   FUSE was designed as a light-weight interface for low-bandwidth file system operations like http dav.
[16:16:27] <Jeffrey Altman> Not that it is impossible to create a pioctl interface for FUSE but it will require creativity.
[16:16:46] <meffie> indeed.
[16:16:49] <wiesand> The gluster fuse client works pretty well. Performance is not bad at all, nor is CPU usage on a modern server.
[16:18:42] <Jeffrey Altman> Arla, is much like FUSE in that it provides a kernel shim that calls to userland to do the hard work.   This is the same model that the Windows client uses.   The difference between Arla and FUSE is that the userland to kernel interface is owned by Arla so it can be modified and customized as necessary.
[16:18:45] <meffie> gluster workloads are probably significantly different.
[16:18:53] <mvita> well, we've drifted far into 2.0 land, let's get back to 1.6 for a bit
[16:19:36] <wiesand> mike: sure
[16:20:12] <wiesand> (just an example that fuse can work in cases it wasn’t designed for at all)
[16:20:37] <Jeffrey Altman> shall we discuss data corruption in 1.6 ?
[16:20:46] <meffie> btw, it's nice to see alra traffic recently.
[16:21:05] <wiesand> Jeffrey: w/o Linux 4.4?
[16:21:11] <Jeffrey Altman> vlserver
[16:21:29] <wiesand> oh no
[16:22:06] <Jeffrey Altman> Back in 2010 Andrew submitted and Daria accepted a number of changes to ubik and vlserver to permit reads while write transactions are in progress.
[16:23:05] <Jeffrey Altman> These changes are fundamentally flawed and can result in corrupt data be sent to clients for any request for a record that spans more than one ubik page.
[16:23:57] <mvita> ticket number?
[16:24:11] <Jeffrey Altman> ticket number?  I don't have access to SNA's ticketing system
[16:24:25] <mvita> rt.central.org
[16:24:42] <Jeffrey Altman> why would there be something in rt.central.org?
[16:25:15] <mvita> you are reporting a problem, I am looking for details
[16:25:55] <Jeffrey Altman> there are no details in rt.central.org.   the original problem report was from an SNA customer and addressed by changes submitted by Andrew Deason to gerrit.openafs.org
[16:26:29] <mvita> I'm talking about the reported problem with those changes.
[16:27:49] <Jeffrey Altman> there was no report to rt.central.org
[16:30:16] <wiesand> How can a record span more than one ubik page?
[16:30:17] <mvita> who discovered it, and are they willing to provide information to openafs.org?
[16:33:01] <Jeffrey Altman> AuriStor staff discovered it in our testing.   You can search for ubik_BeginTransReadAnyWrite() which was added to ubik on 1.6 in 6261
[16:33:25] <meffie> ok
[16:33:27] <mvita> thank you.
[16:33:31] <wiesand> I guess it’s 5c7297a ff?
[16:34:20] <Jeffrey Altman> I haven't found all of the vlserver changes on 1.6 but there are many subsequent ones on master.
[16:34:37] <meffie> thanks.
[16:35:03] <Jeffrey Altman> This is the master commit that starts the series https://gerrit.openafs.org/#/c/2103
[16:35:08] <meffie> there are several commits in this area.
[16:35:12] <Jeffrey Altman> 2104, 2105, 2106
[16:35:35] <Jeffrey Altman> and then there are later commits (including from meffie) to fix bugs in those commits
[16:35:50] <meffie> i hope i didnt make it worse :(
[16:37:01] <Jeffrey Altman> the underlying premise is flawed.  It is not safe to read from the database while a write transaction is in progress because ubik has neither a concept of two-phase commit nor a concept of copy-on-write.
[16:37:16] <Jeffrey Altman> meffie: your patches did not make it worse
[16:37:21] <meffie> whew.
[16:38:20] <Jeffrey Altman> this is another example of unintended consequences.  the locking within ubik is complicated.
[16:40:07] <Jeffrey Altman> Andrew believed that it would be ok to ignore the database locks for reads while the locks were held for writes because he would filter out dirty pages.  However, the transition of pages from dirty to not-dirty is not atomic and so must be performed under a lock.
[16:40:23] <mvita> sorry all but my time is up - please carry on without me and I will see you next week
[16:40:31] <Jeffrey Altman> But the lock does no good if the readers ignore it
[16:40:53] mvita leaves the room
[16:41:40] <Jeffrey Altman> For LWP there is a reduced chance of corrupt data being delivered to clients but it can still happen.  For pthreaded ubik, it turns out this is a gaping hole.
[16:42:23] <meffie> interesting, thank you
[16:45:15] <wiesand> Sounds like one of those "low hanging fruit quick-ish tasks" the foundation asked about?
[16:45:28] <meffie> not sure about that.
[16:45:41] <Jeffrey Altman> ripping out the code is quickish.  solving the problem, not so much
[16:47:00] <wiesand> That was my assumption. I meant re-disabling reads during writes.
[16:48:22] <wiesand> I’m running out of time too. Do you have more such good news?
[16:48:31] <Jeffrey Altman> I suspect SNA's customer that funded this work was concerned about the length of time that the ubik syncsite blocks reads while it writes to disk and communicates the changes to the replicas when one or more of the replica sites are down.   If a quorum is unable to receive the update, then the transaction is rolled back and the client that initiated the request is delivered a failure.
[16:49:40] <Jeffrey Altman> Those timeouts can be longer than the client timeout period for the volume location information request (at least for Windows) and can cause failures.
[16:50:06] <Jeffrey Altman> I'm not sure that unix cm times out volume location requests at all.
[16:50:51] <meffie> even if they dont timeout, long vlserver requests makes for unhappy users.
[16:51:01] <Jeffrey Altman> If you want good news, Sara's pregnancy is just about at term and all is well.   By next week this time I should be a dad.
[16:51:23] <meffie> Yay! Congratulations!
[16:51:46] <Jeffrey Altman> the stork should arrive any day now
[16:52:05] <wiesand> [crosses fingers]
[16:52:15] <meffie> Wishing you and Sara well!
[16:52:33] <Jeffrey Altman> Sara totally looks like someone stuffed with a pumpkin
[16:53:09] <meffie> with that, motion to adjourn?
[16:53:16] <Jeffrey Altman> so long
[16:53:21] <wiesand> Jeffrey, hope all goes well!
[16:53:24] <meffie> bye
[16:53:25] <Jeffrey Altman> thanks
[16:53:31] <wiesand> Bye.
[16:53:38] wiesand leaves the room
[16:54:11] meffie leaves the room
[17:31:21] mvita joins the room
[17:46:54] mvita leaves the room
[20:10:58] mvita joins the room
[21:38:00] <kadukoafs@gmail.com/barnowlD2481C74> Jeff said several incorrect things that I will cover later.
For now I've reverted the redirect permanent directives for the
port-80 gerrit and git vhosts.  (You'll probably need to restart your
browser to clear that state, unfortunately.)
[21:43:32] <Jeffrey Altman> regarding git? gerrit? ubik? debian? other/
[21:47:44] <kadukoafs@gmail.com/barnowlD2481C74> git and gerrit
[21:48:08] <Jeffrey Altman> git.openafs.org has its own cert.
[22:16:30] mvita leaves the room