- Wednesday, February 11, 2015

release-team@conference.openafs.org

Wednesday, February 11, 2015< ^ >

Room Configuration

Room Occupants

GMT+0
[13:20:00] shadow@gmail.com/barnowlE5B64A04 leaves the room
[13:20:09] shadow@gmail.com/barnowlE5B64A04 joins the room
[13:31:00] meffie joins the room
[14:25:34] wiesand joins the room
[14:47:12] <Jeffrey Altman> morning
[14:47:30] <wiesand> “morning”
[14:51:36] shadow@gmail.com/barnowlE5B64A04 leaves the room
[14:51:45] shadow@gmail.com/barnowlE5B64A04 joins the room
[14:55:09] <Jeffrey Altman> with the --settime change what happens when 11706 is applied?
[14:58:15] <wiesand> I just added a comment. The box still locked up, just not quite as quickly.
[15:01:48] <Jeffrey Altman> it doesn't panic but perhaps it deadlocks?
[15:02:22] <Jeffrey Altman> and it panics if 11706 is not present?
[15:03:14] <wiesand> It still stopped responding to pings. That’s all the details I have - the box I used has no console, doesn’t produce dumps, and hardly anything makes it into the logs.
[15:03:20] Marc Dionne joins the room
[15:03:42] <Jeffrey Altman> ok
[15:03:57] <wiesand> Ubuntu 14.04 w/ 1.6.7 locks up but doesn’t panic. I cannot easily test what would happen with 11706 applied.
[15:04:40] <wiesand> And frankly, it seems -settime has been broken for all of 1.6, and I don’t have any more time to sink into it.
[15:04:59] <Jeffrey Altman> I don't think any of us do.   Better to disable it
[15:05:04] <wiesand> I just wanted to make sure merging 11706 really makes sense, and I think it doesn’t.
[15:05:47] <wiesand> Right. At least log a warning, and maybe even ignore the switch then.
[15:06:02] <wiesand> But I see no need to sort this out in time for 1.6.11.
[15:06:15] <Jeffrey Altman> agreed
[15:06:38] <meffie> i dont know of anyone that actually uses -settime
[15:06:40] <wiesand> Thanks
[15:07:17] <Jeffrey Altman> its not safe to use settime when ntp is also running or if the client is in a vm
[15:07:19] <wiesand> There can’t possibly be anyone using it given the content of 11706 ;-)
[15:08:49] <wiesand> Frankly, I didn’t stop ntpd. Does it matter?
[15:10:20] <wiesand> I may try that, but after 1.6.11pre2 if at all.
[15:10:59] <Jeffrey Altman> 1.6 has been broken since e8d8a2240a57f9f4a11ee45b60c229d3f8447b86)
[15:11:26] <Jeffrey Altman> settime on 1.6 that is
[15:12:25] <wiesand> Right.
[15:13:26] <wiesand> Regarding pre2:
[15:14:03] <wiesand> 11713 and 14 will probably be ok, but a few more +1s would be good.
[15:14:06] <Jeffrey Altman> broken since 1.6.0pre1 so there is no need to rush to fix it
[15:14:19] <wiesand> +1
[15:15:02] <wiesand> The 1.6.11 blocker is still 11694 IMO.
[15:15:25] <wiesand> In particular, I’m thinking about Anders’ comment in 11710.
[15:15:39] <Jeffrey Altman> I asked Marc to review 11710
[15:15:49] <Marc Dionne> i was going to comment there, but i can do so here
[15:16:00] <wiesand> Whatever you prefer ;-)
[15:16:03] <Jeffrey Altman> please comment in both places
[15:16:22] <Marc Dionne> i don't think there are many kernel affected, basically the early 3.17 ones
[15:16:39] <Marc Dionne> most distros that have kernels that recent have already moved on to 3.18
[15:17:08] <wiesand> If it’s true that defaulting to on will almost certainly do the right thing in practically all actual cases, that would be so much nicer than forcing the switch on every build.
[15:17:23] <Marc Dionne> so my question would be, how much of an extra burden is it to have to add a mandatory option for the kernels that don't have the issue
[15:19:10] <Marc Dionne> not sure that extra burden is worth preventing an issue on a few kernels that should be rarely used at this point
[15:19:38] <wiesand> That’s what I think.
[15:20:15] <Jeffrey Altman> then add a default but leave in the mechanism to override?
[15:20:27] <wiesand> If 11694 + the switch defaulting to on is right for all kernels < 31,7 and all >= 3.17.3, I really think that’s what we should do.
[15:20:37] <wiesand> Jeff: yes.
[15:20:40] <Marc Dionne> jeff, yes
[15:21:14] <wiesand> For those cases where  someone actually uses 3.17.2, or when a distro backports the change from 3.17 but not the fix.
[15:21:29] <wiesand> I hope Anders is right about that being unlikely to happen...
[15:22:56] <Jeffrey Altman> the question is how likely is that to happen?   we know that there have been some backports already.   perhaps only set defaults for < 3.17 and > 3.18.1 (or whatever mainline kernel got the fix)
[15:23:53] <Marc Dionne> but doing any tests based on version numbers is tricky.  who knows what code from rh might be called 3.17.xxx some day
[15:23:59] <wiesand> 3.17.3 and 3.18 got the fix AFAIK
[15:24:02] <Jeffrey Altman> the reason we added the switch is because we don't know what individual distributions might have done in any specific version number
[15:26:39] <Marc Dionne> but not sure that we should cater to distros introducing bugs in older kernels
[15:27:45] <wiesand> Well, they do. But providing the switch for those cases is sufficient IMO.
[15:29:36] <Jeffrey Altman> a line in the sand could be the default matches the mainline kernel and if your distribution did anything differently its on your head to fix it
[15:30:23] <wiesand> Yes.
[15:30:49] <wiesand> I’m figuring writing the release announcement...
[15:30:50] <Jeffrey Altman> go with that.   we have spent too much time on this
[15:31:26] <Marc Dionne> would be interesting to have a take from the people who build rpms or maintain the spec files, etc., whether adding a mandatory option is a big deal for them, or not
[15:32:19] <wiesand> I was already “looking forward” to providing an extra SRPM for a handful of already outdated Fedora kernels.
[15:32:31] <wiesand> I’d much rather declare those unsupported.
[15:33:50] <wiesand> And of course we’d need a change to src/packaging/redhat, coming through master...
[15:34:56] <wiesand> For my SL packages, no big deal if it’s unlikely that it will ever have to be changed.
[15:37:13] <wiesand> I’ll ask Ben whether he can modify 11710 the way Anders suggests. Unless Marc thinks that’s stupid.
[15:37:41] <Jeffrey Altman> my one fear is that by failing to require a default that downstream distributions that are built by individuals that really have little involvement with the openafs  community are not going to read the announcement and will just ship the generated binaries if they build
[15:39:55] <wiesand> But those binaries would be ok in all realistic cases, unless I’m still misunderstanding the issue?
[15:40:41] <Jeffrey Altman> the underlying issue is that this is a behavior change that we cannot test for either at run time or at compile time
[15:40:53] <wiesand> Only if they use 3.17 or 3.17.1 or 3.17.2 or some earlier kernel with a backport of the change in 3.17 but not of the fix in 3.17.3/3.18rc2.
[15:41:27] <Jeffrey Altman> the point is we cannot tell
[15:42:04] <wiesand> ... those binaries would leak inodes in error cases. That’s my understanding. May be wrong.
[15:42:13] <Jeffrey Altman> the code is going to build.  whether its behavior matches the kernel is unknown
[15:43:23] <Jeffrey Altman> either leaks references or removes references too soon
[15:44:08] <Jeffrey Altman> I wish I could say that I simply didn't care
[15:44:48] <Jeffrey Altman> hopefully if an end user gets burned it is the distribution that gets the blame and not openafs
[15:44:49] <Marc Dionne> the problem kernels leak an inode ref in the error case.  and because we have directory aliases, hitting that error case is not that rare
[15:46:43] <Jeffrey Altman> on the flip side we need to get 1.6.11 out the door so just set the defaults and move on
[15:47:50] <wiesand> That’s my current preference, no matter what hat I’m wearing (site admin, packager).
[15:49:08] <wiesand> Let’s go with that for pre2 at least.
[15:49:27] <Jeffrey Altman> done.  lets move on.
[15:49:56] <wiesand> And I think that’s all we have on 1.6.11 for today, unless I missed anything or Marc has some late breaking bad news regarding the final 3.19 kernel?
[15:50:43] <Jeffrey Altman> I want to discuss https://rt.central.org/rt/Ticket/Display.html?id=131997
[15:51:29] <Jeffrey Altman> not that I think that 1.6.11 should be blocked on it
[15:51:30] <Marc Dionne> 3.19 looks fine afaict
[15:52:16] <Jeffrey Altman> However, it is probably worth putting something in the release notes
[15:52:31] <wiesand> Marc: Thanks.
[15:55:28] <wiesand> I’ll need help with putting anything regarding 131997 in the release notes.
[15:55:56] <Jeffrey Altman> As a bit of background.  Andrew filed 131997 because an SNA customer experienced ubik database corruption.   Marc and I have done a lot of testing in the last few weeks and the scenario that generates the corruption is very easy to reproduce in a lab.   The corruption rate is nearly 100%
[15:56:49] <Jeffrey Altman> but worse than that I am now aware of at least three cells other than SNA's customer that have experienced database corruption.   Two of them have experienced total loss of the database
[15:57:18] <Jeffrey Altman> as in, all of the replica sites and the coordinator copies were replaced with damaged databases
[15:58:31] <wiesand> Ok, that’s bad.
[15:59:53] <Jeffrey Altman> In the current implementation of ubik, the coordinator is responsible for tracking the database versions of the replica sites.   If a replica site is restarted with a non-current database without the coordinator noticing then the coordinator will proceed to issue write transactions against the replica which will in turn produce a damaged database with the current version number.
[16:02:19] <Jeffrey Altman> The coordinator might not notice if the replica restarts in under BIGTIME seconds.  I think that is 75 seconds.  it might be slightly longer.    There are some scenarios that might extend that period.
[16:03:29] <Jeffrey Altman> Sites that deploy database servers using puppet, chef, salt, docker, or similar technologies are at risk because they tend to bring up a new service with a clean configuration and do so very quickly.
[16:05:41] <wiesand> Adding a DB server like this seems the most likely case?
[16:06:40] <Marc Dionne> or maybe replacing an existing one with a new one (with no db) at the same ip address
[16:06:50] <Jeffrey Altman> In a rolling upgrade of three servers (A (sync site), B, C) where C is restarted with a clean configuration in under BIGTIME seconds, the database can be corrupted on the first write transaction that follows (if it is issued before the next election cycle).   That database will then be marked as current.   If B is then restarted similarly with active writes taking place it too can become corrupted and be marked current.  When A is finally upgraded, it will restart, see it is not current and pull a corrupted db from either B or C.
[16:07:47] <Jeffrey Altman> Adding an additional DB server to an existing set does not trigger the problem.   In that case the CellServDB on all of the servers must be replaced.
[16:08:35] <Jeffrey Altman> The guidance is that when shutting down and restarting a dbserver the server must be left off for a few minutes to ensure the coordinator notices.
[16:09:07] <Jeffrey Altman> Once the udebug against the coordinator indicates that the server being upgraded is in fact down, then it is safe to restart
[16:10:50] <Jeffrey Altman> I believe that http://gerrit.openafs.org/#q,status:open+project:openafs+branch:master+topic:prevent-ubik-db-corruption,n,z are sufficient to prevent the corruption but there are race conditions that can lead to failing write transactions for an extended period of time.
[16:11:51] <wiesand> But can that happen when the restarts are carried out according to the guidance above?
[16:13:15] <Jeffrey Altman> when following the guidance there will be extended periods where write transactions will fail.
[16:13:37] <Jeffrey Altman> but it is necessary to prevent corruption
[16:15:17] <Jeffrey Altman> as the code is written today the DISK_Begin RPC issued by the coordinator never fails.
[16:16:21] <Jeffrey Altman> with http://gerrit.openafs.org/#change,11689 DISK_Begin will fail if the database is not known to be correct.
[16:18:15] <Jeffrey Altman> The coordinator will then assume that there is something wrong with the replica server and stop talking to it until the recovery process is triggered.
[16:18:50] <Jeffrey Altman> Doing so does eventually result in a stable database.
[16:19:26] <Jeffrey Altman> 11689 and 11738 require a great deal of testing
[16:20:14] <wiesand> Sigh. Where would we get that?
[16:21:24] <Jeffrey Altman> not sure.
[16:22:02] <Jeffrey Altman> Marc and I have spent most of the last three weeks on this problem.   I do not have more cycles to expend on  it.
[16:22:38] <Jeffrey Altman> 1.6.11 should not block on it.
[16:22:46] <wiesand> I agree.
[16:22:57] <Jeffrey Altman> I spoke with Ben about it last night and he doesn't think that 1.8 should block on it either
[16:23:29] <wiesand> This must have been a problem since the transarc days?
[16:23:35] <Marc Dionne> without a good testing setup it's hard to be confident that those changes don't have other side effects..
[16:23:36] <Jeffrey Altman> I respect that position.  This is a problem with every release of ubik since the beginning of time
[16:24:36] <Jeffrey Altman> it goes back to cmu and is only now a problem because administrator expectations are changing and hardware / software is getting faster
[16:25:04] <wiesand> True.
[16:25:32] <wiesand> So it’s important to document.
[16:25:42] <Jeffrey Altman> it reminds me of 2003 when hyper threaded processors were introduced by Intel.  Suddenly there were a new set of problems because race conditions that had never before been exposed were suddenly exposed.
[16:26:33] <Jeffrey Altman> I'm planning on sending an announcement but we should document the issue somewhere
[16:27:06] <Jeffrey Altman> as part of 1.6.11
[16:27:16] <Jeffrey Altman> I have nothing else for today
[16:27:17] <wiesand> KNOWN_ISSUES ?
[16:27:56] <Jeffrey Altman> if you can make those upper case letters be bold and blink  and change colors that would be excellent
[16:28:58] <wiesand> Tricky...
[16:29:31] <wiesand> But such a text file alongside NEWS may make sense.
[16:30:35] <wiesand> But this can wait until 1.6.11 final.
[16:30:42] <Jeffrey Altman> yes
[16:31:04] <wiesand> I think we’re done for today then?
[16:31:18] <Jeffrey Altman> unless meffie has something
[16:31:30] <wiesand> thus the “?”
[16:31:55] <Jeffrey Altman> Mike?
[16:31:55] <meffie> question about the ubik bug. do sites delete the *.db0 file to hit this?
[16:32:38] <Jeffrey Altman> starting with an empty .db0 or rolling back to an older version (think a virtual machine rolling back to a snapshot)
[16:33:20] <meffie> thank you
[16:34:05] <meffie> i'll be sure andrew sees this as well, thank you.
[16:34:42] <Jeffrey Altman> one site that I am aware of was hit by this during the process of upgrading the OS to address GHOST.  They replaced a VM with an old OS with a VM with a new OS on the same IP address and a clean configuration.   Switching VMs took just a few seconds.
[16:35:03] <Jeffrey Altman> The idea was to prevent clients from seeing the outage.
[16:35:56] <wiesand> I always start with the current coordinator, and wait for the remaining servers to reach quorum before continuing.
[16:35:57] <meffie> grr. ghost
[16:38:10] <meffie> as you said, expectations are changing
[16:38:31] <wiesand> Fine. Thanks a lot for being here today, despite the lack of a proper and timely invitation once again.
[16:38:46] <Jeffrey Altman> its a standing meeting.  don't worry about the announcements
[16:39:00] <Jeffrey Altman> have a good day
[16:39:19] <wiesand> I have to run. Goodbye.
[16:39:27] wiesand leaves the room
[16:39:29] <Jeffrey Altman> bye
[16:39:36] Marc Dionne leaves the room
[16:39:36] <meffie> is this chat logged, publicly?
[16:40:02] <Jeffrey Altman> it is logged if you know where to look.   I'm not sure what the url is these days
[16:40:32] <meffie> ok, just wanted to share with andrew.
[16:40:56] <Jeffrey Altman> Andrew's RT issue was updated
[16:41:08] <Jeffrey Altman> the patches are in Gerrit
[16:55:45] <shadow@gmail.com/barnowlE5B64A04> http://conference.openafs.org/release-team@conference.openafs.org/2015-02-11.html
[16:56:14] <meffie> thank you
[18:55:11] shadow@gmail.com/barnowlE5B64A04 joins the room
[19:00:52] shadow@gmail.com/barnowlE5B64A04 leaves the room
[19:02:49] shadow@gmail.com/barnowlE5B64A04 leaves the room
[19:02:57] shadow@gmail.com/barnowlE5B64A04 joins the room
[19:20:45] shadow@gmail.com/barnowlE5B64A04 leaves the room
[19:20:53] shadow@gmail.com/barnowlE5B64A04 joins the room
[19:22:40] shadow@gmail.com/barnowlE5B64A04 leaves the room
[19:23:18] shadow@gmail.com/barnowlE5B64A04 joins the room
[20:37:48] meffie leaves the room
[20:59:41] kaduk joins the room
[21:04:22] meffie joins the room
[21:04:28] meffie leaves the room
[21:35:10] kaduk leaves the room
[21:47:47] kaduk joins the room