[01:44:44] --- haba has become available [01:50:19] --- Russ has left: Disconnected [06:29:43] --- Simon Wilkinson has left [09:41:40] --- Russ has become available [10:46:17] ... huh. The switch to using bsd.kmod.mk for libafs builds is what fixed my "symbol SHA1_Init undefined" error at load-time. [11:43:42] --- mdionne has become available [12:10:41] --- mdionne has left [12:33:47] e88e369c92e8a0c4bedd136edadb21d99988d587 seems to be before the "all buffers locked" issue was introduced. [13:00:42] --- haba has left [13:50:58] what is the closest patchset you have tested that failed? [13:53:14] The current range I'm considering is e88e369c92e..79aedab16c36, but since I'm using 'svn update' on a freebsd checkout as my test, I want to wait another day before declaring that the left side of that actually does work. [13:53:31] we have to fix the freebsd82-amd64-builder and the aix-builder. at this point there is no benefit to having buildbot since both of those systems are failing everything on master [13:54:01] freebsd82-amd64-builder should be fine, now? [13:54:46] (Though I'm confused why freebsd81 didn't get bitten by the same typo.) [14:36:00] --- Simon Wilkinson has become available [14:38:50] --- mdionne has become available [14:44:15] --- mdionne has left [14:44:39] --- mdionne has become available [14:56:35] There are some buffer fixes for master sitting in gerrit [14:56:53] There's a double free, and a missing DRelease. Both of those are on pretty unusual error conditions, though. [15:00:18] master currently fails to untar anything substantial on my test machine. the 1.6 branch is OK. the dir package seem like a likely suspect. ring a bell for anyone? [15:01:17] what seems to be happening is that a directory creation succeeds at the server and locally, but a lookup a bit later for that dir gets ENOENT [15:02:01] tar then proceeds to try to create the dir it thinks is missing, which fails at the server because EEXIST [15:02:47] Hmmm. Doesn't ring any immediate bells, but that would be consistent with the dir packages having a sulk [15:03:25] However, this has worked since the new dir package code went in, hasn't it - because you fixed a bug in the dir package just after it was committed. [15:03:56] yeah but I don't remember if I ran this type of test [15:04:50] I'll see if I can bisect on a different system - on this one I'd need to roll back to older kernels [15:05:18] There's a fairly small number of changes you'd need to rip out to see if it is the dir package stuff. [15:05:44] 0284e65f9786, mostly? [15:05:46] Basically, just go through the commits to dir.c, and pull them one by one, across the tree. [15:06:20] I'd pull d1946ffe9be0031a2daf907f5e96cf0ee7f5e15e and bb25bdfcb059fc54a57fd4733ce3184e231ca88d first (and their various fixes after those commits) [15:06:52] If the problem's still, there then pull 0284e65f97861e888d95576f22a93cd681813c39, and everything that modified that commit [15:14:15] If you're getting ENOENT, rather than blowing up, my suspicion would be an error in the verification code. [15:18:02] ok, first test, reverting bb25bdfcb0, 0fb2e3a6dbfd (and d1946ffe9be0031a2) doesn't help [15:18:42] Oh, that's boring. I'd almost convinced myself that there was an off-by-one error in that code. [15:19:52] In fact, I'm pretty sure that there is. [15:20:28] Oh no, it's fine... [15:20:51] Try pulling the rest of the dir changes, and see what falls out ... [15:24:33] will do, but a bit trickier to revert [15:36:30] --- Jeffrey Altman has become available [15:36:36] --- jaltman/FrogsLeap has left: Replaced by new connection [15:36:37] --- jaltman/FrogsLeap has become available [15:37:58] reverting 0284e65f9786 and a few of the later minor commits doesn't help [15:38:21] Now, that's interesting ... [15:39:16] 1.6 is fine, and master is broken? [15:40:18] yes I can't reproduce the same failure on the head of the 1_6 branch [15:42:02] The locking changed in afs_linux_readdir [15:42:50] We added cells to inode number calculations [15:43:40] Hmmm. We took the lock free dentry_revalidate change. [15:45:52] readdir locking is not it [15:46:44] GetUser now locks properly, so it's possible that there have been some big internal timing changes [15:50:34] let me check a bit more but looks like the lock free revalidate might be it [15:52:54] … connection clones, but that's not going to be it [15:53:09] … cache bypass changes, but again, not likely [15:55:27] Some changes to callback flushes, but if it's a single client showing this problem, not likely to be it either. [15:56:24] Lots of other bits and pieces, but that's about it in the 1.6 -> master diff. So, yeah. dentry_revalidate would be the most likely candidate. [15:56:42] reproduced it with the lock free revalidate reverted, although the test suceeded several time [15:58:24] Rats. So that's not it either. [15:59:24] It is possible that this is a race that we have always had, that some of the less relevant changes have exposed by changing operation timings, or relative orderings between threads. If so, it's not going to be fun to track down. [16:00:14] --- mdionne has left [16:01:34] --- mdionne has become available [16:03:33] could be related to recent kernel changes, this is runnning a 3.1-rc kernel. Confirmed again that 1.6 is fine - ran a loop of 20 untars with no issues. [16:05:09] When you pulled the dir package changes, you went back to effectively a 1.6 dir package, right? [16:09:41] there's some other more minor looking fixes that I didn't pull, I suppose I can try that [16:10:37] That's worth a look. [16:11:14] After that, I wonder about removing the locking in GetUser and PutUser and see if the problem goes away. If it does, then it's a timing related thing [16:37:47] It looks like I mis-reverted something earlier. 0284e65f9786 is looking guilty [16:39:08] Ah, okay. I'm off to sleep now, but if you find anything more, let me know, otherwise I'll stare at that patch whilst I commute tomorrow, and see what drops out. [16:40:06] ok, I'll see if I can pin it down a bit more, thanks [16:40:12] Cool, ta! [19:43:16] found the problem - fix coming soon to a gerrit near you... [19:57:54] change 5666 [19:58:00] --- mdionne has left [23:14:50] --- Russ has left: Disconnected [23:44:54] Hmm, I suppose I could try a parallel build again ... but I wouldn't want to deal with the fallout if it failed. [23:45:28] Try it, get a log if it fails, open a bug report? [23:46:02] In other news, buildbot seems to have stalled again - it's stopped building without finish off all of the latest patches in gerrit [23:47:41] s/could try/could have tried/ Too sleepy for the subjunctive, perhaps I am. [23:48:59] At this point, though, I think I'm just going to let my 4am cronjob run instead of manually testing it. So, we'll see how things do in my morning. [23:55:48] --- Simon Wilkinson has left [23:56:37] well, the build (and compilation during the 'install' target) are done. And I'm off to bed.