[00:04:01] --- sxw has become available [00:31:28] --- kaj has become available [00:41:23] --- sxw has left [00:59:47] --- kaj has left [00:59:54] --- kaj has become available [01:24:29] --- haba has become available [01:25:05] --- haba has left [02:17:20] --- kaj has left [02:17:25] --- kaj has become available [02:30:24] --- kaj has left [03:33:41] --- kaj has become available [03:37:46] --- kaj has left [03:40:27] --- kaj has become available [04:53:16] --- jaltman has left: Replaced by new connection [04:53:17] --- jaltman has become available [04:53:44] --- Jeffrey Altman has become available [05:23:40] --- jaltman has left: Disconnected [05:23:48] --- jaltman has become available [06:24:52] --- cudave has left [06:24:56] --- dwbotsch has become available [07:06:19] --- deason has become available [07:56:48] --- sxw has become available [07:58:22] --- sxw has left [07:58:33] --- sxw has become available [07:59:37] The pthread Ubik stuff is broken for out of tree builds, too. I'm working on a fix, but it is really very wrong, so it's taking a while. [07:59:59] (tvlserver also requires that vlserver is built first, but doesn't express that dependency anywhere.) [08:04:20] the addition of AFS_NORETURN has identified a number of Windows only source files that do not include afsconfig.h nor afs/param.h [08:04:44] Sorry! [08:06:07] or places elsewhere in the code where is included before those files [08:06:18] the tvlserver and ptserver makefiles are... weird; I would have thought they'd be more similar to the ptserver and vlserver ones [08:06:30] should we just copy the ptserver and vlserver ones and make the necessary changes? [08:06:36] jaltman: I take it Windows doesn't have a native assert? [08:06:40] deason: No. [08:06:51] Windows does but we don't use it [08:07:29] The problem is that tvlserver and tptserver also build their own copies of all of the subsidiary code - so stuff from util/ and so on. They probably shouldn't, but until we completely rework our build process so we have proper pthreaded libraries, they need to. [08:08:01] jaltman: I think, as a coding style, everything should include [08:08:01] #include #include [08:08:24] at the very beginning. We should probably just fix the files that don't do so. If you want to give me a list, I'm happy to do that. [08:11:39] I will just commit the fix once I have a new build [08:12:05] sxw: yes... but there are unrelated things that are different for no reason I can see, such as 'dest' and 'install' [08:12:36] if you think it's easier to just fix the stuff in there, then okay [08:12:42] Yeah. Those aren't breaking my build at the moment, so I care about them less. [08:26:07] --- reuteras has left [08:40:01] --- matt has become available [09:07:22] --- kaj has left [09:25:53] --- mattjsm has become available [09:33:17] --- mattjsm has left [09:34:59] --- mattjsm has become available [09:42:19] --- sxw has left [10:04:06] --- rra has become available [10:55:44] matt: "don't know how to make .../root.client/bin/libafs.nonfs.o. Stop." [10:56:02] Are you aware of anything that was taken out that would do this? I'm not getting errors during the make anymore... [11:08:46] mattjsm: have you pushed your latest state? [11:15:02] sec [11:15:41] pushed [11:21:08] I'll have to build it. take a bit [11:21:24] kk [11:22:36] --- Kevin Sumner has left [11:23:06] --- Kevin Sumner has become available [11:23:54] --- Kevin Sumner has left [11:23:55] --- Kevin Sumner has become available [11:27:41] My brain hurts. freebuild# cd /afs/zone.mit.edu/user/kaduk freebuild# fs flushv freebuild# ls freebuild hereiam kaduk kaduk.root src freebuild# cd .. freebuild# cd kaduk freebuild# fs flushv freebuild# ls freebuild hereiam kaduk kaduk.root src freebuild# rm freebuild freebuild# fs flushv db_trace_self_wrapper() at db_trace_self_wrapper+0x2a vfs_badlock() at vfs_badlock+0x95 assert_vop_locked() at assert_vop_locked+0x72 vinvalbuf() at vinvalbuf+0x3a osi_VM_TryToSmush() at osi_VM_TryToSmush+0x78 __func__.18827() at __func__.18827+0x65f vinvalbuf: 0xffffff000c5833c0 is not locked but should be KDB: enter: lock violation [ thread pid 1281 tid 100156 ] Stopped at kdb_enter+0x3d: movq $0,0x6ce940(%rip) [11:37:38] --- kaj has become available [11:38:00] kaduk: maybe you just got lucky the first times? does the vnode ever get locked in the flushv code path? it doesn't look like it but I'm skimming... [11:39:27] I don't remember seeing it do so, but if I put a lock in TryToSmush, then I get a more reliable panic trying to recurse on the lock. [11:43:29] afs_GetVCache code seems to imply that you need to check if it's locked, and lock if it's not [11:43:54] (don't ask me why) [11:44:40] Hm. [11:47:11] (The trace for the recursing panic claims that the lock is first obtained in afs_vop_lookup, osi_vnodeops.c:527) [12:08:05] mattjsm: my build fails with multiple defines of various supposedly static inline functions [12:09:11] Hmm [12:10:11] did you build the whole thing or just libafs? [12:10:47] built from scratch. I had a solution to building -our- static inlines (see stds.h) at 4.0, but this didn't -seem- to be what the netbsd 5.0 kernel was doing. [12:11:49] let me try something... [12:23:30] matt: I've got to take off and deal with something asap. Yay real life. If you figure anything out, send me a mail. I'll try to be back on in an hour or so. [12:23:48] --- mattjsm has left [12:23:58] I think I have an unholy fix, but we'll need something blessed. talk later.. [12:26:49] unholy fix: change "$(LD) -r -o ..." to "$(LD) -r -z muldefs -o ..." in MakefileProto.NBSD.in. The real solution is to set a missing gcc attribute. [12:27:44] doing this, I do have a libafs.nonfs.o which might be loadable [12:28:40] (need to regen Makefile after this) [12:31:26] --- Kevin Sumner has left [12:53:46] --- jaltman has left: Disconnected [13:07:54] --- Kevin Sumner has become available [13:28:44] Hm, I can't remember if I've mentioned this here before: +lock order reversal: + 1st 0xffffff007e992c60 call lock (call lock) @ /usr/ports/net/openafs-devel/work/openafs/src/rx/rx.c:5349 + 2nd 0xffffff8000beb820 AFS global lock (AFS global lock) @ /usr/ports/net/openafs-devel/work/openafs/src/rx/rx_rdwr.c:241 This is moderately annoying because WITNESS goes on to try and print a stack trace, but ends up thinking that there is an entire page (?) of _end()s. Printing them all to serial console takes a while. [13:31:56] --- mattjsm has become available [13:32:46] --- jaltman has become available [13:34:10] --- Jeffrey Altman has left: Replaced by new connection [13:34:11] --- Jeffrey Altman has become available [13:43:43] --- asedeno has left [13:55:19] --- asedeno has become available [13:58:44] --- asedeno has left [13:58:47] --- asedeno has become available [14:09:09] matt: Any luck? [14:09:16] yes, see above? [14:09:27] I missed it [14:09:59] ok, pasting (also, note web log) [14:10:14] Yeah. Weird. Pidgin usually will pull down the logs [14:10:32] I mean, there is an html log of the conferences [14:10:37] k [14:11:09] I assume, then, you got it past all the static inline errors? [14:11:26] yes [14:11:53] Nothing I need to deal with? [14:12:45] that was all I noticed, but I was making in libafs, that -should- be it [14:12:51] see if you can repeat it? [14:15:29] if you can get a libafs.nonfs.o, you're ready to work on loading [14:16:50] I got the same error. I wasn't getting the static inline errors, though. [14:23:41] if you're "make dest"ing, that error from before looks like you have a libafs.nonfs.o somewhere; it's just the final step of copying it to the 'dest' location that's failing [14:25:20] I'm suspicious tho. I did the following: pull your changes; add change above; sh regen.sh; ./configure; make -- and I have a module [14:25:49] yeah, 'make' no 'make dest' [14:25:56] "not", that is [14:26:17] He's right. There's one hiding in src/libafs/MODLOAD. [14:26:45] heh. that's where it hides [14:26:53] yeah. i was lookign in dest tho. :) [14:35:17] the loading bit. the netbsd modload is going to want to load without exported symbols or something of that nature. the goal is to get to a state in which you can set and hit breakpoints on functions in libafs in ddb [14:40:58] --- sxw has become available [14:48:01] kaduk (and anyone who cares): I've put the lock order reversal issue into RT as #127440, as I suspect it will just get lost in the scrollback [14:48:30] Thanks. I will in theory follow up fairly soon, but you know how these things go. [14:49:10] I'll add you to the ticket, so you can - is kaduk@mit.edu the address you'd be sending from? [14:49:20] Yup. [14:53:15] --- mattjsm has left [15:04:47] It would be good to get more comments on the risk/reward of 2161 (in gerrit) for 1.4.x - especially from those on the "1.4 means stable, dammit" side of the argument like jhutz. [15:06:08] Basically the choices seem to be a) We crash when someone unmounts with files open in /afs on Solaris. b) We crash randomly on Solaris c) We take a relatively invasive change to vnode reference counting on all platforms. [15:06:46] (a) has been the behavior of OpenAFS on Solaris for years, I think. Well, there may have been an interim point where it didn't happen, but I remember that behavior back when we were on Solaris. [15:07:17] It's one of the reasons why several of us had init scripts that used lsof to find and kill processes with AFS open before shutting down the client. [15:07:28] Yeah. A fix aimed at solving a) was pulled up to 1.4.x - but it causes the problems in b) [15:07:43] Fixing those problems then requires changing the way that we reference count vnodes. [15:08:11] (b) is definitely worse than (a). [15:09:11] well yeah, I don't think anyone's suggesting (b) [15:09:12] Yeah. I think the options are (a) or (c). Andrew obviously favours (c). I'm torn - it's good to fix bugs, but I do wonder about the risk/reward ratio in this case. [15:09:38] Also particularly since we may be in a position to tell people to run 1.6 in a couple of months. [15:09:48] well... I don't really care; the person who complained about this is getting the fix anyway [15:09:54] I assume (c) is a backport of something already in master? [15:09:59] Yes [15:10:06] yes, but there's differences [15:10:13] Yeah, there'd have to be. [15:10:16] We diverged a lot. [15:11:00] It's obviously great to have the fix for 1.6. [15:11:32] the various vnode referencing interfaces seem.... very crufty and hacked-on, even in 1.5; I'm not too sure on what may be different on various platforms [15:13:04] It looks to me like nobody really understood what the semantics of the various different interfaces were - and so just rolled their own for the bit of the code they were writing at the time. [15:13:40] Dragos and I tried to understand it whilst doing disconnected, and ended up getting something that worked for the stuff we cared about, but couldn't get a clear enough idea of what was going on to tackle the rest of the code. [15:13:54] Your patch is a definite step forwards in that regard. [15:14:08] from looking around at the stuff in at least 1.5... the feeling I kept getting was that there were 3 ways to do any of this, and they were all effectively equivalent [15:14:20] even if they looked like they did different things, or were intended to originally [15:14:41] e.g. some time awhile ago we had the possibility of sleeping while trying to acquire a ref (osi_vnhold's second argument) [15:15:14] but now that doesn't seem possible anywhere, so the "I can sleep to acquire a ref" and the "I cannot sleep to acquire a ref" functions now do the same thing [15:15:55] Yeah. We should probably look at using a single interface. [15:16:21] I was going to look at cleaning it up in 1.5 (or 1.9, whatever it comes to), but after this mess is dealt with [15:16:28] Someone needs to pluck up the courage to make these kinds of changes though - I keep chickening out, because I'm not convinced I'd be able to test well enough to pick up any breakage. [15:16:43] That would be great if you did. [15:17:32] --- mmeffie has left [15:18:27] are we even having another 1.4 release? I wasn't even sure to submit the backport, since I don't know if what's going on the 1_4_x branch will even be used aside from people cherry-picking things [15:18:53] I suspect we'll have at least one, and most likely two, more releases from 1.4.x [15:19:53] With the next release basically being a gathering of all of the bug fixes from 1.5.x, as usual,, and anything that comes after that being a _very_ conservative selection, for people who aren't ready to move to 1.6 [15:20:22] --- mmeffie has become available [15:20:29] Of course, that depends on whether we can get 1.6 out of the door before the next ice age. [15:26:05] anyway, if the vnode hold changes are that contentious, I don't mind just reverting the solaris 1.4 stuff... I just didn't think those changes made much difference; I don't really see it possible to make a "wrong" choice for which hold macro to use, but I don't know [15:27:53] I'm not sure that they're that contentious - I'm just concerned about the size of the change against the problem they're ultimately fixing. That's probably because the vnode ref macros have always been voodoo to me - if I understood that bit of code better, I'd probably be happier about the change. [15:31:28] I don't really understand the purposes of them either, but my thinking is that the different ones are already used interchangeably afaict, so it's hard to see a problem with just substituting some for another [15:31:32] but it's still change, so I can get that [15:32:01] that probably wasn't clear: "so I can understand being uncomfortable with it" [15:37:03] --- deason has left [15:42:39] --- deason has become available [15:58:40] 1.6 is supposed to branch incredibly soon, right? [15:59:11] Yes. [15:59:47] What I'm pushing for is for 1.6 to branch very very soon, and for master to immediately reopen for development work. [16:00:12] Then 1.7 can branch from the 1.6 branch for the Windows redirector work. [16:00:59] that would be highly desirable [16:01:01] 1.7 would be a 'merge' branch, rather than a cherry pick one, which should reduce the overhead of having all of these branches on the go at once. [16:01:36] Do you have stuff ready to fire at master once 1.6 is cut? [16:02:02] I believe so. first, xcb. then things following on it. [16:03:20] --- Kevin Sumner has left [16:03:20] Cool. Hopefully we won't get flattened in the post-release patch avalanche ... [16:03:36] yeah, I put xcb out there early to ease that [16:03:37] hopefully gerrit won't get flattened [16:03:48] Gerrit will be fine. It's the merging that scares me. [16:03:54] --- mmeffie has left [16:04:05] I have been thinking about doing openafs-next. I may well do so if 1.6 doesn't appear this month. [16:04:11] I'm not remembering the last roadmap slides very well--rxk5 is on a post 1.8 master? [16:04:38] --- Kevin Sumner has become available [16:04:38] In my world, I think rxk5 could start landing as soon as 1.6 is cut. [16:04:43] we could frankly send a steady stream of rxk5 at this point [16:05:00] Marcus has been breaking it up aggressively. [16:05:13] grumble. still want to have fileserver that works before we branch 1.6. so you'll get no buy-in from these quarters yet [16:05:47] I'm not sure I see the point in not branching. [16:05:59] well, frankly, why can't 1.6 branch now, and release when ready? [16:06:01] It reduces the risk of regressions on the to-be-1.6 branch. [16:06:07] like, we should have a release meeting around a week from now and invite all concerned and discuss it, really. [16:06:12] er, what doesn't work yet? [16:06:12] Sure. [16:06:16] sure [16:06:42] Matt: with regards to rxk5 - it would be good if it could go in after the new-tokens stuff from rxgk has landed. [16:06:58] i have a dafs fileserver which discards its state pretty regularly. in the next day i should get back to it and have the answer. [16:06:59] That gives a much cleaner way of handling multiple-mechanism tokens. [16:07:13] well, that's basically derived from rxk5 tokens, and yes, we can't send the old...new...tokens since it's replaced [16:07:33] so I'd presume you would send that as soon as possible [16:07:53] As soon as I can, I'll push the stuff that's in github over. [16:07:57] I don't see discarding state on restart as "broken", but eh, can be discussed later [16:08:41] but, basically, rxk5 broken out has probably a lot of pieces that can start going to gerrit at the branch point, at any rate, some might be orthogonal enough to commit, etc [16:08:48] well, it bitches on shutdown too, and i think it's uncovering a real issue. i'd instead prefer to, say, find and ideally fix the issue [16:09:12] discarding state in and of itself, probably not broken [16:09:30] it certainly discards state on every 3rd restart, over hear [16:09:32] here [16:09:33] --- mmeffie has become available [16:10:03] yeah, for a while i was having nightly restarts due to the last bug i fixed (bosserver) and every one discarded its state, 100% of the time, for the server workload [16:10:41] --- Kevin Sumner has left [16:10:55] > the various vnode referencing interfaces seem.... very crufty [16:11:08] --- Kevin Sumner has become available [16:11:33] i understand it decently, the problem is at various points where what OSes needed changed no one consistently fixed the code up for it, and we got what ibm gave us [16:12:40] and as far as it goes, i don't think c) is all that invasive. [16:13:21] and as far as kaduk's local reversal issue, i'd really like more stack. [16:13:37] yes, you absolutely need a real bt [16:14:08] well, the references we got are... not all that useful [16:14:10] Er, the lock reversal? [16:14:19] the report from WITNESS [16:14:28] the flushv one is real enough, but I'm pretty sure it's what always happens on 8 when we enter VOP_RECLAIM [16:14:47] Sure. KDB is not terribly good about traversing the stack, so I think I'd need to get a dump. [16:15:00] just having the backtrace would be a start [16:15:01] get us something from ddb? [16:15:07] i'd even settle for a jpg of bt [16:16:00] I have a ddb patch that is extra helpful for getting backtraces of blocked threads [16:16:31] So, this is what I have now; with some effort I could probably procure a crash dump. lock order reversal: 1st 0xffffff000bf72a60 call lock (call lock) @ /usr/ports/net/openafs-devel/work/openafs/src/rx/rx.c:5239 2nd 0xffffff8000be5000 AFS global lock (AFS global lock) @ /usr/ports/net/openafs-devel/work/openafs/src/rx/rx_rdwr.c:241 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e _mtx_lock_flags() at _mtx_lock_flags+0x78 rxi_ReadProc() at rxi_ReadProc+0x187 rx_freePacketQueue() at rx_freePacketQueue _end() at 0xffffff8000c17400 _end() at 0xffffff8000c20700 _end() at 0xffffff8000c1ac00 _end() at 0xffffff8000c1cf00 [...] [16:16:36] and, further back... if we include afsconfig.h as a coding style, perhaps we should also install it. because, well, right now rxgen generates references to it... and make dest *doesn't* appear to install it [16:17:04] yeah, that's not very useful. oh [16:18:03] --- mmeffie has left [16:18:04] --- deason has left [16:18:17] --- deason has become available [16:20:12] matt: do you have a link to this ddb patch? [16:21:37] just to clarify, I don't think an openafs-next would be that helpful, I'm not sure I understand what entirely was meant by this, either, so... [16:21:52] will send [16:22:13] It's not large, just gives you the addresses of things you want to bt, from show allocks [16:23:27] Hm, I wonder if manually breaking to the debugger while it's printing the _end() spew would be close enough in time to get useful data. [16:23:52] um...ugh. [16:24:25] (Otherwise I need to patch WITNESS to break to debugger on like the third LOR, which is just ugly, but doable on the dedicated test box.) [16:24:53] well, you'd have dropped to ddb, already [16:28:04] --- mmeffie has become available [16:41:34] ben: /afs/umich.edu/user/m/b/mbenj/Public/fbsd_allocks_verbosity.diff [17:12:39] --- matt has left [17:29:24] Jeffrey, one question about the Gets functionality in StrSafe.h - Do you know if the Gets functions consumes the entire line if the buffer is too small to hold it? [17:30:29] Also, do you know if an STRSAFE_E_INSUFFICIENT_BUFFER be thrown if the input was too long - the documentation only specifies that it is thrown if the buffer is of size 1 or less, but there are hints elsewhere that it is thrown in other cases as well. [17:32:28] Returned, not thrown. :) [19:03:00] --- rra has left: Disconnected [19:28:59] --- Russ has become available [20:04:09] --- Born Fool has become available [20:50:25] I go to test Andrew's patch, and find that (even without his patch): cd src && cd config && make all rm -f Makefile.version if [ -r /usr/ports/net/openafs-devel/work/openafs/src/CML/state ] ; then cp Makefile.version-CML Makefile.version ; else cp Makefile.version-NOCML Makefile.version ; fi make -f Makefile.version AFS_component_version_number.c echo 'char cml_version_number[]="@(#) OpenAFS 1.5.74.1 built ' `date +"%Y-%m-%d"` '";' >AFS_component_version_number.c echo 'char* AFSVersion = "openafs 1.5.74.1"; ' >>AFS_component_version_number.c cc -g -O -I/usr/ports/net/openafs-devel/work/openafs/include -O2 -pipe -fPIC -I. -c ./config.c cc -g -O -I/usr/ports/net/openafs-devel/work/openafs/include -O2 -pipe -fPIC -o config config.o mc.o cc -g -O -I/usr/ports/net/openafs-devel/work/openafs/include -O2 -pipe -fPIC -o mkvers ./mkvers.c ./mkvers.c:15:23: error: afsconfig.h: No such file or directory ./mkvers.c:16:23: error: afs/param.h: No such file or directory [21:03:18] --- Born Fool has left [21:03:30] Looks like jaltman's 9316f209e (Windows: ensure that afsconfig.h and afs/param.h are included) is the direct culprit, though I would easily believe that the real problem is more subtle. [21:10:39] I think the stuff in config can't safely use those files since the stuff in config constructs those files, although I haven't looked at it in detail. [21:13:39] I'm going to send mail; to: jaltman cc:openafs-cvs is okay? [21:16:14] Or would openafs-devel be better than -cvs ? [21:16:19] -devel's better. [21:16:26] I suspect -cvs would bounce your message anyway. [22:25:59] --- reuteras has become available [22:42:52] --- deason has left [22:47:36] --- kaj has left [23:56:27] --- kaj has become available