[08:44:20] --- jaltman has left: Disconnected [08:46:51] Ben: How difficult is it to reproduce? [08:48:21] And um, maybe you could paste the actual warning? [08:54:07] WARNING: afs_ufswr vp=ffffff8000ff2730, exOrW=0 I've seen it maybe five times (two distinct vp values, though). It popped up while I was doing mmap testing, so it is probably not too hard to reproduce. [09:16:09] It happens only with mmap? [09:18:43] Probably? Note that kib@freebsd.org had a suspicion about why mmap is not entirely functional for us. [09:19:03] Which is? [09:19:41] I've basically not even looked at mmap'd i/o on fbsd. [09:19:52] [me: stuff is broken; ...] > I also don't seem to be able to run executables from AFS: > freebuild# ./my_mmap test4 > elf_load_section: truncated ELF file > Abort This sounds very much as missed vnode_pager_setsize() calls. VM tracks the file size as vnode vmobject size separately. I think this was done for historical reasons, but also it allows to not traverse the vop stack calling VOP_GETATTR each time when size is needed. [09:21:08] Well, it started off by me saying 'okay, I can copy /usr/src into and out of /afs; let me try a buildworld'; buildworld died pretty quickly with a signal 11. With that and failing to run executables from afs, rwatson suggested some mmap unit tests. [09:24:30] We appear to call vnode_pager_setsize only in osi_VMTruncate. [09:26:20] Indeed. [09:27:16] nfs_clbio.c:ncl_write calls it [09:30:48] I'm forced to look at dfbsd, for the nonce. I see we should be calling it more, but in the common case, vfs must infer the right values. [09:33:32] Incorrect execsOrWriters seems sortable. [09:35:18] "sortable"? I don't understand your meaning. [09:35:36] Tractable to sort out. [09:36:14] Ah. Sure. [09:42:28] Since we actually are AFS_VM_RDWR_ENV, then afs_vop_write should probably wrap its afs_write call with afs_FakeOpen and afs_FakeClose. [09:47:36] Do you have some test mmap code you've been using? [09:48:25] A dinky couple c files, yeah. Hang on. [09:51:22] But I am pretty sure I misspoke, above. There's a missing afs_FakeOpen/Close pair, but not afs_vop_write. [09:51:34] /afs/zone.mit.edu/user/kaduk/Public/my_{mmap,write}.c test4 in that directory is a text file that I'd been using them on. I'd just been changing read sizes and locations in the source as desired. [09:51:44] Cool, thanks. [09:52:36] The mmap one just reads, and should succeed. (I managed to get in a state where it failed, at some point, but I believe a flushv fixed it.) The write one also tries to write, and claims success but does not actually change anything. [09:53:50] Ok [10:04:03] I can see you in sipb.mit.edu, not those files, and not zone. [10:06:54] Hm. Try athena/user/k/a/kaduk/Public/ ? [10:10:00] (Which might be easier spelled http://web.mit.edu/kaduk/Public/) [10:10:30] Afs worked, eventually ;0 [10:11:37] Most of the MIT cells have been renumbered fairly recently as the servers are converted to VMs; I'm not sure if the g.c.o cellservdb has all the updates. [10:13:22] I had reasonable luck from umich.edu. I do have an older CellServDB which failed on those. Need to update. [10:16:00] --- deason has become available [11:04:46] --- deason has left [11:39:40] I ended up mangling the test programs, but post-mangling, I think basic mmap read and write do work. [11:40:39] For the reasons we discussed earlier, after a file size change, that is perhaps unlikely to be true. [11:41:49] Or in some cases, relative to the new length and the boundaries of the mapped region(s). [11:41:51] Hm. The original my_write.c would change the first character of a file on ufs local disk, but not in afs. (Of course, my code could just have been broken.) [11:44:10] Not necessarily, but at least working differently with the data gives more expected results. [11:44:51] Fair enough. Care to post your mangled versions? [11:45:39] Sure, just a sec [11:47:24] /afs/umich.edu/user/m/b/mbenj/Public/ben [12:39:17] --- Russ has become available [13:35:13] --- mdionne has become available [13:51:57] There Ben? [13:52:52] A bit distracted, but yeah. [13:53:32] I have a change that tries to call vnode_pager_setsize in the three unhandled cases. Wondered if you could make time to test at some point. [13:54:20] Sure thing. Where's the patch? [13:55:22] In my git checkout. I can just push it. Derrick won't commit it without beating on it, I presume. [13:56:18] well, i am known to fuck up testing [13:56:23] Well, he might. But that doesn't mean we shouldn't. [13:57:47] I meant, based on prior experience, that's what would happen. I haven't provoked any bad behavior yet. [13:59:06] Derrick: while you're here, should we be pushing these fbsd fixes to openafs-stable-1_6_x to get picked up for the release? [14:00:20] probably. but before we do a release i will try to poll and see what looks like it ought to be pushed and make sure someone is on it, or deal myself [14:03:41] --- mdionne has left [14:16:39] I've reproduced the vn_lock cwd thing, using a gtar -cf - | gtar -xf - pipeline. [14:18:38] Not exactly the simplest testcase, but something. [14:21:44] Admittedly, but it appears commensurate with the behavior. [14:24:06] Well, the vnode_pager_setsize didn't do anything for my afs_ufswr warning (not suprising). But I can run a binary from afs, now. [14:24:22] Yes. The two are unrelated. [14:24:54] the afs_ufswr warning is the exOrW undercount thing? [14:25:01] Yeah. [14:25:07] What he meant, yes. [14:25:26] That's likely to do, again, with lack of an afs_FakeOpen somewhere. [14:25:32] there's some discussion in the list archive [14:25:33] yeah [14:26:07] I think the issue is, in part, the FBSD usage is not exactly what AFS_VM_RDWR_ENV is intended to mean. [14:26:58] So we don't do afs_FakeOpen. I think it might be acceptable to omit FBSD (and DFBSD) in that condition, and thus call afs_FakeOpen in the write path. But I'm not sure. [14:28:57] i think you're probably right [14:29:09] i'm not sure that's the right way to apprach it but it'd work for now [14:30:15] ok [14:32:18] Ha! ar: fatal: Numeric user ID too large [14:37:06] uh. that's special. [14:39:22] Well, I am using a daemon principal. [15:13:23] (Looks like libarchive wants to fit a uid in 6 characters, and mine is 8.) [15:15:28] Well, that's someone's bug... [15:16:50] Tim Kientzle, probably. I suppose I'll ask freebsd-current@ ... [15:51:38] --- Simon Wilkinson has become available [15:53:40] --- matt has left [16:06:39] --- matt has become available [16:36:16] --- Simon Wilkinson has left [16:37:17] Well, we know there are some sleep/nosleep lock order reversals, and in general it's not an error per se. [16:39:13] Mostly working by trial and error, also looking over bundled filesystems, I have a locking change that eliminates a recursive lock panic I can produce (ie., probably freezes, perhaps not fatally, without WITNESS and/or INVARIANTS) with the above cwd gtar | gtar pipeline. [16:41:28] Also, converts locks on child vnodes returned from afs_vop_lookup to shared. It's not clear to me that we need to do this in the general case at all, but I'm pretty sure this addresses the parallel ls issue that was reported. [16:42:37] Wheee. [17:03:25] Try to break it ;) [17:42:53] --- matt has left [18:10:18] panic: __lockmgr_args: recursing on non recursive lockmgr afs @ /usr/ports/net/openafs-devel/work/openafs/src/afs/FBSD/osi_vnodeops.c:1165 Oh, hey; maybe I should try it with your patch. :) [18:29:51] Hm, having taken 2624, I panic very quickly: panic: Lock afs not locked @ /usr/src/sys/kern/vfs_default.c:504 [18:35:01] --- matt has become available [18:59:05] Nobody ever checks for LK_EXCLOTHER, it seems ... [19:09:28] The parallel ls report is not what I thought it was. [19:19:14] --- matt has left [19:34:01] --- jaltman has become available [20:01:35] --- matt has become available [20:02:51] The first version of 2624 wouldn't panic, I think. But I abandoned that change, I'll resend it when I deal with the deadlock I've run into. [20:30:51] --- matt has left [20:47:08] Well, I pulled a dump for both panics, so I'll take a look at some point. Maybe not tonight, though. [21:07:27] --- jaltman has left: Disconnected