[00:28:22] --- Russ has left: Disconnected [00:32:37] --- kula has left [01:05:32] --- haba has become available [02:45:19] --- jaltman has left: Replaced by new connection [02:45:20] --- jaltman has become available [02:59:17] --- dev-zero@jabber.org has become available [02:59:26] --- dev-zero@jabber.org has left: offline [03:00:28] --- kula has become available [03:52:30] --- abo has become available [03:54:30] --- abo has left [03:54:53] --- abo has become available [03:54:59] --- abo has left [04:10:30] --- abo has become available [04:40:33] --- Jeffrey Altman has become available [04:40:56] --- Jeffrey Altman has left [06:49:13] --- deason has become available [07:01:58] --- meffie has become available [08:07:51] --- reuteras has left [08:53:35] --- abo has left [08:54:24] --- abo has become available [09:01:35] --- haba has left [09:02:03] --- meffie has left [09:03:25] --- abo has left [09:03:58] --- abo has become available [09:07:45] --- kaj has left [09:08:40] --- deason has left [09:14:32] --- deason has become available [11:19:46] --- Russ has become available [12:21:35] Well, looks like IRC is melting down ... [12:21:42] that's one way to put it [12:21:53] i set irc to filtered in owl [12:44:15] --- jaltman has left: Replaced by new connection [12:44:16] --- jaltman has become available [13:26:13] --- bpoliakoff has become available [13:27:12] --- meffie has become available [13:27:46] --- meffie has left [13:35:37] --- kaj has become available [13:36:09] --- bpoliakoff has left [13:42:55] --- kaj has left [13:42:57] --- kaj has become available [14:12:19] --- jaltman has left: Disconnected [14:13:49] --- jaltman has become available [14:57:00] Has anyone tested the Master cache manager recently - I've got a build that's giving me EIOs in afs_lookup (checkCode 19) on a dynroot /afs. [14:57:05] Can't quite work out what's going on ... [15:02:39] EIO is the error that was passed to afs_CheckCode, or the output? [15:03:43] The output, although I suspect it's also the one on the way in too. Adding some intstrumentation, and backing out some local changes at the the moment to check what's actually on master. [15:05:08] Are you using the tracing? [15:05:41] Whereever you got 19 should also tell you the code on the way in [15:06:53] > /* sub-block just to reduce stack usage */ If that has any effect, the optimizer sucks [15:08:08] --- phalenor has left [15:08:53] It may be something I've broken. That module has a rewritten pioctl interface. [15:11:09] Tracing doesn't help hugely, sadly: [15:11:10] time 878.864247, pid 15201: Access vp 0xd1754000 mode 0x40 len (0x0, 0x4000) time 878.864266, pid 15201: Getdcache vp 0xd1754000 failed to find chunk 0x0 time 878.864270, pid 15201: GetdCache vp 0xd1754000 dcache 0xe0c4be78 dcache low-version 0xffffffff, vcache low-version 0xc4 time 878.864271, pid 15201: GetdCache tlen 0x0 flags 0x1 abyte (0x0, 0x0) Position (0x0, 0x0) time 878.864301, pid 15201: FetchProc vp 0xd1754000 fid (1:1.1.1) pos (0x0, 0x0) size 0x3b9ac9ff time 878.864313, pid 15201: Returning code 5 from 19 [15:14:07] --- kaj has left [15:17:12] --- phalenor has become available [15:21:16] "Returning code 5 from 19" means 5 was the code sent _in_ to afs_CheckCode. It does the trace before any transformations. [15:21:59] I'm pretty sure the code is coming because GetDCache fails. But, beyond that, I'm bisecting to find where it broke... [15:22:55] Yeah, I just came to that conclusion as well. [15:27:25] Assuming afs_IsDynroot() is not broken, the only way for GetDCache to get as far as the FETCHPROC trace and then fail is for afs_CFileWrite to return not the size it was asked to write. [15:27:58] Yeh. I've just put some debugging around there. [15:28:57] --- deason has left [15:29:23] And for those watching, the size in question is _after_ it's been clipped to the size of the dynroot directory, not the absurdly large size shown in the trace. [15:30:25] Well, that's the problem. Not sure why yet, but we're definitely doing a short write. [15:33:51] Which cache type? [15:33:55] disk. [15:34:10] We're getting a length of 0 back from CFileWrite, so it's off to that proc next [15:34:20] That's really afs_osi_Write [15:34:30] Yeh. [15:34:54] linux? [15:34:57] Indeed [15:36:37] --- RedBear has left [15:36:46] --- dwbotsch has become available [15:41:29] Well, the I/O could actually be failing [15:41:48] Yeh. Just drilling down to that point at the moment. [15:42:03] My first thought was SELinux, but that's now disabled (and wasn't logging anything anyway) [15:47:37] --- mdionne has become available [15:50:28] Okay, so it's my mistake. [15:50:39] how so? [15:50:46] Change 7a5cee30cc5f0e6d5780387633ce2b46608fd5fb is lacking some very important braces. [15:51:29] Basically means that if the filesystem has an llseek operator, then we will _never_ write data, and always return 0. [15:51:41] Shame we shipped a 1.5 release 17 hours ago :( [15:54:44] \me hangs head in shame. [15:57:12] Oh. Oops. I totally missed that. [15:57:31] Yeh. Having fixed that, we now oops seconds later. [15:57:47] --- jaltman has left: Disconnected [15:57:52] Well, at least it's a new failure. :-) [15:58:07] Indeed. [15:58:27] --- mdionne has left [15:58:49] --- mdionne has become available [16:02:11] works OK here after fixing that brace [16:02:27] It may be that my test machine is just fried. [16:02:48] Rebooting now to check that. I'll push a patch fixing the brace problem. [16:04:21] I don't think we actually _need_ that llseek, given that the Linux read and write operations take a position. [16:04:29] But that's a thought for later. [16:06:20] 1194 [16:08:33] marked as verified - fixes the breakage here [16:08:58] Cool. Thanks! [16:09:22] While you're around - I don't know if you've seen the fallout from the last SELinux change - the one we made for AppArmor? [16:11:23] Yeah I saw that thread - but haven't looked much into it yet. Would it be there even without the change we made for credentials? [16:12:16] Without the change to use afsd's credentials, we're using the users. On Fedora, at least, a user can read unlabelled files, so it's not a problem. [16:13:41] I guess my question is, is an anonymous dentry in itself a problem, or is it only a issue in the case where we're not using the user's credentials, and we have a security module.. [16:13:57] The latter, I think. [16:14:51] From a quick read of nfsd's code, I think the dcaches it ends up using are still anonymous, and still don't have security information in them. [16:15:16] Ok, so that explains that we haven't noticed before. [16:16:06] What I want to try at some point is to set up SELinux so that nfsd is running with limited privileges, and see what happens. [16:17:02] Oh, and the other reason we haven't noticed, is that you have to stress the system enough that it purges dcaches from its cache - otherwise d_obtain_alias will return the current 'normal' dcache, rather than an anonymous one. [16:17:52] my usual test system has tons of memoery, so I probably rarely get to that point [16:18:30] so you suspect nfsd may have the same issue? [16:18:46] I think so. If it does, then I suspect we can get it fixed in Linux. [16:19:34] Hmmm. Even with a completely fresh build, and a clean cache, I'm still segfaulting. Bah ! [16:19:57] BTW when I tested it a few years ago, opening files by path had issues with most filesystems other than ext2, etx3 [16:20:09] In terms of locking? [16:20:35] --- abo has left [16:20:41] I assume they were locking issues - ran OK under light load, would oops under load [16:20:56] --- abo has become available [16:21:10] It wouldn't surprise me, Linux has had issues for a long time with stackable file systems. [16:22:06] so I think that going to opening by path would need some good testing [16:22:14] Yes. [16:22:28] The problem we have now is that we don't work reliably "out-of-the-box" on Fedora. [16:25:09] Hmm, fedora might be OK with using the user's credentials for the cache files? Were all the bug reports coming from Ubuntu + App Armor? [16:25:38] Yes. SELinux could have the same problem, it depends on how it's configured. [16:25:57] but is it ok with the default policy [16:26:13] Wait, what do we do now? Doesn't using the user's credentials have the problem that the user does not have access to the files? [16:26:14] I don't know. SELinux policies give me brain freeze. [16:26:39] jhutz: Because we come in at the dentry point, we bypass the normal 'open' checks. [16:27:04] What we did historically was to use the credentials of the current process. [16:27:15] I think the only problem cases were ubuntu users with apparmor policies on evince [16:27:31] hm, ok [16:28:00] I guess we could pull the patch from upstream, and ask Russ to just carry it on Debian until we have a better solution. [16:28:26] or make it an option somehow [16:34:42] --- deason has become available [16:46:00] I guess we should speak to David, too, and see how fscache handles this problem. [16:46:15] good idea [16:52:55] Okay, we definitely have other problems on Linux. [16:53:29] git master, with the brace fix, completely clean build, on a clean VM, segfaults whilst decoding an XDR response. [16:55:48] Backtrace at http://pastebin.com/m2450b19c if anyone cares ... [16:56:02] xdr_free [16:56:20] no issues here BTW - completed a few kernel builds out of AFS with current master [16:56:45] xdr_free isn't used in kernel though. [16:57:29] Oh, no. It is [16:57:37] afs_xdr_free is xdr_free [16:58:41] Only on AFS_AMD64_LINUX24_ENV and Darwin. [16:59:02] what are you guys on? [16:59:11] i'm gonna guess one of you is 64 bit [16:59:18] I'm not. [16:59:24] (well, not on that VM) [16:59:28] I suspect Marc is. [16:59:38] I suspect we're using the kernel's xdr_free(), and not our own. [16:59:47] well, remember i had that bizarre issue with xdr_free's odd prototype. [17:00:02] Yeh, with the xdrproc_t thing. [17:00:03] yeah. so where are we not getting xdr_free's redefinition that we should [17:09:40] I wonder if this is a calling convention thing. [17:10:46] yes I'm on 64-bit BTW [17:11:15] In particular, we're defining xdr proc_t as taking a variable set of arguments, and then expecting to be able to call a normal function like that. [17:11:37] Works in userspace, but given Linux's odd calling convention, I wonder if that's tripping us up. [17:11:58] --- abo has left [17:12:58] --- abo has become available [17:13:57] In fact, if I were a betting man ... [17:14:45] that seems plausible [17:16:02] Well, it completely explains why xdr_char is getting 0 when called as a helper from xdr_vector [17:16:13] Just waiting for my VM to recover... [17:21:36] Yup. That fixed it. I'll think about what the correct patch should look like tomorrow. [17:24:24] Could someone attack the website to suggest that people on Linux should perhaps avoid 1.5.71? [17:37:21] done [19:13:50] --- jaltman has become available [19:18:53] --- mdionne has left [20:14:39] revisiting the point about apparmor, could we do something hideous to call (security_)d_instantiate on anonymous dcaches if afsd -something happened, and let people take their own risk? i mean, we could, but should we... [21:38:29] --- deason has left [21:58:20] --- kaj has become available [22:17:51] --- reuteras has become available [22:34:49] --- kaj has left [23:07:03] --- Russ has left: Disconnected [23:26:26] --- haba has become available [23:42:18] --- kaj has become available