[03:50:33] --- dev-zero@jabber.org has left: Replaced by new connection [03:50:33] --- dev-zero@jabber.org has become available [04:15:19] /afs/inf.ed.ac.uk/user/s/sxw/openafs-disconnected-dflush.patch is the fix for the 'can't delete recently created files' problem. [05:34:33] --- dev-zero@jabber.org has left [06:18:05] --- Derrick Brashear has left [06:49:09] --- matt has become available [06:49:28] --- dev-zero@jabber.org has become available [07:08:37] --- SecureEndpoints has become available [07:10:23] /edgester: of course. [07:12:35] --- matt has left [07:17:52] --- matt has become available [09:34:13] --- edgester has become available [09:51:28] --- edgester has left: Replaced by new connection [09:51:28] --- edgester has become available [09:55:33] --- Russ has become available [10:13:20] --- dragos.tatulea has become available [10:29:48] --- edgester has left [13:02:52] --- Derrick Brashear has become available [13:15:57] --- dragos.tatulea has left [13:18:25] --- dragos.tatulea has become available [13:22:45] --- dragos.tatulea has left [14:30:42] --- Derrick Brashear has left [15:03:18] --- Derrick Brashear has become available [15:41:45] --- mdionne has become available [15:49:31] Looking at the 1.5 server - issue 1: GetVolumeSummary in vol-salvage.c doesn't do a chdir to the vice partition location. The result is volume header files in the log directory, and a lot of problems as a result. [15:53:07] Yuk. I wonder if that's the cause of the volume creation issue that's in RT. [15:53:43] I haven't looked at the 1.5 server. I'm trying to bend the combination of the Linux VFS layer and the disconnected operations code to my will at the moment. [15:54:09] it's part of it - found a few more things as well [15:54:36] Cool. It's great that someone's look at that... [15:54:41] hm. istr the volume creation issue predated the patch which i bet caused that [15:55:13] the problem with just reverting it, iirc, would be back to "salvager cores in each /vicep?" [15:55:45] Issue 2: VolumeAttach on the volume server sets the inUse flag - the file server then refuses to attach the volume. inUse is not set by the volume server in 1.4 code [15:56:57] can the code build the full path when it needs it, so it can work with the current dir = log dir ? [15:57:31] it doesn't know but i don't see why it can't; the partition struct has the partition path name in it [15:58:26] After fixing those 2 issues, volume creation works, but the vos client still gets a communication error - I see a marshalling error in RX for the end of transaction... even though the call returns success on the server side [16:00:49] conceivably a recent rx change, though before jeff applied them i ran a pthreaded test program [16:02:38] it's not all calls that fail - but AFSVolEndTrans seems to fail consistently [16:03:49] dafs on or off? [16:03:57] off [16:04:02] can you determine which rx routine the marshalling is failing in? [16:04:18] --- matt has left [16:04:41] i remember something weird in the voltrans code ~recently [16:05:14] yes... it's xdr_afs_int32 - marshalling error on the server side, unmarshalling error on the client side [16:05:54] once there's a marshalling error on one side, the other side should be doomed. i guess the call doesn't error fast enough [16:07:25] mdionne - could you be more specific about the volume header files in log directory issue? ie, what steps result in that? [16:09:13] i just tried creating a volume (which failed), then stops and restarts of the server. I noticed the .vol files in the log directory [16:09:30] did you open an RT ticket yet? [16:09:48] at that point the server can't be started [16:09:48] having just looked at rt, no [16:09:53] (I can have the person who produced the chdir() patch take a look at it tomorrow when he gets in) [16:10:01] no RT ticket - thought I'd discuss it here first [16:10:30] no problem, but if you can drop a ticket into place, I can aim the right person at it tomorrow am. [16:10:56] Ok, does the other issue ring a bell? (inUse) [16:11:24] no, but i am looking through cvs logs [16:11:26] iiuc, it's just a lower-level view of the known volume creation problem with 1.5 [16:11:43] (that's a question, not a statement, btw) [16:12:19] let me back on up #2..can you say what steps result in that? [16:12:37] ie, that way i can understand how you got into that situation..and that will help me understand if its a known vs new problem. [16:13:34] it shouldn't be happening from the code i'm reading. the only call to set something inUse appears to be when you're programType == fileServer [16:13:55] VOLINT_INFO_STORE(handle, inUse, 1); [16:14:30] shadow - I haven't looked at that recently, but I'd be surprised if a vos create, vos release, vos dump, vos restore does not set inUse [16:14:47] What about this: V_inUse(vp) = programType; (where programtype != fileserver). this is not in 1.4.8 [16:14:55] (I don't recall the sync mechanisms those use) [16:15:10] 1.5 has a very, very different volume pkg than 1.4.. [16:15:27] so there are lots of things in 1.5 that aren't in 1.4.8 [16:15:38] plausible. [16:16:51] When thsi line is in, the fileserver refuses to attach the volume, claiming it needs to be salvaged. [16:17:23] If i comment it out, the volume creation succeeds and the volume is attached. Only error is the RX error mentioned earlier. [16:17:54] All of this happens at volume creation time BTW. [16:18:06] that's going to be it. VCheckDetah uses it. it shoild presumably be being cleared before the volume is given back to the fileserver, or the detach process changed back [16:18:07] ok, tx. [16:19:48] i suspect dafs does or is supposed to do something to clear inUse when relinquishing control, and doesn't for non-dafs, or doesn't anywhere [16:20:23] mdionne: can you confirm that the marshalling error is taking place through a call to VDetachVolume_r? [16:21:37] posting the full stack trace into pastebin would be useful. [16:21:42] SecureEndpoints: it is [16:26:31] I don't see VDetachVolume_r changing the volume state when DAFS is off [16:27:21] Hmm, actually the marshalling error occurs in _AFSVolEndTrans [16:28:47] DeleteTrans will call VDetachVolume, so this may well still be related [16:29:12] that is the path I was following [16:29:29] Well DetachVolume is called shortly before the error, yes [16:30:08] nothing else work-wise is "done" in EndTrans. if VDetachVolume gets called, hm... [16:31:04] Everything that happens on the server side in EndTrans returns success as far as I can see. Up to the point where it tries to marshall the response. [16:31:36] ObSomethingCompletelyDifferent: should dentry_iput() really be invalidating the vcache? - it seems to get called whenever Linux feels that the dentry cache is too full... [16:43:56] (sorry - didn't mean to cut everyone off ... ) [16:49:19] that's ok, busy here getting the kids ready for bed... :) [16:50:38] BTW the marshalling error in _AFSVolEndTrans doesn't always happen. If I do a "vos rel", several succeed, but hte last one fails. [17:26:18] http://www.pastebin.ca/1312141 is a stack trace at the entry to the xdr_afs_int32() call that fails. Doesn't look very informative... [17:28:46] --- edgester has become available [17:30:50] that is not very informative [17:34:57] -g is your fiend [17:50:10] Hmm, should --enable-debug do the job? Thought I already had it set... [17:57:27] probably. but you can just look at src/config/Makefile.config and change any -O or -Os to -g [17:57:31] and remake [18:07:57] Ok I have symbols, just trying to stop it at the right point... seems like rx_WriteProc32 might be the interesting function to watch [18:11:16] http://www.pastebin.ca/1312173 - although looks like some things are optimized out [18:13:48] yeah for registers. sigh. [18:23:02] --- dev-zero@jabber.org has left [18:23:47] Looks like rxi_WriteProc returns early because this condition is not met: if ((conn->type == RX_SERVER_CONNECTION) && (call->mode == RX_MODE_RECEIVING)) [18:27:04] what is the call mode and type? [18:29:09] Can't print them - "value optimized out"... [18:29:36] remove the "register" declaration in rxi_WriteProc for "call" [18:29:56] or just remove all of them and recompile [18:37:20] ok, that's useful [18:37:48] call->mode = 3, conn->type = 1 [18:41:48] type == RX_SERVER_CONNECTION, mode == ERROR [18:42:04] in other words, something failed prior to this point [18:45:43] Ok, so we'd need to see where RX_MODE_ERROR is being set... I only see 2 spots where RX_MODE_ERROR is used [18:55:38] placing a break point at both locations should be sufficient [19:00:21] http://www.pastebin.ca/1312199 is where it gets set, error is -1. Trying to get symbols compiled for the volserver bits. [19:02:35] ah my old friend DeleteTrans [19:02:35] http://www.pastebin.ca/1312201 is the same, with more symbols [19:03:19] it's gonna be rxi_CallError(tt->rxCallPtr, RX_CALL_DEAD); i assume [19:04:09] is tt->volume non-Null and if so, what is the value of "error" ? [19:04:32] and confirm that line is what i said, in your source. it's return 0 in mine [19:05:02] (i mean, i can't see how it's something other than rxCallPtr CallError) [19:05:40] it is - line numbers don't match because I have debugging code [19:06:13] ok. just wanted to be sure [19:06:54] the deal will be that rxCallPtr is supposed to be cleared if we succeeded. somewhere we returned before it is [19:08:06] tt->volume is set after the call to VDetachVolume, error is 0 [19:08:42] rxCallPtr has a value [19:09:10] yes. like i said, it's not cleared because somewhere we exited with an error and left it set, or there's a bug. was this a vol create, or what? [19:09:25] yes, a vos create [19:09:25] ok. VDetachVolume should probably be clearing tt->rxCallPtr [19:09:39] no. [19:09:50] the thing which set it should be clearing it [19:09:55] ok [19:10:36] note that VolCreateVolume leaves it unset except for around one log message [19:10:54] so some other routine is guilty, and we should find that routine and determine what. and why. [19:11:16] it will be one that calls DeleteTrans [19:11:17] anything in the Log? [19:11:33] no. EndTrans doesn't set it [19:11:41] it will be some earlier routine. [19:12:03] which Log? most of it is my own trace messages [19:12:09] volserlog [19:12:14] should be the last line or so [19:13:05] what about VosSetFlags ? [19:13:12] VolSetFlags [19:13:21] i see it [19:13:31] VolSetFlags should tt->rxCallPtr = (struct rx_call *)0; [19:13:34] like in 1.4 [19:13:40] just before if (TRELE(tt) && !error) [19:14:06] that should fix it [19:14:11] Which file is this in? [19:14:19] volprocs.c [19:14:26] volser/volprocs.c [19:15:02] probably around like 1071 [19:15:13] or on it if my source matches what's in cvs now [19:15:18] Ok I see it. Will try that. [19:17:01] that is the only function that I see that is failing to reset rxCallPtr [19:17:24] Bingo! vos create and vos rel now work [19:17:30] excellent [19:17:45] derrick, do you want to commit the fix? [19:17:49] now i can delete those 50 test volumes... [19:19:31] i already commited the fix [19:19:54] i am not going to touch the chdir issue tonight, since whoever contributed the change is going to look at it [19:19:55] so you did [19:20:19] Thanks - BTW for volume creation I also had the inUse change in place to get it working. I created an RT ticket for that [19:20:22] mdionne: do you have a patch for the chdir issue? [19:21:07] I'm working on a real fix for the inUse flag issue right now. Setting inUse in volser is by-design. The idea is to catch and demand salvage volumes which were in-use by other daemons during a crash [19:21:16] No, I wasn't sure if just adding a chdir was the correct way to go. [19:21:28] k [19:22:28] I also created an RT ticket for the chdir issue - I can attach the patch that works for me. [19:23:06] please do [19:24:20] will do. thanks guys for the help - another baby step to getting 1.5 into shape [19:24:54] thank you for taking the time to track this down [19:25:09] these bugs have been sitting in cvs for far too long [19:27:06] sure, but no demand-salavge in non-dafs [19:29:42] it's broken for both dafs and non-dafs (just more visible on non-dafs). i'm working on a general fix that addresses both cases. if you want the whole inuse flagging for other server types conditionalized to dafs only, we could go back and do that [19:31:56] I would like to see the general fix that you will propose but without seeing the details I am leaning towards conditionalizing the inUse flagging to DAFS only [19:32:27] --- Derrick Brashear has left [19:32:40] My hunch is conditionalizing it is the lowest risk solution [19:33:03] I am sure that is the case [19:33:52] --- stevenjenkins has left [19:34:11] --- mdionne has left [19:37:46] --- stevenjenkins has become available [19:57:23] --- Derrick Brashear has become available [20:05:19] --- SecureEndpoints has left [20:15:53] --- SecureEndpoints has become available [20:29:26] --- edgester has left [20:41:48] --- dev-zero@jabber.org has become available [22:24:10] --- dev-zero@jabber.org has left [22:36:53] --- dragos.tatulea has become available [22:37:03] --- dragos.tatulea has left [23:12:33] --- dev-zero@jabber.org has become available [23:14:24] --- reuteras has become available