[00:48:11] --- Simon Wilkinson has left [00:50:40] --- Russ has become available [00:56:51] I found something strange in afs_BlackListOnce(): The calculation of "serversleft" does never look at tvp->serverhost[i] != NULL, so I think the function will allways return 1. [00:57:38] afs_analyze.c line 496 [01:24:18] * haba compiling (trying patching above) [01:42:40] But it did not help, so more is broken :( [02:05:23] --- Russ has left: Disconnected [05:20:13] well, don't worry. before pre3 hits i will either get a patch from someone, fix it myself (and verify in either case; though i've been running with this for months... uh) or pull it out (which would be a shame since the "never hang forever" code depends on it [05:28:12] --- mmeffie has become available [05:29:52] shadow: I think problems only show up if your client is "fast enough". [05:30:24] this client is plenty fast [05:30:52] fstrace is enough to slow it down that it will not trigger the bug. [05:31:49] i don't leave fstrace on [05:32:12] (just as an example) [05:32:14] so the odd things are no code review (and at least 4 people did) even caught any issues [05:32:22] which, well, sigh. [05:33:14] ehm, that afs_BlackListOnce() was definitely broken. Does my patch make it better? I think so :) [05:33:50] well, no argument it was broken [05:34:02] it just makes me sad that i fucked up and no one noticed [05:36:09] Now I added a afs_warnuser("AFS BlackListOnce: Serversleft %d for busy volume %u in cell %s\n", serversleft, (afid ? afid->Fid.Volume : 0), tsp->cell->cellName); in afs_Analyze if (acode == RX_CALL_TIMEOUT).... line 679 but it never got printed ..... [05:36:25] So the connection timed out must be in another code path. [05:36:27] Sigh. [05:37:46] (I added it just before the goto out) [05:37:58] yes, i'd imagine so. [05:38:45] hang on. let me bring up the complete 1.4.7->1.4.8 diff in front of me [05:38:46] And it is not hitting the other afs_BlackListOnce() invocation either, because there I would get afs: Tokens for user of AFS id %d for cell %s: rxkad error=%d....... [05:41:02] I have to run, but prob online again in a few hours. [05:41:08] --- haba has left [05:41:41] ok [05:41:54] i have a little bit, i look at it now [06:03:12] --- summatusmentis has left [06:07:28] --- haba has become available [06:46:14] --- reuteras has left [07:01:53] Anything I can test, because I'm out of new leads now. [07:02:20] not yet [07:18:02] harald, still here? [07:18:11] yes [07:18:29] got "sob" somewhere useful for me to grab from? [07:19:15] a review of the cm says the only change which could affect anything is the blacklist once change [07:19:16] http://www.pdc.kth.se/~pek/sob/sob.c (gcc -O2 -o sob sob.c) [07:19:29] invoke as? [07:19:45] but i need to look at rx briefly also [07:20:12] ./sob -n 30000 -o 1000 -s 1k -b 1k -w (it makes dir.0 to dir.29 containing files) [07:20:48] ./sob -h Options : -b bs Block size -h This listing -R iter Randomly read iter files -n nfiles Number of files to read/write -o fanout Number of files per directory -r Read files -s size File size -t Terse, output ":" separated fields -v Program version -V Verify files -w Write files [07:21:00] yeah, i just wanted to know your usage case. [07:21:47] suggested patch for you to try while i am doing this: remove the line in afs_conn.c which does: rx_SetConnIdleDeadTime(tc->id, afs_rx_idledead); [07:21:52] i see nothing else relevant [07:21:57] My guess is the more files you make and the smaller they are, the sooner it goes boom. [07:21:59] (changed that could cause anything) [07:22:16] in the meantime, i will update the module in my vm and run sob [07:22:56] i don't anticipate it will change anything [07:22:57] the folks in germany did run a recursive grep on many files. sob has a read flag, too. [07:25:05] sob is the second version of a program called boom. son of boom = sob :) [07:27:38] --- SecureEndpoints has left: Disconnected [07:53:21] --- SecureEndpoints has become available [07:55:04] Nope (without rx_SetConnIdleDeadTime()): Failed to create file testfile.27871 : Connection timed out [07:57:09] ok. [07:57:20] than hang on, i am reading code [08:04:54] --- dev-zero@jabber.org has become available [08:15:57] harald, you patch removes any use of serversleft in Analyze, but leaves the variable. [08:16:41] oh, did I forget to remove the variable declaration. [08:16:59] That should of course be gone, too. [08:17:42] it also looks like aside from the removal of the ! to not do much. [08:18:13] like, you restructured to eliminate a variable, and returned a number of servers instead of simply 1 (or 0). [08:18:29] all of which is cosmetic [08:18:43] No, the last loop is different. [08:19:29] it doesn't break, and keeps counting. what other difference is there? [08:19:57] oh, the tvp->serverHost[i] part [08:20:09] ok. but why the extraneous editing beyond that? [08:20:45] given that nothing cares if serversleft is 1 or N (other than N==0) why do extra loops, for instance? [08:21:31] Say you have 3 hosts zero to 2 which have status areq->skipsererver[i] != 0, without my patch it will allways set serversleft = 1 when getting to skipserver[3] [08:22:35] I don't think that 10 loops extra are little to pay in that code path when the function returns the exact number which can be useful in the future. It's not a critical code path, isn't it? [08:22:48] not really. fine, fair enough [08:23:18] anyway, let me tweak one more place in the code, and run this [08:24:04] I first wanted to move serversleft into the inner braces to mimimize confusion but then removed it completely from afs_Analyze. [08:24:38] I have to be at a birthday party in ~30-40min, so I won't be online much longer. [08:24:41] ok [08:24:54] actually, i have one more thing for you to try [08:25:31] in afs_callback.c change dataBuffP[0] = CAPABILITY_ERRORTRANS to dataBuffP[0] = 0 [08:27:50] won't make the test before I leave I'm afraid. [08:28:14] damn [08:28:31] i'd say make the test zwrite its result but i guess no [08:28:33] not [08:31:36] my build system is a bit slow :( [08:32:04] But I'm trying to make a run [08:39:06] test running [08:39:23] Failed to create file testfile.7474 : Connection timed out [08:39:34] Ok, that was it not. [08:39:35] ok. [08:39:40] Gotta run! [08:39:45] well, i have another patch. i will hope i can produce it. enjoy party [10:22:18] --- summatusmentis has become available [12:16:01] --- summatusmentis has left [12:21:51] --- summatusmentis has become available [13:43:19] --- summatusmentis has left [17:29:15] --- Simon Wilkinson has become available [17:59:17] --- Russ has become available [18:37:03] --- Simon Wilkinson has left [20:19:22] --- SecureEndpoints has left [20:50:36] --- summatusmentis has become available [21:13:06] --- mmeffie has left [21:16:59] --- SecureEndpoints has become available [21:56:27] --- dwbotsch has left [21:56:42] --- RedZBear has left [21:58:53] --- dwbotsch has become available [22:26:17] --- summatusmentis has left [22:57:34] --- reuteras has become available [23:08:49] --- thomas.kula@gmail.com has left [23:08:52] --- thomas.kula@gmail.com has become available [23:10:55] --- thomas.kula@gmail.com has left [23:19:21] --- thomas.kula@gmail.com has become available [23:29:23] --- thomas.kula@gmail.com has left [23:38:12] --- thomas.kula@gmail.com has become available [23:59:38] --- Russ has left: Disconnected