[00:23:13] --- kaj has become available [01:11:41] --- Simon Wilkinson has become available [02:00:41] --- haba has become available [03:12:52] --- Simon Wilkinson has left [04:06:11] --- abo has left [05:08:10] --- haba has left [05:08:11] --- haba has become available [05:15:50] * haba uses wireshark and lets it suggest how to filter by klicking on the package content. [05:25:22] --- Simon Wilkinson has become available [05:28:11] if you have an ebook reader and want to learn more about git, get this: http://progit.org/2010/05/17/progit-for-the-ipad.html [05:28:26] (e.g. if you will be sending us patches, this can help) [05:29:58] wireshark filters in two places. If the kernel is dropping packets, then its usermode filters (the clever ones) won't help you. [06:03:46] --- abo has become available [06:53:11] --- haba has left [06:53:11] --- haba has become available [06:55:10] --- deason has become available [07:26:01] --- Simon Wilkinson has left [07:32:30] And in fact, wireshark's usermode filters are clever enough to understand rx and can do things like show you only aborts. For that matter, you can click on a packet containing an RPC request and it will tell you where the reply to that RPC starts, or where the abort is. [07:33:24] The kernel-mode filter, which is the same thing tcpdump uses, does not understand rx, but it does know how to match specific values of bitfields in specific places in the packet, and the protocol is simple enough that you could use that to write a filter that only matches aborts. [07:45:46] --- matt has become available [07:48:40] --- Simon Wilkinson has become available [07:53:11] --- reuteras has left [08:15:12] --- Simon Wilkinson has left [08:30:22] --- shadow@gmail.com/owlD5B4E913 has left [08:31:38] That sounds like effort, though. The zone cell is just sitting there asking to be used for testing. [08:31:40] from an rx dump I have laying around... looks like you're looking at byte 0x3e (offset 0x14 from the start of the RX packet); which should be 0x04 if it's an abort packet [08:35:32] something like 'ether[0x3e] == 4' or 'udp[0x1c] == 4', I think [08:39:20] --- shadow@gmail.com/owl536A9DCC has become available [08:40:18] yeah, udp[0x1c] == 4 seems to work with tcpdump [08:48:48] --- Simon Wilkinson has become available [09:05:40] --- jaltman has left: Disconnected [09:25:54] --- kaj has left [09:52:33] --- meffie has become available [10:03:59] --- Simon Wilkinson has left [10:04:06] --- Simon Wilkinson has become available [10:09:25] --- Russ has become available [10:14:20] --- phalenor has left [10:14:31] --- jaltman has become available [10:17:20] --- phalenor has become available [10:17:28] --- Simon Wilkinson has left [10:17:49] --- Simon Wilkinson has become available [10:18:01] --- phalenor has left: Lost connection [10:18:01] --- kula has left: Lost connection [10:18:01] --- dwbotsch has left: Lost connection [10:18:01] --- steven.jenkins has left: Lost connection [10:18:01] --- shadow@gmail.com/owl536A9DCC has left: Lost connection [10:18:13] --- dwbotsch has become available [10:19:20] --- steven.jenkins has become available [10:19:33] --- kula has become available [10:20:20] --- shadow@gmail.com/owl536A9DCC has become available [10:20:22] --- phalenor has become available [10:22:18] --- jaltman has left: Replaced by new connection [10:22:19] --- jaltman has become available [10:23:51] --- Simon Wilkinson has left [10:31:51] http://bugs.debian.org/582111 is interesting and seems to imply that we're not exercising proper discipline in bosserver on detaching from our tty. [10:33:28] Hm, but we're calling daemon(). [10:33:29] it calls daemon(1,0)? [10:33:32] yeah [10:33:44] Okay, I think this person is doing something weird. [10:33:49] Let me ask. [10:34:27] Although why is it killing a ton of processes -- oh, those are all the file server threads. Hm. [10:34:30] also, I'm inferring that the bosserver didn't die there; just the fileserver [10:34:34] Yeah. [10:34:38] So is the file server doing something odd? [10:34:43] Why would it have a tty open? [10:35:26] it used it, it was a bad behavior it had. i thought we fixed it. [10:35:38] it used fputs to send the license message there [10:35:50] the if 0'd "government rights" thing. the code is still there [10:35:59] Ah, indeed, my file server has /dev/console open. [10:37:02] #ifndef AFS_QUIETFS_ENV console = afs_fopen("/dev/console", "w"); #endif [10:37:48] Ah, which is only defined for NT. [10:38:00] it has console open to print the "fileserver has started at X"-like messages [10:38:19] Yeah, this is wrong tool. [10:38:20] yes. we should probably *at least* provide a switch to turn that off. maybe to turn it *on* [10:38:27] That's what syslog is for. [10:38:46] sites that want it might want it. i'd suggest killing it in 1.5 and providing a switch to kill it in 1.4 [10:38:58] Yeah, that's probably the right move. [10:39:43] and /dev/console can go to a tty that's killed by killing x? [10:39:57] In fact will always do so on a non-server. [10:40:13] If you start the file server from an xterm. [10:40:26] Probably would be okay if it started at boot before X started, although it may depend on ordering. [10:41:58] So the next question is: do we want to make this code go away completely, or replace it with openlog(LOG_DAEMON) and syslog calls? I'm leaning towards the latter. [10:42:10] For 1.5. [10:42:47] a patch implementing the former is in gerrit as 1986 [10:43:25] and.... what happens if you use syslog for logging in place of FileLog? won't trying to openlog twice be.... not good? [10:43:34] Ah, we don't lose any information that we're not printing out elsewhere through ViceLog. [10:43:34] see e.g. util/serverLog.c [10:43:37] Yeah, kill it with fire. [10:43:51] well, we do lose like one case. look at 1986 [10:45:07] Ah, well, sort of. You lose the explicit "was started" message with a time, but we log a message to ViceLog immediately afterwards which will get timestamped, so we don't really lose anything. [10:45:07] --- haba has left [10:45:10] But anyway, that looks good. [10:45:36] --- Kelli Ireland has become available [10:47:01] snerk. turn that off and for macos use growl, for linux/freebsd use dbus [10:51:56] Do we already have a dbus dependency? [10:53:40] No. [10:58:38] no dbus yet [10:59:16] do we never free the events we allocate from afs_getevent / afs_addevent? [10:59:43] dbus has some unfortunate problems, but it's probably the best there is right now. [10:59:47] Hopefully someday someone will fix them. [11:00:05] typically events are "forever" [11:00:20] Mostly, everything using dbus tends to lose its brains if the dbus daemon has to be restarted (if, for instance, you upgraded the dbus package) and can't recover without restarting. [11:00:28] yeah, but if we don't free on shutdown at least, I get yelled at for not all blocks being free [11:01:15] yeah, true. we should have a function to do that there [11:01:36] I thought something got freed, at least. Certainly there was an obvious place to destroy the event mutex when I wrote that patch. [11:01:58] depends which platform, ben [11:02:11] Ah. [11:02:12] are the events similar enough on every platform to just walk the event hash and free them? or does it need to be written for each platform? [11:02:31] afs_getevent is not global. each platform will at least need to be looked at [11:03:09] presumably afs_evhasht[] gets walked and freed for solaris, irix, linux, fbsd, darwin, aix. [11:03:19] not hpux or the other 2 bsds [11:03:27] yeah, I was looking at them.... so far they all look very very similar [11:03:31] ah, okay [11:03:38] this code shold be global and i was going to start merging at some point [11:03:44] aix at least seems to use a different allocator, though [11:03:48] --- abo has left [11:03:58] so does linux, come to think of it.... [11:04:42] --- abo has become available [11:05:15] --- Kelli Ireland has left [11:05:52] but not urgent; I was just in the neighborhood so was looking around [11:07:52] I note without comment that the event handling is the biggest outstanding issue with fbsd that I know about. [11:15:40] ben, i doubt the event handling is even close to the top of freebsd's problems. ahem vnode locking ahem [11:16:14] anyway, back in a bit. i need to take the bike cart downtown and pick up some fabulous prizes. [11:17:57] Ooh, bike cart. I should get one of those. [11:18:16] So you think it's vnode locking that makes us unuseable on multiprocessor machines? [11:58:26] --- jaltman has left: Disconnected [12:16:46] --- jaltman has become available [12:25:37] --- Kelli Ireland has become available [12:28:05] > So you think it's vnode locking that makes us unuseable on multiprocessor machines? [12:28:25] no. just that events aren't the biggest problem we have [12:31:23] and as to bike cart, mine's a castoff from when a friend's kids got too big to be carted around [12:36:25] --- haba has become available [12:52:26] --- haba has left [13:25:43] http://gerrit.openafs.org/1988 cherry-picks “Linux: replace invalidate_inode_pages” onto openafs-stable-1_4_x. This is needed for building on kernel 2.6.34 (Maverick), so it should also go into the next Debian package. [13:36:42] --- haba has become available [13:37:55] Thanks, I was just looking at that. [13:43:26] --- steven.jenkins has left [13:43:51] --- steven.jenkins has become available [13:47:13] --- Kelli Ireland has left [13:51:50] --- haba has left [14:46:33] --- meffie has left [15:17:52] --- mdionne has become available [15:39:47] --- deason has left [15:46:42] --- mdionne has left [15:47:15] --- mdionne has become available [15:58:00] --- mdionne has left [15:58:11] --- mdionne has become available [16:07:26] --- kula has left [16:12:19] --- kula has become available [16:13:08] --- steven.jenkins has left [16:13:33] --- steven.jenkins has become available [16:21:33] --- matt has left [16:58:24] --- kula has left [17:25:40] --- deason has become available [18:10:20] --- Russ has left: Disconnected [18:30:23] --- Russ has become available [18:41:43] andersk: thanks, I tend to forget to push things to 1.4. I rarely run or test it anymore. [18:47:35] --- kula has become available [18:58:37] --- mdionne has left [19:52:09] --- phalenor has left [19:59:40] --- phalenor has become available [20:34:22] --- Born Fool has become available [20:43:21] --- shadow@gmail.com/owl536A9DCC has left [20:54:11] Any thoughts on how bad an idea it would be to lock Giant for GetDCache? [21:09:22] --- Born Fool has left [21:15:22] --- dwbotsch has left [21:16:07] --- dwbotsch has become available [21:25:52] --- phalenor has left [21:34:06] --- kaj has become available [21:35:53] --- phalenor has become available [21:42:05] Sorry; lock what? [21:44:32] The BKL-equivalent. An idle thought; almost certainly a bad idea. [21:45:21] However, I think I am starting to make sense of what is going on here. ("here" being "reader hangs on multiprocessor system but not single-processor system") (more) [21:45:27] Okay, so if I'm reading this right, in the multiprocessor case, the call to afs_GetDCache() in BPrefetch is never returning, because it is sleeping in UpgradeSToWLock(&tdc->lock, 609) and apparently never getting woken up. Since GetDCache never gets past there, it never wakes up &tdc->validPos with the reader, and we stick there forever. Does that sound sane? [21:45:39] That doesn't work the same way. The GLOCK is a kernel mutex that's held most of the time when in AFS code and not blocked. It's clearly already held when GetDCache is called, since that function calls ObtainWriteLock() which can only be called with the GLOCK held. [21:48:21] AFS's locks are these data structures whose contents are manipulated with the GLOCK held. If operations on them block, that happens with GLOCK temporarily released. [21:49:19] Sure. [21:49:32] > sleeping in UpgradeSToWLock(&tdc->lock, 609) then I'd look for what else has read locks on that dcache [21:50:10] but there is probably context I don't have, since I haven't really been paying attention to whatever problem you're looking at. [22:08:12] --- reuteras has become available [22:09:10] So, I have a single kernel thread in GetDCache, and a single userland process trying to read from AFS. "user thread" is in: 674 ReleaseReadLock(&tdc->mflock); 675 ReleaseReadLock(&tdc->lock); 676 ReleaseReadLock(&avc->lock); 677 code = afs_osi_SleepSig(&tdc->validPos); GetDCache is here: 2008 UpgradeSToWLock(&tdc->lock, 609); and &tdc->validPos of that tdc is the same as in the other thread. So they appear to be waiting for each other to wake them up. [22:12:16] wait; user thread is blocked in afs_osi_SleepSig having already called ReleaseReadLock(&tdc->lock) ? For the same dcache (tdc is the same address)? [22:12:52] That's what it looks like. [22:13:38] I'm not entirely sure if this core dump will be enough to figure out how I got here, though ... [22:13:47] I'm going to guess that either (1) there is another reader on that dcache lock, or (2) the count on that lock is wrong such that the waiting thread didn't wake up when the lock was released. [22:14:19] the contents of tdb->lock may be interesting. [22:14:48] (kgdb) p tdc->lock $29 = {wait_states = 0 '\0', excl_locked = 4 '\004', readers_reading = 0, num_waiting = 1, spare = 0, time_waiting = {tv_sec = 0, tv_usec = 0}, pid_last_reader = 0, pid_writer = 1838, src_indicator = 606} [22:15:29] (pid 1838 is GetDCache) [22:22:12] huh [22:27:18] --- deason has left [22:28:55] For lack of a better thought, I am tempted to change the osi_AllocSmallSpace in getevent() (FBSD/osi_sleep.c) to afs_osi_Alloc_NoSleep and see what happens. [22:29:18] Since I guess that could open up a race window. [22:38:42] And that certainly seems much happier. [22:39:37] Now I am left guessing whether the lost contact messages during my stress test were real, and how I managed to smash my stack later on in the test. [22:45:51] oh; I missed that this was FBSD, too [22:46:31] so it's a distinct possibility that the problem is in the implementation of low-level things like locking [22:47:31] Oh, yes. Basically everything I am working on will be FBSD for the forseeable future. [22:47:50] gerrit/1989 [22:51:32] --- kaj has left [22:53:06] --- jaltman has left: Disconnected [22:53:57] --- jaltman has become available [23:47:16] --- kaj has become available [23:48:43] --- Russ has left: Disconnected