[01:40:26] --- Russ has left: Disconnected [02:21:35] --- haba has become available [03:15:01] --- jaltman/FrogsLeap has left [03:16:11] Then I will fill some new things into this room. Todays kernel messages are: Oct 20 06:06:39 a07c01n02 kernel: BUG: soft lockup detected on CPU#6! ... Oct 20 06:06:39 a07c01n02 kernel: [] :libafs:afs_linux_dentry_revalidate+0x40c/0x4b1 ... So we are somewhere in libafs and not happy about the cache? This is 1.6.0 on CentOS5. Is this known or should I email the whole log..... [03:16:45] Whole log please [03:16:53] Where do you want it? [03:17:05] RT [03:17:10] OK [03:17:29] Also, can you run gdb against your kernel module and list *(afs_linux_dentry_revalidate+0x40c) [03:17:47] (which will tell us the line number at which the soft lockup occurred [03:18:52] It's unlikely that a lockup in dentry_revalidate is related to the cache, though. [03:26:37] RT has email [03:26:52] * haba will have lunch [04:38:00] Okay, so that soft lockup is actually coming from a spinlock in prune_dcache (RT is 130286 if others are interested) [04:41:40] which means that it's probably the sb_lock that we're spinnning on. Which is wierd. [05:09:25] --- jaltman/FrogsLeap has become available [06:09:00] Simon: Do you want to do me something with that machine in its curent state? [06:12:53] 209 spin_lock(&dcache_lock); [06:13:15] haba: In which file? [06:13:50] 0x6056d is in afs_linux_dentry_revalidate (include/linux/dcache.h:209) (and copy tp RT) [06:14:25] I wonder what the other holder of the dcache_lock is. [06:15:02] Sadly, without lock debugging we can't easily tell. However, could you do an alt-sysrq-t on the machine, and stick the resulting output into RT as well. [06:15:13] After that, I think there's not much more we can do with it. [06:15:28] * haba have to find a console through IPMI [06:15:38] echo t > /proc/sysrq-trigger [06:15:49] true [06:16:44] That has probably crashed it [06:16:53] $%^&* [06:17:07] Bah. Oh well. [06:17:13] What's the kernel version on that machine? [06:17:34] see RT: 2.6.18-53.1.14.el5.centos.plus [06:18:01] Now I have to find the console [06:21:12] Hm. It was unresponsive for minutes, but came back. [06:23:19] Yeah, that will be whilst it dumped the stack of every process in the system. If you look in your logs, you should see a very large amount of debugging output. [06:23:39] Not that large ;-) [06:25:20] I have call traces from a handful of rsh (naturally because that's how the scheduler starts stuff on other computers) and my own bash and telnetd. [06:25:43] And also for a load of kernel processes, hopefully [06:26:12] Nope [06:28:36] I can attach what there is to the RT, but I am afraid that's not the droids.... [06:28:54] Yeah, I suspect that's not interesting to us, sadly. [06:29:46] should I recompile the kernel module with more debugging and what do we want? [06:30:35] Lock debugging is disabled as soon as you load a non-GPLd module. [06:32:37] So, you _could_ rebuild the kernel with lock debugging enabled, but you'd then have to also rebuild the OpenAFS kernel module with a different MODULE_LICENSE field. [06:57:34] --- Simon Wilkinson has left [06:59:20] --- Simon Wilkinson has become available [07:22:41] --- Simon has become available [07:45:25] --- Simon has left [07:54:21] --- deason has become available [08:05:25] --- reuteras has left [08:13:06] --- summatusmentis has become available [09:16:09] --- haba has left: Lost connection [09:16:09] --- abo has left: Lost connection [09:49:05] --- Russ has become available [10:02:10] --- Simon Wilkinson has left [12:19:55] --- abo has become available [12:22:15] --- haba has become available [12:25:32] Unfortunately I'll have to rebuild the kernel as well as spinlockdebug is OFF in RHEL/CentOS5. [12:25:43] Some other day [13:44:24] --- mfelliott has become available [14:02:55] The kernel of my laptop has gotten into a Oct 20 23:01:38 habanero kernel: [968271.306430] afs: Tokens for user of AFS id 0 for cell stacken.kth.se: rxkad error=19270403 loggin loop. [14:03:53] ... until I did a kdestroy for root. [14:07:18] version? [14:07:24] 1.4.14 [14:07:36] Linux [14:14:02] loop possibly fixed by 4d4ce0986376675b05fbffbe96f8aac2bf3912b2 , though I have no idea what caused that rxkad error in the first place [14:15:35] Me neither as I am not aware to have "done" anything in AFS from the pag in question. But fixed by xxxx is good :) [14:31:18] --- haba has left [15:28:30] --- Simon Wilkinson has become available [15:49:26] --- Simon Wilkinson has left [16:11:43] --- deason has left [18:46:02] --- deason has become available [22:32:35] --- deason has left [22:56:39] --- Russ has left: Disconnected [23:30:22] --- haba has become available [23:49:57] --- haba has left