[01:40:26] --- Russ has left: Disconnected
[02:21:35] --- haba has become available
[03:15:01] --- jaltman/FrogsLeap has left
[03:16:11] <haba> Then I will fill some new things into this room. Todays kernel messages are:
Oct 20 06:06:39 a07c01n02 kernel: BUG: soft lockup detected on CPU#6!
...
Oct 20 06:06:39 a07c01n02 kernel:  [<ffffffff8837e56d>] :libafs:afs_linux_dentry_revalidate+0x40c/0x4b1
...
So we are somewhere in libafs and not happy about the cache? This is 1.6.0 on CentOS5. Is this known or should I email the whole log.....

[03:16:45] <Simon Wilkinson> Whole log please
[03:16:53] <haba> Where do you want it?
[03:17:05] <Simon Wilkinson> RT
[03:17:10] <haba> OK
[03:17:29] <Simon Wilkinson> Also, can you run gdb against your kernel module and list *(afs_linux_dentry_revalidate+0x40c)
[03:17:47] <Simon Wilkinson> (which will tell us the line number at which the soft lockup occurred
[03:18:52] <Simon Wilkinson> It's unlikely that a lockup in dentry_revalidate is related to the cache, though.
[03:26:37] <haba> RT has email
[03:26:52] * haba will have lunch
[04:38:00] <Simon Wilkinson> Okay, so that soft lockup is actually coming from a spinlock in prune_dcache (RT is 130286 if others are interested)
[04:41:40] <Simon Wilkinson> which means that it's probably the sb_lock that we're spinnning on. Which is wierd.
[05:09:25] --- jaltman/FrogsLeap has become available
[06:09:00] <haba> Simon: Do you want to do me something with that machine in its curent state?
[06:12:53] <haba> 209             spin_lock(&dcache_lock);

[06:13:15] <Simon Wilkinson> haba: In which file?
[06:13:50] <haba> 0x6056d is in afs_linux_dentry_revalidate (include/linux/dcache.h:209) (and copy tp RT)
[06:14:25] <Simon Wilkinson> I wonder what the other holder of the dcache_lock is.
[06:15:02] <Simon Wilkinson> Sadly, without lock debugging we can't easily tell. However, could you do an alt-sysrq-t on the machine, and stick the resulting output into RT as well.
[06:15:13] <Simon Wilkinson> After that, I think there's not much more we can do with it.
[06:15:28] * haba have to find a console through IPMI
[06:15:38] <Simon Wilkinson> echo t > /proc/sysrq-trigger
[06:15:49] <haba> true
[06:16:44] <haba> That has probably crashed it
[06:16:53] <haba> $%^&*
[06:17:07] <Simon Wilkinson> Bah. Oh well.
[06:17:13] <Simon Wilkinson> What's the kernel version on that machine?
[06:17:34] <haba> see RT: 2.6.18-53.1.14.el5.centos.plus
[06:18:01] <haba> Now I have to find the console
[06:21:12] <haba> Hm. It was unresponsive for minutes, but came back. 
[06:23:19] <Simon Wilkinson> Yeah, that will be whilst it dumped the stack of every process in the system. If you look in your logs, you should see a very large amount of debugging output.
[06:23:39] <haba> Not that large ;-)
[06:25:20] <haba> I have call traces from a handful of rsh (naturally because that's how the scheduler starts stuff on other computers) and my own bash and telnetd.
[06:25:43] <Simon Wilkinson> And also for a load of kernel processes, hopefully
[06:26:12] <haba> Nope
[06:28:36] <haba> I can attach what there is to the RT, but I am afraid that's not the droids....
[06:28:54] <Simon Wilkinson> Yeah, I suspect that's not interesting to us, sadly.
[06:29:46] <haba> should I recompile the kernel module with more debugging and what do we want?
[06:30:35] <Simon Wilkinson> Lock debugging is disabled as soon as you load a non-GPLd module.
[06:32:37] <Simon Wilkinson> So, you _could_ rebuild the kernel with lock debugging enabled, but you'd then have to also rebuild the OpenAFS kernel module with a different MODULE_LICENSE field.
[06:57:34] --- Simon Wilkinson has left
[06:59:20] --- Simon Wilkinson has become available
[07:22:41] --- Simon has become available
[07:45:25] --- Simon has left
[07:54:21] --- deason has become available
[08:05:25] --- reuteras has left
[08:13:06] --- summatusmentis has become available
[09:16:09] --- haba has left: Lost connection
[09:16:09] --- abo has left: Lost connection
[09:49:05] --- Russ has become available
[10:02:10] --- Simon Wilkinson has left
[12:19:55] --- abo has become available
[12:22:15] --- haba has become available
[12:25:32] <haba> Unfortunately I'll have to rebuild the kernel as well as spinlockdebug is OFF in RHEL/CentOS5.
[12:25:43] <haba> Some other day
[13:44:24] --- mfelliott has become available
[14:02:55] <haba> The kernel of my laptop has gotten into a Oct 20 23:01:38 habanero kernel: [968271.306430] afs: Tokens for user of AFS id 0 for cell stacken.kth.se: rxkad error=19270403 loggin loop.
[14:03:53] <haba> ... until I did a kdestroy for root.
[14:07:18] <deason> version?
[14:07:24] <haba> 1.4.14
[14:07:36] <haba> Linux
[14:14:02] <deason> loop possibly fixed by 4d4ce0986376675b05fbffbe96f8aac2bf3912b2 , though I have no idea what caused that rxkad error in the first place
[14:15:35] <haba> Me neither as I am not aware to have "done" anything in AFS from the pag in question. But fixed by xxxx is good :)

[14:31:18] --- haba has left
[15:28:30] --- Simon Wilkinson has become available
[15:49:26] --- Simon Wilkinson has left
[16:11:43] --- deason has left
[18:46:02] --- deason has become available
[22:32:35] --- deason has left
[22:56:39] --- Russ has left: Disconnected
[23:30:22] --- haba has become available
[23:49:57] --- haba has left