[00:03:21] --- jaltman/FrogsLeap has become available [00:57:55] --- lars.malinowsky has become available [01:16:31] --- Simon Wilkinson has become available [01:21:47] kaduk: I think someone said that I could add tools.git, but In [01:21:59] never got round to it [02:00:08] --- reuteras has become available [02:00:22] --- reuteras has left [02:23:17] --- lars.malinowsky has left [05:51:49] --- Chris Garrison has become available [06:12:45] --- reuteras has become available [06:16:29] --- reuteras has left [06:18:11] --- reuteras has become available [06:39:50] --- mdionne has become available [06:55:31] --- deason/gmail has become available [07:06:24] --- rra has become available [07:21:07] --- Jeffrey Altman has become available [07:45:14] --- Jeffrey Altman has left [07:45:28] --- reuteras has left [08:14:54] --- Jeffrey Altman has become available [08:15:45] --- Jeffrey Altman has left [09:52:09] deason: I am curious if the logic in the commit message for 4847 seems reasonable to you. [09:53:37] that's fine, but I think it's important that it should be temporary [09:53:53] also a very brief comment saying why it's ifndef'd out would be nice [09:56:09] A comment is easy enough to do, but I'm not really convinced that it should be temporary. [09:58:28] If the OS is flushing our vcaches for us on unmount, then I'm not sure if FlushAllVCaches is entirely necessary. [09:58:53] it's not "for us", we do it ourselves earlier in the code path [09:59:43] but the fact that we're panicing there shows that there's a bug elsewhere, so I don't think it should always be removed, as it indicates those other bugs [10:00:00] That's a good point. [10:01:10] if this is because whatever other operation "fails" that's not allowed to fail, we should remove the vcache from the global list, since it's no longer really valid [10:01:35] if we can't free it after that point or something.... I think leaking the memory or whatever is better than having a vcache on the list that can explode the machine [10:02:59] > leaking ... better ... explode Sure, definitely. (We already leak some memory anyway, which I haven't started to look at.) [10:04:02] > panicing ... there's a bug elsewhere Well, I don't want to ship builds that will panic end-user systems. Is there a debug macro that would be appropriate to wrap it in? [10:04:46] We don't have any mechanism for disabling things on release builds, particularly. [10:04:59] well sure, that's why I'm agreeing with disabling it for now [10:05:08] disabling the call, that is [10:20:07] --- jaltman/FrogsLeap has left: Disconnected [10:21:53] The rt.central.org/rt instance is stupendously old [10:22:13] Oh yes. [10:22:20] It's also stupendously customised. [10:22:24] lol [10:22:36] customised in what way? [10:22:38] (which is why it never gets upgraded, because all of the other baggage would need porting) [10:22:39] We're running 3.4.4 at work which I thought was old... [10:23:03] In ways that OpenAFS doesn't really see, AIUI. There's a load of other stuff done on that RT instance bar the openafs queues. [10:23:09] we're on 3.8.8. upgrading RT has become stupid easy in new versions. couple commands and you're there [10:23:36] Simon: is there any way for someone to at least comment on a ticket that is open and not theres? [10:23:45] We've talked for a while about maybe bringing up a new RT instance on the box that's at stanford, and looking at porting the existing tickets over. But nobody has ever had time to do the work. [10:24:10] jblaine: I believe, not. But, if you tell me the ticket number, I can add you as a requestor, and then you'll be able to comment. [10:24:13] I am considering crowdsourcing doc contributions by breaking them into tiny work chunks that anyone could do, but I want people to be able to note in a ticket that it is being worked on [10:24:39] Ah. Our RT won't help you with that, sadly. [10:24:41] I want to mail openafs-info, for example, with a list of tiny tasks that will all add up to a lot of work done [10:24:45] poop [10:24:50] And that sounds like a great plan. [10:25:13] What I would do, for now, is make a page in the wiki in which people can record if they are working on a task. [10:25:21] yeah, cause I sure as hell am not running every AFS command and comparing it to the man page, myself [10:25:26] ahh, yes [10:25:28] good idea [10:26:00] I had a lot of ideas during the workshop that I hopefully won't forget before I get them out of my head [10:26:09] in fact, let me at least go make quick notes right now [10:26:20] But we _should_ be able to do this kind of thing in RT. I'm told it's just a case of making a list of the things we need to be able to do, permission wise, and asking the RT owner to set those. [10:26:36] So, if you fancy coordinating such a list of changes, that would be another useful thing to do. [11:17:23] What is "roken" ? [11:18:34] a library for working around brokenness in libc stuff on various platforms [11:18:47] like, providing strncpy if strncpy works weirdly on some system [11:18:50] ah, libroken = broken [11:18:51] cute [11:18:55] yes :) [11:19:23] isn't that what gnulib is too? [11:19:59] Oh probably. But the clue there is in the name. [11:20:05] GNU == of absolutely no use to us. [11:20:21] ah [11:20:41] We used to roll our own compatibility layer, which lived in "util" (a place where good code goes to die) [11:21:28] It seemed rather foolish that we were doing so when Heimdal alread had a really nice library to that, which was available under a license we could use, so at the same time I pulled in hcrypto for the rxgk work, I also pulled in libroken. [11:22:01] Is it just my naivety/ignorance, or is the cmd_ stuff really silly [11:22:11] The cmd_ stuff is insane. [11:22:16] okay, thank you [11:22:20] I made some changes recently to make it a little less insane [11:22:30] So you can now supply your offset when creating a new parameter. [11:22:48] I couldn't believe seeing harcoded numeric index references littered through the code for it [11:23:11] Yeah. Another thing you could do, if you have the bother, is convert them to use the new WithOffset function. [11:23:37] I think Mike has already done that for vol-info, if you would like an example. [11:24:16] I'll have a look after this doc "push" [11:25:19] Simon: You seem to have a pretty hardcore systems programming background [11:25:29] where did you "come from" ? [11:25:46] Picked it all up as I went along really. [11:25:57] I knew nothing of kernel programming until I started playing with OpenAFS. [11:26:02] cool [11:27:52] The big thing that really did it for me was doing disconnected mode. I spent a while floundering around writing the read-only side of things, and then I got a Summer of Code student to do r/w. He kept asking questions which meant I had to stay ahead of him in learning how the kernel module worked, and so I was forced into picking lots of it up then. [11:28:50] you only need to stay one lesson ahead of the kid when teaching piano lessons :) [11:30:26] heh [11:30:43] My main problem is that I don't do this all day every day, so I find myself swapping out big chunks of knowledge that I then need to swap back in. [11:31:08] yeah, that's hard [11:31:13] That's kinda what happened with RX - I had loads of it in my head in November, then ended up doing something else, so had to cram it all back in there in the last couple of weeks. [11:33:11] I do my best to take down pretty good documentation as I'm getting to truly comprehending something, just for that purpose - restarting later [11:38:41] I'll buy someone a beer if they edit local.css for http://wiki.openafs.org/ and add: body { margin: 40px; } [11:39:51] Lets see if I can figure out where that file lives. [11:40:48] what kind of beer? [11:40:54] your choice [11:43:53] Looks like you are going to need Russ for that. rra - are you around? [11:45:04] FYI, all of you running RT before 3.8.10 should be aware of a half-dozen security vulnerabilities you want to patch for. http://blog.bestpractical.com/2011/04/security-vulnerabilities-in-rt.html [11:45:18] andersk: Yeah, I meant to mention that. [11:45:39] andersk: in the prior discussion. I can't imagine how many other issues there are with 3.2.2 [11:46:22] If anyone is prepared to put the work into migrating our RT to a newer instance, I'm sure the gatekeepers would love to hear from them ... [11:48:02] Who set it up? Who made the customizations? I'll agree to do it if I'm given the proper access/privs to do the work *and* a description of the customizations is provided (or it is agreed to abandon them). [11:48:08] someone would need to list exactly what customizations are first [11:49:25] Jeffrey Hutzleman is the current owner of the system. [11:49:44] I think, for openafs, we would be happy to move over to a vanilla RT instance. [11:49:56] But Jeffrey and Derrick would be the ones to talk to about this in detail. [11:50:20] Our Stanford host is Debian. If we can use something that has Debian packages, we'll get the upgrades for free. [11:51:26] bestpractical's shipwright system makes upgrades pretty straight forward. we have every dependency for RT above libc built and living in afs for 3 different platforms using shipwright. [11:51:46] Sounds cool. [11:52:39] If someone is prepared to take this on as a project, then it would be great to get it done. It comes up again and again, but nobody has yet had the time to actually do the work. [11:53:31] Is there a list/alias for gatekeepers? [11:53:51] request-tracker3.8 is in Debian. (3.8.8-7+squeeze1 in squeeze/updates and lenny-backports has the necessary patches. 3.8.10-1 is in wheezy.) [11:54:06] openafs-gatekeepers@openafs.org [11:54:33] andersk: You up for helping? [11:54:38] (which is essentially Russ, Derek and Jeffrey A) [11:54:44] Derrick, even. [11:55:43] I could be. [11:55:57] 4.0 would be even better [11:56:08] eh [11:56:26] I'm not one for the bleeding edge, no matter how tested they say 4.0 is [11:56:43] I saw all of the 4.0 problem messages the first week on rt-users :| [11:57:30] andersk: email address? I'll CC you? [11:57:58] andersk@mit.edu [11:58:58] maybe we could just ditch RT and start using prophet. then we could have bugs go through gerrit ;) [11:59:24] er, prophet+SD [11:59:33] Russ has talked a lot about having a git based bug tracker. [11:59:50] All I want is a bug tracking system that isn't awful. When someone invents one, be sure to let me know ... [11:59:59] http://syncwith.us/sd/ [12:00:36] Yes, but now you're talking about an altogether different project. [12:00:41] Bear in mind that we do also have a private bug queue - openafs-security, which we wouldn't want to share. [12:00:57] I know, just putting that out there. [12:01:00] Yeah. I think we're going to be with RT for a while. If anything, what I really want is closer RT integration. [12:01:16] Let's start with getting the upgrade done... [12:01:19] So when a patch set is merged with gerrit, it should notify the related bugs (and possibly close them) [12:01:43] that's simply a matter of the CommandByMail extension [12:02:02] gerrit -> email -> ticket action taken [12:02:19] Excellent. Can I have one please? :) [12:02:28] eventually :) [12:02:38] It's coming with OpenAFS 1.6... [12:02:57] http://instantrimshot.com/ [12:03:29] * Simon Wilkinson is busy writing the RX fixes as we speak [12:03:53] does that mean rx is broken somewhere? [12:04:14] I take it you weren't "attending" the workshop :) [12:04:20] no [12:04:27] RX is broken in lots of places. [12:04:36] simon said every thing is fine. [12:04:40] The RX in 1.6 has no meaningful congestion control when packets are lost. [12:04:51] Its RTT calculations are utterly bogus [12:05:16] The way that it applies those packet calculations means that we never have an opportunity to do fast retransmit before a timeout occurs. [12:05:39] There are other things too, but those are the things we need to fix for 1.6. [12:05:59] hmm. [12:06:24] If your links are good, and you don't drop packets, you _probably_ won't see any problems. [12:06:30] so there will be at least a pre7 then? [12:06:50] Either that, or those will go out in 1.6.0. Not sure, you'll need to ask a gatekeeper. [12:07:18] .0 won't match the last prerelease anyway, as there is security stuff to go in. [12:07:28] (nothing major, I hasten to add) [12:07:36] I need to get pre6 rolled out on a few machines here to get it tested, I should do that today. [12:08:02] 99.9% of our clients are on our own networks, and they're very stable [12:08:59] If you're having problems, you'll either find things running a bit slowly, or see the resend count reported by rxdebug -rxstats go through the roof. [12:09:16] good to know [12:09:39] (In general, we save ourselves from ourselves, because whilst we end up deciding to resend far too often, by the time we wade through the lock contention, we discover that there's nothing there for us to resend) [12:36:49] Unrelatedly, if something is in openafs-web.git, it is not continuing to get generated, right? (So I should submit a patch against openafs-web in addition to the patch to the script) [12:38:41] Sadly, I don't really know how the web stuff hangs together. [12:39:08] It's possible that the scripts get run locally, and then the changes they produce are commited to openafs-web [12:44:15] The script in question "lives in" tools.git, which I think Derrick said he had local changes to, so probably not in this case, at least. [12:44:42] Yeah, I need to link tools.git into gerrit. [13:05:12] --- mdionne has left [13:26:05] I have a sort of conceptual question. What types of things would cause an FSSYNC call to fail? [13:26:10] jblaine: Do you want me to move those documentation bugs to the Docs queue? [13:26:21] yes please [13:26:28] which fssync call? they all do different types of things [13:26:43] I'd have sent them into that if I knew the address [13:26:46] I don't see it publicly [13:26:48] same one you and I were talking about last night, lemme see if I can find it in the log [13:27:17] I don't know what that one was; the log said it didn't understand the fssync code, so the call was unknown :) [13:27:21] I don't know if there is a way to directly add to the docs queue. [13:27:46] yeah, I mailed devel about it [13:27:55] someone will answer hopefully [13:28:26] deason: http://pastebin.com/uXYe3hMp that's what's in FileLog, what else am I looking for? [13:29:08] that usually means the fssync client side process died or something [13:29:20] that's unsettling [13:29:35] but maybe it's from the unknown fssync codes; I'd up the debug level [13:29:48] can you remind me how to do that? [13:29:57] -d 125 [13:30:00] or send SIGTSTP [13:30:52] sorry, the -d flag to fileserver? [13:31:12] yeah [13:31:23] (sorry I'm being brief; I'm on the phone doing something else atm :) [13:31:42] no worries, finish your phone call if you want [13:35:26] hrm, kicking the debug level way up suggests that something is being passed the wrong fd, which might actually be my code [13:35:46] http://pastebin.com/R9xyymUM [13:36:52] and they pay you with cards they deposit money into [13:40:23] err... ignore that last line, wrong window [13:41:08] (sorry I'll be a bit silent; I'll look at that in a bit) [13:41:19] really, don't worry about it [13:53:07] --- Chris Garrison has left [13:57:27] deason: taking off briefly, feel free to respond, I'll see when I get back [14:35:39] summatusmentis: the SYNC_BAD_COMMAND response is because it looks like that's a dafs salvager running against a non-dafs fileserver [14:36:28] the failed read later and "error receiving" command may just be from the salvager exiting after it's done; but if there's anything from the salvager side of things around that time, that could say if it's something else [14:52:38] I don't suppose there's anything like gitk for tty? [14:53:13] ah, there is [15:23:28] --- deason/gmail has left [15:56:43] --- deason/gmail has become available [16:36:05] OMG the stale bugs in RT [16:36:08] 9 years and open? [16:37:27] Yeah. They're not necessarily stale. Some of them are still bugs. [16:59:07] I have a few that old that are still bugs, at least so far as I know. [16:59:20] I'm pretty sure we can still get horribly confused by symlinks that ascend out of volumes, for example. [16:59:26] In the Linux cache manager. [16:59:40] Yes, yes we do. [17:00:05] If you mount a volume in two different places, then the Linux dcache will get confused and your symlink traversal becomes interesting. [17:00:23] I opened that bug during the very first AFS and Kerberos Best Practices Workshop. :) [17:00:52] http://grand.central.org/rt/Ticket/Display.html?id=860 isn't going to go anywhere [17:00:58] The only acceptable way to fix the bug is to stop lying about mountpoints, and make each volume mountpoint appear to Linux as a mountpoint. [17:01:36] Oh, that's an entertaining bug. [17:01:56] That actually may have subsequently been fixed in later versions of Solaris. [17:02:03] 860? That's essentially "don't call into the kernel whilst holding AFS locks". Which is pretty much true everywhere. [17:02:18] I think we'll go bang on Linux for much the same reason. [17:02:43] Oh, hey, I fixed 124457. [17:03:15] http://grand.central.org/rt/Ticket/Display.html?id=125470 should be closed [17:03:17] Yay! [17:03:28] fix is merged at least... [17:03:47] ugh, I suppose there's no point to me reviewing the bug list in RT. [17:03:55] that's not productive [17:03:59] Yeah, sadly, I never look at RT. [17:04:09] I spend too much time looking at RT. [17:04:20] I have a few bugs open in RT to remind me what I should be working on. [17:04:36] this ones all you, Russ :) http://grand.central.org/rt/Ticket/Display.html?id=129881 [17:04:39] Reviewing the bug list is a useful thing to do, but there's probably not much point to it given that our braindead access controls won't let you fix things. [17:04:50] exactly, Simon [17:05:01] Yeah, I know, Jeff asked me about that one. [17:05:15] I unfortunately don't have time and now probably won't until Novemberish to do much work on AFS. [17:05:24] Next time Derrick pops up, bug him to be added to the workers group in RT. [17:05:29] Have to get WebAuth 4 out and then take vacation. [17:05:36] Ouch, and yay! [17:06:08] I've never seen http://grand.central.org/rt/Ticket/Display.html?id=17428 [17:06:13] After that one time. [17:06:20] I don't know if we fixed it or not, though. [17:07:06] I've closed 125470 [17:07:21] Perhaps a group RT ticket Planned Cleanup Hour is in order, but hey, I'm not in a position to suggest that :) [17:08:24] I've tried for 11 years now at work to get people looking at their tickets (IT support) once per day. [17:08:29] 17428 is one of the ones that got inadvertently deleted, that I found in a trawl through the spam tickets. You've got no idea how much fun that was. [17:08:42] I gave up. I don't need to start another fruitless attempt at that outside of work. [17:09:04] We generally do look at tickets. I know I do. [17:09:17] It's just generally the tickets that get looked at are the ones at the top of the queue. [17:09:25] My problem is that I just don't have enough time to do the things on AFS that I already know I need to do, let alone trawl through RT tickets that are often low signal-to-noise. [17:09:37] * jblaine nods [17:10:35] 1163 is still a bug, for example. And it's a bug that I actually know how to fix, if i had enough round tuits. [17:10:41] And speaking of time, time to go home. [17:10:54] I've sort of hit a dead-end a bit with it, as I don't have any knowledge of the commands (nor an environment making use of some of them) that need documentation. And my C knowledge is... we'll say... sub-par. [17:11:28] We've got a list now, though of the commands that are missing docs? [17:11:56] doc/man-pages/README, near the end [17:11:57] What I would do is try looking at who contributed that command, using the 'git blame' command [17:12:19] And send them an email asking if they'd mind telling you what the command should do so that a man page can be written. [17:12:37] Once a mailing list exists that gets tickets into the openafs-docs queue, I intend to submit the known problems as tickets there [17:13:02] and clean out that section of the README, and link in the README to the RT queue [17:13:09] Jason seems to think that openafs-docs@openafs.org goes there. [17:13:16] I tried that [17:13:45] after his email -- no response back, and I don't see my test ticket [17:13:55] no num assigned, etc [17:14:59] Hmmm. I don't have any access to that system to poke around, unfortunately. [17:19:07] example (git blame) : "Kris Van Hees" kvanhees@sinenomine.net added -creation to vos.c for vos restore in 2004 [17:19:14] perhaps I'll track him down :) [17:19:53] Ah, you're actually look at missing options, rather than just missing commands. [17:20:04] yup [17:21:44] So, -creation takes d, k, or n (or dump, keep or new) as its sole argument. [17:23:02] same for -lastupdate, which Kris added as well [17:24:03] It controls what the creationDate of the volume is set to when it is restored. "dump" is the date in the dump that's being restored, "keep" is the date of the existing version of the volume or new is the time now. [17:25:19] ahhh, I suspect the same for -lastupdate, since it takes the same args, but it applies to the last updated info for the volume [17:25:28] That sounds plausible. [17:25:43] * jblaine files git blame -e into his arsenal [17:25:52] Thanks Simon [17:26:11] no problem [17:26:30] With recent stuff, if you find the commit, you may also find helpful stuff in the commit message. [17:26:46] Sadly with stuff that landed in days of CVS, there's not likely to be much of use there. [17:26:49] yup, saw that just now, well explained in the commit [17:26:53] * jblaine nods [17:27:30] Ah, there is useful stuff in 21592fe6. Cool. [17:35:49] --- Russ has become available [17:51:18] deason: so, I'm apparently way later than I expected to be. I don't see any reason there'd be a non-dafs fileserver, except perhaps that I didn't configure it that way. They're both 1.6pre [17:52:25] and there's nothing in SalvLog that suggests anything [17:52:56] You are remembering that to use dafileserver, you have to use da-everything else, too? [17:53:11] except I didn't use defileserver [17:53:53] dafileserver* [17:53:57] or da-anything else [17:55:02] I do hope that the non-dafs fileserver in 1.6 isn't broken again... [17:56:26] I mean, it's serving files ok, and the salvage completed. Did you see the log I posted earlier? [17:56:38] would that help indicate anything one way or the other? [18:11:14] no, the fileserver doesn't seem broken; it got a FSYNC_VOL_QUERY_VOP request, and the non-dafs fileserver isn't supposed to understand that message [18:11:24] so that's correct; what are you running for the salvage? [18:12:54] /usr/afs/bin/salvager /vicepa 536870954 -xattr -showlog [18:13:01] (-xattr is what I'm working on) [18:16:20] I don't really know what I'm talking about here, I have almost no afs admin experience, so I appreciate your help/insight [18:16:38] so, thanks, is what I'm saying [18:19:03] can you run salvager under gdb, and set a breakpoint for FSYNC_VerifyCheckout ? [18:20:07] er, and then see where it's being called from [18:20:54] never hits the breakpoint with the -debug flag [18:21:09] Fatal Rx error: assertion failed: pid != -1, file: vol-salvage.c, line: 4663 [18:21:11] without [18:24:15] wait, hold on, that wouldn't be it anyway [18:25:33] hm, I don't see anywhere in the vanilla code where we'd call VOL_QUERY_VOP with the FSYNC_SALVAGE reason code, but here's what you can do [18:26:59] in fssync-server.c, function FSYNC_com, find this line: res.hdr.response = SYNC_BAD_COMMAND; which in my 1.6 repo is line 556, but may be different for you [18:27:10] attach to the fileserver with gdb, and break on that line [18:27:22] after the breakpoint triggers, attach to the salvager process in gdb and see where it is [18:28:06] salvager and fileserver of course will be hanging while you're attached, so don't expect anything else to get done there [18:31:45] oh no, wait, I'm being stupid [18:32:00] that damn AskDAFS function; derrick added that, nevermind, hah [18:32:37] so that won't work? [18:34:06] what's in the log is "supposed" to happen [18:34:14] hah, ok [18:34:22] oh, because it's checking dafs [18:34:38] I don't like how it's implemented, but I'm not changing something for 1.6 this late to fix what is essentially a cosmetic issue [18:34:53] yeah, that's fair [18:42:08] ok, so nothing is wrong with salvager, that's reassuring [18:45:08] now to figure out this command flag stuff, thanks for all the help Andrew [18:47:29] sorry that took awhile to realize what was actually happening [18:53:47] not a big deal, story of my dev life :) [21:12:28] --- andersk has left