[00:01:58] --- kaj has become available
[00:22:34] --- jaltman has left: Disconnected
[02:49:57] --- haba has become available
[03:50:56] --- Jeffrey Altman has left: Replaced by new connection
[03:50:57] --- Jeffrey Altman has become available
[05:06:17] --- meffie has become available
[05:27:35] --- jaltman has become available
[05:51:52] <jaltman> apparently the patch that broke the build was Simon's 8e27248698766e3d97e18363c5c4729a4e02add7 "Always include afsconfig.h" patch
[05:52:18] <jaltman> That patch is causing the vol package to break in very interesting ways
[05:55:14] --- Simon Wilkinson has become available
[05:55:45] <Simon Wilkinson> Define "interesting ways" ?
[05:56:07] <jaltman> things that are meant for inclusion on linux are getting built for windows
[05:56:52] <Simon Wilkinson> Well, the only bit of the vol package that change touchs is src/vol/namei_map.c
[05:58:11] <Simon Wilkinson> My guess would be that you're actually getting bitten by 07098dc6708472cf5624b368d63efbdef7d409b8
[05:58:47] <Simon Wilkinson> And that windows has either ssize_t or sig_atomic_t and isn't defining the relevant HAVE_ macro in its afsconfig.h
[05:59:04] <Simon Wilkinson> But if you can provide compiler output showing the error, then I can take a proper look.
[05:59:30] <jaltman> I was already bitten by that and added HAVE_SSIZE_T to the param.*.h files on Windows when that hit the tree
[06:00:51] --- abo has left
[06:00:55] --- Kevin Sumner has left
[06:01:02] --- abo has become available
[06:01:43] --- Kevin Sumner has become available
[06:02:47] --- Simon Wilkinson has left
[06:02:49] --- Simon Wilkinson has become available
[06:03:19] --- kaj has left
[06:04:34] <haba> I seem to be able to trigger some not-so good behaviour in AFS.. The situation is as follows: Some hundred hosts suddenly disappear and reapear with a different UID (memcache). This leaves the AFS server probably with a lot of open callbacks so when you want to do something in a dir which these hosts were using evey operation slows down to a crawl.
[06:05:17] * haba does not want to DoS himself ;)
[06:05:24] <jaltman> why did the hosts disappear and reappear?
[06:05:28] <haba> reboot
[06:05:37] <jaltman> so a clean shutdown
[06:06:19] <haba> not so clean. They are diskless and have / in AFS. I have not figured out how to pull the plug under myself nicely.
[06:06:33] <jaltman> if the file servers were fixed to not corrupts its own memory when processing RXAFS_ReleaseAllCallbacks (or whatever it is called) before shutdown
[06:06:53] <jaltman> then the clients could call it as part of the shutdown process
[06:07:54] <jaltman> but yes, if you have callbacks issued and the clients disappear and the callbacks have to be broken, the file server is going to be forced to track those callbacks that can't be delivered until the callbacks expire
[06:10:00] <haba> "... if the file serves were fixed ... " - What is fixed and what not?
[06:10:03] <shadow@gmail.com/owlD5B4E913> well, eventually it just forces you to initcallbackstate, but yes, it
does delayed callback tracking to a point
[06:10:56] <jaltman> the clients with the new uuid will be initcallbackstate but for the old uuids to which the callbacks are registered, it must delay callback track them.
[06:10:58] <shadow@gmail.com/owlD5B4E913> and the hosts should be marked "down" on the first fail, and not tried
until they come back
[06:11:22] <jaltman> they will, but they will time out on each host for the first callback attempt
[06:11:51] <jaltman> any current release file server has the bug fixed.
[06:11:52] <haba> I think the server might be a litte bit to single-minded and try too hard and timeout not in parallell?
[06:13:19] <haba> Would it harm if the hosts came back with the same UID?
[06:13:40] <shadow@gmail.com/owlD5B4E913> it should time out in parallel, breakcallbacks is done as a multi, but
only in batches of 10. (MAX_CB_HOSTS)
[06:14:27] <haba> I think I want MAX_CB_HOSTS a litte bit higher (we have clusters of 500-1350 hosts)
[06:14:34] --- Simon Wilkinson has left
[06:15:06] --- Simon Wilkinson has become available
[06:15:21] <haba> To come back with the same UID would mean to calculate the UID from something in the hardware, for example the MAC addr.
[06:15:50] <jaltman> to come back with the same uuid should mean saving it somewhere to disk and reloading it
[06:16:04] <haba> no disk
[06:16:35] <jaltman> how are you booting the machines?
[06:16:41] <haba> net+AFS
[06:16:51] --- Simon Wilkinson has left
[06:17:09] <haba> There is not even an USB stick
[06:17:18] <jaltman> whatever contains the per machine configuration should include a definition for an afs uuid.
[06:17:29] <haba> DHCP?
[06:18:37] <haba> (there is very little that actually differs between these machines which is part of the whole idea)
[06:18:43] <jaltman> if you can define a private dhcp type, then yes
[06:19:44] <haba> Something like "afsd -uid <UID> ....."?
[06:20:02] <haba> Hmm
[06:21:11] <haba> Has the UID something else than being of a certain size and being unique?
[06:21:37] <jaltman> it is a globally unique value of a particular structure
[06:23:23] --- kaj has become available
[06:23:51] <haba> shadow: If the timeout is in parallell but in batches of 10, then (timeount * <number-of-hosts>/10) is the wait? If timeout is around 30sec that could fit what I am seeing.
[06:25:13] <haba> Is such a timeout tying up a thread or can I just set MAX_CB_HOSTS = 100 and get on with my life?
[06:27:53] <shadow@gmail.com/owlD5B4E913> timeout should be about a minute
[06:30:04] <shadow@gmail.com/owlD5B4E913> upping MAX_CB_HOSTS looks like it should be fine. you'll allocate some
more memory, and have to earch a linked list which is longer. meh
[06:30:05] <jaltman> what you might want to do as part of your restart process is to query every client machine for its uuid (TellMeAboutYourself) and then after shutdown and prior to restart send the file server a series of RXAFS_GiveUpAllCallbacks using a client that claims to own that UUID.
[06:31:53] <haba> I think i could fix a small program that does the surrender at the right time.
[06:38:15] <jaltman> my problem really looks like somewhere there is a #undef AFS_NAMEI_ENV that is being processed after the #include <afs/param.h> in every source file in src/vol.
[06:39:21] <shadow@gmail.com/owlD5B4E913> you mean like in afsconfig-windows.h?
[06:41:33] <jaltman> I just found that which makes me wonder how this source tree ever built
[06:42:24] <jaltman> since param.*_w2k.h define AFS_NAMEI_ENV and the vol package assumes that windows is a namei platform
[06:43:52] <jaltman> that explains why Simon's patch broke the windows build.
[06:44:19] <jaltman> but there were other things that should have been broken long before that
[07:15:21] --- deason has become available
[07:24:08] --- kaj has left
[07:26:15] --- matt has become available
[07:33:32] <deason> > RXAFS_ReleaseAllCallbacks
when are clients going to start calling that automatically?
[07:34:11] <shadow@gmail.com/owlD5B4E913> well, kind of dangerous. maybe we should have a server capability for
"known safe"?
[07:36:14] <jaltman> knowing that both RXAFS_ReleaseAllCallbacks and volsync structures are provided with InlineBulkStatus RPCs should have capabilities bits since they are not safe without an indication that they are.  If we don't want a capability bit we could just create a new RPC number.
[07:36:34] <jaltman> for InlineBulkStatus we will be doing so anyway
[07:37:52] --- reuteras has left
[07:37:57] <jaltman> what happened in the past was I added the ReleaseAllCallbacks call to the Windows clients for use when entering suspend, when the network was being turned off, and when the client was being shutdown.  It caused file servers to crash.
[07:40:44] --- kaj has become available
[07:45:52] <deason> a new rpc number or cap bit seems odd to me, since it seems like changing the protocol for an old implementation-specific defect
[07:45:59] <deason> not that I have any better suggestions
[07:50:35] <jaltman> when someone comes up with a solution for ensuring that everyone upgrades all of their servers to a current release, then we can avoid having a test for what a client is permitted to call
[07:51:21] <jaltman> We stopped call RXAFS_ReleaseAllCallbacks because it was causing file servers to become unstable that otherwise would be stable.
[07:53:06] <jaltman> the same for the volsync data structure.  it is meant to be used as a method of indicating the volume version to the client.  With that knowledge we can perform some significant optimizations on .readonly volume objects in the cache.  Since the volsync data structure wasn't populated properly when InlineBulkStatus was implemented, we can't use the optimizations anywhere.
[07:57:13] <jaltman> OpenSSH maintains a version number based list in each build that indicates what bugs the older releases have and then works around them accordingly.  We don't have the ability to do that.
[08:01:46] <jaltman> it just doesn't feel right to me to begin using functionality that we know is going to cause otherwise stable deployments to become unstable.   There are clients that make use of the RPC.  Arla and kafs both use it.  As does Hartmut's afsio and I suspect several other one off tools do as well.
[08:13:28] --- kaj has left
[09:00:35] --- Russ has become available
[09:23:27] --- jaltman has left: Replaced by new connection
[09:23:28] --- jaltman has become available
[09:34:11] --- kaj has become available
[09:58:19] --- meffie has left
[10:00:30] --- kaj has left
[10:00:32] --- kaj has become available
[10:21:55] --- haba has left
[11:06:43] --- meffie has become available
[14:56:06] --- mdionne has become available
[15:10:07] <matt> cap bits:  it certainly seems reasonable to me to use caps to indicate 'really capable of x', and iirc, we have more or less unlimited bits, yes?  
[15:13:41] <jhutz@jis.mit.edu/owl> Sort of.  We have a total of 32 bits before we have to decide what
representation to use in the rest of the vector.  Some choices include
(a) more simple bitmasks
(b) capability numbers
(c) some combination of the above
(d) some compressed form that lets us say something along the lines of
    "I support capabilities X-Y (or 1-Y), plus these others"
[15:16:06] <deason> when I think of it more like a bit for "we have fixed bug #X in implementation Y" it seems weird... but I'm not really objecting; if it's the only way to do this (which it looks like it is), then I'm all for it
[15:32:41] --- deason has left
[16:02:59] --- deason has become available
[16:04:04] <deason> you also lose if there are any non-openafs servers that implement the RPC in question that never had the problem to begin with
[16:04:28] <deason> I assume there are none for GiveUpAllCallBacks, but in the general case...
[16:05:50] <jhutz@jis.mit.edu/owl> You don't lose any more badly than if you never use it.
[17:07:32] <jaltman> Since the VolSync parameter was never set properly in BulkStatus and InlineBulkStatus for all implementations of the file server, for that case at least I do not think we need to worry about an implementation that never got it wrong.
[17:08:23] <jaltman> the same is true for RXAFS_GiveUpAllCallbacks
[17:29:53] --- meffie has left
[18:02:31] --- matt has left
[19:30:04] --- mdionne has left
[20:09:45] --- cudave has left: Disconnected
[20:10:42] --- cudave has become available
[21:09:52] --- Born Fool has become available
[21:45:09] --- deason has left
[21:47:14] --- Born Fool has left
[22:06:22] --- Russ has left: Disconnected