[00:34:33] --- abo has become available [00:35:53] --- abo has left [01:13:33] --- Russ has left: Disconnected [02:25:13] --- Jeffrey Altman has become available [06:18:28] --- abo has become available [06:20:36] --- abo has left: Lost connection [06:29:17] --- abo has become available [06:35:18] --- abo has left: Lost connection [06:59:54] --- mho has become available [07:01:34] --- jaltman has left: Disconnected [07:02:08] --- mho has left: Lost connection [07:16:03] --- deason has become available [07:21:46] --- jaltman has become available [07:25:47] --- jaltman has left: Disconnected [07:25:55] --- jaltman has become available [07:27:24] --- mho has become available [07:32:36] --- mho has left: Lost connection [07:51:06] --- jaltman has left: Disconnected [07:55:19] --- jaltman has become available [08:09:06] --- Jeffrey Altman has left [08:16:42] --- geekosaur has left [08:19:31] --- geekosaur has become available [08:38:14] --- matt has become available [09:06:30] could anyone tell me what the 'atid' transaction id parameter in VOTE_Beacon is intended to be for? I think I'm going to send something to -devel shortly about a related issue... but if someone could just tell me what it is (what it's supposed to represent, or how it's supposed to be checked, etc) it could save a little time [09:06:45] --- Simon Wilkinson has left [09:07:34] --- mho has become available [09:07:48] the "return code is a timestamp so if you error, life sucks" issue? [09:08:01] none of the ubik papers/documents etc I see seem to reference such a thing in the beacon messages... but I'm not sure if I just don't see it [09:08:18] no, the issue related to trans id rollover [09:09:19] I can show in at least one way how that situation is broken by Beacon, but I'm not sure how to alter atid's usage or interpretation, since it's not really clear what it's there for in the first place [09:09:39] it should be "the current active ubik transaction on the sync site" [09:10:21] in theory it's to allow us to figure out if we missed anything, i think, and is residual. i think. [09:10:38] --- mho has left: Lost connection [09:12:19] you can beacon and have your message arrive at a non-sync site before a message with a write arrives, due to out of order packets, for instance [09:12:34] but a beacon is rather... unrelated to transactions in terms of timing; it seems like that could cause false errors or something [09:12:58] nothing actually uses the information provided, i think. lemme lock [09:13:00] look [09:13:23] it calls urecovery_CheckTid with the passed trans id, which can invalidate the current transaction, making it abort [09:14:51] in the event of a retransmit, yes, there is probably an issue there. of courtse, that is the *only* place it's used. [09:15:19] (or a delayed packet due to out of order delivery somewhere along the line, or....) [09:16:14] what happens if we just ignore it? I don't immediately see how a beacon could/should interact with the transaction foo or the data itself in general [09:16:21] (I mean, that seems like DISK's job, so...) [09:17:56] in this case, actually, if the beacon is sent, then the change is accepted by a majority of servers and a second occurs, and then a second change is sent, and after this the beacon is received at a given other site, it may cause a problem unless the beacon fails a time invariant. hm. basically the timing math is the question. if the timing math works there is no issue (yet). [09:18:18] --- rra has become available [09:21:12] I'm not as concerned as the false-positive race stuff for today... that's just fodder for calling into question the existence of the parameter: what is the problem if it doesn't exist or is ignored? [09:22:10] possibly matters if we have a new sync site and the old one went away before a change hit the majority of sites. i am trying to work out the mechanics of if it does [09:23:20] and given that this is applied "/* A new vote or a change in the vote or changed quorum */" ... [09:23:30] only way I can see it is if the new sync site somehow sends mid-transaction DISK_ messages with the old trans id.... [09:24:51] or , no, I see it [09:25:21] it wants to abort the trans immediately if we switch sites, since otherwise the new site may get an abort on the first trans it tries to do something [09:26:10] or no, that doesn't happen, SDISK_Begin checks for an existing trans; it should go away immediately when the new site tries to do something [09:26:42] but it just looks like an extra check, to abort the trans as fast as possible if we know it's wrong [09:26:53] that's reasonable enough to me [09:28:05] worse: SDISK_Begin calls checkTid and then does the same thing [09:28:52] syscalls.master fix hit FreeBSD HEAD this morning. [09:29:18] the call to checkTid is pointless if we immediately wipe anyway. [09:29:24] ben: excellent. [09:29:42] but SDISK_Begin should invalidate the trans; it seems like beacon can do so sometimes when the trans is fine [09:29:55] well, it's not pointless for UBIK_PAUSE apparently [09:30:00] though I don't know what that define means [09:31:02] marcus believed it was possible in a heavy-contention ubik environment for ubik to end up hung. sadly, this is a workaround and we got no diagnosis of the real issue. nor any other reports. [09:33:06] oh and no, it is the same for UBIK_PAUSE, CheckTid has that same conditional... "what" [09:34:00] and while I'm looking at that... is there a legit reason that the CheckTid test is testing the transid in a "greater than" test, not a "not-equal" test? [09:34:03] uh. look at... [09:34:11] 2628 [09:34:40] legit reason? doubt it. [09:39:11] trying to think about it, it seems really... weird; a trans id that arrives _after_ the "current" trans is bad, but if it's from before the "current" trans, it's okay? [09:48:39] Re syscalls: how rude would it be to make users recompile all userland apps for a module upgrade? (I'm kind of tempted to switch AFS_SYSCALL to 377 on FBSD90, since that's nominally ours and actually works, and switch it to one of the generic loadable syscalls on older FBSD.) [09:49:59] how many userland apps do we have? part of the problem is most afs system calls are pioctl, not afs, really. in spite of the fact that on most platforms we piggyback pioctl on afs [09:51:48] ben: what/ [09:51:48] ? [09:51:57] Our userland apps? [09:52:23] well, and anything which sets tokens or makes pags. [09:52:35] Yeah. [09:52:38] ben, actually, what arla apps are we compatible with that you'd sacrifice here [09:52:58] I know very little about arla. [09:54:38] I don't see a problem with saying that our userland has to match our kernel. It doesn't look like anyone actually relies on Arla compatibility, on FBSD? [09:54:54] At least, syscalls. [09:55:21] I think Arla only claims to work on FBSD 5? [09:56:07] And that's not the same thing as saying users have to compile it. They can get a dist release, if we can be counted on to make them. [09:56:32] well, anything which uses libkafs from not-us will expect the arla semantic [10:00:14] The alternative is for us to be able to use the same syscall as Arla, and as far as I know, there's no other operational difference. Is that correct? And if so, what was the exact obstacle to doing that? [10:01:30] I think we can. I haven't changed it for my testing. [10:02:03] There might be problems if we ever try to provide compatibility for 32-bit applications on 64-bit kernels. [10:02:11] How so? [10:03:12] in what sense? note that we do just that on macos 10 [10:03:22] and the code to "make right" our input exists [10:03:39] I'm a bit fuzzy on the details, but apparently the compat32 mechanism translates arguments from the 32-bit apps to the 64-bit syscall, which involves munging when types (e.g. pointers) change size. [10:03:51] (actually, worse, on macos, we have comptibility syscalls to run 64 bit code on 32 bit machines too) [10:04:05] yeah. that's fine. we do that for macos [10:04:15] I don't know whether this actually uses information from the prototype in syscalls.master or whether it is entirely in an argument-processing function that we provide. [10:04:48] But how does compat32's hackery relate to the syscall number specifically, except insofar as if Arla doesn't understand compat32, that might not work? [10:05:18] I'm not sure. [10:05:42] notable functions: copyin_afs_ioctl/afs_ioctl32_to_afs_ioctl, copyin_iparam, and the proc_is64bit case in afs_syscall [10:06:18] I attempted to put a prototype for nnpfs_syscall (i.e. arla) in syscalls.master for number 339, and a prototype for our afs3_syscall at number 377. But again, it's not clear that this information is actually used anywhere. [10:07:44] This just says we're not going to share the syscall with Arla. I don't care, but this is the sort of thing it's going to be people other than me, that care about it (if any). [10:09:43] > This just says we're not ... share ... arla Er, which is "this"? [10:09:55] > I attempted to put a prototype for nnpfs_syscall (i.e. arla) in syscalls.master for number 339, and a prototype for our afs3_syscall at number 377. [10:10:06] Maybe I misunderstood that. [10:10:15] If we're actually ABI-compatible with arla, we can continue to use 339 for those things. [10:10:57] 339 was originally added/reserved for arla, and 377 was originally added/reserved for us. [10:11:14] At the same time, longish ago? [10:11:24] But we've been happily using 339 for a long time. [10:11:34] Different times, but yeah, long ago. [10:11:39] Ok. [10:12:27] (339 in 1999, 377 in 2001) [10:15:13] --- mho has become available [10:15:34] --- mho has left: Lost connection [10:18:07] If we feel like separating our pioctl from other stuff, it would be pretty reasonable to keep pioctl at 339 and move the other stuff to 377, I think. But if it's only our bundled utilties that get run, it doesn't really matter what the number is. [10:31:40] --- mho has become available [10:33:43] --- mho has left: Lost connection [13:13:17] I'm very wary of glibly making changes to Ubik. The bar here has to be that there is an actual problem _and_ the proposed changed is necessary to fix it _and_ the proposed change can be shown not to make things work. "I tried it and it seems to still be OK" is not proof. [13:13:55] nothing has been done other than code deduplication. [13:16:08] I strongly suspect that paulaner is not going to move from Panix until I get back from California [13:16:50] Well, someone was asking "what is this parameter for?" and seemed to be suggesting omitting an existing transaction ID check in beacon processing [13:17:13] please read Andrew's e-mail to -devel. [13:17:25] I agree with your conservatism. [13:32:44] > _and_ the proposed change can be shown not to make things work. I assume that's s/work/break/ ? ;) [13:34:52] Those sound like pretty significant changes. I don't see any argument regarding why it's safe to change the way in which write transactions are counted. I'm not convinced that making the urecovery_CheckTid check more strict won't lead to spurious failures in the remote package. [13:35:00] Yes [13:35:59] What is a spurious failure? I assume that Andrew you have or would construct a [13:36:04] I think your analysis is wrong. writeTidCounter is the write transaction ID counter, which is different from the transaction ID. However, it _is_ the space of ubik_currentTrans, which is used by the remote package to track the current remotely-initiated write transaction. [13:36:13] range of scenarios by which to prove Ubik is correctly handling these? [13:37:01] A spurious failure is, for example, "oops, the transaction ID sent was one larger than the counter in ubik_CurrentTrans, so BOOM you lose a server". [13:37:21] Right, I'm not in the discussion of correctness, I do NOT know Ubik. [13:37:37] Where "lose a server" means the remote op returns failure, which causes the recovery engine to declare that server out of date, which means it doesn't get further transactions until it gets a complete database update. [13:37:57] But also the rollover cases seem important. [13:40:07] I'm not sure I'm following the writeTidCounter comments above... is writeTidCounter an entirely separate id-space than the regular tidCounter? [13:40:31] The point here is that empirical tests are not sufficient; I want a clear understanding that the proposed change does not break the properties of the distributed algorithm. I've spent a lot of time and effort and had plenty of problems due to people making changes willy-nilly to fileserver client-tracking code, and have been made very nervous by various proposed changes to code affecting the distributed cache consistency algorithm, mostly from people who refuse to admit that we even have such a thing. I will _not_ allow Ubik to be broken that way. [13:40:58] under UBIK_PAUSE it seems to imply that it just records a particular tidCounter as when the last write transaction was, but without UBIK_PAUSE it seems to imply a completely separate namespace [13:41:26] and when we call DISK_Begin, we always use tidCounter, which would imply that they are the same namespace [13:41:57] i should read the email. i will not the UBOIK_PAUSE code is not code i would use to help me understand anything. [13:42:08] (This is off topic, but to whom does "people" in "changes to code affecting the distributed cache consistency algorithm, mostly from people who refuse to admit that we even have such a thing" refer?) [13:43:10] and there aren't even any empirical tests yet; I'm not even trying to show yet that any of the proposed ideas aren't broken [13:43:55] > is writeTidCounter an entirely separate id-space Yes, unless UBIK_PAUSE is defined, which it never is. [13:44:26] okay, that makes sense; so should the call to DISK_Begin base it's transaction id counter based on writeTidCounter ? [13:44:33] "base its" [13:44:51] Anyone who, in discussions relating to extended callbacks and other things, asserted to me that we don't actually have cache consistency and therefore it is OK to make changes that might not preserve it. [13:45:01] I think the consistency guarantee being made is sync on close. We do need to define cache consistency in terms of the data visible at clients, etc. [13:45:24] there is no sync on close on Windows. Sync on Close is a Unix specific implementation detail. [13:45:36] i can't stop the ranting, but it's rather distracting from the actual topic at hand. [13:45:42] Jaltman: if that is the case, you must be one of "people." [13:45:48] e.g. i'd like it if it went away [13:46:05] No, actually, I think the right answer is to never use writeTidCounter for anything. That is, calls to UVOTE_Beacon() should not use it either. The tid space of ubik_currentTrans is the same as the local tid space on the sync site. Except I'm not actually sure that's right; I'd need to study it more. [13:46:05] egad, if we're going to try and debate three things at once in here.... [13:46:07] (ok, shutting up now. we need to return to this topic in an organised way.) [13:47:01] well, okay, the current value/usage of writeTidCounter seems worthless for net communication then, which is fine, but.... [13:47:37] we need to give the correct transaction to VOTE_Beacon, which my current understanding says that we need to record the trans counter for the current in-flight transaction somewhere [13:48:08] so I either need a new var to record such a thing... unless writeTidCounter _is_ supposed to _be_ that thing already, and is just being used incorrectly [13:48:58] --- meffie has left [13:49:10] --- meffie has become available [13:52:26] Yeah, we do need to know the in-flight transaction ID. Hrm; that's annoying. [13:53:06] hence my question of why we actually need it... [13:53:26] No, we do need it, because we need to do the check. [13:55:03] I'll agree, provisionally, that we probably should be bumping writeTidCounter to be the same as the database transaction number of the write transaction, a la UBIK_PAUSE, rather than letting it be a separate free-roaming namespace. But I don't know what other implications there are of that. [13:56:05] writeTidCounter is only used when it is incremented, and by the VOTE_Beacon call I think; I can double-check that again, though [13:56:11] I suspect it's fine, since dbase->writeTidCounter appears to be used _only_ to communicate the write transaction ID from BeginTrans() to the beacon loop [13:56:51] I shall have to think on this a bit more. [13:57:22] > because we need to do the check. okay, but why? I'm not particularly interested in getting rid of it or even modifying it, but the only "documentation" I have on this check is the conversation I had with derrick earlier today [13:57:35] ...which consisted of me guessing [14:02:02] because that check is what causes a stale, in-progress transaction to be aborted if the sync site changes or if we receive a beacon or remote disk operation from the sync site after there have been intermediate changes [14:02:49] (recovery is triggered by the sync site having earlier noticed we are out of contact, and then when we reappear, checking our db version and sending us a new database if needed) [14:05:24] I suppose we could make the check stricter in certain cases, but that's incidental to the problem. [14:07:27] > because that check is what causes a stale, in-progress transaction it is one thing that causes that, but a DISK_Begin also aborts an in-progress transaction, no? [14:07:49] fyi, prior to afs3.4 there was no writeTidCounter. ubeacon_Interact() set ttid.counter to ubik_dbase->tidCounter in the DBWRITING case. [14:08:18] Yeah; "prior to afs 3.4" is only of historical interest, I think [14:08:50] can be helpful for intent, though [14:09:24] Sure but the hard part here is determining what the intent was when the code was written. What problem were the authors trying to solve by adding it? [14:09:42] what I thought happened is someone added writeTidCounter as it is now, and urecovery_CheckTid was changed after stuff started breaking semi-randomly [14:09:58] or urecovery_CheckTid was already the way it is now for some other reason [14:10:05] It can, but you need the check to happen on every remote operation, so you don't inappropriately apply remote changes to not the database they were intended for, and think everything is OK. [14:10:47] Anyway, I need to think about this for a while, and I also need to get other work done today. [14:11:48] hmm, but could checking just the trans epoch be equivalent for checking the changed site? (for just SVOTE_Beacon, not urecovery_CheckTid in general) [14:12:04] and yeah, I'm not pushing for answers right this instant [14:12:47] changed site is not the only thing you need to check for. you also care about you losing contact for half an hour and missing a half dozen transactions. [14:13:18] It's very important that your stale in-progress transaction not be able to complete, because then you might relabel your database as being something that it isn't. [14:13:25] in 3.4 the only change to urecovery_CheckTid() was to change the comment from "there is a remote trans" to "there is a remote write trans". The rest of the code is identical to how it is today except that the UBIK_PAUSE case did not exist back then [14:13:42] I too much work on other deadlines [14:13:55] s/much/must [14:16:04] > not be able to complete yes; I'm not seeing how it could still complete even without the SVOTE_Beacon tid check at all.... but I don't want to prolong this [14:16:38] > I'm not seeing how In sufficient. Prove that it cannot complete without the check. [14:17:17] or, you know, don't. :-) [14:17:35] I'm not trying to _prove_ it's correct; I'm just wondering [14:17:47] "I'm not sure if it's safe or not" is fine, but it should be marked as such somewhere [14:18:05] instead of having it (implicitly) marked as "this is not safe, so we don't do it" [14:20:27] See, that's the thing. Conservatism here argues for assuming that _every_ change is unsafe until shown otherwise. [14:21:30] yes, but it should be known if it's unsafe or not, so if it seems desirable to change in the future, the pain of changing it can be gauged against the pain of keeping it the same. as in... [14:22:11] if it's known to be unsafe to do X, then you don't do X and you live with it; if doing X would make something else really easy you throw up your arms and say "alas" [14:22:32] but if it's not known if X is unsafe or not, then you can explore to see if you can prove X is safe or unsafe [14:23:21] right now I'm in the situation that "I don't know whether jhutz knows X is unsafe or if X is not-known-to-be-safe" [14:23:51] I'll take your word for it (in this case, at least ;) if you say it is known to be unsafe, but I haven't heard that yet [14:24:28] if it's unknown... then I'd like to write it down so I don't have to have this conversation again next year when I forget it [14:25:07] i missed some of this and should go read the log. however, what i'll say is it may be worth re-editing a copy of the kazar paper and shipping it [14:25:19] or, well, exceprting from it [14:25:39] which one? the "replicated servers made easy" one? [14:26:26] if that's the one which describes quorum completion, yes [14:26:44] (no document I've yet found relates ubik VOTE_* transactions to beacons, which includes that paper... but I don't trust myself to interpret academic papers correctly" [14:26:47] er, ) [14:27:21] the timing descriptions are part of the equation. [14:27:32] (and should be summarizied) [14:28:47] okay, yeah, the "quorum completion" one is something else... the one with the Q(A,T) notation [14:40:03] I think Derrick is referring to this paper /afs/cs/academic/class/15612-s98/projects/Scraw/www/design/ubik.ps [14:45:41] That does appear to mostly be a copy of the paper, though without the references. [14:47:25] Unfortunately, that is far from all you need to understand the system. For example, ReadAny transactions (now used for nearly all read-only operations) are a more recent invention that modify some of the original assumptions. And the introduction of non-voting clones also has a subtle effect on recovery. [15:44:16] --- deason has left [16:03:26] --- matt has left [16:16:50] --- deason has become available [18:14:56] --- rra has left: Disconnected [18:33:54] --- Russ has become available [18:36:05] --- abo has become available [18:37:40] --- abo has left [18:38:11] --- abo has become available [18:45:08] --- abo has left: Lost connection [18:54:34] --- abo has become available [19:01:47] --- asedeno has left [19:03:55] --- andersk has left [19:04:56] --- andersk has become available [19:04:58] --- abo has left: Lost connection [19:07:45] --- andersk has left: Lost connection [19:07:45] --- kaduk@mit.edu/barnowl has left: Lost connection [19:47:51] --- jaltman has left: Disconnected [19:53:16] --- jaltman has become available [19:57:48] --- jaltman has left: Replaced by new connection [19:57:49] --- jaltman has become available [20:20:06] --- asedeno has become available [20:54:19] --- abo has become available [20:56:46] --- abo has left [20:56:46] --- abo has become available [21:18:55] --- abo has left: Lost connection [22:12:13] --- deason has left [22:30:44] --- jaltman has left: Disconnected [22:30:52] --- jaltman has become available