[00:34:02] --- dev-zero@jabber.org has left [00:41:34] --- dev-zero@jabber.org has become available [03:06:36] --- dev-zero@jabber.org has left [03:09:13] --- dev-zero@jabber.org has become available [04:25:37] --- dev-zero@jabber.org has left: Machine going to sleep [05:05:00] --- meffie has become available [05:16:13] --- meffie has left [05:16:56] --- mmeffie has become available [05:17:39] --- mmeffie has left [05:17:40] --- mmeffie has become available [06:02:12] --- mvitale has become available [07:34:55] --- deason has become available [07:38:40] natefoo: 'vos changeaddr -remove' if one of the servers doesn't have any volumes recorded for it [07:39:57] --- dev-zero@jabber.org has become available [07:41:26] one private ip had 0 volumes and i changed it to a "fake" ip based on some old transarc docs i found. [07:41:45] here's what the VLLog on that fileserver was saying: https://gist.github.com/3837663 [07:41:54] docs: http://rzdocs.uni-hohenheim.de/afs_3.6/debug/admin/news/sysid_4.html [07:42:15] it's okay now although the output of vos listaddrs -printuuid is still a bit odd, and the private ips are still in there. [07:43:06] i added NetRestricts to both of the ec2 instances, and it did regenerate sysid, which seems good. [09:06:11] deason: re gerrit 8207 and 8208, do we want all those lines like cherry-pick -x gives? [09:07:07] yes [09:07:09] er, well, I want them [09:08:13] ok. [09:09:57] do you not? [09:21:58] i do now. [09:27:28] --- dev-zero@jabber.org has left [09:28:09] it makes it easier to identify where they come from, and the 'cherry picked from' format being identical makes it easier to script stuffo [09:28:14] stuff [09:30:30] yeah i see your point. it is history. [09:38:38] deason: can you see any problems with the private ips being present in the vldb, or should i leave it as-is? [10:37:52] mmeffie: the cherry-picked lines must match cherry-pick -x so that scripts can compare the 1.6 branch and master branch to determine what has not been cherry-pikced [10:42:44] ah, i see. [10:42:47] natefoo: private ips being in the vldb will be given to clients. If a client attempts to access the private ip and cannot reach it, it will have to timeout before it fails over to the next one. [10:43:14] --- mmeffie is now known as meffie [10:43:50] If a patchset has to be backported because a straight application would not work, the commit message should explain what the differences are from the original patchset. [10:43:58] jaltman: that's what i was afraid of. [10:44:02] unless they are minor [10:44:17] natefoo: that is what the NetRestrict file is for [10:46:28] yeah, i have NetRestricts. [10:47:05] so right now, one of the two fileservers is showing its volumes only via the public address (with vos listvldb -server ), i can presumably vos changeaddr -remove that one. [10:47:09] what do i do about the other one? [10:54:17] wait, you want to remove the public one? [10:54:42] showing 'vos listaddr -printuuid' would make this less confusing, if you're willing to show that [10:55:00] er, -printuuid -noresolve [11:05:02] i want to remove the private one. [11:06:04] one sec, i'll make a little gist with the relevant bits. [11:07:44] sorry, I interpreted "that one" as the public one, didn't see that "" [11:09:09] https://gist.github.com/3841408 [11:09:15] ahh. [11:12:55] did you say you ran changeaddr before? do you have what exact commands you ran? [11:17:27] i did a couple changeaddrs, i doubt i have the commands anymore but i'll look. [11:17:46] one was trying to do changeaddr for one of them, i think delirium. [11:18:06] iirc that never worked, it just resulted in saying they were the same address. [11:19:51] --- dev-zero@jabber.org has become available [11:21:48] okay, well, so you know, 'vos changeaddr' is really only for non-uuid servers; you don't run it for modern stuff, and it can screw up stuff in the vldb if you do [11:22:04] I'm not sure how you get that result exactly [11:22:06] , though [11:23:33] well, for changing addrs for non-uuid servers, or trying to fix issues like this [11:24:50] you might be able to changeaddr the 10.245.66.22 to something bogus , and then remove that; I'm not sure if that's what you tried to do? [11:25:05] it looks like i tried that at some point. [11:25:33] nate/admin@chouffe% vos changeaddr 10.245.66.22 10.1.1.1 [11:25:50] what happens when you do that? [11:27:55] the vldb_check on master may be helpful to show whats going on [11:29:21] nate/admin@chouffe% vos changeaddr 10.245.66.22 10.1.1.1 [11:29:21] Could not change server 10.245.66.22 to server 10.1.1.1 [11:29:21] VLDB: no such entry [11:32:51] --- jaltman has become available [11:33:13] --- jaltman has left: Disconnected [11:33:41] well, I think it's clear what's going on; just two mh server entries with the same uuid [11:34:04] I think that's happened before, but I don't remember what we did [11:34:33] maybe it needs the 255.* address to switch it? [11:38:19] or wait no, I think I'm thinking of the opposite, where the ips were dup'd but not the uuid [11:38:30] you can't 'vos changeaddr 10.245.66.22 -remove' ? [11:41:16] --- natefoo has left [11:58:23] --- natefoo has become available [11:59:43] nope: nate/admin@chouffe% vos changeaddr 10.245.66.22 -remove [11:59:43] Could not remove server 10.245.66.22 from the VLDB [11:59:44] vlserver does not support the remove flag or VLDB: no such entry [12:06:01] i wonder if you remove the first ip, then the second, and let the fileserver reg the correct one again. [12:06:55] do i run any risk of losing volumes with any of this? [12:07:38] i could just remove all of the ips, delete sysid, and restart the fileservers with the correct NetInfo and NetRestrict in place. [12:08:07] (losing volumes, corrupting vldb, or anything else similarly catastrophic). [12:13:12] well, you can always 'fix' it by just removing the vldb entirely and syncing the volumes later [12:13:45] I'm not sure I see why that doesn't work, though; the vlserver looks through the mh entries looking for a matching ip; it shouldn't matter that the uuid is dup'd [12:14:44] does that seem familiar at all, meffie? [12:18:45] last time i looked at this, i think i saw cases where vos remove can't fix duplicates in the vlserver. i made changes to vldb_check to clean up some situations. [12:20:42] dup uuids, though? not addrs? [12:21:09] maybe something else is going on... natefoo: can you provide 'vldb_check /usr/afs/db/vldb.DB0 -servers' (or wherever the vldb is) [12:22:14] and what vlserver version? [12:22:47] for all 3 vlservers? [12:24:28] yes, duplicate uuids [12:24:31] all three are 1.6.1-1 (ubuntu aka debian) [12:25:30] https://gist.github.com/3841827 [12:26:38] sadly we dont display the uuid. [12:28:12] that gist i posted last night had the server ids though. [12:28:19] we can figure out the uuids [12:28:28] he can't delete the stray one because the mh entry isn't referenced [12:28:41] and we try to find it by scanning HostAddrs, so we never get to that mh [12:29:33] yup, hence the change to remove unref'd entries. [12:30:19] the changeaddr I assume set addr 2 to the bogus address, and didn't clear mh 3 [12:32:00] probably. [12:34:30] that prevents a registeraddrs call, though? I don't see that [12:34:39] it seems like it would skip over it in the same way, but maybe I'm misreading [12:37:13] iirc, everything, except get-addrs-u, traverses the hostaddrs table [12:37:27] --- Derrick Brashear has left [12:37:32] --- Derrick Brashear has become available [12:38:35] natefoo: when you restart the fileserver now, it still gives that error in VLLog? [12:38:50] I'm not sure / don't remember if you've tried after trying to move/remove the address [12:38:51] the one i had last night? no. [12:39:21] i can restart it now, though. [12:39:26] all of this is still in development. [12:39:45] but you have the 10.* ip in netrestrict, right? [12:41:02] sorry, I thought the problem was that the fs wasn't registering the addresses properly because of vlserver complaining about it overwriting some addresses [12:41:05] yeah, i do. [12:41:15] although at various times when i started it in the past, it wasn't. [12:41:16] --- mvitale has left [12:41:19] but if the 10.* addr is in netrestrict, then that vldb data makes sense [12:41:25] last time i started it, the 10.* address was in there. [12:41:31] I'm not sure what the issue now is? [12:41:46] yeah, i had the problem there of not registering properly last night. [12:42:00] i fixed that, but now i need to get rid of the 10.* addresses. [12:42:19] because anything outside of ec2 is going to have problems accessing those volumes. [12:42:53] okay, sorry, reading scrollback [12:43:22] so, the "private ips being in the vldb" isn't a problem itself; reporting the private ips to clients can be a problem because they'll try to access it and timeout, etc, like jaltman said [12:43:48] however, your 10.245.66.22 isn't being used to report to clients; it's a stray entry in the vldb that's not referenced by anything [12:44:13] that isn't going to hurt anything, though it looks a bit confusing [12:44:13] how about the other one? 10.211.157.137 [12:44:40] the 10.211.157.137 is being reported (according to the vos listaddrs -printuuid output); restarting the fileserver with that in netrestrict is supposed to make it go away [12:44:54] okay, let me try that. [12:45:06] and from the previous changeaddr usage, you have an extra server entry for 10.1.1.1, which you probably want to changeaddr -remove [12:45:43] okay, that was successful. [12:46:16] (which; the fileserver restart, or removing 10.1.1.1?) [12:46:22] both! [12:46:28] duvel disappeared, no idea why that would be. [12:46:46] is there any way to eliminate 10.245.66.22 to satisfy my inner ocd? [12:48:18] meffie wrote modifications to vldb_check to clean up unreferenced entries, but it's not in a release yet [12:48:46] you can either build it from source with those modifications and run it with -fix (backup the vldb first), or give it to one of us and we can send back a fixed copy [12:48:52] er, 'it' being the vldb [12:49:12] i.e. vldb.DB0 ? [12:49:27] yes [12:49:50] if you'd be so kind... [12:49:59] (thanks for all the help!) [12:50:22] where should i send it? [12:50:59] --- mvitale has become available [12:56:03] adeason@sinenomine, mmeffie@sinenomine.net (or just put it up somewhere; it's not sensitive info) [12:59:49] /afs/galaxyproject.org/user/nate/public/vldb.DB0 [13:07:54] /afs/sinenomine.net/user/adeason/public/natefoo_vldb.DB0.fixed [13:08:54] you may want to run 'vldb_check -uheader -vheader -servers -entries' against both and diff the output, just to make sure nothing else changed [13:12:01] okay. [13:12:08] how do i safely install this? [13:13:26] diff looks good. [13:14:49] stop all the vlservers, put this file in place on all of them, and start them again [13:14:57] okay. [13:15:20] (maybe the _fix utilities should increase the ubik epoch or counter? currently it just looks like the same unchanged db) [13:15:49] yeah, the header is not updated. [13:15:57] er, to be more safe, you need to delete the DBSYS files, too, but I'm assuming you're not changing the vldb while doing this [13:16:22] and if you are, of course installing this fixed file is going to undo any changes anyway :) [13:18:54] success!! [13:19:00] cool. [13:19:18] deason, meffie: +beer [13:20:03] i'm not sure if the vldb_check -fix should bump the epoch. i feel it should, but maybe only with a new switch? [13:20:55] I'm not sure I see any danger in not updating it, but I'm not sure [13:21:16] er, I mean, danger in updating it [13:21:55] the only 'bad' scenario I can see is if doing so then entails the 'fix'ed vldb overwriting an old one when you didn't want it to overwrite [13:21:57] in this case, it's easy because natefoo was able to shutdown all the vlservers. [13:22:01] but in that case... why did you install it? [13:22:53] I don't know, it's not like this comes up very often [13:25:14] so, is there something we need to do to prevent the dupe uuid in the first place? [13:25:36] wish i remember exactly how i did that. [13:26:02] must be change-addr that made the unref'd entry. [13:26:07] oh, it might've been vos setaddr [13:27:10] nate/admin@chouffe% vos setaddrs 00496d26-2f51-106d-94-75-1642f50aaa7 delirium.galaxyproject.org [13:27:13] there we go. [13:28:11] gotta run, but i'll be back later tonight. [13:28:14] thanks again for the help. [13:28:29] that's possible, but my guess would be on the changeaddr or something else [13:29:00] the setaddrs code for finding the uuid is pretty simple; I think the problem started when the dup mh entry became unreferenced [14:05:51] --- meffie has left [14:11:33] --- dev-zero@jabber.org has left [14:11:35] --- dev-zero@jabber.org has become available [14:19:30] --- deason has left [14:19:30] --- deason has become available [14:26:47] --- mvitale has left [14:48:35] --- dev-zero@jabber.org has left [14:48:58] --- dev-zero@jabber.org has become available [15:35:33] --- mdionne has become available [16:00:37] --- deason has left [18:13:36] --- jaltman has become available [18:17:46] --- jaltman has left: Disconnected [18:45:06] --- jaltman has become available [18:45:14] --- jaltman has left: Disconnected [19:48:46] --- mdionne has left [21:37:06] --- Derrick Brashear has left [21:37:11] --- Derrick Brashear has become available [22:37:12] --- natefoo has left [22:54:02] --- kula has left