[00:34:02] --- dev-zero@jabber.org has left
[00:41:34] --- dev-zero@jabber.org has become available
[03:06:36] --- dev-zero@jabber.org has left
[03:09:13] --- dev-zero@jabber.org has become available
[04:25:37] --- dev-zero@jabber.org has left: Machine going to sleep
[05:05:00] --- meffie has become available
[05:16:13] --- meffie has left
[05:16:56] --- mmeffie has become available
[05:17:39] --- mmeffie has left
[05:17:40] --- mmeffie has become available
[06:02:12] --- mvitale has become available
[07:34:55] --- deason has become available
[07:38:40] <deason> natefoo: 'vos changeaddr <addr> -remove' if one of the servers doesn't have any volumes recorded for it
[07:39:57] --- dev-zero@jabber.org has become available
[07:41:26] <natefoo> one private ip had 0 volumes and i changed it to a "fake" ip based on some old transarc docs i found.
[07:41:45] <natefoo> here's what the VLLog on that fileserver was saying: https://gist.github.com/3837663
[07:41:54] <natefoo> docs: http://rzdocs.uni-hohenheim.de/afs_3.6/debug/admin/news/sysid_4.html
[07:42:15] <natefoo> it's okay now although the output of vos listaddrs -printuuid is still a bit odd, and the private ips are still in there.
[07:43:06] <natefoo> i added NetRestricts to both of the ec2 instances, and it did regenerate sysid, which seems good.
[09:06:11] <mmeffie> deason: re gerrit 8207 and 8208, do we want all those lines like cherry-pick -x gives?
[09:07:07] <deason> yes
[09:07:09] <deason> er, well, I want them
[09:08:13] <mmeffie> ok.
[09:09:57] <deason> do you not?
[09:21:58] <mmeffie> i do now.
[09:27:28] --- dev-zero@jabber.org has left
[09:28:09] <deason> it makes it easier to identify where they come from, and the 'cherry picked from' format being identical makes it easier to script stuffo
[09:28:14] <deason> stuff
[09:30:30] <mmeffie> yeah i see your point. it is history.
[09:38:38] <natefoo> deason: can you see any problems with the private ips being present in the vldb, or should i leave it as-is?
[10:37:52] <jaltman/FrogsLeap> mmeffie: the cherry-picked lines must match cherry-pick -x so that scripts can compare the 1.6 branch and master branch to determine what has not been cherry-pikced
[10:42:44] <mmeffie> ah, i see.
[10:42:47] <jaltman/FrogsLeap> natefoo: private ips being in the vldb will be given to clients.  If a client attempts to access the private ip and cannot reach it, it will have to timeout before it fails over to the next one.
[10:43:14] --- mmeffie is now known as meffie
[10:43:50] <jaltman/FrogsLeap> If a patchset has to be backported because a straight application would not work, the commit message should explain what the differences are from the original patchset.
[10:43:58] <natefoo> jaltman: that's what i was afraid of.
[10:44:02] <jaltman/FrogsLeap> unless they are minor
[10:44:17] <jaltman/FrogsLeap> natefoo: that is what the NetRestrict file is for
[10:46:28] <natefoo> yeah, i have NetRestricts.
[10:47:05] <natefoo> so right now, one of the two fileservers is showing its volumes only via the public address (with vos listvldb -server <public>), i can presumably vos changeaddr <private> -remove that one.
[10:47:09] <natefoo> what do i do about the other one?
[10:54:17] <deason> wait, you want to remove the public one?
[10:54:42] <deason> showing 'vos listaddr -printuuid' would make this less confusing, if you're willing to show that
[10:55:00] <deason> er, -printuuid -noresolve
[11:05:02] <natefoo> i want to remove the private one.
[11:06:04] <natefoo> one sec, i'll make a little gist with the relevant bits.
[11:07:44] <deason> sorry, I interpreted "that one" as the public one, didn't see that "<private>"
[11:09:09] <natefoo> https://gist.github.com/3841408
[11:09:15] <natefoo> ahh.
[11:12:55] <deason> did you say you ran changeaddr before? do you have what exact commands you ran?
[11:17:27] <natefoo> i did a couple changeaddrs, i doubt i have the commands anymore but i'll look.
[11:17:46] <natefoo> one was trying to do changeaddr <private> <public> for one of them, i think delirium.
[11:18:06] <natefoo> iirc that never worked, it just resulted in saying they were the same address.
[11:19:51] --- dev-zero@jabber.org has become available
[11:21:48] <deason> okay, well, so you know, 'vos changeaddr' is really only for non-uuid servers; you don't run it for modern stuff, and it can screw up stuff in the vldb if you do
[11:22:04] <deason> I'm not sure how you get that result exactly
[11:22:06] <deason> , though
[11:23:33] <deason> well, for changing addrs for non-uuid servers, or trying to fix issues like this
[11:24:50] <deason> you might be able to changeaddr the 10.245.66.22 to something bogus , and then remove that; I'm not sure if that's what you tried to do?
[11:25:05] <natefoo> it looks like i tried that at some point.
[11:25:33] <natefoo> nate/admin@chouffe% vos changeaddr 10.245.66.22 10.1.1.1
[11:25:50] <deason> what happens when you do that?
[11:27:55] <meffie> the vldb_check on master may be helpful to show whats going on
[11:29:21] <natefoo> nate/admin@chouffe% vos changeaddr 10.245.66.22 10.1.1.1
[11:29:21] <natefoo> Could not change server 10.245.66.22 to server 10.1.1.1
[11:29:21] <natefoo> VLDB: no such entry
[11:32:51] --- jaltman has become available
[11:33:13] --- jaltman has left: Disconnected
[11:33:41] <deason> well, I think it's clear what's going on; just two mh server entries with the same uuid
[11:34:04] <deason> I think that's happened before, but I don't remember what we did
[11:34:33] <deason> maybe it needs the 255.* address to switch it?
[11:38:19] <deason> or wait no, I think I'm thinking of the opposite, where the ips were dup'd but not the uuid
[11:38:30] <deason> you can't 'vos changeaddr 10.245.66.22 -remove' ?
[11:41:16] --- natefoo has left
[11:58:23] --- natefoo has become available
[11:59:43] <natefoo> nope: nate/admin@chouffe% vos changeaddr 10.245.66.22 -remove
[11:59:43] <natefoo> Could not remove server 10.245.66.22 from the VLDB
[11:59:44] <natefoo> vlserver does not support the remove flag or VLDB: no such entry
[12:06:01] <meffie> i wonder if you remove the first ip, then the second, and let the fileserver reg the correct one again.
[12:06:55] <natefoo> do i run any risk of losing volumes with any of this?
[12:07:38] <natefoo> i could just remove all of the ips, delete sysid, and restart the fileservers with the correct NetInfo and NetRestrict in place.
[12:08:07] <natefoo> (losing volumes, corrupting vldb, or anything else similarly catastrophic).
[12:13:12] <deason> well, you can always 'fix' it by just removing the vldb entirely and syncing the volumes later
[12:13:45] <deason> I'm not sure I see why that doesn't work, though; the vlserver looks through the mh entries looking for a matching ip; it shouldn't matter that the uuid is dup'd
[12:14:44] <deason> does that seem familiar at all, meffie?
[12:18:45] <meffie> last time i looked at this, i think i saw cases where vos remove can't fix duplicates in the vlserver. i made changes to vldb_check to clean up some situations.
[12:20:42] <deason> dup uuids, though? not addrs?
[12:21:09] <deason> maybe something else is going on... natefoo: can you provide 'vldb_check /usr/afs/db/vldb.DB0 -servers' (or wherever the vldb is)
[12:22:14] <deason> and what vlserver version?
[12:22:47] <natefoo> for all 3 vlservers?
[12:24:28] <meffie> yes, duplicate uuids
[12:24:31] <natefoo> all three are 1.6.1-1 (ubuntu aka debian)
[12:25:30] <natefoo> https://gist.github.com/3841827
[12:26:38] <meffie> sadly we dont display the uuid.
[12:28:12] <natefoo> that gist i posted last night had the server ids though.
[12:28:19] <deason> we can figure out the uuids
[12:28:28] <deason> he can't delete the stray one because the mh entry isn't referenced
[12:28:41] <deason> and we try to find it by scanning HostAddrs, so we never get to that mh
[12:29:33] <meffie> yup, hence the change to remove unref'd entries.
[12:30:19] <deason> the changeaddr I assume set addr 2 to the bogus address, and didn't clear mh 3
[12:32:00] <meffie> probably.
[12:34:30] <deason> that prevents a registeraddrs call, though? I don't see that
[12:34:39] <deason> it seems like it would skip over it in the same way, but maybe I'm misreading
[12:37:13] <meffie> iirc, everything, except get-addrs-u, traverses the hostaddrs table
[12:37:27] --- Derrick Brashear has left
[12:37:32] --- Derrick Brashear has become available
[12:38:35] <deason> natefoo: when you restart the fileserver now, it still gives that error in VLLog?
[12:38:50] <deason> I'm not sure / don't remember if you've tried after trying to move/remove the address
[12:38:51] <natefoo> the one i had last night?  no.
[12:39:21] <natefoo> i can restart it now, though.
[12:39:26] <natefoo> all of this is still in development.
[12:39:45] <deason> but you have the 10.* ip in netrestrict, right?
[12:41:02] <deason> sorry, I thought the problem was that the fs wasn't registering the addresses properly because of vlserver complaining about it overwriting some addresses
[12:41:05] <natefoo> yeah, i do.
[12:41:15] <natefoo> although at various times when i started it in the past, it wasn't.
[12:41:16] --- mvitale has left
[12:41:19] <deason> but if the 10.* addr is in netrestrict, then that vldb data makes sense
[12:41:25] <natefoo> last time i started it, the 10.* address was in there.
[12:41:31] <deason> I'm not sure what the issue now is?
[12:41:46] <natefoo> yeah, i had the problem there of not registering properly last night.
[12:42:00] <natefoo> i fixed that, but now i need to get rid of the 10.* addresses.
[12:42:19] <natefoo> because anything outside of ec2 is going to have problems accessing those volumes.
[12:42:53] <deason> okay, sorry, reading scrollback
[12:43:22] <deason> so, the "private ips being in the vldb" isn't a problem itself; reporting the private ips to clients can be a problem because they'll try to access it and timeout, etc, like jaltman said
[12:43:48] <deason> however, your 10.245.66.22 isn't being used to report to clients; it's a stray entry in the vldb that's not referenced by anything
[12:44:13] <deason> that isn't going to hurt anything, though it looks a bit confusing
[12:44:13] <natefoo> how about the other one?  10.211.157.137
[12:44:40] <deason> the 10.211.157.137 is being reported (according to the vos listaddrs -printuuid output); restarting the fileserver with that in netrestrict is supposed to make it go away
[12:44:54] <natefoo> okay, let me try that.
[12:45:06] <deason> and from the previous changeaddr usage, you have an extra server entry for 10.1.1.1, which you probably want to changeaddr -remove
[12:45:43] <natefoo> okay, that was successful.
[12:46:16] <deason> (which; the fileserver restart, or removing 10.1.1.1?)
[12:46:22] <natefoo> both!
[12:46:28] <natefoo> duvel disappeared, no idea why that would be.
[12:46:46] <natefoo> is there any way to eliminate 10.245.66.22 to satisfy my inner ocd?
[12:48:18] <deason> meffie wrote modifications to vldb_check to clean up unreferenced entries, but it's not in a release yet
[12:48:46] <deason> you can either build it from source with those modifications and run it with -fix (backup the vldb first), or give it to one of us and we can send back a fixed copy
[12:48:52] <deason> er, 'it' being the vldb
[12:49:12] <natefoo> i.e. vldb.DB0 ?
[12:49:27] <deason> yes
[12:49:50] <natefoo> if you'd be so kind...
[12:49:59] <natefoo> (thanks for all the help!)
[12:50:22] <natefoo> where should i send it?
[12:50:59] --- mvitale has become available
[12:56:03] <deason> adeason@sinenomine, mmeffie@sinenomine.net (or just put it up somewhere; it's not sensitive info)
[12:59:49] <natefoo> /afs/galaxyproject.org/user/nate/public/vldb.DB0
[13:07:54] <deason>  /afs/sinenomine.net/user/adeason/public/natefoo_vldb.DB0.fixed
[13:08:54] <deason> you may want to run 'vldb_check -uheader -vheader -servers -entries' against both and diff the output, just to make sure nothing else changed
[13:12:01] <natefoo> okay.
[13:12:08] <natefoo> how do i safely install this?
[13:13:26] <natefoo> diff looks good.
[13:14:49] <deason> stop all the vlservers, put this file in place on all of them, and start them again
[13:14:57] <natefoo> okay.
[13:15:20] <deason> (maybe the _fix utilities should increase the ubik epoch or counter? currently it just looks like the same unchanged db)
[13:15:49] <meffie> yeah, the header is not updated. 
[13:15:57] <deason> er, to be more safe, you need to delete the DBSYS files, too, but I'm assuming you're not changing the vldb while doing this
[13:16:22] <deason> and if you are, of course installing this fixed file is going to undo any changes anyway :)
[13:18:54] <natefoo> success!!
[13:19:00] <meffie> cool. 
[13:19:18] <natefoo> deason, meffie: +beer
[13:20:03] <meffie> i'm not sure if the vldb_check -fix should bump the epoch. i feel it should, but maybe only with a new switch?
[13:20:55] <deason> I'm not sure I see any danger in not updating it, but I'm not sure
[13:21:16] <deason> er, I mean, danger in updating it
[13:21:55] <deason> the only 'bad' scenario I can see is if doing so then entails the 'fix'ed vldb overwriting an old one when you didn't want it to overwrite
[13:21:57] <meffie> in this case, it's easy because natefoo was able to shutdown all the vlservers.
[13:22:01] <deason> but in that case... why did you install it?
[13:22:53] <deason> I don't know, it's not like this comes up very often
[13:25:14] <meffie> so, is there something we need to do to prevent the dupe uuid in the first place?
[13:25:36] <natefoo> wish i remember exactly how i did that.
[13:26:02] <meffie> must be change-addr that made the unref'd entry.
[13:26:07] <natefoo> oh, it might've been vos setaddr
[13:27:10] <natefoo> nate/admin@chouffe% vos setaddrs 00496d26-2f51-106d-94-75-1642f50aaa7 delirium.galaxyproject.org
[13:27:13] <natefoo> there we go.
[13:28:11] <natefoo> gotta run, but i'll be back later tonight.
[13:28:14] <natefoo> thanks again for the help.
[13:28:29] <deason> that's possible, but my guess would be on the changeaddr or something else
[13:29:00] <deason> the setaddrs code for finding the uuid is pretty simple; I think the problem started when the dup mh entry became unreferenced
[14:05:51] --- meffie has left
[14:11:33] --- dev-zero@jabber.org has left
[14:11:35] --- dev-zero@jabber.org has become available
[14:19:30] --- deason has left
[14:19:30] --- deason has become available
[14:26:47] --- mvitale has left
[14:48:35] --- dev-zero@jabber.org has left
[14:48:58] --- dev-zero@jabber.org has become available
[15:35:33] --- mdionne has become available
[16:00:37] --- deason has left
[18:13:36] --- jaltman has become available
[18:17:46] --- jaltman has left: Disconnected
[18:45:06] --- jaltman has become available
[18:45:14] --- jaltman has left: Disconnected
[19:48:46] --- mdionne has left
[21:37:06] --- Derrick Brashear has left
[21:37:11] --- Derrick Brashear has become available
[22:37:12] --- natefoo has left
[22:54:02] --- kula has left