*** vivek-eb_ has joined #openstack-lbaas | 00:07 | |
*** vivek-ebay has quit IRC | 00:07 | |
*** xgerman has quit IRC | 00:41 | |
*** vivek-eb_ has quit IRC | 00:57 | |
*** mwang2 has quit IRC | 01:00 | |
*** fnaval has quit IRC | 01:26 | |
*** vivek-ebay has joined #openstack-lbaas | 01:36 | |
*** vivek-ebay has quit IRC | 01:43 | |
*** amotoki has joined #openstack-lbaas | 01:59 | |
*** fnaval has joined #openstack-lbaas | 02:16 | |
*** sbfox has joined #openstack-lbaas | 02:19 | |
*** sbfox has quit IRC | 02:33 | |
*** fnaval has quit IRC | 03:00 | |
*** vivek-ebay has joined #openstack-lbaas | 03:00 | |
*** blogan_ has joined #openstack-lbaas | 03:15 | |
openstackgerrit | Brandon Logan proposed a change to stackforge/octavia: Doc detailing amphora lifecycle management https://review.openstack.org/130424 | 03:44 |
---|---|---|
*** blogan_ has quit IRC | 03:45 | |
*** sbfox1 has joined #openstack-lbaas | 04:09 | |
*** kobis has joined #openstack-lbaas | 04:21 | |
*** kobis has quit IRC | 04:26 | |
*** sbfox1 has quit IRC | 04:48 | |
*** kobis has joined #openstack-lbaas | 04:48 | |
*** kobis has quit IRC | 04:56 | |
*** kobis has joined #openstack-lbaas | 05:00 | |
*** sbfox has joined #openstack-lbaas | 05:11 | |
*** blogan_ has joined #openstack-lbaas | 05:23 | |
rm_you | blogan: ooo, new spec | 05:23 |
*** kobis has quit IRC | 05:28 | |
blogan_ | aye | 05:28 |
blogan_ | well just an update to a WIP | 05:28 |
*** vivek-ebay has quit IRC | 05:48 | |
*** vivek-ebay has joined #openstack-lbaas | 06:07 | |
*** kobis has joined #openstack-lbaas | 06:21 | |
*** vivek-eb_ has joined #openstack-lbaas | 06:31 | |
*** vivek-ebay has quit IRC | 06:33 | |
openstackgerrit | Brandon Logan proposed a change to stackforge/octavia: Doc detailing amphora lifecycle management https://review.openstack.org/130424 | 06:37 |
*** blogan_ has quit IRC | 06:38 | |
*** mwang2 has joined #openstack-lbaas | 06:56 | |
*** mwang2 has quit IRC | 07:01 | |
*** vivek-eb_ has quit IRC | 07:15 | |
*** jschwarz has joined #openstack-lbaas | 07:15 | |
*** rm_you| has joined #openstack-lbaas | 07:20 | |
*** rm_you has quit IRC | 07:22 | |
*** woodster_ has quit IRC | 07:40 | |
*** sbfox has quit IRC | 07:51 | |
*** jschwarz has quit IRC | 07:59 | |
*** vivek-ebay has joined #openstack-lbaas | 10:16 | |
*** vivek-ebay has quit IRC | 10:21 | |
*** amotoki has quit IRC | 10:27 | |
*** vivek-ebay has joined #openstack-lbaas | 12:01 | |
*** woodster_ has joined #openstack-lbaas | 12:31 | |
*** vivek-ebay has quit IRC | 12:43 | |
*** Krast has joined #openstack-lbaas | 12:48 | |
*** mestery has quit IRC | 13:08 | |
*** amotoki has joined #openstack-lbaas | 13:09 | |
*** markmcclain has joined #openstack-lbaas | 13:09 | |
*** mestery has joined #openstack-lbaas | 13:11 | |
*** mestery has quit IRC | 13:13 | |
*** mestery has joined #openstack-lbaas | 13:13 | |
*** amotoki has quit IRC | 13:31 | |
*** Krast has quit IRC | 13:43 | |
*** markmcclain has quit IRC | 14:12 | |
*** xgerman has joined #openstack-lbaas | 14:57 | |
*** ptoohill_ has joined #openstack-lbaas | 15:02 | |
*** mlavalle has joined #openstack-lbaas | 15:38 | |
*** dboik has joined #openstack-lbaas | 15:40 | |
dougwig | morning | 15:47 |
ptoohill | Mornin' | 15:56 |
openstackgerrit | Trevor Vardeman proposed a change to stackforge/octavia: Defining interface for amphora drivers https://review.openstack.org/130352 | 16:02 |
openstackgerrit | Trevor Vardeman proposed a change to stackforge/octavia: Defining interface for amphora drivers https://review.openstack.org/130352 | 16:11 |
*** dboik has quit IRC | 16:13 | |
*** markmcclain has joined #openstack-lbaas | 16:18 | |
blogan | awesome, another patch gets merged into the feature branch | 16:23 |
blogan | now those monster patches are no more | 16:23 |
blogan | mestery: thanks! | 16:24 |
*** kobis has quit IRC | 16:25 | |
mestery | blogan: Yes sir! That last one shoudl unlock you folks even more. | 16:25 |
*** kobis has joined #openstack-lbaas | 16:26 | |
*** vivek-ebay has joined #openstack-lbaas | 16:34 | |
*** fnaval has joined #openstack-lbaas | 16:35 | |
*** codekobe has quit IRC | 16:37 | |
*** sballe has quit IRC | 16:38 | |
*** ctracey has quit IRC | 16:38 | |
*** dboik has joined #openstack-lbaas | 16:46 | |
*** dboik has quit IRC | 16:51 | |
*** sbfox has joined #openstack-lbaas | 16:51 | |
blogan | dougwig ping | 17:01 |
dougwig | joining | 17:01 |
blogan | sam hasn't joined yet | 17:02 |
dougwig | it's late for him, right? | 17:02 |
blogan | i think its just 7 or 8 pm | 17:03 |
ptoohill | He set the date/time | 17:03 |
blogan | yeah its 7 pm there | 17:03 |
dougwig | hmm, then why aren't our group meetings at this time? :) | 17:03 |
blogan | lol | 17:03 |
blogan | i agree | 17:03 |
blogan | i thought israel was like 10 hours ahead of us, but turns out its only 7 | 17:04 |
ptoohill | Sam is running 5 mins late | 17:05 |
ptoohill | He posted this in the google slides | 17:05 |
rm_work | yeah WTB this time for our meetings T_T | 17:06 |
*** dboik has joined #openstack-lbaas | 17:06 | |
dougwig | there is an opening either at this time slot, or one hour earlier. | 17:08 |
rm_work | uhhh so what you're telling me | 17:10 |
rm_work | is that we could have our meetings at 11am CST. and we have instead been doing them at 9am. | 17:10 |
rm_work | wtf | 17:10 |
rm_work | or, 9am instead of 7am for the poor PST people | 17:11 |
*** dboik has quit IRC | 17:11 | |
*** dboik has joined #openstack-lbaas | 17:12 | |
*** sbalukoff has quit IRC | 17:17 | |
*** kobis has quit IRC | 17:28 | |
*** kobis has joined #openstack-lbaas | 17:28 | |
*** markmcclain has quit IRC | 17:31 | |
*** masteinhauser is now known as zz_mas | 17:41 | |
*** ctracey has joined #openstack-lbaas | 17:42 | |
*** sbfox has quit IRC | 17:42 | |
*** sbfox has joined #openstack-lbaas | 17:45 | |
*** codekobe has joined #openstack-lbaas | 17:48 | |
*** dboik has quit IRC | 17:59 | |
*** dboik has joined #openstack-lbaas | 18:00 | |
*** dboik has quit IRC | 18:01 | |
*** sballe has joined #openstack-lbaas | 18:02 | |
*** dboik has joined #openstack-lbaas | 18:03 | |
*** dboik has quit IRC | 18:03 | |
*** dboik has joined #openstack-lbaas | 18:04 | |
openstackgerrit | Trevor Vardeman proposed a change to stackforge/octavia: Defining interface for amphora drivers https://review.openstack.org/130352 | 18:05 |
*** dboik_ has joined #openstack-lbaas | 18:06 | |
*** dboik has quit IRC | 18:06 | |
*** markmcclain has joined #openstack-lbaas | 18:25 | |
*** sbalukoff has joined #openstack-lbaas | 18:32 | |
sbalukoff | Hey folks, if you haven't updated it yet, please don't forget to fill out the weekly stand-up etherpad: standup | 18:47 |
sbalukoff | D'oh! | 18:47 |
sbalukoff | https://etherpad.openstack.org/p/octavia-weekly-standup | 18:47 |
ptoohill | ;) | 18:53 |
sbalukoff | Also, if you've got anything else for the agenda, feel free to add it: https://wiki.openstack.org/wiki/Octavia/Weekly_Meeting_Agenda | 18:55 |
*** dboik_ has quit IRC | 18:58 | |
*** dboik has joined #openstack-lbaas | 18:58 | |
*** ptoohill_ has quit IRC | 19:00 | |
*** markmcclain1 has joined #openstack-lbaas | 19:01 | |
*** markmcclain2 has joined #openstack-lbaas | 19:02 | |
*** markmcclain has quit IRC | 19:03 | |
*** dboik has quit IRC | 19:03 | |
*** dboik has joined #openstack-lbaas | 19:04 | |
*** markmcclain1 has quit IRC | 19:06 | |
*** kobis has joined #openstack-lbaas | 19:22 | |
*** zz_mas is now known as masteinhauser | 19:28 | |
*** vivek-ebay has quit IRC | 19:29 | |
*** ptoohill_ has joined #openstack-lbaas | 19:30 | |
*** dboik has quit IRC | 19:32 | |
*** vivek-ebay has joined #openstack-lbaas | 19:33 | |
*** dboik has joined #openstack-lbaas | 19:36 | |
*** dboik has quit IRC | 19:37 | |
*** dboik has joined #openstack-lbaas | 19:39 | |
*** dboik has quit IRC | 19:41 | |
*** dboik has joined #openstack-lbaas | 19:42 | |
*** sbfox has quit IRC | 19:46 | |
*** sbfox has joined #openstack-lbaas | 19:48 | |
*** kobis has quit IRC | 19:49 | |
*** markmcclain2 has quit IRC | 19:50 | |
*** markmcclain has joined #openstack-lbaas | 19:51 | |
*** jamiem has joined #openstack-lbaas | 19:52 | |
*** barclaac|2 has joined #openstack-lbaas | 19:53 | |
*** barclaac has quit IRC | 19:53 | |
dougwig | https://www.irccloud.com/pastebin/AIEwr5av | 19:54 |
ptoohill | wooten! | 19:55 |
*** kobis has joined #openstack-lbaas | 19:56 | |
*** ajmiller has joined #openstack-lbaas | 19:58 | |
blogan | thanks to oleg on that one | 20:00 |
*** mwang2 has joined #openstack-lbaas | 20:02 | |
*** kobis has quit IRC | 20:04 | |
*** jorgem has joined #openstack-lbaas | 20:04 | |
*** sbfox has quit IRC | 20:09 | |
*** markmcclain has quit IRC | 20:13 | |
*** sbfox has joined #openstack-lbaas | 20:15 | |
openstackgerrit | Trevor Vardeman proposed a change to stackforge/octavia: Defining interface for amphora drivers https://review.openstack.org/130352 | 20:31 |
openstackgerrit | Trevor Vardeman proposed a change to stackforge/octavia: Defining interface for compute drivers https://review.openstack.org/130352 | 20:34 |
*** jamiem has quit IRC | 20:38 | |
xgerman | for a moment I read amphora drivers instead of compute drivers | 20:55 |
rm_work | yeah it was the other :P | 21:01 |
rm_work | we made him fix it | 21:01 |
rm_work | I was like | 21:01 |
sbalukoff | johnsom_: Do you have time to continue discussion here? | 21:01 |
rm_work | "wait, german is working on the same thing as you?!" | 21:01 |
rohara | someone remind me what we got cookin' for octavia at summit next week | 21:02 |
johnsom_ | Sure | 21:02 |
sbalukoff | Ok, so, I apologize: I don't mean to single you out on this | 21:02 |
johnsom_ | Thank you. It definitely seemed like ti | 21:02 |
sbalukoff | I probably haven't been as clear as I should have been with regard to expectations for people participating in this project. | 21:03 |
johnsom_ | I don't mind heads up that people care about stuff I signed up for. I got a lot of "we will build our own image", so I was working offline on it and without top priority | 21:04 |
sbalukoff | What do you mean when you say "I got a lot of 'we will build our own image'"? | 21:04 |
johnsom_ | Two weeks ago I put together upstart with respawn for haproxy to go into these images. | 21:04 |
blogan | johnsom_: i think the disconnect is that we are waiting to build those images until we have something to work off of, and an actual working octavia with some default image, which would be yours | 21:04 |
blogan | sbalukoff: bc we (rax) mentioned we will end up probablying having to do our own image | 21:05 |
sbalukoff | blogan: Aah. | 21:05 |
sbalukoff | Ok. | 21:05 |
johnsom_ | When I signed up for this it seemed like everyone was saying they were going to do their own spin anyway, i.e. not interested. Basically it would just be used for testing. | 21:05 |
sbalukoff | So, that's all really secondary to the discussion I want to have. | 21:05 |
rohara | we wil end up doing our own image as well | 21:06 |
johnsom_ | Ok, so I hear you that people need this. I will get working on code to check in under WIP. | 21:06 |
rohara | sbalukoff: i am making my schedule for next week. what are we doing for octavia? i know this came up, but i can't find it. | 21:07 |
sbalukoff | Ok, sure. | 21:07 |
sbalukoff | rohara: Right now, it's mostly ad-hoc meetings and coding work as people are able. | 21:07 |
rohara | sbalukoff: ok | 21:07 |
blogan | johnsom_: i don't mean to single anyone out, but I think this is a good time to adopt a policy of doing the WIPs in gerrit, and not waiting until you think its done adn then pushing | 21:07 |
blogan | i had to get out of that mentality m yself | 21:07 |
sbalukoff | I would like to go over front-end and back-end topology stuff thoroughly with people so we can know what features we're going to need to add to Neutron to make this work. | 21:08 |
blogan | i still think we all need to work on reviewing and giving feedback on WIP reviews now | 21:08 |
rohara | sbalukoff: sign me up. i'll just wanted around looking at names on badges until i find you :) | 21:08 |
rohara | s/wanted/wander | 21:08 |
sbalukoff | rohara: I'm going to be stuck at our booth off and on during the first part of the week. | 21:08 |
blogan | are we goign to do that groupme app again? | 21:09 |
johnsom_ | My personal approach has been to develop something that at least produces something useful before checking in, then iterating from there. I don't want to waste people's precious review time on WIP stuff that is very incomplete. | 21:09 |
rohara | sbalukoff: well i know where to find you then | 21:09 |
sbalukoff | blogan: It's more difficult, what with international cell phone rates. | 21:09 |
johnsom_ | Maybe that pendulum swung to far here | 21:09 |
blogan | johnsom_: i know that urge, but if people are interested in it, they are willing to look at incomplete work and give feedback | 21:09 |
sbalukoff | johnsom_: So, at last weeks meeting we discussed ideas around how to effectively collaborate. | 21:09 |
johnsom_ | Yep, which was the surprise of the day here.... | 21:10 |
sbalukoff | And again, the pithy phrase "If it's ain't in gerrit, it doesn't exist" is probably the best policy here. | 21:10 |
sbalukoff | johnsom_: Sorry, it's in the meeting minutes from last week. | 21:10 |
johnsom_ | So why the push to have people work on stuff folks have signed up for when we have unassigned blueprints? Just curious here | 21:11 |
sbalukoff | Again, the dilemma I'm facing is that I have people who are enthusiastic and want to work on Octavia, especially the bits that aren't blocked by anything else. | 21:11 |
sbalukoff | And I also have people who have signed themselves up for writing parts of this project who have either had their priorities re-arranged by their working circumstances, or otherwise are not making visible progress. | 21:11 |
sbalukoff | johnsom_: Many of those unassigned blueprints are either lower priority, or are blocked by other blueprints. | 21:12 |
johnsom_ | Sorry if you don't think I am enthusiastic about Octavia. Frankly I want it yesterday. | 21:12 |
sbalukoff | And I *know* we're all itching to have something real we can work with. | 21:12 |
sbalukoff | johnsom_: That's not what I'm saying. | 21:12 |
sbalukoff | And you ARE NOT under attack here. | 21:13 |
sbalukoff | Please don't take offense. | 21:13 |
sbalukoff | The point I'm getting at is: As PTL, I need to be able to make sure that 1) we are making progress on this project and 2) people who want to work on it have something to do. | 21:13 |
johnsom_ | Again, I didn't think base-image was a priority, so news to me. Good to know. | 21:14 |
johnsom_ | Yep, as I have mentioned, please, as PTL poke me and help me understand the priority. | 21:14 |
sbalukoff | Ok, I'll be more persistent. What is the best way to get a hold of you? | 21:15 |
sbalukoff | johnsom_: So, controller design is also a priority | 21:15 |
johnsom_ | There are eleven non-blocked blueprints, two of which are high. Can we get those folks fired up on one of those? | 21:16 |
sbalukoff | Most of those have people working on them with WIP reviews. | 21:16 |
johnsom_ | I am always in this room and at the meetings. | 21:16 |
sbalukoff | You weren't last week. ;P | 21:17 |
sbalukoff | But... whatever. | 21:17 |
johnsom_ | I was looking at those that are unassigned. Are we missing some assignments? | 21:17 |
blogan | yeah what blueprints aren't assigned that have WIPs? | 21:18 |
blogan | dont tell me its trevor | 21:18 |
johnsom_ | Really? I updated the standup last week. I'm pretty sure I was there | 21:18 |
sbalukoff | Then how did you miss the "if it ain't in gerrit, it doesn't exist" discussion? | 21:18 |
sbalukoff | Also, many of the lower priority blueprints are effectively blocked by other groundwork that needs to be done, even if they aren't listed as such. | 21:19 |
sbalukoff | (Again, I hate launchpad.) | 21:19 |
johnsom_ | launchpad has a blocked status | 21:20 |
blogan | i think there are two blueprints that are High that can be worked on, amphorae-scheduler and amphora-health-monitor | 21:20 |
sbalukoff | johnsom_: How close are you to having a spec to review on the controller? | 21:20 |
johnsom_ | Anyway, sbalukoff, are we good? | 21:21 |
johnsom_ | I'm somewhere between 50% to 75% done | 21:21 |
sbalukoff | johnsom_: Are you good with the "if it ain't in gerrit, it doesn't exist" idea? | 21:21 |
sbalukoff | Because if there's nothing in there that's marked WIP, I don't have any concrete evidence that progress is being made. | 21:22 |
johnsom_ | Well, I think we shouldn't be quick to disregard people that have signed up for work, published a spec, etc. It would be throwing away code/research that is progress by passing tasks around. | 21:23 |
xgerman | johnsom_ +1 | 21:23 |
xgerman | sbalukoff, when are you taking roll of gerrit? | 21:24 |
rm_work | Yeah I think I am going to upload my certmanager change today, even though it *can't* pass py26/py27 tests yet because python-barbicanclient v3.0 isn't released to pypi yet | 21:24 |
sbalukoff | johnsom_: Agreed. But I need a way to really know where you're at. The idea of writing a bunch of stuff on your own and then presenting something close to the finished product does not work well in this environment. | 21:24 |
sbalukoff | xgerman: I usually look through gerrit every day, even if I don't have time every day to review. | 21:24 |
xgerman | so if we slack for 10 days you will take it away or so? | 21:25 |
sbalukoff | Mostly, though, if I have someone asking me "what should I work on?" I look at both gerrit and launchpad. | 21:25 |
johnsom_ | So what I am hearing is you don't trust what people are reporting in the standups and meetings. If you look at the log from the Octavia meeting two weeks ago I stated that I was prioritizing the controller spec over the image work to get the spec out early. I also asked if that was a problem and offered to re-order | 21:26 |
johnsom_ | 20:41:07 <blogan> how is the base image coming along? | 21:26 |
johnsom_ | 20:42:15 <johnsom> I have done some experimental images and have a good line of sight on the code. I just prioritized a first draft of the controller spec. | 21:26 |
johnsom_ | 20:42:30 <blogan> okay cool, just wanted to get an idea | 21:26 |
johnsom_ | 20:42:38 <johnsom> If it is a blocker, I can re-order | 21:26 |
sbalukoff | Ah yes.... that was the week I was on vacation | 21:27 |
sbalukoff | And I should have been quicker to tell you it was more important than the controller spec. | 21:27 |
sbalukoff | Still, y'all seem to be missing the point: | 21:27 |
rm_work | commit early, commit often | 21:27 |
sbalukoff | rm_work: +1 | 21:28 |
rm_work | still working on blogan :P but yeah, gerrit WIP is great for iteration, use it the same way you would use a private git repo, that's essentially what it is | 21:28 |
rm_work | having lots of patchsets is not a sin | 21:28 |
rm_work | just means you're actively iterating | 21:29 |
blogan | yes it is! but it is the lesser of sins in this case | 21:29 |
sbalukoff | xgerman: If it appears you've stalled for 10 days, I will probably ask what's up. If someone else is champing at the bit to work on it, I'll probably let them. | 21:29 |
jorgem | I'm an angel then since I have no patchsets :) | 21:29 |
blogan | or a bloodsucking manager | 21:29 |
johnsom_ | I don't think I am. I get it. I just think you are jumping on me for some alternate motive. I see open, non-blocked, non-assigned, high blueprints available, so why you want to switch developers doesn't really make sense to me. It's not like I haven't posted anything. I wrote and posted the spec. | 21:29 |
xgerman | sbalukoff, I am acticely wokring on it + I will get the WIP up today... | 21:30 |
sbalukoff | Again, no offense is meant by this! | 21:30 |
xgerman | (or better I am pair programming with Min) | 21:30 |
rm_work | sbalukoff: you commented about showing Octavia storing a copy of the user's cert -- remember we decided NOT to do that | 21:30 |
sbalukoff | I'm certainly not saying you lack enthusiasm or are lazy or some other such nonsense. | 21:30 |
sbalukoff | The point is we all have jobs that have us doing many things at once. | 21:30 |
rm_work | (on my TLS review) | 21:30 |
xgerman | yeah, having clear deadlines help me focus my *and HP's mind) | 21:30 |
sbalukoff | And to keep progress on this project moving forward, we need to be flexible in shifting workloads around. | 21:31 |
rm_work | well, all of us at Rackspace except jorgem are full-time Octavia for the next ... indefinite period | 21:31 |
sbalukoff | rm_work: Will discuss that in a minute. | 21:31 |
xgerman | rm_work, cool | 21:32 |
sbalukoff | rm_work: That's great. That means I can throw more work at you. :) | 21:32 |
rm_work | heh | 21:32 |
rm_work | jorgem is soaking up all the non-Octavia stuff our team has to do :) | 21:32 |
jorgem | I want to be fulltime :( | 21:32 |
sbalukoff | xgerman and johnsom_: Am I being unreasonable about this? | 21:32 |
jorgem | *sniff* *sniff* | 21:32 |
blogan | dont forget neutron lbaas | 21:33 |
xgerman | sbalukoff, yes and no. We like to do the stuff we signe dup for so if we know you need it by X we will ge tit done | 21:33 |
johnsom_ | I think the big gap is communication. Again, I was up front about what I was working on. People didn't seem to be excited about the base image and I knew the spec would spark excitement, so commit early, commit often.... | 21:33 |
sbalukoff | johnsom_: Then let's add frequent and early commits to gerrit as part of our communication strategy. | 21:34 |
johnsom_ | No one spoke up when I mentioned the trade off. | 21:34 |
johnsom_ | So, can I help you find something for these new folks to work on? | 21:34 |
sbalukoff | johnsom_: Again, I'll take the heat for that. | 21:34 |
sbalukoff | Sure | 21:35 |
xgerman | and there can also more than one person work on one blueprint | 21:35 |
sbalukoff | xgerman: +1 | 21:35 |
johnsom_ | Yep. | 21:35 |
blogan | ive had trevor help me out on a few | 21:35 |
sbalukoff | johnsom_: I would love to see your WIP stuff on the controller. I see that becoming our next big hurdle once we have base images being built | 21:36 |
xgerman | yeah, and he worked closely with ajmiller | 21:36 |
sbalukoff | (And it's the glue between so many other components-- we all need to see how it's coming together.) | 21:36 |
xgerman | so it even works across companies :-) | 21:36 |
blogan | im sure the controller work will deal with the amphora lifecycle managemetn doc i've been working on | 21:36 |
sbalukoff | And the Operator APi | 21:36 |
blogan | yeah | 21:37 |
rm_work | yeah a lot of stuff kinda needs to evolve in parallel | 21:37 |
xgerman | yep, I will try to absorb all of Michael's knowledge so we cna discuss at the summit :-) | 21:37 |
sbalukoff | And... well.. all the other -interface(s) | 21:37 |
blogan | osmosis | 21:37 |
johnsom_ | I think the health monitoring is open and one I would like to see explored. I would like to see some proof of life via the monitoring url or something similar. It is not clear to me where this lives and what the tiers, i.e. in the amp, controller, etc. | 21:37 |
sbalukoff | xgerman: I'd love to see something in gerrit that we can start talking about, even if it's only 50 - 75% done. | 21:37 |
xgerman | well, writing is a time consuming task :-) | 21:38 |
johnsom_ | Tell me about it | 21:38 |
sbalukoff | xgerman: It is. But ultimately, we don't have much concrete to go off of if it isn't written down. | 21:38 |
johnsom_ | We had a good conversation about the cert handling too | 21:38 |
blogan | what is time consuming is trying to perfect the seqdiag bc some of the options are documented and so you are forced to look at the source code to find out what it supports | 21:38 |
sbalukoff | Heck, even if 90% of the document is "to be determined" the remaining 10% at least gives us something to work off of. | 21:39 |
blogan | and it all feels like a waste of time, but ocd kicks in | 21:39 |
xgerman | lol | 21:39 |
blogan | another good reason to push to gerrit early and often is if by chance you do get pulled off the blueprint, then someone can easily get your code and start from there | 21:40 |
sbalukoff | codekobe and / or intr1nsic: Would y'all be willing to take on the amphora-health-monitor blueprint? | 21:40 |
sbalukoff | blogan: +1 | 21:40 |
sbalukoff | xgerman and johnsom_: Just to be clear, "commit early, commit often" is something I NEED to see happen. | 21:41 |
sbalukoff | I realize this might mean a change in how you're used to doing things. | 21:41 |
johnsom_ | I think it is a balance we all need to work on | 21:41 |
intr1nsic | sbalukoff I think so | 21:41 |
sbalukoff | We're not playing poker here, y'all. There's no need to hide your hand, eh. | 21:42 |
sbalukoff | And I don't care if the code is crap to begin with. | 21:42 |
johnsom_ | sbalukoff I think I have been very straight forward | 21:43 |
codekobe | amphora-health-monitor blueprint aye | 21:43 |
codekobe | when does this need to be done by | 21:43 |
sbalukoff | johnsom_: Ok, I agree. Are you willing to follow "commit early, commit often" with your actual specs and code going forward? | 21:43 |
codekobe | i can get with you on what has already been discussed | 21:43 |
codekobe | sbalukoff i am sure you already have some ideas of what the health check should include | 21:44 |
sbalukoff | codekobe: We'll need it when we plug it into the controller, when we have a reference amphora image being built, and when we have a driver to control it. | 21:44 |
codekobe | sbalukoff: intr1nsic and i can take a loot at that | 21:44 |
sbalukoff | No specific date. | 21:45 |
johnsom_ | codekobe intr1nsic - awesome. This is the one area that ties the Amp to a controller. | 21:45 |
codekobe | are we looking for the amp to report in, or for controller to poll? | 21:45 |
sbalukoff | The health check will ultimately touch the amphora API, the amphora driver, and controller. | 21:45 |
codekobe | not sure what has been discussed already | 21:45 |
codekobe | so if it touches the amphora API that would tell me the controller is polling | 21:46 |
sbalukoff | codekobe: We're looking for the amphora to emit health checks at regular intervals. Initially this can be done via a RESTful API which lives on the back-side of the controller (yet to be defined), but ultimately it'll probably happen via HMAC-signed UDP messages from the amphora. | 21:46 |
codekobe | I see, so amphora reports in | 21:47 |
johnsom_ | sbalukoff I thought we had decided that the health messages are UDP | 21:47 |
blogan | sbalukoff: if it is done through a restful api doesn't that mean the controller needst o be running a web server? | 21:47 |
sbalukoff | codekobe: We also have a need for a "deep diagnostic" healthcheck (usually, shortly after the amphora is spun up), as well as regular light-weight check-ins after that. | 21:48 |
dougwig | whoa, you guys wrote a book. | 21:48 |
sbalukoff | blogan: Yes, it does. | 21:48 |
blogan | so every component is basically going to have an web server? | 21:48 |
sbalukoff | johnsom_: The talk was to do it over a REST interface at first because people thought that would be simpler to implement out of the gate. | 21:48 |
sbalukoff | Ultimately we do want it over UDP. | 21:48 |
blogan | i feel like we should just do it UDP from the beginning | 21:49 |
xgerman | REST simpler +1 | 21:49 |
xgerman | and the interface accounts for that ;-) | 21:49 |
sbalukoff | I think xgerman was the one primarily advocating the REST API for the same to begin with. | 21:50 |
xgerman | yes + sballe -- we both love REST!! | 21:50 |
*** openstackgerrit has quit IRC | 21:50 | |
blogan | i like rest too but it seems like overkill in this instance | 21:51 |
codekobe | sbalukoff: there seems to be a lot of similarity to Trove here | 21:51 |
sbalukoff | codekobe: There is! | 21:51 |
sbalukoff | So if we can totally steal their work, we should. | 21:51 |
intr1nsic | +1 | 21:51 |
sballe | sbalukoff: I love REST! | 21:52 |
codekobe | yeah, i can look into how they handle that | 21:52 |
xgerman | and trove is awesome, too, the more we can take from them the better ;-) | 21:52 |
codekobe | they have an api, a task controller, an agent | 21:52 |
sbalukoff | Let's see what trove is doing, and then decide where we go on the health checks. | 21:52 |
rm_work | guys, please give me a sec to catch up | 21:52 |
rm_work | :) | 21:52 |
rm_work | I will probably have comments | 21:53 |
codekobe | http://docs.openstack.org/developer/trove/dev/design.html | 21:53 |
codekobe | we can look that over | 21:53 |
* dougwig motions rm_work closer, and says, "RUN!" | 21:53 | |
codekobe | we dont have to copy, but we might be able to borrow where it makes sense | 21:53 |
rm_work | lol | 21:54 |
sbalukoff | Anyway, xgerman and johnsom_: Please be aware: Even if you're not willing to commit to "commit early, commit often" in so many words: Note that that's what I'm expecting to see. Weekly updates or ad-hoc asking of "where are you at?" are not going to be as informative as stuff that's in gerrit. | 21:55 |
rm_work | so, we are talking about... amphora health checks? not node health checks, right? | 21:55 |
xgerman | amphora healthchecks -- node healthcheacks are done by haproxy | 21:55 |
rm_work | right | 21:55 |
xgerman | sbalukoff we will amend our life to appease the great leader | 21:56 |
rm_work | so, if the amphora is announcing its healthy state via UDP ... | 21:56 |
sbalukoff | member healthchecks. | 21:56 |
sbalukoff | xgerman: Thank you | 21:56 |
rm_work | what is the threshold for "this amphora is no longer healthy" | 21:56 |
sbalukoff | rm_work: Going too long without a healthcheck. | 21:57 |
xgerman | well, right now we ask with REST and if it times out for X times or gets route not found we shoot it | 21:57 |
rm_work | sbalukoff: so, there is a process monitoring the latest update times for each amphora? | 21:57 |
sbalukoff | rm_work: There needs to be, yes. | 21:57 |
rm_work | ok, what can we expect the poll time on that to be? | 21:57 |
xgerman | there are some more nuances like if a controller in az1 can't reach any lbs in az2 -- this might mean az2 is down or az1 lost network compatibility, etc. | 21:57 |
rm_work | just example numbers -- would it be polling every second to see if an amphora hadn't responded within the last two seconds? or something like that? | 21:58 |
sbalukoff | xgerman: +1 | 21:58 |
xgerman | rm_work pool time, etc. all needs to be configurable | 21:58 |
sbalukoff | rm_work: Say, once a minute from each amphora | 21:58 |
rm_work | looking for some sense of scale | 21:58 |
rm_work | sbalukoff: so, the amphora anounces once per minute? or we just check for "at least one announce in the last minute"? | 21:58 |
sbalukoff | rm_work: Again, control in v0.5 is not meant to be that scalable. | 21:58 |
rm_work | I meant more like, O(...) | 21:59 |
rm_work | seconds, minutes.... | 21:59 |
xgerman | yeah, in libra we allow 60 s downtime so polling like every 5s for 5-6 times would work | 21:59 |
rm_work | ok, I thought we were looking for subsecond failovers | 21:59 |
rm_work | architecting it this way does not work well for that | 22:00 |
sbalukoff | rm_work: That's what active-active or active-standby are supposed to accomplish. | 22:00 |
rm_work | or really even approaching that | 22:00 |
rm_work | alright, but even active-active will need to announce when one amp goes down, in a reasonable amount of time, right? | 22:00 |
sbalukoff | In active-standby, each amphora is probing its partner once per second. | 22:00 |
dougwig | how man amphora per controller are we supporting? | 22:00 |
johnsom_ | Does the health check info need to be persistent? | 22:00 |
rm_work | ok | 22:00 |
sbalukoff | dougwig: We don't have a specific number on that yet. | 22:01 |
xgerman | rm_work back to active-active | 22:01 |
codekobe | dougwig: do we have to support amphora per controller? | 22:01 |
xgerman | some component needs to fail over sub second and not send stuff to the broken lb; the controller still ha smore time to bring in a new lb | 22:01 |
sbalukoff | johnsom_: As in, stored in the database? Most likely yes-- the process receiving the health checks may not be the same one that checks for dead amphorae. | 22:01 |
xgerman | those are two different problems | 22:01 |
rm_work | xgerman: alright | 22:01 |
johnsom_ | I could see a controller fail over needing to have the near term health history since we are checking for missing check-in messages. | 22:01 |
codekobe | controller hsoudl be ha | 22:01 |
rm_work | so the failover mechanism will be unrelated entirely to health monitoring of the amps | 22:02 |
codekobe | it would be nice for controllers not to have to fail over | 22:02 |
sbalukoff | rm_work: If you want sub-second failover, yes. | 22:02 |
codekobe | but allow for multiple controllers | 22:02 |
sbalukoff | codekobe: That's Octavia v1.0 | 22:02 |
codekobe | i suppose that makes it difficult to determine which controller hsoudl take action on a failed node | 22:02 |
johnsom_ | That is what I thought. So are we considering something like Redis to handle that transaction rate? I think we would melt a transactional database | 22:02 |
sbalukoff | Octavia v0.5 is meant mostly to work the bugs out of amphora lifecycle, network connectivity, etc. It'll support many amphorae, but isn't yet concerned with scaling the control layer. | 22:03 |
codekobe | johnsom_: i think we would with the transactions. Not to mention we probably dont need to store health checks long term | 22:03 |
sbalukoff | johnsom_: That's not a bad idea. | 22:03 |
codekobe | a cache would seem more appropriate | 22:04 |
rm_work | crazy idea: what about a queue? | 22:04 |
codekobe | haha | 22:04 |
sbalukoff | I will beat you, rm_work. | 22:04 |
sbalukoff | ;) | 22:04 |
johnsom_ | I think we have at least two levels of health check going on, one for failover in seconds and one for Amp "cluster" fail over | 22:04 |
rm_work | seriously tho | 22:04 |
rm_work | health announces go onto a queue, queue length is like, 1 | 22:04 |
sbalukoff | Queue seems pretty heavy for this. | 22:04 |
xgerman | +1 | 22:04 |
rm_work | and it doesn't have to be a hardened queue | 22:04 |
rm_work | that way only the latest health announce is ever on the queue | 22:05 |
codekobe | rm_work: i'm not sure how a queue would work, because you are tracking how long it has been since the last checkin, which implies we are keeping a state, which is not for a queue | 22:05 |
rm_work | whenever the controller wants to check, whatever its polling period | 22:05 |
rm_work | the single message on the queue is the latest status | 22:05 |
sbalukoff | Also, I'd rather not expose amphorae to a queue directly. | 22:05 |
dougwig | we put rest *EVERYWHERE* and a queue is heavy?!??!? hahahaha. | 22:05 |
dougwig | sorry, i'm going to go insane now. | 22:05 |
rm_work | I mean, the amphora broadcasts to a queue | 22:05 |
johnsom_ | Anyway, just some thoughts. I think the health check thing is going to be interesting. I need to get back to work as highlight so many times in the last hour.... | 22:05 |
rm_work | not the other way around | 22:06 |
rm_work | amphora -> queue: hello, I am <ACTIVE>, <timestamp> | 22:06 |
sbalukoff | johnsom_: Thanks for your time today. | 22:06 |
rm_work | that happens (n) times | 22:06 |
rm_work | at whatever frequency | 22:06 |
blogan | dougwig: +1 | 22:06 |
rm_work | and the controller queries the queue at whatever frequency it wants, and gets the latest status and timestamp | 22:07 |
codekobe | rm_work: i could see that as a way to deliver the message | 22:07 |
sbalukoff | rm_work: Again, don't want a queue exposed directly to the amphora. I see the potential for that being too easily abused if the amphora gets hacked. | 22:07 |
rm_work | if the timestamp is old, then that's bad | 22:07 |
codekobe | although requires more than udp | 22:07 |
codekobe | so the controller would not just get the latest message | 22:07 |
rm_work | sbalukoff: what, so the hacked amphora could announce to a queue of size 1? | 22:07 |
codekobe | it would end up feeding however many messages were sent since last poll? | 22:07 |
rm_work | the latest message is all that matters | 22:07 |
rm_work | codekobe: queue size 1 | 22:07 |
rm_work | http://www.rabbitmq.com/maxlength.html | 22:08 |
rm_work | "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached." | 22:08 |
sbalukoff | rm_work: You now open up your undercloud based on any vulnerabilities (which are probably well known) in the queue software. | 22:08 |
rm_work | and it doesn't have to be a "reliable" queue | 22:08 |
sbalukoff | I'd rather have something we write ourselves doing sanity-checks and whatnot on messages from the amphorae. | 22:09 |
rm_work | sbalukoff: if the amphora only ever writes to the queue (never reads), and the only thing that reads from the queue is only looking for a state and timestamp, and the only possible action to take from that is that it could kill the hacked amphora and put up a new one... | 22:09 |
rm_work | i don't see the problem | 22:09 |
codekobe | rm_work: does queue size 1 always keep the latest message, i would think it would just stop accepting messages after reaching that size | 22:09 |
rm_work | "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached." | 22:09 |
rm_work | per the rabbitmq docs | 22:10 |
codekobe | thanks^^ cant read apparently | 22:10 |
sbalukoff | rm_work: Unless, say, there's a vulnerability in rabbitmq that just requires having access to any queue to exploit. I mean, clearly that software has been well hardened</sarcasm> | 22:10 |
rm_work | sbalukoff: i mean, what could be exploited? | 22:10 |
rm_work | i guess being able to read from the queue, they could see which amphorae are up? | 22:10 |
codekobe | rm_work: i think this would also allow for ha controllers | 22:10 |
codekobe | right out of the gate | 22:11 |
rm_work | codekobe: right | 22:11 |
sbalukoff | rm_work: Are you expecting me to know all the ways in which someone might try exploiting rabbitmq? | 22:11 |
codekobe | whichever controller pulls the old message can take action | 22:11 |
sbalukoff | A cache would allow for HA controllers, too. | 22:11 |
rm_work | codekobe: correct, that was my thought | 22:11 |
codekobe | and the controllers job primarily is going to be to execute tasks form a queue | 22:11 |
rm_work | sbalukoff: the queue is a cache | 22:11 |
codekobe | that is all it is doing with api calls | 22:11 |
codekobe | api cals go to queue, controller picks up | 22:11 |
rm_work | sbalukoff: trying to imagine the worst possible thing that could happen | 22:12 |
rm_work | i guess, they take down the queue? | 22:12 |
rm_work | which would cause the controller to freak out | 22:12 |
rm_work | and not be able to get status from any amphora | 22:12 |
codekobe | so it does kind of make sense for amphora to write to 1 message queue, and if the timestamp is too old, the task takes action, otherwise it ignores | 22:12 |
rm_work | which could cause downtime... | 22:12 |
sbalukoff | rm_work: I look and see a bloated piece of software on whose access we wouldn't be able to do sanity checks. | 22:13 |
codekobe | so i imagine we are having a cluster of queues | 22:13 |
sbalukoff | I don't like it. | 22:13 |
rm_work | codekobe: at least you seem to understand what i'm getting at :P | 22:13 |
codekobe | and i also imagine that if the queue is down, our api is down | 22:13 |
rm_work | sbalukoff: we're using queues other places too... | 22:13 |
codekobe | because most configuration calls will be async | 22:13 |
rm_work | sbalukoff: do we not like queues anymore now? | 22:13 |
sbalukoff | rm_work: Nothing as exposed to the amphorae | 22:13 |
sbalukoff | Remember, amphorae are going to be the front-lines for attacks. | 22:14 |
sbalukoff | They *will* occasionally be exploited. | 22:14 |
rm_work | sbalukoff: well, I assume we won't have any remote access running on public ports | 22:14 |
sbalukoff | rm_work: Well, I hate rabbit. :P | 22:14 |
rm_work | so if they take over HAProxy... | 22:14 |
sbalukoff | Mostly because it's an unreliable piece of crap. ;) | 22:15 |
sbalukoff | But... eh... | 22:15 |
rm_work | well, let's even say they somehow manage to INSTALL SSH on it and get it running on a port, and log in to the amphora with root access | 22:15 |
sbalukoff | rm_work: If they take over the amphora, we need to minimize their ability to affect other amphorae or the undercloud. | 22:15 |
rm_work | at that point, they could go after the queue I guess, or the API | 22:15 |
sbalukoff | This is part of the reason amphorae live on an "LB Network" and not any undercloud network. | 22:15 |
codekobe | so they could still attack the api | 22:16 |
rm_work | sbalukoff: i mean, they ARE on the network with our API | 22:16 |
sbalukoff | Sure, but it's easier for us to limit the effectiveness of that attack since there will be only a few types of messages that get passed over it. | 22:16 |
codekobe | and now you are talking about having a shared secret between the amphorae and the controllers so that it cant affect other amporae | 22:16 |
rm_work | there would only be ONE type of message on this queue | 22:16 |
sbalukoff | codekobe: Yes, indeed. | 22:16 |
codekobe | so | 22:17 |
codekobe | i have an idea | 22:17 |
codekobe | hear me out | 22:17 |
rm_work | so, you're expecting them to be able to... DOS the queue? | 22:17 |
sbalukoff | rm_work: There are going to be more than just healthchecks on this. | 22:17 |
sbalukoff | edge notifications will also happen over this mechanism (eventually) | 22:17 |
codekobe | ok | 22:17 |
rm_work | sbalukoff: i thought we were talking about the healthchecks mechanism :P | 22:17 |
codekobe | so if this is a 1 message queue | 22:17 |
sbalukoff | (BRB... really need to use the bathroom.) | 22:17 |
codekobe | then that means each amphorae will have its own queue | 22:17 |
codekobe | which can also have its own creds for that queue | 22:18 |
rm_work | yeah, i wonder how rabbit can handle that scale | 22:18 |
codekobe | so you could isolate the attack there | 22:18 |
rm_work | codekobe: i think he is worried that rabbitmq might have a vulnerability that could allow a DoS? | 22:18 |
codekobe | because if the amp was compromised, it would only have access to its own queue | 22:18 |
rm_work | essentially it seems like he doesn't trust rabbitmq to be secure | 22:18 |
codekobe | well, you can sign messages.... | 22:19 |
rm_work | i mean really, it's just anything that speaks AMQP that we're talking about here | 22:19 |
codekobe | yes | 22:19 |
codekobe | not just rabbit | 22:19 |
codekobe | but rabbit is probably the most common implmentation | 22:19 |
codekobe | but really we will HAVE to use oslo.messaging | 22:19 |
rm_work | hmm yes | 22:19 |
codekobe | so whatever amqp backend it supports | 22:19 |
codekobe | so i think the Dos attack angle is a moot point | 22:20 |
xgerman | codekobe each amphora has it's own queue is a bad idea. Trust me! | 22:20 |
codekobe | if an amphorae gets compromised it can dos the controller rest-api just as much as a queue | 22:20 |
xgerman | Sorry, I am distracted (sbalukoff wants me to put stuff in gerrit) | 22:20 |
codekobe | lol^ | 22:20 |
codekobe | so this design will end up in gerrit here | 22:21 |
codekobe | but this is a good discussion to figure out what that spec will look like | 22:21 |
xgerman | no, it's some other things I promised him | 22:21 |
codekobe | gotcha | 22:21 |
xgerman | but amphora talking to queue is bad | 22:21 |
xgerman | it introduces a single point of error you don;t have when you make the controller talk to REST IHMO | 22:21 |
codekobe | continuing to think it through | 22:21 |
sbalukoff | So yes, I don't trust rabbitmq to be secure. | 22:21 |
rm_work | xgerman: it also removes a single point of error, which is the controller | 22:22 |
rm_work | IE, a controller failover event | 22:22 |
codekobe | i dont follow the single point of error, as the queue is clustered, but i do think it will be annoying for every new amphorae to require a 1 message rabbit queue to be made with creds | 22:22 |
rm_work | the queues essentially get created on the fly IIRC | 22:22 |
codekobe | yeah, with a queue, you do get HA controllers out of the box | 22:22 |
codekobe | with no failover | 22:23 |
sbalukoff | And it's not really about a DOS vulnerability-- I mean, it's actually not hard to DOS an amphora from the internet, since these things will usually be the front-end "webserver" for a given openstack cluster. | 22:23 |
rm_work | i've never had to "create" a queue | 22:23 |
codekobe | just a grid of controlelrs | 22:23 |
rm_work | you just define it and start using it, and it's there | 22:23 |
codekobe | but we will have to apply creds to it i imagine, but i guess that is tribial | 22:23 |
codekobe | *trivial | 22:23 |
codekobe | it is a weird pattern, but it does give you HA controllers | 22:24 |
intr1nsic | I think using rabbit is more secure than trying to re-invent something different | 22:24 |
codekobe | without having to worry about failover | 22:24 |
codekobe | and we arent going to avoid using rabbit (or favorite amqp here) | 22:24 |
rm_work | and it works more reliably than UDP and allows the amphora announce time to diverge from the controller poll time, and doesnt risk overloading the controller with health messages | 22:25 |
blogan | better question is if this has been tried before adn what were the results | 22:25 |
codekobe | that is a good question | 22:25 |
blogan | how does trove do heart beats? | 22:25 |
codekobe | i still need to deep drive how trove is doing this | 22:25 |
xgerman | well codekobe we tried that before at HP and it doesn't work | 22:25 |
rm_work | xgerman: what did you try exactly? | 22:25 |
intr1nsic | Tried using the queue for heartbeats? | 22:25 |
rm_work | this is a very specific configuration | 22:25 |
rm_work | single-message queues, NOT in reliable mode | 22:26 |
codekobe | I am blaming this on rm_work, he brought it up first hehe | 22:26 |
rm_work | durable, whatever | 22:26 |
xgerman | yes, we tried using a queue for heartbeats -- | 22:26 |
rm_work | codekobe: heh, fine by me | 22:26 |
rm_work | i did lead with "this is a crazy idea" :P | 22:26 |
rm_work | but i like it | 22:26 |
rm_work | a lot | 22:26 |
codekobe | but if it works i would like to be known as an advocate | 22:26 |
rm_work | would need to do some testing to check feasability at scale | 22:26 |
codekobe | yes, needs a load test probably | 22:27 |
rm_work | just because no one has done it before, doesn't mean it's a bad idea :P | 22:27 |
blogan | im sure someone has done it before | 22:27 |
codekobe | i mean, the queue should be able to handle it as well as an http api | 22:27 |
xgerman | well, I said we did that before | 22:27 |
sbalukoff | xgerman is saying they've done it before. | 22:27 |
codekobe | but we would need some test here | 22:27 |
rm_work | i am still not clear that they did EXACTLY this | 22:27 |
blogan | xgerman: what issues arose? load issues? | 22:27 |
rm_work | specifically, single-message queue length, and non-durable | 22:28 |
xgerman | well, none of the queue implementations existing are very good at HA in a neutron entwork | 22:28 |
rm_work | and yeah, i'm curious what the problem ended up being that prevented them from moving forward with it :) | 22:28 |
rm_work | err, wat | 22:28 |
rm_work | you couldn't get the queues to work HA? | 22:28 |
rm_work | i feel like that's a problem for using queues for anything, in general -- which is a problem because we use queues elsewhere in Octavia already! | 22:29 |
blogan | well i dont care what way we do it, as long as it works well, but i have to reiterate that I believe a REST API for just a heartbeat is a bit overkill, similar how it was said a queue was heavy handed | 22:29 |
xgerman | well, hear me out. | 22:29 |
sbalukoff | blogan: I'm actually fine with going with a UDP-based heartbeat (and edge notifications) from the start. | 22:30 |
xgerman | So for RabbitMQ you can only run Active standby in a neutron network - which limits your scale | 22:30 |
sbalukoff | Agreeing to do it over REST was a compromise. | 22:30 |
rm_work | sbalukoff: well, i have a trove of concerns (lol, pun) about the UDP method | 22:30 |
sbalukoff | (over REST as well / or at first) | 22:30 |
blogan | sbalukoff never makes compromises, who is this imposter? | 22:30 |
sbalukoff | rm_work: I'm sure you do. | 22:30 |
codekobe | well lets define some requirements maybe? | 22:30 |
sbalukoff | codekobe: Probably a really good place to start. | 22:31 |
codekobe | because there is at least one thing we should think about before choosing impementation | 22:31 |
xgerman | also when you do a healthcheck over the queue you are testing that the VM can send message to the queue but NOT if the VM is reachable from outside | 22:31 |
sbalukoff | That, and educating ourselves on how Trove does this, and what didn't work with the HP implementation. | 22:31 |
blogan | just fyi, the amphora lifecycle management doc will not go into detail about hwo the heartbeat will be accomplished, just that it is done | 22:31 |
codekobe | Do we want the controllers to be HA, or have to failover | 22:31 |
xgerman | we had vms which could send message to the queue but where mnot accessible anymoere from the outside (all because of Neutron) | 22:31 |
rm_work | xgerman: hmm, so the healthchecks would be from the amp to the controller API via the public interface? | 22:31 |
sbalukoff | blogan: Agreed. That doc is supposed to be relatively high-level anyway. | 22:31 |
blogan | xgerman: couldn't that same thing happen over a rest api? it can only send requests to the controller | 22:32 |
xgerman | rm_work I wa sthinking the controller would poll the Amp in .5 and then we emit udp | 22:32 |
xgerman | but that doesn't solve the wacky neutron problems we have seen at HP | 22:32 |
rm_work | if the controller polls the amp, just let the amp respond :/ | 22:32 |
sbalukoff | codekobe: Have a look at the design documents for v0.5, v1.0 and v2.0 to answer that question. :) | 22:32 |
rm_work | like, as the response to the poll :P | 22:32 |
codekobe | ahh | 22:32 |
codekobe | i wasnt around for those talks :( | 22:33 |
rm_work | why make a syncronous operation async and UDP? :P | 22:33 |
codekobe | i think it will be hard to switch controller from failover to HA | 22:33 |
xgerman | but a common problem for us health check wise is that the VM replies/emits heartbeats in the control network but lost contact to the public Internet | 22:33 |
codekobe | rather than try and design HA | 22:33 |
xgerman | codekobe controllers will be a cluster -- so intrinsinc HA | 22:34 |
codekobe | ok, i'll have to read up on that | 22:34 |
sbalukoff | xgerman: So that wasn't a problem I was looking to solve with the UDP-emitted healthchecks (I envisioned them being just over the LB network), but that's good to know. | 22:34 |
intr1nsic | I'm interested in xgerman experiences but the neutron <-> public access sounds worse than just making sure the instance is up | 22:34 |
rm_work | xgerman: yeah i don't really understand what method you're advocating for | 22:34 |
codekobe | so the amphora should report wether it can hit some public interface? | 22:35 |
rm_work | or, rather, I don't understand the method you mentioned | 22:35 |
xgerman | well, I am just telling you about the problems we have with our queue based healthcheck | 22:35 |
xgerman | it's not like we solved them | 22:35 |
rm_work | anyway, UDP is a neat idea but impractical for healthchecks when the times get very low -- if you're trying to get down to just a few seconds, and it emits every second, and it misses a couple in a row -- that's not great | 22:36 |
xgerman | for what? | 22:36 |
sbalukoff | rm_work: Again, for sub-second failover we don't rely on this system. | 22:36 |
xgerman | I am still not convicned that the controller needs to initiate a fialover -- it just nbeeds to clesan up | 22:36 |
sbalukoff | xgerman: +1 | 22:37 |
rm_work | sbalukoff: true, but it shouldn't be long, or you'll end up with cascade failures | 22:37 |
xgerman | in Octavia 2.0 I think the failover will be done by our ODS component | 22:37 |
*** fnaval has quit IRC | 22:37 | |
sbalukoff | rm_work: not long in this case is probably a minute or two. | 22:37 |
rm_work | if load is what takes down an amphora, then not replacing a downed amp quickly will just lead to an outage | 22:37 |
sbalukoff | ODS? | 22:37 |
xgerman | Open daylight | 22:38 |
xgerman | or SDN | 22:38 |
sbalukoff | rm_work: A cascading failure is inevitable in that case | 22:38 |
xgerman | yeah, what rm_work likes to do mis scale up | 22:38 |
sbalukoff | You literally can't replace them fast enough if you can't scale horizontally. | 22:38 |
sbalukoff | Aah. | 22:38 |
rm_work | i mean, i wasn't really saying it now, since autoscaling is a long way down the road, but yeah eventually i'd like to see one amp goes down -> two replace it automatically | 22:39 |
rm_work | at least short-term | 22:39 |
codekobe | i could see that, if it is due to load | 22:39 |
rm_work | though if it's a real DDoS we'd need mitigation at the network layer | 22:39 |
sbalukoff | rm_work: Ok, so assuming autoscale is running at n+1 capacity, you still have time to "slowly" detect a failure and clean up. | 22:40 |
sbalukoff | If you're not running n+1 capacity, you're screwed in any case. | 22:40 |
rm_work | sbalukoff: I am assuming active-active | 22:40 |
rm_work | BTW | 22:40 |
sbalukoff | rm_work: So am I. | 22:40 |
rm_work | k | 22:40 |
xgerman | well, I am assuming like 20 active LBs on the same VIP | 22:40 |
sbalukoff | xgerman: Yep. | 22:41 |
rm_work | yeah I guess if you're at 20, losing one and replacing it within 60s isn't too bad | 22:41 |
sbalukoff | Especially if those 20 are doing the load of 19. | 22:41 |
sbalukoff | (ie. n+1) | 22:41 |
sbalukoff | (At most the load of 19) | 22:41 |
rm_work | yeah, predictive autoscaling is better than reactive | 22:41 |
sbalukoff | Though if we're going there, it should probably be a percentage of extra capacity rather than just raw amphorae. | 22:42 |
rm_work | getting off-topic though, i didn't really mean to bring that up | 22:42 |
sbalukoff | :) | 22:42 |
rm_work | i was just saying i'd like replacement time to not be long | 22:42 |
xgerman | no worries it's good to talk abpout those things | 22:42 |
sbalukoff | I can see why you did, I think. I was pointing out that we can't rely on the controller to initiate failovers. | 22:42 |
sbalukoff | Only clean up from them. | 22:42 |
codekobe | so is the spec i am to work on still going to cover healthcheck process? or is the entire lifecycle | 22:42 |
xgerman | and it was never my intention to make the controller do that | 22:43 |
sbalukoff | codekobe: blogan is working on the lifecycle spec. | 22:43 |
rm_work | yeah but getting onto "predictive autoscaling" at this point is like talking about what we're going to do when we land on a planet in a different galaxy. let's get to Mars first, plox | 22:43 |
sbalukoff | You can coordinate with him on that, eh. | 22:43 |
codekobe | awesome, so i still care how we implement the healthcheck then | 22:43 |
xgerman | so healthchecks have two components: | 22:43 |
xgerman | How do we mwaure the health of the lb? | 22:43 |
xgerman | - network reachability | 22:43 |
xgerman | is haproxy running )pid) | 22:43 |
xgerman | is it working (stats?) | 22:43 |
xgerman | do we see traffic? | 22:44 |
xgerman | ... | 22:44 |
xgerman | and then how do we get that info to the controller | 22:44 |
rm_work | yeah, and checking that can be initated either by the controller or by the amp | 22:44 |
rm_work | and we're trying to solve for scalability | 22:44 |
sbalukoff | xgerman: So all of that can be handled via some localized health-check daemon on the amphora, and then reported to the controller somehow, right? | 22:44 |
rm_work | I think that was the reason for going to UDP | 22:44 |
codekobe | ok, so we still need to figure out the whole amqp vs udp thing | 22:44 |
codekobe | etc | 22:44 |
rm_work | but I think we also can't trade too much reliability | 22:44 |
sbalukoff | codekobe: Yes. | 22:45 |
rm_work | which is where the queue idea came in -- since it solves for both | 22:45 |
codekobe | i'm not sure if we made that decision or just got sidetracked ealrier | 22:45 |
*** jorgem has quit IRC | 22:45 | |
xgerman | sbalukoff yes | 22:45 |
rm_work | we got sidetracked | 22:45 |
rm_work | UDP is unreliable, but closer to being a scalable option | 22:45 |
sbalukoff | And I would prefer to look at what Trove is doing, and find out if there were any queue-related problems from HP's experience doing this. | 22:45 |
rm_work | REST is reliable, but less scalable | 22:45 |
rm_work | (IMO) | 22:45 |
sbalukoff | rm_work: Agreed. | 22:46 |
rm_work | AMQP is reliable and scalable | 22:46 |
rm_work | (again, IMO) | 22:46 |
rm_work | so that's why I proposed a queue based solution | 22:46 |
sbalukoff | Part of the idea here is that it shouldn't matter if a health check is missed once in a while. Hence the reason I didn't worry too much about UDP. | 22:46 |
blogan | not reliable from xgerman's experience | 22:46 |
rm_work | yeah, so, revising to "reliable in theory" | 22:46 |
xgerman | not reliable for a lot of messages/clients | 22:46 |
sbalukoff | Also not from ours-- | 22:46 |
rm_work | so, we need to validate concerns | 22:47 |
sbalukoff | Our first load balancing product used rabbitmq for its messaging. | 22:47 |
sbalukoff | The problem we found was that client libraries were unreliable. | 22:47 |
blogan | well we will most likely be using rabbitmq for messaging correct? at least from the API to the controller | 22:47 |
sbalukoff | They didn't handle network events, or server restarts well. | 22:47 |
rm_work | yeah, but there is a huge difference between durable and non-durable queues, for one -- and the number of messages that go on the queue make a difference as well | 22:47 |
xgerman | yes, and we anticipate not having 10,000 controllers | 22:47 |
codekobe | network events they do not | 22:47 |
rm_work | we're talking about 100% non-durable for this | 22:47 |
sbalukoff | blogan: Fewer components will rely on it. :P | 22:48 |
rm_work | non-durable non-persistent queues should not have any of the problems you are mentioning | 22:48 |
blogan | good point | 22:48 |
codekobe | ahh | 22:48 |
blogan | i was just validating that we will be using it still | 22:48 |
codekobe | ok, well any way we can timeout on this topic? | 22:48 |
codekobe | we shoudl dig into trove | 22:48 |
blogan | yeah im about to head out | 22:48 |
codekobe | or at least I should | 22:48 |
rm_work | k | 22:48 |
codekobe | and then report back | 22:48 |
codekobe | maybe tomorrow? | 22:48 |
sbalukoff | Sure | 22:48 |
xgerman | well, practically we run a cluster of queues with on queue server in each availability zone | 22:48 |
sbalukoff | (Also, again, I don't think rabbitmq is hardened enough for the environment we'll be putting it in.) | 22:49 |
xgerman | yep, most queues we have seen have issue if the network mis less than psristine | 22:49 |
rm_work | hmm | 22:49 |
sbalukoff | xgerman: I didn't understand the last part of what you just typed. | 22:49 |
rm_work | i have used queues over the Internet and never had problems >_> | 22:50 |
rm_work | i was kinda hoping a local network wouldn't be a problem | 22:50 |
blogan | rm_work: have you used queues over neutron at scale? | 22:50 |
xgerman | ok, we have issues that neutron network is dropping packets from time to time which seem to make queue servers very angry | 22:50 |
rm_work | to be fair, no | 22:50 |
rm_work | but that's why i said we'd need to do some load testing | 22:50 |
rm_work | xgerman: like, for their cluster sync? | 22:51 |
blogan | but if xgerman has experiences in it already, thats good enough a test | 22:51 |
sbalukoff | xgerman: And UDP messaging would be OK with this, so long as packet loss isn't high. | 22:51 |
xgerman | rm_work -- tes cluster sync is a no-go | 22:51 |
rm_work | blogan: well, i'm saying that what i proposed could be significantly different enough from what they tried to make it a different beast | 22:51 |
xgerman | since we can only run active-passive to avoid split brain | 22:51 |
rm_work | but, if the problems were with the basic infra, then maybe not | 22:51 |
*** dboik has quit IRC | 22:52 | |
rm_work | well, you do have to remember that we're talking about this as opposed to *UDP* | 22:52 |
rm_work | so reliability concerns are ... >_> | 22:52 |
blogan | mgiht be different but if the problem is teh queues and the unreliable neutron networks then it probably won't matter | 22:52 |
sbalukoff | Actually, they're better with UDP. | 22:52 |
blogan | anyway i gotta go | 22:52 |
blogan | ill talk to yall later | 22:52 |
xgerman | cool | 22:52 |
sbalukoff | If rabbit gets messed up from a few dropped packets, it doesn't recover gracefully. | 22:52 |
xgerman | no it doesn't | 22:52 |
rm_work | ok, so, maybe not rabbit <_< | 22:52 |
rm_work | but i'm talking about AMQP | 22:53 |
xgerman | well, gearman will hang, too | 22:53 |
sbalukoff | If we drop a few UDP packets, who cares? Things keep on chugging once the next packets come through. | 22:53 |
xgerman | yeah, same with REST | 22:53 |
sbalukoff | xgerman: +1 | 22:53 |
xgerman | also our problems get magnified since we run the queues with TLS | 22:53 |
sbalukoff | True, and we will need to, since the LB Network is not a trusted network. | 22:54 |
sbalukoff | Anyway, again, let's find out what Trove is doing. | 22:54 |
sbalukoff | And table this for now. | 22:54 |
sbalukoff | (Any objections to that?) | 22:54 |
rm_work | kk | 22:54 |
intr1nsic | +1 | 22:55 |
xgerman | I know the PTL we can ask him in paris ;-) | 22:55 |
sbalukoff | Nice! | 22:55 |
xgerman | yeah, he is HP... | 22:55 |
rm_work | heh | 22:56 |
rm_work | gs | 22:56 |
rm_work | err | 22:56 |
rm_work | just put up https://review.openstack.org/131889 | 22:57 |
sbalukoff | Ok, I'mma go get some lunch. I'll send my unreasonable demands (chat) at you later! | 22:59 |
rm_work | lol lunch | 22:59 |
rm_work | so, I went ahead and jumped the gun and implemented the TLS spec (partially) in that CR :) | 23:00 |
sbalukoff | Bastard! | 23:01 |
sbalukoff | Ok, I'll have a look when I get back. ;) | 23:01 |
ajmiller | trevorv blogan sbalukoff I just posted a new patch to https://review.openstack.org/#/c/130002/6 | 23:06 |
xgerman | rm_work how do we mark as work in progress? | 23:15 |
xgerman | anybody? | 23:16 |
xgerman | ok, figured it out | 23:18 |
*** ptoohil__ has joined #openstack-lbaas | 23:34 | |
*** ajmiller has quit IRC | 23:37 | |
*** ptoohill_ has quit IRC | 23:38 | |
*** barclaac|2 has quit IRC | 23:45 | |
rm_you| | xgerman: ah sorry, yeah | 23:50 |
xgerman | no worries | 23:50 |
rm_you| | review -> workflow -1 | 23:50 |
rm_you| | but you got it | 23:50 |
xgerman | yeah, I am pair programming with Min | 23:50 |
*** barclaac has joined #openstack-lbaas | 23:55 | |
*** rm_you| is now known as rm_you | 23:56 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!