*** ChanServ changes topic to "OpenStack Bare Metal Provisioning | Docs: http://docs.openstack.org/developer/ironic/ | Bugs: https://bugs.launchpad.net/ironic | Status: https://etherpad.openstack.org/p/IronicWhiteBoard" | 00:02 | |
openstackgerrit | Jay Faulkner proposed a change to openstack/ironic-python-agent: Use docker import/export to make image smaller https://review.openstack.org/87819 | 00:19 |
---|---|---|
openstackgerrit | Jay Faulkner proposed a change to openstack/ironic-python-agent: Use docker import/export to make image smaller https://review.openstack.org/87819 | 00:20 |
dwalleck_ | devananda: I almost mentioned this earlier when we were talking about Devstack instructions, but is getting Ironic docs up to api.openstack.org and other parts of OpenStack something that needs some help? I have nothing but time at this point :-) | 00:23 |
devananda | dwalleck_: yep! that'd be great | 00:24 |
dwalleck_ | Good deal. I'll poke around in my spare time and see if I can get something bootstrapped. I think it shouldn't be too bad | 00:26 |
*** matsuhashi has joined #openstack-ironic | 00:30 | |
*** zdiN0bot has joined #openstack-ironic | 00:32 | |
*** yongli has joined #openstack-ironic | 00:36 | |
*** zdiN0bot has quit IRC | 00:36 | |
dwalleck_ | oh my....they generate everything from wadls | 00:39 |
*** blamar has quit IRC | 00:41 | |
*** zdiN0bot has joined #openstack-ironic | 00:43 | |
*** eguz has quit IRC | 00:45 | |
*** zdiN0bot has quit IRC | 00:47 | |
russell_h | dwalleck_: :) | 00:47 |
dwalleck_ | This is what I get for trying to help =P | 00:48 |
*** blamar has joined #openstack-ironic | 00:59 | |
russell_h | dwalleck_: we're supposed to have people who know how that stuff works | 00:59 |
*** newell_ has quit IRC | 01:02 | |
*** ilives has joined #openstack-ironic | 01:16 | |
*** matsuhashi has quit IRC | 01:40 | |
*** matsuhas_ has joined #openstack-ironic | 01:43 | |
*** zdiN0bot has joined #openstack-ironic | 01:44 | |
*** zdiN0bot has quit IRC | 01:48 | |
*** coolsvap|afk is now known as coolsvap | 02:35 | |
*** harlowja is now known as harlowja_away | 02:35 | |
*** zdiN0bot has joined #openstack-ironic | 02:44 | |
*** zdiN0bot has quit IRC | 02:48 | |
*** matsuhas_ has quit IRC | 02:49 | |
*** matsuhashi has joined #openstack-ironic | 02:49 | |
*** matsuhas_ has joined #openstack-ironic | 02:52 | |
*** matsuhashi has quit IRC | 02:52 | |
*** matsuhas_ has quit IRC | 02:58 | |
*** nosnos has quit IRC | 03:35 | |
*** zdiN0bot has joined #openstack-ironic | 03:45 | |
*** zdiN0bot has quit IRC | 03:49 | |
*** pradipta` is now known as pradipta | 04:19 | |
*** nosnos has joined #openstack-ironic | 04:25 | |
*** saju_m has joined #openstack-ironic | 04:27 | |
*** saju_m has quit IRC | 04:27 | |
*** lazy_prince has joined #openstack-ironic | 04:34 | |
*** coolsvap is now known as coolsvap|afk | 04:35 | |
*** dwalleck_ has quit IRC | 04:36 | |
*** zdiN0bot has joined #openstack-ironic | 04:46 | |
*** zdiN0bot has quit IRC | 04:50 | |
*** coolsvap|afk is now known as coolsvap | 04:52 | |
*** Mikhail_D_ltp has joined #openstack-ironic | 04:54 | |
*** radsy has quit IRC | 04:56 | |
*** zdiN0bot has joined #openstack-ironic | 05:23 | |
*** zdiN0bot has quit IRC | 05:27 | |
*** pradipta is now known as pradipta_away | 05:40 | |
*** florentflament has quit IRC | 05:49 | |
*** sabah has joined #openstack-ironic | 05:56 | |
*** pradipta_away is now known as pradipta | 06:03 | |
*** zdiN0bot has joined #openstack-ironic | 06:23 | |
*** zdiN0bot has quit IRC | 06:28 | |
*** pradipta is now known as pradipta_away | 06:38 | |
*** Mikhail_D_ltp has quit IRC | 06:51 | |
openstackgerrit | Haomeng,Wang proposed a change to openstack/ironic: Implements send-data-to-ceilometer https://review.openstack.org/72538 | 06:51 |
*** romcheg1 has joined #openstack-ironic | 06:57 | |
*** foexle has joined #openstack-ironic | 07:03 | |
*** ilives has quit IRC | 07:11 | |
*** ilives has joined #openstack-ironic | 07:11 | |
*** zdiN0bot has joined #openstack-ironic | 07:24 | |
*** zdiN0bot has quit IRC | 07:29 | |
GheRivero | morning all | 07:36 |
Haomeng | GheRivero: morning:) | 07:36 |
Mikhail_D_wk | Good morning all! :) | 07:39 |
*** Haomeng has quit IRC | 07:44 | |
*** jistr has joined #openstack-ironic | 07:52 | |
*** mrda is now known as mrda_away | 08:00 | |
*** romcheg1 has left #openstack-ironic | 08:03 | |
*** lucasagomes has joined #openstack-ironic | 08:07 | |
*** yuriyz has joined #openstack-ironic | 08:12 | |
*** derekh has joined #openstack-ironic | 08:12 | |
dtantsur | Morning Ironic, morning GheRivero, Haomeng, Mikhail_D_wk :) | 08:16 |
Mikhail_D_wk | dtantsur: morning :) | 08:17 |
yuriyz | morning All | 08:17 |
lucasagomes | morning yuriyz dtantsur Mikhail_D_wk | 08:18 |
lucasagomes | lifeless, morning, ping re https://bugs.launchpad.net/ironic/+bug/1199665 and https://bugs.launchpad.net/ironic/+bug/1231351 | 08:18 |
Mikhail_D_wk | yuriyz, lucasagomes morning! :) | 08:18 |
lifeless | lucasagomes: hi | 08:18 |
*** athomas has joined #openstack-ironic | 08:18 | |
lucasagomes | lifeless, they both contradict... I would like ur opnion on that, should we be caching images at all? | 08:19 |
lifeless | lucasagomes: oh, so hmmm | 08:20 |
lifeless | lucasagomes: I don't think we want to grow storage without bound | 08:20 |
dtantsur | lucasagomes, lifeless morning :) | 08:21 |
lifeless | lucasagomes: is there a size limit on the cache? | 08:21 |
dtantsur | lifeless, not now, I can apply (+ timeouts as well) | 08:21 |
lucasagomes | lifeless, not in trunk, but there's a patch upstream by dtantsur that does limit the size of the cache | 08:21 |
yuriyz | dtantsur, looks like image will not be downloaded if exists even if it was changed in Glance https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/pxe.py#L330 | 08:22 |
dtantsur | yuriyz, yes, and my point is to change this as well, so that if checksum changes, image is redownloaded no matter what | 08:22 |
dtantsur | lifeless, ^^ | 08:22 |
lucasagomes | yeah there's no cache invalidation in that patch, but would be good to add | 08:23 |
lucasagomes | dtantsur, got the link handy there? | 08:23 |
lifeless | so | 08:24 |
lifeless | I think: | 08:24 |
lifeless | - total size limit | 08:24 |
lifeless | - if exceeded we need to serialize things carefully (e.g. what if a single image is larger than cache size) | 08:24 |
lifeless | - cache size doesn't apply to kernel and initrd files | 08:24 |
lifeless | - delete images if not used for $time | 08:25 |
lifeless | sounds great to me | 08:25 |
*** zdiN0bot has joined #openstack-ironic | 08:25 | |
lucasagomes | lifeless, right that's a good list | 08:25 |
lifeless | and the bugs don't conflict :) | 08:25 |
lucasagomes | lifeless, and for the default option, by default should cache be disabled? | 08:25 |
lifeless | lucasagomes: enabled | 08:25 |
*** Haomeng has joined #openstack-ironic | 08:25 | |
lucasagomes | as we have that rule of having default ready for production | 08:25 |
lucasagomes | lifeless, ack | 08:25 |
lucasagomes | lifeless, cheers for the input | 08:26 |
lifeless | lucasagomes: nova BM takes nearly 2 hours to deploy to a rack | 08:26 |
lifeless | lucasagomes: ironic today takes 22m | 08:26 |
dtantsur | lifeless, could you leave this as a comment to my patch? | 08:26 |
lifeless | lucasagomes: :) | 08:26 |
lucasagomes | heh | 08:26 |
lucasagomes | w00t | 08:26 |
dtantsur | lifeless, wow! really wow | 08:26 |
lifeless | dtantsur: I need to pop out of here and do other stuff - just copy the IRC log :> | 08:26 |
dtantsur | lifeless, ack :) | 08:27 |
Haomeng | lucasagomes: morning | 08:27 |
lucasagomes | Haomeng, morning there :) | 08:27 |
Haomeng | lucasagomes: one quick problem, do you know which keystone-pythonclient version is used for Jenkins, I found it is not latest one, but my local is latest keystone-pythonclient, and our ironic refer the keystoneclient conf files | 08:28 |
dtantsur | ifarkas, we reached some consensus on caching above ^^^ | 08:29 |
Haomeng | lucasagomes: so the output of "tools/config/generate_sample.sh" is difference with Jenkins, because we have different base version | 08:29 |
*** zdiN0bot has quit IRC | 08:29 | |
Haomeng | lucasagomes: where we can get the same version for keystone-pythonclient with Jenkins | 08:30 |
lucasagomes | Haomeng, hmm I jenkins will download it from pip as we do when generating our venv | 08:30 |
lucasagomes | Haomeng, one possible problem would be if the new version of python-keystoneclient was just fresh released and infra do mirror the pip repo | 08:30 |
lucasagomes | it would be outdated | 08:30 |
Haomeng | lucasagomes: I think so, but not sure if pip source is same one | 08:30 |
Haomeng | maybe | 08:30 |
* ifarkas reads back | 08:31 | |
Haomeng | lucasagomes: :) | 08:31 |
lucasagomes | Haomeng, I dunno much about infra, I think they do have a mirror... I think it worth asking at #openstack-infra | 08:31 |
Haomeng | lucasagomes: but it is diffcult to get same version with jenkins, not sure the pip repo | 08:31 |
Haomeng | used by jenkins | 08:31 |
Haomeng | lucasagomes: ok | 08:31 |
lucasagomes | :) | 08:32 |
Haomeng | lucasagomes: let me check with #openstack-infra irc, thank you:) | 08:32 |
lucasagomes | Haomeng, np :) | 08:32 |
Haomeng | lucasagomes: nice day:) | 08:32 |
Haomeng | lucasagomes: :) | 08:32 |
lucasagomes | Haomeng, you too buddy | 08:32 |
Haomeng | lucasagomes: :) | 08:32 |
*** dshulyak has joined #openstack-ironic | 08:34 | |
*** pradipta_away is now known as pradipta | 08:35 | |
lifeless | lucasagomes: actually let me go further | 08:44 |
lifeless | lucasagomes: ironic is fast because it designed in shared state for images etc | 08:44 |
lifeless | lucasagomes: I would say the cache cannot be disabled, only tuned. | 08:44 |
lucasagomes | lifeless, wouldn't set the total size limit to 0 a way to "disable" it? | 08:44 |
lucasagomes | which is grand imo | 08:45 |
lifeless | no - consider the case of 'image bigger than cache' - we have to download the image; we should deploy - it would be ideal then if there are other deploys needing that image to reuse it before releasing whatever lock and having it deleted | 08:46 |
*** dtantsur is now known as dtantsur|lunch | 08:46 | |
lucasagomes | lifeless, ahh +1.. | 08:48 |
lucasagomes | lifeless, if image is bigger than cache AND no longer have links to it, then we delete get rid of it | 08:48 |
*** eghobo has joined #openstack-ironic | 08:51 | |
lifeless | lucasagomes: right, though I'm not sure we ever have links for images we're going to dd... | 08:51 |
*** martyntaylor has joined #openstack-ironic | 08:53 | |
lucasagomes | right, some condition then... the cleanup function could check whether more nodes set to be deployed are going to use that image | 08:53 |
lifeless | something | 08:54 |
lifeless | we can iterate to it | 08:54 |
lifeless | key thing is that the user setting doesn't disable the cache codepaths | 08:54 |
lifeless | it just makes cleanup more aggressive | 08:54 |
lucasagomes | ack, sounds reasonable to me | 08:55 |
lucasagomes | I'm going to a comment to the patch | 08:55 |
openstackgerrit | Lucas Alvares Gomes proposed a change to openstack/ironic: Add DiskPartitioner https://review.openstack.org/83396 | 09:08 |
openstackgerrit | Lucas Alvares Gomes proposed a change to openstack/ironic: Use DiskPartitioner https://review.openstack.org/83399 | 09:08 |
openstackgerrit | Lucas Alvares Gomes proposed a change to openstack/ironic: Get rid of the swap partition https://review.openstack.org/83726 | 09:08 |
openstackgerrit | Lucas Alvares Gomes proposed a change to openstack/ironic: Use GB instead of MB for swap https://review.openstack.org/83788 | 09:09 |
*** overlayer has joined #openstack-ironic | 09:13 | |
*** max_lobur has joined #openstack-ironic | 09:25 | |
*** zdiN0bot has joined #openstack-ironic | 09:26 | |
*** zdiN0bot has quit IRC | 09:30 | |
*** mdickson has joined #openstack-ironic | 09:37 | |
*** athomas has quit IRC | 09:37 | |
*** athomas has joined #openstack-ironic | 09:38 | |
*** mdickson2 has quit IRC | 09:41 | |
*** dtantsur|lunch is now known as dtantsur | 09:53 | |
*** mdickson2 has joined #openstack-ironic | 09:53 | |
*** mdickson has quit IRC | 09:54 | |
*** nosnos has quit IRC | 10:07 | |
*** nosnos has joined #openstack-ironic | 10:12 | |
*** eghobo has quit IRC | 10:21 | |
*** zdiN0bot has joined #openstack-ironic | 10:26 | |
*** zdiN0bot has quit IRC | 10:31 | |
*** nosnos has quit IRC | 10:34 | |
*** nosnos has joined #openstack-ironic | 10:35 | |
openstackgerrit | Vladimir Kozhukalov proposed a change to openstack/ironic-python-agent: Added disk partitioner https://review.openstack.org/86163 | 10:38 |
*** nosnos has quit IRC | 10:39 | |
openstackgerrit | A change was merged to openstack/ironic: Some minor clean up of various doc pages https://review.openstack.org/87765 | 10:44 |
*** overlayer has quit IRC | 10:46 | |
*** Alexei_987 has joined #openstack-ironic | 10:51 | |
*** nosnos has joined #openstack-ironic | 11:06 | |
*** ifarkas has quit IRC | 11:10 | |
*** coolsvap is now known as coolsvap|afk | 11:15 | |
*** lucasagomes is now known as lucas-hungry | 11:16 | |
openstackgerrit | Andrey Kurilin proposed a change to openstack/python-ironicclient: Sync latest code and reuse exceptions from oslo https://review.openstack.org/71500 | 11:18 |
openstackgerrit | Andrey Kurilin proposed a change to openstack/python-ironicclient: Reuse module `cliutils` from common code https://review.openstack.org/72418 | 11:22 |
*** ifarkas has joined #openstack-ironic | 11:22 | |
*** mdickson has joined #openstack-ironic | 11:22 | |
*** mdickson2 has quit IRC | 11:22 | |
*** zdiN0bot has joined #openstack-ironic | 11:27 | |
*** zdiN0bot has quit IRC | 11:31 | |
*** romcheg1 has joined #openstack-ironic | 11:36 | |
openstackgerrit | Vladimir Kozhukalov proposed a change to openstack/ironic-python-agent: Added disk partitioner https://review.openstack.org/86163 | 11:52 |
openstackgerrit | Vladimir Kozhukalov proposed a change to openstack/ironic-python-agent: Added disk partitioner https://review.openstack.org/86163 | 11:55 |
*** sabah has quit IRC | 12:11 | |
Shrews | morning ironic. fyi, change your IRC passwords: http://blog.freenode.net/2014/04/heartbleed/ | 12:11 |
Shrews | it seems Nickserv was temporarily compromised | 12:12 |
dtantsur | ouch... | 12:13 |
*** pradipta is now known as pradipta_away | 12:14 | |
*** nosnos has quit IRC | 12:17 | |
*** dtantsur is now known as dtantsur|bbl | 12:17 | |
*** dtantsur|bbl has quit IRC | 12:19 | |
*** lucas-hungry is now known as lucasagomes | 12:24 | |
agordeev | morning Ironic | 12:27 |
*** zdiN0bot has joined #openstack-ironic | 12:28 | |
romcheg1 | Morning agordeev and everyone else | 12:28 |
agordeev | Shrews: should i change my pass if i haven't used ssl connection for IRC? | 12:31 |
agordeev | romcheg1: morning :) | 12:31 |
Shrews | agordeev: because Nickserv was compromised, if you happened to authenticate with Nickserv during that period, your password would have been exposed... ssl or not | 12:32 |
*** zdiN0bot has quit IRC | 12:33 | |
agordeev | Shrews: thanks, i got it. Glad to recall that i don't use IRC on weekends (mostly because of not having PC/'internet access' at home) :D | 12:37 |
*** jdob has joined #openstack-ironic | 12:38 | |
*** overlayer has joined #openstack-ironic | 12:39 | |
romcheg1 | lucasagomes: I'm looking at the IPMI console patch | 12:39 |
romcheg1 | It's still bound to using HTTP all the time | 12:39 |
romcheg1 | I don't remember, what was our decision last time we noticed that? | 12:40 |
lucasagomes | romcheg1, hey afternoon | 12:44 |
lucasagomes | hmmmm trying to remember | 12:44 |
lucasagomes | I think that it was about first getting a console access implemented, even if it's only http for now | 12:45 |
lucasagomes | that's enough to have pair functionality with nova bm | 12:45 |
lucasagomes | and after that we can start adding more stuff on it | 12:45 |
romcheg1 | Makes sense | 12:45 |
lucasagomes | linggao is also changing a bit the api to make it more flexible to get the console information | 12:46 |
*** dtantsur has joined #openstack-ironic | 12:52 | |
*** jbjohnso has joined #openstack-ironic | 12:56 | |
NobodyCam | Good morning IRonic | 12:56 |
openstackgerrit | A change was merged to openstack/ironic: Fix message preventing overwrite the instance_uuid https://review.openstack.org/87731 | 13:05 |
Mikhail_D_wk | NobodyCam: morning :) | 13:06 |
NobodyCam | morning Mikhail_D_wk :) | 13:07 |
openstackgerrit | Vladimir Kozhukalov proposed a change to openstack/ironic-python-agent: Added disk utils https://review.openstack.org/86163 | 13:08 |
openstackgerrit | Vladimir Kozhukalov proposed a change to openstack/ironic-python-agent: Added disk utils https://review.openstack.org/86163 | 13:11 |
*** jgrimm has joined #openstack-ironic | 13:12 | |
lucasagomes | morning NobodyCam :) | 13:14 |
NobodyCam | brb... morning walkies | 13:14 |
NobodyCam | morning lucasagomes :) | 13:14 |
lucasagomes | NobodyCam, enjoy the walkies :) | 13:14 |
*** linggao has joined #openstack-ironic | 13:16 | |
linggao | ping lucasagomes | 13:18 |
lucasagomes | linggao, pong | 13:18 |
lucasagomes | linggao, morning :) | 13:18 |
linggao | good morning lucasagomes | 13:18 |
linggao | I am reviewing your patch https://review.openstack.org/#/c/86588/ | 13:19 |
linggao | I saw you have removed the VendorPassThrou Interface and replaced it with ManagementInterface | 13:20 |
lucasagomes | linggao, yeah, for ipmitool/native because the only method in the vendor_passthru was the set_boot_device | 13:20 |
lucasagomes | for the seamicro I kept, since they do have other stuff, attach_volume etc | 13:21 |
linggao | I see. that's okay. | 13:21 |
linggao | I saw you also have a patch https://review.openstack.org/#/c/85742/ | 13:21 |
linggao | to set the bootd evice persistent | 13:21 |
lucasagomes | ahn /me have forgot that hah | 13:21 |
lucasagomes | I probably could rebase it on top of the management interface one | 13:22 |
linggao | is that patch really needed? should the persistent be moved to the new patch? | 13:22 |
linggao | yes, that's what I meant. | 13:22 |
lucasagomes | hmm I could add it to the same one porting to the management interface | 13:23 |
lucasagomes | but I could argue that they are 2 diff things as well, 1 is a port the other one is adding a new functionality | 13:24 |
lucasagomes | like 2 diff changes | 13:24 |
lucasagomes | I'm fine doing it in the same patch or a separated one as is now | 13:24 |
linggao | It's okay with me either way. | 13:25 |
linggao | thanks. | 13:25 |
lucasagomes | linggao, ack, thank YOU for pointing me that... I had forgot about the persistent one | 13:26 |
lucasagomes | marked as WIP for rebase or inclusion in the other patch | 13:26 |
linggao | lucasagomes, np. thanks for all the help you've given me. | 13:26 |
lucasagomes | np at all :) | 13:26 |
NobodyCam | good morning linggao | 13:27 |
linggao | Hey NobodyCam, good morning. Got coffee? | 13:28 |
NobodyCam | si | 13:28 |
NobodyCam | :) | 13:28 |
*** matty_dubs|gone is now known as matty_dubs | 13:29 | |
*** zdiN0bot has joined #openstack-ironic | 13:29 | |
* NobodyCam has decent intertubes too :) | 13:30 | |
*** zdiN0bot has quit IRC | 13:33 | |
NobodyCam | all quick question on the info logging patch. There seems to be many questions about adding info loging to ssh.py. Several revisitions ago I it looked like this: https://review.openstack.org/#/c/85124/3/ironic/drivers/modules/ssh.py | 13:55 |
openstackgerrit | A change was merged to stackforge/pyghmi: Add optical and bios aliases for boot devices https://review.openstack.org/87682 | 13:57 |
NobodyCam | looking for any input on weather I should contine to add info logging to ssh or jsut remove | 13:57 |
NobodyCam | *just even | 13:57 |
lucasagomes | NobodyCam, :( I think we should have logs... on that link ^... "Attempting..." it sounds more like a debug indeed | 13:59 |
lucasagomes | I would log the INFO after the state change happened | 13:59 |
lucasagomes | INFO("Node <uuid> is powered on") | 13:59 |
lucasagomes | if it fails to change the power state | 13:59 |
lucasagomes | ERROR("Failed to power on node <uuid>") | 13:59 |
lucasagomes | I see info as hmmm a storyline of the events that happened to that node | 14:00 |
lucasagomes | node is now active, node is powered on, node is powered off etc... | 14:00 |
lucasagomes | DEBUG for the attempts or logging commands "Attempting to power on node <uuid>: <cmd>" | 14:00 |
lucasagomes | things like that | 14:01 |
NobodyCam | ahh ok will rework | 14:01 |
lucasagomes | INFO for success, ERROR for failures, WARNING for failures that will be automatic retried or fallback somehow, CRITICAL for errors that compromise the whole service and not an operation | 14:01 |
lucasagomes | and DEBUG for the rest | 14:02 |
NobodyCam | brb | 14:07 |
*** lazy_prince has quit IRC | 14:10 | |
*** dwalleck has joined #openstack-ironic | 14:27 | |
*** overlayer has quit IRC | 14:29 | |
*** zdiN0bot has joined #openstack-ironic | 14:29 | |
*** zdiN0bot has quit IRC | 14:34 | |
*** martyntaylor has quit IRC | 14:34 | |
*** dwalleck has quit IRC | 14:36 | |
*** dwalleck has joined #openstack-ironic | 14:36 | |
*** dwalleck_ has joined #openstack-ironic | 14:39 | |
*** dwalleck has quit IRC | 14:42 | |
*** martyntaylor has joined #openstack-ironic | 14:48 | |
*** coolsvap|afk is now known as coolsvap | 14:51 | |
*** hemna_ has quit IRC | 14:52 | |
*** ilives has quit IRC | 14:54 | |
*** ilives has joined #openstack-ironic | 14:58 | |
Shrews | so, nova provides a default "rebuild" if the driver doesn't implement one. has anyone tried to see if this actually works with real bm? | 14:58 |
Shrews | since ironic doesn't provide one, i'm just curious if/how it works | 14:59 |
Shrews | devananda or NobodyCam? ^^^ | 15:00 |
*** dwalleck_ has quit IRC | 15:03 | |
NobodyCam | Shrews: I have not tested, but would assume that it dose not work | 15:05 |
Shrews | it works with the vm's from devstack... but that's not actual h/w :) | 15:06 |
NobodyCam | oh then it may work... I'm nit sure | 15:11 |
NobodyCam | not even | 15:11 |
NobodyCam | bbiafm | 15:12 |
openstackgerrit | Chris Behrens proposed a change to openstack/ironic: Add create() and destroy() to Node https://review.openstack.org/84823 | 15:19 |
openstackgerrit | Chris Behrens proposed a change to openstack/ironic: Clean up calls to get_node() https://review.openstack.org/84573 | 15:19 |
openstackgerrit | Chris Behrens proposed a change to openstack/ironic: Make sync_power_states yield https://review.openstack.org/84862 | 15:19 |
openstackgerrit | Chris Behrens proposed a change to openstack/ironic: Refactor sync_power_states tests to not use DB https://review.openstack.org/87076 | 15:19 |
openstackgerrit | Chris Krelle proposed a change to openstack/ironic: Add Logging https://review.openstack.org/85124 | 15:22 |
NobodyCam | lucasagomes: Mikhail_D_wk: let me know how that version looks | 15:23 |
*** ilives has quit IRC | 15:24 | |
*** ilives has joined #openstack-ironic | 15:25 | |
*** zdiN0bot has joined #openstack-ironic | 15:30 | |
*** zdiN0bot has quit IRC | 15:34 | |
openstackgerrit | Chris Krelle proposed a change to openstack/ironic: Add Logging https://review.openstack.org/85124 | 15:42 |
*** vkozhukalov has joined #openstack-ironic | 15:46 | |
*** martyntaylor has quit IRC | 15:57 | |
jroll | good morning ironic | 15:59 |
NobodyCam | good morning jroll | 15:59 |
*** eghobo has joined #openstack-ironic | 16:03 | |
*** matty_dubs is now known as matty_dubs|lunch | 16:05 | |
NobodyCam | devananda: once your up and going looks like https://review.openstack.org/#/c/87396 needs a quick rebase | 16:09 |
*** max_lobur has quit IRC | 16:09 | |
linggao | lucasagomes, ping | 16:09 |
lucasagomes | linggao, pong | 16:11 |
lucasagomes | morning jroll | 16:11 |
linggao | This is regarding the console API. | 16:11 |
linggao | the GET is v1/nodes/<uuid>/console to return the console information. | 16:11 |
openstackgerrit | Russell Haering proposed a change to openstack/ironic: Fix an issue that left nodes permanently locked https://review.openstack.org/88017 | 16:12 |
linggao | The PUT v1/nodes/<uuid>/states/console to enable/disable the console. | 16:12 |
linggao | I aggress with you that it is a little odd to have get/put in different places. | 16:13 |
linggao | but get gets more info than just the console enablement state. | 16:13 |
linggao | maybe change put to v1/nodes/<uuid>/states/console_enabled ? | 16:14 |
linggao | another thing is that v1/nodes/<uuid>/states returns all node state includes the console state. | 16:15 |
linggao | and v1/nodes/<uuid>/states/power turns node power on/off. | 16:15 |
lucasagomes | linggao, yeah hmmm /me thinking | 16:17 |
*** martyntaylor has joined #openstack-ironic | 16:17 | |
lucasagomes | the /states/... is meant to change all the states (power, provision, console) | 16:18 |
lucasagomes | idk maybe we shouldn't have console as a state? | 16:18 |
lucasagomes | PUT /nodes/<uuid>/console, to enable/disable it | 16:18 |
lucasagomes | GET /nodes/<uuid>/console to get the info about the console (if enabled or disable, url to access it) | 16:19 |
linggao | and remove the console state from the ALL state. | 16:19 |
lucasagomes | yeah | 16:19 |
linggao | I like it. | 16:20 |
linggao | this is more consistent. | 16:20 |
lucasagomes | so console will be a sub-element of a node and have it's own URI | 16:20 |
* NobodyCam brb making minor network change so he can access his test laptop | 16:20 | |
lucasagomes | yeah | 16:20 |
lucasagomes | I thnk it's better than have it divided | 16:21 |
lucasagomes | and leave states for power and provision states | 16:21 |
linggao | yes. | 16:21 |
lucasagomes | it's already a bit confusing that we do have 4 states for a node | 16:21 |
lucasagomes | [target_]power_state, [target_]provision_state | 16:21 |
lucasagomes | devananda, NobodyCam any objections ^ | 16:21 |
lucasagomes | s/objections/objections?/g | 16:22 |
linggao | agree. now with my path, the GET console returns {'console_enabled': True, 'console_info': {'type': 'shellinaboxd', 'url': 'http://<hostname>:<port>'}} for the node with consolle enable. {'console_enabled': False, 'console_info': None} for the node with console disabled. | 16:23 |
lucasagomes | yeah that's good, much easier to parse | 16:24 |
lucasagomes | check if enable, then get console_info | 16:24 |
*** rloo has joined #openstack-ironic | 16:25 | |
linggao | yes | 16:25 |
*** zdiN0bot has joined #openstack-ironic | 16:31 | |
*** zdiN0bot has quit IRC | 16:35 | |
*** ilives has quit IRC | 16:36 | |
*** Mikhail_D_ltp has joined #openstack-ironic | 16:44 | |
*** newell_ has joined #openstack-ironic | 16:48 | |
*** mdickson has quit IRC | 16:48 | |
kevinbenton | devananda: ping | 16:48 |
*** mdickson has joined #openstack-ironic | 16:49 | |
*** harlowja_away is now known as harlowja | 16:51 | |
*** matty_dubs|lunch is now known as matty_dubs | 16:53 | |
devananda | morning, all | 16:57 |
devananda | Shrews: nova.virt.baremetal driver implements rebuild method, and afaik tripleo uses / tests it | 16:57 |
*** foexle has quit IRC | 16:58 | |
*** rloo has quit IRC | 16:58 | |
*** rloo has joined #openstack-ironic | 16:58 | |
*** wendar has quit IRC | 16:59 | |
devananda | lucasagomes: "INFO for success, ..." yes. I'm going to add that to the bug report (or open one if we dont already have one) | 16:59 |
*** rloo has quit IRC | 17:00 | |
*** vkozhukalov has quit IRC | 17:00 | |
lucasagomes | devananda, :) ack | 17:00 |
*** rloo has joined #openstack-ironic | 17:00 | |
NobodyCam | Good morning devananda :) | 17:00 |
devananda | and good morning :) | 17:00 |
lucasagomes | devananda, morning | 17:00 |
*** wendar has joined #openstack-ironic | 17:01 | |
Shrews | devananda: but that driver isn't being used. there is a default implementation: http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n2382 | 17:01 |
*** derekh has quit IRC | 17:02 | |
*** rloo has quit IRC | 17:04 | |
devananda | Shrews: http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n2551 | 17:04 |
devananda | http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/baremetal/driver.py#n308 | 17:04 |
*** rloo has joined #openstack-ironic | 17:04 | |
devananda | Shrews: that is being called. I tried calling it for nova.virt.ironic and got http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/driver.py#n271 | 17:05 |
Shrews | devananda: i call it on my devstack test machine and I get a rebuild | 17:05 |
devananda | with ironic?? | 17:05 |
Shrews | using nova.virt.ironic | 17:06 |
Shrews | yup | 17:06 |
devananda | huh. i got NotImplemented when i tried it yesterday | 17:06 |
*** rloo has quit IRC | 17:06 | |
Shrews | devananda: you ran 'nova rebuild' directly i guess? | 17:06 |
devananda | "nova rebuild --preserve-ephemeral" to be precise | 17:06 |
*** rloo has joined #openstack-ironic | 17:06 | |
Shrews | ah, with that option, you'll get an error | 17:06 |
*** max_lobur has joined #openstack-ironic | 17:07 | |
Shrews | devananda: http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n2391 | 17:07 |
Shrews | devananda: http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n2391 | 17:07 |
Shrews | oops, sorry | 17:07 |
Shrews | irc lag | 17:07 |
devananda | ah | 17:08 |
devananda | ok, so default impl is going to actually destry() + spawn() | 17:08 |
*** Alexei_987 has quit IRC | 17:08 | |
devananda | which for ironic means releasing the node, then trying to reclaim it -- nice race condition | 17:08 |
devananda | someone else could claim it during that window | 17:08 |
openstackgerrit | Josh Gachnang proposed a change to openstack/ironic: Adding swift temp url support https://review.openstack.org/81391 | 17:10 |
*** dwalleck_ has joined #openstack-ironic | 17:11 | |
*** eghobo has quit IRC | 17:11 | |
*** eghobo has joined #openstack-ironic | 17:11 | |
*** rloo has quit IRC | 17:12 | |
*** rloo has joined #openstack-ironic | 17:13 | |
*** dwalleck has joined #openstack-ironic | 17:13 | |
openstackgerrit | Devananda van der Veen proposed a change to openstack/ironic: nova.virt.ironic passes ephemeral_gb to ironic https://review.openstack.org/87396 | 17:14 |
*** dwalleck_ has quit IRC | 17:16 | |
*** rloo has quit IRC | 17:18 | |
*** rloo has joined #openstack-ironic | 17:18 | |
linggao | morning devananda | 17:20 |
linggao | devananda, lucasagomes and I discussed the console API. we have a new proposal. | 17:21 |
linggao | devananda, could you roll back and see if you have any objections to it? | 17:21 |
devananda | linggao: I skimmed the discussion - can you summarize? | 17:22 |
linggao | yes, devananda. GET v1/nodes/<uuid>/console will return the console information. {'console_enabled': True, 'console_info': {'type': 'shellinaboxd', 'url': 'http://<hostname>:<port>'}} for the node with consolle enabled. {'console_enabled': False, 'console_info': None} for the node with console disabled. | 17:24 |
linggao | PUT v1/nodes/<uuid>/console will enable/diable the console | 17:24 |
linggao | and remove console_enabled from GET v1/nodes/<uuid>/states because it already has a lot of states. | 17:25 |
linggao | this way makes console a sub-element itself under node. | 17:26 |
devananda | linggao: how will this expose an error? | 17:27 |
devananda | linggao: eg, if console can not be started, what will the API show? | 17:28 |
linggao | ?? | 17:28 |
linggao | the last_error in the details will show the error. | 17:29 |
linggao | devananda, I see what you mean. the last_error is in the v1/nodes/<uuid>/states | 17:31 |
*** zdiN0bot has joined #openstack-ironic | 17:32 | |
* linggao thinking... | 17:32 | |
*** rloo has quit IRC | 17:33 | |
*** rloo has joined #openstack-ironic | 17:33 | |
devananda | right | 17:34 |
linggao | devananda, lucasagomes, now the console and power and provision are sharing the last error. one will overwrite ther other. | 17:36 |
devananda | linggao: yep, which is not very helpful | 17:36 |
*** zdiN0bot has quit IRC | 17:36 | |
devananda | eg, conductor.utils.node_power_action will clear node.last_error | 17:37 |
linggao | right. | 17:37 |
devananda | so even sync_power_states periodic task may clear the node.last_error in some cases | 17:37 |
devananda | and by no action from the user, they might "lose" the console error message | 17:37 |
devananda | the more independent states we add (power, provision, console...) the more each one really needs a separate target_* and error_* state | 17:38 |
lucasagomes | +1 | 17:38 |
NobodyCam | devananda: linggao lucasagomes if we are making the console a entire subclass could / should it nit return its own (last) error? ie.. {'console_enabled': False, 'console_info': None, 'console_error': 'blah'} | 17:39 |
devananda | NobodyCam: exactly | 17:39 |
devananda | NobodyCam: except, right now, power and provision share a single error state | 17:40 |
NobodyCam | ya. | 17:40 |
linggao | NobodyCam, but we do not have a place in the db to store the error so far. | 17:40 |
devananda | if we all agree that's the right way to go (independent error states) then let's work on that | 17:40 |
openstackgerrit | Josh Gachnang proposed a change to openstack/ironic: Adding a reference driver for the agent https://review.openstack.org/84795 | 17:40 |
NobodyCam | devananda: ++ | 17:41 |
NobodyCam | ya | 17:41 |
linggao | +1 | 17:41 |
NobodyCam | I think that is the way we should shoot for | 17:41 |
devananda | lucasagomes: rloo: NobodyCam: side track. what do y'all think about having a required blueprint format? take a look at https://wiki.openstack.org/wiki/TroveBlueprint | 17:41 |
* NobodyCam clicks | 17:41 | |
* lucasagomes lemme read the scrollback | 17:41 | |
devananda | lucasagomes: it's a side track, lol. not directly related to scrollback :) | 17:42 |
lifeless | morning | 17:42 |
devananda | lifeless: g'morning! | 17:42 |
lucasagomes | devananda, yeah reading NobodyCam ping as well | 17:42 |
NobodyCam | morning lifeless :) | 17:42 |
NobodyCam | devananda: I like that! (bp template) | 17:43 |
lucasagomes | devananda, https://github.com/openstack/nova-specs/blob/master/specs/template.rst | 17:43 |
lucasagomes | so nova now has this nova-specs repo | 17:44 |
NobodyCam | makes people think about the over impact of a BP | 17:44 |
devananda | lucasagomes: right. that's another option | 17:44 |
lucasagomes | devananda, yeah I see that neutron wants to adopt the nova way of doing it | 17:45 |
lucasagomes | I didn't dig much into it to see how it's really done | 17:45 |
matty_dubs | TripleO was just talking about the Neutron way yesterday as well | 17:45 |
lucasagomes | devananda, but I would +1 to have some standardlization of bps | 17:45 |
matty_dubs | I haven't looked too deeply and don't have a strong opinion | 17:45 |
lucasagomes | idk if it's trove-way or nova-way | 17:45 |
lucasagomes | matty_dubs, nice, tripleo wants to have a -specs repo as well? | 17:46 |
linggao | davananda, but you do aggree to move console GET/PUT from ../states/console to ../console, right? | 17:46 |
devananda | lifeless: opinions? we're discussing how to standardize BPs -- nova-specs way, or just a launchpad template. I think it'd be helpful if both tripleo and ironic use same approach / similar format | 17:46 |
lucasagomes | devananda, http://lists.openstack.org/pipermail/openstack-dev/2014-March/029232.html | 17:46 |
lucasagomes | that's the nova discussion | 17:46 |
* lucasagomes didn't read yet, will do | 17:47 | |
devananda | lucasagomes: thanks | 17:47 |
devananda | linggao: i haven't thought through the ramifications of that yet | 17:47 |
matty_dubs | lucasagomes: That was my understanding, though I wasn't familiar with the Nova background at the time so I didn't fully grasp it then. | 17:47 |
lucasagomes | matty_dubs, ack... at a first glance, being able to +2 blueprints seems good | 17:47 |
devananda | linggao: my initial reaction is to keep it in ./states/console for now, because moving console to ../console -- without also moving the other states -- is inconsistent | 17:48 |
devananda | linggao: for an API change like that, I'd like more time and some documentation (like a blueprint for it) | 17:48 |
rloo | devananda: yes to standardizing on something. If people think the nova-way is the way to go, that would be fine. I haven't been following it to have any opinion yet ;) | 17:49 |
devananda | heh | 17:49 |
devananda | this sentence: "The results of this have been that design review today is typically not | 17:49 |
devananda | happening on Blueprint approval, but is instead happening once the code | 17:50 |
devananda | shows up in the code review." | 17:50 |
rloo | devananda: is that before or after the new 'system'? Hopefully before! | 17:50 |
devananda | rloo: that's a comment from the nova discussion as to why they made the switch | 17:51 |
lifeless | devananda: http://lists.openstack.org/pipermail/openstack-dev/2014-April/032768.html | 17:51 |
rloo | devananda: ah. so yes, I also don't like the design being 'discussed' as one is code-reviewing. So +2 for anything that prevents that! | 17:51 |
devananda | rloo: exactly | 17:52 |
NobodyCam | lucasagomes: reading that link I also like the idea of iteratting BP's in gerrit. but not sure about steps #1 (create bad blueprint) and #4 (once approved copy back the approved text into the blueprint) | 17:53 |
lifeless | I'm not sure the lp tracking item should be created until the spec has two x +2's | 17:54 |
linggao | devananda, lucasagomes and NobodyCam, we can add a console_error in the nodes table. It is just odd to have to go to ...states/console to get console information. I can write a blueprint for it. | 17:54 |
devananda | yea, i dont see the benefit to the LP artefact prior to the spec approval | 17:55 |
lucasagomes | NobodyCam, yeah, I'll read it more soon... I'm testing something so didn't want to stop to read | 17:55 |
devananda | seems like just more noise | 17:55 |
JayF | The thing I don't always get about the way blueprints are done in general in Openstack is it assumes you know the right way to do something from the beginning, when some of the best designs I've seen were ones that were built one small piece at a time, where you know the first step, and you know, in a general sense, where you're going. | 17:56 |
devananda | JayF: what I immediately like about https://github.com/openstack/nova-specs/blob/master/specs/template.rst is that it requires the proposer to think about the impact of their changes | 17:57 |
lifeless | JayF: that aspect is terrible ;) | 17:58 |
lifeless | JayF: and why I generally don't go for blueprints at all - but its a scaling problem for new team members to get good feedback on ideas | 17:58 |
lifeless | JayF: and *that* aspect is valuable | 17:58 |
JayF | I mean, it's the one aspect I care a lot about? How can you decide how software should run until you /make/ it run. | 17:58 |
devananda | sure | 17:58 |
*** eguz has joined #openstack-ironic | 18:00 | |
*** eghobo has quit IRC | 18:04 | |
*** epim has joined #openstack-ironic | 18:05 | |
jroll | this looks like failure to RPC, yes? https://gist.github.com/jimrollenhagen/cc1970e2b28e875a735c | 18:09 |
jroll | my conductor node seems to die after this happens | 18:10 |
jroll | (I'm powering off a bunch of nodes in a loop) | 18:10 |
*** todd_dsm has joined #openstack-ironic | 18:14 | |
lucasagomes | jroll, maybe you hit this; https://review.openstack.org/#/c/88017/ | 18:15 |
jroll | lucasagomes: yes, but I think there's a bigger issue in the first traceback | 18:17 |
jroll | it makes it through a couple of other stuck nodes just fine | 18:17 |
jroll | unless... I think maybe I see what you mean | 18:17 |
jroll | we end up with no free workers? | 18:17 |
lucasagomes | jroll, no, I just saw that patch and thought it was related... lemme take a look at the traceback u have | 18:19 |
lucasagomes | hmmm no idea... scary that the conductor just died after it | 18:21 |
jroll | right | 18:21 |
lucasagomes | brb | 18:21 |
jroll | I'll dig in | 18:21 |
devananda | jroll: that doesn't look like it should cause a failure | 18:21 |
jroll | I wonder if our rabbit is somehow getting overloaded | 18:21 |
lucasagomes | ack, yeah I can't see why the conductor would die because of that error | 18:22 |
*** lucasagomes is now known as lucas-dinner | 18:22 | |
jroll | devananda: I agree, but right after that I get spammed with "No valid host was found. Reason: No conductor service registered which supports driver agent_ipmitool." | 18:22 |
devananda | interesting | 18:22 |
jroll | no errors in conductor log, and conductor process is still running | 18:22 |
devananda | http://git.openstack.org/cgit/openstack/ironic/tree/ironic/conductor/manager.py#n457 | 18:23 |
openstackgerrit | Chris Krelle proposed a change to openstack/ironic: Fix for tripleO undercloud gate tests DO NOT MERGE https://review.openstack.org/85529 | 18:23 |
jroll | restarting the conductor seems to make it happy | 18:23 |
devananda | jroll: add logging there and see if it's getting run | 18:23 |
devananda | jroll: it may be that something has starved the main thread and the keepalive didn't get sent | 18:24 |
devananda | jroll: also can you check the timestamp on the conductors table -- that could be another way to confirm this hypothesis | 18:24 |
jroll | yeah, that was my next thought | 18:24 |
devananda | jroll: if so, https://review.openstack.org/#/c/84862/10 is related, but not a complete fix | 18:25 |
devananda | jroll: since it sounds like this starvation is not coming from a periodic_task -- it's the main thread starving out a periodic_task | 18:25 |
devananda | s/main thread/RPC receivers/ | 18:25 |
* devananda needs more coffee | 18:26 | |
jroll | right | 18:26 |
*** coolsvap is now known as coolsvap|afk | 18:26 | |
jroll | heh | 18:27 |
jroll | you are right, sir | 18:27 |
devananda | heh. i'm not sure if that's a good thing | 18:28 |
jroll | devananda: we have touch_conductor running at 10s intervals, because we suspected it | 18:28 |
jroll | it goes to 3-5 *minutes* while we're pounding the conductor | 18:28 |
devananda | wow | 18:28 |
jroll | indeed. | 18:28 |
jroll | I'll file a bug right now | 18:28 |
devananda | ok, so we can easily DOS the conductor | 18:28 |
devananda | thanks | 18:28 |
jroll | and take a look after lunch | 18:28 |
jroll | heh, np | 18:28 |
*** vkozhukalov has joined #openstack-ironic | 18:30 | |
*** zdiN0bot has joined #openstack-ironic | 18:32 | |
openstackgerrit | Chris Krelle proposed a change to openstack/ironic: Fix for tripleO undercloud gate tests DO NOT MERGE https://review.openstack.org/85529 | 18:33 |
*** todd_dsm has quit IRC | 18:33 | |
jroll | devananda: fyi https://bugs.launchpad.net/ironic/+bug/1308680 | 18:33 |
* jroll -> lunch | 18:34 | |
*** athomas has quit IRC | 18:34 | |
*** datajerk has quit IRC | 18:35 | |
NobodyCam | DOS our conductor.. ieek | 18:35 |
devananda | jroll: just curious - how many nodes are enrolled? how many conductor services are running? and is it hw or virt? | 18:36 |
*** zdiN0bot has quit IRC | 18:36 | |
*** epim_ has joined #openstack-ironic | 18:39 | |
*** epim has quit IRC | 18:42 | |
*** epim_ is now known as epim | 18:42 | |
*** datajerk has joined #openstack-ironic | 18:42 | |
comstud | devananda: would you mind reviewing the dependent patch for that power state sync yield? | 18:48 |
comstud | https://review.openstack.org/#/c/87076/ | 18:48 |
comstud | I'd like to get that to land.. it has potential for conflicting a lot due to moving a lot of tests in conductor test_manager | 18:49 |
devananda | comstud: started looking at it already. decided i needed food first :) | 18:49 |
comstud | haha np | 18:49 |
devananda | yea | 18:49 |
NobodyCam | :) | 18:49 |
*** russellb has quit IRC | 18:53 | |
*** russellb has joined #openstack-ironic | 18:55 | |
linggao | devananda, lucasagomes and NobodyCam, here is the blueprint https://blueprints.launchpad.net/ironic/+spec/update-console-api. | 18:56 |
linggao | I hope you guys can at least agree on the #1 item on the blueprint, I already have a patch on it for review. | 18:56 |
jroll | devananda: 72 nodes on bare metal; one conductor and one api, running on separate VMs. conductor craps out after 30-40 "power off" commands. | 19:05 |
devananda | jroll: awesome | 19:05 |
jroll | using ipmitool for power, if it matters | 19:05 |
devananda | jroll: i'd love to see the timing data from that | 19:05 |
devananda | yea, it does | 19:05 |
jroll | timing data on the power commands? | 19:06 |
rloo | can the Rally stuff be hooked in jroll? | 19:06 |
devananda | jroll: avg/min/max time for each power operation | 19:06 |
devananda | how long each worker thread is occupied | 19:06 |
jroll | rloo: I know nothing about rally. maybe? | 19:07 |
jroll | devananda: sure, I'll grab logs and see what I can dig up | 19:07 |
rloo | jroll, romcheg has been looking into it. maybe he'd have an idea as to how easy, blah blah. | 19:07 |
jroll | yeah, I'd like to hear about it | 19:07 |
romcheg1 | jroll, rloo: Morning, yeah, working on it | 19:08 |
romcheg1 | There are a few problems I experienced. Hopefuly everything will get merged tonight :) | 19:08 |
jroll | cool :) | 19:08 |
openstackgerrit | Josh Gachnang proposed a change to openstack/ironic: Adding a reference driver for the agent https://review.openstack.org/84795 | 19:11 |
devananda | jroll: as for why 30 - 40 is the magic number, 30 is the default rpc worker pool | 19:13 |
jroll | aha | 19:13 |
devananda | jroll: you're filling up the threadpool faster than requests complete. IPMI is slow | 19:13 |
JayF | devananda: could we bump that worker pool to workaround our issue for now? | 19:13 |
devananda | yes | 19:13 |
JayF | Is that a reasonable default? | 19:13 |
devananda | i dont know :) | 19:14 |
jroll | JayF: I told you we needed more conductors :P | 19:14 |
openstackgerrit | Devananda van der Veen proposed a change to openstack/ironic: Better handling of missing drivers https://review.openstack.org/83572 | 19:14 |
JayF | I told you more conductors would hide the bug, that seems to remain valid ;) | 19:14 |
devananda | heh | 19:15 |
devananda | so | 19:15 |
devananda | power cycling 100 machines from a single conductor should be achievable -- it's mostly waiting on IPMI | 19:15 |
devananda | but | 19:15 |
*** jdob_ has joined #openstack-ironic | 19:15 | |
devananda | deploying to 100 machines in parallel? that is definitely not achievable to day from a single conductor | 19:16 |
devananda | IO will bottleneck | 19:16 |
devananda | there's some amount of parallelization that's reasonable today, depending on hardware | 19:16 |
devananda | well | 19:16 |
devananda | that is, with the PXE driver. | 19:17 |
jroll | I'd say it's achievable in the agent model, at least to the point where you saturate glance :) | 19:17 |
devananda | with the Agent driver, ... right | 19:17 |
jroll | yeah | 19:17 |
JayF | I want to find those limits (in the agent_ipmitool driver), stretch it to it's limit, and file bugs and fix the bottlenecks | 19:17 |
devananda | you probably wont saturate tftp, even with 100 booting in parallel | 19:17 |
devananda | JayF: ++ | 19:17 |
JayF | because we'll want to boot a lot more than a hundred boxes on a single conductor | 19:17 |
devananda | indeed | 19:17 |
*** rloo has quit IRC | 19:18 | |
devananda | i bet i can dupliate that bug locally | 19:18 |
*** rloo has joined #openstack-ironic | 19:18 | |
devananda | jroll: any data you have on how long it is taking to power-off a single node via ipmitool would be helpful | 19:20 |
JayF | devananda: some # of these nodes, ipmi is timing out | 19:20 |
JayF | so that's almost certainly adding to the contention | 19:20 |
devananda | great | 19:20 |
devananda | that's unfortunately normal | 19:20 |
devananda | so we need to handle that w/o the conductor falling over | 19:20 |
JayF | I agree entirely | 19:21 |
*** rloo has quit IRC | 19:22 | |
*** rloo has joined #openstack-ironic | 19:22 | |
JayF | when log_file is set, what things would still be writing to stdout? | 19:25 |
JayF | we're seeing different output from stdout on an api server than what is going to a log, and I need the info from stdout to hit a log somewhere | 19:25 |
*** zdiN0bot has joined #openstack-ironic | 19:27 | |
devananda | what's a trivial way to block a greenthread? | 19:33 |
devananda | JayF: there are a set of logging options that should control what the api logs where | 19:33 |
devananda | JayF: part of oslo.logging | 19:33 |
JayF | it appears to be wsgiref | 19:33 |
JayF | that's doing the logging I'm referring to | 19:33 |
devananda | JayF: ah. fwiw, you can also use apache mod_wsgi. probably will perform better | 19:34 |
NobodyCam | brb | 19:34 |
*** dwalleck has quit IRC | 19:39 | |
*** jistr has quit IRC | 19:40 | |
*** dkehn_ has joined #openstack-ironic | 19:43 | |
*** dkehn_ has quit IRC | 19:46 | |
*** dkehnx has quit IRC | 19:46 | |
*** dkehn_ has joined #openstack-ironic | 19:46 | |
*** zdiN0bot1 has joined #openstack-ironic | 19:46 | |
*** zdiN0bot has quit IRC | 19:46 | |
comstud | devananda, jroll: so, eventlet will essentially never switch greenthreads if there's always stuff to process from the queue | 19:47 |
comstud | i mean | 19:47 |
comstud | if the read() from the rabbit socket never returns EAGAIN.. | 19:47 |
comstud | and you don't have any other I/O to cause greenthread switches | 19:47 |
*** zdiN0bot1 has quit IRC | 19:48 | |
comstud | (DB calls do not right now -- we put some explicit yield in all DB calls in nova, but it's ugly) | 19:48 |
*** zdiN0bot has joined #openstack-ironic | 19:48 | |
comstud | (and causes some other side effects) | 19:48 |
jroll | hrm. | 19:48 |
devananda | jroll: great. i can repro locally with a fake driver | 19:49 |
comstud | eventlet will only switch if it gets a EAGAIN on socket I/O | 19:49 |
jroll | devananda: cool | 19:50 |
jroll | comstud: so, is there anything we can do at the ironic level? | 19:50 |
comstud | i have something for you to try | 19:51 |
comstud | sec | 19:51 |
comstud | this is ugly | 19:54 |
comstud | but try it | 19:54 |
comstud | http://paste.openstack.org/show/75966/ | 19:54 |
comstud | oops | 19:54 |
comstud | except use the correct decorator name | 19:54 |
* jroll tries | 19:55 | |
devananda | comstud: I think the solution is just to put an explicit yield in the ipmitool driver | 19:55 |
comstud | how is ipmitool implemented? | 19:56 |
comstud | is it something not wrapped by eventlet? | 19:56 |
comstud | (C code underneath) | 19:56 |
devananda | http://git.openstack.org/cgit/openstack/ironic/tree/ironic/drivers/modules/ipmitool.py#n135 | 19:56 |
comstud | ok, that should be fine | 19:57 |
comstud | that should cause a yield | 19:57 |
devananda | k | 19:57 |
comstud | assuming you're monkey patching os | 19:57 |
*** zdiN0bot has quit IRC | 19:57 | |
comstud | or subprocess i guess | 19:57 |
devananda | whih we're not | 19:57 |
devananda | http://git.openstack.org/cgit/openstack/ironic/tree/ironic/__init__.py#n22 | 19:57 |
jroll | oh. welp. | 19:58 |
comstud | hm | 19:58 |
comstud | let me check if that wraps subprocess or not | 19:58 |
devananda | i dont recall if we had a reason for that | 19:58 |
jroll | devananda: is there an explicit reason for os=False? | 19:58 |
jroll | oh | 19:58 |
comstud | i think i've see os=False in nova for some reason too | 19:58 |
comstud | maybe not | 19:58 |
*** jdob_ has quit IRC | 19:59 | |
jroll | yeah, comstud, yielding for the db didn't do anything | 20:01 |
comstud | ok | 20:01 |
*** romcheg1 has quit IRC | 20:02 | |
comstud | interesting | 20:02 |
devananda | comstud: yea, this is from the initial refactoring -- http://git.openstack.org/cgit/openstack/ironic/diff/ironic/cmd/__init__.py?id=0480834614476997e297187ec43d7ca500c8dcdb | 20:03 |
jroll | comstud: I don't think this is blocking on the db | 20:03 |
jroll | I think it's blocking on ipmi like deva said | 20:03 |
comstud | I don't see where eventlet patches subprocess at all | 20:04 |
comstud | unless you explicitly import eventlet.green.subprocess | 20:04 |
devananda | jroll: i think it's not actually blocking | 20:04 |
devananda | or it is? | 20:04 |
jroll | I don't really know | 20:04 |
comstud | KeyError: 'eventlet.green.subprocess' | 20:04 |
devananda | i mean, there's loopingcall | 20:04 |
devananda | we're explicitly tying up a greenthread while waiting for power on/off to succeed | 20:04 |
comstud | that module is not loaded at all | 20:04 |
comstud | even after calling monkey_patch | 20:05 |
devananda | so even wiuthout the greenthread blocking on db/io/foo | 20:05 |
devananda | we can still run out of threads from the pool, no? | 20:05 |
jroll | sure | 20:05 |
jroll | but I tested this with pool size of 256 and got the same thing | 20:05 |
jroll | so I don't think it's that it's running out of worker threads | 20:06 |
devananda | jroll: rpc_conn_pool_size? rpc_thread_pool_size? | 20:06 |
jroll | rpc_thread_pool_size=256; rpc_conn_pool_size=256 | 20:06 |
jroll | both | 20:06 |
devananda | hrm | 20:06 |
devananda | and you clearly have <256 nodes | 20:06 |
comstud | jroll: try adding: from eventlet.green import subprocess | 20:06 |
comstud | instead of import subprocess | 20:06 |
comstud | in utils.py or wherever it is | 20:06 |
jroll | comstud: yeah, I was just going to do that | 20:06 |
jroll | directly in the ipmitool code, yes? | 20:07 |
comstud | in utils.py | 20:07 |
comstud | wherever execute is | 20:07 |
jroll | oh right | 20:07 |
comstud | that calls subprocess.. i'm assuming it does | 20:07 |
comstud | So | 20:07 |
jroll | no, processutils | 20:07 |
* jroll dives into oslo | 20:07 | |
comstud | re: rpc pool size.. it's is not really useful beyond a few threads, really. | 20:08 |
jroll | which uses eventlet.green.subprocess | 20:08 |
comstud | ah so it does | 20:08 |
comstud | lol ok | 20:08 |
comstud | well.. interesting. | 20:08 |
jroll | indeed. | 20:09 |
comstud | try adding explicit eventlet.sleep() somewhere in your call path that's being slammed | 20:09 |
comstud | but it almost sounds like something is blocking the whole process somewhere in the path | 20:10 |
*** zdiN0bot has joined #openstack-ironic | 20:10 | |
comstud | (besides DB calls) | 20:10 |
comstud | maybe you need the multiple-process-workers support for conductor. | 20:10 |
comstud | that would be useful no matter, I think | 20:10 |
devananda | quite possibly, ya | 20:11 |
comstud | or many more conductors. | 20:11 |
devananda | comstud: how complete is support for that in the various oslo utils? i haven't looked in at least a year | 20:11 |
comstud | unless it's a single node slamming | 20:11 |
jroll | comstud: you're talking about adding the sleep() call in the code path of the worker, yes? or the code that spawns the worker? | 20:11 |
comstud | devananda: I don't know, I haven't looked at oslo for it | 20:11 |
comstud | we certainly use it in nova.. but i dunno if it's coming from oslo or not! | 20:12 |
comstud | jroll: the side that's starved.. i assume conductor | 20:12 |
jroll | comstud: and that can be a sleep(0) yeah? | 20:12 |
comstud | yeah | 20:12 |
jroll | well, it's in the conductor for sure | 20:12 |
jroll | it's an RPC call to change_node_power_state | 20:12 |
jroll | which does self._spawn_worker(utils.node_power_state, ...) | 20:13 |
comstud | hm | 20:13 |
comstud | sec, looking | 20:13 |
jroll | ironic/conductor/manager.py | 20:13 |
comstud | right | 20:13 |
comstud | hm | 20:14 |
jroll | although | 20:14 |
comstud | i don't know there's much point to the extra worker pool | 20:14 |
jroll | oh | 20:14 |
comstud | over having the rpc pool | 20:14 |
comstud | but | 20:14 |
comstud | stick it in the worker | 20:14 |
*** zdiN0bot has quit IRC | 20:14 | |
jroll | ah ha | 20:14 |
jroll | well | 20:14 |
jroll | it does that power.validate() call first | 20:14 |
jroll | which is not in a worker | 20:15 |
jroll | and hits ipmi | 20:15 |
jroll | so it might be choking there | 20:15 |
JayF | and if it, say, hit a half dozen bad IPMI addresses | 20:15 |
comstud | it should still switch greenthreads | 20:15 |
jroll | JayF: nah, that part is synchronous | 20:15 |
comstud | oh, i wonder. | 20:15 |
openstackgerrit | Devananda van der Veen proposed a change to openstack/ironic: DO NOT MERGE - demonstration of reproducing 1308680 https://review.openstack.org/88076 | 20:16 |
*** jistr has joined #openstack-ironic | 20:17 | |
NobodyCam | rloo: you happen to be about? | 20:17 |
rloo | NobodyCam: I happen to be about. | 20:17 |
NobodyCam | :) | 20:17 |
NobodyCam | happen to have a second to rebase 85107 | 20:17 |
rloo | sure. | 20:17 |
devananda | jroll, comstud: simple steps to reproduce ^ | 20:18 |
jroll | devananda: yeah, I see that | 20:18 |
openstackgerrit | linggao proposed a change to openstack/ironic: Modify the get console API https://review.openstack.org/87760 | 20:18 |
*** max_lobur1 has joined #openstack-ironic | 20:18 | |
*** max_lobur has quit IRC | 20:18 | |
comstud | devananda: well yeah, all DB calls are going to block.. and there's nothing we can do about that | 20:19 |
comstud | if you sleep in all of your available threads, then boom. | 20:19 |
devananda | comstud: right -- that's fine. the point of that patch is to demonstrate the failure. not solve the db blocking issue | 20:19 |
comstud | I should say... 'nothing we can do about that right now' | 20:19 |
comstud | gotcha | 20:19 |
comstud | I'm just wondering if that's the same problem as jroll is seeing. | 20:19 |
comstud | or not. | 20:20 |
devananda | comstud: same result. different cause. | 20:20 |
jroll | well | 20:20 |
devananda | probably different cause | 20:20 |
jroll | I have 72 nodes, and 256 workers. | 20:20 |
jroll | so it's not all of the workers being used. | 20:20 |
devananda | jroll: what's the uptime on that machine? | 20:20 |
jroll | or being blocked | 20:20 |
comstud | well | 20:20 |
devananda | jroll: or the VM where ir-cond is running | 20:20 |
comstud | workers in this case would be... | 20:20 |
comstud | the process... | 20:20 |
comstud | i mis-spoke | 20:20 |
devananda | jroll: like -- what's the CPU utilization, context swaps, etc | 20:20 |
jroll | devananda: 34 days | 20:20 |
comstud | 1 just DB call doing a SLEEP 10 | 20:20 |
jroll | but CPU is never pegged | 20:21 |
comstud | is going to hang the process for 10 seconds | 20:21 |
devananda | jroll: vmstat 1 10 | 20:21 |
comstud | jroll: I have another thing for you to try | 20:21 |
comstud | even thought it will probably cause other issues at some point | 20:22 |
comstud | but just to verify something.. | 20:22 |
comstud | sec | 20:22 |
jroll | devananda: https://gist.github.com/jimrollenhagen/001cefc6422e5f8f8f30 | 20:22 |
devananda | comstud: ahh. yea. damn. | 20:22 |
devananda | comstud: my patch is actually demonstrating something entirely different | 20:23 |
devananda | if a single DB query takes longer than heartbeat_timeout time, the conductor appears to be dead | 20:23 |
devananda | since the db query is blocking the whole process, not just the thread | 20:23 |
comstud | devananda: nod | 20:23 |
comstud | jroll: http://paste.openstack.org/show/75972/ | 20:24 |
comstud | try that just for shits | 20:24 |
devananda | greenthreads hurt my brain sometimes. I want real threads ... | 20:24 |
comstud | eventlet is not completely Thread safe here, but.. | 20:24 |
comstud | devananda: nod! | 20:24 |
comstud | i've let my eventlet replacement sit over a month now | 20:24 |
comstud | I need to get back to it. | 20:24 |
jroll | comstud: running... | 20:25 |
comstud | good luck | 20:26 |
jroll | heh | 20:26 |
comstud | if it even works | 20:26 |
NobodyCam | humm.... http://paste.openstack.org/show/tBVlJZHhhxjUK7deqVgQ/ | 20:26 |
devananda | jroll: i'd also like to see the output of 'vmstat 1 10' while this is mid-run | 20:27 |
comstud | yeah, that'd be interesting also | 20:27 |
openstackgerrit | Ruby Loo proposed a change to openstack/python-ironicclient: Adds documentation for ironicclient API https://review.openstack.org/85107 | 20:27 |
*** zdiN0bot has joined #openstack-ironic | 20:28 | |
jroll | devananda: so when this occurs, things go smoothly for a bit, then the api hangs, then everything bursts through after api decides conductor doesn't exist | 20:29 |
jroll | but I did get one in | 20:29 |
jroll | during the hang | 20:29 |
*** zdiN0bot has quit IRC | 20:29 | |
jroll | devananda: https://gist.github.com/jimrollenhagen/82f7b6bf942f8215b35c | 20:29 |
*** zdiN0bot has joined #openstack-ironic | 20:29 | |
comstud | the api hangs too? | 20:29 |
*** pradipta_away has quit IRC | 20:29 | |
jroll | yeah | 20:30 |
comstud | for how long? | 20:30 |
jroll | as I understand it, the RPC call is synchronous | 20:30 |
comstud | putting into the queue and waiting for response will be asynch | 20:31 |
jroll | comstud: 34 seconds this round | 20:31 |
jroll | or at least, this call to the API hangs | 20:31 |
jroll | idk about other calls | 20:31 |
comstud | k | 20:31 |
jroll | but yeah, that tpool fix didn't help | 20:32 |
devananda | yea, i'm seeing the API hang too | 20:32 |
comstud | going to hop on the api node and look while you slam | 20:32 |
comstud | did you put it on the conductor side only? | 20:32 |
jroll | yes | 20:32 |
devananda | jroll: that vmstat run is from during the its-all-hung period? | 20:32 |
jroll | devananda: yes | 20:32 |
comstud | jroll: try also putting it on the api side | 20:33 |
jroll | devananda: on the conductor side | 20:33 |
devananda | huh | 20:33 |
jroll | comstud: ok | 20:33 |
comstud | devananda: yeah, something's fishy | 20:33 |
comstud | i have 1 idea | 20:33 |
* comstud checks something | 20:33 | |
*** zdiN0bot has quit IRC | 20:34 | |
*** vkozhukalov has quit IRC | 20:34 | |
comstud | could DB be waiting on a lock for something for a long period of time? | 20:34 |
jroll | I don't think so, I think it just kicks out here if the node is already locked | 20:34 |
comstud | and maybe a lock wait timeout we're hiding/not seeing? | 20:34 |
jroll | but I see where you're going with this | 20:34 |
*** pradipta_away has joined #openstack-ironic | 20:34 | |
comstud | well, i'm talking about an actual DB lock | 20:34 |
comstud | vs the reservation column | 20:34 |
jroll | ohhhh | 20:34 |
jroll | I doubt it? | 20:35 |
comstud | just an idea | 20:35 |
comstud | checking db for slow query log | 20:35 |
jroll | ok | 20:35 |
jroll | I just restarted services, going to start hitting it | 20:35 |
jroll | if you want to watch | 20:35 |
devananda | ok, something's fishy with just changing the power state | 20:36 |
devananda | my patch to add db(sleep) into the fake driver is causing the *client* to wait | 20:36 |
devananda | those are supposed to be async | 20:36 |
comstud | don't you wait to acquire reservation before returning rpc response? | 20:37 |
devananda | comstud: yes | 20:37 |
comstud | wouldn't that cause the client to wait? | 20:37 |
devananda | if that blocked | 20:38 |
comstud | if you're sleeping | 20:38 |
devananda | but i'm not runing anything else on this node | 20:38 |
comstud | (i dunno where you inserted the sleep now) | 20:38 |
devananda | i added the sleep into drivers.modules.fake.FakePowerDriver.set_power_state | 20:38 |
comstud | ah | 20:38 |
devananda | https://review.openstack.org/#/c/88076/1/ironic/drivers/modules/fake.py | 20:38 |
comstud | ok, that seems bad | 20:38 |
devananda | hacky way to simulate ipmitool blocking | 20:39 |
comstud | nod | 20:39 |
devananda | while using the fake driver | 20:39 |
devananda | yea. wtf | 20:39 |
*** Mikhail_D_ltp has left #openstack-ironic | 20:40 | |
devananda | http://git.openstack.org/cgit/openstack/ironic/tree/ironic/conductor/manager.py#n245 | 20:40 |
devananda | yea, wtf | 20:41 |
devananda | that method exits but the RPC call isn't returning | 20:41 |
russell_h | devananda: I'm seeing ours returning _much_ later | 20:43 |
jroll | huh. | 20:43 |
russell_h | the RPC call times out in the API | 20:44 |
russell_h | then a long time later the API gets a response that it doesn't know what to do with | 20:44 |
russell_h | jroll: one of those should hit the API log soon | 20:44 |
devananda | russell_h: right. so the problem *there* is that the RPC call isn't returning right away | 20:44 |
devananda | which it should be | 20:44 |
comstud | Is the API blocking on something ? | 20:44 |
jroll | russell_h: one hit about 4 minutes ago | 20:45 |
comstud | and not getting the reply? | 20:45 |
devananda | manager.change_node_power_state should be spawning another thread, then returning right away | 20:45 |
russell_h | jroll: the timeout hit | 20:45 |
devananda | it's not | 20:45 |
russell_h | jroll: but the reply from the conductor should hit soon I think | 20:45 |
russell_h | unless you restarted it | 20:45 |
jroll | russell_h: I mean the "no calling thread" thing | 20:45 |
devananda | oi, i need to jump on a call | 20:45 |
russell_h | oh | 20:45 |
russell_h | totally right, I missed it | 20:45 |
jroll | 2014-04-16 20:40:32.232 24876 WARNING ironic.openstack.common.rpc.amqp [-] No calling threads waiting for msg_id : d1a3ba68fecd4587b4af7bbc9a2cfef3 | 20:45 |
jroll | ya | 20:45 |
jroll | devananda: have fun | 20:46 |
* jroll needs to take a walk | 20:46 | |
devananda | this http://git.openstack.org/cgit/openstack/ironic/tree/ironic/conductor/manager.py#n721 | 20:46 |
devananda | is allowing control flow to continue | 20:47 |
devananda | but preventing the RPC response from being sent back to the API | 20:47 |
devananda | until the worker_pool.spawn()'d thread finishes | 20:47 |
devananda | that's causing all kinds of problems | 20:47 |
jroll | wat | 20:48 |
devananda | like a simple "ironic node-set-power-state" is hanging for 10 seconds | 20:48 |
jroll | spawn() doesn't return immediately? | 20:48 |
devananda | no -- it does! | 20:48 |
devananda | execution continuies | 20:48 |
jroll | what's preventing RPC response then? | 20:48 |
devananda | the RPC response that should happen when that thread exits ISN OT | 20:48 |
devananda | I dunno | 20:48 |
jroll | heh | 20:48 |
jroll | ok | 20:48 |
jroll | interesting | 20:48 |
*** eguz has quit IRC | 20:49 | |
devananda | jroll: http://paste.openstack.org/show/75977/ | 20:50 |
*** eghobo has joined #openstack-ironic | 20:50 | |
jroll | wtf indeed | 20:51 |
* jroll bbiab | 20:51 | |
*** romcheg1 has joined #openstack-ironic | 20:52 | |
*** jdob has quit IRC | 20:55 | |
russell_h | devananda: here's what I'm thinking | 20:58 |
openstackgerrit | linggao proposed a change to openstack/python-ironicclient: node-get-console command use the new API https://review.openstack.org/87769 | 20:59 |
russell_h | devananda: something probably yields to the event loop before the RPC response is fully written | 20:59 |
* devananda will bbiah | 20:59 | |
russell_h | devananda: at which point your long-running call blocks the event loop | 20:59 |
* russell_h source dives into oslo | 21:00 | |
russell_h | jroll: I knew we should have used twisted ;) | 21:00 |
openstackgerrit | linggao proposed a change to openstack/python-ironicclient: node-get-console incorporate the changes in API https://review.openstack.org/87769 | 21:01 |
*** romcheg1 has quit IRC | 21:03 | |
jroll | russell_h: I have a commit that says otherwise | 21:10 |
dividehex | is it possible to separate pools of baremetal nodes? for example if I wanted to allocate a baremetal instance from a pool of identical machines | 21:13 |
*** linggao has quit IRC | 21:14 | |
jroll | russell_h: so if you're right, eventlet.sleep(0) in utils.node_power_action should solve this, but... doesn't | 21:16 |
russell_h | jroll: not necessarily | 21:16 |
*** mrda_away is now known as mrda | 21:16 | |
jroll | well, sure | 21:16 |
*** max_lobur1 has quit IRC | 21:17 | |
jroll | sigh | 21:17 |
russell_h | jroll: try eventlet.sleep(1) | 21:17 |
jroll | heh, that was my next guess | 21:17 |
*** jbjohnso has quit IRC | 21:20 | |
mrda | russell_h: that would be sad if we needed eventlet.sleep(1) there in power action. It does take a float, so perhaps eventlet.sleep(0.1) might be sufficient? | 21:21 |
russell_h | mrda: if we have to sleep at all I'm going to rage | 21:22 |
mrda | right | 21:22 |
russell_h | mrda: I'm just thinking of how we could diagnose this issue | 21:22 |
russell_h | the goal would be to give rpc time to sort its shit out | 21:22 |
russell_h | before embarking on any blocking call | 21:22 |
russell_h | if that fixes it, I'd be more confident in my theory on this | 21:23 |
jroll | russell_h: fwiw sleep(1) didn't help anything | 21:23 |
russell_h | meh | 21:23 |
russell_h | thats in seconds right? | 21:23 |
jroll | I think the heartbeat is just not running | 21:23 |
jroll | hell if I know :) | 21:23 |
russell_h | jroll: I'ma bout to try something | 21:23 |
jroll | wait | 21:23 |
russell_h | k | 21:23 |
jroll | I'm doing things | 21:23 |
russell_h | when you're ready, restart that conductor | 21:23 |
russell_h | and brace yourself | 21:23 |
jroll | oh god | 21:23 |
* jroll looks at git diff | 21:24 | |
comstud | sorry, i got pulled into something in -nova | 21:24 |
russell_h | I don't even know if I did that right | 21:24 |
jroll | oh, this is gonna be fun | 21:25 |
comstud | so | 21:25 |
comstud | i missed a lot of stuff here | 21:25 |
comstud | but | 21:25 |
comstud | is that node_power_action call in greenthread worker in conductor? | 21:26 |
jroll | ok so increasing heartbeat timeout at least let me get my nodes powered off | 21:26 |
russell_h | comstud: yeah | 21:26 |
comstud | i don't think the worker should become after until after the RPC call returns | 21:26 |
comstud | eventlet doesn't tend to schedule greenthreads immediately | 21:26 |
comstud | but i'd have to check how it works with pools | 21:26 |
comstud | but, for example, spawn() doesn't fire the greenthread immediately | 21:27 |
comstud | or switch to it, i mean | 21:27 |
russell_h | comstud: what I'm wondering though, is if you managed to yield to the event loop before your response is fully written | 21:27 |
comstud | yeah | 21:27 |
russell_h | like, an easy example would be if your TCP buffer filled up I'm guessing | 21:27 |
comstud | nod | 21:27 |
russell_h | but it could be something more likely in the AMQP library or something | 21:27 |
russell_h | a read-before-write or something | 21:27 |
comstud | i don't think anything else should cause a switch | 21:27 |
comstud | besides a write() and getting EAGAIN | 21:27 |
comstud | so, socket buffer full. | 21:28 |
comstud | setsockopt sndbuf to 128K | 21:28 |
comstud | :) | 21:28 |
comstud | or sysctl it | 21:28 |
comstud | hehe | 21:28 |
jroll | russell_h: let me finish powering on these nodes and then I'll run your debug thing | 21:29 |
comstud | right, i suppose there could be a read before write | 21:29 |
jroll | power cycle everything was my goal to begin with | 21:29 |
comstud | but the other thing is... | 21:29 |
comstud | I had roll try wrapping the DB calls in Thread pools | 21:29 |
comstud | that would have eliminated the problem here | 21:29 |
*** eguz has joined #openstack-ironic | 21:29 | |
comstud | jroll even | 21:29 |
comstud | oh | 21:30 |
*** zdiN0bot has joined #openstack-ironic | 21:30 | |
comstud | actually, | 21:30 |
comstud | i wonder if you are using our patched eventlet or not | 21:30 |
comstud | there's another very bad thing in stock eventlet | 21:31 |
comstud | in that it likes to use set()s for things. | 21:31 |
jroll | it's probably stock eventlet | 21:31 |
comstud | so you don't get very fair scheduling | 21:31 |
jroll | ffs | 21:31 |
jroll | sigh | 21:31 |
comstud | but it mostly applies to Queue | 21:31 |
comstud | the event loop is a list (heapq) | 21:32 |
jroll | russell_h: I'm not seeing spam from eventlet | 21:32 |
russell_h | me either | 21:33 |
*** jistr has quit IRC | 21:33 | |
*** eghobo has quit IRC | 21:34 | |
*** zdiN0bot has quit IRC | 21:34 | |
*** zdiN0bot has joined #openstack-ironic | 21:43 | |
*** matty_dubs is now known as matty_dubs|gone | 22:00 | |
JayF | NobodyCam: is there any documentation or do you have a few moments to chat about your dev environment | 22:06 |
NobodyCam | humm | 22:23 |
*** dwalleck has joined #openstack-ironic | 22:23 | |
NobodyCam | otp right now | 22:23 |
JayF | NobodyCam: otp? | 22:24 |
NobodyCam | on the phone | 22:24 |
*** dwalleck_ has joined #openstack-ironic | 22:24 | |
JayF | ooh. I thought it was some new testing tech I hadn't heard of :P | 22:25 |
*** dwalleck has quit IRC | 22:28 | |
*** jgrimm has quit IRC | 22:32 | |
*** ilives has joined #openstack-ironic | 22:32 | |
*** ilives has quit IRC | 22:38 | |
NobodyCam | JayF: where they quick questions that you had.. I ask only because I need to step away from keybord for a bit. | 22:38 |
JayF | I just wanted to gauge what you were doing, because it seems to work well for you, before thinking about what we should do | 22:39 |
JayF | it's not urgent, go on your walkabout and have some bagels ;) | 22:40 |
* devananda is back | 22:44 | |
devananda | jroll: make any progress diagnosing? I skimmed scrollback but things tapered off ~1hr ago | 22:44 |
jroll | devananda: didn't really find too much. if we set our heartbeat timeout to 6000 everything goes through before it dies | 22:47 |
*** dwalleck_ has quit IRC | 22:47 | |
NobodyCam | JayF: its eazy.. I run tripleO with USE_IRONIC=1 | 22:48 |
jroll | devananda: so I think it is the heartbeat getting starved | 22:48 |
NobodyCam | :( /me is actually out of bagels | 22:48 |
devananda | "before it dies" | 22:49 |
devananda | hah | 22:49 |
*** lnxnut_ has joined #openstack-ironic | 22:51 | |
*** lnxnut has quit IRC | 22:51 | |
JayF | NobodyCam: wonder how difficult it would be to have that work with the agent. I know very little about ooo | 22:51 |
*** lnxnut_ has quit IRC | 22:51 | |
*** lnxnut has joined #openstack-ironic | 22:52 | |
devananda | shouldn't be hard. change a few settings and maybe install different package / prereq | 22:52 |
devananda | JayF: i suspect the biggest obstacle will be that the agent isn't buildable with DIB | 22:52 |
JayF | For now, I'd probably just set it up that you'd provide the image to pxe boot | 22:53 |
JayF | if that's possible at all | 22:53 |
*** todd_dsm has joined #openstack-ironic | 22:57 | |
NobodyCam | JayF: if it all works like it should.. the only change should be with how you regiseter the node and ofc having the deploy-ironic element actually use the agent | 22:58 |
*** todd_dsm has quit IRC | 23:09 | |
*** todd_dsm has joined #openstack-ironic | 23:10 | |
*** killer_prince has quit IRC | 23:14 | |
*** ifarkas has quit IRC | 23:14 | |
*** lucas-dinner has quit IRC | 23:20 | |
*** killer_prince has joined #openstack-ironic | 23:21 | |
devananda | jroll: so the last paste of mine, where it looked like the RPC response wasn't being sent, is because the "background" greenthread was executing after change_node_power_state returned but before the RPC message was sent | 23:28 |
devananda | thus blocking the whole process | 23:28 |
devananda | i switched the test to use utils.execute(*['sleep', '10']) and it works as expected -- RPC message is returned right away and background thread waits for 10 seconds | 23:29 |
devananda | so that also confirms that utils.execute is getting patched (which i think you guys also confirmed by digging in the code) | 23:29 |
jroll | devananda: cool | 23:29 |
jroll | devananda: although, russell_h had that theory, we added eventlet.sleep(1) to the utils.node_power_action function | 23:31 |
jroll | and that didn't help at all | 23:31 |
devananda | right | 23:31 |
devananda | the way that i recreated the error is not the way you guys are encountering it | 23:32 |
devananda | *not the same cause | 23:32 |
devananda | so using utils.execute(...) i'm getting "lack of free conductor workers" | 23:33 |
devananda | but not seeing the conductor fall over | 23:33 |
devananda | jroll: do you know where (in the python code) the threads are mostly sitting? | 23:34 |
jroll | devananda: we are not running out of worker threads | 23:35 |
jroll | the problem is that heartbeat does not run fast enough while we're spamming power commands | 23:35 |
devananda | jroll: that's the symptom. what's causing the periodic task not to run? | 23:35 |
jroll | devananda: I'm not sure | 23:36 |
jroll | I'm off on other tangents atm | 23:36 |
devananda | i was able to reproduce that symptom with calls that block the whole process | 23:36 |
devananda | k | 23:36 |
*** todd_dsm has quit IRC | 23:38 | |
jroll | devananda: I may take a crack at it again tomorrow, but I wasn't getting very far :/ | 23:39 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!