Thursday, 2022-11-17

opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for ports node_uuid  https://review.opendev.org/c/openstack/ironic/+/86293300:01
TheJuliaJayF: ack ack00:10
hjensasJayF: added the perf test data, it's both good and bad. I.e the plain db_api.get_port_list is a lot slower, but the more complete "API" test is a lot faster.00:11
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for node chassis_uuid  https://review.opendev.org/c/openstack/ironic/+/86480200:11
TheJuliahjensas: so, it is a lazy population... I *wonder* if you change that to selectinload, fi that would change the numbers dramatically00:22
TheJuliasince lazy deferrs the data population, and might speed up the dp_api method call00:22
TheJuliahmm lazy=joined00:27
hjensasTheJulia: I will run a test with selectinload tomorrow.00:27
TheJuliahttps://docs.sqlalchemy.org/en/20/orm/relationship_api.html#sqlalchemy.orm.relationship.params.lazy00:28
TheJuliaselect is the parameter00:28
TheJuliajoined is a deferred lazy load00:28
TheJuliaselect would cause two distinct queries00:28
TheJuliaso ports, then "hi db, give me this value for all of these id's kthxbai"00:28
TheJuliahjensas: have a wonderful sleep() :)00:30
hjensasTheJulia: that might help, I need to respin my devstack instance to test. But using select vs join sure changes the SQL generated - https://paste.opendev.org/show/be5fV2ynyF1W6cD30kjH/00:31
TheJuliayeah, there is a tax on the lazy that is a bit more overhead in later population00:32
TheJuliabut it is one of those "we should check both to be sure" things00:33
JayFLazy uses more sqla magic, right?00:33
JayFIt's like opting in to some magic?00:33
TheJulialazy is01:15
TheJuliaso, "select" is a lazy load, "selectin" is the double query01:15
TheJuliajoined might have unexpected results.... 01:21
opendevreviewSteve Baker proposed openstack/bifrost master: Support PXE network boot with grub  https://review.opendev.org/c/openstack/bifrost/+/80722002:06
jandersgood morning Ironic o/07:10
jandersajya I finally had a chance to look at test improvements you suggested for the SettingsURI fix (thanks!) - left you a response: https://review.opendev.org/c/openstack/sushy/+/856597/comments/e04fa10b_2b4ac1e9 - let me know what you think, TY!07:11
opendevreviewkamlesh chauvhan proposed openstack/sushy master: Fix volume deletion on newer iDRACs  https://review.opendev.org/c/openstack/sushy/+/86484507:47
kubajjGood morning Ironic 07:52
arne_wiebalckGood morning kubajj and Ironic!07:55
ajyaMorning, janders, kubajj, arne_wiebalck and Ironic08:13
ajyajanders: thanks, I left a comment08:13
*** akahat|ruck is now known as akahat|ruck|lunch08:14
arne_wiebalckhey ajya o/08:25
rpittaugood morning ironic! o/08:35
opendevreviewVanou Ishii proposed openstack/python-ironicclient master: Expand boot device choice  https://review.opendev.org/c/openstack/python-ironicclient/+/86484708:42
*** akahat|ruck|lunch is now known as akahat|ruck09:22
jandersajya  WDYT about this approach to improving the tests in question: https://paste.openstack.org/show/bedD9rBJV08zDsuyhZ4R/ (L24-25). This is trying to ensure the GET request for eTag is hitting the expected path (System or Settings)10:04
ajyajanders: I guess this adds some extra checks, but might not be so obvious what it's testing. I'm not pushing hard for this, wait for other reviewers input.10:25
jandersajya TY10:25
jandersdtantsur if you have time it would be great to hear your thoughts on this part - would be great to get this thing to merge :)10:26
jandersyeah I'm not super confident about using those GETs, if you folks have any pointers how to do this nicely with side_effects I can have a go tomorrow10:44
janders(I mean using asserts to look for GET calls)10:45
iurygregorymorning Ironic11:32
dtantsurjanders: absolutely no context, sorry12:40
dtantsurwhich patch are we talking about?12:40
TheJuliaGood morning!13:14
jrosseris there anywhere in the docs where the 'lifecycle' of whatever is providing the dhcp lease for an ipmi port is described?13:35
jrosserright now i have inspector offering dhcp leases to anything that wants one, but those seem ephemeral and i can't see how those become persistent once a node is created/inspected13:36
jrosser*dnsmasq adjacent to inspector....13:36
TheJuliaso you want to manage your BMCs using dhcp?13:40
jrosserwell, pretty much by accident that it what im doing now :)13:41
TheJuliaokay!13:41
TheJuliaGenerally we recommend static addressing of BMCs since if you loose your dhcp server, you'll be in a huge world of hurt13:41
jrosserand what happened was we did some firmware updates that involved taking the network ports down, and the leases expired13:41
jrosserand now everything is $randomised!13:41
TheJuliauhh13:41
TheJuliathat is exactly why we generally recommend that13:42
jrosserso i was wondering if i just misunderstood and there was a step needed to convert the inspector/dnsmasq leases into neutron ports13:42
jandersdtantsur apologies. We were talking about this comment in the Lenovo - fixing SettingsURI patch : https://review.opendev.org/c/openstack/sushy/+/856597/comments/e04fa10b_2b4ac1e913:42
TheJuliathe inspector dnsmasq has always been for the connected interfaces for netbooting/hardware discovery, not the BMCs itself13:43
jandersdtantsur if you have any inputs on the need to improve these tests (and even better hints how to go about it) that would be very much appreciated13:43
jrosserah ok i think we have struggled to find a reference architecture for this13:43
jrosserand getting the bmc leases from inspector dnsmasq has indeed made it work pretty much by accident13:44
TheJuliajrosser: not a step per-say, more disjointed usage since most operators when using something like auto-discovery just discover the machine on a base vlan, and then use network switch managemnet via an ml2 driver, or human port management to move the base network interfaces to an more appropriate network for the long term operation of the physical node13:44
TheJuliayeah, we deadlocked on reference architectures because there are many ways this can all be plugged together, and that doesn't translate to workflows easily either :(13:45
TheJuliamost operators I know run static for the bmcs on an entirely separate isolated management network13:45
TheJuliabut yeah, it is easy to stumble into such a config :\13:45
TheJuliaAlso, we can't really just know what the bmc's IP is all the time. If ipmi, the bmc if memory serves says it is 0.0.0.013:46
TheJuliaDo you have a physical asset inventory?13:46
jrosserat the moment we have some ansible that creates the baremetal nodes13:47
jrosserso "yes" in as much as the nodes are described in ansible group vars13:47
TheJuliaokay.. I think your going to need to get into the bmc of each node, and reconcile the addressing back out by hand and/or set it up either with a static lease provided by some dhcp service or just a static address, and then ensure that matches what ironic is configured for to perform power/boot management13:48
TheJuliaWhich is kind of why i was thinking... do you have an asset inventory because typically accountants need serial numbers and sometimes those lists have mac addresses13:49
TheJuliathis is one of those things physical hardware access makes a ton easier, fwiw since there since service tags are generally a thing13:49
jrosseri think that we were seeing something looking really close to auto discovery / auto inspect and thinking how neat that was13:50
jrosserlike zero touch nearly13:50
TheJuliamost zero-touch things I've seen completely ignore the long term persistent management since they are just the first go at the machine's deployment13:51
jrosserfwiw i am heavily revisiting the code and documentation in openstack-ansible as a side effect of deploying ironic for $job so i'd like to make sure what i do and document is reasonable13:52
TheJuliajrosser: awesome, if there is any thoughts/guidance/opinion/info we can help leverage please just let us know13:52
TheJuliafwiw, I know several operators who have pipelined ironic into zero touch of their stuff, but they have a hardware inventory to run with which helps advance setting of dhcp reservations for bmcs and supplies them passwords with just a little glue to fit their business process and vendor process on each side of getting the hardware in on the loading dock.13:59
TheJuliaor static addresses pre-programmed by the factory14:00
jrosserright - everything shipped with unique bmc passwords these days means some process is needed there14:00
TheJuliayeah, for individual off the shelf orders that is a pain14:00
jrosseralso completely different angle, what is the situiation with openbmc and ARM systems?14:00
jrosseri am about to get some Ampere stuff which is both of those14:01
TheJuliafor bulk like whole rack orders, most vendors will provide a list of the passwords upfront14:01
TheJuliaoh good, because Ampere uses ironic in their lab14:01
TheJuliaw/r/t openbmc, they impelemented enough ipmi that things kind of work as I understand it, and redfish apparently also just works, but I personally can't speak to it14:02
TheJuliaArm wise, you can set the bootloader you want, I think ampere went with ipxe and had to build their own ipxe binary because of a bug or something14:02
TheJulia(and distributors don't ship arm ipxe)14:03
jrosserit's great to hear they are using ironic internally - they've been very helpful so far so I can just ask our local FAE14:03
TheJuliayou could likely use pre-built grub if you really wanted to, it is just not as flexible and doesn't contain fallback/retry logic14:03
TheJuliaalso you can't just fall into introspection easily on grub14:04
TheJuliaanyone remember what irc handle morgan is using these days?14:07
johnthetubaguyJayF: TheJulia: as a heads up, I think I found a way to finally fix that placement race when nova instances are deleted, then the ironic node goes into automatic cleaning, but there is a window when we keep handing out that ironic node as a valid candidate: https://review.opendev.org/c/openstack/nova/+/864773 and a related fix on retrying when placement gets out of sync has also just got approved:14:16
johnthetubaguyhttps://review.opendev.org/c/openstack/nova/+/842478 Hopefully a bunch of failed builds that can be avoided, particuarly when you are trying to patch supercomputers power by OpenStack ;)14:16
TheJuliajohnthetubaguy: that is *awesome* news14:17
TheJuliajohnthetubaguy: Is there a path to backport it?14:19
johnthetubaguythe general "feeling" was yes for both patches, I think14:21
TheJulianice14:23
TheJuliahjensas: sorry, I looke dat the wrong field, it was like 2500%, not 3500%14:23
TheJuliaimprovement that is14:23
hjensasTheJulia: no problem, still a lot. (I should've calculated myself before posting on commit msg.)14:56
opendevreviewkamlesh chauvhan proposed openstack/sushy master: Retry on iDRAC SYS518 errors for all requests  https://review.opendev.org/c/openstack/sushy/+/86491115:10
dtantsurjanders: I fell asleep, apologies.. will take a look16:05
* dtantsur suspects he needs a large break pretty soon16:05
JayFI feel that :) I'm off W/Th/Fri next week to visit family for American Thanksgiving and it's much needed16:46
TheJuliaJayF: fwiw, nobodycam mentioned to me that we shoudl expect traction on https://review.opendev.org/c/openstack/ironic-python-agent/+/566544 after the new year17:12
JayFI mean, it's not something I'm tracking or worried about :D 17:12
JayFI'm sure they are the most motivated to support their own hardware :D 17:13
rpittaugood night! o/17:17
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for ports node_uuid  https://review.opendev.org/c/openstack/ironic/+/86293318:00
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for port groups node_uuid  https://review.opendev.org/c/openstack/ironic/+/86478118:00
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for node chassis_uuid  https://review.opendev.org/c/openstack/ironic/+/86480218:00
hjensasTheJulia: I changed lazy to "selectin" - with "select" I end up with error related to no active session.18:02
hjensasTheJulia: selecting seems to be marginally better than "joined" and "joined" + "innerjoin=True" https://paste.opendev.org/show/bVCbtyXmPDd44OVskrea/18:02
hjensass/selecting/selectin/18:02
TheJuliaoh, right... yeah you can't chain anymore18:06
TheJuliasorry, nest18:06
TheJuliayeah maybe 200ms which is still surprising, I guess that means there was some result set deduplication occuring18:07
TheJuliaso so selectin it is I guess18:08
hjensasadd_port_filter_node_uuid_is_not_none <- removing that shaves ~500ms18:13
TheJuliagot a link to that method?18:16
hjensashttps://review.opendev.org/c/openstack/ironic/+/862933/6/ironic/db/sqlalchemy/api.py#21418:17
hjensasThat check for not none is to "compensate" for removing what was added for https://launchpad.net/bugs/1748893 - i.e https://review.opendev.org/c/openstack/ironic/+/862933/6/ironic/api/controllers/v1/port.py#b16918:19
TheJuliaoh! Interesting18:20
hjensasI kind of feel that should not be an issue anymore.18:20
TheJuliaOkay, yeah18:20
TheJuliaI was thinking it shoudln't be either18:20
TheJuliasince there is no background query occuring anymore, the join result failure should just not have a value.... or might not show in the list18:21
TheJuliaoh18:21
TheJuliait is selectin, so I think it would just be None18:21
TheJuliaso yeah, that really shouldn't be an issue18:21
JayFyou could possibly write a test to validate that as well, if it's a concern18:21
hjensasso, ports should always be gone before the node is gone - https://opendev.org/openstack/ironic/src/branch/master/ironic/db/sqlalchemy/api.py#L801-L86718:31
* JayF wonders if replicated read slaves are guaranteed to update in order18:31
JayFyes, at least for mysql18:31
hjensasunless there is some replica that did not sync yet.18:31
TheJuliayeah, its a weird edge case18:34
TheJuliaand largely resolves around the results being generated and being iterated through18:34
TheJuliaclosing the time span gap also fixes it18:34
TheJuliabecause your not going back and re-consulting the db after the initial list was created 3-4 moments back18:35
hjensaswonder if it would actually be faster to just continue if port.node_uuid is None in the api controller?18:35
TheJuliaI seem to remember ring repl could get out of order.... but I don't think you can do that anymore18:35
TheJuliamaybe18:36
TheJuliaalthough generally closing the window greatly reduces the possibility18:36
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for ports node_uuid  https://review.opendev.org/c/openstack/ironic/+/86293319:13
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for port groups node_uuid  https://review.opendev.org/c/openstack/ironic/+/86478119:13
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for node chassis_uuid  https://review.opendev.org/c/openstack/ironic/+/86480219:13
hjensasI moved the node_uuid is None check to the API controller, afict it's overhead is far less invasive when done there.19:15
stevebaker[m]good morning19:33
TheJuliagood morning stevebaker[m] 20:17
arne_wiebalckTheJulia: I left some thoughts on https://review.opendev.org/c/openstack/ironic-specs/+/86180320:39
arne_wiebalckGood morning stevebaker[m] o/20:39
stevebaker[m]\o20:40
arne_wiebalck.... aaaaand goodbye everyone, see you tmrw o/ :)20:40
stevebaker[m]good night arne_wiebalck20:40
JayFarne_wiebalck: I wrote you a novel in there for when you're back tomorrow :D 21:39
JayFtl;dr: conductors and computes have different scaling semantics and tying them together is a bad idea as a result21:40
TheJuliabrraaaaaains22:35
TheJuliaJayF: I think you kind of explained it nicely, fwiw22:47
TheJuliait is so drastically different, and given it works, just not quickly... kind of becomes decieving22:47
JayFit's not even that like, in some common deployment types they won't, as a coincidence, have similar scaling sizes22:48
JayFbut if you have someone who, e.g. is managing large numbers of nodes but mainly only doing 1-5 provision/clean cycles during a multi-year node lifetime22:48
JayFyou don't need quite so many conductors :D 22:48
JayFand I'm sure there are examples which go the other direction; but with the iscsi driver dead I don't have an easy one to pick on lol22:48
TheJuliaTrue22:51
TheJuliaheh22:51
TheJuliawe also changed some of the logic to be more preferential/smarter22:51
TheJulialike... always starting power sync with the oldest22:51
TheJuliaAnd everything is tunable22:52
TheJuliaAnd... we're fairly laid about power sync just being a long running "thing"22:52
JayFwe have config to disable it, yeah?22:52
TheJuliaof course22:52
JayFI can't remember if that was a downstream patch or config22:52
JayFok good22:53
TheJuliayou set the interval to 022:53
TheJuliaand it stops22:53
JayFfyi my office hours today are starting in ~7 minutes22:53
JayFgoing to be working on periodic task for the shards stuff22:53
JayFof course folks are welcome to come stop by and request reviews or similar, too22:54
TheJulialink?22:55
JayFyoutube.com/jayofdoom22:56
JayFgoing live at the top of the hourish22:56
TheJuliacool cool22:56
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for port groups node_uuid  https://review.opendev.org/c/openstack/ironic/+/86478123:57
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for node chassis_uuid  https://review.opendev.org/c/openstack/ironic/+/86480223:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!