Tuesday, 2025-07-08

opendevreviewJulia Kreger proposed openstack/ironic master: ci: stabilize ironic-standalone-redfish  https://review.opendev.org/c/openstack/ironic/+/95430301:20
rpittaugood morning ironic! o/06:32
opendevreviewJacob Anders proposed openstack/ironic master: [WIP] Skip initial reboot to IPA when updating firmware out-of-band  https://review.opendev.org/c/openstack/ironic/+/95431106:32
queensly[m]Good morning07:07
dtantsurTheJulia: good question. Initial boot logging can be pretty useful, e.g. when debugging something around NetworkManager07:18
ContinuityMorning all. Has anyone had any experience with fast-track-deploy09:27
dtantsurContinuity: we use it by default in metal3, it's also used in bifrost. If your question is in the context of Nova though, there are way fewer people who tried it.11:26
Continuitydtantsur: thanks, I guess ill be off to do some work on the bleeding edge... and you know why they call it the bleeding edge... 11:55
ContinuityFrom all the cuts you get :D11:55
opendevreviewTakashi Kajinami proposed openstack/networking-generic-switch master: etcd: Discover api version automatically  https://review.opendev.org/c/openstack/networking-generic-switch/+/95433112:01
iurygregorygood morning ironic12:10
iurygregoryCI is sad today? <eyes>12:16
iurygregorymetal3-integration seems unhappy, but seems like a centos issue because nothing provides the versions we want12:19
rpittauthat's unfortunately still centos repos shenanigans impacting us12:19
iurygregory=(12:21
rpittauand well others too :/12:21
iurygregorypep8 is going crazy also lol12:21
iurygregoryERROR: No matching distribution found for pbr>=2.0.0 12:22
rpittaummm that's weird12:22
iurygregoryhttps://zuul.opendev.org/t/openstack/build/e388682264654e3992b6b3aeb5f7cd70 a lot12:22
iurygregorybefore the recheck the job was green XD12:23
dtantsurDoes anyone understand the logic that informed this ordering? https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/steps.py#L34-L4112:24
dtantsurIt means, among other things, that in-band steps from IPA override BIOS interface steps but do not override firmware interface ones.12:24
rpittauiurygregory: that's the error, but the root cause looks like an issue contacting pypi, or better the pypi mirror 12:25
iurygregoryrpittau, yeah12:25
dtantsur(in case these ever clash, which is a bad idea on its own)12:25
iurygregorydtantsur, wait what? :eyes:12:25
dtantsuriurygregory: steps are collected in the order of these priorities12:25
iurygregoryyup12:25
iurygregorybut we don't have in-band steps for firmware interface in IPA, so yeah it wouldn't override 12:26
dtantsurThe more I think about it, the more I'm convinced that I lied to JayF back at the PTG, and overriding steps does not actually work reliably12:27
dtantsuriurygregory: I'm being hypothetical here. The context is the change that janders is working on.12:27
iurygregoryoh ok!12:27
dtantsurIf overriding steps does not actually work for servicing/cleaning, we're safe to make certain assumptions.12:27
iurygregoryI was missing some context <sorry>12:27
dtantsurNo worries, I'm struggling to put thoughts into words today12:28
iurygregorysame, I only had 5hrs of sleep, so I'm a bit slow12:28
* dtantsur is now struggling to understand how software RAID works at all12:30
iurygregorytldr; magic :D12:30
* iurygregory needs more coffee12:30
dtantsurWe might be relying on accident here..12:31
opendevreviewMerged openstack/bifrost master: bug: drop baremetal introspection mention  https://review.opendev.org/c/openstack/bifrost/+/94672612:48
TheJuliagood morning13:04
TheJuliadtantsur: magic in the agent :)13:04
guilhermesphello team! Question: is it possible to use ubuntu-signed DIB element for ipa so I can clean a system with secure boot enabled? 13:07
iurygregoryI think it should be possible, but I never used ubuntu-signed ..13:09
rpittauguilhermesp: you'll have to include the ubuntu-signed DIB element when using ipa-builder13:17
rpittauguilhermesp: there's asepcific option called "extra-elements"13:17
rpittauerrr "extra-element"13:18
guilhermesprpittau: thanks! trying to get hte idea here... becasue it seems ubuntu-signed is already cooked? https://docs.openstack.org/diskimage-builder/latest/elements/ubuntu-signed/README.html 13:23
guilhermesptried that: DIB_REPOLOCATION_ironic_python_agent=/root/dib/IPA-path/ironic-python-agent DIB_REPOREF_ironic_python_agent=master DIB_RELEASE=ubuntu ironic-python-agent-builder -o test-ipa-ubuntu-signed ubuntu-signed ( becasue i have to add a pr by dtantsur for software raid ) 13:23
rpittauguilhermesp: you need the -e option of ipa-builder13:25
rpittauhere -> https://docs.openstack.org/ironic-python-agent-builder/latest/admin/dib.html13:26
TheJuliaguilhermesp: are you using PXE or virtual media?13:51
guilhermesppxe TheJulia 13:52
guilhermespi tried injeecting those 13:53
guilhermesphttps://www.irccloud.com/pastebin/d8wgGZxl/13:53
guilhermespbut somehow the clean process failed and i could not see logs from it, then the node rebooted and could not pxe boot anymore since secureboot was enabled 13:54
dtantsurTheJulia: morning! Do we have a way for a user to provide agent token? And if no, is it something you'd fine acceptable?13:55
dtantsurContext: I have frequent cases of "something happened, and now the token is invalid", which are hard to debug. If we could embed the token into the ISO, it would hopefully be better.13:56
dtantsurAnd we cannot use Ironic's embedding process since it does not know how to work with CoreOS images13:57
TheJuliaguilhermesp: are you using a signed PXE loader, i.e. grub, or self signed ipxe binary?13:59
TheJuliawith your own key management13:59
TheJuliadtantsur: eek... uhhhhhh13:59
TheJuliadtantsur: uhhhhhhhh13:59
TheJuliadtantsur: give me a few to ponder this one13:59
TheJulia(make more coffee)14:00
dtantsursure14:00
dtantsurSeparately, I need to pick your brain on threads, periodics and long-running tasks in the post-eventlet world. Whenever you have time and spoons.14:01
TheJuliaI should have time after the board meeting, but that means in like 3 hours14:02
dtantsurMay be a bit late. I can drop you an email instead and copy cid (if I know your email) and whoever is interested in eventlet.14:16
TheJuliaI also have time tomorrow morning, fwiw14:16
dtantsurTheJulia: Wednesdays are usually pretty busy for me. I assume 1500 UTC is a bit early?14:17
dtantsurIn any case, an email should give you food for thoughts14:17
TheJuliacool cool14:18
TheJuliauhh14:18
TheJuliaI already have someone who has asked for 1500 UTC14:18
TheJuliaunfortuantely14:18
dtantsurNo worries, let's do it asynchronously14:19
dtantsurTheJulia: these are my expanded thoughts about the agent token: https://bugs.launchpad.net/ironic/+bug/211618814:26
TheJuliaOkay, I'll try to take a look later today14:27
opendevreviewTakashi Kajinami proposed openstack/networking-generic-switch master: etcd: Discover api version automatically  https://review.opendev.org/c/openstack/networking-generic-switch/+/95433114:33
dtantsurThought dump sent too (you'll probably be happy I was not doing the same on a call - it's messy)14:38
kubajjHello, we recently came across an issue with the efibootmgr (it took us a while to pinpoint it there). In order to shorten the time to solve the issue, we are planning to include a debug logging of the efibootmgr output before and after the efi_utils._run_efibootmgr runs. I have 2 questions: 1. should I commit this upstream? 2. Is there a convention on how to log blobs of text spanning multiple lines? (I could loop over the 14:44
kubajjlines, but am not sure)14:44
dtantsurkubajj: how long is it? If it's like 10 lines, our convention is to just log it.14:47
dtantsur(we actually log pretty long things in IPA - check any CI output)14:47
guilhermespTheJulia: its a ipxe selfsigned binary i believe. 14:49
kubajjdtantsur: I think on a reasonable deployment, it could fit into 10 lines (I tested it on a node we use for QA - i.e. deploy every hour, and it was ~50 lines :D )14:49
dtantsurkubajj: do you need it in runtime? Another option is to add it to the logs collection.14:49
kubajjdtantsur: nope, just for debugging after14:50
dtantsurkubajj: then add your command to https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L3447-L346114:50
dtantsurtry to enable the maximum verbosity: the output will be in a separate file14:51
dtantsurI'd consider it a very useful addition for upstream IPA btw14:51
kubajjdtantsur: ah, that's neat, where does it store these logs?14:52
dtantsurkubajj: IPA sends them to Ironic, which stores them according to https://opendev.org/openstack/ironic/src/branch/master/ironic/conf/agent.py#L63-L9914:53
dtantsurdocs: https://docs.openstack.org/ironic/latest/admin/troubleshooting.html#retrieving-logs-from-the-deploy-ramdisk14:54
kubajjdtantsur: ok, thanks, I'll test it out and then create a change14:54
dtantsurcool14:54
kubajjdtantsur: I understand now, TIL how the other files other than the journal get created14:57
opendevreviewTakashi Kajinami proposed openstack/networking-generic-switch master: Document implicit requirement of etcd3gw backend  https://review.opendev.org/c/openstack/networking-generic-switch/+/95435115:06
opendevreviewVerification of a change to openstack/ironic master failed: Handle unresponsive BMC during Firmware Updates  https://review.opendev.org/c/openstack/ironic/+/93810815:17
iurygregoryI'm wondering for how long the centos repo will be causing problems =(15:47
Sandzwerg[m]<TheJulia> "sandzwerg: based on what you'..." <- I made some progress. It seems the ubuntu would boot after all it took just like reaaaaally long (in the order of multiple minutes). I can provide a complete log from the sol with grub debug enabled but I saw a ubuntu login screen at the end. That would be really cool, just trying to recreate it. The deployment itself failed still because of a timeout but I didn't watch and have16:13
Sandzwerg[m]currently no way to access the IPA to see what's going on so if secure boot works that would be next. If booting the kernel takes so long that would be odd and something I'd need to get fixed before I can roll it out for productive use cases. What are other people using who are working with secure boot? CentOS? Something completely different?16:13
dtantsurRHEL/CoreOS for us16:14
Sandzwerg[m]Which version? And do you build that with DIB or something else?16:15
dtantsurSandzwerg[m]: we use official (downstream) CoreOS images for both IPA and final instances. Versions tend to be quite recent and match OpenShift versions we're working with. We don't choose versions ourselves.16:19
Sandzwerg[m]dtantsur are these public available?16:26
dtantsurSandzwerg[m]: I don't think so :(16:26
Sandzwerg[m]<Sandzwerg[m]> "I made some progress. It seems..." <- boot time over all from "server create" to "I see ubuntu kernel messeages is about 17mins and for the "node powers on and dos virtualmedia" it's probably around 2-3mins or so?16:30
stephenfincid: The ironic gating job in SDK is permafailing due to a failing inspection rule test. I assume something changed here recently?16:30
stephenfinhttps://zuul.opendev.org/t/openstack/builds?job_name=openstacksdk-functional-devstack-ironic&project=openstack/openstacksdk16:30
Sandzwerg[m]dtantsur: to bad 16:31
stephenfinAny chance you'd be able to put together a fix if so?16:31
* TheJulia exits board meeting16:36
opendevreviewIury Gregory Melo Ferreira proposed openstack/ironic master: Redfish Firmware Interface - NIC Support  https://review.opendev.org/c/openstack/ironic/+/95339416:36
TheJuliadtantsur: yeah, I only scratched the surface of your email and I'll try to reply today.16:37
dtantsurthx!16:37
TheJuliakubajj: I *thought* we already log that, but logging around that is worthwhile if there is an issue. Often UEFI troubleshooting is a pain so all for more logging16:38
cidstephenfin, looking...16:40
TheJuliaguilhermesp: okay, so I don't know about ipxe in signed mode, I also don't know if it does next step enforcement or not. I'm not sure it does, actually. One thing which is super interesting is some vendors have explicitly given up on Secure Booting PXE operations because PXE is super insecure anyhow, so internally you may apply settings, then when PXE is requested, the vendor might temporarily disable and then re-enable. 16:40
TheJuliaWe don't have good insight into that, unfortunately.16:40
stephenfincid: tl;dr: the API was returning 'loop' before but no longer appears to be doing so16:40
cidAhhhh, yep.16:41
TheJuliaguilhermesp: Sounds like if it is still in PXE boot post reboot but then failing, that your vendor is doing something along those lines.16:41
TheJuliaiurygregory: speaking of vendors and run, whats the latest on idrac10, trying to figure out what (if anything) I should plan on us doing.16:41
iurygregoryTheJulia, let me grab the latest info I have 1min16:44
TheJuliaSandzwerg[m]: WOW, that is crazy. Turning off "nofb" might help. It sounds like it might be some sort of network bandwidth issue in the load process. You said vmedia, so I'd actually try to extend the timeout. A thing to double check is the BMC link rate/speed, if its set to 100M or 1G, is a huge variable.16:45
iurygregoryTheJulia, https://paste.opendev.org/show/bGtLlkKmFPqucHacchLV/ this was the latest error we saw when QE was testing, dtantsur gave the idea to include 404 as a retry code in https://opendev.org/openstack/sushy/src/branch/master/sushy/oem/dell/asynchronous.py#L46  , I'm getting back to this today to see if I can have a downstream build for QE with this change, to see if helps16:49
TheJuliaSandzwerg[m]: What your describing sounds very much like the link rate is constrained and there may also be additional latency. Might be worth investigating that path for any issues/constraints. When I was working with etingof early on, we did some early tests with the image backhauled across vpn connections and ended up getting to watch the boot process crawl along for like 45 minutes. A larger kernel/ramdisk that you'd 16:54
TheJuliaend up with will take ~1 minute in perfect conditions to transfer over the wire to even be in ram for it to boot.16:54
TheJuliaiurygregory: I'll want to see what I can get backported, I just don't entirely understand what is broken where at this point and what we need to to do project wise at the moment.16:55
iurygregoryTheJulia, so far the fixes are all in sushy16:55
iurygregorywe will need them in sushy-oem-idrac for old versions of ironic16:56
iurygregoryhttps://review.opendev.org/c/openstack/sushy/+/950694 16:56
iurygregoryafter this fix we are hitting the problem I've shared above that can be related to the asynchronous.py16:57
TheJuliaiurygregory: okay, if we can detail it out, or at least have it detailed out, it would be super helpful (also because I'm context switching a lot righ tnow)16:57
iurygregoryI will add to the etherpad about idrac10 issues16:58
TheJuliacid: I guess stephenfin's item is related to the fix which got merged 1-2 weeks ago?!17:04
dtantsurTheJulia: re replacing local calls for rpc_transport==None: how bad will it be to use JSON RPC on localhost (with a temporary password)?17:05
dtantsurI've considered unix sockets, but the client side of them is quite sad in both python-requests and keystoneauth17:05
TheJuliadtantsur: I'd be cool with that, I was sort of suspecting that might be the path anyhow17:06
dtantsurright17:06
TheJulia.... I'd likely consider a random... on startup password if configured17:06
cidTheJulia, stephenfin. Yeah. The fix should be up in a jiffy :D17:06
TheJuliawhich could then be shared/established early on and run with17:06
dtantsurTheJulia: yeah, I want to generate a long random password on start-up.17:06
TheJuliadtantsur: sgtm17:07
dtantsurTheJulia: is it safe to assume I don't need TLS in this case or am I being naive? :)17:07
TheJuliadtantsur: if your binding to localhost, it stays inside the host, so... Its likely fine17:07
TheJuliaanyone who does anything like "iptables -t nat -I PREROUTING -j DNAT --to 127.0.0.1:8080" deserves what they get...17:08
dtantsur:D17:08
TheJuliaThat sounds worse than I was thinking, but it gets the point across17:08
TheJuliawe can't be expected to guard rail that17:08
TheJuliasince it is such an unexpected/unreasonable thing to really do17:09
TheJuliaso as long as the config loading locks it in, it should be okay17:09
TheJulia*one* thought I had was order of operations wise, I expect we *may* need to break the metal3-integration and bifrost jobs for a relatively short period of time, so I suspect whatever order *works* for us, we could just stack everything, get to the confidence level we want and potentially toggle jobs (with upfront communication) and all17:10
dtantsurTheJulia: let's see when we get there. I hope we can have some sort of a smooth upgrade, but it may be wishful thinking17:11
TheJuliaI *fully* expect once you do this, we might see some fun with locking and sqlite3 (as we've discussed) as well, so we might want to just get everything stacked up and just figure out the path to merge the things in whatever path makes sense17:11
dtantsurright17:11
TheJuliayeah, I think we can get to smooth, but at a minimum we may have a day or two where we know a job is going to be very broken17:11
dtantsurMmmm, googing confirms my suspicion that tcpdump will work on loopback17:12
TheJuliaone step at a time17:12
TheJuliaoh, yes, it very much does :)17:12
TheJulia*and* the MTU is 64k17:12
dtantsurMmm, okay, I think I'll generate a temporary TLS certificate as well. You're never too safe, right?17:13
TheJuliaWell... hmm... errr....17:13
TheJuliahttps://tenor.com/view/castle-nathan-fillion-richard-castle-talk-talking-gif-5154792 <-- exactly what I'm doing17:13
dtantsurI was imagining a cat (speaking tongues as they do), but this works as well17:14
Sandzwerg[m]<TheJulia> "Sandzwerg: What your describing..." <- That's a good idea, I'll investigate that tomorrow. I think it's 1G but I'm not sure. And apparently swift is quiet slow for this and the ubuntu image is ~700mb so much much bigger than our older images ( we used a second stage there to circumvent some of the issues). maybe also rebuilding it as ubuntu-minimal might help. Thanks for the pointers17:17
TheJuliaSandzwerg[m]: the other thing is firmware blobs which get included. Even upstream right now with centos, our compressed image size is like 506MB for just the ramdisk, which uncompresses to like 833MB, and of that ~250MB is just firmware blobs for variou shardware. Like... in centos, the Mellanox Spectrum series of images are present in the ramdisk and that alone is 90MB and if anyone is installing centos on switches, I 17:19
TheJuliareally want to meet them.17:19
Sandzwerg[m]Ha, centos on switches would be interesting indeed. On the other hand I was so surprised when I noticed that arista switches are based on an ancient fedora version (I think it was 18 or something?)17:20
opendevreviewDmitry Tantsur proposed openstack/ironic master: Stop short-cutting to local manager with all-in-one processes  https://review.opendev.org/c/openstack/ironic/+/95436317:21
* TheJulia twitches17:21
TheJulia(that was w/r/t arista)17:21
Sandzwerg[m]Apparently by now it's only veeeery losley based on that, but they apparently started by forking fedora. it was quiet a WTF moment when I found the version file when poking on a switch17:22
Sandzwerg[m]*they17:22
TheJuliawow17:23
TheJuliaI'm not really that surprised actually17:23
TheJuliaA good example is SONiC is based on a fork of debian17:23
Sandzwerg[m]TIL17:27
Sandzwerg[m]https://en.wikipedia.org/wiki/Arista_Networks#Extensible_Operating_System it's even in the wikipedia17:28
TheJuliaTIL!17:29
opendevreviewChris Krelle proposed openstack/ironic master: update Jinja2 to address CVE-2024-2383  https://review.opendev.org/c/openstack/ironic/+/95390218:36
NobodyCamsorry that took so long to update.18:38
guilhermespthanks TheJulia yeah... in this case im dealing with some lenovo gear here and ended up getting a https://pubs.lenovo.com/sr650/FQXSFPU0023G after enabling secure boot which is not so flexible to automatically switch things... i got this warning there even disabling secure boot and apparently lenovo needs to reset secureboot keys, so yeah kinda of cumbersome 18:58
TheJuliaNobodyCam: no worries18:58
TheJuliaguilhermesp: Sigh, yeah, lenovo.18:59
TheJuliaguilhermesp: I even forgot they will do that. SR650s in particular are... weird.18:59
TheJuliaA bunch of our EFI record management logic is due to the SR650s because you get only *one* write per boot, otherwise it reverts UEFI records/settings to last known good.18:59
cidTalking about taking long to update. TheJulia, stephenfin, I bet the fix to https://zuul.opendev.org/t/openstack/builds?job_name=openstacksdk-functional-devstack-ironic&project=openstack/openstacksdk is likely very tiny. But I have been starring at it for a while now and I don't see how the last patch failing that test.19:00
TheJuliaI, too, was looking at that error oddly. I was wondering if the error message was odd, and if the logged text was swapped in terms o fwhat was receieved and expected19:02
cidIt's like it's working in reverse. The field is now optional and tests, you want to fail now?19:03
cidThere's no where the loop field is added in the functional test, and it seems the API is returning correctly too.19:06
cidI will just look at it tomorrow.19:07
TheJuliaI think the test is codified in the sdk19:07
TheJuliaand that is part of it19:07
TheJuliaI too, looked at it and was like "what?!" and then realized it was something in the sdk19:08
guilhermespTheJulia: er i figure that would be complicated. All fleet will be composed by those sr650s... but there is just one specific use case which users will need secure boot, so im just thinking of, besides getting a signed OS to deploy, just ask ppl to run redfish commands to enable secure boot one node is ACTIVE and before releasing it, disable secure boot lol not very practical but i guess  it is what it is 19:23
guilhermesps/one node/when a node/ 19:24
TheJuliaiurygregory: much appreciated20:32
* TheJulia blinks20:32
TheJuliawow the log jumped20:32
TheJuliaguilhermesp: well, does the existing knobs not work for setting it by default? Because it sounds like they disable it for PXE, but maybe they don't re-enable it?20:33
TheJuliaDepending on your version, maybe a service step could be whipped up (like whipped cream) to apply it after the fact.20:34
opendevreviewVerification of a change to openstack/ironic master failed: Handle unresponsive BMC during Firmware Updates  https://review.opendev.org/c/openstack/ironic/+/93810820:45
TheJuliacid: so I sort of got on this subject (and this doesn't need an immediate reply, I'm only mentioning it here because I mentioned it in the email back to yourself and dtantsur). I'm wondering how we might want to load test our concurrency. I'm thinking the easy path may be to spin up a bunch of fake sushy services I'm thinking something locally so we can explicitly stress a few aspects under "normal" circumstances, I'm 20:58
TheJuliajust not sure of any other good ways. Have you managed to put any time into that subject yet? also, if anyone else has any thoughts in this area, it would be good to get them out into discussion20:58
opendevreviewVerification of a change to openstack/ironic master failed: Handle unresponsive BMC during Firmware Updates  https://review.opendev.org/c/openstack/ironic/+/93810823:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!