opendevreview | Merged openstack/ironic master: Fix auth_protocol and priv_protocol for SNMP v3 https://review.opendev.org/c/openstack/ironic/+/875323 | 00:00 |
---|---|---|
opendevreview | Merged openstack/ironic bugfix/21.3: Wipe Agent Token when cleaning timeout occcurs https://review.opendev.org/c/openstack/ironic/+/877402 | 00:00 |
opendevreview | Merged openstack/networking-baremetal master: [CI] Explicitly disable port security https://review.opendev.org/c/openstack/networking-baremetal/+/874939 | 00:22 |
opendevreview | Merged openstack/ironic bugfix/21.3: Clean out agent token even if power is already off https://review.opendev.org/c/openstack/ironic/+/877406 | 00:58 |
opendevreview | Nisha Agarwal proposed openstack/ironic master: Enables boot modes switching with Anaconda deploy for ilo driver https://review.opendev.org/c/openstack/ironic/+/860821 | 04:30 |
opendevreview | Nisha Agarwal proposed openstack/ironic master: Enables boot modes switching with Anaconda deploy for ilo driver https://review.opendev.org/c/openstack/ironic/+/860821 | 06:17 |
rpittau | good morning ironic! happy friday! o/ | 08:09 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Use main branch of metal3-dev-env to run metal3 integration job https://review.opendev.org/c/openstack/ironic/+/877600 | 08:48 |
vanou | JayF: About vul doc. Just put 2 things in doc is enough I think: If Ironic community is asked by owner of unofficial library, 1)Ironic community is open and willing to collaborate to solve such rare vul 2)Ironic community is willing to collaborate in resonable manner, which means follwing good manner to handle vul (e.g. craft vul patch in private till fix is published), to | 09:03 |
vanou | handle vul. | 09:03 |
vanou | How do you think about this? | 09:03 |
vanou | TheJulia: If possible please put comment on https://review.opendev.org/c/openstack/ironic/+/870881 | 09:03 |
kubajj | Good morning Ironic o/ | 09:51 |
iurygregory | morning Ironic, happy friday | 11:33 |
kaloyank | hello ironic o/ | 12:00 |
kaloyank | I'm struggling with rescuing a bare-metal instance. I boot the IPA via a rescue command but it cannod find the node via the /v1/lookup endpoint using mac addresses. | 12:02 |
kaloyank | The particular machine had a NIC inserted after it was provisioned. I'm trying to salvage the contents of the local disk via the rescue image but the creation of the rescue user and setting a password fails | 12:03 |
kaloyank | I added manually the ports and their respective MACs to the node but the IPA still gets a 404 from /v1/lookup | 12:03 |
kaloyank | do I need to run an inspection? | 12:04 |
* dtantsur is trying to remember how rescue works | 12:08 | |
dtantsur | kaloyank: checked for any logs in ironic API and conductor? | 12:09 |
dtantsur | running inspection should be no different from creating ports manually | 12:10 |
kaloyank | dtantsur: The Ironic API just returns 404 | 12:10 |
kaloyank | so the IPA is stuck in a loop retrying to get which node it was booted on | 12:10 |
kaloyank | I tracked it down to this method: https://github.com/openstack/ironic/blob/stable/yoga/ironic/api/controllers/v1/ramdisk.py#L132 | 12:12 |
kaloyank | which calls this one: https://github.com/openstack/ironic/blob/stable/yoga/ironic/db/sqlalchemy/api.py#L1384 | 12:13 |
kaloyank | however, I don't know how I can reassemle the SQL to execute it against the DB and see the result :/ | 12:14 |
dtantsur | https://opendev.org/openstack/ironic/src/branch/master/ironic/db/sqlalchemy/api.py#L1487-L1489 | 12:16 |
kaloyank | dtantsur: I'm running yoga | 12:17 |
dtantsur | kaloyank: is it possible that you have MAC addresses overlapping? | 12:17 |
dtantsur | i.e. some node has a part with a MAC address from the list? | 12:17 |
kaloyank | hmm, I haven't thought of that, checking | 12:17 |
kaloyank | I don't think there's overlapping | 12:18 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Add error logging on lookup failures in the API https://review.opendev.org/c/openstack/ironic/+/877786 | 12:20 |
dtantsur | kaloyank: may help ^^ | 12:20 |
kaloyank | dtantsur: will apply it and see how it goes | 12:22 |
kaloyank | Mar 17 12:38:01 os-controller.lab.storpool.local ironic-api[652843]: 2023-03-17 12:38:01.225 652843 ERROR ironic.api.controllers.v1.ramdisk [req-0d942d09-54dc-4493-b601-cbd5e6be5ece - - - - -] Lookup is not allowed for node 0e227233-b285-40e5-b19a-fa920a06f122 in the provision state rescue | 12:38 |
kaloyank | why so :( | 12:38 |
dtantsur | kaloyank: hmmm, I'd expect the state to be 'rescue wait' | 12:39 |
dtantsur | 'rescue' is the successful stable state | 12:39 |
dtantsur | unless our state machine is doing something crazy.. | 12:39 |
kaloyank | dtantsur: logs -> https://pastebin.com/vDBRNscE | 12:52 |
opendevreview | Verification of a change to openstack/bifrost master failed: Upgrade from 2023.1 and use Jammy for the upgrade job https://review.opendev.org/c/openstack/bifrost/+/877515 | 12:59 |
dtantsur | kaloyank: oh, okay, so the rescue preparation is actually finished | 13:07 |
dtantsur | then you need to check why the password was not applied | 13:08 |
TheJulia | yeah, the only way to really figure that out is to burn credentials into the ramdisk image and access the running ramdisk directly to poke around | 13:09 |
TheJulia | since we shut the agent down in rescue cases | 13:09 |
TheJulia | and further logs don't get shipped to the conductor | 13:09 |
dtantsur | good morning TheJulia | 13:10 |
TheJulia | good morning! | 13:10 |
dtantsur | I left some comments on the service steps spec, you may not like them :) | 13:10 |
dtantsur | it's possible that I'm missing a lot of context on rescue-vs-service | 13:11 |
TheJulia | does it boil down to a "never going to happen" ? | 13:11 |
TheJulia | oh, yeah | 13:11 |
dtantsur | not at all | 13:11 |
TheJulia | I suspected that might be the case at some point | 13:11 |
dtantsur | I'm all for the part that involves service_steps | 13:11 |
dtantsur | I'm very confused by the part that is essentially rescue in a trenchcoat | 13:11 |
TheJulia | heh | 13:11 |
TheJulia | I like how you describe that | 13:11 |
dtantsur | at your service :D | 13:12 |
TheJulia | the tl;dr is today rescue is an end user oriented feature which gets... miss-used by operators to do $other_nefarious_things which in part overlap with this functionality. The underlying issue is we actually, in relation to kaloyank's issue, we stop the agent in rescue and we can't do additional actions because we cannot trust the ramdisk interactions when it is handed over to an end user any longer. Things can get | 13:13 |
TheJulia | extracted from memory, URLs discovered, etc. | 13:13 |
dtantsur | To be fair, anyone with root in the ramdisk can just restart the agent... | 13:14 |
TheJulia | In a sense, the idea is much more a "administrative side" of things sort of function | 13:14 |
TheJulia | but we won't talk to it anymore | 13:14 |
dtantsur | Will it help to update the rescue API with a new flag keep_agent that is not exposed to Nova? | 13:14 |
dtantsur | and possible covered by a different policy? | 13:15 |
TheJulia | but does rescue appropriately cover "someone is doing things" ? | 13:15 |
dtantsur | you mean, the name or the state itself? | 13:15 |
TheJulia | I'm not really sure just adding a new policy covered flag would do it for rescue | 13:15 |
TheJulia | the name of the state itself being an end user oriented state | 13:16 |
dtantsur | The difference between user and admin in case of in-band actions is quite subtle | 13:17 |
TheJulia | true, I was not thinking of granting ssh access in this new state flow | 13:17 |
TheJulia | I was purely thinking "if someone wants to do something with this different ramdisk, than so be it | 13:18 |
TheJulia | they can sort that part out without an ssh password and flow | 13:18 |
* TheJulia remembers the pain of improving the auth flow for that | 13:18 | |
TheJulia | For some reason that auth flow fix and agent token were the two most painful breaks we've ever done in IPA | 13:19 |
dtantsur | I'm trying to avoid 2 things: 1) copy-paste of rescue code in the style of iscsi-vs-direct deploy, 2) ops confusion around "which one do I use and why" | 13:19 |
TheJulia | ... both for good reason | 13:19 |
dtantsur | I also don't quite like that the "service" verb will be used for two very different things (in your spec) | 13:19 |
TheJulia | I think the 2nd is perception and docs related, the first I've not prototyped it yet, but I don't think it would be specific rescue code, I only framed it as similar and maybe we might end up with some common helpers around networking | 13:20 |
dtantsur | To me, what you describe is closer to rescue still than to the other scenario your propose (service_steps) | 13:20 |
TheJulia | well, to be fair, it started as modify/modify_steps | 13:20 |
TheJulia | and then the battle over names began :) | 13:20 |
dtantsur | edge case: a bug in automation results in empty service_steps, and the operator is wondering why it never finishes | 13:21 |
dtantsur | (I'm not talking about names here, more about actual semantics) | 13:21 |
TheJulia | (well, to be fair, they they are tied together and service was much more a compromise from modify) | 13:22 |
TheJulia | without a list, as I believe I wrote the text today, it would land in the stable state | 13:22 |
dtantsur | Right. Which is going to be confusing. | 13:22 |
TheJulia | and then an operator could just call the verb again with their corrected list | 13:22 |
dtantsur | Unless you apply some GPT4, the automation will not understand that it made a mistake :) | 13:23 |
TheJulia | no ai please. We don't need the conductors deciding to launch drones to remind us to do code review | 13:23 |
TheJulia | :) | 13:23 |
dtantsur | So in my head: "rescue" = hold this machine and give me SSH access, "service state" = hold this machine, "service_steps" = execute these actions, within a ramdisk or out-of-band. | 13:24 |
dtantsur | So for me, the 1st two have more logical overlap | 13:24 |
dtantsur | (do drones speak Redfish?) | 13:24 |
TheJulia | I guess the conundrum is what if you *need* to do both sort of things, apply steps and then hold | 13:25 |
TheJulia | and then apply more steps, and possibly hold again | 13:25 |
* dtantsur reads the proposed state machine again | 13:25 | |
dtantsur | mmm, right, I missed that transition | 13:26 |
TheJulia | sorry, I'll try to make that a bit more clear | 13:26 |
dtantsur | TheJulia: do you plan to shut down the agent in-between? i.e. is IPA running in "service"? | 13:26 |
TheJulia | I guess I should also better spell out some of the use cases a bit more clearly too | 13:26 |
TheJulia | no plan to shut the agent down | 13:26 |
dtantsur | TheJulia: nah, I just had a pretty shallow read this morning. Just to make sure I understand it. | 13:26 |
TheJulia | ipa would continue to run awaiting its commands | 13:27 |
dtantsur | A provocative question then: can we make rescue just a service step? | 13:27 |
dtantsur | (I even suggested making rebuild a service step in my comments!) | 13:27 |
TheJulia | I think the idea might end up being "I want to clone /dev/sda" "oh, I need to clone /dev/sdb now" "wait, what, there is a /dev/cciss/c0p0, gah clone it!" | 13:28 |
TheJulia | (rescue a service step... very likely, but not out of the gate please ;) | 13:28 |
TheJulia | ) | 13:28 |
dtantsur | fair | 13:28 |
* TheJulia has a few big picture ideas floating around that intertwine | 13:28 | |
TheJulia | mix service with "oh, I need to update the DPU firmware and then flash a new DPU OS on the flash as well, and and then tweak some bios settings before I go back to my previously deployed workload" as a big picture hope | 13:29 |
* jssfr eyes the discussion | 13:30 | |
* TheJulia suspects magic words cause eyebrows to raise | 13:31 | |
TheJulia | ... which is why I've been trying to spend time on the specs before the PTG | 13:31 |
jssfr | I was only idly following along until you spelled out the use cases explicitly and that makes it interesting | 13:31 |
dtantsur | right | 13:31 |
dtantsur | For us in Metal3, upgrading firmware and BIOS settings is of interest. So a simpler case. | 13:32 |
TheJulia | I've had people look at service steps and also go "oh, I could make a snapshot hardware manager and snap my machines!" | 13:32 |
dtantsur | TheJulia: a new section in your spec "Service vs Rescue vs Rebuild" with what we discuss may resolve my objection | 13:32 |
TheJulia | and then also "I need to look for the evil hackers now! run all the security tools!" | 13:32 |
TheJulia | it does open a huge awesome framework door in sense of unlocking a lot of the operations that we've avoided due to the whole pets vs cattle debate of the ages | 13:33 |
dtantsur | TheJulia: I've found the source of my confusion. In once place in the beginning, you're saying that SERVICING->ACTIVE will happen automatically. | 13:34 |
* TheJulia notes, most actual metal falls under the petcattle moniker | 13:34 | |
dtantsur | without stopping at SERVICE | 13:34 |
TheJulia | okay, so better delineation there. I think I did that after the fact which could definitely lead to a little more confusion | 13:34 |
TheJulia | jssfr: do you happen to have a use case which aligns? | 13:35 |
* TheJulia feels like explicit use case layout is also needed to really frame things | 13:35 | |
jssfr | TheJulia, firmware upgrades are certainly an interesting use case. | 13:35 |
TheJulia | jssfr: what perspective does that come from, and by asking that I mean is that with an end user hat on, an ops hat on behalf of the end user, pure operations/systems management, or security operations? | 13:37 |
dtantsur | speaking of which, how many people will shout at me for proposing a firmware_interface and some explicit APIs? | 13:38 |
dtantsur | I just ran this idea by iurygregory, and I don't think he wants to kill me :D | 13:38 |
TheJulia | dtantsur: most vendors have it bolted on to their management interface already... I guess the question I then have is why does it need to be separate and different from the management interface | 13:39 |
TheJulia | well, most vendors + the redfish driver | 13:39 |
TheJulia | (at least, I *think* it is on management as steps) | 13:39 |
dtantsur | I'm happy to elaborate if we're done with service steps. | 13:39 |
dtantsur | I don't want to deprive jssfr of the possibility to voice the use case. | 13:39 |
TheJulia | dtantsur: ack :) | 13:39 |
* TheJulia wonders where the promised payload of logs of a misbehaving ironic deployment are which she was expecting this morning | 13:40 | |
* TheJulia suspects that means today might be a quiet day | 13:40 | |
kaloyank | so.. I'm misusing it or? | 13:50 |
dtantsur | kaloyank: I think you're doing an okay thing with the instruments you've got. | 13:51 |
dtantsur | the discussed service steps are just a plan | 13:51 |
iurygregory | :D it's a good idea I would say | 13:52 |
kaloyank | I'll try to check this again in a few days | 13:52 |
dtantsur | TheJulia, iurygregory I've put my reasoning in the bottom of https://etherpad.opendev.org/p/ironic-bobcat-ptg | 13:55 |
dtantsur | since I'd like to discuss it at the PTG anyway | 13:55 |
iurygregory | ++ | 13:56 |
TheJulia | speaking of | 14:06 |
TheJulia | Adding keystone cross-project thingie for oauth2 as well | 14:06 |
dtantsur | neat! I assume it will still require keystone to be deployed? or not? | 14:06 |
TheJulia | not at all | 14:07 |
TheJulia | at least, that was the idea floated | 14:07 |
dtantsur | double neat!! | 14:07 |
TheJulia | it is what projects should sort of strive for, actually. | 14:07 |
TheJulia | make the thing completely optional or critical, and let the infrastructure operator decide | 14:08 |
TheJulia | folks, when would we be up for meeting with keystone folks on oauth2?! | 14:10 |
TheJulia | (email on the mailing list this morning asking) | 14:10 |
dtantsur | I'm probably not the right person.. | 14:13 |
TheJulia | I might be, but asking for others | 14:13 |
* dtantsur is trying to get an inspector-less in-band inspection PoC up before the PTG | 14:14 | |
TheJulia | oooh ahh | 14:14 |
TheJulia | I'm not protyping any of the items right now, although I'm hoping if I can get ~1 month heads down most of the ideas I'm working on should be able to move forward | 14:15 |
dtantsur | by the way, would be cool to have your feedback on https://review.opendev.org/c/openstack/ironic/+/877470 and the discussion in the comments | 14:16 |
TheJulia | dtantsur: commented | 14:20 |
TheJulia | I think the time we did that, it was a stropt | 14:20 |
TheJulia | but I am fairly sure it will work with a boolopt as well :) | 14:21 |
TheJulia | emailed the NTT contributor back w/r/t oauth2 | 14:29 |
TheJulia | asking for how long they feel it might take to discuss | 14:30 |
iurygregory | I assume this can go from 30 to 45min | 14:42 |
TheJulia | I was kind of thinking the same, I suspect some might want/need an oauth2 primer | 15:14 |
jssfr | TheJulia, dtantsur, sorry, I went afk. This is not a concrete thing, it just resonated with things we might want to do in the future. | 15:16 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: [WIP] Very basic in-band inspection with a new "agent" interface https://review.opendev.org/c/openstack/ironic/+/877814 | 15:45 |
dtantsur | the promised proof of concept ^^ | 15:45 |
rpittau | bye everyone, have a great weekend, see you on Tuesday! o/ | 15:46 |
opendevreview | Merged openstack/bifrost master: Upgrade from 2023.1 and use Jammy for the upgrade job https://review.opendev.org/c/openstack/bifrost/+/877515 | 15:54 |
arne_wiebalck | we see some issue with IPAB building images, same issue we saw some 6 months ago (where it went away before we understood the cause): `Problem: cannot install both python3-policycoreutils-2.9-21.1.el8.noarch and python3-policycoreutils-2.9-24.el8.noarch` ... conflicting requirements somewhere it seems (to the best of our knowledge we did not change anything since yesterday) ... any suggestions where to look? | 16:01 |
dtantsur | can be transient | 16:19 |
arne_wiebalck | wait and see? | 16:21 |
TheJulia | conflicting rpm requirements ? | 16:24 |
arne_wiebalck | yeah, looks like it, but not from us downstream | 16:25 |
arne_wiebalck | I *think* | 16:25 |
TheJulia | are you running your own package pipelines? | 16:28 |
TheJulia | or are you doing a blend, or... ? | 16:28 |
TheJulia | is it a static mirror? | 16:28 |
TheJulia | or maybe the stars aligned for a little bit ? :( | 16:28 |
dtantsur | have a nice weekend folks o/ | 16:34 |
arne_wiebalck | we have IPAB run from a gitlab pipeline, yes | 16:36 |
arne_wiebalck | IPAB is a static mirror which we have only touched when there were issues | 16:36 |
arne_wiebalck | like now :-/ | 16:37 |
arne_wiebalck | we have not updated it since a while | 16:37 |
arne_wiebalck | probably time to do it ... would have been better if we did this when I did not need to create an image :) | 16:38 |
arne_wiebalck | anyway, not this week | 16:38 |
arne_wiebalck | have a good week-end everyone o/ | 16:39 |
JayF | > I think you're doing an okay thing with the instruments you've got. | 17:10 |
JayF | kaloyank: you're one in a long line of us who have done the same thing :) | 17:10 |
JayF | kaloyank: almost every feature in Ironic was "made to work with what we had" downstream SOMEWHERE before it went upstream :D | 17:10 |
TheJulia | well, not everything, but some major features yes :) | 17:14 |
JayF | https://review.opendev.org/c/openstack/ironic/+/860821/3 trivial -- had 2x+2 but had to update the release note so it's just got my vote | 17:28 |
JayF | TheJulia: the entire default deploy method. all of rescue, all of cleaning, arguably all of raid/bios interfaces | 17:28 |
JayF | TheJulia: the list is pretty beefy | 17:28 |
JayF | the anaconda driver too :) | 17:28 |
TheJulia | JayF: bios was entirely upstream developed | 17:29 |
JayF | I take extra joy in meeting random folks who are running stuff that used to take ab entire team of engineers to crowbar into Ironic back when I worked at Rackspace :D | 17:29 |
TheJulia | driver composition as well | 17:29 |
JayF | TheJulia: after many people -- me included -- did bios management using the pre-existing tools e.g. hardware managers | 17:29 |
JayF | "made to work with what we had" being the key term there :D | 17:29 |
JayF | in many cases it wasn't pretty | 17:30 |
TheJulia | yeah | 17:30 |
JayF | I bricked 10% of the fleet the first time we did it downstream | 17:30 |
TheJulia | did you spend lots of time with jtags? | 17:30 |
JayF | So there was basically a bios-rescue ability built into the hardware we were using, but we had ... conflicting docs from the vendors on how it worked | 17:30 |
TheJulia | doh | 17:31 |
JayF | so we sent one of our best ops engys to the datacenter with usb keys and time, they figured out the process (something like a USB key with vfat with UPDATE.BIN on it or somethign like that, case sensitive) | 17:31 |
JayF | so they had to go server-to-server recovering the bios | 17:31 |
TheJulia | yeouch | 17:31 |
JayF | we ended up recovering every machine bricked in that manner | 17:31 |
JayF | so 1 engineer-week and travel expenses for them | 17:31 |
JayF | realistically a very small price to pay to recover a couple hundred machines | 17:31 |
TheJulia | yuyp | 17:32 |
JayF | that was madasi who went, if you remember him from way back in the day | 17:32 |
JayF | he wasn't much upstream though | 17:32 |
TheJulia | the name is familar, just dont think I can put a face to it | 17:32 |
JayF | I actually think it's a really good reflection on Ironic that we built it so ... accessible to being able to be used to do what you need to do | 17:33 |
TheJulia | it is the repeating theme | 17:33 |
JayF | Heads up: I'm going to be turning into a pumpkin in very short order. If you need something urgently from me today you should ask soon. | 17:34 |
* JayF has been working or travelling for ~9 days straight | 17:35 | |
TheJulia | ack | 17:36 |
TheJulia | I'm semi-pumpkiny myself, going to try and hammer on specs more to get thoughts out | 17:36 |
JayF | yeah, I'm likely going to write up something for a couple of CfPs over the weekend | 17:37 |
JayF | I'm going to put in a talk to upstream.io and All Things Open 2023 (in Raleigh!) | 17:37 |
JayF | basically my "Invisible work in OpenStack" blog series, except s/OpenStack/Open Source/ and finding more examples from around, talking about why doing the quiet work is important, etc | 17:38 |
* JayF thinks we have some flakiness in tests | 17:40 | |
JayF | third time in two days I've seen this failure: | 17:40 |
JayF | > AssertionError: Expected 'update_port_dhcp_opts' to be called once. Called 0 times. | 17:40 |
JayF | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_930/860820/4/check/openstack-tox-py310/93088f0/job-output.txt is the one I'm looking at right now | 17:40 |
TheJulia | oh, we absolutely do | 17:47 |
TheJulia | we likely just need to break that test into a few different classes | 17:47 |
* JayF o/ | 17:55 | |
opendevreview | Julia Kreger proposed openstack/ironic-specs master: WIP: cross-conductor rpc https://review.opendev.org/c/openstack/ironic-specs/+/873662 | 18:36 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!