Friday, 2023-03-17

opendevreviewMerged openstack/ironic master: Fix auth_protocol and priv_protocol for SNMP v3  https://review.opendev.org/c/openstack/ironic/+/87532300:00
opendevreviewMerged openstack/ironic bugfix/21.3: Wipe Agent Token when cleaning timeout occcurs  https://review.opendev.org/c/openstack/ironic/+/87740200:00
opendevreviewMerged openstack/networking-baremetal master: [CI] Explicitly disable port security  https://review.opendev.org/c/openstack/networking-baremetal/+/87493900:22
opendevreviewMerged openstack/ironic bugfix/21.3: Clean out agent token even if power is already off  https://review.opendev.org/c/openstack/ironic/+/87740600:58
opendevreviewNisha Agarwal proposed openstack/ironic master: Enables boot modes switching with Anaconda deploy for ilo driver  https://review.opendev.org/c/openstack/ironic/+/86082104:30
opendevreviewNisha Agarwal proposed openstack/ironic master: Enables boot modes switching with Anaconda deploy for ilo driver  https://review.opendev.org/c/openstack/ironic/+/86082106:17
rpittaugood morning ironic! happy friday! o/08:09
opendevreviewRiccardo Pittau proposed openstack/ironic master: Use main branch of metal3-dev-env to run metal3 integration job  https://review.opendev.org/c/openstack/ironic/+/87760008:48
vanouJayF: About vul doc. Just put 2 things in doc is enough I think: If Ironic community is asked by owner of unofficial library, 1)Ironic community is open and willing to collaborate to solve such rare vul 2)Ironic community is willing to collaborate in resonable manner, which means follwing good manner to handle vul (e.g. craft vul patch in private till fix is published), to09:03
vanouhandle vul.09:03
vanouHow do you think about this?09:03
vanouTheJulia: If possible please put comment on https://review.opendev.org/c/openstack/ironic/+/87088109:03
kubajjGood morning Ironic o/09:51
iurygregorymorning Ironic, happy friday11:33
kaloyankhello ironic o/12:00
kaloyankI'm struggling with rescuing a bare-metal instance. I boot the IPA via a rescue command but it cannod find the node via the /v1/lookup endpoint using mac addresses. 12:02
kaloyankThe particular machine had a NIC inserted after it was provisioned. I'm trying to salvage the contents of the local disk via the rescue image but the creation of the rescue user and setting a password fails12:03
kaloyankI added manually the ports and their respective MACs to the node but the IPA still gets a 404 from /v1/lookup12:03
kaloyankdo I need to run an inspection?12:04
* dtantsur is trying to remember how rescue works12:08
dtantsurkaloyank: checked for any logs in ironic API and conductor?12:09
dtantsurrunning inspection should be no different from creating ports manually12:10
kaloyankdtantsur: The Ironic API just returns 40412:10
kaloyankso the IPA is stuck in a loop retrying to get which node it was booted on12:10
kaloyankI tracked it down to this method: https://github.com/openstack/ironic/blob/stable/yoga/ironic/api/controllers/v1/ramdisk.py#L13212:12
kaloyankwhich calls this one: https://github.com/openstack/ironic/blob/stable/yoga/ironic/db/sqlalchemy/api.py#L138412:13
kaloyankhowever, I don't know how I can reassemle the SQL to execute it against the DB and see the result :/12:14
dtantsurhttps://opendev.org/openstack/ironic/src/branch/master/ironic/db/sqlalchemy/api.py#L1487-L148912:16
kaloyankdtantsur: I'm running yoga12:17
dtantsurkaloyank: is it possible that you have MAC addresses overlapping?12:17
dtantsuri.e. some node has a part with a MAC address from the list?12:17
kaloyankhmm, I haven't thought of that, checking12:17
kaloyankI don't think there's overlapping12:18
opendevreviewDmitry Tantsur proposed openstack/ironic master: Add error logging on lookup failures in the API  https://review.opendev.org/c/openstack/ironic/+/87778612:20
dtantsurkaloyank: may help ^^12:20
kaloyankdtantsur: will apply it and see how it goes12:22
kaloyankMar 17 12:38:01 os-controller.lab.storpool.local ironic-api[652843]: 2023-03-17 12:38:01.225 652843 ERROR ironic.api.controllers.v1.ramdisk [req-0d942d09-54dc-4493-b601-cbd5e6be5ece - - - - -] Lookup is not allowed for node 0e227233-b285-40e5-b19a-fa920a06f122 in the provision state rescue12:38
kaloyankwhy so :(12:38
dtantsurkaloyank: hmmm, I'd expect the state to be 'rescue wait'12:39
dtantsur'rescue' is the successful stable state12:39
dtantsurunless our state machine is doing something crazy..12:39
kaloyankdtantsur: logs -> https://pastebin.com/vDBRNscE12:52
opendevreviewVerification of a change to openstack/bifrost master failed: Upgrade from 2023.1 and use Jammy for the upgrade job  https://review.opendev.org/c/openstack/bifrost/+/87751512:59
dtantsurkaloyank: oh, okay, so the rescue preparation is actually finished13:07
dtantsurthen you need to check why the password was not applied13:08
TheJuliayeah, the only way to really figure that out is to burn credentials into the ramdisk image and access the running ramdisk directly to poke around13:09
TheJuliasince we shut the agent down in rescue cases13:09
TheJuliaand further logs don't get shipped to the conductor13:09
dtantsurgood morning TheJulia 13:10
TheJuliagood morning!13:10
dtantsurI left some comments on the service steps spec, you may not like them :)13:10
dtantsurit's possible that I'm missing a lot of context on rescue-vs-service13:11
TheJuliadoes it boil down to a "never going to happen" ?13:11
TheJuliaoh, yeah13:11
dtantsurnot at all13:11
TheJuliaI suspected that might be the case at some point13:11
dtantsurI'm all for the part that involves service_steps13:11
dtantsurI'm very confused by the part that is essentially rescue in a trenchcoat 13:11
TheJuliaheh13:11
TheJuliaI like how you describe that13:11
dtantsurat your service :D13:12
TheJuliathe tl;dr is today rescue is an end user oriented feature which gets... miss-used by operators to do $other_nefarious_things which in part overlap with this functionality. The underlying issue is we actually, in relation to kaloyank's issue, we stop the agent in rescue and we can't do additional actions because we cannot trust the ramdisk interactions when it is handed over to an end user any longer. Things can get 13:13
TheJuliaextracted from memory, URLs discovered, etc.13:13
dtantsurTo be fair, anyone with root in the ramdisk can just restart the agent...13:14
TheJuliaIn a sense, the idea is much more a "administrative side" of things sort of function13:14
TheJuliabut we won't talk to it anymore13:14
dtantsurWill it help to update the rescue API with a new flag keep_agent that is not exposed to Nova?13:14
dtantsurand possible covered by a different policy?13:15
TheJuliabut does rescue appropriately cover "someone is doing things" ?13:15
dtantsuryou mean, the name or the state itself?13:15
TheJuliaI'm not really sure just adding a new policy covered flag would do it for rescue13:15
TheJuliathe name of the state itself being an end user oriented state13:16
dtantsurThe difference between user and admin in case of in-band actions is quite subtle13:17
TheJuliatrue, I was not thinking of granting ssh access in this new state flow13:17
TheJuliaI was purely thinking "if someone wants to do something with this different ramdisk, than so be it13:18
TheJuliathey can sort that part out without an ssh password and flow13:18
* TheJulia remembers the pain of improving the auth flow for that13:18
TheJuliaFor some reason that auth flow fix and agent token were the two most painful breaks we've ever done in IPA13:19
dtantsurI'm trying to avoid 2 things: 1) copy-paste of rescue code in the style of iscsi-vs-direct deploy, 2) ops confusion around "which one do I use and why"13:19
TheJulia... both for good reason13:19
dtantsurI also don't quite like that the "service" verb will be used for two very different things (in your spec)13:19
TheJuliaI think the 2nd is perception and docs related, the first I've not prototyped it yet, but I don't think it would be specific rescue code, I only framed it as similar and maybe we might end up with some common helpers around networking13:20
dtantsurTo me, what you describe is closer to rescue still than to the other scenario your propose (service_steps)13:20
TheJuliawell, to be fair, it started as modify/modify_steps13:20
TheJuliaand then the battle over names began :)13:20
dtantsuredge case: a bug in automation results in empty service_steps, and the operator is wondering why it never finishes13:21
dtantsur(I'm not talking about names here, more about actual semantics)13:21
TheJulia(well, to be fair, they they are tied together and service was much more a compromise from modify)13:22
TheJuliawithout a list, as I believe I wrote the text today, it would land in the stable state13:22
dtantsurRight. Which is going to be confusing.13:22
TheJuliaand then an operator could just call the verb again with their corrected list13:22
dtantsurUnless you apply some GPT4, the automation will not understand that it made a mistake :)13:23
TheJuliano ai please. We don't need the conductors deciding to launch drones to remind us to do code review13:23
TheJulia:)13:23
dtantsurSo in my head: "rescue" = hold this machine and give me SSH access, "service state" = hold this machine, "service_steps" = execute these actions, within a ramdisk or out-of-band.13:24
dtantsurSo for me, the 1st two have more logical overlap13:24
dtantsur(do drones speak Redfish?)13:24
TheJuliaI guess the conundrum is what if you *need* to do both sort of things, apply steps and then hold13:25
TheJuliaand then apply more steps, and possibly hold again13:25
* dtantsur reads the proposed state machine again13:25
dtantsurmmm, right, I missed that transition13:26
TheJuliasorry, I'll try to make that a bit more clear13:26
dtantsurTheJulia: do you plan to shut down the agent in-between? i.e. is IPA running in "service"?13:26
TheJuliaI guess I should also better spell out some of the use cases a bit more clearly too13:26
TheJuliano plan to shut the agent down13:26
dtantsurTheJulia: nah, I just had a pretty shallow read this morning. Just to make sure I understand it.13:26
TheJuliaipa would continue to run awaiting its commands13:27
dtantsurA provocative question then: can we make rescue just a service step?13:27
dtantsur(I even suggested making rebuild a service step in my comments!)13:27
TheJuliaI think the idea might end up being "I want to clone /dev/sda" "oh, I need to clone /dev/sdb now" "wait, what, there is a /dev/cciss/c0p0, gah clone it!"13:28
TheJulia(rescue a service step... very likely, but not out of the gate please ;)13:28
TheJulia)13:28
dtantsurfair13:28
* TheJulia has a few big picture ideas floating around that intertwine13:28
TheJuliamix service with "oh, I need to update the DPU firmware and then flash a new DPU OS on the flash as well, and and then tweak some bios settings before I go back to my previously deployed workload" as a big picture hope13:29
* jssfr eyes the discussion13:30
* TheJulia suspects magic words cause eyebrows to raise13:31
TheJulia... which is why I've been trying to spend time on the specs before the PTG13:31
jssfrI was only idly following along until you spelled out the use cases explicitly and that makes it interesting13:31
dtantsurright13:31
dtantsurFor us in Metal3, upgrading firmware and BIOS settings is of interest. So a simpler case.13:32
TheJuliaI've had people look at service steps and also go "oh, I could make a snapshot hardware manager and snap my machines!"13:32
dtantsurTheJulia: a new section in your spec "Service vs Rescue vs Rebuild" with what we discuss may resolve my objection13:32
TheJuliaand then also "I need to look for the evil hackers now! run all the security tools!"13:32
TheJuliait does open a huge awesome framework door in sense of unlocking a lot of the operations that we've avoided due to the whole pets vs cattle debate of the ages13:33
dtantsurTheJulia: I've found the source of my confusion. In once place in the beginning, you're saying that SERVICING->ACTIVE will happen automatically.13:34
* TheJulia notes, most actual metal falls under the petcattle moniker13:34
dtantsurwithout stopping at SERVICE13:34
TheJuliaokay, so better delineation there. I think I did that after the fact which could definitely lead to a little more confusion13:34
TheJuliajssfr: do you happen to have a use case which aligns?13:35
* TheJulia feels like explicit use case layout is also needed to really frame things13:35
jssfrTheJulia, firmware upgrades are certainly an interesting use case.13:35
TheJuliajssfr: what perspective does that come from, and by asking that I mean is that with an end user hat on, an ops hat on behalf of the end user, pure operations/systems management, or security operations?13:37
dtantsurspeaking of which, how many people will shout at me for proposing a firmware_interface and some explicit APIs?13:38
dtantsurI just ran this idea by iurygregory, and I don't think he wants to kill me :D13:38
TheJuliadtantsur: most vendors have it bolted on to their management interface already... I guess the question I then have is why does it need to be separate and different from the management interface13:39
TheJuliawell, most vendors + the redfish driver13:39
TheJulia(at least, I *think* it is on management as steps)13:39
dtantsurI'm happy to elaborate if we're done with service steps.13:39
dtantsurI don't want to deprive jssfr of the possibility to voice the use case.13:39
TheJuliadtantsur: ack :)13:39
* TheJulia wonders where the promised payload of logs of a misbehaving ironic deployment are which she was expecting this morning13:40
* TheJulia suspects that means today might be a quiet day13:40
kaloyankso.. I'm misusing it or?13:50
dtantsurkaloyank: I think you're doing an okay thing with the instruments you've got.13:51
dtantsurthe discussed service steps are just a plan13:51
iurygregory:D it's a good idea I would say13:52
kaloyankI'll try to check this again in a few days13:52
dtantsurTheJulia, iurygregory  I've put my reasoning in the bottom of https://etherpad.opendev.org/p/ironic-bobcat-ptg13:55
dtantsursince I'd like to discuss it at the PTG anyway13:55
iurygregory++13:56
TheJuliaspeaking of14:06
TheJuliaAdding keystone cross-project thingie for oauth2 as well14:06
dtantsurneat! I assume it will still require keystone to be deployed? or not?14:06
TheJulianot at all14:07
TheJuliaat least, that was the idea floated14:07
dtantsurdouble neat!!14:07
TheJuliait is what projects should sort of strive for, actually.14:07
TheJuliamake the thing completely optional or critical, and let the infrastructure operator decide14:08
TheJuliafolks, when would we be up for meeting with keystone folks on oauth2?!14:10
TheJulia(email on the mailing list this morning asking)14:10
dtantsurI'm probably not the right person..14:13
TheJuliaI might be, but asking for others14:13
* dtantsur is trying to get an inspector-less in-band inspection PoC up before the PTG14:14
TheJuliaoooh ahh14:14
TheJuliaI'm not protyping any of the items right now, although I'm hoping if I can get ~1 month heads down most of the ideas I'm working on should be able to move forward14:15
dtantsurby the way, would be cool to have your feedback on https://review.opendev.org/c/openstack/ironic/+/877470 and the discussion in the comments14:16
TheJuliadtantsur: commented14:20
TheJuliaI think the time we did that, it was a stropt14:20
TheJuliabut I am fairly sure it will work with a boolopt as well :)14:21
TheJuliaemailed the NTT contributor back w/r/t oauth214:29
TheJuliaasking for how long they feel it might take to discuss14:30
iurygregoryI assume this can go from 30 to 45min 14:42
TheJuliaI was kind of thinking the same, I suspect some might want/need an oauth2 primer15:14
jssfrTheJulia, dtantsur, sorry, I went afk. This is not a concrete thing, it just resonated with things we might want to do in the future.15:16
opendevreviewDmitry Tantsur proposed openstack/ironic master: [WIP] Very basic in-band inspection with a new "agent" interface  https://review.opendev.org/c/openstack/ironic/+/87781415:45
dtantsurthe promised proof of concept ^^15:45
rpittaubye everyone, have a great weekend, see you on Tuesday! o/15:46
opendevreviewMerged openstack/bifrost master: Upgrade from 2023.1 and use Jammy for the upgrade job  https://review.opendev.org/c/openstack/bifrost/+/87751515:54
arne_wiebalckwe see some issue with IPAB building images, same issue we saw some 6 months ago (where it went away before we understood the cause): `Problem: cannot install both python3-policycoreutils-2.9-21.1.el8.noarch and python3-policycoreutils-2.9-24.el8.noarch` ... conflicting requirements somewhere it seems (to the best of our knowledge we did not change anything since yesterday) ... any suggestions where to look?16:01
dtantsurcan be transient16:19
arne_wiebalckwait and see?16:21
TheJuliaconflicting rpm requirements ?16:24
arne_wiebalckyeah, looks like it, but not from us downstream16:25
arne_wiebalckI *think*16:25
TheJuliaare you running your own package pipelines?16:28
TheJuliaor are you doing a blend, or... ?16:28
TheJuliais it a static mirror?16:28
TheJuliaor maybe the stars aligned for a little bit ? :(16:28
dtantsurhave a nice weekend folks o/16:34
arne_wiebalckwe have IPAB run from a gitlab pipeline, yes16:36
arne_wiebalckIPAB is a static mirror which we have only touched when there were issues16:36
arne_wiebalcklike now :-/16:37
arne_wiebalckwe have not updated it since a while16:37
arne_wiebalckprobably time to do it ... would have been better if we did this when I did not need to create an image :)16:38
arne_wiebalckanyway, not this week16:38
arne_wiebalckhave a good week-end everyone o/16:39
JayF> I think you're doing an okay thing with the instruments you've got. 17:10
JayFkaloyank: you're one in a long line of us who have done the same thing :)17:10
JayFkaloyank: almost every feature in Ironic was "made to work with what we had" downstream SOMEWHERE before it went upstream :D17:10
TheJuliawell, not everything, but some major features yes :)17:14
JayFhttps://review.opendev.org/c/openstack/ironic/+/860821/3 trivial -- had 2x+2 but had to update the release note so it's just got my vote17:28
JayFTheJulia: the entire default deploy method. all of rescue, all of cleaning, arguably all of raid/bios interfaces17:28
JayFTheJulia: the list is pretty beefy17:28
JayFthe anaconda driver too :) 17:28
TheJuliaJayF: bios was entirely upstream developed17:29
JayFI take extra joy in meeting random folks who are running stuff that used to take ab entire team of engineers to crowbar into Ironic back when I worked at Rackspace :D  17:29
TheJuliadriver composition as well17:29
JayFTheJulia: after many people -- me included -- did bios management using the pre-existing tools e.g. hardware managers17:29
JayF"made to work with what we had" being the key term there :D 17:29
JayFin many cases it wasn't pretty17:30
TheJuliayeah17:30
JayFI bricked 10% of the fleet the first time we did it downstream 17:30
TheJuliadid you spend lots of time with jtags?17:30
JayFSo there was basically a bios-rescue ability built into the hardware we were using, but we had ... conflicting docs from the vendors on how it worked17:30
TheJuliadoh17:31
JayFso we sent one of our best ops engys to the datacenter with usb keys and time, they figured out the process (something like a USB key with vfat with UPDATE.BIN on it or somethign like that, case sensitive)17:31
JayFso they had to go server-to-server recovering the bios17:31
TheJuliayeouch17:31
JayFwe ended up recovering every machine bricked in that manner17:31
JayFso 1 engineer-week and travel expenses for them17:31
JayFrealistically a very small price to pay to recover a couple hundred machines17:31
TheJuliayuyp17:32
JayFthat was madasi who went, if you remember him from way back in the day17:32
JayFhe wasn't much upstream though17:32
TheJuliathe name is familar, just dont think I can put a face to it17:32
JayFI actually think it's a really good reflection on Ironic that we built it so ... accessible to being able to be used to do what you need to do17:33
TheJuliait is the repeating theme17:33
JayFHeads up: I'm going to be turning into a pumpkin in very short order. If you need something urgently from me today you should ask soon.17:34
* JayF has been working or travelling for ~9 days straight17:35
TheJuliaack17:36
TheJuliaI'm semi-pumpkiny myself, going to try and hammer on specs more to get thoughts out17:36
JayFyeah, I'm likely going to write up something for a couple of CfPs over the weekend17:37
JayFI'm going to put in a talk to upstream.io and All Things Open 2023 (in Raleigh!)17:37
JayFbasically my "Invisible work in OpenStack" blog series, except s/OpenStack/Open Source/ and finding more examples from around, talking about why doing the quiet work is important, etc17:38
* JayF thinks we have some flakiness in tests17:40
JayFthird time in two days I've seen this failure:17:40
JayF> AssertionError: Expected 'update_port_dhcp_opts' to be called once. Called 0 times.17:40
JayFhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_930/860820/4/check/openstack-tox-py310/93088f0/job-output.txt is the one I'm looking at right now17:40
TheJuliaoh, we absolutely do17:47
TheJuliawe likely just need to break that test into a few different classes17:47
* JayF o/17:55
opendevreviewJulia Kreger proposed openstack/ironic-specs master: WIP: cross-conductor rpc  https://review.opendev.org/c/openstack/ironic-specs/+/87366218:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!