Monday, 2024-05-20

*** srelf_ is now known as Continuity09:33
iurygregorygood morning Ironic11:07
mohammed_good morning / aftrnoon, where I can add a topic to todays weekly agenda ? 11:25
iurygregoryhey mohammed_ here is the link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting11:27
mohammed_Thanks ! I was there but I could not quickly find where to add inputs I will read it again 11:43
adam-metal3mohammed, if you are logged in, there are small [edit] buttons and if you clic on those then you can edit the relevant section12:27
iurygregory^ this, if you can add let me know and I can update12:46
TheJuliagood morning13:07
iurygregorygood morning TheJulia o/13:15
iurygregorydo you want me to run the meeting today? just double checking :D13:15
opendevreviewJulia Kreger proposed openstack/ironic bugfix/22.1: Remove SQLAlchemy tips jobs  https://review.opendev.org/c/openstack/ironic/+/91990113:18
TheJuliaI guess I can13:18
iurygregoryack13:22
JayFIf there's an ironic contributor with some free time this week, I'd like to give my BM SIG talk a run through with someone who knows Ironic well14:02
JayFif anyone is available/interested let me know14:02
TheJuliaI might be available, tomrorow maybe14:09
iurygregorydoes anyone remember the redfish bug with HPE where sometimes it failed to power on nodes after cleaning?14:15
iurygregoryseems like I'm facing the same issue again =( 14:16
iurygregoryJayF, happy to join depending on the day and time..14:16
JayFI mean, you tell me a workign day/time :) 14:16
JayFyou'd be the one doing me a favor14:17
iurygregorytomorrow afternoon or wed afternoon ie after 15 UTC =) 14:18
iurygregoryfriday I'm available also 14:18
JayFI can do 1500 UTC tomorrow 14:19
TheJuliaiurygregory: is it the one where we try to power the machine back on and it fails to power on because it is still booting?14:20
TheJuliaI'm not going to be around on Friday14:20
TheJuliatomorrow would likely be easiest for me14:20
opendevreviewKaifeng Wang proposed openstack/python-ironicclient master: Support traits configuration on baremetal create CLI  https://review.opendev.org/c/openstack/python-ironicclient/+/91943214:21
JayFTheJulia: 1500 utc / 8am PDT work for you?14:21
TheJulia9:30 AM UTC would work14:22
TheJuliaerr14:22
TheJulia9:30 AM US Pacifici14:22
TheJuliapacific14:22
* TheJulia tosses the keyboard14:22
JayF9:30a-10:30a (1630-1730 UTC) works for me tomorrow, too14:22
iurygregoryTheJulia, humm not sure, it's via OCP https://paste.opendev.org/show/b6DXtU8HM9dUyduwS3mV/ this is the error in the ironic logs after the firmware update was executed (successfully..) 14:25
TheJuliawe likely need to do a couple different things there14:29
iurygregoryyeah =(14:29
iurygregorythe update worked fine, ironic has the new information in the DB14:29
TheJuliaStarting out, the 120 seconds limit on the post reboot wait is likely *way* too short14:29
iurygregoryI had the wait param as option to increase the time14:30
iurygregorybut we didn't add that in metal3 =X14:30
iurygregoryI think there is a config option that we can change also right?14:31
TheJuliaso we likley need to wait/check at before claiming we're done with the firmware upgrade14:31
TheJuliaiurygregory: do we know where "Dynamic backoff interval looping call 'ironic.conductor.utils.node_wait_for_power_state.<locals>._wait'" is being triggered at ?14:34
iurygregorylet me look at the code i did for updates 14:34
TheJuliaI think: tunable longer run time for the post upgrade wait. ... Apparently Oneview waits 45 minutes by *default*14:37
iurygregoryjesus O.o14:38
TheJuliawell, I was about to say I think we need something else14:38
TheJuliabut it looks like the step itself fails because of the inability to power back up14:38
iurygregoryyeah14:38
TheJulia 2024-05-20 12:57:57.885 1 ERROR ironic.conductor.utils [None req-d65c5bdf-1567-45d9-81df-7c7d6701d5ef - - - - - -] Node f497fb20-6852-45fe-a21f-8afc8782eed9 failed step14:39
iurygregoryso, reboot_to_finish_step receives the timeout14:39
iurygregoryin deploy_utils from what i remember14:39
TheJuliaI think we need a bigger timeout14:39
TheJuliahttps://imgflip.com/i/8qnszp14:40
mohammed_iurygregory can you add for me in agenda: Seek review for https://review.opendev.org/c/openstack/sushy-tools/+/87536614:41
iurygregoryTheJulia, yup!14:43
iurygregoryTheJulia, CONF.conductor.power_state_change_timeout this is the config with default value we use? 14:43
TheJuliaso for firmware upgrades, we likely need that to be... at *least* 5 minutes14:44
iurygregoryI think that's how I was using (by sending wait 300)14:44
iurygregorybut the CRD doesn't have the concept of the wait ...14:45
TheJuliaits only waiting 120 though14:45
TheJuliawell, post firmware upgrades, we always need to wait14:45
TheJulialonger14:45
TheJuliabecause it has to complete them14:45
iurygregoryshall we consider sending 300 by default if no value was set for wait? <thinking..>14:46
TheJuliafor all state changes, I would doubt it14:47
TheJuliaI must have udp packet loss this morning, dns lookups keep randomly failing14:47
iurygregoryouch >.<14:48
TheJuliamohammed_: what portion of th emeeting would you like that item brought up during?14:48
iurygregoryI was about to add in open discussion 14:48
TheJuliathat works14:49
JayFmohammed_: I'm generally curious what the use case is? I always viewed sushy-tools as being about helping Ironic fake VMs, if we stub even further through doesn't it remove some of the value?14:49
JayFmohammed_: I've avoided putting this review on your change because I don't wanna stop it if there's a good use case (sushy-tools is not high on the list of things I'm super concerned about) but I'm curious generally what the goal is14:50
TheJuliamohammed_: So I have the exact same question as JayF, fwiw.14:51
TheJulia#startmeeting ironic15:00
opendevmeetMeeting started Mon May 20 15:00:03 2024 UTC and is due to finish in 60 minutes.  The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot.15:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:00
opendevmeetThe meeting name has been set to 'ironic'15:00
TheJuliao/15:00
mohammed_o/15:00
TheJuliaGood morning everyone in (UGT) Universal Greeting Time!15:00
TheJuliaToday's ironic meeting can be found on the wiki15:00
JayFo/15:01
TheJulia#link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_May_20.2C_202415:01
TheJulia... do we have quorum today?15:02
* TheJulia puts a cup of coffee in front of iurygregory 15:02
TheJulia:)15:02
iurygregoryo/15:02
* iurygregory was looking at logs and lost track of time15:02
TheJulia#topic Announcements / Reminders15:02
TheJuliaYour weekly standing reminder to review patches tagged with ironic-week-prio15:03
TheJulia#link https://tinyurl.com/ironic-weekly-prio-dash15:03
TheJuliaThe work items for 2024.2 have been merged.15:03
TheJulia#link https://review.opendev.org/c/openstack/ironic-specs/+/91629515:03
TheJuliaAnd last, but not least, the release schedule for 2024.215:03
TheJulia#link https://releases.openstack.org/dalmatian/schedule.html15:03
TheJuliaThis does mean this week is R-1915:04
TheJulia#topic Review Ironic CI Status15:04
TheJuliaso, how is CI? I've noticed some more tempest test unahppiness which I'm working on trying to fix15:04
TheJulia#link https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/91976215:05
iurygregoryinteresting ^15:06
TheJuliaYeah, i think part of it is the whole "oh, there is no way we're doing a whole OS on this cloud provider" logic path15:06
TheJuliaand our continued reliance on tinycore is sort of... biting us15:06
TheJuliaLike an unhappy cat.15:06
iurygregory++15:06
TheJuliaAnyway, moving on!15:06
iurygregorywill keep on my radar15:07
TheJuliaIt looks like a timeout, fwiw15:07
TheJuliabut you actually have to look at the ipxe text15:07
TheJuliaMoving past Discussion topics, to Bug Deputy Updates15:07
TheJulia#topic Bug Deputy15:07
TheJuliaLast week we got down to 137 open bugs against ironic itself. A number of them are wishlist items, we should really take a look at open in progress items because I found a few which were already fixed last week.15:08
iurygregoryI will be the deputy this week o/15:08
TheJulia#action iurygregory to be the Bug Deputy for this week.15:08
iurygregorygreat job TheJulia 15:08
TheJuliaSince we have no RFE topics, we will move to Open Disucssion.15:09
TheJulia#topic Open Discussion15:09
JayFI have an item for this, when mohammed_ is done15:09
TheJulia\o/ as do I!15:09
TheJuliamohammed_ is seeking reviews of https://review.opendev.org/c/openstack/sushy-tools/+/87536615:09
mohammed_I would like to take this forward it seems hanging there for longtime I want to know if somthing is blocking I have already got +1 from dmitry 15:09
JayFI'm generally curious what the use case is? I always viewed sushy-tools as being about helping Ironic fake VMs, if we stub even further through doesn't it remove some of the value?15:09
JayFI've avoided putting this review on your change because I don't wanna stop it if there's a good use case (sushy-tools is not high on the list of things I'm super concerned about) but I'm curious generally what the goal is15:09
TheJuliaI think a better understanding of the goal will help reviewers review it, to be honest15:10
JayF++15:10
mohammed_we need mainly to testing ironic scalability but  open to discuss the needs for such fake IPAs on the sushytools on the changes :) 15:10
JayFWhat does IPA have to do with sushy?15:10
JayFI guess is my root problem15:11
JayFsushy is a library for redfish access; sushy-tools uses that library to present fake VMs15:11
JayFwhy not fake IPA in virtualbmc? or virtualpdu?15:11
JayFthere's a base level misalignment that I can't figure out15:11
mohammed_sushy has fake system faking the VM and ironic expect ipa agent to be running on these vms that we are trying to add there15:11
mohammed_Fake driver ^15:12
TheJuliais the intention or modeling at present such that triggering a fake run cause the fake IPA to fire up?15:13
TheJuliaI guess that sort of makes sense then15:14
mohammed_the current model have the fakeIPA and sushytools running in different containers 15:14
TheJuliaokay15:14
TheJuliaSo my final question, will results be published at all?15:14
TheJulia:)15:14
mohammed_Did not get you :/ 15:14
JayFI'm still confused enough about the whole end-to-end I doubt I'll review that change, but honestly sushy{,-tools} is not my strongest project anyway15:14
JayFmohammed_: I think TheJulia is saying, if you land this and do performance testing, share the results with us :D 15:15
TheJuliaSo your doing the delopment work to bench test Ironic, are those results available for review?15:15
mohammed_we are syncing together on these on metal3 side with dmitry 15:15
TheJuliaOkay, cool15:16
TheJuliaJayF: You had an open discussion item?15:16
JayFYes15:16
JayFJune 5th is the BM SIG @ CERN.15:16
JayFIt's been noted that we can enable remote participation if desired15:16
TheJuliaThat would be nice15:17
iurygregory:O15:17
iurygregory++15:17
JayFThen we need to discuss time. Right now the proposed start time is 10am, which is 1am pdt15:17
JayFobviously not ideal but we can change it; I'm a little nervous about doing it because we already have some published15:17
JayFbut I didn't want to make the decision in a vacuum15:18
JayFI doubt that there's a time we could pick that would be 1) close to the original published time and 2) not terrible for PDT15:18
JayFwe list an 8am start right now on the webpage15:18
JayF#link https://indico.cern.ch/event/1378171/15:19
TheJuliaJayF: I will do my best to join in remotely based upon the local schedule, start based upon the people there 15:19
JayFyeah, that's what I was thinking -- which would be a 10am start for presentations, I think15:19
TheJulia8 AM is a bit early if one needs to train in and all15:19
TheJuliayeah, that seems reasonable15:19
iurygregoryI think it would be complicated to change the time now, I had no idea  it could have remote participation, I will just try to join 15:20
JayFI'll note the agenda currently includes an introduction to metal3, an ~45-60m talk by me on "ironic features" 15:20
JayFand then hackathon/discussion afterwards15:20
TheJuliaSounds good15:20
JayFthe goal is to hopefully inspire lots of "how does feature X work?" conversations after the presentations :D 15:20
JayFthe other, related item for OD for me15:21
JayFI will be in Europe next week and the week after. I'll be very rarely on IRC as a result to focus on my in-person team in UK and all15:21
JayFSo if you need me, email/sms/etc15:21
TheJuliaJayF: cool cool, enjoy15:21
TheJuliaso my one last item!15:21
TheJuliaihrachys has raised some questions/comments on https://review.opendev.org/c/openstack/ironic-specs/+/916126. JayF and I have both responded. If there are any more thoughts/responses please try and post it in the next day or so and I'll rev the spec.15:22
TheJuliaAnything else today?15:22
TheJuliaI'll note next Monday is a holiday in the US.15:23
JayFPlease review the self-serve runbooks spec :D 15:23
TheJulia++15:23
iurygregoryack o/15:23
JayFthat is on hold pending community review right now15:23
iurygregorythis week I have more "free" time15:24
iurygregoryso I will double check it15:24
TheJuliawhat is "free" time ;)15:24
iurygregorya time I will not spend testing firmware update or on escalation :D15:24
* JayF hears a new jira ticket being filed15:24
JayFrun iurygregory!15:24
iurygregorywell I got an escalation from a bug created today15:25
iurygregoryI'm not surprised because is monday15:25
TheJuliain accordance with the prophecy15:25
iurygregoryyup!15:25
TheJuliaiurygregory: if you need another brain, I can deffer some stuff I was planning on doing this morning until tomorrow15:26
TheJuliaanyway! Thanks everyone!15:26
TheJuliaHave a great week!15:26
mohammed_o/ 15:26
TheJulia#endmeeting15:26
opendevmeetMeeting ended Mon May 20 15:26:35 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:26
opendevmeetMinutes:        https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-05-20-15.00.html15:26
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-05-20-15.00.txt15:26
opendevmeetLog:            https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-05-20-15.00.log.html15:26
iurygregoryo/ tks15:26
* iurygregory is happy, he found the 120sec config \o/ https://github.com/openshift/ironic-image/blob/master/ironic-config/ironic.conf.j2#L9815:26
iurygregoryok, so either I will increase this downstream again, or try to add something in the ironic code .-.15:27
opendevreviewEbbex proposed openstack/bifrost master: Consolidate centos/fedora/redhat required_defaults  https://review.opendev.org/c/openstack/bifrost/+/88844715:27
JayF120s timeout is very low15:27
cidquick question, the weekly meetings, is it happening solely on the IRC chat? Or there's another side to it.15:28
JayFbut you can bump it to find a better default15:28
JayFcid: it's solely an IRC thing for most projects :)15:28
JayFcid: basically just an opportunity to ensure many core contributors are in the same place for an hour to help make decisions15:28
TheJuliaso 120 seconds makes semi-sense for for some of the hardware I've been seeing15:28
iurygregoryJayF, it was ok before since we didn't have to deal with the firmware update XD15:28
TheJuliaI thik ironic is 30 seconds today, Ironic itself likely needs the value incrased15:28
cidJayF: Got it.15:28
TheJulia.... I think we have a bug about specific configurations on supermicro gear taking 5 minutes15:29
iurygregoryI think this 120 was because of supermicro...15:29
TheJuliayeouch, no wonder I'm avhing isuses, I'm seeing 4% packet loss15:29
TheJuliawind variable but some of it is also 2.4ghz band noise15:32
* iurygregory increased to 360.. fingers crossed15:33
TheJuliadoes it make sense to consider raising the default in ironic?15:34
JayFiurygregory: TheJulia: rewinding about an hour: tomorrow; 1630-1730 aka 9:30-1030am PDT for a run thru of that talk?15:34
TheJuliaworks for me15:34
iurygregoryworks for me15:35
JayFinvites going out15:35
iurygregorytks!15:35
iurygregorygoing to grab lunch to have more energy to debug things...15:35
opendevreviewJulia Kreger proposed openstack/ironic-tempest-plugin master: Exclude ramdisk tests with tinycore in uefi mode  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/91976215:52
TheJuliaThat should fix it, I got the structure wrong15:52
cidAny additional suggestions/ideas on this: https://review.opendev.org/c/openstack/ironic/+/91722916:12
TheJuliacid: I just left a comment on that16:20
TheJuliaYeah, you don't want to try and modify the runtime environment16:24
TheJuliabut I proposed a solution16:24
cidYeah, I will go ahead and do that.16:25
opendevreviewDmitry Tantsur proposed openstack/bifrost master: [WIP] Experimental support for ARM64 hosts on Debian  https://review.opendev.org/c/openstack/bifrost/+/92005716:25
cidAlso, Since two reviewers have suggested making the environment approach default with no objections, I will do that too, with the option for temp file 16:26
TheJuliayeah, there really is not a huge different, the fundimental issue is a lack of understanding around the mechanism properties of the temp file, but *shrugs*16:27
JayFI am a little -0 to changing the default16:29
JayFjust because the value of changing it, and risking a break on one of our older drivers16:29
JayFmakes me a little nervous16:29
cidRight! that change will touch on many existing "stuff"16:30
dtantsurIf we're worried about this change, do we even understand it?16:31
cidI'm thinking, ...16:31
cidwhat's the point of the change if there's no "added" benefit 16:32
* dtantsur installs bifrost on raspberry pi16:32
JayFdtantsur: I'm not worried about this change in general, but when it's a minor trade off between two different fairly secure ways of doing things, I would default to "don't churn the default"16:33
JayFcid: well, it's choosing one set of tradeoffs over another. For most people, a file on disk is secure and trustable, some people have different security requirements. 16:33
dtantsurAny time we add an option, we should think how to explain operators which option to choose16:33
JayF"choose this option if your security person freaks out over passwords in a file on disk and won't listen to reason"16:34
TheJulia"in a file, umasked such that it is only for the running user16:34
cidif you say it like that :D16:34
TheJulia"16:34
opendevreviewJulia Kreger proposed openstack/ironic master: Handle Power On/Off for child node cases  https://review.opendev.org/c/openstack/ironic/+/89657016:34
TheJulia"and the security person has never heard of /proc/<pid>/environ :)16:34
JayFyes, exactly16:35
JayFthis is a workaround for a bad security requirements16:35
JayFwhich usually I'd bristle against ... except I've been in that persons' shoes, so I wanna give them the option16:35
dtantsurYou're not making me more convinced t o merge this patch16:35
dtantsurbut I guess as long as it's an ironic.conf, not driver_info, option, I can live with that..16:35
TheJuliasame basic issue, the only advantage for environ is the kernel enforce it so only root and the user can see it16:35
JayFI mean, it's the truth of the situation as described16:35
JayFto be clear: gr-oss isn't even the downstrema for this change16:36
JayFit was an operator bug that cid picked up to help pare down the backlog16:36
TheJuliaits a "good deed" change16:36
JayFthat's such a good way of describing it16:36
TheJuliaI get it, and I think it makes sense, there is value in doing those sorts of things16:36
JayFwe're making someone elses' life easier, removing a barrier to them running ironic, because we can16:36
TheJuliaexactly16:37
JayF"we" meaning c.i.d. in this case really lol16:37
cidlol16:37
JayFif it was API surface, or in a driver that wasn't already on the far side of it's lifecycle, I might care enough to have a philisophical argument about it16:37
JayFmoving the setting outta driver info is probably the right call for ensuring 100% it's not api surface16:38
JayFand will die with ipmi driver16:38
dtantsur++16:38
dtantsur"libvirtError: unsupported configuration: BIOS serial console only supported on x86 architectures" oh well16:38
cid^^ sounds so familiar :D16:39
TheJuliasuper super super familiar16:39
dtantsurcid: I need to start borrowing hacks from your devstack-on-arm patch16:39
dtantsuryeah :D16:39
cidlol, I mean, I have conquer this phase it would seem.16:41
cidSo, yea16:41
JayFdid you get to the point of any output in the log files?16:41
JayFlast I saw it was creating them but they were blank16:41
TheJulia... or capturing the output16:41
cidI have outputs now, that's the only good news16:42
dtantsur:D16:42
cidI have none-empty outputs :D16:42
opendevreviewDmitry Tantsur proposed openstack/bifrost master: [WIP] Experimental support for ARM64 hosts on Debian  https://review.opendev.org/c/openstack/bifrost/+/92005716:42
cidhttps://zuul.opendev.org/t/openstack/build/5a2a3cbd254747b784f866aa2244e923/log/controller/logs/ironic-bm-logs/node-1_console_2024-05-16-17:53:19_log.txt16:42
JayFso your ipxe rom -- snponly.efi -- likely doesn't work on arm16:43
JayF> BdsDxe: failed to load Boot0002 "UEFI PXEv4 (MAC:5254005DBB88)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(5254005DBB88,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Unsupported16:43
dtantsurat least Debian seems to ship x86 iPXE image even on ARM16:44
dtantsurmy bifrost patch will download the image instead (see above)16:44
dtantsur"ps2 is not supported by this QEMU binary" why do we even use ps2...16:45
* TheJulia blinks16:46
* JayF wonders if dtantsur is provisioning a playstation (?)16:46
JayFwhat is PS2 in this context? :D 16:46
dtantsurJayF: bifrost seriously configures a mouse for its test VMs16:46
dtantsurI think we can live without it...16:47
JayFjust make sure you install gpm16:47
TheJuliaremove it!16:47
JayFlol16:47
TheJuliaI think we only had it because devstack test vms had it16:47
opendevreviewDmitry Tantsur proposed openstack/bifrost master: [WIP] Experimental support for ARM64 hosts on Debian  https://review.opendev.org/c/openstack/bifrost/+/92005716:47
dtantsurokay, what else will break..16:47
TheJuliaeverything?16:47
dtantsuron a related note, I should have probably gone with active cooling, this thing is pretty warm already16:48
JayFif you've taken the configure-vm.py changes from cid's change, you can at least get to the point of VM up with output16:48
JayFdtantsur: are you trying to deploy an rpi?16:48
dtantsurJayF: bifrost testenv on rpi 5, correct16:48
JayFTIL there is an rpi 516:48
JayFwait, does that mean it actually does pxe!?16:49
dtantsurit's a virt environment, I just wanted to have real ARM to play with16:49
dtantsurI also suspect it can actually network boot16:49
TheJulia... i guess it should be able to, if they didn't try to use their own firmware16:50
dtantsurat the very least, I could flash an iPXE image on its SD card probably..16:51
JayFthat's what you had to do with older rpis16:51
dtantsur$ sudo virsh list --all16:51
dtantsur Id   Name      State16:51
dtantsur--------------------------16:51
dtantsur -    testvm1   shut off16:51
dtantsurgetting somewhere! :)16:51
dtantsurNew ones are 64-bit and have KVM support16:52
dtantsurand 8G RAM16:52
dtantsuralso, yes, it seems like its firmware is capable of PXE https://linuxhit.com/raspberry-pi-pxe-boot-netbooting-a-pi-4-without-an-sd-card/16:54
JayFoh, that's a pi 4! I have one of those16:54
dtantsuryeah, pi 5 is too new to have a lot of guides available16:55
JayFI even have a pi 400 in my closet, those are pretty nice (the pi4 baked into a keyboard so it's more like a "plug into tv and go" computer16:55
dtantsurso, https://review.opendev.org/c/openstack/bifrost/+/920057 is enough to pass the installation phase and enroll a node. I'll probably leave it here and go eat something :)17:04
TheJuliaJayF: I'm +1 on https://review.opendev.org/c/openstack/ironic-specs/+/890164, borderline +2, I have one comment which could definitely be a follow-up, just to set more context around the security concern17:13
JayFI always assume our security policy checks are configurable17:15
JayFI'm not sure I know how to write them in a way that's not overridable17:15
TheJuliayou don't17:15
TheJuliabut17:15
TheJuliaI think your focusing on the wrong concern17:15
TheJuliamy concern is the expression of doom17:15
TheJuliagetting line number17:15
JayFmy primary concern/fear is "project X creates a runbook. project Y tries to use runbook. project X changes runbook" -- that's the scenario that mostly has been what I've structured the rbac stuff around17:16
TheJuliaoh, you see what you mean re 16417:16
TheJuliait is if they replace the ramdisk17:16
JayFyes, exactly17:16
TheJuliaI was thinking a step to do a naughty thing17:16
JayFalthoguh to be fair, replacing the ramdisk is proably the whole ball game17:17
TheJuliaI mean, they shouldn't be able to, but if someone were to allow members to change driver_info, yeah17:17
TheJuliayeah17:17
TheJuliait is17:17
dtantsurlooking for a 2nd +2 on https://review.opendev.org/c/openstack/bifrost/+/888447, it has been around for quite a while17:20
JayFlanding17:21
JayFI mean, trade you for a review on self-serve runbooks :D 17:21
JayF(because a spec review and reviewing a minor change completely equal effort levels lol)17:21
dtantsurI'm definitely going to, it's just 7:30pm on a public holiday right now17:21
dtantsur:D17:21
TheJuliagoodnight!17:22
dtantsuro/17:22
JayFo/17:23
TheJuliathe tempest runs just hate me these days17:30
TheJuliaJayF: any chance I can get an independent look at a CI log to see if you have any thoughts in particular direction as to the rot cause18:04
JayFI'm not sure if you meant to leave that extra o out or not lol18:04
TheJulialol18:04
JayFassuming you mean https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/91976218:04
TheJuliawell, I've found another instance in another change18:05
TheJuliahttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_467/918001/2/gate/ironic-standalone-redfish-2023.2/4676c9f/controller/logs/screen-ir-cond.txt18:05
TheJuliatake a look or "Read timed out.", the 15th or so instance of it is super interesting18:05
JayFthis looks like the old double-query of the agent bug18:07
JayFwhere we sent a second query to the agent while the first was still running18:07
JayFis fast track enabled on this test, perhaps?18:07
JayFbecause it ran get deploy steps successfully18:08
JayFbut the timeout is on get deploy steps18:08
JayFso I imagine there's either a race where a heartbeat while we're executing get deploy steps causes pain OR we're double requesting18:08
JayFTheJulia: can you get me to the root zuul that log came from?18:09
JayFI wanna see the agnet it was querying18:09
TheJuliahttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_467/918001/2/gate/ironic-standalone-redfish-2023.2/4676c9f/controller/logs/ironic-bm-logs/node-2_console_2024-05-16-17%3A31%3A25_log.txt18:10
TheJuliaseems to be something which likes to happen with 2023.218:13
JayFthis looks like it's thinking it has to wait for md0 to come up on the agent18:13
TheJuliawe never get to the point of creating it18:14
JayFwhich it allows up to 10 times to wait for disk on retry18:14
TheJuliawe fail even before we get/process get_deploy_steps18:14
JayFso you're saying the busyloop is a symptom, not the cause?18:14
TheJuliaI suspect so18:14
JayFI think get_deploy_steps has a side effect18:14
JayFand that is breaking things18:14
JayFI am going to look into that code18:14
TheJulia++18:14
TheJuliai mean, if it is sitting htere spinning the disk detection, sure18:15
TheJuliabut we seem to think on the agent side that we respond promptly18:15
JayFwell that's what messes with me18:15
JayFit should be returning there while it can't find disks, should it?18:15
TheJuliahence why I'm sort of asking for another set of eyes18:15
JayF2023.2, yeah?18:15
TheJuliayes, it should be18:15
TheJuliaunless ironic is wedged, becasue it seems to think it never gets a reply and that the socket just times out18:16
JayFremember that get_deploy_steps is a command, ironic has to call ipa api to get the command result, right?18:16
TheJuliayes18:16
TheJuliaand it *looks* like the call attempts are hanging18:17
JayFI wonder if it's possible fo rus to get to the point where this is happening18:17
TheJuliabut also other things are hanging conductor side at the same time18:17
JayFbut IPA API hasn't spawned yet18:17
* TheJulia wonders if the cloud provider is throttling CPU18:22
JayFbingo TheJulia I have it18:23
JayFso look at agent.process_lookup_data18:24
JayFit calls hardware.cache_node18:24
JayFwhich calls wait for disks18:24
JayFwhich waits for the root device hinted device (md0) to come online18:24
JayFwhich takes 10 retries, doesn't return timely18:24
JayFbut API server doesn't start until after process_lookup_data is complete18:25
JayFLOG.info('Cached node %s, waiting for its root device to appear', node['uuid'])18:26
JayFlook familiar? it's what we print before dispatching a wait to disks to all managers18:26
JayFso in a case where we are trying to find a nonexistant disk (I suspect raid is making this even more complex), Ironic races IPA 18:26
TheJuliahmmm18:27
JayFthis is 100% the case from your tempest logs18:27
TheJuliathat would be the most simple explination18:27
JayFI think there's an assumption baked into hardware.cache_node that it wouldn't be called on lookup18:27
JayFI suspect the simplest fix is adding a kwarg to hardware.cache_node, defaulting to false, but when called from lookup we pass as "true" to make it not wait for disks18:28
JayFalthough honestly waiting for disks in cache node just seems wrong to me, but I haven't done the looking to see why it would be done18:28
TheJuliaI'm sort of thinking that as well18:29
TheJuliathe non-existent disk we're going to create in a little bit18:29
TheJuliaits a chicken/egg issue18:29
JayFeven beyond that18:29
JayFwait_for_disks is never called in agent.py until hardware.cache_node18:29
JayFif it /needs/ to be called, why is it a side effect of node caching?18:29
TheJuliadunno18:29
JayFthere is a piece here I don't understand that we're hacking to enable, and I dunno what it is18:29
JayFI have a hunch it might be fast track18:29
JayFso if you get a new node in fast track, you would requery disks18:30
JayFbut that is 100% guessing18:30
TheJuliaThis node is freshly booting up on a non-fast track scenario18:30
TheJuliajust never quite gets there18:30
JayFI'm not talking about your case, I'm trying to figure out a case of *why we would ever want* this weird behavior18:30
JayFbasically asking the question "who does this bugfix need to worry about breaking" more or less :D18:31
TheJuliaoh, not sure18:31
JayFwe've been waiting for disks on node caching since 201718:31
TheJuliaugh, packet loss, why you hate me today18:33
TheJuliaahh18:36
TheJuliaI see18:36
JayFhmmm so wait18:37
JayFthe REAL bug is we are likely heartbeating18:37
JayFbefore we bring up the API18:37
JayFall this stuff is A-OK as long as we aren't heartbeating until *after* we are done with this work18:37
JayFthat's the real bug: you can't heartbeat until you're ready to put the API up, in general18:38
JayFnot saying there's not more to fix, too, but that's generally gross :)18:38
JayFthe plot thickens: it does18:39
JayFdid it heartbeat in the logs? if so, why?18:39
JayFit spent from 17:24 - 17:27 looking for that raid revice18:43
JayFbut the get_deploy_steps was called /after/ that18:44
JayFso I think it's a red herring, unless that extra 3 minutes is enough to cause things to timeout18:44
TheJuliaI wonder why we keep searching for the /dev/md018:47
opendevreviewJulia Kreger proposed openstack/ironic-python-agent master: DNM: test: skip on initial cache  https://review.opendev.org/c/openstack/ironic-python-agent/+/92006118:48
JayFwe also call wait_for_disks on every evaluate_hardware_support18:48
JayFwe are just ... spamming  the hell outta the world18:49
TheJuliayup18:49
JayFTheJulia: I think my original hypothesis is wrong, fwiw18:49
TheJuliayeah, I figured that18:49
JayFTheJulia: I reread the logs, it DOES spin for about 3 minutes, but *then* it starts heartbeating and comes online18:49
TheJuliayeah, still bothers me it does that18:49
JayFthe thing that's really cooking my noodle right now tbh18:51
JayFis it almost looks like we return get_deploy_steps then *AFTER THAT* logs print for stuff running in evaluate_hardware_support18:51
TheJuliadid you by chance walk the get_deploy_steps code path18:51
TheJuliaI don't think it should be taking that long, but this is bizzare18:52
JayFyes18:52
TheJulia... it also likely shouldn't be calling evaluate_hardware_support, but I guess it might be18:52
JayFcan you get on a call?18:52
JayFit is18:52
JayF10000% 18:52
TheJuliaoh noes18:52
TheJuliauhh, in like 5 mintues18:52
JayFevaluate_hardware_support is not cached everywhere, it seems18:52
JayFbecause we're running it multiple times18:52
JayFTheJulia: ima order and pick up some lunch, lets talk through it after that18:53
TheJuliaok18:55
opendevreviewcid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms  https://review.opendev.org/c/openstack/ironic/+/91544118:57
opendevreviewcid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms  https://review.opendev.org/c/openstack/ironic/+/91544118:58
opendevreviewMerged openstack/bifrost master: Consolidate centos/fedora/redhat required_defaults  https://review.opendev.org/c/openstack/bifrost/+/88844719:02
opendevreviewcid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms  https://review.opendev.org/c/openstack/ironic/+/91544119:04
TheJuliait is almost like we need to have enough smarts to go "oh, this is not a thing yet"19:16
JayFnoting that one concrete suggestion out of our chat19:31
JayFis to log here https://opendev.org/openstack/ironic-python-agent/src/branch/stable/2023.2/ironic_python_agent/hardware.py#L3160 so we know if the managers were cached or not19:31
JayFand we can give a feel fore if we're dealing with the same HWM classes or if we're instantiating them more than once19:31
JayFalso we suggested caching evaluate_hardware_support and wait_for_disks output for our default/generic implementations so we don't spend fifteen years looking for disks we'll never find19:32
JayFTheJulia: ^ I think that's the concrete notes from our meet19:32
TheJuliaonly 15 years?!?19:58
TheJulia;)19:58
TheJuliaack, my takeaway was basically the same, I'll try to look at it tomorrow19:58
opendevreviewcid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms  https://review.opendev.org/c/openstack/ironic/+/91544120:13
JayFTheJulia: noting here that dispatch_to_all_managers (which is called for get_deploy_steps) appears to be bugged (well, at least it behaves differently than I would expect) 20:14
JayFhttps://opendev.org/openstack/ironic-python-agent/src/branch/stable/2023.2/ironic_python_agent/hardware.py#L3186 it unilaterally includes all steps, even if hardware isn't supported20:14
JayFeven in dispatch_to_all_managers I *think* we should be filtering out evaluate_hardware_support ==0 20:14
JayFor else we're advertising deploy steps for a node which can never run on it20:14
JayFlooking at that code more in depth (and to leave myself a note before I leave), we need to make dispatch_to_all_managers take a flag to say to exclude unsupported HWMs or not. We have one case (get_versions) where we likely want HWMs to get called even if hardware isn't supported22:56
JayFbut for get_X_steps and collect_system_logs, we certainly do *not* want steps/logs from a hardware manager that doesn't even represent hardware on the system ... (right?)22:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!