*** srelf_ is now known as Continuity | 09:33 | |
iurygregory | good morning Ironic | 11:07 |
---|---|---|
mohammed_ | good morning / aftrnoon, where I can add a topic to todays weekly agenda ? | 11:25 |
iurygregory | hey mohammed_ here is the link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting | 11:27 |
mohammed_ | Thanks ! I was there but I could not quickly find where to add inputs I will read it again | 11:43 |
adam-metal3 | mohammed, if you are logged in, there are small [edit] buttons and if you clic on those then you can edit the relevant section | 12:27 |
iurygregory | ^ this, if you can add let me know and I can update | 12:46 |
TheJulia | good morning | 13:07 |
iurygregory | good morning TheJulia o/ | 13:15 |
iurygregory | do you want me to run the meeting today? just double checking :D | 13:15 |
opendevreview | Julia Kreger proposed openstack/ironic bugfix/22.1: Remove SQLAlchemy tips jobs https://review.opendev.org/c/openstack/ironic/+/919901 | 13:18 |
TheJulia | I guess I can | 13:18 |
iurygregory | ack | 13:22 |
JayF | If there's an ironic contributor with some free time this week, I'd like to give my BM SIG talk a run through with someone who knows Ironic well | 14:02 |
JayF | if anyone is available/interested let me know | 14:02 |
TheJulia | I might be available, tomrorow maybe | 14:09 |
iurygregory | does anyone remember the redfish bug with HPE where sometimes it failed to power on nodes after cleaning? | 14:15 |
iurygregory | seems like I'm facing the same issue again =( | 14:16 |
iurygregory | JayF, happy to join depending on the day and time.. | 14:16 |
JayF | I mean, you tell me a workign day/time :) | 14:16 |
JayF | you'd be the one doing me a favor | 14:17 |
iurygregory | tomorrow afternoon or wed afternoon ie after 15 UTC =) | 14:18 |
iurygregory | friday I'm available also | 14:18 |
JayF | I can do 1500 UTC tomorrow | 14:19 |
TheJulia | iurygregory: is it the one where we try to power the machine back on and it fails to power on because it is still booting? | 14:20 |
TheJulia | I'm not going to be around on Friday | 14:20 |
TheJulia | tomorrow would likely be easiest for me | 14:20 |
opendevreview | Kaifeng Wang proposed openstack/python-ironicclient master: Support traits configuration on baremetal create CLI https://review.opendev.org/c/openstack/python-ironicclient/+/919432 | 14:21 |
JayF | TheJulia: 1500 utc / 8am PDT work for you? | 14:21 |
TheJulia | 9:30 AM UTC would work | 14:22 |
TheJulia | err | 14:22 |
TheJulia | 9:30 AM US Pacifici | 14:22 |
TheJulia | pacific | 14:22 |
* TheJulia tosses the keyboard | 14:22 | |
JayF | 9:30a-10:30a (1630-1730 UTC) works for me tomorrow, too | 14:22 |
iurygregory | TheJulia, humm not sure, it's via OCP https://paste.opendev.org/show/b6DXtU8HM9dUyduwS3mV/ this is the error in the ironic logs after the firmware update was executed (successfully..) | 14:25 |
TheJulia | we likely need to do a couple different things there | 14:29 |
iurygregory | yeah =( | 14:29 |
iurygregory | the update worked fine, ironic has the new information in the DB | 14:29 |
TheJulia | Starting out, the 120 seconds limit on the post reboot wait is likely *way* too short | 14:29 |
iurygregory | I had the wait param as option to increase the time | 14:30 |
iurygregory | but we didn't add that in metal3 =X | 14:30 |
iurygregory | I think there is a config option that we can change also right? | 14:31 |
TheJulia | so we likley need to wait/check at before claiming we're done with the firmware upgrade | 14:31 |
TheJulia | iurygregory: do we know where "Dynamic backoff interval looping call 'ironic.conductor.utils.node_wait_for_power_state.<locals>._wait'" is being triggered at ? | 14:34 |
iurygregory | let me look at the code i did for updates | 14:34 |
TheJulia | I think: tunable longer run time for the post upgrade wait. ... Apparently Oneview waits 45 minutes by *default* | 14:37 |
iurygregory | jesus O.o | 14:38 |
TheJulia | well, I was about to say I think we need something else | 14:38 |
TheJulia | but it looks like the step itself fails because of the inability to power back up | 14:38 |
iurygregory | yeah | 14:38 |
TheJulia | 2024-05-20 12:57:57.885 1 ERROR ironic.conductor.utils [None req-d65c5bdf-1567-45d9-81df-7c7d6701d5ef - - - - - -] Node f497fb20-6852-45fe-a21f-8afc8782eed9 failed step | 14:39 |
iurygregory | so, reboot_to_finish_step receives the timeout | 14:39 |
iurygregory | in deploy_utils from what i remember | 14:39 |
TheJulia | I think we need a bigger timeout | 14:39 |
TheJulia | https://imgflip.com/i/8qnszp | 14:40 |
mohammed_ | iurygregory can you add for me in agenda: Seek review for https://review.opendev.org/c/openstack/sushy-tools/+/875366 | 14:41 |
iurygregory | TheJulia, yup! | 14:43 |
iurygregory | TheJulia, CONF.conductor.power_state_change_timeout this is the config with default value we use? | 14:43 |
TheJulia | so for firmware upgrades, we likely need that to be... at *least* 5 minutes | 14:44 |
iurygregory | I think that's how I was using (by sending wait 300) | 14:44 |
iurygregory | but the CRD doesn't have the concept of the wait ... | 14:45 |
TheJulia | its only waiting 120 though | 14:45 |
TheJulia | well, post firmware upgrades, we always need to wait | 14:45 |
TheJulia | longer | 14:45 |
TheJulia | because it has to complete them | 14:45 |
iurygregory | shall we consider sending 300 by default if no value was set for wait? <thinking..> | 14:46 |
TheJulia | for all state changes, I would doubt it | 14:47 |
TheJulia | I must have udp packet loss this morning, dns lookups keep randomly failing | 14:47 |
iurygregory | ouch >.< | 14:48 |
TheJulia | mohammed_: what portion of th emeeting would you like that item brought up during? | 14:48 |
iurygregory | I was about to add in open discussion | 14:48 |
TheJulia | that works | 14:49 |
JayF | mohammed_: I'm generally curious what the use case is? I always viewed sushy-tools as being about helping Ironic fake VMs, if we stub even further through doesn't it remove some of the value? | 14:49 |
JayF | mohammed_: I've avoided putting this review on your change because I don't wanna stop it if there's a good use case (sushy-tools is not high on the list of things I'm super concerned about) but I'm curious generally what the goal is | 14:50 |
TheJulia | mohammed_: So I have the exact same question as JayF, fwiw. | 14:51 |
TheJulia | #startmeeting ironic | 15:00 |
opendevmeet | Meeting started Mon May 20 15:00:03 2024 UTC and is due to finish in 60 minutes. The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:00 |
opendevmeet | The meeting name has been set to 'ironic' | 15:00 |
TheJulia | o/ | 15:00 |
mohammed_ | o/ | 15:00 |
TheJulia | Good morning everyone in (UGT) Universal Greeting Time! | 15:00 |
TheJulia | Today's ironic meeting can be found on the wiki | 15:00 |
JayF | o/ | 15:01 |
TheJulia | #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_May_20.2C_2024 | 15:01 |
TheJulia | ... do we have quorum today? | 15:02 |
* TheJulia puts a cup of coffee in front of iurygregory | 15:02 | |
TheJulia | :) | 15:02 |
iurygregory | o/ | 15:02 |
* iurygregory was looking at logs and lost track of time | 15:02 | |
TheJulia | #topic Announcements / Reminders | 15:02 |
TheJulia | Your weekly standing reminder to review patches tagged with ironic-week-prio | 15:03 |
TheJulia | #link https://tinyurl.com/ironic-weekly-prio-dash | 15:03 |
TheJulia | The work items for 2024.2 have been merged. | 15:03 |
TheJulia | #link https://review.opendev.org/c/openstack/ironic-specs/+/916295 | 15:03 |
TheJulia | And last, but not least, the release schedule for 2024.2 | 15:03 |
TheJulia | #link https://releases.openstack.org/dalmatian/schedule.html | 15:03 |
TheJulia | This does mean this week is R-19 | 15:04 |
TheJulia | #topic Review Ironic CI Status | 15:04 |
TheJulia | so, how is CI? I've noticed some more tempest test unahppiness which I'm working on trying to fix | 15:04 |
TheJulia | #link https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/919762 | 15:05 |
iurygregory | interesting ^ | 15:06 |
TheJulia | Yeah, i think part of it is the whole "oh, there is no way we're doing a whole OS on this cloud provider" logic path | 15:06 |
TheJulia | and our continued reliance on tinycore is sort of... biting us | 15:06 |
TheJulia | Like an unhappy cat. | 15:06 |
iurygregory | ++ | 15:06 |
TheJulia | Anyway, moving on! | 15:06 |
iurygregory | will keep on my radar | 15:07 |
TheJulia | It looks like a timeout, fwiw | 15:07 |
TheJulia | but you actually have to look at the ipxe text | 15:07 |
TheJulia | Moving past Discussion topics, to Bug Deputy Updates | 15:07 |
TheJulia | #topic Bug Deputy | 15:07 |
TheJulia | Last week we got down to 137 open bugs against ironic itself. A number of them are wishlist items, we should really take a look at open in progress items because I found a few which were already fixed last week. | 15:08 |
iurygregory | I will be the deputy this week o/ | 15:08 |
TheJulia | #action iurygregory to be the Bug Deputy for this week. | 15:08 |
iurygregory | great job TheJulia | 15:08 |
TheJulia | Since we have no RFE topics, we will move to Open Disucssion. | 15:09 |
TheJulia | #topic Open Discussion | 15:09 |
JayF | I have an item for this, when mohammed_ is done | 15:09 |
TheJulia | \o/ as do I! | 15:09 |
TheJulia | mohammed_ is seeking reviews of https://review.opendev.org/c/openstack/sushy-tools/+/875366 | 15:09 |
mohammed_ | I would like to take this forward it seems hanging there for longtime I want to know if somthing is blocking I have already got +1 from dmitry | 15:09 |
JayF | I'm generally curious what the use case is? I always viewed sushy-tools as being about helping Ironic fake VMs, if we stub even further through doesn't it remove some of the value? | 15:09 |
JayF | I've avoided putting this review on your change because I don't wanna stop it if there's a good use case (sushy-tools is not high on the list of things I'm super concerned about) but I'm curious generally what the goal is | 15:09 |
TheJulia | I think a better understanding of the goal will help reviewers review it, to be honest | 15:10 |
JayF | ++ | 15:10 |
mohammed_ | we need mainly to testing ironic scalability but open to discuss the needs for such fake IPAs on the sushytools on the changes :) | 15:10 |
JayF | What does IPA have to do with sushy? | 15:10 |
JayF | I guess is my root problem | 15:11 |
JayF | sushy is a library for redfish access; sushy-tools uses that library to present fake VMs | 15:11 |
JayF | why not fake IPA in virtualbmc? or virtualpdu? | 15:11 |
JayF | there's a base level misalignment that I can't figure out | 15:11 |
mohammed_ | sushy has fake system faking the VM and ironic expect ipa agent to be running on these vms that we are trying to add there | 15:11 |
mohammed_ | Fake driver ^ | 15:12 |
TheJulia | is the intention or modeling at present such that triggering a fake run cause the fake IPA to fire up? | 15:13 |
TheJulia | I guess that sort of makes sense then | 15:14 |
mohammed_ | the current model have the fakeIPA and sushytools running in different containers | 15:14 |
TheJulia | okay | 15:14 |
TheJulia | So my final question, will results be published at all? | 15:14 |
TheJulia | :) | 15:14 |
mohammed_ | Did not get you :/ | 15:14 |
JayF | I'm still confused enough about the whole end-to-end I doubt I'll review that change, but honestly sushy{,-tools} is not my strongest project anyway | 15:14 |
JayF | mohammed_: I think TheJulia is saying, if you land this and do performance testing, share the results with us :D | 15:15 |
TheJulia | So your doing the delopment work to bench test Ironic, are those results available for review? | 15:15 |
mohammed_ | we are syncing together on these on metal3 side with dmitry | 15:15 |
TheJulia | Okay, cool | 15:16 |
TheJulia | JayF: You had an open discussion item? | 15:16 |
JayF | Yes | 15:16 |
JayF | June 5th is the BM SIG @ CERN. | 15:16 |
JayF | It's been noted that we can enable remote participation if desired | 15:16 |
TheJulia | That would be nice | 15:17 |
iurygregory | :O | 15:17 |
iurygregory | ++ | 15:17 |
JayF | Then we need to discuss time. Right now the proposed start time is 10am, which is 1am pdt | 15:17 |
JayF | obviously not ideal but we can change it; I'm a little nervous about doing it because we already have some published | 15:17 |
JayF | but I didn't want to make the decision in a vacuum | 15:18 |
JayF | I doubt that there's a time we could pick that would be 1) close to the original published time and 2) not terrible for PDT | 15:18 |
JayF | we list an 8am start right now on the webpage | 15:18 |
JayF | #link https://indico.cern.ch/event/1378171/ | 15:19 |
TheJulia | JayF: I will do my best to join in remotely based upon the local schedule, start based upon the people there | 15:19 |
JayF | yeah, that's what I was thinking -- which would be a 10am start for presentations, I think | 15:19 |
TheJulia | 8 AM is a bit early if one needs to train in and all | 15:19 |
TheJulia | yeah, that seems reasonable | 15:19 |
iurygregory | I think it would be complicated to change the time now, I had no idea it could have remote participation, I will just try to join | 15:20 |
JayF | I'll note the agenda currently includes an introduction to metal3, an ~45-60m talk by me on "ironic features" | 15:20 |
JayF | and then hackathon/discussion afterwards | 15:20 |
TheJulia | Sounds good | 15:20 |
JayF | the goal is to hopefully inspire lots of "how does feature X work?" conversations after the presentations :D | 15:20 |
JayF | the other, related item for OD for me | 15:21 |
JayF | I will be in Europe next week and the week after. I'll be very rarely on IRC as a result to focus on my in-person team in UK and all | 15:21 |
JayF | So if you need me, email/sms/etc | 15:21 |
TheJulia | JayF: cool cool, enjoy | 15:21 |
TheJulia | so my one last item! | 15:21 |
TheJulia | ihrachys has raised some questions/comments on https://review.opendev.org/c/openstack/ironic-specs/+/916126. JayF and I have both responded. If there are any more thoughts/responses please try and post it in the next day or so and I'll rev the spec. | 15:22 |
TheJulia | Anything else today? | 15:22 |
TheJulia | I'll note next Monday is a holiday in the US. | 15:23 |
JayF | Please review the self-serve runbooks spec :D | 15:23 |
TheJulia | ++ | 15:23 |
iurygregory | ack o/ | 15:23 |
JayF | that is on hold pending community review right now | 15:23 |
iurygregory | this week I have more "free" time | 15:24 |
iurygregory | so I will double check it | 15:24 |
TheJulia | what is "free" time ;) | 15:24 |
iurygregory | a time I will not spend testing firmware update or on escalation :D | 15:24 |
* JayF hears a new jira ticket being filed | 15:24 | |
JayF | run iurygregory! | 15:24 |
iurygregory | well I got an escalation from a bug created today | 15:25 |
iurygregory | I'm not surprised because is monday | 15:25 |
TheJulia | in accordance with the prophecy | 15:25 |
iurygregory | yup! | 15:25 |
TheJulia | iurygregory: if you need another brain, I can deffer some stuff I was planning on doing this morning until tomorrow | 15:26 |
TheJulia | anyway! Thanks everyone! | 15:26 |
TheJulia | Have a great week! | 15:26 |
mohammed_ | o/ | 15:26 |
TheJulia | #endmeeting | 15:26 |
opendevmeet | Meeting ended Mon May 20 15:26:35 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 15:26 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-05-20-15.00.html | 15:26 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-05-20-15.00.txt | 15:26 |
opendevmeet | Log: https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-05-20-15.00.log.html | 15:26 |
iurygregory | o/ tks | 15:26 |
* iurygregory is happy, he found the 120sec config \o/ https://github.com/openshift/ironic-image/blob/master/ironic-config/ironic.conf.j2#L98 | 15:26 | |
iurygregory | ok, so either I will increase this downstream again, or try to add something in the ironic code .-. | 15:27 |
opendevreview | Ebbex proposed openstack/bifrost master: Consolidate centos/fedora/redhat required_defaults https://review.opendev.org/c/openstack/bifrost/+/888447 | 15:27 |
JayF | 120s timeout is very low | 15:27 |
cid | quick question, the weekly meetings, is it happening solely on the IRC chat? Or there's another side to it. | 15:28 |
JayF | but you can bump it to find a better default | 15:28 |
JayF | cid: it's solely an IRC thing for most projects :) | 15:28 |
JayF | cid: basically just an opportunity to ensure many core contributors are in the same place for an hour to help make decisions | 15:28 |
TheJulia | so 120 seconds makes semi-sense for for some of the hardware I've been seeing | 15:28 |
iurygregory | JayF, it was ok before since we didn't have to deal with the firmware update XD | 15:28 |
TheJulia | I thik ironic is 30 seconds today, Ironic itself likely needs the value incrased | 15:28 |
cid | JayF: Got it. | 15:28 |
TheJulia | .... I think we have a bug about specific configurations on supermicro gear taking 5 minutes | 15:29 |
iurygregory | I think this 120 was because of supermicro... | 15:29 |
TheJulia | yeouch, no wonder I'm avhing isuses, I'm seeing 4% packet loss | 15:29 |
TheJulia | wind variable but some of it is also 2.4ghz band noise | 15:32 |
* iurygregory increased to 360.. fingers crossed | 15:33 | |
TheJulia | does it make sense to consider raising the default in ironic? | 15:34 |
JayF | iurygregory: TheJulia: rewinding about an hour: tomorrow; 1630-1730 aka 9:30-1030am PDT for a run thru of that talk? | 15:34 |
TheJulia | works for me | 15:34 |
iurygregory | works for me | 15:35 |
JayF | invites going out | 15:35 |
iurygregory | tks! | 15:35 |
iurygregory | going to grab lunch to have more energy to debug things... | 15:35 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: Exclude ramdisk tests with tinycore in uefi mode https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/919762 | 15:52 |
TheJulia | That should fix it, I got the structure wrong | 15:52 |
cid | Any additional suggestions/ideas on this: https://review.opendev.org/c/openstack/ironic/+/917229 | 16:12 |
TheJulia | cid: I just left a comment on that | 16:20 |
TheJulia | Yeah, you don't want to try and modify the runtime environment | 16:24 |
TheJulia | but I proposed a solution | 16:24 |
cid | Yeah, I will go ahead and do that. | 16:25 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost master: [WIP] Experimental support for ARM64 hosts on Debian https://review.opendev.org/c/openstack/bifrost/+/920057 | 16:25 |
cid | Also, Since two reviewers have suggested making the environment approach default with no objections, I will do that too, with the option for temp file | 16:26 |
TheJulia | yeah, there really is not a huge different, the fundimental issue is a lack of understanding around the mechanism properties of the temp file, but *shrugs* | 16:27 |
JayF | I am a little -0 to changing the default | 16:29 |
JayF | just because the value of changing it, and risking a break on one of our older drivers | 16:29 |
JayF | makes me a little nervous | 16:29 |
cid | Right! that change will touch on many existing "stuff" | 16:30 |
dtantsur | If we're worried about this change, do we even understand it? | 16:31 |
cid | I'm thinking, ... | 16:31 |
cid | what's the point of the change if there's no "added" benefit | 16:32 |
* dtantsur installs bifrost on raspberry pi | 16:32 | |
JayF | dtantsur: I'm not worried about this change in general, but when it's a minor trade off between two different fairly secure ways of doing things, I would default to "don't churn the default" | 16:33 |
JayF | cid: well, it's choosing one set of tradeoffs over another. For most people, a file on disk is secure and trustable, some people have different security requirements. | 16:33 |
dtantsur | Any time we add an option, we should think how to explain operators which option to choose | 16:33 |
JayF | "choose this option if your security person freaks out over passwords in a file on disk and won't listen to reason" | 16:34 |
TheJulia | "in a file, umasked such that it is only for the running user | 16:34 |
cid | if you say it like that :D | 16:34 |
TheJulia | " | 16:34 |
opendevreview | Julia Kreger proposed openstack/ironic master: Handle Power On/Off for child node cases https://review.opendev.org/c/openstack/ironic/+/896570 | 16:34 |
TheJulia | "and the security person has never heard of /proc/<pid>/environ :) | 16:34 |
JayF | yes, exactly | 16:35 |
JayF | this is a workaround for a bad security requirements | 16:35 |
JayF | which usually I'd bristle against ... except I've been in that persons' shoes, so I wanna give them the option | 16:35 |
dtantsur | You're not making me more convinced t o merge this patch | 16:35 |
dtantsur | but I guess as long as it's an ironic.conf, not driver_info, option, I can live with that.. | 16:35 |
TheJulia | same basic issue, the only advantage for environ is the kernel enforce it so only root and the user can see it | 16:35 |
JayF | I mean, it's the truth of the situation as described | 16:35 |
JayF | to be clear: gr-oss isn't even the downstrema for this change | 16:36 |
JayF | it was an operator bug that cid picked up to help pare down the backlog | 16:36 |
TheJulia | its a "good deed" change | 16:36 |
JayF | that's such a good way of describing it | 16:36 |
TheJulia | I get it, and I think it makes sense, there is value in doing those sorts of things | 16:36 |
JayF | we're making someone elses' life easier, removing a barrier to them running ironic, because we can | 16:36 |
TheJulia | exactly | 16:37 |
JayF | "we" meaning c.i.d. in this case really lol | 16:37 |
cid | lol | 16:37 |
JayF | if it was API surface, or in a driver that wasn't already on the far side of it's lifecycle, I might care enough to have a philisophical argument about it | 16:37 |
JayF | moving the setting outta driver info is probably the right call for ensuring 100% it's not api surface | 16:38 |
JayF | and will die with ipmi driver | 16:38 |
dtantsur | ++ | 16:38 |
dtantsur | "libvirtError: unsupported configuration: BIOS serial console only supported on x86 architectures" oh well | 16:38 |
cid | ^^ sounds so familiar :D | 16:39 |
TheJulia | super super super familiar | 16:39 |
dtantsur | cid: I need to start borrowing hacks from your devstack-on-arm patch | 16:39 |
dtantsur | yeah :D | 16:39 |
cid | lol, I mean, I have conquer this phase it would seem. | 16:41 |
cid | So, yea | 16:41 |
JayF | did you get to the point of any output in the log files? | 16:41 |
JayF | last I saw it was creating them but they were blank | 16:41 |
TheJulia | ... or capturing the output | 16:41 |
cid | I have outputs now, that's the only good news | 16:42 |
dtantsur | :D | 16:42 |
cid | I have none-empty outputs :D | 16:42 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost master: [WIP] Experimental support for ARM64 hosts on Debian https://review.opendev.org/c/openstack/bifrost/+/920057 | 16:42 |
cid | https://zuul.opendev.org/t/openstack/build/5a2a3cbd254747b784f866aa2244e923/log/controller/logs/ironic-bm-logs/node-1_console_2024-05-16-17:53:19_log.txt | 16:42 |
JayF | so your ipxe rom -- snponly.efi -- likely doesn't work on arm | 16:43 |
JayF | > BdsDxe: failed to load Boot0002 "UEFI PXEv4 (MAC:5254005DBB88)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(5254005DBB88,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Unsupported | 16:43 |
dtantsur | at least Debian seems to ship x86 iPXE image even on ARM | 16:44 |
dtantsur | my bifrost patch will download the image instead (see above) | 16:44 |
dtantsur | "ps2 is not supported by this QEMU binary" why do we even use ps2... | 16:45 |
* TheJulia blinks | 16:46 | |
* JayF wonders if dtantsur is provisioning a playstation (?) | 16:46 | |
JayF | what is PS2 in this context? :D | 16:46 |
dtantsur | JayF: bifrost seriously configures a mouse for its test VMs | 16:46 |
dtantsur | I think we can live without it... | 16:47 |
JayF | just make sure you install gpm | 16:47 |
TheJulia | remove it! | 16:47 |
JayF | lol | 16:47 |
TheJulia | I think we only had it because devstack test vms had it | 16:47 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost master: [WIP] Experimental support for ARM64 hosts on Debian https://review.opendev.org/c/openstack/bifrost/+/920057 | 16:47 |
dtantsur | okay, what else will break.. | 16:47 |
TheJulia | everything? | 16:47 |
dtantsur | on a related note, I should have probably gone with active cooling, this thing is pretty warm already | 16:48 |
JayF | if you've taken the configure-vm.py changes from cid's change, you can at least get to the point of VM up with output | 16:48 |
JayF | dtantsur: are you trying to deploy an rpi? | 16:48 |
dtantsur | JayF: bifrost testenv on rpi 5, correct | 16:48 |
JayF | TIL there is an rpi 5 | 16:48 |
JayF | wait, does that mean it actually does pxe!? | 16:49 |
dtantsur | it's a virt environment, I just wanted to have real ARM to play with | 16:49 |
dtantsur | I also suspect it can actually network boot | 16:49 |
TheJulia | ... i guess it should be able to, if they didn't try to use their own firmware | 16:50 |
dtantsur | at the very least, I could flash an iPXE image on its SD card probably.. | 16:51 |
JayF | that's what you had to do with older rpis | 16:51 |
dtantsur | $ sudo virsh list --all | 16:51 |
dtantsur | Id Name State | 16:51 |
dtantsur | -------------------------- | 16:51 |
dtantsur | - testvm1 shut off | 16:51 |
dtantsur | getting somewhere! :) | 16:51 |
dtantsur | New ones are 64-bit and have KVM support | 16:52 |
dtantsur | and 8G RAM | 16:52 |
dtantsur | also, yes, it seems like its firmware is capable of PXE https://linuxhit.com/raspberry-pi-pxe-boot-netbooting-a-pi-4-without-an-sd-card/ | 16:54 |
JayF | oh, that's a pi 4! I have one of those | 16:54 |
dtantsur | yeah, pi 5 is too new to have a lot of guides available | 16:55 |
JayF | I even have a pi 400 in my closet, those are pretty nice (the pi4 baked into a keyboard so it's more like a "plug into tv and go" computer | 16:55 |
dtantsur | so, https://review.opendev.org/c/openstack/bifrost/+/920057 is enough to pass the installation phase and enroll a node. I'll probably leave it here and go eat something :) | 17:04 |
TheJulia | JayF: I'm +1 on https://review.opendev.org/c/openstack/ironic-specs/+/890164, borderline +2, I have one comment which could definitely be a follow-up, just to set more context around the security concern | 17:13 |
JayF | I always assume our security policy checks are configurable | 17:15 |
JayF | I'm not sure I know how to write them in a way that's not overridable | 17:15 |
TheJulia | you don't | 17:15 |
TheJulia | but | 17:15 |
TheJulia | I think your focusing on the wrong concern | 17:15 |
TheJulia | my concern is the expression of doom | 17:15 |
TheJulia | getting line number | 17:15 |
JayF | my primary concern/fear is "project X creates a runbook. project Y tries to use runbook. project X changes runbook" -- that's the scenario that mostly has been what I've structured the rbac stuff around | 17:16 |
TheJulia | oh, you see what you mean re 164 | 17:16 |
TheJulia | it is if they replace the ramdisk | 17:16 |
JayF | yes, exactly | 17:16 |
TheJulia | I was thinking a step to do a naughty thing | 17:16 |
JayF | althoguh to be fair, replacing the ramdisk is proably the whole ball game | 17:17 |
TheJulia | I mean, they shouldn't be able to, but if someone were to allow members to change driver_info, yeah | 17:17 |
TheJulia | yeah | 17:17 |
TheJulia | it is | 17:17 |
dtantsur | looking for a 2nd +2 on https://review.opendev.org/c/openstack/bifrost/+/888447, it has been around for quite a while | 17:20 |
JayF | landing | 17:21 |
JayF | I mean, trade you for a review on self-serve runbooks :D | 17:21 |
JayF | (because a spec review and reviewing a minor change completely equal effort levels lol) | 17:21 |
dtantsur | I'm definitely going to, it's just 7:30pm on a public holiday right now | 17:21 |
dtantsur | :D | 17:21 |
TheJulia | goodnight! | 17:22 |
dtantsur | o/ | 17:22 |
JayF | o/ | 17:23 |
TheJulia | the tempest runs just hate me these days | 17:30 |
TheJulia | JayF: any chance I can get an independent look at a CI log to see if you have any thoughts in particular direction as to the rot cause | 18:04 |
JayF | I'm not sure if you meant to leave that extra o out or not lol | 18:04 |
TheJulia | lol | 18:04 |
JayF | assuming you mean https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/919762 | 18:04 |
TheJulia | well, I've found another instance in another change | 18:05 |
TheJulia | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_467/918001/2/gate/ironic-standalone-redfish-2023.2/4676c9f/controller/logs/screen-ir-cond.txt | 18:05 |
TheJulia | take a look or "Read timed out.", the 15th or so instance of it is super interesting | 18:05 |
JayF | this looks like the old double-query of the agent bug | 18:07 |
JayF | where we sent a second query to the agent while the first was still running | 18:07 |
JayF | is fast track enabled on this test, perhaps? | 18:07 |
JayF | because it ran get deploy steps successfully | 18:08 |
JayF | but the timeout is on get deploy steps | 18:08 |
JayF | so I imagine there's either a race where a heartbeat while we're executing get deploy steps causes pain OR we're double requesting | 18:08 |
JayF | TheJulia: can you get me to the root zuul that log came from? | 18:09 |
JayF | I wanna see the agnet it was querying | 18:09 |
TheJulia | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_467/918001/2/gate/ironic-standalone-redfish-2023.2/4676c9f/controller/logs/ironic-bm-logs/node-2_console_2024-05-16-17%3A31%3A25_log.txt | 18:10 |
TheJulia | seems to be something which likes to happen with 2023.2 | 18:13 |
JayF | this looks like it's thinking it has to wait for md0 to come up on the agent | 18:13 |
TheJulia | we never get to the point of creating it | 18:14 |
JayF | which it allows up to 10 times to wait for disk on retry | 18:14 |
TheJulia | we fail even before we get/process get_deploy_steps | 18:14 |
JayF | so you're saying the busyloop is a symptom, not the cause? | 18:14 |
TheJulia | I suspect so | 18:14 |
JayF | I think get_deploy_steps has a side effect | 18:14 |
JayF | and that is breaking things | 18:14 |
JayF | I am going to look into that code | 18:14 |
TheJulia | ++ | 18:14 |
TheJulia | i mean, if it is sitting htere spinning the disk detection, sure | 18:15 |
TheJulia | but we seem to think on the agent side that we respond promptly | 18:15 |
JayF | well that's what messes with me | 18:15 |
JayF | it should be returning there while it can't find disks, should it? | 18:15 |
TheJulia | hence why I'm sort of asking for another set of eyes | 18:15 |
JayF | 2023.2, yeah? | 18:15 |
TheJulia | yes, it should be | 18:15 |
TheJulia | unless ironic is wedged, becasue it seems to think it never gets a reply and that the socket just times out | 18:16 |
JayF | remember that get_deploy_steps is a command, ironic has to call ipa api to get the command result, right? | 18:16 |
TheJulia | yes | 18:16 |
TheJulia | and it *looks* like the call attempts are hanging | 18:17 |
JayF | I wonder if it's possible fo rus to get to the point where this is happening | 18:17 |
TheJulia | but also other things are hanging conductor side at the same time | 18:17 |
JayF | but IPA API hasn't spawned yet | 18:17 |
* TheJulia wonders if the cloud provider is throttling CPU | 18:22 | |
JayF | bingo TheJulia I have it | 18:23 |
JayF | so look at agent.process_lookup_data | 18:24 |
JayF | it calls hardware.cache_node | 18:24 |
JayF | which calls wait for disks | 18:24 |
JayF | which waits for the root device hinted device (md0) to come online | 18:24 |
JayF | which takes 10 retries, doesn't return timely | 18:24 |
JayF | but API server doesn't start until after process_lookup_data is complete | 18:25 |
JayF | LOG.info('Cached node %s, waiting for its root device to appear', node['uuid']) | 18:26 |
JayF | look familiar? it's what we print before dispatching a wait to disks to all managers | 18:26 |
JayF | so in a case where we are trying to find a nonexistant disk (I suspect raid is making this even more complex), Ironic races IPA | 18:26 |
TheJulia | hmmm | 18:27 |
JayF | this is 100% the case from your tempest logs | 18:27 |
TheJulia | that would be the most simple explination | 18:27 |
JayF | I think there's an assumption baked into hardware.cache_node that it wouldn't be called on lookup | 18:27 |
JayF | I suspect the simplest fix is adding a kwarg to hardware.cache_node, defaulting to false, but when called from lookup we pass as "true" to make it not wait for disks | 18:28 |
JayF | although honestly waiting for disks in cache node just seems wrong to me, but I haven't done the looking to see why it would be done | 18:28 |
TheJulia | I'm sort of thinking that as well | 18:29 |
TheJulia | the non-existent disk we're going to create in a little bit | 18:29 |
TheJulia | its a chicken/egg issue | 18:29 |
JayF | even beyond that | 18:29 |
JayF | wait_for_disks is never called in agent.py until hardware.cache_node | 18:29 |
JayF | if it /needs/ to be called, why is it a side effect of node caching? | 18:29 |
TheJulia | dunno | 18:29 |
JayF | there is a piece here I don't understand that we're hacking to enable, and I dunno what it is | 18:29 |
JayF | I have a hunch it might be fast track | 18:29 |
JayF | so if you get a new node in fast track, you would requery disks | 18:30 |
JayF | but that is 100% guessing | 18:30 |
TheJulia | This node is freshly booting up on a non-fast track scenario | 18:30 |
TheJulia | just never quite gets there | 18:30 |
JayF | I'm not talking about your case, I'm trying to figure out a case of *why we would ever want* this weird behavior | 18:30 |
JayF | basically asking the question "who does this bugfix need to worry about breaking" more or less :D | 18:31 |
TheJulia | oh, not sure | 18:31 |
JayF | we've been waiting for disks on node caching since 2017 | 18:31 |
TheJulia | ugh, packet loss, why you hate me today | 18:33 |
TheJulia | ahh | 18:36 |
TheJulia | I see | 18:36 |
JayF | hmmm so wait | 18:37 |
JayF | the REAL bug is we are likely heartbeating | 18:37 |
JayF | before we bring up the API | 18:37 |
JayF | all this stuff is A-OK as long as we aren't heartbeating until *after* we are done with this work | 18:37 |
JayF | that's the real bug: you can't heartbeat until you're ready to put the API up, in general | 18:38 |
JayF | not saying there's not more to fix, too, but that's generally gross :) | 18:38 |
JayF | the plot thickens: it does | 18:39 |
JayF | did it heartbeat in the logs? if so, why? | 18:39 |
JayF | it spent from 17:24 - 17:27 looking for that raid revice | 18:43 |
JayF | but the get_deploy_steps was called /after/ that | 18:44 |
JayF | so I think it's a red herring, unless that extra 3 minutes is enough to cause things to timeout | 18:44 |
TheJulia | I wonder why we keep searching for the /dev/md0 | 18:47 |
opendevreview | Julia Kreger proposed openstack/ironic-python-agent master: DNM: test: skip on initial cache https://review.opendev.org/c/openstack/ironic-python-agent/+/920061 | 18:48 |
JayF | we also call wait_for_disks on every evaluate_hardware_support | 18:48 |
JayF | we are just ... spamming the hell outta the world | 18:49 |
TheJulia | yup | 18:49 |
JayF | TheJulia: I think my original hypothesis is wrong, fwiw | 18:49 |
TheJulia | yeah, I figured that | 18:49 |
JayF | TheJulia: I reread the logs, it DOES spin for about 3 minutes, but *then* it starts heartbeating and comes online | 18:49 |
TheJulia | yeah, still bothers me it does that | 18:49 |
JayF | the thing that's really cooking my noodle right now tbh | 18:51 |
JayF | is it almost looks like we return get_deploy_steps then *AFTER THAT* logs print for stuff running in evaluate_hardware_support | 18:51 |
TheJulia | did you by chance walk the get_deploy_steps code path | 18:51 |
TheJulia | I don't think it should be taking that long, but this is bizzare | 18:52 |
JayF | yes | 18:52 |
TheJulia | ... it also likely shouldn't be calling evaluate_hardware_support, but I guess it might be | 18:52 |
JayF | can you get on a call? | 18:52 |
JayF | it is | 18:52 |
JayF | 10000% | 18:52 |
TheJulia | oh noes | 18:52 |
TheJulia | uhh, in like 5 mintues | 18:52 |
JayF | evaluate_hardware_support is not cached everywhere, it seems | 18:52 |
JayF | because we're running it multiple times | 18:52 |
JayF | TheJulia: ima order and pick up some lunch, lets talk through it after that | 18:53 |
TheJulia | ok | 18:55 |
opendevreview | cid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms https://review.opendev.org/c/openstack/ironic/+/915441 | 18:57 |
opendevreview | cid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms https://review.opendev.org/c/openstack/ironic/+/915441 | 18:58 |
opendevreview | Merged openstack/bifrost master: Consolidate centos/fedora/redhat required_defaults https://review.opendev.org/c/openstack/bifrost/+/888447 | 19:02 |
opendevreview | cid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms https://review.opendev.org/c/openstack/ironic/+/915441 | 19:04 |
TheJulia | it is almost like we need to have enough smarts to go "oh, this is not a thing yet" | 19:16 |
JayF | noting that one concrete suggestion out of our chat | 19:31 |
JayF | is to log here https://opendev.org/openstack/ironic-python-agent/src/branch/stable/2023.2/ironic_python_agent/hardware.py#L3160 so we know if the managers were cached or not | 19:31 |
JayF | and we can give a feel fore if we're dealing with the same HWM classes or if we're instantiating them more than once | 19:31 |
JayF | also we suggested caching evaluate_hardware_support and wait_for_disks output for our default/generic implementations so we don't spend fifteen years looking for disks we'll never find | 19:32 |
JayF | TheJulia: ^ I think that's the concrete notes from our meet | 19:32 |
TheJulia | only 15 years?!? | 19:58 |
TheJulia | ;) | 19:58 |
TheJulia | ack, my takeaway was basically the same, I'll try to look at it tomorrow | 19:58 |
opendevreview | cid proposed openstack/ironic master: wip: Provision aarch64 fake-bare-metal-vms https://review.opendev.org/c/openstack/ironic/+/915441 | 20:13 |
JayF | TheJulia: noting here that dispatch_to_all_managers (which is called for get_deploy_steps) appears to be bugged (well, at least it behaves differently than I would expect) | 20:14 |
JayF | https://opendev.org/openstack/ironic-python-agent/src/branch/stable/2023.2/ironic_python_agent/hardware.py#L3186 it unilaterally includes all steps, even if hardware isn't supported | 20:14 |
JayF | even in dispatch_to_all_managers I *think* we should be filtering out evaluate_hardware_support ==0 | 20:14 |
JayF | or else we're advertising deploy steps for a node which can never run on it | 20:14 |
JayF | looking at that code more in depth (and to leave myself a note before I leave), we need to make dispatch_to_all_managers take a flag to say to exclude unsupported HWMs or not. We have one case (get_versions) where we likely want HWMs to get called even if hardware isn't supported | 22:56 |
JayF | but for get_X_steps and collect_system_logs, we certainly do *not* want steps/logs from a hardware manager that doesn't even represent hardware on the system ... (right?) | 22:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!