Monday, 2023-05-08

vanougood morning ironic01:31
Nisha_Agarwalmorning Ironic!!!06:41
opendevreviewVerification of a change to openstack/ironic master failed: Make rbac enforced test non-voting for the time being  https://review.opendev.org/c/openstack/ironic/+/88245107:20
opendevreviewRiccardo Pittau proposed openstack/ironic-tempest-plugin master: Fix rbac tests, take 2  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/88245207:42
opendevreviewRiccardo Pittau proposed openstack/ironic-tempest-plugin master: Fix rbac tests, take 2  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/88245207:42
opendevreviewRiccardo Pittau proposed openstack/ironic-tempest-plugin master: Add RBAC specific tempest jobs to gate plugin  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/88231207:42
opendevreviewMerged openstack/ironic master: Make rbac enforced test non-voting for the time being  https://review.opendev.org/c/openstack/ironic/+/88245109:49
iurygregorygood morning Ironic11:03
*** iurygregory_ is now known as iurygregory11:18
TheJuliagood morning13:23
dtantsurmorning iurygregory, TheJulia 13:26
* TheJulia tries to wake up13:37
arne_wiebalckGood morning, Ironic!14:10
TheJuliagood morning14:10
arne_wiebalck(and it is really morning for me: at a conference in the U.S. talking about Ironic :))14:11
dtantsur\o/14:14
TheJulia\o/14:15
TheJuliaOut of curiosity, if your willing to share, what is the conference?14:15
arne_wiebalckhttps://indico.jlab.org/event/459/contributions/11646/14:16
arne_wiebalckComputing in High Energy Physics14:16
TheJulianice!14:17
* TheJulia is still trying to wake up14:20
JayFI would've gone, arne_wiebalck, but I'm only an expert in Medium Energy Physics /s 14:20
JayFlol14:20
arne_wiebalckJayF: LOL14:20
JayFNorfolk, VA is nice. You should try to head over to Virginia Beach while you're in town14:28
JayFthat's a nice area of the states14:28
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: Attempting to troubleshoot grenade  https://review.opendev.org/c/openstack/ironic/+/88234714:40
opendevreviewJulia Kreger proposed openstack/ironic master: DPU modeling - parent_node DB/Model/API  https://review.opendev.org/c/openstack/ironic/+/88011414:57
JayFGood morning. Meeting time!15:00
JayF#startmeeting ironic15:00
opendevmeetMeeting started Mon May  8 15:00:09 2023 UTC and is due to finish in 60 minutes.  The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot.15:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:00
opendevmeetThe meeting name has been set to 'ironic'15:00
iurygregoryo/15:00
JayFWho all is here this morning?15:00
matfechnero/15:00
dtantsuro/15:00
JayFWe have a very simple agenda so unless folks have something for open discussion I'm going to try and make it quick.15:00
TheJuliao/15:00
JayF#topic announcements/reminder15:00
JayF#note Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: https://tinyurl.com/ironic-weekly-prio-dash15:00
JayF#note Please avoid running a `recheck` command bare, without any other comments. Obviously, best case is to troubleshoot and fix an issue in CI, but in case of some ephemeral failure, please just note that -- e.g. `recheck jobname failed`. This is tracked at an OpenStack level (https://etherpad.opendev.org/p/recheck-weekly-summary) and I've noticed more contributors doing15:01
JayFbare rechecks.15:01
JayFAlthough I recently confirmed those messages are not aggreggated, it can be useful for others coming to a patch later.15:01
JayFThose are the two announcements/reminders.15:01
JayFWe had no action items last meeting; skipping agenda item.15:02
JayF#topic Review Ironic CI status15:02
JayFTheJulia: did you ever get that grenade-with-debug job to fail?15:02
JayFThat's the only outstanding gate thing right now, yeah? Grenade?15:02
TheJuliayeah, trying to figure out if nova-api is just dying15:03
dtantsurthe rbac job has been turned non-voting15:03
TheJuliaI think we likely need to consider doing the same thing to grenade15:03
dtantsurle sigh, but yeah15:03
JayFOnly for a short time :/ we have another grenade job (skip-level) to get running, too.15:03
JayFfor SLURP15:04
TheJuliaWe likely need to reconsider nova being central to our upgrade testing15:04
TheJulia... As well.15:04
JayFThat sounds like a pretty brutal decision given we have a big nova-compute migration incoming15:04
JayFwhen sharding is complete15:04
dtantsurWe have a bifrost upgrade job, but I wonder if the Nova-Ironic link is actually the weak point that we want to test15:05
dtantsuryep15:05
TheJuliawell, the requirement has always been "however it is done", not mandating the tool precisely15:05
TheJuliaunless slurp codified grenade15:05
TheJuliabut without actual grenade maintenance, I can't see that as being reasonable15:05
TheJuliaI'll put it this way15:06
JayFWhat is the maintenance work we need from grenade?15:06
TheJuliaus running grenade is a convience upgrade test15:06
dtantsuryou definitely cannot expect me to miss grenade :D15:06
dtantsurwe trapped ourselves with devstack15:07
JayFThe story I see developing, as told by our CI reliability, is that we have a good handle on testing standalone ironic, and a less-good one on testing ironic+nova, generally, with grenade being a pain point15:07
JayFIt's hard for me to think it's grenade/devstack as a tooling problem versus actually implying functional issues without more data15:08
opendevreviewMerged openstack/ironic master: Handle MissingAttributeError when using OOB inspections to fetch MACs  https://review.opendev.org/c/openstack/ironic/+/88057515:08
TheJuliaafaik grenade is basically as needed only, and locks us in to a nova-centric model. Maybe that is fine. dunno.15:08
JayFI don't see us as locked-in to a nova-centric model so much as, two basic columns of integration testing: Ironic "standalone" and openstack integrated15:09
JayFbut either way, we need to make patches mergable and keep upgrades testable15:09
JayFwe should keep down the path of identifying if the nova-api is dying15:09
JayFbecause if it is, hopefully we can get some help from those associated teams15:09
TheJuliathat is what I'm doing15:09
JayF++ that's what I thought just getting it into the meeting log15:09
JayFGoing to move on15:10
dtantsurWell, among many things that we did right with bifrost, is this one: the production tool is easy to use for testing15:10
TheJuliathe underlying issue is basically people are ignoring the issue or going "it is not my problem"15:10
TheJuliaso... at some point, we're going to be completely blocked unless we can pin it down15:10
JayFTheJulia: I'd request if you're not getting good results in IRC, lets make an explicit request on the list so if there is indifference, it's more visible at a project level15:10
TheJuliaack15:10
JayF#topic Review ongoing 2023.2 workstreams15:11
JayF#link https://etherpad.opendev.org/p/IronicWorkstreams2023.215:11
TheJuliaif nova-api is dying will be a major data point, so time will tell15:11
JayFPlease view and update workstream status. Thank you for putting the DPU information in there, I'll review those patches today15:11
JayF#topic Open Discussion15:12
JayFAnything for open discussion?15:12
JayFI'll note, Ironic cores please check your email and respond to the one I sent last week at some point :)15:12
JayFSounds like nothing for open discussion; closing the meeting. Have a good week all o/15:14
JayF#endmeeting15:14
opendevmeetMeeting ended Mon May  8 15:14:09 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:14
opendevmeetMinutes:        https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-05-08-15.00.html15:14
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-05-08-15.00.txt15:14
opendevmeetLog:            https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-05-08-15.00.log.html15:14
JayFTheJulia: lmk if there's something specific I can help out w/the grenade troubleshooting15:15
TheJuliajust waiting at this point, I wasn't able to focus on it on friday due to a multi-hour meeting15:16
sschmitt_I'm seeing this error constantly in my nova-compute-ironic service on a fresh Zed install: https://bugs.launchpad.net/nova/+bug/1910504 Is it safe to ignore?15:16
TheJulia... unrelated, looks like they added a thing and defaulted it to an exception for other cases.15:18
sschmitt_yeah thats what I assumed. Its just filling my logs and I would have assumed it might have for others as well15:19
TheJuliathat method has been there for a very long time too15:20
TheJuliaI don't think I've ever seen that recorded before. :\15:21
sschmitt_I wonder why its hitting it now15:22
JayFit looks like that method is only called explicitly15:23
JayFwhich makes me wonder what is calling it15:23
JayFlike, AFAICT, nothing is calling get_host_uptime from inside nova except in response to a req for host uptime15:23
TheJuliaAdditionally, for bare metal usage it doesn't exactly map over data wise15:24
TheJulialooks like the present grenade run, with uwsgi turned off passed resource creation15:26
TheJulia... I did so hoping we might see more process logging wise15:26
JayFheisenbug, away!15:28
JayF:| 15:28
sschmitt_hmmm maybe the openstack prometheus exporter is triggering that method15:32
NobodyCamGood morning and Happy Monday Ironic Folks15:36
* TheJulia wonders good morning15:40
TheJuliaJayF: If your referring to the TV show... I've made it to the candy lady to get some of the "rock" (candy), but have never made it more than a few episodes into the show15:41
JayFI'm referring to https://en.wikipedia.org/wiki/Heisenbug15:42
JayF> In computer programming jargon, a heisenbug is a software bug that seems to disappear or alter its behavior when one attempts to study it.[1] The term is a pun on the name of Werner Heisenberg, the physicist who first asserted the observer effect of quantum mechanics, which states that the act of observing a system inevitably alters its state. In electronics, the15:42
JayFtraditional term is probe effect, where attaching a test probe to a device changes its behavior. 15:42
TheJuliaheh15:42
TheJuliaI was unaware of that reference15:42
sschmitt_Yep, stopping the Prometheus OpenStack Exporter stops those error messages related to get_host_uptime.15:45
JayFsschmitt_: looks like you may need to configure/patch that service to not call that method for Ironic :/15:46
sschmitt_womp womp15:46
JayFI've been thinking about if we could implement that15:46
JayFbut there's really nothing at all we could take to put there15:46
JayFthat would even remotely make sense in context of what it'd mean for actual-hypervisors15:46
sschmitt_Maybe just something so it doesn't return an exception? But yeah something like time since node creation doesn't really make sense either15:49
TheJuliayeah15:50
TheJuliaThat was where my mind immediately went15:51
sschmitt_or time since last successful node deployment15:51
TheJuliaThat would actually make sense for nova-compute to keep in memory15:52
JayFhow does that map to hypervisor uptime?15:52
JayFif anything; it seems like lifetime of nova-compute service might be closer15:52
TheJuliapossibly, but even with baremetal that looses meaning really.15:54
TheJuliatime since last deploy op *does* sort of make sense, is the proxy working or getting workload15:54
TheJuliaI rechecked grenade, time to see again.15:54
* JayF adds a /v1/uptime to Ironic which returns the first time an ironic service using that DB was ever spun up15:54
JayFlol15:54
TheJuliahehehehe15:54
TheJuliaAdd an rbac policy around that!15:55
dtantsurI think we have registration time for conductors?15:57
dtantsurWe can just expose it if not already :)15:57
TheJuliaI don't think we expose it via the object->api15:59
sschmitt_Anyone who does a vanilla kolla-ansible deployment with both ironic and prometheus enabled will hit this it seems15:59
TheJuliarpittau: Body: b'{"error_message": "{\\"faultcode\\": \\"Server\\", \\"faultstring\\": \\"\\\\\\"baremetal:node:delete:self_owned_node\\\\\\": \\\\\\"role:admin and project_id:%(node.owner)s\\\\\\" requires a scope of [\'project\'], request was made with system scope.\\", \\"debuginfo\\": null}"}' is a bug I think16:09
TheJuliarunning tests on fix now16:13
opendevreviewJulia Kreger proposed openstack/ironic master: Fix self_owned_node policy check  https://review.opendev.org/c/openstack/ironic/+/88259716:20
TheJuliaheh, well... looks like grenade is going to pass as uwsgi, again16:58
TheJuliaerr, without uwsgi16:58
TheJuliao/ Nisha_Agarwal 17:00
Nisha_AgarwalTheJulia, o/17:02
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: Attempting to troubleshoot grenade  https://review.opendev.org/c/openstack/ironic/+/88234717:04
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: Attempting to troubleshoot grenade  https://review.opendev.org/c/openstack/ironic/+/88234717:04
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: Change nova-api to not use uwsgi for grenade  https://review.opendev.org/c/openstack/ironic/+/88260717:06
opendevreviewJulia Kreger proposed openstack/ironic master: CI: Change nova-api to not use uwsgi for grenade  https://review.opendev.org/c/openstack/ironic/+/88260717:07
TheJuliaif it passes CI, might as well merge it17:07
TheJulia2x passes in the troubleshooting run, reproposed that without it17:07
TheJuliaif it passes on that change, it will be at a better pass rate than CI has been for us all week17:08
JayFdansmith: I wonder, do you all run nova in uwsgi mode for the !ironic grenade run?17:08
dansmithuwsgi everywhere, AFAIK17:08
TheJuliait is the overall default17:08
JayFthis behavior implies some kind of bug in the uwsgi support, yeah?17:08
TheJuliatoo early to know for sure, we can see cases where neutron responds and it looks like the api grinds to a halt, but we need to see if the API is still actually running in the uwsgi case17:09
TheJulia(even then, I think we run 2 workers by default in that config)17:09
dansmitheach service runs in its own uwsgi though, I'm sure you know17:09
dansmithah I misread I thought you were saying neither nova nor neutron responds17:10
TheJuliawe don't know if nova has stopped responding, but we can see where neutron responds17:11
TheJuliain the logs, the logs end for nova-api17:11
TheJuliaI believe "weird" is an understatement17:12
dansmithI'd certainly be skeptical of running nova-api in a different-from-everywhere-else mode as a solution here of course17:13
JayFdansmith: I think the idea is more, flip the switch as a last-ditch before we mark it non-voting and can hopefully troubleshoot past that17:14
TheJuliaof course17:14
JayFbut I think we're rapidly approaching the end of the rope17:14
JayFgiven TheJulia's last debugging attempt appears to have fixed it17:14
JayF"fixed" lolsob17:14
dansmithis ironic's api running in uwsgi proxied through apache as well?17:14
TheJuliagood question17:15
dansmithjust wondering what else could be causing a difference in behavior (like if apache is getting stuck somewhere and having an additional component is adding pressure)17:15
TheJuliawell, intrestingly enough, apache behaves as if there is just a raw timeout17:16
dansmithdid this start when jobs got converted to jammy by chance?17:17
TheJuliaso, Ironic's grenade job *does* run it's api in wsgi17:18
TheJuliaso, I think we saw it before, but it has clearly gotten worse17:18
dansmithjust wondering if zed->2023.1 grenade jobs show the problem (on focal) or not.. if not that'd be a data point17:19
TheJuliabefore it was more a timeout behavior with getting the test resource instance booted and occassionally something like this where we never see it get past validate_networks17:19
TheJuliaindeed17:20
dansmithI just can't think of any reason why you all would be the only ones hitting this if it wasn't something related to your job configuration (i.e. having ironic in the mix, or your use of a different/larger image, IIRC)17:20
TheJuliait is cirros in the case of the resource creation17:20
TheJuliait is indeed bizzar17:21
JayFit is nested virt, we might be under more memory pressure17:21
JayFthat should only matter if at the point of breakage we were majorly slow, e.g. significant swapping17:21
JayFwhich seems extremely unlikely17:21
dansmithnested virt? are you using the nested virt nodes with kvm enabled17:22
dansmith?17:22
JayFwhen I say nested virt, I just mean we run Ironic fake-BM-VMs inside our grenade VM17:22
TheJuliawe have no control over if the VMs are enabled with nested virt17:22
JayFI don't mean to imply a specific hardware config17:22
dansmithTheJulia: yes you do17:22
TheJuliawell, we can constrain it17:22
dansmithokay because we have nested virt nodes, and that configuration would be be different than what everyone else is on, so just wondering17:23
TheJuliadansmith: I believe only via restricting providers for the jobs17:23
dansmithJayF: typically memory pressure manifests as an OOM, which we hit in lots of other places17:23
JayFyeah, that's my expectation. Just mainly thinking out loud17:23
dansmithTheJulia: no, there's a nodepool label that lets you select nodes that can do it, and flags to enable/allow it in devstack17:23
TheJuliawe've got timeout adjustments embedded for where the devstack vm is fully emulated virt17:23
TheJuliait is a red herring here though17:24
TheJuliabecause we're not even to that point17:24
TheJulianova-compute's spawn is never triggered17:24
dansmiththe devstack vm is not emulated, but the instances created within are17:24
TheJuliaindeed17:24
dansmithyeah17:24
TheJuliadansmith: even on rax?17:25
dansmithyeah devstack in emulated mode would be incredibly slow ;)17:25
TheJuliawell, yes, it is17:25
clarkbrax are xen VMs (pvhvm iirc). Then on top of that you get qemu emulation for the workload cloud's resources17:26
dansmithright17:26
dansmithpvhvm being equivalent to a first-level kvm instance.. hardware virt support with a userspace device model17:27
dansmithdifferent from the emulated guest where it's all fully emulated in userspace17:27
TheJuliaugh, looks like we're also seeing the old dhcp test which was once an absolute rare blue moon sort of thing becoming way more frequent under jammy17:30
* TheJulia goes and works on that while waiting on cI17:30
TheJuliaCI17:30
dansmithfwiw, we (nova) have seen some differing test and service behavior under jammy, but from what I can see, it's likely because py3.10 is so much faster17:32
dansmithlike we hit some race conditions more often than we used to17:32
clarkbit also made thread scheduling more determinstic so if you rely on proper threads a lot you could be exposed to more race conditions due to them existing under the order it ends up scheduling them17:32
TheJuliaThat would align with the unit test in particular that dislikes us :)17:32
opendevreviewJulia Kreger proposed openstack/ironic master: CI: Modify dhcp client ID fail  https://review.opendev.org/c/openstack/ironic/+/88260918:08
TheJuliaIf anyone would like to offer thoughts on extending the step schema -> https://review.opendev.org/c/openstack/ironic/+/880545/4/ironic/api/controllers/v1/node.py it would be appreicated18:19
opendevreviewJulia Kreger proposed openstack/ironic-python-agent master: Fix Bandit errors  https://review.opendev.org/c/openstack/ironic-python-agent/+/87991218:45
opendevreviewJulia Kreger proposed openstack/ironic-python-agent master: Fix nvidia hardware manager url parser to permit https  https://review.opendev.org/c/openstack/ironic-python-agent/+/88141018:45
opendevreviewVerification of a change to openstack/ironic master failed: Remove autocommit, again.  https://review.opendev.org/c/openstack/ironic/+/86283218:58
opendevreviewJulia Kreger proposed openstack/metalsmith master: Update MD5 checksum references  https://review.opendev.org/c/openstack/metalsmith/+/88217019:14
opendevreviewMerged openstack/ironic-lib master: Add jsonrpc client port capability  https://review.opendev.org/c/openstack/ironic-lib/+/87921119:19
*** dmellado9 is now known as dmellado19:43
opendevreviewMerged openstack/python-ironicclient master: Allow several nodes for most node actions  https://review.opendev.org/c/openstack/python-ironicclient/+/87575620:11
TheJuliastevebaker[m]: https://review.opendev.org/c/openstack/ironic/+/88260920:30
JayFstevebaker[m]: I ninja'd in my +2 before you landed it anyway ;)20:34
stevebaker[m]Dueling ninja +2s20:35
opendevreviewMerged openstack/ironic master: CI: Modify dhcp client ID fail  https://review.opendev.org/c/openstack/ironic/+/88260920:48
TheJuliahmm, looks like another test needs to be fixed as well21:03
opendevreviewJulia Kreger proposed openstack/ironic master: CI: Fix another network test  https://review.opendev.org/c/openstack/ironic/+/88261821:11
opendevreviewVerification of a change to openstack/ironic master failed: Remove autocommit, again.  https://review.opendev.org/c/openstack/ironic/+/86283221:31
sschmitt_Is there a reason a node wouldn't get its DHCP record switched away from ignore so that it would deploy? have a node getting to wait-callback and seeing dhcp-discover packets hitting the controller, but dnsmasq still has its ignore record so it's ignoring them21:40
TheJuliasschmitt_: neutron dhcp should be responding, are the packets getting in to the neutron dhcp namespace?21:42
sschmitt_well this is my first time deploying with OVN so thats it21:44
TheJuliado you have neutron-dhcp-agent setup?21:44
TheJuliaor at least present and running?21:44
sschmitt_I see the dhcpdiscover coming into the interface I have set as the interface in the dnsmasq config for inspector21:44
sschmitt_I thought with the newer OVN you didnt need the dhcp agent for ipv4 baremetal nodes, so I dont21:45
opendevreviewHarald Jensås proposed openstack/ironic-tempest-plugin master: Fix rbac indicator tests  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/88261921:45
TheJuliasschmitt_: so, we don't have a job to test that with OVN, because OVN doesn't support ipv6  DHCP configurations needed for Ironic to function, so we generally still point folks to using a neutron-dhcp-agent service21:47
TheJuliain theory, yes, ovn should work with dhcp for ipv421:48
TheJuliaWhat that means configuration wise, I personally don't know.21:48
sschmitt_Ok, ill try and dive in and see where the disconnect it with ovn and can deploy the dhcp agent as a last resort21:49
TheJuliaokay, ironic-grenade has passed 10x in a row, this seems... odd21:55
TheJuliasschmitt_: any information you can share one you figure things out, would be helpful to aid our collective knowledge21:55
opendevreviewHarald Jensås proposed openstack/ironic-tempest-plugin master: Fix rbac indicator tests  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/88261922:04
opendevreviewHarald Jensås proposed openstack/ironic-tempest-plugin master: Fix rbac indicator tests  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/88261922:05
opendevreviewVerification of a change to openstack/ironic master failed: Remove autocommit, again.  https://review.opendev.org/c/openstack/ironic/+/86283222:07
opendevreviewHarald Jensås proposed openstack/ironic master: Fix api-ref v1-indicators  https://review.opendev.org/c/openstack/ironic/+/88262022:26
JayFTheJulia: in or out of wsgi?22:40
TheJuliaacross the board22:41
JayFStrange. I'll make a note tomorrow morning to "diff" a couple of jobs from today/last week and look at commit logs, see if I can identify a win22:43
TheJulianothing seems inherently different, but it could just be the odds. hopefully things will get better with the other unit test fixes as well22:47
opendevreviewMerged openstack/ironic master: CI: Fix another network test  https://review.opendev.org/c/openstack/ironic/+/88261823:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!