Monday, 2023-09-18

iurygregorygood morning Ironic10:59
TheJuliagood morning13:30
TheJuliaWe should get https://review.opendev.org/c/openstack/ironic-inspector/+/895164/ sorted13:38
TheJuliadtantsur: you around today?14:32
opendevreviewJake Hutchinson proposed openstack/bifrost master: Bifrost NTP configuration  https://review.opendev.org/c/openstack/bifrost/+/89569114:37
ravlewGood morning ironic14:43
ravlewI'm getting an error in stable/yoga CI in bifrost-integration-redfish-vmedia-uefi-centos-814:44
ravlew"/home/zuul/src/opendev.org/openstack/bifrost/scripts/collect-test-info.sh: line 95: /home/zuul/openrc: No such file or directory"14:44
dtantsurTheJulia, I am indeed14:45
ravlewcould anyone help with that :) ?14:45
dtantsurravlew, that's probably not what makes the job fail, look before that14:45
TheJuliadtantsur: ++14:46
TheJuliadtantsur: ask because of review feedback on 89516414:46
dtantsuryeah, I was going to get back to it, but we had some fire-fighting downstream14:47
JayFReleases team put up PRs over the weekend requesting to cut stable/2023.2 branches by no later than Friday.14:47
JayFI think we technically have a little more time than that, but I don't want us to overflow and cause them crunch time if we can help it14:47
JayFSo whatever the priorities are, we need to cut releases soon so we need to resolve that patch one way or another14:48
TheJuliayeah, they will eventually just change devstack and we will have no choice if we don't do it before they force our hand14:48
* JayF notes he's not super keen on just ignoring inspector grenande failures and hiding them :/14:48
dtantsurI'm by no means keen on that, just don't realistically have time/energy to debug it this week14:49
JayFI am going to propose at the meeting we cut the release very soon and backport redfish firmware as a pseudo-FFE (we don't do FF do FFE doesn't make sense) if it's done before 2023.2 release date14:49
dtantsurmakes sense to me14:49
JayFdtantsur: TheJulia: Any hints on Inspector? 14:49
JayFI'm not super experienced with those jobs or the service in general, but I can try to fix the grenade job for a little bit today 14:50
dtantsurI don't believe the grenade failure is caused by the current Ironic work, but I can try double-checking14:50
JayFLet me put it this way; it's my personal belief unless someone tells me otherwise that we have not manually tested upgrades of inspecotr14:50
JayFso that means with the gate job not working we'd be applying liberal quantities of "hope" and "assumption" to that upgrade working which seems not-great for purposes of our users14:51
JayFespecially upstream end-users who may not have another layer of QA between them and a release14:51
TheJuliablowing out on nova resource creation14:51
JayFdnsmasq startup explodes, something else listening on 5314:53
TheJuliano, it is expecting cirros to be sitting around14:53
ravlewthanks dtantsur I'll check it out14:53
TheJuliaand changes got made there in devstack at some point14:54
JayFI think your brain is about a mile ahead of me rn :)14:54
JayFmakes sense that > 2023-09-14 15:19:07.382906 | controller | Sep 14 15:19:07 np0035253989 dnsmasq[57991]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use 14:54
dtantsurhmm, I thought I fixed cirros as part of https://review.opendev.org/c/openstack/ironic-inspector/+/89516414:54
JayFis OK because we appear to start it as devstack@ironic-inspector-dhcp14:55
TheJuliaoh, okay14:55
dtantsuryeah, I'd assume it's fine14:55
JayFa lot of > RC_DIR: unbound variable # which I'm assuming is probably OK?14:56
TheJuliaso we need https://review.opendev.org/c/openstack/ironic-inspector/+/895164 to not post_fail basically14:57
TheJuliaand look at the inspector log to understand what is going on14:57
dtantsuryeah, I'm looking at the previous run14:57
JayFI'm reading the output from the original job right now, trying to get some kind of a baseline since this might be one of the first inspector grenade jobs I've looked at14:58
JayFI'll resume after my morning meetings (in 2 minutes than a chat with kubajj after)14:58
dtantsurCMD "lshw -quiet -json" returned: 0 in 23.572s14:59
dtantsurthe ramdisk logs just stop at some point, interesting..15:00
JayF#startmeeting ironic15:00
opendevmeetMeeting started Mon Sep 18 15:00:08 2023 UTC and is due to finish in 60 minutes.  The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot.15:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:00
opendevmeetThe meeting name has been set to 'ironic'15:00
JayFWelcome to the Ironic team meeting! This meeting is held under OpenInfra Code of Conduct available at: https://openinfra.dev/legal/code-of-conduct15:00
JayFAgenda is available at https://wiki.openstack.org/wiki/Meetings/Ironic15:00
dtantsuro/15:00
kubajjo/15:00
iurygregoryo/15:00
JayF#topic Announcements/Reminder15:00
JayF#note Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: https://tinyurl.com/ironic-weekly-prio-dash15:00
JayFWe will be listing patches later which we want landed before release, please ensure you look and help with those, too.15:00
JayF#note PTG will take place virtually October 23-27, 2023! https://openinfra.dev/ptg/15:01
JayF#link https://etherpad.opendev.org/p/ironic-ptg-october-202315:01
JayFAfter release activities are complete, I'll setup a short mailing list thread and maybe sync call for us to pare down that list into items we think we should chat about, so please get your items in there.15:01
JayFThat's all our standing announcements; release is incoming but I made that a separate topic15:02
JayFNo action items outstanding; skipping agenda item.15:02
JayF#topic Bobcat 2023.2 Release15:02
JayFSo essentially, my intention is to cut stable/2023.2 releases this week if possible. The sooner the better just because I don't want other teams waiting on us if we can help it.15:03
JayFThere are two items we've already identified as wanting to land before release:15:03
JayF* Redfish Firmware Interface https://review.opendev.org/c/openstack/ironic/+/88542515:03
JayF* IPA support for Service Steps https://review.opendev.org/c/openstack/ironic-python-agent/+/89086415:03
JayFIPA Service Steps landed late Friday; so that's good news \o/15:03
JayFRedfish firmware I believe still needs a revision from iurygregory 15:04
JayFplus the inspector-related CI shenanigans talked about before the meeting15:04
JayFAre there any other pending changes we'd like to ensure make Bobcat?15:04
iurygregoryI'm making changes and re-testing things to make sure they are working (still trying to figure out the DB part, almost there I think)15:04
JayFIf there are no other pending changes for Bobcat; I'd like to propose that we begin to cut releases (starting with !Ironic first, to give more time), and if we cut Ironic stable/2023.2 before iurygregory's changes land, that we permit them to be backported as long as they are completed before release finalizes (even though that may mean we can't cut a release with them until15:05
JayFa week after or so)15:05
JayFWe generally don't practice FF here, so calling it an FFE doesn't make sense; but I don't want the whole release hanging on a change when it might hold up other bits of the stack.15:06
JayFWDYT?15:06
TheJuliaother bits of the stack?15:06
JayFlike releases team work15:07
JayFrequirements et15:07
JayFI don't want us to hold up any of the common work that has to happen15:07
TheJuliawell, service projects don't go into requirements15:07
JayFI don't think it's crazy of me to suggest that we have the final Ironic release done a week before the marketing-deadline for said release?15:07
TheJuliaI don't either, but it feels like your pushing for now() as opposed to in a few days15:08
JayFTheJulia: I'm saying I'll go create PRs for stuff that's done now, and start walking down the list. We have dozens of these and I usually manually review the changes for the final release.15:08
TheJuliawe can absolutely cut the !ironic things and then cut ironic later in the week15:09
JayFTheJulia: so I'm not just like, automating git sha readouts into yaml files15:09
JayFI don't wanna get that process started too late, which is why I want to start now and have the freedom to do ironic e.g. Thurs or Friday15:09
TheJuliaI understand that15:09
JayFless now() and more max(week) 15:09
TheJuliaso what is the big deal then? lets do the needful and enable15:09
TheJuliaI'd say EOD Wednesday15:10
TheJuliabecause release team won't push button on friday15:10
JayF++ that sounds pretty much exactly like what I had in mind15:10
JayFwnated it done before Friday-europe15:10
TheJuliafor Ironic unless we have full certinty that we can solve it early thursday morning before release team disappears15:10
JayF++15:10
JayF#agreed Ironic projects will begin having stable/2023.2 releases cut. Projects with pending changes (Ironic + Inspector) have at least until Wednesday EOD to land them.15:11
JayFAnything else related to 2023.2 release?15:12
JayF#topic Review Ironic CI Status15:13
JayFAFAICT, things look stable-ish. Just have that POST-FAILURE for Inspector grenade to figure out.15:13
dtantsurI think the POSTFAILURE itself is less of a problem15:14
JayFYeah, the breakage is earlier/in our code which is different than a postfailure usually indicates15:14
JayFbut either way it's the only outstanding CI issue I'm aware of15:15
TheJuliayeah, read timeouts against the api surface15:15
TheJuliacould entirely be environmental15:15
TheJuliawe just need more logs to confirm that or not15:16
JayFwe'll figure it out, if there's nothing else I'll move on so I can get back to helping with that :D 15:16
JayF#topic Branch Retirement to resolve zuul-config-errors15:16
JayFI'm going to execute on this, probably today if inspector CI doesn't eat the day -> https://lists.openstack.org/pipermail/openstack-discuss/2023-August/034854.html15:16
JayFtake notice15:16
JayF#topic RFE Review15:17
JayFThere was an RFE spotted earlier this week, Julia and I discussed in channel, I already tagged it as approved15:17
JayFposting here for documentation/awareness15:17
JayF#link https://bugs.launchpad.net/ironic/+bug/2034953 -- adding two fields to local_link_connection schema to allow physical switch integrations with OVS15:17
TheJuliahjensas: you might find ^ interesting15:18
JayFBasically it seems adding two fields to our local_link_connection gets us the win of supporting OVN-native switches15:18
JayFwhich is a wonderful effort:value ratio15:18
dtantsurIndeed. How popular are these?15:19
JayFI learned they exist when I read bug 2034953 ;)15:19
dtantsurSame :D15:19
JayFtwo fields to get free support for something from neutron sounds great though15:20
JayFand is the exact kind of good stuff we get from being stacked sometimes :D 15:20
TheJuliaI don't know... There was some OVS enabled for OpenFlow switches years ago, whitebox sort of gear AIUI, I'm guessing this might just be an evolution15:20
JayFI don't hear any objection; so I'm going to consider this one to remain approved. Probably a low-hanging fruit for someone to knock out (I have an MLH fellow starting in a couple of weeks, if we want to save this I can use it as an onboarding task)15:21
JayF#topic Open Discussion15:21
JayFAgenda is done; anything else15:21
JayFLast chance?15:23
JayF#endmeeting15:23
opendevmeetMeeting ended Mon Sep 18 15:23:47 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:23
opendevmeetMinutes:        https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-09-18-15.00.html15:23
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-09-18-15.00.txt15:23
opendevmeetLog:            https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-09-18-15.00.log.html15:23
TheJuliaHas anyone seen "Incompatible openstacksdk library found: Version MUST be >=1.0 and <=None, but 0.101.0 is smaller than minimum version 1.0." before?15:43
dtantsurTheJulia, ansible modules?15:45
TheJuliaYeah, I guess :\15:46
TheJuliaI'm guessing we're getting too new ansible on stable branches15:46
dtantsuryep, they have two branches, 2.0 is only compatible with the (future?) openstacksdk 1.015:46
TheJuliayeah, that gets kicked out when metalsmith's deployment task triggers15:50
TheJuliadtantsur: was taht change somewhere in the zed -> 2023.1 timeframe?16:03
TheJulias/taht/that/16:03
JayFI believe so16:05
TheJuliahmm so why is this breaking on zed then16:05
JayFOSA versions not locked?16:06
JayFOSDK version not locked?16:06
TheJuliano OSA16:07
frickleractually sdk >=0.99 should work for the openstack collection, but maybe they're being extra safe16:07
TheJuliaOSDK seems appropriate16:07
TheJuliaso yeah, it is unlocked ansible16:09
TheJulia2.15.3 on a backport which was released mid august16:09
TheJuliaand we use the host ansible16:10
TheJuliaMy guess is the newer collection gets pulled in but with the older sdk, and then things explode16:21
TheJuliawould pinning back ansible be reasonable?16:21
TheJuliaI have no idea if that wouldd work16:21
frickleriiuc you'd need sdk < 0.99 for that16:40
TheJuliaso 0.101.0 is incompatible in general?16:46
opendevreviewJulia Kreger proposed openstack/metalsmith stable/zed: DNM: Test constrainting ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89570316:51
jrosserthere is a description of the compatibility here https://galaxy.ansible.com/openstack/cloud16:57
frickler0.99.0 should work with the newer collection, but it seems ansible decided to place the bar at 1.0 instead, which is not unreasonable. and the choice of 0.99.0 was a bad decision in retrospect in which I do have some responsibility myself, so sorry for that17:00
TheJuliaexcept we have 0.101.0 in upper-constraints on zed18:01
TheJulia*so* the inspector issue is basically we can't find the record in the db18:01
JayFAs emailed on the list; branch EOLs requested to clean up our zuul config errors18:11
frickler\o/18:12
JayFTheJulia: can I help? I don't want to dupe effort if you're actively looking at anything18:12
JayFTheJulia: it looks like tempest and grenade might be failing in similar ways?18:26
TheJuliamaybe, I'll check in a moment18:26
opendevreviewJulia Kreger proposed openstack/ironic-inspector master: DNM: Collect additional failure information  https://review.opendev.org/c/openstack/ironic-inspector/+/89572718:27
JayFTheJulia: I'll note; the change in progress pins cirros to 0.6.2 and ironic pins to 0.6.118:29
JayFunsure if related but it's suspicious18:29
TheJuliahttps://review.opendev.org/c/openstack/metalsmith/+/895703/1/test-requirements.txt <-- seems to work for metalsmith \o/18:29
* JayF has been going the route of looking at CI-related commits in Ironic looking for things that coule break inspector or needs to be updated for inspector to work18:29
JayF\o/18:29
TheJuliathe version pin doesn't matter in this case18:29
JayF> 2023-09-18 15:42:34.321739 | controller | Details: Fault: {'code': 500, 'created': '2023-09-18T15:42:32Z', 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 4d27add4-8a9d-4209-a4ea-746f58e86e8a.'}. Request ID of server operation performed before checking the server status18:30
JayFreq-92fe8a05-b6a1-4061-a152-fa5fed5e0646.18:30
JayFTheJulia: I'll note: sharding is still mid-revert in Nova; I don't think this should impact inspector jobs but it looks like some of them are landed, some are not18:33
TheJuliaso grenade blows up because for some reason we don't seem to have record of the actual node18:33
JayFhttps://review.opendev.org/c/openstack/nova/+/894946 is in the gate now18:33
TheJuliaI think that *might* actually be a break18:33
JayFI'm looking at the tempest job18:34
JayFtrying to figure that piece out under the hope(?) it's related but simpler18:34
JayFTheJulia: > Sep 18 15:41:43.807251 np0035284146 ironic-conductor[106327]: DEBUG ironic.drivers.modules.agent_client [req-9ed5070c-733d-49fc-86c7-a7d670fbde21 req-1809a5fb-b49d-46a4-b517-ebc6f7f575f3 None None] Status of agent commands for node dc7624b9-0cf9-4a24-ad4b-ddfa7a28039d: get_deploy_steps: result "{'deploy_steps': {'GenericHardwareManager': [{'step':18:36
JayF'erase_devices_metadata', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False}, {'step': 'apply_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'argsinfo': {'raid_config': {'description': 'The RAID configuration to apply.', 'required': True}, 'delete_existing': {'description': "Setting this to 'True' indicates to delete existing18:36
JayFRAID configuration prior to creating the new configuration. Default value is 'True'.", 'required': False}}}, {'step': 'write_image', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False}, {'step': 'inject_files', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'argsinfo': {'files': {'description': "Files to inject, a list of file structures with18:36
JayFkeys: 'path' (path to the file), 'partition' (partition specifier), 'content' (base64 encoded string), 'mode' (new file mode) and 'dirmode' (mode for the leaf directory, if created). Merged with the values from node.properties[inject_files].", 'required': False}, 'verify_ca': {'description': 'Whether to verify TLS certificates. Global agent options are used by default.',18:36
JayF'required': False}}}]}, 'hardware_manager_version': {'generic_hardware_manager': '1.2'}}", error "None"; execute_deploy_step: result "{'deploy_result': {'result': 'prepare_image: image (23902a7f-65c7-4e87-80de-6a993829398a) written to device /dev/vda root_uuid=2d360c20-c65e-4d5c-b8ce-c9196635667a'}, 'deploy_step': {'interface': 'deploy', 'step': 'write_image', 'args':18:36
JayF{'image_info': {'id': '23902a7f-65c7-4e87-80de-6a993829398a', 'urls': ['https://173.231.255.172:8080/v1/AUTH_03382fe70b254cb88c398793986c28b3/glance/23902a7f-65c7-4e87-80de-6a993829398a?temp_url_sig=5366935da2ec9e8511e69501943eb08decccd113986c0c196665520b26cb9448&temp_url_expires=1695054998'], 'disk_format': 'raw', 'container_format': 'bare', 'stream_raw_images': True,18:36
JayF'checksum': 'b0d4ad249188e59ffbdfa5e10054fea0', 'os_hash_algo': 'sha512', 'os_hash_value': '1913006d9f40852f615542d15ae1eda2bf5fe1b067940ea8b02bdd4c6e2e5f04b077462f003c4ce436b7f6dc8720b80effcc54d6cef3425f2792b4dd3e8c2aa9', 'node_uuid': 'dc7624b9-0cf9-4a24-ad4b-ddfa7a28039d', 'kernel': None, 'ramdisk': None, 'root_gb': '4', 'root_mb': 4096, 'swap_mb': 0, 'ephemeral_mb': 0,18:36
JayF'ephemeral_format': None, 'configdrive': '***', 'preserve_ephemeral': False, 'image_type': 'partition', 'deploy_boot_mode': 'bios', 'boot_option': 'local'}, 'configdrive': '***'}}}", error "None"; get_partition_uuids: result "{'partitions': {'configdrive': '***', 'root': '/dev/vda2'}, 'root uuid': '2d360c20-c65e-4d5c-b8ce-c9196635667a', 'efi system partition uuid': None}",18:36
JayFerror "None"; install_bootloader: result "None", error "{'type': 'CommandExecutionError', 'code': 500, 'message': 'Command execution failed', 'details': 'Installing GRUB2 boot loader to device /dev/vda failed with Unexpected error while running command.\nCommand: chroot /tmp/tmpl2skmrue /bin/sh -c "mount -a -t vfat"\nExit code: 127\nStdout: \'\'\nStderr: "chroot: failed to18:36
JayFrun command \'/bin/sh\': No such file or directory\\n".'}" {{(pid=106327) get_commands_status /opt/stack/ironic/ironic/drivers/modules/agent_client.py:346}}18:36
TheJuliadude...18:36
JayFwow I had no idea that was that long18:36
TheJuliaseriously, paste18:36
JayFputting it in a pastebin18:36
JayFsorry about that18:36
JayFmy client warns me about >1 line18:37
JayFapparently it doesn't warn about 1 line that wraps to 10 lines18:37
JayFso I wasn't careful :( 18:37
JayFhttps://gist.github.com/jayofdoom/225fafa59106f71f085701a7b3c0c16f18:37
JayFthat looks like the provisioning failure that is breaking tempest18:37
JayFIDK if it's related to the grenade fail, but there it is18:38
JayFI'll note that I wonder how https://github.com/openstack/ironic-inspector/blob/master/devstack/plugin.sh#L150 interacts with https://github.com/openstack/ironic/blob/88fd22de796b8b936287ee0e39fed6a0bcf3b604/devstack/lib/ironic#L3351 in terms of ordering18:42
TheJuliaso looks like ironic/inspector on the non-standalone tempest job self-aborts the inspection18:43
JayFdue to the error I spammed across IRC, yeah?18:43
JayFwell, that's happening during imaging18:44
TheJuliauhhh shouldn't be anywhere near that18:44
JayFbut it implies an environmental problem that could be similarly impacting18:44
TheJuliathat error you pasted is typical cirros18:44
JayFoh, really? 18:44
TheJuliacirros has no actual bootloader or ocntents18:44
TheJuliaso bootloader deployment will *always* fail18:44
TheJuliaunless it is made to be present there18:45
JayFIt looks like that failure piped all the way back thru to Ironic, which is why I thought it was meaningful18:45
JayFbut I trust you have more context on this than I do18:45
JayFis there value in us getting on a call or something? maybe make sure we're on the same wavelength?18:45
TheJuliaI'm not sure we are18:45
TheJuliagive me a few to keep digging18:45
JayFI am almost 10000% certain we aren't :D 18:45
TheJuliaokay, different node than I was looking at for the failing test, i was looking at the other test18:46
TheJuliaoh, so in the non-standalone one, it is deploying a node18:47
JayFI'll note there's a power control failure in https://1e3a584a444c8ece91d9-a7e38d5d296143dfa7c720fa849f5cad.ssl.cf5.rackcdn.com/895164/2/experimental/ironic-inspector-tempest-managed-non-standalone/a0836f7/controller/logs/screen-ir-cond.txt too18:48
TheJuliahttps://paste.opendev.org/show/blwPDRI2U06pUgQ3zxZj/ right ?18:50
JayFthat's the original error I spammed into the channel18:50
JayF15:42:08.16598118:50
JayFis the power control failure18:50
TheJuliaso yeah18:51
TheJuliathe failure I'm seeing is we're trying to deploy a cirros partition image18:51
TheJuliawhich *is* empty by default18:51
TheJuliaand thus the CI job fails18:51
JayFthat matches what I saw, too18:51
JayFwhich leads to the question: how the hell did that ever work? did older cirros not have empty there?18:52
TheJuliahttps://github.com/openstack/ironic/blob/master/devstack/tools/ironic/scripts/cirros-partition.sh18:55
opendevreviewMerged openstack/ironic-prometheus-exporter master: CI: Remove ubuntu focal job  https://review.opendev.org/c/openstack/ironic-prometheus-exporter/+/89401618:55
opendevreviewMerged openstack/ironic-prometheus-exporter master: tox: Remove basepython  https://review.opendev.org/c/openstack/ironic-prometheus-exporter/+/89031418:55
JayFooh18:55
TheJuliaso lets see18:57
TheJuliathat got uploaded as cirros-0.6.1-x86_64-partition18:57
TheJuliawhich is the image18:58
TheJuliaso... only guess I have is the packing fails18:58
JayFagain I note, we're pinnning to 0.6.2 in inspector in that change18:59
TheJuliathat independently grabs cirros19:00
JayFand it uploads a full disk 0.8.219:00
JayFs/8/6/19:00
JayFyeah I see19:00
JayFdamn19:01
JayFaha19:01
TheJuliaThat, I don't think is anything related to ironic changes with inspector merge19:01
JayFthat is where RC_DIR is unbound errors pop out19:01
TheJuliathe tempest job, sure looks like it is failing in super unexpected ways19:01
JayFwhich I suspect I'll go look at a passing Ironic job and see the same /me verifies19:01
TheJuliaerr19:01
TheJuliagrenade tempest job19:01
JayFyep angry RC_DIR errors in pasisng Ironic jobs, so likely unrelated19:03
JayFI'm going to change lanes to the grenade job19:04
JayFbut tempest being so broken implies to me there might be a base level environmental thing going on? IDK19:04
TheJuliayou'd need to hold a node to check at this point19:05
TheJuliabecause fundimentally, it looks like we're getting a bogus disk image19:06
JayFTheJulia: so that script, I believe, is pulling in 0.6.2 even though it's labelled 0.6.1 https://github.com/openstack/ironic/blob/master/devstack/tools/ironic/scripts/cirros-partition.sh respects CIRROS_VERSION19:07
JayFTheJulia: so I think the most obvious change to move forward with is s/0.6.2/0.6.1/g in that pin, just to eliminate a variable19:08
TheJuliathe logs I have say 0.6.119:08
JayFthe name is set to 0.6.119:08
JayFregardless of what the script uses19:08
JayFbased on my reading of the devstack logs, the partition script, and the outputs/inputs19:08
TheJuliagets worse, 0.6.2 and 0.6.1 19:09
TheJuliawhat a mess19:09
JayFthe output name is passed directly to cirros-partition.sh so we have it wired up to lie to us19:09
JayFlets make that aligned and see if a more clear err pops out19:09
JayFI can make the edit if you're +1 just don't wanna trample changes or running CI jobs?19:09
opendevreviewHarald Jensås proposed openstack/ironic master: redfish_address - wrap_ipv6 address  https://review.opendev.org/c/openstack/ironic/+/89572919:09
TheJuliago ahead and make a change, since I'm bouncing between several different things and I have a meeting in a few minutes19:09
JayFI'm going to this change then I have an important meeting with my local friendly assassin, who is more increasingly threatening me if I don't provide kibble ;)19:10
JayFI think it was consistent19:11
JayFit was inconsistent *between the tempest and the grenade job*19:11
JayFbut it was consistent within the job19:11
opendevreviewJay Faulkner proposed openstack/ironic-inspector master: Update the project status and move broken jobs to experimental  https://review.opendev.org/c/openstack/ironic-inspector/+/89516419:12
JayFI'm going to let that run and get lunch for myself and cats; I will be pointing my brain in this direction until I'm outta steam or hours in the day today19:13
opendevreviewJulia Kreger proposed openstack/metalsmith stable/zed: Constrain the upper Ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89570319:18
opendevreviewJulia Kreger proposed openstack/metalsmith stable/zed: stable-only: Constrain the upper Ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89570319:18
TheJuliaJayF: ^^^ needed so I can unbrick the ipa stable branches19:19
TheJuliasince they are wedged due to the too new openstacksdk/ansible issues19:19
JayF+2a19:20
opendevreviewHarald Jensås proposed openstack/ironic-inspector master: Handle bracketed IPv6 redfish_address  https://review.opendev.org/c/openstack/ironic-inspector/+/89573419:20
opendevreviewMerged openstack/metalsmith stable/zed: stable-only: Constrain the upper Ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89570319:37
TheJuliawoot20:31
opendevreviewJulia Kreger proposed openstack/metalsmith stable/yoga: stable-only: Constrain the upper Ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89567120:31
opendevreviewJulia Kreger proposed openstack/metalsmith stable/xena: stable-only: Constrain the upper Ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89567220:31
opendevreviewJulia Kreger proposed openstack/metalsmith stable/xena: stable-only: Constrain the upper Ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89567220:33
opendevreviewJulia Kreger proposed openstack/metalsmith stable/wallaby: stable-only: Constrain the upper Ansible version  https://review.opendev.org/c/openstack/metalsmith/+/89567320:33
opendevreviewJulia Kreger proposed openstack/ironic-inspector master: DNM: Collect additional failure information  https://review.opendev.org/c/openstack/ironic-inspector/+/89572720:44
JayFI'm looking at any differences between Ironic and inspector grenade20:46
JayFironic has ipa in required-projects20:47
JayFI don't think that matters tho b/c build_ramdisk is false20:47
JayF          INSTANCE_WAIT: 12020:48
JayF          MYSQL_GATHER_PERFORMANCE: False20:48
JayFboth missing from inspector grenade as well20:48
JayFthose actually make me ponder if they could be impacting20:48
JayFwill add thosein next update for inspector after I get experimental results20:48
JayFI guess if tempest is failing too, it implies that the problem is deeper20:51
JayFwell, integrated tempest20:51
iurygregoryfinally the power outage is over...21:12
iurygregoryI'm back21:12
TheJuliaiurygregory: welcome back21:35
iurygregoryTheJulia, tks!21:36
iurygregorynow time to work :D 21:37
iurygregorythe funny thing is that I couldn't participate in a meeting about the customer case I'm working since Friday 21:37
TheJuliadoh21:39
JayFiurygregory: IDK if you were here for it, but we set EOD Wednesday as a tentative deadline for cutting Ironic releases.21:49
iurygregoryack21:50
JayFiurygregory: of course, I'm now racing that deadline, too, to try and get passing CI on inspector so hey :) we can form a club21:50
iurygregoryI wasn't I lost connection while we were talking about it, haven't checked logs yet21:50
iurygregoryJayF, perfect :D21:50
opendevreviewIury Gregory Melo Ferreira proposed openstack/ironic master: RedfishFirmware Interface  https://review.opendev.org/c/openstack/ironic/+/88542521:52
iurygregoryone more round to test :D21:52
* iurygregory updates bifrost to get the latest patchset 21:52
JayFlooks like the same failure mode re: inexplicably broken image for inspector with my change22:08
JayFI have infra holding the next failure so I can look first thing tomorrow22:08
TheJuliaso grenade didn't fail the same way22:22
TheJulialooks like keystone went on vacation22:22
JayFTheJulia: looking at the conductor logs, we get the same partition-image-based error22:23
JayF(or was that in the tempest job)22:23
JayF> Sep 18 21:14:35.428023 np0035285949 ironic-conductor[220906]: ERROR ironic.drivers.modules.inspector.interface [None req-a2e5dec5-9f04-4a09-9356-e791f2979ce1 None None] Inspection failed for node 67265dca-e1f5-4a7f-b2cd-dcde31799b96 with error: Introspection timeout22:25
TheJulialets turn off the db stats stuff22:27
JayFoh, this actually maybe fits22:27
JayFI have a question for you22:27
JayFI've been looking for devstack-plugin-evidence that we are getting right dnsmasq version in inspector22:27
TheJuliaok22:27
JayFsince the newer one was crashy in ironic jobs22:27
JayFI dumped that method of thinking because I assumed the standalone jobs would be blowing up too22:28
JayFbut it occurs to me that we might have different mechanisms of interacting with dhcp that are only breaky when it's under neutron22:28
JayFtl;dr: do we need to ensure downgrade_dnsmasq runs on inspector devstack22:28
TheJulia....22:29
TheJuliaI'm struggling to grok how your getting to think dnsmasq is the root cause22:29
TheJuliaWould talking through it be helpful?22:30
JayFIt is a leftover from a troubleshooting technique I applied earlier: Try to find fixes landed this cycle to Ironic CI that might apply to inspector CI22:30
JayFsince the grenade job does not inherit from ironic jobs like some of the tempest jobs do22:30
TheJuliaso on grenade, I can see inspector did seemingly go out for vacation, and I remembered we saw similar pauses on the upgrade stuffs22:31
TheJuliawhich is why we disabled it on the ironic grenade job22:31
JayFdo we want to apply INSTANCE_WAIT: 120 as well?22:31
* JayF JFDI22:33
opendevreviewJay Faulkner proposed openstack/ironic-inspector master: Update the project status and move broken jobs to experimental  https://review.opendev.org/c/openstack/ironic-inspector/+/89516422:33
* TheJulia shrugs on the instance wait22:33
JayFI did it, figured I'd rather have a passing job and remove something22:33
JayFoooh TheJulia  I might have found something22:37
TheJulia?22:37
JayFhttps://zuul.opendev.org/t/openstack/build/3e8021686f884ae8b4c5e2b248138158/log/controller/logs/ironic-bm-logs/node-3_console_log.txt#333122:37
JayFTheJulia: almost pasted the line but apparently I am capable of learning and improvement :P 22:37
TheJuliahttps://paste.opendev.org/show/b052qhfXQ8PL9zrryTsb/ <-- grenade job did literally pause22:38
JayFholy cow22:39
JayFthat is nontrivial22:39
TheJuliabut look further down at 22:09:08 OSError22:39
JayFBRB in 8 minutes22:39
JayFthat is extremely strange22:39
JayF(the BRB was a joke related to the pause if it's not clear)22:39
JayFdid the FS literally go R/O during the run? Are we filling up the disk?22:40
JayFheh that's silly, a full disk we wouldn't get the log written about it22:41
TheJuliano, ironic was doing some stuff22:41
TheJuliathat seems like thread locked, guessing related to the db since db counter is the last thing to do anything22:41
TheJuliawhich is the same issue we saw on ironic22:41
TheJuliamy logs are from the change I put up to get some more debug logging22:42
JayFhopefully check experimental with my change passes22:42
TheJuliayours is the patch before22:42
JayFyeah, I just added the bits to the yaml to turn off perf counters22:42
JayFI have a meta-question about this: assuming we get grenade/integrated tempest job passing; should we leave them in the queue or put them back in exp?22:43
JayFI think if we get it passing it doesn't hurt to keep it on the change but IDK22:43
TheJuliaOh, I think I see what is going on there22:44
TheJuliaerr, maybe not22:45
TheJuliait might also be slow CI nodes, looks like the inspection timed out right before the post of the payload to inspector22:46
JayFthe mysql performance counters bump will help for that22:46
TheJuliabut that shouldn't result in the failure to find the node22:46
TheJuliasince it is not constrained22:46
TheJuliaas for leaving in the queue dunno, my impression right now is we're struggling to prove we didn't break ironic/inspector integration22:47
JayFyeah that's my impression, too22:47
JayFand it almost slipped off the radar because that service has been ignored in favor of getting it into ironic22:48
TheJuliaand if we didn't, then... we're chasing red herrings22:48
TheJuliaWell, that has been the case for years22:48
JayFwe're literally paying the price for us not completing that migratino, in hours :(22:48
TheJuliawe basically had to do major DB API updates this cycle *because* of a lack of attention22:48
JayFyeah22:48
TheJuliacould be maybe we got list_nodes_by_attributes wrong too22:48
TheJuliadunno22:48
JayFI think I'm mostly done with digging logs on this for the day22:48
JayFI'll look at a node tomorrow and that'll help22:48
TheJuliaokay, I guess I did the db stuffs last cycle22:51
TheJuliayup I did22:51
opendevreviewVerification of a change to openstack/ironic-python-agent stable/zed failed: Handle the node being locked  https://review.opendev.org/c/openstack/ironic-python-agent/+/89259423:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!