opendevreview | Adam McArthur proposed openstack/ironic-tempest-plugin master: Microversion Test Generator https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/936293 | 05:14 |
---|---|---|
opendevreview | Adam McArthur proposed openstack/ironic-tempest-plugin master: Microversion Test Generator https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/936293 | 05:36 |
opendevreview | Adam McArthur proposed openstack/ironic-tempest-plugin master: Microversion Test Generator https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/936293 | 06:16 |
opendevreview | Adam McArthur proposed openstack/ironic-tempest-plugin master: Microversion Test Generator https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/936293 | 06:43 |
opendevreview | Adam McArthur proposed openstack/ironic-tempest-plugin master: Microversion Test Generator https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/936293 | 06:52 |
opendevreview | Adam McArthur proposed openstack/ironic-tempest-plugin master: Microversion Test Generator https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/936293 | 07:29 |
rpittau | good morning ironic! o/ | 08:25 |
adam-metal3 | Hello ironic. Was there a barmetal networking working group meeting held on the 20th of November on any oder day? | 09:21 |
adam-metal3 | I was on PTO and couldn't really check and looking at the e-mail thread I am not sure | 09:22 |
rpittau | adam-metal3: I believe so, last notes are here https://etherpad.opendev.org/p/ironic-networking | 09:33 |
adam-metal3 | rpittau, thanks! | 09:33 |
rpittau | np :) | 09:33 |
iurygregory | good morning Ironic | 10:52 |
opendevreview | Merged openstack/ironic master: Allow setting of disable_power_off via API https://review.opendev.org/c/openstack/ironic/+/934740 | 12:25 |
dtantsur | derekh__: yay ^^ | 12:27 |
iurygregory | \o/ | 12:43 |
derekh__ | nice :-) | 12:46 |
opendevreview | Verification of a change to openstack/metalsmith master failed: CI: Remove metalsmith legacy jobs https://review.opendev.org/c/openstack/metalsmith/+/933154 | 13:13 |
opendevreview | Verification of a change to openstack/ironic stable/2024.2 failed: Use specific fix-commit from dnsmasq https://review.opendev.org/c/openstack/ironic/+/936205 | 14:11 |
rpittau | #startmeeting ironic | 15:00 |
opendevmeet | Meeting started Mon Dec 2 15:00:06 2024 UTC and is due to finish in 60 minutes. The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:00 |
opendevmeet | The meeting name has been set to 'ironic' | 15:00 |
rpittau | mmm I wonder if we'll have quorum today | 15:00 |
rpittau | anyway | 15:00 |
rpittau | Hello everyone! | 15:00 |
rpittau | Welcome to our weekly meeting! | 15:00 |
rpittau | The meeting agenda can be found here: | 15:00 |
rpittau | https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_December_02.2C_2024 | 15:00 |
rpittau | let's give it a couple of minutes for people to join | 15:00 |
iurygregory | o/ | 15:01 |
TheJulia | o/ | 15:01 |
TheJulia | We likely need to figure out our holiday meeting schedule | 15:01 |
rpittau | yeah, I was thinking the same | 15:01 |
kubajj | o/ | 15:02 |
cid | o/ | 15:02 |
rpittau | ok let's start | 15:02 |
rpittau | #topic Announcements/Reminders | 15:02 |
rpittau | #topic Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: | 15:02 |
rpittau | #link https://tinyurl.com/ironic-weekly-prio-dash | 15:02 |
rpittau | there are some patches needing +W when any approver has a moment | 15:03 |
adam-metal3 | o/ | 15:03 |
rpittau | #topic 2025.1 Epoxy Release Schedule | 15:04 |
rpittau | #link https://releases.openstack.org/epoxy/schedule.html | 15:04 |
rpittau | we're at R-17, nothing to mention except I'm wondering if we need to do some releases | 15:04 |
rpittau | we had ironic and ipa last week, I will go through the other repos and see where we are | 15:04 |
TheJulia | I expect, given holidays and all, the next few weeks will largely be mainly focus time for myself | 15:05 |
rpittau | and I have one more thing, I won't be available for the meeting next week as I'm traveling, any volunteer to run the meeting? | 15:05 |
iurygregory | I can't since I'm also traveling | 15:06 |
TheJulia | I might be able to | 15:06 |
JayF | My calendar is clear if you want me to | 15:06 |
TheJulia | I guess a question might be, how many folks will be available next monday? | 15:06 |
rpittau | TheJulia, JayF, thanks, either of you is great :) | 15:07 |
rpittau | oh yeah | 15:07 |
JayF | I'll make a reminder to run it next Monday. Why wouldn't we expect many people around? | 15:07 |
rpittau | I guess it will be at least 3 less people | 15:07 |
JayF | Oh that's a good point. But I wonder if it's our last chance to have a meeting before the holiday, and I think we're technically supposed to have at least one a month | 15:07 |
rpittau | JayF: dtantsur, iurygregory and myself are traveling | 15:07 |
rpittau | we can have a last meeting the week after | 15:07 |
rpittau | then I guess we skip 2 meetings | 15:08 |
rpittau | the 23rd and the 30th | 15:08 |
rpittau | and we get back to the 6th | 15:08 |
TheJulia | I'll note, while next week will be the ?9th?, the following week will be the 16th, and I might not be around | 15:08 |
JayF | I will personally be out ... For exactly those two meetings | 15:08 |
dtantsur | I'll be here on 23rd and 30th if anyone needs me | 15:08 |
dtantsur | but not the next 2 weeks | 15:08 |
TheJulia | Safe travels! | 15:09 |
rpittau | thanks :) | 15:09 |
rpittau | so tentative last meeting the 16th ? | 15:09 |
rpittau | or the 23rd? I may be able to make it | 15:09 |
TheJulia | Lets do the 16th | 15:10 |
TheJulia | I may partial week it, I dunno | 15:10 |
rpittau | perfect | 15:11 |
rpittau | I'll send an email out also as reminder/announcement | 15:11 |
JayF | Yeah I like the idea of just saying the 16th is our only remaining meeting of the month. +1 | 15:11 |
rpittau | cool :) | 15:11 |
TheJulia | I really like that idea, skip next week, meet on the 16h, take over world | 15:11 |
TheJulia | etc | 15:11 |
TheJulia | Also, enables time for folks to focus on feature/work items they need to move forward | 15:12 |
rpittau | alright moving on | 15:13 |
rpittau | #topic Discussion topics | 15:13 |
rpittau | I have only one for today | 15:13 |
rpittau | which is more an announcement | 15:13 |
rpittau | #info CI migration to ubuntu noble has been completed | 15:13 |
rpittau | so far so good :D | 15:14 |
rpittau | anything else to discuss today? | 15:14 |
janders | I've got one item, if there is time/interest | 15:14 |
janders | servicing related | 15:14 |
janders | (we also have a good crowd for this topic) | 15:14 |
rpittau | janders: please go ahead :) | 15:14 |
janders | great :) | 15:15 |
janders | (in EMEA this week so easier to join this meeting) | 15:15 |
TheJulia | o/ janders | 15:15 |
janders | so - iurygregory and I ran into some issues with firmware updates during servicing | 15:15 |
janders | the kind of issues I wanted to talk about is related to BMC responsiveness issues during/immediately after | 15:15 |
TheJulia | Okay, what sort of issues? | 15:16 |
TheJulia | and what sort of firmware update? | 15:16 |
* iurygregory thanks HPE for saying servicing failed because the bmc wasn't accessible | 15:16 | |
janders | HTTP error codes in responses (400s, 500s, generally things making no sense) | 15:16 |
janders | I think BMC firmware update was the more problematic case (which makes sense) | 15:17 |
TheJulia | i know idracs can start spewing 500s if the FQDN is not set properly | 15:17 |
janders | but then what happens is update succeeds but Ironic thinks if failed cause it got a 400/500 response when BMC was booting up and talking garbage in the process | 15:17 |
janders | (if it remained silent and not responding it would have been OK) | 15:17 |
iurygregory | https://paste.opendev.org/show/bdrsgYzFECwvq5O3hQPb/ | 15:18 |
iurygregory | this was the error in case someone is interested =) | 15:18 |
janders | but TL;DR I wonder if we should have some logic saying "during/after BMC firmware upgrade, disregard any 'bad' BMC responses for X seconds" | 15:18 |
TheJulia | There is sort of a weird similar issue NobodyCam has encountered with his downstream where after we power cycle, the BMCs sometimes also just seem to packup and go on vacation for a minute or two | 15:18 |
iurygregory | in this case it was about 3min for me | 15:19 |
TheJulia | Step wise, we likely need to... either implicitly or have an explicit step which is "hey, we're going to get garbage responses, lets hold off on the current action until the $thing is ready | 15:19 |
iurygregory | but yeah, seems similar | 15:19 |
janders | it's not an entirely new problem but the impact of such BMC (mis)behaviour is way worse in day2 ops than day1 | 15:19 |
janders | it is annoying when it happens on a new node being provisioned | 15:19 |
TheJulia | or in service | 15:19 |
adam-metal3 | I have seen similar related to checking power states | 15:20 |
TheJulia | because these are known workflows and... odd things happening are the beginning of the derailment | 15:20 |
janders | it is disruptive if someone has prod nodes in scheduled downtime (and overshoots the scheduled downtime due to this) | 15:20 |
TheJulia | we almost need a "okay, I've got a thing going on", give the node some grace or something flag | 15:20 |
TheJulia | or "don't take new actions, or... dunno" | 15:20 |
janders | TheJulia++ | 15:21 |
TheJulia | I guess I'm semi-struggling to figure out how we would fit it into the model and avoid consuming a locking task, but maybe the answer *is* to lock it | 15:21 |
TheJulia | and hold a task | 15:21 |
janders | let me re-read the error Iury posted to see what Conductor was exactly trying to do when it crapped out | 15:21 |
TheJulia | we almost need a "it is all okay" once "xyz state is achived" | 15:22 |
janders | OK so in this case it seems like the call to BMC came from within the step it seems | 15:22 |
TheJulia | nobodycam's power issue makes me want to hold a lock, and have a countdown timer of sorts | 15:22 |
janders | but I wonder if it is possible that we hit issues with a periodic task or something | 15:22 |
TheJulia | Well, if the task holds a lock the enitre time, the task can't run. | 15:22 |
janders | TheJulia I probably need to give it some more thought but this makes sense to me | 15:23 |
TheJulia | until the lock releases it | 15:23 |
janders | iurygregory dtantsur WDYT? | 15:23 |
TheJulia | it can't really be a background periodic short of adding a bunch more interlocking complexity | 15:23 |
TheJulia | because then step flows need to resume | 15:23 |
TheJulia | we kind of need to actually block in these "we're doing a thing" cases | 15:23 |
TheJulia | and in nobodycam's case we could just figure out some middle ground which could be turned on for power actions | 15:24 |
janders | yeah it doesn't sound unreasonable | 15:24 |
janders | to do this | 15:24 |
TheJulia | I *think* his issue is post-cleaning or just after deployment, like the very very very last step | 15:24 |
iurygregory | I think it makes sense | 15:24 |
TheJulia | I've got a bug in launchpad which lays out that issue | 15:24 |
TheJulia | but I think someone triaged it as incomplete and it expired | 15:24 |
iurygregory | oh =( | 15:25 |
janders | I think this time we'll need to get to the bottom of it cause when people start using servicing in anger (and also through metal3) this is going to cause real damage | 15:25 |
janders | (overshooting maintenance windows for already-deployed nodes is first scenario that comes to mind but there will likely be others) | 15:26 |
TheJulia | https://bugs.launchpad.net/ironic/+bug/2069074 | 15:26 |
TheJulia | Overshooting maintenance windows is inevitable | 15:26 |
TheJulia | the key is to keep the train of process from derailing | 15:26 |
TheJulia | That way it is not the train which is the root cause | 15:27 |
janders | "if a ironic is unable to connect to a nodes power source" - power source == BMC in this case? | 15:27 |
TheJulia | yes | 15:27 |
TheJulia | I *think* | 15:27 |
janders | this rings a bell, I think this is what crapped out inside the service_step when iurygregory and I were looking at it | 15:27 |
TheJulia | they also have SNMP PDUs in that environment, aiui | 15:27 |
TheJulia | oh, so basically same type of issue | 15:28 |
janders | (this level of detail is hidden under that last_error) | 15:28 |
janders | yeah | 15:28 |
iurygregory | not during service, but cleaning in an HPE | 15:28 |
iurygregory | but yeah same type of issue indeed | 15:28 |
janders | thank you for clarifying iurygregory | 15:28 |
iurygregory | np =) | 15:28 |
TheJulia | yeah, I think I semi-pinned the issue that I thought | 15:28 |
janders | yeah it feels like we're missing the "don't depend on responses from BMC while mucking around with its firmware" bit | 15:29 |
janders | in a few different scenarios | 15:29 |
TheJulia | well, or in cases where the bmc might also be taking a chance to reset/reboot itself | 15:29 |
TheJulia | at which point, it is no longer a stable entity until it returns to stability | 15:29 |
janders | ok so from our discussion today it feels 1) the issue is real and 2) holding a lock could be a possible solution - am I right here? | 15:30 |
TheJulia | Well, holding a lock prevents things from moving forward | 15:30 |
TheJulia | and prevents others from making state assumptions | 15:30 |
TheJulia | or other conflicting instructions coming in | 15:30 |
janders | yeah | 15:31 |
TheJulia | iurygregory: was https://paste.opendev.org/show/bdrsgYzFECwvq5O3hQPb/'s 400 a result of power state checking? | 15:31 |
TheJulia | post-action | 15:31 |
TheJulia | ? | 15:31 |
iurygregory | Need to double check | 15:32 |
iurygregory | I can re-run things later and gather some logs | 15:32 |
janders | so the lock would be requested by code inside the step performing firmware operation in this case (regardless of whether day1 or day2) and if BMC doesn't resume returning valid responses after X seconds we fail the step and release the lock? | 15:33 |
TheJulia | Yeah, I think if this comes down to a "we're in some action like power state change in a workflow, we should be abld to hold, or let the caller know we need to wait unti lwe're getting stable response | 15:33 |
TheJulia | janders: the task triggering the step would already hold a lock (node.reservation) field through the task. | 15:33 |
dtantsur | I think we do a very similar thing with boot mode / secure boot | 15:34 |
TheJulia | Yeah, if the BMC never returns from "lunch" we eventually fail | 15:34 |
janders | dtantsur are you thinking about the code we fixed together in sushy-oem-idrac? | 15:34 |
dtantsur | yep | 15:34 |
janders | or is this in the more generic part? | 15:34 |
janders | OK, I understand, thank you | 15:35 |
rpittau | great :) | 15:36 |
rpittau | anything more on the topic? or other topics to discuss? | 15:37 |
dtantsur | iurygregory and I could get some ideas | 15:37 |
janders | I think this discussion really helped my understanding of this challenge and gives me some ideas going forward, thank you! | 15:37 |
janders | dtantsur yeah | 15:37 |
dtantsur | what on Earth could make IPA take 90 seconds to return any response for any API (including the static root) | 15:37 |
iurygregory | yeah, I'm breaking my mind trying to figure out this one | 15:38 |
dtantsur | even on localhost! | 15:38 |
janders | hmm it's always the DNS right? :) | 15:39 |
dtantsur | it could be DNS.. | 15:39 |
TheJulia | dns for logging? | 15:39 |
dtantsur | yeah, I recall this problem | 15:40 |
janders | saying that tongue-in-cheek since you said localhost but hey | 15:40 |
janders | maybe were onto something | 15:40 |
TheJulia | address of the caller :) | 15:40 |
janders | what would be default timeout on the DNS client in question? | 15:41 |
TheJulia | This was also a thing which was "fixed" at one point ages ago by monkeypatching eventlet | 15:41 |
TheJulia | err, using eventlet monkeypatching | 15:41 |
iurygregory | I asked them to check if the response time was the same using the name and ip, and the problem always repeats, I also remember someone said some requests taking 120sec =) | 15:41 |
JayF | That problem was more or less completely excised, the one that was fixed with more monkey patching | 15:42 |
JayF | I really like the hypothesis of inconsistent or non-working DNS. There might even be some differences in behavior between what distribution you're using for the ramdisk in those cases. | 15:43 |
dtantsur | It's a RHEL container inside CoreOS | 15:43 |
janders | could tcpdump help confirm this hypothesis? | 15:43 |
TheJulia | janders: likely | 15:43 |
janders | (see if there is DNS queries on the wire) | 15:43 |
TheJulia | janders: at a minimum, worth a try | 15:43 |
rpittau | anything else to discuss? :) | 15:46 |
adam-metal3 | I have a question if I may | 15:46 |
rpittau | adam-metal3: sure thing | 15:46 |
rpittau | we still have some time | 15:46 |
adam-metal3 | I have noticed an interesting behaviour with ProLiant DL360 Gen10 Plus servers, as you know IPA registers an UEFI boot record under the name of ironic-<somenumber> by default | 15:47 |
TheJulia | Unless there is a hint file, yeah | 15:48 |
TheJulia | whats going on? | 15:48 |
adam-metal3 | On the machine type I have mentioned this record gets saved during all the deployments so if you deploy and clean 50 times you have 50 of these boot devices visible | 15:48 |
TheJulia | oh | 15:48 |
TheJulia | heh | 15:48 |
TheJulia | uhhhhh | 15:48 |
TheJulia | Steve wrote a thing for this | 15:49 |
adam-metal3 | as far as tests done by dowsntream folks inidcate, there is no serious issue | 15:49 |
adam-metal3 | but it was confusing a lot of my downstream folks | 15:49 |
* iurygregory needs to drop, lunch just arrived | 15:50 | |
TheJulia | https://review.opendev.org/c/openstack/ironic-python-agent/+/914563 | 15:50 |
adam-metal3 | Okay so I assume then it is a known issue, that is good! | 15:50 |
TheJulia | Yeah, so... Ideally the image your deploying has a loader hint | 15:51 |
TheJulia | in that case, the iamge can say what to use, because some shim loaders will try to do record injection as well | 15:51 |
TheJulia | and at one point, that was a super bad bug on some intel hardware | 15:51 |
TheJulia | or, triggered a bug... is the best way to describe it | 15:51 |
TheJulia | Ironic, I *think* should be trying to clean those entries up in general, but I guess it would help to better understand what your seeing, and compare to a deployment log since the code is *supposed* to dedupe those entries if memory serves | 15:52 |
TheJulia | adam-metal3: we can continue to discuss more as time permits | 15:52 |
adam-metal3 | sure | 15:53 |
TheJulia | we don't need to hold the meeting for this, I can also send you a pointer to the hint file | 15:53 |
adam-metal3 | thanks! | 15:53 |
rpittau | we have a couple of minutes left, anything else to discuss today? :( | 15:53 |
rpittau | errr | 15:53 |
rpittau | :) | 15:53 |
rpittau | alright I guess we can close here | 15:55 |
rpittau | thanks everyone! | 15:55 |
rpittau | #endmeeting | 15:55 |
opendevmeet | Meeting ended Mon Dec 2 15:55:32 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 15:55 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-12-02-15.00.html | 15:55 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-12-02-15.00.txt | 15:55 |
opendevmeet | Log: https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-12-02-15.00.log.html | 15:55 |
janders | thank you all o/ | 15:55 |
janders | great to be able to join the meeting in real time for a change | 15:55 |
janders | (and sorry for being few minutes late) | 15:55 |
dtantsur | sooo, folks. When running get_deploy_steps, we somehow end up running some real code in IPA. That involves 'udevadm settle'. That consistently takes 2 minutes on their machine. | 15:56 |
dtantsur | what. the. hell. | 15:56 |
rpittau | udevadm settle takes 2 minutes? wow | 15:57 |
janders | crazy - but an awesome find | 15:57 |
janders | gotta drop for a bit again, back a bit later | 15:57 |
TheJulia | adam-metal3: so shim, by default, look fro a BOOTX64.CSV file as a hint, I think it is expected in the folder it is in, so on fedora machine is /boot/efi/EFI/fedora/BOOTX64.CSV and IPA will look for a file like this and use it as the basis for the records to set, replacing ironic-$num behavior | 15:58 |
rpittau | does that mean that udevd is still syncing devices? | 15:58 |
TheJulia | syncing and waiting for settled device state | 15:58 |
TheJulia | which might inform hardware managers what steps are actually available | 15:59 |
rpittau | yep | 15:59 |
rpittau | dtantsur: probably need systemd-udevd logs to see what's taking that long | 16:01 |
adam-metal3 | TheJulia: thanks, I will check if we have that file and in general how this process works, so far I have only checked the uefi tooling that IPA uses to set the record | 16:02 |
dtantsur | also caching hardware managers does not seem to work.. | 16:02 |
adam-metal3 | the strange thing for me is that in my case it is always ironic-1 but 50 times I find it strange that the same name can be saved any number of times | 16:03 |
TheJulia | adam-metal3: that sounds like it is saving or sits regarding a delete | 16:06 |
TheJulia | adam-metal3: we have seen a thing on some Lenovo hardware where if changes are or made in a very particular order, the machine reverts back to last known UEFI boot variables | 16:09 |
TheJulia | We had to move the delete before the save in that case because originally the code was add then cleanup | 16:10 |
TheJulia | dtantsur: … that sounds like what us old bug is now new bug again | 16:11 |
adam-metal3 | TheJulia: interesting, I will need to to ask around whether I can get my hands on a machine that exhibits this symptoms otherwise I am not sure how else to play around with the UEFI | 16:24 |
TheJulia | adam-metal3: I suspect that is the only path, it does sound like something is going "off the rails". If you can get us agent logs and the efibootmgr -v output after deployment, it should be easier to wrap our heads around this | 16:30 |
rpittau | good night! o/ | 16:45 |
dtantsur | aahhhh, JayF's recent patch fixes the reason why evaluate_hardware_support is called on each get_deploy_steps: https://review.opendev.org/c/openstack/ironic-python-agent/+/920153/7/ironic_python_agent/hardware.py#3504 | 16:54 |
dtantsur | if we pull that downstream, things will no longer break because of udevadm settle | 16:54 |
JayF | Uh | 16:55 |
JayF | you know before that change we called it like 5 times instead of 1 iirc | 16:55 |
dtantsur | not just that, we also call it every time get_deploy_steps is called | 16:56 |
dtantsur | which is.. rather often | 16:56 |
JayF | are you 10000% sure that's not cached? | 16:56 |
JayF | I think it is. | 16:56 |
dtantsur | After your patch, it is cached. | 16:56 |
JayF | get_managers_detail() is cached | 16:56 |
dtantsur | See the link, before your patch it was a naked call to dispatch_to_all_managers('evaluate_hardware_support') | 16:57 |
JayF | so we should /not/ be running it on each call to get_deploy_steps() unless that's the bug | 16:57 |
JayF | oh, oh, oh | 16:57 |
dtantsur | Yep, except that we don't have your patch in that version of OpenShift | 16:57 |
JayF | I thought you were saying *the opposite* | 16:57 |
JayF | which is why I was so confused | 16:57 |
dtantsur | :) | 16:57 |
JayF | yes, this is a gross bug, and it's nice to see it had more real world impact | 16:57 |
JayF | that's also why we don't have so much crap logging from IPA anymore :) | 16:57 |
TheJulia | dtantsur: rpittau: going back to the question from last week, then I think what I'm sort of thinking of works. Ironic should prefer qcow2 images. IPA, should it get a URL directly, could prefer applehv, but that is a bridge to try and cross when we get there. | 16:58 |
dtantsur | applehv == raw? | 17:00 |
TheJulia | very much appears to be the case | 17:04 |
JayF | It really weirds me out how much of this is just "yeah it looks raw" rather than there being a documented standard :| | 17:12 |
JayF | I know that's not our fault, but it feels like something that's going to eventually cause headaches (if not already causing it) | 17:12 |
TheJulia | I think we should refine it and make "raw" our standard ;) | 17:20 |
TheJulia | tbh | 17:20 |
TheJulia | but, one step at a time | 17:20 |
TheJulia | we first must crawl, then walk, then run | 17:20 |
dtantsur | I think the crawl would be 1 layer (image), detect the type using our new shiny detector | 17:24 |
dtantsur | It feels like many conversations around this effort is because we're trying to do the next step already (payload containing different image types per architecture, etc) | 17:25 |
JayF | TheJulia: I honestly like the idea of calling what we usually call "raw" "gpt" similar to what the glance as defender spec lays out | 17:26 |
dtantsur | Layers also have a MIME content type, I'm curious why podman did not use it.. | 17:27 |
JayF | https://review.opendev.org/c/openstack/glance-specs/+/925111 | 17:27 |
TheJulia | dtantsur: because OCI spec mandates specific types | 17:27 |
TheJulia | and so you cannot make assumptions and doing some sanity checking upfront allows for "hey, you gave us bad data" instead of just falling over being unable to deploy | 17:28 |
dtantsur | which part of the OCI spec do you mean? | 17:29 |
TheJulia | OCI image spec | 17:30 |
dtantsur | I don't think the OCI spec has any explicit mentions of qcow2/applehv (or does it?) | 17:31 |
TheJulia | it does not | 17:31 |
dtantsur | Then it cannot mandate them? | 17:31 |
TheJulia | But it does, if my memory is serving me, explicitly note all attached manifest data layers are treated as layers with the mandated data types | 17:31 |
TheJulia | its a structural aspect which mandates layer modeling | 17:32 |
dtantsur | sorry, I don't get it | 17:32 |
dtantsur | it's up to a tool how to treat a certain layer | 17:32 |
dtantsur | aha, the spec even shows an explicit "artifactType", I did not see it initially | 17:35 |
dtantsur | so, looks like we could have a layer with "artifactType": "application/x-qemu-disk" | 17:35 |
* dtantsur still curious why podman did not use all this | 17:35 | |
TheJulia | https://github.com/opencontainers/image-spec/blob/main/image-layout.md#indexjson-file <-- I think that is the start of it | 17:35 |
* TheJulia looks up artifactType | 17:35 | |
TheJulia | https://github.com/opencontainers/image-spec/blob/main/artifacts-guidance.md <-- uses should | 17:36 |
TheJulia | hmmm I like ArtifactType | 17:37 |
dtantsur | It sounds like we could use mediaType as well | 17:37 |
dtantsur | for those following along: https://github.com/opencontainers/image-spec/commit/749ea9a27d1eb44b5369ee7e8e296c7e99e3d2e5 | 17:38 |
dtantsur | Ah, there are two different things called mediaType. Thank you, not confusing at all. | 17:38 |
TheJulia | It *looks* like it might be a lower level we might be able to note/annotate it, but I suspect they did it one level up so they didn't have to walk all the way down and then back up | 17:39 |
TheJulia | at least, I suspect | 17:39 |
TheJulia | they being in that guess is podman | 17:40 |
dtantsur | Okay, I finally got it. On the top level of the index.json, mediaType is its own media type (a constant), artifactType is an optional type of the contained artifact. | 17:42 |
TheJulia | yup | 17:42 |
dtantsur | Then each manifest can have its own mediaType, which can be, well, anything. application/x-qemu-disk | 17:42 |
TheJulia | Each manifest after you make a decision right? | 17:42 |
TheJulia | so not second level, but third level down right? | 17:43 |
TheJulia | because second level is where you have all of the varying types and just the pointers to the final manifests | 17:43 |
dtantsur | even first? | 17:43 |
dtantsur | what is preventing me from having https://paste.opendev.org/show/btSiycm241tCqJDkXUJt/ ?\ | 17:43 |
dtantsur | ("no existing tooling can product that" is a plausible answer, but let's leave it aside just for a minute) | 17:44 |
dtantsur | Imagine, I have an image with this index.json and exactly one blob | 17:44 |
dtantsur | Am I missing something? | 17:44 |
TheJulia | because under existing data structures, that would be a lower layer artifact, and if we're going to be along side of containers which may also be bootable, we ideally want to be mindful and of fitting in with other aspects instead of trying to create something entirely different in the same upper level modeling | 17:45 |
dtantsur | there may be more manifests with other types | 17:46 |
TheJulia | If we're going to try and carve an entirely new path here, I might as well stop and punt on this, to be entirely honest | 17:46 |
TheJulia | yes, but they can't be index.json at that point, they would need to be other containers | 17:46 |
dtantsur | Mmm, I'm trying NOT to carve a new path | 17:46 |
TheJulia | index.json is top level, the whole of the representation of a container | 17:46 |
TheJulia | My whole driver here is to have a streamlined path so I could eventually have a single container which has a qcow2 file attached, and a bootable container | 17:47 |
dtantsur | I'm looking at https://github.com/opencontainers/image-spec/blob/main/image-layout.md#index-example, and in this example they have two "normal" images as well as some AppStream XML | 17:47 |
TheJulia | and the user could then just choose based upon the deploy interface | 17:47 |
dtantsur | In fact, it's a root index that points at another index, a simpler manifests and just an artifact, right? | 17:49 |
TheJulia | yeah, the third entry in that example is a definite standalone file | 17:49 |
TheJulia | second is a manifest reference pointer | 17:49 |
dtantsur | So that could be our qcow2 alongside the proper container stuffs? | 17:49 |
TheJulia | the first is another index reference | 17:50 |
TheJulia | potentially, question is platform field and if it can exist at that level | 17:52 |
TheJulia | the spec doc walks through what podman did, so it is a top layer primary manifest pointer for the container itself, and then an an index | 17:53 |
TheJulia | inside that index, each manifest entry has platform and annotations to signify what the files are | 17:53 |
dtantsur | https://github.com/opencontainers/image-spec/blob/main/image-index.md#image-index-property-descriptions lists platform | 17:54 |
TheJulia | They point to a separate manifest file | 17:54 |
TheJulia | which then lists the contents as a single layer (why did they do that?!) | 17:54 |
dtantsur | Yeah, I'm also curious about this indirection | 17:55 |
dtantsur | nothing in the spec is telling me that I cannot have top-level artifacts with different architectures | 17:55 |
dtantsur | maybe simply because of tooling support? | 17:56 |
TheJulia | so it could be since I think all layers are expected to be z-streamed | 17:56 |
TheJulia | maybe that is why?! | 17:56 |
dtantsur | i.e. they could create layers with podman easily but they could not great what I describe? | 17:56 |
TheJulia | so sort of a tooling convenience and extra transparent compression? | 17:56 |
dtantsur | yeaah | 17:56 |
TheJulia | yeah | 17:56 |
TheJulia | I kind of suspect some of it is that compression, some of it was explicit modeling of indirection and also trying to not have to walk all the way down | 17:57 |
dtantsur | The cost they're paying through is the non-standard "disktype" annotations | 17:57 |
TheJulia | yup | 17:57 |
dtantsur | I guess the key question is whether we want to mimic that (keeping in mind that they themselves may pivot from it one day) | 17:57 |
dtantsur | Need to do some exercising now. Sorry, I'm afraid I caused more confusion than I solved.. | 17:59 |
TheJulia | I think they expect to have to if docker decides to do anything else. Perhaps a question back to ?Arron? was Why not do it at a top-ish level (assuming the tools support it | 17:59 |
TheJulia | the whole thing that made me raise an eyebrow is the container reference being expected | 17:59 |
TheJulia | I bet that is a compatibility aspect on index.json. | 17:59 |
dtantsur | The spec is sometimes vague on what is required and what is not | 17:59 |
TheJulia | which drives the mid-level index to manifest linking | 17:59 |
TheJulia | which is then also weird, is that lower level index for machine-os also circles back and points back at the same container manifest as well | 18:00 |
dtantsur | The case described in https://github.com/opencontainers/image-spec/commit/749ea9a27d1eb44b5369ee7e8e296c7e99e3d2e5 is remotely similar to ours, and they accepted it as a valid case, so there is hope | 18:00 |
TheJulia | from 2023, I wonder if that was the original focus and they maybe pivoted? | 18:01 |
TheJulia | we're making lots of guesses | 18:01 |
dtantsur | the author does not seem to be from podman | 18:01 |
dtantsur | yeah | 18:01 |
TheJulia | oh, hmm | 18:01 |
TheJulia | on a plus side, I've not written any code at this layer. Still trying to wire in the overall higher level changes | 18:03 |
dtantsur | ++ | 18:04 |
* dtantsur leaves for real o/ | 18:04 | |
opendevreview | cid proposed openstack/ironic master: [WIP] Save ``configdrive`` in an auxiliary table https://review.opendev.org/c/openstack/ironic/+/933622 | 18:14 |
cardoe | Peek at how helm and other tools like orcas use OCI for storage. | 19:30 |
iurygregory | time to setup bifrost in a fresh OS to double check that I'm not going crazy with firmware updates "not working" on stable/2023.2 =( | 20:27 |
-opendevstatus- NOTICE: Gerrit will have a short outage while we update to the latest 3.9 release in preparation for our 3.10 upgrade on Friday | 21:31 | |
TheJulia | cardoe: do you happen to have a specific link to aid us in this :) | 21:52 |
JayF | I had given up ever seeing these. https://usercontent.irccloud-cdn.com/file/b4gwar2p/PXL_20241202_223226218.jpg | 22:39 |
JayF | Five nanoKVMs, ready for action :D | 22:39 |
JayF | Took about 2 months, and I had given up on getting anything for my money, but they are here. Hopefully they work! | 22:39 |
JayF | Likely will be what I use to a first stab at redfish console behavior | 22:43 |
JayF | (they also support IPMI, gross) | 22:43 |
iurygregory | nice! | 23:11 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!