iurygregory | I give up, 3 KP when trying to boot the iDRAC 10 is enough for one day =( | 02:23 |
---|---|---|
iurygregory | I'm wondering if I should pass root device hints to see if would help , but I will figure out tomorrow | 02:24 |
TheJulia | iurygregory: are you able to capture much of the kernel panic? are you able to reproduce it? | 03:11 |
*** jroll04 is now known as jroll0 | 07:17 | |
opendevreview | Nicolas Belouin proposed openstack/ironic-python-agent stable/2025.1: netutils: Use ethtool ioctl to get permanent mac address https://review.opendev.org/c/openstack/ironic-python-agent/+/950489 | 07:41 |
stephenfin | TheJulia: When you're about, I wonder if you'd be able to take a look over a failure we're seeing in Gophercloud CI? https://github.com/gophercloud/gophercloud/actions/runs/15144029062/job/42575162709?pr=3108 | 09:50 |
stephenfin | That seems to be coming from code related to the network simulator stuff you've added since we branched. I haven't been able to reproduce on a local Ubuntu 24.04 host though, so I'm hoping you'll see something obvious | 09:52 |
Sandzwerg[m] | Morning ironic. I'd like to test secure boot. Via toggling it doesn't seem to work so far (I have the impression the toggling is not happening) using our own IPA & ESP I get a secure boot error. So our IPA doesn't support secure boot. While I try to figure out what package or config is missing: is there an IPA & ESP that I can use which should support secure boot out of the box? Or a way I can build something fast that should | 10:15 |
Sandzwerg[m] | work? | 10:15 |
masghar | iurygregory: also wondering if the disks ever came back on the UI, how curious | 10:25 |
dtantsur | iurygregory: I can assure you that root device hints have no effect when the ramdisk is booting (and why would they?) | 10:42 |
dtantsur | Sandzwerg[m]: what's your boot method and how do you build IPA? | 10:44 |
*** sfinucan is now known as stephenfin | 11:19 | |
opendevreview | Merged openstack/ironic unmaintained/xena: [stable-only] Fix errors building docs https://review.opendev.org/c/openstack/ironic/+/949260 | 11:30 |
Sandzwerg[m] | <dtantsur> "Sandzwerg: what's your boot..." <- In this case idrac-redfish. We build our IPA with mkosi and use fedora 40 as basis. We needed that some years ago because we had hardware issues with what we were using before. We probably could switch to something else like DIB by now but before I invest the time I'd like to get something running to make sure it works at all and there isn't something else blocking | 11:41 |
dtantsur | Sandzwerg[m]: ah yeah. I recall building a secure boot capable ISO being quite annoying, trying to remember any details | 11:46 |
dtantsur | Sandzwerg[m]: this is what I did for metal3 back in the days: https://github.com/metal3-io/ironic-image/commit/f12f205 | 11:48 |
Sandzwerg[m] | currently the ISO is build on the fly with the ESP and the ramdisk & kernel, and I rebuild the esp on fedora 40 so it's the same as the IPA image itself, there was a note in the documentation that that was required but it still failed. So I guess something in our fedora image is missing. That's why I'd like to get a "known good" and if that works we can even switch. The main reason for our customized deployment is gone | 11:48 |
dtantsur | tl;dr is to be careful what you put into the ESP and also configure Ironic to match that | 11:49 |
Sandzwerg[m] | For the esp we basically did https://docs.openstack.org/ironic/latest/install/configure-esp.html we only needed to adjust the size as the hardcoded value was to small for us | 11:51 |
Sandzwerg[m] | That's why we rebuild it with the matching distribution or could it even be that a change in the package leads to an error? | 11:52 |
dtantsur | do you configure the matching grub_config_path? | 11:56 |
dtantsur | It should work this way.. | 11:57 |
dtantsur | Although, granted, I've only tried secure boot on RHEL CoreOS | 11:57 |
Sandzwerg[m] | yes, we adjusted the grub,shim path, and the dest path that was all we did. But there is no ESP availavle that fits to the for example centos based IPA images that are available? | 12:01 |
dtantsur | Sandzwerg[m]: we don't publish one, and someone told me that CentOS Stream is not properly signed by the Microsoft's key | 12:02 |
Sandzwerg[m] | hmpf. Okay is there a recommendation for what to use if one wants to do secure boot? | 12:04 |
dtantsur | I think Fedora could be the right path. Debian might work too. I haven't dealt with this topic for ages, sorry. | 12:05 |
dtantsur | (maybe TheJulia has more recent experience?) | 12:05 |
Sandzwerg[m] | The issue wehave with fedora is the frequent upgrades and changes. We're stll on 40 because it's the last one that doesn't have python 3.12 as default, maybe that would work now but back then it broke IPA I think. We might be able to circumvent that with uv or similar tools but we haven't looked into that yet | 12:09 |
* TheJulia waves from an uncaffinated state | 12:57 | |
TheJulia | so from my experience, Centos Stream's shim loader *is* signed by msft | 13:00 |
TheJulia | specifically we had a bug appear in shim ages ago and it made its way into rhel becuase getting the shim binary re-signed is a brutal process | 13:03 |
TheJulia | (which I've been copied on for that bug too....) | 13:03 |
Sandzwerg[m] | Alright, I'll try that then. Thanks | 13:04 |
dtantsur | okay, must have misunderstood something.. | 13:04 |
TheJulia | stephenfin: so, I'm wondering if it is newgrp which is causing the passwd prompt trigger, or if it is sudo. I guess we could "sudo newgrp"? For what it is worth, recent neutron devstack changes have torpedoed our gate, so we're sort of dead in the water at the moment while we try to figure that out as well | 13:22 |
rpittau | TheJulia, dtantsur, JayF, cid, I think I found the "issue" with sushy/sushy-tools auth loop -> https://opendev.org/openstack/ironic/commit/5f7c7dcd041e95a7f1283ab12e9d708844fd0974 | 13:25 |
rpittau | we're now calling ironic.drivers.modules.redfish.utils and it does not detect redfish cached session, causing the loop | 13:26 |
rpittau | this -> https://pastebin.com/1zckr36J | 13:27 |
rpittau | we should revert that change and look into sushy/sushy-tools to avoid the loop | 13:27 |
rpittau | at least I could not find anything else :/ | 13:29 |
TheJulia | rpittau: doesn't detect the unique session url already on hand so it then tries again | 13:29 |
rpittau | yeah, I mean that's the only kind of related change that I can see | 13:32 |
rpittau | although it did get merged a week ago, it seems issues started later, so not 100% sure | 13:32 |
rpittau | oh wait | 13:33 |
rpittau | just checking the actual patch, it did pass the first time on metal3 integration too, so I wonder if it's a race then | 13:34 |
rpittau | no nvm it never passed on metal3 | 13:36 |
rpittau | and Python version does not make teh difference | 13:36 |
opendevreview | Queensly Kyerewaa Acheampongmaa proposed openstack/sushy master: WIP: Add DateTime and DateTimeLocalOffset support to Manager resource https://review.opendev.org/c/openstack/sushy/+/950539 | 13:37 |
rpittau | I'll do a revert patch just to try | 13:37 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Revert "Fix redfish driver URL parsing" https://review.opendev.org/c/openstack/ironic/+/950540 | 13:38 |
rpittau | I will do another test in parallel in metal3 | 13:43 |
TheJulia | k | 13:46 |
TheJulia | trying to figure out why our most advanced networking jobs are dead now :( | 13:47 |
TheJulia | okay, so NGS is just not working now | 13:56 |
TheJulia | that is why the jobs are failing with JayF's patch | 13:56 |
JayF | Did they do the eventlet removal | 14:03 |
JayF | if so, we probably need a patch similar to my NBM patch in NGS | 14:03 |
TheJulia | who is they in that statement, it seems like maybe we're in an odd setup state | 14:04 |
JayF | neutron landed eventlet migration for l2 agents | 14:04 |
JayF | which makes me wonder if we are breaky downstream for that reason | 14:04 |
JayF | since NGS/NBM are plugins | 14:04 |
JayF | https://opendev.org/openstack/neutron/commit/9dc0d0fd2f44e348705804f1f99403086c138010 hmm not as dramatic as I thought | 14:05 |
JayF | timing doesn't match anyway | 14:05 |
TheJulia | oh, I think I see what is going on | 14:08 |
TheJulia | so, we have configuration loaded in the files | 14:08 |
TheJulia | but not in the running neutron API instance which is where ml2 plugins launch from | 14:08 |
TheJulia | if you compare old to new | 14:09 |
JayF | it'd be interesting to understand where the restart was dropped, since my statement before about that code being dead is still true | 14:09 |
TheJulia | old, we restart neutron 2x | 14:09 |
TheJulia | and it gets the configuration | 14:09 |
JayF | that's what I've been struggling with; the diff is so small in devstack | 14:09 |
TheJulia | in new, it never gets restarted | 14:09 |
JayF | hm | 14:09 |
TheJulia | yeah | 14:11 |
TheJulia | the way it gets *registered* only ever sets the config files parameter | 14:11 |
TheJulia | https://opendev.org/openstack/devstack/src/branch/master/lib/neutron#L1048 | 14:11 |
TheJulia | dog is demanding to go out, bbiab | 14:12 |
TheJulia | okay, in working, the genericswitch ini file is on the intiial start | 14:33 |
TheJulia | in the non-working, it never gets added/loaded | 14:34 |
TheJulia | i see the issue | 15:04 |
TheJulia | when you use wsgi launcher, the existing configuraiton modeling does *not* load up or respect the classical configuration patterns for neutron services | 15:05 |
TheJulia | instead, neutron looks for an environment variable to source the list | 15:05 |
TheJulia | https://github.com/openstack/neutron/blob/master/neutron/server/__init__.py | 15:06 |
TheJulia | I've raised it in the neutron channel | 15:13 |
TheJulia | I'm guessing we're sort of shit out of luck | 15:13 |
opendevreview | Queensly Kyerewaa Acheampongmaa proposed openstack/sushy master: Add DateTime and DateTimeLocalOffset support to Manager resource https://review.opendev.org/c/openstack/sushy/+/950539 | 15:37 |
opendevreview | Verification of a change to openstack/bifrost master failed: Add support for downloading CentOS Stream 10 image https://review.opendev.org/c/openstack/bifrost/+/950286 | 15:41 |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: Workaround neutorn's move to uwsgi only https://review.opendev.org/c/openstack/networking-generic-switch/+/950559 | 15:49 |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: Workaround neutron's move to uwsgi only https://review.opendev.org/c/openstack/networking-generic-switch/+/950559 | 15:50 |
TheJulia | I *think* ^^ might work, but we might need to disable ironic jobs first, and then land it | 15:50 |
TheJulia | if we can test it in | 15:50 |
TheJulia | time will tell | 15:50 |
* JayF just filed RFE https://bugs.launchpad.net/ironic/+bug/2111438 | 15:53 | |
JayF | I'm not 100% sure I have the shape of the solution right, but I wanted to document the need/ask | 15:53 |
* JayF looks at Julia's change | 15:54 | |
JayF | TheJulia: I think start_neutron_service_and_check does some configuration stuff too | 15:54 |
* JayF double checks | 15:55 | |
JayF | yeah I suspect there'll be an issue around stop/start and hitting the right set of services | 15:56 |
JayF | but I am not certain enough to suggest a fix before seeing output | 15:56 |
TheJulia | it happens after the ini file is written | 15:56 |
TheJulia | https://e3fa69918ab3893f89a3-76ad47885070581f857a540cadaa6a6d.ssl.cf1.rackcdn.com/openstack/55cf2727b4c54f06b897353cf71ea0a3/controller/logs/etc/neutron/neutron-api-uwsgi.ini is what we get today | 15:56 |
JayF | I mainly am wondering if we need to hit start_neutron too | 15:57 |
JayF | so the agents come back up | 15:57 |
JayF | I'm ... mostly sure we don't need to? | 15:57 |
JayF | either way, you have the science sciencing, we'll see if there's cake at the end | 15:57 |
TheJulia | yeah, I'm curious what neutron folks will say... if they respond at all | 15:58 |
opendevreview | Merged openstack/ironic-python-agent master: Remove TinyIPA jobs https://review.opendev.org/c/openstack/ironic-python-agent/+/950236 | 16:06 |
JayF | I guess IPA doesn't have any neutron jobs | 16:06 |
JayF | in a surprising victory of sensibility in our CI | 16:06 |
JayF | massive irony about neutron devstack jobs being broken: cid and I have time on calendar today to work on step 0 of dynamic networking (contributor guide docs for complex networking devstack setups) | 16:07 |
JayF | so I guess that gets pushed a week lol | 16:07 |
FreemanBoss[m] | rpittau: please your review is required... (full message at <https://matrix.org/oftc/media/v1/media/download/AXmEi5gLrV1hye0GHOujXjSDi9m2WU7JWH3Y5IJ7dhg1bw_sPHJ485qws7uKRCKVkHL3c3c566IQ_GZSpRiPHdFCeXO7NK8gAG1hdHJpeC5vcmcvZFpVR09sZGdhdk9nVUVKYWJ6VlhHclVl>) | 16:08 |
masghar | FreemanBoss: For changes ready for review, you can add the hashtag 'ironic-week-prio' | 16:33 |
masghar | Will get more eyes on your change | 16:33 |
TheJulia | JayF: So, I think we need to disable voting on the jobs on your patch | 16:42 |
TheJulia | merge that, then try to get n-g-s jobs ifxed | 16:42 |
TheJulia | The n-g-s job failed deep in the config, so hopefully that is the shortest path | 16:42 |
TheJulia | JayF: I can revise your patch if you want, or you can do so to mark the failing jobs as non-voting | 16:43 |
JayF | I'm OK with that, but can we achieve the same result for science with Depends-On on the NGS patch? | 16:43 |
TheJulia | but we can't merge it | 16:43 |
TheJulia | and we would be blocked from doing so | 16:43 |
TheJulia | And regardless, we would need to make something non-voting someplace to merge a fix | 16:43 |
JayF | like I said I'm OK with it | 16:44 |
JayF | you can update patch or I will in a few minutes when I reach the end of my current train of thought | 16:45 |
TheJulia | I'm on it | 16:45 |
opendevreview | Julia Kreger proposed openstack/ironic master: ci: Remove code which has been long-dead https://review.opendev.org/c/openstack/ironic/+/950461 | 16:47 |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: Workaround neutron's move to uwsgi only https://review.opendev.org/c/openstack/networking-generic-switch/+/950559 | 16:49 |
TheJulia | okay, that will allow us to semi-unblock and begin sciencing furhter | 16:51 |
TheJulia | SCIENCE! | 16:51 |
TheJulia | so, regarding https://bugs.launchpad.net/ironic/+bug/2111438, is the idea some sort of "power priority" and then to sort the power sync/status/etc stuff via the priority | 16:54 |
JayF | Well, in my particular case that won't do the trick | 16:57 |
JayF | primarily because it's not specific *nodes* it's specific *instances* | 16:58 |
JayF | notice how the example script is keying on instance_info | 16:58 |
JayF | that's the reason I leaned towards an offline tool, because it fits into a DR recovery plan more sanely; 1) get DB up 2) set power priorities 3) online ironic and let it execute 4) use API clients to spin up the rest as needed | 16:58 |
JayF | I could also potentially accept "just power all of them off" as an option, then API to turn on the ones we want after | 16:59 |
TheJulia | hmm, fair | 17:08 |
TheJulia | I guess there are a few competing challenges: | 17:16 |
TheJulia | 1) Conductor should be powering everything up *anyhow* | 17:16 |
TheJulia | 2) Some sort of priority would make a lot of sense as to not overload breakers with inrush current. I've never popped a distribution breaker, but Ironic has successfully popped some breakers in it's history ;) | 17:16 |
TheJulia | 3) I guess it is fair to be able to "go turn on key nodes, and to be able to key that off something which is instance provided. | 17:16 |
TheJulia | Perhaps a power_prioirty of 0 could be "go reference some config which could reach into or look at something", 1-98 could be in order, and add a knob at 99 by default | 17:17 |
TheJulia | The whole thing about a DR plan recovery, does sort of make sense, if you had everything powered down | 17:19 |
TheJulia | but ironic is going to try to return to prior state | 17:19 |
TheJulia | So, if you did power everything off, then you have to power everything back up | 17:19 |
*** darmach0 is now known as darmach | 17:27 | |
-opendevstatus- NOTICE: Gerrit is being updated to the latest 3.10 bugfix release as part of early prep work for an eventual 3.11 upgrade. Gerrit will be offline momentarily while it restarts on the new version. | 17:34 | |
adamcarthur5 | TheJulia Hey, I am not sure about a priority because its really about instances AND nodes, which you kind of mention. I think just a way of saying "power on the nodes as you see fit, but these instances need to come online first" | 18:09 |
TheJulia | then I think if we could have a priority which somehow explains "go look at this"... maybe!? | 18:11 |
adamcarthur5 | Ah okay, I think I have misinterpreted what level you meant "priority" at. I agree, I think "go look at this" is the difficult thing because: | 18:12 |
adamcarthur5 | 1) It needs to be agnostic to how you create instances (i.e support more than just nova) | 18:13 |
adamcarthur5 | 2) It needs to live entirely in Ironic | 18:13 |
TheJulia | yeah, it doesn't solve the case though, if you epxlicitly shut everything down first | 18:14 |
TheJulia | because then save power state superceeds | 18:14 |
adamcarthur5 | Is that about your point 1? | 18:19 |
TheJulia | not really, mor so is how you get into a disaster in the first place | 18:20 |
TheJulia | "oh no, the nuclear power plant is melting down" is entirely different from "data center is burning down" | 18:21 |
TheJulia | i.e. is this a sudden disaster, or a slow rolling disaster | 18:21 |
TheJulia | are you coming back from nothing, or just a hard outage | 18:21 |
TheJulia | was that outage planned, or not | 18:21 |
adamcarthur5 | Are you mentioning this because we need to have a feature that covers many scenarios, or because you don't understand where this bug desc is coming from? | 18:23 |
TheJulia | there are many scenarios | 18:23 |
TheJulia | For example, I ran a DR scenario once which was literally "The power plant near by is melting down, we have to leave, the servers will keep running for a undeterminable amount of time" and a opposite test, "a tornado hit the data center, we're rebuilding from scratch" | 18:24 |
TheJulia | there is a whole spectrum in there | 18:24 |
TheJulia | so assuming the disaster is "Electrical Room Fire", then you the prior power state is power_on when your conductor is back online | 18:25 |
adamcarthur5 | Yeah okay, I mean is it acceptable for us to specific only think about issues like ours. So "external factors knocked everything offline, for a temporary period" i.e power outage | 18:25 |
TheJulia | but assuming your disaster is uhhh... "UPS is in bypass, and we need to cut the slab to replace it" then you can shutdown the workloads and your state is "power off" | 18:26 |
adamcarthur5 | Your worry is getting everything to the power off state? | 18:27 |
TheJulia | no, my worry is that we can't recover the power state because we start in a power state of power off if you did a staged shutdown of the data center | 18:28 |
TheJulia | so the starting place is a little weird | 18:29 |
TheJulia | the bug, seems to request the idea of "hey, explicitly power on" which is totally different as well | 18:29 |
adamcarthur5 | I think Jay is purely using that as an example | 18:30 |
adamcarthur5 | I.e. explicit power off is not a requirement | 18:30 |
adamcarthur5 | (before starting_ | 18:30 |
TheJulia | so we need to "turn on a fleet", and it is almost like we need tri-value field | 18:34 |
JayF | well | 18:34 |
JayF | the power_state field on those nodes are liekly power_on | 18:34 |
JayF | even though the datacenter, in this case, went boom and they are all off | 18:34 |
JayF | so the idea is just to try and get conductor to spin up, say, the 5% of nodes who (in my downstream case) have an instance name that indicates it's a controller node | 18:34 |
JayF | note that it's **instance** name | 18:35 |
TheJulia | oh yeah | 18:35 |
JayF | we don't dedicate nodes for controllers | 18:35 |
JayF | think about it if you have a leader/followers kind of model with a dedicaated leader, in any sorta app | 18:41 |
JayF | we mainly wanna insure the leaders start first | 18:41 |
TheJulia | so funny thing is, we have a similar requirement/need which has been articulated by customers, but they've never been able to really articulate what is the driving force and what is the entry state | 18:41 |
TheJulia | oh, absolutely | 18:41 |
JayF | GR wins again as the model upstream customer :D | 18:41 |
JayF | lol | 18:41 |
TheJulia | so if I was restarting a conductor, and forcing that initial power sync | 18:45 |
JayF | 1) accidentally a whole datacenter | 18:48 |
TheJulia | That power sync could have a static priority check, and ... could we just have an ability to do some sort of query or suggestion from the config file?!? | 18:48 |
JayF | 2) manually power on an absolute minimal set of ironic control plane to get bootstrapped | 18:50 |
JayF | 3) use that smaller setup to get things booted in a proper order | 18:50 |
adamcarthur5 | I mean, how "nasty" can the query search be? How many locations can useful information for what instance info on nodes you might care about be? | 18:56 |
JayF | and the requirement I have that makes it suck: base those decisions on *instance* info not *node* info | 18:57 |
JayF | sorry ^ that never got hit enter on | 18:57 |
adamcarthur5 | We probably want to support more than just display_name | 18:57 |
TheJulia | yeah | 18:57 |
JayF | yeah | 18:57 |
adamcarthur5 | Is it too nasty to say, convert the entire node object into a __dict__ and allow regexing on the whole thing 😅, I assume so? | 18:58 |
JayF | too many secrets for that :DS | 18:58 |
JayF | "why did my server with 'critical' in the middle of it's ipmi_password get caught" /s | 18:58 |
JayF | It really depends on how it's oriented: if we do some kinda offline tool which mainly pokes the DB, we can do more nasty stuff | 18:59 |
JayF | if we try to arrange a more api-centric way, that is probably not good | 18:59 |
JayF | but I am also skeptical of any solution that starts from "your ironic is working" as a starting point, because even getting *to* that point is nontrivial | 18:59 |
TheJulia | I'm largely thinking the right thing feels like "get your ironic conductor restarted" and it starts taking over, since ideally it should be | 18:59 |
TheJulia | the only challenge is if you powered anything off.... | 18:59 |
TheJulia | anyway, stepping away | 19:00 |
TheJulia | bbiab | 19:00 |
JayF | "if you powered anything off" <-- we're talking DR-level recovery, from a full outage of a DC or computer room | 19:00 |
adamcarthur5 | Yeah JayF I'm right in saying the "manual powered off" isn't a requirement right? Because that is what the script in the bug does | 19:00 |
JayF | the script in the bug /tells ironic to power the node off/ | 19:01 |
adamcarthur5 | And I think Julia is questioning whether that is a hard-requirement | 19:01 |
JayF | which is a noop when Ironic sees the node is already powered off | 19:01 |
adamcarthur5 | noop? | 19:01 |
JayF | the entire point of that script is to circumvent conductor powering them on before we are ready | 19:01 |
JayF | no-op, as in, ironic sees it's already off and does nothing | 19:01 |
JayF | just updates the DB to be correct | 19:01 |
adamcarthur5 | But what about Julia's idea of messing the conductor? I.e get it online first and go from there | 19:01 |
JayF | the whole crux of this is ironic thinks node.power_state=on, when the actual server is powered off | 19:01 |
JayF | (and it might not be ideal to have an entire datacenter of power hungry servers coming online all at once for physics/electric power reasons) | 19:02 |
JayF | adamcarthur5: I don't dislike that idea, but I struggle with thinking of a way to model this where it works based on nova instance metdata for deployed images rather than a node-centric orientation | 19:02 |
JayF | At some places I worked, they had like, a set of servers that were "core" and they were always the same hardware, sometimes in a separate room/cage, etc | 19:03 |
JayF | in that model; something on the node to mark those as special is trivial | 19:03 |
JayF | in the model where what makes the node special is /some property inherent to the software installed on it/ (hence instance_info.display_name), it gets more complex | 19:03 |
adamcarthur5 | I like the idea of editing conductor behaviour (we can handle the whole "entire data center trying to power up at once" problem too) | 19:06 |
adamcarthur5 | And then I don't think getting instance information is impossible from there right? | 19:06 |
adamcarthur5 | It seems better than powering them all off with one script, and then needing a other script to bring it back in a certain order? (I.e. if it's not a script, where would it live if you took this path?) | 19:07 |
JayF | maybe | 19:10 |
JayF | honestly I think about stuff like this in terms of API interfaces | 19:10 |
JayF | and I'm having trouble visualizing how you'd configure behavior like this | 19:10 |
JayF | if you were able to articulate a config grammar (even if a separate yaml file like what we proposed for dynamic networking), it might be easier to understand | 19:10 |
JayF | but also may be too complex for a minor feature? IDK. | 19:11 |
* TheJulia reappears from taking a break | 19:27 | |
TheJulia | so, I think irc is doing a dissservice to this discsussion | 19:28 |
TheJulia | that being said, I think it has value, so I propose we jump on a call and talk through it because disasters also take many different shapes, and that is where I'm coming from. I'd like the conductor to be able to handle recoveries in general through a simple method. | 19:30 |
JayF | sure, when do you wanna have that chat? cc: adamcarthur5 | 19:30 |
JayF | I was about to grab lunch but can delay if all parties are here now() | 19:31 |
jph | I have a deployment with both Redfish and ILO5 hardware and was wondering how the conductors should be configured to handle this. I have encountered errors where the conductors do not start because no default power interface satisfies both hardware types. Which leads me to believe that I need two separate conductor groups one for each hardware type. Is this correct? | 19:31 |
adamcarthur5 | I can call now | 19:32 |
adamcarthur5 | JayF | 19:33 |
JayF | so one conductor can handle pretty much any hardware type you want | 19:33 |
JayF | let me find the conf you need | 19:33 |
JayF | TheJulia: you wanna have that chat now or some later point? | 19:33 |
TheJulia | Let ma make a cup of coffee and then I can chat | 19:34 |
TheJulia | coffee: brewing | 19:35 |
JayF | https://docs.openstack.org/ironic/rocky/configuration/sample-config.html jph see enabled_hardware_types | 19:35 |
JayF | you can set that to a list, comma separated | 19:36 |
JayF | oh screw that rocky link | 19:36 |
JayF | silly google | 19:36 |
TheJulia | heh | 19:36 |
JayF | https://docs.openstack.org/ironic/latest/configuration/sample-config.html | 19:36 |
JayF | this example even has two in it! jph ^^^ | 19:36 |
TheJulia | And generally have the same enabled interfaces across your conductors | 19:36 |
JayF | I have seen environments that used conductor groups to separate devices of different types | 19:36 |
JayF | but I wouldn't do it that way | 19:36 |
jph | Okay thanks I will reconfigure ironic again see if it is any different. | 19:37 |
TheJulia | Coffee: Acquired | 19:38 |
JayF | jph: also note enabled_*_interfaces and default_*_interface. Those may also be useful in your case. | 19:39 |
JayF | TheJulia: you wanna do a meet? or should I zoomzoom | 19:39 |
TheJulia | https://meet.google.com/sui-uuhe-kyz | 19:39 |
JayF | jph: to avoid having to explain it all again, https://www.youtube.com/watch?v=FUGB2e3XP0g#t=6m30s (6:30 timestamp) explains some of this | 19:40 |
TheJulia | a meeting we shall go, a meeting we shall go... | 19:40 |
* TheJulia looses all remaining sanity | 19:40 | |
JayF | adamcarthur5: ^ | 19:41 |
jph | Thanks JayF and TheJulia I dropped the default_*_interfaces from the conductor configuration leaving it to the defaults and the conductor is now up and running with both hardware types. | 20:10 |
TheJulia | JayF: by resource class, shard, owner, lessee, conductor group, and then just do everything for on or off | 20:50 |
TheJulia | Recovery power on delay is a huge variable too | 20:52 |
JayF | adamcarthur5: ^ | 20:52 |
JayF | if you could ping adam in on these too it'd be awesome | 20:52 |
TheJulia | Also, we’d to be able to signal a soft off | 20:55 |
JayF | I thought most power offs these days were soft->(poll)->(timeout)->hard | 20:56 |
JayF | that is what I had in mind in any event | 20:56 |
opendevreview | Verification of a change to openstack/ironic master failed: ci: Remove code which has been long-dead https://review.opendev.org/c/openstack/ironic/+/950461 | 20:59 |
JayF | urgh | 21:00 |
JayF | the job was -nv'd but is still in gate, fixing | 21:00 |
opendevreview | Jay Faulkner proposed openstack/ironic master: ci: Remove code which has been long-dead https://review.opendev.org/c/openstack/ironic/+/950461 | 21:02 |
TheJulia | Power off is hard, you have to explicitly say soft if you want to be soft | 21:02 |
* JayF notes that in etherpad | 21:02 | |
TheJulia | Explicitly discussing power on recovery, seems like less of a big hammer is needed but avoiding hot spotting is super hard | 21:06 |
JayF | yeah, i think the recovery side of it is a little fuzzier | 21:11 |
JayF | I'll ponder it overnight and get more input and we'll see what squeezes out | 21:11 |
TheJulia | so in talking with another operator, they are *super* concerned about hotspotting for resumption and almost feel they will need to query external DCIM tooling data to figure out their preferred ordered lists | 21:14 |
TheJulia | which makes it a little easier if recovery is also just explicitly slower too | 21:14 |
TheJulia | The key thing they wanted to note is they have some machines which are at peak for ~2-3 minutes | 21:15 |
TheJulia | and only then can they then trigger the next one in the rack. | 21:15 |
TheJulia | I floated "EMERGENCY_OFF" as like a power state, and they loved it, they envisioned it as "oh, I can then turn off all my active and powered on nodes, that way they know the nodes they need to get back online to be back at the same place | 21:18 |
TheJulia | lots of concern about being able to send soft though, so whatever flow there gets a little weird | 21:18 |
TheJulia | JayF: doh, sorry about that regarding the -nv | 21:21 |
JayF | np it happens | 21:31 |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: Workaround neutron's move to uwsgi only https://review.opendev.org/c/openstack/networking-generic-switch/+/950559 | 21:45 |
TheJulia | doh, yay typos. | 21:54 |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: ci: workaround neutron's move to uwsgi only https://review.opendev.org/c/openstack/networking-generic-switch/+/950559 | 22:54 |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: ci: workaround neutron's move to uwsgi only https://review.opendev.org/c/openstack/networking-generic-switch/+/950559 | 23:32 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!