rpittau | good morning ironic! o/ | 07:39 |
---|---|---|
dtantsur | TheJulia, JayF, in-band clean steps are orthogonal to drivers, so I'm a bit puzzled by the discussion last night. If there is a step in IPA, you can use it today. | 09:20 |
dtantsur | Or did you mean out-of-band really? | 09:20 |
dtantsur | JayF: re automated clean template, we just need to expose https://docs.openstack.org/ironic/latest/configuration/config.html#conductor.clean_step_priority_override as a Node field | 09:22 |
TheJulia | Jay was thinking of taking driver vendor specific feature steps and making them able to be used across the board by a generic driver, so like use generic redfish but be able to always invoke special ilo driver management step. | 09:24 |
* TheJulia tries to go back to sleep | 09:24 | |
TheJulia | I think the hope was to decouple perception of need to maintain a whole driver to get useful individual features | 09:25 |
dtantsur | I'm not sure what is bad about it (we literally coined the notion of "hardware types"). Maybe we should just make the driver composition more easily composable. | 09:49 |
dtantsur | (so, it is indeed about out-of-band steps, not in-band, right?) | 09:50 |
TheJulia | Nothing bad, but practical challenges | 09:50 |
dtantsur | Crossing the hardware type boundary will also cause practical challenges (like, the iLO one will expect ilo_address, which is not even the same format as redfish_address) | 09:51 |
TheJulia | Yup | 09:51 |
opendevreview | Mahnoor Asghar proposed openstack/ironic master: Add inspection hooks https://review.opendev.org/c/openstack/ironic/+/892661 | 11:36 |
iurygregory | good morning Ironic | 11:41 |
masghar | Morning! | 11:59 |
rpittau | masghar: hi! is that the last of the "hooks" patches? or we're missing something else? I lost count :D | 12:27 |
masghar | rpittau: Its the last one :D | 12:37 |
rpittau | \o/ | 12:37 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Generic API for attaching/detaching virtual media https://review.opendev.org/c/openstack/ironic/+/894918 | 12:53 |
drannou | dtantsur: crossing hardware type could be a good idea, for us for example we would like to be "platform" agnostic. | 13:08 |
drannou | for the moment we are using HPe hosts, but I don't want to be vendor lock | 13:09 |
TheJulia | drannou: but are you using specific driver features which would only ever work/exist on hpe hardware? | 13:16 |
dtantsur | drannou: I don't think what we discuss with TheJulia is related to the vendor lock problem | 13:23 |
dtantsur | it's really a question of which driver you set to the node | 13:23 |
dtantsur | If we want something in Redfish, we can ask DMTF for it. Jacob and I are going through such a proposal process right now for a different thing. | 13:23 |
TheJulia | I think there are some small steps we can take to simplify some matters, which could also be go to the standards bodies if needed, but vendors also traditionally try and drive differentiator features in the form of "value-add" for their customers... and to sell more hardware/license fees by locking some of those features behind upgraded use licenses | 13:25 |
dtantsur | Well, licenses are something quite orthogonal here. They can easily hide parts of standard Redfish behind a paywall. | 13:26 |
TheJulia | And there *is* already a way for us to do cross cutting features, it is just not the fastest/easiest path if a vendor wants to drive it | 13:26 |
TheJulia | oh yes, absolutely | 13:26 |
TheJulia | example, Supermicro | 13:26 |
TheJulia | and VMedia | 13:26 |
drannou | TheJulia: in fact, nothing for the moment, but we will need to have a license management, and firmware upgrade. | 13:30 |
drannou | The problem is that I don't want to switch compeltely to ILO for just that | 13:31 |
dtantsur | Redfish firmware upgrade has been implemented recently | 13:31 |
dtantsur | License management is something we haven't looked at yet. It's probably highly vendor-specific. | 13:31 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Generic API for attaching/detaching virtual media https://review.opendev.org/c/openstack/ironic/+/894918 | 13:36 |
TheJulia | so, there *is* a v1.1.0 LicenseService definition off the root in the st andard | 13:51 |
TheJulia | standard | 13:51 |
dtantsur | Nice! | 13:53 |
TheJulia | Looks disjointed from the system entirely, as a payload to upload, but also embraces the Oem field as well | 13:53 |
dtantsur | At some point, we may make inventory of Redfish features we eventually want to support | 13:53 |
dtantsur | Custom TLS certificates are probably at the top of this list, together with UEFI HTTP boot | 13:54 |
TheJulia | There is also a license list resource | 13:55 |
TheJulia | Looks like this stuff got added back in 2021.1 | 13:56 |
TheJulia | but is not in the mockups | 13:56 |
TheJulia | at least, the couple I clicked into | 13:57 |
TheJulia | Speaking of UEFI HTTP Boot, I need to get back on that | 13:59 |
TheJulia | looks like I broke something on the emulator schema :\ | 13:59 |
* TheJulia tries to grok a thread and wonders if they are advocating nuking PLDM from the face of device interactions and repalcing everything with redfish | 14:17 | |
JayF | dtantsur: the idea would be to think about a world where there would be no ilo hardware type, but instead you'd have something like a redfish hardware type and a bunch of plugins which just call custom redfish hardware APIs. but the world isn't lollipops and rainbows and all of the bonus stuff isn't in the standards-compliant places so this is a solution for a world that | 14:17 |
JayF | doesn't exist | 14:17 |
dtantsur | heh, I see | 14:18 |
dtantsur | My less ambitious vision is for both ilo and idrac drivers to be built on top of the redfish one | 14:18 |
TheJulia | I'm wary of unicorns, but Legends of Tomorrow Season 4... Episode 4 streamed last night in the living room. | 14:18 |
JayF | dtantsur: that kinda sucks operationally, to be honest, because then we still have more room for divergence over time | 14:19 |
dtantsur | Only if the reality diverges. Which is something we cannot prevent. | 14:19 |
JayF | dtantsur: my hope would be for an operator to know+understand one driver, hopefully, and not have to figure out the incantation for each different vendor's hardware they buy | 14:19 |
JayF | :( | 14:19 |
TheJulia | Divergence is always going to occur on some level, the value add driver. | 14:19 |
dtantsur | I mean... knowing that "ilo" is for HPE hardware is the least complex fact about Ironic you can learn | 14:19 |
JayF | we have a PTG session about ^ | 14:20 |
JayF | it's not simple | 14:20 |
TheJulia | ... not all HPE hardware though | 14:20 |
dtantsur | nor is our redfish driver for all redfish hardware | 14:20 |
TheJulia | yup | 14:20 |
dtantsur | #sadbuttrue | 14:20 |
JayF | that is the piece that hurts me in my hurtin' place LOL | 14:20 |
dtantsur | you're not alone with your hurtin' place's hurt :D | 14:21 |
JayF | the 'redfish driver is not for all redfish hardware' ... is that a case we can detect? | 14:22 |
JayF | like are most failure modes for that "oh, it's missing support for X, Y, Z" or is it "yeah support for this is advertised but broken" (or "yes") | 14:23 |
dtantsur | We sometimes even try (see idrac-redfish-virtual-media in the recent past) | 14:23 |
dtantsur | The huawei driver is a counter-example, I think | 14:23 |
TheJulia | both very good examples | 14:23 |
dtantsur | And now we're going to have a Fujitsu's Redfish variant | 14:23 |
JayF | huawei driver is not called that upstream is it (?) | 14:23 |
TheJulia | one generally works, but we've found cases where firmware updates break it, the later has four different field names to possibly account for power state. | 14:23 |
dtantsur | JayF: ibmc? | 14:23 |
JayF | ack | 14:24 |
JayF | I never knew what h/w that was for lol | 14:24 |
dtantsur | :D | 14:24 |
JayF | never worked a place that would've gotten their gear | 14:24 |
TheJulia | ... on a plus side, the ibmc contributor indicated they were aware of the noncompliance and intended to fix it | 14:24 |
dtantsur | You would need to move to Russia for that, which is something I'd not recommend | 14:24 |
dtantsur | (or China, obviously, which I cannot recommend either) | 14:24 |
TheJulia | China != Russia, but they both get cast in the same light often times | 14:25 |
TheJulia | Different cultures and all | 14:25 |
* JayF puts on his best william wallace face and yells something about freedom /s | 14:25 | |
dtantsur | Well, if you don't want to get into details, and is lucky enough to not have to care about them, you can throw them in the same bucket | 14:25 |
JayF | ^ is pretty much where I am | 14:26 |
TheJulia | dtantsur: this is true | 14:26 |
* dtantsur has started the very slow and complicated process of gaining the German citizenship, hold your fingers crossed | 14:26 | |
TheJulia | I wouldn't mind at some point visiting china on a tourist visa, but the odds of that happening are slim | 14:26 |
TheJulia | dtantsur: \o/ | 14:26 |
masghar | I heard from an ex-CEO-in-China recently that the people on ground are super nice, but there are difficulties in running a business etc | 14:51 |
TheJulia | Also the movement of funds in and out | 14:55 |
TheJulia | As a business traveler, China is a difficult country to visit. | 14:55 |
TheJulia | oh wow, reading about PLDM and the internal modeling explains a LOT of how BMCs have typically viewed and provided networking details | 14:58 |
dtantsur | PLDM? | 15:08 |
TheJulia | oh, wow, and the interaction explains why it is often slow | 15:08 |
TheJulia | Platform Level Data Model | 15:08 |
TheJulia | how devices on the motherboard are supposed to talk to the BMC | 15:08 |
dtantsur | AKA dark magic reference :D | 15:10 |
iurygregory | hey TheJulia I was talking with dtantsur about a bug downstream involving multipath, scenario: a hardwar have 84 devices with 4 paths each, Dmitry had an interesting idea, that maybe we could cache the output of https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L233 and avoid doing two more calls on our code (if I recall correctly they are reaching timeouts when trying to provision the | 15:10 |
iurygregory | node), do you think this would be safe to cache? | 15:10 |
TheJulia | In the MCU, it might be titled "The Darkhold" | 15:10 |
iurygregory | dark magic reference lol :D | 15:11 |
TheJulia | Cache makes sense until we encounter an IO error and feel the need to reset said cache, fwiw | 15:11 |
TheJulia | The *odds* of an IO error due to changes are more from like a cable getting unplugged | 15:12 |
JayF | if we're dealing with multipath, remote IO | 15:12 |
JayF | and we lose a device/topology changes | 15:12 |
TheJulia | yeah | 15:12 |
JayF | shouldn't that be a provisioning-aborting event? | 15:12 |
JayF | or does that sorta intentionally go against the HA nature of the multipathing? | 15:13 |
dtantsur | If that's not a root device, we can survive it | 15:13 |
TheJulia | sort of yes, but we also via the cache know the multipath device name via the kernel so we can continue to operate/resolve | 15:13 |
JayF | I am unsure if in this case we should be using the HA (?) | 15:13 |
TheJulia | oh, the kernel will make sure you keep working if your interacting with the mpath device :) | 15:13 |
JayF | oh so the idea is, I got an I/O error using this path, I'll try this one | 15:13 |
* TheJulia has pulled many a fiber cable | 15:13 | |
JayF | it's like you have ... multi[ple] path[s]! | 15:13 |
TheJulia | JayF: the kernel does it for us :) | 15:13 |
JayF | does this mean I'm a storage engineer now? Where can I pick up my LUNs? /s | 15:14 |
TheJulia | You can pickup your luns once you rack/stack/cable 500TB-1PB storage system | 15:14 |
TheJulia | oh, and configure the SAN controllers | 15:14 |
JayF | I don't like SAN, it's rough and sticky and gets everywhere | 15:15 |
* dtantsur has realized he can no longer list all possible hardware interfaces by heart and is a bit upset | 15:16 | |
iurygregory | on my mind the easy path was use skip_block_devices XD "hey ironic don't look at this devices please" lol | 15:16 |
TheJulia | Now remember https://www.etsy.com/listing/1361107953/torch-and-fire-extinguisher-holder makes things *much* better with SANs | 15:16 |
dtantsur | iurygregory: that's not a terrible idea, but I'm afraid we'll anyway try to list them first | 15:16 |
dtantsur | and then you need to add Metal3 API to pass this skip_block_devices... | 15:16 |
iurygregory | yep | 15:16 |
TheJulia | enumeration is dynamic guys | 15:16 |
JayF | dtantsur: you can never list the ones that I wrote that are locked up in some private repo in rackerlabs/ github repo somewhere ;) | 15:17 |
TheJulia | depending on SAN state and FC login query | 15:17 |
TheJulia | and SAN response speed | 15:17 |
JayF | dtantsur: oh, you mean interface, not manager, just kidding | 15:17 |
TheJulia | so don't expect stability *at all* unless your matching WWNs | 15:17 |
dtantsur | JayF: you got me scared for a minute: new downstream hardware interfaces, that would be.. something | 15:17 |
JayF | you're talking about agent | 15:17 |
JayF | and rescue | 15:17 |
* dtantsur suspects someone may have a graphical console interface somewhere in production | 15:17 | |
JayF | you know that, right? LOL | 15:17 |
dtantsur | wellllllll :D | 15:18 |
iurygregory | i don't even think it's SAN to me is a JBOD they are using hehe | 15:18 |
dtantsur | TheJulia: I keep telling people that, but people INSIST on using device names... anyway | 15:18 |
TheJulia | iurygregory: is it showing more than one path per device? | 15:18 |
dtantsur | TheJulia: 4 paths for each of the 84 devices | 15:19 |
TheJulia | iurygregory: whiskey | 15:19 |
TheJulia | ?$? | 15:19 |
dtantsur | hundreds of block devices per machine | 15:19 |
iurygregory | yeah | 15:19 |
TheJulia | wtaf | 15:19 |
dtantsur | fun right? | 15:19 |
TheJulia | I've seen 2 paths to a JBOD array, never 4 before | 15:19 |
dtantsur | It takes IPA a few minutes to loop through them.. and then it starts again.. and again... and again...... | 15:19 |
TheJulia | ... are we sure it is not a JBOD array via FC? | 15:19 |
dtantsur | I'm personally not sure about anything in that setup | 15:19 |
* TheJulia makes nervous laughter as she login into the case management system to see what the customer uploaded | 15:20 | |
dtantsur | Let's hope it's cat photos | 15:20 |
iurygregory | yes please | 15:20 |
iurygregory | and corgi photos also! | 15:20 |
*** dking is now known as Guest3726 | 15:21 | |
TheJulia | *anyway* caching the multipath command output is stable enough for IPA's exeuction | 15:21 |
iurygregory | ok, I will try to think on how to do this after grabbing lunch | 15:21 |
iurygregory | and coffee | 15:21 |
iurygregory | brb | 15:21 |
dtantsur | Does anyone know a usable way to alternate between two shared screens? | 15:23 |
JayF | on linux? | 15:23 |
JayF | and do you need audio? | 15:23 |
dtantsur | Like, slides and a console. But without turning the sharing on/off or sharing the whole screen. | 15:23 |
dtantsur | JayF: on linux, audio not required | 15:24 |
JayF | I have a great answer for you if both of those are yes | 15:24 |
JayF | perfect | 15:24 |
TheJulia | yeah, the list forces a scan, so it is a steady enough state | 15:24 |
JayF | dtantsur: obs has a plugin which will output to a v4l loopback device | 15:24 |
JayF | dtantsur: instead of using share screen function; use webcam function to share the v4l loopback device | 15:24 |
JayF | dtantsur: now you have OBS studio fully functional to swap out what you're sharing on your screen between slides/console | 15:24 |
TheJulia | Awww, customer redacted the pathinging details enough that I can't tell if there is more than one storage controller port in the mix | 15:25 |
dtantsur | JayF: nice! I need to rehearse that, but sounds like it could work. | 15:26 |
dtantsur | I suspect I can even display my small face in the corner | 15:26 |
TheJulia | iurygregory: so, I'm inclined to think *not* a jbod, but a proper san because it looks like 3 different devices are in the mix locally based upon bus id markers, but they are also not sequential so ... dunno | 15:27 |
TheJulia | dtantsur: I did that once with an f-key to toggle the source, it worked really well | 15:27 |
TheJulia | plan 1-2 hours for learning curve and all | 15:28 |
JayF | dtantsur: exactly, it's a pretty good model, and even if you can't make it work via a webcam, you can open something like gvc or cheese and share that as your screen | 15:28 |
opendevreview | Takashi Kajinami proposed openstack/ironic-inspector master: Fix python shebang https://review.opendev.org/c/openstack/ironic-inspector/+/898587 | 15:28 |
dtantsur | JayF++ TheJulia++ | 15:28 |
JayF | oh and also | 15:28 |
JayF | test the hell outta it | 15:28 |
TheJulia | +++++++++ | 15:28 |
JayF | OBS on linux is flakey as hell | 15:29 |
JayF | in my world, it was always audio so no audio should help a lot | 15:29 |
dtantsur | I have some experience with OBS already, but only with regular recording. | 15:29 |
TheJulia | Also, if your doing anything like green screening with your background | 15:29 |
JayF | I tried to do streaming to twitch for a while from Linux; I had something like a 10% incidence rate of the sound just ... dying mid-stream | 15:29 |
JayF | no errors from OBS, no indicators in the logs, it just ... stopped | 15:29 |
dtantsur | If I share that in a call, the sound will go through the regular means | 15:30 |
dtantsur | (which also has a small chance of just not working) | 15:30 |
JayF | yeah so I think you'll be fine, I'm just making the point that it's not the most rock solid platform on linux at hti spoint | 15:30 |
dtantsur | Since when is Linux rock solid? That's one of the things we love about it :D | 15:30 |
JayF | also note that I'm doing all this on gentoo so there's a nonzero chance it was just broken because that point release of ffmpeg, plus that point release of obs, plus a full moon and my cflags or some craziness lol | 15:30 |
dtantsur | LOL | 15:31 |
JayF | dtantsur: dead serious; I run linux these days on things other than my gaming machine because I got so dang tired of feeling like my own devices were trying to sell me crap | 15:31 |
TheJulia | Remember, do not taunt the darkhold of professional video tools :) | 15:31 |
dtantsur | hehe | 15:31 |
dtantsur | I tried making a video from fragments.. I've seen some horrors... | 15:31 |
*** Guest3726 is now known as dking | 15:32 | |
TheJulia | I think I understand PLDM enough to hold down a conversation | 15:32 |
TheJulia | I may have had to push some python bits out of my brain for this | 15:32 |
dtantsur | JayF: ikr.. I'm afraid our ancient phones will need replacements soon, so I'm going to choose between the android crap and the apple crap... | 15:32 |
clarkb | I would probably do that using an xmonad tiling mode that emulates workspaces on a single workspace | 15:33 |
dtantsur | TheJulia: is there a short(ish) overview to read? | 15:33 |
dtantsur | clarkb: but will your browser be able to share only one such workspace? | 15:34 |
dking | Is anybody here familiar with troubleshooting SQLAlchemy errors "QueuePool limit of size X overflow Y reached, connection timed out..."? I'm seeing those in my Ironic logs, and I can connect directly to MariaDB, and I don't know how to troubleshoot SQLAlchemy outside of adjusting the code. | 15:34 |
TheJulia | dtantsur: https://www.dmtf.org/sites/default/files/standards/documents/DSP0240_1.0.0.pdf | 15:34 |
dtantsur | thx | 15:34 |
TheJulia | the rest of the related docs get into the nitty gritty of PLDM, but the base model *appears* to be rather flexible but command/response driven | 15:34 |
dtantsur | dking: the first time I hear about it. is it linked with some sort of activity? | 15:34 |
TheJulia | so "give me your state" -> "Here is my state" after you ask "what do you support" "here is what I support" | 15:35 |
clarkb | dtantsur: ya I would share the single workspace then use window management to swap windows aroudn appropriate on the workspace | 15:35 |
dking | dtantsur: I'm not certain. I don't know of any special activity going on. We're not running that many nodes. It's showing up with: ERROR futurist.periodics [-] Failed to call periodic 'ironic.conductor.manager.ConductorManager._sync_power_states' | 15:35 |
dtantsur | clarkb: Right, makes sense. I'm on MATE nowadays though | 15:36 |
TheJulia | ... So if memory serves the connection pool is like 8 connections to start | 15:36 |
TheJulia | dking: how much concurrency are you running for power state sync? | 15:37 |
dking | Here, I'm seeing "limit of size 5 overflow 50 reached". I'll have to count, but I didn't thing we had 50 nodes in. | 15:37 |
TheJulia | dking: is the version of mariadb on sqlalchemy's list of supported versions? | 15:38 |
dtantsur | TheJulia: ouch, even the glossary is quite a text to read | 15:38 |
TheJulia | dking: that sort of sounds like mariadb is locking or has a load balancer dropping the connection | 15:38 |
dking | TheJulia: I can check. We're using Metal3, so it's likely whatever comes stock with that. | 15:39 |
TheJulia | dking: as a starting point in db troubleshooting, try making a backup on the server side | 15:39 |
TheJulia | oh | 15:39 |
TheJulia | hmm | 15:39 |
JayF | If you're using metal3, there is also potentially value in asking in their slack if we're not able to figure it | 15:40 |
TheJulia | so no load balancer | 15:40 |
dking | TheJulia: I thought about that, but it's not really using a load balancer. I think it might just be 1 replica, and I can connect to the DB just fine from within the mariadb container. But those are good things to check. | 15:40 |
TheJulia | is the db in a separate pod? | 15:40 |
JayF | of course half the people are in both channels so.... | 15:40 |
dking | JayF: Yeah, the best people for Metal3 are in here, too. :) | 15:40 |
dtantsur | dking, TheJulia, a useful thing to know about Metal3 is that its MariaDB image is minimized on purpose | 15:40 |
dking | TheJulia: No, it should be the same pod. | 15:40 |
TheJulia | dking: there could be a transaction deadlock... but that should get detected... there is still a chance a held lock is not detected | 15:41 |
dtantsur | e.g. https://github.com/metal3-io/mariadb-image/blob/main/runmariadb#L22 used to be even lower than this | 15:41 |
dtantsur | https://github.com/metal3-io/mariadb-image/blob/main/runmariadb#L46-L48 is a suspect too | 15:41 |
TheJulia | which is stupidly weird | 15:41 |
dtantsur | dking: I'd start with playing with these values ^^^ | 15:41 |
TheJulia | as are max connections, since that can scale based upon what is going on | 15:42 |
TheJulia | if you can't dump the db or it hangs, then lock is likely it | 15:42 |
TheJulia | but dump *through* the server, not from the files (i think that got removed in mariadb...) | 15:43 |
* TheJulia has a rather dusty DBA hat someplace | 15:45 | |
* TheJulia also had the arcane book of databases someplace.... | 15:45 | |
dtantsur | (side note: "the same pod" thingy is something I'd like to fix eventually) | 15:45 |
dking | Well, I'm currently wondering if I can troubleshoot the issue live, as I'm suspecting that it won't replicate (or at least not quickly) if Ironic gets restarted. | 15:48 |
TheJulia | you can try and get a feeling from the DB, but yeah, any restart is going to change things | 15:49 |
dking | And it didn't happen right away, so it may take a while to show up again, if ever. I'm going to check the logs to see how long it's been going on. At least I have some configs that I can play with. | 15:51 |
dtantsur | dking: the recommendation about slack is not too bad, because the Ericsson folks do use MariaDB (unlike us in OpenShift) | 15:52 |
dking | So, for now, the consensus is that something hung, we don't really know where, but restart, and if it comes back, attempt to increase the limits mentioned above? | 15:52 |
dtantsur | that's the only thing that comes to my mind | 15:53 |
TheJulia | likewise | 15:53 |
dking | Okay, thank you very much! | 15:53 |
JayF | Is metal3+mariadb a common deployment (with the built-in, slimmed mariadb container?) | 15:53 |
dtantsur | JayF: yes. I don't know the proportion, but both sqlite and mariadb are common (and mariadb was there first) | 15:54 |
JayF | ack | 15:55 |
opendevreview | Takashi Kajinami proposed openstack/ironic-inspector master: Fix python shebang https://review.opendev.org/c/openstack/ironic-inspector/+/898587 | 16:02 |
rpittau | good night! o/ | 16:16 |
TheJulia | o/ | 16:19 |
* iurygregory is back | 16:26 | |
iurygregory | TheJulia, gotcha, I will look at caching now to try to improve this (fingers crossed) | 16:28 |
TheJulia | okay, cool! | 16:28 |
iurygregory | TheJulia, newbie question, but do you think changing ironic.conf [agent]command_timeout ( https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/agent_client.py#L206 ) could help with their issue? since ironic logs show a lot of https://paste.opendev.org/show/bXAGwIusnE3JdjWTWE43/ | 16:50 |
TheJulia | I would think unlikely | 16:51 |
TheJulia | and that feels like a workaround for whatever state the agent is in when the call hits | 16:51 |
iurygregory | yeah, its more a workaround just to see if helps them | 16:52 |
iurygregory | till we figure out the cache and hope this would help them | 16:52 |
iurygregory | the first problem they had was during inspection, so increasing the timeout helped them | 16:53 |
dtantsur | well, "helped" | 16:53 |
dtantsur | after you walk them through raising all possible timeouts everywhere (and you'll need to), the final question will be "please make it so that the process does not take 6 hours" :D | 16:54 |
TheJulia | I guess the only want ot know for sure is to corrilate agent logs | 16:54 |
TheJulia | dtantsur: ++ | 16:54 |
iurygregory | dtantsur, yeah | 16:54 |
dtantsur | Agent timeouts make me think about insufficient eventlet monkey patching | 16:55 |
dtantsur | Otherwise why is IPA not reponsive? | 16:55 |
TheJulia | or, is it transient and unrelated? | 16:55 |
TheJulia | there are a few different variables, you almost need a timeline of interaction drawn from both sides | 16:55 |
JayF | this is the same issue that started with multipath, yeah? | 16:56 |
dtantsur | yep | 16:56 |
JayF | I'll note that IO is one of the places that can lock up in a way that python can do nothing whatsoever about it | 16:56 |
TheJulia | at 12:01:31.0311 the agent started waving a white flag saying "please, cache the data" | 16:56 |
TheJulia | The conductor continued to think it was dancing until 12:02:31 | 16:56 |
*** dking is now known as Guest3741 | 17:20 | |
*** Guest3741 is now known as dking | 17:20 | |
dking | dtantsur: Are you still around? | 17:40 |
dtantsur | dking: I'm not far from the computer | 17:40 |
dking | I asked a Metal3 question on slack, but it seems slow there. I'm ended up having to restart my pod, and the Ironic DB got cleared. I believe that BMO populates Ironic, and it did put my inspecting node back. In the past, I thought that it put back in the other nodes, but it hasn't done so yet. Is there a way to force it to populate Ironic with the nodes it has provisioned? | 17:43 |
dtantsur | dking: it should eventually, but for nodes that are in a stable state there will some delay until the reconciler gets to them | 17:43 |
dtantsur | (Ericsson folks are also in Europe, so you'll need to get back in the morning) | 17:43 |
dking | dtantsur: That's good to know. It's been about 45 minutes, so I was starting to worry. Do you know roughly how long before starting to worry? | 17:45 |
dtantsur | hmm, 45 minutes IS quite long to my taste. Anything happening in the baremetal-operator logs? | 17:45 |
dking | dtantsur: Hmm. I see: {"level":"info","ts":"2023-10-17T17:46:35Z","logger":"controllers.HostFirmwareSettings","msg":"provisioner returns error","hostfirmwaresettings":{"name":"***","namespace":"machines"},"Error":"could not get node for BIOS settings: Host not registered","RequeueAfter":30} | 17:47 |
dtantsur | well, that's fair - the node does not exist | 17:47 |
masghar | What is the RIBCL business with HPE servers...and Redfish being its alternative in iLO - do we enable RIBCL in the server explicitly? | 17:48 |
dking | dtantsur: Not in Ironic. I was thinking that BMO would re-populate an empty Ironic. Would it only be able to update existing Ironic nodes? | 17:48 |
dtantsur | dking: no, it (should) always create nodes. I'm not yet sure why it does not - the logs may have a clue (but you may need to find it between other logging) | 17:49 |
JayF | masghar: I'm not sure; we have HPE downstream driver devs joining us for a PTG session to talk about how future of HP drivers in Ironic and how those kinda changes impact it | 17:49 |
dtantsur | masghar: RIBCL is their previous protocol. They've been moving to Redfish for a long time already (they were among the founders of Redfish) | 17:50 |
masghar | JayF: and dtantsur: I see. I'm trying to 'manage' an HPE server and its failing with RIBCL, and the Redfish is trying to use https but http is actually what should be used | 17:51 |
dtantsur | masghar: I can take a look tomorrow (nearly 8pm, c'mon! :) | 17:52 |
dtantsur | Make sure you're using the Redfish driver, I have no experience with the iLO one | 17:52 |
masghar | Ah yes, of course! Thank you | 17:52 |
dtantsur | (and check the firmware version against our docs - that's what we tested) | 17:52 |
masghar | I will switch to redfish and see, and check the firmware version too | 17:53 |
dtantsur | masghar: you should call it a day too, it's just as 8pm for you ;) | 17:53 |
dtantsur | (I'm actually watching youtube with IRC open on another screen) | 17:53 |
masghar | I started pretty late so its alright ^-^ | 17:53 |
JayF | dtantsur: that used to be more common for me | 17:53 |
JayF | dtantsur: note my lack of commentary in here late-PST since starfield released ;) | 17:53 |
dtantsur | :D | 17:54 |
dtantsur | masghar: for us in OCP, the interesting drivers are idrac+idrac-redfish* for Dell, just redfish for everything else (FJ folks are testing their stuff themselves) | 17:56 |
dtantsur | (we only have best effort support for the ilo driver and only for iLO 4) | 17:56 |
JayF | ilo driver does work pretty much outta the box for ilo4/5 stuff, but the things you need in driver_info are different | 17:56 |
JayF | for upstream ironic sake | 17:56 |
masghar | dtantsur: noted, thanks! One thing in the HPE docs section is a bit outdated, I did note | 17:57 |
dking | dtantsur: Nevermind. It must have just taken a bit. They seem to be back now. | 17:57 |
dtantsur | JayF: right, we've made a decision to focus much more on Redfish in the Metal3 world | 17:57 |
masghar | JayF: alright | 17:57 |
JayF | dking: I prescribe you a walk and a cup of coffee next time before troubleshooting ;) | 17:57 |
JayF | dtantsur: makes sense, and fwiw I think that's the correct decision too | 17:58 |
dtantsur | dking: eventual consistency, amiright? | 17:58 |
JayF | eventual consistency makes operators nervous (at least it made me nervous) | 17:58 |
JayF | because our pagers go off until "eventual" comes about :P | 17:58 |
dking | JayF: I just got worried because it was about an hour without nodes in Ironic and I wasn't completely sure that BMO was supposed to even poplate provisioning nodes into Ironic. | 17:58 |
JayF | honestly that's a metal3/bmo question more than an Ironic one | 17:59 |
dtantsur | an hour is wild, but I'm not sure how often the reconciler runs | 17:59 |
dtantsur | yeah | 17:59 |
JayF | and probably is more related to how slimmed down things are in those containers | 17:59 |
dking | I'm just not used to "eventually" being so long. :) But it looks like it was fine in the end. | 17:59 |
JayF | it's impressive how much Ironic can scale up and down these days | 17:59 |
dtantsur | dking: it's something we've discussed internally already: when BMO does not know that Ironic was gone, it cannot run an out-of-order reconcile | 17:59 |
dtantsur | still, an hour.. wow | 17:59 |
dking | Yeah, and only 40 nodes. I've run instances with over 200, so that seems a bit long for such a low number. | 18:00 |
dtantsur | Could be something to experiment with whenever you have some time. To see if it's a reproducible behavior. | 18:01 |
dtantsur | (Our folks scale-test with ~ 3500 nodes, but I don't think they try to kill Ironic in the process) | 18:03 |
TheJulia | 3500 sounds... sadistic. | 19:12 |
JayF | when we had 300 node clusters at $oldJob, and 3 reschedules setup | 19:12 |
JayF | and the racey-as-hell clustered compute manager | 19:12 |
JayF | pquerna used to load test us without warning by issuing 100 build requests | 19:12 |
JayF | we usually got about 85% eventual success rate, which was pretty great (most of the 15% failures were losing the CCM race 3x times in a row) | 19:13 |
JayF | that was spicy for the age of ironic at the time and the immaturity of the ironic/nova driver and all that | 19:13 |
TheJulia | if memory served, it really expected the ccm to do that and one to always sort of fail | 19:14 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: WIP: Add test for dhcp-less vmedia based deployment https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/898006 | 20:45 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: WIP: Add test for dhcp-less vmedia based deployment https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/898006 | 23:00 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!