Tuesday, 2023-10-17

rpittau	good morning ironic! o/	07:39
dtantsur	TheJulia, JayF, in-band clean steps are orthogonal to drivers, so I'm a bit puzzled by the discussion last night. If there is a step in IPA, you can use it today.	09:20
dtantsur	Or did you mean out-of-band really?	09:20
dtantsur	JayF: re automated clean template, we just need to expose https://docs.openstack.org/ironic/latest/configuration/config.html#conductor.clean_step_priority_override as a Node field	09:22
TheJulia	Jay was thinking of taking driver vendor specific feature steps and making them able to be used across the board by a generic driver, so like use generic redfish but be able to always invoke special ilo driver management step.	09:24
* TheJulia tries to go back to sleep		09:24
TheJulia	I think the hope was to decouple perception of need to maintain a whole driver to get useful individual features	09:25
dtantsur	I'm not sure what is bad about it (we literally coined the notion of "hardware types"). Maybe we should just make the driver composition more easily composable.	09:49
dtantsur	(so, it is indeed about out-of-band steps, not in-band, right?)	09:50
TheJulia	Nothing bad, but practical challenges	09:50
dtantsur	Crossing the hardware type boundary will also cause practical challenges (like, the iLO one will expect ilo_address, which is not even the same format as redfish_address)	09:51
TheJulia	Yup	09:51
opendevreview	Mahnoor Asghar proposed openstack/ironic master: Add inspection hooks https://review.opendev.org/c/openstack/ironic/+/892661	11:36
iurygregory	good morning Ironic	11:41
masghar	Morning!	11:59
rpittau	masghar: hi! is that the last of the "hooks" patches? or we're missing something else? I lost count :D	12:27
masghar	rpittau: Its the last one :D	12:37
rpittau	\o/	12:37
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] Generic API for attaching/detaching virtual media https://review.opendev.org/c/openstack/ironic/+/894918	12:53
drannou	dtantsur: crossing hardware type could be a good idea, for us for example we would like to be "platform" agnostic.	13:08
drannou	for the moment we are using HPe hosts, but I don't want to be vendor lock	13:09
TheJulia	drannou: but are you using specific driver features which would only ever work/exist on hpe hardware?	13:16
dtantsur	drannou: I don't think what we discuss with TheJulia is related to the vendor lock problem	13:23
dtantsur	it's really a question of which driver you set to the node	13:23
dtantsur	If we want something in Redfish, we can ask DMTF for it. Jacob and I are going through such a proposal process right now for a different thing.	13:23
TheJulia	I think there are some small steps we can take to simplify some matters, which could also be go to the standards bodies if needed, but vendors also traditionally try and drive differentiator features in the form of "value-add" for their customers... and to sell more hardware/license fees by locking some of those features behind upgraded use licenses	13:25
dtantsur	Well, licenses are something quite orthogonal here. They can easily hide parts of standard Redfish behind a paywall.	13:26
TheJulia	And there is already a way for us to do cross cutting features, it is just not the fastest/easiest path if a vendor wants to drive it	13:26
TheJulia	oh yes, absolutely	13:26
TheJulia	example, Supermicro	13:26
TheJulia	and VMedia	13:26
drannou	TheJulia: in fact, nothing for the moment, but we will need to have a license management, and firmware upgrade.	13:30
drannou	The problem is that I don't want to switch compeltely to ILO for just that	13:31
dtantsur	Redfish firmware upgrade has been implemented recently	13:31
dtantsur	License management is something we haven't looked at yet. It's probably highly vendor-specific.	13:31
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] Generic API for attaching/detaching virtual media https://review.opendev.org/c/openstack/ironic/+/894918	13:36
TheJulia	so, there is a v1.1.0 LicenseService definition off the root in the st andard	13:51
TheJulia	standard	13:51
dtantsur	Nice!	13:53
TheJulia	Looks disjointed from the system entirely, as a payload to upload, but also embraces the Oem field as well	13:53
dtantsur	At some point, we may make inventory of Redfish features we eventually want to support	13:53
dtantsur	Custom TLS certificates are probably at the top of this list, together with UEFI HTTP boot	13:54
TheJulia	There is also a license list resource	13:55
TheJulia	Looks like this stuff got added back in 2021.1	13:56
TheJulia	but is not in the mockups	13:56
TheJulia	at least, the couple I clicked into	13:57
TheJulia	Speaking of UEFI HTTP Boot, I need to get back on that	13:59
TheJulia	looks like I broke something on the emulator schema :\	13:59
* TheJulia tries to grok a thread and wonders if they are advocating nuking PLDM from the face of device interactions and repalcing everything with redfish		14:17
JayF	dtantsur: the idea would be to think about a world where there would be no ilo hardware type, but instead you'd have something like a redfish hardware type and a bunch of plugins which just call custom redfish hardware APIs. but the world isn't lollipops and rainbows and all of the bonus stuff isn't in the standards-compliant places so this is a solution for a world that	14:17
JayF	doesn't exist	14:17
dtantsur	heh, I see	14:18
dtantsur	My less ambitious vision is for both ilo and idrac drivers to be built on top of the redfish one	14:18
TheJulia	I'm wary of unicorns, but Legends of Tomorrow Season 4... Episode 4 streamed last night in the living room.	14:18
JayF	dtantsur: that kinda sucks operationally, to be honest, because then we still have more room for divergence over time	14:19
dtantsur	Only if the reality diverges. Which is something we cannot prevent.	14:19
JayF	dtantsur: my hope would be for an operator to know+understand one driver, hopefully, and not have to figure out the incantation for each different vendor's hardware they buy	14:19
JayF	:(	14:19
TheJulia	Divergence is always going to occur on some level, the value add driver.	14:19
dtantsur	I mean... knowing that "ilo" is for HPE hardware is the least complex fact about Ironic you can learn	14:19
JayF	we have a PTG session about ^	14:20
JayF	it's not simple	14:20
TheJulia	... not all HPE hardware though	14:20
dtantsur	nor is our redfish driver for all redfish hardware	14:20
TheJulia	yup	14:20
dtantsur	#sadbuttrue	14:20
JayF	that is the piece that hurts me in my hurtin' place LOL	14:20
dtantsur	you're not alone with your hurtin' place's hurt :D	14:21
JayF	the 'redfish driver is not for all redfish hardware' ... is that a case we can detect?	14:22
JayF	like are most failure modes for that "oh, it's missing support for X, Y, Z" or is it "yeah support for this is advertised but broken" (or "yes")	14:23
dtantsur	We sometimes even try (see idrac-redfish-virtual-media in the recent past)	14:23
dtantsur	The huawei driver is a counter-example, I think	14:23
TheJulia	both very good examples	14:23
dtantsur	And now we're going to have a Fujitsu's Redfish variant	14:23
JayF	huawei driver is not called that upstream is it (?)	14:23
TheJulia	one generally works, but we've found cases where firmware updates break it, the later has four different field names to possibly account for power state.	14:23
dtantsur	JayF: ibmc?	14:23
JayF	ack	14:24
JayF	I never knew what h/w that was for lol	14:24
dtantsur	:D	14:24
JayF	never worked a place that would've gotten their gear	14:24
TheJulia	... on a plus side, the ibmc contributor indicated they were aware of the noncompliance and intended to fix it	14:24
dtantsur	You would need to move to Russia for that, which is something I'd not recommend	14:24
dtantsur	(or China, obviously, which I cannot recommend either)	14:24
TheJulia	China != Russia, but they both get cast in the same light often times	14:25
TheJulia	Different cultures and all	14:25
* JayF puts on his best william wallace face and yells something about freedom /s		14:25
dtantsur	Well, if you don't want to get into details, and is lucky enough to not have to care about them, you can throw them in the same bucket	14:25
JayF	^ is pretty much where I am	14:26
TheJulia	dtantsur: this is true	14:26
* dtantsur has started the very slow and complicated process of gaining the German citizenship, hold your fingers crossed		14:26
TheJulia	I wouldn't mind at some point visiting china on a tourist visa, but the odds of that happening are slim	14:26
TheJulia	dtantsur: \o/	14:26
masghar	I heard from an ex-CEO-in-China recently that the people on ground are super nice, but there are difficulties in running a business etc	14:51
TheJulia	Also the movement of funds in and out	14:55
TheJulia	As a business traveler, China is a difficult country to visit.	14:55
TheJulia	oh wow, reading about PLDM and the internal modeling explains a LOT of how BMCs have typically viewed and provided networking details	14:58
dtantsur	PLDM?	15:08
TheJulia	oh, wow, and the interaction explains why it is often slow	15:08
TheJulia	Platform Level Data Model	15:08
TheJulia	how devices on the motherboard are supposed to talk to the BMC	15:08
dtantsur	AKA dark magic reference :D	15:10
iurygregory	hey TheJulia I was talking with dtantsur about a bug downstream involving multipath, scenario: a hardwar have 84 devices with 4 paths each, Dmitry had an interesting idea, that maybe we could cache the output of https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L233 and avoid doing two more calls on our code (if I recall correctly they are reaching timeouts when trying to provision the	15:10
iurygregory	node), do you think this would be safe to cache?	15:10
TheJulia	In the MCU, it might be titled "The Darkhold"	15:10
iurygregory	dark magic reference lol :D	15:11
TheJulia	Cache makes sense until we encounter an IO error and feel the need to reset said cache, fwiw	15:11
TheJulia	The odds of an IO error due to changes are more from like a cable getting unplugged	15:12
JayF	if we're dealing with multipath, remote IO	15:12
JayF	and we lose a device/topology changes	15:12
TheJulia	yeah	15:12
JayF	shouldn't that be a provisioning-aborting event?	15:12
JayF	or does that sorta intentionally go against the HA nature of the multipathing?	15:13
dtantsur	If that's not a root device, we can survive it	15:13
TheJulia	sort of yes, but we also via the cache know the multipath device name via the kernel so we can continue to operate/resolve	15:13
JayF	I am unsure if in this case we should be using the HA (?)	15:13
TheJulia	oh, the kernel will make sure you keep working if your interacting with the mpath device :)	15:13
JayF	oh so the idea is, I got an I/O error using this path, I'll try this one	15:13
* TheJulia has pulled many a fiber cable		15:13
JayF	it's like you have ... multi[ple] path[s]!	15:13
TheJulia	JayF: the kernel does it for us :)	15:13
JayF	does this mean I'm a storage engineer now? Where can I pick up my LUNs? /s	15:14
TheJulia	You can pickup your luns once you rack/stack/cable 500TB-1PB storage system	15:14
TheJulia	oh, and configure the SAN controllers	15:14
JayF	I don't like SAN, it's rough and sticky and gets everywhere	15:15
* dtantsur has realized he can no longer list all possible hardware interfaces by heart and is a bit upset		15:16
iurygregory	on my mind the easy path was use skip_block_devices XD "hey ironic don't look at this devices please" lol	15:16
TheJulia	Now remember https://www.etsy.com/listing/1361107953/torch-and-fire-extinguisher-holder makes things much better with SANs	15:16
dtantsur	iurygregory: that's not a terrible idea, but I'm afraid we'll anyway try to list them first	15:16
dtantsur	and then you need to add Metal3 API to pass this skip_block_devices...	15:16
iurygregory	yep	15:16
TheJulia	enumeration is dynamic guys	15:16
JayF	dtantsur: you can never list the ones that I wrote that are locked up in some private repo in rackerlabs/ github repo somewhere ;)	15:17
TheJulia	depending on SAN state and FC login query	15:17
TheJulia	and SAN response speed	15:17
JayF	dtantsur: oh, you mean interface, not manager, just kidding	15:17
TheJulia	so don't expect stability at all unless your matching WWNs	15:17
dtantsur	JayF: you got me scared for a minute: new downstream hardware interfaces, that would be.. something	15:17
JayF	you're talking about agent	15:17
JayF	and rescue	15:17
* dtantsur suspects someone may have a graphical console interface somewhere in production		15:17
JayF	you know that, right? LOL	15:17
dtantsur	wellllllll :D	15:18
iurygregory	i don't even think it's SAN to me is a JBOD they are using hehe	15:18
dtantsur	TheJulia: I keep telling people that, but people INSIST on using device names... anyway	15:18
TheJulia	iurygregory: is it showing more than one path per device?	15:18
dtantsur	TheJulia: 4 paths for each of the 84 devices	15:19
TheJulia	iurygregory: whiskey	15:19
TheJulia	?$?	15:19
dtantsur	hundreds of block devices per machine	15:19
iurygregory	yeah	15:19
TheJulia	wtaf	15:19
dtantsur	fun right?	15:19
TheJulia	I've seen 2 paths to a JBOD array, never 4 before	15:19
dtantsur	It takes IPA a few minutes to loop through them.. and then it starts again.. and again... and again......	15:19
TheJulia	... are we sure it is not a JBOD array via FC?	15:19
dtantsur	I'm personally not sure about anything in that setup	15:19
* TheJulia makes nervous laughter as she login into the case management system to see what the customer uploaded		15:20
dtantsur	Let's hope it's cat photos	15:20
iurygregory	yes please	15:20
iurygregory	and corgi photos also!	15:20
*** dking is now known as Guest3726		15:21
TheJulia	anyway caching the multipath command output is stable enough for IPA's exeuction	15:21
iurygregory	ok, I will try to think on how to do this after grabbing lunch	15:21
iurygregory	and coffee	15:21
iurygregory	brb	15:21
dtantsur	Does anyone know a usable way to alternate between two shared screens?	15:23
JayF	on linux?	15:23
JayF	and do you need audio?	15:23
dtantsur	Like, slides and a console. But without turning the sharing on/off or sharing the whole screen.	15:23
dtantsur	JayF: on linux, audio not required	15:24
JayF	I have a great answer for you if both of those are yes	15:24
JayF	perfect	15:24
TheJulia	yeah, the list forces a scan, so it is a steady enough state	15:24
JayF	dtantsur: obs has a plugin which will output to a v4l loopback device	15:24
JayF	dtantsur: instead of using share screen function; use webcam function to share the v4l loopback device	15:24
JayF	dtantsur: now you have OBS studio fully functional to swap out what you're sharing on your screen between slides/console	15:24
TheJulia	Awww, customer redacted the pathinging details enough that I can't tell if there is more than one storage controller port in the mix	15:25
dtantsur	JayF: nice! I need to rehearse that, but sounds like it could work.	15:26
dtantsur	I suspect I can even display my small face in the corner	15:26
TheJulia	iurygregory: so, I'm inclined to think not a jbod, but a proper san because it looks like 3 different devices are in the mix locally based upon bus id markers, but they are also not sequential so ... dunno	15:27
TheJulia	dtantsur: I did that once with an f-key to toggle the source, it worked really well	15:27
TheJulia	plan 1-2 hours for learning curve and all	15:28
JayF	dtantsur: exactly, it's a pretty good model, and even if you can't make it work via a webcam, you can open something like gvc or cheese and share that as your screen	15:28
opendevreview	Takashi Kajinami proposed openstack/ironic-inspector master: Fix python shebang https://review.opendev.org/c/openstack/ironic-inspector/+/898587	15:28
dtantsur	JayF++ TheJulia++	15:28
JayF	oh and also	15:28
JayF	test the hell outta it	15:28
TheJulia	+++++++++	15:28
JayF	OBS on linux is flakey as hell	15:29
JayF	in my world, it was always audio so no audio should help a lot	15:29
dtantsur	I have some experience with OBS already, but only with regular recording.	15:29
TheJulia	Also, if your doing anything like green screening with your background	15:29
JayF	I tried to do streaming to twitch for a while from Linux; I had something like a 10% incidence rate of the sound just ... dying mid-stream	15:29
JayF	no errors from OBS, no indicators in the logs, it just ... stopped	15:29
dtantsur	If I share that in a call, the sound will go through the regular means	15:30
dtantsur	(which also has a small chance of just not working)	15:30
JayF	yeah so I think you'll be fine, I'm just making the point that it's not the most rock solid platform on linux at hti spoint	15:30
dtantsur	Since when is Linux rock solid? That's one of the things we love about it :D	15:30
JayF	also note that I'm doing all this on gentoo so there's a nonzero chance it was just broken because that point release of ffmpeg, plus that point release of obs, plus a full moon and my cflags or some craziness lol	15:30
dtantsur	LOL	15:31
JayF	dtantsur: dead serious; I run linux these days on things other than my gaming machine because I got so dang tired of feeling like my own devices were trying to sell me crap	15:31
TheJulia	Remember, do not taunt the darkhold of professional video tools :)	15:31
dtantsur	hehe	15:31
dtantsur	I tried making a video from fragments.. I've seen some horrors...	15:31
*** Guest3726 is now known as dking		15:32
TheJulia	I think I understand PLDM enough to hold down a conversation	15:32
TheJulia	I may have had to push some python bits out of my brain for this	15:32
dtantsur	JayF: ikr.. I'm afraid our ancient phones will need replacements soon, so I'm going to choose between the android crap and the apple crap...	15:32
clarkb	I would probably do that using an xmonad tiling mode that emulates workspaces on a single workspace	15:33
dtantsur	TheJulia: is there a short(ish) overview to read?	15:33
dtantsur	clarkb: but will your browser be able to share only one such workspace?	15:34
dking	Is anybody here familiar with troubleshooting SQLAlchemy errors "QueuePool limit of size X overflow Y reached, connection timed out..."? I'm seeing those in my Ironic logs, and I can connect directly to MariaDB, and I don't know how to troubleshoot SQLAlchemy outside of adjusting the code.	15:34
TheJulia	dtantsur: https://www.dmtf.org/sites/default/files/standards/documents/DSP0240_1.0.0.pdf	15:34
dtantsur	thx	15:34
TheJulia	the rest of the related docs get into the nitty gritty of PLDM, but the base model appears to be rather flexible but command/response driven	15:34
dtantsur	dking: the first time I hear about it. is it linked with some sort of activity?	15:34
TheJulia	so "give me your state" -> "Here is my state" after you ask "what do you support" "here is what I support"	15:35
clarkb	dtantsur: ya I would share the single workspace then use window management to swap windows aroudn appropriate on the workspace	15:35
dking	dtantsur: I'm not certain. I don't know of any special activity going on. We're not running that many nodes. It's showing up with: ERROR futurist.periodics [-] Failed to call periodic 'ironic.conductor.manager.ConductorManager._sync_power_states'	15:35
dtantsur	clarkb: Right, makes sense. I'm on MATE nowadays though	15:36
TheJulia	... So if memory serves the connection pool is like 8 connections to start	15:36
TheJulia	dking: how much concurrency are you running for power state sync?	15:37
dking	Here, I'm seeing "limit of size 5 overflow 50 reached". I'll have to count, but I didn't thing we had 50 nodes in.	15:37
TheJulia	dking: is the version of mariadb on sqlalchemy's list of supported versions?	15:38
dtantsur	TheJulia: ouch, even the glossary is quite a text to read	15:38
TheJulia	dking: that sort of sounds like mariadb is locking or has a load balancer dropping the connection	15:38
dking	TheJulia: I can check. We're using Metal3, so it's likely whatever comes stock with that.	15:39
TheJulia	dking: as a starting point in db troubleshooting, try making a backup on the server side	15:39
TheJulia	oh	15:39
TheJulia	hmm	15:39
JayF	If you're using metal3, there is also potentially value in asking in their slack if we're not able to figure it	15:40
TheJulia	so no load balancer	15:40
dking	TheJulia: I thought about that, but it's not really using a load balancer. I think it might just be 1 replica, and I can connect to the DB just fine from within the mariadb container. But those are good things to check.	15:40
TheJulia	is the db in a separate pod?	15:40
JayF	of course half the people are in both channels so....	15:40
dking	JayF: Yeah, the best people for Metal3 are in here, too. :)	15:40
dtantsur	dking, TheJulia, a useful thing to know about Metal3 is that its MariaDB image is minimized on purpose	15:40
dking	TheJulia: No, it should be the same pod.	15:40
TheJulia	dking: there could be a transaction deadlock... but that should get detected... there is still a chance a held lock is not detected	15:41
dtantsur	e.g. https://github.com/metal3-io/mariadb-image/blob/main/runmariadb#L22 used to be even lower than this	15:41
dtantsur	https://github.com/metal3-io/mariadb-image/blob/main/runmariadb#L46-L48 is a suspect too	15:41
TheJulia	which is stupidly weird	15:41
dtantsur	dking: I'd start with playing with these values ^^^	15:41
TheJulia	as are max connections, since that can scale based upon what is going on	15:42
TheJulia	if you can't dump the db or it hangs, then lock is likely it	15:42
TheJulia	but dump through the server, not from the files (i think that got removed in mariadb...)	15:43
* TheJulia has a rather dusty DBA hat someplace		15:45
* TheJulia also had the arcane book of databases someplace....		15:45
dtantsur	(side note: "the same pod" thingy is something I'd like to fix eventually)	15:45
dking	Well, I'm currently wondering if I can troubleshoot the issue live, as I'm suspecting that it won't replicate (or at least not quickly) if Ironic gets restarted.	15:48
TheJulia	you can try and get a feeling from the DB, but yeah, any restart is going to change things	15:49
dking	And it didn't happen right away, so it may take a while to show up again, if ever. I'm going to check the logs to see how long it's been going on. At least I have some configs that I can play with.	15:51
dtantsur	dking: the recommendation about slack is not too bad, because the Ericsson folks do use MariaDB (unlike us in OpenShift)	15:52
dking	So, for now, the consensus is that something hung, we don't really know where, but restart, and if it comes back, attempt to increase the limits mentioned above?	15:52
dtantsur	that's the only thing that comes to my mind	15:53
TheJulia	likewise	15:53
dking	Okay, thank you very much!	15:53
JayF	Is metal3+mariadb a common deployment (with the built-in, slimmed mariadb container?)	15:53
dtantsur	JayF: yes. I don't know the proportion, but both sqlite and mariadb are common (and mariadb was there first)	15:54
JayF	ack	15:55
opendevreview	Takashi Kajinami proposed openstack/ironic-inspector master: Fix python shebang https://review.opendev.org/c/openstack/ironic-inspector/+/898587	16:02
rpittau	good night! o/	16:16
TheJulia	o/	16:19
* iurygregory is back		16:26
iurygregory	TheJulia, gotcha, I will look at caching now to try to improve this (fingers crossed)	16:28
TheJulia	okay, cool!	16:28
iurygregory	TheJulia, newbie question, but do you think changing ironic.conf [agent]command_timeout ( https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/agent_client.py#L206 ) could help with their issue? since ironic logs show a lot of https://paste.opendev.org/show/bXAGwIusnE3JdjWTWE43/	16:50
TheJulia	I would think unlikely	16:51
TheJulia	and that feels like a workaround for whatever state the agent is in when the call hits	16:51
iurygregory	yeah, its more a workaround just to see if helps them	16:52
iurygregory	till we figure out the cache and hope this would help them	16:52
iurygregory	the first problem they had was during inspection, so increasing the timeout helped them	16:53
dtantsur	well, "helped"	16:53
dtantsur	after you walk them through raising all possible timeouts everywhere (and you'll need to), the final question will be "please make it so that the process does not take 6 hours" :D	16:54
TheJulia	I guess the only want ot know for sure is to corrilate agent logs	16:54
TheJulia	dtantsur: ++	16:54
iurygregory	dtantsur, yeah	16:54
dtantsur	Agent timeouts make me think about insufficient eventlet monkey patching	16:55
dtantsur	Otherwise why is IPA not reponsive?	16:55
TheJulia	or, is it transient and unrelated?	16:55
TheJulia	there are a few different variables, you almost need a timeline of interaction drawn from both sides	16:55
JayF	this is the same issue that started with multipath, yeah?	16:56
dtantsur	yep	16:56
JayF	I'll note that IO is one of the places that can lock up in a way that python can do nothing whatsoever about it	16:56
TheJulia	at 12:01:31.0311 the agent started waving a white flag saying "please, cache the data"	16:56
TheJulia	The conductor continued to think it was dancing until 12:02:31	16:56
*** dking is now known as Guest3741		17:20
*** Guest3741 is now known as dking		17:20
dking	dtantsur: Are you still around?	17:40
dtantsur	dking: I'm not far from the computer	17:40
dking	I asked a Metal3 question on slack, but it seems slow there. I'm ended up having to restart my pod, and the Ironic DB got cleared. I believe that BMO populates Ironic, and it did put my inspecting node back. In the past, I thought that it put back in the other nodes, but it hasn't done so yet. Is there a way to force it to populate Ironic with the nodes it has provisioned?	17:43
dtantsur	dking: it should eventually, but for nodes that are in a stable state there will some delay until the reconciler gets to them	17:43
dtantsur	(Ericsson folks are also in Europe, so you'll need to get back in the morning)	17:43
dking	dtantsur: That's good to know. It's been about 45 minutes, so I was starting to worry. Do you know roughly how long before starting to worry?	17:45
dtantsur	hmm, 45 minutes IS quite long to my taste. Anything happening in the baremetal-operator logs?	17:45
dking	dtantsur: Hmm. I see: {"level":"info","ts":"2023-10-17T17:46:35Z","logger":"controllers.HostFirmwareSettings","msg":"provisioner returns error","hostfirmwaresettings":{"name":"***","namespace":"machines"},"Error":"could not get node for BIOS settings: Host not registered","RequeueAfter":30}	17:47
dtantsur	well, that's fair - the node does not exist	17:47
masghar	What is the RIBCL business with HPE servers...and Redfish being its alternative in iLO - do we enable RIBCL in the server explicitly?	17:48
dking	dtantsur: Not in Ironic. I was thinking that BMO would re-populate an empty Ironic. Would it only be able to update existing Ironic nodes?	17:48
dtantsur	dking: no, it (should) always create nodes. I'm not yet sure why it does not - the logs may have a clue (but you may need to find it between other logging)	17:49
JayF	masghar: I'm not sure; we have HPE downstream driver devs joining us for a PTG session to talk about how future of HP drivers in Ironic and how those kinda changes impact it	17:49
dtantsur	masghar: RIBCL is their previous protocol. They've been moving to Redfish for a long time already (they were among the founders of Redfish)	17:50
masghar	JayF: and dtantsur: I see. I'm trying to 'manage' an HPE server and its failing with RIBCL, and the Redfish is trying to use https but http is actually what should be used	17:51
dtantsur	masghar: I can take a look tomorrow (nearly 8pm, c'mon! :)	17:52
dtantsur	Make sure you're using the Redfish driver, I have no experience with the iLO one	17:52
masghar	Ah yes, of course! Thank you	17:52
dtantsur	(and check the firmware version against our docs - that's what we tested)	17:52
masghar	I will switch to redfish and see, and check the firmware version too	17:53
dtantsur	masghar: you should call it a day too, it's just as 8pm for you ;)	17:53
dtantsur	(I'm actually watching youtube with IRC open on another screen)	17:53
masghar	I started pretty late so its alright ^-^	17:53
JayF	dtantsur: that used to be more common for me	17:53
JayF	dtantsur: note my lack of commentary in here late-PST since starfield released ;)	17:53
dtantsur	:D	17:54
dtantsur	masghar: for us in OCP, the interesting drivers are idrac+idrac-redfish* for Dell, just redfish for everything else (FJ folks are testing their stuff themselves)	17:56
dtantsur	(we only have best effort support for the ilo driver and only for iLO 4)	17:56
JayF	ilo driver does work pretty much outta the box for ilo4/5 stuff, but the things you need in driver_info are different	17:56
JayF	for upstream ironic sake	17:56
masghar	dtantsur: noted, thanks! One thing in the HPE docs section is a bit outdated, I did note	17:57
dking	dtantsur: Nevermind. It must have just taken a bit. They seem to be back now.	17:57
dtantsur	JayF: right, we've made a decision to focus much more on Redfish in the Metal3 world	17:57
masghar	JayF: alright	17:57
JayF	dking: I prescribe you a walk and a cup of coffee next time before troubleshooting ;)	17:57
JayF	dtantsur: makes sense, and fwiw I think that's the correct decision too	17:58
dtantsur	dking: eventual consistency, amiright?	17:58
JayF	eventual consistency makes operators nervous (at least it made me nervous)	17:58
JayF	because our pagers go off until "eventual" comes about :P	17:58
dking	JayF: I just got worried because it was about an hour without nodes in Ironic and I wasn't completely sure that BMO was supposed to even poplate provisioning nodes into Ironic.	17:58
JayF	honestly that's a metal3/bmo question more than an Ironic one	17:59
dtantsur	an hour is wild, but I'm not sure how often the reconciler runs	17:59
dtantsur	yeah	17:59
JayF	and probably is more related to how slimmed down things are in those containers	17:59
dking	I'm just not used to "eventually" being so long. :) But it looks like it was fine in the end.	17:59
JayF	it's impressive how much Ironic can scale up and down these days	17:59
dtantsur	dking: it's something we've discussed internally already: when BMO does not know that Ironic was gone, it cannot run an out-of-order reconcile	17:59
dtantsur	still, an hour.. wow	17:59
dking	Yeah, and only 40 nodes. I've run instances with over 200, so that seems a bit long for such a low number.	18:00
dtantsur	Could be something to experiment with whenever you have some time. To see if it's a reproducible behavior.	18:01
dtantsur	(Our folks scale-test with ~ 3500 nodes, but I don't think they try to kill Ironic in the process)	18:03
TheJulia	3500 sounds... sadistic.	19:12
JayF	when we had 300 node clusters at $oldJob, and 3 reschedules setup	19:12
JayF	and the racey-as-hell clustered compute manager	19:12
JayF	pquerna used to load test us without warning by issuing 100 build requests	19:12
JayF	we usually got about 85% eventual success rate, which was pretty great (most of the 15% failures were losing the CCM race 3x times in a row)	19:13
JayF	that was spicy for the age of ironic at the time and the immaturity of the ironic/nova driver and all that	19:13
TheJulia	if memory served, it really expected the ccm to do that and one to always sort of fail	19:14
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: WIP: Add test for dhcp-less vmedia based deployment https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/898006	20:45
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: WIP: Add test for dhcp-less vmedia based deployment https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/898006	23:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!