Wednesday, 2024-03-06

opendevreview	Merged openstack/ironic-tempest-plugin master: Test multiple boot interfaces as part of one CI job https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/902171	03:27
rpittau	good morning ironic! o/	07:55
masghar	Good morning!	09:14
opendevreview	Merged openstack/ironic stable/wallaby: stable only/ci: pin CI to dnsmasq 2.85/pin proliantutils https://review.opendev.org/c/openstack/ironic/+/911160	09:25
iurygregory	good morning Ironic	11:07
Sandzwerg[m]	moin ironic	12:10
Sandzwerg[m]	Has anyone ever heard of dell nodes getting frequently powered off shortly after they got powered on. It does not happen all the time and we're unsure if ironic/metal³ or something else is involved	12:15
dtantsur	Sandzwerg[m]: what do you use to power them on? Ironic tends to force whatever it thinks is the right state.	12:20
Sandzwerg[m]	metal³ ironic	12:26
Sandzwerg[m]	and yeah it should force them to come on, but something forces them off when ironic forces them to PXE/on just before.	12:29
Sandzwerg[m]	Feels a bit like ironic is confused but not entirely sure. Haven't found which user triggers the shutdown in the remoteboard log	12:30
dtantsur	If Ironic does it, it has to long something about it	13:31
drannou	Hello	13:38
drannou	If you have a moment to review https://review.opendev.org/c/openstack/ironic-python-agent/+/902769	13:39
dking	Does anybody know of the top of their head which part of the cleaning process, perhaps in hardware.py, LVM VG groups are cleared?	13:42
dtantsur	dking: erase_devices/erase_device_metadata clean steps. For the latter, it propagates all the way into https://opendev.org/openstack/ironic-lib/src/branch/master/ironic_lib/disk_utils.py#L515	13:43
dking	dtantsur: Thank you very much!	13:45
*** tosky_ is now known as tosky		13:57
TheJulia	Sandzwerg[m]: I'd check the console to see if reconfiguration jobs are running, that will cause a reboot and a dell node to appear on, go off, and then come back on.	14:54
TheJulia	JayF: https://review.opendev.org/c/openstack/ironic/+/911158 won't pass until tempest tests are fixed	14:57
JayF	I'm assuming since this pointed at me that must be a failure caused by the sharding test or something?	14:58
JayF	I can look at it, I'm finally called up today	14:58
JayF	**caught up	14:58
TheJulia	yes :)	14:58
TheJulia	That would be awesome if you could, I'm burried under piles of things right now and I need to work on a side deck	14:59
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: Invoke tests with fake interfaces https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/909939	14:59
TheJulia	ummm hmmm	15:00
*** dansmith_ is now known as dansmith		15:02
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: Invoke tests with fake interfaces https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/909939	15:08
dtantsur	TheJulia, hjensas, hi folks, would be great if you could put https://review.opendev.org/c/openstack/ironic/+/907991 (and thus https://review.opendev.org/c/openstack/ironic/+/910251) in your review queue.	15:36
dtantsur	I'm actually the last person who needs that working :)	15:36
JayF	TheJulia: I'm confused, I thought https://opendev.org/openstack/ironic-tempest-plugin/src/branch/master/ironic_tempest_plugin/tests/api/admin/test_shards.py#L24 was all that was needed to keep those tests from running	15:38
JayF	https://opendev.org/openstack/ironic-tempest-plugin/src/branch/master/ironic_tempest_plugin/tests/api/base.py#L77	15:39
JayF	I'm assuming those branches are not configured properly with their max microversion?	15:40
TheJulia	there are no branches for tempest	15:41
TheJulia	I honestly don't know why they are trying to run	15:41
TheJulia	it is clear it knows the diffrence from the errors I linked out	15:41
JayF	3142: iniset $TEMPEST_CONFIG baremetal max_microversion $TEMPEST_BAREMETAL_MAX_MICROVERSION	15:41
JayF	we have to set that	15:42
JayF	on older branches	15:42
TheJulia	ahh!	15:42
JayF	I'll add it to the patch in question	15:42
TheJulia	so config on stable/zed then	15:42
JayF	aye	15:42
TheJulia	I guess that code path doesn't have real detetion based skipping unlike scenario jobs	15:42
JayF	does that imply I only need to configure it on functional test jobs?	15:43
JayF	I suspect it's also possible for scenario jobs ... we just haven't added any that use new APIs	15:43
JayF	so it's a moot point and not breaky	15:43
TheJulia	likely just the default in devstack/lib/ironic	15:43
opendevreview	Jay Faulkner proposed openstack/ironic stable/zed: stable only/ci: pin CI to dnsmasq 2.85/pin proliantutils/scciclient https://review.opendev.org/c/openstack/ironic/+/911158	15:48
*** dking is now known as Guest2031		15:55
*** Guest2031 is now known as dking		16:07
dking	In the Ironic-Python-Agent, was ProtectedDeviceFound replaced by errors.ProtectedDeviceError? I see that ProtectedDeviceFound is only referenced in some docstrings, and I cannot find it implemented, but errors.ProtectedDeviceError exists and seems to be used similiarly.	16:20
JayF	you wanna link to those invalid docstrings, I'll clean em up	16:28
JayF	eh, ripgrep can do that part I guess	16:28
JayF	dking: yep, you are right	16:32
opendevreview	Jay Faulkner proposed openstack/ironic-python-agent master: Correct invalid docstrings; s/Found/Error/ https://review.opendev.org/c/openstack/ironic-python-agent/+/911598	16:33
JayF	dking: ^ fyi it'll be fixed when that lands	16:33
dking	JayF: Thanks! So, that can help clean up the code. So, I'm just wondering about it a little. The new error is thrown when looking at devices, but in those laces where it's mentioned in the docstrings is in methods handling the node itself. The error requires a device. Is that still the intention?	16:36
JayF	Lets back up a step	16:36
JayF	what's your overall thing you're trying to accomplish?	16:37
dking	For instance, if I wanted to override erase_devices() for an entire node for some reason, I would make my own erase_devices() method in my hardware manager and send an error up saying that this node is protected. Would ProtectedDeviceFound still be the best way to do that?	16:37
dking	In my specific use case, I can adjust the code to grab the particular device, so I'll probably do that anyway, but it just made me wonder if there should be something more generic.	16:39
dking	(In my use case, I'm looking for some specific volume group names which happen to belong to a ceph cluster and I'm planning to make those be removed only upon a manual clean step. So, I'm also happy to try something else if there's a better way.)	16:40
JayF	ProtectedDeviceError would be the way do that /on master branch/ which is all I've looked at	16:42
dking	JayF: Okay, that's good, then. So, if somebody for some reason did decided that they wanted their clean step to fail after checking the whole node, not a check for a specific device, they would throw that error and perhaps fill in device with just any text?	16:45
JayF	Do you care that it's a ProtectedDeviceError	16:45
JayF	or do you just want to stop cleaning?	16:46
dking	More the later. So, perhaps just a generic CleaningError?	16:46
JayF	https://opendev.org/openstack/ironic-python-agent/src/branch/master/examples/business-logic/example_business_logic.py#L94	16:47
JayF	yep	16:47
JayF	you're basically implementing the business logic pattern	16:47
JayF	https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/errors.py#L362 ProtectedDeviceError is a CleaningError, which a preset exception message :)	16:48
JayF	**with a preset exception message	16:48
JayF	So either one of those works, just a question of if you want the premade message or you wanna write one yourself	16:49
dking	Great. Then, that answers my question. I was wondering for a moment if the docstrings should be updated instead to the more generic CleaningError, but as those particular methods do specifically only raise ProtectedDeviceError, that seems appropriate.	16:49
dking	Yeah, I see that. I might still do use ProtectedDeviceError in my case also.	16:50
JayF	I suspect that error name was changed and just got missed being updated in the docstring	16:50
rpittau	good night! o/	16:50
JayF	o/	16:50
dking	BTW, I see that after about 4 years, you just updated the hardware manager examples to fix some spelling issues. I'm sure there's tons of cobwebs growing on them.	16:51
JayF	The interface itself hasn't changed in .... ever?	16:51
dking	rpittau: Good night	16:51
JayF	We did one api change on hardware managers, back in like, 2014 or 2015	16:51
JayF	and it's been a stable API since	16:51
dking	It probably might be helpful to add in a deploy step? I don't know if people are doing them much, though.	16:52
JayF	I'd +1 such a change, but it's not something that's urgently on my list	16:52
dking	I'm adding in my first locally because we're not ready to update BMO yet and so we're stuck with the old version that doesn't have an option for a "by_path" root device hint.	16:52
TheJulia	dnsmasq gives me a migraine	16:53
JayF	More breakage? Or you trying to fix the issue in C?	16:53
TheJulia	oh, I know of a way, it just masks the actual problem	16:54
*** clarkb1 is now known as clarkb		16:55
opendevreview	Verification of a change to openstack/ironic master failed: Split conductor-specific RPCService https://review.opendev.org/c/openstack/ironic/+/910251	16:56
* TheJulia wonders if I just found the root causse		17:12
dtantsur	dking: we have downstream deploy steps if you're curious: https://github.com/openshift/ironic-agent-image/blob/main/hardware_manager/ironic_coreos_install.py	17:26
JayF	I was reviewing a change for cid and got a little confused: https://review.opendev.org/c/openstack/ironic/+/910973	17:26
JayF	re: if we can use mysql types in alembic migrations	17:26
dtantsur	dking: but if you have issues with hints, won't it make more sense to override get_os_install_device in a downstream hardware manager?	17:27
TheJulia	JayF: mysql has a nuance I'm trying to remember	17:27
TheJulia	I think postgres doesn't actually enforce stringfield lengths	17:28
TheJulia	it does it by types if memory serves	17:28
TheJulia	easy enough to test in migration test code	17:28
dking	dtantsur: Perhaps. I hadn't looked at that, but at the moment, I'm just trying to mimic what BMO is doing so that it will work the same way once we update our version.	17:31
dtantsur	dking: I haven't put a ton of thoughts into this suggestion either, but it does sound like you need to override get_os_install_device	17:32
TheJulia	gah, I have dnsmasq in a cpu consuming loop	17:32
dtantsur	nom-nom	17:32
TheJulia	88.6% om nom!	17:32
dtantsur	\o/	17:32
TheJulia	... how in the world did I do that	17:32
clarkb	if you give a dhcp server a cookie...	17:34
opendevreview	Verification of a change to openstack/ironic master failed: Split conductor-specific RPCService https://review.opendev.org/c/openstack/ironic/+/910251	17:35
clarkb	I'm curious, have you checked if neutron runs into similar problems? And if not maybe you can isolate what triggers these things (you probably have it jsut seems odd that its a consistent problem for ironic but apparently not for neutron)	17:35
TheJulia	I remember I could see evidence in some of the non-ironic jobs that dnsmasq was getting restarted	17:36
TheJulia	but that was a while back	17:36
TheJulia	I haven't looked recently because it doesn't seem anything bare metal specific	17:37
clarkb	ok I did wonder if maybe pxe boot flags could be the problem for example	17:38
clarkb	since neutron in normal VM operation wouldn't be setting those	17:38
TheJulia	it is any option handling it seems	17:38
TheJulia	and all ports get options inherently to do the base matching	17:39
clarkb	ya neutron will also set dns servers and other flags too iirc	17:39
TheJulia	yup	17:39
TheJulia	I'm sort of at the end of my ability to figure out what is going on, I don't really understand the inner workings well enough to do more than get a feeling for it being in option response building most likely	17:59
JayF	TheJulia: can you make sure your research is reflected in that dnsmasq ubuntu bug?	18:10
JayF	TheJulia: and I will see if I can pull a C expert outta the hat	18:10
JayF	looks like you already did that	18:11
JayF	score	18:11
TheJulia	JayF: I did like an hour ago	18:12
TheJulia	yeah	18:13
TheJulia	waiting on hopefully another message from petr, but even he is unsure of a next step in debugging	18:13
jrosser	would you have a link to the bug out of interest?	18:14
JayF	https://bugs.launchpad.net/dnsmasq/+bug/2026757	18:14
JayF	I put out a bat-signal in GR-OSS downstream slack, this is basically how I started the eventlet stuff and got Itamar's help there. No promises it'll yield anything, but I am trying :D	18:15
jrosser	a trivial reproducer might be valuable, then the barrier to debugging is lowered	18:23
TheJulia	I've not been able to figure one out really	18:24
TheJulia	it is not just about HUP operations, it is about that and dhcp options processing being triggered which appears to work just as expected, and upon the next hup it crashes	18:25
*** awb_ is now known as awb		19:05
opendevreview	Merged openstack/ironic master: Split conductor-specific RPCService https://review.opendev.org/c/openstack/ironic/+/910251	20:53
adam__metal3	Hello Ironic, just wondering if you have experienced kernel issues on Dell servers with BOSS-N1 raid controllers when running centos IPA ? I am having "fun" (not really) with new hw combinations and I am wondering whether that device is known to cause issues for IPA? Not sure it causes the problem for me ....	22:02
JayF	What do you mean by kernel issues, and what centos version?	22:03
adam__metal3	latest upstream centos-9-stream, kernel panic	22:03
JayF	Most of these are just ... $distro issues, but some stuff around how we build the ramdisk can cause headaches	22:03
JayF	what is the panic?	22:03
JayF	you have a screenshot/log?	22:03
adam__metal3	yeah I can make screenshot where should I share it? it most likely not the boss one, I have built a new networkd driver for some intel E810-XXV cards and then the issue has change to just getting stuck at the pinguins on the boot screen but I was wondering if it might be BOSS that is doing some extra wierd because the same cards on a different machine work	22:06
JayF	BOSS?	22:07
JayF	ah, BOSS-N1 raid	22:07
JayF	I see	22:07
JayF	I don't have a good spot, any random image sharing site	22:08
TheJulia	so BOSS cards have always been a little weird, but generally they've just worked. The only sort of known issue is they can very much cause weirdness with the device ordering in the OS making /dev/sda /dev/sdb, etc being unreliable across reboots/kernel upgrades, but that is not a unique problem only to those devices and why hinting is preferred	22:12
adam__metal3	https://imgur.com/a/6E4ljs1	22:12
adam__metal3	TheJulia, thanks good to know!	22:12
TheJulia	you could try changing the iommu options on the command line	22:12
TheJulia	to debug that sort of issue, your going to need to attach to a serial port and debug out to that to get as much of the kernel boot/initialziation as possible	22:13
TheJulia	but yeah, your deeeeeeeep inside of the kernel	22:13
adam__metal3	yup and ofc in a global organization there is no living soul who can go there with a serial to usb adapter :D I think this specifc hw is on an other continent	22:14
JayF	Like, have you reproduced this on another raid card/system?	22:14
TheJulia	heh	22:14
TheJulia	adam__metal3: ipmitool sol?!	22:14
JayF	I mainly am asking just because this sorta issue is triggering my "are you SURE all ram is good?" spidey sense	22:15
TheJulia	yeah, in device attach for iommu is... interesting	22:15
opendevreview	Steve Baker proposed openstack/sushy-tools master: Add virtual-media-boot to openstack driver https://review.opendev.org/c/openstack/sushy-tools/+/906768	22:15
adam__metal3	so here comes the even more weird part ther are 3 identical blade servers dell R660s the issue happens with all of them but if you run inspection for a night 2 out of the 3 eventually reboots enough times that it gets inspected and one of them never :D also as I mentioned the same network cards work in a different dell server jsut fine with the same IPA	22:16
TheJulia	https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt is a good starting place for options, fwiw	22:17
adam__metal3	great I will check this kernel doc thanks this iommu is the best tip so far :D	22:17
JayF	adam__metal3: I will say, if I was in your shoes: this is call the vendor territory	22:18
TheJulia	... that is weird	22:18
TheJulia	yeah	22:18
TheJulia	++	22:18
JayF	vendor being RH or hardware-vendor	22:18
JayF	you can launder the failure from centos into redhat if you have a contract	22:18
TheJulia	I wonder if the blades have some sort of weird/random initization thing going on	22:19
adam__metal3	these dell machines are all around weird imo... iDRAc has this weird queue/job system when we change something takes 15 minutes to modify any meaninfull option.... slow as hell I would expect better for servers that cost 12K...	22:20
TheJulia	wow	22:20
TheJulia	so.. a long long time ago I was actually at dell's offices in one of the labs, and I had a couple identical systems, one of which acted sort of like that	22:21
TheJulia	I ended up resetting the idrac and bios firmware back to defaults and it became a bit more consistent	22:22
TheJulia	... I also remember flashing firmware at one point	22:22
TheJulia	It was a blur couple of days	22:22
adam__metal3	we did the same got a bit more stable bot still slow but bit more consistent indeed	22:22
adam__metal3	we also went through a few network card firmwares and as I mentioned I even switched the centos mainline ice drever to the official intel one	22:23
TheJulia	Hmmm	22:23
TheJulia	I've never been a fan of dell's blades, tbh	22:23
TheJulia	Has the overall chassis manager firmware been updated?	22:24
adam__metal3	that I don't know because we actually not touching that we have a <redacted> between agents like Ironic and the BMCs :D soo I am not even sure our platform folks have access to the chassis I will ask tomorrow	22:25
adam__metal3	but in any case I can ask around this is also a good tip	22:26
TheJulia	I did have a some dell blades when I worked for a place in Atlanta, which had a mismatch and exhibited some really weird behavior until they were all sorted out on aligning versions	22:26
TheJulia	that was a very long time ago, but still sort of similar	22:27
adam__metal3	JayF,TheJulia, I will go to sleep now thanks for the tips and discussion I will bring these point up to downstream	22:29
TheJulia	goodnight!	22:30
JayF	o/	22:30
TheJulia	https://bugs.launchpad.net/ironic/+bug/2056248 is a good one :)	22:32
TheJulia	I'll note https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/909939 could use a review or two. It fixes one of the pains with running our tempest suite against an ironic deployment	22:33
JayF	TheJulia: that bug sounds like someone running a rabbitmq that is underscaled and/or has a failover timeout set high enough to screw up their enviornment	22:34
JayF	(I don't know the actual rabbitmq word for it, but I've seen clusters that took longer to failover in failure cases than the default conductor timeouts)	22:35
TheJulia	rabbit is not the actual arbitor	22:38
TheJulia	mysql is	22:38
TheJulia	so rabbit is entirely independent in that case	22:38
JayF	I was about to ask you "was that the case back in $release_before_you_started_stacking"	22:39
JayF	lol	22:39
JayF	Heads up: I'm going on some PTO the first week of April, going to completely disconnect that week.	22:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!