Wednesday, 2024-03-06

opendevreviewMerged openstack/ironic-tempest-plugin master: Test multiple boot interfaces as part of one CI job  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/90217103:27
rpittaugood morning ironic! o/07:55
masgharGood morning!09:14
opendevreviewMerged openstack/ironic stable/wallaby: stable only/ci: pin CI to dnsmasq 2.85/pin proliantutils  https://review.opendev.org/c/openstack/ironic/+/91116009:25
iurygregorygood morning Ironic11:07
Sandzwerg[m]moin ironic12:10
Sandzwerg[m]Has anyone ever heard of dell nodes getting frequently powered off shortly after they got powered on. It does not happen all the time and we're unsure if ironic/metalĀ³ or something else is involved12:15
dtantsurSandzwerg[m]: what do you use to power them on? Ironic tends to force whatever it thinks is the right state.12:20
Sandzwerg[m]metalĀ³ ironic12:26
Sandzwerg[m]and yeah it should force them to come on, but something forces them off when ironic forces them to PXE/on just before. 12:29
Sandzwerg[m]Feels a bit like ironic is confused but not entirely sure. Haven't found which user triggers the shutdown in the remoteboard log12:30
dtantsurIf Ironic does it, it has to long something about it13:31
drannouHello13:38
drannouIf you have a moment  to review https://review.opendev.org/c/openstack/ironic-python-agent/+/90276913:39
dkingDoes anybody know of the top of their head which part of the cleaning process, perhaps in hardware.py, LVM VG groups are cleared?13:42
dtantsurdking: erase_devices/erase_device_metadata clean steps. For the latter, it propagates all the way into https://opendev.org/openstack/ironic-lib/src/branch/master/ironic_lib/disk_utils.py#L51513:43
dkingdtantsur: Thank you very much!13:45
*** tosky_ is now known as tosky13:57
TheJuliaSandzwerg[m]: I'd check the console to see if reconfiguration jobs are running, that will cause a reboot and a dell node to appear on, go off, and then come back on.14:54
TheJuliaJayF: https://review.opendev.org/c/openstack/ironic/+/911158 won't pass until tempest tests are fixed14:57
JayFI'm assuming since this pointed at me that must be a failure caused by the sharding test or something?14:58
JayFI can look at it, I'm finally called up today14:58
JayF**caught up14:58
TheJuliayes :)14:58
TheJuliaThat would be awesome if you could, I'm burried under piles of things right now and I need to work on a side deck14:59
opendevreviewJulia Kreger proposed openstack/ironic-tempest-plugin master: Invoke tests with fake interfaces  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/90993914:59
TheJuliaummm hmmm15:00
*** dansmith_ is now known as dansmith15:02
opendevreviewJulia Kreger proposed openstack/ironic-tempest-plugin master: Invoke tests with fake interfaces  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/90993915:08
dtantsurTheJulia, hjensas, hi folks, would be great if you could put https://review.opendev.org/c/openstack/ironic/+/907991 (and thus https://review.opendev.org/c/openstack/ironic/+/910251) in your review queue.15:36
dtantsurI'm actually the last person who needs that working :)15:36
JayFTheJulia: I'm confused, I thought https://opendev.org/openstack/ironic-tempest-plugin/src/branch/master/ironic_tempest_plugin/tests/api/admin/test_shards.py#L24 was all that was needed to keep those tests from running15:38
JayFhttps://opendev.org/openstack/ironic-tempest-plugin/src/branch/master/ironic_tempest_plugin/tests/api/base.py#L7715:39
JayFI'm assuming those branches are not configured properly with their max microversion?15:40
TheJuliathere are no branches for tempest15:41
TheJuliaI honestly don't know why they are trying to run15:41
TheJuliait is clear it knows the diffrence from the errors I linked out15:41
JayF3142:        iniset $TEMPEST_CONFIG baremetal max_microversion $TEMPEST_BAREMETAL_MAX_MICROVERSION15:41
JayFwe have to set that15:42
JayFon older branches15:42
TheJuliaahh!15:42
JayFI'll add it to the patch in question15:42
TheJuliaso config on stable/zed then15:42
JayFaye15:42
TheJuliaI guess that code path doesn't have real detetion based skipping unlike scenario jobs15:42
JayFdoes that imply I only need to configure it on functional test jobs?15:43
JayFI suspect it's also possible for scenario jobs ... we just haven't added any that use new APIs15:43
JayFso it's a moot point and not breaky15:43
TheJulialikely just the default in devstack/lib/ironic15:43
opendevreviewJay Faulkner proposed openstack/ironic stable/zed: stable only/ci: pin CI to dnsmasq 2.85/pin proliantutils/scciclient  https://review.opendev.org/c/openstack/ironic/+/91115815:48
*** dking is now known as Guest203115:55
*** Guest2031 is now known as dking16:07
dkingIn the Ironic-Python-Agent, was ProtectedDeviceFound replaced by errors.ProtectedDeviceError? I see that ProtectedDeviceFound is only referenced in some docstrings, and I cannot find it implemented, but errors.ProtectedDeviceError exists and seems to be used similiarly.16:20
JayFyou wanna link to those invalid docstrings, I'll clean em up16:28
JayFeh, ripgrep can do that part I guess16:28
JayFdking: yep, you are right16:32
opendevreviewJay Faulkner proposed openstack/ironic-python-agent master: Correct invalid docstrings; s/Found/Error/  https://review.opendev.org/c/openstack/ironic-python-agent/+/91159816:33
JayFdking: ^ fyi it'll be fixed when that lands16:33
dkingJayF: Thanks! So, that can help clean up the code. So, I'm just wondering about it a little. The new error is thrown when looking at devices, but in those laces where it's mentioned in the docstrings is in methods handling the node itself. The error requires a device. Is that still the intention?16:36
JayFLets back up a step16:36
JayFwhat's your overall thing you're trying to accomplish?16:37
dkingFor instance, if I wanted to override erase_devices() for an entire node for some reason, I would make my own erase_devices() method in my hardware manager and send an error up saying that this node is protected. Would ProtectedDeviceFound still be the best way to do that?16:37
dkingIn my specific use case, I can adjust the code to grab the particular device, so I'll probably do that anyway, but it just made me wonder if there should be something more generic.16:39
dking(In my use case, I'm looking for some specific volume group names which happen to belong to a ceph cluster and I'm planning to make those be removed only upon a manual clean step. So, I'm also happy to try something else if there's a better way.)16:40
JayFProtectedDeviceError would be the way do that /on master branch/ which is all I've looked at16:42
dkingJayF: Okay, that's good, then. So, if somebody for some reason did decided that they wanted their clean step to fail after checking the whole node, not a check for a specific device, they would throw that error and perhaps fill in device with just any text?16:45
JayFDo you *care* that it's a ProtectedDeviceError16:45
JayFor do you just want to stop cleaning?16:46
dkingMore the later. So, perhaps just a generic CleaningError?16:46
JayFhttps://opendev.org/openstack/ironic-python-agent/src/branch/master/examples/business-logic/example_business_logic.py#L9416:47
JayFyep16:47
JayFyou're basically implementing the business logic pattern16:47
JayFhttps://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/errors.py#L362 ProtectedDeviceError *is* a CleaningError, which a preset exception message :)16:48
JayF**with a preset exception message16:48
JayFSo either one of those works, just a question of if you want the premade message or you wanna write one yourself16:49
dkingGreat. Then, that answers my question. I was wondering for a moment if the docstrings should be updated instead to the more generic CleaningError, but as those particular methods do specifically only raise ProtectedDeviceError, that seems appropriate.16:49
dkingYeah, I see that. I might still do use ProtectedDeviceError in my case also.16:50
JayFI suspect that error name was changed and just got missed being updated in the docstring16:50
rpittaugood night! o/16:50
JayFo/16:50
dkingBTW, I see that after about 4 years, you just updated the hardware manager examples to fix some spelling issues. I'm sure there's tons of cobwebs growing on them.16:51
JayFThe interface itself hasn't changed in .... ever?16:51
dkingrpittau: Good night16:51
JayFWe did one api change on hardware managers, back in like, 2014 or 201516:51
JayFand it's been a stable API since16:51
dkingIt probably might be helpful to add in a deploy step? I don't know if people are doing them much, though.16:52
JayFI'd +1 such a change, but it's not something that's urgently on my list 16:52
dkingI'm adding in my first locally because we're not ready to update BMO yet and so we're stuck with the old version that doesn't have an option for a "by_path" root device hint.16:52
TheJuliadnsmasq gives me a migraine16:53
JayFMore breakage? Or you trying to fix the issue in C?16:53
TheJuliaoh, I know of a way, it just masks the actual problem16:54
*** clarkb1 is now known as clarkb16:55
opendevreviewVerification of a change to openstack/ironic master failed: Split conductor-specific RPCService  https://review.opendev.org/c/openstack/ironic/+/91025116:56
* TheJulia wonders if I just found the root causse17:12
dtantsurdking: we have downstream deploy steps if you're curious: https://github.com/openshift/ironic-agent-image/blob/main/hardware_manager/ironic_coreos_install.py17:26
JayFI was reviewing a change for cid and got a little confused: https://review.opendev.org/c/openstack/ironic/+/91097317:26
JayFre: if we can use mysql types in alembic migrations17:26
dtantsurdking: but if you have issues with hints, won't it make more sense to override get_os_install_device in a downstream hardware manager?17:27
TheJuliaJayF: mysql has a nuance I'm trying to remember17:27
TheJuliaI think postgres doesn't actually enforce stringfield lengths17:28
TheJuliait does it by types if memory serves17:28
TheJuliaeasy enough to test in migration test code17:28
dkingdtantsur: Perhaps. I hadn't looked at that, but at the moment, I'm just trying to mimic what BMO is doing so that it will work the same way once we update our version.17:31
dtantsurdking: I haven't put a ton of thoughts into this suggestion either, but it does sound like you need to override get_os_install_device17:32
TheJuliagah, I have dnsmasq in a cpu consuming loop17:32
dtantsurnom-nom17:32
TheJulia88.6% om nom!17:32
dtantsur\o/17:32
TheJulia... how in the world did I do that17:32
clarkbif you give a dhcp server a cookie...17:34
opendevreviewVerification of a change to openstack/ironic master failed: Split conductor-specific RPCService  https://review.opendev.org/c/openstack/ironic/+/91025117:35
clarkbI'm curious, have you checked if neutron runs into similar problems? And if not maybe you can isolate what triggers these things (you probably have it jsut seems odd that its a consistent problem for ironic but apparently not for neutron)17:35
TheJuliaI remember I could see evidence in some of the non-ironic jobs that dnsmasq was getting restarted17:36
TheJuliabut that was a while back17:36
TheJuliaI haven't looked recently because it doesn't seem anything bare metal specific17:37
clarkbok I did wonder if maybe pxe boot flags could be the problem for example17:38
clarkbsince neutron in normal VM operation wouldn't be setting those17:38
TheJuliait is any option handling it seems17:38
TheJuliaand all ports get options inherently to do the base matching17:39
clarkbya neutron will also set dns servers and other flags too iirc17:39
TheJuliayup17:39
TheJuliaI'm sort of at the end of my ability to figure out what is going on, I don't really understand the inner workings well enough to do more than get a feeling for it being in option response building most likely17:59
JayFTheJulia: can you make sure your research is reflected in that dnsmasq ubuntu bug?18:10
JayFTheJulia: and I will see if I can pull a C expert outta the hat18:10
JayFlooks like you already did that18:11
JayFscore18:11
TheJuliaJayF: I did like an hour ago18:12
TheJuliayeah18:13
TheJuliawaiting on hopefully another message from petr, but even he is unsure of a next step in debugging18:13
jrosserwould you have a link to the bug out of interest?18:14
JayFhttps://bugs.launchpad.net/dnsmasq/+bug/202675718:14
JayFI put out a bat-signal in GR-OSS downstream slack, this is basically how I started the eventlet stuff and got Itamar's help there. No promises it'll yield anything, but I am trying :D18:15
jrossera trivial reproducer might be valuable, then the barrier to debugging is lowered18:23
TheJuliaI've not been able to figure one out really18:24
TheJuliait is not just about HUP operations, it is about that *and* dhcp options processing being triggered which appears to work just as expected, and upon the next hup it crashes18:25
*** awb_ is now known as awb19:05
opendevreviewMerged openstack/ironic master: Split conductor-specific RPCService  https://review.opendev.org/c/openstack/ironic/+/91025120:53
adam__metal3Hello Ironic, just wondering if you have experienced kernel issues on Dell servers with BOSS-N1 raid controllers when running centos IPA ? I am having "fun" (not really) with new hw combinations and I am wondering whether that device is known to cause issues for IPA? Not sure it causes the problem for me ....22:02
JayFWhat do you mean by kernel issues, and what centos version?22:03
adam__metal3latest upstream centos-9-stream, kernel panic22:03
JayFMost of these are just ... $distro issues, but some stuff around how we build the ramdisk can cause headaches22:03
JayFwhat is the panic?22:03
JayFyou have a screenshot/log?22:03
adam__metal3yeah I can make screenshot where should I share it? it most likely not the boss one, I have built a new networkd driver for some intel E810-XXV cards and then the issue has change to just getting stuck at the pinguins on the boot screen but I was wondering if it might be BOSS that is doing some extra wierd because the same cards on a different machine work 22:06
JayFBOSS?22:07
JayFah, BOSS-N1 raid22:07
JayFI see22:07
JayFI don't have a good spot, any random image sharing site22:08
TheJuliaso BOSS cards have always been a little weird, but generally they've just worked. The only sort of known issue is they can very much cause weirdness with the device ordering in the OS making /dev/sda /dev/sdb, etc being unreliable across reboots/kernel upgrades, but that is not a unique problem only to those devices and why hinting is preferred22:12
adam__metal3https://imgur.com/a/6E4ljs122:12
adam__metal3TheJulia, thanks good to know!22:12
TheJuliayou could try changing the iommu options on the command line22:12
TheJuliato debug that sort of issue, your going to need to attach to a serial port and debug out to that to get as much of the kernel boot/initialziation as possible22:13
TheJuliabut yeah, your deeeeeeeep inside of the kernel22:13
adam__metal3yup and ofc in a global organization there is no living soul who can go there with a serial to usb adapter :D I think this specifc hw is on an other  continent22:14
JayFLike, have you reproduced this on another raid card/system?22:14
TheJuliaheh22:14
TheJuliaadam__metal3: ipmitool sol?!22:14
JayFI mainly am asking just because this sorta issue is triggering my "are you SURE all ram is good?" spidey sense22:15
TheJuliayeah, in device attach for iommu is... interesting22:15
opendevreviewSteve Baker proposed openstack/sushy-tools master: Add virtual-media-boot to openstack driver  https://review.opendev.org/c/openstack/sushy-tools/+/90676822:15
adam__metal3so here comes the even more weird part ther are 3 identical blade servers dell R660s the issue happens with all of them but if you run inspection for a night 2 out of the 3 eventually reboots enough times that it gets inspected and one of them never :D also as I mentioned the same network cards work in a different dell server jsut fine with the same IPA 22:16
TheJuliahttps://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt is a good starting place for options, fwiw22:17
adam__metal3great I will check this kernel doc thanks this iommu is the best tip so far :D22:17
JayFadam__metal3: I will say, if I was in your shoes: this is call the vendor territory22:18
TheJulia... that is *weird*22:18
TheJuliayeah22:18
TheJulia++22:18
JayFvendor being RH or hardware-vendor22:18
JayFyou can launder the failure from centos into redhat if you have a contract22:18
TheJuliaI wonder if the blades have some sort of weird/random initization thing going on22:19
adam__metal3these dell machines are all around weird imo... iDRAc has this weird queue/job system when we change something takes 15 minutes to modify any meaninfull option.... slow as hell I would expect better for servers that cost 12K...22:20
TheJuliawow22:20
TheJuliaso.. a long long time ago I was actually at dell's offices in one of the labs, and I had a couple identical systems, one of which acted sort of like that22:21
TheJuliaI ended up resetting the idrac and bios firmware back to defaults and it became a bit more consistent22:22
TheJulia... I also remember flashing firmware at one point22:22
TheJuliaIt was a blur couple of days22:22
adam__metal3we did the same got a bit more stable bot still slow but bit more consistent indeed22:22
adam__metal3we also went through a few network card firmwares and as I mentioned I even switched the centos mainline ice drever to the official intel one22:23
TheJuliaHmmm22:23
TheJuliaI've never been a fan of dell's blades, tbh22:23
TheJuliaHas the overall chassis manager firmware been updated?22:24
adam__metal3that I don't know because we actually not touching that we have a <redacted> between agents like Ironic and the BMCs :D soo I am not even sure our platform folks have access to the chassis I will ask tomorrow22:25
adam__metal3but in any case I can ask around this is also a good tip 22:26
TheJuliaI did have a some dell blades when I worked for a place in Atlanta, which had a mismatch and exhibited some really weird behavior until they were all sorted out on aligning versions22:26
TheJuliathat was a very long time ago, but still sort of similar22:27
adam__metal3JayF,TheJulia, I will go to sleep now thanks for the tips and discussion I will bring these point up to downstream22:29
TheJuliagoodnight!22:30
JayFo/22:30
TheJuliahttps://bugs.launchpad.net/ironic/+bug/2056248 is a good one :)22:32
TheJuliaI'll note https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/909939 could use a review or two. It fixes one of the pains with running our tempest suite against an ironic deployment22:33
JayFTheJulia: that bug sounds like someone running a rabbitmq that is underscaled and/or has a failover timeout set high enough to screw up their enviornment22:34
JayF(I don't know the actual rabbitmq word for it, but I've seen clusters that took longer to failover in failure cases than the default conductor timeouts)22:35
TheJuliarabbit is not the actual arbitor22:38
TheJuliamysql is22:38
TheJuliaso rabbit is entirely independent in that case22:38
JayFI was about to ask you "was that the case back in $release_before_you_started_stacking" 22:39
JayFlol22:39
JayFHeads up: I'm going on some PTO the first week of April, going to completely disconnect that week.22:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!