Thursday, 2023-06-08

rpittaugood morning ironic! o/08:19
rpittauJayF, clarkb, TheJulia, about the timeout/failures in unit tests, I'm more and more oriented towards some DB shenanigans as supposed by clarkb 10:17
rpittauif you look at my patch (besides the weird failure) the migration tests have very high runtimes, which are absolutely unjustified10:17
rpittauI've observed the migration tests runtimes going from 12-13 secs to 42 secs completely randomly and I suspect at some point the DB hangs and the timeout happens10:17
rpittaubtw I think we should merge https://review.opendev.org/c/openstack/ironic/+/885372?usp=dashboard anyway :)10:24
rpittauJayF: if you want to test with an older version of tox I think you can use the requires option in tox.ini in the tox section10:28
opendevreviewMerged openstack/python-ironicclient master: Add support for node inventory  https://review.opendev.org/c/openstack/python-ironicclient/+/88383311:35
JayFrpittau: tox was completely a guess that mostly has shaken out as a wrong guess14:02
rpittaugood news everyone! I was able to reproduce the timeout locally14:03
rpittau"good" is relative in this case14:04
JayFit's worthy of celebration for sure14:04
JayFdid you do anything magical to repro it?14:05
rpittaunothing special, I tried the kiss approach and it worked14:06
rpittaucreated a VM and reproduced the workflow14:06
rpittausecond tentative and tox just stop at some point14:06
rpittauno error message, nothing14:06
rpittaujust defunct14:06
rpittauI don't really see anything wrong so far in the VM, it's not stuck. only the terminal with tox is not movinganymore14:07
rpittaulots of zombie processes14:09
rpittauwas there a stestr upgrade recently?14:09
JayFno14:13
JayFI'd suggest patching oslo.db in that venv as clarkb suggested yesterday14:13
JayFsince we know that could cause things to stop14:13
JayFno new releases or updates to requirements that maps to the timeline of unit tests beginning to fail14:14
JayFtox release was on the same day but I think we've mostly ruled that out14:14
rpittauok, let me do another test first14:15
rpittauI can reproduce it pretty much every time now14:23
opendevreviewOpenStack Release Bot proposed openstack/ironic bugfix/22.0: Update .gitreview for bugfix/22.0  https://review.opendev.org/c/openstack/ironic/+/88558814:25
clarkbcool having a reproduction case is very helpful15:24
clarkbrpittau: what I would do is grab the list of tests fed to the stestr works. Then see what tests ran and which ones are up next15:24
clarkbrpittau: the tests that are up next would be highly suspicuous since in theory they are/were the ones running (depending on the state things end up in)15:25
rpittauJayF, clarkb, I patched oslo.db, ran the tests twice and it didn't timeout, but the database tests took over a minute to complete15:25
rpittaueach of them15:25
clarkbya I mean they can be slow that doesn't necessarily mean they are the problem causing the entire job to timeout (since you have a lot of headroom on the job timeout)15:26
clarkbrpittau: what patch did you apply?15:26
clarkbextra logging?15:26
rpittauconsidering the number of db tests we have I think the jobs go in timeout because effectively they take too long, so they pass the 40 minutes timeout value15:27
clarkbbeing able to reproduce it is the main thing. Hopefull you're able to track it down pretty quickly now15:27
clarkbrpittau: I don't think that is the case looking at the job logs15:27
clarkbrpittau: if you look at the job logs you see that the last tes tcase completes like half an hour before the job is timed out15:27
clarkbsomething in the test suite is getting stuck and not progressing15:27
clarkbit may still be that it is the db tests getting stuck but I don't think it is that they take too long generally. More like a deadlock etc15:28
rpittaummm I'm seeing tox stuck again, I'll leave it be for a while15:29
clarkbrpittau: I would definitely try to determine which specific test cases are running when it gets stuck15:29
clarkband then you can sort out why they are stuck etc15:30
rpittauyeah, I'm trying to understand that15:31
clarkbrpittau: if you look at ps -elf | grep stestr or similar there should be stestr processes with --load-list arguments. I think each of those load-list arguments is a file containing a set of tests in the order to run them15:35
rpittauyep, looking at stestr options to help with that15:36
clarkbrpittau: if you then look n .stestr/tmp* for the current run you should see the log showing what tests have been run for each worker and you can cross check against the load list files to see what is currently running15:36
JayFI setup a local mysql, trying to get in on the reproduction game15:36
clarkbIt might also be possible to issue a sigterm/^C to cause the test run to raise an exception whereever it is stuck as well and then you'd see in the traceback where it was at15:36
JayFrpittau: are you doing anything other than setting up mysql? also using postgresql?15:39
JayFrpittau: what python version15:39
rpittauJayF: just installing bindep deps, then running tools/test-setup.sh, using py 3.1015:40
JayFack; so you have a running postgres and mysql server? 15:40
JayFI will setup postgres too15:40
ydehi, i'm trying to make ironic+neutron work with a networking generic switch driver to put servers into neutron managed networks (vlans) but there's somehting i really dont get in the workflow15:40
rpittauJayF: both yeah15:40
ydewhich "dhcp+tftp" service is supposed to answer to PXE requests ?15:41
ydeduring enrollment/inspection, the ironic dnsmasq + tftp servers does the job, boots the IPA and the inspector reports what it found15:42
ydebut then it auto adds all the MACs to a deny list15:42
ydeand when i boot a instance, neutron+ironic create a port into the neutron managed "provisionning" network15:42
ydewhich i also created with a dhcp agent15:43
ydebut this dnsmasq does'nt have all necessary stuff to answer to pxe requests right ?15:43
rpittauclarkb: I don't see any --load-list in the stestr process15:45
rpittauit's using the tests dir directly15:46
TheJuliayde: you need to configure/establish a specific/dedicated endpoint for tftp15:47
ydesomehow, the enrolled+available baremetal host is properly discovered and inspected. then when i boot an instance through nova, neutron binds it a port in the admin network (whicch has a dhcp agent running in its ip netns), the ports on the switch are properly setup to the correct vlan id15:47
ydeand the server reboots, sends its its DHCP requests, and nobody replies15:48
JayFrpittau: you using mariadb, not mysql, yeah?15:48
* JayF trying to get this to reproduce and looking for differences15:48
rpittauI'm using mysql15:48
JayFdang, so am I 15:48
rpittaujust using what's in CI15:48
JayFis it happening every time for you?15:48
rpittauin bindep15:48
rpittauyes15:48
rpittauso now it failed and I thinkg we're onto something15:49
opendevreviewMerged openstack/ironic bugfix/22.0: Update .gitreview for bugfix/22.0  https://review.opendev.org/c/openstack/ironic/+/88558815:49
ydeTheJulia: thanks. i have one, the one setup by kolla-ansible. but where do i have to setup the ipxe script ?15:49
JayFif there's a way I can help other than by working independently to try and get a thing to fail, let me know15:49
rpittauJayF: posting the error I found15:50
ydei mean, if it's not the ironic-dnsmasq agent (which has an inspector.ixpe script setup), where is the ipxe script for deployment ? and who shall serve it /15:50
rpittauJayF, clarkb : https://paste.openstack.org/show/bn06iH0TXNfw4bAB78ns/15:50
yde?15:50
JayFthose are served from the conductor server, generally15:50
ydewhich dnsmasq ? the one in the qdhcp namespace of the provisioning network ? or the one ironic-dnsmasq agnet ?15:51
JayFrpittau: ...I feel like that error is disappointingly less useful than I hoped15:55
JayFrpittau: did you ID anything interesting in there I might have misseD?15:56
rpittaummm no, but I have the processes with the lists now15:57
clarkbrpittau: JayF it is in eventlet16:03
TheJuliayde: the dnsmasq launched in the neutron namespace can’t take tftp. The config for ironic-conductor should include a tftp_address which it puts in the dhcp response payload16:03
clarkbwaiting to read bytes. I think if there is a lock on the db side the test side might be waiting for the lock to release16:04
clarkbrpittau: maybe next is to check the db state when it is stuck and see if there are locks preventing work from happeningin hte db16:04
rpittauclarkb: I can see it's very slow at the moment16:04
JayFwhat OS do we use in the gate?16:04
JayFjammy, yeah16:05
rpittauubuntu focal and jammy16:05
JayFthe thing I keep rebounding back to16:07
JayFis why now16:08
JayFwhy did it break 6/516:08
JayFhttps://github.com/eventlet/eventlet/issues/79816:09
JayFI think that's not impacting us here, because the switches complete successfully in your traceback16:09
rpittauso it was stuck for a while then it started again briefly and it looks like running a migration16:16
rpittauI think it's stuck on ironic.tests.unit.db.sqlalchemy.test_migrations.TestWalkVersions.test_walk_versions_all_false16:16
rpittauno nvm that's completed16:16
rpittauthat's the last migration test completed16:17
rpittauI'll leave it running for the rest of the evening16:19
rpittauneed to split now16:19
rpittaugood night! o/16:19
JayFo/ I'm going to keep trying to get a local reproducer :(16:20
rpittauI think I found the culprit {1} ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_upgrade_twice [1792.906012s] ... ok16:25
rpittauthis takes 30 minutes to complete16:25
rpittauin the end it completes, but of course it goes in timeout in cI16:26
* rpittau runs away screaming in SQL16:26
ydeTheJulia: thanks a lot. i already have a tfpserver up an running and setup in the [pxe] section of the conductor16:27
clarkbin https://zuul.opendev.org/t/openstack/build/00d4e534d78448a99b7a284baf6a869a/log/job-output.txt#57860 it only took 16s16:27
clarkbrpittau: ^ so maybe it is only slow sometimes? But that would certainly do it16:27
ydeTheJulia: what i dont get is how the tftp info is pushed to the server as it doesnt seem to be setup in the dnsmasq conf of the qdhcp neutron network16:29
ydeTheJulia: whereas it is, in the ironic-dnsmasq service16:29
TheJuliayde: when a port attachment occurs in the lifecycle, the neutron port gets updated with the parameters16:30
TheJuliaSo dnsmasq gets told by neutron16:30
TheJuliaAnd it is dhcp record-lease specific as a result16:30
ydeTheJulia: ok thanks16:31
ydeill check that16:31
JayFrpittau: clarkb: I'm going to propose a change to temporarily skip that test, and look and see if we can make it behave better.16:46
JayFthat test looks bad at the face of it16:53
JayFI'm going to see if it's a simple fix16:53
JayFhttps://bugs.launchpad.net/ironic/+bug/2023316 to track this, since I'm going to propose disabling a test17:00
opendevreviewJay Faulkner proposed openstack/ironic master: DNM: Testing to see if disabling this test lets things pass  https://review.opendev.org/c/openstack/ironic/+/88560017:02
JayFpushed it up with integration tests disabled because I'm still sus, if it passes I'll make it mergable17:02
JayFit's passed17:17
opendevreviewJay Faulkner proposed openstack/ironic master: Disabling test_upgrade_twice temporarily for CI fix  https://review.opendev.org/c/openstack/ironic/+/88560017:19
JayFTheJulia: rpittau: others: ^ workaround for unit test failures17:19
JayFafter my afternoon meetings (and lunch) I'll start digging to see if there is a real fix, but it only causes us harm to have CI broken17:19
opendevreviewJay Faulkner proposed openstack/ironic master: Disabling test_upgrade_twice temporarily for CI fix  https://review.opendev.org/c/openstack/ironic/+/88560017:21
JayFhttps://review.opendev.org/c/openstack/ironic/+/885600 is V+1 and seemed to run a lot quicker, too17:57
JayFrpittau: saw your comment. That the tests ran so fast is heartening, I think21:09
JayFHeads up: when looking at the unit test failures, and digging thru some issues, I saw something that's likely to break oslo.service's SSL support in eventlet for py 3.1221:35
JayFso we'll have to be on the lookout for that when 3.12 comes down the pipe21:35
TheJuliaFun :(21:36
JayFif we're lucky, it'll just be differential eventlet versions depending on py version21:37
JayFI also think we only use that in IPA, not Ironic21:37
opendevreviewJay Faulkner proposed openstack/ironic master: DNM: Ensure upgrade twice test uses two engines  https://review.opendev.org/c/openstack/ironic/+/88566121:45
TheJuliayde: is dnsmasq logging anything?21:50
TheJuliayde: nevermind, i opened an old irccloud window which I guess had out of date logs21:50
opendevreviewJay Faulkner proposed openstack/ironic master: Ensure upgrade twice test uses two engines  https://review.opendev.org/c/openstack/ironic/+/88566121:59
JayFthe most wishy washy unsure commit message I've ever applied21:59
JayFbut it passed unit tests the first time21:59
clarkboh interesting fwiw that wouldn't surprise me if reusing things keeps some stale state around that can create problems22:08
JayFThe thing I can't quite figure out is what is being reused, exactly.22:13
JayFThe engine passed in that context manager is an engine-getter22:14
JayFso I wonder if this fixing things would still indicate we have a place where a connection/session is being left dangling22:14
JayFbecause I don't think they should step on each other22:14
JayFbut it was an obvious thing to try given the symptoms22:14
JayFthat thing has already passed22:15
JayFI now have a voice in the back of my head suggesting that it disappeared as quickly as it appeared, and I should re-validate the original bug...22:15
JayFbut I guess rpittau had it repro'd locally, he can test if 885661 fixes it for him, too22:16
JayFrpittau: pointing my brain towards something else; so trying to document status for you in the morning: https://review.opendev.org/c/openstack/ironic/+/885661 may be a real fix. If it doesn't survive additional testing (or your local testing) we should land https://review.opendev.org/c/openstack/ironic/+/885600 to unblock the gate22:22
TheJuliahttps://review.opendev.org/c/openstack/ironic/+/885661/ fixing it makes me wonder about our last migration 22:51
opendevreviewJulia Kreger proposed openstack/ironic master: WIP: service steps  https://review.opendev.org/c/openstack/ironic/+/88478322:52
TheJuliaeh, looks fine22:58
TheJuliastill funky22:58
TheJuliaJayF: so.... w/r/t hold steps, I suspect they could be combined into one handler23:03
TheJuliaJust with two distinctly different approaches23:04
TheJuliaone being a "translation lookup, or 'reserved name' set of logic23:04
TheJulia"23:04
TheJulialikely reserved names then the translation lookup.23:04
* TheJulia steps away23:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!