Monday, 2023-06-26

rpittaugood morning ironic! o/06:47
rpittauwow, I see we're still struggling with the unit tests timeouts :/07:58
rpittauJayF, TheJulia, do we want to increase the MIGRATIONS_TIMEOUT value for the time being?08:01
rpittaualso, I think we should change the gentle value for the timeout to False for all the tests, not just the migration ones08:07
opendevreviewMerged openstack/ironic-python-agent master: Fix Bandit errors  https://review.opendev.org/c/openstack/ironic-python-agent/+/87991209:25
opendevreviewlikui proposed openstack/python-ironicclient master: These should be indented 4 spaces to match the other things in this block  https://review.opendev.org/c/openstack/python-ironicclient/+/79469109:58
opendevreviewMerged openstack/ironic-python-agent master: Fix nvidia hardware manager url parser to permit https  https://review.opendev.org/c/openstack/ironic-python-agent/+/88141010:11
Nisha_Agarwalscottsol, hi10:26
iurygregorygood morning Ironic11:09
iurygregoryrpittau, re unit tests, I think TheJulia was able to reproduce things locally I seem to remember 4 patches that maybe would be necessary to fix https://review.opendev.org/c/openstack/ironic/+/885797 https://review.opendev.org/c/openstack/ironic/+/886865 https://review.opendev.org/c/openstack/ironic/+/886871 https://review.opendev.org/c/openstack/ironic/+/886881 11:13
TheJuliarpittau: what is the present setting?13:07
iurygregoryTheJulia, MIGRATIONS_TIMEOUT is set to 6013:12
iurygregorygood morning =)13:12
iurygregoryand the set for other tests is OS_TEST_TIMEOUT is 3013:14
TheJulia60 seconds?13:18
TheJuliathe second migration test takes consistently 70 seconds on my desktop13:18
iurygregoryhummm13:18
iurygregoryinteresting13:18
iurygregorymaybe we should change the value13:19
TheJuliayeah13:54
opendevreviewJulia Kreger proposed openstack/ironic master: Fix test_migrations with firmware information.  https://review.opendev.org/c/openstack/ironic/+/88688113:58
JayFwe did cross a threshold then, eh14:06
JayFjust a threshold of doing the wrong thing more than the number of conns we had or somehting like14:06
JayFthat14:06
opendevreviewJulia Kreger proposed openstack/ironic master: CI: Change migrations timeout to be >60 seoncds  https://review.opendev.org/c/openstack/ironic/+/88698514:08
TheJuliayeah, it might explain why test_migrations has always sort of been a little more likely to fail14:26
JayFI'm adding a unit tests section to the meeting agenda14:45
JayFJust in case I missed a patch, you wanna check up? https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting14:51
iurygregoryI think we also need https://review.opendev.org/c/openstack/ironic/+/885797 and https://review.opendev.org/c/openstack/ironic/+/88687114:54
JayFI'll make SQL2.0 tests a separate line item14:54
JayFbut it's a good  callout14:54
iurygregorymakes sense14:54
iurygregory++14:54
JayF#startmeeting ironic15:01
opendevmeetMeeting started Mon Jun 26 15:01:06 2023 UTC and is due to finish in 60 minutes.  The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot.15:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:01
opendevmeetThe meeting name has been set to 'ironic'15:01
JayFGooood morning Ironic'ers! Who all is here this morning?15:01
iurygregoryo/15:01
rpittauo/15:01
matfechnero/15:01
TheJuliao/15:01
JayF#topic Announcements/Reminder15:02
JayFThis is where I'd usually note to review things wiht #ironic-week-prio and to mark things as #ironic-week-prio if you want it reviewed15:02
JayFbut frankly, unit tests are a mess (to be addressed later) and we should not approve anything until they are fixes15:02
JayFplease point review love at the priority order of: unit tests; ironic-week-prio ;) 15:02
JayFThere were no action items out of our 6/19 meeting, so I'm skipping that agenda item.15:03
JayF#topic Reivew Ironic CI status15:03
JayFyou know, I put all the commenst in open discussion, but I think we can talk about this here15:03
JayFUnit tests are broken! At least, they are until more things merge15:03
JayFthere are at least three separate things that we need to get unit tests back in good shape15:04
JayFFirst: Failures when running migrations15:04
JayFthe changes we believe will fix this, and undo any attempts to mitigate the damage since it's been busted are15:04
JayF#link https://review.opendev.org/c/openstack/ironic/+/88698515:04
JayF#link https://review.opendev.org/c/openstack/ironic/+/88688115:04
JayF#link https://review.opendev.org/c/openstack/ironic/+/88619615:04
JayFSecond: Failures in TestNeutronVifPortMixing15:05
JayFthis one, we don't have fixed, but it's not really OK for it to fail intermittantly all the time either15:05
JayFI proposed a change to disable it, and a bug to track re-enabling15:05
JayF#link https://review.opendev.org/c/openstack/ironic/+/88691115:05
JayF#link https://bugs.launchpad.net/ironic/+bug/202499415:05
JayFIf you don't think this is the right approach; please comment in the bug and/or the change, I'm open to other solutions but not really Ok with us just suffering under a test that fails 5-10% of the time for any longer :D 15:06
JayFand finally15:06
JayFAdding tests to ensure SQLA2.0 compat is not broken15:06
JayF#link https://review.opendev.org/c/openstack/ironic/+/885797 to fix tests under sqla 2.015:06
JayF#link https://review.opendev.org/c/openstack/ironic/+/886020 adds a sqla 2.0 job (we need to rebase this once tests are fixed)15:06
JayF#link https://review.opendev.org/c/openstack/ironic/+/886871 Fixes an SAWarning around DB schema -- I'm not sure this is really testing related, but I think it's another change we need to land to make CI fully happy15:07
JayFAny thoughts on this? Did I miss anything? Any discussion we wanna have around this?15:07
TheJuliato at least... remove a warning that will become an error at some point15:07
rpittauJayF: thanks, I believe that's all, and a lot :)15:08
TheJuliaNo issues, I believe the first few are already in route to CI15:08
JayFThis all just makes me wonder if there's something we can look at from a higher level to help in the future15:08
JayFlike writing something in migrations tests to ensure you never pass a SQLA object in the data dict 15:09
JayFI don't really know, but it seems obvious we need more guardrails given the age of some of the miscoded tests fixed in 88688115:09
TheJuliaIt is not the most widely known aspect of sqlalchemy, fwiw15:10
JayFand it's going to crop up more as people do sqla migrations15:10
TheJuliaso definitely not "easy"15:10
JayFPart of me wonders if there's a guardrail to put in for oslo db15:11
JayFto try and "save" the smaller projects from tha pain we just felt over the last two weeks15:11
TheJuliaThat can be a whole debate15:11
JayFwell I might spike on having the discussion, if it gets to code then I can wail against that lol15:12
TheJuliaheh15:12
JayFbut I look at this this way: if it took us two weeks-ish, as a team that has more sqla strength than most15:12
opendevreviewMerged openstack/ironic master: Fix test_migrations with firmware information.  https://review.opendev.org/c/openstack/ironic/+/88688115:12
JayFthen a smaller project with a similar bug has very little chance15:12
JayFbut that's not an ironic problem; but I'll still try and spread the love of our lessons learned15:12
JayFgoing to move on now unless there's further?15:13
TheJuliano objection here15:13
JayF#topic Review ongoing 2023.2 workstreams15:13
JayF#link https://etherpad.opendev.org/p/IronicWorkstreams2023.215:13
JayFI've been behind on reviewing some of this stuff because I've been pointing all my brainpower at tests; going to try and correct that this week.15:14
JayFno comments on workstreams, moving on15:16
JayF#topic Open Discussion15:16
JayFI have a note here on PTL availability15:16
JayF#note JayF will be mostly unavailable upstream the week of 7/10.15:16
TheJuliado you need someone to run the meeting that week?15:17
JayFCan someone volunteer to run the 7/10 and also 7/17 meetings? I should be here week of 7/17 just being careful15:17
JayFThat week I'll be visiting my coworkers in the UK, so I'll be in a different timezone even if I get to point my brain at upstream :D15:17
* TheJulia checks calendars15:18
iurygregoryI can't =(15:18
iurygregory7/10 I'm on vacation15:18
JayFWorse case, if I put this call out next week and get some "no"s, we can cancel the meeting.15:19
iurygregory7/17 I'm traveling back .-.15:19
JayFThis has been the most substantive one in a while just because of the unit tests :D15:19
TheJuliaI'm available15:19
JayFfor both?15:19
TheJuliayeah15:19
rpittauI'm also available15:19
JayFwho wants what?15:19
TheJuliarpittau: if you'd like to, the gavel is yours15:19
JayFYou tell me and I'll put it in an action so we can have it marked15:19
JayFas it is voluntold, so shall it be written :P15:19
JayF#action rpittau to run Ironic meeting 7/10 and 7/1715:20
TheJuliahehehe15:20
TheJulialol15:20
TheJuliaevil man, evil :)15:20
JayFAnything else for Open Discussion?15:20
rpittauyay (?)15:20
TheJuliarpittau: worst comes to worst, let lmk :)15:20
JayFrpittau: it'll be your turn for PTL soon ;) 15:20
TheJuliaheh15:20
JayFrpittau: just embarce the gavel15:20
* rpittau hides15:20
* TheJulia hears evil laughter from the background15:20
* TheJulia notes in our line of work, evil is good.15:20
iurygregory++15:21
JayFAnything non-philosophical for the rest of open discussion? hah15:21
JayFAight, I think that does it then.15:22
JayF#endmeeting15:22
opendevmeetMeeting ended Mon Jun 26 15:22:07 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:22
opendevmeetMinutes:        https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-06-26-15.01.html15:22
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-06-26-15.01.txt15:22
opendevmeetLog:            https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-06-26-15.01.log.html15:22
johnthetubaguyJayF: as a heads up, I got the first two shard related nova patches up, in a very draft form: https://review.opendev.org/c/openstack/nova/+/88698015:25
JayFcool, I'll log that in https://etherpad.opendev.org/p/IronicWorkstreams2023.215:25
TheJuliacool15:25
JayFexciting but it also means I gotta get on with helping on the documentationing soon15:25
JayFso in the past ~4 working hours15:26
JayFunit tests are fixed15:26
JayFsharding is drafted15:26
JayFI'm going to go get some breakfast and lottery tickets :D 15:27
johnthetubaguy:)15:28
TheJulialol15:29
rpittauJayF: https://review.opendev.org/c/openstack/ironic/+/886911 needs some pep8 love :)15:29
opendevreviewJay Faulkner proposed openstack/ironic master: Skip tests that fail occassionally in CI  https://review.opendev.org/c/openstack/ironic/+/88691115:30
JayFI was running a locally cached copy of pep8 between my ears when I edited that on my windows box yesterday LOL15:31
opendevreviewVerification of a change to openstack/ironic master failed: CI: Change migrations timeout to be >60 seoncds  https://review.opendev.org/c/openstack/ironic/+/88698515:36
JayF> openstack-tox-py39 https://zuul.opendev.org/t/openstack/build/7d9d11500f1c43b39987388adc072c0e : TIMED_OUT in 41m 24s15:36
JayFoh no15:36
JayF2023-06-26 15:20:43.696512 | ubuntu-focal | {2} ironic.tests.unit.objects.test_volume_target.TestVolumeTargetObject.test_save [0.226305s] ... ok15:36
JayF2023-06-26 15:35:13.828453 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master]15:36
TheJulia.... wut15:37
rpittaummm we may need to set gentle to False as I wrote this morning15:37
* TheJulia wonders if it actually pulled with the parent15:37
JayF...wouldn't that be an incredibly severe bug if it didn't?15:38
TheJuliawell, I think it comes down to depends-on behavior15:39
TheJuliasince it was tagged, not rebased upon15:39
JayFit wasn't tagged depends-on15:40
JayFit was on top  of your change15:40
JayFrelationally15:40
TheJuliaIt didn't look like it earlier15:40
* TheJulia will be able to look again in a few minutes15:40
JayFhttps://review.opendev.org/c/openstack/ironic/+/886985 looking at relation chain in top right15:40
JayFI'm going to recheck because what the hell else am I going to do 15:40
JayFbut I'll be screaming in anguish as I type it it :{15:40
iurygregory=(15:43
iurygregorylay down, try to not cry, cry a lot15:43
TheJulia2f8ee2cf4..d9a000020  master -> master <-- that is what i would epxect15:43
TheJuliale sigh15:43
TheJuliaexpect15:44
JayFis it possible the timeout fixture itself is somehow causing an issue?15:44
JayFhttps://review.opendev.org/c/openstack/ironic/+/886196 will be the ... non-change test15:44
rpittauJayF: I can't exclude it 1005, but it would be really weird15:44
JayFbecause we're just enabling an additional test that would cause it to be more likely to fail if broken in this way15:45
rpittauI can override the value of gentle and see if we can get an actual failure15:46
rpittauthe problem is that now if a test goes in timeout we won't see it because the exception will be caught, with gentle=False it will just throw the exception and stop the execution15:46
JayFso the code I just rechecked is ~useless in it's current form?15:46
JayFyou wanna flip the bool in the change so it does the useful?15:46
rpittauehm... yeah :D15:46
JayFmy change to re-enable that test is landing smoothly15:47
JayFall jobs in gate and done except docs15:48
rpittauyep15:48
rpittauI'll do it in a separate patch, don't want to mix stuff where we are now15:48
JayFI wouldn't hate the mix because if it fails, we'll know why more clearly15:48
JayFand I'm here and will kick it back into the gate immedaitely15:48
rpittaualright then15:49
opendevreviewMerged openstack/ironic master: Revert "Disabling test_upgrade_twice temporarily for CI fix"  https://review.opendev.org/c/openstack/ironic/+/88619615:49
* TheJulia wonders if we have a different problem on postgres15:50
TheJuliaI was testing with just mysql locally15:50
TheJuliaI didn't have the same setup for postgres15:50
JayFwe can see, but rpittau's change with the gentle=False will shed more light on that15:52
JayFand in the meantime we seem to have at least closed the window some15:52
JayFI think the only 100% never-false-negative test would be something like def test(): return True15:52
JayFwe just keep making progress and getting it better15:52
opendevreviewRiccardo Pittau proposed openstack/ironic master: CI: Change migrations timeout to be >60 seoncds  https://review.opendev.org/c/openstack/ironic/+/88698515:57
rpittaushould've probably changed the commit msg :/15:57
JayFdo it in an edit real quick before I land it?15:57
opendevreviewRiccardo Pittau proposed openstack/ironic master: CI: Change migrations timeout to be >60 seoncds  https://review.opendev.org/c/openstack/ironic/+/88698515:58
opendevreviewJay Faulkner proposed openstack/ironic master: CI: Change migrations timeout to be >60 seconds  https://review.opendev.org/c/openstack/ironic/+/88698515:58
JayFI fixed the typo, too :D it's approved15:59
rpittau!15:59
JayFIt's OK, easier to spell when you only know one language when everyone else has to know 2+ :D 15:59
rpittaunot all typos :)16:01
JayFlol I fixed the one that I saw everytime the patchset bumped16:01
JayFc'mon, we know the truth as software devs: the first line of our commit message is likely the only parts that'll be read after today :D 16:01
opendevreviewRiccardo Pittau proposed openstack/ironic master: CI: Change migrations timeout to be >60 seoncds  https://review.opendev.org/c/openstack/ironic/+/88698516:02
rpittauok done16:02
rpittauthere was an error in the new variable, it would've caused issues16:02
JayFhttps://zuul.opendev.org/t/openstack/build/6dd34d94f5c64cfca0b8eedec2de4e3b16:06
JayF2023-06-26 15:41:55.457564 | ubuntu-jammy | {4} ironic.tests.unit.test_base.DontBlockExecuteTestCase.test_no_exception_raised_for_execute [0.040523s] ... ok16:06
JayF2023-06-26 16:02:49.795961 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master]16:06
JayFWHY16:06
JayFTheJulia: rpittau: ^ gate check for my patch failed on unit tests with driver libs16:06
JayFfroze up for 20 minutes16:06
rpittauoO16:10
iurygregoryO.O16:11
JayFhopefuly rpittau's patch will land and shed light on WTF16:15
JayFwhoa16:30
JayFThe following tests exited without returning a status16:30
JayFand likely segfaulted or crashed Python:16:30
JayF * ironic.tests.unit.drivers.modules.irmc.test_boot.IRMCVirtualMediaBootTestCase.test_clean_up_instance_with_secure_boot16:30
JayF * ironic.tests.unit.drivers.modules.irmc.test_boot.IRMCVirtualMediaBootTestCase.test_prepare_instance_with_secure_booty16:30
JayFthat's on rpittau's change16:30
JayFand is /also/ in the with-drivers category16:30
JayFI bet we have a bad test in there, too16:30
TheJuliaooooooh16:30
TheJuliaso!16:30
JayFhttps://zuul.opendev.org/t/openstack/build/cc73bf5c96064178af9de32686d31dd216:30
TheJuliaspeaking of! I believe vanou pushed an edit, I wonder if python-irmcclient got updated16:31
JayFwell nevermind ... this is hung https://zuul.opendev.org/t/openstack/stream/958243d5237249549fdaa121425604d5?logfile=console.log 16:31
TheJuliaspecifically to virtual media16:31
JayFand the migrations timeout doesn't look like it's firing16:31
JayFso we still have the hang, and we have another issue now, too? 16:31
TheJuliadunno, I just know i wasn't able to reproduce it on friday16:31
TheJuliaat least, with the patch when I could before16:31
JayFwell, we got a *lot* of passes, too16:33
JayFI think at least 6/7 runs without seeing the timeout16:33
JayFwhich might just mean we flipped the coin 7 times and got heads every time16:33
TheJuliaindeedly possible16:35
TheJuliaI was basically failing every run locally before, I think I had one pass before the patch on friday16:35
TheJuliaon that specific machine16:35
JayFwas there any other change you had dangling in that checkout that could've made a difference?16:37
JayFthe object layer delete piece, maybe?16:37
JayFmy original proposed fix that proved to at least be not enough16:37
TheJuliaI was rebasing the hold steps stuff16:38
TheJuliaand then took a clean copy, reproduced, fix there and then fixed on my local branch and had both passing16:38
JayFdid the clean copy ever fail?16:38
TheJuliayes16:38
JayFdamn16:38
JayFthere is something in play here we still haven't sussed out then, yeah16:39
JayFor adding int(node[id]) in those two places somehow made it worse, not better16:39
TheJulianode.id16:39
TheJuliathey aren't json dicts16:39
JayFsure, I'm just opening the code now so was going by memory16:39
TheJuliaahh, ok16:39
JayFI'm going to reread that file of tests again16:40
JayFand see if there's anything I can figure out16:40
JayFtempted to push a patch with nothing but enabling sqla debug logging16:42
JayFif I did that would probably need to limit the tests to only test_migrations.py16:42
JayFnot even sure that'd tell us what we need though16:42
JayFsomething weird about rpittau's change16:44
JayFwe're calling self.useFixture on an object class16:44
JayFeven though at usage time, it's mixed into classes that have that attr16:44
JayFI don't think that's a potential cause, but saying it out loud16:45
JayFwe do the same in WalkVersionsMixin since forever16:47
rpittauJayF: all the tests call self.useFixture because they're based on oslo test tests class, we're just overriding it16:53
JayFwell, they aren't in terms of actual inheretance16:53
JayFmy use of typed python in personal projects is showing here 16:54
JayFwe're instantiating it as an object with intent to be mixed into an oslo test class16:54
JayFit all works, was just something I noticed and when I saw we already had it (in WalkVersionsMixin) I discarded it as a "new" breakage16:54
JayFI am curious about _check_487deb87cc9d16:55
JayFit's weird to see patterns of using the engine in db_utils.get_table(engine, 'x') 16:55
JayFthen a few lines down we are using with engine.begin() as connection16:55
JayFand get_table looks like the kind of sqla magic we've been trying to get rid of16:56
rpittaugoing offline now, see ya tomorrow! o/16:56
JayFo/16:57
JayFyeah that's probably fine, reading the sqla code16:59
JayFTheJulia: would it be a confirmation this bug is kinda what we think it is, if we put the autocommit back in and it started passing again?17:09
TheJuliapossibly, got a link?17:09
JayFI can make that patch if you think it's worth a shot17:10
TheJuliak, on a call at the moment17:11
opendevreviewJay Faulkner proposed openstack/ironic master: DNM: Revert "Remove autocommit, again."  https://review.opendev.org/c/openstack/ironic/+/88699517:12
JayFoooh I found a thing, I think17:15
JayFhttps://opendev.org/openstack/ironic/src/branch/master/ironic/tests/unit/db/sqlalchemy/test_migrations.py#L54417:16
JayFI wonder if we need to str() that row['network_interface']17:16
JayFor really just move the asserts up a level is a smaller change17:20
* JayF puts that change in his local branch and keeps looking17:21
JayFfound another case similar to that17:22
TheJuliayou can access items as a dict or as a object17:33
TheJuliaat least, dependin gon pattern17:33
opendevreviewJay Faulkner proposed openstack/ironic master: Ensure all sqla objects descoped before ending txn  https://review.opendev.org/c/openstack/ironic/+/88700017:33
JayFso if it's accessed as a dict, it's likely safe, accessed like an object, it's likely not17:33
JayFso I think that makes ^^^ a noop then17:34
JayFgoing to let CI finish just to see but yeah17:34
TheJuliayeah, I think that is right17:34
TheJuliaso 544 is not doing anything that would be returned outside and keep a handler open17:35
TheJuliaat least, the way I'm interpretting it17:35
JayFack17:35
JayFthen I'm basically out of ideas17:35
JayFother than a ton of "print statement debugging" in the gate17:35
JayFwhich might heisenbug us17:36
NobodyCamGood Morning Ironic folks17:40
JayFyou're half right17:41
JayFwell little more than half17:41
JayFit is morning UGT17:41
JayFand we are ironic folks17:41
JayFlol17:41
NobodyCamLoL17:41
NobodyCamhehehee17:42
NobodyCamhey hey JayF 17:42
JayFgnomes stole my happiness shortly after the Ironic meeting, and I'm hunting for it again :P 17:42
NobodyCamhappen to know off the top of your head if we emit a message when entering / exiting rescue mode?17:44
JayFwe emit an oslo.notification for sure17:45
JayFusually that means a log too, or you can certainly configure notifications to log17:46
JayFbut I'd have to check the code to be sure17:46
opendevreviewJay Faulkner proposed openstack/ironic master: Ensure all sqla objects descoped before ending txn  https://review.opendev.org/c/openstack/ironic/+/88700017:47
JayFwell, whether we think so or not, this one passed17:47
JayFfixed the pep8 and will try it again17:47
JayFinteresting17:48
JayFon the autocommit-enabling CI job17:49
JayF2023-06-26 17:27:38.962571 | ubuntu-jammy |  * ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_upgrade_and_version and ironic.tests.unit.db.sqlalchemy.test_migrations.ModelsMigrationsSyncMysql.test_models_sync bailed17:49
JayFso I guess we know it's still in mysql, and it made the migrations timoue work better(?) to have autocommit on(?)17:49
TheJuliauuggggghhhhh17:54
JayFhmm test_models_sync17:55
JayFI really wanna put the object deleter back in17:55
* JayF goes and reads it again17:55
JayFhmm test_models_sync adds a cleanup to drop all objects18:00
JayFwe are dangling a DB connection somewhere18:00
JayFTheJulia: I'm pretty sure I'm seeing that even the connection.execute() that returns a dict-like object is still a sqlalchemy dict cursor18:04
TheJuliathe result set class?18:05
TheJuliadepends on how you ask for the entry, if memory serves18:05
JayFno, a ResultProxy18:05
TheJuliaahhhhhhhhh18:05
JayFhttps://docs.sqlalchemy.org/en/13/core/connections.html#sqlalchemy.engine.Connection.execute18:05
TheJuliayeah, the resultproxy is evil18:05
JayFthat is 13, I should check 1418:05
TheJulia1.4 depends on style of invocation18:05
JayFa CursorResult 18:05
TheJuliain 2.0, we want tuples18:05
JayFheh18:06
TheJuliaORMey things generally get a cursor result, if you ask directly, you get the result tuple set of data18:06
JayFit depends on /version of sqla at invocation/ if I'm reading this right18:06
TheJuliacorrect18:06
JayFif in a 1.x context; it returns the CursorResult which is a result proxy, yeah18:06
JayFand my patch is about to succeed in CI again18:07
JayF> This object returns rows as LegacyRow objects, which maintains Python mapping (i.e. dictionary) like behaviors upon the object itself. Going forward, the Row._mapping attribute should be used for dictionary behaviors.18:09
JayFI think my patch is the real fix, and I think this fits the whole story we've been telling the whole time18:09
JayFI just can't answer the question of why this passed for so long18:09
* TheJulia glares at workday18:10
TheJuliaJayF: I think it was failing every so often18:10
TheJuliajust as you suggested, we passed a threshold18:10
TheJuliaat some point, all of the issues colided and there was extreme sadness18:10
JayFI still have extreme sadness 18:10
JayFlol18:11
JayFironic-cross-sushy seems to be frozen now18:12
JayFin my "hey I fixed it, see my explanation" PR18:12
JayFhttps://zuul.opendev.org/t/openstack/stream/fc7c761fb23f4c148edd36ddbf83320c?logfile=console.log18:13
JayFfungi: if I have a job that's currently running right now18:16
JayFfungi: can you hold the node?18:16
JayFfungi: https://zuul.opendev.org/t/openstack/stream/fc7c761fb23f4c148edd36ddbf83320c?logfile=console.log I'd LOVE to get ahold of this one18:16
clarkbI can do it. I'm not sure fungi is around18:16
clarkbcan you share the info we need to hold it, project, change and job name?18:16
JayFthis is openstack/ironic ironic-cross-sushy18:17
JayFhttps://review.opendev.org/c/openstack/ironic/+/887000 this change18:17
TheJuliale sigh18:18
clarkbok hold is in place18:18
JayFthank you18:18
TheJuliawhew, and it just timed out it looks like18:19
JayFit didn't catch it, I think18:20
JayFprobably because it TIMED_OUT?18:20
clarkbor I was too slow18:20
TheJuliaugh18:20
clarkbI think I caught it18:20
clarkbI can ssh into the node at least. Maybe it is trying to delte and hasn't yet18:21
JayFyeah I see it in https://zuul.opendev.org/t/openstack/nodes as deleting18:21
JayFat least the one I think it is lol18:21
clarkb0034460245 is the node its held according ot nodepool18:21
clarkbhave an ssh pubkey I can add to the node?18:21
JayFssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILR/OLTS/VWHzE3vgBFCaTNBg2+MRCENOmDr9oEMxzhZ jay@jvf.cc18:22
clarkbJayF: root@158.69.66.8818:23
JayFthis is way more interesting than yours, I think18:25
JayFhttps://gist.github.com/jayofdoom/d0f38dce541ed4f2600e6d5d5cc34ace18:26
JayFlooks like we're trying to select from conductors while also trying to alter conductors and allocations table18:26
JayFit's like two tests are stepping on each other or something18:26
JayFI can confirm two separate python processes are accessing and deadlocked on the same DB18:28
JayFbut I'm not sure if that's meaningful, but there's absolutely multiple processes here talking to the same DB (and I don't mean DB host; I mean *the same DB string*)18:28
TheJuliaso here is one for you18:29
TheJuliaI consistently had 8 hanging connections per db last week18:29
TheJuliaso we're orphaning stuff someplace18:29
TheJuliastill18:29
JayFbut like, is it ever sane that multiple different python processes would be touching the DB?18:29
JayFat the same time?18:30
JayFshouldn't each test run in a single process, and each test have it's own BD?18:30
TheJuliaoh, still the same, but two different dbs at once18:30
JayF**DB18:30
TheJuliaeach runner gets its own db18:30
JayFyeah this is the problem18:30
TheJuliaAIUI18:30
JayFlook at my output18:30
JayFat the source port18:30
JayFthe source port via netstat on the host maps back to two different python pids18:30
TheJuliayeah18:30
JayFthose pids are running different testing load lists18:32
TheJuliayup18:32
JayF+ironic.tests.unit.db.sqlalchemy.test_migrations.ModelsMigrationsSyncMysql.test_models_sync # scheduled to one18:32
JayFhttps://gist.github.com/jayofdoom/d0f38dce541ed4f2600e6d5d5cc34ace?permalink_comment_id=4611051#gistcomment-461105118:33
JayFso somehow they ended up running these tests against the same db at the same time18:33
JayFwhich is exceedingly wrong18:33
clarkbthis looks similar to the problem we hit in zuul's db migrations where we had nested transactions. The outer one would hold a lock for some table and then the inner one couldn't update it18:34
clarkbthe fix was to flatten the transactions. But this was only a problem with sqla 2.0 not 1.418:34
JayFbut in this case, clarkb, we have two different python test runners talking to *the same DB on the same DB host*18:34
JayFso the deadlock is coming from that conflict, not from a single test if I understand the flow correctly18:34
JayFit's very possible I don't understand the flow correctly18:35
JayFbut until I hear otherwise, or convince myself, I'm working under the assumption that two different python PIDs, running different sets of tests, connecting to the same DB on the same DB host is a big nono18:35
JayFI gotta find where those test names are created18:36
JayFer, db names18:36
clarkbwhere do you see multiple processes talking to the same db?18:36
TheJulia+1 to clarkb's question18:36
JayFso look at https://gist.github.com/jayofdoom/d0f38dce541ed4f2600e6d5d5cc34ace18:36
clarkbthere are multiple tcp connections but it may be from the same process18:36
JayFthen I feed that into18:36
JayFnetstat -nptu18:36
JayFgrep out the source port and map it back to two different python pids18:36
clarkbnote there are two dbs in the db listing too18:37
JayFyes, I know18:37
JayFthank you for asking for reciepts18:38
JayFhelped me realize I tranposed the port number18:38
JayFboth those connections to setsuipeap that are deadlocked *are* from the same process18:38
JayFI just fat fingered a number/copy+pasted one row off, something like that18:38
TheJuliac18:40
TheJuliaerr18:40
TheJuliac18:40
TheJuliawow I cannot type this morning18:40
TheJuliac'est la vie18:40
JayFI wonder if like, oslotest started doing cleanups async or something18:43
JayFor unittest18:43
JayFadd_allocations_table seems to be where it blew up18:48
JayFoh18:48
JayFOH!!!!18:48
JayFTheJulia: ^^^18:48
JayFthis is your fix18:48
JayFthe sawarning18:49
JayFwe should see if that is exhibiting this same stuff if rebased on master18:49
JayFthis is frozen trying to remove the fk that the sawarning is complaining about18:49
JayFthe case I'm looking at rn18:50
TheJuliawaiting, it is blocked atm18:50
JayFblocked on what? CI?18:50
TheJuliaone of the other threads18:50
TheJulias/threads/connections/18:50
TheJuliaso the ones in Sleep() state are still open, they need to close out18:50
TheJuliaso I think the answer is we need to find any and all open places18:51
JayFI have an error in the log saying it can't drop conductors table18:51
JayFbceause the fk to allocations still exists18:51
JayFdeep in the output of the frozen sushy test18:51
JayFwhich is what sent me down this path18:51
TheJuliaThat is just bizzar18:51
JayFthis is the random piece we've been looking for too -- the sawarning says it can't sort tables 18:51
TheJulia... wonder if it "can't sort"... "to delete"18:52
JayFthat's sorta where I'm at18:52
JayFwe might need to stack your sawarning change with my chnge https://review.opendev.org/c/openstack/ironic/+/88700018:52
JayFTheJulia: maybe approve ^, see if we can get it in, then it'll make landing the sawarning fix easier18:55
JayFit's a clear improvement even if it's not the only fix18:55
JayFalternatively I can merge those two changes and try to get a clean-clean run18:55
TheJuliadone18:55
JayFmerge as in, make them a single change18:55
TheJulia+a'ed18:56
JayFTheJulia: I wonder if it's like... metadata.drop_all() fails if there are any capital-C-Connections, not if there are any connections to the db18:57
JayFso the sleepy empty connections are probably OK; the outstanding queries would be a capital-C-Connection18:57
JayFfrom sqlalchemy perspective18:57
TheJuliaEh, they really need to get clsoed out19:00
JayFbut aren't we using a connection pool in the engine?19:01
TheJuliayeah, but we can't do things if they have open result sets19:01
TheJuliaand we can't see that19:01
JayFhmm19:01
JayFTrying to think of a case where you could have `show full processlist;` be nothing but sleepy conns and things broken19:06
JayFand then I realized, I don't know what things look like if I'm mid txn19:06
JayFif this path doesn't work, I might science seeing what > begin(); insert stuff; # shows up like on the full processlist19:06
JayFaight, going to feed the cats and myself (in that order per their demand), and see how much red is on the zuul status panel when I return19:07
opendevreviewMerged openstack/ironic master: Ensure all sqla objects descoped before ending txn  https://review.opendev.org/c/openstack/ironic/+/88700019:13
JayFheeeey19:15
TheJuliahuh?19:16
JayFit passed the gate19:16
JayFthat was nice19:16
JayFjust a celebratory heyyy19:16
JayFhttps://review.opendev.org/c/openstack/ironic/+/886871 has already failed19:16
JayFif I recheck, it'll rebase to master yeah?19:16
JayFI would usually just hit the button but don't want to break your chain19:17
TheJuliai guess you could19:17
TheJuliahttps://review.opendev.org/c/openstack/ironic/+/885797 is at least mostly unrelated19:18
JayFhttps://862c1d9848efa4012e00-65cf29ccd095bbc5984e6f01063fc964.ssl.cf1.rackcdn.com/886871/2/check/openstack-tox-py38/27a357f/job-output.txt has what looks like real failures19:18
JayFunrealted to the stuff I just fixed, maybe19:18
JayF2023-06-26 18:58:06.634120 is the string to search19:18
* JayF off to pickup his lunch19:18
clarkbyes zuul always merges against the target branch19:19
TheJuliaall with postgres19:20
TheJuliaoh noes19:25
TheJuliait looks like that breaks on postgres19:25
opendevreviewJulia Kreger proposed openstack/ironic master: Handle SAWarning around allocations FK Constratins  https://review.opendev.org/c/openstack/ironic/+/88687119:30
JayFthat's why I pointed it out19:37
JayFTheJulia: pep8 unhappy on that change, you want me to fix it or you got it?19:47
TheJuliai'll fix, i forgot since I had to jump on a call19:47
JayFack; np19:48
opendevreviewJulia Kreger proposed openstack/ironic master: Handle SAWarning around allocations FK Constratins  https://review.opendev.org/c/openstack/ironic/+/88687119:50
JayFit timed out20:06
JayFand the timeout worked on hte migrations test20:06
JayFwtf20:06
JayFlike, zuul status FAILURE20:06
JayFtimed out via the fixture20:06
JayF2023-06-26 20:02:51.941686 in https://671fe52e0206ebfe4cbb-1ebf823a6670729b8775b6721553bbcb.ssl.cf5.rackcdn.com/886871/4/check/openstack-tox-py38/01a1ba0/job-output.txt20:07
JayFMIGRATIONS_TIMEOUT just too low?20:07
JayFand py39 / py310  have both been hung now for 8 minuts20:09
JayF*minutes20:09
JayFback to the drawing board20:10
JayFironic-cross-sushy hung too20:14
JayFdid we somehow make it more reproducable? wtf20:14
JayFgoing to try to locally repro again20:16
JayFjust because if I can get it to repro, I might be able to debug it or something20:16
* JayF wonders if the hang being more reliable could be beacuse of the nature of that change20:41
JayFI'm going to recheck on a less invasive change20:41
TheJuliak20:41
TheJuliaoff calls, need to run to the post office real quick, I think mysql and postgres are next for me20:42
JayFI even am running same sql version20:46
TheJuliaare you executing the "ye olde youtube video in HD in the background" while it is running?20:48
JayFyou saying I need to upgrade my gentoo while I run tests?20:48
JayFis that what you are suggesting?20:49
TheJuliaI always have a youtube video running late in the day, and some of the unit test race issues we've seen in the past, have been reproduced under host load20:49
* JayF emerges the world20:50
JayFlucky gentoo comes with a built-in cpu cycle tester: the package manager20:51
JayFlol20:51
JayFTheJulia: is it possible that something like self.assertX() is holding onto a handle?20:58
JayFTheJulia: like maybe in some kind of results collector or something20:58
JayFso even in-scope for asserts we need to dupe the object?20:58
TheJulia... I wouldn't think so20:58
JayFme neither, but that's implemented in C stdlib so I can't go read the code20:58
JayFwell, I have access to the code, I can't read it lol20:58
JayFyeah doesn't look like it21:01
JayFand it's python https://github.com/python/cpython/blob/3.8/Lib/unittest/case.py#L76121:01
* TheJulia watches the slowest download ever21:05
JayFTheJulia: was pondering if the `if engine.dialect.name == 'mysql'` needed to include sqlite, I talked myself out of it21:05
JayFbecause we don't support migrations on sqlite generally, yeah21:05
JayFand I don't think we test against that either21:05
JayFso it's not really likely to be an issue21:05
TheJuliayeah, we've some some cases where advanced ops behave in not fun ways with differnces between postgres and mysql21:06
TheJuliaand *really* the only case we're goin gto break is on the teardown... really21:06
JayFTheJulia: it'd be interesting science to see if your change worked with postgres version of these tests commented out; I suspect it won't though21:07
* JayF is onboard to drop psql support we find it's the culprit21:07
JayFjust wonder about if it counts that it was deprecated forever and ever ago21:07
JayFI have (former racker/ironicer) clif on a zoom with me now21:09
JayFwe're going to rubber duck at this21:09
TheJuliahttps://671fe52e0206ebfe4cbb-1ebf823a6670729b8775b6721553bbcb.ssl.cf5.rackcdn.com/886871/4/check/openstack-tox-py38/01a1ba0/testr_results.html21:10
TheJuliaso... that is odd....21:10
JayFthat is what I was looking at and commenting on earlier21:18
TheJuliahmm21:26
TheJuliadifferent errors \o/21:26
opendevreviewJulia Kreger proposed openstack/ironic master: Disable WAL Pragma for Unit Testing  https://review.opendev.org/c/openstack/ironic/+/88686521:29
opendevreviewJulia Kreger proposed openstack/ironic master: Handle SAWarning around allocations FK Constratins  https://review.opendev.org/c/openstack/ironic/+/88687121:29
opendevreviewJay Faulkner proposed openstack/ironic master: Fully ensure counts are out of scope of cxtmgr  https://review.opendev.org/c/openstack/ironic/+/88700821:29
JayFTheJulia: ^^^^21:29
JayFTheJulia: clif helped me quack at it21:29
JayFTheJulia: I moved it in but didn't change where it was declared21:30
JayFthat is 100% another instance of that, not sure it's the last one, but it certainly *is* one21:30
TheJuliaso I restacked my changes, they cleaned up failure errors in the unit test runs for me locally, fwiw21:30
JayFnice21:30
JayFwdyt about that I just pushed21:30
JayFit's 100% an issue21:30
JayFwell, it's 100% a case of the thing I fixed before21:30
JayFasusming those things are an issue which I'm fairly certain of21:31
TheJuliacould do it, this feels like a thousand cuts21:31
JayFI counted, it's 9001 cuts exactly21:31
TheJuliaheh21:32
JayFTheJulia: so I just had an epiphany -- we couldn't repro this in the job over the weekend b/c the clouds were underused21:46
JayFTheJulia: so it was too fast to trigger the race21:47
JayFwhich is why the 'fix' seemed so perfect compared to now21:47
TheJuliaheh21:48
JayFnow my change is passed everything but tox-py38 and tox-docs21:48
JayFwhich just are queued and have been for a long time21:48
JayFhash-tag better than a db hang21:48
iurygregoryI'm trying to catch up the whole conversation, tl;dr is that we would need this 3 changes to have unit tests happy?21:54
JayFtl;dr we know nothing21:54
JayFWe hope https://review.opendev.org/c/openstack/ironic/+/886871 https://review.opendev.org/c/openstack/ironic/+/887008 will help21:54
JayFall hope no knowledge21:54
JayFlol21:54
iurygregoryJayF, perfect!21:54
iurygregoryI'm wondering about the drop of constraints in allocation, would we need to backport that? 21:55
JayFI really don't know21:56
JayFI'm laser focused on green tests21:57
TheJuliaI wouldn't worry about backporting21:57
TheJuliaunless someone tries to pull in sqlalchemy 2.0 before we officially support it21:58
TheJuliaooooh ahh, i postgres tests running locally21:58
JayFclarkb: you can unhold that node for me, I squeezed all useful data outta it22:01
JayFclarkb: if you want, you can also remove the other autohold with my name on it for now22:03
JayFpass incoming for https://review.opendev.org/c/openstack/ironic/+/887008 if TheJulia iurygregory want to land it22:12
JayFI see a +2, going to workflow it myself in a few minutes if nobody else does22:12
JayFall voting jobs green on Julia's change too, we might be on the path22:16
* JayF landed that as a trivial change/ci change with one core approval22:19
TheJuliaso, https://review.opendev.org/c/openstack/ironic/+/886865 also sort of fixes a bug, without it I get a nice error about the migration test trying to run that with mysql22:21
TheJuliawhich of course, detonates22:22
TheJuliathat is when I explicitly ask for them to run22:22
JayFyeah I got it too22:22
JayFbut that one is failing tests right now22:22
JayFin the check queue22:22
TheJuliagah!!!!22:27
opendevreviewMerged openstack/ironic master: Fully ensure counts are out of scope of cxtmgr  https://review.opendev.org/c/openstack/ironic/+/88700822:31
JayFwell, it passed multiple times in a row22:31
JayFthat's a good sign22:31
JayFthat change never once failed unit tests in the gate22:31
JayF*knock on wood*22:31
JayFgoing to rebase and recheck my "uncomment the super-breaky-test"22:32
JayFoh, that landed22:32
JayFso we passed that too! awesome22:32
TheJuliaoh, so here is a interesting aspect22:32
opendevreviewJay Faulkner proposed openstack/ironic master: Skip tests that fail occassionally in CI  https://review.opendev.org/c/openstack/ironic/+/88691122:33
TheJulialocally, the migration test takes a fraction of the time it does in ci22:33
iurygregoryI've +2 https://review.opendev.org/c/openstack/ironic/+/885797/22:33
JayFlanding it22:33
TheJuliaI guess so many direct io syncs out, that in a test VM it grinds and my ye olde code farm desktop, it is just nowhere near as fast22:33
JayFgood call out22:33
JayFyep22:33
iurygregoryTheJulia, this would be expected *I think* CI we won't have a lot of resources22:33
TheJuliabut ci right now test_walk_version is 44 seconds22:34
JayFClif's first question when he started rubber ducking this with me was "are the CI nodes well resourced?"22:34
iurygregoryhummm22:34
TheJuliayeah, just an example22:34
JayFafter my three hours of laughing we started troubleshooting22:34
JayFwe are our own noisy neighbor22:34
iurygregory*well resourced*22:34
iurygregory:D22:34
JayFthe first devoppy thing I ever did was setup a full LAMP stack in a 32mb linode22:34
JayFnow I can't even test software in eight gigabytes :P22:35
TheJuliaahh, but that lamp stack took 26 mb and used swap ;)22:35
JayFI actually had it not swapping22:36
JayFbut DB caches were near-universally disabled so it killed I/O anyway22:36
JayFlinode back in those days would actually monitor swap usage and rate limit it at the hypervisor level22:36
JayFso if you had a regularly swapping VM, eventually it'd run out of rate limits and I/O would slow to a crawl22:36
JayFpretty cool thing TBH; but those were user mode linux VMs, and were very slow :P22:36
JayFI sure hope 886865,2 didn't have my fix in it (I don't think it did), it has a TIMED OUT unit test22:44
JayFTheJulia: with my fix > 2023-06-26 22:43:51.725855 | ubuntu-focal | {6} ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_walk_versions [17.414814s] ... ok22:50
JayFthat's on py38, assumedly the worst case22:50
JayF33.96s on py31022:51
JayFmaybe it is just more environmental than anything else22:51
JayFstill not as bad as 44s22:51
JayFhttps://review.opendev.org/c/openstack/ironic/+/886911 passed check, to disable the other flakey tests22:52
iurygregoryack22:55
* JayF ==> EOD22:59
clarkbJayF: done I've aksed zuul to clean them up23:18
NobodyCamcrazy question. anyone know if meta-data service (169.254.169.254) should work in rescue mode? I have it working in normal deployed image,  but not in rescue mode23:29
TheJulia... I guess it mgiht23:45
TheJuliawellit should23:45
TheJuliawell23:45
TheJuliaIt depends on *how* your configured I guess23:45
TheJuliashould it, likely not but there is no requirement there23:46

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!