Tuesday, 2023-06-06

opendevreviewdaniel.pawlik proposed openstack/ci-log-processing master: Change provided directory path in remove_old_dir function  https://review.opendev.org/c/openstack/ci-log-processing/+/88534007:11
opendevreviewMerged openstack/ci-log-processing master: Change provided directory path in remove_old_dir function  https://review.opendev.org/c/openstack/ci-log-processing/+/88534007:42
iurygregoryHi all, anyone aware of py3 jobs reaching timed_out?15:16
iurygregoryI noticed in some ironic patches, and they are failing randomly .-.15:17
clarkbiurygregory: I always encourage people to open up their job logs and see where the time is going before anything else15:17
clarkbthe job logs should have good timestamp information to see where it is going15:17
clarkbit could be pip's dep solver being slow or unittests taking longer than expected or dkms compilation of deps etc etc15:18
fungior some dependency lacking prebuilt wheels for a recent release15:18
iurygregoryyeah we are looking at the logs trying to understand what is happening, because seems random (on some patches it fails py38, others py39, others py310)15:22
fungibasically look at timestamps and see if there are any large gaps in time15:24
clarkbalso py310 and newer are much quicker than older pythons at least for zuul15:24
clarkbpy310/py311 arealmost half the runtime of the py38 jobs15:24
clarkbso there may be some tuning of timeout values that needs happening if you see a similar spread15:25
rpittauclarkb: the issue is happening at random points during the tests and with different python versions, py3.8 and py3.1015:37
rpittauthe tests just stop at one point and then the job reaches timeout after 40 minutes15:37
rpittaufor example here https://c7cc7615691360b259ca-b4831d08abece0047714b6befcdb357a.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-cover/8dfaf99/job-output.txt15:37
fungi2023-06-06 12:45:28.590470 | ubuntu-jammy | INFO  [alembic.runtime.migration] Running upgrade c0455649680c -> ac00b586ab95, Adds indexes to important and commonly matched columns.15:39
fungi2023-06-06 13:22:31.946089 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master]15:39
rpittauyep15:39
fungiso looks like that one stuck during a db migration?15:39
rpittauit is not always the same place, for example here its just after a random test https://c81b6237185e127ea93f-fe98a86d691c6300a22f04f47f232829.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-py38/889a3ab/job-output.txt15:39
fungiany commonality in the node type? like using nested-virt or something?15:40
rpittaufor now they all look from ovh15:41
rpittau2023-06-06 12:32:23.021846 | localhost | Provider: ovh-bhs115:41
rpittauheh nvm I found a different one Provider: inmotion-iad315:42
fungistable branches, master, both? any idea when it started/got worse?15:43
rpittauat the moment it looks only on master, started today, we haven't ran much today on stable branches15:45
clarkbgiven that evidence I'd start looking at the database15:46
clarkbin particular newer sqla and databases are pickier about transactions and locks15:46
rpittauclarkb: which evidence? :)15:46
clarkbrpittau: that things get stuck performing database operations15:46
rpittauthat's just one15:46
clarkbyes I understand15:46
rpittauthe others fail in different points15:46
clarkbbut its the one you've shared so its what I can work off of15:46
clarkbit is entirely possible there are 30 different issues at play but we have to start somewhere15:47
clarkband sqla locking and transactions are known to have changed15:47
clarkbI personally debugged and fixed a database thing in the zuul unittests that resulted in similar issues with tests hanging because locks were held preventing new schemas for additional tests from being created15:47
clarkbbasically start with what we do know and work from there15:48
clarkbI've also seen mariadb be flakier than mysql for whatever reason. Haven't been able to run that down so zuul testing still uses mysql not mariadb15:49
fungimaybe it's time to resurrect drizzle15:50
rpittauclarkb: from what I can see the database migration error is just one out of at least 4, the others are different, I'm not sure that's the problem, it seems the time taken in those tests hasn't changed15:51
clarkbrpittau: ok something is deadlocking and not performing additional effort. I don't know what. I'm just making suggestions based on what was shared. You will need to figure out where and why things are stopping in your unittests. Another approach may be to run things locally in a loop to see if you can reproduce. Or add additional debugging etc etc15:52
clarkblooking at https://c81b6237185e127ea93f-fe98a86d691c6300a22f04f47f232829.ssl.cf1.rackcdn.com/885276/2/check/openstack-tox-py38/889a3ab/tmpgr5suvjz worker 7 seems to run fewer tests than the other 7 threads. The tests should be roughly balanced equally so that might be a clue (worker-7 is getting stuck and not completing)15:54
clarkbdo your unittests run an individual test timeout? That can help break things forcefully giving you a clue for where the are getting stuck15:54
fungirpittau: also, in an effort to not duplicate work already going on, you might take a look at the #openstack-qa logs from friday where dansmith brought some similar concerns15:55
clarkbI do not see a timeout in ironic/tests/base.py so that may be the first thing to do15:55
clarkbmany projects do this for this very reason. If you tests lock up and the job dies you get very little info. If you set a timeout like 1-5 minutes long to forcefully break the test case then you get a traceback of where things are stuck15:56
rpittaufungi, clarkb, thanks for the advice, let's see if we can come up with something15:56
fungirpittau: the timeouts raised in the qa channel were related to tempest jobs, but if somethinghas gotten systematically slower in openstack the impact could extend to unit testing as well15:58
rpittaufungi: yeah, I'm reading through it, I'll try to raise up the issue there, thanks15:59
clarkb04eaebe701fa70bc3d483fe9281a2252c78ea14c says oslo test bsae test classes may do the timeout for you16:00
clarkbI would double check that assumption16:00
clarkbya that assumption is bad you have to set OS_TEST_TIMEOUT (which doesn't appear to be set in tox.ini) and if you don't then it uses DEFAULT_TIMEOUT which is set to 0 if you don't override it in your class which means no timeout16:02
clarkbso ya update ironic/tests/base.py to set DEFAULT_TIMEOUT or set OS_TEST_TIMEOUT in tox.ini to something like 60 or 120 seconds and see what happens16:03
rpittauwould that value be for all the unit tests?16:03
clarkbits based on your test class tree16:04
rpittauoh right :/16:04
rpittauok, thanks, I'll check that16:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!