Wednesday, 2022-07-06

@chgans:matrix.org^^^ Looks like it's testing time for me! :) Will try to find time tomorrow. Thanks corvus for the fix!10:13
@westphahl:matrix.orgClark corvus : could you look again at https://review.opendev.org/c/zuul/zuul/+/848234 ?12:27
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 848132: zuul-web: return errors as JSON https://review.opendev.org/c/zuul/zuul/+/84813214:24
@fungicide:matrix.orgis anyone aware of nondeterminism in the circular deps tests? i think https://zuul.opendev.org/t/zuul/build/3cb2b13fbbd94e4aa9b2aa52e71e5c1d failed test_job_deduplication_failed_job because one of the builds it expected to succeed was aborted, but it's not clear to me why that happened. did it race the failure of a build for the change ahead of it? looks like that caused the running build to be cancelled15:27
-@gerrit:opendev.org- Jonathan Rosser proposed: [zuul/zuul-jobs] 848880: [DNM] test 848027 https://review.opendev.org/c/zuul/zuul-jobs/+/84888015:38
@clarkb:matrix.org> <@westphahl:matrix.org> Clark corvus : could you look again at https://review.opendev.org/c/zuul/zuul/+/848234 ?15:49
I think my last major questions is the set(None) case as pointed out by my comment there.
-@gerrit:opendev.org- Jonathan Rosser proposed wip: [zuul/zuul-jobs] 848880: [DNM] test 848027 https://review.opendev.org/c/zuul/zuul-jobs/+/84888015:49
@clarkb:matrix.org> <@westphahl:matrix.org> Clark corvus : could you look again at https://review.opendev.org/c/zuul/zuul/+/848234 ?15:49
* I think my last major question is the set(None) case as pointed out by my comment there.
@westphahl:matrix.orgClark: the second call to `getProjectBranches()` should never return None. This case will lead to an early return after the first call already15:51
@westphahl:matrix.orgsorry, that probably wasn't clear in my response to your comment15:51
@clarkb:matrix.orgI see so we only have to handle that situation on the first pass whcih was done15:52
@westphahl:matrix.orgyes, unless I'm missing something (which might be the case)15:54
@jim:acmegating.comfungi: that's new to me, but i suppose some new nondetermistic tests are likely to pop up with the new py versions.  it's not immediately apparent to me what the issue is.16:06
@clarkb:matrix.orgswest: I left a super minor nit but +2'd. I think you can probably self approve and decide if you want to followup or not for that minor item. (Just not approving so that you have a chance to see the note)16:07
@westphahl:matrix.orgClark: replied to your comment16:10
@clarkb:matrix.orgOh right None being overloaded. I think you are correct16:11
@clarkb:matrix.orgApproved thanks16:12
@jim:acmegating.comClark: while you're at it, can you look at the followup test for that? https://review.opendev.org/84852416:13
@clarkb:matrix.orgwill do16:15
-@gerrit:opendev.org- Jonathan Rosser proposed wip: [zuul/zuul-jobs] 848880: [DNM] test 848027 and 848883 https://review.opendev.org/c/zuul/zuul-jobs/+/84888016:40
@fungicide:matrix.orghttps://zuul.opendev.org/t/zuul/build/b08bfd5c47a14cbeafc1d85966d294bc is another identical failure in test_job_deduplication_failed_job17:31
@jim:acmegating.comguess it's time to dig into it :)17:32
@fungicide:matrix.orgstarting to wonder if it's become suddenly persistent17:32
@fungicide:matrix.orgnow i'm curious how we were able to add the py310 job successfully17:32
@fungicide:matrix.orglooking through the build history, i think it may have a 50% success rate. checking some failures for other changes now to see if they're failing the same way17:34
@fungicide:matrix.orgyeah, https://zuul.opendev.org/t/zuul/build/a8ef2a83bc614a19895b7e3e2ac61eec is for a different change and also failed only test_job_deduplication_failed_job though the synthetic build outcomes were slightly different (two aborts instead of one)17:37
@fungicide:matrix.orgis there a way to enforce build completion order in a test?17:37
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 848234: Fix read-only branches error in zuul-web https://review.opendev.org/c/zuul/zuul/+/84823417:38
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 848524: Add branch cache fetch failure tests https://review.opendev.org/c/zuul/zuul/+/84852417:38
@fungicide:matrix.orgthe commonality seems to be that at least one build is getting aborted (sometimes more) due to failure of the change ahead17:38
@fungicide:matrix.orgwhich i think is the correct behavior if the first change has a failure result for a build before all the builds for the second change complete17:39
@fungicide:matrix.orgbut since this also involves circular dependencies and deduplication of an extra result, i'm not entirely sure17:40
@fungicide:matrix.orgis the deduplication behavior itself deterministic as to which result it "keeps" and which it "deduplicates"?17:41
@jim:acmegating.comwe can enforce build order by holding jobs in build and then releasing them selectively; but yeah, i don't know if that's what we need to do here or if it's something else we need to accomodate in the test.17:46
@jim:acmegating.comthe build it uses should just be the for the first item in the pipeline17:47
@fungicide:matrix.orgso in that case we shouldn't see an abort bleeding through from the same job in second change?17:48
@fungicide:matrix.orgi'll read back through the log again, since i'm probably misinterpreting what it's trying to tell me17:49
@fungicide:matrix.orgah, okay, common-job is the one common to both changes and is getting deduplicated with a failure result consistently across all the examples i've looked at17:52
@fungicide:matrix.orgbut project1-job and project2-job seem to randomly report either success or aborted17:53
@fungicide:matrix.orgalso the build history is being returned in a seemingly random order, while the test assumes a fixed order17:58
@fungicide:matrix.orgoh, never mind that second bit. the assert does specify ordered=False so i guess that's irrelevant17:58
@jim:acmegating.comoh... i think i may be starting to understand18:05
@jim:acmegating.comi think the race is between whether the project*-job jobs finish before common-job fails.  if they finish, then we get the success result.  if they aren't done yet, then the fact that change #2 is complete and part of the bundle but failed means that zuul decides to abort jobs on change #1 18:07
@jim:acmegating.comin other words, we either need to make sure the project jobs finish before the common job fails, or we need to not care whether the result is aborted or not18:07
@jim:acmegating.comfungi: i have a fix; i'll push it up in a second18:14
@fungicide:matrix.orgoh, yes that makes sense, okay18:15
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 848887: Fix race in test_job_deduplication_failed_job https://review.opendev.org/c/zuul/zuul/+/84888718:24
@jim:acmegating.comfungi: ^18:24
@clarkb:matrix.orgcorvus: I haven't looked at that too closely but shouldn't zuul handle that without strict ordering in the test so that rael world usage functions?18:29
@jim:acmegating.comClark: i think real world usage is we don't care18:30
@clarkb:matrix.orgI guess in the real world an abort is maybe fine when a job fails like that but we're trying to assert stable behavior so need it to be deterministic?18:30
@jim:acmegating.comyep exactly that18:30
@clarkb:matrix.orgya ok that makes sense. I left a note about something else inline too18:31
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 848887: Fix race in test_job_deduplication_failed_job https://review.opendev.org/c/zuul/zuul/+/84888718:32
@jim:acmegating.comthanks fixed :)18:32
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 848890: Handle non default loopvars in Ansible callback stream plugin https://review.opendev.org/c/zuul/zuul/+/84889018:51
@clarkb:matrix.orgSo we don't forget ^ but that lacks a test currently18:51
@clarkb:matrix.orgI'm not sure I'll get to adding a test today as I feel like I have a million other things to do first18:51
@clarkb:matrix.orgAt the very least that should check we don't regress?18:51
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 848014: Report timing info for POST_FAILURE and TIMED_OUT builds https://review.opendev.org/c/zuul/zuul/+/84801418:53
@jim:acmegating.comClark: on my workstation, zuul unit tests are 2x faster under py310 than py3920:32
@jim:acmegating.com * Clark: on my workstation, zuul unit tests take 50% of the time under py310 compared to py3920:33
@jim:acmegating.compy39:20:33
Ran: 1325 tests in 747.7551 sec.
Sum of execute time for each test: 14749.2265 sec.
py310:
Ran: 1328 tests in 384.7824 sec.
Sum of execute time for each test: 7170.3903 sec.
@jim:acmegating.comi tend to get very consistent times on my test runs too, typically within 10 seconds variation in the wall-clock time.20:34
@clarkb:matrix.orgnice. And python3.11 is expected to be 10-60% quicker too20:37
@clarkb:matrix.orgquicker than 3.10 I mean20:37
@jim:acmegating.comhow soon can we upgrade? ;)20:38
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 848890: Handle non default loopvars in Ansible callback stream plugin https://review.opendev.org/c/zuul/zuul/+/84889020:38
@clarkb:matrix.org3.11 is expected to release early october. I suspect we will be able to update not long after that? Thats my hope anyway. Then also maybe taking some time to refactor bits of zuul to be more consistent in their shape so that python3.11 can do its best to run it quickly20:39
@clarkb:matrix.orgI added a test in 848890 but haven't run it locally yet20:39
@clarkb:matrix.orgbut I think I got the general shape correct20:40
@clarkb:matrix.orgwe might want to add more cases to the testing to handle the other bits that I updated rather than just the instance we noticed with the random selection20:40
@jim:acmegating.comopendev's cpu use is not high, so if there's going to be a difference, we might have to squint to see it.  so far the difference on zuul01 from july 5 -> july 6 looks like it's within the error bars (eg, about the same as jun 30 -> july 1)20:41
@jim:acmegating.comi'm not sure how else we could quantify it.  maybe aggregate loop runtime, but network ops are the bulk of that time so again it may be lost in the noise.20:41
@clarkb:matrix.orgcorvus: ya and some things that zuul does a lot of like json (de)serialization is on the lwoer end of the speedups. But they do show speedups just not as impressively20:42
@jim:acmegating.com++20:42
@clarkb:matrix.orgregex and json ops according ot benchmarks do improve but not as crazily as some of the more mathy benchmarks20:42
@clarkb:matrix.orgI guess regexes are very mathy, but you probably know what I mean20:42
@jim:acmegating.comhttps://grafana.opendev.org/d/2e89fb78e5/zuul-performance-metrics?orgId=1&from=1656535509194&to=1657140309194&var-tenant=openstack&viewPanel=1720:45
@jim:acmegating.comooh there's a discontinuity20:45
@jim:acmegating.comlike maybe a 20% decrease in layout generation time for the openstack tenant20:45
@clarkb:matrix.orgI think for 3.10 one of the changes they made was to thread scheduling20:48
@clarkb:matrix.orgwhich may explain why testing is so much quicker?20:49
@jim:acmegating.comcould be -- though testing only has a couple more threads than production (since each test runs in its own process)20:50
@clarkb:matrix.orghttps://bugs.python.org/issue42093 is apparently a big python3.10 performance change20:57
@clarkb:matrix.orgUpstream definitely seems interested in performance these days which is great. Hopefully 3.12 can keep the trned going :)20:57
@fungicide:matrix.orgyeah, there's been a slew of peps for speedups in various parts of the interpreter. from what i can see a lot of that is being driven by enterprise users who chose python and then decided things were running too slowly so have been looking for anywhere to improve that safely21:39
@fungicide:matrix.orge.g. the cinder (no not that cinder) work meta has started to open up21:40
@jim:acmegating.com> <@fungicide:matrix.org> yeah, there's been a slew of peps for speedups in various parts of the interpreter. from what i can see a lot of that is being driven by enterprise users who chose python and then decided things were running too slowly so have been looking for anywhere to improve that safely21:50
i think every python dev can relate to that :)
@clarkb:matrix.orgwow 848890 passes testing. I'll let reviewers chime in on whether or not they think we need to cover more of the cases of non default loop vars22:52
@jim:acmegating.comClark: that lgtm but i left a nit i suspect you may want to fix :)22:55
@jim:acmegating.comClark: and fwiw, i think we should include the fallback so that we don't make any more assumptions than necessary about ansible input22:56
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 848910: Apply timer trigger jitter to project-branches https://review.opendev.org/c/zuul/zuul/+/84891022:57
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 848890: Handle non default loopvars in Ansible callback stream plugin https://review.opendev.org/c/zuul/zuul/+/84889022:57
@clarkb:matrix.orgcorvus: should I remove the todo then?22:57
@clarkb:matrix.orgI can update it to make a note about not assuming things about ansible apis22:58
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 848890: Handle non default loopvars in Ansible callback stream plugin https://review.opendev.org/c/zuul/zuul/+/84889022:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!