Wednesday, 2025-04-02

-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946035: emit-job-header: add instance and region information https://review.opendev.org/c/zuul/zuul-jobs/+/94603507:28
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946035: emit-job-header: add instance and region information https://review.opendev.org/c/zuul/zuul-jobs/+/94603508:53
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [zuul/zuul-jobs] 940158: ensure-python: Skip "t" versions in pyenv https://review.opendev.org/c/zuul/zuul-jobs/+/94015810:32
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [zuul/nodepool] 946136: Allow any (and multiple) whitespace separator for glean https://review.opendev.org/c/zuul/nodepool/+/94613611:46
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [zuul/nodepool] 946143: Remove nonlocal assignment from tests https://review.opendev.org/c/zuul/nodepool/+/94614312:22
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [zuul/nodepool] 946136: Allow any (and multiple) whitespace separator for glean https://review.opendev.org/c/zuul/nodepool/+/94613612:22
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 946148: Annotate "reporting start|enqueue|dequeue" log messages with event UUID https://review.opendev.org/c/zuul/zuul/+/94614812:56
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:15:12
- [zuul/zuul] 946166: Use eatmydata with zookeeper in tests https://review.opendev.org/c/zuul/zuul/+/946166
- [zuul/zuul] 946167: DNM: eatmydata zk test battery https://review.opendev.org/c/zuul/zuul/+/946167
@jim:acmegating.comClark: the big config object stack without large nodes did very poorly.  many memory-related timeouts.  so i think we either need the large nodes or eatmydata ^; let's see how that does15:13
@clarkb:matrix.orgack15:13
@clarkb:matrix.orgcorvus: any indication if the end state does better again? Curious if we're returning to a lower memory requirement as expected at the end of the stack15:15
@jim:acmegating.comit does not15:16
@jim:acmegating.comthe more i think about it though -- i'm not sure we should expect to see a big difference in tests anyway.  so i'm not sure if that stack is making anything significantly worse, or if it's just a little bit different and we're already right on the edge.15:18
we do know that we are already right on the edge, because we've seen plenty of timeouts from low memory before that stack.
@clarkb:matrix.orgya I would agree we are right there15:19
@clarkb:matrix.orgcorvus: quickly checking some of the failures with the eatmydata update I suspect those are existing races similar to the one where I had to shutdown zk caches to avoid their watchers attempting to use a client connection that was stopped16:50
@jim:acmegating.comyeah i was doing the exact same thing and came to the same conclusion16:50
@clarkb:matrix.orgIf I had to guess these are shaking out due to timing differences with eatmydata vs tmpfs but not actually an indication of memroy pressure problems?16:50
@jim:acmegating.comthe mechanics of the fix are a little different though, so it's taking a minute to come up with a solution (there's no "stop" for the change cache)16:51
@jim:acmegating.comi agree with the assessment as it relates to memory16:51
@jim:acmegating.comand the jobs that passed did so with typical runtimes, so we didn't slow it down16:51
@clarkb:matrix.orgack I'll let you sort out the fix then I can review it16:51
@clarkb:matrix.orgthe dstat graph looks pretty good too. I need to look at the csv more closely though as that is a bit more accurate16:52
@jim:acmegating.comyeah, clearly the actual memory numbers are wrong, but they seem to have some relation to the memory :)16:53
@clarkb:matrix.orglooking at the csv values I think we peak at ~6.3GB used16:56
@clarkb:matrix.orgwhich if we assume ~740MB freed from zookeeper storage that is a fair bit of extra headroom. This may be a good workaround16:56
@clarkb:matrix.organd does the paging in and out values account for swap? If so I don't think we ever swapped but I'm not confident of that assessment17:02
@clarkb:matrix.orgtail, cut and sort say "6342564" is the largest used memory value17:08
@clarkb:matrix.orgfrom this build https://zuul.opendev.org/t/zuul/build/cb69fc18b7ba473a87dd6bad03af6bae/logs17:08
@jim:acmegating.comi do think this specific test failure indicates a difference in zk response time.  i don't know if it's faster or slower, but from the test, my guess would be slower -- i think we're getting watch callback after the end of the test.  it's not a problem -- it's just a test race after all.  but it's a clue as to how the system is different.17:10
@clarkb:matrix.orgthe total job runtimes seemed to be in the ballpark at least. So we're not making things terribly slow if it is slower17:10
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946176: Fix change cache test races https://review.opendev.org/c/zuul/zuul/+/94617617:12
@jim:acmegating.comagree.  just micro changes.  there's hopefully a fix ^17:12
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946176: Fix change cache test races https://review.opendev.org/c/zuul/zuul/+/94617617:14
@jim:acmegating.combased on the results, i think i'll try stacking race fix, eatmydata, then the config object stack17:15
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:17:17
- [zuul/zuul] 946166: Use eatmydata with zookeeper in tests https://review.opendev.org/c/zuul/zuul/+/946166
- [zuul/zuul] 945743: Remove tenant references from project parsing https://review.opendev.org/c/zuul/zuul/+/945743
- [zuul/zuul] 945744: Stop copying ProjectConfig objects when adding to layout https://review.opendev.org/c/zuul/zuul/+/945744
- [zuul/zuul] 945745: Check for nodeset label perms in layout https://review.opendev.org/c/zuul/zuul/+/945745
- [zuul/zuul] 945746: Check for job timeouts in layout https://review.opendev.org/c/zuul/zuul/+/945746
- [zuul/zuul] 945747: Resolve required-projects in layout https://review.opendev.org/c/zuul/zuul/+/945747
- [zuul/zuul] 945748: Resolve include-vars projects in layout https://review.opendev.org/c/zuul/zuul/+/945748
- [zuul/zuul] 945749: Check allowed-projects in layout https://review.opendev.org/c/zuul/zuul/+/945749
- [zuul/zuul] 945750: Move ignore_allowed_projects to job freeze https://review.opendev.org/c/zuul/zuul/+/945750
- [zuul/zuul] 945751: Move post_review setting to job freezing https://review.opendev.org/c/zuul/zuul/+/945751
- [zuul/zuul] 945752: Move base job check to layout https://review.opendev.org/c/zuul/zuul/+/945752
- [zuul/zuul] 945753: Resolve role references when freezing https://review.opendev.org/c/zuul/zuul/+/945753
- [zuul/zuul] 945754: Set playbook trusted flag when freezing https://review.opendev.org/c/zuul/zuul/+/945754
- [zuul/zuul] 945755: Determine job branch matchers when freezing https://review.opendev.org/c/zuul/zuul/+/945755
- [zuul/zuul] 945756: Remove trusted flag from source context https://review.opendev.org/c/zuul/zuul/+/945756
- [zuul/zuul] 945757: Move some pipeline attributes to manager https://review.opendev.org/c/zuul/zuul/+/945757
- [zuul/zuul] 945758: Move some manager attributes to pipeline https://review.opendev.org/c/zuul/zuul/+/945758
- [zuul/zuul] 945759: Move queue-related methods to PipelineState https://review.opendev.org/c/zuul/zuul/+/945759
- [zuul/zuul] 945760: Remove the manager attribute from pipelines https://review.opendev.org/c/zuul/zuul/+/945760
- [zuul/zuul] 945761: Remove layout.pipelines https://review.opendev.org/c/zuul/zuul/+/945761
- [zuul/zuul] 945762: Remove tenant attribute from pipeline https://review.opendev.org/c/zuul/zuul/+/945762
- [zuul/zuul] 945763: Move formatStatusJSON to manager https://review.opendev.org/c/zuul/zuul/+/945763
- [zuul/zuul] 945764: Move allowed triggers/reporter checks to layout https://review.opendev.org/c/zuul/zuul/+/945764
- [zuul/zuul] 945765: Remove tenant from configloader parse context https://review.opendev.org/c/zuul/zuul/+/945765
- [zuul/zuul] 945766: Refactor unprotected branch tests https://review.opendev.org/c/zuul/zuul/+/945766
- [zuul/zuul] 945767: Add error list to error accumulator https://review.opendev.org/c/zuul/zuul/+/945767
- [zuul/zuul] 945768: Store object errors separately https://review.opendev.org/c/zuul/zuul/+/945768
- [zuul/zuul] 945769: Add ConfigObjectCache https://review.opendev.org/c/zuul/zuul/+/945769
- [zuul/zuul] 945770: Reuse configuration objects https://review.opendev.org/c/zuul/zuul/+/945770
- [zuul/zuul] 945985: Remove unecessary code from manager._postConfig https://review.opendev.org/c/zuul/zuul/+/945985
@clarkb:matrix.orgcorvus: a few questions on the cache race fix17:17
@jim:acmegating.comClark: thx, replied17:25
@clarkb:matrix.org+2 thanks17:41
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946186: Fix race in test_ref_updated_reconfig https://review.opendev.org/c/zuul/zuul/+/94618618:18
@clarkb:matrix.orgcorvus: it seems like jobs weren't hitting memory limits (which often manifest as timeouts iirc) but instead maybe more related race conditions?19:40
@jim:acmegating.comClark: well... some timeouts have happened recently...20:08
@clarkb:matrix.orgya I just noticed https://zuul.opendev.org/t/zuul/build/15f999d213044a2dafdb88aa86d4cb83/logs and if you look at dstat it uses more memroy too20:08
@clarkb:matrix.orgthat job's used memory was as high as 776294020:09
@clarkb:matrix.orgthat is about 1.4gb more than the previous example I looked at20:10
@fungicide:matrix.orgmaybe it consumes more ram the longer it runs?20:10
@clarkb:matrix.orgI don't think so. If you look at the dstat graphs it uses more very quickly20:11
@fungicide:matrix.orgmmm20:11
@clarkb:matrix.orgalmost 50% of the runtime is over 7GB used memory20:12
@clarkb:matrix.orgIts likely this is the higher memory demand from the changes in question20:12
@clarkb:matrix.orgnow that may still be excessive and there is a bug maybe? I'm not sure20:12
@jim:acmegating.comhrm okay, there may be smoke and fire there.  that change was the turning point last time for when most jobs timed out20:14
@jim:acmegating.comi mean... there's an lru_cache which is an obvious candidate... but it's a cache on an object (the tenant) which should frequently be deleted.20:15
@jim:acmegating.comi'll dig into that change a bit and see if i can get some evidence one way or the other.20:16
@clarkb:matrix.orgcorvus: I wonder if that lru cache effectively leaks because we're allocating the cache for each callable against a tenant object?20:57
@clarkb:matrix.organd maybe preallocation is expensive there?20:57
@clarkb:matrix.orgcorvus: https://docs.python.org/3/faq/programming.html#faq-cache-method-calls I think this may answer some questions. Specifically we may have to clear the cache to cause the entries (which include self) to be GCable21:02
@jim:acmegating.comhrm... but a LRUCache is just a handful of objects... and even if we leaked every Tenant in the unit tests, that's what, like 10,000 tenant objects total?21:02
@clarkb:matrix.org"If a method is cached, the self instance argument is included in the cache. "21:03
@clarkb:matrix.orgya we typically only have one tenant per test21:03
@clarkb:matrix.organd maybe a handful of layouts per test so like 1 * 3 * 1800 for a rough estimate of object count total?21:04
@clarkb:matrix.orgmaybe tenants are larger than we realize since we so rarely have many of them?21:04
@jim:acmegating.comokay i think i just confirmed that we are leaking tenants21:06
@jim:acmegating.comtook a few tries to find the right approach for detection21:07
@clarkb:matrix.orghttps://bugs.python.org/issue44310 here is some additional discussion of it21:23
@clarkb:matrix.orgthe example at the end of that issue might be applicable here?21:23
@clarkb:matrix.orgoh and even more discussion here https://discuss.python.org/t/memoizing-methods-considered-harmful/2469121:25
@jim:acmegating.comyeah, we already have a pattern to resolve this issue in the cloud drivers: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/aws/awsendpoint.py#L388-L39021:25
@jim:acmegating.comi'm just applying that pattern to this one21:25
@clarkb:matrix.orgreading that discuss thread I wonder if maybe setting maxsize is sufficient21:26
@jim:acmegating.comtried that, not so much21:26
@clarkb:matrix.orgsince most zuuls have a small number of tenants we can probably set that to a reasonable number like 256 and call it a day?21:26
@clarkb:matrix.orgah21:26
@jim:acmegating.comi'm doing the constructor pattern from the adaptor and also setting maxsize to None because it should be okay to cache all of the tenants.  it's going to be a tiny cache.21:28
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:21:29
- [zuul/zuul] 945751: Move post_review setting to job freezing https://review.opendev.org/c/zuul/zuul/+/945751
- [zuul/zuul] 945752: Move base job check to layout https://review.opendev.org/c/zuul/zuul/+/945752
- [zuul/zuul] 945753: Resolve role references when freezing https://review.opendev.org/c/zuul/zuul/+/945753
- [zuul/zuul] 945754: Set playbook trusted flag when freezing https://review.opendev.org/c/zuul/zuul/+/945754
- [zuul/zuul] 945755: Determine job branch matchers when freezing https://review.opendev.org/c/zuul/zuul/+/945755
- [zuul/zuul] 945756: Remove trusted flag from source context https://review.opendev.org/c/zuul/zuul/+/945756
- [zuul/zuul] 945757: Move some pipeline attributes to manager https://review.opendev.org/c/zuul/zuul/+/945757
- [zuul/zuul] 945758: Move some manager attributes to pipeline https://review.opendev.org/c/zuul/zuul/+/945758
- [zuul/zuul] 945759: Move queue-related methods to PipelineState https://review.opendev.org/c/zuul/zuul/+/945759
- [zuul/zuul] 945760: Remove the manager attribute from pipelines https://review.opendev.org/c/zuul/zuul/+/945760
- [zuul/zuul] 945761: Remove layout.pipelines https://review.opendev.org/c/zuul/zuul/+/945761
- [zuul/zuul] 945762: Remove tenant attribute from pipeline https://review.opendev.org/c/zuul/zuul/+/945762
- [zuul/zuul] 945763: Move formatStatusJSON to manager https://review.opendev.org/c/zuul/zuul/+/945763
- [zuul/zuul] 945764: Move allowed triggers/reporter checks to layout https://review.opendev.org/c/zuul/zuul/+/945764
- [zuul/zuul] 945765: Remove tenant from configloader parse context https://review.opendev.org/c/zuul/zuul/+/945765
- [zuul/zuul] 945766: Refactor unprotected branch tests https://review.opendev.org/c/zuul/zuul/+/945766
- [zuul/zuul] 945767: Add error list to error accumulator https://review.opendev.org/c/zuul/zuul/+/945767
- [zuul/zuul] 945768: Store object errors separately https://review.opendev.org/c/zuul/zuul/+/945768
- [zuul/zuul] 945769: Add ConfigObjectCache https://review.opendev.org/c/zuul/zuul/+/945769
- [zuul/zuul] 945770: Reuse configuration objects https://review.opendev.org/c/zuul/zuul/+/945770
- [zuul/zuul] 945985: Remove unecessary code from manager._postConfig https://review.opendev.org/c/zuul/zuul/+/945985
@clarkb:matrix.orgcorvus: I guess we don't need to do something like `def _isTrusted():` then defined `isTrusted` in the constructor because we can get the original method through the cache object?21:32
@jim:acmegating.comClark: if we needed to, but we have no need of the original method.  (the cache wrapper does, so the cache wrapper keeps that reference we pass in)21:33
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946186: Fix race in test_ref_updated_reconfig https://review.opendev.org/c/zuul/zuul/+/94618621:46
@jim:acmegating.comClark: can you take another look at ^ -- i was about to approve it and noticed it had failed that test.  but it ran both jobs, which is the behavior we care about.  we don't care about the order -- they aren't sequenced.21:46
@clarkb:matrix.orgdone I went ahead and approved it since that was your intention anyway21:47
@jim:acmegating.comthanks!21:52
@clarkb:matrix.orgcorvus: the lru change seems to have gotten things back down to maxing out at 6322676 for used memory. That leak really must've consumed 1.4GB more22:45
@jim:acmegating.comyep..  i think it seems likely we can merge that stack with just the race fixes and the eatmydata, and stay on 8gb nodes for a while longer22:47
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 940824: Use niz nodes for unit tests https://review.opendev.org/c/zuul/zuul/+/94082422:49
@jim:acmegating.comon that note, i refreshed that ^22:49
@clarkb:matrix.org+2 from me. And I guess given we think we can make 8gb nodes work there isn't much reason to hold off on that dogfooding switch now?22:56
@clarkb:matrix.orgthe main thing would be if autoholds don't work but can always switch back just to hold a test node in the interim22:57
@jim:acmegating.comyeah, i think they're truly orthogonal now23:03
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 946186: Fix race in test_ref_updated_reconfig https://review.opendev.org/c/zuul/zuul/+/94618623:08
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 946085: Log reasons for missing image builds https://review.opendev.org/c/zuul/zuul/+/94608523:18
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 946086: Add REPL to zuul-launcher https://review.opendev.org/c/zuul/zuul/+/94608623:24

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!