-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946035: emit-job-header: add instance and region information https://review.opendev.org/c/zuul/zuul-jobs/+/946035 | 07:28 | |
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946035: emit-job-header: add instance and region information https://review.opendev.org/c/zuul/zuul-jobs/+/946035 | 08:53 | |
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [zuul/zuul-jobs] 940158: ensure-python: Skip "t" versions in pyenv https://review.opendev.org/c/zuul/zuul-jobs/+/940158 | 10:32 | |
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [zuul/nodepool] 946136: Allow any (and multiple) whitespace separator for glean https://review.opendev.org/c/zuul/nodepool/+/946136 | 11:46 | |
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [zuul/nodepool] 946143: Remove nonlocal assignment from tests https://review.opendev.org/c/zuul/nodepool/+/946143 | 12:22 | |
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [zuul/nodepool] 946136: Allow any (and multiple) whitespace separator for glean https://review.opendev.org/c/zuul/nodepool/+/946136 | 12:22 | |
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 946148: Annotate "reporting start|enqueue|dequeue" log messages with event UUID https://review.opendev.org/c/zuul/zuul/+/946148 | 12:56 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 15:12 | |
- [zuul/zuul] 946166: Use eatmydata with zookeeper in tests https://review.opendev.org/c/zuul/zuul/+/946166 | ||
- [zuul/zuul] 946167: DNM: eatmydata zk test battery https://review.opendev.org/c/zuul/zuul/+/946167 | ||
@jim:acmegating.com | Clark: the big config object stack without large nodes did very poorly. many memory-related timeouts. so i think we either need the large nodes or eatmydata ^; let's see how that does | 15:13 |
---|---|---|
@clarkb:matrix.org | ack | 15:13 |
@clarkb:matrix.org | corvus: any indication if the end state does better again? Curious if we're returning to a lower memory requirement as expected at the end of the stack | 15:15 |
@jim:acmegating.com | it does not | 15:16 |
@jim:acmegating.com | the more i think about it though -- i'm not sure we should expect to see a big difference in tests anyway. so i'm not sure if that stack is making anything significantly worse, or if it's just a little bit different and we're already right on the edge. | 15:18 |
we do know that we are already right on the edge, because we've seen plenty of timeouts from low memory before that stack. | ||
@clarkb:matrix.org | ya I would agree we are right there | 15:19 |
@clarkb:matrix.org | corvus: quickly checking some of the failures with the eatmydata update I suspect those are existing races similar to the one where I had to shutdown zk caches to avoid their watchers attempting to use a client connection that was stopped | 16:50 |
@jim:acmegating.com | yeah i was doing the exact same thing and came to the same conclusion | 16:50 |
@clarkb:matrix.org | If I had to guess these are shaking out due to timing differences with eatmydata vs tmpfs but not actually an indication of memroy pressure problems? | 16:50 |
@jim:acmegating.com | the mechanics of the fix are a little different though, so it's taking a minute to come up with a solution (there's no "stop" for the change cache) | 16:51 |
@jim:acmegating.com | i agree with the assessment as it relates to memory | 16:51 |
@jim:acmegating.com | and the jobs that passed did so with typical runtimes, so we didn't slow it down | 16:51 |
@clarkb:matrix.org | ack I'll let you sort out the fix then I can review it | 16:51 |
@clarkb:matrix.org | the dstat graph looks pretty good too. I need to look at the csv more closely though as that is a bit more accurate | 16:52 |
@jim:acmegating.com | yeah, clearly the actual memory numbers are wrong, but they seem to have some relation to the memory :) | 16:53 |
@clarkb:matrix.org | looking at the csv values I think we peak at ~6.3GB used | 16:56 |
@clarkb:matrix.org | which if we assume ~740MB freed from zookeeper storage that is a fair bit of extra headroom. This may be a good workaround | 16:56 |
@clarkb:matrix.org | and does the paging in and out values account for swap? If so I don't think we ever swapped but I'm not confident of that assessment | 17:02 |
@clarkb:matrix.org | tail, cut and sort say "6342564" is the largest used memory value | 17:08 |
@clarkb:matrix.org | from this build https://zuul.opendev.org/t/zuul/build/cb69fc18b7ba473a87dd6bad03af6bae/logs | 17:08 |
@jim:acmegating.com | i do think this specific test failure indicates a difference in zk response time. i don't know if it's faster or slower, but from the test, my guess would be slower -- i think we're getting watch callback after the end of the test. it's not a problem -- it's just a test race after all. but it's a clue as to how the system is different. | 17:10 |
@clarkb:matrix.org | the total job runtimes seemed to be in the ballpark at least. So we're not making things terribly slow if it is slower | 17:10 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946176: Fix change cache test races https://review.opendev.org/c/zuul/zuul/+/946176 | 17:12 | |
@jim:acmegating.com | agree. just micro changes. there's hopefully a fix ^ | 17:12 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946176: Fix change cache test races https://review.opendev.org/c/zuul/zuul/+/946176 | 17:14 | |
@jim:acmegating.com | based on the results, i think i'll try stacking race fix, eatmydata, then the config object stack | 17:15 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 17:17 | |
- [zuul/zuul] 946166: Use eatmydata with zookeeper in tests https://review.opendev.org/c/zuul/zuul/+/946166 | ||
- [zuul/zuul] 945743: Remove tenant references from project parsing https://review.opendev.org/c/zuul/zuul/+/945743 | ||
- [zuul/zuul] 945744: Stop copying ProjectConfig objects when adding to layout https://review.opendev.org/c/zuul/zuul/+/945744 | ||
- [zuul/zuul] 945745: Check for nodeset label perms in layout https://review.opendev.org/c/zuul/zuul/+/945745 | ||
- [zuul/zuul] 945746: Check for job timeouts in layout https://review.opendev.org/c/zuul/zuul/+/945746 | ||
- [zuul/zuul] 945747: Resolve required-projects in layout https://review.opendev.org/c/zuul/zuul/+/945747 | ||
- [zuul/zuul] 945748: Resolve include-vars projects in layout https://review.opendev.org/c/zuul/zuul/+/945748 | ||
- [zuul/zuul] 945749: Check allowed-projects in layout https://review.opendev.org/c/zuul/zuul/+/945749 | ||
- [zuul/zuul] 945750: Move ignore_allowed_projects to job freeze https://review.opendev.org/c/zuul/zuul/+/945750 | ||
- [zuul/zuul] 945751: Move post_review setting to job freezing https://review.opendev.org/c/zuul/zuul/+/945751 | ||
- [zuul/zuul] 945752: Move base job check to layout https://review.opendev.org/c/zuul/zuul/+/945752 | ||
- [zuul/zuul] 945753: Resolve role references when freezing https://review.opendev.org/c/zuul/zuul/+/945753 | ||
- [zuul/zuul] 945754: Set playbook trusted flag when freezing https://review.opendev.org/c/zuul/zuul/+/945754 | ||
- [zuul/zuul] 945755: Determine job branch matchers when freezing https://review.opendev.org/c/zuul/zuul/+/945755 | ||
- [zuul/zuul] 945756: Remove trusted flag from source context https://review.opendev.org/c/zuul/zuul/+/945756 | ||
- [zuul/zuul] 945757: Move some pipeline attributes to manager https://review.opendev.org/c/zuul/zuul/+/945757 | ||
- [zuul/zuul] 945758: Move some manager attributes to pipeline https://review.opendev.org/c/zuul/zuul/+/945758 | ||
- [zuul/zuul] 945759: Move queue-related methods to PipelineState https://review.opendev.org/c/zuul/zuul/+/945759 | ||
- [zuul/zuul] 945760: Remove the manager attribute from pipelines https://review.opendev.org/c/zuul/zuul/+/945760 | ||
- [zuul/zuul] 945761: Remove layout.pipelines https://review.opendev.org/c/zuul/zuul/+/945761 | ||
- [zuul/zuul] 945762: Remove tenant attribute from pipeline https://review.opendev.org/c/zuul/zuul/+/945762 | ||
- [zuul/zuul] 945763: Move formatStatusJSON to manager https://review.opendev.org/c/zuul/zuul/+/945763 | ||
- [zuul/zuul] 945764: Move allowed triggers/reporter checks to layout https://review.opendev.org/c/zuul/zuul/+/945764 | ||
- [zuul/zuul] 945765: Remove tenant from configloader parse context https://review.opendev.org/c/zuul/zuul/+/945765 | ||
- [zuul/zuul] 945766: Refactor unprotected branch tests https://review.opendev.org/c/zuul/zuul/+/945766 | ||
- [zuul/zuul] 945767: Add error list to error accumulator https://review.opendev.org/c/zuul/zuul/+/945767 | ||
- [zuul/zuul] 945768: Store object errors separately https://review.opendev.org/c/zuul/zuul/+/945768 | ||
- [zuul/zuul] 945769: Add ConfigObjectCache https://review.opendev.org/c/zuul/zuul/+/945769 | ||
- [zuul/zuul] 945770: Reuse configuration objects https://review.opendev.org/c/zuul/zuul/+/945770 | ||
- [zuul/zuul] 945985: Remove unecessary code from manager._postConfig https://review.opendev.org/c/zuul/zuul/+/945985 | ||
@clarkb:matrix.org | corvus: a few questions on the cache race fix | 17:17 |
@jim:acmegating.com | Clark: thx, replied | 17:25 |
@clarkb:matrix.org | +2 thanks | 17:41 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946186: Fix race in test_ref_updated_reconfig https://review.opendev.org/c/zuul/zuul/+/946186 | 18:18 | |
@clarkb:matrix.org | corvus: it seems like jobs weren't hitting memory limits (which often manifest as timeouts iirc) but instead maybe more related race conditions? | 19:40 |
@jim:acmegating.com | Clark: well... some timeouts have happened recently... | 20:08 |
@clarkb:matrix.org | ya I just noticed https://zuul.opendev.org/t/zuul/build/15f999d213044a2dafdb88aa86d4cb83/logs and if you look at dstat it uses more memroy too | 20:08 |
@clarkb:matrix.org | that job's used memory was as high as 7762940 | 20:09 |
@clarkb:matrix.org | that is about 1.4gb more than the previous example I looked at | 20:10 |
@fungicide:matrix.org | maybe it consumes more ram the longer it runs? | 20:10 |
@clarkb:matrix.org | I don't think so. If you look at the dstat graphs it uses more very quickly | 20:11 |
@fungicide:matrix.org | mmm | 20:11 |
@clarkb:matrix.org | almost 50% of the runtime is over 7GB used memory | 20:12 |
@clarkb:matrix.org | Its likely this is the higher memory demand from the changes in question | 20:12 |
@clarkb:matrix.org | now that may still be excessive and there is a bug maybe? I'm not sure | 20:12 |
@jim:acmegating.com | hrm okay, there may be smoke and fire there. that change was the turning point last time for when most jobs timed out | 20:14 |
@jim:acmegating.com | i mean... there's an lru_cache which is an obvious candidate... but it's a cache on an object (the tenant) which should frequently be deleted. | 20:15 |
@jim:acmegating.com | i'll dig into that change a bit and see if i can get some evidence one way or the other. | 20:16 |
@clarkb:matrix.org | corvus: I wonder if that lru cache effectively leaks because we're allocating the cache for each callable against a tenant object? | 20:57 |
@clarkb:matrix.org | and maybe preallocation is expensive there? | 20:57 |
@clarkb:matrix.org | corvus: https://docs.python.org/3/faq/programming.html#faq-cache-method-calls I think this may answer some questions. Specifically we may have to clear the cache to cause the entries (which include self) to be GCable | 21:02 |
@jim:acmegating.com | hrm... but a LRUCache is just a handful of objects... and even if we leaked every Tenant in the unit tests, that's what, like 10,000 tenant objects total? | 21:02 |
@clarkb:matrix.org | "If a method is cached, the self instance argument is included in the cache. " | 21:03 |
@clarkb:matrix.org | ya we typically only have one tenant per test | 21:03 |
@clarkb:matrix.org | and maybe a handful of layouts per test so like 1 * 3 * 1800 for a rough estimate of object count total? | 21:04 |
@clarkb:matrix.org | maybe tenants are larger than we realize since we so rarely have many of them? | 21:04 |
@jim:acmegating.com | okay i think i just confirmed that we are leaking tenants | 21:06 |
@jim:acmegating.com | took a few tries to find the right approach for detection | 21:07 |
@clarkb:matrix.org | https://bugs.python.org/issue44310 here is some additional discussion of it | 21:23 |
@clarkb:matrix.org | the example at the end of that issue might be applicable here? | 21:23 |
@clarkb:matrix.org | oh and even more discussion here https://discuss.python.org/t/memoizing-methods-considered-harmful/24691 | 21:25 |
@jim:acmegating.com | yeah, we already have a pattern to resolve this issue in the cloud drivers: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/aws/awsendpoint.py#L388-L390 | 21:25 |
@jim:acmegating.com | i'm just applying that pattern to this one | 21:25 |
@clarkb:matrix.org | reading that discuss thread I wonder if maybe setting maxsize is sufficient | 21:26 |
@jim:acmegating.com | tried that, not so much | 21:26 |
@clarkb:matrix.org | since most zuuls have a small number of tenants we can probably set that to a reasonable number like 256 and call it a day? | 21:26 |
@clarkb:matrix.org | ah | 21:26 |
@jim:acmegating.com | i'm doing the constructor pattern from the adaptor and also setting maxsize to None because it should be okay to cache all of the tenants. it's going to be a tiny cache. | 21:28 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 21:29 | |
- [zuul/zuul] 945751: Move post_review setting to job freezing https://review.opendev.org/c/zuul/zuul/+/945751 | ||
- [zuul/zuul] 945752: Move base job check to layout https://review.opendev.org/c/zuul/zuul/+/945752 | ||
- [zuul/zuul] 945753: Resolve role references when freezing https://review.opendev.org/c/zuul/zuul/+/945753 | ||
- [zuul/zuul] 945754: Set playbook trusted flag when freezing https://review.opendev.org/c/zuul/zuul/+/945754 | ||
- [zuul/zuul] 945755: Determine job branch matchers when freezing https://review.opendev.org/c/zuul/zuul/+/945755 | ||
- [zuul/zuul] 945756: Remove trusted flag from source context https://review.opendev.org/c/zuul/zuul/+/945756 | ||
- [zuul/zuul] 945757: Move some pipeline attributes to manager https://review.opendev.org/c/zuul/zuul/+/945757 | ||
- [zuul/zuul] 945758: Move some manager attributes to pipeline https://review.opendev.org/c/zuul/zuul/+/945758 | ||
- [zuul/zuul] 945759: Move queue-related methods to PipelineState https://review.opendev.org/c/zuul/zuul/+/945759 | ||
- [zuul/zuul] 945760: Remove the manager attribute from pipelines https://review.opendev.org/c/zuul/zuul/+/945760 | ||
- [zuul/zuul] 945761: Remove layout.pipelines https://review.opendev.org/c/zuul/zuul/+/945761 | ||
- [zuul/zuul] 945762: Remove tenant attribute from pipeline https://review.opendev.org/c/zuul/zuul/+/945762 | ||
- [zuul/zuul] 945763: Move formatStatusJSON to manager https://review.opendev.org/c/zuul/zuul/+/945763 | ||
- [zuul/zuul] 945764: Move allowed triggers/reporter checks to layout https://review.opendev.org/c/zuul/zuul/+/945764 | ||
- [zuul/zuul] 945765: Remove tenant from configloader parse context https://review.opendev.org/c/zuul/zuul/+/945765 | ||
- [zuul/zuul] 945766: Refactor unprotected branch tests https://review.opendev.org/c/zuul/zuul/+/945766 | ||
- [zuul/zuul] 945767: Add error list to error accumulator https://review.opendev.org/c/zuul/zuul/+/945767 | ||
- [zuul/zuul] 945768: Store object errors separately https://review.opendev.org/c/zuul/zuul/+/945768 | ||
- [zuul/zuul] 945769: Add ConfigObjectCache https://review.opendev.org/c/zuul/zuul/+/945769 | ||
- [zuul/zuul] 945770: Reuse configuration objects https://review.opendev.org/c/zuul/zuul/+/945770 | ||
- [zuul/zuul] 945985: Remove unecessary code from manager._postConfig https://review.opendev.org/c/zuul/zuul/+/945985 | ||
@clarkb:matrix.org | corvus: I guess we don't need to do something like `def _isTrusted():` then defined `isTrusted` in the constructor because we can get the original method through the cache object? | 21:32 |
@jim:acmegating.com | Clark: if we needed to, but we have no need of the original method. (the cache wrapper does, so the cache wrapper keeps that reference we pass in) | 21:33 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946186: Fix race in test_ref_updated_reconfig https://review.opendev.org/c/zuul/zuul/+/946186 | 21:46 | |
@jim:acmegating.com | Clark: can you take another look at ^ -- i was about to approve it and noticed it had failed that test. but it ran both jobs, which is the behavior we care about. we don't care about the order -- they aren't sequenced. | 21:46 |
@clarkb:matrix.org | done I went ahead and approved it since that was your intention anyway | 21:47 |
@jim:acmegating.com | thanks! | 21:52 |
@clarkb:matrix.org | corvus: the lru change seems to have gotten things back down to maxing out at 6322676 for used memory. That leak really must've consumed 1.4GB more | 22:45 |
@jim:acmegating.com | yep.. i think it seems likely we can merge that stack with just the race fixes and the eatmydata, and stay on 8gb nodes for a while longer | 22:47 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 940824: Use niz nodes for unit tests https://review.opendev.org/c/zuul/zuul/+/940824 | 22:49 | |
@jim:acmegating.com | on that note, i refreshed that ^ | 22:49 |
@clarkb:matrix.org | +2 from me. And I guess given we think we can make 8gb nodes work there isn't much reason to hold off on that dogfooding switch now? | 22:56 |
@clarkb:matrix.org | the main thing would be if autoholds don't work but can always switch back just to hold a test node in the interim | 22:57 |
@jim:acmegating.com | yeah, i think they're truly orthogonal now | 23:03 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 946186: Fix race in test_ref_updated_reconfig https://review.opendev.org/c/zuul/zuul/+/946186 | 23:08 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 946085: Log reasons for missing image builds https://review.opendev.org/c/zuul/zuul/+/946085 | 23:18 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 946086: Add REPL to zuul-launcher https://review.opendev.org/c/zuul/zuul/+/946086 | 23:24 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!