Tuesday, 2025-04-01

@newbie23:matrix.orgThanks fungi Clark Albin Vass for your replies and inputs.07:08
I will check out the changes for skipping the workspace checkout you posted earlier 🙇‍♂️
About caching, we already keep a git repo cache for workers and we try to keep it warm.
The issue we are experiencing is with Zuul mergers when cloning the repos.
> The mergers should all keep their git repo caches fairly warm though. SO the biggest hit is the first time
A bit more of context: this is a legacy large repo that is not frequently updated, but when that happens, the merger clone/fetch timeout expires.
OFC we could keep increasing the timeout :), but we decided to look why it was so slow.
GitHub (we use GitHub) instance performance are ok, and it turned out that the repo is just huge.
We are taking actions to optimize the repo, but repo history rewrite is a no, as it will break a lot of things.
So I wanted to explore how to convince the Zuul merger to do a shallow clone, but Clark you are right, ensuring enough history is present to allow correct merges in all circumstances might be challenging.
Any other suggestions?
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/94579508:43
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/94579509:10
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/94579510:32
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/94579510:39
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul] 942432: Implement zuul-web OIDC endpoints https://review.opendev.org/c/zuul/zuul/+/94243211:51
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946033: mirror-workspace-git-repos: Allow deleting current branch https://review.opendev.org/c/zuul/zuul-jobs/+/94603312:40
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946033: mirror-workspace-git-repos: Allow deleting current branch https://review.opendev.org/c/zuul/zuul-jobs/+/94603312:40
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946035: emit-job-header: add instance and region information https://review.opendev.org/c/zuul/zuul-jobs/+/94603512:42
@clarkb:matrix.orgYou can preclone the repo for your mergers outside of your normal timeout period. Then subsequent operations would do smaller updates which means you may be able to use a smaller timeout14:55
@jim:acmegating.comClark: i'm curious about your thoughts on this: the config object reuse stack has been running with the niz nodes (more ram).  i haven't seen the zk failures indicating memory pressure that we have been seeing (that is expected and good).  but it still has somewhere between a 75%-90% success rate on tests.  i think adding the additional node types has, as is usual, exposed a new set of test races.  do you think we should keep going down that path, and fixing those races as they come up?  or should we see how that stack does without the niz nodes, now that we've cleaned up a bunch of issues on the 8gb nodes?16:41
@clarkb:matrix.orgcorvus: re zookeeper it occurred to me that we run zk on a tmpfs right? so memory pressure may affect us both for the disk in memory content as well as the process memory. As far as flakyness for niz nodes goes, yes I suspect that we're shaking out new races by running on raxflex more often (faster nodes). I've seen a number of tests with new races on those nodes in general. But also I guess more memory means less contention for shared resources which might generally speed things up.16:44
@clarkb:matrix.orgMaybe rebase the stack off of niz dogfooding and see if the memory pressure mid stack is a big issue? If not we might eb able to get away with the current nodes and treat these as two separate trains of thought rather than combining them?16:45
@clarkb:matrix.orgthat might make it easier to quantify the improvments from that stack16:45
@jim:acmegating.comsounds good to me.  i'm kind of curious too, even if we decide to put it back on niz.  :)16:47
@jim:acmegating.com * sounds good to me.  i'm kind of curious too (about current performance without niz), even if we decide to put it back on niz.  :)16:47
@fungicide:matrix.orgideally, tmpfs should perform no worse under memory pressure than a device-backed filesystem, since they both use the same fs cache model only tmpfs just has nowhere to sync its writes to or pull its reads from other than swap devices16:47
@fungicide:matrix.orgthough tmpfs can cause additional memory pressure, since it needs that cache16:47
@jim:acmegating.comfun fact, since zk is also an in-memory db, we're just storing 2 copies of everything in ram :)16:48
@jim:acmegating.comwonder if we could just have it write to /dev/null16:49
@fungicide:matrix.orghuh, so in theory it gets no performance benefits from tmpfs either16:49
@jim:acmegating.comhere comes the rebase16:49
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:16:50
- [zuul/zuul] 945743: Remove tenant references from project parsing https://review.opendev.org/c/zuul/zuul/+/945743
- [zuul/zuul] 945744: Stop copying ProjectConfig objects when adding to layout https://review.opendev.org/c/zuul/zuul/+/945744
- [zuul/zuul] 945745: Check for nodeset label perms in layout https://review.opendev.org/c/zuul/zuul/+/945745
- [zuul/zuul] 945746: Check for job timeouts in layout https://review.opendev.org/c/zuul/zuul/+/945746
- [zuul/zuul] 945747: Resolve required-projects in layout https://review.opendev.org/c/zuul/zuul/+/945747
- [zuul/zuul] 945748: Resolve include-vars projects in layout https://review.opendev.org/c/zuul/zuul/+/945748
- [zuul/zuul] 945749: Check allowed-projects in layout https://review.opendev.org/c/zuul/zuul/+/945749
- [zuul/zuul] 945750: Move ignore_allowed_projects to job freeze https://review.opendev.org/c/zuul/zuul/+/945750
- [zuul/zuul] 945751: Move post_review setting to job freezing https://review.opendev.org/c/zuul/zuul/+/945751
- [zuul/zuul] 945752: Move base job check to layout https://review.opendev.org/c/zuul/zuul/+/945752
- [zuul/zuul] 945753: Resolve role references when freezing https://review.opendev.org/c/zuul/zuul/+/945753
- [zuul/zuul] 945754: Set playbook trusted flag when freezing https://review.opendev.org/c/zuul/zuul/+/945754
- [zuul/zuul] 945755: Determine job branch matchers when freezing https://review.opendev.org/c/zuul/zuul/+/945755
- [zuul/zuul] 945756: Remove trusted flag from source context https://review.opendev.org/c/zuul/zuul/+/945756
- [zuul/zuul] 945757: Move some pipeline attributes to manager https://review.opendev.org/c/zuul/zuul/+/945757
- [zuul/zuul] 945758: Move some manager attributes to pipeline https://review.opendev.org/c/zuul/zuul/+/945758
- [zuul/zuul] 945759: Move queue-related methods to PipelineState https://review.opendev.org/c/zuul/zuul/+/945759
- [zuul/zuul] 945760: Remove the manager attribute from pipelines https://review.opendev.org/c/zuul/zuul/+/945760
- [zuul/zuul] 945761: Remove layout.pipelines https://review.opendev.org/c/zuul/zuul/+/945761
- [zuul/zuul] 945762: Remove tenant attribute from pipeline https://review.opendev.org/c/zuul/zuul/+/945762
- [zuul/zuul] 945763: Move formatStatusJSON to manager https://review.opendev.org/c/zuul/zuul/+/945763
- [zuul/zuul] 945764: Move allowed triggers/reporter checks to layout https://review.opendev.org/c/zuul/zuul/+/945764
- [zuul/zuul] 945765: Remove tenant from configloader parse context https://review.opendev.org/c/zuul/zuul/+/945765
- [zuul/zuul] 945766: Refactor unprotected branch tests https://review.opendev.org/c/zuul/zuul/+/945766
- [zuul/zuul] 945767: Add error list to error accumulator https://review.opendev.org/c/zuul/zuul/+/945767
- [zuul/zuul] 945768: Store object errors separately https://review.opendev.org/c/zuul/zuul/+/945768
- [zuul/zuul] 945769: Add ConfigObjectCache https://review.opendev.org/c/zuul/zuul/+/945769
- [zuul/zuul] 945770: Reuse configuration objects https://review.opendev.org/c/zuul/zuul/+/945770
- [zuul/zuul] 945985: Remove unecessary code from manager._postConfig https://review.opendev.org/c/zuul/zuul/+/945985
@clarkb:matrix.orgcorvus: the dev null idea is actually not terrible for testing. Since we namespace everything anyway its not like we're preserving data across test cases. Should mean that shutting down has very minimal impact16:50
@clarkb:matrix.orgfungi: the performance benefit to using tmpfs is you are not writing to slow disk16:50
@fungicide:matrix.orgeatmydata might also be an option?16:50
@clarkb:matrix.orgyes I think eatmydata would also work16:50
@fungicide:matrix.orgdo zk writes block because they assume they're being backed by an actual physical device?16:51
@jim:acmegating.comwould eatmydata do much more than tmpfs for performance?16:51
@fungicide:matrix.orgif writes are sync then yes, that sounds painful16:51
@clarkb:matrix.orgcorvus: I think the upside to eatmydata is it simply returns sync calls so you actually get things written to disk in most situations16:51
@clarkb:matrix.orgsort of a compromise between tmpfs and dev null16:51
@fungicide:matrix.orgi think eatmydata coupled with a normal fs would probably free up some memory pressure, if there is any16:52
@jim:acmegating.comeatmydata would still store the data, just not wait for sync, which doesn't help our ram pressure16:52
@jim:acmegating.comoh i see what you mean16:52
@jim:acmegating.comnormal disk (to reduce ram pressure) + eatmydata (to keep some performance benefit)16:52
@fungicide:matrix.orgzk writes would still go fast because they don't need to wait, but unlike tmpfs they won't reside in memory forever16:52
@clarkb:matrix.orgya exactly. dev null might be slightly faster as the underlying disk will still haev to do the writes in most cases and other things will compete for those iops16:53
@jim:acmegating.comcould be.  though, we should probably quantify the problem before we spin too much on it.  the actual zk data size in tests is probably not huge.16:53
@fungicide:matrix.orgbut at worst it will be "normal" i/o contention, zk writes can still be non-blocking16:53
@fungicide:matrix.organd also the writes to disk can get batched up more efficiently16:54
@fungicide:matrix.orgoverall though, i doubt it would be that much different than using tmpfs and just allocating a bunch of additional swap16:55
@jim:acmegating.comi'll kick off a local test run and see what my zk tmpfs ends up at16:55
@fungicide:matrix.orgsince higher-priority memory use can cause infrequently-accessed tmpfs blocks to be paged out and not use the ram cache16:56
@jim:acmegating.com(and we are allocating some swap now after a recent Clark change)16:56
@jim:acmegating.comthat's probably the biggest reason to see what this stack does without niz16:56
@jim:acmegating.comer, based on that status page, i'm guessing npm is having a sad.  i'll just leave everything running and recheck the unit test jobs that hit that later.17:05
@jim:acmegating.comanecdata: i have noticed the build-image job failing more with npm errors recently.17:06
@jim:acmegating.com 40M   /data17:27
694M /datalog
@clarkb:matrix.org`/tmp/zookeeper/version-2` is what we mount the tmpfs to. Not sure which of those it covers17:40
@jim:acmegating.comprobably both; that's the default17:41
@clarkb:matrix.orgcorvus: so if we drop the tmpfs and use dev null or eatmydata we'd save 3/4 of a gig? that might be worthwhile to see if that resolves memory pressure problems on the smaller 8gb nodes?18:24
@jim:acmegating.comi agree.  not actually sure how to accomplish the devnull approach18:32
@jim:acmegating.comeatmydata is at least something we know how to do (and shouldn't confuse zk too much) so maybe that's the one to start with18:32
@fungicide:matrix.orgyeah, i suspect telling zk to use something like a nullfs as its backend might cause problms if it thinks it should be able to read things from it at some point18:33
@jim:acmegating.comoh nullfs exists nice.  but i don't see it in noble, so logistically non-trivial18:36
@clarkb:matrix.orglooks like its primarily a bsd thing. there is a fuse driver you can pull off of github but fuse isn't quick18:38
@fungicide:matrix.orgyeah, i was using it as a generic term18:39
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946085: Log reasons for missing image builds https://review.opendev.org/c/zuul/zuul/+/94608520:37
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946086: Add REPL to zuul-launcher https://review.opendev.org/c/zuul/zuul/+/94608620:43
@jim:acmegating.comClark: ^ those are in aid of tracking down why opendev's zuul-launcher appears to be building images more frequently than necessary20:44
@clarkb:matrix.orgcool let me see if I can review them quickly before the school run20:44
@clarkb:matrix.orgboth are straightforward +2 from me20:46

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!