@newbie23:matrix.org | Thanks fungi Clark Albin Vass for your replies and inputs. | 07:08 |
---|---|---|
I will check out the changes for skipping the workspace checkout you posted earlier 🙇♂️ | ||
About caching, we already keep a git repo cache for workers and we try to keep it warm. | ||
The issue we are experiencing is with Zuul mergers when cloning the repos. | ||
> The mergers should all keep their git repo caches fairly warm though. SO the biggest hit is the first time | ||
A bit more of context: this is a legacy large repo that is not frequently updated, but when that happens, the merger clone/fetch timeout expires. | ||
OFC we could keep increasing the timeout :), but we decided to look why it was so slow. | ||
GitHub (we use GitHub) instance performance are ok, and it turned out that the repo is just huge. | ||
We are taking actions to optimize the repo, but repo history rewrite is a no, as it will break a lot of things. | ||
So I wanted to explore how to convince the Zuul merger to do a shallow clone, but Clark you are right, ensuring enough history is present to allow correct merges in all circumstances might be challenging. | ||
Any other suggestions? | ||
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/945795 | 08:43 | |
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/945795 | 09:10 | |
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/945795 | 10:32 | |
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 945795: Add limit-log-file-size role. https://review.opendev.org/c/zuul/zuul-jobs/+/945795 | 10:39 | |
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul] 942432: Implement zuul-web OIDC endpoints https://review.opendev.org/c/zuul/zuul/+/942432 | 11:51 | |
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946033: mirror-workspace-git-repos: Allow deleting current branch https://review.opendev.org/c/zuul/zuul-jobs/+/946033 | 12:40 | |
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946033: mirror-workspace-git-repos: Allow deleting current branch https://review.opendev.org/c/zuul/zuul-jobs/+/946033 | 12:40 | |
-@gerrit:opendev.org- Lukas Kranz proposed: [zuul/zuul-jobs] 946035: emit-job-header: add instance and region information https://review.opendev.org/c/zuul/zuul-jobs/+/946035 | 12:42 | |
@clarkb:matrix.org | You can preclone the repo for your mergers outside of your normal timeout period. Then subsequent operations would do smaller updates which means you may be able to use a smaller timeout | 14:55 |
@jim:acmegating.com | Clark: i'm curious about your thoughts on this: the config object reuse stack has been running with the niz nodes (more ram). i haven't seen the zk failures indicating memory pressure that we have been seeing (that is expected and good). but it still has somewhere between a 75%-90% success rate on tests. i think adding the additional node types has, as is usual, exposed a new set of test races. do you think we should keep going down that path, and fixing those races as they come up? or should we see how that stack does without the niz nodes, now that we've cleaned up a bunch of issues on the 8gb nodes? | 16:41 |
@clarkb:matrix.org | corvus: re zookeeper it occurred to me that we run zk on a tmpfs right? so memory pressure may affect us both for the disk in memory content as well as the process memory. As far as flakyness for niz nodes goes, yes I suspect that we're shaking out new races by running on raxflex more often (faster nodes). I've seen a number of tests with new races on those nodes in general. But also I guess more memory means less contention for shared resources which might generally speed things up. | 16:44 |
@clarkb:matrix.org | Maybe rebase the stack off of niz dogfooding and see if the memory pressure mid stack is a big issue? If not we might eb able to get away with the current nodes and treat these as two separate trains of thought rather than combining them? | 16:45 |
@clarkb:matrix.org | that might make it easier to quantify the improvments from that stack | 16:45 |
@jim:acmegating.com | sounds good to me. i'm kind of curious too, even if we decide to put it back on niz. :) | 16:47 |
@jim:acmegating.com | * sounds good to me. i'm kind of curious too (about current performance without niz), even if we decide to put it back on niz. :) | 16:47 |
@fungicide:matrix.org | ideally, tmpfs should perform no worse under memory pressure than a device-backed filesystem, since they both use the same fs cache model only tmpfs just has nowhere to sync its writes to or pull its reads from other than swap devices | 16:47 |
@fungicide:matrix.org | though tmpfs can cause additional memory pressure, since it needs that cache | 16:47 |
@jim:acmegating.com | fun fact, since zk is also an in-memory db, we're just storing 2 copies of everything in ram :) | 16:48 |
@jim:acmegating.com | wonder if we could just have it write to /dev/null | 16:49 |
@fungicide:matrix.org | huh, so in theory it gets no performance benefits from tmpfs either | 16:49 |
@jim:acmegating.com | here comes the rebase | 16:49 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 16:50 | |
- [zuul/zuul] 945743: Remove tenant references from project parsing https://review.opendev.org/c/zuul/zuul/+/945743 | ||
- [zuul/zuul] 945744: Stop copying ProjectConfig objects when adding to layout https://review.opendev.org/c/zuul/zuul/+/945744 | ||
- [zuul/zuul] 945745: Check for nodeset label perms in layout https://review.opendev.org/c/zuul/zuul/+/945745 | ||
- [zuul/zuul] 945746: Check for job timeouts in layout https://review.opendev.org/c/zuul/zuul/+/945746 | ||
- [zuul/zuul] 945747: Resolve required-projects in layout https://review.opendev.org/c/zuul/zuul/+/945747 | ||
- [zuul/zuul] 945748: Resolve include-vars projects in layout https://review.opendev.org/c/zuul/zuul/+/945748 | ||
- [zuul/zuul] 945749: Check allowed-projects in layout https://review.opendev.org/c/zuul/zuul/+/945749 | ||
- [zuul/zuul] 945750: Move ignore_allowed_projects to job freeze https://review.opendev.org/c/zuul/zuul/+/945750 | ||
- [zuul/zuul] 945751: Move post_review setting to job freezing https://review.opendev.org/c/zuul/zuul/+/945751 | ||
- [zuul/zuul] 945752: Move base job check to layout https://review.opendev.org/c/zuul/zuul/+/945752 | ||
- [zuul/zuul] 945753: Resolve role references when freezing https://review.opendev.org/c/zuul/zuul/+/945753 | ||
- [zuul/zuul] 945754: Set playbook trusted flag when freezing https://review.opendev.org/c/zuul/zuul/+/945754 | ||
- [zuul/zuul] 945755: Determine job branch matchers when freezing https://review.opendev.org/c/zuul/zuul/+/945755 | ||
- [zuul/zuul] 945756: Remove trusted flag from source context https://review.opendev.org/c/zuul/zuul/+/945756 | ||
- [zuul/zuul] 945757: Move some pipeline attributes to manager https://review.opendev.org/c/zuul/zuul/+/945757 | ||
- [zuul/zuul] 945758: Move some manager attributes to pipeline https://review.opendev.org/c/zuul/zuul/+/945758 | ||
- [zuul/zuul] 945759: Move queue-related methods to PipelineState https://review.opendev.org/c/zuul/zuul/+/945759 | ||
- [zuul/zuul] 945760: Remove the manager attribute from pipelines https://review.opendev.org/c/zuul/zuul/+/945760 | ||
- [zuul/zuul] 945761: Remove layout.pipelines https://review.opendev.org/c/zuul/zuul/+/945761 | ||
- [zuul/zuul] 945762: Remove tenant attribute from pipeline https://review.opendev.org/c/zuul/zuul/+/945762 | ||
- [zuul/zuul] 945763: Move formatStatusJSON to manager https://review.opendev.org/c/zuul/zuul/+/945763 | ||
- [zuul/zuul] 945764: Move allowed triggers/reporter checks to layout https://review.opendev.org/c/zuul/zuul/+/945764 | ||
- [zuul/zuul] 945765: Remove tenant from configloader parse context https://review.opendev.org/c/zuul/zuul/+/945765 | ||
- [zuul/zuul] 945766: Refactor unprotected branch tests https://review.opendev.org/c/zuul/zuul/+/945766 | ||
- [zuul/zuul] 945767: Add error list to error accumulator https://review.opendev.org/c/zuul/zuul/+/945767 | ||
- [zuul/zuul] 945768: Store object errors separately https://review.opendev.org/c/zuul/zuul/+/945768 | ||
- [zuul/zuul] 945769: Add ConfigObjectCache https://review.opendev.org/c/zuul/zuul/+/945769 | ||
- [zuul/zuul] 945770: Reuse configuration objects https://review.opendev.org/c/zuul/zuul/+/945770 | ||
- [zuul/zuul] 945985: Remove unecessary code from manager._postConfig https://review.opendev.org/c/zuul/zuul/+/945985 | ||
@clarkb:matrix.org | corvus: the dev null idea is actually not terrible for testing. Since we namespace everything anyway its not like we're preserving data across test cases. Should mean that shutting down has very minimal impact | 16:50 |
@clarkb:matrix.org | fungi: the performance benefit to using tmpfs is you are not writing to slow disk | 16:50 |
@fungicide:matrix.org | eatmydata might also be an option? | 16:50 |
@clarkb:matrix.org | yes I think eatmydata would also work | 16:50 |
@fungicide:matrix.org | do zk writes block because they assume they're being backed by an actual physical device? | 16:51 |
@jim:acmegating.com | would eatmydata do much more than tmpfs for performance? | 16:51 |
@fungicide:matrix.org | if writes are sync then yes, that sounds painful | 16:51 |
@clarkb:matrix.org | corvus: I think the upside to eatmydata is it simply returns sync calls so you actually get things written to disk in most situations | 16:51 |
@clarkb:matrix.org | sort of a compromise between tmpfs and dev null | 16:51 |
@fungicide:matrix.org | i think eatmydata coupled with a normal fs would probably free up some memory pressure, if there is any | 16:52 |
@jim:acmegating.com | eatmydata would still store the data, just not wait for sync, which doesn't help our ram pressure | 16:52 |
@jim:acmegating.com | oh i see what you mean | 16:52 |
@jim:acmegating.com | normal disk (to reduce ram pressure) + eatmydata (to keep some performance benefit) | 16:52 |
@fungicide:matrix.org | zk writes would still go fast because they don't need to wait, but unlike tmpfs they won't reside in memory forever | 16:52 |
@clarkb:matrix.org | ya exactly. dev null might be slightly faster as the underlying disk will still haev to do the writes in most cases and other things will compete for those iops | 16:53 |
@jim:acmegating.com | could be. though, we should probably quantify the problem before we spin too much on it. the actual zk data size in tests is probably not huge. | 16:53 |
@fungicide:matrix.org | but at worst it will be "normal" i/o contention, zk writes can still be non-blocking | 16:53 |
@fungicide:matrix.org | and also the writes to disk can get batched up more efficiently | 16:54 |
@fungicide:matrix.org | overall though, i doubt it would be that much different than using tmpfs and just allocating a bunch of additional swap | 16:55 |
@jim:acmegating.com | i'll kick off a local test run and see what my zk tmpfs ends up at | 16:55 |
@fungicide:matrix.org | since higher-priority memory use can cause infrequently-accessed tmpfs blocks to be paged out and not use the ram cache | 16:56 |
@jim:acmegating.com | (and we are allocating some swap now after a recent Clark change) | 16:56 |
@jim:acmegating.com | that's probably the biggest reason to see what this stack does without niz | 16:56 |
@jim:acmegating.com | er, based on that status page, i'm guessing npm is having a sad. i'll just leave everything running and recheck the unit test jobs that hit that later. | 17:05 |
@jim:acmegating.com | anecdata: i have noticed the build-image job failing more with npm errors recently. | 17:06 |
@jim:acmegating.com | 40M /data | 17:27 |
694M /datalog | ||
@clarkb:matrix.org | `/tmp/zookeeper/version-2` is what we mount the tmpfs to. Not sure which of those it covers | 17:40 |
@jim:acmegating.com | probably both; that's the default | 17:41 |
@clarkb:matrix.org | corvus: so if we drop the tmpfs and use dev null or eatmydata we'd save 3/4 of a gig? that might be worthwhile to see if that resolves memory pressure problems on the smaller 8gb nodes? | 18:24 |
@jim:acmegating.com | i agree. not actually sure how to accomplish the devnull approach | 18:32 |
@jim:acmegating.com | eatmydata is at least something we know how to do (and shouldn't confuse zk too much) so maybe that's the one to start with | 18:32 |
@fungicide:matrix.org | yeah, i suspect telling zk to use something like a nullfs as its backend might cause problms if it thinks it should be able to read things from it at some point | 18:33 |
@jim:acmegating.com | oh nullfs exists nice. but i don't see it in noble, so logistically non-trivial | 18:36 |
@clarkb:matrix.org | looks like its primarily a bsd thing. there is a fuse driver you can pull off of github but fuse isn't quick | 18:38 |
@fungicide:matrix.org | yeah, i was using it as a generic term | 18:39 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946085: Log reasons for missing image builds https://review.opendev.org/c/zuul/zuul/+/946085 | 20:37 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 946086: Add REPL to zuul-launcher https://review.opendev.org/c/zuul/zuul/+/946086 | 20:43 | |
@jim:acmegating.com | Clark: ^ those are in aid of tracking down why opendev's zuul-launcher appears to be building images more frequently than necessary | 20:44 |
@clarkb:matrix.org | cool let me see if I can review them quickly before the school run | 20:44 |
@clarkb:matrix.org | both are straightforward +2 from me | 20:46 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!