Tuesday, 2024-01-30

-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 906596: Fix duplicate fd registration in nodescan https://review.opendev.org/c/zuul/nodepool/+/90659600:34
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [zuul/zuul] 907149: Replace mock_kinesis with mock_aws https://review.opendev.org/c/zuul/zuul/+/90714904:08
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 905465: Client (old): don't translate null to 0000000 https://review.opendev.org/c/zuul/zuul/+/90546509:04
-@gerrit:opendev.org- Simon Westphahl proposed:13:02
- [zuul/nodepool] 906815: Refactor config loading in builder and launcher https://review.opendev.org/c/zuul/nodepool/+/906815
- [zuul/nodepool] 907201: Improve logging around manual/scheduled image builds https://review.opendev.org/c/zuul/nodepool/+/907201
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 907210: Add backoff delay only after successful job start https://review.opendev.org/c/zuul/zuul/+/90721014:17
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 907210: Add backoff delay only after successful job start https://review.opendev.org/c/zuul/zuul/+/90721014:32
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 907109: Allow custom k8s pod specs https://review.opendev.org/c/zuul/nodepool/+/90710915:01
@raaar:matrix.orgHi Folks, does anyone have any negative experience with squash-merging? We have it enabled some time ago across several projects we use Zuul for and have recently discovered that under relatively moderate load (20-30 concurrent jobs per executor) it takes a very long time (up to a few hours) for an executor to prepare a workspace for a build. It looks like most of this slow-down is coming from repo.index.commit call in squash_merge method. Has anyone seen anything similar and could, perhaps, suggest a way to solve this?17:25
@raaar:matrix.org * Hi Folks, does anyone have any negative experience with squash-merging? We have it enabled some time ago across several projects we use with Zuul and have recently discovered that under relatively moderate load (20-30 concurrent jobs per executor) it takes a very long time (up to a few hours) for an executor to prepare a workspace for a build. It looks like most of this slow-down is coming from repo.index.commit call in squash\_merge method. Has anyone seen anything similar and could, perhaps, suggest a way to solve this?17:27
@clarkb:matrix.orgraaar: how many mergers are you running? It is possible you need more to process the resulting git tasks.17:28
@raaar:matrix.orgWe have 5 mergers and 5 executors (with merge_jobs enabled), so 10 in total. Whenever the squash_merge is done by a merger, it take virtually no time... but on executor pods it's extremely slow.... FWIW, we run those on shared k8s nodes with 32cores17:30
@raaar:matrix.org * We have 5 mergers and 5 executors (with merge\_jobs enabled), so 10 in total. Whenever the squash\_merge is done by a merger, it takes virtually no time... but on executor pods it's extremely slow.... FWIW, we run those on shared k8s nodes with 32cores17:31
@clarkb:matrix.orgare the executors all adjacent to each other on the same shared node(s) and the mergers elsewhere? could be a problem of contention (cpu, memory and disk io are all important for merges iirc). Determining why one is slow and not the other is probably helpful in finding a solution17:32
@raaar:matrix.orgJust double-checked, we do have 5 nodes (out of total 6), which run an instance of an executor, additionally they might have an instance of a merger, a launcher, a zoo-keeper, a web shard.. So, a merger and an executor might share the same node...yet perform vastly differently. Also, we don't really see a clear evidence of resource contention.17:42
@clarkb:matrix.orgI think the merge logs do record what actions are being taken (you might need to set log level to debug). That may have clues?17:59
@raaar:matrix.org> <@clarkb:matrix.org> I think the merge logs do record what actions are being taken (you might need to set log level to debug). That may have clues?18:26
Well, we added some additional print-debugging around here -- https://opendev.org/zuul/zuul/src/branch/master/zuul/merger/merger.py#L678 and all we see is that the call to `repo.index.commit` takes tens of minutes. We'll try to also enable logging in gitpython to see what actually goes on there...
IIUC, one major difference between executors and mergers is that the former are heavily multi-threaded and run multiple merge operations concurrently (for each of the AnsibleJob they are starting), while the latter do a single merge at a time... I wonder if that could explain our observations?
@jim:acmegating.comthere are a mixture of threads and processes in the executor; most merge operations happen in their own process18:30
@clarkb:matrix.orgBut merges occur one at a time per merger18:32
@clarkb:matrix.orgPretty sure the run loop in mergers can't handle more than one at a time. But then ya they fork out to git for that18:32
@clarkb:matrix.orglooking at the squash merge code it does a merge first then a commit. I would've expected the merge to be the slow part then the commit is mostly bookkeeping. But maybe the bookkeeping actually causes the other work to happen?18:38
@clarkb:matrix.orgone crazy thing to try may be a tmpfs for those directories to try and rule out disk io problems18:39
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 906320: WIP Finish circular dependency refactor https://review.opendev.org/c/zuul/zuul/+/90632018:46
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 906320: WIP Finish circular dependency refactor https://review.opendev.org/c/zuul/zuul/+/90632018:47
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 906320: WIP Finish circular dependency refactor https://review.opendev.org/c/zuul/zuul/+/90632019:27
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 907109: Allow custom k8s pod specs https://review.opendev.org/c/zuul/nodepool/+/90710920:40
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 907109: Allow custom k8s pod specs https://review.opendev.org/c/zuul/nodepool/+/90710920:43
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 907109: Allow custom k8s pod specs https://review.opendev.org/c/zuul/nodepool/+/90710920:46
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 907255: Simplify k8s pod sleep https://review.opendev.org/c/zuul/nodepool/+/90725520:52
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:22:34
- [zuul/zuul] 906320: WIP Finish circular dependency refactor https://review.opendev.org/c/zuul/zuul/+/906320
- [zuul/zuul] 907256: Add --keep-config-cache option to delete-state command https://review.opendev.org/c/zuul/zuul/+/907256
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 905275: Optimize db prune query https://review.opendev.org/c/zuul/zuul/+/90527522:39

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!