-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 814679: Store FrozenJob data in separate znodes https://review.opendev.org/c/zuul/zuul/+/814679 | 01:45 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl: | 01:45 | |
- [zuul/zuul] 814544: Cleanup stale items after refreshing a pipeline https://review.opendev.org/c/zuul/zuul/+/814544 | ||
- [zuul/zuul] 814570: Reference active change queues in pipeline state https://review.opendev.org/c/zuul/zuul/+/814570 | ||
- [zuul/zuul] 814571: Update pipeline state when modifying attributes https://review.opendev.org/c/zuul/zuul/+/814571 | ||
- [zuul/zuul] 814772: Allow passing extra attributes to ZKObject.fromZK https://review.opendev.org/c/zuul/zuul/+/814772 | ||
- [zuul/zuul] 814862: Bail out when a project moves between connections https://review.opendev.org/c/zuul/zuul/+/814862 | ||
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773 | ||
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899 | ||
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/815089 | 05:12 | |
@iwienand:matrix.org | > <@gerrit:opendev.org> Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/815089 | 05:15 |
---|---|---|
Clark: corvus ... as you know I can never leave well enough alone. I remembered why it was hard to update, the actual thing that does the graphs and wraps around d3 seems abandoned -- meaning its stuck on d3 v3 (it's up to 7 now). but ... i had some little fixes and half a conversion to bootstrap 5 i pushed on. that's all in the tree referenced there -- a sample is http://people.redhat.com/~iwienand/dstat-sample.html | ||
@iwienand:matrix.org | i've made an upstream pull request (https://github.com/Dabz/dstat_graph/pull/6) and we can see if that goes anywhere | 05:15 |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/815089 | 05:45 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/815089 | 06:01 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/815089 | 06:01 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/815089 | 06:22 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/815089 | 07:31 | |
-@gerrit:opendev.org- Simon Westphahl proposed: | 07:56 | |
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773 | ||
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899 | ||
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111 | 12:21 | |
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111 | 12:33 | |
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 760807: UI: Add components page https://review.opendev.org/c/zuul/zuul/+/760807 | 12:36 | |
@westphahl:matrix.org | corvus: I started with the builds in Zookeeper in 815111 | 12:43 |
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 760807: UI: Add components page https://review.opendev.org/c/zuul/zuul/+/760807 | 12:53 | |
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 760808: WIP UI: Make components table filterable and add pagination https://review.opendev.org/c/zuul/zuul/+/760808 | 13:29 | |
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 814695: ensure-docker: remove Debian Stretch testing https://review.opendev.org/c/zuul/zuul-jobs/+/814695 | 15:19 | |
@jim:acmegating.com | Clark, tobiash, swest: opendev had a zk connection issue several hours ago and zuul hit the "RuntimeError: can't start new thread" issue again. i'm wondering if we have many thousands of data watches, and on a connection event, each of them spawns a thread to refresh | 15:35 |
@tobias.henkel:matrix.org | A thread per watch? That would be insane | 15:37 |
@tobias.henkel:matrix.org | But I guess that exception points towards this theory | 15:38 |
@tobias.henkel:matrix.org | Is there a stack trace in the logs so we see where it gets created? | 15:39 |
@jim:acmegating.com | thousands of these: https://paste.opendev.org/show/810171/ | 15:40 |
@jim:acmegating.com | i'm assuming each one corresponds to a different watch object | 15:40 |
@jim:acmegating.com | https://github.com/python-zk/kazoo/blob/master/kazoo/handlers/threading.py | 15:43 |
@jim:acmegating.com | maybe we need a handler with a spawn method that uses a threadpoolexecutor? | 15:44 |
@clarkb:matrix.org | similar to jenkins OOMing on stack space due to leaked threads for ssh connections | 15:47 |
@jim:acmegating.com | i think that would be pretty easy for us to do; the kazoo api appears to be set up for that. if we like that idea, i can implement it shortly. | 15:47 |
@clarkb:matrix.org | that seems like a reasonable change particularly if it doesn't require significant api rewrites (so easy to udnerstand and review) | 15:48 |
@jim:acmegating.com | Clark: i don't think there's a leak here, i think it's just a design mismatch. kazoo uses threads for "long-running" async background tasks, and authors probably didn't anticipate this kind of flood of background tasks on reconnect. | 15:48 |
@jim:acmegating.com | Clark: yeah, i'm thinking on the order of tens of lines, and no api shenanigans | 15:49 |
@clarkb:matrix.org | right we aren't really leaking ourselves but the fallout is similar. YOu can have plenty of memory overall but the stack is typically allocated a small portion of that which can be fileld with threads and all their separate stacks | 15:50 |
@tobias.henkel:matrix.org | A handler using a thread pool sounds useful | 15:57 |
@jim:acmegating.com | cool, i'll get on that now | 15:58 |
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 812273: ensure-rust: verify cryptography build on Ubuntu https://review.opendev.org/c/zuul/zuul-jobs/+/812273 | 16:13 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815135: Use a ThreadPoolExecutor for kazoo callbacks https://review.opendev.org/c/zuul/zuul/+/815135 | 16:29 | |
@jim:acmegating.com | Clark, tobiash: ^ how does that look? | 16:29 |
@tobias.henkel:matrix.org | lgtm | 16:38 |
@jim:acmegating.com | tobiash: can you take a quick look at https://review.opendev.org/815072 ? | 16:42 |
@jim:acmegating.com | i think you added that flag originally, but i think it's not necessary now? | 16:42 |
@clarkb:matrix.org | I'm trying to keep an eye on a deployment right now but will review it while the less interesting playbooks execute | 16:44 |
@clarkb:matrix.org | that is remarkably straightforward. I have approved it | 16:47 |
@jim:acmegating.com | Clark, tobiash: it's going to fail testing... i'm looking into what else we need | 17:03 |
@jim:acmegating.com | on the dstat front: we need to instal actual dstat on bionic | 17:05 |
@jim:acmegating.com | okay, back to kazoo -- one thing i see is that TreeCaches spawn a background thread; so every treecache we use will eat up another slot. we could either just make the max workers "large enough" or we could do a bit more work to make that a real thread instead of using the threadpool. | 17:07 |
@jim:acmegating.com | if we wanted to the latter, one way to accomplish that would be to vendor/subclass/etc some more things so that we use a different method for spawning long-running vs throwaway actions. | 17:08 |
@jim:acmegating.com | another way to do it would be to whitelist the callables inside our spawn method and decide what to do with them. it's less intrusive, but maybe more fragile. | 17:09 |
@clarkb:matrix.org | check the spawn action type then decide if it is spawned in a threadpool or a real thread? that might be more future proof? (if we forget to bump the count) | 17:09 |
@jim:acmegating.com | yeah, that's what i'm thinking. like "check if func == <bound method TreeCache.dobackground of <kazoo.recipe.cache.TreeCache object at 0x7f3dc8d11910>>" | 17:09 |
@jim:acmegating.com | okay, one more idea is to go ahead and change the session_watchers to not use spawn | 17:12 |
@jim:acmegating.com | that might be better and not hard since we're already vendoring those classes anyway | 17:12 |
@jim:acmegating.com | okay, i'm leaning toward that. the only problematic spawns in the entire codebase afaict are in watchers, and we vendor that whole file | 17:14 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815135: Use a ThreadPoolExecutor for kazoo callbacks https://review.opendev.org/c/zuul/zuul/+/815135 | 17:19 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/815142 | 17:25 | |
@jim:acmegating.com | ianw: https://review.opendev.org/815089 looks great thanks! | 17:26 |
@clarkb:matrix.org | I see the normal spawn method will continue to be called in zk for everything but the watches then watches which we have overridden run short_spawn | 17:31 |
@clarkb:matrix.org | lgtm | 17:31 |
@jim:acmegating.com | cool, it passes tests locally. i also tried some CLI commands to make sure the shutdown procedure wasn't whack. so i think that's good. | 17:32 |
@jim:acmegating.com | tobiash: if you're around for another look at https://review.opendev.org/815135 that'd be great | 17:33 |
@jim:acmegating.com | Clark: https://zuul.opendev.org/t/zuul/build/8b9609645e2c4f7da4ea4658e84b3ab8/console didn't run the Ubuntu-18.04.yaml file... do you grok why? | 17:49 |
@jim:acmegating.com | oh it's a - not a ... | 17:49 |
@jim:acmegating.com | we appear to be inconsistent with that :/ | 17:49 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/815142 | 17:54 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815077: Run dstat in tox jobs https://review.opendev.org/c/zuul/zuul/+/815077 | 17:54 | |
@clarkb:matrix.org | corvus: you did zk_distro_os and zj_distro_os in different places | 17:56 |
@jim:acmegating.com | the mind is a terrible thing | 17:56 |
@clarkb:matrix.org | typing muscle memory is crazy | 17:56 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/815142 | 17:56 | |
@jim:acmegating.com | 815142 still doesn't seem to actually run the right file... | 19:02 |
@jim:acmegating.com | omg is major_version just 18? | 19:03 |
@jim:acmegating.com | have we ever done separate tasks or variables in zuul-jobs for ubuntu bionic / focal? | 19:04 |
@jim:acmegating.com | okay, i see what we do in ensure-podman | 19:05 |
@jim:acmegating.com | i feel like having every instance of this pattern being different from what's in the docs is counter-productive. i thought it would be best to copy the docs, but there are zero instances where we actually do what the docs say, so of course that didn't work :( | 19:06 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/815142 | 19:07 | |
@clarkb:matrix.org | corvus: minor thing on the dstat change. I think it is sufficient for your purposes now but you were talking about being consistent there so I noted it | 19:50 |
@jim:acmegating.com | oh yeah i should fix that. will do after i check in on results | 19:51 |
@jim:acmegating.com | yay it worked | 19:51 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/815142 | 19:51 | |
@jim:acmegating.com | Clark: okay that should be the final version | 19:51 |
@clarkb:matrix.org | Also I don't know why "file extension" didn't occur to me when writing that comment | 19:52 |
@jim:acmegating.com | the mind is a terrible thing | 19:53 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 19:53 | |
- [zuul/zuul] 815077: Run dstat in tox jobs https://review.opendev.org/c/zuul/zuul/+/815077 | ||
- [zuul/zuul] 814684: DNM: Increase unit test job timeout to 2h https://review.opendev.org/c/zuul/zuul/+/814684 | ||
- [zuul/zuul] 814685: DNM: Test unit tests on larger nodes https://review.opendev.org/c/zuul/zuul/+/814685 | ||
@jim:acmegating.com | now hopefully that gets us some real data | 19:53 |
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 814848: Add addtional checks to key deletion testing https://review.opendev.org/c/zuul/zuul/+/814848 | 20:11 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815135: Use a ThreadPoolExecutor for kazoo callbacks https://review.opendev.org/c/zuul/zuul/+/815135 | 20:11 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815154: Update test_inventory to be ZK-friendly https://review.opendev.org/c/zuul/zuul/+/815154 | 21:00 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111 | 21:00 | |
@jim:acmegating.com | Clark: since you were asking about this earlier this week -- i think 815111 is the last major change in the "store pipeline state in zk" series. i'm working on finishing it up now. i think with that done, it will make sense to review that stack for real next week. it's long, but it's basically just a few concepts applied iteratively to each of the model objects starting with pipelines going all the way down to builds. | 21:02 |
@jim:acmegating.com | obviously there is the minor impediment of the test jobs failing, but i have been running boatloads of tests locally which reliably pass, so i'm optimistic we can figure something out. i don't see that as a blocker for reviewing (only merging). | 21:03 |
@clarkb:matrix.org | corvus: noted, I'll plan to review the stack | 21:36 |
@clarkb:matrix.org | also note that matrix.org just had an extended outage due to a database hardware failure. Things might be catching up communication wise | 21:36 |
@jim:acmegating.com | Clark: interesting; have a link, or timestamps? (because i hadn't noticed) | 21:45 |
@jim:acmegating.com | i guess https://status.matrix.org/ | 21:45 |
@jim:acmegating.com | looks like 21:23 | 21:46 |
@clarkb:matrix.org | corvus: ya thats probably the best link. My oftc bridge lost almost an hour of messages | 21:47 |
@jpew:matrix.org | I've noticed that when an ansible task reruns due to an `until` the logger doesn't show any output from the subseqent runs. anyone else seen this? (I am using zuul that is a few months old) | 21:56 |
@jpew:matrix.org | * I've noticed that when an ansible task reruns due to an `until` the logger doesn't show any output from the subseqent (re)runs. anyone else seen this? (I am using zuul that is a few months old) | 21:56 |
@jpew:matrix.org | It picks up again just fine on the next task.... just make it look like something is hung (esp. when the task rerunning takes a long time) | 21:56 |
@clarkb:matrix.org | jpew: hrm the way that all works is a bit of hacky magic. It wouldn't surprise me if it is buggy around retries | 22:04 |
@clarkb:matrix.org | trying to remember the details I think it starts a daemon that runs the entirety of the playbook run. Then command and shell modules are swapped out to write to a socket that daemon has held open | 22:05 |
@clarkb:matrix.org | I wonder if ansible is loading the actual command and shell modules on the retry and not the overloaded versions | 22:06 |
@clarkb:matrix.org | In the past we had problems with the rsync module because ansible didn't reset state or replace the class between retries and that led to state leakage between attempts. I wonder if they fixed that more broadly by creating new module instances for each attempt and that is breaking the retries | 22:07 |
@jim:acmegating.com | aha! https://zuul.opendev.org/t/zuul/build/c85eaeff4bee4b4b9c2dfd456dc203a9/log/zookeeper-root-server-ubuntu-bionic-32GB-airship-kna1-0027062402.out#5576 | 22:14 |
@jim:acmegating.com | our zk server is running out of disk space (which probably means running out of tempfs space?) | 22:14 |
@jim:acmegating.com | in tests -- to be clear | 22:15 |
@clarkb:matrix.org | oh ya since they use a tmpfs | 22:15 |
@clarkb:matrix.org | are those jobs using the 64MB tmpfs? | 22:15 |
@jim:acmegating.com | at least... i think we have zk use a tmpfs? | 22:15 |
@clarkb:matrix.org | it seems we don't bound them in the docker-compose tmpfs but we aren't using that in CI iirc | 22:16 |
@jim:acmegating.com | suddenly i'm not sure... i see we set up tmpfs for zuul | 22:16 |
@jim:acmegating.com | yeah, i think we rely on the zuul-jobs role to set up zk | 22:16 |
@clarkb:matrix.org | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-zookeeper/tasks/main.yaml#L36 | 22:17 |
@clarkb:matrix.org | that appears to be what it is from the job log | 22:17 |
@jim:acmegating.com | you are 2 seconds faster than i am :) | 22:17 |
@jim:acmegating.com | so.... hrm.... let me see what my local zk tmpfs usage is | 22:18 |
@jim:acmegating.com | tmpfs 32G 530M 31G 2% /data | 22:19 |
@jim:acmegating.com | tmpfs 32G 1.7G 30G 6% /datalog | 22:19 |
@jim:acmegating.com | that's from a handful of test runs; i can clean it and do a single run to get better numbers | 22:19 |
@clarkb:matrix.org | You might be able to tell zk to not keep old copies of data around? | 22:20 |
@clarkb:matrix.org | I want to say it keeps all versions by default but that is configurable? something like that | 22:20 |
@jim:acmegating.com | hrm... the autopurge interval is configured in hours :/ | 22:25 |
@clarkb:matrix.org | ya I was just going to paste that | 22:25 |
@clarkb:matrix.org | but it is disabled by default | 22:25 |
@clarkb:matrix.org | which means no purging even after an hour | 22:25 |
@clarkb:matrix.org | might be worth setting the interval to 1 and the retain count to 1 anyway? | 22:26 |
@jim:acmegating.com | well, if we're considering using larger nodes for the cpus; we're going to get gobs of ram for free. maybe we should just unlimit the tmpfs? | 22:26 |
@jim:acmegating.com | Clark: well, i'd like to get this back under 1hr. | 22:26 |
@clarkb:matrix.org | ah | 22:27 |
@clarkb:matrix.org | also are both /data and /datalogs synchronous? or can you get away with one or the other being on disk? | 22:27 |
@jim:acmegating.com | i think the log is probably more performance sensitive | 22:29 |
@jim:acmegating.com | i think that's synchronous, and the datadir is for snapshots which only happen every 100,000 transactions | 22:29 |
@jim:acmegating.com | or 10,000 depending on versions/config | 22:29 |
@clarkb:matrix.org | got it. Might be another option to reduce what is in the tmpfs too | 22:29 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815163: ensure-zookeeper: eliminate the tmpfs size limit https://review.opendev.org/c/zuul/zuul-jobs/+/815163 | 22:32 | |
@jim:acmegating.com | Clark: ^ that's the way i'm leaning right now, but i'm totally open to the idea that we can/should make it a role variable. technically it's a backwards incompat change, but it's probably not going to break anyone | 22:33 |
@jim:acmegating.com | tmpfs 32G 84M 32G 1% /data | 22:33 |
@jim:acmegating.com | tmpfs 32G 556M 31G 2% /datalog | 22:33 |
@jim:acmegating.com | that's the values after a single run of the full test suite | 22:33 |
@clarkb:matrix.org | I +2'd because I don't think failing due to a lack of memroy is any worse than failing due to filling the tmpfs | 22:34 |
@clarkb:matrix.org | and this at least has a chance of not filling | 22:34 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111 | 22:34 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 22:41 | |
- [zuul/zuul] 815077: Run dstat in tox jobs https://review.opendev.org/c/zuul/zuul/+/815077 | ||
- [zuul/zuul] 815167: Use a larger nodeset for unit tests https://review.opendev.org/c/zuul/zuul/+/815167 | ||
@jim:acmegating.com | okay with any luck we should see the tests on the end of the zk stack run to completion on large nodes with dstat graphs | 22:41 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815175: Add reno about zk disconnect thread fix https://review.opendev.org/c/zuul/zuul/+/815175 | 23:36 | |
@jim:acmegating.com | Clark, fungi: ^ could we go ahead and approve that? so that if the opendev restart goes well, we can just tag and release that on monday? | 23:37 |
@clarkb:matrix.org | looking | 23:38 |
@clarkb:matrix.org | It is probably fine to self approve that one after my +2 since it is just a release note | 23:39 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!