Friday, 2021-10-22

-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 814679: Store FrozenJob data in separate znodes https://review.opendev.org/c/zuul/zuul/+/81467901:45
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl:01:45
- [zuul/zuul] 814544: Cleanup stale items after refreshing a pipeline https://review.opendev.org/c/zuul/zuul/+/814544
- [zuul/zuul] 814570: Reference active change queues in pipeline state https://review.opendev.org/c/zuul/zuul/+/814570
- [zuul/zuul] 814571: Update pipeline state when modifying attributes https://review.opendev.org/c/zuul/zuul/+/814571
- [zuul/zuul] 814772: Allow passing extra attributes to ZKObject.fromZK https://review.opendev.org/c/zuul/zuul/+/814772
- [zuul/zuul] 814862: Bail out when a project moves between connections https://review.opendev.org/c/zuul/zuul/+/814862
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/81508905:12
@iwienand:matrix.org> <@gerrit:opendev.org> Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch  https://review.opendev.org/c/zuul/zuul-jobs/+/81508905:15
Clark: corvus ... as you know I can never leave well enough alone. I remembered why it was hard to update, the actual thing that does the graphs and wraps around d3 seems abandoned -- meaning its stuck on d3 v3 (it's up to 7 now). but ... i had some little fixes and half a conversion to bootstrap 5 i pushed on. that's all in the tree referenced there -- a sample is http://people.redhat.com/~iwienand/dstat-sample.html
@iwienand:matrix.orgi've made an upstream pull request (https://github.com/Dabz/dstat_graph/pull/6) and we can see if that goes anywhere05:15
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/81508905:45
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/81508906:01
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/81508906:01
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/81508906:22
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 815089: ensure-dstat-graph: pull updated branch https://review.opendev.org/c/zuul/zuul-jobs/+/81508907:31
-@gerrit:opendev.org- Simon Westphahl proposed:07:56
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/81511112:21
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/81511112:33
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 760807: UI: Add components page https://review.opendev.org/c/zuul/zuul/+/76080712:36
@westphahl:matrix.orgcorvus: I started with the builds in Zookeeper in 81511112:43
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 760807: UI: Add components page https://review.opendev.org/c/zuul/zuul/+/76080712:53
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 760808: WIP UI: Make components table filterable and add pagination https://review.opendev.org/c/zuul/zuul/+/76080813:29
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 814695: ensure-docker: remove Debian Stretch testing https://review.opendev.org/c/zuul/zuul-jobs/+/81469515:19
@jim:acmegating.comClark, tobiash, swest: opendev had a zk connection issue several hours ago and zuul hit the "RuntimeError: can't start new thread" issue again.  i'm wondering if we have many thousands of data watches, and on a connection event, each of them spawns a thread to refresh15:35
@tobias.henkel:matrix.orgA thread per watch? That would be insane15:37
@tobias.henkel:matrix.orgBut I guess that exception points towards this theory15:38
@tobias.henkel:matrix.orgIs there a stack trace in the logs so we see where it gets created?15:39
@jim:acmegating.comthousands of these: https://paste.opendev.org/show/810171/15:40
@jim:acmegating.comi'm assuming each one corresponds to a different watch object15:40
@jim:acmegating.comhttps://github.com/python-zk/kazoo/blob/master/kazoo/handlers/threading.py15:43
@jim:acmegating.commaybe we need a handler with a spawn method that uses a threadpoolexecutor?15:44
@clarkb:matrix.orgsimilar to jenkins OOMing on stack space due to leaked threads for ssh connections15:47
@jim:acmegating.comi think that would be pretty easy for us to do; the kazoo api appears to be set up for that.  if we like that idea, i can implement it shortly.15:47
@clarkb:matrix.orgthat seems like a reasonable change particularly if it doesn't require significant api rewrites (so easy to udnerstand and review)15:48
@jim:acmegating.comClark: i don't think there's a leak here, i think it's just a design mismatch.  kazoo uses threads for "long-running" async background tasks, and authors probably didn't anticipate this kind of flood of background tasks on reconnect.15:48
@jim:acmegating.comClark: yeah, i'm thinking on the order of tens of lines, and no api shenanigans15:49
@clarkb:matrix.orgright we aren't really leaking ourselves but the fallout is similar. YOu can have plenty of memory overall but the stack is typically allocated a small portion of that which can be fileld with threads and all their separate stacks15:50
@tobias.henkel:matrix.orgA handler using a thread pool sounds useful15:57
@jim:acmegating.comcool, i'll get on that now15:58
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 812273: ensure-rust: verify cryptography build on Ubuntu https://review.opendev.org/c/zuul/zuul-jobs/+/81227316:13
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815135: Use a ThreadPoolExecutor for kazoo callbacks https://review.opendev.org/c/zuul/zuul/+/81513516:29
@jim:acmegating.comClark, tobiash: ^ how does that look?16:29
@tobias.henkel:matrix.orglgtm16:38
@jim:acmegating.comtobiash: can you take a quick look at https://review.opendev.org/815072 ?16:42
@jim:acmegating.comi think you added that flag originally, but i think it's not necessary now?16:42
@clarkb:matrix.orgI'm trying to keep an eye on a deployment right now but will review it while the less interesting playbooks execute16:44
@clarkb:matrix.orgthat is remarkably straightforward. I have approved it16:47
@jim:acmegating.comClark, tobiash: it's going to fail testing... i'm looking into what else we need17:03
@jim:acmegating.comon the dstat front: we need to instal actual dstat on bionic17:05
@jim:acmegating.comokay, back to kazoo -- one thing i see is that TreeCaches spawn a background thread; so every treecache we use will eat up another slot.   we could either just make the max workers "large enough" or we could do a bit more work to make that a real thread instead of using the threadpool.17:07
@jim:acmegating.comif we wanted to the latter, one way to accomplish that would be to vendor/subclass/etc some more things so that we use a different method for spawning long-running vs throwaway actions.17:08
@jim:acmegating.comanother way to do it would be to whitelist the callables inside our spawn method and decide what to do with them.  it's less intrusive, but maybe more fragile.17:09
@clarkb:matrix.orgcheck the spawn action type then decide if it is spawned in a threadpool or a real thread? that might be more future proof? (if we forget to bump the count)17:09
@jim:acmegating.comyeah, that's what i'm thinking.  like "check if func == <bound method TreeCache.dobackground of <kazoo.recipe.cache.TreeCache object at 0x7f3dc8d11910>>"17:09
@jim:acmegating.comokay, one more idea is to go ahead and change the session_watchers to not use spawn17:12
@jim:acmegating.comthat might be better and not hard since we're already vendoring those classes anyway17:12
@jim:acmegating.comokay, i'm leaning toward that.  the only problematic spawns in the entire codebase afaict are in watchers, and we vendor that whole file17:14
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815135: Use a ThreadPoolExecutor for kazoo callbacks https://review.opendev.org/c/zuul/zuul/+/81513517:19
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/81514217:25
@jim:acmegating.comianw: https://review.opendev.org/815089 looks great thanks!17:26
@clarkb:matrix.orgI see the normal spawn method will continue to be called in zk for everything but the watches then watches which we have overridden run short_spawn17:31
@clarkb:matrix.orglgtm17:31
@jim:acmegating.comcool, it passes tests locally.  i also tried some CLI commands to make sure the shutdown procedure wasn't whack.  so i think that's good.17:32
@jim:acmegating.comtobiash: if you're around for another look at https://review.opendev.org/815135 that'd be great17:33
@jim:acmegating.comClark: https://zuul.opendev.org/t/zuul/build/8b9609645e2c4f7da4ea4658e84b3ab8/console didn't run the Ubuntu-18.04.yaml file... do you grok why?17:49
@jim:acmegating.comoh it's a - not a ...17:49
@jim:acmegating.comwe appear to be inconsistent with that :/17:49
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/81514217:54
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815077: Run dstat in tox jobs https://review.opendev.org/c/zuul/zuul/+/81507717:54
@clarkb:matrix.orgcorvus: you did zk_distro_os and zj_distro_os in different places17:56
@jim:acmegating.comthe mind is a terrible thing17:56
@clarkb:matrix.orgtyping muscle memory is crazy17:56
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/81514217:56
@jim:acmegating.com815142 still doesn't seem to actually run the right file...19:02
@jim:acmegating.comomg is major_version just 18?19:03
@jim:acmegating.comhave we ever done separate tasks or variables in zuul-jobs for ubuntu bionic / focal?19:04
@jim:acmegating.comokay, i see what we do in ensure-podman19:05
@jim:acmegating.comi feel like having every instance of this pattern being different from what's in the docs is counter-productive.  i thought it would be best to copy the docs, but there are zero instances where we actually do what the docs say, so of course that didn't work :(19:06
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/81514219:07
@clarkb:matrix.orgcorvus: minor thing on the dstat change. I think it is sufficient for your purposes now but you were talking about being consistent there so I noted it19:50
@jim:acmegating.comoh yeah i should fix that.  will do after i check in on results19:51
@jim:acmegating.comyay it worked19:51
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815142: Update dstat to support bionic and others https://review.opendev.org/c/zuul/zuul-jobs/+/81514219:51
@jim:acmegating.comClark: okay that should be the final version19:51
@clarkb:matrix.orgAlso I don't know why "file extension" didn't occur to me when writing that comment19:52
@jim:acmegating.comthe mind is a terrible thing19:53
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:19:53
- [zuul/zuul] 815077: Run dstat in tox jobs https://review.opendev.org/c/zuul/zuul/+/815077
- [zuul/zuul] 814684: DNM: Increase unit test job timeout to 2h https://review.opendev.org/c/zuul/zuul/+/814684
- [zuul/zuul] 814685: DNM: Test unit tests on larger nodes https://review.opendev.org/c/zuul/zuul/+/814685
@jim:acmegating.comnow hopefully that gets us some real data19:53
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 814848: Add addtional checks to key deletion testing https://review.opendev.org/c/zuul/zuul/+/81484820:11
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815135: Use a ThreadPoolExecutor for kazoo callbacks https://review.opendev.org/c/zuul/zuul/+/81513520:11
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815154: Update test_inventory to be ZK-friendly https://review.opendev.org/c/zuul/zuul/+/81515421:00
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/81511121:00
@jim:acmegating.comClark: since you were asking about this earlier this week -- i think 815111 is the last major change in the "store pipeline state in zk" series.  i'm working on finishing it up now.  i think with that done, it will make sense to review that stack for real next week.  it's long, but it's basically just a few concepts applied iteratively to each of the model objects starting with pipelines going all the way down to builds.21:02
@jim:acmegating.comobviously there is the minor impediment of the test jobs failing, but i have been running boatloads of tests locally which reliably pass, so i'm optimistic we can figure something out.  i don't see that as a blocker for reviewing (only merging).21:03
@clarkb:matrix.orgcorvus: noted, I'll plan to review the stack21:36
@clarkb:matrix.orgalso note that matrix.org just had an extended outage due to a database hardware failure. Things might be catching up communication wise21:36
@jim:acmegating.comClark: interesting; have a link, or timestamps?  (because i hadn't noticed)21:45
@jim:acmegating.comi guess https://status.matrix.org/21:45
@jim:acmegating.comlooks like 21:2321:46
@clarkb:matrix.orgcorvus: ya thats probably the best link. My oftc bridge lost almost an hour of messages21:47
@jpew:matrix.orgI've noticed that when an ansible task reruns due to an `until` the logger doesn't show any output from the subseqent runs. anyone else seen this? (I am using zuul that is a few months old)21:56
@jpew:matrix.org * I've noticed that when an ansible task reruns due to an `until` the logger doesn't show any output from the subseqent (re)runs. anyone else seen this? (I am using zuul that is a few months old)21:56
@jpew:matrix.orgIt picks up again just fine on the next task.... just make it look like something is hung (esp. when the task rerunning takes a long time)21:56
@clarkb:matrix.orgjpew: hrm the way that all works is a bit of hacky magic. It wouldn't surprise me if it is buggy around retries22:04
@clarkb:matrix.orgtrying to remember the details I think it starts a daemon that runs the entirety of the playbook run. Then command and shell modules are swapped out to write to a socket that daemon has held open22:05
@clarkb:matrix.orgI wonder if ansible is loading the actual command and shell modules on the retry and not the overloaded versions22:06
@clarkb:matrix.orgIn the past we had problems with the rsync module because ansible didn't reset state or replace the class between retries and that led to state leakage between attempts. I wonder if they fixed that more broadly by creating new module instances for each attempt and that is breaking the retries22:07
@jim:acmegating.comaha!  https://zuul.opendev.org/t/zuul/build/c85eaeff4bee4b4b9c2dfd456dc203a9/log/zookeeper-root-server-ubuntu-bionic-32GB-airship-kna1-0027062402.out#557622:14
@jim:acmegating.comour zk server is running out of disk space (which probably means running out of tempfs space?)22:14
@jim:acmegating.comin tests -- to be clear22:15
@clarkb:matrix.orgoh ya since they use a tmpfs22:15
@clarkb:matrix.orgare those jobs using the 64MB tmpfs?22:15
@jim:acmegating.comat least... i think we have zk use a tmpfs?22:15
@clarkb:matrix.orgit seems we don't bound them in the docker-compose tmpfs but we aren't using that in CI iirc22:16
@jim:acmegating.comsuddenly i'm not sure... i see we set up tmpfs for zuul22:16
@jim:acmegating.comyeah, i think we rely on the zuul-jobs role to set up zk22:16
@clarkb:matrix.orghttps://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-zookeeper/tasks/main.yaml#L3622:17
@clarkb:matrix.orgthat appears to be what it is from the job log22:17
@jim:acmegating.comyou are 2 seconds faster than i am :)22:17
@jim:acmegating.comso.... hrm.... let me see what my local zk tmpfs usage is22:18
@jim:acmegating.comtmpfs            32G  530M   31G   2% /data22:19
@jim:acmegating.comtmpfs            32G  1.7G   30G   6% /datalog22:19
@jim:acmegating.comthat's from a handful of test runs; i can clean it and do a single run to get better numbers22:19
@clarkb:matrix.orgYou might be able to tell zk to not keep old copies of data around?22:20
@clarkb:matrix.orgI want to say it keeps all versions by default but that is configurable? something like that22:20
@jim:acmegating.comhrm... the autopurge interval is configured in hours :/22:25
@clarkb:matrix.orgya I was just going to paste that22:25
@clarkb:matrix.orgbut it is disabled by default22:25
@clarkb:matrix.orgwhich means no purging even after an hour22:25
@clarkb:matrix.orgmight be worth setting the interval to 1 and the retain count to 1 anyway?22:26
@jim:acmegating.comwell, if we're considering using larger nodes for the cpus; we're going to get gobs of ram for free.  maybe we should just unlimit the tmpfs?22:26
@jim:acmegating.comClark: well, i'd like to get this back under 1hr.22:26
@clarkb:matrix.orgah22:27
@clarkb:matrix.orgalso are both /data and /datalogs synchronous? or can you get away with one or the other being on disk?22:27
@jim:acmegating.comi think the log is probably more performance sensitive22:29
@jim:acmegating.comi think that's synchronous, and the datadir is for snapshots which only happen every 100,000 transactions22:29
@jim:acmegating.comor 10,000 depending on versions/config22:29
@clarkb:matrix.orggot it. Might be another option to reduce what is in the tmpfs too22:29
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 815163: ensure-zookeeper: eliminate the tmpfs size limit https://review.opendev.org/c/zuul/zuul-jobs/+/81516322:32
@jim:acmegating.comClark: ^ that's the way i'm leaning right now, but i'm totally open to the idea that we can/should make it a role variable.  technically it's a backwards incompat change, but it's probably not going to break anyone22:33
@jim:acmegating.comtmpfs            32G   84M   32G   1% /data22:33
@jim:acmegating.comtmpfs            32G  556M   31G   2% /datalog22:33
@jim:acmegating.comthat's the values after a single run of the full test suite22:33
@clarkb:matrix.orgI +2'd because I don't think failing due to a lack of memroy is any worse than failing due to filling the tmpfs22:34
@clarkb:matrix.organd this at least has a chance of not filling22:34
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl: [zuul/zuul] 815111: wip: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/81511122:34
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:22:41
- [zuul/zuul] 815077: Run dstat in tox jobs https://review.opendev.org/c/zuul/zuul/+/815077
- [zuul/zuul] 815167: Use a larger nodeset for unit tests https://review.opendev.org/c/zuul/zuul/+/815167
@jim:acmegating.comokay with any luck we should see the tests on the end of the zk stack run to completion on large nodes with dstat graphs22:41
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815175: Add reno about zk disconnect thread fix https://review.opendev.org/c/zuul/zuul/+/81517523:36
@jim:acmegating.comClark, fungi: ^ could we go ahead and approve that?  so that if the opendev restart goes well, we can just tag and release that on monday?23:37
@clarkb:matrix.orglooking23:38
@clarkb:matrix.orgIt is probably fine to self approve that one after my +2 since it is just a release note23:39

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!