Wednesday, 2021-09-22

-@gerrit:opendev.org- Felix Edel proposed:05:21
- [zuul/zuul] Don't use executor.builds when processing build result events https://review.opendev.org/c/zuul/zuul/+/808091
- [zuul/zuul] Don't use executor.builds to find out if tests are settled https://review.opendev.org/c/zuul/zuul/+/808792
- [zuul/zuul] Remove the local builds list from the executor client https://review.opendev.org/c/zuul/zuul/+/809175
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] Let zuul-web look up the live log streaming address from ZooKeeper https://review.opendev.org/c/zuul/zuul/+/80941005:25
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] Let zuul-web look up the live log streaming address from ZooKeeper https://review.opendev.org/c/zuul/zuul/+/80941005:25
@felixedel:matrix.orgClark, corvus : I've added the use_cache=False flag to zuul-web as well. I think it doesn't require a cache so we can go without. If we see the need for this in the future, we could simply change it. But as the build request lookup doesn't benefit from the cache, I think it's not needed.05:36
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] web UI: add "show retries" toggles on buildset page https://review.opendev.org/c/zuul/zuul/+/80620106:47
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] wip: Make QueueItem a Zookeeper object https://review.opendev.org/c/zuul/zuul/+/80941407:14
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] wip: Make QueueItem a Zookeeper object https://review.opendev.org/c/zuul/zuul/+/80941408:31
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] web UI: add "show retries" toggles on buildset page https://review.opendev.org/c/zuul/zuul/+/80620109:37
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] wip: Make QueueItem a Zookeeper object https://review.opendev.org/c/zuul/zuul/+/80941410:10
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] web UI: user login with OpenID Connect https://review.opendev.org/c/zuul/zuul/+/73408210:30
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] Add authentication-realm attribute to tenants https://review.opendev.org/c/zuul/zuul/+/73558610:30
@iselor:matrix.orgHi,10:36
I've got a question. Is there a way to run ansible k8s module from zuul job? When I'm trying to use k8s module I'm getting module not found exception.
-@gerrit:opendev.org- Zuul merged on behalf of Matthieu Huin https://matrix.to/#/@mhuin:matrix.org: [zuul/zuul] UI: remove immutability-helper dependency https://review.opendev.org/c/zuul/zuul/+/80942310:43
-@gerrit:opendev.org- Zuul merged on behalf of Albin Vass: [zuul/zuul] Exit sucessfully when manipulating project keys https://review.opendev.org/c/zuul/zuul/+/81024010:48
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] Periodicall run malloc_trim via apscheduler https://review.opendev.org/c/zuul/zuul/+/81041311:58
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] Periodically run malloc_trim via apscheduler https://review.opendev.org/c/zuul/zuul/+/81041312:00
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/nodepool] kubernetes, openshift: Add volume mount support https://review.opendev.org/c/zuul/nodepool/+/81032212:54
@icey:matrix.orghey, is there any way to have the gerrit reporter and mysql reporter on a 3.19.1 install, but then have the gerrit reporter still use the links provided, rather than changing them to refer to the zuul server itself?13:09
@icey:matrix.orgon the same server, I've just configured the mysql reporter and have had jobs finish, but there are no builds being recorded in the database13:09
@icey:matrix.orgwell, looks like https://opendev.org/zuul/zuul/src/tag/3.19.1/zuul/driver/sql/sqlreporter.py#L36 didn't update after the tables actually did get created, I see the following line in my logs13:14
@icey:matrix.org*wave* fungi13:17
@fungicide:matrix.orgoh, good, you made it over to here13:29
@fungicide:matrix.orgi'm not all that familiar with that part of the code, i'll have to track down where the table creation was expected to happen in 3.19.113:29
@icey:matrix.orgI've got tables now, hoping that a restart of the scheduler (bleh) resolves the reporting skipping the database bits13:31
@icey:matrix.orgthe bigger question, for me, is: how can I have the gerrit reporter continue using the links that I had it using before adding in the mysql reporter13:31
@icey:matrix.orgI guess I can cause https://opendev.org/zuul/zuul/src/tag/3.19.1/zuul/model.py#L2748-L2749 to be Falsey13:33
@fungicide:matrix.orgas far as reporting a specific link instead of the link to zuul's build results page, there were success-url and failure-url job attributes, but those are deprecated in 4.7.0 so it probably wouldn't make sense to rely on them. the logs and artifacts should all be accessible from the build results page, while the reverse is rarely true13:35
@fungicide:matrix.orgthe report-build-page attribute for tenants (added in 4.0.0) was also able to control that, but it too was deprecated in 4.7.0 when it became the only behavior13:38
@fungicide:matrix.orgso i guess until you upgrade to 4.7.0 at least, you could leave the sql reporter disabled in 3.19.1 now that you've confirmed writes to the db are working, then when upgrading to a version >=4 <4.7 set report-build-page on your tenant(s) to false, but just be aware that as of 4.7.0 that's no longer an option and zuul assumes it should always report the build results page url13:42
-@gerrit:opendev.org- Joshua Watt proposed:13:49
- [zuul/zuul-operator] Decode operator secrets as strings https://review.opendev.org/c/zuul/zuul-operator/+/806452
- [zuul/zuul-operator] Allow type annotations for connection secret keys https://review.opendev.org/c/zuul/zuul-operator/+/810440
@icey:matrix.org> <@fungicide:matrix.org> so i guess until you upgrade to 4.7.0 at least, you could leave the sql reporter disabled in 3.19.1 now that you've confirmed writes to the db are working, then when upgrading to a version >=4 <4.7 set report-build-page on your tenant(s) to false, but just be aware that as of 4.7.0 that's no longer an option and zuul assumes it should always report the build results page url14:09
it sounds like the real issue is that my zuul's web UI _must_ become public
@icey:matrix.orglooks like the `report-build-page: false` works on 3.19.1 as well14:11
@clarkb:matrix.orgAlso note that 4.3.0 (I think that was the version) was an important security update.14:13
@jim:acmegating.com4.6.014:15
@icey:matrix.orgthe requirement for Zookeeper TLS is what's holding me back from 4 :-/14:16
@icey:matrix.organd my entire Zuul is in a restricted network environment (hence not wanting the links pointing straight to Zuul)14:16
@clarkb:matrix.orgicey: re zookeeper tls you can deploy it alongside no tls iirc. Then switch zuul after you are happy with the zookeeper side. It might help make that transition easier if you can do it step by step rather than all at once16:03
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] Add some debug logging to gerrit event processing https://review.opendev.org/c/zuul/zuul/+/81046716:44
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul] zk: change_cache: Delete bad JSON data from cache https://review.opendev.org/c/zuul/zuul/+/81046816:44
@clarkb:matrix.orgZuulians re 810467 we're seeing slowness in OpenDev on processing events that are in the zk gerrit connection event queue16:49
@clarkb:matrix.orgAt this point we are waiting about an hour to enqueue changes after they are updated16:49
@clarkb:matrix.orgThe event queues in zk were running on our prior restart I think. But the change caches in zk were not, those are new as of this restart. However, I'm having a hard time figuring out from logs where the slowness might be hence adding logs16:50
@jim:acmegating.comClark: oh, i'll try to catch up16:50
@clarkb:matrix.orgtahnks16:51
-@gerrit:opendev.org- Jeremy Stanley proposed: [zuul/zuul-website] Update the OpenInfra Community CoC URL https://review.opendev.org/c/zuul/zuul-website/+/81046917:10
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul-operator] Allow type annotations for connection secret keys https://review.opendev.org/c/zuul/zuul-operator/+/81044018:03
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul-operator] Pass web configuration keys https://review.opendev.org/c/zuul/zuul-operator/+/81049718:06
-@gerrit:opendev.org- Emmanuel Odeyemi proposed: [zuul/zuul] modify pr files to allow job triggers on a push event for a deleted files https://review.opendev.org/c/zuul/zuul/+/80741118:07
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul-operator] Try to load the in-cluster config when starting pods https://review.opendev.org/c/zuul/zuul-operator/+/81049818:07
@jim:acmegating.comsummary from #opendev -- it looks like the change cache is slow (~15 seconds to write into it).  we've rolled back to 4.9.0.  I'm going to run some local tests to learn more.18:19
@jpew:matrix.orgcorvus: It also broke my setup when some non-JSON data got insterted somehow: https://review.opendev.org/c/zuul/zuul/+/81046818:20
@jpew:matrix.org(I think I restarted while it was writing maybe?)18:20
@jim:acmegating.comjpew: hm, i wouldn't expect a crash to write out invalid json :/18:23
@jim:acmegating.comthat does seem like a good fix, but i'd still love to understand the cause more18:23
@jpew:matrix.orgIs there a way I can go look at it?18:24
@jim:acmegating.comjpew: yeah, you can use zk_shell (though i think that may not support tls, so you may need a non-tls zk client port open for that)18:25
@jim:acmegating.comin opendev, we have the zk servers listening on tls and non-tls, but we firewall off the non-tls ports, so we can run zk-shell from the zk servers themselves, but not elsewhere18:25
@jpew:matrix.orgI'm running entirely in kubernetes, is there a pod that has that in it I can exec into?18:26
@avass:vassast.orgjpew: I think the official zookeeper images has it18:26
@jim:acmegating.comavass: sure you aren't thinking of zkCli?18:26
@avass:vassast.orgcorvus: oh, yeah18:26
@jpew:matrix.orgTBH, I know nothing about zk (which is probably more of a testament to how well it's "just worked" :) )18:26
@avass:vassast.orgcorvus: does zk_shell do anything different?18:28
@jim:acmegating.comavass: it annoyed me a lot less; but zkCli may be fine for this :)18:28
@avass:vassast.orgheh :)18:28
@jim:acmegating.comjpew: if you feel like spelunking in zk and can identify that node, that'd be great; and if you don't and just want to blow everything away and start scratch, there's this command: https://zuul-ci.org/docs/zuul/reference/client.html#delete-state18:29
@jim:acmegating.com(we certainly don't expect people to fiddle with zk in the normal course of operations; it should 'just work')18:30
@jpew:matrix.orgI was just waiting 2 hours for it to timeout 😛18:30
@jim:acmegating.comoh now that i look a bit more... i see there's a sharded reader under there18:31
@jim:acmegating.comi think that indeed a crash could end up with half-written data because of that18:31
@jim:acmegating.comjpew: i think your initial suspicion is very likely correct: a crash mid-write left incomplete data.  inspecting zk would probably confirm that, but i think now that i know the sharded writer is involved, i'm not sure it's worth expending much effort to try to confirm it.18:33
@jpew:matrix.orgOk, works for me18:34
@clarkb:matrix.orgcorvus: do you think my logging change is worth trying to land?19:18
@clarkb:matrix.orgor should I abandon it?19:18
@jim:acmegating.comClark: unsure at this point; i'm starting to run a local scheduler with lots of new logging, i'll see what's useful19:21
@clarkb:matrix.orgok le me know and I'm happy to get that into better shape if it iwll help19:22
@jim:acmegating.comone thing i have just noticed (i don't think this is related to the performance issues) is that in the gerrit driver we have an update/retry loop that just does variable assignment as part of the update... see updateChangeWithRetry in gerritconnection19:41
@jim:acmegating.comwe should probably think about what exactly we're doing the retry for there and reconcile that... but it's not important now19:41
@jim:acmegating.comoh i think i get it -- it's re-updating the change in memory in case it was updated externally19:42
@clarkb:matrix.orgmy logging change left a comment around a uuid assignment I couldn't understand. But I may just be missing some extra data context19:43
@clarkb:matrix.orgI don't think it is important other than possibly removing a distraction19:44
@jim:acmegating.comi instrumented the updateChangeWithRetry method and it took 0.0034462958574295044 seconds :/19:50
@jim:acmegating.comyesterday right after the restart, it was <1 second from "Updating <Change" to "Scheduling event"; and it was 16 seconds today19:54
@jim:acmegating.comso i wonder if we're looking for something that accumulates over time19:54
@clarkb:matrix.orgmight eb able to artificially inject a bunch of changes into the cache to simulate that?19:55
@clarkb:matrix.orgthen double it and recompare to see if time grows?19:55
@jim:acmegating.comyeah... i'll go check the size of the cache in opendev19:55
@jim:acmegating.com/zuul/cache/connection/gerrit/cache: 554319:59
@jim:acmegating.com/zuul/cache/connection/gerrit/data: 1130419:59
@jim:acmegating.comi will write a routine to cache the most recent 5k changes in my local cache20:00
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed:20:02
- [zuul/zuul] web UI: user login with OpenID Connect https://review.opendev.org/c/zuul/zuul/+/734082
- [zuul/zuul] Add authentication-realm attribute to tenants https://review.opendev.org/c/zuul/zuul/+/735586
- [zuul/zuul] web UI: allow a privileged user to dequeue a change https://review.opendev.org/c/zuul/zuul/+/734850
- [zuul/zuul] web UI: allow a privileged user to re-enqueue a change https://review.opendev.org/c/zuul/zuul/+/736772
- [zuul/zuul] Web UI: allow a privileged user to request autohold https://review.opendev.org/c/zuul/zuul/+/768115
- [zuul/zuul] Web UI: add Autoholds, Autohold page https://review.opendev.org/c/zuul/zuul/+/768199
- [zuul/zuul] REST API, Web UI: add pipelines' manager, triggers data in status https://review.opendev.org/c/zuul/zuul/+/736968
- [zuul/zuul] web UI: allow a privileged user to promote a change https://review.opendev.org/c/zuul/zuul/+/781858
- [zuul/zuul] Web UI: Add "Create Autohold Request" form, improve API error messages https://review.opendev.org/c/zuul/zuul/+/802559
- [zuul/zuul] Example Docker compose: keycloak integration https://review.opendev.org/c/zuul/zuul/+/769943
@jpew:matrix.orgI'm running K8s pods for (long) builds on a local cluster, and I have more builds than can run simultaneously on the cluster which causes the pods to stay in the Pending state (there is some status you can look up for this exact condition). Nodepool is timing out after 300 seconds, which fails my jobs. Does anyone have an idea about what I can do about this?20:17
@jim:acmegating.comi think the expected thing from a zuul/nodepool perspective would be to add "quota" support to the k8s driver.  that's not simple, but it's what the algorithm is designed for.  it may be possible to work around that by tweaking timeout or retry values (to near-infinity) but that obvs comes with downsides.20:23
@jpew:matrix.orgcorvus: Ya, I think I might be doing a somewhat unusual case were I want to use my entire cluster for builds instead of carving out a chunk, and I can't exactly "scale" it up since it's baremetal..... I just want it to use everything available and then nothing when it's done :)20:28
@jim:acmegating.comi wonder if our working set really is 5500 changes, or if the gc isn't working properly20:44
@jim:acmegating.comwe may be missing apscheduler log messages...both in zuul's default logging config and opendev's...20:53
@jim:acmegating.comthat may make it difficult to identify when the cleanup routines ran; however, i think if there were an error we would till get the exception logged since i think we log everything at error level20:54
@jim:acmegating.comoh, i think i found something -- i missed that the "clearing dependent change caches" log line is part of this sequence (it's a call out to the scheduler, so it has the scheduler log prefix)21:43
@jim:acmegating.comi think that may be what's slowing us down21:43
@jim:acmegating.com(so the sequence is: 1) get event from connection event queue; 2) query gerrit for change info about the event; 3) write the new info to zk; 4) tell the scheduler to refresh dependent changes; 5) done.)21:44
@jim:acmegating.comthat step 4 may involve its own zk writes21:46
@jim:acmegating.comClark, fungi: indeed; i updated https://etherpad.opendev.org/p/gaK1SsRMpe3hh4ASOCLi with that log line, and i think that shows the additional time is all in that routine21:48
@clarkb:matrix.orgAnd we're thinking 16 * 5k ~= 1hour? (I haven't done that math yet)21:50
@clarkb:matrix.org3600 is an hour so that is well over but also a lot of events don't go through that processing21:50
@jim:acmegating.comyeah, ref-replicated won't take that long, and there are a lot of those21:54
@pearcetyler:matrix.orgHas anyone encountered an issue where jobs run infinitely, regardless of the timeout? 22:10
testjob (from the quickstart guide) has a timeout of 600s and up to 3 retires, but it's been running far longer than expected. My scheduler's last log suggests it started fine `2021-09-22 21:20:45,031 INFO zuul.ExecutorClient: [e: ec19be10-1bea-11ec-8378-332f10c0134b] Execute job testjob [...]`. Looking at the web interface, Status -> stream, all I see in the console is `--- END OF STREAM ---` repeated every few seconds.
@jim:acmegating.comClark: i'm unclear why that would take so long though.  with my local cache of 6300 changes, it still only takes a second (which is long, but not 16s long)22:11
@jim:acmegating.comTyler Pearce: i'd check the executor log22:12
@clarkb:matrix.orgcorvus: is there anything else that happens between those two log messages?22:13
@clarkb:matrix.orgI can look at that code soon and will see if anything jumps out at me22:13
@jim:acmegating.comClark: i don't think so22:14
@pearcetyler:matrix.orgcorvus: I don't see anything unusual in the executor logs. Most recent log:22:15
`2021-09-22 21:20:39,298 INFO zuul.ExecutorServer: [e: ec19be10-1bea-11ec-8378-332f10c0134b] Merge <MergeRequest e11b914e87e84f98a8db9701f222611f, job_type=refstate, state=running, path=/zuul/merger/requests/e11b914e87e84f98a8db9701f222611f> complete, merged: False, updated: True, commit: None, branches: []`
@jim:acmegating.comClark: now i wish we still had the scheduler running so i could inspect what that method does with the production data more closely :/22:23
@clarkb:matrix.orgI'm wondering if the state of depends on matters?22:26
@clarkb:matrix.orgopendev does have some changes with a ton of depends on22:26
@fungicide:matrix.orglike that tripleo change with the 50 depends-on footers22:28
@fungicide:matrix.orgwhich they seem to keep rechecking22:28
@jim:acmegating.comyeah, that's probably a big difference with my local env; i'm not going to get that populated with a random working set22:29
@jim:acmegating.comand especially since depends-on are only set when changes end up in pipelines22:29
@jim:acmegating.comand that method is mostly looking at updates for depends-on changes22:29
@clarkb:matrix.orglooks like the connections.getSource() is constructing an entirely new source object for each change in the list.22:32
@clarkb:matrix.orgoh wait no its a dict22:33
@clarkb:matrix.orgthe driver loader creates the new source22:33
@clarkb:matrix.orgoh which we call after the dict lookup22:33
@clarkb:matrix.orgThat does a regex compile, but overall shouldn't be a big cost22:33
@clarkb:matrix.orgbut then we set change attributes. Is this n + n-1 + n-2 + n-3... + 1 operations per stack?22:34
@clarkb:matrix.orgI guess it sin't quite that bad because it should only evaluate it for the live changes in the cache22:35
@jim:acmegating.comi just used the repl in opendev prod, and we have 459 changes in the gerrit cache, and something like 20 have commitneedschanges, all of which are length 1 except for one which has 322:38
@jim:acmegating.comso even right now, opendev's zuul isn't in a state for that code to give us any trouble22:39
@jim:acmegating.comokay, that just adds an entry with 43 deps; it doesn't seem to do anything crazy22:50
@jim:acmegating.comthat iteration only takes 0.05 seconds.  so we're still missing some key piece of info22:51
@clarkb:matrix.orgcorvus: I'm noticing there is a bit of recursion happening here22:57
@clarkb:matrix.org_handleEvent calls self._getChange() calls connection._getChange() calls _updateChange() calls sched.onChangeUpdated(). But we also have _updateChange() call _updateChangeDependencies()  calls connection._getChange()22:59
@clarkb:matrix.orgThinking out loud here I wonder if it is possible for us to get into a deep recursion/loop somehow23:00
@jim:acmegating.comClark: potentially -- but i'm focusing on onChangeUpdated -- do you see recursion happening starting there?23:00
@clarkb:matrix.orgNo I think that method might be in the middle of the loop but no the originator23:01
@jim:acmegating.comwe know from log timestamps that our delay doesn't start until onChangeUpdated is called23:02
@clarkb:matrix.orgas a sanity check onChangeUpdated hasn't been changed between the two restarts. But the methods it calls were involved in the cache update.23:09
@jim:acmegating.comyeah, and i suspect we may have more items in the cache23:10
@clarkb:matrix.orgTyler Pearce: if you run the executor with debug logging (I think there is a command to toggle that on?) you should get a lot more info about what it is doing. Another option would be to directly inspect the job as it is running wherever it is running23:10
@jim:acmegating.comwe're storing more than just changes in the cache (also refs)23:11
@jim:acmegating.com * we're storing more than just changes in the cache (also refs) with the new code23:11

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!