Friday, 2021-09-03

@iwienand:matrix.orghttps://github.com/quay/quay/blob/f774e4c6b6c674c822057a1d3dd49a3d5c36e6ca/endpoints/v2/manifest.py looks generally like the quay manifest handling00:11
@iwienand:matrix.orgurgh, there's so many layers, and it abstracts everything as it seems to handle about 8 different formats from docker/OCI etc00:23
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-registry] Reject mismatched layer sizes, with some exceptions https://review.opendev.org/c/zuul/zuul-registry/+/80723204:23
-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/zuul] Update tenant quota spec config example https://review.opendev.org/c/zuul/zuul/+/80076608:38
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] Fix unstable cross-scheduler config priming test https://review.opendev.org/c/zuul/zuul/+/80727208:44
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] [GUI] Buildset: add "show retries" toggles https://review.opendev.org/c/zuul/zuul/+/80620112:18
@erbarr:matrix.orghello, i was wondering how I need to setup my connections and tenants so that I can inherit the ironic-base job. In my tenant if I list opendev.org/openstack/ironic to match the required project, I get an error in the scheduler and I don't get any jobs. https://opendev.org/openstack/ironic/src/branch/master/zuul.d/ironic-jobs.yaml#L10 14:32
@fungicide:matrix.orgyour tenant needs a connection for opendev.org in that case (though it can probably be a git connection if you don't want to set up an account for the gerrit connection driver)14:37
@gtema:matrix.org```14:38
[connection "opendev"]
name=opendev
driver=git
baseurl=https://opendev.org
```
@gtema:matrix.orgthat's what works for us erbarr14:38
@fungicide:matrix.organd then put the opendev source in your tenant config listing opendev.org/openstack/ironic as an untrusted project14:39
@fungicide:matrix.orgwell, in the tenant config it doesn't have to be a fully qualified project name since the domain part is taken from the connection details, so can just be openstack/ironic as the untrusted project entry there14:42
@fungicide:matrix.orgyou might also consider limiting that untrusted project entry with `include: [job]` so that it doesn't pull in things like the project definition14:52
@fungicide:matrix.orgwe've also got a bunch of real-world examples of tenant configs we use in opendev: https://opendev.org/openstack/project-config/src/branch/master/zuul/main.yaml14:54
@erbarr:matrix.orgi have that entry of opendev and I have gerrit, I can trigger from project changes, it adds comments to review, but when I add opendev.org/openstack/ironic to the tenant, cloning the project fails by trying a url https://opendev.org/opendev.org/openstack/ironic15:04
@erbarr:matrix.orgopenstack/ironic works though, but fails to upload jobs that require opendev.org/openstack/ironic or others that start with opendev.org]15:05
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] REST API: Implement nodes filtering https://review.opendev.org/c/zuul/zuul/+/73604215:08
@clarkb:matrix.orgcorvus: ianw I wrote a bit of a novel on https://review.opendev.org/c/zuul/zuul-registry/+/807232 to try and convince me there isn't a race around the strict verification blob cleanup. If you can double check that I would appreciate it.15:18
@clarkb:matrix.org * corvus: ianw I wrote a bit of a novel on https://review.opendev.org/c/zuul/zuul-registry/+/807232 to try and convince myself there isn't a race around the strict verification blob cleanup. If you can double check that I would appreciate it.15:18
@jim:acmegating.comClark, ianw: what's the end goal here?  if the registry rejects the upload, how does that improve the situation?15:20
@clarkb:matrix.orgcorvus: it has a fixup case specifically for the docker pushes and would only reject other clients pushing bad images15:21
@clarkb:matrix.orgI think the idea is we can special case buggy clients but if they are not special cased we error15:21
@jim:acmegating.comah15:21
@clarkb:matrix.orgI think the change is ok if we're happy that the existing lock check and bail out is sufficient to avoid the race I describe in my comment15:29
@jim:acmegating.comClark: replied to your last pgraph15:31
@jim:acmegating.comwhy would we delete anything?15:32
@clarkb:matrix.orgcorvus: the line I've commented on deletes the blobs if manifest size verification fails. I think the idea is that image isn't valid at all and we shouldn't store it15:34
@jim:acmegating.comClark: i left a -1 to that effect.15:34
@jim:acmegating.comcorrect, the image manifest is wrong.  but the layer is correct.  no need to delete it.15:34
@clarkb:matrix.orgWouldn't that be an orphaned layer then?15:35
@jim:acmegating.comit could be referenced by a different manifest that is correct15:35
@jim:acmegating.comyes, it would.  there's a cleanup process for that.15:35
@jim:acmegating.comi don't think we're running it, but it's a necessary part of operations anyway, and we should rely on it.15:35
@clarkb:matrix.orgI see we dedup the layers across images.15:35
@clarkb:matrix.organd deleting here would affect any other images with that layer15:35
@jim:acmegating.comyes -- though it's not like we're doing anything special to de-dupe the layers, that's just inherent to a CAS15:36
@clarkb:matrix.orgyup15:36
@clarkb:matrix.orgShould I push an updated patchset so we can keep pushing this along? ianw is weekending at this point so unlikely to do that for a couple of days15:41
@jim:acmegating.comClark: i feel like it's a substantial enough change it would be best if he could weigh in on that, unless it's urgent?15:43
@clarkb:matrix.orgI guess it isn't super urgent, the only place we've observed it breaking is with buildkit and not much is using that yet?15:43
@jim:acmegating.comClark: (if it's urgent, sure and we can make sure he knows to push a patch to add it back if he wants to continue discussion.)15:43
@jim:acmegating.comwhat is using buildkit now?15:44
@clarkb:matrix.orgWith zuul-registry I think it is all attempts at this point. The zuul things mordred was looking at and the opendev assets image that started ianw down debugging this15:44
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] Fix unstable cross-scheduler config priming test https://review.opendev.org/c/zuul/zuul/+/80727215:45
@clarkb:matrix.orgBasically work in progress and not production usage yet15:45
@jim:acmegating.comClark: a middle ground would be to push a patchset, both of us leave +2s on it, and ianw can approve on what we call "sunday" if he likes it :)15:46
@clarkb:matrix.orgthat seems like a reasonable compromise. I'll do that15:47
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] Remove nodeset from NodeRequest https://review.opendev.org/c/zuul/zuul/+/80606315:48
-@gerrit:opendev.org- Clark Boylan proposed on behalf of Ian Wienand: [zuul/zuul-registry] Reject mismatched layer sizes, with some exceptions https://review.opendev.org/c/zuul/zuul-registry/+/80723215:51
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Try harder to unlock failed build requests https://review.opendev.org/c/zuul/zuul/+/80722116:23
@jim:acmegating.comClark: ^16:23
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] Add user_data field to Node https://review.opendev.org/c/zuul/nodepool/+/80736216:30
@mordred:inaugust.comcorvus: nodepool and python-base/builder use buildkit16:39
@jim:acmegating.commordred: are we failing builds on those due to this error now?16:44
@jim:acmegating.comi guess because of the way this manifests, we can build exactly 1 thing with buildkit, but anything that depends on it fails?16:46
@jim:acmegating.combut docker doesn't fail pulling... only podman?16:46
@mordred:inaugust.comWe're failing builds ok ianw patches to update to bullseye. I don't think it's URGENT16:46
@jim:acmegating.comwhy would just those fail?16:47
-@gerrit:opendev.org- Atit Patel proposed: [zuul/zuul] Correcting timeout logic to fix retry errors https://review.opendev.org/c/zuul/zuul/+/80736416:48
@mordred:inaugust.comhttps://review.opendev.org/c/zuul/nodepool/+/806312/11 for instance16:50
@mordred:inaugust.comI'm not sure?16:50
-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/nodepool] Add Tenant-Scoped Resource Quota https://review.opendev.org/c/zuul/nodepool/+/80076516:51
@jim:acmegating.commordred: was -bullseye built without buildkit, and bullseye with?16:55
@jim:acmegating.com * mordred: was -buster built without buildkit, and bullseye with?16:55
@jim:acmegating.comi think that would explain it; basically we have a poisoned bullseye image, so anything that tries to use that will fail16:56
@jim:acmegating.comClark: ^?16:56
@clarkb:matrix.orgI'm not sure16:57
@clarkb:matrix.orgI thought they were built the same way16:57
@clarkb:matrix.orghttps://review.opendev.org/c/zuul/nodepool/+/806312/11 specifically is failing on a dib thing that is being fixed iirc16:58
@clarkb:matrix.orghttps://review.opendev.org/c/openstack/diskimage-builder/+/806318 that depends on change, but it needs to land and then get a release 806312 can set in requirements.txt16:58
@jim:acmegating.comClark, mordred: ah, maybe that job specifically uses podman?17:00
@jim:acmegating.comoh but that job is still passing on buster :/17:01
@jim:acmegating.comvery weird17:02
-@gerrit:opendev.org- Atit Patel proposed: [zuul/zuul] Changes in time out logic, addition of retry to handle slower API responses https://review.opendev.org/c/zuul/zuul/+/80539617:10
-@gerrit:opendev.org- Atit Patel proposed: [zuul/zuul] Correcting timeout logic to fix retry errors caused by incorrect implementation of an object `retries` https://review.opendev.org/c/zuul/zuul/+/80736417:12
@mordred:inaugust.comcorvus: Iirc, at least some of the ianw changes explicitly enabled buildkit...  But I'm not at a computer presently17:13
@clarkb:matrix.orgmordred: I don't think those have landed yet due to these problems we are having17:17
@clarkb:matrix.orgcorvus: the logs for the failed test in https://zuul.opendev.org/t/zuul/build/53db0725e0ee4d9ab8b998b5b5b922f0 show that the _retry() function is hitting an exception. Specifically a no node error which makes me wonder if something is causing the build request to go away between the update and the put?21:08
@jim:acmegating.comClark: neat, i didn't hit that when running locally (which is why i didn't catch the log bug)21:20
@jim:acmegating.comClark: that test removes a tenant, which recursively deletes the tenant info in zk including the pipeline result queues21:24
@jim:acmegating.comClark: i think maybe we have to handle NoNodeError on the put and pass it21:25
@clarkb:matrix.orgaha that would cause a no node error21:25
@jim:acmegating.comi'll fix that up when i finish up the change i'm writing21:26
@clarkb:matrix.orgI'm not sure I see the log bug21:27
@jim:acmegating.comoh sorry, it's what i fixed between ps1 and 221:29
@jim:acmegating.comflake8 caught it21:29
@clarkb:matrix.orgoh got it21:31
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:21:36
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821
- [zuul/zuul] Remove internal nodepool request cache https://review.opendev.org/c/zuul/zuul/+/807196
- [zuul/zuul] Report nodepool resource stats gauges in scheduler https://review.opendev.org/c/zuul/zuul/+/807406
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Try harder to unlock failed build requests https://review.opendev.org/c/zuul/zuul/+/80722121:40
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Try harder to unlock failed build requests https://review.opendev.org/c/zuul/zuul/+/80722121:41
@jim:acmegating.comClark: ^ maybe something like that21:41
@clarkb:matrix.orgthe double pass of log is fun21:43
@jim:acmegating.comeveryone needs a log!  log, log, log!21:49
@clarkb:matrix.orgcorvus: does https://review.opendev.org/c/zuul/zuul/+/806280/ impact our ability to freely restart nodepool launchers? I don't think so as it wasn't creating the node requests but now I'm wondering how that works today. Does it just pick up the requests and start booting new instances when it comes back? or does it go back to the queue?22:53
@jim:acmegating.comClark: what's "it"?  a launcher?22:54
@jim:acmegating.comClark: a launcher basically just walks the request queue repeatedly; that's about it.22:55
@jim:acmegating.comto directly answer your first question: nodepool won't perceive a difference.22:55
@clarkb:matrix.orgit being the node request. If a launcher is handling a node request and we restart it what happens? I guess the ephemeral property wasn't important there because the executor handled that22:55
@clarkb:matrix.org * it being the node request. If a launcher is handling a node request and we restart the launcher what happens? I guess the ephemeral property wasn't important there because the executor handled that22:56
@jim:acmegating.comah yep.  that's correct.  what happens (before and after that change) is the node request gets unlocked (because the launcher locks it) and another launcher (possibly the restarted one, possibly a different one) locks it and starts over22:57
@clarkb:matrix.orgaha the lock remains ephemeral but not the request itself22:57
@clarkb:matrix.orgthat makes esnse22:57
@clarkb:matrix.orgcorvus: does the cleanup there need to somehow signal to a launcher that a request is no longer needed and unlock it? My concern there is the leaked noderequests may still be fulfilled by nodepool22:58
@clarkb:matrix.orgbut then zuul will do nothing with them?22:59
@clarkb:matrix.orgThen once the node request goes away the nodes can be reused? At worst we're delayed an hour?23:00
@clarkb:matrix.orgThe code itself seems fine and does what it says on the tin. I just worry about ^ at this point23:01
@jim:acmegating.comClark: the cleanup is a failsafe, it shouldn't be necessary23:03
@jim:acmegating.comClark: but if something does slip through, then the cleanup will delete the node request, and as soon as a launcher sees a 'ready' node allocated to a node request that doesn't exist anymore, it will recycle it.23:03
@clarkb:matrix.orgyup the launchers handle that already for cancelled jobs iirc.23:04
@jim:acmegating.comall the typical cases, like zuul canceling a build before it starts, etc, should immediately delete the node request.  the cleanup is more like handling, i dunno, the zk server disappearing right as a cancel request happens :)23:04
@clarkb:matrix.orgThinking out loud here I wonder if it might be a good idea to have a small utility that skims zuul logs for things like this where we've tripped failsafes and don't expect to23:04
@clarkb:matrix.orgFor nodepool restarting the NodeWorker thread would be something to look for too23:05
@jim:acmegating.comwe could -- we'd have to examine each one and see if it's something we could actually prevent though.23:06
@jim:acmegating.comnode worker threads are expected to restart occasionally, that shouldn't be an error23:06
@jim:acmegating.combasically each time you change the configuration, you get a new node worker23:07
@jim:acmegating.com(while the old one gracefully stops)23:07
@clarkb:matrix.orgcorvus: that was one of the things I was looking at with the debugging yesterday and I think that isn't true. Only the provider manager is restarted. The NodeWorker doesn't typically23:07
@clarkb:matrix.orgits a little confusing as there are a lot of threads. But the driver specific thread does restart when the config change but the NodeWorker thread sits above the driver specific thread and it only starts when the provider is added and restarts if the previous thread died23:08
@jim:acmegating.comClark: okay let's get specific -- exactly what class are you talking about?23:08
@jim:acmegating.comcause there isn't anything called a NodeWorker in nodepool?23:09
@clarkb:matrix.orghttps://opendev.org/zuul/nodepool/src/commit/433e4f33d187496a61a333a6f976572725ead4a2/nodepool/launcher.py#L1077-L109623:09
@clarkb:matrix.orgSorry it is a PoolWorker23:10
@clarkb:matrix.orgWe run a top level Nodepool thread which runs a PoolWorker thread for each pool regardless of driver specifics. That PoolWorker thread interacts with driver specific threads which do get restarted23:11
@jim:acmegating.comokay.  so what's the error you want to scan for?23:11
@clarkb:matrix.orgSpecifically https://opendev.org/zuul/nodepool/src/commit/433e4f33d187496a61a333a6f976572725ead4a2/nodepool/launcher.py#L1092 would indicate an unexpected thread death23:12
@clarkb:matrix.orgwhich could indicate uncaught exceptions or perhaps some sort of OOM if OOMKiller can kill specific lightweight threads and not the parent process (that is an interesting question I don't know the answer to but I suspect it only kills parent threads)23:12
@clarkb:matrix.orgit isn't a majorly urgent thing, but thinking out loud I thought maybe it could be useful to give hints for bad things to operators23:13
@jim:acmegating.comyeah, my preference would be to make sure that the run method in that thread catches all exceptions and that code isn't necessary.  then exceptions just get logged as exceptions.23:13
@jim:acmegating.comthat code is pretty non-idiomatic for zuul23:14
@clarkb:matrix.org> <@jim:acmegating.com> yeah, my preference would be to make sure that the run method in that thread catches all exceptions and that code isn't necessary.  then exceptions just get logged as exceptions.23:14
Yup that is what my change to address the issue previously does, catches more places exceptions can happen.
@jim:acmegating.comanyway, then the "look for bad stuff" operation becomes "look for log messages with severity ERROR" :)23:14
@clarkb:matrix.orgfair enough23:15
@clarkb:matrix.orglooks like linters failed on the node request change23:15
@jim:acmegating.comthough maybe it's worth thinking about whether we want those to be warning or error.  might be worth getting specific about that.23:15
@clarkb:matrix.orgI'm going to +2 it now to remind myself that I liked the current state before it got updated23:15
@jim:acmegating.commaybe error should be "zuul can not fix this" vs warning meaning "zuul noted an anomaly which it corrected"23:16
@clarkb:matrix.orgI like that.23:16
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:23:17
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821
- [zuul/zuul] Remove internal nodepool request cache https://review.opendev.org/c/zuul/zuul/+/807196
- [zuul/zuul] Report nodepool resource stats gauges in scheduler https://review.opendev.org/c/zuul/zuul/+/807406
@jim:acmegating.comClark: ^ fixed the lint error23:18

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!