Thursday, 2021-09-02

@jim:acmegating.comthe registry just restarted?00:01
@clarkb:matrix.orgI haven't touched the hosts so wasn't me if it did00:03
@jim:acmegating.comthe storage appears to have persisted, so not a problem00:03
@iwienand:matrix.orgahh sorry i just killed it, didn't know you were looking.  will leave it00:04
@iwienand:matrix.orgthe build will have come from ...00:04
@iwienand:matrix.orghttps://zuul.opendev.org/t/openstack/build/69f9654bebfe435fb2792ba173f9dffc i would say00:05
@iwienand:matrix.orgUpdate upload metadata chunks: [{'size': 7670, 'md5': '54a131e3f12627a57edf174190aabebf'}]00:06
@iwienand:matrix.orghttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_69f/805932/3/check/opendev-buildset-registry/69f9654/docker/buildset_registry.txt is the registry log00:06
@jim:acmegating.comianw: i'd probably run the buildset registry with a patch that output the received manifest to confirm what docker pushed00:15
@iwienand:matrix.orgcorvus: will opendev-buildset-registry obey depends-on?00:17
@iwienand:matrix.orgi feel like no00:17
@jim:acmegating.comianw: i just mean restart it on the held node.  you don't need to run it in docker, just clone the repo and run it by hand.00:18
@jim:acmegating.comanyway, i'd bet a nickel that it's reporting a size of 7671 because that's what's in the manifest that's being uploaded.00:20
@jim:acmegating.comianw: hosts are all yours, i've logged out00:21
@iwienand:matrix.org2021-09-02 00:44:35,172 INFO registry.api: processing {'mediaType': 'application/vnd.docker.image.rootfs.diff.tar.gzip', 'digest': 'sha256:8ba24a495b7a3a8458f46a749b1072be2e88ec0476b88b61289c5b51341b8874'}00:46
2021-09-02 00:44:35,172 INFO registry.api: digest:sha256:8ba24a495b7a3a8458f46a749b1072be2e88ec0476b88b61289c5b51341b8874 blob_size:3121
@iwienand:matrix.org-rw-r--r-- 1 root root 3121 Sep  2 00:43 data00:46
@jim:acmegating.comianw: does that mean i lose my nickel?00:47
@iwienand:matrix.orgwe do not seem to get a size from docker push (if i'm doing this right).  the blob_size seems to match what's on disk ...00:48
@jim:acmegating.comianw: that's not 716x?00:49
@jim:acmegating.comianw: what's the info for sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1 ?00:50
@iwienand:matrix.orgi've rebuilt so i guess the numbers are slightly different00:51
@jim:acmegating.comianw: also, that looks like the first layer -- why is the sha different?00:51
@jim:acmegating.comoh00:51
@jim:acmegating.comwhy did you rebuild?00:51
@iwienand:matrix.orgi figured that was the way to docker push it?00:51
@jim:acmegating.comyou can just run docker push00:52
@jim:acmegating.comanyway, what's the info for the third layer this time?00:52
@iwienand:matrix.orgwell doh, i just tried to push the original and i think i've cleared it00:54
@iwienand:matrix.orgnice, now i'm getting 500 errors from teh registry00:55
@iwienand:matrix.orgok, cleared /stroage and started again00:56
@iwienand:matrix.org021-09-02 00:56:12,069 INFO registry.api: processing {'mediaType': 'application/vnd.docker.image.rootfs.diff.tar.gzip', 'size': 7670, 'digest': 'sha256:3a7c8ad16a18e78ae62b0490ab345b44303e3718bc6af474e20facc335020748'}00:56
@iwienand:matrix.orgthis time the push *did* specify a size00:56
@iwienand:matrix.orgnow things are even *further* out01:01
@iwienand:matrix.orgthe manifest says01:01
@iwienand:matrix.org      "size": 7670,01:02
"digest": "sha256:3a7c8ad16a18e78ae62b0490ab345b44303e3718bc6af474e20facc335020748"
@iwienand:matrix.orgif i get that blob01:02
@iwienand:matrix.org curl  -H "Authorization: Bearer ${TOKEN}" 'https://zuul-jobs.buildset-registry:5000/v2/opendevorg/assets/blobs/sha256:3a7c8ad16a18e78ae62b0490ab345b44303e3718bc6af474e20facc335020748?ns=docker.io' --output test01:02
@iwienand:matrix.org-rw-r--r-- 1 root root 7669 Sep  2 01:01 test01:02
@iwienand:matrix.orgoh no, 69 + 1 = 70.  just confused.  i think this replicates ...01:03
@jim:acmegating.comso the push was mismatched -- either the size it sent us is wrong, or the data were truncated?01:04
@iwienand:matrix.orgit seems so01:05
@iwienand:matrix.orgsigh, the timeout just expired and the nodes went away01:06
@iwienand:matrix.orghttps://etherpad.opendev.org/p/zrobo <- i'm going to put some debugging things here01:07
@iwienand:matrix.orgclarkb: if you a have a sec for https://review.opendev.org/c/opendev/base-jobs/+/806818 i might be able to make a more persistent debugging setup in infra01:09
@clarkb:matrix.orgianw:  approved, I guess be prepared to revert it if it breaks jobs when we don't want them to fail. That is in base jobs which is always a bit tricky to test01:12
@iwienand:matrix.orgyeah, thanks.  i think, if i set that var and set holds for opendev-buildset-registry and the system-config image build job i can replicate it01:14
@jim:acmegating.comianw: did that patch just merge without a gerritbot announcement?01:28
@jim:acmegating.comoh that was in base jobs, nm01:28
@iwienand:matrix.org@corvus i i think cause it's opendev doesn't echo here01:29
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Delete election https://review.opendev.org/c/zuul/zuul/+/80700501:35
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:02:52
- [zuul/zuul] Remove nodeset from NodeRequest https://review.opendev.org/c/zuul/zuul/+/806063
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Reduce leaked file descriptors in tests https://review.opendev.org/c/zuul/zuul/+/80699502:53
@jim:acmegating.comClark swest tobiash : ^ okay, i think i fixed the leak (with `del self.election`) and addressed/responded to all comments.02:57
-@gerrit:opendev.org- Jiri Podivin proposed: [zuul/zuul-jobs] DNM https://review.opendev.org/c/zuul/zuul-jobs/+/80703106:46
@iwienand:matrix.orgClark: corvus see https://etherpad.opendev.org/p/zrobo ... i have a pretty good debugging setup i think, but haven't quite pinpointed what's wrong08:05
@iwienand:matrix.orgtl;dr ... there is a screen running on 23.253.159.167 ; one window you can "docker push" from and the other has mitmproxy running (docker is setup to http_proxy through this)08:06
@iwienand:matrix.orgthe registry is  104.130.13.181.  as zuul you can run ".tox/venv/bin/zuul-registry  -d -c /home/zuul/buildset_registry/conf/registry.yaml serve" and then "docker push" to it from the other host08:07
@iwienand:matrix.orgm -rf /home/zuul/storage/_local/* and rinse and repeat ...08:08
@iwienand:matrix.orgwe are one byte short on *all* the layers; info in the etherpad08:08
@iwienand:matrix.orgfeel free to fiddle ... 08:09
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] Remove hold request cache from nodepool interface https://review.opendev.org/c/zuul/zuul/+/80681308:54
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com:09:00
- [zuul/zuul] Don't treat failed requirement jobs as ready https://review.opendev.org/c/zuul/zuul/+/806781
- [zuul/zuul] Remove returnNodeSet calls from scheduler https://review.opendev.org/c/zuul/zuul/+/806062
-@gerrit:opendev.org- Simon Westphahl proposed:09:54
- [zuul/zuul] Periodically maintain connection caches https://review.opendev.org/c/zuul/zuul/+/806756
- [zuul/zuul] Clean up dangling cache data nodes more often https://review.opendev.org/c/zuul/zuul/+/807102
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul-jobs] Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/80711811:18
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul-jobs] Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/80711811:23
@gtema:matrix.orgcorvus, I tried to complete https://review.opendev.org/c/zuul/zuul-jobs/+/776677 with https://review.opendev.org/c/opendev/base-jobs/+/806941 and https://review.opendev.org/c/zuul/zuul-jobs/+/807118 as requested11:23
@gtema:matrix.orgplease have a look whether there is something else still missing11:24
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul] Add support for gitlab squash merge https://review.opendev.org/c/zuul/zuul/+/80623111:31
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul-jobs] Drop deprecated url param of upload_logs_XXX modules https://review.opendev.org/c/zuul/zuul-jobs/+/80711811:40
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] Make sure dynamic change queues are marked as such https://review.opendev.org/c/zuul/zuul/+/80712111:43
@gtema:matrix.orgavass, I see you implemented initially upload-logs-s3 role. Do you still know what is this `_undocumented_test_worker_node_`? Potentially it started failing in https://review.opendev.org/c/zuul/zuul-jobs/+/776677 (at least this is the only difference to regular swift role12:37
@avass:vassast.orggtema: I think it was added to make it easier to test the role12:39
@gtema:matrix.orgdue to the need of boto?12:40
@avass:vassast.orgit's not possible to target localhost like that in an untrusted context. but it may also be a good idea to remove that variable completely12:41
@gtema:matrix.orgI wonder why it now fails12:42
@gtema:matrix.orgI see boto is installed on test node, while role now tries to execute delegated (what in reality should be same node)12:43
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul-jobs] [DNM] Test dropping delegation in the upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/80713212:47
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul-jobs] [DNM] Test dropping delegation in the upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/80713213:05
@gtema:matrix.orgoh, now I got, cause syntax in https://opendev.org/zuul/zuul-jobs/src/branch/master/test-playbooks/upload-logs-s3.yaml#L48 apparently does not work anymore since last security change13:15
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul-jobs] [DNM] Test upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/80713213:22
@jim:acmegating.comgtema: thanks!  see comments on https://review.opendev.org/776677  --  it looks like there already was a change for step 3 at https://review.opendev.org/c/opendev/base-jobs/+/777087 but i don't think it ever got linked to 776677 :/  -- your change for step 5 lgtm.14:22
-@gerrit:opendev.org- Jiri Podivin proposed: [zuul/zuul-jobs] DNM https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:23
-@gerrit:opendev.org- Jiri Podivin proposed: [zuul/zuul-jobs] DNM https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:42
@gtema:matrix.orgcorvus: thks. Sadly the 776677 fails now, I assume since security release. And honestly I have now no clue what to do with it15:01
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:15:02
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821
- [zuul/zuul] WIP: re-add test_zookeeper_disconnect2 https://review.opendev.org/c/zuul/zuul/+/807166
@clarkb:matrix.orgcatching up on ianw's debugging of the zuul-registry do we know if the builder push is supplying the manifest to zuul-registry with the off by one sizes or is zuul-registry calculating them at off by one values? It isn't clear to me from the mitm log in the etherpad if we've checked what manifest data is supplied to the registry and whether or not the registry is updating the manifest15:05
@jim:acmegating.comClark: i haven't looked into that today15:08
@clarkb:matrix.orgok, I've got a meeting and breakfast but I'll try to look into that in a bit15:09
@clarkb:matrix.orgI half wonder if the docker hub registry is always rewriting the size in the manifest and the builder is supplying the wrong value due to a bug15:10
-@gerrit:opendev.org- Artem Goncharov proposed: [zuul/zuul-jobs] [DNM] Test upload_logs_s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/80713215:12
@jim:acmegating.comthat sure would be something15:30
@jpew:matrix.orgIs there a recommended way to validate that the configuration in a trusted project is correct before merging it?15:47
@jpew:matrix.org(since it won't use the pending changes in the check or gate(?) pipelines)15:48
@clarkb:matrix.orgjpew: we have a test base job that allows us to land a changes without affecting any jobs, then we can reparent other jobs to it speculatively or temporarily to test things work well. The biggest downside to this approach is that you can often end up with copies of stuff15:51
@clarkb:matrix.orgSimilar setup works for roles, modify a test role, use test role in test base job and so on15:51
@clarkb:matrix.orgpabelanger: has an approach where as little as possible is put in the trusted repo to minimize the amount of code that needs testing in this way15:52
@jim:acmegating.comtobiash: i noticed https://review.opendev.org/788680 is going to conflict with the nodepool sos work -- can you take a look at that soon?  you have a previous +2 on it, but i want to make sure you're okay with the minor change i requested afterwords.16:28
@jpew:matrix.orgClark: So you have a project that is just a basic test job, then you submit a change to it with a Depends-On change for the pending change in the config repo?16:30
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:16:30
- [zuul/zuul] Remove nodeset from NodeRequest https://review.opendev.org/c/zuul/zuul/+/806063
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Benjamin Schanzel: [zuul/zuul] Add tenant name on NodeRequests for Nodepool https://review.opendev.org/c/zuul/zuul/+/78868016:30
@jim:acmegating.comjpew: https://opendev.org/opendev/base-jobs/src/branch/master/zuul.d/jobs.yaml#L5-L23 explains the test process for base-jobs16:31
@clarkb:matrix.orgjpew: no the base test job is a copy of the base job for the tenant. But we can modify it without affecting everything. Then we merge changes to that base-test job and we can update zuul's unittests for example to parent to that and exercise the trusted stuff16:31
@jpew:matrix.orgAh, OK16:32
@jpew:matrix.orgI'll look at that, thanks16:32
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] Reduce leaked file descriptors in tests https://review.opendev.org/c/zuul/zuul/+/80699517:17
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Remove internal nodepool request cache https://review.opendev.org/c/zuul/zuul/+/80719618:13
-@gerrit:opendev.org- Clark Boylan proposed:18:43
- [zuul/nodepool] Catch more potential errors in PoolWorker run loop https://review.opendev.org/c/zuul/nodepool/+/807199
- [zuul/nodepool] Deregister launchers when restarting them https://review.opendev.org/c/zuul/nodepool/+/807200
@clarkb:matrix.orgcorvus: I think ^ those two things may be related to the leaked launchers in zk18:43
-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/zuul] Add tenant name on NodeRequests for Nodepool https://review.opendev.org/c/zuul/zuul/+/78868019:04
@jim:acmegating.comClark: nice catch -- i left a q on one, but +2 on both20:34
@clarkb:matrix.orgthanks responded20:37
@jim:acmegating.comkk20:40
@mordred:inaugust.comcorvus: "zuul-tox-py38 https://zuul.opendev.org/t/zuul/build/a31ec1e2fe9744b0976f3acea87c3451 : TIMED_OUT in 1h 31m 59s" <-- that's on the bottom patch in your SOS stack20:44
@mordred:inaugust.comis that a thing that bears investigating?20:44
@jim:acmegating.commordred: i think we're getting to a point where we're adding enough additional zk traffic that we're noticably slowing down the tests.  in particular, we put zero-node requests into zk now, and a lot of tests use those (previously, we just no-op'd it in the scheduler, but it'd be hard to do that with multi-scheduler without some zk traffic)21:22
@iwienand:matrix.org@corvus that's what i was hoping to catch with the mitmproxy dump, see what the on-the-wire body length is as it leaves the docker push, before zuul-registry writes it out21:34
@iwienand:matrix.orgsorry, i mean @clarkb21:34
@iwienand:matrix.org * @clarkb that's what i was hoping to catch with the mitmproxy dump, see what the on-the-wire body length is as it leaves the docker push, before zuul-registry writes it out21:34
@pearcetyler:matrix.orgHello 👋 Is it possible to allow Zuul to run a gate pipeline even if GitHub's status checks have not passed? I'd like to make `gate` a required check in GitHub (To force devs to use the gate pipeline instead of manually merging), but when I make `gate` required, Zuul does not queue the PR because the required checks haven't passed 21:36
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan:21:37
- [zuul/nodepool] Catch more potential errors in PoolWorker run loop https://review.opendev.org/c/zuul/nodepool/+/807199
- [zuul/nodepool] Deregister launchers when restarting them https://review.opendev.org/c/zuul/nodepool/+/807200
@jim:acmegating.comTyler Pearce: should work if you do this: https://zuul-ci.org/docs/zuul/reference/drivers/github.html#branch-protection-rules21:50
@clarkb:matrix.org> <@iwienand:matrix.org> @clarkb that's what i was hoping to catch with the mitmproxy dump, see what the on-the-wire body length is as it leaves the docker push, before zuul-registry writes it out21:55
Unfortunately I think it was gzipped over the wire so the byte length there isn't telling us
@pearcetyler:matrix.orgcorvus: I looked at that, though it suggests making `check` required. I want to make `gate` required, but doing so prevents Zuul from running on that PR. Ultimately I want to block users from manually merging (force them to use Zuul for the project gating benefits)21:56
@mordred:inaugust.com> <@jim:acmegating.com> mordred: i think we're getting to a point where we're adding enough additional zk traffic that we're noticably slowing down the tests.  in particular, we put zero-node requests into zk now, and a lot of tests use those (previously, we just no-op'd it in the scheduler, but it'd be hard to do that with multi-scheduler without some zk traffic)21:57
Nod, makes sense. So the timeout itself isn't indicative of an specific issue, but there's a deeper thing we might need to ponder at some point as the time continues to grow
@iwienand:matrix.orgClark: i think mitmproxy dump is unzipped.  i'm just trying to figure out how to get it show the body length in some reasonable way; it weirdly doesn't show that in mitmdump22:02
@jim:acmegating.comTyler Pearce: The intent with zuul is that if zuul knows that it will report some piece of information that will satisfy the code review system's requirements to merge a change, then it should enqueue the change regardless of that value.  IOW, it should do what you want.  If everything is set up correctly and that's not working, then there may be a bug in the github driver.22:08
@jim:acmegating.comianw: why not just have zuul-registry emit the info instead of using mitm?  or do you want to try to eliminate the possibility of a framework error?22:09
@jim:acmegating.commordred: yeah, i'm not seeing any easy answers right now, other than increase timeout and/or run tests on larger nodes.  fwiw, i've actually seen fairly modest test runtime growth when running locally; i think the clouds are exacerbating the issue.22:12
@iwienand:matrix.org@corvus yes, and also if there's any chance of getting someone who knows docker a trace that says "here's a trace: these PATCH requests send X bytes and then it sends a manifest that has X+1 bytes" stands alone 22:13
@jim:acmegating.comianw: btw the "@user" isn't doing anything special -- you can just call me "corvus" :)22:14
@pearcetyler:matrix.org> <@jim:acmegating.com> Tyler Pearce: The intent with zuul is that if zuul knows that it will report some piece of information that will satisfy the code review system's requirements to merge a change, then it should enqueue the change regardless of that value.  IOW, it should do what you want.  If everything is set up correctly and that's not working, then there may be a bug in the github driver.22:15
Thanks. It's likely I set something up incorrectly 😅 I'll investigate
@iwienand:matrix.orgcorvus oh, in element it makes it pop up like a tag22:15
@jim:acmegating.comianw: are you using element to type that?22:15
@iwienand:matrix.orgyep22:15
@jim:acmegating.comianw: something isn't working then; it's coming across as plain text for me in element.  it looks correct when other people mention me.22:16
@clarkb:matrix.orgit is coming through as a literal @user22:17
@clarkb:matrix.orgthe client should translate @user to a real user thing though22:17
@jim:acmegating.comClark: the sending client22:17
@clarkb:matrix.orgbut you have to tab complete onto it22:17
@clarkb:matrix.orgcorvus: yes the sending client, you do @corvus then tab and that produces corvus 22:17
@iwienand:matrix.orgClark: corvus ok, it seems i need to press tab22:17
@iwienand:matrix.orgheh, yes ;)22:18
@jim:acmegating.comianw, Clark: yes, also you don't actually need @22:18
@pearcetyler:matrix.orgcorvus: After approving the PR I was testing, the `gate` pipeline works as expected. It seems Zuul may be checking for an approval, even if the branch protection rules do not require approval. In a real workflow we should always have approvals, it just caught me off guard for my local testing :) 22:18
@clarkb:matrix.orglooks like without the @ you have to hit tab twice22:18
@clarkb:matrix.orgwith the @ it is just once. Same amount of typing though. An interesting UX choice22:19
@jim:acmegating.comClark: weird, "cla^I" gets me this.  i'm using the "experimental irc" mode. but i didn't think that changed anything regarding input, only line height22:19
@jim:acmegating.comi'm also using a slightly different element than matrix.org22:20
@jpew:matrix.orgWe have to allow untrusted projects to download from an authenticated artifactory server. I was trying to set it up so that the trusted context would generate a temporary artifactory token using the secret password which the untrusted context could use, then revoke it at the end of the build. However, I'm not sure the best way to "transfer" the generated token; I tried an ansible fact, but that didn't work  (ansible facts don't transfer between playbooks?).... are there any suggestions or precendence for these types of things?22:21
@jim:acmegating.comjpew: cacheable ansible facts do22:22
@jpew:matrix.orgAh, so if I set the `cachable` flag in `set_fact` it should be available all the time?22:22
@jim:acmegating.comjpew: yep https://docs.ansible.com/ansible/latest/collections/ansible/builtin/set_fact_module.html#parameter-cacheable22:23
@jpew:matrix.orgYa, I saw that but I wasn't sure if zuul had caching enabled. Thanks!22:23
@jim:acmegating.comgood point, not sure if we mention that in docs :)22:23
@jim:acmegating.combut we do, explicitly for that purpose22:23
@jim:acmegating.comjpew: just a quick thought: you might be able to construct a job where a trusted playbook gets the artifacts without giving the token to an untrusted playbook.  (but may not be necessary if you fully trust your "untrusted" users)22:26
@clarkb:matrix.orgianw: ok looks like my nodepool changes for the leaked launchers have landed and I've fixed the leaked changes in the check queue for openstack. I'm finally to the point where I can look at the zuul-registry off by one thing if it will help. But I figure you've got more of that paged in so let me know if I can help22:47
@clarkb:matrix.orgI'm going to try and review hashtag:sos in the maentime22:47
@iwienand:matrix.orgthanks; yeah getting a dump that shows lengths is harder than you'd think22:49
@iwienand:matrix.orgone thing that has me thinking is that Chunked Transfer-Encoding ends with a 0\r\n ... that seems suspicious22:49
@clarkb:matrix.orgcorvus: mordred re: test runtimes I think we divide the number of cpus in half before letting the job run. Its possible that may be too conservative? (or maybe not and using more CPUs will create more contention and break more often)22:53
@jim:acmegating.comClark: i think the cpu usage is actually pretty good -- but i wouldn't mind a dstat chart to double check that :)22:57
@jim:acmegating.comby good i mean we're not leaving any/much on the table22:57
@iwienand:matrix.orgClark: ok, got it; if you want me to bring you up to where i'm at I can23:00
@iwienand:matrix.orgdocker push is definitely putting X-1 bytes on the wire, and reporting X bytes in the manifest23:02
@clarkb:matrix.orgianw: nice! I theorized that might be happening earlier and suggested that maybe docker hub et al are just alays writing the manifest over themselves :)23:06
@clarkb:matrix.orgcorvus: I'm struggling with your response to https://review.opendev.org/c/zuul/zuul/+/806653/6/zuul/nodepool.py I think the sendNodesProvisionedEvent() event is on a different queue than the node requests events that trigger  _handleNodeRequestEvent ?23:06
@clarkb:matrix.orgcorvus:  there are different states being checked between the two as well. Fulfilled vs completed and deleted vs failed23:08
@clarkb:matrix.orghrm actually ZooKeeperNodepool might be doing that translation for us? Does this mean the only risk here is you need some next event to trigger to process a prior one that hit the race?23:09
@clarkb:matrix.orgcorvus: I think that case is worth considering if say it is a quiet period like a weekend you might have to push up another change to get the node request for an older change to finally process if it had previously hit that race and short circuited23:10
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Try harder to unlock failed build requests https://review.opendev.org/c/zuul/zuul/+/80722123:12
@jim:acmegating.comClark: ^ that's from the stuck build earlier.  i'll context switch to what you just wrote now.  :)23:13
@jim:acmegating.comianw: i'd be okay expanding the registry's dockerhub compat rewriting23:14
@iwienand:matrix.orgi'm thinking zuul-registry should start validating the size to make sure it doesn't ingest bad data23:18
@jim:acmegating.comianw: that's clearly not bug-compatible with dockerhub :)23:19
@jim:acmegating.comClark: whenever an election happens, we 1) enable newly received completion events (from zk) to put node provisioned events on the scheduler's result queue, then 2) go over all existing requests and put node provisioned events on the scheduler's result queue.  i think that covers all cases.23:20
@jim:acmegating.comthere's an overlap between those two things, which means we can generate duplicate events, but we shouldn't miss anything.23:20
@jim:acmegating.com(and specifically, #2 happens after #1)23:22
@clarkb:matrix.orgcorvus: but we aren't garunteed to process those events with the lock held are we?23:22
@clarkb:matrix.orgthats the issue. What happens if we are in  _handleNodeRequestEvent without the election being won23:22
@clarkb:matrix.org * corvus: but we aren't garunteed to process those events with the election won are we?23:23
@clarkb:matrix.orgoh now I see it, beacuse we do the same work once the election is won you'd only wait for the election to be won23:23
@jim:acmegating.comthen the event we're processing will be handled by whoever either has already won the election or will win the election next23:23
@clarkb:matrix.orgnot some other external event23:23
@jim:acmegating.comyep, and we're continuously trying to win the election23:24
@clarkb:matrix.orgin my head I keep thinking these events will fire based on exteranl events from either gerrit or nodepool but in this specific case the election won handler is only an interaction with zk23:24
@jim:acmegating.comcorrect23:24
@clarkb:matrix.orgcorvus: in https://review.opendev.org/c/zuul/zuul/+/807221/1/zuul/executor/server.py line 3890 when you say exit there you mean process exit? Since there shouldn't be any code there that will raise?23:31
@jim:acmegating.comClark: correct23:32
@clarkb:matrix.orgthat race should really only become a problem in a multi scheduler setup then?23:32
@jim:acmegating.comand i did check _stopped in the retry so it doesn't hang forever, so either a crash or a normal exit could trigger that race condition (if it happens along with a zk outage)23:32
@jim:acmegating.comClark: no, that's the executor, so if an executor lost ZK connectivity and exited between those two sections, we get a perpetually stuck build23:33
@clarkb:matrix.orgah.23:33
@clarkb:matrix.orgI don't see where _stopped is checked23:33
@jim:acmegating.comsorry _running23:33
@jim:acmegating.comin _retry23:34
@clarkb:matrix.orgoh right different word for similar thing23:34
@jim:acmegating.com(note it's checked after the first try, so we'll always try those methods at least once; that's why we'd need to have zk errors to trigger it too)23:34
@clarkb:matrix.orgThanks for looking into that problem. +2 from me. I'm going to need to pop out for more school stuff (I hate how last minute this all feels, surprise its school time do all the things!) but I do still want to review the hashtag:sos stack more properly23:39
@clarkb:matrix.orgdon't wait on me though if people are reviewing and approving that is fine too23:39
@jim:acmegating.comno rush :)23:39
@iwienand:matrix.orgi've filed https://github.com/docker/for-linux/issues/1296 which i have a feeling is a place issues go to die23:42
@iwienand:matrix.orgmitmweb view of the dump file is pretty cool23:43
@jim:acmegating.comianw: in paragraph 2, first sentence, you might want to clarify "it" is "docker" that's getting things wrong; because otherwise i'd assume the anticedent is "our registry"23:44
@jim:acmegating.comianw: otherwise looks pretty clear :)23:45
@jim:acmegating.comianw: oh wait, i think i misread that sentence23:46
@iwienand:matrix.orgi reworded it anyway to be a little clearer23:46
@jim:acmegating.comianw: i think it was correct as originally, so i guess my only feedback is that it was easy for me to read wrong :).  it's definitely harder for me to mess up now.  :)23:47
@iwienand:matrix.orgwhat are other registries that we could look at?  gitlab has one maybe?23:50
@iwienand:matrix.orgit might be useful to see if they just ignore the size sent to them23:51
@jim:acmegating.comianw: the open source docker registry23:51
@jim:acmegating.comjfrog23:51
@fungicide:matrix.orgback in my c programming days i would have guessed this was the result of counting the length of an array containing a null-terminated string ;)23:51
@jim:acmegating.comianw: quay23:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!