*** dmsimard0 is now known as dmsimard | 00:25 | |
jhesketh | mhu: no worries, I need to revisit that series but I've been a little snowed under recently :-s | 00:30 |
---|---|---|
*** mattw4 has quit IRC | 00:37 | |
*** michael-beaver has quit IRC | 00:43 | |
openstackgerrit | Joshua Hesketh proposed zuul/zuul master: Expose ansible_date_time instead of date_time https://review.opendev.org/666268 | 00:44 |
SpamapS | hrm | 00:49 |
SpamapS | I wish there was a clear way to say "why didn't this event trigger that pipeline?" | 00:50 |
SpamapS | Like, I just had to dig through hundreds of lines of logs to find Exception: Project GoodMoney/funnel-cake is not allowed to run job promote-funnel-cake | 00:53 |
SpamapS | I wish that would somehow post to the PR | 00:54 |
SpamapS | (or maybe get recorded as a failed build) | 00:54 |
SpamapS | Does look like it was recorded as a CONFIG_ERROR buildset | 00:54 |
SpamapS | but there's no link to the text | 00:54 |
fungi | figuring out which things to report that about is where i get stuck | 01:00 |
SpamapS | Ultimately I think it belongs somewhere in zuul's database. | 01:01 |
fungi | there are so many things zuul chooses not to run that it seems like it would be very overwhelming | 01:01 |
fungi | but yeah, maybe database reporter | 01:01 |
SpamapS | The "don't run because not matching" is fine. CONFIG_ERROR though, is important. | 01:01 |
SpamapS | That error message is very clear, and the user will know what to do with it. | 01:02 |
SpamapS | So, it belongs in the user's hands. | 01:02 |
fungi | and that config error isn't reported with the configuration errors in the status interface (via the "bell" icon)? | 01:11 |
SpamapS | fungi:no, it's detected at runtime. | 01:15 |
SpamapS | (though from what I see, it *could* be detected at config parse time) | 01:15 |
SpamapS | If you try to add a project stanza that isn't allowed, you should get an error. | 01:16 |
SpamapS | Instead it lands just fine, and then at the time where it tries to run the job, it fails the allowed projects check. | 01:16 |
SpamapS | I haven't looked closely though, there may be runtime circumstances that make it allowed. | 01:16 |
*** sanjayu__ has joined #zuul | 01:26 | |
*** spsurya has joined #zuul | 01:39 | |
*** rlandy|bbl is now known as rlandy | 01:42 | |
pabelanger | jlk: Did you get a chance to see about github3.py release process? | 02:01 |
*** rlandy has quit IRC | 02:44 | |
*** jamesmcarthur has joined #zuul | 02:47 | |
*** bhavikdbavishi has joined #zuul | 03:23 | |
*** bhavikdbavishi1 has joined #zuul | 03:26 | |
*** bhavikdbavishi has quit IRC | 03:27 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 03:28 | |
*** jamesmcarthur has quit IRC | 03:35 | |
*** jamesmcarthur has joined #zuul | 03:42 | |
*** jamesmcarthur has quit IRC | 03:47 | |
*** jamesmcarthur has joined #zuul | 03:55 | |
*** sanjayu__ has quit IRC | 04:22 | |
*** jamesmcarthur has quit IRC | 04:26 | |
*** jamesmcarthur has joined #zuul | 04:34 | |
*** bhavikdbavishi has quit IRC | 04:49 | |
*** jamesmcarthur has quit IRC | 05:04 | |
*** jamesmcarthur has joined #zuul | 05:16 | |
*** jamesmcarthur has quit IRC | 05:21 | |
*** jamesmcarthur has joined #zuul | 05:27 | |
*** jamesmcarthur has quit IRC | 05:31 | |
*** raukadah is now known as chandankumar | 05:40 | |
*** jamesmcarthur has joined #zuul | 05:47 | |
*** jamesmcarthur has quit IRC | 05:52 | |
*** pcaruana has joined #zuul | 06:15 | |
*** jamesmcarthur has joined #zuul | 06:20 | |
*** jamesmcarthur has quit IRC | 06:30 | |
*** saneax has joined #zuul | 06:45 | |
*** jamesmcarthur has joined #zuul | 06:45 | |
*** jamesmcarthur has quit IRC | 06:52 | |
*** themroc has joined #zuul | 07:09 | |
*** jamesmcarthur has joined #zuul | 07:21 | |
*** bhavikdbavishi has joined #zuul | 07:32 | |
*** jamesmcarthur has quit IRC | 07:33 | |
*** bhavikdbavishi1 has joined #zuul | 07:35 | |
*** bhavikdbavishi has quit IRC | 07:36 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 07:36 | |
*** jamesmcarthur has joined #zuul | 07:44 | |
*** jpena|off is now known as jpena | 07:46 | |
*** jamesmcarthur has quit IRC | 07:48 | |
*** jamesmcarthur has joined #zuul | 07:50 | |
*** jamesmcarthur has quit IRC | 07:57 | |
*** jamesmcarthur has joined #zuul | 07:58 | |
*** bhavikdbavishi has quit IRC | 08:06 | |
*** saneax has quit IRC | 08:06 | |
*** sanjayu_ has joined #zuul | 08:06 | |
*** bhavikdbavishi has joined #zuul | 08:06 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Add command processor to zuul-web https://review.opendev.org/666307 | 08:10 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Add repl server for debug purposes https://review.opendev.org/579962 | 08:10 |
*** bhavikdbavishi has quit IRC | 08:13 | |
*** zbr has joined #zuul | 08:17 | |
*** igordc has quit IRC | 08:20 | |
*** yolanda has quit IRC | 08:54 | |
*** jamesmcarthur has quit IRC | 08:55 | |
*** jamesmcarthur has joined #zuul | 08:59 | |
*** jamesmcarthur has quit IRC | 09:08 | |
*** jamesmcarthur has joined #zuul | 09:18 | |
openstackgerrit | Ian Wienand proposed zuul/nodepool master: Pin to openshift <= 0.8.9 https://review.opendev.org/666526 | 09:31 |
*** jamesmcarthur has quit IRC | 09:36 | |
openstackgerrit | Jean-Philippe Evrard proposed zuul/zuul-jobs master: Dockerhub now returns 200 for DELETEs https://review.opendev.org/666529 | 09:37 |
*** jamesmcarthur has joined #zuul | 09:45 | |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Add Bitbucket Server source functionality https://review.opendev.org/657837 | 09:47 |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Create a basic Bitbucket build status reporter https://review.opendev.org/658335 | 09:47 |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Create a basic Bitbucket event source https://review.opendev.org/658835 | 09:47 |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Upgrade formatting of the patch series. https://review.opendev.org/660683 | 09:47 |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134 | 09:47 |
*** bhavikdbavishi has joined #zuul | 09:55 | |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134 | 09:59 |
*** jamesmcarthur has quit IRC | 10:00 | |
*** gtema has joined #zuul | 10:07 | |
*** gtema has quit IRC | 10:16 | |
*** gtema has joined #zuul | 10:16 | |
openstackgerrit | Jean-Philippe Evrard proposed zuul/zuul-jobs master: Dockerhub now returns 200 for DELETEs https://review.opendev.org/666529 | 10:20 |
*** electrofelix has joined #zuul | 10:23 | |
*** electrofelix has quit IRC | 10:27 | |
*** electrofelix has joined #zuul | 10:27 | |
*** avass has joined #zuul | 10:35 | |
*** NBorg has joined #zuul | 10:36 | |
*** avass has quit IRC | 10:44 | |
*** jamesmcarthur has joined #zuul | 10:45 | |
*** jamesmcarthur has quit IRC | 10:50 | |
*** jamesmcarthur has joined #zuul | 10:50 | |
*** jpena is now known as jpena|lunch | 10:59 | |
openstackgerrit | Alexander Braverman proposed zuul/nodepool master: Openshift client https://review.opendev.org/666541 | 10:59 |
*** jamesmcarthur has quit IRC | 11:06 | |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134 | 11:08 |
*** NBorg has quit IRC | 11:08 | |
*** jamesmcarthur has joined #zuul | 11:09 | |
*** rfolco_off has joined #zuul | 11:20 | |
*** rfolco_off is now known as rfolco | 11:21 | |
*** rlandy has joined #zuul | 11:30 | |
*** gtema has quit IRC | 11:30 | |
*** rlandy is now known as rlandy|afk | 11:33 | |
flaper87 | tobiash: thanks for the hint on using statefulsets. That worked | 11:35 |
*** rfolco is now known as rfolco_pto | 11:36 | |
tobiash | :) | 11:38 |
*** hashar has joined #zuul | 12:00 | |
*** bhavikdbavishi has quit IRC | 12:12 | |
*** jpena|lunch is now known as jpena | 12:17 | |
*** themroc has quit IRC | 12:17 | |
*** rfolco_pto has quit IRC | 12:35 | |
ofosos | My builds seem to be stuck :( | 12:35 |
*** themroc has joined #zuul | 12:37 | |
*** rfolco has joined #zuul | 12:42 | |
*** jamesmcarthur has quit IRC | 12:42 | |
*** NBorg_ has joined #zuul | 13:00 | |
*** pcaruana has quit IRC | 13:00 | |
*** pcaruana has joined #zuul | 13:00 | |
pabelanger | morning, I'm having an issue with nodepool 3.7.0: http://paste.openstack.org/show/753220/ | 13:19 |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134 | 13:19 |
pabelanger | zuul-maint: ^ | 13:20 |
pabelanger | I'm rolling back, and going to debug in a few minutes | 13:21 |
pabelanger | openshift==0.9.0 | 13:28 |
pabelanger | that got pulled in | 13:28 |
pabelanger | I suspect something changed in their API | 13:28 |
*** spsurya has quit IRC | 13:29 | |
Shrews | pabelanger: there are at least two changes up to fix that but I’m not really here today to review | 13:31 |
pabelanger | okay, thanks | 13:31 |
pabelanger | https://review.opendev.org/666526/ | 13:32 |
pabelanger | cool, thanks fungi / tobiash | 13:32 |
pabelanger | okay, downgrading openshift, has fixed the issue | 13:33 |
pabelanger | we maybe should consider a 3.7.1 release to pick that up | 13:33 |
*** rlandy|afk is now known as rlandy | 13:40 | |
*** rfolco has quit IRC | 13:58 | |
openstackgerrit | Merged zuul/nodepool master: Pin to openshift <= 0.8.9 https://review.opendev.org/666526 | 13:59 |
*** sanjayu_ has quit IRC | 13:59 | |
*** sanjayu_ has joined #zuul | 14:12 | |
openstackgerrit | Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134 | 14:15 |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: Add missing start-message in pipeline config schema https://review.opendev.org/665936 | 14:22 |
*** jamesmcarthur has joined #zuul | 14:22 | |
*** felixgcb has joined #zuul | 14:26 | |
felixgcb | hey :) have any of you guys ever tried to use "include_role" from one untrusted project to call a role in another untrusted_project? It works locally for me, but on the zuul-executer it just gives me an immediate "ok" result and proceeds.. | 14:29 |
fungi | felixgcb: is that other project in the roles list for the job? | 14:30 |
fungi | zuul-maint: there's a very lengthy log of me basically talking to myself over in #openstack-infra where it seems like we ended up perpetually locking a node for a paused build when the change was rescheduled (though i can't be certain)... it picks up around: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2019-06-20.log.html#t2019-06-20T13:54:01 | 14:33 |
fungi | if anybody has further ideas of things i should check (and how i should cleanly release that lock), it would be much appreciated | 14:33 |
*** sanjayu_ has quit IRC | 14:35 | |
felixgcb | fungi: It wasn't explicitly defined in the projects job.roles setting, but it should be read automatically, since it is a job in the roles directory. I tried adding it explicitly but that also didn't work.. | 14:39 |
fungi | felixgcb: you're talking about the project name, right? like this: https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/zuul.d/jobs.yaml#L18-L19 | 14:41 |
felixgcb | fungi: ah this section is really helpful. is "zuul" the repository which contains that job? | 14:42 |
fungi | zuul/zuul-jobs is the repository which contains roles we want this particular job to also use | 14:43 |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: Add missing doc for pipeline start-message https://review.opendev.org/665930 | 14:43 |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: Add support for item.change for pipeline start-message formater https://review.opendev.org/665968 | 14:43 |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: Add change replacement field in doc for start-message https://review.opendev.org/665974 | 14:43 |
corvus | fungi: locks should be released when the scheduler restarts | 14:43 |
fungi | felixgcb: you have to tell zuul for a given job which other projects you want it to search for roles | 14:43 |
*** zbr is now known as zbr|ruck | 14:44 | |
fungi | corvus: but there's no (easy) way to manually tell the scheduler to release a lock for a build which has been (or was supposed to be) cancelled so that nodepool will clean up the node? | 14:44 |
fungi | corvus: is the best course of action just to manually delete the node from outside zuul/nodepool and ignore the locked node entry in nodepool until the zuul scheduler is next restarted? | 14:46 |
*** michael-beaver has joined #zuul | 14:46 | |
corvus | fungi: i guess it depends on the goal -- if the zk entry is still there, nodepool will reserve space for it in quota calculations, so deleting the node won't necessarily get the quota back | 14:47 |
corvus | fungi: wearing my opendev hat, i'd say "one node doesn't matter, just leave it until next restart" | 14:48 |
fungi | one goal is to comply with a request from the provider's support staff to clean up an apparently unused instance | 14:48 |
fungi | since they were the ones who brought it to our attention | 14:49 |
corvus | fungi: sure, then delete it out from under nodepool | 14:49 |
fungi | just wanting to make sure i'm not going to cause more problems by deleting it out of band when there's a zuul lock on it | 14:49 |
fungi | the bigger reason i brought it up in here was to better postulate about what could cause zuul to indefinitely hold a node lock associated with a build which seems to have been cancelled | 14:50 |
tobiash | corvus, fungi: I also noticed rare node leaks which I haven't got to analyze so far | 14:51 |
corvus | it should be possible to find the exact sequence that caused that in the logs | 14:51 |
tobiash | my first quick log analysis a few weeks ago showed nothing special, but I'll also have a deeper look | 14:53 |
pabelanger | corvus: if we do nodepool 3.7.1 (for openshift pin), do we need to add reno note to generate something for releasenotes page? | 14:53 |
fungi | would log messages about cancellation of a build not include the build uuid (and so that's why i'm not finding them)? or maybe did we never actually cancel the build? | 14:54 |
corvus | pabelanger: yes; we should always have at least one release note (otherwise, why did we make a release?) | 14:54 |
tobiash | but that analysis was before we annotated the logs | 14:54 |
pabelanger | corvus: ack | 14:54 |
corvus | tobiash: it was harder before, but still possible :) | 14:55 |
tobiash | I'm sure it's possible, I just had more pressing issues last weeks | 14:56 |
tobiash | We leak around 20 nodes per week atm, and a temporary workaround is deleting the znode | 14:58 |
openstackgerrit | Paul Belanger proposed zuul/nodepool master: Add release note about pinning openshift client https://review.opendev.org/666605 | 14:59 |
corvus | it would be really good to find and fix that leak | 14:59 |
flaper87 | does zuul have a concept of stages? Something that would allow for building "artifacts" in one stage and then reuse them in other jobs? Something like https://docs.gitlab.com/ee/ci/yaml/#stages | 15:09 |
openstackgerrit | Merged zuul/zuul-jobs master: Dockerhub now returns 200 for DELETEs https://review.opendev.org/666529 | 15:09 |
pabelanger | flaper87: Yup! you'll want to checkout the recent work corvus has been doing around that | 15:10 |
pabelanger | (trying to find docs) | 15:11 |
flaper87 | sweeeeet | 15:12 |
pabelanger | flaper87: https://zuul-ci.org/docs/zuul-jobs/docker-image.html has some info about how containers would work | 15:13 |
corvus | flaper87: see https://zuul-ci.org/docs/zuul/user/config.html#attr-job.dependencies to have one job depend on another job | 15:14 |
*** pcaruana has quit IRC | 15:14 | |
corvus | flaper87: https://zuul-ci.org/docs/zuul/user/jobs.html#return-values can be used to pass information to dependent jobs | 15:15 |
*** themroc has quit IRC | 15:15 | |
corvus | flaper87: and https://zuul-ci.org/docs/zuul/user/jobs.html#var-zuul.artifacts can be used to extend depends-on behavior to artifacts (so you can have a change in one project depend on a built artifact in another project) | 15:16 |
*** hashar has quit IRC | 15:17 | |
flaper87 | niiiice! Thanks, I'll read all those and come back with questions | 15:19 |
tobiash | corvus: hrm, we seem to be missing a relation between job name and build uuid in the logs | 15:24 |
corvus | tobiash: hrm, we used to have it when we launched a build | 15:25 |
corvus | tobiash: zuul/executor/client.py: "Execute job %s (uuid: %s) on nodes %s for change %s " | 15:26 |
tobiash | corvus: oh found it | 15:26 |
tobiash | yeah that | 15:26 |
tobiash | so appearently the job that leaked a node never shows this message | 15:26 |
tobiash | last thing for the job that leaked is froze job and completed node request | 15:28 |
flaper87 | corvus: pabelanger thanks! sounds like all the building blocks for what I need are there | 15:28 |
flaper87 | nice | 15:28 |
tobiash | so it might got dequeued somewhere between node lock and 'execute job' | 15:29 |
corvus | fungi: was our node also missing that line ^ ? | 15:30 |
tobiash | fungi: it's a little bit hard to correlate, you could search first for the node, then the node-request which tells you the job and then filter for event id and job name | 15:31 |
felixgcb | fungi: thank you so much :) now it works. | 15:33 |
fungi | corvus: sorry, in two meetings simultaneously at the moment. pretty sure i saw the scheduler log that in my case, yes... checking | 15:33 |
fungi | ahh, no it was the entry about the executor accepting the build request i was thinking of... the one you're asking about would be logged by the builder? | 15:35 |
*** jamesmcarthur has quit IRC | 15:36 | |
corvus | fungi: that should show up in the scheduler debug log | 15:37 |
fungi | er, yeah i meant s/builder/executor/ but i was looking at the normal scheduler log not the debug log | 15:40 |
fungi | just a sec | 15:40 |
*** electrofelix has quit IRC | 15:44 | |
fungi | well, more than a sec | 15:45 |
fungi | power outage here, but also grepping our scheduler debug log is not fast | 15:45 |
tobiash | corvus, fungi: the job that leaked is making use of skipping child jobs | 15:45 |
*** jamesmcarthur has joined #zuul | 15:53 | |
fungi | ooh, we've got a bunch of useful info in the scheduler debug log i'm surprised didn't percolate into the normal service log | 15:54 |
corvus | we should fix that | 15:54 |
fungi | agreed, just figuring out which ones should get elevated | 15:54 |
fungi | apparently the scheduler was trying to cancel the build but couldn't find it in the queue and thought it wasn't started | 15:55 |
*** fdegir8 has joined #zuul | 15:55 | |
*** kklimonda_ has joined #zuul | 15:55 | |
corvus | mordred: i've run into an error with my new nodepool functional job, can i get your help debugging this: http://paste.openstack.org/show/753227/ (i've set an autohold on that node, so it should be available for us to ssh into in a bit) | 15:56 |
fungi | but yeah, the log entry you were asking about is present in this case: | 15:56 |
*** jamesmcarthur has quit IRC | 15:56 | |
fungi | 2019-06-20 04:53:52,464 INFO zuul.ExecutorClient: [e: 9b44cc146cd649e585ff229c0a8a296b] Execute job swift-upload-image (uuid: 395c781799ea452c8f639d3037f8de0f) on nodes <NodeSet [<Node 0008103245 ('ubuntu-bionic',):ubuntu-bionic>]> for change <Change 0x7fa6a08caf60 openstack/swift 665487,2> [...] | 15:56 |
corvus | then it seems we may have a different situation than tobiash | 15:56 |
fungi | seems so, yes | 15:56 |
*** jamesmcarthur has joined #zuul | 15:56 | |
*** persia_ has joined #zuul | 15:57 | |
tobiash | hrm, two node leaks? | 15:57 |
tobiash | bad | 15:57 |
fungi | this is probably the critical element in our scenario: | 15:57 |
fungi | 2019-06-20 04:54:18,614 DEBUG zuul.ExecutorClient: [e: 9b44cc146cd649e585ff229c0a8a296b] Response to cancel build request: b'ERR UNKNOWN_JOB' | 15:57 |
*** sean-k-mooney1 has joined #zuul | 15:58 | |
*** mattw4 has joined #zuul | 15:58 | |
fungi | the build cancellation event was logged at 04:54:18 a few lines before that | 16:00 |
fungi | so ~36 seconds after the job execution event | 16:01 |
fungi | strangely, it seems to have continued on the executor, which i guess never got asked to abort | 16:02 |
fungi | and at 05:02:00 the executor paused the build | 16:02 |
*** pcaruana has joined #zuul | 16:02 | |
fungi | so maybe a race around cancelling a job moments after execution? | 16:02 |
*** kklimonda has quit IRC | 16:03 | |
*** fdegir has quit IRC | 16:03 | |
*** persia has quit IRC | 16:03 | |
*** sean-k-mooney has quit IRC | 16:03 | |
*** kklimonda_ is now known as kklimonda | 16:03 | |
*** jangutter has quit IRC | 16:03 | |
fungi | hrm, but the first reference to that build uuid in the executor debug log was for picking up the corresponding executor:execute request from gearman at 04:53:52, then staring the ssh agent at 04:53:53... so the job was well under way when the cancellation happened | 16:09 |
tobiash | I found the sequence of my scenario: https://etherpad.openstack.org/p/EAvGRON5P4 | 16:13 |
fungi | yeah, in our case the build seems to have been well into checking out git refs on the executor when the cancellation should have happened | 16:13 |
tobiash | corvus: that etherpad shows the sequence in my scenario and a poposal for a fix ^ | 16:17 |
fungi | in the log example above, is 'ERR UNKNOWN_JOB' being returned by gearman? i don't see where the executor ever responded to a cancellation | 16:18 |
ofosos | Hey ho, has anybody experience with running the OpenShift drivers to connect to an EKS cluster? | 16:20 |
*** hwangbo has joined #zuul | 16:23 | |
fungi | ofosos: i think SpamapS maybe tried out hooking nodepool to ekcs? not sure if it was with the openshift or kubernetes node driver | 16:24 |
fungi | sounded like authentication for ekcs is complicated | 16:24 |
hwangbo | Hi everyone. When we start up our Zuul server, the executor takes a long time (several hours) updating/resetting the repos in our gerrit server. Is there any way to shorten this time? | 16:24 |
pabelanger | hwangbo: sounds like maybe an IO issue? | 16:25 |
pabelanger | how many repos do you have, and I guess they are pretty large? | 16:25 |
hwangbo | We're using 3 repos, but they have dozens of branches with thousands of commits. Pretty large | 16:26 |
*** felixgcb has quit IRC | 16:28 | |
pabelanger | hwangbo: I'd try to collect stats on your disk, and try to figure out where the bottleneck is. IIRC, tobiash also had some IO issues recently. | 16:31 |
pabelanger | hwangbo: how many executors are you running? | 16:31 |
*** fdegir8 is now known as fdegir | 16:32 | |
corvus | fungi: yeah, 'err unknown job' is the gearman server telling the client that it doesn't know about the job it was requested to cancel | 16:32 |
hwangbo | We're using the Zuul quick start docker-compose, so it's just a single executor container | 16:32 |
fungi | corvus: should it have been in there, or is that a normal occurrence once an executor picks up the job? | 16:32 |
pabelanger | hwangbo: you may want to also scale our the executor, which should allow less jobs per executor, resulting in less IO too | 16:33 |
tobiash | hwangbo: note that the jobdir must be on the same mountpoint like the cache dir | 16:33 |
tobiash | hwangbo: do you already have this in your deployment https://review.opendev.org/665186 ? | 16:33 |
tobiash | this fixes the default in the docker-compose so this condition is met | 16:34 |
pabelanger | good point | 16:34 |
tobiash | (which has been merged in the last few days) | 16:34 |
hwangbo | tobiash: Nope, we don't have that change yet | 16:35 |
fungi | corvus: hrm, grep counts 46 occurrences in a 24 hour period, so i guess it's somewhat expected (or else this problem is more widespread) | 16:35 |
pabelanger | tobiash: mind a review on https://review.opendev.org/666605/ | 16:35 |
tobiash | pabelanger: done | 16:35 |
corvus | fungi: yes, the race is not unexpected, and, at least theoretically handled :) | 16:37 |
fungi | right, so there's more to it than just that | 16:37 |
corvus | it is certainly an interesting code path though | 16:37 |
fungi | presumably if a build is underway on an executor then it is expected that there is a corresponding gearman job, otherwise i'd expect waaaay more than a couple of these an hour | 16:38 |
tobiash | maybe connection issues with gearman? | 16:39 |
tobiash | fungi, corvus: fwiw I don't have a match to 'ERR UNKNOWN_JOB' in my logs in the last 15 days | 16:41 |
fungi | tobiash: in the scheduler debug log, right? | 16:41 |
tobiash | oh wait, the matching was wrong :/ | 16:41 |
tobiash | I lied, I see 5000 in the last 15 days | 16:42 |
tobiash | yes, scheduler debug log | 16:42 |
hwangbo | pabelanger: Is there any documentation on how to scale the executor? This is just for the server initialization, I thought there's only ever 1 "executor" container running. | 16:44 |
tobiash | so this looks quite normal | 16:44 |
corvus | hwangbo: update with the change tobiash mentioned first, then let us know if you still have problems | 16:45 |
*** panda has quit IRC | 16:50 | |
*** panda has joined #zuul | 16:52 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Defer setting build result to event queue https://review.opendev.org/666643 | 16:55 |
tobiash | this should fix the node leak in my scenario ^ | 16:56 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023 | 16:56 |
tobiash | corvus: is this worth a release note? | 16:57 |
tobiash | pabelanger: I'd appreciate a review on https://review.opendev.org/579962 and parent. Facilitates advanced debugging techniques in a running zuul | 16:59 |
corvus | tobiash: i don't feel strongly about a release note for that | 17:00 |
pabelanger | tobiash: sure, can look shortly | 17:00 |
pabelanger | contemplating zuul upgrade to 3.9.0 this afternoon :) | 17:01 |
corvus | tobiash: that will slow things down a bit, but i think we actually process events fairly quickly now, so it's probably okay (there was a time where result event processing took so long, half our system would be idle) | 17:01 |
corvus | tobiash: i think we'll just want to keep an eye out for efficiency changes related to that | 17:02 |
openstackgerrit | Merged zuul/nodepool master: Add release note about pinning openshift client https://review.opendev.org/666605 | 17:03 |
tobiash | corvus: yeah, I also thought about that but at least freeing the nodes is at the same stage | 17:05 |
pabelanger | nodepool 3.7.1 should be ready now, if we want to tag with ^ | 17:05 |
tobiash | corvus: or do you think we need a different solution? | 17:05 |
pabelanger | We are already running the pre-relase version for zuul.a.c, without issues | 17:05 |
corvus | tobiash: this feels like the more correct one, and is simpler; so i think it's worth trying | 17:06 |
tobiash | ++ | 17:06 |
corvus | i'll tag nodepool master as 3.7.1 | 17:07 |
corvus | pushed | 17:09 |
pabelanger | Thanks! | 17:10 |
hwangbo | corvus pabelanger: I don't think it helped. I updated with the change, but it's still stuck doing "Updating repo...Resetting repo" for what seems like each commit in the repository | 17:13 |
*** jpena is now known as jpena|off | 17:15 | |
tobiash | hwangbo: it shouldn't do that for each commit, so logs would be helpful | 17:16 |
hwangbo | tobiash: Here's a snippet of what the log looks like. I spoofed out some sensitive information, but I think you'd get the idea. https://pastebin.com/HEQH1eWM | 17:32 |
tobiash | hwangbo: you mean the cat jobs that are executed when starting up zuul? | 17:35 |
tobiash | those are executed for each repo and branch | 17:35 |
hwangbo | Yeah, sorry. I should have been more verbose | 17:35 |
tobiash | (not every commit) | 17:35 |
hwangbo | For us, that process takes several hours | 17:35 |
tobiash | in this case a single executor is a bottleneck | 17:35 |
tobiash | so you likely want to run more than one executor | 17:35 |
tobiash | (which is not covered by the docker-compose file) | 17:36 |
hwangbo | Is there any other documentation on how to set that up? | 17:36 |
tobiash | well you have the executor config, so you just can take that part and run it multiple times | 17:37 |
tobiash | note that the cache dirs must not be shared | 17:37 |
hwangbo | run it multiple times? | 17:38 |
hwangbo | Is this separate from the docker-compose file? | 17:38 |
fungi | in our case we run executors and additional mergers, and put them on separate servers from where we run the scheduler/web/finger-gw daemons | 17:39 |
fungi | i think our current count is 12 executors and 8 additional mergers for the opendev zuul deployment | 17:40 |
fungi | the cat job workload probably scales in an embarrassingly parallel fashion, so you could divide the current duration by the desired duration to get a rough estimate of the number you need | 17:42 |
pabelanger | does docker-compose setup a zuul-merger? | 17:43 |
fungi | another (or additional) option could be to limit which repositories you want zuul to look in for job configuration | 17:43 |
pabelanger | if not, maybe we make that an option thing to enable | 17:43 |
hwangbo | It doesn't setup a standalone zuul-merger | 17:44 |
fungi | since i *think* cat jobs are only run for repositories where zuul is looking for job configuration (someone can probably correct me on that) | 17:44 |
pabelanger | however, given this is a single server, IO is still going to be a concern | 17:44 |
hwangbo | Is there an example zuul.conf I can look at to see how multiple executors/mergers are configured? | 17:45 |
pabelanger | hwangbo: in theory, you shouldn't need to change anything when you add another executor online. However, in this case, like tobiash said, you need to create a new volume in docker as not to share it with other executor | 17:47 |
pabelanger | https://github.com/openstack-infra/puppet-zuul/blob/master/templates/zuulv3.conf.erb | 17:48 |
pabelanger | is example from opendev.org | 17:48 |
pabelanger | https://github.com/ansible-network/windmill-config/blob/master/zuul/zuul.conf.j2 | 17:48 |
pabelanger | is ours for zuul.ansible.com | 17:48 |
fungi | zuul is designed so that you can have just one service configuration and hand it out to all the executors and mergers as well as the scheduler, but for merger-specific configuration options see https://zuul-ci.org/docs/zuul/admin/components.html#id4 | 17:49 |
fungi | really if you just take your executor configuration and put it on additional servers with zuul installed and start the zuul-executor or zuul-merger daemon, it ought to simply work | 17:51 |
fungi | assuming you've configured them to communicate with the scheduler on an address they can all reach | 17:51 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Differentiate between queued and waiting jobs in zuul web UI https://review.opendev.org/660878 | 17:51 |
fungi | and that any firewall rules are configured to allow communication between those servers | 17:52 |
hwangbo | fungi, pabelanger, tobiash: Thanks for all the help. I'll be giving this a shot | 17:52 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Differentiate between queued and waiting jobs in zuul web UI https://review.opendev.org/660878 | 17:52 |
fungi | for a better understanding of what firewall rules you might need to allow communication between distributed services, see https://zuul-ci.org/docs/zuul/admin/components.html | 17:53 |
corvus | yes, io on a single machine may still be a bottleneck no matter how many mergers/executors are running. it all depends on the situation. for that reason, i don't think it's worth expanding the quick start for this. the operator, otoh, should handle this case. | 17:57 |
tobiash | corvus: we're running into another scalability problem regarding the number of branches in a repo that is being used by a job. We have a project containing 2000 branches (due to branch&pull workflow and many people working on it). Restoring the repo state when running a job takes a significant amount of time (20 minutes when under io pressure). | 17:58 |
tobiash | most jobs only need the protected branches (<5) | 17:59 |
corvus | tobiash: "most"? | 17:59 |
tobiash | actually probably all | 17:59 |
corvus | whew, that's probably more actionable :) | 18:00 |
tobiash | one idea would be to be able to limit the repo state to protected branches as a job setting | 18:00 |
tobiash | but that would violate layering in the model | 18:00 |
tobiash | however I cannot guarantee that it's *all* jobs so I guess we'd need something configurable | 18:01 |
tobiash | (e.g. projects following fork and pull on github.com might not even need protected branches) | 18:02 |
tobiash | or we could determine this by the exclude-protected-branches option on the repo | 18:02 |
tobiash | which we already have | 18:03 |
tobiash | I think in this case we could require a user to protect a branch if he wants to run something on it | 18:03 |
tobiash | I think I'll go for the existing exclude-protected-branches option as the best compromise. I don't really want to put that into a job parameter. | 18:05 |
*** hashar has joined #zuul | 18:06 | |
*** jamesmcarthur has quit IRC | 18:06 | |
*** jamesmcarthur has joined #zuul | 18:07 | |
pabelanger | yah, I've seen some random branches get cretaed in a repo, when humans edit a file via web ui | 18:08 |
pabelanger | hard to suggest changing that workflow, as users tend to not be well versed in git | 18:08 |
*** jamesmcarthur has quit IRC | 18:12 | |
tobiash | in many enterprise workflows fork&pull is not even possible. In this case you always have the source branches of pull requests in the same repo (which most likely won't ever be needed by jobs) | 18:12 |
*** jamesmcarthur has joined #zuul | 18:12 | |
*** jamesmcarthur_ has joined #zuul | 18:17 | |
*** jamesmcarthur has quit IRC | 18:17 | |
*** jamesmcarthur_ has quit IRC | 18:22 | |
*** jamesmcarthur has joined #zuul | 18:23 | |
*** igordc has joined #zuul | 18:27 | |
*** Minnie100 has joined #zuul | 18:30 | |
*** igordc has quit IRC | 18:36 | |
openstackgerrit | Merged zuul/zuul master: Update cached repo during job startup only if needed https://review.opendev.org/648229 | 18:39 |
*** jamesmcarthur has quit IRC | 18:42 | |
*** jamesmcarthur has joined #zuul | 18:51 | |
*** jamesmcarthur has quit IRC | 19:01 | |
*** jamesmcarthur has joined #zuul | 19:14 | |
*** Minnie100 has quit IRC | 19:16 | |
*** jamesmcarthur has quit IRC | 19:22 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Filter out unprotected branches from builds if excluded https://review.opendev.org/666664 | 19:34 |
*** gtema has joined #zuul | 19:40 | |
*** bhavikdbavishi has joined #zuul | 19:44 | |
*** gtema has quit IRC | 19:46 | |
*** bhavikdbavishi has quit IRC | 20:15 | |
SpamapS | tobiash: or in my case, the users just refuse to use fork&pull because they like that branches on the main repo are shared by default. | 20:26 |
SpamapS | *ugh*, the prohibition of untrusted repos sharing a job with a secret just ruined my day :-/ | 20:32 |
SpamapS | have to move that job into a config repo :-/ | 20:32 |
corvus | i had an idea of how to address that, but i don't seem to have time to work on it | 20:42 |
*** pcaruana has quit IRC | 20:45 | |
SpamapS | corvus: I'm happy my secrets are protected, but yeah... I wish there was a better answer. | 20:46 |
*** jeliu_ has joined #zuul | 20:47 | |
SpamapS | also this runs into the job renaming problem I had in the past | 20:49 |
SpamapS | (there's no way to move a job from repo to repo without going through a 3-way dance to create a new name, change all references to it, and then remove the old one and rename the new one | 20:52 |
dmsimard | SpamapS: I'd love a better solution to that dance too. | 20:53 |
dmsimard | I don't have any great ideas :( | 20:53 |
corvus | i expect it could be better with bidirectional dependencies | 20:53 |
corvus | (at least, for those who wished to enable that, once support for it exists) | 20:54 |
SpamapS | Maybe, another thought I had was just to have an optional precedence field. | 20:54 |
corvus | SpamapS: you could temporarily allow a repo to shadow another | 20:54 |
*** jeliu_ has quit IRC | 20:54 | |
corvus | (that is a form of optional precedence) | 20:54 |
SpamapS | corvus: indeed.. a bit of a big hammer, but swingable. | 20:55 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023 | 20:56 |
SpamapS | corvus: I kind of wonder why I can't just break the prohibition and say "allowed-projects: foo" | 20:56 |
SpamapS | The two repos are separate purely for focus reasons.. we expect a small team to iterate rapidly on a small piece of the code so we extracted it to a separate repo.. | 20:57 |
SpamapS | so it would be nice if I could just make an explicit trust. | 20:57 |
corvus | SpamapS: i think the solution i was considering had the idea that a config project could cause a job in untrusted repo A to be run in untrusted repo B. in other words, allowing the config project to assert that trust relationship. | 20:59 |
SpamapS | corvus:yes, that's exactly what I want. | 21:00 |
corvus | SpamapS: unfortunately, just doing "allowed-projects: foo" in an untrusted project is dangerous (that is a thing we had to remove) because a change can speculative change it. so we need to get the config project involved to fixate it safely. | 21:00 |
*** hashar has quit IRC | 21:01 | |
corvus | SpamapS: i'll see if i can't negotiate that one up higher in my todo list... it's an unhandled use case, and you know i don't like those. | 21:02 |
SpamapS | I do ;) | 21:03 |
SpamapS | also, TBH, we may actually kick all the secrets out of the untrusted repos to have more separation between deploy and dev. | 21:04 |
openstackgerrit | James E. Blair proposed zuul/zuul master: WIP: Allow config projects to override allowed-projects https://review.opendev.org/666733 | 21:17 |
corvus | SpamapS: ^ maybe now it'll be a bit more in my face and i can finish it up piecemeal. | 21:18 |
corvus | i wonder if, in the nodepool devstack job, we could start the nodepool builder at the beginning of the job, so that it ran the DIB build portion while devstack was being installed, and we rely on the upload retries for the builder to eventually upload the image once the cloud came online... | 21:39 |
corvus | i think i'll try that after i get the basic functionality working | 21:40 |
*** sshnaidm is now known as sshnaidm|off | 21:55 | |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023 | 22:15 |
*** mattw4 has quit IRC | 22:47 | |
*** sean-k-mooney1 has quit IRC | 22:49 | |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023 | 22:53 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023 | 22:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023 | 22:55 |
*** mattw4 has joined #zuul | 23:02 | |
*** rlandy has quit IRC | 23:44 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!