tristanC | corvus: thanks you for the pings, i approved the changes | 00:20 |
---|---|---|
corvus | tristanC: thanks! | 00:24 |
*** iurygregory_ is now known as iurygregory | 06:42 | |
*** rpittau|afk is now known as rpittau | 07:22 | |
*** jpena|off is now known as jpena | 07:34 | |
opendevreview | Ian Wienand proposed zuul/zuul master: [wip] treeview for job overview page https://review.opendev.org/c/zuul/zuul/+/804956 | 07:35 |
opendevreview | Benjamin Schanzel proposed zuul/nodepool master: Add Tenant-Scoped Resource Quota https://review.opendev.org/c/zuul/nodepool/+/800765 | 07:47 |
opendevreview | Benjamin Schanzel proposed zuul/nodepool master: Add Tenant-Scoped Resource Quota https://review.opendev.org/c/zuul/nodepool/+/800765 | 07:59 |
opendevreview | Felix Edel proposed zuul/zuul master: WIP NodeRequest watches https://review.opendev.org/c/zuul/zuul/+/804961 | 08:08 |
*** mgoddard- is now known as mgoddard | 08:20 | |
*** mhuin is now known as mhu | 08:31 | |
gryf | hello! | 08:48 |
avass[m] | gryf: hi! | 08:48 |
gryf | I have a question regarding two nodes job (controller and compute1), defined like this: https://opendev.org/openstack/kuryr-kubernetes/src/branch/master/.zuul.d/multinode.yaml#L15-L63 and I'd like to pass something which is generated during stacking on controller devstack to the compute1. is it possible? | 08:50 |
gryf | I'm looking on https://zuul-ci.org/docs/zuul/reference/jobs.html#return-values but cannot figure out how to expose the data (from first example) on controller. any hints? | 08:50 |
gryf | is it even possible on job level? | 08:50 |
*** dviroel|ruck|out is now known as dviroel|ruck | 11:12 | |
*** jpena is now known as jpena|lunch | 11:34 | |
*** sshnaidm|pto is now known as sshnaidm | 12:25 | |
*** jpena|lunch is now known as jpena | 12:32 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 12:38 | |
fungi | gryf: i'm not clear on what you mean by "expose the data" (the example you've linked was just for a job definition) | 13:02 |
fungi | corvus: okay, this one may be cache-related, and i'm also not sure where to go next debugging it: a project in opendev recently renamed a zuul config file in their git repository, and zuul is reporting config errors for content in the old filename which no longer exists. change https://review.opendev.org/801436 renamed zuul.d/jobs.yaml to zuul.d/devstack-tobiko.yaml weeks ago, but zuul is | 13:32 |
fungi | reporting config errors for zuul.d/jobs.yaml in x/devstack-plugin-tobiko here: https://zuul.opendev.org/t/openstack/config-errors | 13:32 |
fungi | like it still has the old file cached | 13:32 |
opendevreview | Matthieu Huin proposed zuul/zuul master: web UI: user login with OpenID Connect https://review.opendev.org/c/zuul/zuul/+/734082 | 13:32 |
fungi | related, another project which uses jobs from there ceased being able to run any builds after we restarted zuul yesterday, getting "Unable to freeze job graph: Job devstack-tobiko is abstract and may not be directly run" and i'm not sure yet whether that's related | 13:33 |
fungi | to the same problem | 13:33 |
fungi | suggestions welcome from anybody really, i just mentioned corvus because yesterday he mentioned the potential for regressions in the recent config cache work | 13:35 |
fungi | i suspect the second problem is related though because the change which renamed that file also switched the problem job definition from abstract to runnable | 13:41 |
fungi | so zuul still seeing the old file would explain both symptoms | 13:41 |
gryf | fungi, so I have a job which is multinode. on controller there is devstack spinned up, and as one of the results there is a kubernetes cluster spinned up. I'd like to pass somehow results of the command `kubeadm token create --print-join-command`, which gives something like `kubeadm join 10.0.2.15:6443 --token z4dnzr.ag24bdnghb3nb3o5 --discovery-token-ca-cert-hash | 13:44 |
gryf | sha256:6a9b90fe4394bbff3b5d88f5aefea3a2b04744a4f4502a1d51992870bd920cb8` on the output. I'd like to pass it somehow to the other node, compute1. | 13:44 |
corvus | fungi: i think i'd start at the same place -- did we get the event for the change merge? | 13:46 |
fungi | gryf: the two nodes should be able to communicate with one another, worst case you could run the command remotely from the other node via ssh, or put it into a file and scp it to or from the other node | 13:46 |
fungi | corvus: even though the change merged in july and we've had multiple scheduler restarts since? | 13:47 |
fungi | and the problem doesn't seem to have appeared until some time between 14:55 utc yesterday and 07:16 utc today | 13:48 |
gryf | fungi, I was thinking about it as a last resort, and was hoping about some mechanism within zuul itself. but it will work aswell. thank you, kind sir :) | 13:48 |
corvus | fungi: in re your 1st 2 questions: yes. your last question does change things; i don't know why it would suddenly appear | 13:49 |
fungi | so we no longer reread configuration on scheduler restart i guess, and that's been the case since some number of weeks | 13:50 |
fungi | and that's now persisted in a cache (in zk?) | 13:50 |
fungi | gryf: you could probably rely on ansible to save the output as a cached fact and then reference it like a variable in a later playbook, though if that data is sensitive i'm not sure facts are a safe way to pass it around | 13:52 |
corvus | fungi: a small number of weeks, yes, though we've been partially writing to the cache for a larger number of weeks. | 13:52 |
corvus | fungi: i can help dig into this in a few hours | 13:52 |
fungi | okay, so it's conceivable we've cached that config from prior to july 21, at least | 13:53 |
gryf | fungi, I guess that's not the case, since this is all happening within job, so after job is done, it's already too late. I guess, I'd go with the ssh way. | 13:53 |
fungi | corvus: no hurry, i was just trying to figure out what to be looking at. i'll see if we still have scehduler logs from the timedframe where that file rename merged | 13:53 |
pabelanger[m] | gryf: you can use set_fact but delegate_to the other host, then the next play can access it. You may want to set cacheable to be safe | 14:00 |
pabelanger[m] | if the nodes are in the same job / nodeset | 14:01 |
pabelanger[m] | child jobs, you'll need to use zuul_return | 14:01 |
fungi | corvus: it hadn't dawned on me that the cache might persist past scheduler restarts, but in retrospect that makes sense and explains the reconfigure on start change (804806) | 14:03 |
fungi | i'm going to conveniently blame the fact that people always seem to want to report these issues while i'm asleep, and so i foolishly start trying to debug them before breakfast | 14:15 |
gryf | pabelanger[m], can you provide some general job example for this please? | 14:27 |
opendevreview | Benjamin Schanzel proposed zuul/nodepool master: Add Tenant-Scoped Resource Quota https://review.opendev.org/c/zuul/nodepool/+/800765 | 14:35 |
clarkb | gryf: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-tox/tasks/main.yaml#L15-L19 that ensures a cacheable fact called tox-executable that later playbooks can use to run tox | 15:19 |
clarkb | gryf: that is the sort of thing pabelanger[m] is describing | 15:19 |
*** jpena is now known as jpena|off | 15:42 | |
opendevreview | Felix Edel proposed zuul/zuul master: WIP NodeRequest watches https://review.opendev.org/c/zuul/zuul/+/804961 | 15:44 |
*** rpittau is now known as rpittau|afk | 16:06 | |
pabelanger[m] | gryf: https://docs.ansible.com/ansible/latest/user_guide/playbooks_delegation.html#delegating-facts | 16:10 |
pabelanger[m] | actually: https://docs.ansible.com/ansible/latest/user_guide/playbooks_delegation.html#delegating-tasks | 16:10 |
clarkb | https://lwn.net/ml/linux-kernel/xmqq1r6touqi.fsf@gitster.g/ git 2.33 released on Monday and includes a new merge algorithm. This algorithm is expected to become the default in the next release. It is apparently much much faster under certain situations | 16:34 |
clarkb | Wondering if maybe zuul should consider updating its docker images to include git 2.34 whenever that happens as merge times are one of the big time costs in zuul? | 16:35 |
fungi | do we perform default strategy merges now? i know at one point it tried to mimic the limited merge strategy implemented by jgit because of wanting to avoid merging something gerrit couldn't | 16:36 |
clarkb | fungi: I think it depends on the merge context? but maybe we do it globally | 16:37 |
clarkb | ya strategy is an optional argument to the merge method. Now to see if we ever call it with None set | 16:38 |
corvus | the default is 'git merge -s resolve' | 16:38 |
corvus | if you waint plain 'git merge' from zuul, it's a specific setting | 16:38 |
corvus | we could add 'merge-sort' (lol :) as another strategy though; but i don't know that it makes sense to change zuul's default unless the code review systems change | 16:40 |
corvus | but maybe github will because of their close ties to the git folks? | 16:40 |
clarkb | corvus: I think when git 2.34 happens zuul.model.MERGER_MERGE will use the new mode | 16:41 |
corvus | clarkb: yes | 16:42 |
clarkb | which is probably sufficient here | 16:42 |
clarkb | and ya as long as jgit is simpler then resolve is likely best there (and apparently resolve is quite fast too but doesn't succeed as often?) | 16:42 |
opendevreview | James E. Blair proposed zuul/zuul master: Add delete-state command to delete everything from ZK https://review.opendev.org/c/zuul/zuul/+/804304 | 16:44 |
corvus | clarkb, fungi: apropos the current opendev issue, ^ is a "reset button" for zuul operators | 16:45 |
fungi | neat! | 16:46 |
fungi | big red switch | 16:46 |
gryf | corvus, pabelanger[m], I'll take a look, thanks! | 16:58 |
corvus | fungi, clarkb: on a first pass through here -- i'm beginning to suspect we may not delete files from the cache in zk, which was fine right up until we started relying on the list of files in zk as canonical on startup. | 17:17 |
corvus | (we do delete entire projects from the cache, but if we delete files, i'm missing it) | 17:17 |
corvus | that still doesn't explain why this worked yesterday. | 17:18 |
fungi | which commit started relying on the file list from zk? could that have maybe merged between our last two scheduler restarts? | 17:19 |
corvus | fungi: https://review.opendev.org/771463 merged on aug 6 | 17:23 |
corvus | we restarted the next day. we restarted aug 7, aug 10, and aug 17 | 17:25 |
fungi | so we should have seen this cropping up as soon as the 7th | 17:25 |
fungi | puzzling | 17:26 |
corvus | maybe they merged a config change since then which did not rely on zk for the file cache | 17:27 |
corvus | s/cache/list/ | 17:27 |
corvus | yes, they did | 17:28 |
fungi | well, for the observed symptom, we saw zuul was able to enqueue a change in the check pipeline yesterday, which today it could no longer freeze the job graph for. all open changes they've tried for that same project are seeing that behavior today | 17:28 |
corvus | so it's possible that when zuul sees a change that touches the config, it gets the list of existing files from the merger via cat jobs, but on startup, it will get the set of files from zk. | 17:29 |
corvus | so every time they merge a config change they "fix" the problem, temporarily, until the next time we restart the scheduler, at which point it 'breaks" again | 17:29 |
fungi | oh, got it | 17:29 |
fungi | that explains why they ended up doing https://review.opendev.org/804356 and then thought it solved a problem | 17:30 |
fungi | all it really did was convince zuul to start using a current configuration there, until the next restart | 17:31 |
corvus | yep | 17:31 |
corvus | i think this is enough information for me to start a reproducer test case; i'm going to take a short break, then start working on that. i may be only intermittently available for the next little bit because this is a heads-down kind of thing. | 17:31 |
fungi | thanks! i agree this seems to explain everything we saw | 17:31 |
corvus | i'm not going to issue the .1 release i was planning on, because i think this is a more important fix than the other things, and that will take time. | 17:31 |
fungi | it does seem like a regression worth including in whatever we tag next | 17:32 |
fungi | er, worth including a fix for | 17:32 |
Clark[m] | I'll have to catch up as I remembered the garden needed tending before the sun showed up. Then I found a present from what I think is a raccoon... | 17:33 |
dmsimard | I'm already on matrix but was surprised to see the thread on zuul-discuss about moving off of IRC entirely :o | 17:39 |
clarkb | Agreed that holding off on the .1 makes sense so that this can be included in that release | 17:53 |
clarkb | dmsimard: that has been the intention for a while, just a matter of getting a homeserver and bots on it sorted out | 17:54 |
fungi | dmsimard: https://zuul-ci.org/docs/zuul/reference/developer/specs/community-matrix.html | 17:55 |
dmsimard | thanks for the info and the link | 18:04 |
corvus | clarkb, fungi: i have a test reproducer | 19:04 |
corvus | (which confirms the theory) | 19:04 |
fungi | corvus: splendid! thanks for working on the proof | 19:07 |
*** dviroel|ruck is now known as dviroel|ruck|out | 19:07 | |
fungi | i guess once there's a fix we're also going to want to recommend that deployments flush the state cache (so 804304 will come in handy there)? | 19:09 |
Clark[m] | Does the flush delete all the secrets too? If so we should proceed carefully :) | 19:12 |
fungi | it does not | 19:13 |
fungi | at least the documentation added with it says it doesn't | 19:13 |
fungi | but worth double-checking the code to make sure when reviewing | 19:13 |
corvus | fungi, Clark: i think a 'zuul-scheduler full-reconfigure' would be sufficient to fix; but let's check back in on that when i'm done with the fix :) | 19:21 |
fungi | absolutely | 19:22 |
opendevreview | James E. Blair proposed zuul/zuul master: Fix error with config cache https://review.opendev.org/c/zuul/zuul/+/805078 | 20:00 |
corvus | clarkb, fungi, swest, tobiash: ^ | 20:01 |
clarkb | reviewing | 20:01 |
corvus | actual fix is 7 lines, naturally | 20:02 |
fungi | that many? ;P | 20:03 |
opendevreview | James E. Blair proposed zuul/zuul master: Fix error with config cache https://review.opendev.org/c/zuul/zuul/+/805078 | 20:04 |
corvus | ps2 only fixes typos in commit message | 20:04 |
corvus | grr | 20:04 |
opendevreview | James E. Blair proposed zuul/zuul master: Fix error with config cache https://review.opendev.org/c/zuul/zuul/+/805078 | 20:05 |
corvus | ditto | 20:05 |
tobiash[m] | wow | 20:05 |
corvus | to be fair, 2 of the 7 lines are only there because the linter doesn't like long lines :) | 20:08 |
corvus | and 4 of them are a log line | 20:09 |
clarkb | corvus: super small thing noted inline that may be worth another ps if you can take a look | 20:15 |
clarkb | I +2'd though in case you don't think it is a big deal you can probably +A the change | 20:15 |
corvus | clarkb: yes we should do that; that actually matches what we do elsewhere | 20:16 |
corvus | can anyone think of a reason this should be startswith and not == ? https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L2408 | 20:22 |
corvus | clarkb: oh, we actually don't append the / in that place either | 20:24 |
clarkb | corvus: since extra config files is distinct from extra config dirs I think line 2408 that you call out should be an == | 20:25 |
clarkb | we should only need startswith for the dirs case | 20:25 |
corvus | yeah. i feel like that would be better. i'm going to fix that in this change, mostly because i want the two cases to be consistent, and if that's going to fail tests, i want to know now, before we merge the main fix. | 20:27 |
opendevreview | James E. Blair proposed zuul/zuul master: Fix error with config cache https://review.opendev.org/c/zuul/zuul/+/805078 | 20:27 |
corvus | clarkb, tobiash: tell me if PS3..PS4 looks correct; i'm going to kick off a full local test run | 20:28 |
clarkb | corvus: I think my only concern with what you have is line 2429. What happens if the extra config dirs are passed with a trailing / do we strip that off? | 20:29 |
clarkb | otherwise you may need to check for a trailing / first? I guess that is true for the code on line 1856 too | 20:29 |
corvus | clarkb: yes we do https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L1692 | 20:29 |
clarkb | in that case +2 | 20:30 |
corvus | it's actually significant in the config specification; 'trailing /' means directory in the yaml file, then we strip it once we store it in the _dirs variable | 20:30 |
corvus | and it looks like the merger is designed to return only files if you ask for files (it's not going to return a dir that matches a requested file). that's the only other reason i could think of using the startswith. | 20:33 |
corvus | tobiash: ^ if you have a second for a re-review | 20:33 |
*** timburke_ is now known as timburke | 21:00 | |
*** y2kenny is now known as Guest4766 | 21:07 | |
*** Guest4766 is now known as y2kenny | 21:10 | |
y2kenny | When dealing with git repository that uses submodule, are there any special configuration needed with Zuul? | 21:13 |
corvus | tobiash: i'm carrying over your +2 and approving that | 21:15 |
clarkb | y2kenny: your jobs needs to be smarter and the coordination between repos that zuul does gets trickier. But at a very base level no | 21:16 |
corvus | y2kenny: no, but zuul doesn't know anything about them (only enough to not corrupt data). getting them to work with zuul's speculative execution can be considerable work. | 21:16 |
corvus | (and depends heavily on the environment they're used in -- what works in some regime may not work in another) | 21:17 |
y2kenny | ok... so the executor will just checkout the parent without any additional 'git submodule' commands and any nesting will need to be handled by the job? | 21:19 |
corvus | yes | 21:19 |
y2kenny | ok thanks. | 21:19 |
opendevreview | Merged zuul/zuul-operator master: Document externalConfig https://review.opendev.org/c/zuul/zuul-operator/+/800791 | 21:58 |
opendevreview | Merged zuul/zuul-operator master: Fix config update detection https://review.opendev.org/c/zuul/zuul-operator/+/800991 | 22:03 |
opendevreview | Merged zuul/zuul master: Fix error with config cache https://review.opendev.org/c/zuul/zuul/+/805078 | 22:23 |
clarkb | corvus: ^ I guess that needs a restart and then the full reconfigur? | 22:27 |
corvus | yep, i'll get that going over in #opendev | 22:32 |
opendevreview | James E. Blair proposed zuul/zuul master: Fix error with config cache some more https://review.opendev.org/c/zuul/zuul/+/805089 | 23:42 |
corvus | clarkb, fungi, tobiash, swest: ^ part 2. i don't think that's as urgent, but i would like to merge and restart with that before making the .1 release | 23:43 |
fungi | thanks, reviewing | 23:43 |
opendevreview | Ian Wienand proposed zuul/zuul master: web: Jobs: Use TreeView for job overview page https://review.opendev.org/c/zuul/zuul/+/804956 | 23:51 |
clarkb | yup reviewing | 23:55 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!