Monday, 2024-02-05

-@gerrit:opendev.org- Peter Strunk proposed: [zuul/zuul] 903808: zuul_stream: add FQCN for windows command and shell https://review.opendev.org/c/zuul/zuul/+/90380801:42
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 907256: Add --keep-config-cache option to delete-state command https://review.opendev.org/c/zuul/zuul/+/90725609:25
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-client] 907627: Handle forward compatability with cycle refactor https://review.opendev.org/c/zuul/zuul-client/+/90762709:34
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 906765: Fix 10 second delay after skipped command task https://review.opendev.org/c/zuul/zuul/+/90676510:10
@vonschultz:matrix.orgI restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error ```  File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in _cacheTenantYAMLBranch18:01
self._updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in _updateUnparsedBranchCache
min_ltimes[source_context.project_canonical_name][
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
@vonschultz:matrix.org * I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error18:01
```
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in \_cacheTenantYAMLBranch
self.\_updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in \_updateUnparsedBranchCache
min\_ltimes\[source\_context.project\_canonical\_name\]\[
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
```
@vonschultz:matrix.org * I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error18:01
```
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in \_cacheTenantYAMLBranch
self.\_updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in \_updateUnparsedBranchCache
min\_ltimes\[source\_context.project\_canonical\_name\]\[
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
@vonschultz:matrix.org * I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error18:02
```
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in _cacheTenantYAMLBranch
self._updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in _updateUnparsedBranchCache
min_ltimes[source_context.project_canonical_name][
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
@jim:acmegating.comChristian von Schultz: the cause would take quite a bit of effort to determine (and may not be relevant in current versions), but the remedy should be to stop all zuul components, then run `zuul-admin delete-state` on the scheduler to completely reset the ephemeral zookeeper state (that includes current pipeline contents), then start the system up again.18:09
@jim:acmegating.comthat's the universal fix for "something in zk is wrong"18:10
@vonschultz:matrix.orgIt looks like it worked. Thanks. 👍️18:26
@fungicide:matrix.orgChristian von Schultz: as to the cause, is it possible your zookeeper service got interrupted mid-write or something?18:41
@fungicide:matrix.orghaving a 3-way zk cluster helps mitigate the risk that a spontaneous crash/reboot corrupts your state18:42
@vonschultz:matrix.orgI do have a 3-way zk cluster.18:42
@fungicide:matrix.orgcool, so probably not that at least18:43
@vonschultz:matrix.orgBut I suppose it _is_ possible that it got interrupted mid-write anyway. I ran an Ansible playbook, and in the upgrade task, a new version of `docker-ce` got rolled out. Since Ansible likes to run many hosts in parallel, it's entirely possible that the Docker containers for Zookeeper got pulled down at approximately the same time.18:44
@clarkb:matrix.orgwhen you upgrade docker it will restart your containers18:45
@clarkb:matrix.orgyou will want to put serial: 1 or whatever the ansible is for that on playbooks that impact your zk members18:46
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 907509: Use git native commit with squash merge https://review.opendev.org/c/zuul/zuul/+/90750920:52
@jkkadgar:matrix.orgWhen 3 changes are in a gate queue (merge-mode: cherry-pick) and the top change has a set of 10 jobs in which 1 retries due to a failure in pre-run, Zuul then dequeues and restarts all the builds in the other 2 changes in the gate queue as the first change is marked failed. Zuul then dequeues and restarts all the builds in the other 2 changes in the gate queue a second time as the first change is reinserted to retry its 1 job.  Is this expected behavior? Is there a way to disable or enable this? Ideally I would like to see the dequeue only happen after the retries have been exhausted on the top change.20:53
@fungicide:matrix.orgjkkadgar: i agree, it doesn't make sense for an automatically rerun build to mark the entire buildset as failing unless the build is failing on its final try. i've never noticed it doing that, but maybe i've just not been observant enough. perhaps this is an oversight related to the early failure detection from 9.0.0... what version of zuul are you running?21:00
@jkkadgar:matrix.org9.3.021:00
@fungicide:matrix.orgso it's possible this is a relatively recent behavior change, yeah21:00
@fungicide:matrix.orgthough also, i can think of another corner case... zuul will automatically retry a build if it detects what looks like a network connection failure reaching a node, even in the run phase. if zuul marks the change as failing due to a failed ansible task in a run phase playbook, maybe it also needs to turn off the auto-rerun behavior for detected connection failures21:03
@fungicide:matrix.orgcorvus: ^ is any of this already known problems with early failure detection?21:04
@clarkb:matrix.orgwhen a job is retried for network issues it should not dequeue the change. It will just rerun the job21:06
@clarkb:matrix.orgthe indication that the change is dequeued makes me think something else may be happening21:07
@clarkb:matrix.orgbecause yes, dequeing a change causes its children to reset as they need to evict their parent from the speculative git trees and test with the new git state21:07
@clarkb:matrix.orgis it hitting the retry limit? By default it will only attempt the job three times21:08
@jkkadgar:matrix.orgJust one retry causes this to occur21:09
@fungicide:matrix.orgClark: is it possible that zuul is noticing a failed task, early failure detection is moving the change out of the main sequence for the dependent pipeline at that time, and then moving it back in again when it decides it will retry the build? just since the introduction of early failure detection in 9.0.0 i mean21:09
@fungicide:matrix.orgthe observed case was for retries due to pre-run phase playbook failures21:10
@fungicide:matrix.orgnot network connection errors21:10
@fungicide:matrix.orgi was merely speculating that network connection error retries might similarly expose the same behavior21:10
@jim:acmegating.comfungi: early failure detection does not take place in pre-run playbooks21:11
@fungicide:matrix.orgcorvus: got it, so that shouldn't be influencing the queue21:12
@clarkb:matrix.orgI would look at debug logs on the scheduler to see its decision making process for this21:13
@fungicide:matrix.orgthen i agree, i'd be surprised if i'd failed to notice pre-run phase failure retries causing a build reset for changes behind the change with the retried build21:13
@fungicide:matrix.orgmmm, that's a mouthful21:14
@jkkadgar:matrix.orgOk thanks, I'll try to see if I can figure out what is happening in the logs.21:15
@fungicide:matrix.orgjkkadgar: if you can identify the sequence of events, we should be able to add a regression test for it to double-check the behavior is or isn't as expected21:16
@fungicide:matrix.orgbut yes, the scheduler debug log is the best place to find that21:17
@clarkb:matrix.orgone useful tip for doing that is to grep on the event id for the event enqueing that change21:20
@clarkb:matrix.orgit may not give you everything but should give you a pretty good overview that you can then dig into from there21:20
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:23:40
- [zuul/zuul] 907506: Update web ui for dependency refactor https://review.opendev.org/c/zuul/zuul/+/907506
- [zuul/zuul] 906320: Finish circular dependency refactor https://review.opendev.org/c/zuul/zuul/+/906320
- [zuul/zuul] 907628: Change status json to use "refs" instead of "changes" https://review.opendev.org/c/zuul/zuul/+/907628

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!