Monday, 2024-02-05

-@gerrit:opendev.org- Peter Strunk proposed: [zuul/zuul] 903808: zuul_stream: add FQCN for windows command and shell https://review.opendev.org/c/zuul/zuul/+/903808		01:42
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 907256: Add --keep-config-cache option to delete-state command https://review.opendev.org/c/zuul/zuul/+/907256		09:25
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-client] 907627: Handle forward compatability with cycle refactor https://review.opendev.org/c/zuul/zuul-client/+/907627		09:34
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 906765: Fix 10 second delay after skipped command task https://review.opendev.org/c/zuul/zuul/+/906765		10:10
@vonschultz:matrix.org	I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error ``` File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in _cacheTenantYAMLBranch	18:01
self._updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in _updateUnparsedBranchCache
min_ltimes[source_context.project_canonical_name][
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
@vonschultz:matrix.org	* I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error	18:01
```
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in \_cacheTenantYAMLBranch
self.\_updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in \_updateUnparsedBranchCache
min\_ltimes\[source\_context.project\_canonical\_name\]\[
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
```
@vonschultz:matrix.org	* I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error	18:01
```
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in \_cacheTenantYAMLBranch
self.\_updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in \_updateUnparsedBranchCache
min\_ltimes\[source\_context.project\_canonical\_name\]\[
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
@vonschultz:matrix.org	* I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error	18:02
```
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in _cacheTenantYAMLBranch
self._updateUnparsedBranchCache(
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in _updateUnparsedBranchCache
min_ltimes[source_context.project_canonical_name][
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'opendev.org/zuul/zuul-jobs'
```
Does anyone have any idea what this is or how I might recover from it?
@jim:acmegating.com	Christian von Schultz: the cause would take quite a bit of effort to determine (and may not be relevant in current versions), but the remedy should be to stop all zuul components, then run `zuul-admin delete-state` on the scheduler to completely reset the ephemeral zookeeper state (that includes current pipeline contents), then start the system up again.	18:09
@jim:acmegating.com	that's the universal fix for "something in zk is wrong"	18:10
@vonschultz:matrix.org	It looks like it worked. Thanks. 👍️	18:26
@fungicide:matrix.org	Christian von Schultz: as to the cause, is it possible your zookeeper service got interrupted mid-write or something?	18:41
@fungicide:matrix.org	having a 3-way zk cluster helps mitigate the risk that a spontaneous crash/reboot corrupts your state	18:42
@vonschultz:matrix.org	I do have a 3-way zk cluster.	18:42
@fungicide:matrix.org	cool, so probably not that at least	18:43
@vonschultz:matrix.org	But I suppose it _is_ possible that it got interrupted mid-write anyway. I ran an Ansible playbook, and in the upgrade task, a new version of `docker-ce` got rolled out. Since Ansible likes to run many hosts in parallel, it's entirely possible that the Docker containers for Zookeeper got pulled down at approximately the same time.	18:44
@clarkb:matrix.org	when you upgrade docker it will restart your containers	18:45
@clarkb:matrix.org	you will want to put serial: 1 or whatever the ansible is for that on playbooks that impact your zk members	18:46
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 907509: Use git native commit with squash merge https://review.opendev.org/c/zuul/zuul/+/907509		20:52
@jkkadgar:matrix.org	When 3 changes are in a gate queue (merge-mode: cherry-pick) and the top change has a set of 10 jobs in which 1 retries due to a failure in pre-run, Zuul then dequeues and restarts all the builds in the other 2 changes in the gate queue as the first change is marked failed. Zuul then dequeues and restarts all the builds in the other 2 changes in the gate queue a second time as the first change is reinserted to retry its 1 job. Is this expected behavior? Is there a way to disable or enable this? Ideally I would like to see the dequeue only happen after the retries have been exhausted on the top change.	20:53
@fungicide:matrix.org	jkkadgar: i agree, it doesn't make sense for an automatically rerun build to mark the entire buildset as failing unless the build is failing on its final try. i've never noticed it doing that, but maybe i've just not been observant enough. perhaps this is an oversight related to the early failure detection from 9.0.0... what version of zuul are you running?	21:00
@jkkadgar:matrix.org	9.3.0	21:00
@fungicide:matrix.org	so it's possible this is a relatively recent behavior change, yeah	21:00
@fungicide:matrix.org	though also, i can think of another corner case... zuul will automatically retry a build if it detects what looks like a network connection failure reaching a node, even in the run phase. if zuul marks the change as failing due to a failed ansible task in a run phase playbook, maybe it also needs to turn off the auto-rerun behavior for detected connection failures	21:03
@fungicide:matrix.org	corvus: ^ is any of this already known problems with early failure detection?	21:04
@clarkb:matrix.org	when a job is retried for network issues it should not dequeue the change. It will just rerun the job	21:06
@clarkb:matrix.org	the indication that the change is dequeued makes me think something else may be happening	21:07
@clarkb:matrix.org	because yes, dequeing a change causes its children to reset as they need to evict their parent from the speculative git trees and test with the new git state	21:07
@clarkb:matrix.org	is it hitting the retry limit? By default it will only attempt the job three times	21:08
@jkkadgar:matrix.org	Just one retry causes this to occur	21:09
@fungicide:matrix.org	Clark: is it possible that zuul is noticing a failed task, early failure detection is moving the change out of the main sequence for the dependent pipeline at that time, and then moving it back in again when it decides it will retry the build? just since the introduction of early failure detection in 9.0.0 i mean	21:09
@fungicide:matrix.org	the observed case was for retries due to pre-run phase playbook failures	21:10
@fungicide:matrix.org	not network connection errors	21:10
@fungicide:matrix.org	i was merely speculating that network connection error retries might similarly expose the same behavior	21:10
@jim:acmegating.com	fungi: early failure detection does not take place in pre-run playbooks	21:11
@fungicide:matrix.org	corvus: got it, so that shouldn't be influencing the queue	21:12
@clarkb:matrix.org	I would look at debug logs on the scheduler to see its decision making process for this	21:13
@fungicide:matrix.org	then i agree, i'd be surprised if i'd failed to notice pre-run phase failure retries causing a build reset for changes behind the change with the retried build	21:13
@fungicide:matrix.org	mmm, that's a mouthful	21:14
@jkkadgar:matrix.org	Ok thanks, I'll try to see if I can figure out what is happening in the logs.	21:15
@fungicide:matrix.org	jkkadgar: if you can identify the sequence of events, we should be able to add a regression test for it to double-check the behavior is or isn't as expected	21:16
@fungicide:matrix.org	but yes, the scheduler debug log is the best place to find that	21:17
@clarkb:matrix.org	one useful tip for doing that is to grep on the event id for the event enqueing that change	21:20
@clarkb:matrix.org	it may not give you everything but should give you a pretty good overview that you can then dig into from there	21:20
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:		23:40
- [zuul/zuul] 907506: Update web ui for dependency refactor https://review.opendev.org/c/zuul/zuul/+/907506
- [zuul/zuul] 906320: Finish circular dependency refactor https://review.opendev.org/c/zuul/zuul/+/906320
- [zuul/zuul] 907628: Change status json to use "refs" instead of "changes" https://review.opendev.org/c/zuul/zuul/+/907628

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!