Monday, 2023-08-14

@jjbeckman:matrix.orgHi folks.08:38
I've posted about a similar issue previously, and thought it was resolved by adding more resources to the Kubernetes cluster that Zuul is deployed on, but I still find it happenning quite frequently(Once in around 10 builds).
My Zuul builds seemingly randomly fail in various places, with one of the following variaton of errors.
```
2023-07-31 08:04:12.621503 | TASK [Read target {redacted} metadata file]
2023-07-31 08:06:16.690816 | debian-bullseye | ERROR
2023-07-31 08:06:16.691120 | debian-bullseye | {
2023-07-31 08:06:16.691162 | debian-bullseye | "msg": "failed to transfer file /var/lib/zuul/builds/3f9565e3e3c1474c8103df5247a64576/work/tmp/ansible-local-2rastc7sj/tmp8v1_trs9 to '/~zuul/.ansible/tmp/ansible-tmp-1690790653.9951165-654-214536513211330/AnsiballZ_command.py':\nb''\nb'error: error sending request: Post \"https://10.0.0.1:443/api/v1/namespaces/main-0000000250/pods/debian-bullseye/exec?command=%2Fbin%2Fsh&command=-c&command=dd+of%3D%27%2F~zuul%2F.ansible%2Ftmp%2Fansible-tmp-1690790653.9951165-654-214536513211330%2FAnsiballZ_command.py%27+bs%3D65536+%26%26+sleep+0&container=debian-bullseye&stderr=true&stdin=true&stdout=true\": unexpected EOF\\n'"
2023-07-31 08:06:16.691191 | debian-bullseye | }
```
```
2023-07-27 05:15:26.919655 | TASK [Create namespace to deploy {redacted}]
2023-07-27 05:16:31.070031 | debian-bullseye | ERROR
2023-07-27 05:16:31.070386 | debian-bullseye | {
2023-07-27 05:16:31.070436 | debian-bullseye | "msg": "Failed to set execute bit on remote files (rc: 1, err: error: error sending request: Post \"https://10.0.0.1:443/api/v1/namespaces/main-0000000221/pods/debian-bullseye/exec?command=%2Fbin%2Fsh&command=-c&command=%2Fbin%2Fsh+-c+%27chmod+u%2Bx+%27%22%27%22%27~zuul%2F.ansible%2Ftmp%2Fansible-tmp-1690434927.4863539-150-2938412173267%2F%27%22%27%22%27+%27%22%27%22%27~zuul%2F.ansible%2Ftmp%2Fansible-tmp-1690434927.4863539-150-2938412173267%2FAnsiballZ_command.py%27%22%27%22%27+%26%26+sleep+0%27&container=debian-bullseye&stderr=true&stdin=true&stdout=true\": unexpected EOF\n)"
2023-07-27 05:16:31.070470 | debian-bullseye | }
```
```
2023-08-10 07:29:16.381698 | TASK [Install Ingress-NGINX]
2023-08-10 07:32:56.999231 | debian-bullseye | ERROR
2023-08-10 07:32:56.999528 | debian-bullseye | {
2023-08-10 07:32:56.999572 | debian-bullseye | "msg": "Failed to create temporary directory. In some cases, you may have been able to authenticate and did not have permissions on the target directory. Consider changing the remote tmp path in ansible.cfg to a path rooted in \"/tmp\", for more error information use -vvv. Failed command was: ( umask 77 && mkdir -p \"` echo ~zuul/.ansible/tmp `\"&& mkdir \"` echo ~zuul/.ansible/tmp/ansible-tmp-1691652558.1460352-264-268319145791209 `\" && echo ansible-tmp-1691652558.1460352-264-268319145791209=\"` echo ~zuul/.ansible/tmp/ansible-tmp-1691652558.1460352-264-268319145791209 `\" ), exited with result 1",
2023-08-10 07:32:56.999605 | debian-bullseye | "unreachable": true
2023-08-10 07:32:56.999633 | debian-bullseye | }
```
```
2023-08-10 06:27:08.843073 | TASK [Obtain kubeconfig for target cluster]
2023-08-10 06:28:15.523562 | debian-bullseye | MODULE FAILURE: error: error sending request: Post "https://10.0.0.1:443/api/v1/namespaces/main-0000000416/pods/debian-bullseye/exec?command=%2Fbin%2Fsh&command=-c&command=%2Fbin%2Fsh+-c+%27%2Fusr%2Fbin%2Fpython3+%27%22%27%22%27~zuul%2F.ansible%2Ftmp%2Fansible-tmp-1691648829.3785486-116-61947887594951%2FAnsiballZ_command.py%27%22%27%22%27+%26%26+sleep+0%27&container=debian-bullseye&stderr=true&stdin=true&stdout=true": unexpected EOF
```
```
2023-08-10 07:42:23.827956 | TASK [Deploy {redacted} job]
2023-08-10 07:42:35.716548 | debian-bullseye | Unable to connect to the server: net/http: TLS handshake timeout
2023-08-10 07:42:37.732360 | debian-bullseye | ERROR
2023-08-10 07:42:37.732585 | debian-bullseye | {
2023-08-10 07:42:37.732627 | debian-bullseye | "delta": "0:00:10.072682",
2023-08-10 07:42:37.732657 | debian-bullseye | "end": "2023-08-10 07:42:35.720062",
2023-08-10 07:42:37.732758 | debian-bullseye | "msg": "non-zero return code",
2023-08-10 07:42:37.732795 | debian-bullseye | "rc": 1,
2023-08-10 07:42:37.732822 | debian-bullseye | "start": "2023-08-10 07:42:25.647380"
2023-08-10 07:42:37.732848 | debian-bullseye | }
2023-08-10 07:42:37.732895 | debian-bullseye | ERROR: Ignoring Errors
```
Most of them appear to stem from failing to copy the `AnsiballZ_command.py` file to the node host.
I've been troubleshooting this issue without much success, until I found out about Ansible Turbo mode for kubernetes.core.
https://github.com/ansible-collections/kubernetes.core/blob/main/docs/ansible_turbo_mode.rst
In order to test this feature, I added the following to the source code.
```
host_vars['environment'] = {}
host_vars['environment']['ENABLE_TURBO_MODE'] = 1
```
https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2014
As a result, instead of failing in random places in and failing builds, the `AnsiballZ_command.py` copying issue only ever happens during the first job, but when it does, the pipeline ends up with the "RESET" status, and spawns a 2nd(or more) attempt build. This new behavior more or less makes Zuul useable in our use case.
Would appreciate any advice regarding the following.
1. Is there any way to enable Ansible Turbo Mode without touching the source code?
2. Has there been any reports of Zuul running nodes on Kubernetes being unstable? I am not able to find any information on the internet regarding this problem I have...
Sorry for the wall of text, thank you for reading.
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 890353: Change image ID from int sequence to UUID https://review.opendev.org/c/zuul/nodepool/+/89035309:13
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 890354: Update references of build "number" to "id" https://review.opendev.org/c/zuul/nodepool/+/89035409:19
@clarkb:matrix.org> <@gerrit:opendev.org> Zuul merged on behalf of James E. Blair  https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 890353: Change image ID from int sequence to UUID  https://review.opendev.org/c/zuul/nodepool/+/89035313:34
Image name collisions aren't actually an issue with openstack as it uses the underlying image uuid to differentiate between images. Is this a problem for a specific cloud? Might be good to add that info to the change if so.
@jim:acmegating.comClark: it's the zk image id13:40
@jim:acmegating.comClark: addresses the comment left on https://review.opendev.org/84680113:42
@clarkb:matrix.orgjjbeckman: you can set the environment for the remote host in your playbooks exactly how they have in the example in your documentation link.13:42
I don't think there is currently a way to set it globally in the way you have without making the change you did.
@clarkb:matrix.orgAh it's zk itself betting angry13:43
@clarkb:matrix.org* Ah it's zk itself getting angry13:43
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 891332: Only support Python 3.11 https://review.opendev.org/c/zuul/nodepool/+/89133216:53
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 891332: Only support Python 3.11 https://review.opendev.org/c/zuul/nodepool/+/89133216:55
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 891332: Only support Python 3.11 https://review.opendev.org/c/zuul/nodepool/+/89133217:20
@clarkb:matrix.orgcorvus another related todo would be to bump the base image to bookworm from bullseye. Its less important for python itself since that is compiled on top of those platforms for a specific version. But things like git updates and bwrap backport cleanup would be good to get from bookworm17:26
@clarkb:matrix.orgI'm not sure where you think that should fit into the release planning17:26
@jim:acmegating.comClark: yeah.  i believe bullseye is still getting security updates, so it's not urgent.  maybe let's do that after we cut 9.x?17:36
@clarkb:matrix.orgwfm and yes it is getting updates for another 10 months or something iirc17:36
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 891338: Resolve statsd client once at startup https://review.opendev.org/c/zuul/nodepool/+/89133817:39
@jim:acmegating.comapparently ensure-k8s only works for jammy17:40
@jim:acmegating.comi think we could either split the job into two nodes like we do for openshift, or see if we can get ensure-k8s to run on bookworm17:41
@jim:acmegating.com(also, maybe both of those jobs should actually just run nodepool in containers; but then we almost certainly want 2 nodes for that)17:42
@jim:acmegating.comi'll take a quick look at ensure-k8s17:42
@jim:acmegating.comoh that's the one that changed to microk8s ... running that on bookworm may not be a slam dunk17:43
@jim:acmegating.comhrm, it's all snapd, it might work?17:44
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 891339: Support ensure-kubernetes on bookworm https://review.opendev.org/c/zuul/zuul-jobs/+/89133917:47
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 891338: Resolve statsd client once at startup https://review.opendev.org/c/zuul/nodepool/+/89133817:48
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 891339: Support ensure-kubernetes on bookworm https://review.opendev.org/c/zuul/zuul-jobs/+/89133917:49
@jim:acmegating.comcool that passed zuul-jobs test; will rebase17:53
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 891332: Only support Python 3.11 https://review.opendev.org/c/zuul/nodepool/+/89133217:54
@jim:acmegating.comit looks like ensure-k8s crio bionic started failing at the beginning of this year; i'm going to pull it.  the rest looks good18:13
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:18:15
- [zuul/zuul-jobs] 891339: Support ensure-kubernetes on bookworm https://review.opendev.org/c/zuul/zuul-jobs/+/891339
- [zuul/zuul-jobs] 891341: Stop testing ensure-kubernetes with crio under ubuntu bionic https://review.opendev.org/c/zuul/zuul-jobs/+/891341
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 886432: Add ZK cache stats https://review.opendev.org/c/zuul/nodepool/+/88643221:00
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 890446: Always complete cache init on reconnect https://review.opendev.org/c/zuul/nodepool/+/89044621:22
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 891358: Add support for limiting dependency processing https://review.opendev.org/c/zuul/zuul/+/89135821:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!