Tuesday, 2021-10-12

-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:02:43
- [zuul/zuul] 812750: Add LocalZKContext for job freezing https://review.opendev.org/c/zuul/zuul/+/812750
- [zuul/zuul] 812760: Add RepoState object https://review.opendev.org/c/zuul/zuul/+/812760
- [zuul/zuul] 813552: Remove Worker class https://review.opendev.org/c/zuul/zuul/+/813552
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Simon Westphahl:02:44
- [zuul/zuul] 812452: Store build sets in Zookeeper https://review.opendev.org/c/zuul/zuul/+/812452
- [zuul/zuul] 812467: Add support for sharded ZKObjects https://review.opendev.org/c/zuul/zuul/+/812467
- [zuul/zuul] 812673: Store RepoFiles for a build set in Zookeeper https://review.opendev.org/c/zuul/zuul/+/812673
@fzzfh:matrix.org> <@nborg:matrix.org> fzzf: You can get host-key from the static node by "ssh-keyscan <host>". Take for instance the ssh-ed25519 key returned.06:25
thanks.I found ssh-ed25519.
@fzzfh:matrix.org> <@clarkb:matrix.org> The SSH protocol does host key checking to avoid mitm attacks. By default it will not connect to a server whose host key it does not recognize. In this case nodepool allows you to specify in config the public part of the host key so that nodepool and zuul can safely recognize the server when connecting via SSH.06:32
>
> The method to restart nodepool will depend on your method of deployment. Systemd units need a systemctl restart, docker(-compose) will need you to stop/start the container
thank you. I get it.
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul-jobs] 813034: Implement role for limiting zuul log file size https://review.opendev.org/c/zuul/zuul-jobs/+/81303407:13
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul-jobs] 813593: WIP, only for test if build works https://review.opendev.org/c/zuul/zuul-jobs/+/81359308:00
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul-jobs] 813596: wip: only for test https://review.opendev.org/c/zuul/zuul-jobs/+/81359608:26
@tristanc_:matrix.orgHello, a tripleo project is adding a secret to one of their job and this is causing a config-error for rdo's zuul. We don't plan on using that job, but is this going to affect the other jobs defined in this project, or will zuul simply skip that job after the change is merged ( https://review.opendev.org/c/openstack/tripleo-ci/+/810261 ) ?12:23
@tristanc_:matrix.orgThen I guess adding `secret` to the tenant config project `include` list is going to result in a decryption error... So what is the recommended strategy to deal with third-party jobs which use secrets?12:27
-@gerrit:opendev.org- Jeremy Stanley proposed: [zuul/zuul-jobs] 813620: DNM: Exercise tox role on CentOS 7 https://review.opendev.org/c/zuul/zuul-jobs/+/81362012:27
@avass:vassast.orgWe seem to have some issues with importing a backup of zuuls project keys and it's causing the scheduler to crash with the following traceback:12:59
```
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: Error starting Zuul:
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: Traceback (most recent call last):
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/cmd/scheduler.py", line 152, in run
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: self.sched.prime(self.config)
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/scheduler.py", line 727, in prime
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: self.primeSystemConfig()
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/scheduler.py", line 1589, in primeSystemConfig
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: loader.loadTPCs(self.abide, self.unparsed_abide)
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/configloader.py", line 2292, in loadTPCs
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: self.tenant_parser.loadTenantProjects(unparsed_config)
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/configloader.py", line 1751, in loadTenantProjects
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: self._loadProjectKeys(source_name, tpc.project)
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/configloader.py", line 1649, in _loadProjectKeys
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: self.keystorage.getProjectSecretsKeys(
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/cachetools/decorators.py", line 26, in wrapper
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: v = func(*args, **kwargs)
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/lib/keystorage.py", line 187, in getProjectSecretsKeys
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: private_key, public_key = encryption.deserialize_rsa_keypair(
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/zuul/lib/encryption.py", line 98, in deserialize_rsa_keypair
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: private_key = serialization.load_pem_private_key(
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/cryptography/hazmat/primitives/serialization/base.py", line 20, in load_pem_private_key
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: return backend.load_pem_private_key(data, password)
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 1217, in load_pem_private_key
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: return self._load_key(
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 1448, in _load_key
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: self._handle_key_loading_error()
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: File "/usr/local/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 1478, in _handle_key_loading_error
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: raise ValueError("Bad decrypt. Incorrect password?")
2021-10-12 12:52:02,678 ERROR zuul.Scheduler: ValueError: Bad decrypt. Incorrect password?
```
I don't really understand why that's failing if the key-import seems to succeed
@avass:vassast.org(it's development environment so no big issue at the moment)13:00
@avass:vassast.orgOh nvm it was just a mislabeled backup :)13:23
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 811969: WIP Put frozen jobs in ZooKeeper https://review.opendev.org/c/zuul/zuul/+/81196914:14
@ashleybullock:matrix.orgHey, would someone be able to show me where the dockerfile's are for the different zuul components that are included here: https://opendev.org/zuul/zuul/src/branch/master/Dockerfile#L88 I seem to be missing how they are built, thanks! 15:19
@clarkb:matrix.orgAshley Bullock: that is the dockerfile for all of the zuul images. The zuul image is built at https://opendev.org/zuul/zuul/src/branch/master/Dockerfile#L51 then the specific services modify that with different command directives.15:24
@jim:acmegating.comtristanC: you might consider whether a multi-level job structure using pass-to-parent could help in that case.15:26
@tristanc_:matrix.orgcorvus: that is if we need to use that new job, but the concern at the moment is that we don't want the config-error about missing secrets to break existing job being imported15:27
@tristanc_:matrix.orgright now the third-party CI is unable to run on the change because the proposed configuration is invalid15:29
@ashleybullock:matrix.orgClark: thankyou! Completely misread that. 15:33
@ologinov:synapse.sardinasystems.comHi folks!15:52
I've been launch docker-compose example from Quick Installation Guide. I added GitLab connection into `zuul.conf` (my docker instance on PC), added GitLab pipeline into `pipelines.yaml` for zuul-config which located in Gerrit. I'm puted demo-zuul(copy from https://gitlab.com/fabien.dot.boucher/demo-zuul ) project into GitLab. I've been setup Gitlab webhook which direct to zuul-web (it works),added zuul user with token (sure that is work via curl for comment and approve MR into GitLab).
So I saw that pipeline was launched (itself name here:GitLab-check). And now I don't see that Zuul tried to report comment or approve for GitLab MR.
Can't find any suspicious things for debug. Can't launch zuul-executor with `-d` option because this service can't create PID-file in itself docker container.
Could you please get me directions for search cause of this strange behavior? Would appreciate for any help!
logs and configs here: https://pastebin.com/BhXpE3Sp
@jim:acmegating.comOleg Loginov: you can run zuul-executor with "-d" and "-f" to get debug logs while running in foreground.16:04
@ologinov:synapse.sardinasystems.comcorvus: indeed,thx!16:20
@clarkb:matrix.orgcorvus: is the plan to do that 4.10.1 release today?16:30
@jim:acmegating.comClark: yep, i'll go ahead and push it now17:47
@goneri:matrix.orgWe're stuck in a situation where nodepool's got the quota wrong and returns "Not enough quota remaining to satisfy request" all the time. What's the best way to for it to recompute the quota?18:03
@clarkb:matrix.orgGonéri: the first thing to do is determine which side the error is on: cloud or nodepool18:06
@clarkb:matrix.orgto check the cloud side you can ask the cloud for what it reports its quota usage is then cmopare that against the actual set of instances in the cloud18:07
@clarkb:matrix.orgTo check the nodepool side you want to make sure you haven't leaked any resources in nodepool's zk db that are not in the cloud18:07
@goneri:matrix.orgnodepool says: Predicted remaining provider quota: {'compute': {'cores': -1, 'instances': 11, 'ram': -1024}} and I've got like 18GB free in my tenant.18:10
@goneri:matrix.organd 18 vcpus18:10
@goneri:matrix.orgIs there a command that allow me to just reset the counters.18:11
@clarkb:matrix.orgthere isn't a command to reset the counters, its based on the records that are in the DB and what the cloud reports. This is why you need to determine where the mismatch is. if the mismatch is on the cloud side you have to fix the cloud side. If the mismatch is in nodepool you need to figure out what leaked and clean those records18:12
@goneri:matrix.orgOk, How can do that?18:13
@clarkb:matrix.orgYou need to get a listing of the instances in the cloud. Then ask the cloud for quota limits and usage and compare that against the listing of actual instances. If that all lines up you compare the instances in the cloud against the instances listed by nodepool.18:15
@clarkb:matrix.orgIt is possible for the accounting leak to happen in the cloud itself then nodepool will operate against incorrect data, or for there to be leaked instances in the nodepool db taht do not exist in the cloud but nodepool treats them as counting against quota18:15
@goneri:matrix.orgI've got Horizon under my eyes and we've got 18/50GB and 7/25 instances.18:16
@goneri:matrix.orgIt's a bit difficult to me to understand how nodepool computes the whole thing.18:16
@clarkb:matrix.orgI don't know how horizon computes those values. I would do a show limits instead18:16
@clarkb:matrix.orgopenstack limits show --absolute or something18:17
@clarkb:matrix.orgGonéri: nodepool then computes its usage using the instance records it has in its local db. If there are instances in its DB that do not exist in the cloud that can explain it too. You can ask nodepool to delete those instances in that case and that should clean up the db18:17
@goneri:matrix.orgoh I see. I was expecting nodepool to do that by itself. I need to dig a bit.18:18
@goneri:matrix.orgI try to delete a 6 day old node and it fails because it's locked.18:21
@goneri:matrix.orgIs there a way to know what is locking it?18:21
@clarkb:matrix.orgGonéri: I think you can look at logs for that. The locking entity may also write its info into the lock in the zk db if you inspect that directly. All of the locks are ephemeral znodes which means if the process holding them goes away then the lock is unlocked. Chances are Zuul is holding the lock to run a job.18:25
@goneri:matrix.orgThanks for the help.18:28
@clarkb:matrix.orgDocker hub sorts image tags in a weird way. but I see 4.10.1 tags and the release jobs were all successful.18:45
@jim:acmegating.comcool, will send announcement after lunch18:47
@shrews:matrix.orgaaaaah, nodepool + cloud provider debugging.... a good way to make the day fly by19:36
@fungicide:matrix.orgshrews: i could spend all day just staring at the clouds20:09
@goneri:matrix.orgI'm a bit lost at this stage. I've got this node 0001943369 that has been ready for the last 6 days. I found the associated request in the log, but it does not exist anymore. I've no ongoing job older than 3h. I see the lock if I connect to Zk and do curl http://localhost:8080/commands/watches_by_path|grep 0001943369 but I don't know what to do next.20:13
@iwienand:matrix.orgcorvus: Clark i think there is something else going on with zuul not reporting on https://review.opendev.org/c/opendev/system-config/+/807672.  I added a change to README to it, and that didn't trigger anything, and adding the noop jobs in check/gate also hasn't triggered anything.  updating readme should have triggered a few jobs (https://review.opendev.org/c/opendev/system-config/+/813700 tests that in the gate)20:15
@shrews:matrix.orgGonéri: Zuul will delete the request once it has been satisfied by nodepool. You now need to check zuul logs for the associated build ID (I think that was proper ID) and trace that through zuul. It likely has not finished or gotten stuck.20:16
@fungicide:matrix.orgianw: we'll probably need to look in the scheduler log20:16
@iwienand:matrix.orgfungi: yep :)  was just pulling it up20:16
@goneri:matrix.orgoh! thanks shrews! I end-up with "Canceling node request <NodeRequest 200-0000643513"  https://paste.openstack.org/show/809937/20:18
@iwienand:matrix.orgahh, it's reporting a syntax error!20:19
@iwienand:matrix.orghttps://paste.opendev.org/show/809938/20:20
@iwienand:matrix.orgbut that error isn't making it back to gerrit.  it is *very* long, i wonder if that has something to do with it20:21
@shrews:matrix.orgGonéri: iirc, zuul should release the locks on a cancelled request when handled properly. not sure if you've encountered a bug or something else. the folks here should know.20:26
@iwienand:matrix.org"2021-10-12 19:20:08,957 ERROR zuul.GerritConnection: [e: 759c53e53b4644089d09f7cc3313d0cd] Error submitting data to gerrit, attempt 1 ... Exception: Received response 400"20:27
@goneri:matrix.orgWe had some network outages with ZK last week. So I imagine Zuul tried to clean up the lock but failed to reach ZK. And just gave up.20:27
@goneri:matrix.orgThis was the associated issue https://storyboard.openstack.org/#!/story/200928020:28
@shrews:matrix.orgGonéri: scheduler restart would likely clear it. i could tell you how to take a hammer to it to brute force it, but corvus might get angry at me for suggesting it   :)20:29
@goneri:matrix.orgMy understanding is that the two problems are actually connected. And if Zuul lose the ZK connection it can lose track of the locks.20:29
@shrews:matrix.orgyeah, zk + network outages can cause some real wacky things20:30
@jim:acmegating.comgoneri: what version?20:30
@jim:acmegating.comof zuul20:30
@goneri:matrix.org4.8.120:30
@goneri:matrix.orgPaul told me that if we restart the scheduler, we also need to restart all the other services. Is this still the case with Zuul4? If so, I will do that later.20:33
@jim:acmegating.comthere is a bug like that in 4.9.0 that is fixed in 4.10.0, but that shouldn't affect 4.8.1.20:34
@jim:acmegating.comto clarify my last: that was regarding stuck node requests20:51
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 813704: DNM: for discussion of test_live_reconfiguration_shared_queue_removed https://review.opendev.org/c/zuul/zuul/+/81370420:57
@fungicide:matrix.orgianw: do we know for sure whether opendev's zuul deployment is even able to do its syntax error reporting to gerrit 3.3? that's something i didn't think to test yet21:34
@iwienand:matrix.orgfungi: i've replicated it, it has to do with the size.  it was also happening on 3.2.  i'll send a patch soon :)21:42
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 813711: Build Zuul's docker images on Bullseye https://review.opendev.org/c/zuul/zuul/+/81371122:30
@clarkb:matrix.orgI think ianw was looking at images on the nodepool side of things. I've got it on my list of TODOs to update images we use so here is a chnge for Zuul :)22:31
@iwienand:matrix.org> <@clarkb:matrix.org> I think ianw was looking at images on the nodepool side of things. I've got it on my list of TODOs to update images we use so here is a chnge for Zuul :)22:33
yep, WIP but it basically works. it's not nodepool as such, it's the nodepool-builder image and making sure that dib can produce what we need

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!