Tuesday, 2021-11-09

@tristanc_:matrix.orgI get it some new ideas are needed to solve unexpected problems, and I think its ok to fasttrack their implementations, but I worry the important details will be forgotten if we don't write them down (after such a change settle).00:04
@jim:acmegating.comtristanC: yeah.  i think "done" is probably after zuul-web is updated to read everything from zk, which is the next and final major effort.  i'm sure there will be more to do after that, but by then i think the zk schema and concepts will be fairly settled.  basically, it'll be feature complete.  :)00:16
@jim:acmegating.comi did start on https://review.opendev.org/809300 but it needs a lot more work.  it's mostly a placeholder right now.00:17
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 817126: Move pipeline state to /zuul/tenant https://review.opendev.org/c/zuul/zuul/+/81712600:19
@spamaps:spamaps.ems.host```root@e3b6233e6707:/# nodepool list00:20
+------------+------------+--------------+-----------+-------------+------+----------+-------------+--------+
| ID | Provider | Label | Server ID | Public IPv4 | IPv6 | State | Age | Locked |
+------------+------------+--------------+-----------+-------------+------+----------+-------------+--------+
| 0000000070 | static-vms | ubuntu-focal | None | None | None | building | 00:00:05:11 | locked |
+------------+------------+--------------+-----------+-------------+------+----------+-------------+--------+```
@spamaps:spamaps.ems.hostHow do I delete this node?00:20
@spamaps:spamaps.ems.hostI have ubuntu-focal set to min-ready: 000:20
@jim:acmegating.comspamaps: nodepool delete 000000007000:21
@spamaps:spamaps.ems.hosttimes out00:21
@spamaps:spamaps.ems.hostit's locked00:21
@jim:acmegating.comspamaps: restart the launcher and let it clean it up00:21
@spamaps:spamaps.ems.hostAh ok restart cleaned it up right away ty00:22
@spamaps:spamaps.ems.hostHopefully some day nodepool will clean up after itself if you delete a provider00:22
@jim:acmegating.comhttps://zuul-ci.org/docs/nodepool/operation.html#removing-a-provider00:23
@spamaps:spamaps.ems.hostthere must be something still in ZK00:23
@spamaps:spamaps.ems.hostnodepool list shows nothing but deleting the provider angers launcher00:24
@spamaps:spamaps.ems.host```launcher_1   | 2021-11-09 00:18:56,634 INFO nodepool.NodeDeleter: Deleting ZK node id=0000000071, state=deleting, external_id=node00:27
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: Cannot re-register deleted node Node(hostname='node', username='ro
ot', port=22):
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: Traceback (most recent call last):
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/driver/s
tatic/provider.py", line 489, in nodeDeletedNotification
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: self.registerNodeFromConfig(
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/driver/s
tatic/provider.py", line 188, in registerNodeFromConfig
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: host_keys = self.checkHost(static_node)
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/driver/s
tatic/provider.py", line 64, in checkHost launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: keys = nodeutils.nodescan(node["name"],
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/nodeutil
s.py", line 70, in nodescan launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: addrinfo = socket.getaddrinfo(ip, port)[0]
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/socket.py", line 953, in getadd
rinfo launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: for res in _socket.getaddrinfo(host, port, family, type, pro
to, flags):
launcher_1 | 2021-11-09 00:18:56,744 ERROR nodepool.driver.static.StaticNodeProvider: socket.gaierror: [Errno -2] Name or service not known launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: Couldn't sync node:
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: Traceback (most recent call last):
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/driver/static/provider.py", line 427, in cleanupLeakedResources
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: self.syncNodeCount(registered, node, pool) launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/driver/s
tatic/provider.py", line 319, in syncNodeCount
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: self.registerNodeFromConfig(
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/driver/s
tatic/provider.py", line 188, in registerNodeFromConfig
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: host_keys = self.checkHost(static_node)
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/driver/s
tatic/provider.py", line 64, in checkHost
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: keys = nodeutils.nodescan(node["name"],
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/site-packages/nodepool/nodeutil
s.py", line 70, in nodescan
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: addrinfo = socket.getaddrinfo(ip, port)[0]
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: File "/usr/local/lib/python3.9/socket.py", line 953, in getadd
rinfo
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: for res in _socket.getaddrinfo(host, port, family, type, pro
to, flags):
launcher_1 | 2021-11-09 00:19:41,977 ERROR nodepool.driver.static.StaticNodeProvider: socket.gaierror: [Errno -2] Name or service not known```
@spamaps:spamaps.ems.hostlauncher prints this every time it starts up00:27
@spamaps:spamaps.ems.hostand so far won't start any nodes00:27
@spamaps:spamaps.ems.hostAh that last part was a bad merge on my part.. now it's making nodes00:32
@clarkb:matrix.orgarg 816094 failed again00:36
@clarkb:matrix.orgI'll just let it be until zuul is reloaded with the functional fixes00:37
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 816208: WIP Use AuthProvider https://review.opendev.org/c/zuul/zuul/+/81620800:58
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Matthieu Huin https://matrix.to/#/@mhuin:matrix.org:00:58
- [zuul/zuul] 768199: Web UI: add Autoholds, Autohold page https://review.opendev.org/c/zuul/zuul/+/768199
- [zuul/zuul] 810699: Web UI: Show pipeline types as icons https://review.opendev.org/c/zuul/zuul/+/810699
- [zuul/zuul] 781858: web UI: allow a privileged user to promote a change https://review.opendev.org/c/zuul/zuul/+/781858
- [zuul/zuul] 802559: Web UI: Add "Create Autohold Request" form, improve API error messages https://review.opendev.org/c/zuul/zuul/+/802559
- [zuul/zuul] 769943: Example Docker compose: keycloak integration https://review.opendev.org/c/zuul/zuul/+/769943
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Matthieu Huin https://matrix.to/#/@mhuin:matrix.org: [zuul/zuul] 768115: Web UI: allow a privileged user to request autohold https://review.opendev.org/c/zuul/zuul/+/76811501:29
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 816208: WIP Use AuthProvider https://review.opendev.org/c/zuul/zuul/+/81620801:30
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Matthieu Huin https://matrix.to/#/@mhuin:matrix.org:01:30
- [zuul/zuul] 768199: Web UI: add Autoholds, Autohold page https://review.opendev.org/c/zuul/zuul/+/768199
- [zuul/zuul] 810699: Web UI: Show pipeline types as icons https://review.opendev.org/c/zuul/zuul/+/810699
- [zuul/zuul] 781858: web UI: allow a privileged user to promote a change https://review.opendev.org/c/zuul/zuul/+/781858
- [zuul/zuul] 802559: Web UI: Add "Create Autohold Request" form, improve API error messages https://review.opendev.org/c/zuul/zuul/+/802559
- [zuul/zuul] 769943: Example Docker compose: keycloak integration https://review.opendev.org/c/zuul/zuul/+/769943
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 817122: Shard config errors https://review.opendev.org/c/zuul/zuul/+/81712201:38
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 816094: Adjust spacing on status page https://review.opendev.org/c/zuul/zuul/+/81609403:52
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of Felix Edel:08:17
- [zuul/zuul] 816807: Split up registerScheduler() and onLoad() methods https://review.opendev.org/c/zuul/zuul/+/816807
- [zuul/zuul] 814996: Make the ConfigLoader work independently of the Scheduler https://review.opendev.org/c/zuul/zuul/+/814996
- [zuul/zuul] 816361: Load system config and tenant layouts in zuul-web https://review.opendev.org/c/zuul/zuul/+/816361
- [zuul/zuul] 816362: Implement job freezing API in zuul-web https://review.opendev.org/c/zuul/zuul/+/816362
- [zuul/zuul] 816514: Implement management events directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/816514
- [zuul/zuul] 816783: Implement autohold endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/816783
-@gerrit:opendev.org- Simon Westphahl proposed:08:17
- [zuul/zuul] 817003: Store pipeline status for zuul-web in Zookeeper https://review.opendev.org/c/zuul/zuul/+/817003
- [zuul/zuul] 817004: Use pipeline status from Zookeeper in zuul-web https://review.opendev.org/c/zuul/zuul/+/817004
- [zuul/zuul] 817157: Use abide for listing pipelines in zuul-web https://review.opendev.org/c/zuul/zuul/+/817157
- [zuul/zuul] 817158: Use abide for getting public keys in zuul-web https://review.opendev.org/c/zuul/zuul/+/817158
-@gerrit:opendev.org- Felix Edel proposed:08:23
- [zuul/zuul] 817159: Implement job endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817159
- [zuul/zuul] 817160: Implement project endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817160
-@gerrit:opendev.org- Simon Westphahl proposed:08:42
- [zuul/zuul] 817158: Use abide for getting public keys in zuul-web https://review.opendev.org/c/zuul/zuul/+/817158
- [zuul/zuul] 817164: Use abide for getting project SSH keys in zuul-web https://review.opendev.org/c/zuul/zuul/+/817164
-@gerrit:opendev.org- Simon Westphahl proposed:09:32
- [zuul/zuul] 817171: Check authentication directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817171
- [zuul/zuul] 817172: Create list of admin tenants directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817172
@mhuin:matrix.org> <@jim:acmegating.com> tristanC: yeah.  i think "done" is probably after zuul-web is updated to read everything from zk, which is the next and final major effort.  i'm sure there will be more to do after that, but by then i think the zk schema and concepts will be fairly settled.  basically, it'll be feature complete.  :)09:45
On that note, the zuul admin CLI still supports a gearman client for autoholds, enqueues etc. How about we remove the stuff that can be done through the web API -and thus zuul-client- and keep only create-auth-token, tenant-conf-check and the ZK-related commands once the web admin GUI is finalized?
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 817174: Use abide to get (dis-)allowed labels in zuul-web https://review.opendev.org/c/zuul/zuul/+/81717409:49
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 817177: Create list of connections directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/81717710:23
-@gerrit:opendev.org- Simon Westphahl proposed:11:36
- [zuul/zuul] 817164: Use abide for getting project SSH keys in zuul-web https://review.opendev.org/c/zuul/zuul/+/817164
- [zuul/zuul] 817171: Check authentication directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817171
- [zuul/zuul] 817172: Create list of admin tenants directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817172
- [zuul/zuul] 817174: Use abide to get (dis-)allowed labels in zuul-web https://review.opendev.org/c/zuul/zuul/+/817174
- [zuul/zuul] 817177: Create list of connections directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817177
- [zuul/zuul] 817193: Use config errors directly from layout in zuul-web https://review.opendev.org/c/zuul/zuul/+/817193
-@gerrit:opendev.org- Simon Westphahl proposed:11:41
- [zuul/zuul] 817171: Check authentication directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817171
- [zuul/zuul] 817172: Create list of admin tenants directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817172
- [zuul/zuul] 817174: Use abide to get (dis-)allowed labels in zuul-web https://review.opendev.org/c/zuul/zuul/+/817174
- [zuul/zuul] 817177: Create list of connections directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817177
- [zuul/zuul] 817193: Use config errors directly from layout in zuul-web https://review.opendev.org/c/zuul/zuul/+/817193
@westphahl:matrix.orgmhu:  yep, I think that's the plan11:43
-@gerrit:opendev.org- Felix Edel proposed:11:59
- [zuul/zuul] 817159: Implement job endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817159
- [zuul/zuul] 817160: Implement project endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817160
- [zuul/zuul] 817196: Implement tenant endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817196
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of Felix Edel:13:44
- [zuul/zuul] 817159: Implement job endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817159
- [zuul/zuul] 817160: Implement project endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817160
- [zuul/zuul] 817196: Implement tenant endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817196
-@gerrit:opendev.org- Simon Westphahl proposed:13:44
- [zuul/zuul] 817174: Use abide to get (dis-)allowed labels in zuul-web https://review.opendev.org/c/zuul/zuul/+/817174
- [zuul/zuul] 817177: Create list of connections directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817177
- [zuul/zuul] 817193: Use config errors directly from layout in zuul-web https://review.opendev.org/c/zuul/zuul/+/817193
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of Felix Edel:13:45
- [zuul/zuul] 817160: Implement project endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817160
- [zuul/zuul] 817196: Implement tenant endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817196
@spamaps:spamaps.ems.hostcorvus: correct to say that the GCE driver doesn't support network arguments for instances? I'm going to need those.. happy to add it.14:37
@spamaps:spamaps.ems.hostWow we still don't have pretty ansi color log viewer? I thought by now y'all would have gotten that one working. ;)14:56
@avass:vassast.orgspamaps: we did have one but it couldn't handle large log files so it was reverted until someone implements something that doesn't take >60sec to render large files :)14:57
@spamaps:spamaps.ems.hostAh, yeah some of the other CI systems have that same problem. I've noticed they implement as a ring buffer.14:57
@spamaps:spamaps.ems.hostIf you start scrolling up they drop the new stuff.14:57
-@gerrit:opendev.org- Tristan Cacqueray proposed: [zuul/zuul] 762759: web: integrate re-ansi to render ANSI code in Zuul consoles https://review.opendev.org/c/zuul/zuul/+/76275915:10
@tristanc_:matrix.orgAlbin Vass: spamaps well here is a change to add ansi color https://review.opendev.org/c/zuul/zuul/+/762759 . @corvus could you please re-evaluate your -2 ?15:12
@goneri:matrix.orgHi! We've recently got some short (less 60s) network outages. And this seems to be the cause of some deadlocks in nodepool. My understanding ( and I'm not an expert :-) ) is that nodepool get the request from zs and starts to process the it. It pushes the lock in zk and lose the connection. It quickly gives up because of an exception coming from Kazoo ( https://paste.openstack.org/show/810873/ ). But the lock remains. When it tries to retry, it's too late, a lock is already here.15:12
@clarkb:matrix.orgGonéri: the node request and the locks should be ephemeral which means if you lose connectivity they go away. I don't know why that connectionloss wouldn't unlock the locks15:17
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 817228: Remove gearman from zuul-web https://review.opendev.org/c/zuul/zuul/+/81722815:20
@jpew:matrix.orgI have a cache cleanup job I want to run once a week with zuul, and while it's running I don't want any other jobs to run.... is there a way to do that?15:30
@goneri:matrix.orgClark: the Kazoo documentation says the ephemeral nodes are deleted if the state transition to LOST, here we just switch to SUSPEND.15:32
@goneri:matrix.org * Clark: the Kazoo documentation says the ephemeral nodes are deleted if the state transition to LOST, here we just switch to SUSPENDED.15:33
@jim:acmegating.comjpew: the cleanup is a zuul job, or an external thing?15:34
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 816918: Upgrade react-router-dom https://review.opendev.org/c/zuul/zuul/+/81691815:34
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 734082: web UI: user login with OpenID Connect https://review.opendev.org/c/zuul/zuul/+/73408215:34
@jpew:matrix.orgcorvus: A zuul job15:34
@jim:acmegating.comjpew: i don't have a good solution; semaphores don't have a read/write variant which i assume you would need for that.15:35
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 734850: web UI: allow a privileged user to dequeue a change https://review.opendev.org/c/zuul/zuul/+/73485015:35
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 736772: web UI: allow a privileged user to re-enqueue a change https://review.opendev.org/c/zuul/zuul/+/73677215:36
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 768115: Web UI: allow a privileged user to request autohold https://review.opendev.org/c/zuul/zuul/+/76811515:36
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 768199: Web UI: add Autoholds, Autohold page https://review.opendev.org/c/zuul/zuul/+/76819915:36
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 810699: Web UI: Show pipeline types as icons https://review.opendev.org/c/zuul/zuul/+/81069915:36
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 781858: web UI: allow a privileged user to promote a change https://review.opendev.org/c/zuul/zuul/+/78185815:37
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 802559: Web UI: Add "Create Autohold Request" form, improve API error messages https://review.opendev.org/c/zuul/zuul/+/80255915:37
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 769943: Example Docker compose: keycloak integration https://review.opendev.org/c/zuul/zuul/+/76994315:37
@goneri:matrix.orgMy bad, when the connections back, the state switch to LOST -> CONNECTED. So the locks should be removed at this moment.15:58
@goneri:matrix.orgActaully, the node request's got ephemeralOwner == 0 in zk:  https://paste.openstack.org/show/810883/16:17
@jim:acmegating.comcorrect, node requests are not ephemeral16:17
@goneri:matrix.orgOh, so this explains my endless loop of [node_request: 300-0000678000] Request is locked by someone else.16:18
@clarkb:matrix.orgbut the locks are ephemeral?16:18
@jim:acmegating.comlocks are ephemeral16:19
@jim:acmegating.comwell, some are16:19
@jim:acmegating.comnodepool's lock on the request is16:19
@clarkb:matrix.orgGonéri: you can list the ephemeral nodes and their sessions. Then you can list sessions which includes IP addrs to identify the lock owner16:20
@goneri:matrix.orgWell, I think my problem is at the node request lock.16:21
@clarkb:matrix.orgdump to list the epehermal nodes and sessions and cons for sessions16:21
@goneri:matrix.orgThe node request is not processed and the lock remains.16:21
@jim:acmegating.comClark: can you take a look at https://review.opendev.org/817126 ? i think we might want to go ahead and rip the bandage off on that one.16:22
@clarkb:matrix.orgI always have to look that up. maybe I should write it down somewhere16:22
@clarkb:matrix.orgcorvus: that will require a stop, clear of zk data, then start? Or just start with new data on the new path and sometime later delete the old path contents?16:23
@jim:acmegating.comClark: i'd do the first i think16:24
@jim:acmegating.combut the second would work16:24
@clarkb:matrix.orgcorvus: does my nit there change anything? perhaps there are other places we need to update if there was a pipeline name conflict (I think the conflict may be with tenant names?)16:26
@jim:acmegating.comClark: you're right, it was just a commit message typo16:26
@jim:acmegating.comi think we should just go ahead and approve it16:27
@clarkb:matrix.orgwfm I'll +A16:28
@clarkb:matrix.orgI was just concerned I might have missed something important there and if I had it was worth double checking :)16:28
@jim:acmegating.comGonéri: Ie5aae0704e5925a5bcc73cc6bc0bcb91287ab26e fixed an issue with node requests in zuul; i doubt it could cause the behavior you're seeing, but perhaps related.  just fyi.16:30
@goneri:matrix.orgAn uncaught exception that happens after the node request lock in _assignHandlers https://paste.openstack.org/show/810885/16:38
@goneri:matrix.orgnote: my node request state in zk is: "requested".16:49
@goneri:matrix.orgIf I understand correctly, _removeCompletedHandlers() is in charge of cleaning this kind of unachieved node request. It loops on self.request_handlers list. But it does not include our thread. Shouldn't this be done before? https://opendev.org/zuul/nodepool/src/branch/master/nodepool/launcher.py#L199 e.g: https://paste.openstack.org/show/810886/17:02
@clarkb:matrix.orgGonéri: is the request internally completed? I wonder if the exception caused it to bail out early so a necessary state transition is not occuring17:11
@clarkb:matrix.orgI would inspect the znode record in zk direclty to check that17:12
@goneri:matrix.orgYes, the node request is stuck. Nothing has been started.17:12
@clarkb:matrix.orgshould give you the current state and other info17:12
@goneri:matrix.orgI pasted that above, is this what you want? https://paste.openstack.org/show/810883/17:13
@clarkb:matrix.orgI'm wondering if this is the thing opendev has seen in the past where we have had to restart launchers17:13
@clarkb:matrix.orgbut we haven't seen that recently17:13
@clarkb:matrix.orgGonéri: no the actual content of the record in the database not the metadata. That should tell you who has declined it, who is currently working on it, what state the request is in and so on17:14
@goneri:matrix.orgFrom the log, it's clear nothing is working on it. I've just an endless loop of "Request is locked by someone else".17:15
@goneri:matrix.orgIt has been like that for 12h.17:15
@clarkb:matrix.orgright which is why I've suggested you determine who it is locked by and work from there17:20
@clarkb:matrix.orgI think there are two approaches for that. Either examine the record data directly to see if it was recorded or use the two zookeeper four letter commands to figure it out17:20
@clarkb:matrix.orgsince connection losses should remove ephemeral locks17:20
@clarkb:matrix.orgI think it is a good idea to determine how the lock exists17:21
@goneri:matrix.orgClark: I need your help on this, where is the information you looking for. In MariaDB or in zk?17:37
@clarkb:matrix.orgGonéri: all in zookeeper. The first is similar to what you have in your paste at https://paste.openstack.org/show/810883/ but instead of stat'ing the record you can get the record (I think the command is get). The second is zookeeper's 4 letter commands which can be sent to the client port but depending on how your zookeeper is deployed they may or may not be enabled https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html#sc_zkCommands There is also a rest api that accepts the four letter commands that may or may not be enabled17:39
@clarkb:matrix.orgThe idea is to use that to get the identity of the lock holder in nodepool or zuul. THen from that figure out why the lock is held even though no action is taken17:40
@goneri:matrix.orgah! thanks :-) https://paste.openstack.org/show/810888/17:40
@clarkb:matrix.orginteresting it doesn't have any declined by entries17:42
@clarkb:matrix.organd is still in the requested state. This must have happened very early in the request process17:43
@goneri:matrix.orgYes, this one happened at the beginning of the first try. And zk was away after, so no way to update the record.17:43
@clarkb:matrix.orgright but the lock should've been removed? then processing could try again via locking it and proceeding17:43
@goneri:matrix.orgThis is how I would address that: https://paste.openstack.org/show/810889/17:44
@clarkb:matrix.orgI think the idea is the request handler failed and the lock goes away so we don't need to continue to track that request handler but I may be missing something important around that17:45
@goneri:matrix.orgThe exception is not caught and so we break the loop. But the lock remains.17:45
@clarkb:matrix.orgOne reason I'm asking about double checking the lock holder is it may be possible that zuul holds the lock (though that is probably unlikely)17:45
@goneri:matrix.orgWe've got 3 node requests like that because of this 1 minute outage.17:46
@clarkb:matrix.orgright but we should confirm it is nodepool that holds the lock before making a change like the one in your diff (I think the change in your diff is probably safe either way but we may not be looking at the right spot if the lock holder is zuul)17:47
@goneri:matrix.orgHow can I know who holds the lock?17:49
@clarkb:matrix.orgfor that I think you need to use the four letter commands. You can run `dump` to get a listing of all the ephemeral nodes and their associated sessions (this should include the lock). Then you can run `cons` to get all the connections mapped to sessions.17:50
@goneri:matrix.org    "/nodepool/requests/300-0000678000" : [ 72408799404359680 ],17:52
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 817126: Move pipeline state to /zuul/tenant https://review.opendev.org/c/zuul/zuul/+/81712617:52
@clarkb:matrix.orgGonéri: I think the lock path is something like /nodepool/requests/foo-bar/lock?17:52
@goneri:matrix.orgthis is nl01, my nodepool node.17:53
@clarkb:matrix.orgfor the /lock path too?17:53
@goneri:matrix.orgFor the following lock: /nodepool/requests/300-000067800017:54
@clarkb:matrix.orghrm I don't think that is the lock that is just the request?17:54
@clarkb:matrix.orgthen there is a separate `/nodepool/requests/300-0000678000/lock` path? But maybe we don't lock that way for node requests17:55
@goneri:matrix.org```17:56
ls /nodepool/requests/300-0000678000
[]
```
@jim:acmegating.comClark: regarding the issue of the estimated time not showing up; we have a thread doing that in the background.  we prime it when we get the node request back, and we get the actual value when the job starts.  but that can happen on two different schedulers now, so i think that needs some re-thinking.17:56
@clarkb:matrix.orgGonéri: interesting, I wonder if that means the lock really did go away? when you say the lock remains above are you talking about `/nodepool/requests/300-0000678000` ?17:57
@clarkb:matrix.orgin that case I would agree that the request remains and that your fix seems like a possible fix (however, I would've expect the launcher to simply try handling the request again since the previous handler failed)17:58
@clarkb:matrix.orgcorvus:  oh neat. I'm glad this early running in OpenDev has been so productive :)17:58
@goneri:matrix.orgI assume the node_request was locked because of the `Request is locked by someone else` messages. that's all :-).17:59
@clarkb:matrix.orgaha18:01
@clarkb:matrix.org`/nodepool/requests-lock/300-0000678000` is the path. Can you check who holds that?18:02
@goneri:matrix.orgsure18:04
@goneri:matrix.orgWell, the lock exists but nothing holds it.18:06
-@gerrit:opendev.org- Clint Byrum proposed: [zuul/zuul-jobs] 817291: Remove google_sudoers in revoke-sudo https://review.opendev.org/c/zuul/zuul-jobs/+/81729118:10
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 817292: Make time estimation synchronous https://review.opendev.org/c/zuul/zuul/+/81729218:11
@clarkb:matrix.orgGonéri: in your dump output that path should show up with a session taht should show you who holds it18:11
@jim:acmegating.comClark, tobiash: we expected  https://review.opendev.org/c/zuul/zuul/+/817292 i think the time has come18:12
@goneri:matrix.orgI did curl http://localhost:8080/commands/watches_by_path. I can see all the locks and they are all in /nodepool/nodes/*/lock.18:13
@clarkb:matrix.orgGonéri: I don't know that nodepool watches the requests path. I think it may iterate through it18:14
@goneri:matrix.orghttps://opendev.org/zuul/nodepool/src/branch/master/nodepool/launcher.py#L181 it looks like to me it uses the node request directly.18:17
@clarkb:matrix.orgGonéri: does /nodepool/requests-lock/300-0000678000 show up in the dump output?18:19
@clarkb:matrix.orgthat will show you which session has the lock18:19
@goneri:matrix.orgthe entry exists if this is the question. However, it's not locked.18:21
@goneri:matrix.orgactually, when I do a `ls /nodepool/requests-lock`, I see some very old entries. Looks like nothing clean them up.18:22
@clarkb:matrix.orgwell they should be ephemeral. Looks like they use the base kazoo Lock recipe18:23
@goneri:matrix.org537 entries!18:23
@clarkb:matrix.orgya confirmed https://kazoo.readthedocs.io/en/latest/api/recipe/lock.html#kazoo.recipe.lock.Lock.acquire shows the default is ephemeral and nodepool doesn't seem to override that in any of its acquire calls18:24
@clarkb:matrix.orgI wonder if zookeeper didn't consider the connection to be lost18:25
@clarkb:matrix.organd kazoo is relying on zookeeper to clean those up in a connection loss state.18:25
@clarkb:matrix.orgmaybe check your zookeeper logs and ss/netstat to see if the connection was ever  recorded as lost? maybe it is still there?18:26
@goneri:matrix.orgI don't think this is what cause our problem. reg here https://opendev.org/zuul/nodepool/src/branch/master/nodepool/launcher.py#L181 comes from https://opendev.org/zuul/nodepool/src/branch/master/nodepool/zk.py#L1742. It's the regular path pointing on /nodepool/requests/300-0000678000.18:29
@goneri:matrix.organd this is the req nodepool tries to lock here: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/launcher.py#L18118:30
@clarkb:matrix.orgGonéri: the 'Request is locked' messages that you are getting are from nodepool trying to process the request. It would be able to process it if it wasn't locked18:30
@clarkb:matrix.orgThe way a disconnect should be handled if I understand correctly is that the requests will be unlocked and then when reconnection happens nodepool can reprocess them18:31
@clarkb:matrix.orgbut this is failing due to the lock18:31
@clarkb:matrix.orgIf we understand why the lock doesnt' go away then wecan fix this18:31
@goneri:matrix.orgI mean, I think there is two different paths. And the nodepool uses in this block is /nodepool/requests/300-000067800018:31
@goneri:matrix.org * I mean, I think there is two different paths. And the one nodepool uses in this block is /nodepool/requests/300-000067800018:31
@clarkb:matrix.orgthere are two different paths. `/nodepool/requests/300-0000678000` is where the request details are written. How many nodes, what type of nodes etc. Then `/nodepool/requests-lock/300-0000678000` is used to lock the first path so that only one launcher or zuul are modifying it at a time18:33
@clarkb:matrix.orgThe recovery process is failing beacuse somehow that lock remains in place. If the lock goes away then recovery should proceed (and this is the anticipated behavior)18:33
@clarkb:matrix.orgI wonder if the disconnection doesn't kill the session somehow and that is what keeps the lock in place but the code that noticed the disconnect has bailed out so cannot process the request further18:34
@clarkb:matrix.orgcorvus: ^ do you know if you can disconnect without losing the session?18:34
@jim:acmegating.comClark: a session is lost on disconnect18:35
@goneri:matrix.orgoh I see, I didn't realize lockNodeRequest uses a different lock path internally18:35
@goneri:matrix.orgthank you for your patience :-)18:36
@clarkb:matrix.org> <@jim:acmegating.com> Clark: a session is lost on disconnect18:37
That has me thinking that zookeeper didn't notice the disconnect somehow and only kazoo seemed to notice
@clarkb:matrix.orgbecause it does seem that we rely on the zookeeper side of things to clean up ephemeral nodes18:38
@jim:acmegating.comthat seems unlikely to be the issue.  i think we'd need some solid proof for that.18:39
@clarkb:matrix.orgI'm going to have to stop looking at this momentarily to prep for the opendev team meeting.18:39
@goneri:matrix.orgSo indeed, there is a /nodepool/requests-lock/300-0000678000/380fe6217cb947c88c3ac7b6c2c008d7__lock__0000000000 lock in the dump output and it's associated with nl01 (same session 72408799404359680).18:41
@clarkb:matrix.orgok good that confirms zuul doesn't have the lock18:44
@clarkb:matrix.orgwhich means the problem is independent of zuul18:44
@clarkb:matrix.orgBut I must context switch now18:44
@clarkb:matrix.orgcorvus: I'll try to review the time db change after the opendev meeting18:46
@jim:acmegating.comClark: thx; btw i've been going through the #sos stack to do the zuul-web stuff, i think it's 99% there and ready for other reviewers18:47
@clarkb:matrix.orgcorvus: noted18:49
@jim:acmegating.commhu: i think we could use your input on https://review.opendev.org/81717219:00
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of Felix Edel:19:10
- [zuul/zuul] 817159: Implement job endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817159
- [zuul/zuul] 817160: Implement project endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817160
- [zuul/zuul] 817196: Implement tenant endpoints directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817196
- [zuul/zuul] 817228: Remove gearman from zuul-web https://review.opendev.org/c/zuul/zuul/+/817228
-@gerrit:opendev.org- Simon Westphahl proposed:19:10
- [zuul/zuul] 817164: Use abide for getting project SSH keys in zuul-web https://review.opendev.org/c/zuul/zuul/+/817164
- [zuul/zuul] 817171: Check authentication directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817171
- [zuul/zuul] 817172: Create list of admin tenants directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817172
- [zuul/zuul] 817174: Use abide to get (dis-)allowed labels in zuul-web https://review.opendev.org/c/zuul/zuul/+/817174
- [zuul/zuul] 817177: Create list of connections directly in zuul-web https://review.opendev.org/c/zuul/zuul/+/817177
- [zuul/zuul] 817193: Use config errors directly from layout in zuul-web https://review.opendev.org/c/zuul/zuul/+/817193
@westphahl:matrix.orgcorvus: responded to your comment in 817003. I won't have the time to fix this today, so feel free to ninja fix if you find the time19:17
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 817298: Add tag to tag serialization https://review.opendev.org/c/zuul/zuul/+/81729819:18
@jim:acmegating.comswest: ack thx.19:19
@jim:acmegating.comClark: ^ that change should fix the tag issue mentioned in #opendev19:19
@clarkb:matrix.orgsounds like my hunch was a good one19:27
@clarkb:matrix.orgcorvus: question on https://review.opendev.org/c/zuul/zuul/+/817292 but +2'd as it isn't very important and can be cleaned up in a followup20:05
@jim:acmegating.comtristanC: done :)20:08
@jim:acmegating.comClark: do you think https://review.opendev.org/817298 is sufficiently reviewed to approve?  (cc: zuul-maint)  I think that might be a good candidate to get in now, then do an opendev clear zk data and restart this afternoon/evening.20:10
@jim:acmegating.comthat would get the tag fix as well as the pipeline status relocation20:11
@jim:acmegating.com * that would get the tag fix as well as the pipeline relocation20:11
@clarkb:matrix.orgcorvus: ++ not sure if we want to try and track down the nodeset already defined issue before restarting though?20:24
@jpew:matrix.orgI have two jobs: "build" and "test". "build" builds then "provides" build artifacts and "test" then "requires" those artifacts and tests them for $REASONS, "test" loops over all provided artifacts and will test any found in a loop. However, I've noticed that when changes start stacking, later "test" jobs will get artifacts from the previouly stacked patchesets (we are using Gerrit BTW), meaning in runs the tests multiple times: once for the current patch and once for each patchset the current one was stacked on top of..... I don't really want that latter part to happen (although I assume there is some mindblowing reason why Zuul does it :) )20:25
@jpew:matrix.org * I have two jobs: "build" and "test". "build" builds then "provides" build artifacts and "test" then "requires" those artifacts and tests them. For $REASONS, "test" loops over all provided artifacts and will test any found in a loop. However, I've noticed that when changes start stacking, later "test" jobs will get artifacts from the previouly stacked patchesets (we are using Gerrit BTW), meaning in runs the tests multiple times: once for the current patch and once for each patchset the current one was stacked on top of..... I don't really want that latter part to happen (although I assume there is some mindblowing reason why Zuul does it :) )20:26
@jim:acmegating.comClark: i'm thinking let's get 298 in the pipe then see where we are with the other thing20:26
@jpew:matrix.orgTL; DR: I want my test job to only test things from the current patchset20:26
@jim:acmegating.comjpew: let's clarify do you really mean you want test only considering artifacts from the current queue item, not anything from changes it depends on?  or do you just mean that you only want to use the "latest" of any particular artifact?20:27
@jpew:matrix.orgcorvus: The former20:29
@jim:acmegating.comjpew: does the build job by any chance take into account in some way (either via git checkouts or artifact reuse) the changes ahead of it?20:29
@jpew:matrix.orgcorvus: I'm not specfically sure what you are asking, but it doesn't require anything from zuuls perseptive.... there is a lot of caching between builds (think ccache), but they all start from scratch20:31
@jim:acmegating.comjpew: given change A, and change B after it in the pipeline, when the 'build' job for change B runs, would it produce anything different if change A wasn't there?20:32
@jim:acmegating.comjpew:  anyway, if the answer to that is 'no', then it's a little bit of a warning flag that you may want to make sure you're getting the most out of gating.  :)  to directly answer your question, you can drop provides/requires from the test job, since provides/requires is used for making a linkage between queue items, not between jobs within a single queue item.  the latter is done via 'dependencies'.20:37
@jpew:matrix.orgcorvus: I think the answer is yes, but my mind is broken trying to think how it could possibly be no...20:38
@jim:acmegating.comjpew: for a similar example in zuul's repo, the "zuul-build-image" or "zuul-upload-image" (they are effectively the same job, just tweaked for different pipelines) is your 'build' job, and the "zuul-quick-start" is your test job.  but zuul-(build|upload)-image uses artifacts from changes ahead of it in the queue.  so it has provides/requires to make sure it gets those.  zuul-quick-start just has dependencies to make sure that it runs after the build job.20:39
@jim:acmegating.comhttps://opendev.org/zuul/zuul/src/branch/master/.zuul.yaml20:39
@jpew:matrix.org@corvus: Ah! Ok I have the dependencies, but I couldn't figure out how to get the artifacts from the "build" job to the "test" job, so I added the provides/requires20:40
@jim:acmegating.com(* white lie: it also has a 'requires' line because it requires nodepool container images which are not built by the zuul-build-image job, so it actually does both things)20:40
@clarkb:matrix.org> <@jim:acmegating.com> Clark: i'm thinking let's get 298 in the pipe then see where we are with the other thing20:40
Ok, my only worry with this is that the zk clearing will likely "fix" the nodeset issue and we have no way of knowing if we'll see it again soon
@jpew:matrix.orgI suppose the thing to do there is make the "build" and "test" job agree where the artifacts are stashed?20:42
@jim:acmegating.comjpew: i thought dependencies also pass artifacts.  but that is also an option.  that's effectively what's happening in the zuul jobs -- they agree that the artifacts are in the buildset container registry.20:43
@jim:acmegating.comClark: ok, let's dig into that and keep it in mind20:44
@mhuin:matrix.org> <@jim:acmegating.com> mhu: i think we could use your input on https://review.opendev.org/81717220:59
I just reviewed, I think this is indeed a bug. Auth override in the token was tested only when the override is disabled in zuul's conf, from what I can tell
@jim:acmegating.commhu: should we return a list of all tenants in that case?21:01
@mhuin:matrix.org> <@jim:acmegating.com> mhu: should we return a list of all tenants in that case?21:04
No, the idea was that an operator would generate a token scoped to one or several tenants with the `zuul create-auth-token` command. The claim zuul.admin would then hold the list of tenants on which the token holder would be admin
@jim:acmegating.commhu: oh got it.  and i see your comment with the suggested fix.  that all makes sense, thx.  :)21:05
@mhuin:matrix.orgnp! I'm just facepalming because I forgot to add a test for this obvious use case21:05
@mhuin:matrix.orgI only tested when the override is forbidden21:06
-@gerrit:opendev.org- Clark Boylan proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 817292: Make time estimation synchronous https://review.opendev.org/c/zuul/zuul/+/81729221:07
@clarkb:matrix.orgThat is a just a linter fix21:07
@goneri:matrix.orgClark: I just did a restart of nodepool-launcher and I confirm it fixes the problem.21:29
@clarkb:matrix.orgok we know then that removing the lock allows it to recover gracefully as expected. That means we should probably track down why the lock didn't get removed to fix this going forward21:30
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 817298: Add tag to tag serialization https://review.opendev.org/c/zuul/zuul/+/81729822:07
@goneri:matrix.orgClark: I think https://review.opendev.org/c/zuul/nodepool/+/817287 would be enough to avoid the situation. I'm not sure yet how to test it.22:08
@clarkb:matrix.orgit might but it doesn't really explain why it broke in the first place which bothers me :)22:19
@clarkb:matrix.orgwe need https://review.opendev.org/c/zuul/nodepool/+/817114 to land before other nodepool fixes22:38
@clarkb:matrix.orgit fixes the boto deps situation22:38
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 817292: Make time estimation synchronous https://review.opendev.org/c/zuul/zuul/+/81729222:39
@clarkb:matrix.orghrm actually they made a new release and the pip dep solver might be able to solve it. But giving the dep solver help might also make installations quicker. Just less urgent now22:40

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!