*** tosky has quit IRC | 00:00 | |
openstackgerrit | Merged zuul/zuul master: Support per branch change queues https://review.opendev.org/c/zuul/zuul/+/718531 | 00:43 |
---|---|---|
openstackgerrit | Merged zuul/zuul master: Move queue from pipeline to project https://review.opendev.org/c/zuul/zuul/+/720182 | 00:45 |
*** hamalq has quit IRC | 00:56 | |
openstackgerrit | James E. Blair proposed zuul/zuul master: Use ZooKeeper TLS in tests https://review.opendev.org/c/zuul/zuul/+/777489 | 00:59 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Emit config warnings if shared queues on pipelines https://review.opendev.org/c/zuul/zuul/+/777479 | 01:08 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Add test-setup-docker.sh https://review.opendev.org/c/zuul/nodepool/+/777491 | 01:21 |
corvus | tristanC, tobiash: ^ i just now realized i forgot to 'git add' that file when we were working on the nodepool zk change. some of the review comments make more sense now. :) | 01:21 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Use ZooKeeper TLS in tests https://review.opendev.org/c/zuul/zuul/+/777489 | 01:57 |
*** rlandy|bbl is now known as rlandy | 03:15 | |
*** rlandy has quit IRC | 03:16 | |
*** SpamapS has quit IRC | 03:47 | |
*** SpamapS has joined #zuul | 03:51 | |
*** ykarel has joined #zuul | 04:55 | |
*** bhagyashri|ruck is now known as bhagyashri|rover | 05:15 | |
*** asettle has quit IRC | 05:17 | |
openstackgerrit | Merged zuul/zuul master: Add python-logstash-async to container images https://review.opendev.org/c/zuul/zuul/+/776551 | 05:28 |
*** evrardjp has quit IRC | 05:33 | |
*** evrardjp has joined #zuul | 05:33 | |
*** ajitha has joined #zuul | 05:35 | |
*** jfoufas1 has joined #zuul | 05:42 | |
*** saneax has joined #zuul | 05:55 | |
*** yoctozepto0 has joined #zuul | 06:05 | |
*** yoctozepto has quit IRC | 06:05 | |
*** yoctozepto0 is now known as yoctozepto | 06:05 | |
*** zbr has quit IRC | 06:43 | |
*** reiterative has quit IRC | 06:50 | |
*** reiterative has joined #zuul | 06:51 | |
*** icey has quit IRC | 06:58 | |
*** icey has joined #zuul | 06:58 | |
*** icey has quit IRC | 07:01 | |
*** icey has joined #zuul | 07:05 | |
*** icey has quit IRC | 07:08 | |
*** jpena|off is now known as jpena | 07:45 | |
*** rpittau|afk is now known as rpittau | 07:57 | |
*** jcapitao has joined #zuul | 07:59 | |
*** jfoufas1 has quit IRC | 08:19 | |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Add zookeeper-timeout connection config https://review.opendev.org/c/zuul/nodepool/+/752022 | 08:22 |
avass | corvus: there's a shell_type change for nodepool as well but that one is marked WIP. I think there should be a simple unittest for it | 08:28 |
*** ykarel_ has joined #zuul | 08:31 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Harmonize zk session timeouts https://review.opendev.org/c/zuul/zuul/+/763209 | 08:31 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Make repo state buildset global https://review.opendev.org/c/zuul/zuul/+/738603 | 08:33 |
*** ykarel has quit IRC | 08:33 | |
*** ykarel_ is now known as ykarel | 08:34 | |
*** tosky has joined #zuul | 08:38 | |
*** icey has joined #zuul | 08:57 | |
*** msuszko has quit IRC | 09:04 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Move fingergw config to fingergw https://review.opendev.org/c/zuul/zuul/+/664949 | 09:05 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Route streams to different zones via finger gateway https://review.opendev.org/c/zuul/zuul/+/664965 | 09:05 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Support ssl encrypted fingergw https://review.opendev.org/c/zuul/zuul/+/664950 | 09:05 |
*** msuszko has joined #zuul | 09:05 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Add --validate-tenants option to zuul scheduler https://review.opendev.org/c/zuul/zuul/+/542160 | 09:11 |
Open10K8S | Hi team. can you please check these again? https://review.opendev.org/c/zuul/zuul-jobs/+/776677 https://review.opendev.org/c/opendev/base-jobs/+/777087 | 09:37 |
*** harrymichal has joined #zuul | 09:48 | |
*** yoctozepto9 has joined #zuul | 09:55 | |
*** nils has joined #zuul | 09:55 | |
*** jhesketh_ has joined #zuul | 09:57 | |
*** paulalbertella has joined #zuul | 09:57 | |
*** avass_ has joined #zuul | 09:58 | |
*** arxcruz has quit IRC | 10:00 | |
*** arxcruz has joined #zuul | 10:01 | |
*** reiterative has quit IRC | 10:03 | |
*** yoctozepto has quit IRC | 10:03 | |
*** jhesketh has quit IRC | 10:03 | |
*** avass has quit IRC | 10:03 | |
*** imtiazc has quit IRC | 10:03 | |
*** irclogbot_0 has quit IRC | 10:03 | |
*** yoctozepto9 is now known as yoctozepto | 10:03 | |
*** irclogbot_2 has joined #zuul | 10:08 | |
*** jangutter has joined #zuul | 10:40 | |
*** jangutter_ has joined #zuul | 10:45 | |
*** jangutter has quit IRC | 10:49 | |
*** sshnaidm|afk is now known as sshnaidm|pto | 11:22 | |
*** jangutter_ is now known as jangutter | 12:05 | |
*** jcapitao is now known as jcapitao_lucnh | 12:19 | |
*** jcapitao_lucnh is now known as jcapitao_lunch | 12:19 | |
*** jpena is now known as jpena|lunch | 12:27 | |
avass_ | any plan on upgrading opendevs zuul to v4? I want to upgrade ours but I don't want to be first out :) | 12:38 |
avass_ | or is that 3.19 version the same change but from an image built in gate and not by a tag? | 12:39 |
*** rlandy has joined #zuul | 12:51 | |
openstackgerrit | Felix Edel proposed zuul/zuul master: Remove superfluous flushes and queries from SQL reporter https://review.opendev.org/c/zuul/zuul/+/752664 | 12:51 |
*** zbr has joined #zuul | 13:13 | |
*** saneax has quit IRC | 13:19 | |
*** jpena|lunch is now known as jpena | 13:26 | |
openstackgerrit | Merged zuul/nodepool master: Add test-setup-docker.sh https://review.opendev.org/c/zuul/nodepool/+/777491 | 13:27 |
tobiash | avass_: v4 has been tagged from the last restart rev | 13:28 |
*** jcapitao_lunch is now known as jcapitao | 13:30 | |
*** ykarel has quit IRC | 13:32 | |
*** ykarel has joined #zuul | 13:37 | |
*** iurygregory has quit IRC | 13:43 | |
*** iurygregory has joined #zuul | 13:44 | |
*** zbr7 has joined #zuul | 13:47 | |
avass_ | tobiash: thanks! | 13:49 |
*** zbr has quit IRC | 13:49 | |
*** zbr has joined #zuul | 13:51 | |
*** zbr7 has quit IRC | 13:53 | |
*** jangutter_ has joined #zuul | 13:55 | |
*** jangutter has quit IRC | 13:58 | |
fungi | avass_: tobiash: yeah it's v4, we restarted it on the commit we were considering tagging just to make sure it was still working | 14:01 |
*** yoctozepto has quit IRC | 14:01 | |
*** yoctozepto has joined #zuul | 14:01 | |
*** zbr0 has joined #zuul | 14:11 | |
*** zbr has quit IRC | 14:13 | |
*** zbr0 is now known as zbr | 14:13 | |
*** jangutter has joined #zuul | 14:16 | |
*** jangutter_ has quit IRC | 14:19 | |
openstackgerrit | Felix Edel proposed zuul/zuul master: Move fingergw config to fingergw https://review.opendev.org/c/zuul/zuul/+/664949 | 14:31 |
openstackgerrit | Felix Edel proposed zuul/zuul master: Route streams to different zones via finger gateway https://review.opendev.org/c/zuul/zuul/+/664965 | 14:31 |
openstackgerrit | Felix Edel proposed zuul/zuul master: Support ssl encrypted fingergw https://review.opendev.org/c/zuul/zuul/+/664950 | 14:31 |
openstackgerrit | Felix Edel proposed zuul/zuul master: Perform per tenant locking in getStatus https://review.opendev.org/c/zuul/zuul/+/772695 | 14:48 |
*** saneax has joined #zuul | 14:49 | |
*** zbr1 has joined #zuul | 14:54 | |
*** avass_ is now known as avass | 14:56 | |
*** zbr has quit IRC | 14:56 | |
*** zbr1 is now known as zbr | 14:56 | |
*** jfoufas1 has joined #zuul | 15:28 | |
*** saneax has quit IRC | 15:35 | |
*** ykarel has quit IRC | 15:38 | |
*** zbr3 has joined #zuul | 15:50 | |
*** zbr has quit IRC | 15:52 | |
*** zbr3 is now known as zbr | 15:52 | |
openstackgerrit | Dong Zhang proposed zuul/zuul master: Display branch of queue in status page https://review.opendev.org/c/zuul/zuul/+/777613 | 15:53 |
openstackgerrit | Dong Zhang proposed zuul/zuul master: Display branch of queue in status page https://review.opendev.org/c/zuul/zuul/+/777613 | 15:57 |
openstackgerrit | Dong Zhang proposed zuul/zuul master: Display branch of queue in status page https://review.opendev.org/c/zuul/zuul/+/777613 | 15:58 |
openstackgerrit | Dong Zhang proposed zuul/zuul master: Display branch of queue in status page https://review.opendev.org/c/zuul/zuul/+/777613 | 15:59 |
*** jfoufas1 has quit IRC | 16:12 | |
*** zbr9 has joined #zuul | 16:13 | |
*** zbr has quit IRC | 16:15 | |
*** zbr9 is now known as zbr | 16:15 | |
*** nils has quit IRC | 16:20 | |
*** nils has joined #zuul | 16:26 | |
*** zbr3 has joined #zuul | 16:39 | |
*** zbr has quit IRC | 16:41 | |
*** zbr3 is now known as zbr | 16:41 | |
*** jpena is now known as jpena|off | 16:54 | |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: Spec: external permissions for the REST admin API https://review.opendev.org/c/zuul/zuul/+/777629 | 17:04 |
*** zbr7 has joined #zuul | 17:17 | |
*** zbr has quit IRC | 17:19 | |
*** zbr7 is now known as zbr | 17:19 | |
*** ikhan has joined #zuul | 17:32 | |
*** rpittau is now known as rpittau|afk | 17:34 | |
*** zbr3 has joined #zuul | 17:38 | |
*** zbr has quit IRC | 17:40 | |
*** zbr3 is now known as zbr | 17:40 | |
*** zbr5 has joined #zuul | 17:42 | |
avass | fungi: I think we'll be upgrading very soon if you haven't had any issues yet | 17:44 |
*** zbr has quit IRC | 17:44 | |
*** zbr5 is now known as zbr | 17:44 | |
fungi | none related to latest zuul/nodepool, as far as i'm aware | 17:46 |
avass | good :) | 17:47 |
fungi | we've been restarting frequently on latest master branch builds though, so if we did run into problems we already addressed them earlier i suppose | 17:48 |
clarkb | the only thing we ran into was the ready locked nodes not being marked fulfilled in nodesets, but tobiash reports they observed that too previously (so unlikely to be a v4 issue) | 17:51 |
openstackgerrit | Albin Vass proposed zuul/zuul master: Remove sqlreporter from quickstart pipeline definitions https://review.opendev.org/c/zuul/zuul/+/777638 | 17:51 |
clarkb | and we sovled that by restarting the launcher | 17:51 |
clarkb | the workaround at least is easy | 17:51 |
tobiash | We have ready locked since ages | 17:51 |
avass | those ^ are supposed to be removed now right? | 17:51 |
fungi | avass: yes, they'll generate deprecation warnings in the log | 17:52 |
avass | I think we've encountered that a couple of time as well | 17:52 |
avass | we had one locked static node this week that could be what you're talking about | 17:54 |
avass | fungi: just making sure since the example pipelines hadn't been updated :) | 17:55 |
clarkb | related to that one thing I noticed was that we don't log much when we skip in the noderequest poll | 17:55 |
clarkb | maybe we should add logging to the various skip reasons to try and identify why that situation happens (I suspect it must be the poll that causes it but not sure of that) | 17:55 |
*** jcapitao has quit IRC | 17:55 | |
clarkb | I'll look at that again while I am thinking about it | 17:56 |
clarkb | hrm it might be very noisy though | 17:57 |
avass | oh those aren't used by quickstart. there's another zuul-conf directory that is used. | 17:57 |
openstackgerrit | Albin Vass proposed zuul/zuul master: Remove sqlreporter from example pipeline definitions https://review.opendev.org/c/zuul/zuul/+/777638 | 17:58 |
*** nils has quit IRC | 17:59 | |
avass | however for static nodes it's easy to just delete the node in zookeeper. I'm not sure if that's the correct way to solve it however | 17:59 |
avass | I guess for dynamic nodes they would be leaked? | 18:00 |
openstackgerrit | Clark Boylan proposed zuul/nodepool master: Add logging to noderequest handler polls https://review.opendev.org/c/zuul/nodepool/+/777641 | 18:01 |
clarkb | avass: when you restart the launcher the locks are removed which allows the newly started replacement launcher to reprocess those nodse and assign them to jobs | 18:01 |
avass | oh that sounds a bit better | 18:03 |
*** wuchunyang has joined #zuul | 18:03 | |
clarkb | you can probably manually remove the locks too | 18:04 |
clarkb | for the same effect | 18:04 |
clarkb | but not sure, haven't tested that option | 18:04 |
avass | I _think_ I tried restarting nodepool for static nodes one time that happened and it didn't help but I might be misremembering | 18:04 |
*** wuchunyang has quit IRC | 18:07 | |
*** wuchunyang has joined #zuul | 18:41 | |
*** saneax has joined #zuul | 18:45 | |
*** wuchunyang has quit IRC | 18:45 | |
*** harrymichal has quit IRC | 18:53 | |
*** jangutter_ has joined #zuul | 19:03 | |
*** jangutter has quit IRC | 19:06 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: cabal-test: add install_args role var https://review.opendev.org/c/zuul/zuul-jobs/+/777653 | 19:17 |
openstackgerrit | Dong Zhang proposed zuul/zuul master: Display branch of queue in status page https://review.opendev.org/c/zuul/zuul/+/777613 | 19:21 |
*** tjgresha has joined #zuul | 19:33 | |
tobiash | clarkb: in our case the scheduler holds the lock of the leaked nodes | 19:57 |
*** saneax has quit IRC | 20:02 | |
*** hamalq has joined #zuul | 20:02 | |
clarkb | in https://review.opendev.org/c/zuul/nodepool/+/777641 almost 10% of the log lines in the functional openstack job are "Node request waiting for launches to complete" which that change adds | 20:24 |
clarkb | definitely chatty, would be curious to hear if others think that is a problem though | 20:24 |
*** ajitha has quit IRC | 20:25 | |
avass | tobiash: right that was the reason | 20:35 |
*** tflink_ has joined #zuul | 20:35 | |
*** logan- has quit IRC | 20:35 | |
*** icey_ has joined #zuul | 20:36 | |
*** decimuscorvinus_ has joined #zuul | 20:36 | |
*** mugsie_ has joined #zuul | 20:36 | |
tobiash | clarkb: since that's the 'not yet ready' within a poll I think that can create problems for busy systems | 20:36 |
*** persia_ has joined #zuul | 20:36 | |
*** Tahvok_ has joined #zuul | 20:36 | |
avass | clarkb: would that become very noisy with a lot of requests? | 20:36 |
*** Tahvok has quit IRC | 20:36 | |
*** zbr has quit IRC | 20:36 | |
*** fdegir has quit IRC | 20:36 | |
*** decimuscorvinus has quit IRC | 20:36 | |
*** icey has quit IRC | 20:36 | |
*** EmilienM has quit IRC | 20:36 | |
*** mvadkert has quit IRC | 20:36 | |
*** persia has quit IRC | 20:36 | |
*** melwitt has quit IRC | 20:36 | |
*** mugsie has quit IRC | 20:36 | |
*** swest_ has joined #zuul | 20:36 | |
*** Tahvok_ is now known as Tahvok | 20:36 | |
*** melwitt has joined #zuul | 20:36 | |
*** corvus has quit IRC | 20:36 | |
tobiash | clarkb: however I think you could do the debugging for that without this message as well. | 20:37 |
*** tjgresha has quit IRC | 20:37 | |
tobiash | clarkb: what I'm wondering is, do you see any exception trace with the poll function in it? | 20:37 |
*** tosky_ has joined #zuul | 20:37 | |
*** jonass_ has joined #zuul | 20:37 | |
tobiash | I guess that might be also a way to get into that state | 20:37 |
tobiash | clarkb: and to double check, you're sure that your nodes are locked by nodepool? | 20:38 |
tobiash | in this case we have two leaks | 20:38 |
tobiash | one in the scheduler and one in nodepool | 20:38 |
*** evrardjp_ has joined #zuul | 20:38 | |
*** jonass has quit IRC | 20:38 | |
*** parallax has quit IRC | 20:38 | |
*** stevthedev has quit IRC | 20:38 | |
*** dry has joined #zuul | 20:39 | |
*** mordred has quit IRC | 20:39 | |
*** stevthedev has joined #zuul | 20:39 | |
*** mmedvede has quit IRC | 20:40 | |
*** parallax has joined #zuul | 20:40 | |
*** Eighth_Doctor has quit IRC | 20:40 | |
*** tosky has quit IRC | 20:40 | |
*** noonedeadpunk has quit IRC | 20:40 | |
*** ricolin has quit IRC | 20:40 | |
*** evrardjp has quit IRC | 20:40 | |
*** guillaumec has quit IRC | 20:40 | |
*** jpenag has joined #zuul | 20:40 | |
*** msuszko has quit IRC | 20:40 | |
*** logan- has joined #zuul | 20:40 | |
*** mnasiadka has quit IRC | 20:41 | |
*** mmedvede has joined #zuul | 20:41 | |
avass | tobiash: that also means that the nodepool leak could be recent | 20:41 |
*** tosky_ is now known as tosky | 20:42 | |
*** masterpe has quit IRC | 20:42 | |
*** ironfoot has quit IRC | 20:42 | |
*** jpena|off has quit IRC | 20:42 | |
*** tflink has quit IRC | 20:42 | |
*** mnasiadka has joined #zuul | 20:42 | |
*** swest has quit IRC | 20:43 | |
clarkb | tobiash: I'm like 98% sure that nodepool had the lock because the noderequest state was not fulfilled. But I looked around in zk directly and couldn't figure out how to determine what actually owned the lock. The lock had a uuid associated with it but I couldn't find a way to map that uuid to anything | 20:44 |
*** corvus has joined #zuul | 20:44 | |
tobiash | clarkb: and the noderequest was for one node? | 20:44 |
clarkb | tobiash: the node lock should only be taken by zuul once the node request is fulfilled if I read the hand shaking there properly | 20:44 |
clarkb | tobiash: correct it was a single node node request | 20:44 |
*** ironfoot has joined #zuul | 20:45 | |
tobiash | clarkb: do you have logs filtered for request id and/or node id? | 20:45 |
clarkb | checking for tracebacks with the poll function in them is a good idea. I don't think I did that when this happened. I did take thread stack dumps before restarting but those all looked normal to me (the provider handler thread was still running) | 20:45 |
clarkb | tobiash: I did, they ended with the node going ready, there was nothing for either node id or request id after that | 20:46 |
tobiash | clarkb: do you still have those logs? I'd be interested in them, maybe there is something missing. | 20:49 |
tobiash | clarkb: on exception in poll it should re-add it to the active list and poll them again | 20:50 |
tobiash | that node request should be then recurring in the 'Active requests' line | 20:51 |
tobiash | if that stops recurring there it should be either finished or failed | 20:51 |
clarkb | tobiash: I think we do have them let me look | 20:56 |
tobiash | clarkb: also you should have either 'Error unlocking node' or 'Unlocked node <number>' in the logs, the former unluckily without node id but with exception trace | 20:57 |
*** mordred has joined #zuul | 20:58 | |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Mention node id when unlock failed https://review.opendev.org/c/zuul/nodepool/+/777678 | 21:00 |
tobiash | clarkb: this should aid correlating node and unlocking exception ^ | 21:01 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Mention node id when unlock failed https://review.opendev.org/c/zuul/nodepool/+/777678 | 21:02 |
clarkb | thanks, I've got the node request and node id logs sorted out now. Lookking for unlock failuers and will put a paste together | 21:02 |
clarkb | tobiash: http://paste.openstack.org/show/OTea9DPKielkU8wVO1XF/ | 21:05 |
tobiash | clarkb: that is really weird | 21:08 |
tobiash | that means that we got out with a result of True of the poll method | 21:08 |
clarkb | or the poll method was always short circuiting | 21:09 |
clarkb | which was why I wanted to add that extra logging | 21:09 |
clarkb | oh except the node request is removed from the active requests list | 21:10 |
clarkb | does that imply poll returned true? | 21:10 |
tobiash | False and exception trigger reenqueue to active handlers | 21:10 |
tobiash | so poll can only get lost if it returned true | 21:10 |
*** guillaumec has joined #zuul | 21:10 | |
clarkb | and the request not being logged as active shows us it did that | 21:10 |
clarkb | that is weird | 21:10 |
tobiash | see _removeCompletedHandlers | 21:11 |
tobiash | you presumably also got no zk session loss? | 21:12 |
tobiash | clarkb: what's also weird is that you should have 'Removing request handler' in the logs | 21:13 |
avass | maybe that kze.SessionExpiredError should be logged | 21:14 |
clarkb | aha 2021-02-22 22:59:25,186 WARNING kazoo.client: Connection dropped: socket connection error: Connection reset by peer | 21:14 |
clarkb | 2021-02-22 22:59:25,442 INFO kazoo.client: Zookeeper connection established, state: CONNECTED | 21:15 |
clarkb | we were disconnected for 300ms but maybe that was long enough to cause problems for those nodes | 21:16 |
tobiash | clarkb: we do log it in: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/zk.py#L913 | 21:17 |
tobiash | if you check for 'ZooKeeper connection: LOST' you see the lost session events | 21:17 |
clarkb | ya 2021-02-22 22:59:25,187 DEBUG nodepool.zk.ZooKeeper: ZooKeeper connection: SUSPENDED | 21:17 |
clarkb | then 2021-02-22 22:59:25,442 DEBUG nodepool.zk.ZooKeeper: ZooKeeper connection: CONNECTED | 21:17 |
clarkb | that is likely the underlying cause, but doesn't quite tell us why that was not recoverable | 21:18 |
tobiash | clarkb: I think we should get this in and default to a higher session timeout: https://review.opendev.org/c/zuul/nodepool/+/752022 | 21:19 |
tobiash | clarkb: so the session was not lost but suspended which likely means that some zk store might have failed but we didn't loose the lock | 21:19 |
tobiash | we loose the lock only on session lost | 21:20 |
clarkb | and if that had happened we would've avoided this | 21:20 |
clarkb | tobiash: though your change would presumably make this problem worse? | 21:20 |
clarkb | because we'd lose the session, but not the lock and result in this problem? | 21:20 |
*** Eighth_Doctor has joined #zuul | 21:21 | |
tobiash | clarkb: well a busy nodepool is sometimes too late with the heartbeat thread which makes a longer session timeout required | 21:21 |
clarkb | I can see how this might make one problem better but another worse :) | 21:22 |
tobiash | ok, well that fixes a different issue ;) | 21:22 |
clarkb | I'm ok with that too, if we find this issue I observed becomes worse we'll just debug it faster :) | 21:22 |
tobiash | do you have any exception in that time range? | 21:22 |
avass | could there be a similar issue with the scheduler not releasing locks in that case? | 21:22 |
tobiash | some are logged without node id | 21:22 |
clarkb | tobiash: http://paste.openstack.org/show/803023/ that maybe | 21:25 |
clarkb | its the same provider and it fails in _assignHandlers ? | 21:25 |
clarkb | that happens after 2021-02-22 23:00:11,102 DEBUG nodepool.zk.ZooKeeper: ZooKeeper connection: SUSPENDED and before 2021-02-22 23:00:12,052 DEBUG nodepool.zk.ZooKeeper: ZooKeeper connection: CONNECTED | 21:26 |
tobiash | is that one trace? | 21:26 |
tobiash | looks weird | 21:26 |
clarkb | yes I believe it is one trace, but I did use grep -A -B so let me double check | 21:26 |
clarkb | yes one trace unless the log is interleaved | 21:27 |
clarkb | (which I suppose it could be) | 21:27 |
tobiash | no idea what createMinReady should have todo with _assignhandlers | 21:28 |
clarkb | 2021-02-22 23:00:11,109 ERROR nodepool.PoolWorker.rax-iad-main: Error in PoolWorker also fails just aboev it and has basically the same trace | 21:28 |
clarkb | so I don't think they are interleaved | 21:28 |
tobiash | anyway, nodepool is not good at handling connection errors, there is much room to improve there | 21:29 |
tobiash | as far as I've understood I think we need to handle the kazoo.exceptions.ConnectionLoss exception everywhere in the zk class and only reraise if the session has been lost and until then keep retrying | 21:30 |
tobiash | with this connectionloss the kazoo state turns into suspended | 21:31 |
clarkb | that makes sense | 21:31 |
clarkb | since before the session is lost our locks are still valid | 21:31 |
clarkb | its only once the lock goes away that nodepool propr will care | 21:31 |
tobiash | correct | 21:32 |
tobiash | corvus: I think we need to take this into consideration in zuul as well | 21:32 |
tobiash | connection state handling will become more important there as well | 21:33 |
tobiash | I guess getting this right will be non-trivial | 21:34 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Log error message on request handler removal on session loss https://review.opendev.org/c/zuul/nodepool/+/777689 | 21:37 |
tobiash | clarkb: maybe it went this route if there was a misleading exception in kazoo ^ | 21:38 |
clarkb | tobiash: can we log the requset id in that log? | 21:39 |
clarkb | r.request.id is the value I think | 21:39 |
tobiash | that's part of the logger already | 21:39 |
clarkb | oh wait ya I see that now | 21:39 |
*** rlandy has quit IRC | 21:54 | |
clarkb | is `zuul-executor graceful` not documented intentionally? | 22:02 |
clarkb | I want to try it out but only if it is expected to work :) | 22:02 |
tobiash | we use it in production | 22:02 |
clarkb | cool I guess if it works for me I'll update the docs too | 22:02 |
tobiash | I guess that was just forgotten | 22:03 |
tobiash | or let me rephrase, I forgot ;) | 22:04 |
tobiash | clarkb: there is one glitch tho, if the executor is already paused or governed | 22:05 |
clarkb | governed based on the disk/cpu/memory governors? | 22:06 |
clarkb | anyway it seems to be working, it is currently waiting on ~39 jobs to complete | 22:06 |
tobiash | then it stays in pause and won't terminate because of uncatched unable to unregister unregistered function | 22:06 |
clarkb | ah ok, so if it goes to 0 jobs to complete but doesn't exit I can probably safely stop it at that point | 22:07 |
clarkb | hrm I think I just discovered a flaw in this though. We tell docker-compose to restart-always in our zuul-executor config | 22:08 |
clarkb | so it will exit and then restart :/ | 22:08 |
clarkb | I guess what I really need is pause, wait for it to do no work then docker-compose down. /me is learning | 22:08 |
tobiash | oh, that's unfortunate ;) | 22:12 |
tobiash | btw, this is the exception when it's already paused: http://paste.openstack.org/show/803026/ | 22:12 |
tobiash | so double pausing throws an exception | 22:13 |
*** harrymichal has joined #zuul | 22:14 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Catch exception when double unregistering merge jobs https://review.opendev.org/c/zuul/zuul/+/777694 | 22:19 |
tobiash | clarkb: this should fix this corner case ^ | 22:19 |
clarkb | thanks | 22:20 |
tobiash | clarkb: can you tell docker-compose to auto-restart only on result !=0? | 22:20 |
tobiash | then you could still use graceful | 22:20 |
clarkb | good question | 22:21 |
tobiash | clarkb: at least pure docker can do that: https://docs.docker.com/config/containers/start-containers-automatically/#use-a-restart-policy | 22:21 |
clarkb | restart: on-failure and restart: unless-stopped might be good options | 22:22 |
tobiash | clarkb: restart: on-failure is probably what you want | 22:22 |
corvus | on-failure looks good -- also, it's worth checking whether we need a restart in order to have these start on boot. maybe we want to remove restart altogether? | 22:27 |
clarkb | I think start on boot is related to whather or not the last state was an up or a down | 22:29 |
clarkb | we don't auto start on first deployment here because nothing has up'd the containers | 22:29 |
clarkb | but once they are up'd they should start on boot if I understand this correctly | 22:30 |
clarkb | so ya we can probably remove restart altogether | 22:30 |
*** harrymichal has quit IRC | 22:41 | |
openstackgerrit | Clark Boylan proposed zuul/zuul master: Document zuul-executor graceful https://review.opendev.org/c/zuul/zuul/+/777699 | 22:59 |
corvus | 777689 failed with interesting errors if anyone feels like more nodepool debugging: https://zuul.opendev.org/t/zuul/build/b3492e29b42e4adeb480cdf177790327 | 23:11 |
corvus | i suspect the failure is transient, but it still looks like a bug | 23:11 |
corvus | clarkb: super nit on the docs patch, but worth it i think (sorry) | 23:15 |
openstackgerrit | Clark Boylan proposed zuul/zuul master: Document zuul-executor graceful https://review.opendev.org/c/zuul/zuul/+/777699 | 23:17 |
clarkb | no problem | 23:17 |
fungi | clarkb: we've mostly not used graceful because we're in a hurry to restart and builds which get aborted by an abrupt stop will be retried (unless unlucky and already on their third strike) | 23:54 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!