Thursday, 2023-08-24

-@gerrit:opendev.org- Zuul merged on behalf of Tobias Henkel: [zuul/zuul] 761520: Only report dequeue if we have reported start https://review.opendev.org/c/zuul/zuul/+/76152006:31
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 891556: Use the GitHub default branch as the default branch https://review.opendev.org/c/zuul/zuul/+/89155606:49
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 891639: Add default branch support to the Gerrit driver https://review.opendev.org/c/zuul/zuul/+/89163906:53
-@gerrit:opendev.org- Zuul merged on behalf of Tristan Cacqueray: [zuul/nodepool] 892573: Add tenant and label name to Launch failed error https://review.opendev.org/c/zuul/nodepool/+/89257307:31
-@gerrit:opendev.org- Zuul merged on behalf of Benedikt Löffler: [zuul/nodepool] 890401: Test if username is set for diskimage based nodes in AWS https://review.opendev.org/c/zuul/nodepool/+/89040108:54
@dpawlik:matrix.orghey, could someone take a look on https://review.opendev.org/c/zuul/zuul/+/889314 ? Thanks12:39
@jpew:matrix.orgHas anyone experienced nodepool keyscan timeout errors on OpenStack when it's at capacity before? I can't figure out what the problem might be :/14:28
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 892682: WIP: Add failing config cache cleanup test. https://review.opendev.org/c/zuul/zuul/+/89268214:37
@fungicide:matrix.org> <@jpew:matrix.org> Has anyone experienced nodepool keyscan timeout errors on OpenStack when it's at capacity before? I can't figure out what the problem might be :/14:41
the possibilities which immediately come to mind: nodes not booting quickly enough, nodes not able to configure their network quickly enough (overloaded metadata service? consider booting with configdrive enabled), overloaded network channels leading to significant packet loss...
@jpew:matrix.orgfungi: Ya, I tried with 10 minute boot-timeout (up from 5) and it still failed, so... I don't _think_ it's that the nodes are too slow14:42
@fungicide:matrix.org> <@jpew:matrix.org> fungi: Ya, I tried with 10 minute boot-timeout (up from 5) and it still failed, so... I don't _think_ it's that the nodes are too slow14:56
are you sure the network is coming up on those nodes? do their addresses respond to ping?
@fungicide:matrix.orgalso if you recycle ip addresses too quickly, i think you may end up with problems allocating virtual switchports, stale host routes, and related network problems depending on the architecture14:57
@jpew:matrix.orgfungi: Ya, hard to tell if the nodes are actuall accessible on the network or not because nodepool deletes them right away.... is there a way to make it keep them around for a bit?15:11
@clarkb:matrix.orgyou might need to set a long time out and then debug in that window.15:11
@clarkb:matrix.orgIf they are openstack nodes I think nodepool can request their console log from the api which can help you determine if the nodes set up networking properly (doesn't say anything about the cloud side though just the node)15:12
@jpew:matrix.orgClark: Ah, ok. I see that option, I will enable it15:18
@jpew:matrix.orgHmm, it is enabled.... maybe I just don't know where to look for them15:19
@jpew:matrix.org... I don't see anywhere in the nodepool code where console-log does anything?15:21
@clarkb:matrix.orghrm I wonder if that got lost in the statemachine refactor15:25
@jpew:matrix.orgLooks that way15:25
@clarkb:matrix.orgya I think it did. fyi corvus 15:26
@jim:acmegating.comyep.  looks like that was added without a test.15:27
@jim:acmegating.comi can add it back on the basis that it wasn't properly deprecated, but tbh, i'm not sure i think it's actually a good idea.  it's the only driver that has ever attempted to do something like that, and it might be better to take the opportunity to keep things simple.  if we want something like that, we need someone to add it to other drivers and add tests.15:36
@jpew:matrix.orgThats.... disappointing. I think I could add it for OpenStack, but I really don't think I can swing the time for all other drivers also :(15:44
@fungicide:matrix.orgmaybe a hook framework might make more sense? "on node creation failure, call this entrypoint"15:46
@jpew:matrix.orgRight, so it could be added to the other later15:46
@fungicide:matrix.orgthen anyone can just plug in the script they want called and redirected to a log or whatever15:46
@jpew:matrix.org * Right, so it could be added to the others later15:46
@fungicide:matrix.orgbecause the things you might want to run for diagnostics could vary by environment anyway, beyond simply what driver you're using15:48
@jim:acmegating.comsure, i mean, that's the only way to actually do it now, because we *are* unifying the drivers15:48
@jim:acmegating.combut what i'm saying is that this code went in when we basically had no standards, and it shows15:48
@jim:acmegating.comit went in without a test, without any kind of framework to support other drivers, etc15:48
@jim:acmegating.comand we're tying to get away from that15:48
@jim:acmegating.comall of the cloud drivers (and almost all of the drivers) at this point use the same internal mechanics to do their work.  we're standardizing it, and all of the drivers are benefiting.15:49
@jim:acmegating.comanyway, i'm almost done with the patch15:49
@fungicide:matrix.orgsure, i was just saying if we want to take the opportunity to drop it instead, then rather than readding that feature we could add a hook framework and let operators supply their own diagnostic scripts (which could be a script to grab an nova console log, or just about anything else)15:52
@jim:acmegating.comoh i get ya15:53
@fungicide:matrix.organd even if it's being patched back to working condition for now, we can deprecate it on the grounds that it's not general15:53
@jim:acmegating.comyeah, that could be an improvement15:54
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 892690: Restore openstack console log https://review.opendev.org/c/zuul/nodepool/+/89269015:57
@jim:acmegating.comthat should add it back.  but there is no test, so i don't know if it works now, and i don't know if it's worked at any point after 2017.16:00
@jpew:matrix.orgI'll test it16:00
@fungicide:matrix.orgit worked at some points after 2017 because we did use it in opendev from time to time to diagnose node creation problems, but when and how recently i can't recall16:05
@fungicide:matrix.orgobviously not more recently than when then refactoring happened16:05
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 892696: Use bookworm container images https://review.opendev.org/c/zuul/zuul/+/89269616:55
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 892697: Use bookworm container images https://review.opendev.org/c/zuul/nodepool/+/89269717:33
-@gerrit:opendev.org- Tristan Cacqueray https://matrix.to/#/@tristanc_:matrix.org proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/88124518:39
-@gerrit:opendev.org- Tristan Cacqueray https://matrix.to/#/@tristanc_:matrix.org proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/88124520:12
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 892716: Add pipeline queue stats https://review.opendev.org/c/zuul/zuul/+/89271620:36
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/nodepool] 892719: tests: Add test console logging on keyscan error https://review.opendev.org/c/zuul/nodepool/+/89271922:49
-@gerrit:opendev.org- Joshua Watt proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 892690: Restore openstack console log https://review.opendev.org/c/zuul/nodepool/+/89269022:49
@jpew:matrix.org^^ Now with tests :)22:49
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/nodepool] 892719: tests: Add test for console logging on keyscan error https://review.opendev.org/c/zuul/nodepool/+/89271922:49
@jim:acmegating.comjpew: thanks!  +2 with a suggestion for a smidge more coverage22:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!