-@gerrit:opendev.org- Zuul merged on behalf of Tobias Henkel: [zuul/zuul] 761520: Only report dequeue if we have reported start https://review.opendev.org/c/zuul/zuul/+/761520 | 06:31 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 891556: Use the GitHub default branch as the default branch https://review.opendev.org/c/zuul/zuul/+/891556 | 06:49 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 891639: Add default branch support to the Gerrit driver https://review.opendev.org/c/zuul/zuul/+/891639 | 06:53 | |
-@gerrit:opendev.org- Zuul merged on behalf of Tristan Cacqueray: [zuul/nodepool] 892573: Add tenant and label name to Launch failed error https://review.opendev.org/c/zuul/nodepool/+/892573 | 07:31 | |
-@gerrit:opendev.org- Zuul merged on behalf of Benedikt Löffler: [zuul/nodepool] 890401: Test if username is set for diskimage based nodes in AWS https://review.opendev.org/c/zuul/nodepool/+/890401 | 08:54 | |
@dpawlik:matrix.org | hey, could someone take a look on https://review.opendev.org/c/zuul/zuul/+/889314 ? Thanks | 12:39 |
---|---|---|
@jpew:matrix.org | Has anyone experienced nodepool keyscan timeout errors on OpenStack when it's at capacity before? I can't figure out what the problem might be :/ | 14:28 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 892682: WIP: Add failing config cache cleanup test. https://review.opendev.org/c/zuul/zuul/+/892682 | 14:37 | |
@fungicide:matrix.org | > <@jpew:matrix.org> Has anyone experienced nodepool keyscan timeout errors on OpenStack when it's at capacity before? I can't figure out what the problem might be :/ | 14:41 |
the possibilities which immediately come to mind: nodes not booting quickly enough, nodes not able to configure their network quickly enough (overloaded metadata service? consider booting with configdrive enabled), overloaded network channels leading to significant packet loss... | ||
@jpew:matrix.org | fungi: Ya, I tried with 10 minute boot-timeout (up from 5) and it still failed, so... I don't _think_ it's that the nodes are too slow | 14:42 |
@fungicide:matrix.org | > <@jpew:matrix.org> fungi: Ya, I tried with 10 minute boot-timeout (up from 5) and it still failed, so... I don't _think_ it's that the nodes are too slow | 14:56 |
are you sure the network is coming up on those nodes? do their addresses respond to ping? | ||
@fungicide:matrix.org | also if you recycle ip addresses too quickly, i think you may end up with problems allocating virtual switchports, stale host routes, and related network problems depending on the architecture | 14:57 |
@jpew:matrix.org | fungi: Ya, hard to tell if the nodes are actuall accessible on the network or not because nodepool deletes them right away.... is there a way to make it keep them around for a bit? | 15:11 |
@clarkb:matrix.org | you might need to set a long time out and then debug in that window. | 15:11 |
@clarkb:matrix.org | If they are openstack nodes I think nodepool can request their console log from the api which can help you determine if the nodes set up networking properly (doesn't say anything about the cloud side though just the node) | 15:12 |
@jpew:matrix.org | Clark: Ah, ok. I see that option, I will enable it | 15:18 |
@jpew:matrix.org | Hmm, it is enabled.... maybe I just don't know where to look for them | 15:19 |
@jpew:matrix.org | ... I don't see anywhere in the nodepool code where console-log does anything? | 15:21 |
@clarkb:matrix.org | hrm I wonder if that got lost in the statemachine refactor | 15:25 |
@jpew:matrix.org | Looks that way | 15:25 |
@clarkb:matrix.org | ya I think it did. fyi corvus | 15:26 |
@jim:acmegating.com | yep. looks like that was added without a test. | 15:27 |
@jim:acmegating.com | i can add it back on the basis that it wasn't properly deprecated, but tbh, i'm not sure i think it's actually a good idea. it's the only driver that has ever attempted to do something like that, and it might be better to take the opportunity to keep things simple. if we want something like that, we need someone to add it to other drivers and add tests. | 15:36 |
@jpew:matrix.org | Thats.... disappointing. I think I could add it for OpenStack, but I really don't think I can swing the time for all other drivers also :( | 15:44 |
@fungicide:matrix.org | maybe a hook framework might make more sense? "on node creation failure, call this entrypoint" | 15:46 |
@jpew:matrix.org | Right, so it could be added to the other later | 15:46 |
@fungicide:matrix.org | then anyone can just plug in the script they want called and redirected to a log or whatever | 15:46 |
@jpew:matrix.org | * Right, so it could be added to the others later | 15:46 |
@fungicide:matrix.org | because the things you might want to run for diagnostics could vary by environment anyway, beyond simply what driver you're using | 15:48 |
@jim:acmegating.com | sure, i mean, that's the only way to actually do it now, because we *are* unifying the drivers | 15:48 |
@jim:acmegating.com | but what i'm saying is that this code went in when we basically had no standards, and it shows | 15:48 |
@jim:acmegating.com | it went in without a test, without any kind of framework to support other drivers, etc | 15:48 |
@jim:acmegating.com | and we're tying to get away from that | 15:48 |
@jim:acmegating.com | all of the cloud drivers (and almost all of the drivers) at this point use the same internal mechanics to do their work. we're standardizing it, and all of the drivers are benefiting. | 15:49 |
@jim:acmegating.com | anyway, i'm almost done with the patch | 15:49 |
@fungicide:matrix.org | sure, i was just saying if we want to take the opportunity to drop it instead, then rather than readding that feature we could add a hook framework and let operators supply their own diagnostic scripts (which could be a script to grab an nova console log, or just about anything else) | 15:52 |
@jim:acmegating.com | oh i get ya | 15:53 |
@fungicide:matrix.org | and even if it's being patched back to working condition for now, we can deprecate it on the grounds that it's not general | 15:53 |
@jim:acmegating.com | yeah, that could be an improvement | 15:54 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 892690: Restore openstack console log https://review.opendev.org/c/zuul/nodepool/+/892690 | 15:57 | |
@jim:acmegating.com | that should add it back. but there is no test, so i don't know if it works now, and i don't know if it's worked at any point after 2017. | 16:00 |
@jpew:matrix.org | I'll test it | 16:00 |
@fungicide:matrix.org | it worked at some points after 2017 because we did use it in opendev from time to time to diagnose node creation problems, but when and how recently i can't recall | 16:05 |
@fungicide:matrix.org | obviously not more recently than when then refactoring happened | 16:05 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 892696: Use bookworm container images https://review.opendev.org/c/zuul/zuul/+/892696 | 16:55 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 892697: Use bookworm container images https://review.opendev.org/c/zuul/nodepool/+/892697 | 17:33 | |
-@gerrit:opendev.org- Tristan Cacqueray https://matrix.to/#/@tristanc_:matrix.org proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/881245 | 18:39 | |
-@gerrit:opendev.org- Tristan Cacqueray https://matrix.to/#/@tristanc_:matrix.org proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/881245 | 20:12 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 892716: Add pipeline queue stats https://review.opendev.org/c/zuul/zuul/+/892716 | 20:36 | |
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/nodepool] 892719: tests: Add test console logging on keyscan error https://review.opendev.org/c/zuul/nodepool/+/892719 | 22:49 | |
-@gerrit:opendev.org- Joshua Watt proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 892690: Restore openstack console log https://review.opendev.org/c/zuul/nodepool/+/892690 | 22:49 | |
@jpew:matrix.org | ^^ Now with tests :) | 22:49 |
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/nodepool] 892719: tests: Add test for console logging on keyscan error https://review.opendev.org/c/zuul/nodepool/+/892719 | 22:49 | |
@jim:acmegating.com | jpew: thanks! +2 with a suggestion for a smidge more coverage | 22:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!