opendevreview | Jay Faulkner proposed openstack/ironic master: Fix tox4 and setuptools errors https://review.opendev.org/c/openstack/ironic/+/868749 | 00:21 |
---|---|---|
JayF | rpittau: ^^ figured an edit and seeing if it got V+1 was preferable to a PR comment... thanks for looking at this | 00:22 |
JayF | rpittau: please email me directly at jay at jvf dot cc if/when some of these are ready to land and I can try to find time to unstick it | 00:22 |
JayF | family in town for like two more days then I can have more time to help review/troubleshoot these in detail | 00:22 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Fix CI https://review.opendev.org/c/openstack/ironic/+/868749 | 09:18 |
samuelkunkel[m] | Hi folks, I have a strange scenario were you are maybe able to provide me a hint. We have deployed our second ironic infrastructure usind zed release (ipa running stream9 with release 9.2). We are facing an issue that the conductor fails to validate the cert of the agent (during cleaning which is done as first step) due to certificate is not yet valid. I am assuming this is due to timedifference between the conductor and the | 12:12 |
samuelkunkel[m] | agent. (stream9 defaults to timezone EST which I fixed by unpacking the initramfs, changing the symlink and repacking it). So the time (according to timedatectl) is basically the same... | 12:12 |
samuelkunkel[m] | In our first deployment (currently running train) I did not see anything with the certificate stuff but I assume this was not present in train release | 12:13 |
iurygregory | good morning Ironic | 12:18 |
iurygregory | samuelkunkel[m], release 9.2 you mean the IPA version in use? | 12:19 |
samuelkunkel[m] | yes | 12:19 |
iurygregory | this would be for Antelope not Zed... | 12:20 |
iurygregory | zed is 9.1 | 12:20 |
samuelkunkel[m] | hmm | 12:20 |
samuelkunkel[m] | ok thats interesting | 12:20 |
samuelkunkel[m] | but I also faced the same issue using 9.0 | 12:20 |
samuelkunkel[m] | therefore I just build it from the newest sources | 12:20 |
iurygregory | gotcha | 12:20 |
samuelkunkel[m] | (and I just saw that stream 9 doubled its image size since august, very interesting) | 12:21 |
iurygregory | it can be a configuration option you need to set, do you have the exactly error message you get from logs about the cert problem? | 12:21 |
samuelkunkel[m] | let me quickly fetch the one I can see | 12:22 |
samuelkunkel[m] | Node failed to start the first cleaning step: Connection to agent failed: Failed to connect to the agent running on node 33cd3d68-b7df-44b1-9d2d-0027c2b93c79 for invoking command clean.get_clean_steps. Error: HTTPSConnectionPool(host='10.218.14.242', port=9999): Max retries exceeded with url: /v1/commands/?wait=true&agent_token=_N6wuKa-bhGvVV9l_Mp1PpNaIur0ylEvKFnIgiu5qmw (Caused by SSLError(SSLCertVerificationError(1, '[SSL: | 12:22 |
samuelkunkel[m] | CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate is not yet valid (_ssl.c:1131)'))) | 12:22 |
samuelkunkel[m] | on the ipa I can see also correleating ssl validation issues during the times the conductor tries to connect | 12:23 |
samuelkunkel[m] | My guess would be its down to a time issue as the cert is created to be valid for localtime up to localtime+1 on the ipa | 12:24 |
samuelkunkel[m] | So my Idea was to provide ipa-ntp-server so it syncs its time to one of our ntp servers to rule out time issue | 12:24 |
samuelkunkel[m] | but this one fails during startup as the difference is to big (let me quickly boot up a node to check the exact error message here) | 12:26 |
iurygregory | yeah the problem can be a clock skew | 12:27 |
samuelkunkel[m] | We had that before because stream9 defaulted to EST as its standard timezone, but after adjusting the initramfs the time looks mostly good | 12:28 |
samuelkunkel[m] | (before = that the clock was not properly set) | 12:28 |
samuelkunkel[m] | but in our train deployment nothing cared | 12:29 |
samuelkunkel[m] | ok one information was not valid, I just saw that the cert is created for 30days to be valid on the ipa | 12:39 |
samuelkunkel[m] | On the agent I see the error message (right after the successfull heartbeat): ssl.SSLError: [SSL: SSLV3_ALERT_BAD_CERTIFICATE] sslv3 alert bad certificate (_ssl.c:2633) | 12:48 |
samuelkunkel[m] | thrown by eventlet/green/ssl.py (does a full traceback help? Not sure if I can pack one in irc) | 12:48 |
samuelkunkel[m] | This time cleaning worked but deploying failed. And everytime it fails I see an issue with chrony | 12:50 |
samuelkunkel[m] | 2022-12-29 14:43:59.713 3665 ERROR ironic_python_agent.utils [-] Failed to sync time using chrony to ntp server: 10.219.208.25: Unexpected error while running command. | 12:51 |
samuelkunkel[m] | Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Command: chronyd -q 'server 10.219.208.25 iburst' | 12:51 |
samuelkunkel[m] | Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Exit code: 1 | 12:51 |
samuelkunkel[m] | Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Stdout: '' | 12:51 |
samuelkunkel[m] | Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Stderr: '2022-12-29T14:43:49Z chronyd version 4.3 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH +IPV6 +DEBUG)\n2022-12-29T14:43:59Z No suitable source for synchronisation\n2022-12-29T14:43:59Z chronyd exiting\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. | 12:51 |
samuelkunkel[m] | So I guess I need to dig into why chrony fails | 12:51 |
arozman | Hi Ironic! | 13:19 |
samuelkunkel[m] | Ok it seems to be an issue related to dhcp. As my hardware starts the ipa kernel / ramdisk I seem to lose some connectivity. (or basically ipa is too fast). I lose around 20 pings. If I run chrony around 20s later after it failed first it succeeds. If I do that (or everytime chrony runs successful) all the ssl issues are gone. | 13:51 |
samuelkunkel[m] | If chrony fails I also can not see a request on port 123 | 13:51 |
samuelkunkel[m] | so seems like the ipa is not "able" to query for ntp during that time | 13:54 |
samuelkunkel[m] | meh | 13:54 |
jssfr | checked syslog for timestamps when the DHCP lease was acquired? | 13:54 |
jssfr | also, hi samuel | 13:54 |
samuelkunkel[m] | hi! | 13:55 |
TheJulia | Have you checked switch logs to see if the port is going into a blocking state? | 14:11 |
samuelkunkel[m] | good points, going to investigate in that direction(s) | 14:14 |
TheJulia | Some switches with bonds will abehave oddly when you dhcp and line carrier up at certain versions | 14:15 |
TheJulia | I know that is vague but some vendors actually say don’t try to pxe over bonds, and then things like networkmanager can impact dhcp behavior | 14:16 |
samuelkunkel[m] | I think also the ipxe itself is capable of forming lacp. If I recall correctly we have the switchports configured for bonding but we also have the "port-channels" (name for bonds on the arista switches) configured for lacp fallback to individual (which is basically "if there is no lacp also allow them to run as standalone port") . So the idea was to just leave them unbundled within ipa and once we boot into the final deployment | 14:22 |
samuelkunkel[m] | we have them bundeled | 14:22 |
TheJulia | So there is the conundrum | 14:28 |
TheJulia | Ipxe sends an lacp bpdu “I’m here” message which the switch counts as an activation | 14:28 |
TheJulia | When the OS boots, and no more messages are sent, then the switch resets port states | 14:29 |
TheJulia | Supposedly a newer network manager version does try to setup bonds automatically which would navigate this… but I know no details | 14:29 |
samuelkunkel[m] | what I see, on the switchports itself we do not have spanning-tree portfast activated (only on the port-channel) | 14:29 |
samuelkunkel[m] | so once the switch falls back to individual It runs the whole spanning-tree chain | 14:29 |
samuelkunkel[m] | I have configured my testsetup now to use also portfast on the individual ports, lets see. | 14:30 |
samuelkunkel[m] | Gonna have a look if there is an ipxe config flag to not speak lacp (would make this a bit easier) | 14:30 |
TheJulia | There is not, it gets compiled in | 14:31 |
TheJulia | But portfast should fix you up | 14:31 |
TheJulia | Vendor dependent of course | 14:31 |
samuelkunkel[m] | that should not be the biggest issue as we compile it from source anyway | 14:31 |
TheJulia | Yeah | 14:32 |
samuelkunkel[m] | maybe there is something in the build to adjust | 14:32 |
TheJulia | Some larger operators do | 14:32 |
TheJulia | If memory serves, there is | 14:32 |
TheJulia | We… might even have that mentioned in our docs someplace | 14:32 |
opendevreview | waleed mousa proposed openstack/ironic-python-agent master: update NVIDIA NIC firmware images and settings by ironic-python-agent https://review.opendev.org/c/openstack/ironic-python-agent/+/566544 | 15:06 |
samuelkunkel[m] | TheJulia: your guess was correct, after configuring spanning-tree portfast on the respective switchports there is no network interrupt anymore, chrony runs fine and therefore all certificates are valid | 15:55 |
samuelkunkel[m] | thanks to all of your for your help :) | 15:56 |
samuelkunkel[m] | (have tested it now with 4 different hardware types, all work perfectly fine) | 15:56 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Fix CI https://review.opendev.org/c/openstack/ironic/+/868749 | 16:10 |
TheJulia | samuelkunkel[m]: awesome | 17:16 |
* TheJulia goes back to home repairs | 17:16 | |
* rpittau appears from the shadow | 18:09 | |
* rpittau hello ironicers! Mind the CI is broken almost everywhere but this https://review.opendev.org/c/openstack/ironic/+/868749 should fix it | 18:09 | |
* rpittau once that merges others that are failing with the same topic (like https://review.opendev.org/c/openstack/ironic-inspector/+/868750) should succeed after a recheck | 18:09 | |
* rpittau happy holidays! | 18:09 | |
* rpittau /me goes back into the shadows | 18:09 | |
TheJulia | 4gb of ram!?! | 18:11 |
TheJulia | That is super problematic, what did they add that we can remove? | 18:11 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!