Wednesday, 2022-04-27

opendevreviewMerged openstack/ironic master: Grenade: Turn up interfaces for vxlan  https://review.opendev.org/c/openstack/ironic/+/83942000:46
arne_wiebalckGood morning, Ironic!06:20
rpittaugood morning ironic! o/06:58
opendevreviewRiccardo Pittau proposed openstack/sushy-tools master: Use python Zed tests  https://review.opendev.org/c/openstack/sushy-tools/+/83867408:24
opendevreviewDmitry Tantsur proposed openstack/ironic master: Decouple deploy callback timeout from deploy step timeout  https://review.opendev.org/c/openstack/ironic/+/83769008:44
dtantsurTheJulia: okay, apparently you were right, and I don't remember how the timeout stuff works :) we do update provision_updated_at on heartbeats, so running deploy steps never time out.09:41
dtantsurfolks, I'd really appreciate some reviews on https://review.opendev.org/c/openstack/sushy-tools/+/830157/ and https://review.opendev.org/c/openstack/sushy-tools/+/830598/10:15
hjensasnetworking-baremetal CI was broken, can cores take a look at https://review.opendev.org/c/openstack/networking-baremetal/+/839298 ?, thanks10:16
dtantsur+210:18
iurygregorymorning Ironic11:04
hjensasthanks dtantsur 11:18
hjensasTheJulia: how on earth can we get conductor takeover issues on the undercloud? (It's just one conductor?)11:18
hjensashttps://logserver.rdoproject.org/15/15761b77d91ab3e398f6fa1d10d2f5da267b7931/openstack-periodic-integration-stable1-cs8/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/2f5dc02/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz11:19
hjensas2022-04-26 23:40:31.669 10 WARNING ironic.conductor.manager [-] Forcibly removed reservation of conductor undercloud.localdomain on node 9187de8c-59e1-46d1-8fac-d6cb28fca0b4 as that conductor went offline11:19
dtantsurhjensas: hostname change?11:24
hjensasdtantsur: thanks, that seems like a plausible reason.11:25
hjensasthe conductor starts with "host = undercloud.localdomain", I don't see any indication of hostname change in the journal.11:46
hjensasdtantsur: is it possible self.dbapi.get_offline_conductors() would return the same conductor in case the heartbeat was not received?11:48
dtantsurhjensas: does not sound impossible to me, I"m not sure if we have any safeguards11:54
dtantsur(unclear why a heartbeat wouldn't be received in this case)11:54
hjensasyeah, how does this hearbeat work? A periodic task is updating a timestamp in the db? Or is RPC/MQ involved?11:56
dtantsurhjensas: it's a thread in the conductor IIRC11:56
TheJuliaLocal takeover can occur when it failure or super long pauses occur12:00
TheJuliaWhich exceed 60 seconds. It is a sign of resources are way over committed12:01
dtantsurmaybe we need a safeguard for the current host12:01
dtantsur(won't help in case of changing hostnames)12:01
TheJuliaA running process doesn’t learn of a host name change12:02
TheJuliaIt is a launch time variable12:02
TheJuliaSo that stays static in the running namespace aiui12:02
TheJuliaWell, process space12:02
TheJuliaA safeguard wouldn’t really defend it… it is basically a “I stopped writing to the dab or being able to write due to external conditions”….12:04
TheJuliaTo the db12:04
* TheJulia goes back to sleep12:04
TheJuliaOr at least, try… cats12:05
hjensasI looked at the dstat data of the job, there is for sure a load spike. And this is OVB, so potentially noisy neighbours.12:16
dtantsurit shouldn't be too hard to make the conductor never take over itself12:16
hjensasIt may make sense to set CONF.conductor.check_provision_state_interval = 0 on the undercloud, or in the CI.12:16
dtantsurplease no12:17
dtantsurthis is a bug that is trivial to fix. what you're suggesting is not even the right option..12:17
dtantsurprobably just wrap all calls to self.dbapi.get_offline_conductors in manager.py to a helper that excludes the current host12:19
dtantsur(and maybe logs a warning that something is off)12:19
hjensashm, I figured from https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/manager.py#L1587 that check_provision_state_interval = 0 would disable the entire thing. 12:20
hjensasBut yeah, I can add something to exclude current host.12:20
dtantsurah, so we're reusing the same option for many purposes. fun.12:20
dtantsureven if you disable the periodic task, the conductor will be considered offline for e.g. hash ring purposes12:21
dtantsurwhich is... extra fun because it's on the API side12:21
dtantsurbut yeah, at least never take over the current conductor, nor orphan its nodes or allocations12:22
dtantsurthis option also affects checking for deploy timeouts12:23
hjensasok, I'll suggest they try to bump the heartbeat_interval and heartbeat_timeout. That may allow it to survive the load peak.12:28
iurygregorydtantsur, rpittau, TheJulia https://review.opendev.org/c/openstack/releases/+/839524 releasing ironic and sushy in victoria before moving to EM13:19
iurygregorythe other deliverables didn't have new commits or had commits that were just related to CI13:19
opendevreviewMerged openstack/ironic master: [iRMC] Change the way to get irmc-info in raid  https://review.opendev.org/c/openstack/ironic/+/83912213:22
opendevreviewRiccardo Pittau proposed openstack/ironic-python-agent master: Multipath Hardware path handling  https://review.opendev.org/c/openstack/ironic-python-agent/+/83703913:38
iurygregoryrpittau, should I update the wallaby patch?13:39
rpittauiurygregory: yes please!13:39
iurygregorydoing now13:39
rpittauthanks13:39
rpittaudtantsur, iurygregory, TheJulia, please double-check the mpathconf options, I checked on RHEL8 so it should be fine, but still more eyes the better :)13:40
TheJuliarpittau: does debian/ubuntu have /sbin/mpathconf?13:47
rpittauTheJulia: mmm probably no13:51
TheJulianot on debian13:52
rpittauno, they donb't, they use a different way to configure multipath13:52
TheJuliaso13:52
TheJuliadirectly launching the daemon *should* configure pathing13:53
TheJuliaat least it did on my test machine13:53
rpittaunot in RHEL13:53
TheJuliaoh jebus13:53
rpittauunfortubnately the procedure for RHEL is to use mpathconf13:53
iurygregoryyay...13:54
rpittau\o/13:54
rpittauwe can put a big TRY there13:54
TheJuliaI suspect we need to13:54
TheJulia:\13:54
TheJuliaand I can re-test manually on my desktop after my next reboot13:54
* TheJulia feels like major distro differences should be a reason to begin drinking very early13:54
* iurygregory agrees13:55
opendevreviewIury Gregory Melo Ferreira proposed openstack/ironic-python-agent stable/wallaby: Multipath Hardware path handling  https://review.opendev.org/c/openstack/ironic-python-agent/+/83778413:56
* rpittau had a beer at lunch13:58
opendevreviewMerged openstack/networking-baremetal master: Register neutron common config options  https://review.opendev.org/c/openstack/networking-baremetal/+/83929814:05
TheJuliarpittau: +2+A ;)14:46
rpittau:D14:47
* TheJulia may, or may not, be a bad influence14:50
*** mat_fechner is now known as matfechner14:58
opendevreviewJulia Kreger proposed openstack/ironic master: Auto-populate lessee for deployments  https://review.opendev.org/c/openstack/ironic/+/81864116:06
opendevreviewJulia Kreger proposed openstack/ironic master: Auto-populate lessee for deployments  https://review.opendev.org/c/openstack/ironic/+/81864116:07
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: v6/grenade multinode jobs  https://review.opendev.org/c/openstack/ironic/+/83908616:18
rpittaugood night! o/16:21
iurygregorygn!16:22
TheJuliarpittau: are you updating the multipath change tests?16:22
rpittauTheJulia: I was going to wait for the tests on the field before moving further forward with the patch16:25
TheJuliarpittau: ack16:25
TheJuliahjensas: figured out multinode, seems to be an issue with ngs18:48
TheJuliahow to fix is now the question18:48
TheJuliagot it!19:21
* TheJulia gues up dancing for after the next meeting19:22
opendevreviewJulia Kreger proposed openstack/networking-generic-switch master: CI: Fix Multinode ssh key file placement  https://review.opendev.org/c/openstack/networking-generic-switch/+/83964522:24
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: v6/grenade multinode jobs  https://review.opendev.org/c/openstack/ironic/+/83908622:25
TheJuliahjensas: so... dhcp stateful + centos 8... it never got anotehr dhcp address.22:32
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: v6/grenade multinode jobs  https://review.opendev.org/c/openstack/ironic/+/83908622:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!