Wednesday, 2025-03-12

stevebaker[m]	:P	01:17
opendevreview	Merged openstack/ironic master: fix glance metadata layout https://review.opendev.org/c/openstack/ironic/+/942496	03:36
frickler	and another apache2 restart failure. I wonder if it is coincindence that all failures I've seen recently are in raxflex. maybe those nodes are faster and thus more prone to trigger this kind of race condition? https://zuul.opendev.org/t/openstack/build/89e65f7557764fcf98e29651d0e8c9e5	07:54
rpittau	good morning ironic! o/	07:57
frickler	urgs, why do you do this? none of our mirrors are meant to be used hardcoded directly "IRONIC_GRUB2_FILE: https://mirror.iad3.inmotion.opendev.org/..."	07:59
frickler	good morning rpittau	07:59
frickler	oh, that's only from TheJulia's patch, /me goes fixing	08:03
rpittau	hey frickler :)	08:04
opendevreview	Dr. Jens Harbott proposed openstack/ironic master: Advanced vmedia deployment test ops https://review.opendev.org/c/openstack/ironic/+/898010	08:07
opendevreview	Verification of a change to openstack/ironic master failed: [CI] Use bigger partition as work dir for metal3 job https://review.opendev.org/c/openstack/ironic/+/943374	08:08
opendevreview	Merged openstack/networking-generic-switch master: Fix info leakage in Netmiko connection errors https://review.opendev.org/c/openstack/networking-generic-switch/+/943061	09:37
opendevreview	Merged openstack/ironic-tempest-plugin master: Validate automatic lessee https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/927545	09:57
opendevreview	minwoo seo proposed openstack/ironic-inspector master: Add action in plugin/rules.py for webhook https://review.opendev.org/c/openstack/ironic-inspector/+/942968	10:42
opendevreview	Merged openstack/ironic master: [CI] Use bigger partition as work dir for metal3 job https://review.opendev.org/c/openstack/ironic/+/943374	10:57
rpittau	\o/	11:35
dtantsur	Let's hope it remains stable until we can get to E2E tests	12:27
dtantsur	(in all fairness, I don't know if these have a smaller footprint)	12:27
opendevreview	cid proposed openstack/ironic master: doc: Update the runbook API usage https://review.opendev.org/c/openstack/ironic/+/944115	12:53
iurygregory	https://openinfra.org/blog/openinfra-linux-announcement it's official	13:10
TheJulia	good morning	13:46
opendevreview	cid proposed openstack/ironic master: doc: Update the runbook API usage https://review.opendev.org/c/openstack/ironic/+/944115	14:49
opendevreview	Julia Kreger proposed openstack/ironic master: WIP: Add network simulator support for force10 os 10 https://review.opendev.org/c/openstack/ironic/+/943345	15:12
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] Run metal3 integration job using UEFI boot (default) https://review.opendev.org/c/openstack/ironic/+/939694	16:01
dtantsur	Hey folks, since not everyone is in the metal3 channel: I've recorded an ironic-standalone-operator demo https://www.youtube.com/watch?v=LfA0Qt798QA	16:02
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] Run metal3 integration job using UEFI boot (default) https://review.opendev.org/c/openstack/ironic/+/939694	16:04
opendevreview	Julia Kreger proposed openstack/ironic master: Add network simulator support for force10 OS 10 https://review.opendev.org/c/openstack/ironic/+/943345	17:25
opendevreview	Julia Kreger proposed openstack/ironic master: docs: detail network switch simulator support https://review.opendev.org/c/openstack/ironic/+/944139	17:25
opendevreview	Julia Kreger proposed openstack/networking-generic-switch master: docs: add status for force10 os https://review.opendev.org/c/openstack/networking-generic-switch/+/943070	17:58
TheJulia	\o/ I got force10 OS10 based switch to fire up	18:11
TheJulia	and be configured, and have NGS do configuration	18:11
TheJulia	<Insert dancing here>	18:11
JayF	it'd be nice to get https://review.opendev.org/c/openstack/ironic/+/699953 in for the release; I just rechecked it	18:43
JayF	cid: cardoe: https://review.opendev.org/c/openstack/ironic/+/931848 does this intersect at all with the sushy-aware oem support stuff?	18:44
JayF	in general kinda unsure what to do with 931848 -- do we want cid to carry it over the line or what?	18:45
TheJulia	It would, It could also be considered a fix depending on framing	18:53
TheJulia	(speaking regarding the port binding change)	18:54
TheJulia	(since that can guard against failure cases)	18:54
shermanm	TheJulia: very cool! with OS10, did you happen to observe long delays for each ssh connection? We see it on all of our os10 switches, but I wasn't sure if it also shows up in the emulator	18:57
TheJulia	I mainly focused on it working, not the actual timing	19:03
TheJulia	It doesn't feel like the fastest thing to interact with, fwiw. It might also be specific switch settings as well. At least once the console was interactive, I was able to configure it and the ssh connections did the needful	19:04
TheJulia	shermanm: my work at the moment is mainly focused on "does ngs do what it promises" not how it performs speed wise, but I know neutron is also not the quickest about triggering plugins, so Id honestly focus on an atomic example if possible	19:06
shermanm	yep, that all makes sense, seems like it would also be helpful for things like the NGS error-regex.	19:13
shermanm	I asked mostly because I'm working on a test-case for our scaling issues with these os10 switches, dell basically told that 10 seconds to start a new ssh connection to the switch is normal. (subsequent commands per connection are fine though)	19:13
opendevreview	Julia Kreger proposed openstack/ironic master: Add network simulator support for force10 OS 10 https://review.opendev.org/c/openstack/ironic/+/943345	19:49
JayF	shermanm: "10 seconds to start a new ssh connection [snip] is normal" :-\| ... that's a wild statement to me	20:02
JayF	like I know it's probably true, and I'm not surprised it's true	20:02
JayF	I'm surprised someone said that out loud without stopping themselves with a "how bad have we failed at software" ;)	20:02
shermanm	yeah, it's par for the course with os10 :( . I'd done a bit of digging by logging in as the `linuxadmin` user, the actual time is taken by starting up the `clish` shell, which is the login shell for normal config users. It's also invoked by their restconf implementation, so that also takes 10s to run anything.	20:05
shermanm	to work around this, I was poking around to see if I can get NGS to hold onto an open netmiko ssh connection per switch and reuse it	20:07
JayF	Do you know concurrent connections allowed for the switch?	20:07
shermanm	1	20:07
JayF	So one dangling connection and we're locked out :/	20:08
shermanm	ah, I misunderstood. You can have 10+ connections, but explictly only allowed to make changes from one, as there seems to be no support for locking, commit-confirmed, etc.	20:09
JayF	ah, that's nicer	20:09
shermanm	nothing enforces this btw, you just get buggy behavior	20:09
JayF	some switches I've used in production in the past would reject ssh connections once you got past two	20:09
shermanm	that part is thankfully configurable on these	20:10
JayF	and if they were timing out; you were just SOL until the timeout hit	20:10
JayF	It should be better this story is a decade old lol	20:10
-opendevstatus- NOTICE: One of our Zuul job log storage providers is experiencing errors. We have removed that storage target from base jobs. You should be able to safely recheck changes now.		20:23
TheJulia	shermanm: I did notice in my testing that the initial cli shell invocation takes 6+ seconds to fire up	20:25
TheJulia	so that maps and aligns with ssh I guess	20:25
TheJulia	I've been thinking we need to build a model and revisit the adhoc change model	20:25
TheJulia	I've recently talked with folks who want switches reconfigured in sub-second, but the reality is "it takes time"	20:27
JayF	it's always useful to keep in mind that the server reboot will still always be the long pole in the provisioning	20:27
JayF	but I know people still want things now() anyway	20:28
TheJulia	I'd frame it as POST	20:28
TheJulia	since some methods you can go direct to booting workload, others...	20:28
TheJulia	it takes a deployment	20:28
JayF	++	20:32
TheJulia	so, 10 seconds for anything makes me think we really need to do a bunch of batching	20:32
JayF	we need nagle, but for switches	20:33
shermanm	most switches are way better about this, but os10 in particular is just a headache	20:33
JayF	"how likely is it we're going to get more configurations to set"	20:33
TheJulia	If your deploying the entire rack, totally likely	20:33
TheJulia	if your doing 2-3 in the same rack, sure, but maybe not right now	20:34
JayF	yeah, but today we don't have a way to express that	20:34
TheJulia	yup	20:34
JayF	nagle is all about guessing the answer to that question, except for TCP packets that aren't full yet :D	20:34
TheJulia	Yeah	20:35
TheJulia	I'm thinking if a server has multiple prots we could know	20:35
TheJulia	but each port is a separate bind	20:35
TheJulia	so if I have 8 ports, I've got 80 seconds of switch configuration	20:35
shermanm	how does this interact with the existing etcd-based batching?	20:36
TheJulia	I'd have to dig into it	20:36
shermanm	btw, for comparison, I tried dropping SONiC onto one of these same switches, and was able to "xargs ssh do thing" with like 40 connections no issue, 40 port reconfiguration in < 5 seconds.	20:36
JayF	oh, good call. my mental model was more cross-switch LACP than people using bonds on a single switch	20:38
TheJulia	hey, eavesdrop, update the logs so I can prod someone with a question	20:38
* TheJulia taps foot and looks at the bot		20:40
shermanm	This "10s delay on everything" also interacts badly with the provisioning flow of "set vlan to provision net, remove from provision net, add to tenant net", as soon as more than a few nodes are in flight. And because it's currently still a synchronous operation, you start getting API timeouts from neutron.	20:46
shermanm	At one point we'd just set all of the timeouts as high as possible and saw it take upwards of an hour to complete networking for 20 nodes.	20:46
TheJulia	JayF: well, standing cross-switch I would think is typical and really outside of our need/logic/modeling	20:46
JayF	I just meant that in terms of; I never considered we'd hit the same switch for "N" ports in a single deployment	20:46
TheJulia	nodes, oh yeah. and even if it is not all bonded and just individual interfaces, that is still a ton of time. Portgroup bonds are going to go faster since it should just be one portgroup and get fanned out from there by ml2 execution	20:47
TheJulia	OH	20:47
TheJulia	yeah	20:47
shermanm	one architecture that I liked was the way networking-odl did their ML2 driver https://github.com/openstack/networking-odl/blob/stable/2023.1/networking_odl/ml2/mech_driver_v2.py ; all of the operations get put into a journal, and a periodic worker applies changes from it	20:59
TheJulia	That is kind of what I've been thinking	21:02
TheJulia	"is the periodic running" -> "if not, add to list and trigger", "if so, add to list, and check to see when done"	21:16
JayF	with some kind of catch for backpressure if the queue grows too fast	21:21
shermanm	I'm not entirely sure, it's complicated by the fact that they're using an external SDN, so that periodic worker is reading the journal and sending API calls there. Maybe their doc here answers your question though? https://opendev.org/openstack/networking-odl/src/commit/fe40cbe13a9b9bca2ee740de0b5736b5806e54ae/doc/source/contributor/drivers_architecture.rst#v2_design	21:21
shermanm	one inherent limit is that you only have ~40ish ports per switch. what do you do in the case where you have 5 queued operations to set the state for the same port? Just pick the latest one to apply?	21:24
JayF	another good question	21:24
shermanm	I feel that it's a reasonable thing to do, as the neutron DB has already accepted those changes, and I wouldn't want the state of a port to depend on the order of multiple calls, each port create/bind should fully describe the desired state. and this helps limit how large the queue can ever get.	21:27
TheJulia	i think it does, but I'd need to check	21:28
TheJulia	Anyway, I need to get going	21:29
opendevreview	Adam McArthur proposed openstack/ironic-tempest-plugin master: WIP: Testing all microversion tests on CI https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/943086	22:30
opendevreview	Satoshi Shirosaka proposed openstack/ironic-python-agent master: WIP Add ContainerHardwareManager https://review.opendev.org/c/openstack/ironic-python-agent/+/941714	23:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!