Wednesday, 2025-03-12

stevebaker[m]:P01:17
opendevreviewMerged openstack/ironic master: fix glance metadata layout  https://review.opendev.org/c/openstack/ironic/+/94249603:36
fricklerand another apache2 restart failure. I wonder if it is coincindence that all failures I've seen recently are in raxflex. maybe those nodes are faster and thus more prone to trigger this kind of race condition? https://zuul.opendev.org/t/openstack/build/89e65f7557764fcf98e29651d0e8c9e507:54
rpittaugood morning ironic! o/07:57
fricklerurgs, why do you do this? none of our mirrors are meant to be used hardcoded directly "IRONIC_GRUB2_FILE: https://mirror.iad3.inmotion.opendev.org/..."07:59
fricklergood morning rpittau 07:59
frickleroh, that's only from TheJulia's patch, /me goes fixing08:03
rpittauhey frickler :)08:04
opendevreviewDr. Jens Harbott proposed openstack/ironic master: Advanced vmedia deployment test ops  https://review.opendev.org/c/openstack/ironic/+/89801008:07
opendevreviewVerification of a change to openstack/ironic master failed: [CI] Use bigger partition as work dir for metal3 job  https://review.opendev.org/c/openstack/ironic/+/94337408:08
opendevreviewMerged openstack/networking-generic-switch master: Fix info leakage in Netmiko connection errors  https://review.opendev.org/c/openstack/networking-generic-switch/+/94306109:37
opendevreviewMerged openstack/ironic-tempest-plugin master: Validate automatic lessee  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/92754509:57
opendevreviewminwoo seo proposed openstack/ironic-inspector master: Add action in plugin/rules.py for webhook  https://review.opendev.org/c/openstack/ironic-inspector/+/94296810:42
opendevreviewMerged openstack/ironic master: [CI] Use bigger partition as work dir for metal3 job  https://review.opendev.org/c/openstack/ironic/+/94337410:57
rpittau\o/11:35
dtantsurLet's hope it remains stable until we can get to E2E tests12:27
dtantsur(in all fairness, I don't know if these have a smaller footprint)12:27
opendevreviewcid proposed openstack/ironic master: doc: Update the runbook API usage  https://review.opendev.org/c/openstack/ironic/+/94411512:53
iurygregoryhttps://openinfra.org/blog/openinfra-linux-announcement it's official 13:10
TheJuliagood morning13:46
opendevreviewcid proposed openstack/ironic master: doc: Update the runbook API usage  https://review.opendev.org/c/openstack/ironic/+/94411514:49
opendevreviewJulia Kreger proposed openstack/ironic master: WIP: Add network simulator support for force10 os 10  https://review.opendev.org/c/openstack/ironic/+/94334515:12
opendevreviewRiccardo Pittau proposed openstack/ironic master: [WIP] Run metal3 integration job using UEFI boot (default)  https://review.opendev.org/c/openstack/ironic/+/93969416:01
dtantsurHey folks, since not everyone is in the metal3 channel: I've recorded an ironic-standalone-operator demo https://www.youtube.com/watch?v=LfA0Qt798QA16:02
opendevreviewRiccardo Pittau proposed openstack/ironic master: [WIP] Run metal3 integration job using UEFI boot (default)  https://review.opendev.org/c/openstack/ironic/+/93969416:04
opendevreviewJulia Kreger proposed openstack/ironic master: Add network simulator support for force10 OS 10  https://review.opendev.org/c/openstack/ironic/+/94334517:25
opendevreviewJulia Kreger proposed openstack/ironic master: docs: detail network switch simulator support  https://review.opendev.org/c/openstack/ironic/+/94413917:25
opendevreviewJulia Kreger proposed openstack/networking-generic-switch master: docs: add status for force10 os  https://review.opendev.org/c/openstack/networking-generic-switch/+/94307017:58
TheJulia\o/ I got force10 OS10 based switch to fire up18:11
TheJuliaand be configured, and have NGS do configuration18:11
TheJulia<Insert dancing here>18:11
JayFit'd be nice to get https://review.opendev.org/c/openstack/ironic/+/699953 in for the release; I just rechecked it18:43
JayFcid: cardoe: https://review.opendev.org/c/openstack/ironic/+/931848 does this intersect at all with the sushy-aware oem support stuff?18:44
JayFin general kinda unsure what to do with 931848 -- do we want cid to carry it over the line or what?18:45
TheJuliaIt would, It could also be considered a fix depending on framing18:53
TheJulia(speaking regarding the port binding change)18:54
TheJulia(since that can guard against failure cases)18:54
shermanmTheJulia: very cool! with OS10, did you happen to observe long delays for each ssh connection? We see it on all of our os10 switches, but I wasn't sure if it also shows up in the emulator18:57
TheJuliaI mainly focused on it working, not the actual timing19:03
TheJuliaIt doesn't feel like the fastest thing to interact with, fwiw. It might also be specific switch settings as well. At least once the console was interactive, I was able to configure it and the ssh connections did the needful19:04
TheJuliashermanm: my work at the moment is mainly focused on "does ngs *do* what it promises" not how it performs speed wise, but I know neutron is also not the quickest about triggering plugins, so Id honestly focus on an atomic example if possible19:06
shermanmyep, that all makes sense, seems like it would also be helpful for things like the NGS error-regex. 19:13
shermanmI asked mostly because I'm working on a test-case for our scaling issues with these os10 switches, dell basically told that 10 seconds to start a new ssh connection to the switch is normal. (subsequent commands per connection are fine though)19:13
opendevreviewJulia Kreger proposed openstack/ironic master: Add network simulator support for force10 OS 10  https://review.opendev.org/c/openstack/ironic/+/94334519:49
JayFshermanm: "10 seconds to start a new ssh connection [snip] is normal" :-| ... that's a wild statement to me20:02
JayFlike I know it's probably true, and I'm not surprised it's true20:02
JayFI'm surprised someone said that out loud without stopping themselves with a "how bad have we failed at software" ;) 20:02
shermanmyeah,  it's par for the course with os10 :(  . I'd done a bit of digging by logging in as the `linuxadmin` user, the actual time is taken by starting up the `clish` shell, which is the login shell for normal config users. It's also invoked by their restconf implementation, so that also takes 10s to run anything.20:05
shermanmto work around this, I was poking around to see if I can get NGS to hold onto an open netmiko ssh connection per switch and reuse it20:07
JayFDo you know concurrent connections allowed for the switch? 20:07
shermanm120:07
JayFSo one dangling connection and we're locked out :/20:08
shermanmah, I misunderstood. You can have 10+ connections, but explictly only allowed to make changes from one, as there seems to be no support for locking, commit-confirmed, etc.20:09
JayFah, that's nicer20:09
shermanmnothing enforces this btw, you just get buggy behavior20:09
JayFsome switches I've used in production in the past would reject ssh connections once you got past two20:09
shermanmthat part is thankfully configurable on these20:10
JayFand if they were timing out; you were just SOL until the timeout hit20:10
JayFIt should be better this story is a decade old lol20:10
-opendevstatus- NOTICE: One of our Zuul job log storage providers is experiencing errors. We have removed that storage target from base jobs. You should be able to safely recheck changes now.20:23
TheJuliashermanm: I did notice in my testing that the initial cli shell invocation takes 6+ seconds to fire up20:25
TheJuliaso that maps and aligns with ssh I guess20:25
TheJuliaI've been thinking we need to build a model and revisit the adhoc change model20:25
TheJuliaI've recently talked with folks who want switches reconfigured in sub-second, but the reality is "it takes time"20:27
JayFit's always useful to keep in mind that the server reboot will still always be the long pole in the provisioning20:27
JayFbut I know people still want things now() anyway20:28
TheJuliaI'd frame it as POST20:28
TheJuliasince some methods you can go direct to booting workload, others...20:28
TheJuliait takes a deployment20:28
JayF++20:32
TheJuliaso, 10 seconds for anything makes me think we *really* need to do a bunch of batching20:32
JayFwe need nagle, but for switches 20:33
shermanmmost switches are way better about this, but os10 in particular is just a headache20:33
JayF"how likely is it we're going to get more configurations to set"20:33
TheJuliaIf your deploying the entire rack, totally likely20:33
TheJuliaif your doing 2-3 in the same rack, sure, but maybe not right now20:34
JayFyeah, but today we don't have a way to express that20:34
TheJuliayup20:34
JayFnagle is all about guessing the answer to that question, except for TCP packets that aren't full yet :D 20:34
TheJuliaYeah20:35
TheJuliaI'm thinking if a server has multiple prots we could know20:35
TheJuliabut each port is a separate bind20:35
TheJuliaso if I have 8 ports, I've got 80 seconds of switch configuration20:35
shermanmhow does this interact with the existing etcd-based batching?20:36
TheJuliaI'd have to dig into it20:36
shermanmbtw, for comparison, I tried dropping SONiC onto one of these same switches, and was able to "xargs ssh do thing" with like 40 connections no issue, 40 port reconfiguration in < 5 seconds.20:36
JayFoh, good call. my mental model was more cross-switch LACP than people using bonds on a single switch20:38
TheJuliahey, eavesdrop, update the logs so I can prod someone with a question20:38
* TheJulia taps foot and looks at the bot20:40
shermanmThis "10s delay on everything" also interacts badly with the provisioning flow of "set vlan to provision net, remove from provision net, add to tenant net", as soon as more than a few nodes are in flight. And because it's currently still a synchronous operation, you start getting API timeouts from neutron.20:46
shermanmAt one point we'd just set all of the timeouts as high as possible and saw it take upwards of an hour to complete networking for 20 nodes.20:46
TheJuliaJayF: well, standing cross-switch I would think is typical and really outside of our need/logic/modeling20:46
JayFI just meant that in terms of; I never considered we'd hit the same switch for "N" ports in a single deployment20:46
TheJulianodes, oh yeah. and even if it is not all bonded and just individual interfaces, that is still a ton of time. Portgroup bonds are going to go faster since it should just be one portgroup and get fanned out from there by ml2 execution20:47
TheJuliaOH20:47
TheJuliayeah20:47
shermanmone architecture that I liked was the way networking-odl did their ML2 driver https://github.com/openstack/networking-odl/blob/stable/2023.1/networking_odl/ml2/mech_driver_v2.py ; all of the operations get put into a journal, and a periodic worker applies changes from it20:59
TheJuliaThat is kind of what I've been thinking21:02
TheJulia"is the periodic running" -> "if not, add to list and trigger", "if so, add to list, and check to see when done"21:16
JayFwith some kind of catch for backpressure if the queue grows too fast21:21
shermanmI'm not entirely sure, it's complicated by the fact that they're using an external SDN, so that periodic worker is reading the journal and sending API calls there.  Maybe their doc here answers your question though? https://opendev.org/openstack/networking-odl/src/commit/fe40cbe13a9b9bca2ee740de0b5736b5806e54ae/doc/source/contributor/drivers_architecture.rst#v2_design21:21
shermanmone inherent limit is that you only have ~40ish ports per switch. what do you do in the case where you have 5 queued operations to set the state for the same port? Just pick the latest one to apply?21:24
JayFanother good question21:24
shermanmI feel that it's a reasonable thing to do, as the neutron DB has already accepted those changes, and I wouldn't want the state of a port to depend on the order of multiple calls, each port create/bind should fully describe the desired state. and this helps limit how large the queue can ever get.21:27
TheJuliai *think* it does, but I'd need to check21:28
TheJuliaAnyway, I need to get going21:29
opendevreviewAdam McArthur proposed openstack/ironic-tempest-plugin master: WIP: Testing all microversion tests on CI  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/94308622:30
opendevreviewSatoshi Shirosaka proposed openstack/ironic-python-agent master: WIP Add ContainerHardwareManager  https://review.opendev.org/c/openstack/ironic-python-agent/+/94171423:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!