stevebaker[m] | :P | 01:17 |
---|---|---|
opendevreview | Merged openstack/ironic master: fix glance metadata layout https://review.opendev.org/c/openstack/ironic/+/942496 | 03:36 |
frickler | and another apache2 restart failure. I wonder if it is coincindence that all failures I've seen recently are in raxflex. maybe those nodes are faster and thus more prone to trigger this kind of race condition? https://zuul.opendev.org/t/openstack/build/89e65f7557764fcf98e29651d0e8c9e5 | 07:54 |
rpittau | good morning ironic! o/ | 07:57 |
frickler | urgs, why do you do this? none of our mirrors are meant to be used hardcoded directly "IRONIC_GRUB2_FILE: https://mirror.iad3.inmotion.opendev.org/..." | 07:59 |
frickler | good morning rpittau | 07:59 |
frickler | oh, that's only from TheJulia's patch, /me goes fixing | 08:03 |
rpittau | hey frickler :) | 08:04 |
opendevreview | Dr. Jens Harbott proposed openstack/ironic master: Advanced vmedia deployment test ops https://review.opendev.org/c/openstack/ironic/+/898010 | 08:07 |
opendevreview | Verification of a change to openstack/ironic master failed: [CI] Use bigger partition as work dir for metal3 job https://review.opendev.org/c/openstack/ironic/+/943374 | 08:08 |
opendevreview | Merged openstack/networking-generic-switch master: Fix info leakage in Netmiko connection errors https://review.opendev.org/c/openstack/networking-generic-switch/+/943061 | 09:37 |
opendevreview | Merged openstack/ironic-tempest-plugin master: Validate automatic lessee https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/927545 | 09:57 |
opendevreview | minwoo seo proposed openstack/ironic-inspector master: Add action in plugin/rules.py for webhook https://review.opendev.org/c/openstack/ironic-inspector/+/942968 | 10:42 |
opendevreview | Merged openstack/ironic master: [CI] Use bigger partition as work dir for metal3 job https://review.opendev.org/c/openstack/ironic/+/943374 | 10:57 |
rpittau | \o/ | 11:35 |
dtantsur | Let's hope it remains stable until we can get to E2E tests | 12:27 |
dtantsur | (in all fairness, I don't know if these have a smaller footprint) | 12:27 |
opendevreview | cid proposed openstack/ironic master: doc: Update the runbook API usage https://review.opendev.org/c/openstack/ironic/+/944115 | 12:53 |
iurygregory | https://openinfra.org/blog/openinfra-linux-announcement it's official | 13:10 |
TheJulia | good morning | 13:46 |
opendevreview | cid proposed openstack/ironic master: doc: Update the runbook API usage https://review.opendev.org/c/openstack/ironic/+/944115 | 14:49 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP: Add network simulator support for force10 os 10 https://review.opendev.org/c/openstack/ironic/+/943345 | 15:12 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Run metal3 integration job using UEFI boot (default) https://review.opendev.org/c/openstack/ironic/+/939694 | 16:01 |
dtantsur | Hey folks, since not everyone is in the metal3 channel: I've recorded an ironic-standalone-operator demo https://www.youtube.com/watch?v=LfA0Qt798QA | 16:02 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] Run metal3 integration job using UEFI boot (default) https://review.opendev.org/c/openstack/ironic/+/939694 | 16:04 |
opendevreview | Julia Kreger proposed openstack/ironic master: Add network simulator support for force10 OS 10 https://review.opendev.org/c/openstack/ironic/+/943345 | 17:25 |
opendevreview | Julia Kreger proposed openstack/ironic master: docs: detail network switch simulator support https://review.opendev.org/c/openstack/ironic/+/944139 | 17:25 |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: docs: add status for force10 os https://review.opendev.org/c/openstack/networking-generic-switch/+/943070 | 17:58 |
TheJulia | \o/ I got force10 OS10 based switch to fire up | 18:11 |
TheJulia | and be configured, and have NGS do configuration | 18:11 |
TheJulia | <Insert dancing here> | 18:11 |
JayF | it'd be nice to get https://review.opendev.org/c/openstack/ironic/+/699953 in for the release; I just rechecked it | 18:43 |
JayF | cid: cardoe: https://review.opendev.org/c/openstack/ironic/+/931848 does this intersect at all with the sushy-aware oem support stuff? | 18:44 |
JayF | in general kinda unsure what to do with 931848 -- do we want cid to carry it over the line or what? | 18:45 |
TheJulia | It would, It could also be considered a fix depending on framing | 18:53 |
TheJulia | (speaking regarding the port binding change) | 18:54 |
TheJulia | (since that can guard against failure cases) | 18:54 |
shermanm | TheJulia: very cool! with OS10, did you happen to observe long delays for each ssh connection? We see it on all of our os10 switches, but I wasn't sure if it also shows up in the emulator | 18:57 |
TheJulia | I mainly focused on it working, not the actual timing | 19:03 |
TheJulia | It doesn't feel like the fastest thing to interact with, fwiw. It might also be specific switch settings as well. At least once the console was interactive, I was able to configure it and the ssh connections did the needful | 19:04 |
TheJulia | shermanm: my work at the moment is mainly focused on "does ngs *do* what it promises" not how it performs speed wise, but I know neutron is also not the quickest about triggering plugins, so Id honestly focus on an atomic example if possible | 19:06 |
shermanm | yep, that all makes sense, seems like it would also be helpful for things like the NGS error-regex. | 19:13 |
shermanm | I asked mostly because I'm working on a test-case for our scaling issues with these os10 switches, dell basically told that 10 seconds to start a new ssh connection to the switch is normal. (subsequent commands per connection are fine though) | 19:13 |
opendevreview | Julia Kreger proposed openstack/ironic master: Add network simulator support for force10 OS 10 https://review.opendev.org/c/openstack/ironic/+/943345 | 19:49 |
JayF | shermanm: "10 seconds to start a new ssh connection [snip] is normal" :-| ... that's a wild statement to me | 20:02 |
JayF | like I know it's probably true, and I'm not surprised it's true | 20:02 |
JayF | I'm surprised someone said that out loud without stopping themselves with a "how bad have we failed at software" ;) | 20:02 |
shermanm | yeah, it's par for the course with os10 :( . I'd done a bit of digging by logging in as the `linuxadmin` user, the actual time is taken by starting up the `clish` shell, which is the login shell for normal config users. It's also invoked by their restconf implementation, so that also takes 10s to run anything. | 20:05 |
shermanm | to work around this, I was poking around to see if I can get NGS to hold onto an open netmiko ssh connection per switch and reuse it | 20:07 |
JayF | Do you know concurrent connections allowed for the switch? | 20:07 |
shermanm | 1 | 20:07 |
JayF | So one dangling connection and we're locked out :/ | 20:08 |
shermanm | ah, I misunderstood. You can have 10+ connections, but explictly only allowed to make changes from one, as there seems to be no support for locking, commit-confirmed, etc. | 20:09 |
JayF | ah, that's nicer | 20:09 |
shermanm | nothing enforces this btw, you just get buggy behavior | 20:09 |
JayF | some switches I've used in production in the past would reject ssh connections once you got past two | 20:09 |
shermanm | that part is thankfully configurable on these | 20:10 |
JayF | and if they were timing out; you were just SOL until the timeout hit | 20:10 |
JayF | It should be better this story is a decade old lol | 20:10 |
-opendevstatus- NOTICE: One of our Zuul job log storage providers is experiencing errors. We have removed that storage target from base jobs. You should be able to safely recheck changes now. | 20:23 | |
TheJulia | shermanm: I did notice in my testing that the initial cli shell invocation takes 6+ seconds to fire up | 20:25 |
TheJulia | so that maps and aligns with ssh I guess | 20:25 |
TheJulia | I've been thinking we need to build a model and revisit the adhoc change model | 20:25 |
TheJulia | I've recently talked with folks who want switches reconfigured in sub-second, but the reality is "it takes time" | 20:27 |
JayF | it's always useful to keep in mind that the server reboot will still always be the long pole in the provisioning | 20:27 |
JayF | but I know people still want things now() anyway | 20:28 |
TheJulia | I'd frame it as POST | 20:28 |
TheJulia | since some methods you can go direct to booting workload, others... | 20:28 |
TheJulia | it takes a deployment | 20:28 |
JayF | ++ | 20:32 |
TheJulia | so, 10 seconds for anything makes me think we *really* need to do a bunch of batching | 20:32 |
JayF | we need nagle, but for switches | 20:33 |
shermanm | most switches are way better about this, but os10 in particular is just a headache | 20:33 |
JayF | "how likely is it we're going to get more configurations to set" | 20:33 |
TheJulia | If your deploying the entire rack, totally likely | 20:33 |
TheJulia | if your doing 2-3 in the same rack, sure, but maybe not right now | 20:34 |
JayF | yeah, but today we don't have a way to express that | 20:34 |
TheJulia | yup | 20:34 |
JayF | nagle is all about guessing the answer to that question, except for TCP packets that aren't full yet :D | 20:34 |
TheJulia | Yeah | 20:35 |
TheJulia | I'm thinking if a server has multiple prots we could know | 20:35 |
TheJulia | but each port is a separate bind | 20:35 |
TheJulia | so if I have 8 ports, I've got 80 seconds of switch configuration | 20:35 |
shermanm | how does this interact with the existing etcd-based batching? | 20:36 |
TheJulia | I'd have to dig into it | 20:36 |
shermanm | btw, for comparison, I tried dropping SONiC onto one of these same switches, and was able to "xargs ssh do thing" with like 40 connections no issue, 40 port reconfiguration in < 5 seconds. | 20:36 |
JayF | oh, good call. my mental model was more cross-switch LACP than people using bonds on a single switch | 20:38 |
TheJulia | hey, eavesdrop, update the logs so I can prod someone with a question | 20:38 |
* TheJulia taps foot and looks at the bot | 20:40 | |
shermanm | This "10s delay on everything" also interacts badly with the provisioning flow of "set vlan to provision net, remove from provision net, add to tenant net", as soon as more than a few nodes are in flight. And because it's currently still a synchronous operation, you start getting API timeouts from neutron. | 20:46 |
shermanm | At one point we'd just set all of the timeouts as high as possible and saw it take upwards of an hour to complete networking for 20 nodes. | 20:46 |
TheJulia | JayF: well, standing cross-switch I would think is typical and really outside of our need/logic/modeling | 20:46 |
JayF | I just meant that in terms of; I never considered we'd hit the same switch for "N" ports in a single deployment | 20:46 |
TheJulia | nodes, oh yeah. and even if it is not all bonded and just individual interfaces, that is still a ton of time. Portgroup bonds are going to go faster since it should just be one portgroup and get fanned out from there by ml2 execution | 20:47 |
TheJulia | OH | 20:47 |
TheJulia | yeah | 20:47 |
shermanm | one architecture that I liked was the way networking-odl did their ML2 driver https://github.com/openstack/networking-odl/blob/stable/2023.1/networking_odl/ml2/mech_driver_v2.py ; all of the operations get put into a journal, and a periodic worker applies changes from it | 20:59 |
TheJulia | That is kind of what I've been thinking | 21:02 |
TheJulia | "is the periodic running" -> "if not, add to list and trigger", "if so, add to list, and check to see when done" | 21:16 |
JayF | with some kind of catch for backpressure if the queue grows too fast | 21:21 |
shermanm | I'm not entirely sure, it's complicated by the fact that they're using an external SDN, so that periodic worker is reading the journal and sending API calls there. Maybe their doc here answers your question though? https://opendev.org/openstack/networking-odl/src/commit/fe40cbe13a9b9bca2ee740de0b5736b5806e54ae/doc/source/contributor/drivers_architecture.rst#v2_design | 21:21 |
shermanm | one inherent limit is that you only have ~40ish ports per switch. what do you do in the case where you have 5 queued operations to set the state for the same port? Just pick the latest one to apply? | 21:24 |
JayF | another good question | 21:24 |
shermanm | I feel that it's a reasonable thing to do, as the neutron DB has already accepted those changes, and I wouldn't want the state of a port to depend on the order of multiple calls, each port create/bind should fully describe the desired state. and this helps limit how large the queue can ever get. | 21:27 |
TheJulia | i *think* it does, but I'd need to check | 21:28 |
TheJulia | Anyway, I need to get going | 21:29 |
opendevreview | Adam McArthur proposed openstack/ironic-tempest-plugin master: WIP: Testing all microversion tests on CI https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/943086 | 22:30 |
opendevreview | Satoshi Shirosaka proposed openstack/ironic-python-agent master: WIP Add ContainerHardwareManager https://review.opendev.org/c/openstack/ironic-python-agent/+/941714 | 23:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!