opendevreview | Ghanshyam proposed openstack/ironic master: Run ironic-dbsync in grenade testing from venv https://review.opendev.org/c/openstack/ironic/+/932016 | 02:55 |
---|---|---|
opendevreview | Takashi Kajinami proposed openstack/metalsmith master: Drop SETUPTOOLS_USE_DISTUTILS=stdlib https://review.opendev.org/c/openstack/metalsmith/+/932020 | 03:53 |
rpittau | good morning ironic! o/ | 06:06 |
TheJulia | good morning! | 07:09 |
rpittau | hey TheJulia welcome to the Old World :) | 07:18 |
TheJulia | heh | 07:20 |
TheJulia | Thanks... I think ;) | 07:20 |
rpittau | :D | 08:58 |
iurygregory | good morning ironic | 09:37 |
opendevreview | cid proposed openstack/ironic-specs master: Add a Kea DHCP backend https://review.opendev.org/c/openstack/ironic-specs/+/931025 | 13:14 |
opendevreview | cid proposed openstack/ironic-specs master: Add a Kea DHCP backend https://review.opendev.org/c/openstack/ironic-specs/+/931025 | 13:54 |
opendevreview | cid proposed openstack/ironic-specs master: Add a Kea DHCP backend https://review.opendev.org/c/openstack/ironic-specs/+/931025 | 13:58 |
TheJulia | Hey guys, question for metal3 folks, did rebuild support ever get wired up? | 14:27 |
TheJulia | like, so we can trigger the rebuild verb for a node | 14:27 |
TheJulia | For example, I need to change the disk image | 14:27 |
dtantsur | TheJulia: no rebuild as you know it. Either use a new image URL or remove and re-add it. | 14:34 |
dtantsur | ref https://book.metal3.io/bmo/provisioning#reprovisioning | 14:34 |
opendevreview | Merged openstack/ironic stable/2023.1: Revert "ci: stable-only: explicitly pin centos build" https://review.opendev.org/c/openstack/ironic/+/921884 | 14:55 |
JayF | TheJulia: dtantsur: other cores who may be around: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/thread/Z52INRWOC5GT3NFPG5427FM4PSXJFRUG/ is the public proposal for the stuff cores have been discussing | 15:06 |
JayF | cardoe: cid ^ you will find that email interesting | 15:06 |
cardoe | I'm honored for the nomination. | 15:08 |
cardoe | cid: wrt to the kea DHCP backend, I was wondering if a side outcome could be capturing better what things we need to pass for DHCP in generic terms. | 15:14 |
JayF | What do you mean by that? | 15:15 |
cardoe | The reason I ask is that I'm running my Ironic inside of k8s using OpenStack Helm right now and the DHCP backends make assumptions around being on the same physical filesystem as the DHCP server. | 15:15 |
JayF | DHCP backends in Ironic or Neutron? | 15:16 |
cardoe | Ironic | 15:16 |
cardoe | Obviously the neutron one is sending neutron commands. | 15:16 |
JayF | I'd suggest that might be a separate spec | 15:16 |
JayF | well, I mean, we're going to do a neutron-dhcp-agent for kea as well | 15:16 |
cardoe | Just thinking around some RPC mechanism. | 15:18 |
cardoe | So I can run DHCP not in a 1:1 fashion with conductors | 15:18 |
JayF | So lets develop some kind of like, network configuration API, that when we tell it to, it configures a DHCP server via an agent. That sounds space-riffic so I think we can call it "Neutron"! | 15:19 |
JayF | /s obviously but this is something I'm thinking about | 15:19 |
JayF | "conductors are in the same subnet as the nodes they provision" is a reasonable assumption that many Ironic pieces make | 15:20 |
JayF | (for example: Grub won't cross a network boundary to network boot AIUI) | 15:21 |
cardoe | So you're not wrong about the fact that it's neutron. I agree. | 15:24 |
cardoe | There's also the Mercury proposal from Julia to have a different network backend instead of neutron. | 15:24 |
cardoe | So my RPC though was tied to Mercury. | 15:25 |
cardoe | I don't have specifics until Mercury moves forward more. It was more the current code is a bit racy if you NFS mount the path. | 15:30 |
opendevreview | Verification of a change to openstack/ironic-python-agent stable/2023.1 failed: Call evaluate_hardware_support exactly once per hwm https://review.opendev.org/c/openstack/ironic-python-agent/+/920216 | 15:40 |
opendevreview | Verification of a change to openstack/ironic-python-agent stable/2023.1 failed: Call evaluate_hardware_support exactly once per hwm https://review.opendev.org/c/openstack/ironic-python-agent/+/920216 | 16:06 |
opendevreview | Scott Tran proposed openstack/sushy master: Add Port interface https://review.opendev.org/c/openstack/sushy/+/932096 | 16:20 |
cid | cardoe: My brain interpreted your question as, "finding out what things we need to pass for DHCP" no matter the backend while at the Kea change (?). | 16:25 |
rpittau | good night! o/ | 16:26 |
cid | \o | 16:28 |
opendevreview | Scott Tran proposed openstack/sushy master: Add Port interface https://review.opendev.org/c/openstack/sushy/+/932096 | 16:42 |
JayF | cardoe: so mercury is about hooking ironic up to NGS, at least in practice today, and NGS is a remote agent | 16:56 |
JayF | cardoe: you'd still need the conductor in the same subnet for most cases | 16:56 |
JayF | cardoe: I'm not saying "we can't/shouldn't do this" so much as "this is a completely different case than kea support" | 16:57 |
JayF | at least imho | 16:57 |
cardoe | I'm not disagreeing with that. | 16:57 |
JayF | I do wonder if kea, in practice, can help solve it | 16:57 |
JayF | it does have clustering/replication features | 16:57 |
cardoe | I've got 2 conductors but 1 dnsmasq | 16:57 |
JayF | yeah that's honestly just wrong | 16:57 |
JayF | what kinda horrible person would've left an ironic install with a bunch of static dhcp stuff /me hides | 16:58 |
cardoe | I'm just using that as an example. | 16:58 |
JayF | (we did horrible dhcp things for onmetal; mainly didn't let ironic do it at all; used static dhcp configs and network switching did the work) | 16:58 |
cardoe | is that really horrible to have multiple conductors for redundancy? | 16:58 |
JayF | No, but they each need to run their own dnsmasq, yeah? | 16:58 |
JayF | I don't think upstream a shared dnsmasq across two conductors is supported or makes sense | 16:59 |
JayF | but maybe I'm not visualizing the shape properly | 16:59 |
cardoe | Yep. But can't have unmanaged inspection on. | 16:59 |
cardoe | Cause one dnsmasq sees the box as unmanaged and the other doesn't | 16:59 |
JayF | have you seen the noise about how unmanaged inspection support is going to go away if people don't make loud noises about wanting you? | 17:00 |
JayF | s/you/it | 17:00 |
JayF | you might wanna make loud noises if you use that feature :D | 17:00 |
cardoe | I don't actually in how it works today. | 17:00 |
JayF | okay, cool, yeah, it creates a *lot* of tough problems which is why we don't love it so much | 17:00 |
cardoe | So what I want is out-of-band inspection to be great. | 17:03 |
JayF | I think we're there, or getting there :D | 17:03 |
cardoe | I'm happy to share my thoughts / ideas and folks can tell me if its dumb. | 18:12 |
cardoe | JayF: hundo percent not there. | 18:12 |
JayF | Do we have RFEs for the stuff you want that doesn't exit? | 18:13 |
JayF | **exist | 18:13 |
cardoe | I gotta go look | 18:14 |
cardoe | https://bugs.launchpad.net/ironic/+bugs?field.tag=rfe right? | 18:15 |
JayF | or rfe-approved | 18:16 |
cardoe | So I'll just blab first here. I already blabbed in #-keystone after I got accused of not providing enough context and not reading docs before submitting a patch to the docs. | 18:19 |
cardoe | So some background context, we're using https://github.com/nautobot/nautobot for DCIM / IPAM | 18:21 |
cardoe | We define hardware types like... https://github.com/netbox-community/devicetype-library/blob/master/device-types/Dell/PowerEdge-R7615.yaml | 18:22 |
cardoe | But even more so than that. | 18:23 |
cardoe | So from my personal experience, we've needed to have the same chassis, processor, RAM, PCI devices (including sub-vendor/sub-device). | 18:27 |
cardoe | So that's defined for a flavor which is a child of a device type. | 18:27 |
JayF | I don't understand where that lands in use case terms | 18:28 |
cardoe | So getting there. | 18:28 |
cardoe | So one of my guys (not sure if he's on here) wrote some extra inspection hooks. We compare the data and determine what flavor it is. If it matches. | 18:29 |
cardoe | And set that as the resource class | 18:29 |
cardoe | Which then nova flavors use for scheduling. Which then jamesdenton consumes for hypervisors and whatever else to give clarkb VMs for Zuul. | 18:30 |
cardoe | But to run the agent inspector there's quite a bit of data that needs to be in there for the node. | 18:31 |
cardoe | So rather than having the out-of-band and then the agent inspector the thought is just do it once. | 18:31 |
cardoe | Hence why Scott Tran is submitting patches to sushy. | 18:31 |
cardoe | I'm wanting to enable redfish inspector to use inspection hooks as well to modify the node. | 18:32 |
cardoe | We grab node asset tag / serial, put an SSL certificate on there, grab all the PCI card info, grab all the NIC info, grab LLDP info to populate physical network and local link connection. | 18:33 |
JayF | so redfish inspection interface supporting inspection rules is like, 90% of what you need? | 18:34 |
* JayF does not have the firmest understading of inspection | 18:35 | |
JayF | it's always been similar-but-different from our other flows | 18:35 |
cardoe | So inspection rules vs inspection hooks.... what's the difference? | 18:36 |
dtantsur | cardoe: hooks are written in python and available as local modules; rules are written in a special DSL and can be created via API | 18:39 |
dtantsur | their purpose is pretty similar, just the audiences slightly differ | 18:40 |
cardoe | ah okay cool. | 18:41 |
dtantsur | fwiw unifying in-band and out-of-band inspections was one of my background thoughts when I started the effort to merge inspection in ironic | 18:42 |
cardoe | we've got a DHCP server on the BMC network. When it gives out an IP, we trigger our out-of-band inspection. Our BMCs are normally static IP. But it's also idempotent so we can inspect the same box multiple times so there's cases when they get inspected without DHCPing. | 18:43 |
cardoe | dtantsur: while you're here... I SWEAR I read some proposal from you for having vendor interfaces overriding the generic interface without needing to have a custom interface. I think it was for idrac. | 18:44 |
cardoe | So like we use "redfish" in the config and it transparently knows to use "idrac-redfish" | 18:44 |
samcat116 | Hah, I'm building a very similar workflow with Netbox/Ironic | 18:45 |
cardoe | We thought we'd use the SSoT stuff a bunch so we went Nautobot. | 18:45 |
samcat116 | We were on Netbox before the split, and IMO none of the other stuff on the Nautobot side has tempted me | 18:46 |
samcat116 | For us a lot of our stuff is SNMP based, so we can fully do discovery/onboarding as we have no way to discover the PDU outlet | 18:47 |
samcat116 | But I was gonna work on a service for "Node is tagged in netbox and thats enrolled in ironic" | 18:47 |
cardoe | So our UUIDs are in sync between both systems. | 18:48 |
samcat116 | In a custom field on the NB side I assume? | 18:49 |
cardoe | Nope. We put the device into NB first and use that UUID to create it in Ironic. | 18:49 |
cardoe | Some OpenStack APIs allow you to do that pattern. I usually refer to it as upsert. | 18:49 |
cardoe | Like I argue that the REST style is POST /api/object/ -> make me a brand new one... PUT /api/object/<id> -> create or replace and PATCH /api/object/<id> -> update existing | 18:50 |
cardoe | The PUT being the upsert. | 18:51 |
samcat116 | Here was my dream workflow which I haven't gotten around to fully implementing: someone racks a device and plugs in every cable. Its set to PXE by default so boots to inspector with some custom scripts to grab all the info we'd want. That writes it both to Ironic to enroll the node and to Netbox to track it. | 18:56 |
JayF | Disclaimer: this is not a comment reflecting if Ironic should or should not implement that workflow | 19:00 |
JayF | but ... I don't see the appeal at all | 19:00 |
JayF | I always saw it as a feature-not-a-bug that most onboarding workflows I've worked with started with an assertion about what we ordered/racked/had delivered that we could compare against | 19:00 |
JayF | rather than trusting that the hardware that was racked was the right thing | 19:00 |
cardoe | samcat116: that's what we're working on as well. | 19:02 |
cardoe | JayF: so the resource-class is a custom field in our Nautobot. "unknown" is bad | 19:02 |
JayF | at which point a device, that for unknown reasons has been modified, is already live on your network | 19:03 |
cardoe | So right now we're doing 1 server at a time | 19:03 |
cardoe | Well anytime a DC tech touches a box (programmatically) we re-run inspection. | 19:04 |
JayF | hehe I know from experience that's a /great/ idea | 19:04 |
JayF | might wanna run inspection on the servers on either side of it in the rack too ;) | 19:04 |
cardoe | So current Ironic stuff we don't | 19:06 |
cardoe | But existing OSPC stuff that I own... we at a minimum LLDP the whole rack | 19:06 |
cardoe | So I know you said you don't see what the appeal is... but for me personally I don't care what the hardware is that showed up. | 19:08 |
cardoe | If it's a known flavor to me, great I'll use it and throw it into the available pool. | 19:08 |
samcat116 | I think the difference is that we aren't in a multi-tenant compute-providing environment, we're in a device testing environment for different software versions on different hardware types | 19:09 |
cardoe | I'm multi-tenant. | 19:09 |
JayF | It's interesting, just different ways to view operations | 19:09 |
JayF | I like a bit more top-down control; documentation decides what the environment looks like | 19:09 |
cardoe | So docs still decide. | 19:10 |
JayF | you all are describing something more descriptive; creating documentation for what reality is (and reacting to it) | 19:10 |
JayF | same output just different pathway | 19:10 |
samcat116 | Where was that documentation traditionally held? | 19:11 |
JayF | At almost every company I've worked at, they had a database that was directly interacted with by the purchasing team | 19:12 |
JayF | that database was considered source of truth, and was used to feed into ironic | 19:13 |
samcat116 | Interesting | 19:14 |
JayF | so imagine this flow; which generically describes I think 3 outta 4 ironic clusters I've helped manage: | 19:14 |
JayF | Hardware is ordered, added to internal DB system as ordered | 19:15 |
JayF | arrives, is racked, DC verifies manually against internal system (theoretically lol), moved into a state that says like "ready to go" or whatever | 19:15 |
JayF | automation sees machines ready to go, grabs them, uses the metadata to add into ironic | 19:15 |
JayF | then updates the state in the internal db system to "ironic has this" which would (in some places) lock out updates on the other DB unless the machine was in manageable/maintenance/some other sign | 19:16 |
samcat116 | Yeah for us Netbox is that database and has all that info. Additionally I'd have to do some sort of discovery as I'd never get all the info I'd need from purchasing (what are the MACs of the interfaces and the LLDP neighbor info so I know what switch port its plugged into so NGS can program it on deploy), so I might as well have automation fill it in initially, rather than us fill in the device name and serial while ignoring | 19:18 |
samcat116 | everything else | 19:18 |
JayF | maybe that's some of the difference | 19:19 |
JayF | I'm used to bigger environments that have more layers | 19:19 |
JayF | but the workflow you talk about might be a LOT more ergonomic for teams that are more tightly integrated with the datacenter/hardware processes | 19:20 |
samcat116 | yeah im in more of a lab environment than a production DC | 19:21 |
JayF | thanks for walking through this with me | 19:22 |
JayF | just trying to understand the underlying motivations not just the workflow | 19:22 |
samcat116 | The netbox folks have a public demo instance with some demo data if you want to poke around and see what kind of thigns are tracked and how they're represented: https://demo.netbox.dev/ | 19:23 |
cardoe | So JayF same thing. Hardware is ordered in one system. It's not Nautobot for us. | 19:30 |
cardoe | But we can add it as "Ordered" or not. | 19:30 |
JayF | That seems like a core part of most workflows ;) | 19:31 |
cardoe | So Nautobot/NetBox require devices to have a location. They need to have a Rack and a U location to be created. I think "Ordered" devices can be assigned to a Rack without a U slot. But any other status requires more. | 19:35 |
cardoe | The DC people can come into Nautobot and create a rack of type X. Add the switch and mark it "Ready to Go". | 19:36 |
cardoe | s/the switch/the BMC switch/ | 19:36 |
cardoe | At that point we'll lay down the base switch config | 19:37 |
cardoe | well really add all the switches. | 19:37 |
cardoe | The switch health should go green. | 19:37 |
cardoe | But with the BMC switch they'll start trying to DHCP and so our inspection process starts off and starts adding them into the rack. | 19:38 |
cardoe | We specify that rack type X has BMC switch port 1 = U3 let's stay to start. | 19:39 |
cardoe | So it pre-populates all the boxes in like that. Shows the flavor of machine and the asset tag. | 19:40 |
cardoe | unknown is bad | 19:40 |
cardoe | the rest? well that's on them to validate. The little pull out tab on the chassis shows the asset tag. | 19:41 |
cardoe | If they're suppose to be "cpu2.large" or "gpu2.large", well that's on them to validate with the order. | 19:41 |
cardoe | Sometimes they stop chewing on crayons long enough to validate it. | 19:42 |
JayF | I think part of why I'm so insistent about documentation-first is that I had the experience, a lot, at rackspace of devices being miscabled | 19:42 |
cardoe | But once all the machines are a known flavor they can approve it and off it goes getting added to Ironic and enabled | 19:42 |
JayF | so asserting "it's plugged into these two switches, and these two switchports" helped us avoid pain a lot of time | 19:43 |
cardoe | Yes. That's part of that inspection process to go green. | 19:43 |
JayF | since we'd often end up with maybe two devices miscabled into each others' switches (like two cables from each into the same tor, as opposed to one cable to each tor) | 19:43 |
cardoe | LLDP info is used. | 19:43 |
JayF | yep, same | 19:43 |
JayF | we totally implemented a thing 4 different ways in 4 different places over a few years there it seems lol | 19:44 |
cardoe | I'm sure. Cause it needed to be a basic thing provided across the WHOLE company. | 19:44 |
jamesdenton | JayF miscabled? How dare you. | 19:47 |
jamesdenton | you just didn't know you wanted it that way | 19:47 |
JayF | jamesdenton: nobody screwed up onmetal as bad as cisco screwed up in v1 :-| | 19:47 |
JayF | might have been a bug around them just randomly not enforcing the security features they built just for us | 19:48 |
jamesdenton | nothing custom ever goes awry | 19:48 |
JayF | I was putting the finishing touches on a test to get maintainership in gentoo today | 19:50 |
JayF | and one of the answers was effectively "don't patch in meaningful features, it's dumb" :) | 19:51 |
jamesdenton | haha | 19:51 |
cardoe | jamesdenton: what I said sensible? | 19:53 |
cardoe | Almost like I should take this IRC log and make a doc out of it or a blog post. I'm not as fancy as you and Kevin though. | 19:54 |
JayF | serious suggestion: if you wanna do that, take your messages from the logs, and ask chatgpt to draft it | 19:54 |
jamesdenton | we have a docs site, you should check it out | 19:55 |
JayF | you might get 75% of the way there for free-ish | 19:55 |
cardoe | So question... Scott is adding a sample file for testing.. https://review.opendev.org/c/openstack/sushy/+/932096/2/sushy/tests/unit/json_samples/port.json#56 that's from real hardware. It's spelled "Inteface". So his patch is failing codespell. Should he fix that spelling or ignore it since its real hardware? | 20:24 |
JayF | the codespell target shoudl bee updated to ignore that whole dir | 20:25 |
JayF | json_samples implies we are pulling direct samples | 20:25 |
opendevreview | Scott Tran proposed openstack/sushy master: Add Port interface https://review.opendev.org/c/openstack/sushy/+/932096 | 20:28 |
cardoe | okay I'll add that. He decided to just update it. | 20:31 |
JayF | I suggest explicitly leaving it misspelled | 20:32 |
JayF | if it's misspelled in the real machine | 20:32 |
JayF | with maybe even a note to that effect in the test or the commit | 20:32 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!