Thursday, 2024-10-10

opendevreviewGhanshyam proposed openstack/ironic master: Run ironic-dbsync in grenade testing from venv  https://review.opendev.org/c/openstack/ironic/+/93201602:55
opendevreviewTakashi Kajinami proposed openstack/metalsmith master: Drop SETUPTOOLS_USE_DISTUTILS=stdlib  https://review.opendev.org/c/openstack/metalsmith/+/93202003:53
rpittaugood morning ironic! o/06:06
TheJuliagood morning!07:09
rpittauhey TheJulia welcome to the Old World :)07:18
TheJuliaheh07:20
TheJuliaThanks... I think ;)07:20
rpittau:D08:58
iurygregorygood morning ironic09:37
opendevreviewcid proposed openstack/ironic-specs master: Add a Kea DHCP backend  https://review.opendev.org/c/openstack/ironic-specs/+/93102513:14
opendevreviewcid proposed openstack/ironic-specs master: Add a Kea DHCP backend  https://review.opendev.org/c/openstack/ironic-specs/+/93102513:54
opendevreviewcid proposed openstack/ironic-specs master: Add a Kea DHCP backend  https://review.opendev.org/c/openstack/ironic-specs/+/93102513:58
TheJuliaHey guys, question for metal3 folks, did rebuild support ever get wired up?14:27
TheJulialike, so we can trigger the rebuild verb for a node14:27
TheJuliaFor example, I need to change the disk image14:27
dtantsurTheJulia: no rebuild as you know it. Either use a new image URL or remove and re-add it.14:34
dtantsurref https://book.metal3.io/bmo/provisioning#reprovisioning14:34
opendevreviewMerged openstack/ironic stable/2023.1: Revert "ci: stable-only: explicitly pin centos build"  https://review.opendev.org/c/openstack/ironic/+/92188414:55
JayFTheJulia: dtantsur: other cores who may be around: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/thread/Z52INRWOC5GT3NFPG5427FM4PSXJFRUG/ is the public proposal for the stuff cores have been discussing15:06
JayFcardoe: cid ^ you will find that email interesting15:06
cardoeI'm honored for the nomination.15:08
cardoecid: wrt to the kea DHCP backend, I was wondering if a side outcome could be capturing better what things we need to pass for DHCP in generic terms.15:14
JayFWhat do you mean by that?15:15
cardoeThe reason I ask is that I'm running my Ironic inside of k8s using OpenStack Helm right now and the DHCP backends make assumptions around being on the same physical filesystem as the DHCP server.15:15
JayFDHCP backends in Ironic or Neutron?15:16
cardoeIronic15:16
cardoeObviously the neutron one is sending neutron commands.15:16
JayFI'd suggest that might be a separate spec15:16
JayFwell, I mean, we're going to do a neutron-dhcp-agent for kea as well15:16
cardoeJust thinking around some RPC mechanism.15:18
cardoeSo I can run DHCP not in a 1:1 fashion with conductors15:18
JayFSo lets develop some kind of like, network configuration API, that when we tell it to, it configures a DHCP server via an agent. That sounds space-riffic so I think we can call it "Neutron"!15:19
JayF /s obviously but this is something I'm thinking about15:19
JayF"conductors are in the same subnet as the nodes they provision" is a reasonable assumption that many Ironic pieces make15:20
JayF(for example: Grub won't cross a network boundary to network boot AIUI)15:21
cardoeSo you're not wrong about the fact that it's neutron. I agree.15:24
cardoeThere's also the Mercury proposal from Julia to have a different network backend instead of neutron.15:24
cardoeSo my RPC though was tied to Mercury.15:25
cardoeI don't have specifics until Mercury moves forward more. It was more the current code is a bit racy if you NFS mount the path.15:30
opendevreviewVerification of a change to openstack/ironic-python-agent stable/2023.1 failed: Call evaluate_hardware_support exactly once per hwm  https://review.opendev.org/c/openstack/ironic-python-agent/+/92021615:40
opendevreviewVerification of a change to openstack/ironic-python-agent stable/2023.1 failed: Call evaluate_hardware_support exactly once per hwm  https://review.opendev.org/c/openstack/ironic-python-agent/+/92021616:06
opendevreviewScott Tran proposed openstack/sushy master: Add Port interface  https://review.opendev.org/c/openstack/sushy/+/93209616:20
cidcardoe: My brain interpreted your question as, "finding out what things we need to pass for DHCP" no matter the backend while at the Kea change (?).16:25
rpittaugood night! o/16:26
cid\o16:28
opendevreviewScott Tran proposed openstack/sushy master: Add Port interface  https://review.opendev.org/c/openstack/sushy/+/93209616:42
JayFcardoe: so mercury is about hooking ironic up to NGS, at least in practice today, and NGS is a remote agent16:56
JayFcardoe: you'd still need the conductor in the same subnet for most cases16:56
JayFcardoe: I'm not saying "we can't/shouldn't do this" so much as "this is a completely different case than kea support"16:57
JayFat least imho16:57
cardoeI'm not disagreeing with that.16:57
JayFI do wonder if kea, in practice, can help solve it16:57
JayFit does have clustering/replication features16:57
cardoeI've got 2 conductors but 1 dnsmasq16:57
JayFyeah that's honestly just wrong16:57
JayFwhat kinda horrible person would've left an ironic install with a bunch of static dhcp stuff /me hides16:58
cardoeI'm just using that as an example.16:58
JayF(we did horrible dhcp things for onmetal; mainly didn't let ironic do it at all; used static dhcp configs and network switching did the work)16:58
cardoeis that really horrible to have multiple conductors for redundancy?16:58
JayFNo, but they each need to run their own dnsmasq, yeah?16:58
JayFI don't think upstream a shared dnsmasq across two conductors is supported or makes sense16:59
JayFbut maybe I'm not visualizing the shape properly16:59
cardoeYep. But can't have unmanaged inspection on.16:59
cardoeCause one dnsmasq sees the box as unmanaged and the other doesn't16:59
JayFhave you seen the noise about how unmanaged inspection support is going to go away if people don't make loud noises about wanting you?17:00
JayFs/you/it17:00
JayFyou might wanna make loud noises if you use that feature :D 17:00
cardoeI don't actually in how it works today.17:00
JayFokay, cool, yeah, it creates a *lot* of tough problems which is why we don't love it so  much 17:00
cardoeSo what I want is out-of-band inspection to be great.17:03
JayFI think we're there, or getting there :D 17:03
cardoeI'm happy to share my thoughts / ideas and folks can tell me if its dumb.18:12
cardoeJayF: hundo percent not there.18:12
JayFDo we have RFEs for the stuff you want that doesn't exit?18:13
JayF**exist18:13
cardoeI gotta go look18:14
cardoehttps://bugs.launchpad.net/ironic/+bugs?field.tag=rfe right?18:15
JayFor rfe-approved18:16
cardoeSo I'll just blab first here. I already blabbed in #-keystone after I got accused of not providing enough context and not reading docs before submitting a patch to the docs.18:19
cardoeSo some background context, we're using https://github.com/nautobot/nautobot for DCIM / IPAM18:21
cardoeWe define hardware types like... https://github.com/netbox-community/devicetype-library/blob/master/device-types/Dell/PowerEdge-R7615.yaml18:22
cardoeBut even more so than that.18:23
cardoeSo from my personal experience, we've needed to have the same chassis, processor, RAM, PCI devices (including sub-vendor/sub-device).18:27
cardoeSo that's defined for a flavor which is a child of a device type.18:27
JayFI don't understand where that lands in use case terms18:28
cardoeSo getting there.18:28
cardoeSo one of my guys (not sure if he's on here) wrote some extra inspection hooks. We compare the data and determine what flavor it is. If it matches.18:29
cardoeAnd set that as the resource class18:29
cardoeWhich then nova flavors use for scheduling. Which then jamesdenton consumes for hypervisors and whatever else to give clarkb VMs for Zuul.18:30
cardoeBut to run the agent inspector there's quite a bit of data that needs to be in there for the node.18:31
cardoeSo rather than having the out-of-band and then the agent inspector the thought is just do it once.18:31
cardoeHence why Scott Tran is submitting patches to sushy.18:31
cardoeI'm wanting to enable redfish inspector to use inspection hooks as well to modify the node.18:32
cardoeWe grab node asset tag / serial, put an SSL certificate on there, grab all the PCI card info, grab all the NIC info, grab LLDP info to populate physical network and local link connection.18:33
JayFso redfish inspection interface supporting inspection rules is like, 90% of what you need?18:34
* JayF does not have the firmest understading of inspection18:35
JayFit's always been similar-but-different from our other flows18:35
cardoeSo inspection rules vs inspection hooks.... what's the difference?18:36
dtantsurcardoe: hooks are written in python and available as local modules; rules are written in a special DSL and can be created via API18:39
dtantsurtheir purpose is pretty similar, just the audiences slightly differ18:40
cardoeah okay cool.18:41
dtantsurfwiw unifying in-band and out-of-band inspections was one of my background thoughts when I started the effort to merge inspection in ironic18:42
cardoewe've got a DHCP server on the BMC network. When it gives out an IP, we trigger our out-of-band inspection. Our BMCs are normally static IP. But it's also idempotent so we can inspect the same box multiple times so there's cases when they get inspected without DHCPing.18:43
cardoedtantsur: while you're here... I SWEAR I read some proposal from you for having vendor interfaces overriding the generic interface without needing to have a custom interface. I think it was for idrac.18:44
cardoeSo like we use "redfish" in the config and it transparently knows to use "idrac-redfish"18:44
samcat116Hah, I'm building a very similar workflow with Netbox/Ironic18:45
cardoeWe thought we'd use the SSoT stuff a bunch so we went Nautobot.18:45
samcat116We were on Netbox before the split, and IMO none of the other stuff on the Nautobot side has tempted me18:46
samcat116For us a lot of our stuff is SNMP based, so we can fully do discovery/onboarding as we have no way to discover the PDU outlet18:47
samcat116But I was gonna work on a service for "Node is tagged in netbox and thats enrolled in ironic"18:47
cardoeSo our UUIDs are in sync between both systems.18:48
samcat116In a custom field on the NB side I assume?18:49
cardoeNope. We put the device into NB first and use that UUID to create it in Ironic.18:49
cardoeSome OpenStack APIs allow you to do that pattern. I usually refer to it as upsert.18:49
cardoeLike I argue that the REST style is POST /api/object/ -> make me a brand new one... PUT /api/object/<id> -> create or replace and PATCH /api/object/<id> -> update existing18:50
cardoeThe PUT being the upsert.18:51
samcat116Here was my dream workflow which I haven't gotten around to fully implementing: someone racks a device and plugs in every cable. Its set to PXE by default so boots to inspector with some custom scripts to grab all the info we'd want. That writes it both to Ironic to enroll the node and to Netbox to track it. 18:56
JayFDisclaimer: this is not a comment reflecting if Ironic should or should not implement that workflow19:00
JayFbut ... I don't see the appeal at all19:00
JayFI always saw it as a feature-not-a-bug that most onboarding workflows I've worked with started with an assertion about what we ordered/racked/had delivered that we could compare against19:00
JayFrather than trusting that the hardware that was racked was the right thing19:00
cardoesamcat116: that's what we're working on as well.19:02
cardoeJayF: so the resource-class is a custom field in our Nautobot. "unknown" is bad19:02
JayFat which point a device, that for unknown reasons has been modified, is already live on your network19:03
cardoeSo right now we're doing 1 server at a time19:03
cardoeWell anytime a DC tech touches a box (programmatically) we re-run inspection.19:04
JayFhehe I know from experience that's a /great/ idea19:04
JayFmight wanna run inspection on the servers on either side of it in the rack too ;)19:04
cardoeSo current Ironic stuff we don't19:06
cardoeBut existing OSPC stuff that I own... we at a minimum LLDP the whole rack19:06
cardoeSo I know you said you don't see what the appeal is... but for me personally I don't care what the hardware is that showed up.19:08
cardoeIf it's a known flavor to me, great I'll use it and throw it into the available pool.19:08
samcat116I think the difference is that we aren't in a multi-tenant compute-providing environment, we're in a device testing environment for different software versions on different hardware types19:09
cardoeI'm multi-tenant.19:09
JayFIt's interesting, just different ways to view operations19:09
JayFI like a bit more top-down control; documentation decides what the environment looks like19:09
cardoeSo docs still decide.19:10
JayFyou all are describing something more descriptive; creating documentation for what reality is (and reacting to it)19:10
JayFsame output just different pathway19:10
samcat116Where was that documentation traditionally held?19:11
JayFAt almost every company I've worked at, they had a database that was directly interacted with by the purchasing team19:12
JayFthat database was considered source of truth, and was used to feed into ironic19:13
samcat116Interesting19:14
JayFso imagine this flow; which generically describes I think 3 outta 4 ironic clusters I've helped manage:19:14
JayFHardware is ordered, added to internal DB system as ordered19:15
JayFarrives, is racked, DC verifies manually against internal system (theoretically lol), moved into a state that says like "ready to go" or whatever19:15
JayFautomation sees machines ready to go, grabs them, uses the metadata to add into ironic 19:15
JayFthen updates the state in the internal db system to "ironic has this" which would (in some places) lock out updates on the other DB unless the machine was in manageable/maintenance/some other sign19:16
samcat116Yeah for us Netbox is that database and has all that info. Additionally I'd have to do some sort of discovery as I'd never get all the info I'd need from purchasing (what are the MACs of the interfaces and the LLDP neighbor info so I know what switch port its plugged into so NGS can program it on deploy), so I might as well have automation fill it in initially, rather than us fill in the device name and serial while ignoring19:18
samcat116everything else19:18
JayFmaybe that's some of the difference19:19
JayFI'm used to bigger environments that have more layers19:19
JayFbut the workflow you talk about might be a LOT more ergonomic for teams that are more tightly integrated with the datacenter/hardware processes19:20
samcat116yeah im in more of a lab environment than a production DC19:21
JayFthanks for walking through this with me19:22
JayFjust trying to understand the underlying motivations not just the workflow19:22
samcat116The netbox folks have a public demo instance with some demo data if you want to poke around and see what kind of thigns are tracked and how they're represented: https://demo.netbox.dev/19:23
cardoeSo JayF same thing. Hardware is ordered in one system. It's not Nautobot for us.19:30
cardoeBut we can add it as "Ordered" or not.19:30
JayFThat seems like a core part of most workflows ;) 19:31
cardoeSo Nautobot/NetBox require devices to have a location. They need to have a Rack and a U location to be created. I think "Ordered" devices can be assigned to a Rack without a U slot. But any other status requires more.19:35
cardoeThe DC people can come into Nautobot and create a rack of type X. Add the switch and mark it "Ready to Go".19:36
cardoes/the switch/the BMC switch/19:36
cardoeAt that point we'll lay down the base switch config19:37
cardoewell really add all the switches.19:37
cardoeThe switch health should go green.19:37
cardoeBut with the BMC switch they'll start trying to DHCP and so our inspection process starts off and starts adding them into the rack.19:38
cardoeWe specify that rack type X has BMC switch port 1 = U3 let's stay to start.19:39
cardoeSo it pre-populates all the boxes in like that. Shows the flavor of machine and the asset tag.19:40
cardoeunknown is bad19:40
cardoethe rest? well that's on them to validate. The little pull out tab on the chassis shows the asset tag.19:41
cardoeIf they're suppose to be "cpu2.large" or "gpu2.large", well that's on them to validate with the order.19:41
cardoeSometimes they stop chewing on crayons long enough to validate it.19:42
JayFI think part of why I'm so insistent about documentation-first is that I had the experience, a lot, at rackspace of devices being miscabled19:42
cardoeBut once all the machines are a known flavor they can approve it and off it goes getting added to Ironic and enabled19:42
JayFso asserting "it's plugged into these two switches, and these two switchports" helped us avoid pain a lot of time19:43
cardoeYes. That's part of that inspection process to go green.19:43
JayFsince we'd often end up with maybe two devices miscabled into each others' switches (like two cables from each into the same tor, as opposed to one cable to each tor)19:43
cardoeLLDP info is used.19:43
JayFyep, same19:43
JayFwe totally implemented a thing 4 different ways in 4 different places over a few years there it seems lol19:44
cardoeI'm sure. Cause it needed to be a basic thing provided across the WHOLE company.19:44
jamesdentonJayF miscabled? How dare you.19:47
jamesdentonyou just didn't know you wanted it that way19:47
JayFjamesdenton: nobody screwed up onmetal as bad as cisco screwed up in v1 :-| 19:47
JayFmight have been a bug around them just randomly not enforcing the security features they built just for us19:48
jamesdentonnothing custom ever goes awry19:48
JayFI was putting the finishing touches on a test to get maintainership in gentoo today19:50
JayFand one of the answers was effectively "don't patch in meaningful features, it's dumb" :) 19:51
jamesdentonhaha19:51
cardoejamesdenton: what I said sensible?19:53
cardoeAlmost like I should take this IRC log and make a doc out of it or a blog post. I'm not as fancy as you and Kevin though.19:54
JayFserious suggestion: if you wanna do that, take your messages from the logs, and ask chatgpt to draft it19:54
jamesdentonwe have a docs site, you should check it out19:55
JayFyou might get 75% of the way there for free-ish19:55
cardoeSo question... Scott is adding a sample file for testing.. https://review.opendev.org/c/openstack/sushy/+/932096/2/sushy/tests/unit/json_samples/port.json#56 that's from real hardware. It's spelled "Inteface". So his patch is failing codespell. Should he fix that spelling or ignore it since its real hardware?20:24
JayFthe codespell target shoudl bee updated to ignore that whole dir20:25
JayFjson_samples implies we are pulling direct samples20:25
opendevreviewScott Tran proposed openstack/sushy master: Add Port interface  https://review.opendev.org/c/openstack/sushy/+/93209620:28
cardoeokay I'll add that. He decided to just update it.20:31
JayFI suggest explicitly leaving it misspelled20:32
JayFif it's misspelled in the real machine20:32
JayFwith maybe even a note to that effect in the test or the commit20:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!