Thursday, 2024-10-10

opendevreview	Ghanshyam proposed openstack/ironic master: Run ironic-dbsync in grenade testing from venv https://review.opendev.org/c/openstack/ironic/+/932016	02:55
opendevreview	Takashi Kajinami proposed openstack/metalsmith master: Drop SETUPTOOLS_USE_DISTUTILS=stdlib https://review.opendev.org/c/openstack/metalsmith/+/932020	03:53
rpittau	good morning ironic! o/	06:06
TheJulia	good morning!	07:09
rpittau	hey TheJulia welcome to the Old World :)	07:18
TheJulia	heh	07:20
TheJulia	Thanks... I think ;)	07:20
rpittau	:D	08:58
iurygregory	good morning ironic	09:37
opendevreview	cid proposed openstack/ironic-specs master: Add a Kea DHCP backend https://review.opendev.org/c/openstack/ironic-specs/+/931025	13:14
opendevreview	cid proposed openstack/ironic-specs master: Add a Kea DHCP backend https://review.opendev.org/c/openstack/ironic-specs/+/931025	13:54
opendevreview	cid proposed openstack/ironic-specs master: Add a Kea DHCP backend https://review.opendev.org/c/openstack/ironic-specs/+/931025	13:58
TheJulia	Hey guys, question for metal3 folks, did rebuild support ever get wired up?	14:27
TheJulia	like, so we can trigger the rebuild verb for a node	14:27
TheJulia	For example, I need to change the disk image	14:27
dtantsur	TheJulia: no rebuild as you know it. Either use a new image URL or remove and re-add it.	14:34
dtantsur	ref https://book.metal3.io/bmo/provisioning#reprovisioning	14:34
opendevreview	Merged openstack/ironic stable/2023.1: Revert "ci: stable-only: explicitly pin centos build" https://review.opendev.org/c/openstack/ironic/+/921884	14:55
JayF	TheJulia: dtantsur: other cores who may be around: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/thread/Z52INRWOC5GT3NFPG5427FM4PSXJFRUG/ is the public proposal for the stuff cores have been discussing	15:06
JayF	cardoe: cid ^ you will find that email interesting	15:06
cardoe	I'm honored for the nomination.	15:08
cardoe	cid: wrt to the kea DHCP backend, I was wondering if a side outcome could be capturing better what things we need to pass for DHCP in generic terms.	15:14
JayF	What do you mean by that?	15:15
cardoe	The reason I ask is that I'm running my Ironic inside of k8s using OpenStack Helm right now and the DHCP backends make assumptions around being on the same physical filesystem as the DHCP server.	15:15
JayF	DHCP backends in Ironic or Neutron?	15:16
cardoe	Ironic	15:16
cardoe	Obviously the neutron one is sending neutron commands.	15:16
JayF	I'd suggest that might be a separate spec	15:16
JayF	well, I mean, we're going to do a neutron-dhcp-agent for kea as well	15:16
cardoe	Just thinking around some RPC mechanism.	15:18
cardoe	So I can run DHCP not in a 1:1 fashion with conductors	15:18
JayF	So lets develop some kind of like, network configuration API, that when we tell it to, it configures a DHCP server via an agent. That sounds space-riffic so I think we can call it "Neutron"!	15:19
JayF	/s obviously but this is something I'm thinking about	15:19
JayF	"conductors are in the same subnet as the nodes they provision" is a reasonable assumption that many Ironic pieces make	15:20
JayF	(for example: Grub won't cross a network boundary to network boot AIUI)	15:21
cardoe	So you're not wrong about the fact that it's neutron. I agree.	15:24
cardoe	There's also the Mercury proposal from Julia to have a different network backend instead of neutron.	15:24
cardoe	So my RPC though was tied to Mercury.	15:25
cardoe	I don't have specifics until Mercury moves forward more. It was more the current code is a bit racy if you NFS mount the path.	15:30
opendevreview	Verification of a change to openstack/ironic-python-agent stable/2023.1 failed: Call evaluate_hardware_support exactly once per hwm https://review.opendev.org/c/openstack/ironic-python-agent/+/920216	15:40
opendevreview	Verification of a change to openstack/ironic-python-agent stable/2023.1 failed: Call evaluate_hardware_support exactly once per hwm https://review.opendev.org/c/openstack/ironic-python-agent/+/920216	16:06
opendevreview	Scott Tran proposed openstack/sushy master: Add Port interface https://review.opendev.org/c/openstack/sushy/+/932096	16:20
cid	cardoe: My brain interpreted your question as, "finding out what things we need to pass for DHCP" no matter the backend while at the Kea change (?).	16:25
rpittau	good night! o/	16:26
cid	\o	16:28
opendevreview	Scott Tran proposed openstack/sushy master: Add Port interface https://review.opendev.org/c/openstack/sushy/+/932096	16:42
JayF	cardoe: so mercury is about hooking ironic up to NGS, at least in practice today, and NGS is a remote agent	16:56
JayF	cardoe: you'd still need the conductor in the same subnet for most cases	16:56
JayF	cardoe: I'm not saying "we can't/shouldn't do this" so much as "this is a completely different case than kea support"	16:57
JayF	at least imho	16:57
cardoe	I'm not disagreeing with that.	16:57
JayF	I do wonder if kea, in practice, can help solve it	16:57
JayF	it does have clustering/replication features	16:57
cardoe	I've got 2 conductors but 1 dnsmasq	16:57
JayF	yeah that's honestly just wrong	16:57
JayF	what kinda horrible person would've left an ironic install with a bunch of static dhcp stuff /me hides	16:58
cardoe	I'm just using that as an example.	16:58
JayF	(we did horrible dhcp things for onmetal; mainly didn't let ironic do it at all; used static dhcp configs and network switching did the work)	16:58
cardoe	is that really horrible to have multiple conductors for redundancy?	16:58
JayF	No, but they each need to run their own dnsmasq, yeah?	16:58
JayF	I don't think upstream a shared dnsmasq across two conductors is supported or makes sense	16:59
JayF	but maybe I'm not visualizing the shape properly	16:59
cardoe	Yep. But can't have unmanaged inspection on.	16:59
cardoe	Cause one dnsmasq sees the box as unmanaged and the other doesn't	16:59
JayF	have you seen the noise about how unmanaged inspection support is going to go away if people don't make loud noises about wanting you?	17:00
JayF	s/you/it	17:00
JayF	you might wanna make loud noises if you use that feature :D	17:00
cardoe	I don't actually in how it works today.	17:00
JayF	okay, cool, yeah, it creates a lot of tough problems which is why we don't love it so much	17:00
cardoe	So what I want is out-of-band inspection to be great.	17:03
JayF	I think we're there, or getting there :D	17:03
cardoe	I'm happy to share my thoughts / ideas and folks can tell me if its dumb.	18:12
cardoe	JayF: hundo percent not there.	18:12
JayF	Do we have RFEs for the stuff you want that doesn't exit?	18:13
JayF	**exist	18:13
cardoe	I gotta go look	18:14
cardoe	https://bugs.launchpad.net/ironic/+bugs?field.tag=rfe right?	18:15
JayF	or rfe-approved	18:16
cardoe	So I'll just blab first here. I already blabbed in #-keystone after I got accused of not providing enough context and not reading docs before submitting a patch to the docs.	18:19
cardoe	So some background context, we're using https://github.com/nautobot/nautobot for DCIM / IPAM	18:21
cardoe	We define hardware types like... https://github.com/netbox-community/devicetype-library/blob/master/device-types/Dell/PowerEdge-R7615.yaml	18:22
cardoe	But even more so than that.	18:23
cardoe	So from my personal experience, we've needed to have the same chassis, processor, RAM, PCI devices (including sub-vendor/sub-device).	18:27
cardoe	So that's defined for a flavor which is a child of a device type.	18:27
JayF	I don't understand where that lands in use case terms	18:28
cardoe	So getting there.	18:28
cardoe	So one of my guys (not sure if he's on here) wrote some extra inspection hooks. We compare the data and determine what flavor it is. If it matches.	18:29
cardoe	And set that as the resource class	18:29
cardoe	Which then nova flavors use for scheduling. Which then jamesdenton consumes for hypervisors and whatever else to give clarkb VMs for Zuul.	18:30
cardoe	But to run the agent inspector there's quite a bit of data that needs to be in there for the node.	18:31
cardoe	So rather than having the out-of-band and then the agent inspector the thought is just do it once.	18:31
cardoe	Hence why Scott Tran is submitting patches to sushy.	18:31
cardoe	I'm wanting to enable redfish inspector to use inspection hooks as well to modify the node.	18:32
cardoe	We grab node asset tag / serial, put an SSL certificate on there, grab all the PCI card info, grab all the NIC info, grab LLDP info to populate physical network and local link connection.	18:33
JayF	so redfish inspection interface supporting inspection rules is like, 90% of what you need?	18:34
* JayF does not have the firmest understading of inspection		18:35
JayF	it's always been similar-but-different from our other flows	18:35
cardoe	So inspection rules vs inspection hooks.... what's the difference?	18:36
dtantsur	cardoe: hooks are written in python and available as local modules; rules are written in a special DSL and can be created via API	18:39
dtantsur	their purpose is pretty similar, just the audiences slightly differ	18:40
cardoe	ah okay cool.	18:41
dtantsur	fwiw unifying in-band and out-of-band inspections was one of my background thoughts when I started the effort to merge inspection in ironic	18:42
cardoe	we've got a DHCP server on the BMC network. When it gives out an IP, we trigger our out-of-band inspection. Our BMCs are normally static IP. But it's also idempotent so we can inspect the same box multiple times so there's cases when they get inspected without DHCPing.	18:43
cardoe	dtantsur: while you're here... I SWEAR I read some proposal from you for having vendor interfaces overriding the generic interface without needing to have a custom interface. I think it was for idrac.	18:44
cardoe	So like we use "redfish" in the config and it transparently knows to use "idrac-redfish"	18:44
samcat116	Hah, I'm building a very similar workflow with Netbox/Ironic	18:45
cardoe	We thought we'd use the SSoT stuff a bunch so we went Nautobot.	18:45
samcat116	We were on Netbox before the split, and IMO none of the other stuff on the Nautobot side has tempted me	18:46
samcat116	For us a lot of our stuff is SNMP based, so we can fully do discovery/onboarding as we have no way to discover the PDU outlet	18:47
samcat116	But I was gonna work on a service for "Node is tagged in netbox and thats enrolled in ironic"	18:47
cardoe	So our UUIDs are in sync between both systems.	18:48
samcat116	In a custom field on the NB side I assume?	18:49
cardoe	Nope. We put the device into NB first and use that UUID to create it in Ironic.	18:49
cardoe	Some OpenStack APIs allow you to do that pattern. I usually refer to it as upsert.	18:49
cardoe	Like I argue that the REST style is POST /api/object/ -> make me a brand new one... PUT /api/object/<id> -> create or replace and PATCH /api/object/<id> -> update existing	18:50
cardoe	The PUT being the upsert.	18:51
samcat116	Here was my dream workflow which I haven't gotten around to fully implementing: someone racks a device and plugs in every cable. Its set to PXE by default so boots to inspector with some custom scripts to grab all the info we'd want. That writes it both to Ironic to enroll the node and to Netbox to track it.	18:56
JayF	Disclaimer: this is not a comment reflecting if Ironic should or should not implement that workflow	19:00
JayF	but ... I don't see the appeal at all	19:00
JayF	I always saw it as a feature-not-a-bug that most onboarding workflows I've worked with started with an assertion about what we ordered/racked/had delivered that we could compare against	19:00
JayF	rather than trusting that the hardware that was racked was the right thing	19:00
cardoe	samcat116: that's what we're working on as well.	19:02
cardoe	JayF: so the resource-class is a custom field in our Nautobot. "unknown" is bad	19:02
JayF	at which point a device, that for unknown reasons has been modified, is already live on your network	19:03
cardoe	So right now we're doing 1 server at a time	19:03
cardoe	Well anytime a DC tech touches a box (programmatically) we re-run inspection.	19:04
JayF	hehe I know from experience that's a /great/ idea	19:04
JayF	might wanna run inspection on the servers on either side of it in the rack too ;)	19:04
cardoe	So current Ironic stuff we don't	19:06
cardoe	But existing OSPC stuff that I own... we at a minimum LLDP the whole rack	19:06
cardoe	So I know you said you don't see what the appeal is... but for me personally I don't care what the hardware is that showed up.	19:08
cardoe	If it's a known flavor to me, great I'll use it and throw it into the available pool.	19:08
samcat116	I think the difference is that we aren't in a multi-tenant compute-providing environment, we're in a device testing environment for different software versions on different hardware types	19:09
cardoe	I'm multi-tenant.	19:09
JayF	It's interesting, just different ways to view operations	19:09
JayF	I like a bit more top-down control; documentation decides what the environment looks like	19:09
cardoe	So docs still decide.	19:10
JayF	you all are describing something more descriptive; creating documentation for what reality is (and reacting to it)	19:10
JayF	same output just different pathway	19:10
samcat116	Where was that documentation traditionally held?	19:11
JayF	At almost every company I've worked at, they had a database that was directly interacted with by the purchasing team	19:12
JayF	that database was considered source of truth, and was used to feed into ironic	19:13
samcat116	Interesting	19:14
JayF	so imagine this flow; which generically describes I think 3 outta 4 ironic clusters I've helped manage:	19:14
JayF	Hardware is ordered, added to internal DB system as ordered	19:15
JayF	arrives, is racked, DC verifies manually against internal system (theoretically lol), moved into a state that says like "ready to go" or whatever	19:15
JayF	automation sees machines ready to go, grabs them, uses the metadata to add into ironic	19:15
JayF	then updates the state in the internal db system to "ironic has this" which would (in some places) lock out updates on the other DB unless the machine was in manageable/maintenance/some other sign	19:16
samcat116	Yeah for us Netbox is that database and has all that info. Additionally I'd have to do some sort of discovery as I'd never get all the info I'd need from purchasing (what are the MACs of the interfaces and the LLDP neighbor info so I know what switch port its plugged into so NGS can program it on deploy), so I might as well have automation fill it in initially, rather than us fill in the device name and serial while ignoring	19:18
samcat116	everything else	19:18
JayF	maybe that's some of the difference	19:19
JayF	I'm used to bigger environments that have more layers	19:19
JayF	but the workflow you talk about might be a LOT more ergonomic for teams that are more tightly integrated with the datacenter/hardware processes	19:20
samcat116	yeah im in more of a lab environment than a production DC	19:21
JayF	thanks for walking through this with me	19:22
JayF	just trying to understand the underlying motivations not just the workflow	19:22
samcat116	The netbox folks have a public demo instance with some demo data if you want to poke around and see what kind of thigns are tracked and how they're represented: https://demo.netbox.dev/	19:23
cardoe	So JayF same thing. Hardware is ordered in one system. It's not Nautobot for us.	19:30
cardoe	But we can add it as "Ordered" or not.	19:30
JayF	That seems like a core part of most workflows ;)	19:31
cardoe	So Nautobot/NetBox require devices to have a location. They need to have a Rack and a U location to be created. I think "Ordered" devices can be assigned to a Rack without a U slot. But any other status requires more.	19:35
cardoe	The DC people can come into Nautobot and create a rack of type X. Add the switch and mark it "Ready to Go".	19:36
cardoe	s/the switch/the BMC switch/	19:36
cardoe	At that point we'll lay down the base switch config	19:37
cardoe	well really add all the switches.	19:37
cardoe	The switch health should go green.	19:37
cardoe	But with the BMC switch they'll start trying to DHCP and so our inspection process starts off and starts adding them into the rack.	19:38
cardoe	We specify that rack type X has BMC switch port 1 = U3 let's stay to start.	19:39
cardoe	So it pre-populates all the boxes in like that. Shows the flavor of machine and the asset tag.	19:40
cardoe	unknown is bad	19:40
cardoe	the rest? well that's on them to validate. The little pull out tab on the chassis shows the asset tag.	19:41
cardoe	If they're suppose to be "cpu2.large" or "gpu2.large", well that's on them to validate with the order.	19:41
cardoe	Sometimes they stop chewing on crayons long enough to validate it.	19:42
JayF	I think part of why I'm so insistent about documentation-first is that I had the experience, a lot, at rackspace of devices being miscabled	19:42
cardoe	But once all the machines are a known flavor they can approve it and off it goes getting added to Ironic and enabled	19:42
JayF	so asserting "it's plugged into these two switches, and these two switchports" helped us avoid pain a lot of time	19:43
cardoe	Yes. That's part of that inspection process to go green.	19:43
JayF	since we'd often end up with maybe two devices miscabled into each others' switches (like two cables from each into the same tor, as opposed to one cable to each tor)	19:43
cardoe	LLDP info is used.	19:43
JayF	yep, same	19:43
JayF	we totally implemented a thing 4 different ways in 4 different places over a few years there it seems lol	19:44
cardoe	I'm sure. Cause it needed to be a basic thing provided across the WHOLE company.	19:44
jamesdenton	JayF miscabled? How dare you.	19:47
jamesdenton	you just didn't know you wanted it that way	19:47
JayF	jamesdenton: nobody screwed up onmetal as bad as cisco screwed up in v1 :-\|	19:47
JayF	might have been a bug around them just randomly not enforcing the security features they built just for us	19:48
jamesdenton	nothing custom ever goes awry	19:48
JayF	I was putting the finishing touches on a test to get maintainership in gentoo today	19:50
JayF	and one of the answers was effectively "don't patch in meaningful features, it's dumb" :)	19:51
jamesdenton	haha	19:51
cardoe	jamesdenton: what I said sensible?	19:53
cardoe	Almost like I should take this IRC log and make a doc out of it or a blog post. I'm not as fancy as you and Kevin though.	19:54
JayF	serious suggestion: if you wanna do that, take your messages from the logs, and ask chatgpt to draft it	19:54
jamesdenton	we have a docs site, you should check it out	19:55
JayF	you might get 75% of the way there for free-ish	19:55
cardoe	So question... Scott is adding a sample file for testing.. https://review.opendev.org/c/openstack/sushy/+/932096/2/sushy/tests/unit/json_samples/port.json#56 that's from real hardware. It's spelled "Inteface". So his patch is failing codespell. Should he fix that spelling or ignore it since its real hardware?	20:24
JayF	the codespell target shoudl bee updated to ignore that whole dir	20:25
JayF	json_samples implies we are pulling direct samples	20:25
opendevreview	Scott Tran proposed openstack/sushy master: Add Port interface https://review.opendev.org/c/openstack/sushy/+/932096	20:28
cardoe	okay I'll add that. He decided to just update it.	20:31
JayF	I suggest explicitly leaving it misspelled	20:32
JayF	if it's misspelled in the real machine	20:32
JayF	with maybe even a note to that effect in the test or the commit	20:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!