Wednesday, 2022-12-14

opendevreviewMerged openstack/ironic-inspector master: [grenade] Explicitly enable Neutron ML2/OVS services in the CI job  https://review.opendev.org/c/openstack/ironic-inspector/+/86757300:04
vanougood morning ironic02:27
vanouTheJulia JayF: Thanks for review on https://review.opendev.org/c/openstack/ironic/+/865075 I have one concern on try/fail model. If I follow this approach, driver will first try IPMI then, if IPMI fails, try Redfish. However IPMI code in Ironic resends IPMI command till timeout reaches (default 60seconds). So this try/fail model can introduce 60s delay on every IPMI attempt.06:00
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for ports node_uuid  https://review.opendev.org/c/openstack/ironic/+/86293308:15
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for port groups node_uuid  https://review.opendev.org/c/openstack/ironic/+/86478108:15
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for node chassis_uuid  https://review.opendev.org/c/openstack/ironic/+/86480208:15
opendevreviewHarald Jensås proposed openstack/ironic master: Use association_proxy for node allocation_uuid  https://review.opendev.org/c/openstack/ironic/+/86598908:15
codecap_HI Guys,08:36
codecap_Experiencing the Problem08:36
codecap_Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:98', '4c:52:62:42:de:96', '4c:52:62:52:ae:2b', '4c:52:62:42:de:97', '4c:52:62:42:de:95', '4c:52:62:52:ae:2a']}08:36
codecap_Can anybody help to understand the reason for the problem ?08:36
rpittaugood morning ironic! o/08:59
kubajjMorning rpittau and ironic o/09:29
rpittauhey kubajj :)09:29
dtantsurcodecap_: could be a mismatch between existing BMC address and ports on the Ironic node09:37
codecap_dtantsur: Hi ! Can you please explain ?09:38
dtantsurcodecap_: so, you're inspecting the node right? the node has a BMC (IPMI/Redfish/iDRAC/...) address and maybe some pre-existing ports?09:39
codecap_dtantsur: IPMI ... pre-existing ports ?09:40
dtantsurcodecap_: okay, let's take a step back. What exactly are you doing? How much familiar with Ironic are you?09:42
codecap_dtantsur: I'm trying to install und to use it ... less expierience I would say ... I've already learned how to use bifrost... The Problem I described above, occurs when I use ironic within openstack. The node , whiech is beeing deployed, is comming up under the control of IPA, at one Point IPA tries to contact the inspector, which replies with HTTP 500 09:48
dtantsurIs it still Bifrost or another installation with the rest of OpenStack?09:53
dtantsurwithin the Bifrost context, attempting to contact Inspector first is normal09:54
dtantsurwhen used with Nova/Neutron, we usually expect DHCP to be set up correctly to avoid that (in which case it may be a problem with networking or Neutron)09:54
dtantsurcodecap_: ^^09:55
dtantsurI need to step away from a while, other folks may help you further (or I'll check your messages when I'm back)09:56
codecap_dtantsur: bifrost is completely independent. Ironic within OpenStack was installed by kolla-ansible ... DHCP looks good, so the node ( being deployed) get started loading IPA Image09:57
codecap_dtantsur: can you explain me how Inspector is searching for the node when contacted by IPA?10:00
opendevreviewMerged openstack/ironic master: Ironic doesn't use metering; don't start it in CI  https://review.opendev.org/c/openstack/ironic/+/86757410:21
ajyajanders: iDRAC 6.10.00.00 has been released, if you want to try it out10:30
opendevreviewRiccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job  https://review.opendev.org/c/openstack/ironic/+/86387310:42
opendevreviewRiccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job  https://review.opendev.org/c/openstack/ironic/+/86387310:48
dtantsurcodecap_: when inspection is started, it caches BMC address (ipmi_address, etc) and MAC addresses to use them for lookup later. Note that if you don't start inspection, it will never work.11:12
dtantsurcodecap_: so if you end up inspecting when you meant to deploy, something went wrong on the networking level11:12
codecap_dtantsur: so what is the order a node should be deployed in ? Create -> Inspect -> Deploy ?11:14
kubajjdtantsur: We use the pecan REST for the decorators, right?11:17
dtantsurcodecap_: it depends on what you need. Inspection is optional.11:21
dtantsurkubajj: I think we have a lot of own decorator code, inherited from a project called wsme11:22
kubajjdtantsur: and these are defined in the api repo?11:23
kubajj(not repo, directory)11:23
codecap_dtantsur: how can I debug why the Lookup fails ? Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:97' ... both (bmc_address and macs) are correct ones11:28
dtantsurkubajj: yep, somewhere these11:29
dtantsurcodecap_: do you start inspection? if not, it will never work. if you start deploy and end up inspecting, you need to see what went wrong on the PXE stage.11:29
dtantsurin a normal openstack installation, different dnsmasq instances are responsible for two processes11:29
dtantsurwhen inspection is not running, ironic-inspector's dnsmasq is blocked for these nodes by its own filters, neutron's is working11:30
dtantsursomewhere these something went wrong, but I cannot guess what without a deep look11:31
codecap_dtantsur: Inspection works11:31
codecap_dtantsur:  [node: 8280e833-8a86-4ee1-8cf6-675d55d54970 state finished MAC 4c:52:62:52:ae:2a BMC 10.40.11.68] Introspection finished successfully 11:31
codecap_dtantsur: after successful inspection found data should be saved in the ironic inspector DB, right?11:33
dtantsurcodecap_: if you have it set up to do it, yes (I don't know how kolla sets up inspector)11:34
dtantsurso, inspection works, deployment does not?11:34
codecap_dtantsur: deployments failes with Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:97', ....11:36
dtantsurright, so your flow erroneously goes back into inspection11:37
dtantsurcodecap_: see what your PXE filter is ([pxe_filter]driver in inspector.conf). check for suspicious messages about it in inspector logs.11:38
kubajjdtantsur: I am sorry for asking so many questions, but I've been stuck on this for more than a day now. Is the api I am aiming to implement much different from the StatesController?11:38
dtantsurkubajj: it should not be (and don't worry about questions). It's not entirely clear to me at which step you're stuck now. If  you elaborate, I may be able to help better.11:39
codecap_dtantsur: driver = noop11:40
dtantsurthat explains...11:41
dtantsuris it something you did or something kolla set up for you?11:41
dtantsur(driver noop means that access to inspector's dnsmasq is not limited, which means that inspection always starts)11:41
kubajjdtantsur: I am still getting the "Missing mandatory parameter" node_ident error. The thing is, the states controller uses just node_ident instead of self.node_ident and does not have the constructor (which Julia suggested I implement as in the history controller) and it is able to be called from the node controller https://opendev.org/openstack/ironic/src/branch/master/ironic/api/controllers/v1/node.py#L195611:43
kubajjI am trying to figure out why I can't call it as well11:43
dtantsurokay, I'll check after the current meeting (please upload the latest patch)11:43
codecap_dtantsur: kolla configures it this way11:44
dtantsurcodecap_: then please ask mgoddard and other folks on #openstack-kolla which architecture they have in mind11:45
dtantsurit feels like it's a mix of a standalone architecture and normal openstack one.. but I may simply be missing what they mean11:45
codecap_dtantsur: shuld it be  driver = dnsmasq ?11:53
dtantsurcodecap_: that's what we recommend, but it may require additional configuration, so I'd check with kolla folks first11:54
dtantsurI assume they don't have a lot of docs around this area?11:54
codecap_dtantsur: sure ... will also ask there for help11:57
codecap_dtantsur: thanx a lot !11:58
codecap_dtantsur: Ive already read a lot .... but still not smart enough :)11:58
dtantsurnetworking is the least trivial thing in openstack IMO :)11:58
opendevreviewJakub Jelinek proposed openstack/ironic master: WIP: API for node inventory  https://review.opendev.org/c/openstack/ironic/+/86687612:16
kubajjdtantsur: somehow I managed to fix it. What should happen if I try to access a node inventory of a node that doesn't have one?12:17
kubajjdtantsur: What should be the behaviour if the version is too old?12:17
dtantsurkubajj: in both cases we return 404 with an appropriate message12:31
opendevreviewJakub Jelinek proposed openstack/ironic master: API for node inventory  https://review.opendev.org/c/openstack/ironic/+/86687613:18
kubajjdtantsur: if you had a moment, feedback for ^ would be appreciated13:19
dtantsursure, will put on my queue13:20
dtantsurkubajj: before that: this is the point where we need a release note since we're finally adding something user-visible. I can help with wording, but feel free to propose a draft.13:23
kubajjdtantsur: I am just drafting it, I was expecting you to say that we need it 😀13:25
dtantsurgreat :)13:25
kubajjdtantsur: I need apiref as well, right?13:40
opendevreviewRiccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job  https://review.opendev.org/c/openstack/ironic/+/86387314:09
mgoddardajya: Hi, we're having trouble with cleaning on some Dell hardware. We have NVMes behind a RAID controller, and it seems to be falling back to shredding rather than secure erase14:13
mgoddardRAID controller: https://dl.dell.com/content/manual61357984-dell-poweredge-raid-controller-11-user-s-guide-perc-h755-adapter-h755-front-sas-h755n-front-nvme-h755-mx-adapter-h750-adapter-sas-h355-adapter-sas-h355-front-sas-h350-adapter-sas-h350-mini-monolithic-sas.pdf?language=en-us&ps=true14:13
mgoddardNVMe: https://www.dell.com/en-us/shop/dell-384tb-data-center-nvme-read-intensive-ag-drive-u2-gen4-with-carrier/apd/400-bmto/storage-drives-media#tabs_section14:13
mgoddardhave you seen this before?14:13
ajyamgoddard: haven't seen, but I can check. What commands are being used?14:15
opendevreviewRiccardo Pittau proposed openstack/ironic-python-agent master: [DNM] test ci  https://review.opendev.org/c/openstack/ironic-python-agent/+/86765914:34
TheJuliamgoddard: just standard automated cleaning? if so I suspect you'll need to share an agent log from cleaning, but I wonder what the actual raid controller's capabilities actually inhibit some of that14:42
dtantsurkubajj: yes please14:45
dtantsurmgoddard: I believe TheJulia is right: I don't think we can see the NVMe behind the controller14:46
TheJuliasecure erase as the code uses it is an ATA concept14:46
mgoddardthat does seem to be the case14:46
mgoddardnvme list is empty14:47
TheJuliascsi, depending ont he protocol exposed by the raid controller, has no concept of it14:47
mgoddarddisk show up as /dev/sdX14:47
TheJuliahow do you see the devices to an OS?14:47
TheJuliaoh!14:47
TheJuliaso yeah14:47
TheJuliathey are exposed as SCSI devices :(14:47
* TheJulia is surprised they are not exposed as NVMe devices14:47
mgoddardwondering whether it kills the benefits of NVMe14:47
TheJuliathe raid controller utilities *might* have a tool to help enable this14:47
TheJuliaraid controllers can be useful, and they can also hurt performance/usage at times14:48
TheJuliait is a value tradeoff unfortunately14:48
dtantsuror rebuild the raid every time..14:48
TheJuliayeah14:49
TheJuliathe raid controller, depending on how smart it is might issue unmap/deallocate commands to the device14:49
TheJuliaor it could write zeros14:49
TheJuliaor keep a memory map and just double zero14:49
TheJulialike qcow2s14:49
* TheJulia has crashed many PERCs in her career14:51
mgoddardseems like we need to try to expose the NVMes directly, if possible14:53
TheJuliaso, you could try disabling raid mode14:53
mgoddardwould much prefer not to need to use some custom tool in IPA (if that is even possible)14:53
TheJuliathe bios firmware should have a mode setting for the card14:53
mgoddardthere isn't currently any RAID virtual disk configured14:53
TheJuliaOh, yeah, even better reason to check the card settings14:54
mgoddardok, I'll have a poke around14:54
mgoddardthanks all14:54
TheJuliaso fwiw, people have wrapped individual binaries in their ramdisks to do special things, anda hardware plugin can always override, but disabling RAID mode on the card would hopefully just make it a pass-through14:55
TheJuliaNow, if that translates the protocol, that is a whole different question14:55
TheJuliaand if it does.... yeouch. We can work on figuring something out14:55
TheJuliayou'll know quickly after rebooting the machine with the raid controller card in a different mode14:56
* TheJulia wonders if we have this sort of caveat documented heavily14:56
dtantsurI feel like we do not15:05
TheJuliaI'm writing something now15:13
dtantsurI recall there were some talks about dropping dhcp-all-interfaces from IPA deps, does anyone else remember? TheJulia?15:23
TheJuliawe talked about it because network manager has similar behavior15:23
TheJuliaalthough I think we now set networkmanager to do the thing with dhcp-all-interfaces15:23
TheJuliaso apparently there is a SCSI FORMAT UNIT command15:26
dtantsurgot someone complaining about the time it takes to DHCP interfaces that don't have a DHCP server on them15:26
TheJuliathe counter balancing issue is if you turn that off, networkmanager is going to focus on the first one it finds15:31
TheJuliaand you may loose connectivity because there could be multiple interfaces :(15:31
opendevreviewJulia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid  https://review.opendev.org/c/openstack/ironic/+/86767415:55
TheJuliamgoddard: dtantsur: ^^15:55
opendevreviewJulia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid  https://review.opendev.org/c/openstack/ironic/+/86767415:57
JayFWe probably should mention that we have explicit examples for making your own hardware managers15:57
JayFwith tooling provided by raid card vendor15:57
JayFto use the actual-vendor-tool to do this if you want15:57
JayFhttps://github.com/openstack/ironic-python-agent/tree/master/examples/custom-disk-erase15:57
TheJuliawe've not seen that as a thing as much, most vendors are trying to integrate purely into BMC interactive actions, but yeah15:58
* TheJulia revises15:58
JayFOr, potentially, that pathway is better documented so more people self-serve it without bothering us ;) 15:58
dtantsurTheJulia: one formatting issue inline15:58
* JayF knows of at least 3 installs that have used it without consulting with us15:58
TheJulia++15:58
TheJuliadtantsur: which line?15:58
TheJuliaJayF: that is an interesting data poitn15:59
dtantsurTheJulia: 1091, need to use a :doc: link, otherwise it won't work when relocated or built locally15:59
TheJuliaack15:59
JayFTheJulia: well, it's what I tend to ask people about given it's the piece I know the most about, so there's a lot of confirmation bias :D 15:59
opendevreviewAija Jauntēva proposed openstack/sushy master: Fix exceeding retries  https://review.opendev.org/c/openstack/sushy/+/86767516:00
dtantsursee you tomorrow folks16:00
ajyarpittau:  see the patch ^ 16:00
opendevreviewJulia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid  https://review.opendev.org/c/openstack/ironic/+/86767416:05
TheJuliaweb-edited that, so hopefully no issues16:05
opendevreviewJulia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid  https://review.opendev.org/c/openstack/ironic/+/86767416:05
TheJuliaOkay, that should be good16:06
JayFTheJulia: just +A shard_key spec16:12
JayFTheJulia: Arne gave me an email thumbs-up16:12
JayF\o/16:13
TheJulia\o/16:13
TheJulianow, where was I with extracting useful metrics!16:20
opendevreviewMerged openstack/ironic-specs master: Add a shard key  https://review.opendev.org/c/openstack/ironic-specs/+/86180316:22
JayFTheJulia: circling back to the hwmgr/disk stuff; I think crossing "I'm building my own ramdisk" is a threshold for some folks; once it's crossed those are not hard problems anymore (because the code is trivial)16:26
JayFTheJulia: the thing that, when reflecting, I realize is common to some of the issues that are reported here, are people are trying hard (for good reason) to use a fully stock ramdisk16:27
TheJuliaeh, we made ipa-b to simplify ramdisk building16:35
TheJuliaso the knowledge barrier is a bit lower now16:36
JayFthere's a reason I used 'threshold' over 'barrier'16:37
JayFIME operating it, it's always surrounding BS that makes it hard: setting up internal forks, CI, etc16:37
JayFbuilding one special ramdisk is easy; setting up a ramdisk building machine is more :)16:37
TheJuliaI think it is variable based upon perception/experience, but I do agree16:38
opendevreviewMerged openstack/ironic-inspector master: Update tox.ini for tox 4  https://review.opendev.org/c/openstack/ironic-inspector/+/86754116:40
opendevreviewJulia Kreger proposed openstack/ironic-lib master: Provide an interface to store metrics  https://review.opendev.org/c/openstack/ironic-lib/+/86531116:40
TheJuliaJayF: ^ I could use a sanity check of this ironic-lib change since you were deeply involved with this ages ago16:41
JayFhopefully you like the interface :D 16:41
JayFI think aweeks wrote that though, not me, although I was upstreaming with him16:42
TheJuliait took a little time to wrap my head around it, but once I did I was like "oh, this is perfect for what I want"16:42
JayFthis is a very bog-standard metrics interface from 2015 :) 16:43
JayFTheJulia: reading this... you're reimplmenting half of statsd16:44
JayFTheJulia: is there a statsd->prometheus connector that we could document instead?16:44
TheJuliadunno, but I need both node sensor data and ironic data in one pass16:44
TheJuliaand we have ipe for the node data16:44
TheJuliaand deploying 16:44
TheJuliaerr16:44
TheJuliaincomplete thought, disregard that last line16:45
JayFTheJulia:  https://github.com/prometheus/statsd_exporter16:45
TheJuliayeah, I was just looking at that16:46
TheJuliageared as a migration tool16:46
JayFthat even handles turning timers into summary/histogram16:46
JayFI am not opposed to native support, I'm just asking the question 16:46
TheJuliaif I were to do that, I'd have to completely forklift all of the sensor data stuff over16:46
JayFAre you saying that right now16:46
JayFwe have two metrics stacks in Ironic?16:46
JayFbceause whoever added sensor data didn't plug it into existing metrics collection (or vice-versa, but I think app metrics were first)16:47
TheJulianot metrics stacks, sensor data collection walks/collects data from the BMC and was originally geared for handing it off wholeale to.... ceilometor16:47
TheJuliaceilometer16:47
JayFyeah as I hit enter there I realized it was the celiometer stuff16:47
JayFwhich I steered clear of when designing these metrics because people thought ceilometer would continue to exist still :| 16:47
JayFI mean, "forklift all the sensor data over" sounds like the right answer16:48
JayFor "forklift all the metrics stuff over"16:48
JayFi don't think many deployers would distinguish, in the same way we do, between app and hardware metrics16:48
TheJuliahmm16:49
TheJuliawe had to write a lot of code to handle the sensors16:49
JayFThis isn't definitive, I'm just talking through it16:49
TheJuliabut the proxy might be the right way to go16:49
TheJuliastevebaker[m]: any thought on running the statsd->prometheus proxy?16:50
JayFlike, statsd is definitately the older school thing vs prometheus16:50
JayFwhich is something to consider too16:50
JayFbut our metrics interface was written with statsd in mind, so using the proxy you sorta get the code you wanted to write here for free16:50
TheJuliathe downside of IPE is it relies upon the message bus architecture16:50
JayFlet me make sure I understand16:51
TheJuliayou don't actually *need* a message bus, but it plugs in there16:51
JayFit used to be 16:51
JayFIronic node sensor data -> ceil -> ??? 16:51
JayFnow it's Ironic node sensor data -> IPE -> Prom16:51
TheJuliayes, for those that wish to use it16:51
JayFand separately, we have Ironic app metrics -> statsd (I think this is the only plugin we ship?)16:51
TheJuliaI don't think anyone actually runs with statsd, tbh16:52
JayFme neither16:52
TheJuliait is statsd, and noop currently16:52
JayFWell, like let me put it this way16:52
JayFeverywhere I've run at crazy scale16:52
JayFwe ran statsd16:52
JayFwe almost never used the data though16:52
JayFSo if the answer is "plumb up metrics library to prom" I'm OK-ish with that, but we might wanna consider just changing the paradigm at that point16:53
JayFbecause it seems bad that sensor data and app data don't flow through the same code paths16:53
TheJuliayeah, your raising a really good point16:54
TheJuliadoes statsd record how many times it got a timer call?16:54
JayFstatsd gives you counters for free off timers16:54
JayFwell, IDK if statsd does it16:54
JayFor if you can derive it at the display layer16:54
JayFmy knowledge of this is significantly muddied because most of my experience with it was sending the statsd output to the Rackspace Monitoring agent16:55
JayFand then [magic] 16:55
TheJuliaso we could always just go "oh, that implies a count as well, ship it!"16:56
TheJuliaif we need to16:56
JayFI am less concerned with that level of implementation16:56
JayFand more concerned with the core problem of having two metrics libraries (essentially) in the same project16:57
TheJuliaI'm more so because people tend to want to know how many times was a thing performed, not total time spent16:57
JayFI will say from experience that timer metrics were the most useful16:57
TheJuliayou can't imply # of total deploys from just at timer unless statsd does magic there16:57
JayFI am surprised you wouldn't see that, since you started the practice of actually benchmarking our API calls during dev :) 16:58
TheJuliawith fixed counts :)16:58
JayFTheJulia: most of the time, I used oslo notifications to derive that info...16:58
JayFTheJulia: timing was more for troubleshooting or isolating perf issues 16:58
JayFthis is part of why the app metrics seem so weird now16:58
TheJuliawait...16:59
JayFyou don't need them as much because we don't have as many "101-level" performance metrics as you used to16:59
TheJuliawhere is the oslo notifications consumption?16:59
TheJuliaor did that not make it upstream?16:59
JayFIronic emits oslo notifications on state change16:59
JayFmost places I've worked emitted them to splunk where you could aggregate them16:59
JayFhttps://docs.openstack.org/ironic/6.2.2/dev/notifications.html17:00
TheJuliaahh17:00
JayFLike, if you just wanna quick and dirty solve this, use the statsd->prom connector... if we're going to enhance it, lets actually fix it though17:00
TheJuliaThat is way beyond the level of info17:00
TheJuliawell, there is no quick and dirty if I need both streams17:01
opendevreviewRiccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job  https://review.opendev.org/c/openstack/ironic/+/86387317:01
TheJuliaon a prometheus payload :(17:01
JayFTheJulia: so a manager rloo and I used to share, he used oslo notifications going to splunk, aggregated nova and ironic notifications, to create a splunk "success %" dashboard for how many builds succeeded17:01
JayFTheJulia: wait, why do you need them on the *same payload* 17:01
JayFis there some prom limitation I don't have awareness of?17:01
TheJuliabecause you do a poll per service/thing17:01
TheJuliayeah17:01
TheJuliait polls an http endpoint for a file17:01
TheJuliawhich is all of the metrics data for it17:02
TheJuliain a particular format17:02
TheJuliamoving to a proxy *would* potentially allow us to sunset IPE17:02
JayFIt sounds like having prom fetch those metrics twice (node metrics then app metrics) is the correct configuration17:02
JayFfor how our code is currently structured17:02
TheJuliaiurygregory: would that make anyone in your area sad/unhappy?17:02
JayFthey are two separate gathering and export mechanisms17:02
JayFso why would we aggregate their output together for prom?17:02
TheJuliaJayF: aiui, you actually can't if you want them connected, they are different query/sensor areas since it is modeled to single service/operation 17:03
JayFwe need to tie together all the way back if it's the same; or leave them separate all the way.... last minute tie in is the worst of all worlds: ops don't have flexibility to split app/node metrics and we still have two codepaths to do the same thing17:03
JayFTheJulia: I'm saying our code currently strongly implies "these are not tied together" because it's two separate code paths altogether17:04
TheJuliaI'm stressing our metrics stuff was always intended as ceilometer as the consumer17:04
TheJuliaand ipe was a means to an end without deploying anything else or retooling stats17:04
TheJuliaoh17:05
TheJuliayou know what hte issue is17:05
TheJuliathat payload to prometheus has to have descriptive text17:05
TheJuliahttps://github.com/prometheus/statsd_exporter#explicit-metric-type-mapping17:05
TheJuliaand the sensor data we get from BMCs includes freeform stuff that we predictably translate17:05
JayFthat also makes sense as to why were were talking past each other a little bit17:06
JayFthat mismatch of "metric type" and "metric name" being separate in statsd but not prom17:06
JayFWould it be useful for us to talk about this sync a little bit?17:06
TheJuliayeah, promethus has the additional layer of labeling for each metric17:07
JayFI really wish I had a whiteboard and physical proximity to talk about this sorta thing 17:07
TheJuliawell, this was supposed to be a lightweight exploration17:07
JayFhard to do in text only17:07
TheJuliago go down a heavyweight path might mean we punt17:07
TheJuliajust, being completely transparent17:07
JayFso lets back up a step then and ask the core question: what's your end goal?17:07
JayFWhat's the specific thing you want to achieve? Not even "app metrics for prometheus" but like "I want to be able to count how many widgets we provisioned" or whatever17:08
TheJuliato get meaningful metric data about ironic's operation out of ironic plus the node sensor data to prometheus17:08
JayFnode sensor data already goes to prom via IPE, right?17:08
TheJuliaif deployed, yes17:08
JayFack17:08
TheJuliaThe goal would be "how many deploys, how many times we continued cleaning" sort of stuff17:09
JayFSo, if we merged this prom backend for metrics when it's done, would the next step be to make node sensor data use that natively, and deprecate "ceilometer"/ipe support?17:09
TheJuliawe don't need the classic notifications payloads because that is *way* too much information and is more for logging and log statistic aggregation tools like splunk17:09
TheJuliaI'm unsure we could merge a promethus backend because we would need another service17:10
TheJuliaor another HTTP endpoint to offer the content up17:10
JayFis there any concept of an official prometheus agent?17:11
JayFthat does that http endpoint for "free" for you?17:11
TheJulianot rea;;u, everything is supposted to export for pickup by promethus17:11
TheJuliasorry, not really17:11
JayFI assume the frequency of updates is such that using something intermediate, like uploading the stats file to a remote host somehow (webdav, swift temp urls, etc?) is out of the question17:11
TheJuliathere are things like node exporter, which walks a bunch of endpoints, and creates the html document when called17:12
TheJuliakind of yeah, ipe does two things. It has a plugin that transforms the data, and then runs a tiny webserver for promethus to grab the data out from it17:12
JayFSo does your PR to add a prom backend, in that case, target IPE as the intermediary?17:13
TheJuliathe path I was heading down was to graft on to the metrics collection interface in ironic-lib to serve as the collection point where I could then pull the data out of ironic's memory every time the conductor shipped a node sensor update to IPE17:14
TheJuliaIPE picks that up, transforms it, and has it ready for when promethus goes to get an update17:15
JayFI'm not sure how you could do that from the ironic-lib level17:15
JayFthat sounds like a new API endpoing?17:16
JayF*endpoint17:16
TheJuliaIronic-lib would be in memory and could be the keeper of the local stats data, it would mean a new method, but basic testing I've done shows that works as expected at the moment17:18
JayFI'm confused as to how IPE and Ironic-lib communicate for this purpose17:18
rpittaubye everyone o/17:18
TheJuliawe have a periodic in the conductor which does a sweep for sensor data17:19
JayFo/17:19
TheJuliato ship to the oslo notification endpoint, which IPE picks up17:19
TheJuliaso it would just be a "oh, lets grab this data and send it along too" action17:19
JayFokay; I dig it17:20
JayFthe thing that is cool about this, which I wonder if you've forgotten17:20
JayFis we have IPA plumbed up to metrics, too17:20
JayFso you could even enhance node sensor data with in-band metrics from api, if desired, too17:20
JayFTheJulia: so here's my suggestion: don't worry about statsd compatability, change the interface with impunity if you need to make it more prom-friendly, and we retire the statsd backend over time17:21
JayFand then, sometime off in the future, we tie the existing ironic node sensor data fetching into metrics17:22
JayF(from ironic_lib)17:22
JayFand in fact, that could be a very low hanging fruit for a new/more junior contributor17:22
JayFespecially if we make sure to tailor the interface to be well suited for how we use it now (and ignore the use cases from 7 years ago which are not in use so much now)17:22
TheJuliayeah... so then I would need another exporter plugin or to retool IPE's operation17:23
TheJuliahmmm17:23
JayFWhy would you need another exporter plugin or to retool IPE's operation?17:23
JayFI think you laid out a pretty clear migration path; intentional or not:17:24
JayF- add IPE-compatable backend to metrics17:24
TheJuliaIf I'm remembering correctly, it is one shot operations, not disjointed mashed together data operations17:24
JayF- IPE gets ir-lib metrics in the same polling loop it does for node sensor data17:24
iurygregoryTheJulia, let me scroll back to read things17:25
JayFthere's nothing wrong with a generic endpoint in the metrics lib that says17:25
JayF"here are some metrics from out of band"17:25
TheJuliaipe doesn't load ir-lib17:25
TheJuliaipe is an entirely separate process17:25
TheJuliaso it doesn't have access to memory, we woudl need someplace to store the data17:25
TheJuliahmmmmm17:25
JayFI'm extremely confused; I thought you said IPE polls conductor for the metrics data17:25
TheJuliano, the conductor does it and ships to IPE via the notification plugin17:26
* TheJulia needs to see how that gets invoked17:26
TheJuliawe *might* actually be able to ask ironic lib from the plugin17:26
JayFack; that's the point of misunderstanding I had 17:26
TheJuliaso... this might actually just work if we ask from the i-p-e plugin to ironic-lib for data when running inside of the conductor context17:28
JayFso if we figure out that point of contention, the path is pretty clear, yeah?17:28
TheJuliayeah, it is basically the same path I've been on though :)17:28
JayFyeah but I didn't see it until now17:29
iurygregorygoing to lose some connection, brb17:29
JayFit was foggy ;) 17:29
TheJuliaoh, it is very confusing :)17:29
TheJuliathat is for sure17:29
JayFso okay; thank you for the lesson and the little bit of back and forth17:29
JayFI'm a little worried this is one of those situations where I feel like "we" figured something out but in reality; I was just confused the whole time until now lol17:29
TheJuliaI think I need to fire up a devstack soon() and dig through this to make sure it would work17:30
JayFI do think that prom->statsd thing will be useful though17:30
JayFin terms of figuring out how to translate the metrics from the existing interface terms17:31
JayFnot as a service; but as like, a reference17:31
TheJuliayeah, that is a good point, and I'm realizing now to hand things out to prometheus, we're going to have to label them properly beyond the name17:31
TheJuliawhich means we either create a reference, or... mmm17:31
TheJulias/mmm/hmmm/17:31
JayFI suspect you have all the information you need17:31
* TheJulia puts her glasses on and opens a spreadsheet to review test cases17:32
JayFbecause the type of metric asked for is one piece of data; the name of the metric is the other17:32
TheJuliaI, unfortunately, don't exactly remember what the prometheus labeling is supposed to look like17:32
iurygregoryTheJulia, let me see if understood correct re your question, you are referring to "fetch those metrics twice"?17:41
TheJuliain what context?17:41
iurygregorywhat JayF said before you asked me "would that make anyone in your area sad/unhappy? "17:43
JayFwe were talking about killing either existing ir-lib metrics or IPE at that point17:43
iurygregoryI don't see a problem tbh 17:43
JayFI think we landed on "make ir-lib metrics use IPE"17:44
JayFwhich would, if anything, make you happier17:44
TheJuliaand would allow us to collect all the data together and have it handy in the prometheus way17:44
iurygregoryI do think it makes sense to collect all data and provide in the prometheus format 17:45
TheJuliaOne intermediate thoguht we reached is what if we dropped i-p-e for the statsd proxy exporter, but then I realized the need to maintain label mappings would be quite problematic for IPMI data in particular17:50
iurygregoryyeah, ipmi data is "funny"18:00
TheJulia"funny" is a polite way of putting it18:41
stevebaker[m]good morning19:08
TheJuliao/ stevebaker[m] 19:08
* stevebaker[m] starts reading the backscroll19:08
TheJuliahehe19:08
TheJuliawould you have a few minutes to join a call in say 10-15 m?19:08
* TheJulia guesses no :)19:32
stevebaker[m]yes!19:36
stevebaker[m]TheJulia: just caught up19:36
stevebaker[m]TheJulia: What is your github username? I'm just updating ironic-operator/OWNERS19:42
JayFhttps://github.com/openstack/ironic/graphs/contributors looks like juliakreger19:44
stevebaker[m]thanks19:45
TheJuliayup19:56
TheJuliaThanks!19:56
TheJuliasorry, I stepped outside for a few to obtain my daily allocation of vitamin D19:57
TheJuliastevebaker[m]: jparoly sent you an email downstream w/r/t what I was pinging you about. I have 2 hours before my next and final call of the day19:58
*** tosky_ is now known as tosky21:12
opendevreviewJay Faulkner proposed openstack/ironic master: DB & Object layer for node.shard  https://review.opendev.org/c/openstack/ironic/+/86423622:11
opendevreviewJay Faulkner proposed openstack/ironic master: API support for CRUD node.shard  https://review.opendev.org/c/openstack/ironic/+/86623522:11
JayFjust bringing it in line with lint and the spec as written, still need to get into rbac testing22:12
JayFand TheJulia I might take that sync intro after failing at RBAC testing for a bit on my own22:12
* JayF setup some stuff in yaml, couldn't get it to pass no matter what22:13
TheJuliawould 7 am work tomorrow?22:15
TheJulia\I can also do 11 am22:15
JayFprobably, yeah22:15
JayFlike 7am-ish, I start at 7am but not always able to communicate to other humans until some caff is ingested lol22:15
JayFI suspect it's likke, something basic I'm missing22:18
JayFmaybe just a single breadcrumb needed22:18
JayFI'll even take another stab at it this afternoon; I'd like to get that RBAC testing in before I work on port queries filtered by node22:19
JayFTheJulia: I think I managed to get on the right track starting again from scratch22:47
JayFTheJulia: the yaml file is so well formatted that I did `shard_patch_set_node_shard_disallowed` and copilot literally completed the entire rest of the dict :) 22:47
TheJuliacreepy magic!22:48
TheJuliaheh22:48
TheJuliaokay, well I'll be happy to look at things in the morning22:48
opendevreviewJay Faulkner proposed openstack/ironic master: API support for CRUD node.shard  https://review.opendev.org/c/openstack/ironic/+/86623522:49
JayFTheJulia: feel free to do it as a code review ^22:49
JayFI suspect there may be cases I wanna check that I'm not now, but there is checking being done now and I have confidence the rules are workin22:49
TheJuliacool cool22:57
TheJuliawell, off to the grocery store I go23:01
JayFTheJulia: iurygregory: I don't know which of you is invested in bugfix branches; but we're about a month late for our first one of the cycle. My question is essentially: should we actually cut one?23:10
JayFI am happy to propose a change to our release policy indicating we *may* cut a bugfix release (much happier than I'd be cutting this release with no known consumers of it)23:11
vanouJayF TheJulia: if possible, could you give me feedback on my concern in last comment at https://review.opendev.org/c/openstack/ironic/+/865075 ?23:16
JayFoh yeah, that was on my mental list and dropped off23:17
JayFI'23:17
JayF*I'll put this comment in the PR as well, but essentially, don'23:18
JayFdon't let yourself be limited by what exists23:18
JayFif we need to add a way, in our ipmi module, to send a single message no retries, or similar, I think you should just add it23:18
JayFor more likely, promote some method from private to public in the ipmi modules, because it almost certainly already exists23:18
vanouThanks for care. I'll reconsinder how to deal with it based on your advice :)23:20
JayFand fwiw, you get some sympathy :) I know in this sort of patch you have two sets of folks to answer to, the ones downstream who want it a certain way and then we in Ironic want it a certain way23:22
JayFso don't hesitate to ask if you get stuck or need more direction23:23
vanouThanks a lot JayF o/23:23
JayF\o23:23
vanouIf I change IPMI method/func, I think I should make another patch on modifying IPMI method/func. Is it correct?23:25
JayFHonestly, I'd just do it directly in the patch we're talking about; that way we can see it as a result of our comments. 23:26
opendevreviewVerification of a change to openstack/ironic-python-agent stable/victoria failed: Drop python2 from bindep.txt  https://review.opendev.org/c/openstack/ironic-python-agent/+/86265623:26
JayFIf it helps you to split it up; feel free23:26
vanouOK. Thanks :)23:27
TheJuliaJayF: in the past, when we have been late, typically we have evaluated on if there has been anything substantive23:28
TheJuliaAnd just skipping if it has been a quiet month or two23:29
JayFTheJulia: I'm asking the question not because we're late; but because part of the reason we're late is some chatter in here that the release wouldn't be consumed23:29
JayFTheJulia: and also I'm info gathering; I want to have a patch for our release policy ready for when I retire bugfix branches so I can feel like I cleaned up all the release-debt from that :D 23:29
TheJuliaAck, to be clear, we have skipped a few times when there just wasn’t a good reason to ship23:31
TheJuliaAround holidays in particular23:31
JayFthis would absolutely apply23:31
JayFand we should also make sure our published policies fit reality around this23:31
JayFeven if it just means adding more "squish" to the public policy :D 23:32
TheJuliaI thought we already did for that actually, but it has been a while23:34
JayFyou're 10000% right23:35
JayFthere's nothing about the release cadence in the spec now23:36
JayFI'm not insane; it used to say that, right?23:36
JayFnope; it never said that23:38
TheJuliaIheh23:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!