Monday, 2023-10-02

rpittaugood morning ironic! o/06:59
opendevreviewMahnoor Asghar proposed openstack/ironic master: Add inspection hooks  https://review.opendev.org/c/openstack/ironic/+/89266107:48
*** Continuity__ is now known as Continuity08:03
ContinuityMorning o/08:04
*** Continuity__ is now known as Continuity10:24
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch stable/xena: CI: Removing job queue  https://review.opendev.org/c/openstack/networking-generic-switch/+/89702311:13
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch stable/wallaby: CI: Removing job queue  https://review.opendev.org/c/openstack/networking-generic-switch/+/89702411:13
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch stable/victoria: CI: Removing job queue  https://review.opendev.org/c/openstack/networking-generic-switch/+/89702511:13
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch stable/ussuri: CI: Removing job queue  https://review.opendev.org/c/openstack/networking-generic-switch/+/89702611:13
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch stable/train: CI: Removing job queue  https://review.opendev.org/c/openstack/networking-generic-switch/+/89702711:14
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch stable/pike: CI: Removing job queue  https://review.opendev.org/c/openstack/networking-generic-switch/+/89702811:14
iurygregorygood morning Ironic o/11:15
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch stable/pike: CI: Removing job queue  https://review.opendev.org/c/openstack/networking-generic-switch/+/89702811:49
* TheJulia raises eyebrows re branches12:54
TheJuliazigo: yes, you can, the inherent challenge is if your in a higher security environment, you need to change the network, which was the original root question if you've tried it out12:55
zigoWell, I'm still stuck at the level of before the last summit, where the Ironic client doesn't use the correct endpoint, so my Ironic setup isn't complete ... :/12:56
TheJuliaugh :(13:08
fricklermnasiadka: iirc JayF was working on retiring at least the older n-g-s branches, so the above should not be necessary13:08
mnasiadkaah okI'll abandon - no problem ;)13:08
TheJuliazigo: I don't entirely remember at this point, but didn't it seem like like rooted in keystoneauth1 ?13:09
zigoTheJulia: It was hard to debug, because it failed in a generator, so I didn't really know how to go forward...13:09
TheJulia:\13:09
zigoIMO, the issue was in ironicclient.13:09
TheJuliazigo: was there a bug filed?13:09
zigoI'm sorry I had no time to move forward ...13:09
TheJuliaI can't look this week, I'm driving to San Francisco for a conference13:10
zigoNo bug filled, because it wouldn't have been valuable enough with enough details.13:10
TheJulia.. and giving that talk13:10
TheJuliazigo: ahh13:10
TheJuliahmmm.13:10
TheJuliaI don't remember enough of what was going on to try and hunt13:10
TheJuliaunfortunately13:10
zigoBut in short, it was because I had the Ironic endpoint at /<SOMETHING> instead of :<someport>/ (ie: root of the http).13:10
fricklermnasiadka: see e.g. https://review.opendev.org/c/openstack/releases/+/89570713:11
zigoTheJulia: I'll try to re-start my PoC, which is easy to setup, as I scripted absolutely all.13:11
zigoIt starts Ironic in Debian and all the setup is done with Puppet...13:11
mnasiadkafrickler: sure, done ;)13:12
TheJuliazigo: the openstack puppet modules?13:13
TheJuliazigo: I actually have an idea of where to look, but won't have time really for a few days13:14
dtantsurSo, our scale testers managed to choke Ironic to the extent it only returns NoFreeConductorWorker13:34
dtantsurI wonder how realistic is having a limit of 100 workers. Has anyone tried raising it?13:35
dtantsurcc TheJulia, JayF ^^13:35
TheJuliaI have heard of some folks running towards 20013:35
TheJuliaBut haven't thought about it honestly13:37
TheJuliawheee 3.5 hours of driving to go13:38
dtantsurwheeee13:40
dtantsurdeploying 500-600 nodes in parallel, well, no wonder it has issues...13:42
TheJuliawheeeeeeee13:42
rpittaugood luck TheJulia :)13:43
iurygregorywondering if arne_wiebalck tested deploying this amount in parallel :D13:43
opendevreviewMerged openstack/ironic-prometheus-exporter master: Update master for stable/2023.2  https://review.opendev.org/c/openstack/ironic-prometheus-exporter/+/89604813:43
iurygregorygood luck TheJulia 13:43
TheJuliaI've heard of folks doing such huge deploys... with numerous conductors and an awesome backend database cluster of super magic scale13:44
iurygregorysuper magic is the answer13:44
dtantsurAll we have is 1 conductor backed by SQLite :D13:45
iurygregorywondering why in the past things used to work lol13:46
dtantsurI suspect it was non-converged flow13:46
dtantsurI also suspect we need to do something about heartbeats..13:46
iurygregoryhummmm13:47
TheJuliahow so? I can guess and all :)13:47
dtantsur500 nodes heartbeating may prevent the conductor from doing anything else13:48
opendevreviewBaptiste Jonglez proposed openstack/networking-generic-switch master: Introduce NGS agent design  https://review.opendev.org/c/openstack/networking-generic-switch/+/89704713:49
iurygregorybut this would happen in both flows now? non-converged and the converged one? 13:50
dtantsurno heartbeats in the non-coverged flow. it's just ramdisk deploy.13:50
iurygregoryoh ok!13:50
TheJuliadtantsur: well, yeah :)   If we kick back a "no free conductor" error to it, then we likely ought to make sure we have appropriate hold downs in the agent13:51
dtantsurHmm the tester claims to had been running the converged flow before. Interesting.13:53
opendevreviewMerged openstack/ironic-inspector master: Update master for stable/2023.2  https://review.opendev.org/c/openstack/ironic-inspector/+/89604513:53
JayFI wonder if the various misplaced locks somehow were acting as a rate limit or bottleneck.13:54
JayFAnd now with those gone, we'll try as hard as we can to deploy everything immediately, even in a case where that might not be wise13:54
dtantsurPossibly13:55
TheJulialooks like we do have an opportunity to teach the agent to be a little smarter, fwiw13:56
dtantsurI wonder if the thread number must be larger than max_concurrent_deploy13:56
dtantsurand then we can use max_concurrent_deploy to tell BMO to go retry13:56
TheJuliaoh yes, every yes13:56
dtantsurright. max_concurrent_deploy defaults to 250, the number of threads - to 10013:56
TheJuliaconcurrent is across the board/entire deployment wise for a scaled deployment13:57
dtantsurfair enough13:57
* dtantsur needs to remember what happens when `max_concurrent_deploy is hit13:57
TheJuliamaximum upper bounds of deploys in flight13:58
TheJuliaI think we error on the deploy state change request13:58
dtantsurConcurrentActionLimit is a generic HTTP 500?13:58
TheJuliaI believe so yes13:59
dtantsurhmm, inconveniet13:59
dtantsurTheJulia, do you remember why it was not derived from TemporaryFailure?13:59
TheJuliano specific reason14:01
TheJuliaI'd treat it as a bug at this point if we want to change it14:01
TheJulia*also*14:01
rpittaugoing to catch a flight, see you tomorrow! o/14:01
TheJuliathe concurrent action limit was in mainly abuse prevention, so TemporaryFailure is likely better14:02
iurygregorysafe travels rpittau o/14:02
dtantsur(logs with thousands of clusters look scary)14:02
TheJuliaugh14:02
iurygregoryyes it does14:02
dtantsurI've just had my archive manager crashing on trying to read this tar.gz14:02
iurygregoryspecially when the files were rotated every 50MB.. =)14:02
TheJuliaugh14:02
iurygregoryoh wow, this hasn't happen to me yet14:03
TheJuliaIt has happened to me a few times14:03
TheJuliaI've gotten multi gigabyte log dumps from customres14:03
dtantsuriurygregory: must-gather for just the openshift-machine-api namespace is 2.5G14:03
TheJuliaarchie manager chokes hardcore14:04
iurygregoryWTH14:04
TheJuliaYou need to manually extract for your own sanity14:04
dtantsurthat's a monstrous testing lab. they repeatedly lunch thousands of OCP clusters for 10-15 hours.14:04
iurygregoryonly for that namespace?! JESUS14:04
iurygregoryyeah, the 502 bug I'm working is on the same lab if I recall14:05
dtantsuryeah, I think I've extracted the relevant bits14:05
dtantsurthinking what I could need more14:05
dtantsur(thanks god vim can handle 100M text files - normal editors choke much before)14:05
iurygregorytruth ^14:06
dtantsurI must admit, I'm excited about this bug. I love to see Ironic tested at such a scale!14:08
TheJuliadtantsur: if you want to change the base class to temporaryfailure, consider me +214:11
dtantsuron it14:11
TheJuliaCan't spend much more time talking about it today, I need to hit the road soon if I want to get into San Francisco before lunch time14:11
dtantsurSure, no worries. I need to stare at the logs, maybe play with numbers.14:12
TheJuliaenjoy14:12
TheJulia:)14:12
* TheJulia packs up laptop and goes to find breakfast and resume the drive14:26
iurygregorysafe travel TheJulia =)14:29
opendevreviewDmitry Tantsur proposed openstack/ironic master: Fix the HTTP code for reaching max_concurrent_deploy: 503 instead of 500  https://review.opendev.org/c/openstack/ironic/+/89705014:34
dtantsurbtw anyone who wants to drive forward scale testing Ironic should review https://review.opendev.org/c/openstack/sushy-tools/+/875366/14:38
dtantsurthis can allow us to test this sort of scale without being able to create many thousands of VMs14:38
dtantsurhold on, what the hell is this? https://opendev.org/openstack/ironic/src/branch/master/ironic/conf/conductor.py#L216-L21714:44
dtantsurthis seems... far-reaching14:45
iurygregoryI think it was for the effort to support multi-arch or something14:45
iurygregorytrying to remember14:45
JayFdtantsur: deploy_ramdisk_by_arch14:45
dtantsurI know why the new options were added, I see no point in deprecating the old ones14:45
dtantsurit's just new busy-work for 99% of operators14:45
JayFThere was consensus in the reviews for that change that deprecation was the right move; with an extremely long tail14:45
dtantsurjesus.. okay14:46
JayFbasically try to drive new opers to the *_by_arch but we don't have a timeline for actually pulling the deprecated value14:46
JayFlet me put it this way: I would not be upset if this value existed for a long, long, long time in a deprecated state to avoid that exact busywork you are concerned of14:46
dtantsurokay, that makes sense14:46
JayFbut we don't wanna have new opers using the more coarse setting with a fine grained one existing14:46
dtantsurHas anyone pondered if we need bootloader_by_arch? I don't know the answer off the hand14:47
JayFThat's a pretty astute observation.14:47
JayFthe answer is almost certain yes. dang14:48
dtantsurPossibly no. EFI partitions are arch-independent, aren't they?14:48
JayFthey are?14:48
dtantsurhaha14:48
dtantsurwelllll... I think so?14:48
JayFI honestly have very little experience with ... standards-compliant, server quality arm14:48
JayFIs there any ... affordable arm hardware with some kind of BMC and efi support?14:49
dtantsurOn the other hand, operators may not wish to include all possible boot loaders in all possible ISOs14:49
dtantsurJayF, if you find one, tell me. I've been looking for it for some time.14:49
TheJuliaThey are arch independent14:56
TheJuliaThey have separate files by arch name14:56
dtantsuryeah, but it may not be desirable to have really large bootloaders14:57
dtantsur* images14:57
JayFso not crucial, but a nice to have14:58
JayFanother low-hanging-fruit thing that has clear examples on how to implement14:58
* JayF adds to his list to write it up for LP14:58
JayFI've been focusing on trying to find nuggets like this and write up h/t low-hanging-fruit bugs in LP for them14:58
TheJuliaRe by arch stuff, mixed arch shops have used it14:58
JayFI *try* to put enough context to make them doable by anyone who knows how to submit patches to openstack14:58
JayFoh hey it's time15:00
iurygregoryyes15:00
JayF#startmeeting ironic15:00
opendevmeetMeeting started Mon Oct  2 15:00:23 2023 UTC and is due to finish in 60 minutes.  The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot.15:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:00
opendevmeetThe meeting name has been set to 'ironic'15:00
JayFWelcome to the weekly Ironic meeting.15:00
iurygregoryo/15:00
dtantsuro/15:00
JayFA reminder that this meeting is held under the OpenInfra Code of Conduct available at https://openinfra.dev/legal/code-of-conduct.15:00
JayF#topic Announcements/Reminder15:00
TheJuliao/15:00
JayF#info As always; please add hashtag:ironic-week-prio to patches ready for review; and prioritize your reviews on such patches.15:01
JayF#topic Review Action Items15:01
JayFI had an action to cut another release of NGS; I have not yet as it never made it onto my todo list; my bad. 15:02
JayFI'll get that done as soon as we're done15:02
JayF#action JayF Carryover: release another NGS15:02
JayF#topic Bobcat release: October 4th, 202315:03
JayFOther than that refreshed action; is there anything else we need to do/consider/etc?15:03
JayFThank you all for another awesome Ironic release! \o/ I'll be on OpenInfra Live on Thursday talking about it; if you want me to highlight something specific please say so.15:03
iurygregory\o/15:04
TheJuliaNot that I’m aware of, likely a reminder wrt the PTG and etherpad15:04
JayFack; good callout15:04
JayFThat means I need to have a time for the BM SIG by then15:04
JayFwhich leads into...15:04
JayF#topic PTG Planning for PTG October 23-27, 202315:05
JayFI'll be honest, the time I had set aside to take care of this before the meeting was eaten by an emergency Friday; but for now my plan is the following:15:05
JayFI'm working off the assumption of similar/identical availability to last PTG15:05
JayF1300-1700 UTC main window, 2200-2400 APAC friendly if needed. Any objections or additional input?15:05
JayFI will be sending this to the mailing list as well for feedback.15:05
TheJuliaI think that is reasonable15:06
JayFMy intention will be to start labelling topics in PTG etherpad with times (or 'do we need to talk about this?' in a couple of cases I suspect)15:06
JayFI'll make noise in here when I do so if someone wants to async feedback with me; the goal will be to have a schedule nailed down for time slots before Thursday, and hopefully have topics<->timeslots nailed down by EOW15:07
TheJuliaI advise against aggressive compression, some of the topics can be split as well15:07
JayFack15:07
TheJuliaAlso, operator feedback being good, we should be prepared for emergent items15:08
JayFmaybe like, try to keep the top level slots to a larger theme15:08
TheJulia++, and just be flexible15:08
JayFI think we always are15:08
dtantsur++15:08
JayFPTG schedules for Ironic are ... suggestive at best :D 15:08
JayFGoing to move on?15:08
JayF#topic Review Ironic CI status & update whiteboard if needed15:08
TheJuliaWe also need to schedule a 3rd party ci session during the PTG15:09
JayFThat is on the list :)15:09
TheJuliaCool cool15:09
TheJuliaI resume driving then15:09
JayFthe list is ... huge15:09
JayFIf you've (not Julia-you, anyone-listening-you) not looked at it, please do15:09
JayFI have no knowledge about CI status and the like15:09
JayFI think we're doing OK? Julia fixed some metalsmith pain around SDK/Ansible shenanigans15:09
JayFLooks like folks have similar opinions to CI as me :) outta sight outta mind15:11
JayF#info No RFEs to review, skipping agenda item15:11
JayF#topic Open Discussion15:11
dtantsuryeah, I'm not aware of issues15:11
JayFany open discussion items?15:11
dtantsurDoes anything speak against increasing heartbeat interval?15:12
JayFHmm.15:12
JayFwe used to document it loudly15:12
TheJuliadtantsur: eh, mixed feelings. I’d teach agent to handle the too busy case first15:13
JayFoh, you're asking a different question than I thought you were?15:13
JayFheartbeat interval will not help you for performance when doing 500+ deploys15:13
JayFbecause agent will check in as soon as it's done with an item15:13
JayFregardless of hb interval, as an optimization15:13
dtantsuryeah, I know15:13
dtantsurbut at this scale, even periodic check-ups can be an issu3e15:13
JayFThis seems like a combination of edge cases? Extreme scale + fast track + simultaneous deploys? Just making sure I understand15:14
JayF?15:14
JayFI guess extreme scale isn't really a good term, it's more about pushing so many deploys at once15:14
dtantsurI don't think fast track is even relevant; just huge scale15:14
JayFI think it's: huge scale *on one conductor*15:15
JayFwhich traditionally isn't even something we've optimized for, yeah?15:15
TheJuliaGuys, we’re in meeting…15:15
JayFthis is open discussion15:15
JayFI think this counts :)15:15
TheJuliaOh, didn’t realize we moved to open15:15
TheJuliaSorry!15:15
iurygregoryit happens =)15:15
JayFI'm just trying to understand the problem space/edges here... if it's single conductor/single process I'm not super surprised there are edges here, and incresing perf in that case is good (and handling being on the edges of our scale)15:16
JayFdtantsur: I suggest writing this up in an LP bug, with details about the edges, and potentially if we can brainstorm specific "scaling up simultaneous deploys w/single-process ironic" maybe have a PTG session about it15:18
dtantsurhmm, good point15:18
JayFI know for a fact you can do 500 simul deploys with Ironic with a scaled up cluster, conductor groups, all the trimmings :D15:19
JayFdoing it on a single-process will be an extremely valuable exercise in seeing how Ironic handles queuing tasks and being patient under heavy load15:20
JayFit'll be fun and will be a nice robustness exercise, similar to how the sqlite locking fixes were, except hopefully these will be 500x less painful :D15:20
JayFI'm going to close the meeting, this can just be chat in channel15:20
JayFty all o/15:20
JayF#endmeeting15:20
opendevmeetMeeting ended Mon Oct  2 15:20:52 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:20
opendevmeetMinutes:        https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-10-02-15.00.html15:20
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-10-02-15.00.txt15:20
opendevmeetLog:            https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-10-02-15.00.log.html15:20
JayFheh it just occcurred to me15:22
JayFwe literally *remove the queue* for single process Ironic15:22
JayFlol15:22
JayFAction item actioned https://review.opendev.org/c/openstack/releases/+/897054 NGS 7.3.0: addl release for Bobcat15:32
opendevreviewDmitry Tantsur proposed openstack/ironic-python-agent master: [WIP] Delay heartbeat after a failure and handle Unavailable  https://review.opendev.org/c/openstack/ironic-python-agent/+/89705515:41
dtantsuropinions?15:41
dtantsurhmm, maybe it's actually already handled by the event.wait15:42
dtantsurJust as a data point, Ironic received around 100 heartbeat requests PER SECOND15:49
dtantsurI nearly think we need a separate thread pool for heartbeats15:54
iurygregorythe logic does make sense to me16:01
iurygregorytrying to understand the part event.wait you mentioned 16:02
dtantsuriurygregory: see the loop just above. it uses wait() on an event with a minimal wait for 5 seconds already16:03
iurygregoryoh the while not self.stop_event.wait ?16:03
dtantsuryeah16:03
iurygregoryright, make sense16:04
dtantsurWe also have an issue in our thread pool handling that essentially doubles the capacity...16:05
TheJulia100/second is super impressive16:09
shermanmI actually had a question related to large (100-500) instance launches on a single conductor. Most of the time seemed to get taken up by hashing and converting the glance image from qcow2->raw?16:13
dtantsuryou can easily hit that (not my case though)16:14
dtantsurso, we're doing a weird thing. we let eventlet create 100 threads. so far so good. then we let futurist park 100 more requests that just wait for these 100 threads.16:14
dtantsurthis is my fault: I think I misunderstood the futurist API back then.16:14
dtantsurI don't believe letting requests sit and wait is a good idea in any case. I'd rather have 200 threads trying to run.16:15
shermanm(sorry, I also don't need to interrupt the threading/heartbeat discussion)16:16
dtantsurno worries. you're right: image conversion is very heavy. shouldn't block the threads, but will otherwise take a ton of time and CPU16:17
* dtantsur hopes he didn't scare them away :(16:18
shermanmmostly I was trying to figure out what combination of configuration triggered such a conversion on the host, as I was under the impression that you could send a qcow2 (or raw) image to the agent directly, and not have the controller do this conversion16:20
dtantsuryou can, you need to set force_raw to False, keeping in mind that your qcow2 conversion will happen in memory on the node16:22
shermanmahh, thank you16:24
opendevreviewDmitry Tantsur proposed openstack/ironic master: Bump workers_pool_size to 300 and remove queueing of tasks  https://review.opendev.org/c/openstack/ironic/+/89707116:44
dtantsurthe case of a release note being larger than the change ^^^ JayF and TheJulia, your opinions especially appreciated16:44
JayFI renew my request for documenting the conditions in a bug; it's difficult for me to consume that sorta information in the IRC firehose and without all the context it's hard to review the change16:45
dtantsurJayF, it's not per se a fix for that bug, more of a logical problem in our defaults16:45
JayFwow that is like, two lines and 50 lines of release note LOL16:45
dtantsurI'll WIP it until I can double-check all assumptions there. but please review anyway.16:47
JayFthe changes look OK, but also are the sort of thing that could have unintended side effects16:47
JayFI wanna see the CI run on it, and I'm curious what it'd do different under load but we don't have a good way to test that upstream16:48
JayFif we are going to land a change like that; top of the cycle is the 100% ideal place to land it so if it did cause weirdness we'd find it 16:48
JayF(not the default change; the no-more-queuing)16:48
dtantsurI feel like futurist is a mess...16:48
dtantsurI think my assumptions may be right. Will see what the CI says.16:58
dtantsurThe more I think about it, the more I'd like to backport the no-more-queuing17:12
dtantsurQueuing means we're keeping TaskManager instances cached somewhere inside futurist. Potentially for a long time.17:13
dtantsurPotentially blocking threads that are actually running. Ugh.17:13
JayFlocked, unlocked, or both?17:13
JayFAre taskmanagers always a lock?17:13
dtantsurOnly with shared=False17:14
dtantsurBut we use that quite often. And then go into spawn_worker.17:14
JayFah17:14
dtantsurHa, the tested has found an issue in their configuration. Maybe it's not too bad in the end :D17:18
dtantsur* the tester17:18
dtantsur(they claim to have tested the same thing successfully in the past, so who knows)17:18
dtantsurAnyway, time for dinner. It's a public holiday tomorrow, so talk to you on Wednesday.17:19
JayFDoes this make sense as a statement for the Bobcat update?18:28
JayF> Ironic has more thoroughly tested, and resolved many issues related to single conductor mode with SQLite database backend. This is the configuration utilized by Metal3.18:28
TheJuliaI would reword the the very first part somehow19:09
TheJuliaMaybe flip it, "Ironic resolved issues reported by the Metal3 project related to SQLlite database utilization."19:11
JayFoooooh19:11
JayFI like that a *lot* better19:11
JayFI know it sounded weird, making them the actor fixes it19:11
TheJuliayeah19:13
TheJuliaand also emphasises the feedback loop and the power of it19:14
JayFI have a *very* bad habit of slipping into passive voice much too often19:14
TheJuliahttps://usercontent.irccloud-cdn.com/file/uKuKXHp4/IMG_1782.JPG19:18
TheJuliaMy awesome hotel room view right now19:19
JayFWhoa I didn't know they were redoing the ferry building tower (that is what it is, yeah?)19:19
JayFI used to work at 2nd@Folsom, I'd often walk straight down Folsom to right under the bay bridge and just look at the water after a long day19:19
JayF(and you got to get on your muni train at the first stop guaranteeing a seat but that was merely a bonus)19:20
TheJuliaJayF: yeah19:23

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!