Monday, 2023-07-17

rpittaugood morning ironic! o/06:48
iurygregorygood morning Ironic11:25
*** iurygregory_ is now known as iurygregory12:08
iurygregoryanything urgent for review?12:15
TheJuliagood morning13:01
rpittaugood morning TheJulia :)13:03
rpittauiurygregory: anything in ironic-week-prio tags?13:03
TheJuliaI added it to https://review.opendev.org/c/openstack/ironic/+/88850613:05
TheJuliahttps://review.opendev.org/c/openstack/ironic/+/888500 is likely super close, but I've not seen it log any retries :(13:05
iurygregorygood morning TheJulia o/13:08
iurygregoryrpittau, right, one week of PTO and I started to forget things lol13:08
rpittau:D13:08
rpittauwell to be fair it's probably not super up-to-date13:08
iurygregory=X13:09
TheJuliamore along the lines of we've been fighting the gate a lot13:09
rpittauyeah13:11
TheJuliaA continued fight to reduce down lock conflicts in metal313:12
TheJulianuking the heartbeats seems to help a *ton*13:12
TheJuliawhich leaves retry really only viable to test with a loaded CI system13:13
iurygregoryyup, I was imagining that we would still be fighting CI13:13
rpittau"EFI variables are not supported on this system" ok....13:44
TheJuliawhere did you see that... 8|13:45
rpittautinycore 14.x patch https://621ce5ea301cd3e5f94e-66098508dda66f6b765978d818669c61.ssl.cf2.rackcdn.com/887754/2/check/ironic-standalone-ipa-src/aaad71d/job-output.txt13:45
rpittauare we back to jammy for standalone? dn't remember13:45
TheJuliawe are13:46
TheJuliaidentified the issue13:46
TheJuliabug filed, routed around13:46
rpittauok, something else going wrong there then13:47
TheJuliaIf we can get some reviews on https://review.opendev.org/c/openstack/ironic/+/888506 I think it would make sense to proceed with13:51
TheJuliaThen again, I have no idea if CI is presently super happy load wise, or hates our existence and right now. 13:51
TheJuliaAlso, file under "wheeeeeee": https://bugzilla.redhat.com/show_bug.cgi?id=222298113:51
rpittauw00t13:56
iurygregoryDon't actually heartbeat with sqlite14:00
iurygregoryWOW14:00
iurygregoryre 2222981 WHAT?!14:01
TheJuliayeah.....14:05
TheJuliaand the retry change passed again14:06
mohammedTheJulia we have built an ironic image for metal3 with the fix https://review.opendev.org/c/openstack/ironic/+/888188 but still get the error sqlite3.OperationalError: database is locked14:12
TheJuliamohammed: there are three competing issues, one is periodics, the next is the heartbeat sync for conductor status which I've got a patch up to disable when using sqlite, and the third seems to be to add db retry logic14:17
TheJuliaso, it was not a "silver bullet" unfortunately, but I think we're almost there14:17
TheJuliaI also worked to midnight local on Friday night to try and get it sorted, and I'm still trying to observe failures in CI14:17
TheJuliaif your willing, https://review.opendev.org/c/openstack/ironic/+/888506 and https://review.opendev.org/c/openstack/ironic/+/888500/ should reduce the chance substantially and add a retry decorator around db writes.14:20
mohammedTheJulia do you think replacing SQLite with Mariadb can limit the occurrences of this issue 14:20
TheJuliaabsolutely, MariaDB can handle the mutlithreaded write operations natively, where write operations with sqlite are more transactional on a file level so everything going on right now tends to result in one consumer encountering another14:22
JayFrpittau: are you going to run today's meeting?14:29
JayFIt's already 30 minutes late :( 14:29
rpittau?14:29
rpittauisn't it in 30 minutes ?14:29
JayFI just traversed a giant amount of timezones14:29
JayFso I would trust your clocking better than mine14:29
rpittau:D14:29
JayFlol14:29
JayFsee, this is why I'm not working today14:29
rpittauJayF: no worries, I'll take care of the meeting :)14:30
JayFthank you lol14:30
JayFI was just so confused14:30
iurygregoryyeah, it's in 30min14:30
opendevreviewJulia Kreger proposed openstack/ironic master: Retry SQLite DB write failures due to locks  https://review.opendev.org/c/openstack/ironic/+/88850014:34
TheJuliamohammed: See the release note attached to ^14:34
mohammedTheJulia thanks for your efforts! We'll try replacing SQLite with MariaDB on our CI to mitigate the issue, while following the progress of solving it with SQLite14:37
TheJuliamohammed: you might just want to wait 24-48 hours, since it seems like we might have a path forward ready to go, we just haven't seen performance to result in locking issues in our CI yet today nor over the weekend14:39
opendevreviewJulia Kreger proposed openstack/ironic master: Retry SQLite DB write failures due to locks  https://review.opendev.org/c/openstack/ironic/+/88850014:40
TheJuliaso... 4k disks. If we can't use iso9660, and we cannot use vfat, what options are really left, ext2/3 ?14:41
TheJuliai guess we could detect and repack it as xfs as long as the config label exists14:42
* TheJulia wonders what local patch HPE's CI is keeping that merge conflicts wit upstream14:48
mohammedTheJulia sounds like a great option we can patiently wait this fix ! Thanks :)  14:48
TheJuliamohammed: since you can locally reproduce, definitely give the two additional patches a try, I'm really quite hopeful14:50
opendevreviewElod Illes proposed openstack/ironic stable/victoria: [stable-only] Cap virtualenv/setuptools  https://review.opendev.org/c/openstack/ironic/+/88870114:57
*** iurygregory_ is now known as iurygregory14:59
rpittau#startmeeting ironic15:00
opendevmeetMeeting started Mon Jul 17 15:00:05 2023 UTC and is due to finish in 60 minutes.  The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot.15:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:00
opendevmeetThe meeting name has been set to 'ironic'15:00
iurygregoryo/15:00
TheJuliao/15:00
rpittauwelcome everyone to our weekly meeting!15:00
rpittauI'll be your host for today :)15:00
rpittauThe meeting agenda can be found here:15:00
rpittau#link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting15:00
rpittau#topic Announcements/Reminder15:01
rpittauwe've announced last week that the next PTG will take place virtually October 23-27, 202315:01
rpittaugoing to remove the reminder after this meeting15:02
rpittauwe'll remind it everyone when we're closer to the date15:02
rpittau#note usual friendly reminder to review patches tagged #ironic-week-prio, and tag your patches for priority review15:02
rpittauI'm leaving the bobcat timeline for reference in the reminder section15:03
rpittauany other announcement/reminder today ?15:04
TheJuliaI worked stupidly late on Friday night on a decorator for sqlite issues15:04
rpittaulol15:04
TheJuliaI *think* it works, but I've not seen CI give me a resource contention issue to generate a "database is locked" error yet today15:04
TheJuliaunit test wise, it definitely works!15:05
rpittau\o/15:05
rpittaulet15:05
iurygregory:D15:05
rpittauI guess we need to review it and recheck a couple of times15:05
TheJuliait has been tagged as a prio review, any reviews/feedback would be appreciated15:05
TheJulia++15:05
rpittaugreat15:05
rpittauthanks TheJulia :)15:05
TheJuliayeah, I think I'm on the 3rd stupidly minor change today, so hopefully that is helping15:05
rpittau(I'm silently skipping the action items review as there are none from last time)15:06
rpittausince we started with that patch let's go to15:06
rpittau#topic Review Ironic CI Status15:06
rpittauso we're back to jammy, except for the snmp pxe job15:07
rpittauand still battling the sqlite shenanighans15:07
TheJuliayup, I have no idea why it is failing, but I don't remember the exact reason we held it back to begin with15:07
rpittauI don't know, I'll try to make time this week to try and replicate, downstream permitting15:08
rpittaualso found an issue with the new tinycore 14.x in the standalone job15:08
rpittauthe one I mentioned before15:08
rpittauoutput of efibootmgr "EFI variables are not supported on this system"15:09
rpittauoh well15:09
rpittauwe're probably not in a rush for that15:09
rpittauanything else to mention for the CI status?15:10
rpittauok, moving on15:10
rpittau#topic 2023.2 Workstream15:10
rpittau#link https://etherpad.opendev.org/p/IronicWorkstreams2023.215:10
rpittauany update to share?15:11
iurygregoryfeedback would be appreciated in the Firmware Update patches =) 15:11
TheJuliaI've updated the patch to support vendor passthru methods as steps15:11
TheJuliaand it now passes CI \o/15:11
rpittauiurygregory: yeah, was going to mention that, I have some time this week, I'll put here on top of my list15:12
rpittauTheJulia: great!15:12
TheJuliahopefully get back to service steps this week15:12
* TheJulia hopes15:12
iurygregorytks rpittau o/15:12
TheJuliabut CI is obviously the priority15:12
iurygregory++15:12
iurygregoryI will take a look at patches today for CI15:13
rpittauok, great15:14
rpittauI think we're good15:14
rpittau#topic RFE review15:14
rpittauJayF: left a couple of notes15:14
TheJuliaSo, jay also put the same link in on both15:14
rpittauso the first one is https://bugs.launchpad.net/ironic/+bug/202768815:14
rpittauah yeah :D15:14
* TheJulia hands JayF coffee15:14
TheJuliaI left a comment on the first one15:15
rpittauthe second one is https://bugs.launchpad.net/ironic/+bug/202769015:15
rpittauI'll update teh agenda15:15
TheJuliaI think the idea is good overall, but we need to better understand where the delination is, and even then, some of the things that fail are deep in the weeds15:15
TheJuliaand you only see the issue when your at that step, deep in the weeds15:15
TheJuliain other words, more discussion/clarification required15:16
rpittaummmm I agree15:16
TheJuliasince we can't  go check switches we know *nothing* about15:16
TheJuliabut if we can provide better failure clarity, or "identify networking failed outright" mid-deploy, then we don't have to time out completely15:16
iurygregory++, I like the rfe, but we will need further discussion about it15:17
TheJuliaso yeah, I agree with the second rfe, we'll need it as an explicit spec and we'll need to figure out how to handle permissions15:18
TheJuliait might just be we pre-flight validate all permisison rights and then cache the step data from the templates15:18
* iurygregory is reading the second15:18
TheJuliabut, that requires a human putting their brain into the deploy templates code15:19
rpittauchecking earlky for the permissions would be best IMHO15:19
TheJulia*also* custom clean steps would be a thing to consider15:19
TheJuliamaybe we just... permit them to be referenced, dunno15:19
TheJuliathat is a later idea I guess15:19
rpittauyeah15:19
rpittauonce we define the templates, it should be not too hard to expand with custom clean steps15:20
TheJuliayeah15:20
JayFI mainly wanted feedback on the interfaces I laid out there; I intend to specify the implementation just to keep my thoughts straight15:21
iurygregory++15:21
TheJuliaright now you need to have system privileges for the deploy template endpoint if memory serves, so if we take the same pattern as add fields, begin to wire everythin together, adn then change the rbac policy, it should be a good overall approach15:21
TheJuliawe have the patterns already spelled out in nodes fairly cleanly too15:22
TheJuliaand allocations15:22
rpittauright!15:22
TheJuliabecause you can have an allocation owner15:22
TheJulia(but not lessee, since that model doesn't exist there)15:23
rpittauwe're probably going to discuss both RFEs further, but they both look good to me15:24
TheJuliayeah, first one just needs some more "what are the specific failure cases we're trying to detect" to provide greater clarity15:25
rpittauprobably clarify some aspects and then finalize during the PTG ?15:25
TheJuliabecause honestly, some areas we do a bad job with today, and we should make that better since they could still fail there even if neutron has a thing saying "yes, I can login to the switch"15:25
JayFI don't think for the precheck idea, we're looking for anything perfect, just catch anything obviously broken15:27
JayFit was suggested by johnthetubaguy that for ilo, for instance, there's an internal BMC status and you could fail if that wasn't 'green'15:27
TheJuliaI *suspect* the big one is "bmc is half broken" weird case, and I'm not sure we actually test a swift tempurl we create...15:28
JayFand I know with the hardware I ran at Rackspace, even just an ipmi power status would've been enough to indicate if our most common failure modes were active15:28
iurygregoryyellow can be a thing depending on what you are looking at in iLO15:28
iurygregoryXD15:28
TheJuliai guess one question might end up being, are there *really* operators still turning off power sync at scale? We went from 1 to 8 power sync workers. Of course, that is also ipmi15:30
TheJuliaAnyway, that is beyond the scope of the meeting, just a question to get a datapoint really15:31
rpittaulet's remember this during (and more) during future discussion :)15:31
JayFThe more I think of it, the more I wonder if it's just a flag to validation to say "no really, actually validate"15:32
JayFand have config to change default behavior for when Ironic does it15:32
TheJuliato at least try and fail faster, I guess15:33
JayFI also have a use case15:33
JayFof failing a rebuild before the disk gets nuked15:33
rpittauyeha, I guess that's the point (fail faster)15:33
JayFso like e.g. a Ceph node wouldn't drop outta the cluster for longer than needed15:33
JayFif we can prevent that even in 25% of failure cases, it reduces operator pain because they don't have to go $bmc_fix_period without their node in a usable state15:34
TheJuliaJayF: I think additional details or context might be needed there since that might also just be a defect15:34
TheJuliaor, maybe something we've done work on and a lack of a failure awareness is causing the wrong recovery path to be taken by humans15:34
JayFthat's sorta the theme of both of these rfes; allowing people with that kind of "I have a sensitive cluster" kind of use cases do their own scheduling of maintenance, and reduce the amount of times we'd have them in a  failed-down state15:35
TheJuliaanyway, lets keep talking about this one after the meeting, I think we need to better understand your use case and what spawned the issue to begin with, if that makes sense15:36
TheJuliaspecifically, if I'm trying to rebuild [to update] a node, and somehow I'm wiping all of ceph volumes out15:36
rpittaualright, moving on!15:37
rpittau#topic Open Discussion15:37
TheJuliaI had https://bugzilla.redhat.com/show_bug.cgi?id=2222981 pop up in my inbox this morning!15:38
rpittauit's been great running the meetings again this time and the last week :)15:38
rpittaunext week JayF will be back!15:38
rpittau#note Overcloud deploy fails when mounting config drive on 4k disks15:38
rpittau#link https://bugzilla.redhat.com/show_bug.cgi?id=222298115:39
TheJuliaI'm looking at cloud-init code to see if there is a happy path forward, but I suspect it might be we may need to locally repack the config drive as an intermediate solution15:41
TheJuliathere might be, just depends on how long it has *really* been an "or", and we'll need to likely fix glean15:44
opendevreviewMerged openstack/bifrost master: Refactor use of include_vars  https://review.opendev.org/c/openstack/bifrost/+/85580715:44
iurygregoryI'm wondering if repack would cause performance issues (but it would be only in the 4k scenario right?)15:44
TheJuliawell, we would have to open the file on a loopback... if we even can...15:45
TheJuliaand then write out as a brand new filesystem15:45
TheJuliaso... dunno15:45
iurygregorygotcha15:46
TheJulia¯\_(ツ)_/¯15:46
rpittauI don't see a "straight" solution honestly15:49
TheJuliayeah, since we support binary payloads15:50
TheJuliaI'll dig more, and write an upstream bug up15:51
rpittauthanks15:51
iurygregoryack15:51
rpittaualright, anything else for open discussion ?15:52
rpittauthanks everyone!15:53
rpittau#endmeeting15:53
opendevmeetMeeting ended Mon Jul 17 15:53:34 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:53
opendevmeetMinutes:        https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.html15:53
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.txt15:53
opendevmeetLog:            https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.log.html15:53
iurygregorytks!15:53
TheJuliaJayF: so, that failure case, with the rebuild, what is your perception to the process/failure? I'm guessing the rebuild fails, and instead of being worked as a distinct thing, something happened with nova at that point?15:57
TheJuliaI'm trying to understand since we have capabilities to prevent those disks from getting erased, so I'm trying to sort out in my head, where things went from "oh, retry rebuild" to "oh no, we lost everything"16:02
rpittaugood night! o/16:06
opendevreviewMerged openstack/bifrost master: remove nginx system packages requirement  https://review.opendev.org/c/openstack/bifrost/+/87452116:15
TheJuliaAnyone have any block devices with 1k, 2k, or 4k sector sizes handy? Could you run `blockdev --getss /dev/<device>` and provide output17:12
TheJuliaoh wow17:13
TheJulialooks like maybe blockdev -getblocksz <device> might be the thing, maybe?!17:14
TheJulia5-6 runs with no database is locked errors. I guess that is a good sign :\17:57
iurygregoryyeah 17:57
iurygregoryI need to drop now, going to the airport, will continue working from there o/17:57
JayFTheJulia: not lost data; just downtime in the cluster. I have a server, X, I do a rebuild with --preserve-ephemeral; it fails in the middle (while in prov network). That machine is now dead-to-me until I can get ops teams to fix the failure. If, instead, Ironic had predicted that failure and refused to touch the node; sure the nova instance would be in ERROR but the workload18:00
JayFwould be untouched18:00
JayFthink about cases where you are tight on capacity and can't take a long downtime in a cluster18:00
JayFI'll talk about use cases in the rfe more.18:01
TheJuliaso, you'd need to be able to ask neutron if it can noop on that switch/interface18:02
JayFyeah I'm not sure what an implementation looks like for every interface -- but having the hook is useful18:03
TheJuliayeah, in ironic would mostly be fairly light touches, but that network part is the real conundrum18:03
JayFfor instance, in some future world with a non-neutron network driver, we might be able to ask better questions, even (does the smartnic have desired_vlan trunked?)18:03
TheJulia++18:03
TheJuliaor switch, etc18:03
TheJuliabut yeah18:04
JayFI do not want to say, we shouldn't add it to the interface b/c we might not be able to do a great job of it on that interface18:04
TheJuliaAnd that is not why I'm asking the question, I'm trying to understand under what case are you blocked from being able to perform basic actions to recover18:04
JayFmeaning I'd want to implement it across all interfaces (like validate), but it's going to naturally be more useful for some than others18:04
TheJuliayeah18:05
JayFI really think once I look at this tuesday18:05
JayFI might shape it more like "validate, except you have time to do more stuff"18:06
JayFwhich could make it useful for enrollment situations, or for doing actual-validation of a fix after a node has undergone servicing (either by human, or by Ironic I guess)18:06
JayFI don't think this would actually work out in practice; but it'd be neat if we could also communicate via API the confidence level18:07
TheJuliawell, the actual validation today is just "can we get the power status"18:07
TheJuliaI do like the idea of giving that feedback18:08
JayFheh, maybe even redfish super-validate is running the DMTF validation against it18:09
JayF"It'll give you a medium Ironic experience"18:09
TheJuliaheh18:12
* TheJulia goes back to the gigabytes of logs a customer has shipped me18:12
NobodyCamGood afternoon Ironic folk19:37
TheJuliagood afternoon!19:37
TheJuliait is... a bit... warm here today20:31
opendevreviewJulia Kreger proposed openstack/ironic-python-agent master: Log the number of bytes downloaded  https://review.opendev.org/c/openstack/ironic-python-agent/+/88772920:47
TheJuliastevebaker[m]: ^20:47
TheJuliaINFO ironic_python_agent.extensions.standby [-] Image streamed onto device /dev/vda in 201.56154799461365 seconds for 2958688256 bytes. Server originally reported 2958688256.22:47
TheJuliawoot22:47
opendevreviewJulia Kreger proposed openstack/ironic-python-agent master: Log the number of bytes downloaded  https://review.opendev.org/c/openstack/ironic-python-agent/+/88772923:19

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!