Thursday, 2021-06-17

cenneG. Morning ironic 05:55
jandershey cenne05:55
cennehey janders o/06:01
iurygregorygood morning cenne janders and Ironic o/06:28
jandershey iurygregory o/06:28
arne_wiebalckGood morning iurygregory cenne janders and Ironic!06:32
jandershey arne_wiebalck o/06:32
iurygregorymorning arne_wiebalck o/06:35
cenneHeya06:46
opendevreviewAija Jauntēva proposed x/sushy-oem-idrac stable/victoria: Add get PXE port MACs for BIOS mode  https://review.opendev.org/c/x/sushy-oem-idrac/+/79666507:18
*** rpittau|afk is now known as rpittau07:22
rpittaugood morning ironic! o/07:22
iurygregorymorning rpittau o/07:22
rpittauhey iurygregory :)07:23
arne_wiebalckTheJulia: stevebaker: I built an IPA image with https://review.opendev.org/c/openstack/ironic-python-agent/+/796068 and successfully installed a c8 image on top of a UEFI s/w RAID node ... so the patch does not seem to hurt :) Anything specific you would like me to check in the deploy logs?07:38
iurygregoryarne_wiebalck, you should try with centos scream :D07:39
iurygregoryops typo :D07:39
arne_wiebalckiurygregory: heh07:39
arne_wiebalckiurygregory: on it ... but it seems our cs8 images are robust enough to even boot the node despite the grub2 issue in recent scream versions07:40
arne_wiebalckoops typo :-D07:40
iurygregorywow really?07:41
iurygregoryI was expecting the problems to show up in cs8 ...07:41
arne_wiebalckyes: it complains about secure boot as expected, but the node boots none the less07:41
iurygregoryI see07:41
arne_wiebalckthese are our own cs8 images, not upstream, not built with dib07:42
iurygregoryoh ok =)07:43
arne_wiebalcktbh, I am a little surprised the node boots since it essentially means calling grub is not required07:51
arne_wiebalckiurygregory: cs8 boots as well08:16
iurygregorywoot :O08:23
opendevreviewMerged openstack/ironic master: Add documentation for anaconda deploy interface  https://review.opendev.org/c/openstack/ironic/+/79611008:29
dtantsurmorning folks08:36
arne_wiebalckgood morning dtantsur o/08:36
* dtantsur has +28 in the work room already08:36
iurygregorymorning dtantsur o/08:36
iurygregoryfor me +28 is outside :D08:37
arne_wiebalckdtantsur: so you need no warm-up today :)08:37
dtantsurdefinitely not :)08:37
arne_wiebalckdtantsur: this may be of interest to you: https://storyboard.openstack.org/#!/story/200898308:37
* iurygregory not looking forward for the weekend 32...08:37
dtantsurwe have 35 today, but should be slightly better on the weekend08:38
iurygregoryI wish we could skip summer...08:38
dtantsurheh, right08:41
dtantsurI like Spring and Autumn, but Summer and Winter (except for holidays) are not my things08:41
cenneMorning dtantsur08:41
iurygregoryI like summer, but not the one here hehehehe08:42
dtantsurhi cenne 08:43
arne_wiebalckdtantsur: and I checked your "Do not retry locking when heartbeating" patch is in Victoria, so it should be included already08:48
* dtantsur will check messages a bit later, sorry08:48
arne_wiebalckdtantsur: ty08:49
opendevreviewkamlesh chauvhan proposed openstack/ironic master: Upgrade oslo-db version  https://review.opendev.org/c/openstack/ironic/+/79681109:54
opendevreviewkamlesh chauvhan proposed openstack/ironic master: Upgrade oslo-db version  https://review.opendev.org/c/openstack/ironic/+/79681110:01
dtantsurarne_wiebalck: there is still a possibility of racing if the next heartbeat comes exactly when the previous one is released10:33
dtantsurhowever, you bug doesn't mention the exact error, so hard to tell10:33
arne_wiebalckdtantsur: ok10:39
arne_wiebalckdtantsur: the IPA sends heartbeats in a very quick succession10:39
arne_wiebalckdtantsur: since we do force_heartbeat at the end of all async commands10:39
arne_wiebalckdtantsur: I can match the commands with the heartbeats arriving, in fact10:40
arne_wiebalckdtantsur: http://paste.openstack.org/show/806723/ shows an example10:41
arne_wiebalckdtantsur: the IPA sends heartbeats within a few microseconds, this may expose the race you mention10:42
dtantsurarne_wiebalck: hmm, this is a good point. maybe we should delay a heartbeat if the previous one was sent very recently.10:42
arne_wiebalckdtantsur: yes, I think so, too10:42
dtantsurmaybe forcing heartbeat should work by just reducing the delay until the next heartbeat significantly10:43
arne_wiebalckdtantsur: why do we force the heartbeat at the end of async commands?10:43
dtantsurarne_wiebalck: to avoid waiting the whole minute before a normal heartbeat happens10:43
arne_wiebalckdtantsur: oh, I see10:44
dtantsurnext_heartbeat = min(next_heartbeat, 10) or so10:44
arne_wiebalckwhat about reducing the heartbeat interval to 10secs and remove the forced ones?10:46
arne_wiebalck(I guess this was discussed at the time :))10:46
arne_wiebalckI guess remembering the last send time is better.10:47
dtantsurI don't know what side effects may reducing the heartbeat interval have10:51
arne_wiebalckright10:52
arne_wiebalckso, how about we remember the last send time and do not send one if the last one was less than 10 secs ago?10:52
arne_wiebalckforce_heartbeat will be a noop in this case10:53
dtantsurarne_wiebalck: not really, we should still send a heartbeat earlier than in +- a minute10:56
arne_wiebalckdtantsur: we could also "collect" forced_heartbeats and have one at max every 10 secs (like a flush) 11:00
dtantsurarne_wiebalck: yep, I'm thinking of the same thing11:01
janderssee you tomorrow Ironic o/12:39
rpittaubye janders :)12:41
opendevreviewMerged x/sushy-oem-idrac stable/victoria: Add get PXE port MACs for BIOS mode  https://review.opendev.org/c/x/sushy-oem-idrac/+/79666512:54
TheJuliagood morning13:14
dtantsurmorning TheJulia 13:29
rpittaugood morning TheJulia :)13:37
opendevreviewkamlesh chauvhan proposed openstack/ironic master: Upgrade oslo.db version  https://review.opendev.org/c/openstack/ironic/+/79681113:42
arne_wiebalckdtantsur: would sth like this work? http://paste.openstack.org/show/806732/15:07
JayFarne_wiebalck: we have a force heartbeat method in IPA... I'm missing context but wanted you to know there is a template you can follow15:11
arne_wiebalckJayF: this is the force_heartbeat function :)15:13
arne_wiebalckJayF: I am trying to prevent heartbeats from being sent within a few microseconds15:14
JayFoooh15:14
JayFoh, so like if one is forced15:14
JayFyou should restart the timer15:14
JayFor are you having multiples forced at the same time?15:14
arne_wiebalckJayF: yes, within a few microseconds15:14
JayFyeah my refactor of this in U (I think?) made sure the timer gets reset15:15
JayFOK, ty for context, reading your code again15:15
JayFIt is 100% a bad idea to use threading.[] methods inside eventlet-using apps15:15
arne_wiebalckJayF: the idea is to collect "close-by" calls and send only one 15:15
JayFthat will likely have negative side effects for ya15:15
JayFarne_wiebalck: maybe something as simple as a `self.last_forced_heartbeat` that gets set to a timestamp, and ignoring forced hb requests that happen within [configurable seconds] of the last forced hb?15:16
arne_wiebalckJayF: yeah, we discussed this as well15:16
arne_wiebalckJayF: but this may mean we wait until the next timer15:16
arne_wiebalckJayF: which may be very long15:17
JayFI'll be honest, this is part of why I was worried about fixing ironic agent client to be OK with multiple commands15:17
JayFIPA structure was not really designed as parallel as the API is :|15:17
JayFbut that cat has been outta the bag for years15:17
JayFarne_wiebalck: I wonder if we could do something like make it possible to override the interval... probably not though15:18
arne_wiebalckJayF: like shorten the interval on a forced_timeout?15:19
JayFif a heartbeat is forced, and it cannot be executed because it's too close to the other one, shorten the interval15:19
arne_wiebalckJayF: for the full context, this is to avoid a race we seem to run into when the heartbeats hit the server: https://storyboard.openstack.org/#!/story/200898315:20
JayFbut I suspect at that point we're already in the cycle of waiting interval seconds15:20
JayFand you can't get outta that15:20
arne_wiebalckright15:20
JayFyeah I saw some of the IRC chatter15:20
arne_wiebalckoh, ok15:20
arne_wiebalckit does not seem to break anything15:20
JayFyeah, apparently I introduced threading.event15:20
arne_wiebalckbut as an operator I prefer logs with ERROR15:20
JayFmake sure you test with TLS15:20
JayFeventlet-type issues in IPA are exposed with TLS15:21
arne_wiebalcks/with/without ofc15:21
JayFand that image streaming code15:21
JayFif both of those things work it should mostly [knock on wood] be safe15:21
arne_wiebalck"does not break anything" is for the initial issue (not the proposed code)15:22
arne_wiebalckso, the initial problem is not fatal, deployment works15:22
JayFoh crap, this could totally be happening with scheduled heartbeats too15:22
JayFbecause when force_hearbeat() calls do_hearbeat() it updates the interval but run is alredy waiting on the old interval15:22
JayFso we have to fix this generically, not just for the forced heartbeat15:23
arne_wiebalckright, but probability for scheduled heartbeats should be much lower, no? 15:26
dtantsurrather than using locks and stuff, cannot we just adjust the next heartbeat time?15:26
* dtantsur doesn't remember the code well15:27
JayFdtantsur: that's exactly what we've been talking about if you read the scrollback :|15:27
opendevreviewDmitry Tantsur proposed openstack/ironic master: Clean up vendor prefixes for iRMC boot  https://review.opendev.org/c/openstack/ironic/+/79687915:27
JayFthat doesn't work, and looking at that I confirmed the bug can happen with scheduled heartbeats (one race) or with forced heartbeats (another race)15:27
JayFprobability for scheduled heartbeats varies based on heartbeat_timeout configuration, and when the forced heartbeats are happening15:28
JayFI don't see a compelling reason why we wouldn't fix all cases instead of just one15:28
arne_wiebalckthere is no compelling reason, it is just the one at hand I can already reproduce on almost every deploy 15:29
arne_wiebalckas there seem to be two async commands in quick succession15:29
dtantsurwe could do it so that we wake up every second and check the interval rather than sleeping the whole interval15:29
JayFdtantsur: I think any approach that works is good. I know this code is more fragile than it seems because you can easily break eventlet with too much of that kinda stuff :|15:30
* JayF is still scarred from the week he spent chasing the eventlet tls/heartbeat problems15:30
dtantsurhttp://paste.openstack.org/show/806732/ adds time.sleep into force_heartbeast. I'm not sure that will work.15:30
dtantsuri.e. I'm not sure the callers of this method expect it to get stuck for 10 seconds.15:31
* TheJulia reads some and starts to get very worried15:31
arne_wiebalckthis is why I was asking :)15:31
arne_wiebalckit seems to be ok on a first test, though15:31
* arne_wiebalck will run another test to see if the received heartbeats go down from 4 to 315:32
JayFI think your idea is maybe solid, but perhaps it should move into do_heartbeat() instead?15:32
JayFe.g. if we just had a heartbeat (of any kind) delay the forced/scheduled heartbeat N seconds15:32
JayFthis honestly is kinda screaming for a queue of heartbeat requests15:32
JayFbut not sure multiprocessing.Queue is going to make our lives easier in this code :P15:32
arne_wiebalckJayF: the counter is my queue :-D15:33
JayF.o(or maybe it will? Then we just only have to add heartbeat events to the queue on a timer)15:33
JayFand we could separate the heartbeat scheduling from the heartbeat-doing15:33
JayFand most of the multiprocessing.Queue bugs are around not evacuating the queue before shutdown and getting a deadlock, but since we power off the machine it's not a big deal :P15:33
dtantsurI have no clue how multiprocessing plays with eventlet though15:34
JayFwe might just need a regular, not threaded queue, I've only dealt with the threaded one though15:35
lmcgannWhy is it that you can only specify manual clean steps from power, management, deploy, bios, and raid drivers and not of other driver types. What is it about these driver types that makes them different?15:35
dtantsurlmcgann: mm, we just decided these are the drivers that will most likely provide steps15:36
dtantsurif we have a use case for expanding the list, I'm all for it15:36
JayF`eventlet.queue` exists, and likely has a queue we could use15:36
JayFhttps://eventlet.net/doc/modules/queue.html#eventlet.queue.Queue15:36
dtantsurI'm not sure how a queue will help us to avoid heartbeats too often15:36
JayFbecause then we'll be doing heartbeats off a queue consumer15:37
dtantsurright, but still several at the same time?15:37
JayFand can delay sequential heartbeats based on a minimum interval15:37
dtantsuror does it have any de-dup?15:37
dtantsurthen we need to store the last heartbeat time15:37
dtantsurthen we don't really need a queue any more15:37
JayFpull from queue -> check if last_heartbeat_time is set, if so, make sure it's been less than "N" seconds or wait those N secondss -> do a heartbeat -> set last_heartbeat_time -> repeat15:38
dtantsurs/pull from queue/check the flag/15:38
JayFdtantsur: I'm trying to get out of the >         while not self.stop_event.wait(self.interval):15:38
JayFline of code15:38
dtantsurthis is necessary to be able to stop IPA, no?15:39
dtantsuror do you suggest also storing a stop sign in the queue?15:39
JayFI'd say that kinda code would populate the queue15:39
JayFand the actual-do-heartbeat code would be separate15:39
JayFright now we have force a hearbeat, heartbeat timer, and the actual "doing" of a heartbeat all in the same place15:39
dtantsurI honestly see it as an advantage15:39
JayFI am not tied to any one way of doing stuff, tbh, I just want us to handle both of the cases, and I know that this code is fragile15:40
JayFand I'm trying to think of ways to make it less fragile15:40
JayFmy brainstorm might be more of a drizzle though :P15:41
arne_wiebalcklogs from a first test (slightly different version): http://paste.openstack.org/show/806735/15:41
dtantsurI think it's easier to show on the code what I mean15:41
lmcganndtantsur: so other drivers dont get used during clean and deploy generally?15:46
opendevreviewDmitry Tantsur proposed openstack/ironic-python-agent master: [PoC] heartbeat de-dup  https://review.opendev.org/c/openstack/ironic-python-agent/+/79688215:46
dtantsurJayF, arne_wiebalck, that's what I mean ^^15:46
dtantsurlmcgann: you mean, interfaces? Boot interface does, I don't remember why it wasn't included.15:47
dtantsurprobably because booting is a bit orthogonal to the deploy process, even though it's used by it15:47
JayFdtantsur: you did what I was thinking, except in 10 less simple lines with less complexity :D 15:47
dtantsurI did suspect that we were talking about the same thing :)15:47
JayFYeah, I sometimes only know the scenic route to a solution :D 15:47
JayFyou took the highway15:48
arne_wiebalckdtantsur: uh, nice15:48
dtantsur:D15:48
dtantsurarne_wiebalck: feel free to take it over and finish if you like this approach15:48
JayFdtantsur: why is self.interval in there still, for the while not line?15:49
dtantsurJayF: I haven't put much thought in it. can we have interval < 5?15:49
dtantsurat the very least, we start with interval == 015:49
JayFhmm. I guess so15:50
lmcganndtantsur do the other interfaces invoke boot interface methods as part of their steps?15:55
arne_wiebalckdtantsur: JayF: thanks for all your input, I give that patch a try (and finish it if it works ok)15:55
opendevreviewDmitry Tantsur proposed openstack/ironic master: CI: change ilo_deploy_iso to deploy_iso  https://review.opendev.org/c/openstack/ironic/+/79688615:55
dtantsurlmcgann: yes, mostly deploy and inspect interfaces15:55
opendevreviewJulia Kreger proposed openstack/ironic master: Fix node detail instance_uuid request handling  https://review.opendev.org/c/openstack/ironic/+/79672015:55
dtantsurarne_wiebalck: cool! note that I haven't even run unit tests on it, so there may be stupid typos15:55
arne_wiebalckdtantsur: I will find them if there any :)15:56
arne_wiebalckTheJulia: in case you missed it, I tried the efi preservation patch, and it worked for me15:56
TheJuliaarne_wiebalck: I saw, thanks.15:56
arne_wiebalckTheJulia: but, as mentioned, it seems our images are somewhat robust15:56
TheJuliaindeed15:57
arne_wiebalckTheJulia: ah, ok15:57
TheJuliaCan I get a review on https://review.opendev.org/c/openstack/ironic-inspector/+/796535 so we can have working unit tests on inspector again?15:59
rpittaugood night! o/16:09
*** rpittau is now known as rpittau|afk16:09
*** sshnaidm is now known as sshnaidm|afk16:16
dtantsurwdyt folks? https://storyboard.openstack.org/#!/story/200898716:35
opendevreviewVerification of a change to openstack/ironic-inspector failed: Fix SqlAlchemy >1.3.19 support  https://review.opendev.org/c/openstack/ironic-inspector/+/79653516:51
opendevreviewDmitry Tantsur proposed openstack/ironic master: Don't run the inspector job on changes to inspector tests  https://review.opendev.org/c/openstack/ironic/+/79689816:56
opendevreviewDmitry Tantsur proposed openstack/ironic-inspector master: [DNM] Test excluding files  https://review.opendev.org/c/openstack/ironic-inspector/+/79690016:57
opendevreviewDmitry Tantsur proposed openstack/ironic master: Don't run the inspector job on changes to inspector tests  https://review.opendev.org/c/openstack/ironic/+/79689817:01
opendevreviewDmitry Tantsur proposed openstack/ironic-inspector master: [DNM] Test excluding files  https://review.opendev.org/c/openstack/ironic-inspector/+/79690017:03
opendevreviewDmitry Tantsur proposed openstack/bifrost master: Skip running jobs on ironic and inspector unit tests  https://review.opendev.org/c/openstack/bifrost/+/79691417:07
opendevreviewDmitry Tantsur proposed openstack/ironic-inspector master: [DNM] Test excluding files  https://review.opendev.org/c/openstack/ironic-inspector/+/79690017:08
NobodyCamGood Morning OpenStack Folks17:11
NobodyCamany good docs on supporting ARM? IPA Image building gotcha's, etc..17:12
dtantsurmorning NobodyCam. I know people have done it, I don't think we have docs though.17:13
NobodyCam:) Morning dtantsur :)17:13
dtantsuriPXE used to be a problem, not sure about now17:13
NobodyCamack, I'll take a look see17:13
NobodyCammay be able to treat it like we do our snmp controlled hardware: https://github.com/ipxe/pipxe17:15
dtantsurheh, nice17:17
dtantsurwell, iPXE is the only big problem I'm aware of. If you can build it for your hardware, everything else should be a matter of figuring out the right configuration.17:18
dtantsurokay folks, I'll be out tomorrow and on Monday, so see you on Tuesday17:19
dtantsuro/17:20
JayFo/17:20
NobodyCamhave a good night dtantsur 17:22
arne_wiebalckdtantsur: JayF: I added some logging to Dmitry's patch to confirm it does what we would like it to do, logs seem to confirm it's working fine: http://paste.openstack.org/show/806741/17:27
arne_wiebalckI'll take over and submit the patch tomorrow.17:28
JayFnice work :D 17:31
arne_wiebalckOh, and it fixes the issue on the coductor side :)17:32
arne_wiebalckThanks again dtantsur JayF !17:32
arne_wiebalckAnd TheJulia who looked into this yesterday already as well :)17:33
arne_wiebalckbye everyone, see you tmrw o/17:54
TheJuliarats... dtantsur is gone :(17:59
TheJuliac'est la vie17:59
TheJuliaNobodyCam: iPXE but you can use PXE + Grub. You *CAN* build an ipxe arm image though.18:00
NobodyCam:)18:00
TheJuliaNobodyCam: We've had a couple bug reports for if like you try and build wheels of ipa18:01
NobodyCamhave request access to some ARM hardware to attempt building 18:01
TheJuliaand... some arm hardware has super unhappy ipmi interfaces18:01
NobodyCam+++18:01
TheJuliathat could be nasty but with an understanding of what to look for it would be super easy to patch18:01
TheJuliawe just have a "ipmitool freezes" report18:02
TheJuliaand we don't have any other info really18:02
NobodyCam+++ I'll try and take notes from the build process and see if I can do a how to writeup for others 18:02
TheJuliaawesome18:03
TheJuliaI'd happilly merge a patch in to ipa for the ipmitool freeze thing, but need to understand more first :(18:03
NobodyCambe a few days before I get the HW I'm sure18:03
NobodyCam+++18:03
opendevreviewMerged openstack/ironic-inspector master: Fix SqlAlchemy >1.3.19 support  https://review.opendev.org/c/openstack/ironic-inspector/+/79653518:32
TheJuliaso!  What if! When we started, we checked the db status and did the needful and just relied upon online data migrations instead of just run 2 commands19:17
kkillsfirstHello, I have a test setup with Ironic using bifrost and have deployed nodes with a ramdisk image. Where should I start with using kexec on my testbed with Ironic?19:58
opendevreviewJulia Kreger proposed openstack/ironic master: Only return the requested fields  https://review.opendev.org/c/openstack/ironic/+/79227420:27
opendevreviewJulia Kreger proposed openstack/ironic master: Set stage for objects to handle selected field lists.  https://review.opendev.org/c/openstack/ironic/+/79227520:27
opendevreviewJulia Kreger proposed openstack/ironic master: API to pass fields to node object list  https://review.opendev.org/c/openstack/ironic/+/79229620:27
opendevreviewJulia Kreger proposed openstack/ironic master: Allow node_sanitize function to be provided overrides  https://review.opendev.org/c/openstack/ironic/+/79488020:27
TheJuliakkillsfirst: ideally for the transition from deployment, so support would need to be in the agent, and I guess as what we call a deploy step in ironic itself20:57
jandersgood morning Ironic o/23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!