Friday, 2021-07-02

iurygregorygood morning stevebaker janders and Ironic! Happy Friday!06:04
arne_wiebalckGood morning iurygregory and Ironic!06:41
iurygregorymorning arne_wiebalck o/06:42
dtantsurmorning folks, happy Friday07:06
dtantsurjanders: _do_node_verify in manager.py is the place you need, yes07:06
cenneGood morning dtantsur, iurygregory,  arne_wiebalck, janders . :)07:23
dtantsuro/07:23
arne_wiebalckhey cenne and dtantsur o/07:24
iurygregorymorning cenne and dtantsur o/07:24
opendevreviewAija Jauntēva proposed openstack/ironic master: Fix Redfish RAID interface_type physical disk hint  https://review.opendev.org/c/openstack/ironic/+/79550508:43
jandershey iurygregory arne_wiebalck iurygregory dtantsur cenne10:04
jandersanyone feeling SPUC-ky?10:04
dtantsurI can join if we have a quorum (also cc ajya)10:05
jandershttps://bluejeans.com/772893798 & I'm on standby - let's see how we go10:07
dtantsurjoining10:07
iurygregoryI was having lunch sorry 11:09
dtantsurno worries, lunch is important11:09
* dtantsur needs one too11:09
janderssee you tomorrow Ironic o/12:51
jandersehhm12:51
janderssee you on Monday Ironic o/12:51
iurygregorybye janders o/12:53
jandersTheJulia when you're online, I would like to ask for your thoughts on https://storyboard.openstack.org/#!/story/200902513:09
TheJuliajanders: okay.. what was encountered that this is a fix to?!13:11
jandersTheJulia context: it's something that came up in Metal3, but after some thought I figured I would like to propose it in Ironic. The idea is preventing issues where existing Lifecycle Controller jobs may interfere with inspection/deployments. Essentially I wanted to tap into the great work Dell have done, just make it used more regularly13:12
jandersoh, you are here TheJulia13:12
jandersgood morning :)13:12
jandersthere were issues reported where pre-existing stuck LC jobs were preventing other jobs of the same type being queued13:13
jandersbreaking inspections, deployments, etc13:13
jandersfull disclosure: I haven't witnessed these issues myself; I was asked to help with a metal3 change that would add iDRAC resets on node registration, but I figured this may belong more in Ironic13:14
TheJuliaSounds like test enviroents are being smashed to be redeployed before they gave failed13:15
TheJuliaUnless the bmc truly is getting jobs stuck forever which sounds like a bug to me13:16
jandersI think it's been spotted in a few CU BZs I believe13:16
TheJuliaNot that I've personally seen, but proceed carefully. My concern is largely if we're working the problem in the wrong place... Routing around other problems 13:17
TheJuliaBut if impossible or whatever, then it could make sense as long as vendor agnostic w/r/t frameworking 13:18
jandersthank you TheJulia13:20
dtantsurTheJulia: I'm trying to direct janders towards "verify steps" :)13:22
dtantsurbecause you know there is never enough types of steps13:22
TheJuliadtantsur: that is just pure evil ;)13:27
dtantsurhowever weird, this idea is growing on me :)13:27
TheJuliait feels strongly we're just routing around someone else's bug, to be totally honest13:27
dtantsurisn't it a part of our job definition?13:27
TheJuliaIt *does* make some sense, I just think caution is needed13:27
TheJuliaor maybe a medium sized grain of salt13:28
dtantsuron a plug side, we already have caching for 3 different driver-specific things that could be verify steps instead13:28
TheJuliawrt bugs: Touché13:28
TheJuliahmmm13:28
* TheJulia caffinates13:31
TheJuliaThink sometihing like: https://res.cloudinary.com/teepublic/image/private/s--Ogwu8oEo--/t_Preview/b_rgb:ffffff,c_limit,f_jpg,h_630,q_90,w_630/v1446156588/production/designs/40878_1.jpg13:32
dtantsur:D13:32
ajyaIt could be that the job is scheduled by someone (outside Ironic), but never executed and so it sits there blocking everything else for that subsystem.13:44
ajyadon't recall seeing jobs stuck forever, but have heard of cases that something gets stuck in iDRAC internally and then reset could help.13:45
TheJuliaI have seen people just try to destroy a deployment or work attempt right after $things have been started and start over and have that collide in odd ways, so I do concur, $something may make sense13:46
dtantsursimpler. they create a job and never reboot the machine.13:49
TheJuliayup13:49
TheJuliayou know...13:49
TheJuliaI have run into this sort ofissue13:49
TheJuliain a lab in at the Techridge campus...13:50
iurygregorywelcome tkot o/14:37
dtantsuro/14:39
opendevreviewArne Wiebalck proposed openstack/ironic master: [doc] Update section on ESP consistency  https://review.opendev.org/c/openstack/ironic/+/79921514:46
opendevreviewArne Wiebalck proposed openstack/ironic master: [doc] Bootloader reinstallations on Software RAID  https://review.opendev.org/c/openstack/ironic/+/79921715:01
opendevreviewJulia Kreger proposed openstack/ironic master: Add reno and reset legacy policy deprecation expectation  https://review.opendev.org/c/openstack/ironic/+/79922115:14
NobodyCamGood Morning Ironic'ers...15:38
NobodyCamand OFC ...15:38
NobodyCamTGIF!15:38
TheJuliaYAY! Friday!15:39
JayFdoes NobodyCam know about SPUC?15:41
JayFNobodyCam: come get your sanity preserved this afternoon :D 15:42
JayFprobably too late but no harm in trying ;)15:42
NobodyCamHey Hey TheJulia and JayF 15:43
NobodyCamNot sure I can male today15:43
JayFNot a big deal just wanted to make sure you were aware of it :D 15:43
JayFwell, I mean, it'd be nice to see and say hello, but I understand not everyone has a meeting free friday :)15:43
NobodyCamon man meeting free15:44
* NobodyCam is jelly15:44
JayFYeah, it's a policy in my downstream team to have no manager-initiated meetings on Fridays.15:44
NobodyCam++ 15:44
JayFOccassionally we'll have a group code review or chat through a technical problem, but no sprint ceremonies / status meetings /etc 15:44
JayFmost Fridays my only meeting is SPUC :D15:44
NobodyCamnice!15:45
opendevreviewJay Faulkner proposed openstack/ironic master: Add reno and reset legacy policy deprecation expectation  https://review.opendev.org/c/openstack/ironic/+/79922115:48
JayFarne_wiebalck: ^ if you're still around, I just did a patch edit on Julia's patch to fix the typos, if you wanna +2 it now :)15:48
arne_wiebalckJayF: thanks, done!15:50
* dtantsur also doesn't do meetings on Friday15:51
TheJuliaI pretty much try to avoid them because I end up working extra hours during the week, so friday is my "adjustment" day where I try to wrap up the needful and don't feel bad when I've hit my hours for the week and call it a weekend15:57
JayFI often leave early on Fridays and make up the time upstream reviewing on Sat/Sun15:58
JayFnot doing that this week though after missing 2 days to a heat wave and unlucky AC failure15:58
dtantsurhave a great weekend folks15:58
TheJuliayou too dtantsur 15:58
JayFo/15:59
TheJuliaJayF: ugh. I'm worried about mine...  It *is* starting to pull more power than it should which is a sign15:59
JayFOurs dropped a capacitor, which the repairman told me is basically an unpredictable, spontaneous failure16:00
JayFso it failing the night before a heat wave is just emblematical of my luck16:00
JayFlol16:00
TheJuliaYeah...16:01
* TheJulia remembers a certian series of servers with capacitor failures on the motherboards16:01
JayFI mean, more an era than just a series16:02
JayFthat era of P3/P4/Athlon boards would end up with bulging caps with frequency16:02
JayFTheJulia: you got a sec to help me with something, perhaps? or maybe during/before/after spuc?16:07
JayFTrying to find if there's a published image *anywhere* that'll work with the kickstart driver16:07
TheJuliaJayF: sure, I'm sending an email at the moment on potential openinfra live topic in the future regarding ironic16:09
JayFack, ty16:09
JayFas it stands right now, it looks like the only option to use a prebuilt image would involve using the 10GB dvd live iso for centos16:12
JayFwhich is obviously hilariously large for such a test16:12
TheJuliaomg that is16:12
TheJuliaThat may acutally be more like.... the joker just laughing as he is being hauled off to the asylum16:13
JayFI joined the SPUC bluejeans meeting early, once you get to a stopping point just drop in and I'll lay it out16:13
JayFI'm hoping you know some secret place where there are better images that are published by centos (unlikely) or have good ideas as to how to otherwise approach it (actually likely)16:13
TheJuliaokay16:14
TheJuliaI'm about 5-10 minutes from wrapping this email16:14
JayFtake your time :)16:14
arne_wiebalckbye everyone, see you next week o/16:23
JayFo/16:23
opendevreviewMerged openstack/ironic master: Ramdisk: do not require image_source  https://review.opendev.org/c/openstack/ironic/+/79868116:24
trandlesnot going to make SPUC today...I'm stealing an empty desk at my wife's place of work and forgot my headphones. I can't be making a bunch of racket...16:44
trandles:(16:45
opendevreviewVerification of a change to openstack/ironic failed: Suppress policy deprecation and default change warnings  https://review.opendev.org/c/openstack/ironic/+/79912016:46
JayFnp, enjoy your novel office for the day16:59
opendevreviewMerged openstack/ironic master: Only return the requested fields from the DB  https://review.opendev.org/c/openstack/ironic/+/79227417:03
TheJuliawhee17:08
NobodyCamhttps://bugs.launchpad.net/nova/+bug/1853009 :'(17:09
TheJuliaNobodyCam: that sounds super familiar and I thought it was fixed17:13
NobodyCamsays fix released17:14
NobodyCamhttps://www.irccloud.com/pastebin/Ewqki6BX/17:15
opendevreviewMerged openstack/ironic master: Set stage for objects to handle selected field lists.  https://review.opendev.org/c/openstack/ironic/+/79227517:15
opendevreviewMerged openstack/ironic-inspector master: Ignored error state cache for new requests  https://review.opendev.org/c/openstack/ironic-inspector/+/78524517:18
JayFI filed https://storyboard.openstack.org/#!/story/2009026 about getting CI for upstream Anaconda17:29
JayFafter spending ~1 working day on it, I found some pretty gnarly blockers and wrote that to document the issue17:29
opendevreviewVerification of a change to openstack/ironic failed: Suppress policy deprecation and default change warnings  https://review.opendev.org/c/openstack/ironic/+/79912017:32
TheJuliaNobodyCam: hmm :\17:44
NobodyCam;p 17:45
TheJuliaNobodyCam: ?train? right?17:45
NobodyCamthis one is ussuri17:45
TheJuliaso... someone on the spuc call says they have a downstream fix for this17:46
JayFMAYBE17:46
NobodyCamoh17:46
JayFgotta look to see if it matches zer0c00l's fix17:46
NobodyCamThank you in advance 🙇‍♂️17:53
TheJuliaNobodyCam: In logging, can you confirm that it was right after the instance was launched?17:54
NobodyCamseen on conductor restart, noticed a hash ring miss match .. lookin in to the logs17:56
NobodyCamsaw trace back for:`2021-07-01 16:43:54.829 2510473 ERROR nova.compute.manager oslo_messaging.rpc.client.RemoteError: Remote error: InvalidRequestError This session is in 'inactive' state, due to the SQL transaction being rolled back; no further SQL can be emitted within this transaction.`17:56
NobodyCam`2021-07-01 16:50:38.516 2510473 ERROR nova.compute.resource_tracker [req-d6749c0a-373f-45dc-8f8c-905757721a7c - - - - -] Skipping removal of allocations for deleted instances: Failed to retrieve allocations for resource provider...<blah>17:56
NobodyCamwhich lead me to the bug17:57
TheJuliaNobodyCam: was the instance tying to it discarded as an orphan?18:00
NobodyCamthat is the state it is in now. if I read it correctly it was attempting to create the record,18:05
NobodyCam`2021-07-01 16:43:49.852 2510473 ERROR nova.compute.manager ['Traceback (most recent call last):\n', '  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 139, in _object_dispatch\n    return getattr(target, method)(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper\n    return fn(self, *args, **kwargs)\n', '  File 18:05
NobodyCam"/usr/lib/python3.6/site-packages/nova/objects/compute_node.py", line 338, in create\n    db_compute = db.compute_node_create(self._context, updates)\n`18:05
JayFzer0c00l: ^^^ We think NobodyCam is having the same issue you fixed downstream FWIW18:05
JayFwith resource provider cache and node caching 18:05
opendevreviewLeo McGann proposed openstack/ironic-specs master: Add attestation interface spec  https://review.opendev.org/c/openstack/ironic-specs/+/57671818:12
NobodyCamckrelle-mlt:OpenStack ckrelle$ openstack resource provider show 74649c21-994a-42a2-83a0-7f7966d136e418:18
NobodyCam+------------+--------------------------------------+18:18
NobodyCam| Field      | Value                                |18:18
NobodyCam+------------+--------------------------------------+18:18
NobodyCam| uuid       | 74649c21-994a-42a2-83a0-7f7966d136e4 |18:18
NobodyCam| name       | 74649c21-994a-42a2-83a0-7f7966d136e4 |18:18
NobodyCam| generation | 2                                    |18:18
NobodyCam+------------+--------------------------------------+18:18
JayFhere is the potentially relevant patch, applies to ocata https://gist.github.com/jayofdoom/4fa315489330430ea7aeaa6e8ad62dec18:26
TheJuliaI'm running nova unit tests now18:53
TheJuliaNobodyCam: https://review.opendev.org/c/openstack/nova/+/79932720:06
NobodyCamoh20:06
NobodyCam/me Clicks20:06
TheJuliaI tried to add context to it so people understand that we really just are creating a giant race window20:11
NobodyCamyes, we should be able to test on Tuesday20:12
NobodyCamSide note. Recovery ended up being to remove the delete flagged records from nova/compute_nodes table and restart the nova compute20:13
TheJuliaso it *looks* like the full complete fix is mgoddard's patch series20:18
TheJuliabecause it handles the case when it has happened, and it *can* happen for a couple different reasons20:19
NobodyCamthe one in merge conflict?20:31
NobodyCamhttps://review.opendev.org/c/openstack/nova/+/69518920:32
TheJuliaNobodyCam: https://review.opendev.org/q/topic:%2522bug/1839560%2522+(status:open+OR+status:merged)+and+owner:mark%2540stackhpc.com20:40
TheJuliaI guess I can try rebasing those next week20:41
TheJuliaI'm thinking it is approaching nap time20:41
NobodyCam+++ for Nap Time20:41
opendevreviewMerged openstack/ironic master: Suppress policy deprecation and default change warnings  https://review.opendev.org/c/openstack/ironic/+/79912020:43
TheJuliaJayF: I'm thinking performance can be improved int he nova virt driver by using the cache for one-off things such as "give me the number of nodes. That is as long as the local cache is correct. The cache refresh seems to be the big possible area of resource utilization20:44
JayFHonestly I assumed we already used the cache for as much as possible20:44
JayFif there's more room there to use it, I'm +120:44
TheJuliayeah, we just have to tidy it by removing things that end up in the list that are not required20:55
TheJuliaor not applicable.20:55
opendevreviewJay Faulkner proposed openstack/ironic stable/wallaby: Suppress policy deprecation and default change warnings  https://review.opendev.org/c/openstack/ironic/+/79925320:56
TheJulia*looks* like it would get rid of the other full node enumeration for just a count of nodes. The power sync stuff would still be fairly harsh, but at least cache wise we'd peel away the potentially long call to make a list of the nodes which does a ton of work20:56
TheJuliaarne_wiebalck already opened a bz against nova for this too if memory serves20:56
TheJuliaerr20:57
TheJulialp item20:57
opendevreviewArun S A G proposed openstack/ironic master: Add support for configdrive in anaconda interface  https://review.opendev.org/c/openstack/ironic/+/78039823:28

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!