iurygregory | good morning stevebaker janders and Ironic! Happy Friday! | 06:04 |
---|---|---|
arne_wiebalck | Good morning iurygregory and Ironic! | 06:41 |
iurygregory | morning arne_wiebalck o/ | 06:42 |
dtantsur | morning folks, happy Friday | 07:06 |
dtantsur | janders: _do_node_verify in manager.py is the place you need, yes | 07:06 |
cenne | Good morning dtantsur, iurygregory, arne_wiebalck, janders . :) | 07:23 |
dtantsur | o/ | 07:23 |
arne_wiebalck | hey cenne and dtantsur o/ | 07:24 |
iurygregory | morning cenne and dtantsur o/ | 07:24 |
opendevreview | Aija Jauntēva proposed openstack/ironic master: Fix Redfish RAID interface_type physical disk hint https://review.opendev.org/c/openstack/ironic/+/795505 | 08:43 |
janders | hey iurygregory arne_wiebalck iurygregory dtantsur cenne | 10:04 |
janders | anyone feeling SPUC-ky? | 10:04 |
dtantsur | I can join if we have a quorum (also cc ajya) | 10:05 |
janders | https://bluejeans.com/772893798 & I'm on standby - let's see how we go | 10:07 |
dtantsur | joining | 10:07 |
iurygregory | I was having lunch sorry | 11:09 |
dtantsur | no worries, lunch is important | 11:09 |
* dtantsur needs one too | 11:09 | |
janders | see you tomorrow Ironic o/ | 12:51 |
janders | ehhm | 12:51 |
janders | see you on Monday Ironic o/ | 12:51 |
iurygregory | bye janders o/ | 12:53 |
janders | TheJulia when you're online, I would like to ask for your thoughts on https://storyboard.openstack.org/#!/story/2009025 | 13:09 |
TheJulia | janders: okay.. what was encountered that this is a fix to?! | 13:11 |
janders | TheJulia context: it's something that came up in Metal3, but after some thought I figured I would like to propose it in Ironic. The idea is preventing issues where existing Lifecycle Controller jobs may interfere with inspection/deployments. Essentially I wanted to tap into the great work Dell have done, just make it used more regularly | 13:12 |
janders | oh, you are here TheJulia | 13:12 |
janders | good morning :) | 13:12 |
janders | there were issues reported where pre-existing stuck LC jobs were preventing other jobs of the same type being queued | 13:13 |
janders | breaking inspections, deployments, etc | 13:13 |
janders | full disclosure: I haven't witnessed these issues myself; I was asked to help with a metal3 change that would add iDRAC resets on node registration, but I figured this may belong more in Ironic | 13:14 |
TheJulia | Sounds like test enviroents are being smashed to be redeployed before they gave failed | 13:15 |
TheJulia | Unless the bmc truly is getting jobs stuck forever which sounds like a bug to me | 13:16 |
janders | I think it's been spotted in a few CU BZs I believe | 13:16 |
TheJulia | Not that I've personally seen, but proceed carefully. My concern is largely if we're working the problem in the wrong place... Routing around other problems | 13:17 |
TheJulia | But if impossible or whatever, then it could make sense as long as vendor agnostic w/r/t frameworking | 13:18 |
janders | thank you TheJulia | 13:20 |
dtantsur | TheJulia: I'm trying to direct janders towards "verify steps" :) | 13:22 |
dtantsur | because you know there is never enough types of steps | 13:22 |
TheJulia | dtantsur: that is just pure evil ;) | 13:27 |
dtantsur | however weird, this idea is growing on me :) | 13:27 |
TheJulia | it feels strongly we're just routing around someone else's bug, to be totally honest | 13:27 |
dtantsur | isn't it a part of our job definition? | 13:27 |
TheJulia | It *does* make some sense, I just think caution is needed | 13:27 |
TheJulia | or maybe a medium sized grain of salt | 13:28 |
dtantsur | on a plug side, we already have caching for 3 different driver-specific things that could be verify steps instead | 13:28 |
TheJulia | wrt bugs: Touché | 13:28 |
TheJulia | hmmm | 13:28 |
* TheJulia caffinates | 13:31 | |
TheJulia | Think sometihing like: https://res.cloudinary.com/teepublic/image/private/s--Ogwu8oEo--/t_Preview/b_rgb:ffffff,c_limit,f_jpg,h_630,q_90,w_630/v1446156588/production/designs/40878_1.jpg | 13:32 |
dtantsur | :D | 13:32 |
ajya | It could be that the job is scheduled by someone (outside Ironic), but never executed and so it sits there blocking everything else for that subsystem. | 13:44 |
ajya | don't recall seeing jobs stuck forever, but have heard of cases that something gets stuck in iDRAC internally and then reset could help. | 13:45 |
TheJulia | I have seen people just try to destroy a deployment or work attempt right after $things have been started and start over and have that collide in odd ways, so I do concur, $something may make sense | 13:46 |
dtantsur | simpler. they create a job and never reboot the machine. | 13:49 |
TheJulia | yup | 13:49 |
TheJulia | you know... | 13:49 |
TheJulia | I have run into this sort ofissue | 13:49 |
TheJulia | in a lab in at the Techridge campus... | 13:50 |
iurygregory | welcome tkot o/ | 14:37 |
dtantsur | o/ | 14:39 |
opendevreview | Arne Wiebalck proposed openstack/ironic master: [doc] Update section on ESP consistency https://review.opendev.org/c/openstack/ironic/+/799215 | 14:46 |
opendevreview | Arne Wiebalck proposed openstack/ironic master: [doc] Bootloader reinstallations on Software RAID https://review.opendev.org/c/openstack/ironic/+/799217 | 15:01 |
opendevreview | Julia Kreger proposed openstack/ironic master: Add reno and reset legacy policy deprecation expectation https://review.opendev.org/c/openstack/ironic/+/799221 | 15:14 |
NobodyCam | Good Morning Ironic'ers... | 15:38 |
NobodyCam | and OFC ... | 15:38 |
NobodyCam | TGIF! | 15:38 |
TheJulia | YAY! Friday! | 15:39 |
JayF | does NobodyCam know about SPUC? | 15:41 |
JayF | NobodyCam: come get your sanity preserved this afternoon :D | 15:42 |
JayF | probably too late but no harm in trying ;) | 15:42 |
NobodyCam | Hey Hey TheJulia and JayF | 15:43 |
NobodyCam | Not sure I can male today | 15:43 |
JayF | Not a big deal just wanted to make sure you were aware of it :D | 15:43 |
JayF | well, I mean, it'd be nice to see and say hello, but I understand not everyone has a meeting free friday :) | 15:43 |
NobodyCam | on man meeting free | 15:44 |
* NobodyCam is jelly | 15:44 | |
JayF | Yeah, it's a policy in my downstream team to have no manager-initiated meetings on Fridays. | 15:44 |
NobodyCam | ++ | 15:44 |
JayF | Occassionally we'll have a group code review or chat through a technical problem, but no sprint ceremonies / status meetings /etc | 15:44 |
JayF | most Fridays my only meeting is SPUC :D | 15:44 |
NobodyCam | nice! | 15:45 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Add reno and reset legacy policy deprecation expectation https://review.opendev.org/c/openstack/ironic/+/799221 | 15:48 |
JayF | arne_wiebalck: ^ if you're still around, I just did a patch edit on Julia's patch to fix the typos, if you wanna +2 it now :) | 15:48 |
arne_wiebalck | JayF: thanks, done! | 15:50 |
* dtantsur also doesn't do meetings on Friday | 15:51 | |
TheJulia | I pretty much try to avoid them because I end up working extra hours during the week, so friday is my "adjustment" day where I try to wrap up the needful and don't feel bad when I've hit my hours for the week and call it a weekend | 15:57 |
JayF | I often leave early on Fridays and make up the time upstream reviewing on Sat/Sun | 15:58 |
JayF | not doing that this week though after missing 2 days to a heat wave and unlucky AC failure | 15:58 |
dtantsur | have a great weekend folks | 15:58 |
TheJulia | you too dtantsur | 15:58 |
JayF | o/ | 15:59 |
TheJulia | JayF: ugh. I'm worried about mine... It *is* starting to pull more power than it should which is a sign | 15:59 |
JayF | Ours dropped a capacitor, which the repairman told me is basically an unpredictable, spontaneous failure | 16:00 |
JayF | so it failing the night before a heat wave is just emblematical of my luck | 16:00 |
JayF | lol | 16:00 |
TheJulia | Yeah... | 16:01 |
* TheJulia remembers a certian series of servers with capacitor failures on the motherboards | 16:01 | |
JayF | I mean, more an era than just a series | 16:02 |
JayF | that era of P3/P4/Athlon boards would end up with bulging caps with frequency | 16:02 |
JayF | TheJulia: you got a sec to help me with something, perhaps? or maybe during/before/after spuc? | 16:07 |
JayF | Trying to find if there's a published image *anywhere* that'll work with the kickstart driver | 16:07 |
TheJulia | JayF: sure, I'm sending an email at the moment on potential openinfra live topic in the future regarding ironic | 16:09 |
JayF | ack, ty | 16:09 |
JayF | as it stands right now, it looks like the only option to use a prebuilt image would involve using the 10GB dvd live iso for centos | 16:12 |
JayF | which is obviously hilariously large for such a test | 16:12 |
TheJulia | omg that is | 16:12 |
TheJulia | That may acutally be more like.... the joker just laughing as he is being hauled off to the asylum | 16:13 |
JayF | I joined the SPUC bluejeans meeting early, once you get to a stopping point just drop in and I'll lay it out | 16:13 |
JayF | I'm hoping you know some secret place where there are better images that are published by centos (unlikely) or have good ideas as to how to otherwise approach it (actually likely) | 16:13 |
TheJulia | okay | 16:14 |
TheJulia | I'm about 5-10 minutes from wrapping this email | 16:14 |
JayF | take your time :) | 16:14 |
arne_wiebalck | bye everyone, see you next week o/ | 16:23 |
JayF | o/ | 16:23 |
opendevreview | Merged openstack/ironic master: Ramdisk: do not require image_source https://review.opendev.org/c/openstack/ironic/+/798681 | 16:24 |
trandles | not going to make SPUC today...I'm stealing an empty desk at my wife's place of work and forgot my headphones. I can't be making a bunch of racket... | 16:44 |
trandles | :( | 16:45 |
opendevreview | Verification of a change to openstack/ironic failed: Suppress policy deprecation and default change warnings https://review.opendev.org/c/openstack/ironic/+/799120 | 16:46 |
JayF | np, enjoy your novel office for the day | 16:59 |
opendevreview | Merged openstack/ironic master: Only return the requested fields from the DB https://review.opendev.org/c/openstack/ironic/+/792274 | 17:03 |
TheJulia | whee | 17:08 |
NobodyCam | https://bugs.launchpad.net/nova/+bug/1853009 :'( | 17:09 |
TheJulia | NobodyCam: that sounds super familiar and I thought it was fixed | 17:13 |
NobodyCam | says fix released | 17:14 |
NobodyCam | https://www.irccloud.com/pastebin/Ewqki6BX/ | 17:15 |
opendevreview | Merged openstack/ironic master: Set stage for objects to handle selected field lists. https://review.opendev.org/c/openstack/ironic/+/792275 | 17:15 |
opendevreview | Merged openstack/ironic-inspector master: Ignored error state cache for new requests https://review.opendev.org/c/openstack/ironic-inspector/+/785245 | 17:18 |
JayF | I filed https://storyboard.openstack.org/#!/story/2009026 about getting CI for upstream Anaconda | 17:29 |
JayF | after spending ~1 working day on it, I found some pretty gnarly blockers and wrote that to document the issue | 17:29 |
opendevreview | Verification of a change to openstack/ironic failed: Suppress policy deprecation and default change warnings https://review.opendev.org/c/openstack/ironic/+/799120 | 17:32 |
TheJulia | NobodyCam: hmm :\ | 17:44 |
NobodyCam | ;p | 17:45 |
TheJulia | NobodyCam: ?train? right? | 17:45 |
NobodyCam | this one is ussuri | 17:45 |
TheJulia | so... someone on the spuc call says they have a downstream fix for this | 17:46 |
JayF | MAYBE | 17:46 |
NobodyCam | oh | 17:46 |
JayF | gotta look to see if it matches zer0c00l's fix | 17:46 |
NobodyCam | Thank you in advance 🙇♂️ | 17:53 |
TheJulia | NobodyCam: In logging, can you confirm that it was right after the instance was launched? | 17:54 |
NobodyCam | seen on conductor restart, noticed a hash ring miss match .. lookin in to the logs | 17:56 |
NobodyCam | saw trace back for:`2021-07-01 16:43:54.829 2510473 ERROR nova.compute.manager oslo_messaging.rpc.client.RemoteError: Remote error: InvalidRequestError This session is in 'inactive' state, due to the SQL transaction being rolled back; no further SQL can be emitted within this transaction.` | 17:56 |
NobodyCam | `2021-07-01 16:50:38.516 2510473 ERROR nova.compute.resource_tracker [req-d6749c0a-373f-45dc-8f8c-905757721a7c - - - - -] Skipping removal of allocations for deleted instances: Failed to retrieve allocations for resource provider...<blah> | 17:56 |
NobodyCam | which lead me to the bug | 17:57 |
TheJulia | NobodyCam: was the instance tying to it discarded as an orphan? | 18:00 |
NobodyCam | that is the state it is in now. if I read it correctly it was attempting to create the record, | 18:05 |
NobodyCam | `2021-07-01 16:43:49.852 2510473 ERROR nova.compute.manager ['Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 139, in _object_dispatch\n return getattr(target, method)(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper\n return fn(self, *args, **kwargs)\n', ' File | 18:05 |
NobodyCam | "/usr/lib/python3.6/site-packages/nova/objects/compute_node.py", line 338, in create\n db_compute = db.compute_node_create(self._context, updates)\n` | 18:05 |
JayF | zer0c00l: ^^^ We think NobodyCam is having the same issue you fixed downstream FWIW | 18:05 |
JayF | with resource provider cache and node caching | 18:05 |
opendevreview | Leo McGann proposed openstack/ironic-specs master: Add attestation interface spec https://review.opendev.org/c/openstack/ironic-specs/+/576718 | 18:12 |
NobodyCam | ckrelle-mlt:OpenStack ckrelle$ openstack resource provider show 74649c21-994a-42a2-83a0-7f7966d136e4 | 18:18 |
NobodyCam | +------------+--------------------------------------+ | 18:18 |
NobodyCam | | Field | Value | | 18:18 |
NobodyCam | +------------+--------------------------------------+ | 18:18 |
NobodyCam | | uuid | 74649c21-994a-42a2-83a0-7f7966d136e4 | | 18:18 |
NobodyCam | | name | 74649c21-994a-42a2-83a0-7f7966d136e4 | | 18:18 |
NobodyCam | | generation | 2 | | 18:18 |
NobodyCam | +------------+--------------------------------------+ | 18:18 |
JayF | here is the potentially relevant patch, applies to ocata https://gist.github.com/jayofdoom/4fa315489330430ea7aeaa6e8ad62dec | 18:26 |
TheJulia | I'm running nova unit tests now | 18:53 |
TheJulia | NobodyCam: https://review.opendev.org/c/openstack/nova/+/799327 | 20:06 |
NobodyCam | oh | 20:06 |
NobodyCam | /me Clicks | 20:06 |
TheJulia | I tried to add context to it so people understand that we really just are creating a giant race window | 20:11 |
NobodyCam | yes, we should be able to test on Tuesday | 20:12 |
NobodyCam | Side note. Recovery ended up being to remove the delete flagged records from nova/compute_nodes table and restart the nova compute | 20:13 |
TheJulia | so it *looks* like the full complete fix is mgoddard's patch series | 20:18 |
TheJulia | because it handles the case when it has happened, and it *can* happen for a couple different reasons | 20:19 |
NobodyCam | the one in merge conflict? | 20:31 |
NobodyCam | https://review.opendev.org/c/openstack/nova/+/695189 | 20:32 |
TheJulia | NobodyCam: https://review.opendev.org/q/topic:%2522bug/1839560%2522+(status:open+OR+status:merged)+and+owner:mark%2540stackhpc.com | 20:40 |
TheJulia | I guess I can try rebasing those next week | 20:41 |
TheJulia | I'm thinking it is approaching nap time | 20:41 |
NobodyCam | +++ for Nap Time | 20:41 |
opendevreview | Merged openstack/ironic master: Suppress policy deprecation and default change warnings https://review.opendev.org/c/openstack/ironic/+/799120 | 20:43 |
TheJulia | JayF: I'm thinking performance can be improved int he nova virt driver by using the cache for one-off things such as "give me the number of nodes. That is as long as the local cache is correct. The cache refresh seems to be the big possible area of resource utilization | 20:44 |
JayF | Honestly I assumed we already used the cache for as much as possible | 20:44 |
JayF | if there's more room there to use it, I'm +1 | 20:44 |
TheJulia | yeah, we just have to tidy it by removing things that end up in the list that are not required | 20:55 |
TheJulia | or not applicable. | 20:55 |
opendevreview | Jay Faulkner proposed openstack/ironic stable/wallaby: Suppress policy deprecation and default change warnings https://review.opendev.org/c/openstack/ironic/+/799253 | 20:56 |
TheJulia | *looks* like it would get rid of the other full node enumeration for just a count of nodes. The power sync stuff would still be fairly harsh, but at least cache wise we'd peel away the potentially long call to make a list of the nodes which does a ton of work | 20:56 |
TheJulia | arne_wiebalck already opened a bz against nova for this too if memory serves | 20:56 |
TheJulia | err | 20:57 |
TheJulia | lp item | 20:57 |
opendevreview | Arun S A G proposed openstack/ironic master: Add support for configdrive in anaconda interface https://review.opendev.org/c/openstack/ironic/+/780398 | 23:28 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!