vanou | good morning ironic | 00:58 |
---|---|---|
opendevreview | Aija Jauntēva proposed openstack/sushy master: Fix volume deletion on newer iDRACs https://review.opendev.org/c/openstack/sushy/+/864845 | 07:53 |
rpittau | good morning ironic! o/ | 08:36 |
kubajj | Good morning everyone | 09:16 |
kubajj | rpittau: is it possible to recheck just one zuul pipeline? | 11:26 |
rpittau | kubajj: unfortunately that's not possible | 11:26 |
iurygregory | good morning Ironic | 11:29 |
kubajj | good morning iurygregory | 11:29 |
kubajj | If I want to recheck, do I just leave a comment, a review with just recheck in it or is there something special? | 11:30 |
rpittau | kubajj: just add a comment with recheck at the beginning and possibly a reason for the recheck | 11:31 |
rpittau | what's the patch ? | 11:31 |
kubajj | https://review.opendev.org/c/openstack/ironic/+/866056 | 11:31 |
kubajj | Zuul says Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance | 11:32 |
kubajj | and I don't think that I modified anything that should cause a problem with partition image | 11:33 |
iurygregory | you can add "recheck zuul retries" | 11:34 |
iurygregory | this would trigger recheck on the patch, and you provided the reason why you would like to do it =) | 11:34 |
rpittau | kubajj: I agree, it doesn't look like the failure is due to your change | 11:34 |
kubajj | thanks, will do | 11:36 |
opendevreview | Jonathan Rosser proposed openstack/ironic master: Fix debug log message argument formatting https://review.opendev.org/c/openstack/ironic/+/866856 | 13:56 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Fix unit tests for Python 3.11 https://review.opendev.org/c/openstack/ironic/+/866861 | 14:12 |
janders | dtantsur sorry I missed the ping about https://review.opendev.org/c/openstack/sushy/+/866612 - missed the boat now, but LGTM, thanks! | 14:23 |
janders | Out of curiosity, what hardware needed this fix? | 14:23 |
dtantsur | janders: some ARM stuff, I pinged you downstream | 14:24 |
janders | dtantsur ACK | 14:25 |
kubajj | dtantsur: what are the links that the controllers return? | 14:37 |
dtantsur | kubajj: these are for navigation between API resources. Since your new endpoint is leaf (nothing under it), I think you can omit links for now. | 14:38 |
dtantsur | e.g. a node endpoint links to states (/v1/nodes/<node> to /v1/nodes/<node>/states) | 14:39 |
kubajj | thanks | 14:39 |
TheJulia | good morning folks | 14:43 |
kubajj | dtantsur: just to make sure, I need just get_one and it should just return the data | 14:50 |
dtantsur | kubajj: yep | 14:50 |
rpittau | need to split, see ya tomorrow o/ | 16:02 |
kubajj | dtantsur: I think I figured out why we had the discussion about the get_by | 16:13 |
kubajj | get_by_node_uuid in object definition | 16:14 |
opendevreview | Jakub Jelinek proposed openstack/ironic master: WIP: API for node inventory https://review.opendev.org/c/openstack/ironic/+/866876 | 16:16 |
jrosser | TheJulia: following up from my difficulty with debug=True yesterday i found a suspicious debug log https://review.opendev.org/c/openstack/ironic/+/866856 | 16:44 |
jrosser | it was a little tricky to find because the exception handling here doesnt output the stack trace https://github.com/openstack/ironic/blob/b34d79e3f440c408520f24a2263dc587ac205ee2/ironic/drivers/modules/agent_base.py#L539-L545 | 16:44 |
JayF | I don't /think/ that should be a functional change | 16:45 |
JayF | but imbw | 16:45 |
jrosser | i'm now stuck in a bit of a loop where i can't get cleaning to succeed https://paste.opendev.org/show/bTFmRIKizVxJwO8E14Ww/ | 16:46 |
JayF | Are you using any kind of custom code in that? | 16:47 |
JayF | Other than the above patch? | 16:47 |
jrosser | no, this is stable/yoga | 16:48 |
JayF | This looks extremely broken | 16:48 |
JayF | Can you reproduce that with IPA in debug mode? | 16:48 |
jrosser | is there a correct way to "start again" after getting to clean failed state | 16:49 |
jrosser | i am wondering if i'm doing that wrong | 16:49 |
JayF | So if it's in clean failed | 16:50 |
JayF | you should be able to run | 16:50 |
JayF | baremetal node blah manage | 16:50 |
JayF | baremetal node blah provide | 16:50 |
JayF | you have to go from: | 16:50 |
JayF | clean failed -> managable | 16:50 |
JayF | then from managable -> cleaning -> available | 16:50 |
JayF | if you're working with automated cleaning | 16:50 |
JayF | so just manage, provide | 16:50 |
jrosser | hmm ok thats what i'm doing | 16:50 |
JayF | So lets do a couple of things: | 16:51 |
JayF | Do your "manage" | 16:51 |
JayF | get a dump of the node object | 16:51 |
JayF | (/v1/nodes/detail?uuid=blah or whatever the node-show-detail command is in OSC) | 16:51 |
JayF | then do a provide | 16:51 |
JayF | when it breaks, get a node detail again | 16:51 |
JayF | toss all that into a pastebin and we should have the info we need to find the problem, I hope | 16:51 |
jrosser | ok i'll try that | 16:52 |
JayF | TheJulia: ^^ uh, it looks like cleaning could be crazy-broken on stable/yoga --- we are trying to call "wait" state verb on a node in cleanwait state | 16:52 |
* TheJulia tries to digest | 16:54 | |
TheJulia | what the.... | 16:56 |
JayF | yeah, that's what I was thinking lol | 16:56 |
TheJulia | yeah, we're going to need state data for the fields before/after I think | 17:00 |
TheJulia | just so we're not spinning endlessly | 17:00 |
* jrosser not sure how to get node detail | 17:01 | |
JayF | baremetal node show $node_uuid --detail # I think ? | 17:01 |
opendevreview | Vishal Manchanda proposed openstack/ironic-ui master: [DNM] Test CI jobs status https://review.opendev.org/c/openstack/ironic-ui/+/866880 | 17:06 |
jrosser | JayF: before i run the 'provide' is this sufficient detail? https://paste.opendev.org/show/bkMzulFg0D4nm9zZqB41/ | 17:18 |
JayF | perfect | 17:18 |
jrosser | i can't find a way to get detailed output from 'baremetal node show' | 17:18 |
JayF | --long might be it | 17:18 |
JayF | you have the fields I care about :D | 17:18 |
JayF | TheJulia: ^^ is clean_steps supposed to be erased on the failed -> managable transition, or when cleaning restarts? | 17:19 |
TheJulia | it is just baremetal node show | 17:19 |
TheJulia | JayF: I believe it is | 17:20 |
TheJulia | uhh... both I think | 17:20 |
JayF | So I'll note; node.clean_steps in the above paste is still populated | 17:21 |
JayF | we'll wait to see the output but that looks sus | 17:21 |
TheJulia | That seems like it is a problem then | 17:22 |
TheJulia | but it has been ~2 years since I was looking at that code | 17:22 |
JayF | same, but longer | 17:22 |
JayF | and I don't know what coulda busted it in Yoga | 17:22 |
TheJulia | likewise | 17:22 |
TheJulia | this seems super weird | 17:22 |
TheJulia | because we run CI in debug too | 17:23 |
JayF | jrosser: information about how you have installed Ironic would be useful too -- pip installed? Using packages? etc | 17:23 |
TheJulia | but... maybe it got orphaned there earlier on and the presence short circuits things | 17:23 |
TheJulia | which *does* feel like we've had that come up | 17:23 |
jrosser | it is installed with pip from git using openstack-ansible | 17:23 |
jrosser | this feels less directly related to debug than my trouble yesterday, but i may be wrong there | 17:24 |
jrosser | unless i have some bogus state now from it encountering the exception | 17:25 |
TheJulia | well, I think you hit an unexpected exception | 17:26 |
TheJulia | lets see, what was that exception | 17:26 |
jrosser | ok here is the node detail after i get clean failed https://paste.opendev.org/show/bPcYy3RTxwWFSThLUVVl/ | 17:27 |
TheJulia | also, sorry for not being really around late yesterday/early today. Board stuffs | 17:28 |
TheJulia | ouch | 17:31 |
TheJulia | so yeah | 17:31 |
TheJulia | you hit a typeerror, which bombed out things really abdly | 17:31 |
TheJulia | badly | 17:31 |
JayF | If we can't recover from that in Ironic-proper; it's still our bug to fix that up, yeah? | 17:31 |
JayF | manage/provide should be resetting enough context on the node to make it work | 17:32 |
JayF | I just can't tell what is broken where, in the output | 17:32 |
TheJulia | yeah, I'm not that far yet | 17:34 |
TheJulia | mother in law just called | 17:34 |
TheJulia | err | 17:34 |
TheJulia | no, step mother | 17:34 |
* TheJulia gets the two confused | 17:34 | |
TheJulia | oh! | 17:44 |
TheJulia | I see what is going on | 17:44 |
TheJulia | so clean step overrides are loaded | 17:44 |
JayF | I'm not sure I follow? | 17:45 |
TheJulia | https://github.com/openstack/ironic/blame/stable/yoga/ironic/conductor/steps.py#L187 | 17:45 |
JayF | I'm still not sure how that could cause the failures we're seeing | 17:47 |
TheJulia | I think we just need a general exception catch in do_node_clean | 17:47 |
jrosser | yes i have [conductor] clean_step_priority_override = deploy.erase_devices_express:5 | 17:48 |
JayF | So that's the original typeerror | 17:49 |
JayF | but I don't think that explains why it's failing in a recurring method now | 17:49 |
TheJulia | yeah, that is what I'm struggling with | 17:54 |
TheJulia | do we have the latest error? | 17:54 |
JayF | he showed a node log of the error we see in the last node | 17:55 |
JayF | like well above | 17:55 |
JayF | it appears to be failing reliably in the same way | 17:55 |
jrosser | yes it is the same, this is from my last attempt just now https://paste.opendev.org/show/bUASWfgCz9kRVzd1fvRn/ | 17:56 |
jrosser | i can see in the console of the node that the express cleaning step complete with result: None | 17:57 |
TheJulia | oh! | 17:59 |
TheJulia | I think i know what is going on | 17:59 |
TheJulia | jrosser: any chance we can get like the hundred lines before that in the logs ? | 18:04 |
jrosser | sure | 18:04 |
jrosser | this is from the point that the node was powered on https://pastebin.com/raw/Dn0AwziH | 18:14 |
JayF | 17:20:37 | 18:22 |
JayF | hmm, nope, I'm just reading multiple logs about the same thing as if it was multiple different logs | 18:23 |
JayF | it was already in clean wait after it powered on, it never transitioned back to cleaning at any point, but yet wait was called | 18:24 |
JayF | I wonder if this is some kind of weird edge case in DRAC cleaning + clean step override | 18:24 |
jrosser | oh the drac stuff is just noise, there are a bunch of dells as well but this isnt one of them | 18:24 |
* JayF burns his hypothesis | 18:25 | |
TheJulia | yeah, this is bizzar | 18:33 |
JayF | jrosser: you 100000000% sure this is unpatched, up to date stable/yoga? | 18:36 |
JayF | jrosser: maybe a pip freeze from inside the venv in the container, if you can do that? | 18:37 |
JayF | 17:20:37 is the only state transition | 18:38 |
JayF | why it is trying to wait a cleanwait node??? | 18:38 |
jrosser | https://paste.opendev.org/show/baKpxWfnQ18WzAMBXvjl/ | 18:38 |
jrosser | it should be ironic from commit 5fc42c41 | 18:40 |
JayF | yeah, and that's the latest yoga release | 18:40 |
JayF | I'm looking at git commit logs for something to blame lol | 18:40 |
JayF | this is one of the most clearly broken things I've seen with Ironic, and I can't figure out why lol | 18:41 |
jrosser | i need to fork ironic now anyway to apply my debug log patch repeatably, so i can redeploy all this fresh tomorrow | 18:41 |
JayF | well, if it stopped breaking after you redeployed | 18:41 |
JayF | I'd be even more confused and upset lol | 18:41 |
TheJulia | eh, everything with weird cleaning stuff is... weird | 18:43 |
JayF | https://github.com/openstack/ironic/commit/8034242c225f3293c08ca46dc588d00c5ad0e10a is the only commit I can even see in yoga that I'd be sus over | 18:43 |
JayF | and I can't draw a line between those changes | 18:43 |
jrosser | and just for completeness i tried again with debug=False and it's totally nothing to do with that | 18:43 |
JayF | yeah my hunch is somehow garbage was left in the node cleaning information | 18:44 |
JayF | from the typeerror | 18:44 |
JayF | and that is causing consistent failures | 18:44 |
JayF | so there are 2x things to fix: the root cause, the typeerror, and our recovery semantics around bad data in cleaning metadata | 18:44 |
JayF | but I'd really like to track it through the code | 18:44 |
jrosser | is there value in persisting with this, i'm happy to debug but i don't really know what i'm looking at | 18:44 |
jrosser | alternatively i can delete / recreate the node but the learning opportunity may be lost then | 18:45 |
TheJulia | so it is *almost* like an explicit call to resume is not getting recorded | 18:45 |
TheJulia | and when it saves the object and refreshes, it gets the old data | 18:45 |
JayF | TheJulia: do you think there's any value in jrosser NOT rebuilding this? | 18:45 |
JayF | I think we have all the data, right? | 18:45 |
TheJulia | jrosser: what version of sqlalchemy is in use? | 18:45 |
JayF | even if we can't explain it | 18:45 |
JayF | oooooh | 18:45 |
JayF | SQLAlchemy==1.4.31 | 18:46 |
JayF | sqlalchemy-migrate==0.13.0 | 18:46 |
JayF | that looks right to me | 18:46 |
TheJulia | bngo | 18:46 |
JayF | that's not correct? for yoga? | 18:46 |
TheJulia | drop down to 1.3.x or 1.2.x | 18:46 |
JayF | oslo.db==11.2.0 | 18:46 |
TheJulia | we're just doing 1.4/2.0 stuff in master branch | 18:46 |
JayF | are our requirements wrong? or is OSA ignoring them? | 18:46 |
TheJulia | uhhhhhhh | 18:46 |
TheJulia | 11.2.0 | 18:46 |
* TheJulia goes and checks | 18:47 | |
JayF | SQLAlchemy>=1.2.19 # MIT | 18:47 |
JayF | oslo.db>=9.1.0 # Apache-2.0 | 18:47 |
JayF | no upper bound, when clearly there needs to be one | 18:47 |
TheJulia | upper comes from requirements repo | 18:47 |
JayF | https://github.com/openstack/requirements/blob/stable/yoga/global-requirements.txt#L326 | 18:48 |
TheJulia | https://github.com/openstack/requirements/blob/stable/yoga/upper-constraints.txt#L164 | 18:48 |
TheJulia | yeah | 18:48 |
TheJulia | so, I'd drop the version back to just see if it works | 18:48 |
JayF | so that is incorrect? | 18:48 |
JayF | 1.4 drops autocommit? | 18:48 |
TheJulia | I don't know, I've never seen this before | 18:48 |
TheJulia | 1.4 doesn't drop it, 2.0 does | 18:48 |
JayF | let me ask the question in a more declarative way: | 18:48 |
TheJulia | but... this feels like an autocommit issue | 18:48 |
JayF | Would you have expected oslo.db 11.2.0 and sqlalchemy 1.4.31 to work? | 18:49 |
JayF | (on stable/yoga Ironic) | 18:49 |
TheJulia | yes, I would have, but I'm grasping at straws because 1.4 does have some major changes that perhaps we just didn't detect | 18:49 |
JayF | OK; I understand where you're at now | 18:50 |
jrosser | do you think i have something inconsistent with u-c | 18:50 |
* jrosser tries to follow along | 18:50 | |
JayF | jrosser: the hypothesis is that there may be a currently-unknown-bug in newer sqlalchemy that we haven't identified yet | 18:50 |
JayF | jrosser: canyou try with 1.3.x or 1.2.x | 18:50 |
TheJulia | ... I feel like maybe there could be something, because the only thing that explains this, is if we get an old row back for node upon changing some other fields | 18:50 |
JayF | jrosser: if that fixes it; we'll have to either fix compat with sqla 1.4.x or drop the upper-constraint | 18:50 |
JayF | jrosser: are you using anything exotic for your DB here? | 18:51 |
jrosser | mariadb | 18:51 |
JayF | e.g. some kind of not-quite-mysql cloud service, or read-only secondaries, etc | 18:51 |
JayF | I would think new mariadb would be fine | 18:51 |
TheJulia | the path is task.process_event('resume') not actually executing | 18:51 |
TheJulia | at least, that what it *seems* | 18:52 |
TheJulia | but it has no way to actually not do so to get as far as it does | 18:52 |
* TheJulia hopes that makes sense | 18:52 | |
JayF | what line # do you think that's being called on? | 18:52 |
TheJulia | looking for it again, give me a minute | 18:52 |
JayF | just to make sure I'm looking in the right spot | 18:52 |
JayF | ty | 18:52 |
TheJulia | https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L613 | 18:53 |
JayF | yeah that's what I expected | 18:54 |
TheJulia | the fact the target_provision_state is available, and not none, also points to something funky | 18:54 |
TheJulia | Honestly, I'm kind of at the point where I'd want to try and reproduce this, but I don't see how we got on that track to begin with unless something "weird" occurd with db interaction | 18:55 |
JayF | like I said above; I think the typeerror killed it in a bad spot | 18:55 |
TheJulia | oh, I agree | 18:55 |
JayF | and somehow gave us a node in a state that just screws up this over and over | 18:55 |
TheJulia | oh | 18:55 |
TheJulia | you know what... it could just be the fact target_provision_state is already set | 18:55 |
JayF | like maybe that leftover target_prov_state is never getting zeroed? | 18:55 |
TheJulia | I wonder if nuking that field int he db would clear this case up | 18:55 |
TheJulia | because I think on failure, that does get reset | 18:55 |
JayF | jrosser: mind trying something for uS? | 18:55 |
jrosser | sure | 18:56 |
TheJulia | or at least, with cleaning is *should* | 18:56 |
JayF | jrosser: reset the node to managable state | 18:56 |
jrosser | ok, done | 18:56 |
JayF | nope, that hypothesis is bunk TheJulia | 18:56 |
JayF | https://paste.opendev.org/show/bkMzulFg0D4nm9zZqB41/ | 18:56 |
TheJulia | openstack baremetal node show ? | 18:56 |
JayF | target_provision_state is null in the before photo | 18:56 |
JayF | dii['clean_steps'] is not though | 18:57 |
JayF | which is what I think could be the culprit but I'm not sure | 18:57 |
TheJulia | ugh | 18:57 |
JayF | TheJulia: we can just have jrosser wipe all of dii, right? | 18:57 |
TheJulia | yes | 18:58 |
JayF | jrosser: so now you'll run this: | 18:58 |
JayF | update nodes set driver_internal_info=None where uuid='THE NODES UUID' limit 1; | 18:58 |
JayF | be 1000000% sure that's right or you'll break some other node | 18:58 |
JayF | I do not believe there is a non-DB way to blank driver_internal_info :( | 18:58 |
TheJulia | doing that will at least identify where this is going sideways since it is not standing out in the code | 18:58 |
TheJulia | there is not | 18:58 |
TheJulia | we purge most of it in error handling stuffs that make sense | 18:59 |
TheJulia | but yeah... | 18:59 |
JayF | jrosser: then, after you've wiped driver_internal_info on the node; lets do another before picture (node show -> pastebin) then do the provide again | 18:59 |
jrosser | ERROR 1054 (42S22): Unknown column 'None' in 'field list' | 19:02 |
jrosser | oh hmm | 19:02 |
JayF | oh | 19:02 |
JayF | null | 19:02 |
JayF | not None | 19:02 |
JayF | sorry, it's null in mysql, not `None` | 19:02 |
jrosser | ok done | 19:03 |
JayF | aight, then like I said, do a node show -> pastebin | 19:03 |
JayF | so we can have a snapshot before state for this test | 19:03 |
JayF | then do the provide | 19:03 |
JayF | and it'll succeed this time. Maybe. We hope. Perhaps. | 19:03 |
JayF | Cross your fingers :D | 19:03 |
jrosser | https://paste.opendev.org/show/b853jZDgzdhsYrsrAHKN/ | 19:05 |
* TheJulia is hoping lots | 19:14 | |
jrosser | same error :( | 19:18 |
TheJulia | Ugh | 19:19 |
jrosser | i am kind if tired now and it's late | 19:19 |
JayF | I'll be here in the morning :) Thanks for working withus | 19:19 |
jrosser | i will do that again tomorrow, as first time the host dropped to the uefi shell and didnt boot | 19:19 |
jrosser | and i can't be certain if that was my fault or not | 19:19 |
TheJulia | jrosser: get some rest, perhaps log.debug(task.node) before the exception is raised | 19:19 |
JayF | I don't see anything else in the "before" node show that would indicate to me that something in the node object is busted | 19:20 |
JayF | so it's gotta be environmental :( | 19:20 |
jrosser | thankyou very much for your help, i can't help feeling theres something really silly going on here | 19:20 |
JayF | like TheJulia's scary hypothesis | 19:20 |
JayF | jrosser: it shouldn't be possible to get Ironic in this broken of a state, even if misconfigured | 19:20 |
JayF | jrosser: unless you find out it's been flapping between two or three DB servers mid-cleaning :P | 19:20 |
JayF | lol | 19:21 |
JayF | I'm sayin', almost certainly Ironic's fault. | 19:21 |
opendevreview | Julia Kreger proposed openstack/ironic master: Catch any exception for Cleaning https://review.opendev.org/c/openstack/ironic/+/866933 | 19:58 |
TheJulia | stevebaker[m]: By chance have you looked at https://zuul.opendev.org/t/openstack/build/1172af245e204813a34d2fc3d61ef5c1 ? | 20:07 |
TheJulia | rpittau: fyi https://review.opendev.org/c/openstack/ironic/+/866780 is failing, looks like job configuration on the branch | 20:07 |
TheJulia | stevebaker[m]: rloo: hjensas: Reviews on https://review.opendev.org/c/openstack/ironic-specs/+/861803 would be appreciated | 20:08 |
stevebaker[m] | TheJulia: that unit test fail looks very much related to the change. I'll take a look | 20:09 |
TheJulia | thanks | 20:09 |
TheJulia | yeah, I saw the ipmitool octets and went "rutro" | 20:09 |
rloo | TheJulia: yes, I started it this morning and then got side-tracked. Have it down to do tomorrow morning (hopefully no interruptions tomorrow...) | 20:12 |
TheJulia | okay | 20:12 |
TheJulia | Thanks | 20:12 |
stevebaker[m] | TheJulia: ah, those tests are testing the default boot mode, which is bios on xena | 20:30 |
TheJulia | so question comes to mind, do we stop backports at that point then? | 20:32 |
stevebaker[m] | TheJulia: the test tests default boot mode, explicit uefi, then explicit bios. The fix actually removes the test_management.py changes from the change | 20:35 |
stevebaker[m] | So I think its ok to backport imo | 20:35 |
opendevreview | Steve Baker proposed openstack/ironic stable/xena: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866627 | 20:36 |
TheJulia | stevebaker[m]: ack ack | 20:38 |
opendevreview | Steve Baker proposed openstack/ironic stable/wallaby: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866628 | 20:39 |
opendevreview | Steve Baker proposed openstack/ironic stable/wallaby: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866628 | 20:41 |
JayF | TheJulia: are stevebaker[m] ^^ patches the new version of that one I originally NACK'd? | 21:01 |
JayF | or is this yet-another-kinda-breaking-change? :( | 21:01 |
* JayF put a -1 and a question on the first backport in that chain | 21:07 | |
JayF | As discussed in Monday's meeting, I've proposed a change to project-config to give access to a new group, ironic-release, to delete bugfix branches and create new tags (so we can tag bugfix-xy-eol and delete bugfix/x.y branch) | 21:31 |
JayF | https://review.opendev.org/c/openstack/project-config/+/866937 | 21:31 |
TheJulia | JayF: different one | 21:31 |
JayF | Ack; I mean, this is 2x in a week for iRMC driver? | 21:32 |
JayF | Are we not having a high enough review bar for those changes? | 21:32 |
TheJulia | they have a ton of changes in review right now | 21:32 |
JayF | If we think this is OK to land, can someone make that case in the merge request in reponse to my comment | 21:32 |
TheJulia | it is not about a bar, it is about addressing/fixing issues most appropriately | 21:32 |
JayF | and lets get all the breaky stuff landed together and cut a minor-version-bump release | 21:32 |
JayF | This particular default boot mode fix looks like a straight up bug that we should've caught in review. | 21:34 |
JayF | and "we" is probably literal because there's a good chance I was a reviewer on it | 21:34 |
TheJulia | my concious grok which is not deep was that they were not honoring the framework as the driver didn't get updated as ironic evolved | 21:36 |
TheJulia | but.... I could be in left field, under some rocks for all I know right now | 21:36 |
JayF | I'm still at google maps trying to get directions to the ballpark ;) | 21:36 |
TheJulia | oh good, it is not just me | 21:37 |
JayF | I hadn't even thought about how easy it is for bitrot to break these drivers | 21:39 |
JayF | and we don't even track or think about that at all in any kind of structured way | 21:40 |
JayF | this goes back to the discussion about iRMC driver having it's own bios deal, and not doing the bios interface, bceause it predates | 21:40 |
stevebaker[m] | JayF: any existing irmc node with explicit boot mode will not be affected by this backport the irmc docs are clear about boot mode being explicitly set on the node. Not setting the boot mode on the node is has undefined behaviour, that is what the backport fixes | 21:40 |
JayF | undefined meaning, undocumented or unreliable? | 21:41 |
JayF | This is the way I posed the question to TheJulia about the other, similar kinda-breaky change: is there any chance a reasonable operator would have this working reliably the way they want it to *and* that the change could break it | 21:42 |
JayF | sounds like in this case it would be: | 21:42 |
JayF | - boot_mode unset on the node | 21:42 |
JayF | - default_boot_mode set in config | 21:42 |
JayF | - node is using !default boot_mode reliably | 21:42 |
JayF | - operator wants the node to be using the nondefault boot_mode, despite never setting it explicitly | 21:42 |
JayF | Do I understand it properly? | 21:42 |
JayF | If so, I think this fits like the other one: as long as we have an extremely straightforward release note warning operators the behavior is changing, it should be OK. And I have a strong preference for us cutting a minor-version-bump release of any impacted branches as soon as this (and the other breaky bugfix that Julia was proposing) lands | 21:43 |
JayF | stevebaker[m]: ^ can you ack my understanding is correct? Then one of us put this in the merge req so it's documnted and I'll flip my vote | 21:43 |
TheJulia | I *think* the inverse, not explicitly stating it thus undefined result on the back end based upon ?last? or prior hardware state | 21:44 |
TheJulia | sort of like what you get by default iwth IPMI if you don't do the raw magic | 21:44 |
JayF | I don't like that scenario as much | 21:44 |
JayF | because it means there's no way for an operator to get the existing behavior post-bugfix | 21:44 |
TheJulia | ahh, but steve said the docs said the are supposed to be explicit | 21:45 |
JayF | and I stopped assuming if certain configurations made sense after working at places that did things that only made sense in downstream context | 21:45 |
TheJulia | this handles when the humand didn't do the thing | 21:45 |
TheJulia | as I understand it | 21:45 |
JayF | what I'm saying though, is right now, if you're running Zed | 21:45 |
JayF | and you're in the impacted group | 21:45 |
JayF | you're getting "we boot with the last boot mode" behavior. Which is not awesome IMO, but if it reliably behaves that way | 21:45 |
TheJulia | well, if you didn't follow the docs, your impacted regardless | 21:45 |
JayF | there's no way for an oper to restore that behavior | 21:45 |
stevebaker[m] | JayF: It looks like defaults to bios when not explicit. default_boot_mode is bios for W, X, and uefi for Y, Z | 21:46 |
JayF | e.g. if it was always bios or always uefi; we could at least dictate how to change it | 21:46 |
JayF | OK; that is nicer then. | 21:46 |
JayF | Can we ensure the release note explcitly tells an operator how to get existing behavior back if they want? | 21:46 |
TheJulia | I'm not sure, if documented to work differently, that the existing behavior was valid in a sense. | 21:46 |
TheJulia | without being set... | 21:46 |
TheJulia | that is | 21:47 |
JayF | I guess I should explicitly state | 21:47 |
JayF | if someone is in this configuration | 21:47 |
JayF | you have to assume they aren't reading docs, right? | 21:47 |
TheJulia | the docs are dense, they could have just missed it | 21:49 |
JayF | yeah, that's fair | 21:49 |
JayF | I think what I said above and re: the previous change is still reasonable, make sure we have clear, directive release notes to operators who were desiring the previous behavior (or were in the uncertain config) | 21:49 |
JayF | and then lets cut a minor version bump release as soon as those two pieces are in | 21:50 |
stevebaker[m] | safest release note advice would be to set capablities boot_mode:bios explicitly for any node which doesn't have it set. And this note would only be required on Y, Z | 21:50 |
JayF | and maybe, maybe consider not backporting this beyond the point where we can't "fit" in a new minor version :/ | 21:50 |
stevebaker[m] | JayF: W, X is actually safer because default_boot_mode is bios? | 21:51 |
JayF | So really, Y and Z are the only ones that are changing behavior in a real way | 21:51 |
JayF | I cringe b/c I don't think we can minor-version-bump Y | 21:51 |
JayF | but that is just an unfortunate reality | 21:51 |
JayF | I just noticed a misconfiguration in ironic-stable-maint; I'm going to be taking immediate action to rectify it, and will email the list. | 21:54 |
JayF | We have core-reviewers-emeritus still on the ironic-stable-maint group | 21:54 |
* TheJulia raises an eyebrow | 21:54 | |
JayF | I'm actually going to purge the whole list; we have ironic-stable-maint inheriting from ironic-core | 21:55 |
JayF | so anything left in the list for ironic-stable-maint is just asking to be forgotten about | 21:55 |
stevebaker[m] | JayF: So just clarifying, we can go ahead with the backports. Y, Z will get a clear warning in the release note to set the boot mode on existing nodes. W, X can have a softer warning because defaults result in no change? | 21:56 |
JayF | I'm onboard with that | 21:57 |
JayF | just make sure the release note says what you'd tell an operator if they were like "Steve, what happened to my node?" :D | 21:57 |
stevebaker[m] | JayF: maybe in a new "upgrade:" release note section? | 21:58 |
JayF | +++++ | 21:58 |
stevebaker[m] | STEVE, MY NODES! | 21:58 |
TheJulia | JayF: when do you want to look at the RBAC stuffs? | 21:59 |
JayF | uh | 21:59 |
JayF | give me a minute to finish sending a message to the list about the stable stuff | 21:59 |
JayF | then nowish is fine w/me? | 21:59 |
TheJulia | uhh, maybe in a little bit | 21:59 |
TheJulia | I'm reviewing the api change now | 22:00 |
JayF | tomorrow is fine, a breadcrumb in the review about where to go look to run tests is fine | 22:00 |
JayF | I know you're busy :D | 22:00 |
TheJulia | sure | 22:00 |
opendevreview | Steve Baker proposed openstack/ironic stable/zed: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866625 | 22:12 |
stevebaker[m] | JayF: let me know what you think of ^^ wording | 22:12 |
JayF | TheJulia: hmm, could jrosser's issue be like, a stuck-open transaction | 22:18 |
JayF | TheJulia: does that even make sense as a thing to suggest? | 22:18 |
JayF | stevebaker[m]: +1'd it, only not a +2 because I only looked at it from an impact perspective right now | 22:24 |
TheJulia | eh... I don't *think* so, but it is curious if the conductor has restarted or not | 22:24 |
JayF | I was just thinking that an open transaction, modifying that node, might be an explainable reason why you'd be getting weird results back after you SHOULD see the node in a new position ... but I guess that hypothesis is 100% negated by the fact the node is in a decent state between runs | 22:27 |
opendevreview | Verification of a change to openstack/ironic master failed: Fix debug log message argument formatting https://review.opendev.org/c/openstack/ironic/+/866856 | 22:28 |
opendevreview | Steve Baker proposed openstack/ironic stable/yoga: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866626 | 22:40 |
opendevreview | Steve Baker proposed openstack/ironic stable/xena: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866627 | 22:40 |
opendevreview | Steve Baker proposed openstack/ironic stable/wallaby: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866628 | 22:43 |
opendevreview | Steve Baker proposed openstack/ironic bugfix/20.2: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866780 | 22:44 |
opendevreview | Steve Baker proposed openstack/ironic bugfix/21.0: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866656 | 22:45 |
opendevreview | Jay Faulkner proposed openstack/ironic master: DB & Object layer for node.shard https://review.opendev.org/c/openstack/ironic/+/864236 | 23:27 |
opendevreview | Jay Faulkner proposed openstack/ironic master: API support for CRUD node.shard https://review.opendev.org/c/openstack/ironic/+/866946 | 23:27 |
TheJulia | JayF: rutro https://review.opendev.org/c/openstack/ironic/+/866946/1 | 23:30 |
TheJulia | check the enitre commit message | 23:31 |
JayF | lolsob | 23:31 |
JayF | gimme a sec, I think I can "fix" | 23:31 |
TheJulia | yeah, delete the new changeid line | 23:31 |
opendevreview | Jay Faulkner proposed openstack/ironic master: DB & Object layer for node.shard https://review.opendev.org/c/openstack/ironic/+/864236 | 23:32 |
opendevreview | Jay Faulkner proposed openstack/ironic master: API support for CRUD node.shard https://review.opendev.org/c/openstack/ironic/+/866235 | 23:32 |
JayF | okay; fixed | 23:32 |
JayF | ty for pointing that out before it was too long on the split brain lol | 23:33 |
JayF | gerrit telling on me, and how I usually manage multi-patch setups | 23:33 |
JayF | (fixup things patch-by-patch, then rebase them back into the correct commits) | 23:33 |
TheJulia | I go patch by patch and just cherry-pick the next patch on top and go from there | 23:36 |
JayF | Yeah; I think I am a little less scared of rebasing than most folks. I try to avoid telling people my git workflow because they either want to adopt it or change it. | 23:39 |
JayF | lol | 23:39 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!