opendevreview | Michal Nasiadka proposed openstack/bifrost master: WIP: Bump up Ansible to >=2.9,<5 https://review.opendev.org/c/openstack/bifrost/+/814858 | 06:09 |
---|---|---|
iurygregory | good morning Ironic o/ | 06:11 |
janders | hey iurygregory o/ | 06:11 |
janders | (making progress with my laptop setup :) ) | 06:11 |
iurygregory | janders, o/ | 06:11 |
dtantsur | please tell me it's Friday already | 06:11 |
janders | hey dtantsur o/ | 06:12 |
iurygregory | dtantsur, almost? | 06:12 |
janders | unfortunately I do not have good news on that | 06:12 |
janders | but good news is - it aint Wednesday! | 06:12 |
janders | (unless someone is in LA or Hawaii - then it's Wednesday still) | 06:14 |
iurygregory | OMG :D | 06:16 |
dtantsur | poor people! | 06:16 |
iurygregory | *poor soul* | 06:17 |
arne_wiebalck | Good morning janders iurygregory dtantsur and Ironic! | 06:28 |
janders | hey arne_wiebalck o/ | 06:28 |
iurygregory | arne_wiebalck, o/ | 06:32 |
rpittau | good morning ironic! o/ | 06:34 |
iurygregory | morning rpittau o/ | 06:34 |
rpittau | hey iurygregory :) | 06:34 |
* rpittau breakfast incoming | 06:39 | |
* dtantsur -> haircut, bbl | 07:54 | |
opendevreview | Michal Nasiadka proposed openstack/bifrost master: Bump up Ansible to >=2.10,<5 https://review.opendev.org/c/openstack/bifrost/+/814858 | 08:52 |
*** pmannidi is now known as pmannidi|brb | 10:34 | |
opendevreview | Riccardo Pittau proposed openstack/ironic-python-agent-builder master: Bump pip for tinyipa to 21.3 https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/814894 | 10:37 |
opendevreview | Merged openstack/ironic master: Add Xena versions to release notes https://review.opendev.org/c/openstack/ironic/+/814581 | 10:51 |
*** dviroel|rover|out is now known as dviroel|rover | 10:51 | |
TheJulia | Good morning | 11:35 |
dtantsur | morning TheJulia. it's still not Friday, imagine? | 11:36 |
TheJulia | no, unfortunately it is not | 11:50 |
TheJulia | iurygregory: so it looks like I'm going to get pulled onto a call at 10:30 | 11:53 |
iurygregory | good morning TheJulia o/ | 11:56 |
TheJulia | well, 10:30 my local time, 14:30 UTC | 11:57 |
iurygregory | oh ok | 11:57 |
TheJulia | So we can start discussions and see where they go | 11:57 |
iurygregory | I was going to ask the UTC time XD | 11:57 |
TheJulia | And I can always follow-up with notes | 11:57 |
iurygregory | yeah, we can also probably move dtantsur topic from friday to today also, so we don't have 3 topics from you | 11:58 |
opendevreview | Arne Wiebalck proposed openstack/ironic-python-agent stable/xena: Assert EFI part UUID is not None before editing fstab https://review.opendev.org/c/openstack/ironic-python-agent/+/814904 | 12:11 |
opendevreview | Arne Wiebalck proposed openstack/ironic-python-agent stable/wallaby: Assert EFI part UUID is not None before editing fstab https://review.opendev.org/c/openstack/ironic-python-agent/+/814905 | 12:12 |
iurygregory | Just a reminder that our PTG sessions today are in the Kilo room - https://ptg.opendev.org/ptg.html | 12:36 |
opendevreview | Riccardo Pittau proposed openstack/ironic-python-agent bugfix/8.1: Assert EFI part UUID is not None before editing fstab https://review.opendev.org/c/openstack/ironic-python-agent/+/814769 | 12:49 |
*** redrobot is now known as Guest3656 | 12:59 | |
opendevreview | HanGuangyu proposed openstack/ironic master: Add a description of stopping ironic-api.service https://review.opendev.org/c/openstack/ironic/+/814912 | 13:04 |
dtantsur | TheJulia, iurygregory, wrote the RFE about enabled interfaces: https://storyboard.openstack.org/#!/story/2009316 | 13:05 |
dtantsur | will probably take next (downstream) sprint if nobody beats me to it | 13:06 |
dtantsur | btw what do you think about enabling redfish by default nowadays? (and thus making sushy a requirement) | 13:07 |
iurygregory | sounds like a good idea i would say ^ | 13:07 |
iurygregory | reminder: the tc discussion about CI is in a few minutes | 13:08 |
dtantsur | 12 minutes to be precise? | 13:08 |
iurygregory | yeah =) | 13:08 |
iurygregory | if it starts sooner I will put a message here | 13:09 |
arne_wiebalck | I observe that nodes after instantiation sometimes end up in active but with no instance UUID. I think this happens when RPCs time out (Nova fails, but Ironic moves on). While I guess this "active, but no UUID" situation is ok for Ironic stand-alone, it is a false active and hence a blocked resource in an Ironic w/ Nova deployment. Would it make sense / is it possible to have sth in Ironic which detects and cleans this up? | 13:09 |
dtantsur | arne_wiebalck: I thought Nova was supposed to set instance_uuid *first* | 13:21 |
dtantsur | since it's a sort of a lock | 13:21 |
arne_wiebalck | dtantsur: set it in Ironic? | 13:22 |
dtantsur | yeah | 13:22 |
arne_wiebalck | dtantsur: I thought the lock is the resource allocation in placement | 13:22 |
dtantsur | it shouldn't proceed with deployment if instance_uuid is not set on a node | 13:22 |
dtantsur | that's on the nova side | 13:22 |
dtantsur | on the ironic side instance_uuid is a lock | 13:22 |
arne_wiebalck | right, but noone else can take that node anymore (through nova) | 13:22 |
arne_wiebalck | also, the reason why I end up here is just speculation | 13:23 |
arne_wiebalck | but it is a fact I end up here :) | 13:23 |
arne_wiebalck | ... and the nodes are blocked until I step in and clean up | 13:24 |
dtantsur | Node ca2b5f09-1fc7-4784-b2c9-700cd614cec3 reached failure state deploy failed while waiting for provision_state=['active']. Error: Agent returned error for deploy step {'interface': 'raid', 'step': 'apply_configuration', 'args': {'raid_config': {'logical_disks': [{'size_gb': 'MAX', 'raid_level': '1', 'controller': 'software'}]}}, 'priority': 97} on node | 13:29 |
dtantsur | ca2b5f09-1fc7-4784-b2c9-700cd614cec3 : Error performing deploy_step apply_configuration: Software RAID caused unknown error: Failed to create md device /dev/md0 on /dev/vda1 /dev/vdb1: Unexpected error while running command. | 13:29 |
dtantsur | Command: mdadm --create /dev/md0 --force --run --metadata=1 --level 1 --raid-devices 2 /dev/vda1 /dev/vdb1 | 13:29 |
dtantsur | Exit code: 2 | 13:29 |
dtantsur | Stdout: '' | 13:29 |
dtantsur | Stderr: 'mdadm: cannot open /dev/vdb1: No such file or directory\n'. | 13:29 |
dtantsur | This has started appearing in the CI (stable/wallaby). arne_wiebalck, has your recent fix hit wallaby? | 13:30 |
arne_wiebalck | dtantsur: the udev patch, yes it should have | 13:31 |
arne_wiebalck | TheJulia: filed the backports IIRC | 13:31 |
arne_wiebalck | but the symptom looks indeed identical | 13:33 |
TheJulia | what did I do!?? | 13:33 |
TheJulia | or forgot?!? | 13:33 |
opendevreview | HanGuangyu proposed openstack/ironic master: Add description to the mod_wsgi part https://review.opendev.org/c/openstack/ironic/+/814916 | 13:36 |
dtantsur | hmm, this is a fresh stable/wallaby run | 13:36 |
dtantsur | it's https://opendev.org/openstack/ironic-python-agent/commit/9d707e9f4bab40109b7e29df2136e86d65325ea3 right? | 13:36 |
arne_wiebalck | dtantsur: yes | 13:37 |
dtantsur | it should be in the builds... | 13:37 |
opendevreview | HanGuangyu proposed openstack/ironic master: Add description to the mod_wsgi part https://review.opendev.org/c/openstack/ironic/+/814916 | 13:37 |
dtantsur | the job: https://zuul.opendev.org/t/openstack/build/9a9727d4b6f44734949ccb47f2ee1040/artifacts | 13:37 |
arne_wiebalck | the patch is indeed in stable/wallaby | 13:39 |
arne_wiebalck | TheJulia: heh, you submitted the backport for the udev raid patch :) | 13:40 |
TheJulia | dtantsur: redfish by default was on my mind, just haven't pushed the patch up yet. By all means! | 13:40 |
TheJulia | lock on the ironic node is instance_uuid being set | 13:41 |
TheJulia | if field is already set, then it reschedules | 13:41 |
opendevreview | HanGuangyu proposed openstack/ironic master: Add description to the mod_wsgi part https://review.opendev.org/c/openstack/ironic/+/814916 | 13:41 |
arne_wiebalck | TheJulia: hmm ... how would I end up with an active node without UUID then? | 13:42 |
arne_wiebalck | *instance UUID | 13:42 |
TheJulia | aiui, you shouldn't unless a human did it | 13:43 |
arne_wiebalck | a human did what? | 13:43 |
TheJulia | deploy an instance | 13:43 |
dtantsur | that humans!!! | 13:43 |
TheJulia | evil, pesky humans | 13:43 |
arne_wiebalck | it is humans fault then! | 13:44 |
TheJulia | bad humans, no cookies | 13:44 |
dtantsur | we, wool owls, consider humans overrates | 13:44 |
arne_wiebalck | srsly, what do you mean by a human deployed an instance? directly on ironic w/o nova? | 13:44 |
TheJulia | dtantsur: upgraded to a wool owl? | 13:44 |
TheJulia | arne_wiebalck: you can totally do it :) | 13:44 |
dtantsur | TheJulia: indeed: https://twitter.com/creepy_owlet/status/1450815340900933634 (actually it says "owl wool", dunno what they mean) | 13:45 |
TheJulia | so dtantsur is now a wool owl | 13:45 |
TheJulia | huh! | 13:45 |
TheJulia | brainsplodes | 13:45 |
arne_wiebalck | I bet, but this is not what I did ... unfortunately I do not have the nova instances anymore | 13:46 |
TheJulia | does it show up in nova logs? | 13:46 |
arne_wiebalck | I will see if I can get it from my logs ... (input for the log discussion later :-D) | 13:46 |
TheJulia | ++ | 13:47 |
arne_wiebalck | TheJulia: it should, the node should show up in nova and from there I should be able to find the failed instance and ... reasons! | 13:47 |
TheJulia | oh wow | 13:47 |
opendevreview | Aija Jauntēva proposed x/sushy-oem-idrac master: Update unit test folder structure https://review.opendev.org/c/x/sushy-oem-idrac/+/814919 | 13:48 |
TheJulia | arne_wiebalck: I *guess* there could have been a window time wise where the request was still in the pipeline but not locked yet with a task on the conductor side | 13:49 |
TheJulia | if nova failed, it could potentially rip the entry out and ironic would keep going | 13:49 |
dtantsur | Iury, Julia and I are still in the TC room FYI | 14:00 |
iurygregory | dtantsur, I just left | 14:04 |
dtantsur | I think I'll stay until the end of this discussion | 14:05 |
iurygregory | that was my plan also.. | 14:06 |
rpittau | I'm having some issues with zoom | 14:08 |
* dtantsur hears "privsep has issues with memory" | 14:09 | |
iurygregory | wow | 14:13 |
dtantsur | like, serious issues apparently... | 14:13 |
iurygregory | funny =( | 14:13 |
opendevreview | Merged openstack/ironic-python-agent stable/xena: Assert EFI part UUID is not None before editing fstab https://review.opendev.org/c/openstack/ironic-python-agent/+/814904 | 15:16 |
erbarr | need a molteniron review: https://review.opendev.org/c/openstack/molteniron/+/815023 | 16:00 |
rpittau | bye everyone, see you on monday o/ | 16:06 |
dtantsur | same, but on Friday o/ | 16:07 |
TheJulia | stevebaker[m]: so I guess my concern of sorts is that vendors are building some grub network booting, some not. I guess fedora/centos/rhel grubs are not network booting for uefi, except for a special purpose built binary which is documented to be on the install disks. :\ | 16:08 |
TheJulia | stevebaker[m]: which to me, seems like a bug/defect, but maybe we should address that with the maintainers directly | 16:08 |
TheJulia | some others, upstream, I guess it is there/present in other forms | 16:08 |
stevebaker[m] | Yeah I'm assuming when grub network boot doesn't work it's because it's not well tested by distros | 16:10 |
arne_wiebalck | TheJulia: dtantsur: I had a look at the logs of the node in active w/o instance UUID now. It seems indeed that nova sees a timeout ("Failed to request Ironic to provision instance abc: Timed out waiting for a reply to message ID xyz"), then issues a delete, but Ironic ends up with an active instance. | 16:24 |
arne_wiebalck | Ironic moves the node from available to deploying only after Nova has given up already. | 16:24 |
JayF | That is an overworked rmq or conductor, IME | 16:25 |
arne_wiebalck | JayF; exactly | 16:25 |
JayF | this is a very longstanding bug, we experienced this from day 1 at onmetal during failure cases | 16:25 |
arne_wiebalck | the conductor is overloaded cannot handle the RPC in time, Nova gives up, Ironic moves on :) | 16:25 |
arne_wiebalck | JayF: ha! ok :) | 16:26 |
arne_wiebalck | JayF: Usually, our conductor is not overloading, but we re-created >1000 nodes in a short time. | 16:26 |
arne_wiebalck | *overloaded | 16:27 |
arne_wiebalck | JayF: I increased the RPC timeout as a first measure | 16:28 |
arne_wiebalck | JayF: these are nodes which are not distributed over groups nicely | 16:28 |
arne_wiebalck | so the conductor side is understaffed | 16:28 |
arne_wiebalck | anyway, good to know my theory of how I ended up here seems to check out | 16:29 |
arne_wiebalck | JayF: Is there a bug in storyboard for this, or were there even any ideas how to address this? | 16:30 |
arne_wiebalck | (apart from "scale up your infra!" :)) | 16:30 |
JayF | I don't know the answers to those questions; I just recognize the flow and the behavior | 16:30 |
JayF | we only ever saw it during like, split-brain network failurey sorta things | 16:30 |
arne_wiebalck | thanks | 16:31 |
arne_wiebalck | bye everyone, see you tomorrow o/ | 16:37 |
opendevreview | Merged openstack/ironic-python-agent stable/wallaby: Assert EFI part UUID is not None before editing fstab https://review.opendev.org/c/openstack/ironic-python-agent/+/814905 | 16:52 |
opendevreview | Arun S A G proposed openstack/ironic master: Fix format of kickstart options https://review.opendev.org/c/openstack/ironic/+/814087 | 17:50 |
*** sshnaidm is now known as sshnaidm|afk | 18:53 | |
*** dviroel|rover is now known as dviroel|rover|afk | 21:58 | |
*** pmannidi|brb is now known as pmannidi | 23:23 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!