opendevreview | Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 | 00:13 |
---|---|---|
opendevreview | Verification of a change to openstack/ironic master failed: Add additional logging on iLO power failure https://review.opendev.org/c/openstack/ironic/+/885549 | 02:03 |
rpittau | good morning ironic! o/ | 06:32 |
opendevreview | Riccardo Pittau proposed openstack/ironic-python-agent-builder master: [WIP] Build tinyipa with tinycore 14.x https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/887754 | 07:04 |
dtantsur | TheJulia: ilo and ilo5 don't differ in the power interface. | 09:48 |
dtantsur | (I don't know details, this was just mentioned to me) | 09:48 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost stable/2023.1: CI: Update cached cirros image to 0.5.3 https://review.opendev.org/c/openstack/bifrost/+/887783 | 09:49 |
dtantsur | rpittau: we're very close ^^ | 09:49 |
Nisha_Agarwal | dtantsur, hi | 10:52 |
iurygregory | good morning Ironic | 10:56 |
Nisha_Agarwal | iurygregory, o/ | 11:04 |
adam-metal3 | hey Ironic, is there currently anyone who is experiencing such a situation that sometimes the inspector's sqlite db gets locked and inspection fails? now we are seeing this issue from time to time in the Metal3 pipeline and also downstream. | 11:22 |
iurygregory | hi Nisha_Agarwal o/ | 11:23 |
iurygregory | adam-metal3, I'm not aware of such problem =( (never saw tbh) | 11:24 |
adam-metal3 | iurygregory, Thanks for the feedback, we started to experience this both in our CI and downstream in the last 2 weeks, (we use master branches btw) | 11:29 |
adam-metal3 | it is hard to reproduce, it happens seemingly randomly | 11:30 |
iurygregory | gotcha, I haven't seen anything in the #cluster-api-baremetal (not sure if we mention this problems there..) | 11:30 |
iurygregory | we do see some failures upstream in the metal3-integration job (not sure if they are related ....) | 11:31 |
adam-metal3 | we suspect it is | 11:31 |
iurygregory | https://zuul.opendev.org/t/openstack/builds?job_name=metal3-integration&project=openstack/ironic | 11:31 |
iurygregory | FAIL - node-0 Baremetalhost introspecting completed expected inspecting to be in ready available | 11:32 |
iurygregory | so maybe they are the same failure I think... | 11:32 |
adam-metal3 | yes exactly this is the symptom https://zuul.opendev.org/t/openstack/build/4e7d36544870445ba3ccacb44617559c | 11:32 |
iurygregory | dtantsur, rpittau JayF ^ (now I'm wondering if we should have metal3-integration voting =X in gate ) | 11:33 |
adam-metal3 | here is one of ours https://jenkins.nordix.org/view/Metal3%20Periodic/job/metal3_daily_release-1-2_integration_test_ubuntu/300/ | 11:35 |
iurygregory | ack | 11:40 |
adam-metal3 | yeah same thing if I check the zuul logs https://ac08acb7444cabb70070-2bbdba88dbc29e3876c4caa3f782a7e4.ssl.cf5.rackcdn.com/885549/2/gate/metal3-integration/4e7d365/controller/before_pivoting/ironic.log and search for "locked" it is full of it | 11:41 |
adam-metal3 | "Unable to start managed inspection for node 9639c227-2791-4b52-bc5a-72372c7209a6: (sqlite3.OperationalError) database is locked" | 11:43 |
iurygregory | yeah, just saw | 11:51 |
rpittau | iurygregory: I think we do have metal3 job voting in gate | 11:59 |
rpittau | iurygregory: https://opendev.org/openstack/ironic/src/branch/master/zuul.d/project.yaml#L87 | 12:00 |
iurygregory | rpittau, yeah we do, it was more a sarcastic comment, because we have the random failures XD | 12:01 |
iurygregory | and seems very legit | 12:01 |
rpittau | iurygregory: can you point me to the failures please? | 12:02 |
rpittau | I'm surprised as I see very few failures in metal3-integration recent history | 12:03 |
iurygregory | rpittau, https://zuul.opendev.org/t/openstack/builds?job_name=metal3-integration&project=openstack/ironic and the link that adam-metal3 mentioned =) | 12:03 |
dtantsur | there have been a lot of sqlalchemy-related patches recently.. any of them could cause a regression | 12:09 |
rpittau | was thinking the same | 12:09 |
iurygregory | I think I saw this problem even before we had the sqlalchemy "fixes" | 12:10 |
* iurygregory cheks | 12:10 | |
adam-metal3 | iurygregory, yep similar error has happen 4 years ago there is an old story board ticket : https://storyboard.openstack.org/#!/story/2005093 | 12:11 |
rpittau | the oldest I could find https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5e1/883062/17/check/metal3-integration/5e1dab8/controller/before_pivoting/ironic.log | 12:12 |
adam-metal3 | but ofc that was way before you have refactored and updated the sql alchemy stuff | 12:12 |
iurygregory | oh lovely .-. | 12:12 |
rpittau | June 29 | 12:12 |
rpittau | I think we had the patch in place already | 12:13 |
rpittau | are we not closing connections? maybe a race condition? | 12:14 |
dtantsur | yeah, it feels like another case of a hanging connection | 12:16 |
iurygregory | touch_conductor.. I'm wondering if the if should be inside the with _session and not on the same level | 12:18 |
dtantsur | iurygregory: 'if count'? it does not use anything with the database | 12:18 |
dtantsur | also, touch_conductor is where it failed the first time. The blocking connection is probably the action *before* that. | 12:19 |
dtantsur | as an aside, we should seriously consider retrying on "database is locked". That's the expected response to it. | 12:20 |
rpittau | I agree | 12:20 |
dtantsur | .. but I'd rather catch the root cause before we hide it by retrying :) | 12:21 |
rpittau | totally, also considered that this is not happening very frequently, at least in out CI, we should have the time to troubleshoot | 12:26 |
rpittau | adam-metal3 how frequent is the issue in the metal3 CI ? | 12:26 |
adam-metal3 | mohhammed, managed to induce the error by running a lot jobs I beleive it was 1 error from 10 jobs started right ? | 12:28 |
adam-metal3 | mohammed, ^ | 12:29 |
rpittau | ok, so roughly 10%, and we're seeing fewer | 12:31 |
mohammed | adam, Yeah pretty hard to repro this! did not see it on the PR checks only on some periodics | 12:31 |
opendevreview | Merged openstack/bifrost stable/2023.1: CI: Update cached cirros image to 0.5.3 https://review.opendev.org/c/openstack/bifrost/+/887783 | 12:51 |
TheJulia | good morning | 13:09 |
TheJulia | so, we should likely add logging to the pragma update call just to make sure it is triggering for metal3 | 13:12 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Fix several issues in the lock/release database code https://review.opendev.org/c/openstack/ironic/+/887835 | 13:12 |
TheJulia | on connection initiation, just to be sure | 13:12 |
dtantsur | Not sure if this ^^ helps, but it might. At least that's what caught my attention | 13:12 |
dtantsur | TheJulia: I must admit, SQLA 2 made everything so much more confusing... | 13:16 |
TheJulia | heh, it has a style, and that style was not for humans | 13:16 |
dtantsur | :D | 13:16 |
dtantsur | Thinking more about it, I don't think my patch will fix the issue in question.. Just things I've noticed while staring at the code. | 13:19 |
TheJulia | yeah, I'm suspecting the same since I remember structuring that to avoid the very issue your seeking to avoid | 13:20 |
TheJulia | rpittau: which sqlalchemy patch are you suspecting? | 13:21 |
opendevreview | Julia Kreger proposed openstack/ironic master: Fix db migration tests for sqlalchemy 2.0 https://review.opendev.org/c/openstack/ironic/+/887432 | 13:26 |
opendevreview | Julia Kreger proposed openstack/ironic master: Add job to test with SQLAlchemy master (2.x) https://review.opendev.org/c/openstack/ironic/+/886020 | 13:27 |
rpittau | TheJulia: not one in particular, yet | 13:29 |
rpittau | I see we're all loving sqla 2 | 13:30 |
TheJulia | heh | 13:32 |
rpittau | I'm not even sure the issue is there honestly, I'm looking at the patches merged 2-3 weeks ago and I find it hard to believe that one of them caused the issue | 13:34 |
TheJulia | And here I thought I fixed this: sqlalchemy.exc.SADeprecationWarning: The argument signature for the "ConnectionEvents.engine_connect" event listener has changed as of version 2.0, and conversion for the old argument signature will be removed in a future release. The new signature is "def engine_connect(conn)" | 13:34 |
TheJulia | I think the huge challenge is we've seen huge swings in performance | 13:36 |
rpittau | oh yeah, the only patch would be https://opendev.org/openstack/ironic/commit/9be84608f1da2cb3ad433554bdd3d8a395eab06f | 13:38 |
rpittau | but I'm not sure, just going byh exsclusion | 13:38 |
opendevreview | Julia Kreger proposed openstack/ironic master: SQLAlchemy: LOG the connection event action https://review.opendev.org/c/openstack/ironic/+/887848 | 13:39 |
TheJulia | this might provide our needed clarity | 13:39 |
TheJulia | okay, I think I got that original change wrong ... | 14:03 |
TheJulia | still trying to sort out a forward path | 14:04 |
TheJulia | and I think I have it | 14:08 |
TheJulia | I made the change too literally | 14:09 |
opendevreview | Julia Kreger proposed openstack/ironic master: Fix SQLAlchemy listener for engine connection, correctly https://review.opendev.org/c/openstack/ironic/+/887853 | 14:14 |
TheJulia | wheeeeeeeeeeeeeeeeeeeeeeeeee | 14:14 |
TheJulia | Someone bolt some more wheeee on to that! | 14:15 |
dtantsur | This is useful, but we already use WAL in metal3... | 14:55 |
dtantsur | https://github.com/metal3-io/ironic-image/blob/main/Dockerfile#L60-L61 | 14:57 |
rpittau | oh that's the second of the chain | 14:57 |
sschmitt | Has anyone ever managed to seemingly kill an APC pdu thats used with the ironic SNMP driver? | 14:59 |
sschmitt | (By kill i mean switching an interface on/off no longer works, even from the PDU web UI) | 14:59 |
TheJulia | dtantsur: that does nothing, remember the discussion with JayF a few weeks ago? | 15:01 |
TheJulia | it is not a persistent db setting, it is a on connect option | 15:01 |
TheJulia | sschmitt: sounds more like the relay fused | 15:02 |
dtantsur | TheJulia: this *is* a persistent change, unlike all other pragmas | 15:02 |
dtantsur | I told that JayF back then.. | 15:02 |
TheJulia | JayF pushed back on that | 15:02 |
dtantsur | $ sqlite3 /tmp/ironic.sqlite "pragma journal_mode=wal" | 15:02 |
dtantsur | wal | 15:02 |
dtantsur | $ sqlite3 /tmp/ironic.sqlite "pragma journal_mode" | 15:02 |
dtantsur | wal | 15:02 |
dtantsur | JayF was wrong as this simple test shows ^^ :) | 15:02 |
TheJulia | interesting | 15:03 |
TheJulia | Could also be sqlalchemy cannot honor that or doesn't know to | 15:03 |
dtantsur | Maybe? I don't know if you can override it. I'd expect it to be impossible, but... | 15:03 |
TheJulia | JayF did mention to me that he has seen clients turn off wal and keep working, as frightening as that sounds | 15:03 |
sschmitt | TheJulia yeah that makes sense... this is now the second one to have this happen :( | 15:04 |
dtantsur | The WAL journaling mode is persistent; after being set it stays in effect across multiple database connections and after closing and reopening the database. (c) https://www.sqlite.org/pragma.html#pragma_journal_mode | 15:04 |
TheJulia | sschmitt: electrically I'd be curious about what the contact is plated with :\ | 15:04 |
dtantsur | (I've spent too much time reading these docs lol) | 15:04 |
TheJulia | (JayF worked in a shop that did sqlite very heavily for years, fwiw) | 15:05 |
JayF | dtantsur: yeah but we don't differentiate between "db exists" and "new db" | 15:05 |
JayF | so we have to do it on each connection to ensure we don't have any non-wal writes before it's created | 15:05 |
dtantsur | mmm? | 15:05 |
JayF | even if it persists | 15:05 |
dtantsur | JayF: yeah, I don't object to Ironic doing that | 15:05 |
JayF | it's a safety thing, in that mindset, to do it on each conn | 15:05 |
dtantsur | simply pointing that Metal3 already does it *unless* I'm missing something significant | 15:05 |
JayF | you *do* have to set the foreign_key pragma each conn | 15:06 |
TheJulia | sschmitt: how much load is on the port? | 15:06 |
JayF | but oslodb does that | 15:06 |
dtantsur | yep. all other pragmas (I think) are per connection. | 15:06 |
TheJulia | sschmitt: the one which no longer switches | 15:06 |
TheJulia | I'm also guessing 208v or 230v? | 15:06 |
sschmitt | these are 208v | 15:07 |
sschmitt | no per outlet power reading so hard to say | 15:07 |
sschmitt | but these are like xeon-d based network devices so not that much power draw I'd imagine | 15:07 |
sschmitt | This reminds me, I was gonna open a bug to see if the SNMP driver could support two PDU outlets per device for devices with redundant PSUs | 15:09 |
TheJulia | sschmitt: how often has the port been toggled? | 15:10 |
sschmitt | like 50 times in the past few days | 15:10 |
TheJulia | so NobodyCam has kind of wanted that in the past | 15:10 |
TheJulia | oh wow, that is a lot | 15:10 |
TheJulia | so, the issue might be that depending on how the relay is designed, that might just exceed the duty cycle | 15:10 |
TheJulia | how much do you know about electrical contactors? | 15:11 |
sschmitt | yeah i could see that. these are the older 8000 series APC PDUs that are EOL as off last november. Maybe i should find an APC rep to see if the 9000 series are better | 15:11 |
TheJulia | maybe, price may also be a thing | 15:12 |
TheJulia | I'd ask if the contacts are beryllium copper (BeCU, 2% Be) alloy which basically doesn't arc | 15:12 |
TheJulia | with rapid toggling you can also get carbon buildup | 15:12 |
TheJulia | from the air (thanks air!) | 15:13 |
TheJulia | I took apart my transfer switch on my RV which was seeing a fault in my electrical cable (which needed a plate of BeCU to repair it, actually) and it kept cutting out the electrical feed and the contactors had BeCU as well and they were pristine even though they were toggling an absurd amount under high load | 15:14 |
TheJulia | (30+ to 50A) | 15:14 |
NobodyCam | Good Morning Ironic Folks, Happy Thursday | 15:17 |
TheJulia | re secondary ports, the idea I have is to use child nodes, and then introduce some sort of nested power off logic support | 15:18 |
TheJulia | but not there yet (patches welcome!) | 15:18 |
sschmitt | child nodes as in each psu is a child node? | 15:20 |
TheJulia | well, a child node entry could in theory represent the secondary port | 15:24 |
sschmitt | Maybe kinda like Modules in Netbox (https://docs.netbox.dev/en/stable/models/dcim/module/) | 15:27 |
TheJulia | kind of | 15:29 |
TheJulia | the idea behind is is more for things like DPUs, but in theory you could do a BMC and then have the snmp interfaces as "children" | 15:29 |
JayF | TheJulia: dtantsur: If we retry on db is locked failures, we can probably debug-log information about why we had to retry | 15:34 |
JayF | TheJulia: dtantsur: that might be the better route out of the pain loop and into a way for us to fix these when they happen (even if a deadlock happens, if we retry and print debugging info maybe we can use that to fix? | 15:35 |
dtantsur | that's not a bad idea | 15:35 |
TheJulia | I concur, there is a challenge there with the exception, but if we can pin it down I think that would help | 15:38 |
rpittau | time to close some tabs, see you tomorrow! o/ | 16:16 |
TheJulia | hmmm | 16:21 |
TheJulia | weird, I hope we don't nee dto do version specific listeners | 16:21 |
iurygregory | interesting the failures in standalone today are a bit different the instance was able to find ipxe and initialize, but reached Network unreachable .-. (the root still seems to be dhcp I think) at least for the redfish one https://zuul.opendev.org/t/openstack/build/b1d2fefad3b645669a42080b9bfe72db/log/controller/logs/ironic-bm-logs/node-2_console_log.txt | 17:06 |
opendevreview | Julia Kreger proposed openstack/ironic master: Fix SQLAlchemy listener for engine connection, correctly https://review.opendev.org/c/openstack/ironic/+/887853 | 17:10 |
TheJulia | okay, That *should* fix it | 17:10 |
TheJulia | tested on both versions, refactored for differences | 17:10 |
TheJulia | and yes, I've confirmed it gets called properly in CI | 17:10 |
TheJulia | well | 17:10 |
TheJulia | in local test jobs | 17:10 |
JayF | that change confuses me | 17:11 |
TheJulia | iurygregory: https://review.opendev.org/c/openstack/ironic/+/886863 maybe?! | 17:12 |
TheJulia | JayF: they check the signature of the method they are going to call and issue a compatability warning | 17:12 |
TheJulia | at least, for engine_connect | 17:13 |
JayF | ah, you're just trying to prevent a warning then | 17:14 |
JayF | that's sorta what it read like but I wasn't sure | 17:14 |
TheJulia | well, it is fatal at present | 17:14 |
TheJulia | since it is a hard break between the two | 17:14 |
TheJulia | so navigating it between the two | 17:14 |
iurygregory | TheJulia, maybe.. you have green on standalone jobs on it | 17:15 |
JayF | aha | 17:16 |
opendevreview | Julia Kreger proposed openstack/ironic master: Disable spanning tree https://review.opendev.org/c/openstack/ironic/+/886863 | 17:16 |
TheJulia | rebased | 17:16 |
iurygregory | oh jesus https://review.opendev.org/c/openstack/ironic/+/887853/2/ironic/db/sqlalchemy/__init__.py :D | 17:17 |
iurygregory | if _is_sqlalchemy_2 :D | 17:17 |
TheJulia | never said it was pretty :) | 17:17 |
iurygregory | I still think it's pretty, because we are writing in python | 17:18 |
iurygregory | it could be worse | 17:18 |
TheJulia | At least, we're not punching holes in cards | 17:19 |
iurygregory | yesssss | 17:19 |
JayF | if you were required to have the manual dexterity to keep a stack of cards in order and feed them in order | 17:20 |
JayF | lets just say I'd be making malts instead of using unitac | 17:20 |
TheJulia | Malting grains?! | 17:21 |
JayF | was thinking more 50s diner malt | 17:21 |
TheJulia | I was about to ask if you meant lactose | 17:21 |
JayF | IDK what the correct time period is for punch cards | 17:21 |
JayF | so I just went full happy days | 17:21 |
TheJulia | My mother was punching cards in the 70s | 17:22 |
TheJulia | but she was doing document coding, so a document ID associated to a card and basic information about the document so it could be indexed and found later | 17:22 |
opendevreview | Harald Jensås proposed openstack/metalsmith stable/wallaby: Allow both 'network' and 'subnet' in NIC https://review.opendev.org/c/openstack/metalsmith/+/887867 | 18:06 |
opendevreview | Merged openstack/bifrost master: CI: Update cached cirros image to 0.5.3 https://review.opendev.org/c/openstack/bifrost/+/885874 | 18:38 |
opendevreview | Pierre Riteau proposed openstack/bifrost stable/2023.1: Skip unnecessary SDK get_machine calls https://review.opendev.org/c/openstack/bifrost/+/883115 | 18:52 |
opendevreview | Merged openstack/ironic-python-agent-builder master: Remove outdated install pyyaml with pip2 https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/886379 | 19:21 |
iurygregory | TheJulia, seems like https://review.opendev.org/c/openstack/ironic/+/886863 didn't help with the standalone failures https://zuul.opendev.org/t/openstack/build/5a4438deadf94548b4df9d944b579493/log/controller/logs/ironic-bm-logs/node-1_console_log.txt | 22:00 |
TheJulia | pity :( | 22:01 |
opendevreview | Verification of a change to openstack/ironic master failed: Remove python 3.6 mock hack https://review.opendev.org/c/openstack/ironic/+/887023 | 22:18 |
iurygregory | yeah =( | 22:19 |
opendevreview | Verification of a change to openstack/ironic master failed: Add DB API for Firmware and Object https://review.opendev.org/c/openstack/ironic/+/883062 | 23:22 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!