Thursday, 2023-07-06

opendevreviewJulia Kreger proposed openstack/ironic master: DNM Enable OVN  https://review.opendev.org/c/openstack/ironic/+/88508700:13
opendevreviewVerification of a change to openstack/ironic master failed: Add additional logging on iLO power failure  https://review.opendev.org/c/openstack/ironic/+/88554902:03
rpittaugood morning ironic! o/06:32
opendevreviewRiccardo Pittau proposed openstack/ironic-python-agent-builder master: [WIP] Build tinyipa with tinycore 14.x  https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/88775407:04
dtantsurTheJulia: ilo and ilo5 don't differ in the power interface.09:48
dtantsur(I don't know details, this was just mentioned to me)09:48
opendevreviewDmitry Tantsur proposed openstack/bifrost stable/2023.1: CI: Update cached cirros image to 0.5.3  https://review.opendev.org/c/openstack/bifrost/+/88778309:49
dtantsurrpittau: we're very close ^^09:49
Nisha_Agarwaldtantsur, hi10:52
iurygregorygood morning Ironic10:56
Nisha_Agarwaliurygregory, o/11:04
adam-metal3hey Ironic, is there currently anyone who is experiencing such a situation that sometimes the inspector's sqlite db gets locked and inspection fails? now we are seeing this issue from time to time in the Metal3 pipeline and also downstream. 11:22
iurygregoryhi Nisha_Agarwal o/11:23
iurygregoryadam-metal3, I'm not aware of such problem =( (never saw tbh)11:24
adam-metal3iurygregory, Thanks for the feedback, we started to experience this both in our CI and downstream in the last 2 weeks, (we use master branches btw)11:29
adam-metal3it is hard to reproduce, it happens seemingly randomly 11:30
iurygregorygotcha, I haven't seen anything in the #cluster-api-baremetal (not sure if we mention this problems there..)11:30
iurygregorywe do see some failures upstream in the metal3-integration job (not sure if they are related ....)11:31
adam-metal3we suspect it is 11:31
iurygregoryhttps://zuul.opendev.org/t/openstack/builds?job_name=metal3-integration&project=openstack/ironic11:31
iurygregoryFAIL - node-0 Baremetalhost introspecting completed   expected inspecting to be in ready available11:32
iurygregoryso maybe they are the same failure I think...11:32
adam-metal3yes exactly this is the symptom https://zuul.opendev.org/t/openstack/build/4e7d36544870445ba3ccacb44617559c11:32
iurygregorydtantsur, rpittau JayF ^ (now I'm wondering if we should have metal3-integration voting =X in gate )11:33
adam-metal3here is one of ours https://jenkins.nordix.org/view/Metal3%20Periodic/job/metal3_daily_release-1-2_integration_test_ubuntu/300/11:35
iurygregoryack11:40
adam-metal3yeah same thing if I check the zuul logs https://ac08acb7444cabb70070-2bbdba88dbc29e3876c4caa3f782a7e4.ssl.cf5.rackcdn.com/885549/2/gate/metal3-integration/4e7d365/controller/before_pivoting/ironic.log and search for "locked" it is full of it11:41
adam-metal3"Unable to start managed inspection for node 9639c227-2791-4b52-bc5a-72372c7209a6: (sqlite3.OperationalError) database is locked"11:43
iurygregoryyeah, just saw11:51
rpittauiurygregory: I think we do have metal3 job voting in gate11:59
rpittauiurygregory:  https://opendev.org/openstack/ironic/src/branch/master/zuul.d/project.yaml#L8712:00
iurygregoryrpittau, yeah we do, it was more a sarcastic comment, because we have the random failures XD12:01
iurygregoryand seems very legit12:01
rpittauiurygregory: can you point me to the failures please?12:02
rpittauI'm surprised as I see very few failures in metal3-integration recent history12:03
iurygregoryrpittau, https://zuul.opendev.org/t/openstack/builds?job_name=metal3-integration&project=openstack/ironic and the link that adam-metal3 mentioned =)12:03
dtantsurthere have been a lot of sqlalchemy-related patches recently.. any of them could cause a regression12:09
rpittauwas thinking the same12:09
iurygregoryI think I saw this problem even before we had the sqlalchemy "fixes"12:10
* iurygregory cheks12:10
adam-metal3iurygregory, yep similar error has happen 4 years ago there is an old story board ticket : https://storyboard.openstack.org/#!/story/200509312:11
rpittauthe oldest I could find https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5e1/883062/17/check/metal3-integration/5e1dab8/controller/before_pivoting/ironic.log12:12
adam-metal3but ofc that was way before you have refactored and updated the sql alchemy stuff12:12
iurygregoryoh lovely .-.12:12
rpittauJune 2912:12
rpittauI think we had the patch in place already12:13
rpittauare we not closing connections? maybe a race condition?12:14
dtantsuryeah, it feels like another case of a hanging connection12:16
iurygregorytouch_conductor.. I'm wondering if the if should be inside the with _session and not on the same level12:18
dtantsuriurygregory: 'if count'? it does not use anything with the database12:18
dtantsuralso, touch_conductor is where it failed the first time. The blocking connection is probably the action *before* that.12:19
dtantsuras an aside, we should seriously consider retrying on "database is locked". That's the expected response to it.12:20
rpittauI agree12:20
dtantsur.. but I'd rather catch the root cause before we hide it by retrying :)12:21
rpittautotally, also considered that this is not happening very frequently, at least in out CI, we should have the time to troubleshoot12:26
rpittauadam-metal3 how frequent is the issue in the metal3 CI ?12:26
adam-metal3mohhammed, managed to induce the error by running a lot jobs I beleive it was 1 error from 10 jobs started right ?12:28
adam-metal3mohammed, ^ 12:29
rpittauok, so roughly 10%, and we're seeing fewer12:31
mohammedadam, Yeah pretty hard to repro this! did not see it on the PR checks only on some periodics  12:31
opendevreviewMerged openstack/bifrost stable/2023.1: CI: Update cached cirros image to 0.5.3  https://review.opendev.org/c/openstack/bifrost/+/88778312:51
TheJuliagood morning13:09
TheJuliaso, we should likely add logging to the pragma update call just to make sure it is triggering for metal313:12
opendevreviewDmitry Tantsur proposed openstack/ironic master: Fix several issues in the lock/release database code  https://review.opendev.org/c/openstack/ironic/+/88783513:12
TheJuliaon connection initiation, just to be sure13:12
dtantsurNot sure if this ^^ helps, but it might. At least that's what caught my attention13:12
dtantsurTheJulia: I must admit, SQLA 2 made everything so much more confusing...13:16
TheJuliaheh, it has a style, and that style was not for humans13:16
dtantsur:D13:16
dtantsurThinking more about it, I don't think my patch will fix the issue in question.. Just things I've noticed while staring at the code.13:19
TheJuliayeah, I'm suspecting the same since I remember structuring that to avoid the very issue your seeking to avoid13:20
TheJuliarpittau: which sqlalchemy patch are you suspecting?13:21
opendevreviewJulia Kreger proposed openstack/ironic master: Fix db migration tests for sqlalchemy 2.0  https://review.opendev.org/c/openstack/ironic/+/88743213:26
opendevreviewJulia Kreger proposed openstack/ironic master: Add job to test with SQLAlchemy master (2.x)  https://review.opendev.org/c/openstack/ironic/+/88602013:27
rpittauTheJulia: not one in particular, yet13:29
rpittauI see we're all loving sqla 213:30
TheJuliaheh13:32
rpittauI'm not even sure the issue is there honestly, I'm looking at the patches merged 2-3 weeks ago and I find it hard to believe that one of them caused the issue13:34
TheJuliaAnd here I thought I fixed this:     sqlalchemy.exc.SADeprecationWarning: The argument signature for the "ConnectionEvents.engine_connect" event listener has changed as of version 2.0, and conversion for the old argument signature will be removed in a future release.  The new signature is "def engine_connect(conn)"13:34
TheJuliaI think the huge challenge is we've seen huge swings in performance13:36
rpittauoh yeah, the only patch would be https://opendev.org/openstack/ironic/commit/9be84608f1da2cb3ad433554bdd3d8a395eab06f13:38
rpittaubut I'm not sure, just going byh exsclusion13:38
opendevreviewJulia Kreger proposed openstack/ironic master: SQLAlchemy: LOG the connection event action  https://review.opendev.org/c/openstack/ironic/+/88784813:39
TheJuliathis might provide our needed clarity13:39
TheJuliaokay, I think I got that original change wrong ...14:03
TheJuliastill trying to sort out a forward path14:04
TheJuliaand I think I have it14:08
TheJuliaI made the change too literally14:09
opendevreviewJulia Kreger proposed openstack/ironic master: Fix SQLAlchemy listener for engine connection, correctly  https://review.opendev.org/c/openstack/ironic/+/88785314:14
TheJuliawheeeeeeeeeeeeeeeeeeeeeeeeee14:14
TheJuliaSomeone bolt some more wheeee on to that!14:15
dtantsurThis is useful, but we already use WAL in metal3...14:55
dtantsurhttps://github.com/metal3-io/ironic-image/blob/main/Dockerfile#L60-L6114:57
rpittauoh that's the second of the chain14:57
sschmittHas anyone ever managed to seemingly kill an APC pdu thats used with the ironic SNMP driver?14:59
sschmitt(By kill i mean switching an interface on/off no longer works, even from the PDU web UI)14:59
TheJuliadtantsur: that does nothing, remember the discussion with JayF a few weeks ago?15:01
TheJuliait is not a persistent db setting, it is a on connect option15:01
TheJuliasschmitt: sounds more like the relay fused15:02
dtantsurTheJulia: this *is* a persistent change, unlike all other pragmas15:02
dtantsurI told that JayF back then..15:02
TheJuliaJayF pushed back on that15:02
dtantsur$ sqlite3 /tmp/ironic.sqlite "pragma journal_mode=wal"15:02
dtantsurwal15:02
dtantsur$ sqlite3 /tmp/ironic.sqlite "pragma journal_mode"15:02
dtantsurwal15:02
dtantsurJayF was wrong as this simple test shows ^^ :)15:02
TheJuliainteresting15:03
TheJuliaCould also be sqlalchemy cannot honor that or doesn't know to15:03
dtantsurMaybe? I don't know if you can override it. I'd expect it to be impossible, but...15:03
TheJuliaJayF did mention to me that he has seen clients turn off wal and keep working, as frightening as that sounds15:03
sschmittTheJulia yeah that makes sense... this is now the second one to have this happen :(15:04
dtantsur The WAL journaling mode is persistent; after being set it stays in effect across multiple database connections and after closing and reopening the database. (c) https://www.sqlite.org/pragma.html#pragma_journal_mode15:04
TheJuliasschmitt: electrically I'd be curious about what the contact is plated with :\15:04
dtantsur(I've spent too much time reading these docs lol)15:04
TheJulia(JayF worked in a shop that did sqlite very heavily for years, fwiw)15:05
JayFdtantsur: yeah but we don't differentiate between "db exists" and "new db"15:05
JayFso we have to do it on each connection to ensure we don't have any non-wal writes before it's created15:05
dtantsurmmm?15:05
JayFeven if it persists15:05
dtantsurJayF: yeah, I don't object to Ironic doing that15:05
JayFit's a safety thing, in that mindset, to do it on each conn15:05
dtantsursimply pointing that Metal3 already does it *unless* I'm missing something significant15:05
JayFyou *do* have to set the foreign_key pragma each conn15:06
TheJuliasschmitt: how much load is on the port?15:06
JayFbut oslodb does that15:06
dtantsuryep. all other pragmas (I think) are per connection.15:06
TheJuliasschmitt: the one which no longer switches15:06
TheJuliaI'm also guessing 208v or 230v?15:06
sschmittthese are 208v15:07
sschmittno per outlet power reading so hard to say15:07
sschmittbut these are like xeon-d based network devices so not that much power draw I'd imagine15:07
sschmittThis reminds me, I was gonna open a bug to see if the SNMP driver could support two PDU outlets per device for devices with redundant PSUs15:09
TheJuliasschmitt: how often has the port been toggled?15:10
sschmittlike 50 times in the past few days15:10
TheJuliaso NobodyCam has kind of wanted that in the past15:10
TheJuliaoh wow, that is a lot15:10
TheJuliaso, the issue might be that depending on how the relay is designed, that might just exceed the duty cycle15:10
TheJuliahow much do you know about electrical contactors?15:11
sschmittyeah i could see that. these are the older 8000 series APC PDUs that are EOL as off last november. Maybe i should find an APC rep to see if the 9000 series are better15:11
TheJuliamaybe, price may also be a thing15:12
TheJuliaI'd ask if the contacts are beryllium copper (BeCU, 2% Be) alloy which basically doesn't arc15:12
TheJuliawith rapid toggling you can also get carbon buildup15:12
TheJuliafrom the air (thanks air!)15:13
TheJuliaI took apart my transfer switch on my RV which was seeing a fault in my electrical cable (which needed a plate of BeCU to repair it, actually) and it kept cutting out the electrical feed and the contactors had BeCU as well and they were pristine even though they were toggling an absurd amount under high load15:14
TheJulia(30+ to 50A)15:14
NobodyCamGood Morning Ironic Folks, Happy Thursday 15:17
TheJuliare secondary ports, the idea I have is to use child nodes, and then introduce some sort of nested power off logic support15:18
TheJuliabut not there yet (patches welcome!)15:18
sschmittchild nodes as in each psu is a child node?15:20
TheJuliawell, a child node entry could in theory represent the secondary port15:24
sschmittMaybe kinda like Modules in Netbox (https://docs.netbox.dev/en/stable/models/dcim/module/)15:27
TheJuliakind of15:29
TheJuliathe idea behind is is more for things like DPUs, but in theory you could do a BMC and then have the snmp interfaces as "children"15:29
JayFTheJulia: dtantsur: If we retry on db is locked failures, we can probably debug-log information about why we had to retry15:34
JayFTheJulia: dtantsur: that might be the better route out of the pain loop and into a way for us to fix these when they happen (even if a deadlock happens, if we retry and print debugging info maybe we can use that to fix?15:35
dtantsurthat's not a bad idea15:35
TheJuliaI concur, there is a challenge there with the exception, but if we can pin it down I think that would help15:38
rpittautime to close some tabs, see you tomorrow! o/16:16
TheJuliahmmm16:21
TheJuliaweird, I hope we don't nee dto do version specific listeners16:21
iurygregoryinteresting the failures in standalone today are a bit different the instance was able to find ipxe and initialize, but reached Network unreachable .-. (the root still seems to be dhcp I think) at least for the redfish one https://zuul.opendev.org/t/openstack/build/b1d2fefad3b645669a42080b9bfe72db/log/controller/logs/ironic-bm-logs/node-2_console_log.txt17:06
opendevreviewJulia Kreger proposed openstack/ironic master: Fix SQLAlchemy listener for engine connection, correctly  https://review.opendev.org/c/openstack/ironic/+/88785317:10
TheJuliaokay, That *should* fix it17:10
TheJuliatested on both versions, refactored for differences17:10
TheJuliaand yes, I've confirmed it gets called properly in CI17:10
TheJuliawell17:10
TheJuliain local test jobs17:10
JayFthat change confuses me17:11
TheJuliaiurygregory: https://review.opendev.org/c/openstack/ironic/+/886863 maybe?!17:12
TheJuliaJayF: they check the signature of the method they are going to call and issue a compatability warning17:12
TheJuliaat least, for engine_connect17:13
JayFah, you're just trying to prevent a warning then17:14
JayFthat's sorta what it read like but I wasn't sure17:14
TheJuliawell, it is fatal at present17:14
TheJuliasince it is a hard break between the two17:14
TheJuliaso navigating it between the two17:14
iurygregoryTheJulia, maybe.. you have green on standalone jobs on it17:15
JayFaha17:16
opendevreviewJulia Kreger proposed openstack/ironic master: Disable spanning tree  https://review.opendev.org/c/openstack/ironic/+/88686317:16
TheJuliarebased17:16
iurygregoryoh jesus https://review.opendev.org/c/openstack/ironic/+/887853/2/ironic/db/sqlalchemy/__init__.py :D17:17
iurygregoryif _is_sqlalchemy_2 :D17:17
TheJulianever said it was pretty :)17:17
iurygregoryI still think it's pretty, because we are writing in python17:18
iurygregoryit could be worse17:18
TheJuliaAt least, we're not punching holes in cards17:19
iurygregoryyesssss17:19
JayFif you were required to have the manual dexterity to keep a stack of cards in order and feed them in order17:20
JayFlets just say I'd be making malts instead of using unitac17:20
TheJuliaMalting grains?!17:21
JayFwas thinking more 50s diner malt17:21
TheJuliaI was about to ask if you meant lactose17:21
JayFIDK what the correct time period is for punch cards17:21
JayFso I just went full happy days17:21
TheJuliaMy mother was punching cards in the 70s17:22
TheJuliabut she was doing document coding, so a document ID associated to a card and basic information about the document so it could be indexed and found later17:22
opendevreviewHarald JensÃ¥s proposed openstack/metalsmith stable/wallaby: Allow both 'network' and 'subnet' in NIC  https://review.opendev.org/c/openstack/metalsmith/+/88786718:06
opendevreviewMerged openstack/bifrost master: CI: Update cached cirros image to 0.5.3  https://review.opendev.org/c/openstack/bifrost/+/88587418:38
opendevreviewPierre Riteau proposed openstack/bifrost stable/2023.1: Skip unnecessary SDK get_machine calls  https://review.opendev.org/c/openstack/bifrost/+/88311518:52
opendevreviewMerged openstack/ironic-python-agent-builder master: Remove outdated install pyyaml with pip2  https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/88637919:21
iurygregoryTheJulia, seems like https://review.opendev.org/c/openstack/ironic/+/886863 didn't help with the standalone failures https://zuul.opendev.org/t/openstack/build/5a4438deadf94548b4df9d944b579493/log/controller/logs/ironic-bm-logs/node-1_console_log.txt 22:00
TheJuliapity :(22:01
opendevreviewVerification of a change to openstack/ironic master failed: Remove python 3.6 mock hack  https://review.opendev.org/c/openstack/ironic/+/88702322:18
iurygregoryyeah =( 22:19
opendevreviewVerification of a change to openstack/ironic master failed: Add DB API for Firmware and Object  https://review.opendev.org/c/openstack/ironic/+/88306223:22

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!