Tuesday, 2023-08-15

*** awb_ is now known as awb01:04
kubajjGood morning Ironic o/08:06
iurygregorygood morning Ironic11:39
kubajjmorning iurygregory o/11:51
iurygregoryhey hey kubajj o/12:01
TheJuliagood morning13:15
iurygregorygood morning TheJulia 14:05
TheJuliaoh tenacity, why are you stuck at 6.3.114:21
opendevreviewJulia Kreger proposed openstack/ironic master: Slow down the sqlite retry  https://review.opendev.org/c/openstack/ironic/+/89133314:28
opendevreviewJulia Kreger proposed openstack/ironic master: Log upon completion of power sync  https://review.opendev.org/c/openstack/ironic/+/89133414:28
opendevreviewJulia Kreger proposed openstack/ironic master: Don't yield on power sync at the end of the work  https://review.opendev.org/c/openstack/ironic/+/89133514:28
opendevreviewDmitry Tantsur proposed openstack/ironic master: [DNM] Test how much breaks without query.all() in the SQL API  https://review.opendev.org/c/openstack/ironic/+/89150514:41
TheJuliaoh tenacity, why are you stuck at 6.3.114:42
TheJuliaerr, already sent that14:42
TheJuliadtantsur: driver templates and node history, basically14:42
dtantsurI'd like to be 100% sure :)14:42
TheJuliathere is a port lookup with shards test that triggers it, which I don't understand *why* it heads down that route :\14:42
dtantsurTheJulia: I wonder if the update part of register_conductor is safe15:01
TheJuliagot a link to what your looking at?15:02
dtantsurTheJulia: https://opendev.org/openstack/ironic/src/branch/master/ironic/db/sqlalchemy/api.py#L1348-L135215:03
* dtantsur just scrolling through the file...15:03
TheJuliahmm, that might not be safe...15:06
TheJuliadtantsur: is it just me, or does the base req id logging seem to behave differently in single process mode15:36
dtantsurPossibly? Although context should be passed..15:37
TheJuliaIt seems like the req-id just never changes if it is an internal thing15:37
TheJuliamakes grepping for things... difficult.15:37
TheJulialooking at the nordix log, did you by chance spot the super long locked introspection  ?15:38
dtantsurwhat's the symptoms?15:38
TheJuliathe introspection that had a node locked for 1200+ seconds15:39
TheJuliaor, maybe it was like 1100+15:39
TheJuliawhich caused the agent to get a little freaked out15:39
TheJuliaand it kept piling on lookups, eventually exit due to failure, and restart15:39
dtantsur20 minutes Oo15:39
dtantsurhttps://jenkins.nordix.org/job/metal3_bmo_main_integration_test_ubuntu/291/ is the link btw15:42
TheJulia2023-08-10 16:27:46.644 1 DEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node [trim]15:43
TheJuliathat looks like it was just from a general update_node patch operation15:44
dtantsuryeah15:44
dtantsurthis is weird, it seems like some operations succeed while the process is still retrying. how is it possible?15:45
TheJuliabad timing maybe15:45
TheJuliaDEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:35215:46
TheJuliaerr15:47
TheJuliaDEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:35215:47
dtantsuryeah, we definitely need to reduce the maximum database retry period15:47
TheJuliaone of the agent logs basically suggests a node.reservation field was held by the "conductor" for quite some time15:47
dtantsurOR it was retrying on the database locked15:47
TheJuliaWhat if db read waiting time is also being hidden from us?15:48
TheJuliahmm, look at the timing on this https://paste.openstack.org/show/bzcEG3Eo6hIF1VmboPDM/15:50
TheJuliaso15:51
TheJuliathat took time15:51
TheJuliais tenacity running without logging15:51
TheJuliacapturing NodeLocked15:51
TheJulialine 343 of task_manager.py15:52
TheJuliaso there is our silent read hold15:52
dtantsurNot following. This can make things take time, but it won't lock the database?15:53
dtantsur(unless my recent patch was actually onto something)15:53
TheJuliathe code retries every 1 second by default15:54
TheJuliabut it means the db could be read, soemthing else was running against node that particular node that held the lock15:54
TheJuliathe node lock, not the db lock15:54
TheJuliaso the question I have right now is what was node 07bec2af-6630-48c0-8b8e-e8b2bf084534 doing before 2023-08-10 16:27:46.64415:56
dtantsurstarting inspection, apparently16:00
* TheJulia wonders if we need additional internal locking w/r/t token generation16:00
TheJuliathen again, once things unwedged, it looks like we sifted through everything really fast16:00
dtantsurI'm puzzled why retries did not help16:01
* TheJulia hears rain and goes "wut"16:02
TheJuliaI'm trying to find the inspection, pity we don't see anything get released logging wise, at least I'm not seeing it16:02
TheJulia2023-08-10 16:27:46.671 1 INFO eventlet.wsgi.server [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] 172.22.0.2,127.0.0.1 "PATCH /v1/nodes/07bec2af-6630-48c0-8b8e-e8b2bf084534 HTTP/1.1" status: 200  len: 0 time: 992.4749577 <-- I'm guessing from inspector or the operator?16:08
dtantsurTheJulia: possibly inspector; that's pity that we don't log user-agent16:11
TheJulia2023-08-10 16:11:14.227 <-- starting at  and walking the next 900 or so seconds in the log16:12
TheJulia.... we should fix that I guess16:12
TheJuliauser-agent that is16:12
TheJuliahmmm16:20
TheJuliaso the maximum window drove it to hang out too long trying to run16:28
TheJuliathe update_node *must* have had a lock on the reservation from the task, but something got in the way16:29
TheJuliaoh wiat16:30
* dtantsur waits16:44
TheJuliaso the patch failed to run because NodeLocked was the state, the retry write was not the patch operation itself, it was just the victim of the cascade it looks like16:45
TheJuliathe very first database is locked is the change which removes the node.reservation16:46
TheJuliawhich caused node to go totally off the rails, and didn't get cleaned up until the conductor explicitly failed it16:47
TheJuliaor it succeeded some huge long time later16:47
TheJuliabut it's spans between retries were huge as a result of it16:47
TheJulia... we almost need an agressive and a lazy sqlite retrier16:47
dtantsurmaybe we shouldn't retry some things this way. maybe we should convert "database locked" to NodeLocked and let the TaskManager do its job16:49
dtantsur(which sometimes includes NOT retrying, like for lookups)16:49
TheJuliamaybe16:49
TheJuliathere is still something at play here I'm trying to wrap my head around16:49
TheJulia... I started making my own log16:49
TheJuliapicking out individual pieces and annotating them16:49
-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers16:54
* dtantsur has asked the other metal3 folks to send us each case of the database locked error16:55
dtantsurmaybe if we collect some statistics...16:55
TheJulia++16:55
TheJuliayeah, something doesn't seem right, but with the logging and the CI run, it is a little... bit of a task to put together what exactly is happening16:56
TheJuliabut for closing out an exclusive lock, we just can't rely upon the decorator as is16:57
TheJuliaothrwise we can logjam the entire workflow16:57
TheJuliaI suspect the first failure was more a collission, and we still sort of see that even in OpenDev CI, but we've never kind of had things go sideways on the node release of the reservation 16:59
TheJulia.... but this is just one node, I guess the next question I need to look at is what happened to the other node16:59
dtantsurI definitely see two series of retries17:10
TheJuliayeah17:12
TheJuliaso, ... I'm not sure node c25e ever releaed it's exlcusive lock for the inspect_hardware call17:12
TheJuliasort of like it went on a vacation :\17:13
dtantsur...as all of us would love to17:14
TheJuliadtantsur: does the test job have any selection/variable w/r/t inspection?17:16
dtantsurTheJulia: I'm not sure; it's possible to disable inspection on the node level, I doubt we do though17:16
TheJulia... I think we lost a thread17:22
TheJuliawell, greenthread17:22
opendevreviewJakub Jelinek proposed openstack/ironic master: Introduce default kernel/ramdisks by arch  https://review.opendev.org/c/openstack/ironic/+/89081917:32
TheJulia2023-08-10 16:27:54.590 1 DEBUG ironic.conductor.task_manager [None req-de13bb62-5543-402b-92c3-112facf44ecf - - - - - -] Successfully released exclusive lock for hardware inspection on node c25e451b-d2fb-4168-b690-f15bc8365520 (lock was held 1139.54 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448 <-- didn't loose it.... just never exited for ages17:38
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers17:43
TheJulia2023-08-10 16:29:03.707 1 ERROR ironic.drivers.modules.inspector.interface [None req-87a049b8-fb1d-42c9-b01d-7023e836c5f6 - - - - - -] Inspection failed for node c25e451b-d2fb-4168-b690-f15bc8365520 with error: Unexpected exception ConflictException during processing: ConflictException: 409: Client Error for url: https://172.22.0.2:6385/v1/nodes/c25e451b-d2fb-4168-b690-f15bc8365520, Cannot update node 17:52
TheJulia"c25e451b-d2fb-4168-b690-f15bc8365520" while it is in state "inspecting".ESC[00m17:52
TheJuliaokkay, yeah, update_node *has* to be very agressive along with release resources, we can't just tank the entire interaction with NodeLocked there because we already have an exclusive lock to get to that point, it is actually the release where things are going sideways it seems18:39
JayFAre you implying the DB is locked while trying to unlock the node?18:39
JayFthat's the edge?18:39
TheJuliawell, basically the unlock is what never completed which generated a lot of NodeLocked errors18:45
TheJuliawell, completed, eventually18:45
TheJuliawe need to be sort of aggressive on that18:46
JayFyeah I can see, you could have two things fighting for the DB's attention18:46
JayFwhere one of them needs the node unlocked but keeps the update to unlock out (?)18:46
TheJuliayeah18:46
TheJuliaand read locks get concealed by NodeLocked too18:47
opendevreviewJulia Kreger proposed openstack/ironic master: Retool sqlite retries  https://review.opendev.org/c/openstack/ironic/+/89133319:09
TheJuliawe will want to check the logging for the metal3 job on that19:10
TheJuliaI'm suspecting we need to add a second decorator to act as a "this must not fail" sort of retry19:10
TheJuliaor just have a second decorator with the slightly different logic19:10
TheJuliaso the retry retool has no errors, I guess we ought to retry it a few times just to make sure we're happy with it21:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!