*** awb_ is now known as awb | 01:04 | |
kubajj | Good morning Ironic o/ | 08:06 |
---|---|---|
iurygregory | good morning Ironic | 11:39 |
kubajj | morning iurygregory o/ | 11:51 |
iurygregory | hey hey kubajj o/ | 12:01 |
TheJulia | good morning | 13:15 |
iurygregory | good morning TheJulia | 14:05 |
TheJulia | oh tenacity, why are you stuck at 6.3.1 | 14:21 |
opendevreview | Julia Kreger proposed openstack/ironic master: Slow down the sqlite retry https://review.opendev.org/c/openstack/ironic/+/891333 | 14:28 |
opendevreview | Julia Kreger proposed openstack/ironic master: Log upon completion of power sync https://review.opendev.org/c/openstack/ironic/+/891334 | 14:28 |
opendevreview | Julia Kreger proposed openstack/ironic master: Don't yield on power sync at the end of the work https://review.opendev.org/c/openstack/ironic/+/891335 | 14:28 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: [DNM] Test how much breaks without query.all() in the SQL API https://review.opendev.org/c/openstack/ironic/+/891505 | 14:41 |
TheJulia | oh tenacity, why are you stuck at 6.3.1 | 14:42 |
TheJulia | err, already sent that | 14:42 |
TheJulia | dtantsur: driver templates and node history, basically | 14:42 |
dtantsur | I'd like to be 100% sure :) | 14:42 |
TheJulia | there is a port lookup with shards test that triggers it, which I don't understand *why* it heads down that route :\ | 14:42 |
dtantsur | TheJulia: I wonder if the update part of register_conductor is safe | 15:01 |
TheJulia | got a link to what your looking at? | 15:02 |
dtantsur | TheJulia: https://opendev.org/openstack/ironic/src/branch/master/ironic/db/sqlalchemy/api.py#L1348-L1352 | 15:03 |
* dtantsur just scrolling through the file... | 15:03 | |
TheJulia | hmm, that might not be safe... | 15:06 |
TheJulia | dtantsur: is it just me, or does the base req id logging seem to behave differently in single process mode | 15:36 |
dtantsur | Possibly? Although context should be passed.. | 15:37 |
TheJulia | It seems like the req-id just never changes if it is an internal thing | 15:37 |
TheJulia | makes grepping for things... difficult. | 15:37 |
TheJulia | looking at the nordix log, did you by chance spot the super long locked introspection ? | 15:38 |
dtantsur | what's the symptoms? | 15:38 |
TheJulia | the introspection that had a node locked for 1200+ seconds | 15:39 |
TheJulia | or, maybe it was like 1100+ | 15:39 |
TheJulia | which caused the agent to get a little freaked out | 15:39 |
TheJulia | and it kept piling on lookups, eventually exit due to failure, and restart | 15:39 |
dtantsur | 20 minutes Oo | 15:39 |
dtantsur | https://jenkins.nordix.org/job/metal3_bmo_main_integration_test_ubuntu/291/ is the link btw | 15:42 |
TheJulia | 2023-08-10 16:27:46.644 1 DEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node [trim] | 15:43 |
TheJulia | that looks like it was just from a general update_node patch operation | 15:44 |
dtantsur | yeah | 15:44 |
dtantsur | this is weird, it seems like some operations succeed while the process is still retrying. how is it possible? | 15:45 |
TheJulia | bad timing maybe | 15:45 |
TheJulia | DEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:352 | 15:46 |
TheJulia | err | 15:47 |
TheJulia | DEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:352 | 15:47 |
dtantsur | yeah, we definitely need to reduce the maximum database retry period | 15:47 |
TheJulia | one of the agent logs basically suggests a node.reservation field was held by the "conductor" for quite some time | 15:47 |
dtantsur | OR it was retrying on the database locked | 15:47 |
TheJulia | What if db read waiting time is also being hidden from us? | 15:48 |
TheJulia | hmm, look at the timing on this https://paste.openstack.org/show/bzcEG3Eo6hIF1VmboPDM/ | 15:50 |
TheJulia | so | 15:51 |
TheJulia | that took time | 15:51 |
TheJulia | is tenacity running without logging | 15:51 |
TheJulia | capturing NodeLocked | 15:51 |
TheJulia | line 343 of task_manager.py | 15:52 |
TheJulia | so there is our silent read hold | 15:52 |
dtantsur | Not following. This can make things take time, but it won't lock the database? | 15:53 |
dtantsur | (unless my recent patch was actually onto something) | 15:53 |
TheJulia | the code retries every 1 second by default | 15:54 |
TheJulia | but it means the db could be read, soemthing else was running against node that particular node that held the lock | 15:54 |
TheJulia | the node lock, not the db lock | 15:54 |
TheJulia | so the question I have right now is what was node 07bec2af-6630-48c0-8b8e-e8b2bf084534 doing before 2023-08-10 16:27:46.644 | 15:56 |
dtantsur | starting inspection, apparently | 16:00 |
* TheJulia wonders if we need additional internal locking w/r/t token generation | 16:00 | |
TheJulia | then again, once things unwedged, it looks like we sifted through everything really fast | 16:00 |
dtantsur | I'm puzzled why retries did not help | 16:01 |
* TheJulia hears rain and goes "wut" | 16:02 | |
TheJulia | I'm trying to find the inspection, pity we don't see anything get released logging wise, at least I'm not seeing it | 16:02 |
TheJulia | 2023-08-10 16:27:46.671 1 INFO eventlet.wsgi.server [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] 172.22.0.2,127.0.0.1 "PATCH /v1/nodes/07bec2af-6630-48c0-8b8e-e8b2bf084534 HTTP/1.1" status: 200 len: 0 time: 992.4749577 <-- I'm guessing from inspector or the operator? | 16:08 |
dtantsur | TheJulia: possibly inspector; that's pity that we don't log user-agent | 16:11 |
TheJulia | 2023-08-10 16:11:14.227 <-- starting at and walking the next 900 or so seconds in the log | 16:12 |
TheJulia | .... we should fix that I guess | 16:12 |
TheJulia | user-agent that is | 16:12 |
TheJulia | hmmm | 16:20 |
TheJulia | so the maximum window drove it to hang out too long trying to run | 16:28 |
TheJulia | the update_node *must* have had a lock on the reservation from the task, but something got in the way | 16:29 |
TheJulia | oh wiat | 16:30 |
* dtantsur waits | 16:44 | |
TheJulia | so the patch failed to run because NodeLocked was the state, the retry write was not the patch operation itself, it was just the victim of the cascade it looks like | 16:45 |
TheJulia | the very first database is locked is the change which removes the node.reservation | 16:46 |
TheJulia | which caused node to go totally off the rails, and didn't get cleaned up until the conductor explicitly failed it | 16:47 |
TheJulia | or it succeeded some huge long time later | 16:47 |
TheJulia | but it's spans between retries were huge as a result of it | 16:47 |
TheJulia | ... we almost need an agressive and a lazy sqlite retrier | 16:47 |
dtantsur | maybe we shouldn't retry some things this way. maybe we should convert "database locked" to NodeLocked and let the TaskManager do its job | 16:49 |
dtantsur | (which sometimes includes NOT retrying, like for lookups) | 16:49 |
TheJulia | maybe | 16:49 |
TheJulia | there is still something at play here I'm trying to wrap my head around | 16:49 |
TheJulia | ... I started making my own log | 16:49 |
TheJulia | picking out individual pieces and annotating them | 16:49 |
-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers | 16:54 | |
* dtantsur has asked the other metal3 folks to send us each case of the database locked error | 16:55 | |
dtantsur | maybe if we collect some statistics... | 16:55 |
TheJulia | ++ | 16:55 |
TheJulia | yeah, something doesn't seem right, but with the logging and the CI run, it is a little... bit of a task to put together what exactly is happening | 16:56 |
TheJulia | but for closing out an exclusive lock, we just can't rely upon the decorator as is | 16:57 |
TheJulia | othrwise we can logjam the entire workflow | 16:57 |
TheJulia | I suspect the first failure was more a collission, and we still sort of see that even in OpenDev CI, but we've never kind of had things go sideways on the node release of the reservation | 16:59 |
TheJulia | .... but this is just one node, I guess the next question I need to look at is what happened to the other node | 16:59 |
dtantsur | I definitely see two series of retries | 17:10 |
TheJulia | yeah | 17:12 |
TheJulia | so, ... I'm not sure node c25e ever releaed it's exlcusive lock for the inspect_hardware call | 17:12 |
TheJulia | sort of like it went on a vacation :\ | 17:13 |
dtantsur | ...as all of us would love to | 17:14 |
TheJulia | dtantsur: does the test job have any selection/variable w/r/t inspection? | 17:16 |
dtantsur | TheJulia: I'm not sure; it's possible to disable inspection on the node level, I doubt we do though | 17:16 |
TheJulia | ... I think we lost a thread | 17:22 |
TheJulia | well, greenthread | 17:22 |
opendevreview | Jakub Jelinek proposed openstack/ironic master: Introduce default kernel/ramdisks by arch https://review.opendev.org/c/openstack/ironic/+/890819 | 17:32 |
TheJulia | 2023-08-10 16:27:54.590 1 DEBUG ironic.conductor.task_manager [None req-de13bb62-5543-402b-92c3-112facf44ecf - - - - - -] Successfully released exclusive lock for hardware inspection on node c25e451b-d2fb-4168-b690-f15bc8365520 (lock was held 1139.54 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448 <-- didn't loose it.... just never exited for ages | 17:38 |
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers | 17:43 | |
TheJulia | 2023-08-10 16:29:03.707 1 ERROR ironic.drivers.modules.inspector.interface [None req-87a049b8-fb1d-42c9-b01d-7023e836c5f6 - - - - - -] Inspection failed for node c25e451b-d2fb-4168-b690-f15bc8365520 with error: Unexpected exception ConflictException during processing: ConflictException: 409: Client Error for url: https://172.22.0.2:6385/v1/nodes/c25e451b-d2fb-4168-b690-f15bc8365520, Cannot update node | 17:52 |
TheJulia | "c25e451b-d2fb-4168-b690-f15bc8365520" while it is in state "inspecting".ESC[00m | 17:52 |
TheJulia | okkay, yeah, update_node *has* to be very agressive along with release resources, we can't just tank the entire interaction with NodeLocked there because we already have an exclusive lock to get to that point, it is actually the release where things are going sideways it seems | 18:39 |
JayF | Are you implying the DB is locked while trying to unlock the node? | 18:39 |
JayF | that's the edge? | 18:39 |
TheJulia | well, basically the unlock is what never completed which generated a lot of NodeLocked errors | 18:45 |
TheJulia | well, completed, eventually | 18:45 |
TheJulia | we need to be sort of aggressive on that | 18:46 |
JayF | yeah I can see, you could have two things fighting for the DB's attention | 18:46 |
JayF | where one of them needs the node unlocked but keeps the update to unlock out (?) | 18:46 |
TheJulia | yeah | 18:46 |
TheJulia | and read locks get concealed by NodeLocked too | 18:47 |
opendevreview | Julia Kreger proposed openstack/ironic master: Retool sqlite retries https://review.opendev.org/c/openstack/ironic/+/891333 | 19:09 |
TheJulia | we will want to check the logging for the metal3 job on that | 19:10 |
TheJulia | I'm suspecting we need to add a second decorator to act as a "this must not fail" sort of retry | 19:10 |
TheJulia | or just have a second decorator with the slightly different logic | 19:10 |
TheJulia | so the retry retool has no errors, I guess we ought to retry it a few times just to make sure we're happy with it | 21:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!