Tuesday, 2023-08-15

*** awb_ is now known as awb		01:04
kubajj	Good morning Ironic o/	08:06
iurygregory	good morning Ironic	11:39
kubajj	morning iurygregory o/	11:51
iurygregory	hey hey kubajj o/	12:01
TheJulia	good morning	13:15
iurygregory	good morning TheJulia	14:05
TheJulia	oh tenacity, why are you stuck at 6.3.1	14:21
opendevreview	Julia Kreger proposed openstack/ironic master: Slow down the sqlite retry https://review.opendev.org/c/openstack/ironic/+/891333	14:28
opendevreview	Julia Kreger proposed openstack/ironic master: Log upon completion of power sync https://review.opendev.org/c/openstack/ironic/+/891334	14:28
opendevreview	Julia Kreger proposed openstack/ironic master: Don't yield on power sync at the end of the work https://review.opendev.org/c/openstack/ironic/+/891335	14:28
opendevreview	Dmitry Tantsur proposed openstack/ironic master: [DNM] Test how much breaks without query.all() in the SQL API https://review.opendev.org/c/openstack/ironic/+/891505	14:41
TheJulia	oh tenacity, why are you stuck at 6.3.1	14:42
TheJulia	err, already sent that	14:42
TheJulia	dtantsur: driver templates and node history, basically	14:42
dtantsur	I'd like to be 100% sure :)	14:42
TheJulia	there is a port lookup with shards test that triggers it, which I don't understand why it heads down that route :\	14:42
dtantsur	TheJulia: I wonder if the update part of register_conductor is safe	15:01
TheJulia	got a link to what your looking at?	15:02
dtantsur	TheJulia: https://opendev.org/openstack/ironic/src/branch/master/ironic/db/sqlalchemy/api.py#L1348-L1352	15:03
* dtantsur just scrolling through the file...		15:03
TheJulia	hmm, that might not be safe...	15:06
TheJulia	dtantsur: is it just me, or does the base req id logging seem to behave differently in single process mode	15:36
dtantsur	Possibly? Although context should be passed..	15:37
TheJulia	It seems like the req-id just never changes if it is an internal thing	15:37
TheJulia	makes grepping for things... difficult.	15:37
TheJulia	looking at the nordix log, did you by chance spot the super long locked introspection ?	15:38
dtantsur	what's the symptoms?	15:38
TheJulia	the introspection that had a node locked for 1200+ seconds	15:39
TheJulia	or, maybe it was like 1100+	15:39
TheJulia	which caused the agent to get a little freaked out	15:39
TheJulia	and it kept piling on lookups, eventually exit due to failure, and restart	15:39
dtantsur	20 minutes Oo	15:39
dtantsur	https://jenkins.nordix.org/job/metal3_bmo_main_integration_test_ubuntu/291/ is the link btw	15:42
TheJulia	2023-08-10 16:27:46.644 1 DEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node [trim]	15:43
TheJulia	that looks like it was just from a general update_node patch operation	15:44
dtantsur	yeah	15:44
dtantsur	this is weird, it seems like some operations succeed while the process is still retrying. how is it possible?	15:45
TheJulia	bad timing maybe	15:45
TheJulia	DEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:352	15:46
TheJulia	err	15:47
TheJulia	DEBUG ironic.conductor.task_manager [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] Node 07bec2af-6630-48c0-8b8e-e8b2bf084534 successfully reserved for node update (took 992.41 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:352	15:47
dtantsur	yeah, we definitely need to reduce the maximum database retry period	15:47
TheJulia	one of the agent logs basically suggests a node.reservation field was held by the "conductor" for quite some time	15:47
dtantsur	OR it was retrying on the database locked	15:47
TheJulia	What if db read waiting time is also being hidden from us?	15:48
TheJulia	hmm, look at the timing on this https://paste.openstack.org/show/bzcEG3Eo6hIF1VmboPDM/	15:50
TheJulia	so	15:51
TheJulia	that took time	15:51
TheJulia	is tenacity running without logging	15:51
TheJulia	capturing NodeLocked	15:51
TheJulia	line 343 of task_manager.py	15:52
TheJulia	so there is our silent read hold	15:52
dtantsur	Not following. This can make things take time, but it won't lock the database?	15:53
dtantsur	(unless my recent patch was actually onto something)	15:53
TheJulia	the code retries every 1 second by default	15:54
TheJulia	but it means the db could be read, soemthing else was running against node that particular node that held the lock	15:54
TheJulia	the node lock, not the db lock	15:54
TheJulia	so the question I have right now is what was node 07bec2af-6630-48c0-8b8e-e8b2bf084534 doing before 2023-08-10 16:27:46.644	15:56
dtantsur	starting inspection, apparently	16:00
* TheJulia wonders if we need additional internal locking w/r/t token generation		16:00
TheJulia	then again, once things unwedged, it looks like we sifted through everything really fast	16:00
dtantsur	I'm puzzled why retries did not help	16:01
* TheJulia hears rain and goes "wut"		16:02
TheJulia	I'm trying to find the inspection, pity we don't see anything get released logging wise, at least I'm not seeing it	16:02
TheJulia	2023-08-10 16:27:46.671 1 INFO eventlet.wsgi.server [None req-62650c91-4bc6-4306-b45d-3cd2aa71877a - - - - - -] 172.22.0.2,127.0.0.1 "PATCH /v1/nodes/07bec2af-6630-48c0-8b8e-e8b2bf084534 HTTP/1.1" status: 200 len: 0 time: 992.4749577 <-- I'm guessing from inspector or the operator?	16:08
dtantsur	TheJulia: possibly inspector; that's pity that we don't log user-agent	16:11
TheJulia	2023-08-10 16:11:14.227 <-- starting at and walking the next 900 or so seconds in the log	16:12
TheJulia	.... we should fix that I guess	16:12
TheJulia	user-agent that is	16:12
TheJulia	hmmm	16:20
TheJulia	so the maximum window drove it to hang out too long trying to run	16:28
TheJulia	the update_node must have had a lock on the reservation from the task, but something got in the way	16:29
TheJulia	oh wiat	16:30
* dtantsur waits		16:44
TheJulia	so the patch failed to run because NodeLocked was the state, the retry write was not the patch operation itself, it was just the victim of the cascade it looks like	16:45
TheJulia	the very first database is locked is the change which removes the node.reservation	16:46
TheJulia	which caused node to go totally off the rails, and didn't get cleaned up until the conductor explicitly failed it	16:47
TheJulia	or it succeeded some huge long time later	16:47
TheJulia	but it's spans between retries were huge as a result of it	16:47
TheJulia	... we almost need an agressive and a lazy sqlite retrier	16:47
dtantsur	maybe we shouldn't retry some things this way. maybe we should convert "database locked" to NodeLocked and let the TaskManager do its job	16:49
dtantsur	(which sometimes includes NOT retrying, like for lookups)	16:49
TheJulia	maybe	16:49
TheJulia	there is still something at play here I'm trying to wrap my head around	16:49
TheJulia	... I started making my own log	16:49
TheJulia	picking out individual pieces and annotating them	16:49
-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers		16:54
* dtantsur has asked the other metal3 folks to send us each case of the database locked error		16:55
dtantsur	maybe if we collect some statistics...	16:55
TheJulia	++	16:55
TheJulia	yeah, something doesn't seem right, but with the logging and the CI run, it is a little... bit of a task to put together what exactly is happening	16:56
TheJulia	but for closing out an exclusive lock, we just can't rely upon the decorator as is	16:57
TheJulia	othrwise we can logjam the entire workflow	16:57
TheJulia	I suspect the first failure was more a collission, and we still sort of see that even in OpenDev CI, but we've never kind of had things go sideways on the node release of the reservation	16:59
TheJulia	.... but this is just one node, I guess the next question I need to look at is what happened to the other node	16:59
dtantsur	I definitely see two series of retries	17:10
TheJulia	yeah	17:12
TheJulia	so, ... I'm not sure node c25e ever releaed it's exlcusive lock for the inspect_hardware call	17:12
TheJulia	sort of like it went on a vacation :\	17:13
dtantsur	...as all of us would love to	17:14
TheJulia	dtantsur: does the test job have any selection/variable w/r/t inspection?	17:16
dtantsur	TheJulia: I'm not sure; it's possible to disable inspection on the node level, I doubt we do though	17:16
TheJulia	... I think we lost a thread	17:22
TheJulia	well, greenthread	17:22
opendevreview	Jakub Jelinek proposed openstack/ironic master: Introduce default kernel/ramdisks by arch https://review.opendev.org/c/openstack/ironic/+/890819	17:32
TheJulia	2023-08-10 16:27:54.590 1 DEBUG ironic.conductor.task_manager [None req-de13bb62-5543-402b-92c3-112facf44ecf - - - - - -] Successfully released exclusive lock for hardware inspection on node c25e451b-d2fb-4168-b690-f15bc8365520 (lock was held 1139.54 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448 <-- didn't loose it.... just never exited for ages	17:38
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers		17:43
TheJulia	2023-08-10 16:29:03.707 1 ERROR ironic.drivers.modules.inspector.interface [None req-87a049b8-fb1d-42c9-b01d-7023e836c5f6 - - - - - -] Inspection failed for node c25e451b-d2fb-4168-b690-f15bc8365520 with error: Unexpected exception ConflictException during processing: ConflictException: 409: Client Error for url: https://172.22.0.2:6385/v1/nodes/c25e451b-d2fb-4168-b690-f15bc8365520, Cannot update node	17:52
TheJulia	"c25e451b-d2fb-4168-b690-f15bc8365520" while it is in state "inspecting".ESC[00m	17:52
TheJulia	okkay, yeah, update_node has to be very agressive along with release resources, we can't just tank the entire interaction with NodeLocked there because we already have an exclusive lock to get to that point, it is actually the release where things are going sideways it seems	18:39
JayF	Are you implying the DB is locked while trying to unlock the node?	18:39
JayF	that's the edge?	18:39
TheJulia	well, basically the unlock is what never completed which generated a lot of NodeLocked errors	18:45
TheJulia	well, completed, eventually	18:45
TheJulia	we need to be sort of aggressive on that	18:46
JayF	yeah I can see, you could have two things fighting for the DB's attention	18:46
JayF	where one of them needs the node unlocked but keeps the update to unlock out (?)	18:46
TheJulia	yeah	18:46
TheJulia	and read locks get concealed by NodeLocked too	18:47
opendevreview	Julia Kreger proposed openstack/ironic master: Retool sqlite retries https://review.opendev.org/c/openstack/ironic/+/891333	19:09
TheJulia	we will want to check the logging for the metal3 job on that	19:10
TheJulia	I'm suspecting we need to add a second decorator to act as a "this must not fail" sort of retry	19:10
TheJulia	or just have a second decorator with the slightly different logic	19:10
TheJulia	so the retry retool has no errors, I guess we ought to retry it a few times just to make sure we're happy with it	21:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!