Saturday, 2025-02-08

corvus	clarkb: thanks!	02:04
corvus	both zuul and my local web browser are having trouble connecting to gerrit	16:59
corvus	i am also unable to ssh to the host on port 22	17:00
corvus	review02.opendev.org \| SHUTOFF \| public=199.204.45.33, 2604:e100:1:0:f816:3eff:fe52:22de	17:01
corvus	that does not look nominal	17:02
cardoe	i cannot connect as well	17:02
corvus	\| updated \| 2025-02-08T16:08:13Z	17:03
corvus	i don't see any error messages	17:03
corvus	i with "shutoff" had a reason or some other auditing	17:03
corvus	wish	17:03
*** darmach3 is now known as darmach		17:04
corvus	i don't see anything strange about the volumes	17:06
corvus	i'm going to start the server	17:06
corvus	operating system is up; i'm going to do some checks before starting the service	17:08
corvus	starting up now; mariadb performed crash recovery	17:11
Clark[m]	That mariadb for Gerrit is just to track the reviewed flags for files on changes per user. It isn't critical if we need to start over (though we may back it up too)	17:16
corvus	that mariadb looks happy	17:16
Clark[m]	Cool	17:16
corvus	i'm not sure gerrit is though	17:16
corvus	i've got a change push that's hanging	17:16
Clark[m]	Ack give me a few and I can transplant to a computer	17:17
clarkb	https://review.opendev.org/c/opendev/system-config/+/840972 took a while to open and when it did there are no files or diffs listed. I half wonder if we've got cache management problems again /me will look at show-queue	17:21
clarkb	Disk Cache Pruner (gerrit_file_diff) is running since 17:11	17:21
corvus	i have a show-caches command which is stuck for a few mins now	17:22
corvus	should we stop it, rm caches, and start?	17:22
clarkb	its doing git_file_diff now	17:22
corvus	ok, it is progressing then	17:23
clarkb	neither file on disk is particularly big (at least compared to what we saw before)	17:23
corvus	maybe we should leave it, but maybe we should also just rm the caches in the startup script? :)	17:23
clarkb	ya I'm thinking we can leave it and monitor then check if it is happy when pruning is done. Then agreed automating the cleanups is probably a good idea.	17:24
clarkb	it wasn't a big issue when we manually restart things but for something like this would likely improve our lives/sanity	17:24
corvus	i thought about doing that this time, but figured the data collection for science was worth it.	17:24
fungi	sorry, stepped away to get a shower, but i agree, not clearing the caches during our restart the other day was probably a mistake in retrospect	17:24
clarkb	though the files weren't that much different in size and were much quicker to prune	17:25
clarkb	I wonder if there is underlying io slowness (perhaps with ceph?)	17:25
clarkb	that could explain why things shutdown if there was a filesystem error too potentially?	17:25
clarkb	and maybe there was similar slowness when we restarted in december which caused things to be sad then? So less about absolute size of the caches and more about io performance?	17:26
clarkb	(this is all brainstorming I don't have any real evidence other than this is slower than the recent restart whcih didn't have the files deleted before staring again)	17:27
fungi	entirely possible	17:27
corvus	yeah... i guess we could try running a storage benchmark or something, but designing a useful one that can run while the service is up would be tricky.	17:28
fungi	this is in vexxhost ca-ymq-1 so maybe guilhermesp or mnaser know if something's happend there	17:28
clarkb	all caches have been pruned at this point and corvus' show-queue is no longer showing up in the queue either	17:28
clarkb	and I can see files/diffs on taht change I linked above	17:29
corvus	it did return	17:29
clarkb	corvus: any chance your push completed too?	17:29
fungi	16:08:13 utc event according to nova, sounds like. i'll see if i can correlate in syslog	17:29
corvus	clarkb: the push error'd out earlier; but i just repushed it now.	17:29
corvus	and that did succeed quickly	17:29
clarkb	great	17:30
corvus	(also, got the usual behavior of gerrit indicating that it already had the first patchset, so i pushed a second ps)	17:30
fungi	last timestamp recorded to syslog before the startup messages was 16:05:43 utc, for the record	17:30
corvus	https://review.opendev.org/941046 is the change	17:30
fungi	nothing surprising in syslog, looks like whatever happened took the system by surprise, or at least began with preventing it from writing to /var	17:31
corvus	fungi: nova api "updated" time is 16:08:13	17:31
clarkb	my best guess at this point is that we experienced a slow startup similar to the one in december with cache pruning locking out operations that depend on the cache. It wasn't as extreme as dceember either because this is a weekend with less overall demand or because the cache files are smaller or both. But this new occurrence coinciding with an unexpected shutdown does make me	17:31
clarkb	suspect something unhappy with io whcih is backed by ceph aiui	17:31
fungi	corvus: yep, i was starting from there and then checking it against the syslog for correlation	17:31
clarkb	so ya maybe ricolin guilhermesp or mnaser can check and see if anything abnormal shows up on their end	17:32
fungi	oh, right, i forgot ricolin is there too	17:32
corvus	that's a 3 minute timeframe to look in	17:32
fungi	yeah, it looks like the system was healthy as of 16:05:43 utc and then nova recorded it in shutoff state at 16:08:13 utc	17:33
fungi	that's the most insight we have from this side, i think	17:34
corvus	status notice The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored.	17:34
corvus	^?	17:34
fungi	lgtm	17:34
clarkb	++	17:34
corvus	#status notice The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored.	17:35
opendevstatus	corvus: sending notice	17:35
-opendevstatus- NOTICE: The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored.		17:35
corvus	i don't see any more messages in the gerrit log (so, like, no errors after it cleaned the caches). i'm going to stop watching it now.	17:35
clarkb	re data collection for science: I think that turned out to be useful here as it indicates whether the file is 8-9GB or 220GB we can experience the same issue	17:36
fungi	refining the start of the outage further, gerrit's ssh log has events timestamped up to 16:07:45 utc	17:37
opendevstatus	corvus: finished sending notice	17:38
corvus	oh nice, so it was probably running right up to the "shutoff" time	17:38
clarkb	must without anything notable to record in syslog	17:38
clarkb	s/must/just/	17:38
fungi	yeah, regular syslog noise was from snmp queries	17:39
clarkb	or maybe file syncing for journal and gerrit java logs are different	17:39
fungi	which have gaps between them of course	17:39
corvus	snmp is our "everything's okay alarm"	17:40
clarkb	I'm going to return to morning activity. I'll check back in here later in case there is more that needs to be done in the near term	17:42
clarkb	there are waffles	17:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!