Saturday, 2025-02-08

corvusclarkb: thanks!02:04
corvusboth zuul and my local web browser are having trouble connecting to gerrit16:59
corvusi am also unable to ssh to the host on port 2217:00
corvus review02.opendev.org                   | SHUTOFF | public=199.204.45.33, 2604:e100:1:0:f816:3eff:fe52:22de 17:01
corvusthat does not look nominal17:02
cardoei cannot connect as well17:02
corvus| updated                             | 2025-02-08T16:08:13Z                                                                                                                                            17:03
corvusi don't see any error messages17:03
corvusi with "shutoff" had a reason or some other auditing17:03
corvuswish17:03
*** darmach3 is now known as darmach17:04
corvusi don't see anything strange about the volumes17:06
corvusi'm going to start the server17:06
corvusoperating system is up; i'm going to do some checks before starting the service17:08
corvusstarting up now; mariadb performed crash recovery17:11
Clark[m]That mariadb for Gerrit is just to track the reviewed flags for files on changes per user. It isn't critical if we need to start over (though we may back it up too)17:16
corvusthat mariadb looks happy17:16
Clark[m]Cool17:16
corvusi'm not sure gerrit is though17:16
corvusi've got a change push that's hanging17:16
Clark[m]Ack give me a few and I can transplant to a computer 17:17
clarkbhttps://review.opendev.org/c/opendev/system-config/+/840972 took a while to open and when it did there are no files or diffs listed. I half wonder if we've got cache management problems again /me will look at show-queue17:21
clarkbDisk Cache Pruner (gerrit_file_diff) is running since 17:1117:21
corvusi have a show-caches command which is stuck for a few mins now17:22
corvusshould we stop it, rm caches, and start?17:22
clarkbits doing git_file_diff now17:22
corvusok, it is progressing then17:23
clarkbneither file on disk is particularly big (at least compared to what we saw before)17:23
corvusmaybe we should leave it, but maybe we should also just rm the caches in the startup script? :)17:23
clarkbya I'm thinking we can leave it and monitor then check if it is happy when pruning is done. Then agreed automating the cleanups is probably a good idea.17:24
clarkbit wasn't a big issue when we manually restart things but for something like this would likely improve our lives/sanity17:24
corvusi thought about doing that this time, but figured the data collection for science was worth it.17:24
fungisorry, stepped away to get a shower, but i agree, not clearing the caches during our restart the other day was probably a mistake in retrospect17:24
clarkbthough the files weren't that much different in size and were much quicker to prune17:25
clarkbI wonder if there is underlying io slowness (perhaps with ceph?)17:25
clarkbthat could explain why things shutdown if there was a filesystem error too potentially?17:25
clarkband maybe there was similar slowness when we restarted in december which caused things to be sad then? So less about absolute size of the caches and more about io performance?17:26
clarkb(this is all brainstorming I don't have any real evidence other than this is slower than the recent restart whcih didn't have the files deleted before staring again)17:27
fungientirely possible17:27
corvusyeah... i guess we could try running a storage benchmark or something, but designing a useful one that can run while the service is up would be tricky.17:28
fungithis is in vexxhost ca-ymq-1 so maybe guilhermesp or mnaser know if something's happend there17:28
clarkball caches have been pruned at this point and corvus' show-queue is no longer showing up in the queue either17:28
clarkband I can see files/diffs on taht change I linked above17:29
corvusit did return17:29
clarkbcorvus: any chance your push completed too?17:29
fungi16:08:13 utc event according to nova, sounds like. i'll see if i can correlate in syslog17:29
corvusclarkb: the push error'd out earlier; but i just repushed it now.17:29
corvusand that did succeed quickly17:29
clarkbgreat17:30
corvus(also, got the usual behavior of gerrit indicating that it already had the first patchset, so i pushed a second ps)17:30
fungilast timestamp recorded to syslog before the startup messages was 16:05:43 utc, for the record17:30
corvushttps://review.opendev.org/941046 is the change17:30
funginothing surprising in syslog, looks like whatever happened took the system by surprise, or at least began with preventing it from writing to /var17:31
corvusfungi: nova api "updated" time is 16:08:1317:31
clarkbmy best guess at this point is that we experienced a slow startup similar to the one in december with cache pruning locking out operations that depend on the cache. It wasn't as extreme as dceember either because this is a weekend with less overall demand or because the cache files are smaller or both. But this new occurrence coinciding with an unexpected shutdown does make me17:31
clarkbsuspect something unhappy with io whcih is backed by ceph aiui17:31
fungicorvus: yep, i was starting from there and then checking it against the syslog for correlation17:31
clarkbso ya maybe ricolin guilhermesp or mnaser can check and see if anything abnormal shows up on their end17:32
fungioh, right, i forgot ricolin is there too17:32
corvusthat's a 3 minute timeframe to look in17:32
fungiyeah, it looks like the system was healthy as of 16:05:43 utc and then nova recorded it in shutoff state at 16:08:13 utc17:33
fungithat's the most insight we have from this side, i think17:34
corvusstatus notice The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored.17:34
corvus ^?17:34
fungilgtm17:34
clarkb++17:34
corvus#status notice The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored.17:35
opendevstatuscorvus: sending notice17:35
-opendevstatus- NOTICE: The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored.17:35
corvusi don't see any more messages in the gerrit log (so, like, no errors after it cleaned the caches).  i'm going to stop watching it now.17:35
clarkbre data collection for science: I think that turned out to be useful here as it indicates whether the file is 8-9GB or 220GB we can experience the same issue17:36
fungirefining the start of the outage further, gerrit's ssh log has events timestamped up to 16:07:45 utc17:37
opendevstatuscorvus: finished sending notice17:38
corvusoh nice, so it was probably running right up to the "shutoff" time17:38
clarkbmust without anything notable to record in syslog17:38
clarkbs/must/just/17:38
fungiyeah, regular syslog noise was from snmp queries17:39
clarkbor maybe file syncing for journal and gerrit java logs are different17:39
fungiwhich have gaps between them of course17:39
corvussnmp is our "everything's okay alarm"17:40
clarkbI'm going to return to morning activity. I'll check back in here later in case there is more that needs to be done in the near term17:42
clarkbthere are waffles17:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!