corvus | clarkb: thanks! | 02:04 |
---|---|---|
corvus | both zuul and my local web browser are having trouble connecting to gerrit | 16:59 |
corvus | i am also unable to ssh to the host on port 22 | 17:00 |
corvus | review02.opendev.org | SHUTOFF | public=199.204.45.33, 2604:e100:1:0:f816:3eff:fe52:22de | 17:01 |
corvus | that does not look nominal | 17:02 |
cardoe | i cannot connect as well | 17:02 |
corvus | | updated | 2025-02-08T16:08:13Z | 17:03 |
corvus | i don't see any error messages | 17:03 |
corvus | i with "shutoff" had a reason or some other auditing | 17:03 |
corvus | wish | 17:03 |
*** darmach3 is now known as darmach | 17:04 | |
corvus | i don't see anything strange about the volumes | 17:06 |
corvus | i'm going to start the server | 17:06 |
corvus | operating system is up; i'm going to do some checks before starting the service | 17:08 |
corvus | starting up now; mariadb performed crash recovery | 17:11 |
Clark[m] | That mariadb for Gerrit is just to track the reviewed flags for files on changes per user. It isn't critical if we need to start over (though we may back it up too) | 17:16 |
corvus | that mariadb looks happy | 17:16 |
Clark[m] | Cool | 17:16 |
corvus | i'm not sure gerrit is though | 17:16 |
corvus | i've got a change push that's hanging | 17:16 |
Clark[m] | Ack give me a few and I can transplant to a computer | 17:17 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/840972 took a while to open and when it did there are no files or diffs listed. I half wonder if we've got cache management problems again /me will look at show-queue | 17:21 |
clarkb | Disk Cache Pruner (gerrit_file_diff) is running since 17:11 | 17:21 |
corvus | i have a show-caches command which is stuck for a few mins now | 17:22 |
corvus | should we stop it, rm caches, and start? | 17:22 |
clarkb | its doing git_file_diff now | 17:22 |
corvus | ok, it is progressing then | 17:23 |
clarkb | neither file on disk is particularly big (at least compared to what we saw before) | 17:23 |
corvus | maybe we should leave it, but maybe we should also just rm the caches in the startup script? :) | 17:23 |
clarkb | ya I'm thinking we can leave it and monitor then check if it is happy when pruning is done. Then agreed automating the cleanups is probably a good idea. | 17:24 |
clarkb | it wasn't a big issue when we manually restart things but for something like this would likely improve our lives/sanity | 17:24 |
corvus | i thought about doing that this time, but figured the data collection for science was worth it. | 17:24 |
fungi | sorry, stepped away to get a shower, but i agree, not clearing the caches during our restart the other day was probably a mistake in retrospect | 17:24 |
clarkb | though the files weren't that much different in size and were much quicker to prune | 17:25 |
clarkb | I wonder if there is underlying io slowness (perhaps with ceph?) | 17:25 |
clarkb | that could explain why things shutdown if there was a filesystem error too potentially? | 17:25 |
clarkb | and maybe there was similar slowness when we restarted in december which caused things to be sad then? So less about absolute size of the caches and more about io performance? | 17:26 |
clarkb | (this is all brainstorming I don't have any real evidence other than this is slower than the recent restart whcih didn't have the files deleted before staring again) | 17:27 |
fungi | entirely possible | 17:27 |
corvus | yeah... i guess we could try running a storage benchmark or something, but designing a useful one that can run while the service is up would be tricky. | 17:28 |
fungi | this is in vexxhost ca-ymq-1 so maybe guilhermesp or mnaser know if something's happend there | 17:28 |
clarkb | all caches have been pruned at this point and corvus' show-queue is no longer showing up in the queue either | 17:28 |
clarkb | and I can see files/diffs on taht change I linked above | 17:29 |
corvus | it did return | 17:29 |
clarkb | corvus: any chance your push completed too? | 17:29 |
fungi | 16:08:13 utc event according to nova, sounds like. i'll see if i can correlate in syslog | 17:29 |
corvus | clarkb: the push error'd out earlier; but i just repushed it now. | 17:29 |
corvus | and that did succeed quickly | 17:29 |
clarkb | great | 17:30 |
corvus | (also, got the usual behavior of gerrit indicating that it already had the first patchset, so i pushed a second ps) | 17:30 |
fungi | last timestamp recorded to syslog before the startup messages was 16:05:43 utc, for the record | 17:30 |
corvus | https://review.opendev.org/941046 is the change | 17:30 |
fungi | nothing surprising in syslog, looks like whatever happened took the system by surprise, or at least began with preventing it from writing to /var | 17:31 |
corvus | fungi: nova api "updated" time is 16:08:13 | 17:31 |
clarkb | my best guess at this point is that we experienced a slow startup similar to the one in december with cache pruning locking out operations that depend on the cache. It wasn't as extreme as dceember either because this is a weekend with less overall demand or because the cache files are smaller or both. But this new occurrence coinciding with an unexpected shutdown does make me | 17:31 |
clarkb | suspect something unhappy with io whcih is backed by ceph aiui | 17:31 |
fungi | corvus: yep, i was starting from there and then checking it against the syslog for correlation | 17:31 |
clarkb | so ya maybe ricolin guilhermesp or mnaser can check and see if anything abnormal shows up on their end | 17:32 |
fungi | oh, right, i forgot ricolin is there too | 17:32 |
corvus | that's a 3 minute timeframe to look in | 17:32 |
fungi | yeah, it looks like the system was healthy as of 16:05:43 utc and then nova recorded it in shutoff state at 16:08:13 utc | 17:33 |
fungi | that's the most insight we have from this side, i think | 17:34 |
corvus | status notice The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored. | 17:34 |
corvus | ^? | 17:34 |
fungi | lgtm | 17:34 |
clarkb | ++ | 17:34 |
corvus | #status notice The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored. | 17:35 |
opendevstatus | corvus: sending notice | 17:35 |
-opendevstatus- NOTICE: The opendev gerrit service suffered an unscheduled outage from around 16:00 to 17:30. Service has been restored. | 17:35 | |
corvus | i don't see any more messages in the gerrit log (so, like, no errors after it cleaned the caches). i'm going to stop watching it now. | 17:35 |
clarkb | re data collection for science: I think that turned out to be useful here as it indicates whether the file is 8-9GB or 220GB we can experience the same issue | 17:36 |
fungi | refining the start of the outage further, gerrit's ssh log has events timestamped up to 16:07:45 utc | 17:37 |
opendevstatus | corvus: finished sending notice | 17:38 |
corvus | oh nice, so it was probably running right up to the "shutoff" time | 17:38 |
clarkb | must without anything notable to record in syslog | 17:38 |
clarkb | s/must/just/ | 17:38 |
fungi | yeah, regular syslog noise was from snmp queries | 17:39 |
clarkb | or maybe file syncing for journal and gerrit java logs are different | 17:39 |
fungi | which have gaps between them of course | 17:39 |
corvus | snmp is our "everything's okay alarm" | 17:40 |
clarkb | I'm going to return to morning activity. I'll check back in here later in case there is more that needs to be done in the near term | 17:42 |
clarkb | there are waffles | 17:42 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!