clarkb | last call on meeting agenda content. I'll get that sent out shortly | 00:16 |
---|---|---|
tonyb | nothing from me | 00:17 |
fungi | i have nothing else to suggest either | 00:18 |
clarkb | ok I'll send it nowish then | 00:21 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add a tool for displaying CPU flags https://review.opendev.org/c/openstack/diskimage-builder/+/937836 | 04:52 |
opendevreview | Joel Capitao proposed openstack/project-config master: Authorize packstack-core to force push to remove branch https://review.opendev.org/c/openstack/project-config/+/937792 | 08:27 |
opendevreview | Joel Capitao proposed openstack/project-config master: Authorize packstack-release to delete https://review.opendev.org/c/openstack/project-config/+/937792 | 08:41 |
*** ykarel_ is now known as ykarel | 12:28 | |
NeilHanlon | clarkb, fungi: at this time Rocky is not intending to diverge from the x86-v3 baseline that RH has decided on for v10.. though if a SIG wanted to, I'd support it... we're already planning on having RISC-V support for Rocky 10 via a SIG, so... | 15:02 |
clarkb | NeilHanlon: I doubt we'd drive such an effort. Mostly just curious what sorts of choices others are making | 15:15 |
karolinku[m] | if after updating this CPU flags info and it appear that x86-v3 is available, would you consider creating label which would work with C10? | 15:17 |
clarkb | karolinku[m]: I think that depends a lot on what we discover. My primary concerns are that availability will be extremely limited and as mentioend yesterday I feel like that is ok for minor feature that are minimally used that projects can live without simply by having longer jobs vs an entire platform only works on more specialized hardware | 15:28 |
clarkb | I also have a concern that this effectively means centos 10 can only run on what is likely our most performant hardware | 15:29 |
clarkb | its one thing to turn off nested virt support and force projects to use emulation or rewrite tests to approach the problem another way. It is another to say "you can't have centos 10 stream anymore today because a cloud is no longer present" | 15:30 |
clarkb | disk use of gitea09 and paste seems table even with the logging changes (thats good) | 17:08 |
clarkb | I'll see how things are looking after my morning of meetings and we can probably proceed wtih a gerrit restart at that point | 17:08 |
opendevreview | Merged openstack/project-config master: Authorize packstack-release to delete https://review.opendev.org/c/openstack/project-config/+/937792 | 17:20 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add a tool for displaying CPU flags https://review.opendev.org/c/openstack/diskimage-builder/+/937836 | 19:00 |
tonyb | karolinku[m]: Can you rebase your CS-10 testing patch ontop of ^^ | 20:04 |
tonyb | karolinku[m]: note my change doesn't *do* anything to ensure x86_64-v3 nodes but it makes it very clear if you got one. | 20:04 |
clarkb | tonyb: karolinku[m] I would continue to use the nested virt labels for now since that constrains the problem space a bit and it would be good to know among those clouds which are not acapable | 20:08 |
clarkb | then once we've understood that subset we can go to the wider labels | 20:08 |
tonyb | clarkb: you wanted me to include KVM info along side the cpu-level, Are you thinking capturing the output of `qemu-kvm --version` and `qemu-system-$(arch) --version` as adequate ? | 20:09 |
clarkb | ya that seem adequate. But also we can just check the packaged versiosn too probably and not have an explicit check | 20:10 |
clarkb | 7.2 and newer can emulate haswell (though I don't know what the underlying cpu requirements for that are) | 20:10 |
tonyb | clarkb: Yup I agree, I also have in my head a poorly formed request about adding a label explicitly for "foundational support" for RHEL-10-like distros. NOT for general CI of those OSs | 20:12 |
tonyb | so we'd know that we can build images, but not actually build and add them to our clouds | 20:12 |
tonyb | with my thinking being that was we do get clouds and adequate quota so that these machines are more common we could potentially add CI | 20:14 |
tonyb | although I haven't thought much about if that's actually helpful | 20:14 |
Clark[m] | tonyb: for the foundational stuff I wondered if we could just qemu emulate haswell to check a build and not do functests | 20:21 |
Clark[m] | We already emulate those VM boots for checking iirc | 20:21 |
Clark[m] | So it shouldn't be any skower | 20:21 |
tonyb | Clark[m]: That's certainly possible | 20:22 |
clarkb | infra-root there is nothing in the openstack gate or release pipelines. I've just notified the release team that I'll be restarting Gerrit shortly | 20:55 |
clarkb | if there are no objections I'll start on that in a couple of minutes by sending a notice first then figure out the commands I need to run while that posts | 20:56 |
tonyb | no objection from me | 20:57 |
clarkb | I'm not pulling new images or anything like that just a docker-compose down; mv the waiting queue task files for replication plugin then docker-compose up -d | 20:58 |
clarkb | #status notice Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage. | 20:59 |
opendevstatus | clarkb: sending notice | 20:59 |
-opendevstatus- NOTICE: Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage. | 21:00 | |
opendevstatus | clarkb: finished sending notice | 21:03 |
clarkb | ok proceeding with the restart now. There is a root screen on review02 if anyone else is interested but this should be quick | 21:03 |
clarkb | web ui is up but I'm still waiting for diffs (this is expected | 21:05 |
clarkb | the /var/log/containers content for the db container appears to have updated as epxected | 21:06 |
clarkb | still waiting for files/diffs | 21:09 |
JayF | how long does it usually take for the diffs to start back working again? | 21:12 |
clarkb | usually about 5 monutes, its going long this time but I'm not seeing anything yet indicating why | 21:13 |
clarkb | it does prune caches on startup and I suspect that is related | 21:14 |
clarkb | seems like it may have stopped responding too? | 21:15 |
JayF | I'm seeing the same. | 21:15 |
clarkb | arg I don't understand what is going on yet the error log is largely devoid of anything related | 21:15 |
clarkb | there are some ssh timeouts | 21:15 |
clarkb | oh now the error log is logging reejcted connections over http so that explains that at least | 21:16 |
clarkb | not the underlying cause but it is logging it | 21:16 |
clarkb | from what I can see gerrit is rejecting http connections so then the apahce proxy is responding with 502 bad gwateways | 21:20 |
clarkb | gerrit show-queue shows a gerrit_file_diff pruning task started around the restart | 21:22 |
clarkb | I suspect this is related to the lack of diffs | 21:22 |
clarkb | however it isn't clear to me yet why this has snowballed into gerrit rejecting http connections. maybe we're all trying to load diffs filled up all the http slots? | 21:22 |
corvus | clarkb: i'm around and can take a look | 21:23 |
clarkb | corvus: thanks. the gerrit_file_diff cache is 61GB on disk last | 21:24 |
clarkb | s/last// | 21:24 |
clarkb | the last time we pruned it was 0100 and gerrit says it was 3GB at that time | 21:25 |
clarkb | however this is an h2 db so the on disk size might be much larger than the actual cache content size? | 21:25 |
hashar | yup :) | 21:25 |
hashar | unless you get it vacuumed from time to time | 21:25 |
clarkb | no vacuuming that i know of | 21:25 |
clarkb | there are a number of other stuck tasks in the queue as well so I'm guessing that things got stuck in gerrit running startup tasks and that is leading to other things not proceeding | 21:26 |
hashar | do you have some monitoring in place? | 21:27 |
clarkb | hashar: we don't use the prometheus plugin type stuff if that is what you are asking. And I think we removed that one java monitoring plugin when log4shell happened | 21:28 |
clarkb | (possibly bad) ideas: we could stop gerrit and start it again to clear out the tasks and/or try manually killing tasks | 21:28 |
clarkb | if we stop and start gerrit again we could potentially move the git file diff cache aside to see if bringing that up clean is happier | 21:28 |
hashar | the prometheus plugin and the grafana dashboard maintained somewhere upstream have very nicely helped us | 21:29 |
corvus | i ran a show-caches command a few mins ago; it hasn't returned yet | 21:29 |
clarkb | corvus: ya that one is slow | 21:29 |
corvus | clarkb: do you mean we removed java melody? | 21:29 |
hashar | show-caches runs a full gc iirc | 21:29 |
clarkb | corvus: yes java melody was the one I couldnt' remember the name for | 21:29 |
clarkb | looks like gerrit is responsive now fwiw | 21:30 |
corvus | losing javamelody is sad; it has been very helpful to get a stack trace when something was stuck... :/ | 21:30 |
clarkb | I wonder if the show caches running a gc tripped something over into working again? or it could be that we just needed to wait | 21:30 |
clarkb | corvus: agreed but it also did scary things with the logging methods iirc so we removed it at the time. We could probably add it back now? | 21:30 |
corvus | well my show-caches hasn't returned yet | 21:30 |
clarkb | gerrit has cleaned up the logging stuff quite a bit | 21:30 |
JayF | My diff loads (well, at least once) now, fwiw. | 21:31 |
corvus | i don't see a significant difference in the queue | 21:31 |
clarkb | corvus: ya the show queue output loosk very similar to when it waas being sad | 21:32 |
hashar | for the H2 backed cache: the database will grow up over time and get fragmented as bits are written/removed from it | 21:32 |
corvus | so if it's responsive, i'm guessing it's just slowly working through a backlog while still performing the pruning | 21:32 |
clarkb | hashar: is the suggsetyion that you delete the backing files occasionally? | 21:32 |
hashar | on startup, Gerrit only allows 200ms to compact the database which is certainly not long enough and if the host does not restart often, it is essentailly never cleaned | 21:32 |
corvus | "does not restart often" matches here :) | 21:33 |
hashar | on our setup we have been transfering the caches over and over and had massive multiple GB caches which eventually one day ended up filing the disk | 21:33 |
hashar | I probably spent two weeks of my life debugging it, exactly two years ago | 21:33 |
hashar | my fix was to set `-Dh2.maxCompactTime=15000` which sets the time allowed to compact the db to 15 seconds | 21:34 |
hashar | and after some restart the files were smaller | 21:34 |
hashar | I wrote about my debugging at https://phabricator.wikimedia.org/phame/post/view/300/shrinking_h2_database_files/ | 21:34 |
hashar | the raw journal is in https://phabricator.wikimedia.org/T323754 | 21:35 |
clarkb | hashar: did the problems with those caches eventually lead to slow startup times like we observed? | 21:35 |
hashar | but I don't think we had any performance issue. The files were just super large | 21:35 |
clarkb | I'm wondering if this is related or we may have a different more urgent problem that we need to debug first then get back to the gerrit stuff | 21:35 |
clarkb | * get back to the gerrit cache stuff | 21:35 |
hashar | I don't think that slowed the startup times | 21:35 |
hashar | at least I don't remember that to have been an issue | 21:35 |
hashar | nor that anything has improved after having the files compacted | 21:36 |
hashar | our diff caches went from 12G and 8.2G down to 0.5G | 21:36 |
clarkb | looks like index tasks for some changes show up in the queue then go away | 21:36 |
corvus | clarkb: i'm going to attempt to get a stack trace in the container | 21:37 |
clarkb | but the ones at the top of the queue listing from aroudn when we restarted are slow | 21:37 |
clarkb | corvus: ok | 21:37 |
clarkb | *from around when we restarted are slow and still in there | 21:37 |
corvus | loving "ps: command not found" | 21:38 |
hashar | (for the H2 cache, Gerrit 3.10 has a system running the H2 cache pruning on a daily basis at 1:00 (UTC I think) https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#cachePruning | 21:38 |
clarkb | there is a growing set of errors in the error log for a user attempting tp push to starlingx/metal | 21:38 |
clarkb | but it almost looks like they tried to push and got an error but then gerrit eventually caught up and now they get no new changes errors? | 21:39 |
corvus | clarkb: /home/gerrit2/review_site/logs/jstack.log | 21:40 |
clarkb | corvus: looks like the gerrit_file_diff is RUNNABLE in that list | 21:42 |
clarkb | system load has shot back up and I think http is unhappy again | 21:42 |
clarkb | though maybe less unhappy than before seems things eventually loaded for me just slowly | 21:42 |
corvus | i'm worried that there may be a deadlock... | 21:43 |
clarkb | corvus: the queue dropped down quite a bit actually | 21:43 |
corvus | lemme put some things together | 21:43 |
clarkb | ack | 21:43 |
clarkb | gerrit_file_diff and git_file_diff are quite large on disk | 21:45 |
clarkb | everything else appears to be under 10GB | 21:45 |
corvus | i think a lot of the index commands are waiting on my show-caches command to finish | 21:47 |
corvus | so that's clearly a dangerous command to run :( | 21:47 |
corvus | that thread is runnable though; it's apparently reading the h2 db. | 21:48 |
clarkb | corvus: ah ok your show-cache isn't showing up anymore and the index tasks are no longer in there | 21:48 |
corvus | oh yep it finished | 21:48 |
clarkb | so theory time: the cache pruning is very expensive on very large backing files and things may get caught up in that? | 21:48 |
clarkb | I suspect though don't know for sure that we can remove the cache files and have gerrit start those cache files over again | 21:49 |
clarkb | hashar: ^ do you knwo? | 21:49 |
clarkb | the git_file_diff backing file is even large | 21:49 |
hashar | gerrit_file_diff and git_file_diff hold `git diff` output | 21:50 |
clarkb | but I'm kind of thinking it may be prudent to stop gerrit, move those files aside / delete them, then start gerrit back up again | 21:50 |
hashar | or something like that, so they are necessarily super large | 21:50 |
clarkb | hashar: ya but 61GB and 222GB large? | 21:50 |
clarkb | the actual data in them is about 3GB | 21:50 |
hashar | oh | 21:50 |
hashar | is the cache pruning showing in the show-queue output or the jstack? | 21:51 |
clarkb | hashar: both | 21:51 |
hashar | https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#cachePruning | 21:51 |
hashar | looks like it default to be enabled on startup | 21:51 |
hashar | so potenitally cachePruning.pruneOnStartup=false ? | 21:51 |
clarkb | ya and before 3.10 it only ran on startup | 21:51 |
clarkb | well we want to prune I think so try and keep disk usage under control over time | 21:51 |
corvus | i did another jstack dump in jstack-2.log | 21:52 |
corvus | the dick cache pruner thread stack is different, so it's doing something :) | 21:52 |
clarkb | ok that is good to confirm | 21:53 |
hashar | my fix was to raise it to 15 seconds to let it prune the caches (`java -Dh2.maxCompactTime=15000`) | 21:53 |
hashar | and I haven't looked at whether that had any effect with the newer "pruneOnStartup" | 21:53 |
clarkb | hashar: I think that is a good followup though I am concerned it won't be sufficient with things being as large as they are | 21:53 |
clarkb | and instead wondering if we can just delete the caches. I want to say we can and gerrit generates new empty caches on startup | 21:54 |
hashar | true yes | 21:54 |
clarkb | but probably a good idea to move the cache aside rather than delete it | 21:54 |
clarkb | then once gerrit is up and happy delete it | 21:54 |
corvus | clarkb: i think you are correct; i think all the h2 stuff can be considered ephemeral and i'm fairly sure an individual missing h2 db would be created empty. | 21:54 |
clarkb | I'm kinda thinking we should maybe take the hit of moving the files aside then given their size and apparent impact on startup | 21:55 |
hashar | there is one h2 backed db which is not a cache though | 21:55 |
hashar | I think the one storing whether a given file has been reviewed | 21:55 |
clarkb | oh we're processing git_file_diff according to show queue now so it is done with gerrit_file_diff | 21:55 |
corvus | yep | 21:55 |
clarkb | hashar: we store that one in mariadb | 21:55 |
corvus | and the web ui is pretty responsive | 21:55 |
hashar | so it took like 20 minutes to clear the git_file_diff? | 21:56 |
hashar | +1 on mariadb :) | 21:56 |
clarkb | hashar: more like 45 minutes for gerrit_file_diff I think | 21:56 |
hashar | has the file become smaller at least? | 21:56 |
clarkb | no | 21:57 |
clarkb | corvus: do you think stopping gerrit, moving the two massive diff caches aside then starting gerrit again is something we should try and if so should we do that nowish? | 21:58 |
clarkb | it does feel like this is related to gerrit becoming overly occupid with cache maintenance on startup leading to an inability to process other tasks/requests | 21:58 |
clarkb | I suspect that growth of those files will be more constrained in the future too since we have daily pruning now when before it was only pruning on startup | 22:00 |
corvus | clarkb: i think we're going to get out of this eventually and the system is sufficiently operable at the moment that we don't need to do that. | 22:01 |
corvus | but maybe taking the hit of that now with a few days before holidays sets us up to be more resilient if there is another problem later? | 22:02 |
clarkb | corvus: that is/was one of my thoughts though I suspect if there is a reason to restart gerrit later we can simply bundle of the cache moves into that effort | 22:03 |
corvus | basically -- i'd say let's do that friday, if friday weren't like the last day anyone would be around for a while. so maybe now is better if we want to just nip it in the bud. | 22:03 |
clarkb | ya I think the main reason to do it nwo would be to see that we can restart gerrit safely with a known process before we have holidays | 22:03 |
corvus | i'm on board with that line of thinking | 22:03 |
tonyb | ++ | 22:03 |
corvus | clarkb: all clear from me whenever you want to do that | 22:04 |
clarkb | ok the process I'm thinking is we down gerrit, move the replication waiting queue aside, move the gerrit_file_diff and git_file_diff files out of the cache and into /home/gerrit2/tmp (this prevents them from being backed up but is same fs so shoulbe immediate mv of large files), then up gerrit | 22:05 |
clarkb | let me get commands written down for all that then we can send a notice and try again | 22:05 |
corvus | oh look there's a tempfile | 22:05 |
corvus | -rw-r--r-- 1 gerrit2 gerrit2 237595340800 Dec 17 22:05 git_file_diff.h2.db | 22:05 |
corvus | -rw-r--r-- 1 gerrit2 gerrit2 1517463392 Dec 17 21:50 git_file_diff.698130932.385.temp.db | 22:05 |
corvus | i wonder if that can be used to guage progress | 22:06 |
corvus | anyway -- main thing i was looking for is to "mv gerrit__diff." out of the way -- to make sure we get all the related files. | 22:07 |
corvus | uh not sure if that made it through the bridge right, but you get the idea. | 22:07 |
clarkb | ya basically get the lock file and the trace file and the tempfile if present | 22:07 |
corvus | ++ | 22:08 |
clarkb | someone want to work on sending a status notice? I should be ready by the time that gets through | 22:09 |
clarkb | actually I think I'm ready now | 22:10 |
clarkb | how about #status notice You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart | 22:11 |
hashar | +1 :) | 22:11 |
corvus | ++ | 22:11 |
clarkb | ok sending that now | 22:11 |
clarkb | #status notice You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart | 22:11 |
opendevstatus | clarkb: sending notice | 22:11 |
-opendevstatus- NOTICE: You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart | 22:12 | |
hashar | fun is your git_file_diff has a diskLimit of 2G and a gerrit_file_diff of 3G | 22:12 |
clarkb | hashar: ya that was based on a single day size | 22:12 |
clarkb | which is what the docs suggest | 22:12 |
hashar | the cache pruning thathappens on startup does some magic sql queries to keep those caches under those limits | 22:13 |
clarkb | well we had the default limits which were 128MB iirc then we did a restart a day after a prior restart and based the sizes on the reported prune size from the second restart | 22:13 |
hashar | if you have the h2 files on disk that are severely larger than those (230G and 61G), then my guess is you suffered from the same issue I have encountered: the database needs to be compacted | 22:13 |
hashar | which is a different mechanism than the cachePruning one | 22:13 |
clarkb | right its the content vs the backing file problem. But now pruning happens daily which is new in 3.10 so maybe that will keep it under control | 22:14 |
corvus | it would do that only if pruning also compacts the db? | 22:14 |
hashar | that is only keeping the data under those 2 and 3 g limit | 22:14 |
corvus | i think hashar is talking about something like postgres vacuuming | 22:14 |
hashar | the compaction is at a lower level (that is the H2 driver itself) | 22:15 |
hashar | yeah same as vacuuming | 22:15 |
opendevstatus | clarkb: finished sending notice | 22:15 |
corvus | so we'd need a "compact h2" system cron job | 22:15 |
hashar | I knew of the concept after Sqlite Vacuum | 22:15 |
hashar | and eventually found H2 has the exact same logic but named Compact | 22:15 |
hashar | when Gerrit connects to the H2 databsaes through the java driver, the driver does compact upon connection | 22:16 |
clarkb | ok proceeding | 22:16 |
hashar | for up to 20 ms | 22:16 |
corvus | oh that's the timeout you mentioned | 22:16 |
hashar | or up to system property `h2.maxCompactTime` ms | 22:16 |
hashar | yeah sorry I wasn't clear | 22:16 |
hashar | so in 20 ms it can vacuum much | 22:16 |
corvus | no you were clear :) | 22:16 |
hashar | I gave an arbitrary 15 seconds value, restarted Gerrit some times and eventually the file got smaller | 22:17 |
corvus | so right after clarkb clears out the db file would be a really good time to bump that property | 22:17 |
hashar | and if you have been carrying those h2 cache files over and over as we did | 22:18 |
hashar | then I guess you had the exact same issue I have encountered :b | 22:18 |
hashar | ever growing caches! | 22:18 |
clarkb | oh oops its already coming back | 22:18 |
hashar | which makes me regret to not have pushed that further upstream to have a nice solution implemented | 22:18 |
clarkb | but we can deal with that some time later as in theory we have time for that | 22:18 |
hashar | have you nuked both h2 files?* | 22:18 |
corvus | clarkb: yeah i think we just do that today/tomorrow or maybe next year :) | 22:19 |
corvus | but that's a system-config change and i think we don't want to rush it | 22:19 |
clarkb | ++ | 22:19 |
clarkb | I have ocnfirmed that new h2 backing files and lock files were created | 22:19 |
clarkb | and show-queue looks much much cleaner | 22:21 |
clarkb | and I can see diffs | 22:21 |
hashar | congratulations! | 22:22 |
clarkb | the disk cache pruning things that show up in show-queue are not running they are the 01:00 scheduled tasks I believe | 22:22 |
clarkb | and distinct from the startup gerrit tasks which have all completed as far as I can tell | 22:22 |
corvus | yep i saw some startup cache pruning that is done now | 22:23 |
clarkb | I'm making notes now to followup with deletion of the large h2 backing files from the tmpdir and to look at the compaction option hashar mentioned | 22:25 |
clarkb | then I'll push an update to my podman docker compose change to make it mergeable | 22:25 |
clarkb | which should be a good exercise of gerrit overall | 22:25 |
corvus | i'll delete the jstack logs | 22:27 |
clarkb | hashar: any idea if this is something upstream knows about? It does seem like h2 is a bad caching implementation if it grows indefintiely | 22:27 |
hashar | I guess I will investigate the size of those files caches. At quick glance Wikimedia runs with the default of 10mb in memory and whatever the default is for disk based | 22:28 |
clarkb | I think I saw there was someone looking at an alternative caching implementation I wonder if that is better | 22:28 |
hashar | I might have mentioned it on the upstream Discord yeah | 22:28 |
hashar | my guess is Google is using something else | 22:28 |
hashar | but Sap surely relies on H2 | 22:28 |
clarkb | also fwiw I had originally thought that the caches being too small is what led to diffs not loading quickly on startup but now I'm wondering that is purely related to the backing files not pruning quickly | 22:28 |
clarkb | and so in theory smaller caches are actually better? | 22:28 |
hashar | so maybe worth filing a task for it or at least have the driver to compact things | 22:29 |
hashar | no clue | 22:29 |
clarkb | we may want to revert the changes to increase the cache sizes and clear out all of the h2 caches and reset our baselines | 22:29 |
clarkb | (I don't think this is urgent) | 22:29 |
hashar | my guess is one want to look at the hit ratio | 22:30 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run containers on Noble with docker compose and podman https://review.opendev.org/c/opendev/system-config/+/937641 | 22:31 |
clarkb | I think ^ should be mergable now though it won't do anything until we have followups that run on noble. Probably also want to sanity check the jobs it does run don't try to install and use podman | 22:32 |
hashar | I am off, congratulations on fixing the Gerrit caches! | 22:32 |
clarkb | hashar: thank you for the help! | 22:33 |
hashar | you are very welcome, I am always happy to help here | 22:34 |
hashar | I guess take note of the magic compact time for next year | 22:35 |
clarkb | yup I have written a note in my notes file locally | 22:35 |
hashar | and do note https://phabricator.wikimedia.org/phame/post/view/300/shrinking_h2_database_files/ | 22:35 |
clarkb | done added the link to my notes | 22:36 |
hashar | also the compaction time should be able to be passed to the java database driver base on my comment at https://phabricator.wikimedia.org/T323754#8419742 | 22:36 |
hashar | java -cp h2-1.3.176.jar org.h2.tools.Shell -url 'jdbc:h2:file:.git_file_diff;IFEXISTS=TRUE;MAX_COMPACT_TIME=15000' | 22:37 |
hashar | sql> SHUTDOWN COMPACT; | 22:37 |
hashar | The file went from 9G to 302MB \o/ | 22:37 |
hashar | which is a HACKY way to compact it manually :b | 22:37 |
clarkb | hashar: did you do that while gerrit was running? | 22:37 |
hashar | nop | 22:37 |
hashar | locally | 22:37 |
hashar | err I mean independently | 22:38 |
hashar | I think I copied the whole cache file on my local machine to ease debugging | 22:38 |
clarkb | ya that makes sense seems likely that would conflict with gerrit h2 operations especailly if they set the default compaction time so short it must be something you don't want happening while stuff is running | 22:38 |
hashar | but potentially Gerrit could learn yet another setting that would be passed down to the jdbc connection url | 22:38 |
hashar | but I did not look into that since the java property was a quick/good enough fix | 22:39 |
clarkb | makes sense thank you for the pointers | 22:39 |
hashar | at least I wrote a blog post! :b | 22:39 |
hashar | what I wonder is why you would have the issue only triggering now | 22:40 |
clarkb | hashar: I'm guessing the size went over some threshold that caused us to not trim within the timeout of some request or client and then that snowballed into many many requests | 22:41 |
clarkb | it is interesting that Gerrit started almost immediately with no file diff delay etc after moving those files aside | 22:41 |
clarkb | makes me feel more confident this was related | 22:41 |
hashar | most probably yeah | 22:42 |
hashar | the cache pruning tries to keep your caches below 2G/3G apparently. Maybe it does not run that often | 22:43 |
hashar | anyway, no cache, no slowness :) | 22:43 |
hashar | you are probably set for the next ten years | 22:43 |
clarkb | ha indeed | 22:43 |
hashar | I am off for real! | 22:43 |
clarkb | goodnight | 22:43 |
clarkb | timburke: I'm just now seeing your comments on the swift container deleter script that I wrote. Sorry I have ignored that mostly because more important things have been fighting for my attention. I'll try to get back to that at some point as we still have need for it and thank you fro the feedback | 22:45 |
clarkb | infra-root I'ev confirmed that I can push a change, comemt on a change, search changes, view diffs and other web ui actions. I've also fetched the new patchset I pushed from gitea implying replication is working | 22:47 |
clarkb | the only thing I haven't seen yet is a change merge | 22:47 |
clarkb | I don't have any changes that are in a good spot to merge right this moment. Does anyone else have a change we can merge as a canary? | 22:48 |
tonyb | sorry I don't think I do | 22:50 |
corvus | actually yes | 22:55 |
corvus | +3 https://review.opendev.org/935970 | 22:55 |
clarkb | awesome thanks! | 22:56 |
clarkb | corvus: hrm that change isn't showing up in zuul | 22:57 |
clarkb | but zuul says it is starting the jobs so it must've seen it? | 22:57 |
clarkb | oh thats opendev zuul jobs not zuul/zuul-jobs | 22:58 |
clarkb | I see it now | 22:59 |
clarkb | also mordred wandertracks has a couple of changesi nthe gate queued up forever I wonder what is with that | 22:59 |
clarkb | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_46d/937641/13/check/system-config-run-review-3.10/46d66cc/bridge99.opendev.org/ara-report/results/147.html I think that shows gerrit system-config-run jobs properly selecting the defaul.yaml from my updated install-docker role and not the noble specific file | 23:34 |
clarkb | I see it installing docker-cmopose too and not docker compose | 23:34 |
clarkb | corvus: the opendev tenant also has some stuck promote jobs for an opendev/base-jobs merge from a month ago. I suspect we can simply evict those changes and the wandertracks ones and move on, but before we attmpt that I wanted to make sure you didn't want to try and debug this | 23:36 |
clarkb | though I suspect log files may have rolled over | 23:37 |
opendevreview | Clark Boylan proposed opendev/zuul-jobs master: Compress raw/vhd images https://review.opendev.org/c/opendev/zuul-jobs/+/935970 | 23:47 |
clarkb | corvus: ^ there was a post run failure and I think that should fix it | 23:47 |
clarkb | I ended up using https://review.opendev.org/c/opendev/sandbox/+/934997 test test merging. That worked and it replicated here: https://opendev.org/opendev/sandbox/commits/branch/master | 23:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!