Wednesday, 2024-04-03

hasharfungi: I got another hack for `git-review` to let it shows some colors when pushing a change https://review.opendev.org/c/opendev/git-review/+/914000 :)06:47
hasharthe remote send back some message and cgit colors some of them , but that only happens when run interactively (stdin or stderr is a tty)06:48
opendevreviewAntoine Musso proposed opendev/git-review master: Show remote side colors when color is enabled  https://review.opendev.org/c/opendev/git-review/+/91400006:49
hashar:)06:49
opendevreviewTobias Rydberg proposed opendev/irc-meetings master: We have decided to adjust meeting time to 0700 during summer time.  https://review.opendev.org/c/opendev/irc-meetings/+/91494008:06
namanhello11:19
namananyone online here ??11:20
funginot sure what's going on, but in the middle of the openstack coordinated release we've been getting stalled by near-constant "management events" (i guess pipeline reconfigurations?) taking upwards of 40 minutes to clear13:27
fungimaybe it's lots of job configuration changes merging, though that seems unlikely. i wonder if there's ongoing changes in some external repo we're watching that's triggering them13:28
fungii don't really know what to look for in the scheduler debug logs to isolate the cause13:28
fungithere's not that many tenant reconfigurations in the log13:30
fungione at 09:59:42 because of an update to openstack/magnum and another at 11:16:44 from an update to nebulous/nebulous13:31
fungioh, though the latter was for the nebulous tenant13:31
fungiso yeah, only one reconfiguration for the openstack tenant during the relevant time period13:32
fungii wonder what else those per-pipeline management events could signify13:32
fungibecause trigger and result events just seem to keep stacking up until the management event clears, which seems to take ages13:33
fricklerI'm also seeing semi regular errors in zuul debug logs due to the replication mentioned in #zuul https://opendev.org/opendev/sandbox/commit/1c3e7447e7046e2bb6d7ae364e4f4c83090c185a , not sure if that can affect even processing13:44
fungifuture improvement idea: move opendev/sandbox to a dedicated "sandbox" tenant in zuul13:47
Clark[m]fungi: reconfigurations should be pretty clearly logged if that is what is doing it. I don't think that dependency in supercedent pipeline should affect zuul that way but you can dequeue it manually to check13:53
Clark[m]You seem to have touched on this in the release channel a little but my hunch is it has to do with all the ref creates events for the tags. Each of those tages represents a zuul config I think13:54
Clark[m]Though that shouldn't be an different than previous releases13:54
fungiright now the release-post pipeline has a new management event, which seems to mean we wait half an hour for its status to update and anything else there to start running13:55
fungier, sorry, release pipeline this time13:56
fungithis has been occurring over and over in different pipelines though13:56
fungiokay, now it's up to 2 management events for the release pipeline13:56
Clark[m]fungi: is there an event I'd in the logs associated with those events that you can trace to see where the time is lost?14:01
fungithose did clear, but it's still not processing all the results (been sitting at 0 trigger, 0 management, 13 results for a while14:01
Clark[m]And have we evicted the sandbox change that may be problematic?14:02
fungii haven't looked at the sandbox change, not sure what that's all about14:02
fricklerClark[m]: the changes were merged, you would need to merge the revert14:02
fungior whether it needs to be preserved in the pipeline for inspection14:03
fricklerwould it make sense to take a stacktrace of zuul schedulers at this point in time?14:03
Clark[m]frickler no I mean evict it from the zuul queues if it is still there14:03
fungiokay, release pipeline is moving again now14:03
fricklerClark[m]: yes, it is in post, so not change but ref. we could try to evict it unless corvus1 want to do some looking at it first14:03
Clark[m]fungi: someone was being "helpful" trying to reproduce a bug they found in zuul via the sandbox repo14:03
Clark[m]It apparently got an event stuck in the queues and we could in theory manually dequeue it if we were concerned it is related14:04
Clark[m]But also just tailing scheduler logs during the pauses should give hints as to where the time is lost. Might be noisy but you'd see a lot of git merge events for reconfigurations for example14:05
fricklerit seems they also added the failing job to the periodic pipeline, so it would likely repeat anyway. I just don't want to break the running release jobs, they still do seem to be progressing14:05
fricklerfwiw I'm still getting no results on the builds/buildsets pages unless I add some filter like for a job or project14:40
fricklerthat also only affects the openstack tenant14:41
fungiit does seem to spin for a while like its retrieving data though14:42
frickleroh, there's an error in the zuul web log: 2024-04-03 14:41:41,066 ERROR cherrypy.error.139740297355920:   sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (3024, 'Query execution was interrupted, maximum statement execution time exceeded')14:42
fungifun!14:42
fricklerseems zuul show show an error in the UI for that14:42
frickler*should show14:43
fricklerMAX_EXECUTION_TIME(30000) - I think that will be 30s?14:44
Clark[m]Ya js uses milliseconds iirc14:45
fricklermaybe we need to somehow archive old builds? or just add some proper indices?14:46
clarkbfrickler: was that iwth a specific query?14:49
fungiclarkb: just go to https://zuul.opendev.org/t/openstack/builds with no query params14:49
clarkbok so no query paramters result in that. I wonder what changed14:50
clarkbfwiw the order of magnitude of log entries for sandbox-publish-docs is low enough that I don't think we're in a tight loop adding failed builds to the db due to that14:52
fricklerclarkb: the timeout was without a query14:53
fungii'm wondering if zuul could be getting flooded with internal events or something to cause its queue and tenant event processing to just get drowned out resource-wise14:58
clarkbwe do appear to be getting crawled possibly in a manner similar to what we've seen with gitea14:59
fungioh, like zuul-web could be bombarding the db with queries and we're hanging on writes?15:00
fungithough i wouldn't expect that to block at the db level15:00
*** blarnath is now known as d34dh0r5315:00
clarkbno, but they are querying things liek change logs which may also do db lookups?15:00
fungiload average on the schedulers is very low too15:00
fungioh, but maybe the trove instance is overloaded15:01
clarkbya its a db load issue I think15:01
clarkbweb is just a simple passthrough to fetch that data15:01
clarkbI don't think we're running the apache filter of user agents on zuul web's apache instances btu I think we should15:02
clarkbbut that may still be an orthogonal issue15:02
clarkbif you look at apache logs there is a python requests user agent hitting the 500 error for the timeout consistently for some time15:06
fungithat could certainly be related to the query timeouts we're seeing too15:06
fungieither caused by, or causing15:06
clarkbI think that is our culprit and won't be impact by our existing user agent filter (and we probably don't want it to be until we understand better)15:06
clarkbseems to have started sometime yesterday15:06
clarkbits in a tight loop failing about twice a minute (which aligns with our 30 second timeout)15:07
clarkbno nevermind its older than that15:08
clarkb(my greps are getting better)15:08
clarkbit started with the 500 errors on Sunday (March 31)15:10
clarkbit seems before that it was iterating through 100 builds at a time (arg why do you need to rescan over and over again)15:10
clarkbbut now it gets the 500 and retries the first request over and over again in a loop15:10
fungicould it be ci-log-processor or whatever it's called?15:10
fungibut yeah, it's probably some phd student doing "research" like we've seen knock gerrit offline in the past15:11
clarkb[2024-03-30 07:18:01.736] ok I think this is when it starts based on the first instance of a 500 error in the apache log on zuul0215:12
clarkband I think zuul-web was restarted on march 30 at 07:1315:12
fricklerjust for reference it seems we have about 20M builds in 4M buildsets. that sounds to me like something mysql should still be able to handle reasonably15:12
clarkbI'm beginning to suspect a regression in zuul after our automatic weekend upgrade15:13
fungiyeah, unless someone's pounding it with queries maybe15:13
clarkbcorvus1: ^ fyi 15:13
fungibut yes, the timing is conspicuous15:13
clarkb`zgrep '/api/tenant/openstack/builds?' zuul-ssl-access.log.4 | grep 'HTTP/1.1" 500' | head` to find the first occurence that I see15:13
clarkbI wonder if that may be related to the slow event processing too (a bug affecting zuul in both ways I mean)15:14
fungiwithout more data, it seems at least possible that they're linked15:14
fricklerchecking git log https://review.opendev.org/c/zuul/zuul/+/913875 spring to my eye15:14
clarkbI think there are ~3 things going on that we should consider addressing 1) good bots that will respect robots.txt crawling data that probably shouldn't be crawled we should write a robots.txt to refine what if anything gets crawled 2) bad bots that use weird user agents that we should apply our existing apache ruleset to block on zuul's apache servers and 3) a likely zuul regression15:20
clarkbafter the weekly upgrade leading to timeouts trying to load builds from the db and possibly other related slowness15:20
clarkb1) and 2) are straight forward and may even help the situation with 3) by reducing total queries15:21
fricklerI'm no mysql expert but could that query block the table for the 30s while it is running and thus cause executor updates like event processing be blocked?15:21
clarkbfrickler: generally I would expect read locks on rows to not interfere with write locks, btu maybe we don't have those locks properly set up in the orm and it is being conservative and safe?15:22
clarkbI think we're in no wrose a spot than we were since the weekend so I'll take a break here for morning things like food and shower. I can work on 1) and/or 2) when I get back15:23
clarkbfrickler: I would agree that chagne you identified is suspicious15:26
corvus1i'll take a look at that query15:58
*** gthiemon1e is now known as gthiemonge15:59
clarkbunrelated to everything else rpm's man page doesn't have ascii -'s it has ‐'s instead. This is very disappointing16:17
fricklerah, wasn't there a big discussion about what's typographically correct recently?16:28
funginot sure about recently, but it's been coming up in *roff related discussions for at least 3 decades i can remember, likely closer to 516:30
opendevreviewClark Boylan proposed opendev/system-config master: Apply apache us filter to Zuul  https://review.opendev.org/c/opendev/system-config/+/91498516:34
clarkbya it came up somewhere maybe hn16:35
clarkbI would've expected it to be a solved problem in man pages for popular utilities like rpm though16:35
clarkb914985 is 2) from above16:35
clarkb1) I realized is a bit more tricky because we have to decide what i anything we awnt to be crawlable. I'm tempted to say nothing should be?16:35
clarkbbut at the very least crawling our log files seems like unnecessary bw for our cloud providers16:36
fungithere's no reference material on zuul.o.o, so i'm inclined to say there's nothing worth indexing by search engines at least16:37
fungialso the ability of unvetted actors to use it to publish content/urls in some locations probably means search engine indexing creates an attractive nuisance for spammers and scammers16:38
clarkback I'll make a blanket please don't index rule16:39
*** cid__ is now known as cid16:55
*** jroll00 is now known as jroll016:55
*** Reverbverbverb_ is now known as Reverbverbverb16:55
*** gmann_ is now known as gmann16:55
clarkbI think we lost gerritbot16:58
clarkbremote:   https://review.opendev.org/c/opendev/system-config/+/914989 Add robots.txt to Zuul web [NEW]16:58
clarkbThats 1) and 2) proposed. reviews welcome17:00
corvus1regarding the database issue: i think i have two half-solutions that add up to a whole one:  first, we can add some index hints to the query, that improves the runtime for queries locally for me with mysql 8.x.  they do not improve the runtime in all cases on the opendev production database.  also, they might actually produce the wrong data on opendev's production mysql.  so i think we should also move to a mysql 8.x trove instance.  we can do17:11
corvus1that relatively quickly, though it will require a small downtime (maybe 20 mins?) for the final copy step.17:11
clarkbwe did have a small outage previously where the db wasn't available and zuul mostly kept working but things weren't recorded properly obviously17:13
clarkbI guess we can decdie if we want to shutdown zuul or go through that half working state again?17:13
corvus1true, we could actually just decide to lose a few minutes of data17:13
clarkbcorvus1: also I guess we want to land the index updates first? Any idea what caused the change? is it the change frickler identified?17:14
corvus1we would just end up with missing builds.17:14
clarkband if it is that chagne I guess my next thought is "is that change valid if it has such a big impact"17:14
clarkbre downtown or missing builds I'll defer to fungi and frickler as they aer more involved in the openstack release process today. I suspect we can tolerate that elsewhere17:14
corvus1clarkb: that change gave us correct results at the expense of speed; so a revert is a reasonable solution for opendev, but not a good long term solution17:15
clarkbI see17:15
corvus1adding the index hints would give us potentially incorrect results in opendev, and not necessarily speed up all the queries we need to17:16
clarkbbut with a more modern mysql it would both be correct and relatively quick (at least compared to human wall clock scale?)17:16
corvus1i think the incorrect results from adding the index hints on old mysql would be more visible than anything we've seen yet -- the builds page would show the first 100 builds instead of the last 100.  despite the very explicit "ORDER BY ... DESC" in the query.17:17
corvus1so i think in opendev, we want mysql 8 first, then the index hints.17:18
clarkb++ that does seem like it wouldbe problematic for the common use case of that search page17:18
corvus1and yes, once we have both, at least in my local testing, the builds and buildset queries should both be nearly instantaneous.17:18
clarkbso ya no objectiosn from me. I think the main thing is to coordinate with fungi and frickler to ensure we minimize impact to the openstack release process (not sure if anything is left for that)17:19
corvus1agree; i'll spin up a trove mysql 8 to prep, but await more input before proceeding past that17:20
fungisorry, had to step away for a moment but catching back up...17:22
fungiopenstack release activities have wound down now, so i'm not concerned about a few missing build results if we go that route17:23
fungialso, unless we stop gerrit at the same time, having zuul offline would mean missing events, which isn't necessarily any better17:25
fungii *think* rackspace lets you do in-place mysql upgrades, but a cut-over to an already running and populated second instance might be faster17:26
fungiwhat's the order of operations for swapping out the trove instance in that case?17:26
fungiis zuul going to reread the configuration for the dburi or is there still some sort of restart needed?17:27
clarkbfungi: I think its prevent access from zuul to existing trove instance, dump db, restore db into new instance, update zuul to point at new instance17:27
corvus1start mysql 8; import data; restart zuul-web with new dburi17:27
clarkbthats a good question I think zuul may need a restart to pick up the ini config17:27
corvus1if we're not worried about losing data, we don't have to prevent writes17:28
corvus1we will end up with incomplete data, but not inconsistent data in the new db, so that should be fine17:28
corvus1remote:   https://review.opendev.org/c/zuul/zuul/+/914990 Restore mysql index hint [NEW]        <-- the index hints change17:29
clarkbah ok I wasn't sure if it would be broken/inconsistent17:29
clarkband I don't think we're too worried about losing data. It has happened before and impact was minimal (almost no one noticed even)17:30
corvus1the relations are like a tree, we'll just be missing leaf nodes, and updates to buildsets to say they are complete17:30
fungiyeah, that's exactly what we observed in the past17:30
corvus1(technically, there is a cycle in the relations graph, but there is still a definite order they are created in, so it doesn't affect this issue)17:30
fungisomeone was looking at an affected build result and confused that zuul thought it had never completed17:31
fungiwhich was what led me to discover the db outage17:31
corvus1heh our db is called "zuul-v3" maybe i should call the new one "zuul-v10"17:34
clarkbnice17:34
fungiwe coulda had a v817:34
clarkbzuul the hot rod or zuul the tomato juice?17:35
fungiwhy can't it be both?17:35
clarkbhot rod powered tomato blender17:37
corvus1do we want more ram?  currently 4g17:37
clarkbperformance has been fine I think. But maybe not super fast at time17:38
clarkbmore ram can't hurt I guess17:38
corvus1their internal monitoring thing just has the last 6 hours and shows all memory used17:38
clarkbprobably a bit of an abnormal state in the last 6 hours but also good enough indication to me we should bump it17:39
corvus1there's no separate cache page metric, so it could be fine17:39
corvus1but <shrug> i'm also inclined to maybe bump it up to 817:39
clarkb++17:39
fungii think it will always show all memory used17:41
corvus1when we move to self-hosted, that will be a nice benefit17:41
fungiit basically tells mysqld to use up to the amount available, and that will quickly end up as query cache17:41
corvus1fun fact: no query cache in mysql 8 (but there are other kinds of index caches -- some of which persist on disk!)17:42
fungiinteresting17:43
corvus1oh i just noticed in the web ui we can enable the root user (but rax may not support us if we do).  just filing that bit of info away17:44
fungii suppose people who want an actual query cache just inline a separate proxy daemon these days17:44
corvus1the zuul-v3 server has some server parameters set, but i think the defaults in mysql 8 are equivalent or better, so i don't think we need a custom config17:56
corvus1(zuul-v3 has character_set_server=utf8 (which is an alias for utf8mb3) and collation_server=utf8_bin and wait_timeout=28800)17:57
corvus1the defaults in mysql 8 are utf8mb4 and that same wait_timeout value17:57
corvus1so i think our forward-compatible config has done its job :)17:57
fungisounds good17:59
fungii guess utf8mb4 could make some tables larger, but i doubt we're short on disk17:59
clarkb++ to utf8mb418:02
corvus1i timed a mysqldump: 16m19:32
corvus1when i finish these potstickers, i'll time a restore19:33
clarkbthanks!19:34
*** carloss_ is now known as carloss19:46
fungiclarkb: looks like https://review.opendev.org/914989 isn't quite working. maybe we need a carve-out in the proxy?20:26
fungii'm heading out to run some errands and grab dinner but will be back in an hour or so20:27
clarkbya I wasn't sure if the mod rewrite proxying would work with the location directive like that20:27
clarkbI guess not20:27
clarkbI see that etherpad works differently I'll update20:27
opendevreviewClark Boylan proposed opendev/system-config master: Add robots.txt to Zuul web  https://review.opendev.org/c/opendev/system-config/+/91498920:31
clarkbsomething like that maybe?20:31
corvus1the mysqldump is going to create the tables with utf8mb3 instead of mb4; i'm thinking we should go ahead and sed the dump file to change that.20:36
clarkbcorvus1: thinking out loud here: that should be safe beacuse we're using mysqldump which isn't a binary backup20:50
clarkband we can probably test that process ahead of time?20:50
corvus1yeah, i'm currently restoring an unaltered dump; can repeat with the mb4 version20:51
corvus1the restore is not fast20:51
fungii'm not actually here at the moment, but sounds good to me20:52
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: chore: refactor setting buildset_registry fact  https://review.opendev.org/c/zuul/zuul-jobs/+/91502520:57
corvus1i think the restore is going to be like 90 minutes; the dump was about 15, throw in a few more mins for munging the dump file and we're probably looking at just under 2 hours for the whole operation.21:03
clarkbsorry stepped away for a bit. I wonder if that is long enough we should schedule ti or try to do over a weekend?21:45
clarkbif we do do it over a weeekdn we may want to stop the auto upgrade playbook from running? Just throwing thoughts out there21:45
fungior i can stick around friday afternoon your time22:20
*** dmitriis is now known as Guest1522:38
corvus1yeah, if we can limp along for a bit, then perhaps a non-weekday would be good.  i can plan on doing it saturday.22:42
clarkbI can likely be around on saturday22:46
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: chore: refactor setting buildset_registry fact  https://review.opendev.org/c/zuul/zuul-jobs/+/91502522:47
fungisame22:48
fungialso related, topic:zuul-ua-filter lgtm now22:49
fungilanding that might take some pressure off22:50
corvus1what's the intent there?  would it block folks running requests-based tools?  (also zuul-client? :)23:05
corvus1or is it just blocking known bad uas?23:06
fungithe idea is that we've got a corpus of known made-up/nonsensical user agent strings often used by botfarms that completely ignore robots.txt23:06
corvus1got it23:07
fungithe same ones are presently being filtered for gitea and gerrit i think23:07
fungibecause they flood us with repeated/duplicate requests for computationally-intensive queries, bringing those services to their knees at times23:08
fungii think the bulk of that set was taken from a filter mwf uses to keep the load on wikipedia in check23:08
fungis/mwf/wmf/23:09
fungiand when clarkb was going through the apache logs on the zuul-web servers, he saw quite a few of them trying to scrape builds and buildsets query urls23:12
fungiwhich were probably contributing to the overall db query load23:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!