Friday, 2024-12-06

*** JamesNg[m] is now known as jamesngn[m]01:09
*** elodilles_pto is now known as elodilles07:23
*** mrunge_ is now known as mrunge09:20
*** mko is now known as Guest224312:05
tonybclarkb: fungi: Would be helpful for me to do steps, 4-6 in the gerrit 3.10 upgrade checklist?14:35
fungitonyb: i had already set myself reminders to do those, but feel free!14:35
fungii was just getting ready to do step 6 now, in case running deploy jobs take a little longer to complete14:36
fungiand then step 5 at the top of the hour as a one-hour notice14:37
tonybfungi: okay. Sounds like you have it in hand already.  I don't want to get in the way14:37
fungihappy to share, my morning was sidetracked a bit by a power blip that made it clear to me the ups for my workstation needs replacing. since it's rebooted anyway i'm taking a few extra minutes to upgrade packages/kernel on it14:39
tonybAh okay14:39
fungiopenafs kernel modules take soooo long to build14:40
tonybYes, Yes they do14:42
tonybOkay emergency file updated14:47
fungithanks!14:49
fungitonyb: are you in a position to send the status notice now as well?15:01
tonyb#status notice Gerrit on review.opendev.org is being upgraded to version 3.10 and will be offline. We have allocated an hour for the outage window lasting until 1700 UTC.15:01
opendevstatustonyb: sending notice15:01
fungihah, perfect!15:01
fungithanks!!!15:01
-opendevstatus- NOTICE: Gerrit on review.opendev.org is being upgraded to version 3.10 and will be offline. We have allocated an hour for the outage window lasting until 1700 UTC.15:01
tonybnp, I was just a little distracted so I missed the target by a little15:02
tonyb#ooops15:02
tonyb#status notice Gerrit on review.opendev.org is being upgraded to version 3.10 and will be offline starting at 1600 UTC. We have allocated an hour for the outage window lasting until 1700 UTC.15:02
opendevstatustonyb: finished sending notice15:04
opendevstatustonyb: sending notice15:04
-opendevstatus- NOTICE: Gerrit on review.opendev.org is being upgraded to version 3.10 and will be offline starting at 1600 UTC. We have allocated an hour for the outage window lasting until 1700 UTC.15:04
opendevstatustonyb: finished sending notice15:08
fungigood catch15:09
tonybfungi: Thanks.  Hopefully it's clear enough and if not anyone confused will ask here15:10
fungiyeah, should be fine15:10
fungilooking at trending for the openeuler mirror, if growth rate of the new version is similar to the old one, we'll probably hit the 350gb quota around june15:14
tonybThat doesn't seem sustainable15:14
fungiseems to grow linearly at about 25gb/mo15:15
tonyblong term that's a lot of disk if there isn't a natural reset/truncate point15:17
clarkbI think the reset is probably the next release but as this transition shows there hasn't been much community involvement in getting those rotated regularly15:24
clarkbalso I am awake and almost ready (I still need tea and ssh keys need loading)15:24
tonybYeah I guess that's part of my worry?15:25
clarkbbut ya I personally would like to see openeuler try a no mirror approach given the amount of jobs run there and see if that is stable enough15:27
tonybclarkb: That sounds fair15:27
clarkbI have tea and ssh keys are loaded15:39
clarkbI've confirmed the preflight items (the logging config is in place and the change to swap our gerrit image tag in prod after we do manual things has a +1 from zuul15:41
tonybsounds good to me15:42
clarkbthe emergency file looks good and I see the notice was sent15:42
clarkbapparently emergency file edits were in the file twice so I trimmed the extra one.15:43
clarkbscreen is started and logging in screen is enabled too15:43
fungiattached15:44
* tonyb is also attachech FWIW15:47
clarkbsounds good. I'm happy to drive since I've gone though this several times in the last several weeks15:49
fungii've also given the openstack release managers a couple of reminders this morning15:53
clarkber I meant the emergency file edits step was in the etherpad twice so I edited the etherpad15:56
clarkbthe actual emergency file looked fine15:56
clarkbI think I've just remembered that we need to restart zuul schedulers to pick up the new gerrit version too, but I don't think zuul has any new functionality based on that version so its probably fine for this upgrade if it waits for our normal friday/saturday upgrade process15:57
tonybclarkb: ooooh that makes more sense ... I couldn't figure out how I'd added the content twice15:57
clarkbtonyb: sorry about that. I reread what I wrote and realized it wasn't very clear15:57
tonybclarkb: all good15:57
corvusclarkb: agree re zuul15:58
clarkbI'll make a note about the zuul scheduler thing on the etherpad so we remmber for next time15:59
clarkb#status notice Gerrit on review.opendev.org is being upgraded to version 3.10 and will be offline momentarily. We have allocated an hour for the outage window lasting until 1700 UTC.15:59
opendevstatusclarkb: sending notice15:59
-opendevstatus- NOTICE: Gerrit on review.opendev.org is being upgraded to version 3.10 and will be offline momentarily. We have allocated an hour for the outage window lasting until 1700 UTC.16:00
clarkbonce that gets back with a complete notice I'll start with the disruptive commands (down gerrit and proceed with the rest of the upgrade process)16:01
clarkbin the future maybe we send the notice at 15:55 UTC :)16:03
opendevstatusclarkb: finished sending notice16:03
fungithere we are16:03
clarkbok starting now16:03
clarkbfs backup is done the db backup is running now16:05
clarkbit is done too both subcommands report terminating with success status, rc 0 in the log file16:06
fungiand no containers running16:07
clarkbthese indexes are not small16:09
clarkbI half wonder if we should skip this step for the future and just reindex from scratch if we downgrade16:10
fungii think it's good to keep16:10
fungiyou can plan a longer outage for the upgrade, but rollback is always likely to end in a scramble16:11
fungiso easing the rollback process seems like a reasonable choice16:11
clarkback16:13
clarkbpulled image looks correct to me16:13
clarkbok now is when I'll lean on ya'll to check gerrit functionality loosk good to you16:15
clarkbwe are pruning caches and reindexing all of the indexes though so things might be a bit slow16:15
clarkbI'm going to check on reindexing progress16:16
clarkbreindexing appears to be moving forward but it has a lot of work to do16:16
clarkbthe web ui loads for me and reports the expected version. I can see change lists and changes and at least one file diff16:17
fungigrrr... my broadband provider picked now to have an outage16:17
clarkbI'll leave the log file tail in the first screen window and create window 1 to check the config diffs16:18
tonybUI looks good, version is as expected logout/login (with OpenID) "just worked"16:18
fungii'm on a phone terminal right now so no longer folloeing the screen session, but will try to get up on a tether asap16:19
tonybchanges load but currently there isn't any diff data16:19
clarkbthe config diff lgtm. As expected there is a delta but it is limited to the email soy templates16:19
clarkb(we don't manage those)16:19
clarkbtonyb: ya diffs take a bit to load as caches trim and refresh but after 5-10 minutes should be consistently available16:19
clarkbchange upload, recheck and zuul enqueing things, replication and eventually merging a chagne are the big items to check on16:20
clarkbanyone have a change to upload or push a new patchset to?16:21
tonybclarkb: yeah, I know it's expected, just calling it out so that I can confirm when it starts working (which it has done)16:21
clarkbI suspect those tracebacks for commits not being found may be due to the reindexing backlog16:22
clarkbI'm happy with reindexing progress down to under 2k tasks16:22
clarkb(we started with over 16k)16:23
tonybnice16:23
clarkbof course it is a long tail as it gets into projects like nova and neutron and cinder but progress is progress16:23
clarkbI rechecked https://review.opendev.org/c/opendev/system-config/+/935395 because it runs only a few jobs16:24
clarkband zuul has enqueued things for that recheck16:24
clarkbI don't see any new changes or patchsets since the update16:28
clarkbI guess I'll work on a noop change to check that16:28
opendevreviewClark Boylan proposed opendev/system-config master: DNM testing  https://review.opendev.org/c/opendev/system-config/+/93726616:29
clarkbok new change create worked per ^ and zuul has enqueued it16:29
tonybThanks16:30
clarkbI'm going to check replication of 937266 next16:30
fungiokay, finally on a slightly more useful computer via phone tether16:31
clarkbgit fetch origin refs/changes/66/937266/1 for origin https://opendev.org/opendev/system-config (fetch) succeeded and git show FETCH_HEAD looks correct to me16:32
clarkbthat has me checking off all of the things we've explicitly listed to check for functionality in the ehterpad16:32
fungiseems like i missed the most exciting steps, thankfully nothing got more exciting than expected16:32
clarkbbut please do your own checks as we have different browsers, network setups, and workflows16:32
clarkbfungi: I wonder if you should get on the starlink bandwagon :)16:33
clarkbalso welcome backl16:33
tonybOh feck I was making the replication check ways more complex than it needed to be.16:34
fungiyeah, there's a new fiber provider rolling out lines here, so hoping to give them a try soon16:34
clarkbtonyb: heh ya its a bit hard to discover the magical refs since they are mostly implementation details but we replicate the mtoo so make a quick and easy check16:34
clarkbwe're currently waiting on reindexing to complete per the etherpad but as mentioendb efore any additional testing people do is appreciated16:35
tonybclarkb: I was going to each gitea server bypassing the haproxy and checking the object was there, which is more full gitea checking than gerrit -> gitea replication checking16:36
clarkbtonyb: ah I see. That doesn't hurt but ya shouldn't be necessary to chcek if new gerrit can talk to the version of gitea we run16:37
clarkbif a specific gitea has a problem I wouldn't expect that to be due to the upgrade16:37
tonybYeah. My thought process was just plain wrong16:39
clarkbwhen reindexing is complete and we're happy with the testing we have done https://review.opendev.org/c/opendev/system-config/+/937051 is the next thing to get in place so we can remove the emergency file entries16:40
clarkbcorvus: I think zuul jobs that attempted to update state from gerrit while gerrit was done have reported merge conflict16:42
clarkbcorvus: I believe this is expected and a recheck will sort them out; however, I wonder if we should update zuul to differentiate between an actual merge command fail and network access/git access failures16:42
fungis/done/down/ presumably16:42
clarkbyes down sorry16:42
clarkbjust about 1k reindexing tasks remaining16:43
clarkbhttps://review.opendev.org/936705 is a chagne in the gate that should merge soon16:44
fungiat least this time the fiber cut is only a mile down the road, not off the side of a deserted stretch of highway over on the mainland like usual16:45
clarkbfungi: you should lobby them to wrap the fiber in lots of kevlar and steel16:46
fungiand whatever cern uses for antimatter containment16:46
clarkbhttps://review.opendev.org/c/openstack/ansible-collection-kolla/+/936705 did merge16:47
fungiseems things are generally working post-upgrade. great work!16:47
clarkbreindexing is complete. 3 changes failed to reindex and quickly skimming the log its ancient changes that we've had problems with before16:51
clarkbso I think that is expected16:51
clarkbI'm going to stop the logfile tail now16:51
fungithnks!16:51
tonybSounds good16:51
tonybThanks clarkb 16:51
clarkbso I think the next step is to decide if we are happy with this and if so land the chagne to reflect the new version in system-config16:52
clarkbI haven't seen anything concerning yet so I think we can proceed with landing that change16:52
clarkblanding that change doesn't prevent a rollback later just makes it slightly more painful16:52
fungisounds good to me, virtual +216:52
clarkbI removed my -W16:53
clarkbif at least one other person can +2 before I approve that would be great (so we record it in there more than just a virtual +2_16:53
clarkbhttps://review.opendev.org/c/opendev/system-config/+/937051 this chnage16:54
clarkbor maybe fungi can just add the +2 later?16:57
clarkbI'll go ahead and approve it we have the time it takes to gate for anyone to protest and ya'll can add +2's later16:58
clarkband just before that appears ready to merge I'll remove nodes from the emergency file so we can check idempotency17:00
fungii can +2 for reals once my broadband is back, getting logged into gerrit or setting up gertty on this connection is going to be a bit of extra work17:00
clarkbI just don't want history in gerrit to look like I was ninja upgrading gerrit17:00
clarkbbut even that is not a big deal17:00
clarkband while I wait I'm going to refresh my tea17:01
tonybI have +2+A'd 93705117:02
fungithanks tonyb!17:03
clarkbtyty17:03
tonybfungi: good luck with your broadband :/17:04
clarkbI think our total outage was about 15 minutes? maybe 2017:04
clarkband most of that was waiting for things to copy/backup17:04
fungiyeah, if it goes on much longer i'll swap my workstation over to the phone tether, though it's really not optimized for lower-bandwidth activity17:05
fungithey're estimating service restoration in the next 3 hours, so i may need to bite the bullet for now17:06
clarkbmy ISP is getting purchased by Bell Canada17:06
tonybclarkb: 14 mins going of the IRC timestamps, and roughly 10 was waiting on backups ;P17:07
corvusclarkb: it's not trivial to make that differentiation in zuul.  we are intentionally conservative about what information git outputs that we send back to the user.17:07
clarkbsomewhat concerned because the scrappy localish ISP i've been on has been great from a communications perspective17:07
clarkbcorvus: ack and its understandable from our end by comparing timestamps and we can just tell people to recheck17:07
tonybclarkb: Yeah I can understand that feeling17:07
fungiis all of oregon seceeding from the usa to become part of canada, or only the portland area?17:08
clarkbthe main network admin is on their subreddit answering questions. I have a feeling that will stop once the acquisition completes17:08
clarkbfungi: the main secession movement is "cascadia" which is where oregon and washington steal BC and we become our own thing17:08
fungithat definitely sounds like more work17:09
clarkbthere is also the state of jeffersion which is southern oregon and northern california bcoming a 51st state17:09
clarkband then "greater idaho" which is where eastern oregon splits off to join idaho17:09
clarkbI wonder if any other part of he country has this many border realignment groups. Hawaii and alaska maybe?17:09
clarkbI suspect the problem is people who live in oregon really like it (weather isn't too extreme, its beautiful almost everywhere) but then get annoyed when politics don't align so rather than moving themselves want to move the borders17:10
opendevreviewClark Boylan proposed openstack/project-config master: Update jeepyb triggered Gerrit builds to Gerrit 3.10  https://review.opendev.org/c/openstack/project-config/+/93726817:16
clarkblooking at post upgrade tasks led to ^ there are several other tasks too that aren't super urgent like dropping 3.9 image builds and adding 3.11 and updating the upgrade job17:19
clarkbthat said I suspect the issue corvus found with 3.11 means adding 3.11 and updating the upgrade job may be a bit of work compared to the previous upgrade lookahead job updates17:20
clarkbwhich makes me less enthusiastic for jumping into that today17:20
clarkbI'll probably try to get some naive changes up today then we can work through the broken next week17:20
tonybAlso I suppose the dropping 3.9 can/should wait until we confirm a rollback isn't going to happen17:22
clarkb++17:23
clarkbI think the chnage to update the gerrit version is going to land a minute or two after hourly jobs enqueue17:47
clarkbthat said I think it is safe to remove hosts from the emergency file while hourly jobs run because the hourly jobs don't hit review17:49
clarkbso I'll go ahead and do that now17:49
clarkb(also we don't automatically restart gerrit so even if things update out of sync the runnign service should continue on the version we want until we fix the on disk version)17:49
tonyb++17:49
* tonyb puts laptop down and thinks about dinner17:50
clarkband done, enjoy dinner17:50
clarkbI will check the docker-compose file after jobs run inside of the screen so that we log that then I'll stop the screen and move the logfile17:53
clarkbas expected hourly jobs have started before the chagne merged18:01
clarkbthis shouldn't be a problem it will just delay our idempotency check by 15-20 minutes18:02
opendevreviewMerged opendev/system-config master: Update Gerrit image tag to 3.10 (from 3.9)  https://review.opendev.org/c/opendev/system-config/+/93705118:05
opendevreviewClark Boylan proposed opendev/gerritlib master: Test gerrit lib (and jeepyb) against Gerrit 3.10.3  https://review.opendev.org/c/opendev/gerritlib/+/93727618:11
opendevreviewClark Boylan proposed opendev/system-config master: Remove log cleanup cronjob from review  https://review.opendev.org/c/opendev/system-config/+/93727718:16
opendevreviewClark Boylan proposed opendev/system-config master: Drop Gerrit log cleanup cron from Ansible  https://review.opendev.org/c/opendev/system-config/+/93727818:16
clarkbinfra-prod-service-review is running now18:22
clarkbansible reports "changed": false, when writing out the docker-compose.yaml and 3.10 still shows up in the file on disk and the timestamp for file modification hasn't changed18:25
clarkbif anyone else wants to confirm that in the screen session that would be great then I 'll turn off the screen18:25
clarkbmanage-projects is running now18:26
clarkbhttps://zuul.opendev.org/t/openstack/build/ac72d6bfe8174fa3abca27d817880320 manage projects was succesful18:30
clarkbskimming the log on bridge I see that project acl updatse are all skipped as anticipated and there are no obvious errors18:31
clarkbso ya I think we're good for both manage-projects/jeepyb and also docker-compose18:31
clarkbI'm going to shutdown the screen now18:32
clarkboh wait someone was checking something looks like buffer got claered I'll wait for that to be done (let me know)18:32
fungithat was me, sorry, i'm not succeeding at getting into the screen session (it's acting hung after i screen -x, but i guess keypresses were going through). since it seems like nothing went sideways, so i'm going to step away and get the shower i skipped earlier. power outage and network outage in the same morning has made this day productivity-challenged18:33
clarkback and ya I think everything looks good you can always take a look later18:34
clarkbquitting screen now18:34
fungithanks. i think the background traffic volume in/out of my workstation is just overwhelming this phone tether18:35
clarkbthe screenlog is in the upgrade scratch dir and screen is closed18:36
clarkbcleaning up autoholds and writing changes to drop 3.9 images and testing add 3.11 images and testing as well as landing the other post upgrade changes I've pushed are next on the agenda. However, I don't think there is a rush on that as the further we get on those the more painful any rollback will be18:37
clarkbfungi: assuming network connectivity improves for you I think the main things to check are that docker-compose.yaml looks right, that manage-projects looks happy to you (log is on bridge in the ansible log dir), and that the screenlog we captured is complete18:39
clarkbbut I've checked all that and it lgtm so I'm not super worried18:39
funginoted18:39
fungii'll hopefully be able to take a look soon18:39
clarkbI'm going to take a short break (I haven't eaten anything today) and then work on proposing more followup changes18:40
opendevreviewClark Boylan proposed opendev/system-config master: Update gerrit image build job dependencies  https://review.opendev.org/c/opendev/system-config/+/93728819:43
opendevreviewClark Boylan proposed opendev/system-config master: Drop Gerrit 3.9 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93728919:44
opendevreviewClark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93729019:44
clarkbinfra-root of ^ https://review.opendev.org/c/opendev/system-config/+/937288 shoulod be landed soonish. The other two are less urgent19:44
clarkbI've also been using topic:update-gerrit-3.10 if you want to load all of them up at once19:46
clarkblooks like we need to land the jeepyb dependency update before zuul will test the gerrit image changes (this is due to jobs being removed that are still used by jeepyb until that change lands) https://review.opendev.org/c/openstack/project-config/+/937268 is the other change that would be good land soonish as a result19:51
fungiugh, isp has bumped my repair estimate out another 3.5 hours19:51
fungii've been taking the opportunity to clean up around my lab and reroute wiring at least19:52
clarkbfwiw I have reproduced the login behavior from 3.9.8 on 3.10.3 (well as far as checking login links and refreshing to see they updated I haven't done a full login pass)19:54
clarkbI'll make a note to look into getting my fix that landed on 3.9.8 merged up into 3.10 soonish19:54
clarkber it landed on stable-3.9 after 3.9.819:54
clarkbtypically they roll things forward themselves too, but my guess is with holidays and very few changes on 3.9 so far pushing the merge commit up myself will speed things up19:55
clarkbfungi: if you interents come back and you haven't called it a weekend then I think 937288 and 937268 would be great to get in. But they can also safely wait until next week22:10
clarkbI find myself wearing out and windnig down and my day didn't start as early as yours. Gerrit upgrades are always a bit drianing22:10
fungii'm still holding out hope i can take a look at those, though my isp has bumped out their estimated resolution time by several more hours22:50
fungi02:00 utc now22:51

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!