Tuesday, 2021-10-19

-opendevstatus- NOTICE: Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again16:59
clarkbHello, we'll start the opendev infra team meeting shortly19:00
ianwo/19:00
clarkbI expect we we'll try to keep it relatively short since I know a few of us have been in PTG stuff the last couple of days and may be tired of meetings :)19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Oct 19 19:01:10 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-October/000292.html Our Agenda19:01
clarkb#topic Announcements19:01
clarkbAs mentioned it is the PTG this week. There are a couple of session that might interset those listening in19:01
clarkbOpenDev session Wednesday October 20, 2021 at 14:00 - 16:00 UTC in https://meetpad.opendev.org/oct2021-ptg-opendev19:01
clarkbThis is intended to be an opendev office hours. If this time is not good for you you shouldn't feel obligated to stay up late or get up early. I'll be there and I should have it covered19:02
clarkbIf you've got opendev related questions concerns etc feel free to add them to the etherpad on that meetpad and join us tomorrow19:02
clarkbZuul session Thursday October 21, 2021 at 14:00 UTC in https://meetpad.opendev.org/zuul-2021-10-2119:02
clarkbThis was scheduled recently and sounds liek a birds of the feather session with a focus on the k8s operator19:02
clarkbI'd like to go to this one but I think  Imay end up getting pulled into the openstack tc session around this time as well since they will be talking CI requirement related stuff (python versions and distros)19:03
clarkbAlso I've got dentistry wednesday and school stuff thursday so will have a few blocks of time where I'm not super available outside of ptg hours19:03
fungialso worth noting the zuul operator bof does not appear on the official ptg schedule19:04
fungiowing to there not being any available slots for the preferred time19:04
clarkb#topic Actions from last meeting19:05
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-12-19.01.txt minutes from last meeting19:05
clarkbWe didn't record any actions that I could see19:05
clarkb#topic Specs19:05
clarkbThe prometheus spec has been landed if anyone is interested in lookg at the implementation for that. I'm hoping that once I get past some of the zuul and gerrit stuff I've been doing I'll wind up with time for that myself19:06
clarkbwe'll see19:06
clarkb#link https://review.opendev.org/810990 Mailman 3 spec19:06
clarkbI think I've been the only reviewer on the mailman3 spec so far. Would be great to get some other reviews input on that too19:06
fungifurther input welcome, but i don't think there are any unaddressed comments at this point19:07
clarkbWould be good to have this up for approval next week if others think they can get reviews in between now and then19:07
clarkbI guess we can make that decision in a week at our next meeting19:07
fungiyeah, hopefully i'll be nearer to being able to focus on it by then19:07
clarkb#topic Topics19:08
clarkb#topic Improving OpenDev's CD Throughput19:08
clarkbianw: this sort of stalled on the issue in zuul preventing this from merging. Those issues in zuul should be corrected now and if you recheck you should get an error message? Then we can fix the changes and hopefully land them?19:09
ianwahh, yep, will get back to it!19:09
clarkbthanks19:09
clarkb#topic Gerrit Account Cleanups19:09
clarkbsame situation as the last numebr of weeks. Too many more urgent distractions. I may pull this off the agenda then add it back when I have time to make progress again19:10
clarkb#topic Gerrit Project Renames19:10
clarkbWanted to do a followup on the project renames that fungi and I did last Friday. Overall things went well but we have noticed a few things19:10
clarkbThe first thing is that part of the rename process has us copy zuul key in zk for the renamed projects to their new names then delete the content at the old names.19:10
clarkbUnfortuately this left behind just enough zk db content that the daily key backups are complaining about the old names not having content19:11
clarkb#link https://review.opendev.org/c/zuul/zuul/+/814504 fix for zuul key exports post rename19:11
clarkbThat change should fix this for future renames. Cleanup for the existing errors will likely require our intervention. Possibly by rerunning the delete-keys commands with my change landed.19:11
clarkbNote that the backups are otherwise successful, the scary error messages are largely noise19:12
clarkbNext up we accidentally updated all gitea projects with their current projects.yaml information. We had wanted to only update the subset of projects that are being renamed as this process was very expensive in the past.19:12
clarkbOur accidental update to all projects showed that current gitea with the rest api isn't that slow at updating. And we should consider doing full updatse anytime projects.yaml changes19:13
clarkb#link https://review.opendev.org/c/opendev/system-config/+/814443 Update all gitea projects by default19:13
clarkbI think it took something like 3-5 minutes to run the gitea-git-repos role against all giteas.19:13
clarkb(we didn't directly time it since it was unexpected)19:13
ianw++ i feel like i had a change in that area ...19:14
clarkbif infra-root can weigh in on that and whether or not there are other concerns beyond just time costs please bring them up19:14
fricklerthis followup job is also failing https://review.opendev.org/c/opendev/system-config/+/808480 , some unrelated cert issue according to apache log19:14
fungiseparately, this raised the point that ansible doesn't like to rely on yaml's inferred data types, instead explicitly recasting them to str unless told to recast them to some other specific type. this maks it very hard to implement your own trinary field19:14
fungii wanted a rolevar which could be either a bool or a list, but i think that's basically not possible19:15
ianwoh, the one i'm thinking of is https://review.opendev.org/c/opendev/system-config/+/782887 "gitea: switch to token auth for project creation"19:15
clarkbyou'd have to figure out ansible's serialization system and implement it in your module19:15
clarkbianw: I think gitea reverted the behavior that made normal auth really slow19:15
fungipretty much, re-parse post ansible's meddling19:15
clarkbenough other people complained about it and its memory cost iirc19:16
clarkbfrickler: looks like refstack reported a 500 error when adding results. Might need to look in refstack logs?19:16
fricklerhttps://ef5d43af22af7b1c1050-17fc8f83c20e6521d7d8a3ccd8bca531.ssl.cf2.rackcdn.com/808480/1/check/system-config-run-refstack/a4825bb/refstack01.openstack.org/apache2/refstack-ssl-error.log19:16
clarkbfrickler: https://zuul.opendev.org/t/openstack/build/a4825bbef768406d940fcdd459cb92c6/log/refstack01.openstack.org/containers/docker-refstack-api.log I think that might be genuine error?19:16
clarkbfrickler: ya I think the apache error there is a side effect of how we test with faked up LE certs but shouldn't be a breaking issue19:17
fricklerah, o.k.19:17
fungithat apache warning is expected i think? did you confirm it's not present on earlier successful builds?19:18
clarkbThe last thing I noticed during the rename process is that manage-projects.yaml runs the gitea-git-repos update then the gerrit jeepyb manage projects. It updates project-config to master only after running gitea-git-repos. For extra safety it should do the proejct-config update before any other actions.19:18
fricklerno, I didn't check. then someone with refstack knowledge needs to debug19:18
clarkbOne thing that makes this complicated is that the sync-project-config role updates project-config then copies its contents to /opt/project-config on the remote server like review. But we don't want it to do that for gitea. I think we want a simpler update process to run early. Maybe using a flag off of sync-project-config. I'll look at this in the next day or two hopefully19:19
clarkbfrickler: ++ seems refstack specific19:19
fungijsonschema.exceptions.SchemaError: [{'type': 'object', 'properties': {'name': {'type': 'string'}, 'uuid': {'type': 'string', 'format': 'uuid_hex'}}}] is not of type 'object', 'boolean'19:19
clarkbAnything else to go over on the rename process/results/etc19:20
fricklerthere was a rename overlap missed19:21
fungirename overlap?19:21
fricklerand the fix then also needed a pyyaml6 fix https://review.opendev.org/c/openstack/project-config/+/81440119:21
fungithanks for fixing up the missed acl switch19:23
clarkbya when I rebased the ansible role change I missed a fixup19:23
clarkbits difficult to review those since we typically rely on zuul tocheck things for us. Maybe figure out how to run things locally19:23
frickleralso still a lot of config errors19:25
clarkbya, but those are largely on the tenant to correct19:25
clarkbMaybe we should send a reminder email to openstack/venus/refstack/et al to update their job configs19:25
clarkbMaybe we wait and see another week and if progress isn't made send email to openstack-discuss asking that the related parties update their configs19:26
clarkbI've just written a note in my notes file to do that19:27
frickler+119:27
fungiin theory any changes they push should get errors reported to them by zuul, which ought to serve as a reminder19:27
clarkb#topic Improving Zuul Restarts19:28
clarkbLast week frickler discovered zuul in a sad state due to a zookeeper connectivity issue that apepars to have dropped all the zk watches19:28
clarkbTo correct that the zuul scheduler was restarted, but frickler noticed we haven't kept our documentation around doing zuul restarts up to date19:29
clarkbLed to questions around what to restart and how.19:29
clarkbAlso when to reenqueue in the start up process and general debugging process19:29
clarkbTo answer what to restart I think we're currently needing to restart the scheduler and executors together. The mergers don't matter as much but there is a restart-zuul.yaml playbook in system-config that will do everything (which is safe to do if doing the scheduler and executors) 19:30
frickleryes, it would be nice to have that written up in a way that makes it easy to follow in an emergency19:30
clarkbIf you run that playbook it will do everything for you but capture and restore queues.19:30
clarkbI think the general process we want to put in the documentation is: 1) save queues 2) run restart-zuul.yaml playbook 3) wait for all tenants to load in zuul (can be checked at https://opendev.org/ and wait for all tenants to show up there) 4) run re-enqueue script (possibly after editing it to remove problematic entries?)19:31
clarkbI can work on writing this up in the next day or two19:32
ianw(it's good to timestamp that re-enqueue dump ... says the person who has previously restored an old dump after restarting :)19:32
fricklerregarding queues I wondered whether we should drop periodic jobs to save some time/load and they'll run again soon enough19:32
fungigranted, the need to also restart the executors whenever restarting the scheduler is a very recent discovery related to in progress scale-out-scheduler development efforts, which should no longer be necessary once it gets farther along19:32
clarkbfungi: yes though at other times we've similarly had to restart broader sets of stuff to accomodate zuul chagnes. I think the safeest thing is always to do a full restart especially if you are already debugging an issue and want to avoid new ones :)19:33
fungithat's fair19:33
fungiin which case restarting the mergers too is warranted19:33
clarkbyup and the fingergw and web and the playbook does all of that19:33
fungiin case there are changes in the mergers which also need to be picked up19:33
clarkbfungi: and ya not having periodic jobs that just fail is probably a good idea too19:34
clarkbAs far as debugging goes this is a bit harder to describe a specific process. Generally when I debug zuul I try to find a specific change/merge/ref/tag event that I can focus on as it helps you to narrow the amount of logs that are relevant19:35
clarkbwhat that means is if a queue entry isn't doing what I expect it to I grep teh zuul logs for its identifier (change number or sha1 etc) then from that you find logs for that event whcih have an event id in the logs. Then you can grep the logs for that event id and look for ERROR or traceback etc19:36
fungii guess we should go about disabling tobiko's periodic jobs for them?19:36
fungithey seem to run al day and end up with retry_limit results19:36
fungier, all day19:36
clarkbfungi: ya or figure out why zuul is always so unhappy with them19:36
clarkbwhen we reenqueue them they go straight to error19:36
clarkbthen eventualy when they actually run they break too. I suspect something is really unhappy with it and we did do a rename with topiko to tobiko something19:37
fungiError: Project opendev.org/x/devstack-plugin-tobiko does not have the default branch master19:37
clarkbhrm but it does19:37
clarkbI wonder if we need to get zuul to reclone it19:37
fungiyeah, those are the insta-error messages19:37
fungicould be broken repo caches i guess19:38
clarkbAnyway I'll take this todo to update the process docs for restarting and try to give hints about my debugging process above even though it may not be applicable in all cases19:38
fungibut when they do actually run, they're broken on something else i dug into a while back which turned out to be an indication that they're probably just abandoned19:38
clarkbit should give others an idea of where to start when digging into zuuls very verbose logging19:38
clarkb#topic Open Discussion19:40
clarkbhttps://review.opendev.org/c/opendev/system-config/+/813675 is the last sort of trailing gerrit upgrade related change. It scares me because it is difficult to know for sure the gerrit group wasn't used anywhere but I've checked it a few times now.19:40
clarkbI'm thinking that maybe I can land that on thursday when I hsould have most of the day to pay attention to it and restart gerrit and make sure all is still happy19:41
ianwi've been focusing on getting rid of debian-stable, to free up space, afs is looking pretty full19:42
clarkbdebian-stable == stretch?19:42
fungithe horribly misnamed "debian-stable" nodeset19:42
ianwyeah, sorry, stretch from the mirrors, and the "debian-stable" nodeset which is stretch19:43
clarkbaha19:43
fungiwhich is technically oldstable19:43
ianwbut then there's a lot of interest in 9-stream, so that's what i'm trying to make room for19:43
clarkboldoldstable19:43
fungioldoldstable in fact, buster is oldstable now and bullseye is stable19:43
ianwa test sync on that had it coming in at ~200gb19:43
ianw(9-stream)19:43
clarkbFor 9 stream we (opendev) will build our own images on ext4 as we always do I suppose, but any progress with the image based update builds?19:44
clarkbAlso this reminds me that fedora boots are still unreliable in most clodus iirc19:44
ianwclarkb: yep, if you could maybe take a look over19:44
ianw#link https://review.opendev.org/q/topic:%22func-testing-bullseye%22+(status:open%20OR%20status:merged)19:44
ianwthat a) pulls out a lot duplication and old fluff from the functional testing19:45
ianwand b) moves our functional testing to bullseye, which has a 5.10 kernel which can read the XFS "bigtime" volumes19:45
ianwthe other part of this is updating nodepool-builder to bullsye too; i have that19:46
ianw#link https://review.opendev.org/c/zuul/nodepool/+/80631219:46
ianwthis needs a dib release though to pick up fixes for running on bullseye though19:46
ianwi was thinking get that testing stack merged, then release at this point19:47
clarkbok I can try to review the func testing work today19:47
ianwit's really a lot of small single steps, but i wanted to call out each test we're dropping/moving etc. explicitly19:47
fungisounds great19:49
ianwthis should also enable 9-stream minimal builds19:49
ianwevery time i think they're dead for good, somehow we find a way...19:49
clarkbOne thing the fedora boot struggles makes me wonder is if we can try and phase out fedora in favor of stream. We essentially do the same thing with ubuntu already since the quicker cadence is hard to keep up with and it seems like you end up fixing issues in a half baked distro constantly19:50
ianwyeah, i agree stream might be the right level of updatey-ness for us19:51
ianwfedora boot issues are about 3-4 down on my todo list ATM :)  will get there ...19:52
fungiwe continue to struggle with gentoo and opensuse-tumbleweed similarly19:52
fungiprobably it's just that we expect to have to rework image builds for a new version of the distro, but for rolling distros instead of getting new broken versions at predetermined intervals we get randomly broken image builds whenever something changes in them19:53
clarkbthat is part of it, but also at least with ubuntu and seems like for stream there is a lot more care in updating the main releases than the every 6 month releases19:54
clarkbbecause finding bugs is part of the reason they do the every 6 month release but not when they did the big release19:54
fungiyeah, agreed, they do seem to be treated as a bit more... "disposable?"19:55
fungianyway, it's the same reason i haven't pushed for debian testing or unstable image labels19:55
fungitracking down spontaneous image build breakage eats a lot of our time19:56
clarkbYup. Also we are just about at time. Last call for anything else Otherwise I'll end it at the the change of the hour19:57
ianwoh that was one concern of mine with gentoo in that dib stack19:57
ianwit seems like it will frequently be in a state of taking 1:30hours to timeout19:57
fungii still haven't worked out how to fix whatever's preventing us from turning gentoo testing back on for zuul-jobs changes either19:58
ianwwhich is a little annoying to hold up all the other jobs19:58
fungithough for a while it seemed to be staleness because the gentoo images hadn't been updated in something like 6-12 months19:58
clarkbAlright we are at time. Thank you everyone.20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Oct 19 20:00:07 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.log.html20:00
fungithanks clarkb!20:00

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!