Tuesday, 2021-10-19

-opendevstatus- NOTICE: Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again		16:59
clarkb	Hello, we'll start the opendev infra team meeting shortly	19:00
ianw	o/	19:00
clarkb	I expect we we'll try to keep it relatively short since I know a few of us have been in PTG stuff the last couple of days and may be tired of meetings :)	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Oct 19 19:01:10 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link http://lists.opendev.org/pipermail/service-discuss/2021-October/000292.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	As mentioned it is the PTG this week. There are a couple of session that might interset those listening in	19:01
clarkb	OpenDev session Wednesday October 20, 2021 at 14:00 - 16:00 UTC in https://meetpad.opendev.org/oct2021-ptg-opendev	19:01
clarkb	This is intended to be an opendev office hours. If this time is not good for you you shouldn't feel obligated to stay up late or get up early. I'll be there and I should have it covered	19:02
clarkb	If you've got opendev related questions concerns etc feel free to add them to the etherpad on that meetpad and join us tomorrow	19:02
clarkb	Zuul session Thursday October 21, 2021 at 14:00 UTC in https://meetpad.opendev.org/zuul-2021-10-21	19:02
clarkb	This was scheduled recently and sounds liek a birds of the feather session with a focus on the k8s operator	19:02
clarkb	I'd like to go to this one but I think Imay end up getting pulled into the openstack tc session around this time as well since they will be talking CI requirement related stuff (python versions and distros)	19:03
clarkb	Also I've got dentistry wednesday and school stuff thursday so will have a few blocks of time where I'm not super available outside of ptg hours	19:03
fungi	also worth noting the zuul operator bof does not appear on the official ptg schedule	19:04
fungi	owing to there not being any available slots for the preferred time	19:04
clarkb	#topic Actions from last meeting	19:05
clarkb	#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-12-19.01.txt minutes from last meeting	19:05
clarkb	We didn't record any actions that I could see	19:05
clarkb	#topic Specs	19:05
clarkb	The prometheus spec has been landed if anyone is interested in lookg at the implementation for that. I'm hoping that once I get past some of the zuul and gerrit stuff I've been doing I'll wind up with time for that myself	19:06
clarkb	we'll see	19:06
clarkb	#link https://review.opendev.org/810990 Mailman 3 spec	19:06
clarkb	I think I've been the only reviewer on the mailman3 spec so far. Would be great to get some other reviews input on that too	19:06
fungi	further input welcome, but i don't think there are any unaddressed comments at this point	19:07
clarkb	Would be good to have this up for approval next week if others think they can get reviews in between now and then	19:07
clarkb	I guess we can make that decision in a week at our next meeting	19:07
fungi	yeah, hopefully i'll be nearer to being able to focus on it by then	19:07
clarkb	#topic Topics	19:08
clarkb	#topic Improving OpenDev's CD Throughput	19:08
clarkb	ianw: this sort of stalled on the issue in zuul preventing this from merging. Those issues in zuul should be corrected now and if you recheck you should get an error message? Then we can fix the changes and hopefully land them?	19:09
ianw	ahh, yep, will get back to it!	19:09
clarkb	thanks	19:09
clarkb	#topic Gerrit Account Cleanups	19:09
clarkb	same situation as the last numebr of weeks. Too many more urgent distractions. I may pull this off the agenda then add it back when I have time to make progress again	19:10
clarkb	#topic Gerrit Project Renames	19:10
clarkb	Wanted to do a followup on the project renames that fungi and I did last Friday. Overall things went well but we have noticed a few things	19:10
clarkb	The first thing is that part of the rename process has us copy zuul key in zk for the renamed projects to their new names then delete the content at the old names.	19:10
clarkb	Unfortuately this left behind just enough zk db content that the daily key backups are complaining about the old names not having content	19:11
clarkb	#link https://review.opendev.org/c/zuul/zuul/+/814504 fix for zuul key exports post rename	19:11
clarkb	That change should fix this for future renames. Cleanup for the existing errors will likely require our intervention. Possibly by rerunning the delete-keys commands with my change landed.	19:11
clarkb	Note that the backups are otherwise successful, the scary error messages are largely noise	19:12
clarkb	Next up we accidentally updated all gitea projects with their current projects.yaml information. We had wanted to only update the subset of projects that are being renamed as this process was very expensive in the past.	19:12
clarkb	Our accidental update to all projects showed that current gitea with the rest api isn't that slow at updating. And we should consider doing full updatse anytime projects.yaml changes	19:13
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/814443 Update all gitea projects by default	19:13
clarkb	I think it took something like 3-5 minutes to run the gitea-git-repos role against all giteas.	19:13
clarkb	(we didn't directly time it since it was unexpected)	19:13
ianw	++ i feel like i had a change in that area ...	19:14
clarkb	if infra-root can weigh in on that and whether or not there are other concerns beyond just time costs please bring them up	19:14
frickler	this followup job is also failing https://review.opendev.org/c/opendev/system-config/+/808480 , some unrelated cert issue according to apache log	19:14
fungi	separately, this raised the point that ansible doesn't like to rely on yaml's inferred data types, instead explicitly recasting them to str unless told to recast them to some other specific type. this maks it very hard to implement your own trinary field	19:14
fungi	i wanted a rolevar which could be either a bool or a list, but i think that's basically not possible	19:15
ianw	oh, the one i'm thinking of is https://review.opendev.org/c/opendev/system-config/+/782887 "gitea: switch to token auth for project creation"	19:15
clarkb	you'd have to figure out ansible's serialization system and implement it in your module	19:15
clarkb	ianw: I think gitea reverted the behavior that made normal auth really slow	19:15
fungi	pretty much, re-parse post ansible's meddling	19:15
clarkb	enough other people complained about it and its memory cost iirc	19:16
clarkb	frickler: looks like refstack reported a 500 error when adding results. Might need to look in refstack logs?	19:16
frickler	https://ef5d43af22af7b1c1050-17fc8f83c20e6521d7d8a3ccd8bca531.ssl.cf2.rackcdn.com/808480/1/check/system-config-run-refstack/a4825bb/refstack01.openstack.org/apache2/refstack-ssl-error.log	19:16
clarkb	frickler: https://zuul.opendev.org/t/openstack/build/a4825bbef768406d940fcdd459cb92c6/log/refstack01.openstack.org/containers/docker-refstack-api.log I think that might be genuine error?	19:16
clarkb	frickler: ya I think the apache error there is a side effect of how we test with faked up LE certs but shouldn't be a breaking issue	19:17
frickler	ah, o.k.	19:17
fungi	that apache warning is expected i think? did you confirm it's not present on earlier successful builds?	19:18
clarkb	The last thing I noticed during the rename process is that manage-projects.yaml runs the gitea-git-repos update then the gerrit jeepyb manage projects. It updates project-config to master only after running gitea-git-repos. For extra safety it should do the proejct-config update before any other actions.	19:18
frickler	no, I didn't check. then someone with refstack knowledge needs to debug	19:18
clarkb	One thing that makes this complicated is that the sync-project-config role updates project-config then copies its contents to /opt/project-config on the remote server like review. But we don't want it to do that for gitea. I think we want a simpler update process to run early. Maybe using a flag off of sync-project-config. I'll look at this in the next day or two hopefully	19:19
clarkb	frickler: ++ seems refstack specific	19:19
fungi	jsonschema.exceptions.SchemaError: [{'type': 'object', 'properties': {'name': {'type': 'string'}, 'uuid': {'type': 'string', 'format': 'uuid_hex'}}}] is not of type 'object', 'boolean'	19:19
clarkb	Anything else to go over on the rename process/results/etc	19:20
frickler	there was a rename overlap missed	19:21
fungi	rename overlap?	19:21
frickler	and the fix then also needed a pyyaml6 fix https://review.opendev.org/c/openstack/project-config/+/814401	19:21
fungi	thanks for fixing up the missed acl switch	19:23
clarkb	ya when I rebased the ansible role change I missed a fixup	19:23
clarkb	its difficult to review those since we typically rely on zuul tocheck things for us. Maybe figure out how to run things locally	19:23
frickler	also still a lot of config errors	19:25
clarkb	ya, but those are largely on the tenant to correct	19:25
clarkb	Maybe we should send a reminder email to openstack/venus/refstack/et al to update their job configs	19:25
clarkb	Maybe we wait and see another week and if progress isn't made send email to openstack-discuss asking that the related parties update their configs	19:26
clarkb	I've just written a note in my notes file to do that	19:27
frickler	+1	19:27
fungi	in theory any changes they push should get errors reported to them by zuul, which ought to serve as a reminder	19:27
clarkb	#topic Improving Zuul Restarts	19:28
clarkb	Last week frickler discovered zuul in a sad state due to a zookeeper connectivity issue that apepars to have dropped all the zk watches	19:28
clarkb	To correct that the zuul scheduler was restarted, but frickler noticed we haven't kept our documentation around doing zuul restarts up to date	19:29
clarkb	Led to questions around what to restart and how.	19:29
clarkb	Also when to reenqueue in the start up process and general debugging process	19:29
clarkb	To answer what to restart I think we're currently needing to restart the scheduler and executors together. The mergers don't matter as much but there is a restart-zuul.yaml playbook in system-config that will do everything (which is safe to do if doing the scheduler and executors)	19:30
frickler	yes, it would be nice to have that written up in a way that makes it easy to follow in an emergency	19:30
clarkb	If you run that playbook it will do everything for you but capture and restore queues.	19:30
clarkb	I think the general process we want to put in the documentation is: 1) save queues 2) run restart-zuul.yaml playbook 3) wait for all tenants to load in zuul (can be checked at https://opendev.org/ and wait for all tenants to show up there) 4) run re-enqueue script (possibly after editing it to remove problematic entries?)	19:31
clarkb	I can work on writing this up in the next day or two	19:32
ianw	(it's good to timestamp that re-enqueue dump ... says the person who has previously restored an old dump after restarting :)	19:32
frickler	regarding queues I wondered whether we should drop periodic jobs to save some time/load and they'll run again soon enough	19:32
fungi	granted, the need to also restart the executors whenever restarting the scheduler is a very recent discovery related to in progress scale-out-scheduler development efforts, which should no longer be necessary once it gets farther along	19:32
clarkb	fungi: yes though at other times we've similarly had to restart broader sets of stuff to accomodate zuul chagnes. I think the safeest thing is always to do a full restart especially if you are already debugging an issue and want to avoid new ones :)	19:33
fungi	that's fair	19:33
fungi	in which case restarting the mergers too is warranted	19:33
clarkb	yup and the fingergw and web and the playbook does all of that	19:33
fungi	in case there are changes in the mergers which also need to be picked up	19:33
clarkb	fungi: and ya not having periodic jobs that just fail is probably a good idea too	19:34
clarkb	As far as debugging goes this is a bit harder to describe a specific process. Generally when I debug zuul I try to find a specific change/merge/ref/tag event that I can focus on as it helps you to narrow the amount of logs that are relevant	19:35
clarkb	what that means is if a queue entry isn't doing what I expect it to I grep teh zuul logs for its identifier (change number or sha1 etc) then from that you find logs for that event whcih have an event id in the logs. Then you can grep the logs for that event id and look for ERROR or traceback etc	19:36
fungi	i guess we should go about disabling tobiko's periodic jobs for them?	19:36
fungi	they seem to run al day and end up with retry_limit results	19:36
fungi	er, all day	19:36
clarkb	fungi: ya or figure out why zuul is always so unhappy with them	19:36
clarkb	when we reenqueue them they go straight to error	19:36
clarkb	then eventualy when they actually run they break too. I suspect something is really unhappy with it and we did do a rename with topiko to tobiko something	19:37
fungi	Error: Project opendev.org/x/devstack-plugin-tobiko does not have the default branch master	19:37
clarkb	hrm but it does	19:37
clarkb	I wonder if we need to get zuul to reclone it	19:37
fungi	yeah, those are the insta-error messages	19:37
fungi	could be broken repo caches i guess	19:38
clarkb	Anyway I'll take this todo to update the process docs for restarting and try to give hints about my debugging process above even though it may not be applicable in all cases	19:38
fungi	but when they do actually run, they're broken on something else i dug into a while back which turned out to be an indication that they're probably just abandoned	19:38
clarkb	it should give others an idea of where to start when digging into zuuls very verbose logging	19:38
clarkb	#topic Open Discussion	19:40
clarkb	https://review.opendev.org/c/opendev/system-config/+/813675 is the last sort of trailing gerrit upgrade related change. It scares me because it is difficult to know for sure the gerrit group wasn't used anywhere but I've checked it a few times now.	19:40
clarkb	I'm thinking that maybe I can land that on thursday when I hsould have most of the day to pay attention to it and restart gerrit and make sure all is still happy	19:41
ianw	i've been focusing on getting rid of debian-stable, to free up space, afs is looking pretty full	19:42
clarkb	debian-stable == stretch?	19:42
fungi	the horribly misnamed "debian-stable" nodeset	19:42
ianw	yeah, sorry, stretch from the mirrors, and the "debian-stable" nodeset which is stretch	19:43
clarkb	aha	19:43
fungi	which is technically oldstable	19:43
ianw	but then there's a lot of interest in 9-stream, so that's what i'm trying to make room for	19:43
clarkb	oldoldstable	19:43
fungi	oldoldstable in fact, buster is oldstable now and bullseye is stable	19:43
ianw	a test sync on that had it coming in at ~200gb	19:43
ianw	(9-stream)	19:43
clarkb	For 9 stream we (opendev) will build our own images on ext4 as we always do I suppose, but any progress with the image based update builds?	19:44
clarkb	Also this reminds me that fedora boots are still unreliable in most clodus iirc	19:44
ianw	clarkb: yep, if you could maybe take a look over	19:44
ianw	#link https://review.opendev.org/q/topic:%22func-testing-bullseye%22+(status:open%20OR%20status:merged)	19:44
ianw	that a) pulls out a lot duplication and old fluff from the functional testing	19:45
ianw	and b) moves our functional testing to bullseye, which has a 5.10 kernel which can read the XFS "bigtime" volumes	19:45
ianw	the other part of this is updating nodepool-builder to bullsye too; i have that	19:46
ianw	#link https://review.opendev.org/c/zuul/nodepool/+/806312	19:46
ianw	this needs a dib release though to pick up fixes for running on bullseye though	19:46
ianw	i was thinking get that testing stack merged, then release at this point	19:47
clarkb	ok I can try to review the func testing work today	19:47
ianw	it's really a lot of small single steps, but i wanted to call out each test we're dropping/moving etc. explicitly	19:47
fungi	sounds great	19:49
ianw	this should also enable 9-stream minimal builds	19:49
ianw	every time i think they're dead for good, somehow we find a way...	19:49
clarkb	One thing the fedora boot struggles makes me wonder is if we can try and phase out fedora in favor of stream. We essentially do the same thing with ubuntu already since the quicker cadence is hard to keep up with and it seems like you end up fixing issues in a half baked distro constantly	19:50
ianw	yeah, i agree stream might be the right level of updatey-ness for us	19:51
ianw	fedora boot issues are about 3-4 down on my todo list ATM :) will get there ...	19:52
fungi	we continue to struggle with gentoo and opensuse-tumbleweed similarly	19:52
fungi	probably it's just that we expect to have to rework image builds for a new version of the distro, but for rolling distros instead of getting new broken versions at predetermined intervals we get randomly broken image builds whenever something changes in them	19:53
clarkb	that is part of it, but also at least with ubuntu and seems like for stream there is a lot more care in updating the main releases than the every 6 month releases	19:54
clarkb	because finding bugs is part of the reason they do the every 6 month release but not when they did the big release	19:54
fungi	yeah, agreed, they do seem to be treated as a bit more... "disposable?"	19:55
fungi	anyway, it's the same reason i haven't pushed for debian testing or unstable image labels	19:55
fungi	tracking down spontaneous image build breakage eats a lot of our time	19:56
clarkb	Yup. Also we are just about at time. Last call for anything else Otherwise I'll end it at the the change of the hour	19:57
ianw	oh that was one concern of mine with gentoo in that dib stack	19:57
ianw	it seems like it will frequently be in a state of taking 1:30hours to timeout	19:57
fungi	i still haven't worked out how to fix whatever's preventing us from turning gentoo testing back on for zuul-jobs changes either	19:58
ianw	which is a little annoying to hold up all the other jobs	19:58
fungi	though for a while it seemed to be staleness because the gentoo images hadn't been updated in something like 6-12 months	19:58
clarkb	Alright we are at time. Thank you everyone.	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Oct 19 20:00:07 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.log.html	20:00
fungi	thanks clarkb!	20:00

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!