-opendevstatus- NOTICE: Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again | 16:59 | |
clarkb | Hello, we'll start the opendev infra team meeting shortly | 19:00 |
---|---|---|
ianw | o/ | 19:00 |
clarkb | I expect we we'll try to keep it relatively short since I know a few of us have been in PTG stuff the last couple of days and may be tired of meetings :) | 19:00 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Oct 19 19:01:10 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2021-October/000292.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | As mentioned it is the PTG this week. There are a couple of session that might interset those listening in | 19:01 |
clarkb | OpenDev session Wednesday October 20, 2021 at 14:00 - 16:00 UTC in https://meetpad.opendev.org/oct2021-ptg-opendev | 19:01 |
clarkb | This is intended to be an opendev office hours. If this time is not good for you you shouldn't feel obligated to stay up late or get up early. I'll be there and I should have it covered | 19:02 |
clarkb | If you've got opendev related questions concerns etc feel free to add them to the etherpad on that meetpad and join us tomorrow | 19:02 |
clarkb | Zuul session Thursday October 21, 2021 at 14:00 UTC in https://meetpad.opendev.org/zuul-2021-10-21 | 19:02 |
clarkb | This was scheduled recently and sounds liek a birds of the feather session with a focus on the k8s operator | 19:02 |
clarkb | I'd like to go to this one but I think Imay end up getting pulled into the openstack tc session around this time as well since they will be talking CI requirement related stuff (python versions and distros) | 19:03 |
clarkb | Also I've got dentistry wednesday and school stuff thursday so will have a few blocks of time where I'm not super available outside of ptg hours | 19:03 |
fungi | also worth noting the zuul operator bof does not appear on the official ptg schedule | 19:04 |
fungi | owing to there not being any available slots for the preferred time | 19:04 |
clarkb | #topic Actions from last meeting | 19:05 |
clarkb | #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-12-19.01.txt minutes from last meeting | 19:05 |
clarkb | We didn't record any actions that I could see | 19:05 |
clarkb | #topic Specs | 19:05 |
clarkb | The prometheus spec has been landed if anyone is interested in lookg at the implementation for that. I'm hoping that once I get past some of the zuul and gerrit stuff I've been doing I'll wind up with time for that myself | 19:06 |
clarkb | we'll see | 19:06 |
clarkb | #link https://review.opendev.org/810990 Mailman 3 spec | 19:06 |
clarkb | I think I've been the only reviewer on the mailman3 spec so far. Would be great to get some other reviews input on that too | 19:06 |
fungi | further input welcome, but i don't think there are any unaddressed comments at this point | 19:07 |
clarkb | Would be good to have this up for approval next week if others think they can get reviews in between now and then | 19:07 |
clarkb | I guess we can make that decision in a week at our next meeting | 19:07 |
fungi | yeah, hopefully i'll be nearer to being able to focus on it by then | 19:07 |
clarkb | #topic Topics | 19:08 |
clarkb | #topic Improving OpenDev's CD Throughput | 19:08 |
clarkb | ianw: this sort of stalled on the issue in zuul preventing this from merging. Those issues in zuul should be corrected now and if you recheck you should get an error message? Then we can fix the changes and hopefully land them? | 19:09 |
ianw | ahh, yep, will get back to it! | 19:09 |
clarkb | thanks | 19:09 |
clarkb | #topic Gerrit Account Cleanups | 19:09 |
clarkb | same situation as the last numebr of weeks. Too many more urgent distractions. I may pull this off the agenda then add it back when I have time to make progress again | 19:10 |
clarkb | #topic Gerrit Project Renames | 19:10 |
clarkb | Wanted to do a followup on the project renames that fungi and I did last Friday. Overall things went well but we have noticed a few things | 19:10 |
clarkb | The first thing is that part of the rename process has us copy zuul key in zk for the renamed projects to their new names then delete the content at the old names. | 19:10 |
clarkb | Unfortuately this left behind just enough zk db content that the daily key backups are complaining about the old names not having content | 19:11 |
clarkb | #link https://review.opendev.org/c/zuul/zuul/+/814504 fix for zuul key exports post rename | 19:11 |
clarkb | That change should fix this for future renames. Cleanup for the existing errors will likely require our intervention. Possibly by rerunning the delete-keys commands with my change landed. | 19:11 |
clarkb | Note that the backups are otherwise successful, the scary error messages are largely noise | 19:12 |
clarkb | Next up we accidentally updated all gitea projects with their current projects.yaml information. We had wanted to only update the subset of projects that are being renamed as this process was very expensive in the past. | 19:12 |
clarkb | Our accidental update to all projects showed that current gitea with the rest api isn't that slow at updating. And we should consider doing full updatse anytime projects.yaml changes | 19:13 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/814443 Update all gitea projects by default | 19:13 |
clarkb | I think it took something like 3-5 minutes to run the gitea-git-repos role against all giteas. | 19:13 |
clarkb | (we didn't directly time it since it was unexpected) | 19:13 |
ianw | ++ i feel like i had a change in that area ... | 19:14 |
clarkb | if infra-root can weigh in on that and whether or not there are other concerns beyond just time costs please bring them up | 19:14 |
frickler | this followup job is also failing https://review.opendev.org/c/opendev/system-config/+/808480 , some unrelated cert issue according to apache log | 19:14 |
fungi | separately, this raised the point that ansible doesn't like to rely on yaml's inferred data types, instead explicitly recasting them to str unless told to recast them to some other specific type. this maks it very hard to implement your own trinary field | 19:14 |
fungi | i wanted a rolevar which could be either a bool or a list, but i think that's basically not possible | 19:15 |
ianw | oh, the one i'm thinking of is https://review.opendev.org/c/opendev/system-config/+/782887 "gitea: switch to token auth for project creation" | 19:15 |
clarkb | you'd have to figure out ansible's serialization system and implement it in your module | 19:15 |
clarkb | ianw: I think gitea reverted the behavior that made normal auth really slow | 19:15 |
fungi | pretty much, re-parse post ansible's meddling | 19:15 |
clarkb | enough other people complained about it and its memory cost iirc | 19:16 |
clarkb | frickler: looks like refstack reported a 500 error when adding results. Might need to look in refstack logs? | 19:16 |
frickler | https://ef5d43af22af7b1c1050-17fc8f83c20e6521d7d8a3ccd8bca531.ssl.cf2.rackcdn.com/808480/1/check/system-config-run-refstack/a4825bb/refstack01.openstack.org/apache2/refstack-ssl-error.log | 19:16 |
clarkb | frickler: https://zuul.opendev.org/t/openstack/build/a4825bbef768406d940fcdd459cb92c6/log/refstack01.openstack.org/containers/docker-refstack-api.log I think that might be genuine error? | 19:16 |
clarkb | frickler: ya I think the apache error there is a side effect of how we test with faked up LE certs but shouldn't be a breaking issue | 19:17 |
frickler | ah, o.k. | 19:17 |
fungi | that apache warning is expected i think? did you confirm it's not present on earlier successful builds? | 19:18 |
clarkb | The last thing I noticed during the rename process is that manage-projects.yaml runs the gitea-git-repos update then the gerrit jeepyb manage projects. It updates project-config to master only after running gitea-git-repos. For extra safety it should do the proejct-config update before any other actions. | 19:18 |
frickler | no, I didn't check. then someone with refstack knowledge needs to debug | 19:18 |
clarkb | One thing that makes this complicated is that the sync-project-config role updates project-config then copies its contents to /opt/project-config on the remote server like review. But we don't want it to do that for gitea. I think we want a simpler update process to run early. Maybe using a flag off of sync-project-config. I'll look at this in the next day or two hopefully | 19:19 |
clarkb | frickler: ++ seems refstack specific | 19:19 |
fungi | jsonschema.exceptions.SchemaError: [{'type': 'object', 'properties': {'name': {'type': 'string'}, 'uuid': {'type': 'string', 'format': 'uuid_hex'}}}] is not of type 'object', 'boolean' | 19:19 |
clarkb | Anything else to go over on the rename process/results/etc | 19:20 |
frickler | there was a rename overlap missed | 19:21 |
fungi | rename overlap? | 19:21 |
frickler | and the fix then also needed a pyyaml6 fix https://review.opendev.org/c/openstack/project-config/+/814401 | 19:21 |
fungi | thanks for fixing up the missed acl switch | 19:23 |
clarkb | ya when I rebased the ansible role change I missed a fixup | 19:23 |
clarkb | its difficult to review those since we typically rely on zuul tocheck things for us. Maybe figure out how to run things locally | 19:23 |
frickler | also still a lot of config errors | 19:25 |
clarkb | ya, but those are largely on the tenant to correct | 19:25 |
clarkb | Maybe we should send a reminder email to openstack/venus/refstack/et al to update their job configs | 19:25 |
clarkb | Maybe we wait and see another week and if progress isn't made send email to openstack-discuss asking that the related parties update their configs | 19:26 |
clarkb | I've just written a note in my notes file to do that | 19:27 |
frickler | +1 | 19:27 |
fungi | in theory any changes they push should get errors reported to them by zuul, which ought to serve as a reminder | 19:27 |
clarkb | #topic Improving Zuul Restarts | 19:28 |
clarkb | Last week frickler discovered zuul in a sad state due to a zookeeper connectivity issue that apepars to have dropped all the zk watches | 19:28 |
clarkb | To correct that the zuul scheduler was restarted, but frickler noticed we haven't kept our documentation around doing zuul restarts up to date | 19:29 |
clarkb | Led to questions around what to restart and how. | 19:29 |
clarkb | Also when to reenqueue in the start up process and general debugging process | 19:29 |
clarkb | To answer what to restart I think we're currently needing to restart the scheduler and executors together. The mergers don't matter as much but there is a restart-zuul.yaml playbook in system-config that will do everything (which is safe to do if doing the scheduler and executors) | 19:30 |
frickler | yes, it would be nice to have that written up in a way that makes it easy to follow in an emergency | 19:30 |
clarkb | If you run that playbook it will do everything for you but capture and restore queues. | 19:30 |
clarkb | I think the general process we want to put in the documentation is: 1) save queues 2) run restart-zuul.yaml playbook 3) wait for all tenants to load in zuul (can be checked at https://opendev.org/ and wait for all tenants to show up there) 4) run re-enqueue script (possibly after editing it to remove problematic entries?) | 19:31 |
clarkb | I can work on writing this up in the next day or two | 19:32 |
ianw | (it's good to timestamp that re-enqueue dump ... says the person who has previously restored an old dump after restarting :) | 19:32 |
frickler | regarding queues I wondered whether we should drop periodic jobs to save some time/load and they'll run again soon enough | 19:32 |
fungi | granted, the need to also restart the executors whenever restarting the scheduler is a very recent discovery related to in progress scale-out-scheduler development efforts, which should no longer be necessary once it gets farther along | 19:32 |
clarkb | fungi: yes though at other times we've similarly had to restart broader sets of stuff to accomodate zuul chagnes. I think the safeest thing is always to do a full restart especially if you are already debugging an issue and want to avoid new ones :) | 19:33 |
fungi | that's fair | 19:33 |
fungi | in which case restarting the mergers too is warranted | 19:33 |
clarkb | yup and the fingergw and web and the playbook does all of that | 19:33 |
fungi | in case there are changes in the mergers which also need to be picked up | 19:33 |
clarkb | fungi: and ya not having periodic jobs that just fail is probably a good idea too | 19:34 |
clarkb | As far as debugging goes this is a bit harder to describe a specific process. Generally when I debug zuul I try to find a specific change/merge/ref/tag event that I can focus on as it helps you to narrow the amount of logs that are relevant | 19:35 |
clarkb | what that means is if a queue entry isn't doing what I expect it to I grep teh zuul logs for its identifier (change number or sha1 etc) then from that you find logs for that event whcih have an event id in the logs. Then you can grep the logs for that event id and look for ERROR or traceback etc | 19:36 |
fungi | i guess we should go about disabling tobiko's periodic jobs for them? | 19:36 |
fungi | they seem to run al day and end up with retry_limit results | 19:36 |
fungi | er, all day | 19:36 |
clarkb | fungi: ya or figure out why zuul is always so unhappy with them | 19:36 |
clarkb | when we reenqueue them they go straight to error | 19:36 |
clarkb | then eventualy when they actually run they break too. I suspect something is really unhappy with it and we did do a rename with topiko to tobiko something | 19:37 |
fungi | Error: Project opendev.org/x/devstack-plugin-tobiko does not have the default branch master | 19:37 |
clarkb | hrm but it does | 19:37 |
clarkb | I wonder if we need to get zuul to reclone it | 19:37 |
fungi | yeah, those are the insta-error messages | 19:37 |
fungi | could be broken repo caches i guess | 19:38 |
clarkb | Anyway I'll take this todo to update the process docs for restarting and try to give hints about my debugging process above even though it may not be applicable in all cases | 19:38 |
fungi | but when they do actually run, they're broken on something else i dug into a while back which turned out to be an indication that they're probably just abandoned | 19:38 |
clarkb | it should give others an idea of where to start when digging into zuuls very verbose logging | 19:38 |
clarkb | #topic Open Discussion | 19:40 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/813675 is the last sort of trailing gerrit upgrade related change. It scares me because it is difficult to know for sure the gerrit group wasn't used anywhere but I've checked it a few times now. | 19:40 |
clarkb | I'm thinking that maybe I can land that on thursday when I hsould have most of the day to pay attention to it and restart gerrit and make sure all is still happy | 19:41 |
ianw | i've been focusing on getting rid of debian-stable, to free up space, afs is looking pretty full | 19:42 |
clarkb | debian-stable == stretch? | 19:42 |
fungi | the horribly misnamed "debian-stable" nodeset | 19:42 |
ianw | yeah, sorry, stretch from the mirrors, and the "debian-stable" nodeset which is stretch | 19:43 |
clarkb | aha | 19:43 |
fungi | which is technically oldstable | 19:43 |
ianw | but then there's a lot of interest in 9-stream, so that's what i'm trying to make room for | 19:43 |
clarkb | oldoldstable | 19:43 |
fungi | oldoldstable in fact, buster is oldstable now and bullseye is stable | 19:43 |
ianw | a test sync on that had it coming in at ~200gb | 19:43 |
ianw | (9-stream) | 19:43 |
clarkb | For 9 stream we (opendev) will build our own images on ext4 as we always do I suppose, but any progress with the image based update builds? | 19:44 |
clarkb | Also this reminds me that fedora boots are still unreliable in most clodus iirc | 19:44 |
ianw | clarkb: yep, if you could maybe take a look over | 19:44 |
ianw | #link https://review.opendev.org/q/topic:%22func-testing-bullseye%22+(status:open%20OR%20status:merged) | 19:44 |
ianw | that a) pulls out a lot duplication and old fluff from the functional testing | 19:45 |
ianw | and b) moves our functional testing to bullseye, which has a 5.10 kernel which can read the XFS "bigtime" volumes | 19:45 |
ianw | the other part of this is updating nodepool-builder to bullsye too; i have that | 19:46 |
ianw | #link https://review.opendev.org/c/zuul/nodepool/+/806312 | 19:46 |
ianw | this needs a dib release though to pick up fixes for running on bullseye though | 19:46 |
ianw | i was thinking get that testing stack merged, then release at this point | 19:47 |
clarkb | ok I can try to review the func testing work today | 19:47 |
ianw | it's really a lot of small single steps, but i wanted to call out each test we're dropping/moving etc. explicitly | 19:47 |
fungi | sounds great | 19:49 |
ianw | this should also enable 9-stream minimal builds | 19:49 |
ianw | every time i think they're dead for good, somehow we find a way... | 19:49 |
clarkb | One thing the fedora boot struggles makes me wonder is if we can try and phase out fedora in favor of stream. We essentially do the same thing with ubuntu already since the quicker cadence is hard to keep up with and it seems like you end up fixing issues in a half baked distro constantly | 19:50 |
ianw | yeah, i agree stream might be the right level of updatey-ness for us | 19:51 |
ianw | fedora boot issues are about 3-4 down on my todo list ATM :) will get there ... | 19:52 |
fungi | we continue to struggle with gentoo and opensuse-tumbleweed similarly | 19:52 |
fungi | probably it's just that we expect to have to rework image builds for a new version of the distro, but for rolling distros instead of getting new broken versions at predetermined intervals we get randomly broken image builds whenever something changes in them | 19:53 |
clarkb | that is part of it, but also at least with ubuntu and seems like for stream there is a lot more care in updating the main releases than the every 6 month releases | 19:54 |
clarkb | because finding bugs is part of the reason they do the every 6 month release but not when they did the big release | 19:54 |
fungi | yeah, agreed, they do seem to be treated as a bit more... "disposable?" | 19:55 |
fungi | anyway, it's the same reason i haven't pushed for debian testing or unstable image labels | 19:55 |
fungi | tracking down spontaneous image build breakage eats a lot of our time | 19:56 |
clarkb | Yup. Also we are just about at time. Last call for anything else Otherwise I'll end it at the the change of the hour | 19:57 |
ianw | oh that was one concern of mine with gentoo in that dib stack | 19:57 |
ianw | it seems like it will frequently be in a state of taking 1:30hours to timeout | 19:57 |
fungi | i still haven't worked out how to fix whatever's preventing us from turning gentoo testing back on for zuul-jobs changes either | 19:58 |
ianw | which is a little annoying to hold up all the other jobs | 19:58 |
fungi | though for a while it seemed to be staleness because the gentoo images hadn't been updated in something like 6-12 months | 19:58 |
clarkb | Alright we are at time. Thank you everyone. | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Oct 19 20:00:07 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-19-19.01.log.html | 20:00 |
fungi | thanks clarkb! | 20:00 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!