Tuesday, 2024-11-26

clarkbsent out the meeting agenda for tomorrow to make it official00:51
opendevreviewKarolina Kula proposed zuul/zuul-jobs master: DNM Switch to KVM  https://review.opendev.org/c/zuul/zuul-jobs/+/93602313:55
ykarelgit diff14:53
fungiwarning: Not a git repository. Use --no-index to compare two paths outside a working tree14:54
* fungi is --helpful14:54
opendevreviewJoel Capitao proposed openstack/diskimage-builder master: DNM Testing on KVM  https://review.opendev.org/c/openstack/diskimage-builder/+/93602416:53
opendevreviewJoel Capitao proposed openstack/diskimage-builder master: DNM Testing on KVM  https://review.opendev.org/c/openstack/diskimage-builder/+/93602417:50
clarkbfungi: I responded to your comment on the mm3 migration etherpad. Basically there are some things that I think need updating in ansible that I interpreted as aliases that I don't see covered. I see now that they are distinct to what you were referring to but I think they still need to be updated18:10
fungiclarkb: i had intended those to be covered by the todo at 5.118:11
clarkbok just make sure you also cover the mailman side and not just the mta/exim side18:12
clarkbthe value is used in both places18:12
fungii added ansible inventory groupvars to the change description to cover that just now18:12
clarkbbut also as noted I think you can update them before the change and have both old and new names listed?18:12
clarkbthough I'm not 100% certain of that18:13
fungii'm worried if we update the mailman groupvars in ansible it will try to create the new domain and recreate the lists in it18:13
clarkbfungi: right for the listdomains and lists themselves that would happen. I'm takling specifically about mm_domains18:14
fungisome bits could be added to ansible early, but i'm unclear on whether splitting it into two changes makes sense18:14
clarkbwhcih I think is about allowing exim and django to accept connects with those names18:14
fungiwe'd still need to add forwarding aliases in exim from the new addresses to the old temporarily to make that work, and then flip them during the maintenance18:15
clarkbbut I also don't feel too strongly about it. If you remember to update all the places by hand then update ansible to match the end result should be the same18:15
fungii'm worried i don't have enough time between now and when the foundation announced the domain change maintenance to set up working temporary forwards to accept messages to the new addresses in advance18:16
funginor what the benefits of doing that extra work would be18:17
clarkbfungi: fwiw I didn't intend on setting up actual delivery of things18:22
clarkbjust udpating our configs so that if someone did connect to that address they would get an error at a step past the initial connection18:23
clarkbto reduce the amount of changes required during the migration itself18:23
fungimaybe between steps #7 and #8 we should apply the config change with ansible (moving step #12 earlier)?18:25
opendevreviewClark Boylan proposed opendev/system-config master: Screenshot lodgeit captcha images  https://review.opendev.org/c/opendev/system-config/+/93629718:33
clarkbfrickler: "You retain all your ownership rights in your User Content. Docker simply displays or makes the User Content available to users of the Service and does not otherwise control the content thereof." I'm not a lawyer but I read this as meaning the content is available to be used under the content's license18:38
clarkbfrickler: and at least for the debian image it seems to indicate they are using the underlying distro and software licensnes and not applying any additional restrictions18:38
clarkbnow there may be additional restrictions in that terms of service that limit the use of the image beyond where it is asfe to rehost. I don't know18:39
fricklerclarkb: well the official images in my understanding are not user content, but content provided by docker.18:43
clarkbfrickler: right and the source code for those images is apache2 licensed and then in the debian image at least they say the image is provided under the licenses of the software contained within the image. But as you noted there is a note that the images also fall under the terms of use so you'd need something in the terms of use that prevents rehosting which I haven't found yet but I18:45
clarkbalso haven't read the terms of use in its entirety18:45
clarkbat the very least we could rebuild the images I think18:46
clarkbsince the source code for them is apache 218:46
fungithough even just caching and serving the images through a proxy could be considered rehosting, which we already do18:46
clarkbya though thats a bit different since we're caching the docker debian image fetched from docker18:48
clarkbvs download the docker debian image from docker then reupload it to quay. But in any case I haven't seen anything yet that would prevent this18:48
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems  https://review.opendev.org/c/openstack/diskimage-builder/+/93630118:59
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems  https://review.opendev.org/c/openstack/diskimage-builder/+/93433219:00
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems  https://review.opendev.org/c/openstack/diskimage-builder/+/93433219:01
corvusdid i miss the beginning of a docker license conversation?19:02
clarkbcorvus: it came up in the openstack tc meeting in the prior hour19:02
clarkbcorvus: tldr is frickler is concerned that docker official iimages state they are used under docker's terms of service and that we might not be allowed to rehost them19:02
clarkbcorvus: https://zuul.opendev.org/t/openstack/build/a2454524bdf6447cbaa7a7f38e8bb889 I wonder if this 404 is related to intermediate registry pruning? I've rechecked the parent change to reupload to the intermediate registry19:21
corvuscould check to see if those show up in logs19:22
clarkbyup doing a grep now19:23
clarkbthats weird we seem to log botha keep and delete for it19:23
clarkbin the real-6.log19:23
corvusremember the virtual dirs don't count19:24
clarkboh its a delete for an upload of the underlying data19:25
clarkbI'll have to look mroe closely after the meeting19:25
opendevreviewJeremy Stanley proposed opendev/system-config master: Move OpenInfra mailing lists to new domain  https://review.opendev.org/c/opendev/system-config/+/93630319:43
clarkbcorvus: urllib3.connectionpool: https://storage101.dfw1.clouddrive.com:443 "DELETE /v1/api_path/container_name/_local/blobs/sha256:e611aa8258cc3cdb338784afa7c6e85a2ab2d1fd85beef36a69170a54a5b0377/data HTTP/11" 204 019:49
clarkbcorvus: I think this means we decide the blob was not in a manifest and too lold?19:49
clarkband or some bug in that determination19:50
corvusclarkb: so if i'm following correctly, that was a build from nov 20; when did the delete happen?19:50
clarkb2024-11-21 22:08:35,72619:50
corvuswe're supposed to keep 180 days, yeah?19:51
clarkbthe blob timeout is only an hour19:51
clarkbmanifest_target is 180 days and upload_target is 1 hour. We use upload target when deleting blobs19:52
clarkbUploading on the 20th would mean that we didn't see this manifest when listing manifests because it took 7 days the bulk of which was blob hanlding19:53
clarkbso maybe there is somethign wrong in the logic for cleaning up blobs that don't have a manifest tied to them with the hour long timeout19:53
corvusit's supposed to be one hour before the start of the prune19:54
clarkbyes and that does seem to be what we do19:55
clarkbI don't see us redefining upload_target19:56
corvusthat blob already existed on 11-1919:56
clarkbbut pruning started on the 15th right?19:57
corvuswhich suggests that more than one manifest pointed to it19:57
corvuslooks like it19:57
clarkb2024-11-16 00:02:42,501 <- thats the first timestamp in our log19:58
clarkboh are you suggesting that we had a collision maybe?19:58
corvusno collisions in a CAS19:58
corvusis it possible because of our stops/starts that we pruned all the manifests that pointed to the blob in earlier runs before our final blob-pruning run?19:58
clarkbbasically object existed at least an hour prior to 2024-11-16 00:02:42,501 and was pointed at by manifest A. Then we upload manifest B that points to it but we prune manfiest A?19:59
corvusyep19:59
clarkbcorvus: ya thats what I'm wondinerg19:59
corvusi think that is possible, in which case that would make this an artifact of this individual erroneous pruning operation (ie, a one-off)19:59
clarkbif that is the cause then regular pruning like we do now whould avoid this in the future19:59
clarkbsince we can do regular pruning in a short enough period and go from start to finish in one go to avoid this issue20:00
corvusi don't think it's a timing issue20:00
corvuswell20:00
corvusit's an "interruption + timing" issue20:00
corvusi think maybe the assertion that pruning is interruptible is not 100% right :)20:00
clarkbya20:00
corvusif it is not interrupted, then timing doesn't matter.  if it is interrupted, then we introduce a race.20:01
clarkbsince we moev the upload_target ahead each time we start over20:01
corvusi think what we need is an adjustable "now"20:01
clarkbas long as we don't move that ahead we're fine with it running as long as it takes20:01
corvusso that we can resume pruning and set the pruning start time ("now" variable in the code) to the original pruning start time20:01
clarkbya that would avoid the problem20:02
fungigonna pop out to grab early dinner but should be back in about an hour20:02
opendevreviewClark Boylan proposed opendev/system-config master: Update Gerrit images to 3.9.8 and 3.10.3  https://review.opendev.org/c/opendev/system-config/+/93630520:08
clarkbcorvus: so to tldr I don't think there is anythign we need to do right now for insecure-ci-registry. We can followup with capturing timestamps and feeding that back into the system for interrupted runs but I'm not super concerned about that now20:09
opendevreviewClark Boylan proposed opendev/system-config master: Screenshot lodgeit captcha images  https://review.opendev.org/c/opendev/system-config/+/93629720:35
corvusclarkb: ++20:41
clarkbfungi: fwiw I think you do need last because HTTP_HOST is listed as an http headervar which I think comes from the original request and not inflight rewrite state20:55
clarkbI left notes about that in your change. But otherwise that lgtm20:55
opendevreviewClark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test the 3.10 upgrade  https://review.opendev.org/c/opendev/system-config/+/89357120:57
clarkbI'm cycling my held nodes for gerrit testing to get this new patchset20:57
mnasiadkaIn Kolla we've noticed recently DockerHub started to be more... aggressive towards pull limits. We basically use what we can from quay.io, but debian and ubuntu base images are pulled from DockerHub - which fails from time to time. I understand the caching mechanism through Apache mod_proxy does not work - but maybe there's a pull through docker registry we could use somewhere inside OpenDev? Or at least a place to mirror those images? I21:02
mnasiadkadoubt authenticating to Docker Hub will help (but maybe I'm plain wrong)21:02
clarkbmnasiadka: so first up authenticating to docker hub will almost certainly fix your problems there are much bigger rate limits for authenticated users21:04
clarkbthe problem is their open source program is a bit of a pain to go through though they did make it easier than it has been21:05
clarkbI think ildikov and starglinx have some experience with that. We don't around here because we opted out of pursuing that when the original requirements were published21:05
clarkbnext I don't expect a pull through registry to change much compared to the apache proxy because you still have to pull in the data that bit doesn't change then you get rate limited whether you run apache to cache or a registry21:06
mnasiadkaI don't think I'm going to pursue the open source program, because Kolla does not satisfy the reqs today (we support podman)21:06
clarkba pull through registry has the additional problem of not being prunable so its size can grow without bound21:06
mnasiadkaSo I can try using the existing kolla account in DockerHub to work around the limits and see if that helps21:07
clarkbmnasiadka: I believe they dropped teh requirement to not support other systems. Thats what I mean by them making it easier21:07
clarkbtheir origin open source program said you can't use other tools but it doesn't the last time I saw the requirements21:07
corvusmnasiadka: i started work on automated mirroring of images to quay.io in https://review.opendev.org/93557421:07
clarkband ya the last bit of info is the migration of images to a different registry ^ is one way to do that21:07
clarkbanother method would be to use alternative images if they already exist on say quay or in github's registry etc21:08
corvusi'd like to use that to mirror the few docker.io images that the zuul project uses to quay.io21:08
clarkbmnasiadka: keep in mind that you need to be careful when using docker accounts aprticularly if they have push permissions21:08
clarkbif you leak the credentials in a check job and those credenitals can push then you could have your repo contents replaced21:09
mnasiadkayeah, for debian and ubuntu there are no alternative images - centos and rocky use quay.io today - so what corvus is working on could help us maintain a copy of debian/ubuntu base image in Kolla's quay.io namespace 21:09
mnasiadkawe have an org and some read only tokens as well, but thanks for the reminder :)21:09
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557421:12
clarkbit is also worth noting that I recently manually went through the process of getting an anonymous token and checked the rate limits in the token and they did not match the recent 10 per hour from the docker blog post21:14
clarkbit seems liek something is fishy somewhere but I don't have enough insight behind the scenes to say where21:14
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557421:15
corvusname[template]: Jinja templates should only be at the end of 'name'21:15
corvusthat's from ansible-list21:15
corvusis that a style lint, or is there something functional about that?21:16
corvusthat's saying "you can't use name: "something {{ foo }} something" you can only say name: "something {{ foo }}"21:17
clarkbcorvus: I'm still parsing the question but just so you know the matrix to irc translation ended up with some weird characters in front of name in your first and last line21:18
clarkbcorvus: https://ansible.readthedocs.io/projects/lint/rules/name/ its just hte linter's opinion man21:19
clarkb"This helps with the identification of tasks inside the source code when they fail."21:19
clarkbthey suggest it for greppability basically21:19
corvusthanks.  that's.  yeah.  Dude.21:20
clarkbok I think my capture the new gerrit releases on test nodes change was successful in doing so21:30
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems  https://review.opendev.org/c/openstack/diskimage-builder/+/93433221:31
clarkbI don't think I'll run through upgrade and downgrade testing today, but having that all preppred for monday is a good thing21:31
clarkbmnasiadka: oh I just remembered another idea that was thrown out by someone (maybe you) but recording it here again for completeness: Using the buildset registry to act as a local cache for alll your jobs may be helpful. Then you basically only do a single set of fetches from docker hub, push into buildset registry and everything fetches from there21:33
clarkbmnasiadka: the downside to this is it is unlikely to reduce the total number of requests per ip (since different jobs often use a different ip) however in the broader scope of things it will reduce the total number of requests on each ip which should make subsequent jobs more likely to pass when they run using the same ip later21:34
clarkbany reduction in requests to docker hub should produce overall improvements within the ci system21:35
mnasiadkawell, we basically fetch only two images (one in ubuntu build job and another in debian build job), so mirroring them or going the authenticated route probably is a better idea - in Kolla we only fetch the base image outside of OpenDev - and then install all pip/rpm/deb based on those images21:36
mnasiadkaWhich reminds me that Rocky 9 rpms are not mirrored - and then there's Rocky 10 somewhere around next year21:37
clarkbmnasiadka: right but kolla runs like 20 jobs when you push a change up (I randomed that number I don't know what the actual number is) and if each of those does several rquests that goes into the bucket against our rate limit21:38
clarkbif instead you make 2 requests and then everything else fetches from the buildset registry you've gotten a large overall reduction21:38
clarkbI agree that not fetching from docker in the first place is better21:38
clarkbI'm just trying to call out all possible areas of improvement as some may be easier to implement than others21:39
mnasiadkakolla-ansible fetches the images from quay.io (from openstack.kolla namespace) - we could think of using the buildset registry (as in build daily for master and weekly for stable branches, push the images there and use it in kolla-ansible jobs) - but I don't think that's on our immediate plan21:40
clarkbthe buildset registry only runs while the buildset is running21:40
mnasiadkaah, ok21:40
clarkbyou wouldn't do anything with daily or wkeely stuff there21:40
mnasiadkaany... more persistent registry we could use to limit the amount of traffic we generate?21:41
clarkbits just "what do we need for the buildset" -> we need debian and ubuntu base images -> cache them then everything grabs from there within theb uildset21:41
clarkbmnasiadka: the big problem with a persistent registry is disk21:41
clarkbjust storing our speculative images for 180 days for CI testing is about 2TB21:42
clarkband thats only because we wrote our own registry just for that purpose that we can prune. None of the off the shelf registries really do pruning last I checked21:42
clarkbthey want to store things forever which is kind of a problem21:42
mnasiadkaright21:43
clarkbthen to build on top of that problem we've tried to be good citizens and proxy cache things but all the registries make that difficult to impossible too21:43
clarkbthe unfortunate truth is this system was never designed with cost scaling in mind21:43
clarkband now a decade later we're dealing with the consequences21:43
clarkbthere may be some crazy idea where we cache a super specific subset of items in the insecure-ci-registry and let its pruning deal with stuff21:44
clarkbor run a second instance21:44
corvusi don't know why we'd do that instead of just using quay or something else?21:44
clarkbif we run a second instance we could set different retention periods to the ephemeral images21:44
clarkbcorvus: probably the main reason would be to host within each cloud21:45
clarkbproblem then is not every cloud offers swift21:45
corvuswell, i mean, we already decided that we didn't want to be in the registry hosting business21:45
clarkbyes I agree and I don't think thats chagned. I'm just trying to brainstorm all the options21:45
mnasiadkaif network traffic is not a problem, then quay.io is fine - I'll go waste my time somewhere else than in fixing problems that don't exist :)21:46
clarkbI think if we did do that it would be for a very small set of images and only as a mirror that we prune. Not as a host for official downloads?21:46
corvusif we do change our minds about that then i would probably propose hosting our public images in a self-hosted registry.  :)21:46
clarkbmnasiadka: I think we should start with that assumption since it is probably the easiest one to follow for now21:46
clarkband if that assumption is proven false we debug and brainstorm from there21:47
mnasiadkaI never thought about hosting official downloads, that's what quay.io is for - maybe having a local semi-persistent registry that would be cleaned up daily or something like that would speed up the CI / make projects use less internet bandwidth - but if that's not trivial - let's not try to fix what is not broken21:48
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557421:48
clarkbmnasiadka: ya I think the main issue is solvingthe storage problem21:48
clarkbwhich is not trivial due to clouds being inconsistent about their storage options21:48
clarkband limits on cinder volumes and so on21:48
clarkbwe would have to have a bunch of bespoke solutions for a common problem and I think we should avoid that if we can21:49
fungialso the concern that rh could decide to follow in docker's footsteps and impose similarly strict client rate limits22:00
clarkbseparately if kolla is only fetching two images from docker and opendev is only fetching a similar number (for builds its python-base + python-build and possibly the buildx container if multiarching; for usage its $server + mariadb typically) it makes me wonder who or what is doing many more fetches. its not like these jobs are short either and we're running many back to back22:05
clarkbbut I think that must be why it seems to work ~70% of the time for us right now22:05
clarkbmaybe even more22:05
clarkbbut that may be another avenue to approach this from. Identify the worst offenders and see if we can dial them back22:10
clarkbit used to be we fetched the buildset registry from docker too but we fixed that22:14
opendevreviewClark Boylan proposed opendev/system-config master: Capture lodgeit captchas for verification purposes  https://review.opendev.org/c/opendev/system-config/+/93629722:23
clarkbI've discovered an interesting frame of reference problem. We run testinfra on the bridge node and then run commands against test nodes. We have the screenshot mechanism that doesn't work for raw pngs you have to load html pages. But the selenium driver runs on the test node with a backhaul to the testinfra node built in22:24
clarkbanyway since I'm not using selenium I have to do it myself from the bridge node and I think we have /etc/hosts set up to make this work but not sure about ssl certs22:25
clarkbanyway I'm hopeful that will confirm the captcha is written where we want it to be and we can land the lodgeit updates22:27
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557422:32
opendevreviewClark Boylan proposed opendev/system-config master: Capture lodgeit captchas for verification purposes  https://review.opendev.org/c/opendev/system-config/+/93629722:43
clarkbok ssl verification did fail. But I think this should work now22:43
ianwin terms of ssl the nodes have a self-signed cert that they all should trust22:51
ianwhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L722:54
opendevreviewMerged opendev/lodgeit master: Run python3.11 job on Jammy  https://review.opendev.org/c/opendev/lodgeit/+/93571923:05
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557423:07
ianwclarkb: ohh, you know i bet that testinfra is in a venv, so the requests in there isn't patched to use the system ca-certificates.  i'd probably suggest the easiest thing to do is just "curl" the file and save it -- that way it uses system certs23:12
Clark[m]ianw: verify false worked23:13
ianwbut also, I think you could just use the screenshot and do "https://localhost/_captcha.png" and that would automatically save it23:13
Clark[m]No that was what I tried first but selenium loads the png and explodes23:14
Clark[m]It expects js/html input 23:14
Clark[m]I think the current ps works and shows the code does fix the problem 23:14
Clark[m]I think I'm ok with this verify=False solution. But ya the lodgeit change lgtm based on that test23:15
ianwoh ok, must have pulled up the old one sorry23:15
ianwprobably we should run testinfra with REQUESTS_CA_BUNDLE23:17
ianwbut ok, the captcha renders ... once again our infra testing pulls through :)23:18
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557423:26
*** iurygregory__ is now known as iurygregory23:37
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557423:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!