clarkb | sent out the meeting agenda for tomorrow to make it official | 00:51 |
---|---|---|
opendevreview | Karolina Kula proposed zuul/zuul-jobs master: DNM Switch to KVM https://review.opendev.org/c/zuul/zuul-jobs/+/936023 | 13:55 |
ykarel | git diff | 14:53 |
fungi | warning: Not a git repository. Use --no-index to compare two paths outside a working tree | 14:54 |
* fungi is --helpful | 14:54 | |
opendevreview | Joel Capitao proposed openstack/diskimage-builder master: DNM Testing on KVM https://review.opendev.org/c/openstack/diskimage-builder/+/936024 | 16:53 |
opendevreview | Joel Capitao proposed openstack/diskimage-builder master: DNM Testing on KVM https://review.opendev.org/c/openstack/diskimage-builder/+/936024 | 17:50 |
clarkb | fungi: I responded to your comment on the mm3 migration etherpad. Basically there are some things that I think need updating in ansible that I interpreted as aliases that I don't see covered. I see now that they are distinct to what you were referring to but I think they still need to be updated | 18:10 |
fungi | clarkb: i had intended those to be covered by the todo at 5.1 | 18:11 |
clarkb | ok just make sure you also cover the mailman side and not just the mta/exim side | 18:12 |
clarkb | the value is used in both places | 18:12 |
fungi | i added ansible inventory groupvars to the change description to cover that just now | 18:12 |
clarkb | but also as noted I think you can update them before the change and have both old and new names listed? | 18:12 |
clarkb | though I'm not 100% certain of that | 18:13 |
fungi | i'm worried if we update the mailman groupvars in ansible it will try to create the new domain and recreate the lists in it | 18:13 |
clarkb | fungi: right for the listdomains and lists themselves that would happen. I'm takling specifically about mm_domains | 18:14 |
fungi | some bits could be added to ansible early, but i'm unclear on whether splitting it into two changes makes sense | 18:14 |
clarkb | whcih I think is about allowing exim and django to accept connects with those names | 18:14 |
fungi | we'd still need to add forwarding aliases in exim from the new addresses to the old temporarily to make that work, and then flip them during the maintenance | 18:15 |
clarkb | but I also don't feel too strongly about it. If you remember to update all the places by hand then update ansible to match the end result should be the same | 18:15 |
fungi | i'm worried i don't have enough time between now and when the foundation announced the domain change maintenance to set up working temporary forwards to accept messages to the new addresses in advance | 18:16 |
fungi | nor what the benefits of doing that extra work would be | 18:17 |
clarkb | fungi: fwiw I didn't intend on setting up actual delivery of things | 18:22 |
clarkb | just udpating our configs so that if someone did connect to that address they would get an error at a step past the initial connection | 18:23 |
clarkb | to reduce the amount of changes required during the migration itself | 18:23 |
fungi | maybe between steps #7 and #8 we should apply the config change with ansible (moving step #12 earlier)? | 18:25 |
opendevreview | Clark Boylan proposed opendev/system-config master: Screenshot lodgeit captcha images https://review.opendev.org/c/opendev/system-config/+/936297 | 18:33 |
clarkb | frickler: "You retain all your ownership rights in your User Content. Docker simply displays or makes the User Content available to users of the Service and does not otherwise control the content thereof." I'm not a lawyer but I read this as meaning the content is available to be used under the content's license | 18:38 |
clarkb | frickler: and at least for the debian image it seems to indicate they are using the underlying distro and software licensnes and not applying any additional restrictions | 18:38 |
clarkb | now there may be additional restrictions in that terms of service that limit the use of the image beyond where it is asfe to rehost. I don't know | 18:39 |
frickler | clarkb: well the official images in my understanding are not user content, but content provided by docker. | 18:43 |
clarkb | frickler: right and the source code for those images is apache2 licensed and then in the debian image at least they say the image is provided under the licenses of the software contained within the image. But as you noted there is a note that the images also fall under the terms of use so you'd need something in the terms of use that prevents rehosting which I haven't found yet but I | 18:45 |
clarkb | also haven't read the terms of use in its entirety | 18:45 |
clarkb | at the very least we could rebuild the images I think | 18:46 |
clarkb | since the source code for them is apache 2 | 18:46 |
fungi | though even just caching and serving the images through a proxy could be considered rehosting, which we already do | 18:46 |
clarkb | ya though thats a bit different since we're caching the docker debian image fetched from docker | 18:48 |
clarkb | vs download the docker debian image from docker then reupload it to quay. But in any case I haven't seen anything yet that would prevent this | 18:48 |
opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/936301 | 18:59 |
opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/934332 | 19:00 |
opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/934332 | 19:01 |
corvus | did i miss the beginning of a docker license conversation? | 19:02 |
clarkb | corvus: it came up in the openstack tc meeting in the prior hour | 19:02 |
clarkb | corvus: tldr is frickler is concerned that docker official iimages state they are used under docker's terms of service and that we might not be allowed to rehost them | 19:02 |
clarkb | corvus: https://zuul.opendev.org/t/openstack/build/a2454524bdf6447cbaa7a7f38e8bb889 I wonder if this 404 is related to intermediate registry pruning? I've rechecked the parent change to reupload to the intermediate registry | 19:21 |
corvus | could check to see if those show up in logs | 19:22 |
clarkb | yup doing a grep now | 19:23 |
clarkb | thats weird we seem to log botha keep and delete for it | 19:23 |
clarkb | in the real-6.log | 19:23 |
corvus | remember the virtual dirs don't count | 19:24 |
clarkb | oh its a delete for an upload of the underlying data | 19:25 |
clarkb | I'll have to look mroe closely after the meeting | 19:25 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Move OpenInfra mailing lists to new domain https://review.opendev.org/c/opendev/system-config/+/936303 | 19:43 |
clarkb | corvus: urllib3.connectionpool: https://storage101.dfw1.clouddrive.com:443 "DELETE /v1/api_path/container_name/_local/blobs/sha256:e611aa8258cc3cdb338784afa7c6e85a2ab2d1fd85beef36a69170a54a5b0377/data HTTP/11" 204 0 | 19:49 |
clarkb | corvus: I think this means we decide the blob was not in a manifest and too lold? | 19:49 |
clarkb | and or some bug in that determination | 19:50 |
corvus | clarkb: so if i'm following correctly, that was a build from nov 20; when did the delete happen? | 19:50 |
clarkb | 2024-11-21 22:08:35,726 | 19:50 |
corvus | we're supposed to keep 180 days, yeah? | 19:51 |
clarkb | the blob timeout is only an hour | 19:51 |
clarkb | manifest_target is 180 days and upload_target is 1 hour. We use upload target when deleting blobs | 19:52 |
clarkb | Uploading on the 20th would mean that we didn't see this manifest when listing manifests because it took 7 days the bulk of which was blob hanlding | 19:53 |
clarkb | so maybe there is somethign wrong in the logic for cleaning up blobs that don't have a manifest tied to them with the hour long timeout | 19:53 |
corvus | it's supposed to be one hour before the start of the prune | 19:54 |
clarkb | yes and that does seem to be what we do | 19:55 |
clarkb | I don't see us redefining upload_target | 19:56 |
corvus | that blob already existed on 11-19 | 19:56 |
clarkb | but pruning started on the 15th right? | 19:57 |
corvus | which suggests that more than one manifest pointed to it | 19:57 |
corvus | looks like it | 19:57 |
clarkb | 2024-11-16 00:02:42,501 <- thats the first timestamp in our log | 19:58 |
clarkb | oh are you suggesting that we had a collision maybe? | 19:58 |
corvus | no collisions in a CAS | 19:58 |
corvus | is it possible because of our stops/starts that we pruned all the manifests that pointed to the blob in earlier runs before our final blob-pruning run? | 19:58 |
clarkb | basically object existed at least an hour prior to 2024-11-16 00:02:42,501 and was pointed at by manifest A. Then we upload manifest B that points to it but we prune manfiest A? | 19:59 |
corvus | yep | 19:59 |
clarkb | corvus: ya thats what I'm wondinerg | 19:59 |
corvus | i think that is possible, in which case that would make this an artifact of this individual erroneous pruning operation (ie, a one-off) | 19:59 |
clarkb | if that is the cause then regular pruning like we do now whould avoid this in the future | 19:59 |
clarkb | since we can do regular pruning in a short enough period and go from start to finish in one go to avoid this issue | 20:00 |
corvus | i don't think it's a timing issue | 20:00 |
corvus | well | 20:00 |
corvus | it's an "interruption + timing" issue | 20:00 |
corvus | i think maybe the assertion that pruning is interruptible is not 100% right :) | 20:00 |
clarkb | ya | 20:00 |
corvus | if it is not interrupted, then timing doesn't matter. if it is interrupted, then we introduce a race. | 20:01 |
clarkb | since we moev the upload_target ahead each time we start over | 20:01 |
corvus | i think what we need is an adjustable "now" | 20:01 |
clarkb | as long as we don't move that ahead we're fine with it running as long as it takes | 20:01 |
corvus | so that we can resume pruning and set the pruning start time ("now" variable in the code) to the original pruning start time | 20:01 |
clarkb | ya that would avoid the problem | 20:02 |
fungi | gonna pop out to grab early dinner but should be back in about an hour | 20:02 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update Gerrit images to 3.9.8 and 3.10.3 https://review.opendev.org/c/opendev/system-config/+/936305 | 20:08 |
clarkb | corvus: so to tldr I don't think there is anythign we need to do right now for insecure-ci-registry. We can followup with capturing timestamps and feeding that back into the system for interrupted runs but I'm not super concerned about that now | 20:09 |
opendevreview | Clark Boylan proposed opendev/system-config master: Screenshot lodgeit captcha images https://review.opendev.org/c/opendev/system-config/+/936297 | 20:35 |
corvus | clarkb: ++ | 20:41 |
clarkb | fungi: fwiw I think you do need last because HTTP_HOST is listed as an http headervar which I think comes from the original request and not inflight rewrite state | 20:55 |
clarkb | I left notes about that in your change. But otherwise that lgtm | 20:55 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test the 3.10 upgrade https://review.opendev.org/c/opendev/system-config/+/893571 | 20:57 |
clarkb | I'm cycling my held nodes for gerrit testing to get this new patchset | 20:57 |
mnasiadka | In Kolla we've noticed recently DockerHub started to be more... aggressive towards pull limits. We basically use what we can from quay.io, but debian and ubuntu base images are pulled from DockerHub - which fails from time to time. I understand the caching mechanism through Apache mod_proxy does not work - but maybe there's a pull through docker registry we could use somewhere inside OpenDev? Or at least a place to mirror those images? I | 21:02 |
mnasiadka | doubt authenticating to Docker Hub will help (but maybe I'm plain wrong) | 21:02 |
clarkb | mnasiadka: so first up authenticating to docker hub will almost certainly fix your problems there are much bigger rate limits for authenticated users | 21:04 |
clarkb | the problem is their open source program is a bit of a pain to go through though they did make it easier than it has been | 21:05 |
clarkb | I think ildikov and starglinx have some experience with that. We don't around here because we opted out of pursuing that when the original requirements were published | 21:05 |
clarkb | next I don't expect a pull through registry to change much compared to the apache proxy because you still have to pull in the data that bit doesn't change then you get rate limited whether you run apache to cache or a registry | 21:06 |
mnasiadka | I don't think I'm going to pursue the open source program, because Kolla does not satisfy the reqs today (we support podman) | 21:06 |
clarkb | a pull through registry has the additional problem of not being prunable so its size can grow without bound | 21:06 |
mnasiadka | So I can try using the existing kolla account in DockerHub to work around the limits and see if that helps | 21:07 |
clarkb | mnasiadka: I believe they dropped teh requirement to not support other systems. Thats what I mean by them making it easier | 21:07 |
clarkb | their origin open source program said you can't use other tools but it doesn't the last time I saw the requirements | 21:07 |
corvus | mnasiadka: i started work on automated mirroring of images to quay.io in https://review.opendev.org/935574 | 21:07 |
clarkb | and ya the last bit of info is the migration of images to a different registry ^ is one way to do that | 21:07 |
clarkb | another method would be to use alternative images if they already exist on say quay or in github's registry etc | 21:08 |
corvus | i'd like to use that to mirror the few docker.io images that the zuul project uses to quay.io | 21:08 |
clarkb | mnasiadka: keep in mind that you need to be careful when using docker accounts aprticularly if they have push permissions | 21:08 |
clarkb | if you leak the credentials in a check job and those credenitals can push then you could have your repo contents replaced | 21:09 |
mnasiadka | yeah, for debian and ubuntu there are no alternative images - centos and rocky use quay.io today - so what corvus is working on could help us maintain a copy of debian/ubuntu base image in Kolla's quay.io namespace | 21:09 |
mnasiadka | we have an org and some read only tokens as well, but thanks for the reminder :) | 21:09 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 21:12 |
clarkb | it is also worth noting that I recently manually went through the process of getting an anonymous token and checked the rate limits in the token and they did not match the recent 10 per hour from the docker blog post | 21:14 |
clarkb | it seems liek something is fishy somewhere but I don't have enough insight behind the scenes to say where | 21:14 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 21:15 |
corvus | name[template]: Jinja templates should only be at the end of 'name' | 21:15 |
corvus | that's from ansible-list | 21:15 |
corvus | is that a style lint, or is there something functional about that? | 21:16 |
corvus | that's saying "you can't use name: "something {{ foo }} something" you can only say name: "something {{ foo }}" | 21:17 |
clarkb | corvus: I'm still parsing the question but just so you know the matrix to irc translation ended up with some weird characters in front of name in your first and last line | 21:18 |
clarkb | corvus: https://ansible.readthedocs.io/projects/lint/rules/name/ its just hte linter's opinion man | 21:19 |
clarkb | "This helps with the identification of tasks inside the source code when they fail." | 21:19 |
clarkb | they suggest it for greppability basically | 21:19 |
corvus | thanks. that's. yeah. Dude. | 21:20 |
clarkb | ok I think my capture the new gerrit releases on test nodes change was successful in doing so | 21:30 |
opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/934332 | 21:31 |
clarkb | I don't think I'll run through upgrade and downgrade testing today, but having that all preppred for monday is a good thing | 21:31 |
clarkb | mnasiadka: oh I just remembered another idea that was thrown out by someone (maybe you) but recording it here again for completeness: Using the buildset registry to act as a local cache for alll your jobs may be helpful. Then you basically only do a single set of fetches from docker hub, push into buildset registry and everything fetches from there | 21:33 |
clarkb | mnasiadka: the downside to this is it is unlikely to reduce the total number of requests per ip (since different jobs often use a different ip) however in the broader scope of things it will reduce the total number of requests on each ip which should make subsequent jobs more likely to pass when they run using the same ip later | 21:34 |
clarkb | any reduction in requests to docker hub should produce overall improvements within the ci system | 21:35 |
mnasiadka | well, we basically fetch only two images (one in ubuntu build job and another in debian build job), so mirroring them or going the authenticated route probably is a better idea - in Kolla we only fetch the base image outside of OpenDev - and then install all pip/rpm/deb based on those images | 21:36 |
mnasiadka | Which reminds me that Rocky 9 rpms are not mirrored - and then there's Rocky 10 somewhere around next year | 21:37 |
clarkb | mnasiadka: right but kolla runs like 20 jobs when you push a change up (I randomed that number I don't know what the actual number is) and if each of those does several rquests that goes into the bucket against our rate limit | 21:38 |
clarkb | if instead you make 2 requests and then everything else fetches from the buildset registry you've gotten a large overall reduction | 21:38 |
clarkb | I agree that not fetching from docker in the first place is better | 21:38 |
clarkb | I'm just trying to call out all possible areas of improvement as some may be easier to implement than others | 21:39 |
mnasiadka | kolla-ansible fetches the images from quay.io (from openstack.kolla namespace) - we could think of using the buildset registry (as in build daily for master and weekly for stable branches, push the images there and use it in kolla-ansible jobs) - but I don't think that's on our immediate plan | 21:40 |
clarkb | the buildset registry only runs while the buildset is running | 21:40 |
mnasiadka | ah, ok | 21:40 |
clarkb | you wouldn't do anything with daily or wkeely stuff there | 21:40 |
mnasiadka | any... more persistent registry we could use to limit the amount of traffic we generate? | 21:41 |
clarkb | its just "what do we need for the buildset" -> we need debian and ubuntu base images -> cache them then everything grabs from there within theb uildset | 21:41 |
clarkb | mnasiadka: the big problem with a persistent registry is disk | 21:41 |
clarkb | just storing our speculative images for 180 days for CI testing is about 2TB | 21:42 |
clarkb | and thats only because we wrote our own registry just for that purpose that we can prune. None of the off the shelf registries really do pruning last I checked | 21:42 |
clarkb | they want to store things forever which is kind of a problem | 21:42 |
mnasiadka | right | 21:43 |
clarkb | then to build on top of that problem we've tried to be good citizens and proxy cache things but all the registries make that difficult to impossible too | 21:43 |
clarkb | the unfortunate truth is this system was never designed with cost scaling in mind | 21:43 |
clarkb | and now a decade later we're dealing with the consequences | 21:43 |
clarkb | there may be some crazy idea where we cache a super specific subset of items in the insecure-ci-registry and let its pruning deal with stuff | 21:44 |
clarkb | or run a second instance | 21:44 |
corvus | i don't know why we'd do that instead of just using quay or something else? | 21:44 |
clarkb | if we run a second instance we could set different retention periods to the ephemeral images | 21:44 |
clarkb | corvus: probably the main reason would be to host within each cloud | 21:45 |
clarkb | problem then is not every cloud offers swift | 21:45 |
corvus | well, i mean, we already decided that we didn't want to be in the registry hosting business | 21:45 |
clarkb | yes I agree and I don't think thats chagned. I'm just trying to brainstorm all the options | 21:45 |
mnasiadka | if network traffic is not a problem, then quay.io is fine - I'll go waste my time somewhere else than in fixing problems that don't exist :) | 21:46 |
clarkb | I think if we did do that it would be for a very small set of images and only as a mirror that we prune. Not as a host for official downloads? | 21:46 |
corvus | if we do change our minds about that then i would probably propose hosting our public images in a self-hosted registry. :) | 21:46 |
clarkb | mnasiadka: I think we should start with that assumption since it is probably the easiest one to follow for now | 21:46 |
clarkb | and if that assumption is proven false we debug and brainstorm from there | 21:47 |
mnasiadka | I never thought about hosting official downloads, that's what quay.io is for - maybe having a local semi-persistent registry that would be cleaned up daily or something like that would speed up the CI / make projects use less internet bandwidth - but if that's not trivial - let's not try to fix what is not broken | 21:48 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 21:48 |
clarkb | mnasiadka: ya I think the main issue is solvingthe storage problem | 21:48 |
clarkb | which is not trivial due to clouds being inconsistent about their storage options | 21:48 |
clarkb | and limits on cinder volumes and so on | 21:48 |
clarkb | we would have to have a bunch of bespoke solutions for a common problem and I think we should avoid that if we can | 21:49 |
fungi | also the concern that rh could decide to follow in docker's footsteps and impose similarly strict client rate limits | 22:00 |
clarkb | separately if kolla is only fetching two images from docker and opendev is only fetching a similar number (for builds its python-base + python-build and possibly the buildx container if multiarching; for usage its $server + mariadb typically) it makes me wonder who or what is doing many more fetches. its not like these jobs are short either and we're running many back to back | 22:05 |
clarkb | but I think that must be why it seems to work ~70% of the time for us right now | 22:05 |
clarkb | maybe even more | 22:05 |
clarkb | but that may be another avenue to approach this from. Identify the worst offenders and see if we can dial them back | 22:10 |
clarkb | it used to be we fetched the buildset registry from docker too but we fixed that | 22:14 |
opendevreview | Clark Boylan proposed opendev/system-config master: Capture lodgeit captchas for verification purposes https://review.opendev.org/c/opendev/system-config/+/936297 | 22:23 |
clarkb | I've discovered an interesting frame of reference problem. We run testinfra on the bridge node and then run commands against test nodes. We have the screenshot mechanism that doesn't work for raw pngs you have to load html pages. But the selenium driver runs on the test node with a backhaul to the testinfra node built in | 22:24 |
clarkb | anyway since I'm not using selenium I have to do it myself from the bridge node and I think we have /etc/hosts set up to make this work but not sure about ssl certs | 22:25 |
clarkb | anyway I'm hopeful that will confirm the captcha is written where we want it to be and we can land the lodgeit updates | 22:27 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 22:32 |
opendevreview | Clark Boylan proposed opendev/system-config master: Capture lodgeit captchas for verification purposes https://review.opendev.org/c/opendev/system-config/+/936297 | 22:43 |
clarkb | ok ssl verification did fail. But I think this should work now | 22:43 |
ianw | in terms of ssl the nodes have a self-signed cert that they all should trust | 22:51 |
ianw | https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L7 | 22:54 |
opendevreview | Merged opendev/lodgeit master: Run python3.11 job on Jammy https://review.opendev.org/c/opendev/lodgeit/+/935719 | 23:05 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 23:07 |
ianw | clarkb: ohh, you know i bet that testinfra is in a venv, so the requests in there isn't patched to use the system ca-certificates. i'd probably suggest the easiest thing to do is just "curl" the file and save it -- that way it uses system certs | 23:12 |
Clark[m] | ianw: verify false worked | 23:13 |
ianw | but also, I think you could just use the screenshot and do "https://localhost/_captcha.png" and that would automatically save it | 23:13 |
Clark[m] | No that was what I tried first but selenium loads the png and explodes | 23:14 |
Clark[m] | It expects js/html input | 23:14 |
Clark[m] | I think the current ps works and shows the code does fix the problem | 23:14 |
Clark[m] | I think I'm ok with this verify=False solution. But ya the lodgeit change lgtm based on that test | 23:15 |
ianw | oh ok, must have pulled up the old one sorry | 23:15 |
ianw | probably we should run testinfra with REQUESTS_CA_BUNDLE | 23:17 |
ianw | but ok, the captcha renders ... once again our infra testing pulls through :) | 23:18 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 23:26 |
*** iurygregory__ is now known as iurygregory | 23:37 | |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574 | 23:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!