Tuesday, 2024-11-26

clarkb	sent out the meeting agenda for tomorrow to make it official	00:51
opendevreview	Karolina Kula proposed zuul/zuul-jobs master: DNM Switch to KVM https://review.opendev.org/c/zuul/zuul-jobs/+/936023	13:55
ykarel	git diff	14:53
fungi	warning: Not a git repository. Use --no-index to compare two paths outside a working tree	14:54
* fungi is --helpful		14:54
opendevreview	Joel Capitao proposed openstack/diskimage-builder master: DNM Testing on KVM https://review.opendev.org/c/openstack/diskimage-builder/+/936024	16:53
opendevreview	Joel Capitao proposed openstack/diskimage-builder master: DNM Testing on KVM https://review.opendev.org/c/openstack/diskimage-builder/+/936024	17:50
clarkb	fungi: I responded to your comment on the mm3 migration etherpad. Basically there are some things that I think need updating in ansible that I interpreted as aliases that I don't see covered. I see now that they are distinct to what you were referring to but I think they still need to be updated	18:10
fungi	clarkb: i had intended those to be covered by the todo at 5.1	18:11
clarkb	ok just make sure you also cover the mailman side and not just the mta/exim side	18:12
clarkb	the value is used in both places	18:12
fungi	i added ansible inventory groupvars to the change description to cover that just now	18:12
clarkb	but also as noted I think you can update them before the change and have both old and new names listed?	18:12
clarkb	though I'm not 100% certain of that	18:13
fungi	i'm worried if we update the mailman groupvars in ansible it will try to create the new domain and recreate the lists in it	18:13
clarkb	fungi: right for the listdomains and lists themselves that would happen. I'm takling specifically about mm_domains	18:14
fungi	some bits could be added to ansible early, but i'm unclear on whether splitting it into two changes makes sense	18:14
clarkb	whcih I think is about allowing exim and django to accept connects with those names	18:14
fungi	we'd still need to add forwarding aliases in exim from the new addresses to the old temporarily to make that work, and then flip them during the maintenance	18:15
clarkb	but I also don't feel too strongly about it. If you remember to update all the places by hand then update ansible to match the end result should be the same	18:15
fungi	i'm worried i don't have enough time between now and when the foundation announced the domain change maintenance to set up working temporary forwards to accept messages to the new addresses in advance	18:16
fungi	nor what the benefits of doing that extra work would be	18:17
clarkb	fungi: fwiw I didn't intend on setting up actual delivery of things	18:22
clarkb	just udpating our configs so that if someone did connect to that address they would get an error at a step past the initial connection	18:23
clarkb	to reduce the amount of changes required during the migration itself	18:23
fungi	maybe between steps #7 and #8 we should apply the config change with ansible (moving step #12 earlier)?	18:25
opendevreview	Clark Boylan proposed opendev/system-config master: Screenshot lodgeit captcha images https://review.opendev.org/c/opendev/system-config/+/936297	18:33
clarkb	frickler: "You retain all your ownership rights in your User Content. Docker simply displays or makes the User Content available to users of the Service and does not otherwise control the content thereof." I'm not a lawyer but I read this as meaning the content is available to be used under the content's license	18:38
clarkb	frickler: and at least for the debian image it seems to indicate they are using the underlying distro and software licensnes and not applying any additional restrictions	18:38
clarkb	now there may be additional restrictions in that terms of service that limit the use of the image beyond where it is asfe to rehost. I don't know	18:39
frickler	clarkb: well the official images in my understanding are not user content, but content provided by docker.	18:43
clarkb	frickler: right and the source code for those images is apache2 licensed and then in the debian image at least they say the image is provided under the licenses of the software contained within the image. But as you noted there is a note that the images also fall under the terms of use so you'd need something in the terms of use that prevents rehosting which I haven't found yet but I	18:45
clarkb	also haven't read the terms of use in its entirety	18:45
clarkb	at the very least we could rebuild the images I think	18:46
clarkb	since the source code for them is apache 2	18:46
fungi	though even just caching and serving the images through a proxy could be considered rehosting, which we already do	18:46
clarkb	ya though thats a bit different since we're caching the docker debian image fetched from docker	18:48
clarkb	vs download the docker debian image from docker then reupload it to quay. But in any case I haven't seen anything yet that would prevent this	18:48
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/936301	18:59
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/934332	19:00
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/934332	19:01
corvus	did i miss the beginning of a docker license conversation?	19:02
clarkb	corvus: it came up in the openstack tc meeting in the prior hour	19:02
clarkb	corvus: tldr is frickler is concerned that docker official iimages state they are used under docker's terms of service and that we might not be allowed to rehost them	19:02
clarkb	corvus: https://zuul.opendev.org/t/openstack/build/a2454524bdf6447cbaa7a7f38e8bb889 I wonder if this 404 is related to intermediate registry pruning? I've rechecked the parent change to reupload to the intermediate registry	19:21
corvus	could check to see if those show up in logs	19:22
clarkb	yup doing a grep now	19:23
clarkb	thats weird we seem to log botha keep and delete for it	19:23
clarkb	in the real-6.log	19:23
corvus	remember the virtual dirs don't count	19:24
clarkb	oh its a delete for an upload of the underlying data	19:25
clarkb	I'll have to look mroe closely after the meeting	19:25
opendevreview	Jeremy Stanley proposed opendev/system-config master: Move OpenInfra mailing lists to new domain https://review.opendev.org/c/opendev/system-config/+/936303	19:43
clarkb	corvus: urllib3.connectionpool: https://storage101.dfw1.clouddrive.com:443 "DELETE /v1/api_path/container_name/_local/blobs/sha256:e611aa8258cc3cdb338784afa7c6e85a2ab2d1fd85beef36a69170a54a5b0377/data HTTP/11" 204 0	19:49
clarkb	corvus: I think this means we decide the blob was not in a manifest and too lold?	19:49
clarkb	and or some bug in that determination	19:50
corvus	clarkb: so if i'm following correctly, that was a build from nov 20; when did the delete happen?	19:50
clarkb	2024-11-21 22:08:35,726	19:50
corvus	we're supposed to keep 180 days, yeah?	19:51
clarkb	the blob timeout is only an hour	19:51
clarkb	manifest_target is 180 days and upload_target is 1 hour. We use upload target when deleting blobs	19:52
clarkb	Uploading on the 20th would mean that we didn't see this manifest when listing manifests because it took 7 days the bulk of which was blob hanlding	19:53
clarkb	so maybe there is somethign wrong in the logic for cleaning up blobs that don't have a manifest tied to them with the hour long timeout	19:53
corvus	it's supposed to be one hour before the start of the prune	19:54
clarkb	yes and that does seem to be what we do	19:55
clarkb	I don't see us redefining upload_target	19:56
corvus	that blob already existed on 11-19	19:56
clarkb	but pruning started on the 15th right?	19:57
corvus	which suggests that more than one manifest pointed to it	19:57
corvus	looks like it	19:57
clarkb	2024-11-16 00:02:42,501 <- thats the first timestamp in our log	19:58
clarkb	oh are you suggesting that we had a collision maybe?	19:58
corvus	no collisions in a CAS	19:58
corvus	is it possible because of our stops/starts that we pruned all the manifests that pointed to the blob in earlier runs before our final blob-pruning run?	19:58
clarkb	basically object existed at least an hour prior to 2024-11-16 00:02:42,501 and was pointed at by manifest A. Then we upload manifest B that points to it but we prune manfiest A?	19:59
corvus	yep	19:59
clarkb	corvus: ya thats what I'm wondinerg	19:59
corvus	i think that is possible, in which case that would make this an artifact of this individual erroneous pruning operation (ie, a one-off)	19:59
clarkb	if that is the cause then regular pruning like we do now whould avoid this in the future	19:59
clarkb	since we can do regular pruning in a short enough period and go from start to finish in one go to avoid this issue	20:00
corvus	i don't think it's a timing issue	20:00
corvus	well	20:00
corvus	it's an "interruption + timing" issue	20:00
corvus	i think maybe the assertion that pruning is interruptible is not 100% right :)	20:00
clarkb	ya	20:00
corvus	if it is not interrupted, then timing doesn't matter. if it is interrupted, then we introduce a race.	20:01
clarkb	since we moev the upload_target ahead each time we start over	20:01
corvus	i think what we need is an adjustable "now"	20:01
clarkb	as long as we don't move that ahead we're fine with it running as long as it takes	20:01
corvus	so that we can resume pruning and set the pruning start time ("now" variable in the code) to the original pruning start time	20:01
clarkb	ya that would avoid the problem	20:02
fungi	gonna pop out to grab early dinner but should be back in about an hour	20:02
opendevreview	Clark Boylan proposed opendev/system-config master: Update Gerrit images to 3.9.8 and 3.10.3 https://review.opendev.org/c/opendev/system-config/+/936305	20:08
clarkb	corvus: so to tldr I don't think there is anythign we need to do right now for insecure-ci-registry. We can followup with capturing timestamps and feeding that back into the system for interrupted runs but I'm not super concerned about that now	20:09
opendevreview	Clark Boylan proposed opendev/system-config master: Screenshot lodgeit captcha images https://review.opendev.org/c/opendev/system-config/+/936297	20:35
corvus	clarkb: ++	20:41
clarkb	fungi: fwiw I think you do need last because HTTP_HOST is listed as an http headervar which I think comes from the original request and not inflight rewrite state	20:55
clarkb	I left notes about that in your change. But otherwise that lgtm	20:55
opendevreview	Clark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test the 3.10 upgrade https://review.opendev.org/c/opendev/system-config/+/893571	20:57
clarkb	I'm cycling my held nodes for gerrit testing to get this new patchset	20:57
mnasiadka	In Kolla we've noticed recently DockerHub started to be more... aggressive towards pull limits. We basically use what we can from quay.io, but debian and ubuntu base images are pulled from DockerHub - which fails from time to time. I understand the caching mechanism through Apache mod_proxy does not work - but maybe there's a pull through docker registry we could use somewhere inside OpenDev? Or at least a place to mirror those images? I	21:02
mnasiadka	doubt authenticating to Docker Hub will help (but maybe I'm plain wrong)	21:02
clarkb	mnasiadka: so first up authenticating to docker hub will almost certainly fix your problems there are much bigger rate limits for authenticated users	21:04
clarkb	the problem is their open source program is a bit of a pain to go through though they did make it easier than it has been	21:05
clarkb	I think ildikov and starglinx have some experience with that. We don't around here because we opted out of pursuing that when the original requirements were published	21:05
clarkb	next I don't expect a pull through registry to change much compared to the apache proxy because you still have to pull in the data that bit doesn't change then you get rate limited whether you run apache to cache or a registry	21:06
mnasiadka	I don't think I'm going to pursue the open source program, because Kolla does not satisfy the reqs today (we support podman)	21:06
clarkb	a pull through registry has the additional problem of not being prunable so its size can grow without bound	21:06
mnasiadka	So I can try using the existing kolla account in DockerHub to work around the limits and see if that helps	21:07
clarkb	mnasiadka: I believe they dropped teh requirement to not support other systems. Thats what I mean by them making it easier	21:07
clarkb	their origin open source program said you can't use other tools but it doesn't the last time I saw the requirements	21:07
corvus	mnasiadka: i started work on automated mirroring of images to quay.io in https://review.opendev.org/935574	21:07
clarkb	and ya the last bit of info is the migration of images to a different registry ^ is one way to do that	21:07
clarkb	another method would be to use alternative images if they already exist on say quay or in github's registry etc	21:08
corvus	i'd like to use that to mirror the few docker.io images that the zuul project uses to quay.io	21:08
clarkb	mnasiadka: keep in mind that you need to be careful when using docker accounts aprticularly if they have push permissions	21:08
clarkb	if you leak the credentials in a check job and those credenitals can push then you could have your repo contents replaced	21:09
mnasiadka	yeah, for debian and ubuntu there are no alternative images - centos and rocky use quay.io today - so what corvus is working on could help us maintain a copy of debian/ubuntu base image in Kolla's quay.io namespace	21:09
mnasiadka	we have an org and some read only tokens as well, but thanks for the reminder :)	21:09
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	21:12
clarkb	it is also worth noting that I recently manually went through the process of getting an anonymous token and checked the rate limits in the token and they did not match the recent 10 per hour from the docker blog post	21:14
clarkb	it seems liek something is fishy somewhere but I don't have enough insight behind the scenes to say where	21:14
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	21:15
corvus	name[template]: Jinja templates should only be at the end of 'name'	21:15
corvus	that's from ansible-list	21:15
corvus	is that a style lint, or is there something functional about that?	21:16
corvus	that's saying "you can't use name: "something {{ foo }} something" you can only say name: "something {{ foo }}"	21:17
clarkb	corvus: I'm still parsing the question but just so you know the matrix to irc translation ended up with some weird characters in front of name in your first and last line	21:18
clarkb	corvus: https://ansible.readthedocs.io/projects/lint/rules/name/ its just hte linter's opinion man	21:19
clarkb	"This helps with the identification of tasks inside the source code when they fail."	21:19
clarkb	they suggest it for greppability basically	21:19
corvus	thanks. that's. yeah. Dude.	21:20
clarkb	ok I think my capture the new gerrit releases on test nodes change was successful in doing so	21:30
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/934332	21:31
clarkb	I don't think I'll run through upgrade and downgrade testing today, but having that all preppred for monday is a good thing	21:31
clarkb	mnasiadka: oh I just remembered another idea that was thrown out by someone (maybe you) but recording it here again for completeness: Using the buildset registry to act as a local cache for alll your jobs may be helpful. Then you basically only do a single set of fetches from docker hub, push into buildset registry and everything fetches from there	21:33
clarkb	mnasiadka: the downside to this is it is unlikely to reduce the total number of requests per ip (since different jobs often use a different ip) however in the broader scope of things it will reduce the total number of requests on each ip which should make subsequent jobs more likely to pass when they run using the same ip later	21:34
clarkb	any reduction in requests to docker hub should produce overall improvements within the ci system	21:35
mnasiadka	well, we basically fetch only two images (one in ubuntu build job and another in debian build job), so mirroring them or going the authenticated route probably is a better idea - in Kolla we only fetch the base image outside of OpenDev - and then install all pip/rpm/deb based on those images	21:36
mnasiadka	Which reminds me that Rocky 9 rpms are not mirrored - and then there's Rocky 10 somewhere around next year	21:37
clarkb	mnasiadka: right but kolla runs like 20 jobs when you push a change up (I randomed that number I don't know what the actual number is) and if each of those does several rquests that goes into the bucket against our rate limit	21:38
clarkb	if instead you make 2 requests and then everything else fetches from the buildset registry you've gotten a large overall reduction	21:38
clarkb	I agree that not fetching from docker in the first place is better	21:38
clarkb	I'm just trying to call out all possible areas of improvement as some may be easier to implement than others	21:39
mnasiadka	kolla-ansible fetches the images from quay.io (from openstack.kolla namespace) - we could think of using the buildset registry (as in build daily for master and weekly for stable branches, push the images there and use it in kolla-ansible jobs) - but I don't think that's on our immediate plan	21:40
clarkb	the buildset registry only runs while the buildset is running	21:40
mnasiadka	ah, ok	21:40
clarkb	you wouldn't do anything with daily or wkeely stuff there	21:40
mnasiadka	any... more persistent registry we could use to limit the amount of traffic we generate?	21:41
clarkb	its just "what do we need for the buildset" -> we need debian and ubuntu base images -> cache them then everything grabs from there within theb uildset	21:41
clarkb	mnasiadka: the big problem with a persistent registry is disk	21:41
clarkb	just storing our speculative images for 180 days for CI testing is about 2TB	21:42
clarkb	and thats only because we wrote our own registry just for that purpose that we can prune. None of the off the shelf registries really do pruning last I checked	21:42
clarkb	they want to store things forever which is kind of a problem	21:42
mnasiadka	right	21:43
clarkb	then to build on top of that problem we've tried to be good citizens and proxy cache things but all the registries make that difficult to impossible too	21:43
clarkb	the unfortunate truth is this system was never designed with cost scaling in mind	21:43
clarkb	and now a decade later we're dealing with the consequences	21:43
clarkb	there may be some crazy idea where we cache a super specific subset of items in the insecure-ci-registry and let its pruning deal with stuff	21:44
clarkb	or run a second instance	21:44
corvus	i don't know why we'd do that instead of just using quay or something else?	21:44
clarkb	if we run a second instance we could set different retention periods to the ephemeral images	21:44
clarkb	corvus: probably the main reason would be to host within each cloud	21:45
clarkb	problem then is not every cloud offers swift	21:45
corvus	well, i mean, we already decided that we didn't want to be in the registry hosting business	21:45
clarkb	yes I agree and I don't think thats chagned. I'm just trying to brainstorm all the options	21:45
mnasiadka	if network traffic is not a problem, then quay.io is fine - I'll go waste my time somewhere else than in fixing problems that don't exist :)	21:46
clarkb	I think if we did do that it would be for a very small set of images and only as a mirror that we prune. Not as a host for official downloads?	21:46
corvus	if we do change our minds about that then i would probably propose hosting our public images in a self-hosted registry. :)	21:46
clarkb	mnasiadka: I think we should start with that assumption since it is probably the easiest one to follow for now	21:46
clarkb	and if that assumption is proven false we debug and brainstorm from there	21:47
mnasiadka	I never thought about hosting official downloads, that's what quay.io is for - maybe having a local semi-persistent registry that would be cleaned up daily or something like that would speed up the CI / make projects use less internet bandwidth - but if that's not trivial - let's not try to fix what is not broken	21:48
opendevreview	James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	21:48
clarkb	mnasiadka: ya I think the main issue is solvingthe storage problem	21:48
clarkb	which is not trivial due to clouds being inconsistent about their storage options	21:48
clarkb	and limits on cinder volumes and so on	21:48
clarkb	we would have to have a bunch of bespoke solutions for a common problem and I think we should avoid that if we can	21:49
fungi	also the concern that rh could decide to follow in docker's footsteps and impose similarly strict client rate limits	22:00
clarkb	separately if kolla is only fetching two images from docker and opendev is only fetching a similar number (for builds its python-base + python-build and possibly the buildx container if multiarching; for usage its $server + mariadb typically) it makes me wonder who or what is doing many more fetches. its not like these jobs are short either and we're running many back to back	22:05
clarkb	but I think that must be why it seems to work ~70% of the time for us right now	22:05
clarkb	maybe even more	22:05
clarkb	but that may be another avenue to approach this from. Identify the worst offenders and see if we can dial them back	22:10
clarkb	it used to be we fetched the buildset registry from docker too but we fixed that	22:14
opendevreview	Clark Boylan proposed opendev/system-config master: Capture lodgeit captchas for verification purposes https://review.opendev.org/c/opendev/system-config/+/936297	22:23
clarkb	I've discovered an interesting frame of reference problem. We run testinfra on the bridge node and then run commands against test nodes. We have the screenshot mechanism that doesn't work for raw pngs you have to load html pages. But the selenium driver runs on the test node with a backhaul to the testinfra node built in	22:24
clarkb	anyway since I'm not using selenium I have to do it myself from the bridge node and I think we have /etc/hosts set up to make this work but not sure about ssl certs	22:25
clarkb	anyway I'm hopeful that will confirm the captcha is written where we want it to be and we can land the lodgeit updates	22:27
opendevreview	James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	22:32
opendevreview	Clark Boylan proposed opendev/system-config master: Capture lodgeit captchas for verification purposes https://review.opendev.org/c/opendev/system-config/+/936297	22:43
clarkb	ok ssl verification did fail. But I think this should work now	22:43
ianw	in terms of ssl the nodes have a self-signed cert that they all should trust	22:51
ianw	https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L7	22:54
opendevreview	Merged opendev/lodgeit master: Run python3.11 job on Jammy https://review.opendev.org/c/opendev/lodgeit/+/935719	23:05
opendevreview	James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	23:07
ianw	clarkb: ohh, you know i bet that testinfra is in a venv, so the requests in there isn't patched to use the system ca-certificates. i'd probably suggest the easiest thing to do is just "curl" the file and save it -- that way it uses system certs	23:12
Clark[m]	ianw: verify false worked	23:13
ianw	but also, I think you could just use the screenshot and do "https://localhost/_captcha.png" and that would automatically save it	23:13
Clark[m]	No that was what I tried first but selenium loads the png and explodes	23:14
Clark[m]	It expects js/html input	23:14
Clark[m]	I think the current ps works and shows the code does fix the problem	23:14
Clark[m]	I think I'm ok with this verify=False solution. But ya the lodgeit change lgtm based on that test	23:15
ianw	oh ok, must have pulled up the old one sorry	23:15
ianw	probably we should run testinfra with REQUESTS_CA_BUNDLE	23:17
ianw	but ok, the captcha renders ... once again our infra testing pulls through :)	23:18
opendevreview	James E. Blair proposed zuul/zuul-jobs master: WIP: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	23:26
*** iurygregory__ is now known as iurygregory		23:37
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!