Friday, 2025-02-28

kevko	aaah, okay , now I see discussion above :) ..sorry that I am firstly asking than reading :/	00:00
clarkb	the tl;dr is that normal openstack release demand (which is higher than typical) has shown up and has also triggered a bug in nodepool effectively reducing our total available quota	00:03
clarkb	we have a fix for the nodepool bug but it has been stuck behind all the other demand from the day as well	00:03
clarkb	we currently appear to have about an hours worth of noderequest backlog maybe less. Which is good as it means we should be relatively idle before periodic jobs start (also good job picking the time for periodic jobs)	00:07
clarkb	hrm the unittests failed this time. I wonder if there is some other persistent issue	00:18
clarkb	looks like image errors in ibmvpc driver	00:19
clarkb	looking at tracebacks it looks like we might actually be trying to make requests against ibm?	00:20
clarkb	I wonder if a mock or fake is broken now with an ibmvpc update	00:21
clarkb	there is a new ibm-vpc from january 6	00:22
clarkb	I feel like we've landed changes since january 6 though	00:22
clarkb	yes we last merged nodepool code on February 13	00:26
opendevreview	Merged opendev/system-config master: Add networks and routers in new flex tenants https://review.opendev.org/c/opendev/system-config/+/942936	01:05
corvus	clarkb: i recreated my venv and ran tests again and no ibmvpc failures	01:07
Clark[m]	Ack must be some flaky test then	01:22
opendevreview	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/942969	02:25
opendevreview	Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/942969	03:03
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	15:41
clarkb	looks like the nodepool fix landed and graphs are much better this morning. Zuul queues are not small either so it is likely we're actually exercising things. In a few I'll load ssh keys and double check the launchers all restarted their containers (but I expect they have)	15:48
fungi	did the nodepool launchers restart onto the fix?	15:51
clarkb	they should as part of the opendev hourly pipeline jobs. but that is what I want to double check after I load ssh keys	15:51
fungi	ah, okay, i didn't realize we automatically restarted nodepool services, thought we handled them more like the zuul services	15:52
clarkb	ya zuul and nodepool both run their deployment jobs in the hourly pipeline but zuul doesn't do continuous rollouts due to the cost (we save that for the weekly playbook). Nodepool restarts are cheap and largely non impactful so we just go ahead and do them	15:53
fungi	got it	15:54
clarkb	with zuul we do update the tenant config though so it does have a purpose	15:55
clarkb	looks like xorg updates have finally arrived. I'm going to install those, reboot for good measure, then load ssh keys	15:55
clarkb	I've rechecked https://review.opendev.org/c/opendev/system-config/+/942650 as it hit docker hub rate limits yesterday and I think cloud launcher should've run against the new raxflex tenants. We can probaly check them and look at launching mirrors there now as well	15:56
clarkb	the container on nl04 is 10 hours old and the image matches the latest on quay. This lgtm	16:04
fungi	looks like we've been topped out on usable quota for the past 2.5 hours, and yeah there's not a bunch of available nodes just sitting like we saw yesterday	16:53
fungi	the running builds/running jobs graphs are a good 25% higher than when we were maxxed out on quota yesterday	16:55
clarkb	yup things look much happier in grafana and just in general zuul status	16:55
clarkb	a cool grafana feature would be to select a section of a graph and have it do an integration over that period	17:02
clarkb	just to make it easy to compare totals between different period of time without generating a new graphite query	17:02
fungi	mmm, could even just have it compute and display the integral of the graphed lines over the displayed time window	17:05
fungi	so integral of concurrent jobs from the executor queues graph over the past 6 hours when you're on the 6-hour view, but if you shorten it to 3 hours then automatically recalculate it over that	17:06
clarkb	ya that would work too	17:07
fungi	that wouldn't really require ui interaction changes	17:07
fungi	just some additional data it could display	17:08
clarkb	I'm thinking Friday is not a great day to land the infra-prod refactor change in part because it is ianw's saturday but also because if anything goes wrong I'd like to not have it run into the weekend. I was hoping to do it yesterday then zuul got cranky. I guess aim for Monday then? I do have an appointment Monday afternoon.	17:11
clarkb	there is never going to be a good time and Monday is probably better than Firday all things considered though	17:12
fungi	i'm on board with that suggestion, sure	17:12
clarkb	the gitea memcached change is rechecking now. That might be ok for a Friday (worst case we revert back to in process memory caching). We can also work on spinning up raxflex mirrors	17:14
clarkb	fungi: both sjc3 and dfw3's new tenant have a number of error'd mirror servers. I suspect those are from before we caved on networking?	17:15
clarkb	network list looks good in both regions implying cloud launcher successfully created the resources we need	17:15
clarkb	anyway probably want to try and delete those error'd servers first if we can to avoid confusion later (maybe even increment the mirror number digit by one when booting new servers if the old ERROR servers can't be deleted)	17:16
fungi	ah, yeah i didn't even expect that it had left those behind. i'll try to delete them first	17:19
fungi	error server instances are cleaned up in both regions now	17:23
clarkb	I don't see any associated volumes either (so no additional cleanups)	17:25
clarkb	we do need a volume before deploying the servers though iirc as these falvors don't have enough disk by default for the afs and apache caches	17:25
fungi	yeah, i didn't use the volume option to launch-node because it doesn't do lvm that i could see	17:26
fungi	figured i'd just create and attach them manually	17:26
clarkb	ya I think launch-node expects the entire volume mounted at a single location	17:26
clarkb	in the launch dir there is a mirror_volume.sh script that tonyb added iirc to simplify things but you have to copy it over and run it manually after launch node is done and you've attached a volume	17:27
fungi	it's just a few commands to cut-n-paste anyway	17:27
clarkb	instance, volume, and floating ip are the things I try to check when launch node cleanup doesn't work as expected. In this case no floating ips (since that was what was tested) and no volumes because they weren't requested.	17:29
fungi	yeah, i have the two new mirrors bootstrapping now	17:29
fungi	will attach the volumes and carve them up as soon as these complete	17:31
clarkb	jitsi meet made a release and new images a few hours ago. We should expect meetpad to automatically update during daily periodic runs 0200 UTC tomorrow	17:36
clarkb	I don't expect problems but I subscribed to their release notifications on github so figured I'd pass it along	17:36
clarkb	(I do that for gitea, etherpad, jitsi meet, and maybe some others I can't recall now)	17:36
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex tenant mirrors to inventory https://review.opendev.org/c/opendev/system-config/+/943037	17:42
opendevreview	Jeremy Stanley proposed opendev/zone-opendev.org master: Add new Rackspace Flex tenant mirrors https://review.opendev.org/c/opendev/zone-opendev.org/+/943038	17:42
fungi	pushed those up to get them in progress while i work on storage setup	17:42
clarkb	fungi: one important note on the system-config change	17:46
clarkb	also system-config needs to depends on the zon update if you want to add that in the new patchset	17:46
fungi	huh, identically booted server instances, identically created and attached volumes, kernel for the instance in dfw3 didn't register the hotadd, while in sjc3 it did	17:53
fungi	okay, it finally showed up, but took about 5 minutes	17:53
fungi	weird	17:53
clarkb	I wonder if they have different underlying io systems	17:53
fungi	or maybe just "eventually consistent" ;)	17:54
fungi	hotplug events recorded by dmesg look identical	17:54
fungi	virtio-pci/virtio_blk for both	17:54
clarkb	I guess that sort of thing would hit the nova api, then nova would put it on the rabbitmq bus for the compute node to pick up then it would tell libvirt to do the attachment through qemu? eventually consistent is a likely explanation	18:01
fungi	okay, volumes are mounted and cross-checked comparing to the existing mirror in the old sjc3 tenant	18:06
fungi	interestingly, mtu on the ens3 interface on the new servers is 3942	18:07
fungi	vs 1442 on the old server	18:07
fungi	hopefully pmtud works well ;)	18:07
fungi	but i don't feel like tcpdumping my connections to check for fragmentation right this moment, it's probably working fine	18:08
fungi	and this might mean we get 3942 through between test nodes and the mirrors as a result, which could improve network performance quite a bit for things like package downloads	18:09
corvus	wow	18:13
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex tenant mirrors to inventory https://review.opendev.org/c/opendev/system-config/+/943037	18:14
clarkb	should we consider setting them to 1500?	18:16
clarkb	for test nodes we would need to update glean or have something else do that, but the mirrors would just be a netplan update? I think we have examples of netplan edits for other servers (review maybe?)	18:16
fungi	we could, i'm sure	18:18
clarkb	maybe its best to see if it creates a problem first	18:18
fungi	can also be done later, right	18:18
clarkb	the system-config update lgtm now too. Just need to land the dns update first (for the acme records)	18:20
clarkb	should we go ahead and do that now?	18:20
clarkb	I'm happy to approve it if you're otherwise happy with the new nodes	18:20
fungi	yeah, go for it	18:25
fungi	neither of those changes should impact anything, and the inventory change is going to take time to run the myriad of jobs it triggers	18:26
clarkb	dns chaneg is approved. Once the new names resolve for me I'll approve the other change	18:27
fungi	thanks	18:30
opendevreview	Merged opendev/zone-opendev.org master: Add new Rackspace Flex tenant mirrors https://review.opendev.org/c/opendev/zone-opendev.org/+/943038	18:34
clarkb	dns resovles now. I'm approving the system-config change	18:39
fungi	agreed, bridge can resolve them all now	18:42
fungi	942650 timed out running system-config-run-gitea this time	18:45
clarkb	hrm we had that timeout in rax-dfw installing pdf dependencies yesterday. This timeout ran in ovh-gra1	18:46
clarkb	I know we were close to the tmieout before. It is possible that memcached usage makes things slower over (but hopefully more consistent over time)	18:47
clarkb	I'll update with a bumped up timeout because so that we can get more data	18:47
fungi	ara didn't identify any failures	18:47
clarkb	ya it timed out during testinfra test case runs. So right at the end	18:47
clarkb	probably would've been fine with 5 minutes or even less of extra time	18:48
fungi	aha, i didn't see a testinfra report but i guess that's why	18:48
fungi	and yeah, the change is adding something, so conceivably it's just tipped over the timeout	18:48
frickler	3942 is 1500+1442, that sounds like a weird bug by raxflex	18:48
fungi	frickler: mmm, great observation	18:49
frickler	and IMHO we should override to 1500 ourselves, neutron does allow that iirc	18:49
frickler	having larger MTU in my experience just leads to weird issues unless it is being used purely internally	18:50
opendevreview	Clark Boylan proposed opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650	18:50
fungi	i wonder if the encapsulation is reducing the available mtu by 8 and instead of increasing the mtu at the outer layer by that amount they just doubled it and said "that should be big enough"?	18:50
clarkb	frickler: you mean we can configure out network or subnet in neutron to have dhcp set a 1500 mtu?	18:51
fungi	so now instead of 1x1500-8 it's 2x1500-8	18:51
clarkb	I think that would be a great solution if possible	18:51
frickler	oh, wait, 1500+1442=2942, so add another 1000 in for weirdness	18:51
fungi	aha, yep	18:52
fungi	it's friday. i don't math on fridays	18:52
frickler	clarkb: yes, "openstack network set mtu 1500" should do that	18:52
frickler	ehm "--mtu"	18:52
frickler	guess I should eod soonish ;)	18:53
fungi	i wonder if we already have support for that via cloud-launcher or would need to incorporate it	18:54
Clark[m]	https://opendev.org/opendev/ansible-role-cloud-launcher/src/branch/master/tasks/create_network.yml looks like no	19:03
Clark[m]	Maybe do it manually then add a comment to our cloud launcher stuff saying we've done so	19:04
clarkb	ya I think the simplest thing to do for now is likely ^	19:13
clarkb	because if we add support to cloud launcher I'm not sure we want to run that again all clouds with network config either so we might need different setups?	19:13
fungi	no, we'd probably have to make it configurable where we do our network definitions	19:14
clarkb	another option would be to not use the central definition and do it like the router (where we basically define it bespoke per cloud region and do that for the ones that need overrides)	19:14
clarkb	but since this is a one time deal (typically) I think manually setting it and making a note we've done so is sufficient and low effort/cost	19:15
fungi	anyway, i've manually updated the networks in both tenants of both regions now	19:15
clarkb	but not the old sjc3 deployment right? (we need that one to use the sub 1500 mtu)	19:15
fungi	correct	19:15
fungi	just the new control plane and zuul projects in both dfw3 and sjc3	19:16
clarkb	cool. And in theory we just wait for leases to renew and we'll get the new configuration	19:16
fungi	i rebooted mirror01.dfw3.raxflex to see if that kick-starts it	19:16
fungi	i'll leave mirror02.sjc3.raxflex if we want to see it take effect naturally	19:17
fungi	still 3942 after a reboot, so probably would have to force expiration on the local copy of the lease	19:18
fungi	if we want to see it happen sooner	19:18
clarkb	how long is the lease for? 24 hours?	19:18
fungi	not sure where noble is storing that	19:19
fungi	there's a /var/lib/dhcpcd but it's empty	19:20
clarkb	ya no dhcpcd process either.	19:21
clarkb	the /etc/netplan/50-cloud-init.yaml file says dhcp4 is true but it also sets an mtu in that file	19:22
clarkb	not sure if that value came from dhcp or metadata (I dont' think these are config drived looking at lsblk)	19:23
fungi	i wonder if it's coming from configdrive	19:23
fungi	indeed	19:23
clarkb	could be metadata service	19:23
clarkb	anyway I guess there are now two qusetions. Will dhcp override the netplan mtu value or not and where is the dhcp lease details stored	19:24
fungi	i did specify --config-drive when i launched these	19:24
fungi	sr0 is the configdrive i think	19:24
clarkb	oh yup blkid confirms. My mistake was expecting config drive to be mounted for us but that is a glean behavior and we aren't using glean	19:25
fungi	also the images had cloud-init installed looks like, and then our ansible uninstalled it	19:25
clarkb	so we could mount the config drive and check if the mtu is set there. In theory that mtu value comes from the network settings in neutron though so if we unset it in the netplan file manually we'd be fine. Still need to figure out if dhcp is setting it	19:26
fungi	so my guess is cloud-init built the netplan config from data in configdrive, then we uninstalled cloud-init	19:26
clarkb	fungi: yup that is expected. Basically we want an initial bootstrap config then we don't want cloud init making new changes later (something it did in the past with hpcloud iirc)	19:26
clarkb	the config drive value for mtu won't ever update (bceause it is created on instance creation and never edited) but with cloud-init removed we're free to make edits like remove the mtu line and rely on dhcp etc	19:27
clarkb	but need to find where dhcp is doing its thing	19:27
fungi	configdrive still has "mtu": 3942	19:27
fungi	so i guess that doesn't get rewritten after the server is created	19:27
fungi	i've removed the hard-coded mtu line from /etc/netplan/50-cloud-init.yaml and rebooted	19:28
fungi	2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000	19:28
fungi	that did it	19:29
fungi	i can do it on the other server as well if we want, but need to take a break to assemble/bake/eat pizzas	19:29
clarkb	fungi: correct config drive is never updated after the initial creation. It is the main drawback to using it.	19:29
clarkb	I've been looking at mirror02.sjc3 and I think systemd-networkd is responsible	19:29
fungi	(or irresponsible)	19:30
clarkb	systemd-networkd[695]: /run/systemd/network/10-netplan-ens3.network: MTUBytes= in [Link] section and UseMTU= in [DHCP] section are set. Disabling UseMTU=.	19:30
clarkb	the journalctl -u systemd-networkd shows this. I think that is a warning saying "I have two different mtu values one from netplan and one from dhcp. I'm using the netplan value"	19:31
clarkb	ah nope the link value is a hardcoded mtu and the dhcp value is just a flag for whether or not it should use dhcp provided mtus	19:32
clarkb	so ya removing the value from the netplan file and rebooting is probably simplest. I suspect that new servers that get booted will just work with 1500 mtus because dhcp will provide that	19:32
clarkb	and config drive for new instances should also have 1500 mtus because we set that on the network	19:32
clarkb	I think that all means the intervention you took is fine and appropriate. New servers should just work and the intervention means we can keep using these servers as is	19:33
opendevreview	Merged opendev/system-config master: Add new Rackspace Flex tenant mirrors to inventory https://review.opendev.org/c/opendev/system-config/+/943037	19:40
clarkb	fungi: be careful restarting mirror02.sjc3 to not berak ^	19:42
clarkb	I'm going to grab lunch now. The gitea memcached change has been approved but is still at least 1.5 hours away. so not too worried about missing that go out. It should auto deploy because it checks the image ids before and after pull and memcached will be a new id causing it to restart things	19:47
clarkb	jitsi has made a couple more releases since I mentioned the early one. I guess there were problems with those images	20:46
clarkb	I think the mirrors are deploying right now too	20:50
clarkb	and no known_host complaints \o/	20:50
clarkb	at first glance https://mirror02.sjc3.raxflex.opendev.org/ and https://mirror.dfw3.raxflex.opendev.org/ lgtm	20:53
clarkb	we can probably fix sjc3's mtu then update dns to switch the mirror over to it then start figuring out how to migrate load between the old tenant and the new tenant then turn on dfw3	20:54
fungi	yeah, that was my plan as well	20:57
clarkb	and I Think we decided to use the more straightforward method of shutting things down in sjc3, dleeting resources, then swapping the config over so that the raxflex name doesn't change.	20:59
clarkb	so the next step there is likely to set max-servers to 0, then empty the image list for that region	21:00
clarkb	let nodepool delete everything it can, then anything it can't delete we'll have to follow up with manually later	21:00
fungi	latest system-config-run-gitea failure on 942650 looks like a dockerhub rate limit hit again	21:01
fungi	i'll recheck	21:01
clarkb	oh did it just fail ugh	21:01
fungi	20:58:01	21:01
fungi	yup	21:01
clarkb	just keep in mind that it will do a deployment update to the production service when it lands	21:01
fungi	yeah, i'm around	21:02
clarkb	I have to do a school run this afternoon then will try to do a bike ride before the cold wet weather returns but I can keep an eye on things around that	21:02
clarkb	if you want to speed things up direct enquing to the gate might be a good idea (otherwise you have to wait for clean check)	21:02
fungi	yep, can do	21:03
fungi	it's back in the gate again	21:06
clarkb	I think the docker hub rate limits restrict starting tomorrow too	21:06
clarkb	currently its like 100 image pulls per 6 hours and tomorrow it becomes 10 per 1 hour	21:07
clarkb	so a 40% redunction? to be determined if this creates utter chaos or if our efforts to rely less on them has helped enough	21:07
fungi	tomorrow utc is less than 3 hours away	21:07
clarkb	it is possible the change may even help us because the time range is shorter and we lots of long running jobs	21:08
clarkb	basically it would be easy to accumulate over 6 hours and then be stuck for a while but resetting regularly may be advantageous	21:09
fungi	deploy of 943037 is on the 5th-from-last job, but well past anything that will touch the mirror servers so i'm going to remediate the mtu on mirror02.sjc3.raxflex now	21:09
clarkb	This is what I tell myself to not worry too much until we have real world conditions to check	21:09
clarkb	fungi: ++ that should be safe now	21:09
fungi	2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000	21:12
fungi	all good	21:12
clarkb	I can still reach https://mirror02.sjc3.raxflex.opendev.org/	21:13
fungi	so, next a dns change to move the mirror.sjc3.raxflex cname>?	21:13
clarkb	I think so	21:14
clarkb	maybe concurrently start shutting down raxflex usage	21:14
fungi	yeah, i'll throw in a change for that as well	21:14
clarkb	I think we may need to coordinate with zuul launcher as well	21:14
clarkb	its usage is minimal I'm less worried about breaking someone/something and more thinking about cleanup being incomplete if we forget zuul-launcher	21:15
opendevreview	Jeremy Stanley proposed opendev/zone-opendev.org master: Move Rackspace Flex SJC3 mirror to new server https://review.opendev.org/c/opendev/zone-opendev.org/+/943064	21:15
fungi	set max-servers to 0 and comment out all the labels entries?	21:17
fungi	or is it diskimages that needs to be an empty list?	21:18
clarkb	fungi: note the labels, but the image list needs to be []	21:18
fungi	got it	21:18
clarkb	not having an image in the image list means don't upload this image	21:19
clarkb	there should be git history showing the process. I think I've usually split those two changes up	21:19
clarkb	so that instances go away then images go awway	21:19
clarkb	its probably fine to do it all at once though and let nodepool complain if it must	21:19
fungi	cool	21:19
fungi	worth a try, again this is only 32 servers so could be an interesting experiment	21:19
opendevreview	Merged opendev/zone-opendev.org master: Move Rackspace Flex SJC3 mirror to new server https://review.opendev.org/c/opendev/zone-opendev.org/+/943064	21:21
opendevreview	Jeremy Stanley proposed openstack/project-config master: Wind down/clean up Rackspace Flex SJC3 resources https://review.opendev.org/c/openstack/project-config/+/943065	21:23
fungi	that's it, i think?	21:23
fungi	where's the zuul-launcher config?	21:28
fungi	opendev/system-config?	21:29
fungi	oh, right, opendev/zuul-providers	21:31
clarkb	ya sorry got distracted for a minute	21:32
clarkb	ist the new config in zuul so we probably don't have any examples of winding that down yet	21:32
fungi	doesn't look like provider sections have a max-servers set	21:32
fungi	and there's no configuration reference for launcher yet, that i can see	21:35
fungi	use the source, luke	21:35
clarkb	I suspect if you removed the labels from here that it would be equiavlent https://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/providers.yaml#L38-L50	21:36
clarkb	rather than setting a limit we remove the tie from those labels to the provider so nothing should get booted?	21:36
fungi	wfm	21:36
clarkb	and then also empty the image list here: https://opendev.org/opendev/zuul-providers/src/branch/master/zuul.d/providers.yaml#L23-L26 ?	21:37
opendevreview	Jeremy Stanley proposed opendev/zuul-providers master: Wind down/clean up Rackspace Flex SJC3 resources https://review.opendev.org/c/opendev/zuul-providers/+/943067	21:39
fungi	something like that ^	21:39
clarkb	yes I've +2'd that one but that would be a good one to get corvus' input on	21:39
opendevreview	Merged opendev/system-config master: Run gitea with memcached cache adapter https://review.opendev.org/c/opendev/system-config/+/942650	21:54
clarkb	I have to pop out now but ^ seems to be deploying and at least gitea09 is up running memcached	22:02
clarkb	the bytes count is nonzero running `echo 'stats' \| nc -N 127.0.0.1 11211` there as well indicating that caching is working	22:03
clarkb	nevermind I don't have to do the school run timing worked out	22:04
clarkb	the deployment was a success. I'll check each of the backends have the new setup now	22:10
clarkb	yup all looks well to me	22:11
clarkb	git clone works as well	22:12
clarkb	mnaser: so tl;dr is we've bypassed what I suspect is the problematic caching system for memcached. If you see problems again please let us know as it is possible that this isn't the fix or isn't a complete fix	22:13
fungi	yeah, gitea servers lgtm as well	22:47
clarkb	anecdotally loading the system-config base page on gitea isn'y any slower or faster for me than before (a good thing)	23:05

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!