fungi | opendev "infra" meeting is starting shortly | 18:59 |
---|---|---|
clarkb | I'm going to let fungi be substitute meeting chair today as I think I've got something kids brought home from school (you may have noticed I was awake far too early today) | 18:59 |
clarkb | But I'll follow along | 19:00 |
fungi | sounds like you could use some debugging | 19:00 |
fungi | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Sep 20 19:00:29 2022 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
fungi | #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000361.html Our agenda | 19:01 |
fungi | #topic Announcements | 19:01 |
fungi | anybody have anything that needs announcing? | 19:01 |
fungi | i'll take your silence as a resounding negative | 19:02 |
fungi | #topic Actions from last meeting | 19:02 |
ianw | o/ | 19:02 |
fungi | #link https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html | 19:03 |
fungi | looks like that says "none" so i think we can move on | 19:03 |
fungi | #topic Specs Review | 19:04 |
fungi | i don't see any recently proposed nor recently merged | 19:04 |
fungi | anyone have a spec on the way? | 19:05 |
fungi | the in progress specs are covered in upcoming topics anyway | 19:05 |
fungi | so i guess let's jump straight to those | 19:05 |
fungi | #topic Bastion host | 19:05 |
fungi | ianw: looks like you had some questions there? | 19:06 |
ianw | i have found some time to move this forward a bit | 19:06 |
fungi | excellent! | 19:06 |
ianw | #link https://review.opendev.org/q/topic:bridge-ansible-venv | 19:06 |
fungi | that is quite the series of changes | 19:07 |
ianw | is the stack. it's doing a few things -- moving our ansible into a venv, updating all our testing environments to a jammy-based bridge, and abstracting things so we can replace bridge.openstack.org with another fresh jammy host | 19:07 |
ianw | probably called bridge.opendev.org | 19:08 |
fungi | look lie clarkb has already reviewed the first several | 19:08 |
fungi | er, like | 19:08 |
ianw | yeah, the final change isn't 100% ready yet but hopefully soon | 19:08 |
fungi | 852477 is getting a post_failure on run-base. is that expected? or unrelated? | 19:08 |
ianw | i'm on PTO after today, until 10/3, so i don't expect much progress, but after that i think we'll be pretty much ready to swap out bridge | 19:09 |
fungi | sounds great, i'll try to set aside some time to go through those | 19:09 |
ianw | 852477 is not expected to fail :) that is probably unrelated | 19:10 |
fungi | enjoy your time afk! | 19:10 |
fungi | i'm hoping to do something like that myself at the end of next month | 19:10 |
ianw | it should all be a big no-op; *should* being the operative term :) | 19:11 |
fungi | awesome, anything else specific you want to call out there? | 19:12 |
fungi | the agenda mentions venv for openstacksdk, upgrading the server distro version... | 19:12 |
clarkb | I think the change stack does a lot of that | 19:13 |
clarkb | so now that we have changes we can probably shift to reviewing things there | 19:13 |
fungi | sounds good. i guess we can proceed to the next topic | 19:14 |
fungi | thanks for putting that together ianw! | 19:15 |
fungi | #topic Upgrading Bionic servers to Focal/Jammy | 19:15 |
clarkb | I think the only new thing related to this topic is what ianw just covered above | 19:15 |
fungi | #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades | 19:15 |
fungi | yep | 19:16 |
fungi | #topic Mailman 3 | 19:17 |
fungi | this is coming along nicely | 19:17 |
fungi | i was able to hold a node from the latest patchset of the implementing change and import production copies of all our data into it | 19:18 |
fungi | most full site migrations will go quickly. i think i clocked lists.opendev.org in at around 8 minutes (i'd have to double-check my notes) | 19:18 |
fungi | the slowest of course will be lists.openstack.org, which took about 2.5 hours when i tested it | 19:19 |
fungi | the stable branch failures list accounts for about half that time on its own | 19:19 |
fungi | the only outstanding errors were around a few of the older openstack lists having some rejection/moderation message templates which were too large for the default column width in the db | 19:20 |
fungi | for those, i think i'll just fix the current production ones to be smaller (it was a total of 3, so not a big deal) | 19:20 |
fungi | i still need to find some time to test that the legacy pipermail redirects work, and test that mysqldump works on the imported data set | 19:21 |
fungi | the held server with the full imported data set is 104.239.143.143 in case anyone wants to poke at it | 19:21 |
clarkb | sounds like great progress. Thank you for continuing to push this along | 19:21 |
fungi | i think the plan, as it stands, is to shoot for doing lists.opendev.org and lists.zuul-ci.org around early november, if things continue to go well and stay on track? | 19:22 |
clarkb | that wfm. I think we can switch opendev whenever we're comfortable with the deployment (general functionality, migration path, etc) | 19:23 |
fungi | and then we can consider doing other lists after the start of the year if we don't run into any problems with the already migrated sites | 19:24 |
ianw | ++ having done nothing to help with this, it sounds good :) | 19:25 |
ianw | thank you, and just form poking at the site it looks great | 19:25 |
fungi | thanks! | 19:25 |
fungi | #topic Jaeger tracing server (for Zuul) | 19:25 |
fungi | corvus: i saw the change to add the server and initial deployment bits merged? | 19:26 |
fungi | #link https://review.opendev.org/c/opendev/system-config/+/855983 Add Jaeger tracing server | 19:29 |
corvus | yep, have not launched yet, will do so when i have a spare moment | 19:30 |
fungi | thanks, i'll keep an eye out for the inventory and dns changes | 19:31 |
fungi | anything else to cover on that at the moment? | 19:31 |
fungi | sounds like no | 19:32 |
fungi | #topic Fedora 36 Rollout | 19:32 |
fungi | we found out today that it needs an extra package to work with glean's networkmanager implementation for non-dhcp providers | 19:33 |
ianw | yes, thanks for digging on that! | 19:33 |
fungi | #link https://review.opendev.org/858547 Install Fedora ifcfg NM compat package | 19:34 |
ianw | #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547 | 19:34 |
ianw | jinx :) | 19:34 |
fungi | #undo | 19:34 |
opendevmeet | Removing item from minutes: #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547 | 19:34 |
fungi | it was conflated with a separate problem where we crashed nl04 with a bad config when the weekend restart happened | 19:34 |
ianw | i have déjà vu on that ... I either vaguely knew about it or we've fixed something similar before, but i can't pinpoint it | 19:35 |
fungi | so we've not been able to boot anything in ovh and some other providers since early utc saturday | 19:35 |
clarkb | ya the OVH thing made it worse as ovh has a lot of dhcp resources | 19:35 |
ianw | (the missing package) | 19:35 |
fungi | or, rather, we weren't able to until it got spotted and fixed earlier today | 19:35 |
ianw | was the nl04 something that ultimately I did, missing bits of f36? | 19:36 |
fungi | and with the dib change approved and on its way to merge, i guess we'll need the usual dib release tagged, nodepool requirements bumped, new container images built and deployed | 19:36 |
clarkb | ianw: it was missing bits of rockylinux | 19:36 |
clarkb | ianw: and I think a few days ago nodepool become more strict about checking that stuff | 19:36 |
fungi | ianw: it was a missed rockylinux-9 label from the nested-virt addition | 19:36 |
fungi | whatever strictness was added would have been between saturday and whenever the previous restart of nl04's container was | 19:37 |
clarkb | https://review.opendev.org/c/zuul/nodepool/+/858577 is something I've just pushed that should have our CI for project-config catch it | 19:37 |
fungi | the bad config merged at the beginning of the month, but nodepool only spams exceptions about failing to load its config until you try to restart it | 19:38 |
fungi | then it fails to start up after being stopped | 19:38 |
clarkb | fungi: oh I see it crashed because it had to start over but it was probably angry before then too | 19:38 |
ianw | clarkb: excellent; thanks, that was my next question as to why the linter passed it :) | 19:38 |
fungi | yes, i saw the tracebacks in older logs, but i didn't look to see how far back it started complaining | 19:38 |
fungi | anyway, if we decide we want the networking fix for f36 sooner we can add that package to infra-package-needs, but i'm fine waiting. it's no longer urgent with nl04 back in a working state | 19:40 |
fungi | #topic Improving Ansible task runtime | 19:40 |
fungi | clarkb: you had a couple changes around this, looks like | 19:40 |
clarkb | ya this is still on the agenda mostly for https://review.opendev.org/c/opendev/system-config/+/857239 | 19:41 |
fungi | #link https://review.opendev.org/857232 Use synchronize to copy test host/group vars | 19:41 |
clarkb | The idea behind that change is to more forcefully enable pipelining for infra ansible things since we don't care about other connection types | 19:41 |
clarkb | zuul won't make that change because it has many connection types to support and that isn't necessarily safe. But in our corner of the world it should be | 19:42 |
fungi | #link https://review.opendev.org/857239 More aggresively enable ansible pipelining | 19:42 |
clarkb | and then ya comments on whether or not people want faster file copying at the expense of more complicated job setup would be good | 19:42 |
corvus | we should do that asap, that change will bitrot quick | 19:43 |
corvus | (do that == decide whether to merge it) | 19:44 |
clarkb | ++ the mm3 change actually conflicts with it too iirc | 19:44 |
clarkb | as an example of how quickly it will run into trouble | 19:44 |
corvus | (that change == 857232) | 19:44 |
ianw | heh yes and the stack previously mentioned about venv stuff will | 19:45 |
clarkb | I'm happy to do the legwork to coordinate and rebase things. Mostly just looking for feedback on whether or not we want to make the change in the first place | 19:46 |
clarkb | EOT for me | 19:47 |
fungi | i relate to corvus's review comment there. if it's an average savings of 3 minutes and just affects our deployment tests, that's not a huge savings (it's not exactly trivial either though, i'll grant) | 19:47 |
clarkb | Re time saving the idea I had was that if we can trim a few places that are slow like that then in aggregate maybe we end up saving 10 minutes per job | 19:48 |
clarkb | that is yet to be proven though | 19:48 |
clarkb | (other places for improvement are the multi node ssh key management stuff whcih our jobs use too) | 19:48 |
fungi | yeah, on the other hand, i don't think 857232 significantly adds to the complexity there. mainly it means we need to be mindful of what files need templating and what don't | 19:50 |
ianw | yeah multiple minutes, to me, is worth going after | 19:50 |
ianw | multiple seconds maybe not :) | 19:50 |
fungi | not to downplay the complexity, it's complex for sure and that means more we can trip over | 19:50 |
fungi | on balance i'm like +0.5 in favor | 19:51 |
clarkb | ack can follow up on the change | 19:52 |
fungi | okay, well we can hash it out in review but let's not take too long as corvus indicated | 19:52 |
clarkb | ++ | 19:52 |
fungi | #topic Nodepool Builder Disk utilization | 19:52 |
fungi | the perennial discussion | 19:53 |
fungi | can we drop some images soon? or should we add more/bigger builders | 19:53 |
clarkb | I added this to the agenda after frickler discovered that the disk had filled on nb01 and nb02 | 19:53 |
ianw | i think we're close on f35, at least | 19:53 |
clarkb | we've add two flavors of rocky since our last cleanup and our base disk utilization is higher. THis makse it easier for the disk to fill if something goes wrong | 19:53 |
ianw | (current issues not withstanding) | 19:53 |
frickler | so one raw image is 20-22G | 19:54 |
frickler | vmdk the same plus ca. 14G qcow2 | 19:54 |
frickler | makes 50-60GB per image, times 2 because we keep the last two | 19:54 |
clarkb | note the raw disk utilization size is closer to the qcow2 size due to fs/block magic | 19:54 |
frickler | we have 1T per builder | 19:54 |
clarkb | du vs ls show the difference | 19:54 |
frickler | one of the problems I think is that the copies aren't necessarily distributed | 19:55 |
fungi | i'll note that gentoo hasn't successfully built for almost 2 months (nodes still seem to be booting but the image is very stale now) | 19:55 |
frickler | gentoo builds are disabled | 19:55 |
frickler | the still need some fix, I don't remember the detail | 19:55 |
clarkb | frickler: yes, when the builders are happy they pretty evenly distribute things. But as soon as one has problems all the load shifts to the other and it then falls over | 19:56 |
fungi | opensuse images seem to have successfully built today after two weeks of not being able to | 19:56 |
clarkb | Adding another builder or more disk would alleviate that aspect of the issue | 19:56 |
fungi | but that was likely due to the builders filling up | 19:56 |
ianw | I have https://review.opendev.org/c/zuul/nodepool/+/764280 that i've never got back to that would stop a builder trying to start a build if it knew there wasn't enough space | 19:56 |
frickler | the disks were full for two weeks, so no images got build during that time | 19:57 |
fungi | right, makes sense | 19:57 |
frickler | but all builds looked green now in grafana | 19:57 |
fungi | are there concerns with adding another builder? | 19:58 |
frickler | the other option would be just add another disk on the existing ones? | 19:58 |
fungi | yeah, though that doesn't get us faster build throughput | 19:58 |
clarkb | one upside to adding a new builder would be we could deploy it on jammy and make sure everything is working | 19:58 |
clarkb | then replace the older two | 19:58 |
frickler | because we don't seem to have an issue with buildtime yet | 19:58 |
clarkb | Adding more disk is probably the easiest thing to do though | 19:59 |
frickler | ah, upgrading is a good reason for new one(s) | 19:59 |
ianw | this is true. i think the 1tb limit is just a self-imposed thing, perhaps related to the maximum cinder volume size in rax, but i don't think it's a special number | 19:59 |
fungi | but any idea what % of the day is spent building images on each builder? we won't know that we're out of build bandwidth until we are | 19:59 |
clarkb | fungi: it takes about an hour to build each image | 19:59 |
clarkb | total images / 2 ~= active hours of the day per builder | 20:00 |
fungi | and we're at ~16 images now | 20:00 |
fungi | so yeah, we're theoretically at 33% capacity for build bandwidth | 20:00 |
fungi | i agree more space would be fine for now in that case, unless we just want to be able to get images online a bit faster | 20:00 |
frickler | oh, right, we should check upload times | 20:01 |
frickler | I don't think that is included in the 1h | 20:01 |
clarkb | it isn't | 20:01 |
clarkb | but we do parallelize those but builds are serial | 20:01 |
clarkb | there is still an upper limit though. I don't think it is the one we are likely to hit first. But checking is a good idea | 20:02 |
fungi | and we still need to replace the arm builder right? does it need more space too? | 20:02 |
clarkb | yes it needs to be replaced. Still waiting on the new cloud to put it in unless we move it to osuosl | 20:02 |
ianw | that will need replacement as linaro goes, yep. that seems to be a wip | 20:02 |
fungi | also note that we're roughly three minutes over the end of the meeting | 20:03 |
clarkb | and its disk does much better since we don't need the vhd and qcow images | 20:03 |
clarkb | we use raw only for arm64 iirc | 20:03 |
fungi | so sounds like some consensus on adding a second volume for nb01 and nb02. i can try to find time for that later this week | 20:03 |
fungi | it's lvm so easy to just grow in-place | 20:04 |
fungi | #topic Open Discussion | 20:04 |
fungi | if nobody has anything else i'll close the meeting | 20:05 |
clarkb | I'm good. Thank you for steppign in today | 20:05 |
ianw | #link https://review.opendev.org/c/zuul/zuul/+/858243 | 20:05 |
ianw | is a zuul one; thanks for reviews on the recent console web-stack -- that one is a fix for narrow screens on the console | 20:06 |
ianw | there's others on top, but they're opinions, not fixes :) | 20:06 |
ianw | and thanks to fungi for meeting too :) | 20:07 |
fungi | thanks everyone! enjoy the rest of your week | 20:07 |
fungi | #endmeeting | 20:07 |
opendevmeet | Meeting ended Tue Sep 20 20:07:19 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:07 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-20-19.00.html | 20:07 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-20-19.00.txt | 20:07 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-20-19.00.log.html | 20:07 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!