*** SotK has quit IRC | 06:54 | |
*** SotK has joined #opendev-meeting | 06:55 | |
*** frickler is now known as frickler_pto | 09:44 | |
*** frickler_pto is now known as frickler | 09:47 | |
*** gouthamr has quit IRC | 10:22 | |
*** gouthamr has joined #opendev-meeting | 10:23 | |
*** frickler is now known as frickler_pto | 13:50 | |
*** hamalq has joined #opendev-meeting | 15:23 | |
*** hamalq has quit IRC | 15:29 | |
*** hamalq has joined #opendev-meeting | 15:35 | |
corvus | o/ | 18:59 |
---|---|---|
clarkb | hello! | 18:59 |
clarkb | we'll get started in a couple minutes | 18:59 |
ianw | o/ | 19:00 |
clarkb | #startmeeting infra | 19:01 |
openstack | Meeting started Tue Jul 14 19:01:05 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
*** openstack changes topic to " (Meeting topic: infra)" | 19:01 | |
openstack | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2020-July/000056.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
*** openstack changes topic to "Announcements (Meeting topic: infra)" | 19:01 | |
clarkb | OpenDev virtual event #2 happening July 20-22 | 19:01 |
*** zbr has joined #opendev-meeting | 19:01 | |
clarkb | calling this out as they are using etherpad, but the previous event didn't have any problems with etherpad. I plan to be around and support the service if necessary though | 19:01 |
zbr | o/ | 19:01 |
clarkb | also if you are interested in baremetal management that is the topic and you are welcome to join | 19:02 |
clarkb | #topic Actions from last meeting | 19:02 |
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)" | 19:02 | |
clarkb | #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-07-19.00.txt minutes from last meeting | 19:03 |
clarkb | ianw: thank you for running last weeks meeting when I was out. I didn't see any actions recorded on the minutes. | 19:03 |
clarkb | ianw: is there anything else to add or should we move on to today's topics? | 19:03 |
ianw | nothing, i think move on | 19:03 |
clarkb | #topic Specs approval | 19:04 |
*** openstack changes topic to "Specs approval (Meeting topic: infra)" | 19:04 | |
clarkb | #link https://review.opendev.org/#/c/731838/ Authentication broker service | 19:04 |
clarkb | I believe this got a new patchset and I ws going to review it then things got busy before I took a week off | 19:04 |
clarkb | fungi: ^ other than needing reviews anything else to add? | 19:04 |
fungi | nope, there's a minor sequence numbering problem in one of the lists in it, but no major revisions requested yet | 19:05 |
clarkb | great, and a friendly reminder to the rest of us to try and review that spec | 19:05 |
fungi | or counting error i guess | 19:05 |
clarkb | #topic Priority Efforts | 19:05 |
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)" | 19:06 | |
clarkb | #topic Update Config Management | 19:06 |
*** openstack changes topic to "Update Config Management (Meeting topic: infra)" | 19:06 | |
clarkb | ze01 is running on containers again. We've vendored the gear lib into the ansible role that uses it | 19:06 |
fungi | no new issues seen since? | 19:06 |
clarkb | other than a small hiccup with the vendoring I haven't seen any additional issues related to this | 19:06 |
clarkb | maybe give it another day or two then we should consider updating the remaining executors? | 19:07 |
fungi | any feel for how long we should pilot it before redoing the other 11? | 19:07 |
fungi | ahh, yeah, another day or two sounds fine to me | 19:07 |
clarkb | most of the issues we've hit have been in jobs that don't run frequently which is why giving it a few days to have those random jobs run on that executors seems like a good idea | 19:07 |
clarkb | but I don't think we need to wait for very long either | 19:07 |
ianw | umm, there is one | 19:08 |
ianw | https://review.opendev.org/#/c/740854/ | 19:08 |
clarkb | ah that was related to the executor then? | 19:08 |
clarkb | (I saw the failures were happening but hadn't followed it that closely) | 19:08 |
ianw | yes, the executor writes out the job ssh key in the new openssh format, and it is more picky about whitespace | 19:08 |
clarkb | #link https://review.opendev.org/#/c/740854/ fixes an issue with containerized ze01. Should be landed and confirmed happy before converting more executors | 19:09 |
fungi | ahh, right, specifically because the version of openssh in the container is newer | 19:09 |
clarkb | I'll give that a review after the meeting if no one beats me to it | 19:09 |
fungi | fwiw the reasoning is sound and it's a very small patch, but disappointing default behavior from variable substitution | 19:10 |
fungi | i guess ansible or jinja assumes variables with trailing whitespace are a mistake unless you tell it otherwise | 19:10 |
clarkb | as far as converting the other 11 goes, I'm not entirely sure what the exact process is there. I think its something like stop zuul services, manually remove systemd units for zuul services, run ansible, start container but we'll want to double check that if mordred isn't able to update us | 19:11 |
mordred | ohai - not really here - but here fora a sec | 19:11 |
fungi | i'd also be cool waiting for mordred's return to move forward, in case he wants to be involved in the next steps | 19:11 |
mordred | yeah - I think the story when we're happy with ze01 is for each remaining ze to shut down the executor, run ansible to update to docker | 19:12 |
mordred | but I can totally drive that when I'm back online for real | 19:12 |
clarkb | cool that probably gives us a good burn in period for ze01 too | 19:12 |
mordred | yah | 19:13 |
ianw | yeah probably worth seeing if any other weird executor specific behaviour pops up | 19:13 |
fungi | sounds okay to me | 19:13 |
corvus | mordred: eta for your return? | 19:13 |
mordred | I'll be back online enough to work on this on Thursday | 19:14 |
fungi | "chapter 23: the return of mordred" | 19:14 |
mordred | I'll have electricians replacing mains power ... But I have a laptop and phone :) | 19:14 |
fungi | that also fits with the "couple of days" suggestion | 19:14 |
corvus | cool, no rush or anything, just thought if it was going to be > this week maybe i'd start to pick some stuff off, but "wait for mordred" sounds like it'll fit time-wise :) | 19:15 |
clarkb | ya I'm willing to help too, just let me know | 19:15 |
fungi | same | 19:15 |
mordred | Cool. I should be straight forward at this point | 19:15 |
corvus | meanwhile, i'll continue work on the (tangentially related) multi-arch container stuff | 19:15 |
clarkb | corvus: that was the next item on my list of notes related to config management updates | 19:15 |
corvus | cool i'll summarize | 19:16 |
corvus | despite all of our reworking, we're still seeing the "container ends up with wrong arch" problem for the nodepool builder containers | 19:16 |
corvus | we managed to autohold a set of nodes exhibiting the problem reliably | 19:16 |
corvus | (and by reliably, i mean, i can run the build over and over and get the same result) | 19:17 |
corvus | so i should be able to narrow down the problem with that | 19:17 |
corvus | at this point, it's unknown whether it's an artifact of buildx, zuul-registry, or something else | 19:17 |
clarkb | is there any concern that if we were to restart nodepool builders right now they may fail due to a mismatch in the published artifacts? | 19:17 |
corvus | clarkb: 1 sec | 19:18 |
corvus | mordred, ianw: do you happen to have a link to the change? | 19:18 |
ianw | https://review.opendev.org/#/c/726263/ the multi-arch python-builder you mean? | 19:19 |
corvus | yep | 19:19 |
ianw | then yes :) | 19:19 |
corvus | #link https://review.opendev.org/726263 failing multi arch change | 19:19 |
corvus | that's the change with the held nodes | 19:19 |
corvus | (it was +3, but failed in gate with the error; i've modified it slightly to fail the buildset registry in order to complete the autohold) | 19:20 |
corvus | clarkb: i *think* the failure is happening reliably in the build stage, so i think we're unlikely to have a problem restarting with anything published | 19:21 |
clarkb | gotcha, basically if we make it to publishing things have succeeded which implies the arches are mapped properly? | 19:21 |
corvus | we do have published images for both arches, and, tbh, i'm not sure what's actually on them. | 19:21 |
* mordred is excited to learn what the issue is | 19:21 | |
corvus | are we running both arches in containers at this point? | 19:21 |
mordred | no - arm is still running non-container | 19:22 |
mordred | multi-arch being finished here should let us run the arm builder in containers | 19:22 |
mordred | and stop having differences | 19:22 |
corvus | okay. then my guess would be that there is a good chance the arm images published may not be arm images. but i honestly don't know. | 19:22 |
corvus | we should certainly not proceed any further with arm until this is resolved | 19:23 |
clarkb | ++ | 19:23 |
mordred | ++ | 19:23 |
fungi | noted | 19:23 |
mordred | well | 19:23 |
mordred | we haven't built arm python-base images | 19:23 |
mordred | so any existing arm layers for nodepool-builder are defnitely bogus | 19:23 |
mordred | so defnitely should not proceed further :) | 19:23 |
clarkb | mordred: even if those layers possibly don't do anything arch specific? | 19:23 |
mordred | they do | 19:24 |
clarkb | like our python-base is just python and bash right? | 19:24 |
clarkb | ah ok | 19:24 |
mordred | they install dumb-init | 19:24 |
mordred | which is arch-specific | 19:24 |
fungi | python is arch-specific | 19:24 |
clarkb | fungi: ya but cpython is on the lower layer | 19:24 |
mordred | yah - but it comes from the base image | 19:24 |
mordred | from docker.io/library/python | 19:24 |
mordred | and is properly arched | 19:24 |
clarkb | .py files in a layer won't care | 19:24 |
clarkb | unless they link to c things | 19:24 |
mordred | but we install at least one arch-specific package in docker.io/opendevorg/python-base | 19:25 |
clarkb | or we install dumb init | 19:25 |
mordred | yah | 19:25 |
fungi | oh, okay, when you said "our python-base is just python and bash right" you meant python scripts, not compiled cpython | 19:25 |
ianw | clarkb: but anything that builds does, i think that was where we saw some issues with gcc at least | 19:25 |
clarkb | fungi: yup | 19:25 |
fungi | i misunderstood, sorry. thought you meant python the interpreter | 19:25 |
mordred | ianw: the gcc issue is actually a symptom of arch mismatch | 19:25 |
mordred | the builder image buiolds wheels so the base image dones't have to - but the builder and base layers were out of sync arch-wise - so we saw the base image install trying to gcc something (and failing) | 19:26 |
mordred | (yay for complex issues) | 19:26 |
ianw | i thought so, and the random nature of the return was why it passed check but failed in gate (irrc?) | 19:27 |
mordred | yup | 19:27 |
mordred | thank goodness corvus managed to hold a reproducible env | 19:27 |
clarkb | anything else on the topic of config management? | 19:27 |
ianw | i have grafana.opendev.org rolled out | 19:28 |
ianw | i'm still working on graphite02.opendev.org and migrating the settings and data correctly | 19:28 |
fungi | might be worth touching on the backup server split, though we can cover that under a later topic if preferred | 19:29 |
clarkb | yup its a separate topic | 19:30 |
fungi | cool, let's just lump it into that then | 19:30 |
clarkb | #topic OpenDev | 19:30 |
*** openstack changes topic to "OpenDev (Meeting topic: infra)" | 19:30 | |
clarkb | lets talk opendev things really quickly then we can get to some of the things that popped up recently | 19:30 |
clarkb | #link https://review.opendev.org/#/c/740716/ Upgrade to v1.12.2 | 19:30 |
clarkb | That change upgrades us to latest gitea. Notably its changelog asys it allows you to properly set the default branch on new projects to something other than manster | 19:31 |
* fungi changes all his default branches to manster | 19:31 | |
clarkb | this isn't smoeting we're using yet but figuring these things out was noted in https://etherpad.opendev.org/p/opendev-git-branches so upgraded sooner than later seems like a good idea | 19:31 |
fungi | yeah, i think it's a good idea to have that in place soon | 19:32 |
clarkb | my fix for the repo list pagination did merge upstream and some future version of gitea should include it. That said the extra check we've got seems like good belts and suspenders | 19:32 |
fungi | surprised nobody's asked for the option yet | 19:32 |
clarkb | that fix is not in v1.12.2 | 19:32 |
clarkb | and finally I need to send an email to all of our advisory board volunteers and ask them to sub to service-discuss and service-annoucne if they haven't already, then see if I can get them to agree on a comms method (I've suggested service-discuss for simplicity) | 19:33 |
clarkb | #topic General Topics | 19:34 |
*** openstack changes topic to "General Topics (Meeting topic: infra)" | 19:34 | |
clarkb | #topic Dealing with Bup indexes and backup server volume migrations and our new backup server | 19:34 |
*** openstack changes topic to "Dealing with Bup indexes and backup server volume migrations and our new backup server (Meeting topic: infra)" | 19:34 | |
clarkb | this is the catch all for backup related items. Maybe we should start with what led us into discovering things? | 19:34 |
clarkb | My understanding of it is that zuul01's root disk filled up and this was tracked back to bups local to zuul01 "git" indexes | 19:35 |
clarkb | we rm'ed that entire dir in /root/ but then bup stopped working on zuul01 | 19:35 |
clarkb | in investigating the fix for that we discovered our existing volume was nearing full capacity so we rotated out the oldest volume and made it latest on the old backup server | 19:35 |
fungi | probably the biggest things to discuss are that we've discovered it's safe to reinitialize ~root/.bup on backup clients, and that we're halfway through replacing the puppet-managed backups with ansible-managed backups but they use different servers (and redundancy would be swell) | 19:36 |
clarkb | after that ianw pointed out we have a newer backup server which is in use for some servers | 19:36 |
ianw | i had a think about how it ended up like that ... | 19:36 |
corvus | i kinda thought rm'ing the local index should not have caused a problem; it's not clear if it did or not; we didn't spend much time on that since it was time to roll the server side anyway | 19:36 |
clarkb | corvus: I think for review we may not have rotated its remote backup after rm'ing the local index because its remote server was the new server (not the old one). ianw and fungi can probablyh confirm that though | 19:37 |
clarkb | ianw: fungi ^ maybe lets sort that out first then talk about the new server? | 19:37 |
fungi | corvus: i did see a (possibly related) problem when i did it on review01... specifically that it ran away with disk (spooling something i think but i couldn't tell where) on the first backup attempt and filled the rootfs and crashed | 19:37 |
corvus | oh i missed the review01 issue | 19:38 |
corvus | it filled up root01's rootfs? | 19:38 |
corvus | er | 19:38 |
fungi | and yeah, when i removed and reinitialized ~root/.bup on review01 i didn't realize we were backing it up to a different (newer) backup server | 19:38 |
corvus | review01's rootfs | 19:38 |
corvus | fungi: what did you do to correct that? | 19:39 |
fungi | then i started the backup without clearing its remote copy on the new backup server, and rootfs space available quickly drained to 0% | 19:39 |
ianw | fungi: is that still the state of things? | 19:39 |
fungi | bup crashed with a puthon exception due to the enospc, but it immediately freed it all, leading me to suspect it was spooling in unlinked tempfiles | 19:39 |
fungi | which would also explain why i couldn't find them | 19:40 |
clarkb | ya it basically fixed itself afterwards but cacti showed a spike during roughly the time bup was running | 19:40 |
clarkb | then a subsequent run of bup was fine | 19:40 |
fungi | after that, i ran it again, and it used a bit of space on / temporarily but eventually succeeded | 19:40 |
fungi | so really not sure what to make of that | 19:41 |
clarkb | it may be worth doing a test recovery off the current review01 backups (and zuul01?) just to be sure the removal of /root/.bup isn't a problem there | 19:41 |
fungi | it did not exhibit whatever behavior led zuul01 to have two bup processes hung/running started on successive days | 19:41 |
ianw | clarkb: ++ i can take an action item to confirm i can get some data out of those backups if we like | 19:43 |
clarkb | ianw: that would be great, thanks | 19:43 |
clarkb | and I think otherwise we continue to monitor it and see if we have disk issues? | 19:43 |
clarkb | ianw: what are we thinking for the server swap itself? | 19:44 |
ianw | yeah, so some history | 19:44 |
ianw | i wrote the ansible roles to install backup users and cron jobs etc in ansible, and basically iirc the idea was that as we got rid of puppet everything would pull that in, everything would be on the new server and the old could be retired | 19:44 |
ianw | however, puppet clearly has a long tail ... | 19:45 |
ianw | which is how we've ended up in a confusing situation for a long time | 19:45 |
ianw | firstly | 19:45 |
fungi | but also we're already ansibling other stuff on all those servers, so the fact that some also get unrelated services configured by puppet should be irrelevant now as far as that goes | 19:45 |
ianw | fungi: yes, that was my next point :) i don't think that was true, or as true, at the time of the original backup roles | 19:46 |
ianw | so, for now | 19:46 |
fungi | if we can manage user accounts across all of them with ansible then seems like we could manage backups across all of them with ansible too | 19:46 |
ianw | #link https://review.opendev.org/740824 add zuul to backup group | 19:46 |
fungi | yeah, a year ago maybe not | 19:46 |
ianw | we should do that ^ ... zuul dropped the bup:: puppet bits, but didn't pick up the ansible bits | 19:46 |
ianw | then | 19:46 |
ianw | #link https://review.opendev.org/740827 backup all hosts with ansible | 19:47 |
ianw | that *adds* the ansible backup roles to all extant backup hosts | 19:47 |
ianw | so, they will be backing up to the vexxhost server (new, ansible roles) and the rax one (old, puppet roles) | 19:47 |
clarkb | gotcha so we'll swap over puppeted hosts too that way its less confusing | 19:47 |
ianw | once that is rolled out, we should clean up the puppet side, drop the bup:: bits from them and remove the cron job | 19:47 |
clarkb | oh we'll keep the puppet too? would it be better to have the ansible side configure both theo ld and new server? | 19:48 |
fungi | and once we do that, build a second new backup server and add it to the role? | 19:48 |
clarkb | and simply remove the puppetry? | 19:48 |
ianw | *then* we should add a second backup server in RAX, add that to the ansible side, and we'll have dual backups | 19:48 |
fungi | yeah, all that sounds fine to me | 19:48 |
clarkb | gotcha | 19:48 |
ianw | yes ... sounds like we agree :) | 19:48 |
* fungi makes thumbs-up sign | 19:48 | |
clarkb | basically add the second back in with ansible rather than worry too much about continuing to use the puppeted side of things | 19:48 |
clarkb | wfm | 19:48 |
clarkb | as a time check we have ~12 minutes and a few more items so I'll keep things moving here | 19:49 |
clarkb | #topic Retiring openstack-infra ML July 15 | 19:49 |
*** openstack changes topic to "Retiring openstack-infra ML July 15 (Meeting topic: infra)" | 19:49 | |
clarkb | fungi: I havne't seen any objections for this, are we still a go for that tomorrow? | 19:49 |
fungi | yeah, that's the plan | 19:50 |
fungi | #link https://review.opendev.org/739152 Forward openstack-infra ML to openstack-discuss | 19:51 |
fungi | i'll be approving that tomorrow, preliminary reviews appreciated | 19:51 |
fungi | i've also got a related issue | 19:51 |
fungi | in working on a mechanism for analyzing mailing list volume/activity for our engagement statistics i've remembered that we'd never gotten around to coming up with a means of providing links to the archives for retired mailing lists | 19:52 |
fungi | and mailman 2.x doesn't have a web api really | 19:53 |
fungi | or more specifically pipermail which does the archive presentation | 19:53 |
clarkb | the archives are still there if you know the urls though iirc. Maybe a basic index page we can link to somewhere? | 19:53 |
fungi | basically once these are deleted, mailman no longer knows about the lists but pipermail-generated archives for them continue to exist and be served if you know the urls | 19:54 |
fungi | at the moment there are 24 (25 tomorrow) retired mailing lists on domains we host, and they're all on the lists.openstack.org domain so far but eventually there will be others | 19:54 |
fungi | i don't know if we should just manually add links to retired list archives in the html templates for each site (there is a template editor in the webui though i've not really played with it) | 19:55 |
clarkb | each site == mailman list? | 19:55 |
fungi | or if we should run some cron/ansible/zuul automation to generate a static list of them and publish it somewhere discoverable | 19:55 |
fungi | sites are like lists.opendev.org, lists.zuul-ci.org, et cetera | 19:56 |
clarkb | ah | 19:56 |
clarkb | that seems reasonable to me because it is where people will go looking for it | 19:56 |
clarkb | but I'm not sure how automatable that is | 19:56 |
fungi | yeah, i'm mostly just brniging this up now to say i'm open to suggestions outside the meeting (so as not to take up any more of the hour) | 19:56 |
clarkb | ++ | 19:57 |
clarkb | #topic China telecom blocks | 19:57 |
*** openstack changes topic to "China telecom blocks (Meeting topic: infra)" | 19:57 | |
fungi | i'll keep this short | 19:57 |
clarkb | AIUI we removed the blcoks and did not need to switch to ianw's UA based filtering? | 19:57 |
fungi | we dropped the temporary firewall rules (see opendev-announce ml archives for date and time) once the background activity dropped to safe levels | 19:57 |
fungi | it could of course reoccur, or something like it, at any time. no guarantees it would be from the same networks/providers either | 19:58 |
clarkb | we've landed the apache filtration code now though right? | 19:58 |
fungi | so i do still think ianw's solution is a good one to keep in our back pocket | 19:58 |
clarkb | so our response in the future can be to switch to the apache port in haproxy configs? | 19:59 |
fungi | yes, the plumbing is in place we just have to turn it on and configure it | 19:59 |
ianw | yeah, i think it's probably good we have the proxy option up our sleeve if we need those layer 7 blocks | 19:59 |
ianw | touch wood, never need it | 19:59 |
clarkb | ++ | 19:59 |
fungi | bt of course if we don't exercise it, then it's at risk of bitrot as well so we should be prepared to have to fix something with it | 19:59 |
clarkb | are there any changes needed to finish that up so its is ready if we need ti? | 19:59 |
clarkb | or are we in the state where its in our attic and good to go when necessary? | 20:00 |
ianw | fungi: it is enabled and tested during the gate testing runs | 20:00 |
clarkb | (we are at time now but have one last thing to bring up) | 20:00 |
ianw | gate testing runs for gitea | 20:00 |
fungi | yeah, hopefully that mitigates the bitrot risk then | 20:00 |
clarkb | #topic Project Renames | 20:01 |
*** openstack changes topic to "Project Renames (Meeting topic: infra)" | 20:01 | |
clarkb | There are a couple of renames requested now. | 20:01 |
clarkb | I'm already feeling a bit swamped this week just catching up on things and making progress on items that I was pushing on | 20:01 |
clarkb | makes me think that July 24 may be a good option for rename outage | 20:02 |
clarkb | if I can get at least one other set of eyeballs for that I'll go ahead and announce it. We're at time so don't need to have that answer right now but let me know if you can help | 20:02 |
fungi | the opendev hardware automation conference finishes on the 22nd, so i can swing the 24th | 20:02 |
clarkb | (we've largely automated that whole process now which is cool) | 20:02 |
clarkb | fungi: thanks | 20:03 |
clarkb | Thanks everyone! | 20:03 |
clarkb | #endmeeting | 20:03 |
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev" | 20:03 | |
openstack | Meeting ended Tue Jul 14 20:03:17 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:03 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.html | 20:03 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.txt | 20:03 |
openstack | Log: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.log.html | 20:03 |
fungi | thanks clarkb! | 20:03 |
*** hamalq has quit IRC | 22:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!