Tuesday, 2020-07-14

*** SotK has quit IRC06:54
*** SotK has joined #opendev-meeting06:55
*** frickler is now known as frickler_pto09:44
*** frickler_pto is now known as frickler09:47
*** gouthamr has quit IRC10:22
*** gouthamr has joined #opendev-meeting10:23
*** frickler is now known as frickler_pto13:50
*** hamalq has joined #opendev-meeting15:23
*** hamalq has quit IRC15:29
*** hamalq has joined #opendev-meeting15:35
corvuso/18:59
clarkbhello!18:59
clarkbwe'll get started in a couple minutes18:59
ianwo/19:00
clarkb#startmeeting infra19:01
openstackMeeting started Tue Jul 14 19:01:05 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
*** openstack changes topic to " (Meeting topic: infra)"19:01
openstackThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2020-July/000056.html Our Agenda19:01
clarkb#topic Announcements19:01
*** openstack changes topic to "Announcements (Meeting topic: infra)"19:01
clarkbOpenDev virtual event #2 happening July 20-2219:01
*** zbr has joined #opendev-meeting19:01
clarkbcalling this out as they are using etherpad, but the previous event didn't have any problems with etherpad. I plan to be around and support the service if necessary though19:01
zbro/19:01
clarkbalso if you are interested in baremetal management that is the topic and you are welcome to join19:02
clarkb#topic Actions from last meeting19:02
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)"19:02
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-07-19.00.txt minutes from last meeting19:03
clarkbianw: thank you for running last weeks meeting when I was out. I didn't see any actions recorded on the minutes.19:03
clarkbianw: is there anything else to add or should we move on to today's topics?19:03
ianwnothing, i think move on19:03
clarkb#topic Specs approval19:04
*** openstack changes topic to "Specs approval (Meeting topic: infra)"19:04
clarkb#link https://review.opendev.org/#/c/731838/ Authentication broker service19:04
clarkbI believe this got a new patchset and I ws going to review it then things got busy before I took a week off19:04
clarkbfungi: ^ other than needing reviews anything else to add?19:04
funginope, there's a minor sequence numbering problem in one of the lists in it, but no major revisions requested yet19:05
clarkbgreat, and a friendly reminder to the rest of us to try and review that spec19:05
fungior counting error i guess19:05
clarkb#topic Priority Efforts19:05
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)"19:06
clarkb#topic Update Config Management19:06
*** openstack changes topic to "Update Config Management (Meeting topic: infra)"19:06
clarkbze01 is running on containers again. We've vendored the gear lib into the ansible role that uses it19:06
fungino new issues seen since?19:06
clarkbother than a small hiccup with the vendoring I haven't seen any additional issues related to this19:06
clarkbmaybe give it another day or two then we should consider updating the remaining executors?19:07
fungiany feel for how long we should pilot it before redoing the other 11?19:07
fungiahh, yeah, another day or two sounds fine to me19:07
clarkbmost of the issues we've hit have been in jobs that don't run frequently which is why giving it a few days to have those random jobs run on that executors seems like a good idea19:07
clarkbbut I don't think we need to wait for very long either19:07
ianwumm, there is one19:08
ianwhttps://review.opendev.org/#/c/740854/19:08
clarkbah that was related to the executor then?19:08
clarkb(I saw the failures were happening but hadn't followed it that closely)19:08
ianwyes, the executor writes out the job ssh key in the new openssh format, and it is more picky about whitespace19:08
clarkb#link https://review.opendev.org/#/c/740854/ fixes an issue with containerized ze01. Should be landed and confirmed happy before converting more executors19:09
fungiahh, right, specifically because the version of openssh in the container is newer19:09
clarkbI'll give that a review after the meeting if no one beats me to it19:09
fungifwiw the reasoning is sound and it's a very small patch, but disappointing default behavior from variable substitution19:10
fungii guess ansible or jinja assumes variables with trailing whitespace are a mistake unless you tell it otherwise19:10
clarkbas far as converting the other 11 goes, I'm not entirely sure what the exact process is there. I think its something like stop zuul services, manually remove systemd units for zuul services, run ansible, start container but we'll want to double check that if mordred isn't able to update us19:11
mordredohai - not really here - but here fora a sec19:11
fungii'd also be cool waiting for mordred's return to move forward, in case he wants to be involved in the next steps19:11
mordredyeah - I think the story when we're happy with ze01 is for each remaining ze to shut down the executor, run ansible to update to docker19:12
mordredbut I can totally drive that when I'm back online for real19:12
clarkbcool that probably gives us a good burn in period for ze01 too19:12
mordredyah19:13
ianwyeah probably worth seeing if any other weird executor specific behaviour pops up19:13
fungisounds okay to me19:13
corvusmordred: eta for your return?19:13
mordredI'll be back online enough to work on this on Thursday19:14
fungi"chapter 23: the return of mordred"19:14
mordredI'll have electricians replacing mains power ... But I have a laptop and phone :)19:14
fungithat also fits with the "couple of days" suggestion19:14
corvuscool, no rush or anything, just thought if it was going to be > this week maybe i'd start to pick some stuff off, but "wait for mordred" sounds like it'll fit time-wise :)19:15
clarkbya I'm willing to help too, just let me know19:15
fungisame19:15
mordredCool. I should be straight forward at this point19:15
corvusmeanwhile, i'll continue work on the (tangentially related) multi-arch container stuff19:15
clarkbcorvus: that was the next item on my list of notes related to config management updates19:15
corvuscool i'll summarize19:16
corvusdespite all of our reworking, we're still seeing the "container ends up with wrong arch" problem for the nodepool builder containers19:16
corvuswe managed to autohold a set of nodes exhibiting the problem reliably19:16
corvus(and by reliably, i mean, i can run the build over and over and get the same result)19:17
corvusso i should be able to narrow down the problem with that19:17
corvusat this point, it's unknown whether it's an artifact of buildx, zuul-registry, or something else19:17
clarkbis there any concern that if we were to restart nodepool builders right now they may fail due to a mismatch in the published artifacts?19:17
corvusclarkb: 1 sec19:18
corvusmordred, ianw: do you happen to have a link to the change?19:18
ianwhttps://review.opendev.org/#/c/726263/ the multi-arch python-builder you mean?19:19
corvusyep19:19
ianwthen yes :)19:19
corvus#link https://review.opendev.org/726263 failing multi arch change19:19
corvusthat's the change with the held nodes19:19
corvus(it was +3, but failed in gate with the error; i've modified it slightly to fail the buildset registry in order to complete the autohold)19:20
corvusclarkb: i *think* the failure is happening reliably in the build stage, so i think we're unlikely to have a problem restarting with anything published19:21
clarkbgotcha, basically if we make it to publishing things have succeeded which implies the arches are mapped properly?19:21
corvuswe do have published images for both arches, and, tbh, i'm not sure what's actually on them.19:21
* mordred is excited to learn what the issue is19:21
corvusare we running both arches in containers at this point?19:21
mordredno - arm is still running non-container19:22
mordredmulti-arch being finished here should let us run the arm builder in containers19:22
mordredand stop having differences19:22
corvusokay.  then my guess would be that there is a good chance the arm images published may not be arm images.  but i honestly don't know.19:22
corvuswe should certainly not proceed any further with arm until this is resolved19:23
clarkb++19:23
mordred++19:23
funginoted19:23
mordredwell19:23
mordredwe haven't built arm python-base images19:23
mordredso any existing arm layers for nodepool-builder are defnitely bogus19:23
mordredso defnitely should not proceed further :)19:23
clarkbmordred: even if those layers possibly don't do anything arch specific?19:23
mordredthey do19:24
clarkblike our python-base is just python and bash right?19:24
clarkbah ok19:24
mordredthey install dumb-init19:24
mordredwhich is arch-specific19:24
fungipython is arch-specific19:24
clarkbfungi: ya but cpython is on the lower layer19:24
mordredyah - but it comes from the base image19:24
mordredfrom docker.io/library/python19:24
mordredand is properly arched19:24
clarkb.py files in a layer won't care19:24
clarkbunless they link to c things19:24
mordredbut we install at least one arch-specific package in docker.io/opendevorg/python-base19:25
clarkbor we install dumb init19:25
mordredyah19:25
fungioh, okay, when you said "our python-base is just python and bash right" you meant python scripts, not compiled cpython19:25
ianwclarkb: but anything that builds does, i think that was where we saw some issues with gcc at least19:25
clarkbfungi: yup19:25
fungii misunderstood, sorry. thought you meant python the interpreter19:25
mordredianw: the gcc issue is actually a symptom of arch mismatch19:25
mordredthe builder image buiolds wheels so the base image dones't have to - but the builder and base layers were out of sync arch-wise - so we saw the base image install trying to gcc something (and failing)19:26
mordred(yay for complex issues)19:26
ianwi thought so, and the random nature of the return was why it passed check but failed in gate (irrc?)19:27
mordredyup19:27
mordredthank goodness corvus managed to hold a reproducible env19:27
clarkbanything else on the topic of config management?19:27
ianwi have grafana.opendev.org rolled out19:28
ianwi'm still working on graphite02.opendev.org and migrating the settings and data correctly19:28
fungimight be worth touching on the backup server split, though we can cover that under a later topic if preferred19:29
clarkbyup its a separate topic19:30
fungicool, let's just lump it into that then19:30
clarkb#topic OpenDev19:30
*** openstack changes topic to "OpenDev (Meeting topic: infra)"19:30
clarkblets talk opendev things really quickly then we can get to some of the things that popped up recently19:30
clarkb#link https://review.opendev.org/#/c/740716/ Upgrade to v1.12.219:30
clarkbThat change upgrades us to latest gitea. Notably its changelog asys it allows you to properly set the default branch on new projects to something other than manster19:31
* fungi changes all his default branches to manster19:31
clarkbthis isn't smoeting we're using yet but figuring these things out was noted in https://etherpad.opendev.org/p/opendev-git-branches so upgraded sooner than later seems like a good idea19:31
fungiyeah, i think it's a good idea to have that in place soon19:32
clarkbmy fix for the repo list pagination did merge upstream and some future version of gitea should include it. That said the extra check we've got seems like good belts and suspenders19:32
fungisurprised nobody's asked for the option yet19:32
clarkbthat fix is not in v1.12.219:32
clarkband finally I need to send an email to all of our advisory board volunteers and ask them to sub to service-discuss and service-annoucne if they haven't already, then see if I can get them to agree on a comms method (I've suggested service-discuss for simplicity)19:33
clarkb#topic General Topics19:34
*** openstack changes topic to "General Topics (Meeting topic: infra)"19:34
clarkb#topic Dealing with Bup indexes and backup server volume migrations and our new backup server19:34
*** openstack changes topic to "Dealing with Bup indexes and backup server volume migrations and our new backup server (Meeting topic: infra)"19:34
clarkbthis is the catch all for backup related items. Maybe we should start with what led us into discovering things?19:34
clarkbMy understanding of it is that zuul01's root disk filled up and this was tracked back to bups local to zuul01 "git" indexes19:35
clarkbwe rm'ed that entire dir in /root/ but then bup stopped working on zuul0119:35
clarkbin investigating the fix for that we discovered our existing volume was nearing full capacity so we rotated out the oldest volume and made it latest on the old backup server19:35
fungiprobably the biggest things to discuss are that we've discovered it's safe to reinitialize ~root/.bup on backup clients, and that we're halfway through replacing the puppet-managed backups with ansible-managed backups but they use different servers (and redundancy would be swell)19:36
clarkbafter that ianw pointed out we have a newer backup server which is in use for some servers19:36
ianwi had a think about how it ended up like that ...19:36
corvusi kinda thought rm'ing the local index should not have caused a problem; it's not clear if it did or not; we didn't spend much time on that since it was time to roll the server side anyway19:36
clarkbcorvus: I think for review we may not have rotated its remote backup after rm'ing the local index because its remote server was the new server (not the old one). ianw and fungi can probablyh confirm that though19:37
clarkbianw: fungi ^ maybe lets sort that out first then talk about the new server?19:37
fungicorvus: i did see a (possibly related) problem when i did it on review01... specifically that it ran away with disk (spooling something i think but i couldn't tell where) on the first backup attempt and filled the rootfs and crashed19:37
corvusoh i missed the review01 issue19:38
corvusit filled up root01's rootfs?19:38
corvuser19:38
fungiand yeah, when i removed and reinitialized ~root/.bup on review01 i didn't realize we were backing it up to a different (newer) backup server19:38
corvusreview01's rootfs19:38
corvusfungi: what did you do to correct that?19:39
fungithen i started the backup without clearing its remote copy on the new backup server, and rootfs space available quickly drained to 0%19:39
ianwfungi: is that still the state of things?19:39
fungibup crashed with a puthon exception due to the enospc, but it immediately freed it all, leading me to suspect it was spooling in unlinked tempfiles19:39
fungiwhich would also explain why i couldn't find them19:40
clarkbya it basically fixed itself afterwards but cacti showed a spike during roughly the time bup was running19:40
clarkbthen a subsequent run of bup was fine19:40
fungiafter that, i ran it again, and it used a bit of space on / temporarily but eventually succeeded19:40
fungiso really not sure what to make of that19:41
clarkbit may be worth doing a test recovery off the current review01 backups (and zuul01?) just to be sure the removal of /root/.bup isn't a problem there19:41
fungiit did not exhibit whatever behavior led zuul01 to have two bup processes hung/running started on successive days19:41
ianwclarkb: ++ i can take an action item to confirm i can get some data out of those backups if we like19:43
clarkbianw: that would be great, thanks19:43
clarkband I think otherwise we continue to monitor it and see if we have disk issues?19:43
clarkbianw: what are we thinking for the server swap itself?19:44
ianwyeah, so some history19:44
ianwi wrote the ansible roles to install backup users and cron jobs etc in ansible, and basically iirc the idea was that as we got rid of puppet everything would pull that in, everything would be on the new server and the old could be retired19:44
ianwhowever, puppet clearly has a long tail ...19:45
ianwwhich is how we've ended up in a confusing situation for a long time19:45
ianwfirstly19:45
fungibut also we're already ansibling other stuff on all those servers, so the fact that some also get unrelated services configured by puppet should be irrelevant now as far as that goes19:45
ianwfungi: yes, that was my next point :)  i don't think that was true, or as true, at the time of the original backup roles19:46
ianwso, for now19:46
fungiif we can manage user accounts across all of them with ansible then seems like we could manage backups across all of them with ansible too19:46
ianw#link https://review.opendev.org/740824 add zuul to backup group19:46
fungiyeah, a year ago maybe not19:46
ianwwe should do that ^ ... zuul dropped the bup:: puppet bits, but didn't pick up the ansible bits19:46
ianwthen19:46
ianw#link https://review.opendev.org/740827 backup all hosts with ansible19:47
ianwthat *adds* the ansible backup roles to all extant backup hosts19:47
ianwso, they will be backing up to the vexxhost server (new, ansible roles) and the rax one (old, puppet roles)19:47
clarkbgotcha so we'll swap over puppeted hosts too that way its less confusing19:47
ianwonce that is rolled out, we should clean up the puppet side, drop the bup:: bits from them and remove the cron job19:47
clarkboh we'll keep the puppet too? would it be better to have the ansible side configure both theo ld and new server?19:48
fungiand once we do that, build a second new backup server and add it to the role?19:48
clarkband simply remove the puppetry?19:48
ianw*then* we should add a second backup server in RAX, add that to the ansible side, and we'll have dual backups19:48
fungiyeah, all that sounds fine to me19:48
clarkbgotcha19:48
ianwyes ... sounds like we agree :)19:48
* fungi makes thumbs-up sign19:48
clarkbbasically add the second back in with ansible rather than worry too much about continuing to use the puppeted side of things19:48
clarkbwfm19:48
clarkbas a time check we have ~12 minutes and a few more items so I'll keep things moving here19:49
clarkb#topic Retiring openstack-infra ML July 1519:49
*** openstack changes topic to "Retiring openstack-infra ML July 15 (Meeting topic: infra)"19:49
clarkbfungi: I havne't seen any objections for this, are we still a go for that tomorrow?19:49
fungiyeah, that's the plan19:50
fungi#link https://review.opendev.org/739152 Forward openstack-infra ML to openstack-discuss19:51
fungii'll be approving that tomorrow, preliminary reviews appreciated19:51
fungii've also got a related issue19:51
fungiin working on a mechanism for analyzing mailing list volume/activity for our engagement statistics i've remembered that we'd never gotten around to coming up with a means of providing links to the archives for retired mailing lists19:52
fungiand mailman 2.x doesn't have a web api really19:53
fungior more specifically pipermail which does the archive presentation19:53
clarkbthe archives are still there if you know the urls though iirc. Maybe a basic index page we can link to somewhere?19:53
fungibasically once these are deleted, mailman no longer knows about the lists but pipermail-generated archives for them continue to exist and be served if you know the urls19:54
fungiat the moment there are 24 (25 tomorrow) retired mailing lists on domains we host, and they're all on the lists.openstack.org domain so far but eventually there will be others19:54
fungii don't know if we should just manually add links to retired list archives in the html templates for each site (there is a template editor in the webui though i've not really played with it)19:55
clarkbeach site == mailman list?19:55
fungior if we should run some cron/ansible/zuul automation to generate a static list of them and publish it somewhere discoverable19:55
fungisites are like lists.opendev.org, lists.zuul-ci.org, et cetera19:56
clarkbah19:56
clarkbthat seems reasonable to me because it is where people will go looking for it19:56
clarkbbut I'm not sure how automatable that is19:56
fungiyeah, i'm mostly just brniging this up now to say i'm open to suggestions outside the meeting (so as not to take up any more of the hour)19:56
clarkb++19:57
clarkb#topic China telecom blocks19:57
*** openstack changes topic to "China telecom blocks (Meeting topic: infra)"19:57
fungii'll keep this short19:57
clarkbAIUI we removed the blcoks and did not need to switch to ianw's UA based filtering?19:57
fungiwe dropped the temporary firewall rules (see opendev-announce ml archives for date and time) once the background activity dropped to safe levels19:57
fungiit could of course reoccur, or something like it, at any time. no guarantees it would be from the same networks/providers either19:58
clarkbwe've landed the apache filtration code now though right?19:58
fungiso i do still think ianw's solution is a good one to keep in our back pocket19:58
clarkbso our response in the future can be to switch to the apache port in haproxy configs?19:59
fungiyes, the plumbing is in place we just have to turn it on and configure it19:59
ianwyeah, i think it's probably good we have the proxy option up our sleeve if we need those layer 7 blocks19:59
ianwtouch wood, never need it19:59
clarkb++19:59
fungibt of course if we don't exercise it, then it's at risk of bitrot as well so we should be prepared to have to fix something with it19:59
clarkbare there any changes needed to finish that up so its is ready if we need ti?19:59
clarkbor are we in the state where its in our attic and good to go when necessary?20:00
ianwfungi: it is enabled and tested during the gate testing runs20:00
clarkb(we are at time now but have one last thing to bring up)20:00
ianwgate testing runs for gitea20:00
fungiyeah, hopefully that mitigates the bitrot risk then20:00
clarkb#topic Project Renames20:01
*** openstack changes topic to "Project Renames (Meeting topic: infra)"20:01
clarkbThere are a couple of renames requested now.20:01
clarkbI'm already feeling a bit swamped this week just catching up on things and making progress on items that I was pushing on20:01
clarkbmakes me think that July 24 may be a good option for rename outage20:02
clarkbif I can get at least one other set of eyeballs for that I'll go ahead and announce it. We're at time so don't need to have that answer right now but let me know if you can help20:02
fungithe opendev hardware automation conference finishes on the 22nd, so i can swing the 24th20:02
clarkb(we've largely automated that whole process now which is cool)20:02
clarkbfungi: thanks20:03
clarkbThanks everyone!20:03
clarkb#endmeeting20:03
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"20:03
openstackMeeting ended Tue Jul 14 20:03:17 2020 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:03
openstackMinutes:        http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.html20:03
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.txt20:03
openstackLog:            http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.log.html20:03
fungithanks clarkb!20:03
*** hamalq has quit IRC22:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!