*** hamalq has quit IRC | 01:24 | |
*** hashar has joined #opendev-meeting | 07:09 | |
*** hashar has quit IRC | 08:19 | |
*** hashar has joined #opendev-meeting | 09:25 | |
*** hashar has quit IRC | 11:08 | |
*** hashar has joined #opendev-meeting | 13:04 | |
*** hashar has quit IRC | 15:28 | |
*** hashar has joined #opendev-meeting | 15:57 | |
*** hashar has quit IRC | 17:07 | |
*** hamalq has joined #opendev-meeting | 18:30 | |
*** hashar has joined #opendev-meeting | 18:53 | |
clarkb | anyone else here for the meeting? | 19:00 |
---|---|---|
clarkb | we will get started shortly | 19:00 |
ianw | o/ | 19:00 |
clarkb | #startmeeting infra | 19:01 |
openstack | Meeting started Tue Mar 9 19:01:06 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
*** openstack changes topic to " (Meeting topic: infra)" | 19:01 | |
openstack | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2021-March/000195.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
*** openstack changes topic to "Announcements (Meeting topic: infra)" | 19:01 | |
clarkb | clarkb out March 23rd, could use a volunteer meeting chair or plan to skip | 19:02 |
clarkb | I'll probably just let this resolve itself. If you see a meeting agenda next week show up to the meeting otherwise skip it :) | 19:02 |
clarkb | er sorry its 2 weeks from now | 19:03 |
clarkb | I'm getting too excited :) | 19:03 |
fungi | heh | 19:03 |
clarkb | DST change happens for those of us in North America this weekend. EU and others follow in a few weeks. | 19:03 |
fungi | you're in a hurry for northern hemisphere spring i guess | 19:03 |
ianw | :) i am around so can run it | 19:03 |
clarkb | ianw: thanks! | 19:03 |
clarkb | heads up on the DST changes starting soon for many of us. You'll want to update your calendars if you operate in local time | 19:03 |
clarkb | North american is this weekend then EU in like two weeks and Australia in 3? something like that | 19:04 |
clarkb | #topic Actions from last meeting | 19:04 |
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)" | 19:04 | |
clarkb | #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-02-19.01.txt minutes from last meeting | 19:04 |
clarkb | corvus has started the jitsi unfork | 19:04 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/778308 | 19:05 |
clarkb | I don't think we need to reaction that and can track it with the change now. Currently it is failing CI for some reason, and I haven't had a chance to look at it though I should try to look at it today | 19:05 |
clarkb | #topic Priority Efforts | 19:05 |
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)" | 19:05 | |
clarkb | #topic OpenDev | 19:05 |
*** openstack changes topic to "OpenDev (Meeting topic: infra)" | 19:05 | |
clarkb | The gerrit account inconsistency work continues. | 19:05 |
clarkb | Sicne last we recorded the status here I have deleted conflicting external ids from about 35 accounts that were previously inactive. These were considered safe changes because the accounts were already disabled. | 19:06 |
clarkb | I have also put another 70 something through the disabling process in preparation for deleting their external ids (something I'd like to do today or tomorrow if we are comfortable with the list). These were accounts that the audit script identified as not having valid openids or ssh usernames or sshkeys or any reviewd changes or pushed changes | 19:07 |
clarkb | essentially they were never used and cannot be used for anything. I set them inactive on friday to cause them to flip an error if they were used over the last few days but I haven't seenanything related to that. | 19:08 |
clarkb | I'll put together the external id cleanup input file and run that when we are ready | 19:08 |
fungi | yeah, i don't see any way those accounts could be used in their current state, so should be safe | 19:08 |
fungi | hard to say they were never logged into, but they can't be logged into now anyway | 19:08 |
fungi | possible at least some are lingering remnants of earlier account merges/cleanups | 19:09 |
*** hamalq has quit IRC | 19:09 | |
fungi | but just vestiges now if so | 19:09 |
clarkb | yup | 19:09 |
*** hamalq has joined #opendev-meeting | 19:09 | |
clarkb | Sort of related to this we had a user reach out to service-incident about not being able to log in. THis was for an entirely new account though and appears to be the moving openid then email conflict problem. | 19:09 |
clarkb | they removed their email from moderation themselves but I reached out anyway asking them for info on how they got into that state and offered a couple of options for moving forward (they managed to create a second account with a third non conflicting email which woudl work or we can apply these same retirement and external id cleanups to the original account and have them try again) | 19:10 |
clarkb | will wait and see what they say. I cc'd fungi on the email since fungi would've gotten the moderation notice too. But kept it off of public lists so we can atlkabout email addrs and details like that | 19:11 |
fungi | not that service-incident is a public list anyway | 19:11 |
fungi | but also not really on-topic | 19:11 |
clarkb | fungi: right I was thinkign we could discuss it on service-discuss if not for all the personal details | 19:12 |
fungi | yep | 19:12 |
clarkb | any other opendev items to discuss? if not we can move on? (I've sort of stalled out on the profiling of gerrit work in CI, as I'm prioritizing the account work) | 19:13 |
clarkb | #topic Update Config Management | 19:14 |
*** openstack changes topic to "Update Config Management (Meeting topic: infra)" | 19:14 | |
clarkb | I'm not aware of any items under this heading to talk about, but thought I'd ask before skipping ahead | 19:14 |
fungi | nothing new this week afaik | 19:14 |
clarkb | #topic General Topics | 19:16 |
*** openstack changes topic to "General Topics (Meeting topic: infra)" | 19:16 | |
clarkb | #topic OpenAFS cluster status | 19:16 |
*** openstack changes topic to "OpenAFS cluster status (Meeting topic: infra)" | 19:16 | |
clarkb | ianw: I think this may be all complete now? (aside from making room on dfw01's vicepa?) | 19:16 |
clarkb | we have a third db server and all servers are upgraded to focal now? | 19:17 |
ianw | yeah, i've moved on to the kerberos kdc hosts, which are related | 19:17 |
ianw | i hope to get ansible for that finished very soon; i don't think an in-place upgrade is as important there but probably easiest | 19:17 |
clarkb | good point. Should we drop this topic and track kerberos under the general upgrades heading or would you like to keep this here as a separate item? | 19:18 |
clarkb | also thank you for pushing on this, our openafs cluster should be much happier now and possibly ready for 2.0 whenever that is somethign to consider | 19:18 |
ianw | i think we can remove it | 19:18 |
clarkb | ok | 19:18 |
ianw | you're right i owe looking at the fedora mirror bits, on the todo list | 19:19 |
clarkb | #topic Borg Backups | 19:19 |
*** openstack changes topic to "Borg Backups (Meeting topic: infra)" | 19:19 | |
clarkb | Last I heard we were going to try manually running the backup for gitea db and see if we could determine why it is sad but only to one target | 19:19 |
clarkb | any new news on that? | 19:19 |
fungi | the errors for gitea01 ceased. not sure if anyone changed anything there? | 19:19 |
ianw | i did not | 19:19 |
clarkb | I did not either | 19:20 |
clarkb | I guess we keep our eyes open for recurrence but can probably drop this topic now too? | 19:20 |
ianw | yep! i think we're done there | 19:21 |
clarkb | great | 19:21 |
clarkb | thank you for working on this as well | 19:21 |
fungi | i just rechecked the root inbox to be sure, no more gitea01 errors | 19:21 |
clarkb | #topic Puppet replacements and Server upgrades | 19:21 |
*** openstack changes topic to "Puppet replacements and Server upgrades (Meeting topic: infra)" | 19:21 | |
fungi | though the ticket about fedora 33 being unable to reach emergency consoles seems to be waiting for an update from us | 19:22 |
clarkb | I've rotated all the zuul-executors at this point. That means zuul-mergers and executors are done. Next on my list was nodepool launchers | 19:22 |
clarkb | I think these are going to be a bit more involved since we need to keep old launcher from interfering with new launcher. One idea I had was to land a change that sets max-servers: 0 on the old host and max-servers: valid-value on the new server and then remove the old server when it stops managing any hosts | 19:23 |
clarkb | corvus wasn't sure if that would be safe (it sounds like maybe it would be) | 19:23 |
clarkb | not sure if we want to find out the hard way or do a more careful disable old server, wait for it to idle, start new server setup | 19:23 |
clarkb | the downside with the careful appraoch is we'll drop our node count by the number of nodes in that provider in the interim | 19:23 |
ianw | if anyone would know, corvus would :) | 19:24 |
ianw | it doesn't seem like turning one to zero would communicate anything to the other | 19:24 |
clarkb | I've got some changes to rebase and clean up anyway related to this so I'll look at it a bit more and see if I can convince myself the first idea is safe | 19:24 |
ianw | it feels like the other one would just see more resources available | 19:24 |
clarkb | ianw: I think the concern is that they may see each other as leaking nodes within the provider | 19:24 |
clarkb | also possibly the max-servers 0 instance my reject node requests for the provider since it has no quota. Not sure how the node request rejections work though and if they would be unique enough to avoid that problem. | 19:25 |
clarkb | if the node requests are handled with host+provider unique info we would be ok. I can check on that | 19:25 |
clarkb | That was all I had on this though | 19:26 |
clarkb | ianw: anything to add re kerberos servers? | 19:26 |
ianw | no, wip, but i think i have a handle on it after a crash course on kerberos configuration files :) | 19:27 |
clarkb | let us know when you want reviews | 19:28 |
clarkb | #topic Deploy new refstack | 19:28 |
*** openstack changes topic to "Deploy new refstack (Meeting topic: infra)" | 19:28 | |
clarkb | kopecmartin: ianw: any luck sorting out the api situation (and wsgi etc) | 19:28 |
kopecmartin | this is ready: https://review.opendev.org/c/opendev/system-config/+/776292 | 19:28 |
kopecmartin | i wrote a comments as well so that we know why the vhost was edited the way it is | 19:29 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/776292 refstack api url change | 19:29 |
clarkb | thanks I've got that on my list for review now | 19:29 |
kopecmartin | great, thanks | 19:29 |
kopecmartin | i tested it and it seems ok | 19:29 |
clarkb | I guess we land that then retest the new server and take it from there? | 19:29 |
kopecmartin | so I'd say it can go to production | 19:29 |
clarkb | the comment helps, thank you for that | 19:30 |
clarkb | anything else related to refstack? | 19:30 |
ianw | ok, it seems we don't quite understand what is going on, but i doubt any of us have a lot of effort to put into it if it seems to work | 19:30 |
kopecmartin | yeah, i'm out of time on this unfortunately | 19:30 |
kopecmartin | nothing else from my side | 19:31 |
clarkb | kopecmartin: ok, left a quick note for somethign I noticed on that change | 19:31 |
clarkb | if we update that I expect we can land it | 19:31 |
clarkb | #topic Bridge Disk Space | 19:32 |
*** openstack changes topic to "Bridge Disk Space (Meeting topic: infra)" | 19:32 | |
clarkb | the major consumer of disk here got tracked down to stevedore (thank you ianw and mordred and frickler) writing out entrypoint cache files | 19:32 |
clarkb | latest stevedore aviods writing those caches when it can detect it is running under ansible | 19:33 |
clarkb | ianw wrote some cahnges to upgrade stevedore as the major problem was ansible related, but also wrote out a disable file to the cache dir to avoid other uses polluting the dir | 19:33 |
clarkb | ianw: it also looks like you cleaned up the stale cache files. | 19:33 |
clarkb | Anything else to bring up on this? I think we can consider a solved problem | 19:33 |
ianw | yep i checked the deployment of stevedore and cleaned up those files, so ++ | 19:34 |
clarkb | #topic PTG Prep | 19:35 |
*** openstack changes topic to "PTG Prep (Meeting topic: infra)" | 19:35 | |
clarkb | The next PTG dates have been annoucned as April 19-23. We have been asked to fill out a survey by March 25 to indicate interest in participating if we wish to participate | 19:35 |
clarkb | I'd be interested in hearing from others if they think this will be valuable or not. | 19:36 |
clarkb | The last PTG was full of distractions and a lot of other stuff going on and it felt less useful to me. But I'm not sure if that was due to circumstances or if this smaller group just doesn't need as much synchronous time | 19:36 |
clarkb | curious to hear what others think. I'm happy to organize time for us if we want to particpate, just let me know | 19:37 |
clarkb | maybe think about it and we can bring it up again in next week's meeting and take it from there | 19:38 |
clarkb | #topic Open Discussion | 19:38 |
*** openstack changes topic to "Open Discussion (Meeting topic: infra)" | 19:38 | |
clarkb | That was all I had on the agenda, anything else? | 19:38 |
ianw | i haven't got to the new review server setup although the server is started | 19:39 |
ianw | i did wonder if maybe we should run some performance things on it now | 19:40 |
clarkb | we might also want to consider if we need a bigger instance? | 19:40 |
ianw | i know the old server has hot-migrated itself around, but i wonder if maybe the new server is in a new rack or something and might be faster? | 19:40 |
clarkb | but ya some performance sanity checks make sense to me. In addition to cpu checks disk io checking (against the ssd volume?) might be good | 19:41 |
ianw | i'm not sure how it works on the backend. perhaps there's the "openstack rack" in a corner and that's what we're always on :) | 19:41 |
ianw | i think the next size up was 96gb | 19:41 |
clarkb | ianw: ya i don't know either. That said the notedb migration was much quicker on review-test than it ended up being on prod | 19:41 |
clarkb | it is possible that the newer hosts gain some benefit somewhere based on ^ | 19:41 |
clarkb | too many variables involved to say for sure though | 19:42 |
ianw | performance2-90 | 90 GB Performance | 92160 | 40 | 900 | 24 | N/A | 19:42 |
clarkb | ianw: it is probably worth considering ^ since we're tight on memory right now and one of my theories is the lack of space for the kernel to cache things may be impacting io | 19:43 |
clarkb | (of course I haven't been able to test that in any reasonable way) | 19:43 |
ianw | we also have an onMetal flavour | 19:44 |
clarkb | I think we had been asked to not onmetal at some point | 19:44 |
ianw | ok, the medium flavor looks about right with 64gb and 24vcpu; but yeah, we may not actually have any quota | 19:45 |
clarkb | we do have a couple of services we have talked about turning off like pbx and mqtt | 19:46 |
clarkb | and if we go more radical dropping elasticsearch would free up massive resources. At this point the trade between better gerrit and no elasticsearch may be worthwhile | 19:46 |
ianw | i'd have to check but i think we'd be very tight to add 30gb ATM | 19:46 |
clarkb | definitely something to consider, I don't really want to hold your work up overthinking it though | 19:47 |
ianw | at this point i need to get back to pulling apart where we've mixed openstack.org/opendev.org in ansible so we have a clear path for adding the new server anyway | 19:48 |
clarkb | ok, fungi frickler corvus ^ maybe you can think about that a bit and we can decide if we should go bigger (and make necessary sacrifices if so) | 19:48 |
frickler | going bigger would mean changing IPs, right? would it be an option to move to vexxhost then? | 19:49 |
ianw | frickler: either way we've been looking at a new host and changing ip's | 19:49 |
clarkb | frickler: that may also be something to consider especially if we want to fine tune sizing | 19:50 |
corvus | o/ | 19:50 |
clarkb | my concern with a change like that would be the disk io (we can test it to ensure some confidence in it though). We'd also want to talk to mnaser and see if that is reasonable | 19:50 |
frickler | iirc mnaser has nice fast amd cpus | 19:50 |
clarkb | frickler: yup, but then all the disk is ceph and I'm not sure how that compares to $ssd gerrit cinder volume we have currently | 19:51 |
clarkb | it may be great it may not be, something to check | 19:51 |
frickler | sure | 19:51 |
mnaser | clarkb / frickler: our ceph is all nvme/ssd backed | 19:51 |
mnaser | and we also have local (but unreliable) storage available | 19:51 |
mnaser | depending on your timeline, we're rolling out access to baremetal systems | 19:52 |
mnaser | so that might be an interesting option too | 19:52 |
frickler | depending on your timeline, we might consider waiting for that ;) | 19:52 |
corvus | i like the idea of increasing gerrit size; i also like the idea of moving it to vexx if mnaser is ok; | 19:52 |
clarkb | mnaser: is that something you might be interested in hosting on vexxhost? we're thinking that ab igger server will probably help with some of the performance issues. In particular we allocate a ton of memory to the jvm and that impacts the kernels ability to cache at its level | 19:52 |
clarkb | mnaser: the current server is 60GB ram + 16 vcpu and we'd probably want to bump up both of those axis if possible | 19:53 |
mnaser | hm | 19:53 |
mnaser | so, we're 'recycling' our old compute nodes to make them available as baremetal instances | 19:53 |
ianw | (plus a 256gb attached volume for /home/gerrit2) | 19:54 |
mnaser | so you'd have 2x 240G for OS (RAID-1), 2x 960G disks (for whatever you want to use them, including raid), 384gb memory, but the cpus arent the newest, but.. | 19:54 |
mnaser | its not vcpus | 19:54 |
clarkb | part of me likes the simplicity of VMs. If they are on a failing host tehy get migrated somewhere else | 19:55 |
corvus | baremetal sounds good assuming remote disk and floating ips to cope with hardware failure; with our issues changing ips, i wouldn't want to need to do an ip change to address a failure | 19:55 |
clarkb | but there is a performance impact | 19:55 |
clarkb | corvus: thats a better way of describing what I'm saying I think | 19:55 |
mnaser | 40 thread cpu systems, but yeah | 19:55 |
corvus | clarkb: yeah, my preference is still fully virtualized until we hit that performance wall :) | 19:55 |
mnaser | virtual works too, our cpu to mem ratio is 4 so | 19:56 |
mnaser | for every 1 vcpu => 4 gb of memory | 19:56 |
mnaser | 32vcpus => 128gb memory | 19:56 |
clarkb | mnaser: is 96GB and 24 vcpu a possibility? | 19:56 |
clarkb | (I haven't looked at flavors lately) | 19:56 |
mnaser | i think there is a flavor with that size i believe | 19:56 |
mnaser | if not we can makeit happen as long as it fits the ratio | 19:57 |
clarkb | I suspect that sort of bump may be reasonable given the current situation on 60 + 16 | 19:57 |
clarkb | we wouldn't really incrase jvm heap allocation from 48gb we'd just let the kernel participate in file caching | 19:57 |
mnaser | also, i'd advise against using a floating ip (so traffic is not hairpinned) but instead attach publicly directly -- you can keep the port and reuse it if you need to | 19:57 |
corvus | one thing to consider if we move gerrit to vexxhost is there will likely be considerable network traffic between it and zuul; probably not a big deal, but right now all of gerrit+zuul is in one data center | 19:58 |
corvus | mnaser: ++ | 19:58 |
clarkb | mnaser: that looks like create a port with an ip in neutron (but not floating ip) then when we openstack server create or similar pass the port value in for network info? | 19:58 |
mnaser | admins can create ports with any ips | 19:59 |
mnaser | so can help with that | 19:59 |
clarkb | mnaser: gotcha | 19:59 |
clarkb | we are just about at time. It sounds like mnaser isn't opposed to the idea. In addition to untangling opendev vs openstack maybe the next step here is to decide what an instance in vexxhost should look like and discuss those specifics with mnaser? | 19:59 |
mnaser | +1, also recommend going to mtl for this one | 20:00 |
clarkb | then we can spin that up and do some perf testing ot make sure we aren't missing something important and take it from there | 20:00 |
clarkb | I'll go ahead and end the meeting now so that we can have lunch/dinner/breakfast | 20:00 |
clarkb | #endmeeting | 20:00 |
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev" | 20:00 | |
openstack | Meeting ended Tue Mar 9 20:00:40 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.html | 20:00 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.txt | 20:00 |
openstack | Log: http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.log.html | 20:00 |
clarkb | thanks everyone and feel free to continue conversations in #opendev or on the mailing list | 20:00 |
corvus | mnaser: aww, i was hoping for sjc ;) | 20:01 |
fungi | thanks clarkb! | 20:02 |
clarkb | corvus: the worst thing is when silly ISPs send you halfway across the country to hit local resources due to peering and route costs | 20:04 |
clarkb | corvus: up here its really common to go to at least seattle before returning oregon | 20:04 |
ianw | clarkb: my brother-in-law lives in what could only be termed the middle of nowhere. he's signed up for starlink ... going to be very interested if he ends up with better internet than me in a suburb of a major city | 20:15 |
fungi | my folks did the early signup too. same deal. their only current "broadband" option is slow and often dead at&t adsl | 20:16 |
fungi | though they're in a tight valley, i warned them that it may be a while before there's a satellite which isn't behind a mountain for them | 20:17 |
clarkb | ianw: eventually you should be able to get starlink too? though that may be a long way away | 20:27 |
*** irclogbot_3 has quit IRC | 20:31 | |
*** irclogbot_1 has joined #opendev-meeting | 20:32 | |
*** sboyron has quit IRC | 20:38 | |
fungi | in more ways than one | 20:38 |
*** hashar has quit IRC | 22:53 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!