Friday, 2020-04-24

*** diablo_rojo has quit IRC07:45
clarkbfungi maybe move the mailman discussion here. I think the IncomingRunner is the runner that processes our "GLOBAL_PIPELINE" config16:05
clarkbso it could be emails causing it to balloon16:05
clarkbfungi: around the time of the memory balloon we have a lot of errors like Apr 24 11:28:33 2020 (4985) listinfo: No such list "openstack-dev": in the openstack vhost error log16:07
clarkblooking in the exim log around that same time period I've grepped for mailman_router and I don't see anything headed to openstack-dev there16:09
clarkbI don't think those errors are generated by external email as a result. But maybe we've got somethign still referencing openstack-dev internally and it not existing causes us to use more memory than we should?16:10
clarkb(I'16:10
clarkber16:10
clarkb(I'm mostly just pulling on the random thread I've found don't know how valid this is)16:11
clarkbaha mailmain docs tell me listinfo: likely originates from the web ui16:12
clarkbthat explains why exim is unaware16:12
clarkbyup I've generated that error by visiting the listinfo page for openstack-dev16:13
clarkband it didn't drastically change memory use. I think that is the thread fully pulled out16:14
clarkbhttps://mail.python.org/pipermail/mailman-users/2012-November/074397.html16:16
clarkbskimming the error and vette logs for all vhosts the only one active with anything out of the ordinary is openstack and that was a message was 6kb too large16:26
clarkbcorvus: I seem to recall at one point we had swap issues due to a kernel bug on executors? where they thought they were running out of memory but they had plenty of cache and buffers that could be evicted16:32
clarkbcorvus: do you recall what the outcome of that was? looking at the cacti graphs here it almost seems like it may be similar (granted we aer using memory unexpectedly too)16:33
clarkbhrm I take that back I think we may be down to ~50MB of buffers and cache at the point OOMkiller is invoked16:34
fungiokay, moving here16:41
corvusclarkb: not off hand, but it looks like based on your latest message that may not be important?16:42
clarkbcorvus: ya I don't think it is important anymore16:42
fungiright, cacti showed something spiked active memory and exhausted it, so not buffers/cache16:43
clarkbcorvus: I was off by an order of magnitude when looking at memory use16:43
fungiclarkb: also it looks like these spikes happen around 1000-1300z daily, so could be some scheduled thing i suppose16:44
fungimaybe digests going out?16:44
clarkbaren't those monthly though?16:44
clarkb(I don't digset so not sure how frequently they can be configured)16:44
corvusdaily is possible16:45
fungii think there's a periodic wakeup which sends them so long as the length is over a certain number of messages, but also no fewer than some timeframe16:45
corvus(in fact, i think daily is typical for digest users)16:45
clarkbah ok16:45
clarkbmaybe we should try and spread the timing of those out over different vhosts (even if it isn't the problem that may be a healthy thing?)16:46
clarkb(I'm not seeing where that might be configurable in the config though)16:47
clarkbfungi: other thoughts, this server notes it could be restarted for updates. Might be worthwhile starting there just to rule out any bugs in other systems?16:47
clarkbbut otherwise I'm thinking restart dstat with cpu and memory consumption reporting for processes and see if that adds any new useful info16:49
fungiyeah, can't hurt, it was last rebooted over a year ago16:50
clarkbI think digests are configured via cron to go out at 12:00 UTC16:52
clarkbwhich is slightly before we saw the OOM in this case16:52
clarkbthe thing prior to that runs at 9:00 UTC and is the disabled account reminder send16:52
clarkbgiven that I don't think its digests but dstat should help us further rule that in or out16:54
fungiand /etc/cron.d/mailman-senddigest-sites seems to fire each of the sites after 12:00 as well16:54
funginothing in /etc/cron.d/mailman really correlates either16:55
clarkbmy hunch continues to be its some email that makes the IncomingRouter's pipeline modules unhappy16:56
clarkbor a pipeline module16:56
clarkband we are seeing the openstack IncomingRunner (not router) in a half state due to that rightn ow16:57
corvusi have a vague memory we started logging emails at some point to try to catch something; is that still in place?17:02
fungilogging e-mail content? i believe i remember doing something like that at one point but only for a specific list17:03
fungiand technically we have logs of what messages arrived and were accepted courtesy of exim, though if the message winds up not getting processed because some fork handling it is killed in an oom, we might not have record of the message content17:04
fungior it may still be sitting in one of the incoming dirs or something17:04
fungibut yeah, if memory serves we were recording posts for the openstack-discuss (or maybe it was long enough ago to be openstack-dev?) ml to have unadulterated copies for comparison while investigating dmarc validation failures17:07
clarkbI'm making a top-mem plugin that shows pid17:07
clarkbcurrently trying to sort out output width formatting :)17:08
fungiand yeah, i agree the stuff i saw dstat logging didn't seem particularly more useful than what cacti is trending, just better granularity and more reliable recording17:15
fungibut i'd never looked at what devstack's generating so wasn't really clear what we should expect out of dstat17:15
fungii assumed it was basically like a fancy version of sar17:15
clarkbfungi: ~clarkb/.dstat/dstat_top_mem_adv.py is a working modification to the shipped top mem plugin to show pid17:16
clarkbfungi: if you put it in the dstat process owner's homedir at .dstat/dstat_top_mem_adv.py you can use it with --top-mem-adv flag17:16
clarkbfungi: ya the thing it does that is useful for us is tracking memory use up to the OOM I think17:17
clarkbwhereas with OOM output you only get that once oomkiller has run17:17
clarkb(and cacti isn't really tracking with that same granularity)17:18
fungiclarkb: like the command i have staged in the root screen session? i copied that file into place and added --top-mem-adv to the previous command, after copying the old dstat-csv.log out of the way17:19
clarkbfungi: I don't think its compatible with csv output /me looks at root screen17:19
clarkbfungi: ya that looks right. lets try it and see what csv looks like?17:20
clarkbif it fails it fails and we figure out redirecting the table output to file instead17:20
fungidstat: option --top-mem-adv not recognized17:27
clarkbfungi: where did you copy the file?17:27
clarkbit should be ~root/.dstat/dstat_top_mem_adv.py17:28
fungid'oh, i did it into my homedir. scattered today17:28
fungisudo doesn't re-resolve ~ when your shell is already expanding it :/17:28
fungithat's better :)17:29
fungithanks17:29
clarkbfungi: ok its not doing what I expect and I know why17:29
clarkbfungi: let me hack on that plugin a bit more17:29
fungisure17:31
clarkbfungi: ok recopy it from my homedir and restart it. It should have the pid, process name, and memory use now17:32
clarkbthe existing thing only has process name and memory use (no pid) because the csv and tabular output are different parts of the plugin17:32
clarkband hopefully that gives us better clues. Also do we want to reboot ?17:32
fungiclarkb: it's running again with the plugin updated17:32
clarkbfungi: yup that seems to be happy it has the info about memory in the csv17:33
fungilooks like it does make it into the csv, yep17:33
fungishould i also be running it with --swap?17:34
clarkbfwiw making plugins like that seems pretty straightforward17:34
clarkbwe could probably write a mailman specific plugin to check queue lengths and such pretty easily17:34
fungiyeah, looks like you just copied one and tweaked a few lines?17:34
clarkbfungi: yup I copied top-mem and edited it to output pid in addtion to the other stuff17:35
fungicool17:35
clarkbjust thinking out loud here that maybe if we find it is mailman that is exploding in size the next step is instrumenting mailman and dstat may help us do that too via plugins17:35
fungiyep17:35
clarkbfwiw the vette log for mailman does report when it gets really large emails. And we only have a 46kb email as showing up around that time17:36
clarkbI suppose some list could have a really high message size set by admins and someone is sending really large emails?17:36
clarkbwe wouldn't see that in vette because it wouldn't error?17:36
clarkbthe existing plugins are in /usr/share/dstat if you want to see them17:36
fungishould i also be running it with --swap? how about --top-cpu-adv and --top-io-adv?17:37
clarkbfungi: ya maybe we should just add in as much info as possible17:37
fungithose are the other ones the foreground dstat uses in devstack17:37
clarkbthe cpu and io info could be useful if its also processing some email furiously while expanding in size17:37
clarkbwe'd probably see that info there and help us understand the behavior better17:38
fungiKeyError: 'i/o process'17:38
fungihrm17:38
fungimaybe raised at dstat_top_io_adv.py +7517:38
clarkbfungi: it only does that when run with --output17:39
clarkbI think the plugin doesn't support csv output. Maybe just drop that on17:39
clarkb*that one17:39
fungii've left out --top-io-adv for now17:39
fungiyeah, that's what i figured too17:39
clarkband that key is definitely unused elsewhere in the plugin and not set anywhere so bug in dstat17:41
fungimaybe that's why devstack only uses it in the foreground process17:41
clarkbfungi: other than watching it to see what happens I think other things we can do are: reboot, move digest cron time, log all incoming email sizes (or complexity if we can do that somehow)17:42
clarkbcorvus: ^ you might know how to make exim logging richer? I don't see message sizes but maybe we can start there?17:43
clarkboh unless is S=\d+ a size?17:44
corvusclarkb: it is17:45
corvusfungi: yes! the dmarc debug thing is what i was thinking of17:45
corvusis that still in place?17:46
fungichecking...17:46
corvusfungi: yes /var/mail/openstack-discuss17:47
corvuswe probably all have that content anyway, but that's the raw incoming messages to that list if we need them17:47
clarkbooh more debugging tools17:48
fungicorvus: looks like we used a mailman_copy router for that one lisyt17:48
fungilist17:48
clarkbfwiw we are looking for something around 11:28UTC today that triggered memory increase today17:48
clarkb(also how funny would it be if that is the source of the issue :P)17:48
clarkb(I mean that file is quite large and if it has to be opened to write to...)17:49
fungiif so it's something getting sent to the ml at around the same time every day, which would be weird17:49
fungior opening that file around the same time every day17:50
clarkbfungi: it does make me wonder if that is why the incoming runner for openstack is 800MB is resident size right now17:50
corvusclarkb: that's written by exim, mm doesn't know about that file17:50
clarkbcorvus: ah thanks17:50
fungialso exim is appending the file, so probably just does a seek based on file length?17:52
clarkbok I think  Imay have found something weird18:18
clarkbSubject: Re: [openstack-dev] [Tacker] <- that thread recieve a message at 11:22UTC according to exim18:19
clarkbthat message isn't in the openstack-discuss mailman archive18:19
clarkbitspossible that is just fallout but may also indicate a trigger?18:19
fungiit may just be in the moderation queue, i haven't checked held messages yet today18:20
fungii'll go through them now18:20
clarkbI also notice that at 01:04UTC april 23 we get a message with a fairly large zip attachemnt. Its clearly spam, but not sure if it ever gets to mailman or not (could be mailman has a sad dealing with attachments like that?)18:23
fungiclarkb: yeah, there were several posts to that thread from a non-subscriber i'm approving momentarily (once i get through skimming the remaining spam)18:28
fungilargest is 183kb, which is excessive but not significantly so18:29
clarkbfungi: ya it comes in right before we have issues though. Also it has a bunch of utf8 in it18:29
clarkbits got a few things that make it stand out but not necessarily point to it being the issue18:30
fungiokay, call caught up on the 10 lists i moderate, nothing really jumped out at me as unusual18:31
fungithose missing messages should show up soon18:32
clarkbya I see them now18:32
clarkbfungi: do you think the zip spam yseterday is anything to be concerned about?18:33
clarkbI guess at this point its probably better to stop looking without direction and wait for more dstat info18:33
clarkbmy rtt to the AP just got really bad. Need a local network reset I think18:35
clarkbfungi: I guess we might also consider adding swapp to the host as a stop gap?18:35
fungiwe get tons of spam with attachments, i doubt exim or mailman attempt to expand them18:36
clarkbmordred: did you add the zuul user to the hosts?20:01
*** yoctozepto has joined #opendev-meeting20:01
fungi/etc/passwd on zm02 was last modified 25 minutes ago20:02
clarkbin any case there is no /home/zuul/.ssh/id_rsa*20:02
mordredso - we have a uid/gid conflict20:02
mordredplaybooks/group_vars/zuul.yaml:zuul_user_id: 1000120:02
clarkbI think what hapepned was you added a zuul user and that changed the zuul users homedir20:02
clarkband now the homedir has no ssh key20:02
clarkbthis is why we added a zuulcd user20:02
mordredno - wait20:02
clarkb/var/lib/zuul/ssh/id_rsa is where the key is I Think20:03
mordredwe did not add the zuul user with the additional_users mechanism we used on bridge20:03
mordredwe added the zuul user in playbooks/roles/zuul/tasks/main.yaml explicitly20:03
mordredand did, in fact, set the home dir to /home/zuul20:03
clarkbah I see20:04
mordredit seems that was a mistake20:04
fungithe zuul group seems to have maybe been changed from 3000 to 10001 though20:04
AJaegerlooks like we have the same problem now also on https://review.opendev.org/72230920:04
mordredand we should set it to /var/lib/zuul ?20:04
clarkbmordred: I think /var/lib/zuul is what it was20:04
mordredshould that be zuul's home on all of the machines?20:04
clarkbbut I'm going to check puppet now20:04
mordredthanks20:04
fungiand yeah, there's stuff in /var/lib/zuul and /home/zuul group-owned by an anonymous gid 300020:05
clarkboh no it was /home/zuul before20:05
clarkbI think its just that the uids changed?20:05
fungiuid still seems the same as it was20:05
clarkbso the issue is adding a new uid?20:05
clarkbwhcih orphaned the perms on those files?20:05
fungii don't see a new uid, just gid20:05
mordredthe ansible thinks it wants to set the uid and gid to 1000120:06
fungilooks like the old zuul group was probably 3000 but has been replaced by group 1000120:06
clarkb-r-------- 1 3000 10001 1675 Apr 24 19:40 id_rsa20:06
clarkbit can't use the ssh key beacuse its owned by the old uid20:06
fungipossible paramiko/openssh don't like incorrect group ownership, yeah20:06
clarkbgroup doesn't matter there20:06
clarkbits 40020:06
clarkbbut note the uid is 300020:07
fungiyou keep saying the old uid, but the old uid and new uid still seem to be 300020:07
fungii only see a change in the gid20:07
clarkboh we didn't change the uid?20:07
funginot that i have seen evidence of yet20:07
clarkbfungi: I was going off of "mordred | the ansible thinks it wants to set the uid and gid to 10001"20:07
mordredthat's what we have set in ansible20:07
fungithe zuul user in /etc/passwd is still 300020:08
fungion ze02 anyway20:08
mordredso - we told ansible to set uid and git to 10001 but that only worked for group it seems20:08
corvususermod refuses to change the uid of a user if it's running processes20:09
corvus(that task should have failed)20:09
mordredok - so maybe let's change ansible back to 3000 for both of these?20:09
mordredI thought 10001 was the correct value, but clearly it wasn't20:09
clarkbwhy would ssh fail to read the file if its got the correct perms? maybe the process is the different uid?20:09
*** gouthamr has joined #opendev-meeting20:09
corvusit is the correct value for the container20:09
mordredcorvus: yeah - I thought we picked the value for the container to match opendev's prod20:10
clarkboh20:10
mordredbut clearly that's not correct20:10
fungiclarkb: openssh and paramiko are both paranoid about permissions and ownership on key files, so it could just be that it doesn't like the random group id even though that group lacks read perms20:10
fungianyway, this is at least *a* problem, it may not be the only problem20:11
clarkbfwiw there are no containers running20:12
clarkbso we aren't having a mismatch between container and host uids20:12
corvusare we sure the key didn't change?20:12
corvusrunning this as zuul: ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org -vvv gerrit ls-projects20:12
fungicorvus: i'm not sure, no20:12
corvusthat doesn't complain about perms at all20:13
fungi/var/lib/zuul/ssh/id_rsa was modified 35 minutes ago20:13
fungiso may also have been overwritten with incorrect content20:13
clarkbthe error complains about publickey too20:13
corvusthat seems like the theory we should work to confirm or discount right now20:13
mordredclarkb: we set the ansible values to match the contianer values so that when we started teh containers all of the permissions would match. we did this thinking that we'd picked the container values to match the production opendev values. that is why there is a gid numericm mismatch20:14
clarkbdrwxr-xr-x  5 zuul             3000 4096 Apr 24 20:10 zuul20:14
mordredbut clearly those were not actually our production values20:14
clarkbis it possibly looking in /home/zuul/.ssh/known_hosts to verify host keys (or similar) and finding the homedir is owned by the wrong group and bailing out?20:15
fungi`sudo -u zuul -- ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org` gives me "Permission denied (publickey)."20:15
mordredso I'd say one thing we should do - that may have _zilch_ to do with this issue - is that we should reset the ansible uid and gid values to 3000, since that's what the files are on disk - and then add a user: 3000 to the docker-compose files so that when we start dokcer things will also work20:15
corvusmordred: i disagree, but let's talk about that later20:15
corvusi suspect we're rejecting every new patchset now20:16
*** tristanC has joined #opendev-meeting20:16
corvusin fact20:16
corvuswe may want to see if we can pause the mergers and executors20:16
corvusthat may reduce the carnage20:16
clarkbcorvus: should be able to just stop them entirely?20:16
clarkbthen the scheduler will maintain the state?20:16
corvusclarkb: yeah, if we are okay throwing away the jobs20:16
corvusdo we have executor pause running yet?20:16
clarkbI don't think we've restart zuul any time recently20:17
fungistatus notice zuul is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress20:17
fungishould we send something like that? ^20:17
corvusfungi: sounds good; clarkb you stop all mergers, i'll do something to the executors20:17
clarkbok stopping mergers now20:18
gouthamr+1, helps me stop rechecking :)20:18
clarkbsystemctl stop zuul-merger doesn't seem to be working fwiw20:18
clarkbdid we remove the unit?20:18
fungi#status notice The Zuul project gating service is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress20:18
openstackstatusfungi: sending notice20:18
clarkboh tehre it goes maybe just a delay20:18
mordreddon't think so - that was going to be cleanup20:18
clarkbok doing the other 7 now20:18
-openstackstatus- NOTICE: The Zuul project gating service is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress20:19
clarkbzm01-08 should all be stopped or stopping now20:20
fungido the mergers need to be stopped as well?20:20
fungi(the independent mergers i mean)20:20
clarkbfungi: that is what I stopped20:20
mordredfungi: clark has stopped the mergers - corvus is working on executors20:20
corvusall the executors should be paused, meaning they aren't accepting new jobs, but a lot of them are in retry loops for git checkouts20:20
fungioh, right, i misread20:20
corvusso they are still logging errors20:21
corvusbut i don't think stopping them would produce a different outcome20:21
mordred++20:21
clarkbthe operating theory is that we've changed the private key so we no longer authaenticate against the public key in gerrit?20:21
mordredclarkb: corvus used the key to talk to gerrit and it worked20:21
corvusmordred: no it failed20:21
fungiclarkb: that seems likely if my reproducer quoted above is the correct way to test it20:21
mordredoh! I misread20:22
fungi`sudo -u zuul -- ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org` gives me "Permission denied (publickey)."20:22
openstackstatusfungi: finished sending notice20:22
corvusso yeah, i think that's the unconfirmed theory which we should confirm now, probably by untangling the old hiera values and comparing?20:22
clarkbI've confirmed the key in brdige hiera is different than what we have on zm0220:22
mordredclarkb: which key?20:22
fungii also tried zuul@review.opendev.org too just to make sure20:23
mordredzuul_ssh_private_key_contents is what ansible wants to install20:23
clarkbbridge:/etc/ansible/hosts/group_vars/zuul-merger.yaml:zuulv3_ssh_private_key_contents != zm02:/var/lib/zuul/ssh/id_rsa20:23
mordredah - wrong hiera key20:23
clarkbnote I've not confirmed that is the correct key20:24
clarkbbut going to try and do that now20:24
corvus  $zuul_ssh_private_key    = hiera('zuul_ssh_private_key_contents')20:24
clarkb$zuul_ssh_private_key = hiera('zuulv3_ssh_private_key_contents')20:24
clarkbmy line is from the zm node block20:24
corvusi think that's what puppet writes out to '/var/lib/zuul/ssh/id_rsa'20:25
corvusclarkb: mine is from executor20:25
clarkbwe may have called it different things in differnt places?20:25
corvusin site.pp20:25
clarkbya looks like it may be different key between merger and executor20:25
corvuswow20:25
mordredno wonder it broke then20:26
clarkbwell both have different values than what is on the merger right now20:26
mordredso maybe we should just update the key name in the zuul merger hiera20:26
clarkband they aer different from each other I think20:26
mordredif zuul_ssh_private_key_contents is what we're using elsewhere?20:26
corvusdo we have both public keys set for that user in gerrit?20:27
mordredlooking20:27
mordredoh wait - not sure how to look - do I need to log in as that user to gerrit?20:28
clarkbwe also have two different keys in used on executors20:28
corvusi'll poke at the gerrit db20:28
fungii confirm the /etc/ansible/hosts/group_vars/zuul-merger.yaml:zuulv3_ssh_private_key_contents works with my reproducer command above20:28
clarkbok I think I've got it20:28
fungiso that seems to confirm it's the one we want20:28
mordredawesome20:28
clarkbzuul_ssh_private_key might be for executor to test nodes20:28
clarkbgerrit_ssh_private_key on the executor is ssh from executor to gerrit20:28
clarkband on zuul mergers we called that zuulv3_ssh_private_key20:29
clarkbtrying to compare them in hiera now20:29
clarkbyes gerrit_ssh_private_key == zuulv3_ssh_private_key20:29
clarkbbetween zuul-executor.yaml and zuul-merger.yaml20:29
fungiit's definitely the correct file content anyway (or at least it's *a* key the zuul user in gerrit can authenticate with)20:29
mordredwow20:29
mordredso - in the current ansible we want that to be zuul_ssh_private_key_contents everywhere20:30
clarkbmordred: and then you need to do something to write out the executor to test node key20:30
clarkb(I mean thats probably alerady done, we just need to make sure we get it right too)20:30
corvusthere is one public key in gerrit.  i could paste it but it doesn't seem necessary since that's easy to test.20:30
mordredclarkb: yeah20:31
mordredclarkb: I think we can followup with the executor-to-test-nodes key since I don't think we're touching it right now20:31
clarkbI'll see if I can find it on ze01 and cross check against hiera20:31
fungiwell, like i said, i tested authenticating with it, and it does work20:32
mordredbut if we update hiera for zuul-to-gerrit-key to be called zuul_ssh_private_key_contents and then re-reun ansible, we should get the right keys back everywhere - right?20:32
fungioh, you mean the test node key20:32
fungiyeah20:32
mordredbut I thnik we don't have to solve that key right now, because it shoudln't have changed from the last time puppet wrote it if we're not writing it20:32
clarkbmordred: confirmed those other keys look ok20:33
clarkbso its just the zuul to gerrit key that got mismatched20:33
clarkbas well as gids?20:33
mordredclarkb: cool. so - you've got your head wrapped around the key names in hiera - can you update the key names?20:33
clarkb(and possibly uids if ansible manages to do its thing?)20:33
fungiso far that's all i've seem wrong at least20:33
clarkbmordred: I don't know where I need to update them20:34
mordredclarkb: in hiera20:34
clarkbmordred: right but in which group or host?20:34
mordredall of them20:34
clarkbmordred: sorry let me rephrase20:34
mordredansible wants the key to be zuul_ssh_private_key_contents20:34
clarkbzuul-merger doesn't have the key you used and yet it still updated the contents20:34
clarkbI don't know where it is finding those contents20:34
mordredgroup_vars/zuul-merger.yaml:zuul_ssh_private_key_contents: |20:35
mordredclarkb: that's where ansible got the contents from20:35
mordredfor the mergers20:35
clarkboh it is in there sorry20:35
mordred*phew*20:35
fungizuul_ssh_private_key_contents seems to exist in zuul-executor.yaml, zuul-merger.yaml and zuul-scheduler.yaml according to git grep20:35
mordredI was worried there was even more weirdness20:35
clarkbok now my concern is if I chagne that are we consuming it anywhere it is expected to be a thing?20:36
clarkbmordred: did you add that key to zuul-merger and zuul-scheduler recently?20:36
mordredno - so -  I think my error is that I expected we called the key the same thing in all three places20:36
mordredbut that is apparently not what we did20:36
clarkbright so if I change the value of that key, anything consuming the key for its intended purpose may break20:37
mordrednothing else is consuming that key20:37
mordredthe only thing consuming this is ansible now20:37
clarkbok20:37
fungicodesearch turns up a couple of hits for zuul_ssh_private_key_contents in x/ci-cd-pipeline-app-murano and x/infra-ansible which i expect we can ignore20:37
corvusshould be able to grep system-config to confirm that20:37
clarkbI'm taking the lock then20:37
mordredand the only thing it's consuming it for is /var/lib/git/id_rsa20:37
fungiaside form that, only hits are in system-config20:37
clarkbI'll update zuul_ssh_private_key_contents to be the same value as zuulv3_whatever on zuul-merger20:37
fungiplaybooks/roles/zuul/tasks/main.yaml, playbooks/zuul/templates/group_vars/zuul-scheduler.yaml.j2 and playbooks/zuul/templates/group_vars/zuul.yaml.j220:37
mordredyah20:38
mordredso as long as we have the correct data referred to by that key, we should be good to go20:38
mordredcorvus: what do you think we should do about uid/gid?20:39
corvusi think we should fix hiera and run ansible first; get that deployed and verify things are running (since it doesn't look like the id issue will be a problem right now)20:39
mordredkk20:39
mordredand agree20:39
corvusthen maybe take a breath and talk about that :)20:39
clarkbI think zuul-merger.yaml is good now I ketp the old ssh key around under a new hiera key name20:40
clarkbI'm going to do the same in the other two files now20:40
mordredcool20:40
mordred++ \o/20:40
clarkbif you want to diff zuul-merger you'll see what I ddi20:40
fungiyeah, odds are the changed gid may be purely cosmetic/immaterial, and ansible is failing to change the uid anyway20:40
corvusclarkb: maybe go ahead and get rid of it?20:40
corvusclarkb: we have git history :)20:40
corvusclarkb: and the last thing we need is more keys in those file :)20:40
fungiyeah, easy to revert changes to that file20:40
mordredyeah - these files are ... busy20:40
corvus(we're never going to think it's a better time than right now to make that cleanup)20:41
clarkbcorvus: ++ I'll remove them20:42
clarkbI think I'm done except for that cleanup20:42
clarkbzuul-scheduler was already correct20:42
clarkbso we had 3 different names for they key in 3 different palces. All three are now using the name and value that scheduler was using20:42
mordredawesome20:43
mordredclarkb: and then zuul_ssh_private_key_contents_for_ssh_to_test_nodes is the executor key for talking to the test nodes, right?20:43
clarkbmordred: yes I've removed that now though20:44
mordredclarkb: +old_zuul_ssh_private_key_contents_for_test_node_ssh still seems to be in merger20:44
clarkbmordred: should be gone from both now20:44
fungiwhat a marvellously verbose variable name20:44
mordredclarkb: I agree20:44
mordredok - so I should re-run the playbook now yes?20:44
clarkbmordred: and yes you can confirm by looking at /var/lib/zuul/ssh/nodepool_id_rsa20:44
clarkbya I think hiera/group_vars are ready for it now20:44
mordredcool20:44
fungialso zuul_ssh_private_key_contents_for_ssh_to_test_nodes appears nowhere in our public repos per codesearch20:45
mordredI've joined the screen session on bridge20:45
fungijoined20:45
clarkbfungi: I added it as a placeholder bu tthen removed it per corvus' suggestion20:45
mordredwe will expect it to error out eventually at zuul-executor20:45
fungiclarkb: oh, got it, so that was a made-up-on-the-spot one20:45
mordredbut that's ok20:45
clarkbmordred: if we are happy with the output of that run we should git add zuul-executor.yaml and zuul-merger.yaml in group_vars and commit the change20:46
mordredclarkb: ++20:46
mordred https://review.opendev.org/723023 <-- this is the fix for the executor error that will happen, fwiw20:46
clarkb+2 on that change20:46
mordredthe mergers are stopped not paused, right?20:46
clarkbmordred: correct systemctl stop zuul-merger'd20:47
mordredok. I think that means it's going to update the uid20:47
clarkbso we may see ansible start them again20:47
clarkbhrm that may break us too20:47
corvusplease carry on, but once this is done, we do need to talk about gid before we reinstate20:47
mordredyeah20:47
mordredI'm happy to talk about that while this is running20:47
corvusif the mergers are stopped, i say we just let ansible do it's remapping20:47
mordredkk20:48
corvusbut we don't want to start the mergers20:48
mordredshould I cancel ansible real quick?20:48
clarkbya we can chown /var/lib/zuul /var/run/zuul /home/zuul eassy enough20:48
mordredI don't think it's going to start anything20:48
corvusdid we land the change where we said "oh it's always safe to start mergers"?20:48
mordredactually - it's DEFINITELY not going to start anything20:48
mordredno, we did not20:48
corvusk20:48
fungiconfirmed, it did switch up the uid on the mergers just now20:48
fungi-r-------- 1 3000 zuul 1675 Apr 24 19:40 id_rsa20:49
fungiso that's definitely going to need fixing before we can start them again20:49
mordredok. so the mergers now completely match the container uid/gid - but we'll need to run a chown on some directories20:49
mordredyup20:49
corvusok, stable enough to talk about id stuff?20:49
mordredI'm good to talk about it20:49
corvus  File "/usr/local/lib/python3.5/dist-packages/zuul/driver/bubblewrap/__init__.py", line 121, in getPopen20:49
corvus    group = grp.getgrgid(gid)20:49
corvusKeyError: 'getgrgid(): gid not found: 3000'20:49
corvusthat's why we can't turn things on again until we fix that :/20:50
corvusi think we should just roll forward and re-id everything to the container ids20:50
clarkbits using the current process gid I guess?20:50
fungithat seems fine to me. we'll need to do the scheduler too right?20:50
mordredcorvus: I agree - since we're essentially down anyway20:50
corvusyeah, that's basically my thinking20:50
corvusfungi: maybe?20:51
fungiclarkb: the running process is running with gid 3000 which doesn't have a corresponding group name, right20:51
corvusi think this is easy for the executors (they'll be just like the mergers; maybe we have to do some chowns)20:51
corvusbut i'm guessing we do want to do the scheduler too, so maybe we just go ahead and save queues and do a full shutdown?20:51
mordredcorvus: yeah - and I mean - we needed to shutdown to start up from container - so while it's not ideal, again, we're down, so might as well20:52
corvusyep; fungi, clarkb: sound good?  if so, i can do the shutdown20:52
fungisounds fine, still here to help20:52
mordreddo we know which all dirs we need to chown?20:52
clarkbya I think rolling forard is our best choice20:52
mordred/var/lib/zuul for sure20:52
clarkbotherwise we'll just be chasing down weird bugs likely20:52
corvuswhy don't folks divide up looking on the different machine classes for dirs to change?20:53
clarkbmordred: /var/lib/zuul/ /var/run/zuul and /home/zuul I think20:53
clarkbthere may be others20:53
mordredok. I'm going to write a quick playbook20:53
* mnaser doesn’t have access but can provide extra hands or eyes doing other things if needed20:53
fungii can check executors since clarkb's already been tackling mergers and mordred's ansibling20:54
mordredhttps://etherpad.opendev.org/p/Egpih9sInMgzpz0DFF9m20:55
clarkbmordred: also /etc/zuul20:55
mordredhow does that look so far? (mnaser - eyeballs ^^ ?)20:55
fungilike originally on the mergers, the executors failed to update the zuul uid, which is still 3000, but have updated the zuul gid to 1000120:55
mordredclarkb: those are the same on both mergers and executors right? is it the same dirs on scheduler?20:55
clarkbmordred: yes I believe its the same for mergers and executors20:55
clarkblooking at scheduler now20:55
mnasermordred: looks good, i think that covers most folders20:55
mordredthank you20:55
fungimordred: some of our files in those paths are root-owned20:56
funginot sure if that matters20:56
mordredbetcha ansible will happily set them back if it does20:56
mordredfungi: got an example?20:56
fungipossibly just because they're world-readable and configuration management wasn't told a uid/gid20:56
mordredoh - I bet all the /etc/zuul stuff20:56
corvusya20:56
mordredyeah20:56
clarkbmordred: ya scheduler looks the same20:56
mnasermordred: maybe adding a -v for verbosity just in case to see what actually got changed20:57
fungi/etc/zuul/executor-logging.conf for example20:57
mordredcool. so that etherpad should take care of it20:57
mordredmnaser: ++20:57
clarkbthe scheduler has some old extra stuff in /var/run but those aren't used anymore20:57
clarkbits just /var/run/zuul now20:57
corvusdeja vu, btw20:57
corvuswe did this the last time we landed this change :)20:57
corvusmaybe drop /etc/zuul and just chown /etc/zuul/github.key20:57
clarkbmordred: ^20:58
corvusand zuul.conf20:58
mordredso - I think we should a) shut down scheduler, executors, web and fingergw b) run current ansible c) run chown playbook20:58
mordredcorvus: kk20:58
corvusd) run ansible again20:58
mordredyes20:58
corvus(to achieve steady state in case chown playbook != ansible)20:58
mordredyup20:58
mnasercorvus: gerrit.key too? or is that in /var/lib/zuul/.ssh/id_rsa ?20:58
clarkbmnaser: its /var/lib/zuul/ssh/$files20:58
mordredok - I wrote that etherpad to chown-zuul.yaml20:59
corvusyeah, our install is... weird and not recommended :)20:59
corvusit's a zuulv0 install20:59
mnaserhehe, also, the -v might had a _lot_ of output20:59
clarkbya the git repos have lots of files20:59
mnaseresp when it hits the repo folders i guess20:59
clarkbmordred: ^20:59
fungithere are a few files in /etc on executors with old gids but that may be because they're stale files? /etc/zuul/site-variables.yaml, /etc/zuul/ssl/{ca.pem,client.key,client.pem}20:59
mordredoh - maybe we don't want the -v20:59
clarkbor do a -v for everything but /var/lib/zuul21:00
clarkbbut definitely be careful with -v in /var/lib/zuul :)21:00
corvusfungi: those files are important21:00
fungiwe *can* also use find to look for files/directories with old uids and gids21:00
mordredclarkb: like that?21:01
mnaserfungi: that's a good idea too, depending on how happy the iops on the system are :)21:01
corvuswe may just not have installed the site-variables file in this playbook, since it comes from system-config21:01
clarkbmordred: the repos are in /var/lib/zuul/git iirc21:01
clarkbmnaser: and then there is antoehr set of hardlinked repos on the executors21:01
clarkber sorry mordred ^21:01
corvusclarkb: we shouldn't have any hardlinked repos21:01
mordredok. but I mean - does the playbook update look better?21:01
fungicorvus: good point, those files may reflect things we're missing in ansible too21:01
corvuslet me make sure we have cleaned the build dir21:02
mordredthese: /etc/zuul/ssl/{ca.pem,client.key,client.pem} changed in the new rollout21:02
mnaserfind / -uid $olduid --- could very well be useful if the drives are fast enough21:02
mordredthey're /etc/zuul/ssl/gearman-{ca.pem,client.key,client.pem}21:02
clarkbmordred: ya it looks good though corvus is mentioning we should clear out the build dirs21:02
mordredah - good point21:02
mnaseras suggested by fungi21:02
fungimordred: i can confirm, those new filenames have the correct ownership21:02
corvusderp, my shutdown was hung on an interactive step; i've resumed it but it's still proceeding21:03
fungiit's the old filenames which do not21:03
mnaserwait actually21:03
mordredsite-variables hasn't been written yet because of the executor bug21:03
corvusmordred: wait for an all clear on shutdown from me before running21:03
mordredcorvus: yes21:03
clarkbmordred: fungi we should check that the new files are used by zuul?21:03
mnaserfind / -uid old_zuul_uid -exec chown zuul:zuul {} \;21:03
mnaserthoughts about that ^ ?21:03
mordredclarkb: I did earlier - but please double-check that21:03
mordredbecause this is one of those cascading failures sorts of days21:04
mnaser(if we're traversing the entire file system, might as well as check if its the right uid and chown)21:04
clarkbmnaser: I'ev checked on zm01 and the new files appear to be what is used21:04
mnasers/entire/most/21:04
fungimnaser: i'd probably do without the -exec first to see what's found21:04
clarkbmnaser: it will likely be incredibly slow too21:04
fungiso that we can look into why they got skipped21:04
mordredI think I'm ok with just the targetted chown - but maybe let's do that as a print later to make sure we didn't miss something21:04
fungibut yeah, may want to exclude /var/lib/zuul from the find21:04
fungisince we're chowning it all anyway21:05
fungiand that'll be the bulk of the chew anyway21:05
mnasercool fair i figured while chown -Rv will probably scan all inodes find was going to do the same, but probably an overoptimization21:05
mordredfungi: the old filenames got skipped because ansible doens't reference them at all21:05
corvusfwiw, we could also just remove the git repos21:05
fungimordred: yeah, that's what i gathered when you said they'd been renamed in ansible21:05
corvusbit slower on startup, but it's friday afternoon, going into a weekend, and they all need pruning anyway21:05
mordredah - nod21:05
mnasercorvus: when was the last time an executor started with a clean slate? maybe that might take a while and introduce all other fun things too21:06
clarkbmnaser: you don't need to get the old file stat to update its perms21:06
mordredcorvus: fine by me21:06
corvusmnaser: a few months21:06
clarkbmnaser: I expect it won't do a full stat21:06
mnaserclarkb: ah yes good point21:06
corvusmnaser: (they get terribly slow, and need periodic cleaning at opendev scale; we *just* merged a patch from tobiash to fix that, and i think we'll be restarting with it in place)21:07
mordredcorvus: looks like there's no python processes on scheduler21:07
corvusnot ready yet21:07
mordredkk21:07
corvusokay, everything is shutdown, and there are no files in builds21:08
corvusi think now we decide if we want to delete the git repos or chown them21:08
mnasercorvus: maybe start with chown and see how quickly its churning through them?21:09
corvussure, it's a second task anyway21:09
mordredcorvus: I think I can run the service-zuul playbook now while we discuss yeah?21:09
corvusmordred: ++21:09
mordredok. I'm going to run that now21:09
corvusand restarting with the new code will immediately correct and gc them as soon as they are used (now *that* might be slow)21:09
corvusbut it'll be a good upgrade test :)21:10
mordred++21:10
clarkbare we starting on containers or old init?21:10
mordredcontainers21:10
corvusi think we're gonna be starting on containers (except executors)21:10
mordredyeah21:10
fungiso really really roll forward21:10
mordredyup! we needed to restart everything to pick up the container change anyway - so, you know, here we are :)21:10
mnaserif not using init then are the (assuming systemd) units disabled just in case?21:10
clarkbmnaser: we've done that as a followup in other cases yes21:11
corvusyeah.  the upside of the 'unscheduled outage on friday' approach is this is going to go a lot faster than it would have otherwise :)21:11
mordredcorvus: jfdi ftw!21:11
clarkbmnaser: basically systemctl stop foo && systemctl disable foo21:11
mnaserok cool21:11
mnaserno one wants a reboot surprise :)21:11
mordredyea - I think we need to write a cleanup playbook that does the disable and then deletes the units21:12
corvusi would like a tequila sunrise21:12
mordredcorvus: now _I_ want a tequila sunrise21:12
funginobody expects the reboot inquisition!21:13
clarkbI might need to go make an okolehao something (https://www.islanddistillers.com/product/okolehao-100-proof-cork/)21:13
clarkbyou'll like the literal translation on that21:13
clarkbbut its the only alcohol I have left I Think21:13
mordredclarkb: sandy thinks you should save that for disinfecting things21:14
corvusnot high enough proof :(21:14
fungionly good for disinfecting your digestive tract21:14
mordredor your iron butt21:14
mordredok. the playbook is done21:15
mordredwe ready for the chown playbook?21:15
corvus++21:15
* fungi nods affirmatively21:15
clarkbI think all the things I was doing are long done21:15
mordredcut/paste looks ok?21:15
corvus++21:15
mordred-f 20 is 20 forks right?21:16
fungithat's my understanding21:16
mordredI will now run that command21:16
fungifireball 2021:16
mordredugh. what's page up in screen scrollback?21:17
fungithe pgup key usually21:17
corvusfor me "page up"21:17
mordred(I saw a bunch of error stream by)21:17
mordredcan one of you do that? I have a stupid keyboard21:17
corvuswe're at the top21:17
fungiyeah, there's a limited buffer by default21:17
mordredoh. well - let's run it and re look later then21:18
mnasermordred: if stupid key that being mac keyboard, fn+arrow up is page up :p21:18
mnasers/key/keyboard/21:18
fungilooks like it just worried about the missing file21:18
fungiyou told it to chown something which only exists on the scheduler21:18
mordredok. so it might have done everything ok21:18
mordred/var/lib/zuul on the scheduler looks decent21:19
mordredso - the answer is - recursive chown is quick21:20
fungi/var/lib/zuul/ssh/{nodepool_id_rsa,static_id_rsa} on ze01 are 3000:300021:20
fungibut maybe those are stale files?21:20
mordredwe still should have chowned them21:20
mordredmaybe let's pull github.key from the list21:20
clarkbfungi: those are unmanaged by ansible currently but a chown -R should've gotten them21:21
clarkbfungi: the nodepool_id_rsa file is how we ssh into nodes21:21
mordredI'm re-running having taken the github.key file out of the first task21:21
fungimakes sense21:22
mordredbecause I'm pretty sure that caused it to bail and not run the second task :)21:22
clarkbokolehau + club soda + lime and some ice is drinkable. Not really something you'd probably pay for though21:23
mordredI would have expected zuul01 to take the longest21:26
mordreddone21:27
mordred6 minutes total fwiw - so still not terrible for a chown21:27
mordredfolks want to verify that stuff looks chown'd?21:27
mordredthen we can the service-zuul playbook again21:27
fungithose keyfiles in /var/lib/zuul/ssh/ on ze01 are correct now21:27
clarkbzm01 looks good to me in /var/lib21:28
fungi/home/dmsimard on ze01 has some files with old ownership21:29
fungiwhich is fine21:29
mordredif that breaks zuul something is horribly wrong :)21:29
fungiindeed21:29
mordredI will now re-run service-zuul - yeah?21:29
fungithe aforementioned files in /etc/zuul still have old gids21:30
fungibut they weren't included in the chown21:30
mordred(should I fix the jeamalloc line in the executor role?)21:30
fungi/etc/zuul/site-variables.yaml /etc/zuul/ssl/ca.pem /etc/zuul/ssl/client.pem /etc/zuul/ssl/client.key21:30
corvusmordred: oh, i guess if it's aborting due to that error we should fix it21:30
mordredI think since we're rolling forward - we should go ahead and fix that21:30
fungithose last three are not used by the container sounded like, as theyve been renamed21:30
mordredyeah21:30
fungii'm unsure whether /etc/zuul/site-variables.yaml needs correcting though?21:31
mordredlet's see how it is after this ansible run21:31
mordredsince we won't abort on zuul-executor now (we hope)21:31
fungii guess it's managed by a different playbook we haven't run? or just missing from ansible entirely?21:31
mordredwe've been bombing out due to the jemalloc bug21:32
fungigot it21:32
mordredso didnt' get to the scheduler role21:32
mordredok - I run playbook21:32
fungialso /etc/zuul/site-variables.yaml is world-readable so if we're configuration-managing it then that's probably fine for the moment21:32
mordredcorvus: so - the real question I have is - why did I think 10001 was the uid of zuul in opendev production?21:33
clarkbmordred: was that the uid that we used on static for the zuul user?21:34
clarkbto run goaccess?21:34
mordredno - I mean - we set 10001 in the zuul images a while ago21:34
mordredand we picked it to make it match what we were already running21:34
mordredexcept WOW was that wrong21:34
clarkboh got it21:35
mordredalso - this same issue exists with nodepool nodes21:35
clarkbI hae no idea in that case :)21:35
clarkbI'm not in the screen, are we running the regular ansible now?21:35
mordredyes21:35
mordredit's very exciting21:36
fungialso i've done a find on /etc, /home, /usr and /var for anything using -uid 3000 -o -gid 300021:36
fungino other hits21:36
fungi(on ze01)21:36
corvusmordred: i don't know, i think i just took your word for it when i asked about it :)21:37
mordredcorvus: :)21:37
mordredcorvus: so - how should we handle this for nodepool?21:37
fungimordred is a very believable fellow21:37
corvusmordred: same way -- full shutdown and restart?21:37
corvus(that's less disruptive)21:37
mordredyeah21:37
mordredwe have't landed the nodepool patch yet21:38
clarkbif we do it in a rolling fashion it won't be an outage21:38
clarkband each nodepool launcher runs indepednently so should be ok?21:38
mordredok - the playbook is done and it looks like it was happy about itself21:40
mordredI believe we are now at the place where we can consider running some docker-composes21:41
fungithat still hasn't updated ownership for /etc/zuul/site-variables.yaml on the executors, btw21:41
clarkbwe need mergers for the scheduler to start up21:41
corvuslets start a merger?21:41
mordredis there some way to do that piecewise in a way that's less disruptive while we check stuff/21:41
clarkbcorvus: ++ I think merger then escheduler then the others if they are happy?21:41
mordredshould we do a chown playbook for site-variables real quick?21:41
mordredlike that?21:42
fungimaybe we should confirm we're ansibling that file at all21:42
mordredoh - nope21:43
mordredwe point to /opt/project-config in conf now21:43
mordredplaybooks/roles/zuul/templates/zuul.conf.j2:variables=/opt/project-config/zuul/site-variables.yaml21:43
fungicodesearch only turns up hits for that path in opendev/puppet-zuul21:43
fungiso yeah, that's stale along with the other three files in /etc/ with old ownership21:43
fungiwe're all set in that case21:43
mordredcool. I think we shoudl start a merger21:44
corvuslet's rm that file then21:44
clarkbwell we use that file21:44
* fungi concurs21:44
corvusclarkb: ?21:44
mordredclarkb: we do not21:44
clarkboh wait I get it now21:44
clarkbwe chagned the config to consume it directly21:44
clarkbrather than writing it in /etc/zuul21:44
clarkbgot it21:44
corvusclarkb, fungi, mordred: ok to rm /etc/zuul/site-variables.yaml ?21:44
mordred++21:45
fungiyeah, we should be able to delete these: /etc/zuul/site-variables.yaml /etc/zuul/ssl/ca.pem /etc/zuul/ssl/client.pem /etc/zuul/ssl/client.key21:45
corvus(that way we don't get confused21:45
mordredwant me to playbook it?21:45
clarkbya I think so21:45
fungiall are using new names or paths now21:45
corvusfungi: agreed21:45
fungiwhich is why they have old ownership still, no longer referenced anywhere21:45
mordredlike that?21:45
fungiyep21:46
mordredk. running real quick21:46
mordredgreat. worked on executors, not on mergers - which is probably correct21:46
clarkbya mergers won't have site-variables21:46
fungifind no longer turns up files under /etc with -uid 3000 -o -gid 300021:46
fungilgtm21:46
clarkbbut will haev zuul/ssl/ files21:47
mordredyes - we should have /etc/zuul/ssl/gearman-*21:47
fungiwe do on ze0121:47
mordredand on zm0121:47
clarkbzm01 looks good21:47
mordredcool21:47
mordredwho wants to start a merger?21:48
corvusi will21:48
clarkbI'm still on zm01 if you want me to do it21:48
clarkbk /me lets corvus do it21:48
corvusclarkb: you do it21:48
* mordred wants both clarkb and corvus to do it21:48
corvusclarkb: all you :)21:48
clarkbalright starting momentarily21:48
clarkbits sudo docker-compose up -d ?21:48
clarkbin /etc/zuul-merger?21:49
mordredyup21:49
clarkbcontainer is up21:49
clarkbits waiting for gearman21:49
clarkbI think that is about as far as it will get without the scheduler?21:49
corvusyep21:50
corvuslet's start an executor?21:50
clarkbwfm21:50
corvusi'll do that one :)21:50
mordredthis is very exciting21:50
corvusit's up and running, no complaints (of course it's not a container)21:51
mordredgood that it's running though :)21:51
corvusmordred: can you write a playbook to start the rest of the mergers and executors?21:52
clarkbwhats startup for executor? same as before?21:52
corvusclarkb: yeah systemctl21:52
mordredcorvus: yes21:52
clarkbrgr21:52
corvusmordred: i think both 'docker-compose up' and 'systemctl start' are idempotent enough we can just run that on all hosts21:52
mordredhow's that look?21:53
mordredoh - piddle21:53
mordredhow's that?21:53
corvusclarkb: i kind of want a unit file, because stopping/starting zuul is currently much easier than with docker-compose21:53
corvusmordred: ++21:53
corvus(like, a unit file to run docker-compose)21:53
mordredk. I'm going to run that21:53
clarkbooh ya maybe do that for all our docker-copmpose too?21:54
clarkbwe can also do it without -d and systemd will be happy I think21:54
corvusmordred: "- shell"21:54
mordredthanks21:55
clarkbbtw the sudo that failed from zm02 was me21:55
mordredok. they're all started21:55
clarkbI was trying to check if it was running under containers yet or not and didn't realize I was zuul21:55
corvusall right, i'll start the scheduler?21:56
clarkbI guess so21:56
clarkbcan't think of much else we can do without gearman in the mix21:56
corvusstarting21:58
corvusthe scheduler is waiting for gearman21:58
corvusoh it's in a restart loop21:59
clarkbgearman is?21:59
corvusno the scheduler container21:59
clarkbgot it21:59
corvusproblem with the zk config21:59
corvusi'm thinking this may not be well tested in the gate21:59
corvus[zookeeper]22:00
corvushosts=23.253.236.126:2888:3888,172.99.117.32:2888:3888,23.253.90.246:2888:3888session_timeout=4022:00
corvusthat's on all hosts22:00
mordredcorvus: I agree - it's possible we're only testing that docker-compose up works (which it did)22:00
corvusi think we should shut everything down, fix that manually, run ansible, then start again22:00
mordredcorvus: we're missing a \n yeah?22:00
corvusyep22:01
clarkbmordred: ya session_timeout needs its own line22:01
mordredk. I'll stop the mergers and executors22:01
mordredand edit the file - someone else want to capture it in a gerrit change?22:01
clarkbI can take a look at gerrit22:01
corvusscheduler is down22:01
mordredcorvus: what's syntax issue there?22:02
mordredthere's no -%} - it's just a %}22:02
corvusmordred: i don't understand the question22:02
mordredcorvus: sorry - in the jina template for zuul.conf.j2 - I don't understand why there's no \n in the output22:03
clarkbmordred: ya I was lookingat it and getting confused too22:03
mordredI've got it up in vim in the screen22:03
clarkbmordred: we can just add a newline between the two items probably22:03
clarkbbut tahts hacky22:03
mordredyeah - maybe do that for now?22:04
clarkbk I'm pulling up jinja2 docs now to try and understand better22:05
mordredalso - let's look at the rest of the file and make sure nothing else is dumb22:05
mordredthe connection sections look ok to me22:05
corvuscould use some spaces between them22:05
corvussome newlines before each connection header22:05
mordredadded. shall I run it with this?22:06
corvusalso the gearman headers22:06
clarkbhttps://jinja.palletsprojects.com/en/2.11.x/templates/#whitespace-control22:06
clarkb"a single trailing newline is stripped if present" is the default apparently?22:07
corvusmordred: ++22:08
mordredk. I did the hacky version - let's see if we can't get the nicer version22:08
corvus"By default, Jinja also removes trailing newlines. To keep single trailing newlines, configure Jinja to keep_trailing_newline."22:08
clarkbreading that doc I think we have to do the hacky thing22:08
clarkbcorvus: ya so we'd have to change ansible to get what we want?22:08
corvusso if there's a way to tell ansible to do that, that might be the less hacky thing.  but if not, i agree with clarkb, hacky is the only option22:09
corvusclarkb: i dunno if there's an argument to the template module?22:09
clarkbcorvus: let me read those docs22:09
clarkbhttps://docs.ansible.com/ansible/latest/modules/template_module.html#parameter-trim_blocks22:09
clarkbwe can set that22:10
mordredok - the zuul.conf looks better now22:10
corvusclarkb: i see a trim_blocks option but not a keep_trailing_newline22:10
corvusclarkb: i read trim_blocks as being to remove the first newline inside a block22:10
corvusbut i'm not at all confident i'm interpreting that right22:10
clarkboh ya22:10
clarkbkeep_trailing_newline is different22:10
corvusit's worth a local experiment22:10
mordredcorvus: I still didnt' get a blank line on top of each [connection ...22:10
mordreddo we care?22:11
corvusmordred: not now, but we should when we fix it in gerrit22:11
mordred++22:11
corvuscause it's pretty hard to read without22:11
mordredok. that's done - shall I restart the executors and mergers?22:12
corvusyep22:12
mordreddone22:12
mordredready for a second stab at the scheduler22:12
corvusit's still in a restart loop with the same error22:13
corvus[zookeeper]22:13
corvushosts=23.253.236.126:2888:3888,172.99.117.32:2888:3888,23.253.90.246:2888:388822:13
corvuswhat looks wrong with that?22:13
clarkbcorvus: the colons maybe?22:13
clarkbis that how we range them?22:13
clarkbthats my best guess. kazoo doesn't like foo:2888:3888 ?22:15
corvusdoes aynone know where the config file reference is for zuul.conf now?22:15
corvusbecause i can't seem to find it with our new improved navigation22:16
corvusoh22:16
corvushttps://zuul-ci.org/docs/zuul/discussion/components.html22:17
corvushosts=zk1.example.com,zk2.example.com,zk3.example.com22:17
corvusmordred: let's drop all the ports22:17
mordredcorvus: ok22:17
corvusthose are actually quorum ports22:17
clarkboh22:17
corvus(we could do zk:2181 -- that's the client port)22:17
clarkbwe do that on the servers but not as clients22:17
corvusbut i think it's optional22:17
mordredstopping executors and mergers22:17
corvusscheduler is stopped22:18
mordredre-running service-zuul22:18
mordredok. all done22:21
mordredready to start mergers and executors?22:22
corvusyep22:22
corvusscheduler is up22:22
mordredcorvus: is it happier this time?22:22
corvusyep22:22
mordred(not having those ports will make that jinja line able to use | join(',')22:22
corvusmordred: we may want ports for zk-tls22:23
corvus(i'm not sure, we should test)22:23
clarkbmerger is resetting all the things22:23
mordredcorvus: nod22:23
mordredcorvus: we clearly need a better test for this in the gate22:23
clarkbno failure yet that I see22:23
fungionce we're sure this is working...22:23
fungistatus notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:25 were missed entirely by Zuul and should also be rechecked to get fresh test results22:23
fungidoes that ^ look reasonable? i tried to base the times there on when the keyfile was first overwritten, when the mergers were stopped, when the scheduler was stopped, and when the scheduler was started again22:23
mordredturns out "does docker-compose work" is not sufficient22:23
clarkbfungi: lgtm22:24
mordredfungi: lgtm22:24
corvusmordred: we can borrow a lot of stuff from quickstart22:24
corvusfungi: ++22:24
corvusmordred: are we not running zuul-web and fingergw in containers?22:25
corvusoh it's a separate docker-compose22:25
corvusfor zuul-web22:25
clarkbprobably want to consolidate that?22:25
corvusweb and fingergw are in the same docker-compose22:26
*** rkukura has joined #opendev-meeting22:26
corvusweb is in a restart loop22:26
corvusPermissionError: [Errno 13] Permission denied: '/var/log/zuul/web.log'22:26
corvuswe did not chown those22:26
fungiyep, still has old uid/gid22:26
corvushow is the scheduler writing to its log?22:26
mordredcorvus: its 66622:27
fungiworld-writeable22:27
fungiyep22:27
mordredweb.log is 64422:27
corvusthat is amusing22:27
mordredyeah22:27
fungiwe probably don't really want those logs world-writeable22:27
mordredso - shall I do a quick cown of /var/log/zuul ?22:27
corvusi'll do the chown22:27
mordredok22:28
mordredwe might need to do it everywhere?22:28
mordredalso - I jiust changed windows in screen I think by accident?22:28
fungilogs on executors are world-readable too22:28
fungiwhich is why they're also working22:28
fungier, world-writeable22:28
mordredhow do I switch windows in screen?22:29
fungictrl-a,n22:29
fungifor next window22:29
mordredthanks22:29
fungip for previous22:29
corvusweb server is up now22:29
corvustenants loaded, will re-enqueue22:29
corvusi think we're good to call this up and send the announcement?22:29
fungilmk when i should status notice that novella22:29
mordredcorvus: I think I should run the playbook I just wrote in screen22:29
fungimaybe after reenqueuing, yeah22:29
clarkbcorvus: maybe we should check executors can ssh into test nodes?22:30
mordredand yes - I think we can announce back u22:30
corvusmordred: sure22:30
mordredup22:30
clarkbif we haven't confirmed that yet? simply because ssh was affected22:30
corvusoh22:30
corvusyeah22:30
corvuslike those retry_limits i'm seeing at https://zuul.opendev.org/t/openstack/status22:30
corvusand there was one merger_failure22:30
mordred2020-04-24 22:31:13,629 DEBUG zuul.AnsibleJob.output: [build: ad8be97939f543289336ff87bf847f1e] Ansible output: b'    "msg": "Data could not be sent to remote host \\"23.253.166.144\\". Make sure this host can be reached over ssh: Permission denied (publickey).\\r\\n",'22:31
mordredI think that's a nope22:31
corvusi'll shut down the scheduler22:31
clarkbis that executor to test node?22:32
corvusyeah22:32
fungilooks like it22:32
clarkbok the nodepool_id_rsa file hasn't been chagned recently22:33
clarkbwhich makes me think maybe its a config issue. possibly we are trying to use teh gerrit key to ssh to test nodes?22:33
clarkbtrying to figure out where we configure that on executors now22:33
corvus[executor]22:34
corvusprivate_key_file=/var/lib/zuul/ssh/id_rsa22:34
mordredso - that's wrong - that just need to point to nodepool_id_rsa22:35
clarkbthat should be /var/lib/zuul/ssh/nodepool_id_rsa22:35
clarkb__22:35
clarkber ++22:35
mordredyes - and we need to be sure to write that file out in the ansible22:35
mordredI'll make the local change real quick22:35
mordredclarkb: so - we should make sure there is a hiera entry for that key content :)22:36
mordredclarkb: I think that's one of the keys you were looking at earlier22:36
mordredI don't think we're writing it at all in the ansible22:36
clarkbmordred: correct I think ansible needs to pull it out of ansible22:37
clarkbmordred: and its the key content I delented in zuul-executor22:37
clarkbmordred: should I go ahead and add it under a new name?22:38
clarkbhow about nodepool_test_node_ssh_key_content or similar22:38
fungifine by me22:38
mordred++22:39
corvusmordred: restart all the executors?22:39
corvusprobably fine to do the mergers too if it's easy to reuse the playbook22:40
mordred++22:40
mordredI pushed up a broken patch to remind us to fix the nodepool key22:41
mordredok - executors and mergers are restarted22:41
clarkbnodepool_test_node_ssh_private_key_contents <- that is the key I used22:41
clarkbI've also annotated the keys to explain what they are fo22:42
mordredclarkb: aewsome22:42
mordredI pushed remote:   https://review.opendev.org/723049 Add nodepool node key22:42
corvusready to restart sched?22:42
mordredthat's a much better key name22:42
mordredcorvus: yes22:42
corvusstarting22:43
corvusk it's up22:47
corvusre-enqueueing22:47
clarkbjobs seem to be working on ze01 if I'm reading the logs correctly22:48
corvusyeah, that's what i'm seeing22:48
mordredPHEW22:48
clarkbits configuring mirrors and stuff22:48
mordredok - we don't actually have that many changes in the local ansible repo on system-config - I think we've captured them all in gerrit changes22:48
clarkbjemalloc, zuul.conf, and managing nodepool test node ssh key?22:49
fungistatus notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results22:49
fungimaybe after we've seen at least one build report success22:49
fungiand all the reenqueuing is finished22:49
clarkbmordred: https://review.opendev.org/#/c/723049/1 that needs a small update22:50
clarkb(commented inline22:51
mordredclarkb: you didn't click send22:51
clarkbhttps://review.opendev.org/#/c/723023/ https://review.opendev.org/#/c/723049/ and https://review.opendev.org/#/c/723046/ are the changes22:51
clarkbmordred: gah sorry22:51
clarkbmordred: now I have22:52
mordredprobably 0400 too yeah?22:52
mordredshould I do owner/group?22:52
clarkbthe others are all 0400 and owned by zuul:zuul22:53
clarkbI think paramiko may complain if it isn't set that way so probably should22:53
fungiyes, or 0600 would also be okay but since we're deploying them with configuration management 0400 is probably a better signal22:53
corvusfungi: reenqueing is finished22:54
corvusfungi: i see a successful build22:54
fungithanks corvus, now to find a success22:54
clarkbmaybe someone else can double check the group_vars git diff and we can git commit that? I'm happy to git commit if someone else confirms it looks good to them22:54
corvusfungi: top of check (but the change hasn't reported yet, still has a job running)22:54
fungicorvus: any reason to wait for a full buildset to report, or is that unlikely to be broken?22:54
corvusfungi: i think you're good to send22:55
fungithanks, doing then22:55
mordredclarkb: I updated 723049 with 0400 and also added a key to the test templates22:55
fungi#status notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results22:55
openstackstatusfungi: sending notice22:55
-openstackstatus- NOTICE: the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results22:56
clarkbmordred: thanks lgtm22:56
corvusam i clear to eod?  i have no brains.22:58
clarkbI think so. If there are any braincells left reviewing those changes might be a good thing. However, maybe better to do that when brains are ready22:58
clarkbI think the three changes above cover all our debugging22:59
fungii will review with coffee in the morning22:59
openstackstatusfungi: finished sending notice22:59
fungiif they don't get approved sooner22:59
clarkbI'm going to take a break then will check back in to see that things are still chugging along23:01
clarkblots of successful jobs though so things are looking goog23:01
clarkb*good too23:01
corvusmordred: oh i'm seeing a lot of cronspam23:01
corvuslooks like the backup cron may have a syntax error?23:02
corvusthe zuul queue backup cron23:02
corvus# Puppet Name: zuul_scheduler_status_backup-openstack-zuul-tenant23:02
corvus* * * * * timeout -k 5 10 curl https://zuul.opendev.org/api/tenant/openstack/status -o /var/lib/zuul/backup/openstack-zuul-tenant_status_$(date +\%s).json 2>/dev/null23:02
corvus#Ansible: zuul-scheduler-status-openstack23:02
corvus* * * * * timeout -k 5 10 curl https://zuul.opendev.org/api/tenant/openstack/status -o /var/lib/zuul/backup/openstack_status_$(date +\\%s).json 2>/dev/null23:02
corvuslooks like the wrong level of escaping23:03
mordredyeah23:03
corvusdo we have periodic crons running these playbooks yet?23:04
corvusi've manually fixed the crontab on zuul0123:04
mordredcorvus: we do23:05
corvusthat means we need to land all of the changes, plus this one now, right?23:05
mordredhttps://review.opendev.org/#/c/723052/23:05
mordredyeah23:05
mordredclarkb, corvus: https://review.opendev.org/#/c/723022/ ... actually, lemme free that from the stack23:06
mordredcorvus: should we direct enqueue some into the gate?23:07
clarkbwe can also put them in emergency23:07
mordredclarkb: https://review.opendev.org/#/c/723052/23:07
corvusi like emergency23:07
corvusmordred: mostly because https://review.opendev.org/72304623:08
mordredyeah23:08
corvus(i don't want to land that without seeing the output)23:08
clarkbI've +2'd 3052 but not approved it23:08
mordredI think it's just executors we need in emergency yes?23:08
corvusi think we can +w that one23:08
corvusmordred: everything?23:08
mordredkk. on it23:08
corvus(i don't want a delta between the running config and what's on disk)23:09
mordredok. I have put all of zuul in emergency23:09
clarkbcorvus: if we approve it it will run ansible against zuul and chagne the configs I think?23:09
corvusmordred: thanks!23:09
clarkbI'll let others decide if that is safe and approve if so23:10
corvusclarkb: hopefully not if they're in emergency?23:10
mordredclarkb: no - we just put everythign in emergency23:10
clarkbcorvus: ohya if we are in emergency now we can approve I'll do that23:10
clarkbmordred what is a zl01?23:16
mordredexactly23:16
clarkbI'm trying to update the zuul run job to log the zuul conf and noticed that23:16
mordredclarkb: fixed here: https://review.opendev.org/#/c/723023/3/.zuul.yaml23:16
fungiour old ansible launchers were zl*23:17
fungifor zuul v2.523:17
mordredclarkb: which is why the jemalloc bug was able to exist23:17
clarkb'm k I'll rebase23:17
clarkbok popping out now for a bit23:21
mordredsame23:27

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!