*** diablo_rojo has quit IRC | 07:45 | |
clarkb | fungi maybe move the mailman discussion here. I think the IncomingRunner is the runner that processes our "GLOBAL_PIPELINE" config | 16:05 |
---|---|---|
clarkb | so it could be emails causing it to balloon | 16:05 |
clarkb | fungi: around the time of the memory balloon we have a lot of errors like Apr 24 11:28:33 2020 (4985) listinfo: No such list "openstack-dev": in the openstack vhost error log | 16:07 |
clarkb | looking in the exim log around that same time period I've grepped for mailman_router and I don't see anything headed to openstack-dev there | 16:09 |
clarkb | I don't think those errors are generated by external email as a result. But maybe we've got somethign still referencing openstack-dev internally and it not existing causes us to use more memory than we should? | 16:10 |
clarkb | (I' | 16:10 |
clarkb | er | 16:10 |
clarkb | (I'm mostly just pulling on the random thread I've found don't know how valid this is) | 16:11 |
clarkb | aha mailmain docs tell me listinfo: likely originates from the web ui | 16:12 |
clarkb | that explains why exim is unaware | 16:12 |
clarkb | yup I've generated that error by visiting the listinfo page for openstack-dev | 16:13 |
clarkb | and it didn't drastically change memory use. I think that is the thread fully pulled out | 16:14 |
clarkb | https://mail.python.org/pipermail/mailman-users/2012-November/074397.html | 16:16 |
clarkb | skimming the error and vette logs for all vhosts the only one active with anything out of the ordinary is openstack and that was a message was 6kb too large | 16:26 |
clarkb | corvus: I seem to recall at one point we had swap issues due to a kernel bug on executors? where they thought they were running out of memory but they had plenty of cache and buffers that could be evicted | 16:32 |
clarkb | corvus: do you recall what the outcome of that was? looking at the cacti graphs here it almost seems like it may be similar (granted we aer using memory unexpectedly too) | 16:33 |
clarkb | hrm I take that back I think we may be down to ~50MB of buffers and cache at the point OOMkiller is invoked | 16:34 |
fungi | okay, moving here | 16:41 |
corvus | clarkb: not off hand, but it looks like based on your latest message that may not be important? | 16:42 |
clarkb | corvus: ya I don't think it is important anymore | 16:42 |
fungi | right, cacti showed something spiked active memory and exhausted it, so not buffers/cache | 16:43 |
clarkb | corvus: I was off by an order of magnitude when looking at memory use | 16:43 |
fungi | clarkb: also it looks like these spikes happen around 1000-1300z daily, so could be some scheduled thing i suppose | 16:44 |
fungi | maybe digests going out? | 16:44 |
clarkb | aren't those monthly though? | 16:44 |
clarkb | (I don't digset so not sure how frequently they can be configured) | 16:44 |
corvus | daily is possible | 16:45 |
fungi | i think there's a periodic wakeup which sends them so long as the length is over a certain number of messages, but also no fewer than some timeframe | 16:45 |
corvus | (in fact, i think daily is typical for digest users) | 16:45 |
clarkb | ah ok | 16:45 |
clarkb | maybe we should try and spread the timing of those out over different vhosts (even if it isn't the problem that may be a healthy thing?) | 16:46 |
clarkb | (I'm not seeing where that might be configurable in the config though) | 16:47 |
clarkb | fungi: other thoughts, this server notes it could be restarted for updates. Might be worthwhile starting there just to rule out any bugs in other systems? | 16:47 |
clarkb | but otherwise I'm thinking restart dstat with cpu and memory consumption reporting for processes and see if that adds any new useful info | 16:49 |
fungi | yeah, can't hurt, it was last rebooted over a year ago | 16:50 |
clarkb | I think digests are configured via cron to go out at 12:00 UTC | 16:52 |
clarkb | which is slightly before we saw the OOM in this case | 16:52 |
clarkb | the thing prior to that runs at 9:00 UTC and is the disabled account reminder send | 16:52 |
clarkb | given that I don't think its digests but dstat should help us further rule that in or out | 16:54 |
fungi | and /etc/cron.d/mailman-senddigest-sites seems to fire each of the sites after 12:00 as well | 16:54 |
fungi | nothing in /etc/cron.d/mailman really correlates either | 16:55 |
clarkb | my hunch continues to be its some email that makes the IncomingRouter's pipeline modules unhappy | 16:56 |
clarkb | or a pipeline module | 16:56 |
clarkb | and we are seeing the openstack IncomingRunner (not router) in a half state due to that rightn ow | 16:57 |
corvus | i have a vague memory we started logging emails at some point to try to catch something; is that still in place? | 17:02 |
fungi | logging e-mail content? i believe i remember doing something like that at one point but only for a specific list | 17:03 |
fungi | and technically we have logs of what messages arrived and were accepted courtesy of exim, though if the message winds up not getting processed because some fork handling it is killed in an oom, we might not have record of the message content | 17:04 |
fungi | or it may still be sitting in one of the incoming dirs or something | 17:04 |
fungi | but yeah, if memory serves we were recording posts for the openstack-discuss (or maybe it was long enough ago to be openstack-dev?) ml to have unadulterated copies for comparison while investigating dmarc validation failures | 17:07 |
clarkb | I'm making a top-mem plugin that shows pid | 17:07 |
clarkb | currently trying to sort out output width formatting :) | 17:08 |
fungi | and yeah, i agree the stuff i saw dstat logging didn't seem particularly more useful than what cacti is trending, just better granularity and more reliable recording | 17:15 |
fungi | but i'd never looked at what devstack's generating so wasn't really clear what we should expect out of dstat | 17:15 |
fungi | i assumed it was basically like a fancy version of sar | 17:15 |
clarkb | fungi: ~clarkb/.dstat/dstat_top_mem_adv.py is a working modification to the shipped top mem plugin to show pid | 17:16 |
clarkb | fungi: if you put it in the dstat process owner's homedir at .dstat/dstat_top_mem_adv.py you can use it with --top-mem-adv flag | 17:16 |
clarkb | fungi: ya the thing it does that is useful for us is tracking memory use up to the OOM I think | 17:17 |
clarkb | whereas with OOM output you only get that once oomkiller has run | 17:17 |
clarkb | (and cacti isn't really tracking with that same granularity) | 17:18 |
fungi | clarkb: like the command i have staged in the root screen session? i copied that file into place and added --top-mem-adv to the previous command, after copying the old dstat-csv.log out of the way | 17:19 |
clarkb | fungi: I don't think its compatible with csv output /me looks at root screen | 17:19 |
clarkb | fungi: ya that looks right. lets try it and see what csv looks like? | 17:20 |
clarkb | if it fails it fails and we figure out redirecting the table output to file instead | 17:20 |
fungi | dstat: option --top-mem-adv not recognized | 17:27 |
clarkb | fungi: where did you copy the file? | 17:27 |
clarkb | it should be ~root/.dstat/dstat_top_mem_adv.py | 17:28 |
fungi | d'oh, i did it into my homedir. scattered today | 17:28 |
fungi | sudo doesn't re-resolve ~ when your shell is already expanding it :/ | 17:28 |
fungi | that's better :) | 17:29 |
fungi | thanks | 17:29 |
clarkb | fungi: ok its not doing what I expect and I know why | 17:29 |
clarkb | fungi: let me hack on that plugin a bit more | 17:29 |
fungi | sure | 17:31 |
clarkb | fungi: ok recopy it from my homedir and restart it. It should have the pid, process name, and memory use now | 17:32 |
clarkb | the existing thing only has process name and memory use (no pid) because the csv and tabular output are different parts of the plugin | 17:32 |
clarkb | and hopefully that gives us better clues. Also do we want to reboot ? | 17:32 |
fungi | clarkb: it's running again with the plugin updated | 17:32 |
clarkb | fungi: yup that seems to be happy it has the info about memory in the csv | 17:33 |
fungi | looks like it does make it into the csv, yep | 17:33 |
fungi | should i also be running it with --swap? | 17:34 |
clarkb | fwiw making plugins like that seems pretty straightforward | 17:34 |
clarkb | we could probably write a mailman specific plugin to check queue lengths and such pretty easily | 17:34 |
fungi | yeah, looks like you just copied one and tweaked a few lines? | 17:34 |
clarkb | fungi: yup I copied top-mem and edited it to output pid in addtion to the other stuff | 17:35 |
fungi | cool | 17:35 |
clarkb | just thinking out loud here that maybe if we find it is mailman that is exploding in size the next step is instrumenting mailman and dstat may help us do that too via plugins | 17:35 |
fungi | yep | 17:35 |
clarkb | fwiw the vette log for mailman does report when it gets really large emails. And we only have a 46kb email as showing up around that time | 17:36 |
clarkb | I suppose some list could have a really high message size set by admins and someone is sending really large emails? | 17:36 |
clarkb | we wouldn't see that in vette because it wouldn't error? | 17:36 |
clarkb | the existing plugins are in /usr/share/dstat if you want to see them | 17:36 |
fungi | should i also be running it with --swap? how about --top-cpu-adv and --top-io-adv? | 17:37 |
clarkb | fungi: ya maybe we should just add in as much info as possible | 17:37 |
fungi | those are the other ones the foreground dstat uses in devstack | 17:37 |
clarkb | the cpu and io info could be useful if its also processing some email furiously while expanding in size | 17:37 |
clarkb | we'd probably see that info there and help us understand the behavior better | 17:38 |
fungi | KeyError: 'i/o process' | 17:38 |
fungi | hrm | 17:38 |
fungi | maybe raised at dstat_top_io_adv.py +75 | 17:38 |
clarkb | fungi: it only does that when run with --output | 17:39 |
clarkb | I think the plugin doesn't support csv output. Maybe just drop that on | 17:39 |
clarkb | *that one | 17:39 |
fungi | i've left out --top-io-adv for now | 17:39 |
fungi | yeah, that's what i figured too | 17:39 |
clarkb | and that key is definitely unused elsewhere in the plugin and not set anywhere so bug in dstat | 17:41 |
fungi | maybe that's why devstack only uses it in the foreground process | 17:41 |
clarkb | fungi: other than watching it to see what happens I think other things we can do are: reboot, move digest cron time, log all incoming email sizes (or complexity if we can do that somehow) | 17:42 |
clarkb | corvus: ^ you might know how to make exim logging richer? I don't see message sizes but maybe we can start there? | 17:43 |
clarkb | oh unless is S=\d+ a size? | 17:44 |
corvus | clarkb: it is | 17:45 |
corvus | fungi: yes! the dmarc debug thing is what i was thinking of | 17:45 |
corvus | is that still in place? | 17:46 |
fungi | checking... | 17:46 |
corvus | fungi: yes /var/mail/openstack-discuss | 17:47 |
corvus | we probably all have that content anyway, but that's the raw incoming messages to that list if we need them | 17:47 |
clarkb | ooh more debugging tools | 17:48 |
fungi | corvus: looks like we used a mailman_copy router for that one lisyt | 17:48 |
fungi | list | 17:48 |
clarkb | fwiw we are looking for something around 11:28UTC today that triggered memory increase today | 17:48 |
clarkb | (also how funny would it be if that is the source of the issue :P) | 17:48 |
clarkb | (I mean that file is quite large and if it has to be opened to write to...) | 17:49 |
fungi | if so it's something getting sent to the ml at around the same time every day, which would be weird | 17:49 |
fungi | or opening that file around the same time every day | 17:50 |
clarkb | fungi: it does make me wonder if that is why the incoming runner for openstack is 800MB is resident size right now | 17:50 |
corvus | clarkb: that's written by exim, mm doesn't know about that file | 17:50 |
clarkb | corvus: ah thanks | 17:50 |
fungi | also exim is appending the file, so probably just does a seek based on file length? | 17:52 |
clarkb | ok I think Imay have found something weird | 18:18 |
clarkb | Subject: Re: [openstack-dev] [Tacker] <- that thread recieve a message at 11:22UTC according to exim | 18:19 |
clarkb | that message isn't in the openstack-discuss mailman archive | 18:19 |
clarkb | itspossible that is just fallout but may also indicate a trigger? | 18:19 |
fungi | it may just be in the moderation queue, i haven't checked held messages yet today | 18:20 |
fungi | i'll go through them now | 18:20 |
clarkb | I also notice that at 01:04UTC april 23 we get a message with a fairly large zip attachemnt. Its clearly spam, but not sure if it ever gets to mailman or not (could be mailman has a sad dealing with attachments like that?) | 18:23 |
fungi | clarkb: yeah, there were several posts to that thread from a non-subscriber i'm approving momentarily (once i get through skimming the remaining spam) | 18:28 |
fungi | largest is 183kb, which is excessive but not significantly so | 18:29 |
clarkb | fungi: ya it comes in right before we have issues though. Also it has a bunch of utf8 in it | 18:29 |
clarkb | its got a few things that make it stand out but not necessarily point to it being the issue | 18:30 |
fungi | okay, call caught up on the 10 lists i moderate, nothing really jumped out at me as unusual | 18:31 |
fungi | those missing messages should show up soon | 18:32 |
clarkb | ya I see them now | 18:32 |
clarkb | fungi: do you think the zip spam yseterday is anything to be concerned about? | 18:33 |
clarkb | I guess at this point its probably better to stop looking without direction and wait for more dstat info | 18:33 |
clarkb | my rtt to the AP just got really bad. Need a local network reset I think | 18:35 |
clarkb | fungi: I guess we might also consider adding swapp to the host as a stop gap? | 18:35 |
fungi | we get tons of spam with attachments, i doubt exim or mailman attempt to expand them | 18:36 |
clarkb | mordred: did you add the zuul user to the hosts? | 20:01 |
*** yoctozepto has joined #opendev-meeting | 20:01 | |
fungi | /etc/passwd on zm02 was last modified 25 minutes ago | 20:02 |
clarkb | in any case there is no /home/zuul/.ssh/id_rsa* | 20:02 |
mordred | so - we have a uid/gid conflict | 20:02 |
mordred | playbooks/group_vars/zuul.yaml:zuul_user_id: 10001 | 20:02 |
clarkb | I think what hapepned was you added a zuul user and that changed the zuul users homedir | 20:02 |
clarkb | and now the homedir has no ssh key | 20:02 |
clarkb | this is why we added a zuulcd user | 20:02 |
mordred | no - wait | 20:02 |
clarkb | /var/lib/zuul/ssh/id_rsa is where the key is I Think | 20:03 |
mordred | we did not add the zuul user with the additional_users mechanism we used on bridge | 20:03 |
mordred | we added the zuul user in playbooks/roles/zuul/tasks/main.yaml explicitly | 20:03 |
mordred | and did, in fact, set the home dir to /home/zuul | 20:03 |
clarkb | ah I see | 20:04 |
mordred | it seems that was a mistake | 20:04 |
fungi | the zuul group seems to have maybe been changed from 3000 to 10001 though | 20:04 |
AJaeger | looks like we have the same problem now also on https://review.opendev.org/722309 | 20:04 |
mordred | and we should set it to /var/lib/zuul ? | 20:04 |
clarkb | mordred: I think /var/lib/zuul is what it was | 20:04 |
mordred | should that be zuul's home on all of the machines? | 20:04 |
clarkb | but I'm going to check puppet now | 20:04 |
mordred | thanks | 20:04 |
fungi | and yeah, there's stuff in /var/lib/zuul and /home/zuul group-owned by an anonymous gid 3000 | 20:05 |
clarkb | oh no it was /home/zuul before | 20:05 |
clarkb | I think its just that the uids changed? | 20:05 |
fungi | uid still seems the same as it was | 20:05 |
clarkb | so the issue is adding a new uid? | 20:05 |
clarkb | whcih orphaned the perms on those files? | 20:05 |
fungi | i don't see a new uid, just gid | 20:05 |
mordred | the ansible thinks it wants to set the uid and gid to 10001 | 20:06 |
fungi | looks like the old zuul group was probably 3000 but has been replaced by group 10001 | 20:06 |
clarkb | -r-------- 1 3000 10001 1675 Apr 24 19:40 id_rsa | 20:06 |
clarkb | it can't use the ssh key beacuse its owned by the old uid | 20:06 |
fungi | possible paramiko/openssh don't like incorrect group ownership, yeah | 20:06 |
clarkb | group doesn't matter there | 20:06 |
clarkb | its 400 | 20:06 |
clarkb | but note the uid is 3000 | 20:07 |
fungi | you keep saying the old uid, but the old uid and new uid still seem to be 3000 | 20:07 |
fungi | i only see a change in the gid | 20:07 |
clarkb | oh we didn't change the uid? | 20:07 |
fungi | not that i have seen evidence of yet | 20:07 |
clarkb | fungi: I was going off of "mordred | the ansible thinks it wants to set the uid and gid to 10001" | 20:07 |
mordred | that's what we have set in ansible | 20:07 |
fungi | the zuul user in /etc/passwd is still 3000 | 20:08 |
fungi | on ze02 anyway | 20:08 |
mordred | so - we told ansible to set uid and git to 10001 but that only worked for group it seems | 20:08 |
corvus | usermod refuses to change the uid of a user if it's running processes | 20:09 |
corvus | (that task should have failed) | 20:09 |
mordred | ok - so maybe let's change ansible back to 3000 for both of these? | 20:09 |
mordred | I thought 10001 was the correct value, but clearly it wasn't | 20:09 |
clarkb | why would ssh fail to read the file if its got the correct perms? maybe the process is the different uid? | 20:09 |
*** gouthamr has joined #opendev-meeting | 20:09 | |
corvus | it is the correct value for the container | 20:09 |
mordred | corvus: yeah - I thought we picked the value for the container to match opendev's prod | 20:10 |
clarkb | oh | 20:10 |
mordred | but clearly that's not correct | 20:10 |
fungi | clarkb: openssh and paramiko are both paranoid about permissions and ownership on key files, so it could just be that it doesn't like the random group id even though that group lacks read perms | 20:10 |
fungi | anyway, this is at least *a* problem, it may not be the only problem | 20:11 |
clarkb | fwiw there are no containers running | 20:12 |
clarkb | so we aren't having a mismatch between container and host uids | 20:12 |
corvus | are we sure the key didn't change? | 20:12 |
corvus | running this as zuul: ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org -vvv gerrit ls-projects | 20:12 |
fungi | corvus: i'm not sure, no | 20:12 |
corvus | that doesn't complain about perms at all | 20:13 |
fungi | /var/lib/zuul/ssh/id_rsa was modified 35 minutes ago | 20:13 |
fungi | so may also have been overwritten with incorrect content | 20:13 |
clarkb | the error complains about publickey too | 20:13 |
corvus | that seems like the theory we should work to confirm or discount right now | 20:13 |
mordred | clarkb: we set the ansible values to match the contianer values so that when we started teh containers all of the permissions would match. we did this thinking that we'd picked the container values to match the production opendev values. that is why there is a gid numericm mismatch | 20:14 |
clarkb | drwxr-xr-x 5 zuul 3000 4096 Apr 24 20:10 zuul | 20:14 |
mordred | but clearly those were not actually our production values | 20:14 |
clarkb | is it possibly looking in /home/zuul/.ssh/known_hosts to verify host keys (or similar) and finding the homedir is owned by the wrong group and bailing out? | 20:15 |
fungi | `sudo -u zuul -- ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org` gives me "Permission denied (publickey)." | 20:15 |
mordred | so I'd say one thing we should do - that may have _zilch_ to do with this issue - is that we should reset the ansible uid and gid values to 3000, since that's what the files are on disk - and then add a user: 3000 to the docker-compose files so that when we start dokcer things will also work | 20:15 |
corvus | mordred: i disagree, but let's talk about that later | 20:15 |
corvus | i suspect we're rejecting every new patchset now | 20:16 |
*** tristanC has joined #opendev-meeting | 20:16 | |
corvus | in fact | 20:16 |
corvus | we may want to see if we can pause the mergers and executors | 20:16 |
corvus | that may reduce the carnage | 20:16 |
clarkb | corvus: should be able to just stop them entirely? | 20:16 |
clarkb | then the scheduler will maintain the state? | 20:16 |
corvus | clarkb: yeah, if we are okay throwing away the jobs | 20:16 |
corvus | do we have executor pause running yet? | 20:16 |
clarkb | I don't think we've restart zuul any time recently | 20:17 |
fungi | status notice zuul is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress | 20:17 |
fungi | should we send something like that? ^ | 20:17 |
corvus | fungi: sounds good; clarkb you stop all mergers, i'll do something to the executors | 20:17 |
clarkb | ok stopping mergers now | 20:18 |
gouthamr | +1, helps me stop rechecking :) | 20:18 |
clarkb | systemctl stop zuul-merger doesn't seem to be working fwiw | 20:18 |
clarkb | did we remove the unit? | 20:18 |
fungi | #status notice The Zuul project gating service is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress | 20:18 |
openstackstatus | fungi: sending notice | 20:18 |
clarkb | oh tehre it goes maybe just a delay | 20:18 |
mordred | don't think so - that was going to be cleanup | 20:18 |
clarkb | ok doing the other 7 now | 20:18 |
-openstackstatus- NOTICE: The Zuul project gating service is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress | 20:19 | |
clarkb | zm01-08 should all be stopped or stopping now | 20:20 |
fungi | do the mergers need to be stopped as well? | 20:20 |
fungi | (the independent mergers i mean) | 20:20 |
clarkb | fungi: that is what I stopped | 20:20 |
mordred | fungi: clark has stopped the mergers - corvus is working on executors | 20:20 |
corvus | all the executors should be paused, meaning they aren't accepting new jobs, but a lot of them are in retry loops for git checkouts | 20:20 |
fungi | oh, right, i misread | 20:20 |
corvus | so they are still logging errors | 20:21 |
corvus | but i don't think stopping them would produce a different outcome | 20:21 |
mordred | ++ | 20:21 |
clarkb | the operating theory is that we've changed the private key so we no longer authaenticate against the public key in gerrit? | 20:21 |
mordred | clarkb: corvus used the key to talk to gerrit and it worked | 20:21 |
corvus | mordred: no it failed | 20:21 |
fungi | clarkb: that seems likely if my reproducer quoted above is the correct way to test it | 20:21 |
mordred | oh! I misread | 20:22 |
fungi | `sudo -u zuul -- ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org` gives me "Permission denied (publickey)." | 20:22 |
openstackstatus | fungi: finished sending notice | 20:22 |
corvus | so yeah, i think that's the unconfirmed theory which we should confirm now, probably by untangling the old hiera values and comparing? | 20:22 |
clarkb | I've confirmed the key in brdige hiera is different than what we have on zm02 | 20:22 |
mordred | clarkb: which key? | 20:22 |
fungi | i also tried zuul@review.opendev.org too just to make sure | 20:23 |
mordred | zuul_ssh_private_key_contents is what ansible wants to install | 20:23 |
clarkb | bridge:/etc/ansible/hosts/group_vars/zuul-merger.yaml:zuulv3_ssh_private_key_contents != zm02:/var/lib/zuul/ssh/id_rsa | 20:23 |
mordred | ah - wrong hiera key | 20:23 |
clarkb | note I've not confirmed that is the correct key | 20:24 |
clarkb | but going to try and do that now | 20:24 |
corvus | $zuul_ssh_private_key = hiera('zuul_ssh_private_key_contents') | 20:24 |
clarkb | $zuul_ssh_private_key = hiera('zuulv3_ssh_private_key_contents') | 20:24 |
clarkb | my line is from the zm node block | 20:24 |
corvus | i think that's what puppet writes out to '/var/lib/zuul/ssh/id_rsa' | 20:25 |
corvus | clarkb: mine is from executor | 20:25 |
clarkb | we may have called it different things in differnt places? | 20:25 |
corvus | in site.pp | 20:25 |
clarkb | ya looks like it may be different key between merger and executor | 20:25 |
corvus | wow | 20:25 |
mordred | no wonder it broke then | 20:26 |
clarkb | well both have different values than what is on the merger right now | 20:26 |
mordred | so maybe we should just update the key name in the zuul merger hiera | 20:26 |
clarkb | and they aer different from each other I think | 20:26 |
mordred | if zuul_ssh_private_key_contents is what we're using elsewhere? | 20:26 |
corvus | do we have both public keys set for that user in gerrit? | 20:27 |
mordred | looking | 20:27 |
mordred | oh wait - not sure how to look - do I need to log in as that user to gerrit? | 20:28 |
clarkb | we also have two different keys in used on executors | 20:28 |
corvus | i'll poke at the gerrit db | 20:28 |
fungi | i confirm the /etc/ansible/hosts/group_vars/zuul-merger.yaml:zuulv3_ssh_private_key_contents works with my reproducer command above | 20:28 |
clarkb | ok I think I've got it | 20:28 |
fungi | so that seems to confirm it's the one we want | 20:28 |
mordred | awesome | 20:28 |
clarkb | zuul_ssh_private_key might be for executor to test nodes | 20:28 |
clarkb | gerrit_ssh_private_key on the executor is ssh from executor to gerrit | 20:28 |
clarkb | and on zuul mergers we called that zuulv3_ssh_private_key | 20:29 |
clarkb | trying to compare them in hiera now | 20:29 |
clarkb | yes gerrit_ssh_private_key == zuulv3_ssh_private_key | 20:29 |
clarkb | between zuul-executor.yaml and zuul-merger.yaml | 20:29 |
fungi | it's definitely the correct file content anyway (or at least it's *a* key the zuul user in gerrit can authenticate with) | 20:29 |
mordred | wow | 20:29 |
mordred | so - in the current ansible we want that to be zuul_ssh_private_key_contents everywhere | 20:30 |
clarkb | mordred: and then you need to do something to write out the executor to test node key | 20:30 |
clarkb | (I mean thats probably alerady done, we just need to make sure we get it right too) | 20:30 |
corvus | there is one public key in gerrit. i could paste it but it doesn't seem necessary since that's easy to test. | 20:30 |
mordred | clarkb: yeah | 20:31 |
mordred | clarkb: I think we can followup with the executor-to-test-nodes key since I don't think we're touching it right now | 20:31 |
clarkb | I'll see if I can find it on ze01 and cross check against hiera | 20:31 |
fungi | well, like i said, i tested authenticating with it, and it does work | 20:32 |
mordred | but if we update hiera for zuul-to-gerrit-key to be called zuul_ssh_private_key_contents and then re-reun ansible, we should get the right keys back everywhere - right? | 20:32 |
fungi | oh, you mean the test node key | 20:32 |
fungi | yeah | 20:32 |
mordred | but I thnik we don't have to solve that key right now, because it shoudln't have changed from the last time puppet wrote it if we're not writing it | 20:32 |
clarkb | mordred: confirmed those other keys look ok | 20:33 |
clarkb | so its just the zuul to gerrit key that got mismatched | 20:33 |
clarkb | as well as gids? | 20:33 |
mordred | clarkb: cool. so - you've got your head wrapped around the key names in hiera - can you update the key names? | 20:33 |
clarkb | (and possibly uids if ansible manages to do its thing?) | 20:33 |
fungi | so far that's all i've seem wrong at least | 20:33 |
clarkb | mordred: I don't know where I need to update them | 20:34 |
mordred | clarkb: in hiera | 20:34 |
clarkb | mordred: right but in which group or host? | 20:34 |
mordred | all of them | 20:34 |
clarkb | mordred: sorry let me rephrase | 20:34 |
mordred | ansible wants the key to be zuul_ssh_private_key_contents | 20:34 |
clarkb | zuul-merger doesn't have the key you used and yet it still updated the contents | 20:34 |
clarkb | I don't know where it is finding those contents | 20:34 |
mordred | group_vars/zuul-merger.yaml:zuul_ssh_private_key_contents: | | 20:35 |
mordred | clarkb: that's where ansible got the contents from | 20:35 |
mordred | for the mergers | 20:35 |
clarkb | oh it is in there sorry | 20:35 |
mordred | *phew* | 20:35 |
fungi | zuul_ssh_private_key_contents seems to exist in zuul-executor.yaml, zuul-merger.yaml and zuul-scheduler.yaml according to git grep | 20:35 |
mordred | I was worried there was even more weirdness | 20:35 |
clarkb | ok now my concern is if I chagne that are we consuming it anywhere it is expected to be a thing? | 20:36 |
clarkb | mordred: did you add that key to zuul-merger and zuul-scheduler recently? | 20:36 |
mordred | no - so - I think my error is that I expected we called the key the same thing in all three places | 20:36 |
mordred | but that is apparently not what we did | 20:36 |
clarkb | right so if I change the value of that key, anything consuming the key for its intended purpose may break | 20:37 |
mordred | nothing else is consuming that key | 20:37 |
mordred | the only thing consuming this is ansible now | 20:37 |
clarkb | ok | 20:37 |
fungi | codesearch turns up a couple of hits for zuul_ssh_private_key_contents in x/ci-cd-pipeline-app-murano and x/infra-ansible which i expect we can ignore | 20:37 |
corvus | should be able to grep system-config to confirm that | 20:37 |
clarkb | I'm taking the lock then | 20:37 |
mordred | and the only thing it's consuming it for is /var/lib/git/id_rsa | 20:37 |
fungi | aside form that, only hits are in system-config | 20:37 |
clarkb | I'll update zuul_ssh_private_key_contents to be the same value as zuulv3_whatever on zuul-merger | 20:37 |
fungi | playbooks/roles/zuul/tasks/main.yaml, playbooks/zuul/templates/group_vars/zuul-scheduler.yaml.j2 and playbooks/zuul/templates/group_vars/zuul.yaml.j2 | 20:37 |
mordred | yah | 20:38 |
mordred | so as long as we have the correct data referred to by that key, we should be good to go | 20:38 |
mordred | corvus: what do you think we should do about uid/gid? | 20:39 |
corvus | i think we should fix hiera and run ansible first; get that deployed and verify things are running (since it doesn't look like the id issue will be a problem right now) | 20:39 |
mordred | kk | 20:39 |
mordred | and agree | 20:39 |
corvus | then maybe take a breath and talk about that :) | 20:39 |
clarkb | I think zuul-merger.yaml is good now I ketp the old ssh key around under a new hiera key name | 20:40 |
clarkb | I'm going to do the same in the other two files now | 20:40 |
mordred | cool | 20:40 |
mordred | ++ \o/ | 20:40 |
clarkb | if you want to diff zuul-merger you'll see what I ddi | 20:40 |
fungi | yeah, odds are the changed gid may be purely cosmetic/immaterial, and ansible is failing to change the uid anyway | 20:40 |
corvus | clarkb: maybe go ahead and get rid of it? | 20:40 |
corvus | clarkb: we have git history :) | 20:40 |
corvus | clarkb: and the last thing we need is more keys in those file :) | 20:40 |
fungi | yeah, easy to revert changes to that file | 20:40 |
mordred | yeah - these files are ... busy | 20:40 |
corvus | (we're never going to think it's a better time than right now to make that cleanup) | 20:41 |
clarkb | corvus: ++ I'll remove them | 20:42 |
clarkb | I think I'm done except for that cleanup | 20:42 |
clarkb | zuul-scheduler was already correct | 20:42 |
clarkb | so we had 3 different names for they key in 3 different palces. All three are now using the name and value that scheduler was using | 20:42 |
mordred | awesome | 20:43 |
mordred | clarkb: and then zuul_ssh_private_key_contents_for_ssh_to_test_nodes is the executor key for talking to the test nodes, right? | 20:43 |
clarkb | mordred: yes I've removed that now though | 20:44 |
mordred | clarkb: +old_zuul_ssh_private_key_contents_for_test_node_ssh still seems to be in merger | 20:44 |
clarkb | mordred: should be gone from both now | 20:44 |
fungi | what a marvellously verbose variable name | 20:44 |
mordred | clarkb: I agree | 20:44 |
mordred | ok - so I should re-run the playbook now yes? | 20:44 |
clarkb | mordred: and yes you can confirm by looking at /var/lib/zuul/ssh/nodepool_id_rsa | 20:44 |
clarkb | ya I think hiera/group_vars are ready for it now | 20:44 |
mordred | cool | 20:44 |
fungi | also zuul_ssh_private_key_contents_for_ssh_to_test_nodes appears nowhere in our public repos per codesearch | 20:45 |
mordred | I've joined the screen session on bridge | 20:45 |
fungi | joined | 20:45 |
clarkb | fungi: I added it as a placeholder bu tthen removed it per corvus' suggestion | 20:45 |
mordred | we will expect it to error out eventually at zuul-executor | 20:45 |
fungi | clarkb: oh, got it, so that was a made-up-on-the-spot one | 20:45 |
mordred | but that's ok | 20:45 |
clarkb | mordred: if we are happy with the output of that run we should git add zuul-executor.yaml and zuul-merger.yaml in group_vars and commit the change | 20:46 |
mordred | clarkb: ++ | 20:46 |
mordred | https://review.opendev.org/723023 <-- this is the fix for the executor error that will happen, fwiw | 20:46 |
clarkb | +2 on that change | 20:46 |
mordred | the mergers are stopped not paused, right? | 20:46 |
clarkb | mordred: correct systemctl stop zuul-merger'd | 20:47 |
mordred | ok. I think that means it's going to update the uid | 20:47 |
clarkb | so we may see ansible start them again | 20:47 |
clarkb | hrm that may break us too | 20:47 |
corvus | please carry on, but once this is done, we do need to talk about gid before we reinstate | 20:47 |
mordred | yeah | 20:47 |
mordred | I'm happy to talk about that while this is running | 20:47 |
corvus | if the mergers are stopped, i say we just let ansible do it's remapping | 20:47 |
mordred | kk | 20:48 |
corvus | but we don't want to start the mergers | 20:48 |
mordred | should I cancel ansible real quick? | 20:48 |
clarkb | ya we can chown /var/lib/zuul /var/run/zuul /home/zuul eassy enough | 20:48 |
mordred | I don't think it's going to start anything | 20:48 |
corvus | did we land the change where we said "oh it's always safe to start mergers"? | 20:48 |
mordred | actually - it's DEFINITELY not going to start anything | 20:48 |
mordred | no, we did not | 20:48 |
corvus | k | 20:48 |
fungi | confirmed, it did switch up the uid on the mergers just now | 20:48 |
fungi | -r-------- 1 3000 zuul 1675 Apr 24 19:40 id_rsa | 20:49 |
fungi | so that's definitely going to need fixing before we can start them again | 20:49 |
mordred | ok. so the mergers now completely match the container uid/gid - but we'll need to run a chown on some directories | 20:49 |
mordred | yup | 20:49 |
corvus | ok, stable enough to talk about id stuff? | 20:49 |
mordred | I'm good to talk about it | 20:49 |
corvus | File "/usr/local/lib/python3.5/dist-packages/zuul/driver/bubblewrap/__init__.py", line 121, in getPopen | 20:49 |
corvus | group = grp.getgrgid(gid) | 20:49 |
corvus | KeyError: 'getgrgid(): gid not found: 3000' | 20:49 |
corvus | that's why we can't turn things on again until we fix that :/ | 20:50 |
corvus | i think we should just roll forward and re-id everything to the container ids | 20:50 |
clarkb | its using the current process gid I guess? | 20:50 |
fungi | that seems fine to me. we'll need to do the scheduler too right? | 20:50 |
mordred | corvus: I agree - since we're essentially down anyway | 20:50 |
corvus | yeah, that's basically my thinking | 20:50 |
corvus | fungi: maybe? | 20:51 |
fungi | clarkb: the running process is running with gid 3000 which doesn't have a corresponding group name, right | 20:51 |
corvus | i think this is easy for the executors (they'll be just like the mergers; maybe we have to do some chowns) | 20:51 |
corvus | but i'm guessing we do want to do the scheduler too, so maybe we just go ahead and save queues and do a full shutdown? | 20:51 |
mordred | corvus: yeah - and I mean - we needed to shutdown to start up from container - so while it's not ideal, again, we're down, so might as well | 20:52 |
corvus | yep; fungi, clarkb: sound good? if so, i can do the shutdown | 20:52 |
fungi | sounds fine, still here to help | 20:52 |
mordred | do we know which all dirs we need to chown? | 20:52 |
clarkb | ya I think rolling forard is our best choice | 20:52 |
mordred | /var/lib/zuul for sure | 20:52 |
clarkb | otherwise we'll just be chasing down weird bugs likely | 20:52 |
corvus | why don't folks divide up looking on the different machine classes for dirs to change? | 20:53 |
clarkb | mordred: /var/lib/zuul/ /var/run/zuul and /home/zuul I think | 20:53 |
clarkb | there may be others | 20:53 |
mordred | ok. I'm going to write a quick playbook | 20:53 |
* mnaser doesn’t have access but can provide extra hands or eyes doing other things if needed | 20:53 | |
fungi | i can check executors since clarkb's already been tackling mergers and mordred's ansibling | 20:54 |
mordred | https://etherpad.opendev.org/p/Egpih9sInMgzpz0DFF9m | 20:55 |
clarkb | mordred: also /etc/zuul | 20:55 |
mordred | how does that look so far? (mnaser - eyeballs ^^ ?) | 20:55 |
fungi | like originally on the mergers, the executors failed to update the zuul uid, which is still 3000, but have updated the zuul gid to 10001 | 20:55 |
mordred | clarkb: those are the same on both mergers and executors right? is it the same dirs on scheduler? | 20:55 |
clarkb | mordred: yes I believe its the same for mergers and executors | 20:55 |
clarkb | looking at scheduler now | 20:55 |
mnaser | mordred: looks good, i think that covers most folders | 20:55 |
mordred | thank you | 20:55 |
fungi | mordred: some of our files in those paths are root-owned | 20:56 |
fungi | not sure if that matters | 20:56 |
mordred | betcha ansible will happily set them back if it does | 20:56 |
mordred | fungi: got an example? | 20:56 |
fungi | possibly just because they're world-readable and configuration management wasn't told a uid/gid | 20:56 |
mordred | oh - I bet all the /etc/zuul stuff | 20:56 |
corvus | ya | 20:56 |
mordred | yeah | 20:56 |
clarkb | mordred: ya scheduler looks the same | 20:56 |
mnaser | mordred: maybe adding a -v for verbosity just in case to see what actually got changed | 20:57 |
fungi | /etc/zuul/executor-logging.conf for example | 20:57 |
mordred | cool. so that etherpad should take care of it | 20:57 |
mordred | mnaser: ++ | 20:57 |
clarkb | the scheduler has some old extra stuff in /var/run but those aren't used anymore | 20:57 |
clarkb | its just /var/run/zuul now | 20:57 |
corvus | deja vu, btw | 20:57 |
corvus | we did this the last time we landed this change :) | 20:57 |
corvus | maybe drop /etc/zuul and just chown /etc/zuul/github.key | 20:57 |
clarkb | mordred: ^ | 20:58 |
corvus | and zuul.conf | 20:58 |
mordred | so - I think we should a) shut down scheduler, executors, web and fingergw b) run current ansible c) run chown playbook | 20:58 |
mordred | corvus: kk | 20:58 |
corvus | d) run ansible again | 20:58 |
mordred | yes | 20:58 |
corvus | (to achieve steady state in case chown playbook != ansible) | 20:58 |
mordred | yup | 20:58 |
mnaser | corvus: gerrit.key too? or is that in /var/lib/zuul/.ssh/id_rsa ? | 20:58 |
clarkb | mnaser: its /var/lib/zuul/ssh/$files | 20:58 |
mordred | ok - I wrote that etherpad to chown-zuul.yaml | 20:59 |
corvus | yeah, our install is... weird and not recommended :) | 20:59 |
corvus | it's a zuulv0 install | 20:59 |
mnaser | hehe, also, the -v might had a _lot_ of output | 20:59 |
clarkb | ya the git repos have lots of files | 20:59 |
mnaser | esp when it hits the repo folders i guess | 20:59 |
clarkb | mordred: ^ | 20:59 |
fungi | there are a few files in /etc on executors with old gids but that may be because they're stale files? /etc/zuul/site-variables.yaml, /etc/zuul/ssl/{ca.pem,client.key,client.pem} | 20:59 |
mordred | oh - maybe we don't want the -v | 20:59 |
clarkb | or do a -v for everything but /var/lib/zuul | 21:00 |
clarkb | but definitely be careful with -v in /var/lib/zuul :) | 21:00 |
corvus | fungi: those files are important | 21:00 |
fungi | we *can* also use find to look for files/directories with old uids and gids | 21:00 |
mordred | clarkb: like that? | 21:01 |
mnaser | fungi: that's a good idea too, depending on how happy the iops on the system are :) | 21:01 |
corvus | we may just not have installed the site-variables file in this playbook, since it comes from system-config | 21:01 |
clarkb | mordred: the repos are in /var/lib/zuul/git iirc | 21:01 |
clarkb | mnaser: and then there is antoehr set of hardlinked repos on the executors | 21:01 |
clarkb | er sorry mordred ^ | 21:01 |
corvus | clarkb: we shouldn't have any hardlinked repos | 21:01 |
mordred | ok. but I mean - does the playbook update look better? | 21:01 |
fungi | corvus: good point, those files may reflect things we're missing in ansible too | 21:01 |
corvus | let me make sure we have cleaned the build dir | 21:02 |
mordred | these: /etc/zuul/ssl/{ca.pem,client.key,client.pem} changed in the new rollout | 21:02 |
mnaser | find / -uid $olduid --- could very well be useful if the drives are fast enough | 21:02 |
mordred | they're /etc/zuul/ssl/gearman-{ca.pem,client.key,client.pem} | 21:02 |
clarkb | mordred: ya it looks good though corvus is mentioning we should clear out the build dirs | 21:02 |
mordred | ah - good point | 21:02 |
mnaser | as suggested by fungi | 21:02 |
fungi | mordred: i can confirm, those new filenames have the correct ownership | 21:02 |
corvus | derp, my shutdown was hung on an interactive step; i've resumed it but it's still proceeding | 21:03 |
fungi | it's the old filenames which do not | 21:03 |
mnaser | wait actually | 21:03 |
mordred | site-variables hasn't been written yet because of the executor bug | 21:03 |
corvus | mordred: wait for an all clear on shutdown from me before running | 21:03 |
mordred | corvus: yes | 21:03 |
clarkb | mordred: fungi we should check that the new files are used by zuul? | 21:03 |
mnaser | find / -uid old_zuul_uid -exec chown zuul:zuul {} \; | 21:03 |
mnaser | thoughts about that ^ ? | 21:03 |
mordred | clarkb: I did earlier - but please double-check that | 21:03 |
mordred | because this is one of those cascading failures sorts of days | 21:04 |
mnaser | (if we're traversing the entire file system, might as well as check if its the right uid and chown) | 21:04 |
clarkb | mnaser: I'ev checked on zm01 and the new files appear to be what is used | 21:04 |
mnaser | s/entire/most/ | 21:04 |
fungi | mnaser: i'd probably do without the -exec first to see what's found | 21:04 |
clarkb | mnaser: it will likely be incredibly slow too | 21:04 |
fungi | so that we can look into why they got skipped | 21:04 |
mordred | I think I'm ok with just the targetted chown - but maybe let's do that as a print later to make sure we didn't miss something | 21:04 |
fungi | but yeah, may want to exclude /var/lib/zuul from the find | 21:04 |
fungi | since we're chowning it all anyway | 21:05 |
fungi | and that'll be the bulk of the chew anyway | 21:05 |
mnaser | cool fair i figured while chown -Rv will probably scan all inodes find was going to do the same, but probably an overoptimization | 21:05 |
mordred | fungi: the old filenames got skipped because ansible doens't reference them at all | 21:05 |
corvus | fwiw, we could also just remove the git repos | 21:05 |
fungi | mordred: yeah, that's what i gathered when you said they'd been renamed in ansible | 21:05 |
corvus | bit slower on startup, but it's friday afternoon, going into a weekend, and they all need pruning anyway | 21:05 |
mordred | ah - nod | 21:05 |
mnaser | corvus: when was the last time an executor started with a clean slate? maybe that might take a while and introduce all other fun things too | 21:06 |
clarkb | mnaser: you don't need to get the old file stat to update its perms | 21:06 |
mordred | corvus: fine by me | 21:06 |
corvus | mnaser: a few months | 21:06 |
clarkb | mnaser: I expect it won't do a full stat | 21:06 |
mnaser | clarkb: ah yes good point | 21:06 |
corvus | mnaser: (they get terribly slow, and need periodic cleaning at opendev scale; we *just* merged a patch from tobiash to fix that, and i think we'll be restarting with it in place) | 21:07 |
mordred | corvus: looks like there's no python processes on scheduler | 21:07 |
corvus | not ready yet | 21:07 |
mordred | kk | 21:07 |
corvus | okay, everything is shutdown, and there are no files in builds | 21:08 |
corvus | i think now we decide if we want to delete the git repos or chown them | 21:08 |
mnaser | corvus: maybe start with chown and see how quickly its churning through them? | 21:09 |
corvus | sure, it's a second task anyway | 21:09 |
mordred | corvus: I think I can run the service-zuul playbook now while we discuss yeah? | 21:09 |
corvus | mordred: ++ | 21:09 |
mordred | ok. I'm going to run that now | 21:09 |
corvus | and restarting with the new code will immediately correct and gc them as soon as they are used (now *that* might be slow) | 21:09 |
corvus | but it'll be a good upgrade test :) | 21:10 |
mordred | ++ | 21:10 |
clarkb | are we starting on containers or old init? | 21:10 |
mordred | containers | 21:10 |
corvus | i think we're gonna be starting on containers (except executors) | 21:10 |
mordred | yeah | 21:10 |
fungi | so really really roll forward | 21:10 |
mordred | yup! we needed to restart everything to pick up the container change anyway - so, you know, here we are :) | 21:10 |
mnaser | if not using init then are the (assuming systemd) units disabled just in case? | 21:10 |
clarkb | mnaser: we've done that as a followup in other cases yes | 21:11 |
corvus | yeah. the upside of the 'unscheduled outage on friday' approach is this is going to go a lot faster than it would have otherwise :) | 21:11 |
mordred | corvus: jfdi ftw! | 21:11 |
clarkb | mnaser: basically systemctl stop foo && systemctl disable foo | 21:11 |
mnaser | ok cool | 21:11 |
mnaser | no one wants a reboot surprise :) | 21:11 |
mordred | yea - I think we need to write a cleanup playbook that does the disable and then deletes the units | 21:12 |
corvus | i would like a tequila sunrise | 21:12 |
mordred | corvus: now _I_ want a tequila sunrise | 21:12 |
fungi | nobody expects the reboot inquisition! | 21:13 |
clarkb | I might need to go make an okolehao something (https://www.islanddistillers.com/product/okolehao-100-proof-cork/) | 21:13 |
clarkb | you'll like the literal translation on that | 21:13 |
clarkb | but its the only alcohol I have left I Think | 21:13 |
mordred | clarkb: sandy thinks you should save that for disinfecting things | 21:14 |
corvus | not high enough proof :( | 21:14 |
fungi | only good for disinfecting your digestive tract | 21:14 |
mordred | or your iron butt | 21:14 |
mordred | ok. the playbook is done | 21:15 |
mordred | we ready for the chown playbook? | 21:15 |
corvus | ++ | 21:15 |
* fungi nods affirmatively | 21:15 | |
clarkb | I think all the things I was doing are long done | 21:15 |
mordred | cut/paste looks ok? | 21:15 |
corvus | ++ | 21:15 |
mordred | -f 20 is 20 forks right? | 21:16 |
fungi | that's my understanding | 21:16 |
mordred | I will now run that command | 21:16 |
fungi | fireball 20 | 21:16 |
mordred | ugh. what's page up in screen scrollback? | 21:17 |
fungi | the pgup key usually | 21:17 |
corvus | for me "page up" | 21:17 |
mordred | (I saw a bunch of error stream by) | 21:17 |
mordred | can one of you do that? I have a stupid keyboard | 21:17 |
corvus | we're at the top | 21:17 |
fungi | yeah, there's a limited buffer by default | 21:17 |
mordred | oh. well - let's run it and re look later then | 21:18 |
mnaser | mordred: if stupid key that being mac keyboard, fn+arrow up is page up :p | 21:18 |
mnaser | s/key/keyboard/ | 21:18 |
fungi | looks like it just worried about the missing file | 21:18 |
fungi | you told it to chown something which only exists on the scheduler | 21:18 |
mordred | ok. so it might have done everything ok | 21:18 |
mordred | /var/lib/zuul on the scheduler looks decent | 21:19 |
mordred | so - the answer is - recursive chown is quick | 21:20 |
fungi | /var/lib/zuul/ssh/{nodepool_id_rsa,static_id_rsa} on ze01 are 3000:3000 | 21:20 |
fungi | but maybe those are stale files? | 21:20 |
mordred | we still should have chowned them | 21:20 |
mordred | maybe let's pull github.key from the list | 21:20 |
clarkb | fungi: those are unmanaged by ansible currently but a chown -R should've gotten them | 21:21 |
clarkb | fungi: the nodepool_id_rsa file is how we ssh into nodes | 21:21 |
mordred | I'm re-running having taken the github.key file out of the first task | 21:21 |
fungi | makes sense | 21:22 |
mordred | because I'm pretty sure that caused it to bail and not run the second task :) | 21:22 |
clarkb | okolehau + club soda + lime and some ice is drinkable. Not really something you'd probably pay for though | 21:23 |
mordred | I would have expected zuul01 to take the longest | 21:26 |
mordred | done | 21:27 |
mordred | 6 minutes total fwiw - so still not terrible for a chown | 21:27 |
mordred | folks want to verify that stuff looks chown'd? | 21:27 |
mordred | then we can the service-zuul playbook again | 21:27 |
fungi | those keyfiles in /var/lib/zuul/ssh/ on ze01 are correct now | 21:27 |
clarkb | zm01 looks good to me in /var/lib | 21:28 |
fungi | /home/dmsimard on ze01 has some files with old ownership | 21:29 |
fungi | which is fine | 21:29 |
mordred | if that breaks zuul something is horribly wrong :) | 21:29 |
fungi | indeed | 21:29 |
mordred | I will now re-run service-zuul - yeah? | 21:29 |
fungi | the aforementioned files in /etc/zuul still have old gids | 21:30 |
fungi | but they weren't included in the chown | 21:30 |
mordred | (should I fix the jeamalloc line in the executor role?) | 21:30 |
fungi | /etc/zuul/site-variables.yaml /etc/zuul/ssl/ca.pem /etc/zuul/ssl/client.pem /etc/zuul/ssl/client.key | 21:30 |
corvus | mordred: oh, i guess if it's aborting due to that error we should fix it | 21:30 |
mordred | I think since we're rolling forward - we should go ahead and fix that | 21:30 |
fungi | those last three are not used by the container sounded like, as theyve been renamed | 21:30 |
mordred | yeah | 21:30 |
fungi | i'm unsure whether /etc/zuul/site-variables.yaml needs correcting though? | 21:31 |
mordred | let's see how it is after this ansible run | 21:31 |
mordred | since we won't abort on zuul-executor now (we hope) | 21:31 |
fungi | i guess it's managed by a different playbook we haven't run? or just missing from ansible entirely? | 21:31 |
mordred | we've been bombing out due to the jemalloc bug | 21:32 |
fungi | got it | 21:32 |
mordred | so didnt' get to the scheduler role | 21:32 |
mordred | ok - I run playbook | 21:32 |
fungi | also /etc/zuul/site-variables.yaml is world-readable so if we're configuration-managing it then that's probably fine for the moment | 21:32 |
mordred | corvus: so - the real question I have is - why did I think 10001 was the uid of zuul in opendev production? | 21:33 |
clarkb | mordred: was that the uid that we used on static for the zuul user? | 21:34 |
clarkb | to run goaccess? | 21:34 |
mordred | no - I mean - we set 10001 in the zuul images a while ago | 21:34 |
mordred | and we picked it to make it match what we were already running | 21:34 |
mordred | except WOW was that wrong | 21:34 |
clarkb | oh got it | 21:35 |
mordred | also - this same issue exists with nodepool nodes | 21:35 |
clarkb | I hae no idea in that case :) | 21:35 |
clarkb | I'm not in the screen, are we running the regular ansible now? | 21:35 |
mordred | yes | 21:35 |
mordred | it's very exciting | 21:36 |
fungi | also i've done a find on /etc, /home, /usr and /var for anything using -uid 3000 -o -gid 3000 | 21:36 |
fungi | no other hits | 21:36 |
fungi | (on ze01) | 21:36 |
corvus | mordred: i don't know, i think i just took your word for it when i asked about it :) | 21:37 |
mordred | corvus: :) | 21:37 |
mordred | corvus: so - how should we handle this for nodepool? | 21:37 |
fungi | mordred is a very believable fellow | 21:37 |
corvus | mordred: same way -- full shutdown and restart? | 21:37 |
corvus | (that's less disruptive) | 21:37 |
mordred | yeah | 21:37 |
mordred | we have't landed the nodepool patch yet | 21:38 |
clarkb | if we do it in a rolling fashion it won't be an outage | 21:38 |
clarkb | and each nodepool launcher runs indepednently so should be ok? | 21:38 |
mordred | ok - the playbook is done and it looks like it was happy about itself | 21:40 |
mordred | I believe we are now at the place where we can consider running some docker-composes | 21:41 |
fungi | that still hasn't updated ownership for /etc/zuul/site-variables.yaml on the executors, btw | 21:41 |
clarkb | we need mergers for the scheduler to start up | 21:41 |
corvus | lets start a merger? | 21:41 |
mordred | is there some way to do that piecewise in a way that's less disruptive while we check stuff/ | 21:41 |
clarkb | corvus: ++ I think merger then escheduler then the others if they are happy? | 21:41 |
mordred | should we do a chown playbook for site-variables real quick? | 21:41 |
mordred | like that? | 21:42 |
fungi | maybe we should confirm we're ansibling that file at all | 21:42 |
mordred | oh - nope | 21:43 |
mordred | we point to /opt/project-config in conf now | 21:43 |
mordred | playbooks/roles/zuul/templates/zuul.conf.j2:variables=/opt/project-config/zuul/site-variables.yaml | 21:43 |
fungi | codesearch only turns up hits for that path in opendev/puppet-zuul | 21:43 |
fungi | so yeah, that's stale along with the other three files in /etc/ with old ownership | 21:43 |
fungi | we're all set in that case | 21:43 |
mordred | cool. I think we shoudl start a merger | 21:44 |
corvus | let's rm that file then | 21:44 |
clarkb | well we use that file | 21:44 |
* fungi concurs | 21:44 | |
corvus | clarkb: ? | 21:44 |
mordred | clarkb: we do not | 21:44 |
clarkb | oh wait I get it now | 21:44 |
clarkb | we chagned the config to consume it directly | 21:44 |
clarkb | rather than writing it in /etc/zuul | 21:44 |
clarkb | got it | 21:44 |
corvus | clarkb, fungi, mordred: ok to rm /etc/zuul/site-variables.yaml ? | 21:44 |
mordred | ++ | 21:45 |
fungi | yeah, we should be able to delete these: /etc/zuul/site-variables.yaml /etc/zuul/ssl/ca.pem /etc/zuul/ssl/client.pem /etc/zuul/ssl/client.key | 21:45 |
corvus | (that way we don't get confused | 21:45 |
mordred | want me to playbook it? | 21:45 |
clarkb | ya I think so | 21:45 |
fungi | all are using new names or paths now | 21:45 |
corvus | fungi: agreed | 21:45 |
fungi | which is why they have old ownership still, no longer referenced anywhere | 21:45 |
mordred | like that? | 21:45 |
fungi | yep | 21:46 |
mordred | k. running real quick | 21:46 |
mordred | great. worked on executors, not on mergers - which is probably correct | 21:46 |
clarkb | ya mergers won't have site-variables | 21:46 |
fungi | find no longer turns up files under /etc with -uid 3000 -o -gid 3000 | 21:46 |
fungi | lgtm | 21:46 |
clarkb | but will haev zuul/ssl/ files | 21:47 |
mordred | yes - we should have /etc/zuul/ssl/gearman-* | 21:47 |
fungi | we do on ze01 | 21:47 |
mordred | and on zm01 | 21:47 |
clarkb | zm01 looks good | 21:47 |
mordred | cool | 21:47 |
mordred | who wants to start a merger? | 21:48 |
corvus | i will | 21:48 |
clarkb | I'm still on zm01 if you want me to do it | 21:48 |
clarkb | k /me lets corvus do it | 21:48 |
corvus | clarkb: you do it | 21:48 |
* mordred wants both clarkb and corvus to do it | 21:48 | |
corvus | clarkb: all you :) | 21:48 |
clarkb | alright starting momentarily | 21:48 |
clarkb | its sudo docker-compose up -d ? | 21:48 |
clarkb | in /etc/zuul-merger? | 21:49 |
mordred | yup | 21:49 |
clarkb | container is up | 21:49 |
clarkb | its waiting for gearman | 21:49 |
clarkb | I think that is about as far as it will get without the scheduler? | 21:49 |
corvus | yep | 21:50 |
corvus | let's start an executor? | 21:50 |
clarkb | wfm | 21:50 |
corvus | i'll do that one :) | 21:50 |
mordred | this is very exciting | 21:50 |
corvus | it's up and running, no complaints (of course it's not a container) | 21:51 |
mordred | good that it's running though :) | 21:51 |
corvus | mordred: can you write a playbook to start the rest of the mergers and executors? | 21:52 |
clarkb | whats startup for executor? same as before? | 21:52 |
corvus | clarkb: yeah systemctl | 21:52 |
mordred | corvus: yes | 21:52 |
clarkb | rgr | 21:52 |
corvus | mordred: i think both 'docker-compose up' and 'systemctl start' are idempotent enough we can just run that on all hosts | 21:52 |
mordred | how's that look? | 21:53 |
mordred | oh - piddle | 21:53 |
mordred | how's that? | 21:53 |
corvus | clarkb: i kind of want a unit file, because stopping/starting zuul is currently much easier than with docker-compose | 21:53 |
corvus | mordred: ++ | 21:53 |
corvus | (like, a unit file to run docker-compose) | 21:53 |
mordred | k. I'm going to run that | 21:53 |
clarkb | ooh ya maybe do that for all our docker-copmpose too? | 21:54 |
clarkb | we can also do it without -d and systemd will be happy I think | 21:54 |
corvus | mordred: "- shell" | 21:54 |
mordred | thanks | 21:55 |
clarkb | btw the sudo that failed from zm02 was me | 21:55 |
mordred | ok. they're all started | 21:55 |
clarkb | I was trying to check if it was running under containers yet or not and didn't realize I was zuul | 21:55 |
corvus | all right, i'll start the scheduler? | 21:56 |
clarkb | I guess so | 21:56 |
clarkb | can't think of much else we can do without gearman in the mix | 21:56 |
corvus | starting | 21:58 |
corvus | the scheduler is waiting for gearman | 21:58 |
corvus | oh it's in a restart loop | 21:59 |
clarkb | gearman is? | 21:59 |
corvus | no the scheduler container | 21:59 |
clarkb | got it | 21:59 |
corvus | problem with the zk config | 21:59 |
corvus | i'm thinking this may not be well tested in the gate | 21:59 |
corvus | [zookeeper] | 22:00 |
corvus | hosts=23.253.236.126:2888:3888,172.99.117.32:2888:3888,23.253.90.246:2888:3888session_timeout=40 | 22:00 |
corvus | that's on all hosts | 22:00 |
mordred | corvus: I agree - it's possible we're only testing that docker-compose up works (which it did) | 22:00 |
corvus | i think we should shut everything down, fix that manually, run ansible, then start again | 22:00 |
mordred | corvus: we're missing a \n yeah? | 22:00 |
corvus | yep | 22:01 |
clarkb | mordred: ya session_timeout needs its own line | 22:01 |
mordred | k. I'll stop the mergers and executors | 22:01 |
mordred | and edit the file - someone else want to capture it in a gerrit change? | 22:01 |
clarkb | I can take a look at gerrit | 22:01 |
corvus | scheduler is down | 22:01 |
mordred | corvus: what's syntax issue there? | 22:02 |
mordred | there's no -%} - it's just a %} | 22:02 |
corvus | mordred: i don't understand the question | 22:02 |
mordred | corvus: sorry - in the jina template for zuul.conf.j2 - I don't understand why there's no \n in the output | 22:03 |
clarkb | mordred: ya I was lookingat it and getting confused too | 22:03 |
mordred | I've got it up in vim in the screen | 22:03 |
clarkb | mordred: we can just add a newline between the two items probably | 22:03 |
clarkb | but tahts hacky | 22:03 |
mordred | yeah - maybe do that for now? | 22:04 |
clarkb | k I'm pulling up jinja2 docs now to try and understand better | 22:05 |
mordred | also - let's look at the rest of the file and make sure nothing else is dumb | 22:05 |
mordred | the connection sections look ok to me | 22:05 |
corvus | could use some spaces between them | 22:05 |
corvus | some newlines before each connection header | 22:05 |
mordred | added. shall I run it with this? | 22:06 |
corvus | also the gearman headers | 22:06 |
clarkb | https://jinja.palletsprojects.com/en/2.11.x/templates/#whitespace-control | 22:06 |
clarkb | "a single trailing newline is stripped if present" is the default apparently? | 22:07 |
corvus | mordred: ++ | 22:08 |
mordred | k. I did the hacky version - let's see if we can't get the nicer version | 22:08 |
corvus | "By default, Jinja also removes trailing newlines. To keep single trailing newlines, configure Jinja to keep_trailing_newline." | 22:08 |
clarkb | reading that doc I think we have to do the hacky thing | 22:08 |
clarkb | corvus: ya so we'd have to change ansible to get what we want? | 22:08 |
corvus | so if there's a way to tell ansible to do that, that might be the less hacky thing. but if not, i agree with clarkb, hacky is the only option | 22:09 |
corvus | clarkb: i dunno if there's an argument to the template module? | 22:09 |
clarkb | corvus: let me read those docs | 22:09 |
clarkb | https://docs.ansible.com/ansible/latest/modules/template_module.html#parameter-trim_blocks | 22:09 |
clarkb | we can set that | 22:10 |
mordred | ok - the zuul.conf looks better now | 22:10 |
corvus | clarkb: i see a trim_blocks option but not a keep_trailing_newline | 22:10 |
corvus | clarkb: i read trim_blocks as being to remove the first newline inside a block | 22:10 |
corvus | but i'm not at all confident i'm interpreting that right | 22:10 |
clarkb | oh ya | 22:10 |
clarkb | keep_trailing_newline is different | 22:10 |
corvus | it's worth a local experiment | 22:10 |
mordred | corvus: I still didnt' get a blank line on top of each [connection ... | 22:10 |
mordred | do we care? | 22:11 |
corvus | mordred: not now, but we should when we fix it in gerrit | 22:11 |
mordred | ++ | 22:11 |
corvus | cause it's pretty hard to read without | 22:11 |
mordred | ok. that's done - shall I restart the executors and mergers? | 22:12 |
corvus | yep | 22:12 |
mordred | done | 22:12 |
mordred | ready for a second stab at the scheduler | 22:12 |
corvus | it's still in a restart loop with the same error | 22:13 |
corvus | [zookeeper] | 22:13 |
corvus | hosts=23.253.236.126:2888:3888,172.99.117.32:2888:3888,23.253.90.246:2888:3888 | 22:13 |
corvus | what looks wrong with that? | 22:13 |
clarkb | corvus: the colons maybe? | 22:13 |
clarkb | is that how we range them? | 22:13 |
clarkb | thats my best guess. kazoo doesn't like foo:2888:3888 ? | 22:15 |
corvus | does aynone know where the config file reference is for zuul.conf now? | 22:15 |
corvus | because i can't seem to find it with our new improved navigation | 22:16 |
corvus | oh | 22:16 |
corvus | https://zuul-ci.org/docs/zuul/discussion/components.html | 22:17 |
corvus | hosts=zk1.example.com,zk2.example.com,zk3.example.com | 22:17 |
corvus | mordred: let's drop all the ports | 22:17 |
mordred | corvus: ok | 22:17 |
corvus | those are actually quorum ports | 22:17 |
clarkb | oh | 22:17 |
corvus | (we could do zk:2181 -- that's the client port) | 22:17 |
clarkb | we do that on the servers but not as clients | 22:17 |
corvus | but i think it's optional | 22:17 |
mordred | stopping executors and mergers | 22:17 |
corvus | scheduler is stopped | 22:18 |
mordred | re-running service-zuul | 22:18 |
mordred | ok. all done | 22:21 |
mordred | ready to start mergers and executors? | 22:22 |
corvus | yep | 22:22 |
corvus | scheduler is up | 22:22 |
mordred | corvus: is it happier this time? | 22:22 |
corvus | yep | 22:22 |
mordred | (not having those ports will make that jinja line able to use | join(',') | 22:22 |
corvus | mordred: we may want ports for zk-tls | 22:23 |
corvus | (i'm not sure, we should test) | 22:23 |
clarkb | merger is resetting all the things | 22:23 |
mordred | corvus: nod | 22:23 |
mordred | corvus: we clearly need a better test for this in the gate | 22:23 |
clarkb | no failure yet that I see | 22:23 |
fungi | once we're sure this is working... | 22:23 |
fungi | status notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:25 were missed entirely by Zuul and should also be rechecked to get fresh test results | 22:23 |
fungi | does that ^ look reasonable? i tried to base the times there on when the keyfile was first overwritten, when the mergers were stopped, when the scheduler was stopped, and when the scheduler was started again | 22:23 |
mordred | turns out "does docker-compose work" is not sufficient | 22:23 |
clarkb | fungi: lgtm | 22:24 |
mordred | fungi: lgtm | 22:24 |
corvus | mordred: we can borrow a lot of stuff from quickstart | 22:24 |
corvus | fungi: ++ | 22:24 |
corvus | mordred: are we not running zuul-web and fingergw in containers? | 22:25 |
corvus | oh it's a separate docker-compose | 22:25 |
corvus | for zuul-web | 22:25 |
clarkb | probably want to consolidate that? | 22:25 |
corvus | web and fingergw are in the same docker-compose | 22:26 |
*** rkukura has joined #opendev-meeting | 22:26 | |
corvus | web is in a restart loop | 22:26 |
corvus | PermissionError: [Errno 13] Permission denied: '/var/log/zuul/web.log' | 22:26 |
corvus | we did not chown those | 22:26 |
fungi | yep, still has old uid/gid | 22:26 |
corvus | how is the scheduler writing to its log? | 22:26 |
mordred | corvus: its 666 | 22:27 |
fungi | world-writeable | 22:27 |
fungi | yep | 22:27 |
mordred | web.log is 644 | 22:27 |
corvus | that is amusing | 22:27 |
mordred | yeah | 22:27 |
fungi | we probably don't really want those logs world-writeable | 22:27 |
mordred | so - shall I do a quick cown of /var/log/zuul ? | 22:27 |
corvus | i'll do the chown | 22:27 |
mordred | ok | 22:28 |
mordred | we might need to do it everywhere? | 22:28 |
mordred | also - I jiust changed windows in screen I think by accident? | 22:28 |
fungi | logs on executors are world-readable too | 22:28 |
fungi | which is why they're also working | 22:28 |
fungi | er, world-writeable | 22:28 |
mordred | how do I switch windows in screen? | 22:29 |
fungi | ctrl-a,n | 22:29 |
fungi | for next window | 22:29 |
mordred | thanks | 22:29 |
fungi | p for previous | 22:29 |
corvus | web server is up now | 22:29 |
corvus | tenants loaded, will re-enqueue | 22:29 |
corvus | i think we're good to call this up and send the announcement? | 22:29 |
fungi | lmk when i should status notice that novella | 22:29 |
mordred | corvus: I think I should run the playbook I just wrote in screen | 22:29 |
fungi | maybe after reenqueuing, yeah | 22:29 |
clarkb | corvus: maybe we should check executors can ssh into test nodes? | 22:30 |
mordred | and yes - I think we can announce back u | 22:30 |
corvus | mordred: sure | 22:30 |
mordred | up | 22:30 |
clarkb | if we haven't confirmed that yet? simply because ssh was affected | 22:30 |
corvus | oh | 22:30 |
corvus | yeah | 22:30 |
corvus | like those retry_limits i'm seeing at https://zuul.opendev.org/t/openstack/status | 22:30 |
corvus | and there was one merger_failure | 22:30 |
mordred | 2020-04-24 22:31:13,629 DEBUG zuul.AnsibleJob.output: [build: ad8be97939f543289336ff87bf847f1e] Ansible output: b' "msg": "Data could not be sent to remote host \\"23.253.166.144\\". Make sure this host can be reached over ssh: Permission denied (publickey).\\r\\n",' | 22:31 |
mordred | I think that's a nope | 22:31 |
corvus | i'll shut down the scheduler | 22:31 |
clarkb | is that executor to test node? | 22:32 |
corvus | yeah | 22:32 |
fungi | looks like it | 22:32 |
clarkb | ok the nodepool_id_rsa file hasn't been chagned recently | 22:33 |
clarkb | which makes me think maybe its a config issue. possibly we are trying to use teh gerrit key to ssh to test nodes? | 22:33 |
clarkb | trying to figure out where we configure that on executors now | 22:33 |
corvus | [executor] | 22:34 |
corvus | private_key_file=/var/lib/zuul/ssh/id_rsa | 22:34 |
mordred | so - that's wrong - that just need to point to nodepool_id_rsa | 22:35 |
clarkb | that should be /var/lib/zuul/ssh/nodepool_id_rsa | 22:35 |
clarkb | __ | 22:35 |
clarkb | er ++ | 22:35 |
mordred | yes - and we need to be sure to write that file out in the ansible | 22:35 |
mordred | I'll make the local change real quick | 22:35 |
mordred | clarkb: so - we should make sure there is a hiera entry for that key content :) | 22:36 |
mordred | clarkb: I think that's one of the keys you were looking at earlier | 22:36 |
mordred | I don't think we're writing it at all in the ansible | 22:36 |
clarkb | mordred: correct I think ansible needs to pull it out of ansible | 22:37 |
clarkb | mordred: and its the key content I delented in zuul-executor | 22:37 |
clarkb | mordred: should I go ahead and add it under a new name? | 22:38 |
clarkb | how about nodepool_test_node_ssh_key_content or similar | 22:38 |
fungi | fine by me | 22:38 |
mordred | ++ | 22:39 |
corvus | mordred: restart all the executors? | 22:39 |
corvus | probably fine to do the mergers too if it's easy to reuse the playbook | 22:40 |
mordred | ++ | 22:40 |
mordred | I pushed up a broken patch to remind us to fix the nodepool key | 22:41 |
mordred | ok - executors and mergers are restarted | 22:41 |
clarkb | nodepool_test_node_ssh_private_key_contents <- that is the key I used | 22:41 |
clarkb | I've also annotated the keys to explain what they are fo | 22:42 |
mordred | clarkb: aewsome | 22:42 |
mordred | I pushed remote: https://review.opendev.org/723049 Add nodepool node key | 22:42 |
corvus | ready to restart sched? | 22:42 |
mordred | that's a much better key name | 22:42 |
mordred | corvus: yes | 22:42 |
corvus | starting | 22:43 |
corvus | k it's up | 22:47 |
corvus | re-enqueueing | 22:47 |
clarkb | jobs seem to be working on ze01 if I'm reading the logs correctly | 22:48 |
corvus | yeah, that's what i'm seeing | 22:48 |
mordred | PHEW | 22:48 |
clarkb | its configuring mirrors and stuff | 22:48 |
mordred | ok - we don't actually have that many changes in the local ansible repo on system-config - I think we've captured them all in gerrit changes | 22:48 |
clarkb | jemalloc, zuul.conf, and managing nodepool test node ssh key? | 22:49 |
fungi | status notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results | 22:49 |
fungi | maybe after we've seen at least one build report success | 22:49 |
fungi | and all the reenqueuing is finished | 22:49 |
clarkb | mordred: https://review.opendev.org/#/c/723049/1 that needs a small update | 22:50 |
clarkb | (commented inline | 22:51 |
mordred | clarkb: you didn't click send | 22:51 |
clarkb | https://review.opendev.org/#/c/723023/ https://review.opendev.org/#/c/723049/ and https://review.opendev.org/#/c/723046/ are the changes | 22:51 |
clarkb | mordred: gah sorry | 22:51 |
clarkb | mordred: now I have | 22:52 |
mordred | probably 0400 too yeah? | 22:52 |
mordred | should I do owner/group? | 22:52 |
clarkb | the others are all 0400 and owned by zuul:zuul | 22:53 |
clarkb | I think paramiko may complain if it isn't set that way so probably should | 22:53 |
fungi | yes, or 0600 would also be okay but since we're deploying them with configuration management 0400 is probably a better signal | 22:53 |
corvus | fungi: reenqueing is finished | 22:54 |
corvus | fungi: i see a successful build | 22:54 |
fungi | thanks corvus, now to find a success | 22:54 |
clarkb | maybe someone else can double check the group_vars git diff and we can git commit that? I'm happy to git commit if someone else confirms it looks good to them | 22:54 |
corvus | fungi: top of check (but the change hasn't reported yet, still has a job running) | 22:54 |
fungi | corvus: any reason to wait for a full buildset to report, or is that unlikely to be broken? | 22:54 |
corvus | fungi: i think you're good to send | 22:55 |
fungi | thanks, doing then | 22:55 |
mordred | clarkb: I updated 723049 with 0400 and also added a key to the test templates | 22:55 |
fungi | #status notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results | 22:55 |
openstackstatus | fungi: sending notice | 22:55 |
-openstackstatus- NOTICE: the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results | 22:56 | |
clarkb | mordred: thanks lgtm | 22:56 |
corvus | am i clear to eod? i have no brains. | 22:58 |
clarkb | I think so. If there are any braincells left reviewing those changes might be a good thing. However, maybe better to do that when brains are ready | 22:58 |
clarkb | I think the three changes above cover all our debugging | 22:59 |
fungi | i will review with coffee in the morning | 22:59 |
openstackstatus | fungi: finished sending notice | 22:59 |
fungi | if they don't get approved sooner | 22:59 |
clarkb | I'm going to take a break then will check back in to see that things are still chugging along | 23:01 |
clarkb | lots of successful jobs though so things are looking goog | 23:01 |
clarkb | *good too | 23:01 |
corvus | mordred: oh i'm seeing a lot of cronspam | 23:01 |
corvus | looks like the backup cron may have a syntax error? | 23:02 |
corvus | the zuul queue backup cron | 23:02 |
corvus | # Puppet Name: zuul_scheduler_status_backup-openstack-zuul-tenant | 23:02 |
corvus | * * * * * timeout -k 5 10 curl https://zuul.opendev.org/api/tenant/openstack/status -o /var/lib/zuul/backup/openstack-zuul-tenant_status_$(date +\%s).json 2>/dev/null | 23:02 |
corvus | #Ansible: zuul-scheduler-status-openstack | 23:02 |
corvus | * * * * * timeout -k 5 10 curl https://zuul.opendev.org/api/tenant/openstack/status -o /var/lib/zuul/backup/openstack_status_$(date +\\%s).json 2>/dev/null | 23:02 |
corvus | looks like the wrong level of escaping | 23:03 |
mordred | yeah | 23:03 |
corvus | do we have periodic crons running these playbooks yet? | 23:04 |
corvus | i've manually fixed the crontab on zuul01 | 23:04 |
mordred | corvus: we do | 23:05 |
corvus | that means we need to land all of the changes, plus this one now, right? | 23:05 |
mordred | https://review.opendev.org/#/c/723052/ | 23:05 |
mordred | yeah | 23:05 |
mordred | clarkb, corvus: https://review.opendev.org/#/c/723022/ ... actually, lemme free that from the stack | 23:06 |
mordred | corvus: should we direct enqueue some into the gate? | 23:07 |
clarkb | we can also put them in emergency | 23:07 |
mordred | clarkb: https://review.opendev.org/#/c/723052/ | 23:07 |
corvus | i like emergency | 23:07 |
corvus | mordred: mostly because https://review.opendev.org/723046 | 23:08 |
mordred | yeah | 23:08 |
corvus | (i don't want to land that without seeing the output) | 23:08 |
clarkb | I've +2'd 3052 but not approved it | 23:08 |
mordred | I think it's just executors we need in emergency yes? | 23:08 |
corvus | i think we can +w that one | 23:08 |
corvus | mordred: everything? | 23:08 |
mordred | kk. on it | 23:08 |
corvus | (i don't want a delta between the running config and what's on disk) | 23:09 |
mordred | ok. I have put all of zuul in emergency | 23:09 |
clarkb | corvus: if we approve it it will run ansible against zuul and chagne the configs I think? | 23:09 |
corvus | mordred: thanks! | 23:09 |
clarkb | I'll let others decide if that is safe and approve if so | 23:10 |
corvus | clarkb: hopefully not if they're in emergency? | 23:10 |
mordred | clarkb: no - we just put everythign in emergency | 23:10 |
clarkb | corvus: ohya if we are in emergency now we can approve I'll do that | 23:10 |
clarkb | mordred what is a zl01? | 23:16 |
mordred | exactly | 23:16 |
clarkb | I'm trying to update the zuul run job to log the zuul conf and noticed that | 23:16 |
mordred | clarkb: fixed here: https://review.opendev.org/#/c/723023/3/.zuul.yaml | 23:16 |
fungi | our old ansible launchers were zl* | 23:17 |
fungi | for zuul v2.5 | 23:17 |
mordred | clarkb: which is why the jemalloc bug was able to exist | 23:17 |
clarkb | 'm k I'll rebase | 23:17 |
clarkb | ok popping out now for a bit | 23:21 |
mordred | same | 23:27 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!