fungi | looks like it fit! | 00:00 |
---|---|---|
clarkb | I think its still channel dependent because the channel name goes at the beginning of the message but ya seems like for the channels I'm in it is good | 00:00 |
clarkb | once 763618 lands and promotes the image I think we're in a good spot to turn on cd again, but more and moer I'm feeling like that is a tomorrow morning thing | 00:02 |
clarkb | will others be around for that? | 00:02 |
clarkb | sounded like ianw would au morning | 00:02 |
fungi | i will be around circa 13-14z | 00:07 |
corvus | i probably won't be around tomorrow | 00:08 |
clarkb | my concern with doing it today is if we turn it on and dont notice problems becausethey happen at0600 utc or whatever | 00:08 |
clarkb | so probably tomorrow is best? | 00:09 |
corvus | yeah i agree | 00:09 |
fungi | i don't expect to have major additional issues crop up which we'll be unable to deal with on the spot | 00:09 |
clarkb | ok I think the promote has happened | 01:05 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/763618/ build succeeded deploy pipeline | 01:05 |
fungi | excellent | 01:40 |
fungi | i'm starting to fade though | 01:40 |
mordred | Congrats on the Gerrit upgrade!!! | 02:20 |
mordred | The post upgrade etherpad doesn't look horrible | 02:22 |
clarkb | the x/ conflict is probably the big thing | 02:26 |
fungi | thanks mordred! | 02:32 |
mordred | Btw ... If anybody needs an unwind Netflix ... We Are The Champions is amazing. We watched the hair episode last night. I promise you've never seen anything like it | 02:56 |
*** hamalq has joined #opendev-meeting | 09:09 | |
*** hamalq has quit IRC | 11:10 | |
*** hamalq has joined #opendev-meeting | 11:46 | |
*** hamalq has quit IRC | 11:51 | |
fungi | no signs of trouble this morning, i'm around whenever folks are ready to try reenabling ansible | 14:23 |
clarkb | fungi: I think our first step is to confirm our newly promoted 3.2 tagged image is happy on review-test, then roll that out to prod | 15:02 |
clarkb | then in my sleep I figured a staged rollout of cd would probably be good: put the ssh keys back but keep review and zuul in the emergency file and see what jobs do, then remove zuul from emergency file and see what jobs do, then remove review and land the html cleanup change I wrote and see how that does? | 15:03 |
clarkb | I think for the first two steps the periodic jobs should give us decent coverage | 15:03 |
clarkb | mordred: I've watched the first two episods. the cheese wheel racing is amazing | 15:04 |
fungi | the image we're currently running from is hand-built? or fetched from check pipeline? | 15:07 |
clarkb | david ostrovsky has congradulated us on the gerrit mailing list. Also asks if we have any feedback. I guess following up there with the url paths thing might be good as well as questions about if you can make the notedb step go faster by somehow manually gc'ing then manually reindexing | 15:07 |
clarkb | fungi: review-test should be running the check pipeline image | 15:07 |
clarkb | the docker compose file should reflect that | 15:08 |
fungi | but we've got a fix of some sort in place in production right? | 15:08 |
clarkb | fungi: and prod is in the same boat iirc | 15:08 |
fungi | ahh, okay, yep looks like it's also running the image built in the check pipeline then | 15:09 |
clarkb | fungi: the fix is that https://review.opendev.org/c/opendev/system-config/+/763618 and promoted our workaround as the 3.2 tag in docker hub. Which means we can switch back to using the opendevorg/gerrit:3.2 image on both servers | 15:09 |
clarkb | I think we should do review-test first and just quickly double check that git clone still works, then do the same with prod | 15:09 |
fungi | right, i'll get it swapped out on review-test now | 15:09 |
fungi | opendevorg/gerrit 3.2 3391de1cd0b2 15 hours ago 681MB | 15:10 |
fungi | that's what review-test is in the process of starting on now | 15:11 |
clarkb | fungi: I can clone ranger from review-test | 15:18 |
clarkb | via https | 15:18 |
fungi | yup, same. perfect | 15:23 |
fungi | shall i similarly edit the docker-compose.yaml on review.o.o in that case? | 15:23 |
clarkb | yes I think we should go ahead and get as many of these restarts in on prod during out window as we can | 15:23 |
fungi | edits made, see screen window | 15:24 |
fungi | do i need to down before i pull, or can i pull first? | 15:24 |
clarkb | you can pull first | 15:24 |
clarkb | sorry I'm not on the screen yet, but I think it will be fine since you just did it on -test | 15:25 |
fungi | opendevorg/gerrit 3.2 3391de1cd0b2 15 hours ago 681MB | 15:25 |
fungi | that's what's pulled | 15:25 |
clarkb | and it matches -test | 15:25 |
fungi | shall i down and up -d? | 15:25 |
clarkb | ++ | 15:25 |
fungi | done | 15:25 |
clarkb | one thing that occured to me is we should double check our container shutdown process is still valid. I figured an easy way to do that was to grab the deb packages they publish and read the init script but I can find where the actual packages are | 15:29 |
fungi | `git clone https://review.opendev.org/x/ranger` is still working for me | 15:30 |
clarkb | *I can't find where | 15:30 |
fungi | which package? docker? docker-compose? | 15:31 |
clarkb | nevermind found then deb.gerritforge.com is only older stuff bionic.gerritforce.com has newer things | 15:31 |
clarkb | fungi: the "native packages" that luca publishes http://bionic.gerritforge.com/dists/gerrit/contrib/binary-amd64/gerrit-3.2.5.1-1.noarch.deb | 15:31 |
clarkb | since I assume that will have systemd unit or init file that we can see how stop is done | 15:32 |
clarkb | our current stop is based on the 2.13 provided init script. actually I wonder if 3.2 provides one too | 15:32 |
clarkb | ah yup it does | 15:33 |
clarkb | resources/com/google/gerrit/pgm/init/gerrit.sh and that still shows sig hup so I think we're good | 15:33 |
fungi | oh, got it | 15:33 |
fungi | thought you were talking about docker tooling packages | 15:34 |
clarkb | no just more generally. Our docker-compose config should send a sighup to stop gerrit's container | 15:34 |
clarkb | which it looks like is correct | 15:34 |
clarkb | *is still correct | 15:35 |
*** hamalq has joined #opendev-meeting | 15:58 | |
clarkb | remote: https://review.opendev.org/c/opendev/system-config/+/763656 Update gerrit docker image to java 11 | 15:58 |
clarkb | I think that is a later thing so will mark it WIP | 15:58 |
clarkb | also gerritbot didn't report that :/ | 15:58 |
clarkb | oh right we just restarted :) | 15:58 |
clarkb | I'm restarting gerritbot now | 15:59 |
clarkb | also git review gives a nice error message when you try to push with too old git review to new gerrit | 16:00 |
fungi | seems like we need to restart gerritbot any time we restart gerrit these days | 16:00 |
*** hamalq has quit IRC | 16:02 | |
clarkb | ok I won't say I feel ready, but I'm probably as ready as I will be :) what do you think of my staged plan to get zuul cd happening again? | 16:03 |
fungi | it seems sound, i'm up for it | 16:06 |
clarkb | opendev-prod-hourly jobs are the ones that we'd expect to run and those run at the top of the hour. So if we move authorized_keys back in place then we should be able to monitor at 17:00UTC? | 16:06 |
clarkb | then if we're happy with the results of that we remove zuul from emergency and wait for the hourly prod jobs at 18:00UTC | 16:07 |
clarkb | (zuul is in that list) | 16:07 |
clarkb | fungi: I put a commented out mv command in the bridge screen to put keys back in place, can you check it? | 16:10 |
fungi | yep, that looks adequate | 16:10 |
clarkb | ok I guess we wait for 17:00 then? | 16:10 |
fungi | was ansible globally disabled, and have we taken things back out of the emergency disable list? | 16:14 |
fungi | looks like /home/zuul/DISABLE-ANSIBLE does not exist on bridge at least | 16:14 |
clarkb | ansible was not globally disabled with the DISABLE-ANSIBLE file and the host are all still in the emergency disable list | 16:14 |
clarkb | we used the more forceful "you cannot ssh at all" disable method | 16:14 |
fungi | cool, so in theory the 1700z deploy will skip the stuff we stil have disabled in the emergency list | 16:15 |
clarkb | yup, then after that if we're happy with the results we take the zuul hosts out of emergency and let the next hourly pulse run on them | 16:15 |
clarkb | then if we're happy with that we remove review and then land my html cleanup change | 16:16 |
clarkb | review isn't part of the hourly jobs so we need something else to trigger a job on it (it is on the daily periodic jobs though so we should ensure we run jobs against it before ~0600 or put it back in the emergency file) | 16:16 |
clarkb | fungi: one upside to doing the ssh disable is that the jobs fail quicker in zuul | 16:19 |
clarkb | which we wanted beacuse we knew that things would be off for a long period of time | 16:19 |
clarkb | when you write the disable ansible file the jobs will poll it and see if it goes away before their timeout | 16:19 |
clarkb | during typical operation ^ is better beacuse its a short window where you want to pause rather than a full stop | 16:20 |
clarkb | https://etherpad.opendev.org/p/E3ixAAviIQ1F_1-gzuq_ is the gerrit mailing list email from david. I figure we should respond. fungi not sure if you're subscribed? but seems like we should write up an email and bring up the x/ conflict? | 16:21 |
fungi | i'm not subscribed, but happy if someone mentions that bug to get some additional visibility/input | 16:22 |
clarkb | fungi: I drafted a response in that etherpad, have a moment to take a look? | 16:32 |
fungi | yep, just a sec | 16:33 |
clarkb | I think they were impressed we were able to incorporate a jgit fix from yseterday too :) | 16:35 |
clarkb | something something zuul | 16:35 |
fungi | yep, reply lgtm, thanks! | 16:37 |
fungi | i've got something to which i must attend briefly, but will be back to check the hourly deploy run | 16:38 |
clarkb | response sent | 16:41 |
clarkb | infra-prod-install-ansible is running | 17:01 |
clarkb | as well as publish-irc-meetings (that one maybe didn't rely on bridge though?) | 17:01 |
clarkb | infra-prod-install-ansible reports success | 17:03 |
clarkb | now it is running service -bridge | 17:03 |
clarkb | service-bridge claims success now too | 17:06 |
clarkb | cloud-launcher is running now | 17:07 |
clarkb | fungi: are you back? | 17:12 |
fungi | yup | 17:12 |
clarkb | I've checked that system-config updated in zuul's homedir | 17:12 |
fungi | looking good so far | 17:12 |
clarkb | but now am trying to figure out where the hell project-config is synced too/from | 17:13 |
clarkb | /opt/project-config is what system-config ansible vars seem to show but that seems old as dirt on bridge | 17:13 |
clarkb | that makes me think that it isn't actually where we sync from, but I'm having a really hard time understanding it | 17:13 |
fungi | i thought it put one in ~zuul | 17:13 |
clarkb | ok that one looks more up to date | 17:14 |
clarkb | btu I still can't tell from our config management what is used | 17:14 |
fungi | from friday, yah | 17:14 |
clarkb | (also its a project creation from friday... maybe we should've stopped those for a bit) | 17:14 |
fungi | well, it won't run manage-projects yet | 17:15 |
fungi | because of review.o.o still being in emergency | 17:15 |
clarkb | ya | 17:15 |
fungi | but yeah once we reenable that, we should check the outcome of manage-projects runs | 17:15 |
clarkb | I think I figured it out | 17:15 |
clarkb | /opt/project-config is the remote path but no the bridge path | 17:16 |
clarkb | the bridge path is /home/zuul/src/../project-config | 17:16 |
clarkb | fungi: looking at timestamps there is a good chance that projct is already created /me checks | 17:17 |
clarkb | https://opendev.org/openstack/charm-magpie | 17:18 |
clarkb | and they are in gerrit too, ok one less thing to worry about until we're happy wit hthe state of the world | 17:18 |
clarkb | fungi: nodepool's job is next and I think that one may be expected to fail due to the issues on the buidlers that inaw was debugging. Not sure if they have been fixed yet | 17:19 |
clarkb | just a heads up that a failure there is probably sane | 17:19 |
clarkb | I suspect that our hourly jobs take longer than an hour to complete | 17:20 |
clarkb | huh cloud launcher failed, I wonder if it is trying to talk to a cloud that isn't available anymore (that is usually why it fails) | 17:22 |
clarkb | fungi: it just occured to me that the jeepyb scripts that talk to the db likely won't fail until we remove the db config from the gerrit config | 17:23 |
clarkb | fungi: and there is potential there for welcome message to spam new users created on 3.2 beacuse it won't see them on the 2.16 db | 17:23 |
clarkb | I don't think that is urgent ( we can apologise a lot) but its in the stack of changes to do that cleanup anyway. Then those scripts should start failing on the ini file lookups | 17:24 |
fungi | mmm, maybe if they have only one recorded change in the old db, yes | 17:26 |
fungi | i think it would need them to exist in the old db but have only one change present | 17:26 |
fungi | i need to look back at the logic there | 17:27 |
clarkb | also we can just edit playbooks/roles/gerrit/files/hooks/patchset-created to drop welcome-message? | 17:27 |
fungi | easily | 17:27 |
clarkb | the other one was the blueprint update? | 17:28 |
fungi | bug update | 17:28 |
clarkb | looks like bug and blueprint both | 17:29 |
clarkb | confirmed that nodepool failed | 17:29 |
clarkb | registry running now | 17:29 |
clarkb | fungi: so ya maybe we get a change in that simply comments out those scripts in the various hook scripts for now? | 17:29 |
clarkb | then that can land before or after the db config cleanup | 17:30 |
fungi | yeah, looking at welcome_message.py the query is looking for changes in the db matching that account id, so it won't fire for actually new users post upgrade, but will i think continue to fire for existing users who only had one change in before the upgrade | 17:30 |
clarkb | got it | 17:30 |
clarkb | registry just succeeded. zuul is running now and it should noop succeed | 17:30 |
fungi | i think update_bug.py will still half-work, it will just fail to reassign bugs | 17:30 |
clarkb | fungi: but will it raise an exception early because the ini file doesn't have the keys it is looking for anymore? | 17:31 |
clarkb | I think it will | 17:31 |
fungi | oh, did we remove the db details from the config? | 17:31 |
clarkb | fungi: not yet, thati s one of the chagnes to land though | 17:32 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/757162 | 17:32 |
fungi | got it | 17:32 |
fungi | so yeah i guess we can strip them out until someone has time to address those two scripts | 17:32 |
fungi | i'll amend 757162 with that i guess | 17:32 |
clarkb | wfm | 17:33 |
clarkb | note its parent is the html cleanup | 17:33 |
clarkb | whcih is also not yet merged | 17:33 |
fungi | yeah, i'm keeping it in the stack | 17:33 |
clarkb | zuul "succeeded" | 17:33 |
clarkb | fungi: also its three scripts | 17:34 |
clarkb | welcome message, and update bug and update blueprint | 17:34 |
fungi | update blueprint doesn't rely on the db though | 17:36 |
clarkb | it does | 17:36 |
fungi | huh, i wonder why. okay i'll take another look at that one too | 17:36 |
fungi | select subject, topic from changes where change_key=%s | 17:37 |
fungi | yeesh, okay so it's using the db to look up changes | 17:38 |
clarkb | ya rest api should be fine for that anonymously too | 17:38 |
clarkb | I'm adding notes to the etherpad | 17:38 |
fungi | and yeah, the find_specs() function performing that query is called unconditionally in update_blueprint.py so it'll break entirely | 17:38 |
clarkb | and eavesdrop succeeded. Now puppet else is starting | 17:40 |
fungi | update_bug.py is also called from two other hook scripts, i'll double-check whether those modes are expected to work at all | 17:40 |
fungi | looks like the others are safe to stay, update_bug.py is only connecting to the db within set_in_progress() which is only called within a conditional checking args.hook == "patchset-created" | 17:43 |
clarkb | fungi: where does it do the ini file lookups? | 17:44 |
clarkb | because it will raise on those when the keys are removed from the file | 17:44 |
clarkb | (its less about where it connects and more where it finds the config) | 17:44 |
clarkb | puppetry is still running according to the log I'm tailing | 17:48 |
clarkb | I don't expect this to finish before 18:00, but should I go ahead and remove the zuul nodes from the emergency file anyway now since things seem to be working? | 17:48 |
clarkb | fungi: ^ | 17:48 |
fungi | ini file is parsed in jeepyb.gerritdb.connect() which isn't called outside the check for patchset-created | 17:49 |
fungi | sorry, digging in jeepyb internals | 17:50 |
fungi | what's the desire to take zuul servers out of emergency in the middle of a deploy run? | 17:50 |
fungi | oh, just in case it finishes before the top of the hour? | 17:51 |
clarkb | that this deploy run is racing the next cron iteration | 17:51 |
clarkb | yup | 17:51 |
clarkb | if we wait we might have to skip to 19:00 though this may end up happening anyway | 17:51 |
clarkb | oh wait puppet is done it says | 17:51 |
fungi | is it likely to decide to deploy things to zuul servers in this run if they're taken out of emergency early? | 17:51 |
fungi | ahh, then go for it | 17:51 |
clarkb | ok will do it in the screne | 17:51 |
clarkb | and done, can you double check the contents of the emergency file really quickly just to make sure I didn't do anything obviously wrong? | 17:52 |
fungi | emergency file in the bridge screen lgtm | 17:54 |
fungi | gerritbot is still silent. did it get restarted? | 17:55 |
clarkb | ya I thought I restarted it | 17:55 |
fungi | last started at 15:59 | 17:55 |
fungi | gerrit restart was 15:25 | 17:56 |
fungi | but for some reason it didn't echo when i pushed updates for your system-config stack | 17:56 |
fungi | checking gerritbot's logs | 17:56 |
fungi | Nov 22 16:59:23 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 16:59:23,196 ERROR gerrit.GerritWatcher: Exception consuming ssh event stream: | 17:58 |
fungi | (from syslog) | 17:58 |
clarkb | neat | 17:58 |
clarkb | I thought it was supposed to try and reconnect | 17:59 |
fungi | looks like the json module failed to parse an event | 17:59 |
fungi | and that crashed gerritbot | 17:59 |
clarkb | probably sufficient for now to ignore json parse failures there? | 17:59 |
clarkb | basically go "well I don't understand, oh well" | 17:59 |
fungi | this is being parsed from within gerritlib, the exception was raised from there | 18:00 |
fungi | so we'll likely need to fix this in gerritlib and tag a new release | 18:00 |
clarkb | side note: zuul doesn't use gerritlib, maybe there is something to be learned in zuul's implementation | 18:01 |
fungi | http://paste.openstack.org/show/800291/ | 18:05 |
fungi | that's the full traceback | 18:05 |
clarkb | are we getting empty events | 18:05 |
clarkb | maybe the fix is to wrap that in if l : data = json.loads(l) | 18:06 |
clarkb | and maybe catch json decode errors that happen anyway and reconnect or something like that | 18:07 |
clarkb | fungi: whats with the docker package insepection in the review screen? | 18:11 |
fungi | https://review.opendev.org/c/opendev/gerritlib/+/763658 Log JSON decoding errors rather than aborting | 18:11 |
fungi | clarkb: when you were first suggesting we needed to look at shutdown routines in some unspecified package you couldn't find, i thought you meant the docker package so i was tracking down where we'd installed it from | 18:12 |
clarkb | gotcha | 18:12 |
clarkb | fungi: looks like gerritbot side will also need to be updated to handle None events | 18:14 |
clarkb | I think we can land the gerritbot change first and be backward compatible | 18:14 |
clarkb | then land gerritlib and release it | 18:15 |
fungi | clarkb: i don't see where gerritbot needs updating. _read() is just trying to parse a line from the stream and then either explicitly returning None early or returning None implicitly after enqueuing any event it found | 18:16 |
fungi | i figured the return value wasn't being used since that method wasn't previously explicitly returning at all | 18:17 |
clarkb | fungi: https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L303-L346 specifically line 307 assumes a dict | 18:17 |
clarkb | oh I see | 18:17 |
clarkb | you're short circuiting | 18:17 |
clarkb | nevermind you're right | 18:17 |
fungi | yeah, the _read() in gerritbot is being passed contents from the queue, not return values from gerritlib's _read() | 18:18 |
clarkb | left a suggestion but +2'd it | 18:18 |
clarkb | yup | 18:18 |
clarkb | nodepool is running now, then registry then zuul | 18:24 |
clarkb | on the zuul side of things we expect it to noop the config because I already removed the digest auth option from zuul's config files | 18:25 |
fungi | okay, gerritbot is theoretically running now with 763658 hand applied | 18:26 |
fungi | we should be able to check its logs for "Cannot parse data from Gerrit event stream:" | 18:27 |
clarkb | and see what the data is | 18:27 |
fungi | exactly | 18:27 |
clarkb | infra-prod-service-zuul is starting nowish | 18:28 |
clarkb | oh another thing I noticed is that we do fetch from gitea for our periodic jobs when syncing project-config | 18:29 |
clarkb | this didn't end up being a problem because replication went quickly and we replicated project-config first, but we should keep that in mind for the future. It isn't always gerrit state | 18:30 |
clarkb | (maybe a good followup change is to switch it) | 18:30 |
fungi | good thing we waited for the replication to finish yeah | 18:30 |
clarkb | that is in the sync-project-config role | 18:33 |
clarkb | it has a flag to run off of master that we set on the periodic jobs | 18:33 |
*** hamalq has joined #opendev-meeting | 18:34 | |
clarkb | looks like we're pulling new zuul images (not surprising) | 18:37 |
clarkb | it succeeded and ansible logs show that zuul.conf is changed: false which is what we wanted to see \o/ | 18:40 |
clarkb | infra-root I think we are ready for reviews on https://review.opendev.org/c/opendev/system-config/+/757161 since zuul looked good. if this chagne looks good to you maybe don't approve it just yet as we have to remove review.o.o from the emergency file for it to take effect | 18:42 |
clarkb | also note that we'll have to manually clean up those html/js/css files as the change doesn't rm them. B uthe change does update gerrit.config so we'll see if it does the right thing there | 18:42 |
fungi | it's just dawned on me that 763658 isn't going to log in production, i don't think, because that's being emitted by gerritlib and we'd need python logging set up to write to a gerritlib debug log? | 19:11 |
clarkb | depends on what the default log setup is I think | 19:11 |
fungi | mmm | 19:11 |
clarkb | I dont know how that s ervicr sets up logging | 19:11 |
clarkb | you could edit your update to always lot the event at debug level and see you grt those | 19:12 |
clarkb | if you dont then more digging is required | 19:12 |
fungi | gerritbot itself is logging info level and above to syslog at least | 19:13 |
fungi | ahh, okay, https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerritbot/files/logging.config#L14-L17 seems to be mapped into the container and logs gerritlib debug level and higher to stdout | 19:16 |
fungi | if i'm reading that correctly | 19:16 |
fungi | oh, but then the console handler overrides the level to info i guess? | 19:17 |
fungi | so i'd need to tweak that presumably | 19:17 |
clarkb | ya or update your thing ti log at a higher level | 19:18 |
clarkb | I suggestedthat on the change earlier | 19:18 |
clarkb | warn was what I suggested | 19:18 |
fungi | yeah, which i agree with if the content it can't parse identifies some buggy behavior somewhere | 19:19 |
fungi | anyway, in the short term i've restarted it set to debug level logging in the console handler | 19:19 |
clarkb | sounds good | 19:20 |
clarkb | fungi: what do you thibk about 757161? should we proceed with that? | 19:22 |
fungi | sure, i've approved it just now | 19:23 |
clarkb | weneed to edit emergency.yaml as well | 19:24 |
fungi | oh, yep, doing | 19:24 |
fungi | removed review.o.o from the emergency list just now in the screen on bridge.o.o | 19:25 |
clarkb | lgtm | 19:25 |
clarkb | fungi: I'm going to make a copy of gerrit.config in my homedir so that we can easily diff the result after these changes land | 19:32 |
*** hamalq has quit IRC | 19:33 | |
fungi | sounds good | 19:34 |
clarkb | if this change lands ok then I think we land the next one, then restart gerrit and ensure it is happy with those updates? We can also make these updates on review-test really quickly and restart it | 19:35 |
clarkb | why don't I do that since I'm paranoid and this will make me feel better | 19:35 |
fungi | yeah, not a terrible idea | 19:40 |
fungi | go for it | 19:40 |
clarkb | I did both of the changes against review-test manually and it is restarting now | 19:40 |
fungi | i'm poking more stuff into https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes i've thought of, and updated stuff there we've since resolved | 19:40 |
clarkb | I didn't do the hooks updates though since testing those is more painful | 19:41 |
fungi | oh, one procedural point which sprang to mind, once all remaining upgrade steps are completed and we're satisfied with the result and endmeeting here, we can include the link to the maintenance meeting log in the conclusion announcement | 19:42 |
fungi | that may be a nice bit of additional transparency for folks | 19:42 |
clarkb | ++ | 19:42 |
clarkb | review-test seems happy to me and no errors in the log (except the expected one from plugin manager plugin) | 19:43 |
fungi | i guess that should go on the what's broken list so we don't forget to dig into it | 19:43 |
fungi | adding | 19:43 |
clarkb | fungi: I'm 99% sure its because you have to explicitly enable that plugin in config in addition to installing it | 19:44 |
clarkb | but we aren't enabling remote plugin management so it breaks. But ya we can test if enabling remote plugin management fixes review-tests error log error | 19:44 |
fungi | i just added it as a reminder to either enable or remove that | 19:45 |
fungi | not a high priority, but would rather not forget | 19:45 |
clarkb | ++ | 19:45 |
clarkb | zuul syas we are at least half an hour to it merging the change to update commentlinks on review | 19:46 |
clarkb | ianw: when your monday starts, I was going to ask if you could maybe do a quick check of the review backups to ensure that all the shuffling hasn't made anything sad | 19:47 |
clarkb | realted to ^ we'll want to cleanup the old reviewdb when we're satisfied with things so that only the accountPatchReviewDb is backed up | 19:49 |
clarkb | should cut down on backup sizes | 19:49 |
fungi | yeah, though we should preserve our the pre-upgrade mysqldump for a while "just in case" | 19:52 |
clarkb | ++ | 19:52 |
fungi | i added some "end maintenance" communication steps to the upgrade plan pad | 19:57 |
clarkb | fungi: that list lgtm | 19:58 |
clarkb | the change that should trigger infra-prod-service-review is about to merge | 20:20 |
clarkb | hrm I think that decided it didn't need to run the deploy job :/ | 20:21 |
clarkb | ya ok our files list seems wrong for that job :/ | 20:22 |
clarkb | or wait now playbooks/roles/gerrit is in there | 20:22 |
clarkb | Unable to freeze job graph: Job system-config-promote-image-gerrit-2.13 not defined is the error | 20:23 |
clarkb | I see the issue | 20:23 |
*** gouthamr_ has quit IRC | 20:25 | |
clarkb | remote: https://review.opendev.org/c/opendev/system-config/+/763663 Fix the infra-prod-service-review image dependency | 20:26 |
clarkb | fungi: ^ gerritbot didn't report that or the earlier merge that afiled to run the job I expected. Did you catch things in logs | 20:26 |
fungi | looking | 20:26 |
fungi | it didn't log the "Cannot parse ..." message at least | 20:27 |
fungi | seeing if it's failed in some other way | 20:27 |
*** gouthamr_ has joined #opendev-meeting | 20:29 | |
clarkb | I'm not sure if merging https://review.opendev.org/c/opendev/system-config/+/763663 will trigger the infra-prod-service-review job (I think it may since we are updating that job). If it doesn't then I guess we can land the db cleanup change? | 20:32 |
fungi | so here's the new gerritbot traceback :/ | 20:33 |
fungi | http://paste.openstack.org/show/800294/ | 20:33 |
clarkb | fungi: its a bug in your change | 20:34 |
clarkb | you should be print line not data | 20:34 |
clarkb | beacuse data doesn't get assigned if you fall into the traceback | 20:34 |
fungi | d'oh, yep! | 20:35 |
fungi | okay, switched that log from data to l | 20:36 |
fungi | will update the change | 20:36 |
clarkb | fungi: note its line not l | 20:36 |
clarkb | at least looking at your change | 20:36 |
fungi | well, it's "l" in production, it'll be "line" in my change | 20:37 |
fungi | we're running the latest release of gerritlib in that container, not the master branch tip | 20:37 |
clarkb | I see | 20:37 |
fungi | pycodestyle mandated that get "fixed" | 20:37 |
clarkb | of course | 20:38 |
fungi | but the fact that we were tripping that code path indicates we're seeing more occurences of unparseable events in the stream at least | 20:38 |
clarkb | ya | 20:38 |
clarkb | can you review https://review.opendev.org/c/opendev/system-config/+/763663 ? | 20:39 |
clarkb | zuul should be done check testing it in about 7 minutes | 20:39 |
clarkb | fungi: I wonder if there is a new event type that isn't json | 20:40 |
clarkb | and we've just got to ginore it or parse it differently | 20:40 |
clarkb | I guess we should find out soon enough | 20:40 |
fungi | aha, yep, good catch | 20:41 |
fungi | on 763663 | 20:41 |
fungi | also ansible seems to have undone my log level edit in the gerritbot logging config so i restarted again with it reset | 20:42 |
clarkb | fungi: it will do that hourly as eavesdrop is in the hourly cron jobs | 20:42 |
clarkb | fungi: maybe put eavesdrop in emergency? | 20:42 |
fungi | yeah, i suppose i can do that | 20:42 |
fungi | done | 20:43 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/757162/ is the next chagne to land if infra-prod-service-review doesn't run after my fix (its not clear to me if the fix will trigger the job due to file matchers and zuul behavior) | 20:46 |
ianw | clarkb will do | 20:55 |
clarkb | fungi: fwiw `grep GerritWatcher -A 20 debug.log` in /var/log/zuul on zuul01 doesn't show anything like that. It does show when we restart gerrit and connectivity is lost | 20:59 |
ianw | just trying to catch up ... gerritbot not listening? | 21:00 |
clarkb | ianw: its having trouble decoding events at times | 21:00 |
clarkb | grepping JSONDecodeError in that debug.log for zuul shows it happens once? | 21:00 |
clarkb | and then it tries to reconnect. I think that may line up with a service restart | 21:00 |
clarkb | 15:25:46,161 <- iirc that is when we restarted to get on the newly promoted image | 21:01 |
clarkb | so no real key indicator yet | 21:01 |
clarkb | ianw: we've started reenabling cd too, having trouble getting infra-prod-service-review to run due to job deps whihc should be fixed by https://review.opendev.org/c/opendev/system-config/+/763663 not sure if that change landing will run the job though | 21:01 |
clarkb | once that job does run and we're happy with the result I think we're good from the cd perspective | 21:02 |
clarkb | fungi: ya in the zuul example of this problem it seems that zuul gets a short read beacuse we restarted the service. That then fails to decode because its incomplete json. Then it fails a few times after that trying to reconnect | 21:03 |
ianw | ok sorry i'm about 40 minutes away from being 100% here | 21:04 |
clarkb | ianw: no worries | 21:04 |
clarkb | waiting for system-config-legacy-logstash-filters to start | 21:11 |
clarkb | kolla runs 40 check jobs | 21:11 |
clarkb | 29 are non voting | 21:12 |
clarkb | I'm nto sure this is how we imaging this woudl work | 21:12 |
clarkb | "OR, more simply, just check the User-Agent and serve the all the HTTP incoming requests for Git repositories if the client user agent is Git." I like this idea from luca | 21:23 |
fungi | clarkb: yeah, no idea if it's a short read or what, though we're not restarting gerrit when it happens | 21:26 |
fungi | though that could explain why it was crashing when we'd restart gerrit | 21:27 |
clarkb | ya | 21:27 |
fungi | i hadn't looked into that failure mode | 21:27 |
clarkb | that makes me wonder if sighup sin't happening or isn't as graceful as we hope | 21:27 |
clarkb | you'd expect gerrit to flush connections and close them on a graceful stop | 21:27 |
clarkb | that might be a question for luca | 21:27 |
clarkb | we can rpobably test that by manually doing a sighup to the process and observing its behavior | 21:28 |
clarkb | rather than relying on docker-compose to do it | 21:28 |
clarkb | then we at least know gerrit got the signal | 21:28 |
*** hamalq has joined #opendev-meeting | 21:30 | |
clarkb | or maybe we need a longer stop_grace_period value in docker-compose | 21:30 |
clarkb | though its already 5m and we stop faster than that | 21:30 |
clarkb | this system-config-legacy-logstash-filters job ended up on the airship cloud and its super slow :/ | 21:33 |
clarkb | slightly worried it might time out | 21:34 |
*** hamalq has quit IRC | 21:35 | |
clarkb | fungi: I put a kill command (commetned out) in the review-test screen if we want to try and manaully stop the gerrit process that way and see if it goes away quickly like we see with docker-compose down | 21:36 |
fungi | checking | 21:37 |
clarkb | if https://review.opendev.org/c/opendev/system-config/+/763663 fails I'm gonna break for lunch/rest | 21:38 |
clarkb | while it rechecks | 21:38 |
fungi | clarkb: yeah, that looks like the proper child process | 21:38 |
clarkb | k I guess I should go ahead and run it and see what happens | 21:39 |
clarkb | it stopped almost immediately | 21:40 |
clarkb | thats "good" i guess. means our docker compose file is unlikely to be broken | 21:40 |
clarkb | I wonder if that means that gerrit no longer handles sighup | 21:40 |
fungi | may be worth double-checking the error_log to see if it did log a graceful stop | 21:41 |
clarkb | wow I think it may have finished just before the timeout | 21:45 |
clarkb | the job I mean | 21:45 |
fungi | seems like we ought to consider putting it on a diet rsn | 21:45 |
fungi | hey! my stuff is logging | 21:46 |
fungi | looks like it's getting a bunch of empty strings on read | 21:46 |
fungi | Nov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 21:46:20,320 DEBUG gerrit.GerritWatcher: Cannot parse data from Gerrit event stream: | 21:47 |
fungi | Nov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: '' | 21:47 |
fungi | it's seriously spamming the log | 21:47 |
clarkb | fungi: maybe just filter those out then? | 21:47 |
fungi | i wonder if this is some pathological behavior from it getting disconnected and not noticing | 21:47 |
clarkb | oh maybe? | 21:47 |
fungi | but yeah, i'll add a "if line" or similar to skip empty reads | 21:48 |
clarkb | ok I don't think https://review.opendev.org/c/opendev/system-config/+/763663 was able to trigger the deploy job | 21:48 |
fungi | and see what happens | 21:48 |
fungi | it's having trouble stopping the gerritbot container even | 21:48 |
fungi | okay, it's restarted with that conditional wrapping everything after the readline() | 21:50 |
clarkb | interestingly zuul doesn't apepar to have that problem | 21:50 |
clarkb | could it be a paramiko issue? | 21:50 |
clarkb | amybe compare paramiko between zuul and gerritbot | 21:50 |
clarkb | infra-root do we want to alnd https://review.opendev.org/c/opendev/system-config/+/757162 to try and get the infra-prod-service-review deploy job to run now? Or would we prefer a more noopy change to get the change that previously merged to run? | 21:51 |
clarkb | I don't think enqueue will work because the issue was present on the cahnge thatmerged earlier so enqueing will just abort | 21:52 |
fungi | i'm in the middle of dinner now so can look soonish or will happily defer to others' judgement there | 21:52 |
clarkb | fungi: related to that I'm fading fast I think I need a meal | 21:52 |
* clarkb tracks one down | 21:52 | |
clarkb | I think the risk with 757162 is that it adds more changes to apply with infra-prod-service-review rather than the more simple 757161 | 21:53 |
clarkb | I'll push up a noopy change then go get food | 21:53 |
clarkb | remote: https://review.opendev.org/c/opendev/system-config/+/763665 Change to trigger infra-prod-service-review | 21:55 |
clarkb | and now I'm taking a break | 21:56 |
ianw | i'm surprised gerrit would want to do the UA matching; but we could do something like the google approach and move git to a separate hostname, but then we do the UA switching with a 301 in apache? | 21:56 |
clarkb | ianw: well its luca not google | 21:57 |
clarkb | I'm not quite sure how just a separate hodtname hrlps | 21:58 |
fungi | i haven't seen anyone reply or comment on my bug report yet | 21:58 |
clarkb | becauee you need to aplly gerrit acls and auth | 21:58 |
clarkb | fungi: ya most of the discussion is on thr mailing list I'm hoping they poke at thr bug when the work weel resumes | 21:58 |
fungi | is there some sort of side discussion going on? | 21:59 |
fungi | oh | 21:59 |
fungi | process: mention to luca and he asks me to file a bug. i do that and discussion of it happens somewhere other than the bug or my e-mail | 21:59 |
clarkb | indeed | 22:00 |
fungi | so just to be clear: if there has been some discussion of the bug report i filed, i have seen none of it | 22:01 |
fungi | i'll happily weigh in on discussion within the bug report | 22:02 |
clarkb | I mentioned both the issur and thr bug itself on my response to the mailing list and they are now discussing it on the mailing list not the bug | 22:02 |
fungi | i'll just continue to assume my input on the topic is not useful in that case | 22:03 |
clarkb | my hunch is its more that on sunday its easy to couch quarterback the mailing listbut not the bug tracker | 22:04 |
fungi | fair enough, i'll be at the ready to reply with bug comments once the professionals are back on the field | 22:05 |
ianw | yeah, i see it as just floating a few ideas; but fundamentally you let people call their projects anything, have people access repos via /<name> and use some bits for their UI. seems like a choose two situation | 22:06 |
ianw | clarkb: https://review.opendev.org/c/opendev/system-config/+/757162 seems ok to me? | 22:07 |
clarkb | ianw: ya I expect its fine. its more that we've force merged a number of changes as well as merged 757161 at this point and none of those have run yet | 22:12 |
clarkb | ianw: so I'm thinking keep the delta down as much as possible may be nice | 22:12 |
clarkb | ianw: but if you're able to keep an eye on things I'm also ok with merging 757162 | 22:13 |
clarkb | I'm "around". Eating soup | 22:13 |
clarkb | fwiw looking at the code it seems that gerrit does properly install a java runtime shutdown hook | 22:13 |
clarkb | not sureif that hook is sufficient to gracefully stop connections though | 22:14 |
ianw | clarkb: yeah, i'm around and can watch | 22:14 |
clarkb | ianw: that change still has my WIP on it by feel free to remove that and approve if the change itself looks good after a revie | 22:14 |
clarkb | ianw: I also put a copy of gerrit.config and secure.config in ~clarkb/gerrit_config_backups on review to aid in checking of diffs after stuff runs | 22:15 |
ianw | clarkb: i'm not sure i can remove your wip now | 22:15 |
clarkb | oh becuse we don't admin | 22:15 |
clarkb | ok give me a minute | 22:15 |
clarkb | WIP removed | 22:16 |
clarkb | (but I didn't approve) | 22:16 |
ianw | i'll watch that | 22:17 |
ianw | TASK [sync-project-config : Sync project-config repo] ************************** seems to be failing on nb01 & nb02 | 22:44 |
fungi | :/ | 22:44 |
clarkb | are the disks full again? | 22:45 |
clarkb | we put project-config on /opt too | 22:45 |
ianw | /dev/mapper/main-main 1007G 1007G 0 100% /opt | 22:45 |
ianw | clarkb: jinx :) | 22:45 |
ianw | ok, it looks like i'm debugging that properly today now :) | 22:46 |
fungi | gerritbot is still parsing events for the moment | 22:48 |
fungi | time check, we've got just over two hours until our maintenance is officially slated to end | 22:53 |
ianw | system-config-run-review (2. attempt) ... unfortunately i missed what caused the first attempt to fail | 22:53 |
ianw | this is on the gate job for https://review.opendev.org/c/opendev/system-config/+/757162/ | 22:53 |
clarkb | fungi: yup I'm hopeful we'll get ^ too deploy and we restart one more time | 22:54 |
ianw | i'll go poke at the zuul logs to make sure it was an infra error, not something with the job | 22:54 |
clarkb | butonce that restart is done and we'rehappy with things I think wecall it done | 22:55 |
clarkb | its merging now | 22:58 |
fungi | excellent | 22:58 |
clarkb | infra-prod-service-review is queued | 22:58 |
clarkb | and ansible is running | 22:59 |
clarkb | and its done | 23:01 |
clarkb | the onyl thing I didn't quite expect was it restarted apache2 | 23:01 |
clarkb | so maybe the edits we made to the vhost config didn't quite line up | 23:01 |
clarkb | I'm going to cmopare diffs and look at files and stuff now | 23:01 |
clarkb | gerrit.config looks "ok". We are not quoting the same way as gerrit I don't think so a lot of the comment links have "changes" | 23:03 |
clarkb | I think those are fine | 23:03 |
clarkb | secure.config looks good | 23:03 |
fungi | we should probably try to normalize them in git though | 23:04 |
clarkb | ++ | 23:04 |
clarkb | docker-compose.yaml lgtm | 23:04 |
clarkb | the track-upstream and manage-projects scripts lgtm | 23:04 |
clarkb | patchset-created lgtm | 23:05 |
clarkb | the apache vhost lgtm its got the request header and no /p/ redirection | 23:06 |
clarkb | ok I think the only other thing to do is delete/move aside the files that 757161 stops managing | 23:07 |
clarkb | I'll move them into my homedir then we can restart? | 23:07 |
fungi | sgtm | 23:09 |
fungi | on hand for the gerrit container restart once you've moved those files away | 23:10 |
clarkb | files are moved | 23:10 |
clarkb | fungi: do you want me to do the down up -d or will you do it? | 23:11 |
fungi | happy to do it | 23:11 |
clarkb | (not sure if on hand meant around for it or doing the typing) | 23:11 |
clarkb | k go for it | 23:11 |
fungi | downed and upped in the root screen session on review.o.o | 23:12 |
clarkb | yup saw it happen | 23:12 |
clarkb | seems to be up now. I can view changes | 23:13 |
clarkb | fungi: you may need to convince gerritbot to be happy? or maybe not after your changes | 23:14 |
clarkb | on the upgrade etheerpad everything but item 7 is struck through | 23:14 |
clarkb | I'll abandon my noopy change | 23:16 |
fungi | looking | 23:16 |
fungi | Nov 22 23:11:46 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 23:11:46,598 DEBUG paramiko.transport: EOF in transport thread | 23:16 |
fungi | that seems likely to indicate it lost the connection and didn't reconnect? | 23:16 |
fungi | i've restarted the gerritbot container now | 23:17 |
fungi | it's getting and logging events | 23:17 |
clarkb | cool | 23:18 |
clarkb | fungi: you haven't happened to have drafted the content for item 7 have yo? | 23:18 |
fungi | nope, but i could | 23:18 |
clarkb | re the config I actually wonder if what is happening is we quote things in our ansible and old gerrit removed them but new gerrit does not remove them | 23:19 |
clarkb | because the config has been static since 2.13 except for hand edits | 23:19 |
clarkb | fungi: that would be great | 23:19 |
fungi | i'll start a draft in a pad | 23:19 |
clarkb | zuul is seeing events too because a horizon change just entered the gate | 23:20 |
*** hamalq has joined #opendev-meeting | 23:31 | |
clarkb | fungi: I've detached from the screen on review and bridge | 23:33 |
clarkb | I don't think they have to go away anytime soon but I think I'm done with them myself | 23:33 |
fungi | cool | 23:33 |
clarkb | also detached on review-test | 23:33 |
*** hamalq has quit IRC | 23:35 | |
fungi | started the announce ml post draft here: https://etherpad.opendev.org/p/nzYm6eWfCr1mSf0Dis4B | 23:37 |
fungi | i'm positive it's missing stuff | 23:37 |
clarkb | looking | 23:37 |
clarkb | fungi: I made acuple small edits | 23:38 |
clarkb | lgtm otherwise | 23:38 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!