Sunday, 2020-11-22

fungilooks like it fit!00:00
clarkbI think its still channel dependent because the channel name goes at the beginning of the message but ya seems like for the channels I'm in it is good00:00
clarkbonce 763618 lands and promotes the image I think we're in a good spot to turn on cd again, but more and moer I'm feeling like that is a tomorrow morning thing00:02
clarkbwill others be around for that?00:02
clarkbsounded like ianw would au morning00:02
fungii will be around circa 13-14z00:07
corvusi probably won't be around tomorrow00:08
clarkbmy concern with doing it today is if we turn it on and dont notice problems becausethey happen at0600 utc or whatever00:08
clarkbso probably tomorrow is best?00:09
corvusyeah i agree00:09
fungii don't expect to have major additional issues crop up which we'll be unable to deal with on the spot00:09
clarkbok I think the promote has happened01:05
clarkbhttps://review.opendev.org/c/opendev/system-config/+/763618/ build succeeded deploy pipeline01:05
fungiexcellent01:40
fungii'm starting to fade though01:40
mordredCongrats on the Gerrit upgrade!!!02:20
mordredThe post upgrade etherpad doesn't look horrible02:22
clarkbthe x/ conflict is probably the big thing02:26
fungithanks mordred!02:32
mordredBtw ... If anybody needs an unwind Netflix ... We Are The Champions is amazing. We watched the hair episode last night. I promise you've never seen anything like it02:56
*** hamalq has joined #opendev-meeting09:09
*** hamalq has quit IRC11:10
*** hamalq has joined #opendev-meeting11:46
*** hamalq has quit IRC11:51
fungino signs of trouble this morning, i'm around whenever folks are ready to try reenabling ansible14:23
clarkbfungi: I think our first step is to confirm our newly promoted 3.2 tagged image is happy on review-test, then roll that out to prod15:02
clarkbthen in my sleep I figured a staged rollout of cd would probably be good: put the ssh keys back but keep review and zuul in the emergency file and see what jobs do, then remove zuul from emergency file and see what jobs do, then remove review and land the html cleanup change I wrote and see how that does?15:03
clarkbI think for the first two steps the periodic jobs should give us decent coverage15:03
clarkbmordred: I've watched the first two episods. the cheese wheel racing is amazing15:04
fungithe image we're currently running from is hand-built? or fetched from check pipeline?15:07
clarkbdavid ostrovsky has congradulated us on the gerrit mailing list. Also asks if we have any feedback. I guess following up there with the url paths thing might be good as well as questions about if you can make the notedb step go faster by somehow manually gc'ing then manually reindexing15:07
clarkbfungi: review-test should be running the check pipeline image15:07
clarkbthe docker compose file should reflect that15:08
fungibut we've got a fix of some sort in place in production right?15:08
clarkbfungi: and prod is in the same boat iirc15:08
fungiahh, okay, yep looks like it's also running the image built in the check pipeline then15:09
clarkbfungi: the fix is that https://review.opendev.org/c/opendev/system-config/+/763618 and promoted our workaround as the 3.2 tag in docker hub. Which means we can switch back to using the opendevorg/gerrit:3.2 image on both servers15:09
clarkbI think we should do review-test first and just quickly double check that git clone still works, then do the same with prod15:09
fungiright, i'll get it swapped out on review-test now15:09
fungiopendevorg/gerrit                                         3.2                                    3391de1cd0b2        15 hours ago        681MB15:10
fungithat's what review-test is in the process of starting on now15:11
clarkbfungi: I can clone ranger from review-test15:18
clarkbvia https15:18
fungiyup, same. perfect15:23
fungishall i similarly edit the docker-compose.yaml on review.o.o in that case?15:23
clarkbyes I think we should go ahead and get as many of these restarts in on prod during out window as we can15:23
fungiedits made, see screen window15:24
fungido i need to down before i pull, or can i pull first?15:24
clarkbyou can pull first15:24
clarkbsorry I'm not on the screen yet, but I think it will be fine since you just did it on -test15:25
fungiopendevorg/gerrit                                         3.2                                    3391de1cd0b2        15 hours ago        681MB15:25
fungithat's what's pulled15:25
clarkband it matches -test15:25
fungishall i down and up -d?15:25
clarkb++15:25
fungidone15:25
clarkbone thing that occured to me is we should double check our container shutdown process is still valid. I figured an easy way to do that was to grab the deb packages they publish and read the init script but I can find where the actual packages are15:29
fungi`git clone https://review.opendev.org/x/ranger` is still working for me15:30
clarkb*I can't find where15:30
fungiwhich package? docker? docker-compose?15:31
clarkbnevermind found then deb.gerritforge.com is only older stuff bionic.gerritforce.com has newer things15:31
clarkbfungi: the "native packages" that luca publishes http://bionic.gerritforge.com/dists/gerrit/contrib/binary-amd64/gerrit-3.2.5.1-1.noarch.deb15:31
clarkbsince I assume that will have systemd unit or init file that we can see how stop is done15:32
clarkbour current stop is based on the 2.13 provided init script. actually I wonder if 3.2 provides one too15:32
clarkbah yup it does15:33
clarkbresources/com/google/gerrit/pgm/init/gerrit.sh and that still shows sig hup so I think we're good15:33
fungioh, got it15:33
fungithought you were talking about docker tooling packages15:34
clarkbno just more generally. Our docker-compose config should send a sighup to stop gerrit's container15:34
clarkbwhich it looks like is correct15:34
clarkb*is still correct15:35
*** hamalq has joined #opendev-meeting15:58
clarkbremote:   https://review.opendev.org/c/opendev/system-config/+/763656 Update gerrit docker image to java 1115:58
clarkbI think that is a later thing so will mark it WIP15:58
clarkbalso gerritbot didn't report that :/15:58
clarkboh right we just restarted :)15:58
clarkbI'm restarting gerritbot now15:59
clarkbalso git review gives a nice error message when you try to push with too old git review to new gerrit16:00
fungiseems like we need to restart gerritbot any time we restart gerrit these days16:00
*** hamalq has quit IRC16:02
clarkbok I won't say I feel ready, but I'm probably as ready as I will be :) what do you think of my staged plan to get zuul cd happening again?16:03
fungiit seems sound, i'm up for it16:06
clarkbopendev-prod-hourly jobs are the ones that we'd expect to run and those run at the top of the hour. So if we move authorized_keys back in place then we should be able to monitor at 17:00UTC?16:06
clarkbthen if we're happy with the results of that we remove zuul from emergency and wait for the hourly prod jobs at 18:00UTC16:07
clarkb(zuul is in that list)16:07
clarkbfungi: I put a commented out mv command in the bridge screen to put keys back in place, can you check it?16:10
fungiyep, that looks adequate16:10
clarkbok I guess we wait for 17:00 then?16:10
fungiwas ansible globally disabled, and have we taken things back out of the emergency disable list?16:14
fungilooks like /home/zuul/DISABLE-ANSIBLE does not exist on bridge at least16:14
clarkbansible was not globally disabled with the DISABLE-ANSIBLE file and the host are all still in the emergency disable list16:14
clarkbwe used the more forceful "you cannot ssh at all" disable method16:14
fungicool, so in theory the 1700z deploy will skip the stuff we stil have disabled in the emergency list16:15
clarkbyup, then after that if we're happy with the results we take the zuul hosts out of emergency and let the next hourly pulse run on them16:15
clarkbthen if we're happy with that we remove review and then land my html cleanup change16:16
clarkbreview isn't part of the hourly jobs so we need something else to trigger a job on it (it is on the daily periodic jobs though so we should ensure we run jobs against it before ~0600 or put it back in the emergency file)16:16
clarkbfungi: one upside to doing the ssh disable is that the jobs fail quicker in zuul16:19
clarkbwhich we wanted beacuse we knew that things would be off for a long period of time16:19
clarkbwhen you write the disable ansible file the jobs will poll it and see if it goes away before their timeout16:19
clarkbduring typical operation ^ is better beacuse its a short window where you want to pause rather than a full stop16:20
clarkbhttps://etherpad.opendev.org/p/E3ixAAviIQ1F_1-gzuq_ is the gerrit mailing list email from david. I figure we should respond. fungi not sure if you're subscribed? but seems like we should write up an email and bring up the x/ conflict?16:21
fungii'm not subscribed, but happy if someone mentions that bug to get some additional visibility/input16:22
clarkbfungi: I drafted a response in that etherpad, have a moment to take a look?16:32
fungiyep, just a sec16:33
clarkbI think they were impressed we were able to incorporate a jgit fix from yseterday too :)16:35
clarkbsomething something zuul16:35
fungiyep, reply lgtm, thanks!16:37
fungii've got something to which i must attend briefly, but will be back to check the hourly deploy run16:38
clarkbresponse sent16:41
clarkbinfra-prod-install-ansible is running17:01
clarkbas well as publish-irc-meetings (that one maybe didn't rely on bridge though?)17:01
clarkbinfra-prod-install-ansible reports success17:03
clarkbnow it is running service -bridge17:03
clarkbservice-bridge claims success now too17:06
clarkbcloud-launcher is running now17:07
clarkbfungi: are you back?17:12
fungiyup17:12
clarkbI've checked that system-config updated in zuul's homedir17:12
fungilooking good so far17:12
clarkbbut now am trying to figure out where the hell project-config is synced too/from17:13
clarkb/opt/project-config is what system-config ansible vars seem to show but that seems old as dirt on bridge17:13
clarkbthat makes me think that it isn't actually where we sync from, but I'm having a really hard time understanding it17:13
fungii thought it put one in ~zuul17:13
clarkbok that one looks more up to date17:14
clarkbbtu I still can't tell from our config management what is used17:14
fungifrom friday, yah17:14
clarkb(also its a project creation from friday... maybe we should've stopped those for a bit)17:14
fungiwell, it won't run manage-projects yet17:15
fungibecause of review.o.o still being in emergency17:15
clarkbya17:15
fungibut yeah once we reenable that, we should check the outcome of manage-projects runs17:15
clarkbI think I figured it out17:15
clarkb/opt/project-config is the remote path but no the bridge path17:16
clarkbthe bridge path is /home/zuul/src/../project-config17:16
clarkbfungi: looking at timestamps there is a good chance that projct is already created /me checks17:17
clarkbhttps://opendev.org/openstack/charm-magpie17:18
clarkband they are in gerrit too, ok one less thing to worry about until we're happy wit hthe state of the world17:18
clarkbfungi: nodepool's job is next and I think that one may be expected to fail due to the issues on the buidlers that inaw was debugging. Not sure if they have been fixed yet17:19
clarkbjust a heads up that a failure there is probably sane17:19
clarkbI suspect that our hourly jobs take longer than an hour to complete17:20
clarkbhuh cloud launcher failed, I wonder if it is trying to talk to a cloud that isn't available anymore (that is usually why it fails)17:22
clarkbfungi: it just occured to me that the jeepyb scripts that talk to the db likely won't fail until we remove the db config from the gerrit config17:23
clarkbfungi: and there is potential there for welcome message to spam new users created on 3.2 beacuse it won't see them on the 2.16 db17:23
clarkbI don't think that is urgent ( we can apologise a lot) but its in the stack of changes to do that cleanup anyway. Then those scripts should start failing on the ini file lookups17:24
fungimmm, maybe if they have only one recorded change in the old db, yes17:26
fungii think it would need them to exist in the old db but have only one change present17:26
fungii need to look back at the logic there17:27
clarkbalso we can just edit playbooks/roles/gerrit/files/hooks/patchset-created to drop welcome-message?17:27
fungieasily17:27
clarkbthe other one was the blueprint update?17:28
fungibug update17:28
clarkblooks like bug and blueprint both17:29
clarkbconfirmed that nodepool failed17:29
clarkbregistry running now17:29
clarkbfungi: so ya maybe we get a change in that simply comments out those scripts in the various hook scripts for now?17:29
clarkbthen that can land before or after the db config cleanup17:30
fungiyeah, looking at welcome_message.py the query is looking for changes in the db matching that account id, so it won't fire for actually new users post upgrade, but will i think continue to fire for existing users who only had one change in before the upgrade17:30
clarkbgot it17:30
clarkbregistry just succeeded. zuul is running now and it should noop succeed17:30
fungii think update_bug.py will still half-work, it will just fail to reassign bugs17:30
clarkbfungi: but will it raise an exception early because the ini file doesn't have the keys it is looking for anymore?17:31
clarkbI think it will17:31
fungioh, did we remove the db details from the config?17:31
clarkbfungi: not yet, thati s one of the chagnes to land though17:32
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/75716217:32
fungigot it17:32
fungiso yeah i guess we can strip them out until someone has time to address those two scripts17:32
fungii'll amend 757162 with that i guess17:32
clarkbwfm17:33
clarkbnote its parent is the html cleanup17:33
clarkbwhcih is also not yet merged17:33
fungiyeah, i'm keeping it in the stack17:33
clarkbzuul "succeeded"17:33
clarkbfungi: also its three scripts17:34
clarkbwelcome message, and update bug and update blueprint17:34
fungiupdate blueprint doesn't rely on the db though17:36
clarkbit does17:36
fungihuh, i wonder why. okay i'll take another look at that one too17:36
fungiselect subject, topic from changes where change_key=%s17:37
fungiyeesh, okay so it's using the db to look up changes17:38
clarkbya rest api should be fine for that anonymously too17:38
clarkbI'm adding notes to the etherpad17:38
fungiand yeah, the find_specs() function performing that query is called unconditionally in update_blueprint.py so it'll break entirely17:38
clarkband eavesdrop succeeded. Now puppet else is starting17:40
fungiupdate_bug.py is also called from two other hook scripts, i'll double-check whether those modes are expected to work at all17:40
fungilooks like the others are safe to stay, update_bug.py is only connecting to the db within set_in_progress() which is only called within a conditional checking args.hook == "patchset-created"17:43
clarkbfungi: where does it do the ini file lookups?17:44
clarkbbecause it will raise on those when the keys are removed from the file17:44
clarkb(its less about where it connects and more where it finds the config)17:44
clarkbpuppetry is still running according to the log I'm tailing17:48
clarkbI don't expect this to finish before 18:00, but should I go ahead and remove the zuul nodes from the emergency file anyway now since things seem to be working?17:48
clarkbfungi: ^17:48
fungiini file is parsed in jeepyb.gerritdb.connect() which isn't called outside the check for patchset-created17:49
fungisorry, digging in jeepyb internals17:50
fungiwhat's the desire to take zuul servers out of emergency in the middle of a deploy run?17:50
fungioh, just in case it finishes before the top of the hour?17:51
clarkbthat this deploy run is racing the next cron iteration17:51
clarkbyup17:51
clarkbif we wait we might have to skip to 19:00 though this may end up happening anyway17:51
clarkboh wait puppet is done it says17:51
fungiis it likely to decide to deploy things to zuul servers in this run if they're taken out of emergency early?17:51
fungiahh, then go for it17:51
clarkbok will do it in the screne17:51
clarkband done, can you double check the contents of the emergency file really quickly just to make sure I didn't do anything obviously wrong?17:52
fungiemergency file in the bridge screen lgtm17:54
fungigerritbot is still silent. did it get restarted?17:55
clarkbya I thought I restarted it17:55
fungilast started at 15:5917:55
fungigerrit restart was 15:2517:56
fungibut for some reason it didn't echo when i pushed updates for your system-config stack17:56
fungichecking gerritbot's logs17:56
fungiNov 22 16:59:23 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 16:59:23,196 ERROR gerrit.GerritWatcher: Exception consuming ssh event stream:17:58
fungi(from syslog)17:58
clarkbneat17:58
clarkbI thought it was supposed to try and reconnect17:59
fungilooks like the json module failed to parse an event17:59
fungiand that crashed gerritbot17:59
clarkbprobably sufficient for now to ignore json parse failures there?17:59
clarkbbasically go "well I don't understand, oh well"17:59
fungithis is being parsed from within gerritlib, the exception was raised from there18:00
fungiso we'll likely need to fix this in gerritlib and tag a new release18:00
clarkbside note: zuul doesn't use gerritlib, maybe there is something to be learned in zuul's implementation18:01
fungihttp://paste.openstack.org/show/800291/18:05
fungithat's the full traceback18:05
clarkbare we getting empty events18:05
clarkbmaybe the fix is to wrap that in if l : data = json.loads(l)18:06
clarkband maybe catch json decode errors that happen anyway and reconnect or something like that18:07
clarkbfungi: whats with the docker package insepection in the review screen?18:11
fungihttps://review.opendev.org/c/opendev/gerritlib/+/763658 Log JSON decoding errors rather than aborting18:11
fungiclarkb: when you were first suggesting we needed to look at shutdown routines in some unspecified package you couldn't find, i thought you meant the docker package so i was tracking down where we'd installed it from18:12
clarkbgotcha18:12
clarkbfungi: looks like gerritbot side will also need to be updated to handle None events18:14
clarkbI think we can land the gerritbot change first and be backward compatible18:14
clarkbthen land gerritlib and release it18:15
fungiclarkb: i don't see where gerritbot needs updating. _read() is just trying to parse a line from the stream and then either explicitly returning None early or returning None implicitly after enqueuing any event it found18:16
fungii figured the return value wasn't being used since that method wasn't previously explicitly returning at all18:17
clarkbfungi: https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L303-L346 specifically line 307 assumes a dict18:17
clarkboh I see18:17
clarkbyou're short circuiting18:17
clarkbnevermind you're right18:17
fungiyeah, the _read() in gerritbot is being passed contents from the queue, not return values from gerritlib's _read()18:18
clarkbleft a suggestion but +2'd it18:18
clarkbyup18:18
clarkbnodepool is running now, then registry then zuul18:24
clarkbon the zuul side of things we expect it to noop the config because I already removed the digest auth option from zuul's config files18:25
fungiokay, gerritbot is theoretically running now with 763658 hand applied18:26
fungiwe should be able to check its logs for "Cannot parse data from Gerrit event stream:"18:27
clarkband see what the data is18:27
fungiexactly18:27
clarkbinfra-prod-service-zuul is starting nowish18:28
clarkboh another thing I noticed is that we do fetch from gitea for our periodic jobs when syncing project-config18:29
clarkbthis didn't end up being a problem because replication went quickly and we replicated project-config first, but we should keep that in mind for the future. It isn't always gerrit state18:30
clarkb(maybe a good followup change is to switch it)18:30
fungigood thing we waited for the replication to finish yeah18:30
clarkbthat is in the sync-project-config role18:33
clarkbit has a flag to run off of master that we set on the periodic jobs18:33
*** hamalq has joined #opendev-meeting18:34
clarkblooks like we're pulling new zuul images (not surprising)18:37
clarkbit succeeded and ansible logs show that zuul.conf is changed: false which is what we wanted to see \o/18:40
clarkbinfra-root I think we are ready for reviews on https://review.opendev.org/c/opendev/system-config/+/757161 since zuul looked good. if this chagne looks good to you maybe don't approve it just yet as we have to remove review.o.o from the emergency file for it to take effect18:42
clarkbalso note that we'll have to manually clean up those html/js/css files as the change doesn't rm them. B uthe change does update gerrit.config so we'll see if it does the right thing there18:42
fungi it's just dawned on me that 763658 isn't going to log in production, i don't think, because that's being emitted by gerritlib and we'd need python logging set up to write to a gerritlib debug log?19:11
clarkbdepends on what the default log setup is I think19:11
fungimmm19:11
clarkbI dont know how that s ervicr sets up logging19:11
clarkbyou could edit your update to always lot the event at debug level and see you grt those19:12
clarkbif you dont then more digging is required19:12
fungigerritbot itself is logging info level and above to syslog at least19:13
fungiahh, okay, https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerritbot/files/logging.config#L14-L17 seems to be mapped into the container and logs gerritlib debug level and higher to stdout19:16
fungiif i'm reading that correctly19:16
fungioh, but then the console handler overrides the level to info i guess?19:17
fungiso i'd need to tweak that presumably19:17
clarkbya or update your thing ti log at a higher level19:18
clarkbI suggestedthat on the change earlier19:18
clarkbwarn was what I suggested19:18
fungiyeah, which i agree with if the content it can't parse identifies some buggy behavior somewhere19:19
fungianyway, in the short term i've restarted it set to debug level logging in the console handler19:19
clarkbsounds good19:20
clarkbfungi: what do you thibk about 757161? should we proceed with that?19:22
fungisure, i've approved it just now19:23
clarkbweneed to edit emergency.yaml as well19:24
fungioh, yep, doing19:24
fungiremoved review.o.o from the emergency list just now in the screen on bridge.o.o19:25
clarkblgtm19:25
clarkbfungi: I'm going to make a copy of gerrit.config in my homedir so that we can easily diff the result after these changes land19:32
*** hamalq has quit IRC19:33
fungisounds good19:34
clarkbif this change lands ok then I think we land the next one, then restart gerrit and ensure it is happy with those updates? We can also make these updates on review-test really quickly and restart it19:35
clarkbwhy don't I do that since I'm paranoid and this will make me feel better19:35
fungiyeah, not a terrible idea19:40
fungigo for it19:40
clarkbI did both of the changes against review-test manually and it is restarting now19:40
fungii'm poking more stuff into https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes i've thought of, and updated stuff there we've since resolved19:40
clarkbI didn't do the hooks updates though since testing those is more painful19:41
fungioh, one procedural point which sprang to mind, once all remaining upgrade steps are completed and we're satisfied with the result and endmeeting here, we can include the link to the maintenance meeting log in the conclusion announcement19:42
fungithat may be a nice bit of additional transparency for folks19:42
clarkb++19:42
clarkbreview-test seems happy to me and no errors in the log (except the expected one from plugin manager plugin)19:43
fungii guess that should go on the what's broken list so we don't forget to dig into it19:43
fungiadding19:43
clarkbfungi: I'm 99% sure its because you have to explicitly enable that plugin in config in addition to installing it19:44
clarkbbut we aren't enabling remote plugin management so it breaks. But ya we can test if enabling remote plugin management fixes review-tests error log error19:44
fungii just added it as a reminder to either enable or remove that19:45
funginot a high priority, but would rather not forget19:45
clarkb++19:45
clarkbzuul syas we are at least half an hour to it merging the change to update commentlinks on review19:46
clarkbianw: when your monday starts, I was going to ask if you could maybe do a quick check of the review backups to ensure that all the shuffling hasn't made anything sad19:47
clarkbrealted to ^ we'll want to cleanup the old reviewdb when we're satisfied with things so that only the accountPatchReviewDb is backed up19:49
clarkbshould cut down on backup sizes19:49
fungiyeah, though we should preserve our the pre-upgrade mysqldump for a while "just in case"19:52
clarkb++19:52
fungii added some "end maintenance" communication steps to the upgrade plan pad19:57
clarkbfungi: that list lgtm19:58
clarkbthe change that should trigger infra-prod-service-review is about to merge20:20
clarkbhrm I think that decided it didn't need to run the deploy job :/20:21
clarkbya ok our files list seems wrong for that job :/20:22
clarkbor wait now playbooks/roles/gerrit is in there20:22
clarkbUnable to freeze job graph: Job system-config-promote-image-gerrit-2.13 not defined is the error20:23
clarkbI see the issue20:23
*** gouthamr_ has quit IRC20:25
clarkbremote:   https://review.opendev.org/c/opendev/system-config/+/763663 Fix the infra-prod-service-review image dependency20:26
clarkbfungi: ^ gerritbot didn't report that or the earlier merge that afiled to run the job I expected. Did you catch things in logs20:26
fungilooking20:26
fungiit didn't log the "Cannot parse ..." message at least20:27
fungiseeing if it's failed in some other way20:27
*** gouthamr_ has joined #opendev-meeting20:29
clarkbI'm not sure if merging https://review.opendev.org/c/opendev/system-config/+/763663 will trigger the infra-prod-service-review job (I think it may since we are updating that job). If it doesn't then I guess we can land the db cleanup change?20:32
fungiso here's the new gerritbot traceback :/20:33
fungihttp://paste.openstack.org/show/800294/20:33
clarkbfungi: its a bug in your change20:34
clarkbyou should be print line not data20:34
clarkbbeacuse data doesn't get assigned if you fall into the traceback20:34
fungid'oh, yep!20:35
fungiokay, switched that log from data to l20:36
fungiwill update the change20:36
clarkbfungi: note its line not l20:36
clarkbat least looking at your change20:36
fungiwell, it's "l" in production, it'll be "line" in my change20:37
fungiwe're running the latest release of gerritlib in that container, not the master branch tip20:37
clarkbI see20:37
fungipycodestyle mandated that get "fixed"20:37
clarkbof course20:38
fungibut the fact that we were tripping that code path indicates we're seeing more occurences of unparseable events in the stream at least20:38
clarkbya20:38
clarkbcan you review https://review.opendev.org/c/opendev/system-config/+/763663 ?20:39
clarkbzuul should be done check testing it in about 7 minutes20:39
clarkbfungi: I wonder if there is a new event type that isn't json20:40
clarkband we've just got to ginore it or parse it differently20:40
clarkbI guess we should find out soon enough20:40
fungiaha, yep, good catch20:41
fungion 76366320:41
fungialso ansible seems to have undone my log level edit in the gerritbot logging config so i restarted again with it reset20:42
clarkbfungi: it will do that hourly as eavesdrop is in the hourly cron jobs20:42
clarkbfungi: maybe put eavesdrop in emergency?20:42
fungiyeah, i suppose i can do that20:42
fungidone20:43
clarkbhttps://review.opendev.org/c/opendev/system-config/+/757162/ is the next chagne to land if infra-prod-service-review doesn't run after my fix (its not clear to me if the fix will trigger the job due to file matchers and zuul behavior)20:46
ianwclarkb will do20:55
clarkbfungi: fwiw `grep GerritWatcher -A 20 debug.log` in /var/log/zuul on zuul01 doesn't show anything like that. It does show when we restart gerrit and connectivity is lost20:59
ianwjust trying to catch up ... gerritbot not listening?21:00
clarkbianw: its having trouble decoding events at times21:00
clarkbgrepping JSONDecodeError in that debug.log for zuul shows it happens once?21:00
clarkband then it tries to reconnect. I think that may line up with a service restart21:00
clarkb15:25:46,161 <- iirc that is when we restarted to get on the newly promoted image21:01
clarkbso no real key indicator yet21:01
clarkbianw: we've started reenabling cd too, having trouble getting infra-prod-service-review to run due to job deps whihc should be fixed by https://review.opendev.org/c/opendev/system-config/+/763663 not sure if that change landing will run the job though21:01
clarkbonce that job does run and we're happy with the result I think we're good from the cd perspective21:02
clarkbfungi: ya in the zuul example of this problem it seems that zuul gets a short read beacuse we restarted the service. That then fails to decode because its incomplete json. Then it fails a few times after that trying to reconnect21:03
ianwok sorry i'm about 40 minutes away from being 100% here21:04
clarkbianw: no worries21:04
clarkbwaiting for system-config-legacy-logstash-filters to start21:11
clarkbkolla runs 40 check jobs21:11
clarkb29 are non voting21:12
clarkbI'm nto sure this is how we imaging this woudl work21:12
clarkb"OR, more simply, just check the User-Agent and serve the all the HTTP incoming requests for Git repositories if the client user agent is Git." I like this idea from luca21:23
fungiclarkb: yeah, no idea if it's a short read or what, though we're not restarting gerrit when it happens21:26
fungithough that could explain why it was crashing when we'd restart gerrit21:27
clarkbya21:27
fungii hadn't looked into that failure mode21:27
clarkbthat makes me wonder if sighup sin't happening or isn't as graceful as we hope21:27
clarkbyou'd expect gerrit to flush connections and close them on a graceful stop21:27
clarkbthat might be a question for luca21:27
clarkbwe can rpobably test that by manually doing a sighup to the process and observing its behavior21:28
clarkbrather than relying on docker-compose to do it21:28
clarkbthen we at least know gerrit got the signal21:28
*** hamalq has joined #opendev-meeting21:30
clarkbor maybe we need a longer stop_grace_period value in docker-compose21:30
clarkbthough its already 5m and we stop faster than that21:30
clarkbthis system-config-legacy-logstash-filters job ended up on the airship cloud and its super slow :/21:33
clarkbslightly worried it might time out21:34
*** hamalq has quit IRC21:35
clarkbfungi: I put a kill command (commetned out) in the review-test screen if we want to try and manaully stop the gerrit process that way and see if it goes away quickly like we see with docker-compose down21:36
fungichecking21:37
clarkbif https://review.opendev.org/c/opendev/system-config/+/763663 fails I'm gonna break for lunch/rest21:38
clarkbwhile it rechecks21:38
fungiclarkb: yeah, that looks like the proper child process21:38
clarkbk I guess I should go ahead and run it and see what happens21:39
clarkbit stopped almost immediately21:40
clarkbthats "good" i guess. means our docker compose file is unlikely to be broken21:40
clarkbI wonder if that means that gerrit no longer handles sighup21:40
fungimay be worth double-checking the error_log to see if it did log a graceful stop21:41
clarkbwow I think it may have finished just before the timeout21:45
clarkbthe job I mean21:45
fungiseems like we ought to consider putting it on a diet rsn21:45
fungihey! my stuff is logging21:46
fungilooks like it's getting a bunch of empty strings on read21:46
fungiNov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 21:46:20,320 DEBUG gerrit.GerritWatcher: Cannot parse data from Gerrit event stream:21:47
fungiNov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: ''21:47
fungiit's seriously spamming the log21:47
clarkbfungi: maybe just filter those out then?21:47
fungii wonder if this is some pathological behavior from it getting disconnected and not noticing21:47
clarkboh maybe?21:47
fungibut yeah, i'll add a "if line" or similar to skip empty reads21:48
clarkbok I don't think https://review.opendev.org/c/opendev/system-config/+/763663 was able to trigger the deploy job21:48
fungiand see what happens21:48
fungiit's having trouble stopping the gerritbot container even21:48
fungiokay, it's restarted with that conditional wrapping everything after the readline()21:50
clarkbinterestingly zuul doesn't apepar to have that problem21:50
clarkbcould it be a paramiko issue?21:50
clarkbamybe compare paramiko between zuul and gerritbot21:50
clarkbinfra-root do we want to alnd https://review.opendev.org/c/opendev/system-config/+/757162 to try and get the infra-prod-service-review deploy job to run now? Or would we prefer a more noopy change to get the change that previously merged to run?21:51
clarkbI don't think enqueue will work because the issue was present on the cahnge thatmerged earlier so enqueing will just abort21:52
fungii'm in the middle of dinner now so can look soonish or will happily defer to others' judgement there21:52
clarkbfungi: related to that I'm fading fast I think I need a meal21:52
* clarkb tracks one down21:52
clarkbI think the risk with 757162 is that it adds more changes to apply with infra-prod-service-review rather than the more simple 75716121:53
clarkbI'll push up a noopy change then go get food21:53
clarkbremote:   https://review.opendev.org/c/opendev/system-config/+/763665 Change to trigger infra-prod-service-review21:55
clarkband now I'm taking a break21:56
ianwi'm surprised gerrit would want to do the UA matching; but we could do something like the google approach and move git to a separate hostname, but then we do the UA switching with a 301 in apache?21:56
clarkbianw: well its luca not google21:57
clarkbI'm not quite sure how just a separate hodtname hrlps21:58
fungii haven't seen anyone reply or comment on my bug report yet21:58
clarkbbecauee you need to aplly gerrit acls and auth21:58
clarkbfungi: ya most of the discussion is on thr mailing list I'm hoping they poke at thr bug when the work weel resumes21:58
fungiis there some sort of side discussion going on?21:59
fungioh21:59
fungiprocess: mention to luca and he asks me to file a bug. i do that and discussion of it happens somewhere other than the bug or my e-mail21:59
clarkbindeed22:00
fungiso just to be clear: if there has been some discussion of the bug report i filed, i have seen none of it22:01
fungii'll happily weigh in on discussion within the bug report22:02
clarkbI mentioned both the issur and thr bug itself on my response to the mailing list and they are now discussing it on the mailing list not the bug22:02
fungii'll just continue to assume my input on the topic is not useful in that case22:03
clarkbmy hunch is its more that on sunday its easy to couch quarterback the mailing listbut not the bug tracker22:04
fungifair enough, i'll be at the ready to reply with bug comments once the professionals are back on the field22:05
ianwyeah, i see it as just floating a few ideas; but fundamentally you let people call their projects anything, have people access repos via /<name> and use some bits for their UI.  seems like a choose two situation22:06
ianwclarkb: https://review.opendev.org/c/opendev/system-config/+/757162 seems ok to me?22:07
clarkbianw: ya I expect its fine. its more that we've force merged a number of changes as well as merged 757161 at this point and none of those have run yet22:12
clarkbianw: so I'm thinking keep the delta down as much as possible may be nice22:12
clarkbianw: but if you're able to keep an eye on things I'm also ok with merging 75716222:13
clarkbI'm "around". Eating soup22:13
clarkbfwiw looking at the code it seems that gerrit does properly install a java runtime shutdown hook22:13
clarkbnot sureif that hook is sufficient to gracefully stop connections though22:14
ianwclarkb: yeah, i'm around and can watch22:14
clarkbianw: that change still has my WIP on it by feel free to remove that and approve if the change itself looks good after a revie22:14
clarkbianw: I also put a copy of gerrit.config and secure.config in ~clarkb/gerrit_config_backups on review to aid in checking of diffs after stuff runs22:15
ianwclarkb: i'm not sure i can remove your wip now22:15
clarkboh becuse we don't admin22:15
clarkbok give me a minute22:15
clarkbWIP removed22:16
clarkb(but I didn't approve)22:16
ianwi'll watch that22:17
ianwTASK [sync-project-config : Sync project-config repo] ************************** seems to be failing on nb01 & nb0222:44
fungi:/22:44
clarkbare the disks full again?22:45
clarkbwe put project-config on /opt too22:45
ianw /dev/mapper/main-main 1007G 1007G     0 100% /opt22:45
ianwclarkb: jinx :)22:45
ianwok, it looks like i'm debugging that properly today now :)22:46
fungigerritbot is still parsing events for the moment22:48
fungitime check, we've got just over two hours until our maintenance is officially slated to end22:53
ianwsystem-config-run-review (2. attempt) ... unfortunately i missed what caused the first attempt to fail22:53
ianwthis is on the gate job for https://review.opendev.org/c/opendev/system-config/+/757162/22:53
clarkbfungi: yup I'm hopeful we'll get ^ too deploy and we restart one more time22:54
ianwi'll go poke at the zuul logs to make sure it was an infra error, not something with the job22:54
clarkbbutonce that restart is done and we'rehappy with things I think wecall it done22:55
clarkbits merging now22:58
fungiexcellent22:58
clarkbinfra-prod-service-review is queued22:58
clarkband ansible is running22:59
clarkband its done23:01
clarkbthe onyl thing I didn't quite expect was it restarted apache223:01
clarkbso maybe the edits we made to the vhost config didn't quite line up23:01
clarkbI'm going to cmopare diffs and look at files and stuff now23:01
clarkbgerrit.config looks "ok". We are not quoting the same way as gerrit I don't think so a lot of the comment links have "changes"23:03
clarkbI think those are fine23:03
clarkbsecure.config looks good23:03
fungiwe should probably try to normalize them in git though23:04
clarkb++23:04
clarkbdocker-compose.yaml lgtm23:04
clarkbthe track-upstream and manage-projects scripts lgtm23:04
clarkbpatchset-created lgtm23:05
clarkbthe apache vhost lgtm its got the request header and no /p/ redirection23:06
clarkbok I think the only other thing to do is delete/move aside the files that 757161 stops managing23:07
clarkbI'll move them into my homedir then we can restart?23:07
fungisgtm23:09
fungion hand for the gerrit container restart once you've moved those files away23:10
clarkbfiles are moved23:10
clarkbfungi: do you want me to do the down up -d or will you do it?23:11
fungihappy to do it23:11
clarkb(not sure if on hand meant around for it or doing the typing)23:11
clarkbk go for it23:11
fungidowned and upped in the root screen session on review.o.o23:12
clarkbyup saw it happen23:12
clarkbseems to be up now. I can view changes23:13
clarkbfungi: you may need to convince gerritbot to be happy? or maybe not after your changes23:14
clarkbon the upgrade etheerpad everything but item 7 is struck through23:14
clarkbI'll abandon my noopy change23:16
fungilooking23:16
fungiNov 22 23:11:46 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 23:11:46,598 DEBUG paramiko.transport: EOF in transport thread23:16
fungithat seems likely to indicate it lost the connection and didn't reconnect?23:16
fungii've restarted the gerritbot container now23:17
fungiit's getting and logging events23:17
clarkbcool23:18
clarkbfungi: you haven't happened to have drafted the content for item 7 have yo?23:18
funginope, but i could23:18
clarkbre the config I actually wonder if what is happening is we quote things in our ansible and old gerrit removed them but new gerrit does not remove them23:19
clarkbbecause the config has been static since 2.13 except for hand edits23:19
clarkbfungi: that would be great23:19
fungii'll start a draft in a pad23:19
clarkbzuul is seeing events too because a horizon change just entered the gate23:20
*** hamalq has joined #opendev-meeting23:31
clarkbfungi: I've detached from the screen on review and bridge23:33
clarkbI don't think they have to go away anytime soon but I think I'm done with them myself23:33
fungicool23:33
clarkbalso detached on review-test23:33
*** hamalq has quit IRC23:35
fungistarted the announce ml post draft here: https://etherpad.opendev.org/p/nzYm6eWfCr1mSf0Dis4B23:37
fungii'm positive it's missing stuff23:37
clarkblooking23:37
clarkbfungi: I made acuple small edits23:38
clarkblgtm otherwise23:38

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!