Friday, 2016-03-18

*** ajmiller has joined #openstack-infra-incident02:23
*** ajmiller has quit IRC03:15
*** lifeless has quit IRC04:55
*** lifeless has joined #openstack-infra-incident04:56
*** ajmiller_ has joined #openstack-infra-incident14:18
*** ajmiller has joined #openstack-infra-incident14:18
*** ajmiller_ has quit IRC14:19
*** anteaya has joined #openstack-infra-incident15:17
anteayaso jesusaur has noticed a jenkins security advisory: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-2415:18
anteayaand right now yolanda is the only infra-root online15:18
anteayafungi appears to be traveling with poor connectivity when he can get it15:18
anteayaand since jenkins posts the version in every footer, it seems we should discuss a plan to upgrade15:19
yolandamordred should be around15:20
yolandaand i'd like to get opinion of more infra-root and jenkins experts like zaro15:20
yolandai'm afraid of plugins not being compatible15:20
anteayayolanda: good points15:20
anteayaand I agree with you15:21
jeblairstart by installing the new version of jenkins on jenkins-dev15:21
anteayamorning jeblair15:21
anteayayolanda: are you able to focus on installing the suggested version of jenkins on jenkins-dev?15:22
yolandaanteaya, i need to leave in over an hour as much15:22
yolandachildren going out from school, and is the last day before holiday15:23
anteayayolanda: okay well this seems to me a higher priority than firefox at the moment15:23
anteayahmmmm15:23
anteayaokay thanks, I understand yolanda15:23
yolandai can start trying on jenkins-dev but i cannot follow the issue until completion15:24
jesusauris the jenkins version in puppet somewhere that I'm not seeing? how would jenkins-dev be upgraded?15:24
anteayajesusaur: look at the bottom: https://jenkins.openstack.org/15:25
yolandai think manually installing the war, jesusaur15:25
anteayajesusaur: oh sorry, that wasn't the question you asked, my apologies15:25
jesusauryolanda: ah15:25
jeblairhopefully one of clarkb, pleia2, fungi, jhesketh, mordred, pabelanger, SergeyLukjanov or nibalizer will be around by then.15:25
anteayapleia2: is in singapore15:25
anteayayolanda: can you get started and post what actions you are taking?15:26
anteayaso someone else can pick up from where you left off?15:26
*** AJaeger has joined #openstack-infra-incident15:26
yolandathat will be the first time that i do an upgrade on upstream so i actually need to learn about it first15:27
yolandai've done on downstream but i don't know if you follow same steps15:27
mordredmorning15:27
yolandamordred, jeblair, how do you do the upgrades? install new war manually?15:27
anteayamordred: morning15:27
anteayamordred: glad you are here15:28
anteayamordred: we need to discuss upgrading our jenkins, starting with jenkins-dev: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-2415:28
mordredanteaya: just waking up - haven't coffeed yet, so I may not be AWESOME at incident helping yet15:28
anteayamordred: fair enough15:28
mordredlet me start the coffee real quick15:28
anteayayup15:28
fungiyeah, my reception is terrible at the moment and i'm about to be getting back on the road again15:28
fungiyolanda: apt-get install jenkins15:29
anteayamordred: if you can help yolanda learn how we upgrade a jenkins that would be a great place to start15:29
fungiwell, sudo apt-get install jenkins15:29
mordredanteaya: my personal process is "cower in terror and cry a lot"15:29
fungibut on jenkins-dev first as jeblair said15:29
anteayamordred: good to know15:29
anteayafungi: glad to have your help when we can get it, thank you15:29
anteayafungi: safe travels15:29
mordredyolanda: yah - we do jenkins-dev first just to make sure that the jenkins people haven't broke us15:30
*** ChanServ changes topic to "https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-24"15:31
fungii think you _may_ want to also confirm that's pulling in the lts version and not the dev version. if it ends up with a dev version then you may need to check their apt repository for the corresponding package for the lts version instead, wget that and dpkg -i it to install15:31
anteayafungi: thank you, I assumed we were running their lts version, but glad to have the confirmation15:31
fungiwe tend to avoid the dev releases of jenkins because of stability concerns15:31
yolandaok going to try that on jenkins-dev15:32
yolandai have a fear about gear plugin15:32
anteayayolanda: thank you15:32
anteayayup, good thing it is the -dev server15:33
jeblairhttps://review.openstack.org/271543 is the change to fix the gearman plugin15:35
yolandaok fingers crossed15:35
anteayayolanda: go you15:35
jeblairit claims to have passed unit tests, but if you actually read the console log at http://logs.openstack.org/43/271543/3/check/gate-gearman-plugin-build/d49cb94/console.html.gz  there are errors15:36
yolandajeblair so this needs to go first...15:36
yolandajenkins-dev is starting now but it should fail15:36
jeblairyolanda: the commit message says gearman function registration will fail; so we may need to check on that15:37
yolandajeblair i remember having this problem on my tests with latest versions15:37
anteayayolanda: what version was pulled with apt?15:37
yolanda1.642.315:38
anteayathanks15:38
yolandaso jeblair, how do we test? we have some test jobs we can trigger in zuul to point to jenkins-dev?15:39
jeblairoh wow, this existis now: https://wiki.jenkins-ci.org/display/JENKINS/Slave+To+Master+Access+Control/15:39
fungiyeah, they added that a year or so ago15:40
fungifinally15:40
anteayaacls for jenkins15:40
fungianyway, i need to disappear again for a few hours. hope to be back to the internet around 1900z or a little after15:41
anteayafungi: thank you15:41
anteayabest if luck with whatever you are doing15:41
anteayaof15:41
yolandajeblair, jenkins-dev is up, but how can we better check about gear interaction , and zmq?15:42
jeblairyolanda: i'll clean up zuul-dev; you can reconfigure jenkins-dev to talk to gearman on zuul-dev.15:42
anteayazaro: once you are online, any suggestions you could make are welcome15:43
yolandajeblair, to talk with zuul gear in production?15:43
jeblairyolanda: zuul-dev15:44
yolandaok i updated the url15:45
jeblairthe stop and set description jobs are registered, but no real jobs15:48
jeblairthat suggests this version is also incompatible15:48
yolandaso i rollback it?15:49
yolandaor doesnt' matter to have that on dev to future tests?15:49
jeblairno, now we need to land 27154315:49
jeblairi have rechecked to see if it will pass tests15:49
yolandaok15:50
jeblairi will also go ahead and pull the build off of that node so we don't have to wait for it to actually land15:53
jeblairit failed with the same errors, but i'm going to go ahead and install it on jenkins-dev and see what happens15:54
yolandaok15:55
jeblairit's installed and restarted and many more functions are registered now15:56
yolandacan you trigger some job on zuul-dev to test the full workflow?15:57
jeblairlet me see how hard that would be15:58
nibalizeri am touristing in france15:58
nibalizerso pretty useless15:58
nibalizersorry15:58
yolandalucky you :)15:58
jeblairnibalizer: go enjoy france!15:59
anteayanibalizer: thanks for letting us know15:59
nibalizerkk16:00
jeblairzuul-dev is configured to only run noop; i'll manually reconfigure it to run a real job real quick16:00
nibalizergood luck :hugops:16:00
anteayanibalizer: thanks, have fun16:03
jeblairhttps://jenkins-dev.openstack.org/view/zaro/job/check-gtest-org/8/console16:04
jeblairthat looks reasonably good16:05
jeblairthe content of the job is all wrong which is what the error is about16:05
jeblairbut it ran it16:05
anteayait did16:05
jeblairi think we should now shut down a jenkins master and upgrade it16:06
jeblairi will put 07 and 06 in shutdown mode16:07
jeblairactually, first i will restart firefox16:07
yolandai need to go for a pair of hours16:08
anteayayolanda: thanks for your help16:08
jeblairyolanda: thanks, it sounds like mordred will be online soon.16:08
anteayayolanda: enjoy your next thing16:08
yolandaanteaya, family duties. Have a nice weekend16:09
anteayayolanda: thanks you too16:09
jeblairanteaya: did the ubuntu-trusty cleanup change land yesterday?  and did zuul get restarted?16:13
jeblairjenkins06 and jenkins07 are in quiesce mode16:13
anteayasorry will look16:16
anteayawe reverted clarkb's move puppet jobs to ubuntu-trusty yesterday: https://review.openstack.org/#/c/294199/16:17
anteayaas there are issues with the images, as in it doesn't contain cron for example16:17
jeblairwow, crontab16:17
anteayayeah16:17
jeblair*that* is something i would support in a base image :)16:18
anteayaand there is a jjb yaml parsing item16:18
anteayajeblair: yay another convert16:18
anteayaha ha ha16:18
anteayaand did zuul get restarted, yes it did16:18
anteayafungi restarted it late in the day16:18
jeblairanteaya: the second link in your comment doesn't work (it was to jenkins, not logs.o.o)16:18
anteayaI'm wrong, zuul didnt' appear to get restarted: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-03-18.log.html#t2016-03-18T00:39:1016:19
anteayajeblair: ah sorry about that broken comment link16:20
jeblairokay, i find it strange that we would run for a whole day without restarting zuul when it's having such a big impact on our quota16:20
jeblairi will restart it now16:20
anteayathanks16:20
anteayait was discussed but the timing was to restart at the end of the day16:21
jeblairokay, i will be a softie and wait just a few mins for the top 5 changes to merge16:22
AJaegerjeblair, anteaya : Last words from fungi last night: He didn't restart and asked somebody else to do it...16:22
AJaegerHe run out of time ;(16:22
jeblairAJaeger: if last night is anything like now, i guess no one else was around16:22
AJaegerjeblair: Last night: 12+ hours ago16:23
* jeblair is starting to think the more roots we are the fewer we have16:23
anteayaI'm noticing the same thing16:23
anteayaAJaeger: yup, found that in the logs, thanks16:24
jeblairsince i'm throwing in a zuul restart, i'll go ahead and quiesce jenkins01 and jenkins0216:26
anteayaack16:26
jeblairthen i'll be able to knock out half the system at a time16:26
anteayawonderful16:27
AJaegeris anything in the release queue? Should we give release team a heads-up?16:27
anteayapost does have 70 changes16:27
jeblairi don't see anything in release16:28
anteayaoh sorry, release not post16:28
* AJaeger gives them a heads-up now...16:29
anteayathanks16:29
jeblairAJaeger: ++16:29
jeblairthis is the slowest job ever16:31
anteaya:(16:31
*** pabelanger has joined #openstack-infra-incident16:45
pabelangero/16:45
anteayahello pabelanger16:45
pabelangerjust catching up on backscroll for #openstack-infra16:45
anteayayup16:46
pabelangerWhile on PTO today, I have some time for the next 2 hours to help out if needed16:46
pabelangerkids down for naps16:46
clarkbI am only just now waking up, will be a bit before I can help but looks like jenkins and gearman plugin upgrades in process16:46
anteayathanks pabelanger but you should be time offing it you are on time off16:47
anteayaguess it just would be helpful to know in advance so not everyone is off on the same day if possible16:47
anteayaclarkb: welcome16:47
pabelangeranteaya: don't mind offering if there is an issue ATM16:48
anteayayup, understood and thanks16:48
pabelangeranteaya: agreed.  Not sure the protocol for PTO upstream, I have no issues sending out a notification in advance, just need to figure out where :)16:49
anteayawell in the past we have just said in channel16:49
anteayaI'll be away tomorrow16:49
anteayahas worked well thus far when practiced16:49
jeblairi *will* be away tomorrow ;)16:52
jeblairalso the day after16:52
anteayaawesome16:52
* anteaya makes a note16:52
jeblairback on monday :)16:52
anteayawoooo16:52
jeblairwhy is this job still running?!16:53
anteaya:(16:53
jeblairdid we add vexxhost to grafana?16:53
clarkbjeblair: yes it and osic are both in grafana16:53
jeblairclarkb: oh! it's in the dropdown but not on the main page16:54
jeblairthere is no indication that he list of dashboards is not complete.  :(16:54
pabelangerya, we need to rejig the front-page to include them16:55
jeblairit looks like it generally runs tempest jobs in a reasonable time, but there is a 2h spike.  i wonder if this is another similar spike.16:56
clarkbjeblair: are you just eaiting for jobs to finish on 01 02 06 and 07?17:10
AJaegerjeblair: did you restart zuul? ttx wants to know on #openstack-release...17:12
jeblairnope, still waiting on https://jenkins04.openstack.org/job/gate-tempest-dsvm-full/16564/consoleFull to finish17:13
jeblairwhen ^ finishes, i'll restart zuul; if one of the masters finishes first (unlikely?), i'll upgrade it17:13
* anteaya checks the water in the basement17:14
jeblairwow, this run has taken more than 2 hours17:22
anteaya:(17:24
*** dims_ has joined #openstack-infra-incident17:28
* dhellmann catches up on scrollback17:33
anteayadhellmann: hello17:33
dhellmannhi, anteaya17:33
dhellmannjust to make sure I have the situation clear, you're working on upgrading jenkins still, correct?17:34
jeblairyes, and about to restart zuul17:34
dhellmanncool, thanks, I'll keep an eye on this channel17:35
anteayadhellmann: great17:35
jeblairzuul is stopped17:35
jeblairdeleted all nodepool nodes17:36
jeblairstarting zuul17:36
jeblair(rather: set all nodepool nodes to be deleted)17:36
jeblairre-enqueing changes17:36
clarkbjeblair: all nodepool nodes ore just devstack-trusty?17:37
jeblairclarkb: all -- i'm restarting zuul17:37
clarkboh right all the jobs will go off in never never land17:38
jeblairright that17:38
jeblairit's eventually consistent, but this speeds up freeing up the 4 masters and gives us more available quota17:38
jeblaireverything is re-enqueud and 01,02,06,07 are idle17:40
jeblairi gave the all-clear in -release17:41
jeblairi'm going to upgrade jenkins07 now17:41
clarkbok17:41
jeblairremoved all slaves from config.xml17:44
jeblairSEVERE: Failed to create slave log directory /var/lib/jenkins/logs/slaves/devstack-centos7-internap-nyj01-881092517:45
jeblairneat17:45
jeblairokay17:46
jeblairsomehow teh gearman plugin is working17:46
jeblairi have not upgraded it17:46
anteayaodd17:46
jeblairi have quiesced jenkins07 to stop further jobs from running there17:47
clarkb07 claims to be the old version17:47
clarkboh maybe my html just needed more of a refresh17:47
clarkbnow its new and shiny17:47
jeblairMar 18, 2016 5:47:27 PM org.gearman.common.GearmanJobServerSession handleReqSessionEvent17:48
jeblairINFO: Session GearmanJobServerSession:1038:GearmanNIOJobServerConnection:zuul.openstack.org/162.242.150.96:4730 handling a REQ/CAN_DO event17:48
jeblairMar 18, 2016 5:47:27 PM org.gearman.common.GearmanJobServerSession submitTask17:48
jeblairINFO: Session GearmanJobServerSession:1038:GearmanNIOJobServerConnection:zuul.openstack.org/162.242.150.96:4730 is now handling the task GearmanTask:null:1213a77b-4535-448f-ab26-d266eca4859d17:48
jeblairalso, those logs are insane17:48
clarkbjeblair: was not upgrading the plugin intentional for testing?17:50
jeblairclarkb: nope.  dpkg helpfully restarted jenkins17:50
jeblairi think the order needs to be install new plugin, then apt-get upgrade17:51
jeblairer, apt-get install17:51
clarkbgotcha17:51
jeblairthe reason it can't create the slave log directory is that there are too many entries in that directory already17:51
jeblairi expect that has been the case for some time, and we probably don't really care that much atm, so i'm just going to leave it17:51
clarkbkk17:52
clarkbI can probably whip up a find to clean out anything older than say 2 months17:52
jeblairso... given that i have no idea what the deal is with the gearman plugin; should we just go ahead and upgrade it anyway?17:52
clarkbyes I think upgrade sounds safest17:53
anteayaI can support that17:53
anteayaon dev we saw the upgraded plugin worked17:53
jeblairokay, i will nodepool delete all of the jobs running on 07, then upgrade gearman plugin there17:55
jeblairthe nodepool delete will cause them to be re-enqueued in zuul17:55
anteayaack17:55
clarkbfind /var/lib/jenkins/logs/slaves -mtime +60 -delete ?17:59
clarkbjeblair: ^ should I go ahead and run that on the 8 masters?17:59
jeblairclarkb: sure; we could probably to mtime +7 really...18:00
clarkbjeblair: it seems to have decided that all of the slaves dir itself should go away18:02
jeblairclarkb: you mean /var/lib/jenkins/logs/slaves or all of the entries in there?18:02
clarkb/var/lib/jenkins/logs/slaves18:02
jeblairclarkb: neat18:03
jeblairclarkb: it was jenkins:nogroup18:03
clarkbthanks18:03
clarkbrecreated18:03
jeblairi have uploaded the gearman-plugin18:04
clarkbbah because the first output of that is that dir18:04
clarkbwill do this better on not 07 :)18:04
jeblairi'm going to restart jenkins on 07 now18:04
clarkbkk18:05
clarkbfind /var/lib/jenkins/logs/slaves/ -mindepth 1 -mtime +60 -delete appears to be what we want18:06
clarkbam running that on 06 now18:06
jeblairi'm going to create a new default view for all of the masters like this: https://jenkins01.openstack.org/18:07
jeblairit has no jobs in18:07
jeblairit18:07
clarkb++18:08
jeblairwhich means that when we visit one of the masters, we don't have to wait 5 minutes for the list of 8000 jobs to load18:08
clarkbyup that above find is much happier, it removed all contents of the dir anyways so it was more than 2 months ago when we ran out of room for new dirs18:08
clarkbbut left the dir in place18:08
clarkbwill do the remainder of the masters now18:08
anteayaclarkb: yay18:09
anteayajeblair: sounds good18:09
jeblairoh, are slave config files now outside of the main config?18:11
clarkbjeblair: they werent on the old version but thats good news if so18:12
clarkbjeblair: looks like /var/lib/jenkins/nodes18:15
jeblairyep18:15
jeblairi'm watching a build to make sure the whole lifecycle works18:15
jeblairhttps://jenkins07.openstack.org/job/gate-horizon-npm-run-test/2838/console is finished18:15
jeblairit ran on https://jenkins07.openstack.org/computer/ubuntu-trusty-bluebox-sjc1-8878183/18:15
jeblairwhich is offline now (good)18:16
jeblairnodepool got the complete event for that and will delete it shortly18:16
anteayaI hadn't seen a gui for a single node before, thanks18:17
clarkbjeblair: this is with upgraded gearman and upgraded jenkins right?18:18
jeblairclarkb: yes18:18
jeblairscp console log looks good: http://logs.openstack.org/14/289314/21/check/gate-horizon-npm-run-test/bbe19e1/console.html18:18
jeblairjenkins node was just deleted (good)18:19
jeblairregular scp looks good: http://docs-draft.openstack.org/57/292957/1/check/gate-storyboard-webclient-js-draft/a4f46f9/dist/#!/page/about18:20
jeblair(different job from jenkins07)18:20
jesusaurwhere is the new plugin hosted? and will a new release be tagged?18:20
jeblairthat's all i can think of; anything else we should check?18:20
clarkb/monitoring 404s18:20
jeblairclarkb: neat18:21
jeblairjesusaur: i approved the change earlier, that should at least result in a branch-tip build18:21
clarkbmaybe I need to login first and won't get redirected automagically anymore18:21
jeblairi'm logged in and it 404s too18:21
clarkbnope still 404s ya18:21
jeblairmonitoring used to be in the manage menu but is gone18:22
clarkbmaybe we just need to upgrade the plugin18:22
clarkbhttps://wiki.jenkins-ci.org/display/JENKINS/Monitoring has newer releases so presumably it should still work if upgraded18:22
jeblairit is not listed under installed plugins on 0718:23
jeblairi will install the current version18:23
jeblairshall i delete all nodes from 07 and restart?18:24
anteayaI support that18:24
clarkbbtw jenkins02 cleanup of slave logs is far slower than the higher numbered jenkinses. I think the reason this one is so much slower is that its IO isn't as good18:24
clarkbjeblair: sure, then we can bundle an upgrade of javamelody into the later restarts18:24
anteayayes fungi said 01 and 02 are the slowest18:24
anteayasince they were the first and are probably on the slowest vms so should be upgraded18:25
anteayawhen someone finds the time...18:25
jeblairclarkb: weirdly2 01 and 02 are the fastest at rendering the overview dashboards...18:25
anteayareally?18:25
jeblairokay, all clear in #-release, so i'm restarting 07 again now18:26
*** asselin has joined #openstack-infra-incident18:26
jeblairanteaya: i confirmed our rax errors are likely because our ord quota is insufficient for the quest antonym made18:32
jeblairForbidden: Quota exceeded for ram: Requested 8192, but already used 516096 of 512000 ram (HTTP 403) (Request-ID: req-c7dc80bb-6798-4783-9165-ea4155e8410b)18:32
anteayajeblair: ah okay18:32
anteayajeblair: I can offer a fix, any suggestions for quota?18:33
jeblairanteaya: antonym is increasing our ord quota18:33
anteayaord is currently 19518:33
anteayaoh okay, that is even better18:33
jeblairokay, 07 is idle again, i will stop it, remove the leaked node config directories, and start again18:35
anteayado I calculate correctly that a quota of 512000 ram is 62 nodes?18:36
jeblairyep18:36
anteayathanks18:36
jeblairanteaya: we set max-servers a few lower than that in order to have headroom for building images, etc.18:36
anteayaack18:36
jeblairclarkb: https://jenkins07.openstack.org/monitoring exists18:37
anteayaso the original quota of 55 seems to have been the max without a rax increase18:37
anteayaI can sign into jenkins but I can't see the monitoring dashboard18:37
clarkbjeblair: confirmed that monitoring is working for me18:38
anteayaguess I'll just have to listen to the stories18:38
clarkband if I log out it forces me to log in18:38
jeblairanteaya: yeah, it's limited to admins since it lets you kill threads which would do damage18:38
anteayaagreed18:38
jeblairare we ready to proceed to 1,2,6 now?18:38
anteayaclarkb: though shalt be known!18:38
anteayaI have no objection to proceeding18:39
jeblairclarkb: ?18:43
clarkboh ya I have no objections18:43
clarkb01-07 have the logs/slaves cleaned out18:44
jeblaircool, will proceed; i'll upgrade monitoring plugin, gearman plugin, then restart18:44
jeblairer, then jenkins, then restart :)18:44
anteayaack18:44
jeblair06 is starting18:47
anteayayay18:47
jeblair06 is up; monitoring exists...18:48
anteayawooooo18:49
jeblairnodes are being added18:49
jeblairand its running jobs18:49
jeblairi think before i do 01 and 02, i will pause and do jenkins.o.o18:49
nibalizerhello18:49
jeblairsince that's the one we actually care about18:49
anteayamakes sense18:49
nibalizeri have returned to a computer, are there things I can help with?18:49
anteayanibalizer: hi there18:50
anteayaI'll let jeblair respond when he has a moment18:50
anteayaas I don't know18:50
clarkbjeblair: ok I am about to clean out the slave logs on jenkins.o.o18:50
clarkbI don't think that will interfere with your upgrade process18:50
jeblairclarkb: thanks and i agree18:50
jeblairnibalizer: i think we've got it covered now18:52
clarkball done18:52
nibalizerjeblair: fantastic18:52
anteayanibalizer: how is the touristing going?18:52
nibalizeranteaya: pretty good18:52
anteayaawesome18:53
nibalizerim exhausted and have lots of pictures18:53
anteayahave you had crepes?18:53
anteayanibalizer: sounds like you're doing it right18:53
nibalizeranteaya: no i should get some crepes!18:53
nibalizerwait no i did have crepes18:53
anteayaobviously memorable18:53
anteayawere they any good?18:53
nibalizerya18:54
anteayanice18:54
nibalizerit was street vendor crepes during the jet lag march of death18:54
anteayaah18:54
anteayarestaurant crepes18:54
anteayato savour18:54
* anteaya fondly remembers crepes from normandy18:55
clarkbI think the zuul restart earlier may have confused the nodepool image uploads18:55
clarkbsupposedly we have 42 images uplodaing but I can't find any evidence that is in progress18:56
jeblairwe have the all-clear from release to upgrade jenkins.o.o18:56
jeblairit's restarting18:56
clarkbconfirmed with geard that none of the uplaods are actually queued or running18:59
clarkbI am going to restart nodepool-builder18:59
clarkband then probably have to clean out nodepool itself18:59
clarkbactually nodepool-builder restart probably not necessary as there are 4 workers for each upload18:59
jeblairclarkb: i think we knew there were still some edge cases in the builder job code; this may be one of them.18:59
clarkbya19:00
jeblairjenkins.o.o is up19:00
anteayayay19:00
AJaegergreat19:00
clarkbthe simple fix I think is to restart nodepoold19:00
clarkbsince that is supposed to clear any building images out of the db19:00
jeblairnow i will do 01 and 0219:01
clarkbso maybe lets wait for jenkins stuff to settle down a bit before restarting nodepoold19:01
jeblair01 and 02 are restarting19:05
jeblair01 is up19:05
jeblair02 is up19:07
jeblair01 is getting nodes and jobs19:08
AJaegerthat went smooth, great!19:08
clarkbjeblair: let me know when you think it is relatively safe to restart nodepool19:08
jeblairAJaeger: still have 3 more masters to go, but it's pretty mechanical at this point19:08
jeblair02 is getting nodes and jobs19:09
* AJaeger leaves now for the weekend! Hope that's the last fire for today ;)19:09
jeblairclarkb: https://jenkins04.openstack.org/monitoring does not exist19:10
jeblairwhich is weird.19:11
clarkbjeblair: huh, it hasnt been upgraded19:11
jeblairclarkb: oh sorry, yeah, i haven't started on 3,4,5 yet19:11
jeblairclarkb: i was just taking stock of the old state19:11
clarkbright just noticing that its the old version so should be there19:11
jeblairi have created the new 'Default' dashboard on them and just now put them in quiesce mode19:11
jeblairclarkb: ok yep.19:12
clarkb03 has it19:12
jeblairand 519:12
jeblairclarkb: i'm thinking of doing the 'delete nodes from under jenkins' trick for 3-5 and just knocking this out.  what do you think?19:13
clarkbjeblair: seems reasonable, better to get it done than spend all day on it19:13
anteayaI support that19:13
clarkbthen when it is done we can restart nodepool and get new images up19:13
pabelangerclarkb: jeblair I did https://review.openstack.org/#/c/294339/ last night (exposed upload_workers), but haven't tested the uploads yet.19:15
jeblairwe have clearance from release, so i'm going to upgrade 3-5 now19:15
pabelangershould be able to do that on Monday19:15
clarkbthe one thing we didn't test was jjb ? just thinking of that now19:15
jeblairpabelanger: awesome!19:15
clarkbbut I think other people are successfully using jjb on a variety of jenkins versions so not very worried19:15
jeblairclarkb: heh, true.  and i think we can work through it if there are problems19:16
pabelangerokay, kids are waking up for naps, back to PTO shortly. I'll check back in later tonight.19:16
pabelangerfrom*19:16
anteayapabelanger: thanks for your help19:16
jeblairhrm, 04 is somewhat unresponsive19:19
jeblairit got better19:20
anteayaheh19:21
jeblairdeleting nodes now19:21
clarkbnumber of threads is keeping steady on hte upgraded jenkinses19:24
jeblairupgrading 03 now19:26
jeblairand 0419:27
jeblairand 0519:28
jeblair03 is up19:29
jeblairand 0419:29
jeblairand 0519:29
jeblair03 is getting nodes and jobs19:30
jeblairi have quiesced 0419:31
jeblair04 did not end up with the monitoring plugin installed19:31
anteayaah19:32
jeblair05 is getting nodes and jobs19:32
anteayaso just respinning 04 remaining?19:32
jeblairre-re-installing monitoring plugin on 0419:32
anteayaack19:33
jeblairrestarting 0419:33
jeblairquiescing 04 again19:34
clarkba startin shutdown mode would be nice19:34
jeblairno kidding19:34
jeblairthis time it snagged some nodes; i'm deleting them19:36
jeblairhttp://paste.openstack.org/show/491153/19:36
jeblairis the error19:36
clarkbwow heap issues19:36
clarkbpossible tge plugin data is corrupt?19:37
jeblairclarkb:19:38
jeblairFYI, there is a proprietary plugin in Jenkins Enterprise by CloudBees19:38
jeblairfor this purpose.19:38
jeblairhttps://groups.google.com/forum/#!topic/jenkinsci-dev/Gr2QOxSl7_819:38
jeblairthough, someone did post a groovy script to do it19:38
jeblairhttps://wiki.jenkins-ci.org/display/JENKINS/Post-initialization+script19:38
clarkbwow19:39
clarkbvalue add19:39
jeblairclarkb: the /etc/defaults/jenkins file on 04 does not match that of 0319:39
jeblair03 has -Xmx12g; 04 does not have an Xmx arg19:40
clarkbhuh19:40
clarkbpuppet appears to have run on 04 recently19:41
jeblairah....19:41
clarkbhrm 04 has it now19:42
clarkbmaybe this is the package racing puppet or something?19:42
jeblairyeah, maybe a dpkg puppet race19:42
jeblairi will restart 04 now19:42
jeblair(ansible confirms all /etc/default/jenkins files are identical)19:42
jeblairwhat it failed again19:44
anteaya:(19:44
jeblairquiesced19:44
clarkbhuh same error with the heap?19:45
jeblairyep19:45
jeblairi see -Xmx12g in the command line19:45
clarkb/tmp/hsperfdata_jenkins is the melody data I think19:46
clarkbbut its mostly empty19:46
jeblairdeleting 04 nodes again19:46
jeblairi need to lunch now; i'm going to leave 04 as-is if clarkb or anyone else wants to look into it19:47
clarkbok19:47
clarkbit is trying to copy an array it looks like and that hits the error19:49
anteayais it worth it to spin up a fresh vm and install jenkins on it and call it 04?19:50
clarkbno we should figure out why it is doing this19:50
anteayaokay19:51
clarkbthe md5sums of the .jpi files differ19:54
clarkbjeblair: I think whatever process you were using to update isn't functioning. We cna try copying the .jpi in from another master19:54
anteayawonder why that would only cause an issue on 0419:55
fungii am back and starting to catch up, though i may also have to pop back out again very briefly at 2130z (but not for long)19:55
anteayawelcome back19:55
fungiand yeah, zuul was still not caught up when i started running out of steam last night, sounds like there wasn't really anybody else around in shape to restart it after that either19:56
anteayafungi: yeah didn't appear so19:56
anteayajeblair observed that the more root folks we have the less we seem to have19:57
clarkbjeblair: my suggestion would be to do that, copy the jpi in from 07, rm plugins/monitoring/ on 04 and restart 0419:57
fungiso jenkins 1.642.3 is the latest lts. interesting they've changed up the theming on the webui19:58
clarkbjeblair: but will wait for your lunch to conclude so we can figure out what might be wrong with the process that was previously used19:59
clarkbfungi: basic sitrep is zuul has been restarted (which broke nodepools gearman state so it needs a restart too, then manual uploads of ubuntu-trusty for uuidgen updates) and every master but jenkins04 is fully operational on new jenkins lts with new gearman plugin version20:03
clarkbfungi: the monitorying plugin install is unhappy on 04 appears to be due to trying to use the old version against new jenkins lts20:03
clarkbI intend on restarting nodepoold as soon as the jenkinses are happy20:03
clarkband will requeue ubuntu-trusty uploads20:05
fungiokay, sounds better than it could be i guess20:08
fungiand makes sense that nodepool image updates might be disrupted by its gearman disappearing out from under it at the moment20:09
clarkbit should mostly handle that but there are still a few corner cases20:09
fungiand yeah, yesterday was mostly a bust for getting anything done other than dealing with the openstackid/zanata situation20:09
clarkbthe fix for uuidgen did get in we just haven't managed to upload any of those images yet from what I can see20:10
fungialso yolanda and krotscheck seem to have discovered that missing dbus on the minimal images was keeping firefox from working20:22
fungior i think that was the conclusion they had arrived at20:23
anteayayes20:23
anteayaI replied in -infra to keep the conversation about anything not incident related in there20:24
anteayafor logs searching purposes20:25
clarkb-rw-rw-r-- 1 jenkins jenkins     0 Mar 18 20:03 plugin.ini20:27
clarkbthats extra weird20:27
clarkber ww20:27
fungiwhat's weird? the fact that it's empty?20:31
fungimaybe we can just copy that from another server?20:31
clarkbthis was response to emilienm in other channel20:31
fungioh20:32
clarkbtheir job failed being unable to copy that file and I found it weird the file existed20:32
clarkbsince the copy failed20:32
clarkbbut I think scp plugin must create the dirent before copying contents20:32
clarkbrather than doing a w and writing20:32
fungiahh, i haven't finished catching up in there yet20:32
jeblairclarkb: catching up20:42
jeblairclarkb: i used the webui; on jenkins04 and 07 i 'installed' the plugin; on others i 'upgraded' it20:44
jeblairclarkb: just selecting from the list of available plugins; no manual uploads or anything in the case of monitoring20:44
clarkbjeblair: huh, for some reason it appears to still be using the old version on 04 when you do that20:44
clarkbjeblair: maybe we can 'upgrade' it now?20:44
jeblair(for gearman-plugin, they were all uploaded through the webui)20:44
* clarkb looks20:44
jeblairclarkb: i think it still wasn't showing as upgradable, probably since it isn't really loaded20:45
clarkbya I don't see it20:45
jeblairclarkb: assuming that's still the case, i'd agree that next steps are probably to delete, manually install as you suggest20:45
clarkbmy original suggestion of just copying a jpi from the other servers over is the best idea i have20:45
jeblairlet's first do the groovy script to have it start in shutdown mode20:46
clarkb++20:46
jeblairi will do that20:46
jeblairclarkb: do you want to work on staging the delete/copy?20:46
clarkbyup20:46
clarkbit is at jenkins04:/home/clarkb/monitoring.jpi and the md5sums match20:48
clarkbwe can just cp that into place when 04 jenkins process is stopped20:48
jeblairclarkb: my groovy script is staged20:50
jeblairclarkb: i'm all set; why don't you drive from here?20:50
clarkbok I will stop jenkins now20:50
clarkbok plugin copied in place starting jenkins20:51
clarkbjeblair: ready?20:51
jeblairyep20:52
jeblairclarkb: same error :/20:53
jeblairit did start in shutdown mode20:54
clarkbmonitoring/META-INF/MANIFEST.MF looks good at least20:54
clarkbbut ya not working20:54
clarkbperhaps /var/lib/jenkins/monitoring is the problem?20:55
clarkbthere are almost 13k files in there20:55
clarkbhrm 07 has almost as many20:56
clarkbthat would be my next best guess move that dir aside, and have the plugin restart from scratch20:56
jeblairlet's try it20:57
clarkbjeblair: you doing it this time? and do we have to update the groovy stuff each restart or is it set once and that way until the script is removed?20:58
jeblairclarkb: it's there for good.  it's a file in ~jenkins/init.groovy.d/20:58
clarkbI like the idea of putting that on every master and updating the ansible playbook to restart them to explicitly online them20:59
jeblairclarkb: i worry a little about system restarts though...20:59
clarkboh good point20:59
clarkbjeblair: I can move the monitoring dir aside if you want21:00
jeblairclarkb: but, we could ansible the process of adding and removing that21:00
jeblairclarkb: so we could have an "ansible shut down jenkins" and "ansible start jenkins" that does that for us21:00
clarkbya21:00
jeblairclarkb: why don't you continue to drive21:00
clarkbkk21:00
clarkbstopping jenkins again21:01
clarkband starting21:01
jeblairseems happier now21:02
clarkbyup I can get to the melody pag enow21:03
clarkbjeblair: anything else you want to chekc before cancelling shutdown mode?21:03
jeblairclarkb: nope, i'll go ahead and do that and rm the script21:03
clarkbkk21:03
clarkbit is running jobs21:05
clarkblooking good, I am ready to restart nodepool if others are ready21:06
clarkbjeblair: fungi anteaya ^ any reason to not do that now?21:08
fungilooks like a fine time for a nodepool restart21:08
fungii'm around to help. pretty much caught up on what i've missed the first part of the day21:09
clarkbok stopping nodepool services now21:09
anteayaI have no reason to not restart nodepool21:09
jeblair++21:09
clarkbok both services are restarted and appear happy21:10
clarkbI will start ubuntu-trusty uploads in a csreen session21:10
anteayayay21:10
fungiand wow so moving the monitoring dir aside solved it?21:11
fungifreaky21:11
fungii guess that's accumulating graph/stats data over time?21:11
anteayafungi: clarkb removed a whole bunch of slave files accumulated on the jenkins earlier today21:13
anteayanot sure if that is related or not21:13
jeblairit wasn't the largeest such directory; 03 has 17k files in it21:13
clarkbfungi: ya tons of rrd files21:13
fungiyikes21:14
jeblairbut i guess something in there was :(21:14
clarkbif I had to guess it wasn't the size so much as maybe corrupt data21:14
clarkbwhich caused the jvm reading them in to allocate crazy amounts of memory and fail21:14
jeblairya.  a lot of those files are for ephemeral nodes too21:14
clarkball ubuntu-trusty image uploads are queued21:14
clarkbI started with the qcow2 clouds21:14
clarkbas they should flush out quicker21:15
clarkbI need to step out now back in 20 or so21:15
anteayaclarkb: thanks21:15
fungithe more data there is, the more likely some of it is to end up corrupt as well21:18
anteayathis was the magic incantation: find /var/lib/jenkins/logs/slaves/ -mindepth 1 -mtime +60 -delete21:19
anteayathe first round was sans -mindepth 1 and took out the dir as well21:19
anteayawhich was recreated21:19
fungiyep, saw that in scrollback21:21
anteayafigured21:21
*** zaro has quit IRC21:25
*** greghaynes has quit IRC21:25
*** zaro has joined #openstack-infra-incident21:26
*** mordred has quit IRC21:26
*** dhellmann has quit IRC21:26
*** dhellmann has joined #openstack-infra-incident21:26
*** mordred has joined #openstack-infra-incident21:27
*** greghaynes has joined #openstack-infra-incident21:28
clarkbI am back21:31
clarkbvexxhost image upload failed on the nginx looking thing21:32
clarkbosic and bluebox are done21:32
*** ChanServ changes topic to "situation normal"22:21
*** asselin has quit IRC23:22

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!