*** ajmiller has joined #openstack-infra-incident | 02:23 | |
*** ajmiller has quit IRC | 03:15 | |
*** lifeless has quit IRC | 04:55 | |
*** lifeless has joined #openstack-infra-incident | 04:56 | |
*** ajmiller_ has joined #openstack-infra-incident | 14:18 | |
*** ajmiller has joined #openstack-infra-incident | 14:18 | |
*** ajmiller_ has quit IRC | 14:19 | |
*** anteaya has joined #openstack-infra-incident | 15:17 | |
anteaya | so jesusaur has noticed a jenkins security advisory: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-24 | 15:18 |
---|---|---|
anteaya | and right now yolanda is the only infra-root online | 15:18 |
anteaya | fungi appears to be traveling with poor connectivity when he can get it | 15:18 |
anteaya | and since jenkins posts the version in every footer, it seems we should discuss a plan to upgrade | 15:19 |
yolanda | mordred should be around | 15:20 |
yolanda | and i'd like to get opinion of more infra-root and jenkins experts like zaro | 15:20 |
yolanda | i'm afraid of plugins not being compatible | 15:20 |
anteaya | yolanda: good points | 15:20 |
anteaya | and I agree with you | 15:21 |
jeblair | start by installing the new version of jenkins on jenkins-dev | 15:21 |
anteaya | morning jeblair | 15:21 |
anteaya | yolanda: are you able to focus on installing the suggested version of jenkins on jenkins-dev? | 15:22 |
yolanda | anteaya, i need to leave in over an hour as much | 15:22 |
yolanda | children going out from school, and is the last day before holiday | 15:23 |
anteaya | yolanda: okay well this seems to me a higher priority than firefox at the moment | 15:23 |
anteaya | hmmmm | 15:23 |
anteaya | okay thanks, I understand yolanda | 15:23 |
yolanda | i can start trying on jenkins-dev but i cannot follow the issue until completion | 15:24 |
jesusaur | is the jenkins version in puppet somewhere that I'm not seeing? how would jenkins-dev be upgraded? | 15:24 |
anteaya | jesusaur: look at the bottom: https://jenkins.openstack.org/ | 15:25 |
yolanda | i think manually installing the war, jesusaur | 15:25 |
anteaya | jesusaur: oh sorry, that wasn't the question you asked, my apologies | 15:25 |
jesusaur | yolanda: ah | 15:25 |
jeblair | hopefully one of clarkb, pleia2, fungi, jhesketh, mordred, pabelanger, SergeyLukjanov or nibalizer will be around by then. | 15:25 |
anteaya | pleia2: is in singapore | 15:25 |
anteaya | yolanda: can you get started and post what actions you are taking? | 15:26 |
anteaya | so someone else can pick up from where you left off? | 15:26 |
*** AJaeger has joined #openstack-infra-incident | 15:26 | |
yolanda | that will be the first time that i do an upgrade on upstream so i actually need to learn about it first | 15:27 |
yolanda | i've done on downstream but i don't know if you follow same steps | 15:27 |
mordred | morning | 15:27 |
yolanda | mordred, jeblair, how do you do the upgrades? install new war manually? | 15:27 |
anteaya | mordred: morning | 15:27 |
anteaya | mordred: glad you are here | 15:28 |
anteaya | mordred: we need to discuss upgrading our jenkins, starting with jenkins-dev: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-24 | 15:28 |
mordred | anteaya: just waking up - haven't coffeed yet, so I may not be AWESOME at incident helping yet | 15:28 |
anteaya | mordred: fair enough | 15:28 |
mordred | let me start the coffee real quick | 15:28 |
anteaya | yup | 15:28 |
fungi | yeah, my reception is terrible at the moment and i'm about to be getting back on the road again | 15:28 |
fungi | yolanda: apt-get install jenkins | 15:29 |
anteaya | mordred: if you can help yolanda learn how we upgrade a jenkins that would be a great place to start | 15:29 |
fungi | well, sudo apt-get install jenkins | 15:29 |
mordred | anteaya: my personal process is "cower in terror and cry a lot" | 15:29 |
fungi | but on jenkins-dev first as jeblair said | 15:29 |
anteaya | mordred: good to know | 15:29 |
anteaya | fungi: glad to have your help when we can get it, thank you | 15:29 |
anteaya | fungi: safe travels | 15:29 |
mordred | yolanda: yah - we do jenkins-dev first just to make sure that the jenkins people haven't broke us | 15:30 |
*** ChanServ changes topic to "https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-24" | 15:31 | |
fungi | i think you _may_ want to also confirm that's pulling in the lts version and not the dev version. if it ends up with a dev version then you may need to check their apt repository for the corresponding package for the lts version instead, wget that and dpkg -i it to install | 15:31 |
anteaya | fungi: thank you, I assumed we were running their lts version, but glad to have the confirmation | 15:31 |
fungi | we tend to avoid the dev releases of jenkins because of stability concerns | 15:31 |
yolanda | ok going to try that on jenkins-dev | 15:32 |
yolanda | i have a fear about gear plugin | 15:32 |
anteaya | yolanda: thank you | 15:32 |
anteaya | yup, good thing it is the -dev server | 15:33 |
jeblair | https://review.openstack.org/271543 is the change to fix the gearman plugin | 15:35 |
yolanda | ok fingers crossed | 15:35 |
anteaya | yolanda: go you | 15:35 |
jeblair | it claims to have passed unit tests, but if you actually read the console log at http://logs.openstack.org/43/271543/3/check/gate-gearman-plugin-build/d49cb94/console.html.gz there are errors | 15:36 |
yolanda | jeblair so this needs to go first... | 15:36 |
yolanda | jenkins-dev is starting now but it should fail | 15:36 |
jeblair | yolanda: the commit message says gearman function registration will fail; so we may need to check on that | 15:37 |
yolanda | jeblair i remember having this problem on my tests with latest versions | 15:37 |
anteaya | yolanda: what version was pulled with apt? | 15:37 |
yolanda | 1.642.3 | 15:38 |
anteaya | thanks | 15:38 |
yolanda | so jeblair, how do we test? we have some test jobs we can trigger in zuul to point to jenkins-dev? | 15:39 |
jeblair | oh wow, this existis now: https://wiki.jenkins-ci.org/display/JENKINS/Slave+To+Master+Access+Control/ | 15:39 |
fungi | yeah, they added that a year or so ago | 15:40 |
fungi | finally | 15:40 |
anteaya | acls for jenkins | 15:40 |
fungi | anyway, i need to disappear again for a few hours. hope to be back to the internet around 1900z or a little after | 15:41 |
anteaya | fungi: thank you | 15:41 |
anteaya | best if luck with whatever you are doing | 15:41 |
anteaya | of | 15:41 |
yolanda | jeblair, jenkins-dev is up, but how can we better check about gear interaction , and zmq? | 15:42 |
jeblair | yolanda: i'll clean up zuul-dev; you can reconfigure jenkins-dev to talk to gearman on zuul-dev. | 15:42 |
anteaya | zaro: once you are online, any suggestions you could make are welcome | 15:43 |
yolanda | jeblair, to talk with zuul gear in production? | 15:43 |
jeblair | yolanda: zuul-dev | 15:44 |
yolanda | ok i updated the url | 15:45 |
jeblair | the stop and set description jobs are registered, but no real jobs | 15:48 |
jeblair | that suggests this version is also incompatible | 15:48 |
yolanda | so i rollback it? | 15:49 |
yolanda | or doesnt' matter to have that on dev to future tests? | 15:49 |
jeblair | no, now we need to land 271543 | 15:49 |
jeblair | i have rechecked to see if it will pass tests | 15:49 |
yolanda | ok | 15:50 |
jeblair | i will also go ahead and pull the build off of that node so we don't have to wait for it to actually land | 15:53 |
jeblair | it failed with the same errors, but i'm going to go ahead and install it on jenkins-dev and see what happens | 15:54 |
yolanda | ok | 15:55 |
jeblair | it's installed and restarted and many more functions are registered now | 15:56 |
yolanda | can you trigger some job on zuul-dev to test the full workflow? | 15:57 |
jeblair | let me see how hard that would be | 15:58 |
nibalizer | i am touristing in france | 15:58 |
nibalizer | so pretty useless | 15:58 |
nibalizer | sorry | 15:58 |
yolanda | lucky you :) | 15:58 |
jeblair | nibalizer: go enjoy france! | 15:59 |
anteaya | nibalizer: thanks for letting us know | 15:59 |
nibalizer | kk | 16:00 |
jeblair | zuul-dev is configured to only run noop; i'll manually reconfigure it to run a real job real quick | 16:00 |
nibalizer | good luck :hugops: | 16:00 |
anteaya | nibalizer: thanks, have fun | 16:03 |
jeblair | https://jenkins-dev.openstack.org/view/zaro/job/check-gtest-org/8/console | 16:04 |
jeblair | that looks reasonably good | 16:05 |
jeblair | the content of the job is all wrong which is what the error is about | 16:05 |
jeblair | but it ran it | 16:05 |
anteaya | it did | 16:05 |
jeblair | i think we should now shut down a jenkins master and upgrade it | 16:06 |
jeblair | i will put 07 and 06 in shutdown mode | 16:07 |
jeblair | actually, first i will restart firefox | 16:07 |
yolanda | i need to go for a pair of hours | 16:08 |
anteaya | yolanda: thanks for your help | 16:08 |
jeblair | yolanda: thanks, it sounds like mordred will be online soon. | 16:08 |
anteaya | yolanda: enjoy your next thing | 16:08 |
yolanda | anteaya, family duties. Have a nice weekend | 16:09 |
anteaya | yolanda: thanks you too | 16:09 |
jeblair | anteaya: did the ubuntu-trusty cleanup change land yesterday? and did zuul get restarted? | 16:13 |
jeblair | jenkins06 and jenkins07 are in quiesce mode | 16:13 |
anteaya | sorry will look | 16:16 |
anteaya | we reverted clarkb's move puppet jobs to ubuntu-trusty yesterday: https://review.openstack.org/#/c/294199/ | 16:17 |
anteaya | as there are issues with the images, as in it doesn't contain cron for example | 16:17 |
jeblair | wow, crontab | 16:17 |
anteaya | yeah | 16:17 |
jeblair | *that* is something i would support in a base image :) | 16:18 |
anteaya | and there is a jjb yaml parsing item | 16:18 |
anteaya | jeblair: yay another convert | 16:18 |
anteaya | ha ha ha | 16:18 |
anteaya | and did zuul get restarted, yes it did | 16:18 |
anteaya | fungi restarted it late in the day | 16:18 |
jeblair | anteaya: the second link in your comment doesn't work (it was to jenkins, not logs.o.o) | 16:18 |
anteaya | I'm wrong, zuul didnt' appear to get restarted: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-03-18.log.html#t2016-03-18T00:39:10 | 16:19 |
anteaya | jeblair: ah sorry about that broken comment link | 16:20 |
jeblair | okay, i find it strange that we would run for a whole day without restarting zuul when it's having such a big impact on our quota | 16:20 |
jeblair | i will restart it now | 16:20 |
anteaya | thanks | 16:20 |
anteaya | it was discussed but the timing was to restart at the end of the day | 16:21 |
jeblair | okay, i will be a softie and wait just a few mins for the top 5 changes to merge | 16:22 |
AJaeger | jeblair, anteaya : Last words from fungi last night: He didn't restart and asked somebody else to do it... | 16:22 |
AJaeger | He run out of time ;( | 16:22 |
jeblair | AJaeger: if last night is anything like now, i guess no one else was around | 16:22 |
AJaeger | jeblair: Last night: 12+ hours ago | 16:23 |
* jeblair is starting to think the more roots we are the fewer we have | 16:23 | |
anteaya | I'm noticing the same thing | 16:23 |
anteaya | AJaeger: yup, found that in the logs, thanks | 16:24 |
jeblair | since i'm throwing in a zuul restart, i'll go ahead and quiesce jenkins01 and jenkins02 | 16:26 |
anteaya | ack | 16:26 |
jeblair | then i'll be able to knock out half the system at a time | 16:26 |
anteaya | wonderful | 16:27 |
AJaeger | is anything in the release queue? Should we give release team a heads-up? | 16:27 |
anteaya | post does have 70 changes | 16:27 |
jeblair | i don't see anything in release | 16:28 |
anteaya | oh sorry, release not post | 16:28 |
* AJaeger gives them a heads-up now... | 16:29 | |
anteaya | thanks | 16:29 |
jeblair | AJaeger: ++ | 16:29 |
jeblair | this is the slowest job ever | 16:31 |
anteaya | :( | 16:31 |
*** pabelanger has joined #openstack-infra-incident | 16:45 | |
pabelanger | o/ | 16:45 |
anteaya | hello pabelanger | 16:45 |
pabelanger | just catching up on backscroll for #openstack-infra | 16:45 |
anteaya | yup | 16:46 |
pabelanger | While on PTO today, I have some time for the next 2 hours to help out if needed | 16:46 |
pabelanger | kids down for naps | 16:46 |
clarkb | I am only just now waking up, will be a bit before I can help but looks like jenkins and gearman plugin upgrades in process | 16:46 |
anteaya | thanks pabelanger but you should be time offing it you are on time off | 16:47 |
anteaya | guess it just would be helpful to know in advance so not everyone is off on the same day if possible | 16:47 |
anteaya | clarkb: welcome | 16:47 |
pabelanger | anteaya: don't mind offering if there is an issue ATM | 16:48 |
anteaya | yup, understood and thanks | 16:48 |
pabelanger | anteaya: agreed. Not sure the protocol for PTO upstream, I have no issues sending out a notification in advance, just need to figure out where :) | 16:49 |
anteaya | well in the past we have just said in channel | 16:49 |
anteaya | I'll be away tomorrow | 16:49 |
anteaya | has worked well thus far when practiced | 16:49 |
jeblair | i *will* be away tomorrow ;) | 16:52 |
jeblair | also the day after | 16:52 |
anteaya | awesome | 16:52 |
* anteaya makes a note | 16:52 | |
jeblair | back on monday :) | 16:52 |
anteaya | woooo | 16:52 |
jeblair | why is this job still running?! | 16:53 |
anteaya | :( | 16:53 |
jeblair | did we add vexxhost to grafana? | 16:53 |
clarkb | jeblair: yes it and osic are both in grafana | 16:53 |
jeblair | clarkb: oh! it's in the dropdown but not on the main page | 16:54 |
jeblair | there is no indication that he list of dashboards is not complete. :( | 16:54 |
pabelanger | ya, we need to rejig the front-page to include them | 16:55 |
jeblair | it looks like it generally runs tempest jobs in a reasonable time, but there is a 2h spike. i wonder if this is another similar spike. | 16:56 |
clarkb | jeblair: are you just eaiting for jobs to finish on 01 02 06 and 07? | 17:10 |
AJaeger | jeblair: did you restart zuul? ttx wants to know on #openstack-release... | 17:12 |
jeblair | nope, still waiting on https://jenkins04.openstack.org/job/gate-tempest-dsvm-full/16564/consoleFull to finish | 17:13 |
jeblair | when ^ finishes, i'll restart zuul; if one of the masters finishes first (unlikely?), i'll upgrade it | 17:13 |
* anteaya checks the water in the basement | 17:14 | |
jeblair | wow, this run has taken more than 2 hours | 17:22 |
anteaya | :( | 17:24 |
*** dims_ has joined #openstack-infra-incident | 17:28 | |
* dhellmann catches up on scrollback | 17:33 | |
anteaya | dhellmann: hello | 17:33 |
dhellmann | hi, anteaya | 17:33 |
dhellmann | just to make sure I have the situation clear, you're working on upgrading jenkins still, correct? | 17:34 |
jeblair | yes, and about to restart zuul | 17:34 |
dhellmann | cool, thanks, I'll keep an eye on this channel | 17:35 |
anteaya | dhellmann: great | 17:35 |
jeblair | zuul is stopped | 17:35 |
jeblair | deleted all nodepool nodes | 17:36 |
jeblair | starting zuul | 17:36 |
jeblair | (rather: set all nodepool nodes to be deleted) | 17:36 |
jeblair | re-enqueing changes | 17:36 |
clarkb | jeblair: all nodepool nodes ore just devstack-trusty? | 17:37 |
jeblair | clarkb: all -- i'm restarting zuul | 17:37 |
clarkb | oh right all the jobs will go off in never never land | 17:38 |
jeblair | right that | 17:38 |
jeblair | it's eventually consistent, but this speeds up freeing up the 4 masters and gives us more available quota | 17:38 |
jeblair | everything is re-enqueud and 01,02,06,07 are idle | 17:40 |
jeblair | i gave the all-clear in -release | 17:41 |
jeblair | i'm going to upgrade jenkins07 now | 17:41 |
clarkb | ok | 17:41 |
jeblair | removed all slaves from config.xml | 17:44 |
jeblair | SEVERE: Failed to create slave log directory /var/lib/jenkins/logs/slaves/devstack-centos7-internap-nyj01-8810925 | 17:45 |
jeblair | neat | 17:45 |
jeblair | okay | 17:46 |
jeblair | somehow teh gearman plugin is working | 17:46 |
jeblair | i have not upgraded it | 17:46 |
anteaya | odd | 17:46 |
jeblair | i have quiesced jenkins07 to stop further jobs from running there | 17:47 |
clarkb | 07 claims to be the old version | 17:47 |
clarkb | oh maybe my html just needed more of a refresh | 17:47 |
clarkb | now its new and shiny | 17:47 |
jeblair | Mar 18, 2016 5:47:27 PM org.gearman.common.GearmanJobServerSession handleReqSessionEvent | 17:48 |
jeblair | INFO: Session GearmanJobServerSession:1038:GearmanNIOJobServerConnection:zuul.openstack.org/162.242.150.96:4730 handling a REQ/CAN_DO event | 17:48 |
jeblair | Mar 18, 2016 5:47:27 PM org.gearman.common.GearmanJobServerSession submitTask | 17:48 |
jeblair | INFO: Session GearmanJobServerSession:1038:GearmanNIOJobServerConnection:zuul.openstack.org/162.242.150.96:4730 is now handling the task GearmanTask:null:1213a77b-4535-448f-ab26-d266eca4859d | 17:48 |
jeblair | also, those logs are insane | 17:48 |
clarkb | jeblair: was not upgrading the plugin intentional for testing? | 17:50 |
jeblair | clarkb: nope. dpkg helpfully restarted jenkins | 17:50 |
jeblair | i think the order needs to be install new plugin, then apt-get upgrade | 17:51 |
jeblair | er, apt-get install | 17:51 |
clarkb | gotcha | 17:51 |
jeblair | the reason it can't create the slave log directory is that there are too many entries in that directory already | 17:51 |
jeblair | i expect that has been the case for some time, and we probably don't really care that much atm, so i'm just going to leave it | 17:51 |
clarkb | kk | 17:52 |
clarkb | I can probably whip up a find to clean out anything older than say 2 months | 17:52 |
jeblair | so... given that i have no idea what the deal is with the gearman plugin; should we just go ahead and upgrade it anyway? | 17:52 |
clarkb | yes I think upgrade sounds safest | 17:53 |
anteaya | I can support that | 17:53 |
anteaya | on dev we saw the upgraded plugin worked | 17:53 |
jeblair | okay, i will nodepool delete all of the jobs running on 07, then upgrade gearman plugin there | 17:55 |
jeblair | the nodepool delete will cause them to be re-enqueued in zuul | 17:55 |
anteaya | ack | 17:55 |
clarkb | find /var/lib/jenkins/logs/slaves -mtime +60 -delete ? | 17:59 |
clarkb | jeblair: ^ should I go ahead and run that on the 8 masters? | 17:59 |
jeblair | clarkb: sure; we could probably to mtime +7 really... | 18:00 |
clarkb | jeblair: it seems to have decided that all of the slaves dir itself should go away | 18:02 |
jeblair | clarkb: you mean /var/lib/jenkins/logs/slaves or all of the entries in there? | 18:02 |
clarkb | /var/lib/jenkins/logs/slaves | 18:02 |
jeblair | clarkb: neat | 18:03 |
jeblair | clarkb: it was jenkins:nogroup | 18:03 |
clarkb | thanks | 18:03 |
clarkb | recreated | 18:03 |
jeblair | i have uploaded the gearman-plugin | 18:04 |
clarkb | bah because the first output of that is that dir | 18:04 |
clarkb | will do this better on not 07 :) | 18:04 |
jeblair | i'm going to restart jenkins on 07 now | 18:04 |
clarkb | kk | 18:05 |
clarkb | find /var/lib/jenkins/logs/slaves/ -mindepth 1 -mtime +60 -delete appears to be what we want | 18:06 |
clarkb | am running that on 06 now | 18:06 |
jeblair | i'm going to create a new default view for all of the masters like this: https://jenkins01.openstack.org/ | 18:07 |
jeblair | it has no jobs in | 18:07 |
jeblair | it | 18:07 |
clarkb | ++ | 18:08 |
jeblair | which means that when we visit one of the masters, we don't have to wait 5 minutes for the list of 8000 jobs to load | 18:08 |
clarkb | yup that above find is much happier, it removed all contents of the dir anyways so it was more than 2 months ago when we ran out of room for new dirs | 18:08 |
clarkb | but left the dir in place | 18:08 |
clarkb | will do the remainder of the masters now | 18:08 |
anteaya | clarkb: yay | 18:09 |
anteaya | jeblair: sounds good | 18:09 |
jeblair | oh, are slave config files now outside of the main config? | 18:11 |
clarkb | jeblair: they werent on the old version but thats good news if so | 18:12 |
clarkb | jeblair: looks like /var/lib/jenkins/nodes | 18:15 |
jeblair | yep | 18:15 |
jeblair | i'm watching a build to make sure the whole lifecycle works | 18:15 |
jeblair | https://jenkins07.openstack.org/job/gate-horizon-npm-run-test/2838/console is finished | 18:15 |
jeblair | it ran on https://jenkins07.openstack.org/computer/ubuntu-trusty-bluebox-sjc1-8878183/ | 18:15 |
jeblair | which is offline now (good) | 18:16 |
jeblair | nodepool got the complete event for that and will delete it shortly | 18:16 |
anteaya | I hadn't seen a gui for a single node before, thanks | 18:17 |
clarkb | jeblair: this is with upgraded gearman and upgraded jenkins right? | 18:18 |
jeblair | clarkb: yes | 18:18 |
jeblair | scp console log looks good: http://logs.openstack.org/14/289314/21/check/gate-horizon-npm-run-test/bbe19e1/console.html | 18:18 |
jeblair | jenkins node was just deleted (good) | 18:19 |
jeblair | regular scp looks good: http://docs-draft.openstack.org/57/292957/1/check/gate-storyboard-webclient-js-draft/a4f46f9/dist/#!/page/about | 18:20 |
jeblair | (different job from jenkins07) | 18:20 |
jesusaur | where is the new plugin hosted? and will a new release be tagged? | 18:20 |
jeblair | that's all i can think of; anything else we should check? | 18:20 |
clarkb | /monitoring 404s | 18:20 |
jeblair | clarkb: neat | 18:21 |
jeblair | jesusaur: i approved the change earlier, that should at least result in a branch-tip build | 18:21 |
clarkb | maybe I need to login first and won't get redirected automagically anymore | 18:21 |
jeblair | i'm logged in and it 404s too | 18:21 |
clarkb | nope still 404s ya | 18:21 |
jeblair | monitoring used to be in the manage menu but is gone | 18:22 |
clarkb | maybe we just need to upgrade the plugin | 18:22 |
clarkb | https://wiki.jenkins-ci.org/display/JENKINS/Monitoring has newer releases so presumably it should still work if upgraded | 18:22 |
jeblair | it is not listed under installed plugins on 07 | 18:23 |
jeblair | i will install the current version | 18:23 |
jeblair | shall i delete all nodes from 07 and restart? | 18:24 |
anteaya | I support that | 18:24 |
clarkb | btw jenkins02 cleanup of slave logs is far slower than the higher numbered jenkinses. I think the reason this one is so much slower is that its IO isn't as good | 18:24 |
clarkb | jeblair: sure, then we can bundle an upgrade of javamelody into the later restarts | 18:24 |
anteaya | yes fungi said 01 and 02 are the slowest | 18:24 |
anteaya | since they were the first and are probably on the slowest vms so should be upgraded | 18:25 |
anteaya | when someone finds the time... | 18:25 |
jeblair | clarkb: weirdly2 01 and 02 are the fastest at rendering the overview dashboards... | 18:25 |
anteaya | really? | 18:25 |
jeblair | okay, all clear in #-release, so i'm restarting 07 again now | 18:26 |
*** asselin has joined #openstack-infra-incident | 18:26 | |
jeblair | anteaya: i confirmed our rax errors are likely because our ord quota is insufficient for the quest antonym made | 18:32 |
jeblair | Forbidden: Quota exceeded for ram: Requested 8192, but already used 516096 of 512000 ram (HTTP 403) (Request-ID: req-c7dc80bb-6798-4783-9165-ea4155e8410b) | 18:32 |
anteaya | jeblair: ah okay | 18:32 |
anteaya | jeblair: I can offer a fix, any suggestions for quota? | 18:33 |
jeblair | anteaya: antonym is increasing our ord quota | 18:33 |
anteaya | ord is currently 195 | 18:33 |
anteaya | oh okay, that is even better | 18:33 |
jeblair | okay, 07 is idle again, i will stop it, remove the leaked node config directories, and start again | 18:35 |
anteaya | do I calculate correctly that a quota of 512000 ram is 62 nodes? | 18:36 |
jeblair | yep | 18:36 |
anteaya | thanks | 18:36 |
jeblair | anteaya: we set max-servers a few lower than that in order to have headroom for building images, etc. | 18:36 |
anteaya | ack | 18:36 |
jeblair | clarkb: https://jenkins07.openstack.org/monitoring exists | 18:37 |
anteaya | so the original quota of 55 seems to have been the max without a rax increase | 18:37 |
anteaya | I can sign into jenkins but I can't see the monitoring dashboard | 18:37 |
clarkb | jeblair: confirmed that monitoring is working for me | 18:38 |
anteaya | guess I'll just have to listen to the stories | 18:38 |
clarkb | and if I log out it forces me to log in | 18:38 |
jeblair | anteaya: yeah, it's limited to admins since it lets you kill threads which would do damage | 18:38 |
anteaya | agreed | 18:38 |
jeblair | are we ready to proceed to 1,2,6 now? | 18:38 |
anteaya | clarkb: though shalt be known! | 18:38 |
anteaya | I have no objection to proceeding | 18:39 |
jeblair | clarkb: ? | 18:43 |
clarkb | oh ya I have no objections | 18:43 |
clarkb | 01-07 have the logs/slaves cleaned out | 18:44 |
jeblair | cool, will proceed; i'll upgrade monitoring plugin, gearman plugin, then restart | 18:44 |
jeblair | er, then jenkins, then restart :) | 18:44 |
anteaya | ack | 18:44 |
jeblair | 06 is starting | 18:47 |
anteaya | yay | 18:47 |
jeblair | 06 is up; monitoring exists... | 18:48 |
anteaya | wooooo | 18:49 |
jeblair | nodes are being added | 18:49 |
jeblair | and its running jobs | 18:49 |
jeblair | i think before i do 01 and 02, i will pause and do jenkins.o.o | 18:49 |
nibalizer | hello | 18:49 |
jeblair | since that's the one we actually care about | 18:49 |
anteaya | makes sense | 18:49 |
nibalizer | i have returned to a computer, are there things I can help with? | 18:49 |
anteaya | nibalizer: hi there | 18:50 |
anteaya | I'll let jeblair respond when he has a moment | 18:50 |
anteaya | as I don't know | 18:50 |
clarkb | jeblair: ok I am about to clean out the slave logs on jenkins.o.o | 18:50 |
clarkb | I don't think that will interfere with your upgrade process | 18:50 |
jeblair | clarkb: thanks and i agree | 18:50 |
jeblair | nibalizer: i think we've got it covered now | 18:52 |
clarkb | all done | 18:52 |
nibalizer | jeblair: fantastic | 18:52 |
anteaya | nibalizer: how is the touristing going? | 18:52 |
nibalizer | anteaya: pretty good | 18:52 |
anteaya | awesome | 18:53 |
nibalizer | im exhausted and have lots of pictures | 18:53 |
anteaya | have you had crepes? | 18:53 |
anteaya | nibalizer: sounds like you're doing it right | 18:53 |
nibalizer | anteaya: no i should get some crepes! | 18:53 |
nibalizer | wait no i did have crepes | 18:53 |
anteaya | obviously memorable | 18:53 |
anteaya | were they any good? | 18:53 |
nibalizer | ya | 18:54 |
anteaya | nice | 18:54 |
nibalizer | it was street vendor crepes during the jet lag march of death | 18:54 |
anteaya | ah | 18:54 |
anteaya | restaurant crepes | 18:54 |
anteaya | to savour | 18:54 |
* anteaya fondly remembers crepes from normandy | 18:55 | |
clarkb | I think the zuul restart earlier may have confused the nodepool image uploads | 18:55 |
clarkb | supposedly we have 42 images uplodaing but I can't find any evidence that is in progress | 18:56 |
jeblair | we have the all-clear from release to upgrade jenkins.o.o | 18:56 |
jeblair | it's restarting | 18:56 |
clarkb | confirmed with geard that none of the uplaods are actually queued or running | 18:59 |
clarkb | I am going to restart nodepool-builder | 18:59 |
clarkb | and then probably have to clean out nodepool itself | 18:59 |
clarkb | actually nodepool-builder restart probably not necessary as there are 4 workers for each upload | 18:59 |
jeblair | clarkb: i think we knew there were still some edge cases in the builder job code; this may be one of them. | 18:59 |
clarkb | ya | 19:00 |
jeblair | jenkins.o.o is up | 19:00 |
anteaya | yay | 19:00 |
AJaeger | great | 19:00 |
clarkb | the simple fix I think is to restart nodepoold | 19:00 |
clarkb | since that is supposed to clear any building images out of the db | 19:00 |
jeblair | now i will do 01 and 02 | 19:01 |
clarkb | so maybe lets wait for jenkins stuff to settle down a bit before restarting nodepoold | 19:01 |
jeblair | 01 and 02 are restarting | 19:05 |
jeblair | 01 is up | 19:05 |
jeblair | 02 is up | 19:07 |
jeblair | 01 is getting nodes and jobs | 19:08 |
AJaeger | that went smooth, great! | 19:08 |
clarkb | jeblair: let me know when you think it is relatively safe to restart nodepool | 19:08 |
jeblair | AJaeger: still have 3 more masters to go, but it's pretty mechanical at this point | 19:08 |
jeblair | 02 is getting nodes and jobs | 19:09 |
* AJaeger leaves now for the weekend! Hope that's the last fire for today ;) | 19:09 | |
jeblair | clarkb: https://jenkins04.openstack.org/monitoring does not exist | 19:10 |
jeblair | which is weird. | 19:11 |
clarkb | jeblair: huh, it hasnt been upgraded | 19:11 |
jeblair | clarkb: oh sorry, yeah, i haven't started on 3,4,5 yet | 19:11 |
jeblair | clarkb: i was just taking stock of the old state | 19:11 |
clarkb | right just noticing that its the old version so should be there | 19:11 |
jeblair | i have created the new 'Default' dashboard on them and just now put them in quiesce mode | 19:11 |
jeblair | clarkb: ok yep. | 19:12 |
clarkb | 03 has it | 19:12 |
jeblair | and 5 | 19:12 |
jeblair | clarkb: i'm thinking of doing the 'delete nodes from under jenkins' trick for 3-5 and just knocking this out. what do you think? | 19:13 |
clarkb | jeblair: seems reasonable, better to get it done than spend all day on it | 19:13 |
anteaya | I support that | 19:13 |
clarkb | then when it is done we can restart nodepool and get new images up | 19:13 |
pabelanger | clarkb: jeblair I did https://review.openstack.org/#/c/294339/ last night (exposed upload_workers), but haven't tested the uploads yet. | 19:15 |
jeblair | we have clearance from release, so i'm going to upgrade 3-5 now | 19:15 |
pabelanger | should be able to do that on Monday | 19:15 |
clarkb | the one thing we didn't test was jjb ? just thinking of that now | 19:15 |
jeblair | pabelanger: awesome! | 19:15 |
clarkb | but I think other people are successfully using jjb on a variety of jenkins versions so not very worried | 19:15 |
jeblair | clarkb: heh, true. and i think we can work through it if there are problems | 19:16 |
pabelanger | okay, kids are waking up for naps, back to PTO shortly. I'll check back in later tonight. | 19:16 |
pabelanger | from* | 19:16 |
anteaya | pabelanger: thanks for your help | 19:16 |
jeblair | hrm, 04 is somewhat unresponsive | 19:19 |
jeblair | it got better | 19:20 |
anteaya | heh | 19:21 |
jeblair | deleting nodes now | 19:21 |
clarkb | number of threads is keeping steady on hte upgraded jenkinses | 19:24 |
jeblair | upgrading 03 now | 19:26 |
jeblair | and 04 | 19:27 |
jeblair | and 05 | 19:28 |
jeblair | 03 is up | 19:29 |
jeblair | and 04 | 19:29 |
jeblair | and 05 | 19:29 |
jeblair | 03 is getting nodes and jobs | 19:30 |
jeblair | i have quiesced 04 | 19:31 |
jeblair | 04 did not end up with the monitoring plugin installed | 19:31 |
anteaya | ah | 19:32 |
jeblair | 05 is getting nodes and jobs | 19:32 |
anteaya | so just respinning 04 remaining? | 19:32 |
jeblair | re-re-installing monitoring plugin on 04 | 19:32 |
anteaya | ack | 19:33 |
jeblair | restarting 04 | 19:33 |
jeblair | quiescing 04 again | 19:34 |
clarkb | a startin shutdown mode would be nice | 19:34 |
jeblair | no kidding | 19:34 |
jeblair | this time it snagged some nodes; i'm deleting them | 19:36 |
jeblair | http://paste.openstack.org/show/491153/ | 19:36 |
jeblair | is the error | 19:36 |
clarkb | wow heap issues | 19:36 |
clarkb | possible tge plugin data is corrupt? | 19:37 |
jeblair | clarkb: | 19:38 |
jeblair | FYI, there is a proprietary plugin in Jenkins Enterprise by CloudBees | 19:38 |
jeblair | for this purpose. | 19:38 |
jeblair | https://groups.google.com/forum/#!topic/jenkinsci-dev/Gr2QOxSl7_8 | 19:38 |
jeblair | though, someone did post a groovy script to do it | 19:38 |
jeblair | https://wiki.jenkins-ci.org/display/JENKINS/Post-initialization+script | 19:38 |
clarkb | wow | 19:39 |
clarkb | value add | 19:39 |
jeblair | clarkb: the /etc/defaults/jenkins file on 04 does not match that of 03 | 19:39 |
jeblair | 03 has -Xmx12g; 04 does not have an Xmx arg | 19:40 |
clarkb | huh | 19:40 |
clarkb | puppet appears to have run on 04 recently | 19:41 |
jeblair | ah.... | 19:41 |
clarkb | hrm 04 has it now | 19:42 |
clarkb | maybe this is the package racing puppet or something? | 19:42 |
jeblair | yeah, maybe a dpkg puppet race | 19:42 |
jeblair | i will restart 04 now | 19:42 |
jeblair | (ansible confirms all /etc/default/jenkins files are identical) | 19:42 |
jeblair | what it failed again | 19:44 |
anteaya | :( | 19:44 |
jeblair | quiesced | 19:44 |
clarkb | huh same error with the heap? | 19:45 |
jeblair | yep | 19:45 |
jeblair | i see -Xmx12g in the command line | 19:45 |
clarkb | /tmp/hsperfdata_jenkins is the melody data I think | 19:46 |
clarkb | but its mostly empty | 19:46 |
jeblair | deleting 04 nodes again | 19:46 |
jeblair | i need to lunch now; i'm going to leave 04 as-is if clarkb or anyone else wants to look into it | 19:47 |
clarkb | ok | 19:47 |
clarkb | it is trying to copy an array it looks like and that hits the error | 19:49 |
anteaya | is it worth it to spin up a fresh vm and install jenkins on it and call it 04? | 19:50 |
clarkb | no we should figure out why it is doing this | 19:50 |
anteaya | okay | 19:51 |
clarkb | the md5sums of the .jpi files differ | 19:54 |
clarkb | jeblair: I think whatever process you were using to update isn't functioning. We cna try copying the .jpi in from another master | 19:54 |
anteaya | wonder why that would only cause an issue on 04 | 19:55 |
fungi | i am back and starting to catch up, though i may also have to pop back out again very briefly at 2130z (but not for long) | 19:55 |
anteaya | welcome back | 19:55 |
fungi | and yeah, zuul was still not caught up when i started running out of steam last night, sounds like there wasn't really anybody else around in shape to restart it after that either | 19:56 |
anteaya | fungi: yeah didn't appear so | 19:56 |
anteaya | jeblair observed that the more root folks we have the less we seem to have | 19:57 |
clarkb | jeblair: my suggestion would be to do that, copy the jpi in from 07, rm plugins/monitoring/ on 04 and restart 04 | 19:57 |
fungi | so jenkins 1.642.3 is the latest lts. interesting they've changed up the theming on the webui | 19:58 |
clarkb | jeblair: but will wait for your lunch to conclude so we can figure out what might be wrong with the process that was previously used | 19:59 |
clarkb | fungi: basic sitrep is zuul has been restarted (which broke nodepools gearman state so it needs a restart too, then manual uploads of ubuntu-trusty for uuidgen updates) and every master but jenkins04 is fully operational on new jenkins lts with new gearman plugin version | 20:03 |
clarkb | fungi: the monitorying plugin install is unhappy on 04 appears to be due to trying to use the old version against new jenkins lts | 20:03 |
clarkb | I intend on restarting nodepoold as soon as the jenkinses are happy | 20:03 |
clarkb | and will requeue ubuntu-trusty uploads | 20:05 |
fungi | okay, sounds better than it could be i guess | 20:08 |
fungi | and makes sense that nodepool image updates might be disrupted by its gearman disappearing out from under it at the moment | 20:09 |
clarkb | it should mostly handle that but there are still a few corner cases | 20:09 |
fungi | and yeah, yesterday was mostly a bust for getting anything done other than dealing with the openstackid/zanata situation | 20:09 |
clarkb | the fix for uuidgen did get in we just haven't managed to upload any of those images yet from what I can see | 20:10 |
fungi | also yolanda and krotscheck seem to have discovered that missing dbus on the minimal images was keeping firefox from working | 20:22 |
fungi | or i think that was the conclusion they had arrived at | 20:23 |
anteaya | yes | 20:23 |
anteaya | I replied in -infra to keep the conversation about anything not incident related in there | 20:24 |
anteaya | for logs searching purposes | 20:25 |
clarkb | -rw-rw-r-- 1 jenkins jenkins 0 Mar 18 20:03 plugin.ini | 20:27 |
clarkb | thats extra weird | 20:27 |
clarkb | er ww | 20:27 |
fungi | what's weird? the fact that it's empty? | 20:31 |
fungi | maybe we can just copy that from another server? | 20:31 |
clarkb | this was response to emilienm in other channel | 20:31 |
fungi | oh | 20:32 |
clarkb | their job failed being unable to copy that file and I found it weird the file existed | 20:32 |
clarkb | since the copy failed | 20:32 |
clarkb | but I think scp plugin must create the dirent before copying contents | 20:32 |
clarkb | rather than doing a w and writing | 20:32 |
fungi | ahh, i haven't finished catching up in there yet | 20:32 |
jeblair | clarkb: catching up | 20:42 |
jeblair | clarkb: i used the webui; on jenkins04 and 07 i 'installed' the plugin; on others i 'upgraded' it | 20:44 |
jeblair | clarkb: just selecting from the list of available plugins; no manual uploads or anything in the case of monitoring | 20:44 |
clarkb | jeblair: huh, for some reason it appears to still be using the old version on 04 when you do that | 20:44 |
clarkb | jeblair: maybe we can 'upgrade' it now? | 20:44 |
jeblair | (for gearman-plugin, they were all uploaded through the webui) | 20:44 |
* clarkb looks | 20:44 | |
jeblair | clarkb: i think it still wasn't showing as upgradable, probably since it isn't really loaded | 20:45 |
clarkb | ya I don't see it | 20:45 |
jeblair | clarkb: assuming that's still the case, i'd agree that next steps are probably to delete, manually install as you suggest | 20:45 |
clarkb | my original suggestion of just copying a jpi from the other servers over is the best idea i have | 20:45 |
jeblair | let's first do the groovy script to have it start in shutdown mode | 20:46 |
clarkb | ++ | 20:46 |
jeblair | i will do that | 20:46 |
jeblair | clarkb: do you want to work on staging the delete/copy? | 20:46 |
clarkb | yup | 20:46 |
clarkb | it is at jenkins04:/home/clarkb/monitoring.jpi and the md5sums match | 20:48 |
clarkb | we can just cp that into place when 04 jenkins process is stopped | 20:48 |
jeblair | clarkb: my groovy script is staged | 20:50 |
jeblair | clarkb: i'm all set; why don't you drive from here? | 20:50 |
clarkb | ok I will stop jenkins now | 20:50 |
clarkb | ok plugin copied in place starting jenkins | 20:51 |
clarkb | jeblair: ready? | 20:51 |
jeblair | yep | 20:52 |
jeblair | clarkb: same error :/ | 20:53 |
jeblair | it did start in shutdown mode | 20:54 |
clarkb | monitoring/META-INF/MANIFEST.MF looks good at least | 20:54 |
clarkb | but ya not working | 20:54 |
clarkb | perhaps /var/lib/jenkins/monitoring is the problem? | 20:55 |
clarkb | there are almost 13k files in there | 20:55 |
clarkb | hrm 07 has almost as many | 20:56 |
clarkb | that would be my next best guess move that dir aside, and have the plugin restart from scratch | 20:56 |
jeblair | let's try it | 20:57 |
clarkb | jeblair: you doing it this time? and do we have to update the groovy stuff each restart or is it set once and that way until the script is removed? | 20:58 |
jeblair | clarkb: it's there for good. it's a file in ~jenkins/init.groovy.d/ | 20:58 |
clarkb | I like the idea of putting that on every master and updating the ansible playbook to restart them to explicitly online them | 20:59 |
jeblair | clarkb: i worry a little about system restarts though... | 20:59 |
clarkb | oh good point | 20:59 |
clarkb | jeblair: I can move the monitoring dir aside if you want | 21:00 |
jeblair | clarkb: but, we could ansible the process of adding and removing that | 21:00 |
jeblair | clarkb: so we could have an "ansible shut down jenkins" and "ansible start jenkins" that does that for us | 21:00 |
clarkb | ya | 21:00 |
jeblair | clarkb: why don't you continue to drive | 21:00 |
clarkb | kk | 21:00 |
clarkb | stopping jenkins again | 21:01 |
clarkb | and starting | 21:01 |
jeblair | seems happier now | 21:02 |
clarkb | yup I can get to the melody pag enow | 21:03 |
clarkb | jeblair: anything else you want to chekc before cancelling shutdown mode? | 21:03 |
jeblair | clarkb: nope, i'll go ahead and do that and rm the script | 21:03 |
clarkb | kk | 21:03 |
clarkb | it is running jobs | 21:05 |
clarkb | looking good, I am ready to restart nodepool if others are ready | 21:06 |
clarkb | jeblair: fungi anteaya ^ any reason to not do that now? | 21:08 |
fungi | looks like a fine time for a nodepool restart | 21:08 |
fungi | i'm around to help. pretty much caught up on what i've missed the first part of the day | 21:09 |
clarkb | ok stopping nodepool services now | 21:09 |
anteaya | I have no reason to not restart nodepool | 21:09 |
jeblair | ++ | 21:09 |
clarkb | ok both services are restarted and appear happy | 21:10 |
clarkb | I will start ubuntu-trusty uploads in a csreen session | 21:10 |
anteaya | yay | 21:10 |
fungi | and wow so moving the monitoring dir aside solved it? | 21:11 |
fungi | freaky | 21:11 |
fungi | i guess that's accumulating graph/stats data over time? | 21:11 |
anteaya | fungi: clarkb removed a whole bunch of slave files accumulated on the jenkins earlier today | 21:13 |
anteaya | not sure if that is related or not | 21:13 |
jeblair | it wasn't the largeest such directory; 03 has 17k files in it | 21:13 |
clarkb | fungi: ya tons of rrd files | 21:13 |
fungi | yikes | 21:14 |
jeblair | but i guess something in there was :( | 21:14 |
clarkb | if I had to guess it wasn't the size so much as maybe corrupt data | 21:14 |
clarkb | which caused the jvm reading them in to allocate crazy amounts of memory and fail | 21:14 |
jeblair | ya. a lot of those files are for ephemeral nodes too | 21:14 |
clarkb | all ubuntu-trusty image uploads are queued | 21:14 |
clarkb | I started with the qcow2 clouds | 21:14 |
clarkb | as they should flush out quicker | 21:15 |
clarkb | I need to step out now back in 20 or so | 21:15 |
anteaya | clarkb: thanks | 21:15 |
fungi | the more data there is, the more likely some of it is to end up corrupt as well | 21:18 |
anteaya | this was the magic incantation: find /var/lib/jenkins/logs/slaves/ -mindepth 1 -mtime +60 -delete | 21:19 |
anteaya | the first round was sans -mindepth 1 and took out the dir as well | 21:19 |
anteaya | which was recreated | 21:19 |
fungi | yep, saw that in scrollback | 21:21 |
anteaya | figured | 21:21 |
*** zaro has quit IRC | 21:25 | |
*** greghaynes has quit IRC | 21:25 | |
*** zaro has joined #openstack-infra-incident | 21:26 | |
*** mordred has quit IRC | 21:26 | |
*** dhellmann has quit IRC | 21:26 | |
*** dhellmann has joined #openstack-infra-incident | 21:26 | |
*** mordred has joined #openstack-infra-incident | 21:27 | |
*** greghaynes has joined #openstack-infra-incident | 21:28 | |
clarkb | I am back | 21:31 |
clarkb | vexxhost image upload failed on the nginx looking thing | 21:32 |
clarkb | osic and bluebox are done | 21:32 |
*** ChanServ changes topic to "situation normal" | 22:21 | |
*** asselin has quit IRC | 23:22 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!