Thursday, 2018-12-06

clarkbah its going through the ansible log not the job console log00:00
corvusclarkb: so 1deb5f1e391aa7eea4d84b2032bb1c970e005500 would be dev15?00:00
clarkb1deb5f1e391aa7eea4d84b2032bb1c970e005500 is what I find for dev1500:03
clarkbyup00:03
clarkb(the thing that makes the weird is that pbr will do 3.3.1.dev$commits since 3.3.0 but git describe will do 3.3.0-$commits since 3.3.000:03
*** jamesmcarthur has joined #openstack-infra00:08
openstackgerritPaul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider  https://review.openstack.org/62310700:08
corvusclarkb, pabelanger: none of those changes look suspicous.  we don't have objgraph installed on the executors, so we can't get a histogram of memory usage.00:11
corvusmeanwhile, the current executor behavior is highly anamolous for the last 90 days.  this is the first time we've ever had more jobs queued than running.00:11
corvuswe're running a mere 331 jobs right now00:12
*** jamesmcarthur has quit IRC00:12
clarkbhuh00:12
pabelangeragree, have all the executors been rebooted recently? I haven't checked to see if maybe we have a new hwe kernel00:12
corvusinfra-root: so i'd say that whatever this problem is, it's critical at this point.00:12
clarkbpabelanger: I don't think they have ~111 days00:12
*** _alastor_ has joined #openstack-infra00:13
clarkbcorvus: do we want to rotate executors out and see if reboot reset swap and return them to happyness?00:13
clarkb(then monitor it for any signs of swap returning)00:13
clarkbmostly thinking that without instrumentation we likely aren't going to debug the swappage now, but restarting may give us breathing room00:14
corvusclarkb: yeah, i think it's likely to buy us a few days at least, assuming this behavior is consistent with the way it's been for the past week00:14
corvusyeah00:14
pabelanger+100:14
corvusso let's pip3 install objgraph on all of them and do that00:14
pabelanger++00:14
clarkblet me know how I can help00:14
corvusthough i'm just inclined to go ahead and hard-restart them all, rather than rotate out00:14
corvusmostly because it's eod00:15
clarkbalso only using 30% capacity anyway (so the shock is low)00:15
corvusokay, objgraph is installed everywhere00:15
corvusi'll stop all executors now00:16
clarkbstatus page reflects that executors are stopping00:17
*** gyee has quit IRC00:17
*** _alastor_ has quit IRC00:17
clarkbze01 appears to have stopped its executor00:24
corvuscuriously 7 and 9 seem to be the slowest to stop00:24
corvusall stopped now; i will reboot them all00:25
pabelangerack00:25
corvusthey're starting00:26
corvusall running except 8,9,1000:27
*** wolverineav has quit IRC00:27
corvusall running now; i guess those were just slow to boot00:27
corvusgah, i should have deleted the old builds00:28
clarkbwas that not fixed?00:28
corvusclarkb: not merged: https://review.openstack.org/62069700:28
clarkbfwiw ze01 looks sane so far. Memory usage seems to be roughly proportional to the number of processes running00:29
clarkbah00:29
johnsomI am guessing you all are talking about the jobs that have been sitting for over an hour, started, but no progress/stream?00:29
*** wolverineav has joined #openstack-infra00:30
clarkbjohnsom: they aren't quite started yet. They go to the empty box on the status page as soon as they have a node assigned aiui, then you have to wait for an executor to pick up that node and start the job. But yes00:30
johnsomYep, cool. Just checking that it's a known issue.00:31
corvusi'll manually delete some build dirs (lots of old stuff sitting around will cause du to waste cycles)00:31
clarkbsnmp hasn't quite caught up on all the nodes according to cacti but spot checking by hand it looks like things haven't immediately returned to the former state00:32
clarkbcorvus: another thing I notice is that ansible 2.5 had a release at the end of october that we may have pciekd up? there have been a couple since then too (whcih we have been using on more recent restarts)00:34
corvusclarkb: yeah, i wonder if something about that could affect it.  we don't import it or anything, but it could be using more memory and driving us into swap in general.  or outputting more data that we capture or something.00:35
clarkbya. The change log https://github.com/ansible/ansible/blob/stable-2.5/changelogs/CHANGELOG-v2.5.rst looks sane though00:35
clarkbwe are now running more jobs than are queued00:37
*** jaosorior has quit IRC00:37
corvus#status log rebooted all zuul executors (ze01-ze11) due to suspected performance degredation due to swap.  underlying cause is unclear, but may be due to a regression in zuul introduced since 3.3.0, or in dependencies (including ansible).  objgraph installed on all executors to support future memory profiling.00:40
openstackstatuscorvus: finished logging00:40
corvusclarkb, pabelanger, tobiash: i'm not 100% sure i want to put ze12 into production at this point.  we may have been wrong about our supposition for the increased queued jobs.00:42
pabelangersure, makes sense00:42
clarkbya if slowness was caused by memory issues we may not need it00:43
clarkbcorvus: fwiw I did approve the change to puppet ze1200:43
clarkbdo we need to -W it?00:43
corvusespecially based on the sort of exponential regression we were seeing00:43
pabelangeryah, we should in next 5mins, about to merge00:43
corvusi'll -w it00:43
pabelangerI'll look at sf.io tomorrow to see if we are also seeing an increase of swap00:44
clarkbcorvus: on ze01 and ze03 there are a few megabytes of swap being used, none of it from the two zuul-executor processes00:44
*** yamamoto has joined #openstack-infra00:45
corvusclarkb: yeah, looking at the list it seems pretty reasonable -- kernel just moving idle stuff out of the way00:46
corvusalso, we don't need to run apache on those servers00:46
clarkb++00:46
corvusi've deleted old build dirs from the 3 largest offenders, so the servers should be generally in-line now.  there may be a few stragglers, but no big deal00:47
corvusclarkb, pabelanger: it's possible that ansible is using more memory and the only thing to do about it is just to add more executors after all00:50
corvusi kinda don't want to jump to that conclusion though00:50
pabelangerYah, I can also look at open issues in ansible/ansible tomorrow, see if anybody has reported anything00:50
clarkbcorvus: looks like there are ~200 playbooks running on ze01 but only ~60 jobs?00:51
pabelangerlike clarkb said, there have been a few releases of 2.5 recently00:51
clarkbI guess that could be ansible forking00:51
clarkbah yup it appears there are multiple ansible playbook processes running concurrently per build00:51
*** Swami has quit IRC00:52
corvusqueued jobs: 000:52
pabelangeryay00:52
clarkbcorvus: if that is the case we'd still want to hve the governors throttle such that they don't swap00:53
clarkbthough the swap was from the zuul-executor process so I dunno00:54
corvusclarkb: the zuul-executor process on ze01 is the same virt size it was before the reboot00:54
corvus 1882 zuul      20   0 5534272 175336  10224 S  45.5  2.1   9:10.47 zuul-executor00:54
corvusless resident00:54
*** Belgar81 has joined #openstack-infra00:55
clarkband about 10mB into swap now00:55
clarkb(far less than before)00:55
corvusbut given that we're running all out now, and we've achieved the same virtual size as before, and pretty close to the same resident size (what was it, like 300000 or 400000 before?)  i'm not sure the executor is going to turn out to be the smoking gun00:56
corvusi'm going to eod now00:58
clarkbya I need to pop out myself.00:59
clarkbianw and/or fungi if amorin wanders by later today/tomorrow morning (relative to me) maybe you can point out https://etherpad.openstack.org/p/bhs1-test-node-slowness I'ev triedto capture what we/I know there00:59
clarkbcorvus: thinking out loud here it might be good to instrument things like ansible as used by zuul so that we'll know if/when there are regressions in performance or resource usage01:00
clarkbthat feedback might also be useful to ansible tiself01:01
*** sthussey has quit IRC01:17
*** yamamoto has quit IRC01:18
*** yamamoto has joined #openstack-infra01:18
*** tosky has quit IRC01:19
*** rkukura has quit IRC01:20
*** harlowja has quit IRC01:27
*** markvoelker has quit IRC01:33
rm_workhey, how complex is the process of getting a cloud added to nodepool? including technical and legal/political/whatever -- I assume there's all sorts of things that need to be signed?01:35
*** betherly has joined #openstack-infra01:40
*** betherly has quit IRC01:44
*** david-lyle has joined #openstack-infra01:48
*** manjeets_ has joined #openstack-infra01:49
clarkbrm_work: https://docs.openstack.org/infra/system-config/contribute-cloud.html is the doc we have for it. It tends to be more informal and we try to do our best to accomodate the needs of the provider01:49
rm_workcool cool01:50
clarkbcorvus: I have a really derpy script at ze01:~clarkb/swap.sh that looks for playbooks using more than 60MB ish of swap. It seems that "larger" jobs tend to be in that club, things like grenade and tripleo and lbaas jobs01:50
*** dklyle has quit IRC01:51
*** manjeets has quit IRC01:51
clarkbalso it seems that swap usage may have stablizied a bit. And now really calling it a day01:51
*** _alastor_ has joined #openstack-infra02:13
*** mrsoul has quit IRC02:15
*** _alastor_ has quit IRC02:18
*** hongbin has joined #openstack-infra02:35
*** wolverineav has quit IRC02:40
*** bhavikdbavishi has joined #openstack-infra02:41
*** wolverineav has joined #openstack-infra02:41
*** ykarel has joined #openstack-infra02:41
*** bhavikdbavishi1 has joined #openstack-infra02:44
*** yamamoto has quit IRC02:45
*** bhavikdbavishi has quit IRC02:45
*** bhavikdbavishi1 is now known as bhavikdbavishi02:45
*** wolverineav has quit IRC02:46
*** betherly has joined #openstack-infra02:51
*** imacdonn has quit IRC02:53
*** imacdonn has joined #openstack-infra02:53
*** betherly has quit IRC02:55
*** rlandy|bbl is now known as rlandy03:09
*** rlandy has quit IRC03:10
openstackgerritPaul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider  https://review.openstack.org/62310703:12
*** wolverineav has joined #openstack-infra03:21
*** yamamoto has joined #openstack-infra03:24
*** wolverineav has quit IRC03:26
*** wolverineav has joined #openstack-infra03:30
*** psachin has joined #openstack-infra03:32
*** wolverineav has quit IRC03:34
*** yamamoto has quit IRC03:34
*** ramishra has quit IRC03:36
*** wolverineav has joined #openstack-infra03:46
*** hwoarang has quit IRC03:47
*** hwoarang has joined #openstack-infra03:50
*** wolverineav has quit IRC04:02
*** wolverineav has joined #openstack-infra04:03
*** diablo_rojo has quit IRC04:06
*** wolverineav has quit IRC04:07
*** yamamoto has joined #openstack-infra04:29
*** betherly has joined #openstack-infra04:32
*** hongbin has quit IRC04:33
*** janki has joined #openstack-infra04:34
*** ramishra has joined #openstack-infra04:35
*** betherly has quit IRC04:37
*** rh-jelabarre has quit IRC04:41
*** ykarel is now known as ykarel|afk04:50
*** lpetrut has joined #openstack-infra04:52
*** ykarel|afk has quit IRC04:54
*** yboaron has joined #openstack-infra05:02
*** apetrich has quit IRC05:07
*** ykarel|afk has joined #openstack-infra05:10
*** ykarel|afk is now known as ykarel05:10
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [wip] rhel8 beta support  https://review.openstack.org/62313705:13
*** ahosam has joined #openstack-infra05:32
*** lpetrut has quit IRC05:36
*** wolverineav has joined #openstack-infra05:43
*** apetrich has joined #openstack-infra05:48
tonybtobias-urdin: I sent a list of repos to openstack-discuss can you verify them and then I'll get them taken care of05:49
*** wolverineav has quit IRC05:50
*** _alastor_ has joined #openstack-infra06:15
*** _alastor_ has quit IRC06:19
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout based on screen size  https://review.openstack.org/62201006:43
*** ahosam has quit IRC07:00
*** wolverineav has joined #openstack-infra07:05
*** wolverineav has quit IRC07:10
*** bhavikdbavishi has quit IRC07:13
*** jtomasek has joined #openstack-infra07:22
*** ahosam has joined #openstack-infra07:24
*** yboaron has quit IRC07:26
*** yboaron has joined #openstack-infra07:26
*** dpawlik has joined #openstack-infra07:28
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats  https://review.openstack.org/61630607:33
*** gfidente has joined #openstack-infra07:35
*** e0ne has joined #openstack-infra07:38
*** ginopc has joined #openstack-infra07:50
*** irdr has quit IRC07:55
*** rcernin has quit IRC07:56
amorinhey all07:57
*** pcaruana has joined #openstack-infra07:58
*** pcaruana is now known as muttley07:58
*** shardy has joined #openstack-infra08:02
*** rcernin has joined #openstack-infra08:03
*** shardy has quit IRC08:05
amorinso as far as I can read, the results are a little bit better since I moved the disk sched to deadline, but this is still not perfect.08:05
*** lpetrut has joined #openstack-infra08:06
amorinon my side, I am investigating two things: enabling back VMX flag on cpu (for nested virt)08:06
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: refactor jobs page to use a reducer  https://review.openstack.org/62139608:06
*** slaweq has joined #openstack-infra08:06
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: refactor job page to use a reducer  https://review.openstack.org/62315608:06
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: refactor tenants page to use a reducer  https://review.openstack.org/62315708:06
amorinand also completely removing iotune on osf flavors08:06
amorincc fungi clarkb mordred dmsimard08:07
*** slaweq has quit IRC08:15
*** florianf|afk is now known as florianf08:17
*** ykarel is now known as ykarel|lunch08:18
frickleramorin: the comment regarding kvm caching iiuc would amount to setting "libvirt:disk_cachemodes=writeback" in nova.conf08:18
frickleramorin: but I'd defer that to a second step08:18
frickleramorin: looking at the last 6h, I still see about 50% of the job timeouts on bhs1, which sadly doesn't look like much progress08:20
*** ralonsoh has joined #openstack-infra08:21
amorinfrickler: ok08:26
*** shardy has joined #openstack-infra08:28
tobias-urdintonyb: will check it out asap, sorry missed that yesterday was out on adventures08:32
*** rcernin has quit IRC08:33
tonybtobias-urdin: All good.  I hope they were good adventures :)08:34
tobias-urdintonyb: answered on ML, but that list is correct and complete, thanks tonyb!08:42
*** AJaeger has quit IRC08:49
*** AJaeger has joined #openstack-infra08:51
*** ahosam has quit IRC08:54
*** bhavikdbavishi has joined #openstack-infra08:55
*** bkero has quit IRC08:55
*** jpich has joined #openstack-infra08:56
*** lpetrut has quit IRC08:57
*** tosky has joined #openstack-infra09:00
*** markvoelker has joined #openstack-infra09:01
*** ahosam has joined #openstack-infra09:01
*** kjackal has joined #openstack-infra09:10
*** gfidente has quit IRC09:14
*** ramishra has quit IRC09:19
*** ramishra has joined #openstack-infra09:21
*** ahosam has quit IRC09:26
*** ahosam has joined #openstack-infra09:26
*** ykarel|lunch is now known as ykarel09:27
*** witek has joined #openstack-infra09:28
*** dtantsur|afk is now known as dtantsur09:29
*** markvoelker has quit IRC09:34
*** derekh has joined #openstack-infra09:44
*** sshnaidm|afk has quit IRC09:45
*** sshnaidm|afk has joined #openstack-infra09:46
*** bhavikdbavishi has quit IRC09:49
*** electrofelix has joined #openstack-infra10:04
dulekHey, is it possible to sync job lists between two repos in Zuul v3?10:05
dulekIt's a bit hard to keep tempest plugin repo job list in sync with the main repo manually.10:06
*** sshnaidm|afk is now known as sshnaidm10:12
*** ahosam has quit IRC10:15
*** e0ne has quit IRC10:24
*** gfidente has joined #openstack-infra10:26
*** yboaron_ has joined #openstack-infra10:26
*** e0ne has joined #openstack-infra10:27
*** yboaron has quit IRC10:29
*** markvoelker has joined #openstack-infra10:31
fricklerdulek: iiuc the usual solution would be to use project-templates, see https://zuul-ci.org/docs/zuul/user/config.html#project-template10:32
*** sshnaidm has quit IRC10:33
*** sshnaidm has joined #openstack-infra10:34
dulekfrickler: Oh my, how early is Christmas this year, this is awesome!10:34
dulekThanks!10:34
fricklerfor an example see http://git.openstack.org/cgit/openstack/designate/tree/.zuul.yaml#n155 vs. http://git.openstack.org/cgit/openstack/designate-tempest-plugin/tree/.zuul.yaml10:35
*** Belgar81 has quit IRC10:54
*** markvoelker has quit IRC11:05
*** yboaron_ has quit IRC11:05
*** yboaron_ has joined #openstack-infra11:08
*** jesusaur has quit IRC11:27
*** jesusaur has joined #openstack-infra11:31
*** yboaron_ has quit IRC11:33
*** bhavikdbavishi has joined #openstack-infra11:48
*** agopi has quit IRC11:52
*** gfidente has quit IRC11:54
stephenfinfungi, clarkb: When you're about, fancy taking a look at these three doc changes for git-review? https://review.openstack.org/#/q/project:openstack-infra/git-review+status:open+owner:%22Stephen+Finucane+%253Cstephenfin%2540redhat.com%253E%2211:58
ssbarnea|roverdoes anyone knows what is the bashate friendly of doing something like local foo=$(cmd) ? -- see https://github.com/openstack-dev/bashate/blob/master/bashate/messages.py#L16612:01
openstackgerritSorin Sbarnea proposed openstack-infra/git-review master: Stash and unstash changes during download  https://review.openstack.org/34002412:07
*** psachin has quit IRC12:08
*** sshnaidm is now known as sshnaidm|bbl12:08
*** irdr has joined #openstack-infra12:13
*** _alastor_ has joined #openstack-infra12:18
*** yamamoto has quit IRC12:22
*** _alastor_ has quit IRC12:23
*** ahosam has joined #openstack-infra12:27
*** mriedem has joined #openstack-infra12:27
*** gfidente has joined #openstack-infra12:28
*** yboaron_ has joined #openstack-infra12:30
*** yamamoto has joined #openstack-infra12:31
*** pbourke has quit IRC12:40
*** pbourke has joined #openstack-infra12:40
*** udesale has joined #openstack-infra12:41
openstackgerritDirk Mueller proposed openstack-infra/openstack-zuul-jobs master: use opensuse15 as generic name instead of opensuse150  https://review.openstack.org/61962812:45
*** tpsilva has joined #openstack-infra12:48
fricklerssbarnea|rover: split it into two commands: local foo; foo=$(cmd)12:49
*** hjensas has quit IRC12:49
ssbarnea|roverfresta: thanks. I was considering it but I was not sure if that had the desired behavior.12:49
*** betherly has joined #openstack-infra12:52
*** eharney has quit IRC12:54
openstackgerritMerged openstack/os-performance-tools master: Change openstack-dev to openstack-discuss  https://review.openstack.org/62217312:56
*** rh-jelabarre has joined #openstack-infra12:57
*** rlandy has joined #openstack-infra12:58
*** bobh has quit IRC13:00
*** bobh has joined #openstack-infra13:00
*** betherly has quit IRC13:01
openstackgerritGhanshyam Mann proposed openstack-dev/hacking master: Fix 'ref' format errors in README file  https://review.openstack.org/62320313:05
openstackgerritGhanshyam Mann proposed openstack-dev/hacking master: Fix 'ref' format errors in README file  https://review.openstack.org/62320313:06
*** muttley has quit IRC13:08
*** boden has joined #openstack-infra13:15
*** dave-mccowan has joined #openstack-infra13:16
*** rcernin has joined #openstack-infra13:17
fungi#status log deleted stale /var/log/exim4/paniclog on ns2.opendev.org to silence nightly cron alert e-mails about it13:17
openstackstatusfungi: finished logging13:17
*** dave-mccowan has quit IRC13:21
*** muttley has joined #openstack-infra13:21
*** muttley has quit IRC13:25
*** muttley has joined #openstack-infra13:26
Linkidhi13:29
*** rcernin has quit IRC13:29
Linkidfungi: about peertube, can I add a page here : https://docs.openstack.org/infra/system-config/systems.html ?13:29
*** muttley has quit IRC13:29
Linkid(as WIP)13:29
*** yboaron_ has quit IRC13:29
AJaegerLinkid: a spec is the better next step13:30
AJaegerLinkid: against openstack-infra/infra-specs repo13:30
AJaegerLinkid: once the spec is approved, adding a page is one step13:30
Linkidand corvus told about ansible to deploy services, but I only saw puppet classes for services on the system-config repo13:30
fungiLinkid: yes, in whatever change implements the configuration management for the service you would also add a document there explaining its management, but AJaeger is right after the mailing list discussion the next step is likely an infra-spec describing how we will get it bootstrapped13:31
Linkidok, I'll make a spec this Friday or this week-end, then13:31
Linkid(I don't have enough time today)13:32
fungiLinkid: there is a template file in the openstack-infra/infra-specs repo you can fill in with the proposal, see the readme in that repo for instrructions13:32
Linkidoh, great :)13:33
fungiand there's no hurry, we operate on the assumption that people are working on these sorts of things in their spare/volunteer time anyway13:33
fungiand feel free to look at other approved specs in that repo for examples, or ask questions in here or on the ml if you need help13:34
*** pcaruana has joined #openstack-infra13:34
Linkidok, I will read some other specs today, I think13:35
*** kota_ has quit IRC13:36
Linkidthanks for your help :)13:36
fungimy pleasure!13:37
*** ahosam has quit IRC13:37
*** ahosam has joined #openstack-infra13:37
fungirereading the readme in the specs repo, i can see that it could use some clarifications too. i'll improve it a bit here shortly13:38
*** kota_ has joined #openstack-infra13:38
*** pcaruana has quit IRC13:39
*** rfolco has quit IRC13:41
*** rfolco has joined #openstack-infra13:41
*** yamamoto has quit IRC13:41
*** pcaruana has joined #openstack-infra13:43
*** pcaruana has quit IRC13:47
*** ahosam has quit IRC13:47
*** kgiusti has joined #openstack-infra13:47
*** ykarel is now known as ykarel|away13:48
*** dpawlik has quit IRC13:50
openstackgerritJeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity  https://review.openstack.org/62321113:51
fungiLinkid: ^ those updates to the readme might be helpful to you13:52
*** bhavikdbavishi has quit IRC13:53
Linkidok, I'll take a look13:55
openstackgerritJeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity  https://review.openstack.org/62321113:56
*** agopi has joined #openstack-infra13:57
openstackgerritJeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity  https://review.openstack.org/62321113:58
openstackgerritJeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity  https://review.openstack.org/62321114:00
fungiokay, i think i'm happy with it now ;)14:00
amorinfungi: AJaeger clarkb I just enabled the cpu VMX flag on BHS1, so now, you should be able to spawn icccccdlucfbribncbuvefvbjlbeeckvvikkcvuhtdgn14:00
amorininstances14:01
*** eharney has joined #openstack-infra14:01
fungithose are some fun looking instances14:01
amorinwith nested virt14:01
amorinyes they are :p14:01
amorinsorry14:01
fungino worries. i fall asleep on my keyboard all the time ;)14:01
fungiand thanks!14:01
*** pgaxatte has joined #openstack-infra14:03
*** pgaxatte has left #openstack-infra14:03
*** pgaxatte has joined #openstack-infra14:04
*** eernst has joined #openstack-infra14:07
*** jcoufal has joined #openstack-infra14:10
openstackgerritJeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity  https://review.openstack.org/62321114:11
fungiokay, now i'm really done with it... i think14:12
*** sthussey has joined #openstack-infra14:13
fricklerfungi: oh, since when do we only need the Task: header? that feature has managed to avoid making itself known to me so far14:17
fungifrickler: ever since https://review.openstack.org/607699 merged on halloween14:20
fungihrm, though i still need to restart the gerrit service on review.o.o to pick that up. it's been running undisturbed since august 314:21
fungithat's probably why it is unknown to you, i haven't announced it working because gerrit hasn't been restarted14:22
fungiperhaps i can do a quick gerrit restart late today when things (hopefully) quiet down14:22
*** jaosorior has joined #openstack-infra14:23
mordredfungi: I really need to start working on the gerrit upgrade again14:23
mordredfungi: too many plates spinnin14:23
mordredfungi: it looks like the project rename plugin is actually real now though https://gerrit.googlesource.com/plugins/rename-project/+/master/src/main/resources/Documentation/about.md14:24
*** yamamoto has joined #openstack-infra14:24
fungiyay! that'll be swell. once we have a new enough gerrit to use it14:25
mordredyah. I'll be glad when that's no longer a downtime event14:25
fricklerfungi: hmm, IIUC that patch only creates the hyperlink to the task, will updates to the story still get posted on storyboard when a patch contains only the task reference?14:26
fungifrickler: yes, in fact only the task footer causes story updates to happen14:27
fungiafter digging into the current implementation of the its-storyboard plugin for gerrit, it does nothing at all with story ids, and only acts on task ids14:27
fungiwe had wanted it to leave story comments when the story footer was included in a change even without a task footer, but it seems that was never actually implemented14:28
*** eernst has quit IRC14:28
*** ykarel|away has quit IRC14:29
fungiso if someone with a good grasp of java wants to work on the its-storyboard plugin for gerrit, that would be a good next feature14:29
fricklerfungi: o.k., going via the task seems to work just as well, so fine for me14:29
*** psachin has joined #openstack-infra14:31
*** janki has quit IRC14:33
*** yamamoto has quit IRC14:37
mordredinfra-root: keystoneauth1 3.11.2 has been released, which has the fix for rackspace discovery in it14:38
mordredit should be safe to unpin nodepool and to use latest sdk for launch-node now14:39
mordredbut I'm on a plane, so I'm not going to do any of those things right now14:39
fungialso i notice we still have some significant gaps in executor availability so we may want to proceed with the ze12 addition14:40
*** rkukura has joined #openstack-infra14:44
*** Swami has joined #openstack-infra14:44
*** sshnaidm|bbl is now known as sshnaidm14:51
*** _alastor_ has joined #openstack-infra14:53
*** janki has joined #openstack-infra14:54
*** Swami has quit IRC14:55
*** Swami has joined #openstack-infra14:56
*** _alastor_ has quit IRC14:57
fungioh joy, now spammers seem to be mistyping their spoofed domain and i'm getting messages into the openstack-discuss moderation queue for random addresses @q.com instead of @qq.com14:59
fungion the order of one every few seconds14:59
*** ramishra has quit IRC15:00
fungiupdated the nonmember discard filter to ^[0-9]+@q+\.com$ now15:00
fungiand my renovation contractors are making me high on spray-foam insulation fumes15:03
fungii should open a window but it's windy and close to freezing outside right now15:03
amorinwhere are you from?15:04
fungia barrier island in the atlantic, off the coast of north carolina (usa)15:05
fungiwe're ~16km from shore15:06
amorinnice, windy situation I guess15:06
openstackgerritMerged openstack-dev/hacking master: Fix 'ref' format errors in README file  https://review.openstack.org/62320315:07
*** alexchadin has quit IRC15:07
*** ykarel|away has joined #openstack-infra15:08
fungiyeah, the water is no more than 30 meters from my house, at the end of my yard, so very windy15:08
funginothing to slow it down15:08
*** ykarel|away is now known as ykarel15:08
fungii ended up opening a window anyway because i figure i'm far less likely to pass out from hypothermia than hypoxia (and i can always at least put on a jacket)15:09
*** dpawlik has joined #openstack-infra15:10
*** bobh has quit IRC15:12
*** jcoufal_ has joined #openstack-infra15:14
*** jcoufal_ has quit IRC15:15
*** dpawlik has quit IRC15:15
*** bobh has joined #openstack-infra15:16
*** jcoufal_ has joined #openstack-infra15:16
*** dpawlik has joined #openstack-infra15:17
openstackgerritChris Dent proposed openstack-infra/openstack-zuul-jobs master: Make lower-constraints job use python 3.6  https://review.openstack.org/62322915:17
*** dpawlik has quit IRC15:17
*** jcoufal has quit IRC15:17
fungiwe nearly caught up on node requests around 1300z but i guess then north america woke up15:17
*** dpawlik has joined #openstack-infra15:18
*** slaweq has joined #openstack-infra15:23
*** slaweq has quit IRC15:29
*** bobh has quit IRC15:31
mordredstupid north america15:32
*** adam_zhang has joined #openstack-infra15:33
mordredfungi: you should also ventiilate for a thile once they're done with that spray foam - it offgasses for a while, is my understanding15:33
*** jhesketh has quit IRC15:34
*** adam_zhang has quit IRC15:35
*** jhesketh has joined #openstack-infra15:35
fungiyeah15:35
fungiluckily this house is fairly leaky already (part of why we're renovating the downstairs entry instead of just repairing it)15:36
fungijust an unfortunate time of year to need to leave windows open15:36
*** bobh has joined #openstack-infra15:37
mordred++15:40
*** bobh has quit IRC15:42
*** jamesmcarthur has joined #openstack-infra15:43
jrossercould i get some more eyes on this when folks have a moment? https://review.openstack.org/#/c/622169/15:45
corvusfungi: good morning; i'll start looking at data in a bit15:46
fricklerjrosser: done15:47
jrosserfrickler: thanks :)15:47
mriedemclarkb: also for you https://bugs.launchpad.net/nova/+bug/180721915:51
openstackLaunchpad bug 1807219 in OpenStack Compute (nova) "SchedulerReporClient init slows down nova-api startup" [Medium,Triaged]15:51
mriedemworking a patch now15:51
*** zul has joined #openstack-infra15:52
*** dpawlik has quit IRC15:55
*** bobh has joined #openstack-infra15:55
*** ramishra has joined #openstack-infra15:59
*** bobh has quit IRC16:00
corvusfungi, clarkb: i'm going to sigusr2 ze01 to get an objgraph list16:00
*** _alastor_ has joined #openstack-infra16:00
fungioh! right, it was so late for me i didn't commit to memory that we'd added it to all the executors16:01
*** jcoufal_ has quit IRC16:01
*** janki is now known as janki|dinner16:02
corvushere are the object counts: http://paste.openstack.org/show/736764/16:04
openstackgerritMerged openstack-infra/project-config master: Add centos/suse to OSA grafana dashboard  https://review.openstack.org/62216916:04
corvusour first class on that list is Repo.  and 1700 repos sounds about right.16:05
fungiyep16:08
corvusi agree with clarkb; i wish we had historical values for "how much memory ansible is using".  also, for that matter, "how much memory the executor process is using"16:09
corvuscause at this point, all we suspect is something changed and we don't even know which piece of software.16:09
fungiit does at least look like we're not swapping as hard since the restart16:10
*** takamatsu has quit IRC16:10
corvusfungi: yeah, we seem to be around 2g, which is a value we've encountered before in our history without too much issue.16:11
*** bobh has joined #openstack-infra16:17
*** jcoufal has joined #openstack-infra16:20
*** bobh has quit IRC16:22
clarkbcorvus what I found interesting is ansible memory/swap use seems to correlate to the job playbooks16:25
clarkbgrenade and tripleo were showing up abunch16:25
mordredyou know ...16:26
mordredin the callback plugins, we actually have the entire log data in memory for the entire job in RAM16:26
mordredat least in the json one16:26
openstackgerritFrank Kloeker proposed openstack-infra/irc-meetings master: Change meeting time and format for Docs & I18n team  https://review.openstack.org/62324216:27
mordredwhich we can improve by switching that to be yaml and use multiple documents which we can just append without reading the old data like we discussed in berlin16:27
mordredso - grenade and tripleo being long/complex and potentially verbose could be causing the json callback plugin to eat ram16:27
mordredthat said - we could even improve the json plugin by only reading the old data in to memory right before doing the append and write out - so that it's not holding the RAM for the whole playbook invocation16:28
*** e0ne has quit IRC16:30
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Read old json data right before writing new data  https://review.openstack.org/62324516:30
mordredclarkb, corvus: ^^ like that16:30
*** ykarel is now known as ykarel|away16:33
clarkbah ya if thats held for say 2 hoursby gremade or tripleo I could see how those would be good swap candidates16:37
mriedemclarkb: btw, this takes 26 seconds on nova-api startup:16:37
mriedemDec 05 20:13:23.060958 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: running "unix_signal:15 gracefully_kill_them_all" (master-start)...16:37
mriedemhttp://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz#_Dec_05_20_13_23_06095816:37
mriedemthen we spend about 27 seconds loading up API extensions16:38
*** bobh has joined #openstack-infra16:38
clarkbmriedem: ya is it waiting on child pids to go away? that looked like uwsgi not nova?16:38
mriedemwe're looking into the latter but i don't know if we can control the former16:38
*** rossella_s has quit IRC16:38
mriedemyeah i'm not sure what's doing that16:40
*** psachin has quit IRC16:41
mriedemhttp://git.openstack.org/cgit/openstack-dev/devstack/tree/lib/apache#n27216:41
mriedemdevstack sets the hook16:41
*** bobh has quit IRC16:42
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Add appending yaml log plugin  https://review.openstack.org/62325616:47
mordredand there's the yaml version16:47
*** pgaxatte has quit IRC16:49
clarkbmordred: left a small comment on the json change but +216:49
mordredclarkb: yeah - totally. that would be better for sure16:53
*** bobh has joined #openstack-infra16:56
clarkbmriedem: reading uwsgi that sets up a hook that on SIGTERM (signal 15) it calls the kill them all gracefully function16:59
clarkbmriedem: I wonder if that is actually slow or if uwsgi is just not logging what it is doing in the interim well16:59
*** bobh has quit IRC17:01
mriedemclarkb: same, i suspect it's doing other things but not logging it17:01
mriedemi'll enable debug logging and see if that shows anything17:07
mriedem26 https://review.openstack.org/62326517:12
clarkbmriedem: like a season of 24 but two episodes longer17:13
*** Swami has quit IRC17:14
*** janki|dinner has quit IRC17:16
*** bobh has joined #openstack-infra17:16
*** jamesmcarthur has quit IRC17:17
mriedemheh the 26 was typing in the wrong window17:17
mriedemi can't watch 24, the ads alone with kiefer constantly yelling is just too much17:18
mriedem"the pop tarts are done omfg!!!"17:18
clarkbmnaser: followup on centos images. All regions but inap-mtl1 should have centos 7.6 now17:18
clarkbmnaser: waiting on inap upload to complete17:18
clarkbmriedem: I couldn't watch it when broadcast but managed to get through the first season on netflex relatively recently17:19
openstackgerritFrank Kloeker proposed openstack-infra/irc-meetings master: Change meeting time and format for Docs & I18n team  https://review.openstack.org/62324217:19
*** bobh has quit IRC17:21
*** jamesmcarthur has joined #openstack-infra17:22
clarkbfungi: heh ns2 now emails about packages on hold?17:25
clarkbhttps://packages.ubuntu.com/bionic/netplan.io17:25
fungiclarkb: that's yet another package which can't be upgraded because it will bring in a new dependency17:31
*** kjackal has quit IRC17:32
*** jtomasek has quit IRC17:32
corvusfungi, clarkb: i think we should throw ze12 at the problem.17:32
clarkbcorvus: ok, just a matter of merging the change to puppet it right?17:32
corvusclarkb: yeah, i'll re-approve17:33
*** kjackal has joined #openstack-infra17:33
fungiwfm17:34
*** bobh has joined #openstack-infra17:34
openstackgerritJeremy Stanley proposed openstack-infra/infra-specs master: Overhaul instructions in README.rst for clarity  https://review.openstack.org/62321117:36
clarkbnotmyname: http://logs.openstack.org/31/592231/6/gate/swift-probetests-centos-7/7bde795/job-output.txt.gz#_2018-12-06_17_32_48_836444 just failed in the gate. I took a quick look to see if it was for any of the known problems associated with the centos 7.6 release and unless that is a new race caused by new python or libs I don't think it is. (just an fyi that it appears to be an actual bug and not17:36
clarkbcentos 7.6 causing problems)17:36
*** rascasoft has quit IRC17:38
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired  https://review.openstack.org/62326917:39
openstackgerritChris Dent proposed openstack-infra/openstack-zuul-jobs master: Make lower-constraints job use python 3.6  https://review.openstack.org/62322917:39
notmynameclarkb: thanks17:40
clarkbssbarnea|rover: new theory on the file:// lookup issue with delorean. Is it possible that delorean is looking for that file within its chroot but it exists on the surrounding fs?17:42
fungiclarkb: followup on the stackalytics-bot-2 ip6tables block rule. it looks like the bot eventually switched to ipv4 anyway, so probably safe to say it's not what was causing the gerrit slowdowns a couple weeks back and there's no point in continuing to leave that block rule in place17:48
*** jpich has quit IRC17:48
*** florianf has quit IRC17:49
*** eharney has quit IRC17:52
*** gyee has joined #openstack-infra17:55
fungiso based on graphs it looks like we're topping out around 70 concurrent builds per executor? i guess if with the addition of ze12 we see we get up around 850 concurrent builds for extended periods that suggests we need another17:55
fungithe hysteresis kicking in on the executor queue graph around the time we stop accepting more builds is interesting and fairly pronounced17:57
fungior perhaps each of the queued jobs spikes there reflects a major gate queue reset17:58
openstackgerritMerged openstack-infra/system-config master: Add ze12.openstack.org  https://review.openstack.org/62306717:59
*** derekh has quit IRC17:59
pabelangerfungi: yah, gate resets are aplifying the backlog for sure17:59
fungiyeah, i guess there are corresponding spikes on the node requests graph so that seems likely17:59
*** bdodd has quit IRC17:59
fungithough we do seem to be managing ~0.2kjph higher than yesterday already18:00
*** dtantsur is now known as dtantsur|afk18:01
*** udesale has quit IRC18:04
*** Swami has joined #openstack-infra18:05
*** gfidente is now known as gfidente|afk18:06
*** ykarel|away has quit IRC18:07
*** e0ne has joined #openstack-infra18:08
*** jamesmcarthur has quit IRC18:08
*** e0ne has quit IRC18:09
*** bobh has quit IRC18:09
*** bobh has joined #openstack-infra18:17
openstackgerritMerged openstack-infra/zuul master: web: break the reducers module into logical units  https://review.openstack.org/62138518:20
clarkbssbarnea|rover: I'll move the conversation back here since I'm no longer thinking this is likely a zuul bug. http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/ara_oooq/reports/730db7e5-5c8a-4aec-a2a4-836c4367225a.html That ansible run crashes, this seems to crash the console log streaming which is then noticed when pre is started18:20
clarkbssbarnea|rover: it seems to crash when executing tempest and even the tempest log seems truncated: http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/undercloud/home/zuul/tempest.log.txt.gz notice that it has concurrency = 4 but only workers 0 and 2 record tests (we should at least have {1} as well)18:21
*** bobh has quit IRC18:22
*** electrofelix has quit IRC18:26
*** wolverineav has joined #openstack-infra18:29
*** wolverineav has quit IRC18:29
*** wolverineav has joined #openstack-infra18:29
*** dave-mccowan has joined #openstack-infra18:29
clarkbssbarnea|rover: ok I figured it out http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/undercloud/var/log/journal.txt.gz#_Dec_06_16_08_49 -- Reboot --18:29
clarkbssbarnea|rover: ^ so tahts the issue after all18:29
clarkbmordred: ^ fyi18:30
*** slaweq has joined #openstack-infra18:30
fungihah, rebooting a node mid-job. we knew that would prematurely terminate the console stream at least, right?18:31
clarkbfungi: yes18:32
mordredyeah. I really do need to raise the priority of reworking streaming18:32
mordredin the mean time - they can add a zuul_console line to their playbook after the reboot18:32
mordredit's just an ansible module - nothing stopping it from being restarted18:32
clarkbmordred: while I agree, I also don't think that running tempest should induce a reboot18:34
*** yamamoto has joined #openstack-infra18:34
clarkbso I think there is a bigger tripleo bug here18:34
clarkb(or maybe I am misunderstanding the logs around whati s going on at that time)18:35
openstackgerritMerged openstack-infra/irc-meetings master: Change meeting time and format for Docs & I18n team  https://review.openstack.org/62324218:35
*** diablo_rojo has joined #openstack-infra18:35
*** slaweq has quit IRC18:36
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Import install-docker role  https://review.openstack.org/60558518:37
*** wolverineav has quit IRC18:37
mordredclarkb: I saw something about enabling something - perhaps a new kernel is happening?18:37
*** yamamoto has quit IRC18:38
openstackgerritMerged openstack-infra/irc-meetings master: Change Senlin meeting to different biweekly times  https://review.openstack.org/62303118:39
*** jcoufal has quit IRC18:40
*** jcoufal has joined #openstack-infra18:40
clarkbmordred: is zuul_console a task that you run or a role?18:41
clarkbupdating the bug for this now and hoping to give that ^ info18:41
mordredclarkb: task. one sec ...18:41
*** eharney has joined #openstack-infra18:42
*** bdodd has joined #openstack-infra18:42
mordredclarkb: sorry, role: http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/start-zuul-console18:42
clarkbmordred: thanks18:42
logan-clarkb: i began seeing similar behavior about 2-3 days ago in jobs I have that launch nested vms with nested virt enabled. i have temporarily changed the affected jobs to use software virt and they no longer reboot. this is on ubuntu xenial test nodes launching bionic nested vms.18:44
clarkblogan-: ah good to know. Nested virt hits again. I'll leave that note too18:45
logan-pretty concerning to see it is happening on centos guests too18:45
mordredclarkb: that said - it's a one-task-role - so if it's more convenient to run it as a task, that's fine too18:45
*** wolverineav has joined #openstack-infra18:45
logan-nothing has changed on the hosts, but maybe I should look to see if there is a newer kernel we can try.18:45
clarkblogan-: well centos just updated its kernels with 7.6 I'm sure18:46
clarkblogan-: could be in the guest side of things18:46
fungi#status log unblocked stackalytics-bot-2 access to review.o.o since the performance problems observed leading up to addition of the rule on 2018-11-23 seem to be unrelated (it eventually fell back to connecting via ipv4 and no recurrence was reported)18:46
openstackstatusfungi: finished logging18:47
*** ramishra has quit IRC18:47
logan-yeah it seems like these breakages are usually guest induced by updated nodepool images, and then we usually get it back on track by updating the hosts. when I looked the other day there were no kernel updates available for the hosts :/18:47
logan-there is a kernel update available now. I'm taking a host out of the aggregate to update and test. will let you know how it goes.18:52
*** slaweq has joined #openstack-infra18:52
clarkblogan-: sounds good. It will be really interseting to see in a year or so (when the current 4.19 kernel shows up in places) if the intel nested virt enabled by default there actuall pans out as being much more reliable18:53
logan-no kidding. I'm running xenial hwe on these hosts and it is pretty sad that it still breaks a few times per quarter while still being more reliable than the regular xenial kernel. :(18:54
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired  https://review.openstack.org/62326918:56
*** slaweq has quit IRC18:57
*** shardy has quit IRC18:59
clarkbmwhahaha: ssbarnea|rover EmilienM to tl;dr it I think the three issues I'm aware of affecting tripleo jobs that are not "slowness" are the ntp errors, delorean not finding the /home/zuul/.*/repomd.xml file, delorean using pypi.python.org directly and having errors, and the nested virt possibly crashing VM and rebooting it (which may be the cause of the repomd.xml thing as a side effect?)19:01
clarkbI guess thats 3.5 issues19:01
clarkbI'm pretty sure all of these have bugs and I've updated the one I have new info on (reboots) with the data I collected19:02
clarkbThen there is the ovh slowness related stuff. I do still think reducing memory pressure would be worthwhile as an exercise to see if that helps. Especially if there are any easy wins like kernel same page merging19:03
clarkband we'll keep working with ovh on the infra side to characterize and hopefully address underlying issues as well19:04
*** udesale has joined #openstack-infra19:11
clarkbhrm my neighbor is getting a new roof, will need to find the airplane headphones19:14
clarkbfungi: ^ must be worse at your place :)19:14
mwhahahaclarkb: ok we had a fix for the ntp thing but it failed in the gate, maybe we can get that promoted next gate rest19:14
mwhahahawhich seemst o have just occured19:15
* mwhahaha sighs19:15
mwhahahaclarkb: https://review.openstack.org/#/c/621930/ if you can promote that to the top of the gate so we stop getting that one19:15
mwhahahaclarkb: i'm not aware of the repomd.xml one or the crashing vm. Is the creashing VM the standalone job?19:16
clarkbmwhahaha: ya an example is http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/ara_oooq/19:17
mwhahahaok so originally we used just qemu hard coded and then we moved to try and do the auto direction19:18
clarkbmwhahaha: notice that the multinode-standalone.yaml playbook is incomplete/interrupted. If you then go look at the journal log you'll see that there was a reboot around 16:07 ish19:18
mwhahahayea so it crashes in tempest19:18
clarkbthen later in the job delorean fails beacuse the repomd.xml isn't present (possibly because ansible crashed in a way that really confused things?)19:18
clarkbmwhahaha: ya I think tempest is the trigger there19:18
mwhahahawe can force that to qemu but that is less than ideal19:18
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Run a local MySQL service on StoryBoard servers  https://review.openstack.org/62329019:19
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Switch StoryBoard database backups to local  https://review.openstack.org/62329119:19
clarkbmwhahaha: unfortunately nested virt has never been stable19:19
clarkbit will work then stop then work and its really hard to debug unless you've got logan- or mnaser investigating the hypervisor side too19:19
fungiclarkb: roof rebuild hasn't started yet, we're still getting quotes and arguing about whether we want shingle or steel19:19
clarkbmwhahaha: ntp fix is being promoted now19:19
mwhahahathanks19:19
mwhahahai'll propose a patch for the qemu thing19:20
clarkbfungi: I'm unsure of the comparative advantages in your part of the world but steel is very loud when it rains19:20
clarkbfungi: growing up it would rain hard enough that it would be louder than the speakers hooked to the tv19:21
clarkb(granted we lived in the tropics with minimal insulation to dampen things too)19:21
*** dpawlik has joined #openstack-infra19:22
fungiyeah, lots of insulation here. i know metal roofs are loud, though in theory require less maintenance and last a lot longer for not a lot higher cost19:22
*** dpawlik has quit IRC19:24
mwhahahaclarkb: for the nested virt thing: https://review.openstack.org/#/c/623293/19:26
mwhahahassbarnea|rover, EmilienM fyi -^19:26
clarkbmwhahaha: thanks!19:26
EmilienMmwhahaha: ack19:27
*** ndahiwade has joined #openstack-infra19:27
openstackgerritRonelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null  https://review.openstack.org/62329419:28
*** udesale has quit IRC19:31
openstackgerritMerged openstack-infra/zuul master: web: refactor info and tenant reducers action  https://review.openstack.org/62138619:35
fungithis is awesome: https://github.com/systemd/systemd/issues/1102619:39
*** dpawlik has joined #openstack-infra19:39
*** boden has quit IRC19:42
fungithough now https://gitlab.freedesktop.org/polkit/polkit/issues/74 is arguing it's a systemd bug after all19:42
clarkbyou get root I get root we all get root19:43
fungifinger pointing ftw!19:43
*** sshnaidm is now known as sshnaidm|afk19:43
*** dpawlik has quit IRC19:44
*** bobh has joined #openstack-infra19:47
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired  https://review.openstack.org/62326919:50
*** bobh has quit IRC19:54
*** jamesmcarthur has joined #openstack-infra19:57
*** ndahiwade has quit IRC19:59
*** wolverineav has quit IRC19:59
corvus#status log added ze12 to zuul executor pool to reduce memory pressure20:00
openstackstatuscorvus: finished logging20:00
corvusinfra-root: ze12 is in production20:00
*** wolverineav has joined #openstack-infra20:00
*** jamesmcarthur has quit IRC20:01
*** jamesmcarthur has joined #openstack-infra20:01
fungiahh, yep, just noticed the green line on the executors graph bump up a notch20:02
corvusi really like that it's immediately reflected in monitoring :)20:02
fungithat is super nice20:02
corvusall the governor graphs have an extra line now too.  even cacti is updated.20:02
fungiand there's a gate resent underway in the integrated queue. curious to see if we fall into exhaustion again20:04
fungier, gate reset20:04
corvusoh good, that will help things equalize across all the executors faster :)20:04
clarkbya its been bumpy there too. I've  been trying to context switch into debugging some of those failures next, but running out of steam20:04
clarkbglance python3 unittests were the last failure20:05
*** wolverineav has quit IRC20:05
clarkbseemed to be a legit issue with bytes and unicode or something20:05
clarkbhttp://logs.openstack.org/61/610661/7/gate/openstack-tox-py35/f70430e/job-output.txt.gz#_2018-12-06_18_48_43_31160420:05
*** bobh has joined #openstack-infra20:07
clarkbthe most recent reset was grenade job failing on bhs1 because the nova node tempest was testing didn't reach an active state before the timeout20:07
*** bobh has quit IRC20:11
*** sthussey has quit IRC20:12
*** rcernin has joined #openstack-infra20:12
fungiany puppet gurus know how to work around http://logs.openstack.org/90/623290/1/check/infra-puppet-apply-4-ubuntu-xenial/62c7ca7/applytest/puppetapplytest32.final.out.FAILED ?20:17
fungiseems we can't use our mysql::backup_remote class with the puppet mysql module because both want to install mysql-client20:18
fungiunfortunately one of them isn't a module under our control (i think?)20:18
clarkbfungi: ya the mysql module is an upstream module.20:19
fungimaybe it provides a way to not declare the mysql-client package20:19
clarkbfungi: puppet 4 is ordered (and maybe 3 is now too?) in any case in the backup module you can do the if !defined() check for mysql-client and install it that way. Then ensure that you include backup class after the regular myself stuff20:20
clarkbthen the ordering should be such that it works20:20
fungiohh20:20
fungii can actually order them?20:20
fungiwill try that, thanks!20:20
clarkbfungi: its implied top to bottom order in pupet 420:20
clarkbbut I think you can also have mysql backup require mysql then it will order them explicitly too20:20
fungiwe already do the if ! defined(Package['mysql-client']) dance in puppet-mysql_backup so if i can order them that should do the trick20:21
*** david-lyle is now known as dklyle20:24
*** udesale has joined #openstack-infra20:29
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Run a local MySQL service on StoryBoard servers  https://review.openstack.org/62329020:31
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Switch StoryBoard database backups to local  https://review.openstack.org/62329120:31
fungihopefully that ^ will solve it for puppet 3 and 4 then20:31
pabelangercorvus: clarkb: I think we forgot to move /var/lib/zuul to /dev/xvde2 partition for ze12.o.o, which means we only have 40GB there for builds20:32
*** bobh has joined #openstack-infra20:33
*** wolverineav has joined #openstack-infra20:33
corvuspabelanger: hrm.  we should automate that.20:34
pabelangeragree!20:35
openstackgerritTobias Henkel proposed openstack-infra/zuul master: WIP: Add spec for scale out scheduler  https://review.openstack.org/62147920:35
corvuspabelanger: i think it's okay for now; we can shut it down later when it's quieter20:35
*** bobh has quit IRC20:37
*** wolverineav has quit IRC20:38
fungiooh, just remembered, since i want to restart gerrit soon for the task footer hyperlinking config addition, might be nice to get https://review.openstack.org/471078 merged before as well20:39
fungiadding lp bug trackingids20:39
fungi(to make them searchable)20:40
openstackgerritMerged openstack-infra/zuul master: web: add error reducer and info toast notification  https://review.openstack.org/62138720:42
fungicorvus: looks like we've entered a period of no executors accepting new builds again20:43
*** mriedem has quit IRC20:45
clarkbfungi: we had a few periods of that overnight according to grafana, they all seem to have recovered on their own (not sure if this one will)20:45
fungiyeah, it's more of a question of whether we'll have any more 2-hour-long ones20:46
*** udesale has quit IRC20:46
fungithis at least has only persisted for ~15 minutes so far20:46
*** mriedem has joined #openstack-infra20:47
clarkbon the bhs1 front I hopped into a test node and manually checked it had reasonable disk throughput, then found the job it is running https://zuul.openstack.org/stream/b81a8b3afe0f48819fcd3ed0fa201fba?logfile=console.log in the hopes of looking at dstat for that job to see if it exhibits similar behavior to the jobs that timeout20:47
clarkbthats a heat functional job that uses devstack, I'm not actually sure if it runs dstat :/20:48
fungifingers crossed20:48
clarkbback to swapping. Last night I found that each of the swapping ansible jobs would use up to 75MB swap each20:49
clarkbI think we should consider getting mordreds patch in around the json handling20:50
clarkbthat should reduce the window where memory is needed in the jobs20:50
fungitheory is that it's paging out the console stream?20:50
clarkbadditionally we probably want to consider testing a downgrade of ansible to 2.5.older to see if it changes behavior20:50
clarkbfungi: its the ansible json log data not the console stream itself, but ya we open it and keep it open the whole time when we really only need to write a new copy with new data at the end aiui20:51
clarkbI think it scales in the size of the role and tasks in a playbook as its capturing all of that data?20:51
mordredthere's a patch up to fix that20:51
clarkbmordred: ya I +2'd I'm saying we should try to get that in20:51
mordredyah. I totally agree20:52
mordredI think we should try rolling that out before we try downgrading anythign20:52
clarkb++20:52
corvusi'll take a look now20:52
fungi623245?20:53
*** wolverineav has joined #openstack-infra20:53
clarkbfungi: yes20:53
corvusmordred: i like the local var idea; is there any reason not to do that?20:54
mordredthere's also a followup patch that will do the same thing but with yaml and appending to a file instead of reading and re-writing20:54
clarkbon_stats is the thing that runs at the end of a playbook ruin to display stats around what took time and all that20:54
mordredcorvus: there isn't - although that function is the last thing called before the process exits, so I didn't just to avoid the respin20:54
mordredbut I can totally do that real quick20:55
clarkbmordred: ya I did double check on that in ansible docs (that the hook fires at the end)20:55
clarkbso I didn't -120:55
fungii suppose a local would be more future-proofing in case more function calls get tacked on after that down the road?20:55
mordredyeah. I'll push up a followup20:55
corvusyeah, it's mostly just confusing from a dev/maint pov.20:55
corvusmordred: you may as well ammend20:56
mordredkk20:56
corvusor however you spell that :)20:56
fungiammm...mmmend20:56
fungiwe have enough people around to approve the revised version anyway20:56
corvusi'd want to restart with that anyway, so we'd be waiting for the second, and we can all re+2 real quick20:56
fungido we want to get the yaml equivalent in too?20:57
clarkbthe one upside to json is browsers nicely render it20:57
clarkbyaml is more readable on its own though20:57
corvuspersonally, i'd like to take that one slower if we can, since it's basically a new feature.20:57
fungioh, i see the yaml one is more involved20:57
corvusthe json one seems more like an operational fix20:58
mordredyeah. the yaml one is like a change in approach20:58
fungionly just started looking at it and, yeah, i agree20:58
corvusi like it, i just think we should talk through it fully (eg clarkb's point)20:58
fungientirely possible there are users of zuul who prefer the json version, so it's probably a bigger community question20:59
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Read old json data right before writing new data  https://review.openstack.org/62324520:59
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Add appending yaml log plugin  https://review.openstack.org/62325620:59
mordredok. there's the updated20:59
*** bobh has joined #openstack-infra21:02
clarkblooks like dstat is enabled in that test, should have that data once logs post21:04
fungilooks like that busy cycle for the executors lasted ~17 minutes21:05
openstackgerritNate Johnston proposed openstack-infra/project-config master: Neutron grafana update for co-gating section  https://review.openstack.org/62241821:05
*** bobh has quit IRC21:07
*** agopi has quit IRC21:12
*** agopi has joined #openstack-infra21:12
*** gfidente|afk has quit IRC21:15
SpamapSHey I just realized my company affiliation changed recently. Is there still a place to update somewhere?21:17
* SpamapS always forgets where it is21:17
*** jcoufal has quit IRC21:19
corvusSpamapS: the openstack foundation site has a thing for that21:21
corvusSpamapS: foundation individual membership21:21
corvusfungi, clarkb, mordred: i think our next steps should be to restart executors with mordred's patch, observe behavior, then create ze13 if needed.21:22
clarkbok. Im steeping out for a bit for lunch and needbreak from staring at monitor21:22
clarkbcan help when I return21:22
mriedemclarkb: interesting, not seeing that 26 sec mystery time gap in this run on n-api startup http://logs.openstack.org/65/623265/1/check/tempest-full/0e80f2a/controller/logs/screen-n-api.txt.gz#_Dec_06_17_53_58_03930321:22
mriedemuwsgi debug logging seems to not do anything21:23
*** jaosorior has quit IRC21:23
clarkbmriedem: it could be ovh specific :(21:23
fungiSpamapS: and if you care about stackalytics at all (or your handlers do?) then there's a config file in the stackaytics repo you could push up a review for21:23
fungisomeday stackalytics might start consuming the affiliation info in osf profiles since there's an api to query that now, but it doesn't today21:24
fungicorvus: this sounds like a fine plan. i need to disappear in about 55 minutes to meet some friends for dinner, but can help with restarts prior to that21:26
fungior once i get back (probably around 23:30-00:00z)21:26
SpamapScorvus: ty21:26
SpamapSfungi: ty too21:26
*** kjackal has quit IRC21:27
fungicorvus: oh, and taking ze12 offline briefly to add a cinder volume i guess slots in there somewhere too21:29
*** bobh has joined #openstack-infra21:30
corvusfungi: i don't believe we use cinder volumes21:30
corvusfungi: we just mount the ephemeral volume at /var/lib/zuul21:31
corvus(we should probably instead have our automation symlink that to /opt or something)21:31
corvusbut i don't know how to untangle our deployment from openstackci at this point, so i don't want to touch it until we can move to the new stuff.21:31
*** bobh has quit IRC21:34
fungioh, got it, i always forget we have ephemeral disk in that provider21:34
fungilooks like 623245 has all its node requests fulfilled in the gate as of just now, eta 18 minutes21:37
*** boden has joined #openstack-infra21:40
*** jamesmcarthur has quit IRC21:49
clarkbI think the way it has been done in the past is passing the option to launch node to mount the ephemeral disk elsewhere?21:50
clarkbPutting http://logs.openstack.org/38/589238/13/check/heat-functional-convg-mysql-lbaasv2-amqp1/b81a8b3/logs/dstat-csv_log.txt.gz into https://lamada.eu/dstat-graph/ shows a much happier test run than the runs that fail. And there is quite a bit of IO happening too21:54
clarkbfungi: ianw ^ I think that points to ovh bhs1 (and maybe gra1) having unhappy and happy hypervisors as the source of the problem (assuming we aren't seeing noisy neighbor issues)21:54
*** bobh has joined #openstack-infra21:55
openstackgerritMerged openstack-infra/zuul master: Read old json data right before writing new data  https://review.openstack.org/62324521:56
fungii have a feeling puppet isn't going to update the executors before i have to disappear in 20 minutes, but happy to assist with restarts when i return from dinner if there's still any to be done at that point21:59
*** bobh has quit IRC22:01
clarkbI've updated https://etherpad.openstack.org/p/bhs1-test-node-slowness22:02
clarkbI think we should consider halving the max-servers again and watch those e-r bugs I identified as being corrected by not running in bhs122:02
clarkbamorin pointed out that none of those slow jobs ran on the same hypervisor so less likely it is one or two unhappy hypervisors. Instead we are maybe our own noisy neighbor22:02
clarkband halving the number of nodes should reduce noisy neighbor impacts22:03
corvusclarkb: ack22:03
fungimakes sense to me22:04
openstackgerritRonelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null  https://review.openstack.org/62329422:04
openstackgerritClark Boylan proposed openstack-infra/project-config master: Halve bhs1 max-servers value  https://review.openstack.org/62333822:06
clarkbfungi: corvus ^ quick review to implement that22:06
corvusclarkb: unfortunately that's a variable into our zuul executor swap problem.  we should do it because resets are bad, we're just going to need to keep that in mind as we evaluate further changes.22:07
clarkbcorvus: ++22:07
fungiif we drop it back to 79 instead of 75 we can more directly compare behavior differences between bhs1 and gra122:08
logan-clarkb: updating xenial hwe from 4.15.0.34.56 to 4.15.0.42.63 makes my nested kvm jobs work again. ¯\_(ツ)_/¯22:08
logan-i will cycle thru the hvs and update them all over the next day or so22:08
clarkblogan-: weird22:08
clarkbfungi: I think gra1 has less physical hardware too though22:09
clarkbso that comparison won't be super accurate?22:09
fungiyeah, not a big deal either way22:09
fungicertainly if we still see more failures in bhs1 than gra1 even with a lower max-servers, that's telling too22:10
fungianyway, i approved it22:10
*** kjackal has joined #openstack-infra22:11
fungiand i'm being dragged away 10 minutes early. back as soon as i can be22:11
clarkbthanks!22:11
*** calebb has quit IRC22:12
clarkblogan-: that does sort of imply to me that canonical/ubuntu must haev testing for this stuff, but its likely a losing battle for them trying to keep up22:17
clarkbcorvus: for the executor stuff we are waiting on mordred's change to merge now?22:18
corvusclarkb: it's merged; waiting to deploy22:19
clarkbrgr22:19
corvusclarkb: but i think we should wait a bit after your quota change merges before we restart22:19
corvusso we get a new baseline22:19
*** eernst has joined #openstack-infra22:20
openstackgerritRonelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null  https://review.openstack.org/62329422:20
*** eernst has quit IRC22:25
*** bobh has joined #openstack-infra22:25
*** kgiusti has left #openstack-infra22:27
*** manjeets_ is now known as manjeets22:28
openstackgerritJonathan Rosser proposed openstack-infra/project-config master: Separate out success/failure/timeout charts in grafana for OSA  https://review.openstack.org/62334122:29
*** bobh has quit IRC22:29
openstackgerritMerged openstack-infra/project-config master: Halve bhs1 max-servers value  https://review.openstack.org/62333822:30
tonybCan I please be added to the bootsrappers gerrit group so I can EOL the puppet repos as per: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000663.html22:32
* tonyb will self remove when done22:32
clarkbtonyb: yes, one moment22:32
tonybclarkb: Thanks22:33
clarkbtonyb: done22:33
tonybclarkb: \o/ as always I'll be careful :)22:33
*** boden has quit IRC22:45
*** bobh has joined #openstack-infra22:45
tonybclarkb, tobias-urdin: Done and removed22:48
*** bobh has quit IRC22:49
openstackgerritRonelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null  https://review.openstack.org/62329422:53
*** bobh has joined #openstack-infra23:04
*** lbragstad has quit IRC23:08
*** bobh has quit IRC23:09
*** lbragstad has joined #openstack-infra23:09
clarkbmelwitt: mriedem: email sent23:16
mriedemclarkb: thanks23:19
clarkbcorvus: max-servers change was applied at ~2300UTC and dropped to ~75 in use at about 23:15UTC23:19
clarkbbaseline numbers probably want to start at 23:15UTC23:20
corvusclarkb: yeh, i was just looking.  so maybe we wait until at least 24:00 before we restart any executors.23:20
clarkbwfm23:20
melwittclarkb: danke23:21
*** bobh has joined #openstack-infra23:30
*** bobh has quit IRC23:35
clarkbcorvus: we seem to be stabilizing just under 2GB swap per executor?23:42
*** rkukura_ has joined #openstack-infra23:48
corvusclarkb: looks like it23:48
openstackgerritPaul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider  https://review.openstack.org/62310723:49
*** bobh has joined #openstack-infra23:49
*** rkukura has quit IRC23:51
*** rkukura_ is now known as rkukura23:51
*** bobh has quit IRC23:54
*** yamamoto has joined #openstack-infra23:56
corvusclarkb: other than 'atomic images prune' and 'atomic pull --storage ostree docker.io/openstackmagnum/kubernetes-kubelet:v1.11.5-1'  did you have to do anything else to upgrade the cluster?23:58
clarkbcorvus: yes, I "vacuumed" the journald contents to free up more disk space23:59
clarkbcorvus: oh and I upgraded proxy, scheduler, kubelet, api, and controler-manager23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!