Thursday, 2018-01-11

*** jascott1 has joined #openstack-infra00:00
*** edmondsw has quit IRC00:00
*** iyamahat_ has quit IRC00:01
*** dougwig has quit IRC00:03
*** jascott1 has quit IRC00:03
*** jascott1 has joined #openstack-infra00:03
*** iyamahat_ has joined #openstack-infra00:05
*** iyamahat_ has quit IRC00:06
*** jascott1 has quit IRC00:07
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add skipped CRD tests  https://review.openstack.org/53188700:08
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies  https://review.openstack.org/53080600:08
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests  https://review.openstack.org/53269900:08
*** wolverineav has joined #openstack-infra00:16
*** felipemonteiro has joined #openstack-infra00:19
*** sbra has joined #openstack-infra00:22
*** yamamoto has quit IRC00:24
*** yamamoto has joined #openstack-infra00:27
*** erlon has quit IRC00:27
*** armax has joined #openstack-infra00:31
openstackgerritIan Wienand proposed openstack-infra/project-config master: Revert "Pause builds for dib 2.10 release"  https://review.openstack.org/53270100:36
*** abelur_ has quit IRC00:38
*** yamamoto has quit IRC00:39
*** abelur_ has joined #openstack-infra00:39
*** claudiub has quit IRC00:43
*** sree has joined #openstack-infra00:43
*** rmcall has quit IRC00:46
*** rmcall has joined #openstack-infra00:46
*** sbra has quit IRC00:46
*** sree has quit IRC00:48
*** wolverin_ has joined #openstack-infra00:49
*** wolverineav has quit IRC00:50
*** wolverin_ has quit IRC00:53
*** cuongnv has joined #openstack-infra00:58
*** cuongnv has quit IRC00:59
openstackgerritPaul Belanger proposed openstack-infra/project-config master: Set max-server to 0 for infracloud-vanilla  https://review.openstack.org/53270501:01
*** pcrews has quit IRC01:01
*** threestrands has quit IRC01:03
clarkbianw: ^ do you wnat to review that too?01:03
openstackgerritKendall Nelson proposed openstack-infra/storyboard master: [WIP]Migration Error with Suspended User  https://review.openstack.org/53270601:04
*** pcrews has joined #openstack-infra01:05
*** ilpianista has quit IRC01:05
ianware we 100% sure the gate is moving?01:09
*** ricolin has joined #openstack-infra01:10
clarkbianw: I am not no01:12
*** stakeda has joined #openstack-infra01:12
clarkbze01 at least is running stuff01:12
pabelangerissues?01:12
clarkbzuul scheduler is running01:12
clarkband zm01 has a zuul-merger01:13
ianw532395,1 ... my change of course ... i would call stuck ... just looking at some of the jobs01:13
ianweverything on the grafana page seems to be going01:13
*** cuongnv has joined #openstack-infra01:15
ianwzuul-executor on ze04 seems to have pretty high load ... maybe it's normal01:15
ianwhttp://paste.openstack.org/show/642527/ in the logs, odd one01:15
ianwgit.exc.GitCommandError: Cmd('git') failed due to: exit code(128)01:15
ianw  stderr: 'Host key verification failed.01:16
clarkbI've got to afk to take care of kids. I did approve pabelanger's max-servers: 0 for infracloud01:16
clarkbianw: ya that was brought up earlier today too01:16
pabelangeryah, if we loaded from queues, it is possible one executor grabbed more then normal01:16
clarkbI think tomorrow once meltdown is behind us we need to do some zuul stabilization and investigation01:16
pabelangercool01:16
clarkband for you that may be today >_>01:16
ianwdo we have 11 executors now?01:16
clarkbianw: just 10, ze01-1001:16
ianwgrafana seems to think we have 1101:17
*** kiennt26 has joined #openstack-infra01:17
clarkbcould be stale connwction to geard?01:17
pabelangerI can't seem to get any of the stream.html pages working01:18
pabelangerfinger 779ce43930ab4f48aa41ad6f9e422734@zuulv3.openstack.org also returns Internal streaming error01:19
ianw2018-01-11 00:10:20,511 DEBUG zuul.AnsibleJob: [build: ac4dbd3dd3eb4dbbbbed8ab3171500b2] Ansible complete, result RESULT_NORMAL code 001:19
ianwthat was about an hour ago, and the status page hasn't picked up the job returned01:19
*** slaweq has joined #openstack-infra01:19
pabelangertcp6       0      0 :::7900                 :::*                    LISTEN01:20
pabelangerthat is, new01:20
*** slaweq_ has joined #openstack-infra01:21
pabelangerdid we land new tcp 7900 changes?01:21
pabelangercorvus: Shrews: ^01:21
pabelangeryah, we must have01:21
pabelangerso, we have a mix of executors on tcp/79 and other on tcp/790001:22
ianwmy other job, 0aee65504f2d491ea497064898f2ad8e, maybe hasn't been picked up by an executor01:22
pabelanger2018-01-11 01:18:11,940 DEBUG zuul.web.LogStreamingHandler: Connecting to finger server ze09:7901:23
pabelangersocket.gaierror: [Errno -2] Name or service not known01:23
pabelangerin zuul-web debug log01:24
*** slaweq has quit IRC01:24
ianwthe job at the top of the integrated gate queue01:25
ianwhttp://zuulv3.openstack.org/stream.html?uuid=a379a36040e047cebfbc4b3e2e9d79a3&logfile=console.log01:25
ianw2018-01-10 22:53:10,886 DEBUG zuul.AnsibleJob: [build: a379a36040e047cebfbc4b3e2e9d79a3] Ansible complete, result RESULT_NORMAL code 001:25
*** slaweq_ has quit IRC01:25
pabelangerokay, so I think just ze09 is having web streaming issues01:26
pabelangerdue to hostname01:26
ianwi think my amateur analysis is ... something's not right01:26
*** ilpianista has joined #openstack-infra01:26
pabelangeryup, hostname on ze09 is ze09 again01:27
pabelangerwhile ze01.o.o is ze01.openstack.org01:27
clarkbyes and plan is to switch everything ti be like ze0901:28
clarkbat least that was mordreds ideal and no one objected01:28
*** masayukig_ has quit IRC01:28
pabelangerI guess zuulv3.o.o cannot resolve ze09 right now01:28
pabelangerguess we need to append domain01:28
clarkbor use IP addrs01:29
pabelangerhttp://paste.openstack.org/show/642528/01:29
ianwinfra-root: ^ do we want to agree zuul isn't currently making progress?  restart?01:30
pabelangeryah, something to deal with in the morning01:30
*** masayukig has joined #openstack-infra01:30
clarkbianw: I'll have to defer to your judgement, am on phone and feeding kids01:30
pabelangerianw: no, it is inap we have been having issues there today01:30
pabelangerI'd wait until the job times out01:31
pabelangerianw: we should see if we can SSH into node and see what network looks like01:31
ianwpabelanger: but look at something like a379a36040e047cebfbc4b3e2e9d79a3 ?  it appears to have finished but zuul hasn't noticed?01:32
pabelangerianw: what is load look like on executor?01:32
ianwthat job went to ze0301:32
ianwzuul-executor is busy there, but probably not mroe than normal.  load ~1.501:33
pabelangerlooks like a379a36040e047cebfbc4b3e2e9d79a3 is still running01:33
*** zhurong has joined #openstack-infra01:33
*** felipemonteiro has quit IRC01:35
pabelangerianw: I think we need to wait for a379a36040e047cebfbc4b3e2e9d79a3 to timeout01:35
ianwyeah ok, that has gone to internap01:36
*** caphrim007_ has joined #openstack-infra01:36
ianwubuntu-xenial-inap-mtl01-000180399601:36
ianwac4dbd3dd3eb4dbbbbed8ab3171500b2 is a docs job that should have finihsed ages ago however ...01:36
ianwthat one went to ovh01:37
*** caphrim007 has quit IRC01:38
pabelangerianw: do you know which executor ac4dbd3dd3eb4dbbbbed8ab3171500b2 was on?01:39
pabelangerze04.o.o, according to scheduler01:39
ianwyes, ze04 ... it went to 158.69.75.10101:39
*** yamamoto has joined #openstack-infra01:39
ianwit's got everything cloned, so something happened01:40
pabelangerwow01:40
pabelangerI think ze04.o.o was stop and started01:40
pabelangeror killed01:40
pabelanger2018-01-11 01:00:06,180 DEBUG zuul.log_streamer: LogStreamer starting on port 790001:41
*** askb has quit IRC01:41
ianwhmm, this might explain the 11 executors01:41
pabelangersomething happened 40mins ago01:41
ianwpuppet run?01:41
*** caphrim007_ has quit IRC01:41
pabelangerchecking01:42
*** rcernin has quit IRC01:42
*** askb has joined #openstack-infra01:43
*** edmondsw has joined #openstack-infra01:44
pabelangerwe rebooted 40mins ago01:44
pabelangerwhich make sense01:44
pabelangeris that when we did meltdown reboots?01:44
ianwno, i rebooted them all yesterday!01:44
pabelangerI think we did it again01:44
pabelangeruptime is 1hr:44mins01:44
ianwactually only 44 minutes (it's 1:44) ...01:45
pabelangerokay, so I think the reboot in logs was from server reboot, so executor didn't crash01:46
pabelangerhowever01:46
ianwafaics there was no login around the time of the reboot.  was it externally triggered?01:46
*** shu-mutou-AWAY is now known as shu-mutou01:46
pabelangerac4dbd3dd3eb4dbbbbed8ab3171500b2 was running just before server reboot01:47
*** yamamoto has quit IRC01:47
pabelangermaybe?01:47
clarkbwe didnt do ze's unless jeblair did them?01:47
pabelangerwas the server live migrated?01:47
ianwit must have been ,or something.  it happened at exactly 01:0001:47
pabelangerbecause executor was killed when ac4dbd3dd3eb4dbbbbed8ab3171500b2 was running01:48
pabelangerso, in that case, zuul does thing it is running01:48
*** askb has quit IRC01:48
pabelangerwhen it is in fact done01:48
pabelangerwhich means, I do think we need to dump queues and restart scheduler01:48
*** edmondsw has quit IRC01:48
ianwyah :/01:48
pabelangerokay, then I agree01:49
*** askb has joined #openstack-infra01:49
pabelangerthat also explains why it is running as tcp/790001:49
pabelangersince daemon restarted01:49
*** markvoelker has joined #openstack-infra01:50
*** rcernin has joined #openstack-infra01:50
ianwis anyone logged into the rax console where they send those notifications?01:50
Shrewspabelanger: hrm, yes, it seems my executor change recently landed01:51
ianwis this finger thing something we want to consider before reloading?01:51
Shrewspossibly? we may not have considered all of the things that need to change for it01:53
*** bobh has joined #openstack-infra01:53
Shrewsthere's this: https://review.openstack.org/53259401:53
Shrewsand we might need iptables rules for the new port?01:53
ianwwith the gate the way it is, unless we start force pushing, not sure we can do too much changing01:55
openstackgerritDavid Shrewsbury proposed openstack-infra/system-config master: Zuul executor needs to open port 7900 now.  https://review.openstack.org/53270901:57
*** Apoorva has quit IRC01:58
Shrews532594 and 532709 are going to be needed since that change landed. Then restart the executors. I'm not sure what the recently restarted ze* servers are going to need since they're probably running as root and not zuul now. Might run into file permission issues? Can't speak to that.02:00
Shrewsianw: pabelanger: I wasn't expecting that change to land so quickly. My fault for not marking it WIP or -2 before we discussed coordination.02:03
Shrewsianw: pabelanger: what can i do to assist here?02:04
ianwwell only ze04 has probably restarted with the new code02:04
ianwis the only consequence that live streaming doesn't work?02:04
Shrewsright02:05
Shrewsi see the processes there running as root, not zuul. i'm not sure of the implications of that02:05
Shrewscorvus would know02:06
pabelangerwe could manually downgrade ze04.o.o for tonight02:06
pabelangerthen work on rolling out new code02:07
ianwor just turn it off?02:07
pabelangeror that02:07
pabelangermaybe best for tonight02:07
pabelangerI'm going to have to run shortly02:07
ianwyeah, given it's run as root now, turning it back to zuul seems dangerous02:07
ianwso, let's stop ze0402:08
ianwthen i'll dump & restart zuul scheduler02:08
pabelanger1wfm02:08
ianwand cross fingers none of the other ze's restart02:08
*** xarses has joined #openstack-infra02:08
ianwand the a-team can pick this up tomorrow02:08
pabelangeragree02:08
ianwok02:08
Shrewswfm x 2. should I -2 532594 and 532709 for now?02:08
Shrewslanding 709 w/o restarting the executors would break streaming on all executors until restart02:09
ianwi think so, just for safety02:09
pabelangerwe could make 709 support both 79,7900 then remove 79 in follow02:10
*** gouthamr has joined #openstack-infra02:10
*** ramishra has joined #openstack-infra02:10
pabelangermight be safer approach02:10
Shrews-2'd for now02:11
pabelangerkk02:11
*** annp has joined #openstack-infra02:11
ianw#status log zuul-executor stopped on ze04.o.o and it is placed in the emergency file, due to an external reboot applying https://review.openstack.org/#/c/532575/.  we will need to more carefully consider the rollout of this code02:13
openstackstatusianw: finished logging02:13
*** Apoorva has joined #openstack-infra02:14
*** gouthamr_ has joined #openstack-infra02:15
*** gouthamr has quit IRC02:15
pabelangerianw: okay, I'm EOD, good luck02:17
ianwi'm not sure i like the look of this02:18
ianwhttp://paste.openstack.org/show/642531/02:18
*** namnh has joined #openstack-infra02:18
*** yamahata has joined #openstack-infra02:18
Shrewsianw: where's that from?02:18
Shrewsi think dmsimard pasted something similar a couple of days ago02:19
*** gouthamr_ has quit IRC02:19
*** yamamoto has joined #openstack-infra02:19
dmsimardyeah02:19
dmsimarddid you restart zuul ?02:19
*** Apoorva has quit IRC02:19
dmsimardiirc I saw that being spammed (like hell) in zuulv3.o.o logs after starting the scheduler/web02:20
ianwyes, just restarted02:20
Shrewsdmsimard: did anyone look into that?02:20
ianwit seems to be going ...02:20
ianwok, i'm going to requeue02:20
*** rmcall has quit IRC02:20
dmsimardShrews: no one jumped on it when I pointed it out, no02:21
ianwif it's job is to scare the daylights out of the person restarting, then it has worked :)02:22
dmsimardianw: iirc there's the same errors in the zuul-web logs too02:22
dmsimardnot just zuul-scheduler02:22
dmsimardyeah, zuul web: http://paste.openstack.org/raw/642533/02:23
*** gouthamr has joined #openstack-infra02:23
Shrewslooks like it might be a bug in the /{tenant}/status route?02:30
ShrewstristanC: know anything about that? ^^^^02:30
dmsimardI was talking to him about that just now :D02:30
tristanCShrews: yep, i'm currently writting a better exception handling02:30
Shrewsianw: my instinct says it's not critical02:30
Shrewsw00t!02:31
dmsimardHe says it's because when the scheduler restarts, the layouts are not yet available yet when you try to load the webpage it tries to seek them out02:31
dmsimardor something along those lines02:31
tristanCdmsimard: ianw: this happens when the requested tenant layout isn't ready in the scheduler02:31
Shrewsdmsimard: tristanC: thx for the explanation02:32
ianwtristanC: ok :) crisis averted02:33
ianwstill requeueing02:33
*** Qiming has quit IRC02:33
*** markvoelker has quit IRC02:36
*** yamahata has quit IRC02:38
ianw#status log zuul restarted due to the unexpected loss of ze04; jobs requeued02:38
openstackstatusianw: finished logging02:38
*** yamahata has joined #openstack-infra02:39
ianwok, we're back to 9 executors in grafana, which seems right02:39
*** Qiming has joined #openstack-infra02:39
*** gouthamr has quit IRC02:40
ianwwill just leave things for a bit, will send a summary email to avoid people having to pull apart chat logs02:40
*** caphrim007 has joined #openstack-infra02:41
*** hongbin has joined #openstack-infra02:44
dmsimardwe have 10, what other executor is missing ?02:45
*** caphrim007 has quit IRC02:45
*** threestrands has joined #openstack-infra02:47
*** yamahata has quit IRC02:54
*** mriedem has quit IRC02:56
*** wolverineav has joined #openstack-infra03:01
*** fultonj has quit IRC03:01
*** masber has quit IRC03:02
*** fultonj has joined #openstack-infra03:02
*** masber has joined #openstack-infra03:05
*** wolverineav has quit IRC03:05
*** ijw has joined #openstack-infra03:06
*** aeng has quit IRC03:08
*** aeng has joined #openstack-infra03:08
*** katkapilatova1 has quit IRC03:08
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: scheduler: better handle format status error  https://review.openstack.org/53271803:10
*** rmcall has joined #openstack-infra03:10
*** yamamoto_ has joined #openstack-infra03:12
ianwdmsimard: 4 is turned off, see prior, and grafana was saying there was 11 at one point, so something had gone wrong there03:14
*** yamamoto has quit IRC03:16
*** gyee has quit IRC03:16
*** felipemonteiro has joined #openstack-infra03:18
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: scheduler: better handle format status error  https://review.openstack.org/53271803:21
*** slaweq has joined #openstack-infra03:21
*** harlowja_ has quit IRC03:21
*** slaweq_ has joined #openstack-infra03:21
*** ijw has quit IRC03:22
*** slaweq has quit IRC03:25
*** ramishra has quit IRC03:25
*** slaweq_ has quit IRC03:26
*** ramishra has joined #openstack-infra03:27
*** wolverineav has joined #openstack-infra03:33
*** wolverineav has quit IRC03:38
*** ijw has joined #openstack-infra03:39
*** sbra has joined #openstack-infra03:42
*** ramishra has quit IRC03:45
*** ijw has quit IRC03:47
*** rlandy|bbl is now known as rlandy03:50
mtreinishclarkb: I just did a quick test and afs works fine on on of my raspberry pi's running arch03:54
*** sbra has quit IRC03:54
*** verdurin_ has joined #openstack-infra03:55
*** verdurin has quit IRC03:56
*** felipemonteiro has quit IRC03:57
*** caphrim007 has joined #openstack-infra03:58
*** ameliac has quit IRC04:00
*** ramishra has joined #openstack-infra04:01
*** verdurin has joined #openstack-infra04:03
*** verdurin_ has quit IRC04:04
*** xarses has quit IRC04:05
openstackgerritMerged openstack/diskimage-builder master: Revert "Dont install python-pip for py3k"  https://review.openstack.org/53239504:05
*** esberglu has quit IRC04:10
*** bobh has quit IRC04:11
*** zhurong has quit IRC04:12
*** xarses has joined #openstack-infra04:13
*** xarses_ has joined #openstack-infra04:14
*** xarses_ has quit IRC04:14
*** xarses_ has joined #openstack-infra04:15
*** felipemonteiro has joined #openstack-infra04:15
*** daidv has joined #openstack-infra04:16
*** xarses has quit IRC04:18
*** jamesmcarthur has joined #openstack-infra04:18
*** jamesmcarthur has quit IRC04:18
*** jamesmcarthur has joined #openstack-infra04:19
*** sree has joined #openstack-infra04:21
*** armax has quit IRC04:24
*** links has joined #openstack-infra04:25
*** spzala has joined #openstack-infra04:31
*** andreas_s has joined #openstack-infra04:32
*** ykarel|away has joined #openstack-infra04:33
*** rosmaita has quit IRC04:34
*** bhavik1 has joined #openstack-infra04:35
*** shu-mutou has quit IRC04:35
*** shu-mutou has joined #openstack-infra04:36
*** xarses_ has quit IRC04:37
*** andreas_s has quit IRC04:37
*** felipemonteiro has quit IRC04:39
*** spzala has quit IRC04:42
*** dingyichen has quit IRC04:43
*** rlandy has quit IRC04:53
*** rmcall has quit IRC04:54
*** dingyichen has joined #openstack-infra04:55
*** links has quit IRC04:56
*** links has joined #openstack-infra04:57
*** janki has joined #openstack-infra05:07
*** dingyichen has quit IRC05:09
*** dingyichen has joined #openstack-infra05:09
*** nibalizer has joined #openstack-infra05:14
*** claudiub has joined #openstack-infra05:15
*** sree has quit IRC05:19
*** edmondsw has joined #openstack-infra05:20
*** psachin has joined #openstack-infra05:21
*** edmondsw has quit IRC05:24
*** gema has quit IRC05:26
*** gema has joined #openstack-infra05:27
*** psachin has quit IRC05:27
*** zhurong has joined #openstack-infra05:28
*** rcernin has quit IRC05:29
*** CrayZee has joined #openstack-infra05:29
*** harlowja has joined #openstack-infra05:29
*** sree has joined #openstack-infra05:29
*** gcb has quit IRC05:30
*** gcb has joined #openstack-infra05:31
*** hongbin has quit IRC05:32
*** shu-mutou is now known as shu-mutou-AWAY05:32
*** wolverineav has joined #openstack-infra05:34
*** psachin has joined #openstack-infra05:35
*** wolverin_ has joined #openstack-infra05:35
*** jamesmcarthur has quit IRC05:36
*** wolveri__ has joined #openstack-infra05:38
*** wolverineav has quit IRC05:39
*** wolverin_ has quit IRC05:39
*** bhavik1 has quit IRC05:42
*** wolveri__ has quit IRC05:43
*** markvoelker has joined #openstack-infra05:47
*** coolsvap has joined #openstack-infra05:47
*** dhajare has joined #openstack-infra05:47
*** jamesmcarthur has joined #openstack-infra05:52
*** xinliang has quit IRC05:56
*** sandanar has joined #openstack-infra05:58
*** ykarel|away is now known as ykarel05:59
*** jamesmcarthur has quit IRC06:00
*** jbadiapa has quit IRC06:06
*** xinliang has joined #openstack-infra06:08
*** xinliang has quit IRC06:08
*** xinliang has joined #openstack-infra06:08
CrayZeeHi, Is there anything going on with zuul? I see over 300 tasks in the queue with tasks running over 3.5 hours that are probably requeued ?06:08
CrayZeee.g. 532631...06:09
*** liujiong has joined #openstack-infra06:09
*** slaweq has joined #openstack-infra06:09
*** slaweq has quit IRC06:13
*** kjackal has joined #openstack-infra06:14
*** armaan has quit IRC06:17
ianwCrayZee: I think we're just flat out ATM.  we're down one provider until we can sort out some issues, but http://grafana01.openstack.org/dashboard/db/zuul-status seems to be okish06:17
*** Hunner has quit IRC06:19
*** bmjen has quit IRC06:19
CrayZeeianw: thanks06:21
*** liujiong has quit IRC06:28
*** jamesmcarthur has joined #openstack-infra06:28
*** jamesmcarthur has quit IRC06:33
*** liujiong has joined #openstack-infra06:33
*** jamesmcarthur has joined #openstack-infra06:35
*** annp has quit IRC06:38
*** markvoelker has quit IRC06:40
*** markvoelker has joined #openstack-infra06:42
eumel8ianw: you're still around?06:44
*** kzaitsev_pi has quit IRC06:45
*** yamamoto has joined #openstack-infra06:45
ianweumel8: a little ... wrapping up06:45
*** alexchadin has joined #openstack-infra06:46
*** harlowja has quit IRC06:47
eumel8ianw:  morning :) can you please take a look into https://review.openstack.org/#/c/531736/ and plan the next upgrade of Zanata?06:47
*** yamamoto_ has quit IRC06:49
*** kzaitsev_pi has joined #openstack-infra06:52
ianweumel8: ok, i can probably try that out tomorrow my time.  basically 1) pause puppet 2) do the table drop 3) commit that 4) restart puppet?06:54
eumel8ianw: drop database, not table :)06:55
*** aeng has quit IRC06:56
ianwahh, ok.  how about i stop puppet for it now & we get it in the queue06:56
eumel8ianw: sounds good, thx06:57
*** threestrands has quit IRC07:00
*** sbra has joined #openstack-infra07:00
*** threestrands has joined #openstack-infra07:01
*** threestrands has quit IRC07:02
*** threestrands has joined #openstack-infra07:03
*** threestrands has quit IRC07:03
*** threestrands has joined #openstack-infra07:03
*** threestrands has quit IRC07:04
*** threestrands has joined #openstack-infra07:04
*** edmondsw has joined #openstack-infra07:08
*** jamesmcarthur has quit IRC07:12
*** edmondsw has quit IRC07:13
*** armaan has joined #openstack-infra07:13
*** dhill_ has quit IRC07:13
*** dhill__ has joined #openstack-infra07:13
*** jamesmcarthur has joined #openstack-infra07:19
masayukigmtreinish: clarkb: yeah, I don't have a lot of stuff installed on my machines. But I also tested it on a very slow hdd. Files might be on disk cache because the size is not so big.07:19
*** pcaruana has joined #openstack-infra07:21
*** dims_ has quit IRC07:21
openstackgerritMerged openstack-infra/infra-manual master: Clarify the point of a repo in a zuulv3 job name  https://review.openstack.org/53268607:21
*** hemna_ has quit IRC07:21
*** jamesmcarthur has quit IRC07:24
*** dims has joined #openstack-infra07:25
openstackgerritMerged openstack-infra/system-config master: Upgrade translate-dev.o.o to Zanata 4.3.3  https://review.openstack.org/53173607:25
openstackgerritMerged openstack-infra/project-config master: Switch grafana neutron board to non-legacy jobs  https://review.openstack.org/53263207:26
*** afred312 has quit IRC07:27
*** abelur_ has quit IRC07:29
*** vsaienk0 has joined #openstack-infra07:29
*** jamesmcarthur has joined #openstack-infra07:30
*** afred312 has joined #openstack-infra07:30
AJaegerconfig-core, could you review some changes on https://etherpad.openstack.org/p/Nvt3ovbn5x , please? I put up ready changes there for easier review07:34
*** jamesmcarthur has quit IRC07:35
*** slaweq has joined #openstack-infra07:38
*** jtomasek has joined #openstack-infra07:39
*** jamesmcarthur has joined #openstack-infra07:39
*** jtomasek has joined #openstack-infra07:39
*** armaan has quit IRC07:41
*** armaan has joined #openstack-infra07:41
*** jamesmcarthur has quit IRC07:44
openstackgerritMerged openstack-infra/project-config master: Set max-server to 0 for infracloud-vanilla  https://review.openstack.org/53270507:45
*** jamesmcarthur has joined #openstack-infra07:46
*** armaan has quit IRC07:46
*** armaan has joined #openstack-infra07:47
*** dciabrin has quit IRC07:49
AJaegerfrickler: could you review https://review.openstack.org/#/c/523018/ to remove an unused template so that it won't get used again, please?07:50
*** andreas_s has joined #openstack-infra07:53
*** jamesmcarthur has quit IRC07:55
gemamtreinish: thanks for trying it, good to know afs can be built on arm :)07:57
AJaegerinfra-root, looking at http://grafana.openstack.org/dashboard/db/nodepool I see a nearly constant number of "deleting nodes". Do we have a problem with deletion somewhere?07:57
*** annp has joined #openstack-infra08:01
*** jamesmcarthur has joined #openstack-infra08:04
*** evin has joined #openstack-infra08:04
*** sbra has quit IRC08:06
*** threestrands has quit IRC08:08
*** jamesmcarthur has quit IRC08:09
*** slaweq_ has joined #openstack-infra08:10
*** apetrich has quit IRC08:13
*** florianf has joined #openstack-infra08:14
*** slaweq_ has quit IRC08:14
*** jamesmcarthur has joined #openstack-infra08:15
*** pcichy has joined #openstack-infra08:17
*** tesseract has joined #openstack-infra08:17
*** ramishra has quit IRC08:18
*** aviau has quit IRC08:20
*** aviau has joined #openstack-infra08:20
*** ramishra has joined #openstack-infra08:20
*** jamesmcarthur has quit IRC08:21
*** jamesmcarthur has joined #openstack-infra08:22
*** shardy has joined #openstack-infra08:22
*** jamesmcarthur has quit IRC08:26
openstackgerritVasyl Saienko proposed openstack-infra/project-config master: Add networking-generic-switch-tempest-plugin repo  https://review.openstack.org/53254208:27
openstackgerritVasyl Saienko proposed openstack-infra/project-config master: Add jobs for n-g-s tempest-plugin  https://review.openstack.org/53254308:27
*** alexchadin has quit IRC08:28
*** tosky has joined #openstack-infra08:29
*** cshastri has joined #openstack-infra08:31
*** jamesmcarthur has joined #openstack-infra08:31
*** jpena|off is now known as jpena08:34
*** jamesmcarthur has quit IRC08:36
*** jamesmcarthur has joined #openstack-infra08:37
*** jbadiapa has joined #openstack-infra08:40
*** alexchadin has joined #openstack-infra08:41
*** jamesmcarthur has quit IRC08:42
*** pcaruana has quit IRC08:44
*** links has quit IRC08:46
openstackgerritDmitrii Shcherbakov proposed openstack-infra/project-config master: add charm-panko project-config  https://review.openstack.org/53276908:47
*** jaosorior has joined #openstack-infra08:47
*** b_bezak has joined #openstack-infra08:47
*** armaan has quit IRC08:47
*** armaan has joined #openstack-infra08:48
*** jamesmcarthur has joined #openstack-infra08:48
*** ralonsoh has joined #openstack-infra08:50
*** jamesmcarthur has quit IRC08:53
*** dingyichen has quit IRC08:54
*** edmondsw has joined #openstack-infra08:56
*** b_bezak has quit IRC08:57
*** hashar has joined #openstack-infra08:57
*** links has joined #openstack-infra08:59
*** jamesmcarthur has joined #openstack-infra09:01
*** edmondsw has quit IRC09:01
*** jpich has joined #openstack-infra09:02
*** jamesmcarthur has quit IRC09:05
*** pblaho has quit IRC09:06
*** sree_ has joined #openstack-infra09:10
*** sree_ is now known as Guest3737509:11
*** sree has quit IRC09:12
*** sree has joined #openstack-infra09:12
*** Guest37375 has quit IRC09:15
*** dbecker has joined #openstack-infra09:17
*** jamesmcarthur has joined #openstack-infra09:17
*** flaper87 has quit IRC09:17
*** pcichy has quit IRC09:18
*** dsariel has joined #openstack-infra09:19
*** pblaho has joined #openstack-infra09:19
*** jamesmcarthur has quit IRC09:19
*** flaper87 has joined #openstack-infra09:21
*** flaper87 has quit IRC09:21
*** e0ne has joined #openstack-infra09:21
*** flaper87 has joined #openstack-infra09:23
*** lucas-afk is now known as lucasagomes09:26
*** kopecmartin has joined #openstack-infra09:27
*** armaan has quit IRC09:27
*** armaan has joined #openstack-infra09:27
*** dsariel has quit IRC09:27
*** pblaho has quit IRC09:36
*** armaan has quit IRC09:36
*** dtantsur|afk is now known as dtantsur09:38
ykarelJobs are in queue For long, some jobs for 7+ hr, is some issue going on or it's just infra is out of nodes09:43
*** pblaho has joined #openstack-infra09:48
*** links has quit IRC09:51
*** apetrich has joined #openstack-infra09:53
*** vsaienk0 has quit IRC09:55
*** stakeda has quit IRC09:57
CrayZeeykarel: This is the answer I got a few hours ago: "(06:17:58 UTC) ianw: CrayZee: I think we're just flat out ATM.  we're down one provider until we can sort out some issues, but http://grafana01.openstack.org/dashboard/db/zuul-status seems to be okish"09:59
*** CrayZee has quit IRC10:00
*** armaan has joined #openstack-infra10:03
*** pcaruana has joined #openstack-infra10:04
*** links has joined #openstack-infra10:05
ykarelCrayZee Thanks10:06
*** slaweq_ has joined #openstack-infra10:11
*** markvoelker has quit IRC10:13
*** ankkumar has joined #openstack-infra10:13
*** sambetts|afk is now known as sambetts10:14
*** namnh has quit IRC10:14
*** slaweq_ has quit IRC10:15
ankkumarHi10:16
ankkumarI am trying to build openstack/ironic CI using zuul v310:16
ankkumarCan anyone tell me, how would it would read my zuul.yaml customized file, because openstack/ironic repo has there one?10:16
*** cuongnv has quit IRC10:16
ankkumarHow would my gate will run, I mean using which zuul yaml files?10:16
*** pbourke has joined #openstack-infra10:16
*** liujiong has quit IRC10:23
*** sree_ has joined #openstack-infra10:25
*** vsaienk0 has joined #openstack-infra10:25
*** sree_ is now known as Guest8860010:26
*** ricolin has quit IRC10:28
*** sree has quit IRC10:29
*** rosmaita has joined #openstack-infra10:32
*** alexchadin has quit IRC10:34
*** cshastri has quit IRC10:35
*** zhurong has quit IRC10:35
*** ldnunes has joined #openstack-infra10:37
*** abelur_ has joined #openstack-infra10:37
*** e0ne has quit IRC10:38
*** e0ne has joined #openstack-infra10:39
*** sandanar has quit IRC10:39
*** sandanar has joined #openstack-infra10:40
*** edmondsw has joined #openstack-infra10:44
*** pcichy has joined #openstack-infra10:46
*** slaweq_ has joined #openstack-infra10:46
*** danpawlik_ has joined #openstack-infra10:46
*** maciejjo1 has joined #openstack-infra10:47
*** andreas_s has quit IRC10:48
*** slaweq_ has quit IRC10:48
*** maciejjozefczyk has quit IRC10:48
*** danpawlik has quit IRC10:48
*** slaweq_ has joined #openstack-infra10:48
*** cshastri has joined #openstack-infra10:48
*** slaweq has quit IRC10:48
*** edmondsw has quit IRC10:49
*** pcichy has quit IRC10:50
*** pcichy has joined #openstack-infra10:50
*** numans has quit IRC10:54
*** numans has joined #openstack-infra10:55
*** andreas_s has joined #openstack-infra10:58
fricklerinfra-root: AJaeger: ack, seeing lots of deletion-related exceptions on nl02, will try do dig a bit more11:10
*** andreas_s has quit IRC11:19
*** shardy has quit IRC11:22
*** shardy has joined #openstack-infra11:24
*** panda|rover|afk has quit IRC11:24
*** derekh has joined #openstack-infra11:25
*** eyalb has joined #openstack-infra11:26
*** maciejjo1 is now known as maciejjozefczyk11:27
eyalbany news on zuul our jobs are queuing for a long time11:28
fricklereyalb: due to various issues over the last days we have a pretty large backlog currently, please be patient. if you have specific jobs that seem stuck, please post them here so we can take a look11:30
eyalbfrickler: thanks11:31
*** tpsilva_ has joined #openstack-infra11:31
fricklerinfra-root: seems we have multiple issues currently: a) nodes in stuck deleting state, this seems to be on all providers, all since about the time zuul seems to have been restarted (around 22:30)11:32
fricklerinfra-root: b) some timeouts when deleting current nodes and c) some more issues with infracloud-vanilla11:33
fricklerfor a) I think a restart of nodepool might help, but I don't want to make the situation worse. this seems to be just blocking about 30% of our quota, so not optimal, but also not catastrophic I'd say, as the remaining quota seems to get used without issues11:34
*** tpsilva_ is now known as tpsilva11:34
fricklerb) might actually be normal behaviour. c) infracloud-vanilla seems to have been taken offline anyway, so also not critical11:35
*** andreas_s has joined #openstack-infra11:36
*** panda has joined #openstack-infra11:38
frickleralthough it may be that the leaked instance cleanup is not working due to c)11:38
*** apetrich has quit IRC11:38
*** sdague has joined #openstack-infra11:39
*** wolverineav has joined #openstack-infra11:39
*** andreas_s has quit IRC11:40
*** wolverineav has quit IRC11:44
*** andreas_s has joined #openstack-infra11:50
*** florianf has quit IRC11:52
*** florianf has joined #openstack-infra11:53
*** rhallisey has joined #openstack-infra11:54
*** andreas_s has quit IRC11:55
Shrewsfrickler: AJaeger: ugh, not what i want to see when i first wake up. i'll take a look at nodepool11:56
*** e0ne has quit IRC11:58
*** janki has quit IRC11:59
Shrewsoh, this is new: http://paste.openstack.org/show/642553/11:59
*** panda has quit IRC11:59
Shrewsa restart of nodepool would not help with that12:00
Shrewsfrickler: you said "more" issues with vanilla. what were the previous issues?12:00
*** davidlenwell has quit IRC12:01
*** yamamoto has quit IRC12:01
*** davidlenwell has joined #openstack-infra12:01
*** yamamoto has joined #openstack-infra12:01
*** electrical has quit IRC12:01
*** icey has quit IRC12:02
*** pcaruana has quit IRC12:02
*** pcaruana|afk| has joined #openstack-infra12:02
*** icey has joined #openstack-infra12:02
*** electrical has joined #openstack-infra12:02
Shrewsoh, the meltdown patches causing issues12:03
Shrewswe might should disable all of vanilla until someone can look into that infrastructure a bit more12:04
*** jpena is now known as jpena|lunch12:04
*** jkilpatr has joined #openstack-infra12:05
*** annp has quit IRC12:05
Shrewsoh, it IS disabled in nodepool.yaml12:06
*** kjackal has quit IRC12:07
Shrewsoh geez, happening across other providers too. *sigh*12:08
Shrewsmordred: was there a new shade release recently?12:09
*** dsariel has joined #openstack-infra12:10
*** arxcruz|ruck is now known as arxcruz|rover12:11
*** slaweq has joined #openstack-infra12:11
fricklerShrews: the failure in your paste is happening because keystone isn't responding. but I'm also seeing timeouts when deleting nodes from other providers12:12
*** panda has joined #openstack-infra12:13
Shrewsfrickler: yeah, looking at those on the other launcher now12:13
Shrewsseems to be affecting rax, ovh, and chocolate too12:13
Shrewsno idea what's up with those12:13
*** markvoelker has joined #openstack-infra12:14
fricklerShrews: http://paste.openstack.org/show/642563/ seems to show the reason for infra-chocolate12:15
AJaegerShrews:  https://review.openstack.org/532705 disabled vanilla already.12:16
fricklerShrews: nodepool seems always to retry deleting the same node before deleting others12:16
*** slaweq has quit IRC12:16
AJaegerfrickler: thanks for looking into this.12:16
AJaegerand good morning, Shrews !12:17
*** sandanar has quit IRC12:18
fricklerShrews: I'm wondering whether we should delete infracloud-vanilla from nodepool.yaml completely for the time being. that should at least allow the leaked instance cleanup to run for the other clouds12:18
*** smatzek has joined #openstack-infra12:21
Shrewsfrickler: iirc, what should be happening is a new thread is started for each node delete. so it should cycle through all deleted instances rather quickly, but the actual delete threads are the things timing out12:22
*** sdoran has quit IRC12:24
*** mrmartin has quit IRC12:24
*** madhuvishy has quit IRC12:24
*** tomhambleton_ has quit IRC12:24
*** madhuvishy has joined #openstack-infra12:24
*** lucasagomes is now known as lucas-hungry12:25
*** pcichy has quit IRC12:25
*** abelur has quit IRC12:27
*** abelur has joined #openstack-infra12:27
*** kjackal has joined #openstack-infra12:28
*** e0ne has joined #openstack-infra12:30
*** pcichy has joined #openstack-infra12:31
Shrewslooks like we lost connection to our zookeeper server yesterday12:32
*** edmondsw has joined #openstack-infra12:33
*** e0ne has quit IRC12:33
Shrewsok, this is weird. i'm going to restart the launchers to see what happens12:33
*** rosmaita has quit IRC12:34
*** abelur_ has quit IRC12:35
AJaegerShrews: wait, please...12:35
AJaegerShrews: first read http://lists.openstack.org/pipermail/openstack-infra/2018-January/005774.html - ianw run into a problem with ze04 that we should check before restarting12:36
ShrewsAJaeger: already restarted nl01, which seems to be cleaning up the instances. i worked with ianw last night on ze04, so aware of that12:37
Shrewsunrelated things12:37
*** edmondsw has quit IRC12:38
AJaegerShrews: glad to hear.12:38
*** mrmartin has joined #openstack-infra12:39
*** e0ne has joined #openstack-infra12:42
Shrewsnl02 restarted which seems to have cleaned up the enabled providers there12:42
*** bobh has joined #openstack-infra12:42
ShrewsAJaeger: frickler: not sure what happened there. I'm going to have to spend the day sorting through tons of logs.12:43
Shrewscan one of you #status the things while I finally have a cup of coffee?12:43
*** vsaienk0 has quit IRC12:44
AJaeger#status log nl01 and nl02 restarted to recover nodes in deletion12:47
openstackstatusAJaeger: finished logging12:47
*** markvoelker has quit IRC12:48
AJaegerShrews: done - or do you want to write more - or on broader scale? Enjoy your coffee!12:48
*** pgadiya has joined #openstack-infra12:49
*** pcaruana|afk| has quit IRC12:49
*** krtaylor_ has quit IRC12:50
ShrewsAJaeger: thx. good enough for now i guess. i don't yet know enough about the problem to say more12:50
mpetersonhello, I was wondering if you know what might be causing a job not to go into the RUN phase at all? it does PRE and then POST all successful but not RUN12:51
*** e0ne has quit IRC12:52
*** e0ne has joined #openstack-infra12:53
*** andreas_s has joined #openstack-infra12:58
pabelangerwe'll be deleting infracloud-vanilla, and moving compute node to chocolate. We lost the controller yesterday to a bad HDD12:58
pabelangerhttps://review.openstack.org/532705/ should have disable it12:59
*** ankkumar has quit IRC12:59
*** jpena|lunch is now known as jpena13:00
*** pcaruana|afk| has joined #openstack-infra13:02
*** andreas_s has quit IRC13:04
*** vsaienk0 has joined #openstack-infra13:04
*** dprince has joined #openstack-infra13:08
*** jaosorior has quit IRC13:09
AJaegermpeterson: do you have logs?13:10
*** olaph has quit IRC13:13
*** olaph has joined #openstack-infra13:14
*** janki has joined #openstack-infra13:14
mpetersonAJaeger: yes, patches 530642 and 51735913:16
mpetersonAJaeger: first one in devstack-tempest, second one in networking-odl-functional-carbon13:16
*** pcichy has quit IRC13:17
AJaegermpeterson: we have on https://wiki.openstack.org/wiki/Infrastructure_Status item "another set of broken images has been in use from about 06:00-11:00 UTC, reverted once more to the previous one".13:19
*** e0ne has quit IRC13:20
AJaegermpeterson: that was at 14:00 and your run was at that time - and would explain it. I'm just wondering whether the time span is accurate.13:20
fricklerpabelanger: the issue seems to have been that nodepool still wanted to delete nodes on vanilla and that may have blocked other deletes13:22
AJaegermpeterson: wait, those are change to Zuul v3 config itself. I wonder whether there's a bug in them that let's Zuul skip the run playbooks.13:22
AJaegermpeterson: please discuss with rest of the team later...13:22
pabelangerfrickler: ouch, okay. We should be able to write a unit test for that in nodepool at least and see what happens with a bad provider13:23
mpetersonAJaeger: yeah, I think it is probably an issue in layer 8 (aka me) rather the infra13:23
mpetersonAJaeger: sure, when would the rest team be in aprox?13:24
*** katkapilatova1 has joined #openstack-infra13:24
AJaegermpeterson: US best - so during the next three hours...13:24
AJaegerUS based I mean13:24
openstackgerritMerged openstack-infra/project-config master: Remove Neutron legacy jobs definition  https://review.openstack.org/53050013:24
*** e0ne has joined #openstack-infra13:24
*** katkapilatova1 has left #openstack-infra13:25
*** bobh has quit IRC13:25
mpetersonAJaeger: okey I'm EMEA based, I hope I'll still be around, thanks13:25
fricklerpabelanger: Shrews: fyi there are still errors about this happening now on nl02, not sure about the impact13:26
pabelangerfrickler: specific to a cloud?13:27
Shrewspabelanger: you'll keep seeing them for vanilla, unless we remove that provider entirely13:28
*** kiennt26_ has joined #openstack-infra13:28
Shrewsi don't see any deleted instances that aren't getting cleaned up in zookeeper except for vanilla13:29
*** panda is now known as panda|ruck13:29
Shrewspabelanger: were there any issues with the zookeeper server yesterday evening? looks like we lost connection to it around 2018-01-10 22:27:18,481 UTC (5:27pm eastern)13:30
*** e0ne has quit IRC13:30
Shrewsi think that's about when this delete problem started13:31
*** eyalb has left #openstack-infra13:31
Shrewshrm, we probably didn't notice since it recovered about a second later13:33
*** e0ne has joined #openstack-infra13:33
*** apetrich has joined #openstack-infra13:33
*** e0ne has quit IRC13:34
frickler"nodepool list" still shows about 50 nodes in infracloud-vanilla in state deleting13:35
mgagnepabelanger: any known/pending issue with inap? I keep seeing inap mentioned and I'm starting to wonder if there is an underlaying issue on our side that needs to be addressed.13:35
pabelangerShrews: yah, working on patches to remove vanilla now13:36
pabelangerShrews: also, not aware of any issues with zookeeper13:36
*** rlandy has joined #openstack-infra13:36
*** trown|outtypewww is now known as trown13:37
Shrewspabelanger: fyi, if we never re-enable vanilla, we'll need to manually delete its instances. might even need to cleanup zookeeper nodes for it.13:38
Shrewsi guess we can't do the instances without a working controller though13:39
openstackgerritZara proposed openstack-infra/storyboard-webclient master: Fix reference to angular-bootstrap in package.json  https://review.openstack.org/53281913:40
pabelangerShrews: Agree13:41
openstackgerritPaul Belanger proposed openstack-infra/project-config master: Remove infracloud-vanilla from nodepool  https://review.openstack.org/53282013:41
pabelangerShrews: frickler: ^13:41
*** rosmaita has joined #openstack-infra13:41
pabelangerShrews: it might be a good idea to maybe come up with a tool or docs to help do the clean up for users in this case of a dead cloud.13:42
pabelangerbefore we'd have to purge the database, would be good to figure out the syntax in zookeeper13:42
vsaienk0AJaeger: pabelanger please add to your review queue https://review.openstack.org/#/c/531375/ - adds releasenotes job to n-g-s13:44
*** lucas-hungry is now known as lucasagomes13:44
*** markvoelker has joined #openstack-infra13:44
*** wolverineav has joined #openstack-infra13:46
Shrewspabelanger: ++  Having someone do anything manual in zookeeper is not a friendly user experience13:47
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve logging around ZooKeeper suspension  https://review.openstack.org/53282313:48
Shrews^^^ might help with debugging this situation in the future13:49
*** tosky has quit IRC13:50
fricklerShrews: pabelanger: different issue, would this be correct if I want to hold a node for debugging? zuul autohold --tenant openstack --project openstack/cookbook-openstack-common --job openstack-chef-repo-integration --reason "frickler: debug 523030" --count 113:51
*** masber has quit IRC13:52
Shrewsfrickler: i believe that's correct13:52
pabelangeryah13:52
pabelangerlooks right13:52
Shrewsfrickler: just make sure to delete the node when you're done with it13:52
fricklerShrews: sure13:53
*** tosky has joined #openstack-infra13:54
fricklerhmm, is it expected that "zuul autohold-list" takes like forever? or is that maybe also blocking trying to access vanilla?13:56
*** pgadiya has quit IRC13:56
*** e0ne has joined #openstack-infra13:58
fricklermy autohold command also doesn't proceed. maybe I'll just wait with that until nodepool issues are sorted out13:58
*** alexchadin has joined #openstack-infra13:59
*** shardy has quit IRC14:02
pabelangerdid nodepool restart?14:03
pabelangergrafana.o.o says we have zero nodes online14:03
*** kiennt26_ has quit IRC14:04
pabelangerhttp://grafana.openstack.org/dashboard/db/nodepool14:04
pabelangerfrickler: Shrews ^14:04
*** kiennt26_ has joined #openstack-infra14:05
pabelangeroh, maybe we just had a gate reset14:05
pabelangerbut ~500 nodes just deleted14:06
Shrewspabelanger: yes, restarted launchers earlier14:06
fricklerpabelanger: Shrews: https://review.openstack.org/532303 failed at the head of integrated gate14:06
pabelangerShrews: this just happened a minute ago14:06
pabelangeryah, must be a gate reset14:07
*** eharney has joined #openstack-infra14:07
pabelangerall our nodes tied up in gate pipeline14:07
*** efried has joined #openstack-infra14:08
fricklerhttp://logs.openstack.org/03/532303/1/gate/openstack-tox-py35/c7bad8c/job-output.txt.gz has a timeout and ssh host id changed warning14:08
pabelangerlooks like citycloud14:09
*** dave-mccowan has joined #openstack-infra14:10
*** slaweq has joined #openstack-infra14:12
*** dave-mccowan has quit IRC14:14
rosmaitahowdy infra folks ... quick question: when a gate job seems to be stuck in queued for a long time, is there any action i can take (other than be patient or go grab a coffee)?14:14
fricklerrosmaita: you can tell us which job it is so we can take a closer look14:15
*** dave-mccowan has joined #openstack-infra14:15
fricklerrosmaita: but also we have a pretty large backlog currently due to various issues during the last couple of days14:16
rosmaitafrickler thanks! the one i was interested in has actually been picked up14:16
*** abhishekk has joined #openstack-infra14:17
*** slaweq has quit IRC14:17
*** markvoelker has quit IRC14:18
*** Goneri has joined #openstack-infra14:18
*** edmondsw has joined #openstack-infra14:21
*** tomhambleton_ has joined #openstack-infra14:21
*** sdoran has joined #openstack-infra14:21
*** evin has quit IRC14:24
*** edmondsw_ has joined #openstack-infra14:24
*** psachin has quit IRC14:24
*** esberglu has joined #openstack-infra14:28
openstackgerritChandan Kumar proposed openstack-infra/project-config master: Switch to tempest-plugin-jobs for ec2api-tempest-plugin  https://review.openstack.org/53283514:28
*** edmondsw has quit IRC14:28
chandankumarAJaeger: regarding uploading package to pypi, respective team will take care of that14:29
chandankumarfor their tempest plugins14:29
mordredShrews: morning! anything I should be looking at?14:29
*** ijw has joined #openstack-infra14:31
Shrewsmordred: no, don't think so14:32
*** rosmaita has quit IRC14:34
*** jkilpatr has quit IRC14:39
*** jkilpatr has joined #openstack-infra14:45
*** edmondsw_ is now known as edmondsw14:45
*** abhishekk has quit IRC14:46
openstackgerritTytus Kurek proposed openstack-infra/project-config master: Add charm-interface-designate project  https://review.openstack.org/52937914:46
*** hongbin has joined #openstack-infra14:48
*** Swami has joined #openstack-infra14:52
AJaegerchandankumar: once all are created, tell me in the patch and we can merge...14:55
chandankumarAJaeger: sure!14:56
*** hoangcx_ has joined #openstack-infra14:57
hoangcx_AJaeger: Hi, Do you know why https://docs.openstack.org/neutron-vpnaas/latest/ is not live even if we merged a patch in our repos recently14:58
*** eharney has quit IRC15:00
hoangcx_AJaeger: I mean https://review.openstack.org/#/c/522695/ merged in project-config and the we megered one patch in vpnaas to reflect the change. But I don't see the link alive yet.15:00
*** gouthamr has joined #openstack-infra15:01
hoangcx_s/the we/then we15:01
*** kopecmartin has quit IRC15:02
*** markvoelker has joined #openstack-infra15:02
*** markvoelker has quit IRC15:02
*** mriedem has joined #openstack-infra15:02
AJaegerhoangcx_: check http://zuulv3.openstack.org/ - the post queue should contain your job15:03
*** eharney has joined #openstack-infra15:03
AJaegerhoangcx_: we have a backlog of 11+ hours, and your change only merged 1h ago15:04
openstackgerritsebastian marcet proposed openstack-infra/openstackid-resources master: Added new endpoint merge speakers  https://review.openstack.org/53284415:04
*** ihrachys has quit IRC15:05
*** ihrachys has joined #openstack-infra15:05
*** kgiusti has joined #openstack-infra15:08
*** shardy has joined #openstack-infra15:08
fricklerinfra-root: I've seen multiple gate failures with "node_failure" now, but don't know how to debug these further, zuul.log just says "node request failed". http://paste.openstack.org/show/642701/ and also 526061,215:08
hoangcx_AJaeger: Clear the point now. I will check it tomorrow. Sorry for trouble and thank you for pointing it out.15:11
openstackgerritMerged openstack-infra/openstackid-resources master: Added new endpoint merge speakers  https://review.openstack.org/53284415:11
*** hoangcx_ has quit IRC15:12
*** Apoorva has joined #openstack-infra15:12
*** hjensas has quit IRC15:12
*** jkilpatr has quit IRC15:13
*** jkilpatr has joined #openstack-infra15:13
*** jkilpatr has quit IRC15:13
*** jkilpatr has joined #openstack-infra15:13
*** kopecmartin has joined #openstack-infra15:15
*** bmjen has joined #openstack-infra15:15
*** Hunner has joined #openstack-infra15:16
*** Hunner has quit IRC15:16
*** Hunner has joined #openstack-infra15:16
*** hamzy has quit IRC15:17
*** edmondsw has quit IRC15:18
*** hashar is now known as hasharAway15:19
Shrewsfrickler: i'm thinking that is likely related to the vanilla failure. vanilla worker thread in the launcher is getting assigned the request, but it throwing an exception during trying to determine if it has quota to handle it15:20
Shrewsfrickler: i think we need to cleanup exception handling there. it's newish code15:20
Shrewsoh, the exception is handled correctly. it fails the request15:21
sshnaidm|afkpabelanger, hi, we have connection errors to nodes since yesterday, is it known issue? http://logs.openstack.org/63/531563/3/gate/tripleo-ci-centos-7-scenario004-multinode-oooq-container/357414a/job-output.txt.gz#_2018-01-11_14_06_38_64818215:21
*** sshnaidm|afk is now known as sshnaidm15:21
*** kopecmartin has quit IRC15:21
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Log request ID on request failure  https://review.openstack.org/53285715:23
*** markvoelker has joined #openstack-infra15:24
Shrewspabelanger: approved https://review.openstack.org/53282015:24
Shrewsfrickler: ^^^ should help15:24
*** edmondsw has joined #openstack-infra15:24
*** bobh has joined #openstack-infra15:24
*** rmcall has joined #openstack-infra15:26
*** pcichy has joined #openstack-infra15:27
*** pcichy has joined #openstack-infra15:27
*** alexchad_ has joined #openstack-infra15:28
dmsimardFYI there are github unicorns going around15:30
dmsimardin case some jobs fail because of that15:30
*** slaweq_ has quit IRC15:30
dmsimardhttps://status.github.com/messages confirmed outage15:31
*** slaweq has joined #openstack-infra15:31
*** alexchadin has quit IRC15:31
*** afred312 has quit IRC15:31
*** afred312 has joined #openstack-infra15:32
pabelangerShrews: danke15:32
*** kopecmartin has joined #openstack-infra15:33
*** Swami has quit IRC15:33
pabelangerdmsimard: shouldn't be much of an impact, minus replication15:34
*** janki has quit IRC15:35
*** shardy has quit IRC15:35
*** slaweq has quit IRC15:35
*** yamamoto has quit IRC15:37
*** felipemonteiro has joined #openstack-infra15:37
*** alexchad_ has quit IRC15:37
*** links has quit IRC15:38
*** yamamoto has joined #openstack-infra15:38
*** felipemonteiro_ has joined #openstack-infra15:38
*** caphrim007 has quit IRC15:40
*** caphrim007 has joined #openstack-infra15:40
corvusShrews: i think we can approve 532594 now and if you make the change i suggested in 532709 we can merge it too.15:42
*** yamamoto has quit IRC15:43
*** felipemonteiro has quit IRC15:43
mwhahahagetting post failures again today15:44
openstackgerritDavid Shrewsbury proposed openstack-infra/system-config master: Zuul executor needs to open port 7900 now.  https://review.openstack.org/53270915:45
Shrewscorvus: done15:45
mwhahahahttp://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22SSH%20Error%3A%20data%20could%20not%20be%20sent%20to%20remote%20host%5C%2215:45
*** xarses has joined #openstack-infra15:45
*** caphrim007 has quit IRC15:45
dmsimardmwhahaha: we're already tracking it in elastic-recheck15:45
dmsimardI've spent some amount of time investigating this week without being able to pinpoint a specific issue that wouldn't be related to load15:46
pabelangercorvus: mordred: dmsimard: this doesn't look healthy: http://paste.openstack.org/show/642816/15:46
pabelangerfirst time I've need DB errors on ARA15:46
pabelangers/error/warning15:47
dmsimardI've discussed that issue with upstream Ansible recently15:47
dmsimardIt's an Ansible bug15:47
*** kiennt26_ has quit IRC15:48
*** hamzy has joined #openstack-infra15:48
dmsimardtl;dr, in some circumstances Ansible can pass a non-boolean ignore_errors down to callbacks (breaking the "contract")15:48
* dmsimard searches logs15:51
*** esberglu has quit IRC15:52
dmsimard2018-01-05 18:16:46     @sivel  dmsimard: I just did a little testing, your `yes` bug was fixed by bcoca in https://github.com/ansible/ansible/commit/aa54a3510f6f14491808291a0300da609b42753d15:52
dmsimardThat's landed in 2.4.215:52
Shrewsdmsimard: sounds like we need https://review.openstack.org/53100915:53
dmsimardShrews: yeah I was just about to mention that15:53
dmsimard2.4.2 also contains other fixes we need15:53
*** alexchadin has joined #openstack-infra15:53
Shrewsdmsimard: that might require coordination with infra-root about the upgrade path15:54
pabelangeryah, maybe even PTG?15:54
pabelangersince we are all in a room15:54
dmsimardShrews: right, but it also means we'll be upgrading all the jobs, playbooks, roles, etc... to 2.4.215:54
pabelangers/all/most all/15:54
dmsimardand you know what it means to upgrade from a minor ansible release to the next15:54
Shrewsi'm going to -2 that, lest someone approve before we discussed it (based on last night's experience)15:54
pabelangerdmsimard: yah, part of my concern as well, some playbook / role some place might now be 2.4.2 compat15:55
pabelangerso, we'll need to be on deck to help migrate that15:55
pabelangermaybe we need to bring zuulv3-dev.o.o online as thirdparty to do some testing15:55
pabelangerwould be a good way to get some code coverage on 2.4.215:56
dmsimardpabelanger: and it's not just z-j and o-z-j, it's every project's playbooks/roles, across branches, etc.15:56
pabelangeryes15:56
*** ykarel is now known as ykarel|away15:57
*** derekh has quit IRC15:59
Shrewsi have to step away for a bit now for an appointment. bbl16:01
*** Guest88600 has quit IRC16:03
toskydmsimard: re those POST_FAILURE you discussed earlier, should I just recheck or wait?16:03
*** armaan has quit IRC16:04
*** sree has joined #openstack-infra16:04
dmsimardtosky: there's a lot of load in general right now16:04
*** ricolin has joined #openstack-infra16:04
dmsimardcorvus: I know you want to wait for the RAM governor but the executors aren't keeping up..16:04
dmsimarddo you have any suggestions ?16:05
*** ykarel|away has quit IRC16:05
dmsimardthe graphs are fairly thundering herd-ish16:05
*** sbra has joined #openstack-infra16:05
toskyI guess I can wait a bit until people leave the offices in the US :)16:06
corvusdmsimard: fix the broken executor?16:06
corvusdmsimard: and then implement the ram governor?  perhaps that should be considered a high priority?16:07
dmsimardShrews: should we rebuild ze04 ? I'm not sure what's the takeaway from your email16:08
*** jbadiapa has quit IRC16:08
*** sree has quit IRC16:09
*** slaweq has joined #openstack-infra16:09
pabelangerI don't think we need a rebuilt, we just need to finish landing tcp/7900 changes in puppet-zuul16:09
pabelangerthen, schedule restarts16:09
corvusdon't schedule, just do.  but there will be permissions that need fixing.16:09
pabelanger+3 on https://review.openstack.org/532709/ for firewall16:10
corvushttps://review.openstack.org/532594 is the other16:10
pabelangergreat, +316:11
pabelangerchecking permissions16:11
corvuswhile those are landing, someone can go on ze04 and fix filesystem permissions16:11
openstackgerritMerged openstack-infra/project-config master: Remove infracloud-vanilla from nodepool  https://review.openstack.org/53282016:12
*** slaweq has quit IRC16:13
*** slaweq_ has joined #openstack-infra16:13
pabelangercorvus: any objections to also having puppet-zuul check them?16:15
corvuspabelanger: why?16:15
*** eumel8 has quit IRC16:16
corvuslet's just fix this and move on.  it was a one time mistake.16:16
pabelangerok16:16
*** armaan has joined #openstack-infra16:17
*** slaweq_ has quit IRC16:17
pabelangeractually, we'll need to change these on other executor too right? Currently /var/log/zuul is root:root16:18
pabelangeractually16:18
pabelangerzuul:root16:18
*** kopecmartin has quit IRC16:19
pabelangerokay, we safe there16:19
*** kopecmartin has joined #openstack-infra16:19
efriedqq, does 'recheck' kick a currently-running check out of the queue?  With how long stuff is taking, I want to be able to recheck as soon as I see any failure rather than waiting for the job to finish.16:19
*** armaan has quit IRC16:21
pabelangerodd /var/log/zuul/executor.log is root:root, I would have thought it was zuul:zuul or zuul:root on ze0216:22
*** e0ne has quit IRC16:24
*** alexchadin has quit IRC16:24
*** alexchadin has joined #openstack-infra16:25
*** esberglu has joined #openstack-infra16:26
AJaegerefried: no, it does not. one way to stop: rebase16:26
pabelangerokay, I think we'll also need fix permissions on /var/lib/zuul for all executors after we stop16:26
efriedAJaeger Okay, thanks.  I was getting tempted to do that anyway to see if it knocked anything loose :)16:27
mriedemwhat does NODE_FAILURE as a job result generally mean?16:28
*** alexchadin has quit IRC16:29
clarkbmriedem: shrews was saying earlier that he thought it was related to the vanilla infracloud going away and quota calculations not working16:29
*** sree has joined #openstack-infra16:30
clarkbShrews: does ^ mean max-servers: 0 is no longer effective? or maybe we didn't get that merged quickyl enough to avoid problems?16:30
mriedemso it's like creating a vm fails ?16:30
mriedemjust suspicious because it's on a new job definition i'm working on and it's in the experimental queue, and zuul never posted results on it last night, and now it failed this morning with node_failure16:30
clarkbmriedem: that was my understanding of it, due to a bug in new code that tries to more dynamically determine available quota. Rather than statically assume what we have told nodepool about its quota is correct16:30
*** rosmaita has joined #openstack-infra16:31
mriedemdynamically determine available quota, good luck16:31
mriedemupgrade to pike where there are no more quota reservations to screw everything up16:31
clarkbShrews: pabelanger I'm not sure that removing vanilla entirely should be our fix in this case. Nodepool should (and always has in the past) handle cloud outages gracefully16:32
*** shardy has joined #openstack-infra16:32
*** rwellum has left #openstack-infra16:32
mriedemhere is an unrelated question, i see a change that's been approved for a few days now is sitting in the gate queue for 9 hours, and it's also sitting in the check queue16:32
*** ijw has quit IRC16:32
mriedemshould i recheck it? or just wait for it to drop from the queues altogether?16:32
pabelangerclarkb: agree, it sounds like there was some bug preventing nodes from deleting. Something we should fix for sure16:33
pabelangerclarkb: I removed vanilla to stop image uploads by nodepool-builder16:33
dmsimardmriedem: might be side effect of people re-checking the patches and the few zuul restarts that have happened where we re-enqueued jobs16:33
*** pblaho1 has joined #openstack-infra16:33
mriedemis there a timeout where a change just gets kicked out altogether? like 10 hours or something?16:33
mriedemgiven the reboots and timeouts and such, i assume people are just rechecking like mad to try and get anything in16:34
openstackgerritMerged openstack-infra/puppet-zuul master: Start executor as 'zuul' user  https://review.openstack.org/53259416:34
*** sree has quit IRC16:34
*** ramishra has quit IRC16:34
*** vsaienk0 has quit IRC16:34
*** maciejjozefczyk has quit IRC16:35
clarkbpabelanger: but deleting nodes should be indepdnent of handling node requests right? In any case its something we should make sure nodepool handles gracefully so that losing a lcoud doesn't affect jobs16:35
*** pblaho has quit IRC16:35
*** HenryG has quit IRC16:36
dmsimardmriedem: the rechecks are exacerbating the load issues that we have right now so we have a bit of a thundering herd effect going on.16:36
pabelangerclarkb: agree, I'd have to defer to Shrews on why that didn't work. But I would guess we can reproduce in a unit test to see why16:36
mriedemthat's what i was wondering - if it's just better to not recheck16:36
dmsimardmriedem: things are slowly stabilizing and we also have a zuul executor that isn't running right now (which is not helping) so our current focus is getting that executor back in line (and resolving the root cause of why it went out in the first place)16:38
mriedemack16:38
dmsimardmriedem: a good idea might be to keep an eye on the zuul-executor graphs here: http://grafana.openstack.org/dashboard/db/zuul-status16:38
mriedemdmsimard: what's a healthy graph look like?16:39
mriedemnot maxed out?16:39
dmsimardmriedem: one where the load of the executors is not 100 :)16:39
mriedemheh is that all16:40
mriedemok will do - thanks for the link btw, i always forget where that is (bookmarked now)16:40
fungiAJaeger: efried: you may be able to abort running check jobs and restart them by abandoning and restoring the change, so you don't need gratuitous rebases16:40
dmsimardfungi: I'm not sure restoring the change triggers the jobs by itself, it might need a recheck after restore but I'm not sure16:41
efriedfungi That's a nice tip.  Though wouldn't that require re+2 as well as re+W?16:41
efriedfungi In any case, rebase wasn't gonna hurt this particular one.16:41
*** HenryG has joined #openstack-infra16:41
fungiefried: as opposed to rebasing? abandon doesn't remove any votes16:41
dmsimardefried: a rebase is likely to require re-voting as well FWIW (though I'm not entirely sure on the criteria about what votes are removed when)16:41
efrieddmsimard In my recent experience, code reviews stick around on a (non-manual) rebase, but +Ws are cleared.16:42
fungidmsimard: it's configurable in gerrit, but generally for our projects it's set to keep code review votes (and clear verified and workflow votes) when the diff of the new change basically matches the diff of the old change. used to use git patch-id to achieve that but i haven't looked at the code in gerrit recently to know whether that's still the case16:43
*** kopecmartin has quit IRC16:43
fungi(`git patch-id` is a sha-1 checksum of the diff's contents, not to be confused with the commit id which covers additional metadata like the commit message and timestamps)16:45
dmsimardfungi: yeah I assumed it had something to do along those lines -- I've modified commit messages before without having to re-run jobs for example16:47
*** dtantsur is now known as dtantsur|afk16:47
fungihopefully not in our gerrit. a new patch set should clear the verified vote and cause new jobs to run16:48
dmsimardmaybe it wasn't in the upstream gerrit, I don't remember16:49
fungialso clears workflow votes so that we don't automatically send it into the gate without an explicit reapproval16:49
fungibut yes, that's configurable per label, so other gerrits may preserve verified votes on rebase, for example16:49
clarkb(we've had to do this because tests whose results are affected by the commit message are or were common)16:49
fungifor similar reasons, we clear all votes (including code review) when a commit message is changed16:50
fungibut that's also configurable16:50
fungii love logging into the rax dashboard. you never know what new tickets will be waiting16:52
fungi"This message is to inform you that the host your cloud server 'ze04.openstack.org' resides on became unresponsive. We have rebooted the server and will continue to monitor it for any further alerts."16:52
pabelangerokay, think I've cleaned up all permissions on ze0416:53
pabelangerwill start it up here in a moment once puppet runs16:53
clarkbpabelanger: that is to pick up the init script change/16:53
clarkb(puppet that is)16:53
pabelangerclarkb: yah16:53
pabelangerbut it does look like logging on executor is being written as root, so we also need to setup permissions on others after we stop16:54
*** armaan has joined #openstack-infra16:55
fungi#status log previously mentioned trove maintenance activities in rackspace have been postponed/cancelled and can be ignored16:56
openstackstatusfungi: finished logging16:56
*** pcaruana|afk| has quit IRC16:57
*** felipemonteiro_ has quit IRC16:58
*** florianf has quit IRC17:00
*** dsariel has quit IRC17:00
fungittx: are you still using odsreg.openstack.org or can we delete that instance?17:00
ttxfungi: I thought we cleared it some times ago, I remember someone asking me about it17:01
fungithat may have been forumtopics17:01
ttxthose days we only use ptgbot on one side and forumtopics.o.o17:01
fungii thought i remembered asking you about odsreg but scoured my irc logs and came up with nothing so thought it best to ask again17:02
fungii'm happy to delete odsreg.openstack.org now if you don't object17:02
ttxfungi:  go for it!17:02
fungidone. thanks for confirming ttx!17:03
fungi#status log deleted old odsreg.openstack.org instance17:03
openstackstatusfungi: finished logging17:03
*** iyamahat has joined #openstack-infra17:03
pabelangerokay, ze04 is coming online17:10
*** pcichy has quit IRC17:10
*** iyamahat has quit IRC17:11
pabelangerand running jobs17:14
pabelangerstill waiting for firewall to open up I think17:14
pabelangerfinger dd7cbca51ff543389eeb43dac537557f@zuulv3.openstack.org17:14
*** electrofelix has left #openstack-infra17:16
*** links has joined #openstack-infra17:17
*** links has quit IRC17:17
*** jpich has quit IRC17:17
*** lucasagomes is now known as lucas-afk17:19
clarkbI'm really confused by this node failure stuff17:20
clarkbI'm looking at a node request that went in at about 1420UTC got rejected by citycloud and vexxhost then appears to go idle. Then 2 hours later infracloud picks it up and then the status is marked as failed?17:21
pabelangerI haven't looked, is there a log?17:21
clarkbpabelanger: ya I'm trying to put it together for this example I'll have a paste up soon17:21
pabelangerI did see some citycloud failures this morning about SSH hostkey changing, wonder if we had some ghost instances17:22
pabelangerbut haven't looked more into it17:22
clarkbthis is way before any ssh can happen17:23
clarkbits all in the "please boot me a node" negotiation17:23
*** amotoki has quit IRC17:24
*** dhajare has quit IRC17:24
pabelangerShrews: do you have a moment to look at fingergw.log? Seeing some execptions around routing, possible we need to improve logging / bug17:24
*** electrofelix has joined #openstack-infra17:26
*** Apoorva has quit IRC17:28
*** hjensas has joined #openstack-infra17:28
fungiclarkb: possible when we restarted the launchers we upgraded to a regression of some sort?17:28
Shrewspabelanger: i just got back from the appointment i mentioned earlier. let me catch up on backscroll17:29
pabelangerokay, I think it might be firewall17:29
pabelangerconfirming that is open17:29
pabelangerah, yup: https://review.openstack.org/532709/ isn't merged17:30
pabelangerUmm17:31
pabelanger[Thu Jan 11 17:26:37 2018] zuul-scheduler[19039]: segfault at a9 ip 0000000000513ef4 sp 00007f2411437ea8 error 4 in python3.5[400000+3a9000]17:31
*** trown is now known as trown|outtypewww17:31
pabelangerthat doesn't look good17:31
pabelangerhttp://zuulv3.openstack.org/ is also down17:31
fungidoes that correspond to another puppet exec upgrading zuul or its deps?17:32
Shrewsclarkb: the new quota checks from tristanC were failing (b/c the query to the provider wasn't working). i think we should short-circuit that if we've said max-servers is 0. it worked before on a failed provider b/c it didn't query that provider, and just went off max-servers17:32
pabelangerfungi: let me check17:32
clarkbpabelanger: Shrews corvus http://paste.openstack.org/show/642927/ (not to distract from the segfault) thats what I can find about infracloud vanilla17:32
*** armax has joined #openstack-infra17:32
pabelangerfungi: yes17:33
clarkbShrews: ya I see that buy why did it take 2 hours to process that request in the first plcae?17:33
pabelangerJan 11 17:16:43 zuulv3 puppet-user[26991]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events17:33
Shrewsand by "i think we should short-circuit", i mean "should fix it so that it short-circuits"17:33
*** slaweq has joined #openstack-infra17:33
fungiso far all the segfault events have happened right on the heels (within a minute) of a puppet exec upgrading zuul or a dependency on the server where it happened17:33
clarkbShrews: but also I'm not sure its sufficient to short circuit, nodepool should handle cloud failures gracefully17:33
clarkbShrews: if the cloud is not there then we shouldn't mark the request failed and instead let some other cloud service it17:33
Shrewsclarkb: i dunno. that is new info to me17:34
corvusthe segfault killed the gearman server17:34
Shrewsclarkb: i am not disagreeing with you.17:34
corvusthis is not recoverable17:34
pabelangerI also do not see zuul-web17:35
pabelangercorvus: happy to pass to you to drive if you'd like17:35
clarkbShrews: cool, just pointing out a short circuit on max-servers isn't sufficnet to get that behavior17:35
corvuspabelanger: i don't think there's anything to do but to restart with empty queues.17:35
corvuspabelanger: you continue driving17:35
pabelangerokay17:36
clarkbShrews: having a hard time with the logs because we don't seem to log the point where we mark it failed on the nodepool side with the request id (that may be because we've bubbled up super far after the cloud exception)17:36
Shrewsclarkb: https://review.openstack.org/53285717:36
pabelangerokay, stopping zuul-echeduler per corvus recommendation17:36
*** coolsvap has quit IRC17:36
*** slaweq has quit IRC17:37
pabelangerscheduler starting17:37
Shrewsclarkb: pabelanger: fwiw, there were 2 separate issues i found this morning in the firefight. the vanilla issue is not related to the stuck delete issue17:39
pabelangerokay, I've started, and stopped zuul-web. It seems to raising exceptions in scheduler17:39
pabelangerAttributeError: 'NoneType' object has no attribute 'layout'17:39
pabelangercat jobs now running17:39
Shrewsthe vanilla issue not being handled well we can fix. i don't have an answer on the delete thing17:40
corvuspabelanger: you can start zuul-web, that's harmless17:40
pabelangercorvus: okay17:40
pabelangerzuul-web started17:40
*** vsaienk0 has joined #openstack-infra17:41
corvusis there any place where we have the full output from the pip install of zuul?17:41
*** cshastri has quit IRC17:41
corvusie, puppet reports or anything?17:41
Shrewspabelanger: did you still need me to look at something?17:42
pabelangerokay, scheduler back online17:42
corvuspabelanger: i suggest sending a notice17:42
pabelangerShrews: I don't think so, firewall change hasn't landed yet17:42
pabelangercorvus: agree17:42
*** jamesmcarthur has joined #openstack-infra17:43
fungicorvus: i believe pip will log in the homedir of the account running that exec, so presumably ~root/17:43
fungilooking17:43
clarkbShrews: fwiw I think the deleted issue is the smaller concern of the two since it doesn't directly affect job results17:43
fungithough i'm not finding it yet17:44
clarkbShrews: comment on https://review.openstack.org/#/c/532857/117:44
pabelangerhow does, #status log Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets17:44
clarkbnow to catch up on the zuul scheduler thing17:45
fungipabelanger: lgtm17:45
pabelanger#status notice Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets17:45
openstackstatuspabelanger: sending notice17:45
clarkbI'm guessing its too late now but maybe we should turn on core dumps?17:46
corvusit's never too late to turn on core dumps17:46
fungiJan 11 17:16:40 zuulv3 kernel: [67122.422307] zuul-web[19121]: segfault at a9 ip 0000000000513ef4 sp 00007fb4116b98a8 error 4 in python3.5[400000+3a9000]17:46
clarkb(might be able to modify that ulimit for a running process somehow?)17:46
fungiJan 11 17:16:43 zuulv3 puppet-user[26991]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events17:46
pabelangerI am unsure where pip install would be logged now17:46
fungithat's pretty tight correlation :/17:46
-openstackstatus- NOTICE: Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets17:47
corvusfungi: the scheduler segfault (which was the geard process) was 10m later?17:47
fungi17:26:38 yeah17:48
corvusi think it would be really useful to know if a certain dependency was updated, or if it was just zuul17:48
corvusthus my query about logs17:48
openstackstatuspabelanger: finished sending notice17:48
fungithis is at least the third time we've seen this happen (last couple times it took out all the executors and maybe the mergers too)17:49
fungiand yeah, i'm looking to see if explicit pip logging needs to be enabled in that exec17:49
pabelangerI did run kisk.sh on zuulv3.o.o to attempt to pick up firewall changes, I don't see how but maybe related?17:50
Shrewsclarkb: i think that log thing is totally separate. there's a bug there, i think, with a handler being present when the provider has been removed17:50
*** vsaienk0 has quit IRC17:51
fungipabelanger: at what time?17:51
Shrewsclarkb: i think we should land the log improvement, then I can look at the other thing17:51
clarkbShrews: run_handler is what creates the self.log object and so if it crashes in run_handler then we hvae no logger17:51
clarkbShrews: I'm fine with that but I also don't know that it will actually give us the logs we want17:51
*** ricolin has quit IRC17:51
pabelangerfungi: I do not have timestamps on my terminal, but it was before I noticed an issue with http://zuulv3.o.o17:51
clarkb(but merging the change isn't a regression in that case, it just also won't fix things?)17:51
fungipabelanger: was the kick happening roughly concurrent with the 17:26:38 segfault for the gearman process or the 17:16:40 segfault for the zuul-web daemon?17:52
clarkbShrews: I'll approve it17:52
pabelangerthe puppet run, was success with no unknown errors17:52
pabelangerfungi: it was after 2018-01-11 17:15:05,152 so it could have been the reason for puppet run17:52
corvusclarkb, Shrews: erm, let's fix it right?17:52
clarkbcorvus: yes we should fix it17:53
fungipabelanger: yeah, i don't see any puppet activity around 17:26:38 just 17:16:4017:53
fungipabelanger: so sounds likely to be the earlier one17:53
clarkbcorvus: its unrelated enough to the existing change and if the exception happens after configuring the logger we will be fine17:53
clarkbso I think we can approve the existing change then swing around and fix the logger17:53
*** Apoorva has joined #openstack-infra17:53
corvusclarkb, Shrews: i'd hate to restart nodepool just to end up without the extra info we need because of that bug17:53
pabelangerfungi: in fact, I only see puppet-user from my kick.sh attempt17:53
pabelangerfungi: I don't see any runs of our wheel for some reason17:54
*** sambetts is now known as sambetts|afk17:54
fungipabelanger: yeah, i wonder if something about our meltdown upgrades yesterday has broken puppeting17:55
*** jistr is now known as jistr|afk17:55
clarkbcorvus: I can remove the +W if you wnat to see it fixed all at once17:55
pabelangergit.o.o is failing puppet, i think that is block zuulv3.o.o from running17:55
pabelangerlooking at git.o.o now17:55
clarkbwent ahead and did that17:55
pabelangeroh, yum-crontab fix didn't land17:56
Shrewsclarkb: corvus: run_handler() is re-entrant, executed multiple times for a request. most of the time, self.log is available. but that race i mentioned is what we need to fix, and is rare17:56
pabelangerthat is odd17:56
pabelangerhttps://review.openstack.org/532331/17:57
pabelangerI thought that merged, apparently not17:57
*** ijw has joined #openstack-infra17:57
pabelangerso, it is possible that was the first puppet run on zuulv3.o.o in a few days17:57
*** yamamoto has joined #openstack-infra18:00
*** nicolasbock has joined #openstack-infra18:00
pabelangerApologies, it does seem my kick.sh was the reason for puppet run18:00
openstackgerritDirk Mueller proposed openstack/diskimage-builder master: Add SUSE Mapping  https://review.openstack.org/53292518:00
corvuspabelanger: well, i mean, puppet runs happen.  it's hardly your fault.  i don't think there's anything about kick.sh that should cause a segfault.18:02
corvusclarkb, pabelanger, fungi: this segfault thing is pretty critical -- we'll sink if zuul just radomly gets killed.  what do we need to do to track it down?18:03
*** ralonsoh has quit IRC18:04
*** jbadiapa has joined #openstack-infra18:04
*** yamamoto has quit IRC18:04
fungicorvus: looking through pip --help i see options for directing it where to log and how verbosely, but we'll probably want to set that in /etc/pip.conf given the mix of explicit pip calls and pip puppet package resources18:04
fungiso i'm looking up the corresponding config options18:05
corvusmaybe we can figure out what was updated by filesytem timestamps18:05
corvusfungi, pabelanger, clarkb: http://paste.openstack.org/show/642940/18:05
fungimaybe, though with the gearman crash coming after a 10-minute delay that may indicate that the correlation isn't tight enough to be able to match them18:05
corvuswas msgpack the thing from earlier?18:05
fungiyes, it's the one which changed its name i think?18:05
johnsomFYI, we are still seeing "retry_limit" errors that are bogus.18:06
johnsomzuul patch 531514 for example18:06
fungicorvus: also, deploying pip.conf updates widely will depend on getting puppet working again, so also need to look into why that broke18:06
clarkbcorvus: probably a good idea to get core dumps turned on too?18:06
pabelangerI agree we should consider coredumps for zuul, is there any impact to that?18:07
clarkbpabelanger: could fill up our disk (but we can watch it closely)18:07
corvusso does this happen everytime msgpack updates, or just once?18:07
pabelangerI believe msgpack is what dmsimard seen for executor issue also18:07
fungicorvus: we've had several crashes across all zuul daemons (launchers, mergers) i think, so maybe each time msgpack updates?18:08
dmsimardeverything crashed *again* ? sorry I've been in meetings all morning18:08
dmsimardFWIW we're not the only ones having issues with this... https://github.com/msgpack/msgpack-python/issues/268 && https://github.com/msgpack/msgpack-python/issues/26618:09
fungidmsimard: so far just the daemons on the zuulv3.o.o server, but puppet isn't working (pabelanger ran it manually against that server) so that may be why it hasn't crashed any others yet18:09
dmsimardWe might have to manually uninstall both msgpack-python and msgpack, make sure there's no more files in /usr/local/python3.5/dist-packages (stale distinfo, eggs, etc.) and then re-run the (zuul?) reinstallation or just install msgpack by itself, I don't know.18:10
fungithough interestingly, pip3 list says zuulv3 and ze01 both have the same versions of msgpack and msgpack-python18:10
dmsimardfungi: yeah, that's mentioned in the issues I linked18:10
fungimaybe zuulv3 upgrade was delayed somehow?18:11
corvusokay, so if puppet wasn't running on zuulv3.o.o, then maybe this is the same issue, just delayed, and it's not a systemic problem?18:11
dmsimardwell, when it first happened last sunday, only the ZE and ZM nodes were impacted .. and then monday (I think?) zuulv3.o.o crashed because of msgpack as well18:11
corvusoh, so it has happened before18:11
clarkbpabelanger: yes msgpack is what I think we tracked the executors dying to18:12
corvusin that case, since there was just a msgpack release, i wonder if we can expect the z* servers to all crash shortly?18:12
fungicorvus: yeah, it's tanked all the zuul servers at least twice now, but there may have been two versions of msgpack triggering this18:12
pabelangerdmsimard: when on Monday? We had a swapping issue due to memory, but wasn't aware of a crash18:12
dmsimardfungi: the corellation can be found in /usr/local/lib/python3.5/dist-packages (see timestamps where modules were last updated) and the dmesg timestamps of the general protection fault18:13
fungicorvus: i think what crashed it most recently on the other servers is what just now crashed zuulv3 but was delayed because puppet hasn't run on that server in a while until pabelanger did so manually18:13
clarkbcould it be that the msgpack builds are replacing compiled so files under zuul?18:13
dmsimardthat's how I ended up finding out what was the issue18:13
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Delete the pool thread when a provider is removed  https://review.openstack.org/53293118:13
clarkbexcept we should load everything into memory right? so probably not that18:13
corvusi'm worried that msgpack_python (the deprecated package) was updated, but not msgpack (the replacement)18:14
dmsimardpabelanger: there was two separate incidents for msgpack, one sunday and one monday18:14
dmsimard2018-01-07 20:55:48 UTC (dmsimard) all zuul-mergers and zuul-executors stopped simultaneously after what seems to be a msgpack update which did not get installed correctly: http://paste.openstack.org/raw/640474/ everything is started after reinstalling msgpack properly.18:14
pabelangerfungi: corvus: I believe so, I am going to see if I can reproduce it locally here by installing old version of msgpack, then upgrading it when zuul is running18:14
dmsimard2018-01-08 20:33:57 UTC (dmsimard) the msgpack issue experienced yesterday on zm and ze nodes propagated to zuulv3.o.o and crashed zuul-web and zuul-scheduler with the same python general protection fault. They were started after re-installing msgpack but the contents of the queues were lost.18:14
*** weshay is now known as weshay_interview18:14
pabelangerdmsimard: okay18:15
dmsimardpabelanger: it might require something to hit that region of the memory for the GPF to trigger18:15
fungipypi release history shows msgpack and msgpack-python releases on january 6 and january 918:15
fungione release each on each of those two dates18:15
corvushttps://pypi.python.org/pypi/msgpack/0.5.1 is worth a read18:15
pabelangerlet me see when the last time we ran puppet on zuulv3.o.o was18:16
dmsimardwhy the hell is there two *package names* using the same *module name* ?18:16
fungi0.5.0 for both on the 6th and 0.5.1 on the 9th18:16
*** jpena is now known as jpena|off18:16
pabelangerJan  9 20:33:36 zuulv3 puppet-user[17648]: Finished catalog run in 7.59 seconds18:16
pabelangerwas the last puppet run on zuulv3, before the kick.sh18:17
dmsimardthat's weird, it's almost an exact match from the 01-08 timestamp18:17
fungianybody happen to know off the top of their head what zuul dep is dragging in msgpack?18:17
pabelangerJan  8 21:49:06 zuulv3 puppet-user[10102]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events18:17
pabelangerwas our last install_zuul before kisk.sh18:18
fungi(or msgpack-python i guess)18:18
corvusfungi: no, but i'd like to find out18:18
fungiyeah, was going to track that down next18:18
fungijust didn't know if anyone already had18:18
dmsimardis there something else than puppet that could update python modules ? here's the timestamps from dist-packages: http://paste.openstack.org/raw/642949/18:18
*** dhajare has joined #openstack-infra18:18
corvusfungi: cachecontrol18:19
dmsimardfungi, corvus: the stack trace from msgpack should give us an idea of where it's imported from http://paste.openstack.org/raw/640474/18:19
fungidmsimard: potentially unattended-upgrades if installed from distro packages18:19
*** SumitNaiksatam has joined #openstack-infra18:19
fungicorvus: thanks18:19
corvusdmsimard: confirmed :)18:19
pabelangerdrwxr-sr-x   2 root staff   4096 Jan 11 17:16 msgpack_python-0.5.1.dist-info I think that is what we just installed18:20
dmsimardfungi: doesn't look like it'd be from a package http://paste.openstack.org/raw/642953/18:20
fungilatest release of cachecontrol is what started using msgpack, it seems18:21
dmsimardfungi: but from that last paste (dist-packages), you can see that it's not just msgpack that was updated.. there's zuul there as well18:21
fungiintroduced in 0.12.0, while 0.11.2 and earlier don't use it18:21
*** rmcall has quit IRC18:21
fungidmsimard: right, it happens when we pip install zuul from source each time a new commit lands18:21
dmsimardfungi: so that's not driven by puppet then, it's automatic somehow ?18:22
fungidmsimard: it's driven by puppet. there's a vcsrepo resource for teh zuul git repo which triggers an exec to upgrade zuul from that18:22
dmsimardok, yeah, there's a puppet run that matches the timestamps where the modules have been updated18:23
dmsimardBTW, if this happens again (where we lose the full scheduler) -- there is a short window where the status.json appears still populated while the scheduler reloads everything and we might be able to dump everything during that short window18:24
dmsimardbut I haven't tried18:24
dmsimard(only realized when it was too late)18:24
*** ldnunes has quit IRC18:27
corvusdmsimard: i believe we were alerted to this because zuul-web died18:27
dmsimardcorvus: yeah but for some reason, (monday) when I started zuul-scheduler and zuul-web again, I recall seeing the whole status page and wondering why it wasn't empty18:28
corvusdmsimard: that'll be the apache cache.  it was gone.18:28
dmsimardcorvus: likely18:29
corvusdmsimard: i'm certain; i checked.18:29
dmsimardcorvus: status.json is generated dynamically right ? it's not dumped to disk periodically (even for cache purposes) ?18:29
corvusdmsimard: it's in-memory in zuul.  apache may put it on disk.18:30
dmsimardcorvus: should we consider dumping it to disk periodically even if just for backup purposes so we can re-queue if need be ? though now that I think about it, this can be out of band -- like a cron that just gets it every minute or something18:31
corvusdmsimard: i don't object to a cron18:32
corvusin the mean time, what should we do about this?18:32
dmsimardI believe pabelanger mentioned he'd like to try and  reproduce it locally18:32
dmsimardI'll send a patch for a cron which could at least help prevent us losing the entire queues -- check and gate aren't so bad, but losing post/release/tag kind of sucks18:33
corvusdmsimard: post/release/tag aren't safe to automatically re-enqueue18:33
*** dprince has quit IRC18:34
dmsimardright, but we have no visibility on what jobs might not have run18:34
corvusit's fine to save them, just pointing out that we can't restore them without consideration.18:34
* dmsimard nods18:34
*** ldnunes has joined #openstack-infra18:41
*** dprince has joined #openstack-infra18:42
*** sree has joined #openstack-infra18:43
*** iyamahat has joined #openstack-infra18:43
*** tesseract has quit IRC18:45
*** jamesmcarthur has quit IRC18:46
*** sree has quit IRC18:47
AJaegermordred: your change https://review.openstack.org/#/c/532304/ fail in http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/configure-unbound/tasks/main.yaml#n44 . Is role_path not set correctly?18:48
*** fultonj has quit IRC18:48
*** beagles has quit IRC18:49
*** b3nt_pin has joined #openstack-infra18:49
*** b3nt_pin is now known as beagles18:49
openstackgerritMerged openstack-infra/system-config master: Fix typo with yum-cron package / service  https://review.openstack.org/53233118:51
openstackgerritMerged openstack-infra/zuul-jobs master: Capture and report errors in sibling installation  https://review.openstack.org/53221618:51
AJaegermordred: or is that a problem of the base-minimal parent?18:51
*** yamahata has joined #openstack-infra18:53
AJaegermordred: left comments on 53230418:54
*** felipemonteiro has joined #openstack-infra18:55
*** felipemonteiro_ has joined #openstack-infra18:56
*** hemna_ has joined #openstack-infra18:59
*** sree has joined #openstack-infra19:00
*** jkilpatr_ has joined #openstack-infra19:00
*** felipemonteiro has quit IRC19:00
*** jkilpatr has quit IRC19:01
*** shardy has quit IRC19:01
*** caphrim007 has joined #openstack-infra19:02
*** sree has quit IRC19:05
*** fultonj has joined #openstack-infra19:06
*** sree has joined #openstack-infra19:06
AJaegermlavalle, just commented on https://review.openstack.org/#/c/531496 - do you know what to do or do you have further questions?19:07
*** jkilpatr_ has quit IRC19:07
corvusfungi, pabelanger, clarkb: based on what i reported in #zuul, i *think* we can expect msgpack not to break us again unless they do another rename.  further version upgrades shouldn't break us.  i'm assuming we have 0.5.1 installed everywhere now.  if we want to be extra safe, we might want to uninstall msgpack and msgpack-python everywhere and reinstall msgpack-python 0.5.1 just to clean up, but19:09
corvusshouldn't be necessary.19:09
corvusdmsimard: ^19:09
pabelangerokay, puppet is running again on zuulv3.o.o now that git servers are running puppet again19:09
clarkbcorvus: good to know thanks for digging in19:09
corvusShrews: what do we need to do for nodepool?19:10
*** sree has quit IRC19:10
pabelangerI am also still waiting for 532709 to land and confirm console logging is working on ze04.o.o before continuing with reboots of other executors19:12
*** sshnaidm is now known as sshnaidm|afk19:15
*** weshay_interview is now known as weshay19:16
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix races around deleting a provider  https://review.openstack.org/53293119:16
Shrewscorvus: bunches of things?19:18
Shrewscorvus: sort of sinking in the changes at the moment and can't cover them all19:19
Shrewsand i've totally missed lunch b/c of it, so gonna grab a quick bite. brb19:19
*** slaweq has joined #openstack-infra19:20
corvusShrews: okay.  it looks like the segfault this is mostly sorted, so when you get a chance to sort out what needs to happen, let me know.19:20
*** vsaienk0 has joined #openstack-infra19:20
*** panda|ruck is now known as panda|ruck|afk19:20
*** jkilpatr_ has joined #openstack-infra19:21
smcginnisLooking into some stable branch failures. Anyone know how this could have happened? http://logs.openstack.org/periodic-stable/git.openstack.org/openstack/kolla/stable/pike/build-openstack-sphinx-docs/60e2946/job-output.txt.gz#_2018-01-09_06_16_06_65155319:23
smcginnisMust be something local: http://git.openstack.org/cgit/openstack/kolla/tree/doc/source?h=stable/pike19:24
fungiwhere did we end up defining zuul nodeset names?19:24
corvusfungi: ozj i think19:26
*** dprince has quit IRC19:26
openstackgerritIhar Hrachyshka proposed openstack-infra/openstack-zuul-jobs master: Switched all jobs from q-qos to neutron-qos  https://review.openstack.org/53294819:26
fungicorvus: thanks! i thought it was in pc for some reason19:26
*** smatzek has quit IRC19:27
AJaegerteam, is our pypi mirror working? We pushed https://pypi.python.org/pypi/openstackdocstheme 2hours ago and it's not yet used in new jobs19:27
*** florianf has joined #openstack-infra19:27
*** smatzek has joined #openstack-infra19:28
openstackgerritMerged openstack-infra/system-config master: Zuul executor needs to open port 7900 now.  https://review.openstack.org/53270919:28
*** alexchadin has joined #openstack-infra19:29
pabelangerAJaeger: let me check19:30
*** sree has joined #openstack-infra19:30
*** vsaienk0 has quit IRC19:30
pabelangerAJaeger: bandersnatch is running19:31
AJaegerpabelanger: and it's at http://mirror.ca-ymq-1.vexxhost.openstack.org/pypi/simple/openstackdocstheme/ - let me recheck then.19:31
AJaegerthanks19:31
*** smatzek has quit IRC19:33
*** sree has quit IRC19:34
Shrewscorvus: so nodepool, we have A) new quota handling does not fail as gracefully since we do more than just check max-servers now  B) it has been suggested that instead of failing a request that gets an exception in request handling, that we instead let another provider try (which is a good idea IMO)  and C) some zookeeper wonkiness caused us to not be able to delete 'deleting' znodes even though the19:39
Shrewsactual instance was deleted. I have no idea on this one atm19:39
Shrewscorvus: and the other thing clarkb was concerned about is hopefully handled in https://review.openstack.org/53293119:40
Shrewsbut i can't test that one locally b/c linux and the world hates me19:40
pabelangernice19:41
pabelangerfinger 8ee380a2a3ec4b1698ccd4fe6e6d5ecb@zuulv3.openstack.org19:42
pabelangerworks19:42
pabelangerthat should be using tcp/7900 now19:42
*** eharney has quit IRC19:42
pabelangerI think we can proceed with rolling restarts of zuul-executors to drop root permissions19:43
*** harlowja has joined #openstack-infra19:43
pabelangeralong with /var/log/zuul permission fix19:43
*** eharney has joined #openstack-infra19:44
*** edmondsw_ has joined #openstack-infra19:48
*** edmonds__ has joined #openstack-infra19:48
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json  https://review.openstack.org/53295519:49
dmsimard^ as per my suggestion19:49
dmsimardcorvus: ^19:49
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json  https://review.openstack.org/53295519:51
*** alexchadin has quit IRC19:51
clarkbShrews: it would be good to get A) and B) sorted out so that we can safely reboot the chocolate infracloud controller19:51
clarkbI'm sort of all over the place today taking care of sick family and getting new glasses and doing travel paperwork19:51
clarkbbut happy to help where I can (will review that one change shortly)19:51
pabelangerclarkb: are we okay to proceed with zuul-executor restarts? This is to drop root permissions and finger port change to tcp/7900. Confirmed to be working on ze0419:52
*** edmondsw has quit IRC19:52
*** edmondsw_ has quit IRC19:52
clarkbpabelanger: I think as long as it isn't expected to affect job results we should be fine19:52
clarkbexecutor stops result in jobs being rerun right?19:52
clarkb(its been a couple rough days so doing everything we can to make it less rough is nice)19:53
pabelangerclarkb: yah, jobs will abort and requeue19:53
pabelangerjust means people wait a little longer for stuff to merge19:53
clarkband we are at capacity ya?19:53
clarkbmight be best to wait for things to cool off a bit for that?19:54
pabelangeryah, we are maxed out right now19:54
dmsimardthere's a lot of compounding issues right now19:54
dmsimardthe restarts, the loaded executors, leading people to recheck a significant backlog of things19:55
dmsimardneed to afk food19:55
pabelangerI think what might happen, is if executor is stopped and started again with /var/log/zuul still root, we might not properly start again19:56
pabelangereg: live migration19:56
clarkbdmsimard: right, I think it may be best to let things settle in a bit19:56
clarkband see where we are since the executor restarts aren't urgent19:56
clarkbpabelanger: will it not fail to start at all or will it start and be broken?19:57
*** jistr|afk is now known as jistr19:57
*** smatzek has joined #openstack-infra19:57
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Short-circuit request handling on disable provider  https://review.openstack.org/53295719:57
pabelangerlet me check something19:57
pabelangerokay, we might be good19:58
pabelanger-rw-rw-rw- 1 root root 171389851 Jan 11 19:58 /var/log/zuul/executor-debug.log19:58
pabelangerzuul user will be able to write19:58
*** sree has joined #openstack-infra19:59
corvusShrews, clarkb: i think (B) is compatible with the algorithm and should be okay to implement.  i think that would manifest as: when all of the node launch retries have been exhausted, we decline the request.  the final provider which handles that request would still cause it to fail, in the normal way that requests that are universally declined are failed.19:59
corvuspabelanger: it's world writable?20:01
Shrewscorvus: it would be more than node launch retries (in the vanilla case, it was throwing an exception when trying to query the provider for quota info), but... yeah20:01
*** Apoorva has quit IRC20:01
corvusShrews: sounds good20:01
corvusShrews: is there more detail about (A)?20:02
openstackgerritSean McGinnis proposed openstack-infra/openstack-zuul-jobs master: Add commit irrelevant files to tempest-full  https://review.openstack.org/53295920:02
pabelangercorvus: yes20:02
*** armaan has quit IRC20:02
corvuspabelanger: let's fix that after the restarts :)20:03
pabelangeragree20:03
*** armaan has joined #openstack-infra20:03
Shrewscorvus: such as? https://review.openstack.org/532957 re-enables that short-circuit20:03
openstackgerritSean McGinnis proposed openstack-infra/project-config master: Remove irrelevant-files for tempest-full  https://review.openstack.org/53296020:03
Shrewscorvus: i think you know as much as i do at this point.... happy to clarify anything though20:04
corvusShrews: that sounds good, but what do you mean by "new quota handling does not fail as gracefully since we do more than just check max-servers now" ? in what cases does it fail not gracefully?20:04
*** sree has quit IRC20:04
Shrewscorvus: http://paste.openstack.org/show/643045/20:05
ianwthere's a lot of scroll-back for us antipodeans ... are we at a point we want to merge https://review.openstack.org/#/c/532701/ to start new image builds, or is there too much else going on?20:06
corvusShrews: or are A and B really related -- if we fix B, we'll handle the case where we can't calculate quota because the provider is broken by declining the request.  and then when we come along and set max-servers to 0 because we know the cloud is broken, change 532957 will short circuit that and make things cleaner?20:06
*** gema has quit IRC20:06
Shrewscorvus: oh, not failing gracefully... the new quota stuff adds the dimension of exceptions from a wonky provider during the "should I decline this?" checks20:07
ShrewsA and B are sort of related, yes20:07
*** hasharAway is now known as hashar20:08
corvusShrews: okay, i think i grok.  my understanding is: 532957 should fix the most immediate thing and let us restart infracloud without node failures, then solving (B) will let infracloud break in the future without spewing lots of node_failure messages..  correct?20:09
Shrewscorvus: 532957 would have hidden the provider failure, but doesn't fix the exception stuff that can still occur when max-servers >020:09
corvusShrews: yep, that jives with my understanding20:09
Shrewsbecause vanilla was disabled, but we were still getting the exceptions20:09
pabelangerianw: +320:10
Shrewscorvus: yes, with all my current changes that I have up, we can restart and run for a while while I sort B out20:11
pabelangerianw: also, ze04.o.o is back online and running as zuul user again20:11
pabelangerianw: we have not done any other executors yet, likey do so in another day20:11
fungiianw: probably safe as long as we don't expect that updates to the images will bring yet new regressions... we're somewhat backlogged and maxxed out on capacity at the moment20:11
corvusShrews: i've +2 532931 and 53295720:11
corvusShrews: any others?20:11
pabelangerianw: but all patches for puppet-zuul and fireweall landed20:11
Shrewscorvus: clarkb: what were the concerns about chocolate? i'm afraid i've been too heads down in firecoding to have noticed20:12
pabelangerfungi: ianw: actually, lets hold off until tomorrow then, just to be safe20:12
pabelangerand time for some patches to merge20:12
*** sree has joined #openstack-infra20:12
ianwok, that puts it into my weekend, so maybe leave it with +2's until my monday20:13
fungiShrews: we need to reboot some of the chocolate control plane for meltdown patching still20:13
Shrewscorvus: no, i think you got them all20:13
ianwthat way a) it's usually quiet(er) and b) i can monitor20:13
pabelangerianw: good idea, might want to -2 then20:13
fungiShrews: and are concerned that any prolonged outage of the api will cause heartbreak for nodepool20:13
Shrewsfungi: is chocolate disabled in nodepool?20:13
corvusclarkb: +3 532931 and 532957 please20:13
fungiShrews: no, only vanilla is disabled because we were unable to get it back into operation reliably20:13
Shrewsfungi: has chocolate been otherwise functional?20:14
fungii _think_ so (inasmuch as it ever is anyway)20:15
corvusShrews: i think the extent of what i was saying is that being able to reliably zero max-servers for a cloud gives us the room to enable/disable as needed without worrying about errors related to (B)20:15
Shrewscorvus: yeah. i think we should disable chocolate with max-servers=0 if we're concerned about it working20:15
Shrews957 gives us that leeway20:16
fungibut if we set it to max-servers=0 are we still going to cause the same problems that vanilla was/is causing with max-servers=0 already?20:16
Shrewsfungi: not with 95720:16
Shrews(in theory)20:16
fungiaha, right, that's the missing piece thanks20:16
*** sree has quit IRC20:16
* Shrews has missing pieces scattered everywhere today20:17
*** cody-somerville has joined #openstack-infra20:23
*** dprince has joined #openstack-infra20:26
clarkbcorvus: looking now20:27
clarkbfwiw 957 isn't really the concern with restarting the controller20:28
clarkbwe shouldn't have node failures if a cloud goes away even if max servers is > 020:29
*** SumitNaiksatam has quit IRC20:29
clarkbthis requires us to know in advance when clouds will have outages which isn't always the case20:29
clarkbI've approved 957 because its a good improvement either way. Also left a comment on it for a followup20:30
clarkbShrews: ^20:30
*** fultonj has quit IRC20:31
*** cody-somerville has quit IRC20:31
Shrewsclarkb: yeah, not saying that fixes the B part. it's a stop-gap until i can fix the other thing20:32
Shrewsbut it's a good stop-gap that should stay since it prevents unnecessary provider calls20:33
*** gema has joined #openstack-infra20:33
clarkbShrews: yup20:33
clarkbShrews: in https://review.openstack.org/#/c/532931/2 why are we removing the extra logging that you just added?20:33
clarkboh wait I misread that nevermind20:33
clarkbcorvus: Shrews both changes have been approved, just the one minor improvement idea on 95720:36
clarkbI need to grab lunch now20:36
Shrewsclarkb: can I +A the logging change?20:37
*** hrubi has quit IRC20:37
*** hrubi has joined #openstack-infra20:39
*** eharney has quit IRC20:41
clarkbShrews: ya I think so20:42
efriedDon't want to recheck unnecessarily and worsen the problem, so: if my change isn't showing up on the zuulv3.o.o dashboard, do I need to recheck it?20:49
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Fix races around deleting a provider  https://review.openstack.org/53293120:49
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Short-circuit request handling on disable provider  https://review.openstack.org/53295720:49
efriedSpecifically https://review.openstack.org/#/c/518633/20:50
*** eharney has joined #openstack-infra20:51
AJaegerefried: yes. But in your case: Ask mriedem to toggle the +W, so it goes directly into gate. A recheck would run it first through check.20:52
*** eharney has quit IRC20:52
efriedAJaeger Oh, that's useful to know.  Thank you.20:52
AJaegerefried: or have another nova core just add an additional +W to trigger that20:52
AJaegerefried: see also top entry at https://wiki.openstack.org/wiki/Infrastructure_Status20:52
efriedAJaeger What about the other already+W'd patches behind that guy?  Will they go to the gate automatically once the first one clears?20:53
efriedor do they need to be +W-twiddled too?20:53
AJaegerefried: hope so ;) Give it a try20:53
efriedAJaeger Thanks.20:53
mriedemwhat do i need to do?20:55
AJaegermriedem: toggle +W on  https://review.openstack.org/#/c/51863320:55
*** e0ne has joined #openstack-infra20:55
mriedemtoggle == remove +W and add it back?20:55
AJaegermriedem: yes20:56
mriedemconsider it toggled20:56
* mriedem blushes20:56
AJaegerthanks, mriedem20:56
efriedThanks mriedem20:56
mriedemmy pleasure20:56
efriedIt appeared in the gate right away, wohoo!20:56
efriedNow we'll see if it gets through.  Last time it sat there for about 12 before disappearing mysteriously without a trace...20:57
AJaegerefried: https://review.openstack.org/#/c/518982/19 will not go into gate - there's no Zuul +1 vote. So, that one needs a recheck.20:57
mriedemit hit the bermuda triangle20:57
AJaegerefried: I suggest you recheck those in the stack that don't have Zuul +1.20:58
efriedAJaeger Thanks, will do.20:58
AJaegermriedem: it tried to hide but you found it ;)20:59
fungione of zuul's dependencies made a botched attempt at transitioning between package names, and that has resulted in a couple of rather sudden outages where we ended up unable to reenqueue previously running jobs once we got it back online20:59
fungithankfully, we think this shouldn't be a recurrent problem now that they've fixed their transitional package on pypi20:59
fungithe joys of continuous delivery ;)21:00
mtreinishfungi: which package?21:00
fungimsgpack-python -> msgpack21:00
mtreinishah, ok21:00
fungia transitive dep via cachecontrol21:00
fungimsgpack-python 0.5.0 was basically empty, and so in-place upgrading it caused segfaults for running zuul processes21:01
*** krtaylor has joined #openstack-infra21:01
*** Apoorva has joined #openstack-infra21:02
fungiand upgrading from it had similarly disastrous impact once they released a fixed version21:02
*** david-lyle has quit IRC21:03
*** krtaylor has quit IRC21:03
*** david-lyle has joined #openstack-infra21:04
*** krtaylor has joined #openstack-infra21:04
ianwfungi: ahh, so that's the suspect in our "all the executors went bye-bye" scenario the other day?21:06
*** Apoorva has quit IRC21:06
corvusianw: yep; there's a bit more detail about versions, etc, in #zuul21:06
*** gouthamr has quit IRC21:07
*** edmonds__ is now known as edmondsw21:07
*** e0ne has quit IRC21:09
*** masber has joined #openstack-infra21:10
*** sree has joined #openstack-infra21:12
efriedmriedem Well, that worked so well for that patch, would you mind doing the same for https://review.openstack.org/#/c/521686/ ?21:14
*** gouthamr has joined #openstack-infra21:15
mriedemefried: bauzas is still awake, i'm sure he can do it21:15
efriedHum, that one might be different, since it's still in the check queue (six times ??)21:15
*** olaph1 has joined #openstack-infra21:15
ianweumel8 / ianychoi : i dropped the db (made a backup) and ran puppet on translate-dev ... it doesn't seem to have redeployed zanata.  looking into it21:16
*** olaph has quit IRC21:16
*** sree has quit IRC21:17
AJaegerconfig-core, I updated the list of reviews to review at https://etherpad.openstack.org/p/Nvt3ovbn5x - the backlog is growing; just in case somebody has time left after all the updating and fixing...21:21
AJaegerefried: 521686 has no Zuul +121:21
fungismcginnis: pasted a job log in #-release where a job failed to push content via rsync to static.o.o, connection unexpectedly closed21:23
fungier, smcginnis pasted21:23
fungihttp://logs.openstack.org/a5/a52aa0b2ad06a52e50be8879f9256576ceceb91c/release-post/publish-static/cee3a5f/job-output.txt.gz#_2018-01-11_21_08_06_84464121:23
efriedAJaeger Perhaps you can help me understand what I'm seeing on the dashboard.  518633,23 showed up in the gate and was trucking along nicely.  Then while it was going, three copies of it showed up in the check queue.  They're still there.  And now the one in the gate doesn't have the sub-job thingies anymore - just the one line labeled 518633,2321:23
fungii can connect to the server, at least21:23
efriedAJaeger What does it mean when a change shows up that way (with just the one status line labeled with the change number, that never seems to move)?21:24
fungiefried: does hovering over the dot next to it tell you anything in a pop-up tooltip?21:24
efriedfungi Oo, "dependent change required for testing" -- howzat?21:24
AJaegerefried: I see 518633 as bottom of stack, so that is fine. You rechecked some changes that are stacked on top of it. so, this is fine21:25
fungiefried: looks like the cinder change ahead of it failed a voting job, so your change has been restarted to no longer include testing with that failing cinder change21:25
efriedfungi When you say "ahead of it"...21:26
fungiit took a moment to get new node assignments for the jobs21:26
fungi"above" it in the gate pipeline21:26
efriedThere's no dependency between them, is there?21:26
*** ldnunes has quit IRC21:26
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Improve logging around ZooKeeper suspension  https://review.openstack.org/53282321:26
fungiefried: only insofar as they share some jobs and there's a chance that the cinder change could break the ability of your change to pass jobs (or vice versa) so we test them together to make sure we don't race in merging an interdependency bug21:27
*** jkilpatr_ has quit IRC21:27
fungibut if we realize the cinder change isn't going to merge, we have to retest your nova change without it applied21:27
efriedfungi I see.  And that helps me understand how the whole queue thrashing problem can compound as it gets more full.21:28
fungiyep. we have a sliding stability window where we'll try to test as many changes at a time as possible, but if we fail too often we scale the window down to a minimum of 20 changes at a time in a shared queue21:28
fungijust to keep the thrash manageable21:29
lbragstadi have a quick question, we started noticing this on stable/pike http://logs.openstack.org/periodic-stable/git.openstack.org/openstack/keystone/stable/pike/build-openstack-sphinx-docs/a5a6cb8/job-output.txt.gz#_2018-01-09_06_13_52_29743221:29
lbragstadwe did see that on master a while ago but I don't think we fixed it with a requirements bump21:29
lbragstadi'm wondering if that jogs anyone's memory here21:30
fungilbragstad: smells like a dependency updating in a backward-incompatible fashion21:30
smcginnislbragstad: Is that package python-ldap?21:30
cmurphylbragstad: fungi let me dig up AJaeger's patch that fixed it - the problem was autodoc wasn't loading all the libs that were declared in setup.cfg and not in requirements.txt21:31
lbragstadhttps://github.com/openstack/keystone/blob/master/setup.cfg#L3121:31
smcginnislbragstad: Ah, it can't be in setup.cfg anymore since the source isn't installed to run the docs job.21:32
lbragstadaha21:32
smcginnislbragstad: Probably need to backport the change cmurphy is thinking of.21:32
*** e0ne has joined #openstack-infra21:32
ianwfungi: any thoughts on what we can do about translate-dev getting  "451 4.7.1 Greylisting in action, please come back in 00:04:59" when sending confirm emails?  how do we make it look more legit?21:32
smcginnisI've been doing that in a few projects to get stable jobs working. Let me know if there's any questions about it.21:32
fungiianw: make sure it's not covered by a listing in the spamhaus pbl, and apply for an exception for that ip address if it is21:33
fungiianw: rackspace has added basically all of their ip addignments to the pbl, i guess as a way to cut down on abuse reports, so you have to explicitly poke holes in those blanket listings for systems you want to be able to send e-mail to popular domains which may make filtering decisions based on pbl lookups21:34
fungicmurphy: great memory, i didn't even consider this might be fallout from the docs pti compliance work21:35
lbragstadcmurphy: wasn't it this one? https://review.openstack.org/#/c/530087/21:35
*** numans has quit IRC21:36
*** numans has joined #openstack-infra21:36
fungiianw: i usually start by looking at https://talosintelligence.com/reputation_center/lookup?search=translate-dev.openstack.org and that's suggesting there aren't any matching blacklist entries so it's probably not pbl impact at least21:37
cmurphylbragstad: it was either that one or this one https://review.openstack.org/#/c/52896021:37
smcginnislbragstad: It was probably a change in openstack/keystone.21:37
cmurphyif it's that one ^ then we just need to backport21:37
smcginniscmurphy: Yep, that looks like what you'd need.21:37
ianwfungi: yeah ... also digging into the logs, there's not much email traffic out, but it might just be @redhat.com that's being too picky21:37
fungiianw: up side, if it's just greylisting, is that usually subsides as popular domains get used to receiving e-mail from it21:38
smcginnislbragstad, cmurphy: Only tricky thing I've run in to backporting these is the differences in global-requirements with the stable branch.21:38
cmurphysmcginnis: oh hrm21:38
corvusianw: is it putting translate-dev01.openstack.org in the envelope sender address, or translate-dev?21:38
fungiooh, excellent point21:38
*** sbra has quit IRC21:38
smcginniscmurphy, lbragstad: But actually, doesn't look like that patch was complete. Should switch s/python setup.py build_sphinx/sphinx-build .../. But I don't think that's necessary as far as fixing pike.21:39
ianwcorvus: hmm, possibly "noreply@openstack.org" http://paste.openstack.org/show/643127/21:40
lbragstadsmcginnis: ack - let me get a backport proposed quick21:40
fungiianw: if the message is still in exim's queue, you should be able to find the message (body and headers) in /var/spool/exim4/input/21:40
ianwfungi: yep, that's what i just did :)  anyway, let's see if it's an issue with the real server & people not getting their auth emails and then dig into it21:41
*** aeng has joined #openstack-infra21:41
fungiand yeah, looking at it i agree it seems to be using noreply@openstack.org for sender21:42
fungiwhich at least matches the from header21:42
*** smatzek has quit IRC21:42
ianwthere are some interesting options.  i wonder if i want to turn my "maximum gravatar rating shown" up to "X"21:42
fungiit doesn't go to 11?21:43
fungier, i mean, xi?21:43
corvusnoreply@openstack.org does not accept bounce messages, so it's a bad choice for an envelope sender.  that's a legit greylist trigger.21:43
*** jtomasek has quit IRC21:43
*** sree has joined #openstack-infra21:43
fungiyep, we usually configure an alias of the infra-root address as sender for other systems21:44
*** olaph1 has quit IRC21:44
*** olaph has joined #openstack-infra21:45
corvushttp://paste.openstack.org/show/643135/21:45
*** eharney has joined #openstack-infra21:46
*** rcernin has joined #openstack-infra21:46
fungiright, not gonna get accepted by any mta which enforces sender verification callbacks21:46
ianwi wonder where it comes from.  not immediately obvious from http://codesearch.openstack.org/?q=noreply%40openstack.org&i=nope&files=&repos=21:47
*** sree has quit IRC21:48
ianwwell the server gui configuration has a "from email address" durr21:48
fungii was gonna say, may be manually configured21:48
*** eharney has quit IRC21:48
ianwi just dropped the db for it to repopulate from scratch though21:49
corvusit's fine as a from header, just not great as an envelope sender21:49
lbragstadsmcginnis: not sure if i'm on the right track here https://review.openstack.org/#/c/532984/21:49
lbragstadbut there was a merge conflict with the test-requirements.txt file21:49
fungiianw: maybe it doesn't store that in the remote db?21:50
lbragstadand i'm not quite sure if what i proposed breaks process or not (if i had to change requirement versions because of the conflict)21:50
ianwfungi: yeah, i guess.  thanks, i think it's some good info if it's a problem with other people not getting emails, especially on the live server21:52
smcginnislbragstad: No, that looks right to me. Let's see how it goes in the gate.21:52
*** jkilpatr has joined #openstack-infra21:52
ianwclarkb / ianychoi / eumel8 : i'll drop some notes in the change, but i think translate-dev should be back with the fresh db, i can log in and poke around at least.  let me know if issues21:53
lbragstadsmcginnis: cool - thanks for the sanity check21:53
lbragstadcmurphy: fyi ^21:53
lbragstadsmcginnis: fwiw - if that does fix things, you might have to kick that one through (keystone doesn't have two active stable cores)21:54
*** slaweq has quit IRC21:55
*** slaweq has joined #openstack-infra21:56
*** e0ne has quit IRC21:57
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json  https://review.openstack.org/53295521:57
*** threestrands has joined #openstack-infra21:58
*** threestrands has quit IRC21:58
*** threestrands has joined #openstack-infra21:58
*** hamzy has quit IRC21:58
*** threestrands has quit IRC21:59
*** threestrands has joined #openstack-infra21:59
*** threestrands has quit IRC22:00
smcginnislbragstad: Really?22:00
smcginnisOh, stable. Sure, I will +2 once things pass.22:00
*** threestrands has joined #openstack-infra22:01
*** jascott1 has joined #openstack-infra22:02
*** threestrands has quit IRC22:02
*** gyee has joined #openstack-infra22:02
*** threestrands has joined #openstack-infra22:02
*** Apoorva has joined #openstack-infra22:03
*** sree has joined #openstack-infra22:06
*** tpsilva has quit IRC22:09
*** nicolasbock has quit IRC22:10
*** sree has quit IRC22:10
smcginnislbragstad: Looked up the stable/pike g-r values.22:11
clarkbnot really back yet but looks like nodepool changes did merge, next step is restarting the launchers or is that done?22:16
*** dave-mccowan has quit IRC22:16
*** dave-mccowan has joined #openstack-infra22:17
*** olaph has quit IRC22:17
*** Goneri has quit IRC22:20
*** Apoorva_ has joined #openstack-infra22:23
*** hashar has quit IRC22:24
*** dave-mccowan has quit IRC22:25
*** Apoorva has quit IRC22:26
corvusclarkb: i don't think it's been done22:27
*** Apoorva_ has quit IRC22:30
*** Apoorva has joined #openstack-infra22:31
*** dprince has quit IRC22:31
*** sree has joined #openstack-infra22:32
*** bobh has quit IRC22:32
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies  https://review.openstack.org/53080622:35
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests  https://review.openstack.org/53269922:35
*** sree has quit IRC22:36
*** esberglu has quit IRC22:36
*** olaph has joined #openstack-infra22:38
*** sree has joined #openstack-infra22:38
*** kjackal has quit IRC22:39
*** kjackal has joined #openstack-infra22:40
*** sree has quit IRC22:43
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Delete stale jobdirs at startup  https://review.openstack.org/53151022:44
*** slaweq has quit IRC22:46
*** jbadiapa has quit IRC22:47
*** markvoelker has quit IRC22:49
ianwcorvus: one thing i'd note about that deleting job dirs, is that on the cinder mounted volumes, it actually takes quite a while22:49
*** markvoelker has joined #openstack-infra22:49
ianwwhen i restarted the executors the other day i cleared out the old stuff, and it was like upwards of 10 minutes22:50
ianwit made me think maybe a cron job that removes X day old dirs might work22:50
corvusianw: well, if we need that, we should build it into zuul.22:50
corvustrouble is, it's hard to tell if the admin intended to keep the dir22:51
corvusso if we do that, we'd need to drop a flag in the dirs indicating they were 'kept' and not have the 'cron' delete them22:51
*** niska has quit IRC22:51
*** ruhe has quit IRC22:51
ianwoh, from the keep var in config you mean?22:52
*** rlandy is now known as rlandy|bbl22:52
*** niska has joined #openstack-infra22:52
corvusianw: it's a run-time toggle, so can change at any time22:52
*** ruhe has joined #openstack-infra22:52
ianwyeah, a stamp-file in the directory might be a good interface22:53
*** Jeffrey4l has quit IRC22:53
*** markvoelker has quit IRC22:54
*** andreas_s has joined #openstack-infra22:54
*** sree has joined #openstack-infra22:54
*** bandini has quit IRC22:54
*** edmondsw has quit IRC22:54
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Add zuul mailing lists  https://review.openstack.org/53300622:54
ianwit's probably fine to clear out on start, but if at least in our case, with the slower remote storage for dirs, we have the slate almost clean when it starts that could be helpful22:54
*** wolverineav has quit IRC22:55
ianwespecially because you're usually restarting them in a pressure situation :)22:55
*** edmondsw has joined #openstack-infra22:55
*** abelur_ has joined #openstack-infra22:55
*** vsaienk0 has joined #openstack-infra22:55
*** felipemonteiro_ has quit IRC22:55
*** wolverineav has joined #openstack-infra22:55
corvusyep.  i think both things are incremental improvements.22:55
ianwi'll add it to my todo list :)22:55
*** Jeffrey4l has joined #openstack-infra22:56
*** bandini has joined #openstack-infra22:56
*** markvoelker has joined #openstack-infra22:56
clarkbok back with new glasses. I can see again. Except now everything looks funny22:58
*** sree has quit IRC22:58
clarkbcorvus: nl02 is the launcher for chocolate and has the new nodepool code installed. thoughts on restarting that one now?22:58
*** edmondsw has quit IRC22:59
*** markvoelker has quit IRC22:59
*** wolverineav has quit IRC22:59
corvusclarkb: i'm around for a bit longer and can help with probs, i say go for it23:00
*** andreas_s has quit IRC23:00
clarkbcorvus: ok restart it now then23:00
*** markvoelker has joined #openstack-infra23:00
clarkbit is running again and appears to be acting normally (decline requests that are at quota and just saw a node go ready)23:03
clarkbcorvus: Shrews what is the request handler behavior when all clouds are at quota? I seem to recall reading that code at some point but forget the behavior23:03
*** vsaienk0 has quit IRC23:04
*** markvoelker has quit IRC23:05
corvusclarkb: every provider will grab one more request than it can handle and then block on that request, not accepting any more until it completes.23:05
*** sree has joined #openstack-infra23:05
corvus(strictly speaking, when a single provider is at quota, it grabs one more request and blocks.  providers act independently, so if they are all at quota, they simply all do that)23:06
*** jascott1 has quit IRC23:06
clarkbcorvus: Pausing request handling to satisfy request appears to be the logged message for that?23:07
*** jascott1 has joined #openstack-infra23:07
corvusclarkb: i believe so23:07
corvusclarkb: once it gets enough available quota, it should 'unpause', finish that request, and as long as we are backlogged, grab another one and go right back to being paused.23:08
*** jascott1 has quit IRC23:09
*** jascott1 has joined #openstack-infra23:09
*** sree has quit IRC23:10
clarkbcorvus: 2018-01-11 23:01:28,873 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl02-7970-PoolWorker.citycloud-la1-main]: Declining node request 100-0002001054 because it would exceed quota lots of messages like that which I would expect wouldn't happen if the handler was paused?23:11
*** jascott1 has quit IRC23:11
openstackgerritMerged openstack-infra/system-config master: Add note on how to talk to zuul's gearman  https://review.openstack.org/53152223:11
*** jascott1 has joined #openstack-infra23:11
*** jascott1 has quit IRC23:12
*** jascott1 has joined #openstack-infra23:13
*** jbadiapa has joined #openstack-infra23:14
*** jtomasek has joined #openstack-infra23:15
openstackgerritMerged openstack-infra/puppet-lodgeit master: Systemd: start lodgeit after network  https://review.openstack.org/52772923:16
corvusclarkb: that should only happen if it's impossible for the provider to ever satisfy that (ie, a request for 10 nodes in a provider where our absolute limit is 5)23:16
clarkbhuh I wonder if someone is requesting 50 nodes for a job23:17
*** jascott1 has quit IRC23:17
clarkb(that could also explain why we are at quota so much)23:17
corvusclarkb: the example you cited was 1 node23:18
clarkbcorvus: oh wait citycloud has a couple regions that are turned off23:19
clarkband la1 is one of them23:19
clarkbok that explains it23:19
corvuslooks like it was also declined by rh1-main23:20
corvuser rh123:20
clarkbcorvus: have you seen anything re nl02 that would indicate we shouldn't restart nl01 as well? I haven't yet23:20
*** jtomasek has quit IRC23:21
corvusclarkb: nope,  i say go.23:22
clarkbdone23:24
*** jascott1 has joined #openstack-infra23:29
clarkbShrews: corvus related to the max-servers: 0 short circuit from earlier today we may want to avoid doing all the extra logging and pause the handler entirely (eg stop polling requests)23:30
*** jbadiapa has quit IRC23:32
clarkbcorvus: do you want to review https://review.openstack.org/#/c/523951/ I think it is ready and will allow us to merge zuulv3 branches into master23:33
corvusclarkb: when a provider is paused, it should stop polling for new requests23:36
corvusclarkb: i'll take a look23:36
corvus+3 and reviewed children as well23:40
clarkbcool23:40
*** kgiusti has left #openstack-infra23:41
*** smatzek has joined #openstack-infra23:43
*** sree has joined #openstack-infra23:43
openstackgerritClark Boylan proposed openstack-infra/project-config master: Set infracloud chocolate to max-servers: 0  https://review.openstack.org/53301223:45
clarkb^^ is not a rush, don't think I will have time to do controller00 patching reboots today but will try for tomorrow23:45
*** smatzek has quit IRC23:47
*** sree has quit IRC23:47
*** hongbin has quit IRC23:53
*** armaan has quit IRC23:53
clarkbzxiiro: is https://review.openstack.org/#/c/194497/ still valid?23:54
clarkbcorvus: I think https://review.openstack.org/#/c/163637/ ended up being rplaced by the zuulv3 spec?23:55
corvusclarkb: yeah, i think it covered everything there except the ip whitelist.23:57
zxiiroclarkb: not sure. I can bring it up on our meeting tomorrow though23:57
clarkbzxiiro: thanks23:57

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!