*** mwhahaha has joined #openstack-infra-incident | 00:05 | |
*** jhesketh has quit IRC | 01:59 | |
*** rlandy has quit IRC | 02:44 | |
*** fungi has quit IRC | 05:45 | |
*** fungi has joined #openstack-infra-incident | 05:47 | |
*** myoung|rover|bbl is now known as myoung|rover | 06:53 | |
-openstackstatus- NOTICE: Our CI system Zuul is currently not accessible. Wait with approving changes and rechecks until it's back online. Currently waiting for an admin to investigate. | 08:48 | |
-openstackstatus- NOTICE: Zuul is back online, looks like a temporary network problem. | 09:08 | |
*** AJaeger has quit IRC | 09:51 | |
*** AJaeger has joined #openstack-infra-incident | 09:51 | |
*** jhesketh has joined #openstack-infra-incident | 10:04 | |
*** rosmaita has joined #openstack-infra-incident | 13:04 | |
*** weshay|sickday is now known as weshay | 13:15 | |
*** rlandy has joined #openstack-infra-incident | 13:31 | |
-openstackstatus- NOTICE: We're currently seeing an elevated rate of timeouts in jobs and the zuulv3.openstack.org dashboard is intermittently unresponsive, please stand by while we troubleshoot the issues. | 14:39 | |
fungi | good call | 14:46 |
---|---|---|
dmsimard | seeing a lot of those in the logs http://paste.openstack.org/raw/628731/ | 14:46 |
*** ChanServ changes topic to "Troubleshooting Zuul scheduler memory utilization | OpenStack Infra team incident handling channel | Find us in #openstack-infra for more stimulating conversation" | 14:46 | |
fungi | how far back? just since the last restart? | 14:47 |
dmsimard | not sure, let me see.. | 14:48 |
dmsimard | fwiw amount of jobs running doesn't correlate with the load and available capacity | 14:48 |
dmsimard | http://grafana.openstack.org/dashboard/db/zuul-status | 14:48 |
dmsimard | http://grafana.openstack.org/dashboard/db/nodepool | 14:48 |
*** Shrews has joined #openstack-infra-incident | 14:48 | |
fungi | yeah, if this is just the memory increase we typically see from stacks of config changes, then it generally won't | 14:48 |
clarkb | iirc zk disconnects cause the lock to be released which could explain that if zk isn't staying connected due to memory pressure? | 14:49 |
clarkb | its also possible that is related to the bug I've been trying to sort out on the nodepool side around node request locking | 14:50 |
dmsimard | # grep "Exception" debug.log |grep "is not locked" |grep -v raise |wc -l | 14:50 |
dmsimard | 248926 | 14:50 |
dmsimard | that's a problem | 14:50 |
clarkb | (fwiw my changes havne't merged yet so unlikely to be source of regression) | 14:50 |
clarkb | Shrews: ^ you've been looking at things like this recently. Does it look familiar? | 14:51 |
fungi | looks like it first showed up in the logs 2017-12-12 03:22:46,423 (zuul.log.1.gz right now) | 14:51 |
fungi | went back a week and no hits in the logs until then | 14:52 |
dmsimard | that's UTC right ? | 14:52 |
fungi | yes | 14:52 |
Shrews | clarkb: no. but if zk disconnects are happening, that would be logged too | 14:52 |
fungi | utc | 14:52 |
Shrews | clarkb: re: your fix, lemme check nodepool real quick | 14:52 |
fungi | 03:22z is ~ when we started swapping | 14:54 |
Shrews | clarkb: the main symptom I saw related to your bug fix was a lot of "ready" nodes that were locked. I'm not seeing that. | 14:54 |
fungi | so i agree this could easily be a symptom of memory pressure | 14:54 |
*** frickler has joined #openstack-infra-incident | 14:55 | |
fungi | in fact, the timing of that first exception and when we used up the last of the ram is so close than it seems too correlated to be coincidence | 14:55 |
dmsimard | infra-root: found the issue | 14:55 |
dmsimard | wait no :/ | 14:56 |
dmsimard | last zk logs on nodepool.o.o date back to october with no space disk left on device, only saw part of the trace | 14:56 |
clarkb | ya I think we've larlgely got that sorted and so the last time it was a problem shows up in the logs as the last time we had a problem (I got caught by that in the past trying to debug something else as well) | 14:57 |
fungi | are you saying there's no space left on disk for nodepool.o.o right now? | 14:57 |
fungi | looks like it has free space | 14:57 |
clarkb | fungi: there wasn't free space months ago but thats the last thing in the error log iirc | 14:57 |
clarkb | because we fixed it and been a non issue since | 14:58 |
*** jeblair has joined #openstack-infra-incident | 14:58 | |
fungi | oh, you're saying october is the last time we ran out of space | 14:58 |
Shrews | yeah, we should have fixed that issue with the log space (by "we", i mean fungi) | 14:58 |
* fungi fixes nothing | 14:58 | |
Shrews | lies | 14:58 |
dmsimard | fungi: I said "wait no", that's me not reading the full trace :( | 14:58 |
fungi | got it | 14:58 |
fungi | so anyway, i strongly suspect the exceptions you found in the scheduler log are an upshot of the server swapping heavily, given they started at the precise moment the server started paging | 15:00 |
dmsimard | yeah, perhaps | 15:00 |
fungi | +/- the 5-minute snmp polling granularity | 15:00 |
dmsimard | is it normal that a nodepool list on nl02 would not be working ? says os_client_config.exceptions.OpenStackConfigException: Cloud citycloud was not found. | 15:00 |
jeblair | dmsimard: run as nodepool user | 15:00 |
dmsimard | ah | 15:01 |
dmsimard | jeblair: thanks ! /me still learning | 15:01 |
jeblair | did someone run siguser2? | 15:01 |
dmsimard | jeblair: we mentioned it earlier but I didn't run it. | 15:02 |
dmsimard | not sure if fungi or clarkb have | 15:02 |
clarkb | I have not | 15:02 |
jeblair | then i'll stop looking for its output in the logs | 15:02 |
fungi | no, i did not | 15:02 |
jeblair | i'm going to start poking at things using the repl; this may use a lot of cpu and make things worse | 15:03 |
dmsimard | ack | 15:05 |
jeblair | zuul built 88 layouts during the 22:00 hour yesterday | 15:08 |
jeblair | up from a more typical 20-30 per hour | 15:08 |
fungi | release jobs are almost all done, i think | 15:16 |
fungi | hard to tell since i'm getting intermittent proxy errors retrieving the status.json | 15:16 |
jeblair | i don't see an unusual number of layouts being created around the 4 and 6 hours | 15:18 |
jeblair | which seem to be the bumps into swap | 15:19 |
jeblair | here's the current object count: http://paste.openstack.org/show/628734/ | 15:27 |
jeblair | (since that has returned, things should start speeding up again) | 15:28 |
fungi | yeah, i'm getting the status.json again | 15:29 |
fungi | just two release jobs remaining | 15:29 |
fungi | we're getting really long delays between playbooks on jobs too... i'm going to check load on the executors | 15:33 |
pabelanger | fungi: yah, load look higer because of the pulsing jobs from zuul. Seems to overwell executors, then evens out, just in time for next pulse | 15:35 |
fungi | most are around 30-40 for load average. highest 5-minute average was ze06 at just over 50 | 15:35 |
dmsimard | there was a large spike on ze02 | 15:36 |
dmsimard | obvious in grafana http://grafana.openstack.org/dashboard/db/zuul-status | 15:37 |
pabelanger | fungi: yah, I think 25 loadavg is when we stop accepting jobs | 15:37 |
dmsimard | there was mention of an elevated rate of time outs | 15:38 |
*** tobiash has joined #openstack-infra-incident | 15:39 | |
dmsimard | this is the error that comes up: http://logs.openstack.org/54/510654/5/check/openstack-ansible-functional-ubuntu-xenial/031df0b/job-output.txt.gz#_2017-12-12_12_40_13_061676 | 15:39 |
dmsimard | one of the errors, at least | 15:39 |
fungi | makes sense. so zuul ends up batching up a lot of events, then thundering herds the executors, they push back because load gets too high, cycle continues a la hysteresis | 15:39 |
dmsimard | logstash -> message:"Make sure this host can be reached over ssh" is picking up a lot of hits even within the last 15 minutes | 15:39 |
dmsimard | fungi: so should there be a governor on top of load/disk that caps the amount of jobs per executor ? | 15:40 |
dmsimard | because the load typically doesn't occur until after the executor picked up new jobs | 15:40 |
dmsimard | so an executor might go ahead and pick up 50 new jobs but then it starts loading heavily a minute or two later | 15:40 |
pabelanger | yah, when executors are under a large load, ansible-playbooks slow down greatly, so much that there is almost zero communications between playbook and remote host. What is ends up doing, is start spawning a large amount of job timeouts as a result. So, I wouldn't be surprised if the high load is timeout running jobs | 15:40 |
fungi | i don't know that we should be designing safety valves into the executor to deal with pathological situations on the scheduler | 15:41 |
pabelanger | In v2.5, when a launcher picked up new jobs, is did it incrementally, is that something we still have in zuulv3? | 15:41 |
fungi | i would just consider this a symptom of scheduler memory pressure for now | 15:41 |
pabelanger | eg: when an executor can accept a job, not accept a large amount at once | 15:42 |
dmsimard | executors are swapping as well | 15:42 |
jeblair | there are 10 layout objects known to the scheduler, and 79 in memory. so there's a leak somewhere. | 15:42 |
dmsimard | ze02 is 4GB into swap territory | 15:42 |
dmsimard | but it's weird because there is 4GB "available" | 15:42 |
dmsimard | vm.swappiness is 0 :/ | 15:43 |
pabelanger | dmsimard: Yah, that is the result of high load. Once it comes down, swap will stop | 15:43 |
jeblair | fungi: when the release jobs are finished, i'd like to make an graph. that's going to eat up cpu again and make things very bad; likely much worse than before | 15:43 |
fungi | sure. i guess we should grab a fresh copy of pipelines, generate memory graph, then restart scheduler and reenqueue everything? | 15:44 |
dmsimard | +1 | 15:45 |
fungi | or are you going to want to keep it running in this condition a little longer after grabbing the object graph? | 15:45 |
clarkb | pabelanger: that is what the load governor is supposed to provide but the problem with it is it is a delayed reaction | 15:45 |
dmsimard | clarkb: yeah that's what I was pointing out, maybe there should be a soft and a hard limit on the job concurrency | 15:45 |
fungi | and any delayed reaction leads to hysteresis under some conditions | 15:46 |
jeblair | fungi: probably after i grab the graph we can restart. your plan sounds good. | 15:46 |
pabelanger | clarkb: yah, I think so. I'll look at code and see what is happening. From grafana it seems we might just be accepting jobs too fast at start. | 15:47 |
dmsimard | fungi: you were mentioning thundering herd, looking at the graphs it's exactly that http://grafana.openstack.org/dashboard/db/zuul-status | 15:47 |
dmsimard | executors picks them up in batches | 15:47 |
fungi | right | 15:47 |
clarkb | pabelanger: ya its a similar problem to the one that mnaser had with the cpu based nova scheduling | 15:47 |
dmsimard | I mean, you don't see that in the "running builds" graph, but the load graph. | 15:48 |
clarkb | pabelanger: you don't know what the compute cost of a job is until it is properly running and until that point you can grab a lot of jobs | 15:48 |
pabelanger | clarkb: right | 15:48 |
dmsimard | clarkb: computing the cost of a job (over an average) is doable | 15:48 |
dmsimard | clarkb: through collectd/graphite for example | 15:49 |
fungi | but yeah, any situation where you perform some action, take measurements, then adjust to correct is going to lead to that behavior under extremes | 15:49 |
pabelanger | how zuul-launchers worked for running ansible-playbook worked pretty well, when jobs came online. Going to see if that would be an option for zuul-executors | 15:49 |
dmsimard | clarkb: in a public cloud environment (such as mnaser) you can't really do that because the workloads from customers are really unpredictable | 15:49 |
clarkb | dmsimard: right | 15:49 |
clarkb | also you could likely do something like a pid loop | 15:49 |
clarkb | where you measure after each change and adjust (and don't really care about the work load being predictable because you measure at each step) | 15:50 |
jeblair | as fungi said, let's not design something for when the system is broken. anything more than a simple change or minor adjustment to what we have probably isn't worth it. | 15:51 |
Shrews | FYI, if the executors were also low on available memory, we should check the logs for OOM process trying to kill zuul-executor children (which tends to kill the finger daemon there). zuul-executor will need restarting to correct that | 15:58 |
jeblair | Shrews: did we ever adjust the ratio for that? | 15:59 |
dmsimard | Shrews: I don't see any OOM killer on ze02 which has this weird memory usage (4GB swap, 4GB available) | 15:59 |
Shrews | jeblair: i certainly didn't | 16:00 |
jeblair | oh, we don't have a ram limiter. that might be a thing. | 16:03 |
pabelanger | Yah, I think spamaps talked about writing that | 16:10 |
jeblair | fungi: how are the release jobs? | 16:12 |
fungi | jeblair: that last release job just reenqueued due to pre-task timeouts, so but it's just a propose-update-constraints so maybe we should just cut our losses now | 16:12 |
jeblair | ok, i'll start my queries now and leave it to its own devices | 16:13 |
jeblair | i just saved queues | 16:13 |
fungi | it's been delaying ~30 minutes between each task due to executor load, so it'd be a couple hours at least before the rerun completed anyway | 16:13 |
fungi | er, between each playbook i mean | 16:14 |
jeblair | graph is still being constructed | 17:01 |
fungi | i figured. been watching the red rectangle growing to the right on the cacti cpu graph | 17:07 |
jeblair | how long you think we should let this run? | 18:07 |
clarkb | jeblair: any rough idea how long it has taken in the past? | 18:32 |
*** rlandy is now known as rlandy|brb | 18:34 | |
jeblair | clarkb: hrm, i don't think i have timing info from previous runs. i would have guessed 1-2 hours. but we're at 2.25 now. | 18:34 |
jeblair | i think we're using more memory than last time, so there's more work to do (where work involves a lot of swapping to examine all the objects) | 18:35 |
jeblair | i'm sure it's working and will finish | 18:35 |
jeblair | it's more of a trade off between lost productivity now and obtaining data vs lost productivity the next time this happens if we don't fix it | 18:36 |
fungi | also we probably need to be ready to roll back the duplicate map key detection at restart if we find that there are projects which have inadvertently introduced any | 18:38 |
clarkb | if we aren't swapping already does it perform reasonably well? | 18:39 |
jeblair | fungi: or drop those projects from config, depending on which projects. | 18:39 |
clarkb | basically could we restart it, get jobs running again, then watch for leak and run graph generation before it becomes major performance issue? | 18:40 |
jeblair | clarkb: usually much faster (like maybe tens of minutes), but i think it's intense enough there's a high risk of losing the zk connection and causing disruption. | 18:40 |
jeblair | clarkb: it's an option, and may be less impactful, but it won't be transparent | 18:41 |
clarkb | ya just thinking that may be preferable to no jobs at all | 18:41 |
clarkb | especially if we do it around 0200UTC or so when we seem to have a quite time | 18:41 |
jeblair | yup | 18:42 |
clarkb | *quiet time | 18:42 |
jeblair | want to give it 30m more (so about 3h total)? | 18:43 |
clarkb | sure | 18:43 |
fungi | wfm | 18:44 |
fungi | i wonder if we should attempt to let events/results quiesce after the object graph is done, and then nab another copy of pipeline contents? the ones from before may not be as relevant now | 18:47 |
clarkb | I've approved a few things myself but sort of expected to go back and recheck | 18:48 |
dmsimard | fungi: yeah, we should definitely get a new snapshot of the pipelines right before restarting | 18:48 |
fungi | that's only if the scheduler is still sane enough to provide one | 18:49 |
fungi | no guarantees. and yes, worst case we just remind people to recheck/reapprove stuff | 18:49 |
fungi | i also need to look through the stack of release job failures today and possibly reenqueue stuff if any got caught up in playbook timeouts | 18:50 |
*** myoung|rover is now known as myoung|rover|bia | 18:58 | |
*** myoung|rover|bia is now known as myoung|rover|bbl | 19:00 | |
jeblair | welp, we crossed the 3 hour mark :( | 19:23 |
dmsimard | jeblair: do we have part of it ? or it's an all or nothing deal ? | 19:25 |
dmsimard | like, is it stuck in a recursion loop or something ? | 19:25 |
jeblair | all or nothing | 19:26 |
dmsimard | bleh | 19:26 |
jeblair | and no progress indication | 19:26 |
dmsimard | no insight from the dump either ? | 19:27 |
jeblair | could be 3 hours or 3 minutes away from completion | 19:27 |
jeblair | dmsimard: dump? | 19:27 |
dmsimard | the SIG USR2 | 19:27 |
dmsimard | or is that what we're running right now | 19:27 |
jeblair | oh, that hasn't been run, though i used the repl to run what it does independently. that's the object count i pasted earlier | 19:28 |
dmsimard | ok | 19:28 |
jeblair | what's running now is a graph traversal finding all the backlinks to one of the leaked layouts | 19:29 |
dmsimard | the swap and the load likely isn't helping :( | 19:29 |
jeblair | oh at this point the load is entirely due to this process. it has to access objects that have been swapped out, so everything is swapping in/out again, multiple times more than likely. | 19:31 |
*** rlandy|brb is now known as rlandy | 19:31 | |
dmsimard | does RAX have 32GB VMs ? :/ | 19:32 |
clarkb | it does haev quite large VMs | 19:32 |
clarkb | I wouldn't want to use massive flavor long term but could be a reasonable way to have healthier zuul while fixing these bugs | 19:32 |
dmsimard | We should probably consider upgrading to the next flavor -- memory leaks will still be obvious while giving us the room we need to troubleshoot them | 19:33 |
fungi | we would have probably still at least mostly maxxed out ram for their 30gb flavor | 19:35 |
jeblair | restarting zuul-scheduler | 19:41 |
fungi | thanks | 19:42 |
fungi | so i guess there was at least that one | 19:53 |
fungi | will we need to manually submit 527507 through gerrit? | 19:53 |
jeblair | fungi: i reverted the change manually and restarted | 19:54 |
fungi | oh, gotcha | 19:54 |
fungi | was that the only offender? | 19:54 |
jeblair | i'm re-enqueing now | 19:54 |
jeblair | fungi: it stops at one; i don't know :( | 19:54 |
fungi | ahh | 19:55 |
jeblair | maybe next time we restart, we should manually apply a patch to make that a warning, and collect them | 19:55 |
AJaeger | jeblair: can we do try runs to test this offline? | 19:55 |
jeblair | AJaeger: i have no tool to do that :( | 19:55 |
AJaeger | jeblair: that tool would be convenient for such changes. Ok... | 19:56 |
jeblair | AJaeger: yes, though we're not going to get changes like this after the v3.0 release :) | 19:56 |
AJaeger | ;) | 19:57 |
clarkb | which change was reverted? | 20:02 |
AJaeger | clarkb: see #openstack-infra - openstack/congress-tempest-plugin | 20:03 |
AJaeger | clarkb: see #openstack-infra - https://review.openstack.org/527231 | 20:03 |
* AJaeger and copy & paste ;( | 20:03 | |
AJaeger | jeblair: what's up with zuulv3.openstack.org/ ? takes ages | 20:04 |
jeblair | AJaeger: oh i think i need to restart zuul-web | 20:05 |
*** myoung|rover|bbl is now known as myoung|rover | 20:05 | |
jeblair | AJaeger: thanks. there we go | 20:06 |
AJaeger | jeblair: looks fine now, thanks | 20:06 |
-openstackstatus- NOTICE: The zuul scheduler has been restarted after lengthy troubleshooting for a memory consumption issue; earlier changes have been reenqueued but if you notice jobs not running for a new or approved change you may want to leave a recheck comment or a new approval vote | 20:15 | |
*** ChanServ changes topic to "Situation normal | OpenStack Infra team incident handling channel | Find us in #openstack-infra for more stimulating conversation" | 20:16 | |
*** myoung|rover has quit IRC | 20:35 | |
*** myoung has joined #openstack-infra-incident | 20:39 | |
*** clarkb has quit IRC | 20:45 | |
*** clarkb has joined #openstack-infra-incident | 20:46 | |
*** myoung is now known as myoung|rover | 20:47 | |
*** rlandy is now known as rlandy|afk | 22:10 | |
jeblair | zuul's memory usage is annoyingly flat. :| | 23:17 |
clarkb | quick throw some reconfigures at it ? | 23:19 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!