*** jeblair has joined #openstack-infra-incident | 15:21 | |
*** dmsimard has joined #openstack-infra-incident | 15:21 | |
jeblair | infra-root: hi, who's here? | 15:21 |
---|---|---|
dmsimard | I'm not root but I'm here | 15:21 |
*** kiennt26 has joined #openstack-infra-incident | 15:22 | |
fungi | i am | 15:22 |
fungi | had to re-find the right buffer for this one | 15:22 |
* mordred waves to jeblair | 15:22 | |
*** pabelanger has joined #openstack-infra-incident | 15:22 | |
*** Shrews has joined #openstack-infra-incident | 15:23 | |
*** clarkb has joined #openstack-infra-incident | 15:24 | |
clarkb | o/ | 15:24 |
pabelanger | o/ | 15:25 |
jeblair | okay, i think all of us who are awake are here | 15:25 |
jeblair | i think we should plan how we're going to deal with the flood of problems | 15:26 |
fungi | agreed | 15:26 |
jeblair | if i'm going to work on memory usage, i won't be able to deal with anything else. probably for days. maybe the whole week. | 15:26 |
jeblair | so, assuming that doesn't just immediately convince everyone that we should roll back... | 15:27 |
fungi | i figured as much, and am trying not to raise your attention on anything unrelated to the performance/memory stuff | 15:27 |
fungi | i expect the rest of us can handle job configuration related issues | 15:27 |
jeblair | i think we'll need folks to jump into debugging problems, and if you find something you need me to dig into, put it on a backlog for me | 15:27 |
jeblair | so maybe we should have an etherpad with all the issues being raised, and whose working on them, and resolution, and then we can have a section for problems that need deeper debugging | 15:28 |
mordred | jeblair: works for me | 15:29 |
Shrews | wfm | 15:29 |
pabelanger | yah, that works well | 15:29 |
jeblair | how about https://etherpad.openstack.org/p/zuulv3-issues ? | 15:29 |
clarkb | sounds good | 15:29 |
mordred | maybe three sections- problems/who's working them - need deeper debugging - and need jeblair | 15:29 |
fungi | i'm also prioritizing review on jeblair's zuul patches since that's basically the one thing which would make me strongly consider rollback at this point, and the sooner performance improves the faster job configuration changes get tested/merged | 15:29 |
mordred | because some deeper debugging is stuff various of us can dig in to | 15:29 |
mordred | but there are times when the answer is 'need jeblair to look at XXX' - and we should prioritize which things we raise the jeblair flag on | 15:30 |
fungi | i would argue that unless the problem is more severe than the current performance/resource consumption situation with zuul, we should just find ways to not disturb jeblair | 15:31 |
pabelanger | ++ | 15:31 |
pabelanger | I'm happy to focus on job failures | 15:31 |
fungi | triage existing problems, and also attempt to help others in the community come up to speed on fixing their own jobs as much as possible to free up more bandwidth for all of us | 15:32 |
dmsimard | can we use a review topic for fixes that need to be prioritized ? | 15:33 |
dmsimard | like jeblair's performance/memory patches, or other important things | 15:33 |
fungi | i was wondering the same earlier. we could use topic:zuulv3 but there's a bunch of open stuff under that topic already | 15:34 |
jeblair | yeah, maybe something new; zuulv3-fixes ? | 15:34 |
fungi | though i expect most of jeblair's critical patches will be project:openstack-infra/zuul branch:feature/zuulv3 and there's not a ton of those | 15:34 |
jeblair | also, i'm not expecting a flood of patches related to memory use :) | 15:34 |
dmsimard | yeah I feel zuulv3 might be overloaded right now (especially with other projects starting to use it for their own purposes) | 15:35 |
mordred | yah | 15:35 |
fungi | more considering how to prioritize any random changes which are fixing broken jobs | 15:35 |
fungi | those could certainly use some topic coordination | 15:35 |
dmsimard | ^ | 15:35 |
mordred | I used 'unbreak-zuulv3' for some over the weekend, but zuulv3-fixes seems less negative | 15:35 |
mordred | fungi: what about 'zuulv3-jobs' for things fixing a job, 'zuulv3-fixes' for things that are fixing wider issues (like a patch to zuul-cloner, for instance) | 15:36 |
*** kiennt26 has quit IRC | 15:36 | |
fungi | yes, let's avoid creating our own negative imagery. we have enough of a public relations challenge just getting the community to not revolt over all the currently broken jobs | 15:36 |
fungi | so far they've been amazingly tolerant | 15:37 |
fungi | but i only expect that to last for so long | 15:37 |
jeblair | i also added a fixed issues section we can move things to to help folks know when something known has been fixed | 15:37 |
fungi | that'll help, thanks | 15:38 |
mordred | infra-manual publishing works again too - so we can also start shifting FAQ content into infra-manual | 15:38 |
jeblair | who's working on neutron stadium? | 15:40 |
pabelanger | it looks like I am focusing on our zuul-executors right now, I am seeing large load on some of them | 15:41 |
Shrews | i'll attempt to poke at the branch matcher failure for us. If we don't have a test for that, that's where I'll start with it. | 15:41 |
jeblair | i moved the disk-filling line from issues with jobs to deeper debugging; looks like we have 2 instances of that now | 15:42 |
jeblair | oh ha that's the same instance | 15:42 |
jeblair | Shrews: thanks, can you put your name on it in the etherpad? | 15:42 |
mordred | pabelanger: tobiash submitted a patch which was +A'd this morning (may not havelanded yet) which should reduce executor load | 15:43 |
Shrews | I can and shall! | 15:43 |
fungi | pabelanger: SpamapS has a feature proposed to limit picking up new jobs when load exceeds a given threshold (2.5 x cpu count currently) | 15:43 |
jeblair | fungi: oh, maybe i should review that now? | 15:43 |
pabelanger | mordred: ya, we might also need more executors, currently ze04.o.o is running 400 ansible-playbook processes, with load of 81 | 15:43 |
mordred | jeblair: ++ | 15:43 |
fungi | jeblair: i'm just now pulling it back up to see what state it's in, but he was working through it over the weekend | 15:43 |
pabelanger | mordred: ze03 seems stopped, trying to see why | 15:43 |
mordred | pabelanger: cool | 15:44 |
mordred | infra-root: if you didn't see - simple patch from tobiash which increases the ansible check interval from 0.001 to 0.01 which in his testing signficantly reduces CPU load on executors https://review.openstack.org/#/c/508805/ - it has landed, so we shoud restart executors with it applied | 15:45 |
fungi | jeblair: yeah, i reviewed it and +2'd yesterday, as did tobiash: https://review.openstack.org/508649 | 15:45 |
Shrews | mordred: cool. but why not an even larger value? | 15:46 |
Shrews | (not sure what that affects, tbh) | 15:46 |
jeblair | Shrews: it increases the amount of time between tasks, which can become noticeable with lots of small tasks | 15:47 |
Shrews | ah | 15:47 |
* AJaeger joins and reads backscroll | 15:48 | |
pabelanger | okay ze03.o.o started again | 15:48 |
jeblair | pabelanger: what was wrong with ze03? | 15:48 |
fungi | we may also want to consider using promote to get things like 508344 through faster | 15:48 |
pabelanger | jeblair: not sure, it was stopped | 15:49 |
jeblair | pabelanger: like, no zuul process running? | 15:49 |
pabelanger | jeblair: yes, i think because somebody stopped it | 15:49 |
jeblair | pabelanger: can you look into that? or do you need me to? | 15:50 |
pabelanger | I see the following as last lines: http://paste.openstack.org/show/622478/ | 15:50 |
pabelanger | jeblair: yup, looking into history now | 15:50 |
pabelanger | but across all zuul-executors, we are having load issues, which is resulting in ansible-playbook timeouts | 15:50 |
clarkb | pabelanger: that probably makes tobiash's patch an important one | 15:51 |
dmsimard | clarkb: I was about to say that | 15:51 |
pabelanger | clarkb: yah, looking at it now | 15:51 |
jeblair | pabelanger: please put 'ze03 was stopped' on the etherpad list and investigate the cause | 15:51 |
pabelanger | ok | 15:52 |
jeblair | zuul components don't just stop -- if one does, that's a pretty critical bug | 15:52 |
jeblair | pabelanger, mordred, fungi: SpamapS change lgtm though i'd like to change the config param. however, we can land it now if we think it's important. | 15:53 |
mordred | should we go ahead and start working on adding more executors? Also - we currently run zuul-web co-located with zuul-scheduler - since zuul-scheduler wants all the memory and CPU at the moment- should we spin up a zuul-web server? | 15:53 |
jeblair | (we can change the tunable in a followup) | 15:53 |
jeblair | mordred: afaict, zuul-web uses almost nothing | 15:53 |
jeblair | mordred: we have 7 idle cpus on zuulv3.o.o, so cpu is not a problem | 15:53 |
jeblair | (remember, the gil means that the scheduler only gets to use one) | 15:54 |
fungi | pabelanger: the auth log on ze03 doesn't indicate anyone running sudo over the past few days until a few minutes ago when you started working on restarting the service | 15:54 |
fungi | also, no oom killer events in dmesg | 15:54 |
mordred | jeblair: I'm fine changing the tunable in the followup - it we can limit the number of jobs a single executor tries to run, that should hopefully at least reduce the instances of playbook timeout becaue of executor resource starvation | 15:54 |
mordred | jeblair: ok - cool | 15:54 |
jeblair | okay, i'll land that now | 15:55 |
jeblair | and yeah, we should spin up more executors | 15:55 |
pabelanger | fungi: http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/cmd/executor.py?h=feature/zuulv3#n76 seems to be the last log entry. The line 'Keep running until the parent dies' peaked my interest | 15:55 |
Shrews | pabelanger: yeah, the log streamer is a forked process of the executor. when the executor goes, we want the streamer to quit too | 15:57 |
fungi | pabelanger: any chance you tried to `journalctl -t zuul-executor` before restarting? right now i only see a traceback from 15:46z which i expect is when you were trying to start it back up | 15:57 |
mordred | jeblair: should we consider landing tristanC's move-status-to-zuul-web since those status requests will be potential thread contention on the scheduler? or do you think that's unlikley to be worth the distraction? | 15:57 |
pabelanger | fungi: the traceback is from me when I did service zuul-executor stop, systemd that the process was still running | 15:58 |
mordred | AJaeger: ^^ also - see scrollback if you didn't notice that we had conversation in here - especially https://etherpad.openstack.org/p/zuulv3-issues | 15:59 |
pabelanger | wow, 282 playbooks on ze03 already | 15:59 |
pabelanger | 150 was the average for zuul-launchers | 15:59 |
clarkb | is spamaps patch for limiting job grabbing up yet? | 16:00 |
clarkb | because that will also help wiht ^ | 16:00 |
jeblair | clarkb: yeah we were just discussing that; 508649 is +3 | 16:01 |
clarkb | awesome so we have a couple of load limiting options going in | 16:01 |
AJaeger | mordred: I read scrollback - thanks | 16:01 |
jeblair | okay, i think we have a plan. i'd say that once 508649 lands and is deployed, we should restart the executors. i have no idea what the status of either graceful or stop are in zuulv3. restarting the executors at this point may very well be disruptive. | 16:02 |
pabelanger | okay, ze03 should have the fix from tobiash but up to 110.0 load | 16:02 |
clarkb | I have a hunch that 8649 will also assist with potential disk consumption concerns | 16:02 |
pabelanger | with zuul-executor process taking 115% CPU, I wonder if we should think about adding another zuul-executor or two | 16:03 |
clarkb | pabelanger: yes was mentioned above we should add more | 16:03 |
jeblair | clarkb: i am very suspicious about disk consumption. i think we should avoid assuming that the problem is merely that the disk is full. the executor for the job you linked had something like 32G available. | 16:04 |
pabelanger | clarkb: okay, I'll start doing that | 16:04 |
clarkb | jeblair: ya, I've been watching it on ze04 and seen a 10s of GB swing but not to full yet | 16:04 |
clarkb | trying to sort of why its not in cacti for proper data collection | 16:04 |
pabelanger | clarkb: I'll do 2 at a time | 16:05 |
jeblair | pabelanger: to be clear, you are going to look into why ze03 was stopped, right? | 16:05 |
jeblair | pabelanger: you put "pabelanger started again" on the etherpad, but it's not clear you took the task | 16:06 |
pabelanger | jeblair: yes, I also added it to the list. Would you like me to do that now? | 16:06 |
jeblair | pabelanger: it can wait, just wanted to clarify :) | 16:06 |
pabelanger | jeblair: At first glance, i think our start_log_streamer function finished, which has os._exit(0) in it. But I need to read up on why that would be, based on comments our pipe maybe died | 16:07 |
jeblair | okay, i'm going to go back to my hole and look at memory stuff. i'm not going to be following irc, so ping me if you need me | 16:07 |
fungi | pabelanger: based on the fact that systemd recorded an executor traceback at stop, are you sure there was no running executor process (vs a running process which was simply failing to do anything useful)? | 16:14 |
pabelanger | fungi: Yes, the traceback is becure systemd tried to use zuul stop (via socket) and failed. | 16:15 |
fungi | okay, got it | 16:16 |
fungi | yeah, i see now the traceback is for the z-e cli not the daemon | 16:16 |
fungi | so this isn't another case of hung executors like we had last week | 16:17 |
AJaeger | I added reviews for changes that have fixes to the etherpad - releasenotes and neutron. For neutron, it's not clear which approach to use and I would love if pabelanger or mordred could check lines 15/16 in the therpad | 16:23 |
mordred | AJaeger: I'm not sure I understand the difference - they both look the same to me for neutron | 16:31 |
AJaeger | mordred: https://review.openstack.org/#/c/508822/ creates new jobs that include neutron - and that is then used everywhere. Alternative is: Adding the requires everywhere with the default job | 17:15 |
AJaeger | did I link to the wrong changes? | 17:15 |
mordred | AJaeger: maybe? I think we need to make the new jobs / project-template - as otherwise we're going to have to just add required-projects to non-template version and it'll be harder to go back and clean up once we figure out a better strategy for the neutron jobs that need this | 17:16 |
AJaeger | mordred: I'm not following - and tired... | 17:19 |
AJaeger | mordred: so, you propose to follow 508822 and https://review.openstack.org/#/c/508775/3/zuul.d/projects.yaml which uses the new jobs? | 17:20 |
AJaeger | and suggest to -1 the other changes and ask them to do it the same way? | 17:20 |
AJaeger | mordred: https://review.openstack.org/#/c/508785/16/zuul.d/projects.yaml is the alternative way of doing it - I had wrong links in etherpad. will move around | 17:28 |
mordred | AJaeger: ah! thanks - that's helpful | 17:28 |
mordred | AJaeger: maybe a mix of the two - make a new project-template that just makes variants adding the neutron repo - then apply that to networking- projects - some of them still may need to do things lke 508785 did (in that case it's also adding openstack/networking-sfc) | 17:30 |
AJaeger | mordred: So, 508822 as basis - and some others might still need adding more repos. Yeah, works for me. | 17:31 |
mordred | AJaeger: yah - we could further update 508822 to not actually create new jobs but just do the thing 508785 is doing but in the project-template definition of openstack-python-jobs-neutron | 17:33 |
AJaeger | mordred: do you want to comment and tell frickler about it? Perhaps discuss further on #openstack-infra? | 17:33 |
mordred | AJaeger: yah - I'll go ping frickler in #openstack-infra? | 17:35 |
AJaeger | that's best - tthanks | 17:35 |
pabelanger | fungi: clarkb: mordred: all zuul-executors restarted to pickup latest zuul fixes. And we also have ze09.o.o and ze10.o.o online | 18:13 |
clarkb | pabelanger: 9 and 10 had the fixes from the start right? | 18:13 |
pabelanger | clarkb: yes | 18:13 |
pabelanger | I am hoping all jobs were aborted properly, however ze07.o.o was in a weird state | 18:14 |
pabelanger | was running 3 zuul-executor processes | 18:14 |
mordred | pabelanger: we have not yet figured out why thathappens sometimes | 18:18 |
pabelanger | mordred: I'd like to first replace our init.d script with proper systemd scripts (not today) then see if it is still an issue | 18:19 |
mordred | pabelanger: yes - I imagine that will either have an impact or be one of the things we wind up doing - but a) totally agree, not today and b) I mostly want to make sure we keep track of the main issue - which is that sometimes start/stop is weird, and that the sysvinit/systemd overlap right now may be a cause, a symptom, or something else (and I'd RELALY like to understand the third process | 18:21 |
mordred | we sometimes see) | 18:21 |
pabelanger | ++ | 18:22 |
clarkb | looks like scheduler ran into swap with the latest reconfig? but its starting new jobs so still functioning | 18:37 |
clarkb | also implies that the schedulers are still happy with the changes made to them | 18:37 |
clarkb | er executors | 18:37 |
clarkb | what do people think of having a queue of changes that are ready for review in the etherpad? for example my tripleo fixes aren't ready for review because we need to collect more data on them and whether or not the fs detection fixes cp. But I am sure there are othre changes that are ready and otherwise lost in the shuffle? | 18:40 |
clarkb | maybe if you see one of your changes is ready just add it to the list then remove it once it merges? | 18:40 |
fungi | wfm | 18:52 |
pabelanger | mordred: clarkb: fungi: okay, I am stepping away for the next hour or so. I'll catch up on backscoll then | 19:44 |
fungi | thanks for the heads up, pabelanger | 19:44 |
mordred | clarkb: wfm | 20:14 |
fungi | similarly, if you find a change which is ready and you're going to +1/+2 but it needs another +2 to approve, go ahead and add it | 20:32 |
fungi | to the list in the etherpad | 20:32 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!