Monday, 2017-10-02

*** dkranz has quit IRC00:30
openstackgerritTristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Implement an OpenContainer driver  https://review.openstack.org/46875301:56
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP add repl  https://review.openstack.org/50879302:06
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP add repl  https://review.openstack.org/50879302:15
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP add repl  https://review.openstack.org/50879302:17
*** isaacb has joined #zuul03:46
*** isaacb has quit IRC03:54
*** logan- has quit IRC05:18
*** logan- has joined #zuul05:21
openstackgerritTobias Henkel proposed openstack-infra/zuul feature/zuulv3: Increase ansible internal_poll_interval  https://review.openstack.org/50880505:42
tobiashjeblair, mordred, SpamapS: this should reduce the cpu load on the executors quite a bit ^^^05:42
*** logan- has quit IRC05:48
*** logan- has joined #zuul05:53
*** hashar has joined #zuul07:09
*** electrofelix has joined #zuul07:18
tristanCtobiash: nice!07:57
*** hashar has quit IRC08:02
*** hashar has joined #zuul08:02
openstackgerritMerged openstack-infra/zuul-jobs master: Change in to work dir before executing tox  https://review.openstack.org/50878308:40
*** bhavik1 has joined #zuul09:00
*** jkilpatr has quit IRC10:29
*** bhavik1 has quit IRC10:39
*** jkilpatr has joined #zuul11:06
*** dkranz has joined #zuul12:21
*** isaacb has joined #zuul13:13
ShrewsSo, I'm going to share this little ZooKeeper knowledge nugget here, for all to enjoy and be aware of: I was wonder why, after ZK filled up disk and suspended all connections, nodepool didn't seem to reconnect properly, but zuul did (even though they share identical code).13:28
ShrewsHaving kazoo debug logging on zuul was helpful: http://paste.openstack.org/show/622467/13:29
ShrewsTurns out the kazoo library uses a retry mechanism that varies the connection retry attempts and spaces them out oddly (as seen in that paste). http://kazoo.readthedocs.io/en/latest/api/retry.html13:29
ShrewsSo it seems that I just didn't wait long enough for nodepool-launchers to try again.13:30
ShrewsThat is all. Happy Monday'ing13:30
ShrewsAlso, note that kazoo seems to try both the ipv6 and ipv4 addresses. That surprised me.13:32
pabelangerShrews: okay, so did we restart nodepool-launchers this morning?13:34
Shrewspabelanger: yes13:35
Shrewsbut not zuul13:35
pabelangerShrews: was that around 90mins ago? I see a lot of ready nodes on our launchers, trying to see why that is13:35
Shrewspabelanger: i think that's just the nature of the beast13:36
Shrewsi did not restart any builders either, but they seem to have recovered13:37
pabelangerShrews: okay, I think it might be more however. I don't see any nodes that are used. It could be zuulv3 hasn't grabbed them yet for some reason13:37
pabelangerrequest-list is showing a large list of fulfilled, I've never seen that before13:38
Shrewspabelanger: hmm, yeah. another zuul issue, perhaps related to the stuff jeblair threw in over the weekend?13:39
pabelangerkk, that is what I expected. will hold off for now13:39
Shrewsnodes are READY, assigned, and unlocked, so zuul hasn't done anything with them after nodepool fulfilled the request. odd13:42
pabelangerYah, I wonder if dropped zk connections / filled HDDs confused something13:43
pabelangerzuulv3 also have 1036 events in the queue too, so it is doing something13:43
Shrewspabelanger: maybe just slow?13:44
pabelangermaybe13:44
pabelangerFor now, I'm holding off until jeblair is online13:44
pabelangerShrews: have you seen http://paste.openstack.org/show/622475/ before? I am seeing a large number of NoNodeErrors in the debug.log14:17
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Increase ansible internal_poll_interval  https://review.openstack.org/50880514:32
Shrewspabelanger: yes14:40
Shrewspabelanger: tl;dr has to do with NodeRequest disappearing from zuul (likely due to zk connection issues)14:41
pabelangerack14:41
Shrewspabelanger: since that seems rather recent, not sure what's happening with zuul right now14:43
Shrewspabelanger: but if nodepool sees an unlocked node that is READY, IN-USE, or USED, it will delete it. loss of zk connection in zuul will unlock a node14:44
pabelangerShrews: ya, zuul looks to be happy again and processing things.  I can see node requests in the debug log now14:44
dmsimardShrews: fwiw the web consoles aren't working, not sure if it's related14:45
Shrewswell, not READY14:45
pabelangerShrews: okay, that is helpful.14:45
Shrewspabelanger: if you're interested, this bit of code: https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n54714:46
pabelangerShrews: great, thanks14:46
Shrewsdmsimard: let's assume all issues are due to zuul doing more memory/thread debugging things (applied yesterday, i think)14:47
* dmsimard nod14:48
* Shrews catches up on #infra backscroll14:48
jeblairShrews: there should be no more overhead from memory debugging.  that finished hours ago15:14
*** isaacb has quit IRC15:16
Shrewsjeblair: k. wasn't aware15:16
*** lennyb_ has joined #zuul15:36
*** lennyb_ has quit IRC15:36
openstackgerritClark Boylan proposed openstack-infra/zuul-jobs master: Handle z-c shim copies across filesystems  https://review.openstack.org/50877215:47
jeblairSpamapS: governator patch looks good to me -- though i have one suggestion about the configuration value15:48
SpamapSjeblair: hah, yeah I thought about that too. Perhaps we can just add a max_load_average config item and make it clear the multiplier is only used if you don't set load average explicitly.15:56
jeblairSpamapS: is the multiplier really useful though?  i suggested it as a way to get a sane default load average value, but it's really just an indirect thing.  the direct value is load average itself.15:59
jeblairi feel like it's like making admins do unecessary unit conversions :)15:59
dmsimardI know we're not using Ansible 2.4 yet but FYI I found a bug impacting ara's configuration parsing https://github.com/ansible/ansible/pull/31200/commits/865d57aaf2fe47bc95d7becaac3a0cb182f71aa816:13
dmsimardalready pinged bcoca about it.16:13
SpamapSjeblair: the multiplier is useful if you plan to deploy executors on different sized boxes and you know roughly how much more load you can handle than you have CPUs.16:23
SpamapSjeblair: I agree, that seems like a weird and unnatural thing to plan for.16:23
SpamapSI have just never experienced a good calculating tuning parameter where the authors of the software got any of the constants right. ;)16:25
SpamapSSo it goes deep inside me never to have a hard set constant like that. :-P16:25
jeblairSpamapS: right, i guess what i'm trying to say is that i don't expect it to be right, and i don't expect people to fix it when things are wrong.  what i expect is for admins to say "my executor hit a load average of 50!  that's too high!  i never want it going over 30!" and to set that16:26
SpamapSI agree.16:26
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Limit concurrency in zuul-executor under load  https://review.openstack.org/50864916:28
SpamapSMy only thinking was that I don't want to take away the ability to just tune what's already used in the default calc. So I think later on we should definiteyl add the hard override of the load average. I am not totally opposed to removing the multiplier, but think once we have less fires burning we should think it through.16:28
jeblairi think the main reason to have the multiplier is, as you say, to use a single zuul.conf on differently-sized executors.  i think that's unlikely to work very well in the long run (it would only work if cpu/disk/ram scaled exactly in tandem), and anyone doing that can probably handle templating the value in cfgmgt anyway.  so i just want to avoid making this confusing or difficult.16:32
jeblairbut yeah, we can think about it later.  the patch is merged, which means this is, in fact, *not* bikeshedding.  the shed has been built!16:33
jeblairSpamapS: thanks :)16:33
SpamapSjeblair: I also wonder if we should start doing some memory backoff too16:35
*** hashar has quit IRC16:35
jeblairSpamapS: almost certainly16:36
SpamapSload could be relatively moderate, but a few extra jobs start and the cascade into OOM starts16:36
SpamapSI'll see if we could do that in manageLoad and kind of get it for free16:36
*** bhavik1 has joined #zuul17:04
*** harlowja has joined #zuul17:10
*** bhavik1 has quit IRC17:18
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: DNM: Fix branch matching logic  https://review.openstack.org/50895517:35
Shrewsso that ^^^ is showing that we do indeed have an issue with branch matching logic (at least in the case of (?!...) matching)17:35
Shrewsnow need to find the fix17:36
*** electrofelix has quit IRC17:56
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add memory awareness to system load governor  https://review.openstack.org/50896017:57
Shrewsjeblair: when you poke your head up, could you please tell me if this 'or' between the branch and ref checks should actually be an 'and'?  http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/change_matcher.py?h=feature/zuulv3#n6418:10
Shrewsbecause the ref is matching, even when branch doesn't18:11
jeblairShrews: while i'm thinking about it -- a v2/v3 difference is that all 'change' objects that function gets now have a ref attribute.18:13
jeblairlikely explains the immediate cause of the break18:14
Shrewshrm, changing to 'and' breaks everything (but does make my new test pass)18:14
Shrewsso likely not the correct fix18:14
jeblairShrews: probably just need to rework that as "if has branch: return matches branch.  return matches ref."18:15
* Shrews tries18:15
jeblairbasically, give branch priority, and fall back on ref18:15
Shrewsjeblair: ok. thx!18:15
Shrewshrm, that doesn't seem to be the fix either18:26
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: DNM: Fix branch matching logic  https://review.openstack.org/50895518:32
*** hashar has joined #zuul19:27
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add memory awareness to system load governor  https://review.openstack.org/50896019:28
Shrewsoh, silly me19:48
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix branch matching logic  https://review.openstack.org/50895519:48
Shrewsinfra-root: ^^^ should fix the branch matching bug19:48
fungiShrews: awesome!!!19:49
Shrewsnote to self: don't modify existing set of test configs w/o expecting consequences19:49
fungiheh19:49
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add memory awareness to system load governor  https://review.openstack.org/50896019:50
jeblairShrews: awesome thanks!  +2 with comments19:54
Shrewsjeblair: oops, lemme clean that up19:55
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix branch matching logic  https://review.openstack.org/50895519:55
Shrewssuperfluous :'s removed!19:55
jeblairShrews: cool; did you see my comment about @simple_layout?19:56
Shrewsoops, no19:56
jeblairShrews: maybe look into that and see if you want to rework the test as a followup19:56
Shrewsjeblair: ah, k. should i change that? or do a follow up? or just note for next time?19:56
Shrewsjeblair: i'll do a follow up19:57
jeblairShrews: mostly i wanted you to know about it so you wouldn't think all zuul tests needed this much overhead :)19:57
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Simplify TestSchedulerBranchMatcher  https://review.openstack.org/50898020:01
Shrewsjeblair: umm, yeah. that is much less overhead  :)20:02
clarkbseeing a lot of http://paste.openstack.org/show/622499/ in the zuul scheduler debug log20:12
clarkbSpamapS: ^ any chance that is related to the governor change?20:12
clarkbzuulv3 is effectively dead right now I think. No in use nodes20:15
jeblairi think that's a straightforward bug20:15
clarkbbut still a lot of ansible on executors so maybe I'm wrong?20:15
jeblairand yeah, it's in the middle of a config generation and an object graph walk20:16
clarkbah ok20:16
jeblairbad timing on my part, sorry20:16
clarkbwhats weird about that bug is it complains about str but it is generated from __ which should get the type rightI thought20:17
clarkbso wouldn't it be _str_gearman_job?20:17
jeblairclarkb: i think it's iterating over dict keys instead of values20:19
jeblairclarkb: this line https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/executor/client.py?id=feature/zuulv3#n10420:19
clarkbah ok, still funny that python seems to infer more than I would expect but that makes sense given the loop there20:21
clarkbya dict is uuid: build20:22
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Fix Gearman UnknownJob handler  https://review.openstack.org/50899220:26
*** dkranz has quit IRC20:31
*** jkilpatr has quit IRC20:48
*** hashar has quit IRC20:50
*** jkilpatr has joined #zuul21:13
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix branch matching logic  https://review.openstack.org/50895521:38
* Shrews stepping afk for a few minutes before what he expects to be a very busy Zuul meeting21:45
jlkoh yeah that's happening today21:46
mordredjeblair: I cannot make today's zuul meeting due to a scheduling conflict - I'll be back for meetings next week - I beleve I'm up to date on my things in the etherpad21:58
Shrewsmordred: we are skipping22:05
mordredShrews: cool - I will feel less bad22:05
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Move shadow layout to item  https://review.openstack.org/50901422:07
jeblairfrom what i can tell via objgraph, that ^ is very *unlikely* to be the main part of the problem.  however, it could still be a very big waste of memory.22:08
jeblairfrom my example earlier (where i knew about 19 layouts but python had 77) those layouts would have been included in the 19.  i think.  if this ends up reducing the set of unknown layouts, i still won't understand the mechanism (but anything is possible)22:10
clarkbjeblair: you have +2's but not sure if you want to wait for anything else before approving22:12
jeblairit passed tests locally, i'll +322:17
SpamapSjeblair: did anything pan out with the gc module?23:06
jeblairSpamapS: i'm not really sure what to explore there.23:08
jeblairi did force a collection and note that it said there were 0 unreachable objects23:13
jeblairbut i haven't synthesized all of the information about the gc into something actionable23:14
SpamapSjeblair: My previous understanding of how gc works is that with each generation it will go deeper into objects to find if they are self referencing.23:15
SpamapSbut I could be inventing that to suit my hope23:15
jeblairi mean, the simplest explanation is that these objects are referenced, and i haven't quite satisfied myself that they aren't.23:16
jeblairit's possible that the unsatisfactory results i'm getting from objgraph right now are, in fact, saying that they are unreferenced, but some weird stuff happened with a prior debugging session and i no longer trust the output from the current run; i'll need to try again after a restart.23:17

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!