*** dkranz has quit IRC | 00:30 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Implement an OpenContainer driver https://review.openstack.org/468753 | 01:56 |
---|---|---|
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP add repl https://review.openstack.org/508793 | 02:06 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP add repl https://review.openstack.org/508793 | 02:15 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP add repl https://review.openstack.org/508793 | 02:17 |
*** isaacb has joined #zuul | 03:46 | |
*** isaacb has quit IRC | 03:54 | |
*** logan- has quit IRC | 05:18 | |
*** logan- has joined #zuul | 05:21 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul feature/zuulv3: Increase ansible internal_poll_interval https://review.openstack.org/508805 | 05:42 |
tobiash | jeblair, mordred, SpamapS: this should reduce the cpu load on the executors quite a bit ^^^ | 05:42 |
*** logan- has quit IRC | 05:48 | |
*** logan- has joined #zuul | 05:53 | |
*** hashar has joined #zuul | 07:09 | |
*** electrofelix has joined #zuul | 07:18 | |
tristanC | tobiash: nice! | 07:57 |
*** hashar has quit IRC | 08:02 | |
*** hashar has joined #zuul | 08:02 | |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Change in to work dir before executing tox https://review.openstack.org/508783 | 08:40 |
*** bhavik1 has joined #zuul | 09:00 | |
*** jkilpatr has quit IRC | 10:29 | |
*** bhavik1 has quit IRC | 10:39 | |
*** jkilpatr has joined #zuul | 11:06 | |
*** dkranz has joined #zuul | 12:21 | |
*** isaacb has joined #zuul | 13:13 | |
Shrews | So, I'm going to share this little ZooKeeper knowledge nugget here, for all to enjoy and be aware of: I was wonder why, after ZK filled up disk and suspended all connections, nodepool didn't seem to reconnect properly, but zuul did (even though they share identical code). | 13:28 |
Shrews | Having kazoo debug logging on zuul was helpful: http://paste.openstack.org/show/622467/ | 13:29 |
Shrews | Turns out the kazoo library uses a retry mechanism that varies the connection retry attempts and spaces them out oddly (as seen in that paste). http://kazoo.readthedocs.io/en/latest/api/retry.html | 13:29 |
Shrews | So it seems that I just didn't wait long enough for nodepool-launchers to try again. | 13:30 |
Shrews | That is all. Happy Monday'ing | 13:30 |
Shrews | Also, note that kazoo seems to try both the ipv6 and ipv4 addresses. That surprised me. | 13:32 |
pabelanger | Shrews: okay, so did we restart nodepool-launchers this morning? | 13:34 |
Shrews | pabelanger: yes | 13:35 |
Shrews | but not zuul | 13:35 |
pabelanger | Shrews: was that around 90mins ago? I see a lot of ready nodes on our launchers, trying to see why that is | 13:35 |
Shrews | pabelanger: i think that's just the nature of the beast | 13:36 |
Shrews | i did not restart any builders either, but they seem to have recovered | 13:37 |
pabelanger | Shrews: okay, I think it might be more however. I don't see any nodes that are used. It could be zuulv3 hasn't grabbed them yet for some reason | 13:37 |
pabelanger | request-list is showing a large list of fulfilled, I've never seen that before | 13:38 |
Shrews | pabelanger: hmm, yeah. another zuul issue, perhaps related to the stuff jeblair threw in over the weekend? | 13:39 |
pabelanger | kk, that is what I expected. will hold off for now | 13:39 |
Shrews | nodes are READY, assigned, and unlocked, so zuul hasn't done anything with them after nodepool fulfilled the request. odd | 13:42 |
pabelanger | Yah, I wonder if dropped zk connections / filled HDDs confused something | 13:43 |
pabelanger | zuulv3 also have 1036 events in the queue too, so it is doing something | 13:43 |
Shrews | pabelanger: maybe just slow? | 13:44 |
pabelanger | maybe | 13:44 |
pabelanger | For now, I'm holding off until jeblair is online | 13:44 |
pabelanger | Shrews: have you seen http://paste.openstack.org/show/622475/ before? I am seeing a large number of NoNodeErrors in the debug.log | 14:17 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Increase ansible internal_poll_interval https://review.openstack.org/508805 | 14:32 |
Shrews | pabelanger: yes | 14:40 |
Shrews | pabelanger: tl;dr has to do with NodeRequest disappearing from zuul (likely due to zk connection issues) | 14:41 |
pabelanger | ack | 14:41 |
Shrews | pabelanger: since that seems rather recent, not sure what's happening with zuul right now | 14:43 |
Shrews | pabelanger: but if nodepool sees an unlocked node that is READY, IN-USE, or USED, it will delete it. loss of zk connection in zuul will unlock a node | 14:44 |
pabelanger | Shrews: ya, zuul looks to be happy again and processing things. I can see node requests in the debug log now | 14:44 |
dmsimard | Shrews: fwiw the web consoles aren't working, not sure if it's related | 14:45 |
Shrews | well, not READY | 14:45 |
pabelanger | Shrews: okay, that is helpful. | 14:45 |
Shrews | pabelanger: if you're interested, this bit of code: https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n547 | 14:46 |
pabelanger | Shrews: great, thanks | 14:46 |
Shrews | dmsimard: let's assume all issues are due to zuul doing more memory/thread debugging things (applied yesterday, i think) | 14:47 |
* dmsimard nod | 14:48 | |
* Shrews catches up on #infra backscroll | 14:48 | |
jeblair | Shrews: there should be no more overhead from memory debugging. that finished hours ago | 15:14 |
*** isaacb has quit IRC | 15:16 | |
Shrews | jeblair: k. wasn't aware | 15:16 |
*** lennyb_ has joined #zuul | 15:36 | |
*** lennyb_ has quit IRC | 15:36 | |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul-jobs master: Handle z-c shim copies across filesystems https://review.openstack.org/508772 | 15:47 |
jeblair | SpamapS: governator patch looks good to me -- though i have one suggestion about the configuration value | 15:48 |
SpamapS | jeblair: hah, yeah I thought about that too. Perhaps we can just add a max_load_average config item and make it clear the multiplier is only used if you don't set load average explicitly. | 15:56 |
jeblair | SpamapS: is the multiplier really useful though? i suggested it as a way to get a sane default load average value, but it's really just an indirect thing. the direct value is load average itself. | 15:59 |
jeblair | i feel like it's like making admins do unecessary unit conversions :) | 15:59 |
dmsimard | I know we're not using Ansible 2.4 yet but FYI I found a bug impacting ara's configuration parsing https://github.com/ansible/ansible/pull/31200/commits/865d57aaf2fe47bc95d7becaac3a0cb182f71aa8 | 16:13 |
dmsimard | already pinged bcoca about it. | 16:13 |
SpamapS | jeblair: the multiplier is useful if you plan to deploy executors on different sized boxes and you know roughly how much more load you can handle than you have CPUs. | 16:23 |
SpamapS | jeblair: I agree, that seems like a weird and unnatural thing to plan for. | 16:23 |
SpamapS | I have just never experienced a good calculating tuning parameter where the authors of the software got any of the constants right. ;) | 16:25 |
SpamapS | So it goes deep inside me never to have a hard set constant like that. :-P | 16:25 |
jeblair | SpamapS: right, i guess what i'm trying to say is that i don't expect it to be right, and i don't expect people to fix it when things are wrong. what i expect is for admins to say "my executor hit a load average of 50! that's too high! i never want it going over 30!" and to set that | 16:26 |
SpamapS | I agree. | 16:26 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Limit concurrency in zuul-executor under load https://review.openstack.org/508649 | 16:28 |
SpamapS | My only thinking was that I don't want to take away the ability to just tune what's already used in the default calc. So I think later on we should definiteyl add the hard override of the load average. I am not totally opposed to removing the multiplier, but think once we have less fires burning we should think it through. | 16:28 |
jeblair | i think the main reason to have the multiplier is, as you say, to use a single zuul.conf on differently-sized executors. i think that's unlikely to work very well in the long run (it would only work if cpu/disk/ram scaled exactly in tandem), and anyone doing that can probably handle templating the value in cfgmgt anyway. so i just want to avoid making this confusing or difficult. | 16:32 |
jeblair | but yeah, we can think about it later. the patch is merged, which means this is, in fact, *not* bikeshedding. the shed has been built! | 16:33 |
jeblair | SpamapS: thanks :) | 16:33 |
SpamapS | jeblair: I also wonder if we should start doing some memory backoff too | 16:35 |
*** hashar has quit IRC | 16:35 | |
jeblair | SpamapS: almost certainly | 16:36 |
SpamapS | load could be relatively moderate, but a few extra jobs start and the cascade into OOM starts | 16:36 |
SpamapS | I'll see if we could do that in manageLoad and kind of get it for free | 16:36 |
*** bhavik1 has joined #zuul | 17:04 | |
*** harlowja has joined #zuul | 17:10 | |
*** bhavik1 has quit IRC | 17:18 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: DNM: Fix branch matching logic https://review.openstack.org/508955 | 17:35 |
Shrews | so that ^^^ is showing that we do indeed have an issue with branch matching logic (at least in the case of (?!...) matching) | 17:35 |
Shrews | now need to find the fix | 17:36 |
*** electrofelix has quit IRC | 17:56 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add memory awareness to system load governor https://review.openstack.org/508960 | 17:57 |
Shrews | jeblair: when you poke your head up, could you please tell me if this 'or' between the branch and ref checks should actually be an 'and'? http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/change_matcher.py?h=feature/zuulv3#n64 | 18:10 |
Shrews | because the ref is matching, even when branch doesn't | 18:11 |
jeblair | Shrews: while i'm thinking about it -- a v2/v3 difference is that all 'change' objects that function gets now have a ref attribute. | 18:13 |
jeblair | likely explains the immediate cause of the break | 18:14 |
Shrews | hrm, changing to 'and' breaks everything (but does make my new test pass) | 18:14 |
Shrews | so likely not the correct fix | 18:14 |
jeblair | Shrews: probably just need to rework that as "if has branch: return matches branch. return matches ref." | 18:15 |
* Shrews tries | 18:15 | |
jeblair | basically, give branch priority, and fall back on ref | 18:15 |
Shrews | jeblair: ok. thx! | 18:15 |
Shrews | hrm, that doesn't seem to be the fix either | 18:26 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: DNM: Fix branch matching logic https://review.openstack.org/508955 | 18:32 |
*** hashar has joined #zuul | 19:27 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add memory awareness to system load governor https://review.openstack.org/508960 | 19:28 |
Shrews | oh, silly me | 19:48 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix branch matching logic https://review.openstack.org/508955 | 19:48 |
Shrews | infra-root: ^^^ should fix the branch matching bug | 19:48 |
fungi | Shrews: awesome!!! | 19:49 |
Shrews | note to self: don't modify existing set of test configs w/o expecting consequences | 19:49 |
fungi | heh | 19:49 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add memory awareness to system load governor https://review.openstack.org/508960 | 19:50 |
jeblair | Shrews: awesome thanks! +2 with comments | 19:54 |
Shrews | jeblair: oops, lemme clean that up | 19:55 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix branch matching logic https://review.openstack.org/508955 | 19:55 |
Shrews | superfluous :'s removed! | 19:55 |
jeblair | Shrews: cool; did you see my comment about @simple_layout? | 19:56 |
Shrews | oops, no | 19:56 |
jeblair | Shrews: maybe look into that and see if you want to rework the test as a followup | 19:56 |
Shrews | jeblair: ah, k. should i change that? or do a follow up? or just note for next time? | 19:56 |
Shrews | jeblair: i'll do a follow up | 19:57 |
jeblair | Shrews: mostly i wanted you to know about it so you wouldn't think all zuul tests needed this much overhead :) | 19:57 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Simplify TestSchedulerBranchMatcher https://review.openstack.org/508980 | 20:01 |
Shrews | jeblair: umm, yeah. that is much less overhead :) | 20:02 |
clarkb | seeing a lot of http://paste.openstack.org/show/622499/ in the zuul scheduler debug log | 20:12 |
clarkb | SpamapS: ^ any chance that is related to the governor change? | 20:12 |
clarkb | zuulv3 is effectively dead right now I think. No in use nodes | 20:15 |
jeblair | i think that's a straightforward bug | 20:15 |
clarkb | but still a lot of ansible on executors so maybe I'm wrong? | 20:15 |
jeblair | and yeah, it's in the middle of a config generation and an object graph walk | 20:16 |
clarkb | ah ok | 20:16 |
jeblair | bad timing on my part, sorry | 20:16 |
clarkb | whats weird about that bug is it complains about str but it is generated from __ which should get the type rightI thought | 20:17 |
clarkb | so wouldn't it be _str_gearman_job? | 20:17 |
jeblair | clarkb: i think it's iterating over dict keys instead of values | 20:19 |
jeblair | clarkb: this line https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/executor/client.py?id=feature/zuulv3#n104 | 20:19 |
clarkb | ah ok, still funny that python seems to infer more than I would expect but that makes sense given the loop there | 20:21 |
clarkb | ya dict is uuid: build | 20:22 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Fix Gearman UnknownJob handler https://review.openstack.org/508992 | 20:26 |
*** dkranz has quit IRC | 20:31 | |
*** jkilpatr has quit IRC | 20:48 | |
*** hashar has quit IRC | 20:50 | |
*** jkilpatr has joined #zuul | 21:13 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix branch matching logic https://review.openstack.org/508955 | 21:38 |
* Shrews stepping afk for a few minutes before what he expects to be a very busy Zuul meeting | 21:45 | |
jlk | oh yeah that's happening today | 21:46 |
mordred | jeblair: I cannot make today's zuul meeting due to a scheduling conflict - I'll be back for meetings next week - I beleve I'm up to date on my things in the etherpad | 21:58 |
Shrews | mordred: we are skipping | 22:05 |
mordred | Shrews: cool - I will feel less bad | 22:05 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Move shadow layout to item https://review.openstack.org/509014 | 22:07 |
jeblair | from what i can tell via objgraph, that ^ is very *unlikely* to be the main part of the problem. however, it could still be a very big waste of memory. | 22:08 |
jeblair | from my example earlier (where i knew about 19 layouts but python had 77) those layouts would have been included in the 19. i think. if this ends up reducing the set of unknown layouts, i still won't understand the mechanism (but anything is possible) | 22:10 |
clarkb | jeblair: you have +2's but not sure if you want to wait for anything else before approving | 22:12 |
jeblair | it passed tests locally, i'll +3 | 22:17 |
SpamapS | jeblair: did anything pan out with the gc module? | 23:06 |
jeblair | SpamapS: i'm not really sure what to explore there. | 23:08 |
jeblair | i did force a collection and note that it said there were 0 unreachable objects | 23:13 |
jeblair | but i haven't synthesized all of the information about the gc into something actionable | 23:14 |
SpamapS | jeblair: My previous understanding of how gc works is that with each generation it will go deeper into objects to find if they are self referencing. | 23:15 |
SpamapS | but I could be inventing that to suit my hope | 23:15 |
jeblair | i mean, the simplest explanation is that these objects are referenced, and i haven't quite satisfied myself that they aren't. | 23:16 |
jeblair | it's possible that the unsatisfactory results i'm getting from objgraph right now are, in fact, saying that they are unreferenced, but some weird stuff happened with a prior debugging session and i no longer trust the output from the current run; i'll need to try again after a restart. | 23:17 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!