Tuesday, 2018-01-16

*** rlandy has quit IRC00:27
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies  https://review.openstack.org/53080600:48
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests  https://review.openstack.org/53269900:48
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Compress testr results.html before fetching it  https://review.openstack.org/53382800:49
openstackgerritTristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Clarify provider manager vs provider config  https://review.openstack.org/53161801:09
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Revert "Add consolidated role for processing subunit"  https://review.openstack.org/53383101:10
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit""  https://review.openstack.org/53383401:17
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Adjust check for .stestr directory  https://review.openstack.org/53268801:33
openstackgerritMerged openstack-infra/zuul-jobs master: Revert "Add consolidated role for processing subunit"  https://review.openstack.org/53383101:55
*** haint_ has joined #zuul02:01
*** threestrands_ has joined #zuul02:02
*** haint has quit IRC02:04
openstackgerritTristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Do pep8 housekeeping according to zuul rules  https://review.openstack.org/52294502:04
*** jappleii__ has quit IRC02:04
*** dtruong2 has joined #zuul02:46
*** tflink has quit IRC02:54
*** dtruong2 has quit IRC03:37
*** jaianshu has joined #zuul03:47
*** tflink has joined #zuul04:03
*** dtruong2 has joined #zuul04:17
*** dtruong2 has quit IRC04:26
*** dtruong2 has joined #zuul04:40
*** bhavik1 has joined #zuul05:17
*** dtruong2 has quit IRC05:17
*** bhavik1 has quit IRC05:56
*** dtruong2 has joined #zuul06:08
tobiashcorvus, dmsimard: re executor governor discussion of the meeting06:11
tobiashcouldn't participate as it's too late for me (every time I attend I'm kind of jetlagged the whole next day)06:12
tobiashI think we need something smarter than the simple on/off approach based on current load/ram06:13
tobiashI think we need something similar than tcp does with slow start and sliding window06:17
tobiashs/than/like06:17
*** dtruong2 has quit IRC06:24
SpamapStobiash: could poll stats more aggressively on changes, and back off over time.06:26
tobiashSpamapS: I'm thinking about something like the slow start and a congestion window where we could hook in generic 'sensors' like load, ram06:33
tobiashe.g. the ram governor would not work for me in the openshift/kubernetes environment as it doesn't know how much of the free ram it is allowed to take06:34
tobiashfor this we will have to check the cgroups06:34
tobiashand I like the idea of a general congestion algorithm where we can put sensors into rather than having its own governor for each metric06:35
tobiashI'll think more about this topic and maybe write a post to the ml06:36
clarkbtcp slow start was the model used with the dependent pipeline windowing too06:36
tobiashI think that could be a good fit06:39
*** dtruong2 has joined #zuul06:44
*** dtruong2 has quit IRC06:55
SpamapSquerying the cgroups is also not a bad idea at all.07:09
*** threestrands_ has quit IRC07:18
openstackgerritAndreas Jaeger proposed openstack-infra/zuul-jobs master: Remove testr and stestr specific roles  https://review.openstack.org/52934007:28
*** hashar has joined #zuul08:14
SpamapShttps://github.com/facebook/pyre2/pull/1008:19
SpamapSSo now to figure out what to do about the MIA author.08:19
*** jpena|off is now known as jpena08:46
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Use re2 for change_matcher  https://review.openstack.org/53419008:50
SpamapStobiash: ^ patch to use re2 for change matchers.08:50
tobiashcool08:51
SpamapSyeah.. note that pyre2 has not merged my PR yet.08:51
SpamapSand that's blocked on GoDaddy signing Facebook's Corp CLA08:51
SpamapSso it might be a while before we can merge that. :-P08:51
tobiashSpamapS: how long do you expect?08:52
tobiashsigning the openstack cla took me a year...08:52
SpamapSNo clue, it's my first time asking for a CLA to be signed. ;)08:52
SpamapSWe've signed about 15 of them08:52
SpamapSso I expect we'll do this one relatively "easily"08:52
SpamapSI just dunno how long it takes.08:52
tobiashthe openstack cla was our first ;)08:52
SpamapSYeah, IMO they're all stupid.08:53
SpamapSLawyers making work for themselves.08:53
tobiashyepp08:53
SpamapSBut I guess nobody wants another SCO vs. Linux situation.08:53
tobiashprobably08:53
SpamapSAlso wondering if we can get some speed gains by using re208:54
SpamapSin some cases it is 10x faster.08:54
SpamapSNot that scheduler CPU has been an issue.08:54
tobiashthat would be cool, but I guess we're not limited by regex parsing currently08:54
*** sshnaidm|afk is now known as sshnaidm09:52
*** saop has joined #zuul10:06
saoptristanC, Hello10:06
saoptristanC, Now the CI is able to pick the job and run10:07
saoptristanC, But in the ansible executing phase, we are getting error: Ansible output: b'Unknown option --unshare-all'10:08
*** ankkumar has joined #zuul10:08
saoptristanC, Sending result: {"data": {}, "result": "FAILURE"}10:08
saoptristanC, any idea about that?10:08
saoptristanC, We have very basic ansible playbook file10:08
tristanCsaop: it seems like you need a more recent bubblewrap version10:08
saoptristanC, thanks10:09
tristanCwell, --unshare-all was added in bwrap-0.1.710:11
saoptristanC, We installed some 0.1.210:14
saoptristanC, will upgrade now10:14
tristanCsaop: what distrib are you using?10:14
saoptristanC, ubuntu xenial10:14
saoptristanC, One more question, our CI is able to post the result but its not showing in gerrit, but when we do toggle CI we can see something like: Our CI: test-basic finger://ubuntu/f3f5f726a27a480687b826d9bf6a3e57 : FAILURE in 22s10:21
saoptristanC, Do we need to have any configuration for that?10:21
tristanCsaop: you mean result in the table under vote?10:22
saoptristanC, Yes10:23
saoptristanC, you can check here: https://review.openstack.org/#/c/534137/10:23
saoptristanC, toggle for HPE Proliant CI10:23
tristanCsaop: iirc there is some javascript magic that parse ci comment, and it only works when result has http links10:23
saoptristanC, Now we are getting in ansible b'Unknown option --die-with-parent'10:26
saoptristanC, Do we need to upgrade more,10:26
*** AJaeger has quit IRC10:26
saoptristanC, We are using bwrap-0.1.710:26
tristanCsaop: yes, i meant you need the most recent bubblewrap version, that is 2.0.010:27
saoptristanC, ohh okay10:27
tristanCperhaps ubuntu only needs 0.1.810:28
*** AJaeger has joined #zuul10:32
saoptristanC, how to set log url in zuul v3?10:50
tristanCsaop: you should use https://docs.openstack.org/infra/zuul-jobs/roles.html#role-upload-logs  or build something similar10:54
*** electrofelix has joined #zuul11:01
*** fbo has joined #zuul11:31
*** jpena is now known as jpena|lunch12:35
*** weshay_PTO is now known as weshay13:01
saoptristanC, I created post-logs.yaml according to documentation, and also provided in jobs: post-run section, but it didn't executed, any idea?13:07
tristanCsaop: to debug this kind of failure, use the executor keep option to get the raw ansible job logs in /tmp13:12
saoptristanC, Thanks13:13
*** rlandy has joined #zuul13:30
*** rlandy_ has joined #zuul13:30
*** rlandy_ has quit IRC13:30
*** jpena|lunch is now known as jpena13:39
*** jaianshu_ has joined #zuul13:46
*** jaianshu has quit IRC13:49
*** jaianshu_ has quit IRC13:50
*** saop has quit IRC13:51
*** ankkumar has quit IRC14:05
*** dkranz has joined #zuul14:13
pabelangerhttp://grafana.openstack.org/dashboard/db/nodepool-inap is an interesting pattern with test nodes, I wonder if we are getting a large amount of locked ready nodes pooling, and cannot transition to in-use because we cannot fulfill the requests (no more quota)15:00
pabelangerwe end up getting more then 1/2 the nodes marked ready, then seem to use then all at once15:00
pabelangermaybe we had a gate reset during that time too?15:01
Shrewspabelanger: what are you seeing in zk for that provider?15:10
pabelangerShrews: anything specific I should be looking for? Just that we are at quota for the provider, and we have locked ready nodes, waiting for other nodes to come online15:13
Shrewspabelanger: first thing i'd look at is if the ready&locked nodes have been around for a long time. if so, could be an issue we might need to look into. otherwise, might be a normal pattern15:15
*** bhavik1 has joined #zuul15:16
pabelangerShrews: yah, I'll see if I can find pattern in logs15:18
*** flepied has joined #zuul15:33
*** flepied_ has joined #zuul15:33
*** flepied_ has quit IRC15:33
*** bhavik1 has quit IRC16:09
*** hashar has quit IRC16:32
*** dkranz has quit IRC16:36
*** jpena is now known as jpena|off16:37
*** jpena|off is now known as jpena16:41
openstackgerritClark Boylan proposed openstack-infra/nodepool feature/zuulv3: Only decline requests if no cloud can service them  https://review.openstack.org/53337216:46
openstackgerritClark Boylan proposed openstack-infra/nodepool feature/zuulv3: Add test_launcher test  https://review.openstack.org/53377116:46
openstackgerritClark Boylan proposed openstack-infra/nodepool feature/zuulv3: Only fail requests if no cloud can service them  https://review.openstack.org/53337216:48
openstackgerritClark Boylan proposed openstack-infra/nodepool feature/zuulv3: Add test_launcher test  https://review.openstack.org/53377116:48
*** dkranz has joined #zuul16:52
*** tflink has quit IRC16:54
*** tflink has joined #zuul16:57
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests  https://review.openstack.org/53269917:15
*** dkranz has quit IRC17:15
*** dkranz has joined #zuul17:27
*** zigo has quit IRC17:33
*** openstackgerrit has quit IRC17:33
*** sshnaidm is now known as sshnaidm|afk17:34
*** sshnaidm|afk has quit IRC17:34
*** bhavik1 has joined #zuul17:35
*** bhavik1 has quit IRC17:35
*** zigo has joined #zuul17:37
*** openstackgerrit has joined #zuul17:49
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Support cross-source dependencies  https://review.openstack.org/53080617:49
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add cross-source tests  https://review.openstack.org/53269917:49
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies  https://review.openstack.org/53439717:49
corvusthat patch series is a 3.0 release blocker and is ready for review ^17:49
corvusit may be worth reading the docs change first17:50
SpamapScorvus: so, re2 doesn't support this regexp:  '^(?!stable)' .. that apparently *is* an inefficient backref.17:50
SpamapSwondering if there's a more efficient way to to say "doesn't start with stable"17:51
corvushrm.  well, i think we're going to need that less in the future, however, we still have vestigal uses of it now, and it's proven so useful in the past i worry about not being able to use something like that...17:52
SpamapSyeah, re2 seems to only support negation of character classes, not strings.17:56
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit""  https://review.openstack.org/53383417:58
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Adjust check for .stestr directory  https://review.openstack.org/53268817:58
SpamapScorvus: we could make an irrelevant-branches maybe?18:02
*** sshnaidm|afk has joined #zuul18:03
SpamapSwhich would allow positive matching18:03
corvusSpamapS: an alternative might be to make the zuul config language more sophisticated, so you could specify boolean ops on the regexes.  something like "branches: [not: stable/.*]".  the underlying classes are sophisticated enough to support that, it's just not exposed in syntax.18:03
corvusSpamapS: we had similar ideas simultaneously :)18:03
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver  https://review.openstack.org/53190418:07
SpamapScorvus: Might be good to wrap this up before 3.0. Would be a shame to release with a DoS'able scheduler feature.18:19
corvusSpamapS: there are still so many other ways to DoS, and this one has been out there for 6 years18:21
corvusopenstack's zuul runs out of memory every 4 days just through normal use, and i wasn't even planning on considering that a 3.0 blocker at this point.18:22
SpamapS:-/18:22
corvusarbitrary node set size is another one18:22
SpamapSkk, 3.1 -- the hardening release. ;)18:23
SpamapS(where we let you turn off re maybe? :-P  )18:23
corvusSpamapS: ++ 3.1 hardening, but i think we should should avoid making language features configurable -- we should either reduce the scope of re (ie, switch to re2, possibly compensate by irrelevant-branches or booleans), or drop it altogether18:25
SpamapSYeah I like the idea of allowing only positive matches and using re2.18:25
SpamapSThe CLA process has begun here so hopefully we can get my py3k support merged and released relatively quickly.18:27
corvusSpamapS: if you end up being the maintainer, the CLA process won't matter anyway :)18:27
* fungi feels sorry at a peersonal level that this cla is still hanging around18:27
corvusfungi: different cla18:27
fungioh!18:27
fungihah, i missed that18:27
corvusfacebook i think?18:28
SpamapScorvus: actually the maintainer may have re-appeared. Apparently facebook changed their email domain and they didn't update their addresses in the README (they were @facebook.com and are now @fb.com)18:28
clarkbif we don't do negative lookahead do we even need regexes?18:28
fungiregardless, not *another* cla i need to feel bad about18:28
clarkbcould just list all positive matches. Would be more verbose but avoids the re problem entirely18:28
clarkb(then just string == otherstring)18:29
corvusclarkb: agreed, i think that's something worth considering, and tbh, i would prefer that to our (openstack) habit of negative lookaheads regardless.18:29
fungii suppose if you don't need negative lookahead you could get away with implementing positive matches via an expansion syntax to save some space18:30
fungie.g. branch: stable/{ocata,pike},master18:30
fungithen internally expand and match against the resulting list18:31
corvusthis would be a really good mailing list discussion -- have folks weigh in on "regex positive-only matches vs no regex at all"18:31
clarkboh tahts what I need to do, sign up for ml again18:31
fungiyup!18:31
* clarkb wonders if that will make mailman mad18:31
corvusfungi: if we do forward expansions, we may as well use re2 rather than rolling our own, i'd think18:31
fungii took the hint yesterday and finally signed up for the lists.zuul-ci.org mailing lists18:32
clarkbI signed up last week but things were unworking18:32
fungicorvus: oh, that's a feature of re2? even nicer18:32
corvusfungi: er, i dunno?  maybe i'm making stuff up.18:32
fungithat's cool too ;)18:32
fungieveryone needs a hobby18:33
corvusfungi: i just assumed that something like "stable/(foo|bar)" would work18:33
fungioh, sur18:33
clarkbre2 does safer more performant res with fewer magical features like negative lookahead18:33
fungie18:33
corvusSpamapS: the other thing is there are some more uses of regex -- i think some of the pipeline trigger / approval matching stuff uses it.  maybe it's okay to leave that since that's config-repos only.  maybe we just need to identify all the untrusted-repo uses of regex and address them.18:34
SpamapScorvus: yeah I was mostly concerned with job config. Did not dig into other uses.18:34
SpamapSAt some point it seems like it would make sense to just use re2 everywhere since it is also far faster even on the simple cases.18:35
corvusfiles needs either regex or glob.  so if we entertain dropping regex, we'd have to glob there i think.18:35
corvustalk of mailing lists reminds me...18:36
-corvus- Please sign up for new zuul mailing lists: http://lists.openstack.org/pipermail/openstack-infra/2018-January/005800.html18:36
corvusand we should update the readme/docs too18:37
*** sshnaidm|afk is now known as sshnaidm18:39
*** jpena is now known as jpena|off18:45
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Clarify provider manager vs provider config  https://review.openstack.org/53161818:46
*** electrofelix has quit IRC18:57
*** electrofelix has joined #zuul18:58
*** flepied has quit IRC19:07
*** flepied has joined #zuul19:08
*** flepied has quit IRC19:30
*** flepied has joined #zuul19:30
*** harlowja has joined #zuul19:44
openstackgerritAndreas Jaeger proposed openstack-infra/zuul-jobs master: Refactor fetch-subunit-output  https://review.openstack.org/53442720:04
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Clean up when conditions in fetch-subunit-output  https://review.openstack.org/53442820:05
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Use subunit2html from tox envs instead of os-testr-env  https://review.openstack.org/53442920:05
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-subunit-output work with multiple tox envs  https://review.openstack.org/53443020:05
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script  https://review.openstack.org/53443120:05
*** hashar has joined #zuul20:07
openstackgerritMerged openstack-infra/zuul-jobs master: Revert "Revert "Add consolidated role for processing subunit""  https://review.openstack.org/53383420:10
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script  https://review.openstack.org/53443120:23
pabelangerShrews: we seem to be continuing accumulating ready / locked nodes in inap ATM. I'm trying to see why new requests are coming online, when exists ones haven't been fulfilled yet20:25
pabelangerwe are up to 32 ready / locked nodes right now20:25
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Move testr finding to a script  https://review.openstack.org/53443120:28
pabelangerodd, up to 52 now20:29
pabelangersomething is going on20:29
Shrewspabelanger: "why new requests are coming online, when exists ones haven't been fulfilled yet" ... that confuses me. requests are allowed to exist in parallel for a provider, so not sure what you mean by that20:30
Shrewsbut i will see if i can glean anything from the logs20:31
pabelanger2018-01-16 20:28:04,870 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl01.openstack.org-30110-PoolWorker.inap-mtl01-main]: Fulfilled node request 100-000210468220:33
pabelangerat that point, we seem to fullfill about 52 nodes, but I don't see why it took so long20:33
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies  https://review.openstack.org/53439720:33
pabelangersome of them appear to be single ubuntu-xenial nodes20:33
Shrewspabelanger: i do not see any ready&locked nodes20:34
pabelangerShrews: yah, they just unlocked at timestamp from above20:35
pabelangerif you look at logs, we fulfilled a bunch of requests at one time, upwards of 45 nodes20:35
Shrewspabelanger: we are at capacity20:35
Shrewsrather, inap is20:35
pabelangerhttp://grafana.openstack.org/dashboard/db/nodepool-inap shows the spike in avail ready nodes20:36
Shrewsmax-servers is 19020:36
pabelangerShrews: right, but I am trying to understand, how do we keep growing ready / locked nodes. If at capacity, doesn't the next requests of nodes go towards existing open nodesets waiting for nodes?20:37
pabelangerFor example: http://grafana.openstack.org/dashboard/db/nodepool-inap?from=1516133623844&to=1516134574705&var-provider=All20:38
Shrewswow, i cannot grok that sentence for some reason. :)20:38
Shrewspabelanger: i'm not sure i20:38
pabelangerfor almost 15mins, we grew ready / locked nodes, until something happened for them to all dump20:38
pabelangeralmost 30% of capacity grew to be idle for 15mins20:39
pabelangerhttp://grafana.openstack.org/dashboard/db/nodepool-rackspace?from=1516125202560&to=1516126777140 is another interesting graph for rackspace20:41
pabelangerIAD had 87 availble nodes but only 37 in-use20:41
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver  https://review.openstack.org/53190420:44
corvusclarkb: i'm a little confused by https://storyboard.openstack.org/#!/story/2001427  --  that diagram makes it look like C is the parent of B, but the text says that B is the parent of C.20:48
corvuspabelanger: you should entertain the idea that zuul doesn't have enough executors to handle all of the jobs.  zuul collects nodes before it gives them to executors.  i see some small and large spikes on the executor queue graph.  that mens zuul is spending at least some time waiting for executors to pick up jobs.20:54
corvusis those cases, the nodes will be ready and locked by zuul, and their requests will be fulfilled.20:55
corvuss/is/in/20:55
corvusno idea if that's what you're seeing, just throwing it out there.20:55
pabelangercorvus: okay, let me confirm, but I think nodepool is still holding the lock at this point. Would zuul be involved if that is still the case?20:56
pabelangerthe part I am trying to figure out is, I see the following in the logs:20:57
pabelanger2018-01-16 20:28:04,866 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl01.openstack.org-30110-PoolWorker.inap-mtl01-main]: Pausing request handling to satisfy request20:57
pabelanger<snipped for size>20:57
pabelangerthen nodepool start to unlock nodes20:57
pabelangerwhich brings ready / locked nodes back down to zero20:58
pabelangerit seems we only pausing once at capacity, according to comments in nodepool/driver/openstack/handler.py20:59
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Add skipped CRD tests  https://review.openstack.org/53188721:01
*** haint_ has quit IRC21:01
Shrewspabelanger: i'm not sure what's happening here, but it does seem that something is keeping completed request handlers from being processed21:02
Shrewsin a timely fashion21:02
*** haint_ has joined #zuul21:02
pabelangerI'm trying to walk my self though the poll function in nodepool/driver/__init__.py21:05
*** rlandy_ has joined #zuul21:05
corvusclarkb: though, i think the direction of that arrow doesn't actually matter, the resulting trees are equivalent (though the final tree would be BCA not CBA)21:05
*** rlandy__ has joined #zuul21:05
*** rlandy__ has quit IRC21:05
Shrewspabelanger: theory... i believe something in _assignHandlers(), where it loops through the node requests and processes them, is causing a significant delay within the loop21:08
Shrewspabelanger: because until that loop finishes, it will never remove the completed handlers (and i'm not seeing that happen until minutes later in the log)21:09
Shrewsi only see zookeeper communication happening there, so i wonder if something is wonky with that21:10
Shrewsan overloaded ZK?21:10
Shrewsor communication issues with it?21:10
pabelanger1 sec, need to AFk quickly21:11
clarkbcorvus: C is the parent of B that is what the text says21:12
Shrewsjava.io.IOException: No space left on device21:12
Shrewson zookeeper21:12
Shrewswheeeee21:12
clarkbcorvus: C -> B -> A, then separately C and B depends on A21:12
Shrewspabelanger: ^^^21:12
corvusclarkb: right, i was confused by "Change C has parent change B" which means B is the parent of C.21:12
clarkbcorvus: actually the depends on may just be C to A21:13
Shrewspabelanger: oh, nm. that's an old log21:13
corvusclarkb: but honestly, i don't think it matters, we can just pick one and stick with it.  :)21:13
clarkber now I'm all confused. In any case I htink the arrows are correct :)21:13
clarkbcorvus: ya, basically you just need a dag that looks like a cycle but with one of the arrows pointed the wrong way21:13
corvusclarkb: yeah, both sets of arrows match.21:13
pabelangersorry, #dadop21:14
clarkbShrews: ya nodepool.o.o has ~50% of / free21:14
pabelangerI'm looking at cacti.o.o now21:14
corvusit looks busy, but not critically so.21:14
pabelangerto see if we are spiking anything21:14
clarkbcould it be the executors just arent' grabbing more jobs?21:14
clarkbor not grabbing nodes from nodepool fast enough?21:15
pabelangerif I understand logs, I don't think nodepool is releasing them to zuul21:15
pabelangerbut will let Shrews confirm21:15
corvusclarkb: it's possible that's happened a few times, but i think Shrews and pabelanger have also found behavior that doesn't explain, and it's more common21:15
corvus(the grafana graphs suggest that has maybe happened after some gate resets)21:16
*** threestrands_ has joined #zuul21:18
Shrewscorvus: pabelanger: is it possible that we could be getting rate limited by quota queries by the provider here? if those are slow, we could spend a long time in the request acceptance loop21:33
Shrewswhich i'm growing more confident that we are doing21:33
pabelangerShrews: let me check if we are doing any caching for quota or not21:35
pabelangermaybe we need to add it, if missing21:35
Shrewsi'm not certain what those queries are... trying to track down where we do that21:35
clarkbShrews: we may not be rate limited as much as clouds responding slow21:36
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive  https://review.openstack.org/53444421:36
clarkbShrews: pabelanger you can probably test that out of band by making the same requests and timing them21:37
pabelangeryah, was going to suggest that21:37
pabelangerwe could also start tracking them via statsd too, like create servers21:38
corvuswe should be caching the quota queries -- except when we unexpectedly get an over-quota failure, we empty the cache.  we're continually getting over-quota failures in inap due to the nova mismatch bug.21:38
mordredpabelanger: we should already be tracking them in statsd, just need to add grafana graphs for them21:38
pabelangermordred: nice21:38
mordredpabelanger: (we should be generating statsd metrics for every REST call made - so we shoudl get all the metrics all the time)21:39
pabelangerstats.nodepool.task.inap-mtl01.ComputeGetLimits looks to be key21:41
mordredyup. that seems about right21:41
pabelangerI'll confirm and work up grafyaml patch21:42
corvuspabelanger: do you have a graphite link handy for us to look at?21:43
Shrewsseems getting compute limits on inap takes about a second21:44
corvusclarkb: in 534444 i opted to fix only the new-style depends-on; do you think it's important to fix the legacy gerrit depends-on as well?21:44
pabelangerlet me see how to share21:44
mordredcorvus: test_crd_gate_triangle sounds like a place I don't want to fly a small aircraft over21:44
corvuspabelanger: right click on image ?21:44
clarkbcorvus: is old style going away?21:44
corvusmordred: definitely21:44
clarkbif old style is going away probably not but if we expect it to stick around might be good to have consistent behavior21:44
corvusclarkb: yeah, but with a timeframe convenient for openstack, so maybe 3-6 mo...21:44
corvusi don't want to immediately invalidate a bunch of depends-on headers21:45
pabelangerhttp://graphite.openstack.org/render/?width=586&height=308&_salt=1516139140.947&target=stats.nodepool.task.inap-mtl01.ComputeGetLimits21:46
pabelangerthat is the image, but not sure how to properly share21:46
corvusyou just did21:46
pabelangerkk21:46
pabelangerseems pretty flat right now21:46
Shrewsi'm not sure how to track this down without throwing in a bunch of spurious logging statements in a printf-style-ala-mordred attack21:47
corvuspabelanger: i suspect that's a graph of when we emit those calls21:48
mordredpabelanger: the timing key is ...21:48
corvushttp://graphite.openstack.org/render/?width=586&height=308&_salt=1516139263.857&target=stats.timers.nodepool.task.inap-mtl01.ComputeGetLimits.mean21:48
corvussomething like that ^ i think21:48
pabelangerah, thank you that does look right now21:48
mordredyes. what corvus said21:48
mordredthat's in miliseconds?21:49
corvusmordred: i think so21:49
mordredyah. so - not the world's fastest call - but not the world's slowest either21:49
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Support cross-source dependencies  https://review.openstack.org/53080621:49
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Add cross-source tests  https://review.openstack.org/53269921:49
clarkbcorvus: reviewed the fix for triangle deps21:52
clarkbcorvus: there is a bug21:52
*** flepied has quit IRC21:55
*** flepied has joined #zuul21:57
Shrewspabelanger: corvus: http://paste.openstack.org/show/645748/   Those "predicted" calculations happen back-to-back in the code, so the second one is taking at least 5 seconds. I'm not sure how long the 1st is taking22:03
ShrewsI'm seeing processing a single request take between 20-30 seconds. If there are lots of requests to go through, then the code will not get around to removing completed requests for quite a while22:03
ShrewsSo we could be seeing a combination of things here22:04
ShrewsHeavy load + inefficient request processing22:04
ShrewsThis would explain why pabelanger saw many ready+locked nodes suddenly free up22:05
*** dkranz has quit IRC22:06
Shrewsor change to in-use, rather22:06
pabelangeryah, nodepool-launcher is at 100% CPU for the most part22:07
Shrewspabelanger: this was a good spot. well done for catching it22:08
pabelangercould be run another nodepool-launcher process on the host, but just for inap as a test?22:09
Shrewspabelanger: now tell us how to make it better!!!  :)22:09
Shrewsi don't think this is inap specific. i see similar trends with other providers22:09
pabelangeryah, nl01 is 8vCPU, if we have at 100% for a single nodepool-launcher daemon, maybe we shared the config more and run a single processor for provider?22:10
Shrewsi don't think that'd help, tbh22:13
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive  https://review.openstack.org/53444422:13
corvusclarkb: thx ^22:14
pabelangerShrews: ack22:14
Shrewsadding caching of limits to shade might help a bit22:16
Shrewsi can work on that in the morning if we think that's a good idea22:17
corvusShrews: can you summarize what the problem is?  i don't have the full picture.22:17
mordredShrews: I think adding in the same caching pattern we use for servers ports and floating-ips would be a fine idea (pending we don't learn something else to the contrary between now and then)22:18
Shrewscorvus: requests aren't getting satisfied in a timely fashion because of the processing of the node requests loop. we are seeing several seconds between processing each request. until we get through all of them, we do not attempt to mark requests fulfilled22:19
corvusShrews: which loop is the 'node requests loop' ?22:19
Shrewscorvus: PoolWorker._assignHandlers()22:20
corvusShrews: what in there do you think is slow?22:21
*** haint_ has quit IRC22:22
Shrewscorvus: all of it?  :)  the paste from me above shows several seconds for at least one of the quota estimations.22:22
*** haint_ has joined #zuul22:22
Shrewscorvus: i have not identified other areas of slowness yet, but can see over 20s between assigning a request, and actually launching the node for it22:23
Shrewswithout some more debugging info, i can only guess at to what else is slow about it22:25
corvusShrews: okay, so we've got several seconds for the new handlers to run if that method creates any new ones...22:26
corvusShrews: if the handler is paused, we short out of there pretty quick22:27
corvusShrews: but if it isn't paused, then we're going to try to lock every request.  we have 1800 of them right now.22:27
corvusbut it seems like we should run up to the point where we are paused pretty quickly, so that shouldn't be an issue22:28
Shrewsright. looking at the logs, i saw the inap thread accept 50 requests before it completed the loop, which took about 13m in all22:28
Shrewsit never paused in that time22:29
Shrewswhich means it had capacity to handle them all22:29
clarkbeach provider gets its own thread too right? and those threads fight all the launch threads? could we be cpu starved?22:29
clarkblike maybe we should run multiple nodepool launchers per host or something22:29
pabelangerclarkb: yah, that is what I was thinking about multiple launchers per host22:30
pabelangerthey'd each need to have a config with a specific provider22:30
Shrewsclarkb: yes, i provider pool per thread22:30
Shrewsa provider can have multiple pools22:30
clarkbah right its PoolWorker22:31
clarkbin our case pools and providers are currently 1:122:31
corvusnl01 is currently cpu-bound22:31
corvusthey both are22:32
Shrewsthere could be a lot of thread context switching going on since we have 1 thread per node launch too22:32
Shrewsmaybe multiple processes could help22:32
Shrewsas pabelanger suggested22:32
corvusyep.  this is one of the reasons we designed it to accomodate sharding -- we were starved on one machine.  it's interesting that we are already starved with two processes.22:33
corvusso more processes (whether that's on more machines or the same one) may help, if the main pool thread is being starved by the launch threads.22:33
clarkbin this case mentioned same host since I think we have a lot of available cpu its just python being unable to take advantage22:34
corvusfwiw, the context switch overhead before it became unbearable was about 1k threads.22:34
mordredseems about right22:34
pabelangernb01.o.o is also 8vcpu / 8GB RAM, if we are CPU bound, an expensive server for single process. We could trying bringing on more launcher nodes, in smaller sizes, or run more launchers on each host22:35
corvusShrews, mordred: i'm not 100% sure, but i worry that quota caching isn't an immedate answer.  we should verify that nodepool isn't already caching sufficiently and that changing shade would actually do anything different before we go down that road.22:35
Shrewscorvus: nod22:36
corvuspabelanger: yes that server is too large.  it should be 2g22:36
clarkbpabelanger: more nodes in smaller sizes potentially makes us less outage prone so thats a plus but also harder to do things like switch zk servers22:36
corvusi'd size it 2G for each process we want to run on it.22:36
corvusmaybe even fit a few more processes on larger hosts22:37
pabelangerif we did single process on 2GB hosts, that would work well today how we manage nodepool.yaml in project-config (by hostname). If we do more processes on larger server, we'll need to rework some puppet code first22:38
corvusShrews: we could also make the poolworker a little more concurrent, either by having another thread handle completions, or just interleaving completions with assignments.22:38
Shrewscorvus: difficult to handle correctly since either way, we'd want to modify the same data structure (the handler array). but yeah, could be possible with the right locking and whatnot22:39
pabelangerI have to step away for a bit, will catch up when back22:39
Shrewsand i need to make dinner22:40
corvuslet's resume this tomorrow :)22:41
clarkbanother idea: we could use multiprocessing for pool worker possibly?22:41
clarkband have python handle it more behind the scenes for us? that might be friendlier to other users22:41
corvusclarkb: yes, as long as we do it at that level (the pool worker), where there's no shared data between any of them.  that's worth looking into.22:42
clarkbya all the communication is through zk anyways so that should be pretty safe22:42
corvusmultiprocessing is cool as long as you don't try to share data, then it gets bad.  so a process for a pool worker, but then all the launchers still as threads.22:42
mordred++22:43
corvusthey already even have their own zk connection, so even that wouldn't be different22:43
Shrewsi think the Nodepool object itself is shared, which could be problematic22:43
Shrewsanyway, really away now22:43
corvusShrews: yeah, i think that's mostly config stuff at this point; we could probably fix that.22:44
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix dependency cycle false positive  https://review.openstack.org/53444422:48
corvusclarkb: there are tests which would have caught the error you pointed out.  most of them worked, but two needed a fix, so i updated the patch with that as well. ^22:49
clarkbcorvus: cool I will rereview22:51
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Stop running tox-cover job  https://review.openstack.org/53445822:51
*** flepied has quit IRC22:57
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Really change patchset column to string  https://review.openstack.org/53245923:05
clarkbShrews: corvus do you want to review https://review.openstack.org/#/c/533771/5 and parent?23:06
clarkbI think that is slightly more important now that infracloud might shutdown at any time23:06
corvusclarkb: looking23:10
pabelangerhttps://review.openstack.org/534450/ should start tracking getlimits API query in grafana to help with tomorrow23:11
clarkbthere is another cleanup related to that where I think we can more aggressively unset allocated_to (or whatever the var is called) when we fail a multinode nodeset23:12
clarkbI think right now we let the cleanup thread find those and unset them, btu we can unset them early allowing them to be used in other nodesets sooner23:12
corvusclarkb: lgtm -- we have some tests which set an image flag that tells the fake to always fail when booting that image.  i wonder if there's a way to turn that on mid-stream to avoid that particular monkey-patch.23:14
corvusclarkb: i've +2d it; let's see if Shrews wants to review it tomorrow? unless you think we need to accelerate deployment of it.23:15
clarkblooking at multiprocess PoolWorker more closely we use nodepool.getZK() and nodepool.getPoolWorkers() along with some config related stuff23:16
clarkbI think the only thing that may be a real problem is getPoolWorkers? /me makes a quick change and sees what happens23:16
clarkboh you know where this might break the most is in the tests :/23:17
clarkbbecause we sort of assume a single process we can manipulate23:18
corvushrm.  i'm guessing black-box tests might work, but not so much white-box.23:21
corvusthe tests which do end-to-end testing through zookeeper should work.23:22
corvushow does this message about the zuulv3 merge look? https://etherpad.openstack.org/p/4sX3qKYDBN23:28
clarkbya just hacking in multiprocessing and running it against the test I added above once I fix some structural issues we run into the problem where the test framework introspects its threads and stuff to know how things are doing23:32
clarkbso it iwll be a bit of work to get that going in the tset suite23:32
corvusclarkb: maybe we need a slightly different structure then -- maybe we need to handle the process split in a way where we can test the system all in one process, but when actually started, we get multiple ones.23:34
clarkbthat seems viable, we would need a more true functioanl black box test to go with that (which we do have)23:34
clarkbthis isn't somethign I'm going to dive into now but wanted to confirm my suspicions it is non trivial before I ignored it :)23:35
clarkbbut I do think if we can make it work this will be the most user friendly way of scaling up launchers per host23:36
corvusclarkb: yeah, for the most part, i'd think even just a special test like "i started a two-provider-pool system with multiple-processes and they both gave me a node" would be sufficient to excercise that -- then all the rest of the correctness tests can remain in the single-process realm23:38
*** sshnaidm is now known as sshnaidm|off23:42
corvusi sent the first message to zuul-announce: http://lists.zuul-ci.org/pipermail/zuul-announce/2018-January/000000.html23:43
corvushopefully folks got that23:43
pabelangeryup, got email here23:46
*** rlandy_ has quit IRC23:52
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Use hotlink instead log url in github job report  https://review.openstack.org/53154523:53
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Disambiguate with Netflix and Javascript zuul  https://review.openstack.org/53129223:55
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Add support for protected jobs  https://review.openstack.org/52298523:56
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Documentation changes for cross-source dependencies  https://review.openstack.org/53439723:56
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Remove updateChange history from github driver  https://review.openstack.org/53190423:58
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Handle sigterm in nodaemon mode  https://review.openstack.org/52864623:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!