Friday, 2019-06-14

pabelangerlooks like we didn't properly dequeue the previous version of the PR for some reason00:00
pabelangerlooking at https://github.com/ansible/ansible/pull/55199 is seems to have ~14commit pushed close together, wonder if that has something to do with it00:01
pabelangerhttp://paste.openstack.org/show/752908/00:02
pabelangerI'd only expect 1 change there, not 300:03
*** ianychoi has joined #zuul00:19
*** mattw4 has quit IRC01:01
*** swest has quit IRC01:09
*** swest has joined #zuul01:24
*** igordc has quit IRC01:35
*** bhavikdbavishi has joined #zuul01:58
*** bhavikdbavishi1 has joined #zuul02:01
*** bhavikdbavishi has quit IRC02:02
*** bhavikdbavishi1 is now known as bhavikdbavishi02:02
pabelangerokay, looking at https://github.com/ansible/ansible/pull/55199 more, I _think_ they some how used the web ui to make code changes to the PR02:03
pabelangerwhich requests in zuul enqueuing the same change 6 times into check02:03
pabelangerhttp://paste.openstack.org/show/752910/ was the event from github02:17
pabelangerhowever, I am unsure why zuul triggered on it, the action was synchronize02:17
pabelangeroh, that seems to map to 'changed' for zuul02:19
*** rlandy|ruck has quit IRC02:24
*** bhavikdbavishi has quit IRC02:26
pabelangerokay, will have to continue this in the morning, I think it is possible we are getting more then 1 Change ID for a PR, which seems to not dequeue the previous Change, because they are different02:34
*** michael-beaver has quit IRC02:43
*** dmellado_ has joined #zuul03:08
*** dmellado has quit IRC03:08
*** dmellado_ is now known as dmellado03:08
*** mugsie has quit IRC03:22
*** EmilienM has quit IRC03:22
*** kklimonda has quit IRC03:22
*** mgagne has quit IRC03:22
*** persia has quit IRC03:22
*** sean-k-mooney has quit IRC03:22
*** bhavikdbavishi has joined #zuul03:31
*** mugsie has joined #zuul03:38
*** EmilienM has joined #zuul03:38
*** kklimonda has joined #zuul03:38
*** mgagne has joined #zuul03:38
*** persia has joined #zuul03:38
*** sean-k-mooney has joined #zuul03:38
*** irclogbot_3 has quit IRC03:41
*** irclogbot_1 has joined #zuul03:43
*** igordc has joined #zuul04:12
*** igordc has quit IRC04:19
*** swest has quit IRC04:33
*** pcaruana has joined #zuul05:09
*** swest has joined #zuul05:14
*** swest has quit IRC05:17
*** swest has joined #zuul05:18
*** raukadah is now known as chandankumar05:21
*** bhavikdbavishi has quit IRC05:21
*** bhavikdbavishi has joined #zuul05:22
*** gtema has joined #zuul06:24
*** tobiash has joined #zuul06:45
tobiashzuul-maint: looks like zuul-web got a memleak here: https://paste.pics/9f7cf9a259c39876361b49b1ebebcd2906:45
tobiashdoes anyone experience the same?06:46
tobiashunfortunately it's hard to debug because sigusr2 seems to print the stack trace and objects but also kills it :/06:46
tobiashthose should be the changes that went into this update: https://etherpad.openstack.org/p/ipW8nmqqPn06:59
tristanCtobiash: perhaps one of the requirements also changed?07:20
tobiashmaybe07:21
tristanCtobiash: fwiw we are also running with the commit that are bolded in the etherpad without issue07:24
tristanCtobiash: e.g. SQLAlchemy 1.3.4 released May 2807:31
*** hashar has joined #zuul07:39
*** jpena|off is now known as jpena07:50
*** zbr|ooo is now known as zbr|ruck08:01
*** sanjayu_ has joined #zuul08:16
openstackgerritMatthieu Huin proposed zuul/zuul master: Zuul Web: add /api/user/authorizations endpoint  https://review.opendev.org/64109908:16
openstackgerritMatthieu Huin proposed zuul/zuul master: authentication config: add optional token_expiry  https://review.opendev.org/64240808:17
openstackgerritMatthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration  https://review.opendev.org/64353608:17
*** jangutter has joined #zuul08:18
*** hashar has quit IRC08:33
*** hashar has joined #zuul08:40
openstackgerritFabien Boucher proposed zuul/zuul master: Set same timeout for py35/36 jobs in check and gate  https://review.opendev.org/66533109:01
fboAny reason the timeout was different between check and gate py35/36 jobs ? The pagure driver patch failed in the gate due to a timeout. In check we have 1 hour but in gate 45 minutes. ^09:04
tobiashfbo: it used to be 45 minutes and the bump to 1 hour probably just forgot the gate09:09
fbotobiash: alright, yes it seems to be a good reason :)09:19
fboI'll recheck the pagure driver patch after we this .zuul.yaml change then09:20
fbo*merge*09:21
openstackgerritMatthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration  https://review.opendev.org/64353609:44
*** bhavikdbavishi has quit IRC09:56
*** hashar has quit IRC09:59
*** sanjayu_ has quit IRC10:11
*** sanjayu_ has joined #zuul10:12
*** rfolco has quit IRC10:13
*** rfolco has joined #zuul10:14
*** gtema has quit IRC10:24
*** gtema_ has joined #zuul10:24
*** gtema_ is now known as gtema10:24
*** sanjayu_ has quit IRC10:36
*** saneax has joined #zuul10:38
*** panda is now known as panda|lunch11:28
*** jpena is now known as jpena|lunch11:36
*** EmilienM is now known as EvilienM11:45
*** rf0lc0 has joined #zuul12:21
*** rfolco has quit IRC12:21
*** tobias-urdin has joined #zuul12:25
tobias-urdinhttps://review.opendev.org/#/c/637339/ yay looking forward to the next zuul and nodepool releases :)12:25
openstackgerritMatthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration  https://review.opendev.org/64353612:37
*** jpena|lunch is now known as jpena12:37
*** rlandy has joined #zuul12:38
*** rlandy is now known as rlandy|ruck12:38
*** panda|lunch is now known as panda12:55
*** rf0lc0 has quit IRC13:35
*** rfolco has joined #zuul13:42
*** rfolco has quit IRC13:46
*** rfolco has joined #zuul13:46
*** chandankumar is now known as raukadah13:59
*** saneax has quit IRC14:06
*** rfolco has quit IRC14:20
*** rfolco has joined #zuul14:21
corvustobias-urdin: yeah, we were planning on releasing them today, but we found an issue with github interaction that we want to clear up first14:24
*** rfolco is now known as rfolco|off14:25
corvustobiash: i don't see radical memory use changes on the server where we're running zuul-web: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all  (though that reporting is not as detailed)14:28
tobiashcorvus: k, thanks, I'll look next week more in depth into this14:29
tobiashfor now I just increased the resource limit and let openshift handle the restarts :)14:29
corvusyay cloud14:29
Shrewscorvus: mordred: It looks like next Wednesday is going to be my last full day for a while. Will be gone until at least July 8th. I'll let you two decide what to do with the autohold changes (which I consider complete at this point) in relation to the web admin stuff.14:29
Shrewscertainly not urgent to merge, but will require changes if merged after the web admin stuff14:31
corvusShrews: yeah, i'm expecting it to be relatively straightforward api changes rather than more complicated structural changes, so i expect we can update the patches14:33
Shrewsk k14:33
corvusi'm restarting opendev's zuul to pick up the new github fixes14:41
corvuswe could release nodepool now, but i'd like to release them at the same time because of the node python path change (it's not strictly necessary, but it seems weird not to)14:42
*** mgariepy has joined #zuul14:46
corvuspabelanger: 2019-06-14 14:44:39,652 ERROR zuul.GithubEventProcessor: [e: c657f89e-8eb2-11e9-908a-1f05e5124b4f] Failed handling status event; remote said retry after 60,will retry attempt 1/5 in 60.0 seconds14:46
corvuspabelanger: that was 403 as a first response (ie, it wasn't after a 502)14:46
pabelangerneat, lets see what happens14:46
corvuspabelanger, tobiash: that makes me think that in addition to the 502 errors causing retries, we're *also* simply running into rate limits probably due to the paralellization change14:47
tobiashcorvus: yes, probably14:47
tobiashbefore we just queued on our side with no chance reaching the limit because of serialization14:47
pabelangerwould agree14:47
tobiashI think there is still some potential to reduce api calls (e.g. some calls due to github3.py object model)14:49
tobiashand the getpullbysha can be optimized using graphql with one request14:50
corvuspabelanger: 2019-06-14 14:51:08,371 ERROR zuul.GithubEventProcessor: [e: c657f89e-8eb2-11e9-908a-1f05e5124b4f] Failed handling status event; remote said retry after 60,will retry attempt 5/5 in 60.0 seconds14:51
corvuspabelanger: perhaps the retry-after should be +5 seconds?14:52
pabelangercorvus: yah, that was my fear, we needed to add a buffer14:52
pabelangersounds like a good idea14:52
corvustobiash: do you think we also need to calculate our internal rate and slow down?  we do that in openstacksdk (it's code inherited from nodepool).  we don't necessarily have to ask github for the rate after each request, we could just do it once and count our own calls.  we could probably avoid hitting the limit most of the time that way.14:56
corvustobiash: or do you think we should just try to honor the retry-after header and see if that will take care of it?14:56
openstackgerritPaul Belanger proposed zuul/zuul master: Pad retry_after logic for github when we raise an exception  https://review.opendev.org/66541515:00
pabelangertobiash: corvus: ^ but I also wonder if some how multiple retries for different events, is also going to trigger the 403 abuse api15:03
pabelangerdo we need to wait 'retry_after' for any API call, or just to the specific api?15:03
tobiashcorvus: we don't have a mechanism yet to globally throttle the requests15:06
*** hashar has joined #zuul15:15
pabelangertobiash: I wonder if github3.py exposes anything, IIRC they are using python requests15:17
fungiscore! http://logs.openstack.org/19/663119/1/check/zuul-jobs-test-registry/6109606/ara-report/result/35f53315-15d6-4f5a-8e24-aa07a2796913/15:17
fungilooks like https://review.opendev.org/663119 is ready to be reviewed15:17
openstackgerritPaul Belanger proposed zuul/zuul master: Pad retry_after logic for github when we raise an exception  https://review.opendev.org/66541515:22
pabelanger^also cleaned up traceback message, to add space after comma :)15:22
corvusthanks, that would have annoyed me to no end :)15:23
pabelangerOoo15:24
pabelanger2019-06-14 14:55:33,759 ERROR zuul.GithubEventProcessor: [e: 75a44c70-8eb4-11e9-94c5-7a90e0072678] Failed handling pull_request event; remote said retry after 60,will retry attempt 1/5 in 60.0 seconds15:24
pabelangerhowever, the retry worked15:24
*** rfolco has joined #zuul15:24
pabelangerI can see zuul processing that event15:24
pabelangerinteresting enough15:24
pabelangerit look like all the 5/5 retries seem to be from the status event15:24
pabelangerbut that pull_request did work as expected15:25
*** rfolco|off has quit IRC15:25
*** mgariepy has left #zuul15:26
pabelangerhttp://paste.openstack.org/show/752933/15:26
pabelangerthat is a snip of logic working15:26
*** guilhermesp has joined #zuul15:30
openstackgerritMerged zuul/zuul master: Set same timeout for py35/36 jobs in check and gate  https://review.opendev.org/66533115:32
pabelangerfbo: thanks, I didn't think to check that the other day^15:33
pabelangeralso seen a timeout in gate15:33
corvuspabelanger: is the zuul.ansible.com instance watching ansible/ansible?15:38
pabelangercorvus: yes, we are reporting results back too15:38
corvuspabelanger: you're seeing the same errors there, right?15:39
pabelangercorvus: when I looked last week yes, I can check now, give me a moment15:39
corvuspabelanger: if this is going to be an iterative process, perhaps it would be easier for you to test out changes like 665415 on that instance, and then, once you're satisfied with the result, we could restart opendev as a final verification?15:40
pabelangercorvus: c657f89e-8eb2-11e9-908a-1f05e5124b4f we also see that failure today15:40
corvusit's... not easy to restart opendev, and we lose about 1500 compute-hours worth of work each time we do it... :/15:41
corvusprobably more, really15:41
pabelangercorvus: sure, that seems reasonable, I'd apply it now however we are not running the new multi processing patch tobiash landed. That is not to say we cannot upgrade to latest master15:42
corvuspabelanger: yeah, i think you'll need master + your new patch15:42
pabelangercorvus: agree15:42
corvusthat is interesting that you see that error even without the multiprocessing patch15:43
pabelangerlooking at zuul.o.o logs, I am concerned that if we group 2 or more retries together, they will both not wait the required '60' seconds for a retry, we just keep fighting each other until 5/5 tries.15:43
clarkbpabelanger: yes that is the multiprocessing change15:44
clarkbthat should be fine if the retry is due ti a backend error15:44
clarkbbut likely less good if being throttled by rate limiter15:44
pabelangerI wonder if we should consider backing that out, until we can determine a better way to deal with rate limits15:44
pabelanger(multiprocessing change, I mean)15:45
corvuspabelanger: yes, i think that's quite likely going to be a problem.  maybe you and tobiash can come up with a determination of whether it is, and a recommendation for how we should proceed.15:45
pabelangercorvus: okay, I think what we can do on zuul.a.c, is apply the retry-after logic (minus multiprocessing) to even see if 665415 is needed15:46
corvuspabelanger: good idea, that will give us another data point15:46
corvuspabelanger: then you could also apply 665415 (still without MP) and see if that changes things15:47
pabelangerokay, let me work on that. However, I am not sure I will be able to completely that for today, have to #dadops early today.  But should have time over the weekend to confirm15:47
pabelangercorvus: +115:47
corvuspabelanger: looking at the ...b4f event -- it's interesting that some of the timestamps between the attempts are ~2 minutes apart, suggesting that the reque itself still took almost a minute to return a 403.15:52
*** swest has quit IRC15:55
*** jamesmcarthur has joined #zuul15:56
*** bhavikdbavishi has joined #zuul15:57
pabelangerokay, I've applied: https://review.opendev.org/665220/15:58
pabelangerand confirmed zuul.a.c is able to enqueue a PR15:58
pabelangerso, now we wait for an execption15:58
pabelanger2019-06-14 15:59:43,328 ERROR zuul.GithubEventConnector: [delivery: 281d19ce-8ebd-11e9-8513-c2e73b0c5da4] Failed handling status event; remote said retry after None,will retry attempt 1/5 in 60.0 seconds16:00
pabelangernow we wait16:00
pabelangeralso16:00
pabelangerhttp://paste.openstack.org/show/752935/16:00
pabelangerthat is new16:00
pabelangerokay, a retry worked16:01
pabelangerhttp://paste.openstack.org/show/752936/16:01
pabelangerwith is without any padding16:02
pabelangerlet me find another instances or 216:02
tobiashcorvus, pabelanger: re rate limit, do you still see frequent 'get all prs of ansible' in getPullBySha?16:04
tobiashif yes, that would be likely one reason to hit the rate limit16:05
tobiashbasically this is a quick stab I got from a github engineer to get a pr by sha: http://paste.openstack.org/show/752937/16:07
tobiashusing graphql and this only costs one request16:07
tobiashso if getpullbysha is the reason we have our first graphql use case16:08
tobiash(and those events listed above *are* status events)16:08
*** bhavikdbavishi has quit IRC16:12
*** mattw4 has joined #zuul16:13
pabelanger2019-06-14 16:04:21,511 ERROR zuul.GithubEventConnector: [delivery: f506a3a6-8ebd-11e9-935d-2fe311e1498f] Failed handling status event; remote said retry after None,will retry attempt 1/5 in 60.0 seconds16:17
pabelangerthat was 502 Server Error16:17
pabelangerand retry was successful16:17
pabelanger\o/16:17
pabelangerso, that at least fixes the original issue16:17
pabelangerso far 2 for 2 on retry16:17
pabelangeralso, we didn't get 'Retry-After' header on 502 exepction16:18
*** bhavikdbavishi has joined #zuul16:18
pabelangertobiash: yah, each one as always been in getPullBySha(), that I have seen on zuul.a.c16:20
openstackgerritPaul Belanger proposed zuul/zuul master: Revert "Parallelize github event processing"  https://review.opendev.org/66542916:28
pabelangertobiash: clarkb: corvus: ^revert of parallel github, if we decided to go that route before next zuul release16:28
pabelangerI maked WIP for now16:28
openstackgerritPaul Belanger proposed zuul/zuul master: Revert "Parallelize github event processing"  https://review.opendev.org/66542916:29
*** jangutter has quit IRC16:47
*** bhavikdbavishi has quit IRC16:51
pabelangertobiash: corvus: clarkb: while waiting for zuul.a.c to get more exceptions, last night I noticed the following: http://paste.openstack.org/show/752908/ we had more the 1 PR (same ID) in our check pipeline. Which resulted in zuul multiple test results on the PR: https://github.com/ansible/ansible/pull/55199 I am unsure why we didn't dequeue the previous changes in the check pipeline, but have a feeling because16:57
pabelangerthey were different change ids16:57
*** jpena is now known as jpena|off16:57
corvuspabelanger: i saw you mentioned that.  are you looking into it?16:58
mnaseris there a way to lock down zuul's UI by anyway of auth or should i go back to basics (i.e. whitelist ips or something on apache layer)16:59
mnaserim thinking if there isnt anything native i can try using mod_auth_openidc or something like that17:00
mnaserjust not sure how many things that might break :)17:00
corvusmnaser: no, i think apache should be fine17:00
mnasergoing to run mod_auth_openidc and see how that works out then :)17:01
corvusmnaser: that should certainly work for whitelabeled tenants, or the entire system non-whitelabeled.  if you wanted to restrict access to individual tenants without whitelabling, i think it would mostly work because of the way almost all the urls are tenant-scoped, execpt for the tenant index itself.17:03
corvusnot sure which of the 3 situations you're dealing with17:03
pabelangercorvus: I am, but needed a little help understanding how we decided to dequeue a previous version of a PR / review. If I understand, it is tied to the Change?17:04
mnasercorvus: yeah it's more of a single-tenant deployment in this case so its just protecting the whole thing behind a wall17:04
corvusmnaser: cool, i don't anticipate problems; let us know how it goes :)17:04
corvuspabelanger: first -- you left this set to the default, right?  https://zuul-ci.org/docs/zuul/user/config.html#attr-pipeline.dequeue-on-new-patchset17:05
pabelangercorvus: correct, also I didn't know that was an option :)17:06
corvuspabelanger: good, forget you ever heard about it :)17:06
pabelangerdone17:07
corvuspabelanger: this is the key bit: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/github/githubmodel.py#L4117:07
mnaserwee17:07
mnaserworked fine (initially)17:07
mnaserill see if other things might break17:07
corvuspabelanger: when a new item is enqueued, zuul should iterate over all the other changes in the pipeline and if that's true for any of them, remove those17:07
pabelangergreat17:08
pabelangerthat gives me a starting point17:08
corvuspabelanger: first thing that comes to mind is that the "updated_at" looks like a weak spot -- what if something raced and that ended up wrong?17:09
pabelangercorvus: yah, my thoughts too. looking at the PR, I see a quick push of multiple commits17:09
corvuspabelanger: yeah, so it might be worth collecting all the log lines for those and see if you can determine what the updated_at values would be each time it ran through17:10
pabelangerack, will do17:10
pabelangerthanks17:11
corvuspabelanger: the other key part of this system is https://opendev.org/zuul/zuul/src/branch/master/zuul/manager/__init__.py#L21317:11
corvuspabelanger: you should be able to grep for those methods and see their call locations, etc17:12
pabelangerk17:12
clarkbI guess PRs don't have the nice iteration number that changes in gerrit come with17:13
corvusno, that's why we have to check both the sha and the timestamp17:13
tobiashpabelanger, corvus: the parallel event processing is crucial for bigger environments like ours. Without it we'll have hours of event queueing. And I'm not sure if that's better than occasionally hitting the rate limit. So I think we should try the graphql for getpullbysha and see if that reduces the number of requests.17:16
clarkbtobiash: we might also be able to throttle it so that there is always at least X time between requests or something17:17
corvustobiash: that sounds reasonable.  i still think it would be swell if you or pabelanger or someone would investigate whether an overall throttle is necessary.17:17
clarkbthen you could tune that for your install (which I'mgessing has less rate limit concerns than public github)17:17
corvusclarkb: you lost me at "tune"17:17
corvusgithub tells us rate limits, no tuning should be necessary17:18
tobiashyes, that makes sense17:18
tobiashI think we might be able to hook into the requests session17:18
tobiashand delay that if neccessary17:18
clarkbcorvus: but it seems are hitting errors on start up well before the rate limits?17:19
pabelangertobiash: Yup, my main concern, is the abuse message we get back from github.com, I'd hate to see github apps be banned (not that I expect them to do that for first offense)17:19
clarkbthe rate limit is like 5k requests per hour on a regular account17:19
clarkb(I guess we could be making many thousands of requets all at once)17:19
corvusclarkb: if we're hitting some kind of unpublished rate limit, then i think that would dramatically change how we would approach it.17:20
pabelangerso far, without parallel patch, zuul.a.c has worked as expect with retries logic.  If we release with parallel, and not fixing getpullbysha, that will be something new for us to debug17:20
corvusit sounds like we need *a lot* more information.17:20
pabelangertobiash: I agree, graphql sounds to be the fix, is that something we need to add to github3.py?17:20
tobiashpabelanger: are you already running this: https://review.opendev.org/660800 ?17:20
tobiashno17:20
tobiashjust a post request with the session object17:21
tobiashI'll try it later or during the weekend17:21
pabelangertobiash: no, 660800, I just cherry-picked the retry-after changes17:21
tobiashshould be fairly easy17:21
clarkbya its 5k/hour on a regular authenticated account17:21
pabelangerdoes getpullbysha hit search API?17:22
*** igordc has joined #zuul17:22
clarkbpabelanger: no17:22
pabelangerk17:22
clarkbit lists all the PRs instead (becaus search api does have a loewr limit)17:22
tobiashno it does not according to github engineer17:23
tobiashit's a graphql query and he showed me that this one is very cheap17:23
pabelangerclarkb: right, that's what google was saying, search API seem to have more aggressive rate-limit17:23
clarkbtobiash: https://developer.github.com/v3/search/#rate-limit depends on the api you use17:23
clarkbthat may be true for graphql but search api via rest has the lower rate limit17:24
pabelangertobiash: what is the path forward for graphql in zuul? Do you have thoughts17:24
corvusi think that before we can say something is a solution, we need to know what the problem is; and it sounds like there is some doubt about whether this is a regular api limit, or something new.  so before we engineer solutions which may or may not address the underlying problem, can we try to identify what it is first?17:25
pabelanger+117:26
clarkbhttps://developer.github.com/v3/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits has clues17:26
clarkb"make requests serially"17:26
clarkbI think it is likely we are hitting their abuse traps rather than regular rate limits on startup and that may be due to concurrent requests17:27
corvus(perhaps the problem is that we're calling gitpullbysha on the same repo simultaneously in 10 different threads with an empty cache?)17:27
corvus(perhaps an analysis of the logs would reveal information like that ^)17:27
tobiashah, so the abuse rate limit is something different17:29
tobiashso we should probably reduce the number of threads17:29
corvusthat is one thing that we could do17:30
corvusbut again, we need data17:30
tobiashand implement rate limiting in a central place that evaluates the retry-after header17:30
*** hashar has quit IRC17:30
corvusthat may not be necessary if we avoid the behavior that causes it in the first place17:30
tobiashyeah, I'm just thinking as I have no data ;)17:30
corvusyep, thinking is good :)17:31
corvusso far we've had more patches than analysis17:31
corvusmostly i'm trying to say it's time for the opposite :)17:31
tobiashpabelanger: can you confirm in the logs that you hit the abuse rate limit and not the normal rate limit? I thought I read something like this in your snippets but cannot find it anymore.17:32
pabelangertobiash: yah, it was abuse, I need a moment to post it from zuul.o.o17:33
tobiashok, then  I think we have the following options:17:33
tobiash* queue per installation id17:33
tobiash* reduce threads (but to which number?) Github doesn't document this17:34
tobiash* central limit with retry-after inspection17:34
corvustobiash: the docs clarkb linked suggest one thread per install id17:36
corvuswell, it says "Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently."17:36
corvuswhat's a client id?17:36
tobiashyes, that's the safe side, but that imposes scalability problems :/17:36
tobiashclient id is either an installation or user17:37
tobiashso when using app auth you have a rate limit per installation (most of the time github organization)17:37
corvusso for most of us, we'd just end up back at 'one therad'17:37
corvusone thread17:37
corvusoh wait17:37
tobiashwhich is basically back to queuing17:38
corvusno an installation is app-in-a-project17:38
corvusso wouldn't that be one thread per project?17:38
tobiashmost users install the app on the org17:38
tobiashas best practice17:38
tobiashyou can install it on org or individual repos17:38
tobiashso in our case we have a large org which would fall back to be single threaded, which would break us again17:39
tobiashmaybe we can try out the grey area of few but > 1 threads17:39
tobiashpabelanger: it would be great to see all get requests in the log around that incident17:40
tobiashto see if we polled 1000 pull requests from ansible (which for sure would quickly trigger the abuse limit when multithreaded)17:41
corvusi think "grep zuul.Github debug.log" would be safe to share?17:41
pabelangertobiash: yah, we have 1.9k open PRs on ansible/ansible17:42
corvuspabelanger, tobiash: it would be also interesting to know if we see 403 errors after successfully populating the LRU caches17:42
pabelangercorvus: I think do, I haven't see any sensative data17:42
pabelangerso*17:42
tobiashmaybe we didn't even successfully pupulate the lru caches17:43
openstackgerritMerged zuul/zuul master: cli: prevent exception for the show command  https://review.opendev.org/66423517:44
corvustobiash, pabelanger: http://paste.openstack.org/show/752945/ http://paste.openstack.org/show/752946/ http://paste.openstack.org/show/752947/ http://paste.openstack.org/show/752948/17:47
corvusthat's the first 45 minutes after the most recent startup17:47
corvus(in 500 line chunks in order)17:48
corvusoh oops, that's still too long, they got cut off17:48
tobiashmaybe you can also further grep for 'GET' to just see the requests (otherwise I'll do it locally)17:49
corvustobiash, pabelanger: http://paste.openstack.org/show/752949/  through  http://paste.openstack.org/show/752957/  inclusive17:49
corvusthat's 200 line chunks17:50
corvusthere are raw links, you can fetch http://paste.openstack.org/raw/752957/ etc and recombine them17:50
corvusokay, i'm going to leave you to it17:51
tobiashalready on it17:51
tobiashftr, this does the job: for i in `seq 752949 752957`; do wget http://paste.openstack.org/raw/$i ; done17:52
mnaserhttps://zuul-ci.org/docs/zuul-jobs/install.html17:55
mnaseris this wrong? should the source be "opendev" and not "zuul-git" ?17:55
pabelangermnaser: yah, looks wrong17:56
mnaserand i assume it should be zuul/zuul-jobs ?17:56
pabelangermaybe fall out from mass rename17:56
pabelangeryup17:56
tobiashI see the 403 only during/after sequences like those: http://paste.openstack.org/show/752960/17:57
mnaserguilhermesp: wanna push up a patch to zuul/zuul-jobs to fix that so it has 'opendev' under sources and zuul/zuul-jobs instead ?17:57
guilhermespsure mnaser let me see17:58
tobiashand there are many sequences like those17:58
tobiashso I think that lru cache is also less effective than expected initially (I think we saw this also when introducing it)17:58
tobiashand with every of those occations we're doing multiple sweeps in parallel18:00
tobiashso we could either serialize status events or try out graphql for this which would eliminate those sweeps completely18:00
pabelangertobiash: so, do you think we need to address that before the next zuul release? Or revert parallel, release, working on your suggestions18:08
tobiashhrm, I'd have to re-add parallel then to my deployment, as without it I'm doomed18:09
openstackgerritGuilherme  Steinmuller Pimentel proposed zuul/zuul-jobs master: Fix installation docs  https://review.opendev.org/66544318:10
tobiashI just tried out the getpull by sha via graphql and rate limit api tells me that I have 5000 bucks per hour and the request costs 118:10
pabelangerI can still upgrade zuul.a.c, so latest master, and see how ofter we see the 403 exceptions, but I would imagine it will be the same frequency as zuul.o.o.  But I won't be able to do that now, I am AFK in 30mins or so18:11
pabelangerbut, I do think it will introduce a new issue for us to debug18:11
tobiashexample graphql: http://paste.openstack.org/show/752961/18:11
mnaserthanks guilhermesp ^18:11
tobiashok so how about revert parallel, release, re-add parallel with less threads (4-8? during regular requests in the logs the abuse detection didn't kick in) + getpullbysha via graphql?18:13
pabelangerI'm game for what everybody else wants :)18:14
tobiashbut yeah we definitely need to address this before a release as at least getpullbysha on large repos kind of guarantees to trigger the abuse detection18:16
*** jamesmcarthur_ has joined #zuul18:17
pabelangeragree, sounds like this is something we should address. I'm not sure what is involved to move to graphql honestly.18:18
tobiashor if a release is not that urgent we could skip the revert and directly change thread count and graphql18:18
pabelangerI believe SpamapS was looking for a new release, but in our case if we did the revert of parallel, we could install latest master for all the fixes we need in zuul.a.c18:21
*** jamesmcarthur has quit IRC18:21
pabelangerthen work to address parallel issues18:21
pabelangerso we don't 'need18:21
pabelangererr18:21
pabelangerso we don't 'need' a release, we can pick is up from pre-release18:21
tobiashthis is a graphql example that contains exactly the information we need in getpullbysha: http://paste.openstack.org/show/752962/18:22
tobiashjust has to be templated with the sha, send to github as post using the github client session and do some json parsing18:22
*** jamesmcarthur_ has quit IRC18:24
corvusi lean toward revert parallel, release, then spend a week or so getting it right before we release again :)18:29
pabelangerokay, I've also removed my wip on https://review.opendev.org/665429/ (the revert)18:29
pabelangerlooking at zuul.a.c logs again, I've only have 2 instances of retries today. Github is working well it seems :D18:30
tobiashthen it's decided18:30
pabelangerbut did confirm a 502 Server Error will not have a 'Retry-After' header18:30
fungiwell, or at least some 502 server errors won't18:31
fungican't say for sure that they all don't18:31
pabelangertrue18:31
corvuspabelanger: oh, i missed the link to a 502 without a retry, can you paste that again?18:31
pabelangerlet me find it18:32
pabelangerhttp://paste.openstack.org/show/752964/18:33
pabelangerthen18:34
pabelanger2019-06-14 16:07:12,081 DEBUG zuul.GithubEventConnector: [delivery: f506a3a6-8ebd-11e9-935d-2fe311e1498f] Got PR on project ansible/ansible for sha ba7b2f053d487e533c82b17cb194619b33cbde4f18:34
pabelangerso it worked as expected18:34
pabelangerokay, I have to wrap up for now. Thanks again everybody for help on github driver this week, much appreciated. I believe this will go a long way to making zuul better for ansible org!18:35
pabelangerwill check back later this evening18:36
openstackgerritMerged zuul/zuul-jobs master: Fix installation docs  https://review.opendev.org/66544318:58
*** jamesmcarthur has joined #zuul19:02
tobiashgrr, github apps are not allowed to use search in graphql :/19:16
tobiashso that option is gone19:16
*** gtema has quit IRC19:30
*** pcaruana has quit IRC19:39
*** jamesmcarthur has quit IRC19:41
*** jamesmcarthur has joined #zuul19:42
*** jamesmcarthur has quit IRC19:44
*** jamesmcarthur has joined #zuul19:44
*** jamesmcarthur has quit IRC20:00
*** jamesmcarthur has joined #zuul20:04
tobiashpabelanger, corvus: I'm currently trying out some things using the search api. And I think we can and should go with the search api instead. It has a rate limit of 30 per minute which is one every two seconds on average per installation id. It's unlikely that we hit that. I'd combine this with search related rate limit handling. With that we should be on the safe side and can remove this lru cache entirely.20:05
*** jamesmcarthur has quit IRC20:05
corvustobiash: have we determined whether it's merely the startup of the lru cache that's at issue?20:09
tobiashthe lru cache in this case is suboptimal and a workaround in any case, so I think we should use search+rate limit handling anyway20:10
corvus(i thought the lru cache was a pretty good idea -- after all, it may be able to eliminate some queries entirely?)20:10
corvusiirc, the idea was we get an event for a sha, we do one query on it, then as different people and bots leave comments and updates on that sha for the next hour or so, we don't have to look it up again20:12
tobiashwell the initial idea was not to have to iterate over all pull requests on every status event. For this use case the cache was a (complicated) workadound which I think I remember worked somehow but was not as effective as we hoped.20:14
corvusbasically, "find a pr by sha" is expensive either way -- now it takes many requests (and many seconds) out of a large limit, if we switch to search, it will take one request out of a much lower limit.  so it seems like either way, the cache would be helpful.20:14
clarkbit does work (it also populates the cache when it does the scan for all known shas)20:14
clarkbso it isn't just the current PR that it iwll remember20:14
clarkbwhere it doesn't work is when you have other CI systems doing post merge testing20:14
clarkbbecause those commit shas cannot be known20:14
tobiashthe limit is high enough that it's really hard to hit, essentially only by trying to hit it in my local testing20:15
corvusclarkb: what do you mean by "cannot be known?"20:15
tobiashthe sha changes on merge20:15
tobiashthat was the problem that invalidates the cache in ansible on every post job20:15
clarkbcorvus: when ansible merges say 3 changes in a row. Only one of the three post merge shas can be calculated before they merge20:15
corvussure, but if you search for it, you will find it, right?20:16
clarkbcorvus: yes20:16
clarkbthe cache can't cache it though20:16
corvuswell, wait, why can't it cache it?20:16
clarkbso for that particular case the cache isn't helpful20:16
clarkbbecause when PR1 triggers a cache update the sha for PR2 post merge isn't known20:17
clarkbso you basically always hit the search case when you get post merge test results20:17
corvuswhy are there multiple PR's in the example?20:17
clarkbcorvus: because for the first PR it is cacheable due to github telling you what the sha will be if this change merges (and nothing else merges first)20:18
clarkbits only a problem when you start merging multiple PRs in succession20:18
corvusi think we're looking at this in very different ways20:18
clarkbbasically when those events come in a large percentage of the time the data is not present in the cache20:19
clarkbso for that particular set of events the cache is not helpful. For all of the pre merge events its super helpful20:19
corvusall i see is a cache which maps sha -> PR.  sometimes a PR gets new shas.  that can be because of a new commit being added, or it can be cause of a merge.  in any case, when a PR gets a new sha, it's a cache miss, and a cache miss means we (currently) query the whole list of PRs.20:19
corvusi understand you're saying that generating lots of events on PRs post-merge means we see a lot of new shas which are cache misses.20:20
clarkbyes and that causes backup problems20:20
clarkb(which is the sort of thing tobiash is trying to fix)20:20
corvustobiash: is this the main culprit in terms of time we're spending on api queries?  ie, is this the main thing the pallelization patch was designed to address?20:21
openstackgerritTobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api  https://review.opendev.org/66546920:22
tobiashcorvus, clarkb: this uses search and handles the rate limit gracefully ^20:23
tobiashin local testing with an endless loop It takes around 45 seconds to reach the limit (so worst case scenario). The code handles this gracefully, waits for the time github tells us to wait (a few seconds in testing) and after that it has right another 30 requests left20:24
corvustobiash: see my question about the parallization patch?20:26
tobiashstill reading backlog20:26
corvusack.  i'm trying to understand whether making getpullbysha faster would mean we don't need that.20:27
tobiashcorvus: regarding the parallelization, this is not only about getpullbysha, even without it we're running into queueing because of serialization20:28
tobiashthis is the thing the parallelization patch addresses20:28
tobiashand basically the current implementation of getpullbysha abuses this and triggers the abuse protection20:29
corvuswell, maybe -- i mean, github won't promise that any amount of parallelization is acceptable.20:29
tobiashso swiching to search should resolve this mostly and we should maybe combine this with limiting the threads to like 4 or 820:29
tobiashper installation, but I think we should try out a little bit of the grey area20:30
tobiashotherwise we should at least parallelize and lock per instance id20:31
tobiashand from the log you posted the only problematic occurrences was due to getpullbysha20:33
tobiashso I'd suggest limiting the threads and using search. If then there are still issues, limit parallelization per installation id20:34
corvusokay, let me propose some things: 1) we use the search api like in your patch, but we *also* keep the cache.  that should still help in the cases where multiple events arrive for a sha in succession (still fairly common).  2) we parallelize per installation id.  3) we add an undocumented option to increase the number of threads per instance id (and maybe even an RPC call to change it if you want to get20:34
corvusfancy) so we can observe the grey area.  if we think it's necessary/useful, we document the option (but leave the default as 1)20:34
corvusoh, and if we find it's not useful or dangerous, we remove the option20:35
tobiashsounds like a good plan20:36
corvusclarkb, pabelanger: ^20:36
clarkbcorvus: sounds good to me20:36
clarkbin this case an installation id is the app on the github side?20:37
clarkbso opendev has 1?20:37
corvusclarkb: it's an app installed in an org or repo20:37
corvusclarkb: so opendev has a handful of them20:37
clarkbah ok so its the application of our app into the projects20:37
corvusyep.  it's kind of annoying that there's an incentive to install the app for each repo rather than at the org level (in order to dodge the abuse limit).20:41
openstackgerritTobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api  https://review.opendev.org/66546920:46
tobiashthat retains the cache ^20:47
tobiashtested locally against github.com20:48
tobiashso 1) check, 2) is a little bit more complicated, so I won't do this today ;)20:49
tobiashcorvus: when parallelizing per installation id, do we still want to keep global ordering or do you think ordering per installation id is sufficient?20:50
corvustobiash: i'm inclined to say that global ordering would be best.  it probably would be mostly okay to order per installation id, but maybe there would be times where cross-project dependencies might end up wrong if we did that?20:53
tobiashk, I'm also unsure about that, I just wanted to thing about wether this is a compromise we can make in favor of simpler implementation and scalabiloty20:54
corvustobiash: yeah, the one thing i know is that sequencing per install-id will require more thought.  :)20:55
clarkbcorvus: I think that may be the case considering depends on don't specify a sha but instead specify the container objcet (change/PR)20:55
tobiashyepp20:55
openstackgerritTobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api  https://review.opendev.org/66546920:56
corvusclarkb: yeah, i'm coming up with similar thoughts.  that, basically, we could enqueue A1->B1 even after B2 has been pushed, then later, we see B2 and abort the job because it's not current.  so in the end, zuul won't test the wrong thing, but users would be confused because from their POV, A1 should have tested with B2 to start with.20:57
corvus(and they'd see one of those annoying cryptic 'dependency unable to merge' messages)20:57
tobiashyeah, that's one of those edge cases20:57
tobiashok, I'll keep the global orderung20:58
corvus++20:58
tobiash*ordering20:58
*** zbr|ruck has quit IRC21:13
corvusofosos: i left comments on the bitbucket series -- i think it's really close.  mostly a couple of subtle things which i think our abbreviated developer docs don't make very clear, sorry.21:20
*** zbr has joined #zuul21:30
ofososcorvus: woohoo, thank you very much21:36
*** EvilienM is now known as EmilienM22:03
*** rlandy|ruck has quit IRC22:09
pabelangercorvus: tobiash: clarkb: +1 plan looks good.22:25
pabelangertobiash: looks like test case failed on your search api update22:25
*** rfolco has quit IRC22:44
*** igordc has quit IRC22:45
SpamapSofosos: I'll try and find some time next week to look at your next pushes. Thanks for hanging in there, it's realy awesome to add another flower to the zuul bouquet of drivers. :-D23:02
clarkbI managed to do a quick skim but didn't leave much useful feedback (just one code cleanup suggestion)23:03
*** igordc has joined #zuul23:13
*** mattw4 has quit IRC23:25
*** mattw4 has joined #zuul23:25
openstackgerritMerged zuul/zuul master: Revert "Parallelize github event processing"  https://review.opendev.org/66542923:37

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!