Tuesday, 2019-02-12

tristanCzuul-maint, looking for review on https://review.openstack.org/632620 to be able to use add-build-sshkey on static node. thanks :)00:00
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: executor: enable zuul_return to update Ansible inventory  https://review.openstack.org/59009200:20
*** rlandy is now known as rlandy|bbl00:36
*** TheJulia has joined #zuul00:59
*** mrhillsman has joined #zuul00:59
*** jtanner has joined #zuul00:59
*** sdake has joined #zuul01:01
openstackgerritMerged openstack-infra/zuul master: Add a note about sqlalchemy metadata  https://review.openstack.org/63618901:18
*** sdake has quit IRC01:37
*** sdake has joined #zuul01:39
openstackgerritJames E. Blair proposed openstack-infra/zuul-jobs master: Add intermediate registry push/pull roles  https://review.openstack.org/63482901:41
*** sdake has quit IRC01:43
*** sdake has joined #zuul01:50
*** sdake has quit IRC01:51
openstackgerritMerged openstack-infra/zuul master: web: add /{tenant}/buildsets route  https://review.openstack.org/63003502:12
*** saneax has joined #zuul02:50
*** saneax has quit IRC02:52
*** saneax has joined #zuul02:53
*** rlandy|bbl is now known as rlandy03:12
*** rlandy has quit IRC03:19
*** sdake has joined #zuul03:39
*** sdake has quit IRC03:59
*** manjeets has quit IRC04:41
*** saneax has quit IRC04:42
*** bjackman has joined #zuul04:55
*** shanemcd has quit IRC05:03
*** shanemcd has joined #zuul05:04
*** greg-g has quit IRC05:13
*** chandankumar is now known as chkumar|ruck05:47
openstackgerritMerged openstack-infra/zuul master: Update SQLAlchemy url in docs  https://review.openstack.org/63514805:59
*** sshnaidm is now known as sshnaidm|afk06:39
*** saneax has joined #zuul06:45
openstackgerritFelix Schmidt proposed openstack-infra/zuul master: Retrieve full list of jobs with details per tenant via API  https://review.openstack.org/63571406:56
*** quiquell|off is now known as quiquell|rover07:02
*** snapiri has joined #zuul07:24
*** quiquell|rover is now known as quique|rover|brb07:53
*** themroc has joined #zuul08:10
*** quique|rover|brb is now known as quiquell|rover08:12
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Escape jinja2 template from commit message  https://review.openstack.org/63393008:21
*** gtema has joined #zuul08:24
*** jpena|off is now known as jpena08:53
*** panda|off is now known as panda09:04
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Escape jinja2 template from commit message  https://review.openstack.org/63393009:43
quiquell|rovertristanC: different approach ^09:45
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory  https://review.openstack.org/63393009:45
*** sshnaidm|afk is now known as sshnaidm09:52
SpamapStristanC: heh, your python, on first test, comes out about 10x slower than my rust. Trying the C++ now. ;)10:22
SpamapSThe C++ is still faster. :-P10:24
SpamapSoh interesting, with more URLs, the rust gets faster10:27
SpamapSsuggesting that the C++ LRU cache maybe isn't so awesome.10:28
ttxHey all, I'm trying to get to the bottom of delays in the GitHub triggers... it seems to take anywhere between 30sec and 10min for a GitHub event to enter Zuul pipelines. I was wondering if that was coming from the processing on Zuul side, or some form of webhook batching on the GitHub side. How much of that is expected behavior vs. bug ?10:32
tristanCSpamapS: how are you testing?10:33
SpamapStristanC: I spun up an nginx with 10000 pretend builds10:34
SpamapStristanC: and fed them all to each program10:35
SpamapSttx: GitHub triggers have been known to happen asynchronously, but you can see them on the app's detail page if you have access.. and then compare with the zuul scheduler logs to see who was slower.10:35
SpamapStristanC: I'm pretty sure the problem with your python is that it doesn't use a persistent client10:36
ttxSpamapS: ok, I don't have access to that, but i know a few people who have10:36
tristanCSpamapS: I mean, how did you measure the performance?10:36
SpamapStristanC: oh, just /usr/bin/time10:37
SpamapStristanC: I'm about to push what I did into the git repo I made, so you can repeat if you like. :)10:39
tristanCSpamapS: sure, i'm curious how this would affect overall user experience. Initial build lookup only happens once per visit, then it should rewrite should be served by the cache10:40
tristanCs/it should//10:41
SpamapStristanC: yeah I would like to try some different schemes, like repeat builds10:41
SpamapSBut I'd say log URL's are probably not going to have a fantastic cache hit rate.10:42
tobiashttx: normally this is a matter of less than 30s10:42
SpamapStristanC: I tend to agree with you that this probably isn't a big deal though.10:43
ttxtobiash: I'm testing right now and I'm at 17min so far10:43
tobiashttx: is this the time until a change shows up in the pipeline without or with jobs?10:43
ttxchange shows up in pipeline10:44
tobiashJob freezing can take a while because that requires a merger call10:44
tobiashttx: maybe you're running into api request limit?10:44
tobiashttx: do you use an api token or github app?10:45
ttxtobiash: maybe, but the activity is not that significant. GitHub app10:45
tobiashHrm10:45
tobiashThen you should probably first check webhook history in gh and then zuul web logs10:46
SpamapStristanC: using a persistent session object improved by about 15%. Still 7x slower than the C++/Rust.10:46
ttxtobiash: if you confirm that is a bit unexpected behavior, I'll try to dig up more data points10:46
tobiashJust to verify that there is everything ok10:46
tobiashThe webhooks should arrive within a couple of seconds10:47
SpamapSand uses 20MB more memory than the C++/Rust which only use about 8MB.. but honestly, this is just late night insomnia playing by me ... honestly.. not sure how performance sensitive this actually is.10:47
tristanCSpamapS: well it depends what is the overhead relative to the total latency of the browser going through that service10:49
SpamapStristanC: the python responds to cache misses in 1ms. ;)10:54
tristanCSpamapS: e.g., a request to logs.openstack.org takes about 300ms, thus if the difference between the implementation is less than 10ms, then it shouldn't matter much right10:54
SpamapSYeah, 1ms vs. 100ns. :-P10:55
SpamapSI mean it's nice if things are fast. :)10:55
SpamapSand honestly, the C++ and Rust are pretty darn readable.10:55
SpamapSBut, there's also about 4x the code. :)10:56
SpamapSso, yeah, in terms of overall cost to the project, while I wish we could rewrite all of Zuul in Rust (please god not C++), this isn't a part that I think needs to be in !python10:57
* SpamapS should probably try that sleep thing again11:02
*** iurygregory has joined #zuul11:34
iurygregoryHi everyone, while moving jobs to zuulv3 in ironic we notice that ironic-tempest-plugin is not running all jobs we have set in project.yaml https://github.com/openstack/ironic-tempest-plugin/blob/master/zuul.d/project.yaml, i checked if the parent jobs really exists and they do anyone have ideas what it could be? Some jobs that arent running are defined in stable-jobs.yaml https://github.com/openstack/ironic-tempest-plugin/blob/master/11:42
iurygregoryzuul.d/stable-jobs.yaml11:42
iurygregoryhttps://github.com/openstack/ironic-tempest-plugin/blob/master/zuul.d/stable-jobs.yaml *11:42
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST  https://review.openstack.org/63631511:50
*** jpena is now known as jpena|lunch12:26
*** abregman has joined #zuul12:32
*** panda is now known as panda|lunch12:32
abregmanhey. does zuul have an API?12:34
pabelangerabregman: yes, zuul-web is the service that allows that12:46
abregmanpabelanger: how exactly do I access it? for example if my address is https://my-zuul.com12:47
jktabregman: there's this pending work, https://review.openstack.org/#/c/535541/12:50
jktabregman: are you looking for Getting integration by any chance?12:50
abregmanjkt: thanks, I'll look into it. what do you mean by getting integration?12:51
pabelangerabregman: there is some docs around deployment options: https://zuul-ci.org/docs/zuul/admin/installation.html?highlight=dashboard#web-deployment-options but should be https://my-zuul.com/api12:51
abregmanbasically I'm just looking for a way to pull some information using some scripts I want to write12:51
abregmanpabelanger: thanks, I'll go over it. using /api just redirects me to /tenants12:52
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory  https://review.openstack.org/63393012:53
jktabregman: ok; I just wanted to see if you were looking for https://gerrit.googlesource.com/plugins/zuul-status/12:54
pabelangerabregman: yup, depends on how your dashboard is setup, here is a list of entrypoints: http://git.zuul-ci.org/cgit/zuul/tree/zuul/web/__init__.py#n66112:55
pabelangeryou can also get data from statsd, if setup too12:55
abregmanpabelanger: awesome! exactly what I was looking for. thanks :)12:57
abregmanpabelanger++12:57
*** bjackman has quit IRC12:59
dmsimardpabelanger: http://logs.openstack.org/41/535541/9/check/zuul-build-dashboard/b99f517/npm/html click API at the top right (from https://review.openstack.org/#/c/535541/ )13:00
jktwow!13:05
*** chkumar|ruck is now known as chkumar|pto13:10
*** panda|lunch is now known as panda13:13
*** quiquell|rover is now known as quique|rover|eat13:17
*** rlandy has joined #zuul13:17
*** jpena|lunch is now known as jpena13:30
*** sdake has joined #zuul13:32
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory  https://review.openstack.org/63393013:37
*** openstackgerrit has quit IRC13:37
*** iurygregory has left #zuul13:38
*** quique|rover|eat is now known as quiquell|rover13:39
*** abregman has quit IRC14:01
*** sdake has quit IRC14:03
*** openstackgerrit has joined #zuul14:15
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: prevent status update loop in background  https://review.openstack.org/63634314:15
pabelangermordred: with openstack client and openstack server show, is properties field map to metadata in openstacksdk?14:18
pabelangermordred: eg: line 25 @ https://softwarefactory-project.io/paste/show/1425/14:19
mordredpabelanger: yes, I think so?14:19
pabelangerack14:20
pabelangertrying to figure out why nodepool might be leaking nodes when exception is raised when creating a node14:21
pabelangerhttps://softwarefactory-project.io/paste/show/1424/ from nodepool, note line 514:22
pabelangerbut https://softwarefactory-project.io/paste/show/1425/ shows it actually exists14:22
pabelangerbut, then trying to figure out why cleanup handler isn't seeing it, and trying to delete it14:22
*** sshnaidm has quit IRC14:30
*** sshnaidm has joined #zuul14:30
openstackgerritFelix Schmidt proposed openstack-infra/zuul master: Retrieve full list of jobs with details per tenant via API  https://review.openstack.org/63571414:30
*** sdake_ has joined #zuul14:37
*** ParsectiX has joined #zuul14:44
*** swest has quit IRC14:47
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST  https://review.openstack.org/63631514:47
ParsectiXHi guys. I'm trying to add an ldap group as a member on another group via CLI and it fails. When I'm doing the the same operation from the WebUI it works14:49
ParsectiXssh -p 2244 localhost gerrit set-members Administrators --include ldap/admins14:50
ParsectiXfatal: Group Not Found: ldap%3Acn%3Dadmins%2Cou%3Dgroups%2Cdc%3Dexample%2Cdc%3Dio14:50
ParsectiXCan you please advice how to debug this?14:50
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST  https://review.openstack.org/63631514:54
*** snapiri has quit IRC14:56
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory  https://review.openstack.org/63393014:58
*** Parsecti_ has joined #zuul15:05
*** ParsectiX has quit IRC15:07
*** ianychoi has quit IRC15:07
*** ianychoi has joined #zuul15:07
fungiParsecti_: you may want to try asking in #gerrit15:21
*** Parsecti_ has quit IRC15:21
pabelangeris anybody able to see why https://softwarefactory-project.io/paste/show/1425/ isn't hitting line 493 in http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py?h=3.4.0#n45815:23
pabelangerif my zk data is correct, the nodepool_node_id doesn't exist15:24
pabelangerbut, for some reason, nodepool isn't trying to delete the node15:24
quiquell|roverpabelanger, tobiash, tristanC: The little fix for jinja at gerrit commit is all good now https://review.openstack.org/#/c/633930/15:26
quiquell|roverHave use !unsafe from ansible yaml to skip it15:26
*** gtema has quit IRC15:32
*** gtema has joined #zuul15:35
clarkbpabelanger: see line 46215:38
clarkbpabelanger: that method is specifically for cleaning up servers not in zk, that server appears to be in zk15:39
pabelangerclarkb: based on the info i am getting from fbo / jpena we don't see the server in zk15:39
pabelangerhttps://softwarefactory-project.io/paste/show/1424/15:40
clarkbpabelanger: can you double check that? 0001603003 would be the identifier in zk15:40
pabelangeris also log from nodepool15:40
clarkboh that logs says the cloud didn't have the node. Seems like a bug in your cloud then15:41
*** saneax has quit IRC15:41
pabelangerno, server create failed with 500 no more resources15:41
pabelangerbut, couldn't delete because in error state I think15:42
pabelangerhowever15:42
pabelangerI can't see why cleanup worker isn't trying to delete again15:42
pabelangerinteresting15:43
pabelangerzk id=0001603336 but nodepool_node_id='0001603003'15:43
pabelangershould they be the same?15:43
*** ParsectiX has joined #zuul15:44
clarkbpabelanger: nodepool will assign a new id to nodes it didn't know about in zk15:44
clarkbthat log could be from the cleanup method?15:45
pabelangerk, I think I see the issue15:46
pabelangerwith help from fbo15:46
pabelanger/nodepool/nodes/0001603003/lock/780341bd82e9489c991ac7e972d2b5d0__lock__000000003415:46
pabelangerso, data does exist in zk15:47
pabelangerwhich is instance https://softwarefactory-project.io/paste/show/1425/15:47
pabelangerhowever15:47
pabelangerhttps://softwarefactory-project.io/paste/show/1424/ is the error when nodepool tried to create server15:47
pabelangerbut, we've deleted the wrong node id for some reason15:47
pabelangerleading to data being leaked in zk15:48
clarkbI'm not sure its the wrong id15:48
clarkbthe way nodepool accounts for this stuff it create a new node to track leaky things15:48
clarkbso the id changes, but that is expected iirc15:48
clarkbit is possible that it doesn't cleanup details for the original id properly though15:50
pabelangerokay, but in that case, I can't see how we'd actually delete the data in zk, based on http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py?h=3.4.0#n49215:50
clarkbpabelanger: ah ya that might be buggy then15:50
clarkbthough what it is checking there is "if I don't have an id for this id in zk then make a new node with a new id to track it"15:51
clarkbwhich should work15:51
clarkbit is giving it a dummy node in zk so that it can track that it needs to delete the leaked instance15:51
clarkbwhere this might be buggy is if it fails to delete the first time through ti could createa a second dummy node and so on15:52
clarkb(so we'd leak dummy nodes until the real instance deleted and then all the dummy nodes could be cleaned up?)15:52
fungii've observed an odd behavior with a github connection in our zuul deployment. we see trigger events appear in the scheduler's debug log immediately (zuul.GithubGearmanWorker: Github Webhook Received) but then there's a lengthy delay sometimes on the order of an hour before it gets processed (zuul.GithubConnection: Scheduling event from github, zuul.Scheduler: Processing trigger event <GithubTriggerEvent ...>).15:54
fungiany idea what could cause that delay? when it happens i don't see any events hanging out in the events queue according to the status api, and there's nothing identifiable in the debug logs i can interpret as an explanation for the wait either15:54
pabelangerI'm just looking at code now, I know we changed this a little recently15:54
clarkbpabelanger: http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py?h=3.4.0#n247 I think that is the area that changed semi recently15:57
clarkbhttp://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py?h=3.4.0#n255 is where we create the new node to stand in15:57
pabelangerah, right15:58
pabelangerclarkb: so, I don't think we are updating the actually instance properly with that new zk ID15:59
pabelangerwhich leads to the leak15:59
clarkbpabelanger: I agree we don't update the existing instance with the new zk id. I don't know that that causes a leak16:00
pabelangerlet me get some logs on new zk id16:00
clarkb why do you think that is causing the leak? nodepool should delete the error'd instance under the new id and it will go away from the cloud16:00
clarkbit is possible we leak the old node in zk, but that shouldn't affect the leak cloud side16:01
clarkbfungi: what happens after the webhook is received by the scheduler? I wonder if there is anything obvious that could block16:02
pabelangerclarkb: I am confirming we actually do a delete on new zk ID16:02
clarkbfungi: like maybe construction of the trigger eevnt requires info back from github?16:02
clarkbfungi: I think the flow there is github POSTs to zuul-web, zuul-web does a gearman event, zuul scheduler processes gearman event as worker, then I don't know16:05
fungii thought that happened concurrent with the "processing trigger" activity in the scheduler16:05
clarkbfungi: I think the event has to be converted to a trigger first16:06
fungibut yeah, when my current meeting ends i'll see if i can trace the flow of functions between GithubGearmanWorker and GithubConnection16:06
clarkbthe event is the raw github activity and triggers are a bit more filtered down to things zuul can take action on16:06
pabelangerclarkb: https://softwarefactory-project.io/paste/show/1428/16:07
pabelangerwe seem to be creating multiple servers using the same zk id16:07
pabelangerbut you can see bae8dfbb-d165-47db-9e08-fc8111ba36b8 there16:07
pabelangereventually the zk id comes online16:07
pabelangerbut the error instance, also has that nodepool_node_id16:08
pabelangerhttps://softwarefactory-project.io/paste/show/1429/ looks to be the delete16:09
pabelangerbut, didn't delete cloud side16:09
clarkbya so its likely failing to delete in the cloud? andnodepool trying over and over again?16:09
pabelangerno, the delete only happens once16:13
pabelangerI'd like it to keep trying to delete over and over16:13
pabelangerbut, cleanup worker doesn't because nodepool_node_id is in zk16:13
pabelangerbeing used to boot the next server16:13
pabelangerso, I am unsure how to proceed in this case, if we don't properly delete the node, and a new node is booted with same zk id16:14
pabelangertobiash: might be of interest to you, since you helped here in the past16:16
pabelanger^16:16
clarkbpabelanger: so the original node_id is being reused?16:16
clarkbpabelanger: can you show the logs of that?16:16
*** quiquell|rover is now known as quiquell|off16:17
pabelangerclarkb: https://softwarefactory-project.io/paste/show/1428/16:17
pabelanger0001603003 is the id in question16:18
pabelangereach of those attempts, fail with 500 (lack of resources)16:18
pabelangerhttps://softwarefactory-project.io/paste/show/1425/ is actually bae8dfbb-d165-47db-9e08-fc8111ba36b8 attempt16:19
clarkbright ok so the problem is the logical request's id doesn't change but we've already written properties to the host that we check against for cleanup16:19
clarkbso cleanup won't happen until the reused original request id is deleted then at that point we can cleanup the original failures. That does indeed seem to be a bug16:19
pabelangeryah, I am just trying to confirm we have more then 1 leaked (error node) with the same nodepool_node_id16:20
*** manjeets has joined #zuul16:27
*** themroc has quit IRC16:28
pabelangerclarkb: confirmed, found multiple error'd instances with same nodepool_node_id: https://softwarefactory-project.io/paste/show/1430/ 000161536016:47
corvuspabelanger, clarkb: especially since Shrews is out, can you make sure to collect all of this in a storyboard bug report?16:49
pabelangercorvus: yup!16:50
clarkbpabelanger: ya the final thing we'll want to confirm is if things do cleanup once the successfully booted node with that id is deleted16:56
clarkbthat should remove the zk record for it allowing everything else to be properly processed as a laeked node (mostly this is just to confirm observed behavior vs expected behavior)16:56
pabelangerclarkb: agree16:57
*** ParsectiX has quit IRC16:58
*** sdake_ has quit IRC17:01
*** sdake has joined #zuul17:02
fungijust to follow up, current theory on our github trigger processing delay is growing backlog in GithubConnection.event_queue so looking to stuff a statsd report call somewhere for that17:06
fungi(we have some very, very active gh projects associated with our connection)17:06
*** gtema has quit IRC17:06
fungipresumably event_queue.qsize() is what we should report on?17:07
openstackgerritFabien Boucher proposed openstack-infra/zuul master: [WIP] URLTrigger driver time based - artifact change jobs triggering driver  https://review.openstack.org/63556717:15
clarkbfungi: I think so17:16
fungicool, i'm just learning me up on how to add a new stat now17:17
fungior more like reverse-engineering from the existing uses17:17
AJaegerzuul-jobs maintainer, could you review https://review.openstack.org/#/c/635941/ for the updated puppet-forge upload, please?17:27
openstackgerritPaul Belanger proposed openstack-infra/nodepool master: Properly handle TaskManagerStopped exception  https://review.openstack.org/63639317:29
pabelangerclarkb: mordred: corvus: do you mind confirming ^ is the correct approach for handling when task managers die?  In the case of sf.io (rdo-cloud) we end up spamming logs a fair bit when it happens.17:30
pabelangernhicher: fbo: ^17:30
clarkbpabelanger: do you know why the task manager dies?17:33
clarkbwe should be careful we aren't papering over another bug17:33
*** tobiash has quit IRC17:34
*** pwhalen has quit IRC17:34
pabelangerclarkb: yah, haven't figured out why it dies, I don't think we have properly logging enabled. But I do see the new manager being created17:35
*** tobiash has joined #zuul17:35
pabelangerjust the previous requests are using original, and failing17:35
pabelangerlet me get logs17:35
pabelangerclarkb: http://paste.openstack.org/show/744958/17:37
pabelangeris the best we have right now17:37
pabelangerline 23 is when new TaskManager is created17:38
pabelangerand believe line 30 is still using the original, that just died17:38
clarkbnew provider managers are created when your config is chagned iirc17:38
pabelangerOh, hmm17:39
clarkbif the exception is due to ^ then your fix is probably appropriate17:39
clarkbsince we can safely short circuit in that case17:39
pabelangerI should be able to confirm if that was the case17:39
*** spsurya has quit IRC17:40
pabelangerclarkb: all managers would be created right, not just a specific?17:41
clarkbpabelanger: would be specific to the providers that had their config change17:41
pabelangerk17:42
clarkbthere is an if statement that checks the old config against the new config and if they differ creates a new manager under the new config iirc17:42
pabelangerack17:42
*** pwhalen has joined #zuul17:44
*** sdake has quit IRC17:44
*** sdake has joined #zuul17:46
fungigrr, having a hard time working out where to plumb in the statsd configuration/options in the github driver. i'm guessing i could reference it from the scheduler object? presumably by the time GithubConnection.addEvent() and eventDone() are called, self.sched is not None so i could just reference self.sched.statsd directly there?17:50
fungior is there a better way?17:51
clarkbfungi: zuul.cmd.scheduler calls registerConnections which registers the scheduler with the connections. That should happen quite early.17:54
clarkbYou probably still want to guard against self.sched being None but otherwise should be safe to do what you describe17:54
clarkbfungi: and with the guard we just won't send any of that data until the scheduler is register which should be fine17:56
tobiashfungi: yes, you can just use self.sched (and guard against none)18:02
fungiyep, i already had the conditional in there. just running tox against it now before i push it up18:02
fungishould this come with a regression test too?18:02
fungiapparently i now go down the rabbit hole of making sure i have the necessary libraries to build the c extensions for some of zuul's dependencies18:05
clarkbfungi: I don't know how much testing of the statsd events we do. I don't think its much (but that may have changed)18:05
clarkbfungi: bindep should capture them last I tried18:05
clarkbre2 being the big "recent" one18:05
fungioh, neat, i also need a mysql server?18:06
tobiashfungi: yes, for some tests18:06
*** panda is now known as panda|off18:06
fungiand a postgresql server and a zookeeper server18:06
fungiwill it skip those tests if i don't have those servers installed and configured?18:07
tobiashfungi: there is the tools/test-setup.sh that xonfigures all of those stuff18:07
fungiguess i'm about to find out18:07
tobiashfungi: I guess they will fail18:07
tobiashand zooleeper is needed for all tests ;)18:07
tobiashif there is interest, I can push up a docker-compose based test setup for this stuff18:08
clarkbfungi: ya it willfail the db tests. I usually just run with a zookeeper and ignore the db tests unless I'm looking at code in that region18:08
tobiashthat might be useful for folks to just fire up correctly configured mysql,postgres and zookeeper on tmpfs in docker18:08
fungilooks like i'll be running pep8 locally and then pushing this up to run unit tests under ci instead, since i have a backlog of other stuff to get to18:09
pabelangerclarkb: looks like you are right, nodepool.yaml was modified18:09
clarkbtobiash: or even document a regex to skip the mysql and postgres tests18:09
clarkbI always have t ofigure out that regex again (shame on me for not writing it down)18:09
tobiashI never used such a regex18:10
openstackgerritJeremy Stanley proposed openstack-infra/zuul master: Track GitHub connection event queue length stats  https://review.openstack.org/63640418:10
fungithere we go ^18:10
tobiashfungi: thanks, wanna have this ^ :)18:11
fungistatsd key name there was chosen arbitrarily18:11
fungiplease recommend improvements! ;)18:11
clarkbfungi: you probably want to do github.$connectionname.event_queue18:12
clarkbsince you can potentially have multiple github connections18:12
fungiaha, thanks18:12
fungion a related note, i see where it reports the github api rate limit count remainder and reset epoch timestamp but we seem to be tracking several different rate limits there and the logs don't do anything to differentiate them. any idea what the distinction between them might be?18:14
clarkbjlk: ^18:15
corvusfungi: it might be per-app-install?18:15
fungioh, do we have several of those on the openstack/opendev zuul deployment?18:16
fungiif so, that would make sense18:16
fungiare those treated as different github "connections"?18:16
jlkThere is a per-app rate, and anonymous rate, and non-app API rate18:16
jlkI believe the rate is per installation of your app18:16
corvusfungi: we have only one connection (which maps to one github app), but that connection has many installations of the app.18:17
jlkso if you have one Zuul app, and 30 places have installed it (that would be 30 connections in Zuul config), you'll get a rate per installation18:17
clarkbre event queue lengths the openstack zuul takes significant amount of time to process events from github. reading logs with fungi it seems that we processes events serially in a fifo and those events individually may take some seconds to fully process. Because this serialized over time we accumulate a backlog. The chagne above should help track the size of the queue to see if it does indeed grow a backlog (we18:17
fungiahh, so basically per project (org?) using the app?18:17
clarkbjust don't have enough logging/data to know for sure currently)18:17
jlkoh wait, I think corvus just contradicted me, and he would know better.18:17
tobiashcorvus: is right ')18:17
tobiash;)18:17
jlkI forgot how each installation mapped to Zuul config18:18
tobiashjlk: each connection can have many installations18:18
fungiwondering what additional detail we should include in the debug log (and perhaps in an associated statsd key name pattern) for tracking api rate limits18:18
tobiashotherwise we would not be able to handle that18:18
jlkhttps://developer.github.com/apps/building-github-apps/understanding-rate-limits-for-github-apps/18:19
corvusfungi: there's an installation number, and we know what repo/org it is associated with.  maybe that?18:19
corvus(and we know what installation we're using for any given operation)18:19
tobiashfungi: for tracking api rate limits you probably want to have that per installation (if we're having the installation id)18:19
fungiif the rate limit is per org (or project?) which has installed the app, then i guess we're at least not at risk of one user of the app starving the api rate limit for another?18:19
corvusfungi: that is the idea, yes18:19
jlkyes. It's part of why app based is so compelling18:20
fungithat helps. was trying to wrap my mind around the model there. thanks!18:20
jlkyou get a rate limit per consumer of your app, rather than the old method of a single rate limit shared by everybody sending you web hooks18:20
jlkthat said, I thik not every call we make to teh API is via an app18:20
clarkbtobiash: have you seen pile up of github driver events leading to slow processing? I'm curious if this is something others have dealt with18:21
jlkhttps://developer.github.com/v3/#rate-limiting18:21
jlkthe additional rate limiting is ^^18:21
jlk5K per hour for things we do as our own user, 60 an hour for anonymous calls18:21
jlkAAAND18:21
jlkhttps://developer.github.com/v3/search/#rate-limit18:21
tobiashclarkb: not that dramatically18:21
tobiashbut our github enterprise has not rate limit set (yet)18:21
jlkwhen we hit search (for finding DependsOn and such), 30 per minute18:22
jlkclarkb: I haven't looked at the change, but I thought we did work to parallelize GitHub event processing a while back, sending it into a worker queue18:22
jlkor is it CONSUMPTION of the queue is serial?18:22
fungijlk: that's what it seems like, yes (the latter)18:23
jlkI think the work I did was to make it so the hook receiver didn't hang until the event was processed.18:23
jlknod18:23
fungiwe see log entries immediately about the trigger event getting queued, but then a lengthy delay (sometimes hours) before the log entry saying it's being scheduled and processed18:23
*** jpena is now known as jpena|off18:23
jlkgotcha.18:24
fungiso basically hanging out in the connection driver's own event queue (we think)18:24
jlkthat makes sense given what I know18:24
jlkI had /assumed/ that processing the queue would be expedient. I didn't think it would be serial.18:25
jlkWhatever thing processes the queue should grow threading/forking capability18:25
fungiso i suppose burning down that queue is each trigger multiplied by the time it takes for each query that trigger implies (plus the occasional github api socket timeouts)18:26
jlkI wonder if one needs to create buckets per installation, and then serially processes each bucket, but allow parallel bucket processing18:26
jlkthat way each bucket is going to FIFO the events w/ some guarantee, but one noisy client shouldn't slow down every client18:27
corvusjlk: i don't know what the problem fungi and clarkb are dealing with is, but we receive a tiny percentage of github events compared to gerrit, and our scheduler has plenty of idle time, so i don't think this is a structural problem.18:28
jlkDoes the system process gerrit events serially, or are there more than one worker working the queue?18:29
jlkalso I think in teh gerrit case all the info is at hand, whereas on the GitHub case I _think_ we have to go query the API for more info.18:29
corvusjlk: serially -- both gerrit and github accept events from the remote system, do whatever querying needs to be done (yes, it's more in github than gerrit, but sometimes in gerrit too), then hand off the fully populated event to the scheduler, which is itself single-threaded.18:30
jlkDefine 'idle time' for the scheduler. Could it be working just one task and is otherwise idle (and that one task is the serial burn down of the queue), or is it completely idle, no tasks waiting (which would seem counter to what fungi and clarkb are seeing)18:30
fungicorvus: current problem observed is that when an action is taken on a github project which should trigger a job, it takes our zuul deployment hours (at peak) to ennqueue the builds18:31
*** ParsectiX has joined #zuul18:33
corvusjlk: the scheduler's event queue (which is where the gerrit/github events are deposited), is often empty.  it's possible the github driver is spending too much time performing queries to prepare the events for the scheduler.  but those events still need to be delivered in order.  it's possible that paralellizing the queries there, and then re-serializing before sending them to the scheduler could be more18:33
corvusefficient -- maybe that's what you were suggesting.  but i have my doubts as to whether we're seeing a volume where that's necessary.18:33
fungireviewing the scheduler's debug log and correlating to the gh app's "advanced" log entries, the trigger event is seen by the githubgearmanworker immediately but then there are no subsequent log entries for it until the githubconnection schedules it much, much later18:33
corvusfungi: have some log entries you can paste?18:33
fungisure, just a sec i'll get one up18:34
*** panda|off has quit IRC18:34
fungihere are the log entries containing the triggering event's github uuid: http://paste.openstack.org/show/744965/18:35
fungiin that case there was a 3-hour delay between receipt and scheduling18:36
tobiashfungi: do you have a stream from zuul.GithubConnection at hand?18:36
fungihow long of one would help?18:37
tobiashfungi: maybe that could help to identify where the bottleneck is18:37
tobiashfungi: maybe starting with that event and a few minutes length18:37
fungiwe logged 14764 entries for zuul.GithubConnection in the past ~12 hours18:37
fungisure, i can get that18:37
*** gcerami_ has joined #zuul18:37
tobiashfungi: maybe just the part between the first two lines of your paste18:38
clarkbjlk ya my current theory is each individual event is costlly enough that we accumulate a backlog18:38
clarkbwe have little insight via logging of that though hence fungi's change18:38
tobiashdo you have numbers about the event inflow?18:39
clarkbwecan probably wc the event recieved log messages to get that18:39
fungi4657 log entries for zuul.GithubGearmanWorker and zuul.GithubConnection between 15:32:45,739 and 18:33:32,24318:40
fungiso i don't think i can fit them in a paste18:40
clarkbI've got to prep for the infra meeting now though18:40
fungii can certainly do the few minutes following 15:32:45,739 though as a start18:40
tobiashfungi: oops, I overlooked the hours part of the timestamp ;)18:41
openstackgerritMerged openstack-infra/zuul-jobs master: Add intermediate registry push/pull roles  https://review.openstack.org/63482918:41
fungiyeah, three hours between trigger and scheduling in that instance, which is... why i'm trying to look into it18:41
tobiashso that was queued for three hours?18:42
tobiashthat is definitely not normal18:42
fungii think it was in the githubconnection object's event_queue for three hours, yes18:42
funginot the zuul scheduler's event queue18:42
tobiashyes, pretty sure18:42
fungibecause during that time the scheduler just about always showed an event queue of 0 for that tenant18:43
*** sdake has quit IRC18:44
mordredfungi: so is that 4657 log entries 4657 events from github, or just log lines?18:44
jlkhrm, the first log entry is when we are in handle_payload18:45
fungidebug level or higher log lines for GithubConnection and GithubGearmanWorker18:45
jlkwhich then sends it off to self.connection.addEvent18:45
clarkbjlk: ya it gets converted into an internal event there and added to the connection.event_queue18:45
jlkyeah18:45
clarkbjlk: and then it seems to just sit them for hours. WHich is why I think the queue is long and we aren't flushing it as quickly as we get new events18:45
jlkProbably need more logging/metrics in _handleEvent()18:46
fungihere's a sample of the GithubConnection and GithubGearmanWorker log entries starting with when 717fd7b0-2edb-11e9-89ef-7895d9174016 was received and stretching for roughly 8 minutes http://paste.openstack.org/show/74496718:47
jlkhrm, no real way to tell what event caused line 518:49
mordredfeels like it might be nice to log the webhook id with more things to make it easier to trace the actions of an individual webhook18:49
jlkyup18:49
jlkI was just thinking the same18:49
fungiwould filtering out the repo permissions and rate limit entries help?18:49
jlkmaybe, but I think we lack enough context around log events to truly trace the event18:50
jlkIt's likely we need to make the webook ID (headers.get('x-github-delivery')) part of the event object that carries through18:51
*** sdake has joined #zuul18:51
jlkoh we do18:51
jlkwe add it in addEvent18:51
jlkWe just don't log it again until _after_ much of the processing has been done18:52
jlkWe have the data at https://github.com/openstack-infra/zuul/blob/master/zuul/driver/github/githubconnection.py#L19018:52
corvuswe received 278 events from github during the 17:00 hour.  that gives us a budget of about 13 seconds per event.18:53
jlkWe should log/emit a metric at that point befoire the work begins to process it18:53
pabelangerjust happen to be looking at openstack-discuss, and looks like no official room for zuul at PTG, is that right? http://lists.openstack.org/pipermail/openstack-discuss/2019-February/002618.html18:56
corvuspabelanger: that's right18:57
pabelangerack, thanks!18:57
*** krasmussen has joined #zuul18:58
corvusjlk: agreed, we know when it's finished, but not when it starts18:58
jlkThat whole function and the method calls are ripe for metrics18:59
corvus2019-02-12 15:34:22,636 DEBUG zuul.GithubConnection: Scheduling event from github: <GithubTriggerEvent 0x7f6a05daf9e8 pull_request ansible/ansible refs/pull/50854/head reopened github.com/ansible/ansible 50854,b8048fe130722c11a42eb817aae08f628fc3af53 delivery: 5ea967b0-2ecb-11e9-9d2b-4586291aacb1>18:59
corvus2019-02-12 15:37:05,593 DEBUG zuul.GithubConnection: Got PR on project ansible for sha b8048fe130722c11a42eb817aae08f628fc3af5318:59
corvusfor instance, what happened in those 3 minutes?18:59
corvusthe first line is the end of processing of one event, right?  the next line is 3 minutes later... was it idle, or was it doing something for b804 that whole time19:00
clarkbor perhaps dealing with intermedaite events that don't get processed to the point of logging?19:01
corvusalso, why is it the same sha in both?  probably we started a new event for the same PR (happens a lot).  but would be good to be certain.19:01
corvusfungi: are you working on sprinkling more log events in there?19:02
jlka re-open would be a new event on teh same sha19:02
jlkbut we actually have no idea if the second line is in any way related to the event on the first line19:02
jlkthe second line could have been from an event BEFORE the first line19:02
*** tflink_ has joined #zuul19:03
jlkcorvus: that 'Got PR on' line in a couple different functions, and is a function used by a variety of events19:04
jlkit's just saying "I asked the GitHub API for the details on a pull request object"19:04
*** mugsie has quit IRC19:04
*** tflink has quit IRC19:04
*** fdegir has quit IRC19:04
jlkwell, I take that back19:05
*** mugsie has joined #zuul19:05
jlkthat PARTICULAR line is called via _event_status(), it got a status event on that PR for that hash19:05
*** fdegir has joined #zuul19:05
jlkwhich could be "pending" status, or "success" or "failure"19:05
jlkwell no, we return early in pending.19:06
jlkanyway a lot of this logging stuff was written before I really had any idea what we'd need to trace :D19:06
fungicorvus: i was going to try to understand enough github to help differentiate the various rate limit loglines. can try to improve other logging as well if i get enough understanding of what it means19:09
*** tflink_ is now known as tflink19:10
jlkfungi: basically at the start of https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ffceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L189 we get a 'delivery' item, which is the webhook ID. We need to log reception of it there, to match the line when we added it to the queue. From there we need to log the processing of it.  Specifically starting to handle the event at https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ff19:12
jlkceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L21919:12
jlkhttps://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ffceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L21919:13
jlkThere's some amount of processing in the method() call that causes outbound requests to GitHub's API. Once method() is done, there is further calls that are made in the "if event:" block. All the logging within this class should be called with the 'delivery' string.19:14
*** gcerami_ is now known as panda19:14
corvusjlk, fungi: yeah, i think i'll take a stab at reworking that to use a loggingadaptor real quick19:17
corvusso we can easily get the string everywhere19:17
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Add dockerized test setup  https://review.openstack.org/63642419:18
tobiashcorvus, mordred, fungi: is there interest in this? ^19:18
tobiashI find it quite useful for setting up the test environment on a dev machine19:19
*** dmsimard has quit IRC19:22
fungigetting familiar with using docker locally is still on my to do list. presumably useful but i don't yet have enough context to know whether i'll find it so19:24
*** jamesmcarthur has joined #zuul19:26
tobiashfungi: in that case, just fire up that script, wait for completion and then you can run tox19:26
jamesmcarthurhi all - not sure who to address this towards, but we looked into some solutions for anonymous analytics for the zuul.org site.19:28
jamesmcarthurOur fantastic intern put together some details: https://docs.google.com/document/d/1xli_xW5Fp5xI-FYtEzJV2ijn6cxUfEczk0pVdO69I7w/edit?usp=sharing19:28
jamesmcarthurcorvus: fungi: if this seems like it would work for zuulci.org, please let me know19:28
*** jamesmcarthur has quit IRC19:30
*** aprice has joined #zuul19:30
dkehnclarkb: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n9, when setting a parent's  job.abstract what is the implication of that?19:32
*** dmsimard has joined #zuul19:32
clarkbdkehn: an abstract job cannot be run. It is there only to be part of the inheritance tree19:32
clarkbdkehn: useful if you have some incomplete set of job config that needs to be inherited to fix up19:32
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection  https://review.openstack.org/63642919:33
corvustobiash, jlk, fungi: ^19:34
* jlk looks19:34
corvusfrom a unit test run, that looks like: http://paste.openstack.org/show/744972/19:36
jlkI think it's a good start, but I would also sprinkle in logging into the _whatever() methods that are called by method(). Could be a follow up change.19:37
corvusjlk: good idea, and i just saw a bug, so new ps incoming19:38
jlkneat19:38
tobiashcorvus: commented19:39
*** sdake has quit IRC19:42
corvusyeah, that's the bug :)19:47
*** jamesmcarthur has joined #zuul19:59
*** ParsectiX has quit IRC20:12
fungithe delivery uuid is indeed most helpful there20:13
fungithat'll make it easy to track with one grep20:13
tobiashcorvus, jlk: given that the event processing really is too slow for the single threaded event queue, what do you think about doing that multi-threaded while maintaining stricktly ordered forwarding to the scheduler loop?20:13
corvustobiash: you are way ahead of where i am in understanding this.20:14
corvusi would really like to avoid dramatic redesigns until we have more data20:15
tobiashsorry, I'm thinking if it's possible to do the github event queue processing multi threaded and still maintain the correct order in which these events are forwarded into the scheduler run loop20:15
corvustobiash: yes, i believe jlk and i also said as much earlier20:15
tobiashcorvus: sure, I'm just already thinking about it ;)20:16
corvustobiash: but let's not even think about doing that until we understand the problem20:16
SpamapSpoppycock, run off into the weeds and write code, I say, data is for the unbold.20:17
jlkhehe20:17
tobiashlooks like I missed parts of the backlog, sorry for the noise ;)20:17
corvusthere are so many ways that approach could cause us problems too, especially if we don't know what's actually causing the issue20:17
jlkI'm with corvus on this though. We're making guesses w/out data. I'd like to see logging output from the changes we're making to really trace through what's happening20:17
jlkWe could throw more threads at the problem but if the real issue is that our API calls are really slow for some reason we'd want to fix that as well / instead.20:17
corvusas soon as i finish this sanwich, i will finish that patch20:18
clarkbI should go find lunch too, then return to review the data collecting change20:18
jlklunch sounds good20:19
tobiashmaybe there is also potential to reduce the number of api requests20:20
clarkbjlk: it is that time of day in this part of the world. Did your precipitation switch to rain? we never got any snow here but pdx airport got ~5". Now we have inches of rain20:20
jlkyeah we've been getting rained on since last night.20:20
jlkWe had likely over a foot of accumulated snow in the yard/roofs/etc.20:21
jlkI'm very much done w/ shoveling my patio/driveway/sidewalks20:21
mordredjlk: a foot of snow is not what your neck of the woods is designed for20:21
clarkbjlk: it was crazy how different being a few miles west changed everything here20:21
clarkbjlk: on saturday we had a bunch of ice then everything went to melty rain and no snow :(20:22
jlkmordred: no, indeed not. That's why the kids have been on snow days for... a while now20:22
jlkI'm also very much ready for them to go back.20:22
jlkI'm honestly surprised I didn't lose power. Most our power is above ground lines, and all the trees had HEAVY snow on them. I could hear trees breaking all over the neighborhood. None took out our lines though. Seattle and the east side are not as lucky.20:23
mordredjlk: I saw photos that looked very inappropriate for seattle20:27
*** edmondsw has joined #zuul20:27
mordredthey looked more like photos of the side of mt. baker20:28
jlkHere's a set of photos from yesterday, before ANOTHER 5~ inches fell. (or rather WHILE it was falling) https://www.instagram.com/p/BtwjHNhBlw7/20:28
pabelangershoveling wet snow stinks, but makes great snowballs20:28
*** jamesmcarthur has quit IRC20:31
SpamapSTrust the Canadian about snow and ice.20:32
SpamapSThey know.20:32
fungiyeah, i don't want to just assume we know why github event processing is slow on our deployment. i'd rather collect data first and see where it leads us. i was mostly trying to infer enough from the data we have to know where strategic gathering of additional data is likely to pay off in subsequent diagnosis20:34
fungiwhich has resulted in needing to learn a bit more about github than i anticipated, but i suppose i shouldn't be surprised by it20:35
tobiashthis may not be the github event processing, I just looked at your paste and every event from  'Updating <Change...' until ' Scheduling event from github' is below 8s as fas as I can see20:35
tobiashI'm stunbling accross this: http://paste.openstack.org/show/744974/20:36
tobiashif the queue length is > 0 I'd expect that after 'scheduling event from github' it immediately resumes with the next 'updating <change'20:37
tobiashbut there is a 6min gap in between20:37
clarkbwe dont log the updating change until after the requests though right?20:39
clarkbso it could be blocking aomewhere between?20:39
tobiashhrm, I don't see much code in between that could be blocking20:42
tobiashjust looks like it is blocked here: https://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/github/githubconnection.py#n19020:44
tobiashbut I also think we need the logging improvement for a deeper analysis20:45
clarkbya20:45
corvusi'm about to push up a new ps20:45
corvusmake sure i have logging there :)20:45
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection  https://review.openstack.org/63642920:47
corvustobiash, jlk, fungi, clarkb: ^ sorry it's a more significant change, but it means less passing around of data and loggers.20:47
corvusoh cool, the diff on that in gertty at least looks really good20:48
corvusgerrit too.  cool.  should be easy to follow and review then.20:48
*** jamesmcarthur has joined #zuul20:52
*** jamesmcarthur has quit IRC20:54
tobiashcorvus: found a bug ^21:00
corvustobiash: same thing a few lines down as well21:01
tobiashyah21:01
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection  https://review.openstack.org/63642921:01
corvusi'm runinng tests.unit.test_github_driver.TestGithubDriver.test_pull_event locally, but that's it21:02
corvusso i'm catching the big errors, but that doesn't test everything21:02
*** fdegir has quit IRC21:06
*** mugsie has quit IRC21:06
*** mugsie has joined #zuul21:06
tobiashcorvus: found some more, do you mind if I push an update?21:15
corvustobiash: go for it21:15
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Improve event logging in githubconnection  https://review.openstack.org/63642921:15
tobiashran the complete test suite locally ^21:15
corvusclever21:15
corvustobiash: those all look good, thx21:16
tobiashyw :)21:16
tobiashcorvus: so our +1 add up to +2?21:17
corvustobiash: i think so.  :)  let's +3 it unless jlk wants in on this...?21:18
jlkI'm running a meeting, don't wait in me21:18
tobiashk21:18
tobiashthat logging also should reveal if there are blocking problems on that queue21:19
corvustobiash, clarkb, fungi, jlk: i ran the sigusr debug handler, which means i got two stack traces at random points in time.  for the githubconnection event handler thread, they both had the same stack traces:21:21
corvus  File "/usr/local/lib/python3.5/dist-packages/zuul/driver/github/githubconnection.py", line 1133, in getPullBySha21:21
corvus    for pr in repo.pull_requests(state='open'):21:21
corvuswhich tells me that, statistically speaking, we probably spend a lot of time in that bit of code.21:21
corvusin particular, i can think of one repo that we monitor where "iterate over every open pull request" may not be the best idea.21:21
clarkbwe must list all open pull requests too? I wonder if that is expensive for ansible21:22
clarkbya21:22
tobiashhrm, that sounds not very optimal21:22
tobiashhrm, that code path is used on a status event21:23
tobiashwe should check if there is really no pr number in the status event that forces us to do that21:24
tobiashbut I guess that explains what you observe21:25
mordredoy. yeah21:25
clarkbthe unlabel event I have pulled up has a pull_request dict21:26
*** sdake has joined #zuul21:26
tobiashhrm, the docs tell me that there is really no reference to the pr: https://developer.github.com/v3/activity/events/types/#statusevent21:27
corvuscan we do a search instead of iterating?21:27
clarkboh I see its a commit being updated and we try to refernce that back to PR21:28
jlkI'll pay attention here in a moment.21:28
tobiashchecking the api docs21:28
tobiashI think a search could work: https://help.github.com/articles/searching-issues-and-pull-requests/#search-by-commit-sha21:30
tobiashtogether with https://developer.github.com/v3/search/#search-issues-and-pull-requests21:30
corvusi'm looking far back in the logs today when things were less busy.  8a98cf82-2e8f-11e9-921a-f68fa09c6c2a was received by the gearman worker at 6:29:25 and was delivered to the scheduler at 6:30:49.  so it took 84 seconds.21:31
jlkokay trying to get caught up on backscroll21:32
corvus(assuming there was no event processor backlog, which i am assuming because there was no activity for the next 4 minutes after the delivery to the scheduler)21:32
jlkwait21:33
SpamapSoh wow, if you have debug logging on the scheduler, zuul-git makes a Read Refs log entry that is basically "everything that has ever landed in any of these repos"21:33
jlkwe're not iterating in the way you think21:33
corvusjlk: whew21:33
jlkAt least I don't think. let met get to that line of code21:33
tobiashcorvus: search by sha works: curl https://api.github.com/search/issues\?q\=6891aacd27e28676bab8ec3e957cde26c1616e1c21:33
SpamapS$ kubectl logs zuul-scheduler-0|grep 'Read refs'|wc21:33
SpamapS      8   88944 320090221:33
SpamapSall of the refs/changes/foo's and every single commit hash21:34
corvusSpamapS: i think you can filter out that logging with a logging config file21:34
SpamapScorvus: I can, I just wonder if that's a necessary level of logging, even at debug. :)21:34
jlkSo it's one API call to get all the pull requests21:34
jlkthen WITHIN zuul code we iterate over what we got as a result in that API call21:34
corvusSpamapS: it's a good question.  i've certainly used it in the past, but we do not currently have it enabled in opestack-infra21:35
jlkit's python loop iteration, not a new API call for each PR21:35
SpamapSI find the debug logs to be quite useful. I dump them all into elasticsearch and filter them out usually, but it's nice to have them there.21:35
jlkSo an API call to get all teh open pull requests of a repo vs a API call to search for a PR w/ a hash is likely to be roughly the same21:35
tobiashjlk: well, that probably does pagination21:35
mordredjlk: it might be doing pagination calls on the backend in github3.py though21:35
jlktobiash: that's a fair point.21:35
corvusSpamapS: i could see removing it from the "default" debug logging config, and next time i need it, i can make a manual logging config21:35
mordredespecially since we're talking about the ansible/ansible repo here :)21:36
tobiashjlk: so on a busy project you might get much data while looping21:36
jlklet me dig further21:36
jlkBecause we could tell i tto get more at once21:36
SpamapSIs there a trace level?21:36
corvusapparently there are 1713 open prs21:36
corvusSpamapS: not in python21:36
SpamapSshame21:36
SpamapSanyway, I think I found a bug in github that is unrelated to what you all are looking at21:37
jlkhttps://github.com/sigmavirus24/github3.py/blob/master/src/github3/repos/repo.py#L2355-L241221:37
SpamapSbut it may be all zuul just not working well with github....21:37
corvusSpamapS: yeah.  openstack was so disappointed they made one, but it turned out to be a bad idea because none of the tooling deals with it right, so i don't think we should follow suit.  but i'd totally use it if it really existed.21:37
SpamapSI'd like to have the "require gate status check" branch protection checked on my repos, so only Zuul can merge things, but the status report comes *after* the merge.21:38
tobiashjlk: well, we only want that single matching pull request so I think we really want to change this to a search21:38
jlkSearch is a different API limit21:38
jlka much shorter one21:38
tobiashhow much?21:39
jlk30 per minute21:39
tobiashhrm21:39
SpamapSoh wait, no, the driver actually does set statuses before calling merge21:40
tobiashbut I guess we can live with that limit given that it takes 2-10s to process one event21:40
jlktobiash: On a busy zuul w/ lots of github clients you could possibly overrun that limit quickly21:41
tobiashSpamapS: we're using that since ages and I got no complaints about this so far, is this ghe?21:41
SpamapStobiash: no, github.com21:41
SpamapStobiash: I wonder if it's just async21:41
SpamapS'bit of a race condition21:41
tobiashjlk: with the rate I saw in the logs zuul is not able to process more than 30 events per minute21:41
SpamapSLike, need to confirm the status if you get a 40521:42
tobiashSpamapS: maybe, I haven't connected my zuul to github.com21:42
SpamapSI've used the setting with GHE in the past too, at GoDaddy.21:42
SpamapSso I was surprised when my change failed ot merge.21:42
* SpamapS tries again21:43
tobiashSpamapS: then async on github.com seems likely21:43
tobiashjlk: granted, there are other places in zuul that perform a search too21:43
jlkoh I know why this code looks familiar21:44
jlkand whY I hate this part of GitHub21:44
jlkYou can have multiple PRs open with the same head SHA.21:44
jlkit's unlikely, but possible.21:44
corvusyeah, if it happens in zuul, we just punt.21:44
jlkand since there's no way to match the event with the specific PR it came from, we can't just pick the first and continue21:45
jlkwhich is why we dump them all21:45
jlkwe DO punt of we find more than 121:45
SpamapSmebbe some day github will adopt Gerrit's model and deprecate "pull" requests. :-P21:45
* SpamapS can dream21:45
jlkSpamapS: I don't think that would improve this21:45
jlksee what corvus just said21:45
corvusi think the new logging will tell us with more certainty how long the 'getpullbysha' method takes.  i'm not sure it will tell us how many underlying requests we made...21:45
jlkcorvus: yeah. We might be able to get some logging from github3.py for that21:46
jlkanyway...21:46
jlkI _think_ this problem set of code could go away if we fully switched over to the checks API21:46
tobiashcorvus: but the github3.py logging will tell you21:46
jlkexcept, we still want to be able to trigger stuff on events outside of zuul events, so maybe not21:46
corvusjlk: hrm, yeah, i guess it probably can't go away until nothing uses status anymore...21:47
jlksadly.21:48
clarkbout of curiousity what do the status events give us?21:48
corvushowever, the fewer status changes, the less this code is hit.  so any movement in that direction is an improvement.21:48
tobiashso what should we do, performa search or try to make the iteration faster?21:48
jlkand Ansible has a LOT of status users.21:48
tobiashclarkb: we need this to trigger the gate21:48
clarkbmost of the trigger stuff is oreinted around PR changes not commit changes right?21:48
jlkclarkb: they're how we currently set a head of a PR as passing CI or not21:48
tobiashin case review and merge label are already there, the status event from the check pipeline will trigger the gate21:48
jlkclarkb: and how we learn if another CI system has passed the change or not21:49
*** sdake has quit IRC21:50
jlktobiash: We should get some visibility into whether or not this bit of code is a significant slow down, and if so, how much faster a search would be, or if there are ways we can improve the API calls github3.py makes21:50
* clarkb reviews the new logging change since that seems like a good next step21:50
corvusjlk: ++  clarkb so i think openstack-infra should restart the scheduler with the logging change, and probably also increase the logging level for github3.py so we can see those requests.21:51
corvusthat latter will be a logging config change on our scheduler... and should be temporary because boy is it going to be big.21:51
clarkblooks like everyone has already reviewed it :)21:51
tobiashcorvus: we have 71M of github3.py logging in three days21:53
jlkhrm.21:54
jlkI'm curious21:55
tobiashwhile having 1G of zuul debug logs in 10 hours...21:55
jlkGithubEventProcessor() class uses the GithubEventLogAdapter, which logs the 'delivery' key.21:55
* corvus stands by to -2 his own patch....21:55
jlkoh wait, okay. I see we hand it a 'delivery' thing in __init__21:56
* corvus stands down21:56
jlkI was worried that run() wouldn't have the 'delivery'.21:56
jlkso now we'll have one of these running for EACH event that comes in?21:56
corvusyeah.  it should always be there.  it may be None when we shut down.21:56
corvusjlk: yes, but it's not a thread.  it's really just a data structure.21:57
jlkYeah, I'm just thinking we could really move things along if "GithubEventProcessor(self, data).run()" was an async fire and forget call21:58
corvus(however, if we did decide to have some parallelism in the future, you could hand these things to, say, a pool of workers)21:58
jlkGithubEventConnector uses threading, how many threads does it start?21:58
clarkbjlk: one aiui21:58
jlklike how quickly will it go through its run() loop?21:59
corvuscertainly O(1).21:59
corvusmight be 2.21:59
clarkbjlk one event at a time22:00
jlkI wonder if we could relieve some back pressure by spawning threads for each event22:00
clarkbcorvus: where does kwarys[extra][delivery] make its way into that method?22:00
jlkbut... threading. Now you have 0(N) problems.22:00
jlkclarkb: I believe it comes from the 'data' sent in on line 45822:01
corvusjlk: yeah if we did have to parallelize, i'd want to keep a tight lid on the number of threads here22:01
jlkWe'll just re-write it all in Go, use Gorutines, and everything will be fine, right?22:01
clarkboh I see ya its reading it directly from the event data22:02
openstackgerritMerged openstack-infra/zuul master: Improve event logging in githubconnection  https://review.openstack.org/63642922:02
corvus(because, right now, we have a running system with a (presumed) backlog.  which is better than a non-running system with 2000 stuck threads)22:02
jlkif we were to make threads there, we could totally get out of sync too, where some threads would finish faster than others and we'd deliver events out of order.22:02
clarkbthe reason it isn't already parallelized is that for state correctness we want to update zuul's view of the world serialy right?22:02
clarkbso ya we could churn through events quicker, but we'd have to carefully lock around updates to the internal state model22:02
corvuscorrect22:03
jlkmy brain just went all melty22:03
corvusjust another day in #zuul22:03
jlkthinking about creating a ordered list of events in a data store, allowing parallel processing to get the events processed up to the point of adding them into the scheduler queue, and then going back to serial to dump them INTO the queue22:03
jlker into the Scheduler queue22:03
jlkI can see it in my mind, but no way I could interpret that into code22:04
tobiashjlk: getchange has side effects...22:04
jlkyeh I haven't even thought through the side effects22:04
corvusjlk: it's so easy in napkin-language22:04
jlknone of this was written with thread safety in mind22:05
corvuswell, that's not quite true :)22:05
corvustobiash: in general, getchange's side effects should be okay22:05
jlkahem. None of what _I_ wrote was written with thread safety in mind...22:05
clarkbis the extra kwarg something that python logging inserts?22:06
clarkbI'm assuming so if this all works in tests22:06
tobiashcorvus: I guess you're right22:07
corvuswe allow a certain amount of mutation of data on the Ref objects.  they're generally expected to be as up to date as possible and can change without locking.22:07
corvusclarkb: yes, and the adapter is copied from the executor22:07
corvus(i'd like to use them in more and more places, they're really cool)22:08
clarkbya its a neat way to apply a mutation to consistent types of log data22:08
clarkbit just has a slightly magic interface with kwargs mutations22:08
corvusclarkb: it's also extra metadata on the log entry itself, so it's queryable via journald, etc.22:08
clarkboh thats an extra neat feature22:09
jlkOh I remember something else. We can't even ask GitHub to add a reference to the PR in that webhook, because the status is SET on a hash ( https://developer.github.com/v3/repos/statuses/#create-a-status ) it's not set on any specific pull request.22:23
clarkbya this seems to be a mismatch between pull request workflows and commit based api22:23
clarkbso you have to jump through hoops to match them up in places22:23
corvusgiven that, i wonder if an endpoint to return all the prs for a hash wouldn't be a terrible idea?22:24
corvusespecially one that counted against the normal api limit, not the search limit :)22:24
jlkI'm sure you could cook something like that up w/ GraphQL22:27
jlkhowever. GraphQL.22:27
*** jamesmcarthur has joined #zuul22:27
corvustristanC: i've restarted zuul-web with the buildsets api endpoint patch, and the buildsets tab on http://logs.openstack.org/41/630041/5/check/zuul-build-dashboard/2290fd5/npm/html/ is behaving strangely.  it seems to return a small set of buildsets at random.22:29
jlkDigging through _iter() some more. The repo.pull_requests() API call defaults to asking for "-1" number of pulls, which is "all". That -1 is passed through to _iter(), which goes to GitHubIterator, which sets a page limit at 100 in that case.22:32
jlkso, without direction it's going to fetch 100 at a time and page through them until done. We could... fetch more? Ask for 500?22:32
corvusso that's 18 requests for ansible by default22:33
jlkA good candidate for why that takes a while.22:33
jlkI'm looking to see if there is an upper bound for how many we can request at once22:34
jlk"For some resources, you can also set a custom page size up to 100 with the ?per_page parameter"22:34
jlkso... 100 :/22:34
corvusmaybe we could do an lru cache of sha->pr mapping22:35
jlkHonestly, it wouldn't be the worst idea ever to just work 100 at a time to find ONE that matches and just end there.22:35
jlkit's just perpetuating the oddity22:36
clarkbcorvus: event at a few thousand entries that should be a relatively small amount of memory22:36
clarkbcorvus: that seems like a not bad idea to me22:36
jlkWould need some way to expire things from the cache as they close. Handle a PR close event and remove the PR from the cache22:39
jlkotherwise you wouldn't recover from "multiple PRs w/ this hash" problem fast enough.22:39
corvusgood point22:40
clarkb(mostly personal curiousity) how do you end up with PRs that have the same hash?22:41
clarkbis the common case trying to merge the same commit to multiple branches?22:41
clarkband the branches have the same commit history?22:41
jlkI think it's very uncommon, it's just possible?22:41
clarkb(if we can describe the case we might be able to say something like "zuul doesn't work with this workflow")22:41
jlklike you've made a release branch and then quickly had to fix a thing, the fix goes to both master and the release branch, and master hadn't moved yet22:41
jlkor if somebody accidentally opens 2x PRs from the same branch to the same target22:42
jlkthere's lots of tooling out there to open PRs, a hiccup could cause multiple.22:42
*** tflink has quit IRC23:02
*** tflink has joined #zuul23:02
*** jamesmcarthur has quit IRC23:11
SpamapSclarkb: you force push23:24
SpamapShave accidentally done it ;)23:24
jlkheh. Yeah there's nothing GitHub (or gerrit) side that would prevent a variety of scenarios from happening23:25
clarkbGerrit actually enforces a lot more rule around this23:25
clarkblike force pushing can be disabled pretty forcefully. And you can't push the same commit to multiple changes iirc you'd have to change the changeid? though maybe that works with the stable/master example above23:27
clarkbsince the unique tuple is sha, changeid, branch23:27
jlka force push can be disabled by GitHub too, if the repo owner chooses to do so23:27
jlkexcept that often PRs come from forks, and the rules don't necessarily apply23:27
*** jamesmcarthur has joined #zuul23:32
SpamapSya, it's not only possible, but I'd say it's a valid scenario that should result in a clear error message.23:32
*** jamesmcarthur has quit IRC23:32
*** jamesmcarthur has joined #zuul23:33
SpamapSEven if it means spamming both PR's. ;)23:33
clarkbya I don't necessarily think its a bad thing, except that if we maybe sacrifice some corner face functionality for working at all that might be a good compromise23:33
jlkThat means finding all the matches again :/23:33
clarkbThe other thing that gerrit does is more explicitly couples the workflow and api. So you have the commit info in the events23:34
clarkb(I have no idea what it is like to implement that in github's api btu a query for list of PRs that map to this commit could be useful, this was suggested earlier I think)23:34
jlkour checks API is definitely an attempt at improving this23:35
jlkbecause check runs and suites are attached to specific PRs23:35
clarkbjlk: in that case the entire CI workflow runs through that different API?23:35
jlkit'll just take a while for the whole world to move over23:35
jlkclarkb: yeah. As an app you register as a check suite, and on PR open a check run is requested, which we Zuul would get and start interacting with. We send results into the check run23:36
clarkbwould we still need to process status updates to deal with gating things?23:37
jlkyes, until all the other systems move over into check runs23:37
*** dkehn has quit IRC23:37
jlkso instead of setting status, they would be updating their own specific check suite/runs. We can iterate over them to see if they're complete or not23:37
jlkhrm I wonder23:37
*** dkehn has joined #zuul23:38
jlkI wonder if we would get updates about a check run for another app23:38
jlkmaybe. Hard to tell from the docs23:39
jlk"GitHub Apps that have the checks:read permission and subscribe to the check_suite webhook event receive the completed action payload for all check suites in the app's repository. "23:42
jlkI think we'd receive hooks as each check suite completes, which is analogous to a status success.23:43
*** sdake has joined #zuul23:43
clarkbso the limitation might become "gating only works properly if all check suites use the checks api" and then switch zuul over to that (I'm hand waving though as I doubt I'd write that conversion or have all the details anticipated)23:44
jlkheh23:44
jlkmaybe. I think it'd be more phased23:44
jlk1) move Zuul over to checks API. 2) sometime later deprecate reacting to status events. 3) stop responding to status events, claim that gating (on systems other than Zuul) require those systems to be using checks API23:45
jlka bunch of work is done for 1, in github3.py. It's probably time I poke at Zuul again and try to make use of it23:45
*** jamesmcarthur has quit IRC23:56

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!