jlk | Turns out, when we poke at the dusty corners, we uncover some ugly stuff. | 00:06 |
---|---|---|
*** adam_g has quit IRC | 00:06 | |
*** adam_g has joined #zuul | 00:09 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Put variables into the inventory https://review.openstack.org/467635 | 00:16 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Write inventory as yaml not ini https://review.openstack.org/467634 | 00:16 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Display command/shell results in a banner https://review.openstack.org/467603 | 00:21 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Strip unneeded cmd, changed, stdout and stderr from results https://review.openstack.org/467310 | 00:21 |
mordred | jeblair: ^^ I updated that per your review and also responded. relatedly - I think we're only fetching the contents of the first task | 00:22 |
mordred | jeblair: I seem to remember we discovered that was a bug before | 00:22 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Disable leaked repo check https://review.openstack.org/466810 | 00:23 |
mordred | Shrews: ^^ since you've been looking at logging, maybe you'll have an idea? (It'll be easier to spot once the logs aren't spammy) | 00:23 |
jlk | oooh merged the leaked repo. | 00:28 |
jlk | time to recheck all the things. | 00:28 |
mordred | jlk: I just rechecked the first half of your stack | 00:29 |
jlk | ^5 | 00:29 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Wait for merger to complete before item is ready https://review.openstack.org/463054 | 00:44 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add support for requiring github pr head status https://review.openstack.org/449390 | 00:44 |
*** tristanC has joined #zuul | 01:06 | |
jlk | mordred: you've used betamax in the past, would you have any objections if it were brought into zuul to better test github driver code that interacts with the API? We've hit a number of bugs that were not discoverable until running the actual code against the API instead of our mocked fake stuff. | 01:22 |
jlk | mordred: I'm thinking we can just stuff json events into the webapp, record the zuul responses to interacting with real PRs on github, and then re-use those betamax responses in tests. | 01:23 |
jlk | (similar to how github3.py does things, except they don't deal with webhooks) | 01:23 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Implement trigger require/reject-review https://review.openstack.org/463387 | 01:35 |
jeblair | jlk, mordred: i'm a fan of betamax for such things | 01:39 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Include tenant name in github context https://review.openstack.org/463871 | 01:39 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Save installation ids to a cache and fetch them per project https://review.openstack.org/463424 | 01:39 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Catch integration key file not found in zuul https://review.openstack.org/463425 | 01:39 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Create github client each time it's called. https://review.openstack.org/463421 | 01:39 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Use integration_id with github https://review.openstack.org/463420 | 01:39 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Refactor integration key loading to always load https://review.openstack.org/463423 | 01:39 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Decode JSON body once for requests https://review.openstack.org/463422 | 01:40 |
jeblair | jlk: i'm thinking that might be best as a somewhat more traditional unit test of the github driver, yeah? verifying one thing at a time and keeping the fake for the functional tests? | 01:41 |
jlk | maybe? I'm a bit of a novice when it comes to designing tests. Mostly I just dislike that in the github test area we're manufacturing API call/responses and at times we've done it wrong. SO we wind up mostly testing our fake methods instead of testing the real methods. | 01:43 |
jeblair | jlk: (the only reason i'd hesitate about betamax is the thought that the fake needs to be accurate enough for the functional tests, so once we get all the errors out, will betamax still offer something beyond the fake? | 01:43 |
jlk | I've tried to extract the calls to external APIs into small functions but it's not complete. | 01:43 |
* jlk afk for the dinners. | 01:44 | |
jeblair | the main thing is that the fake needs to be fully functional so that we can make changes to zuul itself with confidence. *also* testing the driver with betamax is fine with me (if, at the end of the day, it's still useful) | 01:46 |
jeblair | fwiw, on the gerrit side, we created the fake by copying the data from production responses and then stuffing them into the fake with templated values | 01:49 |
jeblair | betamax might be useful getting a better baseline for the fake | 01:49 |
jlk | oh oh | 02:25 |
jlk | So | 02:25 |
jlk | I can easily fake the events. | 02:25 |
jlk | that we have to have no matter what | 02:25 |
jlk | it's what the driver does in _response_ to the events that matters. Because unlike gerrit, we don't get _all_ the data. Github driver has to make calls back to the API to get more data. | 02:25 |
jlk | and, it makes calls to _set_ data, like status. It's those things that we're doing now with fake calls, that's bypassing parts of the actual driver code. If we used betamax, it'd use the driver code, and we should still be able to inspect that it Did The Right Thing. | 02:26 |
*** adam_g has quit IRC | 02:28 | |
*** adam_g has joined #zuul | 02:30 | |
SpamapS | jlk: I think jeblair is suggesting that you could make a fake for that side too. | 03:26 |
SpamapS | 2017-05-25 03:57:40,836 kazoo.client INFO Connecting to localhost:2181 | 03:58 |
SpamapS | 2017-05-25 03:57:40,837 kazoo.client WARNING Connection dropped: socket connection error: Connection refused | 03:58 |
SpamapS | the plot thickens | 03:58 |
SpamapS | btw, we have _something_ calling basicConfig during tests | 03:58 |
* SpamapS puts a pin in it to come back to | 03:58 | |
SpamapS | I monkeypatched basicConfig out and at least started getting sane output from these fails | 03:58 |
SpamapS | gah n/m on the ZK thing | 04:00 |
* SpamapS has too many things going on at once ;) | 04:00 | |
jlk | SpamapS: but... isn't that what betamax does? Creates the fakes for you, based on the real interaction, so you can run it over and over again without needing the real API there? | 04:03 |
SpamapS | jlk: I believe it will parrot things back to you. I don't believe it will handle a two way comm functionally the way the fake gerrit does for instance. | 04:06 |
jlk | what's the difference? | 04:06 |
SpamapS | The fake gerrit keeps some state and will feed it back to you as it was submitted. I don't recall that betamax has a facility for that. | 04:06 |
SpamapS | so you can tell the fake gerrit "add a change. now give me the event stream".. betamax is more "if I add a change, respond with this response" | 04:07 |
SpamapS | but I could be wrong | 04:07 |
SpamapS | my betamax experience was over a year ago and brief | 04:08 |
jlk | ah. | 04:12 |
jlk | hrm. | 04:12 |
jlk | I'm not sure if it's that way or if I'm following | 04:12 |
jlk | but it seemed like you could betamax a multi-call thing and have the back/forth | 04:13 |
jlk | maybe it's because there isn't an "event stream" on the github side per se | 04:13 |
*** bhavik1 has joined #zuul | 05:25 | |
SpamapS | jlk: probably worth investigating. | 05:38 |
* SpamapS 's been debugging on the ssh agent.. all the live long day | 05:53 | |
*** bhavik1 has quit IRC | 07:01 | |
*** DangerousDaren has joined #zuul | 07:39 | |
*** Cibo_ has joined #zuul | 08:40 | |
*** smyers has quit IRC | 09:04 | |
*** Cibo_ has quit IRC | 09:09 | |
*** smyers has joined #zuul | 09:15 | |
*** bhavik1 has joined #zuul | 09:41 | |
*** jkilpatr has quit IRC | 10:57 | |
*** jkilpatr has joined #zuul | 11:14 | |
*** isaacb has joined #zuul | 11:26 | |
*** isaacb has quit IRC | 11:30 | |
*** bhavik1 has quit IRC | 11:43 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul feature/zuulv3: Use ssh for git-upload-pack https://review.openstack.org/436802 | 11:48 |
*** jkilpatr has quit IRC | 12:00 | |
*** jkilpatr has joined #zuul | 12:01 | |
*** jkilpatr has quit IRC | 12:16 | |
*** jkilpatr has joined #zuul | 12:43 | |
*** rcarrillocruz has quit IRC | 12:46 | |
Shrews | mordred: sorry, an idea on what? | 13:05 |
mordred | Shrews: I _believe_ there is a bug in the consumption of log streaming and writing it to the file on the executor | 13:17 |
mordred | Shrews: I _think_ the bug is that it for $reason only winds up streaming the first shell task and not subsequent ones | 13:18 |
Shrews | hmm | 13:18 |
mordred | Shrews: it's hard to see right now because we also log a whole mess of garbage | 13:18 |
mordred | which https://review.openstack.org/#/c/467310/ will hopefully fix | 13:18 |
Shrews | i still see garbage :( | 13:19 |
mordred | yah. so much garbage | 13:19 |
mordred | we need to land 467310 and 467603 and the logs should be prettier | 13:19 |
mordred | pabelanger: ^^ if you're up - those are for the logging issue | 13:20 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Re-enable E305 pep8 errors https://review.openstack.org/466286 | 13:21 |
*** dkranz has quit IRC | 13:23 | |
* SpamapS has resorted to appending debug traces to files to try and find out where the thing is getting stuck | 13:36 | |
mordred | SpamapS: print is the ultimate debug tool | 13:38 |
SpamapS | nope | 13:38 |
SpamapS | that gets swallowed | 13:38 |
SpamapS | thanks testr! | 13:38 |
mordred | oh yuck | 13:38 |
SpamapS | (and the problem doesn't happen not in testr) | 13:38 |
mordred | SpamapS: running the test in isolation outside of testr everything is fine? | 13:39 |
SpamapS | so like literally I have a thing that appends to a file at ~/tmp/debug.{pid} | 13:39 |
mordred | wow | 13:39 |
SpamapS | mordred: no one test kills it | 13:39 |
SpamapS | it just gets stuck somewhere and times out | 13:39 |
*** Cibo_ has joined #zuul | 13:39 | |
SpamapS | in futex() | 13:39 |
SpamapS | which tells me "in python thread stuffs" | 13:40 |
* mordred hands SpamapS a large bucket into which to put his sads | 13:40 | |
mordred | I _hate_ these types of bugs | 13:40 |
mordred | because the sad thing is - they're REAL bugs that will hit actual production at some point | 13:41 |
SpamapS | the good news is I just found the line it's getting stuck at | 13:41 |
mordred | ooh! | 13:41 |
* mordred hands SpamapS a large bucket into which to put his happy | 13:41 | |
SpamapS | the bad news is.. now what? :-P | 13:41 |
mordred | SpamapS: have you tried deleting the line? | 13:41 |
mordred | maybe it's just naughty | 13:41 |
SpamapS | mordred: who needs to join threads anyway? | 13:42 |
SpamapS | be an individual | 13:42 |
SpamapS | not a joiner | 13:42 |
mordred | that's what I'm saying | 13:43 |
mordred | SpamapS: so -it's trying to join threads but one of the threads decides it can't because $reason and $reason is the issue | 13:43 |
SpamapS | Right, the threads do try to cooperate a bit | 13:43 |
SpamapS | so we have a running flag | 13:44 |
SpamapS | and we set it to False, and then wait for that to work | 13:44 |
mordred | but if one of the threads is blocked on something, it can't cycle back around to obey the flag | 13:44 |
SpamapS | I think yes | 13:44 |
SpamapS | I think | 13:46 |
mordred | SpamapS: golly - you might have to add ~/tmp/debug.{threadid} files | 13:46 |
SpamapS | I may be wrong | 13:46 |
SpamapS | I think it's possible we're doing a blanket 'wait()' | 13:46 |
SpamapS | and picking up the ssh-agent | 13:46 |
mordred | ahhhhh | 13:46 |
mordred | that is not a bad hypothesis | 13:46 |
SpamapS | hrm no... waitpid | 13:47 |
SpamapS | self.proc.wait() :-P | 13:47 |
mordred | ssh-agent is going in to subprocess right? | 13:47 |
mordred | but I guess there is a controlling execution thread that owns that subprocess handle | 13:47 |
SpamapS | so, I tried it with os.fork/execlp actually | 13:48 |
SpamapS | that doesn't seem to fix it | 13:48 |
* mordred is just being the dumb sounding board saying stupid things in case the trigger a thought btw | 13:48 | |
SpamapS | appreciate it | 13:48 |
SpamapS | there are a lot of bad theories bouncing around in my head | 13:48 |
mordred | SpamapS: ok - I have a REALLY stupid idea | 13:48 |
SpamapS | that's exactly what I need | 13:49 |
mordred | SpamapS: but it's REALLY stupid - so bear with me | 13:49 |
mordred | SpamapS: make a completely separate daemon whose job is only to fork and reap ssh-agents. have the executor request a new ssh-agent from the agent daemon rather than forking one itself, and have the executor return the agent when it's done | 13:50 |
SpamapS | Ugh | 13:50 |
SpamapS | I mean yeah | 13:50 |
SpamapS | that's stupid | 13:50 |
mordred | it is | 13:50 |
SpamapS | but | 13:50 |
SpamapS | I'm getting to where I have no idea why having this process running is locking up my python interpreter | 13:51 |
mordred | however - we might be hitting the point where we're at enough layers of subprocesses interleaved with threads that we just don't have the tools to actually find the sticking point | 13:51 |
Shrews | SpamapS: ooh, i cherish the bugs that force me to log to a file | 13:51 |
SpamapS | found spot seems to have been a red herring.. it hasn't locked up there again | 13:51 |
Shrews | o/` Cherish the bugs you have o/` | 13:52 |
SpamapS | Cherish is the word I use to remind me of... | 13:53 |
SpamapS | Shrews's Love | 13:53 |
Shrews | aawwwww | 13:53 |
Shrews | oops, autocorrect. should have been... | 13:53 |
Shrews | eewwwww | 13:53 |
Shrews | mordred: is it ok to filter result._result before calling the super? | 13:55 |
Shrews | need to check what the super does with that... | 13:55 |
* SpamapS has to go begin the child shuttling process | 13:55 | |
SpamapS | mordred: I'll ponder on that.. thanks for the dumb idea | 13:56 |
mordred | Shrews: I responded to same question in the review from jeblair | 13:58 |
mordred | Shrews: we have no choice | 13:58 |
Shrews | oh | 13:59 |
mordred | Shrews: well - we do have a choice - if we find that filtering first is unacceptable because it would break something else - we can stop calling super and write our own complete methods | 13:59 |
mordred | Shrews: so figuring out if it's a bad idea _is_ worthwhile | 13:59 |
*** isaacb has joined #zuul | 14:00 | |
Shrews | mordred: i'm not immediately seeing anything that would break. However, I'm willing to bet that this is yet another unexpected bastardization of the ansible API that would make bcoca and jimi cringe. :) | 14:01 |
mordred | yah | 14:02 |
mordred | Shrews: that said - we need to put on the list "need defined API for result objects in callback plugins" | 14:03 |
mordred | becauseyou basically can't do _anything_ without hitting private members | 14:03 |
mordred | I know bcoca has said before that they weren't ready to commit to what's there yet | 14:03 |
mordred | which I totally respect | 14:03 |
*** Cibo_ has quit IRC | 14:04 | |
Shrews | yep | 14:04 |
mordred | but it seems enough people are writing callback plugins that it's time ready or not | 14:04 |
*** dkranz has joined #zuul | 14:18 | |
*** isaacb has quit IRC | 14:34 | |
rbergeron | pabelanger: where is beefy being used? | 14:48 |
pabelanger | rbergeron: just some random hacking by dougbtv | 14:50 |
dougbtv | :D | 14:51 |
dougbtv | it's in a cowsay-beefymiracle rpm available in fedora fwiw | 14:51 |
rbergeron | yay | 14:56 |
rbergeron | ahhh | 14:56 |
rbergeron | ah, thanks to rrix and will woods | 14:58 |
* rbergeron wonders if they can be bribed to package it for epel, lol | 14:59 | |
dougbtv | +1! | 15:13 |
*** isaacb has joined #zuul | 15:17 | |
pabelanger | mordred: clarkb: jeblair: do you mind adding https://review.openstack.org/#/c/452494/ so your review pipeline. Its our nodepool console-log patch | 15:24 |
mordred | pabelanger: we wrote get_server_console in shde to just return none if the API call fails,right? | 15:26 |
pabelanger | let me confirm | 15:27 |
pabelanger | mordred: empty string it looks like | 15:28 |
mordred | pabelanger: ok - cool. so no exceptions | 15:38 |
mordred | pabelanger: patch +2 from me | 15:38 |
*** jasondotstar has quit IRC | 15:43 | |
*** leifmadsen has quit IRC | 15:45 | |
*** leifmadsen has joined #zuul | 15:49 | |
*** jasondotstar has joined #zuul | 15:51 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add override-branch property to job repos https://review.openstack.org/467375 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Enable test_one_branch cloner test https://review.openstack.org/467779 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Check out the appropriate branch in executor https://review.openstack.org/467778 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Use the executor cached repos more often https://review.openstack.org/467777 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Return resulting commits from merger https://review.openstack.org/467776 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add default-branch property to projects https://review.openstack.org/467334 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Rename 'repos' job attribute to 'required-projects' https://review.openstack.org/467376 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add override-branch property to jobs https://review.openstack.org/467775 | 15:58 |
*** DangerousDaren has quit IRC | 16:08 | |
mordred | SpamapS, jeblair: topic came up in a completely unrelated conversation - but I thought it might be worth mentioning here | 16:21 |
mordred | SpamapS, jeblair: http://opentracing.io/ ... current effort Adrian Cole (dude who wrote jclouds originally) is working on - it's a spec and a set of implementation for tracing things through distributed systems | 16:22 |
mordred | we have a distributed system that's goal in life is to become even more distributed - thought at least pointing people to that would be nice - on the off chance we happen across the problems they're focused on solving | 16:23 |
SpamapS | heh | 16:25 |
SpamapS | tracing :) | 16:25 |
jeblair | mordred: *nod* could be interesting. though i'd like to start by making sure our log files are sufficient (and sufficiently tagged) so that an admin doesn't need an extra tool to do basic debugging. but once we have our house in order, might be nice. :) | 16:26 |
SpamapS | I need something like set -x for python right now :-P | 16:28 |
* SpamapS is literally putting a debug append to file on every line in executor server right now | 16:28 | |
jeblair | SpamapS: maybe i should poke at that problem with you... what do i do? | 16:29 |
*** bhavik1 has joined #zuul | 16:29 | |
jeblair | mordred, jlk: i reviewed a bunch of github changes | 16:29 |
mordred | jeblair: woot. I'll do that next | 16:32 |
SpamapS | jeblair: Would love a second set of eyes. Especially if they're your elf eyes. | 16:32 |
SpamapS | jeblair: the problematic patch is 462712 | 16:32 |
SpamapS | jeblair: running tests under testr times out tests reliably. Have tried on single thread, still happens. | 16:33 |
SpamapS | stracing reveals before they get SIGALRM they're in 'futex()' | 16:33 |
jeblair | but running a single test succeeds? | 16:33 |
SpamapS | and looks like they're stuck on joining threads | 16:33 |
mordred | speaking of logging, I just pushed up a shade stack related to logging that ends in https://review.openstack.org/#/c/468095/ which is a document about the loggers we log to. mostly mentioning here in that, in as much as nodepool uses shade, admin info on how to control the logging shade emits can be useful | 16:33 |
SpamapS | jeblair: yes any single test succeeds though it's possible that's a sample size problem | 16:34 |
jeblair | mordred: yeah, we may want to xref that in nodepool docs | 16:34 |
SpamapS | also setting gentle=True deadlocks forever, presumably because the thread that receives SIGALRM is the one stuck in futex() | 16:35 |
mordred | jlk: I'm not going to -1 it - but in https://review.openstack.org/#/c/453844/17/tests/base.py you moved user = 'zuul' to a parameter - but the comment at the top of the method says "we hard set user to zuul" - I think the comment needs changing/removal | 16:36 |
jeblair | SpamapS: neat. my ssg-adent has no "-D" argument | 16:37 |
jeblair | ssh-agent even | 16:37 |
jeblair | ii openssh-client 1:6.6p1-2ubuntu2.8 | 16:37 |
jeblair | SpamapS: that's what zuulv3-dev has as well. any chance yours is a gnome thingy? | 16:40 |
SpamapS | whhhhhhhhhat? | 16:40 |
SpamapS | openssh-client: /usr/bin/ssh-agent | 16:40 |
SpamapS | ii openssh-client 1:7.2p2-4ubuntu2.2 amd64 secure shell (SSH) client, for secure access to remote machines | 16:41 |
SpamapS | so -D is new? | 16:41 |
SpamapS | hrm | 16:41 |
jeblair | seems that way | 16:41 |
jeblair | http://paste.openstack.org/show/610657/ | 16:41 |
SpamapS | good 'ol BSD people.. daemonizing everything | 16:41 |
SpamapS | so the original patchset did use the daemonized one | 16:41 |
SpamapS | that made me nervous though | 16:41 |
SpamapS | as it makes it harder for us to ensure they get cleaned up | 16:41 |
SpamapS | at least with it in the same pg as zuul-executor in theory when zuul-executor's session ends it will get a HUP and die. | 16:42 |
pabelanger | I was using ssh-agent -k | 16:42 |
pabelanger | to help clean up | 16:42 |
SpamapS | so of course we can track pids and kill | 16:42 |
SpamapS | but the kernel helps us quite a bit if we stay in the same pg | 16:42 |
jeblair | yeah, i really like the foregroud/pgroup option | 16:43 |
SpamapS | but requiring openssh >= 7.x might be a bit much? | 16:43 |
SpamapS | Trusty still has 2 years left. | 16:43 |
clarkb | we already require bubblewrap | 16:44 |
clarkb | or mostly require it | 16:44 |
jeblair | and we're on the verge of being py3 only | 16:44 |
SpamapS | and RHEL6 still has 2 years too :-P | 16:44 |
clarkb | rhel6 is python2.6 which I doubt we care about | 16:44 |
pabelanger | centos-7 lacks zookeeper currently | 16:44 |
SpamapS | so | 16:44 |
SpamapS | agree on that! | 16:44 |
SpamapS | however.. | 16:44 |
SpamapS | there's a difference between helpers (zk, bwrap) and core system software (ssh) | 16:44 |
pabelanger | I've been testing on fedora latest, works well | 16:44 |
SpamapS | quite a different thing to run backported ssh | 16:45 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Adds github triggering from status updates https://review.openstack.org/453844 | 16:45 |
clarkb | aren't we talking about running backported kernels? | 16:45 |
clarkb | I don't think you get more core sytstem software than that | 16:45 |
pabelanger | ssh-agent on xenial support -D | 16:46 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Implement pipeline requirement on github reviews https://review.openstack.org/453845 | 16:46 |
pabelanger | and I think we're planning on migrating more things in openstack-infra to xenial | 16:46 |
pabelanger | I just did nodepool-builders this week | 16:46 |
SpamapS | I'm surprised we booted zuulv3-dev on trusty :-P | 16:47 |
pabelanger | easy mode | 16:47 |
clarkb | yes I awnt it if for no other reason than to be systemd everywhere and no longer have to deal with 3 init systems | 16:47 |
mordred | also - isn't zesty already out? | 16:47 |
pabelanger | We could upgrade it to xenial, if people would like to do so. | 16:47 |
mordred | oh - wait - zesty still isn't lts is it? | 16:48 |
SpamapS | zesty is in fact out | 16:48 |
SpamapS | but zesty is a 9 month support release | 16:48 |
jeblair | pabelanger: we'll need to soon anyway for the py3 stuff | 16:48 |
mordred | jeblair: ++ | 16:48 |
pabelanger | okay, I can prep zuulv3 upgrade | 16:48 |
mordred | I certainly don't think we should target zuul v3 to trusty | 16:48 |
clarkb | no, but xenial is fine | 16:49 |
mordred | yup | 16:49 |
pabelanger | actually | 16:49 |
pabelanger | can we just inplace distro upgrade it? That there is alot of stuff that is not under puppet right now | 16:49 |
jeblair | pabelanger: well, maybe we should leave it alone and build zuulv3.openstack.org on xenial with puppet? | 16:50 |
pabelanger | we can, just a lot of puppet work atm | 16:50 |
SpamapS | you can borrow BonnyCI's ansible if you want ;-) | 16:50 |
* SpamapS ducks | 16:51 | |
jeblair | i'd be down with that :) | 16:51 |
mordred | honestly, if puppet is hurting not helping us to manage the systems, I think that should be considered | 16:52 |
mordred | managing our systems via config management is suppoesd to be a force-multiplier | 16:52 |
SpamapS | ya'll puppet what you need to puppet. :) | 16:52 |
pabelanger | well, we do have some ansible roles today in gate. I'd love to demo some of the stuff I did will windmill testing :) | 16:52 |
mordred | if we're avoiding tasks because too-much-puppet - it's not helping | 16:52 |
pabelanger | that's how I have been rolling zuulv3 locally | 16:52 |
SpamapS | hoist likely has some non-generic code that would make porting it to infra as hard as porting openstack's stuff to v3. :) | 16:52 |
pabelanger | but, didn't want to push the ansible / puppet cfgmgt issue in openstack-infra | 16:53 |
clarkb | I'm not sure that changing cfg mgmt systems magically fixes anything | 16:53 |
Shrews | jeblair: mordred: well, i absolutely am baffled as to why I cannot get the asyncio server to shutdown when I daemonize the thing. It works fine non-daemonized. I'm about to give up on this. | 16:53 |
SpamapS | (hoist being our pirate-themed Ansible for deploying BonnyCI) | 16:53 |
clarkb | (they all have problems and make like miserable in their own ways) | 16:53 |
jeblair | at any rate, one way or another, we do need the production system to be under config management, so i think we should leave the dev machine alone and launch the prod server on xenial with cfgmgt | 16:53 |
mordred | I agree with jeblair | 16:53 |
mordred | I also agree with clarkb - Ido not think there is a magical bullet - but if humans are avoiding the puppet work it's a worthy facet to consider | 16:54 |
mordred | jeblair, jlk: different topic- minor bikeshed: https://review.openstack.org/#/c/460700/14/tests/base.py ... gh_time_format = '%Y-%m-%dT%H:%M:%SZ' | 16:54 |
mordred | I bring this up because I quite literally JUST hit this with shade and nova api | 16:54 |
clarkb | protip never ever read teh ansible service management code if you want to use ansible to manage services | 16:54 |
jlk | So, just so I can keep it straight again, EventFilter deals with data in the actual event (like the branch, the repo, etc.. and the RefFilter is about deeper data in the change, like what votes it has or whether it's the latest ref in the change request and whatever. | 16:55 |
SpamapS | clarkb: to be fair, they have to deal with even more than 3 init systems. :) | 16:55 |
jlk | mordred: I have no skin in that game, I think that was a time format created by adam_g | 16:55 |
mordred | jeblair, jlk: '%Y-%m-%dT%H:%M:%SZ' is a subset of ISO 8601 - but is not full ISO 8601 - are we certain that github uses the subset and will never send richer ISO 8601 values? | 16:55 |
jeblair | jlk: yep | 16:55 |
jlk | er not format, but code | 16:55 |
clarkb | SpamapS: ya tahts actually a big part of my grump. Everywhere else ansible has explicitly decided to avoid abstracting things into generic resources. Except for service. And its terrible | 16:55 |
jlk | mordred: we're not certain about anything with regard to the github API | 16:55 |
clarkb | I think in part because it flies in the face of the general design of the rest of it | 16:56 |
jlk | all we have is empirical evidence | 16:56 |
jlk | oh oh are we griping about the "service" module? | 16:56 |
* jlk hates it | 16:56 | |
jlk | clarkb: the "package" module is getting to be like that too, trying to abstract around apt, yum, dnf, etc... | 16:57 |
mordred | jlk: which is a dumb idea | 16:58 |
jlk | I get the desire, but I choose not to embrace it | 16:58 |
mordred | jlk: the original ansible approach of an apt, yum and dnf module is much nicer | 16:58 |
mordred | yup | 16:58 |
jlk | I'd rather use playbook branching logic to call the correct modules. | 16:58 |
mordred | dehann wasn't always right - but he's right on this one | 16:59 |
jlk | agreed | 16:59 |
mordred | well - you have to have branching logic ANYWAY | 16:59 |
SpamapS | clarkb: Agree, would much rather see a systemctl module. | 16:59 |
mordred | because fo package names | 16:59 |
mordred | SpamapS, clarkb: https://docs.ansible.com/ansible/systemd_module.html | 16:59 |
jlk | well.. | 16:59 |
jlk | yes | 16:59 |
SpamapS | well there you go | 16:59 |
clarkb | mordred: oh nice | 16:59 |
jlk | you need branching logic for variable data loading | 16:59 |
clarkb | even more reason to just get to systemd everywhere world :) | 16:59 |
jlk | and presumably you always feed the same _variable name_ into the "package" action plugin. | 17:00 |
clarkb | as much as I'm sure it hurts mordred to do that :) | 17:00 |
* SpamapS is now more confused after adding tracing | 17:00 | |
mordred | jlk: so - data type is not listed in thier API reference anywyhere for datatimes - but in their graphql docs they have "An ISO-8601 encoded UTC date string." - https://developer.github.com/early-access/graphql/scalar/datetime/ | 17:00 |
jlk | variable branching is WAY easier than playbook branching | 17:00 |
SpamapS | thread is exiting... join is not returning | 17:00 |
jlk | mordred: I really need to get a grasp on graphql and see if we should move over to it, but for now we're on the REST API | 17:00 |
mordred | in any case - it may be worth considering just using iso8601 to parse the incoming string for potential forward compat | 17:00 |
jlk | mordred: if that's safe with the not quite 8601 we get now, sure. | 17:01 |
mordred | since it'll parse the subset | 17:01 |
mordred | well - the string above is valid iso 8601 | 17:01 |
jlk | my time code skills are REALLY lacking. | 17:01 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: Add web-based console log streaming https://review.openstack.org/463353 | 17:01 |
mordred | it's just that there are other things that are also optional and valid | 17:01 |
mordred | jlk: I'll make a follow up patch - I don't think it's worth reworking the existing patch | 17:02 |
*** isaacb has quit IRC | 17:02 | |
jeblair | SpamapS: i'll switch to a xenial machine to poke at this for now, and we can defer the ssh version decision for later | 17:02 |
jlk | kay thanks | 17:02 |
jlk | I have to get back to rebasing this stuff, because I screwed up the EventFilter vs RefFilter for some things | 17:02 |
pabelanger | jeblair: mordred: does that mean we want ze01.o.o and zuulv3.o.o ? | 17:02 |
pabelanger | or all in one still | 17:02 |
*** Cibo_ has joined #zuul | 17:02 | |
jeblair | pabelanger: let's go ahead and separate | 17:02 |
SpamapS | jeblair: There's an earlier patch set that doesn't use -D ... if that one works rebased on feature/zuulv3, it might be a clue about what's causing the weirdness :-P | 17:03 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: Add web-based console log streaming https://review.openstack.org/463353 | 17:03 |
jeblair | SpamapS: true; i'll reserve that line of inquiry for later | 17:03 |
Shrews | mordred: In case there's any chance you might have a clue on this, lines 164-170 of https://review.openstack.org/#/c/463353/13/zuul/streamer.py summarize the situation. | 17:04 |
mordred | jlk: while you're in there- https://review.openstack.org/#/c/46076 is +3 but jeblair left an inline comment | 17:06 |
jlk | alrighty | 17:06 |
Shrews | hah. typical. now that i've publicly given up, i finally find something on the internet: https://bugs.python.org/issue21998 | 17:07 |
mordred | Shrews: yay! | 17:08 |
Shrews | from the comments i've read so far, the answer is "don't do that" | 17:08 |
Shrews | so maybe we just have to treat this differently from the other zuul processes | 17:09 |
jeblair | Shrews: where's the daemonization code? | 17:09 |
Shrews | jeblair: i posted the latest in 463353 | 17:09 |
Shrews | jeblair: zuul/cmd/streamer.py does the normal zuul stuff | 17:10 |
Shrews | with a little manual forking thrown in | 17:11 |
clarkb | oh Shrews is here. Shrews pabelanger can has review on https://review.openstack.org/#/c/467783/ its the failed image upload records thing | 17:11 |
clarkb | if that looks good on the v3 branch I'd like to port it over to master too so we can get that in soonish | 17:11 |
Shrews | clarkb: yeah, it's on my list. i want to make sure we aren't depeding on those anywhere | 17:11 |
Shrews | pretty sure we aren't, but need to refresh my memory | 17:12 |
clarkb | Shrews: I did some rudimentary checking (grepping for FAILED or was it FAILING), but extra eyes appreciated | 17:12 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Ensure PRs arent rejected for stale negative reviews https://review.openstack.org/460700 | 17:12 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Represent github change ID in status page by PR number https://review.openstack.org/460716 | 17:12 |
jeblair | Shrews: ah, i see it; i was misled by the "TODO: make zuul/cmd/streamer.py" comment, as that appears to be TODONE :) | 17:12 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Comment on PRs if a remote call to merge a change failed https://review.openstack.org/460762 | 17:13 |
Shrews | jeblair: where is that comment? | 17:16 |
jeblair | Shrews: bottom of zuul/streamer.py | 17:16 |
SpamapS | #4 0x000000000049f414 in PyThread_acquire_lock () | 17:16 |
* SpamapS finds all the lock acquires | 17:16 | |
Shrews | hah | 17:16 |
jeblair | Shrews: i'll poke at it after i finish poking at the ssh-agent thing | 17:17 |
SpamapS | oh hrm | 17:17 |
SpamapS | that's lower level | 17:17 |
SpamapS | all the threads are at that spot | 17:17 |
Shrews | jeblair: according to that bug report, the event loop FD appears to be the problem | 17:17 |
Shrews | i suspect i need to play to creating a new event loop | 17:19 |
SpamapS | Shrews: what are you forking for? | 17:22 |
* SpamapS has no context | 17:22 | |
Shrews | SpamapS: all zuul processes support being run in the foreground (with -d option) or being daemonized (the default when you run the command). Forking is part of the daemonizing process | 17:25 |
SpamapS | Shrews: so you haven't done any actual work before daemonizing right? | 17:25 |
Shrews | SpamapS: not i, but i can't control what the asyncio lib does | 17:26 |
SpamapS | Seems like moving the event loop init to after daemonizing would be the simplest thing? | 17:26 |
Shrews | SpamapS: right, that's part of asyncio | 17:26 |
Shrews | i wonder if i move the import to much later... | 17:27 |
SpamapS | zomg this is awesome. So I added a crap ton of debug writes to my log files... and of course.. now it's not timing out. | 17:27 |
SpamapS | Shrews: the event loop isn't initialized on import, is it? | 17:27 |
jeblair | Shrews: yeah, if it does something in module import (which is evil... didn't quido write this?), you may need to do that. we have similar late imports due to paramiko | 17:27 |
SpamapS | That would be .. awful. | 17:27 |
jeblair | guido even | 17:27 |
clarkb | Shrews: was it not somethign as simple as keeping fds open? | 17:28 |
SpamapS | import side effects are one of those newbie mistakes I thought | 17:28 |
Shrews | SpamapS: i have to assume so because one does not initialize the event loop | 17:28 |
clarkb | daemonization closes all open fds | 17:28 |
clarkb | but you can whitelist them | 17:28 |
jeblair | i don't think we want any fds open | 17:28 |
jeblair | we want daemonization to happen before any asyncio stuff starts | 17:29 |
Shrews | the fds are part of asyncio internals. i don't want to try to mess with those | 17:29 |
jeblair | right. i'm just saying they shouldn't be open before we daemonize. | 17:29 |
clarkb | well you don't mess with them so much as tell daemonization process to leave them alone (the opposite of messing with them :) ) | 17:29 |
clarkb | and ya doing asyncio setup after wards is fine too | 17:29 |
Shrews | i'm going to try creating a new event loop, and late import as a last resort | 17:30 |
SpamapS | I'm not kidding though.. I actually find it hard to believe asyncio initializes the event loop on import. I believe you it does, but it _hurts_ my brain to think they did that on purpose. | 17:30 |
jeblair | well, just speculation at this point. but it would be nice to have a definitive answer. | 17:31 |
SpamapS | since you always have to asyncio.get_event_loop() , that seems like the thing to move after fork. | 17:36 |
Shrews | SpamapS: that is after the fork | 17:37 |
Shrews | absolutely nothing is done with asyncio until after the fork | 17:38 |
Shrews | and creating a new event loop, assuming i'm doing it correctly, doesn't help | 17:38 |
jlk | jeblair: can you help me understand why in gerritmodel.py the EventFilter class deals with required_approvals and reject_approvals, instead of just handling those in RefFilter? Since they deal with the deeper content of the change, shouldn't it be only in the Ref filter? | 17:46 |
SpamapS | Shrews: I think you can get a fresh one with get_event_loop_policy().get_event_loop() | 17:49 |
SpamapS | Shrews: oh, that's kind of what new_event_loop() does actually. | 17:50 |
jeblair | jlk: it's a borderline abstraction violation -- it means "this event matches iff the change already has some approval (wich may not be related to the event)" | 17:50 |
SpamapS | Shrews: set_event_loop(new_event_loop()) maybe? | 17:50 |
Shrews | SpamapS: i tried all of that | 17:50 |
SpamapS | ow | 17:51 |
jlk | jeblair: er.... hrm. How would that play out? I'm trying to determine if I need to make the same violation on github. | 17:51 |
jeblair | jlk: it *seems* like it should be obsoleted by the pipeline requirement, however, we use it to say "enqueue a change into check if someone adds a workflow +1 vote and jenkins has a -1 vote". normally a workflow +1 would not enqueue in check. and you definitely don't want to require a workflow +1 for check. | 17:51 |
jlk | ah, that feels... gerrit/openstack specific. | 17:52 |
jlk | I'm going to ignore that for now and not duplicate it for github. | 17:52 |
jeblair | jlk: well, our *use* of it is; i was just giving you an example of how it's different. i think it's okay to ignore for now, but we may end up finding it useful to add later. | 17:53 |
SpamapS | Shrews: so yeah, next step is to delay import of asyncio I guess. :-/ | 17:53 |
Shrews | and just tried that (well, delayed the import of the thing that imports asyncio). still no worky | 17:54 |
* Shrews stabs all things python | 17:54 | |
jeblair | jlk: (the same thing can map to github pretty easily: enqueue in check if zuul has voted failure already and someone approves the pr nonetheless; essentially, an implied "recheck") | 17:54 |
jlk | so normally a "review" would trigger gate, but gate can't trigger because it requires a zuul status, and zuul has said no. Somehow that scenario manages to trigger the check? | 17:56 |
jlk | thinking | 17:56 |
jlk | so when scheduler tries to match up the event to a pipeline, | 17:57 |
jlk | It won't see the "check" pipeline because the trigger config doesn't match, it has no "review" trigger | 17:57 |
SpamapS | Shrews: what about the websocket library? | 18:01 |
SpamapS | Shrews: perhaps it has import side effects | 18:02 |
SpamapS | reading asyncio's code, I don't think it creates the event loop on import | 18:02 |
Shrews | SpamapS: ah, that's entirely probable | 18:03 |
Shrews | but still, nothing is done with the websocket lib until after fork | 18:04 |
SpamapS | Shrews: import? | 18:04 |
Shrews | it's imported in the same file that i tried delaying import on | 18:04 |
SpamapS | Shrews: is it possible the problem has been misdiagnosed? | 18:06 |
jlk | jeblair: oh hrm, I found something else | 18:07 |
jlk | jeblair: A trigger schema has to account for 'require-approval', and that gets defined in <driver>/<driver>trigger.py, which calls <Driver>EventFilter | 18:08 |
Shrews | SpamapS: i welcome others to investigate. all code is up in 463353 that demonstrates the problem | 18:08 |
jlk | so the EventFilter has to handle the require/reject intput | 18:08 |
jlk | even though it seems it should be handled by RefFilter | 18:08 |
* jlk pulls more hair out | 18:08 | |
Shrews | SpamapS: just run 'zuul-web-streamer -c zuul.yaml' and 'kill <child pid>' | 18:09 |
jeblair | jlk: a trigger doesn't have to support require-approval -- i think you just suggested not doing that for github, yeah? | 18:09 |
jlk | I think I was arguing the wrong thing | 18:09 |
jeblair | jlk: successfully though :) | 18:09 |
jlk | jeblair: in my head, I mapped EventFilter to "just look at the content of the event, i.e. the stuff that changed right now to cause the event" | 18:10 |
jlk | So, EventFilter handles things like a comment being added, a label being added, a ref being updated, a vote being added. | 18:10 |
jlk | and I mapped RefFilter to "deeper data about the change itself that must be in a specific set" | 18:11 |
jeblair | jlk: yes, that is, generally, and ideally, the case. except for this one thing. :) | 18:11 |
jlk | so "this label must exist on the change, even if it's not set on this event" or "this vote must exist even if it wasn't set on this event" or things like that. | 18:11 |
SpamapS | Shrews: If I get tired of my current personal threading hell, I'll try your asyncio hell. | 18:12 |
jeblair | jlk: yep. it's just that we also discovered that it was useful to say "this event, but only if this vote exists". | 18:12 |
jlk | the code makes this awkward when you want to express a RefFilter level requirement at the trigger config point | 18:12 |
jlk | yeah, we've got pipeline requirements that generically cover this | 18:13 |
jlk | but I think I see what you're saying | 18:13 |
jlk | pipeline reqs are more like "any of these events are cool, but this must be in place for any of them" | 18:13 |
jlk | and a trigger req is more "this event is _only_ cool in this specific scenario, and _STILL_ the pipeline req must be in place." | 18:14 |
jlk | so I could configure a trigger for check that was an approved review from a writer, but only if there is a failed status from zuul. | 18:15 |
*** Cibo_ has quit IRC | 18:15 | |
jlk | so that an approved review from a writer wouldn't trigger if there was a success status from zuul. | 18:16 |
jlk | I think I get it now. I had originally looked at trigger requirements as a way to have a driver specific requirement on a pipeline with multiple drivers, but we fixed that a different way. | 18:17 |
jeblair | jlk: yep! | 18:17 |
jeblair | jlk: yeah, i recall we talked about that earlier, and i think it's no longer necessary to think of them in that way since we made driver-specific pipeline reqs | 18:18 |
jeblair | (it would be awkward to use them that way since you'd have to duplicate them on every trigger) | 18:18 |
jlk | But this does mean now that EventFilter has to deal with deeper data of the change. | 18:18 |
jlk | since a trigger calls EventFilter | 18:18 |
jlk | I think it'll make sense to implement a trigger requirement on status, but not on review. | 18:19 |
jlk | gerrit doesn't really have status vs review, it's all in just differently formatted approvals. So it makes sense for gerrit to have require-approval | 18:20 |
jeblair | that sounds reasonable | 18:23 |
jlk | and this definitely needs a new test case | 18:24 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Drop -e for pip install for devstack plugin.sh https://review.openstack.org/468127 | 18:37 |
*** bhavik1 has quit IRC | 18:39 | |
Shrews | hrm, ok. seems something about the Ctl-C stops the event loop. | 18:40 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Include tenant name in github context https://review.openstack.org/463871 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Add cachecontrol to requests to github https://review.openstack.org/461587 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Save installation ids to a cache and fetch them per project https://review.openstack.org/463424 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Catch integration key file not found in zuul https://review.openstack.org/463425 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Include exc_info in reporter failure https://review.openstack.org/460765 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Implement github trigger requirement status https://review.openstack.org/463386 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Create github client each time it's called. https://review.openstack.org/463421 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Use integration_id with github https://review.openstack.org/463420 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Refactor integration key loading to always load https://review.openstack.org/463423 | 18:44 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Decode JSON body once for requests https://review.openstack.org/463422 | 18:44 |
*** rcarrillocruz has joined #zuul | 18:46 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: Add web-based console log streaming https://review.openstack.org/463353 | 18:51 |
Shrews | that PS ^^^ captures SIGINT and attempts to shutdown things, but doesn't work as expected, whether daemonized or not | 18:52 |
Shrews | so i guess it ISN'T fork related | 18:52 |
Shrews | progress, at least | 18:53 |
jeblair | Shrews: that looks like it should immediately exit? (it calls start then stop) | 18:55 |
jeblair | er, rather, start then it just exits | 18:55 |
Shrews | jeblair: start() doesn't return | 18:55 |
Shrews | not until the loop stops | 18:56 |
jlk | mordred: jeblair: I think I addressed the stated concerns in those patches. I dropped one more, the trigger level review requirement. Tests appear to pass on the tip of the branch locally. I'm off to find some fish and chips. | 18:56 |
Shrews | jlk: event_loop.stop() should cause the run_forever() call to exit at the next opportunity | 18:57 |
Shrews | err, jeblair ^^^ | 18:57 |
jeblair | Shrews: ah, well, ctrl+c only goes to the main thread, so it will interrupt the streamer start() method which is running run_forever | 18:57 |
jeblair | Shrews: what happens if you run that code then send sigterm? | 18:58 |
jeblair | it'll do the same, actually | 18:58 |
Shrews | jeblair: i'm not using threads | 18:58 |
jeblair | Shrews: yeah, but you still get one for free. :) | 18:58 |
Shrews | well, yeah. there's only 1, i meant | 18:59 |
jeblair | Shrews: so ctrl+c will raise an exception inside of run_forever | 18:59 |
Shrews | jeblair: right, but we SHOULD be able to stop it manually | 19:00 |
jeblair | i would not be surprised if that breaks things, so i say it's best to set that aside for now and try a different method of triggering the shutdown | 19:00 |
jeblair | like sigterm, which should call the exit handler then return to processing insed of run_forever | 19:00 |
jeblair | Shrews: so what happens when you start that with "-d", then in another terminal, kill -15 the pid? | 19:01 |
Shrews | jeblair: same thing | 19:02 |
Shrews | event loop is never stopped, even though we've asked it to | 19:02 |
Shrews | jeblair: the answer might be here: https://stackoverflow.com/questions/42628795/indirectly-stopping-a-python-asyncio-event-loop-through-sigterm-has-no-effect/42635128 | 19:04 |
Shrews | "Instead of signal.signal() use loop.add_signal_handler()" | 19:05 |
Shrews | I will try that | 19:05 |
jeblair | Shrews: ah, so incidentally, fixing the ctrl-c problem may fix this as well. | 19:09 |
jeblair | Shrews: if you have the streamer start in a thread, and have the start() method return, then wait on signal.pause() in the main thread, it may cause both things to work without the need for loop.add_signal_handler | 19:10 |
Shrews | jeblair: i'll try that next because add_signal_handler does the same thing :/ | 19:11 |
*** hashar has joined #zuul | 19:42 | |
*** jamielennox is now known as jamielennox|away | 19:56 | |
* SpamapS returns to the ssh-agent salt mine | 20:23 | |
SpamapS | hrm | 20:27 |
SpamapS | when I attach to the threads of my running tests that are timing out, I just noticed they're actually really busy in the kazoo threads | 20:28 |
pabelanger | heads up, nodepool dsvm jobs are broken. Working to fix them | 20:28 |
*** jkilpatr has quit IRC | 20:29 | |
SpamapS | 2017-05-25 20:25:58,064 zuul.zk.ZooKeeper DEBUG ZooKeeper connection: CONNECTED | 20:30 |
SpamapS | 2017-05-25 20:25:58,080 zuul.nodepool DEBUG Updating node request <NodeRequest 100-0000000000 <NodeSet OrderedDict([('controller', <Node None controller:invalid-label>)])>> | 20:30 |
SpamapS | 2017-05-25 20:25:58,080 zuul.nodepool INFO Submited node request <NodeRequest 100-0000000000 <NodeSet OrderedDict([('controller', <Node None controller:invalid-label>)])>> | 20:30 |
SpamapS | running with 'ttrun -epy27' just prints those 3 lines and then that process is looping hard selecting/recving/sending with zookeeper | 20:31 |
SpamapS | http://paste.openstack.org/show/610675/ | 20:33 |
Shrews | SpamapS: that's likely the kazoo keepalive thread | 20:37 |
Shrews | we had to restrict it to a single kazoo connection b/c otherwise it hit the cpu pretty hard with multiple | 20:38 |
Shrews | in nodepool, that is | 20:38 |
Shrews | i think kazoo uses something like 3 threads per connection, iirc | 20:39 |
SpamapS | Yeah that sounds about right. Need at least 1 for watches. | 20:39 |
jeblair | SpamapS: in general, what i've found is that the stop process for ssh isn't robust, and when we hit the wait timeout, errors stopping ssh can cause the executor to raise an exception when stopping and end up aborting the test shutdown. i'm still tracking stuff down, but i'm making progress. | 20:39 |
SpamapS | Shrews: it's an awfully busy thread. | 20:39 |
SpamapS | jeblair: I have some updates that I may not have pushed up where I run .wait() in a thread | 20:40 |
SpamapS | jeblair: so that I can do terminate, then start waiting for it to die, then after a few seconds use kill. | 20:40 |
jeblair | SpamapS: yeah, though normally, we'd expect terminate to be pretty fast, right? | 20:42 |
SpamapS | jeblair: in fact I was just re-reading the subprocess manual.. | 20:42 |
SpamapS | it says to be careful with subprocess.PIPE and wait. | 20:42 |
SpamapS | jeblair: terminate should be fast, but wait needs to wait for the actual death of ssh-agent. | 20:42 |
SpamapS | the agent does have to unlink its socket | 20:42 |
SpamapS | so, a few ms, but not instant | 20:43 |
SpamapS | jeblair: I'm not seeing executor exceptions ever though, so you're clearly finding more than I could. | 20:43 |
SpamapS | jeblair: I wonder if I should just send all the fd's to devnull and poll for the socket existence, rather than try to be fancy and select stdout. | 20:44 |
SpamapS | that way we don't have any pipes | 20:44 |
jeblair | SpamapS: i think that's a good idea. i think the buffering, etc, there is a theoretical problem, though i'm not sure it's an actual one. at any rate, a lot of stuff gets simpler if we do that. (also, incidentally, the same is true if we don't run in foreground, so we might still consider running without -D depending on whether we find the pgroup thing useful) | 20:46 |
jeblair | SpamapS: the biggest thing i've found is that stopping is racy, and ExecutorServer.stop() can end up running with self not having a .thread attribute; so i added protection there (and for ssh_agent). | 20:47 |
jeblair | SpamapS: i still have some tests hitting the hard timeout though; so still something else to track down | 20:47 |
SpamapS | jeblair: I'm trying with devnull now. | 20:47 |
*** yolanda has quit IRC | 20:49 | |
*** jkilpatr has joined #zuul | 20:49 | |
SpamapS | jeblair: also re-reading the stop method, I think it will be more robust with wait() running in a thread. Otherwise we could get stuck there. | 20:50 |
jeblair | SpamapS: i keep seeing something that looks like git commands are getting stuck. i wonder if there is some interaction there with ssh-agent? (i don't understand why -- all the git commands are local) | 20:50 |
SpamapS | oh hm | 20:50 |
SpamapS | I saw that too | 20:50 |
SpamapS | but kind of let it fall by the wayside | 20:50 |
SpamapS | I was wondering if maybe the pipes were open and causing FD issues with git | 20:51 |
SpamapS | still hard timeout with devnull so pipes probably not the culprit | 20:51 |
jeblair | SpamapS: but this only happens sometimes... | 20:51 |
SpamapS | http://paste.ubuntu.com/24659247/ | 20:52 |
SpamapS | my current diff | 20:52 |
SpamapS | jeblair: merger runs an update thread, right? | 20:53 |
jeblair | SpamapS: yes, and that's one of the things that can get stuck and block shutdown | 20:53 |
SpamapS | jeblair: so what if that merger thread is concurrently running forks and picks up FDs opened for ssh-agent | 20:53 |
SpamapS | subprocess is supposed to handle this | 20:54 |
SpamapS | by closing all fds | 20:54 |
SpamapS | but I'm grasping for anything to hold on to now | 20:54 |
jeblair | ya | 20:54 |
SpamapS | would be great if I could attach gdb but get actual python line numbers | 20:56 |
jeblair | SpamapS: that used to work with https://wiki.python.org/moin/DebuggingWithGdb but last time i tried (not recently) it was somehow broken | 20:59 |
jeblair | i just ran ps during a failing test and saw a zombie git process | 21:00 |
clarkb | I want to say I got ^ mostly working, where it falls down is when you are in code outside of stdlib iirc | 21:01 |
jeblair | also, amazingly, it was a single test in the foreground. first time that's happened. test_zuul_refs | 21:02 |
clarkb | whcih was my problem when trying to debug the python3.4 segfaults | 21:02 |
*** dkranz has quit IRC | 21:02 | |
jeblair | oh, yeah, i just tried it. it's producing some output. let's see if it's helpful | 21:04 |
jeblair | be in the zuul root directory when you run it so it can pick up the local files | 21:06 |
SpamapS | jeblair: I'm trying without -D | 21:09 |
jeblair | http://paste.openstack.org/show/610684/ | 21:10 |
jeblair | there's the traceback (via gdb) for the thread stuck on a git operation | 21:10 |
jeblair | its git subprocess is (Z)ombie | 21:10 |
SpamapS | My copious debug logging led me to innerUpdateLoop, but I couldn't crack what was happening there | 21:13 |
SpamapS | specifically updateRepo | 21:13 |
jeblair | SpamapS: http://bugs.python.org/issue12786 | 21:14 |
jeblair | (via https://stackoverflow.com/questions/14615462/why-does-communicate-deadlock-when-used-with-multiple-popen-subprocesses) | 21:14 |
jeblair | SpamapS: looks like your fd hunch may have been right | 21:14 |
jeblair | SpamapS: i am running in 2.7; are/were you when you saw the git hangs? | 21:15 |
SpamapS | jeblair: yes, 2.7 | 21:16 |
SpamapS | the paint is still a little wet on 3.5 ;) | 21:16 |
jeblair | SpamapS: yeah, that's what i thought too. :) i'm adding close_fds=True to the ssh-agent and re-running now | 21:17 |
SpamapS | but I'm on 2.7.12 | 21:17 |
SpamapS | jeblair: reading the python issue.. this definitely sounds like our problem | 21:18 |
Shrews | OMG OMG OMG | 21:18 |
Shrews | i have left the asyncio hell | 21:19 |
jeblair | Shrews: yay! | 21:20 |
SpamapS | jeblair: oh doh.. 3 changed the default | 21:20 |
* SpamapS shoudl have tried 3.5 | 21:20 | |
SpamapS | running without -D also seems to have had a positive effect I think | 21:20 |
SpamapS | which makes sense, because ssh-agent detaches and exits relatively quickly. | 21:20 |
jeblair | SpamapS: yeah, though then we would have thrown up our hands and wondered why 3 "fixed" it :) | 21:20 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: Add web-based console log streaming https://review.openstack.org/463353 | 21:20 |
SpamapS | does anybody else get a spray of logs to stdout just before the tests succeed? | 21:21 |
Shrews | jeblair: SpamapS: ^^^ changed from run_forever to run_until_complete on a Future, where I set the Future (scheduled in the event loop) in our stop() | 21:22 |
SpamapS | Shrews: oh yay | 21:22 |
Shrews | i am now going to drink copious amounts of alcohol | 21:22 |
SpamapS | if close_fds=True solves this issue I may do the same | 21:23 |
SpamapS | jeblair: FYI, the run without -D did still get one deadlocked test process, so it just narrowed the window. I'm re-running with my non -D patch, and close_fds=True | 21:24 |
mordred | Shrews: yay indeed! | 21:24 |
jeblair | Shrews: cool -- though ultimately we shouldn't have any signal stuff in the streamer. did you try the thread idea? | 21:24 |
Shrews | nope. was running down some other last minute hunches | 21:25 |
Shrews | plus i read some scary things about threads and event loops | 21:25 |
Shrews | but i'll look more closely tomorrow | 21:26 |
SpamapS | IIRC, usually one has an event-loop per thread and dispatches things between them using queues. | 21:26 |
Shrews | jeblair: i THINK it will work by moving the signal handling back out of the streamer | 21:27 |
* Shrews tests real quick | 21:27 | |
jeblair | yeah, we're not trying to do anything funny with the event loop; we just want to start it in its own thread, and tell it to stop. | 21:27 |
jeblair | and we're going to want it in its own thread too so it follows the pattern of the other services, and so we can start/stop it from tests | 21:27 |
Shrews | jeblair: hrm, nope. gotta be done in the event loop | 21:29 |
jeblair | Shrews: what does? | 21:29 |
Shrews | signal handling | 21:29 |
jeblair | Shrews: did you put it in another thread? | 21:29 |
Shrews | (for the current method that i have working) | 21:29 |
Shrews | no | 21:29 |
Shrews | will try tomorrow | 21:29 |
jeblair | Shrews: right, i don't expect that to work with the current method because of the way signals are delivered. they are always delivered to the main thread. so the trick is to put the event loop in a new (second) thread, so that the main thread just waits for a signal. then it tells the event loop to stop when it gets a signal. | 21:30 |
SpamapS | jeblair: close_fds worked for me! | 21:30 |
jeblair | SpamapS: it's working for me, so far, inasmuch as i'm getting real test timeouts and not alarm clocks :) | 21:31 |
SpamapS | Ran 274 (+5) tests in 455.393s (-103.340s) | 21:31 |
SpamapS | PASSED (id=88, skips=18) | 21:31 |
SpamapS | when running with my non -D patch as well | 21:31 |
SpamapS | already have 4 rogue ssh-agents though | 21:32 |
jeblair | oh, my ssh-adds are failing, that's why all the failures | 21:33 |
SpamapS | I have a patch for that | 21:34 |
SpamapS | I'm going to push up this ssh agent patch as a new change ID so we can see the same things | 21:34 |
jeblair | SpamapS: http://paste.openstack.org/show/610686/ | 21:35 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add SSH Agent Primitives and usage https://review.openstack.org/468173 | 21:35 |
jeblair | SpamapS: that's the other thing we should include | 21:35 |
SpamapS | Ah yeah | 21:36 |
SpamapS | saw those missing threads too which were confusing me | 21:36 |
SpamapS | I thought it was just fallout | 21:36 |
SpamapS | ok 468173 is without -D, and with an attempt at a singleton one-per-test-run-process SSH key generator | 21:36 |
SpamapS | the __del__ approach there produces wonky stuff htough | 21:37 |
SpamapS | though | 21:37 |
SpamapS | Exception TypeError: TypeError("'NoneType' object is not callable",) in <bound method SshKeyResource.__del__ of <tests.base.SshKeyResource object at 0x7f4ea8a23610>> ignored | 21:37 |
jeblair | SpamapS: oh, that also needs initializers, i left that off the diff, but you get the idea | 21:37 |
jeblair | SpamapS: regarding the key -- i was thinking it might be nice to just generate a key and stick it in as a test fixture so we don't have to do that live | 21:38 |
jeblair | SpamapS: we do that for the RSA keys | 21:38 |
*** hashar has quit IRC | 21:38 | |
SpamapS | jeblair: oh good idea | 21:39 |
SpamapS | that crossed my mind but this was more fun ;) | 21:39 |
SpamapS | I'll rework around that | 21:39 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add SSH Agent Primitives and usage https://review.openstack.org/468173 | 21:40 |
SpamapS | oh I had another thought which was that we probably don't need to explicitly call SshAgent.stop in AnsibleJob.stop because it's already called in the finally block of execute | 21:40 |
jeblair | SpamapS: ^ i pushed the .thread attribute safety as a new PS to your new chang ^ | 21:41 |
jeblair | SpamapS: oh, even better. :) | 21:41 |
jeblair | SpamapS: let's go with that. | 21:41 |
SpamapS | jeblair: and how about -D vs. not? | 21:44 |
SpamapS | I'm still of two minds | 21:44 |
SpamapS | jeblair: thanks so much for the help on debugging this. I was definitely at the end of the rope. | 21:45 |
SpamapS | I'm going to collect my thoughts and see about putting in a final version of the patch we can be proud of :) | 21:45 |
jeblair | SpamapS: yeah, i'm not sure what the right answer is there yet.... | 21:49 |
jeblair | SpamapS: that patch passes tests for me | 21:54 |
jeblair | oh look, my "actually check out the right branches of stuff" patch series is all green. that's 461176 through 467779 if folks want to start taking a look. it should be ready for review. | 21:56 |
jlk | oh | 22:04 |
jeblair | mordred: reviewed your group/inventory changes | 22:13 |
*** jamielennox|away is now known as jamielennox | 22:36 | |
* SpamapS never thought watching tests pass would bring tears to his eyes.. but... today... | 22:43 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add SSH Agent Primitives and usage https://review.openstack.org/468173 | 22:49 |
SpamapS | jeblair: ^ Let's stick without -D. It makes us work on more platforms and should work well enough. | 22:50 |
SpamapS | I also like that after .communicate() I know the agent is up and working. | 22:50 |
SpamapS | (vs. with -D where I have to select.select on the stdout) | 22:51 |
jeblair | SpamapS: wfm | 22:51 |
* SpamapS restacks bwrap on that | 22:52 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add support for bwrap https://review.openstack.org/453851 | 22:52 |
mordred | jeblair: thanks! (I forgot to say that before I started working on addressing your reviews a little while ago) | 22:53 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Add support for defining groups in nodesets https://review.openstack.org/467611 | 22:53 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Put variables into the inventory https://review.openstack.org/467635 | 22:53 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Write inventory as yaml not ini https://review.openstack.org/467634 | 22:53 |
mordred | jeblair: hopefully those will be mo better | 22:53 |
mordred | jeblair: and yay on check out the right branches of stuff! | 22:54 |
mordred | jeblair: I believe I've read this stack many times already ... I know I've read the commit message at least multiple times | 22:55 |
SpamapS | doh | 23:10 |
SpamapS | http://paste.openstack.org/show/610690/ | 23:10 |
SpamapS | have to muck with the perms | 23:10 |
jeblair | mordred: i just assume you're always saying "thanks!" | 23:22 |
jeblair | SpamapS: ah yeah; we could copy it into the test root and muck it up there | 23:22 |
SpamapS | jeblair: testing that now | 23:26 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add support for bwrap https://review.openstack.org/453851 | 23:29 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Add SSH Agent Primitives and usage https://review.openstack.org/468173 | 23:29 |
jlk | jeblair: what happens if you have a pipeline that triggers on push, but a requirement of "open"? A push event doesn't have a change that can either be open or closed. | 23:32 |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Revert "Use devstack's zookeeper support" https://review.openstack.org/468205 | 23:32 |
jeblair | jlk: probably nothing? | 23:32 |
jeblair | jlk: ie, all changes rejected | 23:33 |
jlk | It looks like the "open" attribute to a change only gets created/updated when _updateChange is called | 23:33 |
jlk | so it kind of looks like you'd get a traceback on a missing attribute in the model | 23:33 |
jeblair | jlk: yeah, probably needs tidying up and has never been exercised due to the contradiction | 23:35 |
jlk | I'll put a thumbtack in that | 23:35 |
jlk | (as I'm implementing it for github) | 23:35 |
jlk | Probably need to test a number of things with a push event instead of a change event for requirements and such | 23:36 |
mordred | jeblair: my updates to the inventory group patches are still failing - but not in a super weird way or anything - I'll look at them tomorrow and get them fixed | 23:41 |
jeblair | mordred: yeah, i think the first change is fine; that py35 failure looks unrelated | 23:42 |
jeblair | i rechecked | 23:42 |
jeblair | (but something to keep an eye on) | 23:42 |
jeblair | the second one def has a real problem | 23:42 |
jeblair | mordred: the second one is just some job name mismatches | 23:44 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Make zuul ansible dir self-contained https://review.openstack.org/468208 | 23:45 |
SpamapS | mordred: ^ | 23:45 |
SpamapS | I think that makes it work inside the bwrap | 23:46 |
SpamapS | but I'm out of runway for today.. | 23:46 |
* SpamapS EOD's | 23:46 | |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Implement github pipepline req of open https://review.openstack.org/468213 | 23:56 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!