Tuesday, 2017-04-25

SpamapSjeblair: AH right ok00:01
* SpamapS may poke at that, I understand now00:01
SpamapStho I kinda wanna bang out the last few test re-enables first00:01
* SpamapS EOD's00:01
clarkbwow is it that time already?00:09
clarkbjeblair: can you look at test_client_enqueue_negative? is the reuse of client there valid when using shutdown?00:12
clarkbtest_client_promote_negative does it too00:14
clarkbI think those are likely passing because they raise the expected exceptions but because the client is shutdown not because it acutally spoke through gearman and failed /me will adjust00:21
*** Shrews has quit IRC00:51
jeblairclarkb: the exception check ostensibly checks the specific error, so it seems like they ought to be valid checks (ie, not fooled by the client being shutdown).  however, i agree, the reuse of client is wrong and should cause problems.  so you're almost certainly on to something.  :)00:57
clarkbya I think I have this mostly sorted out00:58
clarkbIts ~133 tests in and no fails except for 3 where I derped a cleanup00:58
clarkb(still not fast but seems table at least)01:02
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Don't run apsched thread when no jobs need scheduling  https://review.openstack.org/45779801:14
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call  https://review.openstack.org/45940901:14
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Add ability to stop drivers  https://review.openstack.org/45947701:14
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Use cStringIO if available.  https://review.openstack.org/45947801:14
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Join watchdog thread so it doesn't leak in tests  https://review.openstack.org/45947901:14
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests  https://review.openstack.org/45948001:14
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests  https://review.openstack.org/45948101:14
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak.  https://review.openstack.org/45948201:14
clarkbthats today's braindump01:14
clarkbI'm not sure if last one will pass as I haven't been able to run full suite locally yet01:14
clarkbBut its way better than it has been I think01:14
jeblairclarkb: i haven't looked, but it's possible that could cause follow-on failures01:15
jeblairclarkb: like, if one test fails, then all subsequent tests in that test runner will also fail01:15
jeblairwhich is a bit distracting; hard to see where the real failure is01:15
clarkbjeblair: the error if threads leak change?01:15
clarkbI don't think it should01:15
jeblairclarkb: yeah.01:15
clarkbcleanups should run regardless of other cleanups raising01:16
clarkband its at the end of that cleanup so I think we are fine01:16
jeblairclarkb: if a test fails,  it seems quite likely that it may fail to clean up properly, and threads may leak.01:16
clarkbeven failed tests should cleanup properly though...01:16
clarkbthats why cleanup is better than teardown01:17
clarkbanyways I can sort out how to make that assertion stronger later. I do think we need something like it though01:17
clarkbsince leaked threads are very expensive based on my local job runs01:17
jeblairideally yes.  we can choose to make that a priority.  i'm just saying it's probably not the case now.  we'll probably be mopping that up for a while.01:17
clarkbone reason I want it to fail is so that you see leaked threads on the first test that leaks them (since otherwise logs are not captured by default)01:23
clarkbbut I can work on that01:23
clarkbI think http://logs.openstack.org/81/459481/1/check/gate-zuul-python27-ubuntu-xenial/02d8108/console.html is possibly a race introduced by my changes too :/ I touched the ansible execution code a little. Will have to poke more tomorrow01:30
clarkbjlk: SpamapS would appreciate if you could grab that stack and see if its better for you too01:45
jlkkk01:57
clarkbhrm tests.unit.test_scheduler.TestScheduler.test_needed_changes_enqueue has been running for ~8 minutes02:09
clarkbmight have to look at that test more closely too02:10
jamielennoxjeblair: so backscroll, but do you consider py3 and py3gear a blocker for that03:38
jamielennoxjeblair: ie, is py3 a specific valuable task that someone could work on today?03:39
jamielennoxSpamapS: obvoiusly has some influence in the gear space so i can talk with him about a path to getting that done03:42
jamielennoxbecause it definitely makes sense that the proxying service be asyncio based03:42
tobiashmorning05:09
tobiashwhat do you think about enriching nodepool logging with the extra keyword argument?05:10
tobiashwith that we could add e.g. the node id as extra metadata05:11
tobiashusing log formatters which support this extra metadata this would make it easy for e.g. tracking/filtering the lifecycle of a node in the logs05:14
tobiashI use logstash_formatter which formats the log to json (including the extra dict) which is pushed directly to elasticsearch05:15
*** yolanda has quit IRC05:31
SpamapSjamielennox: yeah I think the idea is that to do websockets properly the right path is asyncio05:54
*** bhavik1 has joined #zuul06:29
*** bhavik1 has quit IRC06:58
*** yolanda has joined #zuul07:04
*** isaacb has joined #zuul07:15
*** jeblair_ has joined #zuul07:41
*** jeblair has quit IRC07:41
*** yolanda has quit IRC07:44
*** hashar has joined #zuul07:56
*** yolanda has joined #zuul08:00
*** yolanda has quit IRC08:17
*** yolanda has joined #zuul08:33
*** yolanda has quit IRC09:22
*** yolanda has joined #zuul09:37
*** jkilpatr has quit IRC10:49
*** yolanda has quit IRC11:05
*** jkilpatr has joined #zuul11:11
*** Shrews has joined #zuul11:29
mordredjamielennox, SpamapS: yes re: websockets and asyncio - but I don't think py3/py3gear needs to be a blocker for getting the websockets service done11:56
mordredit'll only be a blocker for writing unit tests for the websockets service11:56
mordredI say that because while technically getting gear python3'd has already eaten one person and is an easy hole to fall down and be frustrated inside of11:57
mordredwhile getting the websockets streamer going should actually be fun11:58
mordredOH11:58
mordredwell - I mean, tox -epy35 isn't really going to do anything right now - so I suppose one could actually even write unit tests for the websockets streamer and just put something into the tox definition that excludes the non-websockets unit tests from running under 3511:59
mordredso if it's a separate little daemon, and it fetches the information it needs about where the logs are from an http call to status.json ...12:00
mordredI don't think there are any blockers12:00
mordred(it wants to make the rest call, fwiw, so that it's not just an open proxy to streaming from random ports - if the browser->websocket protocol is "please give me the stream for job XXX and optionally for logfile YYY", then the websocket server asks zuul "hey, where's XXX and YYY?" it should be a fairly closed loop12:02
mordredalso, Shrews is currently working on the thing the websocket server will be streaming _from_, so if anybody does decide to dive on that topic, syncing with him will likely be helpful12:03
jamielennoxmordred: so i was mostly asking because if that just needed someone to put in the work to unblock something i could pick it up12:17
jamielennoxmordred: but so gear is apparently already running with py35, but it has no asyncio12:17
jamielennoxmordred: and a google search for asyncio gearman has 3 libraries in top 4 results, what was the initial rationale for that being an openstack library?12:18
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: Don't getChange on source not triggering a change  https://review.openstack.org/40612912:19
*** yolanda has joined #zuul12:21
openstackgerritTobias Henkel proposed openstack-infra/nodepool feature/zuulv3: Support externally managed images  https://review.openstack.org/45807313:10
*** atarakt has quit IRC13:14
*** yolanda has quit IRC13:28
pabelangerhttps://admin.fedoraproject.org/pkgdb/package/rpms/zuul/13:40
pabelanger\o/13:40
*** yolanda has joined #zuul13:44
*** hashar is now known as wmf-SWAT-shop13:52
*** wmf-SWAT-shop is now known as hashar13:52
pabelangerzuulv3-dev.o.o back online, but I see an exception14:04
pabelangerhttp://paste.openstack.org/show/607867/14:04
pabelangerpossible another configuration change?14:05
pabelangerHmm, possible a bug. According to docs, server should be the default for canonical_hostname14:10
Shrewscommit 1c7744 added canonical hostname to the source obj. probably related14:11
pabelangerthanks14:11
pabelangerI think we want our configure to be the following now14:11
pabelangerserver = review.openstack.org14:12
pabelangercanonical_hostname = git.openstack.org14:12
jeblair_yes, though it should work without that, so we should fix the error.14:12
pabelangeragree14:13
*** jeblair_ is now known as jeblair14:15
*** isaacb has quit IRC14:17
jeblairpabelanger: i'm going to do some editing/debugging on zuulv3-dev, ok?14:17
pabelangerjeblair: sure14:17
pabelangergoing to grab a coffee14:17
mordredjamielennox: we wrote gear originally just because we needed a better/simpler library for infra things back in the day (the other python libs were either incorrect or tried to be too clever)14:17
mordredjamielennox: I think if we have the streamer get the needed data via http and not via gear we can avoid the need to get gear ported to py3 for now (it still needs to be done to support zuul on py3, but there's no need to block the websocket streaming on gear getting a py3 update)14:18
*** yolanda has quit IRC14:21
jeblairpabelanger: oh, it's because of the 'legacy gerrit section'.  it's time to remove that.  :)14:32
jeblairpabelanger: [gerrit] needs to be [connection gerrit]14:33
pabelangerAh, cool!14:36
*** yolanda has joined #zuul14:36
*** yolanda has quit IRC15:15
pabelangerapparently gnocchi is moving out of openstack namespace. Not sure what they are planning on for testing, but maybe another project to add to the list for our zuulv3 github connection15:23
clarkbwell they weren't using our testing before, justmerge check and noop jobs.15:25
clarkbor is that just debs ETOOEARLY15:25
clarkboh thats transitional, gah I need more caffiene15:26
pabelangerI think they had python-jobs at one point15:26
pabelangerand publish-to-pypi15:26
clarkbpabelanger: ya the change diff I saw was removal of their transition out change15:26
clarkbso first cahnge is to remove python jobs et al and replace with noop jobs15:27
clarkbthen change to remove noop jobs and I saw this one15:27
*** yolanda has joined #zuul15:30
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak.  https://review.openstack.org/45948215:54
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests  https://review.openstack.org/45948115:54
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests  https://review.openstack.org/45948015:54
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Join watchdog thread so it doesn't leak in tests  https://review.openstack.org/45947915:54
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Properly record timed out builds  https://review.openstack.org/45976715:54
clarkbok that should fix up the watchdog related error15:54
clarkband it fixes a bug I found in the process15:54
*** Shrews has quit IRC15:54
*** Shrews has joined #zuul15:54
clarkbSpamapS: jlk ^ I still have one test worker crash on my locally so I don't run the full suite but this seems to get me reliably completing the test suite and things are generally happier than they were so reviews and test driving would be great15:55
*** Shrews has quit IRC15:55
clarkbnow to turn timeout to gentle and see if I can't sort that out15:55
*** Shrews has joined #zuul15:58
Shrews\o/  got my irc bouncer back15:59
Shrewsno thx to vexxhost support, though15:59
jeblairShrews: regarding 456685, i just remembered that the random tmp dir name has a security component.  it seems like the build uuid might be sufficiently secure, but i think we need to be as careful in creating that as mkdtemp.16:06
jeblairShrews: i left a comment inline.  I think what it describes will be sufficient.  if we're not comfortable with that, your previous version (which used a prefix and glob) would work.16:06
jeblairShrews: we may want to ask SpamapS to look at this too since he has opinions on this sort of thing.  :)16:07
* SpamapS awake16:09
Shrewsjeblair: ack16:09
SpamapSI think it's probably unpredictable enough. Unless it's sitting around in a globally readable place for a long time (more than 5 seconds == long) before the dir gets created.16:11
SpamapSseems like build UUID's are only internal to zuul and the gearman jobs16:12
jeblairSpamapS: it will be exposed by the status json, and if it takes a while for a node to show up, it could sit there for a while.16:13
clarkbtests.unit.test_scheduler.TestScheduler.test_needed_changes_enqueue seems to be the problem test based on my grepping of subunit16:13
clarkbgentle timeout also doesn't seem strong enough16:14
jeblairSpamapS, Shrews: so we may not be able to rely on the element of surprise.  given that, is the mkdir suffirient?  (it should fail if it already exists)16:14
SpamapSjeblair: yeah I think it being a dir name and not a file, we can relax with a mkdir that fails.16:14
jeblaircool, yeah that seems like a nice benefit of using a dir16:15
*** yolanda has quit IRC16:15
SpamapSsince the trouble would only come if they symlinked the UUID to a place with files that would match what gets written into a job dir.16:15
Shrewsjeblair: i suppose still using os.makedirs() on the root (not including the build uuid) is necessary, yeah?16:16
SpamapSand then zuul would wreak havoc on that dir. So yeah, just make sure the mkdir raises on existing anything.16:16
jeblairShrews: i don't think we do that now; i think we expect the root to exist.16:16
jeblairShrews: er, i'm using 'root' to mean the 'root' argument as passed in to jobdir16:17
Shrewsjeblair: ok. was wondering because all of the following code uses that16:17
jeblairShrews: also, fwiw, JobDir is used in exactly one place, so you can drop the conditionals (all arguments are, in reality, required).16:18
jeblairthat might make this simpler to reason about16:19
Shrewsjeblair: it looks like 'root' can actually be None16:25
Shrewsso should probaby keep that conditional16:25
jeblairShrews: oh, and that uses TMPDIR?  sounds right then.16:25
Shrewsi'll make the other two params really required16:27
jeblairclarkb: re 457798 -- how does that leak happen?16:28
clarkbjeblair: https://review.openstack.org/#/c/459477/1 is what properly addresses the leak since we never actually stop the driver otherwise16:29
clarkbjeblair: so everytime a test creates a new scheduler you get a new apsched thread16:30
jeblairclarkb: ok, so i think we may not want 457798 (it has a structural problem i don't see a good way to fix; comment inline)16:30
clarkbya I see that now16:31
clarkbjeblair: I'll wait for more reviews on the stack if you are going through it then address them in one go (rather than push a bunch)16:31
jeblairkk16:32
*** yolanda has joined #zuul16:32
clarkbthere is still something causing the tests to run longer under testr than testtools.run16:34
clarkbtest_crd_branch is ~40 seconds vs ~16 seconds16:34
clarkbso its actually quicker for me to do for X in `cat testlist` ; do testtools.run $X ; done then testr run16:34
jeblairthat's so weird16:35
clarkbalso more reliable16:35
clarkbmy hunch is that its processing the subunit stream16:36
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name  https://review.openstack.org/45668516:37
clarkbhowever the time difference is only apparent when all the tests are run together and not individually so maybe not16:37
clarkb*** Error in `python': free(): invalid pointer: 0x00007fdf880000a8 *** <- that just happened running tox -epy27 -- test_crd_branch16:39
*** Shrews has quit IRC16:39
jeblairclarkb: 459479 -- it seems like even if that does leak, it won't leak for more than 10 seconds.  the fix may cause an extra 0-10 second delay in production after every job completes.  is it enough of a problem in thet tests to make that worthwhile?16:40
clarkbjeblair: the delay should only be up to one second16:41
clarkband it actually makes the job that tests for timeouts quicker16:41
jeblairclarkb: i don't think we should increase the watchdog interval in production.  we can do it in tests, but in prod, i think we should stick with 10s.  we're going to have at least 100 of those threads in production, having them all poll every second will be a considerable amount of context switching and python GIL locking.16:42
jeblairor, i guess that should have said 'decrease the watchdog interval'.16:42
clarkbthe reason to claen it up is so that we can assert the end state of tests more strongly, we could whitelist the watchdog thread when we do that16:42
clarkbso change could be to add a name to watchdog thread, then whitelist it in the tests rather than shorten sleep and join16:43
*** yolanda has quit IRC16:44
jeblairand that amount of thread leakage will be okay you think?16:44
clarkbya especially since that thread will be on a 10 second timeout16:45
jeblairwfm then.16:45
clarkbI think the costly threads are gaerman and kazoo beacuse they go as quickly as possible doing nothing in the background but generating log noise16:45
clarkboh them and apsched16:45
clarkbfor same rason16:45
jeblairclarkb: i think gearman has a 2s sleep, but yeah, it adds up.16:46
jlko/16:50
jeblairclarkb: stack review complete16:51
clarkbjeblair: this stack appears to have made my local runs of CRD tests not error on timeout waiting for zuul to settle. So ya I think it does add up. Unfortunately it hasn't corrected the tests are generally slower in testr problem :/16:52
*** Shrews has joined #zuul16:57
SpamapSclarkb: I'm at home with sick baby today, so I'll take a look when she goes down for a nap.16:57
SpamapS"at home" vs. "working at home" ;-)16:58
*** hashar_ has joined #zuul17:01
*** hashar_ has quit IRC17:01
jlkSo what's with the "queue" key in a project definition?17:13
jlkand queue-name in a job definition?17:13
jeblairjlk: queue-name should go away (it's zuulv2 only)17:14
jeblairjlk: queue in a project assigns that project to a shared change queue in a dependent pipeline.17:15
jlkjeblair: is that how multiple unrelated projects can use the same "dependent" manager pipeline, but each have their own queues?17:16
jlkprojects A, B, C all use one queue, project D uses a different one, and thus is not dependent on jobs of A, B, C?17:16
jeblairjlk: exactly.17:16
jlkcool, I've had that question bouncing in my head recently17:16
jeblairjlk: in zuulv2, combining them was automatic (based on job name).  but in v3, we don't assume anything about job names, so you have to manually group projects together like that.17:17
jlkah17:17
jlkoh yeah that is bad for us on v2, where everything shares a single job name, the job where we run whatever "run.sh" is in the project repo17:17
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema  https://review.openstack.org/45979217:18
jeblairjlk: yeah.  it was designed under the assumption of job names like "gate-nova-python27".17:18
jlkmakes sense17:18
jlkso only a tiny bit related, I'm working on the patch that adds dependent queue support to github17:19
jlkand in the tests I'm failing, because the call in manager/dependent.py getChangeQueue() is calling out to self.pipeline.getQueue(change.project), which then looks at the self.queues, and tries to see if "project" is in queue.projects17:20
jlkand while it looks like it, the comparison is failing for some reason17:20
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Error if threads leak.  https://review.openstack.org/45948217:20
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests  https://review.openstack.org/45948117:20
jlkso I'm trying to figure out where equivalence of project objects is determined17:20
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests  https://review.openstack.org/45948017:20
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Properly record timed out builds  https://review.openstack.org/45976717:20
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Whitelist watchdog threads in test thread leak checker  https://review.openstack.org/45947917:20
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Use cStringIO if available.  https://review.openstack.org/45947817:21
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call  https://review.openstack.org/45940917:21
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Add ability to stop drivers  https://review.openstack.org/45947717:21
clarkbjeblair: ^ I think that addresses your concerns17:21
jlkI only have one queue in self.queues17:21
clarkbnow to abandon the unneeded apsched change17:21
jlkand there is only one project in that queue17:21
jlkand there is only one project in that queue: <Project org/project>17:22
jlkwhich LOOKS like it should match the project in question, <Project org/project>17:22
jlk(Pdb) project == queue.projects[0]17:22
jlkFalse17:22
jlkbut it doesn't17:22
jeblairjlk: i believe we rely on object identity for equilavence.  Sources are responsible for supplying Project objects and they are expected to cache them (probably on their driver), so that they always return the same object for a given project.17:22
jlkhrm.17:23
jlkI'll go hunting in that direction, thanks.17:24
jeblair(it's quite possible our data model has evolved to the point where we no longer need to do that and could implement a real __eq__ method (now that we can fully qualify project names with hostnames) but let's defer changing that for later)17:24
jeblairjlk: given that, temporarily throwing an id(self) in the __repr__ for project might help track that down too.17:26
clarkbtests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue is actually the test that appears to timeout under full testr run. It takes ~55 seconds alone either with testr or testtools. So confused17:27
jeblairclarkb: 15s for me locally17:27
jeblairclarkb: and about 20-25s in the gate17:27
jeblair(alone)17:28
clarkbjeblair: do you see the same time difference between tox -epy27 run tests and testtools.run or similar?17:28
jeblairclarkb: nope17:28
jlkyeah the IDs are definitely different17:28
jeblair(i checked that yesterday)17:28
jlkunfortunately how the github code is handling projects basically mirrors how gerrit does.17:28
jlkunless...17:28
jlknope, doesn't look like tests/base.py overrides that17:29
jeblairclarkb: 'nodepool list|grep jeblair' if you want to putz around on the 4 test nodes i held.  ~root/zuul already exists checked out at feature/zuulv3 with a .tox/py2717:29
clarkbjeblair: thanks17:30
jeblairjlk: yeah, looks like the projects are cached on the connection object.17:33
jlkoh this is interesting17:33
jlkI think the source object is not being cached17:34
jlkso some calls to self.source.getProject are hitting a _different_ source object.17:34
jeblairjlk: yeah, i think that's okay though, since the different sources should have the same connection object.17:35
jlkhrm.17:35
jlkthen it's the connection object that's different17:35
jlkeither way this trip through pdb I'm seeing an empty list of projects even though the first trip through some projects were added17:35
jlkself.connections.getSourceByHostname17:36
jlkI think that failed to find the right source for me17:36
jeblairjlk: which test are you running?17:36
jlkit's a new one in this patch. It's running jobs through a pipeline with a dependent manager17:37
jlkoooooh wait.17:37
jlkwhat happens if you have two sources that point to the same hostname?17:37
jeblairjlk: yeah, i've got it checked out17:37
* jlk has a hunch17:38
jeblairjlk: two connections?  bad mojo.17:38
jlkokay I think I know where this is fubar then17:38
jeblairjlk: i think there's a right way to do that though, lemme look it up17:38
jlkhttps://review.openstack.org/#/c/444034/7/tests/fixtures/zuul-github-driver.conf17:39
jlkI think this is what did it, in an earlier test17:39
jlker earlier patch17:39
jeblairjlk: hrm.  it should actually work -- tests/fixtures/zuul-connections-same-gerrit.conf tests that case17:40
jlkoh, darn.17:40
jeblairbut it's possible there's a bug that's slipping by that test case but not yours17:40
jlkwell I'm going ot modify that file and see if things chnage.17:41
jlksure did17:42
jeblairjlk: generally speaking, the idea is that even if we have two connections for a given hostname, in the tenant config file (main.yaml), we associate each project with a specific connection.  so in your case, if you say "source: github:" in main.yaml, the following projects are associated with that specific connection (vs "source: github_ssh")17:42
clarkbwoo if I increase the test timeout I get successful tox run on tip of my stack17:43
clarkbtests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue    141.804 so I'm about 20s past the default timeout17:43
jeblairjlk: however, i think the call to getSourceByHostname in scheduler.process_event_queue is wrong; it fails to take that into account.17:44
clarkbnext slowest is 98 seconds17:44
jlkjeblair: yeah I was about to trace around that call17:44
jeblairjlk: my guess is that the gerrit test happens to have the sources in the right order to mask that, but the github test doesn't.17:44
jeblairjlk: i have an idea; gimme a few to hack it up.17:47
jlkokay17:48
jlkI'll keep poking too17:48
clarkbjeblair: http://paste.openstack.org/show/607901/ I'm seeing the testr full run slowdown on one of your held nodes17:51
clarkbgonna tweark the --concurrency to see if that affects it too (eg less concurrency == slower?)17:51
jlkjeblair: I think the logic in line 108 of zuul/lib/connections.py is suspect.17:52
jlkIt'll wind up replacing sources['hostname'] with whatever the last read connection to the same host is17:53
jlkso the sources list will be incomplete17:53
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: remove getSourceByHostname  https://review.openstack.org/45982117:58
jeblairjlk: yes, i think that's right.  ^ may be the solution.17:58
jeblairclarkb: yeah, that test is pretty busy so i expect it's especially affected by concurrency.18:00
clarkbjeblair: what I am curious to see is if less concurrency makes it slower or faster. Faster is what you'd expect because that means more cpu time/less contention18:00
clarkbjeblair: but I'm thinking maybe its slower because that means single running is running more jobs that are leaking something causing it to be slow?18:01
clarkbjeblair: http://paste.openstack.org/show/607905/ it is indeed slower the fewer cpus we use18:07
clarkband that scaling factor is very similar to mine compared to running isolated18:07
*** isaacb has joined #zuul18:08
jeblairclarkb: interesting... though it could also be unfortunate test ordering, yeah?18:08
clarkbjeblair: I don't think so. each process is running fairly isolated. The only shared resource is zk but you'd expect slower times with more concurrency if it was getting hit hard18:09
clarkbeach process runs one test at a time, regardless of order I would expect similar test runtimes18:10
jeblairclarkb: io is shared, which is quite a lot of what these are doing18:10
clarkbunless it is a resource leak early in test run that negatively affects subsequent tests18:10
clarkbjeblair: right but you'd expect less concurrency to be faster if that was the issue not slower18:10
clarkbby reducing concurrency we are reducing total cpu and io load on the system but the tests run slower18:11
jeblairclarkb: yes, i mostly agree, just holding out the thought that in a pathological case, with concurrency=8 we could be up against cpu bound tests and concurrency=4 up against io bound tests.  that's unlikely, but possible.18:12
clarkbjeblair: I'm going to run it with concurrency=118:13
clarkbwhich may rule out that possibility18:13
jlkhey look at that, all my tests are actually passing now :/18:21
clarkbjeblair: --concurrency=1 on that held node just hit timeout waiting for zuul to settle18:36
clarkbso I think there must be something elsegoing on in here that causes consolidated test processes to perform poorly18:36
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: remove getSourceByHostname  https://review.openstack.org/45982118:52
jeblairjlk: ^ passes tests now.18:52
jlkwoo18:52
clarkbtests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue    100.753 <- under timeout. I killed logging18:56
clarkbI think our logging stuff must be leaking somehow?18:56
jeblairclarkb: that's still 30s longer than run alone, right?19:06
clarkbno sorry thats local to me where it took ~55s19:07
clarkbso thats inline with my local isolated run19:07
clarkbI'm testing a change that I think may help, iteration time is so long though19:07
clarkb(while keeping our logging)19:07
clarkber derp I see what you are saying19:11
clarkbI don't know how I math'd that wrong19:11
*** yolanda has joined #zuul19:38
*** hashar has quit IRC19:55
jlkGREAT NEWS EVERYBODY20:00
jlker, GOOD NEWS whateve.r20:00
jlkGithub branch protection now has an option where new pushes to a PR will dismiss any Reviews on the PR!20:00
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Cleanup logging handlers  https://review.openstack.org/45984920:00
clarkbjeblair: ^ thats a massive improvement in test runtime but still slower than isolated20:00
pabelangerjlk: success?20:01
clarkbSpamapS: jlk ^ thats change is the one you'll want to grab for local testing20:02
jlkkk20:02
clarkbI need to grab lunch now, but very interested if ^ helps anyone else running tests locally20:04
*** jkilpatr has quit IRC20:10
clarkbjeblair: http://logs.openstack.org/49/459849/1/check/gate-zuul-python27-ubuntu-xenial/b9421e1/console.html http://logs.openstack.org/77/459477/2/check/gate-zuul-python27-ubuntu-xenial/e93f61e/console.html comparison seems to confirm this change works as expected in the gate too20:45
*** jkilpatr has joined #zuul20:47
clarkbbased on ^ it would be excellent to get reviews on https://review.openstack.org/#/q/topic:test-stability jeblair has reviewed most of them and it should make everyone's dev process happier21:19
jeblairclarkb: nice, those even both ran on internap21:23
clarkbjeblair: yup so should be a good comparison. Ran in half the time21:27
jeblairclarkb: +2 from me21:28
clarkbI'm still not sure where the other 40-50s is going locally21:28
clarkbbut guessing its another leak in the same sort of genre21:28
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema  https://review.openstack.org/45979221:30
*** isaacb has quit IRC21:36
SpamapSoh wow.. github has added the ability to dismiss PR reviews on push to the pr source.21:48
SpamapSI guess they did get enough angry fist shakes21:48
clarkbjeblair: ok I just finished a successful --concurrency=1 run locally. And the individual tests had runtimes closer to what I would expect. So I think the remaining 40-50s is contention with shared resources like disk IO and zk22:03
clarkbjeblair: I'm going to treat this as "solved" for now as a result of ^ those results22:03
jeblairclarkb: w00t, thanks!22:03
clarkbtests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue    61.80522:03
clarkbRan 236 tests in 1614.315s (+1008.612s) wall time is much much longer but individual tests run quicker22:03
jesusaurclarkb++ continues to be awesome22:08
jlkwoo22:21
jlkReviews on https://review.openstack.org/459821 would help me a lot in my progress :)22:21
jeblairclarkb, mordred: i think you can help jlk there ^22:22
Shrewsclarkb: question for you on 45948022:23
jeblairjlk, mordred, SpamapS: i have reviewed the first 4 github-v3 changes.  i think 439834 and 443947 open a set of related questions on the subject of driver->zuul event manipulation which would be good to come to consensus on.22:25
jlkthanks! I saw those hit my inbox, and I briefly went over them. I'll probably have replies on them tomorrow morning22:25
clarkbShrews: responded22:26
clarkbShrews: thats actually something that I think we likely want to get better at, is constructing things such that the cleanups are in place before they have a chance to break22:26
jeblairclarkb: cleanups are all in individual try/excepts clauses, right?22:27
clarkbjeblair: yes, one cleanup failing doesn't prevent the next from running22:27
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Add ability to stop drivers  https://review.openstack.org/45947722:27
clarkbjeblair: unless you sys.exit ro something22:27
jeblairclarkb: so yeah, i agree, that's probably the safer method and we may want to move many of our cleanups up a line.22:28
Shrewsclarkb: k. iirc, the disconnect() checks to see if it's actually connected first anyway (assuming it was copied directly from nodepool's zk.py)22:28
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove queue-name from job schema  https://review.openstack.org/45979222:30
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Use current class name on super call  https://review.openstack.org/45940922:32
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Use cStringIO if available.  https://review.openstack.org/45947822:32
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Whitelist watchdog threads in test thread leak checker  https://review.openstack.org/45947922:32
*** harlowja has quit IRC22:34
clarkbjeblair: jlk make sure I get this right, basically you can have review.o.o/foo and review.o.o/bar so sources[review.o.o] isn't a singleton?22:34
jeblairclarkb: your example looks like two projects over the same connection which is more straightforward than the problem -- the problem is actually two *connections* to the same server.22:35
clarkbwhen would you have two connections to the same server?22:36
jeblairclarkb: it's actually the original impetus for multi-connection support, though we've never used it: report changes to gerrit under a different username.22:36
clarkbaha22:36
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Cleanup zookeeper and fake nodepool in nodepool tests  https://review.openstack.org/45948022:38
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Cleanup gearman clients in tests  https://review.openstack.org/45948122:38
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Properly record timed out builds  https://review.openstack.org/45976722:40
jeblairclarkb: so the only way to really know which connection to use for a project is via the tenant config file, where project names are definitively mapped to connections.  that process ensures that we don't end up with two identically named projects (with that index, it's okay to have two connections with the same hostname, as long as they don't also have projects with the same name)22:41
jeblair(attempting to do so is a configuration error, so we can rely on that)22:41
mordredjeblair: I believe this matches what we talked about that one time22:46
clarkbjeblair: ok I am confused about the comment at line 498 https://review.openstack.org/#/c/459821/2/zuul/scheduler.py. What is the corresponding live item and why is it N items behind the current item. Is it valid to have live -> not live -> not live -> live -> not live ?22:48
clarkbjeblair: at least with v2 I thought that once you are not live everything behind is not live?22:49
jeblairclarkb: behind and ahead may be confusing here.  take the gnocchi change near the top of check right now22:50
jeblairclarkb: that's a bunch of non-live items *ahead of* a single live item.22:51
clarkbah22:51
clarkbI'm thinking of the gate case with windows (because familiar with that code)22:51
jeblair(which is the only configuration non-live shows up in, at the moment)22:51
clarkbthanks22:51
jeblairclarkb: ah, that's 'active'22:51
jeblair(rather than 'live')22:51
clarkbgotcha22:51
jeblairhonestly, foreign projects and multi-connections need a bit of a rethink, because at some point we're going to want to say "this zuul patch Depends-On this ansible pull request".  but we have no way of saying that at the moment, so that code makes things continue to work the way they do now with minimal fuss.22:53
clarkbwhere you won't merge until the other has merged, but won't actually get that pull request in your git repo?22:53
jeblairclarkb: you will get that pull request in the executor-staged copy of the ansible repo22:54
jeblair(so the executor will stage both your change and the ansible pull request in the jobdir src directory)22:54
clarkbbut the PR would not be live in that case beacuse we aren't voting back on it?22:55
clarkbwe'd vote on zuul22:55
jeblairclarkb: correct, it'd still be non-live22:55
clarkbok that makes a lot more sense now22:55
jeblairclarkb: i mention this because the code i just wrote will break in that circumstance.  so it'll have to change.  but it's not possible to get into that state now, so i think it's good enough.22:56
jeblair(because it assumes that all changes for foreign projects were pulled in via a depends-on from the same source/connection, which obviously won't be the case when you can do cross-connection-dependencies)22:57
clarkbShrews: just two more changes to review on topic:test-stability22:58
clarkbShrews: the last one is the best one too22:58
jeblairalso, holy cow i think it's time for a glossary.  "item", "foreign", "live", "active", "cross-repo-dependency", "cross-connection-dependency" are all zuul-specific terms of art used in the last 5 minutes with very specific meanings.  :)22:59
clarkbindeed22:59
Shrewsclarkb: my pizza arrived. Basic math tells us that pizza > reviews23:01
jeblairmordred: should i take "jeblair: I believe this matches what we talked about that one time" as a +2 on 459821?23:01
clarkbShrews: except that the last change will make your tests run twice as fast. So you can eat more pizza23:01
clarkb:P23:01
clarkbenjoy dinner23:01
ShrewsWill look after fooding23:02
mordredjeblair: I was about to leave the actual +223:03
mordredjeblair: also - yes, I think it's time for a glossary :)23:03
jeblairmordred: ok cool :)23:03
openstackgerritMerged openstack-infra/zuul feature/zuulv3: remove getSourceByHostname  https://review.openstack.org/45982123:14
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Error if threads leak.  https://review.openstack.org/45948223:19
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Cleanup logging handlers  https://review.openstack.org/45984923:19
*** openstack has joined #zuul23:31
mordredclarkb: well, eatmydata does make sync() a no-op. I'm not sure if tmpfs does any extra work than "return 0" on a sync or not?23:38

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!