Monday, 2017-04-24

*** jamielennox is now known as jamielennox|away01:33
*** jamielennox|away is now known as jamielennox01:44
*** yolanda has joined #zuul03:54
*** yolanda has quit IRC05:23
*** isaacb has joined #zuul06:26
*** hashar has joined #zuul06:59
*** bhavik1 has joined #zuul07:00
*** yolanda has joined #zuul07:29
*** yolanda has quit IRC08:20
*** yolanda has joined #zuul08:39
*** yolanda has quit IRC08:54
*** yolanda has joined #zuul09:11
*** yolanda has quit IRC09:39
*** jkilpatr has quit IRC10:40
*** jkilpatr has joined #zuul11:02
*** yolanda has joined #zuul11:16
*** Shrews has joined #zuul11:26
*** bhavik1 has quit IRC11:40
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name  https://review.openstack.org/45668513:16
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for specifying root job directory  https://review.openstack.org/45669113:16
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for using build UUID as temp job dir name  https://review.openstack.org/45668513:34
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Allow for specifying root job directory  https://review.openstack.org/45669113:34
*** yolanda has quit IRC13:36
*** bhavik1 has joined #zuul13:58
*** bhavik1 has quit IRC14:05
*** yolanda has joined #zuul15:03
*** yolanda has quit IRC15:10
*** yolanda has joined #zuul15:11
pabelangermordred: speaking of packaging, just bringing the zuul package for fedora rawhide online today: https://bugzilla.redhat.com/show_bug.cgi?id=122045115:20
openstackbugzilla.redhat.com bug 1220451 in Package Review "Review Request: zuul - Trunk gating system developed for the OpenStack Project" [Medium,Post] - Assigned to karlthered15:20
*** yolanda has quit IRC15:23
*** yolanda has joined #zuul15:32
*** hashar is now known as hasharAway15:37
jlkGood morning you zuuligans!15:54
*** isaacb has quit IRC15:59
clarkbI finally have zuul test suite running under a hand crafted all natural free range python install made with gcc716:09
clarkbcurious if this will show any difference16:09
*** isaacb has joined #zuul16:11
clarkbException: Timeout waiting for Zuul to settle <- so free range python didn't help that. Now to see if it at least completes running in a reasonable amount of time16:21
clarkbSpamapS: jeblair was thinking about this more and could it be that git itself is slower in newer versions in how zuul uses it?16:22
clarkbI'm going to attempt putting yappi in place on a per test basis16:28
jlkWhat time (UTC) is the meeting today?16:38
clarkb2200 iirc16:38
SpamapSclarkb: no I think it's python.16:49
SpamapSclarkb: free range python.. did you enable all the optimizations?16:50
jlkoh good I'll be able to make that.16:50
jeblairi have iad and osic trusty and xenial nodes held; i'll kick off some tests there16:54
clarkbSpamapS: ya I enabled lto and pgo16:55
clarkbcurrently running into the problem of not being able to get stdout out of the test suite even when setting OS_STDOUT_CAPTURE=016:56
jeblairfwiw, osic installed the tox deps in 1m6s +/-2s; rax-iad in 1m36s +/- 1s.16:59
clarkbok if I print in the tests cases I see that but not in BaseTestCase.setUp() even sys.stdout.write isn't working17:05
clarkbhrm I wonder if we are calling that setUp(), even pdb doesn't seem to break there17:07
SpamapSclarkb: threads make that tricky17:08
*** isaacb has quit IRC17:08
SpamapSor are you pdb.set_trace()'ing in it? should break there17:08
clarkbI figured it out, bug in test suite I think. Gonna grab HEAD and double check and push fix up if still there17:09
pabelangerclarkb: jeblair: mordred: care to add https://review.openstack.org/#/c/452494/ to your review queue? Fetch console log if SSH fails.17:10
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Use current class name on super call  https://review.openstack.org/45940917:12
clarkbSpamapS: ^ is fix17:12
clarkbsetUp was just never running on the fast test I was iterating on17:12
jeblairpabelanger: is that a port of a change to another branch?17:16
pabelangerjeblair: yes, I decided just to add it feature/zuulv3 to avoid recent nodepool.yaml changes17:17
clarkbwoo I have more infos. If I run a test that fails for me, tests.unit.test_scheduler.TestScheduler.test_crd_undefined_project, under testr alone it takes 55 seconds. If I run it under testtools, 12 seconds17:17
jeblairpabelanger: can you mention/link to the previous change?17:17
pabelangerjeblair: sure, 1 moment17:18
clarkband I haven't managed to get profiling working yet with testr runs to see what is eating all that time (I can't print out the results as stdout seems to be eaten)17:18
jeblairclarkb: what does it mean to "run under testtools"?17:19
clarkbjeblair: pythom -m testtools.run instead of testr run17:19
jeblairah ok17:19
jeblairclarkb: testr does the discover thing, testtools.run doesn't, right?17:19
jeblairclarkb: (that was a big motivation for mordred writing ttrun)17:20
openstackgerritPaul Belanger proposed openstack-infra/nodepool feature/zuulv3: Fetch server console log if ssh connection fails  https://review.openstack.org/45249417:20
clarkbjeblair: yes but discover is really fast17:20
clarkbI can time it independently in a sec17:20
clarkbsubunit.run is ~14 seconds17:21
clarkbdiscover is 1 second wall clock17:21
jeblairclarkb: was your 55 seconds the time that the "testr" command took, or the time that testr reported for the test?17:24
clarkbjeblair: the time that testr reported for the test17:24
jeblairoh.  yeah.  i agree those should be the same.  :)17:24
clarkblet me rerun under time just to make sure there is no funny business in that recording17:24
SpamapSooooooooooooooooooh my17:25
SpamapSyeah that's a bit nuts17:25
clarkbwow also seems quite variable. last run was 27 seconds17:29
clarkb(with testr, testtools seems pretty consistently 12-14 seconds)17:29
jeblairi'm not seeing that locally with behind_dequeue17:30
* clarkb tries with that test17:32
jeblairalso, running behind_dequeue on trusty/xenial nodepool nodes produced a fairly consistent low-mid 20s times17:32
clarkbjeblair: using testr or testtools?17:32
jeblairclarkb: testr ("tox -e py27 ...")17:33
jeblairclarkb: i ran crd_undefined_project locally and get 3.3s both ways17:33
clarkbZUUL_TEST_ROOT=/home/clark/tmp/tmpfs time testr run tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue produces: 103.92user 18.40system 1:15.73elapsed 161%CPU (0avgtext+0avgdata 278300maxresident)k 0inputs+496outputs (0major+4407299minor)pagefaults 0swaps Ran 1 tests in 69.140s (+48.013s)17:34
clarkbnow to try with testtools.run17:34
clarkbRan 1 test in 60.213s 86.46user 16.58system 1:00.92elapsed 169%CPU (0avgtext+0avgdata 270040maxresident)k 0inputs+0outputs (0major+4220868minor)pagefaults 0swaps17:35
clarkbso that test seems a lot more consistent17:36
clarkbjeblair: fwiw the crd tests are the ones that always break locally for me17:36
clarkb..py:531 RecordingAnsibleJob.execute  44     0.004697  10.05743  0.228578 is biggest cost in dequeue job according to yappi17:38
clarkbwith testr run we are ingesting the subunit streams and writing data out to the "db" files17:38
clarkbpossibly that is what is making it slower in some cases depending on the stream contents?17:38
clarkbafter that git is the highest cost17:39
clarkbnow if I could just get the testr runs to print out their yappi infos :/17:40
jeblairclarkb: yeah.  the new behavior is that we only save log files from failing tests, because they are quite large.  you mention your test is failing, so that should be happening (though your env variables may alter that)17:42
clarkbjeblair: well it only fails when run as a whole, its actually passing when i run it alone (so possibly an intertest interaction?)17:44
clarkbbut I have new data. set *_CAPTURE=0 in .testr.conf and the crd test takes a minute. set them to 1 (the default) and it takes half a minute (still twice the time of a testtools.run) implying the cost is in parsing the output stream and that is somehow affected by not capturing things possibly because its bubbling all the way up to testr and ti has to work harder?17:45
*** yolanda has quit IRC17:48
clarkbif I run testr run --subunit foo.test | subunit-2to1 then I see the output. But I don't see it in the .testrepository files. So testr is ingesting a bunch of information then discarding it?17:50
clarkbjeblair: is gdbm present on the xenail and trusty nodes? (thats another potential difference here is testr won't have a gdbm to use on my system apparently)17:52
* clarkb install it17:52
jeblair-bash: gdbm: command not found17:53
jeblairon both17:53
clarkbjeblair: python interpreter import gdbm?17:53
jeblairoh yeah that's there17:53
clarkbhaving it makes the warning go away but doesn't affect the test runtime17:55
clarkbstill ~30s for that crd tests and *_CAPTURE=117:55
clarkbhttp://paste.openstack.org/show/607721/17:59
clarkbalso comparing function calls logging is more expensive so that may be the cost there?18:11
clarkboh wow so that time increase was purely due to having stdout written that wasn't part of the subunit stream18:22
clarkb(my yappi stuff was not getting monkey patched stdout, if I fixed that the run time is in line with testtools.run)18:23
clarkbactually its not working for capturing yet either. Arg, but good to know that having "raw" stdout over the wire makes things really unhappy with testr18:28
ShrewsCan I just say that Vexxhost's first response to my issue of "instance down" is to ask if it is still down doesn't install much confidence of their support methods.18:40
clarkbwoo finally have yappi working the way I want it and testrun times are not crazy and then discover some of these tests run long enough to have the counters roll over18:44
clarkbyes negative runtimes thats gotta be correct. Zuul can time travel you know18:45
jeblairclarkb: if zuulv3 can time travel backwards that will be great for our release timetable19:07
clarkbfwiw I have the suite running now and trying to get a failure, none yet19:08
clarkbchanges are yappi profiling and use cStringIO19:08
jeblairclarkb: i wanted to use cstringio, but that's apparently gone in python3.19:22
clarkbjeblair: ya I've got it conditionally importing19:24
*** hasharAway is now known as hashar19:25
SpamapSoh we got that time travel feature done?19:26
SpamapScause.. I have a few ideas for use cases..19:26
clarkboh it just failed. But the data is useless because its all rolling over :(19:27
* clarkb tries to get test run alone to fail19:27
clarkbsomething must be leaking19:28
clarkb11seconds when run alone19:30
clarkb1 minute 8 seconds when run with everything else (but with concurrency=1 so sequentially)19:30
clarkbSpamapS: ^19:31
clarkbSpamapS: do you see that behavior too?19:31
SpamapSclarkb: which test?19:36
SpamapS$ ttrun -e py27 tests.unit.test_scheduler.TestScheduler.test_crd_check_transitive19:37
SpamapStakes ~21s consistently19:37
SpamapSworth noting I'm on Ice9c2fd320a4d581f0f71cbacc61f7ac58183c23 sha=070ee7e979dbb9b488493984aeddb866da3884ba19:38
clarkbtests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies19:38
clarkbSpamapS: right, I am comparing testr run tests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies to the runtime of that test in testr run19:39
clarkbrunning it on its own seems consistent19:39
SpamapS8.5s for that one19:41
SpamapS7.5s sometimes19:42
SpamapStests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies  5.40119:42
SpamapSfaster under testr19:42
clarkbSpamapS: a full testr run?19:43
SpamapSno that's just alone19:43
clarkbright its consistent and fast alone19:43
SpamapShm can I tell testr to show me last_run-1 ?19:43
clarkbSpamapS: testr last19:43
SpamapSno that's last_run19:44
clarkboh for that i just vim the file in .testrepository directly, you can also testr load it then testr last19:44
SpamapSahh19:46
SpamapStests.unit.test_scheduler.TestScheduler.test_crd_check_ignore_dependencies                                  5.40119:47
SpamapSclarkb: that's actually from the last real run19:47
SpamapSsame as alone19:47
SpamapSto the thousandth19:47
* SpamapS is suspicious19:48
SpamapSseems like there's some funny business going on19:48
clarkbException: Timeout waiting for Zuul to settle is the exception which makes me think that maybe we aren't settling deterministically for some reason19:48
clarkbalone is fine and runs in 11 seconds. Not alone takes more then a minute then hits ^19:48
clarkbso I think profiling not necessary (at least not for this), but glad I got to this point19:52
clarkb(the other angle I am investigating is maybe there is rogue data in the subunit stream which is causing testr to be unhappy. Have it running dumping subunit to stdout and into file to check. Then will also do a full run with testtools and see if it exhibits the same behavior of longer tests over time)19:56
*** harlowja has quit IRC20:03
*** jamielennox is now known as jamielennox|away20:13
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer  https://review.openstack.org/45672120:14
Shrewsjeblair: that ^^^ caused me much headache20:14
Shrewshopefully it's treading the right path now20:15
jeblairShrews: cool!20:17
Shrewsi also fully expect tests to fail b/c of the single port thing20:17
mordredShrews: its correctness is directly proportional to how much your head hurts20:17
jeblairShrews: single port thing?20:18
Shrewsjeblair: not sure if more than 1 executor is started for tests, but if so, they'll get port conflicts for the finger port20:18
Shrewsdon't know enough about tests though20:19
*** jamielennox|away is now known as jamielennox20:20
jeblairShrews: we don't start an executor using the cmd, we do it directly.  so there shouldn't be anything starting a log streamer until you add it to the tests.  that's why i asked you to split it out from the startup code.20:20
Shrewsah. then yay20:20
jeblairShrews: ya.  when we add it, we'll want to have it start on port 0 in the tests to pick a random port, since there can be multiple tests which start it running, and we also don't want to require root in the tests.20:21
mordredjeblair: it's worth noting that this approach does tie us to a 1:1 relationship between executors and hosts - which I believe we're fine with - but noting it because I believe there had been musing in the past about the viability of running more than one executor on a given machine20:23
mordredjeblair: I'm quite in favor of it as an approach and don't think the limition is an issue - that said, it might make things fun if people attempt to run an executor farm in something like k8s20:24
jeblairmordred: indeed.  i think the 1:1 aspect is the main reason to do this.  i may not fully recall the musings you're recalling -- is there a potential advantage to having more than one on a host?20:24
mordredjeblair: not to my knowledge, no - unless there were some thread-scaling issue that would be better served by multiple processes - although in the infra case I'd imagine if that were the case we'd just spawn more smaller VMs20:25
jeblairmordred: if  you were using multiple executors is a k8s environment, would you be able to run a single log streamer across them?20:26
mordredjeblair: _maybe_? I mean, we haven't really designed for that at all - but with the 1:1 relationship it implies that each executor needs a routable IP20:27
mordredI believe one could fairly easy "expose" a service on each executor to make that happen in k8s land20:27
mordredbut down the road we might wind up wanting/needing a single finger multiplexer that could hit each of the executor fingers on the non-routable network ... this is HIGHLY hand-wavey - trying to put my brain the space of the people who like to put everytihng on private addresses and expose a bare minimum of public ips20:29
jeblairmordred: yes, we even want something similar for the websocket multiplexer20:29
mordred(thisis all "I think this is the right approach and will scale well for us. I don't think it will prevent a k8s scale out of executors - but we might have some enablement work to do in the future to make that nicer for some of those users"20:30
mordredjeblair: ++20:30
mordredjeblair: luckily finger already includes in the protocol the ability to chain hosts20:30
mordredjeblair: so it's almost like the RFC authors back when the internet worked planned this for us!20:30
jeblairmordred: i'm trying to figure out is whether the 1:1 or 1:N choice has an impact on that.  i guess i don't understand enough about k8s to understand how you would create a scenario where you could have multiple executors writing log files to a space where a single log streamer reads from.20:30
fungisorry all, i'm entertaining inlaws this evening and will likely be out at dinner during the zuul meeting, but will check out the log when i get back20:32
mordredjeblair: yah - the more I think about it the more I think the 1:1 choice is beneficial for that - and in a k8s land with less public IPs than desired executors having a multiplexor will be good20:33
mordredand since it's a 1:1 with the executors, we're tlaking about an easy to understand "service talks to other service over network"20:34
mordredand _not_ "this one daemon reads the files these other 10 daemons are writing - oh wait, they're all in their own containers"20:34
jeblairmordred: ok.  i definitely think this is worth thinking about; we don't want to back ourselves into a corner and there are certainly compromises in this direction.  a good argument could easily push us one way or the other at this point.  this approach has 'ease of deployment' going for it right now.20:36
Shrewsfwiw, if we make the decision similar to "we need to rewrite this to be a highly distributed, highly redundant, dockerized application", imma gonna fight somebody20:38
mordredShrews: hell no. BUT - I don't want to do something that prevents someone who does want to run it in k8s or mesos from doing so20:39
mordredjeblair: I think it gets us a one-executor==one-network-listener, which I think folks grok fairly well in k8s land ("microservices") - so you'd wind up with N containers each running an executor/finger-listener20:39
mordredwhich is good20:39
Shrewsmordred: lol, i know. was j/k  :)20:39
mordredShrews: :)20:39
mordredShrews: I dunno - you COULD want to fight someone anyway20:39
mordredjeblair: and an executor is never going to work in a container that doesn't support fork - because ansible20:40
jeblairShrews: it already is a highly distributed application, and becoming moreso, especially thanks to your work.  it only has one SPOF left, and in my mind, zuulv4 will get rid of that.  :)20:40
jeblairmordred: yeah, that makes sense.20:40
Shrewsall hail zuulv420:40
mordredso we're never going to hit the "purist" place of only ever having a single executable in each container with the fork scaleout model actually being at the k8s layer20:41
clarkbSpamapS: ok after lunch and looking at my log from testr run that job is trying to connect to a bunch of gearman servers that it can't connect to20:41
clarkbSpamapS: but gneral idea is to run in isolation and compare logs20:41
mordred(I mean, we could if we wanted to go 100% down the road of doing that by having the executor make k8s api calls instead of subprocess calls - but that's a whole GIANT other thing20:41
mordredand I do not think buys us, well, anything20:41
jeblairmordred: yeah, i feel that that is significant enough of a change to warrant revisiting this anyway.  i mean that's not an implementation choice you can easily make today.  that's something worth considering later, and when we do, "how to get the logs off" will be a good question to answer at the same time.20:42
jeblairmordred: i know that's something SpamapS and others were interested in when we were discussing the executor-security thing.20:43
jeblairclarkb: if there are multiple test failures in your run, it's quite possible those gearman connects are from threads left over from previous tests which did not shut down correctly.  they should be mostly harmless, other than taking up some resources.20:44
mordredjeblair: ++20:45
clarkbjeblair: they show up in the first test that failed. So I don't think its just dirty test env from fails20:46
jeblairclarkb: hrm.  i've been considering having us save a subunit attachment with the gearman port number of every test, even the successful ones, so we can trace those back to the test that launched them.20:47
clarkbI now have logs files up gonna work on seeing if I can see where they start to diverge time wise20:50
clarkbhave confirmed that single test does not have a bunch of servers its trying to connect to20:50
mordredjeblair: re: significant enough change - I totally agree. and I mostly want to make sure we don't prevent anyone from going deep on the run what we have now in k8s so that we can get a bunch of feedback/knowledge of what the ups and downs are... so that if we decide that zuul v5 will just darned-well depend on k8s, that we're doing it not becaus of new-shiny but because we've actually been able to learn about20:51
mordredreal/tangible benefits20:51
jeblairmordred: yep20:51
clarkbits actually really early on hrm. The time between test start to gearman server starting is 6 seconds in slow failed test and one second in test run alone20:52
clarkbso maybe it is just leaked resources vying for cpu time20:52
clarkband the reason gate jobs are happy is by splitting into more cpus you have processes that can better time share the cpu?20:52
clarkbanyways will keep digging20:52
jeblairclarkb: are you saying those may not be leaked threads, but are just the actual threads for that test but geard hasn't started yet?20:53
clarkbjeblair: no, pretty sure they are leaked. I'm looking at the python logging time sequence and the time taken to go from test start to gearman server starting for that test is different20:54
clarkband that happens really early on20:54
*** harlowja has joined #zuul20:59
SpamapSjeblair: mordred jlk I will likely miss today's Zuul meeting. Sick child has to be taken to doctor.21:01
jlk:(21:02
SpamapSpink eye... always pink eye.. :-/21:02
clarkbjeblair: there is also a lot more kazoo activity21:02
*** yolanda has joined #zuul21:02
jamielennoxmordred, jeblair: having just found out about the finger thing, and it basically replacing my little websocket app, why are we sending the uuid to the exectuor, isn't the executor only running one job?21:13
mordredjamielennox: oh - it's not replacing the websocket app - we also need a websocket app21:14
jamielennoxi'm reading scrollback and we are in a position of limited public ips, and i'm not even sure executors would be public21:14
mordredjamielennox: nod, so you might be the first user who needs the multiplexor then :)21:15
mordredjeblair: ^^ look how quickly we go from thought to actual usecases21:16
jamielennoxso i think what we would need there is some way to ask zuul for a build UUID and resolve that to a (private) executor ip21:21
jamielennoxor otherwise just expose the private IPs to a user which is what my thing is doing now21:21
jamielennoxbut i'm also interested in the decision to limit 1 executor per VM, we don't have a decision on this but most executors are just going to be running ansible (network bound) and will register themselves on a job queue21:24
jeblairwait there's already a websocket app?21:24
jamielennoxjeblair: it was something i POCed for our zuul 2.521:24
jamielennoxbut i can show you21:24
jeblairjamielennox: which framework did you use?21:25
jamielennoxjeblair: https://github.com/BonnyCI/logging-proxy21:25
jamielennoxjeblair: nodejs :(21:25
jamielennoxin it's defense when pumping out to a browser it's probably the right choice21:25
mordredjamielennox: so - the fun thing about multi-plexing is that it's built in to the finger spec - so if you say "finger XXX@zuul.openstack.org@private-executor01.openstack.org"21:26
mordredwhich basically means we just have to do a little more plumbing21:26
mordredgah21:26
mordredI didn't finish my sentence21:26
jeblairmordred: while that's neat, i don't think that's what we should actually do.  i don't think we need to conform to the finger spec here.  :)21:26
mordredif you say "finger XXX@zuul.openstack.org@private-executor01.openstack.org" - that tells zuul.o.o to finger private-executor01.openstack.org and pass the results back21:26
jamielennoxmordred: that's really cool from a tech perspective and something you could build into a client side zuul-cli, but probably not something many people would use21:27
jeblairi think our process should be to build up layers to get to the point where we have a user-friendly system.  here's how i think we should proceed:21:27
clarkbok running testr run with OS_LOG_CAPTURE set to 0 so that I can see the logs we are leaking threads21:27
clarkbso best guess is that we are eventually getting to the point where those threads are harmful?21:28
mordredclarkb: if so, it would make sense that it would be more harmful on machines with less cores21:28
jeblair1) implement a console log streamer on each test node (done)21:28
clarkbmordred: and slower cores21:28
jeblair2) implement a daemon which multiplexes the console log from each test node in a job along with the ansible log on the executor (in progress)21:29
clarkba good chunk of them are apscheduler threads, I will pull down my change and use it in my testing too21:29
clarkb<Thread(Thread-158, started daemon 140017141806848)> a lot of those too21:29
jeblair3) implement a *central* daemon which can connect to the correct executor and ship that log stream over websockets21:30
jeblairit's not much of a stretch to have that daemon *also* implement finger as well as websockets21:31
mordredjeblair: ++ yes, I agree21:31
mordred(turns out finger is much easier to implement than websockets)21:31
jeblairthat way we end up with "ws://zuul.example.com/build/..." and "finger build@zuul.example.com" as easy front-ends so users don't have to see executors.21:32
jeblairi anticipate that we would use the executor finger url as the build url for now, and then once we have the central daemons, replace that21:32
jeblairjamielennox: does that address your needs?21:33
jamielennoxjeblair: yea, that will work - one way of another we'd be overriding the build url for viewing anyway, so so long as we have a central place to hit to find information for BUILD_UUID we're good21:35
mordredjeblair: the central daemon should also be pretty easy to scale-out and be a little fleet of log streaming daemons if needed too, yeah?21:36
jeblairjamielennox: well, the end goal is to have the build url point to a js page which uses websockets to stream the logs, so hopefully you won't need to overwrite it21:36
jeblairmordred: yes, i don't see a reason it can't be a scalable component (if you put it behind a load balancer)21:36
jamielennoxmordred: yea, there shouldn't be any state in that central component21:37
jeblairmordred, jamielennox: i think we brainstormed either having the central daemon perform a http/json query to the zuul status endpoint or a direct gearman function call to the scheduler to find out which executor to talk to.  other than that, it's just shifting network traffic.21:38
jeblairjamielennox: so, aside from that, are there other conerns about 1 executor per machine?21:39
clarkbI think I may have found that kazoo leak21:39
jamielennoxjeblair: not real concerns, i don't see any reason we couldn't run a number of small vms with executors as opposed to cgroup/docker/something isolated on the one machine21:41
*** hashar has quit IRC21:41
jamielennoxwe haven't really discussed it, it was just at first pass of the idea it seemed that log streaming is a strange thing to be imposing that limitation21:42
jamielennoxNFS ftw?21:43
jeblairjamielennox: once we have a central daemon, it will make much more sense to put the executor finger daemon on an alternate port, then you could run multiple executor-streamer pairs.  *however*, i really think you'd need a sizable machine to get to the point where we have more python thread contention than ansible process overhead.21:45
jamielennoxjeblair: yea, there are ways we could get around it later if required21:46
jamielennoxquick question though - what's the log streamer on the test node? is that taking syslog etc?21:47
jlkjeblair: for today's meeting, can we do the github subject early? I have to pick up my kid from school, leaving about 10~15 minutes after the meeting starts.21:47
mordredjamielennox: it's hacked in to the ansible command/shell module21:48
jamielennoxmordred: but won't that be coming from the executor now?21:48
mordredjamielennox: it's actually the exact same mechanism that provides the telnet streaming in the 2.5 codebase21:48
mordredjamielennox: the executor has to get the data somehow21:48
*** jkilpatr has quit IRC21:48
jeblairjamielennox: it streams stdout of commands.  i want to enhance it later to also fetch syslog, etc. (and similarly plumb that through the entire stack too)21:48
jeblairjlk: sure thing21:48
jlk<321:49
mordredjamielennox: so currently the callback plugin on the executor makes a connection to the streamer on the test node and pulls back whatever the test node is streaming out its telent stream and then writes it to the ansible log file on the executor21:49
jamielennoxmordred: oh - ok, i wasn't aware that was how that worked21:50
jamielennoxmakes sense21:50
jeblairoh it's meeting time in #openstack-meeting-alt22:00
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer  https://review.openstack.org/45672122:21
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add a finger protocol log streamer  https://review.openstack.org/45672122:24
*** jkilpatr has joined #zuul22:28
jeblairjamielennox: the plan for the websocket proxy in zuul is to use the autobahn framework with the asyncio module in python3.  we won't be able to run tests for that until we convert zuul to python3 (which is blocked on gear for python3).  but we can get started on the code without integrating it into tests for now.22:50
mordred(we do run our pep8 tests under python3, so python3 syntax checking at least happens)22:53
mordred(this is important since asyncio in python3 uses new python3 syntax that does not exist in python2 and thus pep8 has sads)22:54
jeblairya22:54
mordredvery not working patch: https://review.openstack.org/#/c/32056322:55
clarkbthe helpful yaml exception parsing we do for bad configs si really confusing when you've broken the nconfig validation :) (easy enough to add a raise in the context manager though)22:56
jeblairmordred: oh hey i didn't know that was so far along.  :)22:58
jeblairjamielennox: ^22:58
jeblairmordred: do you think it's worth porting to v3 and provisionally merging it with the understanding we'll run tests on it later?22:58
mordredjeblair: sure - I mean, it doesn't have the extra info plumbed in to it - but I could get that ball rolling probably23:00
mordredjeblair: oh - actually, the thing that doesn't have is gear for v323:01
jeblairmordred: ah, right, so that's a harder dep on the gear py3 thing.23:02
mordredjeblair: although that part could be rewritten as a GET on the status.json23:02
jeblairtrue; that could be a good interim thing.23:02
jeblairi don't think py3gear will take much time once we have the brainspace for it, but until then, if folks would like to move forward on that, there is a path.23:03
jeblairjamielennox, SpamapS: ^23:03
jeblair(and if it's not urgent, we can wait for py3gear)23:03
clarkbjeblair: one of the problems with leaking is that the timer driver has a stop() which is never called. That means we leak the apsched threads. I tried addressing this by implementing a shim connection for timer but that implies zuul.conf config that will be checked against the layout.yaml and that fails23:04
jeblairclarkb: sounds like we need driver shutdown?23:04
clarkbya, or separate the shutdown sequence from the magic that sets up valid config23:05
jeblairclarkb: i think explicit driver shutdown is better anyway, since connection shutdown doesn't necessarily imply driver shutdown23:05
jeblairclarkb: the drivers are loaded regardless of configuration23:06
clarkbya, thats just a lot more work :) today the driver manipulation appears to go through connections23:06
clarkbalso it looks like currently stop() is implied for both23:07
clarkbeg we don't stop only of a gerrit connection we stop all the gerrit connections. So maybe this is easier and I can just add in a for driver in drivers if driver hasattr stop then call it there23:07
jeblairclarkb: yeah, i think so.23:10
jlkCan v3 jobs have their own failure-message any more?23:15
SpamapSI lost the thread on py3gear a bit..23:15
jlkoh n/m23:16
SpamapSI believe we had settled on assuming utf8 and adding a binary set of classes for py3 users who want to do non-utf8 function names and payloads.23:16
SpamapS(with the understanding that a gear user migrating to python3 in this particular position would likely see oddness)23:17
jeblairSpamapS: me too, i'd need to spend a few minutes refreshing my memory.  but that sounds reasonable.23:18
SpamapSI think I wrote that patch23:21
SpamapShttps://review.openstack.org/39856023:21
SpamapSI also started on a py3 branch of v323:22
jeblairSpamapS: ya, looks like the next patch is where we left off23:23
jeblairhttps://review.openstack.org/39354423:23
SpamapSjeblair: I think we may have actually decided to abandon 393544 as it was a more heavy-for-py3k-migrations approach.23:24
* SpamapS forgot to write that down23:24
SpamapS398560 handles name automgically.. and payload I think works if all you do is json decode/encode23:25
SpamapSah no, payload ends up coming out as bytes in py323:27
SpamapSwhich won't json.loads without a decode23:27
SpamapSbut that works the same py2/py3 so isn't too painful to wrap your json loads's in it23:27
jeblairSpamapS: i think we *don't* want to require wrapping encodes around everything -- the idea was that TextJob lets you avoid that.  so i think we still want the idea of 393544, just a simpler implementation.23:39

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!