Thursday, 2017-06-15

mordred(garbage, org, repo, more_garbage, number) = s.rsplit('/', 5) really00:00
* mordred loves split00:01
jlkoh yeah I could do that00:02
jlkdoes "__" by itself have special meaning in python?00:05
jlkpep8 might bitch about defining garbage and more garbage but not using it00:05
jamielennoxjlk: there's convention but it doesn't mean anything by default00:08
jamielennoxin web stuff it's general used as _('string') translation, which is annoying for when you want to use it for an ignore variable case00:09
jamielennoxso pep and a lot of things have rules built in to ignore it00:09
jlkyeah I'll just use _ twice.00:13
jlkDEBUG:connection.github:Updating <Change 0x7f20b8159c50 2,943abf6d7430610d348e105b6196e67626ff2d32>: Getting commit-dependent pull request j2sol/z8s-sandbox/100:23
mordredjlk: ooh - that seems promising00:24
jlkzuul-scheduler_1     | DEBUG:zuul.IndependentPipelineManager:Checking for changes needed by <Change 0x7f20b8159c50 2,943abf6d7430610d348e105b6196e67626ff2d32>:00:25
jlkzuul-scheduler_1     | DEBUG:zuul.IndependentPipelineManager:  Change <Change 0x7f20b8159c50 2,943abf6d7430610d348e105b6196e67626ff2d32> needs change <Change 0x7f20b81b38d0 1,bb71d444ce36ea561bf98d83d05d2ccefe38525a>:00:25
jlkzuul-scheduler_1     | DEBUG:zuul.IndependentPipelineManager:  Change <Change 0x7f20b81b38d0 1,bb71d444ce36ea561bf98d83d05d2ccefe38525a> is needed00:25
* mordred hands jlk a fatted calf00:25
jlkit's falling over a bit later, but yes, definitely progress00:25
SpamapSnice00:29
jamielennoxjlk: oh you're working on dependencies - awesome00:36
jlkthought I'd throw some brain at it today00:36
jamielennoxjlk: re https://review.openstack.org/#/c/474300/ - what is a non-PR event?00:37
jamielennoxlike a merge event?00:37
jlka push00:37
jamielennoxso direct push to a repo, why are we handling that/running tests at all00:37
jlkeither by way of somebody merging a pr, or a direct push00:37
jlkpost-merge deployment pipelines00:37
jlklike publishing to pypi00:38
jlka push event covers tags too00:38
jlkso somebody may want pipeline action when a tag is applied00:38
jamielennoxhmm, yea that's difficult in the current model00:39
jamielennoxi imagine gerrit has the same problem, there's not really anywhere to report a pypi publish failing00:39
jlkit kind of has that problem yes00:39
jlkbut it doesn't really come up in openstack because rarely ever is something pushed vs merged via gerrit00:40
jamielennoxyou still push tags for version publishing, but yea00:42
jlkzuul-scheduler_1     |  Project j2sol/z8s-sandbox change 1,bb71d444ce36ea561bf98d83d05d2ccefe38525a based on None01:02
jlkzuul-scheduler_1     | Project j2sol/z8s-sandbox change 2,943abf6d7430610d348e105b6196e67626ff2d32 based on <QueueItem 0x7fecd40c0310 for <Change 0x7fecd410a8d0 1,bb71d444ce36ea561bf98d83d05d2ccefe38525a> in check>01:02
jlkoh cute, the github web UI is "helpfully" translating my DependOn URLs into relative links01:04
jlkbut it DOES notice that and makes the link01:04
jlkcomments about it in the linked to PR01:05
openstackgerritJesse Keating proposed openstack-infra/zuul feature/zuulv3: Implement Depends-On for github [wip]  https://review.openstack.org/47440101:20
jlkSo ^^ seems to be working at least in the independent manager. My setup doesn't yet _do_ jobs, but it is at least merging the right things and scheduling the jobs. I know it doesn't reflect the work happening in https://review.openstack.org/#/c/451423 but I can easily integrate that when it lands.01:21
jlkof course need to add tests and what not01:21
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: Add support for zuul.d configuration split  https://review.openstack.org/47376402:27
*** isaacb has joined #zuul03:22
*** isaacb has quit IRC03:24
SpamapSjlk: nice!03:44
* SpamapS has been fully distracted and hopes to be undistracted tomorrow03:44
*** nt has quit IRC03:59
*** nt has joined #zuul04:01
jlktomorrow I'm mostly AFK, playing co-pilot for my grandmother, to show her how to get to the recovery facility where my grandfather is for the next few weeks.04:53
jlkI'm still really loving testing out this live code from my source tree via docker compose. Really fast iteration.04:59
jamielennoxjlk: how are you kicking it off though? i'm still testing on our v3 instance and making github reissue the events05:03
jamielennoxmeans you have to do everything right though05:04
jamielennoxcannot figure out why the delegate_to: localhost is still not producting output05:04
*** hashar has joined #zuul06:39
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: executor: add support for custom ansible_port  https://review.openstack.org/46871007:04
*** ajafo has quit IRC07:13
*** ajafo has joined #zuul07:13
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: bubblewrap: adds --die-with-parent option  https://review.openstack.org/47316407:18
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: executor: run trusted playbook in a bubblewrap  https://review.openstack.org/47446007:18
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: executor: run trusted playbook in a bubblewrap  https://review.openstack.org/47446007:49
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: config: refactor config get default  https://review.openstack.org/47448408:23
*** yolanda_ has joined #zuul08:59
*** jkilpatr has quit IRC10:32
*** jkilpatr has joined #zuul10:49
*** yolanda__ has joined #zuul11:30
*** yolanda_ has quit IRC11:30
*** olaph has joined #zuul11:54
*** yolanda__ is now known as yolanda12:05
dmsimardpabelanger: what are the odds of seeing https://review.openstack.org/#/c/464283/ backported to master ?13:20
dmsimardI don't think anything is there is zuulv3 specific but I could be mistasken13:20
SpamapSdmsimard: new features are not being merged into master for the most part, but I'm sure exceptions can be made. Also I believe there's an effort to allow running nodepoolv3 with zuul v2.5 for a while via a shim.13:30
mordredSpamapS: fwiw, nobody (me) has gotten around to working on that shim13:32
SpamapSmordred: so there's a *stalled* effort ... ;)13:35
mordredheh13:36
jlkjamielennox: I have copied a few event payloads into files, and then just use curl locally to throw the file at the scheduler. The PRs themselves the files reference are still open13:37
dmsimardSpamapS: okay.13:45
*** hashar has quit IRC13:47
*** hashar has joined #zuul13:54
*** hashar has quit IRC14:04
*** hashar has joined #zuul14:12
Shrewsjlk: any chance you'd share your docker setup?14:18
jlkShrews: https://github.com/j2sol/z8s14:36
jlkwhen I want to mount in my local code directory instead of using the clone that happens during the image build, I use docker-compose -f docker-compose.yaml -f devel.yaml up zuul-zookeeper zuul-scheduler zuul-executor14:36
jlkpointing to both docker-compose.yaml and devel.yaml14:36
jlkand you could use ZUUL_SRC=/path/to/your/checkout   to change from the default of ~/src/zuul14:37
Shrewsjlk: thx. will have to give that a try soon14:39
pabelangerdmsimard: not sure, it should be straightfoward to manually backport right now.  Biggest change is the shade dependency14:50
dmsimardpabelanger: ok tristanC will take a shot at it14:53
openstackgerritTristan Cacqueray proposed openstack-infra/nodepool master: Add boot-from-volume support for nodes  https://review.openstack.org/47460715:07
tristanCdmsimard: pabelanger: let's see how it goes ^15:08
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: config: refactor config get default  https://review.openstack.org/47448415:29
*** hashar has quit IRC15:31
jeblairtristanC: oh thank you!15:32
jeblairthat removes 147 lines of the most boring, hard to read, error-prone code we have!  :)15:33
jlkoh lordy!15:34
jlkthen again, python 3's ConfigParser has that built into it.15:34
jlkparser.get('section', 'value', fallback='something_to_fall_on')15:35
jlkBTW I'm mostly out today. I have to deal with some grandparent medical stuffs.15:36
jeblairwe're on the cusp of moving to py3 entirely, we could probably just move to that.  however, we haven't actually turned off the py2 tests yet, so would be hard to land that right now.  maybe go with tristanC's, then move to py3 version later...?15:36
jlkjeblair: yeah his is absolutely an improvement15:37
jlkand there may be reason to keep our own wrapper, who knows.15:37
*** jkilpatr_ has joined #zuul15:40
jeblairSpamapS: https://review.openstack.org/474064 and https://review.openstack.org/474188 are good warmups.  :)15:42
*** jkilpatr has quit IRC15:43
pabelangerjeblair: I'd like to stop ze01.o.o and pick up the latest zuul changes16:15
jeblairpabelanger: wym16:17
jeblairpabelanger: wfm16:17
mordredjlk, jeblair: his version does have the expand_user flag, which is nice.16:22
jeblairmordred: ++16:33
*** jkilpatr has joined #zuul16:36
*** jkilpatr_ has quit IRC16:37
jlkyeah, the v3 move might just be to simplify the wrapper, without losing the wrapper.16:39
mordredjeblair: also - re: py3/py2 - as we consider operator documentation and whatnot, we should be careful about py2 on nodes, as ansible modules themselves are the least-likely thing to be py3 compat at the moment17:08
mordredjeblair: jamielennox ran in to that with the delegate_to: localhost issu17:09
SpamapSso if you delegate_to: localhost it uses whatever interpreter is running ansible-playbook ?17:09
jeblairmordred: oh, i thought the py3 issue was on remote nodes -- that was on delegate_to: localhost?17:09
SpamapSseems like you can solve that with ansible_interpreter=/usr/bin/python2.7 or something17:09
jeblairmordred: no wonder we couldn't figure it out.17:10
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Default bubblewrap to work_root  https://review.openstack.org/47309917:10
mordredSpamapS: yes - we just need to make sure people realize that "zuul v3 runs py3" doesn't mean "I don't need py2 on my nodes"17:11
mordredjeblair: well, delegate_to: localhost uses localhost as a remote node17:11
clarkbSpamapS: mordred not sure about deletgate_to but connection local uses system python iirc, there are sorts of problems using that and virtualenvs17:12
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Read layout from BuildSet in first merge scenario  https://review.openstack.org/47406417:12
jeblairmordred: yeah, i grok that.  i just remember having conversations like "python on the remote node is python2"17:12
mordredyah. I think there is stuff here, like clarkb and SpamapS mention, that we probably  just need to tighten just a bit17:12
jeblairmordred: i was clearly missing an important piece of information.  i'm happy i have it now and it all makes much more sense.17:12
mordredwoot17:12
clarkbits my big grump when running ansible tests locally out of a venv beacuse my virtualenv makes python3 env by default then it does python2 "remotely" iirc17:13
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Limit github reporters to event types  https://review.openstack.org/47430017:18
*** greghaynes is now known as greghayn117:18
*** greghayn1 has quit IRC17:25
pabelangermordred: we seem to no longer be redirecting ansible error output to zuul-executor debug logs? Is that intentional?17:34
jeblairpabelanger: not sure what you mean -- i see some ansible output, what's missing?17:47
jeblairpabelanger: (also, remember since you restarted, if you want -vvv you'll need to 'zuul-launcher verbose' again)17:48
jeblairlooks like you did that though, i think?17:49
pabelangerjeblair: if you have a syntax error in your playbook / role, we no longer seem to log that into zuul-executor. I thought we did that before17:49
pabelangerjeblair: yes, verbose is enabled17:49
mordredpabelanger: that's likely a bug17:50
jeblairpabelanger: oh, i don't know if that was there or not.17:50
jeblairmordred: is it though?  if that can end up in the ansible_log, isn't that enough?17:50
jeblair(obviously, if that doesn't end up in the ansible log, it should be in the executor log)17:51
jeblairi just don't want to make the executor log noiser than necessary :)17:51
jeblair2017-06-15 17:51:11,695 DEBUG zuul.AnsibleJob: [build: 74ce26349c2546d29f31e38a03ccd516] Ansible output: b''17:53
jeblair2017-06-15 17:52:09,729 DEBUG zuul.AnsibleJob: [build: 74ce26349c2546d29f31e38a03ccd516] Ansible output terminated17:53
pabelangerI think adding it to exector log make sense, because we also add normal ansible output today.  It seems we are just missing errors.  If we want to remove that from executor log too, that is is okay. But makes debugging a little harder, if your log pulbishers are down (like they are today for us) you become blind17:53
jeblairpabelanger: ok17:53
jeblairpabelanger: those lines tell me that the ~1 minute delay is between the last line being read and stdout closing17:54
pabelangerjeblair: agree17:55
pabelangerIs ansible production ready under python3? I've been having a hard time of the last few days trying to figure a series of errors since we switched to ze01.o.o. And have made not much progress writing jobs17:56
pabelangeras an example: http://paste.openstack.org/show/612739/17:57
pabelangerwe are getting random file exists errors from tox now17:57
pabelangerand I am not seeing this on zuul.o.o today17:58
jeblairpabelanger: how is that related to ansible?  that's an error in virtualenv17:59
pabelangerI'm not sure it is, but I am having a hard try determining what changed between moving to ze01.o.o and random failures18:00
pabelangerI've been basically trying to make our existing jobs stable for the last day and 1/2. and haven't been able to write any new jobs18:01
jeblairpabelanger, mordred: switching back to the one minute delay -- actually, i'm a little confused -- the "Ansible output: b''" log line is the sentinel for the iterator.  as soon as it is read, the for loop should exit.18:01
jeblairpabelanger: i'm happy to help investigate.  which problem would you like to focus on?18:01
jeblairpabelanger: 1min delay, or venv error?18:01
pabelangerjeblair: 1min delay, lets continue with that18:02
jeblairokay18:02
pabelangerI'll put venv error on hold18:02
jeblairpabelanger, mordred, clarkb, SpamapS: i think i may not fully understand the iter() method and how it relates to the sentinel18:02
jeblairi'm going to poke at that and see if i can gain a better understanding18:03
jeblairoh, i think i see it18:04
jeblairthe b'' in the log is probably originally a b'\n'.  then we strip it18:04
mordredjeblair: oh. yes.18:04
jeblairso that's not the sentinel being logged18:04
jeblairso indeed, it looks like we get a blank line from ansible, then 1 minute later, stdout closes.18:05
jeblairmordred, pabelanger: how about we throw some more vvv logging into zuul_stream to see if it's being slow18:06
jeblair(it doesn't appear slow enough to log errors, but maybe there's still something going on?)18:06
pabelangerwfm18:07
jeblairworking on that18:07
mordredjeblair: ++18:08
* mordred also looking at that code, fwiw18:08
jeblairmordred: part of me wants the 'handle list of streamers' change to land before we do too much more on this18:09
mordredjeblair: want me to address jamie's comments and push that back up real quick?18:09
jeblairmordred: sure18:11
jeblairpabelanger: while that's going on, i want to try to strace an ansible-playbook process18:15
jeblairpabelanger: do you have any idea what the file/stat loop is?18:16
jeblair2017-06-15 18:16:58,929 DEBUG zuul.AnsibleJob: [build: fa29ae54fa2a40a4bc3ad3082ad3bb2c] Ansible output: b'Using module file /usr/local/lib/python3.5/dist-packages/ansible/modules/files/stat.py'18:17
jeblair2017-06-15 18:16:59,210 DEBUG zuul.AnsibleJob: [build: fa29ae54fa2a40a4bc3ad3082ad3bb2c] Ansible output: b'Using module file /usr/local/lib/python3.5/dist-packages/ansible/modules/files/file.py'18:17
pabelangerjeblair: no, I haven't debugged that yet. I did notice it18:17
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Make logging helper method in zuul_stream  https://review.openstack.org/47296318:17
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Special case shell logging on localhost  https://review.openstack.org/47421618:17
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Change log streaming link to finger protocol  https://review.openstack.org/43776418:17
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Extract get_playhosts listing in zuul_stream to a method  https://review.openstack.org/47296418:17
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Handle lists of streamers  https://review.openstack.org/47423018:17
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Direct streaming at delegated_to target  https://review.openstack.org/47421518:17
jeblairit's real slow18:17
mordredjeblair: ok. there's the stakc fixed and restacked - I checked it locally against a node running a zuul_console and it did what I expected18:18
jeblairi have a hunch... working on it18:19
jeblairi think it's bubblewrap18:20
jeblairansible exits as expected, but it takes bubblewrap 1 minute to then exit itself18:21
mordredoh - interesting. I'm not running anything with bubblewrap in my local tests18:21
jeblairwhen be're between the b'' and 'Ansible output terminated' log lines, there is a bwrap process, but no ansible-playbook process18:21
jeblairwait4(-1,18:21
jeblair[{WIFEXITED(s) && WEXITSTATUS(s) == 255}], 0, NULL) = 2818:21
jeblairwait4(-1, 0x7ffdb3e6cc9c, 0, NULL)      = -1 ECHILD (No child processes)18:21
jeblairexit_group(2)                           = ?18:21
jeblair+++ exited with 2 +++18:21
jeblairstrace only reports that ^18:22
pabelangercool, now we know what is happening18:23
pabelangerit is possible we could terminate bubblewrap our self from the executor?18:27
jeblairpabelanger: i doubt we would get the correct exit code, and it could even be that bwrap *is* terminating already18:35
mordredmaybe it's taking time removing things?18:38
pabelangerthat likely explains why I also see18:40
pabelangerzuul     12161  0.0  0.0      0     0 ?        Zs   18:39   0:00 [bwrap] <defunct>18:40
pabelangeransible-playbook termintated, and bwrap waiting to close18:40
jeblairmordred: seems to be stuck on a wait -- there was only one bwrap process, and i followed threads in strace.  so i think that was the only thing it was doing.18:41
mordredjeblair: so might just be a bug in bwrap?18:44
mordredhttps://github.com/projectatomic/bubblewrap/blob/a4709b6547caf438e41cb478b0b9faded7e4b941/bubblewrap.c#L38418:45
jeblairmordred: i think there are 3 wait/waitpids in bwrap; i haven't quite wrapped my head around what they do18:45
jeblairoh that helps18:45
jeblairmordred: oh18:46
jeblairi should look for the whole process group18:46
jeblairthere could be something other than just bwrap or ansible-playbook...18:46
jeblairugh all the jobs are in the long boring file/stat loop18:48
jeblairwow18:48
jeblairthat's running gcc every time18:48
jeblair(i don't know if it's file.py or stat.py)18:48
mordredjeblair: really? something is running gcc at runtime every time?18:49
jeblairmordred: every second18:49
jeblairbwrap───ansible-playboo─┬─ansible-playboo───sh───sh───python3───python3───sh───gcc───collect2───ld └─{ansible-playboo}18:49
jeblairthat did not work, sorry.18:49
Shrewsheh, i thought weechat just threw up18:50
jeblairShrews: i threw up into the channel18:50
* jeblair mops up ascii characters18:50
jeblairokay that's running /tmp/a3d7e5eac7b84be59ac32d2f088fb39d/ansible/post_playbook_2/git.openstack.org/openstack-infra/openstack-zuul-jobs/playbooks/base/post.yaml18:51
jeblairmaybe it's terrible because /opt/zuul-logs/ doesn't exist?18:51
pabelangerOh18:51
pabelangeryes, that is likely the case18:51
jeblairpabelanger: maybe we should just empty the task list for the base post playbook right now?18:51
SpamapSgcc at runtime smells like ffi?18:51
pabelangerjeblair: ++18:51
jeblairSpamapS: yeah18:51
pabelangerI will do that now18:52
jeblairpabelanger: thx18:52
jeblairpabelanger: im considering mkdiring that directory for the moment, but then rm-rfing it when your patch lands18:53
mordredpabelanger, jeblair: which patch - should I go review real quick?18:54
jeblairmordred: pabelanger is writing it now18:54
pabelangerjeblair: WFM18:54
pabelangerI had to manually create it on zuulv3-dev.o.o a while back too18:55
jeblairmkdir didn't help the copy already in progress :(18:55
SpamapSso catching up... there's a thought that bubblewrap's pid piece is waiting for a minute after the last process exits?18:56
jeblairSpamapS: yep -- or, at least, one minute after ansible-playbook exits.  i have not confirmed that's the last process18:57
SpamapSyeah its entirely possible there are some other processes spawned out by ansible that need cleaning up18:57
jeblairokay, after all that, there was no 1 minute delay on that playbook18:57
jeblairso at least for the base/post playbook, we don't see the behavior18:58
jeblairi'll need to retrigger something to try to catch it18:58
jeblairpabelanger: oh, /opt/zuul-logs isn't in bwrap, that's why the mkdir didn't help18:59
pabelangerAha18:59
jeblairpabelanger: anyway, new jobs should get your chaneg18:59
pabelanger++18:59
jeblairbwrap───ssh19:00
jeblairaha!19:00
jeblairit's the ssh connection cache thingy19:00
jeblairer... controlmaster?19:00
SpamapSah and it times out at 1 minute19:01
jeblairyep19:01
pabelangerright, that now make sense , it is 60s19:01
pabelangerControlPersist=60s is what I am thinking of19:02
jeblairyep19:02
jeblairthis is interesting... i guess with bwrap we won't be able to utilize that across playbook runs...19:02
SpamapSif we execute 'ssh -O cancel' I think that might do it.19:02
SpamapSjeblair: it's still useful for each play sharing SSH conns inside a single playbook.19:03
SpamapSso not a total sadmaker19:03
SpamapSalso we could get fancy and pass the socket in19:03
jeblairSpamapS: yep.  best we can do right now i think, at least without a lot of creative thought19:03
jeblairoh look, creative thought19:03
pabelangerSpamapS: ya, was just searching for a socket thing19:03
SpamapSIt would work just like ssh-agent19:04
jeblairis it just one process that handles all connections to all hosts?19:04
mordredhow do controlpersist and agent work with each other?19:04
pabelangerYa, looks like control_path is what we'd have to bindmount19:04
SpamapSmordred: controlpersist just works like a multiplexer for the ssh socket.19:05
SpamapSso agent doesn't interact with controlpersist19:05
SpamapSoh hm19:05
SpamapSOH19:05
SpamapSwe could replace the agent copying in with already-established SSH conns on a controlpersist master19:06
SpamapSwould reduce the attack vector a bit19:06
jeblairSpamapS: if a non-ssh play took longer than 60s, the controlpersist master will die, and the next ssh task ansible performs will start a new one, right?19:07
SpamapSsince they couldn't establish new conns19:07
SpamapSjeblair: you can run a persistent master.19:07
SpamapSthat just sticks around until you tell it to go away19:07
jeblairoh, neat19:07
SpamapS(assuming your keepalives work so you don't loose the transport)19:07
SpamapSlose19:07
jeblairso basically *just like* the agent model.  which is what you said.  but now with more emphasis.  :)19:07
SpamapSYeah you just set ControlPersist to 0 to make it live indefinitely.19:08
mordredthe downside of that is that we'd have to know which connections to start and when19:08
SpamapSmordred: you'd have to do a ping all I think19:08
SpamapSwhich isn't a bad idea anyway19:08
mordredSpamapS: nod. so write a playbook that the executor runs that does hosts: all ping essentially, but with it set to control-persist 019:09
mordredthen pass in the created socket to the real playbook invocations19:09
mordredthen destroy the thing when we're done with the job19:09
mordred(saying that to make sure I grok)19:09
SpamapSBasically just set ControlPersist to 0, run ansible -i ourinventory -m ping all in trusted context, then after the job 'ssh -O exit'19:10
jeblairSpamapS: there's still a middle ground, right?  run both agent and master before starting.  ansible can still create new connections, but we solve the caching and exit problem.19:10
SpamapSI actually really like this and it might even prove simpler than managing agents.19:10
pabelangerIf we moved ControlPath to outside bwrap, passed into bwrap, doesn't that solve the cache issue? But we still have the ControlPersist=60s issue?19:14
pabelangeralso, I would also think ansible-playbook might also want to run ssh -o cancel directly, to just avoid this issue on shutdown too19:14
jeblairpabelanger: bwrap is still running because the ssh process is running.  so to solve the exit issue, you need to start the control master before running bwrap.19:14
pabelangerjeblair: Ah, I see now19:15
pabelangerthank you19:15
jeblairso i think the two options on the table are:19:15
pabelangerokay, ya. So setting up control master, with ssh-agent seems to make sense then19:15
jeblairA) start agent and controlmaster before first ansible invocation.  kill both after last ansible invocation.  all connections will actually be made within the playbooks, and new connections can be made.  the agent can be accessed by the playbooks and manipulated.19:16
jeblairB) start controlmaster and then ping all hosts before first ansible-playbook invocation.  kill controlmaster after last ansible invocation.  all connections will be made before playbooks run, no new connection can be made.  no agent is required.19:17
mordredjeblair: I think I like B better - less moving parts - and if we can't make a connection to a remote host it happens in a place where we're totally in control of that and will know exactly what is happening19:18
mordredjeblair: that said - if someone wants to reboot a host in the middle of a playbook, this would prevent that19:19
mordredbecause the connnection would go away and the playbook would not be able to re-connect19:19
pabelangerinterested, ya. I'd love for us to be able to suppor that19:19
jeblairmordred: yeah.  B has a lot going for it (it removes the need for the agent, *and* the whole ssh-key-removal pre-playbook dance SpamapS is working on).  but it does have that problem.19:20
mordredthat's potentially a valid thing to want to do for testing kernels and nothing else in the system prevents that19:20
jeblairmordred: or really sophisticated deployment testing?19:20
mordredyah19:21
jeblairlike the sort of hypothetical "run ursula" job19:21
pabelanger++ In the case of loosing reboots, I think A gets my vote19:21
mordredexactly. although that one, I believe, needs to run ursula itself on a node pointed at the other node19:21
mordredsince it needs to do things we don't allow19:22
jeblairright.  "trusted run usula" :)19:22
jeblairursula even19:22
SpamapSA+=119:22
SpamapSless state, more resilience.19:23
jeblairokay.  the A's have it.  but we'll all remember B and how it's actually so much cooler than A.19:23
pabelanger:)19:24
jeblairi'm going to grab some lunch; pabelanger i'll look at the venv thing with you when i get back19:24
SpamapSI liked B before it was cool.19:25
*** jkilpatr has quit IRC20:18
jeblairpabelanger: what job was your venv error from?20:53
clarkbjeblair: openstack ansible jobs pep8 iirc20:53
jeblairer, should have asked which build :)20:53
jeblairor does it consistently happen on all builds of that job?20:54
pabelangerjeblair: I believe 473845 was one that showed the issue20:55
pabelangerso far I have seen it on pep8, and docs jobs20:55
pabelangernow, it is completely possible it is limited to my patchset, that's what I've been trying to determine20:56
jeblairpabelanger: do you have the whole log from that build?20:57
pabelangerI can manually grab it now20:57
jeblairpabelanger: from where?20:58
pabelangerjeblair: /tmp/paul.txt on ze01.o.o20:58
jeblairpabelanger: ok i'll look at it there20:58
pabelangerI manually copied it from jobdir20:58
*** jkilpatr has joined #zuul20:59
pabelangerjeblair: I have to step away for a bit, for family activity. Will be able to loop back in a few hours20:59
mordredSpamapS: if you have a second: https://review.openstack.org/#/c/474230 and the 5 before it are ready to go in21:09
jeblairclarkb, mordred: okay, i'm on a node where it failed... there is, indeed, a tox pep8 venv.21:27
jeblairwhen i try to run 'tox' on my own, in various forms, i can't reproduce the error21:27
clarkbjeblair: check the git reflog to see if host was reused maybe?21:27
clarkb(should have checkouts of earlier change ?)21:28
jeblairclarkb: i checked that; syslog looks new, and the last time zl01 used the host was a couple hours ago.  uptime is ~20 mins21:29
jeblairbut i'm surprised that i can't reproduce the error21:32
jeblairrun-tox.sh works whether .tox/pep8 exists or not.21:32
clarkbya tox seems to be deciding that it needs to execute virtualenv even though the venv exists already21:32
clarkband that is what fails. I bet if you manually tried to run virtualenv there it would fail too21:33
jeblairclarkb: nope that works too21:33
jeblair/usr/bin/python2 -m virtualenv --python /usr/bin/python2 pep821:33
jeblairsucceeds21:33
jeblair(that's the command from the log)21:33
clarkbthat is in .tox/ ?21:33
jeblair2017-06-15 21:09:31.343068 | ubuntu-xenial | ERROR: InvocationError: /usr/bin/python2 -m virtualenv --python /usr/bin/python2 pep8 (see /home/zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox/pep8/log/pep8-0.log)21:33
jeblairclarkb: yep. still succeeds.21:34
jeblairthe log file in that line, however, does not exist.21:34
mordredjeblair: something something bubblewrap?21:35
jeblairmordred: i hope not, we're actually on the remote node now.  but maybe something something ansible?21:36
mordredvirtualenv uses python egg-link files right?21:36
mordredoh- duh. remote node21:36
jeblairmordred: i'm going to try being zuul and running ansible-playbook from zl0121:36
mordredjeblair: ++21:36
jeblairoops, we missed adding some env vars to our debug lines21:37
jeblairalso, ssh-agent makes this hard21:37
clarkbhuh ya confirmed outside of tox and all that that virtualenv doesn't ciomplain if you revenv21:38
mordredjeblair: fwiw,when I run ansible-playbook by hand in the context of our zuul stuff, I do:21:38
mordredjeblair: ZUUL_JOB_OUTPUT_FILE=$(pwd)/log.out ansible-playbook21:38
jeblairclarkb: the traceback almost looks like it's a file/dir mixup (like it's mkdiring over a file maybe?)21:38
jeblairmordred: ya, got past that, just need to figure out how to ssh now21:39
mordredjeblair: but - this is making me think we should make a utility like "zuul-playbook" that we can use to run ansible-playbook in the forground lke zuul runs it21:39
mordredjeblair: ++21:39
mordred(just as a zuul-admin debug tool)21:39
jeblair2017-06-15 21:07:05,545 DEBUG zuul.AnsibleJob: [build: 783818dc9fcb4ada9724bf32c6cc5e5b] Ansible command: ANSIBLE_CONFIG=/tmp/783818dc9fcb4ada9724bf32c6cc5e5b/ansible/untrusted.cfg ansible-playbook -vvv /tmp/783818dc9fcb4ada9724bf32c6cc5e5b/work/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/playbooks/base/pre.yaml21:40
jeblairthat used to work as a copy paste.  we can add the ZUUL_JOB_OUTPUT_FILE var to it, but we'll need to solve the ssh-agent thing.21:40
mordred++21:40
jeblairmordred: that may be the thing that pushes us to needing zuul-playbook21:40
mordredjeblair: --private-key=21:41
mordredjeblair: argument to ansible-playbook21:41
jeblairmordred: oh neat21:41
jeblairjust in time for the clock to run out (i put a sleep 40 mins in a change since we don't have node holding yet)21:42
jeblairwill try again21:42
mordredjeblair: fwiw, I'm running a tox -epep8 via ansible with our libs/ansible plugins loaded against a node I grabbed a few days ago from regular nodepool21:45
mordredjeblair: and it worked21:45
*** yolanda has quit IRC21:45
jeblairokay, i'm on a node before it's gotten around to running tox21:48
jeblairand can confirm that there is no .tox dir at the moment21:48
jeblairand now there is21:49
jeblair2017-06-15 21:48:14.564793 | ubuntu-xenial | py.error.ENOENT: [No such file or directory]: open('/home/zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox/pep8/log/pep8-0.log', 'r')21:49
jeblairokay, that's a slightly different error21:49
jeblairclarkb, mordred: i am able to reproduce when running via ansible21:54
mordredjeblair: and it's base/pre.yaml that's failing?21:54
jeblairmordred: no, tox/linters.yaml21:54
jeblairwhen i run it with no existing .tox dir, i get the first error (the one about logs).  if i run it after that, i get the other error about the python binary.21:55
jeblair/tmp/jeblair1 and /tmp/jeblair2 are the zuul job log output files for both of those cases21:55
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Make logging helper method in zuul_stream  https://review.openstack.org/47296321:56
jeblairalso http://paste.openstack.org/show/612753/ is the console output for one of those21:56
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Extract get_playhosts listing in zuul_stream to a method  https://review.openstack.org/47296421:57
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Change log streaming link to finger protocol  https://review.openstack.org/43776421:57
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Direct streaming at delegated_to target  https://review.openstack.org/47421521:58
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Special case shell logging on localhost  https://review.openstack.org/47421621:58
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Handle lists of streamers  https://review.openstack.org/47423021:58
mordredjeblair: ok - as another data point - if I run that playbook with the role from openstack-zuul-jobs with ansible with our plugins but on the host I just hapepn to have - it works22:01
mordredjeblair: so I have not managed to reproduce it using ansible from my laptop22:01
jeblairi made a shell script on the node that set the same env variables that were printed in the log22:15
jeblairi ran it with "env -i".  and it works22:15
jeblairbut ansible-playbook from zl01 continues to fail22:15
jeblairZUUL_JOB_OUTPUT_FILE=/tmp/jeblair3 ANSIBLE_CONFIG=/tmp/fc8b064159e742fd8d7ecfe216335fb4/ansible/untrusted.cfg ansible-playbook -vvv /tmp/fc8b064159e742fd8d7ecfe216335fb4/work/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/playbooks/tox/linters.yaml --private-key=/var/lib/zuul/ssh/nodepool_id_rsa22:15
jeblairis the command i'm using22:16
jeblairwhen i run that test script using ansible-playbook on zl01, it fails22:17
jeblairand there goes the host22:20
mordredblast22:20
jeblairgetting a new one22:20
clarkbvirtualenv uses the python you are using by default which may be affected by ansible buy tox passes explicit -p to select the python22:22
*** jkilpatr has quit IRC22:24
mordredclarkb: it seems to be the right python in both cases22:31
mordredclarkb: /home/zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox$ /usr/bin/python2 -m virtualenv --python /usr/bin/python2 pep8 >/home/zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox/pep8/log/pep8-0.log22:32
clarkbya it passes -p I guess really --python which is as explicit as you can get so the calling python shouldn't matter at all22:32
jeblairby commenting out a lot of stuff in run-tox.sh and removing the .tox dir, i got it to succeed.22:41
jeblairworking out what's relevant now22:41
mordredjeblair: woot22:43
jeblairnope.  this is not consistent.22:44
jeblairit sometimes succeeds and sometimes fails.22:44
jeblairi'm out of ideas22:45
mordredjeblair: which things did you comment out?22:45
mordredjeblair: were any of them the env vars it sets?22:45
jeblairmordred: yes22:45
jeblairmordred: i don't trust that experiment22:45
mordredme either22:45
jeblairmordred: maybe commenting some of them out makes it sometimes succeed instead of always fail22:46
jeblairbut... ??22:46
pabelangerand back, just catching up22:46
mordredyah - I'd like to understand what's different in the env22:46
mordredespecially since I can run it successfully from my laptop using the role and playbook22:46
mordredso there is $something being changed in between those two22:47
* jlk back from a long day22:47
jlk... to start working22:47
mordredjeblair: are you working in /tmp/d3ed28ec9f954f56b897a52ecefa62ea/ansible ?22:48
jeblairmordred: yes22:48
jeblair2017-06-15 22:51:58.461919 | ubuntu-xenial | py.error.EACCES: [Permission denied]: open('/home/zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox/pep8/log/pep8-1.log', 'w')22:52
mordredjeblair: I'm running a few times22:52
jeblairwhat is22:52
jeblairoh22:52
jeblairthat explains why every time i run i get a different error22:52
mordredno - this is the first time I ran - sorry - thought you'd stopped22:52
* mordred no longer touching anyhting22:52
jeblair2017-06-15 22:54:14.881247 | ubuntu-xenial | shutil.Error: `/usr/lib/python2.7/copy_reg.py` and `/home/zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox/pep8/lib/python2.7/copy_reg.py` are the same file22:54
jeblairnow a success22:54
mordredwtf22:54
jeblairanother success22:55
pabelangerya, that's been my last 24 hours. pulling hair out at randomness22:55
jeblairmordred: all yours :)22:55
mordred2017-06-15 22:56:00.397324 | mordred-irc.inaugust.com | /usr/local/jenkins/slave_scripts/run-tox.sh: .tox/pep8/bin/pip: /home/zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox/pep: bad interpreter: No such file or directory22:56
jeblairmordred: note, i have editied run-tox.sh.  i only commented things out.  feel free to edit.22:56
jeblairmordred, pabelanger: i have stopped nodepool-launcher on nl01.22:58
jeblairthat should prevent this host from being deleted.22:58
pabelangerk22:58
mordredjeblair: so - I'm currently removing the .tox dir between each run22:59
jeblairoh wow, i did that just in time.  the job completed.  i lost the tmpdir on ze01, but the host is still there.22:59
mordredwoot23:00
jeblairmordred: yeah, i did that too23:00
mordredjeblair: I have not yet gotten it to fail - so I'm uncommenting the things and trying again23:02
jeblairmordred: i'll see if i can convert another build directory on ze01 to work with this host23:02
mordredjeblair: ok. I can run it over and over again and it works - so something is different from my execution and the build dir on ze0123:04
jeblairmordred: ok, can you pause for a min, and lemme see if i have this reconstructed?23:04
mordredyup.23:04
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Implement pipeline reject filter for github  https://review.openstack.org/47400123:08
jeblairZUUL_JOB_OUTPUT_FILE=/tmp/jeblair3 ANSIBLE_CONFIG=/tmp/39321fb5d76f4a54a37de4f3beb384fa/ansible/untrusted.cfg ansible-playbook -vvv /tmp/39321fb5d76f4a54a37de4f3beb384fa/ansible/playbook_0/git.openstack.org/openstack-infra/openstack-zuul-jobs/playbooks/tox/linters.yaml  --private-key=/var/lib/zuul/ssh/nodepool_id_rsa23:09
jeblairmordred: okay, that just hit the error23:09
jeblairmordred: i ran this on the host before running that:23:09
jeblairrm -fr ~zuul/.cache/ ~zuul/src/git.openstack.org/openstack-infra/openstack-zuul-jobs/.tox/23:09
mordredjeblair: cool23:09
jeblairmordred: do you want to try running that?23:11
mordredyes - copying a few things real quick ...23:11
mordredjeblair: ok. I'm running it again - the first time worked23:18
mordredsecond time worked23:18
mordredran it again without cleaning and it worked23:18
mordredcleaned the dir - running again23:19
mordredk. it keeps working for me23:20
mordredlet's figure out what's different between me and ze0123:20
SpamapSon the surface.. ze01 is a VM and you are a human?23:21
* SpamapS will stop snarking and try to be helpful23:21
SpamapSis there something I can do to be a third pair of eyes?23:21
jeblairmordred: have you run it on ze01 yet?23:21
jeblairmordred: i'm guessing your most recent series was copy to your workstation and run from there?23:22
mordredjeblair: yes, that is correct - so all the files from the work dir are what I'm working from23:22
mordredI also just updated my local zuul copy to be the same as the copy of zuul on ze01 and it still works23:23
mordredjeblair: oh wait - it just failed23:24
jeblairyay?23:26
SpamapSsuccess fail!23:26
mordredok. SO - one of the differences ...23:27
jeblairSpamapS: i feel like i'm successfail in all my endeavors!23:27
mordredare 429428c0200becf7ac59d83ad1a61fb37171de5c and 1dbf5f96b5928695d65587cccc7fe56630f6114a23:27
mordredwhich are updates to the command module from latest ansible added to try to fix python3 related issues23:27
jeblairmordred: interesting...23:28
mordredI have reverted those two locally - trying again23:28
jeblairmordred: those should be easy to revert on their own, right?23:28
jeblair++23:28
mordredbingo23:29
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Revert sync from latest command from ansible  https://review.openstack.org/47480823:30
mordredjeblair: now - the purpose of those is that running our command module with ansible with python3 was borking for jamie23:31
jlkbitten by ansible updates again?23:31
jeblairmordred: but not us?23:31
mordredjeblair: we weren't running jobs when it bit him23:31
pabelangerI don't think we are using command much right now23:31
pabelangerjust shell23:32
mordredwe use it a ton23:32
pabelangeror, is that the same23:32
mordredit's the same thing23:32
mordredyup23:32
pabelangerah23:32
jeblairmordred: but was that also the same problem as delegate_to?23:32
jeblairlet me rephrase23:32
jeblairwere the python3 ansible issues also related to delegate_to tasks?23:32
mordredyes23:32
pabelangermordred: thank you, I was going crazy23:33
jeblairok, so that's another reason we may not see those issues immediately23:33
mordredI think we should we should consider, at least until we've got this particular topic sorted, running ansible via python2 and setting ansible_python_interpreter in our inventory files to python2 as well23:33
jeblairmordred: ansible_python_interpreter sounds straightforward... run ansible-playbook via python3 is, perhaps, less so?23:35
jeblairmordred: since we're listing ansible as a dependency of zuul; we'd have to switch to having it externally installed, and then we lose control over versioning23:35
mordredyes - I really just mean for a few days, not as a general approach23:36
jeblairoh ok23:36
jeblairwell, probably a week or two at this point :|23:36
pabelangerI'm okay for python2 and ansible23:36
mordredso that we can unstick folks from working on jobs and figure out the py3 issues in isolation23:36
mordredjeblair: yah23:36
jeblairmordred: okay, so what's the proposal on the table?23:36
mordredjeblair: we need to do a pip install of ansible on ze01 with python2's pip so that "ansible-playbook" gets us a python2 version23:37
mordredjeblair: then we can also add ansible_python_version to the inventory we write out23:37
mordredor, ansible_python_interpreter rather23:37
jeblairmordred: okay... i'm probably going to want to blow away ze01 when we're ready to revert this23:38
mordred++23:38
jeblairi don't trust that "uninstall something we manually installed with pip2" is going to leave us in a fresh state :)23:39
jeblairmordred: how do we use the py2 version of ansible-playbook?23:39
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Use python2 for ansible on remote hosts  https://review.openstack.org/47481023:39
jeblair(considering that the py3 one will still be installed)23:40
clarkbfwiw pip has gotten much better at uninstalling stuff23:40
clarkband worst case you just delete everything under the python2 site packages in /usr/local23:40
mordredthe latest pip install will overrite the /usr/local/bin/ansible-playbook link23:40
pabelangerwhat about ansible into a virtualenv, then setup symlink?23:40
mordredjeblair: you can verify after installing by doing ansible-playbook --version and it'll also print the python version23:40
jeblairmordred: i guess that's okay until there's another ansible release and pip installs it for zuul again; that's probably not going to happen this or next week i guess.  :)23:41
mordredjeblair: yah. this is definitely not the state we want long-term23:42
mordredjeblair: I mean - I'm going to wake up in the morning and try to figure out why the module changes break things23:42
jeblairmordred: does delegate_to not honor python_interpereter_whatever?23:42
*** jkilpatr has joined #zuul23:42
jeblairansible_python_interpreter.  that's it.23:42
jeblairmordred: i guess i'm wondering if just setting *that*, but still running py3 ansible-playbook would be okay...23:43
mordredjeblair: it should - I honestly don't know why delegate_to was finding python3 instead of python223:43
mordredI'll _also_ look in to that23:43
mordredjeblair: we can certainly try that23:43
jeblairmordred: will save a bunch of time, if it works.23:43
mordredjeblair: we could pull those revert patches into ze01, then add ansible_python_interpreter to the inventory and run the ze01 test again23:44
mordredactually - lemme see if I can test python3 real quick23:44
pabelangerif we are concerned about changing ansible versions on ze01.o.o over time, we could have bwrap launch from a tarball. I think we talked about that in the past. Obviously, that increases the burden on operator to get said tarball on to ze01.o.o23:44
mordredjeblair: k. that seems to work23:44
jeblairpabelanger: i'm not at all concerned with that under normal circumstances.  i don't want to deal with that now.23:45
mordredyup23:45
mordredjeblair: I believe I have verified that reverting those patches should unstick us without any additionl python interpreter stuffs23:45
mordredjeblair: this may re-break jamie and delegate_to23:45
mordredbut I've got a constrainted surface area to investigate tomorrow23:46
jeblairmordred: right, but the ansible_python_interpreter may fix jamie ?23:46
mordredoh - I think the problem there might be localhost connection=local may use the action in-process python23:46
jeblairmordred: or did jamie try ansible_python_interpereter and it still didn't work?23:46
mordredI don't know23:46
jeblairok.23:47
mordredbut that's a problem we can figure out23:47
jeblairjamielennox: ^ we said your name a lot :)23:47
* mordred needs to eod - but promises to pick up helping solve the issue with the hting we just reverted in the morning23:48
clarkbmordred: I want to say connection=local forks a new 'python' and doesn't use the existing process23:48
jeblairmordred: good night!23:48
clarkbmordred: this is why virtualenvs don't work as people expect with connection=local23:48
mordredclarkb: k. in that case, ansible_python_interpreter MAY still help that23:48
clarkbyes I think ansible_python_interpreter is how you work around the virtuaelnv issue, you point it at the python in the venv and then its happy23:48
SpamapSThat's actually quite nice as you may intentionally want to use the same python you use for your remote nodes for local connections, but you may want a specific python for ansible that is different.23:52
SpamapSNot how you'd expect, sure, because PATH is being ignored, but functional once revealed.23:52
jamielennoxjeblair: oh? i flicked through the history and didn't see highlights23:54
jamielennoxthere's a lot of overnight in the zuul channels at the moment23:54
jamielennoxjeblair: so the patch we've merged fixes the localhost issue afaik23:55
jamielennoxi have other localhost issues, primarily at the moment that i don't get any output from a task on localhost, and so i have a failing task and cannot figure out why or what it's doing23:55
jamielennoxbut from what i can see when you connection=local you are still in the same venv23:56
jamielennox(though realistically i am running command: with localhost so i don't actually notice)23:56

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!