Thursday, 2017-06-22

openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: requirements: update ansible version https://review.openstack.org/475228	01:10
*** lennyb has quit IRC		03:40
*** lennyb has joined #zuul		03:51
tristanC	tobiash: how about using command: rsync instead of the synchronize module?	04:19
tobiash	tristanC: at least the shell module also didn't work delegated to localhost	04:59
tobiash	but that's ok as I currently have a working workaround	05:00
jamielennox	tobiash: you have a workaround to command on localhost?	05:05
tobiash	jamielennox: not for command but for pulling files	05:06
jamielennox	i haven't figured out exactly what fails on mine, but currently there is no output from the command	05:06
tobiash	jamielennox: the output is missing as ansible tries to use py3 when delegated to localhost and many modules don't work yet with py3	05:06
tobiash	(e.g. synchronize, shell, command,...)	05:07
jamielennox	ah ok, pull didn't seem like a problem, it was me trying to rsync from the executor to the log server which i was running via command	05:07
jamielennox	tobiash: at least command should though, i patched it up th eother day	05:07
tobiash	jamielennox: didn't try command, just shell and synchronize	05:08
jamielennox	tobiash: command and shell are basically the same thing	05:08
tobiash	ah, ok	05:08
jamielennox	same code run, a few flags set	05:08
tobiash	jamielennox: for pulling files I currently use a combination of find and fetch modules	05:09
jamielennox	tobiash: synchronize seemed to be pulling ok for me	05:12
jamielennox	with mode: pull	05:12
jamielennox	i'm not sure how much of that runs on client vs server in that mode	05:12
tobiash	hm, for me that didn't work (failed without any helpfull messages)	05:12
jamielennox	i haven't been able to debug it far	05:16
*** smyers has quit IRC		05:21
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul feature/zuulv3: Fix and test report urls for unknown failures https://review.openstack.org/469214	05:24
*** smyers has joined #zuul		05:24
gundalow	Zuul people at AnsibleFest we have booked "Riverview 9" at 11am. The room is UK 2nd floor, e.g. the floor above the where the main tracks are	07:21
jlk	gundalow: what's the topics?	07:58
pabelanger	gundalow: thanks	08:00
gundalow	jlk: I'd like to know the next steps for Zuul + Ansible integration. Especially anything that will need Ansible Core Team people so I can add it into the roadmap	08:02
jlk	Cool. I'll be there.	08:03
*** hashar has joined #zuul		08:06
openstackgerrit	Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Add a test to verify push reports only set status https://review.openstack.org/476400	08:09
jlk	jamielennox: So when working on ^ I ran across something that may bite us. I think our code is trying to create a log path to use with the github status we report, and I think it's trying to use a PR number as part of that URL (and maybe the internal log path). But a push event doesn't have an associated PR.	08:11
pabelanger	remote: https://review.openstack.org/476403 Allow secrets for check pipeline	08:17
pabelanger	so, I think ^ is going to be an issue.	08:17
pabelanger	I'm not sure we want secrets in check today for any untrusted job, but we do need it for trusted atm	08:19
pabelanger	ssh private key for logs.o.o to publish job logs	08:19
jlk	oh eww	08:26
pabelanger	I guess we could move log publishing into a new pipeline	08:36
jamielennox	log publishing is a trusted job, so it runs outside of bwrap on the system	08:45
jamielennox	so if you have an ssh key on the executor you can scp your logs to publish	08:45
jamielennox	eg (because i have it open) https://github.com/BonnyCI/project-config/blob/master/playbooks/roles/base/tasks/post.yaml	08:45
jamielennox	note, that's the delegate_to that currently doesn't work - but it used ot	08:45
pabelanger	jamielennox: right, we could do that but means we need to manage keys with puppet. We also are going to run trusted in bwrap shortly too: https://review.openstack.org/#/c/474460/	08:46
pabelanger	so, with that in mind, we could just store key as secret and have zuul just pass it	08:47
pabelanger	and we talked about that last week, and part of the reason we added tls to gearman	08:47
jamielennox	pabelanger: what the benefit of being trusted if you're still in a bwrap	08:51
pabelanger	jamielennox: usage of lookups / ansible things	08:51
jamielennox	also yea, but we're already doing some key management as isn't the first key (the one nodepool puts on it) already going to need to be in the executor	08:52
pabelanger	ya, today we need a the ssh key for nodepool. but I feel a job could manage the SSH key for logs.o.o under zuulv3. I guess if we think that is going to be an issue, we'll just fall back to having puppet manage that key	08:55
jamielennox	so i've wondered a bit about that, particularly if you were uploading logs to an api of some sort instead of a straight ssh	08:57
jamielennox	whether you could use like a one time private key from hashicorp vault or something that would also tie the job identifier to the key	08:57
jamielennox	but all of these things will require some sort of trusted key on the executor	08:58
pabelanger	right, swift is a good example. we'd want a trusted job to publish to swift. I don't see us storing something directly on the executor for that, the job would have that info in a secret and we'd just use it	09:00
*** hashar has quit IRC		09:01
*** smyers has quit IRC		09:03
*** hashar has joined #zuul		09:04
jamielennox	right, but temp_urls work because you can create them with your token/username+password	09:04
jamielennox	you'll always have a bootstrap problem of how do you know who can create the upload approval token	09:05
jamielennox	but that's what the trusted jobs have been for	09:05
*** hashar_ has joined #zuul		09:09
*** hashar has quit IRC		09:09
*** smyers has joined #zuul		09:15
*** bhavik1 has joined #zuul		09:23
jlk	gundalow: The rooms aren't labeled, but we've taken the one straight across from the woman's restroom	09:38
*** hashar_ has quit IRC		09:40
jlk	mordred: jeblair: we're upstairs in one of the two Ansible rooms, floor 2 above the main snack area of Ansiblefest.	09:46
*** bhavik1 has quit IRC		10:00
jeblair	pabelanger: yeah, i don't think we want to allow secrets in check. i think we'll have to have the log key installed on the executor and use it from a trusted playbook. when we switch to using bubblewrap for trusted, we'll bind-mount that key in as well. i'd like to be able to use secrets for this, but i think we're going to have to revisit some things, so let's not hold up progress on jobs.	10:23
pabelanger	jeblair: okay, I'll do that now in puppet	10:24
jeblair	(in particular, secrets are currently given to an entire job, and we'll have to think about whether we can make them available to just a single playbook within a job)	10:28
pabelanger	jeblair: so, we do not set zuul_changes today in zuul dict() in inventory file. We do have the legacy ZUUL_CHANGES variable, but I think we are (should now?) removing them	10:55
mordred	jamielennox, tobiash: do you have access to logs or anythhing thatshow the issues with syncronize and python3?	11:01
mordred	I'm sitting with toshio looking at the issue	11:02
*** jkilpatr has joined #zuul		11:04
*** hashar has joined #zuul		11:17
tobiash	mordred: I haven't figured out how to get logs for that issue	11:42
tobiash	just see that ansible starts a py3 session	11:42
tobiash	and after that there is nothing in the executor log	11:42
mordred	ahh. kk. I think I can try to reproduce that	11:43
*** hashar has quit IRC		11:46
*** hashar has joined #zuul		11:46
tobiash	mordred: http://paste.openstack.org/show/613370/	11:48
tobiash	that's what I got with verbose executor debug logging	11:48
*** hashar has quit IRC		11:53
*** hashar has joined #zuul		12:03
*** hashar has quit IRC		12:18
*** hashar has joined #zuul		12:19
mordred	tobiash: awesome! thank you!	12:52
mordred	oh fun	12:52
mordred	pabelanger: "test/runner/ansible-test sanity --local --python 2.7" is they way to run ansible test without sourcing hacking/env-setup	12:54
mordred	pabelanger: --local tells it to use the local env not a virtualenv	12:54
pabelanger	++	12:55
mordred	tobiash: can you paste me your playbook with synchronize?	14:03
tobiash	mordred: let me check	14:03
tobiash	mordred: did much rebasing during debugging, have to check reflog...	14:05
tobiash	mordred: http://paste.openstack.org/show/613402/	14:08
mordred	tobiash: awesome! thank you	14:09
tobiash	prerequisite is that the main or pre job at least creates the logs dir in zuul_workspace_root	14:09
tobiash	(in theory if it would work)	14:10
mordred	tobiash: and which task broke?	14:10
mordred	the publish logs?	14:10
tobiash	the synchronize	14:11
tobiash	(collect node logs	14:11
tobiash	mordred: the publish works	14:11
tobiash	the create log dirs also works	14:12
mordred	kk. thanks (we're trying to make it break here so we can debug)	14:12
tobiash	:)	14:14
tobiash	mordred: additional info is that I'm already using the patch to run trusted jobs in bwrap too	14:18
tobiash	if that has an impact on your debugging	14:19
mordred	tobiash: but you had the isseu before you deployed that right?	14:21
tobiash	mordred: I think, but I'm not sure anymore	14:22
tobiash	I was in a "Try and error until it works"-mode...	14:23
tobiash	and the result was a workaround with find and fetch modules as I wasn't able to really debug this...	14:24
mordred	tobiash: yah - we need better debugging structure around that :(	14:55
mordred	tobiash: and ... we can't make it break here :(	14:55
mordred	tobiash: but we're stil trying things	14:55
* SpamapS catching up with all of you in GMT		15:32
jeblair	tobiash: we've tried running a playbook with that synchronize task inside of bubblewrap under py3.6 and still can't reproduce. if you have a chance to try this procedure to get more info, that would be great: run zuul-executor with --keep-jobdir, run a job, grab the bubblewrap command from the logs, then run it manually from a shell.	15:33
* SpamapS points at zuul-bwrap as a possible aid to debugging		15:34
SpamapS	I'm sure there's context that has been missed. Have we totally thrown out the idea of just running ansible-playbook using python2.7 ?	15:35
jeblair	tobiash: if that fails, then edit the ansible.cfg to remove zuul_stream and run it again, hopefully if there is a problem where zuul_stream is eating output, that will show more	15:36
jeblair	tobiash: also, i forgot to mention say, run 'zuul-executor verbose' before running the job to get -vvv output	15:36
jlk	I think there were some concerns about the streaming code in python 2	15:36
mordred	SpamapS: well - we've mostly been focused on trying to reproduce the issue while we had the ansible core team here	15:37
mordred	SpamapS: and have been unable to do so - although we understand basically everything there is to know about synchronize now :)	15:37
SpamapS	hah	15:37
mordred	SpamapS: general consensus is that there should be no issues with synchronize in python3 since the majority of it is run in the action plugin which is core-not-module	15:38
SpamapS	that's exactly what I was hoping we could avoid tying our progress to.	15:38
SpamapS	ahh	15:38
SpamapS	so maybe it's not python3's fault	15:38
mordred	yah	15:38
SpamapS	that explains why the debug steps taken	15:38
mordred	SpamapS: we've also got context in folks heads, so if we can get it to happen again and get a reproducible case, we should be able to debug and figure out what's up	15:40
SpamapS	yeah I don't even have context on when it was seen	15:41
*** hashar is now known as hasharAway		15:43
*** tobiash_ has joined #zuul		16:28
tobiash_	jeblair, mordred: thanks, will try tomorrow again	16:28
tobiash_	I'm already @home without my work laptop	16:29
tobiash_	the log I posted earlier was already with verbose logging	16:30
tobiash_	maybe it could be also a different issue	16:33
tobiash_	how does zuul setup these module overrides?	16:33
SpamapS	tobiash_: ansible.cfg loads them as plugins IIRC	16:34
tobiash_	there might also be a possibility of a leftover of earlier module overrides?	16:34
tobiash_	I'll clean up that directory and retry the synchronize again tomorrow	16:35
tobiash_	btw, will this stay like this or is there a plan to integrate localhost restrictions in ansible itself?	16:38
SpamapS	tobiash_: that's a really long long way away.. but I think it's the right path	16:44
SpamapS	have a "safe mode"	16:44
SpamapS	it would need to be designed generically which might get complicated	16:45
tobiash_	ok, so more a topic for ansible 3 and zuul 5	16:47
tobiash_	what comes to my mind now: does zuul inhibit loading custom modules along the playbooks?	16:51
tobiash_	like http://blog.toast38coza.me/custom-ansible-module-hello-world/	16:54
Shrews	ugh, i think my signal.pause() is affecting asyncio happenings	16:59
SpamapS	tobiash_: I believe that's disabled in untrusted mode by the action plugin restrictions. Could be wrong.	17:44
SpamapS	Shrews: Highly likely. Signals and async are always dicey.	17:44
tobiash_	ok	17:48
*** tobiash_ has left #zuul		18:11
*** tobiash__ has joined #zuul		18:11
*** tobiash__ has quit IRC		18:11
*** tobiash__ has joined #zuul		18:12
*** tobiash__ has quit IRC		18:12
*** tobiash_ has joined #zuul		18:12
clarkb	Shrews: I am following up on the we don't delete uploads on process restart thing from yesterday. The current clean up worker checks if the upload is in progress by attempting to get the lock for that upload. If it gets the lock then it isn't in progress and should be deleted. Do we not release the locks when disconnecting from zk?	18:46
clarkb	Shrews: our init script uses --signal 9, which may be part of it. I wonder if we can gracefully terminate and release locks on sigterm	18:50
clarkb	Shrews: that said, our locks are expected to survive disconnects? or should our lock grabbing code check if whoever has the lock is connected before failing?	18:50
clarkb	harlowja: do you know re ^ and http://kazoo.readthedocs.io/en/latest/api/recipe/lock.html ?	18:53
clarkb	harlowja: if a zk client via kazoos lock recipe there hold the lock then disconnects is that lock still held by that client?	18:54
clarkb	I would expect not but that appears to be the behavior we are seeing	18:54
harlowja	me to, though potentially after the zk session timeout	18:54
harlowja	i didn't think there was an explict on disconnect release all locks (or try to before disconnecting)	18:55
harlowja	locks are basically https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/lock.py#L213-L214	18:57
clarkb	harlowja: ok it does look like the records that were there yesterday (that I wanted to go away) have gone away on a query today	18:57
clarkb	harlowja: so the session timeout may be it	18:58
harlowja	ya	18:58
harlowja	check whatever that is set to	18:58
clarkb	harlowja: I don't see us setting it explicitly so I think whatever kazoo sets it to	19:00
SpamapS	I thought locks had an ephemeral component	19:01
SpamapS	so if the session goes away -> lock released	19:01
SpamapS	and that was like, why zk is good	19:01
clarkb	SpamapS: that is the behavior I expected but it definitely appears like that isn't the case based on nodepool builder behavior	19:01
SpamapS	so yeah, session timeout is very likely the thing you are seeing.	19:01
harlowja	see https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/lock.py#L213-L214	19:02
harlowja	`node = self.client.create(self.create_path, self.data,	19:02
harlowja	ephemeral=True, sequence=True)`	19:02
harlowja	thats pretty much what a lock is, ha	19:02
harlowja	an empheral node with sequencing	19:02
SpamapS	right	19:02
SpamapS	so until the session times out, lock will be held	19:02
clarkb	oh I bet I now what is going on	19:02
SpamapS	clarkb: and yeah, --signal 9 seems like a poor choice. What's up with that?	19:02
clarkb	because we use sigkill we don't get to send the tcp fun	19:02
clarkb	*fin	19:03
SpamapS	yeah	19:03
clarkb	which means zk has to timeout	19:03
SpamapS	TCP timeout, even worse	19:03
clarkb	SpamapS: I don't know why we use sigkill	19:03
harlowja	SpamapS ya agreed, if the only way an app can die is via signal -9, prob app should be redesigned :-P	19:03
clarkb	but if we sigtermed I think python would gracefully kill that for us	19:03
SpamapS	we should be sending SIGTERM and waiting a bit before SIGKILL	19:03
SpamapS	where's this init script?	19:03
clarkb	it is likely that way because we run daemon threads and don't mark them daemon?	19:03
SpamapS	start-stop-daemon will do this for you automatically.	19:04
harlowja	there are many reasons	19:04
harlowja	a daemon thread really isn't a good answer imho either :-P	19:04
clarkb	SpamapS: yes this is start stop daemon	19:04
clarkb	harlowja: it is if this thing is literally supposed to run forever	19:04
clarkb	but you still want to be able to stop it	19:04
harlowja	sorry, i mean, u can use them, but u still need a way to signal to those threads to stop in a nice-happy-fashion	19:04
clarkb	something somewhere is going to be in a tight loop	19:04
harlowja	nice-happy-death preferred	19:04
clarkb	harlowja: sure	19:04
SpamapS	yeah, should be using start-stop-daemon --retry 10 or something	19:05
harlowja	which usually evolves into breaking up your work into small tasks, and then in between tasks checking if some kind of event has been set (ie, for the thread to stop)	19:05
harlowja	blah blah	19:05
harlowja	lol	19:05
clarkb	SpamapS: https://git.openstack.org/cgit/openstack-infra/puppet-nodepool/tree/files/nodepool-builder.init	19:06
clarkb	we do actually appear to have a graceful stop in the nodepool builder too	19:06
SpamapS	argh, cgit has no annotate?	19:07
clarkb	that marks each running worker to shutdown, waits for them, then disconnects from zuuk	19:07
clarkb	I think I see the patch needed to make this work	19:08
clarkb	nodepool.cmd.builder handles sigint as a stop	19:09
clarkb	SpamapS: so ya switching the init script to sigint and wait before sigkill should do it	19:09
clarkb	SpamapS: I can review taht really quick if you are wanting to push it. Otherwise I'll sort out the right start stop daemon incantation	19:10
clarkb	you know what the problem is going to be? its the block on image upload	19:11
clarkb	beacuse that can take a few hours	19:11
clarkb	so we might separately have to make that die gracefully on sigint	19:11
clarkb	yup looks like upload image and image build are both synchronous with their workers	19:14
clarkb	and we join() the workers on stop()	19:14
Shrews	image build spawns a subprocess that should receive any signals sent, so should terminate rather quickly. image uploading doesn't work that way.	19:17
clarkb	ah ok so upload is the one to worry about heer	19:17
Shrews	this should pretty much follow the design of the older nodepool in this area	19:18
clarkb	ya, difference is that older nodepool didn't rely on graceful connection closes to properly update staet in the database	19:18
clarkb	(beacuse there was no database)	19:18
Shrews	(sorry, was out getting new tires. yay fun)	19:18
clarkb	so old nodepool was fine with a forceful kill	19:19
clarkb	Shrews: what do you think about joining only the build workers and not the uploads?	19:19
clarkb	thats the easy mode fix for this I think	19:19
clarkb	(and roughly equivalent to the behavior with kill -9 going on now)	19:19
clarkb	my only concern doing that is upload holds a lock with the zk client and stop() will attempt to stop the client and kill it under the upload	19:21
clarkb	will that raise in the context manager properly/	19:21
Shrews	well, i'm not worried about the lock handling. that's automatic	19:23
Shrews	i think what would happen is those records will just remain in the UPLOADING state and get cleaned up at some point after restart	19:24
clarkb	Shrews: yup. That is what seems to happen today. The problem today is we kill the tcp connection to zk without letting zk know so we have to wait for that session to timeout. Whereas if we were to call nb.stop() properly with signit we'd close the zk connection which should unlock things at that point	19:25
clarkb	but for that to be effective we can't wait for three hours for a thread to join. I think I see some other related bugs and will push up a patch shortly and we can look at this with concrete change	19:25
Shrews	clarkb: are you finding the zookeeper session timeout too long?	19:27
clarkb	Shrews: I think that is why those uploading records for centos hung around forever yesterday. Either they were waiting on the zk session timeout or the tcp connection to timeout on the zk side or both	19:28
Shrews	clarkb: those hung around because they were part of the latest build	19:28
Shrews	we needed two more builds to age them out	19:28
Shrews	(2 most recent builds and their zk records hang around)	19:29
SpamapS	I'm a little confused	19:29
SpamapS	not looking at the code	19:29
SpamapS	upload threads or upload subprocesses?	19:29
Shrews	the problem with those was that we were not cleaning up not-in-progress UPLOADING records for the most recent builds	19:29
clarkb	SpamapS: threads	19:29
SpamapS	but Shrews just said processes.	19:30
SpamapS	oh	19:30
SpamapS	that's build	19:30
Shrews	SpamapS: image builds are subprocess	19:30
SpamapS	k	19:30
clarkb	Shrews: if you look at the cleanup code it will delete uploads in progress if nothing holds their lock	19:30
clarkb	Shrews: and that is independent of age	19:30
SpamapS	then yeah, don't join those	19:30
SpamapS	exit() should take care of them.	19:30
clarkb	SpamapS: yup	19:30
Shrews	clarkb: nope. 1 sec...	19:31
SpamapS	clarkb: and the start-stop-daemon incantation is just --retry <timeoutvalue> instead of --signal 9	19:31
Shrews	clarkb: https://github.com/openstack-infra/nodepool/blob/master/nodepool/builder.py#L441-L442	19:31
clarkb	Shrews: https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/builder.py#n376	19:31
Shrews	that little bit right there keeps us from calling cleanupProvider	19:31
SpamapS	If timeout is specified instead of schedule, then the schedule signal/timeout/KILL/timeout is used, where signal is the signal specified with --signal.	19:32
clarkb	SpamapS: thats a build not an upload	19:32
clarkb	er Shrews ^	19:32
clarkb	Shrews: the code I linked is called before the section you point out and is explicitly for builds we want to keep	19:33
clarkb	Shrews: basically it is making sure that the uploads corresponding to builds we keep are curated	19:33
clarkb	SpamapS: so we just have to change the signal and set a timeout?	19:35
Shrews	clarkb: yeah, you're right, i was in the wrong place yesterday	19:38
Shrews	clarkb: perhaps we should adjust the session timeout in zookeeper server if it's too long?	19:39
clarkb	Shrews: we could do that too, but right now we aren't able to gracefully shutdown the builder because uploads can take hours	19:39
clarkb	Shrews: which is why I think we use sigkill in thei nit script	19:40
clarkb	so fixing that first seems like a good idea	19:40
clarkb	then if we still have problems we can adjust the session timeout	19:40
Shrews	i honestly didn't expect it to be very long, the docs for it are vague, measuring in "ticks": https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkSessions	19:40
clarkb	Shrews: ya I think SpamapS may be right that the tcp timeout is actually at fault because we don't greacefully close the tcp connection	19:41
clarkb	so zk has to hang around and wait for tcp to give up	19:41
Shrews	ah, that would make sense	19:41
Shrews	what is the tcp timeout? anyone recall?	19:43
clarkb	I don't I think it is defined by the kernel by default	19:46
Shrews	clarkb: hrm, a quick test of a 'kill -9' on a kazoo script holding lock shows the lock disappear relatively quickly	19:54
clarkb	Shrews: its entirely possible that there is more bugs in the cleanup code too resulting in the upload hanging around	19:55
clarkb	(eg its not just the lock we have to wrry about)	19:56
Shrews	yeah	19:56
clarkb	but I think either way we should avoid using sigkill and instead use sigint	19:56
clarkb	right now stop() is essentially dead code but it shouldn't be	19:56
openstackgerrit	Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Use private attribute lookup interanlly in builder.py https://review.openstack.org/476677	19:56
openstackgerrit	Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Don't join image upload workers on stop() https://review.openstack.org/476678	19:56
clarkb	there I wrote lots of words to explain this	19:56
Shrews	clarkb: oh, is it possible that the original nb* node that built the image was taken offline?	20:00
clarkb	Shrews: no, I just restarted the service but the node was up the whole time	20:01
openstackgerrit	Joshua Harlow proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683	20:02
harlowja	my contribution ^	20:03
harlowja	lol	20:03
Shrews	clarkb: hrm... so, i think we might have some hostname issues. 'nodepool dib-image-list' shows builds that have 'nb04' AND 'nb04.openstack.org' as the builder hostname. that could cause problems	20:04
Shrews	i do not know how that could happen, but changing the hostname of a builder is a definite thing that jeblair and I have considered as a potential issue	20:05
openstackgerrit	Joshua Harlow proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683	20:05
harlowja	prob a pattern that should be used more ^	20:05
Shrews	which brings up SpamapS' idea of writing a unique id to disk and using that instead of hostnames	20:06
clarkb	Shrews: I think that may have been related to pabelanger? I recall that he noticed hostname was fqdn in one cloud and just hostname in the other. Looks like they have been changed to both be just hostname	20:07
clarkb	Shrews: happens because different cloud images and metadata retrieval/application methods apply that inconsistently	20:07
Shrews	harlowja: ooh, i like that. though i don't think using that around the SUSPEND_WAIT_TIME portion of code is correct	20:12
harlowja	why not	20:12
harlowja	prob don't want to loop forever waiting for zk to comeback up?	20:13
Shrews	harlowja: oh wait. yeah, that works too	20:13
Shrews	nm	20:13
harlowja	:)	20:13
harlowja	ya, shutting down things quickly is hard, lol	20:13
harlowja	https://github.com/google/python-subprocess32#subprocess32 is also nice, in that it allows the call to timeout	20:14
harlowja	its the backport from py3.3	20:14
harlowja	i've done stuff like run(cmd, on_timeout_cb=...)	20:14
harlowja	so that u can like stop subprocesses that u start if u your app is dying/getting shutdown	20:15
clarkb	in this case seconds is fine, but hours isn't :)	20:16
harlowja	up to u	20:16
harlowja	alot of the time i see time.sleep in thread stuff it makes me thing a threading.Event is really whats wanted (so the timeout can finish and the thread can die), but anyways, the more u know :-P	20:17
Shrews	harlowja: no, i definitely like the use of that. i think we can avoid setting very short timeout intervals in our tests using that	20:18
harlowja	prob	20:18
harlowja	https://docs.python.org/3/library/threading.html#barrier-objects is the other stuff that is useful	20:18
harlowja	especially for the code that waits for Xthreads to start	20:18
harlowja	i've got a version that i can extract that is equivalent, somewhere in taskflow, lol	20:18
harlowja	ya, https://github.com/openstack/taskflow/blob/master/taskflow/types/latch.py#L22	20:19
Shrews	i don't think we have need of barriers (thankfully)	20:19
harlowja	k	20:19
*** jkilpatr has quit IRC		20:20
Shrews	harlowja: it unfortunately doesn't work :( debugging now	20:28
harlowja	impassible	20:28
harlowja	lol	20:28
clarkb	harlowja: does threading.event force a context switch?	20:30
clarkb	harlowja: I am assuming it does if it sleeps under teh covers	20:30
harlowja	yes	20:30
harlowja	py2.7 def had time.sleeps, lol	20:30
harlowja	i think they made it slightly better in py3	20:30
harlowja	https://hg.python.org/cpython/file/v2.7.11/Lib/threading.py#l343	20:31
harlowja	3.x they got rid of that loop, lol	20:31
harlowja	from what i remember	20:32
harlowja	or at least moved it down a level...	20:33
Shrews	harlowja: found the bug. the running prop needs to do "not is_set"	20:34
harlowja	ah poops	20:34
Shrews	harlowja: i'll push up the fix. trying some other things with the tests, too	20:34
harlowja	thought i did that, ha	20:34
harlowja	kk	20:34
harlowja	thx	20:34
harlowja	not high priority, i just saw this, and had urge to change it	20:34
harlowja	back to writing peer reviews :-P	20:34
clarkb	harlowja: I just know sleep() gets used a lot because it is the documented way to force a context switch	20:34
harlowja	ya	20:35
*** jkilpatr has joined #zuul		20:40
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683	20:42
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683	20:45
Shrews	harlowja: bug fixed and cleaned up a test. I also removed that TODO b/c I don't feel we need to bring that kind of complexity in for that.	20:46
Shrews	harlowja: thanks!	20:46
harlowja	Shrews np :)	20:47
harlowja	the hard part imho becomes when u start to have a thing that u call into that u want to pass that event into to stop that thing early	20:48
harlowja	u start to end up with coroutines and eventlet at some point, lol	20:49
harlowja	lol	20:49
harlowja	(or something like eventlet/asyncio...)	20:49
Shrews	harlowja: ugh, please don't mention asyncio. i'm using it for some of our websocket streaming stuff and it's causing me to have nightmares	20:49
harlowja	lol	20:49
harlowja	i bet	20:49
harlowja	i would of just liked if eventet (at least the concept) was mainlined into python	20:50
harlowja	i think that would of gone over better	20:50
harlowja	but meh, is what it is, lol	20:50
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Use threading.Event for quicker shutdowns https://review.openstack.org/476705	20:52
Shrews	the feature/zuulv3 cherry pick ^^^	20:52
Shrews	want to carry that over to the launcher process, too	20:53
*** dmellado has quit IRC		21:00
*** dmellado has joined #zuul		21:03
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Use Event for faster shutdown of launcher https://review.openstack.org/476709	21:07
harlowja	cools	21:32
*** dmellado has quit IRC		22:33
*** dmellado has joined #zuul		22:37
*** hasharAway has quit IRC		22:48
*** dmellado has quit IRC		23:03
*** dmellado has joined #zuul		23:06
*** dmellado has quit IRC		23:15
*** dmellado has joined #zuul		23:18
*** dmellado has quit IRC		23:31
jlk	adam_g: thanks for the comments re GHE. Are you all running another zuul somewhere?	23:43
*** dmellado has joined #zuul		23:51
SpamapS	jlk: they are	23:52
SpamapS	jlk: They're toying with a BonnyCI for an IBM internal project	23:52
jlk	neat!	23:53
SpamapS	along with mestery	23:53
SpamapS	so it's all in the family :)	23:53
jlk	craziness.	23:54
*** dmellado has quit IRC		23:55
jamielennox	lol	23:57
jamielennox	well, it shows value for bonny long term in ibm	23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!