Tuesday, 2018-12-04

SpamapS	corvus: indeed	00:22
*** techraving has quit IRC		01:01
*** bhavikdbavishi has joined #zuul		01:06
*** bhavikdbavishi has quit IRC		01:24
openstackgerrit	Ian Wienand proposed openstack-infra/zuul-jobs master: mirror-workspace-git-repos: Explicitly show HEAD of checked out branches https://review.openstack.org/621840	02:21
*** j^2 has joined #zuul		02:22
*** bhavikdbavishi has joined #zuul		02:39
*** rlandy has quit IRC		02:40
openstackgerrit	Merged openstack-infra/nodepool master: Set pool for error'ed instances https://review.openstack.org/621681	03:42
*** smyers_ has joined #zuul		03:52
*** smyers has quit IRC		03:53
*** smyers_ is now known as smyers		03:53
*** sshnaidm is now known as sshnaidm\|afk		04:04
*** bjackman has joined #zuul		04:09
tristanC	Shrews: i'm seing quite a few "Cannot find provider pool for node" for deleting node, and it doesn't seems like they are being deleted. Is this fixed by https://review.openstack.org/621681 ?	05:30
tristanC	(it seems like a new behavior of nodepool-3.3.1)	05:30
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout to 4 columns https://review.openstack.org/622010	06:15
*** quiquell\|off is now known as quiquell		06:59
*** pcaruana has joined #zuul		07:10
*** AJaeger has quit IRC		07:11
*** AJaeger has joined #zuul		07:15
*** quiquell is now known as quiquell\|brb		07:50
*** nilashishc has joined #zuul		07:56
tobiash	tristanC: yes, this should be fixed by this	07:57
tristanC	tobiash: thanks. should i cherry-pick or will there be a new nodepool release to fix that issue?	07:57
tobiash	tristanC: a new nodepool release will contain the relative priority stuff, so we probably want to run this in openstack for a few more days	07:59
tristanC	tobiash: ok thanks. I'll cherry-pick because this is going to be an issue as the node don't seems to be cleaned	08:04
*** chkumar\|off is now known as chandan_kumar		08:05
tristanC	tobiash: what about empty zk-node, there is also "AttributeError: 'NoneType' object has no attribute 'state'" in logs. is this fixed by https://review.openstack.org/621305 ?	08:05
tristanC	the traceback is http://paste.openstack.org/show/736613/	08:07
*** quiquell\|brb is now known as quiquell		08:07
*** gtema has joined #zuul		08:18
tristanC	actually, that is not fixed by 621305, 3.3.1 doesn't have tree cache	08:27
*** ianychoi has quit IRC		08:30
*** ianychoi has joined #zuul		08:30
openstackgerrit	Tristan Cacqueray proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101	08:44
*** jpena\|off is now known as jpena		08:55
openstackgerrit	Tristan Cacqueray proposed openstack-infra/nodepool master: Set id for error'ed instances https://review.openstack.org/622108	08:56
*** sshnaidm\|afk is now known as sshnaidm		09:46
quiquell	tobiash: is the relative_priority a mandatory attribtue at zuul.conf scheduler config ?	09:51
quiquell	tobiash: it's failing on me if I don't add it	09:51
quiquell	tobiash: Ok found the issue https://github.com/openstack-infra/zuul/blob/master/zuul/model.py#L752	09:56
quiquell	tobiash: this make relative_priority mandatory	09:56
*** bhavikdbavishi1 has joined #zuul		10:08
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175	10:09
*** bhavikdbavishi has quit IRC		10:09
*** bhavikdbavishi1 is now known as bhavikdbavishi		10:09
quiquell	tobiash: fix for it https://review.openstack.org/#/c/622175/	10:10
quiquell	corvus: ^	10:11
*** dkehn has quit IRC		10:27
*** electrofelix has joined #zuul		10:34
*** bhavikdbavishi1 has joined #zuul		10:44
*** bhavikdbavishi has quit IRC		10:48
*** bhavikdbavishi1 is now known as bhavikdbavishi		10:48
*** bhavikdbavishi has quit IRC		10:52
*** jpena is now known as jpena\|brb		10:54
*** bhavikdbavishi has joined #zuul		11:54
*** bjackman has quit IRC		12:32
*** quiquell is now known as quiquell\|lunch		12:33
*** jpena\|brb is now known as jpena\|lunch		12:42
*** bhavikdbavishi has quit IRC		12:50
*** bhavikdbavishi has joined #zuul		12:56
*** sshnaidm is now known as sshnaidm\|afk		13:14
*** gtema has quit IRC		13:26
*** rlandy has joined #zuul		13:27
*** bhavikdbavishi has quit IRC		13:29
Shrews	tristanC: the nodes should still be getting deleted (unless you have a problem with the provider). it's just that the error message is more likely to be present now when launches fail	13:37
*** dkehn has joined #zuul		13:38
Shrews	tristanC: what is 622101 fixing?	13:38
Shrews	or tobiash ^^, if you know	13:39
mordred	quiquell\|lunch: thanks!	13:40
*** jpena\|lunch is now known as jpena		13:40
*** quiquell\|lunch is now known as quiquell		13:44
tristanC	Shrews: it's fixing the fact that those deleting nodes don't have a type	13:45
Shrews	tristanC: where is a type required to delete a node?	13:45
tristanC	Shrews: those node doesn't have an id in the json object in zk, it seems like the node deleter doesn't delete them in that case	13:45
tristanC	Shrews: it's not required to delete a node, but having a type is good to know	13:46
Shrews	tristanC: i'm not seeing similar behavior	13:51
Shrews	is there a log message to look for?	13:51
*** bjackman has joined #zuul		13:51
*** sshnaidm\|afk is now known as sshnaidm		13:59
*** bjackman has quit IRC		14:14
*** irclogbot_3 has quit IRC		14:16
*** bjackman has joined #zuul		14:21
*** bhavikdbavishi has joined #zuul		14:43
*** bjackman has quit IRC		14:44
Shrews	corvus: tobiash: I think we have an issue with node deletion that I'm not sure how to solve	14:47
Shrews	corvus: tobiash: I'm seeing many totally empty node znodes in zookeeper, and they seem to result from an overlap of runs of the DeletedNodeWorker thread (e.g., http://paste.openstack.org/show/736643/)	14:48
*** gtema has joined #zuul		14:49
Shrews	the first run starts a NodeDeleter thread that begins to delete the znode, but it takes long enough (more than the 5 second interval of the DeletedNodeWorker runs) that the 2nd run sees it again as something to be deleted, but fails (and somehow leaves the empty node)	14:49
Shrews	we probably should increase the interval for DeletedNodeWorker runs from 5 to at least 60. this won't solve the problem but should reduce it somewhat	14:50
Shrews	we can also add a cleanup routine to delete these empty nodes	14:50
Shrews	but as for solving the real issue, i don't know how to do that. it seems the problem stems from the node lock existing within the node znode itself (so deleting the node deletes the lock and it's a race to get rid of it before someone else attempts to lock it)	14:51
*** irclogbot_3 has joined #zuul		14:55
pabelanger	Shrews: I wonder if what tristanC is trying to debug is related to https://review.openstack.org/621043/ and https://review.openstack.org/621040/ I don't believe sf.io is running those yet	15:02
*** bjackman has joined #zuul		15:04
openstackgerrit	Filippo Inzaghi proposed openstack/pbrx master: Change openstack-dev to openstack-discuss https://review.openstack.org/622353	15:11
Shrews	pabelanger: not sure	15:13
Shrews	i think the node deletion problem is at least one reason why he's seeing empty znodes	15:17
Shrews	but i'm just guessing w/o access to that system	15:17
*** bjackman has quit IRC		15:17
Shrews	and that seems to have existing for quite a while	15:19
Shrews	s/existing/existed/	15:19
*** kklimonda_ has joined #zuul		15:32
*** kklimonda has quit IRC		15:33
corvus	quiquell, tobiash, mordred: see comment on https://review.openstack.org/622175	15:33
tobiash	corvus: oops, revoked my review	15:34
corvus	Shrews: in http://paste.openstack.org/show/736643/ why does the log line say "deleting instarce None" when the previous log line has the instance id?	15:35
corvus	Shrews, tobiash: is it because the "Deleting ZK node" is operating from cached data, and "Deleting" happens after the lock and therefore has updated data?	15:38
*** kklimonda has joined #zuul		15:38
*** kklimonda_ has quit IRC		15:41
*** kklimonda_ has joined #zuul		15:41
*** kklimonda has quit IRC		15:42
corvus	Shrews, tobiash: when we delete a znode, which happens first: the lock is deleted or the node data?	15:44
corvus	tobiash: istr you looked into that last week	15:44
corvus	https://kazoo.readthedocs.io/en/latest/_modules/kazoo/client.html#KazooClient.delete	15:46
corvus	looking at the source, it looks like it's children first -- so the lock is removed, then the node is deleted	15:46
corvus	so the hypothesis is: thread (A) locks the node, deletes the instance, deletes the lock; thread (B) locks the node; thread (A) deletes the node (which only deletes the data since there's a new child); thread (B) loads data from the node, it's empty, so the Node object in memory is zeroed out, then continues with invalid data.	15:48
corvus	we can detect this situation by observing that the node record has no data	15:50
corvus	we gracefully handle that in zk.py, but i'm not sure why. i'm not sure we should ever expect a node with no data	15:51
*** kklimonda_ has quit IRC		15:52
*** kklimonda has joined #zuul		15:52
dmsimard	jhesketh: Is there a spec or something that explains the direction for https://review.openstack.org/#/q/topic:freeze_job ? It's something I might be able to help with	15:53
*** kklimonda has joined #zuul		15:53
openstackgerrit	Filippo Inzaghi proposed openstack-infra/zuul-base-jobs master: Change openstack-infra to openstack-discuss https://review.openstack.org/622386	16:01
openstackgerrit	Filippo Inzaghi proposed openstack-infra/zuul-jobs master: Change openstack-infra to openstack-discuss https://review.openstack.org/622387	16:02
Shrews	corvus: in response to your first question, i think it's because (as you've seem to have also deduced) that competing threads are getting different data	16:03
openstackgerrit	Filippo Inzaghi proposed openstack-infra/zuul-sphinx master: Change openstack-infra to openstack-discuss https://review.openstack.org/622388	16:03
Shrews	b/c the 1st thd deletes the data, so the 2nd thd has none	16:03
Shrews	corvus: how this relates to the cache (if at all), i'm not certain	16:04
Shrews	but this may have existed before caching was introduced	16:04
Shrews	(just based on the node IDs I see in zookeeper)	16:04
corvus	Shrews: yeah, perhaps caching may make it worse, but i don't think it's required. the hypothesis i laid out above doesn't require it	16:05
Shrews	corvus: I suppose we could have the DeletedNodeWorker cleanup thread keep a cache of the NodeDeleter threads it starts, and not try to delete something already being deleted on a subsequent run. But that bookkeeping can be tricky	16:07
Shrews	as far as trying to detect the situation by observing the node has no data, i would think that would still be a bit racey (since it could still be in the process of being deleted)	16:11
*** pcaruana has quit IRC		16:12
corvus	Shrews: i think i have an idea; patch in a minute	16:15
quiquell	Helo I am back	16:20
quiquell	corvus: thanks will fix right now	16:21
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175	16:23
quiquell	corvus, tobiash ^	16:24
SpamapS	Shrews: from your original description it sounds like you need to use locks so only one thread ever acts on a node. No?	16:25
*** electrofelix has quit IRC		16:27
Shrews	SpamapS: we are using locks	16:28
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Fix race when deleting Node znodes https://review.openstack.org/622403	16:28
corvus	Shrews, tobiash, tristanC, SpamapS: ^ how about that idea?	16:28
corvus	also, i think we're missing some potential unlock cases in exception handlers; i'll work on that in a followup	16:29
Shrews	corvus: we don't want the deleteNode() call there do we? that's counter to what we're trying to do	16:32
Shrews	oh, that's for the znode, not the instance	16:32
Shrews	duh	16:32
Shrews	corvus: but also, the node.state check will never succeed there	16:34
Shrews	we'd need to refresh after locking the node	16:34
Shrews	(see the lines above it)	16:34
Shrews	oh lockNode() refreshes it	16:35
Shrews	i forgot that was added	16:35
corvus	yep and yep	16:37
*** quiquell is now known as quiquell\|off		16:50
*** pcaruana has joined #zuul		17:12
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Add governance document https://review.openstack.org/622439	17:14
SpamapS	You're using locks, but two threads are acting on the lock?	17:23
corvus	SpamapS: yes -- the lock is under the node and recursive deletion deletes the lock first. so deleting a node effectively unlocks it then deletes the node.	17:26
corvus	it seems there are disadvantages to both child locks and sibling locks (we use both in nodepool)	17:27
tobiash	That is an interesting race	17:27
corvus	(requests use sibling locks, nodes use child locks)	17:27
tobiash	corvus, Shrews: i was in meetings all day. Are there still open questions or was that all about that race?	17:31
corvus	tobiash: i think all about the race; i don't have open questions	17:32
*** gtema has quit IRC		17:32
tobiash	corvus: thanks for the great review on the spec, I think I'll answer and refine tomorrow	17:33
openstackgerrit	Merged openstack-infra/nodepool master: Make launcher debug slightly less chatty https://review.openstack.org/621675	17:34
SpamapS	Oh I wasn't aware there was such a thing as child locks.	17:34
corvus	SpamapS: i assumed that was the standard way of doing it	17:36
corvus	all things considered, we may want to standardize on sibling locks; it's easier to leak the locks, but at least you can't leave the resource unprotected.	17:40
*** bhavikdbavishi has quit IRC		17:41
tobiash	corvus: yes, and cleaning up leaked locks is easy by a restart of the service ;)	17:45
*** nilashishc has quit IRC		17:46
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Fix race when deleting Node znodes https://review.openstack.org/622403	17:49
tobiash	quiquell\|off: commented on https://review.openstack.org/622175	17:53
*** pcaruana has quit IRC		17:56
Shrews	corvus: should i work up a script to delete empty nodes, or do you think we should add that to nodepool as a cleanup routine?	18:05
Shrews	corvus: and any thoughts on changing the deleted node worker interval from 5s?	18:05
corvus	Shrews: i don't think we should need to change the interval. i think it's important for this to be responsive.	18:07
corvus	Shrews: if we switched it to being event-based in the normal case, we could increase the interval of the periodic check; but i don't think we want to wait 60 seconds to start deleting a node after it's finished.	18:08
Shrews	ok	18:08
corvus	Shrews: as for the other question -- maybe a cleanup routine?	18:09
tobiash	Shrews: I think adding that as a cleanup routine makes sense as probably many of us have these empty znodes	18:09
corvus	Shrews: that way it's automatic and we don't have to tell people to go run a thing after upgrade	18:09
corvus	Shrews: could probably just be added to the delete worker?	18:10
tobiash	++	18:10
Shrews	i don't like it there, personally	18:11
Shrews	we have a general cleanup worker for such things	18:12
corvus	Shrews: ok; mostly was thinking we didn't need a new thread	18:12
pabelanger	Shrews: when you have a moment, trying to debug an issue of nodepool deleting a used node, how do we know a node is unlocked based on this comment: https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/launcher.py#n700 ?	18:12
corvus	pabelanger: nodepool is supposed to delete used nodes	18:13
pabelanger	corvus: right, but in this case a job is still running	18:13
corvus	pabelanger: the lock is checked in line 703	18:13
pabelanger	I am missing the zuul side of the logs currently	18:13
corvus	pabelanger: was the job aborted?	18:14
pabelanger	corvus: not that I see in job-output.txt. I'll know more once I get a copy of the executor log	18:15
corvus	pabelanger: you need the scheduler log	18:15
pabelanger	kk	18:15
Shrews	pabelanger: you'll know a node is unlocked if you cannot obtain a lock for it	18:16
Shrews	errr	18:16
Shrews	if you CAN obtain a lock for it	18:16
pabelanger	cool, thanks!	18:16
corvus	pabelanger: there is a situation where zuul may be unable to abort the job shortly after it starts; in that case, it will go ahead and unlock the node and nodepool will delete it while the job is still running. however, zuul has already moved on and no longer considers the job relevant. so you should never see this in a reported job. only if you're watching logs.	18:16
corvus	pabelanger: you'll see that in the scheduler logs, not the executor. also, everything about locking is in the scheduler, not the executor.	18:17
pabelanger	corvus: okay, let see what the job was marked as	18:17
tobiash	pabelanger: the other possibility would be the scheduler loosing its zk session, but this would result in nodepool deleting all nodes	18:20
pabelanger	tobiash: yah, that was my first thought	18:20
pabelanger	downloading 5GB file now to confirm	18:20
*** jpena is now known as jpena\|off		18:20
corvus	Shrews: replied to your comment on 622403 (didn't mean to ignore it, my update was just out of sync timewise)	18:25
pabelanger	corvus: okay, yes. I believe your are correct, it is looking to be zuul aborted the job, a new patchset was uploaded into gerrit at the same time. Will confirm 100% once this logfile is down downloading	18:25
pabelanger	on the nodepool side, hard to see that :)	18:25
corvus	pabelanger: how did you notice this?	18:26
pabelanger	corvus: was pointing to https://logs.rdoproject.org/87/622187/2/openstack-check/tripleo-ci-centos-7-multinode-1ctlr-featureset010/7be402f/job-output.txt.gz by jpena\|off because it showed up in logstash query for WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED	18:27
pabelanger	they are trying to debug an issue where FIPs are being reused, and I think they are getting false positives because of job aborts	18:28
clarkb	pabelanger: fwiw openstack has noticed genuine fip reuse	18:28
clarkb	I think there is a nova/neutron bug somewhere that is hitting things	18:28
clarkb	we've seen it in rax and inap iirc	18:28
pabelanger	clarkb: yah, that is what I thought at first too, but then I got launcher log and see nodepool deleted the node behind the back of the running job	18:28
tobiash	pabelanger: now you mention that I also saw this warning in rare cases but in real running jobs	18:29
corvus	pabelanger: yeah, that's unfortunate; normally if the executor got the abort signal correctly, it wouldn't upload the logs	18:29
SpamapS	IIRC fip's are pulled without any ordering, so they're in database-pk order, and as such, you may keep getting the same one over and over if you allocate/deallocate over and over.	18:29
clarkb	actually inap and rax don't use fips	18:30
clarkb	so its IP address reuse without fips	18:30
corvus	this is very tricky to fix in the gearman system; however it's easy to fix in zuulv4. i think our first zuulv4 change should be the scheduler-executor interface	18:30
SpamapS	haven't looked at how fixed IPs are allocated, but if I had to guess, I'd say they find the first available in the subnet.	18:31
SpamapS	Yeah not being able ot cleanly cancel jobs is one of gearman's weakest traits.	18:31
pabelanger	corvus: ++	18:33
SpamapS	In fact typically the way to cancel jobs in gearman is sentinels in a consistent system like ZK ;)	18:33
corvus	this happens if the abort call comes between the excutor starting the job and the scheduler receiving notification that it has started. in that case, the gearman job is not cancelable and the scheduler doesn't know which executor to talk to. there is actually a 1 second sleep in the code if we hit this to try to give the system time to resolve, but i looked yesterday, and we're seeing 10+ second delays	18:33
pabelanger	corvus: Yup, scheduler aborted the job, can see it in logs	18:34
corvus	pabelanger: you should see a log line with "Still unable to find build"	18:34
pabelanger	I need to see why the logstash instances in rdo is reporting the build.result at failure, and not aborted	18:34
pabelanger	corvus: yes, that is right	18:35
pabelanger	2018-12-04 10:35:41,660 DEBUG zuul.ExecutorClient: Unable to cancel build <Build 7be402f400574587bf4cd292595813d9 of tripleo-ci-centos-7-multinode-1ctlr-featureset010 voting:True on <Worker Unknown>>	18:35
tobiash	corvus: I can reorder that in the spec (I ordered the spec in an order I thought it might be possible to implement)	18:35
corvus	tobiash: i think the scheduler-executor part is almost completely separate from the rest	18:36
tobiash	that's true	18:36
corvus	pabelanger: the logstash result doesn't really have anything to do with the zuul result	18:36
tobiash	it just needs to be done before moving the pipelines to zk	18:36
corvus	tobiash: or it could be done as part of that (the executors could actually walk the pipelines rather than having their own build queue)	18:38
pabelanger	corvus: okay, aim I right in understanding that zuul reports build_status FAILURE for the job, because the job finished (failed because node was deleted) before the scheduler could tell the executor to abort the job?	18:38
corvus	pabelanger: no, zuul did not report that job	18:38
tobiash	corvus: interesting idea	18:38
corvus	pabelanger: s/job/build/	18:39
pabelanger	okay	18:39
corvus	pabelanger: you only even know about that build because you're using data from outside of zuul. if you looked at the change in gerrit, you would see no links to that build.	18:39
pabelanger	corvus: yah, I can see that now. They are just finger URLs still	18:41
pabelanger	and now I understand this exception: http://paste.openstack.org/show/736664/	18:41
pabelanger	because nodepool has already deleted the node / zk data	18:41
corvus	pabelanger: yep. it's harmless	18:42
pabelanger	okay, this has been help and gives me more information to help debug the 'FIP issue' rdo is seeing.	18:42
corvus	pabelanger: that's when zuul finally gets caught up.	18:42
pabelanger	corvus: ++	18:42
corvus	Shrews: https://review.openstack.org/622403 is green now	18:52
corvus	wow, fastest review ever	18:52
tobiash	corvus: the secret handling just came to my mind which might be a problem in a distributed scheduler	18:53
tobiash	corvus: we need to somehow synchronize the private keys without storing them in zk	18:53
Shrews	corvus: yeah, sorry. had to move back home b/c the coffee shop was blocking my outbound port in a test script :/	18:53
corvus	tobiash: i agree it might be nice to avoid storing them in zk, but we should probably consider that. it's going to have decrypted copies of the data anyway.	18:54
corvus	tobiash: in other words, it may not be any more secure to have a second place where the data could be compromised if we already have a first place.	18:55
corvus	(where the first place is zk)	18:55
clarkb	does zk need to have decrypted copies of the data? if the keys were everywhere that could be a local transformation right?	18:56
clarkb	but then you are dealing with distributing keys (maybe a CA makes that easier)	18:56
corvus	clarkb: yes, if we distributed the keys to all schedulers and executors, we could avoid having decrypted data in zk.	18:56
corvus	but then executors are a more interesting target (they're already interesting, this just makes them more interesting)	18:57
clarkb	as you say I'm not sure that is any better. Might be easier to secure zk	18:57
clarkb	yup	18:57
corvus	if we had a really goad way to share those keys, i agree that would be an approach to consider	18:57
tobiash	did anyone try out hashicorp vault?	18:58
tobiash	maybe this could solve the key distribution?	18:58
tobiash	but it would be yet another dependency on an external service	18:59
corvus	tobiash: i don't see the connection; maybe you could elaborat.	18:59
tobiash	corvus: hashicorp vault is a secure secret store that also has an api interface where services can retrieve the secrets	18:59
tobiash	also I think with fine grained access control	19:00
tobiash	so this could be the one and only place where secrets are stored	19:00
corvus	tobiash: so it's a spof, or another system that needs to be made HA	19:00
tobiash	yes, hadn't looked into that yet	19:00
corvus	tobiash: what about zookeeper acls?	19:01
corvus	tobiash: then only the schedulers could read the keys, and maybe only the executors could read the decrypted data for builds. and web can't read any of it.	19:02
tobiash	that makes sense	19:02
corvus	similar exposure profile to an external service without additional deps.	19:03
corvus	(openstack-infra meeting time; i'll be over there for a bit)	19:03
openstackgerrit	Merged openstack-infra/nodepool master: Fix race when deleting Node znodes https://review.openstack.org/622403	19:20
SpamapS	corvus: I'm not sure decrypted secrets or keys in ZK would make it through our security/compliance review process.	19:35
SpamapS	I'm already on thin ice with no zk auth.	19:35
corvus	SpamapS: decrypted secrets in gearman is okay?	19:36
SpamapS	Correct, that's not on disk.	19:36
SpamapS	And encrypted in transit.	19:37
corvus	SpamapS: ack	19:37
SpamapS	(with cert validation as auth)	19:37
SpamapS	Vault would be great.	19:38
corvus	SpamapS: what if we encrypted the keys in ZK, and then provided a single decryption key to the zuul servers?	19:48
corvus	(so it's basically the vault model without the extra service)	19:48
clarkb	zk supports auth right? is the issue we don't implement it as a client in zuul?	19:49
corvus	everything is encrypted at rest	19:49
corvus	clarkb: SpamapS has a requirement that no secret data be decrypted at rest. (ie, stored decrypted on disk). that's a reasonable and frequently encountered security requirement.	19:50
corvus	clarkb: and i'm presuming we'll finish up the zk auth impl. in zuul before doing any of this	19:50
clarkb	ya I think that is fine. Mostly responding to 19:35:14 SpamapS \| I'm already on thin ice with no zk auth.	19:51
corvus	ah yeah	19:51
corvus	so we should take care of that too :)	19:51
pabelanger	wasn't there patches up to add SSL support to zk for nodepool / zuul?	19:53
corvus	pabelanger: https://review.openstack.org/619155	19:54
pabelanger	corvus: thanks, I did see that. That looks to just be auth for now, will confirm with tristanC with plan for SSL	19:55
tobiash	SpamapS: how do you handle the private keys of the scheduler?	20:04
tobiash	corvus: I think encrypting the secrets with a single key (aes?) would be easy to do.	20:06
tobiash	SpamapS: would that make it through your security audit?	20:06
tobiash	pabelanger: kazoo just has landed ssl support a few weeks ago	20:07
tobiash	so that could be easily added to zuul probably	20:07
pabelanger	tobiash: okay, maybe that is what I was remembering	20:07
tobiash	and I also saw a change do do auth with zk in zuul	20:07
pabelanger	tobiash: yah, I don't mind looking at SSL for nodepool / zuul. Doing it for gearman was straightforward	20:08
tobiash	https://github.com/python-zk/kazoo/issues/382	20:08
pabelanger	+1	20:09
*** manjeets_ has joined #zuul		20:13
*** manjeets has quit IRC		20:14
pabelanger	zuul dashboard question - how can I add back the remaining time of a job when you hover of the blue bar showing the running job? It has been missing for sometime and was a helpful to quickly see how much longer a patch had based on previous job runs	20:18
tobiash	pabelanger: I think there is an estimated time per job in the status.json	20:24
tobiash	This should be already be used to calculate the status bars	20:25
pabelanger	okay, so maybe I just need to add a tooltip	20:26
tobiash	Probably	20:34
pabelanger	k, that is another rabbit hole I'll hold off on for now.	20:35
*** j^2 has quit IRC		20:47
SpamapS	corvus: Vault's model is not quite the same as a single decryption key. You provide clients with a vault authentication credential, which clients use to get keys from vault that decrypt the secrets as they are now. But the ciphertext is constantly rotated in a Vault model, and Vault just keeps track of who has what keys now, so you know when you can purge ciphertexts....	21:52
SpamapS	Honestly, Vault is great, but let's not go there. :)	21:52
SpamapS	tobiash: right now the keys for zuul are on an encrypted EBS volume. That's not ideal, but it works, and it satisfies encryption at rest and revokability (I can delete the key in KMS and nobody can mount the volume anymore). We can also do ZK on encrypted volumes... but one thing I know we've been trying to do with Zuul is not require complicated deployment choices.. so I was hoping we could just do it	21:56
SpamapS	app-level and supply the private keys via environment variables in k8s deployments (which have the benefit of being deletable post-container-startup)	21:56
corvus	SpamapS: it sounds like (even though it's not quite as robust as the vault model) the store-encrypted-in-zk-with-single-decryption-key model would probably fit the bill?	22:00
SpamapS	corvus: absolutely.	22:00
SpamapS	Vault is like, 4th order security automation. I think it might be worth doing some day. But.. let's hit 2nd order which is basically where my baseline will be.	22:00
corvus	SpamapS: and thanks for chiming in -- i think the addition of the 'no decrypted data at rest' requirement is a good one and will have kept us from going down some rabbit holes :)	22:00
SpamapS	corvus: I will try to keep it lightweight and simple as much as possible. For all the pain that requirement brings, complexity will bring far far more.	22:01
clarkb	SpamapS: re env vars and deleting post container startup, you can still cat /proc/pid/env ? Or do you mean remove it from the env after its been internalized, then it would only be available reading memory directly	22:04
clarkb	(msotly trying to understand what level of paranoia people find valuable when it comes to reading things from memory on linux)	22:04
clarkb	in an unrelated thread peopel are talking about supporting virtual TPMs, which are implemented (currently at least) without the support/safety of hardware based features (like write only data). And i wonder how useful an in memory store (whcih can have its data read back out again) is in practice	22:07
SpamapS	clarkb: yes you still have to deal with in-container and in-node compromise. But the idea is that nobody can go grab the secret from etcd/k8s anymore.	22:18
SpamapS	The general thinking is that letting it sit there in RAM to be used is inevitable and you can't really protect against that level of compromise.	22:19
clarkb	SpamapS: you can with a tpm, but that requires a lot more setup and special hardware	22:19
clarkb	(which is why virtual tpm that doesn't protect in that way really surprised me)	22:19
SpamapS	But letting it sit, available in an API, or available on disk for a backup or a medium level of privilege escalation (like, you can read files as root, but nothing else), is avoidable.	22:19
SpamapS	Hah yeah, of course there are ways.	22:20
SpamapS	Basically you want to avoid accidental exposure to a deliberate attack as a policy. Going beyond that is where we get in to those higher orders, and reserved for things where the cost of compromise is higher than the cost to implement.	22:22
*** leifmadsen has joined #zuul		22:33
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Add cleanup routine to delete empty nodes https://review.openstack.org/622616	22:33
Shrews	corvus: tobiash: ^^^ should do the empty node cleanup	22:35
*** etp has quit IRC		23:12
*** etp has joined #zuul		23:13
corvus	Shrews: lgtm, thx!	23:18
*** dkehn_ has joined #zuul		23:51
*** dkehn has quit IRC		23:52
*** dkehn_ is now known as dkehn		23:53
*** pabelanger has quit IRC		23:57
*** pabelanger has joined #zuul		23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!