Tuesday, 2018-12-04

SpamapScorvus: indeed00:22
*** techraving has quit IRC01:01
*** bhavikdbavishi has joined #zuul01:06
*** bhavikdbavishi has quit IRC01:24
openstackgerritIan Wienand proposed openstack-infra/zuul-jobs master: mirror-workspace-git-repos: Explicitly show HEAD of checked out branches  https://review.openstack.org/62184002:21
*** j^2 has joined #zuul02:22
*** bhavikdbavishi has joined #zuul02:39
*** rlandy has quit IRC02:40
openstackgerritMerged openstack-infra/nodepool master: Set pool for error'ed instances  https://review.openstack.org/62168103:42
*** smyers_ has joined #zuul03:52
*** smyers has quit IRC03:53
*** smyers_ is now known as smyers03:53
*** sshnaidm is now known as sshnaidm|afk04:04
*** bjackman has joined #zuul04:09
tristanCShrews: i'm seing quite a few "Cannot find provider pool for node" for deleting node, and it doesn't seems like they are being deleted. Is this fixed by https://review.openstack.org/621681 ?05:30
tristanC(it seems like a new behavior of nodepool-3.3.1)05:30
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout to 4 columns  https://review.openstack.org/62201006:15
*** quiquell|off is now known as quiquell06:59
*** pcaruana has joined #zuul07:10
*** AJaeger has quit IRC07:11
*** AJaeger has joined #zuul07:15
*** quiquell is now known as quiquell|brb07:50
*** nilashishc has joined #zuul07:56
tobiashtristanC: yes, this should be fixed by this07:57
tristanCtobiash: thanks. should i cherry-pick or will there be a new nodepool release to fix that issue?07:57
tobiashtristanC: a new nodepool release will contain the relative priority stuff, so we probably want to run this in openstack for a few more days07:59
tristanCtobiash: ok thanks. I'll cherry-pick because this is going to be an issue as the node don't seems to be cleaned08:04
*** chkumar|off is now known as chandan_kumar08:05
tristanCtobiash: what about empty zk-node, there is also "AttributeError: 'NoneType' object has no attribute 'state'" in logs. is this fixed by https://review.openstack.org/621305 ?08:05
tristanCthe traceback is http://paste.openstack.org/show/736613/08:07
*** quiquell|brb is now known as quiquell08:07
*** gtema has joined #zuul08:18
tristanCactually, that is not fixed by 621305, 3.3.1 doesn't have tree cache08:27
*** ianychoi has quit IRC08:30
*** ianychoi has joined #zuul08:30
openstackgerritTristan Cacqueray proposed openstack-infra/nodepool master: Set type for error'ed instances  https://review.openstack.org/62210108:44
*** jpena|off is now known as jpena08:55
openstackgerritTristan Cacqueray proposed openstack-infra/nodepool master: Set id for error'ed instances  https://review.openstack.org/62210808:56
*** sshnaidm|afk is now known as sshnaidm09:46
quiquelltobiash: is the relative_priority a mandatory attribtue at zuul.conf  scheduler config ?09:51
quiquelltobiash: it's failing on me if I don't add it09:51
quiquelltobiash: Ok found the issue https://github.com/openstack-infra/zuul/blob/master/zuul/model.py#L75209:56
quiquelltobiash: this make relative_priority mandatory09:56
*** bhavikdbavishi1 has joined #zuul10:08
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority  https://review.openstack.org/62217510:09
*** bhavikdbavishi has quit IRC10:09
*** bhavikdbavishi1 is now known as bhavikdbavishi10:09
quiquelltobiash: fix for it https://review.openstack.org/#/c/622175/10:10
quiquellcorvus: ^10:11
*** dkehn has quit IRC10:27
*** electrofelix has joined #zuul10:34
*** bhavikdbavishi1 has joined #zuul10:44
*** bhavikdbavishi has quit IRC10:48
*** bhavikdbavishi1 is now known as bhavikdbavishi10:48
*** bhavikdbavishi has quit IRC10:52
*** jpena is now known as jpena|brb10:54
*** bhavikdbavishi has joined #zuul11:54
*** bjackman has quit IRC12:32
*** quiquell is now known as quiquell|lunch12:33
*** jpena|brb is now known as jpena|lunch12:42
*** bhavikdbavishi has quit IRC12:50
*** bhavikdbavishi has joined #zuul12:56
*** sshnaidm is now known as sshnaidm|afk13:14
*** gtema has quit IRC13:26
*** rlandy has joined #zuul13:27
*** bhavikdbavishi has quit IRC13:29
ShrewstristanC: the nodes should still be getting deleted (unless you have a problem with the provider). it's just that the error message is more likely to be present now when launches fail13:37
*** dkehn has joined #zuul13:38
ShrewstristanC: what is 622101 fixing?13:38
Shrewsor tobiash ^^, if you know13:39
mordredquiquell|lunch: thanks!13:40
*** jpena|lunch is now known as jpena13:40
*** quiquell|lunch is now known as quiquell13:44
tristanCShrews: it's fixing the fact that those deleting nodes don't have a type13:45
ShrewstristanC: where is a type required to delete a node?13:45
tristanCShrews: those node doesn't have an id in the json object in zk, it seems like the node deleter doesn't delete them in that case13:45
tristanCShrews: it's not required to delete a node, but having a type is good to know13:46
ShrewstristanC: i'm not seeing similar behavior13:51
Shrewsis there a log message to look for?13:51
*** bjackman has joined #zuul13:51
*** sshnaidm|afk is now known as sshnaidm13:59
*** bjackman has quit IRC14:14
*** irclogbot_3 has quit IRC14:16
*** bjackman has joined #zuul14:21
*** bhavikdbavishi has joined #zuul14:43
*** bjackman has quit IRC14:44
Shrewscorvus: tobiash: I think we have an issue with node deletion that I'm not sure how to solve14:47
Shrewscorvus: tobiash: I'm seeing many totally empty node znodes in zookeeper, and they seem to result from an overlap of runs of the DeletedNodeWorker thread (e.g., http://paste.openstack.org/show/736643/)14:48
*** gtema has joined #zuul14:49
Shrewsthe first run starts a NodeDeleter thread that begins to delete the znode, but it takes long enough (more than the 5 second interval of the DeletedNodeWorker runs) that the 2nd run sees it again as something to be deleted, but fails (and somehow leaves the empty node)14:49
Shrewswe probably should increase the interval for DeletedNodeWorker runs from 5 to at least 60. this won't solve the problem but should reduce it somewhat14:50
Shrewswe can also add a cleanup routine to delete these empty nodes14:50
Shrewsbut as for solving the real issue, i don't know how to do that. it seems the problem stems from the node lock existing within the node znode itself (so deleting the node deletes the lock and it's a race to get rid of it before someone else attempts to lock it)14:51
*** irclogbot_3 has joined #zuul14:55
pabelangerShrews: I wonder if what tristanC is trying to debug is related to https://review.openstack.org/621043/ and https://review.openstack.org/621040/ I don't believe sf.io is running those yet15:02
*** bjackman has joined #zuul15:04
openstackgerritFilippo Inzaghi proposed openstack/pbrx master: Change openstack-dev to openstack-discuss  https://review.openstack.org/62235315:11
Shrewspabelanger: not sure15:13
Shrewsi think the node deletion problem is at least one reason why he's seeing empty znodes15:17
Shrewsbut i'm just guessing w/o access to that system15:17
*** bjackman has quit IRC15:17
Shrewsand that seems to have existing for quite a while15:19
Shrewss/existing/existed/15:19
*** kklimonda_ has joined #zuul15:32
*** kklimonda has quit IRC15:33
corvusquiquell, tobiash, mordred: see comment on https://review.openstack.org/62217515:33
tobiashcorvus: oops, revoked my review15:34
corvusShrews: in http://paste.openstack.org/show/736643/ why does the log line say "deleting instarce None" when the previous log line has the instance id?15:35
corvusShrews, tobiash: is it because the "Deleting ZK node" is operating from cached data, and "Deleting" happens after the lock and therefore has updated data?15:38
*** kklimonda has joined #zuul15:38
*** kklimonda_ has quit IRC15:41
*** kklimonda_ has joined #zuul15:41
*** kklimonda has quit IRC15:42
corvusShrews, tobiash: when we delete a znode, which happens first: the lock is deleted or the node data?15:44
corvustobiash: istr you looked into that last week15:44
corvushttps://kazoo.readthedocs.io/en/latest/_modules/kazoo/client.html#KazooClient.delete15:46
corvuslooking at the source, it looks like it's children first -- so the lock is removed, then the node is deleted15:46
corvusso the hypothesis is: thread (A) locks the node, deletes the instance, deletes the lock; thread (B) locks the node; thread (A) deletes the node (which only deletes the data since there's a new child); thread (B) loads data from the node, it's empty, so the Node object in memory is zeroed out, then continues with invalid data.15:48
corvuswe can detect this situation by observing that the node record has no data15:50
corvuswe gracefully handle that in zk.py, but i'm not sure why.  i'm not sure we should ever expect a node with no data15:51
*** kklimonda_ has quit IRC15:52
*** kklimonda has joined #zuul15:52
dmsimardjhesketh: Is there a spec or something that explains the direction for https://review.openstack.org/#/q/topic:freeze_job ? It's something I might be able to help with15:53
*** kklimonda has joined #zuul15:53
openstackgerritFilippo Inzaghi proposed openstack-infra/zuul-base-jobs master: Change openstack-infra to openstack-discuss  https://review.openstack.org/62238616:01
openstackgerritFilippo Inzaghi proposed openstack-infra/zuul-jobs master: Change openstack-infra to openstack-discuss  https://review.openstack.org/62238716:02
Shrewscorvus: in response to your first question, i think it's because (as you've seem to have also deduced) that competing threads are getting different data16:03
openstackgerritFilippo Inzaghi proposed openstack-infra/zuul-sphinx master: Change openstack-infra to openstack-discuss  https://review.openstack.org/62238816:03
Shrewsb/c the 1st thd deletes the data, so the 2nd thd has none16:03
Shrewscorvus: how this relates to the cache (if at all), i'm not certain16:04
Shrewsbut this may have existed before caching was introduced16:04
Shrews(just based on the node IDs I see in zookeeper)16:04
corvusShrews: yeah, perhaps caching may make it worse, but i don't think it's required.  the hypothesis i laid out above doesn't require it16:05
Shrewscorvus: I suppose we could have the DeletedNodeWorker cleanup thread keep a cache of the NodeDeleter threads it starts, and not try to delete something already being deleted on a subsequent run. But that bookkeeping can be tricky16:07
Shrewsas far as trying to detect the situation by observing the node has no data, i would think that would still be a bit racey (since it could still be in the process of being deleted)16:11
*** pcaruana has quit IRC16:12
corvusShrews: i think i have an idea; patch in a minute16:15
quiquellHelo I am back16:20
quiquellcorvus: thanks will fix right now16:21
openstackgerritQuique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority  https://review.openstack.org/62217516:23
quiquellcorvus, tobiash ^16:24
SpamapSShrews: from your original description it sounds like you need to use locks so only one thread ever acts on a node. No?16:25
*** electrofelix has quit IRC16:27
ShrewsSpamapS: we *are* using locks16:28
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Fix race when deleting Node znodes  https://review.openstack.org/62240316:28
corvusShrews, tobiash, tristanC, SpamapS: ^ how about that idea?16:28
corvusalso, i think we're missing some potential unlock cases in exception handlers; i'll work on that in a followup16:29
Shrewscorvus: we don't want the deleteNode() call there do we? that's counter to what we're trying to do16:32
Shrewsoh, that's for the znode, not the instance16:32
Shrewsduh16:32
Shrewscorvus: but also, the node.state check will never succeed there16:34
Shrewswe'd need to refresh after locking the node16:34
Shrews(see the lines above it)16:34
Shrewsoh lockNode() refreshes it16:35
Shrewsi forgot that was added16:35
corvusyep and yep16:37
*** quiquell is now known as quiquell|off16:50
*** pcaruana has joined #zuul17:12
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Add governance document  https://review.openstack.org/62243917:14
SpamapSYou're using locks, but two threads are acting on the lock?17:23
corvusSpamapS: yes -- the lock is under the node and recursive deletion deletes the lock first.  so deleting a node effectively unlocks it then deletes the node.17:26
corvusit seems there are disadvantages to both child locks and sibling locks (we use both in nodepool)17:27
tobiashThat is an interesting race17:27
corvus(requests use sibling locks, nodes use child locks)17:27
tobiashcorvus, Shrews: i was in meetings all day. Are there still open questions or was that all about that race?17:31
corvustobiash: i think all about the race; i don't have open questions17:32
*** gtema has quit IRC17:32
tobiashcorvus: thanks for the great review on the spec, I think I'll answer and refine tomorrow17:33
openstackgerritMerged openstack-infra/nodepool master: Make launcher debug slightly less chatty  https://review.openstack.org/62167517:34
SpamapSOh I wasn't aware there was such a thing as child locks.17:34
corvusSpamapS: i assumed that was the standard way of doing it17:36
corvusall things considered, we may want to standardize on sibling locks; it's easier to leak the locks, but at least you can't leave the resource unprotected.17:40
*** bhavikdbavishi has quit IRC17:41
tobiashcorvus: yes, and cleaning up leaked locks is easy by a restart of the service ;)17:45
*** nilashishc has quit IRC17:46
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Fix race when deleting Node znodes  https://review.openstack.org/62240317:49
tobiashquiquell|off: commented on https://review.openstack.org/62217517:53
*** pcaruana has quit IRC17:56
Shrewscorvus: should i work up a script to delete empty nodes, or do you think we should add that to nodepool as a cleanup routine?18:05
Shrewscorvus: and any thoughts on changing the deleted node worker interval from 5s?18:05
corvusShrews: i don't think we should need to change the interval.  i think it's important for this to be responsive.18:07
corvusShrews: if we switched it to being event-based in the normal case, we could increase the interval of the periodic check; but i don't think we want to wait 60 seconds to start deleting a node after it's finished.18:08
Shrewsok18:08
corvusShrews: as for the other question -- maybe a cleanup routine?18:09
tobiashShrews: I think adding that as a cleanup routine makes sense as probably many of us have these empty znodes18:09
corvusShrews: that way it's automatic and we don't have to tell people to go run a thing after upgrade18:09
corvusShrews: could probably just be added to the delete worker?18:10
tobiash++18:10
Shrewsi don't like it there, personally18:11
Shrewswe have a general cleanup worker for such things18:12
corvusShrews: ok; mostly was thinking we didn't need a new thread18:12
pabelangerShrews: when you have a moment, trying to debug an issue of nodepool deleting a used node, how do we know a node is unlocked based on this comment: https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/launcher.py#n700 ?18:12
corvuspabelanger: nodepool is supposed to delete used nodes18:13
pabelangercorvus: right, but in this case a job is still running18:13
corvuspabelanger: the lock is checked in line 70318:13
pabelangerI am missing the zuul side of the logs currently18:13
corvuspabelanger: was the job aborted?18:14
pabelangercorvus: not that I see in job-output.txt. I'll know more once I get a copy of the executor log18:15
corvuspabelanger: you need the scheduler log18:15
pabelangerkk18:15
Shrewspabelanger: you'll know a node is unlocked if you cannot obtain a lock for it18:16
Shrewserrr18:16
Shrewsif you CAN obtain a lock for it18:16
pabelangercool, thanks!18:16
corvuspabelanger: there is a situation where zuul may be unable to abort the job shortly after it starts; in that case, it will go ahead and unlock the node and nodepool will delete it while the job is still running.  however, zuul has already moved on and no longer considers the job relevant.  so you should never see this in a reported job.  only if you're watching logs.18:16
corvuspabelanger: you'll see that in the scheduler logs, not the executor.  also, everything about locking is in the scheduler, not the executor.18:17
pabelangercorvus: okay, let see what the job was marked as18:17
tobiashpabelanger: the other possibility would be the scheduler loosing its zk session, but this would result in nodepool deleting *all* nodes18:20
pabelangertobiash: yah, that was my first thought18:20
pabelangerdownloading 5GB file now to confirm18:20
*** jpena is now known as jpena|off18:20
corvusShrews: replied to your comment on 622403 (didn't mean to ignore it, my update was just out of sync timewise)18:25
pabelangercorvus: okay, yes. I believe your are correct, it is looking to be zuul aborted the job, a new patchset was uploaded into gerrit at the same time.  Will confirm 100% once this logfile is down downloading18:25
pabelangeron the nodepool side, hard to see that :)18:25
corvuspabelanger: how did you notice this?18:26
pabelangercorvus: was pointing to https://logs.rdoproject.org/87/622187/2/openstack-check/tripleo-ci-centos-7-multinode-1ctlr-featureset010/7be402f/job-output.txt.gz by jpena|off because it showed up in logstash query for WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED18:27
pabelangerthey are trying to debug an issue where FIPs are being reused, and I think they are getting false positives because of job aborts18:28
clarkbpabelanger: fwiw openstack has noticed genuine fip reuse18:28
clarkbI think there is a nova/neutron bug somewhere that is hitting things18:28
clarkbwe've seen it in rax and inap iirc18:28
pabelangerclarkb: yah, that is what I thought at first too, but then I got launcher log and see nodepool deleted the node behind the back of the running job18:28
tobiashpabelanger: now you mention that I also saw this warning in rare cases but in real running jobs18:29
corvuspabelanger: yeah, that's unfortunate; normally if the executor got the abort signal correctly, it wouldn't upload the logs18:29
SpamapSIIRC fip's are pulled without any ordering, so they're in database-pk order, and as such, you may keep getting the same one over and over if you allocate/deallocate over and over.18:29
clarkbactually inap and rax don't use fips18:30
clarkbso its IP address reuse without fips18:30
corvusthis is very tricky to fix in the gearman system; however it's easy to fix in zuulv4.  i think our first zuulv4 change should be the scheduler-executor interface18:30
SpamapShaven't looked at how fixed IPs are allocated, but if I had to guess, I'd say they find the first available in the subnet.18:31
SpamapSYeah not being able ot cleanly cancel jobs is one of gearman's weakest traits.18:31
pabelangercorvus: ++18:33
SpamapSIn fact typically the way to cancel jobs in gearman is sentinels in a consistent system like ZK ;)18:33
corvusthis happens if the abort call comes between the excutor starting the job and the scheduler receiving notification that it has started.  in that case, the gearman job is not cancelable and the scheduler doesn't know which executor to talk to.  there is actually a 1 second sleep in the code if we hit this to try to give the system time to resolve, but i looked yesterday, and we're seeing 10+ second delays18:33
pabelangercorvus: Yup, scheduler aborted the job, can see it in logs18:34
corvuspabelanger: you should see a log line with "Still unable to find build"18:34
pabelangerI need to see why the logstash instances in rdo is reporting the build.result at failure, and not aborted18:34
pabelangercorvus: yes, that is right18:35
pabelanger2018-12-04 10:35:41,660 DEBUG zuul.ExecutorClient: Unable to cancel build <Build 7be402f400574587bf4cd292595813d9 of tripleo-ci-centos-7-multinode-1ctlr-featureset010 voting:True on <Worker Unknown>>18:35
tobiashcorvus: I can reorder that in the spec (I ordered the spec in an order I thought it might be possible to implement)18:35
corvustobiash: i think the scheduler-executor part is almost completely separate from the rest18:36
tobiashthat's true18:36
corvuspabelanger: the logstash result doesn't really have anything to do with the zuul result18:36
tobiashit just needs to be done before moving the pipelines to zk18:36
corvustobiash: or it could be done as part of that (the executors could actually walk the pipelines rather than having their own build queue)18:38
pabelangercorvus: okay, aim I right in understanding that zuul reports build_status FAILURE for the job, because the job finished (failed because node was deleted) before the scheduler could tell the executor to abort the job?18:38
corvuspabelanger: no, zuul *did not report* that job18:38
tobiashcorvus: interesting idea18:38
corvuspabelanger: s/job/build/18:39
pabelangerokay18:39
corvuspabelanger: you only even know about that build because you're using data from outside of zuul.  if you looked at the change in gerrit, you would see no links to that build.18:39
pabelangercorvus: yah, I can see that now. They are just finger URLs still18:41
pabelangerand now I understand this exception: http://paste.openstack.org/show/736664/18:41
pabelangerbecause nodepool has already deleted the node / zk data18:41
corvuspabelanger: yep.  it's harmless18:42
pabelangerokay, this has been help and gives me more information to help debug the 'FIP issue' rdo is seeing.18:42
corvuspabelanger: that's when zuul finally gets caught up.18:42
pabelangercorvus: ++18:42
corvusShrews: https://review.openstack.org/622403 is green now18:52
corvuswow, fastest review ever18:52
tobiashcorvus: the secret handling just came to my mind which might be a problem in a distributed scheduler18:53
tobiashcorvus: we need to somehow synchronize the private keys without storing them in zk18:53
Shrewscorvus: yeah, sorry. had to move back home b/c the coffee shop was blocking my outbound port in a test script  :/18:53
corvustobiash: i agree it might be nice to avoid storing them in zk, but we should probably consider that.  it's going to have decrypted copies of the data anyway.18:54
corvustobiash: in other words, it may not be any more secure to have a *second* place where the data could be compromised if we already have a first place.18:55
corvus(where the first place is zk)18:55
clarkbdoes zk need to have decrypted copies of the data? if the keys were everywhere that could be a local transformation right?18:56
clarkbbut then you are dealing with distributing keys (maybe a CA makes that easier)18:56
corvusclarkb: yes, if we distributed the keys to all schedulers *and* executors, we could avoid having decrypted data in zk.18:56
corvusbut then executors are a more interesting target (they're already interesting, this just makes them more interesting)18:57
clarkbas you say I'm not sure that is any better. Might be easier to secure zk18:57
clarkbyup18:57
corvusif we had a really goad way to share those keys, i agree that would be an approach to consider18:57
tobiashdid anyone try out hashicorp vault?18:58
tobiashmaybe this could solve the key distribution?18:58
tobiashbut it would be yet another dependency on an external service18:59
corvustobiash: i don't see the connection; maybe you could elaborat.18:59
tobiashcorvus: hashicorp vault is a secure secret store that also has an api interface where services can retrieve the secrets18:59
tobiashalso I think with fine grained access control19:00
tobiashso this could be the one and only place where secrets are stored19:00
corvustobiash: so it's a spof, or another system that needs to be made HA19:00
tobiashyes, hadn't looked into that yet19:00
corvustobiash: what about zookeeper acls?19:01
corvustobiash: then only the schedulers could read the keys, and maybe only the executors could read the decrypted data for builds.  and web can't read any of it.19:02
tobiashthat makes sense19:02
corvussimilar exposure profile to an external service without additional deps.19:03
corvus(openstack-infra meeting time; i'll be over there for a bit)19:03
openstackgerritMerged openstack-infra/nodepool master: Fix race when deleting Node znodes  https://review.openstack.org/62240319:20
SpamapScorvus: I'm not sure decrypted secrets or keys in ZK would make it through our security/compliance review process.19:35
SpamapSI'm already on thin ice with no zk auth.19:35
corvusSpamapS: decrypted secrets in gearman is okay?19:36
SpamapSCorrect, that's not on disk.19:36
SpamapSAnd encrypted in transit.19:37
corvusSpamapS: ack19:37
SpamapS(with cert validation as auth)19:37
SpamapSVault would be great.19:38
corvusSpamapS: what if we encrypted the keys in ZK, and then provided a single decryption key to the zuul servers?19:48
corvus(so it's basically the vault model without the extra service)19:48
clarkbzk supports auth right? is the issue we don't implement it as a client in zuul?19:49
corvuseverything is encrypted at rest19:49
corvusclarkb: SpamapS has a requirement that no secret data be decrypted at rest.  (ie, stored decrypted on disk).  that's a reasonable and frequently encountered security requirement.19:50
corvusclarkb: and i'm presuming we'll finish up the zk auth impl. in zuul before doing any of this19:50
clarkbya I think that is fine. Mostly responding to 19:35:14         SpamapS | I'm already on thin ice with no zk auth.19:51
corvusah yeah19:51
corvusso we should take care of that too :)19:51
pabelangerwasn't there patches up to add SSL support to zk for nodepool / zuul?19:53
corvuspabelanger: https://review.openstack.org/61915519:54
pabelangercorvus: thanks, I did see that. That looks to just be auth for now, will confirm with tristanC with plan for SSL19:55
tobiashSpamapS: how do you handle the private keys of the scheduler?20:04
tobiashcorvus: I think encrypting the secrets with a single key (aes?) would be easy to do.20:06
tobiashSpamapS: would that make it through your security audit?20:06
tobiashpabelanger: kazoo just has landed ssl support a few weeks ago20:07
tobiashso that could be easily added to zuul probably20:07
pabelangertobiash: okay, maybe that is what I was remembering20:07
tobiashand I also saw a change do do auth with zk in zuul20:07
pabelangertobiash: yah, I don't mind looking at SSL for nodepool / zuul. Doing it for gearman was straightforward20:08
tobiashhttps://github.com/python-zk/kazoo/issues/38220:08
pabelanger+120:09
*** manjeets_ has joined #zuul20:13
*** manjeets has quit IRC20:14
pabelangerzuul dashboard question - how can I add back the remaining time of a job when you hover of the blue bar showing the running job?  It has been missing for sometime and was a helpful to quickly see how much longer a patch had based on previous job runs20:18
tobiashpabelanger: I think there is an estimated time per job in the status.json20:24
tobiashThis should be already be used to calculate the status bars20:25
pabelangerokay, so maybe I just need to add a tooltip20:26
tobiashProbably20:34
pabelangerk, that is another rabbit hole I'll hold off on for now.20:35
*** j^2 has quit IRC20:47
SpamapScorvus: Vault's model is not quite the same as a single decryption key. You provide clients with a vault authentication credential, which clients use to get keys from vault that decrypt the secrets as they are now. But the ciphertext is constantly rotated in a Vault model, and Vault just keeps track of who has what keys now, so you know when you can purge ciphertexts....21:52
SpamapSHonestly, Vault is great, but let's not go there. :)21:52
SpamapStobiash: right now the keys for zuul are on an encrypted EBS volume. That's not ideal, but it works, and it satisfies encryption at rest and revokability (I can delete the key in KMS and nobody can mount the volume anymore). We can also do ZK on encrypted volumes... but one thing I know we've been trying to do with Zuul is not require complicated deployment choices.. so I was hoping we could just do it21:56
SpamapSapp-level and supply the private keys via environment variables in k8s deployments (which have the benefit of being deletable post-container-startup)21:56
corvusSpamapS: it sounds like (even though it's not quite as robust as the vault model) the store-encrypted-in-zk-with-single-decryption-key model would probably fit the bill?22:00
SpamapScorvus: absolutely.22:00
SpamapSVault is like, 4th order security automation. I think it might be worth doing some day. But.. let's hit 2nd order which is basically where my baseline will be.22:00
corvusSpamapS: and thanks for chiming in -- i think the addition of the 'no decrypted data at rest' requirement is a good one and will have kept us from going down some rabbit holes :)22:00
SpamapScorvus: I will try to keep it lightweight and simple as much as possible. For all the pain that requirement brings, complexity will bring far far more.22:01
clarkbSpamapS: re env vars and deleting post container startup, you can still cat /proc/pid/env ? Or do you mean remove it from the env after its been internalized, then it would only be available reading memory directly22:04
clarkb(msotly trying to understand what level of paranoia people find valuable when it comes to reading things from memory on linux)22:04
clarkbin an unrelated thread peopel are talking about supporting virtual TPMs, which are implemented (currently at least) without the support/safety of hardware based features (like write only data). And i wonder how useful an in memory store (whcih can have its data read back out again) is in practice22:07
SpamapSclarkb: yes you still have to deal with in-container and in-node compromise. But the idea is that nobody can go grab the secret from etcd/k8s anymore.22:18
SpamapSThe general thinking is that letting it sit there in RAM to be used is inevitable and you can't really protect against that level of compromise.22:19
clarkbSpamapS: you can with a tpm, but that requires a lot more setup and special hardware22:19
clarkb(which is why virtual tpm that doesn't protect in that way really surprised me)22:19
SpamapSBut letting it sit, available in an API, or available on disk for a backup or a medium level of privilege escalation (like, you can read files as root, but nothing else), is avoidable.22:19
SpamapSHah yeah, of course there are ways.22:20
SpamapSBasically you want to avoid accidental exposure to a deliberate attack as a policy. Going beyond that is where we get in to those higher orders, and reserved for things where the cost of compromise is higher than the cost to implement.22:22
*** leifmadsen has joined #zuul22:33
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Add cleanup routine to delete empty nodes  https://review.openstack.org/62261622:33
Shrewscorvus: tobiash: ^^^ should do the empty node cleanup22:35
*** etp has quit IRC23:12
*** etp has joined #zuul23:13
corvusShrews: lgtm, thx!23:18
*** dkehn_ has joined #zuul23:51
*** dkehn has quit IRC23:52
*** dkehn_ is now known as dkehn23:53
*** pabelanger has quit IRC23:57
*** pabelanger has joined #zuul23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!