Thursday, 2017-03-16

openstackgerrit	Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Add entry for ref-replication-scheduled event https://review.openstack.org/446251	00:47
openstackgerrit	K Jonathan Harker proposed openstack-infra/zuul feature/zuulv3: WIP: Reenable test_build_configuration_conflict https://review.openstack.org/446275	02:09
*** jasondotstar has quit IRC		03:25
*** jasondotstar has joined #zuul		03:26
*** leifmadsen has quit IRC		03:26
*** leifmadsen has joined #zuul		03:27
jamielennox	is it possible/are there any examples of shipping custom ansible modules in a common-config? what's the path?	03:34
openstackgerrit	Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Re-add the ability to set username on zuul-executor https://review.openstack.org/446308	05:58
*** isaacb has joined #zuul		06:26
openstackgerrit	Tristan Cacqueray proposed openstack-infra/nodepool master: Add default logging file argument https://review.openstack.org/446324	07:02
*** bhavik1 has joined #zuul		07:03
*** bhavik1 has quit IRC		07:11
*** isaacb has quit IRC		07:20
*** isaacb has joined #zuul		07:23
*** hashar has joined #zuul		07:43
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul feature/zuulv3: Change mutex to counting semaphore https://review.openstack.org/442563	08:36
*** hashar has quit IRC		08:42
*** hashar has joined #zuul		08:53
*** hashar has quit IRC		08:55
*** hashar has joined #zuul		08:59
*** Cibo_ has quit IRC		08:59
*** Cibo_ has joined #zuul		09:04
*** Cibo_ has quit IRC		09:08
*** isaacb has quit IRC		09:16
*** isaacb has joined #zuul		09:17
pabelanger	morning	13:07
mordred	morning pabelanger	13:09
*** hashar is now known as hasharNap		13:30
*** isaacb has quit IRC		13:58
*** isaacb has joined #zuul		14:11
pabelanger	jeblair: do you mind looking at 446095 and 446110 this morning? Shrews patches to bring nl01.o.o back online	14:14
*** Cibo_ has joined #zuul		14:39
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Reset lost requests https://review.openstack.org/446095	14:40
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Deallocate ready nodes with no requests https://review.openstack.org/446110	14:42
*** hasharNap is now known as hashar		14:43
Shrews	pabelanger: let me know when you're ready to restart nl01. want to watch things	14:47
jeblair	pabelanger, Shrews: done :)	14:47
jeblair	jamielennox: i don't think we've experimented with that yet	14:47
pabelanger	jeblair: danke	14:47
pabelanger	Shrews: Sure, I'm ready once we update the git repo on disk, so 15mins or so	14:48
Shrews	pabelanger: there are currently 5 nodes (the max) allocated to non-existing requests. They should get deleted at startup. The first 4 requests are PENDING and they should get reset to REQUESTED.	14:51
*** isaacb has quit IRC		14:51
* Shrews crosses fingers		14:51
*** isaacb has joined #zuul		14:51
pabelanger	ack	14:53
mordred	jeblair, jamielennox: one can put custom modules in roles, and we do deal with roles -so I'd hope we can just get it for free by putting modules in roles	14:54
mordred	Shrews: I wish you luck	14:55
jeblair	mordred, jamielennox: agreed -- and i think we've implemented enough for it to work at this point if someone wants to try it out	14:57
jeblair	mordred, jamielennox: however, we do still have an outstanding bug (there's a story) where we may not put roles in the correct security context, so be aware of that.	14:58
*** Cibo_ has quit IRC		15:00
Shrews	jamielennox: hrm, this json error on your nodepool change makes no sense to me: http://logs.openstack.org/75/445675/3/check/nodepool-coverage-ubuntu-xenial/45b42f1/console.html#_2017-03-15_20_22_33_272050	15:08
Shrews	jamielennox: maybe some sort of race condition we (or kazoo) do not account for	15:09
Shrews	Node znodes should always have data	15:09
*** isaacb has quit IRC		15:10
pabelanger	Hmm, I might need some help. Going a little crazy as to why e1c26d56b541f (^446110) is not the tip of feature/zuulv3: http://git.openstack.org/cgit/openstack-infra/nodepool/log/?h=feature/zuulv3	15:14
pabelanger	but is locally for me	15:14
pabelanger	because, nl01.o.o /opt/nodepool is also out of sync with local	15:15
Shrews	pabelanger: hrm, a fresh 'git pull' for me does not get those two latest merges	15:16
pabelanger	I think something might be up with our git farm	15:17
Shrews	yeah	15:17
pabelanger	moving to #openstack-infra	15:17
jeblair	okay, that's recovering now	16:04
pabelanger	confirmed, nl01.o.o updating now	16:05
pabelanger	Shrews: let me know when you are ready	16:06
Shrews	pabelanger: ready. do you need to 'pip install' again since that didn't happen last time?	16:07
pabelanger	Shrews: no, all fixed.	16:07
Shrews	k	16:07
pabelanger	okay, starting nodepool-launcher	16:07
pabelanger	cool, nodes were deleted	16:08
Shrews	yep	16:08
Shrews	pending requests being processed	16:09
pabelanger	Oh	16:10
pabelanger	heh	16:10
pabelanger	we'll need to restart zuul to pick up new executor changes too	16:10
pabelanger	we'll let the queue clear out first	16:10
pabelanger	Shrews: I do see a locked ready node however	16:12
Shrews	pabelanger: that can be normal	16:12
Shrews	they get locked satisfying requests	16:13
* Shrews tests the new 'request-list' command		16:14
pabelanger	okay, I'm going to stop zuul-launcher and update to zuul-executor. Jobs are getting requeued until they hit the RETRY_LIMIT	16:15
Shrews	pabelanger: umm, i'm confused why some nodes have AZ 'nova' and others are blank? if there are AZ's, we should list them in nodepool.yaml	16:15
Shrews	oh, they're building. nm	16:15
Shrews	jeblair: we should probably add node.az to the zuul model definition. seems we lose that value when you set the state to IN_USE	16:16
clarkb	fwiw I don't think we have any clouds where az currently matters	16:17
clarkb	(but others may and future might)	16:17
jeblair	Shrews: yeah, though also, we should probably have the model classes transparently keep any unknown values, so we don't stomp on them.	16:17
Shrews	jeblair++	16:17
Shrews	pabelanger: hrm, seems we have a lock now. :(	16:19
Shrews	jeblair: many nodes stuck in IN_USE	16:21
Shrews	and locked	16:21
jeblair	who owns the lock?	16:21
jeblair	looks like zuul	16:22
pabelanger	okay, zuulv3-dev started again	16:23
jeblair	wait what?	16:23
Shrews	oh, well	16:23
jeblair	why are we restarting things?	16:23
pabelanger	to use new zuul-executor	16:24
pabelanger	should I have not done that?	16:24
jeblair	we were trying to debug some stuck locks	16:24
jeblair	it's possible the executor restart made them stuck	16:24
jeblair	i think we still have a bug with that	16:24
jeblair	pabelanger: if you restarted the executor first, then the main daemon, that may have been the cause	16:25
Shrews	nodepool-launcher is still hung up	16:25
pabelanger	apologies, I missed the locks part	16:25
jeblair	pabelanger: did you configure gerrit debug logging?	16:25
jeblair	in zuul	16:25
pabelanger	jeblair: no, I did rename the launcher-logging.conf file to executor-logging.conf	16:26
pabelanger	checking it now	16:26
Shrews	pabelanger: jeblair: can we shutdown both things now so I can have a stable ZK state to investigate current issues?	16:26
pabelanger	I can stop executor again	16:26
jeblair	go for it	16:26
jeblair	i think i know the logging problem	16:26
pabelanger	executor stopped	16:27
*** Cibo_ has joined #zuul		16:27
Shrews	hrm, i think we may be hitting some znode versioning issues	16:31
Shrews	2017-03-16 16:22:31,198 DEBUG nodepool.NodeRequestHandler: Locked existing node 0000000584 for request 100-0000002088	16:31
Shrews	yet, when I looked at that in zk-shell, the 'allocated_to' was still set for a lost request	16:32
Shrews	only after zuul-exe was stopped did the value of allocate_to change to 2088	16:32
Shrews	after n-l was stopped, i guess	16:33
jeblair	pabelanger: did you only stop the executor?	16:34
jeblair	pabelanger: i thought Shrews asked for both components to be stopped. zuul-server is the only part that talks to nodepool.	16:34
pabelanger	jeblair: I also stopped / started zuul too	16:34
pabelanger	zuul is still running, I can stop that	16:35
jeblair	i see zuul-scheduler running right now	16:35
jeblair	Shrews: you have not had a stable environment	16:35
pabelanger	yes, I restarted it when I did executor, but have not stopped it	16:35
Shrews	yeah, wanted both stopped. just so as not to fill the queue	16:35
jeblair	pabelanger: why don't you go ahead and stop it	16:36
pabelanger	scheduler stopped too	16:36
Shrews	so, i need to read up on znode versioning a bit. i'm really confused here	16:36
Shrews	because n-l did what it was supposed to do, but the data it was reading was not what was expected	16:37
*** bhavik1 has joined #zuul		16:37
pabelanger	while I am looking at launcher-debug on nl01.o.o, does it make sense to expose our cleanup timer to nodepool.yaml (like we did becore)? I think 5 seconds might be a little aggressive	16:44
jeblair	pabelanger: i do not think it should be exposed as an option because it doesn't make sense for a user to tune it. we should hardcode it to a good value.	16:45
Shrews	pabelanger: We want deleted nodes to be cleaned up quickly (thus the short time), but I was thinking it might be a good idea to have that as a separate thread from all of the other cleanup that can be less aggressive	16:45
jeblair	++	16:46
pabelanger	Ah, Isee	16:46
jeblair	though, another option would be to set watches on nodes so that we get immediate notification of when to clean them up.	16:46
jeblair	then all of the periodic cleanup tasks can be less aggressive	16:47
pabelanger	I think the current ServerList that happens every 5 seconds is from our _cleanupLeakedInstances function. Today, we do 60 seconds on nodepool.o.o	16:47
jeblair	pabelanger: on master we get immediate notification	16:48
Shrews	watches are another option	16:48
pabelanger	not a big deal, was just trying to see what nl01.o.o was doing with all the ServerList calls	16:48
Shrews	harlowja: Can you think of any reason why reads from ZK would not return the latest version of a znode?	16:48
Shrews	pabelanger: i'd be fine with first separating out into a separate thread, then doing the watches later	16:49
jeblair	that seems like a tractable plan	16:50
pabelanger	would NodeCleanupWorker() launch the new thread or Nodepool()?	16:51
jeblair	pabelanger, jhesketh: https://review.openstack.org/224336 invalidated our logging configuration.	16:51
jeblair	pabelanger: is puppet running on zuulv3-dev?	16:51
pabelanger	jeblair: no, I made the changes manually	16:51
jeblair	pabelanger, jhesketh: ok, i'm going to start manually making a new logging config there, and we can copy it into puppet for master later	16:52
pabelanger	wfm	16:52
Shrews	pabelanger: i envisioned NodePool doing it (a new DeletedNodeWorker thread and rename NodeCleanupWorker to CleanupWorker)	16:53
Shrews	pabelanger: if you want to take a stab at it, go for it!	16:53
pabelanger	Shrews: sure, would be good to try more threading programming	16:54
pabelanger	Shrews: would _cleanupNodeRequestLocks() and _cleanupNodes() go into DeletedNodeWorker?	16:59
*** hashar has quit IRC		17:00
Shrews	pabelanger: yes, though I might consider moving the new check in _cleanupNodes() for the allocated_to stuff into the less aggressive thread	17:02
Shrews	not sure that's necessary. haven't given too much thought to it yet	17:03
pabelanger	Shrews: okay, let me propose this over a few patches, so I bette understand things.	17:03
Shrews	pabelanger: the leaked node stuff should go into the less aggressive thread, obviously. that's where the server listing stuff is coming from	17:05
jeblair	pabelanger, Shrews: okay i have a new logging config staged for when we're ready to restart zuul	17:05
Shrews	jeblair: i'm stumped on this versioning stuff	17:05
harlowja	Shrews reads from how many nodes?	17:05
Shrews	maybe my assumption is wrong	17:06
Shrews	harlowja: umm, what? from "nodes"?	17:06
harlowja	sorry, how many zookeeper servers	17:06
Shrews	harlowja: 1	17:06
harlowja	oh	17:06
harlowja	then, nope can't think of a reason	17:07
harlowja	lol	17:07
Shrews	sigh	17:07
Shrews	ok, maybe this is just a bug somewhere then	17:07
Shrews	in my code, that is	17:07
harlowja	if > 1 i think u can get stale reads (due to how the commit/blah blah algorithm works)	17:07
Shrews	obviously a bug :)	17:07
* Shrews tries a new test case		17:08
* harlowja trying to find if there are any kazoo folks around anymore		17:13
harlowja	i starting to think i'm the last one left in #zookeeper	17:13
harlowja	lol	17:13
Shrews	ah ha. found at least one bug	17:14
Shrews	this may be the main thing. i've been assuming zk-shell reads the latest version of things. maybe not?	17:19
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Unpause when we grab a pre-ready node https://review.openstack.org/446642	17:20
harlowja	it should	17:20
harlowja	didn't think zk-shell was caching anything	17:20
Shrews	jeblair: pabelanger: ^^^	17:20
harlowja	Shrews https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees	17:20
harlowja	may just be useful to read	17:20
harlowja	`Sometimes developers mistakenly assume one other guarantee that ZooKeeper does not in fact make. This is:	17:20
harlowja	Simultaneously Consistent Cross-Client Views	17:20
harlowja	ZooKeeper does not guarantee that at every instance in time, two different clients will have identical views of ZooKeeper data.`	17:20
*** bhavik1 has quit IRC		17:20
harlowja	it will guarantee `Sequential Consistency`	17:21
harlowja	and blah blah	17:21
Shrews	harlowja: hrm, except our clients (threads in this case) share the same ZK connection, so....	17:22
harlowja	k	17:23
Shrews	that is, indeed, interesting though.	17:23
Shrews	we could see differences between zuul and nodepool b/c of that	17:23
harlowja	ya, there might be a way to tweak this to make it happen	17:24
jeblair	Shrews: +2	17:24
Shrews	"The clients view of the system is guaranteed to be up-to-date within a certain time bound (on the order of tens of seconds)."	17:25
Shrews	tens of seconds?	17:25
Shrews	wonder why so high	17:26
harlowja	likely the time the server has before it kicks itself out of the zookeeper cluster, lol	17:26
jesusaur	jeblair: yesterday I started poking test_build_configuration_conflict() in test_scheduler.py, and I discovered that merge conflicts are getting reported as success. Should a merge failure call sendWorkFail instead of sendWorkComplete? or is there another way the merger should notify the server about the merge failure?	17:26
harlowja	Shrews https://github.com/coreos/etcd/issues/741 is an interested read also	17:28
jeblair	jesusaur: i think they should report sendworkcomplete but with a null commit, or maybe a boolean flag?	17:29
harlowja	pretty sure (though not 100% that etcd does the same, in that reads may return slightly different versions of the data, depending on the server the client is connected to)	17:29
jesusaur	jeblair: ok, so then something should check for the null commit and/or merged=False afterwards?	17:31
jeblair	jesusaur: hrm, i thought we had re-added merge checking on all changes, but it looks like that hasn't happened yet. i think someone was working on it but i guess they dropped off.	17:34
jeblair	jesusaur: the missing code is in I14e3d96cadec6e4b4cca66137071e8ed67f161a1. things related to prepareRef	17:36
SpamapS	Shrews: it's ok for Zuul and Nodepool to see different views, as long as they both eventually see the same view.	17:37
harlowja	also https://aphyr.com/posts/313-strong-consistency-models	17:37
jeblair	the main thing is whether locking works :)	17:37
harlowja	distrubted systems are hard	17:38
harlowja	lol	17:38
jeblair	jesusaur: but prepareLayout does very similar work now; so i think the result needs to be a combination of the two.	17:39
jeblair	jesusaur, SpamapS: sorry my evaluation of that turned out to be wrong.	17:40
jesusaur	jeblair: ah, thanks for the breadcrumbs, I'll dig through this a bit more	17:43
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Rename NodeCleanupWorker to DeletedNodeWorker https://review.openstack.org/446649	17:43
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Create BaseNodeWorker class https://review.openstack.org/446650	17:43
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Split DeleteNodeWorker into two threads https://review.openstack.org/446651	17:43
SpamapS	jeblair: can you disambiguate "that" in your message for me?	17:46
Shrews	SpamapS: yep, "eventually" consistent is what i'd expect	17:49
jeblair	SpamapS: the area that jesusaur is working on -- i thought we had re-added the merge conflict detection, but i think that work got dropped.	17:51
rbergeron	shrews, jeblair, spamaps, pabelanger, anyone else full of comments: I've got the draft of the mail update here -- https://etherpad.openstack.org/p/zuul-v3-first-status-update (at the top, the previous version is at the bottom)	17:51
rbergeron	(yes, i know i said i would do this on my own but just want to double check if anyone has anything specific they want to highlight)	17:51
Shrews	rbergeron: i'm full of something...	17:51
rbergeron	shrews: me too. cheetos puffs, to be specific	17:52
Shrews	rbergeron: i added words (at the same time as someone else)	17:53
jeblair	o/	17:53
SpamapS	jeblair: OH! ACK	17:53
jeblair	SpamapS: so that probably invalidates some of my evals in that etherpad	17:54
SpamapS	jeblair: well we can always fix that. :)	17:54
jeblair	rbergeron: lgtm thanks!	17:55
jesusaur	SpamapS: and with any luck, I will!	17:56
jeblair	rbergeron, pabelanger: we could mention the launcher rename	17:56
jeblair	ah, pabelanger is on that :)	17:56
Shrews	jeblair: perhaps we should reference your email for that syntax change?	17:56
*** Cibo_ has quit IRC		17:57
pabelanger	++	17:57
jeblair	Shrews: good idea: http://lists.openstack.org/pipermail/openstack-infra/2017-January/005018.html	17:57
jeblair	rbergeron: ^	17:57
Shrews	added.	17:58
pabelanger	rbergeron: updates left, looks good	17:59
rbergeron	pabelanger, shrews, jeblair: merci beaucoup; looks like y'all added all those things above into the mail directly too (thanks!). will ship it out :)	18:15
jeblair	Shrews: nodepool fix merged	18:23
Shrews	jeblair: yup. pabelanger, ready for restart when nl01 updates	18:24
Shrews	somebody restart it?	18:25
Shrews	b/c it's now running... but i don't think the code updated	18:25
Shrews	oh, it's been running	18:26
Shrews	O.o	18:26
pabelanger	I haven't done anything yet	18:26
Shrews	it was running. i stopped it	18:26
Shrews	not sure why	18:27
pabelanger	let me check if puppet has run	18:27
Shrews	/opt/nodepool hasn't updated	18:27
pabelanger	confirmed	18:27
pabelanger	I can kick.sh it	18:28
Shrews	pabelanger: can i manually 'git pull', or does that muck things up	18:28
pabelanger	Shrews: actually	18:28
pabelanger	Shrews: /opt/system-config/production/tools/kick.sh nl01.openstack.org	18:28
pabelanger	from puppetmaster as root	18:28
Shrews	i'm not touching puppet things. you can do it, or we can wait :)	18:28
pabelanger	k, running	18:28
pabelanger	Shrews: okay, restart when you are ready	18:30
pabelanger	we also need to restart zuulv3-dev.o.o	18:31
Shrews	pabelanger: go for it. the request pool drained	18:31
pabelanger	started	18:32
pabelanger	k, we have no changes in queue	18:33
pabelanger	let me add one	18:33
pabelanger	just rechecked 441332	18:34
pabelanger	Oh, heh	18:34
pabelanger	configure_mirror.sh is failing, since we removed /etc/nodepool/ folder	18:34
pabelanger	but, I do see an issue	18:34
jeblair	yeah, i reckon it's time to reimplement that in ansible	18:35
pabelanger	jobs should be set to failing, but they are requeuing due to exit code 2	18:35
pabelanger	need to see why that is a 2	18:35
Shrews	nodepool-launcher seemed to do its thing. will need some watching though	18:37
jeblair	pabelanger: that's a pre-playbook, right?	18:37
jeblair	pabelanger: we re-queue for pre playbook failures	18:37
Shrews	These nodepool bugs over the last few days were fantastic. Nothing like the drizzle savepoint bug that drove me crazy for weeks and curse mordred for getting me involved, but interesting nonetheless	18:39
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add secret top-level config object https://review.openstack.org/446156	18:41
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add per-repo public and private keys https://review.openstack.org/406382	18:41
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Associate secrets with jobs https://review.openstack.org/446687	18:41
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Decrypt secrets and plumb to Ansible https://review.openstack.org/446688	18:41
Shrews	wooo, requests	18:41
Shrews	MOAR	18:42
jeblair	i'll type faster!	18:42
* Shrews watches the debug log and sees the matrix		18:43
pabelanger	jeblair: ya, pre-run. thanks	18:46
Shrews	still chugging... looking pretty good	18:48
Shrews	jeblair: w00t. it scoffed at your attempts to throw it for a loop (literally, a loop)	18:49
jeblair	heh	18:49
pabelanger	jeblair: what are your thoughts on how we get nodepool (provider / cloud / region / az) into zuul? Is it possible to get some of the data out of zookeer?	18:49
pabelanger	zookeeper*	18:50
Shrews	that question confuses me	18:50
pabelanger	I somehow need to access the following data (we deleted from nodepool) from zuul. https://review.openstack.org/#/c/445572/	18:52
pabelanger	we use it to build up our mirror information	18:52
pabelanger	as an example: http://mirror.dfw.rax.openstack.org/	18:52
Shrews	oh, ok. less confused	18:52
jeblair	ah yeah, we should be able to get all of that. what's the best way to give it to ansible? in vars.yaml? or should it be in the inventory file?	18:53
pabelanger	vars would be the easiest	18:54
pabelanger	but	18:54
* Shrews uses this time of less brokenness to step away for a bit. bbl		18:54
pabelanger	do we maybe want to setup a FQDN in inventory?	18:54
jeblair	it's per-host information	18:54
pabelanger	then extract the info from it?	18:54
jeblair	(i mean, those things should generally all be the same, but it feels like we should be able to support them being different)	18:54
pabelanger	according to docs there is inventory_hostname (which could be FQDN) and inventory_hostname_short, which is up to first dot	18:59
pabelanger	so, adding it to inventory could be an option	18:59
jeblair	pabelanger: wait, are you suggesting we make up a fqdn for the hosts?	18:59
pabelanger	jeblair: Ya, but that might be too specific to openstack-infra	19:00
jeblair	yes	19:01
jeblair	pabelanger: what i'm asking is whether it's possible to add host-specific metadata to the inventory	19:01
jeblair	or whether we have to put that in variables and then cross-reference them	19:01
jeblair	http://docs.ansible.com/ansible/intro_inventory.html#host-variables	19:01
jeblair	that seems to say yes	19:01
pabelanger	jeblair: right, what is the difference then just adding them into vars.yaml?	19:02
jeblair	pabelanger: so it makes the most sense to me to add "nodepool_provider", etc, to host variables. then each invocation of a 'setup mirror' role would be correct on its local host, i assume.	19:02
jeblair	pabelanger: we can have many hosts. some values like this may end up being different some day.	19:03
pabelanger	I see	19:04
pabelanger	I don't have a preference atm, so I'll defer to you	19:04
jeblair	host variables it is then	19:05
jeblair	pabelanger: i'm heading to lunch now; it's all yours if you want to work on that	19:06
pabelanger	jeblair: sure, I can try	19:06
Shrews	pabelanger: i want to bike shed on https://review.openstack.org/446650 :)	19:38
pabelanger	Shrews: sure!	19:40
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Decrypt secrets and plumb to Ansible https://review.openstack.org/446688	20:04
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Associate secrets with jobs https://review.openstack.org/446687	20:04
pabelanger	jeblair: I might need some help with putting vars into inventory, it looks like we need to update test/base.py to properly populate getHostList()	20:13
pabelanger	right now vars.yaml looks to be the easier route :)	20:14
pabelanger	Shrews: no objection, fixing	20:14
jeblair	pabelanger: do you think that vars.yaml and inventory variables are equivalent?	20:17
pabelanger	jeblair: good question, I don't really use inventory files to store variables. Maybe jlk is better to ask	20:18
jeblair	pabelanger: i do not, and that's what i was trying to convey earlier	20:18
pabelanger	Maybe I just haven't had a need for them until now	20:18
jeblair	pabelanger: we have a bunch of hosts. we have metadata for them. ansible has a way of associating metadata with hosts. that seems the right way to do it.	20:18
jeblair	pabelanger: if we stick the information in vars, then we have to cross-reference with some index, like the hostname	20:19
pabelanger	jeblair: yes, I agree in that case inventory variables make sense.	20:19
jeblair	pabelanger: so the vars would look like: "hosts: compute1: provider: rax" and then, the role would have to do something like "{{hosts.$fqdn.provider}}" i'm not sure what that actually looks like in jinja	20:20
jeblair	pabelanger: but if we put it in the inventory, then the role just says "{{nodepool_provider}}" because it should be set already, if i'm following the docs	20:21
pabelanger	jeblair: I am happy to continue down inventory variables, I just need some help with test/base.py.	20:21
jeblair	(i'm sure all those specifics are wrong, but i think that would be the general process required)	20:21
pabelanger	currently getHostList() in test/base.py is returns an empty hosts variable. Need some guidance how to properly mock how the information	20:22
jeblair	pabelanger: i don't think getHostList in tests/base.py needs updating	20:22
jeblair	pabelanger: that's just adding a local connection flag	20:22
jeblair	pabelanger: it calls the real getHostList in executor/server.py	20:23
pabelanger	jeblair: right, I was appending the nodepool_provider info along side ansible_host, for getHostList() in executor/server.py	20:23
jeblair	pabelanger: which is where the additional nodepool data should be added. but it will also need to be plumbed through from nodepool itself, which means in nodepool.py, and zk.py. possibly also the fake nodepool in tests/base.py (to add any missing attrs like az)	20:24
pabelanger	because args['nodes'] is empty, the values are not set	20:24
pabelanger	jeblair: mind if I show you what I have now?	20:24
pabelanger	maybe I am on the wrong path	20:24
jeblair	pabelanger: sure	20:24
jeblair	pabelanger: though if nodes is empty, it's probably one of the many test cases which run jobs with no nodes	20:25
pabelanger	maybe that is the problem	20:25
jeblair	(that still needs to work, but obivously isn't going to exercise the code)	20:25
jeblair	pabelanger: i think many of the tests in test_v3 use nodes	20:26
jeblair	pabelanger: certainly the TestAnsible one	20:28
openstackgerrit	Paul Belanger proposed openstack-infra/zuul feature/zuulv3: WIP: Create nodepool inventory variables https://review.openstack.org/446727	20:28
pabelanger	jeblair: ^	20:28
pabelanger	I haven't added a test case yet	20:28
pabelanger	but of the test_v3 tests I look at	20:29
pabelanger	inventory is set to 'localhost ansible_connection=local'	20:29
pabelanger	which, could be because of what you just mentioned about nodes	20:30
pabelanger	checking that now	20:30
jeblair	pabelanger: oh yeah, the test gethostlist also appends a localhost entry, mostly so that if a test doesn't request any nodes, there's still something in the inventory. i think we can just ignore that.	20:32
jeblair	pabelanger: i think what you're seeing is that you're on the right path, and just need to use a test that requests nodes, then you should see it	20:33
pabelanger	yes, I think you are right	20:33
pabelanger	let me do that	20:33
pabelanger	jeblair: yup, that was it. Thanks!	20:34
pabelanger	I'll add a test job now	20:34
jeblair	pabelanger: can you add 'az' while you're at it?	20:36
jeblair	Shrews: hrm, it looks like the zuul model is already set up to avoid clobbering unknown values.	20:37
jeblair	Shrews: but i wonder if we missed a read, so we just wrote old values?	20:38
pabelanger	jeblair: yes	20:39
openstackgerrit	Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Create nodepool inventory variables https://review.openstack.org/446727	21:02
pabelanger	jeblair: ready for comments	21:03
jeblair	pabelanger: +2	21:04
pabelanger	jeblair: cool, sorry about the noise earlier about tests/base.py, I should have checked nodes were defined first	21:05
jeblair	pabelanger: np. it's perhaps not immediately obvious that most of the tests don't actually request nodes. :)	21:06
pabelanger	jeblair: right, is that a valid setup still? I remember something about we might not want that to happen on the executor	21:08
jeblair	pabelanger: i'm not sure. perhaps we will want to support jobs that only perform some action on the executor?	21:09
pabelanger	jeblair: unsure also	21:11
pabelanger	if we did that, would motivation would somebody have to run nodepool?	21:11
Shrews	jeblair: hrm, then i have no immediate explanation	21:14
pabelanger	clarkb: Shrews: either of you mind reviewing 446727? moves nodepool provider variables into a job inventory file	21:16
clarkb	I can try to take a look in a few minutes (finishing up one thing) but then have to head out to kid's doctor appointment so depends on how long it takes me I guess	21:16
Shrews	pabelanger: +2	21:25
pabelanger	clarkb: when you have time, no rush	21:26
openstackgerrit	Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Handle the 'all' group in callback. https://review.openstack.org/446749	21:27
clarkb	it was simple, I approved	21:27
clarkb	the "nice" thing about doctors appointment is my other thing (a tempest update) should have test results by the time I get back to a computer	21:28
Shrews	jeblair: i think the AZ thing may either be a shade bug or a nova bug. Seems that only sometimes is an AZ value returned: http://paste.openstack.org/show/603028/	21:32
Shrews	jeblair: that could make our AZ grouping not too efficient	21:33
Shrews	if "nova" is the same as "no AZ", we can't know that	21:34
* Shrews places $$$ on nova bug		21:34
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Create nodepool inventory variables https://review.openstack.org/446727	21:35
Shrews	heh, i think i lost $$$	21:38
clarkb	you dont have to have AZs iirc	21:38
Shrews	mordred: i can see OS-EXT-AZ:availability_zone populated from 'openstack server show', but shade isn't always giving this data back in the location attr	21:39
Shrews	You know what, I bet it's not populated right away by nova, so we need to check it again after waitForServer()	21:44
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Set node AZ after we're done waiting for it. https://review.openstack.org/446753	21:47
Shrews	mordred: nevermind! above is a hopeful fix	21:48
*** rattboi is now known as quartermaster		21:48
*** quartermaster is now known as rattboi		21:57
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Serve public keys through webapp https://review.openstack.org/446756	21:58
jeblair	that's the extent of the cryptography related parts of the secrets facility. we still have some config-related changes to make for it.	22:02
jhesketh	Morning	22:04
*** yolanda has quit IRC		22:16
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Decrypt secrets and plumb to Ansible https://review.openstack.org/446688	22:17
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Associate secrets with jobs https://review.openstack.org/446687	22:17
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Serve public keys through webapp https://review.openstack.org/446756	22:17
SpamapS	jeblair: what was the logical conclusion of the discussion around RSA-OAEP yesterday?	22:35
* SpamapS vaguely remembers looking at gcrypt and then shaking in the corner		22:36
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Remove irrelevant test_merging_queues https://review.openstack.org/446768	22:38
jeblair	SpamapS: i think it trailed off. but the way it's gelling in my mind is: RSA-OAEP gets us the ability to encrypt something up to a couple hundred bytes long. to do something longer we'd probably switch to some kind of hybrid scheme (pgp or rsa encrypted AES session key). that will produce longer ciphertext. i think we have 3 choices:	22:39
jeblair	SpamapS: 1) stick with pkcs1 and accept the limits. (we can change our minds later).	22:39
jeblair	SpamapS: 2) keep pkcs1 and also add a second system. maybe have the encryption utility automatically select it for you.	22:40
jeblair	SpamapS: 3) drop pkcs1 for the hybrid system and accept that ciphertext blobs will be bigger.	22:40
jeblair	SpamapS: i think that the framework in zuul will be substantially the same, and also, there is highly likely to be some RSA involved either way, so i decided to press on with implementation to spec for the moment. i think little work will end up being wasted if we change.	22:41
jeblair	probably we need a good ML thread to hash out what we want.	22:42
jeblair	fungi: ^	22:42
SpamapS	Always a good idea to go slow and listen well when playing with crypto.	22:42
jeblair	ya	22:42
SpamapS	jeblair: my only concern is that AFAICT, there's nothing to identify a ciphertext as pkcs1 other than attempting decryption, so changing our minds later may be complicated.	22:44
fungi	SpamapS: the function is namespaced as a yaml datatype in the spec	22:44
SpamapS	Though I suppose we can make a good guess since the data length is relatively static.	22:44
jeblair	yeah, we annotate it with a yaml tag, so it's forward-compatible	22:44
SpamapS	Ah, does that still work with safe_load?	22:45
SpamapS	I've never used the feature.	22:45
jeblair	SpamapS: yeah, you can annotate an object as being 'safe'	22:45
fungi	ml thread sounds like a great idea, and also i agree the current plan makes it not particularly hard to swap out or even increase the algorithms supported	22:45
SpamapS	jeblair: neat	22:45
SpamapS	well yeah, then I'm for (1)	22:45
jeblair	SpamapS: https://review.openstack.org/446156 has that bit	22:45
jeblair	configloader.py	22:46
jeblair	(due to metaclass magic, the class defn of EncryptedPKCS1 is all that's required to make pyyaml know to load that)	22:46
fungi	pkcs#1 gets zuul the ability to handle typical passwords/passphrases and short-ish api keys. something else will be needed to pass along things like asymmetric x.509/rsa key material, i agree	22:47
SpamapS	jeblair: fancy	22:48
fungi	computers are both infuriating and amazing	22:48
SpamapS	fungi: isn't there a somewhat standard encryption format that CSR's use?	22:49
fungi	certificate signing requests aren't encrypted	22:49
SpamapS	I thought they were, by a password, unless you decline to do so.	22:50
fungi	you're probably thinking of pkcs#12 (a.k.a. "pfx") which is commonly used to encrypt x.509 key bundles, but it's symmetric	22:51
fungi	"by a password" is a giveaway that it's symmetric	22:51
SpamapS	Could be	22:51
SpamapS	yeah	22:51
SpamapS	the only arbitrary-length-data public key crypto I'm familiar with are both email standards: PGP and S/MIME	22:52
SpamapS	Though I guess they're just "message" standards, not specifically email.	22:52
fungi	pkcs#7 can, but it's also the underpinning of s/mime	22:54
* fungi sometimes wishes he knew less about these things. shadows of a former career		22:56
SpamapS	Hey, I feel the same about flipping burgers. ;)	22:57
fungi	the two are alike in soooo many ways	22:57
*** jamielennox is now known as jamielennox\|away		23:03
openstackgerrit	Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Set filter according to PR/Change in URL https://review.openstack.org/446782	23:05
*** jamielennox\|away is now known as jamielennox		23:05
SpamapS	jeblair: so.. stuck jobs..	23:21
SpamapS	jeblair: I'm looking at how to re-enable that test, and I wonder if we can still get into the situation that it artificially creates.	23:21
jeblair	SpamapS: yeah, something about the word 'stuck' and all the hullaballo about gearman keeps distracting me from what i think is its main purpose, which is in the comment	23:23
jeblair	or, rather the docstring	23:23
jeblair	"Test that pending jobs are cleaned up if removed from layout"	23:23
SpamapS	jeblair: the way it works now.. if I remove a job that is in the queue, the scheduler correctly cancels it.	23:24
SpamapS	2017-03-16 16:19:10,115 zuul.DependentPipelineManager ERROR Unable to find change queue for project org/project	23:24
SpamapS	2017-03-16 16:19:10,115 zuul.Scheduler WARNING Canceling build <Build 7a09454c0cd34b988de9e276f8712c64 of gate-noop on <Worker Unknown>> during reconfiguration	23:24
SpamapS	that does not end up in history tho	23:25
SpamapS	vs. before, when it did.	23:25
jeblair	hrm. i agree that behavior matches the docstring. i may not fully understand the mechanics of the test as applied to v2... lemme dig further.	23:26
SpamapS	jeblair: so perhaps it's ok to just run that scenario and assert that there's no history?	23:26
jeblair	SpamapS: maybe?	23:26
SpamapS	let me push up what I have and we can see the same code at least	23:26
SpamapS	jeblair: I may be testing something else with the way I refactored the test.	23:28
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Re-enable test_stuck_job_cleanup https://review.openstack.org/446783	23:29
SpamapS	there's code	23:29
jlk	Did the zuulv3 install have a hiccup?	23:30
jlk	my change just got a bunch of RETRY_LIMIT, and it doesn't seem to show up at all in review.	23:30
jlk	(other htan the vote)	23:30
jeblair	jlk: yeah, playbooks broken atm	23:30
jlk	okay.	23:30
jeblair	jlk: pabelanger is fixing it, but we'll have some false -1s for a while	23:30
jeblair	SpamapS: ah, it's testing that the gearman job is removed from the gearman queue	23:31
jeblair	SpamapS: the first config has it running 'project-merge'. that is the job that it's looking for in self.assertEqual(len(self.gearman_server.getQueue()), 1)	23:31
jeblair	SpamapS: (it doesn't let the job run, just lets the launcher enqueue it)	23:32
pabelanger	jeblair: jlk: we need to restart zuulv3-dev.o.o to pick up inventory variables first	23:32
jeblair	SpamapS: and the final assertion is checking that project-merge still didn't actually run.	23:32
jeblair	pabelanger: oh has that merged?	23:32
SpamapS	jeblair: ah, so it's just making sure the queue is made of the one thing that did run.	23:33
pabelanger	jeblair: yes, I still need to update our playbooks however	23:33
jeblair	SpamapS: well, completed history, but yes.	23:33
Shrews	pabelanger: any chance you can review https://review.openstack.org/446753 ? very easy review. would like to restart n-l with it	23:34
jeblair	SpamapS: so the key to that test is the hold_jobs_in_queue part. we want to keep doing that, and not set hold_jobs_in build.	23:35
pabelanger	Shrews: +3	23:35
jeblair	SpamapS: pointed some things out in review	23:37
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Set node AZ after we're done waiting for it. https://review.openstack.org/446753	23:38
openstackgerrit	Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Replace ssh-keyscan with host keys from node https://review.openstack.org/446785	23:40
pabelanger	jeblair: ^ would be interested in comments when you are ready to review. Specifically, I made some changes to getHostList() return value	23:41
Shrews	any objections to nodepool-launcher restart?	23:42
pabelanger	none here, going to do the same for zuulv3-dev too	23:42
pabelanger	jeblair: ^	23:42
jeblair	++	23:42
Shrews	oh noes	23:45
jeblair	pabelanger: lgtm (assuming what we get from nodepool is ready to splat into the known_hosts file)	23:47
pabelanger	jeblair: I think so, I see the data in zookeeper using zk-shell	23:48
* jeblair hopes Shrews is okay		23:48
Shrews	well	23:49
jeblair	Shrews is well. that's good.	23:49
jeblair	all i need to hear	23:49
Shrews	something weird happened	23:50
openstackgerrit	Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Revert "Configure regional mirrors for our jobs" https://review.openstack.org/446786	23:51
Shrews	OpenStackCloudException: Multiple matches found for None	23:51
Shrews	i know shade had a release today, and there were some changes to the name/id matching	23:51
Shrews	so... maybe a bug there	23:51
Shrews	but also, we should try to delete a node if the id is None	23:52
jeblair	pabelanger: +3	23:52
Shrews	not try*	23:52
pabelanger	jeblair: danke	23:53
Shrews	jeblair: pabelanger: we need jamielennox's fix https://review.openstack.org/#/c/445299/	23:53
pabelanger	that will unblock us for now, I'll start a refactoring our mirror logic tomorrow	23:53
pabelanger	Shrews: jeblair: Did we want a test for that? Not sure I understand the issue enough	23:54
jamielennox	pabelanger, Shrews: i'm not sure how to write a test for that one, there was nothing that really existed similar	23:54
Shrews	pabelanger: i can try and work out a test tomorrow if we approve that tonight	23:55
pabelanger	I am okay with that	23:55
Shrews	jamielennox: we just need to make createServer() fail in the test. i think the framework for that might already exist	23:56
pabelanger	+2, so jeblair can review	23:56
jeblair	+3	23:56
jamielennox	Shrews: there was something like it but it was on a different thread, but i would need to double check for all the details	23:56
jamielennox	like i think there was a test for create server failure, but i couldn't see a test that was actually running the cleanup thread	23:57
Shrews	jamielennox: look at test_fail_request_on_launch_failure	23:57
jamielennox	however my priority was making my stuff work at the time so i could have missed something	23:57
Shrews	jamielennox: we'd just need to make sure the node is deleted properly. can probably do so in that same test	23:58
jamielennox	Shrews: hmm, ok, i think that was what i was looking at, assume i just missed something	23:59
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Don't try and delete nodes with no external_id https://review.openstack.org/445299	23:59

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!