Wednesday, 2018-09-19

*** jiapei has quit IRC		00:24
*** _ari_ has quit IRC		02:45
*** rfolco has quit IRC		04:08
*** sshnaidm has joined #zuul		05:10
*** sshnaidm has quit IRC		05:11
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: executor: enable zuul_return to update Ansible inventory https://review.openstack.org/590092	05:44
*** chkumar\|off is now known as chkumar\|ruck		06:53
*** hashar has joined #zuul		07:10
openstackgerrit	Simon Westphahl proposed openstack-infra/nodepool master: Cleanup of leaked resource for static driver https://review.openstack.org/600084	07:30
openstackgerrit	Simon Westphahl proposed openstack-infra/nodepool master: Implement liveness check for static nodes https://review.openstack.org/601513	07:30
*** jesusaur has quit IRC		07:39
*** jesusaur has joined #zuul		07:44
*** jpena\|off is now known as jpena		08:31
openstackgerrit	Sorin Sbarnea proposed openstack-infra/zuul master: Assure that status tooltip is displayed on entire row https://review.openstack.org/603504	09:05
rcarrillocruz	TIL about https://zuul-ci.org/docs/zuul/user/badges.html	09:31
rcarrillocruz	nice!	09:31
mordred	rcarrillocruz: \o/	10:36
*** jpena is now known as jpena\|lunch		11:20
*** rfolco has joined #zuul		12:17
tobiash	mordred: do you know if nodepool can re-use floatingips instead of create/attach/detach/delete?	12:17
tobiash	I have nodepool in its own tenant and don't need a cleanup of unused floatingips there	12:17
tobiash	this currently slows down the openstacksdk task queue dramatically in my environment	12:18
*** hashar is now known as hasharAway		12:21
mordred	tobiash: I don't believe we've implemented that option in nodepool. it's ... an unfortunately racey process	12:23
tobiash	hrm, ok	12:23
tobiash	so the workaround would be to create multiple providers to the same cloud?	12:23
mordred	tobiash: the issue is that there is no way to atomically say "re-use an existing floating ip, else create one for me"	12:23
tobiash	ah, ok	12:24
*** jpena\|lunch is now known as jpena		12:24
mordred	and since we do server creation in threads, we'd have to either have some sort of mutex somewhere to coordinate picking an available floating ip - or else handle conflict errors better	12:24
mordred	that said - I think for your use case it could certainly be nicer to just re-use	12:25
tobiash	mordred: is there a way to alternatively parallelize the task queue?	12:27
tobiash	the problem is just that attach, create, delete take 1-2 seconds and block any other openstack request in that time	12:27
tobiash	so if we could have a thread pool of say 5 parallel requests this issue would be solved for me probably too	12:27
mordred	tobiash: it's a good point - I've actually been thinking also about making the taskmanager be per-service too	12:28
tobiash	just tested and launched 200 nodes at once using nodepool and task contention caused some boot timeouts	12:28
mordred	so that we're ratelimiting on a per-service basis (which is where server-side limits usually sit anyway)	12:29
tobiash	that would make this better but still 200 instances would then wait for the requests for 200 floating ips and 50 of them will timeout	12:30
mordred	as for making the taskmanager have a thread pool - I think that would work, but we'd probably have to be more clever with how we track the rate limiting	12:30
tobiash	the rate limiting is thought of max x requests per s right?	12:30
mordred	yah. oh, I guess we could just track when a request enters the pool	12:31
mordred	interestingly enough - we actually already have a version of this for swift object segents	12:32
mordred	object segments that is	12:32
mordred	I wonder if we could just re-use it insideof the sdk Adapter class	12:33
* mordred goes to ponder the possibility		12:34
mordred	tobiash: yes - I think we can make a reasonably simple patch to openstacksdk to make this happen	12:36
tobiash	mordred: cool	12:37
mordred	tobiash: remote: https://review.openstack.org/603739 Run all tasks through the threadpool	12:43
mordred	tobiash: that's COMPLETELY untested	12:44
tobiash	mordred: cool :)	12:45
mordred	but I can't think of any reason it wouldn't work	12:45
tobiash	mordred: btw, how do you handle releases in openstacksdk?	12:47
tobiash	mordred: the 17.x releases are tied to the current openstack version?	12:47
tobiash	mordred: so if we would want to have this and 602618 usable by nodepool what would be the best approach?	12:48
*** samccann has joined #zuul		12:58
mordred	tobiash: we can cut a new release any time - and they're not tied to openstack version	13:00
mordred	tobiash: in fact, we need to make one with clarkb's logging change in it for nodepool logs anyway	13:00
tobiash	ah cool	13:00
tobiash	mordred: I didn't quite understand your change	13:00
tobiash	in case run_async is True the change is a noop right?	13:01
tobiash	is run_async False when used by nodepool?	13:01
mordred	tobiash: yes, that's right. the only thing that was calling run_async before was the swift object segment upload code	13:01
tobiash	so that is a different task manager?	13:02
mordred	it's the same task manager - it's a different behavior inside the task manager- if run_async is True in the task manager, then concurrent.futures is used and a Future object is returned	13:02
tobiash	ah	13:03
tobiash	thanks for the explanation :)	13:03
mordred	sure! it's a deep dark section of the code	13:03
mordred	speaking of ...	13:03
mordred	Shrews: when you feel like staring at deep dark code - https://review.openstack.org/603739 and https://review.openstack.org/#/c/602085/	13:04
tobiash	mordred: is that thread pool configurable?	13:08
tobiash	in terms of max parallel requests?	13:08
tobiash	nevermind, found it	13:09
mordred	tobiash: only by constructor parameter to TaskManager	13:09
tobiash	that seems like a sane default and is probably not overwritten in nodepool	13:09
mordred	yah. although if we wind up landing this patch to sdk - we should probably add a config option to nodepool to modify it	13:14
mordred	I think it's probaby fine to keep it that way for now though	13:14
tobiash	yepp	13:14
Shrews	mordred: about to head out for a checkup on my wrist, but can look when i return	13:28
mordred	Shrews: have you considered replacing your wrist with an angry cat?	13:28
Shrews	mordred: it's too early for you to have been drinking	13:29
mordred	Shrews: I would have thought after 10 years you'd know me better than that :)	13:30
*** annabelleB has joined #zuul		13:56
*** samccann has quit IRC		14:04
*** samccann has joined #zuul		14:04
clarkb	mordred: my sdk logging change?	14:31
mordred	clarkb: yah	14:31
*** gouthamr_ is now known as gouthamr		14:35
*** _ari_ has joined #zuul		15:06
*** hasharAway is now known as hashar		15:30
*** chkumar\|ruck is now known as chkumar\|off		15:34
openstackgerrit	Markus Hosch proposed openstack-infra/zuul master: Add support for authentication/STARTTLS to SMTP https://review.openstack.org/603833	15:42
openstackgerrit	Markus Hosch proposed openstack-infra/zuul master: Add support for authentication/STARTTLS to SMTP https://review.openstack.org/603833	15:44
openstackgerrit	Markus Hosch proposed openstack-infra/zuul master: Add support for authentication/STARTTLS to SMTP https://review.openstack.org/603833	15:46
*** pcaruana has joined #zuul		15:47
*** annabelleB has quit IRC		15:50
*** annabelleB has joined #zuul		15:53
*** annabelleB has quit IRC		16:03
tobiash	mordred: commented on https://review.openstack.org/603739	16:18
*** annabelleB has joined #zuul		16:23
*** pcaruana has quit IRC		16:27
mordred	tobiash: thanks!	16:32
tobiash	mordred: I was brave and tried that in production ;)	16:33
clarkb	tobiash: that is really brave :P	16:33
tobiash	but the rollback was easy and it only caused some minutes delay	16:33
mordred	wow!	16:33
mordred	tobiash: hrm. that's ... interesting	16:34
mordred	tobiash: but thank you very much - I'll have to dig in a little further I guess :)	16:34
Shrews	hopefully our tests catch that	16:36
tobiash	Shrews: the integration tests for sure	16:37
tobiash	the unit tests: no	16:37
tobiash	I ran the unit tests before I tried it :P	16:37
tobiash	mordred: maybe do task_manager.wait_for_futures instead?	16:38
tobiash	found that somewhere else in the code	16:38
*** panda has joined #zuul		16:38
*** annabelleB has quit IRC		16:50
tobiash	Shrews: looks like I was wrong, right now only two non-voting jobs failed and all other jobs are already green except two still running jobs	16:50
Shrews	well, looks like maybe test_get() failure might be related? http://logs.openstack.org/39/603739/1/check/openstacksdk-functional-devstack-magnum/f278e29/testr_results.html.gz	16:54
Shrews	it at least references the new code	16:55
*** annabelleB has joined #zuul		16:59
openstackgerrit	Sorin Sbarnea proposed openstack-infra/zuul master: Assure that status tooltip is displayed on entire row https://review.openstack.org/603504	16:59
tobiash	Shrews: does nodepool overwrite the task manager of openstacksdk??	17:00
tobiash	it overwrites the submit_task method which is used within the openstacksdk taskmanager to switch between async and not async	17:01
tobiash	maybe the problem is not openstacksdk	17:01
mordred	tobiash: I was just about to type that	17:02
mordred	basically, the nodepool task_manager doesn't support the underlying async logic	17:02
mordred	but fixing that is going to be a bit more work	17:03
mordred	actually ... it MIGHT not be that terrible	17:04
tobiash	mordred: it somehow looks like it re-implements more of the task manager than needed?	17:07
mordred	yeah. well - it's the original task_manager and we've been refactoring things out and in to openstacksdk - I keep meaning to get the multithreaded one from nodepool also added to openstacksdk so that we can look at the whole picture in one file	17:07
Shrews	mordred: didn't you do something around this in the past?	17:08
mordred	yeah - I keep making baby steps	17:09
Shrews	guess i'm thinking of https://review.openstack.org/#/c/574287/	17:09
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool master: Add support for async tasks https://review.openstack.org/603850	17:11
mordred	somethign like that ^^ in nodepool, combined with the sdk patch, perhaps	17:12
mordred	of course, that patch screams "this needs a refactor"	17:12
mordred	Shrews: it's also worth noting that dtantsur mentioned on https://review.openstack.org/#/c/574285/ that the task manager code is remarkably similar to what they did in futurist library	17:13
mordred	and he'd be open to us adding a rate-limiting feature to that	17:13
mordred	so maybe it's worth considering that if/when we decide to do any further refactoring of the taskmanager code	17:14
clarkb	mordred: I might be overthinking this but if you are rate limiting does running async gain you anything?	17:16
clarkb	execute wait execute wait can be done synchronously and is simpler to reason about	17:17
mordred	clarkb: yah - because with rate limiting we're making sure we don't _start_ executing a particular task until the appropriate timeslice	17:18
mordred	clarkb: so with the async it allows us to fire off the correct number of tasks for the timeslice but have them perform in parallel	17:18
mordred	(we already do this for uploading swift large object segments - because the time it takes to upload a segment is much larger than the rate limit timeslice)	17:19
clarkb	ah	17:19
mordred	clarkb: this is going to be one of those where we spend a few weeks making sure we get the interactions right and the end result is going to be a tiny number of SLOC	17:20
tobiash	mordred: I'll try this out now	17:21
tobiash	(this time in dev instance...)	17:21
clarkb	mordred: did leave a comment on the nodepool change	17:23
clarkb	I think there is a small bug	17:23
clarkb	tobiash: ^ fyi	17:23
mordred	clarkb: yes - you are very right - although I don't think putting the put into the else will work ...	17:24
mordred	let me try again	17:24
mordred	oh - it's even easier	17:26
*** jpena is now known as jpena\|off		17:30
mordred	ok. when I said "easier" ... I lied	17:30
*** annabelleB has quit IRC		18:17
*** annabelleB has joined #zuul		18:21
Shrews	tobiash: my brain must not be working correctly because i'm not getting what a "leaked" static node is according to https://review.openstack.org/600084 (left a comment there). is there a real life example you could describe for me?	18:22
AJaeger	anybody wants to +A a zuul-jobs change that has +2s already? https://review.openstack.org/#/c/602290/ fixes print node information	18:23
tobiash	Shrews: I was on vacation at this time but I think is was reducing concurrency of a static node by config	18:24
tobiash	Or removing a static node from config	18:24
Shrews	tobiash: we have tests for decreasing max-parallel-jobs, which is why i'm confused i suppose	18:24
tobiash	swest: ^	18:25
tobiash	There were at least problems on config changes and the need to do manual cleanup in zk	18:26
Shrews	tobiash: swest: if it's removing a static node from the config, then i don't think that leak test is valid because it shares a hostname with an already registered node. in fact, if i change the test to use a different hostname, it fails	18:26
Shrews	(hostnames assumed to be unique across nodes)	18:26
Shrews	maybe we should use a different term than "leak". In openstack world, this is an instance that exist that we do not have recorded in ZK. This test seems to be the opposite of that	18:32
Shrews	possibly also a source of my confusion :)	18:34
tobiash	Shrews: do we remove znodes only on config changes and unlocks?	18:34
tobiash	What happens if the config change is during a restart (and thus unnoticed by nodepool)?	18:35
Shrews	we remove znodes (for the static driver) if we decrease parallelism or remove a config entry. not sure what you mean about unlocks	18:35
tobiash	With unlocks I meant finished jobs	18:35
Shrews	finished jobs should cause the node to be deleted which triggers the nodeDeletedNotification() where we re-register it	18:37
Shrews	i think a config change across restarts gets consolidated at startup, but i'm not 100% sure	18:37
Shrews	it obviously shoul	18:38
tobiash	Yes, the question is what happens if we decrease parallelity during a restart	18:38
Shrews	should	18:38
Shrews	that question is totally not obvious anywhere in that stack :)	18:38
Shrews	but i agree we should investigate it	18:38
tobiash	I think that might be the thing swest means by leaked (z)nodes	18:39
Shrews	but if that's the problem we are solving, let's not call it a leak	18:39
Shrews	term overloading :)	18:39
tobiash	I'm always open for better naming :)	18:40
Shrews	how about "restart config consolidation"?	18:42
Shrews	or other words	18:42
Shrews	nargle sniping	18:42
Shrews	rabbit wrangling	18:42
tobiash	Or just config consolidation?	18:42
Shrews	well, we always consolidate the config on config changes. not sure we are testing across *restarts" correctly	18:43
tobiash	Good point	18:44
Shrews	it's actually good that's being brought up because i don't think i've ever considered that	18:45
tobiash	Yeah, that driver is still new and we start using it more now	18:46
Shrews	awesome	18:46
*** annabelleB has quit IRC		19:15
*** annabelleB has joined #zuul		19:22
*** hashar has quit IRC		21:23
mordred	tobiash, Shrews, clarkb: remote: https://review.openstack.org/603739 Run all tasks through the threadpool	21:33
mordred	I took another stab - a bit more aggressive this time	21:34
mordred	however, I think this time it should work:	21:35
mordred	>>> c.compute.get('/')	21:35
mordred	<Response [200]>	21:35
mordred	>>> c.compute.get('/', run_async=True)	21:35
mordred	<Future at 0x7fad1205ec18 state=running>	21:35
clarkb	mordred: are some queries async by default? and if so does nodepool still need to accoodate htose?	21:36
mordred	clarkb: nothing that nodepool directly touches	21:36
mordred	clarkb: the swift segment uploads are async - but they're internal to sdk	21:37
clarkb	right nodepool says upload image, and that is synchronous from its perspective	21:37
mordred	yup	21:37
clarkb	sdk can upload a bunch of bytes async under that	21:37
mordred	yah. and with this change the same should be true for create_server	21:37
mordred	from each launcher thread's perspective it should look synchronous	21:38
mordred	but behing the scenes the task manager should be able to service those tasks async in parallel but never starting more than rate_limit appropriate tasks	21:38
mordred	DEFINITELY double-check me on that logic though	21:38
*** samccann has quit IRC		21:43
SpamapS	One of the users on my team at GoDaddy ran in to a really annoying problem this week. If you have syntax errors in your ansible... it just exits 1 and you get 0 in the logs.	22:12
SpamapS	Would be quite useful if those errors from ansible were sent back to the user.	22:13
clarkb	SpamapS: I want to say those show up in the console log?	22:13
clarkb	I know I've seen them before. notmyname had one at the PTG and didn't require infra-root to debug it (he found the OSA table and they helped him)	22:13
SpamapS	They do not in ours. The problem we had was that some directories were being moved around and so there were missing roles.	22:16
SpamapS	We just got "RUN starting.." and then "RUN ENDED FAILURE"	22:17
clarkb	SpamapS: I'm trying to get an example of notmyname's error so that we have that info too	22:17
SpamapS	the executor log had the ansible errors in it	22:17
clarkb	SpamapS: http://logs.openstack.org/86/601686/32/check/swift-multinode-rolling-upgrade/4ed9236/job-output.txt.gz#_2018-09-14_19_36_25_291563	22:29
clarkb	SpamapS: I would sort of expect all the ansible errors to be bubbled up like that? maybe it is a stdout vs stderr thing	22:29
clarkb	in this case it was an actual error in the yaml syntax itself	22:30
clarkb	rather than a higher level ansible error	22:30
clarkb	http://paste.openstack.org/show/730388/ that might be a bit more permanent	22:34
clarkb	SpamapS: in this case the playbooks referenced roles that were not included as required roles?	22:36
clarkb	s/this/your/	22:36
*** dkehn has joined #zuul		22:46
SpamapS	clarkb: the roles are local to the playbook	23:11
SpamapS	clarkb: so they're in the ipmlied roles path, not an explicit one.	23:11
*** annabelleB has quit IRC		23:15
*** dkehn has quit IRC		23:17
* SpamapS has now created a `zuul-local` command which does a reasonably decent job of running playbooks designed for zuul without zuul.		23:18
SpamapS	starting to think though that it might make more sense to make that an entry point in zuul itself that has access to all of zuul's parsers	23:18
SpamapS	(right now I have to recreate the job with cmdling args	23:18
clarkb	SpamapS: could also use executor bits to keep things lined up	23:19
SpamapS	So, that was my original approach	23:19
SpamapS	but that runs into need for a lot of zuul objects	23:20
SpamapS	and thus, I think the approach I have now is a dead end	23:20
SpamapS	I definitely am using it right now for testing playbooks that are really hard to iterate on otherwise...	23:20
SpamapS	but basically what I want is a way to just run a single CLI command that does scheduler, merger, and executor's job all at once, locally	23:20
SpamapS	call it "zuul-byovm"	23:21
SpamapS	as in, bring your own vm	23:21
clarkb	iirc the idea was use zuul api to get the job config bits and then inject to local executor thing	23:21
SpamapS	something like `zuul-byovm --tenant Foo --project bar -c zuul.conf --inventory 1.2.3.4, --user ubuntu --job test-my-thing	23:22
pabelanger	SpamapS: agree, there is still use cases where only zuul admin can see ansible failures, wrong path for role is one I know.	23:22
SpamapS	It would parse out the job, run all the playbooks in the order they need to be, and the only requirement is you need a local blob of any secrets.	23:22
SpamapS	Right now I have that but I have to assemble a bunch of things..	23:23
SpamapS	like for instance, I can't just add -M zuul/ansible/library, because the command replacement in there expects to magically have zuul_log_id added to it's arguments	23:23
SpamapS	so I have a dir with symlinks to just zuul_console and zuul_return	23:23
SpamapS	anyway, I just think we can't just keep doing the same thing with base jobs pre playbooks and post jobs	23:25
SpamapS	it's damn near impossible to write them efficiently without something like this.	23:25
SpamapS	They are far too complected with zuul to just be ansible-playbooked	23:26
SpamapS	so, this is my POC, and it's making my life a berzillion times easier...	23:26
SpamapS	I'm going to sit down and think through how we might do it in a way that is generic and useful for zuul users.	23:26
SpamapS	but where they can sit and iterate on playbooks on their own vms	23:27
SpamapS	also like, the assumptions about having a clean SSH agent ... so many things	23:30
clarkb	SpamapS: you have to have copies of all repos to have complete config	23:32
clarkb	which us why asking zuul for the config seems to be a good idea	23:32
SpamapS	ehhh... yes and no	23:36
SpamapS	yes I think that's the perfect solution	23:36
SpamapS	no I don't think that's the only solution	23:36
SpamapS	The config isn't actually the hard part to reason about.	23:36
SpamapS	It's the specialized environment and all of the inputs to it.	23:37
SpamapS	but yeah, if we could do that, neat.	23:37
clarkb	you would be unable to know your config is complete otherwise	23:39
clarkb	you could ave users guess at it	23:39
SpamapS	that's not actually a problem	23:39
clarkb	and probably get it right most of the time	23:39
clarkb	it is if reproducing upstream behavior is imoortant	23:40
SpamapS	reproducing perfectly isn't ... reproducing a reasonable facsimile is.	23:40
SpamapS	But yeah, I want what you describe	23:40
SpamapS	And I think we could definitely do that.	23:41
SpamapS	but I don't mind assembling a bunch of git repos in a dir myself and just saying "use this dir as my tenant config of projects"	23:42
SpamapS	which is what my janky thing does now	23:42

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!