Monday, 2020-04-20

mordred	jkt: yeah - in this case things are ... yuck	00:02
jkt	mordred: I read git blame, and apparently you wrote this. Any reason why that code picks a first network a router is not tried first? I see that it got added in 9986646a1, and it looks like a case of a DT cloud (?) where there was a hidden netowkr	00:02
mordred	jkt: you might also want to try setting routes_externally to False for the other networks (I know that's likely a pita, but it might help you here)	00:02
jkt	wouldn't that break once my provider introduces even more networks that I'm not supposed to use?	00:03
mordred	might do. one sec - lemme look at something	00:04
mordred	jkt: you should set nat_source to true for the right network	00:05
jkt	mordred: looks like that did the trick in my tiny python reproducer, thanks! Let's try it with nodepool now	00:08
mordred	jkt: it's entirely possible that the right choice should be to start by looking for the network attached to the router that's attached to the network that the node is attached to ... I think at this point it's possible we don't start there simply because of history - and becaues the code in question tries to handle so many gross edge cases doing a major logic inversion is ... unfun	00:09
mordred	jkt: awesome!	00:09
mordred	jkt: when it's not sunday night I'll try to think through what it would take to start with router logic	00:09
jkt	mordred: yeah, I can understand why it's fishy -- apparently it's global state, and determining that global state based on just one random server sounds wrong	00:10
mordred	yeah.	00:11
mordred	jkt: if only we didnt' have to hack around a bazillion different ways a deployer could do things :)	00:11
jkt	:)	00:12
jkt	mordred: it works (as in, VMs are created, nodepool is happy, reports correct IP addresses and they are active). Thanks a lot!	00:27
jkt	now I wonder what am I doing wrong with my SSH keys :). Is nodepool using cloud-init with Zuul's pubkey, or is that a wrong assumption?	00:27
clarkb	it doesnt by default. You cn configure it to do so	00:28
clarkb	by default it expects ssh to work (key baked into image)	00:28
jkt	clarkb: tx; so it's either https://zuul-ci.org/docs/nodepool/configuration.html#attr-providers.[openstack].pools.labels.userdata or redoing my images, right?	00:30
clarkb	its pools.labels.key-name	00:30
jkt	clarkb: ah!, wonder how I missed that. Thanks!	00:31
*** threestrands has joined #zuul		00:35
*** swest has quit IRC		01:11
*** swest has joined #zuul		01:27
*** jhesketh has joined #zuul		02:56
*** bhavikdbavishi has joined #zuul		03:46
*** bhavikdbavishi1 has joined #zuul		03:51
*** bhavikdbavishi has quit IRC		03:53
*** bhavikdbavishi1 is now known as bhavikdbavishi		03:53
*** evrardjp has quit IRC		04:23
*** bhavikdbavishi has quit IRC		05:26
*** bhavikdbavishi1 has joined #zuul		05:26
*** bhavikdbavishi has joined #zuul		05:29
*** bhavikdbavishi1 has quit IRC		05:30
*** evrardjp has joined #zuul		05:41
*** dpawlik has joined #zuul		06:09
*** saneax has joined #zuul		06:15
*** bhavikdbavishi has quit IRC		06:47
*** ysandeep\|away is now known as ysandeep		06:53
*** ysandeep is now known as ysandeep\|afk		07:11
*** rpittau\|afk is now known as rpittau		07:12
*** harrymichal has joined #zuul		07:13
*** threestrands_ has joined #zuul		07:18
*** threestrands_ has quit IRC		07:18
*** threestrands has quit IRC		07:21
*** jcapitao has joined #zuul		07:26
*** bhavikdbavishi has joined #zuul		07:49
*** jpena\|off is now known as jpena		07:55
*** ysandeep\|afk is now known as ysandeep		08:02
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	08:04
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	08:08
avass	AJaeger: re ^	08:17
AJaeger	thanks, avass	08:18
avass	It's failing since ensure-tox installs it in ~/.local/bin but that's not in path so fetch-tox-output fails since it wasn't using the 'tox_executable' variable	08:18
avass	but it looks like setting a variable in job level doesn't let ansible override that later so the test fails	08:19
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	08:19
avass	I guess the tox roles should be tested by completely removing tox in a pre-run	08:21
*** sshnaidm has joined #zuul		08:22
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	08:25
AJaeger	fun ;/	08:25
*** tobias-urdin has joined #zuul		08:27
*** swest has quit IRC		08:32
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	08:43
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	08:46
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Parallelize initial static node creation https://review.opendev.org/721205	09:01
*** tosky has joined #zuul		09:01
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Parallelize initial static node synchronization https://review.opendev.org/721205	09:02
*** sgw has quit IRC		09:09
*** swest has joined #zuul		09:11
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	09:19
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Parallelize initial static node synchronization https://review.opendev.org/721205	09:20
openstackgerrit	Albin Vass proposed zuul/zuul master: Enables whitelisting and configuring callbacks https://review.opendev.org/717260	09:28
*** ysandeep is now known as ysandeep\|lunch		09:34
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192	09:47
avass	I'm very confused right now, the fetch-spinch-tarball behaves very strange for us	09:55
avass	for some reason "dest: {{ zuul.executor.log_root }}/docs-html.tar.bz2" evaluates to something completely different, but only for the unarchive task	09:56
avass	see: http://paste.openstack.org/show/792400/	09:56
avass	*fetch-sphinx-tarball :)	09:56
avass	^ i mean the "src:" field for unarchive, the "dest:" field is correct	09:58
*** saneax has quit IRC		10:02
*** swest has quit IRC		10:02
*** swest has joined #zuul		10:03
*** swest has quit IRC		10:04
*** saneax has joined #zuul		10:06
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists https://review.opendev.org/721221	10:12
*** ysandeep\|lunch is now known as ysandeep		10:13
*** rpittau is now known as rpittau\|bbl		10:15
*** harrymichal has quit IRC		10:21
jkt	tristanC: do you happen to have an updated version of the runc nodepool handler? My naive backporting seems to have hit a problem, http://paste.openstack.org/show/792403/ (code at https://gerrit.cesnet.cz/plugins/gitiles/github/openstack-infra/nodepool/+log/refs/heads/cesnet/2020-04-19--runc)	10:28
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists https://review.opendev.org/721221	10:33
*** jcapitao is now known as jcapitao_lunch		10:37
jkt	tristanC: ok, that turned to be easy enough, https://gerrit.cesnet.cz/plugins/gitiles/github/openstack-infra/nodepool/+/f51f5c52c2b37f0814807fa6745ce8313ab45d0d%5E%21/ if you ever need it in softwarefactory	10:43
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Use remote_src false for easier debugging https://review.opendev.org/721237	10:52
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src false https://review.opendev.org/721237	10:53
zbr	avass: can you please help me with https://review.opendev.org/#/c/681532/ ?	10:58
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src false https://review.opendev.org/721237	10:58
avass	I guess we just couldn't handle bzip2 archives and the error message confused me	11:00
avass	zbr: Oh, I'm not a maintainer of zuul, just zuul-jobs if it's a review you want :)	11:01
zbr	avass: ah, i forgot that they do not fully overlap.	11:03
zbr	i am watching your changes as we had overlapping work around tox.	11:04
zbr	i started https://review.opendev.org/#/c/690057/ long time ago, and still rebasing :D	11:05
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: tox: allow tox to be upgraded https://review.opendev.org/690057	11:09
zbr	i wanter if we can make anything to make the yml / yaml consistent, now it seems like a random, even inside the same role.	11:15
avass	huh, yeah I usually stick to .yaml	11:19
*** jpena is now known as jpena\|lunch		11:27
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src true https://review.opendev.org/721237	11:28
avass	Other way around, whoops	11:28
zbr	my impression is that most of the ansible world does the opposite, afaik only zuul prefers ymal.	11:32
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245	11:48
AJaeger	like this? ^	11:48
*** dpawlik has quit IRC		11:50
*** dpawlik has joined #zuul		11:50
*** dpawlik has quit IRC		11:52
*** dpawlik has joined #zuul		11:53
*** dpawlik has quit IRC		11:54
*** dpawlik has joined #zuul		11:54
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245	11:55
*** bhavikdbavishi has quit IRC		11:59
*** swest has joined #zuul		12:01
*** rpittau\|bbl is now known as rpittau		12:03
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: Do not keep owner of archived files https://review.opendev.org/721248	12:05
zbr	AJaeger: i am not a bit fun of using bash in tox.ini, especially 'find' as has portability issues but should work for the moment.	12:06
zbr	until we endup with some files that have .yml extension mandatory	12:06
*** jcapitao_lunch is now known as jcapitao		12:07
AJaeger	then we can remove the test or do it differently	12:08
*** rlandy has joined #zuul		12:09
tristanC	jkt: thanks. though we are about to remove runc from the next version of softwarefactory.	12:14
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Detach head before reset https://review.opendev.org/720458	12:20
openstackgerrit	Jan Kubovy proposed zuul/zuul master: Mandatory Zookeeper connection for ZuulWeb https://review.opendev.org/721254	12:21
*** rfolco has joined #zuul		12:22
jkt	tristanC: understood; we're trying to get ourselves migrated to openstack anyway	12:25
openstackgerrit	Fabien Boucher proposed zuul/zuul master: URLTrigger driver time based https://review.opendev.org/635567	12:25
*** jpena\|lunch is now known as jpena		12:33
*** Goneri has joined #zuul		12:42
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258	12:42
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258	12:54
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258	12:58
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245	13:00
AJaeger	sorry, had to update test. mordred, avass, want to review later again, please? ^	13:00
avass	Ajaeger: huh, yeah good catch	13:01
avass	or wait, what's the difference?	13:02
avass	I guess grep -v doesn't give an errorcode?	13:03
AJaeger	the test failed if there were no .yml files.	13:03
AJaeger	my local tests weren't good enough ;(	13:03
AJaeger	now tested better	13:03
*** irclogbot_0 has quit IRC		13:06
*** irclogbot_3 has joined #zuul		13:08
tristanC	zuul-maint : the zuul-operator has quite a few changes waiting to be reviewed from https://review.opendev.org/718162 to https://review.opendev.org/720822 , please have a look. Thank you in advance!	13:11
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258	13:11
openstackgerrit	Merged zuul/zuul-jobs master: Update Fedora to 31 https://review.opendev.org/717657	13:21
*** sgw has joined #zuul		13:24
openstackgerrit	Merged zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701	13:25
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: Adds role: ensure-ansible https://review.opendev.org/721269	13:27
avass	:)	13:27
avass	I wonder if we could make the ensure-tox a bit more generic, something like an ensure-<python executable package>	13:30
fungi	i thought the idea was that it was really ensure-<executable name> (which could be provided by distro packages, python packages, et cetera)	13:48
avass	fungi: yeah sure, but it feels like a lot of the python packages will look the same when installing them	13:50
avass	fungi: so they could probably be ensure-ansible, ensure-tox etc, but they import a role that does all the work	13:51
avass	or at least the common parts	13:51
fungi	so long as the executable name and the package name are the same, maybe	13:52
openstackgerrit	Jan Kubovy proposed zuul/zuul master: Driver event ingestion https://review.opendev.org/717299	13:53
avass	yeah	13:53
bolg	zuul-maint: someone please -2 on https://review.opendev.org/c/721254 and https://review.opendev.org/c/717299 - breaking changes (requires Zookeeper for ZuulWeb)	13:53
avass	it could also separate the installation into the generic role and setting the relevant "executable" facts in the ensure-<executable> role	13:54
tobiash	bolg: done	13:55
bolg	tobiash: thanks	13:55
fungi	avass: also i notice the current implementation of ensure-tox always runs ensure-pip even before testing whether tox is already installed, which is probably slightly wrong (tox itself doesn't require any pip outside the virtualenvs it creates)	13:57
avass	fungi: yeah that depends on if it finds an already existing installation or not, since it uses pip to install tox if it doesn't	13:59
fungi	yeah, but the order of operations seems to be: 1. install pip, 2. test whether tox is installed, 3. use pip to install tox if missing	14:00
fungi	1 and 2 should probably be swapped around	14:00
avass	why do we do that with bash anyway?	14:00
-openstackstatus- NOTICE: Zuul is temporarily offline; service should be restored in about 15 minutes.		14:00
avass	I agree	14:00
fungi	the bash is there to work out which version of the pip executable should be used	14:00
avass	ah yeah	14:01
fungi	since there are several names for it, and some python version ambiguity as to which goes with which major python series	14:01
avass	shouldn't that check for 'pip2' and not 'pip' then?	14:03
avass	or pip2 > pip > pip3	14:04
avass	if prefer_python2	14:04
fungi	it looks like the logic there assumes that if python2 is installed, then `pip` will relate to it	14:06
fungi	not all platforms provide a `pip2` but i suppose it could be added if needed for one	14:07
avass	it seems like a safer thing to check for	14:09
avass	gotta take a break, I'll probably check in later	14:12
*** ysandeep is now known as ysandeep\|afk		14:15
*** cdearborn has joined #zuul		14:34
jkt	is it a good idea to run one nodepool-builder and zuul-exec per cloud?	14:42
jkt	(looks like our company-internal cloud is constrained by IPv4 quotas, and there's no VPNaaS or anything like that)	14:43
mordred	jkt: fwiw - everything works fine ipv6-only ;)	14:45
mordred	jkt: but that might be a good solution for you	14:46
jkt	yeah, unfortunately I've been asking our operators for IPv6 support for more than a year so far :(	14:47
jkt	we're a research org, our core business are advanced network services, we run our own DWDM, we're building optical ROADMs, my colleagues have been deploying IPv6 for years, but hey, this cloud of ours is IPv4 only </rant>	14:48
openstackgerrit	Merged zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704	14:51
mordred	jkt: it's the sad story of the modern world isn't it/	14:52
mordred	?	14:52
mordred	especially given how well ipv6 works and how many "complex" problems go away once it's there	14:53
* jkt mummbles something about how trivial IPv6 was in 2014-2015 with nova-network and SLAAC from an external, HW-based L3 switch		14:53
openstackgerrit	James E. Blair proposed zuul/zuul master: Log alembic migrations by default https://review.opendev.org/721283	14:55
jkt	but apart from this, is it OK to have one of several nodepool-launchers + zuul-exec in the cloud directly? (perhaps in a different subproject for, eh, "API safety" so that these service VMs cannot kill themselves)	14:55
openstackgerrit	Merged zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111	14:56
corvus	mordred: i think https://review.opendev.org/721283 will make us happier; that won't affect the current opendev install, but we'll get it when we stop using our custom logging config (and that will help new users not using a custom logging config)	14:56
corvus	jkt: yep, you'll need to use the executor region feature to route the jobs to the right executor	14:57
jkt	btw, https://docs.openstack.org/infra/system-config/zuul.html says that there's no need for backups. I think one should still back up per-project keys for providing secrets	14:57
corvus	jkt: that's a good point	14:58
jkt	thanks	15:02
*** dpawlik has quit IRC		15:07
*** bhavikdbavishi has joined #zuul		15:09
*** bhavikdbavishi1 has joined #zuul		15:12
*** bhavikdbavishi has quit IRC		15:13
*** bhavikdbavishi1 is now known as bhavikdbavishi		15:13
dmsimard	hey all, long time no see o/	15:26
dmsimard	I'm back after some medical leave time off, I wouldn't say I'm healed/cured but doing better now	15:27
clarkb	dmsimard: good morning/afternoon. Hope you are feeling well	15:28
mordred	dmsimard: awesome! being better is better than not doing better!	15:28
*** yolanda has joined #zuul		15:28
dmsimard	thanks, sorry about the radio silence :)	15:29
dmsimard	I'd like to look and discuss what it means to upgrade ara in zuul and I figured it might be worth taking a step back and consider the current zuul-web implementation which is ara-ish	15:30
corvus	dmsimard: welcome back!	15:31
AJaeger	dmsimard: welcome back	15:31
corvus	dmsimard: there is some work-in-progress to make the callback plugin mechanism more generic, so that other callbacks can be used, and we could eventually make ara less 'special' (so that whether to use ara and what version could be a deployer decision)	15:32
dmsimard	corvus: ah, a generic callback interface was something I wanted to ask about -- I think making ara less special is fine	15:33
corvus	dmsimard: and yeah, i'd love to be able to share the dashboard console stuff -- obviously the backend is very different at the moment, but the frontend is very similar, and if there's a way to share at least widgets, if not more, that'd be great :)	15:33
corvus	dmsimard: https://review.opendev.org/717260 is the wip callback stuff tha avass is working on	15:34
dmsimard	thanks, looking	15:34
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245	15:45
-openstackstatus- NOTICE: Gerrit will be restarted to correct a misconfiguration which caused some git mirrors to have outdated references.		15:48
dmsimard	corvus: yeah, the approach makes sense and at first glance provides the necessary bits to configure ara -- I'll try to give it a spin	15:51
*** ianychoi_ is now known as ianychoi		15:53
dmsimard	re: dashboard bits, I'm not a javascript person but I understand that react components can be built in a way to be shared/embedded so there is definitely an opportunity to collaborate there	15:54
AJaeger	corvus, mordred, could either of you look at https://review.opendev.org/#/c/721221/ for zuul-jobs once you're done with firefighting, please? It makes the docs job fail if html is stored elsewhere and can't be find	15:54
dmsimard	there is ara-web which is meant to use the new ara api but it's still very WIP	15:54
dmsimard	fwiw, there is a simple javascript-less built-in interface shipping with the api server now: https://api.trunk.demo.recordsansible.org/	15:55
dmsimard	or in API version: https://api.trunk.demo.recordsansible.org/api/v1/playbooks	15:55
tristanC	regarding ansible console reporting, it would be nice to consider nested playbook execution. It seems like once a project uses custom plugin, the only way to run the code is to make the zuul job run another ansible-playbook command, which result in an opaque `run ansible-playbook` task in the reporting.	15:58
tristanC	iiuc ara 1.x implement an API that multiple ansible-playbook command could use, and iirc the zuul_stream may be adapted to support adding extra job-output.json in the job console report	16:00
clarkb	tristanC: I think opendev currently does not want that as part of its CD rollout so making it optional would be a good thing	16:04
clarkb	(we want to keep the details of nested runs hidden until we can confirm they don't expose secrets)	16:04
dmsimard	tristanC: in the context of nested ansible runs, if zuul's ansible and the nested ansible both use the ara callback and report to the same location, you could have both reports (zuul and nested) in a single place	16:04
dmsimard	brb, lunch	16:05
tristanC	clarkb: sure, i guess the only way to make this work would be that zuul exposes an ansible.cfg blob that nested-playbook would need to use (opt-in) to be part of the build report.	16:07
*** rpittau is now known as rpittau\|afk		16:07
openstackgerrit	Merged zuul/zuul master: Log alembic migrations by default https://review.opendev.org/721283	16:09
mordred	tristanC: I've pondered writing an ansible_playbook module (similar to the puppet one we wrote)	16:11
corvus	note that nested runs don't generally run on the executor, so the installation environment will be different; there may not be much to be gained by integrating with zuul's ansible or ansible.cfg there. maybe a better approach is to have a module like mordred suggests, and have easy flags to enable ara or other callbacks	16:13
*** jcapitao has quit IRC		16:14
corvus	i think there's more integration that could happen on the other side: integrating ara or zuul callback reporting data together after the run.	16:14
*** yolanda has quit IRC		16:17
tristanC	corvus: perhaps the nested callback could be configured to write report in ~/zuul-output, and then reporting would load such file and present in on the build page?	16:18
openstackgerrit	Merged zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists https://review.opendev.org/721221	16:19
corvus	tristanC: yeah, that sounds good	16:19
tristanC	i mentioned a ansible.cfg blob because ara would likely need to know what is the ara-api endpoint location and how it should annotate the nested playbook to be attached to the build report	16:20
mordred	yeah - so - the ara api endpoint is one of the things that potentially concerns me about integrating new ara with zuul (although it makes total sense as an approach for ara's main target audience)	16:21
mordred	with that structure, it woudl require us to have each job run an ara service so that the callback could report - and then we'd still need to do $something to have that ara service export the data to json files that we could upload to log storage so that the web stuff could read them	16:22
mordred	the second part shoudl be easy enough - it's just json data it's looking for after all	16:22
mordred	but the first part - requiring a running service endpoint - starts to get rather complicated	16:22
mordred	but I might be missing something - admittedly I havne't played with it too much recently so there's likely a ton I don't know	16:23
*** fdegir has joined #zuul		16:25
*** yolanda has joined #zuul		16:29
*** evrardjp has quit IRC		16:35
*** evrardjp has joined #zuul		16:35
tristanC	mordred: iiuc the proposed plan is to have an api per zuul, or per tenant, not per build	16:43
mordred	tristanC: that ara would have to store _all_ of the playbook results though right? then a given job would need to get the logs of just its playbook runs	16:49
*** kklimonda has quit IRC		16:54
dmsimard	mordred: the api server isn't required when recording data locally, there is an "offline" client implementation which spins up an ephemeral server in the background for the duration of the run	16:54
*** kklimonda has joined #zuul		16:55
tristanC	mordred: it seems like it yes. iirc ara features label where we could attach the zuul build uuid to each playbook results	16:55
dmsimard	let me get an example for the labels	16:56
tristanC	dmsimard: when using the offline client, what would happen with the data at the end of the run?	16:56
dmsimard	with some experimental zuul labels, example: https://api.demo.recordsansible.org/?label=zuul.executor:ze07.openstack.org	16:58
*** sshnaidm is now known as sshnaidm\|afk		16:59
dmsimard	tristanC: whatever we want to happen, I guess ? I mean, by default it's still a sqlite database unless the (offline) API server is set up to use a mysql database or something	16:59
dmsimard	there is an implementation of the sqlite middleware in 1.x which is similar to the one in 0.x but provides the API on top of the reporting interface	17:00
dmsimard	otherwise it has static html generation too	17:00
dmsimard	labels can be specified on a playbook basis or through ansible.cfg like this: https://opendev.org/recordsansible/ara/src/branch/master/tests/zuul_metadata.yaml -> https://opendev.org/recordsansible/ara/src/branch/master/tests/basic.yaml#L108-L110	17:05
AJaeger	avass, mordred , could you review https://review.opendev.org/721245 again, please? Now the test works...	17:15
*** yolanda has quit IRC		17:26
*** bhavikdbavishi has quit IRC		17:28
*** bhavikdbavishi has joined #zuul		17:29
*** portdirect has quit IRC		17:30
*** portdirect has joined #zuul		17:30
*** jpena is now known as jpena\|off		17:32
*** evrardjp has quit IRC		17:44
*** evrardjp has joined #zuul		17:49
clarkb	I think we've got an OSF newsletter going out this week, are there any zuul updates to call out from the last few weeks?	18:17
clarkb	fungi: avass correct not all platforms provide `pip` or `pip2` depending on which combo of things you've got	18:18
clarkb	the safest things seems to be if you know you want a specific version check for that first then heck if just `pip` is available as that can be used to install into arbitrary python envs	18:18
fungi	clarkb: well, except that the system-wide `pip` doesn't actually install into any python envs other than the system context or --user (~/.local)	18:23
fungi	so it's not even necessary to have a system-wide pip available if tox was already installed through other means	18:24
clarkb	fungi: yes, but I think we can't do better than that unless you are overriding the pip version	18:24
clarkb	which the roles allow you todo aiui	18:24
clarkb	correct	18:24
clarkb	you should set the value for the tox executable path if you've done that	18:24
fungi	but the ensure-tox role currently installs pip first, then checks to see whether you already have tox	18:24
clarkb	ah I see, that is a bug yes	18:25
clarkb	but that is independent of checking for pip vs pip2 vs pip3	18:25
fungi	yes, i agree	18:25
fungi	two different suggestions which came out of avass and i looking at the role	18:25
fungi	but we should probably reevaluate those concerns in the context of zbr's ensure-tox overhaul too, i suspect	18:26
clarkb	with pip vs pip2 vs pip3 we don't really have any good options there. What I went with should be the least surprising thing which is use version specific pip if available else use normal pip which can be version specific if you choose it to be	18:26
*** yolanda has joined #zuul		18:46
*** yolanda has quit IRC		18:46
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258	18:55
*** maxamillion has quit IRC		19:08
*** maxamillion has joined #zuul		19:11
*** bhavikdbavishi has quit IRC		19:18
fungi	clarkb: well, the actual selection order seems to be: if python2 is preferred try pip then pip3, if python3 is preferred try pip3 then pip	19:23
clarkb	hrm they should've been the same but maybe I got that wrong?	19:24
fungi	it's possible for some platforms (recent fedora with a transitional python2 installed maybe?) to have pip2 and pip3 but not pip	19:24
fungi	i gather the intended logic in that role was to fall back to the non-preferred python version's pip executable	19:25
clarkb	https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-pip/tasks/main.yaml#L19-L22 its version specific first each time	19:25
clarkb	that isn't selecting a version of pip, merely deciding if one exists that is appropriate	19:25
fungi	clarkb: this is in ensure-tox	19:25
clarkb	I see	19:25
fungi	which was also part of avass's point, i think, that we reimplement this selection logic (differently) in multiple roles	19:25
clarkb	gotcha I read it as being in ensure-pip	19:26
clarkb	and fwiw the selection logic may be different than "does any valid pip exist"	19:26
clarkb	I don't know if they should be the same or not without thinking about it more	19:26
fungi	probably what we could do is have ensure-tox conditionally include ensure-pip when tox is not found, and then use pip to install tox	19:27
fungi	that way we at least don't unnecessarily choose pip in different ways in different roles	19:27
clarkb	++	19:27
*** jtanner has quit IRC		19:42
*** jtanner has joined #zuul		19:43
*** sgw has quit IRC		21:09
*** sgw has joined #zuul		21:18
clarkb	nodepoolians we've noticed some interesting behavior from nodepool recently where it takes many hours to fulfill node requests. This seems related to failed instance deletes in openstack clouds and quotas being reported inaccurately	21:19
clarkb	I've tried to summarize what I've found in https://etherpad.opendev.org/p/lF1-vMNVtDqlEH1QwNAZ	21:20
clarkb	I don't think nodepool is doing anything really wrong here other than perhaps trusting the info from cloud APIs too much	21:20
clarkb	that said I've got a suggestions in the etherpad for how nodepool might handle the situation better to avoid user impact	21:20
clarkb	tobiash: ^ have you seen this before in your usage?	21:20
corvus	clarkb: well, right at the beginning, why did it accept a request when the predicted remaining quota was -1?	21:23
clarkb	corvus: I think because that is how nodepool goes into a paused state	21:23
corvus	clarkb: ack	21:23
clarkb	basically you need a new request to initially transition into paused	21:24
corvus	clarkb: why do you think it accepted request 706 when it still had 864 pending?	21:26
corvus	clarkb: do you think it thought it had enough quota for both?	21:28
clarkb	corvus: I think because for a short period of time around line 14 of my etherpad it thinks it has room to boot new instances	21:28
clarkb	so it unpauses and grabs more but then those all fail due to quota errors (and we don't know that until nova responds to us)	21:28
clarkb	we have a window from 15:48 to 15:58 roughly where we are unpaused and grabbing additional requests	21:29
clarkb	the second request I've added into the etherpad is accepted at 15:53 so is within that window	21:29
corvus	ah got it, line 29 is chronologically around line 17.5	21:29
clarkb	ya sorry I organized by request not chronologically	21:30
tobiash	clarkb: reading backlog	21:31
corvus	that's fine, this makes a lot of sense, but the comment on line 27 about being hours later isn't right, is it?	21:31
corvus	clarkb: or do you mean it's still active many hours later	21:31
clarkb	corvus: ya its still pending right now	21:31
clarkb	or was 5 minutes ago	21:31
clarkb	because we slowly iterate through the 3 node request failures	21:31
corvus	clarkb: how about this alternative solution: we unlock (without rejecting) the newest pending node requests when we're paused (so that after requests #2 and #3... finish their launch attempts failing, we unlock them. only request #1 stays locked)	21:34
corvus	basically, they get to go back on the pile for some other provider to grab them	21:34
clarkb	corvus: I like that I was wondering if we should reject outright but that seems nicer :)	21:35
corvus	it's sort of an "oops, i never should have accepted this one in the first place" move	21:35
corvus	which, i think is close to reality	21:35
corvus	clarkb: is there anything more we can learn about the quota situation? i'm not sure i see specifically what is incorrect here	21:36
corvus	also, i need a refresher on 'pool' vs 'provider' quota	21:36
clarkb	corvus: if you look at grafana for inap, http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=now-24h&to=now, we see that we've had a prolonged state with the majority of nodes in a deleting state	21:37
clarkb	corvus: that implies to me that inap actually has all of those instances stuck in a deleting state, but when you ask the nova api for quota usage we get told that we have 157 instances available	21:37
clarkb	let me try and reproduce that manually via bridge	21:38
corvus	cause it says 'provider quota' is 0, but 'pool quota' is 156/7	21:38
clarkb	oh I missed that distinction	21:40
clarkb	running manually we have ~188 instances in a nova show output and limits show --absolute says we are using 196 instances	21:40
tobiash	pool quota should be the self constrained quota based on the pool config, provider quota is the quota of the cloud tenant	21:40
clarkb	so that is actually looking reasonably accurate	21:40
clarkb	corvus: I wonder if that is actually the underlying thing here	21:41
clarkb	because if the pool quota doesn't show 156 instances we should never unpause	21:41
clarkb	I had attributed it to openstack because we've seen it get quotas wrong before, but maybe this is a nodepool bug?	21:41
tobiash	it looks to me that nodepool locks a request, then pauses and waits for resources to get free which doesn't happen because of hanging deletes?	21:42
clarkb	tobiash: ya, but as corvus points out we still show predicted provider instances is 0 but then pool is 157	21:43
clarkb	and I think we decide based on the pool number we can unpause	21:43
clarkb	despite the provider not having enough capacity there (and nodepool knowing that)	21:43
corvus	tobiash: i think the new issue here is that it's happening for multiple node requests (not just one -- it's okay if just one takes a few minutes to unpause)	21:43
corvus	clarkb: but line 15 in your etherpad indicates that we expect the provider to be able to accomodate the request	21:46
corvus	clarkb: er, i guess the same line at line 30 is more relevant	21:46
clarkb	corvus: I think instances: 0 means we don't have any room	21:46
corvus	clarkb: but that's predicted remaining	21:46
clarkb	oh I see	21:47
corvus	so it's saying "this is the last request i can accept without hitting the quota limit"	21:47
corvus	and then it hits the quota	21:47
clarkb	gotcha so we are right on the line but actually about 150 instances away from it?	21:47
tobiash	the provider quota is potentially inaccurate	21:47
tobiash	so when it hits the provider quota it pauses and invalidates the quota cache	21:48
clarkb	also if we are right at quota that number may be accurate, but since we have ~20 active nodereuqests only one of them will win the race to get that free slot	21:48
tobiash	but then it might already have taken multiple requests	21:48
clarkb	I think the key here is to avoid adding more noderequests because it races quota usage	21:48
clarkb	wheras if we have a single outstanding request when paused then we know we'll service that one iwth the next available resources?	21:48
clarkb	corvus: ^ basically that means your idea may still be the proper fix here	21:49
tobiash	when we hit the provider quota we hit it asynchronously (when we try to spawn the instance)	21:49
tobiash	at that point in time nodepool might have already taken more than one request	21:49
corvus	so you're saying that the quota usage is so inaccurate that we think we can accept, say 10 node requests when we can really accept 0, and all 10 get stuck in the paused state?	21:49
clarkb	tobiash: yup its got like 20 noderequests active	21:49
clarkb	corvus: probably more like 9/10 get stuck in paused state	21:49
tobiash	potentially	21:49
clarkb	corvus: it is likely that one of them is succeeding since there was probably some quota available	21:50
tobiash	do you run multiple providers against the same cloud?	21:50
clarkb	tobiash: not in this cloud	21:50
clarkb	(others do though)	21:50
corvus	i still wish i understood what was wrong with the quota; like how did it got so far out of sync	21:50
clarkb	normally this is fine because deletes actually free up space	21:51
clarkb	corvus: maybe the predicted value isn't considering the other active requests properly?	21:51
clarkb	say we have 10 requests for 1 node each. If we have 1 free instance quota avialable to we predict we can fulfill all 10 of those?	21:51
corvus	clarkb: maybe, but i'm pretty sure nothing would work if that were the case...	21:51
tobiash	can it be that nova lies about the quota?	21:52
corvus	we only accept one node request at a time. we process them in parallel	21:52
clarkb	tobiash: it could be, though I'm not seeing that right htis moment	21:52
corvus	so it'd be pretty hard for us to get that part wrong.	21:52
corvus	clarkb: didn't it look like it was reporting we used 196 instances when we really have 188? granted, that's wrong in the wrong direction, but still...	21:53
tobiash	we saw in our clouds that nova quota query returned say 20 cpus left and spawn still said not enough quota	21:53
clarkb	another thought. Is it possible that predicted quota assumes the deltes will succeed in a reasonable amount of time?	21:53
clarkb	corvus: yup right now its 194 from nova list but 196 from instances used quota accounting	21:53
corvus	clarkb: can you find and add "Provider quota for inap-mtl01" log lines to the etherpad?	21:56
corvus	clarkb: same for "Available quota for inap-mtl01"	21:56
clarkb	ya	21:56
corvus	the first line is "what did nova limit just tell us", and the second is the same with unmanaged quota subtracted.	21:57
corvus	we only periodically update that; so the most recent pair of those log lines before "Predicted remaining..." is what it's going to be working with	21:58
corvus	clarkb: oh, it looks like 196 is the number of instances available, not used?	21:59
tobiash	do you have unmanaged quota in that tenant?	22:00
clarkb	tobiash: no	22:00
clarkb	corvus: maybe? or maybe its the max value from quota limits and not the totalused value?	22:00
clarkb	the limit and not usage maybe	22:00
clarkb	(or maybe thats what you meant)	22:00
corvus	clarkb: yes, nodepool is looking at the limit	22:00
corvus	"Provider quota for inap-mtl01: {'compute': {'instances': 196, 'cores': inf, 'ram': inf}}" means "the limit for this account is 196 instances"	22:01
corvus	"Available quota for inap-mtl01: {'compute': {'instances': 34, 'cores': inf, 'ram': inf}}" means "there are 162 instances of unknown origin in this account leaving us with room for 34 nodepool instances"	22:02
corvus	tobiash: ^ is that right?	22:02
tobiash	let me check	22:02
tobiash	corvus: yes, that should be right	22:04
clarkb	separately I'm going to try manually deleting one of the instances that nodepool is trying to delete to see why that is failing	22:04
corvus	it gets the 162 number by listing all of the nova instances and checking to see if they have nodepool motadata.	22:04
clarkb	corvus: hrm I wonder if I wasn't completely wrong about hte metadata being inappropriately removed then	22:05
corvus	so it may be that our stuck-in-deleting instances have lost their nodepool metadata (clarkb: can you check that?)	22:05
clarkb	corvus: I can try!	22:05
tobiash	corvus: that sounds at least plausible	22:05
clarkb	corvus: is it possible that is related to the userdata base64 thing you discovered too?	22:05
clarkb	where the clients don't return the data to us in some cases?	22:05
tobiash	if the metadata vanishes it would result in such an effect	22:05
corvus	clarkb: i wouldn't expect that	22:06
corvus	clarkb: the userdata thing was a definite "we don't want to print this out on the cli" change	22:06
corvus	i wouldn't expect it to disappear from sdk return values. but i haven't confirmed anything related to that	22:06
clarkb	corvus: got it	22:06
clarkb	also fwiw it seems that inap just managed to delete those instances and now we have lots of room :/	22:07
clarkb	this also means it is hard for me to check for metadata	22:07
corvus	clarkb: oh did you delete them all? :)	22:07
clarkb	corvus: I didn't :)	22:07
clarkb	this was entirely the cloud itself catching up	22:07
corvus	oh "neat"	22:07
clarkb	I didn't attempt to delete a single one myself yet	22:07
corvus	even if the matadata is gone, this system should still work	22:07
corvus	they should just be counted as external-to-nodepool	22:08
corvus	i mean, it might take one quota error to happen in order to cause a cache refresh, but after that, it should return to correct behavior	22:08
clarkb	I'll take it as a TODO to check metadata if we catch it in this state again in the future (because that will be useful to know either way)	22:08
corvus	so then the line "Predicted remaining provider quota" is 34 - what nodepool thinks its using - the new node requirements	22:11
corvus	so essentially, we can deduce that at a point where nodepool said "predicted remaining provider quota: 0 instances" it thought it had 33 instances under its control	22:11
clarkb	why do we say 0 in that case?	22:13
corvus	the predicted remaining is after accounting for the current request	22:14
corvus	so nodepool saw 34 slots total for nodepool's use, 33 were used by nodepool, 1 was available, and it was looking at a request for 1 node	22:15
clarkb	got it	22:15
corvus	clarkb: did anything happen between 15:48 and 15:58 for nodepool to think more room may be available?	22:15
corvus	clarkb: cause now i'm wondering how, if it already was dealing with 864, why it would also accept 706	22:15
corvus	it's interesting that it refreshed the limits and external-to-nodepool limits between those times, but the values did not change	22:17
corvus	so it must have thought that 2 slots opened up?	22:17
clarkb	let me see	22:17
*** tosky has quit IRC		22:19
clarkb	there is a lot of "cannote find provider pool for instance" in the log saround then	22:20
clarkb	which from earlier debugging implies the instance's metadata was removed from nova metadata	22:20
clarkb	(because we create phony intance data in zk to handle leaked nodes, but then those end up without metadata I htink)	22:21
corvus	clarkb: yep. that line indicates that instance should be counted in the 'external-to-nodepool' count	22:21
clarkb	that might be something we can log more clearly to make this debugging easier in the future	22:21
corvus	oh wait	22:21
corvus	clarkb: can you paste one of those log lines?	22:22
corvus	because if it's "Cannot find provider pool for node %s" then we actually aren't accounting for that node and that's worth digging into	22:27
clarkb	corvus: I can fwiw it seems the 2020-04-20 15:48:52,257 DEBUG nodepool.driver.NodeRequestHandler[nl03-28001-PoolWorker.inap-mtl01-main]: [node_request: 300-0008521864] Retrying node request is the triggering thing and I cna't find anything that triggers that	22:27
clarkb	so maybe we hvae a timeout?	22:27
clarkb	where we check in on a paused reuqest status periodically?	22:27
clarkb	2020-04-20 15:48:52,604 WARNING nodepool.driver.openstack.OpenStackProvider: Cannot find provider pool for node <Node {'connection_port': 22, 'state_time': 1587391454.6109967, 'comment': None, 'launcher': None, 'pool': None, 'image_id': None, 'state': 'deleting', 'allocated_to': None, 'attributes': None, 'hostname': None, 'region': None, 'public_ipv4': None, 'cloud': None, 'host_id': None, 'id': '0016056699',	22:28
clarkb	'public_ipv6': None, 'external_id': '33ba9c42-379f-4a0e-8028-4e2db2a21ff5', 'python_path': None, 'connection_type': None, 'interface_ip': None, 'stat': ZnodeStat(czxid=85918431802, mzxid=85918431802, ctime=1587391430355, mtime=1587391430355, version=0, cversion=1, aversion=0, ephemeralOwner=0, dataLength=639, numChildren=1, pzxid=85918433437), 'private_ipv4': None, 'az': None, 'provider': 'inap-mtl01',	22:28
clarkb	'hold_job': None, 'username': None, 'ssh_port': 22, 'hold_expiration': None, 'resources': None, 'type': [], 'host_keys': [], 'created_time': 1587391430.3534877}>	22:28
clarkb	oh sorry didn't realize it would get broken up like that	22:28
clarkb	thats what the cannot find provider pool errors look like	22:28
clarkb	2020-04-20 12:12:01,790 INFO nodepool.DeletedNodeWorker: Deleting deleting instance 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 from inap-mtl01	22:29
clarkb	2020-04-20 12:12:01,840 INFO nodepool.NodeDeleter: Instance 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 not found in provider inap-mtl01	22:29
corvus	clarkb: you say we made that znode because something leaked?	22:29
clarkb	corvus: yes I believe so and I think the above explains it	22:29
clarkb	basically we boot that instance it fails for some reason so we delete it. Then nova says that doesn't exist and we remove our znode for it	22:30
corvus	but then it gets created?	22:30
clarkb	then later we check for leaked instances and find 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 and add a new znode with less info (no pool info for example) so that we can attempt to clean the leaked instance up	22:30
clarkb	corvus: ya I think the trigger for those znodes is leak cleanup checking for instances it should delete. Any it finds it creates these phony znodes for with less info in them	22:31
clarkb	(because its generated from lossy openstack api returns)	22:31
fungi	oh that's a fun race	22:31
corvus	clarkb: huh. it seems like we need more metadata?	22:31
corvus	like, why isn't there a pool entry in there?	22:31
clarkb	corvus: because nova list doesn't tell us how the instance originated from nodepool	22:32
corvus	(how is it possible there's enough metadata for nodepool to decide it's responsible for it, but not enough to account for it)	22:32
clarkb	corvus: ya its possible we aren't shpping enouh data into metadata	22:32
clarkb	let me find the code again	22:32
corvus	clarkb: nodepool is only going to think it should delete an instance if there's enough metadata for it to recognize it as its own	22:32
clarkb	https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L159-L166 is the source of the error https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L532-L537 is where we create the phony record	22:33
corvus	clarkb: seems like that second link should pull in the pool as well, yeah?	22:34
corvus	then the instance should be properly accounted for?	22:34
clarkb	corvus: ya I'm beginning to think that would be the caes	22:34
clarkb	so if we set that as part of hte metadata we'd be able to write it back in again (probably cross check it against valid pools?)	22:34
clarkb	oh we cross check at https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L159-L166 so simply setting the value may be sufficient	22:35
clarkb	to summarize we think that nova reportin instances don't exist but then later appearing them leads to us not accounting for those instances later when we try to clenaup leaked instances. This is because the pool information matters when calculating quotas and we lose that for leaked instances	22:36
clarkb	we can make that quota calculation more accurate by setting the pool info on leaked instance phony records so they are accounted as belonging to the pool quota wise	22:36
corvus	clarkb: the lack of the znode.pool attribute means the znode isn't being counted in the nodepool quota used, but if the 'nodepool_provider_name' metadata is on the instance, and matches this provider, then the instance is not going to be counted as unmanaged quota.	22:38
clarkb	corvus: gotcha so we need both pieces of info to properly line up the quota calculations	22:38
*** sanjayu_ has joined #zuul		22:38
corvus	clarkb, tobiash: ^ so if that sounds right, then yeah, i think the fix may be to just add the pool info back	22:39
corvus	here: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L532-L537	22:39
*** saneax has quit IRC		22:41
clarkb	https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L340 and there (in order to get it back later)	22:42
corvus	oh it's not in the md in the first place. then yeah.	22:45
corvus	that explains why we didn't copy it back :)	22:45
clarkb	yup I hvae a change that outlines this	22:45
clarkb	just writing a commit message now	22:46
clarkb	I doubt its correct but it should be enough for us to not forget this debugging session	22:46
openstackgerrit	Clark Boylan proposed zuul/nodepool master: Set pool info on leaked instances https://review.opendev.org/721359	22:47
clarkb	corvus: tobiash ^ I think thats roughly what we want. My biggest concern right now is I'm not sure I lined up all the pool types properly (I think we have both name and dict versions of pools depending on context)	22:48
*** sanjayu__ has joined #zuul		22:50
*** sanjayu_ has quit IRC		22:52
*** sanjayu__ has quit IRC		22:52
clarkb	ok I think the types are correct based on how we handle deleting a node later	22:54
clarkb	(in the non leaked case)	22:54
corvus	clarkb: comments	22:55
clarkb	https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/handler.py#L271-L286	22:55
clarkb	corvus: ah yup I thought it available then realized I had to pass it in then didn't update my use of it	22:55
openstackgerrit	Clark Boylan proposed zuul/nodepool master: Set pool info on leaked instances https://review.opendev.org/721359	22:57
clarkb	corvus: ^ thanks!	22:57
corvus	clarkb: thank you! tobiash ^ whenever it's a reasonable time for you again :)	22:59
corvus	clarkb: maybe let's try this out for a bit and see if we avoid the situation; i'm not convinced we should implement the workaround we talked about earlier unless it's necessary (it would have masked this, and it's more churn than if nodepool actually has accurate quota)	23:00
clarkb	++ accurate quota would be best	23:01
*** threestrands has joined #zuul		23:18
openstackgerrit	Mohammed Naser proposed zuul/zuul-jobs master: helm-template: enable using values file https://review.opendev.org/721365	23:34

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!