mordred | jkt: yeah - in this case things are ... yuck | 00:02 |
---|---|---|
jkt | mordred: I read git blame, and apparently you wrote this. Any reason why that code picks a first network a router is not tried first? I see that it got added in 9986646a1, and it *looks* like a case of a DT cloud (?) where there was a hidden netowkr | 00:02 |
mordred | jkt: you might also want to try setting routes_externally to False for the other networks (I know that's likely a pita, but it might help you here) | 00:02 |
jkt | wouldn't that break once my provider introduces even more networks that I'm not supposed to use? | 00:03 |
mordred | might do. one sec - lemme look at something | 00:04 |
mordred | jkt: you should set nat_source to true for the right network | 00:05 |
jkt | mordred: looks like that did the trick in my tiny python reproducer, thanks! Let's try it with nodepool now | 00:08 |
mordred | jkt: it's entirely possible that the right choice should be to start by looking for the network attached to the router that's attached to the network that the node is attached to ... I think at this point it's possible we don't start there simply because of history - and becaues the code in question tries to handle so many gross edge cases doing a major logic inversion is ... unfun | 00:09 |
mordred | jkt: awesome! | 00:09 |
mordred | jkt: when it's not sunday night I'll try to think through what it would take to start with router logic | 00:09 |
jkt | mordred: yeah, I can understand why it's fishy -- apparently it's global state, and determining that global state based on just one random server sounds wrong | 00:10 |
mordred | yeah. | 00:11 |
mordred | jkt: if only we didnt' have to hack around a bazillion different ways a deployer could do things :) | 00:11 |
jkt | :) | 00:12 |
jkt | mordred: it works (as in, VMs are created, nodepool is happy, reports correct IP addresses and they are active). Thanks a lot! | 00:27 |
jkt | now I wonder what am I doing wrong with my SSH keys :). Is nodepool using cloud-init with Zuul's pubkey, or is that a wrong assumption? | 00:27 |
clarkb | it doesnt by default. You cn configure it to do so | 00:28 |
clarkb | by default it expects ssh to work (key baked into image) | 00:28 |
jkt | clarkb: tx; so it's either https://zuul-ci.org/docs/nodepool/configuration.html#attr-providers.[openstack].pools.labels.userdata or redoing my images, right? | 00:30 |
clarkb | its pools.labels.key-name | 00:30 |
jkt | clarkb: ah!, wonder how I missed that. Thanks! | 00:31 |
*** threestrands has joined #zuul | 00:35 | |
*** swest has quit IRC | 01:11 | |
*** swest has joined #zuul | 01:27 | |
*** jhesketh has joined #zuul | 02:56 | |
*** bhavikdbavishi has joined #zuul | 03:46 | |
*** bhavikdbavishi1 has joined #zuul | 03:51 | |
*** bhavikdbavishi has quit IRC | 03:53 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 03:53 | |
*** evrardjp has quit IRC | 04:23 | |
*** bhavikdbavishi has quit IRC | 05:26 | |
*** bhavikdbavishi1 has joined #zuul | 05:26 | |
*** bhavikdbavishi has joined #zuul | 05:29 | |
*** bhavikdbavishi1 has quit IRC | 05:30 | |
*** evrardjp has joined #zuul | 05:41 | |
*** dpawlik has joined #zuul | 06:09 | |
*** saneax has joined #zuul | 06:15 | |
*** bhavikdbavishi has quit IRC | 06:47 | |
*** ysandeep|away is now known as ysandeep | 06:53 | |
*** ysandeep is now known as ysandeep|afk | 07:11 | |
*** rpittau|afk is now known as rpittau | 07:12 | |
*** harrymichal has joined #zuul | 07:13 | |
*** threestrands_ has joined #zuul | 07:18 | |
*** threestrands_ has quit IRC | 07:18 | |
*** threestrands has quit IRC | 07:21 | |
*** jcapitao has joined #zuul | 07:26 | |
*** bhavikdbavishi has joined #zuul | 07:49 | |
*** jpena|off is now known as jpena | 07:55 | |
*** ysandeep|afk is now known as ysandeep | 08:02 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 08:04 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 08:08 |
avass | AJaeger: re ^ | 08:17 |
AJaeger | thanks, avass | 08:18 |
avass | It's failing since ensure-tox installs it in ~/.local/bin but that's not in path so fetch-tox-output fails since it wasn't using the 'tox_executable' variable | 08:18 |
avass | but it looks like setting a variable in job level doesn't let ansible override that later so the test fails | 08:19 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 08:19 |
avass | I guess the tox roles should be tested by completely removing tox in a pre-run | 08:21 |
*** sshnaidm has joined #zuul | 08:22 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 08:25 |
AJaeger | fun ;/ | 08:25 |
*** tobias-urdin has joined #zuul | 08:27 | |
*** swest has quit IRC | 08:32 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 08:43 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 08:46 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Parallelize initial static node creation https://review.opendev.org/721205 | 09:01 |
*** tosky has joined #zuul | 09:01 | |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Parallelize initial static node synchronization https://review.opendev.org/721205 | 09:02 |
*** sgw has quit IRC | 09:09 | |
*** swest has joined #zuul | 09:11 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 09:19 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Parallelize initial static node synchronization https://review.opendev.org/721205 | 09:20 |
openstackgerrit | Albin Vass proposed zuul/zuul master: Enables whitelisting and configuring callbacks https://review.opendev.org/717260 | 09:28 |
*** ysandeep is now known as ysandeep|lunch | 09:34 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output https://review.opendev.org/721192 | 09:47 |
avass | I'm very confused right now, the fetch-spinch-tarball behaves very strange for us | 09:55 |
avass | for some reason "dest: {{ zuul.executor.log_root }}/docs-html.tar.bz2" evaluates to something completely different, but only for the unarchive task | 09:56 |
avass | see: http://paste.openstack.org/show/792400/ | 09:56 |
avass | *fetch-sphinx-tarball :) | 09:56 |
avass | ^ i mean the "src:" field for unarchive, the "dest:" field is correct | 09:58 |
*** saneax has quit IRC | 10:02 | |
*** swest has quit IRC | 10:02 | |
*** swest has joined #zuul | 10:03 | |
*** swest has quit IRC | 10:04 | |
*** saneax has joined #zuul | 10:06 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists https://review.opendev.org/721221 | 10:12 |
*** ysandeep|lunch is now known as ysandeep | 10:13 | |
*** rpittau is now known as rpittau|bbl | 10:15 | |
*** harrymichal has quit IRC | 10:21 | |
jkt | tristanC: do you happen to have an updated version of the runc nodepool handler? My naive backporting seems to have hit a problem, http://paste.openstack.org/show/792403/ (code at https://gerrit.cesnet.cz/plugins/gitiles/github/openstack-infra/nodepool/+log/refs/heads/cesnet/2020-04-19--runc) | 10:28 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists https://review.opendev.org/721221 | 10:33 |
*** jcapitao is now known as jcapitao_lunch | 10:37 | |
jkt | tristanC: ok, that turned to be easy enough, https://gerrit.cesnet.cz/plugins/gitiles/github/openstack-infra/nodepool/+/f51f5c52c2b37f0814807fa6745ce8313ab45d0d%5E%21/ if you ever need it in softwarefactory | 10:43 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use remote_src false for easier debugging https://review.opendev.org/721237 | 10:52 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src false https://review.opendev.org/721237 | 10:53 |
zbr | avass: can you please help me with https://review.opendev.org/#/c/681532/ ? | 10:58 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src false https://review.opendev.org/721237 | 10:58 |
avass | I guess we just couldn't handle bzip2 archives and the error message confused me | 11:00 |
avass | zbr: Oh, I'm not a maintainer of zuul, just zuul-jobs if it's a review you want :) | 11:01 |
zbr | avass: ah, i forgot that they do not fully overlap. | 11:03 |
zbr | i am watching your changes as we had overlapping work around tox. | 11:04 |
zbr | i started https://review.opendev.org/#/c/690057/ long time ago, and still rebasing :D | 11:05 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: tox: allow tox to be upgraded https://review.opendev.org/690057 | 11:09 |
zbr | i wanter if we can make anything to make the yml / yaml consistent, now it seems like a random, even inside the same role. | 11:15 |
avass | huh, yeah I usually stick to .yaml | 11:19 |
*** jpena is now known as jpena|lunch | 11:27 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src true https://review.opendev.org/721237 | 11:28 |
avass | Other way around, whoops | 11:28 |
zbr | my impression is that most of the ansible world does the opposite, afaik only zuul prefers ymal. | 11:32 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245 | 11:48 |
AJaeger | like this? ^ | 11:48 |
*** dpawlik has quit IRC | 11:50 | |
*** dpawlik has joined #zuul | 11:50 | |
*** dpawlik has quit IRC | 11:52 | |
*** dpawlik has joined #zuul | 11:53 | |
*** dpawlik has quit IRC | 11:54 | |
*** dpawlik has joined #zuul | 11:54 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245 | 11:55 |
*** bhavikdbavishi has quit IRC | 11:59 | |
*** swest has joined #zuul | 12:01 | |
*** rpittau|bbl is now known as rpittau | 12:03 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: Do not keep owner of archived files https://review.opendev.org/721248 | 12:05 |
zbr | AJaeger: i am not a bit fun of using bash in tox.ini, especially 'find' as has portability issues but should work for the moment. | 12:06 |
zbr | until we endup with some files that have .yml extension mandatory | 12:06 |
*** jcapitao_lunch is now known as jcapitao | 12:07 | |
AJaeger | then we can remove the test or do it differently | 12:08 |
*** rlandy has joined #zuul | 12:09 | |
tristanC | jkt: thanks. though we are about to remove runc from the next version of softwarefactory. | 12:14 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Detach head before reset https://review.opendev.org/720458 | 12:20 |
openstackgerrit | Jan Kubovy proposed zuul/zuul master: Mandatory Zookeeper connection for ZuulWeb https://review.opendev.org/721254 | 12:21 |
*** rfolco has joined #zuul | 12:22 | |
jkt | tristanC: understood; we're trying to get ourselves migrated to openstack anyway | 12:25 |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: URLTrigger driver time based https://review.opendev.org/635567 | 12:25 |
*** jpena|lunch is now known as jpena | 12:33 | |
*** Goneri has joined #zuul | 12:42 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258 | 12:42 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258 | 12:54 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258 | 12:58 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245 | 13:00 |
AJaeger | sorry, had to update test. mordred, avass, want to review later again, please? ^ | 13:00 |
avass | Ajaeger: huh, yeah good catch | 13:01 |
avass | or wait, what's the difference? | 13:02 |
avass | I guess grep -v doesn't give an errorcode? | 13:03 |
AJaeger | the test failed if there were no .yml files. | 13:03 |
AJaeger | my local tests weren't good enough ;( | 13:03 |
AJaeger | now tested better | 13:03 |
*** irclogbot_0 has quit IRC | 13:06 | |
*** irclogbot_3 has joined #zuul | 13:08 | |
tristanC | zuul-maint : the zuul-operator has quite a few changes waiting to be reviewed from https://review.opendev.org/718162 to https://review.opendev.org/720822 , please have a look. Thank you in advance! | 13:11 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258 | 13:11 |
openstackgerrit | Merged zuul/zuul-jobs master: Update Fedora to 31 https://review.opendev.org/717657 | 13:21 |
*** sgw has joined #zuul | 13:24 | |
openstackgerrit | Merged zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701 | 13:25 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: Adds role: ensure-ansible https://review.opendev.org/721269 | 13:27 |
avass | :) | 13:27 |
avass | I wonder if we could make the ensure-tox a bit more generic, something like an ensure-<python executable package> | 13:30 |
fungi | i thought the idea was that it was really ensure-<executable name> (which could be provided by distro packages, python packages, et cetera) | 13:48 |
avass | fungi: yeah sure, but it feels like a lot of the python packages will look the same when installing them | 13:50 |
avass | fungi: so they could probably be ensure-ansible, ensure-tox etc, but they import a role that does all the work | 13:51 |
avass | or at least the common parts | 13:51 |
fungi | so long as the executable name and the package name are the same, maybe | 13:52 |
openstackgerrit | Jan Kubovy proposed zuul/zuul master: Driver event ingestion https://review.opendev.org/717299 | 13:53 |
avass | yeah | 13:53 |
bolg | zuul-maint: someone please -2 on https://review.opendev.org/c/721254 and https://review.opendev.org/c/717299 - breaking changes (requires Zookeeper for ZuulWeb) | 13:53 |
avass | it could also separate the installation into the generic role and setting the relevant "executable" facts in the ensure-<executable> role | 13:54 |
tobiash | bolg: done | 13:55 |
bolg | tobiash: thanks | 13:55 |
fungi | avass: also i notice the current implementation of ensure-tox always runs ensure-pip even before testing whether tox is already installed, which is probably slightly wrong (tox itself doesn't require any pip outside the virtualenvs it creates) | 13:57 |
avass | fungi: yeah that depends on if it finds an already existing installation or not, since it uses pip to install tox if it doesn't | 13:59 |
fungi | yeah, but the order of operations seems to be: 1. install pip, 2. test whether tox is installed, 3. use pip to install tox if missing | 14:00 |
fungi | 1 and 2 should probably be swapped around | 14:00 |
avass | why do we do that with bash anyway? | 14:00 |
-openstackstatus- NOTICE: Zuul is temporarily offline; service should be restored in about 15 minutes. | 14:00 | |
avass | I agree | 14:00 |
fungi | the bash is there to work out which version of the pip executable should be used | 14:00 |
avass | ah yeah | 14:01 |
fungi | since there are several names for it, and some python version ambiguity as to which goes with which major python series | 14:01 |
avass | shouldn't that check for 'pip2' and not 'pip' then? | 14:03 |
avass | or pip2 > pip > pip3 | 14:04 |
avass | if prefer_python2 | 14:04 |
fungi | it looks like the logic there assumes that if python2 is installed, then `pip` will relate to it | 14:06 |
fungi | not all platforms provide a `pip2` but i suppose it could be added if needed for one | 14:07 |
avass | it seems like a safer thing to check for | 14:09 |
avass | gotta take a break, I'll probably check in later | 14:12 |
*** ysandeep is now known as ysandeep|afk | 14:15 | |
*** cdearborn has joined #zuul | 14:34 | |
jkt | is it a good idea to run one nodepool-builder and zuul-exec per cloud? | 14:42 |
jkt | (looks like our company-internal cloud is constrained by IPv4 quotas, and there's no VPNaaS or anything like that) | 14:43 |
mordred | jkt: fwiw - everything works fine ipv6-only ;) | 14:45 |
mordred | jkt: but that might be a good solution for you | 14:46 |
jkt | yeah, unfortunately I've been asking our operators for IPv6 support for more than a year so far :( | 14:47 |
jkt | we're a research org, our core business are advanced network services, we run our own DWDM, we're building optical ROADMs, my colleagues have been deploying IPv6 for years, but hey, this cloud of ours is IPv4 only </rant> | 14:48 |
openstackgerrit | Merged zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704 | 14:51 |
mordred | jkt: it's the sad story of the modern world isn't it/ | 14:52 |
mordred | ? | 14:52 |
mordred | especially given how well ipv6 works and how many "complex" problems go away once it's there | 14:53 |
* jkt mummbles something about how trivial IPv6 was in 2014-2015 with nova-network and SLAAC from an external, HW-based L3 switch | 14:53 | |
openstackgerrit | James E. Blair proposed zuul/zuul master: Log alembic migrations by default https://review.opendev.org/721283 | 14:55 |
jkt | but apart from this, is it OK to have one of several nodepool-launchers + zuul-exec in the cloud directly? (perhaps in a different subproject for, eh, "API safety" so that these service VMs cannot kill themselves) | 14:55 |
openstackgerrit | Merged zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111 | 14:56 |
corvus | mordred: i think https://review.opendev.org/721283 will make us happier; that won't affect the current opendev install, but we'll get it when we stop using our custom logging config (and that will help new users not using a custom logging config) | 14:56 |
corvus | jkt: yep, you'll need to use the executor region feature to route the jobs to the right executor | 14:57 |
jkt | btw, https://docs.openstack.org/infra/system-config/zuul.html says that there's no need for backups. I think one should still back up per-project keys for providing secrets | 14:57 |
corvus | jkt: that's a good point | 14:58 |
jkt | thanks | 15:02 |
*** dpawlik has quit IRC | 15:07 | |
*** bhavikdbavishi has joined #zuul | 15:09 | |
*** bhavikdbavishi1 has joined #zuul | 15:12 | |
*** bhavikdbavishi has quit IRC | 15:13 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 15:13 | |
dmsimard | hey all, long time no see o/ | 15:26 |
dmsimard | I'm back after some medical leave time off, I wouldn't say I'm healed/cured but doing better now | 15:27 |
clarkb | dmsimard: good morning/afternoon. Hope you are feeling well | 15:28 |
mordred | dmsimard: awesome! being better is better than not doing better! | 15:28 |
*** yolanda has joined #zuul | 15:28 | |
dmsimard | thanks, sorry about the radio silence :) | 15:29 |
dmsimard | I'd like to look and discuss what it means to upgrade ara in zuul and I figured it might be worth taking a step back and consider the current zuul-web implementation which is ara-ish | 15:30 |
corvus | dmsimard: welcome back! | 15:31 |
AJaeger | dmsimard: welcome back | 15:31 |
corvus | dmsimard: there is some work-in-progress to make the callback plugin mechanism more generic, so that other callbacks can be used, and we could eventually make ara less 'special' (so that whether to use ara and what version could be a deployer decision) | 15:32 |
dmsimard | corvus: ah, a generic callback interface was something I wanted to ask about -- I think making ara less special is fine | 15:33 |
corvus | dmsimard: and yeah, i'd love to be able to share the dashboard console stuff -- obviously the backend is very different at the moment, but the frontend is very similar, and if there's a way to share at least widgets, if not more, that'd be great :) | 15:33 |
corvus | dmsimard: https://review.opendev.org/717260 is the wip callback stuff tha avass is working on | 15:34 |
dmsimard | thanks, looking | 15:34 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml https://review.opendev.org/721245 | 15:45 |
-openstackstatus- NOTICE: Gerrit will be restarted to correct a misconfiguration which caused some git mirrors to have outdated references. | 15:48 | |
dmsimard | corvus: yeah, the approach makes sense and at first glance provides the necessary bits to configure ara -- I'll try to give it a spin | 15:51 |
*** ianychoi_ is now known as ianychoi | 15:53 | |
dmsimard | re: dashboard bits, I'm not a javascript person but I understand that react components can be built in a way to be shared/embedded so there is definitely an opportunity to collaborate there | 15:54 |
AJaeger | corvus, mordred, could either of you look at https://review.opendev.org/#/c/721221/ for zuul-jobs once you're done with firefighting, please? It makes the docs job fail if html is stored elsewhere and can't be find | 15:54 |
dmsimard | there is ara-web which is meant to use the new ara api but it's still very WIP | 15:54 |
dmsimard | fwiw, there is a simple javascript-less built-in interface shipping with the api server now: https://api.trunk.demo.recordsansible.org/ | 15:55 |
dmsimard | or in API version: https://api.trunk.demo.recordsansible.org/api/v1/playbooks | 15:55 |
tristanC | regarding ansible console reporting, it would be nice to consider nested playbook execution. It seems like once a project uses custom plugin, the only way to run the code is to make the zuul job run another ansible-playbook command, which result in an opaque `run ansible-playbook` task in the reporting. | 15:58 |
tristanC | iiuc ara 1.x implement an API that multiple ansible-playbook command could use, and iirc the zuul_stream may be adapted to support adding extra job-output.json in the job console report | 16:00 |
clarkb | tristanC: I think opendev currently does not want that as part of its CD rollout so making it optional would be a good thing | 16:04 |
clarkb | (we want to keep the details of nested runs hidden until we can confirm they don't expose secrets) | 16:04 |
dmsimard | tristanC: in the context of nested ansible runs, if zuul's ansible and the nested ansible both use the ara callback and report to the same location, you could have both reports (zuul and nested) in a single place | 16:04 |
dmsimard | brb, lunch | 16:05 |
tristanC | clarkb: sure, i guess the only way to make this work would be that zuul exposes an ansible.cfg blob that nested-playbook would need to use (opt-in) to be part of the build report. | 16:07 |
*** rpittau is now known as rpittau|afk | 16:07 | |
openstackgerrit | Merged zuul/zuul master: Log alembic migrations by default https://review.opendev.org/721283 | 16:09 |
mordred | tristanC: I've pondered writing an ansible_playbook module (similar to the puppet one we wrote) | 16:11 |
corvus | note that nested runs don't generally run on the executor, so the installation environment will be different; there may not be much to be gained by integrating with zuul's ansible or ansible.cfg there. maybe a better approach is to have a module like mordred suggests, and have easy flags to enable ara or other callbacks | 16:13 |
*** jcapitao has quit IRC | 16:14 | |
corvus | i think there's more integration that could happen on the other side: integrating ara or zuul callback reporting data together after the run. | 16:14 |
*** yolanda has quit IRC | 16:17 | |
tristanC | corvus: perhaps the nested callback could be configured to write report in ~/zuul-output, and then reporting would load such file and present in on the build page? | 16:18 |
openstackgerrit | Merged zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists https://review.opendev.org/721221 | 16:19 |
corvus | tristanC: yeah, that sounds good | 16:19 |
tristanC | i mentioned a ansible.cfg blob because ara would likely need to know what is the ara-api endpoint location and how it should annotate the nested playbook to be attached to the build report | 16:20 |
mordred | yeah - so - the ara api endpoint is one of the things that potentially concerns me about integrating new ara with zuul (although it makes total sense as an approach for ara's main target audience) | 16:21 |
mordred | with that structure, it woudl require us to have each job run an ara service so that the callback could report - and then we'd still need to do $something to have that ara service export the data to json files that we could upload to log storage so that the web stuff could read them | 16:22 |
mordred | the second part shoudl be easy enough - it's just json data it's looking for after all | 16:22 |
mordred | but the first part - requiring a running service endpoint - starts to get rather complicated | 16:22 |
mordred | but I might be missing something - admittedly I havne't played with it too much recently so there's likely a ton I don't know | 16:23 |
*** fdegir has joined #zuul | 16:25 | |
*** yolanda has joined #zuul | 16:29 | |
*** evrardjp has quit IRC | 16:35 | |
*** evrardjp has joined #zuul | 16:35 | |
tristanC | mordred: iiuc the proposed plan is to have an api per zuul, or per tenant, not per build | 16:43 |
mordred | tristanC: that ara would have to store _all_ of the playbook results though right? then a given job would need to get the logs of just its playbook runs | 16:49 |
*** kklimonda has quit IRC | 16:54 | |
dmsimard | mordred: the api server isn't required when recording data locally, there is an "offline" client implementation which spins up an ephemeral server in the background for the duration of the run | 16:54 |
*** kklimonda has joined #zuul | 16:55 | |
tristanC | mordred: it seems like it yes. iirc ara features label where we could attach the zuul build uuid to each playbook results | 16:55 |
dmsimard | let me get an example for the labels | 16:56 |
tristanC | dmsimard: when using the offline client, what would happen with the data at the end of the run? | 16:56 |
dmsimard | with some experimental zuul labels, example: https://api.demo.recordsansible.org/?label=zuul.executor:ze07.openstack.org | 16:58 |
*** sshnaidm is now known as sshnaidm|afk | 16:59 | |
dmsimard | tristanC: whatever we want to happen, I guess ? I mean, by default it's still a sqlite database unless the (offline) API server is set up to use a mysql database or something | 16:59 |
dmsimard | there is an implementation of the sqlite middleware in 1.x which is similar to the one in 0.x but provides the API on top of the reporting interface | 17:00 |
dmsimard | otherwise it has static html generation too | 17:00 |
dmsimard | labels can be specified on a playbook basis or through ansible.cfg like this: https://opendev.org/recordsansible/ara/src/branch/master/tests/zuul_metadata.yaml -> https://opendev.org/recordsansible/ara/src/branch/master/tests/basic.yaml#L108-L110 | 17:05 |
AJaeger | avass, mordred , could you review https://review.opendev.org/721245 again, please? Now the test works... | 17:15 |
*** yolanda has quit IRC | 17:26 | |
*** bhavikdbavishi has quit IRC | 17:28 | |
*** bhavikdbavishi has joined #zuul | 17:29 | |
*** portdirect has quit IRC | 17:30 | |
*** portdirect has joined #zuul | 17:30 | |
*** jpena is now known as jpena|off | 17:32 | |
*** evrardjp has quit IRC | 17:44 | |
*** evrardjp has joined #zuul | 17:49 | |
clarkb | I think we've got an OSF newsletter going out this week, are there any zuul updates to call out from the last few weeks? | 18:17 |
clarkb | fungi: avass correct not all platforms provide `pip` or `pip2` depending on which combo of things you've got | 18:18 |
clarkb | the safest things seems to be if you know you want a specific version check for that first then heck if just `pip` is available as that can be used to install into arbitrary python envs | 18:18 |
fungi | clarkb: well, except that the system-wide `pip` doesn't actually install into any python envs other than the system context or --user (~/.local) | 18:23 |
fungi | so it's not even necessary to have a system-wide pip available if tox was already installed through other means | 18:24 |
clarkb | fungi: yes, but I think we can't do better than that unless you are overriding the pip version | 18:24 |
clarkb | which the roles allow you todo aiui | 18:24 |
clarkb | correct | 18:24 |
clarkb | you should set the value for the tox executable path if you've done that | 18:24 |
fungi | but the ensure-tox role currently installs pip first, then checks to see whether you already have tox | 18:24 |
clarkb | ah I see, that is a bug yes | 18:25 |
clarkb | but that is independent of checking for pip vs pip2 vs pip3 | 18:25 |
fungi | yes, i agree | 18:25 |
fungi | two different suggestions which came out of avass and i looking at the role | 18:25 |
fungi | but we should probably reevaluate those concerns in the context of zbr's ensure-tox overhaul too, i suspect | 18:26 |
clarkb | with pip vs pip2 vs pip3 we don't really have any good options there. What I went with should be the least surprising thing which is use version specific pip if available else use normal pip which can be version specific if you choose it to be | 18:26 |
*** yolanda has joined #zuul | 18:46 | |
*** yolanda has quit IRC | 18:46 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists https://review.opendev.org/721258 | 18:55 |
*** maxamillion has quit IRC | 19:08 | |
*** maxamillion has joined #zuul | 19:11 | |
*** bhavikdbavishi has quit IRC | 19:18 | |
fungi | clarkb: well, the actual selection order seems to be: if python2 is preferred try pip then pip3, if python3 is preferred try pip3 then pip | 19:23 |
clarkb | hrm they should've been the same but maybe I got that wrong? | 19:24 |
fungi | it's possible for some platforms (recent fedora with a transitional python2 installed maybe?) to have pip2 and pip3 but not pip | 19:24 |
fungi | i gather the intended logic in that role was to fall back to the non-preferred python version's pip executable | 19:25 |
clarkb | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-pip/tasks/main.yaml#L19-L22 its version specific first each time | 19:25 |
clarkb | that isn't selecting a version of pip, merely deciding if one exists that is appropriate | 19:25 |
fungi | clarkb: this is in ensure-tox | 19:25 |
clarkb | I see | 19:25 |
fungi | which was also part of avass's point, i think, that we reimplement this selection logic (differently) in multiple roles | 19:25 |
clarkb | gotcha I read it as being in ensure-pip | 19:26 |
clarkb | and fwiw the selection logic may be different than "does any valid pip exist" | 19:26 |
clarkb | I don't know if they should be the same or not without thinking about it more | 19:26 |
fungi | probably what we could do is have ensure-tox conditionally include ensure-pip when tox is not found, and then use pip to install tox | 19:27 |
fungi | that way we at least don't unnecessarily choose pip in different ways in different roles | 19:27 |
clarkb | ++ | 19:27 |
*** jtanner has quit IRC | 19:42 | |
*** jtanner has joined #zuul | 19:43 | |
*** sgw has quit IRC | 21:09 | |
*** sgw has joined #zuul | 21:18 | |
clarkb | nodepoolians we've noticed some interesting behavior from nodepool recently where it takes many hours to fulfill node requests. This seems related to failed instance deletes in openstack clouds and quotas being reported inaccurately | 21:19 |
clarkb | I've tried to summarize what I've found in https://etherpad.opendev.org/p/lF1-vMNVtDqlEH1QwNAZ | 21:20 |
clarkb | I don't think nodepool is doing anything really wrong here other than perhaps trusting the info from cloud APIs too much | 21:20 |
clarkb | that said I've got a suggestions in the etherpad for how nodepool might handle the situation better to avoid user impact | 21:20 |
clarkb | tobiash: ^ have you seen this before in your usage? | 21:20 |
corvus | clarkb: well, right at the beginning, why did it accept a request when the predicted remaining quota was -1? | 21:23 |
clarkb | corvus: I think because that is how nodepool goes into a paused state | 21:23 |
corvus | clarkb: ack | 21:23 |
clarkb | basically you need a new request to initially transition into paused | 21:24 |
corvus | clarkb: why do you think it accepted request 706 when it still had 864 pending? | 21:26 |
corvus | clarkb: do you think it thought it had enough quota for both? | 21:28 |
clarkb | corvus: I think because for a short period of time around line 14 of my etherpad it thinks it has room to boot new instances | 21:28 |
clarkb | so it unpauses and grabs more but then those all fail due to quota errors (and we don't know that until nova responds to us) | 21:28 |
clarkb | we have a window from 15:48 to 15:58 roughly where we are unpaused and grabbing additional requests | 21:29 |
clarkb | the second request I've added into the etherpad is accepted at 15:53 so is within that window | 21:29 |
corvus | ah got it, line 29 is chronologically around line 17.5 | 21:29 |
clarkb | ya sorry I organized by request not chronologically | 21:30 |
tobiash | clarkb: reading backlog | 21:31 |
corvus | that's fine, this makes a lot of sense, but the comment on line 27 about being hours later isn't right, is it? | 21:31 |
corvus | clarkb: or do you mean it's still active many hours later | 21:31 |
clarkb | corvus: ya its still pending right now | 21:31 |
clarkb | or was 5 minutes ago | 21:31 |
clarkb | because we slowly iterate through the 3 node request failures | 21:31 |
corvus | clarkb: how about this alternative solution: we unlock (without rejecting) the newest pending node requests when we're paused (so that after requests #2 and #3... finish their launch attempts failing, we unlock them. only request #1 stays locked) | 21:34 |
corvus | basically, they get to go back on the pile for some other provider to grab them | 21:34 |
clarkb | corvus: I like that I was wondering if we should reject outright but that seems nicer :) | 21:35 |
corvus | it's sort of an "oops, i never should have accepted this one in the first place" move | 21:35 |
corvus | which, i think is close to reality | 21:35 |
corvus | clarkb: is there anything more we can learn about the quota situation? i'm not sure i see specifically what is incorrect here | 21:36 |
corvus | also, i need a refresher on 'pool' vs 'provider' quota | 21:36 |
clarkb | corvus: if you look at grafana for inap, http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=now-24h&to=now, we see that we've had a prolonged state with the majority of nodes in a deleting state | 21:37 |
clarkb | corvus: that implies to me that inap actually has all of those instances stuck in a deleting state, but when you ask the nova api for quota usage we get told that we have 157 instances available | 21:37 |
clarkb | let me try and reproduce that manually via bridge | 21:38 |
corvus | cause it says 'provider quota' is 0, but 'pool quota' is 156/7 | 21:38 |
clarkb | oh I missed that distinction | 21:40 |
clarkb | running manually we have ~188 instances in a nova show output and limits show --absolute says we are using 196 instances | 21:40 |
tobiash | pool quota should be the self constrained quota based on the pool config, provider quota is the quota of the cloud tenant | 21:40 |
clarkb | so that is actually looking reasonably accurate | 21:40 |
clarkb | corvus: I wonder if that is actually the underlying thing here | 21:41 |
clarkb | because if the pool quota doesn't show 156 instances we should never unpause | 21:41 |
clarkb | I had attributed it to openstack because we've seen it get quotas wrong before, but maybe this is a nodepool bug? | 21:41 |
tobiash | it looks to me that nodepool locks a request, then pauses and waits for resources to get free which doesn't happen because of hanging deletes? | 21:42 |
clarkb | tobiash: ya, but as corvus points out we still show predicted provider instances is 0 but then pool is 157 | 21:43 |
clarkb | and I think we decide based on the pool number we can unpause | 21:43 |
clarkb | despite the provider not having enough capacity there (and nodepool knowing that) | 21:43 |
corvus | tobiash: i think the new issue here is that it's happening for multiple node requests (not just one -- it's okay if just one takes a few minutes to unpause) | 21:43 |
corvus | clarkb: but line 15 in your etherpad indicates that we expect the provider to be able to accomodate the request | 21:46 |
corvus | clarkb: er, i guess the same line at line 30 is more relevant | 21:46 |
clarkb | corvus: I think instances: 0 means we don't have any room | 21:46 |
corvus | clarkb: but that's predicted remaining | 21:46 |
clarkb | oh I see | 21:47 |
corvus | so it's saying "this is the last request i can accept without hitting the quota limit" | 21:47 |
corvus | and then it hits the quota | 21:47 |
clarkb | gotcha so we are right on the line but actually about 150 instances away from it? | 21:47 |
tobiash | the provider quota is potentially inaccurate | 21:47 |
tobiash | so when it hits the provider quota it pauses and invalidates the quota cache | 21:48 |
clarkb | also if we are right at quota that number may be accurate, but since we have ~20 active nodereuqests only one of them will win the race to get that free slot | 21:48 |
tobiash | but then it might already have taken multiple requests | 21:48 |
clarkb | I think the key here is to avoid adding more noderequests because it races quota usage | 21:48 |
clarkb | wheras if we have a single outstanding request when paused then we know we'll service that one iwth the next available resources? | 21:48 |
clarkb | corvus: ^ basically that means your idea may still be the proper fix here | 21:49 |
tobiash | when we hit the provider quota we hit it asynchronously (when we try to spawn the instance) | 21:49 |
tobiash | at that point in time nodepool might have already taken more than one request | 21:49 |
corvus | so you're saying that the quota usage is so inaccurate that we think we can accept, say 10 node requests when we can really accept 0, and all 10 get stuck in the paused state? | 21:49 |
clarkb | tobiash: yup its got like 20 noderequests active | 21:49 |
clarkb | corvus: probably more like 9/10 get stuck in paused state | 21:49 |
tobiash | potentially | 21:49 |
clarkb | corvus: it is likely that one of them is succeeding since there was probably some quota available | 21:50 |
tobiash | do you run multiple providers against the same cloud? | 21:50 |
clarkb | tobiash: not in this cloud | 21:50 |
clarkb | (others do though) | 21:50 |
corvus | i still wish i understood what was wrong with the quota; like how did it got so far out of sync | 21:50 |
clarkb | normally this is fine because deletes actually free up space | 21:51 |
clarkb | corvus: maybe the predicted value isn't considering the other active requests properly? | 21:51 |
clarkb | say we have 10 requests for 1 node each. If we have 1 free instance quota avialable to we predict we can fulfill all 10 of those? | 21:51 |
corvus | clarkb: maybe, but i'm pretty sure nothing would work if that were the case... | 21:51 |
tobiash | can it be that nova lies about the quota? | 21:52 |
corvus | we only accept one node request at a time. we process them in parallel | 21:52 |
clarkb | tobiash: it could be, though I'm not seeing that right htis moment | 21:52 |
corvus | so it'd be pretty hard for us to get that part wrong. | 21:52 |
corvus | clarkb: didn't it look like it was reporting we used 196 instances when we really have 188? granted, that's wrong in the wrong direction, but still... | 21:53 |
tobiash | we saw in our clouds that nova quota query returned say 20 cpus left and spawn still said not enough quota | 21:53 |
clarkb | another thought. Is it possible that predicted quota assumes the deltes will succeed in a reasonable amount of time? | 21:53 |
clarkb | corvus: yup right now its 194 from nova list but 196 from instances used quota accounting | 21:53 |
corvus | clarkb: can you find and add "Provider quota for inap-mtl01" log lines to the etherpad? | 21:56 |
corvus | clarkb: same for "Available quota for inap-mtl01" | 21:56 |
clarkb | ya | 21:56 |
corvus | the first line is "what did nova limit just tell us", and the second is the same with unmanaged quota subtracted. | 21:57 |
corvus | we only periodically update that; so the most recent pair of those log lines before "Predicted remaining..." is what it's going to be working with | 21:58 |
corvus | clarkb: oh, it looks like 196 is the number of instances available, not used? | 21:59 |
tobiash | do you have unmanaged quota in that tenant? | 22:00 |
clarkb | tobiash: no | 22:00 |
clarkb | corvus: maybe? or maybe its the max value from quota limits and not the totalused value? | 22:00 |
clarkb | the limit and not usage maybe | 22:00 |
clarkb | (or maybe thats what you meant) | 22:00 |
corvus | clarkb: yes, nodepool is looking at the limit | 22:00 |
corvus | "Provider quota for inap-mtl01: {'compute': {'instances': 196, 'cores': inf, 'ram': inf}}" means "the limit for this account is 196 instances" | 22:01 |
corvus | "Available quota for inap-mtl01: {'compute': {'instances': 34, 'cores': inf, 'ram': inf}}" means "there are 162 instances of unknown origin in this account leaving us with room for 34 nodepool instances" | 22:02 |
corvus | tobiash: ^ is that right? | 22:02 |
tobiash | let me check | 22:02 |
tobiash | corvus: yes, that should be right | 22:04 |
clarkb | separately I'm going to try manually deleting one of the instances that nodepool is trying to delete to see why that is failing | 22:04 |
corvus | it gets the 162 number by listing all of the nova instances and checking to see if they have nodepool motadata. | 22:04 |
clarkb | corvus: hrm I wonder if I wasn't completely wrong about hte metadata being inappropriately removed then | 22:05 |
corvus | so it may be that our stuck-in-deleting instances have lost their nodepool metadata (clarkb: can you check that?) | 22:05 |
clarkb | corvus: I can try! | 22:05 |
tobiash | corvus: that sounds at least plausible | 22:05 |
clarkb | corvus: is it possible that is related to the userdata base64 thing you discovered too? | 22:05 |
clarkb | where the clients don't return the data to us in some cases? | 22:05 |
tobiash | if the metadata vanishes it would result in such an effect | 22:05 |
corvus | clarkb: i wouldn't expect that | 22:06 |
corvus | clarkb: the userdata thing was a definite "we don't want to print this out on the cli" change | 22:06 |
corvus | i wouldn't expect it to disappear from sdk return values. but i haven't confirmed anything related to that | 22:06 |
clarkb | corvus: got it | 22:06 |
clarkb | also fwiw it seems that inap just managed to delete those instances and now we have lots of room :/ | 22:07 |
clarkb | this also means it is hard for me to check for metadata | 22:07 |
corvus | clarkb: oh did you delete them all? :) | 22:07 |
clarkb | corvus: I didn't :) | 22:07 |
clarkb | this was entirely the cloud itself catching up | 22:07 |
corvus | oh "neat" | 22:07 |
clarkb | I didn't attempt to delete a single one myself yet | 22:07 |
corvus | even if the matadata is gone, this system should still work | 22:07 |
corvus | they should just be counted as external-to-nodepool | 22:08 |
corvus | i mean, it might take one quota error to happen in order to cause a cache refresh, but after that, it should return to correct behavior | 22:08 |
clarkb | I'll take it as a TODO to check metadata if we catch it in this state again in the future (because that will be useful to know either way) | 22:08 |
corvus | so then the line "Predicted remaining provider quota" is 34 - what nodepool thinks its using - the new node requirements | 22:11 |
corvus | so essentially, we can deduce that at a point where nodepool said "predicted remaining provider quota: 0 instances" it thought it had 33 instances under its control | 22:11 |
clarkb | why do we say 0 in that case? | 22:13 |
corvus | the predicted remaining is after accounting for the current request | 22:14 |
corvus | so nodepool saw 34 slots total for nodepool's use, 33 were used by nodepool, 1 was available, and it was looking at a request for 1 node | 22:15 |
clarkb | got it | 22:15 |
corvus | clarkb: did anything happen between 15:48 and 15:58 for nodepool to think more room may be available? | 22:15 |
corvus | clarkb: cause now i'm wondering how, if it already was dealing with 864, why it would also accept 706 | 22:15 |
corvus | it's interesting that it refreshed the limits and external-to-nodepool limits between those times, but the values did not change | 22:17 |
corvus | so it must have thought that 2 slots opened up? | 22:17 |
clarkb | let me see | 22:17 |
*** tosky has quit IRC | 22:19 | |
clarkb | there is a lot of "cannote find provider pool for instance" in the log saround then | 22:20 |
clarkb | which from earlier debugging implies the instance's metadata was removed from nova metadata | 22:20 |
clarkb | (because we create phony intance data in zk to handle leaked nodes, but then those end up without metadata I htink) | 22:21 |
corvus | clarkb: yep. that line indicates that instance should be counted in the 'external-to-nodepool' count | 22:21 |
clarkb | that might be something we can log more clearly to make this debugging easier in the future | 22:21 |
corvus | oh wait | 22:21 |
corvus | clarkb: can you paste one of those log lines? | 22:22 |
corvus | because if it's "Cannot find provider pool for node %s" then we actually aren't accounting for that node and that's worth digging into | 22:27 |
clarkb | corvus: I can fwiw it seems the 2020-04-20 15:48:52,257 DEBUG nodepool.driver.NodeRequestHandler[nl03-28001-PoolWorker.inap-mtl01-main]: [node_request: 300-0008521864] Retrying node request is the triggering thing and I cna't find anything that triggers that | 22:27 |
clarkb | so maybe we hvae a timeout? | 22:27 |
clarkb | where we check in on a paused reuqest status periodically? | 22:27 |
clarkb | 2020-04-20 15:48:52,604 WARNING nodepool.driver.openstack.OpenStackProvider: Cannot find provider pool for node <Node {'connection_port': 22, 'state_time': 1587391454.6109967, 'comment': None, 'launcher': None, 'pool': None, 'image_id': None, 'state': 'deleting', 'allocated_to': None, 'attributes': None, 'hostname': None, 'region': None, 'public_ipv4': None, 'cloud': None, 'host_id': None, 'id': '0016056699', | 22:28 |
clarkb | 'public_ipv6': None, 'external_id': '33ba9c42-379f-4a0e-8028-4e2db2a21ff5', 'python_path': None, 'connection_type': None, 'interface_ip': None, 'stat': ZnodeStat(czxid=85918431802, mzxid=85918431802, ctime=1587391430355, mtime=1587391430355, version=0, cversion=1, aversion=0, ephemeralOwner=0, dataLength=639, numChildren=1, pzxid=85918433437), 'private_ipv4': None, 'az': None, 'provider': 'inap-mtl01', | 22:28 |
clarkb | 'hold_job': None, 'username': None, 'ssh_port': 22, 'hold_expiration': None, 'resources': None, 'type': [], 'host_keys': [], 'created_time': 1587391430.3534877}> | 22:28 |
clarkb | oh sorry didn't realize it would get broken up like that | 22:28 |
clarkb | thats what the cannot find provider pool errors look like | 22:28 |
clarkb | 2020-04-20 12:12:01,790 INFO nodepool.DeletedNodeWorker: Deleting deleting instance 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 from inap-mtl01 | 22:29 |
clarkb | 2020-04-20 12:12:01,840 INFO nodepool.NodeDeleter: Instance 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 not found in provider inap-mtl01 | 22:29 |
corvus | clarkb: you say we made that znode because something leaked? | 22:29 |
clarkb | corvus: yes I believe so and I think the above explains it | 22:29 |
clarkb | basically we boot that instance it fails for some reason so we delete it. Then nova says that doesn't exist and we remove our znode for it | 22:30 |
corvus | but then it gets created? | 22:30 |
clarkb | then later we check for leaked instances and find 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 and add a new znode with less info (no pool info for example) so that we can attempt to clean the leaked instance up | 22:30 |
clarkb | corvus: ya I think the trigger for those znodes is leak cleanup checking for instances it should delete. Any it finds it creates these phony znodes for with less info in them | 22:31 |
clarkb | (because its generated from lossy openstack api returns) | 22:31 |
fungi | oh that's a fun race | 22:31 |
corvus | clarkb: huh. it seems like we need more metadata? | 22:31 |
corvus | like, why isn't there a pool entry in there? | 22:31 |
clarkb | corvus: because nova list doesn't tell us how the instance originated from nodepool | 22:32 |
corvus | (how is it possible there's enough metadata for nodepool to decide it's responsible for it, but not enough to account for it) | 22:32 |
clarkb | corvus: ya its possible we aren't shpping enouh data into metadata | 22:32 |
clarkb | let me find the code again | 22:32 |
corvus | clarkb: nodepool is only going to think it should delete an instance if there's enough metadata for it to recognize it as its own | 22:32 |
clarkb | https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L159-L166 is the source of the error https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L532-L537 is where we create the phony record | 22:33 |
corvus | clarkb: seems like that second link should pull in the pool as well, yeah? | 22:34 |
corvus | then the instance should be properly accounted for? | 22:34 |
clarkb | corvus: ya I'm beginning to think that would be the caes | 22:34 |
clarkb | so if we set that as part of hte metadata we'd be able to write it back in again (probably cross check it against valid pools?) | 22:34 |
clarkb | oh we cross check at https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L159-L166 so simply setting the value may be sufficient | 22:35 |
clarkb | to summarize we think that nova reportin instances don't exist but then later appearing them leads to us not accounting for those instances later when we try to clenaup leaked instances. This is because the pool information matters when calculating quotas and we lose that for leaked instances | 22:36 |
clarkb | we can make that quota calculation more accurate by setting the pool info on leaked instance phony records so they are accounted as belonging to the pool quota wise | 22:36 |
corvus | clarkb: the lack of the znode.pool attribute means the znode isn't being counted in the nodepool quota used, but if the 'nodepool_provider_name' metadata is on the instance, and matches this provider, then the instance is not going to be counted as unmanaged quota. | 22:38 |
clarkb | corvus: gotcha so we need both pieces of info to properly line up the quota calculations | 22:38 |
*** sanjayu_ has joined #zuul | 22:38 | |
corvus | clarkb, tobiash: ^ so if that sounds right, then yeah, i think the fix may be to just add the pool info back | 22:39 |
corvus | here: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L532-L537 | 22:39 |
*** saneax has quit IRC | 22:41 | |
clarkb | https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L340 and there (in order to get it back later) | 22:42 |
corvus | oh it's not in the md in the first place. then yeah. | 22:45 |
corvus | that explains why we didn't copy it back :) | 22:45 |
clarkb | yup I hvae a change that outlines this | 22:45 |
clarkb | just writing a commit message now | 22:46 |
clarkb | I doubt its correct but it should be enough for us to not forget this debugging session | 22:46 |
openstackgerrit | Clark Boylan proposed zuul/nodepool master: Set pool info on leaked instances https://review.opendev.org/721359 | 22:47 |
clarkb | corvus: tobiash ^ I think thats roughly what we want. My biggest concern right now is I'm not sure I lined up all the pool types properly (I think we have both name and dict versions of pools depending on context) | 22:48 |
*** sanjayu__ has joined #zuul | 22:50 | |
*** sanjayu_ has quit IRC | 22:52 | |
*** sanjayu__ has quit IRC | 22:52 | |
clarkb | ok I think the types are correct based on how we handle deleting a node later | 22:54 |
clarkb | (in the non leaked case) | 22:54 |
corvus | clarkb: comments | 22:55 |
clarkb | https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/handler.py#L271-L286 | 22:55 |
clarkb | corvus: ah yup I thought it available then realized I had to pass it in then didn't update my use of it | 22:55 |
openstackgerrit | Clark Boylan proposed zuul/nodepool master: Set pool info on leaked instances https://review.opendev.org/721359 | 22:57 |
clarkb | corvus: ^ thanks! | 22:57 |
corvus | clarkb: thank you! tobiash ^ whenever it's a reasonable time for you again :) | 22:59 |
corvus | clarkb: maybe let's try this out for a bit and see if we avoid the situation; i'm not convinced we should implement the workaround we talked about earlier unless it's necessary (it would have masked this, and it's more churn than if nodepool actually has accurate quota) | 23:00 |
clarkb | ++ accurate quota would be best | 23:01 |
*** threestrands has joined #zuul | 23:18 | |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: helm-template: enable using values file https://review.opendev.org/721365 | 23:34 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!