Monday, 2020-04-20

mordredjkt: yeah - in this case things are ... yuck00:02
jktmordred: I read git blame, and apparently you wrote this. Any reason why that code picks a first network a router is not tried first? I see that it got added in 9986646a1, and it *looks* like a case of a DT cloud (?) where there was a hidden netowkr00:02
mordredjkt: you might also want to try setting routes_externally to False for the other networks (I know that's likely a pita, but it might help you here)00:02
jktwouldn't that break once my provider introduces even more networks that I'm not supposed to use?00:03
mordredmight do. one sec - lemme look at something00:04
mordredjkt: you should set nat_source to true for the right network00:05
jktmordred: looks like that did the trick in my tiny python reproducer, thanks! Let's try it with nodepool now00:08
mordredjkt: it's entirely possible that the right choice should be to start by looking for the network attached to the router that's attached to the network that the node is attached to ... I think at this point it's possible we don't start there simply because of history - and becaues the code in question tries to handle so many gross edge cases doing a major logic inversion is ... unfun00:09
mordredjkt: awesome!00:09
mordredjkt: when it's not sunday night I'll try to think through what it would take to start with router logic00:09
jktmordred: yeah, I can understand why it's fishy -- apparently it's global state, and determining that global state based on just one random server sounds wrong00:10
mordredyeah.00:11
mordredjkt: if only we didnt' have to hack around a bazillion different ways a deployer could do things :)00:11
jkt:)00:12
jktmordred: it works (as in, VMs are created, nodepool is happy, reports correct IP addresses and they are active). Thanks a lot!00:27
jktnow I wonder what am I doing wrong with my SSH keys :). Is nodepool using cloud-init with Zuul's pubkey, or is that a wrong assumption?00:27
clarkbit doesnt by default. You cn configure it to do so00:28
clarkbby default it expects ssh to work (key baked into image)00:28
jktclarkb: tx; so it's either https://zuul-ci.org/docs/nodepool/configuration.html#attr-providers.[openstack].pools.labels.userdata or redoing my images, right?00:30
clarkbits pools.labels.key-name00:30
jktclarkb: ah!, wonder how I missed that. Thanks!00:31
*** threestrands has joined #zuul00:35
*** swest has quit IRC01:11
*** swest has joined #zuul01:27
*** jhesketh has joined #zuul02:56
*** bhavikdbavishi has joined #zuul03:46
*** bhavikdbavishi1 has joined #zuul03:51
*** bhavikdbavishi has quit IRC03:53
*** bhavikdbavishi1 is now known as bhavikdbavishi03:53
*** evrardjp has quit IRC04:23
*** bhavikdbavishi has quit IRC05:26
*** bhavikdbavishi1 has joined #zuul05:26
*** bhavikdbavishi has joined #zuul05:29
*** bhavikdbavishi1 has quit IRC05:30
*** evrardjp has joined #zuul05:41
*** dpawlik has joined #zuul06:09
*** saneax has joined #zuul06:15
*** bhavikdbavishi has quit IRC06:47
*** ysandeep|away is now known as ysandeep06:53
*** ysandeep is now known as ysandeep|afk07:11
*** rpittau|afk is now known as rpittau07:12
*** harrymichal has joined #zuul07:13
*** threestrands_ has joined #zuul07:18
*** threestrands_ has quit IRC07:18
*** threestrands has quit IRC07:21
*** jcapitao has joined #zuul07:26
*** bhavikdbavishi has joined #zuul07:49
*** jpena|off is now known as jpena07:55
*** ysandeep|afk is now known as ysandeep08:02
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119208:04
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119208:08
avassAJaeger: re ^08:17
AJaegerthanks, avass08:18
avassIt's failing since ensure-tox installs it in ~/.local/bin but that's not in path so fetch-tox-output fails since it wasn't using the 'tox_executable' variable08:18
avassbut it looks like setting a variable in job level doesn't let ansible override that later so the test fails08:19
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119208:19
avassI guess the tox roles should be tested by completely removing tox in a pre-run08:21
*** sshnaidm has joined #zuul08:22
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119208:25
AJaegerfun ;/08:25
*** tobias-urdin has joined #zuul08:27
*** swest has quit IRC08:32
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119208:43
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119208:46
openstackgerritTobias Henkel proposed zuul/nodepool master: Parallelize initial static node creation  https://review.opendev.org/72120509:01
*** tosky has joined #zuul09:01
openstackgerritTobias Henkel proposed zuul/nodepool master: Parallelize initial static node synchronization  https://review.opendev.org/72120509:02
*** sgw has quit IRC09:09
*** swest has joined #zuul09:11
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119209:19
openstackgerritTobias Henkel proposed zuul/nodepool master: Parallelize initial static node synchronization  https://review.opendev.org/72120509:20
openstackgerritAlbin Vass proposed zuul/zuul master: Enables whitelisting and configuring callbacks  https://review.opendev.org/71726009:28
*** ysandeep is now known as ysandeep|lunch09:34
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached 'tox_executable' in fetch-tox-output  https://review.opendev.org/72119209:47
avassI'm very confused right now, the fetch-spinch-tarball behaves very strange for us09:55
avassfor some reason "dest: {{ zuul.executor.log_root }}/docs-html.tar.bz2" evaluates to something completely different, but only for the unarchive task09:56
avasssee: http://paste.openstack.org/show/792400/09:56
avass*fetch-sphinx-tarball :)09:56
avass^ i mean the "src:" field for unarchive, the "dest:" field is correct09:58
*** saneax has quit IRC10:02
*** swest has quit IRC10:02
*** swest has joined #zuul10:03
*** swest has quit IRC10:04
*** saneax has joined #zuul10:06
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists  https://review.opendev.org/72122110:12
*** ysandeep|lunch is now known as ysandeep10:13
*** rpittau is now known as rpittau|bbl10:15
*** harrymichal has quit IRC10:21
jkttristanC: do you happen to have an updated version of the runc nodepool handler? My naive backporting seems to have hit a problem, http://paste.openstack.org/show/792403/ (code at https://gerrit.cesnet.cz/plugins/gitiles/github/openstack-infra/nodepool/+log/refs/heads/cesnet/2020-04-19--runc)10:28
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists  https://review.opendev.org/72122110:33
*** jcapitao is now known as jcapitao_lunch10:37
jkttristanC: ok, that turned to be easy enough, https://gerrit.cesnet.cz/plugins/gitiles/github/openstack-infra/nodepool/+/f51f5c52c2b37f0814807fa6745ce8313ab45d0d%5E%21/ if you ever need it in softwarefactory10:43
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use remote_src false for easier debugging  https://review.opendev.org/72123710:52
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src false  https://review.opendev.org/72123710:53
zbravass: can you please help me with https://review.opendev.org/#/c/681532/ ?10:58
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src false  https://review.opendev.org/72123710:58
avassI guess we just couldn't handle bzip2 archives and the error message confused me11:00
avasszbr: Oh, I'm not a maintainer of zuul, just zuul-jobs if it's a review you want :)11:01
zbravass: ah, i forgot that they do not fully overlap.11:03
zbri am watching your changes as we had overlapping work around tox.11:04
zbri started https://review.opendev.org/#/c/690057/ long time ago, and still rebasing :D11:05
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: tox: allow tox to be upgraded  https://review.opendev.org/69005711:09
zbri wanter if we can make anything to make the yml / yaml consistent, now it seems like a random, even inside the same role.11:15
avasshuh, yeah I usually stick to .yaml11:19
*** jpena is now known as jpena|lunch11:27
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src true  https://review.opendev.org/72123711:28
avassOther way around, whoops11:28
zbrmy impression is that most of the ansible world does the opposite, afaik only zuul prefers ymal.11:32
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml  https://review.opendev.org/72124511:48
AJaegerlike this? ^11:48
*** dpawlik has quit IRC11:50
*** dpawlik has joined #zuul11:50
*** dpawlik has quit IRC11:52
*** dpawlik has joined #zuul11:53
*** dpawlik has quit IRC11:54
*** dpawlik has joined #zuul11:54
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml  https://review.opendev.org/72124511:55
*** bhavikdbavishi has quit IRC11:59
*** swest has joined #zuul12:01
*** rpittau|bbl is now known as rpittau12:03
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-sphinx-tarball: Do not keep owner of archived files  https://review.opendev.org/72124812:05
zbrAJaeger: i am not a bit fun of using bash in tox.ini, especially 'find' as has portability issues but should work for the moment.12:06
zbruntil we endup with some files that have .yml  extension mandatory12:06
*** jcapitao_lunch is now known as jcapitao12:07
AJaegerthen we can remove the test or do it differently12:08
*** rlandy has joined #zuul12:09
tristanCjkt: thanks. though we are about to remove runc from the next version of softwarefactory.12:14
openstackgerritTobias Henkel proposed zuul/zuul master: Detach head before reset  https://review.opendev.org/72045812:20
openstackgerritJan Kubovy proposed zuul/zuul master: Mandatory Zookeeper connection for ZuulWeb  https://review.opendev.org/72125412:21
*** rfolco has joined #zuul12:22
jkttristanC: understood; we're trying to get ourselves migrated to openstack anyway12:25
openstackgerritFabien Boucher proposed zuul/zuul master: URLTrigger driver time based  https://review.opendev.org/63556712:25
*** jpena|lunch is now known as jpena12:33
*** Goneri has joined #zuul12:42
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists  https://review.opendev.org/72125812:42
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists  https://review.opendev.org/72125812:54
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists  https://review.opendev.org/72125812:58
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml  https://review.opendev.org/72124513:00
AJaegersorry, had to update test. mordred, avass, want to review later again, please? ^13:00
avassAjaeger: huh, yeah good catch13:01
avassor wait, what's the difference?13:02
avassI guess grep -v doesn't give an errorcode?13:03
AJaegerthe test failed if there were no .yml files.13:03
AJaegermy local tests weren't good enough ;(13:03
AJaegernow tested better13:03
*** irclogbot_0 has quit IRC13:06
*** irclogbot_3 has joined #zuul13:08
tristanCzuul-maint : the zuul-operator has quite a few changes waiting to be reviewed from https://review.opendev.org/718162 to https://review.opendev.org/720822 , please have a look. Thank you in advance!13:11
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists  https://review.opendev.org/72125813:11
openstackgerritMerged zuul/zuul-jobs master: Update Fedora to 31  https://review.opendev.org/71765713:21
*** sgw has joined #zuul13:24
openstackgerritMerged zuul/zuul-jobs master: Make ubuntu-plain jobs voting  https://review.opendev.org/71970113:25
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: Adds role: ensure-ansible  https://review.opendev.org/72126913:27
avass:)13:27
avassI wonder if we could make the ensure-tox a bit more generic, something like an ensure-<python executable package>13:30
fungii thought the idea was that it was really ensure-<executable name> (which could be provided by distro packages, python packages, et cetera)13:48
avassfungi: yeah sure, but it feels like a lot of the python packages will look the same when installing them13:50
avassfungi: so they could probably be ensure-ansible, ensure-tox etc, but they import a role that does all the work13:51
avassor at least the common parts13:51
fungiso long as the executable name and the package name are the same, maybe13:52
openstackgerritJan Kubovy proposed zuul/zuul master: Driver event ingestion  https://review.opendev.org/71729913:53
avassyeah13:53
bolgzuul-maint: someone please -2 on https://review.opendev.org/c/721254 and https://review.opendev.org/c/717299 - breaking changes (requires Zookeeper for ZuulWeb)13:53
avassit could also separate the installation into the generic role and setting the relevant "executable" facts in the ensure-<executable> role13:54
tobiashbolg: done13:55
bolgtobiash: thanks13:55
fungiavass: also i notice the current implementation of ensure-tox always runs ensure-pip even before testing whether tox is already installed, which is probably slightly wrong (tox itself doesn't require any pip outside the virtualenvs it creates)13:57
avassfungi: yeah that depends on if it finds an already existing installation or not, since it uses pip to install tox if it doesn't13:59
fungiyeah, but the order of operations seems to be: 1. install pip, 2. test whether tox is installed, 3. use pip to install tox if missing14:00
fungi1 and 2 should probably be swapped around14:00
avasswhy do we do that with bash anyway?14:00
-openstackstatus- NOTICE: Zuul is temporarily offline; service should be restored in about 15 minutes.14:00
avassI agree14:00
fungithe bash is there to work out which version of the pip executable should be used14:00
avassah yeah14:01
fungisince there are several names for it, and some python version ambiguity as to which goes with which major python series14:01
avassshouldn't that check for 'pip2' and not 'pip' then?14:03
avassor pip2 > pip > pip314:04
avassif prefer_python214:04
fungiit looks like the logic there assumes that if python2 is installed, then `pip` will relate to it14:06
funginot all platforms provide a `pip2` but i suppose it could be added if needed for one14:07
avassit seems like a safer thing to check for14:09
avassgotta take a break, I'll probably check in later14:12
*** ysandeep is now known as ysandeep|afk14:15
*** cdearborn has joined #zuul14:34
jktis it a good idea to run one nodepool-builder and zuul-exec per cloud?14:42
jkt(looks like our company-internal cloud is constrained by IPv4 quotas, and there's no VPNaaS or anything like that)14:43
mordredjkt: fwiw - everything works fine ipv6-only ;)14:45
mordredjkt: but that might be a good solution for you14:46
jktyeah, unfortunately I've been asking our operators for IPv6 support for more than a year so far :(14:47
jktwe're a research org, our core business are advanced network services, we run our own DWDM, we're building optical ROADMs, my colleagues have been deploying IPv6 for years, but hey, this cloud of ours is IPv4 only </rant>14:48
openstackgerritMerged zuul/zuul-jobs master: Document output variables  https://review.opendev.org/71970414:51
mordredjkt: it's the sad story of the modern world isn't it/14:52
mordred?14:52
mordredespecially given how well ipv6 works and how many "complex" problems go away once it's there14:53
* jkt mummbles something about how trivial IPv6 was in 2014-2015 with nova-network and SLAAC from an external, HW-based L3 switch14:53
openstackgerritJames E. Blair proposed zuul/zuul master: Log alembic migrations by default  https://review.opendev.org/72128314:55
jktbut apart from this, is it OK to have one of several nodepool-launchers + zuul-exec in the cloud directly? (perhaps in a different subproject for, eh, "API safety" so that these service VMs cannot kill themselves)14:55
openstackgerritMerged zuul/zuul-jobs master: Python roles: misc doc updates  https://review.opendev.org/72011114:56
corvusmordred: i think  https://review.opendev.org/721283  will make us happier; that won't affect the current opendev install, but we'll get it when we stop using our custom logging config (and that will help new users not using a custom logging config)14:56
corvusjkt: yep, you'll need to use the executor region feature to route the jobs to the right executor14:57
jktbtw, https://docs.openstack.org/infra/system-config/zuul.html says that there's no need for backups. I think one should still back up per-project keys for providing secrets14:57
corvusjkt: that's a good point14:58
jktthanks15:02
*** dpawlik has quit IRC15:07
*** bhavikdbavishi has joined #zuul15:09
*** bhavikdbavishi1 has joined #zuul15:12
*** bhavikdbavishi has quit IRC15:13
*** bhavikdbavishi1 is now known as bhavikdbavishi15:13
dmsimardhey all, long time no see o/15:26
dmsimardI'm back after some medical leave time off, I wouldn't say I'm healed/cured but doing better now15:27
clarkbdmsimard: good morning/afternoon. Hope you are feeling well15:28
mordreddmsimard: awesome! being better is better than not doing better!15:28
*** yolanda has joined #zuul15:28
dmsimardthanks, sorry about the radio silence :)15:29
dmsimardI'd like to look and discuss what it means to upgrade ara in zuul and I figured it might be worth taking a step back and consider the current zuul-web implementation which is ara-ish15:30
corvusdmsimard: welcome back!15:31
AJaegerdmsimard: welcome back15:31
corvusdmsimard: there is some work-in-progress to make the callback plugin mechanism more generic, so that other callbacks can be used, and we could eventually make ara less 'special' (so that whether to use ara and what version could be a deployer decision)15:32
dmsimardcorvus: ah, a generic callback interface was something I wanted to ask about -- I think making ara less special is fine15:33
corvusdmsimard: and yeah, i'd love to be able to share the dashboard console stuff -- obviously the backend is very different at the moment, but the frontend is very similar, and if there's a way to share at least widgets, if not more, that'd be great :)15:33
corvusdmsimard: https://review.opendev.org/717260 is the wip callback stuff tha avass is working on15:34
dmsimardthanks, looking15:34
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Use main.yaml, not .yml  https://review.opendev.org/72124515:45
-openstackstatus- NOTICE: Gerrit will be restarted to correct a misconfiguration which caused some git mirrors to have outdated references.15:48
dmsimardcorvus: yeah, the approach makes sense and at first glance provides the necessary bits to configure ara -- I'll try to give it a spin15:51
*** ianychoi_ is now known as ianychoi15:53
dmsimardre: dashboard bits, I'm not a javascript person but I understand that react components can be built in a way to be shared/embedded so there is definitely an opportunity to collaborate there15:54
AJaegercorvus, mordred, could either of you look at https://review.opendev.org/#/c/721221/ for zuul-jobs once you're done with firefighting, please? It makes the docs job fail if html is stored elsewhere and can't be find15:54
dmsimardthere is ara-web which is meant to use the new ara api but it's still very WIP15:54
dmsimardfwiw, there is a simple javascript-less built-in interface shipping with the api server now: https://api.trunk.demo.recordsansible.org/15:55
dmsimardor in API version: https://api.trunk.demo.recordsansible.org/api/v1/playbooks15:55
tristanCregarding ansible console reporting, it would be nice to consider nested playbook execution. It seems like once a project uses custom plugin, the only way to run the code is to make the zuul job run another ansible-playbook command, which result in an opaque `run ansible-playbook` task in the reporting.15:58
tristanCiiuc ara 1.x implement an API that multiple ansible-playbook command could use, and iirc the zuul_stream may be adapted to support adding extra job-output.json in the job console report16:00
clarkbtristanC: I think opendev currently does not want that as part of its CD rollout so making it optional would be a good thing16:04
clarkb(we want to keep the details of nested runs hidden until we can confirm they don't expose secrets)16:04
dmsimardtristanC: in the context of nested ansible runs, if zuul's ansible and the nested ansible both use the ara callback and report to the same location, you could have both reports (zuul and nested) in a single place16:04
dmsimardbrb, lunch16:05
tristanCclarkb: sure, i guess the only way to make this work would be that zuul exposes an ansible.cfg blob that nested-playbook would need to use (opt-in) to be part of the build report.16:07
*** rpittau is now known as rpittau|afk16:07
openstackgerritMerged zuul/zuul master: Log alembic migrations by default  https://review.opendev.org/72128316:09
mordredtristanC: I've pondered writing an ansible_playbook module (similar to the puppet one we wrote)16:11
corvusnote that nested runs don't generally run on the executor, so the installation environment will be different; there may not be much to be gained by integrating with zuul's ansible or ansible.cfg there.  maybe a better approach is to have a module like mordred suggests, and have easy flags to enable ara or other callbacks16:13
*** jcapitao has quit IRC16:14
corvusi think there's more integration that could happen on the other side: integrating ara or zuul callback reporting data together after the run.16:14
*** yolanda has quit IRC16:17
tristanCcorvus: perhaps the nested callback could be configured to write report in ~/zuul-output, and then reporting would load such file and present in on the build page?16:18
openstackgerritMerged zuul/zuul-jobs master: Fail fetch-sphinx-tarball if no html exists  https://review.opendev.org/72122116:19
corvustristanC: yeah, that sounds good16:19
tristanCi mentioned a ansible.cfg blob because ara would likely need to know what is the ara-api endpoint location and how it should annotate the nested playbook to be attached to the build report16:20
mordredyeah - so - the ara api endpoint is one of the things that potentially concerns me about integrating new ara with zuul (although it makes total sense as an approach for ara's main target audience)16:21
mordredwith that structure, it woudl require us to have each job run an ara service so that the callback could report - and then we'd still need to do $something to have that ara service export the data to json files that we could upload to log storage so that the web stuff could read them16:22
mordredthe second part shoudl be easy enough - it's just json data it's looking for after all16:22
mordredbut the first part - requiring a running service endpoint - starts to get rather complicated16:22
mordredbut I might be missing something - admittedly I havne't played with it too much recently so there's likely a ton I don't know16:23
*** fdegir has joined #zuul16:25
*** yolanda has joined #zuul16:29
*** evrardjp has quit IRC16:35
*** evrardjp has joined #zuul16:35
tristanCmordred: iiuc the proposed plan is to have an api per zuul, or per tenant, not per build16:43
mordredtristanC: that ara would have to store _all_ of the playbook results though right? then a given job would need to get the logs of just its playbook runs16:49
*** kklimonda has quit IRC16:54
dmsimardmordred: the api server isn't required when recording data locally, there is an "offline" client implementation which spins up an ephemeral server in the background for the duration of the run16:54
*** kklimonda has joined #zuul16:55
tristanCmordred: it seems like it yes. iirc ara features label where we could attach the zuul build uuid to each playbook results16:55
dmsimardlet me get an example for the labels16:56
tristanCdmsimard: when using the offline client, what would happen with the data at the end of the run?16:56
dmsimardwith some experimental zuul labels, example: https://api.demo.recordsansible.org/?label=zuul.executor:ze07.openstack.org16:58
*** sshnaidm is now known as sshnaidm|afk16:59
dmsimardtristanC: whatever we want to happen, I guess ? I mean, by default it's still a sqlite database unless the (offline) API server is set up to use a mysql database or something16:59
dmsimardthere is an implementation of the sqlite middleware in 1.x which is similar to the one in 0.x but provides the API on top of the reporting interface17:00
dmsimardotherwise it has static html generation too17:00
dmsimardlabels can be specified on a playbook basis or through ansible.cfg like this: https://opendev.org/recordsansible/ara/src/branch/master/tests/zuul_metadata.yaml -> https://opendev.org/recordsansible/ara/src/branch/master/tests/basic.yaml#L108-L11017:05
AJaegeravass, mordred , could you review https://review.opendev.org/721245 again, please? Now the test works...17:15
*** yolanda has quit IRC17:26
*** bhavikdbavishi has quit IRC17:28
*** bhavikdbavishi has joined #zuul17:29
*** portdirect has quit IRC17:30
*** portdirect has joined #zuul17:30
*** jpena is now known as jpena|off17:32
*** evrardjp has quit IRC17:44
*** evrardjp has joined #zuul17:49
clarkbI think we've got an OSF newsletter going out this week, are there any zuul updates to call out from the last few weeks?18:17
clarkbfungi: avass correct not all platforms provide `pip` or `pip2` depending on which combo of things you've got18:18
clarkbthe safest things seems to be if you know you want a specific version check for that first then heck if just `pip` is available as that can be used to install into arbitrary python envs18:18
fungiclarkb: well, except that the system-wide `pip` doesn't actually install into any python envs other than the system context or --user (~/.local)18:23
fungiso it's not even necessary to have a system-wide pip available if tox was already installed through other means18:24
clarkbfungi: yes, but I think we can't do better than that unless you are overriding the pip version18:24
clarkbwhich the roles allow you todo aiui18:24
clarkbcorrect18:24
clarkbyou should set the value for the tox executable path if you've done that18:24
fungibut the ensure-tox role currently installs pip first, then checks to see whether you already have tox18:24
clarkbah I see, that is a bug yes18:25
clarkbbut that is independent of checking for pip vs pip2 vs pip318:25
fungiyes, i agree18:25
fungitwo different suggestions which came out of avass and i looking at the role18:25
fungibut we should probably reevaluate those concerns in the context of zbr's ensure-tox overhaul too, i suspect18:26
clarkbwith pip vs pip2 vs pip3 we don't really have any good options there. What I went with should be the least surprising thing which is use version specific pip if available else use normal pip which can be version specific if you choose it to be18:26
*** yolanda has joined #zuul18:46
*** yolanda has quit IRC18:46
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Only delete variables tempfile when it exists  https://review.opendev.org/72125818:55
*** maxamillion has quit IRC19:08
*** maxamillion has joined #zuul19:11
*** bhavikdbavishi has quit IRC19:18
fungiclarkb: well, the actual selection order seems to be: if python2 is preferred try pip then pip3, if python3 is preferred try pip3 then pip19:23
clarkbhrm they should've been the same but maybe I got that wrong?19:24
fungiit's possible for some platforms (recent fedora with a transitional python2 installed maybe?) to have pip2 and pip3 but not pip19:24
fungii gather the intended logic in that role was to fall back to the non-preferred python version's pip executable19:25
clarkbhttps://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-pip/tasks/main.yaml#L19-L22 its version specific first each time19:25
clarkbthat isn't selecting a version of pip, merely deciding if one exists that is appropriate19:25
fungiclarkb: this is in ensure-tox19:25
clarkbI see19:25
fungiwhich was also part of avass's point, i think, that we reimplement this selection logic (differently) in multiple roles19:25
clarkbgotcha I read it as being in ensure-pip19:26
clarkband fwiw the selection logic may be different than "does any valid pip exist"19:26
clarkbI don't know if they should be the same or not without thinking about it more19:26
fungiprobably what we could do is have ensure-tox conditionally include ensure-pip when tox is not found, and then use pip to install tox19:27
fungithat way we at least don't unnecessarily choose pip in different ways in different roles19:27
clarkb++19:27
*** jtanner has quit IRC19:42
*** jtanner has joined #zuul19:43
*** sgw has quit IRC21:09
*** sgw has joined #zuul21:18
clarkbnodepoolians we've noticed some interesting behavior from nodepool recently where it takes many hours to fulfill node requests. This seems related to failed instance deletes in openstack clouds and quotas being reported inaccurately21:19
clarkbI've tried to summarize what I've found in https://etherpad.opendev.org/p/lF1-vMNVtDqlEH1QwNAZ21:20
clarkbI don't think nodepool is doing anything really wrong here other than perhaps trusting the info from cloud APIs too much21:20
clarkbthat said I've got a suggestions in the etherpad for how nodepool might handle the situation better to avoid user impact21:20
clarkbtobiash: ^ have you seen this before in your usage?21:20
corvusclarkb: well, right at the beginning, why did it accept a request when the predicted remaining quota was -1?21:23
clarkbcorvus: I think because that is how nodepool goes into a paused state21:23
corvusclarkb: ack21:23
clarkbbasically you need a new request to initially transition into paused21:24
corvusclarkb: why do you think it accepted request 706 when it still had 864 pending?21:26
corvusclarkb: do you think it thought it had enough quota for both?21:28
clarkbcorvus: I think because for a short period of time around line 14 of my etherpad it thinks it has room to boot new instances21:28
clarkbso it unpauses and grabs more but then those all fail due to quota errors (and we don't know that until nova responds to us)21:28
clarkbwe have a window from 15:48 to 15:58 roughly where we are unpaused and grabbing additional requests21:29
clarkbthe second request I've added into the etherpad is accepted at 15:53 so is within that window21:29
corvusah got it, line 29 is chronologically around line 17.521:29
clarkbya sorry I organized by request not chronologically21:30
tobiashclarkb: reading backlog21:31
corvusthat's fine, this makes a lot of sense, but the comment on line 27 about being hours later isn't right, is it?21:31
corvusclarkb: or do you mean it's still active many hours later21:31
clarkbcorvus: ya its still pending right now21:31
clarkbor was 5 minutes ago21:31
clarkbbecause we slowly iterate through the 3 node request failures21:31
corvusclarkb: how about this alternative solution: we unlock (without rejecting) the newest pending node requests when we're paused (so that after requests #2 and #3... finish their launch attempts failing, we unlock them.  only request #1 stays locked)21:34
corvusbasically, they get to go back on the pile for some other provider to grab them21:34
clarkbcorvus: I like that I was wondering if we should reject outright but that seems nicer :)21:35
corvusit's sort of an "oops, i never should have accepted this one in the first place" move21:35
corvuswhich, i think is close to reality21:35
corvusclarkb: is there anything more we can learn about the quota situation?  i'm not sure i see specifically what is incorrect here21:36
corvusalso, i need a refresher on 'pool' vs 'provider' quota21:36
clarkbcorvus: if you look at grafana for inap, http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=now-24h&to=now, we see that we've had a prolonged state with the majority of nodes in a deleting state21:37
clarkbcorvus: that implies to me that inap actually has all of those instances stuck in a deleting state, but when you ask the nova api for quota usage we get told that we have 157 instances available21:37
clarkblet me try and reproduce that manually via bridge21:38
corvuscause it says 'provider quota' is 0, but 'pool quota' is 156/721:38
clarkboh I missed that distinction21:40
clarkbrunning manually we have ~188 instances in a nova show output and limits show --absolute says we are using 196 instances21:40
tobiashpool quota should be the self constrained quota based on the pool config, provider quota is the quota of the cloud tenant21:40
clarkbso that is actually looking reasonably accurate21:40
clarkbcorvus: I wonder if that is actually the underlying thing here21:41
clarkbbecause if the pool quota doesn't show 156 instances we should never unpause21:41
clarkbI had attributed it to openstack because we've seen it get quotas wrong before, but maybe this is a nodepool bug?21:41
tobiashit looks to me that nodepool locks a request, then pauses and waits for resources to get free which doesn't happen because of hanging deletes?21:42
clarkbtobiash: ya, but as corvus points out we still show predicted provider instances is 0 but then pool is 15721:43
clarkband I think we decide based on the pool number we can unpause21:43
clarkbdespite the provider not having enough capacity there (and nodepool knowing that)21:43
corvustobiash: i think the new issue here is that it's happening for multiple node requests (not just one -- it's okay if just one takes a few minutes to unpause)21:43
corvusclarkb: but line 15 in your etherpad indicates that we expect the provider to be able to accomodate the request21:46
corvusclarkb: er, i guess the same line at line 30 is more relevant21:46
clarkbcorvus: I think instances: 0 means we don't have any room21:46
corvusclarkb: but that's predicted remaining21:46
clarkboh I see21:47
corvusso it's saying "this is the last request i can accept without hitting the quota limit"21:47
corvusand then it hits the quota21:47
clarkbgotcha so we are right on the line but actually about 150 instances away from it?21:47
tobiashthe provider quota is potentially inaccurate21:47
tobiashso when it hits the provider quota it pauses and invalidates the quota cache21:48
clarkbalso if we are right at quota that number may be accurate, but since we have ~20 active nodereuqests only one of them will win the race to get that free slot21:48
tobiashbut then it might already have taken multiple requests21:48
clarkbI think the key here is to avoid adding more noderequests because it races quota usage21:48
clarkbwheras if we have a single outstanding request when paused then we know we'll service that one iwth the next available resources?21:48
clarkbcorvus: ^ basically that means your idea may still be the proper fix here21:49
tobiashwhen we hit the provider quota we hit it asynchronously (when we try to spawn the instance)21:49
tobiashat that point in time nodepool might have already taken more than one request21:49
corvusso you're saying that the quota usage is so inaccurate that we think we can accept, say 10 node requests when we can really accept 0, and all 10 get stuck in the paused state?21:49
clarkbtobiash: yup its got like 20 noderequests active21:49
clarkbcorvus: probably more like 9/10 get stuck in paused state21:49
tobiashpotentially21:49
clarkbcorvus: it is likely that one of them is succeeding since there was probably some quota available21:50
tobiashdo you run multiple providers against the same cloud?21:50
clarkbtobiash: not in this cloud21:50
clarkb(others do though)21:50
corvusi still wish i understood what was wrong with the quota; like how did it got so far out of sync21:50
clarkbnormally this is fine because deletes actually free up space21:51
clarkbcorvus: maybe the predicted value isn't considering the other active requests properly?21:51
clarkbsay we have 10 requests for 1 node each. If we have 1 free instance quota avialable to we predict we can fulfill all 10 of those?21:51
corvusclarkb: maybe, but i'm pretty sure nothing would work if that were the case...21:51
tobiashcan it be that nova lies about the quota?21:52
corvuswe only accept one node request at a time.  we process them in parallel21:52
clarkbtobiash: it could be, though I'm not seeing that right htis moment21:52
corvusso it'd be pretty hard for us to get that part wrong.21:52
corvusclarkb: didn't it look like it was reporting we used 196 instances when we really have 188?  granted, that's wrong in the wrong direction, but still...21:53
tobiashwe saw in our clouds that nova quota query returned say 20 cpus left and spawn still said not enough quota21:53
clarkbanother thought. Is it possible that predicted quota assumes the deltes will succeed in a reasonable amount of time?21:53
clarkbcorvus: yup right now its 194 from nova list but 196 from instances used quota accounting21:53
corvusclarkb: can you find and add "Provider quota for inap-mtl01" log lines to the etherpad?21:56
corvusclarkb: same for "Available quota for inap-mtl01"21:56
clarkbya21:56
corvusthe first line is "what did nova limit just tell us", and the second is the same with unmanaged quota subtracted.21:57
corvuswe only periodically update that; so the most recent pair of those log lines before "Predicted remaining..." is what it's going to be working with21:58
corvusclarkb: oh, it looks like 196 is the number of instances available, not used?21:59
tobiashdo you have unmanaged quota in that tenant?22:00
clarkbtobiash: no22:00
clarkbcorvus: maybe? or maybe its the max value from quota limits and not the totalused value?22:00
clarkbthe limit and not usage maybe22:00
clarkb(or maybe thats what you meant)22:00
corvusclarkb: yes, nodepool is looking at the limit22:00
corvus"Provider quota for inap-mtl01: {'compute': {'instances': 196, 'cores': inf, 'ram': inf}}" means "the limit for this account is 196 instances"22:01
corvus"Available quota for inap-mtl01: {'compute': {'instances': 34, 'cores': inf, 'ram': inf}}" means "there are 162 instances of unknown origin in this account leaving us with room for 34 nodepool instances"22:02
corvustobiash: ^ is that right?22:02
tobiashlet me check22:02
tobiashcorvus: yes, that should be right22:04
clarkbseparately I'm going to try manually deleting one of the instances that nodepool is trying to delete to see why that is failing22:04
corvusit gets the 162 number by listing all of the nova instances and checking to see if they have nodepool motadata.22:04
clarkbcorvus: hrm I wonder if I wasn't completely wrong about hte metadata being inappropriately removed then22:05
corvusso it may be that our stuck-in-deleting instances have lost their nodepool metadata (clarkb: can you check that?)22:05
clarkbcorvus: I can try!22:05
tobiashcorvus: that sounds at least plausible22:05
clarkbcorvus: is it possible that is related to the userdata base64 thing you discovered too?22:05
clarkbwhere the clients don't return the data to us in some cases?22:05
tobiashif the metadata vanishes it would result in such an effect22:05
corvusclarkb: i wouldn't expect that22:06
corvusclarkb: the userdata thing was a definite "we don't want to print this out on the cli" change22:06
corvusi wouldn't expect it to disappear from sdk return values.  but i haven't confirmed anything related to that22:06
clarkbcorvus: got it22:06
clarkbalso fwiw it seems that inap just managed to delete those instances and now we have lots of room :/22:07
clarkbthis also means it is hard for me to check for metadata22:07
corvusclarkb: oh did you delete them all? :)22:07
clarkbcorvus: I didn't :)22:07
clarkbthis was entirely the cloud itself catching up22:07
corvusoh "neat"22:07
clarkbI didn't attempt to delete a single one myself yet22:07
corvuseven if the matadata is gone, this system should still work22:07
corvusthey should just be counted as external-to-nodepool22:08
corvusi mean, it might take one quota error to happen in order to cause a cache refresh, but after that, it should return to correct behavior22:08
clarkbI'll take it as a TODO to check metadata if we catch it in this state again in the future (because that will be useful to know either way)22:08
corvusso then the line "Predicted remaining provider quota" is 34 - what nodepool thinks its using - the new node requirements22:11
corvusso essentially, we can deduce that at a point where nodepool said "predicted remaining provider quota: 0 instances" it thought it had 33 instances under its control22:11
clarkbwhy do we say 0 in that case?22:13
corvusthe predicted remaining is after accounting for the current request22:14
corvusso nodepool saw 34 slots total for nodepool's use, 33 were used by nodepool, 1 was available, and it was looking at a request for 1 node22:15
clarkbgot it22:15
corvusclarkb: did anything happen between 15:48 and 15:58 for nodepool to think more room may be available?22:15
corvusclarkb: cause now i'm wondering how, if it already was dealing with 864, why it would also accept 70622:15
corvusit's interesting that it refreshed the limits and external-to-nodepool limits between those times, but the values did not change22:17
corvusso it must have thought that 2 slots opened up?22:17
clarkblet me see22:17
*** tosky has quit IRC22:19
clarkbthere is a lot of "cannote find provider pool for instance" in the log saround then22:20
clarkbwhich from earlier debugging implies the instance's metadata was removed from nova metadata22:20
clarkb(because we create phony intance data in zk to handle leaked nodes, but then those end up without metadata I htink)22:21
corvusclarkb: yep.  that line indicates that instance should be counted in the 'external-to-nodepool' count22:21
clarkbthat might be something we can log more clearly to make this debugging easier in the future22:21
corvusoh wait22:21
corvusclarkb: can you paste one of those log lines?22:22
corvusbecause if it's "Cannot find provider pool for node %s"  then we actually aren't accounting for that node and that's worth digging into22:27
clarkbcorvus: I can fwiw it seems the 2020-04-20 15:48:52,257 DEBUG nodepool.driver.NodeRequestHandler[nl03-28001-PoolWorker.inap-mtl01-main]: [node_request: 300-0008521864] Retrying node request is the triggering thing and I cna't find anything that triggers that22:27
clarkbso maybe we hvae a timeout?22:27
clarkbwhere we check in on a paused reuqest status periodically?22:27
clarkb2020-04-20 15:48:52,604 WARNING nodepool.driver.openstack.OpenStackProvider: Cannot find provider pool for node <Node {'connection_port': 22, 'state_time': 1587391454.6109967, 'comment': None, 'launcher': None, 'pool': None, 'image_id': None, 'state': 'deleting', 'allocated_to': None, 'attributes': None, 'hostname': None, 'region': None, 'public_ipv4': None, 'cloud': None, 'host_id': None, 'id': '0016056699',22:28
clarkb'public_ipv6': None, 'external_id': '33ba9c42-379f-4a0e-8028-4e2db2a21ff5', 'python_path': None, 'connection_type': None, 'interface_ip': None, 'stat': ZnodeStat(czxid=85918431802, mzxid=85918431802, ctime=1587391430355, mtime=1587391430355, version=0, cversion=1, aversion=0, ephemeralOwner=0, dataLength=639, numChildren=1, pzxid=85918433437), 'private_ipv4': None, 'az': None, 'provider': 'inap-mtl01',22:28
clarkb'hold_job': None, 'username': None, 'ssh_port': 22, 'hold_expiration': None, 'resources': None, 'type': [], 'host_keys': [], 'created_time': 1587391430.3534877}>22:28
clarkboh sorry didn't realize it would get broken up like that22:28
clarkbthats what the cannot find provider pool errors look like22:28
clarkb2020-04-20 12:12:01,790 INFO nodepool.DeletedNodeWorker: Deleting deleting instance 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 from inap-mtl0122:29
clarkb2020-04-20 12:12:01,840 INFO nodepool.NodeDeleter: Instance 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 not found in provider inap-mtl0122:29
corvusclarkb: you say we made that znode because something leaked?22:29
clarkbcorvus: yes I believe so and I think the above explains it22:29
clarkbbasically we boot that instance it fails for some reason so we delete it. Then nova says that doesn't exist and we remove our znode for it22:30
corvusbut then it gets created?22:30
clarkbthen later we check for leaked instances and find 33ba9c42-379f-4a0e-8028-4e2db2a21ff5 and add a new znode with less info (no pool info for example) so that we can attempt to clean the leaked instance up22:30
clarkbcorvus: ya I think the trigger for those znodes is leak cleanup checking for instances it should delete. Any it finds it creates these phony znodes for with less info in them22:31
clarkb(because its generated from lossy openstack api returns)22:31
fungioh that's a fun race22:31
corvusclarkb: huh.  it seems like we need more metadata?22:31
corvuslike, why isn't there a pool entry in there?22:31
clarkbcorvus: because nova list doesn't tell us how the instance originated from nodepool22:32
corvus(how is it possible there's enough metadata for nodepool to decide it's responsible for it, but not enough to account for it)22:32
clarkbcorvus: ya its possible we aren't shpping enouh data into metadata22:32
clarkblet me find the code again22:32
corvusclarkb: nodepool is only going to think it should delete an instance if there's enough metadata for it to recognize it as its own22:32
clarkbhttps://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L159-L166 is the source of the error https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L532-L537 is where we create the phony record22:33
corvusclarkb: seems like that second link should pull in the pool as well, yeah?22:34
corvusthen the instance should be properly accounted for?22:34
clarkbcorvus: ya I'm beginning to think that would be the caes22:34
clarkbso if we set that as part of hte metadata we'd be able to write it back in again (probably cross check it against valid pools?)22:34
clarkboh we cross check at https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L159-L166 so simply setting the value may be sufficient22:35
clarkbto summarize we think that nova reportin instances don't exist but then later appearing them leads to us not accounting for those instances later when we try to clenaup leaked instances. This is because the pool information matters when calculating quotas and we lose that for leaked instances22:36
clarkbwe can make that quota calculation more accurate by setting the pool info on leaked instance phony records so they are accounted as belonging to the pool quota wise22:36
corvusclarkb: the lack of the znode.pool attribute means the znode isn't being counted in the nodepool quota used, but if the 'nodepool_provider_name' metadata is on the instance, and matches this provider, then the instance is not going to be counted as unmanaged quota.22:38
clarkbcorvus: gotcha so we need both pieces of info to properly line up the quota calculations22:38
*** sanjayu_ has joined #zuul22:38
corvusclarkb, tobiash: ^ so if that sounds right, then yeah, i think the fix may be to just add the pool info back22:39
corvushere: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L532-L53722:39
*** saneax has quit IRC22:41
clarkbhttps://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L340 and there (in order to get it back later)22:42
corvusoh it's not in the md in the first place.  then yeah.22:45
corvusthat explains why we didn't copy it back :)22:45
clarkbyup I hvae a change that outlines this22:45
clarkbjust writing a commit message now22:46
clarkbI doubt its correct but it should be enough for us to not forget this debugging session22:46
openstackgerritClark Boylan proposed zuul/nodepool master: Set pool info on leaked instances  https://review.opendev.org/72135922:47
clarkbcorvus: tobiash ^ I think thats roughly what we want. My biggest concern right now is I'm not sure I lined up all the pool types properly (I think we have both name and dict versions of pools depending on context)22:48
*** sanjayu__ has joined #zuul22:50
*** sanjayu_ has quit IRC22:52
*** sanjayu__ has quit IRC22:52
clarkbok I think the types are correct based on how we handle deleting a node later22:54
clarkb(in the non leaked case)22:54
corvusclarkb: comments22:55
clarkbhttps://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/handler.py#L271-L28622:55
clarkbcorvus: ah yup I thought it available then realized I had to pass it in then didn't update my use of it22:55
openstackgerritClark Boylan proposed zuul/nodepool master: Set pool info on leaked instances  https://review.opendev.org/72135922:57
clarkbcorvus: ^ thanks!22:57
corvusclarkb: thank you!  tobiash ^ whenever it's a reasonable time for you again :)22:59
corvusclarkb: maybe let's try this out for a bit and see if we avoid the situation; i'm not convinced we should implement the workaround we talked about earlier unless it's necessary (it would have masked this, and it's more churn than if nodepool actually has accurate quota)23:00
clarkb++ accurate quota would be best23:01
*** threestrands has joined #zuul23:18
openstackgerritMohammed Naser proposed zuul/zuul-jobs master: helm-template: enable using values file  https://review.opendev.org/72136523:34

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!