clarkb | Downloading http://mirror.bhs1.ovh.openstack.org/pypi/packages/af/c6/904651ff18e647e37351ca61a183218d3773c60f16d49c2b2756235b0fd4/yarl-1.2.1-cp35-cp35m-manylinux1_x86_64.whl (252kB) | 00:17 |
---|---|---|
clarkb | yarl requires Python '>=3.5.3' but the running Python is 3.5.2 | 00:17 |
clarkb | that is currently breaking zuul jobs | 00:17 |
clarkb | I'm not in a spot to debug it but if people are curious ^ yarl comes from aiohttp | 00:18 |
*** swest1 has joined #zuul | 01:34 | |
*** swest has quit IRC | 01:34 | |
*** spsurya has joined #zuul | 01:43 | |
tristanC | tobiash: fwiw, the re2 change broke status requirement written like 'status: "sf-io[bot]:local/check:success"' | 04:29 |
tristanC | e.g. when you want to require a check success from your zuul, instead of any ".*:success" as you pasted | 04:32 |
tristanC | when upgrading to 3.0.2, you have to escape regexp token in status requirement | 04:36 |
*** pwhalen_ has joined #zuul | 05:06 | |
*** pwhalen has quit IRC | 05:07 | |
* SpamapS drums fingers on desk while nodepool sits and does nothing | 06:43 | |
*** sigmavirus24 has quit IRC | 06:52 | |
tobiash | tristanC: oh, that was unexpected. Maybe we should string match and fallback to regex match if that fails | 06:52 |
tristanC | tobiash: fallback might be confusing too. well now that the change is released i think it's fine | 07:00 |
*** hashar has joined #zuul | 07:01 | |
tobiash | SpamapS: you seem to have this problem often. Is your environment differently constrained than others? E.g. providers with low quota? | 07:10 |
*** hashar is now known as hasharAway | 07:42 | |
*** ssbarnea_ has joined #zuul | 07:57 | |
*** ssbarnea_ has quit IRC | 08:13 | |
SpamapS | tobiash: I have one provider that is always at 0 max-servers so that I can rapidly fail over to it. Maybe that's a mistake? | 08:21 |
SpamapS | (it is the most constrained cloud we have, so I try not to use it, but if the other one goes down..) | 08:21 |
SpamapS | currently using about 75x8GB nodes on our less constrained cloud | 08:22 |
tobiash | SpamapS: the provider with max-servers 0 should decline every request | 08:22 |
SpamapS | it does | 08:22 |
SpamapS | so I think it's not the problem | 08:22 |
SpamapS | what I see is that the other one gets stuck | 08:22 |
tobiash | SpamapS: any logging about quota? | 08:23 |
SpamapS | Not sure where exactly, but I suspect it's a zk lock or something | 08:23 |
SpamapS | nothing about quota | 08:23 |
tobiash | you can inspect locks by telneting to zk and type in 'dump' | 08:23 |
tobiash | that should list you any sessions together with their locks | 08:23 |
tobiash | maybe that helps | 08:24 |
SpamapS | It might | 08:24 |
SpamapS | I suspect locks mainly because it is inconsistent | 08:25 |
SpamapS | sometimes the thing fires up right away | 08:25 |
SpamapS | and sometimes jobs are queued for 15+ minutes | 08:25 |
SpamapS | (with no other jobs running, and more than enough nodes available, and none building) | 08:25 |
SpamapS | It gets stuck between "Accepting node request ..." and the next thing to log.. | 08:26 |
tobiash | hrm, did you notice any zk instabilities? | 08:31 |
tobiash | it is very sensitive to inconsistent io performance | 08:32 |
SpamapS | Not really.. it almost doesn't register | 08:34 |
SpamapS | 2018-05-01 01:32:21,515 INFO nodepool.PoolWorker.a-main: Assigning node request <NodeRequest {'state': 'requested', 'nodes': [], 'stat': ZnodeStat(czxid=7873512, mzxid=7873589, ctime=1525163448214, mtime=1525163495027, version=1, cversion=0, aversion=0, ephemeralOwner=98858567839189105, dataLength=217, numChildren=0, pzxid=7873512), 'declined_by': ['zuul.cloud.phx3.gdg-24752-PoolWorker.p-main'], 'requestor': | 08:34 |
SpamapS | 'zuul.cloud.phx3.gdg', 'node_types': ['g7', 'g7', 'g7'], 'id': '200-0000026032', 'reuse': True, 'state_time': 1525163495.026352}> | 08:34 |
SpamapS | That line is where it is stuck right at this moment | 08:34 |
SpamapS | though it just let go | 08:34 |
tobiash | can you get a thread dump? | 08:34 |
SpamapS | probably | 08:34 |
SpamapS | would need to run nodepool in pdb I suppose? | 08:35 |
SpamapS | THis time it only queued for 3 minutes :-P | 08:35 |
tobiash | SpamapS: just send sigusr2 to it, it should write the thread dump into the log | 08:35 |
SpamapS | http://paste.openstack.org/show/720170/ | 08:36 |
SpamapS | that's the time it was doing nothing | 08:36 |
SpamapS | zk was quiescent.. thing was just sitting there | 08:37 |
SpamapS | tobiash: ah didn't know that | 08:37 |
tobiash | when it hangs, check if the request is set to pending | 08:38 |
tobiash | it should do that right after that log message | 08:39 |
tobiash | and it looks like _waitForNodeSet doesn't log anything | 08:40 |
tobiash | so you might want to add additional debugging info into http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/driver/__init__.py#n239 | 08:41 |
SpamapS | the request is set to 'requested' when it is paused | 08:42 |
tobiash | maybe you're running into a corner case in that function | 08:42 |
SpamapS | yeah that's kind of what I assume | 08:43 |
SpamapS | just haven't had time to sit with it | 08:43 |
tobiash | at least it should not switch to paused without logging it | 08:44 |
SpamapS | http://paste.openstack.org/show/720172/ | 08:47 |
SpamapS | there's the thread dump | 08:47 |
SpamapS | a-main is the "working" cloud, p-main is the one with max-servers: 0 | 08:48 |
*** electrofelix has joined #zuul | 08:48 | |
SpamapS | hm | 08:49 |
SpamapS | I did just notice there are a ton of broken queued images in the alien image list | 08:49 |
SpamapS | makes it take a while to list images | 08:50 |
SpamapS | I suppose I should delete all the aliens | 08:50 |
tobiash | it can slow down processing | 08:50 |
SpamapS | yeah if it does that a few times | 08:50 |
SpamapS | alien-image-list takes ~60s | 08:50 |
tobiash | are you logging the openstack api requests? | 08:50 |
SpamapS | no | 08:50 |
tobiash | you should | 08:50 |
SpamapS | well just the debug level stuff | 08:50 |
SpamapS | but not like, GET's/PUT's | 08:51 |
tobiash | I mean the tasks of the provider request queue | 08:51 |
tobiash | it also logs the queue size | 08:51 |
SpamapS | yeah those stream | 08:51 |
tobiash | so it's not that create-server is starved because of image listing? | 08:53 |
tobiash | does this happen under quota pressure? | 08:53 |
SpamapS | looks like back around march 15 something horrible happened and we were thrashing on image uploads | 08:53 |
SpamapS | tobiash: create server isn't needed | 08:54 |
SpamapS | once it gets past this it just grabs the existing ready nodes | 08:54 |
tobiash | ah ok | 08:54 |
SpamapS | I have min-ready: 15 for the types that we use a lot | 08:54 |
SpamapS | I'm deleting all the image records. they aren't even uploaded, just 'queued' | 08:55 |
SpamapS | interesting.. I will say that now that we have enough jobs where we run 10 at a time on every PR.. it's time for a second executor. :-P | 08:58 |
SpamapS | 8 vcpu 16GB VM is hitting loads near 20 | 08:58 |
tobiash | SpamapS: did you configure 'rate' for your provider? | 08:59 |
tobiash | https://zuul-ci.org/docs/nodepool/configuration.html#openstack-driver | 08:59 |
tobiash | the default is 1 which means it sends at max one request per second to the cloud | 08:59 |
tobiash | which is often far too low | 09:00 |
SpamapS | tobiash: i didn't, that does sound slow | 09:01 |
SpamapS | and if it's busy just trying to look at server details instead of list images and fetch their details... | 09:01 |
tobiash | you have quite a few NodeDeleter threads in your dump so maybe it is under quota pressure because it's not cleaning up fast enough -> paused handler | 09:01 |
tobiash | try to set it to 0.001 | 09:02 |
SpamapS | My quota in the cloud is 300 | 09:02 |
tobiash | that means it basically doesn't do a pause before sending the next request in the queue | 09:02 |
SpamapS | max-servers is just me being polite. ;) | 09:02 |
SpamapS | also I've now deleted all the bad images | 09:03 |
tobiash | max-servers is also treated like quota in noderpool | 09:03 |
SpamapS | I thought quota was the server side quota. :-P | 09:03 |
tobiash | it handles both the same way | 09:03 |
tobiash | so you can have server side quota and self-constrained quota | 09:04 |
SpamapS | ah | 09:04 |
SpamapS | and yeah I'm churning a lot of nodes | 09:04 |
SpamapS | have a 3 node and a 5 node job | 09:04 |
SpamapS | that get run often | 09:05 |
tobiash | that makes me think that you have definitely a far too low rate | 09:05 |
tobiash | my providers run with a rate of 0.01 | 09:06 |
SpamapS | Yeah the control plane can def handle it | 09:07 |
tobiash | that also makes me think if the default of 1 is actually a good default (as it will only work well for very small deployments) | 09:08 |
tobiash | maybe 0.1 would be a better default? | 09:08 |
tobiash | corvus: what do you think? ^ | 09:09 |
tobiash | SpamapS: it's even worse: rate = In seconds, amount to wait between operations on the provider. Defaults to 1.0. | 09:11 |
tobiash | SpamapS: so you're not even getting one request per second but actually less | 09:11 |
SpamapS | yeah | 09:14 |
SpamapS | tobiash: never noticed before but sometimes the queue was indeed getting long | 09:16 |
SpamapS | 20 or so at the worst | 09:16 |
SpamapS | I've now set it to 0.01 | 09:16 |
SpamapS | and queue is staying at 0 | 09:16 |
SpamapS | but box still waiting | 09:17 |
SpamapS | and yeah, my executor is also bombed | 09:18 |
* SpamapS puts adding another one on the next sprint board :-P | 09:18 | |
SpamapS | tobiash: thanks, that does seem a bit better | 09:18 |
SpamapS | still stalls for a minute or so.. but seems consistently 1 minute.. Have done 5 patches now.. and all were ~60s .. last night I was seeing 5-15 minute lag | 09:19 |
SpamapS | alright, sleep calls | 09:20 |
tobiash | at least better now | 09:22 |
openstackgerrit | Ian Wienand proposed openstack-infra/zuul master: Pin yarl for python < 3.5.3 https://review.openstack.org/565470 | 09:31 |
openstackgerrit | Ian Wienand proposed openstack-infra/zuul master: Pin yarl for python < 3.5.3 https://review.openstack.org/565470 | 09:37 |
*** ssbarnea_ has joined #zuul | 09:41 | |
*** gtema has joined #zuul | 10:28 | |
openstackgerrit | Merged openstack-infra/zuul master: Pin yarl for python < 3.5.3 https://review.openstack.org/565470 | 10:46 |
openstackgerrit | Merged openstack-infra/zuul master: Add release note about config memory improvements https://review.openstack.org/565348 | 10:56 |
*** gtema has quit IRC | 10:58 | |
odyssey4me | Hi everyone, has any thought been given to implementing the ability to define a matrix of jobs? I mean something like https://docs.travis-ci.com/user/customizing-the-build#Build-Matrix where I want to execute the same job, but with different parameters and instead of defining each job individually using the parent/child, zuul could prepare the matrix for me. | 11:44 |
*** ssbarnea_ has quit IRC | 12:22 | |
*** rlandy has joined #zuul | 12:26 | |
*** weshay is now known as weshay|ruck | 12:46 | |
*** pwhalen_ is now known as pwhalen | 13:01 | |
*** pwhalen has joined #zuul | 13:01 | |
*** ssbarnea_ has joined #zuul | 13:10 | |
*** ssbarnea_ has quit IRC | 13:15 | |
*** hasharAway has quit IRC | 13:27 | |
openstackgerrit | Merged openstack-infra/zuul master: Increase unit testing of host / group vars https://review.openstack.org/559405 | 13:27 |
openstackgerrit | Merged openstack-infra/zuul master: Inventory groups should be under children key https://review.openstack.org/559406 | 13:28 |
*** openstackgerrit has quit IRC | 13:34 | |
*** pwhalen has quit IRC | 13:47 | |
*** pwhalen has joined #zuul | 13:50 | |
*** gtema has joined #zuul | 13:55 | |
*** ssbarnea_ has joined #zuul | 14:03 | |
*** ssbarnea_ has quit IRC | 14:07 | |
*** ssbarnea_ has joined #zuul | 14:13 | |
*** gtema has quit IRC | 14:15 | |
corvus | odyssey4me: we had that in jjb, but we're trying to avoid it in zuul because it can get pretty confusing pretty fast. so at the moment, the answer is basically "make a bunch of one-liner jobs which override the thing you want to change". but if that doesn't work out in the long run, we can probably add some syntactic sugar to do that. | 14:17 |
*** dkranz has joined #zuul | 14:17 | |
odyssey4me | corvus yeah, understood - the trouble is that now we've got to make many, many job definitions which gets confusing too | 14:28 |
corvus | odyssey4me: indeed. is this in openstack, or a public repo where i can take a look? | 14:28 |
odyssey4me | especially when you're trying to cover multiple axes - like python version, ansible version, code path, and config elements | 14:28 |
odyssey4me | not just yet, I'm just forward planning - trying to figure out how to tackle things and looking at options | 14:29 |
corvus | odyssey4me: okay, let me know when you get there -- i'd like to see what it ends up looking like | 14:31 |
corvus | odyssey4me: also here's an idea for an interim solution: you could put all the variants in one file in zuul.d and write a script which generates the matrix output | 14:31 |
corvus | (so keep the main body of the job in a separate file so you don't have to worry about munging it) | 14:32 |
odyssey4me | corvus yep, I was just thinking of doing that | 14:33 |
odyssey4me | use a tool to generate a static zuul yaml file - kinda the best of both worlds | 14:33 |
corvus | it's at least an easy way to experiment before we have to commit to syntax changes :) | 14:34 |
mordred | ++ I like that as an experimentation approach | 14:39 |
*** acozine1 has joined #zuul | 14:40 | |
*** CharlesShine has joined #zuul | 14:51 | |
*** CharlesShine has left #zuul | 14:52 | |
*** trishnag has quit IRC | 14:57 | |
clarkb | SpamapS: tobiash I notice that reuse: True is set maybe related to that? | 15:41 |
clarkb | (its not something we use upstream) | 15:41 |
tobiash | clarkb: reuse? | 15:55 |
clarkb | ['zuul.cloud.phx3.gdg-24752-PoolWorker.p-main'], 'requestor':'zuul.cloud.phx3.gdg', 'node_types': ['g7', 'g7', 'g7'], 'id': '200-0000026032', 'reuse': True, 'state_time':1525163495.026352}> | 15:56 |
clarkb | tobiash: ^ there reuse:True I think that means nodepool is reusing the nodes? | 15:56 |
tobiash | ah, that's a min-ready request | 16:05 |
clarkb | ah | 16:07 |
SpamapS | odyssey4me: fwiw, I think having a single clear file that has a list of jobs and the reason they vary is actually the least confusing option. But, that's me preferring composition in general. | 16:15 |
SpamapS | clarkb: yeah we're not reusing nodes. | 16:16 |
*** ssbarnea_ has quit IRC | 16:18 | |
*** trishnag has joined #zuul | 16:24 | |
*** trishnag has quit IRC | 16:24 | |
*** trishnag has joined #zuul | 16:24 | |
*** spsurya has quit IRC | 16:31 | |
*** acozine1 has quit IRC | 16:51 | |
*** ssbarnea_ has joined #zuul | 17:05 | |
*** sshnaidm is now known as sshnaidm|afk | 17:17 | |
pabelanger | clarkb: corvus: just looking at python3.6 jobs, for zuul / nodepool do we want both py35 / py36 or just take the leap directly to py36? | 17:33 |
clarkb | pabelanger: I think based on our py26->27 experience we probably want to run both | 17:34 |
clarkb | especially if we want to continue to support 35 | 17:34 |
clarkb | (since aiohttp for example just broke us there) | 17:34 |
pabelanger | wgm | 17:34 |
pabelanger | wfm* | 17:34 |
pabelanger | we also still have xenial servers for openstack-infra, so good to keep it around too | 17:35 |
clarkb | locally I use py36, so I'm reasonably confident zuul works under py36 | 17:35 |
clarkb | so doing both shouldn't be too bad outside of external deps breaking us | 17:36 |
pabelanger | yah, I've been using fedora a lot and haven't seen an issue so far | 17:36 |
mrhillsman | reproduce.sh no longer works because /usr/zuul-env/bin/zuul-cloner does not exist, is there a workaround/fix for this? | 17:43 |
clarkb | mrhillsman: not currently no. | 17:43 |
mrhillsman | thx | 17:43 |
clarkb | mrhillsman: beyond z-c not existing zuul isn't publicly publishing its git refs | 17:43 |
clarkb | so you'd need something to do the merges like zuul before invoking the devstack run which I think we've largely decided will happen as part of a "run zuul job locally" generic tool (rather than devstack specific) | 17:44 |
clarkb | (I want to say jhesketh has been looking at that problem) | 17:45 |
mrhillsman | cool, thx for the info | 17:46 |
mnaser | in super unrelated cool news, doing our first proposal to a customer for a full ci/cd pipeline that builds docker images based on their github commits and then (hopefully) deploys it to a k8s cluster using zuul | 17:46 |
clarkb | if you goal is to just run devstack you can grab the local.conf and run it that way. It won't coordinate multinode testing or run pre/post gate hooks etc | 17:47 |
clarkb | mnaser: neat | 17:47 |
mnaser | feels like the ideal to do it :> | 17:47 |
mnaser | cool way to explore use cases as well i guess | 17:47 |
mrhillsman | yeah, i just want to reproduce this error is all, i'll have to figure out a workaround | 17:47 |
mrhillsman | just did not want to dive down rabbit hole | 17:47 |
clarkb | mrhillsman: have a link to the error? | 17:47 |
clarkb | (just generally curious) | 17:48 |
mrhillsman | yeah | 17:48 |
mrhillsman | http://logs.openlabtesting.org/logs/67/967/7fbc3bfa7dbba1ac0264dab27ec84020c19cd964/check/gophercloud-acceptance-test/e432753/job-output.txt.gz | 17:48 |
mrhillsman | sorry, bad c/p | 17:48 |
mrhillsman | it is devstack related as well | 17:49 |
mrhillsman | something we are configuring wrong trying to lbaas/octavia testing | 17:49 |
mrhillsman | for gophercloud | 17:49 |
mrhillsman | waiting on devstack logs to finish load :) | 17:49 |
mrhillsman | http://paste.openstack.org/show/720193/ | 17:50 |
mrhillsman | i thought reproduce would allow me to kick it off and walk away for food | 17:51 |
clarkb | mrhillsman: /opt/stack/new/neutron-lbaas/devstack/plugin.sh:neutron_lbaas_configure_common:54 : /usr/local/bin/neutron-db-manage --subproject neutron-lbaas --config-file /etc/neutron/neutron.conf --config-file / upgrade head the config file isn't set properly for the db migration | 17:51 |
mrhillsman | yeah, just not sure how that is happening right now, i think maybe neutron-legacy is the issue | 17:52 |
clarkb | looks like you have both neutron-foo and q-foo enabled which may be part of the problem | 17:54 |
mrhillsman | ok, that is what i was leaning towards, thx for quick look | 17:55 |
mrhillsman | i'll go look at the PRs submitted shortly, appreciate it | 17:56 |
mrhillsman | it goes back to something i was worried a bit about some time ago | 17:57 |
mrhillsman | we are force enabling services in this one job; some refactoring needed | 17:57 |
*** myoung is now known as myoung|biab | 18:06 | |
SpamapS | https://photos.app.goo.gl/wD1z5tYowJr7TXZ92 <-- woot, my first successful unattended upgrade of zuul | 18:19 |
SpamapS | (by itself) | 18:20 |
*** openstackgerrit has joined #zuul | 18:35 | |
openstackgerrit | Fatih Degirmenci proposed openstack-infra/zuul master: Add additional steps for configuring Nodepool service on CentOS 7 https://review.openstack.org/564950 | 18:35 |
*** ssbarnea_ has quit IRC | 18:39 | |
*** trishnag has quit IRC | 18:40 | |
*** acozine1 has joined #zuul | 18:47 | |
*** trishnag has joined #zuul | 18:47 | |
*** ssbarnea_ has joined #zuul | 18:49 | |
openstackgerrit | Merged openstack-infra/zuul master: Install g++ on platform:rpm https://review.openstack.org/565070 | 18:52 |
*** trishnag has quit IRC | 18:52 | |
*** myoung|biab is now known as myoung | 19:16 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul-jobs master: Add ansible_hostname to /etc/hosts entries https://review.openstack.org/565564 | 19:20 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul-jobs master: Add ansible_hostname to /etc/hosts entries https://review.openstack.org/565564 | 19:21 |
openstackgerrit | Fatih Degirmenci proposed openstack-infra/zuul master: Add additional steps for configuring Nodepool service on CentOS 7 https://review.openstack.org/564950 | 19:27 |
*** hashar has joined #zuul | 19:28 | |
SpamapS | you know.. | 19:52 |
SpamapS | I love ansible dearly | 19:52 |
SpamapS | but sometimes jinja makes me have violent mood swings. | 19:52 |
*** hashar has quit IRC | 20:00 | |
mordred | SpamapS: yah | 20:01 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul-jobs master: Add ansible_hostname to /etc/hosts entries https://review.openstack.org/565564 | 20:18 |
*** pwhalen has quit IRC | 20:25 | |
*** pwhalen has joined #zuul | 20:27 | |
*** pwhalen has joined #zuul | 20:27 | |
*** myoung is now known as myoung|biab | 20:36 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix setting a change queue in a template https://review.openstack.org/565581 | 20:41 |
*** acozine1 has quit IRC | 20:45 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix regex project templates https://review.openstack.org/565584 | 21:00 |
*** ssbarnea_ has quit IRC | 21:06 | |
*** myoung|biab is now known as myoung | 21:54 | |
clarkb | corvus: https://review.openstack.org/#/c/565584/1 question on that change | 22:25 |
*** swest1 has quit IRC | 23:08 | |
corvus | clarkb: yeah... that's -1 worthy :) | 23:12 |
clarkb | done | 23:13 |
*** acozine1 has joined #zuul | 23:18 | |
*** swest has joined #zuul | 23:23 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix regex project templates https://review.openstack.org/565584 | 23:25 |
corvus | clarkb: i think i deleted the right things this time :) | 23:25 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!