Monday, 2018-07-23

*** nchakrab has joined #zuul04:53
tobiashpabelanger: I added a comment.05:36
*** quiquell|off is now known as quiquell05:48
*** pcaruana has joined #zuul07:02
*** nchakrab has quit IRC07:22
*** quiquell is now known as quiquell|bbl07:22
*** nchakrab has joined #zuul07:25
*** quiquell|bbl is now known as quiquell07:52
*** hashar has joined #zuul08:05
*** goern has joined #zuul08:35
openstackgerritIan Wienand proposed openstack-infra/zuul master: Add variables to project  https://review.openstack.org/58423009:04
openstackgerritIan Wienand proposed openstack-infra/zuul master: doc: Move zuul variable references to a section  https://review.openstack.org/58477909:04
openstackgerritIan Wienand proposed openstack-infra/zuul master: Use definition list for job status  https://review.openstack.org/58478009:04
*** nchakrab has quit IRC09:08
*** nchakrab has joined #zuul09:09
*** nchakrab has quit IRC09:13
*** CrayZee has joined #zuul09:24
*** jimi|ansible has quit IRC09:30
*** panda|off is now known as panda09:48
*** nchakrab has joined #zuul09:49
*** sshnaidm|afk is now known as sshnaidm|rover09:56
tobiashmordred: I just found a docker benchmark collection that benches various things using different base images: https://www.phoronix.com/scan.php?page=article&item=docker-summer-2018&num=409:59
tobiashmordred:10:00
*** sshnaidm|rover is now known as sshnaidm|ruck10:00
*** chkumar|ruck is now known as chkumar|rover10:01
tobiashmordred: python with alpine base image seems to perform quite bad compared to all other base images10:02
tristanCtobiash: interesting results... a superuser answer points at the musl libc ( https://superuser.com/a/1234279 )10:25
*** nchakrab has quit IRC10:30
*** nchakrab has joined #zuul10:31
*** nchakrab has quit IRC10:36
*** quiquell is now known as quiquell|bbl10:39
*** nchakrab has joined #zuul10:47
*** bhavik1 has joined #zuul10:57
*** quiquell|bbl is now known as quiquell11:03
pandapabelanger: re OVB, the overlay network remains a difficult approach for the iPXE images. THe iPXE process should fist ask their management IP, then establish a VXLAN tunnel, then send DHCP request on that tunnel for provision network. iPXE doesn't support any of this. I think we can work with the static provision network approach and mac filtering on the undercloud. The only difficult part here is to pass the11:16
pandaovercloud instances mac address to the undercloud so it can enable leases only for those. The other part is to create overlay network for the overcloud nodes so emulate network isolation.11:17
quiquellHello zuul guys, have one question regarding ZUUL_CHANGES11:36
quiquellif you want a commit to be there, you add a Depends-On or you parent it ?11:36
quiquellthat's right ?11:36
*** bhavik1 has quit IRC11:37
tristanCquiquell: yes, depends-on in commit message or git repos dependency.11:38
tristanCquiquell: ZUUL_CHANGES is a legacy vars from zuulv2, the dependencies are now stored in the zuul.items variable11:38
quiquelltristanC: We will arrive there :-)11:38
tristanCquiquell: e.g. http://logs.rdoproject.org/62/584762/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/fd5f3d6/zuul-info/inventory.yaml11:39
pandaquiquell: what do you mean "parent" it ?11:39
*** GonZo2000 has joined #zuul11:40
*** GonZo2000 has quit IRC11:40
*** GonZo2000 has joined #zuul11:40
tristanCpanda: i think he meant in gerrit, when you click rebase and "change parent revision", or when you git review multiple commit stacked together11:41
quiquellpanda: If you have two commits for the same project, and you want zuul to test them at one review11:41
quiquellpanda: You need to make one parent on the other11:41
pandatristanC: hhmm, I thought rebasing a change on top of another didn't create another zuul.item, because zuul doesn't need to fetch the change, it's in the git log11:43
quiquellpanda: But it's stuff is going to test11:44
quiquellpanda: Make sense to have it there11:44
quiquellpanda: Other stuff if the respsec where the required projects point to11:45
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Ignore files produced by tox-cover  https://review.openstack.org/58482311:45
pandaand how does zuul know what changes in the log need to be addedd to zuul.items ?11:45
pandalooks at the whole log and sees what Change-Id: is still open ?11:45
quiquellpanda: I suppose11:46
quiquellpanda: Cannot have stuff in the middle11:46
tristanCpanda: it does, gerrit automatically returns parent information in the "dependsOn" field, see this function for the full logic: https://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/gerrit/gerritconnection.py#n52411:48
pandatristanC: ah, I see, thanks.11:49
*** quiquell is now known as quiquell|lunch11:59
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Invalidate az cache on bad request  https://review.openstack.org/58274612:04
*** dkranz has joined #zuul12:21
*** quiquell|lunch is now known as quiquell12:24
*** jimi|ansible has joined #zuul12:36
*** jimi|ansible has joined #zuul12:36
*** rlandy has joined #zuul12:39
*** nchakrab has quit IRC12:45
*** samccann has joined #zuul12:50
*** samccann_ has joined #zuul12:51
mordredtobiash: interesting. unfortunately the other outlier is debian 9 - which is the base for the other main python base image12:52
mordredtobiash: definitely worth keeping our eyes on12:52
*** nchakrab has joined #zuul12:58
*** samccann_ has quit IRC12:59
*** jesusaur has quit IRC13:05
openstackgerritMerged openstack-infra/nodepool master: Invalidate az cache on bad request  https://review.openstack.org/58274613:25
openstackgerritMerged openstack-infra/nodepool master: Ignore files produced by tox-cover  https://review.openstack.org/58482313:25
pabelangertobiash: replied13:27
pabelangerpanda: +113:27
pandapabelanger: I know how to filter dhcp offers. kinda.13:30
tobiashpabelanger: oh thanks, I think I understood this now13:37
tobiashpabelanger: but I have another comment, sorry13:37
pabelangertobiash: sure, let me update and test13:39
openstackgerritPaul Belanger proposed openstack-infra/zuul master: Fix delegate_to for ansible synchronize  https://review.openstack.org/58465613:47
pabelangertobiash: ^should be better now13:47
tobiashpabelanger: I'd even leave out 'path: /tmp' and let the os decide where to put the tmpdir13:49
tobiashpabelanger: apart from that lgtm13:49
*** myoung has joined #zuul13:49
pabelangertobiash: I wanted to keep /tmp since on the executor that is outside of the filter path, just to ensure we can write data out side of workdir, even though this is on the node13:50
pabelangerthanks! left comment on review explaining too13:53
tobiashpabelanger: ok (although it won't be inside of the workdir as it connects back via ssh)13:53
pabelangertobiash: right, agree! However, the original bug was actually still checking is_path_safe and failing a job: http://logs.openstack.org/14/584614/9/check/ansible-role-nodepool-ubuntu-xenial/82c6155/job-output.txt.gz#_2018-07-22_13_48_56_12633913:56
*** quiquell is now known as quiquell|off14:02
*** acozine1 has joined #zuul14:08
openstackgerritMerged openstack-infra/nodepool master: Update README and add TESTING similar to Zuul repo  https://review.openstack.org/58134814:16
*** yolanda__ has joined #zuul14:19
*** acozine1 has quit IRC14:20
*** acozine1 has joined #zuul14:21
*** yolanda_ has quit IRC14:21
*** yolanda__ is now known as yolanda14:26
Shrewswould anyone care to review this builder change? https://review.openstack.org/56468714:33
pabelangerlooking14:35
Shrewspabelanger: maybe you can speak to ian's concerns about dib there14:38
Shrews(i missed those initially, but seems like a caveat might be warranted there)14:39
pabelangerShrews: yah, I just left a comment for just that14:40
pabelangerif people are okay with it in general, I'll drop a +214:40
pabelangerbut want to get another set of eyes14:40
Shrewspabelanger: maybe corvus or mordred can comment there too14:41
pabelanger+114:42
mordredpabelanger: I'd argue that needing REG_PASSWORD in a RHEL image is a design flaw in rhel - that password would be completely exposed to anyone booting that image and running workloads on it14:43
mordredBUT - given that, I don't tink it's any _more_ insecure than the need to put the REG_PASSWORD in the image in the first place14:44
mordred(the nodepool patch)14:44
pabelangeryah, that is true too. Didn't think of that14:45
pabelangerreally don't know how RHEL licensing works TBH14:45
mordredit's funny how old-school proprietary software techniques like license managers are _completely_ incompatible with cloud14:45
pabelangerLike, could you use RHEL key in DIB, yum install all latest packages, delete key and boot and image works?14:45
mordredyeah - but then users of the image wouldn't be able to install new images14:46
mordredpackages14:46
mordredthat is14:46
rcarrillocruzhaha, yah...the state of cloud images for networking make me want to stab my eyes at times14:46
mordredI mean - this is amongst the reasons that literally nobody uses rhel as guest images in cloud14:46
rcarrillocruzcloud-init is not a thing for most14:46
pabelangeryah, wonder if you could write a wrapper in ansible using zuul secret, for bindep, then delete key after yum, etc14:47
pabelangerbut that should likes work too14:47
mordredpabelanger: yah - you could likely do that - but it's still a mess. I'd just expose the password since it's an absurd protection14:47
mordred"you must use a password to download these things that you can download for free from centos"14:47
pabelanger++14:48
* mordred looks around to see if he's been fired yet14:48
mordredpabelanger: I think in the non-CI world, one would have a rhel key as a customer, and then injecting that key at boot time using something like cloud-init or ansible would not be exposing your personal rhel secret, because you're the user14:51
*** jesusaur has joined #zuul14:51
mordredpabelanger: for public CI, it's a complete mess ... and I think basically means that for public CI purposes RHEL is 100% unsuitable and centos basically must be used14:52
pabelangerYah, maybe that is why I've not need much discussion around RHEL and CI when I've asked.  Just for this reason of key leaks14:53
mordredyah14:53
pabelangermordred: since you are hear, and non RHEL related: https://review.openstack.org/584656/ fixes a bug with synchronize in ansible and zuul. If you'd like to review and see if I missed something obvious in zuuls trusted / untrusted checks14:54
mordredpabelanger: I think that patch looks right. I added a +2 - but I'd like corvus or clarkb to look too - it's one of those areas that more eyes is more better14:57
pabelangerthanks!14:57
pabelangeryah, happy to get more eyeballs on this14:57
openstackgerritFabien Boucher proposed openstack-infra/zuul master: Add tenant yaml validation option to zuul client  https://review.openstack.org/57426515:01
clarkbmordred: which one?15:03
mordredclarkb: https://review.openstack.org/584656/15:05
corvuspabelanger: i don't understand your answer to tobiash's comment; can you rephrase that, possibly with more examples?15:05
pabelangeryes15:06
pabelangercorvus: I've added more details on explanation.15:11
*** mrhillsman has quit IRC15:18
*** jtanner has quit IRC15:19
*** patrickeast has quit IRC15:19
*** hashar has quit IRC15:19
*** maxamillion has quit IRC15:19
*** clarkb has quit IRC15:19
*** goern has quit IRC15:20
*** sdoran has quit IRC15:20
*** TheJulia has quit IRC15:20
*** andreaf has quit IRC15:21
*** patrickeast has joined #zuul15:21
*** mrhillsman has joined #zuul15:21
*** jtanner has joined #zuul15:22
*** maxamillion has joined #zuul15:23
*** sdoran has joined #zuul15:23
*** jtanner has quit IRC15:24
*** sdoran has quit IRC15:24
*** sdoran has joined #zuul15:24
*** jtanner has joined #zuul15:25
*** andreaf has joined #zuul15:25
*** maxamillion has quit IRC15:25
*** maxamillion has joined #zuul15:25
*** sdoran has quit IRC15:25
*** hashar has joined #zuul15:26
*** sdoran has joined #zuul15:26
*** jtanner has quit IRC15:26
*** hashar has quit IRC15:26
*** hashar has joined #zuul15:26
*** sdoran has quit IRC15:27
*** sdoran has joined #zuul15:27
corvuspabelanger: thanks, i think i grok15:29
*** clarkb has joined #zuul15:33
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Add container spec  https://review.openstack.org/56013615:34
*** TheJulia has joined #zuul15:35
*** jtanner has joined #zuul15:36
*** jtanner has quit IRC15:38
*** jtanner has joined #zuul15:38
corvuspabelanger: can you point me at a summary of your log work?  (maybe a change that explains most of what's going on, or an email, or something?)  i want to refresh my memory and think about how it might interact with things i've been thinking about wrt logs-in-swift15:44
*** nchakrab has quit IRC15:45
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Docs: increase visibility of env secrets warning  https://review.openstack.org/58495015:48
corvusShrews, pabelanger, ianw, tristanC: ^15:48
openstackgerritMerged openstack-infra/nodepool master: Use detail listing in integration testing  https://review.openstack.org/58454216:00
openstackgerritMerged openstack-infra/nodepool master: builder: support setting diskimage env-vars in secure configuration  https://review.openstack.org/56468716:04
*** nchakrab has joined #zuul16:12
*** nchakrab has quit IRC16:13
*** panda is now known as panda|off16:30
*** gchenuet has joined #zuul16:36
mordredcorvus, Shrews: if you get a spare sec, https://review.openstack.org/#/c/584395/ doc patch for pbrx16:37
pabelangerclarkb: could I trouble you for a review of https://review.openstack.org/584656/ too, having extra set of eyes helps16:38
gchenuetHi guys ! We are working (at leboncoin, french classified ads website) to update our Zuul V2 infrastructure to Zuul V3 and we have a question about Nodepool Launcher service: is it possible to scale up to supply redundancy ? We saw that Nodepool Builder is able to do this, but we don't about Nodepool Launcher.16:39
gchenuetThanks and congrats for this awesome stack tools !16:39
clarkbgchenuet: it is possible, but you must partition the clouds across launchers as they won't see each other's resource usage directly. If you only have one cloud you could give each launcher a different account with independent quota16:40
gchenuetthanks clarkb ! What is the best redundancy solution for you if we have two DCs (with 1 OS per DC) to keep all the Zuul infra (Zuul + Nodepool services) alive if we lost 1 DC ?16:45
clarkbgchenuet: for that case I would run two nodepool launchers one in each DC speaking to the local DC openstack. This way if the datacenter goes away the other nodepool keep running. (do similar for nodepool builders). The one thing this doesn't accomodate is where to run your zookeeper. I'm not sure you can run a zookeeper cross multiple datacenters16:46
clarkbpabelanger: why is the src value for synchronize in your test file destdir.path?16:46
clarkbpabelanger: is that intentional?16:46
*** pcaruana has quit IRC16:47
corvusgchenuet, clarkb: i agree about launchers -- but i'd probably have both builders talk to both clouds16:47
corvus(that way if they are both working, you're still only building the images once)16:48
pabelangerclarkb: yah, comes from the copy-common role16:48
pabelangersee depends-on for more examples of usage16:48
clarkbcorvus: oh good point16:48
clarkbpabelanger: isn't that backwards? if nothing else it is confusing :)16:48
corvusSpamapS: any thoughts/experience running zookeeper across a wan (multiple datacenters)?  (see question from gchenuet)16:49
clarkbpabelanger: I guess you are prepping files for copying in the other direction?16:49
pabelangerclarkb: right, depends on mode: pull or push16:49
*** acozine1 has quit IRC16:49
pabelangerfetching common-file from the copy-common role tmpdir16:50
clarkbpabelanger: both use src: destdir.path in the child change16:51
corvusgchenuet: if you try running zk across multiple DCs, please let us know how it goes.  we (in openstack infra) have a nodepool system that's similar to what you're setting up, but we've only run zk within one data center.  i assume it can be done but may require some tuning to allow for greater latencies?16:51
pabelangerclarkb: which line?16:51
clarkbpabelanger: https://review.openstack.org/#/c/584656/6/tests/fixtures/config/remote-action-modules/git/org_project/playbooks/synchronize-delegate-good.yaml 22 and 4616:52
tobiashgchenuet: and you should check if you can encrypt the communication between zookeepers over wan16:52
tobiashbut client side tls is not implemented in kazoo yet (there are pull requests for that)16:53
gchenuetour DCs are connecting with LAN and not WAN16:53
gchenuetFYI16:53
gchenuetI think we'll setup a zk cluster with 6 nodes (3 per DC)16:54
clarkbgchenuet: it needs to be an odd number16:55
tobiashand write performance drops with more nodes (upstream suggestion is 3, 5 or 7 I think)16:55
gchenuetok, good to know ! thanks, we have a third DC so 2 - 2 -116:56
pabelangerclarkb: right, I don't think it matters to much, because in this case we are doing an rsync on a single server. So push and pull both are delegated to the nodepool node, basically in this case we want src to always be destdir.path regardless of mode16:56
pabelangerhttps://docs.ansible.com/ansible/2.5/modules/synchronize_module.html16:57
pabelangersee mode16:57
clarkbpabelanger: mostly its just confusing that we are using destdir as the src not the dest dir16:57
clarkbwhy not make the variable called srcdir in this case?16:57
tobiashregarding the launchers, I think now that nodepool is resilient against unexpected hitting the quota you could even run multiple launchers against the same cloud16:58
gchenuetreally ?16:58
*** bbayszczak has joined #zuul16:58
tobiashit might cause more openstack api requests than neccessary when being under quota pressure but I think it shouldn't break16:58
pabelangerclarkb: that is that common-copy does, creates a tmp directory with a file, I just reused that to pre-populate something. And that is the file we are using for src here.  But I can see how that is a little confusing16:59
clarkbit would be racy on consumption of quota, which may leade to NODE_FAILURES, but other than that it may work16:59
*** bbayszcz_ has joined #zuul16:59
tobiashclarkb: nodepool now retries when hitting quota failures16:59
clarkbtobiash: oh good16:59
*** bbayszczak has quit IRC16:59
tobiashthat doesn't cause node_failures anymore16:59
pabelangertobiash: ++17:00
gchenuetso i can have one zk cluster across DC with nodepool launchers and builders in each DC connected to same zk cluster ?17:00
*** GonZo2000 has quit IRC17:01
tobiashyes, that should work, if the latency doesn't have too much impact on zk17:01
*** hashar is now known as hasharDinner17:02
clarkbpabelanger: as for push vs pull that just changes which ost is looked at to copy from dest to src.17:02
*** bbayszcz_ is now known as bbayszczak17:02
tobiashgchenuet: but that's something you probably just need to try17:02
gchenuetto be sure, when you said 'cloud', is a 'cloud provider' in nodepool ?17:02
clarkbpabelanger: the test takes advantage of the fact that they are the same in this case17:02
tobiashgchenuet: yes17:02
pabelangerclarkb: right, I can propose a follow change that removes the usage of destdir variable and sets a fact for srcdir17:05
pabelangergive me a few minutes17:05
*** nchakrab has joined #zuul17:05
tobiashcorvus: regarding buildset resources, what do you think about this user api: https://etherpad.openstack.org/p/JINkHL4VuM17:07
tobiashwe already map child_jobs into the zuul variable so I think we should do the same for pause17:08
corvustobiash: oh, hrm; i wonder if we should actually do option #2 for both things?17:09
tobiashcorvus: can we still change that for child_jobs?17:10
tobiash(I'd also be in favor of mapping behavior things to real module parameters)17:11
gchenuetthanks tobiash !17:12
corvus(the main difference between the two being whether the variables are exported to the child job)17:12
tobiashbut that should be consistent then17:12
gchenuetIn your OpenStack CI, where are nodepool launcher ? in which cloud provider ?17:12
corvustobiash: we haven't released a version with child_jobs yet, i think we can change it17:12
corvuspabelanger: ^ we're talking about maybe changing how child_jobs is returned17:13
tobiashcorvus: thinking about that, even if we switch to module args we still could map the values into the child jobs within the executor17:13
tobiashin case of pause I would even remap that to a different variable like zuul.paused_parents: [p1, p2]17:14
pabelangercorvus: okay, I haven't started using it yet17:15
corvushrm.  well, both of these are using the special 'zuul' dictionary.  maybe that's enough to distinguish it from regular data that's passed to children...17:15
openstackgerritPaul Belanger proposed openstack-infra/zuul master: Clean up remote ansible synchronize tests  https://review.openstack.org/58497817:17
pabelangerclarkb: I believe ^ is clearer now?17:17
corvustobiash: ultimately the interface is actually just a json file on disk.  if we did option #2, we'd have to revise that structure somehow.17:17
corvustobiash: (i guess "zuul_return: pause: True" could just be a shortcut for "zuul_return: data: zuul: pause: True"....)17:18
tobiashthat's true, it's easier for the user, but a little bit harder to implement17:18
tobiashinteresting idea17:18
corvustobiash: well, log_url is already like this and is in a release.  child_jobs is like this, but isn't in a release yet (but is about to be).  i think maybe we should stick with option #1 for now, and then think about either making an alias later, or restructuring it (if we feel like actually deprecating things)17:21
openstackgerritMerged openstack-infra/nodepool master: Docs: increase visibility of env secrets warning  https://review.openstack.org/58495017:21
tobiashcorvus: ok, I'm fine with tha17:22
corvus(i bet if we just make it an alias later, that will improve the UX without making things more complicated)17:22
tobiashcorvus: shall we name it pause or do you have a better naming proposal (run_children_now, ...)?17:23
corvustobiash: i like pause (or suspend is my runner-up)17:24
pabelangertobiash: corvus: speaking for child_jobs for zuul_return, tristanC noticed this change in functionality for zuul: https://storyboard.openstack.org/#!/story/2003026 I think we want to maybe address before tagging17:24
*** bbayszcz_ has joined #zuul17:24
corvuspabelanger: agreed; is someone working on that?17:24
corvusgchenuet: we use quite a few public and private providers, you can see which by looking at our grafana dashboards (search for nodepool): http://grafana.openstack.org/d/rZtIH5Imz/nodepool?orgId=117:27
corvusgchenuet: (some of those have multiple regions too)17:27
*** bbayszczak has quit IRC17:27
pabelangercorvus: I can, but could use help on how best to fix. This is a side affect of not allowing all jobs to be skipped: https://git.zuul-ci.org/cgit/zuul/commit/zuul/model.py?id=5c797a12a8229b30124988723f786d4ee8dea80717:27
pabelangerit didn't account for non-voting17:28
*** bbayszcz_ has quit IRC17:28
corvuspabelanger: i think there's just a logic error in didAllJobsSucceed -- a non-voting job that ran isn't 'skipped', but it's checked first17:29
*** myoung is now known as myoung|bbl17:29
tobiashpabelanger: I think you just could set skipped to false in https://git.zuul-ci.org/cgit/zuul/tree/zuul/model.py?id=5c797a12a8229b30124988723f786d4ee8dea807#n180817:29
gchenuetthanks corvus, and you have the same single zk + nodepool infra for all providers ?17:31
corvusgchenuet: we have a single zk, 4 builders and 4 launchers.  some builders and launchers are located in specific clouds, some are just in one cloud and talk to several clouds.  some of that is dictated by architecture and networking requirements, not just redundancy/availability)17:34
*** CrayZee has quit IRC17:35
pabelangercorvus: tobiash: I'll create a new test for all skipped jobs and work to fix the logic error17:35
pabelangerI'll do that now17:35
corvuspabelanger: thanks!17:35
gchenuetbut any of your launchers are speaking to same cloud ?17:36
clarkbgchenuet: no17:37
corvuswell, at least one of them is17:38
corvusand at least two of the builders are17:38
clarkbcorvus: which launcher of ours talks to the same cloud?17:40
clarkbthey all talk to distinct sets of clouds17:41
corvusclarkb: i interpreted "same cloud" as "cloud in which the launcher is residing".17:46
corvusif the question is "do any two launchers talk to the same cloud" then the answer is indeed no17:47
corvus(but i thought we already answered that)17:48
clarkbah17:48
gchenuetthanks ! I think you've found a solution, let's test it and if it's works, we'll be very happy to share with you our infra (we run 50K builds per week)17:49
gchenuetthanks you so much guys for your work and your help !17:49
*** gchenuet has quit IRC17:54
pabelangerwith stestr, is there a way to move the captured traceback output after captured logging? Or customize it locally, I'm find it a pain to always be scrolling up in the terminal to see why something failed.17:54
Shrewspabelanger: have you tried 'script'? there might be a more simpler solution, but i often use that for things where i'm not sure how to control output18:04
Shrewsi often use that for zk-shell output capturing18:04
mordredcorvus, Shrews, pabelanger, tobiash, tristanC, clarkb, SpamapS: you may want to skim the recent scrollback in #ansible-devel ... it's about an upcoming behavior change in ansible related to modules returning ansible_facts in their return json18:07
openstackgerritPaul Belanger proposed openstack-infra/zuul master: Fix zuul reporting build failure with only non-voting jobs  https://review.openstack.org/58499018:07
mordredhttps://github.com/ansible/ansible/pull/41811 is the pr adding the deprecation messages - agaffney says we can test the new behavior by setting inject_facts_as_vars to false in ansible.cfg18:08
*** nchakrab has quit IRC18:08
pabelangercorvus: ^ implements tobiash suggestion for fixing non-voting jobs with zuul.child_jobs patch18:08
pabelangerShrews: yah, that could help too18:10
mordredcorvus, Shrews, pabelanger, tobiash, tristanC, clarkb, SpamapS: http://paste.ubuntu.com/p/xJFyJJ7rcy/ is the relevant channel scrollback in case you weren't in #ansible-devel18:14
corvusmordred: so that might affect roles/multi-node-known-hosts/library/generate_all_known_hosts.py ?18:17
mordredcorvus: yes18:17
corvus(sorry, that was in zuul-jobs)18:17
pabelangerreading18:17
corvusmordred: but maybe could be worked around with a normal return / register / set_fact ?18:18
mordredcorvus: we're on 2.5 now right?18:18
corvusmordred: yes18:18
mordredcorvus: cool. so - I believe we can just access the value as ansible_facts.all_known_hosts safely now18:18
mordredcorvus: or we could change it to return/register18:19
mordredas of 2.5 the ansible_facts dict exists and that return should be setting the value both as a top-level var and in the ansible_facts dict18:19
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Scope all_known_hosts in mult-node-known-hosts  https://review.openstack.org/58499318:21
pabelanger+218:22
pabelangerhaven't used ansible_facts much yet, but sounds like we should be18:23
corvusit's apparently new :)18:26
*** ianychoi has quit IRC18:33
*** ianychoi has joined #zuul18:34
*** dkranz has quit IRC18:38
*** dkranz has joined #zuul18:42
*** hasharDinner has quit IRC19:00
*** hashar has joined #zuul19:00
Shrewstobiash: according to https://review.openstack.org/536930, if "quota exceeded" is returned during a launch, we should pause, right? I'm not seeing that pause in deployed code19:03
tobiashWhat do you see instead?19:04
Shrewsi'm not sure what implications this might have19:04
openstackgerritPaul Belanger proposed openstack-infra/nodepool master: Allow launch-retries to indefinitely retry for openstack  https://review.openstack.org/58448819:04
pabelangerclarkb: okay, think ^ handles launch-retries: 019:05
pabelangerbut still haven't figured out best way to test indefinitely retries19:05
Shrewstobiash: http://paste.openstack.org/show/726468/19:05
clarkbpabelanger: you saw the note baout how quota failures will short circuit too right?19:06
clarkbpabelanger: I think that likely to be your biggest enemy in that code19:06
Shrewstobiash: oh, maybe we just don't log the actual pause in that code path?19:08
Shrewstobiash: b/c i do see an "Unpaused request" for that request ID19:08
tobiashShrews: yes, I see that too19:09
Shrewstobiash: in your env?19:09
tobiashShrews: in your log19:09
tobiashIn my env it unpauses every 30 seconds or so and checks again19:10
Shrewsi didn't include the unpause in that paste19:10
Shrewsit was further down in what i pasted, but regardless, looks like that's what's happening19:11
tobiashI looked wrong, never mind, I saw a pause after quota recalculation19:12
pabelangerclarkb: yah, I'm not sure how best to do that. tobiash Shrews do you mind looking comment from clarkb on https://review.openstack.org/58448819:13
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Add logging when pausing due to node abort  https://review.openstack.org/58500719:17
Shrewstobiash: ^^19:17
tobiashLgtm19:19
tobiashpabelanger: your change changes the failure count to count in advance19:21
tobiashThat will break indefinitely retrying on quota exceeded19:22
tobiashAs quota exceeded should not be counted19:23
tobiashpabelanger: note that the quota exceeded raises before increasing the counter19:23
pabelangerHmm, let me try again. I'll add a test where launch-retries is 0, because pretty sure there is a bug today where that isn't counted for19:25
clarkbpabelanger: ya rereading it I think max-retries is really "max-attempts" and it can't be 0 valued19:26
clarkbor if it is that is undefined19:26
pabelangerokay, I'll grab some coffee and try again in a bit19:27
Shrewstobiash: pabelanger: wait... if the "quota exceeded" part is hit, yes, it will leave the launch() method, but we should pause and retry the launch later (effectively still retrying forever as pabelanger wants). i don't see the breakage19:32
Shrewsso pabelanger's change alters the "inner scope" of retrying (within a single launch). but the "outer scope" of retrying when paused is unaffected (unless i miss something)19:33
tobiashThe breakage is when having no unlimited retries and you hit quota exceeded multiple times then these are counted as failures19:34
Shrewstobiash: ah! i see noiw19:35
Shrewsnow19:35
tobiashOr is that retry a different lauch with a new counter?19:35
Shrewstobiash: i believe we should reset the counter on quota error (launch() gets called again)19:36
Shrewsso i think it may be ok19:36
clarkbShrews: tobiash this is specific to single provider configurations19:36
clarkbI thought that a quota fialure was treated as a failure in that provider?19:36
tobiashswitching to laptop19:37
Shrewsclarkb: a recent change by tobiash makes quota errors pause and retry until quota frees up.19:38
clarkbgotcha19:38
Shrewsso before, it would have been a launch error and tried max-tries times19:38
Shrewsor whatever that config setting is called19:38
tobiashShrews: ah I see now19:39
tobiashthe retries variable is just local19:39
tobiashso it shouldn't matter as the retry on quota exceeded is in fact a new node19:39
Shrewsyes19:39
tobiashas the failed nodes was aborted19:39
clarkbin that case the original code was proably fine save for my confusion over max-retries actually being max attempts19:40
clarkbpabelanger: ^19:40
tobiashbut it breaks quota retry if launch_retries is 019:41
Shrewsdouble yes19:41
Shrewsbut, i think it does need some good testing in the various scenarios19:41
clarkbtobiash: I think the original patchset may be fine19:41
tobiashwe might need to test almost all combinations of retries = [inf, 0, >0] together with quota failures19:43
clarkbI'm beginning to wonder if 0 is a valid value19:44
clarkbI has assumed so because in config it is "retries" but in execution it is really attempts19:44
clarkband 0 attepts means nothing boots19:45
Shrewstobiash: ++19:45
Shrewsclarkb: launch_attempts is probably more accurate19:45
Shrewsbecause launch_retries=1 would have always done 1 attempt19:46
Shrews0 doesn't seem valid19:48
clarkbya19:48
tobiashbtw, the quota retry saved us already a lot of node_failures because we now also have nodes that boot from volume combined with a relatively low volume quota and no volume quota support in nodepool yet :)19:48
tobiashwhat I also learned is that labels that compete for different types of quota (cpu vs cpu+volume) need to be separated into different pools19:50
tobiashotherwise a cpu+volume node can pause the handler regardless if there are cpus left for other nodes or not19:51
Shrewstobiash: these sound like helpful things we should document someplace19:54
clarkbtobiash: that is probably a good way to define a pool too19:54
clarkba pool is a rsource contention aggregation19:54
clarkb++ to documenting19:54
Shrewsi just don't know where to document since we don't really have a deployment/admin suggestive guide19:55
Shrewssuggestion*/faq/whatever19:55
clarkbto  start can probably just add it to the doc config docs19:56
clarkband then incorportate that into more narrative style docs later19:56
tobiashk, will write something tomorrow20:06
pabelangerShrews: clarkb: tobiash: could somebody reply with expected changes on 584488, I'll look and re-read backscoll again shortly20:14
clarkbpabelanger: I think for me its go back to original patchset20:15
clarkbpabelanger: but I think tobiash wants to figure it out a bit more20:15
pabelangerkk20:15
Shrewspabelanger: left a comment with the most important take away for me20:23
pabelangerShrews: ++20:24
mhutobiash, can the dequeue in CLI patch get a +3? https://review.openstack.org/#/c/95035/20:26
openstackgerritMerged openstack-infra/nodepool master: Add logging when pausing due to node abort  https://review.openstack.org/58500720:51
*** samccann has quit IRC21:05
*** hashar has quit IRC21:05
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Add upgrade release note about aborted node status  https://review.openstack.org/58503521:07
Shrewsmordred: corvus: we need ^ merged before releasing21:09
corvus++21:10
mordredShrews: should we do another restart to also pick up the logging change just to be doubly-sure?21:10
Shrewsmordred: i think that would be nice for debugging21:10
Shrewsbut not technically necessary21:10
mordredyah - I'm fairly confident that we can emit log lines21:10
Shrewsbut being double-sure is bueno21:10
mordredcorvus: I think Shrews and I have the same "yes but also meh" thought - do you have any divergent opinions?21:11
Shrewsi have to EOD and go pick up my vehicle, so if someone else wants to do that, please go for it (launchers only)21:11
mordredkk21:12
clarkbmaybe restart with 585035 merged then we will run exactly what is tagged?21:12
mordredwfm21:12
Shrewswfm x 221:12
clarkbthough I wonder how long check will take this time of day21:13
mordredclarkb: that is also a good point - and it's going to be that way all week21:13
corvusi think we can wait for 035 and test again.  we probably won't release zuul until tues or wed at the earliest anyway.21:14
clarkbya thats why I've been spending time improving reliability of clouds and trying to increase capacit.y We are right in hte heart of the merge all things21:14
clarkbcorvus: ok21:14
corvus(on the plus side, this should give us plenty of traffic to see if the zuul memory problem is fixed)21:15
clarkbindeed21:16
pabelangerhttp://paste.openstack.org/show/726483/22:14
pabelangerwe are getting that from latest zuul-executor for some reason22:14
pabelangerguessing something broke in cherrypy22:14
pabelangerCherryPy==17.0.0 seems to be installed22:16
mordredpabelanger: sounds like a problem with six version22:17
pabelangersix==1.10.022:17
pabelangerchecking docs22:19
pabelangerheh22:20
pabelangerPull request #203: Add parse_http_list and parse_keqv_list to moved22:20
pabelanger  urllib.request.22:20
pabelangerthat is 1.10.022:20
pabelangerhttps://github.com/benjaminp/six/blob/master/CHANGES22:20
pabelangerchecking ze01 to see what is there22:21
openstackgerritMerged openstack-infra/nodepool master: Add upgrade release note about aborted node status  https://review.openstack.org/58503522:25
*** rlandy has quit IRC22:28
pabelangermordred: for some reason, we didn't install six=1.11.0 on ze11.o.o, but seems to be in install_requires: https://github.com/cherrypy/cherrypy/blob/master/setup.py22:31
pabelangermanually doing pip install -U six fixed it22:31
mordredyah- pip won't upgrade a package or do an internal dag22:32
mordredas part of a single instlal22:33
pabelangeryah, xenial is 1.10.0, how can we fix that with puppet-zuul?22:33
mordredsaid - is six getting installed by apt?22:33
pabelangeryah22:33
mordredugh22:33
mordredwe REALLY need to stop installing python things via apt22:33
mordredit breaks literally everything every time22:33
pabelangerit looks like a dependency of ubuntu-server22:34
mordredsigh22:34
mordredof course it is22:34
mordredoh - I think because of cloud-init?22:34
mordredpabelanger: did we not uninstall cloud-init?22:34
pabelangeryes, no longer on server22:34
mordredhrm22:35
mordredwell - containers solve this ... but lemme stare at puppet for a sec22:35
pabelangermordred: would virtualenv help here?22:35
mordredpabelanger: I think virtualenv would be too much engineering that we'd be throwing away anyway22:36
pabelangermordred: agree, I'd rather push on containers too22:36
mordredpabelanger: we've got the containers building now - so it's about getting them published then getting ansible written22:36
mordredpabelanger: but - in case we need to spin up another ze before that's done ...22:36
pabelangeryah, fixing in puppet would be great.22:37
*** dkranz has quit IRC22:49
*** myoung|bbl is now known as myoung22:59
*** threestrands has joined #zuul23:30

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!