pabelanger | easy +3 for somebody https://review.openstack.org/509833/ | 00:04 |
---|---|---|
pabelanger | job layout changes | 00:04 |
pabelanger | jeblair: ^maybe when you have a moment, switches to use build-openstack-sphinx-docs | 00:05 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove references to pipelines, queues, and layouts on dequeue https://review.openstack.org/509903 | 00:37 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix early processing of merge-pending items on reconfig https://review.openstack.org/509912 | 00:37 |
pabelanger | something to look into in the morning | 00:47 |
pabelanger | I see the following in nodepool-launcher debug.log | 00:48 |
pabelanger | 2017-10-05 17:37:38,277 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl01.openstack.org-6338-PoolWorker.inap-mtl01-main]: Unlocked node 0000140400 for request 200-0000188337 | 00:48 |
pabelanger | however | 00:49 |
pabelanger | | 0000140400 | inap-mtl01 | nova | ubuntu-trusty | 309ec46a-1fee-4314-91fe-c3be122e891a | ready | 00:07:11:10 | locked | | 00:49 |
fungi | pabelanger: 509833 hit another post_failure (i rechecked just now) | 00:59 |
fungi | i guess those are still going on | 00:59 |
pabelanger | looking | 01:02 |
fungi | i wasn't having much luck identifying it from the console log, and ara is rather tough to navigate in lynx | 01:03 |
pabelanger | ya, according to ze01.o.o, we got exit code: 2 | 01:04 |
pabelanger | 2017-10-06 00:24:09,149 DEBUG zuul.AnsibleJob: [build: 2f4383a190674034a8d6e71e0d7d0aff] Ansible output: b'Using /var/lib/zuul/builds/2f4383a190674034a8d6e71e0d7d0aff/ansible/post_playbook_3/ansible.cfg as config file' | 01:05 |
pabelanger | was last post playbook | 01:05 |
*** jkilpatr has quit IRC | 01:06 | |
pabelanger | fungi: it is possible something in https://review.openstack.org/505451/ is causing the issue | 01:10 |
pabelanger | we'd never know since it happens after logs are uploaded | 01:10 |
pabelanger | we should confirm with jeblair about maybe moving that before upload-logs role, to get info | 01:10 |
pabelanger | now I EOD | 01:13 |
fungi | a plausible theory | 01:21 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Use normal docs build jobs https://review.openstack.org/509833 | 01:25 |
*** zigo has quit IRC | 01:27 | |
fungi | though i think they rely on the logs already being uploaded right? so couldn't go before upload-logs | 01:28 |
fungi | we might need some alternative means of identifying if/how they're breaking | 01:29 |
*** zigo has joined #zuul | 01:31 | |
*** logan- has quit IRC | 02:12 | |
*** eventingmonkey has quit IRC | 02:12 | |
*** fbouliane has quit IRC | 02:15 | |
*** logan- has joined #zuul | 02:15 | |
*** fbouliane has joined #zuul | 02:17 | |
*** eventingmonkey has joined #zuul | 02:17 | |
*** mgagne has quit IRC | 02:23 | |
*** mgagne has joined #zuul | 02:24 | |
*** mgagne is now known as Guest66098 | 02:24 | |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Fix path exclusions https://review.openstack.org/509901 | 04:35 |
AJaeger | team, I proposed a change to shift some more nodes to v3, see https://review.openstack.org/509965 - we have changes since 22 hours in the queue | 05:25 |
*** tobiash has quit IRC | 05:27 | |
*** tobiash has joined #zuul | 05:27 | |
*** Diabelko has quit IRC | 05:40 | |
*** isaacb has joined #zuul | 06:46 | |
*** isaacb has quit IRC | 07:30 | |
*** hashar has joined #zuul | 07:49 | |
*** electrofelix has joined #zuul | 08:38 | |
*** jkilpatr has joined #zuul | 11:03 | |
Shrews | pabelanger: jeblair: it would seem that zuul has chosen NOT to reuse any READY nodes for some reason: http://paste.openstack.org/show/622822/ | 11:03 |
Shrews | note the "reuse": false | 11:03 |
Shrews | this has the vexxhost pool thread stuck because it has 10 ready nodes (its max) | 11:04 |
Shrews | maybe some other pool threads too | 11:04 |
Shrews | oh! nm, this is a request from nodepool itself (to satisfy min-ready). | 11:10 |
Shrews | i wonder how it got into this state.... | 11:10 |
Shrews | this is going to require some deeper digging. but i need to breakfast first | 11:14 |
Shrews | i deleted a few nodes from a few regions to unwedge things | 11:15 |
*** jkilpatr has quit IRC | 11:16 | |
*** jkilpatr has joined #zuul | 11:18 | |
*** dkranz has joined #zuul | 12:09 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Do not satisfy min-ready requests if at capacity https://review.openstack.org/510085 | 12:11 |
Shrews | jeblair: pabelanger: i think that ^^^ will prevent such wedging again | 12:11 |
*** hashar has quit IRC | 12:42 | |
*** hashar has joined #zuul | 12:53 | |
pabelanger | Shrews: great! +2 | 13:24 |
*** hashar has quit IRC | 14:09 | |
*** hashar has joined #zuul | 14:14 | |
pabelanger | 2017-10-06 14:06:56.793664 | {phase} {step} {result}: [{trusted} : {playbook}@{branch}]2017-10-06 14:06:56.793882 | {phase} {step}: [{trusted} : {playbook}@{branch}]2017-10-06 14:06:58.268508 | | 14:16 |
pabelanger | dmsimard: ^ I think you added that recently? | 14:17 |
dmsimard | errrrrrrrrrrrrrrrrr | 14:17 |
dmsimard | pabelanger: have a link ? | 14:17 |
pabelanger | http://logs.openstack.org/91/509491/3/check/ansible-role-nodepool-ubuntu-xenial/84cba16/job-output.txt.gz | 14:17 |
dmsimard | damn it :( | 14:18 |
dmsimard | pabelanger: thanks, I'll figure out a fix, it's indeed coming from https://review.openstack.org/#/c/509254/4/zuul/executor/server.py | 14:19 |
jeblair | msg = msg.format() :) | 14:19 |
dmsimard | doh | 14:20 |
dmsimard | indeed | 14:20 |
openstackgerrit | David Moreau Simard proposed openstack-infra/zuul feature/zuulv3: Properly format messages coming out of emitPlaybookBanner https://review.openstack.org/510135 | 14:23 |
dmsimard | well, that was embarassing | 14:23 |
dmsimard | we also need a line break in there, let me add that | 14:24 |
jeblair | Shrews: comment on the min-ready change | 14:25 |
openstackgerrit | David Moreau Simard proposed openstack-infra/zuul feature/zuulv3: Properly format messages coming out of emitPlaybookBanner https://review.openstack.org/510135 | 14:26 |
dmsimard | pabelanger, jeblair ^ | 14:26 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Do not satisfy min-ready requests if at capacity https://review.openstack.org/510085 | 14:36 |
Shrews | jeblair: responded. as an example of my comment, i've been watching our ready node situation and we always have a LOT of ready/unlocked nodes, which i speculate happens when we get super busy responding to requests | 14:43 |
Shrews | (a LOT when idle, that is) | 14:44 |
jeblair | Shrews: i'm confused | 14:44 |
jeblair | Shrews: are you saying when we are very busy, we create a lot of min-ready requests but have few ready nodes, but then when we are idle, we have no requests but many ready nodes? | 14:45 |
Shrews | oh actually, nevermind. i already have checks in place to not submit more min-ready requests when there are unfulfilled requests already. | 14:47 |
Shrews | i'll revert that priority part | 14:47 |
jeblair | ok | 14:47 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Do not satisfy min-ready requests if at capacity https://review.openstack.org/510085 | 14:48 |
Shrews | we can bikeshed the priority thing later if we need to | 14:48 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Properly format messages coming out of emitPlaybookBanner https://review.openstack.org/510135 | 14:53 |
mordred | jeblair: I'm looking at a fix for https://review.openstack.org/#/c/508822/2 which is on our list of job related errors to fix ... tl;dr is that in the project config a job entry was given a list intead of a dict | 14:54 |
mordred | jeblair: I'm poking through configloader - but any hints as to where to look? | 14:54 |
dmsimard | infra-root: need to reload executors ^ landed that fixes this abomination http://logs.openstack.org/91/509491/3/check/ansible-role-nodepool-ubuntu-xenial/84cba16/job-output.txt.gz#_2017-10-06_14_03_41_849986 | 14:55 |
mordred | infra-root: do we havea "restart executors" playbook yet? | 14:55 |
mordred | hrm. lemme take that to infra ... | 14:56 |
dmsimard | mordred: that patchset rightfully returned a syntax error, I guess you want to make it more friendly ? | 14:56 |
mordred | dmsimard: yah - the patchset just returned "unknown configuration error" - but in this case we should be able to know that the issue was that required-projects was given as a list and not a dict | 14:57 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Changes for Ansible 2.4 https://review.openstack.org/505354 | 14:57 |
dmsimard | mordred: we probably need something broader that knows what data types we are expecting and validate that we're receiving what we're expecting | 14:58 |
dmsimard | (for every config parameter) | 14:59 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: Changes for Ansible 2.4 https://review.openstack.org/505354 | 14:59 |
jeblair | dmsimard: something *other* than voluptuous? | 15:00 |
jeblair | mordred: i start by writing a test for the syntax error, then temporarily adding a raise into the exception handler that's masking it to see the full traceback, then go from there. | 15:02 |
dmsimard | jeblair: I'm not super familiar with how the configuration is loaded in the first place, would need to take a look. | 15:03 |
dmsimard | I guess what I was saying is that we should try to avoid handling things on a case by case basis (i.e, only fix "required-projects") when this kind of issue can be expected elsewhere | 15:04 |
jeblair | dmsimard: ah. well, we have what you describe. there is a bug. mordred is investigating the bug. | 15:04 |
jeblair | dmsimard: we absolutely do not handle this on a case-by-case basis. | 15:05 |
dmsimard | ok, whew :) | 15:05 |
*** hashar is now known as hasharAway | 15:05 | |
mordred | jeblair: gotcha! cool | 15:07 |
*** hasharAway has quit IRC | 15:10 | |
jeblair | mordred: see test_untrusted_shadow_error for an example test | 15:11 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Changes for Ansible 2.4 https://review.openstack.org/505354 | 15:15 |
Shrews | that ^^^ doesn't actually change the ansible requirement yet, but gets us ready for it when 2.4.1 is released | 15:16 |
Shrews | mordred: i had to rebase part of your changes in that. i think it's all still valid, but you might want to verify when you have idle time | 15:18 |
mordred | awesome | 15:18 |
dmsimard | Shrews: you're also waiting for 2.4.1 ? | 15:21 |
dmsimard | as in, ara is in stalemate right now and we also need to wait for 2.4.1 | 15:22 |
dmsimard | I had to send a bugfix upstream that will only land in 2.4.1 https://github.com/ansible/ansible/pull/31200 | 15:22 |
Shrews | dmsimard: 2.4 breaks YAML inventory parsing | 15:23 |
dmsimard | doh | 15:24 |
dmsimard | Shrews: we should probably add a job on zuul which tests against devel, make it non-voting | 15:25 |
dmsimard | that's what I do for ara | 15:25 |
mordred | jeblair: TIL you can set a merge-mode in a project-template | 15:26 |
jeblair | mordred: ya; first one wins though | 15:27 |
dmsimard | Shrews: lgtm, just added a comment | 15:30 |
Shrews | dmsimard: that's one of the changes mordred made. i'll let him comment | 15:30 |
mordred | jeblair: I've got the config syntax issue - made one fix, then realized that was slightly inaccurate, working on second fix | 15:49 |
mordred | jeblair: (tl;dr - we do not actually have a voluptous schema here other than "dict") | 15:50 |
mordred | jeblair: I've now got it emitting this: http://paste.openstack.org/show/622850/ | 15:54 |
mordred | jeblair: which is better, but I think still maybe not as clear to the unwashed | 15:54 |
mordred | jeblair: but I think for now that's better than what it was | 15:55 |
fungi | looks like zuulv3 stabilized to around 28GiB virtual memory in use in the 10:00-13:00z span, but looking at the timeline that's basically the period where it was unresponsive due to full rootfs up to the point where i nova rebooted it | 15:56 |
fungi | yikes, load average spiked up into the 360s around 06:30z. periodic jobs kicking off, i guess | 15:59 |
fungi | looks like maybe the periodic jobs starting at 06:00z may have just pushed memory utilization into swap, so then it was trying to deal with them while thrashing swap | 16:01 |
jeblair | mordred: yeah, all those errors are like that. it's not a great message, but it does at least say what's wrong. | 16:02 |
mordred | ++ | 16:02 |
jeblair | mordred: sinc this is just turning out to be a schema bug, i don't think we need a specific test for it (ie, it's not a new class of unhandled error). up to you whether you want to include that in the fix, i'd say. | 16:02 |
mordred | jeblair: well - I made two tests for the case, might as well keep them :) | 16:03 |
jeblair | ok | 16:03 |
fungi | also, can see the dip on the disk usage graph when logrotate kicked in at 06:00z, but used space in / started curving upward sharply at that point (perhaps filling up with the kazoo thread exceptions pabelanger noted) | 16:04 |
fungi | so this was probably a cascade effect | 16:04 |
mordred | jeblair: do we have a list anywhere of which Job attributes are invalid in project and project-template job lists? I know name and parent are | 16:04 |
jeblair | mordred: i think those are the only ones... | 16:06 |
jeblair | mordred: though, fun fact, because of the bug you're fixing, we totally would have passed those through for amusing results | 16:07 |
mordred | yah | 16:07 |
jeblair | mordred: actually, maybe just omit name | 16:07 |
jeblair | mordred: if we reject parent for job variants, we should do that both here and at the top level | 16:08 |
mordred | jeblair: ooh - I have just learned a new python 3.5 syntax | 16:16 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix early processing of merge-pending items on reconfig https://review.openstack.org/509912 | 16:17 |
jeblair | uhoh :) | 16:18 |
mordred | jeblair: well, it doesn't work here - but in 3.5 you can pass more than one **kwargs argument to an invocation | 16:20 |
mordred | jeblair: so you can do dict(**dict1, **dict2) | 16:21 |
fungi | mordred: does it merge them? | 16:21 |
mordred | fungi: yes | 16:21 |
fungi | i guess it instantiates an empty dict and then does an .update() with each of them in sequence or something like that | 16:21 |
mordred | foo = dict(**dict1, **dict2) is like doing foo = dict1.copy() ; foo.update(dict2) | 16:21 |
fungi | yeah, that's more or less what i was thinking | 16:21 |
fungi | neat! | 16:22 |
mordred | but potentially useful in places where the update step isn't possible (like in a list of class attributes) | 16:22 |
fungi | yup | 16:28 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't load dynamic layout twice unless needed https://review.openstack.org/510180 | 16:36 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't load dynamic layout twice unless needed https://review.openstack.org/510180 | 16:44 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Provide error message on malformed job list https://review.openstack.org/510185 | 16:48 |
mordred | jeblair, dmsimard, fungi: ^^ that should fix the syntax error reporting for list instead of dict from https://review.openstack.org/#/c/508822/2 | 16:49 |
jeblair | mordred: the chainmap thing works with voluptuous? neat | 16:54 |
jeblair | mordred: oh i see, you dict() the result | 16:54 |
jeblair | that makes sense | 16:54 |
jeblair | we can probably use that later to reduce the duplication in project-template and project schemas | 16:55 |
mordred | jeblair: ++ | 16:57 |
*** openstack has joined #zuul | 17:09 | |
*** ChanServ sets mode: +o openstack | 17:09 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Provide error message on malformed job list https://review.openstack.org/510185 | 17:19 |
*** electrofelix has quit IRC | 17:50 | |
dmsimard | pabelanger, jeblair, mordred: I just realized that we're likely not setting up unbound in v3 as it was configured in the v2 ready-script | 17:56 |
dmsimard | I'll put it on my todo, I found it while troubleshooting a centos unbound issue | 17:56 |
jeblair | dmsimard: add to jobs section of etherpad? | 17:56 |
dmsimard | yup | 17:56 |
mordred | dmsimard: great catch - thanks! | 18:07 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Do not satisfy min-ready requests if at capacity https://review.openstack.org/510085 | 18:14 |
pabelanger | Yay | 18:15 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Create git_http_low_speed_limit / git_http_low_speed_time https://review.openstack.org/509893 | 18:31 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Have zuul re-run ABORTED jobs https://review.openstack.org/510211 | 18:50 |
pabelanger | jeblair: mordred: I think that fixes our aborted jobs issue^. If so, I can see about writing a test too | 18:51 |
pabelanger | and fixing tox issues now | 19:14 |
jeblair | pabelanger: see my comment on change | 19:15 |
jeblair | i need to grab lunch now | 19:15 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Handle non-syntax errors from Ansible https://review.openstack.org/510219 | 19:15 |
mordred | jeblair, pabelanger: as a follow up to that - earlier when looking at restarting executors there was a comment about jobs not getting re-run if the executor was hard-killed rather than gracefulled - can we also detect that in the client and retry? It seems like the executor going away unexpectedly should *always* result in a retry, no? | 19:34 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Retry jobs on executor disconnect https://review.openstack.org/510223 | 19:38 |
mordred | jeblair, pabelanger: something like that ^^ although I'm not super sure how to test it ATM | 19:39 |
fungi | mordred: that's separate from what 510211 is attempting to address, i guess? | 19:44 |
fungi | mordred: by hard-killed you mean sigkill or sigsegv instead of sigterm? | 19:45 |
fungi | so the executor doesn't get sufficient time to communicate the abort? | 19:45 |
*** hashar has joined #zuul | 19:46 | |
mordred | yah | 19:48 |
mordred | or, heck, cloud decides to kill the VM on which the executor is running | 19:48 |
pabelanger | will look in a minute, about to push an update for 510211 | 19:48 |
mordred | the 'executor went away and we don't know why' case | 19:48 |
fungi | makes sense | 19:53 |
fungi | thanks | 19:53 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Handle double node locking snafu https://review.openstack.org/509603 | 19:55 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Have zuul re-run ABORTED jobs https://review.openstack.org/510211 | 19:56 |
pabelanger | okay, v2 of abort however, I think one disk monitor job is failing | 19:57 |
pabelanger | might need help on how to properly handle that | 19:57 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Set zuul_log_path for a periodic job https://review.openstack.org/509384 | 19:59 |
jeblair | mordred: the discussion earlier about re-running jobs when restarting executors is exactly what pabelanger is working on | 20:11 |
mordred | jeblair: what's the difference between "ABORTED" and "DISCONNECTED" ? | 20:11 |
jeblair | mordred: iow, when an executor is stopped (via some method other than killall -9, it goes through the code path pabelanger is touching | 20:12 |
jeblair | mordred: where does DISCONNECT show up? afaict, you're adding it in your change | 20:12 |
mordred | jeblair: sorry - I meant what's the difference between result=='ABORTED' and onDisconnect - but I believe I understand now | 20:13 |
jeblair | mordred: ya, so the case you're looking at is important to consider too -- it's the unclean shutdown case | 20:13 |
mordred | jeblair: pabelanger's change is to handle the clean shutdown case, mine is the unclean | 20:13 |
mordred | yah | 20:13 |
jeblair | mordred: however, that should be handled by the current "result is None" case | 20:13 |
mordred | cool - so the only difference would be the suggestion that we don't count executor restarts (clean or unclean) against a job's retry count | 20:15 |
jeblair | mordred: ya; considering how things have changed in v3, if you want to push for it on that basis, i could see that | 20:15 |
* mordred can abandon his change, didn't quite track in the brain that None case would handle it - but now that you say it it totally makes sense and should have been obvious :) | 20:15 | |
pabelanger | jeblair: so, should I clean up the test_disk_accountant_kills_job to handle the proper abort retry logic now? http://logs.openstack.org/11/510211/2/infra-check/tox-py35/d6bc444/testr_results.html.gz Or do you have any other suggestion for that use case? | 20:16 |
pabelanger | we'd abort 3 times on that now | 20:17 |
pabelanger | eventually hitting retry_limit | 20:17 |
jeblair | hrm, i think if we hit the disk limit, we should actually return an error other than ABORTED. it's not a retryable error really | 20:18 |
jeblair | lemme look | 20:18 |
pabelanger | kk | 20:18 |
mordred | jeblair: agree | 20:19 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Always retry jobs on executor disconnect https://review.openstack.org/510223 | 20:20 |
mordred | jeblair, pabelanger: ^^ rebased that on top of pabelanger's change and included ABORTED in the always-retry logic | 20:20 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Always retry jobs on executor disconnect https://review.openstack.org/510223 | 20:22 |
mordred | except this time without sucking | 20:22 |
pabelanger | mordred: wouldn't that mean is a job keept aborting, it would run forever? | 20:24 |
jlk | hey all, I'm out at SeaGL today, sorry for not responding to anything. But I had lunch with Clark :) | 20:24 |
pabelanger | say hi! | 20:24 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Return a distinct result on executor disk full https://review.openstack.org/510227 | 20:29 |
jeblair | mordred, pabelanger: maybe base your 2 changes on that? | 20:29 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't store pipeline references on builds https://review.openstack.org/509653 | 20:33 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't load dynamic layout twice unless needed https://review.openstack.org/510180 | 20:33 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Have zuul re-run ABORTED jobs https://review.openstack.org/510211 | 20:35 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Always retry jobs on executor disconnect https://review.openstack.org/510223 | 20:35 |
pabelanger | done | 20:35 |
jeblair | that's some teamwork | 20:35 |
pabelanger | jeblair: mordred: quick question on 510223 | 20:37 |
jeblair | mordred: on 510185 what was "None" needed for? | 20:38 |
jeblair | mordred: oh, i see it broke that test. i'm kind of inclined to go with PS1 and fix the test. i think that was an accident, and i think it's better not to have a ':' unless needed (that's the behavior elsewhere at least) | 20:40 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't store pipeline references on builds https://review.openstack.org/509653 | 20:52 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't load dynamic layout twice unless needed https://review.openstack.org/510180 | 20:52 |
pabelanger | so, if I see the follow request from nodepool-launcher | 20:55 |
pabelanger | | 200-0000309834 | requested | zuulv3 | centos-7 | | nl02.openstack.org-10797-PoolWorker.infracloud-vanilla-main,nl02.openstack.org-10797-PoolWorker.citycloud-lon1-main,nl02.openstack.org-10797-PoolWorker.citycloud-sto2-main,nl02.openstack.org-10797-PoolWorker.infracloud-chocolate-main,nl02.openstack.org-10797-PoolWorker.citycloud-la1-main | 20:55 |
pabelanger | does that mean, it is waiting to launch 1 centos-7 on the clouds listed there? | 20:55 |
pabelanger | the other question, it seems nl02 is many more requests assigned to it then nl01 (2562 vs 60) | 20:58 |
pabelanger | trying to see why that is | 20:58 |
jeblair | pabelanger: what's the heading for that column? | 21:02 |
pabelanger | jeblair: I think declined by? | 21:03 |
pabelanger | okay, that helps | 21:03 |
jeblair | pabelanger: so that's a negative. those are launchers which have decided not to handle the request | 21:03 |
jeblair | pabelanger: where are you seeing the 2562 vs 60 numbers? | 21:05 |
jeblair | launchers shouldn't accept more requests than they can handle, and 2562 is definitely more than nl01 can handle | 21:05 |
pabelanger | jeblair: I might be doing it incorrectly, but I ran: sudo -H -u nodepool nodepool request-list | grep nl01 | wc -l vs sudo -H -u nodepool nodepool request-list | grep nl02 | wc -l | 21:07 |
pabelanger | and was trying to see if the requests were some how split across launchers | 21:07 |
pabelanger | however, I admit, request-list is still new for me | 21:07 |
jeblair | pabelanger: does it include declined-by? | 21:08 |
pabelanger | yah | 21:08 |
pabelanger | I started looking at it, because we have 1 centos-7 ready node | 21:08 |
pabelanger | and trying to see why it wasn't getting used | 21:08 |
pabelanger | Oh, something has grabbed it now | 21:09 |
jeblair | pabelanger: okay, so nl01 is *declining* more requests than nl02. which may still be undesirable, but not what i was expecting when you first brought it up. | 21:09 |
pabelanger | ok | 21:10 |
pabelanger | I just seen this in the debug log | 21:13 |
pabelanger | 2017-10-06 21:12:01,739 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl02.openstack.org-10797-PoolWorker.citycloud-lon1-main]: Declining node request 200-0000310062 because it would exceed quota | 21:13 |
pabelanger | so, maybe a side affect of split quota over 2 nodepools? | 21:13 |
pabelanger | Oh, I know | 21:20 |
pabelanger | we have infracloud disabled in nl02 | 21:21 |
pabelanger | so, it will always decline | 21:21 |
pabelanger | same goes for citycloud | 21:21 |
pabelanger | jeblair: ^ | 21:21 |
clarkb | I think ctiycloud we might be able to turn back on again if the az thing is fixed? | 21:21 |
clarkb | fungi: ^ | 21:21 |
pabelanger | ya, infracloud can also be enabled agin | 21:22 |
clarkb | and possibly we can turn on infracloud now that zuul cpu use is saner | 21:22 |
pabelanger | we did on nodepool.o.o, but not nl02 | 21:22 |
pabelanger | ya | 21:22 |
pabelanger | okay, EOD for me now | 21:23 |
fungi | clarkb: the assumption being that the citycloud failures were related to us booting in their starved "nova-local" ssd az? | 21:23 |
clarkb | fungi: ya, or at least we knew we had a problem there that we should've fixed | 21:23 |
fungi | worth a retry i s'pose | 21:24 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix early processing of merge-pending items on reconfig https://review.openstack.org/509912 | 21:28 |
jeblair | i'm going to self-approve some changes after rebase | 21:30 |
jeblair | i'm self-approving 509653 which skips the test as we discussed earlier | 21:30 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Update node requests after nodes https://review.openstack.org/509571 | 21:31 |
jeblair | fungi, mordred, clarkb: can you +2/+3 508698 | 21:32 |
fungi | lgtm | 21:35 |
fungi | also, i have a fair amount of confidence in SpamapS's review on it at well | 21:36 |
jeblair | mordred: i'm pretty sure i wrote a filename other than "sub_nodes" in the etherpad but it looks like you deleted it and replaced it with that | 21:39 |
jeblair | i agree that sub_nodes is used and we should add it, but i'm also pretty sure i was deliberate and correct about the thing i put in there | 21:39 |
jeblair | i will look through the history and try to find out what it was | 21:40 |
jeblair | okay, it was node_private | 21:41 |
jeblair | i will put that back | 21:41 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Don't store pipeline references on builds https://review.openstack.org/509653 | 21:42 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Don't load dynamic layout twice unless needed https://review.openstack.org/510180 | 21:42 |
jeblair | i'm going to self-approve 509571 which was previously approved before a rebase | 21:45 |
jeblair | also has 3 other +2s | 21:45 |
jeblair | 509912 is ready for review now | 21:46 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Add base job and roles for javascript https://review.openstack.org/510236 | 21:46 |
jeblair | i'm going to pick up the stable branch depends on bug now | 21:47 |
mordred | jeblair: ok. so - the tripleo patch that was linked to didn't use node_private anywhere but instead used sub_nodes all throughout it | 21:48 |
mordred | jeblair: so I think we were looking at different things perhaps? | 21:48 |
jeblair | mordred: i saw node_private in the tht patches | 21:48 |
mordred | jeblair: k. https://review.openstack.org/#/c/508660/ is what I was looking at ... could you point me to what you're looking at? (I don't doubt you - but if we're seeing different things then I worry that there may be other things missed too) | 21:50 |
jeblair | mordred: https://review.openstack.org/509704 | 21:51 |
mordred | jeblair: thanks! | 21:53 |
jeblair | np, sorry i didn't leave enough breadcrumbs | 21:54 |
mordred | jeblair: re: None in 510185 - I can go either way - whichever you feel is better | 21:55 |
jeblair | mordred: okay. i'm feeling fond of PS1 and no support for none. i have a moderate (but not strong) feeling that will encourage tidy layouts and reduce typo errors. | 21:56 |
mordred | jeblair: okie - I'll update! | 21:57 |
jeblair | (i'm also certain it will result in more errors to users, but think it's worth it since it points out a potential typo) | 21:58 |
jeblair | it's one of those "maybe you just forgot to remove a colon, or maybe you forgot to add something important" things | 21:59 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Provide error message on malformed job list https://review.openstack.org/510185 | 22:01 |
mordred | jeblair: there ya go ^^ | 22:01 |
jeblair | w00t | 22:03 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Update node requests after nodes https://review.openstack.org/509571 | 22:04 |
jeblair | mordred: good news and bad news. | 22:07 |
jeblair | mordred: good news: we don't have a test for depends-on same id on multiple branches. | 22:07 |
jeblair | mordred: bad news: i added one and it passed. | 22:07 |
jeblair | even inspected the inventory file, and the stable branch change shows up in the items list | 22:07 |
jeblair | so i'm going to have to dig deeper on that. | 22:08 |
mordred | jeblair: I agree - that is both good news and bad news | 22:09 |
jeblair | i approved mordred's 2 outstanding zuul changes; i think when they land we should restart all components of zuul and zookeeper and nodepool | 22:11 |
mordred | ++ | 22:12 |
jeblair | i have to run now; anyone who feels up to it, feel free to do that restart when 510185 and 510219 land | 22:12 |
mordred | dmsimard: heya - you aroud? | 22:15 |
dmsimard | mordred: not for long | 22:15 |
mordred | dmsimard: well ... maybe you can answer super-quick | 22:15 |
mordred | dmsimard: I'm looking at the tox-linters job for openstack-zuul-jobs and the linters env in tox.ini | 22:15 |
mordred | it has this: | 22:16 |
mordred | ANSIBLE_ROLES_PATH = {toxinidir}/roles:{envdir}/src/zuul-jobs/roles | 22:16 |
mordred | but I don't know where it's getting that zuul-jobs from | 22:16 |
mordred | do you? | 22:16 |
dmsimard | iirc required-projects gets automatically added to role path or something to that effect | 22:17 |
dmsimard | or perhaps roles: | 22:17 |
mordred | AH - I see it ... | 22:17 |
mordred | -e git://git.openstack.org/openstack-infra/zuul-jobs#egg=zuul-jobs | 22:17 |
dmsimard | https://docs.openstack.org/infra/zuul/feature/zuulv3/user/config.html#attr-job.roles | 22:17 |
mordred | in test-requirements.txt | 22:17 |
dmsimard | Roles are added to the Ansible role path in the order they appear on the job – roles earlier in the list will take precedence over those which follow.4 | 22:17 |
mordred | dmsimard: yah - this is on the test node itself running linters | 22:18 |
mordred | dmsimard: I've got an ozj patch with a depends-on on a zj patch but it's not getting set up properly ... | 22:18 |
dmsimard | mordred: link ? | 22:18 |
mordred | although you know what - tox_install_siblings should be able to totally fix this for me :) | 22:19 |
mordred | http://logs.openstack.org/37/510237/1/infra-check/tox-linters/382fe42/job-output.txt.gz | 22:19 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Handle non-syntax errors from Ansible https://review.openstack.org/510219 | 22:19 |
mordred | dmsimard: gonna see if the magic in the tox job will take care of it ^^ | 22:19 |
mordred | wait ... I mean: remote: https://review.openstack.org/510237 Add javascript tarball publication job | 22:19 |
mordred | :) | 22:19 |
dmsimard | fwiw that patch makes sense :p | 22:20 |
mordred | heh | 22:21 |
mordred | dmsimard: btw - like the POST-RUN END RESULT_NORMAL: [untrusted : git.openstack.org/openstack-infra/zuul-jobs/playbooks/tox/post@master] lines | 22:26 |
dmsimard | Aw yeah | 22:27 |
dmsimard | Much better than the non-formatted strings we had this morning | 22:27 |
* dmsimard coughs | 22:27 | |
dmsimard | afk food | 22:27 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Provide error message on malformed job list https://review.openstack.org/510185 | 22:27 |
*** hashar has quit IRC | 22:50 | |
*** harlowja has quit IRC | 23:45 | |
*** docaedo has joined #zuul | 23:47 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!