-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 853913: zuul-jobs-test-ensure-python-pyenv: update matchers https://review.opendev.org/c/zuul/zuul-jobs/+/853913 | 00:52 | |
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 853909: Fedora : update to Fedora 36 release nodes https://review.opendev.org/c/zuul/zuul-jobs/+/853909 | 01:03 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 04:50 | |
@westphahl:matrix.org | corvus: Clark : how about "fallback" or "fallback-labels" instead of "alternatives"? | 05:41 |
---|---|---|
@westphahl:matrix.org | corvus: but +1 for making that more explicit in the Zuul config | 05:42 |
@westphahl:matrix.org | looking at the proposed syntax it should probably be "fallback-nodesets" | 05:43 |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 06:56 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 07:04 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 07:10 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 07:25 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 08:00 | |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 08:25 | |
@mbecker12:matrix.org | corvus: Are there any more thoughts on this patch? :) | 08:42 |
853993: Add hold command to disable nodes | https://review.opendev.org/c/zuul/nodepool/+/853993 | ||
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/855309 | 08:56 | |
@jim:acmegating.com | mbecker12: I left some thoughts on the tests. (cc swest) | 13:32 |
@jim:acmegating.com | swest: i considered that, but to me, the first one isn't a fallback. but if that's close enough and we want to use that terminology, we could just clarify it in the docs. | 13:33 |
@westphahl:matrix.org | corvus: yeah, then fallback might be more misleading than "alternatives" | 13:35 |
@jim:acmegating.com | > <@iwienand:matrix.org> in all the other tests i never saw that one time out. unfortunately i guess due to a combination of bwrap, namespaces, etc, etc. the trace is basically useless | 13:55 |
i'm confused -- i looked at that log and all i see is the command lines -- the exception handler you added for the gdb ran on every single attempt -- i don't see any tracebacks? | ||
@jim:acmegating.com | ianw: could the new console handling be causing ansible not to exit in some cases? | 13:58 |
@clarkb:matrix.org | Does anyone know if ansible really needs about 3 seconds per task to run https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L103 as shown by the log at https://zuul.opendev.org/t/openstack/build/363873433e4d4f1ab66f6b2e97fb9429/log/job-output.txt#15596? The disk io there should be minimal and quick as it is doing a file mv on the same filesystem (no copies across filesystems). This makes me question ansible's per task cost, but it could still be an IO issue | 15:18 |
@clarkb:matrix.org | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L13 that task also appears to have needed about 3 seconds | 15:19 |
@clarkb:matrix.org | I guess it could be the remote host too (swapping, or otherwise slow) | 15:19 |
@clarkb:matrix.org | Looking at other jobs It seems like 1.5 ish seconds for check sudo may be expected | 15:27 |
@clarkb:matrix.org | Definitely not quick, but much quicker than this example | 15:27 |
@clarkb:matrix.org | When ansible runs remote tasks it copies the python code for each task separately or does the whole playbook go over at once? I wonder if we're seeing write costs for the tasks over the wire | 15:36 |
@clarkb:matrix.org | This particular example happened halfway around the world from the zuul executor. Though the task code should be small | 15:36 |
@clarkb:matrix.org | (maybe it isn't so small, I don't know how much bootstrapping code must be sent) | 15:37 |
@jim:acmegating.com | Clark: ansible has a high per-task cost, even on localhost. i've been trying to use fewer tasks and more shell blobs. | 15:53 |
@clarkb:matrix.org | Ya that task may deserve an update to consolidate into a single shell task | 16:00 |
@fungicide:matrix.org | a number of the things we call from base on every build could probably be collapsed into fewer ansible tasks for efficiency. ansible-lint won't like that of course, because it thinks you should never call shell commands which have equivalent ansible tasks, but i get the impression its recommendations are more about readability and maintainability than efficiency | 16:06 |
-@gerrit:opendev.org- Marvin Becker proposed: [zuul/nodepool] 853993: Add hold command to disable nodes https://review.opendev.org/c/zuul/nodepool/+/853993 | 17:40 | |
@clarkb:matrix.org | Any idea how important the replacement of the dot in .suffix with underscore in _suffix.txt is at https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L103 ? it is relatively trivial to update that to a find command task that will rename everything with one command if we don't also need to replace the . with a _ | 18:14 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402 | 18:19 | |
@clarkb:matrix.org | That is marked WIP as I need to test it and better understand the potential for breaking things by removing the _ | 18:20 |
@clarkb:matrix.org | fungi: ^ you might also want to review my find command isn't completely broken | 18:20 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402 | 18:21 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 18:36 | |
- [zuul/zuul] 855403: Do not use _display outside the main thread in zuul_stream https://review.opendev.org/c/zuul/zuul/+/855403 | ||
- [zuul/zuul] 855404: DNM: Run zuul-tox-remote many times https://review.opendev.org/c/zuul/zuul/+/855404 | ||
@jim:acmegating.com | ianw: i was looking into some issues with tobiash and swest, and they have been encountering some deadlocks when ansible forks and flushes stdout. the rate increased significantly after the console stream version change went into use, and that seems to correspond with the increase in failures in tox-remote. | 18:47 |
that has lead me to the hypothesis that the additional log lines emitted by zuul_stream (and possibly slight timing changes) are causing an existing issue to occur more often. i think the underlying issue is that our spawned threads may call the ansible display methods which send to stdout/err and flushes when ansible is forking, and the fork may end up inheriting a file object locked by a non-existent thread. | ||
the change above removes any calls to display from within our spawned threads | ||
@jim:acmegating.com | note https://github.com/ansible/ansible/pull/77056 is related but doesn't help us directly. it's not in a released version of ansible yet. that change could actually make things worse for us without my proposed change as it would likely increase the number of deadlocks caused by our threads. it's more of an aknowledgement that this is a source of deadlocks. | 19:09 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 855417: DNM: Run zuul-tox-remote many times (base case) https://review.opendev.org/c/zuul/zuul/+/855417 | 19:17 | |
@tristanc_:matrix.org | @corvus that sounds good. Is there a name for such bugs that appear when adding debug? Like an anti-heisenbug? :) | 19:17 |
@y2kenny:matrix.org | Hi, is it possible to have the Gerrit driver receive stream events from one server but sync/clone from a different server? (possibly accounting for replication-completed events? | 19:32 |
@clarkb:matrix.org | > <@y2kenny:matrix.org> Hi, is it possible to have the Gerrit driver receive stream events from one server but sync/clone from a different server? (possibly accounting for replication-completed events? | 19:47 |
I think you can configure the scheduler with the gerrit server for stream events, then configure the executores and mergers (which do the git work) with another. Zuul doesn't process replication-completed events though so that may be unreliable | ||
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 19:47 | |
- [zuul/zuul] 855403: Do not use _display outside the main thread in zuul_stream https://review.opendev.org/c/zuul/zuul/+/855403 | ||
- [zuul/zuul] 855404: DNM: Run zuul-tox-remote many times https://review.opendev.org/c/zuul/zuul/+/855404 | ||
@y2kenny:matrix.org | > <@clarkb:matrix.org> I think you can configure the scheduler with the gerrit server for stream events, then configure the executores and mergers (which do the git work) with another. Zuul doesn't process replication-completed events though so that may be unreliable | 19:50 |
ok. Thinking along that line, is there a way to potentially "pre-warm" the executor git cache? (perhaps trigger by a zuul-executor commandline?) | ||
@clarkb:matrix.org | You could preclone all of the repos before starting the executor | 19:54 |
-@gerrit:opendev.org- Clark Boylan proposed: | 20:11 | |
- [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402 | ||
- [zuul/zuul-jobs] 855434: Exercise stage-output extensions_to_txt in testing https://review.opendev.org/c/zuul/zuul-jobs/+/855434 | ||
@jim:acmegating.com | I opened https://github.com/ansible/ansible/pull/78679 to address the lock issue on the ansible side. | 20:24 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 20:27 | |
- [zuul/zuul] 855403: Do not use _display outside the main thread in zuul_stream https://review.opendev.org/c/zuul/zuul/+/855403 | ||
- [zuul/zuul] 855404: DNM: Run zuul-tox-remote many times https://review.opendev.org/c/zuul/zuul/+/855404 | ||
-@gerrit:opendev.org- Clark Boylan proposed: | 20:39 | |
- [zuul/zuul-jobs] 855434: Exercise stage-output extensions_to_txt in testing https://review.opendev.org/c/zuul/zuul-jobs/+/855434 | ||
- [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402 | ||
@clarkb:matrix.org | 855434 is a small testing change to add coverage for the slow tasks I identified earlier. That shows 855402 doesn't work yet. I think 855434 can be merged whenever reviewers have time as it adds some good test coverage for if/when we make changes to that role | 20:57 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 855048: Add ENA support option on uploaded AWS images https://review.opendev.org/c/zuul/nodepool/+/855048 | 20:57 | |
@jim:acmegating.com | the tox-remote tests are failing 20% of the time on master, but 0% of the time so far with the zuul_stream logging change. (with 30 runs of each so far) | 20:59 |
@jim:acmegating.com | that's starting to sound significant | 20:59 |
@clarkb:matrix.org | corvus: which change is that? Sounds like I hsould review it | 21:00 |
@jim:acmegating.com | Clark: 855403 | 21:00 |
@clarkb:matrix.org | corvus: the v2_on_playbook_start call is safe because it isn't running in a subthread? but the _read_log() call was in a subthread? | 21:05 |
@jim:acmegating.com | yep | 21:05 |
@clarkb:matrix.org | That all seems reasonable based on what you've described in the commit msg and comments. Thanks | 21:06 |
@iwienand:matrix.org | > <@jim:acmegating.com> i'm confused -- i looked at that log and all i see is the command lines -- the exception handler you added for the gdb ran on every single attempt -- i don't see any tracebacks? | 21:13 |
just on that, in https://zuul.opendev.org/t/zuul/build/b9598b929d944287b7178acec8b9c812/logs -- because maybe something like this will be useful in the future -- the gdb backtrace didn't make it into the tox output, but can be seen on the raw job output @ https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b95/855116/1/check/zuul-tox-remote/b9598b9/job-output.txt . it mostly can't attach to things (as expected) but we do get something for a few processes it can -- but unfortunately not even basic symbol names | ||
@iwienand:matrix.org | but, thanks for looking into this!!! i agree this looks like it | 21:14 |
@jim:acmegating.com | ianw: i must be missing something -- i only see "passing" instead of the gdb draces | 21:16 |
@iwienand:matrix.org | you can text search for "Target and debugger are in different PID namespaces;" in that job-output.txt and see some of it | 21:16 |
@jim:acmegating.com | ianw: any chance you can do my eyes a favor and use zuul's build page to deep link to a line? | 21:16 |
@jim:acmegating.com | ah ok | 21:17 |
@jim:acmegating.com | ianw: in the future, we'd probably need the python debug symbols package | 21:17 |
@iwienand:matrix.org | it was a bit of a hail mary -- i was hoping that would show us bt's into something suspicious, like we found with the grantpt() thing | 21:18 |
@jim:acmegating.com | ianw: maybe python3-all-dbg on ubuntu, i'm not positive | 21:18 |
@jim:acmegating.com | ianw: it was a very good idea. tobiash managed to find a stacktrace internally when investigating his issue that pointed to a stdout.flush on fork start, so that narrowed things down a lot. i think your approach may have come up with the same if we had symbols. :) | 21:20 |
@jim:acmegating.com | fyi i just wrote a response to sivel at https://github.com/ansible/ansible/pull/78679 with some gory details | 21:21 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402 | 21:22 | |
@jim:acmegating.com | (also, love that i got tagged with "new_contributor". guess 2016 doesn't count) | 21:22 |
@iwienand:matrix.org | Oh that's just a formality so that the AI bot knows it has fresh brains to suck into the matrix^W co-pilot :) | 21:28 |
@jim:acmegating.com | ianw: :) i think https://review.opendev.org/855403 is mergeable now with your +2 (no rush -- just letting you know i think it's ready but i also want to wait for your ack). | 21:30 |
@jim:acmegating.com | (wow, we got a good batch of base case tests, so the failure rate is down to 15% on master -- but we're still at 0% with the fix, with 40 runs for both) | 21:34 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402 | 21:40 | |
@clarkb:matrix.org | I'm beginning to wonder if it wouldn't be better to write a little ansible module to do this instead | 21:46 |
@clarkb:matrix.org | If this works its reasonably light weight but it adds a new dependency to the role (xargs) | 21:47 |
@clarkb:matrix.org | and there isn't really a good way to do the . -> _ transition in a readable fashion. But I'm still not sure how important that is | 21:48 |
@jim:acmegating.com | i think ansible module for this would be great | 21:48 |
@jim:acmegating.com | also, we have that unit test framework for python code in zuul-jobs now, so that could be nice too | 21:49 |
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 855439: feat(elasticsearch): allow filtering of reported returned-vars https://review.opendev.org/c/zuul/zuul/+/855439 | 21:49 | |
@clarkb:matrix.org | That latest patchset actually works now, but fails on linting | 21:50 |
@clarkb:matrix.org | I'll probably pause here and look at the ansible module idea tomorrow | 21:50 |
@clarkb:matrix.org | the testing in the parent change has been really helpful | 21:50 |
@iwienand:matrix.org | regex + jinja is indeed a nightmare zone | 22:01 |
@clarkb:matrix.org | The main thing I guess will be how much slower than find python is emulating find | 22:07 |
@clarkb:matrix.org | Probably still quite a bit quicker than the ansible tasks running one by one though | 22:08 |
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 855439: feat(elasticsearch): allow filtering of reported returned-vars https://review.opendev.org/c/zuul/zuul/+/855439 | 22:22 | |
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 855439: feat(elasticsearch): allow filtering of reported returned-vars https://review.opendev.org/c/zuul/zuul/+/855439 | 22:23 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 22:45 | |
- [zuul/zuul] 855096: WIP: Tracing: implement span save/restore https://review.opendev.org/c/zuul/zuul/+/855096 | ||
- [zuul/zuul] 855293: Add tracing tutorial https://review.opendev.org/c/zuul/zuul/+/855293 | ||
@jim:acmegating.com | ianw: replied to your comment on 855403 -- just wanted to make sure you saw that since it's a +2 comment | 22:53 |
@jim:acmegating.com | Albin Vass: ^ you may be interested in... well... most of the conversation above re 855403 | 22:56 |
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul-jobs] 855434: Exercise stage-output extensions_to_txt in testing https://review.opendev.org/c/zuul/zuul-jobs/+/855434 | 23:03 | |
@iwienand:matrix.org | corvus thanks ; yeah not logging definitely seems best :) | 23:22 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 854044: Add config-error reporter and report config errors to DB https://review.opendev.org/c/zuul/zuul/+/854044 | 23:40 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!