Wednesday, 2022-08-31

-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 853913: zuul-jobs-test-ensure-python-pyenv: update matchers https://review.opendev.org/c/zuul/zuul-jobs/+/85391300:52
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 853909: Fedora : update to Fedora 36 release nodes https://review.opendev.org/c/zuul/zuul-jobs/+/85390901:03
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530904:50
@westphahl:matrix.orgcorvus: Clark : how about "fallback" or "fallback-labels" instead of "alternatives"?05:41
@westphahl:matrix.orgcorvus: but +1 for making that more explicit in the Zuul config05:42
@westphahl:matrix.orglooking at the proposed syntax it should probably be "fallback-nodesets"05:43
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530906:56
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530907:04
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530907:10
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530907:25
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530908:00
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530908:25
@mbecker12:matrix.orgcorvus: Are there any more thoughts on this patch? :) 08:42
853993: Add hold command to disable nodes | https://review.opendev.org/c/zuul/nodepool/+/853993
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 855309: [wip] disable streaming writing via var https://review.opendev.org/c/zuul/zuul/+/85530908:56
@jim:acmegating.commbecker12: I left some thoughts on the tests. (cc swest)13:32
@jim:acmegating.comswest: i considered that, but to me, the first one isn't a fallback.  but if that's close enough and we want to use that terminology, we could just clarify it in the docs.13:33
@westphahl:matrix.orgcorvus: yeah, then fallback might be more misleading than "alternatives"13:35
@jim:acmegating.com> <@iwienand:matrix.org> in all the other tests i never saw that one time out.  unfortunately i guess due to a combination of bwrap, namespaces, etc, etc. the trace is basically useless13:55
i'm confused -- i looked at that log and all i see is the command lines -- the exception handler you added for the gdb ran on every single attempt -- i don't see any tracebacks?
@jim:acmegating.comianw: could the new console handling be causing ansible not to exit in some cases?13:58
@clarkb:matrix.orgDoes anyone know if ansible really needs about 3 seconds per task to run https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L103 as shown by the log at https://zuul.opendev.org/t/openstack/build/363873433e4d4f1ab66f6b2e97fb9429/log/job-output.txt#15596? The disk io there should be minimal and quick as it is doing a file mv on the same filesystem (no copies across filesystems). This makes me question ansible's per task cost, but it could still be an IO issue15:18
@clarkb:matrix.orghttps://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L13 that task also appears to have needed about 3 seconds15:19
@clarkb:matrix.orgI guess it could be the remote host too (swapping, or otherwise slow)15:19
@clarkb:matrix.orgLooking at other jobs It seems like 1.5 ish seconds for check sudo may be expected15:27
@clarkb:matrix.orgDefinitely not quick, but much quicker than this example15:27
@clarkb:matrix.orgWhen ansible runs remote tasks it copies the python code for each task separately or does the whole playbook go over at once? I wonder if we're seeing write costs for the tasks over the wire15:36
@clarkb:matrix.orgThis particular example happened halfway around the world from the zuul executor. Though the task code should be small15:36
@clarkb:matrix.org(maybe it isn't so small, I don't know how much bootstrapping code must be sent)15:37
@jim:acmegating.comClark: ansible has a high per-task cost, even on localhost.  i've been trying to use fewer tasks and more shell blobs.15:53
@clarkb:matrix.orgYa that task may deserve an update to consolidate into a single shell task16:00
@fungicide:matrix.orga number of the things we call from base on every build could probably be collapsed into fewer ansible tasks for efficiency. ansible-lint won't like that of course, because it thinks you should never call shell commands which have equivalent ansible tasks, but i get the impression its recommendations are more about readability and maintainability than efficiency16:06
-@gerrit:opendev.org- Marvin Becker proposed: [zuul/nodepool] 853993: Add hold command to disable nodes https://review.opendev.org/c/zuul/nodepool/+/85399317:40
@clarkb:matrix.orgAny idea how important the replacement of the dot in .suffix with underscore in _suffix.txt is at https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L103 ? it is relatively trivial to update that to a find command task that will rename everything with one command if we don't also need to replace the . with a _18:14
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/85540218:19
@clarkb:matrix.orgThat is marked WIP as I need to test it and better understand the potential for breaking things by removing the _18:20
@clarkb:matrix.orgfungi: ^ you might also want to review my find command isn't completely broken18:20
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/85540218:21
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:18:36
- [zuul/zuul] 855403: Do not use _display outside the main thread in zuul_stream https://review.opendev.org/c/zuul/zuul/+/855403
- [zuul/zuul] 855404: DNM: Run zuul-tox-remote many times https://review.opendev.org/c/zuul/zuul/+/855404
@jim:acmegating.comianw: i was looking into some issues with tobiash and swest, and they have been encountering some deadlocks when ansible forks and flushes stdout.  the rate increased significantly after the console stream version change went into use, and that seems to correspond with the increase in failures in tox-remote.18:47
that has lead me to the hypothesis that the additional log lines emitted by zuul_stream (and possibly slight timing changes) are causing an existing issue to occur more often. i think the underlying issue is that our spawned threads may call the ansible display methods which send to stdout/err and flushes when ansible is forking, and the fork may end up inheriting a file object locked by a non-existent thread.
the change above removes any calls to display from within our spawned threads
@jim:acmegating.comnote https://github.com/ansible/ansible/pull/77056 is related but doesn't help us directly.  it's not in a released version of ansible yet.   that change could actually make things worse for us without my proposed change as it would likely increase the number of deadlocks caused by our threads.  it's more of an aknowledgement that this is a source of deadlocks.19:09
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 855417: DNM: Run zuul-tox-remote many times (base case) https://review.opendev.org/c/zuul/zuul/+/85541719:17
@tristanc_:matrix.org@corvus that sounds good. Is there a name for such bugs that appear when adding debug? Like an anti-heisenbug? :)19:17
@y2kenny:matrix.orgHi, is it possible to have the Gerrit driver receive stream events from one server but sync/clone from a different server? (possibly accounting for replication-completed events?19:32
@clarkb:matrix.org> <@y2kenny:matrix.org> Hi, is it possible to have the Gerrit driver receive stream events from one server but sync/clone from a different server? (possibly accounting for replication-completed events?19:47
I think you can configure the scheduler with the gerrit server for stream events, then configure the executores and mergers (which do the git work) with another. Zuul doesn't process replication-completed events though so that may be unreliable
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:19:47
- [zuul/zuul] 855403: Do not use _display outside the main thread in zuul_stream https://review.opendev.org/c/zuul/zuul/+/855403
- [zuul/zuul] 855404: DNM: Run zuul-tox-remote many times https://review.opendev.org/c/zuul/zuul/+/855404
@y2kenny:matrix.org> <@clarkb:matrix.org> I think you can configure the scheduler with the gerrit server for stream events, then configure the executores and mergers (which do the git work) with another. Zuul doesn't process replication-completed events though so that may be unreliable19:50
ok. Thinking along that line, is there a way to potentially "pre-warm" the executor git cache? (perhaps trigger by a zuul-executor commandline?)
@clarkb:matrix.orgYou could preclone all of the repos before starting the executor19:54
-@gerrit:opendev.org- Clark Boylan proposed:20:11
- [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402
- [zuul/zuul-jobs] 855434: Exercise stage-output extensions_to_txt in testing https://review.opendev.org/c/zuul/zuul-jobs/+/855434
@jim:acmegating.comI opened https://github.com/ansible/ansible/pull/78679 to address the lock issue on the ansible side.20:24
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:20:27
- [zuul/zuul] 855403: Do not use _display outside the main thread in zuul_stream https://review.opendev.org/c/zuul/zuul/+/855403
- [zuul/zuul] 855404: DNM: Run zuul-tox-remote many times https://review.opendev.org/c/zuul/zuul/+/855404
-@gerrit:opendev.org- Clark Boylan proposed:20:39
- [zuul/zuul-jobs] 855434: Exercise stage-output extensions_to_txt in testing https://review.opendev.org/c/zuul/zuul-jobs/+/855434
- [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/855402
@clarkb:matrix.org855434 is a small testing change to add coverage for the slow tasks I identified earlier. That shows 855402 doesn't work yet. I think 855434 can be merged whenever reviewers have time as it adds some good test coverage for if/when we make changes to that role20:57
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 855048: Add ENA support option on uploaded AWS images https://review.opendev.org/c/zuul/nodepool/+/85504820:57
@jim:acmegating.comthe tox-remote tests are failing 20% of the time on master, but 0% of the time so far with the zuul_stream logging change. (with 30 runs of each so far)20:59
@jim:acmegating.comthat's starting to sound significant20:59
@clarkb:matrix.orgcorvus: which change is that? Sounds like I hsould review it21:00
@jim:acmegating.comClark: 85540321:00
@clarkb:matrix.orgcorvus: the v2_on_playbook_start call is safe because it isn't running in a subthread? but the _read_log() call was in a subthread?21:05
@jim:acmegating.comyep21:05
@clarkb:matrix.orgThat all seems reasonable based on what you've described in the commit msg and comments. Thanks21:06
@iwienand:matrix.org> <@jim:acmegating.com> i'm confused -- i looked at that log and all i see is the command lines -- the exception handler you added for the gdb ran on every single attempt -- i don't see any tracebacks?21:13
just on that, in https://zuul.opendev.org/t/zuul/build/b9598b929d944287b7178acec8b9c812/logs -- because maybe something like this will be useful in the future -- the gdb backtrace didn't make it into the tox output, but can be seen on the raw job output @ https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b95/855116/1/check/zuul-tox-remote/b9598b9/job-output.txt . it mostly can't attach to things (as expected) but we do get something for a few processes it can -- but unfortunately not even basic symbol names
@iwienand:matrix.orgbut, thanks for looking into this!!!  i agree this looks like it21:14
@jim:acmegating.comianw: i must be missing something -- i only see "passing" instead of the gdb draces21:16
@iwienand:matrix.orgyou can text search for "Target and debugger are in different PID namespaces;" in that job-output.txt and see some of it21:16
@jim:acmegating.comianw: any chance you can do my eyes a favor and use zuul's build page to deep link to a line?21:16
@jim:acmegating.comah ok21:17
@jim:acmegating.comianw: in the future, we'd probably need the python debug symbols package21:17
@iwienand:matrix.orgit was a bit of a hail mary -- i was hoping that would show us bt's into something suspicious, like we found with the grantpt() thing21:18
@jim:acmegating.comianw: maybe python3-all-dbg on ubuntu, i'm not positive21:18
@jim:acmegating.comianw: it was a very good idea.  tobiash managed to find a stacktrace internally when investigating his issue that pointed to a stdout.flush on fork start, so that narrowed things down a lot.  i think your approach may have come up with the same if we had symbols.  :)21:20
@jim:acmegating.comfyi i just wrote a response to sivel at https://github.com/ansible/ansible/pull/78679 with some gory details21:21
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/85540221:22
@jim:acmegating.com(also, love that i got tagged with "new_contributor".  guess 2016 doesn't count)21:22
@iwienand:matrix.orgOh that's just a formality so that the AI bot knows it has fresh brains to suck into the matrix^W co-pilot :)21:28
@jim:acmegating.comianw: :)  i think https://review.opendev.org/855403 is mergeable now with your +2 (no rush -- just letting you know i think it's ready but i also want to wait for your ack).21:30
@jim:acmegating.com(wow, we got a good batch of base case tests, so the failure rate is down to 15% on master -- but we're still at 0% with the fix, with 40 runs for both)21:34
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 855402: Speed up log file fetching tasks https://review.opendev.org/c/zuul/zuul-jobs/+/85540221:40
@clarkb:matrix.orgI'm beginning to wonder if it wouldn't be better to write a little ansible module to do this instead21:46
@clarkb:matrix.orgIf this works its reasonably light weight but it adds a new dependency to the role (xargs)21:47
@clarkb:matrix.organd there isn't really a good way to do the . -> _ transition in a readable fashion. But I'm still not sure how important that is21:48
@jim:acmegating.comi think ansible module for this would be great21:48
@jim:acmegating.comalso, we have that unit test framework for python code in zuul-jobs now, so that could be nice too21:49
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 855439: feat(elasticsearch): allow filtering of reported returned-vars https://review.opendev.org/c/zuul/zuul/+/85543921:49
@clarkb:matrix.orgThat latest patchset actually works now, but fails on linting21:50
@clarkb:matrix.orgI'll probably pause here and look at the ansible module idea tomorrow21:50
@clarkb:matrix.orgthe testing in the parent change has been really helpful21:50
@iwienand:matrix.orgregex + jinja is indeed a nightmare zone22:01
@clarkb:matrix.orgThe main thing I guess will be how much slower than find python is emulating find22:07
@clarkb:matrix.orgProbably still quite a bit quicker than the ansible tasks running one by one though22:08
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 855439: feat(elasticsearch): allow filtering of reported returned-vars https://review.opendev.org/c/zuul/zuul/+/85543922:22
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 855439: feat(elasticsearch): allow filtering of reported returned-vars https://review.opendev.org/c/zuul/zuul/+/85543922:23
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:22:45
- [zuul/zuul] 855096: WIP: Tracing: implement span save/restore https://review.opendev.org/c/zuul/zuul/+/855096
- [zuul/zuul] 855293: Add tracing tutorial https://review.opendev.org/c/zuul/zuul/+/855293
@jim:acmegating.comianw: replied to your comment on 855403 -- just wanted to make sure you saw that since it's a +2 comment22:53
@jim:acmegating.comAlbin Vass: ^ you may be interested in... well... most of the conversation above re 85540322:56
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul-jobs] 855434: Exercise stage-output extensions_to_txt in testing https://review.opendev.org/c/zuul/zuul-jobs/+/85543423:03
@iwienand:matrix.orgcorvus thanks ; yeah not logging definitely seems best :)23:22
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 854044: Add config-error reporter and report config errors to DB https://review.opendev.org/c/zuul/zuul/+/85404423:40

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!