Monday, 2022-10-03

-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul] 860062: configloader: Fix error report for job nodesets https://review.opendev.org/c/zuul/zuul/+/86006204:29
@clarkb:matrix.orgcorvus: I'm looking at https://tracing.opendev.org/search?service=zuul&tags=%7B%22zuul_event_id%22%3A%22dc53889d2a774a378dac4b8177b030e3%22%7D as an example (no specific thing I'm trying to debug) as I wanted to check that we get the larger traces after weekend reboots (we do). It is a little weird to me that a single event can create multiple traces rather than a single trace with potentially multiple trees17:00
@clarkb:matrix.orgcorvus: in this case I believe it happened because the one trace is for a change that depends on the chagne that generated the other trace17:01
@clarkb:matrix.orgwhich isn't really wrong I guess. Just unexpected to me that a single event has that behavior17:01
@clarkb:matrix.orgcorvus: https://tracing.opendev.org/search?end=1664816893236000&limit=20&lookback=1h&maxDuration&minDuration&operation=Build&service=zuul&start=1664813293236000&tags=%7B%22zuul_event_id%22%3A%22163bc9a5f8744d57aa4bde6ee693746c%22%7D is another interesting one. There are two traces. One that seems to line up with the arm64 check pipeline and the other is the normal check pipeline. However the trace for the check pipeline seems to root with a Merge instead of a QueueItem17:09
@clarkb:matrix.orgAs a result working back to the change that triggered the traces is difficult17:09
@clarkb:matrix.orgsearching QueueItems by ref_number for running changes in zuul status doesn't always produce results either. I wonder if that is related to the arm64 vs check thing above17:23
@clarkb:matrix.orgAfter a few minutes of poking around I think that would be my feedback. Having the event id root everything would probably help show that this is pipeline A and thati s pipeling B and here is change A that is enqueued and change B for the same event which is nice when looking at things from a "what do user actions result in" perspecitve.17:24
@clarkb:matrix.orgAnd using QueueItem consistently so that we can find specific entries when cross checking against running state17:25
@clarkb:matrix.orgThe flamegraph is pretty neat for looking at where time is being consumed17:25
@clarkb:matrix.orgThe comparison tool seems like ti could be really useful if you end up with two similar events but one acted weirdly17:28
@clarkb:matrix.orgMight also be helpful to provide some way to map a span back to code somehow? When debugging with logs we can grep the log string identifier. With the spans they can be updated from multiple locations which amkes this a bit tricky (also it seems that the PipelineManager creates a QeueuItem span and Queueitem creates a Buildset span which is a slightly unexpected mapping).17:38
@clarkb:matrix.orgaha looks like a commonish thing is to update spans in exception handlers similar to what you'd do with log.exception. Then you'd have that particular string to grep for17:46
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 859873: Set default SSH_AUTH_SOCK in zuul-bwrap command https://review.opendev.org/c/zuul/zuul/+/85987318:51
@jim:acmegating.comClark: i think having a trace for an event needs a lot of thought.  the mapping of events to actions is extremely unintuitive, particularly because items can end up in queues due to seemingly unrelated events (due to dependencies).  it is also rather difficult to determine when an event ends (without a full scan of the whole system).  and even if we did that, we would have event spans open for literally days in the case of opendev, which means a high chance of encountering warnings in jaeger about incomplete data.  my inclination would be to avoid that for now, and if we get to the point where everything else makes but that still feels like something that is missing, think about adding it.  in the mean time, i think we can focus on making sure we associate event ids appropriately, and adjust our expectations around the fact that a single event can cause multiple items to be enqueued.  put another way: the search results of a query for an event id do show us all the things that happened because of that event, just not in "span" form.  but there is a graph at the top still.19:13
@jim:acmegating.comClark: re the second link, that sounds like a bug.  i don't think a merge is an expected root span19:13
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 860156: Add extra logging around tenant reconfiguration events https://review.opendev.org/c/zuul/zuul/+/86015621:34
@clarkb:matrix.orgcorvus: ya I think that works. My feedback is coming from a place where I often find the event id then get all the logs for that and read them that way. When you have separate traces it doesn't seem like you can see them overlayed with each other. But maybe that is a good thing for debugging21:38
@jim:acmegating.comClark: yeah,  also, i forgot to say: thank you for looking and for the feedback.  it's definitely valuable.  :)  from a user pov, i agree that "seach for everything related to this event" is a good thing to handle and i think what you imagine would be ideal, all things being equal.  but then because of some technical limitations, we may have to accept some limitations.21:40
@jim:acmegating.comso yeah, i think we may need to try an approach like "well, all the circles at the top are the things that happened because of this event, let's start looking at the individual circles (traces) and see which ones are interesting"21:41
@clarkb:matrix.orgalso I think if we start updating spans from exception handlers we could provide some really quick interesting info21:42
@jim:acmegating.com(also, that would be mitigated a lot if we could see more information about the root span on the search results page... like if we get a bunch of queueitems, it would be great to know which pipeline)21:42
@jim:acmegating.comClark: re exceptions, see https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html#events-and-exceptions21:44
@jim:acmegating.comi don't want to set expectations that this is going to be an adequate replacement for reading logs as a debugging tool for operational issues.  it's more like being able to see where bottlenecks are, or why some action took a certain amount of time..21:47
@clarkb:matrix.orghrm ok. In that case I think we should try to have some mapping between code and spans then21:48
@clarkb:matrix.orgmaybe record the calling object path or something?21:48
@jim:acmegating.comcan you elaborate as to why?21:49
@clarkb:matrix.orgotherwise you can see where things might be slower than expected in a trace but mapping that to the code becomes more difficult21:49
@clarkb:matrix.orgcorvus:  as I mentioned before we already have a mismatch in calling class and span name21:49
@jim:acmegating.comthe span name is greppable in the code?21:49
@clarkb:matrix.orgit is but it conflicts with class names that are different than where the span starts from what I saw21:50
@clarkb:matrix.orglet me get specific code links21:50
@clarkb:matrix.orghttps://opendev.org/zuul/zuul/src/branch/master/zuul/manager/__init__.py#L588 and https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L4403 for example21:51
@clarkb:matrix.orgthe first link is a QueueItem span in pipeline manager code and other is a BuildSet span in QueueItem code21:52
@jim:acmegating.combecause the pipeline creates the queueitem, and the queueitem creates the buildset21:52
@clarkb:matrix.orgright but if you use grep to find that you'll end up in the wrong part of the code21:53
@clarkb:matrix.orgor at least a level too high. Maybe "wrong" is too strong here21:54
@jim:acmegating.comi think the foremost consideration is that the span should be named according to what it represents.  if it takes a developer an extra step of cognition to make the leap to understanding that BuildSets are created by QueueItems, that seems okay to me, as long as the visualization of the process makes sense in jaeger21:57
@clarkb:matrix.orgyes, I have no problem with the naming. I'm more suggesting a hint to where each span is created be another attribute.21:58
@jim:acmegating.comyes, it's true that the first grep result for BuildSet isn't going to be where the span is created, but the code location for the creation of a BuildSet does say "startSavedSpan...BuildSet"21:58
@clarkb:matrix.orgactually maybe last updated rather than created makes more sense21:59
@clarkb:matrix.orgBut something along the lines of "if jaeger identifies this as anomalous then look here" 21:59
@jim:acmegating.comokay, i think i understand your idea, and it makes sense, but it does come with a cost (all of these attributes have to be stored in zk, and unless we make it automatic, it seems like it could bitrot fairly quickly).  to be honest, i'm not convinced it's worth it, and i would like to see if we spend a lot of time running into the problem you describe first (especially since i'm not sure this will be used in a way that would precipitate that problem)22:00
@clarkb:matrix.orgthats fair. This is mostly me anticipating how I might use it, but I haven't attempted to use it for any debugging yet22:02
@clarkb:matrix.orgin particular the trace comparison tool has me thinking it might be useful (since it highlights where two traces diverge)22:02
@jim:acmegating.com(and to be clear, if these attrs were free, i would totally be on board with adding `__file__ __class__` etc now)22:02
@jim:acmegating.com(so we're definitely just down to "cost-benefit")22:03
@jim:acmegating.com(they are more free if we know them at the end of spans, but things that we only know at the start of spans usually need to be stored in zk)22:04
@clarkb:matrix.orgya and last span update location would definitely need to be tracked in zk22:05
@iwienand:matrix.orgre above, i was just trying to restore my treeview change on this jobs page, after recent changes that have updated the patternfly version22:37
@iwienand:matrix.orgironic-tempest-partition-bios-redfish-pxe has sent me into a bit of a spin because it appears to be a child of itself?22:37
@jim:acmegating.comianw: i haven't looked, but consider the possibility the inheritance differs on different branches22:38
@iwienand:matrix.orghrm, i guess it must be something like that22:41
@iwienand:matrix.orgthe problem that was holding this up before, that the "expand" button didn't expand multiple levels, should be fixed by the pf update that the recent console changes pulled in22:43
@iwienand:matrix.orghence i thought i'd loop back to it22:44
@iwienand:matrix.orgzuul-maint: i also have a follow-on stack to that work @ https://review.opendev.org/q/topic:console-control-squish which is suitable for review22:44
@iwienand:matrix.orgalso, apropos the broken images above -> https://gerrit-review.googlesource.com/c/gerrit/+/346898 had a go at returning this in the initial page.  so far it seems the approach won't work due to latency concerns about having to load page data before serving the single-page app js bits22:47
@iwienand:matrix.org"this" being the link preview metadata22:47

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!