Friday, 2023-09-01

-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 892280: Log exceptions on db migration errors https://review.opendev.org/c/zuul/zuul/+/89228007:55
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] 893375: Kubernetes/OpenShirt drivers: allow setting dynamic k8s labels https://review.opendev.org/c/zuul/nodepool/+/89337509:00
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] 893481: Fix sphinx doc build https://review.opendev.org/c/zuul/nodepool/+/89348109:03
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] 893375: Kubernetes/OpenShirt drivers: allow setting dynamic k8s labels https://review.opendev.org/c/zuul/nodepool/+/89337509:39
@mhuin:matrix.orghello zuul-maint, I've started this chain https://review.opendev.org/q/topic:connections_health and would appreciate feedback before I go further down the rabbit hole. For context/incentive behind the chain, we've had issues with github repositories where the zuul app wasn't installed or configured properly, resulting in degraded performances and hitting API rate limiting quotas much earlier than expected. While the zuul admins ended up finding evidence of the issue in the scheduler's logs, it wasn't especially obvious if you don't know what you're looking for. And furthermore, the zuul admins weren't admins on the problematic github repos so we couldn't fix the app, while the rate limiting was applied to our zuul instance as a whole and affected every GH repo IIRC14:20
@mhuin:matrix.orgSo the idea is that for connections that might be out of perimeter for zuul admins, the health of the connection should be made available, either in the form of monitoring metrics or GUI info14:21
@mhuin:matrix.organother possibility would be a zuul-admin subcommand that could poll connections on demand to check their state14:23
@tristanc_:matrix.orgmhu: if I understand correctly, the goal is to report connection issues (e.g. rate limit, project not found, api error, ...). Then I wonder if we need to carry such `_health` structure for every projects, wouldn't it be easier to simply have a list of such error next to the `config_errors` structure, and report them on the existing web page?15:04
@jim:acmegating.comi think having connection configuration errors reported in the logs is pretty reasonable and expected :).  but i understand that you might want to set alert conditions for administrators, and that can be difficult to do based on log messages.  so i think having statsd/prometheus metrics that report health on connections is reasonable.  but in general, connection configuration problems are not actionable by users, so i don't think they should be in the main part of the web ui.  if it's not something that can be fixed by changing zuul.yaml then it should probably not be on the projects page (which shows the zuul.yaml configuration for a project) or the configuration errors page (which shows errors in zuul.yaml configuration)15:07
@jim:acmegating.comtbh, my preference would be to keep this stuff in the logs.  especially if it's very open ended (you listed a lot of different types of errors there).  but something to alert admins that they should check the logs for errors, or certain kinds of errors, sounds useful.15:10
@tristanc_:matrix.orgcorvus: but sometime, connection problems are not actionable by admin either, for example when there is a GitHub api outage, or when the app is not installed properly.15:11
@jim:acmegating.comtristanC: sure, but none of those can be fixed by changes to zuul's configuration.  that's the distinction i'm trying to convey.15:12
@jim:acmegating.commaybe it would be worth writing up exactly what information is needed and where you think it makes sense to expose that.  because there's a lot of confusing crossover with connections and projects in a multi-tenant setup.  associating connection information to projects can be tricky.15:17
@mhuin:matrix.orgThe difficulty is that for connections like GH and possibly also GL and Pagure is the requirement that a repo has an application installed, which makes {connection,project} a unique setup15:19
@jim:acmegating.comexactly, while {tenant, project} is also unique and potentially overlaps in strange ways with the other.15:20
@mhuin:matrix.orgalso, metrics as in statsd and prometheus are just numerical values AFAICT so you can't really convey info about what's exactly going on, which is why I added a description field to the health structure I proposed15:21
@jim:acmegating.commetrics for alerting and logs for diagnosis makes sense to me15:21
@mhuin:matrix.orgso maybe a simpler approach could be to have statsd 0/1 gauges for states like "OK", "DEGRADED", "FAILING"15:22
@mhuin:matrix.orgfire them when there's a problem in a connection, and leave it up to admins to have alerting set up and to look in the logs15:23
@mhuin:matrix.organd drop the project headache15:23
@jim:acmegating.comi think that would be a simple solution.  but also, it's starting to sound to me like this problem may be much more specific than i was originally understanding.  if really the issue is "we want users to know if the github app used by the connection hasn't been installed for this project"... that maybe is a bit more of a tractable problem than generalized connection health reporting.15:26
@mhuin:matrix.orgyeah that was the original incentive, and then I thought "maybe I can go for something more generic and useful?"15:29
@jim:acmegating.commhu: okay, i think i get the use case, and i get why putting it on the project page in the ui makes sense for this case... and i think maybe setting it up in a generic enough way that we could use it for other things in the future makes sense.  but i'd want to be really careful with it that we don't put too much admin-level information in there, or that we don't leak what should be private information about the connection.  so i think i'm okay with adding connection information (specifically to start out: errors relating to app installation) to the project page.15:33
@jim:acmegating.comi think things like persistent 500 errors, etc, may be better for a statsd/prometheus metric for admin alerting.  like, if you get a 500, or hit a rate limit, or something else, that's better to signal an admin to check the logs15:37
@mhuin:matrix.orgAgreed, then 4XX errors would probably fall under config errors15:38
@mhuin:matrix.orgwell generally speaking15:38
@jim:acmegating.comi think maybe the "github app not installed" thing is a new class of malfunction for us to think about: basically, it's an external configuration error.  so propagating external configuration errors to users is reasonable.  and server 5xx or authentication errors, or rate limits, etc, are operational issues, or admin configuration errors, and they should not show up in the web ui, but should be in logs, or alerts...15:39
@clarkb:matrix.orgit isn't necessarily an error to not have an app installed though?15:46
@clarkb:matrix.orgzuul operates with a regular user token iirc15:46
@mhuin:matrix.org@Clark: no, if the app isn't install and if there's no token in the config, it falls back to anonymous API access15:46
@mhuin:matrix.orgbut then can get hit with API rate limiting15:47
@clarkb:matrix.orgbut zuul is functional in that state right?15:47
@clarkb:matrix.orgother than potential rate limits which for small intsalls may be a non concern15:48
@mhuin:matrix.orgwell not trying to brag, but our install is yuuuge! 🙂15:48
@clarkb:matrix.orgsure, my point is that having a big error sign may be counter productive for some zuul installs15:49
@jim:acmegating.comsure, but i think the point stands that not having a huge install is not an error15:49
@mhuin:matrix.orgbut yeah it's probably not a concern for smaller deployments15:49
@mhuin:matrix.orgbut if you have an app_id set in zuul's config and the app isn't installed on the repos zuul provides CI for, surely this should be an error?15:50
@jim:acmegating.comnope15:50
@jim:acmegating.comnot an error15:50
@jim:acmegating.comit's entirely reasonable and expected to add projects for sibling testing, dependencies, etc without having the app installed15:51
@mhuin:matrix.orgah ... this gets pretty complex then if this can be expected behavior15:52
@mhuin:matrix.orgmaybe that should be part of the project's description in the gui, if the connection is of type github, show whether the zuul app is installed or not15:54
@mhuin:matrix.orgwell, food for thought for the weekend15:55
@jim:acmegating.comit could potentially be okay as information, not error15:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!