Wednesday, 2021-12-08

*** rlandy|ruck is now known as rlandy|out00:09
*** dviroel is now known as dviroel|out00:10
opendevreviewMerged openstack/project-config master: Update the opendev/system-config tag  https://review.opendev.org/c/openstack/project-config/+/81971500:26
*** timburke__ is now known as timburke00:33
opendevreviewMerged openstack/project-config master: Fix Neutron periodic dashboard  https://review.opendev.org/c/openstack/project-config/+/82091200:34
opendevreviewMerged openstack/project-config master: Add rights to neutron-dynamic-routing-stable-maint  https://review.opendev.org/c/openstack/project-config/+/82035100:41
*** raukadah is now known as chandankumar04:43
*** ysandeep|out is now known as ysandeep04:50
opendevreviewyatin proposed openstack/project-config master: Fix Neutron periodic dashboard  https://review.opendev.org/c/openstack/project-config/+/82098006:37
*** bhagyashris_ is now known as bhagyashris06:57
*** bhagyashris_ is now known as bhagyashris07:19
*** ysandeep is now known as ysandeep|lunch07:23
opendevreviewMerged openstack/project-config master: Fix Neutron periodic dashboard  https://review.opendev.org/c/openstack/project-config/+/82098008:29
*** ysandeep|lunch is now known as ysandeep08:35
*** ykarel_ is now known as ykarel09:21
*** ysandeep is now known as ysandeep|afk10:11
*** dviroel|out is now known as dviroel10:38
*** rlandy|out is now known as rlandy|ruck11:05
*** ysandeep|afk is now known as ysandeep11:16
*** jcapitao is now known as jcapitao_lunch12:02
*** ysandeep is now known as ysandeep|brb12:49
*** outbrito_ is now known as outbrito13:02
*** ysandeep|brb is now known as ysandeep13:07
*** jcapitao_lunch is now known as jcapitao13:34
*** ysandeep is now known as ysandeep|dinner13:49
opendevreviewdaniel.pawlik proposed openstack/ci-log-processing master: Convert max-skipped parameter to int  https://review.opendev.org/c/openstack/ci-log-processing/+/82084813:50
*** ykarel is now known as ykarel|away14:07
*** dviroel is now known as dviroel|lunch14:56
slaweqHi infra team15:13
slaweqI want to ask about one potential improvement in zuul15:13
slaweqin Neutron team we were thinking how to improve number of rechecks on patches, and resources used by neutron15:14
slaweqand one of the potential improvement could be if maybe jobs which finish with POST_FAILURE could be automatically retried15:14
slaweqor if we could recheck only such POST_FAILURE jobs15:15
slaweqas in most of the cases when job will finish with POST_FAILURE it's not really related to the patch itself15:15
slaweqand it should be safe to not recheck everything else in such case15:15
slaweqwdyt about it? would it be doable maybe?15:16
fungiwe do automatically rerun builds which fail in a pre-run playbook, so rerunning builds which fail in a post-run playbook probably wouldn't be that different, except that for consistent failures of that sort you'd potentially wait far longer for a retry_limit result. one problem i foresee is that failures in the run playbook are often followed by failures in post-run (run didn't create15:25
fungisome artifact which is collected at the end of the job, for example) so this would potentially hide such error conditions15:25
fungialso if it were implemented the same way as how pre-run failures are caught, i think that's a global behavior of the scheduler so would affect all jobs for all projects in all tenants15:26
*** ysandeep|dinner is now known as ysandeep15:45
slaweqfungi regarding failures in RUN phase, I think that if there are such errors, then job finishes with "FAILED" not with "POST_FAILURE"16:10
slaweqbut I agree that it could potentially hide some other errors which happend earlier16:10
slaweqso maybe there would be way to recheck on jobs which ended up in POST_FAILURE state16:11
slaweqthat would save at least some infra resources in some "recheck" cases16:11
*** dviroel|lunch is now known as dviroel16:13
clarkbslaweq: its more nuanced than that. If the run failure induces failure in post then you get a post failure. THis is very common16:16
clarkbSince post-run tends to process outputs of run and if run fails to produce those outputs properly this happens16:16
Reed_dpawlik How's it going? Any luck connecting to OpenSearch?16:18
slaweqclarkb I see, but that's why I'm asking if it would be maybe possible to recheck only jobs which ended up like that, to not recheck "selectively" always, but at least in this specific scenario. Maybe e.g. allowed only for core team, I don't know16:18
slaweqif it's not possible, than it's fine too for me16:19
slaweqat least I'll have answer for that :16:19
clarkbslaweq: I addressed that in the mailing list thread16:19
slaweq😀16:19
clarkbit is a bad idea for a number of reasons. Most importantly it circumvents "clean check"16:19
slaweqclarkb I totally agree that it's bad idea in general16:19
slaweqbut I was hoping to maybe have exception only for such POST_FAILURE jobs16:20
slaweqif it's bad idea too, that's fine16:20
clarkbslaweq: I'm not sure that POST_FAILURE changes anything for why it is a bad idea.16:20
clarkbslaweq: I wrote this in the emails but basically where I stand is anything that doesn't fix or remove errors is only going to accelerate the problems not make them better16:21
fungislaweq: i think that would be very hard (likely intractably hard) for zuul to determine afterward, given the nature of job dependencies some may have been skipped so wouldn't actually represent a post_failure result for example, or dependencies may need to get rerun (consider a build ending in post_failure which needs an image registry being run by another job it depends on)16:21
clarkbslaweq: if you ignore errors and retry until you pass you allow more errors into the system quicker16:22
clarkbthis is why clean check exists. We did a bit of analysis on a number of gate breaking issues and found significant numbers of them were rechecked and forced through16:24
clarkbI think a better approach is to fix the errors. And if the scope is too large to fix then reduce the scope16:24
clarkbslaweq: looking at the last 4 neutron jobs with POST_FAILURE results 3 of them are related to subunit processing not correctly setting the command to run. So its executing the command arguments with no command prefix and failing16:35
clarkbslaweq: the fourth has no logs which implies complete network connectivity loss (could be the job mangled the network stack or the kernel paniced etc)16:36
clarkbI'll look at the subunit thing today16:36
clarkbslaweq: remote:   https://review.opendev.org/c/zuul/zuul-jobs/+/821101 Try to fix broken stestr command discovery is the change17:04
*** rlandy|ruck is now known as rlandy|ruck|mtg17:09
*** ysandeep is now known as ysandeep|out17:10
*** rlandy|ruck|mtg is now known as rlandy|ruck18:05
*** tobias-urdin3 is now known as tobias-urdin20:10
*** aluria is now known as Guest800820:13
slaweqclarkb ok, I will check them. Thx20:52
*** rlandy|ruck is now known as rlandy|ruck|bbl23:35
*** ysandeep|out is now known as ysandeep23:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!