Saturday, 2021-02-27

clarkbfungi: it calls log = get_annotated_logger(self.log, task.zuul_event_id, build=task.build)00:00
clarkbcorvus: ya you could probably put task back on the queue00:01
clarkbaborting from that separate thread would be a bit weird though I think. If we're going to abort we should maybe do it via _update() checking the result of innerUpdateLoop activity00:02
fungiyeah, for reference this is how it winds up in the log: http://paste.openstack.org/show/803050/00:02
fungii get lost quickly when trying to reason through cross-thread interactions00:03
corvusoh, so the idea is to abort the build when that happens?00:03
fungiwe do abort under similar circumstances in AnsibleJob.execute()00:04
fungibut if it's caught in ExecutorServer._innerUpdateLoop() the build returns ERROR instead00:05
fungihttps://zuul.opendev.org/t/openstack/build/999f096bccbe43c286b7c6ed3c5deeb500:06
fungi"Error: Failed to update project None"00:06
corvusfungi: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L104600:06
fungiso that's where we should send abort?00:07
corvusthat's what translates the task failure into an ExecutorError exception, and ExecutorError means "non-transient"00:07
corvusfungi: i think you're suggesting a nuance, in that a brokenprocesspool on the executor should be considered a transient error00:08
fungiahh00:08
clarkbthat could be implemented by updating UpdateTask to have a transient error vs non transiet error flag00:09
corvusso you'll want to distinguish this condition from others, and do something other than ExecutorError.00:09
corvusyeah like clarkb said00:09
clarkbone tricky thing is that the regular merger server shares some of those code paths00:09
fungii was mostly wondering if it should exhibit behavior consistent with https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948-L95300:09
corvusif you do clarkb's suggestion, then it'd probably look like that -- "if transient: self._send_aborted()"00:11
corvusi think you could do that without even dealing with any new exceptions00:12
fungior is the problem that the ExecutorError is masking the BrokenProcessPool there?00:12
corvusit's not masking it since they're in different threads00:12
fungiokay, i think i get it00:13
fungibut we can apparently raise ExecutorError and BrokenProcessPool in the execute thread too00:13
corvusfungi: but -- if we did transfer the exception between threads, then yeah, that would be caught by https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L94800:13
corvusyeah, i'm not sure where else it could come from00:14
fungior could it be that the exception handling at 948 and 954 are dead code because those aren't raised in that thread?00:15
corvuscould be00:16
corvuswell not 954, that's definitely live00:16
corvusbut 948 could be dead00:16
fungiwondering if someone added the one at 948 as an attempt to deal with this exact condition00:16
corvus(ExecutorError has subclasses which i know are raised)00:17
corvusworth a look in the history00:17
fungiadded a year ago by https://review.opendev.org/70930700:18
fungiand yeah the commit message describes exactly what i think we experienced00:18
fungioom killer sniped the python process, leading to a broken process pool00:19
corvusoh nope, it's not dead code.  https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L109500:20
corvusthat will hit the abort path00:20
clarkbya its only used by merger paths but apparently there is at least one outside of the separate thread00:22
clarkbwhat are state_items?00:22
corvusfungi: so yeah, i think we're back at the same place; the existing code isn't defective, it's just not as nuanced as you'd like, and i think clarkb's suggestion is the best way to add that.00:22
fungicool, i'll give it a shot. thanks!00:23
corvusclarkb: yeesh, that could sure use a comment :)  i believe they are queue items which are present only for the purpose of fixing repo state, but they don't have associated changes.  i think if a project is in required-projects but there isn't a change ahead in that project, then a synthetic item will appear there so that all of the jobs running for a change use the same repo state for all the repos.00:26
openstackgerritMerged zuul/zuul master: Delay getting many files to mergers  https://review.opendev.org/c/zuul/zuul/+/75683700:37
openstackgerritMerged zuul/zuul master: Defer setting build pause to event queue  https://review.opendev.org/c/zuul/zuul/+/71293900:37
*** tosky has quit IRC00:42
openstackgerritJames E. Blair proposed zuul/zuul master: Update zoned executor stats  https://review.opendev.org/c/zuul/zuul/+/77785201:11
corvusclarkb, fungi: can you check my comment on https://review.opendev.org/761756 ?01:17
clarkbcorvus: I think that is correct, basically ever downstream has to manually update to get new tag state (either removal or move)01:23
clarkbnote that git fetch --prune-tags will not help with a moved tag if I read the manpage correctly01:24
clarkbit will also remove any local tags that were never on the remote (shouldn't be an issue for zuul)01:24
corvusmoved tag? oh01:24
corvusoy01:24
corvuslike same name different commit01:25
clarkbcorrect01:25
clarkbzuul will continue to see the wrong value in that case I think (basically changing tags or removingthem is bad)01:25
corvusi guess the downside to implementing this change is that it might lead people to believe that this works when it really doesn't01:25
clarkbya, also I think it may be worth having a test of these cases01:26
clarkbbceause its tricky stuff and if we've got a test we can at least assert zuul does what we expect it to01:26
corvusie, you delete a tag, the next job prunes it on executor 01, then you add it back with a new value, and it runs on executor02 and silently does the wrong thing01:26
clarkbyup, or even just force push the new tag value so there is no intermediate delete01:27
clarkb(I think that is allowed when pushing)01:27
clarkb*force push the same tag name but with new commit value01:27
corvusso basically, nothing good ever happens with deleting or changing tags; and there's an argument to be made that the status quo at least breaks in a way that causes people to have to do work01:28
corvusotoh, a force-push tag update is not going to trigger any zuul breakage01:28
corvusso an argument could be made that this is an incremental improvement so we can at least handle 1 of the 3 possibilites semi-transparently.01:29
clarkbzuul would see the force push of the tag change as a refs/tags/ update and rerun jobs against the old commit01:29
clarkband ya this should handle deletes just fine as long as all executors/mergers see the delete eventually01:29
corvusi guess i'm on the fence on this; still slightly leaning toward saying "yes".  cause it probably doesn't make things any worse, even though it mostly doesn't make them better.01:30
clarkbI guess the only race there is if a new tag with the same name is pushed after a delete, because any other event would trigger the pruning01:30
corvusyeah01:30
corvusclarkb: you want to leave some notes on that change?  (or if you need to eod, i can)01:30
clarkbI can do it01:30
corvuskk, thx01:31
clarkbcorvus: you may want to double check what I wrote, but its posted now01:33
clarkbbut now I do need to eod, I smell food01:37
openstackgerritMerged zuul/zuul master: Enhance logging of status_get requests  https://review.opendev.org/c/zuul/zuul/+/77166902:24
*** paladox has quit IRC02:33
*** paladox has joined #zuul02:39
*** ikhan has quit IRC04:14
*** evrardjp has quit IRC05:33
*** evrardjp has joined #zuul05:33
tobiashcorvus: I've responded on 76708407:47
openstackgerritTobias Henkel proposed zuul/zuul master: Allow refreshing volatile data in canMerge check  https://review.opendev.org/c/zuul/zuul/+/76708408:14
openstackgerritTobias Henkel proposed zuul/zuul master: Check cycle items are mergeable before reporting  https://review.opendev.org/c/zuul/zuul/+/74345008:14
tobiashcorvus: I changed 767084 so that it does what we need but avoids to update the changes completely ^08:15
*** evrardjp_ has joined #zuul09:19
*** evrardjp has quit IRC09:21
*** zbr has joined #zuul09:25
*** zbr9 has joined #zuul09:37
*** zbr has quit IRC09:40
*** zbr9 is now known as zbr09:40
*** evrardjp_ has quit IRC09:46
*** zbr7 has joined #zuul09:51
*** zbr has quit IRC09:53
*** zbr7 is now known as zbr09:53
*** zbr has quit IRC10:15
*** zbr has joined #zuul10:16
*** zbr2 has joined #zuul10:20
*** zbr has quit IRC10:23
*** zbr2 is now known as zbr10:23
*** jangutter_ has quit IRC10:40
*** jangutter has joined #zuul10:40
*** tosky has joined #zuul11:24
*** evrardjp has joined #zuul11:44
*** zbr has quit IRC13:00
*** zbr has joined #zuul13:01
*** evrardjp has quit IRC13:26
*** evrardjp has joined #zuul13:27
*** holser_ has quit IRC13:27
*** zbr1 has joined #zuul13:29
*** zbr has quit IRC13:31
*** zbr1 is now known as zbr13:31
*** jangutter_ has joined #zuul13:32
*** jangutter_ has joined #zuul13:33
*** jangutter has quit IRC13:33
*** ikhan has joined #zuul13:36
*** jangutter_ has quit IRC13:54
*** zbr2 has joined #zuul13:55
*** jangutter has joined #zuul13:55
*** zbr has quit IRC13:56
*** zbr2 is now known as zbr13:56
*** evrardjp has quit IRC14:04
*** evrardjp has joined #zuul14:05
corvustobiash: can you see my reply on https://review.opendev.org/743450  ?  i'm still confused14:25
tobiashcorvus: I think I was wrong, generally an event is needed (e.g. gerrit). Only in the github case the event is not needed.14:28
tobiashbecause of the lacking events by github14:29
tobiashI confused this with the parent change14:29
tobiashgenerally this change was intended to work also without the parent change14:30
tobiashmaybe it's easier to understand if we'd switch the order of those two14:30
corvustobiash: yeah, i think it is closely related to the parent14:30
tobiashactually 743450 was older than the parent which has been added in between at some point in time to fix the lack of github events14:32
corvustobiash: i guess my thought is that the point of the late canMerge check is to catch the case where github has changed but our model of it hasn't, which is most likely to happen if github changes something without emitting an event (though there are other ways like delayed event processing).  so that's the new test case we should add.14:32
corvusi certainly won't argue with adding 2 test cases :)  but i still think if we had to pick one, it would be the "did not get event" case14:32
tobiashI agree, the 'did not get event' case covers both cases14:34
tobiashso yes, we could have either both cases or just the 'dif not get event' case14:35
*** jfoufas1 has joined #zuul14:35
corvustobiash: cool14:35
tobiashcorvus: in that case I'd change the 'make it draft but don't send event' to 'add some additional required check to branch protection and don't send event'14:36
tobiashthat fits the real world better since in that case github indeed doesn't send an event and the parent change gets automatically tested14:37
tobiashoh well, draft is part of the refetching14:37
tobiashso I think just removing the event will do fine14:38
corvusi love -1 line fixes14:38
tobiash:)14:38
*** jfoufas1 has quit IRC14:40
tobiashcorvus: I've replied with a summary of what we just discussed14:42
*** zbr0 has joined #zuul14:43
corvustobiash: lgtm.  do you want you or i to update that or wait for swest_ to chime in?14:43
tobiashalready on it14:43
corvusk14:44
tobiashtest case is green locally without the event :)14:45
*** zbr has quit IRC14:45
*** zbr0 is now known as zbr14:45
openstackgerritTobias Henkel proposed zuul/zuul master: Check cycle items are mergeable before reporting  https://review.opendev.org/c/zuul/zuul/+/74345014:45
tobiashcorvus: 777843 has a small typo and a minor linting issue, lgtm otherwise14:49
corvustobiash: ah yep, i have a few -1Vs to go through :)14:51
openstackgerritJames E. Blair proposed zuul/zuul master: Update zoned executor stats  https://review.opendev.org/c/zuul/zuul/+/77785214:53
openstackgerritJames E. Blair proposed zuul/zuul master: Support enqueuing behind circular dependencies  https://review.opendev.org/c/zuul/zuul/+/77784314:54
openstackgerritJames E. Blair proposed zuul/zuul master: Use ZooKeeper TLS in tests  https://review.opendev.org/c/zuul/zuul/+/77748914:59
tobiashcorvus: I think we should rebase 673840 (allow unzoned option) on 777852 since it would be affected by it14:59
openstackgerritJames E. Blair proposed zuul/zuul master: Emit config warnings if shared queues on pipelines  https://review.opendev.org/c/zuul/zuul/+/77747915:00
corvustobiash: oh that's going to make the 'online' calculation difficult15:02
tobiashcorvus: I think I'd do the same with online then (register unzoned and zoned online function)15:03
corvussince that's an online zoned executor, and online unzoned executor -- but for counting "all" is it 1 or 2 executors? :)15:04
tobiashcorvus: maybe we should remove the 'all' then15:04
corvusmy intuition says it's still just 1 executor for "all" which means we need to detect that the zoned and unzoned are the same15:05
tobiashif we want the 'all' to be correct, we could add a uuid to the online function and deduplicate then15:05
corvustobiash: yeah (though i'd go with hostname, we already do that for other gearman functions)15:06
corvus("hostname"  -- it's really more than the hostname and more like a uuid)15:06
corvus(that just happens to have the hostname in it :)15:06
tobiashk15:08
corvustobiash: it's generally not too hard to make dashboards that sum gauges -- so we could remove all and someone who wants that can sum(zoned, unzoned).  they'd get 2 as the answer in that case though.  so if we want to be able to have a dashboard with 1 executor in the dual zoned/unzoned case, then i think our only option is to keep all and do that extra work with the hostname15:08
tobiashspeaking for us we just list online/accepting executors in different graphs per zone plus unzoned so we don't sum them15:09
tobiashbut I can imagine that someone might want to keep the 'all' category as well15:10
corvustobiash: yeah.  i agree that zoned / unzoned is more useful from an ops pov.15:11
corvusdoesn't matter if you have 5 executors running if all of them are zoned and you need an unzoned one.15:11
tobiashyeah15:12
corvustobiash: then maybe we should eliminate the "all" stats as well as the "unzoned" stats hierarchy, make the top level mean unzoned, but keep the zoned ones under "zoned" to avoid name collisions?15:14
corvusthat sort of forces users to use that view of the system, so they don't inadvertently create a dashboard that says everything's okay15:15
tobiashfrom hierarchy data model long term I think zoned/unzoned should be the same level15:16
tobiashlike unzoned is just another or the default zone15:16
corvustobiash: that would be a breaking change for existing deployments15:16
tobiashyeah, so maybe we could do both for a period of time and deprecate the old one15:17
*** evrardjp has quit IRC15:17
corvusyeah15:17
*** evrardjp_ has joined #zuul15:17
tobiashand remove the old maybe with 5.015:17
corvusokay, i left a note on the unzoned option change, and i'm leaving a note on the stats change now; i'll rework the stats change in a little bit15:18
tobiashk15:18
corvustobiash: ok comment left on https://review.opendev.org/777852 i think that covers it15:20
tobiashlgtm15:23
*** holser has joined #zuul15:48
*** evrardjp_ has quit IRC16:00
*** evrardjp has joined #zuul16:00
*** evrardjp has quit IRC16:16
*** evrardjp has joined #zuul16:16
*** evrardjp has quit IRC16:17
*** evrardjp has joined #zuul16:18
*** holser has quit IRC16:33
*** evrardjp has quit IRC17:21
*** evrardjp has joined #zuul17:22
*** jangutter has quit IRC19:03
*** jangutter has joined #zuul19:03
sassynHi everyone19:21
sassynthank u avass19:21
sassynthank u also tobiash19:21
sassynThe issue I'm having is still there. Commit code to gerrit, go to zuul console, open the gate job, I see the commit. Clicking on the Job show for example : 16:21 Job console starting.....  and the next line is 16:31 Running Ansible setup....19:24
sassyn10 min waiting time for nothing. This is the only job i have in my system19:25
avasssassyn: that's probably the time it takes for the executor to clone/checkout any repos it needs to run the jobs. are the repos large or are you running jobs with a lot of repos?19:46
avasssassyn: in any way I don't think it's "nothing" it's just not showing what's actually happening during that time in a transparent way :)19:48
tobiashsassyn: you may want to have a look into the executor logs to see what it's doing20:07
*** ikhan has quit IRC20:08
tobiashsassyn: how large are the repos involved in that job and what storage type is it using?20:08
tobiashAlso the bandwidth to gerrit may matter20:08
avasstobiash: would it be possible to log something like "Setting up workspace" or does the executor build logs start when the jobs actually start?20:09
tobiashsassyn: also make sure that the git_dir and the job_dir are on the same volume (https://zuul-ci.org/docs/zuul/discussion/components.html#attr-executor.git_dir)20:10
tobiashhaving those on separate mount points will dramatically slow down the job startup times20:10
tobiashbecause this prevents hard linking during cloning into the job workspace20:11
tobiashavass: it's writing this when it creates the job dir: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L56820:13
avasstobiash: ah20:14
tobiashavass: which is right before fetching and preparing the repos20:14
avassmaybe updating that message and adding a "setting up workspace" then so users know that it's working on something :)20:15
*** ikhan has joined #zuul20:16
tobiashavass: I think it should be easy to add further messages there20:16
tobiashthe first could be something like updating repos20:17
tobiashand here we could add something like preparing workspace: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L106120:17
avassthanks. looking at that in a moment. I think it would be good to make that a bit more transparent to the user20:20
tobiashfungi, corvus: regarding the processpool topic. I haven't read the full conversation so just ignore me if that's superfluous. I don't think we should do a _send_aborted in this case since this sends the build result to ABORTED which should only be done if the scheduler requested the abort. If you want to trigger a retry build result None is the right thing like for a pre failure or in the unreachable case: https://20:22
tobiashopendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L143120:22
fungitobiash: does that mean you feel like the current behavior at https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948-L953 is also wrong?20:26
tobiashfungi: I mean the job result20:33
tobiashfungi: ah yeah, I think exactly that's where you want to change the behavior20:34
fungitobiash: i'm not sure i follow. _send_aborted() causes the build to be aborted and retried, right? at least that's what the code comment says20:35
tobiashI think the code mismatches the comment20:35
tobiashsend_aborted sends a job result of ABORTED back to the scheduler20:36
tobiasha retry needs to be triggered exactly like https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L955-L958 just with result=None instead20:37
fungiinterestingly, what we observed was that the result reported for the build was ERROR with detail "Error: Failed to update project None"20:37
tobiasha result of None will tell the scheduler to retry20:37
fungiwe didn't get an ABORTED result reported20:38
fungihttps://zuul.opendev.org/t/openstack/build/999f096bccbe43c286b7c6ed3c5deeb520:38
tobiashfungi: can you dig up the stack trace in the executor log?20:38
fungitobiash: yep, this was what i found http://paste.openstack.org/show/803050/20:39
tobiashneed to check if we re-raise that exception20:39
fungias you guessed earlier, it was the result of the process being terminated by the kernel's oom killer20:39
tobiashyeah, that's pretty much always the case, either oom or a crashing python interpreter20:40
tobiashand oom is much more likely these day20:40
tobiashfungi: ok, so your stack trace shows that the exception is catched here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L296220:42
tobiashand the exception information is not passed to the caller20:42
tobiashwhich is why you end up here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L95520:43
tobiashwith result error20:43
tobiashwhat we could do is to attach some error information to the task here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L296320:44
tobiashand evaluate them here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L104620:44
tobiashe.g. if the error is transient return None there20:45
openstackgerritTobias Henkel proposed zuul/zuul master: Retry job on broken process pool  https://review.opendev.org/c/zuul/zuul/+/77787420:48
tobiashfungi: like ^20:48
fungiaha, thanks, think that's basically what we discussed at http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2021-02-27.log.html#t2021-02-27T00:11:5020:55
tobiashoh wait, that misses something20:55
openstackgerritTobias Henkel proposed zuul/zuul master: Retry job on broken process pool  https://review.opendev.org/c/zuul/zuul/+/77787420:59
openstackgerritTobias Henkel proposed zuul/zuul master: Retry job on broken process pool  https://review.opendev.org/c/zuul/zuul/+/77787421:02
openstackgerritTobias Henkel proposed zuul/zuul master: Retry job on failed repo updates  https://review.opendev.org/c/zuul/zuul/+/77787521:03
tobiashfungi, corvus: as an alternative we could also retry jobs on every uptate failure ^21:03
fungii guess i'd be slightly more worried about the possibility for some sorts of update failures to be persistent, leading to retry loops. but we do at least still have a retry limit serving as a backstop in that case21:20

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!