Tuesday, 2021-06-29

ianwclarkb: out of interest i ran a dnf update which pulled in firefox 89.0.2 and i agree that the console highlighting seems to work better on the current production with that00:07
ianwalso everything now has rounded corners00:08
fungifor safety00:36
ianwit also appears to have replaced a previous feature that i occasionally enjoyed "playing any sound" with long timeouts and "firefox is not responding" windows00:50
*** marios is now known as marios|ruck05:12
opendevreviewFelix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server  https://review.opendev.org/c/zuul/zuul/+/77461005:59
*** jpena|off is now known as jpena07:00
opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism  https://review.opendev.org/c/zuul/zuul-jobs/+/79533608:15
*** bhagyashris_ is now known as bhagyashris09:15
opendevreviewMerged zuul/zuul master: Configure unique command socket path per scheduler  https://review.opendev.org/c/zuul/zuul/+/77146209:20
opendevreviewMerged zuul/zuul master: Refactor config/tenant (re-)loading  https://review.opendev.org/c/zuul/zuul/+/79526309:23
opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism  https://review.opendev.org/c/zuul/zuul-jobs/+/79533609:46
opendevreviewMerged zuul/zuul master: Replace TreeCache in component registry  https://review.opendev.org/c/zuul/zuul/+/79658210:44
*** jpena is now known as jpena|lunch11:34
opendevreviewFelix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server  https://review.opendev.org/c/zuul/zuul/+/77461011:38
*** jpena|lunch is now known as jpena12:36
pabelanger[m]Is anyone using the jwt tokens to manage autoholds for jobs?12:54
pabelanger[m]I'm wanting to set them up, but was curious to see if anyone else has12:54
pabelanger[m]also, is there plans to expose that funtionality in the web UI, if a token is found?12:54
tristanCpabelanger[m]: yes we do, mostly to enable qa team debug their jobs without our assistance. 13:07
tristanCjwt is configured by default in sf-config with https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/ansible/roles/sf-zuul/templates/zuul.conf.j2#n7913:08
tristanCthen tokens can be issued with `zuul create-auth-token --auth-config zuul_operator --tenant <tenant> --user <audit-info> --expires-in 7776000`13:09
pabelanger[m]so, each time you want QE to setup autohold, you manually generate the token?13:12
pabelanger[m]or maybe every 90 days13:15
pabelanger[m]when it expires13:15
tristanCpabelanger[m]: yes, those are per user token13:16
pabelanger[m]maybe I'll setup a demo with you downstream to see how it all works13:17
tristanCpabelanger[m]: the documentation has been proposed in https://review.opendev.org/c/zuul/zuul/+/72778513:20
tristanCpabelanger[m]: along with https://zuul-ci.org/docs/zuul/reference/client.html#create-auth-token13:20
pabelanger[m]tristanC: do you have any logic for self-serve tokens?13:37
pabelanger[m]would love to get another review on https://review.opendev.org/c/zuul/zuul/+/79813113:48
pabelanger[m]adds a few more check_run status for github13:48
opendevreviewFelix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server  https://review.opendev.org/c/zuul/zuul/+/77461013:53
opendevreviewSimon Westphahl proposed zuul/zuul master: wip: Items/ahead behind wrong after re-enqueue  https://review.opendev.org/c/zuul/zuul/+/79869114:04
opendevreviewSimon Westphahl proposed zuul/zuul master: wip: Items ahead/behind wrong after re-enqueue  https://review.opendev.org/c/zuul/zuul/+/79869114:05
tristanCpabelanger[m]: what do you mean by self-serve tokens?14:07
pabelanger[m]tristanC: if a token expires, an infra-root needs to regenerate said token and give it to QE right?14:52
pabelanger[m]Rather then the QE person generating it themself?14:53
tristanCpabelanger[m]: for that i think you need a service like keycloack to let the user authenticate themself14:54
pabelanger[m]ack14:55
pabelanger[m]let me go and read up about it, see if I can hook it up to existing SSO at redhat14:55
apevec[m]pabelanger: mhu has web UI WIP, lemme find topic15:22
apevec[m]https://review.opendev.org/q/topic:"fffaff"15:26
*** jpena is now known as jpena|off15:59
*** marios|ruck is now known as marios|out16:17
clarkbcorvus: looks like https://review.opendev.org/c/zuul/zuul/+/774610 is an important zuul change? I'll try to review that between meetings this morning16:54
corvusclarkb: yes, but don't review it yet; i'm currently rebasing it on the rest of the executor api stack (there are significant conflicts)16:57
corvusclarkb: i think the next step is https://review.opendev.org/77090216:57
clarkbcorvus: got it16:58
* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/IvpviwJtxhrFhEOilheDGlly/message.txt >16:58
corvusi'm putting the lock/unlock onto the end of that16:58
clarkbok so start at the bottom of that stack and work forward16:59
clarkbtop of the lsit in your paste16:59
corvusyep; and most of that stack is one change exploded out into hopefully smaller parts;  so there's a certain amount of "introduce an api; make some changes to the api; then use the api" in that progression17:00
clarkbgreat I'll start there17:00
opendevreviewMerged zuul/zuul master: Add skipped / neutral statuses to the github driver  https://review.opendev.org/c/zuul/zuul/+/79813117:12
opendevreviewJames E. Blair proposed zuul/zuul master: Execute builds via ZooKeeper  https://review.opendev.org/c/zuul/zuul/+/78898817:39
opendevreviewJames E. Blair proposed zuul/zuul master: Move build request cleanup from executor to scheduler  https://review.opendev.org/c/zuul/zuul/+/79468717:39
opendevreviewJames E. Blair proposed zuul/zuul master: Handle errors in the executor main loop  https://review.opendev.org/c/zuul/zuul/+/79658317:39
opendevreviewJames E. Blair proposed zuul/zuul master: Shard BuildRequest parameters  https://review.opendev.org/c/zuul/zuul/+/79714917:39
opendevreviewJames E. Blair proposed zuul/zuul master: Compress sharded ZK data  https://review.opendev.org/c/zuul/zuul/+/79715617:39
opendevreviewJames E. Blair proposed zuul/zuul master: Switch to ZooKeeper backed merge result events  https://review.opendev.org/c/zuul/zuul/+/78419517:39
opendevreviewJames E. Blair proposed zuul/zuul master: Switch from global to tenant event queues  https://review.opendev.org/c/zuul/zuul/+/79744017:39
opendevreviewJames E. Blair proposed zuul/zuul master: Remove unused addResultEvent method  https://review.opendev.org/c/zuul/zuul/+/79754217:39
opendevreviewJames E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server  https://review.opendev.org/c/zuul/zuul/+/77461017:39
corvus(i spotted an error in the middle of the stack, so i went ahead and fixed it)17:39
corvusthat was just a 2 line change in 788988; the rest is rebasing on that.  i expect the stack is stable now and remains ready for review.17:43
tobiash[m]corvus: regarding https://review.opendev.org/c/zuul/zuul/+/774610/21/zuul/executor/server.py#1371 could prepareVars already have an effect on the nodes?19:19
tobiash[m]I wonder if we need to set the nodeset into use before prepareVars19:20
corvustobiash: yes to both19:25
opendevreviewJames E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server  https://review.opendev.org/c/zuul/zuul/+/77461020:42
opendevreviewJames E. Blair proposed zuul/nodepool master: Add requestor_data field to node request  https://review.opendev.org/c/zuul/nodepool/+/79874620:51
clarkbcorvus: I left some comments/questions/notes on https://review.opendev.org/c/zuul/zuul/+/770902 if you or felixedel or tobiash[m] get a chance to look those over that would be great20:52
clarkbI suspect some may be clarified by the rest of the stack whcih I'm going to try and review as well20:52
clarkbaha yup the next change answers my questions about "default-zone"20:54
corvusclarkb: yeah, sorry about that -- it's not my favorite thing to have changes not self-contained like that, but it also sort of seemed like leaving changes like the default-zone in their own change made some things easier to understand20:55
clarkbcorvus: I'm through the first 5 in the stack now. Left some comments on a few of them. Would be good if you can check those21:31
clarkbThe change to execute jobs via zk is not small. I'm going to take a break here and do reviews of some other stuff to ensure they get attention today then come back to this21:32
corvusclarkb: thanks!  i'm already well into working on your comments :)21:36
opendevreviewJames E. Blair proposed zuul/zuul master: Execute builds via ZooKeeper  https://review.opendev.org/c/zuul/zuul/+/78898821:38
opendevreviewJames E. Blair proposed zuul/zuul master: Move build request cleanup from executor to scheduler  https://review.opendev.org/c/zuul/zuul/+/79468721:38
opendevreviewJames E. Blair proposed zuul/zuul master: Handle errors in the executor main loop  https://review.opendev.org/c/zuul/zuul/+/79658321:38
opendevreviewJames E. Blair proposed zuul/zuul master: Shard BuildRequest parameters  https://review.opendev.org/c/zuul/zuul/+/79714921:38
opendevreviewJames E. Blair proposed zuul/zuul master: Compress sharded ZK data  https://review.opendev.org/c/zuul/zuul/+/79715621:38
opendevreviewJames E. Blair proposed zuul/zuul master: Switch to ZooKeeper backed merge result events  https://review.opendev.org/c/zuul/zuul/+/78419521:38
opendevreviewJames E. Blair proposed zuul/zuul master: Switch from global to tenant event queues  https://review.opendev.org/c/zuul/zuul/+/79744021:38
opendevreviewJames E. Blair proposed zuul/zuul master: Remove unused addResultEvent method  https://review.opendev.org/c/zuul/zuul/+/79754221:38
opendevreviewJames E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server  https://review.opendev.org/c/zuul/zuul/+/77461021:38
opendevreviewJames E. Blair proposed zuul/zuul master: Only call one build callback in executor api  https://review.opendev.org/c/zuul/zuul/+/79875021:38
opendevreviewJames E. Blair proposed zuul/zuul master: Remove exception handling in executor api isLocked  https://review.opendev.org/c/zuul/zuul/+/79875121:38
corvusclarkb: i don't want to steal your attention, but would you mind looking at https://review.opendev.org/798750 and https://review.opendev.org/798751 ? they are small and should address your comments in the first change21:42
clarkbsure I'm between changes right now21:43
clarkbcorvus: for https://review.opendev.org/c/zuul/zuul/+/798751/1/zuul/zk/executor.py does the path lostBuildRequests() takes (and hunts for) not trigger the case where there is no lock node?21:48
clarkbcorvus: I think if those two functions weren't right next to each other I wouldn't have thought about that at all, but seeing them together it occured to me that the lock may not exist (for whatever reason) and that is what lostBuildRequests() is looking for21:49
corvuscreating a Lock object creates a locke node21:49
clarkbright but listBuildRequests() iterates over all BuildRequests and checks there lock state. If they have never been locked before (or have been unlocked and still exist) then we could hit that?21:49
corvusand any parent nodes21:49
clarkbthis might be more test specific where we manipulate the requests and locks more directly but I think the first test case that is added may even run over that (though I dind't look in the logs of the job to confirm)21:50
corvushit what?  there is nothing in isLocked that could raise a NoNodeError under normal circumstances21:50
corvusand by normal circumstances, i mean anything short of a malfunctioning connection or server21:51
clarkbcorvus: if a BuildRequest exists but has never been locked before then won't isLocked hit the NoNodeError() becuase a lock has never been created for it that BuildRequest yet?21:51
clarkbIf BuildRequests and locks shared the same tree then this would probably be avoided but since they are in distinct trees I think we can hit this?21:51
corvusclarkb: no -- because creating a kazoo.recipe.Lock() object causes the lock node and all parent nodes to be created21:52
clarkbaha ok21:52
clarkbthen we check the contendors count to actually determine if it is locked (and it would be initialized to 0 in the case of no lock existing previously21:53
corvusso either ExecutorApi.lock() or ExecutorApi.isLocked() will cause the initial node to be created.21:53
corvusclarkb: correct; contenders lists the children of the lock node21:53
corvus(because contenders are ephemeral znodes under the lock node)21:53
clarkbnow I understand why you never expect that exception to happen at all21:53
clarkbin which case the new change is the correct cleanup21:54
corvus(also, btw, i ran this through a unit test to double check, but didn't think it was necessary to add a test for it)21:55
clarkbya I think I was confused through an acceptance of the old code being correct (even though I had called it out as being odd). I'm not sure you need a specific test for this.21:56
clarkbThe first test that is added does cover this as a side effect I think21:56
corvusyeah, if isLocked could raise that error, then so could lock21:57
corvusfelixedel, tobiash: would you please (tomorrow) take a look at clarkb's comments on https://review.opendev.org/770902 and my changes in response: https://review.opendev.org/798750 and https://review.opendev.org/79875122:03
clarkbcorvus: one thing I'm noticing as I pick up the use zk to executor jobs change is that the tense on the BuildRequestEvent is maybe a bit odd. They are all past tense but we still have work to do to fulfill that action. A minor thing but may make it sligghtly more clear that we are actively canceling/resuming/deleting jobs if we name then cancel/resume/delete instead of23:11
clarkbcanceled/resumed/deleted23:11
corvusclarkb: ++23:12
clarkbcorvus: in https://review.opendev.org/c/zuul/zuul/+/788988/29/zuul/executor/server.py#3438 completeBuild only seems to be called in the error or cancel case. Where do we mark the build request as completed under normal operation?23:16
clarkbthe function is completeBuild at the end of that file fwiw23:16
clarkbhrm maybe the gerrit diff is simply hiding the call in the nromal case since it hasn't changed?23:17
clarkbthe parameters to the call do change though so it should show up if that is the case23:18
corvusclarkb: line 131523:25
clarkbthanks23:25
clarkbcorvus: pauseBuild() resumeBuild() and completeBuild() all call self.executor_api.update() directly without explicitly holding the lock. In _runBuildWorker() we do that after grabbing the lock and explicitly make note of why it is safe to call update() directly and that we do it to reduce lock contention. Is it safe to do that in pause/resume/completeBuild() ?23:29
clarkbI think pause and resume are safe because we do those off to the side in child nodes23:30
clarkbbut is complete?23:30
clarkbalso all three update the build request state23:30
corvusclarkb: thinking23:33
corvusclarkb: ah, i think the first thing to note is that this is a slightly evolved use of zk compared to what we've done elsewhere -- basically, the .update() call is designed to potentially fail.  previously, we've wrapped ZK updates with locks: lock, write, unlock.  but this is designed to use the zstat attribute (previously read zk state counter) to cause the write to fail if someone else has written in the interim.  so basically, when calling23:39
corvusupdate() we always assume we're the only actor, and if we're wrong, update() will raise an exception.  instead, the executor api lock means that an executor is running the build (it's more or less only exists so the scheduler can detect if a build's executor crashes.23:39
clarkbGot it, the lock here is much higher level than a write mutex23:40
clarkbmore of a signal flare than a mutex. What happens when the zstat attribute is larger than expected and we try to resume a build? do we try again?23:41
corvusclarkb: yep.  then, secondarily, i think pauseBuild, resumeBuild, and completeBuild will all actually be called by the executor while it has the lock, so no other executor is going to write to it anyway; the scheduler might, however, if it were to forcibly remove the build; exceptions in that case would be okay i think23:42
corvusclarkb: that case seems potentially problematic; perhaps we should set the threading event before calling executorapi.update ?23:44
clarkbwhere does the retry occur? for example when pausing the pause() method calls executor_server.pauseBuild() which calls executor_api.update() then pause() waits on the resume event to be set23:44
corvusclarkb: (basically swap the last 2 lines of resume())23:44
clarkbI think the only place we could retry in and have that work is in update()?23:44
corvusclarkb: no retry, our irc messages raced23:44
clarkbcorvus: ok /me rereads23:45
clarkbaha we assume it will work because we're the only high level signal (lock) holder23:45
corvusclarkb: "yep, then secondarily" was a continuation of the first thought; "that case seems" is the beginning of the reply to "What happens when the zstat"23:46
clarkbI'm following now23:46
corvusclarkb: yeah, but since you mention it, i'm concerned with what could happen if the scheduler deletes the build request immediately after unpausing; i think that could trip your edge case23:47
corvusit's not exactly the same thing, because it's not going to be a zstat error23:47
corvusbut basically, since looking at that bit of code, i see the potential for failing to resume23:48
clarkband the reason would be that we would raise in resume() prior to calling set() on the resume event23:48
corvusoh, actually that case is taken care of23:48
corvuswe specifically handle the request not being found during the resume23:48
clarkbis that what BuildRequestNotFound excpetion handling in resumeBuild is handling?23:48
clarkbyup ok23:48
corvusyeah23:49
corvusokay, so i think i'm not seeing a way to hit that case23:49
clarkband that is distinct to zstat being too new as it doesn't exist at all23:49
corvusright23:49
corvusbasically, the zstat change detection shouldn't fail, because we hold the lock and no one else should be updating it.  but if someone did, we would raise an exception, and in the case of resuming a build, it probably wouldn't resume.23:50
corvusit doesn't strike me as super robust, but also, it shouldn't happen23:51
clarkbBut for there to be a job to resume an executor must be running that job and to run that job it must hold the lock23:51
corvusyeah23:51
clarkb(similar case with pause)23:51
corvusi think i'd be willing to give this a try, and if we find some way that it can happen, then we'd probably have to put in a periodic state check (like, on the executor, check every build in zk and see if it has a resume subnode).  but if the event processing works as we expect, i don't think it should be an issue.23:52
clarkb* for there to be a job to resume a unique executor must be running that job and it must hold the lock uniquely23:53
corvusyep23:54
clarkbgreat I'm going to finish up this review (going through the test changes now) but I think this is looking ok23:56
clarkbhave a couple of other minor notes but that asked about big stuff that jumped out at me here already23:56
corvuscool, then if things look good to felixedel and tobiash tomorrow, we may be able to get these merged and resume speed on v5 :)23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!