Thursday, 2020-02-27

*** mattw4 has quit IRC00:00
*** tosky has quit IRC00:21
zenkurocorvus: zuul indeed shows “Loading...” if there where no builds00:27
*** jamesmcarthur has joined #zuul00:35
*** mattw4 has joined #zuul00:46
*** jamesmcarthur has quit IRC00:51
*** jamesmcarthur has joined #zuul00:52
*** jamesmcarthur has quit IRC00:57
*** felixedel has joined #zuul01:09
*** felixedel has quit IRC01:10
*** jamesmcarthur has joined #zuul01:22
*** jamesmcarthur has quit IRC01:23
*** jamesmcarthur has joined #zuul01:23
*** sgw has quit IRC01:32
*** jamesmcarthur has quit IRC01:55
*** jamesmcarthur has joined #zuul02:00
*** jamesmcarthur has quit IRC02:31
*** jamesmcarthur has joined #zuul02:43
*** jamesmcarthur has quit IRC02:49
*** num8lock has joined #zuul02:50
*** num8lock has quit IRC02:51
*** num8lock has joined #zuul02:51
*** num8lock has left #zuul02:52
*** jamesmcarthur has joined #zuul02:53
*** rlandy|bbl is now known as rlandy02:56
zenkurocorvus: the right place for errors was mysql. in particular the name created in mysql should match exactly the one listed in zuul config. which is kinda self evident  >_<03:06
*** jamesmcarthur has quit IRC03:07
*** jamesmcarthur has joined #zuul03:31
*** rlandy has quit IRC03:44
*** jamesmcarthur has quit IRC03:58
*** bolg has joined #zuul05:09
*** mattw4 has quit IRC05:24
*** evrardjp has quit IRC05:34
*** evrardjp has joined #zuul05:35
*** sgw has joined #zuul05:35
*** swest has joined #zuul05:38
*** reiterative has quit IRC05:40
*** reiterative has joined #zuul05:41
*** sgw has quit IRC05:50
*** saneax has joined #zuul05:56
*** felixedel has joined #zuul06:32
*** Defolos has joined #zuul06:46
*** dangtrinhnt has joined #zuul07:06
*** felixedel has quit IRC07:19
*** felixedel has joined #zuul07:28
*** dangtrinhnt has quit IRC07:56
*** dangtrinhnt has joined #zuul07:57
*** jcapitao_off has joined #zuul07:58
*** jcapitao_off is now known as jcapitao08:00
*** dangtrinhnt has quit IRC08:01
*** raukadah is now known as chandankumar08:13
*** tosky has joined #zuul08:17
*** dangtrinhnt has joined #zuul08:20
*** bolg has quit IRC08:33
*** bolg has joined #zuul08:34
*** jpena|off is now known as jpena08:51
*** jfoufas178 has joined #zuul08:54
*** decimuscorvinus has quit IRC08:58
*** decimuscorvinus_ has joined #zuul08:58
*** avass has joined #zuul09:21
*** jfoufas178 has quit IRC09:24
*** dangtrinhnt has quit IRC09:43
*** dangtrinhnt has joined #zuul09:43
*** dangtrinhnt has quit IRC09:51
*** dmellado has quit IRC09:59
*** sshnaidm|afk has joined #zuul10:17
*** carli has joined #zuul10:18
*** sshnaidm|afk is now known as sshnaidm10:29
*** carli is now known as carli|afk10:51
*** felixedel has quit IRC10:52
*** felixedel has joined #zuul10:56
*** jcapitao is now known as jcapitao_lunch11:14
*** carli|afk is now known as carli12:02
*** jpena is now known as jpena|lunch12:35
*** jamesmcarthur has joined #zuul12:36
*** Goneri has joined #zuul12:50
*** rlandy has joined #zuul12:56
*** jamesmcarthur has quit IRC13:00
*** jamesmcarthur has joined #zuul13:00
*** dmellado has joined #zuul13:05
*** sshnaidm_ has joined #zuul13:05
*** jamesmcarthur has quit IRC13:06
fbohi, just question regarding Zuul artifacts require/provide feature. Zuul does not expose artifacts of depends-on changes if dependencies have been merged. I looks ok to me but in some cases it could be if great zuul still expose the artifacts in zuul.artifacts.13:07
fbofor instance in the packaging context for Fedora a merged change on a distgit does not mean the package is built and available.13:08
*** sshnaidm has quit IRC13:08
*** jamesmcarthur has joined #zuul13:10
*** jcapitao_lunch is now known as jcapitao13:17
fboso when dependent changes are still not merged dependent artifacts (here rpms) are exposed to current project rpm-(build|test). When dependent changes are merged artifacts (rpm) are not longer exposed to the current project rpm-(built|test) job but and in the meantime they are not accessible because not yet build and published on the final rpm repository.13:18
fboQuestion is do you think that could make sense to have an option to tell zuul to still add artifacts to zuul.artifacts even if the dependencies are merged ?13:19
*** plaurin has joined #zuul13:25
plaurinHello irc prople13:25
plaurinSo glad to see this patch! https://review.opendev.org/#/c/709261/ Really anxious to use this :D13:26
plaurinAs soon as merged I wil be using this in production13:26
*** rfolco has joined #zuul13:28
*** rfolco has quit IRC13:29
*** jamesmcarthur has quit IRC13:32
*** jamesmcarthur has joined #zuul13:32
*** jpena|lunch is now known as jpena13:34
*** jamesmcarthur has quit IRC13:47
*** jamesmcarthur_ has joined #zuul13:47
openstackgerritTobias Henkel proposed zuul/zuul master: Optimize canMerge using graphql  https://review.opendev.org/70983614:28
*** sgw has joined #zuul14:32
*** jamesmcarthur_ has quit IRC14:35
*** sgw has quit IRC14:39
*** jamesmcarthur has joined #zuul14:40
corvusplaurin, mordred: i think based on conversations with tristanC i want to revise that a bit -- i think we should add socat to our executor images first to ensure that will work for folks running zuul in k8s.14:41
corvusi've set it to WIP14:41
mordredcorvus: ah - socat14:44
*** jamesmcarthur has quit IRC14:48
tristanCcorvus: btw i've added port-forward to k1s and i was about to test it with the zuul change14:49
corvusi'm writing a job now where i wish i could attach the secret just to the pre-playbook of the job, not the main playbook.  i think i have to add an extra layer of inheritance just to do that.  but maybe we should consider a config syntax change to allow that.14:50
corvustristanC: great!14:50
tristanCturns out port-forward uses the same protocol as exec, so that was a simple addition14:50
*** felixedel has quit IRC14:56
fungiwhat is k1s? i'm getting lost in all the abbrevs14:57
mnaseri've heard about k3s but never heard of k1s :p15:00
openstackgerritJames E. Blair proposed zuul/zuul master: Stream output from kubectl pods  https://review.opendev.org/70926115:02
*** bolg has quit IRC15:03
corvustristanC, mordred, plaurin, Shrews: ^ i added socat to the executor image and also a release note.  does that seem good?  or should i update the patch so that it doesn't fail if kubectl isn't present?15:03
Shrewscorvus: i was thinking this morning that maybe we need a release note that mentions a need for a specific version of nodepool to support that15:05
Shrewswhich i guess would be 3.12.0 ?15:05
corvusShrews: yeah, i reckon so15:06
corvusShrews: i'll amend for that15:06
corvusi think at this point, i'll be more comfortable if we don't fail the job; i think i'll add that15:07
*** mattw4 has joined #zuul15:10
mordredcorvus: ++15:10
openstackgerritTristan Cacqueray proposed zuul/zuul master: executor: blacklist dangerous ansible host vars  https://review.opendev.org/71028715:15
*** mattw4 has quit IRC15:17
*** goneri_ has joined #zuul15:19
*** goneri_ has quit IRC15:21
openstackgerritJames E. Blair proposed zuul/zuul master: Stream output from kubectl pods  https://review.opendev.org/70926115:23
openstackgerritJames E. Blair proposed zuul/zuul master: Add destructor for SshAgent  https://review.opendev.org/70960915:23
corvusokay that should take care of it15:23
*** avass has quit IRC15:24
*** sgw has joined #zuul15:28
*** jamesmcarthur has joined #zuul15:38
mordredcorvus: the self._log in the stream callback is displayed to the user? or the operator?15:41
*** chandankumar is now known as raukadah15:41
corvusmordred: user15:41
*** sgw has quit IRC15:42
mordredcorvus: k. then the only nit I might suggest is to say something like "on the executor" or "on the executor, contact your admin" - so it's clear it's not just something they need to add to their job config15:47
mordredotherwise looks great15:47
corvusmordred: good point15:47
corvusmordred: "[Zuul] Kubectl and socat must be installed on the Zuul executor for streaming output from pods"  how's that15:49
openstackgerritJames E. Blair proposed zuul/zuul master: Stream output from kubectl pods  https://review.opendev.org/70926115:51
openstackgerritJames E. Blair proposed zuul/zuul master: Add destructor for SshAgent  https://review.opendev.org/70960915:51
*** rishabhhpe has joined #zuul15:54
mordredcorvus: yeah15:54
*** sshnaidm_ is now known as sshnaidm15:58
Shrewscorvus: lgtm. +3'd15:58
*** sgw has joined #zuul16:00
*** mattw4 has joined #zuul16:17
*** sgw has quit IRC16:24
*** sgw has joined #zuul16:25
*** carli has quit IRC16:32
*** NBorg has joined #zuul16:37
*** plaurin has quit IRC16:42
*** armstrongs has joined #zuul16:45
*** jpena is now known as jpena|brb16:45
*** jamesmcarthur has quit IRC16:45
*** jamesmcarthur has joined #zuul16:46
*** electrofelix has joined #zuul16:46
*** armstrongs has quit IRC16:51
*** erbarr has joined #zuul17:14
openstackgerritTristan Cacqueray proposed zuul/zuul master: executor: blacklist dangerous ansible host vars  https://review.opendev.org/71028717:16
*** jpena|brb is now known as jpena17:23
*** evrardjp has quit IRC17:35
*** evrardjp has joined #zuul17:35
NBorgHow does zuul handle submodules? We have a dependency towards a repo on a gitlab server. We don't need (or want to) trigger anything on updates in that repo. Workarounds I can think of: keep a clone in gerrit and update semi-regularly, use a submodule, or add a role for cloning that specific repo. Any suggestions?17:36
*** felixedel has joined #zuul17:36
*** felixedel has quit IRC17:36
corvusNBorg: we've been discussing submodules a bit lately, and what zuul could do with them to be useful.  right now, it ignores them completely, so the job needs to take care of it itself.17:37
corvusNBorg: gerrit supports submodule subscriptions which automatically update submodule pointers if 2 repos are on the same system, but that won't help if you're using gitlab17:38
corvusNBorg: (but maybe if you kept a copy in gerrit, you could use the submodule subscriptions to automatically update the super-repo)17:38
*** igordc has joined #zuul17:39
corvusNBorg: aside from that, i think the options are to do the cloning/submodule init in a job.  let me get a pointer to some stuff i did for the gerrit project recently17:39
corvusNBorg: take a look at this role, it may have some things that could be useful: https://gerrit.googlesource.com/zuul/jobs/+/refs/heads/master/roles/prepare-gerrit-repos/17:40
corvusNBorg: personally, i would say if you have a choice to avoid git submodules, i would17:41
corvusNBorg: they're often confusing for everybody.  so if i were in your place, i would try to avoid that and either just clone the repo in the job, or get the repo into zuul somehow (gitlab driver, git driver, or copy into gerrit) and add it as a required-project if using depends-on was important.17:43
SpamapSLet me just +1 corvus's point. submodules are a misfeature in git, and should be avoided.17:43
SpamapSThere are 100 other things that work better.17:43
fungiin almost all cases i treat a git submodule as "just another repo" which i happen to checkout inside some repo's worktree, and steer clear of the fancy submodule commands entirely17:45
openstackgerritJames E. Blair proposed zuul/nodepool master: Fix GCE volume parameters  https://review.opendev.org/71032417:49
NBorgI didn't consider the git driver. That seems like what I want. Thank you!17:49
corvusNBorg: i think it'll update every 2 hours, which sounds like it may work for your case17:50
NBorgYou've even made it configurable with poll_delay. :)17:51
corvusoh neat17:52
corvusNBorg: if you get bored and wanted to try out the work-in-progress gitlab driver, that'd be cool too :)17:52
corvus(i think it's in-tree but undocumented since it's incomplete; but it's probably good enough for cloning a repo)17:54
corvusShrews, mordred: https://review.opendev.org/710324 i just noticed that we weren't actually getting the 40GB volumes i requested on gerrit's zuul; i tested this against gce and it seems to improve the situation17:58
*** igordc has quit IRC18:01
*** jcapitao is now known as jcapitao_off18:02
NBorgHehe, being bored is the least of my problems. I'll contribute some code when I have the time to clean it up. Any day now (maybe next year).18:02
*** igordc has joined #zuul18:04
*** mattw4 has quit IRC18:05
*** mattw4 has joined #zuul18:05
rishabhhpeHi All, after triggering a zuul job on nodepool spawned instance on which i installed devstack .. i am not able to run any openstack command . it is saying Failed to discover auth URL.18:14
corvusrishabhhpe: you might want to ask in #openstack-infra, #openstack-qa18:14
rishabhhpecorvus: ok18:18
SpamapSfeedback on UI: in the build screen, Console tab, Zuul shows tasks that had "ignore_errors: true" as "FAILED" .. would be good to show "IGNORED" instead.18:31
*** mattw4 has quit IRC18:31
SpamapS(from a member of my team)18:32
*** mattw4 has joined #zuul18:32
*** jamesmcarthur has quit IRC18:33
*** jamesmcarthur has joined #zuul18:33
corvusSpamapS: yeah, i've been noticing that too.  i bet we have the info to do that.18:33
*** jpena is now known as jpena|off18:33
*** mattw4 has quit IRC18:38
*** mattw4 has joined #zuul18:39
*** tjgresha has quit IRC18:46
*** tjgresha has joined #zuul18:46
mordredcorvus: gce patch looks sensible19:01
*** rlandy is now known as rlandy|brb19:07
*** jamesmcarthur has quit IRC19:12
-openstackstatus- NOTICE: Memory pressure on zuul.opendev.org is causing connection timeouts resulting in POST_FAILURE and RETRY_LIMIT results for some jobs since around 06:00 UTC today; we will be restarting the scheduler shortly to relieve the problem, and will follow up with another notice once running changes are reenqueued.19:13
*** jcapitao_off has quit IRC19:15
*** rishabhhpe has quit IRC19:16
*** rlandy|brb is now known as rlandy19:19
*** electrofelix has quit IRC19:20
*** jamesmcarthur has joined #zuul19:40
-openstackstatus- NOTICE: The scheduler for zuul.opendev.org has been restarted; any changes which were in queues at the time of the restart have been reenqueued automatically, but any changes whose jobs failed with a RETRY_LIMIT, POST_FAILURE or NODE_FAILURE build result in the past 14 hours should be manually rechecked for fresh results19:46
openstackgerritMerged zuul/nodepool master: Fix GCE volume parameters  https://review.opendev.org/71032420:02
*** felixedel has joined #zuul20:13
SpamapSis OpenDev running Zuul in GCE now?20:51
SpamapSor rather, some zuul20:52
corvusSpamapS: no, gerrit is! :)20:52
funginope, the gerrit zuul is20:52
corvusSpamapS: https://ci.gerritcodereview.com/tenants20:52
*** jamesmcarthur has quit IRC20:54
corvusSpamapS: actually, strictly speaking, gerrit is running Zuul in GKE but using GCE for test nodes20:54
SpamapSCool!20:57
*** erbarr has quit IRC21:11
*** jcapitao_off has joined #zuul21:17
*** Goneri has quit IRC21:19
*** sgw has quit IRC21:23
*** mattw4 has quit IRC21:30
*** mattw4 has joined #zuul21:31
*** sgw has joined #zuul21:38
*** dpawlik has quit IRC21:43
NBorgThat automatic reenqueue is a nice feature. Is that new or not part of zuul itself? I don't think I've seen that happen in any of our tentants.21:44
mordredNBorg: it's not part of zuul itself at the moment - we saved the job queues before the restart and then restored the job queues after the restart21:53
*** Goneri has joined #zuul21:53
*** jamesmcarthur has joined #zuul21:54
corvusNBorg: here's the script though: https://opendev.org/zuul/zuul/src/branch/master/tools/zuul-changes.py21:54
corvushopefully soon it will obsolete when we have HA schedulers :)21:54
fungiyeah, it's just a workaround for the fact that it's state held in memory by the spof scheduler21:59
fungizuul v2 *had* some solution where the queue was exported as a python pickle i want to say? but that was only viable if the data structure remained unchanged22:00
openstackgerritMerged zuul/zuul master: executor: blacklist dangerous ansible host vars  https://review.opendev.org/71028722:02
corvusfungi: well, it had (and still has) a facility for saving the event queue, but that's more or less useless without being able to save the running queue22:06
NBorgThanks! That'll save some time.22:07
*** jamesmcarthur has quit IRC22:08
*** jamesmcarthur has joined #zuul22:11
*** NBorg has quit IRC22:12
corvusShrews: re zombie node requests from #openstack-infra -- nodepool/driver/__init.py__ line 679 is i think the key here -- that's failing to store a node (which disappeared) that was assigned to a request (which disappeared)22:20
Shrewsyeah. we're missing exception handling around it22:20
fungicorvus: oh, right, that was events/results22:20
Shrewscorvus: i think we can safely capture NoNodeError there, log it, and move to the next node22:20
corvusShrews: agreed; you want to type that up?22:21
Shrewsthough i wonder how the node znode disappeared?22:21
Shrewscorvus:  yeah22:21
clarkbcorvus: Shrews do we need to restart our launcher to clear this out in the running system?22:21
clarkbShrews: because the scheduler was restarted22:21
corvusclarkb: yes, i don't think it will recover (but i think the only problem is that the logs are spammy?)22:22
corvusyeah, but something still deleted those nodes...22:22
Shrewsclarkb: how would that cause node znodes to disappear?22:22
clarkbShrews: aren't the znodes ephemeral with the connection?22:22
Shrewsno22:22
clarkbthats why memory pressure problems causes node failures22:22
clarkbthe connection dies and znodes go away22:22
Shrewsnode requests are22:22
corvusand node locks are, but not nodes22:23
clarkbah I see22:23
corvusShrews: i'll come up with a timeline and see if we can spot how it happened22:25
tristanCwe are seing NODE_FAILURE after restaring the services, could this be related?22:25
fungitristanC: discussion of opendev failures is taking place in #openstack-infra but, yes, sounds like maybe a launcher restart there is in order22:26
corvusShrews, clarkb: the sequence: https://etherpad.openstack.org/p/tujgF5HLpJ22:29
corvusShrews: i'm guessing that when the request disappeared, the launcher unlocked the nodes even though they were still building (though i don't see a log message for that)22:31
*** Goneri has quit IRC22:32
corvushrm, i can't back that up22:33
openstackgerritDavid Shrewsbury proposed zuul/nodepool master: Fix for clearing assigned nodes that have vanished  https://review.opendev.org/71034322:33
Shrewscorvus: if they're building, they wouldn't be allocated to the request22:34
corvusi thought we allocated them as soon as we started building?22:34
Shrewsthey'd have to be in the READY and unlocked state before being allocated22:34
corvusif that's the case, "Locked building node 0014861474 for request 200-0007660291" is a very misleading message22:34
Shrewsoh, but wait a sec...22:34
Shrewscorvus: you're right. if we build a new node specifically for a request, the allocated_to is set. existing nodes was what i was thinking of22:35
corvusso if we unlocked the node because the request disappeared, i would expect to see the "node request disappeared" log message around then, but i don't (that only comes much later)22:36
corvus(basically i'm focusing on the lines at 18:06 -- i would not expect the node to be unlocked then...)22:39
Shrewsok, so while it is building, the request disappears, the launcher cleanup thread runs, deletes the instance and znodes, request handler wakes up and gets confused22:41
Shrewsthat makes sense22:41
corvusShrews: except i don't understand why the node was unlocked; it should be locked while building22:41
clarkbshould we be checking if the node can fulfill another request before deleting it?22:41
corvusclarkb: that's actually what's supposed to happen22:42
Shrewscorvus: THAT is a good question. possible if the scheduler was seeing disconnects, that the launcher was too and lost locks?22:43
Shrewsany lost session errors?22:43
corvusgood q22:43
Shrewsi don't recall what the message would be. likely a zk exception22:44
ShrewsSessionExpiredError maybe22:44
corvus2020-02-27 18:04:59,092 INFO kazoo.client: Zookeeper session lost, state: EXPIRED_SESSION22:45
corvusbingo22:45
Shrewsso hard to program defensively enough for that22:46
corvusok, i think that explains it, and i think under the circumstances, that's fine, and the only thing we really need to do is plug the hole with https://review.opendev.org/71034322:46
clarkbok so it was related to the connection loss afterall just in an unexpected manner22:47
corvusyep, we didn't notice the launcher also lost its connection at the same time22:48
Shrewsglad we figured it out22:48
clarkblooking at 710343, that causes us to ignore not being able to write the node beacuse it has been deleted?22:49
clarkb(seems like it was the request that was deleted but we could still reuse the host?)22:49
*** jamesmcarthur has quit IRC22:49
clarkbI guess my qusetion comes down to will 710343 cause us to leak some other state (which probably won't acuse jobs to fail but maybe we have to wait 8 hours for cleanup to run)22:51
corvusclarkb: 343 covers the case where both the request and the node have disappeared from zk.  it lets us clear the request out of the list of active requests the launcher is tracking22:53
Shrewswhat cleanup runs every 8 hours?22:53
Shrewsiirc, launcher cleanup thread runs 1x per minute22:53
corvusi don't think it will cause anything else to leak (if a server leaks during that process, the normal leaked instance thing will catch it)22:53
Shrewsour cleanup thread is pretty thorough as long as znodes are unlocked or just not present22:54
Shrewssort of the housekeeper of nodepool, wiping bugs under the carpet that we just cant program against  :)22:55
clarkbShrews: node cleanup is on 8 hour timeout iirc22:56
clarkbcorvus: right the request and node have disappeared from zk but what about the actual cloud node22:57
clarkbif the intent is to reuse that in another request have we leaked it?22:57
Shrewsclarkb: not in nodepool. hard coded to 60s: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/launcher.py#L87422:58
clarkbShrews: thats the interval but then we check age after? I guess if the node znode is gone then we short circuit. The 8 hour timeout is only for other nodes?22:58
clarkbShrews: basically cleanup runs that often but we don't take action on objects until they are older22:58
Shrewsare you refering to max-age? that's different than cleanup22:59
clarkbI guess? basically whatever would remove a now orphaned node22:59
clarkbin the cloud22:59
corvusclarkb: i think in this case, the ship has sailed.  we lost our claim on the node which means it and the underlying instance are subject to deletion (and, in fact, in this case both were deleted).  it doesn't make sense to put it back because we can't recover the state accurately.22:59
Shrewsor max-ready-age (whatever we named it). that just removes unused node after they've been alive for some set time22:59
Shrewsclarkb: orphaned nodes would get cleanup up within the 60s timeframe23:00
corvusclarkb: (to answer your question another way, in this case we did not leak the underlying instance, we deleted it, but if we didn't for some reason (launcher crash?) the thing you're talking with Shrews about would catch it)23:00
clarkbgot it23:00
corvusclarkb: also, to clarify an earlier point, we *do* reuse nodes if the request disappears; only if things are so bad that the launcher can't hold on to its own node locks do we get into the case where we might lose both.23:01
clarkbbasically if the node znode had survived then this would be a non issue (and would allow reuse) but because the node znode does not survive any launcher could then delete the instance so we can't recycle at that point23:01
Shrewsclarkb: correct23:02
corvus(node request disappearing is normal behavior and we optimize for it -- no failures are required to hit that)23:02
clarkband the way to avoid that is to keep network connectivity and memory use happy23:02
*** mattw4 has quit IRC23:02
*** mattw4 has joined #zuul23:06
*** jamesmcarthur has joined #zuul23:25
openstackgerritMerged zuul/nodepool master: Fix for clearing assigned nodes that have vanished  https://review.opendev.org/71034323:38

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!