Friday, 2020-01-17

openstackgerritMerged zuul/nodepool master: Switch to collect-container-logs  https://review.opendev.org/70186900:03
clarkbcorvus: left a few notes on your docs changes00:21
clarkbI like the state this is headed towards00:21
pabelangerclarkb: yup, see that before. Fixed merged, we need release00:22
clarkbtobiash: I left comments on https://review.opendev.org/#/c/702237/3 and https://review.opendev.org/#/c/702828/2 you aren't the author but I don't think they are here and you can probably make sure they see that before the weekend. Thanks!00:23
clarkbpabelanger: ya eventually got there00:23
pabelangerFor a while, i was manually applying the patch, but haven't in a few upgrades00:23
pabelangerso, would be nice to get github3 release for it00:24
*** mattw4 has quit IRC00:25
pabelangerzuul job design question. If a parent job sets job.files: foo/bar.py, can a child set job.files: [] to remove the filematch?00:25
pabelangerdoesn't look like it00:37
pabelangerbasically, tryin to see if I can set job.files directly on a parent job stanza, over doing it in pipeline00:37
pabelangerbut child needs to remove all matchers00:37
pabelangerk, so job.files: [] on child doesn't work, but job.files: .* will00:39
openstackgerritMerged zuul/zuul master: Report buildset result in MQTT reporter  https://review.opendev.org/70283800:50
openstackgerritMerged zuul/zuul master: Document the buildsets endpoint  https://review.opendev.org/70212700:56
*** openstackgerrit has quit IRC00:57
*** sgw has joined #zuul02:15
*** zxiiro has quit IRC02:33
*** logan- has quit IRC02:48
*** logan_ has joined #zuul02:50
*** logan_ is now known as logan-02:50
*** johanssone has quit IRC02:56
*** johanssone has joined #zuul02:57
*** rfolco has joined #zuul02:57
*** rfolco has quit IRC03:02
*** bhavikdbavishi has joined #zuul03:06
*** bhavikdbavishi1 has joined #zuul03:09
*** bhavikdbavishi has quit IRC03:10
*** bhavikdbavishi1 is now known as bhavikdbavishi03:10
*** openstackgerrit has joined #zuul03:11
openstackgerritMerged zuul/zuul-jobs master: ensure-tox: improve pip detection  https://review.opendev.org/70297803:11
*** bhavikdbavishi has quit IRC03:56
*** bhavikdbavishi has joined #zuul04:04
*** rlandy has quit IRC04:06
openstackgerritTristan Cacqueray proposed zuul/zuul-operator master: Add main configuration file  https://review.opendev.org/70301304:21
*** raukadah is now known as chandankumar04:46
*** dustinc is now known as dustinc|PTO04:52
*** bhavikdbavishi1 has joined #zuul04:58
*** bhavikdbavishi has quit IRC05:00
*** bhavikdbavishi1 is now known as bhavikdbavishi05:00
*** decimuscorvinus_ has quit IRC05:21
*** decimuscorvinus has joined #zuul05:22
*** kmalloc has quit IRC05:33
*** kmalloc has joined #zuul05:33
*** evrardjp has quit IRC05:34
*** evrardjp has joined #zuul05:34
*** saneax has joined #zuul06:06
*** swest has joined #zuul06:11
*** sgw has quit IRC06:12
*** saneax has quit IRC06:28
*** sgw has joined #zuul06:30
*** wxy-xiyuan has joined #zuul06:31
*** sanjay__u has quit IRC06:39
*** sanjay__u has joined #zuul06:41
*** michael-beaver has quit IRC06:53
*** notnone has joined #zuul07:15
*** bhavikdbavishi has quit IRC07:18
*** SotK has quit IRC07:19
*** klindgren_ has quit IRC07:19
*** ianw has quit IRC07:19
*** fbo has quit IRC07:19
*** at_work has quit IRC07:19
*** tobberydberg has quit IRC07:19
*** EmilienM has quit IRC07:19
*** openstackstatus has quit IRC07:20
*** saneax has joined #zuul07:29
*** bhavikdbavishi has joined #zuul08:05
*** SotK has joined #zuul08:05
*** klindgren_ has joined #zuul08:05
*** ianw has joined #zuul08:05
*** fbo has joined #zuul08:05
*** tobberydberg has joined #zuul08:05
*** EmilienM has joined #zuul08:05
*** armstrongs has joined #zuul08:14
*** bhavikdbavishi has quit IRC08:16
*** armstrongs has quit IRC08:24
*** avass has joined #zuul08:26
*** reiterative has joined #zuul08:26
*** tosky has joined #zuul08:29
*** dmellado has quit IRC08:34
*** themroc has joined #zuul08:34
*** dmellado has joined #zuul08:35
*** bhavikdbavishi has joined #zuul08:36
*** jpena|off is now known as jpena08:48
openstackgerritMatthieu Huin proposed zuul/zuul master: JWT drivers: Deprecate RS256withJWKS, introduce OpenIDConnect  https://review.opendev.org/70197208:49
openstackgerritMatthieu Huin proposed zuul/zuul master: JWT drivers: Deprecate RS256withJWKS, introduce OpenIDConnect  https://review.opendev.org/70197209:06
openstackgerritBenjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver  https://review.opendev.org/70285409:08
*** themroc has quit IRC09:10
*** themroc has joined #zuul09:15
*** sanjay__u has quit IRC09:20
openstackgerritBenjamin Schanzel proposed zuul/zuul master: Handle Erroneous Cron Strings in TimerDriver  https://review.opendev.org/70223709:23
openstackgerritBenjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver  https://review.opendev.org/70285409:23
*** bhavikdbavishi has quit IRC09:35
*** bhavikdbavishi has joined #zuul09:35
*** bhavikdbavishi has quit IRC09:42
openstackgerritTobias Henkel proposed zuul/zuul master: Handle jobs with dependencies on job page  https://review.opendev.org/70304509:43
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Remove trusty testing  https://review.opendev.org/70304609:45
AJaegercorvus,clarkb: fine to remove the trusty testing on zuul-jobs? ^09:47
reiterativeI'm trying to (manually) set up a node for use via the Nodepool Static driver, but Zuul is failing with RETRY_LIMIT when trying to use it. Can anyone point me at some info about the config / setup requirements for a node? I have confirmed that I can ssh to the node from my Nodepool instance using the configured user, and can see it logging on in the syslog. I've been trying to follow the setup from the quick start tutorial demo, but I suspect that I'm09:54
reiterativemissing something...09:54
openstackgerritSimon Westphahl proposed zuul/zuul master: Match tag items against containing branches  https://review.opendev.org/57855709:59
openstackgerritSimon Westphahl proposed zuul/zuul master: Optionally support mitogen for job execution  https://review.opendev.org/65702410:01
openstackgerritMatthieu Huin proposed zuul/zuul master: OIDCAuthenticator: add capabilities, scope option  https://review.opendev.org/70227510:03
openstackgerritSimon Westphahl proposed zuul/zuul master: Report retried builds in a build set via mqtt.  https://review.opendev.org/63272710:03
*** jangutter has joined #zuul10:06
*** hashar has joined #zuul10:12
*** jangutter has quit IRC10:13
*** jangutter has joined #zuul10:13
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: docker-install: workaround for centos-8 conflicts  https://review.opendev.org/70305310:49
openstackgerritBenjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver  https://review.opendev.org/70285411:15
*** pcaruana has joined #zuul11:59
*** sshnaidm|afk is now known as sshnaidm|off12:00
openstackgerritTobias Henkel proposed zuul/zuul master: Don't expand change panel on middle click  https://review.opendev.org/70306412:08
*** zbr|rover is now known as zbr|drover12:09
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal  https://review.opendev.org/70306512:09
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal  https://review.opendev.org/70306512:11
*** rfolco has joined #zuul12:27
avassreiterative: I can take a look at it if you link a paste: http://paste.openstack.org/12:31
*** bhavikdbavishi has joined #zuul12:31
reiterativeThanks very much avass - what info would you like me to share?12:33
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal  https://review.opendev.org/70306512:34
*** bhavikdbavishi1 has joined #zuul12:34
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal  https://review.opendev.org/70306512:34
avassreiterative: Do you get any build logs during the pre-run? If so that's a good place to start. Otherwise the executor log during the build should contain some information as well :)12:35
*** bhavikdbavishi has quit IRC12:35
*** bhavikdbavishi1 is now known as bhavikdbavishi12:35
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal  https://review.opendev.org/70306512:37
avassreiterative: Here are the requirements for the nodes anyway: https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html#managed-node-requirements12:38
reiterativeavass Executor log is here http://paste.openstack.org/show/788531/12:39
reiterativeWhere would I look for build logs?12:39
*** dmellado has quit IRC12:41
tristanCcorvus: could the unknown configuration be caused by the fact that there is two jobs that depends on the build-image job? Is there an existing (stable) project pipeline config with more than one job that depends on the buildset registry?12:42
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: DNM: docker-install: test existing jobs  https://review.opendev.org/70306812:43
*** dmellado has joined #zuul12:44
*** jpena is now known as jpena|lunch12:46
avassreiterative: There will be a bit more information if you run the executor in debug mode... which I don't remember how to configure *shrug*12:50
avassreiterative: Otherwise you can use the fetch-output role from zuul-jobs repo: https://zuul-ci.org/docs/zuul-jobs/log-roles.html#role-fetch-output12:51
avassBut I guess that would fail as well. Can you see anything in the web dashboard during the run?12:51
openstackgerritBenjamin Schanzel proposed zuul/zuul master: Allow Passing of Jitter Values in TimerDriver  https://review.opendev.org/70285412:54
reiterativeCheers I'll try that12:58
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal  https://review.opendev.org/70306513:09
*** rlandy has joined #zuul13:11
*** bhavikdbavishi has quit IRC13:23
openstackgerritAlan Pevec proposed zuul/zuul-jobs master: Add phoronix-test-suite job  https://review.opendev.org/67908213:25
tristanCzuul-maint : could you please have a look at https://review.opendev.org/679082 . If zuul-jobs is not the right place for such job, should we consider a zuul-jobs-extra project, or should this live outside of the opendev/zuul org?13:30
mnasersmall ping about https://review.opendev.org/#/c/701868/ :)13:32
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: remoke-sudo: improve sudo removal  https://review.opendev.org/70306513:38
*** jpena|lunch is now known as jpena13:39
*** pcaruana has quit IRC13:45
*** saneax has quit IRC14:06
openstackgerritTristan Cacqueray proposed zuul/zuul-operator master: Manage operator scaffolding using a function and configuration file  https://review.opendev.org/70301314:10
*** pcaruana has joined #zuul14:22
*** bhavikdbavishi has joined #zuul14:31
*** bhavikdbavishi1 has joined #zuul14:38
*** bhavikdbavishi has quit IRC14:40
*** bhavikdbavishi1 is now known as bhavikdbavishi14:40
*** jangutter has quit IRC14:51
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: revoke-sudo: improve sudo removal  https://review.opendev.org/70306515:02
*** jangutter has joined #zuul15:02
openstackgerritMerged zuul/zuul-registry master: Switch to collect-container-logs  https://review.opendev.org/70186815:04
*** jangutter has quit IRC15:06
*** zbr|drover has quit IRC15:08
sugaarHi, I am deploying zuul in kubernetes. For that reason, I took the zuul example docker-compose and I am implementing it in k8s "style". I am having a problem with zuul-scheduler because when the scrip wait-for-gearman.sh is executed I get "./wait-to-start.sh: line 9: mysql: Name or service not known15:12
sugaar./wait-to-start.sh: line 9: /dev/tcp/mysql/3306: Invalid argument15:12
sugaar"15:12
openstackgerritTobias Henkel proposed zuul/zuul master: Shard py35 and py37 test cases  https://review.opendev.org/70247315:13
sugaarSo I went into /dev/ and there is not tcp/ directory, so I was wondering how does it work in the docker-compose? why doesn't break there??15:13
tristanCsugaar: did you setup a service named 'mysql' ?15:14
sugaarI am using mariadb which is what it is used in the docker-compose. By service you mean a k8s service object?15:14
tristanCsugaar: yes, the k8s service, iiuc that's make the `mysql` name dns15:15
tobiashsugaar: I guess that dir is not forwarded to pods in k8s, however in k8s you won't really need that script as this is used to avoid race conditions in ci15:15
Shrewsclarkb: I read you comment you pointed me to in https://review.opendev.org/702828. I think it's difficult for us to judge the impact of that change without seeing results, so maybe we should let someone who does use the static driver cast a vote first. (cc: tobiash)15:15
Shrewswow, such bad grammar i have today15:16
*** zbr has joined #zuul15:16
*** avass has quit IRC15:16
*** zbr is now known as zbr|drover15:16
tobiashShrews: we're using the static driver a lot (with user managed static nodes) and when they're doing maintenance (by firewalling ssh port off nodepool) our logs really get spammed by the exception traces15:16
sugaartristanC I have the service but I gave a different name to it, I will change the name to see what happens.15:17
sugaartobiash do you reckon I could get rid of the script?15:17
tobiashso for us this is 'normal operations' where one would not want exceptions in the log15:17
Shrewstobiash: sure. i trust your judgement on it, so if you're ok with the change, i'm happy to +3 it15:17
tobiashShrews: I'm fine with the change15:18
tristanCsugaar: fwiw, you might be interested by the work we are doing with the zuul-helm and the zuul-operator project15:18
Shrewstobiash: cool. just needs your +2  :)15:18
tobiashShrews: I'll review in a bit :)15:18
clarkbtobiash: did you see the case I wascalling out that probably is still an error though?15:21
tobiashclarkb: user triggered maintenance would be ideally implemented by the user if the port is firewalled off directly after a job (which would then also hit the re-registration code path)15:23
tobiashfurther nodepool anyway does a periodic check of node connectivity and still periodically logs warnings about that15:23
corvusclarkb, tobiash, Shrews: how about log.error and include the errno message?15:24
corvus(i think the main thing the traceback gets you there is *why* the connection failed (no route/refused/dns/etc))15:25
tobiashI guess that would be a reasonable compromise15:25
tobiashswest: what do you think? ^15:26
*** electrofelix has joined #zuul15:26
clarkbok so it is valid to hit that post job and deregister. In that case I guess warning is fine though error might be better?15:27
corvustristanC: i think opendev/system-config has multiple jobs that depend on the registry, but they have file matchers, so i don't know how often they run at once.15:27
clarkbmostly I dont want the cause of yhe deregister to g et lost15:27
openstackgerritAntoine Musso proposed zuul/zuul master: doc: add links to components documentation  https://review.opendev.org/70310515:28
corvusclarkb, tobiash, Shrews, swest: i guess the message should also so "failed to connect after liveness check" or something so we know where (that's the other thing the TB gives us)15:28
tobiashsounds good15:29
sugaartristanC this one right? https://opendev.org/zuul/zuul-helm15:31
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: add admin reference section  https://review.opendev.org/70299715:35
corvussugaar: yes -- there's no documentation yet, but looking at it may help15:41
openstackgerritAntoine Musso proposed zuul/zuul master: Fix release note for a 3.0.2 feature  https://review.opendev.org/70310915:41
tristanCsugaar: yes, and i'm also working on a zuul operator with this topic: https://review.opendev.org/#/q/topic:zuul-crd15:44
pabelangerI really don't like that depends-on in commit messages does not work for github16:02
pabelangerand I would like to see how to fix that16:02
tobiashpabelanger: use the hub tool ;)16:02
pabelangerI do use CLI tool16:03
pabelangerbut github issue templates overwrite it16:03
pabelangeralso, it forces users which user gerrit into 2nd workflow16:03
pabelangerpull, if you edit first PR comment, when job is running, zuul will abort job and not reenqueue16:03
pabelangeryou have to manually recheck16:04
pabelangers/pull/plus16:04
clarkbpabelanger: to be fair github and gerrit workflows are already miles apart16:04
sugaartristanC thanks I will have a look to that too16:04
tobiashpabelanger: having the depends-on in the commit message doesn't really work with github as this information is only available after doing the merge and even then it'll be difficult as you can have all sorts of git trees in a pr16:04
pabelangerclarkb: yah, I agree. Just very frustrating to update it in multiple places16:04
pabelangertobiash: can you explain available after merge? I am not following16:06
tobiashpabelanger: in order to get the depends-on from commit messages you need to analyse all commits that are part of the pr (which can be quite a lot)16:07
pabelangeryes, I agree that would be needed. But, is that a lot of overhead?16:08
tobiashso you would need either hammering the github url or use the mergers to get all depends-on headers of all invoved commits16:08
pabelangercould we not support both16:08
tobiashand this needs to be done before enqueuing into any pipeline16:08
tobiashs/github url/github api16:09
pabelangeryup, agree16:09
pabelangerwhat is merges did squash commits, that should get all commits in PR, with depends-on into into single commit16:10
*** hashar has quit IRC16:11
pabelangerif*16:11
pabelangerI _really_ need to slow down on typing16:12
tobiashthe thing is not how to get those commit messages, but the added complexity and system load for doing that16:15
tobiashit could be doable with https://developer.github.com/enterprise/2.19/v3/pulls/#list-commits-on-a-pull-request16:15
tobiash(with the limitation that only the depends-on headers of max 250 commits get found)16:15
tobiashbut if we'd support this it should be optional because of increased system load when using this feature16:16
tobiashan important thing is also that especially with multi-commit prs users can be quickly confused why dependencies are pulled in because on the pr page you won't see any dependency header of commit messages if you don't also add them to the pr description16:17
tobiashso for me this means: more support tickets of unexperienced users wondering why or why not dependencies are pulled in16:18
tobiashthat's why I'd turn that feature off in my deployment16:18
pabelangeryah, I have the opposite issue. gerrit users of zuul, are creating support issues because depends-on isn't working, from commit messages.16:19
pabelangerbut yah, agree if would be nice to make it optional or some way to support16:19
fungiwell. and another problem is that the more popular github workflow is to not replace commits but merely add new child commits in the pr, so what's the expected behavior if there's a depends-on in a commit message for a non-leaf commit in the pr?16:19
pabelangerthis is about the main issue I have with github driver today, so #firstworldproblem I figure :)16:19
tobiashpabelanger: but I'm open adding this as an optional feature (probably best default off)16:20
pabelangerk, that is good to know16:20
fungiand how to you "remove" or "modify" a depends-on without rewriting that commit, in a workflow where rewriting commits is discouraged?16:20
tobiashfungi: in that case all depends-on headers would be squashed into a combined list16:20
*** mattw4 has joined #zuul16:20
tobiashremove would be only possible by force push16:20
fungiright, and there are projects which are strongly opposed to force-push in pull requests16:21
tobiash(which is not mostly not that much discouraged on pre-merge pr branches)16:21
fungibecause they want to retain the history of the pr development16:21
tobiashin that case it's not possible to remove dependencies16:21
fungiexactly16:21
fungi(that was basically my point)16:21
fungithere are popular github workflows which, as clarkb already indicated, are simply incompatible with gerrit workflows, so trying to have zuul pretend people use them the same way is likely to just lead to more problems16:22
tobiashpabelanger: so as long as you're willing to accept those shortcomings and the feature is optional and off by default I'm fine with it16:23
pabelangertobiash: Yup, I think it would be a good idea to get them all down some place, then see if worth it.16:24
pabelangerI don't have the time to work on it today, but is helpful to understand how others deal with it16:24
tobiashI personally deal with it by adding the depends-on when creating the pr via commandline (which actually feels pretty much like creating a commit) ;)16:25
pabelangeryah, we have issue templates. for some reason, they don't read commit messages by default16:26
pabelangerso, I often forget to add it there16:27
pabelangerHmm, I noticed I have a locked ready node for 16hours16:30
pabelangerneed to debug that, but if scheduler is holding lock, there is no way to unlock it without restart right?16:30
tobiashpabelanger: yes, we also face this issue, zuul occasionally leaks nodes in some corner cases, but I didn't have time yet to deeply debug this16:31
pabelangerokay, thanks.16:31
pabelangerwould be nice to have way to unlocking, without full restart16:32
tobiashpabelanger: there is a hack to unlock them by deleting the lock znode via zookeeper cli16:32
tobiashbut that is a hack16:32
pabelangertobiash: k, I'd be okay with that right now16:32
pabelangernodepool delete --force --delete-lock16:33
pabelangerwould be nice16:33
tobiashwell, we should fix the leak instead ;)16:33
pabelangeryes, that is better16:33
tobiashbtw, this fixes one of the leaks: https://review.opendev.org/66664316:33
tobiashbut there is still one left somewhere16:34
Shrewsi do not endorse a way to manually break locks16:34
tobiashShrews: I know, that's why I emphasized it's a hack ;)16:34
Shrewsyeah, i meant the nodepool cli command16:35
pabelangeryah, in our case we are getting billed by the locked node, but I don't have bandwidth to debug or restart zuul16:35
pabelangerso, looking for some solution16:35
pabelangerI could delete node behind nodepool / zuul back, but that doesn't feel good either16:35
tobiashpabelanger: in your case I would delete the lock and let nodepool handle the delete, but I agree with Shrews that we shoudn't add such a functionality into nodepool16:36
tobiashthat's what we do currently until I have enough time to analize and fix the leak16:37
corvus643 lgtm.  that was originally an optimization for speed in zuul v2, but i think we can do without it now.16:38
pabelangertobiash: do you use zk-shell?16:39
pabelangertobiash: like to see if you us rm or rmr16:40
tobiashcorvus: thanks, at least I didn't notice any negative performance impact in our system (we're running with this fix already for quite some time)16:40
tobiashpabelanger: zkCli.sh rmr /nodepool/nodes/<nodeid>/lock16:41
pabelangerthanks16:41
clarkbtobiash: do you configure timeouts with gearman so that if an executor is lost eventually that work is handed to another executor?16:43
clarkbI seem to recall you were involved in some changes to gear to make that possible?16:43
pabelangerleaked node gone, I'll see if I can figure out it locked in a bit16:44
tobiashclarkb: that's done by using tcp keepalive and default in zuul16:44
tobiashclarkb: support for that in gear: https://review.opendev.org/59956716:44
tobiashclarkb: and usage in zuul: https://review.opendev.org/59957316:45
clarkbtobiash: ok so it should just work already?16:46
clarkb(we're seeing that in some cases we have changes stay queued waiting for a job to start form any hours, ~48, after an executor dies)16:46
tobiashactually yes, just read in -infra, did you verify that this job ran on the lost executor?16:46
*** themroc has quit IRC16:47
tobiashzuul did handle executors going away quite ok as far as I remember16:48
clarkbyup we regularly stop them and start them for upgrades and jobs typically restart as expected. This is likely a corner case of some sort16:48
clarkbI'm not sure if fungi managed to confirm that the lost executor was handling this queued job16:48
pabelangerclarkb: did executor stop, or die?16:50
clarkbpabelanger: it was not asked to stop if that is what you mean16:53
fungipabelanger: tobiash: ooh! i didn't know about rmr, i've been using tab completion to manually recurse for removals with multiple rm commands instead16:53
pabelangerclarkb: looking for old bug, might have idea16:54
tobiashclarkb: gear server should also use keepalive16:54
fungiclarkb: nope, i haven't had time to dig into the executors yet. still working my way around to it, but last few times this behavior was reported that was precisely the cause16:54
fungi(executor is already on its way out, cpu is spiking or whatever, but it manages to accept the build request moments before becoming entirely unresponsive)16:55
tobiashclarkb: so a freezing vm should be detected (but not a freezing executor process that continues to keep the connection running)16:55
pabelangerI remember an issue, if zuul executor was killed by rackspace, we leaked jobs16:55
pabelangerI think I opened a bug about it16:55
fungitobiash: what's the definition of "keep the connection running?" does that include the server disappearing without closing the connection? or is there some sort of dead peer detection which is supposed to notice when it's no longer responsive?16:57
tobiashpabelanger: if you find that bug, this was probably the fix: https://review.opendev.org/425248 (vm going away without terminating the connection)16:57
pabelangerclarkb: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-07-18.log.html#t2018-07-18T17:32:2716:57
pabelangerstill looking16:57
tobiashfungi: gear server as well as client use tcp keepalive, so a vm going away will be detected16:57
tobiashwhat won't be detected is a freezing executor python process (no idea how that could happen) because tcp keepalive is handled by the kernel16:58
pabelangertobiash: yah, that's what I remember, VM dies, server reboots, stuck job16:58
tobiashor what's not handled is vm gets unresponsive because of process explosion, also in this case kernel still handles keepalive probably16:59
pabelangeris there no way to get more then 10 items in storyboard?17:00
pabelangerhttps://storyboard.openstack.org/#!/project/67917:00
clarkbpabelanger: in your user settings you can check the paging size17:00
tobiashor just beneath the page selector17:00
pabelangerah, thanks see it17:01
fungithe little gear icon17:04
Shrewshrm, yet another nodepool-zuul-functional test failure during managed_ansible processing: https://zuul.opendev.org/t/zuul/build/a77e4d2b427745f3b8855249d6f05725/log/job-output.txt#85117:05
tobiashShrews: I have a hunch that multithreaded installation of ansible into different venvs sometimes confuses pip17:06
pabelangerShrews: I've seen that before, but not sure why it happens17:06
pabelangertobiash: I think that might be the case17:07
tobiashmaybe some tempfile/cache race17:07
tobiashat least it got worse with more supported versions17:07
openstackgerritTobias Henkel proposed zuul/zuul master: Limit parallelity when installing ansible  https://review.opendev.org/70312617:10
fungipip does use a couple of on-disk caches by default, so i could see concurrent writes potentially clobbering each other in those from time to time17:10
tobiashShrews: I'd suggest to try two threads as a compromise between test failure risk and time to install 4 different versions.17:11
fungiit has an http cache and a wheel cache17:11
tobiashif that's not enough we can disable parallelity entirely17:11
fungibut there are pip command-line options to set the locations of its caches, so we could in theory separate them out and see if that helps17:11
fungiat the expense of retrieving/building some packages multiple times, of course17:12
clarkbtobiash: is that what caused my docs change job to fail?17:12
clarkbtobiash: it was definitely unahppy about something very deep into python itself17:12
clarkb(importlib to be specific)17:12
tobiashwhich one?17:12
tobiashI think I remember one and I think it was during ansible installation17:13
clarkbtobiash: let me find it17:14
clarkbtobiash: https://zuul.opendev.org/t/zuul/build/c26b73191cf2443a8e083f7b1668052e that one17:14
tobiashclarkb: yes, exactly this issue17:15
corvustobiash: see my comment on 70312617:16
corvusi think we're really going to want a comment for that in the future :)17:16
tobiashcorvus: thanks, fixing17:16
tobiashyou're absolutely right :)17:16
openstackgerritTobias Henkel proposed zuul/zuul master: Limit parallelity when installing ansible  https://review.opendev.org/70312617:18
*** chandankumar is now known as raukadah17:20
*** tosky has quit IRC17:21
Shrewscorvus: As an alternative to https://review.opendev.org/702992, all of the remaining common Reference links are admin oriented *except* glossary and dev guide. We could also add an admin reference section, move glossary back under the "Indices and tables", then expand the Dev Guide by itself (thus replacing the common References section). That would expose the dev stuff more clearly on the root page.17:24
Shrewsoh, some of that is done in deeper reviews, i see now17:26
Shrewsat least the admin reference stuff17:26
openstackgerritMerged zuul/zuul master: Handle Erroneous Cron Strings in TimerDriver  https://review.opendev.org/70223717:27
Shrewsi guess the Overview content move complicates my suggestion17:27
*** evrardjp has quit IRC17:34
*** evrardjp has joined #zuul17:34
corvusShrews, clarkb: would you mind reviewing my doc changes despite the -1s, then i'll rebase them and push them through?  conflicts are getting annoying17:35
corvusmaybe if we like all 4 changes, i can squash them too17:35
clarkbcorvus: ya I can rereview them17:36
Shrewscorvus: yeah, i'm doing that now (thus the ramblings above)  :)17:36
Shrewsalready approved the lead change17:38
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: flatten directory structure  https://review.opendev.org/70313517:38
corvuscool, i'll start fixing the conflicts then17:39
corvusoh, i think i'm just going to have to wait for those to merge, then rebase17:41
Shrewsi still think eliminating the common ref section and expanding the dev guide is worth some consideration down the road, but these all lgtm17:41
corvusShrews: yeah, what's left in common ref is ambiguous as to audience.17:42
corvusif i had to choose, i'd say monitorying is more admin, rest is more user.17:42
corvus(but opendev users make use of the monitoring reference)17:42
corvusthis is where separating by audience doesn't matter any more17:43
fungii guess we don't document the signal handler in zuul at all? looks like there's documentation for the (virtually identical) one in nodepool. should we copy that into zuul's docs? (also i'm getting an extreme sense of déjà vu here)17:43
fungimaybe there's already a proposed change to do that17:44
corvusfungi: https://zuul-ci.org/docs/zuul/discussion/components.html#reconfiguration17:45
*** jpena is now known as jpena|off17:45
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: re-order reference index  https://review.opendev.org/70296217:46
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: move project config docs to user reference  https://review.opendev.org/70299217:46
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: move overview section to reference  https://review.opendev.org/70299517:46
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: add admin reference section  https://review.opendev.org/70299717:46
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: flatten directory structure  https://review.opendev.org/70313517:46
openstackgerritJames E. Blair proposed zuul/zuul master: Docs: fix styling in reconfigure commands  https://review.opendev.org/70313817:46
corvusoh phooey17:46
corvusthose are just rebases; i'll self +3 them.17:46
fungicorvus: hrm, that only talks about SIGHUP though. are we doing something similar to pass SIGUSR1 and SIGUSR2 from the cli?17:47
corvusfungi: oh, the debug thing.  yep that's missing and could be copied.17:48
fungiright, the paraghraph at the end of https://zuul-ci.org/docs/nodepool/operation.html#daemon-usage17:48
fungishould that go in https://zuul-ci.org/docs/zuul/howtos/admins/troubleshooting.html do you think?17:48
fungior is there a better place?17:49
corvusfungi: that sounds good to me17:49
fungithanks, will do shortly17:49
*** michael-beaver has joined #zuul17:50
clarkbnote the yappi profiling is only available when yappi is installed manually. However the thread dumps should always work17:55
pabelangerclarkb: corvus: when time, do you happen to have thoughts on http://eavesdrop.openstack.org/irclogs/%23zuul/latest.log.html#t2020-01-17T00:25:1217:58
fungiclarkb: yep, but we install it, so we do get that as well17:59
*** zxiiro has joined #zuul17:59
fungihowever, the resulting double-dump with yappi deets is 64kb/1180 lines so it's probably too large for paste.o.o18:00
corvuspabelanger: if you're dealing with things like that, i would suggest restructuring the job hierarchy so you're not setting files stanzas too early18:01
pabelangercorvus: yah, that is possible. What we have today is a little complex, with multiple child jobs.18:02
pabelangerhowever, so far I haven't figured out a better way18:02
corvuspabelanger: to put it succintly, if you're undoing a files matcher in a pipeline, it shouldn't be on the job in the first place.  think about making 2 versions of the job, or something.18:04
pabelangerYup, that is fair. This job is a little odd, as we run it in ansible-zuul-jobs. But it really parent for jobs that run in ansible/ansible. So, mostly only want to run it if we update playbooks in ansible-zuul-jobs, and always for ansible/ansible. For now, filematcher in pipeline has been way forward18:06
corvuspabelanger: then 3 jobs: abstract parent in azj with no file matcher; child in azj with file matcher - this one runs on azj changes; child in aa with different file matcher - this runs on aa changes18:08
corvusjobs are free.  make as many as you need :)18:08
fungitobiash: here's a thread dump from our hung scheduler... http://paste.openstack.org/show/788548/18:09
fungianything look familiar?18:09
corvuspabelanger: (or, just use the file matchers in the pipelines.  you can use yaml anchors to avoid duplication)18:09
pabelangeryah, that is fair. Mostly want to keep a->b->c order over a->b a->c, but will work18:09
pabelanger++18:09
pabelangerack, thanks for help18:09
corvusfungi: is there a stuck thread?18:09
fungicorvus: what's the best way to tell?18:10
corvusfungi: sorry, i mean, i'm asking what you're debugging18:10
corvusi recall a conversation about a stuck job18:10
fungioh, there's a build hung waiting for over a day, and so went looking and it was handled by ze08 shortly before its service/debug log go dead silent18:11
fungibut it responded to sigusr2 and wrote a thread dump18:11
corvusso this is the output from ze08 and the question is: "why is it not doing anything?"18:11
fungiright18:12
fungithe hung build is 7f3f0f8e86324b22968daec67d7ef28c18:12
fungiwhich does seem to have a thread at line 26418:12
fungithough there are a number of other build-* threads too18:13
fungibut more curious is that the executor seems to have ceased doing anything at all roughly 48 hours ago18:13
funginot even periodic wakeups in the debug log, making me suspect a stuck thread18:14
corvusfungi: thread 140609457223424 looks suspicous18:16
corvuslike it's involved in a long running git transaction.  it's holding a lock that all other jobs need.18:16
fungiindeed, and the forked zuul daemon process is hung at "read(5, " according to strace18:17
fungiand there are two other `git cat-file ...` child processes as well18:17
fungithough those predate the cessation of logging by a couple hours18:17
fungiso i wasn't certain they were necessarily related18:17
fungibut this does tie all that together a bit18:18
corvusfungi: can you tell what git files are open?18:18
fungichecking18:18
fungiby those specific processes, or across the whole filesystem?18:19
fungibut looks like mostly nova18:19
corvusfungi: at this point, probably any open git files18:19
*** hashar has joined #zuul18:19
fungi/var/lib/zuul/executor-git/opendev.org/openstack/nova/.git/objects/pack/<various>.idx18:19
corvusthis all looks to be local filesystem access, so there could be a bad git repo involved.  i'm not sure how difficult it would be to add a timeout here -- i think it would have to be something pretty high level (like something that forcefully killed the thread).18:19
corvusfungi: i would suggest stopping the executor, removing /var/lib/zuul/executor-git/opendev.org/openstack/nova/ and restarting18:20
fungialso /var/lib/zuul/executor-git/git.openstack.org/openstack/charm-vault/.git18:20
corvusmaybe that one too then18:20
fungiand yeah, it doesn't seem to be underlying device issues, or at least nothing that the kernel has logged to dmesg18:20
fungii'll move the nova repo out of the way so we can investigate whether its contents are hinky18:22
fungionce the executor is fully stopped18:22
fungishould i also move charm-vau18:22
corvusyeah18:22
fungier, charm-vault18:22
fungiokay, i'll move both repos18:22
corvus(and some "kill"ing may be required to get it to stop)18:22
fungisure, i won't be surprised there18:23
*** dtroyer has joined #zuul18:29
*** bhavikdbavishi has quit IRC18:29
fungiafter restart it's cloning those again18:30
fungiso far, so good18:30
*** fbo is now known as fbo|off18:41
tobiashfungi: you wrote that there were hanging git cat-file processes?18:46
tobiashThat would fit to that dump18:46
*** electrofelix has quit IRC18:46
tobiashThere are a lot of blocking threads in the inner update loop18:47
tobiashAka every blocked thread there blocks a job18:47
tobiashSo if git blocks the update loop it blocks the whole executor18:47
tobiashI'm not at computer atm, but we should check from where the git calls come from and maybe check if some sort of timeout is helpful there18:48
tobiashoh I guess I should have read the whole backscroll18:49
fungitobiash: yep, several bits of evidence point to hung `git cat-file ...` commands, possibly as a result of some sort of corrupt on-disk repositories (they seemed to be reading from various pack files before i killed them)18:50
fungii moved them aside and will try a git fsck on a copy shortly to see if that theory pans out18:51
*** pcaruana has quit IRC18:51
fungiof the two repos which were open for read on the filesystem, the smaller one completed fsck with no errors (but most of the open file handles were to the larger one which will take a few minutes to fsck)18:57
tobiashcorvus: I guess git cat-file is only used by cat commands?18:57
tobiashMaybe that interferes with parallel fetches18:58
tobiashI'm not sure if those use the repo locks18:58
fungigit fsck on the larger repo also didn't report any errors, just some ~800 dangling commits but that's probably normal19:00
fungithe git cat-file processes were clearly stuck on a read call though, according to strace. maybe something changed out from under them?19:01
tobiashMaybe a garbage collect triggered by a concurrent fetch19:01
tobiashIf that truncates the pack file while cat is reading...19:02
fungiyeah, seems like there are opportunities for races there at least19:06
*** hashar has quit IRC19:08
tobiashfungi: hrm, the only thing in the stack trace that reads a stream is actually this: http://paste.openstack.org/show/788549/19:26
openstackgerritJeremy Stanley proposed zuul/zuul master: Add notes on thread dumping and yappi  https://review.opendev.org/70318519:26
tobiashthat's probably reading one of that pack indexes19:26
fungitobiash: yep, that's the same one corvus pointed out too19:27
fungisounds like some consensus at least19:27
tobiashand I cannot find anything else that accesses the repo and isn't protected by the lock19:27
tobiashso could also be infra structure related19:28
fungipossible, though the kernel didn't think there was any problem with the server's storage layer19:33
fungialso, zuul-maint: yeads up that in #openstack-infra we're discussing what looks like a rapid scheduler memory leak a day or two after restarting on what was tagged as 3.15.019:34
fungier, heads up i mean19:34
*** tosky has joined #zuul19:41
openstackgerritMerged zuul/zuul master: Defer setting build result to event queue  https://review.opendev.org/66664319:45
*** michael-beaver has quit IRC20:00
*** openstackstatus has joined #zuul20:02
*** ChanServ sets mode: +v openstackstatus20:02
*** dtroyer has quit IRC20:21
*** armstrongs has joined #zuul20:28
*** dtroyer has joined #zuul20:28
*** armstrongs has quit IRC20:38
*** zxiiro has quit IRC20:42
*** jamesmcarthur has joined #zuul20:51
*** jamesmcarthur has quit IRC20:53
*** jamesmcarthur has joined #zuul20:55
*** jamesmcarthur has quit IRC21:20
openstackgerritMerged zuul/zuul master: Fix release note for a 3.0.2 feature  https://review.opendev.org/70310921:22
openstackgerritMerged zuul/zuul master: Docs: re-order reference index  https://review.opendev.org/70296221:35
*** rlandy has quit IRC21:36
clarkbOne thing I've noticed looking at this memory leak is that we seem to compile new config schemas for every config object21:49
clarkbI think we could probably compile them once and use them over and over since the config object rules don't change without restarting zuul (and getting new code)21:49
openstackgerritMerged zuul/zuul master: Docs: move project config docs to user reference  https://review.opendev.org/70299221:55
*** hashar has joined #zuul22:04
openstackgerritMerged zuul/zuul master: Docs: move overview section to reference  https://review.opendev.org/70299522:12
hasharbah I looked at Zuul doc earlier today and the version from this morning is already obsolete!22:15
corvushashar: yeah sorry, we're moving stuff around.  once the outstanding changes land, i'll re-do the redirects22:18
corvusclarkb: speaking of which, are you interested in a +3 on https://review.opendev.org/703135 ?22:18
hashari usually just  tox -e docs ;]22:18
hasharturns out I have ready doc from earlier this week22:19
corvushashar: oh, then you'll want to use grep!  i did that earlier.... :)22:19
clarkbcorvus: looking22:19
corvusclarkb: re schemas -- that's probably true for most of them except maybe project-pipelines22:20
hasharzuul/zuul: tox-py37 SUCCESS in 41m 57s   , doh that has grown tremendously22:20
openstackgerritMerged zuul/zuul master: Docs: add admin reference section  https://review.opendev.org/70299722:37
*** hashar has quit IRC23:02
openstackgerritMerged zuul/zuul master: Docs: flatten directory structure  https://review.opendev.org/70313523:03
*** mattw4 has quit IRC23:57

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!