Thursday, 2022-10-13

*** rlandy is now known as rlandy|out01:00
*** ysandeep|out is now known as ysandeep05:55
*** jpena|off is now known as jpena07:18
slaweqfungi hi, can You help me again with tobiko-upload-git-mirror job? I added ssh key encrypted with zuul-client in tobiko repo: https://zuul-ci.org/docs/zuul/latest/project-config.html#encryption08:09
slaweqbut job is still failing without any logs: https://zuul.opendev.org/t/openstack/build/9fd7eeac81354a23a0e09d1dd5a8e150/logs08:09
slaweqcan You check for me what can be wrong there still?08:09
slaweqthx in advance08:09
*** ysandeep is now known as ysandeep|afk08:17
* frickler not fungi, but taking a look at logs08:33
slaweqfrickler thx a lot :)09:01
fricklerslaweq: still seeing the same: ValueError: Encryption/decryption failed.10:12
*** rlandy|out is now known as rlandy10:24
fricklerslaweq: looking at https://opendev.org/x/tobiko/commit/ae6deb883f6ce87c2d069013d94c753af793974a the new secret seems much shorter than the previous one, not sure if that might be legit or is an indication of currupted data10:26
slaweqfrickler yeah, maybe I did something wrong first time10:30
slaweqI will try to merge that new patch and check then10:30
*** ysandeep|afk is now known as ysandeep10:42
*** rlandy is now known as rlandy|mtg11:09
yadneshhello all, could someone take a look at this job https://zuul.opendev.org/t/openstack/build/c39ca54a9167461fb974033a1846a3dc/logs 11:19
yadneshit seems to be timing out while downloading pip modules RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master]11:19
opendevreviewMerged openstack/openstack-zuul-jobs master: Remove pike branch from periodic-stable templates  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/85161311:19
*** yadnesh is now known as yadnesh|afk11:36
fungiyadnesh|afk: the cause is fairly straightforward. notice these messages in the job log:11:43
fungiINFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.11:44
funginow see this comment in the project's tox configuration: https://opendev.org/openstack/python-aodhclient/src/branch/stable/yoga/tox.ini#L1611:44
fungiyadnesh|afk: the solutions available are to either start using constraints like the vast majority of openstack projects, or get rid of some of your dependencies, or choose much tighter dependency version ranges, or increase your job timeout11:45
*** rlandy|mtg is now known as rlandy11:50
*** yadnesh|afk is now known as yadnesh12:16
*** dasm|off is now known as dasm13:00
*** ministry is now known as __ministry13:11
*** blarnath is now known as d34dh0r5313:32
*** ysandeep is now known as ysandeep|dinner14:19
*** yadnesh is now known as yadnesh|away14:52
*** dviroel_ is now known as dviroel14:55
*** ysandeep|dinner is now known as ysandeep15:18
*** ysandeep is now known as ysandeep|out15:33
clarkbdansmith: fungi: I spot checked jobs listed by https://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=0 and all have had logs so far15:36
clarkbthere seem to be some persistently broken jobs too which likely indicates a problem with those specific jobs15:36
fungidid they log what failed though?15:36
clarkbyes15:37
fungihttps://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656 was the example brought up in the tc meeting, i was about to look into it15:37
clarkbfor example all the monasca errors are due to python not being found15:37
clarkbthat example has logs fwiw15:38
fungiyeah, i haven't dived into them yet, but i wonder if the confusion is that the failed ansible tasks aren't rendered in the console view (because we only render tasks which complete)15:39
clarkbittimed out15:39
clarkbso ya I'm not sure that fits the "no logs post failure" situation15:39
clarkbhttps://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656/log/job-output.txt#34100-34215 looks like slow log processing likely due to ansible loop costs15:40
clarkbthis is related to the improvements I've already made to stage-output, but I'm sure more can bedone15:41
clarkbjust processing apache logs there took 13 minutes15:42
fungiyeah, i'm on my netbook at the moment and browsing that build result is bogging down the browser something fierce15:42
*** tkajinam is now known as Guest297115:43
clarkbalso looks like the first task in stage output which stats all the files in a loop wasn't fast. I didn't update that part of stage-output. I only updated the portin that renames files to have .txt suffixes15:44
clarkbso that may be another target for optimizing15:44
dansmithclarkb: ack, I think the one I got with a slew of post_failures with no logs was last week sometime, but it was just one15:46
dansmithso then the post fails this week, which were smaller, were on my radar, but several (including the ones mentioned) were with-logs and not super obvious15:47
fungithere have been some ssh host key mismatch errors which look like rogue virtual machines in cloud providers getting into arp overwrite situations with our test nodes, and if that hits at the wrong moment then the executor can break while trying to retrieve logs from the node15:47
clarkbya my experience has been that as long as there are logs these tend to be debuggable.15:47
clarkbfungi: the issue there is that rsync doesn't chare the ssh control pesistency with regular ssh so it is more susceptible to those failures15:48
clarkbs/chare/share/15:48
fungiyeah, that definitely doesn't help matters15:48
dansmithclarkb: so what was the failure in the above-quoted one15:49
dansmith?15:49
dansmithlike, I see this: POST-RUN END RESULT_NORMAL15:49
fungislowness in processing all the logs from the job led to a timeout for the post-run playbooks15:50
dansmithand the successful log upload, seems like POST_FAILURE doesn't fit15:50
clarkbdansmith: I linked to it. the post run timed out processing all of the devstack log files15:50
clarkbhttps://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656/log/job-output.txt#34100-3421515:50
clarkbit took 13 minutes processing apache logs15:50
clarkbwith a half hour timeout (and there were other bits that are slow too)15:50
dansmithaha, multiple post-runs15:51
dansmithI guess I look for the post-run from the bottom, and expect that to be a post_failure-inducing fail15:52
clarkbhttps://review.opendev.org/c/zuul/zuul-jobs/+/858961 https://review.opendev.org/c/zuul/zuul-jobs/+/857228 and https://review.opendev.org/c/zuul/zuul-jobs/+/855402 are three changes I've already made to speed these sorts of things up for many jobs15:55
clarkbBut basically timeouts like this indicate we need to make more changes like ^15:55
fungiand/or reduce unnecessary log collection where it's reasonable to do so15:56
clarkb++15:56
clarkbalso if anyone happens to know how to convince ansible to treat high task runtime costs as a bug that should be fixed I will owe you a $beverage16:04
JayFclarkb: Red Hat runs ansible. Red Hat has a large number of developers invested in OpenStack. Maybe we can use that to leverage? 16:12
clarkbJayF: maybe, its definitely something we've struggled to convince them in the past. But part of the issue with those tasks above is that ansible has like a 1-2 ish second per task minimum cost. When you run a loop in ansible you take the number of loop entries and multiply it by that base cost for a floor of the best runtime16:13
fungithe ansible core devs have frequently viewed zuul's use case as rather fringe, which hasn't helped when we bring up problems we experience16:13
clarkbin this case if you have a ton of log files to process and loop over them it can easily be just 5 minutes of ansible task startup cost when the actual work is a few milliseconds16:13
fungialso ansible-lint really does push people to write inefficient loops over ansible tasks rather than have a single shell task with a loop inside it16:14
clarkbyes the ansible way to do things is slow16:15
fungianyway, point being solving this within ansible upstream would involve a change in approach and mindset, not just some simple patching16:17
fungithey've made explicit design choices to trade efficiency for clarity and consistency16:19
clarkbI'm looking at the apache tasks for that job more closely and I think it is taking a minute to run each task.16:26
clarkbIts possible that maybe we are deep into swap or something which is just compounding the problem? It is weird to me that we aren't running those tasks in parallel across the different nodes too16:27
clarkbwe run ansible with -f 5 in zuul jobs iirc. Which should allow a two node job to run tasks like that in parallel.16:27
clarkbit doesn't seem to be quite so bad earlier in the job so ya maybe there is some side effect (like swapping) that the job is inducing that causes it to get worse16:29
*** jpena is now known as jpena|off16:34
*** dviroel is now known as dviroel|biab19:55
*** dviroel|biab is now known as dviroel20:51
*** dviroel is now known as dviroel|afk21:33
*** rlandy is now known as rlandy|bbl22:12
*** dasm is now known as dasm|off23:52
*** rlandy|bbl is now known as rlandy23:55
*** rlandy is now known as rlandy|out23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!