Wednesday, 2024-07-24

opendevreviewmelanie witt proposed openstack/tempest master: Add tests for creating servers with (anti-)affinity  https://review.opendev.org/c/openstack/tempest/+/92469102:35
opendevreviewmelanie witt proposed openstack/tempest master: Add tests for creating servers with (anti-)affinity  https://review.opendev.org/c/openstack/tempest/+/92469104:15
opendevreviewyatin proposed openstack/devstack stable/2023.1: [stable only] Unset the default value of MYSQL_GATHER_PERFORMANCE  https://review.opendev.org/c/openstack/devstack/+/92474005:04
fricklerweird, I've been seeing a similar failure in kolla-swift jobs, but didn't investigate yet. the swift commands should be part of the basic osc, no plugin needed, so my assumption is that the catalog would be missing the swift endpoint05:29
opendevreviewyatin proposed openstack/devstack master: [DNM] check abandoned patch  https://review.opendev.org/c/openstack/devstack/+/92482206:56
opendevreviewMartin Kopec proposed openstack/stackviz master: Run the node job with the current node version  https://review.opendev.org/c/openstack/stackviz/+/92375107:12
opendevreviewMartin Kopec proposed openstack/stackviz master: Run the node job with the current node version  https://review.opendev.org/c/openstack/stackviz/+/92375108:58
opendevreviewAmit Uniyal proposed openstack/whitebox-tempest-plugin master: verify tls in guest VM  https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/92382410:16
opendevreviewAmit Uniyal proposed openstack/whitebox-tempest-plugin master: verify tls in guest VM  https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/92382410:27
opendevreviewAmit Uniyal proposed openstack/whitebox-tempest-plugin master: verify vencrypt feature  https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/92382411:17
ykareldansmith, kopecmartin can you +W again https://review.opendev.org/c/openstack/devstack/+/924740 , depends-on merged12:44
ykarelthx in advance12:44
dansmithykarel: done13:32
clarkbgmann: kopecmartin: during the recent mass security fixes stuff frickler pointed out that jobs are never stable enough to land 20 changes at once. The way openstack's gate queue is configured we will start jobs for a minimum of 20 changes then that number can grow if things are stable and merging. A problem with that is gate resets with at least 20 changes in the queue mean a14:58
clarkblot of wasted resources and slowness getting resources where they are most valuable. frickler's idea was the reduce the minimum queue size from 20 changes to say 10 changes to reduce that impact. The suggestion from corvus is to instead make use of early failure detection on zuul jobs so that zuul can more quickly reallocate resources rather than waiting for long tempest jobs14:58
clarkb(or other long jobs) to complete before rearranging the queue14:58
clarkbI think the greatest impact to start is likely going to be with devstack/tempest jobs simply because they run for quite a long time and have a reasonably strong inheritance tree that should allow us to make a more central update to the zuul config and impact a number of jobs14:59
clarkbIs there any interest from the qa team to implement that on the tempest and maybe devstack jobs? The way it works is you define a regex on the job that scans the job output as it is written to detect strings that indicate failures. The zuul project has been dogfooding the functionality and I can point at jobs there as examples15:00
clarkbthe potential risk is that if our rules are not strict enough we could identify successful jobs as failures in appropriately so having those somewhat familiar with the jobs involved in the process is important15:00
opendevreviewMerged openstack/devstack stable/2023.1: [stable only] Unset the default value of MYSQL_GATHER_PERFORMANCE  https://review.opendev.org/c/openstack/devstack/+/92474015:10
opendevreviewRiccardo Pittau proposed openstack/devstack master: Pin openstack cli command in swift tempurls function  https://review.opendev.org/c/openstack/devstack/+/92486716:03
opendevreviewJames Parker proposed openstack/tempest master: Add configurable hostname pattern to filter hosts  https://review.opendev.org/c/openstack/tempest/+/92478519:59
clarkbhttps://opendev.org/zuul/zuul/src/branch/master/.zuul.yaml#L139-L142 is the way zuul is using it for unittest early failure detection. The docs for that are at https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.failure-output20:14
gmannclarkb: I am not sure I understood it completely. '..will start jobs for a minimum of 20 changes ' you mean if there are 19 changes in gate then it would not start the jobs to run on any of the changes?21:55
gmannand when 20 changes are in gate then failing one job will cause reset to all of the changes?21:56
clarkbgmann: no if there are 40 changes in the gate only 20 of them will run jobs22:33
clarkbgmann: the suggestions is to reduce that number to 1022:33
clarkbthe underlying motiviation is that if you have fewer jobs running then a gate reset is less painful. Another approach to addressing that is to reset more quickly so that you aren't consuming the resources unnecessarily long when you know you will discard them22:34
clarkbthen as you merge things that window size increases. As things fail and reset the window size decreases. It is very similar to tcp slowstart. The idea is to moderate resource consumption so that large queues and consistent failures are less painful22:40
gmannclarkb: ohk, thanks I got it now. with early failure my concern is 1. we would not be able to know the all failing tests 2. would not know the result of the test that change is impacting and we might need to re-run it if change lead to the valid test failure22:49
clarkbgmann: to clarify early failure detection does not stop the job from running. It merely allows zuul to act on the resulting restart quicker22:49
clarkbthe jobs will still run to completion and report normally22:49
corvus(ie, zuul will know the job is going to fail before it actually fails)22:50
clarkbbut that means all the jobs running behind it that will be restarted either at the early detection point or at job completion as they do today can be reset onto the new speculative git head and started again22:51
gmannclarkb: ohk, so no impact from job running perspective it is just zuul detect and act in advance. 22:51
clarkbthat means all the resources for those jobs behind are released and reused more quickly22:51
clarkbgmann: yes pretty much. The one gotcha to that is if the early detection rule is bad then we could possibly inadverdently detect "failures" on successful jobs. THis is mostly why I want to get the qa team involved in this to ensure we've got rules that are valid for tempest and won't do that22:52
gmannyou mean other jobs of that change will be reset and gat stop?22:52
clarkbno, its still only jobs for change sbehind in the queue that are stopped and then moved to a new head just as they would be during normal failure22:52
clarkbthe config option works by matching job output strings and if you get a match treats the job as a failure early. If however the job succeeds now you're in a weird state22:53
clarkbthis is why we don't just have generic rules for this it really needs to be job specific to match properly22:53
corvusbut if it takes 10 minutes to hit the first test failure, and 1 hour to run the whole job, then it means that zuul can reset the gate queue after 10 minutes instead of waiting an hour22:53
clarkbyup and that is where the resource savings come from. We're reducing the amount of thrown away test time22:53
corvus(so if you have a deep gate queue and a bunch of changes failing, that's 10m+10m+10m... instead of 1h+1h+1h, so things move a lot faster)22:54
gmannclarkb: corvus: I see.  22:55
gmannI think we can try it in tempest jobs but let me think it more on it. I need to go to pickup my son from school now. I will ping you guys tomorrow if more query.22:56
clarkbok22:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!