Wednesday, 2024-07-24

opendevreview	melanie witt proposed openstack/tempest master: Add tests for creating servers with (anti-)affinity https://review.opendev.org/c/openstack/tempest/+/924691	02:35
opendevreview	melanie witt proposed openstack/tempest master: Add tests for creating servers with (anti-)affinity https://review.opendev.org/c/openstack/tempest/+/924691	04:15
opendevreview	yatin proposed openstack/devstack stable/2023.1: [stable only] Unset the default value of MYSQL_GATHER_PERFORMANCE https://review.opendev.org/c/openstack/devstack/+/924740	05:04
frickler	weird, I've been seeing a similar failure in kolla-swift jobs, but didn't investigate yet. the swift commands should be part of the basic osc, no plugin needed, so my assumption is that the catalog would be missing the swift endpoint	05:29
opendevreview	yatin proposed openstack/devstack master: [DNM] check abandoned patch https://review.opendev.org/c/openstack/devstack/+/924822	06:56
opendevreview	Martin Kopec proposed openstack/stackviz master: Run the node job with the current node version https://review.opendev.org/c/openstack/stackviz/+/923751	07:12
opendevreview	Martin Kopec proposed openstack/stackviz master: Run the node job with the current node version https://review.opendev.org/c/openstack/stackviz/+/923751	08:58
opendevreview	Amit Uniyal proposed openstack/whitebox-tempest-plugin master: verify tls in guest VM https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923824	10:16
opendevreview	Amit Uniyal proposed openstack/whitebox-tempest-plugin master: verify tls in guest VM https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923824	10:27
opendevreview	Amit Uniyal proposed openstack/whitebox-tempest-plugin master: verify vencrypt feature https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923824	11:17
ykarel	dansmith, kopecmartin can you +W again https://review.opendev.org/c/openstack/devstack/+/924740 , depends-on merged	12:44
ykarel	thx in advance	12:44
dansmith	ykarel: done	13:32
clarkb	gmann: kopecmartin: during the recent mass security fixes stuff frickler pointed out that jobs are never stable enough to land 20 changes at once. The way openstack's gate queue is configured we will start jobs for a minimum of 20 changes then that number can grow if things are stable and merging. A problem with that is gate resets with at least 20 changes in the queue mean a	14:58
clarkb	lot of wasted resources and slowness getting resources where they are most valuable. frickler's idea was the reduce the minimum queue size from 20 changes to say 10 changes to reduce that impact. The suggestion from corvus is to instead make use of early failure detection on zuul jobs so that zuul can more quickly reallocate resources rather than waiting for long tempest jobs	14:58
clarkb	(or other long jobs) to complete before rearranging the queue	14:58
clarkb	I think the greatest impact to start is likely going to be with devstack/tempest jobs simply because they run for quite a long time and have a reasonably strong inheritance tree that should allow us to make a more central update to the zuul config and impact a number of jobs	14:59
clarkb	Is there any interest from the qa team to implement that on the tempest and maybe devstack jobs? The way it works is you define a regex on the job that scans the job output as it is written to detect strings that indicate failures. The zuul project has been dogfooding the functionality and I can point at jobs there as examples	15:00
clarkb	the potential risk is that if our rules are not strict enough we could identify successful jobs as failures in appropriately so having those somewhat familiar with the jobs involved in the process is important	15:00
opendevreview	Merged openstack/devstack stable/2023.1: [stable only] Unset the default value of MYSQL_GATHER_PERFORMANCE https://review.opendev.org/c/openstack/devstack/+/924740	15:10
opendevreview	Riccardo Pittau proposed openstack/devstack master: Pin openstack cli command in swift tempurls function https://review.opendev.org/c/openstack/devstack/+/924867	16:03
opendevreview	James Parker proposed openstack/tempest master: Add configurable hostname pattern to filter hosts https://review.opendev.org/c/openstack/tempest/+/924785	19:59
clarkb	https://opendev.org/zuul/zuul/src/branch/master/.zuul.yaml#L139-L142 is the way zuul is using it for unittest early failure detection. The docs for that are at https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.failure-output	20:14
gmann	clarkb: I am not sure I understood it completely. '..will start jobs for a minimum of 20 changes ' you mean if there are 19 changes in gate then it would not start the jobs to run on any of the changes?	21:55
gmann	and when 20 changes are in gate then failing one job will cause reset to all of the changes?	21:56
clarkb	gmann: no if there are 40 changes in the gate only 20 of them will run jobs	22:33
clarkb	gmann: the suggestions is to reduce that number to 10	22:33
clarkb	the underlying motiviation is that if you have fewer jobs running then a gate reset is less painful. Another approach to addressing that is to reset more quickly so that you aren't consuming the resources unnecessarily long when you know you will discard them	22:34
clarkb	then as you merge things that window size increases. As things fail and reset the window size decreases. It is very similar to tcp slowstart. The idea is to moderate resource consumption so that large queues and consistent failures are less painful	22:40
gmann	clarkb: ohk, thanks I got it now. with early failure my concern is 1. we would not be able to know the all failing tests 2. would not know the result of the test that change is impacting and we might need to re-run it if change lead to the valid test failure	22:49
clarkb	gmann: to clarify early failure detection does not stop the job from running. It merely allows zuul to act on the resulting restart quicker	22:49
clarkb	the jobs will still run to completion and report normally	22:49
corvus	(ie, zuul will know the job is going to fail before it actually fails)	22:50
clarkb	but that means all the jobs running behind it that will be restarted either at the early detection point or at job completion as they do today can be reset onto the new speculative git head and started again	22:51
gmann	clarkb: ohk, so no impact from job running perspective it is just zuul detect and act in advance.	22:51
clarkb	that means all the resources for those jobs behind are released and reused more quickly	22:51
clarkb	gmann: yes pretty much. The one gotcha to that is if the early detection rule is bad then we could possibly inadverdently detect "failures" on successful jobs. THis is mostly why I want to get the qa team involved in this to ensure we've got rules that are valid for tempest and won't do that	22:52
gmann	you mean other jobs of that change will be reset and gat stop?	22:52
clarkb	no, its still only jobs for change sbehind in the queue that are stopped and then moved to a new head just as they would be during normal failure	22:52
clarkb	the config option works by matching job output strings and if you get a match treats the job as a failure early. If however the job succeeds now you're in a weird state	22:53
clarkb	this is why we don't just have generic rules for this it really needs to be job specific to match properly	22:53
corvus	but if it takes 10 minutes to hit the first test failure, and 1 hour to run the whole job, then it means that zuul can reset the gate queue after 10 minutes instead of waiting an hour	22:53
clarkb	yup and that is where the resource savings come from. We're reducing the amount of thrown away test time	22:53
corvus	(so if you have a deep gate queue and a bunch of changes failing, that's 10m+10m+10m... instead of 1h+1h+1h, so things move a lot faster)	22:54
gmann	clarkb: corvus: I see.	22:55
gmann	I think we can try it in tempest jobs but let me think it more on it. I need to go to pickup my son from school now. I will ping you guys tomorrow if more query.	22:56
clarkb	ok	22:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!