Tuesday, 2022-11-29

*** clarkb is now known as Guest29801:19
*** Guest298 is now known as clarkb01:20
*** atmark is now known as Guest30502:10
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail  https://review.opendev.org/c/openstack/nova/+/86380602:38
opendevreviewAmit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86405502:38
gmannbauzas: sean-k-mooney[m] gibi : this is for skipping failing tests https://review.opendev.org/c/openstack/nova/+/86592202:43
opendevreviewAmit Uniyal proposed openstack/nova master: Refactoring live_migrate function name  https://review.opendev.org/c/openstack/nova/+/86595408:38
bauzasgibi: sean-k-mooney: your help is appreciated https://review.opendev.org/c/openstack/nova/+/86592208:41
opendevreviewArnaud Morin proposed openstack/nova master: Unbind port when offloading a shelved instance  https://review.opendev.org/c/openstack/nova/+/85368209:30
gibibauzas: on it09:34
gibigmann: thank you09:34
bauzasgibi: thanks09:45
bauzasand I updated the ML thread to notify our gerrit users09:45
gibicool09:45
opendevreviewAmit Uniyal proposed openstack/nova master: Adds check if resized to swap zero  https://review.opendev.org/c/openstack/nova/+/85733909:45
opendevreviewMerged openstack/nova master: Temporary skip some volume detach test in nova-lvm job  https://review.opendev.org/c/openstack/nova/+/86592211:08
opendevreviewMerged openstack/nova stable/wallaby: Adds a repoducer for post live migration fail  https://review.opendev.org/c/openstack/nova/+/86390011:08
opendevreviewManuel Bentele proposed openstack/nova-specs master: Add configuration options to set SPICE compression settings  https://review.opendev.org/c/openstack/nova-specs/+/84948811:13
opendevreviewAmit Uniyal proposed openstack/nova stable/train: add regression test case for bug 1978983  https://review.opendev.org/c/openstack/nova/+/86416811:13
opendevreviewAmit Uniyal proposed openstack/nova stable/train: For evacuation, ignore if task_state is not None  https://review.opendev.org/c/openstack/nova/+/86416911:13
opendevreviewAmit Uniyal proposed openstack/nova stable/train: add regression test case for bug 1978983  https://review.opendev.org/c/openstack/nova/+/86416811:40
opendevreviewAmit Uniyal proposed openstack/nova stable/train: For evacuation, ignore if task_state is not None  https://review.opendev.org/c/openstack/nova/+/86416911:40
opendevreviewMerged openstack/nova stable/wallaby: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86390112:01
*** dasm|off is now known as dasm13:05
sahidmorning dansmith, how do you feel regarding spec https://review.opendev.org/c/openstack/nova-specs/+/857838 I also have proposed the implementation13:45
sahidif you have a moment i would be glad to make progress on it, eel free to let me know if there is any point that you need details thanks a lot13:46
slaweqgibi bauzas and other nova cores: hi, please check https://review.opendev.org/c/openstack/nova/+/855664 when You will have few minutes, thx in advance13:47
pengo_Hello I am using wallaby release openstack and having issues with cinder volumes  attachments as once I try to delete, resize or unshelve the shelved vms the volume_attachement entries do not get deleted in cinder db and therefore the above mentioned operations fail every time. I have to delete these volume_attachement entries manually to make it work.  I could only gather logs from nova-compute 13:48
pengo_https://www.irccloud.com/pastebin/kIDKAF54/13:48
pengo_If I would like to unshelve the  instance it wont work as it has a duplicate entry in cinder db for the attachment. So i have to delete it manually from db or via cli. This is the only choice I have if I would like to unshelve vm. But this is not a good approach for production env to delete duplicate volume attachments entries every time for every vm. Is there any way to fix this issue ?13:49
opendevreviewAmit Uniyal proposed openstack/nova master: Adds regression functional test for 1980720  https://review.opendev.org/c/openstack/nova/+/86135714:11
opendevreviewAmit Uniyal proposed openstack/nova master: Adds check for VM snapshot fail while quiesce  https://review.opendev.org/c/openstack/nova/+/85217114:11
bauzasslaweq: hah, I remember the context, we discussed this at the PTG right?14:16
pengo_lyes 14:22
slaweq_bauzas: yes, we talked about it at the PTG14:35
slaweq_and we agreed that we can go with this approach14:35
sean-k-mooneyis this related to the mtu advertisement ?14:36
sean-k-mooneyor soemthing else14:36
sean-k-mooneyah yes https://review.opendev.org/c/openstack/nova/+/85566414:37
slaweq_sean-k-mooney: yes14:37
sean-k-mooneyya so we agreeed that if dhcp is avialable on the subnet we can omit the mtu form the metadata14:37
sean-k-mooneythis will allow the mtu to be reduced but not increased and the vms will clamp the mtu the next time it renews its dhcp lease14:38
sean-k-mooneywill still not perfect it will help14:38
sean-k-mooneyi can take a look now before i forget about it again14:38
sean-k-mooneyslaweq_: quick question14:41
sean-k-mooneyslaweq_: is the mtu on the netowrk or on the subnet in neutron. its generally an aspect of the netwrok in a real deployment just wondering how its modeled in neutron14:42
sean-k-mooneyneutron does not supprot having diffenert mtu per network segment correct when using routed networks14:42
sean-k-mooneyi have asked that in the review https://review.opendev.org/c/openstack/nova/+/855664/3/nova/virt/netutils.py#b26614:47
slaweq_sean-k-mooney: mtu is per network for sure14:48
slaweq_I'm not sure about segments in routed networks14:48
sean-k-mooney"The net-mtu extension allows plug-ins to expose the MTU that is guaranteed to pass through the data path of the segments in the network."14:49
sean-k-mooneythat makes it sound like the network mtu shoudl be the maxium mtu that woudl eb supported on all segments14:49
sean-k-mooneyok we are good14:50
sean-k-mooneyhttps://docs.openstack.org/api-ref/network/v2/index.html?expanded=show-segment-details-detail#show-segment-details14:50
sean-k-mooneythe segment does not have an mtu field14:50
slaweq_sean-k-mooney++14:50
bauzaswas trampled into internal problems, lemme look at the MTU patch14:51
sean-k-mooneynor does the subnet so ya defeintly per network ill upgade to +2 so14:51
slaweq_sean-k-mooney: thx a lot14:53
bauzasslaweq: sent to the gate now the gate is back :)15:02
gibislaweq: I have a question https://review.opendev.org/c/openstack/nova/+/855664/3/nova/virt/netutils.py#27415:02
bauzasgibi: I had the same concern, but eventually I said yes because it's an operator question15:06
bauzasgibi: here, that means that we won't provide the MTU in the metadata service if the subnet from the instance is using a dhcp server15:07
gibi"subnet from the instance is using a dhcp server" <- but if the network has two subnets one with dhcp and one without dhcp then we need to check the actualy subnet the port uses, not every subnet in the network 15:08
gibior do I miss something15:08
gibi?15:08
gibiI'm OK to not set MTU if the subnet the port uses has DHCP. But the patch does not implement that15:09
gibithat patch does not set MTU if _any_ of the subnets of the network has DHCP15:09
gibino just the one the port uses15:09
fricklerI need to double-check but I think for v6 the MTU is signaled by RAs, not dhcp?15:10
sean-k-mooneygibi: good catch gibi15:11
gibi(I assume that as the dhcp_server is defined per subnet it can be differently configured per subnet of the same network)15:11
sean-k-mooneyfrickler: for ipv6 mtu is discoverd via the neibour discovery protocol15:11
sean-k-mooneyand it shoudl be automaticaly negociated regardelss of using RA or DHCP615:12
opendevreviewSylvain Bauza proposed openstack/nova master: Don't provide MTU value in metadata service if DHCP is enabled  https://review.opendev.org/c/openstack/nova/+/85566415:12
sean-k-mooneyRA and DHCPv6 could provide an inital value.15:12
bauzasslaweq: gibi: sean-k-mooney: in order to stop the check pipeline, I created a new revision ^15:12
sean-k-mooneygibi: yes the dhcp option is per subnet not per network15:13
sean-k-mooneybauzas: ack15:13
sean-k-mooneywell you could have just removed the +w15:13
sean-k-mooneygibi: actully15:14
sean-k-mooneygibi: is the set of subnets that it is lopping over the subnets the port is attach to via the fixed ipes it has or just the one on the network15:15
sean-k-mooneyi woudl assume we have not prefiltered them15:15
gibiI don't think we prefilter them15:16
gibibut I haven't checked explicitly15:16
gibiI assumed it is all the subnets of the network as it is under [networ][subnets]15:16
sean-k-mooneyso we need to do an intersection between the subnets of the fixed ips and the subnets of the network and then check only those15:16
sean-k-mooneygibi: ya that is what i woudl assume too15:16
bauzassean-k-mooney: no, you can't just remove the +W if it was running to the gate15:18
bauzaseven in check pipeline15:19
bauzasgibi: we prefilter only if we opt in for routed networks15:20
sean-k-mooneybauzas: if it was in check you can if it was in gate then no15:20
bauzassean-k-mooney: anyway, this is done but I'm pretty sure of the other way15:21
bauzasmeh15:21
sean-k-mooneyso this is not related to routed networks15:21
bauzasso, here, we have a list of subnets that's given from a network15:21
sean-k-mooneyyou can have as may subnetes on a network as you liek to add more ips to the network15:21
sean-k-mooneyits pretty common ot jsut add addtional subnets to a network if you run out of ips for ports on that network15:22
bauzasif the instance has an IP address for a subnet that doesn't have a DHCP server running, but other subnets do have DHCP server, then we won't set the MTU15:22
bauzasif we merge this one15:22
bauzassean-k-mooney: I know, I'm clarifying the situation15:22
sean-k-mooneyyes which woudl be incorrect15:22
bauzasso yeah, agreed, we need to check the fixed IP address subnet, that's it15:23
bauzaswe even don't need an intersection15:23
sean-k-mooneyoh15:23
sean-k-mooneyya good point15:23
sean-k-mooneywe dont need the intersaction15:23
bauzasand I think my prefilter already does this somewhere15:23
sean-k-mooneyjust loop over the subnets of the fixed ips15:23
bauzasnot saying we should run the prefilter15:24
bauzasbut the check is the same15:24
bauzasI mean the pattern15:24
sean-k-mooneyright just that there might be exsiting code that can be copeid or shared15:24
* bauzas needs to go getting the kid15:24
sean-k-mooneythis https://github.com/openstack/nova/blob/master/nova/scheduler/request_filter.py#L317-L33115:25
sean-k-mooneyyou are calling neutron there since the info is not in the network info cache at this point15:26
sean-k-mooneyso that wont actrully help15:26
sean-k-mooneybut the vif object can give you the fixed ips 15:27
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/network/model.py#L445-L45015:27
sean-k-mooneylooking at that code it might be already filtering 15:28
sean-k-mooneyso the subnets are constucted here 15:30
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/network/neutron.py#L3330-L333215:30
sean-k-mooneywhich internally uses https://github.com/openstack/nova/blob/master/nova/network/neutron.py#L3568-L363715:31
sean-k-mooney_get_subnets_from_port15:31
sean-k-mooneyso we only store the subnets for the current port in the info cache15:32
bauzasreminder : nova meeting in 5 mins15:54
bauzas-ish15:54
bauzas#startmeeting nova16:00
opendevmeetMeeting started Tue Nov 29 16:00:07 2022 UTC and is due to finish in 60 minutes.  The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot.16:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.16:00
opendevmeetThe meeting name has been set to 'nova'16:00
bauzashey folks16:00
auniyalO/16:00
gibio/16:00
Ugglao/16:00
elodilleso/16:01
bauzaslet me grab a coffee and we start16:01
bauzasok let's start and welcome16:02
bauzas#topic Bugs (stuck/critical) 16:03
dansmitho/16:03
bauzas#info No Critical bug16:03
sean-k-mooneyo/16:03
bauzas#link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 16 new untriaged bugs (+5 since the last meeting)16:03
bauzas#info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster16:03
bauzasI know this was a busy week16:03
bauzasany bug to discuss ?16:03
bauzas(apart from the gate ones)16:03
bauzaslooks not16:04
bauzaselodilles: can you use the baton for the next bugs ?16:04
elodillesyepp16:04
bauzascool thanks !16:04
bauzas#info bug baton is being passed to elodilles16:04
bauzas#topic Gate status 16:04
bauzas#link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs 16:04
bauzasit was a busy week16:04
bauzas#info ML thread about the gate blocking issues we had https://lists.openstack.org/pipermail/openstack-discuss/2022-November/031357.html16:05
bauzaskudos to the team for the hard work16:05
bauzasit looks now the gate is back16:05
bauzasunfortunately, we had to skip some tests :(16:05
bauzasbut actually maybe they were not necessary :)16:06
bauzasanyway16:06
bauzas#link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status16:06
bauzas#info Please look at the gate failures and file a bug report with the gate-failure tag.16:06
bauzas#info STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-test-failures16:06
gibinah no test is necessary :) only the code need to work :)16:06
opendevreviewArnaud Morin proposed openstack/nova master: Unbind port when offloading a shelved instance  https://review.opendev.org/c/openstack/nova/+/85368216:07
bauzasanything to discuss about the gate ?16:07
bauzas#topic Release Planning 16:07
bauzas#link https://releases.openstack.org/antelope/schedule.html16:07
bauzas#info Antelope-2 is in 5 weeks16:07
bauzasas a reminder, remember that the last December week(s) you could be off :)16:08
bauzasso even if we have 5 weeks until A-2, maybe less for you :)16:09
sean-k-mooneyya i dont know if we want to have another review day before then16:09
bauzasshould we do another spec review day before end of December, btw ?16:09
gibiI vote for someting after 13th of Dec :)16:09
bauzassean-k-mooney: we accepted to have an implementation review day around end of Dec, like Dec 2016:09
sean-k-mooneywell spec or impelemnation or both16:09
sean-k-mooneythat might be a bit late16:10
bauzasfor implementations, not really16:10
gibimore specifically between 14th and 19th16:10
bauzasas we only have a deadline for A-316:10
sean-k-mooneybecause of vacation16:10
bauzasfor specs, yes16:10
bauzasah16:10
bauzasthen, we could do a spec review day on Dec 13th16:11
gibi14 please16:11
gibithere is an internal demo on 13th I will be busy with :)16:11
bauzasand when should we be doing a implementation review day ?16:11
sean-k-mooneyya im off form the 19th so 14th-16th for feature review16:11
sean-k-mooneywoudl be ok16:11
bauzasgibi: haha, I don't see what you're saying :p16:11
gibiyeah I support what sean-k-mooney proposes 14-1616:12
bauzasgibi: as a reminder, last week, I was discussing here during the meeting while adding some slides for some internal meeting :p16:12
bauzasyou surely could do the same :D16:12
gibimaybe a spec review day on 14th and an impl review day on 15th :)16:12
bauzasif people accept to have two upstream days16:13
sean-k-mooneythat would work for me16:13
gibibauzas: I'm no way near to your abbility to multi task16:13
bauzasduring the same week16:13
sean-k-mooneythat leave the 16th to wrap up stuff before pto16:13
gibiyepp16:13
gibi(I will be here on 19th too but off from 20th)16:13
bauzasgibi: or then I'd prefer to have an implementation day once we're back16:13
bauzasnot all of us work upstream everyday :)16:14
sean-k-mooneywe could but the idea was to have 2 of them16:14
sean-k-mooneyto take the pressure off m316:14
sean-k-mooneyso one before m2 and one before m316:14
gibiI'm back on the 5th of Jan16:14
sean-k-mooneyso we coudl proably do one on the 10th of january16:14
sean-k-mooneymost will be around by then16:15
bauzassean-k-mooney: I don't disagree, I'm just advocating that some folks couldn't be able to have two review days on the same week16:15
sean-k-mooneyif we want ot keep it aligned to the meetign days16:15
gibi10th works for me16:15
bauzasgibi: we don't really need to align those review days to our meeting16:16
bauzasgibi: but this is nice as a reminder16:16
gibiso I think we are converging on Dec 14th as spec and Jan 10th as a code review day16:17
bauzasI think this works for me16:17
bauzasand we can have another implementation review day later after Jan 10th16:17
sean-k-mooneyya sure that sound workable16:18
bauzasas a reminder, Antelope-3 (FF) is Feb 16th16:18
bauzasmore than 5 weeks after Jan 10th16:18
sean-k-mooneythere are still a few bits i would hope we can merge by the end of the year however. namely i would liek to see us make progress on the pci in placement serises16:19
bauzassure16:19
sean-k-mooneyok so i think we can move on for now16:19
bauzaswhat we can do is to tell we can review some changes by Dec 15th if we want so16:19
bauzasthat shouldn't be a specific review day, but people would know that *some* folks can review their changes by this day16:20
bauzasanyway, I think we found a way16:20
gibiyepp16:21
bauzas#agreed Dec-14th will be a spec review day and Jan-10th will be an implementation review day, mark your calendars16:21
bauzas#action bauzas to send an email about it16:21
bauzas#agreed Some nova-cores can review some features changes around Dec 15th, you now know about it16:22
gibi:)16:22
bauzasOK, that's it16:22
bauzasmoving on16:22
bauzas(sorry, that was a long discussion)16:22
bauzas#topic Review priorities 16:22
bauzas#link https://review.opendev.org/q/status:open+(project:openstack/nova+OR+project:openstack/placement+OR+project:openstack/os-traits+OR+project:openstack/os-resource-classes+OR+project:openstack/os-vif+OR+project:openstack/python-novaclient+OR+project:openstack/osc-placement)+(label:Review-Priority%252B1+OR+label:Review-Priority%252B2)16:23
bauzas#info As a reminder, cores eager to review changes can +1 to indicate their interest, +2 for committing to the review16:23
bauzasI'm happy to see people using it16:23
bauzasthat's it for that topic16:23
bauzasnext one16:24
bauzas#topic Stable Branches 16:24
bauzaselodilles: your turn16:24
elodillesack16:24
elodillesthis will be short16:24
elodilles#info stable branches seem to be unblocked / OK16:24
elodilles#info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci16:24
elodillesthat's it16:24
gibinice16:25
bauzaswas quick and awesome16:26
bauzaslast topic but not the least in theory,16:26
bauzas#topic Open discussion 16:26
bauzasnothing in the wikipage16:26
bauzasso16:26
bauzasanything to discuss here by now ?16:27
gibi-16:27
sean-k-mooneydid you merge skipign the failing nova-lvm tests yet16:27
sean-k-mooneyor is the master gate still explodingon that16:27
bauzasI think yesterday we said we could discuss during this meeting about the test skips16:27
bauzasbut given we merged gmann's patch, the ship has sailed16:27
sean-k-mooneyack16:27
sean-k-mooneyso they are disabeled currently 16:28
bauzassean-k-mooney: see my ML thread above ^16:28
sean-k-mooneythe failing detach tests16:28
sean-k-mooneyah ok will check after meeting16:28
bauzassean-k-mooney: you'll get the link to the gerrit change16:28
sean-k-mooneynothing else form me16:28
auniyalhand-raise: zuul frequent timeout issue/fails - this seems to be resource issue, is it possible zuull resource can be increased ?16:28
bauzassean-k-mooney: tl;dr: yes we skipped the related tests but maybe they are actually not needed as you said16:29
sean-k-mooneyauniyal: not really timeout are not that common in our jobs16:29
bauzasauniyal: see what I said above, we had problems with the gate very recently16:29
sean-k-mooneyauniyal: do you have an example16:29
auniyalin morning when there are less number of jobs running if we run same, job gets passed16:29
auniyallike less then 2016:29
auniyalright now 60 jobs are running16:29
sean-k-mooneythat should not really be a thing16:29
sean-k-mooneyunless we have issues with our ci providers16:30
bauzasauniyal: if you speak about job results telling timeouts, agreed with sean-k-mooney, you should tell which ones so we could investigate16:30
sean-k-mooneywe ocationally have issues with slow providers but its not normally coralated with the number of runnign jobs16:30
bauzasyup16:30
auniyalack16:30
bauzastimeouts are generally an infra issue16:30
bauzasfrom a ci provider16:30
bauzasbut "generally"16:30
bauzaswhich means sometimes we may have a larger problem16:31
sean-k-mooneyauniyal: do you have a gerrit link to a change where it happend16:31
dansmithare they fips jobs?16:31
clarkbbauzas: I'm not sure I agree with that statement16:31
sean-k-mooneyoh ya it could be that did we add the extra 30 mins ot the job yet16:31
clarkbwe have significant amounts of very inefficient test payload16:31
clarkbyes slow providers make that worse, but we have lots of ability to improve things in the jobs just about every time I look16:31
sean-k-mooneyclarkb: we dont often see timeouts in the jobs that run on the nova gate16:32
sean-k-mooneywe tent to be well within the job timeout interval16:32
sean-k-mooneythat is not nessialy the same for other projects16:32
clarkb(it is common for tempets jobs to dig into swap which slows everything down, devstack uses osc which is super slow because it gets a new token for every request and has python spin up time, ansible loops are costly with large numbers of entries and so on)16:32
auniyalsean, I am trying to find a link but its time taking16:32
clarkbsean-k-mooney: yes swap is a common cause for the difference in behaviors and that isn't an infra issue16:33
clarkbsean-k-mooney: and devstack runtime could be ~halved if we stopped using osc16:33
clarkbor improved osc's startup and token acquisition time16:33
sean-k-mooneyclarkb: ack16:33
clarkbI just want to avoid the idea its an infra issue so ignore it16:33
sean-k-mooneyya the osc thing is a long runing known issue16:33
clarkbthis assertion gets made often then I go looking and there is plenty of job payload that is just slow16:33
sean-k-mooneythe parallel improments dansmith did helped indirectly16:33
bauzasclarkb: sorry, I was unclear16:33
auniyalalthough, I have experinced this alot, if my zuul, is not passing at night time (IST), even after recheck I ran them in morning, then pass16:33
bauzasclarkb: I wasn't advocating about someone else's fault16:34
bauzasclarkb: I was just explaining to some new nova contributor that given the current situation, we only have timeouts with nova jobs due to some ci provider issue16:34
clarkbbauzas: right I disagree with that16:35
bauzasclarkb: but I agree with you on some jobs that are wasting resources16:35
clarkbjobs timeout due to an accumulation of slow steps16:35
clarkbsome of those may be due to a slow provider or slow instance16:35
clarkbbut, it is extremely rare that this is the only problem16:35
sean-k-mooneyclarkb: we tend to be seeing an avgerate runtime at about 75% or less of the job timeout in my experince16:35
clarkband I know nova tempest jobs have a large number of other slowness problems16:35
clarkbsean-k-mooney: yes, but if a job digs deeply into swap its all downhill from there16:35
bauzasclarkb: that's a fair point16:35
sean-k-mooneywe have 2 hour timeouts on our tempest jobs and we rarly go above about 90 mins16:35
clarkbsuddenly your 75% typical runtime can balloon to 200%16:36
bauzasexcept the fips one16:36
sean-k-mooneyclarkb: sure but i dont think we are 16:36
sean-k-mooneybut its somethign we can look at16:36
sean-k-mooneyauniyal: the best thing you can do is provide us an example and we can look into it16:36
clarkb++ to looking at it16:36
sean-k-mooneyand then see if there is a trend16:36
auniyalack16:37
bauzasI actually wonder how we can track the trend16:38
sean-k-mooneyhttps://zuul.openstack.org/builds?project=openstack%2Fnova&result=TIMED_OUT&skip=016:38
sean-k-mooneythat but its currently loading16:38
sean-k-mooneywe have a couple every few days16:39
bauzassure, but you don't have the time a SUCCESS job runs16:39
bauzaswhich is what we should track 16:39
clarkbyou can show both success and timeouts in a listing16:39
clarkb(and failures, etc)16:39
sean-k-mooneywell we can change the result to filter both16:39
bauzasthe duration field, shit, missed it16:39
sean-k-mooneywe also hav ento fixed the fips job16:40
sean-k-mooneyill create a patch for that i think16:40
bauzassean-k-mooney: I said I should do it16:40
sean-k-mooneybauzas: ok please do16:40
bauzassean-k-mooney: that's simple to do and that's like 4 weeks I promised it16:40
bauzassean-k-mooney: you know what ? I'll end this meeting by now so everyone can do what they want, including me writing a zuul patch :)16:41
sean-k-mooney:)16:41
bauzashaving said it,16:41
bauzasthanks folks16:41
bauzas#endmeeting16:41
opendevmeetMeeting ended Tue Nov 29 16:41:43 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)16:41
opendevmeetMinutes:        https://meetings.opendev.org/meetings/nova/2022/nova.2022-11-29-16.00.html16:41
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/nova/2022/nova.2022-11-29-16.00.txt16:41
opendevmeetLog:            https://meetings.opendev.org/meetings/nova/2022/nova.2022-11-29-16.00.log.html16:41
gibio/16:42
chateaulavo/16:42
elodilleso/16:42
* bauzas says again how much he enjoys zuul v3 web interface and how easy you can get a specific job's details16:44
bauzaslike, https://zuul.openstack.org/job/tempest-centos9-stream-fips16:44
clarkbsean-k-mooney: one thing I've found is fairly consistent for our systemic job timeouts is that we've got a number of steps to each job: base setup (configuring mirrors, configuring ssh keys, configuring git repos), test env setup (tox/devstack/whatever), actual testing, and log collection. Each of these tends to have unnecessary slow bits that add up over the course of a job.16:48
clarkbWhen you then run into something like a slower node or swapping or slowness fetching an external resource it is very easy to tip over the timeout16:48
clarkbsean-k-mooney: imo it would be helpful for us to try and whittle away at that accumulated slowness tech debt to make us less susceptible when we run into an overall slower situation.16:49
clarkbI've worked on that a bit in the base jobs and log collection side of things as that has broad impact. But the downsides here are that it has broad impact so I have to be extremely careful to maintain backward compatibility16:49
clarkbbut the same approaches can be taken to improve things like devstack (did you know it installs tempest 3 times!)16:49
clarkbI think improving memory consumption would also help avoid slowness caused by swapping. privsep is a fairly outsized offender here16:50
bauzasclarkb: I'm curious about privsep being memory greedy16:53
bauzasand I wonder why16:53
clarkbI suspect because it grows buffers to handle all the input and output sent through it16:53
bauzasour internal customers haven't reported such problem, but I guess because of lack of evidence rather than not having a problem16:54
clarkbone way to maybe improve things is to stop running a different privsep for each service whcih creates a bunch of large buffers. We might be able to get away with one large buffer16:54
clarkbor buffer things with intentionally smaller buffers16:54
bauzasagreed16:54
bauzasa stream is costly16:54
sean-k-mooneyclarkb: yep although we have enough fo a buffer in the nova project that we get a time out failure only 1 or twice a week16:57
clarkbsean-k-mooney: yes, but you've also set your timeout to two hours16:57
clarkb(one hour was the goal once upon a time)16:57
sean-k-mooneyack 16:57
sean-k-mooneyshareing privesep is a security issue16:58
sean-k-mooneyso i dont think we can ever do that16:58
sean-k-mooneynova will have more privespe deamons eventurlaly16:58
clarkbdoes privsep prevent random processes from connecting to it? If not this is equivalent. If so it could also apply restrictions on what a specific process can do (granted this is maybe a larger attack surface than refusing to talk at all)16:59
sean-k-mooneyyou are ment to use file permissiosn to limit access at the file system level16:59
sean-k-mooneybut no16:59
sean-k-mooneynot as part of privsep itself17:00
sean-k-mooneyclarkb: you need one privsep deamon per privsep context currently17:00
sean-k-mooneyfor it to proplry provide the correct permission enforcement/escalation17:00
clarkbsean-k-mooney: in that case my suggestion would be to investigate optimizing privsep memory usage17:00
sean-k-mooneywe might be able to reduce it in ci17:01
sean-k-mooneyby limiting it to one process17:01
clarkbor just bound your buffers17:01
clarkband read in chunks17:01
sean-k-mooneymaybe i havent really looked at the channel implemenation closely17:01
clarkbI suspect this is a case of python makes it easy to read abritrary sized buffers into memroy without much fuss17:02
clarkbit might also be inefficient compilation of the ruleset17:02
clarkb(regexes aren't free either)17:02
sean-k-mooneythe impelmation is here https://github.com/openstack/oslo.privsep/blob/83870bd2655f3250bb5d5aed7c9865ba0b5e4770/oslo_privsep/comm.py17:02
sean-k-mooney   self.writesock.sendall(buf)17:03
sean-k-mooneyso its takeign the serialsed payload and sending it17:03
sean-k-mooneyusing the msgpack format for serialisation17:04
sean-k-mooneyits using 4k buffers https://github.com/openstack/oslo.privsep/blob/83870bd2655f3250bb5d5aed7c9865ba0b5e4770/oslo_privsep/comm.py#L8117:05
sean-k-mooneyclarkb: honestly i have looked at privsep a couple of time but dont have enough context of the code to have a feel for how much memory it shoudl be using and if its bounded or not17:08
dansmithalso not sure what we might be calling via privsep that would return large buffers17:09
sean-k-mooneyclarkb: but i suspect that if we are using it ofr any file operatiosn then it might need to process guest images 17:09
dansmithit's mostly for doing things and maybe pulling the qemu-img info blob17:09
bauzassean-k-mooney: https://review.opendev.org/c/openstack/tempest/+/86604917:09
sean-k-mooneydansmith: the console log or writing images to disk would be the two the come to mind17:09
dansmithI don't think we write images to disk via privsep. console log.. maybe? I thought we can get that via libvirt17:10
clarkbya I'm not sure. Its just in aggregate privsep uses more memory than most other openstack services17:10
sean-k-mooneydansmith: no we read the file that is written to disk im pretty sure17:10
clarkbI think cinder? and neutron use more then privsep is next. Its been a while since I looked at hte memory profiling though17:10
bauzasI don't know if we could somehow pdb the running privsep process thru a backdoor, because if we could, like we do with nova services, then we could monitor the growing memory17:11
sean-k-mooneydansmith: i would hope for the images that we write it to somewhere we own then move it and change the permission if needed 17:11
bauzasI personnally use tracemalloc to persist the memory state and compare between snapshots17:11
sean-k-mooneylike in most case i woudl expect nova to put it in the image cache then use qemu-image to create a qcow with the iamge as the backing file17:11
bauzasbut this requires access to the process17:11
sean-k-mooneyso privsep shoudl only be needed for invokeign qemu-img and not the actul image downlaod17:12
sean-k-mooneybut not sure about the same codepath for raw images17:12
sean-k-mooneybauzas: you coudl do that locally but i dont think privsep uses eventlet17:13
bauzassean-k-mooney: afaik it doesn't17:13
dansmithsean-k-mooney: we can't possibly be streaming multiple gigs of image data over the privsep socket17:13
sean-k-mooneydansmith: ya that woudl not make sense to me eitehr17:13
sean-k-mooneyi know the console log is stream over it as we had a security issue with that in the past17:14
sean-k-mooneybut that is small17:14
dansmithwell, that might be large, but not tens of gigs17:15
dansmithand still not sure why we're doing that really, but AFAIK nova has to eat the whole thing and return it over RPC anyway, right?17:15
sean-k-mooneyin production yes in ci the vms dont run that long so i would be surpsied if we ever go close to a mb17:15
dansmithoh sure17:16
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/config.py#L78-L8017:17
sean-k-mooneyso in produciton we drop privesep abck to info17:17
clarkbLooking at https://6546e150b851cf607d58-90bcb133e16a34d804e19428ec130682.ssl.cf5.rackcdn.com/865479/1/gate/tempest-full-py3/9076120/controller/logs/screen-memory_tracker.txt there are three different nova-cpu privsep processes and together use about 150MB of rss. Similar story for neutron ovn metadata. Buts its ~250MB rss17:17
sean-k-mooneyim not sure if we override that in ci17:17
sean-k-mooneyclarkb: two fo the prive sepc process i can accoutn for17:18
sean-k-mooneynova has one privsep context for everything and os-vif has one for the linux bridge and ovs plugin but i only expect 1 of those two to be created in any given ci job17:19
clarkbsean-k-mooney: why do we need multiple process per service though?17:19
sean-k-mooneywe need one per security context17:19
sean-k-mooneyand we shoudl have mulitpel security context per service17:19
clarkbsean-k-mooney: but if your security contro lmehtod is file permissions (any maybe selinux rules) then you can only limit by pid user or pid selinux context17:19
clarkbyou aren't any more secure if you have three different processes that nova cpu can talk to17:20
sean-k-mooneyprivesp drops permissions to only does in the context when it spawns17:20
sean-k-mooneyclarkb: if you have muplie context you can limit the permission of each funciton17:20
clarkbsean-k-mooney: but if the same entity can tell each of those three to do the work then functioanlly its the same context17:21
sean-k-mooneythat is the permission model that privsep provides and the big delta form rootwarp17:21
clarkbyou're limited by what you can limit access to on the filesystem17:21
clarkband if I have permission to talk to one I have permisssion to talk to all since a process shares those attributes17:21
clarkband if I somehow break that security layer to promote myself to nova-cpu user or selinux context I can talk to all of them17:22
sean-k-mooneyyou have to decorate the funciton with the context they can use17:22
clarkbright but does that give the system any additional security since it sounds like we are using fielsystem controls I think not17:22
clarkbthe level of scrutiny is process attribute (user, selinux context etc)17:22
sean-k-mooneywe have had this dusicssion multiple time17:22
sean-k-mooneyprive sep is about limiting the permission of indivual function calls17:23
sean-k-mooneyit is not intended to adress all security concerns17:23
sean-k-mooneyin many respec rootwarp was more secure17:24
sean-k-mooneybut it was unmaintained and slow17:24
clarkbok, the result i that ~5% of system memory on a tempest job run is consumed by privsep for nova cpu and neutron ovn metadata17:24
sean-k-mooneythe nova_ovn_metadata on is kind fo interesting it will need cap net admin for  setting the iptables rules to nat the traffic17:25
sean-k-mooneybut it shoudl not need much memory for that17:25
bauzasI remember some internal customer bug about the metadata service being greedy17:25
clarkbnova cpu uses about 200MB of memory compared to 150 for privsep for nova cpu17:25
clarkbas a comparison17:25
clarkbit essentially doubles the memory footprint of nova cpu17:25
bauzaswe tried to look into it, but given the issue was not reproducable after a reboot, we were unable to conclude 17:26
sean-k-mooneyyep it has to load many of the same nova depencies17:26
sean-k-mooneyit is runnign part of the nova codebase after all17:26
sean-k-mooneybauzas: the customer issue was becasue they had 0 swap17:27
clarkbsure, but that is why I remember it being outsized. Seems like the current memory tracking continues to show this17:27
sean-k-mooneyso the python interperty was using more memory then when swap was avaiable17:27
clarkbmysqld then journald are the top consumers. No surprise there given how mysql uses memory and the quantity of logs we are dumping into the journal17:28
sean-k-mooneyi wonder if there is any way to have privsep free memeory17:28
sean-k-mooneyi dont know if it suffers form the same fragmentation thing that cinder? glance? had17:28
bauzassean-k-mooney: no, unrelated case17:29
clarkblooks like cinder-backup is also running in he base job. I thought there was no testing of it in he base jobs and we were removing it17:29
sean-k-mooneyoh sorry metadata not nova-compute17:29
sean-k-mooneybauzas: ya metadata memory usage is varible17:30
sean-k-mooneywe need to build the fully metadta respocen in memory for every request17:30
sean-k-mooneywhich is why we store it in memcache17:30
bauzasanyway, I need to quit, my wife feels ill and I need to visit the drugstore17:31
sean-k-mooneyim not sure exactuly how we handel user-data and if that is incldued in the cached value or if we stream that seperatly17:31
clarkbin that example job we have about 100MB free memory when the log file is collected17:31
sean-k-mooneythe user-data file can be up to 64mb17:32
clarkband we are using about 600MB of swap fi I read that correctly17:32
clarkb(out of about 1GB of swap total)17:32
clarkbother than mysqld there isn't any single process getting close to dobule digit memory consumption. This is deaht by a thousand cuts (whcih makes sense given micro services etc)17:41
dansmithI'm pretty sure "death by a thousand cuts" is the actual marketing tagline of the microservice approach :)17:42
clarkboh wait memavailable is distinct to memfree. Its closer to 750MB of memory available. That isn't too bad. I wonder why we're so deep into swap then17:44
sean-k-mooneyfree is actully unacllocated17:49
sean-k-mooneyavaiable include cachces and maybe buffers17:50
dansmithsean-k-mooney: bauzas if you're still around, have you seen anything from rajat about changing nova to use a service account to look at image locations?19:25
opendevreviewMerged openstack/nova-specs master: spec: allowing target state for evacuate  https://review.opendev.org/c/openstack/nova-specs/+/85783819:52
sean-k-mooneydansmith: i have not seen that but i dont think nova would need to do anything would it20:44
sean-k-mooneythe account we have in the glance section just needs to have the service role20:45
dansmithsean-k-mooney: we have no such account20:45
sean-k-mooneyoh ok because we currently dont use admin for this20:45
dansmithright, everything we do with glance is with the user's token20:46
dansmiththis is a pretty fundamental change from that20:46
sean-k-mooneyya ok im not sure if the service_user stuff woudl help there although thats really just for working aound token exeriation20:47
dansmithsean-k-mooney: it would require it20:48
sean-k-mooneythis would really be that large a change in nova either a specless blueprint or short spec i guess20:48
dansmithsean-k-mooney: maybe just read the spec :)20:48
sean-k-mooneyoh we have a spec already then sure20:48
dansmithsean-k-mooney: no, the spec on the glance side I copied you on20:49
sean-k-mooneythe current service user config we have is not for that however20:49
dansmithno spec on the nova side,20:49
sean-k-mooneyah righit i saw the email havent looked at it yet20:49
sean-k-mooneyhttps://review.opendev.org/c/openstack/glance-specs/+/86320920:49
sean-k-mooneythe new locations api spec20:49
dansmithbut since everything in nova assumes that the user's token is used for glance interaction, I just want to be super careful that we don't accidentally use the service account for anything related to the image other than the ceph location thing20:49
sean-k-mooneyya so the same way we have for geting an admin client for neutron  we likely need to add a get_service_client funtion or similar and use it for that call explcitly20:50
dansmithright20:51
dansmithwe discussed this earlier related to the dual internal/external glance thing was brought up20:52
dansmithand I thought you had a good reason for why there's a gotcha there, but I don't remember what it was20:52
sean-k-mooneythe internal endpoint is also used to provide unmeetered acces to the api for tenant workloads20:53
sean-k-mooneyin public clouds20:53
sean-k-mooneyso really it woudl be nice if keystone addded a service endpoint20:53
sean-k-mooneyfor service to service comunications20:53
dansmithyeah, that's unrelated to this20:54
sean-k-mooneyalthough if this requried the service user20:54
dansmithI mean, this is to avoid needing that20:54
sean-k-mooneyright if we have the service role not user20:54
sean-k-mooneythen we can just filter the filed based on teh role20:54
sean-k-mooneylike we do with server show20:54
sean-k-mooneyso only show the image location if the token has the service role20:55
sean-k-mooneythat would be the nicer way to do this20:55
dansmithyou should read the spec20:55
sean-k-mooneyyou said the policy/rback stuff in glance has only recently been made capabliy of supproting somethign like that right20:55
sean-k-mooneysure ill add it to my list for tomorrow20:56
sean-k-mooneyi was just back breifly ot check on something20:56
sean-k-mooneyskiming it without the nova changes if this was unconsitonaly added to glance 20:57
sean-k-mooneyit woudl silently disable the fast clone support20:58
sean-k-mooneyand even then it woudl break the grenade upgrade rules if we did not do upgrade carfully20:58
sean-k-mooneyi.e. you shoudl not need to change the config when you upgrade20:59
dansmithit's a new api, so they'll have to support the old one for a while, and I commented on that in the spec that it needs to hang around for a good while20:59
sean-k-mooneyack21:00
sean-k-mooneyso they are not just doign the filtering on the old one21:00
sean-k-mooneyya ok there is no point in me speulcating on this until i have had time to read the spec fully thanks for highlighting it21:00
dansmithI just want to make sure that nova people are aware of when glance people say "we'll just change nova to do X" with no planning on this side, and potentially nobody signing up to do it or review it21:03
*** dasm is now known as dasm|off21:39
opendevreviewMerged openstack/nova master: Handle mdev devices in libvirt 7.7+  https://review.opendev.org/c/openstack/nova/+/83897621:52
opendevreviewMerged openstack/nova master: Don't provide MTU value in metadata service if DHCP is enabled  https://review.opendev.org/c/openstack/nova/+/85566422:02
opendevreviewMerged openstack/nova master: extend_volume of libvirt/volume/fc should not use device_path  https://review.opendev.org/c/openstack/nova/+/85812922:02
opendevreviewmelanie witt proposed openstack/nova stable/xena: Retry attachment delete API call for 504 Gateway Timeout  https://review.opendev.org/c/openstack/nova/+/86608322:54
opendevreviewmelanie witt proposed openstack/nova stable/wallaby: Fix the wrong exception used to retry detach API calls  https://review.opendev.org/c/openstack/nova/+/86608422:57
opendevreviewmelanie witt proposed openstack/nova stable/wallaby: Retry attachment delete API call for 504 Gateway Timeout  https://review.opendev.org/c/openstack/nova/+/86608522:57
opendevreviewmelanie witt proposed openstack/nova stable/victoria: Fix the wrong exception used to retry detach API calls  https://review.opendev.org/c/openstack/nova/+/86608623:09
opendevreviewmelanie witt proposed openstack/nova stable/victoria: Retry attachment delete API call for 504 Gateway Timeout  https://review.opendev.org/c/openstack/nova/+/86608723:09
opendevreviewmelanie witt proposed openstack/nova stable/ussuri: Fix the wrong exception used to retry detach API calls  https://review.opendev.org/c/openstack/nova/+/86608823:16
opendevreviewmelanie witt proposed openstack/nova stable/ussuri: Retry attachment delete API call for 504 Gateway Timeout  https://review.opendev.org/c/openstack/nova/+/86608923:16
opendevreviewmelanie witt proposed openstack/nova stable/train: Fix the wrong exception used to retry detach API calls  https://review.opendev.org/c/openstack/nova/+/86609023:19
opendevreviewmelanie witt proposed openstack/nova stable/train: Retry attachment delete API call for 504 Gateway Timeout  https://review.opendev.org/c/openstack/nova/+/86609123:19
clarkbsean-k-mooney: bauzas: not urgent, but the other piece of info that is probably worht remembering for slow jobs is that we are our own noisy neighbor in some of these clouds. This means our own inefficiencies add up across jobs too not just within them.23:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!