Friday, 2023-03-03

opendevreviewMerged openstack/nova master: Add service version for Antelope  https://review.opendev.org/c/openstack/nova/+/87493206:15
opendevreviewJorge San Emeterio proposed openstack/nova master: Have host look for CPU controller of cgroupsv2 location.  https://review.opendev.org/c/openstack/nova/+/87312708:01
bauzaslooks like I shouldn't gamble with poker08:41
bauzasall my rechecks were failing, while the new ones this night (thanks dansmith) got eventually merged08:42
opendevreviewAmit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero  https://review.opendev.org/c/openstack/nova/+/85733909:30
admin1hi all .. what exactly does this mean ( which came during  yoga -> zed upgrade) and now  nova is down -- https://gist.githubusercontent.com/a1git/6f25cfb53feb2cb3b6d122da5664b462/raw/5bb0563f46c05bcea47fbc2a04f607f1f045f4ab/gistfile1.txt 09:42
admin1Details: Current Nova version does not support computes older than  |", "|   Yoga but the minimum compute service level in your system         |", "|   is 60 and the oldest supported service level is 61"09:45
bauzasadmin1: are you sure *all* your computes are supporting at least Yoga ?10:08
admin1yes10:08
admin1those were upgraded 3 days prior and i had a canary deployed in all of them 10:08
bauzassomething detected a compute having a 60 service version10:08
admin1now yoga -> zed,  ( using openstack ansible) it just failed 10:08
bauzasadmin1: can you please create a bug report ?10:09
bauzasI'll try to look at it10:09
admin1bauzas, is it possible to somehow bypass or disable this check for a bit ? 10:09
bauzasindeed, sec10:09
bauzasadmin1: https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.disable_compute_service_check_for_ffu10:11
bauzasbut you'll need to find which compute is still using the older service version10:12
bauzasmaybe some of them wasn't restarted10:12
bauzas(after upgrading)10:12
admin1output of select host,version  from nova.services  => https://gist.githubusercontent.com/a1git/13ceb2e181dab9532a5b229b2915b478/raw/eee560226905d21159808f36a038c9201d8e2d23/gistfile1.txt10:14
admin1i think some are affected10:14
bauzasadmin1: what's strange is that 60 is an interim service version10:18
bauzasdid you use a milestone or something for some computes ?10:19
bauzashttps://github.com/openstack/nova/blob/master/nova/objects/service.py#L260-L261 Xena is 57 and Yoga is 6110:19
bauzashttps://github.com/openstack/nova/blob/master/nova/objects/service.py#L212-L215 and that's what was changed by the RPC version for the 60 service version10:20
bauzasunless you deploy with master, of course10:20
admin1bauzas, so the servers that are 60 were not online  .. 10:25
admin1they were off temporarily 10:25
admin1but this blocked the whole upgrade process 10:25
bauzasthe compute state isn't and shouldn't be checked for safety reasons10:26
admin1it did .. temporarily what i did was  update nova.services set version=61 where version=60  and trying to run the playbook again 10:26
admin1if it works, then i am all good .. else i have to report it here again 10:27
admin1if this works, then i can open a bug report saying unavailable compute node blocked upgrade 10:27
bauzashttps://docs.openstack.org/nova/latest/cli/nova-status.html#nova-status-checks helps to test your upgrade10:27
bauzasadmin1: again, that's by design that we don't allow non-upgraded compute to be left registered10:28
bauzasadmin1: and that's why we have the workaround option for that intent10:28
bauzashttps://github.com/openstack/nova/blob/master/nova/cmd/status.py#L25110:29
bauzaswhich exactly tests the support contract *before* you upgrade https://github.com/openstack/nova/blob/59f7a524fd4ded3c17b10abcedb0baff769c3a8a/nova/utils.py#L105210:30
sean-k-mooneyadmin1: havign the server be off is expected to block the upgrade process11:07
sean-k-mooneythat would not be a bug11:08
sean-k-mooneybecause if the service verion is 60 it means they never started with the offical yoga release (61)11:08
opendevreviewAmit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero  https://review.opendev.org/c/openstack/nova/+/85733911:09
sean-k-mooneyif you had started them with yoga and then they were stop it would not cause this issue11:09
admin1i think they were off 2 days before when i did wallaby -> yoga 11:09
admin1but wallaby -> yoga did not complained of this .. was this check added in zed ? 11:09
sean-k-mooneythis was alwasy a requirement and we decied to start enforcining it in yoga becasue of operators violating the upgrade contract11:10
sean-k-mooneyand filing bugs :)11:10
admin1:D 11:10
sean-k-mooneynova before 2023.1/2024.1 only allows n to n+1 upgrades11:11
sean-k-mooneywe put the workaround option in place ans an escape hatch11:11
sean-k-mooneyso if you want to run nova in an unsupproted state you can but it should never be requried if you are upgrading withing the upgrade contract11:12
admin1i understand now ..  will make sure no computes are down next time we upgrade 11:13
sean-k-mooneyprovided we do not do an rpc bump its generally possibel for > n->n+1 to function but the first time we tested that was yoga to antelope(2023.1) as a dry run for 2023.1->2024.111:15
sean-k-mooneywe will be offically testing that going forward in case you are not aware of this chagne https://governance.openstack.org/tc/resolutions/20220210-release-cadence-adjustment.html11:16
opendevreviewAmit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero  https://review.opendev.org/c/openstack/nova/+/85733911:19
dansmithbauzas: \o/14:24
dansmithbauzas: are you cooking up the rc1 patch?14:33
bauzasI was waiting for the revert to arrive14:34
dansmithah okay, I see it's close14:34
dansmithsorry I didn't recheck that because I thought we were punting14:34
bauzasshit no14:34
bauzashttps://zuul.openstack.org/status#87358414:34
dansmithoh yep14:34
bauzaspost_failure14:34
bauzasok, so I'll skip it14:35
bauzasgibi: sean-k-mooney: for your sake of knowledge, I'm gonna branch RC1 without the logging revert14:35
bauzaswe'll backport the revert later aftr GA14:35
bauzashmmmm14:36
bauzasdansmith: actually, it looks like the release team agrees us some graceful extra period for branching RC114:37
bauzas(they're on meeting now)14:37
dansmithokay14:37
gibithe revert is in the gate queue now14:38
sean-k-mooneyok we modified it ot be safe in production even so having it in RC1 is not terible14:38
bauzasgibi: yup, but failing14:38
gibiso if we are lucky it might merge today14:38
gibiohh14:38
gibish*t14:38
bauzaswe were so close14:38
sean-k-mooneybut we can ask them to reque it14:38
bauzasI'll claim for a RC1 patch on Monday14:38
sean-k-mooneyif there is a long delay14:38
bauzasand I'll recheck this revert by the next 4 mins14:38
dansmith18 other things in the gate right now14:39
gibibauzas: I can shepherd the patch during Saturday and a bit on Sunday as well. 14:39
dansmithso it'll be a bit if it re-runs, but it's also not a critical patch14:39
sean-k-mooneypost_failure form nova next. unfortunet14:40
bauzasdansmith: I don't disagree14:40
bauzasbut it will be a bit of a pain to backport the revert if we go14:40
bauzasif the release team says they're OK with releasing on Monday, then meh, we gonna try this weekend14:40
bauzasgibi: last time you were way luckier than me14:41
sean-k-mooneywe technially didnt run out of memory but it got pretty clsoe memory_tracker low_point: 73014:42
sean-k-mooney* memory_tracker low_point: 730814:43
dansmithsean-k-mooney: has nova-next been OOMing?14:43
sean-k-mooneyMemAvailable:       9152 kB14:43
sean-k-mooneyi think its been surviing because of swap14:43
sean-k-mooneyMar 03 13:53:10.114492 np0033355853 memory_tracker.sh[131948]: SwapTotal:       4194300 kB14:44
sean-k-mooneyMar 03 13:53:10.114492 np0033355853 memory_tracker.sh[131948]: SwapFree:              0 kB14:44
sean-k-mooneyhttps://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/screen-memory_tracker.txt#301114:44
sean-k-mooneydansmith: are you wondering if this is related to the mariadb tweaks ye did14:44
sean-k-mooneykeystone was giving 503s14:45
dansmithsean-k-mooney: those tweaks are disabled by default in devstack right now14:45
dansmithI'm just saying if we're memory constrained on that job, we might want to enable those tweaks14:45
sean-k-mooneyand when i see that it often because of the db/service getting oom killed14:45
dansmithit seem to have done well for the ceph one14:46
sean-k-mooneyyep14:46
sean-k-mooneyim just looking to see if i can confim that in the logs14:46
sean-k-mooneybut that is why i was checkign the memory tracker i think we are runnign very close to out of memory if we have not hit it14:46
dansmithon the ceph job my tweaks dropped mysql to half of what it was using (~800m to ~400m)14:47
sean-k-mooneyar 03 13:53:09 np0033355853 kernel: sshd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=014:47
bauzashttps://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b14:47
dansmithbut, less memory usage could impair performance and make other things worse of course14:47
sean-k-mooneyso yes we are14:47
sean-k-mooneyhttps://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/syslog.txt#587114:47
bauzaswe had two problems14:47
bauzasa unresponsive API14:48
bauzasand some leaked allocs14:48
sean-k-mooneythe api issue are because mysql got killed14:48
sean-k-mooneyMar 03 13:53:09 np0033355853 kernel: Out of memory: Killed process 47910 (mysqld) total-vm:5223564kB, anon-rss:328112kB, file-rss:0kB, shmem-rss:0kB, UID:116 pgtables:2648kB oom_score_adj:014:48
dansmithyeah that's usually how it works14:49
dansmithmysql is killed and then we stop being able to talk to keystone (et al)14:49
sean-k-mooneyyep14:49
sean-k-mooneyso 1 we shoudl enabel that devstack feature for nova-next 2 we shoudl consider doing it by default14:50
sean-k-mooneyassuming it does not regress over all job time too much14:50
dansmithwe were going to do it by default after everyone is branched, to see if it is reasonable across the board, but not to break anyone before release14:50
sean-k-mooney* default in jobs14:50
sean-k-mooneynot sure about defaulting in devstack14:50
sean-k-mooneyack14:50
dansmithwe have it enabled in some jobs, so we should do that for -next if it can't make it worse14:51
dansmithand then maybe we'll get the defaulting in a month or so14:51
dansmithI can post a patch for -next14:51
sean-k-mooneysounds good14:51
bauzascool14:51
sean-k-mooneyif we are swappign that hard its going to really slow down the job too14:52
sean-k-mooneyis the memory reduction much overall14:52
dansmithlike I said above, about 800m to 400m rss for mysql14:53
bauzasthat's not big14:53
bauzasmysqld seems to be a canary14:53
sean-k-mooney400m is a lot when we only have 8192mb of ram in the vms14:54
opendevreviewDan Smith proposed openstack/nova master: Make nova-next reduce mysql memory  https://review.opendev.org/c/openstack/nova/+/87639114:54
dansmithyeah, it's a lot :)14:54
sean-k-mooneyits like 5%14:54
dansmithit's the single biggest user14:54
dansmithand it puts it down closer to many of the other users like rabbit14:55
sean-k-mooneyi just checked and we are alredy limiting nova to 2 worksers as well so we likely wont get much form limiting that more14:55
dansmithyeah14:55
bauzastrue but neutron-api takes its own big piece of cake https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/screen-memory_tracker.txt#199314:56
sean-k-mooney sure but it looks like they are also limiting to two workers14:58
sean-k-mooneyand the reduction in mysql is around the same as both of them combined14:58
dansmithbauzas: if you can find any other 50% reductions and 400m of free ram, please do let me know :)14:58
sean-k-mooneyalso the nutorn server acts both as the api and conductor for neutron14:58
sean-k-mooneyand schduler14:58
sean-k-mooneyas in it impelmente everything the contoler would do for neutron14:59
bauzasdansmith: fwiw, I +2d your patch, 14:59
bauzasso I'm not debating it :)14:59
dansmithI suspect other gains to be had will be much smaller and much harder to enact :)14:59
bauzastrue14:59
dansmithbauzas: I know, you're for it, you're just not impressed, I get it14:59
dansmithI'll just go cry in the corner14:59
bauzas:)14:59
sean-k-mooneyi sent it to the ci14:59
bauzasI wish I would have a magic wand14:59
sean-k-mooneywhat i have wanted to try for a while is enabel zswap15:00
bauzasBibbidi-Bobbidi-Boo !15:00
sean-k-mooneythat shoudl speed up swap usage a bit and help a little with swap size too15:00
bauzas(shit, doesn't work)15:00
dansmithsean-k-mooney: yeah that might be a thing, but we're also legitimately timing out a lot of jobs, so I'm concerned about slowing anything down with a memory boost causing more of those15:01
sean-k-mooneyhttps://www.omgubuntu.co.uk/2022/01/ubuntu-on-raspberry-pi-4-2gb-zswap15:01
dansmithif we're thrashing I think it will slow us down a lot, if we're stashing bloat we never reference, then it will help15:01
* bauzas is currently investigating https://bugs.launchpad.net/tempest/+bug/199989315:01
sean-k-mooneyonce we swap to 22.04 it will be better15:01
bauzasand I suspect this may be related15:01
dansmithsean-k-mooney: what will?15:01
sean-k-mooneydansmith: its much simpler to enable zswap in 22.0415:02
dansmithoh15:02
sean-k-mooneyand they did some performace optiomistaions15:02
sean-k-mooneyhttps://waldorf.waveform.org.uk/2021/6-months-with-the-pi-desktop.html15:02
dansmithbut still, you're trading cpu for memory15:02
sean-k-mooneycpu vs disk io performace really15:03
dansmithand right now in addition to running against the memory barrier, we're also running against the cpu barrier15:03
sean-k-mooneycompression effectivly makes it like swappign to faster storage15:03
sean-k-mooneyif you have the cpu to do it15:03
dansmithsure, if you're io constrained it will help there, but it's all trading for cpu15:03
sean-k-mooneyyep15:04
sean-k-mooneyi looked into this a while ago when those blogs first came out15:04
sean-k-mooneyit proably worth doing an expermient with at least15:04
dansmithyeah, any shorter but fatter jobs will benefit from it for sure, and it could reduce our own noisy neighbor effects15:05
sean-k-mooneyhonestly just moving to 16G vms in ci woudl be the better solution15:05
sean-k-mooneybut since that is not really an option15:05
dansmithI'm just worried about slowing the tests down at all because we're already hitting widespread legit timeouts15:05
dansmithon slower nodes15:05
sean-k-mooneyya which is a concern15:05
sean-k-mooneywow gerrit found https://review.opendev.org/c/openstack/devstack/+/828639 becasue i mention zswap in my last comment on the top level 15:08
* sean-k-mooney was hoping i already coded that up15:08
* bauzas disappears for taxi reasons15:16
* bauzas is back15:46
darkhorseHi team - I read about service tokens in cinder documentation which is really cool. Can service tokens be used in nova and other services too?16:46
dansmithwow, the nova-next mysql patch passed check the first time through17:08
dansmithbauzas: sean-k-mooney ^17:08
dansmithmelwitt: is this something you saw in the gate? https://review.opendev.org/c/openstack/nova/+/87599117:39
dansmithI don't see that error message in opensearch17:39

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!