Wednesday, 2023-04-26

opendevreviewMerged openstack/nova master: Reproduce bug 1995153  https://review.opendev.org/c/openstack/nova/+/86296700:01
gibiauniyal, bauzas, sean-k-mooney my recent finding about test not waiting for sshable before attach/detach are in https://bugs.launchpad.net/nova/+bug/1998148 but I have no cycles to push tempest changes so feel free to jump on it07:55
bauzasgibi: ack, but I'm atm still digesting dansmith's and gouthamr efforts on ceph/Jammy jobs07:56
gibisure. I just wanted to be explict about the fact that I won't push a solution08:38
bauzasgibi: I'll try to help this afternoon then08:51
bauzasthanks for the findings again08:51
ykarelHi bauzas 11:31
ykarelreported https://blueprints.launchpad.net/nova/+spec/libvirt-tb-cache-size as discussed yesterday, can you please check11:31
bauzasykarel: ack, approving it then11:32
bauzasand done11:34
ykarelThanks bauzas 11:34
bauzasnp11:34
sean-k-mooneyykarel: once we get the ceph/py38 issues resolved im fine with prioritsing getting this landed too11:37
ykarelsean-k-mooney, ack11:37
opendevreviewElod Illes proposed openstack/nova master: Drop py38 based zuul jobs  https://review.opendev.org/c/openstack/nova/+/88133911:38
dansmithgibi: I'm going to look into that today12:53
dansmithgibi: we're failing that rescue test 100% of the time with the new ceph job I'm trying to get working, so perhaps we're losing that race more often there13:09
sean-k-mooneyodd13:09
sean-k-mooneyi wonder what makes rescue speciel13:09
dansmithoh we're failing plenty of other things 100% of the time13:10
dansmithbut yeah, I don't know13:10
dansmithI'm working on it today13:10
gibiI think something is wrongly set up in the rescuce tests regarding the sshable waiter13:13
gibihence the more frequent failure there13:13
gibidansmith: thanks for looking into it13:13
dansmithgibi: yeah, some of the other failures are clearly after we have already attached and detached once, so I'm sure they're not sshable related,13:15
dansmithbut I haven't dug into the rescue ones yet13:15
opendevreviewSylvain Bauza proposed openstack/nova master: Add a new policy for cold-migrate with host  https://review.opendev.org/c/openstack/nova/+/88156213:16
bauzasdansmith: I can also help for the rescue action13:16
bauzasso you can look at the ceph failures13:17
bauzasI'm done with my small feature now13:17
dansmithbauzas: no, they're related13:18
dansmithbauzas: they pass all the time (except for the occasional failure) on the focal ceph job, and 100% of the time on the jammy/quincy job13:18
bauzasdansmith: sorry I meant about https://bugs.launchpad.net/nova/+bug/1998148/comments/613:19
bauzas-ETOOMANYFAILURES13:19
dansmithbauzas: right that's what I'm talking about13:19
bauzashmmm, then I'm lost out of context13:20
sean-k-mooneyso i have mostly got the downstream ting i was workign on in a mostly mergable state. i need to respond to some feedback but i can try and deploy with ceph later today or tormoow and take a look if its still blocking at that point13:20
sean-k-mooneybasically im almost at a point where context switing for a day or two would be ok13:20
bauzasI eventually tried this morning to look at the ceph patch by the failing logs, but honestly I'm a noob13:20
dansmithmy patch for the ceph job has gotten us almost all of the way there,13:21
dansmithjust give me some time this morning to try to hack in an sshable wait to some of the tests that are failing 100% of the time (if applicable) and then we can go from there.. if they're not applicable, then we've got something worse going on13:21
dansmithif they are, maybe we're just losing that race more often now and this will help the ceph and non-ceph cases13:22
bauzasdansmith: ok, you're prioritary (because of the blocking gate), so cool with me13:23
dansmiththe gate is unblocked, so I think the pressure is off for the moment, but we obviously have to get this resolved ASAP13:24
dansmithbauzas: maybe you could re+W these: https://review.opendev.org/c/openstack/nova/+/880632/513:24
dansmiththe bottom patch already merged, they just need a kick to get into the gate13:24
bauzasdansmith: I did this kickass13:25
dansmiththanks13:26
bauzasfwiw, I'll send a status email saying then that the gate is back 13:27
bauzasare we sure all the py38-removal releases were reverted ?13:28
dansmithwell, we're passing jobs so, I think that's all we need to know13:29
fricklerpretty sure oslo bumps were not reverted yet. but only oslo.db was released13:29
dansmiththere have been other releases but they are not yet in requirements, so we're kinda on the edge13:30
dansmithyeah that^13:30
opendevreviewyatin proposed openstack/nova master: Add config option to configure TB cache size  https://review.opendev.org/c/openstack/nova/+/86841914:44
dansmithyeah, so the rescue tests work for me locally, which I guess is good15:01
dansmithhowever, the teardown does fail for me15:01
dansmithwhich tells me, again, that I don't think ssh'able is the solution for those, because we've already attached a volume and rebooted the instance (into rescue) and then rebooted it back, long after the original attach operation15:02
dansmithsorry, s/rescue/rebuild/15:03
bauzasdansmith: bravo15:07
dansmithlocally that includes the validations, which is disabled in the job (for some reason) so I pushed up a patch to set those to enabled,15:09
dansmithwhich will both likely slow it down but also maybe catch some other issues15:10
dansmiththat causes us to at least wait for the server to be created and sshable in some of the other cases, before we might try to attach a volume, but like I say, the failures I'm seeing don't seem like they'd be impacted there15:10
dansmithand the regular jobs have those enabled, and we see detach failures there still15:11
dansmithalbeit not 100%15:11
dansmithbauzas: this needs re-+W as well: https://review.opendev.org/c/openstack/nova/+/88063315:23
dansmithit's about to pass its recheck15:23
bauzasdansmith: done15:24
dansmiththanks15:24
opendevreviewMerged openstack/nova master: Remove silent failure to find a node on rebuild  https://review.opendev.org/c/openstack/nova/+/88063215:37
bauzas:)15:38
dansmithwell, the distro-based ceph job just passed with validations turned on15:53
dansmiththe cephadm based job is definitely quite a bit slower for some reason, so we're still waiting for that15:54
opendevreviewDan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova  https://review.opendev.org/c/openstack/nova/+/88158516:09
opendevreviewDan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova  https://review.opendev.org/c/openstack/nova/+/88158516:19
opendevreviewDan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova  https://review.opendev.org/c/openstack/nova/+/88158516:21
dansmithokay it timed out, but legitimately, it was making progress just taking much longer because of the validations I think16:24
dansmithso we might need to trim that job down again or we can try with the concurrency upped to 216:24
dansmithah, the base job is not concurrency=1, so that's probably the difference there.. might be getting close then!16:26
dansmithgouthamr: ^16:29
opendevreviewribaudr proposed openstack/nova master: Fix live migrating to a host with cpu_shared_set configured will now update the VM's configuration accordingly.  https://review.opendev.org/c/openstack/nova/+/87777316:52
opendevreviewMerged openstack/nova master: Stop ignoring missing compute nodes in claims  https://review.opendev.org/c/openstack/nova/+/88063317:01
dansmitheharney: gouthamr: the cephadm run just passed.. the distro package one failed with errors that look very different than before, but related to ImageBusy type things17:46
dansmiththe difference here is enabling validations on both, which means we try to ssh to instances before we do certain things to them, which helps a lot with race conditions. I dunno why those were disabled on these jobs before, but we thought they were enabled17:46
dansmithso my plan is to: (1) switch the cephadm job to be the voting primary job and make the distro one non-voting, (2) set the nova DNM to run against cephadm, and get another run of everything17:47
dansmithis there anything else you want to change before we could/should merge this? the cephfs job is still on focal (because it's defined elsewhere) and thus failing because of packages, but I could hard-code that to jammy here to "fix" that if you want17:48
dansmithit's non-voting so we should be able to switch it externally separately, but let me know what you want17:48
dansmithoh another change on the cephadm job was removing the concurrency limit, which was causing it to run too slow to finish when validations was turned on. We need that to run in parallel anyway, so it's good that it passed in that form17:50
dansmiththe cephadm job also finished pretty quickly, which is also a good sign that it was pretty healthy17:53
dansmithah, the ceph-osd was oom-killed in the non-cephadm job, so that explains that failure I think18:15
sean-k-mooneythat would do it18:16
sean-k-mooneyit makes snese why you are siing rbd busy messages18:16
dansmithyeah18:17
sean-k-mooneyis this with the mariadb and ceph options to reduce mememory set18:17
sean-k-mooneyand idealy the job configred for 8GB of swap?18:18
dansmithit does include the mysql and ceph tweaks, not sure about swap,18:18
sean-k-mooneyi think we default to 2GB instead of 818:18
dansmithbut what I care about is the cephadm job at this point I think, so I kinda want another data point before we change too much else18:18
sean-k-mooneyack18:19
dansmithI know18:19
dansmithI'm waiting for the nova one to finish and see if it's the same,18:19
dansmithbut also to switch it to cephadm and get another set of data18:19
dansmithbut this is the first time I've ever seen the cephadm job pass, so I'm encouraged18:19
sean-k-mooneyif memory continue to be an issue we have 3 paths forward, 1.) increase swap. 2.) use the nested virt nodeset, 3.) wait for t he qemu cache bluepitn to be merged and set it back to 32mb instead of 1G per vm18:20
dansmithwe can also try reducing concurrency.. =1 times out because it takes too long, but we're at =4 right now18:21
dansmiththe cephadm job ran at =4 at top speed with no failures this time18:21
sean-k-mooneyya settign it to 2 or 3 might sticke a better ballance for jobs with ceph18:22
sean-k-mooneycool18:22
sean-k-mooneyyour waiting fr this to complete ya https://zuul.openstack.org/stream/677a38a3278c4ced9deee595bb616997?logfile=console.log18:23
dansmithyeah18:23
dansmithI want the logs so I can look while I run it against cephadm as the base18:23
dansmithhoping it's also the same issue as the other one because they look very similar, just watching them18:23
sean-k-mooneyan hour and 19 minuts is not bad18:23
sean-k-mooneyhttps://zuul.opendev.org/t/openstack/build/7493255ead1b4c7abbc2ba45c8f1d9d418:24
dansmithright, it's super fast actually18:24
sean-k-mooneyits also singel node which helps18:24
dansmithso I'm hoping that the newer ceph is somehow making that smoother/faster/leaner :)18:24
dansmithcompared to the distro jobs18:24
sean-k-mooneywell going to python 3.10 wil add about 30% to the openstack performace18:25
sean-k-mooneyvs 3.8 i think18:25
sean-k-mooneyoh18:25
sean-k-mooneyyou are comparing to disto packages on 22.04 not focal18:25
dansmithyeah, so that might be helping the speed when it doesn't fail18:25
dansmithsean-k-mooney: correct18:25
dansmiththe cephadm job takes ceph via podman containers from ceph directly, the nova and -py3 jobs are using jammy's ceph packages18:26
sean-k-mooneycool18:26
sean-k-mooneyso wait are you finding containerisation useful for something :P18:26
sean-k-mooneyi like cephadm for what its worth18:27
dansmithI have no opposition to containerization in general :)18:27
sean-k-mooneyof the 4 ways i have installed cpeh it was the lest janky18:27
sean-k-mooneyits also the way i have the least experince with18:27
dansmithokay nova job finished, 11 fails18:29
sean-k-mooneyyep18:29
sean-k-mooneybut thats not bad it proably detach related18:29
sean-k-mooneyim intrest to see the memory stats 18:29
dansmithso gouthamr eharney I'm going to push up patches for my above proposal so I can get another run started, just FYI18:29
gouthamrdansmith: o/ yep18:29
sean-k-mooneydansmith: wait a sec18:30
sean-k-mooneythe job results are not uploaded yet18:30
dansmithsean-k-mooney: I'm not stupid I know :)18:30
gouthamrdansmith: thanks for the explanation on the tempest validations, there were some more things that fpantano was attempting here: https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/834223 18:30
sean-k-mooney:) i just didnt wnat you to push and thing crap i wasted my time18:30
gouthamrdansmith: we can compare notes, and i can comment on the patch you're working on 18:30
dansmithgouthamr: ack18:31
sean-k-mooneyenabling the validation by defautl will slow thing down but it does indeed work around test tha tshoudl be waiting an arnt18:31
sean-k-mooneydansmith: i assume thats what you did untilll we can fix them18:32
dansmithsean-k-mooney: until we can fix what?18:32
dansmiththe sshable checks don't even get run if validations is disabled, that's why I enabled them18:32
sean-k-mooneyah yes18:32
dansmithespecially for a job where we're testing a different storage backend for the instances, we definitely should have those on,18:32
dansmithso we're not passing when instances are totally dead because their disk didn't come up18:33
sean-k-mooneyso i tough tempest had a way to also add validation to test that did not request them18:33
dansmiththey're on by default in devstack too, AFAICT18:33
dansmithall my local testing has them on and I realized late these jobs opted out of them18:33
opendevreviewDan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova  https://review.opendev.org/c/openstack/nova/+/88158518:34
dansmithI'll please ask everyone here to cross all crossable appendages18:35
sean-k-mooneyso the nova job with package that had the failing test had 4G of swap and its memory low point was 130MB with all swapp used18:45
dansmithah I hadn't started looking yet, but yeah, interesting18:45
dansmithdid it oom?18:45
sean-k-mooneycheckign now18:45
sean-k-mooneyyep18:46
dansmithyeah18:46
sean-k-mooneyApr 26 17:18:11 np0033857622 kernel: /usr/bin/python invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=018:46
dansmithalso ceph-osd18:46
dansmithso man, let's hope there's a memory leak that is fixed now18:46
dansmith1.5GiB resident at OOM time18:46
sean-k-mooneyit was right at the end too18:46
sean-k-mooneyactuly it OOM'd a python proce the the osd18:47
dansmithah at basically the same time18:48
sean-k-mooneyone thing that si true of both josb is i dont think we collect any ceph logs18:48
dansmithwait, no,18:48
dansmiththe python process invoked the killer but it didn't kill the python thing first right?18:48
sean-k-mooneyi need to look again but maybe18:49
sean-k-mooneyya so python triggered it18:49
sean-k-mooneyand they ya it killed the osd18:49
sean-k-mooneybecause it presumabel had a higher omm score18:50
dansmithright18:50
sean-k-mooneythere are caches that can be tunned in cpeh to reduce the osd memroy usage just an fyi18:50
sean-k-mooneyi think bluestore is the (only?) backend format now but it has caches that can be reduced to reduce the osd memory usage if i remmeber coreectly so we can also try that if need18:51
sean-k-mooneyi wonder if cephadm uses diffent defualt the ubuntu18:51
dansmithyeah, could be18:52
sean-k-mooneymemory_tracker low_point: 341952 so the cephadm job used less memroy over all18:53
sean-k-mooneyabout 200mb less at its low point18:53
dansmiththat could also be affected by performance of the node too.. if one is much faster cpu than the other, it could have meant more activity at a single point or something18:53
sean-k-mooneyoh and it had swap free too18:53
dansmithbut I agree, that's a strong indicator that it's doing better18:53
sean-k-mooneythe ceph.confs are diffent between the too18:54
sean-k-mooneyi also found where the ceph logs are we have them for both jobs so no regression there18:55
dansmiththey're starting tempest now18:56
dansmithwell, one is18:56
dansmithit's also interesting that the nova job has some services disabled that are not disabled in the base jobs AFAIK, which should reduce not only the test load but also the static footprint18:58
sean-k-mooneywe turn of swift for one i belive18:59
sean-k-mooneyand afew other thigns to manage memory18:59
sean-k-mooneylike heat18:59
dansmithand cinder-backup19:00
dansmithwhich uses a lot of memory for somer eason19:00
sean-k-mooneyjust looking at the providres while we wait19:03
sean-k-mooneyboth josb ran on ovh-bhs119:03
sean-k-mooneyso they hopfully had similar hardware19:03
sean-k-mooneyon a side note my laptop refhes has shipped which is proably a good thing since my fans are spinnig up trying to look at thses loogs19:05
sean-k-mooneyElapsed time: 956 sec so just over 15 mins that ok for devstack on a vm19:07
dansmithgot one ssh timeout failure,19:31
dansmithbut it looks like a normal one, not even specifically volume-related19:31
sean-k-mooneyack so we can proably ignore it19:31
dansmithyeah, hope so19:32
opendevreviewJay Faulkner proposed openstack/nova-specs master: Re-Propose "Ironic Shards" for Bobcat/2023.2  https://review.opendev.org/c/openstack/nova-specs/+/88164319:32
dansmithoh, but...19:32
dansmithit's been three minutes since the last test finished, which might mean it...19:32
dansmithoh yep, just exploded19:32
dansmithdammit19:32
dansmithlooks like everything is failing now, so maybe it just OOMed19:33
sean-k-mooneyif so then i would suggestg kicking the swap to 8G for now and we can evaluate other options if that is not enough19:34
dansmithyeah, I can never remember how to do that.. do you have a pointer to a job I can copy?19:35
sean-k-mooneysure ill get it19:37
sean-k-mooney configure_swap_size: 8192 19:38
sean-k-mooneyhttps://github.com/openstack/devstack/blob/master/.zuul.yaml#L56919:38
dansmithah, right outside of devstack_vars19:38
dansmiththanks.. we'll see what the logs say19:38
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/881585/4/.zuul.yaml#60319:38
sean-k-mooneyya so its set to 4G now jsut bump that or do it in the base job19:38
sean-k-mooney*g*19:39
sean-k-mooney... 8G you got the point19:39
dansmithyeah it's hard failing now, so something must have gone boom19:39
dansmithI guess that's better than just random fails because it's something we have _some_ control over19:39
sean-k-mooneyam im goig to go eat so ill check back later o/19:39
dansmitho/19:40
opendevreviewChristophe Fontaine proposed openstack/os-vif master: OVS DPDK tx-steering mode support  https://review.opendev.org/c/openstack/os-vif/+/88164419:58
opendevreviewChristophe Fontaine proposed openstack/os-vif master: OVS DPDK tx-steering mode support  https://review.opendev.org/c/openstack/os-vif/+/88164420:00
dansmithgouthamr: second successful run in a row on the cephadm job20:14
gouthamr\o/20:14
dansmiththe nova one is more complicated and based on it and it seems to have OOMed or some other major failure20:14
dansmithI'll up the swap to 8g when it finishes and we'll get another data point20:14
gouthamrthat's great dansmith; i wanted to check - you're trying to leave "devstack-plugin-ceph-tempest-py3" job alone.. any reason not to switch that to cephadm and delete the special "cephadm" job? 20:17
dansmithonly just so I could continue to have the comparisons, since at every point we're trying to get a grasp on what helps and hurts20:18
dansmithbut yeah, if you want me to just fold them in I guess I can20:18
dansmithI don't have the same feeling that the distro packages are necessarily worse than the upstream ones (although I'm happy if they are and that's a benefit)20:19
dansmithso I'm not in a big hurry to abandon that I guess :)20:19
gouthamryou're being conservative; but there's no bandwidth to maintain both imho :) 20:20
dansmiththe distro-based job OOMed again, so I guess I want to see if upping the swap makes that work or if it just grows further and OOMs there as well20:20
gouthamrand we're in this situation because we tried to split attention, it was tempting for me at least to not touch what was working20:20
dansmithgouthamr: ack, well, I've already marked it as non-voting which doesn't really hurt anything in the short term, but whatever20:21
gouthamrack; we can get you unblocked first and make that call, democratically, on the ML?20:22
dansmithgouthamr: I have this all ready to go as soon as the nova job finishes to capture logs, so let me push it up as it is (with 8G) and then we can swap things around after that so I can see what the distro job does with more20:22
gouthamrack dansmith 20:22
dansmithgouthamr: yes of course.. I'm certainly not arguing to keep it in such that we need a vote or anything, so if you're actively hoping to drop that support from devstack or something I certainly won't argue against it20:23
dansmithjust want one job that works, is all :)20:23
gouthamr++ 20:23
dansmithwow, no oom on the nova job, but just constant fail until it timed out20:41
dansmithand like no errors in n-cpu log20:43
dansmithseems like all cinder fails20:47
*** dmitriis is now known as Guest1227122:01
dansmithnova job failed again22:01
dansmithgouthamr: here's the report from the nova job: https://2db12bf686954cafec50-571baf10d8fb9f8a9cb5a5e38315002b.ssl.cf5.rackcdn.com/881585/4/check/nova-ceph-multistore/4694eec/testr_results.html22:34
dansmithnot nearly as much fail as before22:34
dansmiththe second job is a failure waiting for cinder to detach22:34
dansmiththe first one is a nova detach, I need to check that test to see if it's doing an ssh wait, which might help22:34
dansmithoh actually that first one is in the cinder tempest plugin, not one of ours22:35
dansmithlooks like it's probably not22:36
dansmithgmann: around?22:41
gmanndansmith: hi22:43
dansmithgmann: looks like the scenario.manager create_server in tempest is not waiting for sshableness22:43
dansmithgmann: and the cinder tempest plugin has a test that uses that and may be part of the volume detach issue(s)22:43
dansmithgmann: is it possible to make that create_server() always wait?22:44
gmannthis one right? https://github.com/openstack/cinder-tempest-plugin/blob/d6989d3c1a31f2bebf94f2a7a8dac1d9eb788b1b/cinder_tempest_plugin/scenario/test_volume_encrypted.py#L8622:44
dansmithgmann: yeah, but I'm wondering if create_server() itself can wait so anything that uses it will wait22:45
gmanndansmith: I think it make sense to wait for SSH by default in scenario tests22:45
dansmithgmann: the current thinking is that a lot of our volume detach problems come from trying to attach a volume before an instance is booted22:45
dansmithgmann: ack, okay22:45
gmannyeah, and with nature of most of scenario tests, booting server fully is always good before trying things on that22:46
dansmithyeah, okay lemme try to do that22:47
gmanndansmith: ok. we need to create/pass the validation resources from scenario manager which is needed for SSH/ping22:47
gmannor may be via get_remote_client22:48
dansmithyeah22:49
opendevreviewDan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova  https://review.opendev.org/c/openstack/nova/+/88158523:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!