Wednesday, 2023-04-26

opendevreview	Merged openstack/nova master: Reproduce bug 1995153 https://review.opendev.org/c/openstack/nova/+/862967	00:01
gibi	auniyal, bauzas, sean-k-mooney my recent finding about test not waiting for sshable before attach/detach are in https://bugs.launchpad.net/nova/+bug/1998148 but I have no cycles to push tempest changes so feel free to jump on it	07:55
bauzas	gibi: ack, but I'm atm still digesting dansmith's and gouthamr efforts on ceph/Jammy jobs	07:56
gibi	sure. I just wanted to be explict about the fact that I won't push a solution	08:38
bauzas	gibi: I'll try to help this afternoon then	08:51
bauzas	thanks for the findings again	08:51
ykarel	Hi bauzas	11:31
ykarel	reported https://blueprints.launchpad.net/nova/+spec/libvirt-tb-cache-size as discussed yesterday, can you please check	11:31
bauzas	ykarel: ack, approving it then	11:32
bauzas	and done	11:34
ykarel	Thanks bauzas	11:34
bauzas	np	11:34
sean-k-mooney	ykarel: once we get the ceph/py38 issues resolved im fine with prioritsing getting this landed too	11:37
ykarel	sean-k-mooney, ack	11:37
opendevreview	Elod Illes proposed openstack/nova master: Drop py38 based zuul jobs https://review.opendev.org/c/openstack/nova/+/881339	11:38
dansmith	gibi: I'm going to look into that today	12:53
dansmith	gibi: we're failing that rescue test 100% of the time with the new ceph job I'm trying to get working, so perhaps we're losing that race more often there	13:09
sean-k-mooney	odd	13:09
sean-k-mooney	i wonder what makes rescue speciel	13:09
dansmith	oh we're failing plenty of other things 100% of the time	13:10
dansmith	but yeah, I don't know	13:10
dansmith	I'm working on it today	13:10
gibi	I think something is wrongly set up in the rescuce tests regarding the sshable waiter	13:13
gibi	hence the more frequent failure there	13:13
gibi	dansmith: thanks for looking into it	13:13
dansmith	gibi: yeah, some of the other failures are clearly after we have already attached and detached once, so I'm sure they're not sshable related,	13:15
dansmith	but I haven't dug into the rescue ones yet	13:15
opendevreview	Sylvain Bauza proposed openstack/nova master: Add a new policy for cold-migrate with host https://review.opendev.org/c/openstack/nova/+/881562	13:16
bauzas	dansmith: I can also help for the rescue action	13:16
bauzas	so you can look at the ceph failures	13:17
bauzas	I'm done with my small feature now	13:17
dansmith	bauzas: no, they're related	13:18
dansmith	bauzas: they pass all the time (except for the occasional failure) on the focal ceph job, and 100% of the time on the jammy/quincy job	13:18
bauzas	dansmith: sorry I meant about https://bugs.launchpad.net/nova/+bug/1998148/comments/6	13:19
bauzas	-ETOOMANYFAILURES	13:19
dansmith	bauzas: right that's what I'm talking about	13:19
bauzas	hmmm, then I'm lost out of context	13:20
sean-k-mooney	so i have mostly got the downstream ting i was workign on in a mostly mergable state. i need to respond to some feedback but i can try and deploy with ceph later today or tormoow and take a look if its still blocking at that point	13:20
sean-k-mooney	basically im almost at a point where context switing for a day or two would be ok	13:20
bauzas	I eventually tried this morning to look at the ceph patch by the failing logs, but honestly I'm a noob	13:20
dansmith	my patch for the ceph job has gotten us almost all of the way there,	13:21
dansmith	just give me some time this morning to try to hack in an sshable wait to some of the tests that are failing 100% of the time (if applicable) and then we can go from there.. if they're not applicable, then we've got something worse going on	13:21
dansmith	if they are, maybe we're just losing that race more often now and this will help the ceph and non-ceph cases	13:22
bauzas	dansmith: ok, you're prioritary (because of the blocking gate), so cool with me	13:23
dansmith	the gate is unblocked, so I think the pressure is off for the moment, but we obviously have to get this resolved ASAP	13:24
dansmith	bauzas: maybe you could re+W these: https://review.opendev.org/c/openstack/nova/+/880632/5	13:24
dansmith	the bottom patch already merged, they just need a kick to get into the gate	13:24
bauzas	dansmith: I did this kickass	13:25
dansmith	thanks	13:26
bauzas	fwiw, I'll send a status email saying then that the gate is back	13:27
bauzas	are we sure all the py38-removal releases were reverted ?	13:28
dansmith	well, we're passing jobs so, I think that's all we need to know	13:29
frickler	pretty sure oslo bumps were not reverted yet. but only oslo.db was released	13:29
dansmith	there have been other releases but they are not yet in requirements, so we're kinda on the edge	13:30
dansmith	yeah that^	13:30
opendevreview	yatin proposed openstack/nova master: Add config option to configure TB cache size https://review.opendev.org/c/openstack/nova/+/868419	14:44
dansmith	yeah, so the rescue tests work for me locally, which I guess is good	15:01
dansmith	however, the teardown does fail for me	15:01
dansmith	which tells me, again, that I don't think ssh'able is the solution for those, because we've already attached a volume and rebooted the instance (into rescue) and then rebooted it back, long after the original attach operation	15:02
dansmith	sorry, s/rescue/rebuild/	15:03
bauzas	dansmith: bravo	15:07
dansmith	locally that includes the validations, which is disabled in the job (for some reason) so I pushed up a patch to set those to enabled,	15:09
dansmith	which will both likely slow it down but also maybe catch some other issues	15:10
dansmith	that causes us to at least wait for the server to be created and sshable in some of the other cases, before we might try to attach a volume, but like I say, the failures I'm seeing don't seem like they'd be impacted there	15:10
dansmith	and the regular jobs have those enabled, and we see detach failures there still	15:11
dansmith	albeit not 100%	15:11
dansmith	bauzas: this needs re-+W as well: https://review.opendev.org/c/openstack/nova/+/880633	15:23
dansmith	it's about to pass its recheck	15:23
bauzas	dansmith: done	15:24
dansmith	thanks	15:24
opendevreview	Merged openstack/nova master: Remove silent failure to find a node on rebuild https://review.opendev.org/c/openstack/nova/+/880632	15:37
bauzas	:)	15:38
dansmith	well, the distro-based ceph job just passed with validations turned on	15:53
dansmith	the cephadm based job is definitely quite a bit slower for some reason, so we're still waiting for that	15:54
opendevreview	Dan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova https://review.opendev.org/c/openstack/nova/+/881585	16:09
opendevreview	Dan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova https://review.opendev.org/c/openstack/nova/+/881585	16:19
opendevreview	Dan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova https://review.opendev.org/c/openstack/nova/+/881585	16:21
dansmith	okay it timed out, but legitimately, it was making progress just taking much longer because of the validations I think	16:24
dansmith	so we might need to trim that job down again or we can try with the concurrency upped to 2	16:24
dansmith	ah, the base job is not concurrency=1, so that's probably the difference there.. might be getting close then!	16:26
dansmith	gouthamr: ^	16:29
opendevreview	ribaudr proposed openstack/nova master: Fix live migrating to a host with cpu_shared_set configured will now update the VM's configuration accordingly. https://review.opendev.org/c/openstack/nova/+/877773	16:52
opendevreview	Merged openstack/nova master: Stop ignoring missing compute nodes in claims https://review.opendev.org/c/openstack/nova/+/880633	17:01
dansmith	eharney: gouthamr: the cephadm run just passed.. the distro package one failed with errors that look very different than before, but related to ImageBusy type things	17:46
dansmith	the difference here is enabling validations on both, which means we try to ssh to instances before we do certain things to them, which helps a lot with race conditions. I dunno why those were disabled on these jobs before, but we thought they were enabled	17:46
dansmith	so my plan is to: (1) switch the cephadm job to be the voting primary job and make the distro one non-voting, (2) set the nova DNM to run against cephadm, and get another run of everything	17:47
dansmith	is there anything else you want to change before we could/should merge this? the cephfs job is still on focal (because it's defined elsewhere) and thus failing because of packages, but I could hard-code that to jammy here to "fix" that if you want	17:48
dansmith	it's non-voting so we should be able to switch it externally separately, but let me know what you want	17:48
dansmith	oh another change on the cephadm job was removing the concurrency limit, which was causing it to run too slow to finish when validations was turned on. We need that to run in parallel anyway, so it's good that it passed in that form	17:50
dansmith	the cephadm job also finished pretty quickly, which is also a good sign that it was pretty healthy	17:53
dansmith	ah, the ceph-osd was oom-killed in the non-cephadm job, so that explains that failure I think	18:15
sean-k-mooney	that would do it	18:16
sean-k-mooney	it makes snese why you are siing rbd busy messages	18:16
dansmith	yeah	18:17
sean-k-mooney	is this with the mariadb and ceph options to reduce mememory set	18:17
sean-k-mooney	and idealy the job configred for 8GB of swap?	18:18
dansmith	it does include the mysql and ceph tweaks, not sure about swap,	18:18
sean-k-mooney	i think we default to 2GB instead of 8	18:18
dansmith	but what I care about is the cephadm job at this point I think, so I kinda want another data point before we change too much else	18:18
sean-k-mooney	ack	18:19
dansmith	I know	18:19
dansmith	I'm waiting for the nova one to finish and see if it's the same,	18:19
dansmith	but also to switch it to cephadm and get another set of data	18:19
dansmith	but this is the first time I've ever seen the cephadm job pass, so I'm encouraged	18:19
sean-k-mooney	if memory continue to be an issue we have 3 paths forward, 1.) increase swap. 2.) use the nested virt nodeset, 3.) wait for t he qemu cache bluepitn to be merged and set it back to 32mb instead of 1G per vm	18:20
dansmith	we can also try reducing concurrency.. =1 times out because it takes too long, but we're at =4 right now	18:21
dansmith	the cephadm job ran at =4 at top speed with no failures this time	18:21
sean-k-mooney	ya settign it to 2 or 3 might sticke a better ballance for jobs with ceph	18:22
sean-k-mooney	cool	18:22
sean-k-mooney	your waiting fr this to complete ya https://zuul.openstack.org/stream/677a38a3278c4ced9deee595bb616997?logfile=console.log	18:23
dansmith	yeah	18:23
dansmith	I want the logs so I can look while I run it against cephadm as the base	18:23
dansmith	hoping it's also the same issue as the other one because they look very similar, just watching them	18:23
sean-k-mooney	an hour and 19 minuts is not bad	18:23
sean-k-mooney	https://zuul.opendev.org/t/openstack/build/7493255ead1b4c7abbc2ba45c8f1d9d4	18:24
dansmith	right, it's super fast actually	18:24
sean-k-mooney	its also singel node which helps	18:24
dansmith	so I'm hoping that the newer ceph is somehow making that smoother/faster/leaner :)	18:24
dansmith	compared to the distro jobs	18:24
sean-k-mooney	well going to python 3.10 wil add about 30% to the openstack performace	18:25
sean-k-mooney	vs 3.8 i think	18:25
sean-k-mooney	oh	18:25
sean-k-mooney	you are comparing to disto packages on 22.04 not focal	18:25
dansmith	yeah, so that might be helping the speed when it doesn't fail	18:25
dansmith	sean-k-mooney: correct	18:25
dansmith	the cephadm job takes ceph via podman containers from ceph directly, the nova and -py3 jobs are using jammy's ceph packages	18:26
sean-k-mooney	cool	18:26
sean-k-mooney	so wait are you finding containerisation useful for something :P	18:26
sean-k-mooney	i like cephadm for what its worth	18:27
dansmith	I have no opposition to containerization in general :)	18:27
sean-k-mooney	of the 4 ways i have installed cpeh it was the lest janky	18:27
sean-k-mooney	its also the way i have the least experince with	18:27
dansmith	okay nova job finished, 11 fails	18:29
sean-k-mooney	yep	18:29
sean-k-mooney	but thats not bad it proably detach related	18:29
sean-k-mooney	im intrest to see the memory stats	18:29
dansmith	so gouthamr eharney I'm going to push up patches for my above proposal so I can get another run started, just FYI	18:29
gouthamr	dansmith: o/ yep	18:29
sean-k-mooney	dansmith: wait a sec	18:30
sean-k-mooney	the job results are not uploaded yet	18:30
dansmith	sean-k-mooney: I'm not stupid I know :)	18:30
gouthamr	dansmith: thanks for the explanation on the tempest validations, there were some more things that fpantano was attempting here: https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/834223	18:30
sean-k-mooney	:) i just didnt wnat you to push and thing crap i wasted my time	18:30
gouthamr	dansmith: we can compare notes, and i can comment on the patch you're working on	18:30
dansmith	gouthamr: ack	18:31
sean-k-mooney	enabling the validation by defautl will slow thing down but it does indeed work around test tha tshoudl be waiting an arnt	18:31
sean-k-mooney	dansmith: i assume thats what you did untilll we can fix them	18:32
dansmith	sean-k-mooney: until we can fix what?	18:32
dansmith	the sshable checks don't even get run if validations is disabled, that's why I enabled them	18:32
sean-k-mooney	ah yes	18:32
dansmith	especially for a job where we're testing a different storage backend for the instances, we definitely should have those on,	18:32
dansmith	so we're not passing when instances are totally dead because their disk didn't come up	18:33
sean-k-mooney	so i tough tempest had a way to also add validation to test that did not request them	18:33
dansmith	they're on by default in devstack too, AFAICT	18:33
dansmith	all my local testing has them on and I realized late these jobs opted out of them	18:33
opendevreview	Dan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova https://review.opendev.org/c/openstack/nova/+/881585	18:34
dansmith	I'll please ask everyone here to cross all crossable appendages	18:35
sean-k-mooney	so the nova job with package that had the failing test had 4G of swap and its memory low point was 130MB with all swapp used	18:45
dansmith	ah I hadn't started looking yet, but yeah, interesting	18:45
dansmith	did it oom?	18:45
sean-k-mooney	checkign now	18:45
sean-k-mooney	yep	18:46
dansmith	yeah	18:46
sean-k-mooney	Apr 26 17:18:11 np0033857622 kernel: /usr/bin/python invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0	18:46
dansmith	also ceph-osd	18:46
dansmith	so man, let's hope there's a memory leak that is fixed now	18:46
dansmith	1.5GiB resident at OOM time	18:46
sean-k-mooney	it was right at the end too	18:46
sean-k-mooney	actuly it OOM'd a python proce the the osd	18:47
dansmith	ah at basically the same time	18:48
sean-k-mooney	one thing that si true of both josb is i dont think we collect any ceph logs	18:48
dansmith	wait, no,	18:48
dansmith	the python process invoked the killer but it didn't kill the python thing first right?	18:48
sean-k-mooney	i need to look again but maybe	18:49
sean-k-mooney	ya so python triggered it	18:49
sean-k-mooney	and they ya it killed the osd	18:49
sean-k-mooney	because it presumabel had a higher omm score	18:50
dansmith	right	18:50
sean-k-mooney	there are caches that can be tunned in cpeh to reduce the osd memroy usage just an fyi	18:50
sean-k-mooney	i think bluestore is the (only?) backend format now but it has caches that can be reduced to reduce the osd memory usage if i remmeber coreectly so we can also try that if need	18:51
sean-k-mooney	i wonder if cephadm uses diffent defualt the ubuntu	18:51
dansmith	yeah, could be	18:52
sean-k-mooney	memory_tracker low_point: 341952 so the cephadm job used less memroy over all	18:53
sean-k-mooney	about 200mb less at its low point	18:53
dansmith	that could also be affected by performance of the node too.. if one is much faster cpu than the other, it could have meant more activity at a single point or something	18:53
sean-k-mooney	oh and it had swap free too	18:53
dansmith	but I agree, that's a strong indicator that it's doing better	18:53
sean-k-mooney	the ceph.confs are diffent between the too	18:54
sean-k-mooney	i also found where the ceph logs are we have them for both jobs so no regression there	18:55
dansmith	they're starting tempest now	18:56
dansmith	well, one is	18:56
dansmith	it's also interesting that the nova job has some services disabled that are not disabled in the base jobs AFAIK, which should reduce not only the test load but also the static footprint	18:58
sean-k-mooney	we turn of swift for one i belive	18:59
sean-k-mooney	and afew other thigns to manage memory	18:59
sean-k-mooney	like heat	18:59
dansmith	and cinder-backup	19:00
dansmith	which uses a lot of memory for somer eason	19:00
sean-k-mooney	just looking at the providres while we wait	19:03
sean-k-mooney	both josb ran on ovh-bhs1	19:03
sean-k-mooney	so they hopfully had similar hardware	19:03
sean-k-mooney	on a side note my laptop refhes has shipped which is proably a good thing since my fans are spinnig up trying to look at thses loogs	19:05
sean-k-mooney	Elapsed time: 956 sec so just over 15 mins that ok for devstack on a vm	19:07
dansmith	got one ssh timeout failure,	19:31
dansmith	but it looks like a normal one, not even specifically volume-related	19:31
sean-k-mooney	ack so we can proably ignore it	19:31
dansmith	yeah, hope so	19:32
opendevreview	Jay Faulkner proposed openstack/nova-specs master: Re-Propose "Ironic Shards" for Bobcat/2023.2 https://review.opendev.org/c/openstack/nova-specs/+/881643	19:32
dansmith	oh, but...	19:32
dansmith	it's been three minutes since the last test finished, which might mean it...	19:32
dansmith	oh yep, just exploded	19:32
dansmith	dammit	19:32
dansmith	looks like everything is failing now, so maybe it just OOMed	19:33
sean-k-mooney	if so then i would suggestg kicking the swap to 8G for now and we can evaluate other options if that is not enough	19:34
dansmith	yeah, I can never remember how to do that.. do you have a pointer to a job I can copy?	19:35
sean-k-mooney	sure ill get it	19:37
sean-k-mooney	configure_swap_size: 8192	19:38
sean-k-mooney	https://github.com/openstack/devstack/blob/master/.zuul.yaml#L569	19:38
dansmith	ah, right outside of devstack_vars	19:38
dansmith	thanks.. we'll see what the logs say	19:38
sean-k-mooney	https://review.opendev.org/c/openstack/nova/+/881585/4/.zuul.yaml#603	19:38
sean-k-mooney	ya so its set to 4G now jsut bump that or do it in the base job	19:38
sean-k-mooney	g	19:39
sean-k-mooney	... 8G you got the point	19:39
dansmith	yeah it's hard failing now, so something must have gone boom	19:39
dansmith	I guess that's better than just random fails because it's something we have _some_ control over	19:39
sean-k-mooney	am im goig to go eat so ill check back later o/	19:39
dansmith	o/	19:40
opendevreview	Christophe Fontaine proposed openstack/os-vif master: OVS DPDK tx-steering mode support https://review.opendev.org/c/openstack/os-vif/+/881644	19:58
opendevreview	Christophe Fontaine proposed openstack/os-vif master: OVS DPDK tx-steering mode support https://review.opendev.org/c/openstack/os-vif/+/881644	20:00
dansmith	gouthamr: second successful run in a row on the cephadm job	20:14
gouthamr	\o/	20:14
dansmith	the nova one is more complicated and based on it and it seems to have OOMed or some other major failure	20:14
dansmith	I'll up the swap to 8g when it finishes and we'll get another data point	20:14
gouthamr	that's great dansmith; i wanted to check - you're trying to leave "devstack-plugin-ceph-tempest-py3" job alone.. any reason not to switch that to cephadm and delete the special "cephadm" job?	20:17
dansmith	only just so I could continue to have the comparisons, since at every point we're trying to get a grasp on what helps and hurts	20:18
dansmith	but yeah, if you want me to just fold them in I guess I can	20:18
dansmith	I don't have the same feeling that the distro packages are necessarily worse than the upstream ones (although I'm happy if they are and that's a benefit)	20:19
dansmith	so I'm not in a big hurry to abandon that I guess :)	20:19
gouthamr	you're being conservative; but there's no bandwidth to maintain both imho :)	20:20
dansmith	the distro-based job OOMed again, so I guess I want to see if upping the swap makes that work or if it just grows further and OOMs there as well	20:20
gouthamr	and we're in this situation because we tried to split attention, it was tempting for me at least to not touch what was working	20:20
dansmith	gouthamr: ack, well, I've already marked it as non-voting which doesn't really hurt anything in the short term, but whatever	20:21
gouthamr	ack; we can get you unblocked first and make that call, democratically, on the ML?	20:22
dansmith	gouthamr: I have this all ready to go as soon as the nova job finishes to capture logs, so let me push it up as it is (with 8G) and then we can swap things around after that so I can see what the distro job does with more	20:22
gouthamr	ack dansmith	20:22
dansmith	gouthamr: yes of course.. I'm certainly not arguing to keep it in such that we need a vote or anything, so if you're actively hoping to drop that support from devstack or something I certainly won't argue against it	20:23
dansmith	just want one job that works, is all :)	20:23
gouthamr	++	20:23
dansmith	wow, no oom on the nova job, but just constant fail until it timed out	20:41
dansmith	and like no errors in n-cpu log	20:43
dansmith	seems like all cinder fails	20:47
*** dmitriis is now known as Guest12271		22:01
dansmith	nova job failed again	22:01
dansmith	gouthamr: here's the report from the nova job: https://2db12bf686954cafec50-571baf10d8fb9f8a9cb5a5e38315002b.ssl.cf5.rackcdn.com/881585/4/check/nova-ceph-multistore/4694eec/testr_results.html	22:34
dansmith	not nearly as much fail as before	22:34
dansmith	the second job is a failure waiting for cinder to detach	22:34
dansmith	the first one is a nova detach, I need to check that test to see if it's doing an ssh wait, which might help	22:34
dansmith	oh actually that first one is in the cinder tempest plugin, not one of ours	22:35
dansmith	looks like it's probably not	22:36
dansmith	gmann: around?	22:41
gmann	dansmith: hi	22:43
dansmith	gmann: looks like the scenario.manager create_server in tempest is not waiting for sshableness	22:43
dansmith	gmann: and the cinder tempest plugin has a test that uses that and may be part of the volume detach issue(s)	22:43
dansmith	gmann: is it possible to make that create_server() always wait?	22:44
gmann	this one right? https://github.com/openstack/cinder-tempest-plugin/blob/d6989d3c1a31f2bebf94f2a7a8dac1d9eb788b1b/cinder_tempest_plugin/scenario/test_volume_encrypted.py#L86	22:44
dansmith	gmann: yeah, but I'm wondering if create_server() itself can wait so anything that uses it will wait	22:45
gmann	dansmith: I think it make sense to wait for SSH by default in scenario tests	22:45
dansmith	gmann: the current thinking is that a lot of our volume detach problems come from trying to attach a volume before an instance is booted	22:45
dansmith	gmann: ack, okay	22:45
gmann	yeah, and with nature of most of scenario tests, booting server fully is always good before trying things on that	22:46
dansmith	yeah, okay lemme try to do that	22:47
gmann	dansmith: ok. we need to create/pass the validation resources from scenario manager which is needed for SSH/ping	22:47
gmann	or may be via get_remote_client	22:48
dansmith	yeah	22:49
opendevreview	Dan Smith proposed openstack/nova master: DNM: Test new ceph job configuration with nova https://review.opendev.org/c/openstack/nova/+/881585	23:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!