Thursday, 2023-03-30

opendevreview	Sergey Kraynev proposed openstack/octavia stable/2023.1: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878836	07:02
opendevreview	Sergey Kraynev proposed openstack/octavia stable/zed: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878837	07:02
opendevreview	Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878838	07:03
opendevreview	Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/879025	07:24
gthiemonge	skraynev: hi, thanks for proposing the backports, can you propose it also on xena and wallaby?	07:25
skraynev	gthiemonge: sure, I just working on fix mrge conflict for yoga. Unfortunately patch from master does not simply backport to pre-zed releases. As I fix merge conflict, I will create for two other branches :)	07:27
skraynev	gthiemonge: could you please remember what I have to do with git-review for updating existing PR? I download change via git-review -d . Fixed merge conflict. Did git commit --amend and git-review.. But it creates new PR instead of updating existence one...	07:30
gthiemonge	skraynev: it should not create a new PR if the commit still has the correct "Change-Id"	07:36
gthiemonge	https://review.opendev.org/c/openstack/octavia/+/879025	07:36
gthiemonge	skraynev: you added "Fixed merge ..." at the bottom, I think git-review requires to not change the last lines of the commit message (it gets the Change-Id from the last block)	07:37
skraynev	oh I see! thank you	07:37
gthiemonge	skraynev: if the backports are not straightforward, you can ping me, I'll do it for you	07:38
opendevreview	Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878838	07:38
skraynev	gthiemonge: I fixed. could you please make a brief review, that it's correct https://review.opendev.org/c/openstack/octavia/+/878838 . if it looks ok - I will backport it to xena and wallaby	07:40
opendevreview	Sergey Kraynev proposed openstack/octavia stable/xena: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878839	07:53
opendevreview	Sergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878840	07:53
skraynev	gthiemonge: done. ps. thank you for the merging and rechecking patch yesterday	07:54
Adri2000	hello, I'm trying to reset octavia quotas for deleted projects... "openstack loadbalancer quota list" lists quotas for some deleted projects, but if I try to "openstack loadbalancer quota reset <project_id>" I get an error saying the project doesn't exist. any idea what I can do?	08:25
Adri2000	additional question: if I try the "reset" on an existing project, that'll set all quotas to "None" put the project will still be listed in "openstack loadbalancer quota list" (with everything set to "None")	08:29
Adri2000	is that expected?	08:29
frickler	Adri2000: the first issue seems to stem from how OSC handles projects, you could either try to use the sdk directly or make a patch to OSC to allow using deleted projects	08:52
Adri2000	frickler: ok, indeed I tried running the HTTP API call directly and I don't get an error	08:56
Adri2000	about the behaviour of quota reset, is it worth filing a bug?	08:57
Adri2000	https://docs.openstack.org/api-ref/load-balancer/v2/index.html?expanded=reset-a-quota-detail#reset-a-quota I'd really expect that if I "DELETE" a project quota, I don't see that quota anymore in "openstack loadbalancer quota list"	08:58
gthiemonge	Adri2000: I'm not familiar with the quota code, but I'll take a look at that	08:59
Adri2000	thanks	09:01
gthiemonge	Adri2000: yeah the DELETE call deletes the quota, then recreates it with the default values	09:13
gthiemonge	Adri2000: we need to discuss it with the team but maybe there's way to detect that the project no longer exists and in that case we would not recreate the quota	09:14
Adri2000	would be nice indeed. the reason I ended up looking at this is because we realized we have a bunch of quotas defined for projects that no longer exist.	09:21
Adri2000	if you create a bug report spec or whatever, please share the link :)	09:21
opendevreview	Tom Weininger proposed openstack/octavia master: DNM: profile using py-spy https://review.opendev.org/c/openstack/octavia/+/878917	10:42
opendevreview	Tom Weininger proposed openstack/octavia master: DNM: profile using py-spy https://review.opendev.org/c/openstack/octavia/+/878917	11:46
opendevreview	Tom Weininger proposed openstack/octavia stable/2023.1: DNM test Add octavia-grenade-slurp CI job https://review.opendev.org/c/openstack/octavia/+/878928	12:07
opendevreview	Sergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878840	12:46
skraynev	gthiemonge: hi again. could you please say, which IRC nick of Adam Harwell ? I wanted to ask him about old commit https://opendev.org/openstack/octavia/commit/78f1c7b1286873e9c70d075a9f3b90304f54a929	12:53
skraynev	looks like I have similar symptoms and wanted to hear his opinion about it	12:54
tweining	skraynev: there must have been a merge conflict in https://review.opendev.org/c/openstack/octavia/+/878838. you should probably add a respective note to the commit message there	12:56
skraynev	tweining: sure, is it a free form? or I need to follow some pattern?	12:58
tweining	I think it's only "Merge conflict:" and then the file path in the next line	12:58
skraynev	got it. give me a sec to add it	12:59
gthiemonge	skraynev: Adam is rm_work on IRC	12:59
opendevreview	Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878838	13:00
opendevreview	Sergey Kraynev proposed openstack/octavia stable/xena: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878839	13:00
opendevreview	Sergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878840	13:00
tweining	thanks :)	13:01
skraynev	tweining: done. for all pre-yoga PRs. thank you for noting about it.	13:01
skraynev	gthiemonge: Great! thank you :)	13:01
skraynev	rm_work: I will add an a note about issue. I try to create huge LB (30 pools with 10 members in each pool). And after finishing creation by octavia-worker, I see, that Octavia-health-manager eats almost 100 % of CPU during handling Health message from such big LB. It spends a lot of time for updating operating statuses each member, pool, listener. And as result HM miss health checks from other amphoras. As result it leads to cascade	13:06
skraynev	failover of other amphoras.	13:06
gthiemonge	skraynev: unfortunately, he is connected	13:08
gthiemonge	skraynev: do you have the "THIS IS NOT GOOD" log messages? how many seconds do they show?	13:09
skraynev	gthiemonge: the funniest thing is, that I have not such messages. I saw, them later after running this cascade failover. According the log it looks like HM process only one health message and update all members, when all other Heatlh messages are ignored. In the logs I could find message like: "Received packet from" , so it looks like messages come to HM, but could not be processed, because after some period of time I don't see any	13:23
skraynev	"Health Update finished"	13:23
rm_work	skraynev: ok I’ll take a look at that commit	14:18
rm_work	Ok yeah I remember this	14:19
rm_work	You’re still seeing issues?	14:19
skraynev	rm_work: I don't think, that commit is incorrect. I just want to tell, that the I have similar issue, but with different root cause, which are not covered by commit.	14:19
rm_work	Ah hmm ok	14:20
rm_work	I have not seen this issue resurface… for you it’s because it takes too long to record all the statuses for a single amphora? Because of how many objects it has?	14:21
skraynev	I have the issue only when try to create in one request LB with 30 pools with 10 members in each. And function update_health execute successfully part right before updating statuses of resources. So the check added in your commit passed, but later it looks like take a lot of time and CPU resources to process all inner LB resources	14:22
rm_work	Hmm yeah so that blocks other health messages?	14:22
rm_work	But it should avoid a cascade failure because of the short circuit there	14:22
skraynev	so around 300 members + 30 pools resources + 30 listeners + some health monitors (not sure about the last one)	14:23
rm_work	How many health manager nodes do you have?	14:23
rm_work	Others should also be able to process things in time to prevent cascade?	14:23
skraynev	> Hmm yeah so that blocks other health messages? - yes it looks so. and eat a lot of CPU resources - goes to top for processes	14:23
rm_work	But even so, the health messages for other amphora should have time to be processed on other HM nodes?	14:25
skraynev	> Others should also be able to process things in time to prevent cascade? - yeap. if it does not block - everything is nice	14:25
skraynev	rm_work: potentially yes. it should - I have seen, that health check messages also stopped processing on other nodes. However I did not yet find the root cause (maybe they also started processing health message from the same LB??)	14:30
skraynev	note, that time of stopping processing is different and I use YOGA release. I also need to check differences for later releases too.. may be it was fixed in later releases.. not sure about it	14:31
rm_work	Hmm yeah you could add additional log output to show more clearly exactly what is starting to be processed and when	14:32
rm_work	Let me know if you figure out a more clear cause 🤔	14:33
johnsom	skraynev A couple of questions:	14:44
johnsom	1. How many cores does the health manager have available to it?	14:44
johnsom	2. What is your configuration setting for [health_manager] health_update_threads and stats_update_threads?	14:45
skraynev	johnsom: 2: each for 4	14:58
skraynev	1: 11 - so each process has each own cpu. totally with hm and subprocesses + process_pool_executors it comes to 11 processes. 1main, 2 process (first health checker, second - UDP parser), 4 from ProcessPoolExecutor (UDP update listener stats), 4 from ProcessPoolExecutor (UDP update health)	15:02
skraynev	Let me know if you figure out a more clear cause -> sure	15:04
skraynev	and forgot to say, that I have 3 nodes with HM	15:04
johnsom	Typically issues like this are related to database issues, but if you don't have the "THIS IS NOT GOOD" log messages, it is probably keeping up ok. I just wonder if the process count is configured high for a VM/container that only has one core or similar. This would cause CPU contention.	15:05
skraynev	Yes - before cascade failover I have not seen "THIS IS NOT GOOD" log messages as well as "Health Update finished in" . I also tried to repeat it right now and looks, like it passed :) On the previous try, I also after 1 minute run the second similar LB creation. so maybe it was overload for CPU... not sure.	15:10
skraynev	johnsom: thank you for pointing on CPU contention, I will try to check this case too.	15:13
johnsom	Yeah, I would expect there to be some CPU load spike with that many pools/members. Also check the RAM usage. All of that processing is going to happen in RAM, so check if the instance is swapping and maybe blocking on IO	15:14
skraynev	johnsom: so.. it looks like a 1 physical cpu with 6 cores and multi-threading... that's why I thought about 11(2) cpus :(	15:27
johnsom	6 cores with a setting of 4 for the threads, should be ok IMO.	15:28
skraynev_	johnsom: I did one more round of testing. So if I create such LB 300 members, 30 pools - and wait when it converge to ACTIVE - it passes. After that I could run the second one and it also passes. However my initial case was a bit more risky. I created one LB and wait about 1 min (it was still create in progress) and I run the second one similar. Looks like it overloaded HM and CPU. (like a DDOS). But I still co	16:09
skraynev_	uld not find the bottleneck. I will look on it tomorrow. Maybe it's related to cpu and memory...	16:09
opendevreview	Tom Weininger proposed openstack/octavia stable/2023.1: DNM test Add octavia-grenade-slurp CI job https://review.opendev.org/c/openstack/octavia/+/878928	16:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!