Thursday, 2023-03-30

opendevreviewSergey Kraynev proposed openstack/octavia stable/2023.1: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87883607:02
opendevreviewSergey Kraynev proposed openstack/octavia stable/zed: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87883707:02
opendevreviewSergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87883807:03
opendevreviewSergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87902507:24
gthiemongeskraynev: hi, thanks for proposing the backports, can you propose it also on xena and wallaby?07:25
skraynevgthiemonge: sure, I just working on fix mrge conflict for yoga. Unfortunately patch from master does not simply backport to pre-zed releases. As I fix merge conflict, I will create for two other branches :)07:27
skraynevgthiemonge: could you please remember what I have to do with git-review for updating existing PR? I download change via git-review -d . Fixed merge conflict. Did git commit --amend and git-review.. But it creates new PR instead of updating existence one...07:30
gthiemongeskraynev: it should not create a new PR if the commit still has the correct "Change-Id" 07:36
gthiemongehttps://review.opendev.org/c/openstack/octavia/+/87902507:36
gthiemongeskraynev: you added "Fixed merge ..." at the bottom, I think git-review requires to not change the last lines of the commit message (it gets the Change-Id from the last block)07:37
skraynevoh I see! thank you07:37
gthiemongeskraynev: if the backports are not straightforward, you can ping me, I'll do it for you07:38
opendevreviewSergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87883807:38
skraynevgthiemonge: I fixed. could you please make a brief review, that it's correct https://review.opendev.org/c/openstack/octavia/+/878838 . if it looks ok - I will backport it to xena and wallaby 07:40
opendevreviewSergey Kraynev proposed openstack/octavia stable/xena: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87883907:53
opendevreviewSergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87884007:53
skraynevgthiemonge: done. ps. thank you for the merging and rechecking patch yesterday07:54
Adri2000hello, I'm trying to reset octavia quotas for deleted projects... "openstack loadbalancer quota list" lists quotas for some deleted projects, but if I try to "openstack loadbalancer quota reset <project_id>" I get an error saying the project doesn't exist. any idea what I can do?08:25
Adri2000additional question: if I try the "reset" on an existing project, that'll set all quotas to "None" put the project will still be listed in "openstack loadbalancer quota list" (with everything set to "None")08:29
Adri2000is that expected?08:29
fricklerAdri2000: the first issue seems to stem from how OSC handles projects, you could either try to use the sdk directly or make a patch to OSC to allow using deleted projects08:52
Adri2000frickler: ok, indeed I tried running the HTTP API call directly and I don't get an error08:56
Adri2000about the behaviour of quota reset, is it worth filing a bug?08:57
Adri2000https://docs.openstack.org/api-ref/load-balancer/v2/index.html?expanded=reset-a-quota-detail#reset-a-quota I'd really expect that if I "DELETE" a project quota, I don't see that quota anymore in "openstack loadbalancer quota list"08:58
gthiemongeAdri2000: I'm not familiar with the quota code, but I'll take a look at that08:59
Adri2000thanks09:01
gthiemongeAdri2000: yeah the DELETE call deletes the quota, then recreates it with the default values09:13
gthiemongeAdri2000: we need to discuss it with the team but maybe there's way to detect that the project no longer exists and in that case we would not recreate the quota09:14
Adri2000would be nice indeed. the reason I ended up looking at this is because we realized we have a bunch of quotas defined for projects that no longer exist.09:21
Adri2000if you create a bug report spec or whatever, please share the link :)09:21
opendevreviewTom Weininger proposed openstack/octavia master: DNM: profile using py-spy  https://review.opendev.org/c/openstack/octavia/+/87891710:42
opendevreviewTom Weininger proposed openstack/octavia master: DNM: profile using py-spy  https://review.opendev.org/c/openstack/octavia/+/87891711:46
opendevreviewTom Weininger proposed openstack/octavia stable/2023.1: DNM test Add octavia-grenade-slurp CI job  https://review.opendev.org/c/openstack/octavia/+/87892812:07
opendevreviewSergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87884012:46
skraynevgthiemonge: hi again. could you please say, which IRC nick of Adam Harwell ? I wanted to ask him about old commit https://opendev.org/openstack/octavia/commit/78f1c7b1286873e9c70d075a9f3b90304f54a92912:53
skraynevlooks like I have similar symptoms and wanted to hear his opinion about it 12:54
tweiningskraynev: there must have been a merge conflict in https://review.opendev.org/c/openstack/octavia/+/878838. you should probably add a respective note to the commit message there12:56
skraynevtweining: sure, is it a free form? or I need to follow some pattern?12:58
tweiningI think it's only "Merge conflict:" and then the file path in the next line12:58
skraynevgot it. give me a sec to add it12:59
gthiemongeskraynev: Adam is rm_work on IRC12:59
opendevreviewSergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87883813:00
opendevreviewSergey Kraynev proposed openstack/octavia stable/xena: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87883913:00
opendevreviewSergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls  https://review.opendev.org/c/openstack/octavia/+/87884013:00
tweiningthanks :)13:01
skraynevtweining: done. for all pre-yoga PRs. thank you for noting about it.13:01
skraynevgthiemonge: Great! thank you :)13:01
skraynevrm_work: I will add an a note about issue. I try to create huge LB (30 pools with 10 members in each pool). And after finishing creation by octavia-worker, I see, that Octavia-health-manager eats almost 100 % of CPU during handling Health message from such big LB. It spends a lot of time for updating operating statuses each member, pool, listener. And as result HM miss health checks from other amphoras. As result it leads to cascade13:06
skraynev failover of other amphoras.13:06
gthiemongeskraynev: unfortunately, he is connected13:08
gthiemongeskraynev: do you have the "THIS IS NOT GOOD" log messages? how many seconds do they show?13:09
skraynevgthiemonge: the funniest thing is, that I have not such messages. I saw, them later after running this cascade failover. According the log it looks like HM process only one health message and update all members, when all other Heatlh messages are ignored. In the logs I could find message like: "Received packet from" , so it looks like messages come to HM, but could not be processed, because after some period of time I don't see any 13:23
skraynev"Health Update finished"13:23
rm_workskraynev: ok I’ll take a look at that commit 14:18
rm_workOk yeah I remember this14:19
rm_workYou’re still seeing issues?14:19
skraynevrm_work: I don't think, that commit is incorrect. I just want to tell, that the I have similar issue, but with different root cause, which are not covered by commit.14:19
rm_workAh hmm ok14:20
rm_workI have not seen this issue resurface… for you it’s because it takes too long to record all the statuses for a single  amphora? Because of how many objects it has?14:21
skraynevI have the issue only when try to create in one request LB with 30 pools with 10 members in each. And function update_health execute successfully part right before updating statuses of resources. So the check added in your commit passed, but later it looks like take a lot of time and CPU resources to process all inner LB resources14:22
rm_workHmm yeah so that blocks other health messages?14:22
rm_workBut it should avoid a cascade failure because of the short circuit there14:22
skraynevso around 300 members + 30 pools resources  + 30 listeners + some health monitors (not sure about the last one)14:23
rm_workHow many health manager nodes do you have?14:23
rm_workOthers should also be able to process things in time to prevent cascade?14:23
skraynev> Hmm yeah so that blocks other health messages?   - yes it looks so. and eat a lot of CPU resources - goes to top for processes14:23
rm_workBut even so, the health messages for other amphora should have time to be processed on other HM nodes?14:25
skraynev> Others should also be able to process things in time to prevent cascade?   - yeap. if it does not block - everything is nice14:25
skraynevrm_work: potentially yes. it should - I have seen, that health check messages also stopped processing on other nodes. However I did not yet find the root cause (maybe they also started processing health message from the same LB??)14:30
skraynevnote, that time of stopping processing is different and I use YOGA release. I also need to check differences for later releases too.. may be it was fixed in later releases.. not sure about it14:31
rm_workHmm yeah you could add additional log output to show more clearly exactly what is starting to be processed and when14:32
rm_workLet me know if you figure out a more clear cause 🤔14:33
johnsomskraynev A couple of questions:14:44
johnsom1. How many cores does the health manager have available to it?14:44
johnsom2. What is your configuration setting for [health_manager] health_update_threads and stats_update_threads?14:45
skraynevjohnsom: 2: each for 414:58
skraynev1: 11 - so each process has each own cpu. totally with hm and subprocesses + process_pool_executors it comes to 11 processes. 1main, 2 process (first health checker, second - UDP parser), 4 from ProcessPoolExecutor (UDP update listener stats), 4 from ProcessPoolExecutor (UDP update health)15:02
skraynevLet me know if you figure out a more clear cause -> sure15:04
skraynevand forgot to say, that I have 3 nodes with HM15:04
johnsomTypically issues like this are related to database issues, but if you don't have the "THIS IS NOT GOOD" log messages, it is probably keeping up ok. I just wonder if the process count is configured high for a VM/container that only has one core or similar. This would cause CPU contention.15:05
skraynevYes - before cascade failover I have not seen "THIS IS NOT GOOD" log messages as well as "Health Update finished in" . I also tried to repeat it right now and looks, like it passed :) On the previous try, I also after 1 minute run the second similar LB creation. so maybe it was overload for CPU... not sure.15:10
skraynevjohnsom: thank you for pointing on CPU contention, I will try to check this case too.15:13
johnsomYeah, I would expect there to be some CPU load spike with that many pools/members. Also check the RAM usage. All of that processing is going to happen in RAM, so check if the instance is swapping and maybe blocking on IO15:14
skraynevjohnsom: so.. it looks like a 1 physical cpu with 6 cores and multi-threading... that's why I thought about 11(2) cpus :(15:27
johnsom6 cores with a setting of 4 for the threads, should be ok IMO.15:28
skraynev_johnsom: I did one more round of testing. So if I create such LB 300 members, 30 pools - and wait when it converge to ACTIVE - it passes. After that I could run the second one and it also passes. However my initial case was a bit more risky. I created one LB and wait about 1 min (it was still create in progress) and I run the second one similar. Looks like it overloaded HM and CPU. (like a DDOS). But I still co16:09
skraynev_uld not find the bottleneck. I will look on it tomorrow. Maybe it's related to cpu and memory...16:09
opendevreviewTom Weininger proposed openstack/octavia stable/2023.1: DNM test Add octavia-grenade-slurp CI job  https://review.opendev.org/c/openstack/octavia/+/87892816:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!