opendevreview | Sergey Kraynev proposed openstack/octavia stable/2023.1: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878836 | 07:02 |
---|---|---|
opendevreview | Sergey Kraynev proposed openstack/octavia stable/zed: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878837 | 07:02 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878838 | 07:03 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/879025 | 07:24 |
gthiemonge | skraynev: hi, thanks for proposing the backports, can you propose it also on xena and wallaby? | 07:25 |
skraynev | gthiemonge: sure, I just working on fix mrge conflict for yoga. Unfortunately patch from master does not simply backport to pre-zed releases. As I fix merge conflict, I will create for two other branches :) | 07:27 |
skraynev | gthiemonge: could you please remember what I have to do with git-review for updating existing PR? I download change via git-review -d . Fixed merge conflict. Did git commit --amend and git-review.. But it creates new PR instead of updating existence one... | 07:30 |
gthiemonge | skraynev: it should not create a new PR if the commit still has the correct "Change-Id" | 07:36 |
gthiemonge | https://review.opendev.org/c/openstack/octavia/+/879025 | 07:36 |
gthiemonge | skraynev: you added "Fixed merge ..." at the bottom, I think git-review requires to not change the last lines of the commit message (it gets the Change-Id from the last block) | 07:37 |
skraynev | oh I see! thank you | 07:37 |
gthiemonge | skraynev: if the backports are not straightforward, you can ping me, I'll do it for you | 07:38 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878838 | 07:38 |
skraynev | gthiemonge: I fixed. could you please make a brief review, that it's correct https://review.opendev.org/c/openstack/octavia/+/878838 . if it looks ok - I will backport it to xena and wallaby | 07:40 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/xena: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878839 | 07:53 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878840 | 07:53 |
skraynev | gthiemonge: done. ps. thank you for the merging and rechecking patch yesterday | 07:54 |
Adri2000 | hello, I'm trying to reset octavia quotas for deleted projects... "openstack loadbalancer quota list" lists quotas for some deleted projects, but if I try to "openstack loadbalancer quota reset <project_id>" I get an error saying the project doesn't exist. any idea what I can do? | 08:25 |
Adri2000 | additional question: if I try the "reset" on an existing project, that'll set all quotas to "None" put the project will still be listed in "openstack loadbalancer quota list" (with everything set to "None") | 08:29 |
Adri2000 | is that expected? | 08:29 |
frickler | Adri2000: the first issue seems to stem from how OSC handles projects, you could either try to use the sdk directly or make a patch to OSC to allow using deleted projects | 08:52 |
Adri2000 | frickler: ok, indeed I tried running the HTTP API call directly and I don't get an error | 08:56 |
Adri2000 | about the behaviour of quota reset, is it worth filing a bug? | 08:57 |
Adri2000 | https://docs.openstack.org/api-ref/load-balancer/v2/index.html?expanded=reset-a-quota-detail#reset-a-quota I'd really expect that if I "DELETE" a project quota, I don't see that quota anymore in "openstack loadbalancer quota list" | 08:58 |
gthiemonge | Adri2000: I'm not familiar with the quota code, but I'll take a look at that | 08:59 |
Adri2000 | thanks | 09:01 |
gthiemonge | Adri2000: yeah the DELETE call deletes the quota, then recreates it with the default values | 09:13 |
gthiemonge | Adri2000: we need to discuss it with the team but maybe there's way to detect that the project no longer exists and in that case we would not recreate the quota | 09:14 |
Adri2000 | would be nice indeed. the reason I ended up looking at this is because we realized we have a bunch of quotas defined for projects that no longer exist. | 09:21 |
Adri2000 | if you create a bug report spec or whatever, please share the link :) | 09:21 |
opendevreview | Tom Weininger proposed openstack/octavia master: DNM: profile using py-spy https://review.opendev.org/c/openstack/octavia/+/878917 | 10:42 |
opendevreview | Tom Weininger proposed openstack/octavia master: DNM: profile using py-spy https://review.opendev.org/c/openstack/octavia/+/878917 | 11:46 |
opendevreview | Tom Weininger proposed openstack/octavia stable/2023.1: DNM test Add octavia-grenade-slurp CI job https://review.opendev.org/c/openstack/octavia/+/878928 | 12:07 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878840 | 12:46 |
skraynev | gthiemonge: hi again. could you please say, which IRC nick of Adam Harwell ? I wanted to ask him about old commit https://opendev.org/openstack/octavia/commit/78f1c7b1286873e9c70d075a9f3b90304f54a929 | 12:53 |
skraynev | looks like I have similar symptoms and wanted to hear his opinion about it | 12:54 |
tweining | skraynev: there must have been a merge conflict in https://review.opendev.org/c/openstack/octavia/+/878838. you should probably add a respective note to the commit message there | 12:56 |
skraynev | tweining: sure, is it a free form? or I need to follow some pattern? | 12:58 |
tweining | I think it's only "Merge conflict:" and then the file path in the next line | 12:58 |
skraynev | got it. give me a sec to add it | 12:59 |
gthiemonge | skraynev: Adam is rm_work on IRC | 12:59 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/yoga: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878838 | 13:00 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/xena: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878839 | 13:00 |
opendevreview | Sergey Kraynev proposed openstack/octavia stable/wallaby: Fix ORM caching for with_for_update calls https://review.opendev.org/c/openstack/octavia/+/878840 | 13:00 |
tweining | thanks :) | 13:01 |
skraynev | tweining: done. for all pre-yoga PRs. thank you for noting about it. | 13:01 |
skraynev | gthiemonge: Great! thank you :) | 13:01 |
skraynev | rm_work: I will add an a note about issue. I try to create huge LB (30 pools with 10 members in each pool). And after finishing creation by octavia-worker, I see, that Octavia-health-manager eats almost 100 % of CPU during handling Health message from such big LB. It spends a lot of time for updating operating statuses each member, pool, listener. And as result HM miss health checks from other amphoras. As result it leads to cascade | 13:06 |
skraynev | failover of other amphoras. | 13:06 |
gthiemonge | skraynev: unfortunately, he is connected | 13:08 |
gthiemonge | skraynev: do you have the "THIS IS NOT GOOD" log messages? how many seconds do they show? | 13:09 |
skraynev | gthiemonge: the funniest thing is, that I have not such messages. I saw, them later after running this cascade failover. According the log it looks like HM process only one health message and update all members, when all other Heatlh messages are ignored. In the logs I could find message like: "Received packet from" , so it looks like messages come to HM, but could not be processed, because after some period of time I don't see any | 13:23 |
skraynev | "Health Update finished" | 13:23 |
rm_work | skraynev: ok I’ll take a look at that commit | 14:18 |
rm_work | Ok yeah I remember this | 14:19 |
rm_work | You’re still seeing issues? | 14:19 |
skraynev | rm_work: I don't think, that commit is incorrect. I just want to tell, that the I have similar issue, but with different root cause, which are not covered by commit. | 14:19 |
rm_work | Ah hmm ok | 14:20 |
rm_work | I have not seen this issue resurface… for you it’s because it takes too long to record all the statuses for a single amphora? Because of how many objects it has? | 14:21 |
skraynev | I have the issue only when try to create in one request LB with 30 pools with 10 members in each. And function update_health execute successfully part right before updating statuses of resources. So the check added in your commit passed, but later it looks like take a lot of time and CPU resources to process all inner LB resources | 14:22 |
rm_work | Hmm yeah so that blocks other health messages? | 14:22 |
rm_work | But it should avoid a cascade failure because of the short circuit there | 14:22 |
skraynev | so around 300 members + 30 pools resources + 30 listeners + some health monitors (not sure about the last one) | 14:23 |
rm_work | How many health manager nodes do you have? | 14:23 |
rm_work | Others should also be able to process things in time to prevent cascade? | 14:23 |
skraynev | > Hmm yeah so that blocks other health messages? - yes it looks so. and eat a lot of CPU resources - goes to top for processes | 14:23 |
rm_work | But even so, the health messages for other amphora should have time to be processed on other HM nodes? | 14:25 |
skraynev | > Others should also be able to process things in time to prevent cascade? - yeap. if it does not block - everything is nice | 14:25 |
skraynev | rm_work: potentially yes. it should - I have seen, that health check messages also stopped processing on other nodes. However I did not yet find the root cause (maybe they also started processing health message from the same LB??) | 14:30 |
skraynev | note, that time of stopping processing is different and I use YOGA release. I also need to check differences for later releases too.. may be it was fixed in later releases.. not sure about it | 14:31 |
rm_work | Hmm yeah you could add additional log output to show more clearly exactly what is starting to be processed and when | 14:32 |
rm_work | Let me know if you figure out a more clear cause 🤔 | 14:33 |
johnsom | skraynev A couple of questions: | 14:44 |
johnsom | 1. How many cores does the health manager have available to it? | 14:44 |
johnsom | 2. What is your configuration setting for [health_manager] health_update_threads and stats_update_threads? | 14:45 |
skraynev | johnsom: 2: each for 4 | 14:58 |
skraynev | 1: 11 - so each process has each own cpu. totally with hm and subprocesses + process_pool_executors it comes to 11 processes. 1main, 2 process (first health checker, second - UDP parser), 4 from ProcessPoolExecutor (UDP update listener stats), 4 from ProcessPoolExecutor (UDP update health) | 15:02 |
skraynev | Let me know if you figure out a more clear cause -> sure | 15:04 |
skraynev | and forgot to say, that I have 3 nodes with HM | 15:04 |
johnsom | Typically issues like this are related to database issues, but if you don't have the "THIS IS NOT GOOD" log messages, it is probably keeping up ok. I just wonder if the process count is configured high for a VM/container that only has one core or similar. This would cause CPU contention. | 15:05 |
skraynev | Yes - before cascade failover I have not seen "THIS IS NOT GOOD" log messages as well as "Health Update finished in" . I also tried to repeat it right now and looks, like it passed :) On the previous try, I also after 1 minute run the second similar LB creation. so maybe it was overload for CPU... not sure. | 15:10 |
skraynev | johnsom: thank you for pointing on CPU contention, I will try to check this case too. | 15:13 |
johnsom | Yeah, I would expect there to be some CPU load spike with that many pools/members. Also check the RAM usage. All of that processing is going to happen in RAM, so check if the instance is swapping and maybe blocking on IO | 15:14 |
skraynev | johnsom: so.. it looks like a 1 physical cpu with 6 cores and multi-threading... that's why I thought about 11(2) cpus :( | 15:27 |
johnsom | 6 cores with a setting of 4 for the threads, should be ok IMO. | 15:28 |
skraynev_ | johnsom: I did one more round of testing. So if I create such LB 300 members, 30 pools - and wait when it converge to ACTIVE - it passes. After that I could run the second one and it also passes. However my initial case was a bit more risky. I created one LB and wait about 1 min (it was still create in progress) and I run the second one similar. Looks like it overloaded HM and CPU. (like a DDOS). But I still co | 16:09 |
skraynev_ | uld not find the bottleneck. I will look on it tomorrow. Maybe it's related to cpu and memory... | 16:09 |
opendevreview | Tom Weininger proposed openstack/octavia stable/2023.1: DNM test Add octavia-grenade-slurp CI job https://review.opendev.org/c/openstack/octavia/+/878928 | 16:34 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!