opendevreview | Ghanshyam proposed openstack/tempest master: Remove nova-network tests https://review.opendev.org/c/openstack/tempest/+/890471 | 00:19 |
---|---|---|
opendevreview | Ghanshyam proposed openstack/tempest master: Remove nova-network tests https://review.opendev.org/c/openstack/tempest/+/890471 | 01:17 |
opendevreview | Ghanshyam proposed openstack/tempest master: Remove nova-network tests https://review.opendev.org/c/openstack/tempest/+/890471 | 01:34 |
opendevreview | melanie witt proposed openstack/tempest master: DNM test rbd retype https://review.opendev.org/c/openstack/tempest/+/890360 | 05:21 |
opendevreview | Arnau Verdaguer proposed openstack/tempest master: [WIP][DNM] Test fix for LP#2027605 https://review.opendev.org/c/openstack/tempest/+/888770 | 10:39 |
opendevreview | Arnau Verdaguer proposed openstack/tempest master: [WIP][DNM] Test fix for LP#2027605 https://review.opendev.org/c/openstack/tempest/+/888770 | 10:42 |
opendevreview | Dan Smith proposed openstack/devstack master: Set acquire=0 on ProxyPass https://review.opendev.org/c/openstack/devstack/+/890526 | 13:42 |
opendevreview | Dan Smith proposed openstack/devstack master: Set acquire=0 on ProxyPass https://review.opendev.org/c/openstack/devstack/+/890526 | 13:43 |
opendevreview | Dan Smith proposed openstack/devstack master: Disable waiting forever for connpool workers https://review.opendev.org/c/openstack/devstack/+/890526 | 14:17 |
dansmith | gmann: so some data.. in the last two weeks we've had no ooms on the nova-ceph-multistore job | 15:11 |
dansmith | that's limited to 3x concurrency | 15:11 |
dansmith | I was going to say, none on nova-live-migration-ceph either, but thats a very limited set of tests, so not very useful | 15:16 |
dansmith | anyway, I'm wondering if the random failures through apache are also related to an increase in the number of things we have flying around now | 15:16 |
opendevreview | Dan Smith proposed openstack/tempest master: Revert "Increase the default concurrency for tempest run" https://review.opendev.org/c/openstack/tempest/+/890535 | 15:24 |
dansmith | just so I can recheck it a bunch of times ^ | 15:25 |
gmann | dansmith: ack, as you know ceph jobs run very limited scenario tests (only 3). let's check it in this revert and other jobs too if more concurrency is causing oom | 17:15 |
dansmith | gmann: yeah, but I think the scenarios probably aren't the ones causing the ooms | 17:16 |
dansmith | but yep, I just want to recheck it a bunch | 17:16 |
dansmith | it's about to pass the first attempt, only one real failure, the mkfs stamp_pattern thing | 17:16 |
dansmith | gmann: also, not sure if you saw, but what do you think of this? https://review.opendev.org/c/openstack/tempest/+/890350/1 | 17:17 |
dansmith | it's a bit of a stab in the dark, but might be worth trying for the stamp_pattern thing | 17:17 |
gmann | dansmith: ok, not checked yet. | 17:18 |
gmann | dansmith: agree, at least it is no different things from test scenario perspective so if it help that is bonus | 17:19 |
dansmith | gmann: cool, I'll move that out on its own so maybe we can try to get that in soon | 17:20 |
opendevreview | Dan Smith proposed openstack/tempest master: Use vfat for timestamp https://review.opendev.org/c/openstack/tempest/+/890350 | 17:21 |
gmann | dansmith: cool | 17:21 |
dansmith | that concurrency revert finished in under 2h, so wall time isn't even bad | 17:22 |
dansmith | (logs uploading from the last job now) | 17:23 |
dansmith | the skip if no cinder fix got one of the uuber slow workers where it takes almost 2h to finish devstack, and is now failing lots of tests as expected | 18:02 |
dansmith | it's a n-v job though so it should still go to gate (to probably fail) | 18:03 |
dansmith | gmann: on the concurrency thing, one thing we could try is some sort of semaphore, where a test declares how many servers it's going to create, | 18:05 |
dansmith | and we only run enough tests in parallel to create that many servers | 18:05 |
dansmith | most will be one, but that would potentially allow us some higher concurrency for api tests and things, but still only 4 (or whatever) servers at any given point | 18:05 |
opendevreview | Ghanshyam proposed openstack/tempest master: Skip test early to improve memory footprint and time https://review.opendev.org/c/openstack/tempest/+/890469 | 18:20 |
gmann | dansmith: not sure how much benefit it will give, oom was issue even before in 4 concurrency also | 18:20 |
dansmith | gmann: I think much less after my mysql tweaks.. it' | 18:21 |
dansmith | it is happening a *lot* now | 18:21 |
gmann | but increasing tow more worker help in job timeout issue | 18:21 |
dansmith | it manifests in a lot of different ways.. sometimes a bunch of tests fail because it takes so long to respawn, but sometimes it's a single test failure and if you go look in syslog, there's a mysql oom | 18:22 |
dansmith | I know, but other things we did also helped with the timeout | 18:22 |
gmann | dansmith: as we know the mysql performance issue, should we run them in a few of the jobs like periodic instead if all by default ? | 18:22 |
dansmith | which "them" ? | 18:23 |
gmann | yeah, all other also helped | 18:23 |
dansmith | anyway, I'm just saying, we're still failing a ton of things, fewer timeouts, but more OOMs | 18:23 |
dansmith | the ooms I've been looking at have all had six or seven qemu processes running most using more than mysql | 18:24 |
opendevreview | Ghanshyam proposed openstack/tempest master: Add test for assisted volume snapshot https://review.opendev.org/c/openstack/tempest/+/864839 | 18:26 |
gmann | ok | 18:27 |
dansmith | anyway, let's recheck this revert several times and see | 18:27 |
gmann | dansmith: let's do more recheck on revert of concurrency and see if we see oom | 18:27 |
gmann | yeah | 18:27 |
dansmith | the way things were going before, we should get plenty of timeouts if we recheck a bunch of times, and the way things are now, we should get a few ooms at least if that's not the problem :) | 18:27 |
gmann | yeah solving both are little hard | 18:28 |
dansmith | yeah, and in the panic to improve we changed a lot of things at once | 18:28 |
dansmith | skip if not cinder back in the gate for another go | 18:29 |
dansmith | gmann: unrelated to ooms, I'm wondering if this will help us diagnose the issues where we seem to run out of apache workers: https://review.opendev.org/c/openstack/devstack/+/890526 | 18:31 |
gmann | +1, hoping we get this in at least | 18:32 |
dansmith | I could be wrong, but some of our "timeout talking to $service" things seem almost like we never make it past apache and then timeout | 18:32 |
dansmith | and I think that tweak will make us fail fast with something we can identify as "apache is out of workers" | 18:32 |
dansmith | right now, as I understand it, a connection will wait forever for a thread pool worker to make the proxy call, | 18:32 |
dansmith | which can cause us to timeout on the client side if it waits too long, even though eventually it will get a worker thread and make the call we originally asked for | 18:33 |
dansmith | so if we see some 503s popping up from apache, we can maybe increase the apache workers to help | 18:33 |
gmann | you mean the case of identity error 500 we see? | 18:33 |
gmann | I think that is the same case here | 18:34 |
dansmith | no, the identity 500s are mysql OOMs ;) | 18:34 |
dansmith | here's one someone just rechecked: https://zuul.opendev.org/t/openstack/build/bbc550d6c8d54aee91a6a5d9166f45b8/log/controller/logs/syslog.txt#7417 | 18:34 |
gmann | this one right? | 18:34 |
gmann | tempest.lib.exceptions.IdentityError: Got identity error | 18:34 |
gmann | Details: Unexpected status code 500 | 18:34 |
dansmith | yep | 18:35 |
gmann | ohk | 18:35 |
dansmith | with that acquire=1 flag we should get a 503 from apache (not 500) if it has no pool workers to take the request after 1ms | 18:36 |
dansmith | if we still had elastic recheck I'd say we should add a check for mysql oom (or really any oom) and have an automatic comment to help mark them | 18:38 |
gmann | but it pass retry=0 at least it will no retry right | 18:39 |
dansmith | yeah, but they're different stages of the request | 18:39 |
dansmith | retry= means "the remote side didn't reply to me" but that's a case where we did get a worker, made the request, and got no answer | 18:40 |
dansmith | acquire is for the "how long to wait for our own threadpool to give us a worker" | 18:40 |
dansmith | without a worker, we never even make the request | 18:40 |
gmann | yeah, I mean with that no retry I think failing early make sense | 18:41 |
dansmith | okay | 18:41 |
gmann | 1ms is enough also | 18:41 |
gmann | I think I agree with the idea, it will help in debugging | 18:41 |
dansmith | 1ms is the lowest value you can use.. without anything it will wait forever | 18:42 |
dansmith | right, this doesn't fix anything it just helps us identify if we end up exhausting apache | 18:42 |
dansmith | the only thing it might actually help, | 18:42 |
dansmith | is if apache says "sorry no workers" and some client waits and retries, and then makes it through | 18:43 |
dansmith | but the debugging is the goal | 18:43 |
gmann | yeah | 18:44 |
opendevreview | Ghanshyam proposed openstack/tempest master: Skip scenario tests early to avoid unnecessary setup https://review.opendev.org/c/openstack/tempest/+/890573 | 19:16 |
* dansmith score one green passing run :) https://review.opendev.org/c/openstack/tempest/+/890535 | 19:32 | |
dansmith | oops | 19:32 |
opendevreview | Ghanshyam proposed openstack/tempest master: Skip scenario tests early to avoid unnecessary setup https://review.opendev.org/c/openstack/tempest/+/890573 | 19:38 |
dansmith | gmann: all the scenario tests require cinder? | 19:42 |
dansmith | oh nm, | 19:42 |
dansmith | I misread the first change | 19:42 |
gmann | not all | 19:42 |
gmann | neutron one specially do not require | 19:43 |
dansmith | yeah I misread the first change and was thinking that was in the base test | 19:43 |
gmann | ohk | 19:44 |
dansmith | skip if not cinder oomed again in check | 21:45 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!