wxy-xiyuan | openEuler mirror(timeout problem) should be fixed by https://review.opendev.org/c/openstack/diskimage-builder/+/878807 Let's wait more. | 02:50 |
---|---|---|
opendevreview | melanie witt proposed openstack/devstack master: DNM Create flavors with disk encryption when enabled https://review.opendev.org/c/openstack/devstack/+/864160 | 03:55 |
*** elodilles_pto is now known as elodilles | 08:04 | |
kopecmartin | #startmeeting qa | 15:00 |
opendevmeet | Meeting started Tue Apr 4 15:00:11 2023 UTC and is due to finish in 60 minutes. The chair is kopecmartin. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:00 |
opendevmeet | The meeting name has been set to 'qa' | 15:00 |
kopecmartin | #link https://wiki.openstack.org/wiki/Meetings/QATeamMeeting#Agenda_for_next_Office_hours | 15:00 |
kopecmartin | agenda ^^ | 15:00 |
lpiwowar | o/ | 15:00 |
frickler | o/ | 15:00 |
kopecmartin | \o | 15:01 |
kopecmartin | #topic Announcement and Action Item (Optional) | 15:03 |
kopecmartin | nothing from my side, other than we've had PTG last week | 15:03 |
kopecmartin | which leads me to | 15:04 |
kopecmartin | #topic OpenStack Events Updates and Planning | 15:04 |
kopecmartin | this was the PTG etherpad | 15:04 |
kopecmartin | #link https://etherpad.opendev.org/p/qa-bobcat-ptg | 15:04 |
kopecmartin | if you couldn't attend the session, feel free to go through that | 15:04 |
kopecmartin | I'll send an email summarizing the discussions some time this week | 15:05 |
lpiwowar | ack +1 | 15:06 |
kopecmartin | #topic Bobcat Priority Items progress | 15:06 |
kopecmartin | this is the previous etherpad with the priority items | 15:06 |
kopecmartin | #link https://etherpad.opendev.org/p/qa-antelope-priority | 15:06 |
kopecmartin | I've moved the unfinished items to the new one | 15:07 |
kopecmartin | #link https://etherpad.opendev.org/p/qa-bobcat-priority | 15:07 |
kopecmartin | gmann: do we still need to/want to track SRBAC? | 15:07 |
kopecmartin | however, i haven't translated the PTG discussions into new priority items yet | 15:07 |
kopecmartin | I'll try to work on it this week too | 15:08 |
kopecmartin | if you see something in the PTG discussion which you know is worth of tracking it as a priority item, feel free to draft something in the new priority etherpad | 15:08 |
kopecmartin | now business as usual | 15:09 |
kopecmartin | #topic Gate Status Checks | 15:09 |
kopecmartin | #link https://review.opendev.org/q/label:Review-Priority%253D%252B2+status:open+(project:openstack/tempest+OR+project:openstack/patrole+OR+project:openstack/devstack+OR+project:openstack/grenade) | 15:09 |
kopecmartin | nothing there | 15:09 |
kopecmartin | any other urgent reviews? | 15:10 |
lpiwowar | Maybe this when you have time please: https://review.opendev.org/c/openstack/tempest/+/878074 ? | 15:10 |
kopecmartin | sure | 15:10 |
lpiwowar | thanks! | 15:10 |
kopecmartin | #topic Bare rechecks | 15:12 |
kopecmartin | #link https://etherpad.opendev.org/p/recheck-weekly-summary | 15:12 |
kopecmartin | no new data there | 15:12 |
kopecmartin | #topic Periodic jobs Status Checks | 15:12 |
kopecmartin | stable | 15:12 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=tempest-full-yoga&job_name=tempest-full-xena&job_name=tempest-full-wallaby-py3&job_name=tempest-full-victoria-py3&job_name=tempest-full-ussuri-py3&job_name=tempest-full-zed&pipeline=periodic-stable | 15:12 |
kopecmartin | master | 15:12 |
kopecmartin | #link https://zuul.openstack.org/builds?project=openstack%2Ftempest&project=openstack%2Fdevstack&pipeline=periodic | 15:12 |
kopecmartin | very green, nice! \o/ | 15:13 |
kopecmartin | no failures we haven't known about | 15:13 |
kopecmartin | #topic Distros check | 15:13 |
kopecmartin | cs-9 | 15:14 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=tempest-full-centos-9-stream&job_name=devstack-platform-centos-9-stream&skip=0 | 15:14 |
kopecmartin | fedora | 15:14 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-fedora-latest&skip=0 | 15:14 |
kopecmartin | debian | 15:14 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-debian-bullseye&skip=0 | 15:14 |
kopecmartin | focal | 15:14 |
kopecmartin | #link https://zuul.opendev.org/t/openstack/builds?job_name=devstack-platform-ubuntu-focal&skip=0 | 15:14 |
kopecmartin | rocky | 15:14 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-rocky-blue-onyx | 15:14 |
kopecmartin | openEuler | 15:14 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-openEuler-22.03-ovn-source&job_name=devstack-platform-openEuler-22.03-ovs&skip=0 | 15:14 |
frickler | seems we need a periodic 2023.1 job now? | 15:14 |
kopecmartin | yes | 15:15 |
frickler | for openeuler there was a dib fix mentioned earlier that might help with the timeouts | 15:15 |
frickler | #link https://review.opendev.org/c/openstack/diskimage-builder/+/878807 | 15:15 |
frickler | ianw: ^^ maybe you can have a look | 15:15 |
kopecmartin | frickler: that's great, thanks | 15:17 |
frickler | also seems we don't run wallaby and older anymore, so you could clean up the above periodic link | 15:17 |
kopecmartin | yeah, i'll update the agenda, i did some modification before the meeting, but haven't updated everything :/ | 15:18 |
kopecmartin | #topic Sub Teams highlights | 15:19 |
kopecmartin | Changes with Review-Priority == +1 | 15:19 |
kopecmartin | #link https://review.opendev.org/q/label:Review-Priority%253D%252B1+status:open+(project:openstack/tempest+OR+project:openstack/patrole+OR+project:openstack/devstack+OR+project:openstack/grenade) | 15:19 |
kopecmartin | nothing there | 15:19 |
kopecmartin | #topic Open Discussion | 15:19 |
kopecmartin | anything for the open discussion? | 15:19 |
frickler | I started another round of abandoning old devstack patches | 15:19 |
frickler | reviving some the don't seem completely outdated | 15:20 |
kopecmartin | perfect, thanks | 15:21 |
frickler | I also started testing devstack on bookworm, now that it is mostly frozen | 15:21 |
frickler | sadly it seems that global pip installs are no longer wanted | 15:21 |
frickler | maybe a good opportunity to finally set up some global venv instead | 15:22 |
frickler | not sure how much time I'll have to invest in that, so anyone joining in the effort would be most welcome | 15:23 |
frickler | https://paste.opendev.org/show/bCElNqBBVUCMOB957ZEj/ is what happens in a test run | 15:24 |
frickler | I'll also try to look into dib support so we can actually run things in opendev CI | 15:25 |
clarkb | worth noting the next ubuntu lts will also break global pip installs | 15:27 |
kopecmartin | hm, perfect, something to look forward to :D | 15:28 |
clarkb | also you can test global venv without bookworm images | 15:29 |
frickler | yes, I think if/when we do this, we would switch everything, not make distro specific things | 15:29 |
clarkb | I have/had changes up for it from years ago that didn't get a lot of momentum. I'm happy for others to pick that up and push updates or abandon my changes and push new ones | 15:29 |
clarkb | but I am unlikely to be able to devote dedicated time to it now | 15:29 |
frickler | yes, I should revisit those patches | 15:31 |
frickler | guess that's it for now? | 15:36 |
frickler | maybe that would also help with the ceph jobs | 15:36 |
kopecmartin | i guess so | 15:36 |
* kopecmartin updating the agenda | 15:36 | |
kopecmartin | #topic Bug Triage | 15:36 |
kopecmartin | #link https://etherpad.openstack.org/p/qa-bug-triage-bobcat | 15:36 |
kopecmartin | the new bug number tracker ^ | 15:37 |
kopecmartin | that's all from my side | 15:37 |
kopecmartin | if there isn't anything else, let's end the office hour | 15:37 |
frickler | ack, thx kopecmartin | 15:38 |
kopecmartin | thanks everyone | 15:38 |
kopecmartin | #endmeeting | 15:38 |
opendevmeet | Meeting ended Tue Apr 4 15:38:23 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 15:38 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/qa/2023/qa.2023-04-04-15.00.html | 15:38 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/qa/2023/qa.2023-04-04-15.00.txt | 15:38 |
opendevmeet | Log: https://meetings.opendev.org/meetings/qa/2023/qa.2023-04-04-15.00.log.html | 15:38 |
lpiwowar | thanks kopecmartin | 15:45 |
*** iurygregory_ is now known as iurygregory | 16:24 | |
opendevreview | Merged openstack/tempest master: Add project reader to account-generator https://review.opendev.org/c/openstack/tempest/+/878074 | 17:17 |
gmann | dansmith: if you are around, can you please check this grenade change https://review.opendev.org/c/openstack/grenade/+/879113 | 19:44 |
dansmith | gmann: yep | 19:47 |
JayF | Heads up, just letting you all know we are testing an Ironic upgrade issue with the grenade job, that setting MYSQL_REDUCE_MEMORY:false seems to clear up | 19:54 |
JayF | we've gotten 2 clean passes on an intermittant issue, will recheck the grenade-only change a few times to get more information | 19:54 |
dansmith | did we switch that to default on? | 19:54 |
JayF | https://review.opendev.org/c/openstack/ironic/+/879494 and https://review.opendev.org/c/openstack/ironic/+/879495/ | 19:54 |
gmann | i do not think so, we defaulted that yet | 19:54 |
dansmith | I thought not | 19:54 |
JayF | I thought it was in https://github.com/openstack/devstack/commit/7567359755a105e7278bbf97541332f28228b87d#diff-108978819c05ae183d88ec87959c2341a94cfc3f9465e3aeee82d554217b4f58R701 | 19:55 |
JayF | if that's not true this is a red herring and we just got "lucky" (which is really unlucky) | 19:55 |
dansmith | only for one job | 19:55 |
dansmith | the devstack-multinode job | 19:55 |
JayF | ah | 19:55 |
dansmith | are you inheriting from that job/ | 19:55 |
JayF | yeah I see below, it's defaulted to false | 19:55 |
JayF | well, i'm glad I said somehting because we were well down the creek of bad assumptions based on a misreading the first time | 19:56 |
JayF | iurygregory: ^ | 19:56 |
gmann | yeah, default is false and only multinode job enable it + few in nova | 19:56 |
JayF | iurygregory: tl;dr it shouldn't be enabled unless either we explicitly enabled it or we're using the multinode job | 19:56 |
JayF | iurygregory: so the passes are probably just coincidence :( | 19:56 |
dansmith | it's certainly possible that flag can cause unintended things resulting in failures, but .. probably only if enabled :) | 19:58 |
iurygregory | hummm | 19:58 |
dansmith | what are the failures you thought were related? | 19:58 |
JayF | A vip is unable to be pinged in neutron after upgrade, I believe is the symptom | 19:58 |
JayF | iurygregory has been looking more in depth and might have a log that could be linked | 19:58 |
JayF | I don't have one at hand | 19:58 |
dansmith | and there was some pointer to database issues? | 19:59 |
opendevreview | Martin Kopec proposed openstack/tempest master: Enable file injection tests https://review.opendev.org/c/openstack/tempest/+/879510 | 19:59 |
JayF | dansmith: more of, it spawned us to go spewlunking in failure statistics vs recent changes | 19:59 |
JayF | dansmith: it lined up and anytime something is intermittant I assume performance could be a component | 19:59 |
dansmith | okay | 20:00 |
dansmith | good at least that we didn't find the first data point of that flag breaking things I guess ;) | 20:08 |
JayF | good that I mentioned it in here and got the misreading exposed before we ended up WAY down the wrong path | 20:08 |
JayF | thanks for that :D | 20:09 |
dansmith | my half of the deal seems better, but..sure :) | 20:10 |
iurygregory | so, TheJulia was the one who found out that we were getting a 404 from the DB | 20:25 |
iurygregory | I could only track till the point we see the flavor wasn't created in nova | 20:25 |
iurygregory | https://zuul.opendev.org/t/openstack/build/21790f36aef540d9a7af420356baeafb/log/controller/logs/grenade.sh_log.txt#1903 | 20:26 |
iurygregory | https://zuul.opendev.org/t/openstack/build/21790f36aef540d9a7af420356baeafb/log/controller/logs/screen-n-api.txt#4224 | 20:26 |
JayF | that's different than the failures I saw | 20:27 |
iurygregory | fun | 20:28 |
iurygregory | .-. | 20:28 |
frickler | as I mentioned in #-nova, that is normal OSC behavior | 20:28 |
frickler | it checks whether the value works as ID, then tries to look up the name after that | 20:28 |
iurygregory | I also remember Dmitry was looking and he saw a problem that was related to float-ip not being pingable | 20:28 |
JayF | yeah, that's the one I've dug into any at all | 20:29 |
iurygregory | so maybe we have different issues based on cloud-providers... | 20:29 |
TheJulia | But then how is it the instance creation fails on the flavor not being found ? | 20:29 |
frickler | where do you see that? | 20:31 |
TheJulia | The post response on the job log I linked to a few hours ago | 20:32 |
frickler | in the above run, the failure in grenade is: 2023-04-03 11:09:29.388 | HttpException: 501: Server Error for url: http://10.209.0.254/compute/v2.1/servers/e635780d-a265-4ce8-b350-328f6ee7fc7f/action, The requested functionality is not supported. | 20:33 |
TheJulia | Literally the instance creation post resulted in a failure where the object cannot be found in the db | 20:33 |
TheJulia | Bottom of the n-api log | 20:33 |
frickler | can you post the link again to make sure we are looking at the same job? | 20:34 |
TheJulia | Yeah, give me a few, I stepped away due to a migraine | 20:35 |
TheJulia | And a cat is seeking attention | 20:35 |
frickler | o.k., it's late here, I can take another look my morning then | 20:36 |
TheJulia | frickler: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b1d/879215/3/check/ironic-grenade/b1d6cf4/controller/logs/screen-n-api.txt | 20:40 |
TheJulia | frickler: Apr 03 22:47:03.961414 appears to be the start of error, which appears to be tied to the last logged post of the instance creation which the apache logs records as resulting in a 500 at 22:47:04. At least, that is the last actual apache log entry related that seemed to be related. The n-api logs seem like there is something talking to it directly outside of apache | 20:42 |
TheJulia | so there is a possibility that the logs could still be red-herring wise, but nothing to the post requesting the instance creation seems logged and the timing seems to match up, so just crazy weird in my book | 20:43 |
TheJulia | hmm, the logged request says "request: GET /compute/v2.1/flavors/baremetal" | 20:44 |
TheJulia | the actual /compute/v2.1/servers post is not in the nova-api log at all :\ | 20:45 |
TheJulia | even though it is in the apache logs | 20:45 |
dansmith | some sort of split-brain going on? | 20:49 |
dansmith | like there's two novas and you create the flavor in one but try to boot from the other? | 20:49 |
dansmith | either at the api level or there's two (api) databases? | 20:49 |
dansmith | that would be similar to the rbac "can't see the flavor because of permissions" issue someone else recommended | 20:50 |
dansmith | and to be clear, this is only on the ironic jobs, not the regular grenade job? | 20:50 |
TheJulia | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b1d/879215/3/check/ironic-grenade/b1d6cf4/controller/logs/apache/access_log.txt the apache log | 20:55 |
TheJulia | so, rbac wise, unless the old policy was removed, it is not enforced on the job | 20:55 |
TheJulia | the ironic job running grenade specifically | 20:55 |
TheJulia | so... there shouldn't be a split brain sort of case, but that *is* what it kind of feels like since nova-api logs don't reallly seem to jive with the apache logs | 20:56 |
JayF | it'd be interesting to inspect a held node in this broken state, unless we think any potential split-brain is ephemeral | 20:56 |
TheJulia | https://zuul.opendev.org/t/openstack/build/b1d6cf4027924c01a61fe07281ad3dd4 | 20:57 |
TheJulia | https://github.com/openstack/ironic/blob/master/zuul.d/ironic-jobs.yaml#L845-L933 | 20:57 |
TheJulia | JayF: Yeah, I think that is the path forward, tbh | 20:57 |
dansmith | or just throw a couple debug commands into the job, dumping all the processes and the keystone catalog | 20:58 |
dansmith | I feel like holding a node is cheating, but obviously easier for debugging | 20:58 |
TheJulia | :) | 20:59 |
TheJulia | JayF: should we just ask opendev to hold nodes from ironic-grenade failing? | 21:06 |
opendevreview | Merged openstack/grenade master: Fix tempest venv constraints for target branch tempest run https://review.opendev.org/c/openstack/grenade/+/879113 | 21:06 |
JayF | I wouldn't say "nodes". A node, sure. I don't want to ask as I don't have time to dedicate this this right now, I can help but don't wanna lead. | 21:06 |
TheJulia | I can, but they are not going to drop it until tomorrow most likely if it catches one today | 21:07 |
TheJulia | just speaking from end of day experiencce | 21:07 |
TheJulia | :) | 21:07 |
dansmith | I just want to clarify - you don't see this in any grenade jobs, only the ironic grenade job right? | 21:11 |
dansmith | assuming so, I guess I'd be looking at what you could possibly be doing differently here | 21:11 |
TheJulia | well, we create a baremetal flavor, and grenade is informed to use it | 21:13 |
dansmith | I guess your special flavor is one othing | 21:13 |
TheJulia | yeah, that *should* basically be it off the top of my head | 21:13 |
dansmith | is that done on the old side? | 21:13 |
TheJulia | flavor is created on the old side | 21:13 |
TheJulia | but it immediately fails it looks like | 21:13 |
dansmith | and the failure happens post-upgrade? | 21:13 |
TheJulia | but returns a 200 | 21:13 |
dansmith | oh, still fails on the old side? | 21:14 |
TheJulia | that is what it looks like based upon the nova api logs | 21:14 |
TheJulia | which I think is why this seems so confusing | 21:14 |
dansmith | yeah, old I see | 21:15 |
dansmith | Apr 03 22:41:05.982907 np0033643434 devstack@n-api.service[85463]: DEBUG nova.api.openstack.wsgi [None req-1e785296-d62a-442e-b51a-24c4719a6953 admin admin] Action: 'create', calling method: <bound method FlavorManageController._create of <nova.api.openstack.compute.flavor_manage.FlavorManageController object at 0x7f3eddbf5fc0>>, body: {"flavor": {"name": "baremetal", "disk": 10, "ram": 1024, "rxtx_factor": 1.0, "vcpus": 1, "id": null, | 21:16 |
dansmith | "swap": 0, "os-flavor-access:is_public": true, "OS-FLV-EXT-DATA:ephemeral": 0}} {{(pid=85463) _process_stack /opt/stack/old/nova/nova/api/openstack/wsgi.py:511}} | 21:16 |
dansmith | that's the create | 21:16 |
TheJulia | yup | 21:16 |
dansmith | and yeah, literally the next operation is a show which fails | 21:17 |
TheJulia | ooooh | 21:18 |
TheJulia | different PIDs | 21:18 |
TheJulia | the last not found is 85464 | 21:19 |
dansmith | the first one is the same pid as the post to create it though | 21:19 |
TheJulia | yup | 21:19 |
TheJulia | and 63 is the one the failed get is on as well :\ | 21:20 |
dansmith | that's what I meant.. created on 63, failed immedately on 63 | 21:21 |
dansmith | maybe you mentioned earlier, | 21:21 |
dansmith | but the devstack log shows that it creates the flavor and then sets a bunch of properties on it, which all succeed | 21:22 |
TheJulia | I just noticed the different PIDs in play, but yeah :\ | 21:22 |
dansmith | I believe you recognized it as a distinct lack of necessary jiveness | 21:22 |
TheJulia | I think that was a single post | 21:22 |
TheJulia | meh, a lot goes over my head when I've got a migraine | 21:23 |
dansmith | no, | 21:23 |
dansmith | it's like six different openstack commands | 21:23 |
TheJulia | hmm | 21:24 |
TheJulia | oh, yup, I see it | 21:25 |
dansmith | I wonder if those are failing and osc isn't telling you | 21:25 |
TheJulia | this is a good question | 21:26 |
* TheJulia looks for the apache log | 21:26 | |
TheJulia | apache reports 200 response codes | 21:27 |
dansmith | so the several flavor shows right after that seem to line up, time-wise | 21:27 |
dansmith | ah, okay | 21:29 |
TheJulia | ig ues the client is doing the name matching? | 21:29 |
TheJulia | err, I guess | 21:29 |
dansmith | so the first show that fails, | 21:29 |
dansmith | is it trying to get it by id baremetal, which is wrong, | 21:29 |
TheJulia | at least based upon the query pattern | 21:29 |
dansmith | so then the client does a list, 200 | 21:29 |
dansmith | and then get by uuid, 200 | 21:29 |
dansmith | ah yep, what you said.. the client is resolving the name | 21:29 |
dansmith | tries the name as the id first | 21:29 |
dansmith | note how the base devstack flavors create with an id: | 21:30 |
dansmith | 2023-04-03 22:36:37.827408 | controller | + lib/nova:create_flavors:1258 : openstack --os-region-name=RegionOne flavor create --id 5 --ram 16384 --disk 160 --vcpus 8 --property hw_rng:allowed=True m1.xlarge | 21:30 |
dansmith | ya'll don't, so you only have the name | 21:30 |
dansmith | and the id is auto-assigned as a uuid | 21:31 |
TheJulia | does the type of id impact the ability to reference the name? | 21:34 |
dansmith | and tempest is configured with ids directly not names | 21:34 |
dansmith | tbh, I didn't think we allowed flavors by name in the API, and relied on the client to resolve, but don't quote me on that | 21:34 |
dansmith | but at the very least that might account for the difference | 21:35 |
dansmith | maybe we do, but we suck at it or something | 21:35 |
dansmith | certainly if there are two with the same name somehow, that would be a problem | 21:35 |
TheJulia | ... okay, regardless of it seeming weird, why now I guess is another question | 21:35 |
TheJulia | so, first() is by result in the logic path on orm query | 21:36 |
dansmith | well, it's not even now, because it's on the grenade old side, but yeah I dunno | 21:36 |
TheJulia | but... if we purely rely upon client | 21:36 |
dansmith | is this on jammy? | 21:36 |
TheJulia | dunno, wouldn't think that would impact it | 21:36 |
dansmith | perhaps something mysql server-side changed some would-have-been-stable-order-ing? | 21:36 |
TheJulia | possibly | 21:36 |
TheJulia | I guess we'll find out soon? :) | 21:36 |
dansmith | I mean, try switching it to id and see if it goes away and if so, then ask "why" :) | 21:37 |
TheJulia | I do think we changed the job to jammy at some point last cycle | 21:37 |
JayF | It might be interesting science, depending on how hard... ^ yes to try that | 21:37 |
dansmith | JayF: it should be trivial since the rest of devstack/tempest uses ids | 21:37 |
dansmith | just change the command to add `--id baremetal` should do it | 21:38 |
dansmith | and maybe mangle the name to something else just for clarity in further debug, since the id and the name don't match on the others | 21:38 |
dansmith | this name and id business is legacy ec2 stuff, AFAIK | 21:38 |
TheJulia | that is a good data point | 21:38 |
dansmith | could also grep out the generated id and use that instead, but it'd be easier to read if you force the id to something, as grepping the logs for 'baremetal' is much easier | 21:39 |
TheJulia | ack, well, I guess lets get some more data and figure out the next step from there | 21:40 |
dansmith | our api ref says id or URL to a flavor when booting | 21:43 |
dansmith | so it's possible that tempest is resolving the flavor for you, inconsistently perhaps | 21:44 |
dansmith | I haven't looked at our code to see if we try harder than that | 21:44 |
dansmith | but, prove it fixes the problem and we can chase it down further | 21:44 |
TheJulia | in this case I believe it is an openstack server create call | 21:45 |
TheJulia | not tempest yet afaik | 21:45 |
dansmith | um, what's doing that then? the devstack run on the old side finishes right? | 21:46 |
TheJulia | that is a good question | 21:46 |
dansmith | oh, resource create I see | 21:46 |
TheJulia | ok | 21:46 |
dansmith | where grenade creates some things to see if the survive to the other side | 21:46 |
TheJulia | yup | 21:46 |
dansmith | so, | 21:47 |
dansmith | that's actually failing with a 500: 2023-04-03 22:52:04.438 | Unknown Error (HTTP 500) | 21:47 |
TheJulia | yup, the post returns that per the apache log | 21:47 |
dansmith | this: | 21:48 |
dansmith | 10.208.224.11 - - [03/Apr/2023:22:47:04 +0000] "POST /compute/v2.1/servers HTTP/1.1" 500 803 "-" "python-novaclient" | 21:48 |
* TheJulia cracks open the nova api code | 21:48 | |
TheJulia | yup | 21:48 |
dansmith | okay it's got the flavor id in the server create | 21:51 |
dansmith | and I see the flavor resolution attempt by osc before that | 21:51 |
dansmith | so nova-api or apache seems to die right as the create comes in | 21:51 |
TheJulia | or it freezes, or was frozen | 21:54 |
dansmith | and nothing happens for 5.5 minutes after that | 21:54 |
TheJulia | it could have come in on either process | 21:54 |
TheJulia | we see some possibly unrelated stuff on... I think it was process 64 in the log | 21:54 |
dansmith | nova starts talking to neutron before it freezes | 21:54 |
dansmith | I wonder if we're out of apache threads with us calling back to ourselves so much here | 21:55 |
* TheJulia raises an eyebrow | 21:55 | |
clarkb | I think the default is somewhere around 200 on ubuntu. Not sure if devstack changes that | 21:55 |
TheJulia | hmm, no nova conductor log entries | 21:55 |
clarkb | for apache threads I mean. It uses the combo worker tpe where it forks X processes and each process has Y threads too by default | 21:56 |
dansmith | neutron is also doing an arse-ton of DB traffic at the time | 21:57 |
dansmith | so we could try turning dbcounter off | 21:58 |
TheJulia | hmmm... I wonder if your on to something with apache | 21:58 |
dansmith | like well over 200 SELECT calls over the course of a few seconds right when we're hanging here | 21:59 |
TheJulia | oh... we could be deadlocking on the db then | 21:59 |
TheJulia | somehow | 21:59 |
dansmith | probably not on select calls | 21:59 |
TheJulia | unlikely unless there is also a create in flight on the same table someplace | 22:00 |
dansmith | it is doing a few insert/delete/updates, but at least it's "a lot of activity" | 22:00 |
dansmith | apache is returning 500 though and if it were timing out waiting for nova or neutron, that'd be a 503 | 22:00 |
TheJulia | I'd expect a timeout eventually, maybe | 22:00 |
dansmith | the thing is all that db activity on the neutron side is with no more api requests coming in | 22:01 |
dansmith | so it's like super busy doing I don't know what | 22:01 |
dansmith | actuall,y | 22:02 |
dansmith | I don't think it *ever* fields another api query after that | 22:02 |
TheJulia | that does seem to be the case | 22:03 |
dansmith | yeah, so we see the POST /compute in the apache log, | 22:04 |
dansmith | we see that hit nova-api, | 22:04 |
dansmith | we see nova say "I'mma go hit neutron" but we never see a request against /network in apache and right around that time, neutron stops logging queries but is doing a ton of db stuff | 22:05 |
TheJulia | just a few seconds later, devstack starts shutting down, fwiw... first command loged at 22:41:24 | 22:05 |
dansmith | okay, so wait, | 22:06 |
dansmith | apache logs that 500 way out of order | 22:06 |
TheJulia | likely, when it is *done* | 22:06 |
TheJulia | or at least finishes the write out | 22:06 |
dansmith | it logs it five minutes late, right | 22:06 |
dansmith | so there is apache activity in between there, | 22:06 |
dansmith | talking to placement and ironic | 22:06 |
dansmith | but not neutron | 22:06 |
dansmith | so I wonder if that's it waiting on nova which is waiting on neutron, and then timeout | 22:06 |
dansmith | this, btw is the last thing nova-api really says: | 22:07 |
dansmith | Apr 03 22:47:04.809417 np0033643434 devstack@n-api.service[85463]: DEBUG nova.network.neutron [None req-0528b41f-c502-4239-bfa8-9c2f40bc440c nova_grenade nova_grenade] validate_networks() for [('0ffa5ccc-6329-4664-ac88-4e778b324d8b', None, None, None, None, None)] {{(pid=85463) validate_networks /opt/stack/old/nova/nova/network/neutron.py:2623}} | 22:07 |
TheJulia | do you have the time when nova connects out to neutron handy? | 22:07 |
dansmith | ^ | 22:08 |
TheJulia | ack thanks! | 22:08 |
dansmith | that's just the last thing we see, so.. probably sometime after that, | 22:08 |
dansmith | but that lines up with the last neutron api call I think | 22:08 |
dansmith | which is: | 22:08 |
dansmith | Apr 03 22:47:05.157665 np0033643434 neutron-server[86238]: INFO neutron.wsgi [req-0528b41f-c502-4239-bfa8-9c2f40bc440c req-36ff5c21-ff3d-4a41-b3b6-f2b2a8867435 nova_grenade nova_grenade] 10.208.224.11 "GET /v2.0/ports?tenant_id=0049ae1d7885406099a43056d3e73ecc&fields=id HTTP/1.1" status: 200 len: 210 time: 0.0663364 | 22:08 |
dansmith | that's the last time neutron.wsgi logs anything | 22:08 |
dansmith | so likely whatever we do after that | 22:08 |
dansmith | https://github.com/openstack/nova/blob/3886f078dea50baa062c732a0bd9f653e35e09cc/nova/network/neutron.py#L2630 | 22:09 |
dansmith | yeah, that code requires special skills to understand, which I don't have | 22:12 |
dansmith | but I think the first call to _ports_needed() is, at least, doing a show on each port | 22:12 |
TheJulia | so, I *think* the bump in selects is the neutron object model at play | 22:12 |
dansmith | which we see the first and only | 22:12 |
TheJulia | I'm semi-curious if we're getting to https://github.com/openstack/nova/blob/3886f078dea50baa062c732a0bd9f653e35e09cc/nova/network/neutron.py#L2654 based upon the neutron log | 22:13 |
dansmith | I mean.. it's a dang lot of selects | 22:13 |
dansmith | I don't think we are | 22:13 |
dansmith | I really need to go to something I'm late for, and now that we've wandered into "neutron's fault" territory I'm losing interest :P | 22:14 |
TheJulia | okay, we have two places which can pull a list of ports it looks like | 22:14 |
dansmith | but maybe tomorrow morning we can rope in a neutron person | 22:14 |
TheJulia | one burried down after line 2630 you linked | 22:14 |
dansmith | maybe they'll recognize what "never answers http again, followed by too much DB traffic" | 22:15 |
TheJulia | maybe, hopefully we'll have more data in the morning | 22:15 |
dansmith | data from what? | 22:15 |
dansmith | I think the flavor thing is likely not related | 22:15 |
TheJulia | we might have a held node to dig into :) | 22:15 |
dansmith | ah okay | 22:15 |
dansmith | I guess it's not clear to me that it would still be hung when you get it, depending I guess | 22:16 |
TheJulia | it might not be, but we might have a lot more logging to dig through | 22:16 |
dansmith | whether or not it is will tell you ... something | 22:16 |
TheJulia | exactly | 22:16 |
dansmith | I guess seeing if neutron ever comes back will be interesting | 22:16 |
TheJulia | ++ | 22:16 |
dansmith | or if mysql is 100% cpu and neutron is still selecting the world | 22:17 |
* dansmith & | 22:17 | |
TheJulia | g'night | 22:17 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!