*** dasm is now known as dasm|off | 01:24 | |
ierdem | Hi folks, is there any way to boot a signed image from volume? I am testing image validation, I can create VM by using signed images on ephemeral disks but when I try boot from volume, it throws an excepiton (https://paste.openstack.org/show/blZen5ID7OIbi47TN8ib/). I am currently working on OpenStack Ussuri, and image backend is Ceph | 10:01 |
---|---|---|
bauzas | gibi: so, there are many ways tricking stestr scheduling | 10:10 |
bauzas | gibi: but given we directly call tempest which eventually calls stestr, the quickiest way to change the test scheduling is about renaming the testname | 10:10 |
bauzas | https://stestr.readthedocs.io/en/latest/MANUAL.html#test-scheduling | 10:11 |
bauzas | "By default stestr schedules the tests by first checking if there is any historical timing data on any tests. It then sorts the tests by that timing data loops over the tests in order and adds one to each worker that it will launch. For tests without timing data, the same is done, except the tests are in alphabetical order instead of based on timing data. If a group regex is used the same algorithm is used with groups instead of | 10:11 |
bauzas | individual tests." | 10:11 |
bauzas | gibi: https://review.opendev.org/c/openstack/tempest/+/870913 | 10:20 |
bauzas | gibi: could I modify https://review.opendev.org/c/openstack/nova/+/869900/ to be Depending on ^ ? | 10:20 |
gibi | lets keep https://review.opendev.org/c/openstack/nova/+/869900/ alone. as I want to land that regardless of our troubleshooting here. As I think some of the other problems might go away after we remove the excessive logging by that. | 10:24 |
gibi | about https://review.opendev.org/c/openstack/tempest/+/870913 why we are trying to move the test to the front? I thought we wanted to moved it later or even disable it temporarily to see if other tests are triggering the same OOM behavior and hence trying to establish a pattern causing the OOM | 10:25 |
bauzas | gibi: OK, then I'll add a DNM on nova, np | 10:32 |
bauzas | gibi: good question, I was wanting to see whether it was due to this test or not | 10:32 |
bauzas | if we call it first, and if this is due to this test, it would be killed earlier, right? | 10:33 |
opendevreview | Sylvain Bauza proposed openstack/nova master: DNM: Testing the killed test https://review.opendev.org/c/openstack/nova/+/870924 | 10:39 |
kashyap | gibi: Can you have a quick look at this workaround patch when you can (for a change all CI have passed): https://review.opendev.org/c/openstack/nova/+/870794 | 10:44 |
kashyap | (When you get a minute, that is) | 10:44 |
tobias-urdin | gibi: any possibility that we can backport this https://review.opendev.org/c/openstack/nova/+/838976 and parent reproducer patch? we are currently patching that in production as we're on newer libvirt with older nova release (xena right now, probably be yoga later this year) | 10:48 |
ierdem | Hi everyone, is there any way to boot a signed image from volume? I am testing image validation, I can create VM by using signed images on ephemeral disks but when I try boot from volume, it throws an excepiton (https://paste.openstack.org/show/blZen5ID7OIbi47TN8ib/). I am currently working on OpenStack Ussuri, and image backend is Ceph | 11:06 |
gokhanisi | hello folks, after rebooting my compute host, I can't attach my cinder volumes to instances. Nova throws "unable to lock /var/lib/nova/mnt/dgf/volume-xx for metadata change: No locks available" Full logs are in https://paste.openstack.org/show/beicZ71J17WeNwLjghKc/ What can be reason of this problem ? I am on victoria. On this compute node I am using also gpu passthrough. | 11:15 |
opendevreview | Sylvain Bauza proposed openstack/nova master: DNM: Testing the killed test https://review.opendev.org/c/openstack/nova/+/870924 | 11:23 |
gibi | tobias-urdin: regarding https://review.opendev.org/c/openstack/nova/+/838976 I think this is technically backportable but bauzas should know more about it as it is vgpu related | 11:24 |
* bauzas looks | 11:24 | |
bauzas | gibi: tobias-urdin: I already proposed the backports down to wallaby | 11:24 |
sean-k-mooney | tobias-urdin: there is a backport already | 11:25 |
sean-k-mooney | tobias-urdin: https://review.opendev.org/c/openstack/nova/+/866156 is the xena cherry pick | 11:25 |
sean-k-mooney | tobias-urdin: we needed it for wallaby for our downstream product so all the patches are up for review but we have already merged them downstream at the end of the year | 11:26 |
gibi | ahh I missed the backports as the topic was not set on them | 11:35 |
tobias-urdin | oh great, thanks! | 11:39 |
gibi | bauzas: regardin OOM I can do a parallel experiement moving the test to the latest to see if others before it trigger the OOM or not | 11:39 |
gibi | s/latest/last/ | 11:40 |
bauzas | gibi: sure, do it | 11:40 |
gibi | ack | 11:40 |
bauzas | gibi: I'm starting to look at the functest races | 11:40 |
bauzas | but I'm hungry | 11:40 |
gibi | ack. I start to get hungry too | 11:41 |
gibi | damn biology | 11:41 |
gibi | bauzas: this was the bug https://bugs.launchpad.net/nova/+bug/1946339 I referred to yesterday related to the funct test failures. It might or might not be related :/ | 11:47 |
opendevreview | melanie witt proposed openstack/nova master: imagebackend: Add support to libvirt_info for LUKS based encryption https://review.opendev.org/c/openstack/nova/+/826755 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: imagebackend: Cache the key manager when disk is encrypted https://review.opendev.org/c/openstack/nova/+/826756 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: libvirt: Introduce support for qcow2 with LUKS https://review.opendev.org/c/openstack/nova/+/772273 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: libvirt: Configure and teardown ephemeral encryption secrets https://review.opendev.org/c/openstack/nova/+/870931 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Support create with ephemeral encryption for qcow2 https://review.opendev.org/c/openstack/nova/+/870932 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Support resize with ephemeral encryption https://review.opendev.org/c/openstack/nova/+/870933 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Add encryption support to convert_image https://review.opendev.org/c/openstack/nova/+/870934 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Add hw_ephemeral_encryption_secret_uuid image property https://review.opendev.org/c/openstack/nova/+/870935 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Add encryption support to qemu-img rebase https://review.opendev.org/c/openstack/nova/+/870936 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Support snapshot with ephemeral encryption https://review.opendev.org/c/openstack/nova/+/870937 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Add reset_encryption_fields() and save_all() to BlockDeviceMappingList https://review.opendev.org/c/openstack/nova/+/870938 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: Update driver BDMs with ephemeral encryption image properties https://review.opendev.org/c/openstack/nova/+/870939 | 12:38 |
opendevreview | melanie witt proposed openstack/nova master: DNM test ephemeral encryption + resize: qcow2, raw https://review.opendev.org/c/openstack/nova/+/862416 | 13:00 |
opendevreview | melanie witt proposed openstack/nova master: DNM test ephemeral encryption + resize: qcow2, raw https://review.opendev.org/c/openstack/nova/+/862416 | 13:03 |
opendevreview | Balazs Gibizer proposed openstack/nova master: DNM: Test OOM killed test https://review.opendev.org/c/openstack/nova/+/870950 | 13:28 |
opendevreview | Balazs Gibizer proposed openstack/nova master: DNM: Test OOM killed test https://review.opendev.org/c/openstack/nova/+/870950 | 13:29 |
gibi | bauzas: my trial is at https://review.opendev.org/c/openstack/tempest/+/870947 and https://review.opendev.org/c/openstack/nova/+/870950 | 13:29 |
opendevreview | Balazs Gibizer proposed openstack/nova master: DNM: Test OOM killed test https://review.opendev.org/c/openstack/nova/+/870950 | 13:33 |
*** dasm|off is now known as dasm | 13:34 | |
kashyap | gibi: Thanks for catching my sloppiness here (I actually was rephrasing it locally). Do my replies seem reasonable to you? - https://review.opendev.org/c/openstack/nova/+/870794 | 13:35 |
* kashyap goes to rework | 13:35 | |
bauzas | gibi: hmm, TIL about dstat and memory_tracker usage from devstack | 13:37 |
sean-k-mooney | ya they run as a service in all the devstack jobs | 13:41 |
sean-k-mooney | we used to have peak_mem_tracker too or somethign like that | 13:41 |
sean-k-mooney | dstat more or less has all the info you want/need | 13:41 |
kashyap | gibi: When you get a sec, I'm wondering what else is missing in this unit-test diff to check the API is called only once - https://paste.opendev.org/show/b7CXHOkMeuuXuzQD6QtW/ | 13:54 |
gibi | bauzas: this fresh functional failure https://7ffaea22ff93fca2f0ea-bf433abff5f8b85f7f80257b72ac6f67.ssl.cf5.rackcdn.com/869900/7/gate/nova-tox-functional-py38/3b10d8a/testr_results.html is very similar to what we discuss with melwitt in the comments of https://bugs.launchpad.net/nova/+bug/1946339 | 14:15 |
gibi | so probably eventlets are escaping the end of test case execution whete they were born and interfering with later tests | 14:16 |
sean-k-mooney | parsing that statement... | 14:17 |
sean-k-mooney | that sound kind fo familar | 14:18 |
gibi | yepp we fixed a set of those in the past but not all it seems | 14:18 |
sean-k-mooney | i tought we were explcitly shutting dow the event loop between tests gobally | 14:18 |
sean-k-mooney | as in vai a fixture | 14:19 |
gibi | is there a way to do that? | 14:19 |
sean-k-mooney | well yes | 14:19 |
sean-k-mooney | if we modify the base test case to call into eventlet | 14:19 |
sean-k-mooney | in test cleanup | 14:19 |
sean-k-mooney | i think there is a global kill but there is also a per greenthered kill | 14:20 |
sean-k-mooney | https://eventlet.net/doc/modules/greenthread.html#eventlet.greenthread.kill thats the per green tread one | 14:21 |
sean-k-mooney | we can also use waitall | 14:21 |
sean-k-mooney | https://eventlet.net/doc/modules/greenpool.html#eventlet.greenpool.GreenPool.waitall | 14:21 |
gibi | we are not pooling our eventlets | 14:22 |
gibi | and as you noted the greenthread.kill assumes we have access to the greented to kill | 14:23 |
sean-k-mooney | well we do have a greenthread pool but its provide by oslo | 14:23 |
sean-k-mooney | i was just looking at the docs to see if we have a way to list the greenthreads | 14:23 |
gibi | when we call spawn or spawn_n we are not using the greenlet from the pool | 14:23 |
sean-k-mooney | ya but i think there is a default pool that is used | 14:24 |
sean-k-mooney | i could be wrong | 14:24 |
gibi | at least I haven't came accross it when originally fixed part of this problem | 14:24 |
sean-k-mooney | i guess if there is one it does not say https://eventlet.net/doc/basic_usage.html#eventlet.spawn | 14:25 |
sean-k-mooney | is there any reason not to jsut have one gloabl pool | 14:25 |
sean-k-mooney | gibi: i was thinking of https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.executor_thread_pool_size by the way | 14:26 |
sean-k-mooney | Size of executor thread pool when executor is threading or eventlet. | 14:26 |
gibi | as far as I see that is only used by oslo_messaging creating rpc message handler threads / eventlets. but nova uses spawn and spawn_n directly outside of oslo messaging | 14:28 |
gibi | we can try to pool them but I'm not sure both spawn and spawn_n can be pooled in the same way | 14:28 |
gibi | as they are not creating the same entity | 14:29 |
sean-k-mooney | there are spawn and spawn_n function on the pools | 14:29 |
sean-k-mooney | we might need a speerate on form the rpc one | 14:29 |
sean-k-mooney | or want a seperate one | 14:29 |
sean-k-mooney | but i think form an api point of view it shoudl be fine | 14:29 |
sean-k-mooney | https://eventlet.net/doc/modules/greenpool.html#eventlet.greenpool.GreenPool.spawn and https://eventlet.net/doc/modules/greenpool.html#eventlet.greenpool.GreenPool.spawn_n | 14:30 |
sean-k-mooney | hopefully we could just update it here https://github.com/openstack/nova/blob/master/nova/utils.py#L635-L684 | 14:30 |
sean-k-mooney | so create a module level pool and use that then in the test call waitall on the base testcase cleanup | 14:31 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/test.py#L150 we currently done tha a cleanup function but in setup we can also jsut add | 14:32 |
sean-k-mooney | self.addCleanup(utils.greenpool.waitall) | 14:32 |
gibi | I can try to set up a way to reproduce the issue more frequently locally and try to see if the pooling might solve it or not | 14:35 |
bauzas | sorry, was at the hairdresser | 14:40 |
* bauzas scrolling up | 14:40 | |
opendevreview | Balazs Gibizer proposed openstack/nova master: DNM: Test OOM killed test https://review.opendev.org/c/openstack/nova/+/870950 | 14:48 |
gibi | bauzas: one result from your OOM trial is that I noticed that test in question takes a realitvely long time even if it passes tempest.api.compute.admin.test_aaa_volume.AttachSCSIVolumeTestJSON.test_attach_scsi_disk_with_config_drive [181.818420s] | 14:54 |
bauzas | gibi: yup, I've seen it | 14:54 |
bauzas | maybe we should introspect the memory size of the cached image | 14:54 |
bauzas | gibi: fwiw, since the UT ran successfully in the DNM patch, I looked at n-api log and I found we called it | 15:20 |
bauzas | gibi: while on https://834de1be955e9175dba1-6977f7378e5264bdb9ba9d1465839752.ssl.cf1.rackcdn.com/869900/6/gate/nova-ceph-multistore/f5aa5ed/controller/logs/screen-n-api.txt we were not calling it | 15:20 |
bauzas | so, I think the test was killed during the first glance call | 15:21 |
bauzas | https://github.com/openstack/tempest/blob/master/tempest/api/compute/admin/test_volume.py#L84-L89 | 15:22 |
dansmith | I'm stacking a ceph devstack right now | 15:49 |
dansmith | so when it's done I could try running just that test and see if it behaves properly in isolation | 15:49 |
opendevreview | Sylvain Bauza proposed openstack/nova master: DNM: Testing the killed test https://review.opendev.org/c/openstack/nova/+/870924 | 15:55 |
gibi | bauzas, dansmith: https://bugs.launchpad.net/nova/+bug/2002951/comments/5 based on dstat and the tempest log I'm pretty sure that loading the image data is using up the memory | 16:21 |
bauzas | gibi: I added a few lines | 16:22 |
dansmith | gibi: oh is show_image() eating the whole image? | 16:22 |
bauzas | https://review.opendev.org/c/openstack/tempest/+/870913/2/tempest/api/compute/admin/test_aaa_volume.py | 16:22 |
dansmith | like response.content instead of response.iter_content ? | 16:23 |
bauzas | dansmith: I asked for the image cache size in my next revision | 16:23 |
bauzas | gibi: that was my guess | 16:23 |
bauzas | hence the new rev I created $ | 16:23 |
dansmith | if so I guess that's my bad for making the image size large, although that is exactly why I did it :) | 16:23 |
dansmith | oh, I see, | 16:25 |
gibi | dansmith: I don't see any iter_content involved | 16:25 |
dansmith | it's actually downloading and re-uploading the image? | 16:25 |
gibi | yepp | 16:25 |
gibi | and doing it in memory | 16:25 |
dansmith | riight, okay | 16:25 |
dansmith | so yeah that'll have to turn into a chunked loop | 16:25 |
dansmith | I can take a look at that if you want, but we might want to just disable that test for the moment | 16:26 |
bauzas | dansmith: as you see they copy in memory the whole image | 16:26 |
gibi | bauzas: if I'm right L90 in your modification wont return https://review.opendev.org/c/openstack/tempest/+/870913/2/tempest/api/compute/admin/test_aaa_volume.py#90 | 16:26 |
bauzas | gibi: hmm? | 16:26 |
bauzas | https://docs.openstack.org/oslo.utils/ocata/examples/timeutils.html#using-a-stopwatch-as-a-context-manager | 16:27 |
gibi | bauzas: OMM will kill the process in the midle | 16:27 |
gibi | OOM | 16:27 |
bauzas | gibi: no, the test is run before | 16:27 |
dansmith | this also means anyone using tempest with a real image (like for verification) will be eating a ton of data | 16:27 |
bauzas | that's still using aaa | 16:27 |
gibi | calling _create_image_with_custom_property will trigger the OOM | 16:27 |
bauzas | gibi: it wasn't the case in the first revision | 16:28 |
bauzas | we waited for 180secs but eventually we didn't got a kill | 16:28 |
gibi | hm, maybe at that run the image fit into memory | 16:28 |
bauzas | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_362/870924/2/check/nova-ceph-multistore/3626391/testr_results.html | 16:29 |
bauzas | gibi: yeah, because the test is called firstly | 16:29 |
gibi | anyhow dansmith if you look into fixing it then I will propose to disable this test in tempest with a @skip decorator as it is probably dangerous to run in any job | 16:29 |
dansmith | yeah, do it | 16:30 |
gibi | ack | 16:30 |
gibi | bauzas: ahh yeah, that helps to fit | 16:30 |
dansmith | I'm still stacking, had to start over because horizon seems broken :/ | 16:30 |
dansmith | 2023 is off to a *great* start | 16:30 |
bauzas | gibi: that said, I wonder why this test was fine since like 10 months before | 16:31 |
gibi | my local tox -e functional-py310 env produces strange failures (>1000 failed test on master) so yeah it is *great* :D | 16:31 |
dansmith | bauzas: I recently increased the size of the image used on the ceph job from 16MB to 1G | 16:31 |
dansmith | but that was in like november or so | 16:31 |
dansmith | so I'm pretty surprised we've been holding on this long | 16:31 |
dansmith | probably because that memory never gets touched again and just gets swapped out | 16:31 |
sean-k-mooney | the gate was pretty ok after you went on pto in the start od december | 16:32 |
sean-k-mooney | so i dont think its related to using the 1G image | 16:33 |
dansmith | sean-k-mooney: I'm happy to leave if that's what helps | 16:33 |
dansmith | sean-k-mooney: this job crashing with OOM seems clearly related as the job eats the 1g image and ... swells to 1g before it goes boom | 16:33 |
dansmith | s/job/test worker/ | 16:33 |
sean-k-mooney | ok do we know why its only happening now or did we get lucky beofre | 16:33 |
gibi | maybe something else grown in memory usage recently a bit and pushing the overall worker VM over the line | 16:34 |
gibi | sean-k-mooney: as bauzas shown if your run this test earlier in the job then it still passes | 16:34 |
dansmith | sean-k-mooney: read the scrollback :) | 16:34 |
sean-k-mooney | ya well ok we could revert back to 512 mb and see if it grows by 512mb | 16:34 |
sean-k-mooney | i was just starting too ya | 16:34 |
dansmith | sean-k-mooney: gibi identified a test that reads the whole image into memory | 16:34 |
bauzas | gibi: dansmith: added Tempest and glance to the bug report | 16:34 |
sean-k-mooney | oh ok | 16:34 |
sean-k-mooney | so fix that test or revert i assume fix the test to not do that | 16:35 |
sean-k-mooney | or exilcitly use a small image in that test | 16:35 |
bauzas | sean-k-mooney: I have a change that will tell us how much memory it takes https://review.opendev.org/c/openstack/tempest/+/870913/2/tempest/api/compute/admin/test_aaa_volume.py#90 | 16:35 |
* sean-k-mooney start reading back | 16:35 | |
dansmith | no, the test is old there's nothing to revert | 16:35 |
dansmith | we need to fix the test | 16:35 |
bauzas | dansmith: I think we have now an issue because we call more tests | 16:36 |
sean-k-mooney | by revert i ment your image change but the test is clearly not writen as we woudl like | 16:36 |
sean-k-mooney | i.e. it should not break with a bix image | 16:36 |
sean-k-mooney | *big | 16:36 |
bauzas | while it was working fine before, now the memory is too large | 16:36 |
sean-k-mooney | like downstream i know we use rhel images in some cases | 16:36 |
sean-k-mooney | and those are just under a gig too like 700mb | 16:37 |
dansmith | the images client in tempest already chunks the upload, it just does it from a fixed size buffer, so it just needs to be smarter | 16:37 |
bauzas | sean-k-mooney: again, we'll know how much memory creates this test with my new CI job | 16:37 |
sean-k-mooney | bauzas: just got back to this point in scrolback | 16:38 |
sean-k-mooney | bauzas: ack | 16:38 |
dansmith | the large image was specifically to flush out things like this, so I don't think going back to a small image gets us anything useful | 16:38 |
bauzas | anyway, this is a guess | 16:38 |
bauzas | nothing was changed in this module since Feb 22 | 16:38 |
dansmith | if anything, it makes me think we can make it larger as this might have been the OOM limit I was running into with 2G | 16:38 |
sean-k-mooney | dansmith: i agree | 16:38 |
bauzas | https://github.com/openstack/tempest/blob/master/tempest/api/compute/admin/test_volume.py | 16:39 |
sean-k-mooney | dansmith: the trade of is we only have 80G of disk space in ci | 16:39 |
sean-k-mooney | so back to 2G perhaps 20G proably not | 16:39 |
bauzas | oh wait | 16:39 |
dansmith | I understand, disk space is not the issue though | 16:39 |
bauzas | maybe we change the image ref | 16:39 |
* bauzas verifying whether we modified the option value closely | 16:40 | |
opendevreview | Kashyap Chamarthy proposed openstack/nova master: libvirt: At start-up allow skiping compareCPU() with a workaround https://review.opendev.org/c/openstack/nova/+/870794 | 16:40 |
kashyap | Duh, forgot to commit 2 files | 16:41 |
opendevreview | Kashyap Chamarthy proposed openstack/nova master: libvirt: At start-up allow skiping compareCPU() with a workaround https://review.opendev.org/c/openstack/nova/+/870794 | 16:41 |
bauzas | https://github.com/openstack/devstack/blob/master/lib/tempest#L213-L220 | 16:45 |
bauzas | hmmmm | 16:45 |
gibi | propsed the skip for this test https://review.opendev.org/c/openstack/tempest/+/870974 I checked no other test using the _create_image_with_custom_property util function | 16:46 |
bauzas | 2023-01-18 11:54:33.763321 | controller | ++ lib/tempest:get_active_images:155 : '[' cirros-raw = cirros-0.5.2-x86_64-disk ']' | 16:46 |
bauzas | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_362/870924/2/check/nova-ceph-multistore/3626391/job-output.txt | 16:46 |
bauzas | we only get the cirros image | 16:46 |
bauzas | shouldn't be that large | 16:46 |
dansmith | bauzas: you understand that the ceph job uses a 1G cirros image right? | 16:48 |
bauzas | Jan 18 11:57:14.947333 np0032776548 glance-api[110229]: DEBUG glance.image_cache [None req-76dfbfc9-8d31-4a47-a529-e95d8077cfc0 tempest-AttachSCSIVolumeTestJSON-1393201534 tempest-AttachSCSIVolumeTestJSON-1393201534-project-admin] Tee'ing image '0bc12eec-2802-48e8-bedf-0931be582d19' into cache {{(pid=110229) get_caching_iter /opt/stack/glance/glance/image_cache/__init__.py:343}} | 16:49 |
bauzas | dansmith: oh sorry no, wasn't knowing | 16:49 |
gibi | 950MB image is downloaded in my case | 16:49 |
dansmith | bauzas: [08:31:27] <dansmith> bauzas: I recently increased the size of the image used on the ceph job from 16MB to 1G | 16:50 |
bauzas | missed that line | 16:50 |
dansmith | :) | 16:50 |
bauzas | ok, so we know that we cache 1GB in memory | 16:50 |
gibi | te test is nice as it get the image metadata first so in the log there is a "size": 996147200 before the data is downloaded | 16:50 |
bauzas | gibi: I'm waiting for my test job to return but I guess we'll see a size of 1GB in memory for that variable | 16:51 |
bauzas | dansmith: about your question (why do we trigger now the kill and not earlier), my guess is that we were just below the line | 16:52 |
opendevreview | Balazs Gibizer proposed openstack/nova master: DNM: Test that OOM triggering test is skipped https://review.opendev.org/c/openstack/nova/+/870950 | 16:52 |
dansmith | bauzas: yeah, like I said, we're probably just swapping it all and never touching it again, so pressure is high and we're close to the edge :) | 16:53 |
bauzas | one way to alleviate this issue would be to make sure we run that greedy test into a specific test runner worker | 16:53 |
opendevreview | Aaron S proposed openstack/nova master: Add further workaround features for qemu_monitor_announce_self https://review.opendev.org/c/openstack/nova/+/867324 | 16:54 |
bauzas | https://stestr.readthedocs.io/en/latest/MANUAL.html#test-scheduling | 16:54 |
bauzas | tempest exposes the worker configs from stestr | 16:54 |
gibi | bauzas: we just shouldn't load the whole image data in memory at once | 16:54 |
gibi | as in general image size can be way bigger than memory size | 16:54 |
* bauzas tries to understand the reason behind the cache | 16:55 | |
bauzas | gibi: that's true | 16:55 |
bauzas | caching the metadata seems ok to me | 16:55 |
bauzas | caching the data itself seems unnecessary | 16:55 |
bauzas | unless you want to compare bytes per bytes | 16:55 |
gibi | yeah meatadata is bound by glance API | 16:55 |
gibi | image size is unbound | 16:55 |
bauzas | but agreed you could and should compare streams and not objects | 16:56 |
bauzas | for the dataz | 16:56 |
dansmith | it's just a naive test not thinking about the world outside a 16MB image | 16:56 |
bauzas | glance team is added on the bug report | 16:56 |
dansmith | anyone using this for verification of a real cloud likely has an even larger image, | 16:56 |
dansmith | so it's clearly not okay to do this | 16:56 |
bauzas | agreed | 16:56 |
bauzas | whoami-rajat: hey, happy new year :) | 16:56 |
opendevreview | Aaron S proposed openstack/nova master: Add further workaround features for qemu_monitor_announce_self https://review.opendev.org/c/openstack/nova/+/867324 | 16:57 |
opendevreview | Aaron S proposed openstack/nova master: Add further workaround features for qemu_monitor_announce_self https://review.opendev.org/c/openstack/nova/+/867324 | 16:58 |
opendevreview | Merged openstack/nova master: Strictly follow placement allocation during PCI claim https://review.opendev.org/c/openstack/nova/+/855650 | 16:59 |
bauzas | hmmm, stackalytics.io is fone | 17:01 |
bauzas | gone* | 17:01 |
* bauzas wonders who to ping | 17:01 | |
gibi | what do you mean by gone? | 17:01 |
gibi | it loads for me with | 17:02 |
gibi | Last updated on 18 Jan 2023 12:44:21 UTC | 17:02 |
bauzas | I got a timeout | 17:02 |
bauzas | oh this works now | 17:02 |
gibi | gmann, dansmith: was there some policy change recently that can cause that the nova functional test locally gets 403 from placement? | 17:04 |
gibi | {'errors': [{'status': 403, 'title': 'Forbidden', 'detail': 'Access was denied to this resource.\n\n placement:allocations:list ', 'request_id': 'req-c5c03029-ba79-4e4a-8ec8-03deadb24ded'}]} | 17:04 |
dansmith | gibi: reliably? | 17:04 |
gibi | yes | 17:04 |
gibi | this is pure nova master functional test | 17:04 |
dansmith | oof, then probably | 17:04 |
gmann | gibi: yeah we changed the default and placement fixture needed fix whihc is merged i think | 17:04 |
bauzas | gibi: yeah | 17:05 |
gmann | gibi: can you rebase the placement repo? | 17:05 |
bauzas | gibi: we switched to new RBAC policies | 17:05 |
gibi | it is just my nova repo locally | 17:05 |
gmann | this one https://review.opendev.org/c/openstack/placement/+/869525 | 17:05 |
gmann | gibi: because nova functional test use placement fixture from placement repo | 17:05 |
bauzas | which we pull as a dependency, right? | 17:06 |
gmann | yes | 17:06 |
gibi | so nova's tox.ini has | 17:06 |
gibi | openstack-placement>=1.0.0 | 17:06 |
gibi | that should pull in the latest openstack-placement | 17:07 |
gibi | in the functional venv | 17:07 |
bauzas | tox -r ? | 17:07 |
gibi | I have openstack-placement==8.0.0 | 17:07 |
gibi | in the venv | 17:07 |
gibi | I guess we merged the fix in placement but we haven't released it | 17:07 |
gmann | i do not think we released placement with that | 17:08 |
gibi | so nova's tox.ini pulls placement from pypi | 17:08 |
gmann | yeah not released yet, we should do | 17:08 |
gibi | ^^ yepp | 17:08 |
gibi | or change nova's tox.ini to pull placement from github | 17:08 |
gmann | yeah for now this can be workaround | 17:08 |
gmann | let me push it today unless bauzas you want to do? | 17:08 |
gmann | I feel placement fixture import from placement in functional test should be changed like we do for cinder/glance fixture otherwise we need new placement release for any change in there | 17:10 |
bauzas | gmann: do the push and I'll +1 | 17:11 |
gmann | bauzas: ok | 17:11 |
gibi | gmann: based on the constraint in tox.ini openstack-placement>=1.0.0 this is the first time we need such a release due to the fixture | 17:12 |
bauzas | gibi: gmann: shall we consider to pull from gh ? | 17:13 |
gibi | I would keep pypi | 17:13 |
gmann | gibi: yeah because this actually change the things like default policy. but this can occur if change in default for policy or config unless we change nova functional test to move to those new defaults. for example | 17:14 |
gibi | if this becomes a frequent problem then I would change the placemnet fixture in nova | 17:14 |
gmann | placement policy need different token default than what nova is using for access | 17:14 |
gibi | I just confimed switching to gh locally in the tox.ini fixes the problem. Still I vote for release a new placement version and bumping the constarint in nova's tox.ini | 17:19 |
gmann | ok. yeah once released we should bump the constraint if we wan to use it from pypi | 17:21 |
gmann | I will push the release | 17:21 |
gibi | yepp | 17:21 |
gibi | and thanks | 17:21 |
bauzas | gmann: I need to disappear soon | 17:27 |
gmann | bauzas: https://review.opendev.org/c/openstack/releases/+/870989 | 17:28 |
*** gmann is now known as gmann_afk | 17:29 | |
bauzas | gmann_afk: gibi: that's where I'm struggling to consider 8.1.0 as a correct number | 17:31 |
bauzas | placement is cycle-with-rc | 17:32 |
gibi | bauzas: you mean 8.0.0 was Zed, so 8.1.0 should come from stable/ze? | 17:32 |
bauzas | gibi: yup | 17:33 |
bauzas | but we have tools for creating YAMLs | 17:33 |
gibi | yeah I'm not sure either if we can relase 8.1 from placement master | 17:33 |
gibi | elodilles: ^^? | 17:33 |
bauzas | https://releases.openstack.org/reference/using.html#using-new-release-command | 17:34 |
bauzas | so, I'd say we should call out this release using the tool with "new-release antelope placement milestone" | 17:36 |
bauzas | elodilles: right? | 17:37 |
elodilles | we could just release beta from cycle-with-rc projects | 17:38 |
gibi | that would be 9.0.0 beta I assume | 17:39 |
elodilles | answering in #openstack-release | 17:39 |
elodilles | gibi: yes, 9.0.0.0b1 | 17:39 |
bauzas | elodilles: using new-release, I guess this is 'milestone' arg I presume ? | 17:39 |
*** gmann_afk is now known as gmann | 17:41 | |
gmann | i see, you are right. 8.1.0 is not right | 17:41 |
elodilles | bauzas: yes, 'milestone' generates 9.0.0.0b1 (to answer it here as well :)) | 17:43 |
bauzas | ack, gtk | 17:43 |
mnaser | is there a reason why nova only generates device: [] metadata for tagged bdms only? | 17:48 |
mnaser | https://github.com/openstack/nova/blob/702dfd33bb93b7cee8c76e117e26bfe56f637460/nova/virt/libvirt/driver.py#L12092 | 17:49 |
mnaser | and then https://github.com/openstack/nova/blob/702dfd33bb93b7cee8c76e117e26bfe56f637460/nova/virt/libvirt/driver.py#L12107-L12108 | 17:49 |
mnaser | which then https://github.com/openstack/nova/blob/702dfd33bb93b7cee8c76e117e26bfe56f637460/nova/virt/libvirt/driver.py#L12020-L12024 | 17:49 |
mnaser | and if its supposed to be with way™, how could one figure out whats attached to the system? | 17:50 |
bauzas | mnaser: sorry, calling it a day | 17:51 |
elodilles | bauzas gibi : fyi, tox.ini might need an update to allow to install beta releases of placement. that can be done via adding >1.0.0.0b1 instead of >1.0.0 ... if i remember correctly | 17:52 |
mnaser | bauzas: lol, was that enough nova for you? :p | 17:52 |
bauzas | mnaser: that :) | 17:52 |
bauzas | :D | 17:52 |
bauzas | one day of CI issues, and I quit. | 17:52 |
bauzas | mnaser: maybe artom could help you | 17:52 |
bauzas | artom: tl;dr: mnaser is wondering why we only generate the device metadata for tagged bdms | 17:53 |
artom | mnaser, that's by design IIRC, every other bit of information there is already visible to the guest | 17:53 |
artom | mnaser, it's only the tag that comes from the user | 17:53 |
artom | Without the tag it's pointless | 17:54 |
mnaser | volume uuid? i remember there is a place where it does come from though i think | 17:54 |
artom | There might be something else exposed there, like the 'trusted' param for NICs | 17:54 |
artom | mnaser, that should show up as the disk serial number | 17:54 |
mnaser | ah yes | 17:54 |
gibi | elodilles: I will do the tox.ini change once the package is on pypi | 18:06 |
*** gmann is now known as gmann_afk | 18:06 | |
sean-k-mooney | mnaser: artom with that said we coudl generate the metadata if we wanted too | 18:17 |
sean-k-mooney | it just wont add extra info as artom mentioned | 18:17 |
sean-k-mooney | you can use lsblk lsusb and lspci to discover it already in the guest | 18:17 |
artom | sean-k-mooney, we could... I just don't see the point? It's already all info the guest has access to | 18:17 |
artom | With lspci, `ip`, etc | 18:18 |
sean-k-mooney | the point woudl jsut to make that part of the metadta nolonger optional | 18:18 |
sean-k-mooney | so you woudl not have to check if its aviabel or not in the ugest it will be | 18:18 |
sean-k-mooney | even if you could get that info elsewhere | 18:18 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Clean up after ImportModulePoisonFixture https://review.opendev.org/c/openstack/nova/+/870993 | 18:24 |
gibi | bauzas: while it is part of the the functional instability it is not the case and therefore this is not the fix for it, but it is related and it is a cleanup ^^ | 18:24 |
gibi | s/it is not the case/it is not the root cause/ | 18:25 |
sean-k-mooney | hum interesting | 18:27 |
sean-k-mooney | what were we leaking | 18:27 |
sean-k-mooney | ah the filter | 18:27 |
sean-k-mooney | ok so we were leakign the filters which could increae memory usage | 18:28 |
sean-k-mooney | but it is not sharing state or caussing ither issues | 18:28 |
gibi | yeah | 18:28 |
sean-k-mooney | it does proably contribute to the oom issues | 18:28 |
gibi | it is part of the functional instability realted to https://bugs.launchpad.net/nova/+bug/1946339 | 18:28 |
gibi | as in the recent case it is that import poison that get called by the late eventlet | 18:29 |
gibi | so I looked at the poision and found a global state | 18:33 |
sean-k-mooney | ya so each executor has what about 1000 tests per worker | 18:34 |
sean-k-mooney | i would guess this is addign 1-2MB per run | 18:34 |
sean-k-mooney | at most | 18:34 |
gibi | yeah it is not realted the OOM case we saw in tempest | 18:35 |
sean-k-mooney | i have not check but i would be surpried if adding a filter allocates more then a KB or memory it should jsut be a few fucntion pointer and python objects | 18:35 |
sean-k-mooney | well didnt we also have OOM issues in teh funtional tests | 18:36 |
sean-k-mooney | wll not OOM python interpreter crahses | 18:36 |
sean-k-mooney | gibi: one question however | 18:38 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/870993/1/nova/tests/fixtures/nova.py#1838 | 18:39 |
sean-k-mooney | shoudl we install the filter in setUp not init | 18:39 |
sean-k-mooney | i assume the orgianl intent of doign it in init was to do it only once | 18:40 |
sean-k-mooney | for the life time of the fixture and resue the fixture | 18:40 |
sean-k-mooney | which is not how we are doign it | 18:40 |
sean-k-mooney | so if we are going to clean up in cleanup should we set it up in setUp | 18:40 |
*** gmann_afk is now known as gmann | 18:56 | |
opendevreview | Merged openstack/nova master: Use new get_rpc_client API from oslo.messaging https://review.opendev.org/c/openstack/nova/+/869900 | 21:01 |
opendevreview | Merged openstack/nova master: FUP for the scheduler part of PCI in placement https://review.opendev.org/c/openstack/nova/+/862876 | 21:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!