*** cloudnull1097742 is now known as cloudnull109774 | 03:02 | |
gibi | Hi folks! Was there any infra reconfiguration that could explain why the nova-ceph-multistore job started consistently failing since 2025-07-11 08:33:54 with qemu-img info failing with 'failed to allocate memory for stack: Cannot allocate memory\n'. I filed a tracker bug https://bugs.launchpad.net/nova/+bug/2116852 | 13:14 |
---|---|---|
gibi | The qemu package versions did not change I checked that | 13:14 |
fungi | gibi: can you link to a failing build result in the zuul dashboard? | 13:39 |
sean-k-mooney | hum | 13:40 |
gibi | https://zuul.opendev.org/t/openstack/build/3c87e32281df4101b698b01ae6b1f3d5 this is the first failure | 13:40 |
fungi | ah, the bug has one | 13:40 |
gibi | and this is the list https://zuul.opendev.org/t/openstack/builds?job_name=nova-ceph-multistore&skip=0 | 13:40 |
fungi | (note that those urls only have logs for 30 days, so recording them in a bug long-term isn't terribly useful) | 13:40 |
sean-k-mooney | is it only happening with ceph voluems or also local | 13:40 |
gibi | single tempest triggering it test_extend_attached_encrypted_volume_luksv1 | 13:41 |
gibi | only the ceph job are affected | 13:41 |
sean-k-mooney | ok thats a good data point | 13:42 |
fungi | https://zuul.opendev.org/t/openstack/build/3c87e32281df4101b698b01ae6b1f3d5/log/zuul-info/host-info.controller.yaml#474 indicates it's an 8gb node, so same as it's always been | 13:42 |
gibi | yepp, the last succeeded and first failed run has the same amount of memory and similar amount of memory usage | 13:42 |
sean-k-mooney | i wonder if there were any ulimit or ceph changes | 13:43 |
sean-k-mooney | the recent qemu packaging issues were related to the librbd version although that might be unrelated | 13:44 |
fungi | also added 8gb of swap at the start of the build: https://zuul.opendev.org/t/openstack/build/3c87e32281df4101b698b01ae6b1f3d5/console#4/0/15/controller | 13:44 |
sean-k-mooney | i dont think this is actully a vm memory issue | 13:44 |
fungi | so i don't see anything that should make the test node have less total virtual memory than before | 13:44 |
sean-k-mooney | tderr: 'failed to allocate memory for stack: Cannot allocate memory\n' | 13:44 |
sean-k-mooney | failing to allcoate enough meory for the stack sould like a bug in qemu-img to me | 13:45 |
gibi | the memtracker output did not show any bad numbers, the journal has no OOM activity | 13:45 |
gibi | sean-k-mooney: the qemu package version exactly the same in tha passing and in the failing runs | 13:46 |
gibi | 8.2.2 | 13:46 |
sean-k-mooney | right but what about the librbd or the crypto libs | 13:46 |
gibi | I can check.... | 13:46 |
frickler | or kernel version? (just guessing though) | 13:46 |
sean-k-mooney | both the luks and rbd supprot are dlls | 13:47 |
sean-k-mooney | well .so files i guss | 13:47 |
sean-k-mooney | frickler: that possible but qemu-img shoudl effectivly be userspace only | 13:47 |
gibi | bingo | 13:47 |
gibi | librdb 19.2.0 passes 19.2.1 fails | 13:47 |
gibi | s/bd/db/ | 13:48 |
gibi | do we have an easy way to control this lib version somewhere ? | 13:49 |
sean-k-mooney | you could maybe test that by forcing a downgrade in the post-config devstack step or local.sh is actuly simpler to test with | 13:50 |
sean-k-mooney | you can drop in a pre playbook into that job which writes a local.sh adn force install 19.2.0 | 13:50 |
sean-k-mooney | if that works i assuem we could add a pre playbook to the job to blacklist that version? im not 100% sure how to do that in apt but i woudl be surpised if that is not a thing | 13:51 |
sean-k-mooney | you can do it with a manual install and mark it to hold hte package if nothing else | 13:51 |
gibi | OK I will try to craft something. Until that / or in parallel with that I have https://review.opendev.org/c/openstack/tempest/+/954949 to skip the test case | 13:54 |
sean-k-mooney | ack makes sense. there is high overlap with the folks in #openstack-qa but probaly best to share that there too | 13:54 |
gibi | yepp, done | 13:59 |
gibi | https://github.com/ceph/ceph/pull/60133/commits/cdbe2c1bb4731395423ab77f3edd2ee4bd053148 this is between 19.2.0 and 19.2.1 and touching stack allocators | 14:16 |
*** losulers is now known as losuler | 14:28 | |
gibi | and testing the package downgrade via https://review.opendev.org/c/openstack/nova/+/954956 | 14:55 |
clarkb | sean-k-mooney: gibi keep in mind that pinning packages like that doesn't make it work for anyone else | 16:30 |
clarkb | the purpose of the CI system is to let us know when things are broken. Papering over the problem amy be appropriate temporarily to not hold up the factory floor but is unlikely to be beneficial for anyone outside of ci | 16:31 |
clarkb | so getting that fixed in the pacakge is important | 16:31 |
sean-k-mooney | clarkb: well the intent is to determin if its a disto packageing bug and avoid the buggy package | 16:35 |
sean-k-mooney | clarkb: if we can confirmt its actully a bug in 19.2.1 we can report that to the ubuntu or librbd maintainers | 16:36 |
clarkb | right but you cannot install old distro packages reliably (they disappear). | 16:36 |
clarkb | so pinning necessarily is short term is my point | 16:37 |
sean-k-mooney | ya, i was hopign we coudl jsut say !19.2.1 | 16:37 |
sean-k-mooney | https://github.com/ceph/ceph/pull/60133/commits/cdbe2c1bb4731395423ab77f3edd2ee4bd053148 looks plausible but my c++ knowlage is a littel out of date. | 16:38 |
sean-k-mooney | gibi: that also in ceph core so im not sure if that will impact librbd | 16:39 |
gibi | clarkb: as soon as I have proof that the package version is the problem I will open a bug upstream | 16:39 |
gibi | to the package maintainer | 16:39 |
sean-k-mooney | but there is a redhater email there so perhasp we could reach out to them | 16:39 |
opendevreview | Scott Little proposed openstack/project-config master: add ability for group starlingx-release to do merges on starlingx/docs https://review.opendev.org/c/openstack/project-config/+/955035 | 19:53 |
opendevreview | Merged openstack/project-config master: add ability for group starlingx-release to do merges on starlingx/docs https://review.opendev.org/c/openstack/project-config/+/955035 | 20:06 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!