Tuesday, 2024-09-10

opendevreviewMridula Joshi proposed openstack/glance stable/2024.1: Fixed typo in discovery.py  https://review.opendev.org/c/openstack/glance/+/92875705:36
opendevreviewMerged openstack/glance master: Add metadef value for stateless firmware  https://review.opendev.org/c/openstack/glance/+/92659907:58
opendevreviewMerged openstack/glance master: api-ref: add 'compressed' container_format  https://review.opendev.org/c/openstack/glance/+/92871707:58
opendevreviewMerged openstack/glance stable/2024.1: Fixed typo in discovery.py  https://review.opendev.org/c/openstack/glance/+/92875711:34
opendevreviewMridula Joshi proposed openstack/glance stable/2023.2: Fixed typo in discovery.py  https://review.opendev.org/c/openstack/glance/+/92878611:42
opendevreviewBrian Rosmaita proposed openstack/glance master: api-ref: Fix 'name' reference  https://review.opendev.org/c/openstack/glance/+/85773512:19
opendevreviewBalazs Gibizer proposed openstack/glance master: [doc]Update hw_vif_model as it supports igb now  https://review.opendev.org/c/openstack/glance/+/92878812:23
dansmithabhishekk: we're failing to delete the image after some of the format testing: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_659/927291/5/check/tempest-integrated-storage-import/659b93e/testr_results.html14:02
dansmithbut I'm not sure why that would be... it's just a delete on the image after the import would have failed14:03
abhishekklooking14:03
dansmiththe actual fail is this: https://zuul.opendev.org/t/openstack/build/659b93ec4243482b8e95b09e5c729db7/log/controller/logs/screen-g-api.txt#1405314:04
dansmithwhich seems DB related and thus I'm struggling to find the relation to what's going on14:04
dansmithbut it's always one of these failed-because-format tests14:04
dansmithall stemming from image_repo.remove(), which is a different request from the failed task and everything14:08
dansmithalthough looks maybe like the task is still running, so maybe there's a bit of a race in glance here between the delete and a running task?14:09
abhishekkI am not able to recollect that we delete the image if conversion fails, is it while cleanup from tempest?14:09
abhishekkI can see a race condition message in the log while deleting property from image_properties table 14:10
abhishekkSep 06 16:46:15.982625 np0038417198 devstack@g-api.service[87670]: DEBUG glance.db.sqlalchemy.api [None req-72866585-8bed-424b-a843-0b3f8fc97b49 tempest-ImagesFormatTest-1430253273 tempest-ImagesFormatTest-1430253273-project-member] StaleDataError while deleting property 'os_glance_import_task' from image 'ec12a1f2-46a5-4c0c-82a3-19a57cc895d5' likely means we raced during delete: UPDATE statement on table 'image_properties' expe14:10
abhishekkcted to update 1 row(s); 0 were matched. {{(pid=87670) _image_property_delete /opt/stack/glance/glance/db/sqlalchemy/api.py:1328}}14:10
dansmithyeah, in the tempest test I delete bad images right away, which is definitely no longer needed since I split them all into different tests I suppose, but it's still a legit cleanup activity that is failing14:12
abhishekkthis is the recent chagne, I think here it is racing?14:13
abhishekkhttps://github.com/openstack/glance/blob/master/glance/async_/flows/plugins/image_conversion.py#L23814:13
abhishekkthis I added to revert image back to queued state14:13
dansmith...which we need also14:16
dansmithbut IIRC none of those changes get made until the action_wrapper context manager exits, right? 14:16
* dansmith looks14:16
abhishekkright14:18
dansmithso maybe we need a retry only on db error in the delete path?14:22
abhishekklikely, Or should we check task progress before delete?14:24
dansmithI think for delete we should always assume it can work, so we want to delete harder, not say "try delete later"14:28
dansmiththe error message is a little confusing though14:28
dansmithsays the transaction is closed14:29
dansmithoh actually,14:29
dansmithI wonder if this is *already* retrying, but the retrying is happening within the transaction14:29
abhishekkbut We don't have any retry call on delete14:30
dansmithit's retrying on image_update14:31
dansmithhttps://github.com/openstack/glance/blob/master/glance/db/sqlalchemy/api.py#L16614:31
dansmithand I bet it's re-using the same session from the context passed in or something14:31
abhishekk_retry_on_deadlock14:31
abhishekkremove is internally calling update14:32
dansmithright14:32
abhishekkI think image_property_delete is creating new session which is called from image_update14:39
dansmithalso the deadlock retrying logs, which isn't in the log, so no retry I think14:40
abhishekkimage_update --> set_image_properties --> image_property_delete 14:40
dansmithand is _that_ the transaction it's complaining about?14:41
abhishekklet me confirm14:41
abhishekkimage_update --> set_image_properties --> _image_property_delete, we are passing same session throughout but _image_property_delete is the method where we are getting that error14:43
dansmithwe don't keep the same session in image_property_delete right?14:45
dansmithoh, we call _image_property_delete I see14:45
abhishekkhmm and that is the last method call from glance and then it raises InvalidRequestError14:47
abhishekkand this debug log suggests, we returned from that method, Sep 06 16:46:15.982625 np0038417198 devstack@g-api.service[87670]: DEBUG glance.db.sqlalchemy.api [None req-72866585-8bed-424b-a843-0b3f8fc97b49 tempest-ImagesFormatTest-1430253273 tempest-ImagesFormatTest-1430253273-project-member] StaleDataError while deleting property 'os_glance_import_task' from image 'ec12a1f2-46a5-4c0c-82a3-19a57cc895d5' likely means we raced durin14:49
abhishekkg delete: UPDATE statement on table 'image_properties' expected to update 1 row(s); 0 were matched. {{(pid=87670) _image_property_delete /opt/stack/glance/glance/db/sqlalchemy/api.py:1328}}14:49
abhishekkbut how come it again raises error from that method only14:49
dansmithhmm, are we going through and deleting each property individually?14:54
abhishekkI think so that is what we are doing14:55
dansmithso we make a list of properties, then go through and delete each one, but if we race with the task revert, one or more of those is gone when we try to delete it and explode?14:57
abhishekklikely, that is the only rational logic I can see in this scenario14:57
dansmithugh14:59
abhishekkand from remove we are passing purge_props=True14:59
dansmithyeah14:59
dansmithso I wonder if we can just retry on InvalidRequestError as well15:00
dansmithalthough I assume we need something more specific than just that exception15:01
abhishekkI think for time being we can retry on that error15:02
abhishekkBut we need refactor on _set_properties_for_image method15:03
dansmithI dunno, I assume we have to retry the whole delete right? because the transaction is closed so I assume we're toast15:03
dansmithI found a thread about retrying InvalidRequestErrors being a common practice15:06
opendevreviewDan Smith proposed openstack/glance master: WIP: Retry image update on closed transaction  https://review.opendev.org/c/openstack/glance/+/92882015:09
dansmithmaybe we recheck this ^ a few times and see if we get that log message but it succeeds?15:09
dansmithand if so we can ask for confirmation from mike15:09
abhishekksounds good15:09
dansmithI'm definitely puzzled about why this is related to the oslo conversion.. must not be or the oslo code is a little slower/faster and is opening up the race15:14
abhishekkyeah, I missed that part that earlier it was working when format_insepctor was in glance15:27
dansmithit's not 100% fail, so it's possible it was failing and just rechecked into the tree I suppose15:30
dansmithi.e. perhaps it's failing in master15:30
dansmithof course, no fail and no hits of that issue on the first run17:01
dansmithwill recheck when it's done17:01
abhishekkAck, let me know if few checks needs to be added tomorrow my time17:26

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!