opendevreview | Mridula Joshi proposed openstack/glance stable/2024.1: Fixed typo in discovery.py https://review.opendev.org/c/openstack/glance/+/928757 | 05:36 |
---|---|---|
opendevreview | Merged openstack/glance master: Add metadef value for stateless firmware https://review.opendev.org/c/openstack/glance/+/926599 | 07:58 |
opendevreview | Merged openstack/glance master: api-ref: add 'compressed' container_format https://review.opendev.org/c/openstack/glance/+/928717 | 07:58 |
opendevreview | Merged openstack/glance stable/2024.1: Fixed typo in discovery.py https://review.opendev.org/c/openstack/glance/+/928757 | 11:34 |
opendevreview | Mridula Joshi proposed openstack/glance stable/2023.2: Fixed typo in discovery.py https://review.opendev.org/c/openstack/glance/+/928786 | 11:42 |
opendevreview | Brian Rosmaita proposed openstack/glance master: api-ref: Fix 'name' reference https://review.opendev.org/c/openstack/glance/+/857735 | 12:19 |
opendevreview | Balazs Gibizer proposed openstack/glance master: [doc]Update hw_vif_model as it supports igb now https://review.opendev.org/c/openstack/glance/+/928788 | 12:23 |
dansmith | abhishekk: we're failing to delete the image after some of the format testing: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_659/927291/5/check/tempest-integrated-storage-import/659b93e/testr_results.html | 14:02 |
dansmith | but I'm not sure why that would be... it's just a delete on the image after the import would have failed | 14:03 |
abhishekk | looking | 14:03 |
dansmith | the actual fail is this: https://zuul.opendev.org/t/openstack/build/659b93ec4243482b8e95b09e5c729db7/log/controller/logs/screen-g-api.txt#14053 | 14:04 |
dansmith | which seems DB related and thus I'm struggling to find the relation to what's going on | 14:04 |
dansmith | but it's always one of these failed-because-format tests | 14:04 |
dansmith | all stemming from image_repo.remove(), which is a different request from the failed task and everything | 14:08 |
dansmith | although looks maybe like the task is still running, so maybe there's a bit of a race in glance here between the delete and a running task? | 14:09 |
abhishekk | I am not able to recollect that we delete the image if conversion fails, is it while cleanup from tempest? | 14:09 |
abhishekk | I can see a race condition message in the log while deleting property from image_properties table | 14:10 |
abhishekk | Sep 06 16:46:15.982625 np0038417198 devstack@g-api.service[87670]: DEBUG glance.db.sqlalchemy.api [None req-72866585-8bed-424b-a843-0b3f8fc97b49 tempest-ImagesFormatTest-1430253273 tempest-ImagesFormatTest-1430253273-project-member] StaleDataError while deleting property 'os_glance_import_task' from image 'ec12a1f2-46a5-4c0c-82a3-19a57cc895d5' likely means we raced during delete: UPDATE statement on table 'image_properties' expe | 14:10 |
abhishekk | cted to update 1 row(s); 0 were matched. {{(pid=87670) _image_property_delete /opt/stack/glance/glance/db/sqlalchemy/api.py:1328}} | 14:10 |
dansmith | yeah, in the tempest test I delete bad images right away, which is definitely no longer needed since I split them all into different tests I suppose, but it's still a legit cleanup activity that is failing | 14:12 |
abhishekk | this is the recent chagne, I think here it is racing? | 14:13 |
abhishekk | https://github.com/openstack/glance/blob/master/glance/async_/flows/plugins/image_conversion.py#L238 | 14:13 |
abhishekk | this I added to revert image back to queued state | 14:13 |
dansmith | ...which we need also | 14:16 |
dansmith | but IIRC none of those changes get made until the action_wrapper context manager exits, right? | 14:16 |
* dansmith looks | 14:16 | |
abhishekk | right | 14:18 |
dansmith | so maybe we need a retry only on db error in the delete path? | 14:22 |
abhishekk | likely, Or should we check task progress before delete? | 14:24 |
dansmith | I think for delete we should always assume it can work, so we want to delete harder, not say "try delete later" | 14:28 |
dansmith | the error message is a little confusing though | 14:28 |
dansmith | says the transaction is closed | 14:29 |
dansmith | oh actually, | 14:29 |
dansmith | I wonder if this is *already* retrying, but the retrying is happening within the transaction | 14:29 |
abhishekk | but We don't have any retry call on delete | 14:30 |
dansmith | it's retrying on image_update | 14:31 |
dansmith | https://github.com/openstack/glance/blob/master/glance/db/sqlalchemy/api.py#L166 | 14:31 |
dansmith | and I bet it's re-using the same session from the context passed in or something | 14:31 |
abhishekk | _retry_on_deadlock | 14:31 |
abhishekk | remove is internally calling update | 14:32 |
dansmith | right | 14:32 |
abhishekk | I think image_property_delete is creating new session which is called from image_update | 14:39 |
dansmith | also the deadlock retrying logs, which isn't in the log, so no retry I think | 14:40 |
abhishekk | image_update --> set_image_properties --> image_property_delete | 14:40 |
dansmith | and is _that_ the transaction it's complaining about? | 14:41 |
abhishekk | let me confirm | 14:41 |
abhishekk | image_update --> set_image_properties --> _image_property_delete, we are passing same session throughout but _image_property_delete is the method where we are getting that error | 14:43 |
dansmith | we don't keep the same session in image_property_delete right? | 14:45 |
dansmith | oh, we call _image_property_delete I see | 14:45 |
abhishekk | hmm and that is the last method call from glance and then it raises InvalidRequestError | 14:47 |
abhishekk | and this debug log suggests, we returned from that method, Sep 06 16:46:15.982625 np0038417198 devstack@g-api.service[87670]: DEBUG glance.db.sqlalchemy.api [None req-72866585-8bed-424b-a843-0b3f8fc97b49 tempest-ImagesFormatTest-1430253273 tempest-ImagesFormatTest-1430253273-project-member] StaleDataError while deleting property 'os_glance_import_task' from image 'ec12a1f2-46a5-4c0c-82a3-19a57cc895d5' likely means we raced durin | 14:49 |
abhishekk | g delete: UPDATE statement on table 'image_properties' expected to update 1 row(s); 0 were matched. {{(pid=87670) _image_property_delete /opt/stack/glance/glance/db/sqlalchemy/api.py:1328}} | 14:49 |
abhishekk | but how come it again raises error from that method only | 14:49 |
dansmith | hmm, are we going through and deleting each property individually? | 14:54 |
abhishekk | I think so that is what we are doing | 14:55 |
dansmith | so we make a list of properties, then go through and delete each one, but if we race with the task revert, one or more of those is gone when we try to delete it and explode? | 14:57 |
abhishekk | likely, that is the only rational logic I can see in this scenario | 14:57 |
dansmith | ugh | 14:59 |
abhishekk | and from remove we are passing purge_props=True | 14:59 |
dansmith | yeah | 14:59 |
dansmith | so I wonder if we can just retry on InvalidRequestError as well | 15:00 |
dansmith | although I assume we need something more specific than just that exception | 15:01 |
abhishekk | I think for time being we can retry on that error | 15:02 |
abhishekk | But we need refactor on _set_properties_for_image method | 15:03 |
dansmith | I dunno, I assume we have to retry the whole delete right? because the transaction is closed so I assume we're toast | 15:03 |
dansmith | I found a thread about retrying InvalidRequestErrors being a common practice | 15:06 |
opendevreview | Dan Smith proposed openstack/glance master: WIP: Retry image update on closed transaction https://review.opendev.org/c/openstack/glance/+/928820 | 15:09 |
dansmith | maybe we recheck this ^ a few times and see if we get that log message but it succeeds? | 15:09 |
dansmith | and if so we can ask for confirmation from mike | 15:09 |
abhishekk | sounds good | 15:09 |
dansmith | I'm definitely puzzled about why this is related to the oslo conversion.. must not be or the oslo code is a little slower/faster and is opening up the race | 15:14 |
abhishekk | yeah, I missed that part that earlier it was working when format_insepctor was in glance | 15:27 |
dansmith | it's not 100% fail, so it's possible it was failing and just rechecked into the tree I suppose | 15:30 |
dansmith | i.e. perhaps it's failing in master | 15:30 |
dansmith | of course, no fail and no hits of that issue on the first run | 17:01 |
dansmith | will recheck when it's done | 17:01 |
abhishekk | Ack, let me know if few checks needs to be added tomorrow my time | 17:26 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!