Tuesday, 2023-04-04

opendevreviewMerged openstack/designate-tempest-plugin master: Re-enable the tempest tests and add Antelope  https://review.opendev.org/c/openstack/designate-tempest-plugin/+/87916801:25
opendevreviewErik Olof Gunnar Andersson proposed openstack/designate master: Move to a batch model for incrementing serial  https://review.opendev.org/c/openstack/designate/+/87125503:48
opendevreviewMerged openstack/designate master: Fix sharing a zone with the zone owner  https://review.opendev.org/c/openstack/designate/+/87920804:57
opendevreviewMichael Johnson proposed openstack/designate stable/2023.1: Fix sharing a zone with the zone owner  https://review.opendev.org/c/openstack/designate/+/87947415:18
ozzzo_workI'm seeing orphaned DNS records created when I delete VMs in my kolla Train cluster. It seems to happen about 1/50 deletions. The deletion is happening here: https://github.com/openstack/designate/blob/60edc59ff765b406e4b936deb4d200a2d9b411ce/designate/notification_handler/base.py#L11315:46
ozzzo_workI added some extra logging to see what is happening: https://paste.openstack.org/show/bM7b0Cd4YDiJDlhqa1HE/15:46
ozzzo_workI loop through the recordset and see a single record. Then the if statement at line 117 (or 16 in the paste) fails, and we see the "not found" error. Then I loop through it again and the record is still there15:48
ozzzo_workWhy is the if failing when the record exists in the recordset?15:48
johnsomozzo_work This check: if record_to_delete not in recordset.records15:59
johnsomIs checking the whole object, not just the ID, so I expect there is something different in the object comparison. Maybe it should be looking for the IDs and not the whole object.16:00
ozzzo_workjohnsom: what would that look like? Would it be: if record_to_delete['id'] not in recordset.records:16:06
ozzzo_workor would it be better to leave out the if and just delete it in a try:?16:07
ozzzo_worksomething like: try: recordset.records.remove(record_to_delete) except: (log error) ?16:09
johnsomThat code is already in a try block, so you probably don't need another one. I would assume the overhead of that remove is low enough that just trying it should be fine.16:17
johnsomI would have to dig in that object code to see if it would be a major performance hit with a large number of records in the recordset.  I'm not super familiar with that code yet.16:18
ozzzo_workok I'll try it, ty!16:27
opendevreviewErik Olof Gunnar Andersson proposed openstack/designate master: Secondary zone loops AXFR transfer during zone creation  https://review.opendev.org/c/openstack/designate/+/86413120:00
eanderssonWhat version is this ozzzo_work?20:05
eanderssonThere are typically two issues with the notifications code20:07
eandersson1) Race conditions with some older versions of Designate can cause records to be missed. This is usually just when a VM is created and destroyed quickly, but could also happen if you have multiple IPs per record (e.g. many VMs with the same name)20:07
eandersson2) Missing notifications from Nova / Neutron. This usually happens when the compute is having issues (e.g. hardware issues, service restarted during process)20:07
johnsomThe paste output led me to think it was an object comparison issue. Like a status was off or something20:08
eanderssonYea - that is possible20:08
eanderssonI have made a ton of improvements around these paths20:08
eanderssonAlthough looks like ozzzo should have all of those already20:10
eanderssonMake sure you are using a coordinator because that is a common issue20:11
eanderssonI honestly suspect that my batching PR will solve this too. It's one of the reasons I went he batching route, was to solve issues around designate-sink20:12
eanderssonThe API layer protect against a lot of these type of issues20:13
johnsomWorking on reviewing that patch now.20:13
eanderssonAwesome20:13
eanderssonThose last few patches were attempts to move the sink code closer to the API code, but it wasn't always enough20:13
eanderssonIt's the same race conditions we see with PTR code btw20:14
eandersson(that occasionally causes our PTR functional tests to fail)20:15
eanderssonAlso, make sure you are using a coordinator that supports locking.20:15
eanderssonWhich is all drivers so as long as you have one of these configured :D https://docs.openstack.org/tooz/latest/user/compatibility.html#id420:16
ozzzoeandersson: We're running Train. I think we must have patched it because the stable-train has different delete code that was last changed in 2020:  https://github.com/openstack/designate/blob/train-eol/designate/notification_handler/base.py23:31
ozzzoWhat we're running looks like this: https://github.com/openstack/designate/blob/60edc59ff765b406e4b936deb4d200a2d9b411ce/designate/notification_handler/base.py23:31
ozzzoThis version includes the 2021 "Fix race condition in the sink when deleting records" patch23:31
ozzzowe are running redis; we set that up last year when we were having a similar issue that happened frequently. That helped a lot but we still see the occasional orphaned record.23:34
ozzzo_workI tried this but it still fails to delete: https://paste.openstack.org/show/bBCWzZXkronbcrwQF3Ar/23:39
ozzzo_workThis is happening shortly after the record is created. Could I be hitting some kind of race condition where the record is locked because it hasn't updated on all 3 controllers?23:41
ozzzo_workMy script creates the VM, pings it, checks forward and reverse DNS at the NS, then checks at all 3 controllers, and after all that is working (usually 10-20 seconds) then it deletes the VM23:42
ozzzo_work10 VMs every 10 minutes, per-cluster23:43

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!