Thursday, 2021-02-11

michchapeandersson, frickler: I think the intro doc is ready to go in if you have a moment. https://review.opendev.org/c/openstack/designate/+/76377902:07
*** ianychoi has joined #openstack-dns02:09
*** ircuser-1 has quit IRC06:04
*** ianychoi has quit IRC08:33
*** ianychoi has joined #openstack-dns08:48
*** mgagne has quit IRC12:52
*** mgagne has joined #openstack-dns12:53
-openstackstatus- NOTICE: Recent POST_FAILURE results from Zuul for builds started prior to 15:47 UTC were due to network connectivity issues reaching one of our log storage providers, and can be safely rechecked15:50
nicolasbockHi! I am working on a somewhat puzzling issue with parallel creation of recordsets. I have a 3 node HA Designate with a BIND9 backend on Stein. I create 10 recordsets in parallel and often end up with fewer records in BIND9. I tracked down the issue to how the zone serial is incremented. Since the code is using `time.time()` in seconds the serial is not always incremented and BIND9 is missing some updates because it21:30
nicolasbockbelieves that the serial hasn't changed.21:30
nicolasbockI have a fix as well. But for Stein.21:30
nicolasbockI cannot reproduce this issue though starting with Ussuri21:30
nicolasbockThe increment serial code has not changed though which makes this quite puzzling21:31
nicolasbockDoes anyone here have an idea where I should look to see what might have "fixed" the issue?21:31
nicolasbockIs there some better serialization for API calls?21:31
johnsomCould it be a change in behavior from the py2->3 change?21:31
nicolasbockThe Stein env is Python321:31
nicolasbockThe BIND9 version is different as well21:32
nicolasbockStein is Bionic and Ussuri is Focal21:32
nicolasbockI don't see any errors in the BIND9 logs though21:32
nicolasbockIt's queuing the NOTIFYs properly21:32
johnsomWell, a notify should just trigger a serial check and if there are multiple updates to the zone, some with the same serial number, I could see BIND thinking it has the latest data.21:35
nicolasbockI was just looking at the BIND9 logs again. I have to correlate those with what central is saying about the zone serials21:37
nicolasbockMaybe it's the more recent BIND9 in Focal that "fixes" this21:38
johnsomI guess I would check the precision of the results from time.time() on both bionic and focal, just to rule out a change in the epoch precision.21:38
johnsomYeah, I was just going to look at the bind release notes21:38
nicolasbockAh good point about the precision21:38
nicolasbockIn Stein it's seconds21:38
nicolasbockMaybe that changed in Ussuri21:38
nicolasbock(or Focal)21:39
johnsomhttps://www.irccloud.com/pastebin/Xuhi8Syd/21:39
johnsomThat is from my focal VM21:39
johnsomI don't have a bionic around to chekc21:39
nicolasbockOh21:39
nicolasbockLet me check21:40
nicolasbock```21:41
nicolasbockubuntu@bionic-lp:~$ python321:41
nicolasbockPython 3.6.9 (default, Oct  8 2020, 12:12:24)21:41
nicolasbock[GCC 8.4.0] on linux21:41
nicolasbockType "help", "copyright", "credits" or "license" for more information.21:41
nicolasbock>>> time.tim()21:41
nicolasbockTraceback (most recent call last):21:41
nicolasbock  File "<stdin>", line 1, in <module>21:41
nicolasbockNameError: name 'time' is not defined21:41
nicolasbock>>> import time21:41
nicolasbock>>> time.time()21:41
nicolasbock1613079631.546939621:41
nicolasbock```21:41
nicolasbockLooks the same21:41
nicolasbockWow, am I confused :/21:41
nicolasbockMaybe time for a coffee break ;)21:41
nicolasbockThanks for the help johnsom !21:42
johnsomSure, I will poke around and see if I see something21:42
nicolasbockThanks!21:44
nicolasbockI'll write this up a bit  more clearly and open a bug21:44
nicolasbockThat's easier for tracking purposes21:44
johnsom+121:44
johnsomAh, yeah, so designate is using oslo timeutils and stripping the sub-second data by converting it to an int.21:47
nicolasbockAh, that makes sense21:48
johnsomhttps://github.com/openstack/oslo.utils/blob/master/oslo_utils/timeutils.py#L15321:49
nicolasbockI'll try to replace the BIND9 package on the backend and downgrade it to the one Bionic is using21:49
nicolasbockMaybe that will show us something21:49
johnsomGiven that, I can certainly see how the bug could happen. Just not sure yet why it wouldn't happen on Ussuri21:50
nicolasbockExactly21:50
nicolasbockLooking a https://opendev.org/openstack/designate/src/branch/master/designate/utils.py#L143 this is certain to break under heavy parallel creation of recordsets21:51
johnsomYeah, that is exactly the code I traced to oslo utils/timeutils21:52
nicolasbockOk, I'll do some surgery on my deployment then to downgrade BIND921:52
johnsomyou could also do a tcpdump on the Ussuri setup and see if BIND is pulling the zone for every NOTIFY.21:53
nicolasbockTrue21:53
johnsomIt could also be some DB optimization where by the time bind does the zone transfer, the new records are there21:54
nicolasbockI have to take a break. The kids need snacks and I have to think about preparing dinner :)21:54
nicolasbockI'll check in tomorrow again21:54
johnsomo/ Ping me whenever you have a bug number and I will add notes there21:55
nicolasbock+121:55
*** gmann is now known as gmann_afk22:10
johnsomThe central coordinated locking might also be playing a factor here: https://review.opendev.org/c/openstack/designate/+/71795522:13
johnsomThat was new in Ussuri22:14
johnsomThat might be delaying the updates enough that there is a second lapsed, at least often enough it masks the issue with the serial number generater22:17
nicolasbockIt's backported to Train (https://review.opendev.org/c/openstack/designate/+/736055) but apparently not Stein22:17
*** PrinzElvis has quit IRC22:41
*** PrinzElvis has joined #openstack-dns22:41
*** gmann_afk is now known as gmann23:11

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!