Friday, 2021-10-15

ozzzoeandersson: Looking at this doc: https://docs.openstack.org/designate/latest/admin/ha.html00:05
ozzzoIt appears that only designate-producer uses locking. We installed redis after rebuilding some clusters on train, but the locking doesn't work, and we get "Duplicate record" errors if we delete a VM and re-create it immediately with the same IP00:06
ozzzoso I want to figure out how the locking works, and then figure out why it isn't working for us00:06
ozzzoit looks like what is happening is, one controller starts deleting the old record, and then before it finishes deleting, another controller tries to create the new one00:07
johnsomozzzo did you confirm the previous steps we provided? Like checking the train version you are running?00:07
ozzzoyes, we have the latest train code00:08
johnsomAnd the sink question?00:08
ozzzoI lost access to what we discussed. Is openstack-dns archived anywhere?00:09
johnsomYeah, on minute00:09
ozzzomy work doesn't allow irc so I have to use the garbage web client00:09
johnsomAh, bummer00:10
johnsomozzzo https://meetings.opendev.org/irclogs/%23openstack-dns/%23openstack-dns.2021-10-13.log.html#t2021-10-13T18:04:5600:10
ozzzoyes I looked at our code and we have that patch00:11
ozzzosomeone told me we're running the latest; how can I verify that we're on 9.0.2?00:11
johnsomLook in your logs for this line: INFO designate.service [-] Starting central service (version: 12.1.0)00:12
ozzzoyes we're on 9.0.200:14
johnsomIf you grep for "coordination" in the central logs what do you see?00:15
ozzzoI see old messages that say "no coordination backend configured"00:16
ozzzonothing recent00:16
johnsomAre you running with debug logging on?00:17
ozzzono. how can I enable that for just Designate?00:19
johnsomIn you designate configuration file, in the [DEFAULT] section, set "debug = True"00:19
ozzzoin /etc/kolla/designate-central/designate.conf? and then restart the designate_central container?00:20
johnsomI don't know anything about kolla and how it deploys designate, but that seems like a logical place, yeah.00:21
johnsome-andersson is the expert on the locking, but I can give some pointers I know about what was implemented.00:24
johnsomThis decorator: https://github.com/openstack/designate/blob/f101fab29540ba11481abbf9c7558f976d14b26b/designate/central/service.py#L126000:24
johnsomDoes the locking using the coordination driver configured.00:24
johnsomAs you can see, each call for create recordset and create record has the wrapper. It will use the lock manager to lock the zone for each create, etc. The limits concurrent access to change the zone, thus ensuring the serial gets updated correctly.00:26
ozzzook reading, ty00:29
ozzzoit looks like this code prevents updates to a zone that is being deleted. our case is slightly different; the forward and reverse records are in the process of being deleted, and we are creating new forward and reverse records for the same name and IP. Is there a mechanism to make the creation wait until the deletion finishes?00:50
eanderssonII don't use reverse records, so could be a new / different bug05:11
eanderssonThe code should prevent any modification to a zone while another record is being updated / deleted / created07:39
eanderssonIf something is able to happen that shouldn't we might be missing a lock07:40
eanderssonI can't remember the path that neutron makes, maybe frickler remembers07:42
eanderssonbut maybe we need a lock here?07:42
eanderssonhttps://github.com/openstack/designate/blob/f101fab29540ba11481abbf9c7558f976d14b26b/designate/central/service.py#L217407:42
eanderssonozzzo would you be able to produce us with more logs? also, ideally your config (without passwords)07:47
eanderssonAt least the full exception (ideally traced from all logs)07:47
fricklerneutron does forward records first and then reverse, all by designate API calls, so nothing special there I'd hope07:51
fricklerbut yes, we need some instructions to reproduce, otherwise I wouldn't know how to tackle it07:52
opendevreviewVadym Markov proposed openstack/designate-dashboard master: Rename "Floating IP Description" field  https://review.opendev.org/c/openstack/designate-dashboard/+/81412109:05
ozzzoeandersson: I can't; in fact I've been advised that I can't contact the community at all from my Verisign workstation until I get upper management approval11:52
ozzzothe "security" department is totally out of control; they have us locked down to the point where we can barely work11:52
*** njohnston_ is now known as njohnston14:01
ozzzoafter enabling debug logging in designate-central, I see 3 "coordination" messages when I restart the container. Besides that coordination doesn't appear in the log. When I duplicate the issue I see stack traces in designate-sink.log but nothing about coordination there either15:03
ozzzoI'm trying to delete a record that was created by Designate and I get error "Managed records may not be deleted"17:08
ozzzohow can I work around that?17:08
johnsomUsually it's not a good idea to do that, but if you really need to, https://docs.openstack.org/python-designateclient/latest/cli/index.html#recordset-delete the --edit-managed option should override that.17:53
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline starting in 5 minutes, at 18:00 UTC, for scheduled project rename maintenance, which should last no more than an hour (but will likely be much shorter): http://lists.opendev.org/pipermail/service-announce/2021-October/000024.html17:58
eanderssonozzzo: --edit-managed17:59
ozzzorighton, ty!17:59
ozzzoI broke designate and then deleted some VMs17:59
eanderssonI'll try to reproduce it this weekend.17:59
eanderssonAre you using designate-sink (aka notifications)?18:00
ozzzoyes18:01
eanderssonhttps://github.com/openstack/kolla-ansible/blob/stable/train/ansible/roles/designate/templates/designate.conf.j2#L6418:01
eanderssonLooks like it is default in ansible18:01
eanderssonkolla18:01
eanderssonAnd is this happening when you are just creating and destroying VMs?18:01
ozzzodesignate-sink.log is where I see the stacktraces when it fails18:01
eanderssonI see18:01
eanderssonWe have this error too, but it isn't very frequent18:02
ozzzoyes. To duplicate it I had to create a TF VM with a static IP, taint it, and then run TF apply18:02
ozzzoso it's deleting the record and then immediately creating a new record with the same name and IP18:02
eanderssonI wrote my own notification handler18:02
eanderssonbut we only do nova, not neutron 18:02
eanderssonso we don't hit it very often18:03
eanderssonlike one per 10k VMs or something like that18:03
ozzzoI don't think we've seen it in nova18:03
eanderssonIs it always neutron ptr?18:03
eanderssonbtw sorry when I say nova18:04
eanderssonI mean nova here https://github.com/openstack/kolla-ansible/blob/stable/train/ansible/roles/designate/templates/designate.conf.j2#L6418:04
eanderssonhandler:nova_fixed vs. handler:neutron_floatingi18:04
eanderssonDoes the exception always come from nova.py or neutron.py?18:05
eanderssonhttps://github.com/openstack/designate/blob/stable/train/designate/notification_handler/neutron.py18:05
eanderssonor is it mixed?18:06
eanderssonThe way ansible-kolla is set up is very aggressive. We run into this issue with ~1 per 10k VMs, but we only create a single A record. ansible-kolla is set up to create 6 records per VM18:07
ozzzoI don't see nova nor neutron in the logs. I see pmysql and designate18:08
eanderssonI see18:08
ozzzoit tries to create an entry in the mysql database and fails with "Duplicate entry"18:08
eanderssonYea - it's so weird because while we have that, it goes away if I add memcached as a coordinator18:09
ozzzolooks like it happens on both forward and reverse18:09
ozzzohopefully I will get permission to share logs sometime soon18:09
eanderssonjohnsom a big todo for testing is to mock nova/neutron notifications and add that as part of the functional testing18:10
eanderssonWhen I designed the coordination stuff I would just craft custom json payloads and send a million of them to rabbitmq18:10
johnsomeandersson Interesting. Yeah, we aren't targeting the sink or notifications in our initial plan, but it will come.18:11
eanderssonThe way the sink talks to central it is easier to test central using it18:12
eanderssonfor race conditions18:12
eandersson(since you don't need to make rest calls, just send 100 messages to rabbitmq and let it consume them instantly)18:13
johnsomMight be worth writing up in a bug18:14
eanderssonHopefully I will have some time later today or this weekend18:15
eanderssonI also want to try to reproduce this in general.18:15
eanderssonThe only thing that is different from when I tested is that kolla-ansible creates 6 records per VM.18:15
eanderssonWith a coordinator I was able to create 1k fake VMs without an issue18:16
eanderssonbut I was only creating one record per VM18:16
eanderssonI also used memcached and not redis, but I have a hard time to think that redis wouldn't work :D18:19
ozzzoI can duplicate it consistently. It fails every time, when I use TF to rebuild a VM with a static IP and existing Designate DNS records18:23
ozzzoon the 2nd rebuild it works because the DNS doesnt' exist, then fails on the 3rd, etc18:23
eanderssonYea - I mean it really just seems like coordination isn't working for you at all, but not sure how to prove that.19:07
johnsomeandersson When you have a minute, can you look at https://review.opendev.org/c/openstack/designate-tempest-plugin/+/800280 It has a +1 and +2 and would continue our path towards updating our tempest plugin.19:25
eanderssonlgtm johnsom20:39
johnsomThank you20:40
opendevreviewMerged openstack/designate-tempest-plugin master: Update service client access in tempest tests  https://review.opendev.org/c/openstack/designate-tempest-plugin/+/80028021:27

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!