Monday, 2024-02-26

mnasiadkamorning08:15
opendevreviewMichal Nasiadka proposed openstack/kolla stable/2023.2: toolbox: Bump ansible-core to 2.15  https://review.opendev.org/c/openstack/kolla/+/91014808:28
mnasiadkafrickler, kevko: ^^ seems we forgot this in 2023.2 cycle...08:34
opendevreviewMichal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml  https://review.opendev.org/c/openstack/kolla/+/90987908:42
opendevreviewMichal Nasiadka proposed openstack/kolla-ansible master: CI: Break OVN cluster before reconfigure  https://review.opendev.org/c/openstack/kolla-ansible/+/89793508:48
opendevreviewMichal Nasiadka proposed openstack/kolla-ansible master: CI: Break OVN cluster before reconfigure  https://review.opendev.org/c/openstack/kolla-ansible/+/89793508:48
fricklermnasiadka: I'm not sure we can or should do this for a stable branch. let's see what testing says, but at least it'll need a reno?08:51
mnasiadkafrickler: now it's minimum supported version, we usually had maximum supported version - if it passes I assume we can create a bug and add a reno08:52
opendevreviewMichal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml  https://review.opendev.org/c/openstack/kolla/+/90987909:16
opendevreviewMichal Nasiadka proposed openstack/kolla stable/2023.2: toolbox: Improve retry loop for ansible-galaxy  https://review.opendev.org/c/openstack/kolla/+/91018109:45
opendevreviewMichal Nasiadka proposed openstack/kolla stable/2023.1: toolbox: Improve retry loop for ansible-galaxy  https://review.opendev.org/c/openstack/kolla/+/91018209:45
opendevreviewMichal Nasiadka proposed openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy  https://review.opendev.org/c/openstack/kolla/+/91018309:45
opendevreviewMichal Nasiadka proposed openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy  https://review.opendev.org/c/openstack/kolla/+/91018309:46
mnasiadkaseems the same thing happened on 2023.1 - some collections were not installed and we didn't fail the build (periodic)09:46
fricklerthat sounds plausible09:48
SvenKieskemhm09:50
frickleroh, we didn't even retry in zed09:51
SvenKieskeah we didn't backport this? I remember it.09:51
SvenKieskeI should keep more attention on where a backport is needed I guess. :)09:59
mnasiadkawell, a simple script getting backport flag and backporting would be nice10:07
frickleryeah, except that that patch also didn't have a backport flag set. so we need another script to check that. or a deputy that watches things ;)10:12
mnasiadkawell, we need us all checking if it needs backporting when we merge it :)10:13
mnasiadkapoint taken10:13
SvenKieskewe could have a script that mandates an active decision to vote for backporting either -1 or +1 and forbids an abstain voting, no? so a basic gerrit rule for that? it add's more work though, but I guess that's the point? :)10:14
mnasiadkathat's also an option10:15
SvenKieskeso, basically before a patch is gated, and get's verified +2 the bot checks if the backport yes/no decision was made.10:16
mnasiadkaI think we can just have gerrit config that doesn't allow W+1 if backport flag is not one of -2 -1 +1 +210:18
SvenKieskeyeah, sounds good. if you nudge me into the right direction I can spin up a short patch10:18
SvenKieskeis that in project-config settings or where do I find this?10:18
opendevreviewMichal Nasiadka proposed openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy  https://review.opendev.org/c/openstack/kolla/+/91018310:20
SvenKieskeI guess it would go somewhere in here? https://opendev.org/openstack/project-config/src/branch/master/gerrit/acls/openstack/kolla.config#L5310:21
mnasiadkaprobably, I'm not an expert on those weird acls, I just remember they need a tab (not 4 whitespaces as tab) ;-)10:37
fricklerI don't think you can do this with gerrit, iiuc you'd have to renumber things from 0..+4 or so. we'd need some more complicated check job like the release team uses to verify PTL-approvals, but then I'd think that would be overdoing it. maybe just having a script that does these checks just before doing the monthly stable releases might work better10:42
opendevreviewWill Szumski proposed openstack/kayobe master: WIP: Add podman support  https://review.opendev.org/c/openstack/kayobe/+/90968610:47
SvenKieskefrickler: well I guess you could do this via the prolog rules that gerrit uses to evaluate the ACLs, you can add custom stuff there, but I guess that's to invasive as it seems to be that we do not use this feature yet.10:49
SvenKieskemhm, seems that's deprecated as well and one should use submit requirements, but I don't see how this can be done with sub requirements just yet: https://gerrit-review.googlesource.com/Documentation/prolog-cookbook.html10:51
opendevreviewMichal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml  https://review.opendev.org/c/openstack/kolla/+/90987910:52
opendevreviewMatúš Jenča proposed openstack/kolla-ansible master: Add Redis as caching backend for Keystone  https://review.opendev.org/c/openstack/kolla-ansible/+/90920110:55
SvenKieskemhm, thinking about it, I _guess_ I have a solution..11:00
SvenKieskemnasiadka, frickler: possibly buggy, so please evaluate :) https://review.opendev.org/c/openstack/project-config/+/91021211:03
kevkomorning 11:16
SvenKieskeo/11:18
kevkoWe were upgrading  quite big openstack ...and rabbitmq failed to upgrade .11:20
kevkoand i am wondering why  .... 11:21
kevkowe've seen in logs something s " wait for mnesia ....blabla ..9 retries left .. fail"11:21
SvenKieskethat upgrade to zed? or what was it?11:21
kevkoit was xena - > yoga 11:21
SvenKieskewas this with default config/upgraded from older versions, or a xena release? because I have seen old envs having no ha enabled at all, because of our conservative upgrade policies so far :)11:24
kevkoin upgrade yob there is a task which stop rabbitmqs but not first node ... and right after this the first node is stopped .... as it's huge cloud .. I was wondering if this was not a problem (while 2 nodes were stil in a process of stopping ...first node was stopped during this ... and because  of this cluster was not stopped correctly)11:25
SvenKieskemhm, taking one step back, what does "failed to upgrade" mean? the upgrade role did throw an error? or did the processes just not start successful after the upgrade?11:25
mnasiadkakevko: mnesia is getting deprecated - RMQ is moving to Raft - but that doesn't help now :)11:25
kevkomnasiadka: I know .... we fixed it ...but I am very curious why it happened ...11:26
SvenKieskekevko: I was under the impression our upgrade jobs should reboot the rmq nodes one after the other, and not two simultaneously? *looking at the upgrade job*11:26
kevkocan be my theory the right one ? 11:26
SvenKieskewhat is your theory? 2 nodes in process of stopping and first node stopped during this via our playbook? I need to look but afaik that should not happen11:27
kevkoSvenKieske: yeah, but it's only theory 11:27
SvenKieskemaybe we have some edgecase here for large envs, did you verify that the upgrade logic permits this?11:27
kevkoSvenKieske: I relly don't know ...we need to say rabbitmq node to force start 11:29
kevko*really 11:29
kevkorabbitmqctl force_start 11:29
kevkoor how it is ...11:29
SvenKieskemhm interesting11:29
SvenKieskehave a meeting now, can take a look later maybe11:29
kevkowell, it's only maybe yoga issue ...i don't know ..because i've already checked newer branches and the way how it is upgraded is little bit different 11:30
mnasiadkait had to be reworked due to Ansible fixing a bug that we apparently used in maria/rmq roles :)11:31
kevkomnasiadka: which bug ? 11:36
mnasiadkait's referenced in the commit message for sure11:41
mnasiadkahttps://review.opendev.org/c/openstack/kolla-ansible/+/88648511:42
mnasiadkabut it was in 2023.111:42
mnasiadkanot zed11:42
SvenKieskelast change in the zed release with respect to upgrades was afaik 3dfca54dfb3be04283e2c5d6800e27cc5a3d8776 which removed the ha-all policy when not required11:45
SvenKieskethe last change to the config.yml, which is used there, in zed, was in fact by me, but I would rule that one out, that was 159925248375727ae67ef93a606b09a54e83ba2d (re-add rmq config for clustering interface)11:47
SvenKieskewell if you have ipv6 that might be a problem, actually11:47
kevkoSvenKieske: well, we didn't remove ha-all policy of course ...for now it's no-go to do this ...11:48
SvenKieskebut by default afaik that changes nothing, only provides the ability to be able to configure the interface used by rabbitmq for clustering, which was removed by accident11:48
SvenKieskekevko: ok, so I guess you didn't hit a bug related to new code, at least from the few files I looked at.11:49
SvenKieskemight be the upgrade logic is not robust enough11:49
kevkoit's not :D 11:49
SvenKieskefor smaller deployments container restarts are rather fast, because there are not that many queues and queues are mostly empty anyway.11:50
SvenKieskeso the fast restart might obscure a race condition11:50
kevkoSvenKieske:  I know :D .... testing env upgraded without problems, LAB on customer side was OK .... production not :D 11:50
SvenKieskemaybe write all your information up in a bug report? that would be a good starting point for the investigation (and I can feed it to my time tracker :D)11:50
opendevreviewMichal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml  https://review.opendev.org/c/openstack/kolla/+/90987911:51
kevkoSvenKieske: let me provide some logs ...i need to ask them to give me some ..11:52
opendevreviewMerged openstack/kolla master: Install ironic-inspector in bifrost  https://review.opendev.org/c/openstack/kolla/+/90986511:54
opendevreviewMichal Nasiadka proposed openstack/kolla master: Remove calls to libvirt repo  https://review.opendev.org/c/openstack/kolla/+/91022412:04
kevkobtw, xena centos libvirt to yoga centos libvirt can't live migrate even if libvirt version is same :D 12:07
mnasiadkaand the reason is? ;-)12:08
kevkobecause they patched something ... :D ..but broke the code 12:08
kevko2024-02-23 14:40:11.574+0000: 967670: error : qemuProcessReportLogError:2051 : internal error: qemu unexpectedly closed the monitor: 2024-02-23T14:40:11.549441Z qemu-kvm: Missing section footer for 0000:00:01.3/piix4_pm12:09
kevko2024-02-23T14:40:11.550219Z qemu-kvm: load of migration failed: Invalid argument12:09
kevkomnasiadka: can I found patches and build my own version (for debuntu it is easy as I can download the source for package ...and debug ...) ..or is it closed ? I don't have much experiences with rpms 12:11
mnasiadkathe spec used by c8s/c9s is here: https://git.centos.org/rpms/libvirt12:11
mnasiadkayou can build on your own12:11
opendevreviewMichal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml  https://review.opendev.org/c/openstack/kolla/+/90987912:17
jheikkinDoes kaybe/kolla support other container backends than Docker? Has anyone experience with running openstack services on systemd-nspawn containers?12:18
mnasiadkaKolla supports Podman since 2023.2 release, Kayobe doesn't support podman yet12:18
mnasiadkasystemd-nspawn - I think OpenStack-Ansible supported that or planned to support that12:19
jheikkinI saw that nspawn containers git under openstack-ansible was archived about 3 years ago, though12:23
mnasiadkaso now there's your answer ;-)12:25
jheikkinThanks for the answers!12:45
opendevreviewMerged openstack/kolla stable/2023.1: toolbox: Improve retry loop for ansible-galaxy  https://review.opendev.org/c/openstack/kolla/+/91018212:54
opendevreviewMerged openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy  https://review.opendev.org/c/openstack/kolla/+/91018312:54
opendevreviewVerification of a change to openstack/kayobe master failed: Make hooks environment-aware  https://review.opendev.org/c/openstack/kayobe/+/90414113:10
kevkomnasiadka, SvenKieske: I was wondering if we can implement rabbitmq blue green upgrade  ... federated rabbitmq 13:30
mnasiadkaswitching OpenStack applications to a second cluster? that's going to be fun13:44
SvenKieskeI actually also had the same idea some weeks ago, but yeah, it would be quite some work, but maybe also more robust. you could write some elaborate script to use the "shovel" plugin to migrate messages from queue to queue. afaik I even found an online example where some other deployment tooling did that14:06
SvenKieskebut that was written in dotNET IIRC :D14:07
rohit02@Team For Kolla Bobcat(2023.2) with external ceph is there any ceph version dependancy for cinder-volume.We are trying to integrate ceph nautilus and cinder volume throws exception15:18
kevkoSvenKieske: do you have a link ? 16:13
SvenKieskefor the dotNet blue/green stuff? let me check my history..16:13
SvenKieskeah was not dotnet, but I guess the language doesn't really matter: https://github.com/Particular/NServiceBus.RabbitMQ/blob/master/src/NServiceBus.Transport.RabbitMQ.CommandLine/Commands/Queue/QueueMigrateCommand.cs16:14
SvenKieskeI stumbled upon it, because they have actual docs around queue migration: https://docs.particular.net/transports/rabbitmq/operations-scripting#queue-migrate-to-quorum16:15
SvenKieskerereading it I'm not sure it's connected to blue green deployments, not sure, just skimmed it16:16
greatgatsby__Hello.  Can I get some guidance on the neutron_l3_agent_failover_delay value?  If I use the `40 + 3n` method, where n is the number of routers, and I have 100 projects each with a router, then my delay would be 340?16:17
greatgatsby__the docs also mention to "time how long an outage lasts" but I'm not sure what this means16:18
SvenKieskegreatgatsby: well as the docs at https://docs.openstack.org/kolla-ansible/latest/reference/networking/neutron.html#l3-agent-high-availability say (it could maybe be worded more clearly), during an l3 agent restart the virtual routers need to be recreated, this can cause an outage, depending on the exact scenario16:22
SvenKieskethe delay is there to make the failover from agent 1 to agent 2 configurable; so kolla-ansible waits for that delay before restarting (iirc)16:23
SvenKieskein theory that should allow all virtual routers to be migrated16:23
greatgatsby__SvenKieske: for sure, we actually implemented our own in yoga and are now backporting the solution from 2023.1.  The only thing we're struggling with now is determining the value we should use for the failover delay.16:23
SvenKieskehowever it might be difficult to calculate the exact amount of seconds for the failover, because this is dependent on the recreation duration per router and the number of routers, which might not be static16:24
SvenKieskegreatgatsby: I believe I already linked you to a different solution in the past? this might have some advantages (no need to calculate the downtime): https://review.opendev.org/c/openstack/kolla-ansible/+/87476916:25
SvenKieskeyou might also be interested in this throttled restart rewrite of the l3 agent handler: https://review.opendev.org/c/openstack/kolla-ansible/+/904134 (also not merged)16:26
greatgatsby__SvenKieske: thanks for that link.  Since that is still a WIP PR we've decided to go with the current solution in 2023.1, for better or worse.16:27
SvenKieskeyeah, I think we already talked about it. it's wip, but I personally happen to know that it runs in production :D16:27
SvenKieskefor upstream the code is a little bit unclean, no proper code module etc.16:28
SvenKieskenevertheless, yeah, you need to time how long your environment needs to create a router and then extrapolate that16:28
greatgatsby__do you anticipate the 2023.1 solution changing to this for the next KA release?16:29
SvenKieskemhm I don't know, mnasiadka: someone at your company wanted to replace the current solution for a proper solution regarding neutron_l3_agent graceful restarts by splitting the processes up into more containers16:30
SvenKieskeI don't know the status of that or when it is ready, or if it even started, maybe I'm also misremembering the company there, but someone wanted to work on it16:30
spatelKolla has default DHCP agent per network setting 2 but when I created network Its showing only single agent running. am i missing something? 16:44
spatelhttps://paste.opendev.org/show/bvdPRCGlA4tUIFTUCgMh/16:44
mnasiadkaSvenKieske: it's work in progress, we'll put more effort into this in coming weeks16:46
spatelI didn't set enable_neutron_agent_ha 16:46
greatgatsby__SvenKieske, mnasiadka:  thanks for the feedback and dev effort16:48
spatelCan I enable enable_neutron_agent_ha: "yes" and re-run deploy? 16:52
spateldoes it going to make it dhcp-agent HA16:52
spatelOh yes.. it did 16:53
-opendevstatus- NOTICE: Gerrit on review.opendev.org will be restarted to perform a minor upgrade to the service.22:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!