mnasiadka | morning | 08:15 |
---|---|---|
opendevreview | Michal Nasiadka proposed openstack/kolla stable/2023.2: toolbox: Bump ansible-core to 2.15 https://review.opendev.org/c/openstack/kolla/+/910148 | 08:28 |
mnasiadka | frickler, kevko: ^^ seems we forgot this in 2023.2 cycle... | 08:34 |
opendevreview | Michal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml https://review.opendev.org/c/openstack/kolla/+/909879 | 08:42 |
opendevreview | Michal Nasiadka proposed openstack/kolla-ansible master: CI: Break OVN cluster before reconfigure https://review.opendev.org/c/openstack/kolla-ansible/+/897935 | 08:48 |
opendevreview | Michal Nasiadka proposed openstack/kolla-ansible master: CI: Break OVN cluster before reconfigure https://review.opendev.org/c/openstack/kolla-ansible/+/897935 | 08:48 |
frickler | mnasiadka: I'm not sure we can or should do this for a stable branch. let's see what testing says, but at least it'll need a reno? | 08:51 |
mnasiadka | frickler: now it's minimum supported version, we usually had maximum supported version - if it passes I assume we can create a bug and add a reno | 08:52 |
opendevreview | Michal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml https://review.opendev.org/c/openstack/kolla/+/909879 | 09:16 |
opendevreview | Michal Nasiadka proposed openstack/kolla stable/2023.2: toolbox: Improve retry loop for ansible-galaxy https://review.opendev.org/c/openstack/kolla/+/910181 | 09:45 |
opendevreview | Michal Nasiadka proposed openstack/kolla stable/2023.1: toolbox: Improve retry loop for ansible-galaxy https://review.opendev.org/c/openstack/kolla/+/910182 | 09:45 |
opendevreview | Michal Nasiadka proposed openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy https://review.opendev.org/c/openstack/kolla/+/910183 | 09:45 |
opendevreview | Michal Nasiadka proposed openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy https://review.opendev.org/c/openstack/kolla/+/910183 | 09:46 |
mnasiadka | seems the same thing happened on 2023.1 - some collections were not installed and we didn't fail the build (periodic) | 09:46 |
frickler | that sounds plausible | 09:48 |
SvenKieske | mhm | 09:50 |
frickler | oh, we didn't even retry in zed | 09:51 |
SvenKieske | ah we didn't backport this? I remember it. | 09:51 |
SvenKieske | I should keep more attention on where a backport is needed I guess. :) | 09:59 |
mnasiadka | well, a simple script getting backport flag and backporting would be nice | 10:07 |
frickler | yeah, except that that patch also didn't have a backport flag set. so we need another script to check that. or a deputy that watches things ;) | 10:12 |
mnasiadka | well, we need us all checking if it needs backporting when we merge it :) | 10:13 |
mnasiadka | point taken | 10:13 |
SvenKieske | we could have a script that mandates an active decision to vote for backporting either -1 or +1 and forbids an abstain voting, no? so a basic gerrit rule for that? it add's more work though, but I guess that's the point? :) | 10:14 |
mnasiadka | that's also an option | 10:15 |
SvenKieske | so, basically before a patch is gated, and get's verified +2 the bot checks if the backport yes/no decision was made. | 10:16 |
mnasiadka | I think we can just have gerrit config that doesn't allow W+1 if backport flag is not one of -2 -1 +1 +2 | 10:18 |
SvenKieske | yeah, sounds good. if you nudge me into the right direction I can spin up a short patch | 10:18 |
SvenKieske | is that in project-config settings or where do I find this? | 10:18 |
opendevreview | Michal Nasiadka proposed openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy https://review.opendev.org/c/openstack/kolla/+/910183 | 10:20 |
SvenKieske | I guess it would go somewhere in here? https://opendev.org/openstack/project-config/src/branch/master/gerrit/acls/openstack/kolla.config#L53 | 10:21 |
mnasiadka | probably, I'm not an expert on those weird acls, I just remember they need a tab (not 4 whitespaces as tab) ;-) | 10:37 |
frickler | I don't think you can do this with gerrit, iiuc you'd have to renumber things from 0..+4 or so. we'd need some more complicated check job like the release team uses to verify PTL-approvals, but then I'd think that would be overdoing it. maybe just having a script that does these checks just before doing the monthly stable releases might work better | 10:42 |
opendevreview | Will Szumski proposed openstack/kayobe master: WIP: Add podman support https://review.opendev.org/c/openstack/kayobe/+/909686 | 10:47 |
SvenKieske | frickler: well I guess you could do this via the prolog rules that gerrit uses to evaluate the ACLs, you can add custom stuff there, but I guess that's to invasive as it seems to be that we do not use this feature yet. | 10:49 |
SvenKieske | mhm, seems that's deprecated as well and one should use submit requirements, but I don't see how this can be done with sub requirements just yet: https://gerrit-review.googlesource.com/Documentation/prolog-cookbook.html | 10:51 |
opendevreview | Michal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml https://review.opendev.org/c/openstack/kolla/+/909879 | 10:52 |
opendevreview | Matúš Jenča proposed openstack/kolla-ansible master: Add Redis as caching backend for Keystone https://review.opendev.org/c/openstack/kolla-ansible/+/909201 | 10:55 |
SvenKieske | mhm, thinking about it, I _guess_ I have a solution.. | 11:00 |
SvenKieske | mnasiadka, frickler: possibly buggy, so please evaluate :) https://review.opendev.org/c/openstack/project-config/+/910212 | 11:03 |
kevko | morning | 11:16 |
SvenKieske | o/ | 11:18 |
kevko | We were upgrading quite big openstack ...and rabbitmq failed to upgrade . | 11:20 |
kevko | and i am wondering why .... | 11:21 |
kevko | we've seen in logs something s " wait for mnesia ....blabla ..9 retries left .. fail" | 11:21 |
SvenKieske | that upgrade to zed? or what was it? | 11:21 |
kevko | it was xena - > yoga | 11:21 |
SvenKieske | was this with default config/upgraded from older versions, or a xena release? because I have seen old envs having no ha enabled at all, because of our conservative upgrade policies so far :) | 11:24 |
kevko | in upgrade yob there is a task which stop rabbitmqs but not first node ... and right after this the first node is stopped .... as it's huge cloud .. I was wondering if this was not a problem (while 2 nodes were stil in a process of stopping ...first node was stopped during this ... and because of this cluster was not stopped correctly) | 11:25 |
SvenKieske | mhm, taking one step back, what does "failed to upgrade" mean? the upgrade role did throw an error? or did the processes just not start successful after the upgrade? | 11:25 |
mnasiadka | kevko: mnesia is getting deprecated - RMQ is moving to Raft - but that doesn't help now :) | 11:25 |
kevko | mnasiadka: I know .... we fixed it ...but I am very curious why it happened ... | 11:26 |
SvenKieske | kevko: I was under the impression our upgrade jobs should reboot the rmq nodes one after the other, and not two simultaneously? *looking at the upgrade job* | 11:26 |
kevko | can be my theory the right one ? | 11:26 |
SvenKieske | what is your theory? 2 nodes in process of stopping and first node stopped during this via our playbook? I need to look but afaik that should not happen | 11:27 |
kevko | SvenKieske: yeah, but it's only theory | 11:27 |
SvenKieske | maybe we have some edgecase here for large envs, did you verify that the upgrade logic permits this? | 11:27 |
kevko | SvenKieske: I relly don't know ...we need to say rabbitmq node to force start | 11:29 |
kevko | *really | 11:29 |
kevko | rabbitmqctl force_start | 11:29 |
kevko | or how it is ... | 11:29 |
SvenKieske | mhm interesting | 11:29 |
SvenKieske | have a meeting now, can take a look later maybe | 11:29 |
kevko | well, it's only maybe yoga issue ...i don't know ..because i've already checked newer branches and the way how it is upgraded is little bit different | 11:30 |
mnasiadka | it had to be reworked due to Ansible fixing a bug that we apparently used in maria/rmq roles :) | 11:31 |
kevko | mnasiadka: which bug ? | 11:36 |
mnasiadka | it's referenced in the commit message for sure | 11:41 |
mnasiadka | https://review.opendev.org/c/openstack/kolla-ansible/+/886485 | 11:42 |
mnasiadka | but it was in 2023.1 | 11:42 |
mnasiadka | not zed | 11:42 |
SvenKieske | last change in the zed release with respect to upgrades was afaik 3dfca54dfb3be04283e2c5d6800e27cc5a3d8776 which removed the ha-all policy when not required | 11:45 |
SvenKieske | the last change to the config.yml, which is used there, in zed, was in fact by me, but I would rule that one out, that was 159925248375727ae67ef93a606b09a54e83ba2d (re-add rmq config for clustering interface) | 11:47 |
SvenKieske | well if you have ipv6 that might be a problem, actually | 11:47 |
kevko | SvenKieske: well, we didn't remove ha-all policy of course ...for now it's no-go to do this ... | 11:48 |
SvenKieske | but by default afaik that changes nothing, only provides the ability to be able to configure the interface used by rabbitmq for clustering, which was removed by accident | 11:48 |
SvenKieske | kevko: ok, so I guess you didn't hit a bug related to new code, at least from the few files I looked at. | 11:49 |
SvenKieske | might be the upgrade logic is not robust enough | 11:49 |
kevko | it's not :D | 11:49 |
SvenKieske | for smaller deployments container restarts are rather fast, because there are not that many queues and queues are mostly empty anyway. | 11:50 |
SvenKieske | so the fast restart might obscure a race condition | 11:50 |
kevko | SvenKieske: I know :D .... testing env upgraded without problems, LAB on customer side was OK .... production not :D | 11:50 |
SvenKieske | maybe write all your information up in a bug report? that would be a good starting point for the investigation (and I can feed it to my time tracker :D) | 11:50 |
opendevreview | Michal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml https://review.opendev.org/c/openstack/kolla/+/909879 | 11:51 |
kevko | SvenKieske: let me provide some logs ...i need to ask them to give me some .. | 11:52 |
opendevreview | Merged openstack/kolla master: Install ironic-inspector in bifrost https://review.opendev.org/c/openstack/kolla/+/909865 | 11:54 |
opendevreview | Michal Nasiadka proposed openstack/kolla master: Remove calls to libvirt repo https://review.opendev.org/c/openstack/kolla/+/910224 | 12:04 |
kevko | btw, xena centos libvirt to yoga centos libvirt can't live migrate even if libvirt version is same :D | 12:07 |
mnasiadka | and the reason is? ;-) | 12:08 |
kevko | because they patched something ... :D ..but broke the code | 12:08 |
kevko | 2024-02-23 14:40:11.574+0000: 967670: error : qemuProcessReportLogError:2051 : internal error: qemu unexpectedly closed the monitor: 2024-02-23T14:40:11.549441Z qemu-kvm: Missing section footer for 0000:00:01.3/piix4_pm | 12:09 |
kevko | 2024-02-23T14:40:11.550219Z qemu-kvm: load of migration failed: Invalid argument | 12:09 |
kevko | mnasiadka: can I found patches and build my own version (for debuntu it is easy as I can download the source for package ...and debug ...) ..or is it closed ? I don't have much experiences with rpms | 12:11 |
mnasiadka | the spec used by c8s/c9s is here: https://git.centos.org/rpms/libvirt | 12:11 |
mnasiadka | you can build on your own | 12:11 |
opendevreview | Michal Nasiadka proposed openstack/kolla master: WIP: Add support for rpm to repos.yaml https://review.opendev.org/c/openstack/kolla/+/909879 | 12:17 |
jheikkin | Does kaybe/kolla support other container backends than Docker? Has anyone experience with running openstack services on systemd-nspawn containers? | 12:18 |
mnasiadka | Kolla supports Podman since 2023.2 release, Kayobe doesn't support podman yet | 12:18 |
mnasiadka | systemd-nspawn - I think OpenStack-Ansible supported that or planned to support that | 12:19 |
jheikkin | I saw that nspawn containers git under openstack-ansible was archived about 3 years ago, though | 12:23 |
mnasiadka | so now there's your answer ;-) | 12:25 |
jheikkin | Thanks for the answers! | 12:45 |
opendevreview | Merged openstack/kolla stable/2023.1: toolbox: Improve retry loop for ansible-galaxy https://review.opendev.org/c/openstack/kolla/+/910182 | 12:54 |
opendevreview | Merged openstack/kolla stable/zed: toolbox: Improve retry loop for ansible-galaxy https://review.opendev.org/c/openstack/kolla/+/910183 | 12:54 |
opendevreview | Verification of a change to openstack/kayobe master failed: Make hooks environment-aware https://review.opendev.org/c/openstack/kayobe/+/904141 | 13:10 |
kevko | mnasiadka, SvenKieske: I was wondering if we can implement rabbitmq blue green upgrade ... federated rabbitmq | 13:30 |
mnasiadka | switching OpenStack applications to a second cluster? that's going to be fun | 13:44 |
SvenKieske | I actually also had the same idea some weeks ago, but yeah, it would be quite some work, but maybe also more robust. you could write some elaborate script to use the "shovel" plugin to migrate messages from queue to queue. afaik I even found an online example where some other deployment tooling did that | 14:06 |
SvenKieske | but that was written in dotNET IIRC :D | 14:07 |
rohit02 | @Team For Kolla Bobcat(2023.2) with external ceph is there any ceph version dependancy for cinder-volume.We are trying to integrate ceph nautilus and cinder volume throws exception | 15:18 |
kevko | SvenKieske: do you have a link ? | 16:13 |
SvenKieske | for the dotNet blue/green stuff? let me check my history.. | 16:13 |
SvenKieske | ah was not dotnet, but I guess the language doesn't really matter: https://github.com/Particular/NServiceBus.RabbitMQ/blob/master/src/NServiceBus.Transport.RabbitMQ.CommandLine/Commands/Queue/QueueMigrateCommand.cs | 16:14 |
SvenKieske | I stumbled upon it, because they have actual docs around queue migration: https://docs.particular.net/transports/rabbitmq/operations-scripting#queue-migrate-to-quorum | 16:15 |
SvenKieske | rereading it I'm not sure it's connected to blue green deployments, not sure, just skimmed it | 16:16 |
greatgatsby__ | Hello. Can I get some guidance on the neutron_l3_agent_failover_delay value? If I use the `40 + 3n` method, where n is the number of routers, and I have 100 projects each with a router, then my delay would be 340? | 16:17 |
greatgatsby__ | the docs also mention to "time how long an outage lasts" but I'm not sure what this means | 16:18 |
SvenKieske | greatgatsby: well as the docs at https://docs.openstack.org/kolla-ansible/latest/reference/networking/neutron.html#l3-agent-high-availability say (it could maybe be worded more clearly), during an l3 agent restart the virtual routers need to be recreated, this can cause an outage, depending on the exact scenario | 16:22 |
SvenKieske | the delay is there to make the failover from agent 1 to agent 2 configurable; so kolla-ansible waits for that delay before restarting (iirc) | 16:23 |
SvenKieske | in theory that should allow all virtual routers to be migrated | 16:23 |
greatgatsby__ | SvenKieske: for sure, we actually implemented our own in yoga and are now backporting the solution from 2023.1. The only thing we're struggling with now is determining the value we should use for the failover delay. | 16:23 |
SvenKieske | however it might be difficult to calculate the exact amount of seconds for the failover, because this is dependent on the recreation duration per router and the number of routers, which might not be static | 16:24 |
SvenKieske | greatgatsby: I believe I already linked you to a different solution in the past? this might have some advantages (no need to calculate the downtime): https://review.opendev.org/c/openstack/kolla-ansible/+/874769 | 16:25 |
SvenKieske | you might also be interested in this throttled restart rewrite of the l3 agent handler: https://review.opendev.org/c/openstack/kolla-ansible/+/904134 (also not merged) | 16:26 |
greatgatsby__ | SvenKieske: thanks for that link. Since that is still a WIP PR we've decided to go with the current solution in 2023.1, for better or worse. | 16:27 |
SvenKieske | yeah, I think we already talked about it. it's wip, but I personally happen to know that it runs in production :D | 16:27 |
SvenKieske | for upstream the code is a little bit unclean, no proper code module etc. | 16:28 |
SvenKieske | nevertheless, yeah, you need to time how long your environment needs to create a router and then extrapolate that | 16:28 |
greatgatsby__ | do you anticipate the 2023.1 solution changing to this for the next KA release? | 16:29 |
SvenKieske | mhm I don't know, mnasiadka: someone at your company wanted to replace the current solution for a proper solution regarding neutron_l3_agent graceful restarts by splitting the processes up into more containers | 16:30 |
SvenKieske | I don't know the status of that or when it is ready, or if it even started, maybe I'm also misremembering the company there, but someone wanted to work on it | 16:30 |
spatel | Kolla has default DHCP agent per network setting 2 but when I created network Its showing only single agent running. am i missing something? | 16:44 |
spatel | https://paste.opendev.org/show/bvdPRCGlA4tUIFTUCgMh/ | 16:44 |
mnasiadka | SvenKieske: it's work in progress, we'll put more effort into this in coming weeks | 16:46 |
spatel | I didn't set enable_neutron_agent_ha | 16:46 |
greatgatsby__ | SvenKieske, mnasiadka: thanks for the feedback and dev effort | 16:48 |
spatel | Can I enable enable_neutron_agent_ha: "yes" and re-run deploy? | 16:52 |
spatel | does it going to make it dhcp-agent HA | 16:52 |
spatel | Oh yes.. it did | 16:53 |
-opendevstatus- NOTICE: Gerrit on review.opendev.org will be restarted to perform a minor upgrade to the service. | 22:34 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!