*** tosky has quit IRC | 00:00 | |
*** dwilde has joined #openstack-ansible | 00:03 | |
*** dwilde has quit IRC | 00:19 | |
*** dwilde has joined #openstack-ansible | 01:33 | |
*** dwilde has quit IRC | 01:48 | |
*** evrardjp has quit IRC | 02:33 | |
*** evrardjp has joined #openstack-ansible | 02:33 | |
*** cloudnull8 has quit IRC | 03:02 | |
*** cloudnull8 has joined #openstack-ansible | 03:02 | |
*** fridtjof[m] has quit IRC | 04:20 | |
*** jrosser has quit IRC | 04:20 | |
*** jrosser has joined #openstack-ansible | 04:34 | |
*** fridtjof[m] has joined #openstack-ansible | 04:38 | |
*** SiavashSardari has joined #openstack-ansible | 04:52 | |
*** miloa has joined #openstack-ansible | 05:34 | |
*** SiavashSardari18 has joined #openstack-ansible | 05:34 | |
*** miloa has quit IRC | 05:35 | |
*** SiavashSardari has quit IRC | 05:37 | |
*** dmsimard has quit IRC | 06:48 | |
*** dmsimard has joined #openstack-ansible | 06:51 | |
*** luksky has joined #openstack-ansible | 07:01 | |
*** andrewbonney has joined #openstack-ansible | 07:08 | |
jrosser | morning | 07:31 |
---|---|---|
*** dmsimard has quit IRC | 07:32 | |
*** dmsimard has joined #openstack-ansible | 07:33 | |
*** tosky has joined #openstack-ansible | 07:35 | |
*** jbadiapa has joined #openstack-ansible | 07:37 | |
jonher | morning, https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/784885 originated from me seeing the deprecation notice in another job, but it looks we don't have the same version of tempest in these tests: "tempest run: error: unrecognized arguments: --include-list" | 07:47 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Test dashboard only for horizon scenario https://review.opendev.org/c/openstack/openstack-ansible/+/782376 | 07:53 |
jrosser | jonher: theres a couple of ways you can look at the tempest version, it is in "/openstack/venvs/tempest-22.0.0.0rc2.dev138/bin" -> 138 commits after the 22.0.0.0rc2(?) tag | 07:55 |
jrosser | but ultimately the thing that defines the version installed should be this https://github.com/openstack/openstack-ansible/blob/master/playbooks/defaults/repo_packages/openstack_testing.yml#L20 | 07:55 |
jrosser | (for master branch at least, stable branches are a little different) | 07:56 |
* jrosser puzzled at os_keystone breakage | 07:59 | |
jonher | the release where it was announce as deprecated is from 2020-12-24 so the pinned SHA from 2021-01-17 should have the change, hmm | 07:59 |
jonher | the commit* https://github.com/openstack/tempest/blob/7e96c8e854386f43604ad098a6ec7606ee676145/releasenotes/notes/Inclusive-jargon-17621346744f0cf4.yaml | 08:00 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Test dashboard only for horizon scenario https://review.opendev.org/c/openstack/openstack-ansible/+/782376 | 08:01 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Set buster jobs to NV https://review.opendev.org/c/openstack/openstack-ansible/+/782681 | 08:01 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Set buster jobs to NV https://review.opendev.org/c/openstack/openstack-ansible/+/782681 | 08:02 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Set buster jobs to NV https://review.opendev.org/c/openstack/openstack-ansible/+/782681 | 08:03 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Bump SHAs for master https://review.opendev.org/c/openstack/openstack-ansible/+/785800 | 08:03 |
*** rpittau|afk is now known as rpittau | 08:04 | |
*** manti has quit IRC | 09:00 | |
*** SiavashSardari has joined #openstack-ansible | 10:19 | |
admin0 | morning | 10:33 |
noonedeadpunk | o/ | 10:34 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Map dbaas and lbaas with role defaults https://review.opendev.org/c/openstack/openstack-ansible/+/784113 | 10:45 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Map dbaas and lbaas with role defaults https://review.opendev.org/c/openstack/openstack-ansible/+/784113 | 10:48 |
*** macz_ has joined #openstack-ansible | 10:53 | |
*** macz_ has quit IRC | 10:57 | |
*** sri_ has quit IRC | 11:04 | |
*** csmart has joined #openstack-ansible | 11:35 | |
*** rh-jelabarre has joined #openstack-ansible | 12:17 | |
*** SiavashSardari has quit IRC | 12:30 | |
*** manti has joined #openstack-ansible | 12:48 | |
*** cloudnull8 is now known as cloudnull | 12:48 | |
*** spatel_ has joined #openstack-ansible | 13:01 | |
*** spatel_ is now known as spatel | 13:01 | |
*** luksky has quit IRC | 13:11 | |
*** luksky has joined #openstack-ansible | 13:11 | |
*** luksky has quit IRC | 13:12 | |
*** luksky has joined #openstack-ansible | 13:12 | |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_trove master: Update trove configuration https://review.opendev.org/c/openstack/openstack-ansible-os_trove/+/784571 | 13:31 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_trove master: Update trove configuration https://review.opendev.org/c/openstack/openstack-ansible-os_trove/+/784571 | 13:31 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_trove master: [doc] Document how to use separate RabbitMQ cluster https://review.opendev.org/c/openstack/openstack-ansible-os_trove/+/784781 | 13:34 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_trove master: Update trove configuration https://review.opendev.org/c/openstack/openstack-ansible-os_trove/+/784571 | 13:45 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_trove master: Update trove configuration https://review.opendev.org/c/openstack/openstack-ansible-os_trove/+/784571 | 13:46 |
openstackgerrit | Merged openstack/openstack-ansible-os_trove master: Add image upload option https://review.opendev.org/c/openstack/openstack-ansible-os_trove/+/784372 | 13:55 |
openstackgerrit | Merged openstack/openstack-ansible-openstack_hosts master: Decrease TCP retries in case of VIP failover https://review.opendev.org/c/openstack/openstack-ansible-openstack_hosts/+/778028 | 14:17 |
*** macz_ has joined #openstack-ansible | 14:35 | |
*** macz_ has quit IRC | 14:40 | |
manti | I have a problem with lxc containers not having any services. I've run playbooks setup-hosts and setup-infrastructure successfully, but when comparing with aio-setup for example cinder-api container does not have anything in tcp port 8776, but aio-setup has uwsgi in tcp port 8776 | 14:57 |
*** dpawlik has quit IRC | 14:58 | |
manti | Same with horizon container, there is no apache2 etc | 14:58 |
manti | It's like containers have been created, but the actual services have not been installed | 14:59 |
jrosser | manti: setup-hosts creates the containers etc, in this context "infrastructure" is rabbitmq/galera/haproxy | 14:59 |
jrosser | you need to run setup-openstack.yml to lay down the actual openstack services | 14:59 |
manti | ah, I assumed the services should already be there because haproxy is unhappy since the healthcheck is failing, because there is nothing that could respond to it | 15:01 |
*** macz_ has joined #openstack-ansible | 15:01 | |
jrosser | manti: thats expected, and those healthchecks should come good one by one as the services are deployed | 15:03 |
manti | But I can't run setup-openstack because task os_keystone : Create database for service fails to Exception message: (2013, 'Lost connection to MySQL server during query'), which I kind of assumed would be caused by haproxy | 15:06 |
manti | If I go to the galera container I can run mysql and access the database, that the problem is clearly in the connections | 15:07 |
jrosser | if you can paste the error message at paste.openstack.org (with the whole ansible task headers) then we can try to find out whats going on | 15:10 |
manti | http://paste.openstack.org/show/804403/ | 15:12 |
manti | Running the playbook with -vv or more doesn't give any extra info | 15:15 |
jrosser | what is 10.254.17.84 ? | 15:16 |
manti | infra1_utility_container-d225f98e | 15:17 |
jrosser | in the utility container you should have the mysql client, can you check that it can connect to the database? | 15:19 |
manti | "ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11" that doesn't work | 15:20 |
manti | correction, it works when I give the host | 15:21 |
manti | mysql -h infra1_galera_container-8862d63d and mysql -h 10.254.17.117 both work | 15:21 |
manti | mysql -h 10.0.3.112 does not work | 15:22 |
jrosser | can i check which branch you have deployed? | 15:23 |
manti | stable/victoria | 15:23 |
jrosser | it should not be needed to specify the node http://paste.openstack.org/show/804404/ | 15:23 |
jrosser | there should be a /root/.my.cnf file on the utility container which has the IP address of the haproxy internal endpoint | 15:25 |
jrosser | the connection should be client > loadbalancer > active galera node | 15:25 |
jrosser | so i think that the next thing to check is if haproxy is seeing the galera nodes as up or down | 15:26 |
admin0 | how to troubleshoot/fix when 2 routers ( 3x in HA) are active at the same time and ping-pong the IP every x seconds | 15:27 |
manti | there is /root/.my.cnf file on the utility container and it has IP address that is used on the haproxy.conf bind clauses (like frontend galera-front-1 bind 10.254.17.11:3306), so I guess that is the haproxy internal endpoint? | 15:28 |
manti | but I think haproxy does not see galera node up since there is " Server galera-back/infra1_galera_container-8862d63d is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue." and " backend galera-back has no server available!" | 15:30 |
manti | I assumed those erros in haproxy come because the healthcheck fails | 15:31 |
jrosser | for a real-world deployment i would expect the ip in .my.cnf to be the one specified in internal_lb_vip_address, here https://github.com/openstack/openstack-ansible/blob/master/etc/openstack_deploy/openstack_user_config.yml.aio#L24 | 15:31 |
jrosser | for galera the mysql protocol is on port 3306, but the healthcheck is done with an http service on port 9200 | 15:32 |
manti | I had to check, but yeah, the IP in .my.cnf is the same as internal_lb_vip_address | 15:34 |
jrosser | root@haproxy1-utility-container-ec8755d0:~# curl 10.11.128.95:9200 | 15:34 |
jrosser | Percona XtraDB Cluster Node is synced. | 15:34 |
jrosser | ^ that is what determines if haproxy is happy with each galera node | 15:34 |
manti | well, it's not then: root@infra1:~# curl 10.254.17.117:9200 | 15:36 |
manti | curl: (56) Recv failure: Connection reset by peer | 15:36 |
jrosser | hmm so this could be some network connectivity issue | 15:38 |
jrosser | time to do basic ping / tcpdump checks between the haproxy node and galera containers | 15:38 |
manti | ping is working, both with IP and name | 15:39 |
jrosser | the galera status check service on port 9200 is run with xinetd on all the galera hosts | 15:40 |
manti | I can see that the xinetd runs mysqlchk, which has only_from definition. Is that list of IPs used to determine from where the connections are allowed? | 15:44 |
jrosser | yes thats correct | 15:45 |
jrosser | it should be populated with this https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/galera_all.yml#L33-L39 | 15:45 |
manti | There's my problem! It doesn't contain the right IP | 15:49 |
jrosser | ahha, interesting | 15:50 |
jrosser | it should get the IP of all your galera hosts and all your haproxy hosts from this -> groups['galera_all'] | union(groups['haproxy']) | 15:51 |
admin0 | manti, check the mtu also | 15:55 |
admin0 | i also had similar issues when my mtu was not in sync | 15:55 |
admin0 | which you cannot determine by ming | 15:55 |
admin0 | ping* | 15:55 |
admin0 | if you do ping, try to ping say 65000 bytes from each instance | 15:55 |
manti | I have a feeling this is happening because the internal_lb_vip_address is 10.254.17.11, but the rest of the /etc/openstack_deploy/openstack_user_config.yml has infra and compute nodes defined with other IP | 15:56 |
admin0 | is .11 even up ? | 15:56 |
manti | yes it is | 15:56 |
admin0 | a vip is different from the other nodes .. but ideally in some network that each controller can reach | 15:56 |
admin0 | so you cannot have 10x in the controllers and 192.x in the vip | 15:57 |
admin0 | the controllers ( default routing table) must be able to reach this vip ip | 15:57 |
admin0 | so ideally its in some common network or via a diff vlan routed via a router | 15:57 |
manti | deployment host is in 172.x and can reach the controller and compute nodes via 172.x, and now it is this 172.x that is shown on the mysqlchk only_from definition | 15:58 |
manti | but it clearly should have the 10.x address | 15:58 |
manti | my openstack_user_config.yml has the 172.x defined for infra_hosts and also for haproxy host, so that is how it has appeared to the mysqlchk | 16:01 |
jrosser | manti: why is the internal_vip on a different CIDR than the internal mgmt network? usually it would be on the same network | 16:03 |
*** jamesdenton has quit IRC | 16:03 | |
manti | internal mgmt network is 10.254.17.0/24 and the internal vip 10.254.17.11, so they are on the same network | 16:04 |
* manti < https://matrix.org/_matrix/media/r0/download/matrix.org/bCtVfntJUpdMwsmRxGxYQpGb/message.txt > | 16:06 | |
jrosser | and you've got those 10.254.17 IP on the hosts and eth1 of the containers where needed | 16:08 |
jrosser | bit 172.x is what you want to use to ssh for out-of-band management and ansible? | 16:08 |
jrosser | *but | 16:08 |
manti | yes | 16:10 |
jrosser | ok, so i think in this case you are going to need to put an override for galera_monitoring_allowed_source in /etc/openstack_deploy/user_variables.yml | 16:11 |
jrosser | it's completely fine to have a separate network for your host management/ansible thats not the openstack mgmt network | 16:11 |
manti | ok, that galera_monitoring_allowed_source sounds like it'll solve my problem | 16:13 |
jrosser | but it will be necessary to make overrides in some cases because of issues like this - you can make whatever network setup you want, but going away from the stock config will require some overrides | 16:13 |
jrosser | fwiw i have 3 deployments set up just like this with a seperate managment network, it's completely fine | 16:14 |
openstackgerrit | Merged openstack/openstack-ansible stable/victoria: Bump SHAs for stable/victoria https://review.opendev.org/c/openstack/openstack-ansible/+/785803 | 16:14 |
jrosser | just not necessarily working out of the box without some extra settings | 16:14 |
manti | that is good to know also, I did notice that documentation does not recommended this kind of setup as it says the deployment host should be within mgmt network (if I recall correctly) | 16:15 |
jrosser | i think it's more like there is a reference deployment assumed throughout the documentation | 16:18 |
*** macz_ has quit IRC | 16:19 | |
*** jamesdenton has joined #openstack-ansible | 16:30 | |
openstackgerrit | Merged openstack/openstack-ansible stable/train: Bump SHAs for stable/train https://review.opendev.org/c/openstack/openstack-ansible/+/785801 | 16:33 |
openstackgerrit | Merged openstack/openstack-ansible stable/ussuri: Bump SHAs for stable/ussuri https://review.opendev.org/c/openstack/openstack-ansible/+/785802 | 16:33 |
*** rpittau is now known as rpittau|afk | 16:59 | |
*** macz_ has joined #openstack-ansible | 17:15 | |
*** macz_ has quit IRC | 17:17 | |
*** macz_ has joined #openstack-ansible | 17:17 | |
*** MrClayPole has quit IRC | 17:35 | |
*** MrClayPole has joined #openstack-ansible | 17:35 | |
*** andrewbonney has quit IRC | 18:47 | |
*** juanoterocas has joined #openstack-ansible | 19:15 | |
*** snapdeal has joined #openstack-ansible | 19:43 | |
*** juanoterocas has quit IRC | 19:59 | |
*** juanoterocas has joined #openstack-ansible | 20:00 | |
*** devtolu1__ has joined #openstack-ansible | 20:04 | |
*** juanoterocas has quit IRC | 20:04 | |
*** snapdeal has quit IRC | 20:46 | |
*** macz_ has quit IRC | 20:47 | |
*** devtolu1__ has quit IRC | 20:56 | |
*** juanoterocas has joined #openstack-ansible | 20:56 | |
*** bjoernt has joined #openstack-ansible | 21:11 | |
*** jralbert has joined #openstack-ansible | 21:25 | |
*** spatel has quit IRC | 21:26 | |
jralbert | Having just upgraded to Train, I'm aware from the release notes that placement is removed from the nova role and gets its own role and container; however I see that the nova-placement-api service is still defined and running from the old Stein nova venv (although evidently receiving no requests). Should I just shut down and mask the service? Should | 21:28 |
jralbert | there be a play in the upgrade to clean up this service? | 21:28 |
*** spatel_ has joined #openstack-ansible | 21:29 | |
*** spatel_ is now known as spatel | 21:29 | |
*** bjoernt has quit IRC | 21:37 | |
*** bjoernt has joined #openstack-ansible | 21:38 | |
jrosser | jralbert: i think that the clean up is left up to you, i'd expect the haproxy backend for nova-placement-api to be removed by the upgrade playbooks as here https://github.com/openstack/openstack-ansible/blob/stable/train/scripts/run-upgrade.sh#L185-L186 | 21:39 |
jrosser | shutting down and masking the service seems like a reasonable next step | 21:40 |
jralbert | Yes, the haproxy backend went away, I was just surprised in our post-install review to still see 19-release binaries running in the nova container and then realized they're from the deprecated placement service | 21:41 |
*** bjoernt has quit IRC | 21:41 | |
*** bjoernt has joined #openstack-ansible | 21:41 | |
jrosser | jralbert: were you previously having trouble during your upgrade with the venvs being built repeatedly for each node? | 21:44 |
*** macz_ has joined #openstack-ansible | 21:44 | |
*** sshnaidm|pto has quit IRC | 21:44 | |
jralbert | that was me, yes | 21:44 |
jrosser | did you manage to find what was happening there? | 21:45 |
jralbert | Nope, we tried several variations on limits, including having the repo containers in the limit, and saw no change | 21:46 |
jralbert | at that point we just directed osa to pull nova and neutron source from github and pushed through | 21:46 |
jrosser | i was thinking that was probably a factor in the upgrade taking a very long time | 21:47 |
jralbert | our goal was to run a set of limited plays so we could upgrade infra, control, and compute planes independently and thereby limit the outage to our clients | 21:47 |
jrosser | yes, understood | 21:47 |
jralbert | and that was largely successful, except that compute was heavy to get through | 21:47 |
*** spatel has quit IRC | 21:47 | |
jralbert | it was very late, and I was very tired, but I couldn't see clearly in the venv build role where it decides what host to delegate to | 21:48 |
jralbert | there's a task that identifies the build host, and it does name a repo container, but then the build still happens on the compute node | 21:48 |
jrosser | yeah, it's kind of a bit tortuous to find that | 21:48 |
jrosser | the other thing is that previously gathered facts can time out after 24 hours | 21:49 |
jralbert | but what fact would change the delegation behaviour? | 21:50 |
jrosser | oh, well it uses the repo server cpu architeecture and operating system version when selecting the one to use to build wheels | 21:51 |
jrosser | and silently falls back to building on the target host instead if a match isnt found | 21:51 |
jrosser | this should all be handled fine, but i just wonder if theres something underlying which made it always fail to find a suitable repo server for you | 21:53 |
jralbert | Is there a way to add some debug plays to illustrate what it's picking? | 21:54 |
jrosser | this is the interesting bit https://github.com/openstack/ansible-role-python_venv_build/blob/stable/train/tasks/python_venv_wheel_build.yml#L16-L24 | 21:57 |
jrosser | it should be gathering facts about the repo servers | 21:57 |
jrosser | then evaluating {{ venv_build_host }} to decide which to use | 21:57 |
jrosser | if it's possible to reproduce some sort of "run nova/neutron playbook limited to one compute node" you might temporarily stick this in the python_venv_build role just after the "gather build target facts" task http://paste.openstack.org/show/804418/ | 22:02 |
jrosser | which will lead you to here https://github.com/openstack/ansible-role-python_venv_build/blob/dac99008d5c2473ce82f83559024ce0c62e78a9f/defaults/main.yml#L121 | 22:03 |
jrosser | anyway, it's kinda late here now, but if the opportunity to understand what was happening comes up i'd be interested to know if we have something to fix | 22:05 |
jralbert | Ah, okay. Perhaps of interest, here's a little chunk of the log from a limited build of horizon on one of our control nodes, showing a repo container selected to gather facts from, and then delegation to the horizon container itself to do the build anyway: http://paste.openstack.org/show/804419/ | 22:05 |
jralbert | I will see about trying to get some debug info as you describe | 22:06 |
jrosser | i think printing venv_build_targets might also be enlightening if you can make it misbehave as before | 22:09 |
jrosser | jralbert: reading your paste carefully it looks like what you show with horizon is the same lack of delegation, so whatever is easiest for you to add some debug to will be fine, does not need to be a compute node | 22:23 |
jralbert | yes, I think I can trigger this anywhere in our environment. I'll do some testing when I have a bit of time and let you know what I find out | 22:24 |
jrosser | great, thankyou | 22:24 |
*** luksky has quit IRC | 22:49 | |
openstackgerrit | Michael Soto proposed openstack/ansible-hardening master: Added pam_auth_password to nullok check https://review.opendev.org/c/openstack/ansible-hardening/+/785984 | 22:52 |
*** juanoterocas has quit IRC | 22:59 | |
*** tosky has quit IRC | 23:24 | |
*** macz_ has quit IRC | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!