Wednesday, 2024-12-18

derekokeeffe85	Morning all, hope all is well, I've been trying to add a compute node to a current cluster and have run into some issues. If anyone has a few min could you take a look at this please and point me in the right direction. Thanks in advance. https://paste.openstack.org/show/bRfClW31LFWDg7iMWt0K/	09:07
kleini	First deleting openstack_inventory.json is a bad idea. It stores chosen IP addresses for LXC containers running on infra nodes. Running dynamic_inventory.py choses random IPs from management network and assigns them to LXC containers. So every container now got a new management network IP. Furthermore, last part of the containers hostname is some random. These randoms are stored in openstack_inventory.json, too. They get also	09:29
kleini	lost when deleting the inventory. This means, your deployment is not able to find existing LXC containers and will create new ones. This doubles them.	09:29
kleini	First, you need to fix, is the restore your openstack_inventory.json.	09:30
kleini	Second: Adding a compute node is very well described: https://docs.openstack.org/openstack-ansible/2023.1/admin/scale-environment.html#add-a-compute-host	09:31
kleini	Just follow that guide.	09:31
derekokeeffe85	Hi Kleini, I have done exactly as that link has described, but the services are bing skipped tha's what I don't understand and am asking for some help with. Regarding the inventory file, I have a back up of it so I can restore it. so how does that get updated to have the new compute3 added?	09:53
kleini	The inventory is updated by running any playbook. You don't need to care about it. And sorry to have to say it this way, you didn't follow that guide. the limit parameter needs to contain localhost, meaning the deployment host.	09:56
kleini	You must not change the openstack-ansible commands described in that guide. They have essential differences to the openstack-ansible commands you wrote down in your paste.	09:59
derekokeeffe85	Sorry I had copied those 3 commands from a document I was writing and had put them in wrong. I did follow that link and I did in fact run the script as a second alternative, the first two playbooks ran through fine and the 3rd skips the nova install etc.. maybe what you have pointed out was the reason. I also didn't know that the inventory was updated by running the playbooks. I will restore the original inventory	09:59
derekokeeffe85	and run again (which I did at the very beginning when following that guide) Thanks for your input	09:59
kleini	Restore openstack_inventory.json and then follow those instructions. Maybe you need to adjust the openstack-ansible version in the URL to the guide according to the version, you're actually running in your deployment.	10:35
jrosser	derekokeeffe85: if there were no hosts matched, then perhaps your inventory is still broken	11:03
jrosser	you can use openstack-ansible/scripts/inventory-manage.py to view the inventory and see if your compute hosts appear where you expect them to	11:04
derekokeeffe85	Thanks I'll take a look at that. I did restore the inventory, ran a playbook, checked the nventory and compute3 was there (as I said I didn't know this was how it worked) but the 3rd command (openstack-ansible setup-openstack.yml --limit localhost,compute3) still skipped the services as far as I can see unless I'm reading the output wrong. I'll do as you both say and maybe later stick the output in a paste if you	11:47
derekokeeffe85	wouldn't mind taking a look.	11:47
jrosser	derekokeeffe85: you can see one of that tasks that was skipped here https://opendev.org/openstack/openstack-ansible-plugins/src/branch/master/playbooks/nova.yml#L16-L20	11:54
jrosser	and you can see that the target group was "nova_all"	11:54
jrosser	we can maybe infer from this that nova_all combined with --limit compute3 returns no hosts	11:55
jrosser	i.e compute3 is not in the nova_all group	11:55
derekokeeffe85	Hmm ok jrosser but it does actually create the nova dir and add nova.conf... it then fails saying there's no nova-compute service. Could it be looking for compute3 in different places? I only have it added in the the compute_hosts in openstack_user_config.y,l	11:57
jrosser	have you checked which groups the tasks that do succeed were targetting?	12:00
derekokeeffe85	I've bben trying to dig around in the playbooks but nothing definite (I don't know how you all developed this :)) so I ran the manage inventory script and can see compute3 there with the same output as the working computes. This is the output from the setup openstack playbook https://paste.openstack.org/show/bajR5O4dUtgMqGcX4mEJ/	12:05
derekokeeffe85	jrosser there's no hosts at all in nova_all in openstack_user_config "nova_all": {	13:42
derekokeeffe85	<OMMITTED> "hosts": []	13:42
derekokeeffe85	}	13:42
derekokeeffe85	Sorry I meant openstack_inventory.json	13:42
jrosser	have you also replacement the relevant parts of env.d / conf.d in your openstack_deploy directory?	13:44
kleini	inventory should look like this: https://paste.opendev.org/show/byZ3gYLEDgQPimyrT0tm/	13:46
derekokeeffe85	Yep group_compute_hosts & nova_compute show the 3 computes jrosser. I didn't change/delete anything in there kleini they all have .example or .aio extensions	13:54
derekokeeffe85	In both dirs	13:54
kleini	nova_all in inventory should have children and those should contain nova_compute	13:55
kleini	and you already confirmed, that nova_compute contains your new compute node	13:55
jrosser	derekokeeffe85: it would be most helpful if you were comparing to a working deployment	14:04
jrosser	becasue for example, nova_all does indeed not have any hosts itself, but it has child groups	14:04
jrosser	one of which should be nova_compute	14:04
jrosser	and in the nova_compute group you should find your actual compute hosts under hosts[]	14:05
jrosser	if you don't, your inventory is still incorrect	14:05
derekokeeffe85	kleini that's correct in my inventory nova_all contains nova_compute. jrosser, this is the back up of etc/openstack_deploy that deployed this current cluster that has been working for 2 years now. Maybe we thought it was working but we didn't configure it properly from the get go. In my inventory right now I do have nova_compute with no children but the 3 computes are there under hosts :(	14:17
derekokeeffe85	ahhhhh I think I may have found something!!! Does nova compute depend on a bridge to be configured? My networking might be the problem, could this be something?	14:19
jrosser	you have to do whatever is specified in "setup the target hosts" in the documentation, adjusting for your own local networking choices	14:20
jrosser	it's pretty obvious where things are failing in your earlier paste https://paste.openstack.org/show/bajR5O4dUtgMqGcX4mEJ/	14:21
jrosser	this seems not really to do with either inventory or network	14:21
jrosser	i am very confused with what you're actually asking about now :(	14:22
derekokeeffe85	In my inventory it has network, container_bridge: br-storage, group_binds: [glance_api, cinder_api, cinder_volume, nova_compute] and my bridge is missing. I don''t have br-storage for some reason all the networking that we had built with netplan is messed up. I'm after taking up too much time on you both with this now so I'll get away and fix this issue first. Thanks for the help today both	14:27
opendevreview	Jonathan Rosser proposed openstack/openstack-ansible-ops master: Test all supported versions of k8s workload cluster with magnum-cluster-api https://review.opendev.org/c/openstack/openstack-ansible-ops/+/916649	15:49
opendevreview	Jonathan Rosser proposed openstack/openstack-ansible-ops master: Test all supported versions of k8s workload cluster with magnum-cluster-api https://review.opendev.org/c/openstack/openstack-ansible-ops/+/916649	16:48
jrosser	noonedeadpunk: i found the difference between the working/broken rabbitmq installs https://paste.opendev.org/show/bNgQS66qbtwLlvYptGvU/	17:24
jrosser	the jobs that work have something (doesnt matter what really) that causes a 'changed' task and so the handler executes to restart rabbitmq on the upgraded installation	17:25
jrosser	the jobs that break do not have any changed task, but all 'ok' instead	17:25
jrosser	so the problem really starts here https://github.com/openstack/openstack-ansible-rabbitmq_server/blob/0e009192879ee4f9c07d5cdc08d7071d4cd6cf2c/tasks/rabbitmq_upgrade_prep.yml#L34-L37	17:26
jrosser	that we have a task which unconditionally stops rabbitmq before an upgrade	17:26
jrosser	but there is nothing that unconditionally restarts it afterwards, only some "accident" like a config file update or actually a new version of the package is installed	17:27
jrosser	i expect at this point when we have just branched the version is equal on N and N-1 versions so thats a fairly serious issue as unless the version or something else changes it will never restart	17:28
jrosser	likley also why we did not see this before on upgrade jobs, as the versions would have been different on N and N-1	17:29

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!