Thursday, 2022-02-10

opendevreviewJonathan Rosser proposed openstack/openstack-ansible-repo_server master: Ensure insist=true is always set for lsyncd  https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/82867809:47
opendevreviewJonathan Rosser proposed openstack/openstack-ansible-repo_server master: Use ssh_keypairs role to generate keys for repo sync  https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/82710009:48
opendevreviewJonathan Rosser proposed openstack/openstack-ansible-os_neutron stable/wallaby: Make calico non voting  https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/82865710:29
jrossercalico is now NV on all branches10:30
jrosserit's all but deprecated really, we've just not said that10:30
jrosserevrardjp_: ^ if you are interested in calico it needs some TLC10:30
evrardjp[m]jrosser: while I like the concept around calico, I never had my hands into this. It was all handled by logan-11:22
evrardjp[m]Thanks for the ping11:22
*** dviroel|out is now known as dviroel|ruck11:30
jrosserdo we want a DNM patch to os_neutron to test this? https://review.opendev.org/c/openstack/openstack-ansible/+/82855311:42
jrosseror are we happy that the same patch on the stable branches is OK11:42
jrosserneed to start merging that as it's yet another thing holding up centos-8 removal11:42
jrosserandrewbonney: is this what you saw with libvirt sockets? https://zuul.opendev.org/t/openstack/build/860df46e474d4333a573f6069e7d90b4/log/job-output.txt#2087411:47
andrewbonneyYes, looks like it. My manual solution was 'systemctl stop libvirtd && systemctl start libvirtd-tls'. The play could then be run successfully11:48
andrewbonneyIf you stop libvirtd manually it gives a warning that the sockets may cause it to start again11:49
jrossernoonedeadpunk: ^ feels like theres a proper bug here11:51
jrosserthere is some specific ordering requirements around systemd sockets and the service activated by the socket11:51
noonedeadpunkregarding https://review.opendev.org/c/openstack/openstack-ansible/+/828553 I think it shouldn't hurt for sure. Especially on master where we fix tempest plugins by SHA11:52
jrosseri had the same in the galera xinetd changes, and the only thing i could come up with was this https://github.com/openstack/openstack-ansible-galera_server/blob/master/tasks/galera_server_post_install.yml#L7111:52
noonedeadpunkregarding sockets - we haven't catched this for some reason.11:55
noonedeadpunkSo basically, when we start libvirtd-tcp.socket it starts libvirt and then libvirtd-tls.socket fails?11:55
noonedeadpunkor libvirt started with package installation?11:55
noonedeadpunkbecause we explicitly stop it with previous task?11:56
andrewbonneyI think libvirtd.socket is already started. If you just stop libvirtd on a system and wait for long enough it'll start itself again, which I assume is triggered by one of the sockets given the message it gives when you stop it11:56
jrosserwith galera, iirc if the service started by the socket was not loaded in systemd when you try to make the .socket service, it all goes bad11:58
noonedeadpunkyes, indeed libvirtd is re-activated within 45 seconds tops12:04
noonedeadpunkso it's race condition I believe12:04
andrewbonneyYeah, and the bigger the deployment the more likely it is you'll see it, which matches our experience12:05
noonedeadpunkand that's indeed libvirtd.socket that brings it back12:06
noonedeadpunkwell, should be trivial fix I beleive then12:09
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible-os_nova master: Drop libvirtd_version identification  https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/82870212:13
jrosserok finally the keypairs stuff is all passing https://review.opendev.org/q/topic:osa%252Fkeypairs12:16
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible-os_nova master: Fix race-condition when libvirt starts unwillingly  https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/82870412:23
noonedeadpunkI guess this should fix the thing ^ ?12:24
*** dviroel is now known as dviroel|ruck13:13
spatelnoonedeadpunk jrosser did you see this error before - https://paste.opendev.org/show/bdmUhBJkdDRfrhKg7Lyt/14:22
spateli am not able to create VMs because my conduction is not happy 14:22
noonedeadpunkRings some bell from R14:22
noonedeadpunkwhere after rabbit member restart queues were lost without HA queues14:23
noonedeadpunkSo had to either restart everything or clean up rest queues and basically rebuilt rabbit cluster.14:24
noonedeadpunkBut maybe that is something different14:24
spatelI had disaster last night and i re-build rabbitMQ14:24
spatelnow i can't able to build any VM :(14:24
noonedeadpunkand conductor restart doesn't help either?14:24
spateltrying that now14:25
spatelYesterday this is what happened, i rebuild rabbitMQ and then all my nova-compute node started throwing this error - https://paste.opendev.org/show/bFtAhaCLfed1R3ptNr1F/14:25
spateli did multiple time re-build rabbitMQ in hope it will fix but no luck :(14:25
noonedeadpunkyeah it happens if conductor is dead14:26
spatelso i shutdown all nova-api container and then rebuild rabbitMQ 14:26
spatelthen nova-compute stopped throwing error but now i can't able to rebuild VM (this is wallaby so may be a bug or something )14:27
spatellet me restart conductor and see if it help 14:27
spatelcurrently my rabbitMQ running without HA queue is that ok?14:27
noonedeadpunkyou tell me :D14:28
spatellet me restart conductor and see it that help14:28
noonedeadpunkthe thing that bugged me without HA queues is when gone rabbit member was re-joining cluster, it was not aware of any queues that existed, but clients were expecting to be able to get them from it14:30
noonedeadpunkwhich kind of sounds like case here14:30
spatelHow do i bring back HA?14:31
spatelhow do i run common-rabbitmq tag so it will deploy queue for all my service?14:31
spateli forgot how to run that14:31
spateldo you know command?14:31
noonedeadpunkso conductor restart didn't help?14:32
spatelno still my VM stuck in BUILD14:33
spatelnot seeing any error in rabbitMQ logs 14:34
spatelno error in conductor log14:34
noonedeadpunkit might take like $timeout before conductor error out14:34
spatelhmm14:35
noonedeadpunkbut considering that overrides removed, it was smth like `openstack-ansible setup-openstack.yml --tags common-mq`14:35
spatelbut that command will run on all compute nodes correct? it will take hell of time in 200 compute 14:36
spatelhow about i can do  - rabbitmqctl -q -n rabbit set_policy -p /keystone HA '^(?!(amq\\.)|(.*_fanout_)|(reply_)).*' '{"ha-mode": "all"}' --priority 014:36
spatelfor each service14:37
noonedeadpunkwhy not14:37
spatellet me try 14:37
noonedeadpunkyou can actually start just from nova14:37
noonedeadpunkto see if that halps14:37
noonedeadpunk*helps14:38
spateli did only for nova/neutron (because that one is more important)14:38
noonedeadpunkstill nova-* will likely need to be restarted 14:38
noonedeadpunkon nova_api at least14:38
spatellet me restart all nova-* service14:40
opendevreviewMerged openstack/openstack-ansible-os_keystone master: Remove legacy policy.json cleanup handler  https://review.opendev.org/c/openstack/openstack-ansible-os_keystone/+/82744314:44
spatelnoonedeadpunk still no luck VM stuck in BUILD 14:44
spatelwhy nova-conductor logs not flowing..14:45
spatelseems like no activity 14:45
spatelthere are 15,235 mesg in Ready queue 14:46
spatelkeep growing.. 14:46
noonedeadpunkand is everything fine with networking on containers?14:47
spatelneutron not throwing any error in logs but i can restart14:48
spatelmesg stuck in this kind of queue scheduler_fanout_5fc638e953cb4dc6b47151fef969d54c14:49
spatelhttps://paste.opendev.org/show/bsQzWFYAsSDyb5EOzbLD/14:51
noonedeadpunkand how does scheduler log look like?14:52
spatelclean :(14:53
spatelno error at all14:53
noonedeadpunkhm14:56
noonedeadpunkwhat if restart nova-compute on 1 node?14:56
noonedeadpunkwill it still complain on rabbit?14:57
spateli am not seeing any error on nova-compute logs 14:58
spatelalso restarting nova-compute taking very long time..14:58
noonedeadpunkas maybe they try to use old fanout queues that are not listened anymore. As they live for some time until die14:58
spatelsame with neutron-linuxbridge-agent.service14:58
spatelhow do i use old fanout?14:58
noonedeadpunkis everything is really fin with container networking?14:58
noonedeadpunklike there's an ip on eth0 and eth1?14:59
spatelyes.. i can ping everything so far14:59
spatelrabbitMQ is cluster is up 14:59
spatelagent can ping rabbitMQ IP etc14:59
spateldoes rabbit restart help?15:00
spatelfanout should get delete after restart correct?15:00
noonedeadpunknot instantly after restart15:01
noonedeadpunkbut like in 20 minutes or smth...15:01
spatelsomething stuck in queue which is keep growing 15:01
opendevreviewMerged openstack/openstack-ansible-os_keystone master: Drop ProxyPass out of VHost  https://review.opendev.org/c/openstack/openstack-ansible-os_keystone/+/82851915:03
spateldoes systemctl restart rabbitmq-server.service is enough or stop everything and then start again?15:05
opendevreviewMerged openstack/openstack-ansible stable/victoria: Remove enablement of neutron tempest plugin in scenario templates  https://review.opendev.org/c/openstack/openstack-ansible/+/82838615:10
spatelnoonedeadpunk getting this error in conductor now -  Connection pool limit exceeded: current size 30 surpasses max configured rpc_conn_pool_size 3015:12
noonedeadpunkso sounds like it was processing smth15:17
spatelnoonedeadpunk how much time  systemctl restart nova-compute take in your setup ?15:17
spatelin my case its talking 5 min and more15:17
noonedeadpunkdepends. might be 30 sec15:17
noonedeadpunkoh, well15:17
noonedeadpunkusually it means it wasn't spawned properly15:18
noonedeadpunkor running for tooooo long15:18
noonedeadpunkso things slow on shutting down there15:18
spatellet me try to shutdown and then start 15:18
noonedeadpunkso if some conenction is stuck - it will wait for timeout likely before stopping service15:19
spatelstop taking longer time 15:19
spatelstart is quick15:19
spateli am thinking to rebuild rabbitMQ again.. :( i have no option left15:20
noonedeadpunkI kind of agree here15:20
spateli am doing apt-get purge rabbitmq-server on container15:20
spatelis that good enough?15:20
noonedeadpunk why just not to re-run playbook?:)15:21
spatelthat won't do anything right>15:21
spatel?15:21
spatellike wiping mnesia etc15:21
noonedeadpunkwith -e rabbitmq_upgrade=true?15:21
noonedeadpunkit kind of will15:22
spateldo you think it will wipe down 15:22
spatelok15:22
noonedeadpunkbut it won't wipe users/vhosts/permisions15:22
noonedeadpunkbut clean up all queues15:22
spatelok 15:22
spatelshutdown shutdown neutron-server and nova container first?15:22
noonedeadpunknah15:22
spatelok i thought it will keep trying and make mess in queue 15:23
noonedeadpunkit shouldn't be mess if rabbit working15:23
spatelok running -e rabbitmq_upgrade=true 15:23
spatelhow about compute agent and network agent?15:24
spateldo they connect itself?15:24
noonedeadpunkI'd left everything as is15:24
noonedeadpunkand yes, they should15:24
spatelok15:24
noonedeadpunknot guaranteed though15:24
spatellast night when i rebuild RabbitMQ they didn't so i have to restart all agent 15:24
spatelbut that took hell of time 15:25
spatelI had zero issue building rabbit in queues but in wallaby this is mess... feels like bug15:26
spateldone15:27
spatelall rabbitnode is up15:27
spatelshould i reboot anything or leave it15:28
spatelsome compute node showing this error so assuming not going to re-try - https://paste.opendev.org/show/b3XS6aQwO4q4AuDIRN8X/15:31
opendevreviewMerged openstack/openstack-ansible-repo_server master: Ensure insist=true is always set for lsyncd  https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/82867815:31
opendevreviewMerged openstack/openstack-ansible stable/wallaby: Remove enablement of neutron tempest plugin in scenario templates  https://review.opendev.org/c/openstack/openstack-ansible/+/82855115:38
opendevreviewMerged openstack/openstack-ansible master: Simplify mount options for new kernels.  https://review.opendev.org/c/openstack/openstack-ansible/+/82746415:38
spatelnoonedeadpunk i am getting this error in conductor - https://paste.opendev.org/show/bRtDZIKivNjbplTFbVuY/15:41
spatelassuming its ok15:42
spatelstill not able to build VM :(15:43
opendevreviewMerged openstack/openstack-ansible master: Remove workaround for OpenSUSE when setting AIO hostname  https://review.opendev.org/c/openstack/openstack-ansible/+/82746515:48
opendevreviewMerged openstack/openstack-ansible master: Rename RBD cinder backend  https://review.opendev.org/c/openstack/openstack-ansible/+/82846315:48
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible stable/xena: Rename RBD cinder backend  https://review.opendev.org/c/openstack/openstack-ansible/+/82866315:49
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible stable/wallaby: Rename RBD cinder backend  https://review.opendev.org/c/openstack/openstack-ansible/+/82866415:50
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible stable/victoria: Rename RBD cinder backend  https://review.opendev.org/c/openstack/openstack-ansible/+/82866515:50
spatelnoonedeadpunk do you think this is mysql issue - https://paste.opendev.org/show/bxSMimrEKVIXI5LePhAF/15:58
noonedeadpunknah, it's fine16:01
spatelhmm i thought may be conductor not able to make connection to mysql 16:01
spateldo you think conductor doesn't have enough thread etc.. 16:02
spateldoes this looks ok to you? - https://paste.opendev.org/show/bxSMimrEKVIXI5LePhAF/16:02
spatelrabbitMQ looks health no mesg in queue all queues are zero 16:03
noonedeadpunkif conductor will drop connection, such message  would arise16:04
noonedeadpunkand compute service list shows everybody is helthy and online?16:04
spatelmajority are up 16:05
spateleven if couple of down i should be able to spin up vm16:06
spatelmy VM creating process stuck in BUILD means mesg are not passing between components16:06
noonedeadpunkand what in scheduler/conductor logs when you try to spawn VM?16:07
noonedeadpunkdo they even process request?16:07
spatelnot activity 16:07
spatellet me go deeper and see16:08
spateli found one log - 'nova.exception.RescheduledException: Build of instance 95c40e2a-2eed-4c75-a131-da55eaac7060 was re-scheduled: Binding failed for port 05272a41-a60c-43ea-9a7a-30a67ed92ddb, please check neutron logs for more information.\n']16:08
spatellet me check neutron logs 16:08
*** priteau_ is now known as priteau16:11
spatelnoonedeadpunk now my vm failed with this error - {"code": 500, "created": "2022-02-10T16:16:36Z", "message": "Build of instance 7fbac40d-2829-47f4-9291-31c07c43d04c aborted: Failed to allocate the network(s), not rescheduling."}16:18
noonedeadpunkwell, it's smth then:)16:18
spatelbut neutron-server logs not saying anything - https://paste.opendev.org/show/blOElBzra5OVjid9uLHH/16:19
spatelno error in neutron-server logs related that UUID16:19
noonedeadpunkbut it's likely one of neutron agent that's failing16:20
spateli have 200 nodes so very odd 16:22
noonedeadpunknova scheduler doesn't really checks neutron agent aliveness on the compute where it places VM16:23
nurdieHey OSA! Does anyone know of an easy method of migrating openstack instances from one production deployment to another? Currently have a cluster deployed on CentOS7 and going to be building a new cluster on Ubuntu that will replace the old16:23
noonedeadpunknurdie: https://github.com/os-migrate/os-migrate16:23
noonedeadpunkor you can just add ubuntu nodes to centos cluster and migrate things16:24
noonedeadpunk(or re-setup them step-by-step)16:24
nurdieWe're going to rebuild from ground up. Controllers + Computes16:24
nurdieI remmeber I chated with someone that said it's possible to mix operating systems but can become problematic, so I'd rather just start anew 16:25
noonedeadpunkwell, how I was doing common re-setup - shut down 1 controller, re-setup to ubuntu, then re-setup computes one by one with offline vm migration between them16:25
spatellet me create seperate group and try to create vm in that group16:25
noonedeadpunkand the re-setup rest of controllers16:26
noonedeadpunkbasically, algorithm would be same as for OS upgrade16:26
noonedeadpunkbut it's up to you ofc :)16:26
nurdieinteresting o.016:27
noonedeadpunkos-migrate can work with basic set of resources. Nothing like octavia or magnum is supported16:27
spatelnoonedeadpunk very odd, i can see vm is up but in paused state in virsh 16:27
nurdieI'll be honest, my hesitation comes from how wonky C7 has been. I don't blame OSA for that. It was beautiful until RHEL bent the whole world over a table16:27
noonedeadpunkbtw I guess spatel can tell you story of how to migrate from rhel7 to ubuntu :D16:28
spatelhehe16:28
spateldo you think neutron-agent reporting wrong status..16:29
spatelthey are showing up but they are not16:29
noonedeadpunkare they showing as healthy?16:29
nurdieIt's a very basic deployment. We don't even have designate yet. I'm going to checkout os-migrate though. I'm also going to force uniformity of compute node resources instead of the frankenstein cluster we have of several different server models, CPU counts, RAM counts. Gonna use old servers to setup a CERT/staging/testing OSA environment16:29
spatelyes they are showing healthy16:29
nurdienoonedeadpunk: thanks!16:30
noonedeadpunkhuh16:30
noonedeadpunkyeah, then unlikely they will lie16:30
noonedeadpunkand what's in neutron-ovs-agent logs on compute?16:30
noonedeadpunknurdie: running cluster of same computes is possible only on launch :D16:32
spateli have linux-agent 16:34
nurdieYou're not wrong hahahhaha. Alas, the new cluster should be big enough to support us until we can deploy computes that are exactly double or something more mathematically pretty16:34
noonedeadpunkspatel: you run LXB? O_O16:34
spatelyes LXB16:35
noonedeadpunkdidn't know that16:35
* noonedeadpunk loves lxb16:35
jamesdentonIt Works™16:39
spatelnoonedeadpunk its create vm is spinning up but neutron not attaching nic16:40
spatelgetting this error in neutron agent - https://paste.opendev.org/show/bJ8O9MnPWXHzwKKlo8sV/16:40
spateljamesdenton any idea16:41
spatelthats all in my log 16:41
noonedeadpunksee no error16:41
spatelthat is why very odd 16:41
jamesdentonanything in nova compute?16:42
spatelvm stuck in paused 16:42
spatelnova created VM and no error in logs but let me check again16:42
noonedeadpunkso basically, compute asks for port through API, it's created, but neutron api reports back failure16:42
spatelno error in nova16:42
noonedeadpunkoh, btw, is port show in neutron api?16:43
noonedeadpunk*shown16:43
spatellet me check 16:43
spatelyes - https://paste.opendev.org/show/bfxx05SXqv9GegQe6XoA/16:44
spateli can see port but they are in BUILD state16:44
spatelnow vm in error stat16:44
spatelwhat could be wrong?16:46
spateli restarted neutron multiple 16:47
jamesdentonis rabbitmq cleared out now?16:47
spatelno error in rabbitmq16:47
spatelwhat could be wrong to attach nic to VM 16:47
noonedeadpunkso basically neutron agent doesn't report back to api16:48
spatelin agent logs look like it got correct IP etc.. so it should attach16:48
*** priteau is now known as priteau_16:48
spatelnoonedeadpunk ohh is that possible16:48
noonedeadpunkwhich it should do through rabbit16:48
*** priteau_ is now known as priteau16:48
spateldamn it.. again rabbit.. 16:49
spatelhow do i check that16:49
noonedeadpunkand this specific agent shown as alive?16:49
spatelyes its alive16:49
spatellet me check again 16:49
noonedeadpunkand you now have ha-queues for neutron?16:49
noonedeadpunkshouldn't matter though16:49
noonedeadpunkhm16:49
noonedeadpunkany stuck messages in neutron fanout?16:50
spatel| 01647432-727b-43ea-a21a-bba2841bc9bc | Linux bridge agent | ostack-phx-comp-gen-1-33.v1v0x.net   | None              | :-)   | UP    | neutron-linuxbridge-agent |16:50
spatelagent is UP16:50
spatelyes i do have HA queue for neutron16:50
spatelif agent is showing up that means rabbitMQ is also clear correct16:51
noonedeadpunkwell, it depends...16:52
noonedeadpunkI think the trick here is that any neutron-server can mark agent as up16:52
noonedeadpunkbut only specific one should recieve reply from agent?16:53
noonedeadpunknot sure here tbh16:53
noonedeadpunkI don;t recall how nova acts when it comes to interaction with neutron...16:53
spatelhmm this is crazy16:54
noonedeadpunkdoes it jsut asks api if prot is ready, or it's witing for reply with same connection...16:54
noonedeadpunk*port * waiting16:54
spateljamesdenton know better16:54
spatelrestarted neutron-server 16:57
spatelstill no luck16:57
spatelany other agent i should be restarting16:59
noonedeadpunkconsidering that you already restarted lxb-agent on compute?16:59
spatelmany time16:59
spatelboth nova-compute and LXD 17:00
noonedeadpunknah, dunno17:00
noonedeadpunkit should work)17:00
noonedeadpunkmaybe smth on net node ofc...17:00
noonedeadpunklike dhcp agent...17:00
noonedeadpunkor metadata...17:00
noonedeadpunkdamn17:00
noonedeadpunkI can recall that happening17:00
spatelreally 17:01
noonedeadpunkit was like dead dnmansq or smth17:01
spatelin that case i should loose IP17:01
noonedeadpunkor some issue with l3 where net comes from17:01
spateli have 500 vms running on this cloud17:01
spatelI have all vlan networking no l3-agent or gateway17:01
spatelhttps://bugs.launchpad.net/neutron/+bug/171901117:02
spatelthis is old bug but may be17:02
noonedeadpunkso basically because of weird dns-agent state jsut port creation was failing17:02
noonedeadpunkI believe it was dns-agent indeed17:02
spateldns-agent?17:03
noonedeadpunk*dhcp17:03
spatelyou are asking to restart - neutron-dhcp-agent.service ?17:03
spatelnot seeing in error in logs 17:04
spatellet me restart 17:04
spatelany good way i can delete nova queue in openstack and re-create (i don't want to rebuild rabbit again)17:05
noonedeadpunkI don't think you need?17:05
spatellet me restart dhcp and then see17:06
noonedeadpunkthere's no problem with nova imo17:06
spatelrestarting neutron-metadata-agent.service 17:08
noonedeadpunkmetadata is other thing?17:08
spatellets see17:08
spatelwhy these stuff stop VM creating and attach interface17:08
noonedeadpunkbecause interface is not even built properly?17:09
noonedeadpunkor well, it misses last bit I guess17:09
spatelmmmm17:09
spatellets see17:09
spatelrestart taking very long time for systemctl restart neutron-metadata-agent.service17:10
noonedeadpunkwhich is to make dhcp aware of ip-mac assignemnt17:10
spatelhmm may be your are right here... 17:10
spatelsomething is holding back17:10
spatelhope you are correct :)17:10
spatelwaiting for my reboot to finish17:10
spateli meant restart 17:11
noonedeadpunkI can recall this happening for us, with nothing in logs17:11
noonedeadpunkso we as well were restarting everything we saw :D17:11
noonedeadpunk(another point for OVN)17:11
spatel+1 yes OVN is much simple 17:12
noonedeadpunkuntil it breaks :D17:12
spatelrestarting meta on last node17:13
spatelits taking long time so must be stuck threads17:14
noonedeadpunkI hope you restarted dhcp agent as well :)17:14
spatelyes17:14
spatel systemctl restart neutron-dhcp-agent.service and neutron-metadata-agent.service17:14
spateldamn this this taking very long time.. still hanging 17:16
*** dviroel|ruck is now known as dviroel|ruck|afk17:16
spatelno kidding..... it works 17:17
spatelDamn noonedeadpunk 17:18
noonedeadpunkyeah, I know. Sorry for not recalling earlier17:19
spatelOMG!! you saved my life.. 17:19
noonedeadpunkmost nasty thing it wasn't logging anything17:19
spatelThis should be a bug... trust me...17:19
spatellast 24 hour i am hitting the wall 17:19
noonedeadpunkYou should prepare talk about that for summit :D17:20
noonedeadpunkThey conveniently prolonged CFP17:20
spatelI am not there yet :(17:21
spatelstill need to learn lost of 17:21
noonedeadpunkone of your blog posts totally could be fit for presentation17:22
noonedeadpunkas you did tons of research in networking17:22
noonedeadpunkand have numbers17:22
spatelI am going to document this incident 17:25
spatelhow i build rabbitMQ how i did HA and how i did everything..17:25
spatelalso i am thinking to put IRC script about our discussion 17:26
spateljamesdenton do i need neutron-linuxbridge-agent.service in SRIOV nodes? 17:35
spateli believe SRIOV need only - neutron-sriov-nic-agent.service17:36
jamesdentoni don't think you do17:42
jamesdentonscrollback - glad to see you got it working17:43
spateljamesdenton This is very unknown issue.. i don't think its easy to figure out :( Big thanks to noonedeadpunk to stick around with me 17:55
spatelhe didn't give us :) 17:55
spatelblog is on the way for postmortem 17:57
fridtjof[m]ugh wow, just got bitten by last year's let's encrypt CA thing. it sneakily broke a major upgrade to wallaby :D nothing could clone stuff from opendev until i upgraded ca-certificates...18:08
jrosseri wonder if we should have a task to update ca-certificates specifically to 'latest' rather than just 'present'18:39
opendevreviewMerged openstack/openstack-ansible-rabbitmq_server master: Allow different install methods for rabbit/erlang  https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/82644519:01
opendevreviewMerged openstack/openstack-ansible-rabbitmq_server master: Update used RabbitMQ and Erlang  https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/82644619:07
*** dviroel|ruck|afk is now known as dviroel|ruck19:19
mgariepyanyone here have an idea how bad it is when a memcached server is down ?19:53
mgariepylike what service are really badly impacted by this19:54
spatelI think only horizon use it 19:54
spatelkeystone 19:54
spatelfor token cache i believe19:54
mgariepynova has it in the config https://github.com/openstack/openstack-ansible-os_nova/blob/master/templates/nova.conf.j2#L6819:55
spatelmay be just to see token is in cache or not19:55
spatelhttps://skuicloud.wordpress.com/2014/11/15/openstack-memcached-for-keystone/19:57
spateli believe only keystone use memcache19:58
mgariepyi'm moving racks between 2 DC. 19:59
mgariepyi have 1 controllers per rack..20:00
mgariepybut when i do stop memcache server i do have a lot of issue with nova 20:00
spatelreally..? nova mostly use to check cache token i believe to speed up process and put less load on keystone20:02
mgariepyor it's the nova/neutron interraction that is failing.20:02
mgariepyit's somewhat of a pain really.20:03
spateli went through great pain so i can understand.. but trust me i don't think memcache can cause any issue to nova20:03
spateli never touch memcache in my deployment (even you remind me today that we have memcache)20:03
mgariepywhen it's up it works. but if you shut one down then there is a lot of delay that start to appears20:04
spatelhmm i never test to shutdown mem20:09
spatelare you currently down or just observe behavior 20:09
mgariepyi do have 2 up out of 3.20:10
mgariepyNetworking client is experiencing an unauthorized exception. (HTTP 400)20:13
mgariepywhen doing openstack server list --all20:14
spatelhmm related to keystone / memcache issue20:15
mgariepyi did remove the memcache server that is down from keystone. restarted the processes and flushed the memcached on all nodes.20:25
spatelany improvement?20:26
mgariepynon20:26
mgariepyone in 6-7 call do end with the error.20:27
mgariepyhmm.. openstack network list,, take 1 sec except once in a while.. it takes 4 sec.20:29
spatelthat is normal correct20:33
mgariepy1sec is ok but 4 doesn't seems ok ;)20:33
mgariepyit should be consistent i guess20:33
spatelin my case it take 2s20:34
mgariepyi'll dig more tomorrow.20:55
spatelPlease share your finding :)20:56
mgariepyi will probably poke you guys also tomorrow ! haha20:56
mgariepyMoving racks between 2 DC live seems to be a good idea at first ..20:57
mgariepylol20:57
spatelhehe you are tough 20:58
mgariepywell.. the DC are 1 block away. with dual 100Gb links between them .. :)20:58
mgariepyit's almost like a local relocation ahha20:59
spatelwhy do you want to split between two DC20:59
spatelassuming redendency20:59
mgariepynop20:59
mgariepywe need to move out of dc1.. 20:59
mgariepyhaha20:59
spatel:)21:00
spatelhow many nodes you have/21:00
jrosserthe services all have keystone stuff in their configs21:06
mgariepy~110 computes,21:06
jrosserand by extension they will be affected by loss of memcached21:06
mgariepyso i should just update the memcache config on all service ?21:07
jrosserit’s something like, when a service needs to auth against keystone it either gets a short circuit via a cached thing in memcached, or keystone has to do expensive crypto thing21:08
jrosserso all your services can be affected, and memcached also is not a cluster21:09
jrosserthere is slightly opaque layer in oslo for    having a pool of memcached21:10
mgariepyhmm ok i'll update all the service tomorrow then. but i did test that part a bit when i did ubuntu 16>18 upgrades on rocky21:10
mgariepybut it migth be a bit different on V.21:11
jrosserI think we struggled during OS upgrade when the performance collapsed21:11
mgariepyi've seen keystone taking like 30s instead of 2s when a memcache was down21:12
mgariepybut never seen errors like i do see now.21:12
damiandabrowski[m]sorry for interrupting You guys but I'm confused by our haproxy config :D is there any point to set `balance source` anywhere while by default we use `balance leastconn` + `stick store-request src`?21:13
jrosseri am not sure tbh21:21
damiandabrowski[m]thank, i'll try to have a deeper look on this. But as far as I can see, we define `balance source` only for: adjutant_api, ceph_rgw, cloudkitty_api, glance_api, horizon, nova_console, sahara_api, swift_proxy, zun_console21:29
*** dviroel|ruck is now known as dviroel|out21:49
*** prometheanfire is now known as Guest223:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!