Monday, 2022-07-25

*** ysandeep is now known as ysandeep|afk07:49
jrossermorning08:46
noonedeadpunk\o/09:02
*** ysandeep|afk is now known as ysandeep09:43
damiandabrowskihey!09:49
jrosserdo we know whats happening with these MODULE FAILURES yet?10:13
jrosseri can reproduce it easily10:13
jrosserit is to do with the process that owns the control master socket exiting10:19
jrosseri messed with the controlpath a bit10:37
jrosserANSIBLE_SSH_CONTROL_PATH=/root/.ansible/cp/ansible-ssh-%%h-%%p-%%r10:37
jrosserthen if i run this in another terminal it makes the MODULE_ERRORS stop `watch -n 0.2 ssh -O check -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" 172.29.236.100`10:38
jrosser^ that command is enough to keep the controlmaster alive, and the tasks keep running, shortly after ^C on that command the playbook fails10:38
jrosserwhilst this is going on you can strace the ssh controlmaster process, and this happens when the playbook fails https://paste.opendev.org/show/bLt55zAHbiRdDYa63cNY/10:42
jrosserit returns ansible_facts (line 50), receives $something back (line 52) then does what looks like a pretty clean tidy up and exit, finally deleting the controlmaster socket10:44
mgariepyjrosser,  so when the pipelining is set to True, ansible does something else and do not keep the ssh alive ?10:58
jrosserdoesnt pipelining just reduce the number of SSH that are required10:59
jrosserssh <foo> <gigantic python command> rather than scp <foo> <python script> / ssh <foo> <run python script>11:00
mgariepyit clearly does something wrong on my tests. since when i set it to false the issue doesn't reproduce.11:01
noonedeadpunkyes, pipelining is exactly does transfer python command instead of scp/ssh11:02
mgariepyunless it's our ssh config that does mess the run ? like the session expiration ?11:05
jrosserit would be nice to be able to get some verbose on the ssh command used at the ansible end11:05
jrosserbecasue this is a looong time ago but feels kind of familiar http://lists.mindrot.org/pipermail/openssh-bugs/2015-July/014957.html11:06
jrosserfrom my strace we see that the last thing that the controlmaster does is unlink the controlpersist socket11:06
mgariepyWhen set to True it does reproduce on my laptop quite often.11:08
jrossermaybe there is a race there with the removal of the controlpersist socket when the timeout happens, but the client still thinks that the socket is good to use and doesnt handle retrying the ssh connection11:08
jrosserwhat was the stuff with some retry decorator last week? i kind of missed that11:08
mgariepycan it be the session logout ? 11:09
mgariepyhttps://github.com/ansible/ansible/blob/fbaea4c269b0a3c8112101754cee808d82bebbee/lib/ansible/plugins/connection/ssh.py#L116311:09
mgariepy-13 is the exit code for ssh.11:10
mgariepyi got to switch desk. i'll be back in an hour or so.11:13
jrosserhttps://github.com/ansible/ansible/blob/fbaea4c269b0a3c8112101754cee808d82bebbee/lib/ansible/plugins/connection/ssh.py#L38811:13
jrosserlike there is a whole class for handling this exact situation when the controlpersist socket closes11:13
mgariepybbl.11:15
anskiyI do remember running into some issues with pipelining (with another basic playbook, not OSA) some time ago. AFAIR, I've found some confirmation that it could be buggy, and so I keep it disabled.11:16
jrosserwell - looks like we totally run our own ssh command directly with no retries https://github.com/openstack/openstack-ansible-plugins/blob/master/plugins/connection/ssh.py#L50211:21
noonedeadpunkwell, we jsut overwrite original methods that should be covered with retries imo11:23
jrosserright yes so that should take the method from the base class11:24
noonedeadpunkand setting ansible.builtin.ssh issue would be the same 11:24
noonedeadpunk*with11:24
noonedeadpunkso that likely doesn't matter much11:24
jrosseris it the same without our plugin?11:24
noonedeadpunkit was for me, yes11:24
jrosserhmm ok11:25
noonedeadpunkbut you should try as well :)11:25
* jrosser goes to eat - bbl11:25
*** dviroel_ is now known as dviroel11:43
jrosserok so with ansible 2.12.7 it did this `fatal: [aio1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: muxclient: master hello exchange failed\r\nkex_exchange_identification: read: Connection reset by peer", "unreachable": true}`12:05
mgariepyi'm back !12:10
*** ysandeep is now known as ysandeep|afk12:13
noonedeadpunkI have never seen that ^12:14
jrosseri am having much more difficulty making it fail with ansible-core installed in a venv12:24
jrosserthe other thing i can see is that with regular ansible the PID of the controlmaster process is rolling continuously, so the insides of ansible are handling the socket being deleted / recreated OK12:32
noonedeadpunkansible-core was failing as long as I sourced openstack-ansible.rc with commented out connection plugin override. And it had nothing to do with dynamic-inventory either at very least12:49
noonedeadpunkDamn, I really fail to understand ansible motivation to hack apt module that weirdly12:50
noonedeadpunk(wrt https://github.com/ansible/ansible/pull/78327)12:50
noonedeadpunkas it makes pinning unpersistant anyway, that will be applied only during runtime. And next run for next package can just update recently installed one if it will have it mentioned as a dependency12:51
noonedeadpunknot talking about unattended upgrades or anything like that12:52
noonedeadpunkit's so weird....12:52
mgariepythey wanted to be able to force a version from withing the playbook. but they did fail to implement it correctly.12:52
mgariepyunless you are only building an app container image in that case you don't really care12:53
mgariepyalso it looks like duplicate from apt directly imo. apt does manage all those case i think.12:55
noonedeadpunkIt really does12:55
noonedeadpunkso what they could do is jsut add another module to manage preferences.d in a more convenient way12:56
noonedeadpunklike our apt-package-pinning does12:56
noonedeadpunkHuh, so does ppl build container images with ansible? :D12:56
mgariepyyeah and use the apt python module and manage error instead of implementing the logic in this one.12:56
mgariepynoonedeadpunk, maybe ?12:56
mgariepylol12:57
noonedeadpunkwell, we kind of do but it's quite different :D12:57
mgariepywe do system-container.12:57
mgariepybut you could also build app container if you really want i guess.12:57
noonedeadpunkbecause even for VM image that won't really work12:58
noonedeadpunkyeah, application maybe... I just thought that it's mostly smth like dockerfiles12:58
mgariepydoes tower make ansible more like puppet ?13:00
noonedeadpunkmeh, I tried to adopt AWX couple of times and it never had reall flight, as was abandoned quite fast13:01
mgariepyalso if you check the comments on the patch that totally break the apt module. it was not supposed to break preference.d config.13:01
noonedeadpunkwell, yes, and fix is basically to read that config and still provide version when it's not asked to do any of that13:02
noonedeadpunkwhich is smth I really fail to understand13:02
mgariepyisn't there a from apt import apt-just-like-you-would-in-cli ?13:03
noonedeadpunkthere is13:06
noonedeadpunkbut they do `pkg_list.append("'%s=%s'" % (name, version))` for some conditions I can hardl read13:07
cloudnullyo!13:25
mgariepyhey cloudnull it's been a while ! :D13:25
jrossernoonedeadpunk: mgariepy i think the problem is here https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L116513:26
jrosserbecasue 255 != -1313:26
jrosserso it never retries13:26
* cloudnull just stood up an OSA cloud from master on Debian, using mostly baremetal, and it all worked perfectly. So I just wanted to say hi, thanks, and great work!13:26
jrosser\o/ awesome13:27
jrossernoonedeadpunk: i hacked it to check for -13 instead of 255 https://paste.opendev.org/show/bOlt6eL38SZerFVuqSLV/13:27
jrosserand you see it retry - without that it crash/burn when it gets -1313:27
cloudnullhow you been mgariepy jrosser?13:29
mgariepynot too bad yourself?13:29
cloudnulldoing great. chillin :D 13:30
jrosseryeah doing ok13:30
noonedeadpunk\o/13:30
*** ysandeep|afk is now known as ysandeep13:30
noonedeadpunkwell,that is intentional https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L17913:32
jrosserbut wrong, no?13:33
jrosserwe kind of show that the controlpersist socket closing results in -13, not 25513:33
noonedeadpunkthey don't consider negative though13:33
noonedeadpunkyes, totally. I think it would be valid bug.13:33
noonedeadpunkwith super trivial fix...13:34
jrosseri'm not really understanding why -13 though13:34
noonedeadpunkor not considering tests13:34
noonedeadpunkI actually tried to google for it but haven't found much13:35
jrosserunless it should retry on any failure13:35
jrosserit's not EPIPE13:35
jrosseri'm still unhappy this doesnt reproduce so easily without our connection plugin13:35
noonedeadpunkit feels as a race that depends on execution time as well, and our connection plugin slows things down from what I saw13:37
noonedeadpunkand eventually our retry looks useless13:38
jrosseryeah i think we should get rid of that13:38
jrosseri've copied my debugging lines over to stock ansible venv and i see it retrying there as well now13:40
spatelcloudnull \o/13:42
mgariepyif i add  if p.returncode == 255 or p.returncode == -13: to the ssh.py file it does retry correctly13:44
mgariepyhttps://paste.openstack.org/show/bwdhz61ViT1vT8imKkfO/13:44
jrosseri have a ton of debugging lines in and it seems our code takes quite a different path through _bare_run than if you run with stock ansible13:44
jrosserand with stock ansible i see the -13 being returned but it is handled13:45
mgariepyssh exits with the exit status of the remote command or with 255 is an error occured.13:45
mgariepyit catches the @_ssh_retry ?13:46
jrosseri'm not sure right now13:46
jrossermost obviously, ansible 2.12.7 is mostly using this https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L91313:48
jrosserbut our plugin makes it go here instead https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L92613:49
mgariepybecause in_data is None. 13:52
mgariepyhttps://opendev.org/openstack/openstack-ansible-plugins/src/branch/master/plugins/connection/ssh.py#L42213:53
jrosseryeah, i just need to make sure i'm running the code i think i am here13:53
noonedeadpunkjrosser: I'm not sure, but isn't pty.openpty is more path without pipelining?13:54
jrosserit could be yes - it looks like there are many more iterations happening13:54
jrosserwhere is that controlled?13:54
noonedeadpunkhttps://github.com/ansible/ansible/blob/15750aec5265866ae46319cbfbb318e9eec0e083/lib/ansible/plugins/connection/ssh.py#L891-L89413:56
noonedeadpunkit's not where controlled, but at least explaining path13:57
noonedeadpunkbut controlled with env var13:57
noonedeadpunkhttps://opendev.org/openstack/openstack-ansible/src/branch/master/scripts/openstack-ansible.rc#L5113:57
jrosserargh i have a meeting14:01
jrosserright i reproduced it with ansible-core venv with pipelineing14:15
mgariepyif i don't set pipelining i can't reproduce it in osa.14:15
mgariepyif i don't set pipelining to True i can't reproduce it in osa.14:16
jrosserhttps://github.com/ansible/ansible/issues/7834414:39
mgariepycool thanks jrosser 14:53
mgariepyit was broken on older release as well.14:53
mgariepywe had the issue on xena 14:54
mgariepy2.11.something irrc14:54
jrossermake a comment :)14:56
*** ysandeep is now known as ysandeep|dinner15:04
mgariepydone15:05
jrossernoonedeadpunk: that ansible apt stuff - do you think it can deal with this https://github.com/openstack/openstack-ansible-openstack_hosts/blob/master/defaults/main.yml#L186-L18915:06
*** dviroel is now known as dviroel|lunch15:08
*** ysandeep|dinner is now known as ysandeep|out15:45
opendevreviewMerged openstack/openstack-ansible-ops master: Update MNAIO to use Ansible Collections  https://review.opendev.org/c/openstack/openstack-ansible-ops/+/85081716:06
opendevreviewMerged openstack/openstack-ansible-os_mistral master: Add release notes for mistral_api_use_uwsgi  https://review.opendev.org/c/openstack/openstack-ansible-os_mistral/+/84980416:13
*** dviroel|lunch is now known as dviroel16:26
mgariepyc9s again .. https://zuul.opendev.org/t/openstack/build/897cce3b5405487aa670b4dce05e3051/log/logs/host/rsyslog.service.journal-14-32-29.log.txt#15-1718:06
mgariepywtf https://zuul.opendev.org/t/openstack/build/897cce3b5405487aa670b4dce05e3051/log/logs/host/keystone-wsgi-public.service.journal-14-32-29.log.txt#1496018:15
jrosserit’s not string vs actual bool is it?18:28
mgariepyno idea18:29
jrosserbecause maybe ‘False’ != False18:29
mgariepyi'm just trying to find our why tempest fials on centos18:29
mgariepyhttps://zuul.opendev.org/t/openstack/build/897cce3b5405487aa670b4dce05e3051/log/logs/host/haproxy.service.journal-14-32-29.log.txt#178418:30
mgariepythis is also a bit troublesome.18:30
mgariepyi wonder if we are not a bit too aggressive on the keystone threads uwsgi config18:30
opendevreviewMarc Gariépy proposed openstack/openstack-ansible master: Set the number of threads for processes to 2  https://review.opendev.org/c/openstack/openstack-ansible/+/85094218:53
opendevreviewMarc Gariépy proposed openstack/openstack-ansible master: Set the number of threads for processes to 2  https://review.opendev.org/c/openstack/openstack-ansible/+/85094218:54
opendevreviewMarc Gariépy proposed openstack/openstack-ansible master: Set the number of threads for processes to 2  https://review.opendev.org/c/openstack/openstack-ansible/+/85094218:55
opendevreviewMerged openstack/openstack-ansible stable/victoria: Add mistra-extra repo  https://review.opendev.org/c/openstack/openstack-ansible/+/84952018:56
*** dviroel is now known as dviroel|afk20:17
*** melwitt_ is now known as melwitt21:10

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!