Thursday, 2019-05-16

*** hongbin has quit IRC00:01
*** BjoernT has quit IRC00:02
*** ttsiouts has joined #openstack-nova00:08
*** artom has joined #openstack-nova00:15
*** gyee has quit IRC00:36
*** xek has quit IRC00:37
*** dpawlik has joined #openstack-nova00:38
*** ttsiouts has quit IRC00:41
openstackgerritBrin Zhang proposed openstack/nova master: Cap sphinx for py2 to match global requirements  https://review.opendev.org/65920100:42
*** dpawlik has quit IRC00:42
*** ricolin has joined #openstack-nova00:50
openstackgerritBrin Zhang proposed openstack/nova-specs master: Cap sphinx for py2 to match global requirements  https://review.opendev.org/65920500:53
*** dpawlik has joined #openstack-nova00:54
*** awalende has joined #openstack-nova00:57
*** dpawlik has quit IRC00:59
openstackgerritTetsuro Nakamura proposed openstack/nova stable/stein: Skip _exclude_nested_providers() if not nested  https://review.opendev.org/65920600:59
*** d34dh0r53 has joined #openstack-nova01:01
*** awalende has quit IRC01:02
*** bbowen has quit IRC01:27
*** nicolasbock has quit IRC01:36
*** lbragstad has quit IRC01:42
*** guozijn has joined #openstack-nova01:48
openstackgerritTetsuro Nakamura proposed openstack/nova stable/rocky: Skip _exclude_nested_providers() if not nested  https://review.opendev.org/65920701:51
*** lbragstad has joined #openstack-nova01:51
*** BjoernT has joined #openstack-nova02:00
*** whoami-rajat has joined #openstack-nova02:06
*** cfriesen has quit IRC02:10
*** guozijn_ has joined #openstack-nova02:11
*** guozijn has quit IRC02:13
*** guozijn_ is now known as guozijn02:13
*** dklyle has joined #openstack-nova02:14
*** BjoernT has quit IRC02:17
*** BjoernT has joined #openstack-nova02:20
openstackgerritBoxiang Zhu proposed openstack/nova master: Fix failure to boot instances with qcow2 format images  https://review.opendev.org/64027102:22
*** JamesBenson has joined #openstack-nova02:26
openstackgerritMatt Riedemann proposed openstack/nova master: Update usage in RT.drop_move_claim during confirm resize  https://review.opendev.org/64180602:26
openstackgerritMatt Riedemann proposed openstack/nova master: Robustify attachment tracking in CinderFixtureNewAttachFlow  https://review.opendev.org/65890402:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add Migration.cross_cell_move and get_by_uuid  https://review.opendev.org/61401202:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add InstanceAction/Event create() method  https://review.opendev.org/61403602:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add Instance.hidden field  https://review.opendev.org/63112302:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add TargetDBSetupTask  https://review.opendev.org/62789202:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add CrossCellMigrationTask  https://review.opendev.org/63158102:26
openstackgerritMatt Riedemann proposed openstack/nova master: Execute TargetDBSetupTask  https://review.opendev.org/63385302:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add prep_snapshot_based_resize_at_dest compute method  https://review.opendev.org/63329302:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add PrepResizeAtDestTask  https://review.opendev.org/62789002:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add prep_snapshot_based_resize_at_source compute method  https://review.opendev.org/63483202:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add nova.compute.utils.delete_image  https://review.opendev.org/63760502:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add PrepResizeAtSourceTask  https://review.opendev.org/62789102:26
openstackgerritMatt Riedemann proposed openstack/nova master: Refactor ComputeManager.remove_volume_connection  https://review.opendev.org/64218302:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add power_on kwarg to ComputeDriver.spawn() method  https://review.opendev.org/64259002:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add finish_snapshot_based_resize_at_dest compute method  https://review.opendev.org/63508002:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add FinishResizeAtDestTask  https://review.opendev.org/63564602:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add Destination.allow_cross_cell_move field  https://review.opendev.org/61403502:26
openstackgerritMatt Riedemann proposed openstack/nova master: Support cross-cell moves in external_instance_event  https://review.opendev.org/65847802:26
openstackgerritMatt Riedemann proposed openstack/nova master: Add cross-cell resize policy rule and enable in API  https://review.opendev.org/63826902:26
openstackgerritMatt Riedemann proposed openstack/nova master: WIP: Enable cross-cell resize in the nova-multi-cell job  https://review.opendev.org/65665602:27
openstackgerritMerged openstack/nova master: Fix retry of instance_update_and_get_original  https://review.opendev.org/65884502:32
guilhermesphello! I'm currently running an upgrade N -> O. There is something weird going on when I run nova-manage cell_v2 map_instances <id>. It seems that the nova_api.instance_mappings keeps stuck for a while, incrementing slowly one by one, then after a huge amount of time it starts to re-map the instance in faster way02:33
guilhermespthe deployment has 160k rows in nova.instances02:33
guilhermespwe currently re-mapped 80K instnaces, but sometimes the command finishes but there are still rows to remap02:34
guilhermespthe active instnaces of the deployment are currently given 404, which mean they doesn't reside on the newly created cell during the deployment02:34
*** BjoernT has quit IRC02:35
guilhermespwe are sometimes adding to the command --max-counts like 50k, 100k and trying to track the remapping watching the rows in nova_api.instance_mappings02:35
guilhermespdoes anyone knows what could be the trick in that scenario02:35
openstackgerritMerged openstack/nova master: Cap sphinx for py2 to match global requirements  https://review.opendev.org/65920102:35
guilhermespwe have a staging env that this didn't happen02:36
guilhermespbut we used to have around 1.8k rows in nova.instances02:36
guilhermespmaybe the amount of records in the other env could be a possible reason?02:37
*** ttsiouts has joined #openstack-nova02:38
*** lbragstad has quit IRC02:40
*** lbragstad has joined #openstack-nova02:43
*** dpawlik has joined #openstack-nova02:55
*** dpawlik has quit IRC03:00
*** dpawlik has joined #openstack-nova03:11
*** ttsiouts has quit IRC03:11
*** hongbin has joined #openstack-nova03:14
*** dpawlik has quit IRC03:15
*** licanwei has joined #openstack-nova03:26
*** udesale has joined #openstack-nova03:53
*** hongbin has quit IRC03:55
*** JamesBenson has quit IRC04:01
*** janki has joined #openstack-nova04:43
*** bhagyashris has joined #openstack-nova04:46
*** guozijn has quit IRC04:47
bhagyashrissean-k-mooney: Hi,04:47
*** _d34dh0r53_ has joined #openstack-nova04:55
*** guozijn has joined #openstack-nova04:56
*** lbragstad has quit IRC04:59
*** _d34dh0r53_ has quit IRC05:08
*** ttsiouts has joined #openstack-nova05:09
*** _d34dh0r53_ has joined #openstack-nova05:11
*** dpawlik has joined #openstack-nova05:12
*** dpawlik has quit IRC05:17
*** ratailor has joined #openstack-nova05:21
*** dpawlik has joined #openstack-nova05:28
*** Luzi has joined #openstack-nova05:40
*** ttsiouts has quit IRC05:41
*** ivve has quit IRC05:41
*** hamzy has quit IRC05:44
*** udesale has quit IRC06:05
*** udesale has joined #openstack-nova06:06
*** hamzy has joined #openstack-nova06:21
*** ccamacho has joined #openstack-nova06:31
*** trondham has joined #openstack-nova06:33
*** pmannidi has joined #openstack-nova06:34
*** udesale has quit IRC06:51
*** udesale has joined #openstack-nova06:52
*** xek has joined #openstack-nova06:54
*** ivve has joined #openstack-nova06:58
*** rcernin has quit IRC07:01
*** udesale has quit IRC07:02
*** openstackgerrit has quit IRC07:03
*** udesale has joined #openstack-nova07:03
*** tesseract has joined #openstack-nova07:05
*** jaosorior has quit IRC07:06
*** pcaruana has joined #openstack-nova07:13
*** guozijn_ has joined #openstack-nova07:15
*** rpittau|afk is now known as rpittau07:17
*** guozijn has quit IRC07:18
*** guozijn_ is now known as guozijn07:18
*** openstackgerrit has joined #openstack-nova07:20
openstackgerritTakashi NATSUME proposed openstack/nova stable/stein: Fix assert methods in unit tests  https://review.opendev.org/65770807:20
*** xek has quit IRC07:21
*** helenafm has joined #openstack-nova07:21
openstackgerritTakashi NATSUME proposed openstack/nova stable/rocky: Fix assert methods in unit tests  https://review.opendev.org/65770907:21
openstackgerritTakashi NATSUME proposed openstack/nova stable/rocky: Fix assert methods in unit tests  https://review.opendev.org/65770907:21
*** ralonsoh has joined #openstack-nova07:23
openstackgerritzhangboye proposed openstack/nova-specs master: Cap sphinx for py2 to match global requirements  https://review.opendev.org/65947007:24
kashyapweshay: Was asleep when you pinged.07:26
* kashyap reads the scroll07:28
*** tssurya has joined #openstack-nova07:31
*** udesale has quit IRC07:36
kashyapweshay: I see that melwitt (thanks) has already investigated the issue and posted a patch to skip logging the trace: https://review.opendev.org/#/c/659374/07:37
*** udesale has joined #openstack-nova07:37
*** ttsiouts has joined #openstack-nova07:39
*** slaweq has joined #openstack-nova07:40
*** awalende has joined #openstack-nova07:49
*** tesseract has quit IRC07:52
*** tesseract has joined #openstack-nova07:52
*** ttsiouts has quit IRC07:57
openstackgerritGorka Eguileor proposed openstack/nova master: Fix doc requirements  https://review.opendev.org/65948408:05
openstackgerritGorka Eguileor proposed openstack/nova master: Fix doc requirements  https://review.opendev.org/65948408:05
*** maciejjozefczyk has joined #openstack-nova08:07
openstackgerritGorka Eguileor proposed openstack/nova master: Use os-brick locking for volume attach and detach  https://review.opendev.org/61419008:07
*** awalende_ has joined #openstack-nova08:15
*** ttsiouts has joined #openstack-nova08:16
*** awalend__ has joined #openstack-nova08:18
*** awalende has quit IRC08:19
*** awalende_ has quit IRC08:20
*** awalende has joined #openstack-nova08:21
*** awalend__ has quit IRC08:24
*** cdent has joined #openstack-nova08:25
*** brinzh has joined #openstack-nova08:26
*** brinzhang has quit IRC08:29
*** tkajinam has quit IRC08:32
*** derekh has joined #openstack-nova08:38
*** dtantsur|afk is now known as dtantsur08:50
*** sridharg has joined #openstack-nova08:53
*** rabel has quit IRC09:03
openstackgerritArnaud Morin proposed openstack/nova master: Always Set dhcp_server in network_info  https://review.opendev.org/65836209:10
*** panda is now known as panda|rover09:17
*** brinzh has quit IRC09:19
*** ttsiouts has quit IRC09:20
*** brinzh has joined #openstack-nova09:20
bhagyashrisstephenfin:  Hi, Just want to ask regarding the reserved_host_cpus as we are setting the reserved value of cpus as 0 for both the PCPU and VCPU then how the reserved value will be maintain I mean there should some cpu reserved for host process09:28
sean-k-mooneybhagyashris: that is achive by not listing them in the cpu_shared_set or the cpu_dedicated_set09:32
sean-k-mooneyso on an 8 core system you can do cpu_shared_set=2,3 cpu_dedicated_set=4-7 and that leave cores 0 and 1 for the host09:33
sean-k-mooneythis is how you reserve cores for the host using vcpu_pin_set today09:34
bhagyashrissean-k-mooney: that means operator should keep in mind that some cpus should left for host process while setting the cpu_dedicated_set and cpu_shared_set09:34
sean-k-mooneyyes09:34
bhagyashrisokay09:34
bhagyashrissean-k-mooney: thank you for clearing my doubt.09:34
sean-k-mooneyalthogh today reserved_host_cpus defaults to 0 and by defalt the vcpu_pin_set if unset enable all host cpus09:35
sean-k-mooneyso if they dont do that it wont be a change in behavior09:35
bhagyashrissean-k-mooney: okay, but suppose if operator assign all the cpus to the dedicated set then what will happen in that case, host process won't have any cpu to process the host process.09:38
sean-k-mooneythe hosts will steal cpu time form the pinned instances09:38
sean-k-mooneybut again that is what happens today if you dont ommit cores for the host to use form the vcpu pin set09:39
*** takashin has left #openstack-nova09:40
bhagyashrisokay, So that case should be handle by code or by operator then09:41
sean-k-mooneyby the operator09:41
bhagyashrisokay09:41
bhagyashrisThank you :)09:42
*** bhagyashris has quit IRC09:43
*** pmannidi has quit IRC10:05
openstackgerritArnaud Morin proposed openstack/nova master: Always Set dhcp_server in network_info  https://review.opendev.org/65836210:08
*** awalende_ has joined #openstack-nova10:11
*** jaosorior has joined #openstack-nova10:12
*** awalende has quit IRC10:12
*** mdbooth_ is now known as mdbooth10:24
mdboothAny reason https://review.opendev.org/658845 shouldn't be backported to stable?10:25
* mdbooth has posted the backports to Queens and they're all clean10:25
*** pcaruana has quit IRC10:27
*** xek has joined #openstack-nova10:27
*** mmethot has quit IRC10:29
*** guozijn has quit IRC10:37
*** ricolin has quit IRC10:39
*** davidsha has joined #openstack-nova10:40
*** tbachman has quit IRC10:50
*** luksky has joined #openstack-nova10:52
*** nicolasbock has joined #openstack-nova10:57
*** panda|rover is now known as panda|rover|eat11:04
*** bbowen has joined #openstack-nova11:11
*** udesale has quit IRC11:15
*** ttsiouts has joined #openstack-nova11:17
*** mmethot has joined #openstack-nova11:19
*** guozijn has joined #openstack-nova11:24
*** yan0s has joined #openstack-nova11:25
*** guozijn has quit IRC11:33
*** pcaruana has joined #openstack-nova11:37
*** dtantsur is now known as dtantsur|brb11:46
*** ttsiouts has quit IRC11:49
*** ttsiouts has joined #openstack-nova11:49
*** florius has joined #openstack-nova11:51
*** rabel has joined #openstack-nova11:58
rabelhi. there seems to be a problem with api microversions. i have an openstack rocky (deployed with packstack), but my openstackclient (newest version, directly from git) tells me that the compute api version is 2.111:59
*** florius has left #openstack-nova12:01
artomrabel, it's an known osc behaviour/bug. While novaclient autonegotiates to the greatest common supported microversion, osc defaults to 2.1, and you have to manually specify which microversion you want12:02
*** tbachman has joined #openstack-nova12:06
*** ttsiouts has quit IRC12:06
*** panda|rover|eat is now known as panda12:06
*** tbachman has quit IRC12:10
*** panda is now known as panda|rover12:11
rabelartom thanks12:13
*** tbachman has joined #openstack-nova12:13
*** awalende has joined #openstack-nova12:27
*** janki has quit IRC12:28
*** awalende_ has quit IRC12:29
*** ttsiouts has joined #openstack-nova12:38
*** tetsuro has joined #openstack-nova12:41
*** tetsuro has quit IRC12:41
*** ttsiouts has quit IRC12:43
alex_xuefried: I think the in_tree take care of all the cases for https://review.opendev.org/#/c/645520/ also12:56
*** ccamacho has quit IRC12:56
*** ccamacho has joined #openstack-nova12:57
*** ratailor has quit IRC12:58
*** mchlumsky has joined #openstack-nova12:59
*** udesale has joined #openstack-nova13:02
*** mchlumsky has quit IRC13:06
*** mchlumsky has joined #openstack-nova13:08
efriedalex_xu: Okay. I'll take a closer look at that patch soon.13:08
alex_xuefried: thx13:08
*** lbragstad has joined #openstack-nova13:09
*** itlinux has quit IRC13:10
*** ttsiouts has joined #openstack-nova13:12
*** mriedem has joined #openstack-nova13:15
openstackgerritMatt Riedemann proposed openstack/nova master: Stop logging traceback when skipping quiesce  https://review.opendev.org/65937413:16
*** dtantsur|brb is now known as dtantsur13:18
*** artom has quit IRC13:18
*** ttsiouts has quit IRC13:19
*** ttsiouts has joined #openstack-nova13:20
*** tbachman has quit IRC13:20
*** wwriverrat has quit IRC13:22
mriedemdansmith: random idea, but with something like https://bugs.launchpad.net/nova/+bug/1829349 where the RT is constantly failing b/c of conflicted pinned CPUs on the same host, we could leverage the build failure stats thing and if the RT fails after x hits or whatever we try to take that host out of scheduling13:23
openstackLaunchpad bug 1829349 in OpenStack Compute (nova) "Resource Tracker failed to update usage in case numa topology conflict happen" [Undecided,New]13:23
mriedemthey did the naughty and enabled live migration for numa instances, 2 landed on the same host with conflicting pinned CPUs, and how the RT constantly fails but the scheduler continues to pick the host b/c it looks free since it's usage isn't getting reported13:24
mriedems/how/now/13:24
dansmithoye13:25
dansmithisn't the build failure count going up in that case anyway?13:25
mriedemmaybe, i can ask in the bug report13:25
sean-k-mooneymriedem: did we fix the typo so that that config option works yet13:26
* sean-k-mooney goes to check13:26
mriedemthe qemu vs QEMU thing?13:27
mriedemit wouldn't matter,13:28
*** xek_ has joined #openstack-nova13:28
mriedembecause in this case they enabled the workaround to allow live migrating instances with numa13:28
sean-k-mooneyya we were checking for hypervior = kvm but its alway qemu13:28
sean-k-mooneyyes we fixed it https://review.opendev.org/#/c/635350/]13:29
sean-k-mooneywe never actully released with it broken so we shoudl be good13:29
mriedemwell, this person is on rocky13:29
mriedembut yes https://review.opendev.org/#/q/I5127227a1e3d76dd413a820b048547ba578aff6b13:29
*** xek has quit IRC13:30
*** tbachman has joined #openstack-nova13:31
sean-k-mooneymriedem: i would have hoped the numa toplogy filter would have prevented new vms from being scudled by the way if there was not space13:31
sean-k-mooneythe one that was mighrtated would not be fixed but ya13:31
aarentsHi, mdbooth I updated https://review.opendev.org/#/c/659054/ if you want to have a look, maybe also mriedem ?13:32
sean-k-mooneywhich reminding me i have a meeting then i need to updated docs for the sriov migration13:32
mdboothaarents: Thanks, still +1.13:34
adriancsean-k-mooney: Hi, id like to see if we can converge on the FUP for SR-IOV LM, would be great if you can take a quick look :)13:34
*** brinzh has quit IRC13:41
sean-k-mooneyadrianc: sure i think this was more or less fine so ill take alook at the latest version13:41
adriancsean-k-mooney, i did not upload a new PS as i wanted to reach an agreement on the propery thing, anyway... you will see13:44
*** Luzi has quit IRC13:45
*** takashin has joined #openstack-nova13:49
*** mgariepy has quit IRC13:55
efriedNova meeting in 4 minutes in #openstack-meeting13:56
*** pcaruana has quit IRC13:56
*** artom has joined #openstack-nova13:57
mriedemtssurya: i'm going to remove the locked reason stuff from the runways etherpad since the nova and novaclient stuff is done https://etherpad.openstack.org/p/nova-runways-train14:02
*** JamesBenson has joined #openstack-nova14:02
*** mgariepy has joined #openstack-nova14:03
*** JamesBenson has quit IRC14:07
*** UdayTKumar has joined #openstack-nova14:07
efriedsean-k-mooney: Last week's nova meeting we got a request to review14:09
efriedhttps://review.opendev.org/#/c/643023/14:09
efriedhttps://review.opendev.org/#/c/643024/14:09
efriedwhich it looked like you were actively looking into. Would you be able to follow up please?14:09
sean-k-mooneyyes14:10
tssuryamriedem: ack14:10
*** alex_xu has quit IRC14:10
efriedthanks sean-k-mooney14:10
*** JamesBenson has joined #openstack-nova14:14
*** JamesBenson has quit IRC14:18
sean-k-mooneyadrianc: just replied on https://review.opendev.org/#/c/659101/2/nova/objects/instance_pci_requests.py@4714:18
adriancthx sean-k-mooney, ill take a look shortly14:19
sean-k-mooneyim going to go review https://review.opendev.org/#/q/topic:bug/1809095+(status:open+OR+status:merged) before my next meeting14:20
sean-k-mooneywhich is your fix for bug #180909514:21
openstackbug 1809095 in OpenStack Compute (nova) "Wrong representor port was unplugged from OVS during cold migration" [Medium,In progress] https://launchpad.net/bugs/1809095 - Assigned to Adrian Chiris (adrian.chiris)14:21
*** mlavalle has joined #openstack-nova14:22
*** itlinux has joined #openstack-nova14:27
*** itlinux has quit IRC14:31
mriedembtw, i remember mtreinish trying to get glance-api moved to wsgi in pike and it got hung up on threading stuff with eventlet, hence https://review.opendev.org/#/c/549743/14:32
dansmithyup14:32
mriedemoh and https://review.opendev.org/#/c/531498/14:34
mriedem^ taskflow evenlet async stuff + wsgi + glance-api14:35
*** d34dh0r53 has quit IRC14:36
mtreinishyep, but I couldn't get anyone to ever really review that or explain how to use the async glance APIs. And there is no dsvm coverage for any of it14:36
*** d34dh0r53 has joined #openstack-nova14:37
*** pcaruana has joined #openstack-nova14:37
mriedemmtreinish: for context we're going over similar in nova-api https://review.opendev.org/#/c/650172/14:37
cdentI reckon the root of the problems in both glance and nova is that we are trying to async workings in an environment (an HTTP API) where it doesn't make sense. The traditional python solution to this problem is to dump into an offboard async task queue (celery and the like). The scatter/gather thing is not quite in the same boat as the glance stuff (which is more amenable to the celery thing perhaps)14:41
*** d34dh0r53 is now known as Guest2890414:41
mtreinishcdent: yeah, that was the best way to fix it, to just do the async bits in another service14:46
mtreinishbut I assumed just launching another process or thread to handle it should work as a quick workaround14:46
cdentit ought to14:46
cdentbut I would guess that zombie blocked processes would present the same concern as blocked threads14:47
cdentwhich might suggest: that's the way unix goes, deal...14:47
*** JamesBenson has joined #openstack-nova14:51
sean-k-mooneyi mean we could just not set a limit on the worker trhead in the tread pool or make it a config option14:53
sean-k-mooneypython will grow and srink the thread pool automatically if you dont specify the number of workers14:53
dansmiththat's the problem, these are a threadpool, which means they're re-used and capped at some max number, so enough api requests come in during a cell down event, and you immediately exhaust the pool of workers until things time out14:53
dansmithsean-k-mooney: only up to N*CPUs I think I read14:54
dansmithwhere N is like 3 or something14:54
sean-k-mooneydansmith: maybe not sure14:54
sean-k-mooneybut we could leave it as a config option and defult to -1 mean let python figure it out14:55
sean-k-mooneyor set it to X if you care and have mesured for your system14:55
mriedemhttps://docs.python.org/3.6/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor14:55
mriedem"If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor."14:55
dansmithright, so with 5 cells, that becomes $CPUs14:56
dansmithand everyone has at least two14:56
cdentwhen calculating the number of threads that will be around, another thing to keep in mind is mod_wsgi will often run with multiple process and multiple threads in those processes. what is the pool associated with?14:57
cdentevery request or thread or process?14:57
* cdent has lost track of the actual code, reviews14:58
dansmithidk14:58
cdentprocess14:58
sean-k-mooney so we could revert to createing the thread pool per call and set it to one thread per cell14:59
sean-k-mooneybut the issue with that is a its expensive to spawn treads and b we nolonger have a limit on them15:00
*** ratailor has joined #openstack-nova15:00
sean-k-mooneybut we will not have the issue of a potention exausting of the thread pool15:00
sean-k-mooneycdent: the pool i a module global in the current patch so it would associated with the process15:01
*** takashin has left #openstack-nova15:02
cdentsean-k-mooney: [t 6MVa]15:03
purplerbot<cdent> process [2019-05-16 14:58:44.001687] [n 6MVa]15:03
*** artom has quit IRC15:03
sean-k-mooney?15:03
cdent[l 6MVa]15:04
purplerbothttp://p.anticdent.org/6MVa15:04
cdentin other words, I figured it out by looking at the code a few minutes ago15:04
sean-k-mooneyoh right :)15:04
mriedemefried: on that ironic job failure, it looks like that job has probably been broken since stein when we started reporting capability traits for nodes http://logs.openstack.org/32/634832/29/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/fba9197/controller/logs/devstacklog.txt.gz#_2019-05-16_03_28_21_59015:06
janguttercdent: I really thought you were mistyping 6 million volt-ampere.15:06
mriedemif i'm reading that correctly it's expecting only the CUSTOM_GOLD trait15:06
sean-k-mooneydansmith: what was the behavior that motivated the scater gather code in the first place to you rememebr? was it to deal with down cells or was it a performce imporvoment?15:06
dansmithsean-k-mooney: it's to avoid a sequential walk of all the cell databases, serializing their latencies15:07
sean-k-mooneyah ok so it was to addres perfromace ok15:07
cdentAre there numbers on those latencies?15:07
dansmithsean-k-mooney: cern has 70 cells, some of which are remote15:07
cdentah. the remote part15:08
dansmithas in, in another country15:08
efriedmriedem: looks like the fix is just to take out the square brackets, nah?15:08
sean-k-mooneyya even if they wer all in the same data center it could be a proble as the number of cell grow15:08
dansmithsean-k-mooney: especially if one in the middle is timing out, and we wait for that to complete before moving to the next one and get its response15:09
dansmithand everyone has two of these at minimum so moving to sequential means 2N latency, which is twice as bad, regardless of what N is15:09
sean-k-mooneyi was just wondering if the initall approch to solving the problem was patinting us in to a corner that could be avoid by revisiting the originial usecase cool15:10
sean-k-mooneyeffectivly waht we want to do is queue up a bunch of asyc request and wait for them to complete15:11
dansmith..a thing for which eventlet is ideal15:11
sean-k-mooneyeventlet and thread pools both allwo that with different tradeoffs15:11
*** awalende has quit IRC15:11
sean-k-mooneythe issue isnt the use of eventlets its self15:11
*** awalende has joined #openstack-nova15:12
sean-k-mooneyits that mod_wsgi and uwsgi both have exposed diffreent bugs caused by  using eventlets with them15:12
*** cfriesen has joined #openstack-nova15:12
*** ivve has quit IRC15:13
sean-k-mooneydoes wsgi provide a way to run request like this in the background that we could leverag and fall back to eventlest when not runnign under wsgi?15:15
*** artom has joined #openstack-nova15:15
*** awalende_ has joined #openstack-nova15:15
sean-k-mooneygoogle found me https://pypi.org/project/uwsgi-tasks/ but no idea if that would help15:16
*** awalende has quit IRC15:16
cdentthat could work for the uwsgi situation, but not mod_wsgi, presumably15:18
*** luksky has quit IRC15:19
sean-k-mooneyi dont know i personally would prefer to try melwitt's approch first and maybe ask cern to try it and get feed back15:19
*** JamesBenson has quit IRC15:19
*** awalende_ has quit IRC15:20
*** JamesBenson has joined #openstack-nova15:20
cdentsean-k-mooney: you mean the current code under review? that seems reasonable. I guess the issue is that we don't know if the problem is a problem until it gets hit...15:20
sean-k-mooneycdent: right15:20
cdentmriedem seemed to think it could be caused with some effort in devstack15:21
sean-k-mooneyand yes the current code under review15:21
efriedmriedem: I'm going to propose that fix unless you've arleady got it in the pipe15:21
mriedemefried: i don't, i'm trying to recreate with bash locally15:23
efriedack15:23
mriedemcdent: i know how to manually test multiple cells in a single node devstack and make one or more down, but that doesn't mean i know how to recreate the issue15:24
mriedemwe haven't hit this problem in the gate as far as i know with nova-api under wsgi + 2 cells15:24
mriedemwe also now have the nova-multi-cell job which is effectively 3 cells15:24
mriedemcell0/1/215:25
melwittimacdonn: do you happen to know why/how the uwsgi/mod_wsgi problem doesn't seem to show itself in the gate?15:26
cdentas I remmeber (which may be wrong) the problem with eventlet needs some work, a long pause, some attempted work to show up. And the expected problem with threads requires connections to databases (or other I/O) that blocks weirdly and doesn't time out in natural ways15:26
mriedemefried: removing the brackets gives me this:15:26
mriedemjq: error (at <stdin>:1): array (["COMPUTE_N...) and string ("CUSTOM_GOLD") cannot have their containment checked15:26
cdentmelwitt: I reckon it the lack of pause15:26
*** ccamacho has quit IRC15:26
melwittaye15:26
efriedmriedem: ugh, I guess it would help me to know wtf jq is.15:27
mriedemefried: json parser15:27
mriedemefried: i considered just re-writing this with osc-placement and grep15:27
efriedseems like that would be better.15:27
mriedemopenstack --os-placement-api-version 1.6 resource provider trait list $provider_id | grep $trait15:27
mriedemit would definitely be easier to f'ing read15:27
efriedagreed15:27
mriedemi'll push that up15:28
efriedight15:28
efriedmriedem: I made a story15:28
mriedemthis would also suggest all ironic jobs would be busted since stein and if that were the case i'd think we would have heard about it by now...15:28
efriedmriedem: https://storyboard.openstack.org/#!/story/200572515:28
efriedyou would think, yah15:28
mriedemcdent: melwitt: dansmith: you don't think it's a simple as adding an https://eventlet.net/doc/modules/greenthread.html#eventlet.greenthread.sleep between calls in the scatter gather do you?15:29
imacdonnmelwitt: I don't know enough about what happens in the gate ... but the problem manifests when nova-api does something that requires an API call, then goes quiet for more than a minute or so, then gets another request that requires RPC15:30
dansmiththe problem is that we're not monkeypatched properly such that background threads aren't being run and thus the heartbeat for the rabbit connection isn't working15:30
dansmithwhich seems like it should manifest in the gate easily if it's fundamental to how we're doing service setup, and which also means other things shouldn't really be working15:31
dansmithso I'm kinda suspicious of the whole thing, and don't really want to "solve" it by changing our whole threading model for one service15:31
dansmithbut I don't really have time to dig deep, as I said15:31
cdentdansmith: I thought the "more than a minute or so" was a critical factor as it is resulting in mod_wsgi doing $something to the python process that breaks the monkey patching15:32
cdentbut I have to admit that finding the sensible thread through all this is hard so I'm not sure if I'm remembering the right stuff15:32
mriedemgoing quite for a minute or so is something that wouldn't happen in the gate yeah15:33
imacdonncdent: rabbitmq, in a typical configuration, will drop a connection if it has not seen heartbeats for over a minute15:33
mriedemtempest is all hammer until it's done and then we teardown15:33
imacdonncdent: that's what leads to a "connection reset by peer" when nova-api wakes up again and tries to use the connection that it thinks is still healthy15:33
dansmithso you think we're just not running hearbeat background tasks if no api requests are coming in?15:33
cdentit does sound a bit that way15:33
imacdonnI have observed that with debug logging15:34
dansmithseems like that should be easy to reproduce in a devstack no?15:34
mriedemcould we maybe tickle this in a post test script after tempest is done, sleep a minute or something, and then try to create or do something to a server, like stop it?15:34
*** helenafm has quit IRC15:35
kashyapefried: Thanks for representing on #openstack-meeting (still haven't caught up).  Stuck in consecutive calls, afraid.15:35
dansmithmriedem: or just a devstack15:35
mriedemyeah..i will leave devstack running for many more minutes before coming back to it to do something15:35
*** ttsiouts has quit IRC15:35
dansmithchanging the whole threading model for this seems like a big hammer15:35
mriedemalso note that vexxhost is running stein and haven't reported this being a problem...15:36
cdentmriedem: is your devstack using mod_wsgi, by default it won't be15:36
mriedemcdent: oh no i use the default (uwsgi)15:36
cdentand it's not clear the problem is the same uwsgi as mod_wsgi15:36
cdentit _might_ be15:36
dansmithI thought imacdonn said that it repros the same in uwsgi?15:36
cdentbut since the only people we've heard of this problem from are tripleo, it's not clear15:36
melwittI think imacdonn said it behaves the same uwsgi and mod_wsgi, IIRC15:37
imacdonnI have easily reproduced the exact symptoms (including absense of heartbeats with debug logging) with both mod_wsgi and uWSGI15:37
cdentah, okay15:37
cdentcool15:37
imacdonnand I'm not tripleo, FWIW15:38
dansmith...said with beaming pride :)15:38
*** _d34dh0r53_ is now known as d34dh0r5315:38
imacdonnheh15:38
cdentso is the other way around the problem to catch the reset by peer and reopen? because expecting an idle web service to make rpc heartbeats is ... odd15:38
cdentimacdonn: sorry, like I said, it's been hard to keep track of the details on this one15:38
dansmithyeah15:38
dansmithso that's the other thing,15:38
*** ricolin has joined #openstack-nova15:38
dansmithwhy is the connection not just immediately re-youknow-connected?15:38
dansmithbecause that would be a lot more sane15:39
* cdent nods15:39
*** macza has joined #openstack-nova15:39
imacdonnthe whole point of heartbeats is to keep connections alive when there's no traffic15:39
imacdonnfor example, a firewall idle timeout could bite you otherwise15:39
cdentsure, but a web server process may not even exist in some deployment scenarios when there are no requests15:39
cdentwho is going to send the heartbeat?15:40
dansmithimacdonn: but only really for things where you might be expecting to hear something from the other side15:40
dansmithimacdonn: the api server receives no messages other than replies,15:40
imacdonnthe wsgi app container, I guess? and if that doesn't exist, there's no problem ... a new one would start a new connection15:40
dansmithso there's really no reason for a connection except when it's processing a reuest15:40
imacdonnI guess I might have assumed that setting up a new connection incurs some expense ... so it's better to keep it open once you have it15:41
dansmithopen connections that aren't used are pretty expensive in rabbit terms ;)15:42
imacdonnkindof like the REST concept15:42
imacdonnre. "things where you might be expecting to hear something from the other side" - I wonder if there are cases where that could break - maybe an async RPC call that takes a long time to complete ?15:43
dansmithimacdonn: in that case you're processing a request still15:44
dansmithwhat I don't understand is, if the wsgi container has spawned a process to do a thing, and has not killed it off during times of inactivity, what is it doing that prevents the normal threading background stuff from running like any other process?15:44
dansmithunless we've tickled something in the python threading library before patching or whatever, but if that's the problem that's the kind of thing we need to know15:45
dansmithI'd be interested in what the differences are between you and everyone else.. same exact (unmodified) nova, oslo, et al code? same version of deps, etc?15:46
dansmiththese are the things worth chasing, imho15:46
imacdonndo we know of anyone running stein nova-api on wsgi successfully ?15:46
imacdonncos I only know of the other guys who are seeing the same issue as me15:47
cdentit's always running on wsgi, whether you're using nova-api script or using mod_wsgi or uwsgi. the distinguishing factor is standalone eventlet server versus everything else15:47
*** pcaruana has quit IRC15:47
imacdonndansmith: see also https://bugs.launchpad.net/nova/+bug/1825584/comments/7 - I haven't really tried to digest that yet15:48
openstackLaunchpad bug 1825584 in OpenStack Compute (nova) "eventlet monkey-patching breaks AMQP heartbeat on uWSGI" [Undecided,New]15:48
sean-k-mooneycdent: is it running on wsgi when using the console script15:48
sean-k-mooneyi did not think that was the case15:48
cdentyes, wsgi is the protocol by which a web request is translated into an eviron dict that is passed to a __callable__ handler with a start_response function15:49
cdentit's the reason why it is possible to run what amounts to the same code with the eventlet server based console script and "everything else"15:50
dansmithimacdonn: okay if it's stopping the interpreter, why does using another real thread change it? does it just pause the main thread's interpreter and release the GIL?\15:50
sean-k-mooneycdent: i did not know https://github.com/openstack/nova/blob/master/nova/cmd/api.py was implement the wsgi progocaol but it look like it is15:50
dansmithif the wsgi server is pausing the interpreter, I take that to mean it expects the thing to stop doing stuff, and thus, shouldn't really be doing heartbeats?15:50
cdentimacdonn: have you tried melwitt's patch? does it "fix it". Because if not, maybe it doesn't...15:51
*** ttsiouts has joined #openstack-nova15:51
cdentand what we said above about reopening the connection is the real fix15:51
imacdonncdent: depends which patch, I guess .. there was a set of 3 related ones, and the last one removed the monkey patching ... I did some basic testing of that, and it seemed to work OK15:52
imacdonnby "that", I mean "all three"15:52
cdentimacdonn: this one: https://review.opendev.org/#/c/650172/15:52
imacdonnI think I did (most of) these: https://review.opendev.org/#/q/topic:cell-scatter-gather-futurist+(status:open+OR+status:merged)15:53
*** wwriverrat has joined #openstack-nova15:54
dansmithso if you disable monkeypatching but still use eventlet, I think everything just becomes synchronous, right?15:54
dansmithin that case, the "it works" is probably just because you're doing everything serially, so unless you apply the other patches to make it actually use the native threading, it's not a relevant test, IMHO15:55
sean-k-mooneydansmith: if you dont call eventlet directly i think so15:55
*** ttsiouts has quit IRC15:55
sean-k-mooneyim not sure what happens if you do call eventlet spawn et al15:55
dansmithagain, I think we need to repro this in a clean devstack and/or the gate before we get too much further.. or figure out why we can't.15:56
sean-k-mooneyyou mean the rabbitmq issues15:57
sean-k-mooneyis the theroy that they will start happening if the api is not actively used after a few minutes15:58
*** rpittau is now known as rpittau|afk15:58
sean-k-mooneye.g. it shoudl break if we stack go have coffee and come back?15:58
*** imacdonn has quit IRC15:58
bnemecThere's some analysis of what was happening in http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005310.html16:01
*** imacdonn has joined #openstack-nova16:02
imacdonnI missed the last ~5 mins ... *^&&*$% stupid VPN16:02
*** sridharg has quit IRC16:02
imacdonnfor repro, I think the best way is to enable debug logging for oslo_messaging and observe heartbeats (or lack thereof)16:03
cdentdansmith, mriedem I put a comment with links to this discussion and some other summaries on the ThreadPoolExecutor patch16:03
imacdonn(after making an API call that requires RPC - like getting the VNC console URL for an instance)16:03
dansmithimacdonn: is it specific to vnc console by chance? because many api calls make rpc calls.. pretty much anything other than get/list instances16:04
fungimriedem et al: i found what was wrong with http://status.openstack.org/reviews/ and got content updating again, though i'm at a loss for explanation as to how it broke (one of the dependencies we pip install for reviewday, cheetah, was somehow not installed on the server and all reasonable explanations have been ruled out)16:04
fungianyway, let us know if you're relying on it and see it go stale again somehow16:04
fungimaybe i'll be able to track it down that way16:05
fungishould update roughly every 30 minutes16:05
imacdonndansmith: I don't believe so .. that's just an easy way to make an RPC call happen16:05
*** dims has quit IRC16:05
*** tesseract has quit IRC16:05
*** yan0s has quit IRC16:07
dansmithconfirming that would be good, because I think we have mucked with console stuff recently16:07
dansmithand would also help narrow down the differences16:07
*** ratailor has quit IRC16:07
imacdonnI have observed other things failing, but didn't make note of what specifically .. is there a particular call that you would suggest?16:09
*** awalende has joined #openstack-nova16:09
dansmithI suggest reproducing this with a devstack16:09
imacdonnworking on that (devstack) now16:09
mriedemfungi: thanks16:11
fungiany time!16:11
*** awalende has quit IRC16:13
*** yonglihe has quit IRC16:14
*** cdent has quit IRC16:17
*** dtantsur is now known as dtantsur|afk16:19
*** luksky has joined #openstack-nova16:21
openstackgerritMatt Riedemann proposed openstack/nova master: Support cross-cell moves in external_instance_event  https://review.opendev.org/65847816:40
openstackgerritMatt Riedemann proposed openstack/nova master: Add cross-cell resize policy rule and enable in API  https://review.opendev.org/63826916:40
openstackgerritMatt Riedemann proposed openstack/nova master: WIP: Enable cross-cell resize in the nova-multi-cell job  https://review.opendev.org/65665616:40
aspierskashyap, efried: are you good with https://review.opendev.org/#/c/655268/ now?16:40
*** jmlowe has quit IRC16:41
mriedemefried: i never +2ed this https://review.opendev.org/#/c/659374/16:41
mriedemwell,16:41
mriedemi did before -1ing it16:41
*** ricolin has quit IRC16:44
mriedembut now i am :)16:44
*** ttsiouts has joined #openstack-nova16:50
imacdonndansmith: I've reproduced the lack of heartbeats on devstack (and confirmed that the heartbeats happen if monkey patching is removed) ... but somehow (so-far) devstack is able to recover from the unexpected loss of amqp connection, where my RDO environment is (sometimes) not16:51
dansmithokay so devstack is behaving like we would expect, assuming the pause-the-interpreter wsgi behavior you describe16:51
imacdonnso-far, I suppose so16:52
imacdonnif the pause is considered "expected"16:52
*** mrch_ has quit IRC16:53
*** bauzas has quit IRC16:54
dansmithif the pause isn't, then the problem is in the configuration of the wsgi container or something else16:56
*** bauzas has joined #openstack-nova16:57
imacdonnbut it only happens when python has been monkeyed-with16:57
dansmithmm, I think you're misunderstanding16:58
dansmithI think that wsgi is pausing regardless,16:58
*** davidsha has quit IRC16:58
dansmithand in the eventlet case, there is only one thing to pause, so it pauses16:58
dansmithand without eventlet, we spawn a real thread that the wsgi container does not pause (only the main thread)16:58
dansmithwhich is why I say that if the design of the wsgi container is to pause the whole thing between requests, then us subverting that with a real thread seems wrong16:59
imacdonnI definitely don't understand the mechanisms well enough ... so my observation is fairly high-level ... in one case, the heartbeats happen, so it seems awake .. in the other case, it goes completely quiet16:59
imacdonnso if it's wrong, the whole oslo_messaging model seems inappropriate ..... ???17:00
*** derekh has quit IRC17:00
imacdonncos it does heartbeats quite intentionally, AFAIK ...17:00
dansmithwell, o.msg _was_ developed with eventlet servers in mind17:00
*** jmlowe has joined #openstack-nova17:01
*** dims has joined #openstack-nova17:03
*** lbragstad has quit IRC17:05
dansmithso here's a random person asking a similar question: https://github.com/benoitc/gunicorn/issues/192417:06
dansmithnot much response there, except that it basically says "don't spawn worker threads from your wsgi app"17:06
dansmithcheck this out: https://docs.newrelic.com/docs/agents/python-agent/web-frameworks-servers/python-agent-uwsgi-web-server17:08
dansmith"By default uWSGI does not enable threading support within the Python interpreter core. This means it is not possible to create background threads from Python code. "17:09
dansmith...which might mean that disabling eventlet and letting threading be unpatched just gets you synchronous behavior17:09
*** tssurya has quit IRC17:10
dansmithI wonder if using --enable-threads but leaving the monkeypatching in place would allow the eventlet background threads to run?17:11
imacdonn"By default the Python plugin does not initialize the GIL. This means your app-generated threads will not run. If you need threads, remember to enable them with enable-threads. Running uWSGI in multithreading mode (with the threads options) will automatically enable threading support. This “strange” default behaviour is for performance reasons, no shame in that."17:12
imacdonn^ from https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html17:12
dansmithyeah, just saw that.. so... do you have threads enabled?17:12
*** dims has quit IRC17:12
imacdonnchecking ... probably not17:12
*** lbragstad has joined #openstack-nova17:13
*** panda|rover is now known as panda|rover|off17:13
*** dims has joined #openstack-nova17:13
dansmithif I were you, I'd really want to find out why devstack recovers properly and yours does not though17:14
imacdonnor...17:14
imacdonn# grep threads /etc/nova/nova-api-uwsgi.ini17:14
imacdonnenable-threads = true17:14
imacdonn#17:14
imacdonn(this is on devstack)17:14
imacdonnthinking about that some more ... I think if enable-threads was not on, the heartbeat thread would not work in the non-patched case (?)17:15
imacdonnguess I could attempt to test that17:15
*** ivve has joined #openstack-nova17:16
imacdonnconfirmed that I do have enable-threads in my "real" (non-devstack) config too17:16
openstackgerritStephen Finucane proposed openstack/nova master: Fix some issues with the newton release notes  https://review.opendev.org/65962317:18
dansmithso,17:18
dansmiththinking about this some more17:19
openstackgerritStephen Finucane proposed openstack/nova master: Use new hotness for upper-constraints  https://review.opendev.org/65962417:19
dansmithI'm not sure how eventlet could really ever work in the wsgi mode, because background tasks really require the main thread to block on the event loop17:19
dansmithI'm struggling to find where the entry point is for nova-api in wsgi mode...17:19
imacdonnthat seems to be the consensus ... eventlet and wsgi containers seem fundamentally incompatible17:21
dansmithwell,17:21
dansmithnot really,17:21
dansmithjust wsgi+eventlet with a background thread17:21
imacdonnok17:21
imacdonnFWIW, with enable-threads turned off, and monkey patching still removed, it looks the same - no heartbeats after the API request that opens an RPC connection17:23
dansmithyeah, not running the event loop means we'll just never do that thing17:23
*** ttsiouts has quit IRC17:24
dansmithtbh, it kinda seems like what we should be doing is splitting nova-api into two pieces in a wsgi world17:25
dansmithwhich is, of course, not how it was designed17:25
imacdonnsomeone else suggested that17:25
dansmithmnaser: you around?17:27
dansmithimacdonn: how instantaneous is the reconnect to rabbit on devstack?17:28
imacdonnno perceptible delay17:28
mnaserdansmith: hiya17:29
dansmithmnaser: you are on stein and running nova-api in wsgi something right?17:29
*** _alastor_ has quit IRC17:29
mnaserCorrect. Uwsgi17:29
dansmithmnaser: and you don't see this rabbit heartbeat timeout freezeup thing correct?17:30
*** xek__ has joined #openstack-nova17:30
mnaserI have not. I thought the ML post thread concluded it was mod_wsgi related ?17:30
dansmithmnaser: nope, both apparently17:30
*** _alastor_ has joined #openstack-nova17:31
dansmithmnaser: imacdonn says it's reproducible in devstack too, but it just reconnects instantaneously and thus isn't a problem (which is what ideal behavior should be)17:31
mnaserHmm, I mean, maybe it could be?  I’ve not seen any issues but maybe our APIs get hit often enough?17:31
dansmithi.e. not requiring the wsgi app to run a daemon thread to maintain a connection17:31
mnaserLet me grep some logs17:32
dansmithmnaser: possible I guess. is there any way you could take one worker out of LB rotation and see if it hangs up after a couple minutes if you hit it directly?17:32
mnaserWhat string should I look for?17:32
dansmithimacdonn: ^17:32
imacdonnQRX ... reproducing again17:32
imacdonnerm, I mean .. standby :)17:32
dansmithimacdonn: ah, so you are the ham I keep finding in my googling :D17:32
*** xek_ has quit IRC17:32
imacdonneven an occasional chirp user ;)17:33
dansmithhah nice17:33
mnasergosh the latency is awful from here17:34
mnaseroh actually17:34
mnaserI have a quiet stein cloud here17:34
*** dklyle has quit IRC17:34
dansmithimacdonn: is it possible that mod_wsgi is the reason you don't get a reconnect and that uwsgi, despite pausing, makes that just work?17:35
imacdonnmnaser: one think you could look for is, in your rabbitmq server log:17:35
imacdonn=ERROR REPORT==== 16-May-2019::16:45:39 ===17:35
imacdonnclosing AMQP connection <0.4566.0> (x.x.x.x:34178 -> x.x.x.x:5672 - uwsgi:47892:8bc2c402-d1d4-4762-9b83-29a7b7f419fd):17:35
imacdonnmissed heartbeats from client, timeout: 60s17:35
imacdonndansmith: no, because I'm using uWSGI "for real" too17:35
dansmithokay17:35
-openstackstatus- NOTICE: Gerrit is being restarted to add gitweb links back to Gerrit. Sorry for the noise.17:37
mnaserso is it only something logged inside RabbitMQ?17:37
*** sean-k-mooney has quit IRC17:39
*** pcaruana has joined #openstack-nova17:39
imacdonntrying to get the nova-api side logs .. I'll have to clean and pastebin them17:39
*** sean-k-mooney has joined #openstack-nova17:41
mnaserim seeing this on a lightly loaded stein cluster 'missed heartbeats from client, timeout: 60s'17:42
mnasergonna see what nova logs say17:42
imacdonnis there a handy place I can upload 10k lines of log to ?17:43
mnaser2019-05-16 17:38:53.643 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer17:44
dansmithmnaser: but sounds like it's recovering when a new call comes in such that it's not noticeable?17:45
imacdonnmnaser: yes! that's one of symptoms17:45
dansmithbnemec: do you know if there's a way to turn off the heartbeat thread for rabbit?17:45
mnaserbut is there a problem however?17:45
dansmithbnemec: for o.msg's rabbit driver I mean17:45
mnaserlike n-api seems to continue to respond fine afterwards17:45
mnaser2019-05-16 17:38:53.644 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer17:46
bnemecThere used to be. It was considered experimental for a couple of cycles. Let me look.17:46
dansmithmnaser: yeah, so that's the behavior we would expect17:46
mnaserbut at a 17:38:54 a GET /v2.1/servers/uuid worked fine17:46
dansmithdunno why imacdonn isn't getting that proper reconnect behavior17:46
dansmithturning off the heartbeat thread for the wsgi workers would eliminate the log noise I think17:46
*** udesale has quit IRC17:46
mnaserrabbitmq heartbeats are super unreliable in my experience17:47
bnemecdansmith: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L14517:47
mnaserthere used to be a nice article about configuring timeouts17:47
efriedaspiers: not sure I ever looked at that.17:47
imacdonnyeah, perhaps I need to look closer at that ... but still, I think that heartbeats are there for a reason, and letting connections rot and "fail by design" is kinda icky17:47
mnaserhttps://medium.com/@george.shuklin/rabbit-heartbeat-timeouts-in-openstack-fa5875e0309a17:47
mnaserdunno how relevant it is but17:48
mnasera good read none the less17:48
dansmithimacdonn: yeah, I just don't think I agree :)17:48
imacdonnif nova-api isn't going to use persistent AMQP connections, it should gracefully close the connection before it goes to sleep, IMO17:48
mnasermy question is: is there a problem? are your requests forever 500-ing after?17:49
dansmithnova-api doesn't go to sleep17:49
melwittwhat's the behavior when it doesn't properly reconnect? nova-api responds with a failure?17:49
mnaser^^17:49
imacdonnyeah, I get some sort of 50x error17:49
dansmithpresumably because something is getting stuck and not letting the connections close off, which is why I want to know why imacdonn is different from the rest of the world17:50
mnaserif you get a 50x that means you have aa traceback, right?17:50
imacdonndansmith: hey now .. there was at least one other person experiencing the failures17:51
mnasershould you have a traceback if you're getting a 500?17:51
dansmithimacdonn: I meant your deployment, not you as a person :)17:51
imacdonnwell, yeah, but the other person was using a different deployment, and mod_wsgi17:51
* mnaser still asks if there is a traceback17:52
dansmithimacdonn: ^17:52
efriedmriedem: Were you going to put up a null nova change depending on https://review.opendev.org/659611 to make sure the ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa job passes?17:53
melwittI had thought someone else faced this behavior too, but looking at the bug report, they say it reconnects https://bugs.launchpad.net/tripleo/+bug/182774417:53
openstackLaunchpad bug 1827744 in tripleo "Running eventlet.monkey_patch in nova_api breaks the AMQP heartbeat thread" [High,In progress] - Assigned to Damien Ciabrini (dciabrin)17:53
efriedor does ironic run that job too?17:53
imacdonnI'll have to rebuild my "real" environment with uWSGI and reproduce again, if we need more context around the failure to reconnect ... I'd already started down the path of reverting to running nova-api standalone, since there didn't seem to be a path forward17:53
imacdonnthe "other person" is here: https://bugs.launchpad.net/nova/+bug/1825584/comments/717:54
openstackLaunchpad bug 1825584 in OpenStack Compute (nova) "eventlet monkey-patching breaks AMQP heartbeat on uWSGI" [Undecided,New]17:54
dansmithimacdonn: can you also try disabling rabbit heartbeats on a wsgi api node? just to see if that eliminates the log noise17:54
dansmithimacdonn: even if on the devstack17:54
melwittimacdonn: that's the bug I linked, I think, and the proposed patch is the stop monkey patching nova-api17:54
imacdonnwhat do you mean by "log noise"? or are you just asking me to test if it's possible to disable heartbeats at all ?17:54
melwittbut in that bug report, he says that rabbit reconnects17:55
imacdonnthat's my bug report, but he commented on it17:55
melwittno, the bug I linked you17:55
dansmithimacdonn: I mean the log noise where nova-api complains that the connection was reset before reconnecting (in mnaser's case)17:55
melwittimacdonn: https://bugs.launchpad.net/tripleo/+bug/182774417:55
openstackLaunchpad bug 1827744 in tripleo "Running eventlet.monkey_patch in nova_api breaks the AMQP heartbeat thread" [High,In progress] - Assigned to Damien Ciabrini (dciabrin)17:55
dansmithi.e. this: [10:46:02]  <mnaser>2019-05-16 17:38:53.644 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer17:55
sean-k-mooneyactully in the api we dont really care about hearbeat to rabbitmq right17:56
dansmithsean-k-mooney: I've been shouting that for hours, yes:)17:57
dansmiththere's really no reason to maintain a persistent connection to rabbit from an idle api worker17:57
sean-k-mooneyactullly i was going to say we never actully listen for anyting from the api but i guess we do if we make a blocking call17:57
imacdonnmelwitt: ah, I see .. OK ... yeah, maybe it was non-fatal in his case17:57
mnaserI think the connection reset by peer is the real issue, I think the warning is log noise17:57
dansmithsean-k-mooney: only in the reply case, which isn't a problem here17:57
sean-k-mooneyya17:57
dansmithmnaser: the connection reset by peer is because we're not running the heartbeat thread, and thus is expected17:58
sean-k-mooneyso even if the connect drops due to in activity in the normal case we just fire and forget17:58
dansmithmnaser: it's the "this is dead let me reconnect immediately" signal, which seems to be working fine for you17:58
dansmithsean-k-mooney: right17:58
mnaserI'm in china and its 2am but for thee last time I'll say it17:58
mnaserif you're getting 500s, it probably threw a traceback, and that'll probably contain something useful17:59
mnaser(which will be the real issue)17:59
dansmithmnaser: imacdonn said he needs to rebuild his env to get that traceback17:59
imacdonn^ that17:59
mnaseraaaha17:59
mnaserokay, well in that case, good luck17:59
mnaserfwiw let me share the OSA config17:59
efriedmriedem: I guess it does run it. It failed. And looking closer, the 'contains' match was working properly. The problem is further up: the inventory is never getting set.17:59
dansmiththe other thing I'm interested in is, if you just don't patch, and everything seems to work, is that just because eventlet is being synchronous, or because eventlet  primitives are spawning real threads?18:00
mnaserhttps://opendev.org/openstack/openstack-ansible-os_nova/src/branch/master/templates/nova-uwsgi.ini.j218:01
mnaserthis is our uwsgi config18:01
* mnaser &18:01
melwittI'm also wondering about that18:01
dansmithI think that if disabling the heartbeat thread removes the log noise and we reconnect as we would expect for all cases except the unexplained ones, that'd eb my preference as a path forward18:02
*** nicolasbock has quit IRC18:02
dansmithbecause changing our threading model for some of nova, and maintaining a thread off each wsgi worker just to hold a useless connection to rabbit does not seem like an improvement to me18:02
mriedemefried: ack ok just getting back, haven't looked18:03
efriedmriedem: Left comment on https://review.opendev.org/#/c/659611/18:04
imacdonndansmith: not sure I like the idea of turning off heartbeat .. that could have side-effects ... like leaving stale connections on the rabbitmq-server side for ever (??)18:04
dansmithimacdonn: show us that it causes negative effects18:04
imacdonnnot sure that I can come up with every possible scenario off the top of my head18:05
dansmiththe other things we could do are:18:05
dansmith1. establish and drop connections before/after request18:06
dansmith2. leave heartbeats in place, as they are today so that rabbit will close the stale connections, which seems to be working for most people,18:06
efriedmriedem: n-cpu thinks the inventory is being set: http://logs.openstack.org/11/659611/1/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/9c644a2/controller/logs/screen-n-cpu.txt.gz#_May_16_17_17_55_78856218:06
sean-k-mooneyif i do18:06
sean-k-mooney>>> def func():18:06
dansmith3. change our threading model for part of nova to run a heartbeat thread18:06
sean-k-mooney...   print("test")18:06
sean-k-mooney...18:06
sean-k-mooney>>> eventlet.greenthread.spawn(func)18:06
sean-k-mooneyit never prints anything18:06
dansmith4. split nova-api into the wsgi request processing side and the persistent bit18:06
sean-k-mooneyso if you dont monkeypatch it looks liek spawn does nothing but queue the fucntion to run18:07
sean-k-mooneyand never run it18:07
dansmithsean-k-mooney: that makes me wonder how disabling monkeypatching does anything18:07
sean-k-mooneydansmith: as long a s you dont call the scater gater code it should apper to work18:07
dansmithsean-k-mooney: right, but we do that for instance list, and they say that disabling monkeypatching just makes everything work normally, right?18:08
sean-k-mooneythat is the only part of the api that actully uses eventlet18:08
efriedmriedem: oh, reserved is 1 and never gets set to 018:08
mriedemefried: hmm, i wonder if it things the node isn't available18:08
mriedem*thinks18:08
sean-k-mooneydansmith: maybe its different on python 218:08
mriedemb/c i think that's when the driver sets reserved=118:08
sean-k-mooneyi jsut checked with py3618:09
efriedyup18:09
mriedemNode 7c5e9503-1e20-42bf-9ae7-d14cef0f38a9 is not ready for a deployment, reporting resources as reserved for it. Node's provision state is cleaning, power state is power off and maintenance is False.18:09
efriedjust so18:09
mriedemyeah so the node is dead18:10
mriedemhttp://logs.openstack.org/11/659611/1/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/9c644a2/controller/logs/screen-ir-cond.txt.gz#_May_16_17_28_09_31156418:10
imacdonnsean-k-mooney dansmith: side-note ... my "real" deployment is py2, devstack is py3 ... may be a factor in the failure to reconnect18:10
* efried edits story text18:13
mriedemyeah i just abandoned the change18:13
sean-k-mooneydansmith: if you save teh greanthread object and then call wait() it runs all the queued greenthreads18:15
dansmithsean-k-mooney: right that's the main loop part we will never do in wsgi mode18:16
sean-k-mooneyso if i do gt = eventlet.greenthread.spawn(func)18:16
sean-k-mooneyit prints after i call gt.wait()18:16
sean-k-mooneydont we wait in the scater gather however18:16
dansmithoh, sorry your point is maybe that's why it's working if they disable?18:17
dansmithyeah, so we do that in sequence, which means it becomes serialized18:17
dansmithso they just hit cell0, then cell1, etc18:17
dansmiththat's the behavior I had been assuming for non-patched,18:18
dansmithI just wasn't sure if in reality the lower layers would actually use threading.Thread() unpatched for the work18:19
sean-k-mooneywell i think when we call wait it is entering a coperative green thread mode so we still get concurance18:19
dansmithwhich hmm, I guess could still be18:19
sean-k-mooneyuntil that greenthread finishes at least18:19
sean-k-mooneyand sice we are doing it in a for it will complete the scater gather as we will wait on all the greadthread we queued18:19
sean-k-mooneyso we should not need to monkey patch18:20
*** ralonsoh has quit IRC18:20
dansmithwell, it depends on whether it's spawning a real thread and we're just waiting on it's future, or if we pass through and just exec the thread function during the wait18:20
sean-k-mooneylet me try a slightly less dumb test function that does a green trhead sleep and increment a counter and see if its serialsed or random18:21
dansmithsean-k-mooney: >>> def func2():18:21
dansmith...     print(len(threading.enumerate()))18:21
dansmith...18:21
dansmith>>> t = eventlet.greenthread.spawn(func2)18:21
dansmith>>> t.wait()18:21
dansmithtest18:21
dansmith118:21
dansmithsean-k-mooney: only one thread running from inside the thread function18:21
dansmithso it's sequential18:21
dansmiththe test came from another thread I spawned first (like your example) which hadn't been waited on18:22
dansmithnormally enumerate() returns one for the main thread all the time,18:22
dansmithso 1 means it didn't spawn something else for func2()18:23
*** ccamacho has joined #openstack-nova18:23
dansmithso one thing we could do in the interim is,18:24
dansmithprovide a knob to let them avoid the monkeypatch without having to actually patch the code, if they prefer synchronous behavior18:24
dansmiththat isn't too terrible for people with just the default two cells, and the rest of the people who don't suffer from this can continue to do what we do today18:25
melwittI think that's already available in the form of an env variable. which is what this tripleo patch is proposing to turn off monkey patching https://review.opendev.org/#/c/657168/1/deployment/nova/nova-api-container-puppet.yaml18:26
dansmithmelwitt: ah even better :)18:26
sean-k-mooneydansmith: it is concurrent however18:27
sean-k-mooneyif i do18:27
sean-k-mooney>>> def func():18:27
sean-k-mooney...   eventlet.greenthread.sleep(10)18:27
sean-k-mooney...   global i18:27
sean-k-mooney...   i+=118:27
sean-k-mooney...   print(i)18:27
sean-k-mooney...18:27
sean-k-mooney>>> i = 018:27
sean-k-mooney>>> gt = eventlet.greenthread.spawn(func)18:27
sean-k-mooney>>> gt = eventlet.greenthread.spawn(func)18:27
sean-k-mooney>>> gt = eventlet.greenthread.spawn(func)18:27
sean-k-mooney>>> gt = eventlet.greenthread.spawn(func)18:27
sean-k-mooney>>> gt.wait()18:27
sean-k-mooneyit does not take 40 secods to print18:27
dansmithsean-k-mooney: because you're using eventlet's sleep right?18:27
sean-k-mooneyyes18:27
imacdonnmnaser (if still awake?), dansmith, sean-k-mooney: log from the failure-to-reconnect case: http://paste.openstack.org/show/751486/18:28
dansmiththat does't mean it's concurrent, that means it's warping time for you :)18:28
mnaserLooks like conductor yoking our18:31
mnaserTiming out18:32
mnaserI’ve seen really weird behaviour thy usually gets resolved by deleting every single queue in the vhost18:32
sean-k-mooneydansmith: well in anycase i think we now now why its working if you dont monkeypatch18:33
dansmithyup, as expected18:33
sean-k-mooneythere it concurant of fully synconous i dont know18:33
dansmithsean-k-mooney: run this and tell me what you think :) https://pastebin.com/XHV9t9cM18:35
imacdonnmnaser: this happens when the monkey patching is used in combination with a wsgi container ... and doesn't happen otherwise18:36
dansmithsean-k-mooney: one single wait, but it clearly runs them in order, and blocking calls block everything18:36
melwittimacdonn: not sure what else to try other than a python2 devstack test maybe, to try and pinpoint what is the difference in your env vs the others. mnaser, are you running stein under py3 or py2?18:36
imacdonnmelwitt: yeah, thought about py2 devstack too18:37
sean-k-mooneyyes it took 53 seconds18:40
sean-k-mooneybut https://pastebin.com/QN5L7yDY took 518:41
sean-k-mooneytime.sleep will suppend the thread18:41
sean-k-mooney eventlet.greenthread.sleep shoudl yeild to the next coroutine18:41
dansmithsean-k-mooney: ah I see what you mean.. we will still get *interleaving* but not actual concurrency18:41
dansmithsean-k-mooney: meaning we're not running multiple concurrent threads, but we will interleave their execution when they do stuff that calls into eventlet18:42
sean-k-mooneywe will get concurrency but not parrallisium18:42
sean-k-mooneyyes18:42
dansmithsean-k-mooney: the thing I think you're missing is that if we haven't monkeypatched, then all our IO is blocking and so the first thing will be like time.sleep() andnot like eventlet.sleep()18:42
dansmithfirst thing meaning first bit of IO18:43
sean-k-mooneyya i was onding if we could import eventlet.green or somting18:44
sean-k-mooneyhttps://eventlet.net/doc/patching.html#import-green18:44
sean-k-mooneyour use a contex manger to patch or jsut the socket lib18:44
dansmithI think that gets us further into the mixing paradigms state and confusion abounds, IMHO18:45
sean-k-mooneyya perhaps18:45
sean-k-mooneyso we orginally merged the new scater gather stuff that used eventlests without monkey patching18:45
sean-k-mooneythen we noticed we were not monkeypatch and mdbooth fixed that18:46
dansmith*for api running under wsgi18:46
sean-k-mooneyand we strated to see these uwsgi errors18:46
sean-k-mooneyya18:46
sean-k-mooneyso setting the env var would allow people to go back to the un monkeypatched versions under wsgi with no code change but its nolonger paralle18:48
dansmithexactly18:48
sean-k-mooneyalthough in that case uwsgi/apache would still provde its own concurancy18:48
dansmithsean-k-mooney: between requests, but not between cells18:49
dansmithcells would still be sequential18:49
sean-k-mooneyyes18:49
sean-k-mooneyim just thinking about what people can do as a workaound in the short term18:49
sean-k-mooneye.g. what can people do that hit this in production18:50
dansmithpeople being the 1.9 people we know that are having this problem right? :)18:50
*** trident has quit IRC18:51
sean-k-mooneywell i think the only reason that we are not seeing this in osp is because hte kolla container use the console-script in the container instead of running under wsgi18:51
sean-k-mooneyso i think that is why we are not hitting it downstream18:51
sean-k-mooneybut i dont know what other deployment tools default too18:51
dansmithwell, we got the vexxhost data that they're reconnecting properly18:52
sean-k-mooneyin osa18:52
sean-k-mooney*with18:52
dansmithso they're clearly seeing the same threading behavior, but not suffering18:52
dansmithi.e. they're getting the best of both worlds18:52
*** trident has joined #openstack-nova18:52
sean-k-mooneyperhaps its because they are installing the nova api under uwsgi and not mod_wsgi18:59
sean-k-mooneyimacdonn: what are you running the nova-api with?18:59
imacdonnsean-k-mooney: uWSGI ... I've also reproduced the heartbeats not happening with mod_wsgi, but I can't say if I've seen the failure to "establish a new connection and move on" in that case19:00
sean-k-mooneywe know that uwsgi and mod_wsgi handel the lifetime of there python interperters diffrently so model level variable are reinitalised in uwsgi but are not in mod_wsgi becuase it reuses the interpreter when the app is reloaded an uwisg does not19:01
sean-k-mooney*module level19:01
sean-k-mooneyimacdonn: if the wsgi app dies and is restarted by uwsgi it will return to a clean state19:02
sean-k-mooneyin the mod_wsgi case it will reuse the interperter and maybe that impact the reconnection19:03
sean-k-mooneythat is a strech however19:03
sean-k-mooneywe are seeing it not reconnect on mod_wsgi correct?19:03
imacdonnI don't see any evidence of the app dying and being restarted19:04
melwittsean-k-mooney: we're not sure what the common denominator is for not reconnecting and 500ing, I think19:04
melwittI have a python2 devstack, going to check whether it reconnects19:05
sean-k-mooneyif it was you would see nova print it config again and all the other start up stuff so if you are not seeing that its proably not happening then19:05
sean-k-mooneyok that is testing with uwsgi then19:05
imacdonnyeah .. the 504 error seems to be caused by an RPC timeout following the reconnect ... don't yet know why I'm seeing that where others are not19:05
sean-k-mooneydo we have any periodic task that could be keeping it alive?19:06
imacdonnI can't say I've reproduced the RPC timeout with mod_wsgi - was focused on the absence of heartbeats before19:06
dansmithsean-k-mooney: if you read the backscroll, it has been reprod with both19:07
imacdonnI'm not able to reproduce the RPC timeout with py2 devstack either19:07
melwitthow do you get the connection reset by peer message to show up in nova-api log? just waiting doesn't seem to do it19:08
dansmithrequest after a delay I think19:08
sean-k-mooneyi tought it was 2  different errors one on each but ok in that case ill stop looking to see what the differe is between how the both run19:08
imacdonnI have an instance already running.. I make a call to get the VNC console URL, wait about 5 mins, then repeat the call to get the console URL19:08
sean-k-mooneythat shoudl not need to hit rabbit19:09
sean-k-mooneythat should just hit the db directly no?19:09
dansmithsean-k-mooney: eh19:09
sean-k-mooneyto get the url19:10
dansmithno, it calls to the driver19:10
sean-k-mooneyto get teh console it woudl have to do an rpc19:10
sean-k-mooneyoh ok i assuem we would have sored the console url in the db19:10
*** imacdonn has quit IRC19:12
melwittit is stored there afterward (base url and token hash, etc) but the driver hands out the connection details initially19:13
*** imacdonn has joined #openstack-nova19:14
melwittok yeah I see now the "Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer" happens on the second request after waiting some minutes. but the request works fine19:14
melwitt(with py2 and uwsgi)19:15
sean-k-mooneyok but this is what get console is actully doing right https://github.com/openstack/nova/blob/master/nova/console/api.py#L27-L2919:15
sean-k-mooneyor rather openstack console url show19:17
melwittI think that's probably the xen-specific service nova-console stephenfin is looking to deprecate19:17
melwittthis is get_vnc_console https://github.com/openstack/nova/blob/master/nova/compute/api.py#L387619:17
sean-k-mooneythat is what called form here https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/consoles.py#L4819:17
sean-k-mooneyoh ok so that one does a downcall to the compute cool19:18
melwittI think that's the xen thing that no one except old rax cluster uses19:18
dansmiththis is one reason I suggested using a different call19:19
dansmithbecause console has all manner of stuff wrapped up in it19:19
dansmithI think pause/unpause is a call right? use that19:19
dansmithor just boot/delete19:19
imacdonnlooking at my py2 devstack, it seems the connection reset is observed at a difference place19:19
sean-k-mooneyya or the server diagnostics19:19
dansmithwell, boot isn't a call from the apiI guess, so pause19:19
dansmithsean-k-mooney: or that19:19
sean-k-mooney... of cose they are only in the nova client...19:20
melwittimacdonn: in my py2 devstack, I see the connection reset in nova-api log (for get_vnc_console)19:20
sean-k-mooneynova diagnostics <server> is definetly a down call to the compute but nova get-vnc-console should be too19:22
*** ttsiouts has joined #openstack-nova19:22
*** nicolasbock has joined #openstack-nova19:22
*** imacdonn has quit IRC19:25
*** jmlowe has quit IRC19:29
*** xek__ has quit IRC19:32
*** xek__ has joined #openstack-nova19:32
*** imacdonn has joined #openstack-nova19:42
sean-k-mooneyefried: just finsihed reviewing https://review.opendev.org/#/q/topic:bug/1809095+(status:open+OR+status:merged)19:43
sean-k-mooneyefried: im provisionly +1 on both i have one question on the second patch but its and edgecase19:43
*** jmlowe has joined #openstack-nova19:44
*** imacdonn_ has joined #openstack-nova19:46
*** maciejjozefczyk has quit IRC19:49
*** imacdonn has quit IRC19:49
*** pcaruana has quit IRC19:52
*** ttsiouts has quit IRC19:54
imacdonn_sean-k-mooney: vnc console does rpc .. you can see it in the trace at http://paste.openstack.org/show/751486/19:55
imacdonn_(the trace for the RPC timeout)19:55
sean-k-mooneythose call to the compute node? ya19:56
imacdonn_yes19:56
sean-k-mooneybut it was not clear what api enpoint osc was calling for openstack console url show19:56
sean-k-mooneyif you are using the nova clintes get-vnc-console that was defietly downcalling19:57
sean-k-mooneyosc is doing some magic to figure out what console type the vm is uesing19:57
sean-k-mooneyas it has a singel command that works for all console types19:57
sean-k-mooneyso i was not sure if the osc command would actully hit the db or the compute node19:58
imacdonn_yeah, def does ... that stack trace resulted from 'openstack console url show ...'19:59
sean-k-mooneyits kind of strange20:00
sean-k-mooneywe can see the hearbeat_tick printed20:00
imacdonn_yeah, but only once20:01
imacdonn_it should be period from that point on20:01
imacdonn_periodic20:01
sean-k-mooneyno its there several times20:01
sean-k-mooneylook at line 140 to 15320:01
imacdonn_for the same thread?20:01
imacdonn_in any case, in the non-patched case, it's continuous .. in the patched case, it goes quiet20:01
*** ccamacho has quit IRC20:02
imacdonn_those are all within the same second20:03
imacdonn_I *may* have just found what's different about my env vs. others and devstack20:04
*** whoami-rajat has quit IRC20:05
*** imacdonn_ is now known as imacdonn20:07
sean-k-mooneyhave you adjusted any of the messaging timeouts?20:07
imacdonnno, but in my uwsgi config, had "threads = 8" ... I've removed that now, and it seems that the reconnect is working20:09
sean-k-mooneythis is the osa one https://opendev.org/openstack/openstack-ansible-os_nova/src/branch/master/templates/nova-uwsgi.ini.j220:10
imacdonnI'm thinking (loosely) that maybe in the threads=X case, the thread making the RPC call has fired before the thread handling heartbeats has noticed the connection failure and reestablished20:10
sean-k-mooneywhich defaults to 120:10
imacdonnyeah. devstack doesn't set it at all20:11
imacdonnI was thinking to figure out what uwsgi does by default .. guessing 120:11
imacdonnI was *trying....20:11
sean-k-mooneyim lookig fot atht too now :)20:11
efriedsean-k-mooney: thanks for reviewing.20:20
sean-k-mooneyimacdonn: finally https://github.com/unbit/uwsgi/blob/3149df02ed443131c54ea6afb29fcbb0ed4d1139/core/init.c#L96 it default to 120:21
sean-k-mooneyimacdonn: the default is not in the docs20:22
imacdonnhah, thanks20:23
sean-k-mooneyfor c code uwsgi is actully not to hard to read but finding stuff is a pain20:23
sean-k-mooneyimacdonn:  so i think we know why vexhost are fine20:24
sean-k-mooneyosa expecitly sets it to 120:24
imacdonnyeah, I think so ... the reconnect seems to be consistently working since I've changed that20:25
imacdonnso now I wonder if I'm giving anything up by doing that, in terms of performance .... haven't really found any good guidance on processes/threads sizing for these things20:26
sean-k-mooneywell osa is running with 64 processes and 1 thread per process i think20:26
sean-k-mooneysorry a max of 16 processes20:27
sean-k-mooneyeach with 1 thread20:27
dansmithis this threads per worker process?20:29
dansmithif so, it could be that multiple threads per worker try to share a single connection pool,20:29
sean-k-mooneyyes i belive so20:29
dansmithbut maybe one of them gets responsibility for the eventlet heartbeat thread,m20:29
dansmithso if that thread isn't the first scheduled after a period of inactivity, then heartbeat isn't run, we try to use the connection and are stuck?20:30
dansmithsomething like that?20:30
imacdonnyes, "something like that" is what it feels like20:30
sean-k-mooneywell im not sure if we have process=2 thread=2 if that wil result in 2 or 4 python interperters20:30
sean-k-mooneye.g. are teh treads os threads or python threads20:31
melwittso heartbeat threads is the one that reconnects things if detected to not be open20:31
dansmithI think that would mean two interpreters, two threads each20:31
melwitt*thread20:31
dansmithmelwitt: no,20:31
sean-k-mooneyright so the gil would mean that only one of the two thread woudl actully be running in each python interperter20:31
dansmithI think any of them would do it, but running or not running the heartbeat but definitely running the other threads is where I can see it getting wonky20:32
*** tbachman has quit IRC20:32
sean-k-mooneyso maybe it is staving the heart beat thread as you suggest20:32
melwittI guess I'm confused. so far, I thought the heart beat thread doesn't affect anything other than to log a message saying "connection reset" if it's been idle > 60s. how does its starvation prevent reconnection?20:34
dansmithno, I think the heartbeat thread is just the first thin scheduled after a period of inactivity, so it's the first thing that hits the dead connection20:36
imacdonnI don't think reconnection is being prevented .. you can see it happen in the log at http://paste.openstack.org/show/751486/ ... but I tihnk it might matter *when* it happens20:36
imacdonn(line 227 in the log shows reconnected)20:37
melwittoh yeah, I see. so why MessagingTimeout? sorry, I just don't get this20:38
imacdonnmaybe the RPC request got sent to a dead connection, somehow??20:39
*** dave-mccowan has joined #openstack-nova20:39
sean-k-mooneythe "best" respocne in https://www.reddit.com/r/Python/comments/4s40ge/understanding_uwsgi_threads_processes_and_gil/ seam to gel with how i tought this would be working20:39
sean-k-mooneyif we are initalising the connection pool as a module vaiable we are sharing it between the threads and all the therads are being run sequtually under the gil20:40
sean-k-mooneyso with threads=8  you had 8 instance of the api all sharing the same connection pool20:40
sean-k-mooneythere is a follow futher down where there is a code wakthorugh which explaing the init flow20:43
*** mchlumsky has quit IRC20:47
artomefried, you added me to that review but I think you'll get through them before I get a chance to even start20:50
artomWe're all putting out CI fires with our upcoming release20:51
artom(Maybe I shouldn't say that in public?)20:51
efriedartom: no worries, I noticed you were on the one above it, which is the important one, but that one's getting a -1 from me in a few minutes here, so you got time :)20:51
artomhttps://photos.app.goo.gl/fRQoJpE5WnqYeP6E8 was my desk not long ago. Yes, I needed both those drinks to get through the bugs20:52
artom... and I need to go pick kids up20:53
artomo/20:53
sean-k-mooneydansmith: imacdonn so i think this is the issue https://github.com/openstack/nova/blob/760fc2de32057817bafc0a1512a735dab1c7c68a/nova/rpc.py#L6920:54
sean-k-mooneywe intial the transport as a module level vaiable so that is shared across all wsgi ap instace in the same process20:55
sean-k-mooneythat end up calling https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/transport.py#L19220:55
dansmithsean-k-mooney: well, that's the global thing that does the connectioning to rabbit, yeah20:55
sean-k-mooneywhich calls the diver manger to construt RabitmqDriver instance which has a connection pool internally20:56
dansmithyup20:56
sean-k-mooneyso the connection pool become global too20:56
dansmithright20:56
*** tbachman has joined #openstack-nova20:57
dansmithsean-k-mooney: this is only part of the equation by the way20:57
sean-k-mooneyand sicne we have muleple python threads and the gil limits there exectuion we could starge the heartbeat20:57
dansmithsean-k-mooney: because I think that TRANSPORT is what we use for initializing a connection which we would use if we were going to start a server and listen20:57
*** artom has quit IRC20:57
dansmithsean-k-mooney: but connections to the *cell* rpc brokers happens in nova.context, because we connect per cell once we read that out20:57
dansmithsean-k-mooney: it's effectively the same situaton though, I imagine they both suffer from the same behavior20:58
sean-k-mooneyriping out those gloabls would proably be a pain20:59
*** trident has quit IRC20:59
dansmithsean-k-mooney: https://github.com/openstack/nova/blob/760fc2de32057817bafc0a1512a735dab1c7c68a/nova/context.py#L37620:59
dansmithsean-k-mooney: having implemented multi-cell tests, I can assure you it is20:59
*** cmart has joined #openstack-nova21:00
*** trident has joined #openstack-nova21:00
dansmithbut also, it's the thing that avoids us having to build that stuff up a hundred times, even for a single request where we make a couple rpc calls21:00
sean-k-mooneyand yep we are storing the connection in a module level CELL_CACHE21:00
dansmithright21:00
sean-k-mooneyso yep ist the same issue21:00
melwittso if the heartbeat thread gets starved, then connections get closed down, and then if reconnect in another thread races with an API request coming in and the API request hits before the reconnect, it will get the MessagingTimeout? I'm looking through the oslo.messaging code and from what I can tell, the reconnect is done on-the-fly when a message is attempted to be sent, so I don't understand how a different thread is doing the21:01
melwitt reconnect21:01
*** ttsiouts has joined #openstack-nova21:01
cmartHello -- can someone help me understand the process by which ephemeral disks get a filesystem label? It's usually `ephemeral0` but I have an instance created today which has `ephemeral0_160_40`21:02
melwittbut if that were true that oslo.messaging automatically tries to reconnect, then the API request shouldn't fail, it would try to reconnect. so, nevermind, I don't know what I'm talking about21:02
sean-k-mooneymelwitt: well thats praobly the proble it like is not auto recoonectin on making a request21:02
sean-k-mooneywe poroubl have  preiodic task or something that is doing th reconnection21:03
sean-k-mooneythat is jsut a guess21:03
sean-k-mooneyi havent look at this that closely in the past21:03
melwittyeah. I just can't tell what triggers a reconnect. in https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py I see a bunch of stuff where "recoverable error" will try to reconnect21:03
sean-k-mooneyare we using the amqp driver or the rabbitmq one21:04
melwittsorry, wrong link https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py21:04
melwittthat other link was me trying to find where/how it raises MessagingTimeout21:04
*** tbachman has quit IRC21:05
sean-k-mooney hum i wonder if setting https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L145-L149 to 0 would help?21:05
dansmithsean-k-mooney: that was mentioned earlier21:06
sean-k-mooneyoh ok i tought you were asking if there was a way to disable it eariler21:06
melwittthat is, you think it might allow threads=8 config to reconnect?21:06
dansmithI said it would be good to test with it, *but* I think the downside of it is that the rabbit server side won't terminate the connection because of lack of heartbeats which means more likelihood of just having stale connections that have been forgotten by the conntracks in between21:06
sean-k-mooneyi think it might not close the connect21:06
sean-k-mooney*connection21:07
melwittoh ok21:07
sean-k-mooneye.g. it would only close if the tcp soecket closed21:07
melwittyeah, I see21:07
imacdonnpfft, it's not like rabbitmq-server is ever going to run out of connection...............21:07
dansmithproblem is, stateful firewalls in between client and server may do that without you noticing which will require some delay before you realize the socket is dead21:07
sean-k-mooneyright which is proably why the heartbeat is there in the first place21:08
melwittyeah21:08
imacdonnback in the icehouse days, I had lots of problems trying to do AMQP through a firewall, due to lack of heartbeat support21:08
dansmithright and why I think it's fine to let it be there, and if we wake up within a minute, we're good and if not, then we've got an immediate RST waiting for us so we know to reconnect21:08
openstackgerritMatt Riedemann proposed openstack/nova master: Restore connection_info after live migration rollback  https://review.opendev.org/55134921:09
sean-k-mooneymelwitt: this is where the reconnect happens i think21:12
sean-k-mooneyhttps://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L897-L92821:12
sean-k-mooneywe call  self.ensure_connection() if there was an exception wehn we calle self,_heartbeat_check()21:12
sean-k-mooneywhich is a thread we inital in the dirver wihc is a global so there is again only 1 per process21:13
sean-k-mooneyso with threads=8 there is a 1 in 9 chacne that it will get scheduled to run21:14
melwittso since it does the reconnecting, if it doesn't run first, other thread will hit a closed connection and get MessagingTimeout21:17
*** slaweq has quit IRC21:18
sean-k-mooneywell the message timeout comes form the fact that we did a request and then stop listening because the connection closed21:18
sean-k-mooneyand we did not reconnect before teh timeout expired21:18
sean-k-mooneyso the set succfully21:18
sean-k-mooneythen it closed and they never go tthe resopnece21:18
melwittok21:18
sean-k-mooneywhich likely means when we recoonet it will be sitting on the queue21:19
sean-k-mooneythings like the inststance list would likely be fine in that case21:21
sean-k-mooneysince the api is hitting all the cell dbs but not doing an rpc right21:21
sean-k-mooneybut things like get_vnc_console woudl fail21:21
efriedcmart: what does your [DEFAULT]virt_mkfs option look like?21:21
sean-k-mooneything like boot would likely be fine too since thats fire and forget21:23
imacdonnso now I'm wondering if the connection reestablishment is working "by accident" with threads=1, or if I can depend on it........21:23
cmartefried I haven't defined it (on Queens, Ubuntu 16.04) so probably the default21:23
sean-k-mooneywell with threads=1 thre is a 50% change the connetion tread will run21:24
sean-k-mooneysicne there is the main thread + the heatbeat thread21:24
melwitt:\21:26
sean-k-mooneyi did not say it woudl definetly work :P21:27
dansmithsean-k-mooney: there's only one thread,21:27
sean-k-mooneybut its better then 1 in 921:27
imacdonnso if I set threads=0, will it be 100% ? :)21:27
melwittlol21:27
sean-k-mooneydansmith: the rabbitmq direver spawn a thread to run the heatbeat check21:28
dansmithand I think that if we're monkeypatched, any io we do will always guarantee that we run the waiting greenthreads, so we'd consume the dead connection with the heartbeat right?21:28
sean-k-mooneyhttps://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L88421:28
dansmithsean-k-mooney:  green thread21:28
sean-k-mooneyah your right if its monkey patched then its a greenthread21:28
dansmithright,21:29
dansmithso we'll always run that stuff on first io21:29
sean-k-mooneyi guess it depends on how many io operation were pending or had yeilded21:29
dansmithI think if it was 50% we'd have heard about it21:29
dansmithhowso?21:29
efriedcmart: It looks like that name is being assigned in this neighborhood: https://opendev.org/openstack/nova/src/branch/stable/queens/nova/virt/libvirt/driver.py#L351021:30
sean-k-mooneywell in the scater gater case at least we are querying a bunch of gread threads to execute then waiting21:30
dansmiththe heartbeat gthread is always schedulable work on any io switch if we've not run it in t/2 or whatever it's period is21:30
dansmithsean-k-mooney: right but on the first io that we do, to even start that, we'll run all the greenthreads pending, which is at least the heartbeat gthread, before we do much of anything else21:31
efriedcmart: that line or L353321:31
sean-k-mooneyya we should21:32
*** hongbin has joined #openstack-nova21:32
cmartefried thank you for the orientation21:33
*** xek has joined #openstack-nova21:33
efriedcmart: Good luck :)21:33
*** ttsiouts has quit IRC21:33
*** xek__ has quit IRC21:35
openstackgerritMerged openstack/nova master: Remove fake_libvirt_utils from connection tests.  https://review.opendev.org/64255721:38
imacdonnI wonder if part of the issue is that the AMQP message is not being resent in the case where the connection reset is detected during publish, vs. during heartbeat21:39
melwittI wonder that too21:40
*** JamesBenson has quit IRC21:40
dansmithimacdonn: meaning if we hit it on publish, we reconnect, but didn't actually send, think we did, and wait for it until timeout?21:40
dansmithif so that would be a giant o.msg bug21:40
imacdonnyeah, smth like that21:41
melwittbut I think we know it won't reconnect on publish right21:41
melwittotherwise the 500 wouldn't happen21:41
dansmithno,21:41
dansmiththe 500 is from timeout right?21:41
dansmithwhich means it's waiting for a reply21:42
melwittyeah21:42
imacdonn504 from timeout, yeah21:42
imacdonnlook at http://paste.openstack.org/show/751486/ - the reset follows the publish failure, not a heartbeat failure21:42
dansmithwhich can happen if we effectively connect to rabbit, subscribe to a reply queue, but never sent the thing and thus time out waiting for a reply to nothing21:42
imacdonnthe reconnect, I mean21:42
sean-k-mooneythis is ment to handel the resent i think https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/impl_rabbit.py#L1088-L110121:43
*** JamesBenson has joined #openstack-nova21:43
sean-k-mooneyfor a direct send we do not pass retry21:43
imacdonnI don't see any "Failed to publish message to topic" logged21:44
melwittyeah that's what confuses me, if publish does a reconnect, why are we left hanging for a timeout? unless we never send a thing like you both mentioned21:44
sean-k-mooneyhttps://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/impl_rabbit.py#L1204-L121321:44
*** JamesBen_ has joined #openstack-nova21:44
dansmithmelwitt: it would be if the act of sending triggered the "huh, this is dead, lemme reconnect" but was not followed by a "okay, now that I have reconnected, let me send that thing I had"21:45
melwittaye21:45
dansmithbecause socket.write() is where you find out it's dead21:45
dansmithwhich is how heartbeat is finding out21:45
dansmithit gets scheduled, and does a "shit I'm late.. socket.write(heartbeat mofo!)"21:45
melwittright21:46
*** JamesBenson has quit IRC21:47
*** xek_ has joined #openstack-nova21:47
*** xek has quit IRC21:47
*** JamesBen_ has quit IRC21:49
melwittsean-k-mooney: yeah, I don't know the difference between direct_send and topic_send. though topic_send sounds more like what's probably being used21:51
melwitt(based on the fact that our message queues all have topics, I would think that means those are topic_send, but I don't actually know)21:52
*** mriedem has quit IRC21:53
sean-k-mooneyso we are invoking the get_vnc_console as a "call"21:53
sean-k-mooneyhttps://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/rpc/client.py#L157-L18221:54
sean-k-mooneywhich im still tracing to see if it sa topic send or a direct send21:54
sean-k-mooneythat is where we do the actul call in the nova side https://github.com/openstack/nova/blob/eae1f2257a9ee6e851182bf949568b1cfe2af763/nova/compute/rpcapi.py#L635-L64021:55
*** xek has joined #openstack-nova21:56
sean-k-mooneyim getting flashbacks to the openstack bootstrap podcast looking at this code21:56
*** xek_ has quit IRC21:56
*** xek has quit IRC21:58
*** xek has joined #openstack-nova21:59
sean-k-mooneyok instesting so22:01
sean-k-mooney_send_reply uses a direct send22:01
sean-k-mooneyand that is used for this heartbeat https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L17422:02
sean-k-mooneybut this is the send function that matters https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L58122:03
sean-k-mooneyok so ya its a topic_send https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L62822:07
sean-k-mooneywhich means the retry behavior is contoled by the context object22:08
*** mrch_ has joined #openstack-nova22:25
openstackgerritEric Fried proposed openstack/nova master: DNM: Revert "Use external placement in functional tests"  https://review.opendev.org/65968022:26
sean-k-mooneyok dumb question but where does the context objec we use to sent the rpc call come form22:29
sean-k-mooneycontext = req.environ['nova.context']22:31
sean-k-mooneyso what sets that...22:31
sean-k-mooneythis https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/api/auth.py#L95-L105 ?22:32
sean-k-mooneyno that has to be the wrong context that the keystone atuth context22:36
*** hongbin has quit IRC22:39
sean-k-mooneyok i give up i cant tell if we actully set retry to ture22:39
imacdonnI suppose I could add debug logging to find out22:40
sean-k-mooneywhat i can tell is we dont set it expreclity wehn we call prepare in which case retry is check in the context22:40
melwittI think the keystone middleware does that https://docs.openstack.org/keystonemiddleware/latest/middlewarearchitecture.html22:41
sean-k-mooneyi can see the keysone context being construted here22:42
sean-k-mooneyhttps://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/api/auth.py#L95-L10622:42
sean-k-mooneyand i think that is the context that is passed here as ctxt https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/compute/rpcapi.py#L593-L59822:42
openstackgerritEric Fried proposed openstack/nova stable/stein: DNM: Revert "Use external placement in functional tests"  https://review.opendev.org/65968222:42
melwittyeah, and it's the keystone middleware that sets the headers that are used to construct the context you show22:43
sean-k-mooneybut when we constuct the rpc client we dont set retry too true as far as i can tell22:44
*** mlavalle has quit IRC22:44
openstackgerritEric Fried proposed openstack/nova stable/stein: Skip _exclude_nested_providers() if not nested  https://review.opendev.org/65920622:45
*** UdayTKumar has quit IRC22:48
melwittsean-k-mooney: looks like it comes from here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/rpc/client.py#L342-L34322:49
melwitt(maybe, I didn't trace it)22:49
sean-k-mooneyok i found that like 30 mins ago but i mis interpreted the default of None as dont retry22:50
melwittso transport._send() will be done with forever retry22:50
sean-k-mooneyok so it shoudl retry for ever22:50
sean-k-mooneyya22:50
sean-k-mooneyi shoudl have read the commet22:51
sean-k-mooneyok so this https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L666 is where the retry is implemented22:53
*** tkajinam has joined #openstack-nova22:55
*** macza has quit IRC22:57
sean-k-mooneyok do i have traced that fully through nova into oslo.messaging and unless there is a bug in the auto retry suppupot in the kombu lib which i doubt the issue is not tha twe think we sent the message on publish23:00
imacdonnseems that that retries ensuring that a connection exists, but doesn't retry sending a message ?23:00
sean-k-mooneyit does it delegate it to kombu23:00
melwittyeah, I saw that too, just passes everything (including the message) to kombu's autoretry method23:01
sean-k-mooneyhttps://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L745-L75223:01
melwittso you would think after reconnecting, it would send the message23:01
melwittI guess I'm wondering now if what's in these callbacks are what controls how it behaves, whether it would send the message after reconnecting?23:03
*** xek has quit IRC23:03
sean-k-mooneyi assume that is what on_revive=on_reconnection is doing?23:04
imacdonnI'm missing something .. where is the message passed to the autoretry method ?23:04
*** luksky has quit IRC23:04
*** rcernin has joined #openstack-nova23:04
sean-k-mooneyexec methond is a functoll.partial thing23:05
*** _erlon_ has quit IRC23:05
*** slaweq has joined #openstack-nova23:05
melwittimacdonn: here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1098-L110123:05
sean-k-mooneyhttps://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L109823:05
sean-k-mooneyso the execute_method has the mesg and all the args curried in the partial callable23:06
*** ttsiouts has joined #openstack-nova23:06
melwittyeah, so the on_reconnection method just reconfigures some things23:07
sean-k-mooneyyes but that is called when the conenct is revived but the auto retry i think would retry?23:07
melwittunfortunately it looks like we're going to have to look at what autoretry source does23:07
openstackgerritMerged openstack/nova master: Test macvtap port with resource request  https://review.opendev.org/64130923:08
sean-k-mooneyunless you think we are ending up here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L760-L77123:08
melwittyou would think so but symptom suggests it's not23:08
sean-k-mooneywell we set the retry limit to float("inf")23:08
sean-k-mooneyso it could take a while to hit that :)23:09
melwittno, I don't think we're ending up there (we haven't seen that log)23:09
*** slaweq has quit IRC23:10
melwittwe see a reconnect message, and then we see a timeout fire some milliseconds later http://paste.openstack.org/show/751486/23:10
*** ttsiouts has quit IRC23:11
imacdonnthe timeout is one minute later23:11
*** slaweq has joined #openstack-nova23:11
melwittoh guh, sorry23:11
imacdonn;)23:11
melwittI see now23:11
*** cmart has quit IRC23:15
*** slaweq has quit IRC23:16
sean-k-mooneythats being raised form https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1125 form inside the partial functionin the auto retry thing23:17
sean-k-mooneyactully its from the publish line but im looking at master on github23:18
*** JamesBenson has joined #openstack-nova23:18
sean-k-mooneyso ya after the first trace which is the publish failing we see the reconnect23:19
sean-k-mooneyand then i think the timeout for sending that message has expired after we reconnect23:20
sean-k-mooneywe started to send it at 2019-05-16 18:24:09.08923:20
*** dklyle has joined #openstack-nova23:21
sean-k-mooneybut that doesnt make sense23:22
sean-k-mooneythe conenct is reestablished at 2019-05-16 18:24:10.10923:22
sean-k-mooneyand then there are no log messages untill the time out 1 minut later23:22
melwittright23:22
sean-k-mooney but the rpc timeout should be longer then 60 seconds right?23:23
melwittI think it probably defaults to 60s23:23
*** JamesBenson has quit IRC23:23
melwittI can't remember where it is, lemme see23:23
imacdonn2019-05-16 18:16:05.973 23144 DEBUG nova.api.openstack.wsgi_app [-] long_rpc_timeout               = 1800 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:257123:23
imacdonnwait that's not it23:24
melwittdefault to 60 https://github.com/openstack/oslo.messaging/blob/7102eddb1fd799d949ff5dca8aa2637794bb8a43/oslo_messaging/rpc/client.py#L4123:24
sean-k-mooneythe long_rpc_timeout is not used for every rpc23:24
melwittrpc_response_timeout is the one23:24
sean-k-mooneyso this is what kombu is doing https://github.com/celery/kombu/blob/master/kombu/connection.py#L500-L54723:26
openstackgerritMerged openstack/nova master: Fix some issues with the newton release notes  https://review.opendev.org/65962323:33
imacdonnattempting repro with kombu debug logging23:34
*** dklyle has quit IRC23:35
melwittgood call, because looking at the code it looks like it should be doing the right thing. no clue what's happening23:38
imacdonnweird .. not getting the connection resets now23:42
sean-k-mooneydoes it not take time for it to happen23:42
sean-k-mooneye.g a few minutes23:42
melwittdid you remember to set threads=8?23:42
imacdonnyeah, I waited 5 mins the first try, then a little over 3 mins second try23:43
imacdonnyes, threads=8, but I should see the connection reset either way23:43
melwitttrue23:43
sean-k-mooneyok its technically friday here so im going to have pizza and then sleep. if ye find anything let me know but for no i guess we shoudl update the bug and tell people to run with thread=123:45
sean-k-mooneyour maybe with debug loging...23:45
*** artom has joined #openstack-nova23:45
melwittpizza, sounds like a solid plan23:45
imacdonnThanks, sean-k-mooney - pizza/sleep well!23:46
sean-k-mooneyi actully had a c++ bug where debug builds did not have the bug because it was an unitialised variable and in debug mode the compiler was "kind" enouch to initalise it to 023:46
imacdonnahh, got the reset now ... paste in a few mins23:48
imacdonnhttp://paste.openstack.org/show/AZQET5IOgntK785RHtxh/23:51
melwittweird, it says basic_publish_confirm but then one minute later timeout happens23:57
melwittso confused.23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!