Wednesday, 2021-03-31

clarkbfungi: yup then we should check that the job on the second project sees commit two as master/main/foo and not commit 100:00
fungiit's probably late in my day to start putting that test together, but i can try tomorrow if nobody beats me to it00:01
clarkbcool I can help with that too and ya dinner should be arriving soon for me00:03
clarkbfungi: it just occured to me that there is an element of luck/randomness involved too as well since it depends on your job being scheduled to an executor that previously ran jobs for the fast forwardable commit before it merged00:03
clarkbif the executor hadn't seen that commit yet then the rev check would fail and the update would happen as expected00:03
fungiyeah, extra luck that lightning struck on the recheck too00:04
fungii'll recheck it again just to see00:04
clarkbfor example ze01 does not have 874d4776eee5ae7c8c15debbb9e943110be299dd and its master ref still points to 99136e1e164baa7b1d9dac4f64c5fb511b813c1900:04
clarkbif the job lands there I would expect it to update as expected00:05
clarkbthe other (third) possible state is master points at 874d4776eee5ae7c8c15debbb9e943110be299dd because a job triggered an update there since 874d4776eee5ae7c8c15debbb9e943110be299dd was not already present00:05
clarkb02 is like 01 and 03 is like 0800:06
clarkb04 is like 08, 05 is like 01, 06 is like 08, 07 is like 08, 09 is like 08, 10 is like 01, 11 is like 01, 12 is like 0800:10
clarkbnone of them are in that third possible state (not sure if that is useful info but there it is)00:10
clarkbfungi: I think you have a 5/12 shot at success given ^00:13
fungiit got scheduled to ze0100:13
fungibuild is 1088ace4500b4e808b89b079061e8a5a00:13
clarkbmaster points at your commit there now so it is now in the third state00:14
clarkb(good to confirm that)00:14
fungi1088ace4500b4e808b89b079061e8a5a00:14
fungier, wrong buffer00:14
fungiSuccessfully installed gerritlib-0.10.1.dev100:14
fungiso that seems to jive with your theory00:14
clarkbya, really makes me wonder if we've just missed this issue for a long time or if something I don't understand caused the behavior to change00:16
clarkbbecause everything else is lining up00:17
openstackgerritKenny Ho proposed zuul/zuul master: Allow post playbook to run when run playbook has unreachable result  https://review.opendev.org/c/zuul/zuul/+/78398600:23
*** y2kenny has quit IRC00:31
*** hamalq_ has quit IRC00:49
openstackgerritKenny Ho proposed zuul/zuul master: Allow post playbook to run when run playbook has unreachable result  https://review.opendev.org/c/zuul/zuul/+/78398600:50
openstackgerritJeremy Stanley proposed zuul/zuul-jobs master: Document algorithm var for remove-build-sshkey  https://review.opendev.org/c/zuul/zuul-jobs/+/78398800:56
*** iurygregory has quit IRC01:16
*** iurygregory has joined #zuul01:17
*** iurygregory has quit IRC01:18
*** iurygregory has joined #zuul01:18
*** iurygregory has quit IRC02:08
*** iurygregory has joined #zuul02:09
*** jangutter has joined #zuul02:13
*** jangutter_ has quit IRC02:15
*** evrardjp has quit IRC02:33
*** evrardjp has joined #zuul02:33
*** nhicher has quit IRC03:09
*** fbo has quit IRC03:09
*** nhicher has joined #zuul03:13
*** saneax has joined #zuul03:34
openstackgerritKenny Ho proposed zuul/zuul master: Allow post playbook to run when run playbook has unreachable result  https://review.opendev.org/c/zuul/zuul/+/78398603:48
*** ykarel|away has joined #zuul04:20
*** jfoufas1 has joined #zuul04:28
*** vishalmanchanda has joined #zuul04:30
*** ykarel|away is now known as ykarel04:39
*** zbr|rover4 has joined #zuul05:04
*** zbr|rover has quit IRC05:06
*** zbr|rover4 is now known as zbr|rover05:06
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Add upload-logs-azure role  https://review.opendev.org/c/zuul/zuul-jobs/+/78200405:27
*** jangutter has quit IRC06:04
*** jangutter has joined #zuul06:05
*** ajitha has joined #zuul06:11
*** jcapitao has joined #zuul06:27
*** hashar has joined #zuul06:45
*** harrymichal has joined #zuul07:14
*** harrymichal has quit IRC07:17
openstackgerritTobias Henkel proposed zuul/zuul master: Add spec for enhanced regional executor distribution  https://review.opendev.org/c/zuul/zuul/+/66341307:18
*** tosky has joined #zuul07:33
openstackgerritMerged zuul/zuul master: Align enqueue/dequeue events with trigger events  https://review.opendev.org/c/zuul/zuul/+/78355608:00
*** ykarel has quit IRC08:01
*** dpawlik0 is now known as dpawlik08:08
openstackgerritSorin Sbârnea proposed zuul/zuul master: WIP: Document tox environments  https://review.opendev.org/c/zuul/zuul/+/76646008:17
openstackgerritIan Wienand proposed zuul/nodepool master: Bump dib requirement to 3.8.0  https://review.opendev.org/c/zuul/nodepool/+/78402608:29
*** nils has joined #zuul08:32
*** vishalmanchanda has quit IRC08:39
*** ykarel has joined #zuul08:43
*** ykarel is now known as ykarel|lunch08:43
*** jcapitao has quit IRC08:51
*** vishalmanchanda has joined #zuul09:11
*** jcapitao has joined #zuul09:24
*** harrymichal has joined #zuul09:31
openstackgerritSimon Westphahl proposed zuul/zuul master: Dispatch Gerrit events via Zookeeper  https://review.opendev.org/c/zuul/zuul/+/78381609:34
openstackgerritSimon Westphahl proposed zuul/zuul master: Ensure single instance for active event gathering  https://review.opendev.org/c/zuul/zuul/+/78381709:34
*** hashar has quit IRC09:52
*** jcapitao has quit IRC09:56
*** jcapitao has joined #zuul09:56
*** harrymichal has quit IRC09:57
*** harrymichal has joined #zuul09:57
openstackgerritSorin Sbârnea proposed zuul/zuul master: WIP: Document tox environments  https://review.opendev.org/c/zuul/zuul/+/76646009:59
*** harrymichal has quit IRC09:59
*** harrymichal has joined #zuul10:00
*** ykarel|lunch is now known as ykarel10:04
*** jcapitao has quit IRC10:36
*** jcapitao has joined #zuul10:36
*** jcapitao is now known as jcapitao_lunch10:39
*** jcapitao_lunch has quit IRC10:46
*** jcapitao_lunch has joined #zuul10:47
*** jcapitao_lunch has quit IRC10:51
*** jcapitao_lunch has joined #zuul10:51
*** mugsie__ is now known as mugsie11:01
*** jcapitao_lunch has quit IRC11:04
*** jangutter_ has joined #zuul11:36
*** jangutter has quit IRC11:39
*** jcapitao_lunch has joined #zuul11:46
*** sshnaidm|off is now known as sshnaidm11:57
iceyhey, is it possible to limit concurrency both globally (via semaphore) as well as per-change? I have a semaphore that can be copmletely consumed by a single ref and would like to spread the test reqources out a bit more to allow all jobs to continue slowly, rather than all blocking waiting on the first12:02
*** jcapitao_lunch is now known as jcapitao12:02
*** wuchunyang has joined #zuul12:06
openstackgerritGuillaume Chauvel proposed zuul/zuul master: [DNM] run TestSchedulerSSL test with pending gear change  https://review.opendev.org/c/zuul/zuul/+/78026112:06
iceymaybe something like supporting multiple semaphores on a job?12:07
*** wuchunyang has quit IRC12:10
openstackgerritSimon Westphahl proposed zuul/zuul master: Make reporting asynchronous  https://review.opendev.org/c/zuul/zuul/+/69125312:22
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066212:42
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066212:45
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066213:04
*** harrymichal has quit IRC13:10
*** harrymichal has joined #zuul13:10
openstackgerritGuillaume Chauvel proposed zuul/zuul master: [DNM] run TestSchedulerSSL test with pending gear change  https://review.opendev.org/c/zuul/zuul/+/78026113:17
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066213:30
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066213:34
avassicey: I can't come up with a good way to do that right now14:15
iceyavass: thanks, I've got some gross ideas but nothing that sounds sane14:18
avassicey: I suppose you could limit the number of parallel jobs for a specific label and put different jobs on separate labels but that seems like an awkward way to do it14:19
avassthe label would work like a semaphore in that case14:20
openstackgerritMerged zuul/nodepool master: Bump dib requirement to 3.8.0  https://review.opendev.org/c/zuul/nodepool/+/78402614:24
*** jangutter has joined #zuul14:40
*** ykarel is now known as ykarel|away14:40
*** jangutter_ has quit IRC14:43
*** saneax has quit IRC14:51
clarkbicey: avass a hacky method would be to sequence the jobs per change requiring each to finish in succession with job deps14:51
clarkbthat would lower throughput if you are otherwise idle though14:52
*** jfoufas1 has quit IRC14:53
fungithere's also the serial pipeline manager, not sure if that helps or makes the stated problem worse14:54
fungizuul-maint: i know we don't usually end up doing anything at the open infrastructure project teams gathering event, but just in case we want to this time for some reason, we can probably still book some slots even though it's technically past the deadline for communities to sign up14:56
fungilet me know if there's interest and i can help coordinate. we can of course also do something similar in an unofficial capacity if desired14:57
iceyclarkb: I think that's going to be the direction I look in :-/14:58
*** harrymichal has quit IRC15:00
*** harrymichal has joined #zuul15:00
*** ajitha has quit IRC15:03
openstackgerritGuillaume Chauvel proposed zuul/zuul master: [DNM] run TestSchedulerSSL test with pending gear change  https://review.opendev.org/c/zuul/zuul/+/78026115:13
avassclarkb: oh like using job.dependencies without there actually being a dependency?15:23
*** johanssone has quit IRC15:26
clarkbavass: ya15:28
clarkbjust artificially sequence them15:28
*** johanssone has joined #zuul15:29
*** ykarel|away has quit IRC15:37
*** hashar has joined #zuul15:41
clarkbfungi: I've started looking at a test for the update needed behavior15:43
clarkbI'm running tox now for the first time against a sort of base layer setup on the test. Once that is good it shouldn't be too difficult to update the rest of the test to confirm failure in the scenario we've got15:43
openstackgerritFelix Edel proposed zuul/zuul master: Make buildset mandatory on build  https://review.opendev.org/c/zuul/zuul/+/77090015:45
openstackgerritFelix Edel proposed zuul/zuul master: Switch to ZooKeeper backed build result events  https://review.opendev.org/c/zuul/zuul/+/78293915:45
openstackgerritFelix Edel proposed zuul/zuul master: DNM Deactivate statsd collection for online executors and running builds  https://review.opendev.org/c/zuul/zuul/+/78294015:45
openstackgerritFelix Edel proposed zuul/zuul master: Execute builds via ZooKeeper  https://review.opendev.org/c/zuul/zuul/+/77090215:45
fungiclarkb: oh, thanks, my morning got hijacked by a variety of other problems15:45
felixedelcorvus: Here is the latest version of the "ZooKeeper builds API" stack (https://review.opendev.org/c/zuul/zuul/+/770902/18) which also includes the "build result events via ZooKeeper" change (https://review.opendev.org/c/zuul/zuul/+/782939/3)15:49
*** spotz has joined #zuul15:50
*** jangutter_ has joined #zuul15:58
clarkbcorvus: to run zk locally with the ca I first run zuul/tools/zk-ca.sh $PATH zuul-test-zookeeper then start the zuul-test-zookeeper container out of tools/docker-compose.yaml ?16:01
*** jangutter has quit IRC16:01
guillaumecclarkb, ./tools/test-setup-docker.sh16:02
clarkbguillaumec: cool thanks that helps confirm what I'm doing (I don't need a sql server for the usbset of tests I'm running os just getting the zk up is good for me and I basically did what that does)16:05
guillaumecclarkb, i can't count the number of time I forgot to start zk/mysql for tests so I made a simple script on top of that to run tox :)  http://paste.openstack.org/show/804082/16:08
*** hamalq has joined #zuul16:18
*** hamalq_ has joined #zuul16:19
*** hamalq has quit IRC16:22
openstackgerritSorin Sbârnea proposed zuul/zuul master: WIP: Document tox environments  https://review.opendev.org/c/zuul/zuul/+/76646016:26
*** jcapitao has quit IRC16:40
*** hamalq_ has quit IRC16:41
*** hamalq has joined #zuul16:41
clarkbfungi: now I am confused beacuse my test succeeds rather than fails17:03
fungimaybe there's an extra ingredient we're missing17:06
clarkbya I am beginning to suspect the fake repo state I am feeding in doesn't quite match production17:08
clarkbI *think* that I shouldn't send the refs/changes/01/1/1 on the final event because in taht case we're simulating just running off of master17:09
*** jangutter has joined #zuul17:09
clarkbyup now I'm reproducing17:10
clarkbseparately I ran into the re2 wheel is stale problem again, do you knwo if we can specify a dep specific rule to not use a cached wheel?17:12
clarkbif so that would be nice to add to zuul's dep for re217:12
*** jangutter_ has quit IRC17:13
*** harrymichal has quit IRC17:16
corvuso/ i'm getting started on looking into the memory leak now17:19
fungiclarkb: as in pip doesn't rebuild the wheel but your underlying libs have changed?17:22
fungiyou can pass options to pip to not use its cache, but i'm not aware of any ability to do that on a per-package case unless pip is invoked separately for each package17:22
*** harrymichal has joined #zuul17:23
openstackgerritClark Boylan proposed zuul/zuul master: Update branches after fast forward merges  https://review.opendev.org/c/zuul/zuul/+/78414217:26
clarkbfungi: yes exactly, the so is different and the wheel that is cached no longer works17:27
clarkbfungi: ^ that tests the issue and fixes it. Running that test without the fix fails and passes with it17:27
fungiclarkb: i don't think the pipists have worked out a solution to that, but might be worth asking on the distutils-sig ml17:27
clarkbfungi: since pip caches with hashed dir paths I just rm the entire wheel cache17:28
clarkbwhich is a bit heavy weight but maybe decent hygiene17:28
fungiclarkb: i upgrade my primary python build fairly often, so have scripted a wipe of the wheel cache any time i do that17:28
mordredone would imagine just adding an soname to the wheel metadata would be helpful - but I'm sure there's a good reason that's not practical17:28
avasscorvus: there's a memory leak?17:45
corvusavass: yep, current master; 4.1.0 should be fine.  http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all17:45
fungilooked like once we ran out of system memory for caches, and started to push into swap even a little, zk disconnects cropped up17:47
*** nils has quit IRC17:47
fungiwhich was the initial symptom we observed17:47
fungia quick scheduler restart to free up memory, and zk connection was back to steady for us again17:48
avassgood thing I've been waiting to upgrade volvos zuul just in case of issues with >4.117:48
corvusyeah, if you want to help find issues, run master; if you want to be a little more conservative, run a point release since we usually only make those when we've seen production use on opendev just in case there are issues like this17:49
*** vishalmanchanda has quit IRC17:51
fungiit's a relatively slow leak in our deployment, takes at least 24 hours under fairly heavy load to consume the 30gb of ram we've allocated17:51
corvusyeah, there's a good chance most production deployments wouldn't notice it17:52
fungiwhich ends up being ~20gb ram and ~8gb cache and ~2gb buffers and then as we get a few 100mb into swap things begin to go bad17:53
avassI usually make sure we follow master during more stable periods but hold off on upgrading when there are bigger changes like this17:55
corvusgood call17:55
*** avass has quit IRC17:56
*** yourname has joined #zuul17:57
*** yourname is now known as avass17:58
avassanyone knows if it's possible to gracefully shutdown irssi and save the scrollback somehow? :)18:00
*** avass has quit IRC18:00
*** yourname has joined #zuul18:01
*** yourname is now known as avass18:02
clarkbavass: I think if you set up logging it can reload it?18:02
avassclarkb: oh autolog? I'll look at that18:04
clarkbI've been on weechat for a long time and don't remember how to do irssi18:06
*** avass has quit IRC18:07
*** yourname has joined #zuul18:10
*** yourname is now known as avass18:10
*** avass has quit IRC18:17
*** yourname has joined #zuul18:18
*** yourname is now known as avass18:18
*** avass has quit IRC18:19
*** yourname has joined #zuul18:19
*** yourname has quit IRC18:21
*** avass has joined #zuul18:21
corvus1 hour into most common object count with no results yet18:27
clarkbI think it may have paused things too?18:28
clarkbat least my change above hasn't started jobs last I checked18:28
corvusit's definitely slowed, but i don't think completely stopped18:29
corvuscheck has items enqueued 17m ago18:30
corvus(but obv that's a 17m mini-pause)18:30
corvusi've hashtagged all of the changes between the last 2 restarts with "leak"; those are our suspects18:31
corvusit returned18:32
corvus2021-03-31 18:30:16,464 DEBUG zuul.stack_dump: mappingproxy                16835504 +1683550418:32
corvusthat's it18:32
clarkbhttps://review.opendev.org/q/hashtag:%22leak%22+(status:open%20OR%20status:merged) if anyone wants a link to those18:32
corvusthat is not what i expected18:33
corvusi would typically read that as objgraph telling us that since startup, the only thing that the scheduler has done is create 16,835,504 mappingproxy objects, and no objects of any other type.18:33
corvusnormally, i would expect to run that again and see the delta, but i'm not sure i'd expect any useful info; what do you think clarkb?18:35
clarkbya I think the only reason to run it again would be to turn off yappi?18:36
corvusgood point; i'll go ahead and do that18:36
corvusi think we may also be in the middle of a reconfiguration18:38
corvuswhich normally pauses queues; with that process being slow due to objgraph, effect is exacerbated18:38
corvusreturned faster this time:18:42
corvus2021-03-31 18:38:35,321 DEBUG zuul.stack_dump: dict                      8690210    +1291018:42
corvusstill only one type18:42
clarkbcorvus: is it interleaved or otherwise hidden in the logs for some reason?18:42
clarkbthat is really odd though18:43
corvuslet me check i'm not messing up the grep18:43
corvusoh yep that's it; it's a multiline output with only one prefixed line18:43
corvusi wonder why the multiline logging fix didn't handle that18:44
corvusinitial: http://paste.openstack.org/show/804091/18:46
corvussecond: http://paste.openstack.org/show/804092/18:46
corvus208 layouts currently18:46
corvusi need to run; biab18:47
clarkbthat seems like a very high nodeset and job count18:50
clarkbthe executor refactor does involve jobs and nodesets18:58
*** y2kenny has joined #zuul19:00
*** hashar has quit IRC19:08
clarkbI've looked over that one a bit and don't think it substantially changed anything frombefore. It was largely a refactor19:34
clarkbthe semaphores in zk chnage does stand out though since it updates how configs/layouts/semaphores are managed19:34
clarkbin particular every layout seems to get a new semaphore object which hangs on to the layout? could it be we've created a loop there peventing things from cleaning up?19:35
clarkbin the past we carried the existing semaphore object forward and only ever had the one per config level semaphore19:35
fungiso any layout with a semaphore is effectively still referenced and the gc can't clean it up?19:35
clarkbmaybe. I'm still tryin to understand the change better, and my desktop had a crash during updates :(19:36
clarkbI'm going to complete updates and reboot and get to a happy spot then pull that change back up again19:36
fungiproof that updating software is always a bad idea19:36
clarkbthe thing that stood out to me initially in that is we pass tenant.layout to the semaphore object and the old code that passes semaphores forward across layouts was removed (its a big old red blob in gerrit)19:38
clarkbhttps://review.opendev.org/c/zuul/zuul/+/772513/9/zuul/scheduler.py is where we would carry things forward before. https://review.opendev.org/c/zuul/zuul/+/772513/9/zuul/configloader.py is where we maybe put things into a loop19:49
clarkbthough I'm noticing this is at a tenant level so we should only have one per tenant?19:50
fungione what?19:50
fungilayout?19:50
clarkbone semaphore handler per tenant19:51
clarkbbut I don't know how frequently zuul recreates tenant objects19:51
clarkbtenant refers to layout and semaphore hander, semaphore hander refers to tenant.layout19:51
clarkbis that sufficient to confuse python gc?19:52
fungidoes that include speculative layouts?19:53
clarkbI think so? the pastes above show ~208 total layouts19:55
clarkbbut maybe not since it see we go Config -> Tenants -> TenantProjectConfigs19:56
clarkband the TenantProjectConfigs are the speculative bits I think19:57
corvusclarkb: yeah, i had observed the same thing and it was at the top of my list of suspects.19:57
corvustheoretically, the gc should be able to clear that sort of thing up... but...19:58
corvusi'm back from lunch and will start trying to see if those layouts are orphaned or really in use19:58
corvuszuul's load has leveled off (we're now running at < capacity) and the memory use is not growing20:00
corvusso we may have some headroom to poke at this before running out of ram; otoh, if we have to restart, we may not see significant leaking again until traffic picks up (could be as late as tuesday?)20:00
corvusi'm going to try really hard not to break it :)20:01
clarkbk20:01
avassupdating my config project bumped my scheduler memory usage form 109mb to 126mb (four commits changing job config), another change bumped it to 128mb, and a third bumped it to 130mb. that 109mb had been stable since I started logging at 19:0020:02
avasswithout any traffic to that zuul instance before I updated the config project that is :)20:04
fungiyeah, i doubt it's leaking at rest, almost certainly related to activity20:04
fungiso i wouldn't be surprised if there was no increase in memory while the scheduler was idle20:05
avassoh and there were no jobs running for that, it was just pushed directly to master without any project config at all20:08
corvusi've collected the known layouts reachable from the abide; there are 26.  objgraph says there are now 262 in memory.20:09
corvusstarting to look like a layout leak20:10
fungiso 10x more than there should be20:10
corvusthe other ones clarkb called out: nodesets and jobs, right?  those are likely hanging off of layouts20:11
corvusso i think we continue poking at layouts20:11
fungithe graph does look reminiscent of the last layout leak we had too, where the growth rate isn't steady but has some occasional stiff slopes20:11
fungiif you get a stack of changes to job configuration then you can eat a bunch of memory fast, but other times it's just one here or there20:13
clarkbcorvus: ya nodestes and jobs seemed high and hanging off of layouts makes sense20:13
corvusi'm starting to do some more intensive objgraph queries; this could end up causing zk disconnects20:17
corvusi'm doing a backref search on one of the leaked layouts with a depth of 10; i seem to remember last time that was about enough to get the info we needed without running for hours20:23
corvus(objgraph default is 20)20:23
clarkbif it is the loop I wrote down then ~2 should be enough to see it?20:24
clarkbmaybe 3?20:25
corvusyeah, was thinking maybe 5 or so might be a better value;  sort of a tradeoff with not wanting to run it twice.20:26
corvusclarkb: if we're not looking at a GC bug, then it's not actually the loop that's the problem, we'd actually expect the whole loop to be hanging off of something else unexpected, so that could end up needing a couple more steps to identify the cause20:27
corvusand apparently i just said we expect the unexpected20:27
clarkbsoftware is all about resetting expectations20:28
corvuss/expectations//20:28
corvusit's done and there are a lot of zk errors, so i think we just lost the zk connection and will be restarting all the jobs20:30
corvushopefully got good data though20:30
fungii have great expectations20:31
clarkbit should recover from that on its own as long as memory didn't spike too high and stay there20:31
avassI think disconnecting from zk could lead to memory leaks as well, at least that's what happened to us when we needed to increase our zk timeout20:32
clarkbavass: in this case based on logs yesterday I'm fairly sure that wasn't the cause but a side effect, though it may have caused things to get worse as it happened more20:33
clarkbthe memory use started climbing well before the disconnects20:33
fungii could see that creatinga feedback loop, yeah20:34
avassclarkb: yeah I'm just saying that the memory usage could increase from dropping connection to zk now20:34
fungibasically you leak something based on what builds are running, zk disconnects cause way more builds to end up running total, leaking more and faster20:34
fungichain reaction20:34
corvusthe backref chain was not helpful; it had no other objects.20:35
tobiashavass: the memleak on zk session loss should have been been fixed meanwhile: https://review.opendev.org/c/zuul/zuul/+/75117020:35
corvusso either i did the query wrong, or the leaked layout has no references and we're looking at a gc issue20:35
tobiashcorvus: do the new changes serialize exception tracebacks?20:36
corvustobiash: yes, they should be strings20:36
corvus(i remember thinking about that in review :)20:37
tobiashour last leak was caused by the traceback creation which also was very hard to see in the object graph20:37
corvuspossible we missed a case20:37
tobiashfix was https://review.opendev.org/c/zuul/zuul/+/751171/1 back then20:37
avasstobiash: oh yeah I think we might have been on an older version back then20:38
corvusrunning it again without the extra_ignore argument this time, in case that messed it up20:40
corvusi wonder if we could patch objgraph and give it a "nice" argument20:41
tobiashcorvus: I wonder if that leaks: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L129220:45
tobiashthe traceback fix back then needed to explicitly clear the frames on the traceback: https://review.opendev.org/c/zuul/zuul/+/751171/1/zuul/scheduler.py20:45
tobiashI think we should try the same there20:45
tobiashand there: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L132920:45
clarkbfungi: unrelated to memory issues https://review.opendev.org/c/zuul/zuul/+/784142 passed testing20:46
corvustobiash: but the traceback obj should be deleted in this case20:47
corvusit wasn't then -- we kept tb around20:47
tobiashwe've read that there are some problems around traceback that produce unintended memleaks20:47
corvus(though even then, tb should have eventually been deleted, so maybe whatever that problem was; it's still relevent)20:47
corvushrm20:47
corvustobiash: were you able to see this with objgraph?20:48
tobiashthe problem is that the traceback references the frames which are not necessary for formatting and those frame objects seem to be some kind of special20:48
corvusmaybe if i run find_backref_chain without the is_proper_module check?20:49
tobiashcorvus: not really, or at least it was very hard20:49
tobiashif the objgraph is deep enough you might see 'frame' objects20:49
tobiashbut we've seen them only within a graph that was too large to be rendered in a useful way20:50
corvusright now, i have zero references to the leaked layout20:50
corvusi re-ran without the extra_ignore, and got a single reference from my local variables20:51
corvusso that's confirming that there are no refs to this layout except from things that objgraph thinks are GC roots20:51
clarkbthe most recent python3.8 release was February 19, I suspect we've been running on that for a while now which makes me think GC bug is less likely20:52
clarkbiirc we updated the base images to pick that up due to the c float thing20:53
openstackgerritTobias Henkel proposed zuul/zuul master: Fix possible memory leak when dealing with exceptions  https://review.opendev.org/c/zuul/zuul/+/78416520:53
tobiashcorvus: we could just try something like this and see if that fixes the leak ^20:54
corvustobiash: i'm not sure that we're going to see a noticable leak again until tuesday if we restart; i really want to try to understand the leak based on what's happening now while we have the chance20:55
corvusif we just guess and we're wrong, we'll spend a lot more time on it.  i agree that's something to try if we can't come up with anything else.20:55
corvusbut when we get down to throwing stuff at the wall, i'd also probably remove the semaphore layout loop too :)20:56
tobiashcan you see if objdump finds objects of type 'frame'?20:56
tobiashoh, I guess I missed the semaphore layout loop in the scrollback20:57
clarkbtobiash: its not a proper loop and should be able to be cleaned20:58
corvusit's just a loop -- it shouldn't be a problem.  so like this, it's something to try when we've excluded everything that should be a problem :)  (the sherlock holmes theory)20:58
corvuseven proper loops should be able to be cleaned20:58
clarkbbut it stood out among the change list of potential problems20:58
clarkbcorvus: it tackles them in multiple passes I think? so old tenant would be cleaned up but then tenant.layout would hang around until semaphore handler was also removed20:59
corvustobiash: re exceptions -- we've seen the layout leak grow over the past few hours, but i don't think we've been submitting many management events during that time, so i don't think there would be many chances for that exception handler to run?21:00
corvustobiash: objgraph found 1948 frames21:00
corvusthey may be paramiko frames21:00
corvusoh, no, these are the right kind of frames21:02
corvusthat's a frame from a paramiko stacktrace21:02
corvusgimme a minute; we don't have less installed on our images :/21:02
clarkbI think more may be on them21:03
clarkb(not as useful but sometimes good enough)21:03
corvusyes, i keep scrolling past line 62321:03
clarkbfungi can probably tell me why less is more useful than more21:03
corvuscause you can scroll up?21:04
fungithat21:04
mordredyup21:04
corvusokay, so the first frame object i found is in an exception handler in _addressfamily_host_lookup in '/usr/local/lib/python3.8/site-packages/paramiko/config.py'21:04
corvusso something is keeping exception frames around21:05
fungii usually prefer view to less, but it can eat a bunch of memory for very large files21:05
avassclarkb: because less is more ;)21:05
fungiavass even knows the tagline21:05
clarkbya I use a lot of view too21:06
corvuswe also don't have view21:06
corvusobjgraph does not think that a frame is a GC root, so it should show up in the backref search21:06
clarkbcorvus: can you backref against the frame and see what refers to it?21:07
corvusyeah, will do in a min;21:07
corvusa lot of frames are kazoo.recipe.watchers; we may want to look at that21:08
corvusi'm poking at frames; i think we can reconstruct a backtrace from a frame by looking at the f_back attr, which is another frame.  so maybe every frame with a null f_back is the top of a stacktrace, then work down from there.21:11
corvusi think if we did that, we'd have a full stacktrace for every frame currently; may be able to identify suspicious code locations21:12
corvuslike if one of those frames had tobiash's fixed lines in it, we might know his change would be effective21:13
clarkbneat21:13
mordred++21:14
corvusi verified that objgraph doesn't think frame.f_locals is a GC root, so if a frame links to a layout via locals, it should still show up in objgraphs backrefs -- assuming that objgraph found the frame or the locals object.  maybe one or the other of those are special and don't show up in the backref gather phase21:15
corvusthat comes from gc.get_referrers; i wonder if frames or frame.f_locals are somehow invisible to that21:19
corvushrm, a quick test shows gc.get_referrers is capable of returning frame objects21:22
corvusi'm going to try to get those stacktraces now21:23
corvusthat half-worked.  i got some stacktraces out of it, but a bunch of frames which didn't have a top.21:40
corvusi wonder what the remainder look like; i'll keep poking21:40
corvustobiash: one of my reassembled stack traces is a single stacktrace with a single frame: <frame at 0x7fdbd4039230, file '/usr/local/lib/python3.8/site-packages/zuul/scheduler.py', line 1234, code process_global_management_queue>21:55
corvusthat line number is wrong, but i'm pretty sure it's pointing at one of your fixes21:56
clarkbprocess_global_management_queue is the wrong functions too (though if it uses the line numbers to help determine that could be wrong too?)21:57
corvusclarkb: it calls _forward_management_event which is one of tobiash's fixes21:57
clarkboh though that function calls ya that21:57
corvusi'm unsure of the significance of this -- it may suggest the fix is necessary, but i'm having trouble convincing myself that it's responsible for all the leaks since it's only a single frame21:57
corvusmeanwhile, there are hundreds of stacktraces with kazoo tracebacks.  i don't know why we have frames for them since they don't seem to be in excepton handlers that call traceback.* functions21:58
corvusmaybe those are just bad luck?  maybe i called objgraph.by_type soon enough after the zk disconnect that the gc just hadn't gc'd all the execptions that caused?22:01
corvusmaybe if i cleared things out and re-did it, there wouldn't be so many?22:01
corvus(but if that's the case, then that would also explain why the global management frame was in there -- it would be innocent after all)22:02
clarkbthat is an interesting idea22:02
*** fsvsbs has joined #zuul22:34
fsvsbshttps://www.irccloud.com/pastebin/79aPsq3322:36
clarkbtristanC: uses the operator I think22:37
clarkband may be able to help22:37
tristanCfsvsbs: what is the actual exception?22:39
fsvsbsHi tristanC have you come across failed to create /var/run/zuul/[container].Pid22:40
fsvsbsLockfile failed22:40
openstackgerritTristan Cacqueray proposed zuul/zuul-operator master: DNM: Testing zuul-operator  https://review.opendev.org/c/zuul/zuul-operator/+/78418022:40
fsvsbshttps://usercontent.irccloud-cdn.com/file/4Yx6rci6/irccloudcapture3460549349975442300.jpg22:41
corvustristanC, fsvsbs: i think the operator overrides the container command, but we should no longer do that.22:42
corvusthe container command is now "zuul-* -f" which is correct for a k8s, but the operator runs "zuul-* -d"22:43
*** harrymichal has quit IRC22:47
openstackgerritTristan Cacqueray proposed zuul/zuul-operator master: Remove command args override  https://review.opendev.org/c/zuul/zuul-operator/+/78418122:49
corvusgc.get_referrers() on one of the leaked layouts returns a dictionary that looks a lot like a __dict__ of a zuul model object; i'm trying to suss out which23:01
corvus(that's the only one of the 4 referrers i can't identify)23:02
corvusit's a QueueItem23:03
corvusso we have the gc module telling us that a QueueItem's __dict__ holds a reference to our leaked layout (but for some reason, this got filtered out of objgraph's chain)23:04
corvusgc has given me the next link -- it is in fact a queueitem23:07
corvusi'm basically doing the objgraph algorithm by hand; i'm not sure why it failed23:07
*** tosky has quit IRC23:30
corvusoh... it's show_backrefs i should be looking at, not find_backref_chain23:32
corvusor i do need to call find_backref_chain but with a higher max_depth until it returns something > 123:42
corvusanyway, making progress now; just slowly23:42

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!