Wednesday, 2021-06-30

*** ykarel\|away is now known as ykarel		04:24
*** sshnaidm is now known as sshnaidm\|afk		04:39
*** ysandeep\|away is now known as ysandeep		05:10
*** jpena\|off is now known as jpena		07:39
*** ykarel is now known as ykarel\|lunch		09:24
*** ykarel\|lunch is now known as ykarel		10:33
*** sshnaidm\|afk is now known as sshnaidm		10:53
*** jpena is now known as jpena\|lunch		11:27
zbr	did we had any recent changes made to logstash instance? it stopped working a couple of days ago.	11:44
zbr	http://status.openstack.org/elastic-recheck/ -- apparently ALL_FAILS_QUERY may become broken?	11:45
zbr	interesting bit is that I see "There were no results because no indices were found that match your selected time span" for any timespan shorter than 7 days, like 2 days.	11:49
*** jpena\|lunch is now known as jpena		12:27
*** ministry is now known as __ministry		13:10
*** ysandeep is now known as ysandeep\|afk		13:52
pmatulis	i got a merge for a doc project but the guide encountered a build problem. anyone here have visibility to that?	14:10
pmatulis	https://review.opendev.org/c/openstack/charm-deployment-guide/+/798273	14:15
clarkb	zbr: we have made zero changes as far as I know the whole thing is in deep hibernation because it needs major updates	14:40
clarkb	zbr: if there are no results for data less than 7 days then the processing pipeline has probably crashed	14:42
*** ysandeep\|afk is now known as ysandeep		14:43
zbr	prob crashed somewhere between 2-7 days based on what I see.	14:49
pmatulis	can we fire it up again? :)	14:53
*** ysandeep is now known as ysandeep\|away		15:02
*** ykarel is now known as ykarel\|away		15:08
clarkb	pmatulis: I believe your issue is distinct to the one zbr has pointed out	15:28
clarkb	fungi: ^ I suspect pmatulis is hitting a problem similar to the one you are debugging for cinder-specs	15:30
clarkb	fungi: do we have a writeup on that we can point people to? or maybe prior art for fixes on the type of fix up we want to use for that? I think all of that got finalized while I was out the other week	15:30
clarkb	pmatulis: your issue is being debugging in #opendev in another context (cinder-specs)	15:33
clarkb	btu I believe they are the same underlying problem	15:33
pmatulis	ok, looking over there	15:33
zbr	clarkb: i am aware that the email thread related to logstrack ended in limbo, as I really doubt infra team will have time to address is, I wonder if I could takeover and attempt a manual upgrade	15:33
clarkb	zbr: we are happy for someone else to run such a system but we wouldn't keep running the resources for it if it isn't managed through our normal systems	15:35
clarkb	(and even then its questionable if we should keep running the system at all due to its large resource consumption)	15:35
fungi	it needs to be redesigned with more modern technologies and in a more efficient manner	15:36
zbr	imho the large resource consumption is related on its outdated state, likely being able to scale it down and run more efficient	15:36
fungi	it accounts for half of all the ram consumed by our control plane servers today	15:36
clarkb	zbr: I don't know that that is actually the case	15:36
clarkb	we feed it upwards of a billion records a day. Its a huge database	15:36
clarkb	upgrading elasticsearch will hopefully make that better resource wise but not majorly	15:37
zbr	i have no desire to manage manually for long term, but only to perform the upgrade manually and have playbooks to keep it running after.	15:37
clarkb	But to answer you question, no we wouldn't hand over the resources for manual management. If we people want to work to update the systems through our configuration management we can discuss what that would look like	15:37
zbr	as we said, rdo team already has playbooks to deploy it, so i should be able to reuse some of them	15:37
clarkb	but also anyone can run such a system if they choose and do so manually or not. The data is public	15:38
clarkb	This is our preference since we want to avoid getting saddled with this same situation in a few years when it gets outdated again and there isn't sufficient aid for keeping it running	15:38
fungi	similar to stackalytics, we tried (and failed) to get that moved into our infrastructure at one point, but ultimately there's nothing it needs privileged access to so any organization can run it for us on the public internet	15:39
fungi	and in so doing, run it however they wish	15:39
zbr	clarkb: how often did you successfully upgrade a distro using configuration management?	15:41
clarkb	zbr: we've done it multiple times recently with other services.	15:41
clarkb	review is in the process of getting similar treatment	15:42
clarkb	(but I did all of zuul + nodepool + zookeeper as an example)	15:42
zbr	my lucky guess is that re-deploying from zero using current versions is likely less effort than attempted ugprade, clarkb WDYT?	15:45
clarkb	zbr: when we say "upgrade" we typically mean redeploy on newer software and migrate state/data as necessary. In the case of elasticsearhc I don't think we would bother migrating any data because it is already highly ephemeral. However, you cannot simply redeploy using the current software or configuration management whcih is why we are in the position we are in	15:47
clarkb	it needs to be redone	15:47
zbr	I wonder if a slimmed down version that runs on a single host would not be ok. if i remember well that is what rdo has.	15:49
clarkb	if there are people interested in figuring that out we are willing to discuss what it looks like and whether or not it is possible for us to keep hosting it with that work done. But it is hard to have that conversation without volunteers and a bit of info on resource sizing and likely future upgrade paths	15:49
clarkb	zbr: if that can handle the data thrown at it I don't see why not. History says this is unlikely to be the case though	15:49
zbr	maybe we should be very picky about how much data we allow it to be fed, setting fixed limits per job.	15:50
clarkb	again all of that is possible, but it requires someone to do the work. That isn't something that exists already (and is actually somewhat difficult to design in a fair way)	15:51
zbr	instead of attempting to index as much as possible, trying to index only what we consider as being meaningful.	15:51
clarkb	for the record we already discard all debug level logs	15:51
clarkb	if we didn't our input size would be like 10x	15:51
zbr	indeed, getting the design on this right is very hard, in fact I am more inclined to believe is the kind of task that can easily become a "full time job"	15:52
fungi	zbr: what was pitched to the board in the meeting a few hours ago is that it's at least two full-time jobs minimum, probably more	15:53
clarkb	fungi: I had tried to stay awake for the baord meeting but was falling asleep on the couch well before it started :/	15:54
clarkb	so I gave up and went to bed	15:54
fungi	the tc made a plea to the openinfra foundation board of directors to find some people within their organizations who can be dedicated to running a replacement for this	15:54
fungi	the board members are going to mull it over and hopefully try to find people	15:54
clarkb	right as far as headcount goes you probably need ~1 person just to keep the massive database happy. Then you also need another person feeding the machine that "scans" the databse	15:55
clarkb	you can share the responsibilities but when looked at from an effort perspective the combination (and one is not useful without the other) should be ~2 individuals	15:55
clarkb	elasticsearch02 and elasticsearch06 were not running elasticsearch processes. I have restarted the processes there	16:11
clarkb	we also ended up with tons of extra shards because elasticsearch02 is where we run the cleanup cron. The cleanup cron has been manually triggered by me now. I'll check on the indexers shortly	16:11
fungi	with two cluster members out, did we end up corrupting the entire existing dataset?	16:12
clarkb	fungi: I think we just stopped properly updating it at that time ya	16:13
clarkb	(also looks like our indexing bug that gives timestamps for the far future is resulting in a number of new indexes we don't want)	16:13
clarkb	geard was not running on logstash.o.o and has been restarted (it probably crashed once its queues got too large since nothing was reducing their size)	16:14
clarkb	looks like some of the indexers actually stayed up but many did not. I'm getting those restarted as well	16:16
clarkb	indexers that were not running hvae been started. That was the vast majority (probably 80%)	16:29
*** jpena is now known as jpena\|off		16:49
opendevreview	Jeremy Stanley proposed openstack/project-config master: Correct targets for afsdocs_secret-deploy-guide https://review.opendev.org/c/openstack/project-config/+/798932	16:54
*** anbanerj\|rover is now known as frenzy_friday		17:07
opendevreview	Jeremy Stanley proposed openstack/project-config master: Correct branch path in afsdocs_secret-releasenotes https://review.opendev.org/c/openstack/project-config/+/798938	17:45
opendevreview	Merged openstack/project-config master: Correct targets for afsdocs_secret-deploy-guide https://review.opendev.org/c/openstack/project-config/+/798932	18:01
*** gfidente is now known as gfidente\|afk		18:11
clarkb	bike rides always produce great ideas: re resource usage we can do an experiment and shutdown half the processing pipeline to see if we need them still	19:29
clarkb	for the elasticsearch servers themselves cacti gives us an idea of how much free space we have if we want to scale those down and I looked not too long ago and we can't easily scale those down due to disk space requirements	19:29
clarkb	for the indexer pipeline we can monitor the delay in indexing things (e-r reports this)	19:30
clarkb	for elasticsearch servers if others want to watch it via cacti we have a total of 6TB of storage across the cluster. 1TB per server. Because we run with a single replica we need to have 1TB of disk space of free headroom to accomodate losing any one member of the cluster	19:30
clarkb	that means our real ceiling there is 5TB even though we have 6TB total	19:31
opendevreview	Merged openstack/project-config master: Correct branch path in afsdocs_secret-releasenotes https://review.opendev.org/c/openstack/project-config/+/798938	19:42
clarkb	fungi: zbr ^ does turning off half of the indexer pipeline and doing binary search for what we need from there make sense to you? If so I can start that	19:46
opendevreview	Kendall Nelson proposed openstack/ptgbot master: Add Container Image Build https://review.opendev.org/c/openstack/ptgbot/+/798025	20:35

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!