Wednesday, 2021-06-30

*** ykarel|away is now known as ykarel04:24
*** sshnaidm is now known as sshnaidm|afk04:39
*** ysandeep|away is now known as ysandeep05:10
*** jpena|off is now known as jpena07:39
*** ykarel is now known as ykarel|lunch09:24
*** ykarel|lunch is now known as ykarel10:33
*** sshnaidm|afk is now known as sshnaidm10:53
*** jpena is now known as jpena|lunch11:27
zbrdid we had any recent changes made to logstash instance? it stopped working a couple of days ago.11:44
zbrhttp://status.openstack.org/elastic-recheck/ -- apparently ALL_FAILS_QUERY may become broken?11:45
zbrinteresting bit is that I see "There were no results because no indices were found that match your selected time span" for any timespan *shorter* than 7 days, like 2 days.11:49
*** jpena|lunch is now known as jpena12:27
*** ministry is now known as __ministry13:10
*** ysandeep is now known as ysandeep|afk13:52
pmatulisi got a merge for a doc project but the guide encountered a build problem. anyone here have visibility to that?14:10
pmatulishttps://review.opendev.org/c/openstack/charm-deployment-guide/+/79827314:15
clarkbzbr: we have made zero changes as far as I know the whole thing is in deep hibernation because it needs major updates14:40
clarkbzbr: if there are no results for data less than 7 days then the processing pipeline has probably crashed14:42
*** ysandeep|afk is now known as ysandeep14:43
zbrprob crashed somewhere between 2-7 days based on what I see. 14:49
pmatuliscan we fire it up again? :)14:53
*** ysandeep is now known as ysandeep|away15:02
*** ykarel is now known as ykarel|away15:08
clarkbpmatulis: I believe your issue is distinct to the one zbr has pointed out15:28
clarkbfungi: ^ I suspect pmatulis is hitting a problem similar to the one you are debugging for cinder-specs15:30
clarkbfungi: do we have a writeup on that we can point people to? or maybe prior art for fixes on the type of fix up we want to use for that? I think all of that got finalized while I was out the other week15:30
clarkbpmatulis: your issue is being debugging in #opendev in another context (cinder-specs)15:33
clarkbbtu I believe they are the same underlying problem15:33
pmatulisok, looking over there15:33
zbrclarkb: i am aware that the email thread related to logstrack ended in limbo, as I really doubt infra team will have time to address is, I wonder if I could takeover and attempt a manual upgrade15:33
clarkbzbr: we are happy for someone else to run such a system but we wouldn't keep running the resources for it if it isn't managed through our normal systems15:35
clarkb(and even then its questionable if we should keep running the system at all due to its large resource consumption)15:35
fungiit needs to be redesigned with more modern technologies and in a more efficient manner15:36
zbrimho the large resource consumption is related on its outdated state, likely being able to scale it down and run more efficient15:36
fungiit accounts for half of all the ram consumed by our control plane servers today15:36
clarkbzbr: I don't know that that is actually the case15:36
clarkbwe feed it upwards of a billion records a day. Its a huge database15:36
clarkbupgrading elasticsearch will hopefully make that better resource wise but not majorly15:37
zbri have no desire to manage manually for long term, but only to perform the upgrade manually and have playbooks to keep it running after.15:37
clarkbBut to answer you question, no we wouldn't hand over the resources for manual management. If we people want to work to update the systems through our configuration management we can discuss what that would look like15:37
zbras we said, rdo team already has playbooks to deploy it, so i should be able to reuse some of them15:37
clarkbbut also anyone can run such a system if they choose and do so manually or not. The data is public15:38
clarkbThis is our preference since we want to avoid getting saddled with this same situation in a few years when it gets outdated again and there isn't sufficient aid for keeping it running15:38
fungisimilar to stackalytics, we tried (and failed) to get that moved into our infrastructure at one point, but ultimately there's nothing it needs privileged access to so any organization can run it for us on the public internet15:39
fungiand in so doing, run it however they wish15:39
zbrclarkb: how often did you successfully upgrade a distro using configuration management? 15:41
clarkbzbr: we've done it multiple times recently with other services.15:41
clarkbreview is in the process of getting similar treatment15:42
clarkb(but I did all of zuul + nodepool + zookeeper as an example)15:42
zbrmy lucky guess is that re-deploying from zero using current versions is likely less effort than attempted ugprade, clarkb WDYT?15:45
clarkbzbr: when we say "upgrade" we typically mean redeploy on newer software and migrate state/data as necessary. In the case of elasticsearhc I don't think we would bother migrating any data because it is already highly ephemeral. However, you cannot simply redeploy using the current software or configuration management whcih is why we are in the position we are in15:47
clarkbit needs to be redone15:47
zbrI wonder if a slimmed down version that runs on a single host would not be ok. if i remember well that is what rdo has.15:49
clarkbif there are people interested in figuring that out we are willing to discuss what it looks like and whether or not it is possible for us to keep hosting it with that work done. But it is hard to have that conversation without volunteers and a bit of info on resource sizing and likely future upgrade paths15:49
clarkbzbr: if that can handle the data thrown at it I don't see why not. History says this is unlikely to be the case though15:49
zbrmaybe we should be very picky about how much data we allow it to be fed, setting fixed limits per job.15:50
clarkbagain all of that is possible, but it requires someone to do the work. That isn't something that exists already (and is actually somewhat difficult to design in a fair way)15:51
zbrinstead of attempting to index as much as possible, trying to index only what we consider as being meaningful.15:51
clarkbfor the record we already discard all debug level logs15:51
clarkbif we didn't our input size would be like 10x15:51
zbrindeed, getting the design on this right is very hard, in fact I am more inclined to believe is the kind of task that can easily become a "full time job"15:52
fungizbr: what was pitched to the board in the meeting a few hours ago is that it's at least two full-time jobs minimum, probably more15:53
clarkbfungi: I had tried to stay awake for the baord meeting but was falling asleep on the couch well before it started :/15:54
clarkbso I gave up and went to bed15:54
fungithe tc made a plea to the openinfra foundation board of directors to find some people within their organizations who can be dedicated to running a replacement for this15:54
fungithe board members are going to mull it over and hopefully try to find people15:54
clarkbright as far as headcount goes you probably need ~1 person just to keep the massive database happy. Then you also need another person feeding the machine that "scans" the databse15:55
clarkbyou can share the responsibilities but when looked at from an effort perspective the combination (and one is not useful without the other) should be ~2 individuals15:55
clarkbelasticsearch02 and elasticsearch06 were not running elasticsearch processes. I have restarted the processes there16:11
clarkbwe also ended up with tons of extra shards because elasticsearch02 is where we run the cleanup cron. The cleanup cron has been manually triggered by me now. I'll check on the indexers shortly16:11
fungiwith two cluster members out, did we end up corrupting the entire existing dataset?16:12
clarkbfungi: I think we just stopped properly updating it at that time ya16:13
clarkb(also looks like our indexing bug that gives timestamps for the far future is resulting in a number of new indexes we don't want)16:13
clarkbgeard was not running on logstash.o.o and has been restarted (it probably crashed once its queues got too large since nothing was reducing their size)16:14
clarkblooks like some of the indexers actually stayed up but many did not. I'm getting those restarted as well16:16
clarkbindexers that were not running hvae been started. That was the vast majority (probably 80%)16:29
*** jpena is now known as jpena|off16:49
opendevreviewJeremy Stanley proposed openstack/project-config master: Correct targets for afsdocs_secret-deploy-guide  https://review.opendev.org/c/openstack/project-config/+/79893216:54
*** anbanerj|rover is now known as frenzy_friday17:07
opendevreviewJeremy Stanley proposed openstack/project-config master: Correct branch path in afsdocs_secret-releasenotes  https://review.opendev.org/c/openstack/project-config/+/79893817:45
opendevreviewMerged openstack/project-config master: Correct targets for afsdocs_secret-deploy-guide  https://review.opendev.org/c/openstack/project-config/+/79893218:01
*** gfidente is now known as gfidente|afk18:11
clarkbbike rides always produce great ideas: re resource usage we can do an experiment and shutdown half the processing pipeline to see if we need them still19:29
clarkbfor the elasticsearch servers themselves cacti gives us an idea of how much free space we have if we want to scale those down and I looked not too long ago and we can't easily scale those down due to disk space requirements19:29
clarkbfor the indexer pipeline we can monitor the delay in indexing things (e-r reports this)19:30
clarkbfor elasticsearch servers if others want to watch it via cacti we have a total of 6TB of storage across the cluster. 1TB per server. Because we run with a single replica we need to have 1TB of disk space of free headroom to accomodate losing any one member of the cluster19:30
clarkbthat means our real ceiling there is 5TB even though we have 6TB total19:31
opendevreviewMerged openstack/project-config master: Correct branch path in afsdocs_secret-releasenotes  https://review.opendev.org/c/openstack/project-config/+/79893819:42
clarkbfungi: zbr ^ does turning off half of the indexer pipeline and doing binary search for what we need from there make sense to you? If so I can start that19:46
opendevreviewKendall Nelson proposed openstack/ptgbot master: Add Container Image Build  https://review.opendev.org/c/openstack/ptgbot/+/79802520:35

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!