Wednesday, 2023-12-13

*** jph5 is now known as jph108:19
*** jph1 is now known as jph08:20
dpawlikfungi: so normally we can downscale the Opensearch. Some time ago, we disable DEBUG messages, so 25% of the cluster is not used (I mean disk space)08:53
dpawlikfungi: I can remove few visualizations. Maybe someone is periodically watch on it or maybe some of them have wrong query that takes more resources as expected08:54
dpawlikfungi: I don't have privileges on AWS to say somthing more. My account is very very limited.08:54
*** elodilles_pto is now known as elodilles09:06
fungidpawlik: yeah, it doesn't look like we're short on disk space or heavy query volume, it looks like that gets covered by the opensearch line item in the bill which hasn't really changed. the more i think about it, the more likely it is that the volume of log lines getting fed to that logstash.logs.openstack.org endpoint is where all this fargate servive usage is13:59
fungido you have any visibility into how much data the job log processor you manage has been shipping to it, and whether that's increased significantly in the past few months?14:00
dpawlikchecking14:08
dpawlikso I don't have any statistics in Logsender how much data it send, but I see the stats of  "Searchable documents" on AWS Opensearch cluster health14:14
dpawlikand it looks normal14:14
dpawlikbut, when I check other metrics in cluster health one is strange14:14
dpawlikJVM garbage collection => Young collection and Young collection time => it always grow, never go down 14:15
dpawlikmaybe we can try to make an OpenSearch update from 2.5 (current one) to 2.1014:16
dpawlikso all services would be restarted. Should help14:16
dpawlikand if yes, we can send AWS feedback14:16
jrosserisnt fargate generally more expensive than regular ec2 instances14:17
dpawlikgot something. So by checking "instance health", for last 1h, JVM memory pressure (Current value for the selected instance is 50.6)  where Minimum value for all instances is 26.8, Maximum value for all instances is 74.2, Minimum and maximum range for the selected instance is from 26.8 to 74.2. 14:18
dpawlikwhen I check last 2 weeks, it has 73.914:19
fungidpawlik: if it's that we're sending a ton of data to the log ingestion system that's getting filtered out/discarded, that could be increasing processing and memory there without being reflected in the amount of data in opensearch (if it's data that's getting discarded and not indexed into opensearch)14:19
dpawliksame with data nodes. Average value from last 2 weeks is higher that from 1 hour ago14:20
fungijrosser: yes, i think that's part of the problem. the contractor who developed the log ingestion system for this wanted it to be entirely stateless so it would require no maintenance14:20
dpawlikoh my14:21
jrosserthat might be like doubling the cost vs. a long term discount on an ec214:21
fungiso there are no virtual servers, no container management, it's in fargate so that there won't need to be a sysadmin basically14:21
jrosserright14:22
fungibut it also means when it blows up in utilization there's nobody to look into why14:22
dpawlikdon't have enough power to see more, maybe you or clarkb have14:24
dpawlikfrom the logscraper/logsender side: all is working normally14:24
dpawlikI will increase the "sleep" time for logsender, because it does not need to do it each 5 seconds (but that parameter was set 1 year ago)14:25
fungii'm not sure we have that access either, i can't even trigger the ssl cert renewals for opensearch, but i'll see if it looks like i have access to something in that regard14:26
dpawlikfungi: first you need to do some AWS certification to touch something there <I'm joking> 14:28
dpawlikfrom the other side, the whole situation started in September 2023, right?14:28
dpawlikor last few weeks?14:28
dpawlikor they did not mention anything 14:29
fungidpawlik: ttx was looking at the bill. since aws switched our free credits from yearly to monthly (exceeding the monthly allocation will now result in overage charges). these were the numbers he reported:14:32
fungiAWS Fargate (September): Memory=31893 Gb.hours, vCPU=15946 vCPU.hours14:33
fungiAWS Fargate (November): Memory=58906 Gb.hours, vCPU=29453 vCPU.hours14:33
fungithe billing interface also lists usage for "Amazon OpenSearch Service ESDomain" (general purpose provisioned storage and m6g.xlarge.search instance hours)14:35
fungithat's separate from the fargate stuff, which is what leads me to believe the log ingestion system tom developed is what's responsible for the separate fargate charges14:35
fungihe also noted, "Looking at December, on day 12 we are at 21833 GB.hours and 10916 vCPU.hours" (for fargate charges), so consistent with the usage we saw for november if you project it out through the end of the month14:38
dpawlikso to understand the whole situation: the AWS fargate it is a service available on the address: https://opensearch.logs.openstack.org  and later it redirect the traffic to the opensearch host logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com , right?14:43
fungiwe have a logstash.logs.openstack.org cname to logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com which i think is related to the fargate resources14:46
fungiis that where the ci log processor sends its data?14:47
fungii think opensearch.logs.openstack.org goes to the opensearch service, which is billed separately from fargate usage14:47
dpawlikit send to: opensearch.logs.openstack.org14:51
fungii've signed into my aws account and am poking around now, though most of the panels say "access denied"14:51
dpawlikyay :D14:51
dpawlikso you know my pain14:51
fungiin particular, i do not have access to "applications" or "cost and usage"14:52
fungii also don't have access to "security" so probably can't change that14:52
dpawlikoh my. I was hoping that you have and clarkb according to Reed messages 14:53
fungigranted there was never any expectation that i would be responsible for managing any of this, so the lack of access is expected14:53
dpawlikbut not you Reed :P14:53
fungioh, his name was reed not tom. i kept saying tom14:54
fungidunno why i had that name stuck in my head14:54
dpawlikhehe14:54
dpawlikhis name was Reed'14:54
fungi(not to be confused with the reed in this channel, who is someone else entirely)14:54
dpawlikoh my, Reed you will be pinged few times :P14:54
reedyes :)14:54
fungisorry!14:54
dpawlikfungi: I'm not able to reach OpenSearch (logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com)  from the logscraper host 14:56
dpawlikso all the traffic needs to go through opensearch.logs.openstack.org14:56
dpawlikor something need to be changed14:56
dpawlikor I need to go to make some AWS cert14:56
ttxI have access to the top IAM account so I can probably grant rights (just have no idea how)14:56
fungiokay, it's possible logstash.logs.openstack.org is just cruft in dns14:56
dpawlikor, wait for AWS to provide some AI that will do what we provide14:57
fungii'm trying to see if maybe i'm in the wrong dashboard view and need to be looking at an organization, but the "organization" view also tells me my access is denied14:58
ttxbut yeah the managed opensearch service itself runs on classic EC2 instance, but there is something that consumes Fargate resources (container service) and that multiplied its footprint by 2 mid-October (and is fairly constant since)14:58
dpawlikah fungi, you send me wrong address and I did not verify that :D14:58
dpawlikso direct OpenSearch url is: search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com14:59
dpawlikso I'm able to change that in logsender14:59
*** d34dh0r5- is now known as d34dh0r5314:59
fungii honestly have no idea what is the "right" or "wrong" hostname, all i know is what we have cnames for in openstack.org's dns14:59
fungisearch-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com is what opensearch.logs.openstack.org is cname'd to15:00
dpawlikthat make sense: logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com "loadb" goes to forgate, where search-blablabla goes directly to Opensearch15:00
dpawlikhm15:01
fungiit's possible the other logstash.logs.openstack.org cname is leftover cruft from earlier iterations of work on the log ingestion in aws15:01
fungior maybe the api you're sending to on opensearch.l.o.o is then forwarding the workload to logstash.l.o.o for processing15:02
dpawliklogsender can send data directly to the OpenSearch and skip forgate \o/15:03
dpawlikfargate or whatever it is15:04
fungiyeah, i have no idea what exactly the role of the log processor there is (my understanding is that it accepted data using a logstash protocol and then fed it into opensearch)15:05
dpawlikthe logstash service should be stopped long time ago, after we move to logsender. (r e e d   should do that)15:05
dpawlikif not, that service is just running without any reason and I don't have permissions to remove it :(15:06
fungii wonder why it has any workload at all, in that case15:06
fungiif we're not sending it data15:06
dpawlikso logsender process the logs and send it  directly to the OpenSearch (opensearch.logs.openstack.org) which is a fargate15:07
dpawliklet's see. I changed the url to direct URL for OpenSearch15:09
dpawlikbut can't say if it will help :(15:09
fungiwhat is the direct url for opensearch?15:09
dpawlikyou already paste it: https://search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com 15:09
dpawlik<we should remove it from the history>15:09
fungiwhy should we remove it? that name is in public dns15:10
fungiit's what the cname points to15:10
dpawlikhm15:11
fungibut anyway, i still don't know why you think that's more direct. it sounds like all you've done is eliminate a cname lookup in dns15:11
dpawlikright. I see in the history that I make "dig logstash.logs.openstack.org" instead of "dig opensearch.logs.openstack.org"15:11
dpawlikso yes, nothing will change 15:12
dpawliksorry for making noise15:12
dpawlikso I don't know how to communicate with OpenSearch directly and skip fargate 15:13
fungijudging from the name resolution in dns, logstash.l.o.o is our convenience alias for a load balancer (elb service) though i have no idea what's on the other side of the load balancer nor whether it even still exists15:13
fungiand opensearch.l.o.o is our convenience alias for an "es" cluster, which i think means their elasticsearch service?15:14
dpawlikyup15:15
dpawlikI guess I know how to get. I remember that R eed says that he is doing everything in yaml, and it was possible to get the yaml content15:16
fungittx: wendar: (if you're around) is it possible to reengage the contractor who set up the elasticsearch service/integration for us to look into the recent usage increases and/or whether we even still need whatever is running on fargate-based resources? it appears none of us have access (or know how to obtain access) to investigate further why we're now needing more resources than amazon15:19
fungihas agreed to donate, and i'd rather not have to simply turn it off in order to avoid overage charges15:19
dpawlikfungi: in AWS console, search for "cloudwatch" and in "logs" there is "log groups" and what is there is IMHO strange15:25
dpawlikespecially because we are not using that service15:25
fungi"This IAM user does not have permission to view Log Groups in this account."15:27
fungi"...no identity-based policy allows the logs:DescribeLogGroups action"15:28
fungianyway, i've got some paperwork i need to get back to, but i'll try to get something out to the openstack-discuss ml later today to talk about what options we might have15:30
dpawlikoh, you got lower permissions that I O_O15:30
fungiunfortunately i have a very limited amount of time to spend on this (the original expectation was that we shouldn't need to spend any time helping maintain this, but it seems like amazon makes that harder than it should be)15:31
dpawlikI will try to check more carefully https://opendev.org/openstack/ci-log-processing/src/branch/master/opensearch-config/opensearch.yaml . Maybe there is an answer there15:33
dpawlikaha15:33
dpawlikand in the AWS console, when you search  "cloudformation" I see logstash, opensearch, dashboards are running15:37
dpawlikIf I will know more AWS, On "logstashstack" stack i will click delete (of course later I will have permissions error)15:41
dpawlikbut if later will affect somehow opensearchstack...15:42
dpawlikI will make some research in AWS console tomorrow. Now I need to leave. Thanks fungi++ for checking 16:01
fungithanks dpawlik for looking into it!16:01
ttxdpawlik fungi if y'all need extra rights to get enough visibility on this issue, just let me know, and I can try to figure out how to grant that16:11
fungittx: thanks, i'll reach out if dpawlik exhausts his current options16:14
fungiJayF: i've just now accepted the maintainer invitation for eventlet on openstackci's behalf19:07
JayFThank you! I'll note that other openstack contributors, including hberaud and itamarst (the GR-OSS employee who has been posting on the list about eventlet) have been given access in pypi/github as well.19:08
fungihttps://pypi.org/project/eventlet/ reflects the addition19:08
JayFPretty much an ideal situation to start off on getting maintenance kicked back off.19:08
fungilet me know if/when you need anything else (other than for someone to fix bugs in eventlet of course, my dance card is pretty full)19:09
JayFI suspect that the reality of automation on that will end up being github-driven; unsure if openstackci account will end up used at all, but I appreciate that we have an organizational account in place and not just individuals :)19:09
fungiright, i or other tact sig folks can at least step in to help do pypi things with it as a fallback19:10
sean-k-mooneyhttps://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd19:12
sean-k-mooneycross postign in infra channel19:12
sean-k-mooney that ssh know host key verfircation error ^19:12
sean-k-mooneyis not a knwo issue right19:12
sean-k-mooney"Collect test-results" failed to rsync the logs to i assuem the executor or log server19:13
sean-k-mooneyi knwo there were issues in the past with reusing ips in some clouds19:13
sean-k-mooneyso im wondering if this is infra related, a random failure or likely to just work if i recheck19:14
fungithat problem has never really gone away, it rises and falls like the tides19:14
JayFI've 100% seen those before, and seen them clear with a recheck. I believe it's a known issue? 19:14
sean-k-mooneyok the frustating thig is the reviews recheck failed on a timeout and this one failed to up load logs in gate19:15
sean-k-mooneyill recheck but i dont want to waste ci time if its actully an issue affecting others19:16
fungihave a link to the log upload failure? hoping one of our swift donors isn't having another outage19:16
sean-k-mooneywell sorry im not sure its an upload failure in fact there are logs19:17
sean-k-mooneyits form fetch-subunit-output: Collect test-results19:18
sean-k-mooneyhttps://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd/console#6/0/14/ubuntu-focal19:18
fungithat's often indicative of an earlier problem, like something broke before subunit data was written19:19
fungioh, okay that one's a rogue vm probably19:19
fungithe executor tried to scp the testr_results.html file from the test node, but when it connected the ssh host key wasn't the one it expected to see, so usually a second machine in that provider which is also using the same ip address19:20
sean-k-mooneyoh ok19:21
fungioften times, the ip address (23.253.20.246 in this case) will be consistent across failures, or many failures will indicate a small number of affected addresses19:21
sean-k-mooneyneutorn is ment to prevent that...19:21
sean-k-mooneyhttp://tinyurl.com/4xex5xrp19:22
JayFfungi: sean-k-mooney: Anyone wanna take the other side of the bet that this bug disappears when our eventlet dep does?19:22
JayF:)19:22
fungiyes, our working theory is that something happens during server instance deletion and the vm never really stops/goes away, so continues to arp for its old ip address, but neutron thinks the address has been freed and assigns it to a new server instance19:22
sean-k-mooneyJayF: you mean when hell freezes over19:23
JayFsean-k-mooney: it's cold down here today, bring a coat if you get dragged down19:23
* fungi bought a new winter coat just for this19:23
JayFsean-k-mooney: openstackci, hberaud, and Itamar (my co-worker from GR-OSS) all are maintainers on eventlet gh org + pypi now :)19:23
sean-k-mooneyso neutron installes mac anti spoofing rules19:23
sean-k-mooneythat should prevent that but dependign on how netowrking is configuredn and how we connect perhaps not19:24
fungisean-k-mooney: you're also operating on the assumption that the provider is using a new enough neutron to have that feature, or even using neutron at all as you know it19:24
sean-k-mooneyfungi: it was there when it was called quantum19:24
sean-k-mooneybut that does not mean then have security groups enabeld19:24
sean-k-mooneyfungi: also this may happen if we reuse floting ips19:25
fungisean-k-mooney: "rax-ord" https://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd/log/zuul-info/inventory.yaml#2019:25
sean-k-mooneyi would not expect this to happen with ipv6 just due to entropy19:25
fungiso i think they're still using nova-network19:25
sean-k-mooney ...19:25
sean-k-mooneyya that would not have that...19:26
JayFIIRC the neutron bits in that stack are mostly custom plugin anyway19:26
funginot sure rackspace ever migrated to neutron (but i can't say for sure)19:26
sean-k-mooneythey did but they still use linux bridge19:26
JayFfungi: I know they half-migrated for OnMetal to work, dunno if they went further for VM or not19:26
fungiaha19:26
sean-k-mooneyand the thing i was refiering to is what ovs does19:26
sean-k-mooneyin the ovs firewall dirver19:26
fungianyway, we do still see this in various providers. in some there was a known bug in nova cells which eventually got fixed and that has helped i think19:27
fungii've been told by a number of different providers' operators though that they run periodic scripts to query locally with virsh on all the compute nodes and then stop and remove any virtual machines which nova is unaware of19:28
sean-k-mooneywe have a runing delete perodic job19:29
sean-k-mooneybut really nova shoudl not let a vm state runnign when its deleted19:29
fungibut if we see persistent failures for the same ip address(es) over a span of days, it's helped in the past to assemble a list of those addresses and open a trouble ticket asking to have them hunted down19:29
sean-k-mooneywe would put it to errror unless or fall back to local delete "if the agent is down when it starts again"19:30
sean-k-mooneygiven the age of there cloud hoping they move to ipv6 is probaly wishful thinking19:30
fungithey have working ipv6 but it's not exposed by the api in a way that nodepool/zuul can make use of the info19:31
fungimainly because, i think, they rolled their own ipv6 support at "the beginning" (before quantum)19:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!