Wednesday, 2023-03-01

opendevreview	Matthew Oliver proposed openstack/swift master: Proxy: restructure cached updating shard ranges https://review.opendev.org/c/openstack/swift/+/870886	03:50
opendevreview	Matthew Oliver proposed openstack/swift master: updater: add memcache shard update lookup support https://review.opendev.org/c/openstack/swift/+/874721	03:51
opendevreview	Matthew Oliver proposed openstack/swift master: updater: add memcache shard update lookup support https://review.opendev.org/c/openstack/swift/+/874721	05:32
opendevreview	Matthew Oliver proposed openstack/swift master: POC: updater: only memcache lookup deferred updates https://review.opendev.org/c/openstack/swift/+/875806	05:32
opendevreview	Tim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses https://review.opendev.org/c/openstack/swift/+/875819	07:35
opendevreview	Alistair Coles proposed openstack/swift master: sharder: show path and db file in info and debug logs https://review.opendev.org/c/openstack/swift/+/875220	15:02
opendevreview	Alistair Coles proposed openstack/swift master: sharder: show path and db file in warning and error logs https://review.opendev.org/c/openstack/swift/+/875221	15:02
reid_g	Hello, I recently did some OS upgrades (18.04 > 20.04) and now one of my nodes is spitting out tons of reconstructor messages "Unable to get enough responses (1/N) to reconstruct non-durable" followed by "Unable to get enough responses (X error responses) to reconstruct durable" for the same object. It seems like maybe there is some old data on this server. Now all of servers in the cluster are showing Ks handoffs. Any thoughts?	16:22
reid_g	It seems like these 1 off fragments are being pushed around to other hosts for some reason.	16:30
opendevreview	Tim Burke proposed openstack/swift master: Add --test-config option to WSGI servers https://review.opendev.org/c/openstack/swift/+/833124	17:07
opendevreview	Tim Burke proposed openstack/swift master: Add a swift-reload command https://review.opendev.org/c/openstack/swift/+/833174	17:07
opendevreview	Tim Burke proposed openstack/swift master: systemd: Send STOPPING/RELOADING notifications https://review.opendev.org/c/openstack/swift/+/837633	17:07
opendevreview	Tim Burke proposed openstack/swift master: Add abstract sockets for process notifications https://review.opendev.org/c/openstack/swift/+/837641	17:07
opendevreview	Alistair Coles proposed openstack/swift master: WIP: Allow internal container POSTs to not update put_timestamp https://review.opendev.org/c/openstack/swift/+/875982	19:30
mattoliver	reid_g: has the crc library changed? https://bugs.launchpad.net/swift/+bug/1886088	20:52
timburke	#startmeeting swift	21:00
opendevmeet	Meeting started Wed Mar 1 21:00:50 2023 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.	21:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	21:00
opendevmeet	The meeting name has been set to 'swift'	21:00
timburke	who's here for the swift meeting?	21:01
zaitcev	o/	21:01
indianwhocodes	o/	21:01
mattoliver	i'm kinda here, have the day off today so that means I'm on getting kids ready for school (however that works) :P	21:01
timburke	i didn't get around to updating the agenda, but i think it's mostly going to be a couple updates from last week, maybe one interesting new thing i'm working on	21:02
timburke	#topic ssync, data with offsets, and meta	21:03
acoles	o/	21:03
timburke	clayg's probe test got squashed into acoles's fix	21:03
timburke	#link https://review.opendev.org/c/openstack/swift/+/874122	21:03
timburke	we're upgrading our cluster now to include that fix; we should be sure to include feedback about how that went on the review	21:04
timburke	being able to deal with metas with timestamps is still a separate review, but acoles seems to like the direction	21:05
timburke	#link https://review.opendev.org/c/openstack/swift/+/874184	21:05
acoles	timburke: persuaded me that we should fix a future bug while we had this all in our heads	21:06
timburke	the timestamp-offset delimiter business still seems a little strange, but i didn't immediately see a better way to do deal with it	21:06
timburke	#topic http keepalive timeout	21:07
timburke	so my eventlet patch merged! gotta admit, seemed easier to get merged than expected :-)	21:08
timburke	#link https://github.com/eventlet/eventlet/pull/788	21:08
timburke	which means i ought to revisit the swift patch to add config plumbing	21:09
timburke	#link https://review.opendev.org/c/openstack/swift/+/873744	21:09
timburke	are we all ok with turning it into a pure-plumbing patch, provided i make it clear in the sample config that the new option kinda requires new eventlet?	21:10
acoles	what happens if the option is set without new eventlet?	21:12
timburke	largely, existing behavior: keepalive is turned on, and with the general socket timeout (ie, client_timeout)	21:13
timburke	it would also give the option of setting keepalive_timeout to 0 to turn off keepalive behavior	21:13
mattoliver	Yup, do it	21:13
acoles	ok	21:14
timburke	all right then	21:15
timburke	#topic per-policy quotas	21:15
timburke	thanks for the reviews, mattoliver!	21:15
timburke	test refactor is now landed, and there's a +2 on the code refactor	21:16
timburke	#link https://review.opendev.org/c/openstack/swift/+/861487	21:16
timburke	any reason not to just merge it?	21:16
timburke	i suppose mattoliver's busy ;-) i can poke him more later	21:17
timburke	the actual feature patch needs some docs -- i'll try to get that up this week	21:18
timburke	#link https://review.opendev.org/c/openstack/swift/+/861282	21:18
timburke	other interesting thing i've been working on (and i should be sure to add it to the PTG etherpad)	21:19
acoles	I just glanced (not reviewed) and the refactor looks nicer than the original	21:19
timburke	thanks -- there were a couple sneaky spots, but the existing tests certainly helped	21:20
timburke	#topic statsd labeling extensions	21:20
mattoliver	Yeah it can probably just land	21:21
timburke	when swift came out, statsd was the basis for a pretty solid monitoring stack	21:21
timburke	these days, though, people generally seem to be coalescing around prometheus, or at least its data model	21:22
timburke	we at nvidia, for example, are running https://github.com/prometheus/statsd_exporter on every node to turn swift's stats into something that can be periodically scraped	21:23
mattoliver	I've been playing with otel metrics, put it as a topic on the ptg etherpad. Got a basic client to test some infrastructure here at work. Maybe I could at least write up some doc on how that works for extra discussions at the ptg?	21:24
mattoliver	By that i mean how open telemetry works	21:25
timburke	that'd be great, thanks!	21:25
timburke	as it works for us today, there's a bunch of parsing that's required -- a stat like `proxy-server.object.HEAD.200.timing:56.9911003112793\|ms` doesn't have all the context we really want in a prometheus metric (like, 200 is the status, HEAD is the request method, etc.)	21:26
timburke	which means that whenever we add a new metric, there's a handoff between dev and ops about what the new metric is, then ops need to go update some yaml file so the new metric gets parsed properly, and then they can start using it in new dashboards	21:27
timburke	which all seems like some unnecessary friction	21:28
timburke	fortunately, there are already some extensions to add the missing labels for components, and the statsd_exporter even already knows how to eat several of them: https://github.com/prometheus/statsd_exporter#tagging-extensions	21:29
timburke	so i'm currently playing around with emitting metrics like `proxy-server.timing,layer=account,method=HEAD,status=204:41.67628288269043\|ms`	21:30
timburke	or `proxy-server.timing:34.14654731750488\|ms\|#layer:account,method:HEAD,status:204`	21:30
timburke	or `proxy-server.timing#layer=account,method=HEAD,status=204:5.418539047241211\|ms`	21:30
timburke	or `proxy-server.timing;layer=account;method=HEAD;status=204:34.639835357666016\|ms`	21:30
timburke	(really, "proxy-server" should probably get labeled as something like "service"...)	21:31
timburke	my hope is to have a patch up ahead of the PTG, so... look forward to that!	21:31
acoles	nice!	21:32
acoles	"layer" is a new term to me?	21:32
timburke	idk, feel free to offer alternative suggestions :-)	21:32
acoles	vs tier or resource (I guess tier isn't clear)	21:33
acoles	haha it took us < 1second to get into a naming debate :D	21:33
acoles	let's save that for the PTG	21:33
mattoliver	Oh cool, I look forward to seeing it!	21:34
timburke	if it doesn't mesh well with an operator's existing metrics stack, (1) it's opt-in and they can definitely still do the old-school vanilla statsd metrics, and (2) most collection endpoints (i believe) offer some translation mechanism	21:34
acoles	I'm hoping we might eventually converge this "structured" stats with structured logging	21:34
mattoliver	+1	21:35
timburke	yes! there's a lot of context that seems like it'd be smart to share between stats and logging	21:35
acoles	e.g. build a "context" data structure and squirt it a logger and/or a stats client and you're done	21:35
timburke	that's all i've got	21:36
timburke	#topic open discussion	21:36
timburke	what else should we bring up this week?	21:36
acoles	on that theme, I wanted to draw attention to a change i have proposed to sharder logging	21:36
timburke	#link https://review.opendev.org/c/openstack/swift/+/875220	21:37
timburke	#link https://review.opendev.org/c/openstack/swift/+/875221	21:37
acoles	2 patches currently: https://review.opendev.org/c/openstack/swift/+/875220 and https://review.opendev.org/c/openstack/swift/+/875221	21:37
acoles	timburke: is so quick!	21:37
mattoliver	Oh yeah, I've been meaning to get to that.. but off for the rest of the week, so won't happen now until next week.	21:38
acoles	I recently had to debug some sharder issue and found the inconsistently log formats very frustrating	21:38
acoles	e.g sometime we include the DB path, sometimes the resource path, sometimes both...but worst, sometimes neither	21:38
acoles	So the patches ensure that every log message associated with a container DB (which is almost all) will consistently get both the db file path and the resource path (i,e, 'a/c') appended to the message	21:39
acoles	I wanted to flag it up because that includes WARNING and ERROR level messages that I am aware some ops may parse for alerts	21:40
acoles	so this change may break some parsing, but on the whole I believe we'll be better for having consistency	21:41
mattoliver	Sounds good, and as we eventually worker up the sharper it gets all more important.	21:41
mattoliver	*sharder	21:41
acoles	IDK if we have precedence for flagging up such a change, or if I am worrying too much (I tend to!)	21:42
mattoliver	Your making debugging via log messages easier.. and that's a win in my book	21:43
timburke	there's some precedent (e.g., https://review.opendev.org/c/openstack/swift/+/863446) but in general i'm not worried	21:43
acoles	ok so I could add an UpgradeImpact to the commit message	21:44
timburke	if we got to the point of actually emitting structured logs, and then took that away, i'd worry. but this, shrug	21:45
timburke	fwiw, i did not call it out in the changelog	21:46
acoles	well if there's no concerns re. the warnings then I will squash the two patches	21:46
acoles	and then I can look forward to the next sharder debugging session 😜	21:47
timburke	sounds good	21:47
timburke	all right, i think i'll call it	21:49
timburke	thank you all for coming, and thank you for working on swift!	21:49
timburke	#endmeeting	21:49
opendevmeet	Meeting ended Wed Mar 1 21:49:23 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	21:49
opendevmeet	Minutes: https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.html	21:49
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.txt	21:49
opendevmeet	Log: https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.log.html	21:49
reid_g	@mattoliver - we did change the library but we set but we added overrides to the systemd services before anything starts "Environment="LIBERASURECODE_WRITE_LEGACY_CRC=1". We did this for ~20 different clusters without issues. Also there were no quarantines generated in the cluster.	21:52
opendevreview	Alistair Coles proposed openstack/swift master: sharder: show path and db file in logs https://review.opendev.org/c/openstack/swift/+/875220	21:53
reid_g	Gotta head out. Will check chat logs tomorrow if you reply	22:10
timburke	reid_g, what versions of swift were involved? sounds like maybe https://bugs.launchpad.net/swift/+bug/1655608 -- were any object disks out of the cluster for a while, then brought back in? it's a bit of an old bug, but we've seen patches in relation to it as recently as a couple years ago. if your new swift is >= 2.28.0, you might consider setting quarantine_threshold=1 for the reconstructor -- see https://github.com/openstack/swift/commit/46	22:10
timburke	ea3aea	22:10
reid_g	We are in ussuri 2.25.2	22:11
reid_g	I don't think any disks were out for a while since we have monitoring for missing disks and get those taken care of quickly.	22:12
reid_g	What is kind of odd is if I swift-object-info on the fragment on the host we think the issues are with, the fragments belongs to that host according to the ring. 1 particular object has a FS date of jan 2022 but it looks like it was pushed to another node on feb 18 2023. The other node appears as a handoff according to the ring	22:14
reid_g	Right now I have a bunch of fragments on all the disks on 1 host that are being pushed around has handoffs to all other hosts (based on the filesystem dates of the files).	22:16
timburke	are some disks unmounted? or maybe full?	22:16
timburke	https://github.com/openstack/swift/commit/ea8e545a had us start rebuilding to handoffs in 2.21.0 if a primary responds 507	22:17
reid_g	no they are all mounted and ~45% used. I don't think that they were unmounted previously.	22:18
reid_g	I will check that link tomorrow. I have to head home before my wife gets on me.	22:19
timburke	of course -- good luck!	22:19
reid_g	TBC. Thanks for your input!	22:20
opendevreview	Tim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses https://review.opendev.org/c/openstack/swift/+/875819	22:58
opendevreview	Tim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses https://review.opendev.org/c/openstack/swift/+/875819	23:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!