opendevreview | Matthew Oliver proposed openstack/swift master: Proxy: restructure cached updating shard ranges https://review.opendev.org/c/openstack/swift/+/870886 | 03:50 |
---|---|---|
opendevreview | Matthew Oliver proposed openstack/swift master: updater: add memcache shard update lookup support https://review.opendev.org/c/openstack/swift/+/874721 | 03:51 |
opendevreview | Matthew Oliver proposed openstack/swift master: updater: add memcache shard update lookup support https://review.opendev.org/c/openstack/swift/+/874721 | 05:32 |
opendevreview | Matthew Oliver proposed openstack/swift master: POC: updater: only memcache lookup deferred updates https://review.opendev.org/c/openstack/swift/+/875806 | 05:32 |
opendevreview | Tim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses https://review.opendev.org/c/openstack/swift/+/875819 | 07:35 |
opendevreview | Alistair Coles proposed openstack/swift master: sharder: show path and db file in info and debug logs https://review.opendev.org/c/openstack/swift/+/875220 | 15:02 |
opendevreview | Alistair Coles proposed openstack/swift master: sharder: show path and db file in warning and error logs https://review.opendev.org/c/openstack/swift/+/875221 | 15:02 |
reid_g | Hello, I recently did some OS upgrades (18.04 > 20.04) and now one of my nodes is spitting out tons of reconstructor messages "Unable to get enough responses (1/N) to reconstruct non-durable" followed by "Unable to get enough responses (X error responses) to reconstruct durable" for the same object. It seems like maybe there is some old data on this server. Now all of servers in the cluster are showing Ks handoffs. Any thoughts? | 16:22 |
reid_g | It seems like these 1 off fragments are being pushed around to other hosts for some reason. | 16:30 |
opendevreview | Tim Burke proposed openstack/swift master: Add --test-config option to WSGI servers https://review.opendev.org/c/openstack/swift/+/833124 | 17:07 |
opendevreview | Tim Burke proposed openstack/swift master: Add a swift-reload command https://review.opendev.org/c/openstack/swift/+/833174 | 17:07 |
opendevreview | Tim Burke proposed openstack/swift master: systemd: Send STOPPING/RELOADING notifications https://review.opendev.org/c/openstack/swift/+/837633 | 17:07 |
opendevreview | Tim Burke proposed openstack/swift master: Add abstract sockets for process notifications https://review.opendev.org/c/openstack/swift/+/837641 | 17:07 |
opendevreview | Alistair Coles proposed openstack/swift master: WIP: Allow internal container POSTs to not update put_timestamp https://review.opendev.org/c/openstack/swift/+/875982 | 19:30 |
mattoliver | reid_g: has the crc library changed? https://bugs.launchpad.net/swift/+bug/1886088 | 20:52 |
timburke | #startmeeting swift | 21:00 |
opendevmeet | Meeting started Wed Mar 1 21:00:50 2023 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. | 21:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 21:00 |
opendevmeet | The meeting name has been set to 'swift' | 21:00 |
timburke | who's here for the swift meeting? | 21:01 |
zaitcev | o/ | 21:01 |
indianwhocodes | o/ | 21:01 |
mattoliver | i'm kinda here, have the day off today so that means I'm on getting kids ready for school (however that works) :P | 21:01 |
timburke | i didn't get around to updating the agenda, but i think it's mostly going to be a couple updates from last week, maybe one interesting new thing i'm working on | 21:02 |
timburke | #topic ssync, data with offsets, and meta | 21:03 |
acoles | o/ | 21:03 |
timburke | clayg's probe test got squashed into acoles's fix | 21:03 |
timburke | #link https://review.opendev.org/c/openstack/swift/+/874122 | 21:03 |
timburke | we're upgrading our cluster now to include that fix; we should be sure to include feedback about how that went on the review | 21:04 |
timburke | being able to deal with metas with timestamps is still a separate review, but acoles seems to like the direction | 21:05 |
timburke | #link https://review.opendev.org/c/openstack/swift/+/874184 | 21:05 |
acoles | timburke: persuaded me that we should fix a future bug while we had this all in our heads | 21:06 |
timburke | the timestamp-offset delimiter business still seems a little strange, but i didn't immediately see a better way to do deal with it | 21:06 |
timburke | #topic http keepalive timeout | 21:07 |
timburke | so my eventlet patch merged! gotta admit, seemed easier to get merged than expected :-) | 21:08 |
timburke | #link https://github.com/eventlet/eventlet/pull/788 | 21:08 |
timburke | which means i ought to revisit the swift patch to add config plumbing | 21:09 |
timburke | #link https://review.opendev.org/c/openstack/swift/+/873744 | 21:09 |
timburke | are we all ok with turning it into a pure-plumbing patch, provided i make it clear in the sample config that the new option kinda requires new eventlet? | 21:10 |
acoles | what happens if the option is set without new eventlet? | 21:12 |
timburke | largely, existing behavior: keepalive is turned on, and with the general socket timeout (ie, client_timeout) | 21:13 |
timburke | it would also give the option of setting keepalive_timeout to 0 to turn off keepalive behavior | 21:13 |
mattoliver | Yup, do it | 21:13 |
acoles | ok | 21:14 |
timburke | all right then | 21:15 |
timburke | #topic per-policy quotas | 21:15 |
timburke | thanks for the reviews, mattoliver! | 21:15 |
timburke | test refactor is now landed, and there's a +2 on the code refactor | 21:16 |
timburke | #link https://review.opendev.org/c/openstack/swift/+/861487 | 21:16 |
timburke | any reason not to just merge it? | 21:16 |
timburke | i suppose mattoliver's busy ;-) i can poke him more later | 21:17 |
timburke | the actual feature patch needs some docs -- i'll try to get that up this week | 21:18 |
timburke | #link https://review.opendev.org/c/openstack/swift/+/861282 | 21:18 |
timburke | other interesting thing i've been working on (and i should be sure to add it to the PTG etherpad) | 21:19 |
acoles | I just glanced (not reviewed) and the refactor looks nicer than the original | 21:19 |
timburke | thanks -- there were a couple sneaky spots, but the existing tests certainly helped | 21:20 |
timburke | #topic statsd labeling extensions | 21:20 |
mattoliver | Yeah it can probably just land | 21:21 |
timburke | when swift came out, statsd was the basis for a pretty solid monitoring stack | 21:21 |
timburke | these days, though, people generally seem to be coalescing around prometheus, or at least its data model | 21:22 |
timburke | we at nvidia, for example, are running https://github.com/prometheus/statsd_exporter on every node to turn swift's stats into something that can be periodically scraped | 21:23 |
mattoliver | I've been playing with otel metrics, put it as a topic on the ptg etherpad. Got a basic client to test some infrastructure here at work. Maybe I could at least write up some doc on how that works for extra discussions at the ptg? | 21:24 |
mattoliver | By that i mean how open telemetry works | 21:25 |
timburke | that'd be great, thanks! | 21:25 |
timburke | as it works for us today, there's a bunch of parsing that's required -- a stat like `proxy-server.object.HEAD.200.timing:56.9911003112793|ms` doesn't have all the context we really want in a prometheus metric (like, 200 is the status, HEAD is the request method, etc.) | 21:26 |
timburke | which means that whenever we add a new metric, there's a handoff between dev and ops about what the new metric is, then ops need to go update some yaml file so the new metric gets parsed properly, and *then* they can start using it in new dashboards | 21:27 |
timburke | which all seems like some unnecessary friction | 21:28 |
timburke | fortunately, there are already some extensions to add the missing labels for components, and the statsd_exporter even already knows how to eat several of them: https://github.com/prometheus/statsd_exporter#tagging-extensions | 21:29 |
timburke | so i'm currently playing around with emitting metrics like `proxy-server.timing,layer=account,method=HEAD,status=204:41.67628288269043|ms` | 21:30 |
timburke | or `proxy-server.timing:34.14654731750488|ms|#layer:account,method:HEAD,status:204` | 21:30 |
timburke | or `proxy-server.timing#layer=account,method=HEAD,status=204:5.418539047241211|ms` | 21:30 |
timburke | or `proxy-server.timing;layer=account;method=HEAD;status=204:34.639835357666016|ms` | 21:30 |
timburke | (really, "proxy-server" should probably get labeled as something like "service"...) | 21:31 |
timburke | my hope is to have a patch up ahead of the PTG, so... look forward to that! | 21:31 |
acoles | nice! | 21:32 |
acoles | "layer" is a new term to me? | 21:32 |
timburke | idk, feel free to offer alternative suggestions :-) | 21:32 |
acoles | vs tier or resource (I guess tier isn't clear) | 21:33 |
acoles | haha it took us < 1second to get into a naming debate :D | 21:33 |
acoles | let's save that for the PTG | 21:33 |
mattoliver | Oh cool, I look forward to seeing it! | 21:34 |
timburke | if it doesn't mesh well with an operator's existing metrics stack, (1) it's opt-in and they can definitely still do the old-school vanilla statsd metrics, and (2) most collection endpoints (i believe) offer some translation mechanism | 21:34 |
acoles | I'm hoping we might eventually converge this "structured" stats with structured logging | 21:34 |
mattoliver | +1 | 21:35 |
timburke | yes! there's a lot of context that seems like it'd be smart to share between stats and logging | 21:35 |
acoles | e.g. build a "context" data structure and squirt it a logger and/or a stats client and you're done | 21:35 |
timburke | that's all i've got | 21:36 |
timburke | #topic open discussion | 21:36 |
timburke | what else should we bring up this week? | 21:36 |
acoles | on that theme, I wanted to draw attention to a change i have proposed to sharder logging | 21:36 |
timburke | #link https://review.opendev.org/c/openstack/swift/+/875220 | 21:37 |
timburke | #link https://review.opendev.org/c/openstack/swift/+/875221 | 21:37 |
acoles | 2 patches currently: https://review.opendev.org/c/openstack/swift/+/875220 and https://review.opendev.org/c/openstack/swift/+/875221 | 21:37 |
acoles | timburke: is so quick! | 21:37 |
mattoliver | Oh yeah, I've been meaning to get to that.. but off for the rest of the week, so won't happen now until next week. | 21:38 |
acoles | I recently had to debug some sharder issue and found the inconsistently log formats very frustrating | 21:38 |
acoles | e.g sometime we include the DB path, sometimes the resource path, sometimes both...but worst, sometimes neither | 21:38 |
acoles | So the patches ensure that every log message associated with a container DB (which is almost all) will consistently get both the db file path and the resource path (i,e, 'a/c') appended to the message | 21:39 |
acoles | I wanted to flag it up because that includes WARNING and ERROR level messages that I am aware some ops may parse for alerts | 21:40 |
acoles | so this change may break some parsing, but on the whole I believe we'll be better for having consistency | 21:41 |
mattoliver | Sounds good, and as we eventually worker up the sharper it gets all more important. | 21:41 |
mattoliver | *sharder | 21:41 |
acoles | IDK if we have precedence for flagging up such a change, or if I am worrying too much (I tend to!) | 21:42 |
mattoliver | Your making debugging via log messages easier.. and that's a win in my book | 21:43 |
timburke | there's some precedent (e.g., https://review.opendev.org/c/openstack/swift/+/863446) but in general i'm not worried | 21:43 |
acoles | ok so I could add an UpgradeImpact to the commit message | 21:44 |
timburke | if we got to the point of actually emitting structured logs, and then *took that away*, i'd worry. but this, *shrug* | 21:45 |
timburke | fwiw, i did *not* call it out in the changelog | 21:46 |
acoles | well if there's no concerns re. the warnings then I will squash the two patches | 21:46 |
acoles | and then I can look forward to the next sharder debugging session 😜 | 21:47 |
timburke | sounds good | 21:47 |
timburke | all right, i think i'll call it | 21:49 |
timburke | thank you all for coming, and thank you for working on swift! | 21:49 |
timburke | #endmeeting | 21:49 |
opendevmeet | Meeting ended Wed Mar 1 21:49:23 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 21:49 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.html | 21:49 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.txt | 21:49 |
opendevmeet | Log: https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.log.html | 21:49 |
reid_g | @mattoliver - we did change the library but we set but we added overrides to the systemd services before anything starts "Environment="LIBERASURECODE_WRITE_LEGACY_CRC=1". We did this for ~20 different clusters without issues. Also there were no quarantines generated in the cluster. | 21:52 |
opendevreview | Alistair Coles proposed openstack/swift master: sharder: show path and db file in logs https://review.opendev.org/c/openstack/swift/+/875220 | 21:53 |
reid_g | Gotta head out. Will check chat logs tomorrow if you reply | 22:10 |
timburke | reid_g, what versions of swift were involved? sounds like maybe https://bugs.launchpad.net/swift/+bug/1655608 -- were any object disks out of the cluster for a while, then brought back in? it's a bit of an old bug, but we've seen patches in relation to it as recently as a couple years ago. if your new swift is >= 2.28.0, you might consider setting quarantine_threshold=1 for the reconstructor -- see https://github.com/openstack/swift/commit/46 | 22:10 |
timburke | ea3aea | 22:10 |
reid_g | We are in ussuri 2.25.2 | 22:11 |
reid_g | I don't think any disks were out for a while since we have monitoring for missing disks and get those taken care of quickly. | 22:12 |
reid_g | What is kind of odd is if I swift-object-info on the fragment on the host we think the issues are with, the fragments belongs to that host according to the ring. 1 particular object has a FS date of jan 2022 but it looks like it was pushed to another node on feb 18 2023. The other node appears as a handoff according to the ring | 22:14 |
reid_g | Right now I have a bunch of fragments on all the disks on 1 host that are being pushed around has handoffs to all other hosts (based on the filesystem dates of the files). | 22:16 |
timburke | are some disks unmounted? or maybe full? | 22:16 |
timburke | https://github.com/openstack/swift/commit/ea8e545a had us start rebuilding to handoffs in 2.21.0 if a primary responds 507 | 22:17 |
reid_g | no they are all mounted and ~45% used. I don't think that they were unmounted previously. | 22:18 |
reid_g | I will check that link tomorrow. I have to head home before my wife gets on me. | 22:19 |
timburke | of course -- good luck! | 22:19 |
reid_g | TBC. Thanks for your input! | 22:20 |
opendevreview | Tim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses https://review.opendev.org/c/openstack/swift/+/875819 | 22:58 |
opendevreview | Tim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses https://review.opendev.org/c/openstack/swift/+/875819 | 23:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!