timburke | notmyname: it will be if/when https://review.openstack.org/#/c/575568/ lands | 00:00 |
---|---|---|
patchbot | patch 575568 - requirements - Blacklist eventlet 0.23.0 | 00:00 |
notmyname | perfect! | 00:00 |
openstackgerrit | Tim Burke proposed openstack/swift master: Only try to fetch or sync shard ranges if the remote supports sharding https://review.openstack.org/573816 | 00:00 |
notmyname | I'll approve ours, and it won't do anything until the other one lands ;-) | 00:00 |
timburke | i just got sick of needing to run down eventlet-caused test failures for people... | 00:01 |
openstackgerrit | Tim Burke proposed openstack/swift master: Use path_qs instead of reinventing it https://review.openstack.org/575572 | 00:08 |
*** mikecmpbll has quit IRC | 00:09 | |
*** germs has joined #openstack-swift | 00:45 | |
*** germs has quit IRC | 00:45 | |
*** germs has joined #openstack-swift | 00:45 | |
*** germs has quit IRC | 00:50 | |
*** two_tired has joined #openstack-swift | 01:06 | |
*** gyee has quit IRC | 01:09 | |
*** lifeless_ has quit IRC | 01:13 | |
*** lifeless has joined #openstack-swift | 01:13 | |
*** lifeless_ has joined #openstack-swift | 01:24 | |
*** lifeless has quit IRC | 01:25 | |
mattoliverau | morning, later start on the interweb this morning, as the wife had a check up | 01:30 |
*** lifeless has joined #openstack-swift | 01:40 | |
*** lifeless_ has quit IRC | 01:41 | |
*** lifeless has quit IRC | 01:54 | |
*** lifeless has joined #openstack-swift | 02:00 | |
*** two_tired has quit IRC | 02:14 | |
*** mwheckmann has joined #openstack-swift | 02:32 | |
*** germs has joined #openstack-swift | 02:46 | |
*** germs has quit IRC | 02:46 | |
*** germs has joined #openstack-swift | 02:46 | |
*** germs has quit IRC | 02:50 | |
*** germs has joined #openstack-swift | 02:50 | |
*** germs has quit IRC | 02:50 | |
*** germs has joined #openstack-swift | 02:50 | |
*** germs has quit IRC | 03:00 | |
*** d0ugal_ has joined #openstack-swift | 03:17 | |
*** d0ugal has quit IRC | 03:17 | |
*** psachin has joined #openstack-swift | 03:28 | |
*** mwheckmann has quit IRC | 03:48 | |
openstackgerrit | Merged openstack/swift master: Experimental swift-ring-composer CLI to build composite rings https://review.openstack.org/451507 | 04:27 |
openstackgerrit | Merged openstack/swift master: Improve path handling in proxy_logging https://review.openstack.org/563354 | 04:33 |
*** germs has joined #openstack-swift | 04:37 | |
*** germs has quit IRC | 04:37 | |
*** germs has joined #openstack-swift | 04:37 | |
*** germs has quit IRC | 04:41 | |
*** links has joined #openstack-swift | 04:55 | |
*** pcaruana has quit IRC | 05:18 | |
openstackgerrit | Merged openstack/swift master: Use path_qs instead of reinventing it https://review.openstack.org/575572 | 05:41 |
*** cbartz has joined #openstack-swift | 05:49 | |
*** cah_link has joined #openstack-swift | 05:49 | |
*** silor has joined #openstack-swift | 06:11 | |
*** silor has quit IRC | 06:36 | |
*** gkadam has joined #openstack-swift | 06:38 | |
*** pcaruana has joined #openstack-swift | 06:44 | |
*** hseipp has joined #openstack-swift | 06:44 | |
*** rcernin has quit IRC | 07:08 | |
*** germs has joined #openstack-swift | 07:08 | |
*** germs has quit IRC | 07:08 | |
*** germs has joined #openstack-swift | 07:08 | |
*** germs has quit IRC | 07:13 | |
*** tesseract has joined #openstack-swift | 07:17 | |
*** silor has joined #openstack-swift | 07:18 | |
*** geaaru has joined #openstack-swift | 07:24 | |
acoles | good morning | 07:36 |
acoles | timburke: torgomatic_ zaitcev - thanks for clarifying re p 575511, I'd not appreciated the lack of support for 100 Continue headers and got too focussed on us just needing to eliminate the non-compliant bytes | 07:36 |
patchbot | https://review.openstack.org/#/c/575511/ - swift - WIP: PUT POST: Use 100 Continue response header to... | 07:36 |
*** ccamacho has joined #openstack-swift | 07:51 | |
*** hseipp has quit IRC | 07:59 | |
*** d0ugal_ has quit IRC | 08:04 | |
*** d0ugal has joined #openstack-swift | 08:04 | |
*** d0ugal has joined #openstack-swift | 08:04 | |
*** mikecmpbll has joined #openstack-swift | 08:06 | |
*** threestrands has quit IRC | 08:20 | |
*** silor has quit IRC | 08:37 | |
*** lifeless has quit IRC | 08:57 | |
*** germs has joined #openstack-swift | 09:09 | |
*** germs has quit IRC | 09:09 | |
*** germs has joined #openstack-swift | 09:09 | |
*** lifeless has joined #openstack-swift | 09:10 | |
*** germs has quit IRC | 09:14 | |
*** silor has joined #openstack-swift | 09:30 | |
*** psachin has quit IRC | 09:38 | |
*** lifeless has quit IRC | 09:38 | |
*** lifeless has joined #openstack-swift | 09:45 | |
*** rcernin has joined #openstack-swift | 10:06 | |
*** cbartz has quit IRC | 10:09 | |
*** silor1 has joined #openstack-swift | 10:09 | |
*** silor has quit IRC | 10:11 | |
*** silor1 is now known as silor | 10:11 | |
*** sundbp has joined #openstack-swift | 10:13 | |
sundbp | I'm trying to understand what the expected failure mode is for the following: 9 node cluster, using 1 region with 3 zones with 3 servers each. replication factor = 3. 1 zone goes down. I seem to get some failures in this situation, but I expected to have no service disruption. If I remove the failed zone from ring all is well. Trying to understand if I've misconfigured things, or if having to update the ring i expected? | 10:16 |
sundbp | (it's not ALL failures, but some. i don't yet see a clear system to what requests works and what not) | 10:16 |
*** lifeless_ has joined #openstack-swift | 10:22 | |
*** lifeless has quit IRC | 10:23 | |
DHE | failure as in undesirable HTTP response code (4xx or 5xx), or the proxy server hanging? | 10:28 |
DHE | also did you set write affinity in the configs? (there's a few places it can go) | 10:29 |
*** lifeless_ has quit IRC | 10:39 | |
*** lifeless has joined #openstack-swift | 10:41 | |
sundbp | oddly i can't find 4xx or 5xx, but that's what I'm assuming | 10:45 |
sundbp | no write affinity | 10:45 |
sundbp | once i update ring to not have the failed zone (the 3 network partitioned servers) all is well. | 10:46 |
sundbp | i do see connecttimeout errors in logs to those 3 hosts, but i expected that. i expected it fail for those servers. but succed on the other 2 serves in the 3x replication, and all would be well as >50% | 10:47 |
sundbp | DHE: (forgot to prefix with your nick) | 10:53 |
*** lifeless_ has joined #openstack-swift | 10:55 | |
sundbp | DHE: i've got about 300 instances of an app working vs this swift cluster. i seemed to have intermittent failures for say 15 at a time. they'd come and go. they all do a hearbeat by writing a file to swift and reading it back as part of a test. some of those tests would fail intermittently. until I updated the ring, then everything solid. | 10:55 |
*** lifeless has quit IRC | 10:56 | |
*** lifeless has joined #openstack-swift | 11:00 | |
*** lifeless_ has quit IRC | 11:01 | |
acoles | sundbp: what is the nature of the failures? failed PUT, failed GET, or unexpected content, or...? | 11:03 |
*** lifeless has quit IRC | 11:05 | |
*** germs has joined #openstack-swift | 11:10 | |
*** germs has quit IRC | 11:10 | |
*** germs has joined #openstack-swift | 11:10 | |
*** lifeless has joined #openstack-swift | 11:10 | |
*** hseipp has joined #openstack-swift | 11:10 | |
*** germs has quit IRC | 11:14 | |
*** silor has quit IRC | 11:15 | |
*** lifeless_ has joined #openstack-swift | 11:15 | |
*** lifeless has quit IRC | 11:15 | |
sundbp | acoles: I'm trying to track that down. What I do know is that the hearbeat checks that upload a file and download it intermittently failed. but i'm really struggling to find any actual log content showing the failures. as updating the ring immediately removed ALL errors (that had been ongoing for 2h) it seems almost a given it was related to swift. | 11:24 |
*** cbartz has joined #openstack-swift | 11:26 | |
sundbp | this is a chart showing health check. the first few red bars are the network partition causing multiple issues, it goes back to green when i fix non-swift related things, then the 2 red bars later are intermittent failures related to swift. i see such for all instances (not at same time, had about 5% at any given time in that period). once i removed the 3 partitioned from the ring around 08:00 all fails went away. | 11:31 |
sundbp | and i can see that the response times went back down to normal'ish for this health test. | 11:32 |
sundbp | typical log swift-side from heartbeat: https://gist.github.com/sundbp/10c6a8cfd69436b2d5bf6bf1b5ac0dc3 | 11:34 |
sundbp | looks alright to me - the ERRORs are connections to the hosts that are down (network partitioned). but it goes ahead and the PUT/GET is 2xx. | 11:34 |
acoles | sundbp: do you have more than one proxy server? | 11:40 |
sundbp | acoles: i have 3, 1 which was partitioned, so 2 were running at the time. | 11:42 |
sundbp | acoles: (the logs are very similar for both) | 11:42 |
acoles | I'm just wondering if you client requests are hitting proxies on either side of network partition, so a PUT goes to one side then the GET for same object hits other side and fails, or gets stale data | 11:43 |
sundbp | acoles: good theory, but unlikely. once those health checks started going green was because i completely disabled the partitioned DC and had moved/restarted everything to not depend on anything there. | 11:45 |
sundbp | acoles: so the 2 remaining DCs were in fine connectivity with each other, and neither had any contact with the 3rd (a broken circuit card of some sort in that DC). | 11:45 |
acoles | oh, so those logs are for the healthy system? | 11:46 |
sundbp | the only thing i can see in logs now after digging around a lot is client app side i see some read timeouts (intermittent) and in swift logs I see "proxy-server: Client disconnected on read (txn: tx77877c4ef47d4757a0d44-005b234be2)" | 11:46 |
sundbp | yep, healthy side. | 11:47 |
sundbp | i can't correlate exactly but it seems to me the 2 logs agree, i.e. they refer to same timeout. | 11:48 |
sundbp | i think that's the source of the intermittent fails. question is why there were some timeouts.. | 11:48 |
*** mikecmpbll has quit IRC | 11:50 | |
sundbp | and why did these timeouts go away once I updated ring.. | 11:51 |
acoles | sundbp: I have to leave now, sorry I couldn't help more | 11:52 |
sundbp | acoles: thanks for trying! | 11:52 |
*** hseipp has quit IRC | 12:10 | |
openstackgerrit | Alistair Coles proposed openstack/swift master: WIP: PUT POST: simplify object server POST handler https://review.openstack.org/575512 | 12:13 |
*** mwheckmann has joined #openstack-swift | 12:29 | |
*** kei_yama has quit IRC | 12:37 | |
*** lifeless_ has quit IRC | 12:46 | |
*** rcernin has quit IRC | 13:02 | |
*** germs has joined #openstack-swift | 13:11 | |
*** germs has quit IRC | 13:11 | |
*** germs has joined #openstack-swift | 13:11 | |
*** germs has quit IRC | 13:15 | |
*** links has quit IRC | 13:27 | |
*** cbartz has quit IRC | 14:15 | |
*** mvenesio has joined #openstack-swift | 14:22 | |
*** germs has joined #openstack-swift | 14:23 | |
*** germs has joined #openstack-swift | 14:23 | |
*** cah_link has quit IRC | 14:26 | |
*** mrjk__ has quit IRC | 14:41 | |
*** mrjk has joined #openstack-swift | 14:42 | |
*** pcaruana has quit IRC | 15:01 | |
sundbp | acoles: think i worked out at least 1 part of mystery. proxy-server set to use a 10s connect timeout, and then the client had 10s read timeout set, meaning we had a race if it gets a timeout or successful outcome. so hence why no errors in swift logs, just client disconnects, and same on clientside - see timeout but no error results. | 15:21 |
*** armaan has joined #openstack-swift | 15:48 | |
*** gyee has joined #openstack-swift | 15:52 | |
notmyname | good morning | 15:59 |
*** armaan has quit IRC | 16:23 | |
timburke | good morning | 16:33 |
*** gkadam has quit IRC | 16:43 | |
*** tesseract has quit IRC | 17:01 | |
*** germs has quit IRC | 17:24 | |
*** germs has joined #openstack-swift | 17:24 | |
*** germs has quit IRC | 17:24 | |
*** germs has joined #openstack-swift | 17:24 | |
*** ccamacho has quit IRC | 17:24 | |
*** germs has quit IRC | 17:24 | |
openstackgerrit | Tim Burke proposed openstack/swift master: func tests: Rename storage_url to storage_path https://review.openstack.org/574900 | 17:32 |
openstackgerrit | Tim Burke proposed openstack/swift master: Tighten up staticweb redirect test https://review.openstack.org/574901 | 17:32 |
timburke | how old of a python-swiftclient do we reasonably want to support when running swift's func tests? | 17:35 |
timburke | https://github.com/openstack/requirements/blob/master/lower-constraints.txt currently lists 3.2.0, which is about a year and a half old... i feel like we could probably go back further if we wanted, though... | 17:37 |
openstackgerrit | Tim Burke proposed openstack/swift master: Blacklist eventlet 0.23.0 https://review.openstack.org/575569 | 17:43 |
openstackgerrit | Tim Burke proposed openstack/swift master: Set lower bounds on all requirements and test-requirements https://review.openstack.org/575806 | 17:43 |
*** gkadam has joined #openstack-swift | 17:44 | |
*** geaaru has quit IRC | 17:52 | |
notmyname | timburke: referencing the discussion we had in dublin, we talked about having a 2 year support window. I'm not sure if that needs to apply here, but it's a good start, IMO | 17:54 |
notmyname | (yes, that discussion was more about swift, not the client, but as a general guide, I don't think it's bad) | 17:54 |
openstackgerrit | Tim Burke proposed openstack/swift master: Add debugging info to SignatureDoesNotMatch responses https://review.openstack.org/575808 | 17:59 |
*** gkadam_ has joined #openstack-swift | 18:33 | |
openstackgerrit | Samuel Merritt proposed openstack/swift master: Fix socket leak on object-server death https://review.openstack.org/575254 | 18:35 |
*** gkadam has quit IRC | 18:37 | |
openstackgerrit | Tim Burke proposed openstack/swift master: Support long-running multipart uploads https://review.openstack.org/575818 | 18:38 |
*** cah_link has joined #openstack-swift | 18:54 | |
timburke | :-( i forgot to check the s3api func tests when i did https://review.openstack.org/#/c/570604/ | 18:55 |
patchbot | patch 570604 - swift - Fix up insecure behavior for functional tests (MERGED) | 18:55 |
openstackgerrit | Tim Burke proposed openstack/swift master: Properly handle custom metadata upon an object COPY operation https://review.openstack.org/575824 | 19:16 |
*** brimestoned has joined #openstack-swift | 19:29 | |
*** brimestoned has left #openstack-swift | 19:29 | |
*** lifeless has joined #openstack-swift | 19:34 | |
*** mvenesio has quit IRC | 19:36 | |
openstackgerrit | Tim Burke proposed openstack/swift master: Change PUT bucket conflict error https://review.openstack.org/575829 | 19:51 |
openstackgerrit | Tim Burke proposed openstack/swift master: s3api: Stop debug-logging the entire wsgi environment https://review.openstack.org/575834 | 20:02 |
*** lifeless has quit IRC | 20:02 | |
*** lifeless has joined #openstack-swift | 20:03 | |
openstackgerrit | Tim Burke proposed openstack/swift master: Give better errors for malformed credentials https://review.openstack.org/575836 | 20:15 |
openstackgerrit | Tim Burke proposed openstack/swift master: Listing of versioned objects when versioning is not enabled https://review.openstack.org/575838 | 20:24 |
*** geaaru has joined #openstack-swift | 20:28 | |
*** _david_-_ has joined #openstack-swift | 20:30 | |
openstackgerrit | Tim Burke proposed openstack/swift master: Fix the deletion of non-existent keys https://review.openstack.org/575842 | 20:30 |
_david_-_ | I'm having difficulty figuring out why one node in a 9 node cluster has multiple disks that are 10% above the average disk fullness. The disk capacity looks good for the weight set in the object ring. Any ideas ? | 20:33 |
_david_-_ | swift-container-replicator is running. I have not changed weights in months, no disks added or deleted. | 20:36 |
notmyname | could be "unlucky" hashing where large objects get placed on the same drive | 20:37 |
zaitcev | I'd look into the dark data, personally... Maybe that box has some history of reboots or running a disparate version, or ooms often. | 20:43 |
timburke | _david_-_: you said the object ring looks right, but then were talking about the *container* replicator... do you know whether it's object or container data throwing things off? | 20:47 |
*** lifeless has quit IRC | 20:49 | |
timburke | i'd also wonder about what sorts of stats the replicators are emitting... it may well be that all the replication services are running fine on that node, but it isn't deleting handoff data because it has trouble talking to one of the primaries for it | 20:49 |
*** lifeless has joined #openstack-swift | 20:50 | |
_david_-_ | timburke: I du'ed the containers folder in one of the disks, it's only 1.3G so it much be objects data. We typically have large numbers of objects in a smaller number of containers. | 20:57 |
openstackgerrit | Tim Burke proposed openstack/swift master: Support long-running multipart uploads https://review.openstack.org/575818 | 20:59 |
timburke | that's the way it usually goes :-) | 21:01 |
timburke | i was also thinking about singular containers with a whole lot of objects, though, such that the container db itself is on the order of tens of GBs. sounds like that isn't your problem | 21:01 |
timburke | what kind of stats are the object replicators emitting? | 21:01 |
*** gkadam_ has quit IRC | 21:05 | |
_david_-_ | object-replicator: 51931/70612 (73.54%) partitions replicated in 1800.04s (28.85/sec, 10m remaining) object-replicator: 103836 successes, 180 failures object-replicator: 16411053 suffixes checked - 0.01% hashed, 0.00% synced object-replicator: Partition times: max 10.4714s, min 0.0096s, med 0.0125s | 21:09 |
_david_-_ | I think I have one two disks in cluster unmounted at the moment, so that might explain the failures. The failure number seems to stick at 180 across multiple output cycles | 21:11 |
timburke | and even so, ~0.2% failure rate seems not bad... hmm... | 21:14 |
_david_-_ | It feels almost like one or more drives were mounted on the wrong /srv/node mountpoints and then they were moved to the correct mount points. Even if that were the case I would assume the replicator would 'fix it' | 21:21 |
openstackgerrit | Merged openstack/swift master: Only try to fetch or sync shard ranges if the remote supports sharding https://review.openstack.org/573816 | 21:30 |
openstackgerrit | Merged openstack/swift master: Set lower bounds on all requirements and test-requirements https://review.openstack.org/575806 | 21:31 |
_david_-_ | another issue... I've just discovered that the x-account-object-count and x-account-bytes-used in the Account DBs are not being updated. Has anyone heard of that issue before ? | 22:00 |
_david_-_ | x-account-container-count does seem to be being updated | 22:00 |
*** lifeless has quit IRC | 22:01 | |
timburke | are the container-updaters running? they'd be in charge of reporting bytes/counts from the container dbs up to the account dbs | 22:03 |
*** sundbp has quit IRC | 22:03 | |
timburke | and assuming they *are* running, are they more or less keeping up? | 22:03 |
openstackgerrit | Tim Burke proposed openstack/swift master: Include '-' in multipart ETags https://review.openstack.org/575860 | 22:04 |
_david_-_ | timburke: hmm. the last time we restarted swift-container-updater across the ACO nodes was 1st Apr, which is when the stats stopped updating.. | 22:15 |
_david_-_ | that was when we upgrades from newton to ocata | 22:17 |
timburke | that sounds suspicious... do you have any logs from the updaters? if they failed to restart, would there be something to try to bring them back up? does ps or top say that they're running? | 22:17 |
_david_-_ | timburke: ps shows them running, yes. No logs in the day logs the past few days, both 'main' and 'error' | 22:17 |
timburke | ...which version from ocata? hmm... did we ever tag after backporting https://github.com/openstack/swift/commit/69c715c? | 22:18 |
timburke | also related: https://github.com/openstack/swift/commit/6339989 | 22:19 |
timburke | 'cause https://bugs.launchpad.net/swift/+bug/1722951 sounds *very much* like your problem... | 22:19 |
openstack | Launchpad bug 1722951 in OpenStack Object Storage (swift) "Container updater may be stuck and not make progress" [High,Fix released] - Assigned to Samuel Merritt (torgomatic) | 22:19 |
_david_-_ | in ubuntu's cloudarchive versioning, I have 2.15.1-0ubuntu3~cloud0 | 22:20 |
_david_-_ | I have the change from https://github.com/openstack/swift/commit/69c715c? but I don't have the code change from https://bugs.launchpad.net/swift/+bug/1722951 | 22:23 |
openstack | Launchpad bug 1722951 in OpenStack Object Storage (swift) "Container updater may be stuck and not make progress" [High,Fix released] - Assigned to Samuel Merritt (torgomatic) | 22:23 |
*** mwheckmann has quit IRC | 22:25 | |
timburke | notmyname: we should think about tagging a 2.15.2 and 2.13.2 for pike and ocata, respectively... | 22:25 |
timburke | _david_-_: you can fix it without a code change by ensuring the updaters (ideally, all swift processes) start with EVENTLET_HUB=poll or EVENTLET_HUB=selects in the environment | 22:29 |
_david_-_ | Thanks timburke . Would ' $ EVENTLET_HUB=poll swift-init restart <name> ' do the job ? | 22:32 |
timburke | i believe so? i forget or sure | 22:35 |
notmyname | timburke: have we landed many (any?) backports? | 22:40 |
timburke | notmyname: pike has four patches of interest, most notably the one that fixes _david_-_'s updater issue | 22:41 |
notmyname | ah | 22:42 |
timburke | ocata has that same bug fix, though it's less important since it doesn't have the PipeMutex yet | 22:42 |
*** lifeless has joined #openstack-swift | 22:49 | |
*** cah_link has quit IRC | 22:56 | |
*** gyee has quit IRC | 23:24 | |
openstackgerrit | Samuel Merritt proposed openstack/swift master: Rename test_except.py -> test_catch_errors.py https://review.openstack.org/575875 | 23:34 |
openstackgerrit | Samuel Merritt proposed openstack/swift master: Enforce Content-Length in catch_errors https://review.openstack.org/575876 | 23:34 |
openstackgerrit | Tim Burke proposed openstack/swift master: Support long-running multipart uploads https://review.openstack.org/575818 | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!