*** annegentle has joined #openstack-swift | 00:09 | |
*** zhill_ has joined #openstack-swift | 00:11 | |
*** zhill_ has quit IRC | 00:11 | |
*** annegentle has quit IRC | 00:14 | |
ho | good morning! | 00:18 |
---|---|---|
notmyname | good mornign ho | 00:18 |
ho | notmyname: hello | 00:19 |
*** remix_tj has quit IRC | 00:21 | |
*** annegentle has joined #openstack-swift | 00:22 | |
ho | acoles: "why did 'w' get missed out :)" <=== (^-^;) | 00:35 |
ho | acoles: (^-^;) means I'm embarrassed and a cold sweat is come out in Japan :-) | 00:43 |
mattoliverau | ho: morning | 00:46 |
ho | mattoliverau: morning! | 00:49 |
*** blmartin has joined #openstack-swift | 00:51 | |
*** blmartin_ has joined #openstack-swift | 00:51 | |
*** chlong has quit IRC | 00:52 | |
*** chlong has joined #openstack-swift | 00:54 | |
*** lpabon has joined #openstack-swift | 01:03 | |
*** lpabon has quit IRC | 01:03 | |
*** annegentle has quit IRC | 01:06 | |
*** kota_ has joined #openstack-swift | 01:13 | |
*** ChanServ sets mode: +v kota_ | 01:13 | |
kota_ | good morning | 01:13 |
notmyname | hello | 01:16 |
kota_ | notmyname: hi :) | 01:18 |
*** hugespoon has left #openstack-swift | 01:49 | |
*** blmartin has quit IRC | 01:56 | |
*** blmartin_ has quit IRC | 01:56 | |
*** jkugel has joined #openstack-swift | 02:12 | |
clayg | hi! | 03:00 |
ho | hello clayg! | 03:06 |
*** zul has quit IRC | 03:08 | |
*** asettle is now known as asettle-afk | 03:10 | |
*** zul has joined #openstack-swift | 03:20 | |
*** asettle-afk is now known as asettle | 03:48 | |
*** kota_ has quit IRC | 04:12 | |
*** jamielennox is now known as jamielennox|away | 04:32 | |
ho | patch #187489 got liberasurecode in gate: http://paste.openstack.org/show/264868/. | 04:47 |
ho | liberasurecode_backend_open: dynamic linking error libJerasure.so: cannot open shared object file: No such file or directory | 04:47 |
ho | is there any env changes in gate? | 04:50 |
*** ahonda has quit IRC | 04:58 | |
swifterdarrell | ho: is jerasure installed too? | 04:58 |
*** ppai has joined #openstack-swift | 05:00 | |
ho | swifterdarrell: i'm not sure. do you how to check it in gate? | 05:04 |
swifterdarrell | ho: no idea | 05:04 |
swifterdarrell | ho: ask the infra guys? | 05:04 |
ho | swifterdarrell: do you know a gu | 05:05 |
swifterdarrell | ho: but based on the error, I'd first make sure libjerasure is actually installed | 05:05 |
ho | my keyboard something wrong... | 05:05 |
swifterdarrell | ho: I think monty is one of them? not sure the full set of infra folks | 05:06 |
swifterdarrell | ho: try #openstack-infra channel? | 05:06 |
ho | swifterdarrell: thanks! I will ask this to infra guys | 05:06 |
swifterdarrell | ho: np, good luck! | 05:07 |
swifterdarrell | portante: good news! I'm clearly seeing that one bad disk per node totally fucks object-server, even with workers = 90 (3x disks). Details to come, but it's night and day | 05:11 |
portante | nice | 05:12 |
swifterdarrell | portante: now testing with 1 bad disk in one of the storage nodes, then I'll do a control group with no bad disks... just normal unfettered swift-object-auditor | 05:12 |
portante | great | 05:12 |
portante | so then you'll do the servers-per-port runs with that | 05:12 |
swifterdarrell | portante: sneak peek https://www.dropbox.com/s/o9sapxkvnr2brip/Screenshot%202015-06-04%2022.12.45.png?dl=0 | 05:12 |
swifterdarrell | portante: ya | 05:12 |
* portante looks | 05:12 | |
swifterdarrell | portante: that graph has 4 runs of servers_per_disk=3 w/one hammered disk per storage node, then 4 runs of workers=90, also w/one hammered disk per node | 05:13 |
swifterdarrell | portante: far right is teh first run w/workers=90 with only one hammered drive in one of two storage nodes | 05:13 |
portante | which means th object servers are not all engaged handling requests, a few of them are getting the requests, a top on the system would probably bear that out | 05:14 |
swifterdarrell | portante: I have a threaded python script that uses directio python module to issue random reads directly (O_DIRECT) to the raw block device of one disk; it gets I/O queue full to 128 with await times between 500 and 700+ ms | 05:15 |
portante | pummeled | 05:15 |
portante | where in that graph is the servers-per-port stuff | 05:15 |
swifterdarrell | portante: ya, the longer blocking I/Os to the bad disk should interrupt servicing of I/O to other non-bad disks by that same, unlucky, swift-object-server | 05:15 |
swifterdarrell | portante: servers_per_port=3 are teh left 4 runs | 05:16 |
swifterdarrell | portante: (the ones that don't look like shit) | 05:16 |
*** SkyRocknRoll has joined #openstack-swift | 05:17 | |
swifterdarrell | portante: this is ssbench-master run-scenario -U bench2 -S http://127.0.0.1/v1/AUTH_bench2 -f Real_base2_simplified.scenario -u 120 -r 1200 | 05:17 |
swifterdarrell | portante: 120 concurrency | 05:17 |
swifterdarrell | portante: I simplified the realistic scenario to reduce noise; cutting down concurrency & whatnot for all background daemons other than swift-object-auditor also reduced noise | 05:17 |
portante | okay | 05:17 |
portante | so in that graph, left y-axis is min avg max response time, and right is request per sec | 05:18 |
swifterdarrell | portante: so I'm seeing even one hammered disk screws things up... this is 1 hammered disk out of ~60 total obj disks in cluster | 05:18 |
swifterdarrell | portante: yup | 05:19 |
swifterdarrell | portante: so lower black line is bad, adn higher red line is bad | 05:19 |
openstackgerrit | Kota Tsuyuzaki proposed openstack/swift: Fix the missing SLO state on fast-post https://review.openstack.org/182564 | 05:19 |
swifterdarrell | portante: I'll have the actual nubmers later (i have all the raw data for these runs) | 05:19 |
portante | wow, so my guess is that if you were to lower the maxclients you would see less of that effect with 90 workers | 05:19 |
portante | and the flat lines just mean test wasn't running | 05:20 |
swifterdarrell | portante: flat line was dinner + True Detective :) | 05:20 |
portante | cute | 05:20 |
swifterdarrell | =) | 05:20 |
swifterdarrell | portante: so you think lowering teh default 1024 max clients will imporove those runs on the right? the wokers=90? | 05:21 |
swifterdarrell | portante: maybe I'll try that in the morning (still w/one bad disk); that'll be cheap & easy to test | 05:21 |
portante | it should because it will allow more of the workers to participate fully | 05:21 |
portante | eventlet accept greenlet is greedy | 05:22 |
swifterdarrell | portante: i'm not convinced that starvation's happening, necessarily... I think the problem is that any object-server (on the affected node) can get fucked by that bad disk | 05:22 |
portante | it'll just keep posting accepts and gobbling up, as long as that process gets on the run queue | 05:22 |
swifterdarrell | portante: but I'm interested in the experiment :) | 05:22 |
portante | certainly, but the more greedy the object server the larger the effect | 05:22 |
portante | servers-per-port is by nature not greedy | 05:23 |
swifterdarrell | portante: I see CPU consumption of swift-object-server being uneven, but how can I tell there aren't outstanding I/Os blocked for the others? i.e. lack of CPU consumptoin doesn't mean reqs aren't being processed | 05:23 |
swifterdarrell | portante: makes sense; what value would you suggest? | 05:24 |
portante | I would drop it down to 5 or something | 05:24 |
swifterdarrell | portante: k | 05:24 |
portante | maybe even 1 if 90 workers covers all your requests | 05:25 |
swifterdarrell | portante: let's see... 120 clients w/some PUT amplification is up to ~360 concurrent reqs, divided by 180 total workers betw 2 servers, is 2 | 05:25 |
swifterdarrell | portante: so I'll try it w/2 | 05:25 |
portante | sounds reasonable | 05:26 |
portante | so what percentage of the oustsanding requests would be for the bad disk in this controlled experiment? | 05:26 |
swifterdarrell | portante: thanks for the idea! I won't get a chance to run it 'til tomorrow morning | 05:27 |
swifterdarrell | portante: heading to bed soon | 05:27 |
portante | I should be myself | 05:27 |
swifterdarrell | portante: all subject ot Swift's dispersion (md5) | 05:27 |
swifterdarrell | portante: 1 bad disk (latest runs) out of 60 | 05:27 |
portante | 60 across all servers, right? | 05:28 |
swifterdarrell | portante: ya | 05:28 |
swifterdarrell | portante: like 29 and 32 i think | 05:28 |
swifterdarrell | portante: so actually 61, but whatever | 05:28 |
portante | so my guess is that what you'll is still works that servers-per-disk, maybe not too bad, but better than normal by a lot | 05:28 |
swifterdarrell | P that a GET hits it is... 1 / (61 / 3) ~= 5% | 05:29 |
portante | but that is just a guess, it depends on how well the requests get spread out | 05:29 |
*** Triveni has joined #openstack-swift | 05:29 | |
swifterdarrell | P that a PUT hits it is... same, I guess | 05:29 |
portante | this is good stuff | 05:30 |
swifterdarrell | portante: I'll ping you tomorrow when I have something | 05:30 |
swifterdarrell | portante: g'night! | 05:30 |
portante | 'night! | 05:30 |
*** mitz has quit IRC | 05:41 | |
ho | clayg, notmyname, torgomatic: macque's info: https://www.youtube.com/watch?v=hWstAme0eQs | 05:56 |
cschwede | Good Morning! | 06:23 |
*** jistr has joined #openstack-swift | 06:57 | |
ho | cschwede: Morning! | 07:03 |
openstackgerrit | Christian Schwede proposed openstack/python-swiftclient: Add connection release test https://review.openstack.org/159076 | 07:19 |
*** remix_tj has joined #openstack-swift | 07:27 | |
*** mmcardle has joined #openstack-swift | 07:42 | |
*** jistr is now known as jistr|biab | 07:47 | |
*** hseipp has joined #openstack-swift | 08:09 | |
*** acoles_away is now known as acoles | 08:27 | |
acoles | ho: i did put :) after my comment. just amused me that the values accidentally ended up using v,x,y,z :P | 08:29 |
*** chlong has quit IRC | 08:34 | |
*** foexle has joined #openstack-swift | 08:35 | |
ho | acoles: i didn't notice it. i exposed weaknesses in english :-) | 08:45 |
acoles | ho: did you get any idea why the unit tests failed with the libjerasure.so problem? i see it on other jenkins jobs too | 08:46 |
ho | acoles: I asked it to infra guys but i didn't get any response yet. | 08:48 |
ho | acoles: FYI: 14:09:58 (ho) hello, patch #187489 got an error regarding to lberasurecode in gate. I would like to know whether libJerasure is installed or not. http://paste.openstack.org/show/264868/ | 08:50 |
*** jistr|biab is now known as jistr | 08:50 | |
acoles | ho: thanks | 08:51 |
ho | acoles: you are welcome! | 08:51 |
acoles | ho: that was a good addition to the test btw | 08:52 |
ho | acoles: thanks! :) | 08:54 |
*** jistr has quit IRC | 08:59 | |
*** jistr has joined #openstack-swift | 09:19 | |
cschwede | acoles: do you remember if we have a „official“ abandon policy? I wanted to abandon a few waised patches (not mine). | 09:20 |
*** geaaru has joined #openstack-swift | 09:21 | |
*** marzif_ has joined #openstack-swift | 09:21 | |
acoles | cschwede: hi. iirc swift policy is that mattoliverau has a bot that finds old patches (4 weeks with -1 ??), mails the author to warm them of likely abandonment, then after 2 weeks he or notmyname abandon them. | 09:22 |
cschwede | morning :) yes, that’s for swift - but it what about swiftclient? i think we don’t do it there | 09:24 |
cschwede | https://review.openstack.org/#/q/status:open+project:openstack/python-swiftclient++label:Code-Review%253C%253D-1,n,z | 09:24 |
cschwede | acoles: ^^ | 09:24 |
acoles | cschwede: i guess an improvement might be to place a comment on the patch when warning is given, then any core could abandon after another 2 weeks. | 09:24 |
acoles | cschwede: hmm, you are most likely right, maybe mattoliverau only does it for swift | 09:24 |
acoles | looking... | 09:25 |
cschwede | https://review.openstack.org/#/q/status:open+AND+project:openstack/python-swiftclient+AND+%28label:Verified-1+OR+label:Code-Review%253C%253D-1%29,n,z | 09:26 |
cschwede | acoles: ^^ is a better view, all patches that got either a -1/-2 from a reviewer or jenkins | 09:27 |
*** SkyRocknRoll has quit IRC | 09:27 | |
cschwede | that’s more than half of all swiftclient patches | 09:27 |
acoles | patch 158701 i will ping author (hp) | 09:27 |
patchbot | acoles: https://review.openstack.org/#/c/158701/ | 09:27 |
cschwede | heh, i was looking at that patch just a few minutes ago :) | 09:28 |
acoles | cschwede: so we don't abandon if WIP, i think | 09:32 |
acoles | cschwede: global requirements notmyname always -2's | 09:33 |
cschwede | makes sense | 09:34 |
*** aix has joined #openstack-swift | 09:34 | |
acoles | patch 148791 appears to be superseded by patch 185629 | 09:36 |
patchbot | acoles: https://review.openstack.org/#/c/148791/ | 09:36 |
acoles | cschwede: shall i abandon 148791? | 09:37 |
cschwede | i was thinking the same | 09:37 |
acoles | doing it now | 09:37 |
cschwede | i abandoned a patch earlier and left the following msg: | 09:37 |
cschwede | „There has been no change to this patch since nearly a year, thus abandoning this patch to clear the list of incoming reviews. | 09:38 |
cschwede | Please feel free to restore this patch in case you are still working on it. Thank you!“ | 09:38 |
cschwede | on patch 109802 | 09:38 |
patchbot | cschwede: https://review.openstack.org/#/c/109802/ | 09:38 |
*** SkyRocknRoll has joined #openstack-swift | 09:40 | |
acoles | cschwede: i will abandon patch 160169 because i think the bug is already fixed | 09:45 |
patchbot | acoles: https://review.openstack.org/#/c/160169/ | 09:45 |
cschwede | yes, and there was also no response to your comment | 09:46 |
acoles | cschwede: does an 'abandon' add to our stackalytics scores :P | 09:48 |
cschwede | i don’t think so? | 09:48 |
acoles | awww | 09:48 |
acoles | cschwede: that leaves patch 116065 that is old and not WIP | 09:51 |
patchbot | acoles: https://review.openstack.org/#/c/116065/ | 09:51 |
*** shlee322 has joined #openstack-swift | 09:51 | |
cschwede | yes, and i agree with Darrells comment on that patch. probably a candidate for abandoning too | 09:52 |
acoles | go for it | 09:53 |
acoles | author can always restore if the disagree | 09:53 |
cschwede | done | 09:55 |
acoles | patch 172791 is not quite so old so i left a comment asking if more work is planned. also, it has had no human review. | 09:58 |
patchbot | acoles: https://review.openstack.org/#/c/172791/ | 09:58 |
*** dmorita has quit IRC | 09:58 | |
acoles | cschwede: i have fresh coffee and croissant waiting for me. bbiab :) | 09:58 |
openstackgerrit | Christian Schwede proposed openstack/python-swiftclient: Add ability to download objects to particular folder. https://review.openstack.org/160283 | 09:59 |
cschwede | acoles: coffee sounds great - enjoy! | 09:59 |
*** Triveni has quit IRC | 10:13 | |
*** shlee322 has quit IRC | 10:18 | |
*** shlee322 has joined #openstack-swift | 10:32 | |
*** ho has quit IRC | 10:41 | |
*** wasmum has quit IRC | 10:55 | |
*** marzif_ has quit IRC | 11:11 | |
*** marzif_ has joined #openstack-swift | 11:12 | |
*** marzif_ has quit IRC | 11:13 | |
*** marzif_ has joined #openstack-swift | 11:13 | |
*** wasmum has joined #openstack-swift | 11:19 | |
*** blmartin_ has joined #openstack-swift | 11:33 | |
*** jkugel has quit IRC | 11:34 | |
*** shlee322 has quit IRC | 11:56 | |
*** wbhuber has joined #openstack-swift | 11:58 | |
*** geaaru has quit IRC | 12:08 | |
*** geaaru has joined #openstack-swift | 12:09 | |
*** km has quit IRC | 12:11 | |
*** mmcardle has quit IRC | 12:13 | |
openstackgerrit | Alistair Coles proposed openstack/swift: Make test_proxy work independent of evn vars https://review.openstack.org/188756 | 12:15 |
openstackgerrit | Alistair Coles proposed openstack/swift: Make test_proxy work independent of env vars https://review.openstack.org/188756 | 12:15 |
*** zul has quit IRC | 12:16 | |
*** zul has joined #openstack-swift | 12:16 | |
*** thurloat_isgone is now known as thurloat | 12:21 | |
*** sc has quit IRC | 12:23 | |
*** mmcardle has joined #openstack-swift | 12:25 | |
*** MVenesio has joined #openstack-swift | 12:25 | |
*** MVenesio has quit IRC | 12:25 | |
*** sc has joined #openstack-swift | 12:26 | |
*** wbhuber has quit IRC | 12:27 | |
*** annegentle has joined #openstack-swift | 12:32 | |
*** blmartin_ has quit IRC | 12:38 | |
openstackgerrit | Prashanth Pai proposed openstack/swift: Make object creation more atomic in Linux https://review.openstack.org/162243 | 12:59 |
*** jkugel has joined #openstack-swift | 13:00 | |
*** ppai has quit IRC | 13:01 | |
*** jkugel1 has joined #openstack-swift | 13:02 | |
*** jkugel has quit IRC | 13:05 | |
*** kei_yama has quit IRC | 13:07 | |
*** petertr7_away is now known as petertr7 | 13:07 | |
*** SkyRocknRoll has quit IRC | 13:18 | |
*** wbhuber has joined #openstack-swift | 13:19 | |
*** blmartin has joined #openstack-swift | 13:23 | |
*** acoles is now known as acoles_away | 13:37 | |
tdasilva | good morning | 13:38 |
cschwede | Hello Thiago! | 13:42 |
tdasilva | cschwede: hi! | 13:43 |
*** acampbell has joined #openstack-swift | 13:43 | |
*** acampbel11 has joined #openstack-swift | 13:44 | |
*** jrichli has joined #openstack-swift | 14:06 | |
*** esker has joined #openstack-swift | 14:09 | |
*** esker has quit IRC | 14:14 | |
*** esker has joined #openstack-swift | 14:15 | |
*** acoles_away is now known as acoles | 14:15 | |
*** foexle has quit IRC | 14:27 | |
swifterdarrell | portante: initial results for workers=90 + max_clients=2 do not look good | 14:29 |
*** thurloat is now known as thurloat_isgone | 14:30 | |
*** acampbel11 has quit IRC | 14:38 | |
*** thurloat_isgone is now known as thurloat | 14:41 | |
swifterdarrell | portante: ya, we'll see how consistent the 4 runs are, but the first run of workers=90 + max_clients=2 is worse than workers=90 + max_clients=1024 | 14:43 |
*** minwoob has joined #openstack-swift | 14:50 | |
portante | swifterdarrell: bummer | 15:12 |
*** B4rker has joined #openstack-swift | 15:12 | |
swifterdarrell | portante: cluster might have just gotten cold overnight; the 2nd run of max_clients=2 is looking very similar to the max_clients=1024 | 15:13 |
swifterdarrell | portante: PUTs: https://www.dropbox.com/s/kpb598iouccs3cw/Screenshot%202015-06-05%2008.13.43.png?dl=0 | 15:13 |
swifterdarrell | portante: GETs: https://www.dropbox.com/s/rjeq83p95ofy2mq/Screenshot%202015-06-05%2008.14.00.png?dl=0 | 15:14 |
swifterdarrell | portante: far right are the (still going) max_clients=2 runs | 15:14 |
swifterdarrell | portante: they're identical to the middle runs: 1 bad disk out of 61; swift-object-auditor running full speed, workers=90; only difference is max_clients | 15:14 |
swifterdarrell | portante: and servers_per_disk=3 with two bad disks (one per server) is better than all the workers=90 runs w/only 1 bad disk | 15:16 |
swifterdarrell | portante: so I'm still pretty sure that the I/O isolation is really important | 15:16 |
swifterdarrell | portante: (what threads-per-disk and servers_per_port really try to get at ) | 15:16 |
swifterdarrell | portante: funny story: one of our guys is onsite w/Intel doing EC benchmarking on a much larger cluster than I've got for my testing and their results were being interfered with by like 1 or two naughty disks | 15:18 |
*** gyee_ has joined #openstack-swift | 15:38 | |
*** janonymous_ has joined #openstack-swift | 15:41 | |
*** janonymous_ has quit IRC | 15:46 | |
*** david-lyle has quit IRC | 15:46 | |
*** david-lyle has joined #openstack-swift | 15:46 | |
*** SkyRocknRoll has joined #openstack-swift | 15:50 | |
portante | swifterdarrell: yes | 15:50 |
portante | ;) that is the reality, too often we work with "pristine" environments and expect that is what customers have | 15:51 |
portante | ;) | 15:51 |
*** shlee322 has joined #openstack-swift | 15:55 | |
*** zaitcev has joined #openstack-swift | 15:57 | |
*** ChanServ sets mode: +v zaitcev | 15:57 | |
swifterdarrell | portante: little more data for the max_clients=2: https://www.dropbox.com/s/2oijdlsmng4ih0r/Screenshot%202015-06-05%2009.03.16.png?dl=0 | 16:03 |
swifterdarrell | portante: you can see it's very ball-park with max_clients=1024 | 16:03 |
swifterdarrell | portante: (tossing the first run as an outlier) | 16:03 |
*** jistr has quit IRC | 16:03 | |
swifterdarrell | portante: I'm going to halt the max_clients=2 runs and proceed the rest of my targets | 16:04 |
*** B4rker has quit IRC | 16:07 | |
*** B4rker has joined #openstack-swift | 16:10 | |
*** B4rker has quit IRC | 16:14 | |
*** B4rker has joined #openstack-swift | 16:16 | |
*** jordanP has joined #openstack-swift | 16:29 | |
*** Fin1te has joined #openstack-swift | 16:29 | |
portante | swifterdarrell: so the timeline for the maxclients=1024 is 22:00 - 00:00, and maxclients=2 is 07:00 till end? | 16:38 |
*** breitz has quit IRC | 16:38 | |
*** breitz has joined #openstack-swift | 16:39 | |
portante | swifterdarrell: are there 4 phases to the test which might match those peaks and valleys? | 16:43 |
*** acoles is now known as acoles_away | 16:45 | |
*** annegentle has quit IRC | 16:46 | |
*** jordanP has quit IRC | 16:48 | |
openstackgerrit | Tim Burke proposed openstack/python-swiftclient: Add ability to download objects to particular folder. https://review.openstack.org/160283 | 16:51 |
*** mmcardle has quit IRC | 16:53 | |
*** annegentle has joined #openstack-swift | 16:54 | |
notmyname | good morning | 16:54 |
*** breitz has quit IRC | 16:57 | |
*** SkyRocknRoll has quit IRC | 16:59 | |
*** annegentle has quit IRC | 17:08 | |
*** Fin1te has quit IRC | 17:10 | |
peluse | morning | 17:10 |
*** SkyRocknRoll has joined #openstack-swift | 17:11 | |
*** dimasot has joined #openstack-swift | 17:13 | |
*** zhill_ has joined #openstack-swift | 17:15 | |
*** geaaru has quit IRC | 17:18 | |
peluse | jrichli, you there | 17:27 |
*** harlowja has quit IRC | 17:27 | |
*** SkyRocknRoll has quit IRC | 17:32 | |
*** harlowja has joined #openstack-swift | 17:32 | |
*** lastops has joined #openstack-swift | 17:33 | |
*** SkyRocknRoll has joined #openstack-swift | 17:45 | |
*** marzif_ has quit IRC | 17:45 | |
*** B4rker has quit IRC | 17:46 | |
*** annegentle has joined #openstack-swift | 17:50 | |
*** marzif_ has joined #openstack-swift | 17:51 | |
*** B4rker has joined #openstack-swift | 17:53 | |
*** hseipp has left #openstack-swift | 17:53 | |
*** B4rker has quit IRC | 17:58 | |
*** B4rker has joined #openstack-swift | 17:59 | |
jrichli | peluse: I am back. what can I do for you? | 18:01 |
*** proteusguy has joined #openstack-swift | 18:04 | |
*** Fin1te has joined #openstack-swift | 18:33 | |
*** harlowja has quit IRC | 18:46 | |
*** proteusguy has quit IRC | 18:47 | |
*** gyee_ has quit IRC | 18:47 | |
*** themadcanudist has joined #openstack-swift | 18:48 | |
tdasilva | so i probably missed this conversation, but what's the status with the py27 tests failing? | 18:49 |
themadcanudist | hey guys, i ran a $container->deleteAllObjects(); from the php opencloud SDK and the command translated to "DELETE /v1/AUTH_$tenant_id%3Fbulk-delete%3D1" | 18:50 |
themadcanudist | which ended up deleting the account!! Now I can't do anything.. i get a 403 "recently deleted" message | 18:50 |
themadcanudist | is this supposed to be possible? | 18:50 |
*** harlowja has joined #openstack-swift | 18:53 | |
*** serverascode has quit IRC | 18:54 | |
*** briancurtin has quit IRC | 18:54 | |
*** zhiyan has quit IRC | 18:54 | |
*** nottrobin has quit IRC | 18:54 | |
*** odsail has joined #openstack-swift | 18:55 | |
*** lastops has quit IRC | 19:00 | |
redbo | themadcanudist: what version of swift are you running? It's possible that could happen if you have an old version of swift and you don't have the bulk delete middleware running and you have the proxy set up to do account management. | 19:02 |
redbo | and you're using a client that doesn't use POST for bulk deletes | 19:04 |
portante | POST for bulk deletes, don't tell the REST police! | 19:04 |
swifterdarrell | portante: probably? each "run" was subsequent 20-min ssbench runs w/like 120s sleep in between. So there should have been 4 humps from max_clients=1024 and 2 or 3 from the in-progress max_clients=2 at the time I took those screenshots | 19:04 |
portante | swifterdarrell: k thanks | 19:05 |
*** lastops has joined #openstack-swift | 19:10 | |
*** annegentle has quit IRC | 19:14 | |
*** SkyRocknRoll has quit IRC | 19:14 | |
*** lastops has quit IRC | 19:15 | |
*** odsail has quit IRC | 19:18 | |
*** lastops has joined #openstack-swift | 19:22 | |
*** lastops has quit IRC | 19:25 | |
themadcanudist | redbo: That's likely it | 19:26 |
themadcanudist | can I disable that functionality!? | 19:26 |
themadcanudist | it's *super-dangerous* | 19:26 |
*** acampbell has quit IRC | 19:26 | |
themadcanudist | I just blew away my test account by running a bulkdelete php api call *on a container* | 19:26 |
themadcanudist | not swift's fault, except for the fact that it allows that behaviour | 19:27 |
*** Fin1te has quit IRC | 19:29 | |
openstackgerrit | Minwoo Bae proposed openstack/swift: The hash_cleanup_listdir function should only be called when necessary. https://review.openstack.org/178317 | 19:30 |
*** silor has joined #openstack-swift | 19:31 | |
*** silor has quit IRC | 19:37 | |
*** B4rker has quit IRC | 19:40 | |
*** ptb has joined #openstack-swift | 19:42 | |
ptb | In a multi-region cluster (4 replica - 2/2) when getting a 404 on head it will go over the wire to check the 2 remote disks and 2 secondary locations. Ideally, when read affinity is enabled would it make sense to have an additional setting to only check local region for an object - if enabled? | 19:48 |
notmyname | ptb: even knowing that setting will result in false 404s to the client? | 19:51 |
*** annegentle has joined #openstack-swift | 19:54 | |
ptb | agreed, what I am trying to solve for is the extra latency over the wire to the remote region | 19:54 |
ptb | b4 returning a 404 that is. 8) | 19:56 |
notmyname | ptb: yeah, that makes sense. just wanted to know if you still wanted the config option even knowing the tradeoffs | 19:59 |
notmyname | ptb: so it sounds like you're ok with a 404 for data that has been stored in the system but isn't in the current region | 20:00 |
notmyname | is that true? | 20:00 |
ptb | Exactly! There is an application pattern in use that always checks b4 writing files. | 20:00 |
ptb | In a multi-region replication scenario having such a setting would save a great deal of over the wire traffic to the remote regions with a HEAD 404 | 20:02 |
notmyname | yeah, I understand how it's good for speeding up 404s | 20:03 |
notmyname | but if you have an issue in one region and those 2 replicas aren't available (but they are available in the other region), then you'll ask for it and get a 404 to the client | 20:03 |
*** annegentle has quit IRC | 20:04 | |
notmyname | I'm thinking of data that is actually already in the cluster | 20:04 |
themadcanudist | redbo: Is there a way to disable the functionality that allows for older swift servers to have the account sent a DELETE? | 20:04 |
*** annegentle has joined #openstack-swift | 20:04 | |
ptb | Yes...this is why I think it should be a setting. Or in a multi-region >2 you could even have a single fallback region. | 20:04 |
*** petertr7 is now known as petertr7_away | 20:06 | |
ptb | I have regions A,B,C,D and perhaps cld set a fallback region to check if the local lookup fails - opposed to the current (albeit correct) behavior of checking all locations? | 20:07 |
notmyname | a specific fallback is already accounted for with the read_affinity setting (you can prefer one region or zone over another) | 20:08 |
notmyname | but yeah, you're asking for a config option to return the wrong answer really fast (in some casese) | 20:09 |
notmyname | granted, the common case would just make 404s faster | 20:09 |
ptb | Yes. Ideally that is the problem to address - returning 404s faster in a multi-region cluster during a HEAD operation | 20:10 |
ptb | As we add regions, customers are expecting the same response times...trying to see if there are options to help meet their expectations. 8) | 20:11 |
ptb | Which also sends another +1 for your async container fix! | 20:12 |
notmyname | have you considered not doing the initial HEAD and instead doing a PUT with the "If-None-Match: *" header? | 20:13 |
* notmyname needs to get back to that patch soon | 20:13 | |
ptb | I hadn't...will experiment. | 20:13 |
*** charlesw has joined #openstack-swift | 20:14 | |
redbo | themadcanudist: if allow_account_management is set to "yes" in the proxy configs, that should be removed/set to no. | 20:14 |
notmyname | ptb: https://gist.github.com/notmyname/a63de6ce99d6f1df857b | 20:15 |
themadcanudist | last question redbo… if I accidently deleted an account like that, can I udnelete it and will the objects and containers still exist? | 20:15 |
themadcanudist | cuz righ tnow i'm getting a 403 "recently deleted" | 20:15 |
notmyname | themadcanudist: from the swift "ideas" page: "utility to "undelete" accounts, as described in http://docs.openstack.org/developer/swift/overview_reaper.html" | 20:16 |
*** blmartin has quit IRC | 20:16 | |
redbo | themadcanudist: it depends. if you're running the swift-account-reaper, what's what actually deletes things. Oh yeah just read that page. | 20:17 |
*** tellesnobrega_ has joined #openstack-swift | 20:17 | |
themadcanudist | yeah | 20:17 |
notmyname | themadcanudist: unfortunately, the current way to do that is to twiddle the bit in the account db | 20:17 |
themadcanudist | right | 20:17 |
themadcanudist | and mapping the account -> /srv/node/*/accounts/$HASH ? | 20:19 |
notmyname | ptb: can you add that idea of a config option as you described it as a bug on launchpad? https://bugs.launchpad.net/swift | 20:19 |
*** thurloat is now known as thurloat_isgone | 20:20 | |
ptb | Indeed...will do. Thanks! | 20:23 |
notmyname | ptb: thanks | 20:27 |
minwoob | "Invalid arguments passed to liberasurecode_instance_create" -- has anyone seen this error lately, when testing a patch? | 20:34 |
minwoob | It isn't showing up locally, but when I push it to gerrit, the gate test seems to fail. | 20:35 |
minwoob | On my local system it is fine. | 20:35 |
minwoob | http://logs.openstack.org/17/178317/7/check/gate-swift-python27/889998b/console.html | 20:36 |
*** tellesnobrega_ has quit IRC | 20:38 | |
*** tellesnobrega_ has joined #openstack-swift | 20:39 | |
torgomatic | themadcanudist: fwiw, more-recent versions of Swift disallow account DELETE calls with a query string for just that reason | 20:41 |
*** nottrobin has joined #openstack-swift | 20:46 | |
*** serverascode has joined #openstack-swift | 20:51 | |
*** zhiyan has joined #openstack-swift | 20:54 | |
*** themadcanudist has quit IRC | 20:59 | |
peluse | minwoob, just approved so lets see if it happens again... | 20:59 |
minwoob | peluse: All right. Thank you. | 21:01 |
peluse | there's been some issues in the past up there, I didn't deal with them though so if it still pukes we'll find someone who did :) works locally for me as well though | 21:02 |
*** briancurtin has joined #openstack-swift | 21:05 | |
minwoob | Okay. | 21:07 |
minwoob | peluse: Do you mean that problem has been observed before, or just in general that the gate tests occasionally have exhibited strange behaviors? | 21:10 |
*** tellesnobrega_ has quit IRC | 21:11 | |
peluse | minwoob, in general issues with liberasurecode/pyeclib setup correctly on those systems... | 21:13 |
peluse | bah, still failed. | 21:15 |
peluse | probably little chance of getting help til Mon... | 21:15 |
*** lastops has joined #openstack-swift | 21:18 | |
*** doxavore has joined #openstack-swift | 21:21 | |
*** lastops has quit IRC | 21:23 | |
*** shlee322 has quit IRC | 21:26 | |
*** shlee322 has joined #openstack-swift | 21:27 | |
*** esker has quit IRC | 21:29 | |
*** jrichli has quit IRC | 21:31 | |
*** shlee322 has quit IRC | 21:36 | |
swifterdarrell | peluse: speaking of PyECLib & friends, the 1.0.7m versions break w/some distribute/pip/setuptools combination... I don't have anything more specific than that, but just FYI it can cause trouble | 21:38 |
swifterdarrell | peluse: i.e. I had 1.0.7m installed and the requirement ">=1.0.7" was unmet and my proxies wouldn't start | 21:39 |
swifterdarrell | peluse: arguably, I should have just packaged "1.0.7m" as "1.0.7" (a little white lie) | 21:39 |
swifterdarrell | peluse: (this was for PyECLib) | 21:39 |
*** openstack has joined #openstack-swift | 21:43 | |
torgomatic | swifterdarrell: that is pretty darn convincing | 21:43 |
peluse | swifterdarrell, interesting. info on pyeclib. tsg is out of the country, I'll pass it on to he and keving though | 21:43 |
swifterdarrell | peluse: thx; don't think there's a real fix beyond getting gate to use actual libraries and get rid of teh "m" | 21:44 |
swifterdarrell | peluse: (so the PyPi version also just uses liberasurecode vs. some bundled C stuff) | 21:44 |
swifterdarrell | torgomatic: yup... | 21:44 |
swifterdarrell | torgomatic: night and day | 21:44 |
notmyname | swifterdarrell: run "2" is with the failing disk(s)? | 21:44 |
swifterdarrell | notmyname: numbers like "0;" and "2;" indicate how many disks were made slow | 21:45 |
notmyname | ah ok | 21:45 |
swifterdarrell | acoles_away: clayg: cschwede: zaitcev: portante: mattoliverau: notmyname: tdasilva: torgomatic: so comparing "0;..." to "2;..." shows how badly 2 slow disks (1 per storage node, with 61 total disks in cluster) hurt | 21:46 |
notmyname | yup | 21:46 |
notmyname | night and day :) | 21:46 |
zaitcev | V.curious | 21:46 |
notmyname | the max latency is the scariest part (of workers=90) | 21:47 |
openstackgerrit | Darrell Bishop proposed openstack/swift: Allow 1+ object-servers-per-disk deployment https://review.openstack.org/184189 | 21:48 |
torgomatic | swifterdarrell: so that's 61 total disks in a cluster of 2 nodes, where each storage node had 1 slow disk? | 21:49 |
torgomatic | or is that 2 slow disks per storage node? | 21:50 |
peluse | wow | 21:51 |
peluse | we just experienced those same type things here in our cluster... | 21:52 |
notmyname | dfg_: glange: redbo: y'all should definitely check out those results and the patch ^ | 21:52 |
swifterdarrell | acoles_away: clayg: cschwede: zaitcev: portante: mattoliverau: notmyname: tdasilva: torgomatic: updated the gist to clarify | 21:56 |
swifterdarrell | acoles_away: clayg: cschwede: zaitcev: portante: mattoliverau: notmyname: tdasilva: torgomatic: "0;..." is no slow disks (just normal swift-object-auditor) "1;..." is one slow disk (out of 61 total disks in cluster) in one of 2 storage nodes "2;..." is two slow disks (out of 61 total disks in cluster), one per each of the 2 storage nodes | 21:57 |
torgomatic | swifterdarrell: thanks | 21:57 |
swifterdarrell | torgomatic: np | 21:57 |
zaitcev | I saw that | 21:57 |
zaitcev | Thanks | 21:57 |
swifterdarrell | acoles_away: clayg: cschwede: zaitcev: portante: mattoliverau: notmyname: tdasilva: torgomatic: also updated gist to include the make-a-drive-slow script I used; a value of 256 kills a disk pretty good (await between 500 and 1000+ms) | 21:59 |
openstackgerrit | Michael Barton proposed openstack/swift: go: restructure cmd/hummingbird.go https://review.openstack.org/188939 | 21:59 |
swifterdarrell | https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-io_hammer-py | 21:59 |
*** marzif_ has quit IRC | 22:01 | |
swifterdarrell | acoles_away: clayg: cschwede: zaitcev: portante: mattoliverau: notmyname: tdasilva: torgomatic: here's earlier results that illustrate the threads_per_disk overhead: https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md | 22:02 |
mattoliverau | Wow! | 22:04 |
peluse | swifterdarrell, so I guess the question is how many clusters out there are being hit by this but their admins have no idea? | 22:07 |
swifterdarrell | peluse: probably a lot? | 22:07 |
notmyname | all of them? ;-) | 22:07 |
notmyname | peluse: only the ones that have drives that fail | 22:08 |
peluse | safe bet I think! | 22:08 |
peluse | fail or just intermittently crappy perf? | 22:08 |
swifterdarrell | peluse: Intel saw it in their testing w/COSBench way back, like 4 design summits ago? that was what prompted the threads_per_disk change | 22:08 |
peluse | yup, I remember - that was jiangang | 22:08 |
peluse | portland summit | 22:08 |
swifterdarrell | peluse: notmyname: declining perf is worse than total failure, I think | 22:08 |
notmyname | yeah, bad drive. or overloaded drive | 22:08 |
notmyname | swifterdarrell: yeah | 22:08 |
swifterdarrell | peluse: ya! portland | 22:09 |
notmyname | good: working drive. bad: broken drive. worst: drive that isn't broken but is slow | 22:09 |
redbo | was this not common knowledge? | 22:09 |
peluse | I didn't think the extent was common knoweldge but could just be me | 22:10 |
swifterdarrell | redbo: which part? the pain of slow disks has been common knowledge since at least the portland summit | 22:10 |
* peluse means by extent the drastic data that swifterdarrell just showed vs 'yeah there's impact' | 22:10 | |
*** charlesw has quit IRC | 22:10 | |
swifterdarrell | redbo: Mercado Libre deployed object servers per disk via ring device port differentiation quite a while ago and talked about it, so that's been common knowledge | 22:11 |
swifterdarrell | redbo: I don't know about common knowledge, but we've seen too-high overhead w/threads_per_disk and no longer recommend it | 22:11 |
peluse | maybe ripping it out would be a good low priority todo item at some point.... | 22:12 |
*** dimasot has quit IRC | 22:12 | |
notmyname | the new thing here is multiple listeners per port when there's one drive per port (right?). the mercado libre situation was 1:1 port:drive | 22:12 |
swifterdarrell | peluse: I've been terrified of blocking I/O calls starving eventlet hub for quite a while now | 22:12 |
swifterdarrell | peluse: servers_per_port has been on my hitlist for a long time... but finally had enough time to actually work on it | 22:12 |
peluse | so they're not keeping you busy enough, is that what you're saying? :) | 22:13 |
swifterdarrell | peluse: haha | 22:13 |
notmyname | and swifterdarrell's results show that the 1 worker per drive isn't always good (or much better than just a lot of worker wrt max latency). but multiple workers per port is great for smoothing out latency and keeping it low | 22:13 |
swifterdarrell | notmyname: ya, servers_per_port=1 didn't cut it | 22:14 |
swifterdarrell | notmyname: 3 was the sweet spot for my 30-disk nodes; not sure how that'd change for a 60 or 80-disk storage node | 22:14 |
notmyname | I think the new thing here is that multiple servers per port where each port is a different drive | 22:14 |
notmyname | s/that// | 22:15 |
portante | swifterdarrell: nice work, I see that maxclients=2 with 90 workers only helps the 99%, but the average still high, so that really shows how important the server per port method is | 22:17 |
openstackgerrit | Darrell Bishop proposed openstack/swift: Allow 1+ object-servers-per-disk deployment https://review.openstack.org/184189 | 22:17 |
swifterdarrell | portante: ya, the max_clients=2 was a wash | 22:18 |
*** zhill__ has joined #openstack-swift | 22:19 | |
*** jkugel1 has quit IRC | 22:19 | |
*** zhill_ has quit IRC | 22:20 | |
*** bi_fa_fu has joined #openstack-swift | 22:21 | |
*** themadcanudist has joined #openstack-swift | 22:24 | |
themadcanudist | torgomatic: thanks! | 22:25 |
*** ptb has quit IRC | 22:28 | |
*** openstackgerrit has quit IRC | 22:37 | |
*** openstackgerrit has joined #openstack-swift | 22:37 | |
*** lcurtis has quit IRC | 22:41 | |
*** wbhuber has quit IRC | 22:46 | |
*** doxavore has quit IRC | 22:48 | |
*** ozialien has joined #openstack-swift | 22:49 | |
*** zhill__ has quit IRC | 22:53 | |
*** zhill_ has joined #openstack-swift | 22:55 | |
*** zhill_ is now known as zhill_mbp | 22:57 | |
*** petertr7_away is now known as petertr7 | 22:58 | |
*** ozialien has quit IRC | 23:02 | |
*** ozialien has joined #openstack-swift | 23:05 | |
*** petertr7 is now known as petertr7_away | 23:07 | |
*** annegentle has quit IRC | 23:18 | |
openstackgerrit | Michael Barton proposed openstack/swift: go: restructure cmd/hummingbird.go https://review.openstack.org/188939 | 23:31 |
*** ozialien has quit IRC | 23:33 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!