clarkb | tests.unit.test_scheduler.TestScheduler.test_crd_branch was very near 2 minutes too | 00:01 |
---|---|---|
SpamapS | what I'm seeing is each of the two python processes using 70-90 percent of "a CPU" | 00:02 |
clarkb | test_crd_check_ignore_dependencies just failed with concurrency=1 | 00:05 |
SpamapS | there's a couple timeouts that are shorter than the default 120s one | 00:05 |
clarkb | this is Exception: Timeout waiting for Zuul to settle | 00:06 |
SpamapS | like any single waitUntilSettled I think is timed out at 60s or so | 00:06 |
SpamapS | Yeah | 00:06 |
SpamapS | that's the one | 00:06 |
clarkb | I shouldn't be runnin with a global timeout as I am testring directly (no tox) | 00:06 |
clarkb | unless maybe tox sets the env stuff in the venv directly? | 00:06 |
SpamapS | 30s | 00:06 |
SpamapS | one test sets it to 120 | 00:06 |
SpamapS | I think it's the behind_dequeue | 00:06 |
SpamapS | yeah | 00:07 |
SpamapS | so funny thing is, with concurrency=2 I still hit a system load of 5 | 00:10 |
SpamapS | lots of gits getting run | 00:10 |
SpamapS | java poking around | 00:10 |
SpamapS | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | 00:11 |
SpamapS | 17765 clint 20 0 4610876 2.389g 9232 S 72.8 15.5 15:17.63 python | 00:11 |
SpamapS | that's a healthy sized python! | 00:11 |
clarkb | wow my python is 4.5g resident | 00:12 |
SpamapS | ah you know, I am looking at dmesg and during the last run I got this: | 00:13 |
clarkb | which is ~2x yours and I am concurrency = 1 | 00:13 |
SpamapS | [651803.188168] CPU1: Core temperature above threshold, cpu clock throttled (total events = 4211) | 00:13 |
SpamapS | (my last concurrency=4 run) | 00:13 |
clarkb | ouch, I don't have that problem. cpu is warm but well below thresholds | 00:13 |
SpamapS | clarkb: probably a horrible chain of resources in subunit. | 00:13 |
SpamapS | during this conc=2 run no throttling | 00:13 |
SpamapS | but.. we're also at 16 minutes | 00:14 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_crd_gate is another almost 2 minute test locally | 00:14 |
clarkb | (passes though) | 00:14 |
clarkb | I'm gonna leave this to make my office warm and go and assemble a car seat | 00:15 |
SpamapS | Gah I hate those | 00:16 |
SpamapS | so I guess the moral of this story is that the Zuul test suite wants cores. | 00:17 |
SpamapS | lots and lots of cores | 00:17 |
jlk | so many cores | 00:19 |
jlk | partly why I went back to my 8 core cloud VM | 00:20 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Add 'pr-comment' github webhook event https://review.openstack.org/443959 | 00:21 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Support GitHub PR webhooks https://review.openstack.org/439834 | 00:21 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Add basic Github Zuul Reporter. https://review.openstack.org/443323 | 00:21 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Add 'push' and 'tag' github webhook events. https://review.openstack.org/443947 | 00:21 |
SpamapS | yeah I think I may need to do something similar | 00:39 |
jeblair | ttrun -e py27 tests.unit.test_scheduler.TestScheduler.test_dependent_behind_dequeue | 01:53 |
jeblair | Ran 1 test in 15.328s | 01:53 |
clarkb | jeblair: is that with a tmpfs? | 02:15 |
*** Shrews has quit IRC | 05:43 | |
*** yolanda has joined #zuul | 05:58 | |
*** hashar has joined #zuul | 07:26 | |
*** bhavik1 has joined #zuul | 08:18 | |
*** phschwartz has quit IRC | 08:50 | |
*** bhavik1 has quit IRC | 10:43 | |
*** hashar has quit IRC | 11:53 | |
*** hashar has joined #zuul | 12:07 | |
*** dkranz has quit IRC | 12:17 | |
*** dmellado has quit IRC | 12:26 | |
*** dkranz has joined #zuul | 12:33 | |
*** dmellado has joined #zuul | 12:47 | |
jeblair | clarkb: yes, zookeeper and git on tmpfs. | 14:21 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: move test_disable_at to simple_layout https://review.openstack.org/458252 | 14:38 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Move check_ignore_dependencies to simple_layout https://review.openstack.org/458255 | 14:39 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Move test_repo_deleted to simple_layout https://review.openstack.org/458269 | 14:39 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Remove org_unknown git repo from single-client test config https://review.openstack.org/458270 | 14:39 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Move semaphore tests to their own class https://review.openstack.org/458584 | 14:42 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Move timer tests to commitConfigUpdate https://review.openstack.org/458599 | 14:44 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove commitLayoutUpdate https://review.openstack.org/458648 | 14:50 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Move two more tests to commitConfigUpdate https://review.openstack.org/458647 | 14:50 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix race in test_queue_rate_limiting https://review.openstack.org/458863 | 14:51 |
jeblair | clarkb: can you look at 458863? it fixes a race that showed up in a test run that bumped the second-to-last change from the merge queue. | 14:51 |
mordred | jeblair: so the reason it's a race fix in the test itself is because in production we don't care about subsequent order so releasing en-masse is ok- but because we have to track sequence in the tests it's bad, yes? | 15:31 |
jeblair | mordred: sort-of -- the old fake jenkins was simple enough that it would tend to strictly order test runs, but since we're throwing them at at executor now, they can end up in random order because of context switches. so that used to be deterministic, but isn't anymore. | 15:35 |
mordred | jeblair: nod | 15:35 |
clarkb | and the asserts just below the diff care about that order since its test1 test2 for first project then test1 test2 for second | 15:36 |
mordred | jeblair: I had a similar issue that I _totally_ punted on when we switched off of swiftclient and implemented the threaded object chunk uploads - turns out you can't really assert the sequence of REST calls at the end of the operation when a set of them in the middle are running in parallel on purpose :) | 15:36 |
clarkb | Ran 133 (+132) tests in 6531.664s (+6473.655s) FAILED (id=16, failures=11 (+11), skips=10 was my space heater run | 15:36 |
jeblair | yeah. we have some methods to do assertions of completed builds in random order, but for a lot of the in-progress builds like these, it's a bit more work (especially if you want to pick one of them out to do something with). | 15:37 |
clarkb | I'm gonna grab the above fix, then rerun without sql tests (this time) on tmpfs | 15:38 |
jeblair | clarkb: that is very slow. that run on chocolate was 300s. | 15:38 |
jeblair | clarkb: but i'm glad your office is warm. | 15:39 |
clarkb | this was single process | 15:39 |
clarkb | I wonder how that lines up in total time and not just wall | 15:39 |
SpamapS | jeblair: I'm on the tip of your optimization branch, and I get a lot higher | 15:42 |
SpamapS | Ran 1 test in 32.540s | 15:42 |
SpamapS | jeblair: really wondering now what is different | 15:42 |
jeblair | model name : Intel(R) Core(TM) i7-3770S CPU @ 3.10GHz | 15:43 |
SpamapS | ruh roh.. my tmpfs mount is gone | 15:43 |
SpamapS | that's it | 15:43 |
jeblair | that'll do it | 15:43 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Fix race in test_queue_rate_limiting https://review.openstack.org/458863 | 15:44 |
SpamapS | err no | 15:44 |
SpamapS | Ran 1 test in 31.455s | 15:44 |
clarkb | how big of a tmpfs is necessary? guessing not too large | 15:45 |
SpamapS | My CPU is 0.5Ghz slower.. or 17% ... but I'm at 2x slower | 15:46 |
SpamapS | clarkb: I just let it do the default of half of RAM. | 15:46 |
SpamapS | but probably 200MB would do it. | 15:46 |
SpamapS | maybe less | 15:46 |
jeblair | SpamapS: and zk is still on tmpfs? | 15:48 |
SpamapS | yep | 15:48 |
SpamapS | jeblair: can you try adding '/usr/bin/time' to the front of yours so I can see what the process itself does? | 15:49 |
SpamapS | 19.06user 37.66system 0:33.24elapsed 170%CPU (0avgtext+0avgdata 74896maxresident)k | 15:49 |
SpamapS | 0inputs+0outputs (0major+3259084minor)pagefaults 0swaps | 15:49 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Move two more tests to commitConfigUpdate https://review.openstack.org/458647 | 15:49 |
SpamapS | so much system :-P | 15:49 |
jeblair | 9.00user 17.40system 0:15.77elapsed 167%CPU (0avgtext+0avgdata 70460maxresident)k | 15:50 |
jeblair | 0inputs+0outputs (0major+4652893minor)pagefaults 0swaps | 15:50 |
SpamapS | Your CPU is just 2x faster. It's settled. | 15:52 |
SpamapS | jeblair: OS? | 15:52 |
jeblair | trusty | 15:53 |
SpamapS | interesting.. I wonder if they broke something on xenial | 15:53 |
* SpamapS blames systemd | 15:53 | |
jeblair | Python 2.7.6 | 15:53 |
SpamapS | xenial is 2.7.12 | 15:53 |
SpamapS | but I'd be more suspicious about compiler settings, kernel, and libc things than python 2.7 patch versions | 15:54 |
clarkb | huh testr run locally is only using one process | 15:54 |
SpamapS | clarkb: yeah it only turns on parallelism if you run it via setup.py testr | 15:54 |
SpamapS | I forget the flag to run testr run with the conc auto thing | 15:54 |
clarkb | oh ha ok. Well I guess I will discover if single cpu is stable with tmpfs | 15:55 |
clarkb | and can expand to multi cpu once I have these results | 15:55 |
jeblair | i wonder if i can grab a trusty and xenial node from one of our clouds and hope they are suitable for a comparison... | 15:55 |
clarkb | osa I think it was did notice some python performance issues | 15:55 |
clarkb | with xenail | 15:55 |
SpamapS | jeblair: if you use the bluebox cloud you can use affinity and get them on the same compute node #worstFeatureEver | 15:56 |
SpamapS | and we may not have turned off affinity on infra cloud.. dunno if it's on by default | 15:56 |
jeblair | the only trusty nodes we have handy are on rax-iad :/ | 15:57 |
jeblair | oh, also osic | 15:57 |
jeblair | i held nodes; they are running jobs. i'll check back in 30 mins or so and see if they're ready. | 15:59 |
clarkb | getting Exception: Timeout waiting for Zuul to settle failures now even with tmpfs, though my zk isn't running on it so I guess thats next step | 16:11 |
clarkb | maybe even run under eatmydata too | 16:12 |
SpamapS | Oh I hadn't even considered eatmydata'ing | 16:13 |
SpamapS | eatmydata did in fact shave a few seconds off | 16:14 |
clarkb | tests.unit.test_scheduler.TestScheduler.test_crd_multiline took 2 minutes 11 seconds in current run | 16:17 |
*** jkilpatr has quit IRC | 16:28 | |
*** jkilpatr_ has joined #zuul | 16:28 | |
*** hashar has quit IRC | 16:29 | |
clarkb | ok I have it rerunning single process with zk and zuul test root on a tmpfs and under eatmydata for good measure | 16:32 |
* SpamapS got side tracked by his old SSL certs expiring and trying to setup certbot | 16:33 | |
SpamapS | offlineimap doesn't seem to like the letsencrypt CA :-| | 16:33 |
clarkb | if this is still slow then I don't think its purely an IO issue | 16:33 |
SpamapS | clarkb: no I think it's python perf | 16:33 |
SpamapS | given jeblair's /usr/bin/time output | 16:33 |
jeblair | SpamapS: yeah, i just pinned the hash, but that's not going to last for long | 16:33 |
SpamapS | https://bugs.launchpad.net/ubuntu/+source/python2.7/+bug/1638695 | 16:33 |
openstack | Launchpad bug 1638695 in python2.7 (Ubuntu) "Python 2.7.12 performance regression" [High,Confirmed] - Assigned to Louis Bouchard (louis) | 16:33 |
SpamapS | jeblair: that's what I was about to do.. that will explode every few weeks tho. :-/ | 16:34 |
clarkb | SpamapS: I'm on tumbleweed which would imply its at least not ubuntu's fault if that is the performance issue | 16:34 |
SpamapS | clarkb: what's your gcc --version ? | 16:35 |
SpamapS | xenial is 5.4.0 | 16:35 |
jeblair | gcc-4.8.real (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4 | 16:36 |
mordred | clarkb: "yeah it only turns on parallelism if you run it via setup.py testr" - is that going to be a problem for OpenStack with our push to not use the setup.py testr ? | 16:37 |
mordred | clarkb: or was our push to _use_ setup.py testr and not setup.py test ? | 16:37 |
clarkb | gcc (SUSE Linux) 6.3.1 20170202 [gcc-6-branch revision 245119] | 16:37 |
mordred | wow. gcc 4, 5 and 6 | 16:37 |
clarkb | mordred: it works with both setup.py test and testr, push is to use test | 16:37 |
mordred | ok. cool. so it's not a problem for the gate - just making sure (that could be really unfortunate) | 16:38 |
clarkb | mordred: its because both entrypoint things check numcpu and default to that concurrency whereas testr itself defaults to no concurrency | 16:38 |
SpamapS | mordred: 'setup.py test' does the parallel thing | 16:38 |
clarkb | because this is rolling distro I don't actually know if python was compiled with that gcc | 16:38 |
SpamapS | sorry I misspoke. | 16:38 |
clarkb | trying to query python for the info but so far only getting "GCC" | 16:38 |
mordred | clarkb: python should tell you if you launch its repl | 16:38 |
mordred | mordred@solace:~$ python | 16:39 |
mordred | Python 2.7.12 (default, Nov 19 2016, 06:48:10) | 16:39 |
mordred | [GCC 5.4.0 20160609] on linux2 | 16:39 |
SpamapS | so, the analysis on that launchpad bug is legit | 16:39 |
jeblair | [GCC 4.8.4] on linux2 | 16:39 |
jeblair | SpamapS: that's a fascinating bug -- it sure looks like it's saying "python compiled with gcc 4 is better than later versions" | 16:39 |
clarkb | mordred: Python 2.7.13 (default, Jan 03 2017, 17:41:54) [GCC] on linux2 :/ | 16:39 |
SpamapS | http://paste.ubuntu.com/24078834/ <-- the bug explains what each of those 'nopgo' 'nolto' configs are | 16:39 |
mordred | clarkb: oh wow | 16:39 |
mordred | clarkb: I'd file that as a bug against tumbleweed :) | 16:39 |
clarkb | mordred: its possible gcc 6 breaks something in pythons preprocessor funzies | 16:39 |
mordred | yah | 16:39 |
SpamapS | jeblair: I think they found that the PGO+LTO combination in gcc 5 is hurting python | 16:40 |
jeblair | SpamapS: ah, k, i'm still only halfway through | 16:40 |
SpamapS | the analysis stops late February.. probably around the time doko started looking for a new job. :-/ | 16:40 |
SpamapS | we should try squeeze and stretch | 16:41 |
jeblair | in addition to all the other reasons we should do that, it may help me decide what to upgrade my workstation to. :) | 16:42 |
SpamapS | I've been feeling itchy about it since the layoffs | 16:42 |
SpamapS | I don't think doko was laid off.. but I don't know that he'd stick around either. | 16:43 |
* SpamapS stops speculating | 16:43 | |
jeblair | SpamapS: read this? https://github.com/docker-library/python/issues/160 | 16:44 |
jeblair | (i haven't) | 16:44 |
jeblair | and related https://bugs.python.org/issue26359 | 16:45 |
clarkb | the docker issue says python should be quicker with pgo and lto enabled. However thats for python3 | 16:46 |
SpamapS | have not read them. | 16:47 |
SpamapS | but it sounds to me like xenial needs to go ahead and recompile python with gcc4 | 16:47 |
SpamapS | That's definitely an SRU-worthy issue. | 16:47 |
clarkb | SpamapS: doesn't that present potential binary comaptiblity issues for people compiling modules? | 16:48 |
clarkb | I guess python-all-dev or whatever it is would have to dep on gcc4 and then have magic to have things compile under that cc by default | 16:48 |
SpamapS | it should not | 16:49 |
SpamapS | it won't change the symbols | 16:49 |
clarkb | I seem to recall a major deal with fedora switched to gcc 5 | 16:49 |
clarkb | it required a full recompile of everything | 16:49 |
SpamapS | Oh god, did they do that? really? :-/ | 16:50 |
SpamapS | (they == gcc) | 16:50 |
SpamapS | would not put it past the optimizer monkies | 16:50 |
clarkb | it may be that fedora opted into something that required it, but there were binary compat issues not too long ago | 16:50 |
SpamapS | I'd say if ldflags and cflags stay the same (they should) it should be fine | 16:52 |
SpamapS | but I could be wrong | 16:52 |
*** dmellado has quit IRC | 16:52 | |
SpamapS | (just would shock me that compiler upgrades would break your existing dynamically loaded objects) | 16:52 |
SpamapS | that's not really how ELF is supposed to work. But hell.. somebody convinced the world to do systemd.. so they might have done this too. | 16:53 |
clarkb | well ELF doesn't say things about calling conventions and the like | 16:53 |
clarkb | its possible to introduce things that elf is happy about that would otherwise break | 16:53 |
*** _ari_ has quit IRC | 16:53 | |
*** pabelanger has quit IRC | 16:54 | |
jeblair | those that do not understand computer history are doomed to infinite loop | 16:54 |
clarkb | and confirmed I now have timeout failures | 16:55 |
clarkb | using tmpfs and eatmydata | 16:55 |
jesusaur | jeblair: lol | 16:56 |
SpamapS | oh yay I got offlineimap working | 16:58 |
SpamapS | jesusaur: if you're struggling.. https://letsencrypt.org/certs/isrgrootx1.pem.txt <-- sslcacertfile={that saved locally somewhere} worked for me. | 16:58 |
SpamapS | doh | 16:58 |
SpamapS | jeblair: if you're struggling.. https://letsencrypt.org/certs/isrgrootx1.pem.txt <-- sslcacertfile={that saved locally somewhere} worked for me. | 16:59 |
SpamapS | tabcomplete fail | 16:59 |
SpamapS | clarkb: that's a good point.. I suppose if they really wanted to turn the world on its ear they could 'splode calling conventions. I bet they did. :-/ | 16:59 |
jeblair | SpamapS: oh cool, thanks | 17:00 |
*** pabelanger has joined #zuul | 17:00 | |
SpamapS | oh derp | 17:02 |
SpamapS | jeblair: n/m | 17:02 |
jeblair | SpamapS: i was just starting to dig into why that wasn't working :) | 17:02 |
*** _ari_ has joined #zuul | 17:02 | |
SpamapS | jeblair: at this point I'm reporting it as a github issue and going to get on the hash treadmill | 17:03 |
*** dmellado has joined #zuul | 17:06 | |
*** hashar has joined #zuul | 17:37 | |
*** hashar is now known as hasharAway | 17:38 | |
clarkb | mordred: #define COMPILER "\n[GCC " __VERSION__ "]" <-is how its defined in python | 17:42 |
clarkb | I wouldn't expect gcc to break that anyways buuilding for myself and can check | 17:42 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Remove commitLayoutUpdate https://review.openstack.org/458648 | 17:51 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Configurable SSH access to GitHub https://review.openstack.org/444034 | 18:55 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Support for github commit status https://review.openstack.org/444060 | 20:19 |
*** openstackgerrit has quit IRC | 20:48 | |
*** openstackgerrit has joined #zuul | 20:59 | |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Merge pull requests from github reporter https://review.openstack.org/444463 | 20:59 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Support for github commit status https://review.openstack.org/444060 | 21:12 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Merge pull requests from github reporter https://review.openstack.org/444463 | 21:12 |
*** hasharAway has quit IRC | 22:46 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!