esberglu_ | #startmeeting powervm_ci_meeting | 13:31 |
openstack | Meeting started Thu Dec 8 13:31:39 2016 UTC and is due to finish in 60 minutes. The chair is esberglu_. Information about MeetBot at http://wiki.debian.org/MeetBot. | 13:31 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 13:31 |
openstack | The meeting name has been set to 'powervm_ci_meeting' | 13:31 |
esberglu_ | Hey guys | 13:31 |
* adreznec waves | 13:32 | |
thorst_ | o/ | 13:33 |
esberglu_ | #topic status | 13:34 |
esberglu_ | So those runs are _slowly_ going through | 13:34 |
adreznec | Yeah | 13:35 |
esberglu_ | The runs themselves seem to be fine | 13:35 |
thorst_ | how slow? | 13:35 |
adreznec | Seeing some scary times out there on the queue | 13:35 |
esberglu_ | thorst_: I can ping you the zuul ip if you want to look | 13:35 |
thorst_ | I have it | 13:36 |
thorst_ | 10 hours? | 13:36 |
esberglu_ | But like 10 hours | 13:36 |
esberglu_ | Yeah | 13:36 |
adreznec | Do we know what's bogging things down yet? | 13:36 |
thorst_ | are any actually going? | 13:36 |
thorst_ | the jenkins has a ton of idle VMs. | 13:36 |
esberglu_ | Yeah. There are 3 going through right now, 20 - 30 have gone through in the last 12 hours | 13:37 |
adreznec | Yeah | 13:37 |
adreznec | It doesn't actually look like anything's been running for all that long | 13:38 |
esberglu_ | That's about the volume I would expect | 13:38 |
adreznec | but things have been in the queue for a long time | 13:38 |
esberglu_ | Its just they sit around in the queue forever first | 13:38 |
esberglu_ | Which means the queue just keeps getting bigger | 13:39 |
adreznec | Ok, so I think we need to nail down exactly what's causing the initial queuing to build up | 13:41 |
adreznec | If it's git issues, then we probably need to invest in mirrors at this point | 13:41 |
thorst_ | did we tell Zuul to only let 3 through at a time? | 13:42 |
thorst_ | wasn't there some gate in zuul about throughput? | 13:43 |
thorst_ | I don't really know how this could be git...what's that train of thought there? | 13:43 |
thorst_ | (not that a mirror is a bad idea...) | 13:44 |
esberglu_ | No the only zuul conf that changed was moving nova from silent to check pipeline | 13:44 |
thorst_ | hmm | 13:44 |
adreznec | Well we were seeing those git performance issues yesterday, and one theory was that we were hitting some kind of internal timeouts doing the clones/fetches | 13:45 |
thorst_ | ahh, cause zuul does some sort of clone | 13:45 |
thorst_ | which I don't understand...I'd have thought that was just in the Jenkins slave VM | 13:45 |
adreznec | Because we could see it attempting to do the same fetch multiple times on different PIDs | 13:45 |
esberglu_ | We were seeing these git fetch <change> | 13:46 |
esberglu_ | That seemed to just be looping | 13:46 |
adreznec | Not sure we have enough data to say that concretely | 13:46 |
adreznec | But it was a theory | 13:46 |
esberglu_ | The only other thing that I thought it might be | 13:46 |
esberglu_ | There are these changes in the queue that depend on like 10 other changes | 13:47 |
esberglu_ | And some of the changes are having merge issues | 13:47 |
thorst_ | why is zuul doing this? ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 powervmci@review.openstack.org git-upload-pack '/openstack/nova' | 13:48 |
esberglu_ | Here's an example of one of those changes https://review.openstack.org/#/c/337789/ | 13:48 |
esberglu_ | Not sure | 13:48 |
thorst_ | that process has been running for a while | 13:49 |
adreznec | Interesting... | 13:51 |
thorst_ | the commit message in zuul about why that runs is "I'll document this later" | 13:51 |
adreznec | That should never really be a particularly long-running command | 13:51 |
thorst_ | I suggest we kill that proc | 13:52 |
thorst_ | and see if we unwedge. | 13:52 |
esberglu_ | Sure | 13:52 |
adreznec | thorst_: How long is a while? | 13:53 |
adreznec | Hours? | 13:53 |
thorst_ | says 08:48 in the ps aux output | 13:53 |
thorst_ | so under 5 min. | 13:54 |
thorst_ | its done now. | 13:54 |
esberglu_ | Yeah I killed it. Another one just popped up in its place | 13:54 |
thorst_ | did you kill that second one? | 13:55 |
thorst_ | they just seem to be really slow | 13:55 |
esberglu_ | No | 13:55 |
adreznec | Right | 13:56 |
thorst_ | wonder what git-upload-pack does | 13:56 |
thorst_ | needs some investigation, because I don't think a clone would help that... | 13:56 |
thorst_ | well...when in doubt, just run by hand. | 13:58 |
thorst_ | it returns quite the amount of data. | 13:59 |
adreznec | I think it does discovery/fetching of objects from git during a fetch | 14:00 |
adreznec | Not 100% sure on that | 14:00 |
thorst_ | alright...so is that the status. Figure out why we're wedged. | 14:02 |
thorst_ | (since we're over on time in the meeting) | 14:02 |
adreznec | Yeah | 14:03 |
adreznec | Clearly we need longer than 30 minutes to investigate this | 14:03 |
thorst_ | just running the command ourselves may take 30 minutes | 14:03 |
esberglu_ | Yeah. Other than that I put a wiki page up for CI | 14:03 |
esberglu_ | If you guys want to take a look. Still need to finish a few sections and polish it up | 14:03 |
adreznec | Where did it land? | 14:04 |
adreznec | Novalink wiki? | 14:04 |
esberglu_ | Neo dev wiki | 14:05 |
esberglu_ | Subpage under PowerVM CI System | 14:05 |
adreznec | Ok | 14:06 |
esberglu_ | _WIP_ CI System and Deployment | 14:06 |
thorst_ | so that is also for wangqwsh as you train him to be able to redeploy the CI? | 14:06 |
esberglu_ | Yep | 14:06 |
thorst_ | excellent. And if we do need a git mirror, that may be a good project for wangqwsh to drive | 14:06 |
esberglu_ | #endmeeting | 14:07 |
openstack | Meeting ended Thu Dec 8 14:07:24 2016 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 14:07 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/powervm_ci_meeting/2016/powervm_ci_meeting.2016-12-08-13.31.html | 14:07 |
adreznec | Yeah | 14:07 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/powervm_ci_meeting/2016/powervm_ci_meeting.2016-12-08-13.31.txt | 14:07 |
openstack | Log: http://eavesdrop.openstack.org/meetings/powervm_ci_meeting/2016/powervm_ci_meeting.2016-12-08-13.31.log.html | 14:07 |
adreznec | I know we've talked about having a mirror setup before | 14:07 |
adreznec | apt/git/pip | 14:07 |
adreznec | Might be worth investigating again just to take some external network load off and improve times | 14:08 |
adreznec | Might even be able to serve all 3 from the same system realistically | 14:08 |
thorst_ | so... | 14:10 |
thorst_ | I could change the zuul server to virtio | 14:10 |
thorst_ | the nic is not virtio right now | 14:10 |
thorst_ | that may help? But would require a shut down of the zuul server. | 14:11 |
adreznec | Maybe? I guess we should look at how much traffic we're actually driving in that case | 14:12 |
adreznec | virtio is definitely faster than e1000 | 14:12 |
adreznec | just not sure how much bandwidth we're consuming | 14:12 |
adreznec | I don't think restarting should be a huge deal if we had to | 14:13 |
thorst_ | its not even e1000 | 14:13 |
thorst_ | lol | 14:13 |
adreznec | Well then | 14:14 |
adreznec | Might be worth a shot then | 14:14 |
thorst_ | esberglu_: let me know when you want that done. Just shut down the VM and then let me know | 14:14 |
esberglu_ | thorst_: Okay. I will let the current runs finish then shut it down | 14:15 |
thorst_ | OK | 14:15 |
thorst_ | adreznec: pulling down from pokgsa.ibm.com is super slow | 14:33 |
thorst_ | to the point of stalling | 14:34 |
adreznec | yikes | 14:34 |
thorst_ | so may be worth trying that switch sooner rather than later | 14:35 |
adreznec | What are we pulling from pokgsa | 14:35 |
thorst_ | I pulled something random | 14:35 |
adreznec | And this is from a machine in POK? | 14:37 |
adreznec | That's crazy... | 14:37 |
thorst_ | well, not in same lab | 14:38 |
thorst_ | but similar area | 14:38 |
thorst_ | also just checked...its not the disk | 14:39 |
adreznec | Heck, it shouldn't matter even if it was in RCH or AUS | 14:40 |
thorst_ | esberglu_: do we have an ETA on the shut down? | 14:48 |
esberglu_ | One run is still going about 50 min in. | 14:48 |
esberglu_ | So 20ish minutes | 14:48 |
esberglu_ | I can kill it if you want | 14:49 |
thorst_ | I can probably wait | 14:49 |
esberglu | thorst_: The mgmt server is shutdown | 15:40 |
thorst_ | esberglu: updating it now | 15:56 |
thorst_ | interesting...its stuck at 32 KB/s | 16:03 |
thorst_ | I wonder if we got hit by some silly qos rule | 16:03 |
thorst_ | anything in the POK lab itself - 50+ MB/s | 16:06 |
thorst_ | anything out...32 KB/s | 16:06 |
thorst_ | that'll kill us. | 16:06 |
esberglu | Yeah it will | 16:07 |
esberglu | thorst_: I'm seeing 2 MB/s cloning from github? | 16:24 |
thorst_ | it has to be the path between github versus gerrit | 16:29 |
esberglu | What were you seeing 32 KB/s on? | 16:30 |
thorst_ | from openstack gerrit | 16:30 |
thorst_ | and pokgsa | 16:31 |
adreznec | Just curious, have we looked at what the routes look like for github/git.o.o | 16:33 |
adreznec | See if there are way more hops to reach one or the other, etc? | 16:34 |
adreznec | the pokgsa one is just weird though... | 16:34 |
adreznec | That should be the fastest of all | 16:34 |
thorst_ | I was doing that | 16:38 |
thorst_ | I think it dies on path 4. | 16:38 |
esberglu | thorst_: What can we do to fix that? | 17:07 |
thorst_ | sorry - had to step away | 17:48 |
thorst_ | looking back at it | 17:48 |
thorst_ | esberglu: switched to a local DNS | 18:05 |
thorst_ | seems to be muh better | 18:05 |
thorst_ | give that a shot? | 18:05 |
thorst_ | might want to try that DNS with the other servers too... | 18:06 |
thorst_ | (POK went from 32 KB/s to 4 MB/s) | 18:09 |
openstackgerrit | Taylor Jakobson proposed openstack/nova-powervm: WIP: First pass at imagecache https://review.openstack.org/408758 | 18:16 |
esberglu | thorst_: Hmm stuff is still hung up in the zuul queue. Gonna try restarting zuul | 18:43 |
thorst_ | yeah, I saw that... | 18:47 |
thorst_ | maybe ping kmtaylor and see if he knows anyone that has seen similar things | 18:48 |
esberglu | thorst_: A nova-powervm run went straight through after restarting. Gonna sit back and see what happens | 18:51 |
thorst_ | yeah, but the issue seems to be that when nova things come in | 18:53 |
thorst_ | something gets wedged | 18:53 |
thorst_ | and then we're stuck behind that. | 18:53 |
esberglu | thorst_: who is kmtaylor | 19:18 |
thorst_ | kurt taylor | 19:18 |
thorst_ | the PowerKVM lead | 19:18 |
openstackgerrit | Taylor Jakobson proposed openstack/nova-powervm: Retry up to 3 times on disk create https://review.openstack.org/406355 | 19:18 |
thorst_ | he should be able to route us to the right PowerKVM CI owner | 19:18 |
thorst_ | I know that this is also hosted in the POK lab | 19:19 |
adreznec | thorst_: is it still rfalco? | 19:24 |
thorst_ | unsure... | 19:24 |
esberglu | He's offline, anyone else that might know? | 19:25 |
esberglu | And are we assuming nova is taking that much longer just because it's larger? | 19:25 |
thorst_ | not sure... | 19:27 |
adreznec | esberglu: I think the maintainers are rfolco and mmedvede | 19:29 |
adreznec | Mikhail is online, you could try pinging him | 19:30 |
adreznec | Or I can if you want | 19:31 |
tjakobs | thorst_ how much time were you thinking for a max when randomizing the sleep for the retries? Also, did you mean to change the topic in gerrit? | 19:33 |
esberglu | thorst_: KVM basically said the same thing we were thinking. They switched to github and are considering setting up a local mirror | 19:42 |
adreznec | They're also in POK | 19:44 |
thorst_ | esberglu: if they set up a clone, we might as well just make one single clone for us both to use? | 20:09 |
thorst_ | maybe up on jupiter or something. | 20:09 |
thorst_ | esberglu_: why doesn't ours say "non-voting" on this? https://review.openstack.org/#/c/408689/ | 21:10 |
esberglu_ | I think it's because we set up our voting through the zuul pipeline and not through the job definition | 21:12 |
openstackgerrit | Taylor Jakobson proposed openstack/nova-powervm: Retry up to 3 times on disk create https://review.openstack.org/406355 | 21:12 |
esberglu_ | I don't think it's a big deal. Other non-voting CI systems also don't show up as "non-voting" | 21:12 |
esberglu_ | (I know I gave you opposite info on this before when you were asking about whether we were voting) | 21:13 |
esberglu_ | Like Intel PCI CI on that patch you linked. They don't show up as "non-voting" but they don't vote | 21:13 |
thorst_ | ok | 21:17 |
thorst_ | cool | 21:17 |
*** tlian has quit IRC | 22:23 | |
