Monday, 2017-10-09

*** dkranz has quit IRC01:05
*** xinliang has joined #zuul01:05
*** bhavik1 has joined #zuul02:45
*** bhavik1 has quit IRC02:54
*** zhuli has joined #zuul03:00
*** isaacb has joined #zuul06:26
*** hashar has joined #zuul08:30
*** bhavik1 has joined #zuul08:56
*** electrofelix has joined #zuul09:01
*** bhavik1 has quit IRC09:30
*** jkilpatr has joined #zuul10:43
*** jkilpatr has quit IRC10:51
*** jkilpatr has joined #zuul11:04
*** dkranz has joined #zuul12:20
*** dkranz has quit IRC13:05
mrhillsmanhow does nodepool read clouds? i keep getting an error - os_client_config.exceptions.OpenStackConfigException: Cloud fake was not found.13:14
mrhillsmanor is that expected behavior13:14
mrhillsmaneverything else appears to be working13:15
mrhillsmandib has build an image13:15
mrhillsmanand node-launcher looks like it is fine as i do not see any errors in the logs either13:15
mrhillsmani see successful requests and responses13:16
mrhillsmanfor both13:16
mrhillsmanbut nodepool dib-image-list fails with error re cloud fake13:17
rcarrillocruzmrhillsman: it reads a file called clouds.yaml13:21
rcarrillocruztypically, it should be under ~/.config/openstack/clouds.yaml13:22
mrhillsmanok cool, i have that in place, and using the default13:22
mrhillsmanah13:22
mrhillsmanit is in /etc/nodepool/clouds.yaml13:22
rcarrillocruzi think that's ok, it falls back to that location if home user does not have it13:22
mrhillsmannope13:23
mrhillsmanyou are right13:23
mrhillsmanmoving it there fixed it :)13:24
rcarrillocruzhttps://docs.openstack.org/os-client-config/latest/13:24
rcarrillocruzyah, it's an oscc thing, not a thing specific to nodepool13:24
mrhillsmanhopefully i am almost done then :)13:24
mrhillsmanneed to get it working against this local openstack cloud13:25
rcarrillocruzif you hit roadblocks, happy to help, i've been installing a nodepool/zuul CI lately (which I think you have too per previous comments from you in this channel)13:25
mrhillsmanawesome, thanks13:25
mrhillsmanyes i have13:25
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Add base job and roles for javascript  https://review.openstack.org/51023613:32
Shrewsjeblair: I'm curious about the new nodepool issue "When nodepool launchers are restarted with building nodes, the requests can be left in a pending state and end up stuck"13:47
Shrewsjeblair: we have code (and a test) to handle unlocked PENDING requests: http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n38113:48
Shrewsjeblair: is it possible the request disappeared (by a zuul disconnect) before the nodepool cleanup thread ran?13:48
Shrewslauncher logs aren't showing me anything about the request from that pastebin after the zk restart, so it seems like it disappeared before it could be cleaned up properly13:50
Shrewshrm, is it possible an active lock could survive a ZK restart?13:51
*** isaacb has quit IRC13:58
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Don't modify project-templates to add job name attributes  https://review.openstack.org/51030413:58
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Don't use cached config when deleting files  https://review.openstack.org/51031713:58
Shrewsjeblair: oh, simple kazoo test shows that a lock CAN survive a normal ZK stop/start! TIL13:59
Shrewsso now the paste makes sense b/c nodepool still has a lock on the PENDING request *after* the ZK restart14:01
Shrewsactually, hrm... why did the building node lose its lock though?  ugh, confused again14:03
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Update statsd output for tenants  https://review.openstack.org/51058014:03
jeblairShrews: yeah, i assume we would lose the lock for both the node and the request, or neither of them.  it's strange to think that we would have the lock on just the request but not the node.  in retrospect, i suppose i should have examined the request node more carefully to determine whether it was locked, and if so, by what.14:07
Shrewsjeblair: i'll see about working on a test case for this, but w/o more details, might be kind of hard to fix something where I don't know what was wrong.  :/14:09
Shrewsand i'm really confused about lock survival across ZK restarts now. maybe jhesketh can help shed light on that?14:11
jeblairShrews: maybe you meant harlowja?14:11
Shrewsoops, yes.14:12
jeblairShrews: how often does the thing that checks for unlocked pending requests run?  we can probably verify whether it ran or not from logs.  but i think this request was outstanding for quite a while.14:12
Shrewsi knew there was a 'j' and 'h' in there somewhere :)14:12
Shrewsjeblair: 60s14:12
jeblairi can attest that all j names are the same14:12
jeblairShrews: i see the bug14:20
jeblairShrews: the cleanup thread hit an exception while trying to lock the node allocated to the request: http://paste.openstack.org/show/623109/14:21
jeblairShrews: there's no protection in http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n391 to unlock the node request in case of an exception14:22
jeblairShrews: so the request stayed locked after that and nothing touched it again.14:22
*** hashar is now known as hasharAway14:27
Shrewsjeblair: ah. wow, good find14:27
jeblairShrews: you want to work on sprinkling some try/finally clauses in there?14:28
Shrewsjeblair: yup, i'll take care of it14:29
jeblaircool, thx14:29
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve exception handling around lost requests  https://review.openstack.org/51059414:47
Shrewsjeblair: ^^^14:47
Shrewsactually, i want to make 1 improvement there14:50
openstackgerritMerged openstack-infra/zuul-jobs master: Add base job and roles for javascript  https://review.openstack.org/51023614:55
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve exception handling around lost requests  https://review.openstack.org/51059414:57
SpamapSShrews: Remember that zookeeper wasn't meant to be "restarted"15:54
SpamapSShrews: so all the recipes assume you have an ensemble that survives forever.15:54
SpamapSI think they're getting better about pointing out when something is different for a single server use case.15:54
SpamapSBut for the most part.. the assumption is that you'll roll a restart through the members.15:55
jeblairi've started investigating a hung git operation on ze06 for build 55769796f30e442d817feb96a6854eb115:56
SpamapSAnd while zuul is still a spof..and nodepool-launcher is still active/passive at best..it might make sense to look at a ZK ensemble on 3 boxes for us, just for the "one less thing we have to be careful with"15:56
jeblairGIT_HTTP_LOW_SPEED_TIME=30 and GIT_HTTP_LOW_SPEED_LIMIT=1000 are in the environment.  and it is doing an https fetch.16:00
*** maxamillion has quit IRC16:18
*** maxamillion has joined #zuul16:18
mrhillsmansooooo...no python3 support :)16:31
SpamapSmrhillsman: eh?16:36
mrhillsmandib - was wondering why image build kept failing16:37
mrhillsmanwith no module errors16:37
mrhillsmanand some other stuff16:38
mrhillsmanturns out python3 was the issue - i was using it by default16:38
AJaegerdoes anybody know what's up with 510540 ? - it's since three hours in gate and waiting for a node - the change after it is tested16:55
*** harlowja has joined #zuul17:01
*** electrofelix has quit IRC17:26
fungilooks like zuulv3 started swapping a bit around 1.5 hours ago, in an effort to save more memory for cache17:32
fungithat probably isn't translating to performance degradation though17:32
fungi40mb paged out17:32
fungibut the server would clearly use more cache at this point, if it had some17:33
fungigranted, we're sitting on a check pipeline with 400+ changes now17:33
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic stage-artifacts role  https://review.openstack.org/50923317:33
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add compress capabilities to stage artifacts  https://review.openstack.org/50923417:33
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic process-test-results role  https://review.openstack.org/50945917:33
AJaeger467 changes in periodic alone...17:34
AJaegerfungi, so 403 in check, 467 in periodic, 261 in periodic-stable; and a few in infra pipelines17:35
AJaegerwhich also means that it did not manage to run jobs for a single periodic job today - far too busy to server higher prio requests17:36
fungioh, yep i basically can't load the status page well enough to scroll around in it17:36
AJaegerfungi: yeah, that page is verrrrry slow. took me some time to find infra-gate and see 510540 waiting...17:36
fungiso over 1100 changes in all17:36
fungii'd say given that sort of backlog, memory utilization is remarkably low17:37
AJaeger;)17:37
fungii'm trying to figure out how to look into the situation with 51054017:37
fungi73 million lines in the current zuulv3 scheduler debug log. that certainly takes a while to load17:59
AJaegerwow ;(18:00
fungithe logfile is 9.2gib in size18:02
Shrewsfungi: perhaps it is too busy logging? ;)18:02
fungidoes lead me to wonder18:03
fungiaside from it checking and not finding any dependencies for 510540,1 the other occasional mentions in the debug log seem to be about it getting reenqueued... which i wonder if that is just how it deals with configuration changes18:24
Shrewsfungi: did you manage to find any request #'s for it?18:24
fungii need to figure out which one it isn't running. that will take a bit for me to parse it back out of the status18:25
Shrewsholy smokes. 6300+ requests outstanding for nodepool18:25
Shrews218 total nodes, only 12 are READY. the rest are BUILDING or IN-USE18:27
Shrewsmaybe at total capacity?18:27
AJaegermultinode-integration-centos-7 is the job that is missing for 51054018:27
fungiokay, in that case this is the last request:18:28
fungi2017-10-09 15:21:22,408 INFO zuul.ExecutorClient: Execute job base-integration-centos-7 (uuid: fc4e31ad056c4d64b54b5f3cd7fac86b) on nodes <NodeSet centos-7 OrderedDict([('centos-7', <Node 0000178819 centos-7:centos-7>)])OrderedDict()> for change <Change 0x7fd20240b710 510556,2> with dependent changes [<Change 0x7fd243b52e10 510555,1>, <Change 0x7fd242811b70 510578,1>, <Change 0x7fd20285d4a8 510564,2>, <Change18:28
fungi0x7fd242ede390 510540,1>]18:28
Shrewsthe one's in-use and locked don't seem to be too old.18:28
AJaegermight be - but the gate has highest priority, so the changes after it got nodes (and those were added 30 mins later)18:28
fungioh, wait, that's not the one18:29
fungithat's base-integration-centos-718:29
fungilooking further back18:29
AJaegerfungi multinode-integration-centos-7 is what you need18:29
fungiyeah18:29
fungithanks, i'm still waiting for the status page filtering to complete18:29
Shrewsfungi: doesn't seem to be a request id in that  :(18:30
AJaegerfungi,  curl -s http://zuulv3.openstack.org/status.json status.json|jq '.' > out18:30
AJaegeruuid is 55769796f30e442d817feb96a6854eb118:30
Shrewsrequest id is of the form: 200-xxxxxxxxxx or 100-xxxxxxxxxx18:31
Shrewsjust wanted to see if maybe it was waiting on a node. i can't map request ID to change ID with the info that's stored in ZK18:32
fungi2017-10-09 15:08:12,415 INFO zuul.ExecutorClient: Execute job multinode-integration-centos-7 (uuid: d1ddd1c5e30842859d299983465dd75e) on nodes <NodeSet OrderedDict([('primary', <Node 0000178676 primary:centos-7>), ('secondary', <Node 0000178677 secondary:centos-7>)])OrderedDict([('switch', <Group switch ['primary']>),18:32
fungi('peers', <Group peers ['secondary']>)])> for change <Change 0x7fd283f29240 509969,1> with dependent changes [<Change 0x7fd20240b710 510556,2>, <Change 0x7fd243b52e10 510555,1>, <Change 0x7fd242811b70 510578,1>, <Change 0x7fd20285d4a8 510564,2>, <Change 0x7fd242ede390 510540,1>]18:32
Shrewsbut those lead me to believe that it got its nodes?18:32
Shrewsso... *shrug*18:33
fungino, wait, that's the wrong change18:33
fungijust a sec, i'll go regex on this18:34
Shrewsgo go gadget regex!18:34
fungithink i'm going to switch to grep, since loading a >9gb file into memory is almost certainly causing undue memory pressure on the server18:36
*** hasharAway is now known as hashar18:36
fungioh yeah. cacti confirms18:36
AJaegerand we're out of space nearly, aren't we?18:37
fungioh, yikes, yep18:38
fungiclosing out my filter/viewer pipe freed all of that up18:39
jeblairoh wow, so that memory use spike *isn't* zuul18:46
fungiright, that was me18:46
jeblairneat.  that saves some time ):18:47
jeblair:) even18:47
fungi(:18:47
fungisorry for the false alarm there18:47
jeblairand the free space is increasing too18:48
fungialso me18:48
jeblairwe still plan on recovering swap and using it for the zuul log dirs, because we will probably run out of log space in prod18:48
jeblairbut also, i think we can make the scheduler log more efficient, it'll just take some cleanup work.18:48
jeblair(the whole "print out the queue" thing is *extremely* useful once, but not more.  finding the right time to print it out when it's useful is the trick)18:49
jeblairat any rate, that shouldn't be significantly different than v2, so we don't have to block on it18:50
fungiat any rate, i think this was the entry i was originally looking for:18:51
fungi2017-10-09 13:56:10,708 INFO zuul.ExecutorClient: Execute job multinode-integration-centos-7 (uuid: 55769796f30e442d817feb96a6854eb1) on nodes <NodeSet OrderedDict([('primary', <Node 0000177867 primary:centos-7>), ('secondary', <Node 0000177868 secondary:centos-7>)])OrderedDict([('switch', <Group switch ['primary']>),18:51
fungi('peers', <Group peers ['secondary']>)])> for change <Change 0x7fd242ede390 510540,1> with dependent changes [<Change 0x7fd2533fc518 510009,1>]18:51
fungithere is no more recent execute for that job on that change, and it's properly from ~5hrs ago18:52
jeblairfungi: you can also grab the build uuid from the console log link18:53
fungibut as mentioned in #openstack-infra, i guess this is the one jeblair was looking at back around 14:00z in scrollback18:53
jeblairfungi: and you can use ansible to grep for that uuid in all the executor logs to find out which one is running it18:53
fungioh, cool idea18:54
jeblairas best as i can tell, git is *not* hung waiting on http traffic; it's in some kind of internal deadlock18:55
jeblairfungi: https://etherpad.openstack.org/p/8aKzRXq6ae18:59
jeblairafter lunch, i'm going to dust off the git hard-timeout change from last week19:01
jeblairalso, i think i'm done poking at that git process.  i'll kill it so things move again after lunch, unless someone objects.19:09
SpamapSDoes look like it is deadlocked19:19
mrhillsmannodepool-builder and nodepool-launcher work fine as long as i have -d without it they do not run and just close19:25
mrhillsmanare there any instructions/example of how to get it to dump logs somewhere?19:34
Shrewsmrhillsman: are you using sudo? it probably can't start up because it can't create it's pid file (/var/run/nodepool dir, iirc)19:53
mrhillsmanusing root19:54
pabelangerI haven't been around much today (Turkey day here in Canada) but will be only for upcoming zuul meeting today19:59
mrhillsmani'll ignore it for now, things appear to be working, zuul and nodepool are talking to each other and local openstack environment, just need a job beyond noop to work and will flush out the minor issues later20:14
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Add upload-npm role  https://review.openstack.org/51068620:42
*** hashar has quit IRC20:43
jeblairmordred, pabelanger: in order to use the git hard-timeout function, we need pabelanger's fix here: https://github.com/gitpython-developers/GitPython/commit/1dbfd290609fe43ca7d94e06cea0d6033334383820:45
jeblairthat's merged to the gitpython master branch20:45
jeblairunfortunately, the master branch still has this bug: https://github.com/gitpython-developers/GitPython/issues/60520:46
jeblairi proposed the very basic workaround that Byron suggested here: https://github.com/gitpython-developers/GitPython/pull/68620:47
jeblairit's not good, but it's something20:47
jeblairmordred: in doing that, i noticed that it looks like you may have done some work in that area in https://github.com/emonty/GitPython/commit/0533dd6e1dddfaa05549a4c16303ea4b4540c030  but maybe never submitted pull requests for it?20:48
jeblairat any rate, i think for the moment we will have to try to run from my fork, until we get a release with both my and pabelanger's fix20:49
pabelangerYa, I hope to look at my open PR tomorrow for GitPython. Didn't get much time over the weekend for it20:50
jeblairpabelanger: they merged your encoding fix to master20:50
pabelangerokay cool, I had another for as_process change20:51
*** jkilpatr has quit IRC20:56
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout  https://review.openstack.org/50951721:29
jeblairmordred, pabelanger: ^21:30
*** jkilpatr has joined #zuul21:33
*** Shrews has quit IRC21:36
*** Shrews has joined #zuul21:39
jeblairit's zuul meeting time in #openstack-meeting-alt22:00
Shrewsfungi: fyi, when you get time: https://review.openstack.org/51059422:19
fungithanks Shrews!22:22
pabelangerI'll be dropping offline again for the next 2 hours, but will review issues later this evening22:57
jeblairplease see the meeting minutes for important information about the openstack-infra roll-forward: http://eavesdrop.openstack.org/meetings/zuul/2017/zuul.2017-10-09-22.03.html22:59
mrhillsmanso i want to move to a production deployment, i know i need a target openstack cloud, have one23:06
mrhillsmanhow many dedicated servers to i need?23:06
jeblairmrhillsman: depends largely on the number of simultaneous jobs you'll run23:10
mrhillsmangot it, we will not rise to as many as infra runs right now of course23:11
jeblairmrhillsman: (which of course is a function of the size of your openstack cloud, and how busy the repos you're testing are)23:11
mrhillsmanbut maybe like 30-50 starting off23:11
mrhillsman30-50 VMs23:11
mrhillsmanneed to prove it in a live environment but want to get the ball rolling23:12
mrhillsmanonly thing left in this PoC is to see an actual job run (other than noop :) )23:12
jeblairmrhillsman: you *can* do everything on one host.  for that size deployment, i'd probably use 3 -- one host as a nodepool image builder+lanucher (it will probably want more disk than anything else).  one zuul scheduler on maybe an 8g vm.  and one zuul executor, also probably around 8g.23:13
mrhillsmanok cool, thanks23:14
jeblairmrhillsman: we've found our 8g executors can handle, we think, maybe around 100 simultaneous jobs.  so those scale linearly like that.23:14
jeblairmrhillsman: as you start to get larger, dedicated mergers can be useful to to handle some of the git load.  1 merger for every 2 executors might be a good ratio.  those can be tiny too -- like 2g or even 1g vm.23:15
mrhillsmangot it23:16
jeblairmrhillsman: i'd probably separate the nodepool builder onto its own host if your nodepool cloud quota is > 250 nodes.  and add another nodepool launcher after 500 nodes.23:16
mrhillsmannoted23:17
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout  https://review.openstack.org/50951723:22
mrhillsmanin terms of getting the web status page working, do i need to proxy through apache/nginx?23:25
mrhillsmansimply running zuul-web and pointing at the port does not work unless i have it configured improperly23:25
mrhillsmanif it should work out the box i'll figure it out, just don't want to go on a rabbit chase23:28
jeblairmrhillsman: no, the status page is not straightforward yet.  there's pending work to integrate it into zuul-web.  but for the moment, there's manual installation steps and proxying may be required.23:44
mrhillsmanok cool, i'll work to figure it out, ty sir23:45
jeblairmrhillsman: i think it's all in the etc/ directory of the zuul repo23:50
mrhillsmanah ok, was looking in the wrong dir, thx for that23:50
jeblairyeah, it's.... not like the rest :)23:52
mrhillsman;)23:53
jeblairfungi, mordred, pabelanger: even with my fix to gitpython, simple unit tests still take 2x as long.  i think there's more at play.  i think i might be inclined to make a new local fork of gitpython that has 2.1.1 with pabelanger's fix, and run with that until we figure out a long-term solution.  but i expect that to take more than just a couple of days.23:55
jeblairall told, i don't think that needs to change anything about our plan -- we still run with a local fork for a bit.  i think it just changes expectations around what step 2 is.23:55
fungiokay23:57
fungii'm fine with that, even if it's not quite as nice as we'd hoped23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!