Wednesday, 2017-10-04

openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Speed configuration building  https://review.openstack.org/50930900:32
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Speed configuration building  https://review.openstack.org/50930901:33
*** leifmadsen has quit IRC03:34
*** leifmadsen has joined #zuul03:35
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Use new infra pipelines  https://review.openstack.org/50922404:10
*** mnaser has quit IRC04:15
*** tristanC has quit IRC04:16
*** mnaser has joined #zuul04:23
openstackgerritMerged openstack-infra/zuul-jobs master: Set default on fetch-tox-output to venv  https://review.openstack.org/50917704:38
openstackgerritMerged openstack-infra/zuul-jobs master: Add TODO note about reworking fetch-tox-output  https://review.openstack.org/50923704:38
openstackgerritMerged openstack-infra/zuul master: Use new infra pipelines  https://review.openstack.org/50922304:43
*** tristanC has joined #zuul04:46
*** _ari_ has quit IRC05:54
*** weshay has quit IRC05:55
*** pabelanger has quit IRC05:56
*** weshay has joined #zuul06:20
*** _ari_ has joined #zuul06:20
*** pabelanger has joined #zuul06:21
*** hashar has joined #zuul07:08
*** adrianc has joined #zuul07:17
*** isaacb has joined #zuul07:57
*** adrianc has quit IRC08:08
*** AJaeger has quit IRC08:24
*** isaacb has quit IRC08:32
*** electrofelix has joined #zuul08:57
openstackgerritFabien Boucher proposed openstack-infra/zuul-jobs master: Set zuul_log_path for a periodic job  https://review.openstack.org/50938409:17
*** hashar is now known as hasharAway09:39
*** ricky_ has quit IRC10:26
*** jkilpatr has quit IRC10:35
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Add launcher ID to log messages  https://review.openstack.org/50940611:03
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Add launcher ID to log messages  https://review.openstack.org/50940611:06
*** jkilpatr has joined #zuul11:10
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Add launcher ID to log messages  https://review.openstack.org/50940611:23
Shrewsif we can get 509406 ^^^ in, it would help greatly with future debugging of nodepool11:30
*** dkranz has joined #zuul12:36
mordredShrews: done12:42
dmsimardShrews, mordred: that kind of reminds me, is there a zuul variable in which the executor running the job is available ?12:42
dmsimardin v2, zuul tells us which zuul-launcher is driving the job, right ?12:43
dmsimardis that information relevant ?12:43
pabelangerdmsimard: yes, zuul.executor.hostname12:44
dmsimardpabelanger, mordred, Shrews: ^ how would you feel about adding that information in https://github.com/openstack-infra/zuul-jobs/blob/master/roles/emit-job-header/tasks/main.yaml ?12:46
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Add launcher ID to log messages  https://review.openstack.org/50940612:46
mordreddmsimard: fine by me12:55
mordredpabelanger, clarkb: ok - this tox-linters / infra-docs tox_envlist thing is WEIRD12:55
mordredif you look at http://logs.openstack.org/21/509221/2/infra-check/tox-linters/4b0f6a5/zuul-info/inventory.yaml ...12:56
mordredyou can see that zuul is passing     tox_envlist: infra-docs12:56
mordred    tox_extra_args: -vv python setup.py build_sphinx12:56
mordredbut those variables are not set anywhere in the inheritance chain for the tox-linters job12:57
mordredjeblair: ^^ when you get up, I think there may be a bug (I'm going to keep looking myself, but it seems like one that might wind up needing your eyeballs)12:57
dmsimardmordred: jeblair fixed it yesterday afaik12:58
dmsimardit was due to config rework12:58
mordredoh! ok cool12:58
dmsimardwe noticed it in https://review.openstack.org/#/c/509254/ first and then he submitted PS2 in https://review.openstack.org/#/c/509309/12:59
dmsimardand it passed after a recheck12:59
dmsimardsince that was last night, I'd suspect it occured before PS2 was reloaded13:00
mordrednod13:00
mordredso - any known reason why infra-check isn't showing up on http://zuulv3.openstack.org/ ?13:01
dmsimardnot sure, I've noticed that 509254 didn't merge despite passing check last night, maybe related but I was asleep back then13:02
mordredyah13:02
pabelangermordred: I thought I seen infra-check / infra-gate this morning13:10
pabelangerbut I don't see them now13:10
mordredpabelanger: I looked at debug log and it seems to think infracheck is a thing13:12
pabelangerYa, I could see the pipeline in debug.log yesterday too13:13
pabelangerhmm13:14
pabelanger| 0000124504 | rax-ord                | None     | ubuntu-xenial    | 5aefbb28-6219-4347-8e17-d53e71225c28 | in-use   | 00:08:11:55 | locked   |13:14
pabelangerthat doesn't look right13:14
pabelanger8+ hours, in-use13:14
pabelangergoing to see why13:14
openstackgerritJens Harbott (frickler) proposed openstack-infra/zuul feature/zuulv3: Avoid JS error when a change has no id  https://review.openstack.org/50943513:14
*** hasharAway has quit IRC13:16
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic stage-artifacts role  https://review.openstack.org/50923313:17
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add compress capabilities to stage artifacts  https://review.openstack.org/50923413:17
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul-jobs master: Add zuul.pipeline and zuul.executor.hostname to job header  https://review.openstack.org/50943613:20
pabelangerdmsimard: see: https://review.openstack.org/483022/ for history. It should be in validate-host, or was13:24
dmsimardpabelanger: hm, I figured that should be in the job header role, I'll look. brb.13:26
pabelangerHmm, ze07 doesn't look happy13:28
pabelangertrying to see why13:28
pabelangerI am seeing a bunch of leaked ssh-agent processes13:28
pabelangerwonder if we should put a timeout on them13:28
mordredpabelanger: we should put that on the todo-list ...13:28
mordredpabelanger: figuring out why they're leaking ... and then having them not leak :)13:29
pabelangerokay13:29
pabelangerI fixed ze07 I think13:29
pabelangerit is because there was a git clone process that was hung13:30
pabelangerand seemed to block everything on the executor13:30
pabelangeronce I killed it, it started processing again13:30
pabelangermordred: okay, updated etherpad with ^13:32
pabelangerdmsimard: ya, we already have the info in inventory files, should be needed to duplicate it13:33
mordredpabelanger: do we have an entry already about that git hung process thing too?13:34
pabelangermordred: looking13:34
mordreddmsimard, pabelanger: should we maybe consider including inventory files in logstash uploads? would it help e-r queires?13:34
pabelangermordred: ++13:35
pabelangermordred: I don't see any other entry about hung processes13:35
pabelangermordred: actually, line 106. It could be related13:35
dmsimardmordred: yeah, we could add it with proper indexing per field13:35
mordredpabelanger: k. we should add that to our todo list too - it's happened a few times13:35
mordredpabelanger: out of curiosity - was the repo it was hung on glance-specs13:36
mordreddmsimard: ++13:36
pabelangermordred: project-config13:36
mordredpabelanger: ok. cool13:36
dmsimardmordred: I can probably send a patch for that13:36
pabelangerguessing networking13:36
mordredpabelanger: all the previous times I'd seen it it was glance-specs13:36
pabelangerya13:36
pabelangersame13:36
mordredso I was starting to worry abot that repo :)13:36
pabelangerchecking other executors now13:36
pabelangerze06.o.o stuck, fixed13:38
pabelangeropenstack/openstack repo13:38
pabelangerchecking zuul-mergers too13:40
dmsimardmordred: fyi speaking of e-r, zuulv3 support is here https://review.openstack.org/#/c/509313/ (builds on top of the zuul_stream new emit playbook header)13:41
pabelangerzm05 hung :(13:41
pabelangercleaning it up13:41
pabelangerk, just zm05.o.o. So ya, we should see about wrapping out git clone with timeouts13:42
dmsimardmordred, pabelanger: could you point me in the right direction where zuul would be sending metrics to graphite ? grepfu is failing me13:43
dmsimardmordred, pabelanger: nevermind, through statsd13:44
pabelangermordred: okay, once zuul-executors started running again, the jobs in check pipeline picked up where things left off. So, zuul was able to recover properly13:46
pabelangerwhich is nice :)13:46
mordreddmsimard: looks good! it does make me wonder ...13:50
mordreddmsimard, pabelanger, tobiash, jeblair, jlk, SpamapS, clarkb, fungi: should we consider before re-rollout renaming job-output.txt to console.html? or maybe just call the html wrapper thing that consumes job-output.json whenever we write it console.html instead (I know clarkb wants us to leave the pure-text job-output.txt for ease of downloading/grepping)13:52
mordreddmsimard: also - I noticed a http://logs.openstack.org/21/509221/2/infra-check/build-openstack-sphinx-docs/fa00cdc/job-output.txt.gz#_2017-10-03_23_37_59_48434813:53
pabelangermordred: I too prefer job-output.txt to console.html13:53
pabelangerhowever the HTML wrapper seems like a good thing for json file13:53
dmsimardI don't have a strong opinion13:53
mordredpabelanger: yah - the intent is to write an html/javascript that takes job-output.json and makes it look like what job-output.txt looks lke now - except with expandable/contractable sections and details13:54
dmsimardmordred: really ?13:54
SpamapSmordred: +1 to html and plain being available.13:54
mordredso that we'll wind up with 3 main presentations of that data ...13:54
dmsimardmordred: plaintext and ara aren't enough ? :p13:54
mordred1) the text file - which is what is streamed to the websocket streaming13:54
dmsimardoh, the full "unadultered" ansible stdout13:55
mordred2) the new console.html which will look like the text file but will have sections - so for folks comfortable with that view but who want a little more ability to poke13:55
pabelangerspeaking of ARA, I did notice it using a bit of CPU for generating reports.  I didn't profile it, but something we should look at on the executors.  However, might not be much of an issue now that SpamapS patch is running13:55
mordred3) ara - for exploring the build froman ansible POV13:56
mordredhopefully between the three of those we should cover folks depending on how they like to think about things13:56
dmsimardpabelanger: I'd expect that it would use cpu capacity when generating reports, yes, but hopefully for a very short burst/duration13:57
mordredI imagine some folks are going to want to ignore the ansible-ness of things and just have ansible run shell scripts - which should be fine13:57
dmsimardfairly high-volume playbooks (such as OSA with >20 hosts >2000 tasks) take no more than a few seconds to generate13:57
mordredthen - once we get tristanC's dashboard in - I would also imagine a way for someone to set a personal prefernce "show me ara by default" or "show me console.html by default" - and a dashboard job result link could show you the thing that makes sense13:58
mordreddmsimard, pabelanger: now that we have SpamapS patch in - that one seems like just a thing to deal with via capacity planning and executor scale out (also, thanks for that patch SpamapS !!!)13:59
ShrewsI'd like to restart nl01 and nl02 to get the new logging enhancement. Anyone object?14:00
pabelangermordred: Ya, so far executors are working great (minus git clone issue). I'd like to play with our threshold too, right now we are at 20.00 load cap. We likely can bump up a little more based on zuul-launcher history14:01
pabelangerShrews: wfm14:01
dmsimardIt's also worth mentioning that generating static reports is very convenient but really not efficient (in terms of storage), eventually we can consider other ways of providing the ara reports like what pabelanger and I have been discussing14:01
Shrewsholy smokes. >5300 requests in the node request pool14:02
Shrewswth is happening there?14:02
pabelangeryah, I really like the ability how stackviz works. If we could some how capture the data in ansible, then at later time have ARA generate it, that would be awesome14:02
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic process-test-results role  https://review.openstack.org/50945914:03
pabelangerShrews: that is because 2 zuul-executors were stuck for about 8 hours14:03
pabelangerso we didn't process any jobs14:03
pabelangerit should be draining now14:03
dmsimardpabelanger: stackviz generates statically in-job, no ?14:04
dmsimardpabelanger: or you mean openstack-health ?14:04
pabelangerdmsimard: stackviz does today, but nothing stopping us from doing that at a later point in time14:04
dmsimardright14:05
rcarrillocruzheya14:06
rcarrillocruzShrews: around?14:06
rcarrillocruzi'm getting "AttributeError: 'OpenStackNodeRequestHandler' object has no attribute 'launcher_id' " on nodepool14:07
rcarrillocruzsaw a recent change you pushed, like a few hours?14:07
rcarrillocruzis that supposed to be fixed by now14:07
Shrewsrcarrillocruz: yup. good grief you are running on the edge14:07
rcarrillocruzheh14:07
Shrewsrcarrillocruz: just saw that myself. working on a fix14:07
rcarrillocruzthx sir14:07
dmsimardedgy14:08
Shrewsgah. that attribute isn't set immediately in the parent14:09
Shrewsthat stinks14:09
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Log exceptions from MODULE FAILURE more consistently  https://review.openstack.org/50948414:14
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Set log after we have launcher_id  https://review.openstack.org/50948514:14
mordreddmsimard: ^^ that should fix the issue seen here: http://logs.openstack.org/21/509221/2/infra-check/build-openstack-sphinx-docs/fa00cdc/job-output.txt.gz#_2017-10-03_23_37_59_48434814:14
Shrewsmordred: pabelanger: ^^^ can we push this through? nl02 is down until we get that merged14:14
Shrewsthx14:15
Shrewsrcarrillocruz: 509485 is the fix. also, standard warning about running latest code, blah blah blah, bugs, blah blah blah14:16
rcarrillocruzlulz14:16
rcarrillocruzthx14:16
rcarrillocruz+214:16
Shrewsbut it's kinda cool someone is running our code before WE are  :)14:18
rcarrillocruzheh14:18
jeblairpabelanger: what repo was hung with the git clone?14:18
jeblairnm, i see it in scrollback now14:19
rcarrillocruzi'm continously deploying nodepool, as part of a ci installer for ansible networking14:19
Shrewsrcarrillocruz: neat! risky, but neat.14:20
rcarrillocruzansible-role-nodepool doesn't install nodepool with latest, but as i torn down the environment to bring it up from scratch today i got the issue14:22
pabelanger:)14:23
rcarrillocruzhmm14:25
rcarrillocruzo-k14:25
rcarrillocruzso as zuul v3 has 20 capacity, i better stop looking at zuul dashboard14:26
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Avoid JS error when a change has no id  https://review.openstack.org/50943514:27
fungircarrillocruz: whatcha mean by "20 capacity?"14:33
fungiits capacity is 20% of our aggregate quota14:33
fungiso maybe that?14:34
rcarrillocruzyeah, that's what i meant14:34
fungithat's still something like 200 nodes14:34
pabelangerya, it is a little backed up because of the blockage this morning14:34
pabelangerbut pipeline is moving14:34
rcarrillocruzyeah, but there seems to be a long queue ahead of the change14:35
Shrewsnl02 being down might back it back up again, but hope to have it up soon14:35
pabelangerrcarrillocruz: which change?14:36
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Move test_model.test_job_inheritance_configloader  https://review.openstack.org/50949514:36
rcarrillocruzhttps://review.openstack.org/50948514:36
rcarrillocruzthe thing i was chatting with Shrews about earlier14:37
jeblairthat is not especially urgent, but if i do any more config rework, i'm going to build on that.  because i spent as much time changing that test for the actual work i did earlier as i just now spent moving it so i won't have to do that again.14:37
pabelangerrcarrillocruz: I don't think nodepool is gating by zuulv314:37
jeblairwe should probably move it to the infra queues14:38
Shrewsno, that's waiting on v214:38
rcarrillocruzbut check is on zuulv3 dashboard at least14:38
rcarrillocruzoh wait14:38
rcarrillocruzso14:38
pabelangerYa, we run all checks for zuulv3 but zuulv2.5 is the gate14:38
rcarrillocruzevents14:38
rcarrillocruzare being processed by both zuuls14:38
rcarrillocruzgotcha14:38
pabelangerrcarrillocruz: we did create infra-check and infra-gate for zuulv3 on projects that are gatting. eg: zuul, zuul-jobs, openstack-zuul-jobs, which will be high priority14:39
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Speed configuration building  https://review.openstack.org/50930914:41
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Move test_model.test_job_inheritance_configloader  https://review.openstack.org/50949514:41
*** leifmadsen has quit IRC15:01
*** leifmadsen has joined #zuul15:06
pabelangerShrews: I'm seeing about 44 used nodes that are unlocked right now, but nodepool-launcher doesn't appear to be cleaning them up. Based on the discusion the other day, I thought you mentioned nodepool should delete those15:11
Shrewspabelanger: are they owned by nl02?15:12
pabelangerShrews: checking15:12
pabelangerShrews: yah, they appear to be15:12
Shrewsk. nl02 is not running15:13
pabelangerokay, cool! That explains it15:13
Shrewsand the fix doesn't appear to be moving :(15:13
clarkbmordred: my only concern with adding more and more representations of the data like that is log storage and retention15:14
clarkbconsole logs tend to be large ish (though not our largest source of disk consumption)15:14
pabelangerShrews: we could enqueue into gate if needed, or move nodepool to zuulv3 infra-check / infra-gate?15:16
Shrewspabelanger: 1st thing sounds good15:17
Shrewsthen maybe the 2nd thing after for future changes?15:17
pabelangerShrews: 509485 right?15:18
Shrewspabelanger: yes15:19
pabelangerokay, enqueued15:20
Shrewsjeblair: for this double locking thing, acceptNodes() already checks if the request it was given is canceled. Does it need to see if request.uid still exists in self.requests (assuming it was canceled and deleted before it pulled the event off the queue)?15:22
Shrews(just trying to make sure i have the sequence of events right in my head)15:23
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Set log after we have launcher_id  https://review.openstack.org/50948515:23
pabelangerShrews: ^15:24
Shrewspabelanger: cool. weird that it still shows on status.o.o15:24
pabelangerShrews: ya, it will still run in the check pipeline then report results15:25
Shrewspabelanger: will wait for puppet to update it then restart15:25
jeblairShrews: i think the cancel we're talking about here is zuul itself canceling it.  but in the thing we saw, we lost the zk connection, so it disappeared.15:25
pabelangerokay, relocating to library. will return shortly15:25
Shrewsjeblair: oh, so _updateNodeRequest should be the thing that updates the copy of the request with "this disappeared" so acceptNodes can check that and act accordingly?15:28
Shrewsgah. i need to think through this some more so i can see how the various pieces interact15:30
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove test_job_inheritance  https://review.openstack.org/50950915:30
jeblairShrews: likely so -- i imagine that's getting called but not doing anything useful for the 'znode disappeared' case right now15:31
Shrewsah ha, that's a watcher callback. that's one piece in place15:34
*** yolanda has left #zuul15:34
Shrewsi may not be quick, but i sure am slow15:34
mordredclarkb: yes - it should be no additional representations than we have now15:36
dmsimardmordred: That whole module failure handling, not just your patch but what exists elsewhere for handling module failures seems needlessly complicated to me ? CallbackModule ships methods to handle warnings and exceptions, we could use them ? Or improve them if we feel they're not good enough ?15:36
mordredclarkb: console.html will be an in-broswer javascript rendering of log-output.json15:36
dmsimardmordred: https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/callback/default.py#L52-L53 && https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/callback/__init__.py#L111-L13515:36
mordreddmsimard: yah - the problem is that those use self._display which is not useful to us15:37
dmsimardCallbackModule.display = self.log ?15:37
dmsimardor something15:37
dmsimardI dunno15:37
mordreddmsimard: now - maybe what we should do is define something that behaves like self._display that we can use to override it and re-use base methods15:38
mordreddmsimard: I agree in general - zuul_stream needs a refactor which is coming up very soon on my todo list15:38
Shrews| None       | None                   | None     | None             | None                                 | used  | 00:00:00:00 | unlocked |15:38
Shrewsthat's one of our nodes15:38
dmsimardmordred: anyway, I'm not -1 on your patch if it fixes the issue -- I'm just saying we need to keep things as simple as possible :)15:39
Shrewsif someone want to try to track down why it's all empty, that would be great15:39
dmsimardShrews: ah interesting, that'd explain the None I've been seeing on http://grafana.openstack.org/dashboard/db/nodepool15:40
mordreddmsimard: totally agree15:40
mordredShrews: I managed to get myself nerdsniped from your logging patch to nodepool earlier ...15:40
Shrewsmordred: sorry?15:41
mordredShrews: (the [%s] on the end of the logger name bothered me, because python logging otherwise has facilities for attaching arbitrary data and being structured for flexible consume-side logging config ...15:41
mordredBUT - I have learned a new thing and have a fun patch coming that I think should be nice in a few different places15:42
Shrewsk. at any rate, nl02 is now restarted and doing things15:43
Shrewspabelanger: ^^^15:43
Shrewsand the request list is empty. going to restart nl01 now15:44
dmsimardShrews: grafana shows that "None" node as far back as 6 months ago /me shrug15:44
Shrewsnl01 restarted15:46
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout  https://review.openstack.org/50951715:48
* SpamapS has coffee now15:51
*** hashar has joined #zuul16:00
pabelangerShrews: great, thanks16:03
jeblairShrews: 2017-10-03 18:39:48,998 INFO zuul.nodepool: Returning nodeset <NodeSet OrderedDict([('ubuntu-xenial', <Node None ubuntu-xenial:ubuntu-xenial>)])OrderedDict()>16:06
pabelangerjust seen this too16:06
pabelanger2017-10-04 16:04:47,782 DEBUG zuul.zk.ZooKeeper: ZooKeeper connection: LOST16:06
pabelangerthere we go, it is back16:06
pabelangererr, tying to connect16:06
jeblairpabelanger: let's not debug zk connection issues while i'm performing memory debugging16:06
pabelangerjeblair: okay!16:07
jeblairShrews: so, somehow, zuul decided to return a nodeset before it was allocated16:08
Shrewsjeblair: i'm beginning to get concerned that you are resubmitting node requests with the same UID. b/c nodepool could update the request out from under zuul if you have resubmitted it.16:10
Shrewsand also, it makes it difficult to match up the node acceptance with the proper request iteration16:11
jeblairShrews: does nodepool use the request.uid?16:12
Shrewsjeblair: only for updating the request. otherwise, no16:12
jeblairShrews: why?  that's internal to zuul...16:13
jeblairShrews: i don't see uid in nodepool's NodeRequest model16:13
Shrewsnot talking about nodepool modifying the uid. but the info for the request (the set of nodes, in particular)16:14
jeblairShrews: okay, you may need to explain the issue to me with more words, sorry.16:14
Shrewsjeblair: i'm not sure that there is an issue yet. just beginning to formulate this in my head16:15
Shrewsjeblair: but in the current thing i'm debugging (the double node locking thing), i'm concerned that the request will enter the queue with one set of data, but could possibly be modified with a different set of data before it can be processed out of the queue16:18
jeblairShrews: who's doing the modifying?16:18
Shrewsjeblair: nodepool (b/c it's processing the request twice). the request watcher you use in zuul would update the request if nodepool changes it.16:19
jeblairShrews: why would nodepool process the request twice?16:20
Shrewsjeblair: because you resubmit it if zuul notices it disappears16:20
openstackgerritMonty Taylor proposed openstack-infra/nodepool feature/zuulv3: Provide resource_name in logging as structured data  https://review.openstack.org/50953116:20
Shrewsjeblair: is this scenario possible: 1) zuul submits NR  2) np assigns nodes 3) zuul enqueues the NR to be handled (but processing of the queue is slow) 4) zk connection is lost, zuul resubmits same NR  5) nodepool assigns new set of nodes to same request (which updates the request in zuul's queue)16:23
Shrewsmy concern is that if 5 happens before the request is processed out of the queue, that we lose the original request data16:24
Shrewscould totally not be a problem, but i'm not yet familiar enough with this code to know if it's a valid concern16:28
jeblairShrews: i see what you're getting at; since we map the znode request id to a nodepool request object, while we know from zk that it's 2 different requests that have been updated, by the time it gets to the zuul event queue, it just appears as the same request twice.  i don't think there's any data corruption, except for the fact that we could end up accepting the same request twice.16:34
jeblairShrews: however, *a lot* of things internally are based on that request object being persistant, so it may be better to just find a way to invalidate queued events if they end up becoming invalid (perhaps by checking that the underlying nodepool request id for the request matches what it was when the event was enqueued)16:35
Shrewsjeblair: yeah, that's what i was (poorly) attempting to get at with the re-used UID16:36
Shrewsjeblair: also, i'm not sure if it's the *same* request object in the queue twice (thus both would get changed), or if they're copies16:38
* Shrews mumbles about python underneath his breath16:39
clarkbid() will tell you16:40
clarkbmight be something to log if it is a concern16:41
*** hashar is now known as hasharAway16:46
*** AJaeger has joined #zuul17:00
jeblairShrews: there is a single zuul NodeRequest object, so it would be added to the queue twice.  but if the connection has been lost the underlying znode object has changed.  so adding the znode id to the event along with the request would let us verify that the event is still valid.17:06
jeblairShrews: the "None" node was from request 100-000011840317:11
jeblairhrm, i'm not sure about that actually17:17
dmsimardLooks like zuulv3.o.o has gone lights out again :(17:40
jeblairShrews: it was request 200-000011815817:40
jeblairShrews: and i think i see the error17:40
jeblairShrews: i think the connection was in the process of going offline when it was updating the request for the fulfilled event.  so it managed to set request.state=fulfilled, but did not manage to update the nodes.17:45
pabelangerjeblair: +2 on 509517 but comment was left. noticed something in API docs for timeout that might affect us17:45
jeblairShrews: the joys of a non-transactional system17:46
pabelangerdmsimard: running slowly, memory debugging17:46
Shrewsjeblair: transactions are supported17:46
Shrewsjeblair: http://kazoo.readthedocs.io/en/latest/api/client.html#kazoo.client.KazooClient.transaction17:47
Shrewsi suspect we should really be taking more advantage of that17:47
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Update node requests after nodes  https://review.openstack.org/50957117:47
jeblairShrews: i mean locally.  from what i can tell, we aborted halfway through updating our internal data structure.17:48
jeblairShrews: we'd need an internal transaction mechanism to deal with that.17:49
jeblairShrews: anyway ^ that's one possible quick fix; feel free to incorporate it or discard it depending on how it relates to your current work.17:49
jeblairpabelanger: yes, i agree it could be a problem.  if we figure out how to detect that has happened, we should probably delete the repo.  i don't know how to do that at the moment, and this should at least avoid stopping the world.17:53
Shrewsjeblair: where did you see that a znode has an id? I do not see that here: https://kazoo.readthedocs.io/en/latest/api/protocol/states.html#kazoo.protocol.states.ZnodeStat18:54
Shrewssession id might be useful. could compare it to the current connection's session id18:55
Shrewsneed to run a test to see how that works....18:56
jeblairShrews: the request has an id (the znode name); in zuul we store it as NodeRequest.id19:05
jeblairShrews: if zuul resubmits the request, that will be updated with the new znode19:06
Shrewsjeblair: oh! duh, it's a sequence znode.19:09
jeblairyeah, so we always know it's different if it was resubmitted19:12
dmsimardexecutors aren't automatically updated on each zuul merge, right ?19:24
clarkbdmsimard: not the python code they are running no19:26
clarkbdmsimard: anything in the ansible bits should be as that is forked on demand19:26
dmsimardclarkb: what decides when the python code is reloaded ? it's on demand ?19:27
clarkbya I think we do it manually when needed19:27
clarkbwell puppet updates the install, then restarts to pick it up are on demand19:27
fungifor zuul source code, in our environment puppet deploys it periodically from branch tip and then we run the new version whenever the daemon is manually restarted19:27
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Handle double node locking snafu  https://review.openstack.org/50960319:27
Shrewsjeblair: ^^^ untested, but maybe that's enough?19:28
dmsimardokay, I was wondering why https://review.openstack.org/#/c/509254/ did not seem to be effective yet but that explains it19:28
Shrewsjeblair: not sure how to write a test for that19:28
Shrewsugh, i don't think that will work if the request.id value can change from underneath us at that point (due to re-use of the request object)19:33
pabelangerjeblair: Ya, deleting might work. No issue waiting until it actually happens then dealing with it then19:35
jeblairShrews: ya, i'm starting to like the idea that we embrace that and use it.  so set request.id back to None in the callback when the node is deleted from under us.  then include the request id in any events we enqueue.  don't respond to an event if the request id doesn't match.19:35
jeblairpabelanger: if we can figure out how to force a timeout (maybe i could clone the kernel over dsl with a 1s timeout) we can probably prepare for it19:35
pabelangerFor sure19:36
jeblairShrews: then we invalidate events both in the case that that we process it before and after resubmission19:36
jeblairpabelanger: if you want to run with that idea (not sure how fast your new connection is, but if you can clone the kernel in 1s i will be really impressed) feel free, i'm still swamped debugging memory19:37
Shrewsjeblair: what good does setting the request.id to None do if we just resubmit it and it gets a new id?19:37
pabelangerjeblair: sure, i can take a go at it19:37
Shrewstotally get the "include req id" in the queue part19:38
jeblairShrews: oh, hrm, we resubmit it in the callback.  it would only be None for very short while.  still, it narrows a race, and if resubmitting fails, we'd still be in a better state.19:40
Shrewsk19:40
jeblairShrews: also, i think maybe the delete handling in _updateNodeRequest should be first, rather than last?  :)19:41
jeblairi think that erroneously assumes that we can still do something with a fulfilled node request that was lost.  but we can't.19:42
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Handle double node locking snafu  https://review.openstack.org/50960319:47
Shrewsjeblair: something along the lines of that ^^^ ?19:47
Shrews(that may possibly break all sorts of tests, but covers all the things, i think)19:49
jeblairShrews: ya, left some further thoughts inline19:57
Shrewsoh right, forgot the id comparison19:58
Shrewsi'll wait for the tests to finish before i add that20:00
Shrewsand the other suggestion20:01
jeblairi think the answer to the memory problem may be in a 616 node object graph.  does anyone have a large format printer?20:03
*** ianw is now known as ianw|pto20:04
fungirs232, already loaded up with a fanfold stack of tractor-fed greenbar!20:04
mnaserjeblair: i have access to a plotter though it'll be a little while before the print can get to you20:05
fungijeblair: just |lp20:05
mnaseron second thought, jeblair, time to buy the biggest highest res tv on the market and expense that because large object graphs20:06
mnaser:P20:06
openstackgerritRicardo Carrillo Cruz proposed openstack-infra/nodepool feature/zuulv3: Bring back per label groups in Openstack  https://review.openstack.org/50962020:08
rcarrillocruzhey folks20:08
rcarrillocruzwe lost per label groups in nodepool20:08
clarkbI have  roll of paper drop cloth and crayons20:08
rcarrillocruzi believe on the nodepool openstack driver split20:09
rcarrillocruz^20:09
SpamapSper label?20:09
rcarrillocruzcan I get eyes on it? without that feature, i have my workflow broken20:09
rcarrillocruzyeah20:09
SpamapSrcarrillocruz: note that it disappeared because it's not tested. :)20:10
rcarrillocruzhttps://git.openstack.org/cgit/openstack-infra/nodepool/commit/?h=feature/zuulv3&id=7c3263c7df08bf824a1a8a87279d4e8ca547fd6320:10
SpamapSYou may want to consider adding a test to make sure that doesn't happen in the next refactor :)20:11
rcarrillocruzwhere are the driver tests20:11
rcarrillocruztristanC: around?20:12
rcarrillocruzi guess testnodelaunchmanager20:12
pabelangerjeblair: first issue with kill_after_timeout: https://github.com/gitpython-developers/GitPython/blob/master/git/repo/base.py#L920 hardcodes as_process=True, which doesn't work with kill_after_timeout. Looking to see why that would be20:13
Shrewsrcarrillocruz: test_launcher.py is the place20:17
rcarrillocruzack, let me see20:18
jeblairpabelanger: ok.  let me know what you find.  maybe we need to make our own version of that method without the progress handling stuff (we don't use it anyway).20:19
pabelangerjeblair: I'm currently patching GitPython to see if we could make it work20:21
clarkbpabelanger: jeblair that may be something we can configure in a gitconfig too?20:27
dmsimardrcarrillocruz: tristanC is on PTO for the next few weeks20:28
rcarrillocruzk thx, nm i was pointed to the place for writing a test20:29
*** jkilpatr has quit IRC20:44
mordredrcarrillocruz: patch looks sane to me - +2 - but obvs would prefer with test :)20:51
rcarrillocruzyeah, looking at it20:51
pabelangerclarkb: oh, maybe20:51
rcarrillocruzjust spinned a xenial vm20:51
pabelangerI'll have to read up on it20:51
pabelangerclarkb: TIL: https://stackoverflow.com/questions/6458790/is-there-a-way-to-make-git-over-http-timeout20:52
mordredpabelanger: ooh - that looks promising20:53
pabelangerI think I got gitpython working (found a bug) but need to better understand the code changes20:53
pabelangerbut, have to run now for family diner20:53
pabelangerdinner*20:54
clarkbpabelanger: cool, I figured there would be something like that in the large list of git configoptions20:54
pabelangerya20:54
pabelangerI'll see about testing / reading up about it20:54
*** hasharAway has quit IRC20:58
mrhillsmanworking on deploying zuul, got it integrated with a github project, do i need nodepool?21:14
jlkwhat do you want to do with it?21:16
jlkIf you want to be able to execute things inside ephemeral environments, then yes21:16
jlkotherwise there are very limited things you could do directly on zuul executors (basically poke remote URLs)21:16
mrhillsmanok cool21:16
mrhillsmanworking on creating a ci/cd21:17
mrhillsmanenvironment21:17
mrhillsmanreading the docs it is a bit difficult to figure things out21:17
jlkgotcha.21:17
jlkThe typical scenario is that you have a nodepool hooked up to one or more OpenStack providers. Nodepool will generate images of your choice, and satisfy node needs from zuul jobs (each job uses a node)21:18
mrhillsmancorrect me if i am wrong but i need to have like a project-config (configuration project), zookeeper, gearman, nodepool, zuul jobs project21:18
jlkit can make those nodes hot-ready, or provision them on-demand.21:18
clarkbjlk: mrhillsman also keep in mind you need to be very specific about what versions of nodepool/zuul you are deploying21:18
mrhillsmanv321:18
jlkmrhillsman: you do not need to have a specific zuul jobs repository. You could place the jobs in the project-config repo if you wished.21:19
mrhillsmani pulled the feature/zuulv3 branch21:19
mrhillsmanah ok21:19
jlkalthough that does prevent some things from happening21:19
jlkproject-config needs to be a 'trusted' repository, like where secrets are defined21:19
clarkbyou also don't need to run a gearman, zuul-scheduler will fork one for you.21:19
jlkZuul will not attempt to use configuration in a proposed change to test the change to a trusted repository21:20
jlkso if you'd like to be able to test your proposed change, it needs to be in an untrusted repository21:20
clarkbbut yes if using v3 then you'll want a zookeeper + nodepool and a project config repo21:20
jlkwhich is why we would have a "jobs" repo21:20
mrhillsmancool21:21
mrhillsmanmakes sense now21:21
fungialso, the intent (at least eventually) is that basic jobs defined in the upstream openstack-infra/zuul-jobs repo are general-purpose building blocks you can use directly or inherit from in your own custom job definitions21:21
mrhillsmani forked that one21:21
mrhillsmanas well as project-config and just removed a lot of stuff21:21
fungiwe're trying to keep the jobs/roles in that zuul-jobs repo pristine and free of openstackisms21:22
jlkyeah, the project-config upstream is for openstack, so it'll be... huge.21:22
fungibasically, zuul-jobs is intended to be a batteries-included stdlib for zuul21:22
mrhillsmanthe docs are a bit difficult to fully understand but forking those and trial/error got me to the point of a noop job successfully updating github21:22
jlknice21:23
jlkwe're really interested in suggestions on doc updates / restructuring :)21:23
mrhillsmanso figured it was time to start asking questions as we need to move on to real jobs running21:23
mrhillsmanand was not entirely sure i needed nodepool so about to dive into that :)21:24
jlkdo you have an OpenStack available to you?21:24
fungiplease ask, we don't have many early adopters for the unreleased v3 so any feedback is useful21:24
clarkbmrhillsman: theoretically you don't need a nodepool, just something to do the zookeeper node request dance with zuul. But right now the only implementation I know of for that is nodepool21:24
mrhillsmangot it21:26
mrhillsmani have a small openstack deployment21:26
mrhillsmanbut currently working with VMs to emulate21:27
mrhillsmanless folks hovering over you when you are using 1 dedicated server vs 100 :)21:28
*** jkilpatr has joined #zuul21:28
mrhillsmanit is the openlab work i discussed with clarkb mordred jeblair and a few others at the ptg21:29
mrhillsmanneed to ensure the PoC is working before i push for a fleet of servers21:29
jlkneat121:30
jlk!21:30
jlkIf you have any issues / questions / suggestions for the github integration point, I'd love to hear them. I shepherded and (re)wrote most of that code.21:30
clarkbmrhillsman: tristanC (who is on PTO I Guess) has been working to add non dynamic VM backends to nodepool21:31
clarkbcontainers, static instances, etc21:31
mrhillsmansure thing, i'll be sure to capture some more detail notes21:31
mrhillsmanoh, that would be nice21:31
clarkbI think the base set of provider plugin changes got in but I'm not seeing any specific implementations in tree that use it yet21:31
jlktaht's what I hope to work on next myself too.21:31
mrhillsmangreat to hear, the hope is that openlab can be a model for how others can adopt zuul going forward21:36
mrhillsmanso any help we can offer and >guidance< we can get in ensuring it is up and running would be great21:37
mrhillsmani emphasis guidance because i know you all are pretty busy regularly21:37
jeblairmrhillsman: oh hi!  we're also planning to work with leifmadsen who wants to flesh out new user documentation as soon as we're not pulling our hair out over the openstack zuulv3 transition.  we know it's a weak spot.21:40
mrhillsmanthat is great21:40
mrhillsmani have colleague working in parallel so i am sure he can provide some feedback as well21:41
mrhillsmanhe is in china though and they are on holiday this week21:41
mrhillsmanwill make sure he joins this channel when he is back21:42
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't store pipeline references on builds  https://review.openstack.org/50965321:46
jeblairmordred, SpamapS, clarkb, fungi, Shrews: ^ AKA fix memory leak21:47
jeblairmrhillsman: cool, we'll let you know when we resume that21:47
mordredjeblair: ZOMG21:48
clarkbjeblair: aroo21:48
* fungi cheers elatedly21:49
mordredjeblair: my huch was close to correct - I think https://review.openstack.org/#/c/509653/1/zuul/model.py is somewhat akin to "it's gonna be a comma or a colon" :)21:49
jeblairmordred: i could not have done it without graphviz :)  also, have you used xdot?  it's pretty cool.  i mean, i wish it were a little more jurassic park, but hey.21:50
clarkbjeblair: so you did end up graphing out the entire reference tree?21:50
mordredjeblair: graphviz is the unsung hero of computer science21:51
jeblairclarkb: i mean, not the *entire* tree.... just 10 nodes deep...21:51
jeblairhttps://i.imgur.com/JqtYeG5.jpg21:52
jeblairmordred: the trick is finding the comma or colon in that ^21:52
mordredjeblair: JEEZ21:53
jeblairit is not normally supposed to look like that.  :)21:53
jlkgah...21:54
jlkthat looks about like something that would fall out of the OpenStack community :D21:54
jeblairif you zoom out, you can see the patterns -- there's a bunch of layouts in there, each with its own set of pipelines, but there should only be one set.  so it's like 20x bigger than it should be.  but they're all connected by these individual builds21:57
mordredjeblair: yah - between that and your commit message it makes total sense21:58
jeblairoh, let me restack that21:59
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't store pipeline references on builds  https://review.openstack.org/50965322:00
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Update node requests after nodes  https://review.openstack.org/50957122:00
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Move test_model.test_job_inheritance_configloader  https://review.openstack.org/50949522:00
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove test_job_inheritance  https://review.openstack.org/50950922:00
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout  https://review.openstack.org/50951722:00
mordredjeblair: are we running with that fix now?22:00
jeblairmordred: no i was about to suggest i pull that onto my local branch and restart22:00
jlkDid you fix the pep8 violation?22:00
mordredjeblair: I think that's a GREAT suggestion22:01
jeblairjlk: there was one in a parent patch, so i reordered them by importance... was there also one in that patch?22:01
jlkno, just in the parent22:02
jeblairk; i'll clean those up in a few mins22:02
jlkIt was in https://review.openstack.org/#/c/50949522:03
mordredjeblair: with that fix in place, I would expect v3 to be running well still when we wake up in the morning (zk locking issues notwithstanding)22:03
jeblair++22:04
jeblairi was thinking tomorrow $morning might be a good time to regroup and check on our progress on things in the etherpad too22:06
jeblairclarkb: do you want to +3 https://review.openstack.org/508786 ?22:07
jeblairclarkb: you may also be interested in the child of that change, which i still haven't tested.22:08
clarkbjeblair: ok, I'm still trying to digest the leak fix. Does that imply an old build could get a new layout (and is that ok?)22:08
clarkbI'm guessing it desn't actually matter really22:10
jeblairclarkb: if an item with a running build is re-enqueued into a new layout, the build will be pulled into the new layout along with the item.  nothing about the build should change.  if something about it the job it represents did change, then the build gets canceled (and possibly a successor run instead)22:10
jeblairclarkb: none of that really interacts with this pipeline reference though, which, afaict, was mostly there for stats reporting and friendly output.  in fact, i was halfway through changing all the references of "pipeline.name" to "pipeline_name" because that's all it seemed to be used for, when i decided this was simpler and nicer.22:11
clarkbgotcha22:12
clarkbin which case the updated pipeline defs don't really matter much22:12
jeblairclarkb: basically, it's so we can say "this build ran in the check pipeline"22:12
clarkb+3 on the usr2 handler22:13
jeblairclarkb: yeah.  the main things that could change during a reconfig that could affect this are the pipeline and job defs.  as i said above, if the job def changes, then we cancel/relaunch.  as for pipeline defs changing -- we make an assumption that if you redefine a pipeline called "check" that we should re-enqueue any jobs running in the old "check" into the new "check".  it's a compromise -- anything else is gnarly.22:14
clarkbjeblair: I'll work to reproduce that yappi thing really quickly22:15
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't store pipeline references on builds  https://review.openstack.org/50965322:21
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Update node requests after nodes  https://review.openstack.org/50957122:21
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Move test_model.test_job_inheritance_configloader  https://review.openstack.org/50949522:21
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove test_job_inheritance  https://review.openstack.org/50950922:21
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout  https://review.openstack.org/50951722:21
jeblairokay that's pep8'd22:21
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Add debug messages and exception handling to stack dump handler  https://review.openstack.org/50878622:22
clarkbjeblair: confirmed it prevents the exception locally (also checked the old code failed)22:26
clarkbjeblair: I'll +222:26
jeblairclarkb: cool, thx!22:26
clarkbjeblair: should I approve it or do you want a proper test too?22:26
jeblairclarkb: i think it's okay to +322:26
clarkbkk22:27
openstackgerritRicardo Carrillo Cruz proposed openstack-infra/nodepool feature/zuulv3: Bring back per label groups in Openstack  https://review.openstack.org/50962022:27
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Speed configuration building  https://review.openstack.org/50930922:27
rcarrillocruzmordred , Shrews ^22:29
rcarrillocruzthe test for the groups thin22:29
rcarrillocruzg22:30
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Change yappi bytesio to stringio  https://review.openstack.org/50878722:38
mordredrcarrillocruz: that's totally going to fail pep8  :)22:38
rcarrillocruzerm, possibly? i just did tox -epy35 -- TestNodeLaunchManager locally22:39
rcarrillocruzlet me tox pep8 it22:39
rcarrillocruznah, i get green locally22:40
rcarrillocruzyou mean the long assert line i assume22:41
rcarrillocruz?22:41
openstackgerritMerged openstack-infra/zuul-sphinx master: Update exception message to include directories  https://review.openstack.org/50540022:42
rcarrillocruzit pass pep8 jobs on zuulv322:45
rcarrillocruzoff-topic: why there is tox-pep and openstack-tox-pep8?22:45
rcarrillocruztox-pep822:45
*** dkranz has quit IRC22:51
SpamapSjeblair: looks like 509653 may have a legitimate failure or at best a racey test.22:55
SpamapSrcarrillocruz: I'd guess that the openstack- one does some openstack-specific things22:56
SpamapSrcarrillocruz: in fact it does22:57
SpamapS      tox_constraints_file: "{{ ansible_user_dir }}/src/git.openstack.org/openstack/requirements/upper-constraints.txt"22:57
SpamapSnot everybody has a constraints file :)22:57
tristanCclarkb: i'm indeed on vacation till the end of the month, though i'll be checking in intermittently23:09
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Don't store pipeline references on builds  https://review.openstack.org/50965323:27
dmsimardWow 509653 is such a small change yet a huge impact, that's an awesome find.23:33
dmsimardjeblair++23:33

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!