Tuesday, 2017-10-03

openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Clear project config cache later  https://review.openstack.org/50904001:14
openstackgerritSam Yaple proposed openstack-infra/zuul feature/zuulv3: Add additional information about secrets  https://review.openstack.org/50904702:20
openstackgerritSam Yaple proposed openstack-infra/zuul feature/zuulv3: Add additional information about secrets  https://review.openstack.org/50904702:25
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Improve scheduler log messages  https://review.openstack.org/50905703:25
jeblairthat's not exactly critical, but considering the amount of time i spent filtering those out of logs today, it's probably a net gain if folks have a second.03:26
*** bhavik1 has joined #zuul04:06
*** bhavik1 has quit IRC05:13
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Move shadow layout to item  https://review.openstack.org/50901405:21
*** isaacb has joined #zuul06:29
openstackgerritAndreas Jaeger proposed openstack-infra/zuul feature/zuulv3: Improve scheduler log messages  https://review.openstack.org/50905706:52
openstackgerritMerged openstack-infra/zuul-jobs master: Add content to support translation jobs  https://review.openstack.org/50220706:59
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Clear project config cache later  https://review.openstack.org/50904007:32
*** isaacb has quit IRC07:33
*** hashar has joined #zuul07:57
*** isaacb has joined #zuul08:14
openstackgerritMerged openstack-infra/zuul-jobs master: Handle z-c shim copies across filesystems  https://review.openstack.org/50877208:27
*** hashar has quit IRC09:01
*** electrofelix has joined #zuul09:04
*** hashar has joined #zuul09:06
*** hashar has quit IRC09:23
*** hashar has joined #zuul09:24
*** isaacb has quit IRC10:03
*** isaacb has joined #zuul10:03
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Fix branch matching logic  https://review.openstack.org/50895510:08
*** jkilpatr has quit IRC10:34
*** isaacb has quit IRC11:00
*** jkilpatr has joined #zuul11:03
*** jesusaur has quit IRC11:19
*** jkilpatr has quit IRC11:25
*** jesusaur has joined #zuul11:33
*** jkilpatr has joined #zuul11:40
*** jkilpatr_ has joined #zuul11:46
*** jkilpatr has quit IRC11:47
*** jkilpatr_ has quit IRC11:54
*** isaacb has joined #zuul12:04
*** jkilpatr_ has joined #zuul12:08
*** isaacb has quit IRC12:14
*** isaacb has joined #zuul12:32
*** isaacb has quit IRC12:34
*** isaacb has joined #zuul12:56
*** isaacb has quit IRC13:04
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Set default on fetch-tox-output to venv  https://review.openstack.org/50917713:36
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Set default on fetch-tox-output to venv  https://review.openstack.org/50917713:39
*** dkranz has joined #zuul14:06
*** ricky_ has joined #zuul15:12
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Add change ID to NodeRequest  https://review.openstack.org/50921516:06
*** hashar is now known as hasharAway16:11
jeblairShrews: is that something we need now, or can we back-burner it until fires are out ^ ?16:16
Shrewsjeblair: total backburner. just getting it off my todo list16:16
jeblairShrews: okay.  cool.  :)16:17
SpamapSjeblair: I had a thought about how to keep reconfigs from eating all RAM.16:18
jeblairShrews: we found a fault in zuul's nodepool code yesterday, see line 122 of https://etherpad.openstack.org/p/zuulv3-issues ; are you interested in picking that up?16:18
jeblairSpamapS: cool!  (though we don't know what's eating all the ram)16:18
SpamapSjeblair: what if for reconfig we fork, re-exec in the child, and shove the new tree down a pipe?16:18
SpamapSLike, at the end of a reconfig event, we pretty much get back to square 0 of the scheduler right?16:19
SpamapSSo sort of automate what you've been doing with the dump/re-enqueue.16:20
jeblairSpamapS: it's the in-process version of a cron restart to deal with a memory leak?16:20
SpamapSyep!16:20
SpamapSNot sexy at all16:20
SpamapSBut until we can figure out why they're not GC'd...16:21
jeblairSpamapS: i suppose it might work, but i fear we could spend more time dealing with that (file descriptors!) than fixing the issue16:21
jeblairSpamapS: well, we've just decided to roll back, so we have a window of time now where we can safely test this at scale now.16:21
SpamapSI agree16:21
SpamapSjeblair: k16:21
Shrewsjeblair: iiuc, before zuul locks the nodes, it needs to make sure the request is still valid. and if not, abort the handling somehow. correct?16:22
SpamapSjeblair: another thing to try is just adding a 'gc.collect()' right after the reconfig.16:22
jeblairSpamapS: yeah -- i started to try to fold in some of the stuff on gc you and dhellman mentioned yesterday, and i ran gc.collect(0), gc.collect(1), and gc.collect(2) manually16:23
SpamapSjeblair: did you see anything in gc.garbage by any chance?16:24
SpamapSI'd expect no16:24
jeblairSpamapS: no it was empty16:24
jeblairSpamapS: those did collect the unknown layout objects though16:24
SpamapSsince we don't use any objects from C extensions in there AFAIK16:24
jeblairso after all 3 collection passes, the number of layout objects in memory was exactly equal to the number i would expect16:25
jeblairSpamapS: so i'm developing a theory that because they stay around *so long*, they end up in generation 216:25
jeblairand we just don't run generation 2 very often, or often enough16:25
SpamapSjeblair: that's certainly possible.16:25
jeblairi have no idea how often the various generations run though16:26
SpamapSIt's based on allocations vs. deallocations I know, but I don't know the ratio16:26
Shrewsjeblair: left a comment trying to explain my goal with that zuul change above. let me know if that's confusing or just a bad idea in general.16:26
SpamapSgc.set_threshold() lets you change them, but doesn't explain what they are before. Lame.16:26
jeblairShrews: re zuul: yes -- i think there are maybe two approaches, and maybe we shoud do both.  1) zk has probably notified zuul already that the request is gone.  we're just acting asynchronously from an event a long time ago.  so maybe when zk notifies us the request is gone, we need to set a flag on our copy of the request object.  2) we could re-get the request before we do anything with the node.16:29
jeblairShrews: yeah, i agree that the buildset uuid/job won't make that task easier, but why are we doing that task?  is it just idle curiosity?  if so, i'm not a fan of adding in potentially confusing information.  it could lead people to believe that the data/request model is much simpler than it actually is and make debugging harder.16:30
Shrewsjeblair: To answer the question: "My review doesn't seem to be doing anything. Why is that?"16:31
jeblairShrews: if we really need to reverse-map this (as opposed to getting the request id out of the zuul log, or somehow asking zuul what request is for what build (ie, maybe adding it to the status.json)) then i think we need to add a lot of data to disambiguate it.16:31
Shrewswe can then use request-list to verify if that review is waiting on nodes. otherwise, it's very hard to determine that16:31
jeblairShrews: yeah, i think if we make this too simplistic, we are likely to make mistakes.16:32
Shrewsjeblair: afraid i don't see the issue clearly. do we not have a 1-to-1 mapping of change/patchset to noderequest?16:33
jeblairShrews: nope.  it's entirely reasonable in our system to have 40 or more requests outstanding for the same change.16:33
jeblairmaybe even significantly more than that16:33
jeblairShrews: the unique key for a node request is pipeline+item(change)+build(job)16:34
Shrewsjeblair: ok, then this isn't going to give me what i want. i'll skip it for now16:35
jeblairShrews: ok.  let's come back to this when we have more time to poke at it16:36
*** rcurran_ has joined #zuul16:37
*** hasharAway has quit IRC16:37
*** rcurran_ has quit IRC16:41
openstackgerritAndreas Jaeger proposed openstack-infra/zuul master: Use new infra pipelines  https://review.openstack.org/50922316:44
openstackgerritAndreas Jaeger proposed openstack-infra/zuul feature/zuulv3: Use new infra pipelines  https://review.openstack.org/50922416:45
*** hashar has joined #zuul16:46
*** hashar is now known as hasharAway16:46
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic stage-artifacts role  https://review.openstack.org/50923317:25
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add compress capabilities to stage artifacts  https://review.openstack.org/50923417:25
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add compress capabilities to stage artifacts  https://review.openstack.org/50923417:30
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Add TODO note about reworking fetch-tox-output  https://review.openstack.org/50923717:42
jeblairoh, hey,  i think i can use this to get a handle on when the collector is run: https://docs.python.org/3.5/library/gc.html#gc.callbacks17:51
jeblairSpamapS: i've done some more poking today, and have found that sometimes there are still unexpected layout objects even after a full collection.  so it looks like sometimes that helps, and sometimes it doesn't.  i'm still trying to figure out what's holding on to them.17:53
jeblairi haven't seen this before: http://paste.openstack.org/show/622572/17:58
SpamapSI wonder if there are stuck threads holding on or something.17:59
SpamapSI assume objgraph knows about threads and references in each thread's stack.17:59
jeblairSpamapS: hrm.  we don't make a lot of threads, and when we do, we generally only pass the singleton scheduler instance around to facilitate cross-communication.18:00
SpamapSI hadn't even looked at threads in the scheduler.18:01
SpamapSBut just wondering out loud really18:01
jeblairone thing i noticed while debugging is that sometimes exceptions hold stack frames with local variables with references.  so one thing i'm concerned about is whether exception handlers are keeping entire layouts alive that way.18:03
jeblairand moreover, the sys.last_exception or whatever it's called is keeping *that* alive18:04
jeblairi cross-referenced memory jumps with config syntax errors and did not see a pattern18:04
jeblairbut presumably, even if that's the case, it should only be, at most, one extra layout in memory18:05
pabelangerwe're just discussion pipeline changes again in zuulv3, do we need to reload zuul for those (via puppet) or does zuul pick them up as soon as merged?18:17
mordredpabelanger: zuul picks them up18:18
mordredpabelanger: the only thing zuul needs a puppet action for is main.yaml18:18
pabelangermordred: Thanks, I was getting confused18:19
*** electrofelix has quit IRC18:26
dmsimardmordred: a side effect from truncating the end of the logs is that we no longer get a formal job status in the logs. I'm looking at adding missing zuul v3 bits in elastic-recheck and there's a couple places where it attempts to look for strings like "[Zuul] Job complete" or "Finished: FAILURE"18:39
mordreddmsimard: nod18:40
dmsimardI wonder what kind of workaround we could do. Run a post-job before log collection that anticipates and prints the status of "run" ? Use ARA to do it ? something else ? Something in the log processor ?18:40
dmsimardThe problem being that if you search for "build_status:FAILURE" instead of "build_status:FAILURE and message:[Zuul] Job complete" you end up with all the log lines for that job, not just one result that contains that particular line.18:41
mordreddmsimard: well - it doesn't have to be at the end of the log, right?18:42
dmsimardyeah, hence why I proposed a 'post-run' playbook that would be able to tell if any of the 'run' or 'pre-run' playbooks failed18:42
mordreddmsimard: we could emit a message at the end of each playbook that is a single line that indicates success or failure that could be searched for18:42
dmsimardmordred: oh, I like that idea. The executor could print the status code of each playbook.18:43
mordreddmsimard: and have it be  "[Zuul] Run Complete" or "[Zuul] Run Failed" (and switch run with pre-run and post-run as well)18:43
mordredyah18:43
dmsimardmordred: wfm, I'll send a patch18:43
mordredso that you could also see [Zuul] Post-Run Failed ... for everything EXCEPT errors in the final uploading of logs18:44
mordreddmsimard: woot!18:44
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul feature/zuulv3: Explicitely print the result status of each playbook run  https://review.openstack.org/50925419:09
dmsimardmordred: ^ not sure if there's more to it than that19:09
mordreddmsimard: I thnik you may want to add phase in too - that's listed a few lines up19:11
mordreddmsimard: as people might want to elasticcheck search for Run failures or Pre-Run failures or something19:12
dmsimardmordred: good idea19:12
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul feature/zuulv3: Explicitely print the result status of each playbook run  https://review.openstack.org/50925419:15
SpamapSunrelated to current fires. Has anyone looked into how one might use submodules with zuul?19:19
SpamapSI'm currently just stripping them out of a project and replacing relative paths with prefixed paths so I can just point to the other project's src_dir...19:19
SpamapSbut it would be pretty awesome if Zuul could just inject them.19:20
clarkbzuul probably needs to have a git submodule init and update step if they are detected19:22
SpamapSthe tricky part is that the parent repo will have a ref of where it wants the submodule checked out19:22
SpamapSand Zuul will want to checkout the submodule to the place where Zuul has prepared it to in the workspace19:23
clarkbthey also tend to use relative paths to the repo19:23
SpamapSin my experience they tend to use absolutes... https://github.com/somewhere/something19:23
SpamapSnot that it's the right thing, but they do that.19:23
clarkbgerrit at least uses relative19:24
SpamapSso there'd need to be a step rewriting that to the local src_dir, and then another one that checks it out to where the prepared git repo is19:24
SpamapSgerrit uses git in very nice ways. :)19:24
mordredSpamapS: the question has come up a few times, but I don't know that we've got an answer yet - or ultimately even a good write up of what we believe *should* happen in such scenarios - it gets even more complex in that canonical_hostname isn't necessarily guaranteed to be the same as the value in the git submodule reference (using openstack repos as a pathological example)19:31
mordredSpamapS: the openstack/openstack repo, for instance, is a repo with a ton of submodules. for gerrit submodule tracking to work, they must be urls that refer to gerrit's url - so https://review.openstack.org19:32
mordredbut zuul knows those repos as git.openstack.org repos19:32
mordredSpamapS: which is all to say - there be a pile of dragons here- and I think the hardest part will be defining semantics19:33
clarkbwhich is not surprising because well thats what has been said about submodules since they were created19:33
mordredSpamapS: that said - it's also possible as a temporary approach to just list a parent repo's submodules in required-projects for jobs, and then make a pre-playbook that does the necessary repo surgery with the contents of src_dir - since on a repo-by-repo basis it should be much easier to know what the right thing is than it is to know that generally19:37
mordredSpamapS: the tox-install-siblings role is sort of lke this - it'll inject contents of a repo from required-projects into a tox virtualenv if it exists, but if it doens't will use the normal venv contents19:38
fungii'm mildly curious what travis does with them19:38
funginot so much because i think they're an example we should follow on things, but more because it'd be interesting to see a ci system's take on how submodules actually get integrated19:39
fungias in what compromises they ultimately end up making for the sake of sanity19:39
fungigiven that they're primarily github-focused and so get to deal with all that "variety"19:40
SpamapSmordred: the pre-surgery route is what I went down for a day. It's dragon infested.19:41
mordredfungi: well - I think for non-zuul systems the answer is fairly easy - just do a 'git submodule update' in the repo after cloning - or clone with the option that says to update the submodules when cloning19:41
SpamapSAnd as clarkb said, it's not Zuul's fault. It's submodules' fault.19:42
SpamapSI'm recommending erradicating them from the place they're used instead.19:42
mordredbut with zuul being multi-repo aware, I could see people wanting to make a change to a submodule and then have zuul test that it'll work in the parent repo19:42
mordredSpamapS: ++19:42
clarkbfungi: I don't think travis really does any speculative future state testing beyond "current patch"19:42
SpamapSTravis gets given a repo and a reference, checks it out, and runs tests.19:43
SpamapSTheir git-fu is pretty meh because they're single repo.19:43
mordredalso - in considering just the simple "git clone --recurse-submodules" case - there's no good way to tell git to use the local cache on the mergers instead of cloning from the submodule reference source19:45
mordredso even *that* would need special consideration19:45
SpamapSYeah it's a mess19:45
mordred++19:45
SpamapSUltimately, git submodules are a hacky deployment system.19:46
SpamapSThey track repo urls, and git shas of said repos.19:46
SpamapSThe challenges would be mostly the same to try and have Zuul manage a repo that just had ansible yaml files with the same level of information. Because it makes use of git directly, it steps all over Zuul's domain.19:47
* SpamapS has finished removing the submodules and moves on19:47
*** AJaeger has joined #zuul19:49
mordredSpamapS: \o/19:54
*** jkilpatr_ has quit IRC20:00
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic stage-artifacts role  https://review.openstack.org/50923320:09
openstackgerritAndrea Frittoli proposed openstack-infra/zuul-jobs master: Add compress capabilities to stage artifacts  https://review.openstack.org/50923420:09
*** jkilpatr has joined #zuul20:20
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Include tenant in pipeline log messages  https://review.openstack.org/50927920:29
clarkbShrews: were you working on that thing I found last night? <- jeblair?20:30
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul feature/zuulv3: Explicitely print the result status of each playbook run  https://review.openstack.org/50925420:32
dmsimardmordred: ^ I think I like this approach better and it also kinda fixes a bug20:33
Shrewsclarkb: what thing did you find last night?20:33
clarkbthe double lock after request cancel20:34
jeblairShrews: the zuul locking thing i mentioned earlier20:34
Shrewsclarkb: oh, yeah. put my name beside it on the etherpad20:34
*** patriciadomin has joined #zuul20:41
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul feature/zuulv3: Explicitely print the result status of each playbook run  https://review.openstack.org/50925420:46
SpamapSI'm not loving the way zuul.projects ends up working btw...21:01
SpamapShttp://paste.openstack.org/show/622586/21:01
SpamapSHad to use this to grab the path to a particular project.21:01
SpamapSmight work better as a mapping21:02
clarkbname: path?21:02
SpamapSEPARSE21:06
clarkbmapping from name to path21:07
SpamapSyeah I think that will be a common use case21:07
jeblairwill it still be easy to iterate?21:17
jeblairthere are some roles in zuul-jobs that do  with_items: "{{ zuul.projects }}"21:18
jeblairi think the issues are: 1) still need to be able to iterate over the values.  2) the dict key will need to be the fully qualified project name; eg: git.openstack.org/openstack/nova21:19
jeblairif those aren't blockers, i think switching to dict would be fine21:20
mordredjeblair, SpamapS: this is also a location where we could write some additional filter plugins - zuul | zuul_projects_by_name or zuul | zuul_projects_by_canonical_name - etc ... since this is specifically about playbooks that manipulate varibles in the zuul datastructure, we're already in the realm where 'pristine' ansible isn't going to be possible anyway21:26
mordredheck - we could do zuul|zuul_project('cloudplatform/openstack-helpers') ... and have that do the zuul matching logic so that it'll match cloudplatform/openstack-helpers if that's unique but otherwise would want git.godaddy.com/cloudplatform/openstack-helpers ... which would otherwise be fairly hard to do just with native jinja filters21:30
jeblairthat's a thing21:31
*** dkranz has quit IRC21:32
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Add jinja filters for manipulating zuul.projects  https://review.openstack.org/50929421:39
mordredjeblair, SpamapS: quick 5 minute for-instance (fine if we say no to that and abandon that patch - just wanted to show code real quick)21:40
mordredI've marked it WIP for now21:40
jeblairmordred: thx21:40
jeblairi'm switching to speeding up the config loading because having zuul spending all its time on that is making everything else hard to debug21:52
mordredjeblair: ++21:53
mordreddmsimard: zomg. I LOVE your 509254 patch and the fact that it removes that chunk from the callback21:54
dmsimardyay ? :)22:13
dmsimardmordred: when it gets queued it'll be even more awsum22:14
jeblairi think i wedged it debugging.  i want to get this speedup ready before restarting tho22:24
*** hasharAway has quit IRC22:45
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Speed configuration building  https://review.openstack.org/50930923:06
jeblairmordred, clarkb, jlk, SpamapS: ^ i'm estimating that will reduce our dynamic config delays from 50s to about 4s.23:08
dmsimardjeblair: that is pretty darn awesome, I would buy you a beer right about now23:08
jeblairthere's still more we can do, but that's the easy stuff23:08
clarkbjeblair: how much of that wsa just not recompiling voluptuous each time?23:09
jeblairclarkb: i think about 75%.  most of the rest was avoiding deepcopy23:12
clarkbjeblair: re the deepcopy where was the modification happening that the comment talked about?23:15
* clarkb quickly scanning for a pop() and not seeing it23:15
jeblairclarkb: configloader.py line 831 on the old side23:16
clarkboh derp its right there also looks like the conf exception handler can do a pop which is why you pass validate=False23:17
jeblairclarkb: now we just tell the template parser not to validate it, since the project parser already has.  then the template parser just ignores it23:17
clarkblgtm thanks23:17
jeblairclarkb: well, the exception handler (with configuration_exceptions:) actually does make a copy internally23:18
clarkboh ya it does23:18
jeblairso the validate=false is mostly to avoid erroring on having the 'templates' section present in the template parser23:18
jeblairoh, here's a way we can further halve the dynamic config time in many cases -- if there are no config changes to config-projects ahead in the queue, do not run phase1.  similarly, if there are no changes to untrusted-projects, do not run phase 2.  (of course, eventually if there are both in a queue, we'll still end up running both phases for changes after that, but that could be rare in practice)23:40
jeblairi don't have time to do that right now, if anyone else wants to work an that feel free23:41
clarkbI've got to do yardwork before I'm out of town for seagl23:41
clarkbI haven't had to mow in months but alas the rain is back and the grass is happy23:41
jeblairclarkb: what's the conference?23:41
clarkbseattle gnu linux (seagl)23:42
clarkbI'm gonna tlak about python packaging23:42
clarkband probably make everyone on hackernews angry23:42
jeblairoh cool, good luck!23:42
jeblairclarkb: they won't be there cause it says gnu linux23:42
clarkboh right23:42
clarkbI'm excited I haven't been able to go in the psat due to summit conflicts and this year no conflict23:43
SpamapSjeblair: oh wow, lots of deepcopies gone23:46
* SpamapS is EOD'ing but will look later23:46

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!