Friday, 2017-06-23

*** _ari_ has quit IRC00:17
*** _ari_ has joined #zuul00:19
*** dmellado has joined #zuul00:37
*** _ari_ has quit IRC01:01
*** dmellado has quit IRC01:06
*** _ari_ has joined #zuul01:08
*** tobiash_ has left #zuul03:38
jamielennoxinteresting - don't know why yet:04:02
jamielennox  File "/usr/local/lib/python3.5/dist-packages/nodepool/launcher.py", line 1486, in removeCompletedRequests04:02
jamielennox    for label in requested_labels:04:02
jamielennoxRuntimeError: dictionary changed size during iteration04:02
tobiashmordred, jeblair: retestet synchronize04:56
tobiashand found out now that it actually worked and the error I saw was from rsync :(04:57
tobiashI had several issues during debugging making me think it was an ansible issue04:58
tobiashsynchronize seems to be also handled by the log streamer (which is not logged in the executor) and with post failure I didn't know how to get the console log04:58
tobiashfurther I have the issue that the executor often prepared old versions of the trusted repo (triggering patch is in a different repo)04:59
tobiashthat fooled my debug iterations05:00
tobiash-> https://review.openstack.org/#/c/469214/ helped now with getting the log from rsync05:00
tobiash-> rsync worked but had some transient errors: http://paste.openstack.org/show/613461/05:02
tobiashso sorry for bothering you with that topic05:02
tobiashbut the non-deterministic playbook repo preparation is an issue I have to look at (I think that fooled me to think that shell also doesn't work delegated)05:03
tobiashmy current setup is scheduler + 5 merger + 1 executor for testing how that scales05:07
jamielennoxgah, i should know this but how do i get nodepool/shade to give me a floating ip05:25
clarkbjamielennox: iirc you get one if you get back a non routablr ipv4 addr05:28
clarkbor if you specify it in the oscc network config05:29
jamielennoxclarkb: yea, i was looking in the wrong place it seems, we had routes_externally: True in the clouds.yaml which was stopping the assignment05:29
jamielennoxany idea whilst people are looking for how to add a security group to nodepool nodes?05:30
clarkbI dont know that we support setting security groups05:33
clarkbunless oscc allows you to05:33
clarkbwe have clouds that dont do security groups so our deployments have always opened them up where they exist then lock down on host05:34
jamielennoxyea, i seem to remember this once before, i guess given that nodepool is likely creating nodes in its own project it's reasonable to just add ssh to default05:35
mordredjamielennox: yah - that's what we do - just modify the default05:37
mordredjamielennox: that said - it's probably not a terrible idea to add support for specifying a security-group - I can imagine it might be a thing people might want05:38
mordredtobiash: good to know - and no worried - I learned a lot about synchronize in the process :)05:41
mordredtobiash: non-deterministic playbook repo prep certainly doesn't sound good though. jeblair when you're up that sounds like a thing you may also want to look at05:43
mordredjamielennox: if you were to want to add security-group support, it would be basically identical to key-name05:48
mordredjamielennox: so if you did 'git grep key.name' in nodepool, you'll see where you need to add it05:50
mordredalthough man, looking at that our naming there seems very awkward05:50
jamielennoxmordred: it's interesting because it's openstack driver specific and i imagine that once v3 is stable and that kicks off it'll change again05:50
jamielennoxnot even thinking too far afield it's not useful for static nodes05:51
mordredjamielennox: indeed - but then neither is key-name, min-ram or console-log05:51
jamielennoxtrue, there is a lot to do there05:52
mordredthere's a bunch of cloud-specific things that go into a label definition05:52
mordredthat I think it's likely fine if they're specific to a label defintion05:52
mordred(also, my comment about awkward naming was because I was looing at master - we have fixed the naming issue I was concerned about :) )05:52
*** hashar has joined #zuul07:13
tobiashmordred, jeblair: the playbook repos don't seem to be part of the repo state the executor gets07:13
tobiashmordred, jeblair: excerpt of the data structure the executor gets for running the job:07:22
tobiashhttp://paste.openstack.org/show/613474/07:22
tobiashthe playbook repos are also not part of the projects dict08:09
*** jkilpatr has quit IRC10:38
mordredtobiash: oh - there's also a bug I think we have with streaming logs and trusted playbooks10:58
tobiashdue to non-brwap?10:59
mordredno - due to how we're injecting the jobid so that the remote command knows how to log11:03
mordredI BELEIVE we're doing that in a place that doesn't get installed when the job is trusted11:03
* mordred is writing down bug right now so he doesn't forget to actually look11:04
mordredtobiash: oh. nope. I'mtotally wrong11:06
*** jkilpatr has joined #zuul11:10
*** dmellado_ has joined #zuul11:17
*** dmellado_ is now known as dmellado11:19
*** openstackgerrit has quit IRC11:33
Shrewsjamielennox: that's interesting on the dictionary RuntimeError. i think the line above that trying to prevent that error is not correct12:45
Shrewsah, it seems keys() returns an iterator in py312:47
Shrewsjamielennox: fix coming12:48
*** openstackgerrit has joined #zuul12:51
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix dict key copy operation  https://review.openstack.org/47690212:51
Shrewsjamielennox: ^^^12:51
tristanCwhat do you think about adding a zuul_comment module so that a job can report custom content to a review?13:10
tristanCi mean, to help user figure out why a job failed withough going to the failure-url, it may be convenient to report more content such as the list of unit-test that actually failled13:16
mordredtristanC: yes - SO - funny story,  I'm actually writing up some things related to that use case from discussions we had this week13:34
clarkbI thought you can alrrady put arbitrary text13:50
clarkbwe do for requirements checking jobs for example13:50
tristanCmordred: nice, please let me know how it goes, I'm interested in such capabilities, it may be very useful for custom reporters... e.g. code coverage test should be able to tell patch author when the coverage drop a certain threshold13:51
clarkbI guess its mostly.static though13:51
*** hashar_ has joined #zuul15:12
SpamapSThat would work nicely with the "job drops a json file" thing we have talked about for artifact sharing too.15:43
SpamapS(so jobs would drop a json with some comment text that zuul would dump into the reporter)15:43
*** hashar is now known as hasharAway15:57
*** hashar_ has quit IRC16:04
dmsimardQuestion about nodepool: as I understand it, TripleO is currently sharing the tenant nodepool is using with other processes which manage virtual machines in that tenant. I know it goes against the recommendation of dedicating a tenant to nodepool but what kind of problems could this expose ?16:18
ShrewsI would think that the biggest risk you open yourself to is getting reality (available vms) out of sync with expectations (the database or ZK in v3) if something other than nodepool deletes/modifies/etc the nodepool owned nodes17:19
Shrewsharlowja: hrm, there might be an issue with the Event change you proposed yesterday. seeing some random job failures with what looks like tests hanging causing the job to timeout17:29
Shrewsbut i'm not seeing anything obviously wrong  :(17:29
harlowjahmmmm, that'd seem odd, logically those should work (since i've been doing this for a long time, lol)17:29
harlowja*doing  similar stuff17:29
harlowjaany more info?17:30
harlowjaon py3 it fails, or py2.7?17:30
Shrewsharlowja: nothing really useful in the logs. failing on coverage job the last few times I checked17:35
Shrewsi'm beginning to wonder about Event.wait() and context switching17:40
tobiashShrews, harlowja: I also saw some random job hangs on my deployment17:40
harlowjahow much of eventlet is that stuff using?17:40
harlowjai mean Event.wait is basically waiting on a condition variable17:40
tobiashlooking at 'ps afx' it looked like ansible didn't start an ssh process17:40
harlowjawhich is basically waiting on a threading primitive that all that it does is force context switches / waits, lol17:41
harlowjaso ya, it'd be interesting to know what happens more tobiash and Shrews17:42
harlowjahow much eventlet are u guys using btw?17:42
harlowjanone, some?17:42
tobiashharlowja: I don't know if we're talking about the same issue17:43
harlowjaperhaps, lol17:43
Shrewsno eventlet used17:43
harlowjakk17:43
harlowjathen ya, i can't think why it wouldn't work17:43
tobiashmy observation: in my zuulv3 deployment sometimes a shell task just hangs with ansible not starting ssh from what I could see17:43
Shrewsharlowja: yeah, conditional is used, so you'd think that would work as expected: https://github.com/python/cpython/blob/3.6/Lib/threading.py#L55117:43
harlowjaya, i mean if those don't work, bigger problems existing all over python :-P17:44
tobiashkilling the leaf-ansible-playbook process on the executor made the job continue (even successfully)17:44
ShrewsI have yet to see these hangs on the changes NOT related to the use of Event (eg., https://review.openstack.org/476902) so it makes me thing the event change is causing it17:45
Shrewss/thing/think/17:45
Shrewstobiash: i haven't experience what you are seeing  :(17:46
harlowja`RuntimeError: dictionary changed size during iteration` is a threading thing17:47
harlowjathat's just something u have to know about :-P17:47
harlowjaand not do, lol17:47
tobiashShrews: it was happening in about 5% of the runs and I didn't see anything obvious so far17:47
tobiashcould be anything in zuul, ansible or bwrap17:48
harlowjaShrews let me know what u find, i'd be surprised if its something busted in threading :-/17:48
harlowjabut weird shit could happen, just would seem hard17:48
harlowja*hard to believe17:49
Shrewsharlowja: i'm not going to spend too much time on it. will explore it a bit today, but may just have to ditch it and stick with our time.sleep()  :(17:51
harlowjak17:51
harlowjathough its odd, because the whole underlying wait and things its using internally is also sort of equivalent to time.sleep17:52
harlowjalol17:52
harlowjajust u aren't seeing the equivalent of time.sleep, lol17:53
Shrewsi wish our logs were more helpful. they are useless for timeouts17:53
harlowja:-/17:53
Shrewsand of course cannot reproduce it locally (yet)17:54
Shrewsaha. got a local hang17:56
harlowjau can try https://gist.github.com/harlowja/5544d84e8e734ea1cc7c163eff00753117:59
harlowjaat least it will show u some <waiting> events17:59
harlowjaevent stuff isn't to crazy :-P18:00
harlowja(ie making your own to see whats happening)18:01
harlowjatypically what happens is someone isn't calling `set` somewhere18:01
harlowjathen wait (especially with no timeout) just waits forever, lol18:01
harlowjathe one thing that might be useful to look at is `run` in those changes18:02
harlowjaself._death.clear()18:02
harlowjaif shutdown/stop happens before run18:03
harlowja(for some reason)18:03
harlowjathen run will never die, lol18:03
harlowjaperhaps put some print/logs in https://gist.github.com/harlowja/5544d84e8e734ea1cc7c163eff007531#file-gistfile1-txt-L1018:05
harlowjaand say in clear18:05
harlowjaand see if someone is setting (ie shutting down) and then run calls clear, lol18:05
Shrewsharlowja: the stop before run thing is what i'm looking at now, particularly: https://github.com/openstack-infra/nodepool/blob/feature/zuulv3/nodepool/builder.py#L1110-L111818:09
harlowjalol18:10
harlowjaya...18:10
harlowjau can take my latch code18:10
harlowjaand then drop that whole loop18:10
harlowjabut ya, that's usually what hangs, something like this :-P18:11
Shrewsharlowja: i'm not sure how that paste you just gave me helps18:11
harlowjamainly that u can add log statements and shit all over it :-P18:11
harlowja*good shit, not the bad kind, lol18:12
harlowjaShrews so my guess would be to remove that clear in those various `run` methods18:16
harlowjathey are probably re-clearing it18:16
harlowjaand the clear should probably happen outside of `run` in some kind of 'reset` method (if those threads are really re-used)18:16
harlowjaif they aren't reused (after shutdown) then meh, just throw them away, and don't add reset on18:17
harlowjaso i'd drop that clear from all the places (in run)18:23
harlowjamy 3 cents18:24
harlowjaha18:24
Shrewsseems to be hanging on join to one of the worker threads during the shutdown process. so one of them is not stopping properly18:27
* Shrews adds more logging18:27
dmsimardShrews: you happen to know if 'env-vars' defined for a DIB image definition in nodepool end up being set on the node when it actually boots up ?18:29
dmsimardor if they are only effective during the image build ?18:29
dmsimarddoesn't look like those env vars are persisted (i.e, writing to /etc/profile)18:29
Shrewsdmsimard: they are not set on the launched node18:29
dmsimardShrews: so here's my dillemma -- I'm trying to reproduce the upstream image definitions outside of the openstack-infra environment (RDO's nodepool/zuul). I got the images to build no problem with the upstream elements/scripts except the configure_mirror.sh ready-script borks because it's not picking up NODEPOOL_MIRROR_HOST18:31
dmsimardhttps://github.com/openstack-infra/project-config/blob/master/nodepool/scripts/configure_mirror.sh#L7018:32
dmsimardimage builds fine with https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/nodepool/nodepool.yaml#L13-L34 but then configure_mirror fails18:32
dmsimardI was wondering if it'd make sense to persist those env vars18:32
Shrewsdmsimard: oh, you're not talking about v3 nodepool. i'd have to search the v2 code to know what's happening with those vars18:34
dmsimardyeah we're not running v2.5/v3 yet :/18:34
Shrews(we did away with the ready script codepath in v3)18:34
Shrewsdmsimard: so you're note even using the zk version of the builder, right?18:36
dmsimardShrews: nope, still geard so far as I am aware -- we'll be moving towards 2.5ish in the near future but we're not there yet18:37
Shrewsi'm _fairly_ certain that no env vars were set on the launched node (but we did write some variable info to files on that host)18:37
dmsimardShrews: I think for the time being I won't bother too much and insert an element to persist the env vars18:37
dmsimardyeah there's like /etc/nodepool/provider with some info, amongst other things18:38
Shrewsdmsimard: this is the thing I was thinking of https://github.com/openstack-infra/nodepool/blob/0.3.0/nodepool/nodepool.py#L630 but nothing for mirror host18:42
dmsimardyeah that's /etc/nodepool/provider18:43
dmsimardI half-expected the values defined in 'env-vars' to be carried over there18:43
Shrewsharlowja: k, i think i understand the problem now. we'll have to use a barrier instead of that code i pointed out18:54
harlowjaif u desire :-P18:55
harlowjahttps://github.com/openstack/taskflow/blob/master/taskflow/types/latch.py is all yours18:55
harlowjaif u want it18:55
harlowjalol18:55
harlowjasince py3.x got the latch addition18:55
harlowjamaybe someone backported it, i didn't, lol18:55
Shrewsbecause x.running no longer means "hey, i've started and reached a certain point in the startup"18:55
harlowjaya, fair18:57
Shrewsugh, this might be too much for right now. would have to pass the Latch in to each thread b/c the # of threads that would need to use it is configurable19:14
harlowjayup19:15
harlowjaup to u19:16
Shrewsi think other v3 things are pressing, atm. will revisit later19:16
tobiashI've noticed that ansible uses quite a lot cpu when running (also when just waiting for a task to finish)19:43
tobiashsetting "internal_poll_interval = 0.01" in ansible.cfg fixed that for me locally19:43
tobiashany objections of adding this setting to the generated ansible configs?19:47
tobiashencountered again a job freeze: http://paste.openstack.org/show/613548/19:52
tobiashlast one is an ansible-playbook process in sleep state and without launched ssh connection19:53
SpamapSruhroh20:33
SpamapStobiash: what does strace think that process is doing?20:33
*** jkilpatr has quit IRC22:05
*** jkilpatr has joined #zuul22:20
*** hasharAway has quit IRC22:30
mordreddmsimard, Shrews: the main issue with sharing an openstack project between nodepool and non-nodepool is quota calculations22:49
mordreddmsimard, Shrews: nodepool should not touch nodes it did not create - but whatever you tell nodepool its quota is will need to take in to account acutal quota - minus non-nodepool nodes22:50
mordredthe main problem with that will be that nodepool may sit there spinning hitting the nova api over and over trying and failing to create a node, which could put undue load on your control plane22:53
mordredbut if you manage the quotas right, you'll be good22:53

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!