Wednesday, 2016-12-07

jeblairjamielennox: i think the runtime would be the same even if we make the triggers smart enough to act like that -- because ultimately the goal of project-change-merged is to put every open change for a project in the queue00:00
jeblairjamielennox: so whether that's one event which enqueues multiple changes (your suggestion), or multiple events each enqueing one change (current), we still need to go ask gerrit for all the changes00:01
*** yolanda has joined #zuul00:01
*** saneax is now known as saneax-_-|AFK00:01
jeblairthe current approach at least has the advantage of matching up pretty well with the "events map directly to a ref/change" idea, so while creating the synthetic events is weird, the actual event matching is very simple and behaves like the gerrit trigger00:02
*** pabelanger_ has joined #zuul00:05
*** yolanda has quit IRC00:05
pabelanger_webchat FTW00:06
pabelanger_can00:06
pabelanger_err00:06
pabelanger_can't get to IRC proxy atm00:06
pabelanger_ianw: mordred: rbergeron: harlowja: I'm hoping to show up next week with a CORS repo for zookeeper for EPEL7 next week.  Like ianw said, not a priority for anybody on internal list.00:08
harlowjacool00:08
mordredawesome00:09
pabelanger_if that goes well, I'll see what is needed to sync into EPEL700:09
harlowjasweet00:09
pabelanger_otherwise, roll the CORS repo until centos8?00:09
mordredyah - sounds like a good plan00:09
pabelanger_since rawhide has it00:09
mordredpabelanger_: is a CORS repo like a PPA?00:09
pabelanger_ya00:09
mordredneat00:09
mordredpabelanger_: I, for one, welcome our CORS repo overlords00:10
ianwpabelanger_: yeah, just working with ggillies on it a bit right now :)00:10
* mordred hands pies to pabelanger_ and ianw00:10
ianwi think pabelanger_ means COPR00:10
pabelanger_my other thought, was just to roll the packaging in openstack-infra, using the same process as zigo00:10
pabelanger_oh, haha, ya00:10
pabelanger_that00:10
pabelanger_COPR00:10
pabelanger_now I disappear again00:11
*** pabelanger_ has quit IRC00:11
*** yolanda has joined #zuul00:19
*** yolanda has quit IRC00:23
*** yolanda has joined #zuul00:26
*** yolanda has quit IRC00:29
*** yolanda has joined #zuul00:30
jamielennoxjeblair: sorry to keep coming in and out, any advice on where this event searching would live?00:35
jeblairjamielennox: i think the event searching is currently in the gerrit connection and can stay there.  that's called by the onChangeMerged method which i think should move from zuultrigger into the scheduler00:36
*** yolanda has quit IRC00:38
jamielennoxso wouldn't the event searching have to be common across connectoins?00:38
*** yolanda has joined #zuul00:44
*** yolanda has quit IRC00:49
jeblairjamielennox: yeah, i think this needs to be scoped to the source/connection associated with the originating pipeline.  i'm starting to think that maybe you should kick this over to me for a bit -- i keep typing and deleting things i'm not 100% sure of and i'm afraid i may be about to send you astray.00:51
jamielennoxjeblair: i'm fine to let you play with it for a while00:52
jamielennoxthere does seem to be scoping problems going on that might be bigger issues00:52
jeblairjamielennox: yeah, and i think this might have an "interesting" interaction with dynamic layouts00:52
jamielennoxalso for our usage i'm pretty sure we'd be fine with just deleting the zuultrigger and adding it back if and when it's required00:52
jeblairjamielennox: anyway, i'll poke at it tomorrow and see where i get00:53
jamielennoxjeblair: ok, i'll leave it with you00:53
jeblairjamielennox: did you grab a test in storyboard?00:55
jamielennoxjeblair: i don't think so for zuultrigger, i was trying to see if i knew enough to fix it first then went down the rabbithole00:56
jeblairok, i'll grab it tommorw then00:57
jamielennoxnight - and thanks00:57
*** yolanda has joined #zuul00:59
*** yolanda has quit IRC01:07
*** jamielennox is now known as jamielennox|away01:07
*** jamielennox|away is now known as jamielennox01:21
*** yolanda has joined #zuul01:31
*** yolanda has quit IRC01:33
*** yolanda has joined #zuul01:34
*** yolanda has quit IRC01:36
*** yolanda has joined #zuul01:37
*** yolanda has quit IRC01:41
*** yolanda has joined #zuul01:53
*** yolanda has quit IRC01:58
*** yolanda has joined #zuul02:10
Shrewsjeblair: my idea for 406411 wasn't really great. let's go with PS1002:28
*** yolanda has quit IRC02:33
*** yolanda has joined #zuul02:34
*** hogepodge has quit IRC02:38
*** yolanda has quit IRC02:42
*** yolanda has joined #zuul02:54
*** yolanda has quit IRC02:58
*** yolanda has joined #zuul03:05
*** yolanda has quit IRC03:11
*** yolanda has joined #zuul03:11
*** yolanda has quit IRC03:16
*** yolanda has joined #zuul03:28
*** yolanda has quit IRC03:33
*** yolanda has joined #zuul03:34
*** yolanda has quit IRC03:40
*** yolanda has joined #zuul03:47
*** yolanda has quit IRC03:51
*** yolanda has joined #zuul04:05
*** yolanda has quit IRC04:10
*** Cibo_ has joined #zuul04:17
*** yolanda has joined #zuul04:22
*** yolanda has quit IRC04:26
*** yolanda has joined #zuul04:29
*** yolanda has quit IRC04:35
*** yolanda has joined #zuul04:37
*** yolanda has quit IRC04:41
*** yolanda has joined #zuul04:57
*** yolanda has quit IRC05:02
*** yolanda has joined #zuul05:14
*** mgagne has quit IRC05:15
*** morgan has quit IRC05:15
*** tflink has quit IRC05:16
*** saneax-_-|AFK has quit IRC05:16
*** jamielennox has quit IRC05:16
*** yolanda has quit IRC05:19
*** tflink has joined #zuul05:21
*** morgan has joined #zuul05:23
*** saneax-_-|AFK has joined #zuul05:27
*** jamielennox has joined #zuul05:31
*** yolanda has joined #zuul06:12
*** saneax-_-|AFK is now known as saneax06:24
*** yolanda has quit IRC06:28
*** yolanda has joined #zuul06:30
*** abregman has joined #zuul06:31
*** yolanda has quit IRC06:37
*** yolanda has joined #zuul06:40
*** yolanda has quit IRC06:45
*** yolanda has joined #zuul06:46
*** willthames has quit IRC06:52
*** yolanda has quit IRC07:10
*** jamielennox is now known as jamielennox|away07:11
*** yolanda has joined #zuul07:14
*** yolanda has quit IRC07:23
*** yolanda has joined #zuul07:24
openstackgerritJoshua Hesketh proposed openstack-infra/nodepool: Merge branch 'master' into feature/zuulv3  https://review.openstack.org/40792308:23
*** Cibo_ has quit IRC09:35
*** bhavik1 has joined #zuul09:48
*** Cibo_ has joined #zuul09:50
*** mgagne has joined #zuul10:47
*** mgagne is now known as Guest261510:47
*** bhavik1 has quit IRC10:48
*** openstackgerrit has quit IRC11:32
*** hashar has joined #zuul11:51
*** hashar_ has joined #zuul11:54
*** hashar has quit IRC11:57
*** hashar_ is now known as hashar13:33
*** Guest2615 is now known as mgagne13:54
*** mgagne has quit IRC13:54
*** mgagne has joined #zuul13:54
*** saneax is now known as saneax-_-|AFK14:16
*** abregman has quit IRC14:55
*** yolanda has quit IRC14:59
*** yolanda has joined #zuul14:59
*** openstackgerrit has joined #zuul15:25
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Add --checksum support to disk-image-create  https://review.openstack.org/40641115:25
pabelangerjeblair: Shrews: clarkb: revert back to PS10^ for --checksum15:25
*** rcarrillocruz has quit IRC15:31
Shrewspabelanger: +1'd15:31
Shrewspabelanger: btw, what's the git or gerrit magic to easily revert to a previous patchset?15:32
Shrewsoh, maybe review -m15:34
clarkbgit review -d change,patchset && git commit --amend #change something because gerrit && git review15:34
pabelangerya15:35
pabelangerI've been know to cherry-pick the previous patchset too from gerrit ui15:36
*** hashar is now known as hasharAway15:36
mordredfascinating - gerrit has re-applied the votes from clarkb and jeblair from ps10 to ps1215:40
clarkbyup I learned it does that when it applied a -1 of mine when an old patchset was pushed and I couldnt figure out why15:42
clarkb"I didnt -1 this patchset" later "oh its that old patchset again that I -1'd"15:42
*** hogepodge has joined #zuul15:50
openstackgerritMonty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag  https://review.openstack.org/40819415:59
*** abregman has joined #zuul16:04
pabelangerokay, doing some ops things with nb01 and nb0216:11
pabelangerimage-build ubuntu-precise: build started on nb0216:11
pabelangerimage-build ubuntu-trusty: build started on nb0116:11
pabelangerimage-build ubuntu-xenail: nothing listed in dib-image-list16:11
pabelangerlooks like we don't display pending builds16:12
pabelangeronly active16:12
openstackgerritMonty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag  https://review.openstack.org/40819416:16
rbergeronianw / pabelanger / mordred: re: zookeeper -- i find it odd that someone is ... an approver / owner for it for epel (different from the owner in fedora, which isn't unheard of) -- but not sure if anyone has pinged that human to see if he's ... going to ever do anything on that front or not.16:17
rbergeronbut i can rustle up package approvers and all that faiiiirly easily16:17
pabelangerRight, I think what i was looking for, was to get zookeeper package into some product managers pipeline at Red Hat, and having some team be come responsible for it.  After some emails, it's now clear, no such team exists.  The current roadmap in tooz and etcd16:18
pabelangerso, guess ianw and I will push on zookeeper package ourselfs and see where to maintain it16:19
pabelangerCOPR for now, maybe into epel716:19
pabelangerShrews: ^ questions on image-build when you are free16:23
openstackgerritMonty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag  https://review.openstack.org/40819416:26
*** abregman_ has joined #zuul16:28
*** abregman has quit IRC16:31
mordredpabelanger, rbergeron: well - not to take over the channel with red hat things completely- but what I got was that it's not on any _openstack_ team's radar16:38
mordredbut I'm not convinced that there is no interest anywhere in the company with zookeeper, kafka or mesos (kafka and mesos both use zk too)16:38
mordredpabelanger: so we may still be able to find some product team somewhere - maybe over in jboss land?16:39
Shrewspabelanger: no, pending image builds are not displayed. the only things stored in ZK are active or past builds16:52
Shrewspabelanger: but honestly, pending builds should not be pending for long. builders should build them as soon as it notices they need building16:54
pabelangerin our case, it takes about 1h20m to do a build16:54
pabelangeronly issue I have, I don16:55
pabelangererr16:55
Shrewsi'm not sure what that has to do with it16:55
pabelangeronly issue I have, I don't actually have a way to confirm builds are queued up, without using zk-shell (found the key it sets)16:55
Shrewspabelanger: are you saying that a build in the 'building' state is not being diplayed, even though it is being built?16:56
pabelangerShrews: no, other way around. I have 2 building images, but issues image-build 3 times16:56
pabelangerissued*16:57
*** hasharAway has quit IRC16:58
Shrewspabelanger: i'm failing to grok something. so, you issued 'image-build ubuntu-xenial', it is not actually building, but you see the build request node for it in zk-shell? is that correct?16:59
pabelangerShrews: okay, give me a sec, I'll get a pastebin17:00
Shrewsk. thx17:00
pabelangercurrent value of dib-image-list: http://paste.openstack.org/show/591687/17:01
pabelangerShrews: from that, we don't actually know we have a pending ubuntu-xenial build17:02
pabelangerusing zk-shell, I can tell: http://paste.openstack.org/show/591688/17:02
pabelangersince I think that is the key we use to trigger it17:02
pabelangerthe issue, I see, if clarkb can along an looked at dib-image-list, he would have no idea I;ve already queued up the ubuntu-xenial build17:03
pabelangercame*17:03
Shrewspabelanger: is that image paused?17:04
pabelangerHmm,17:04
pabelangerI don't think so17:04
pabelangerlet me check17:04
pabelangerno17:05
Shrewspabelanger: then that is odd. if there is a builder thread free to build it (i'm assuming there is), then seems like a bug since the request should be unhandled for very long17:06
jeblairShrews, pabelanger: both builders are currently occupied17:07
pabelangeryes17:07
jeblairbased on that pastebin17:07
pabelangernb01 is almost done17:07
Shrewshow many build threads though?17:07
pabelanger1 per server17:07
jeblairShrews: 1 per build machine17:07
ShrewsOH!17:07
Shrewswell, yeah17:07
Shrewsi assumed we used at least the default # of build workers17:08
Shrewswhich is 4 IIRC17:08
jeblairShrews: i think that is the default17:08
jeblairShrews: 4 is uploaders (we use 16)17:08
pabelangerI don't think diskimage-builder will support parallel builds17:08
Shrewspabelanger: really? wow17:09
pabelangerShrews: I believe so, I'll have to check again17:09
Shrewsthen perhaps we shouldn't support more than one build worker? or at least warn about it17:10
clarkbdib can do it so I think its worth having the option17:10
clarkbyou just have to be extremely careful doing it17:10
clarkbmostly in your use of the cache17:10
clarkb(tl;dr its mostly up to your elements not dib itself which should be fine as it builds things with all sorts of unique ids and properly loopbacks etc)17:11
jeblairso that's probably the warning we should put in the docs :)17:11
pabelangerI found this old blueprint a few weeks ago: https://blueprints.launchpad.net/tripleo/+spec/tripleo-diskimage-builder-parallel-builds  that's what I am taking my queue from17:11
Shrewspabelanger: so, yeah, unhandled manual build requests will not show up. Would be easy enough to add, but there would be no other information in the output other than image name and some sort of new "pending" status17:13
Shrewsso, if that's useful, we can add it17:14
pabelangerokay, something to think about. Not a blocker17:16
pabelangerubuntu-xenial is now building too17:16
pabelangerokay, not to stop zookeeper17:16
pabelangernow*17:16
jeblairyeah, it's an 'image' attribute, not a 'build' attribute.  i'm thinking we may need another set of commands for those...17:16
jeblair(like image-list, build-list, upload-list)17:17
clarkbmaybe even possibly a summarize command too?17:18
clarkbsummarize ubuntu-xenial outputs "pending: 1 builds: 2 uploaded_to: rax-dfw, ovh-gra1, osic-cloud1" or something17:19
pabelangerzookeeper stopped / started17:19
pabelangernow to see what happened17:19
*** abregman_ has quit IRC17:28
*** adam_g has quit IRC17:33
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Clean up exception message to use image / provider name  https://review.openstack.org/40823917:47
pabelangerShrews: what we seen on nb02 when zookeeper was stopped / started: http://paste.openstack.org/show/591694/17:58
pabelangercurrently waiting to see if all ubuntu-precise images will be uploaded17:58
*** adam_g has joined #zuul18:00
jeblairShrews: see reply on 40823918:01
Shrewsjeblair: see my reply to myself  :)18:02
jeblairShrews: aha :)18:02
Shrewspabelanger: that looks normal. i'm very interested in what happens when ZK is kill during a build or during an upload.18:02
pabelangerShrews: that's what i did for this test, we had 2 builds going an uploads. Was stopped for 15 seconds18:03
pabelangerso far, I don't see problems18:03
Shrewspabelanger: should see something when the upload actually completes18:04
Shrewspabelanger: because our upload lock *should* be lost18:04
pabelangerhmm18:04
pabelanger2016-12-07 17:28:41,710 INFO nodepool.builder.UploadWorker.7: Image build ubuntu-precise-0000000011 in infracloud-vanilla is ready18:05
Shrewspabelanger: did you stop zk gracefully or -9 it?18:05
pabelangerthat is the first image uploaded after stop / start18:05
pabelangergraceful18:05
pabelangerstop / start18:05
pabelangerI can do -9 next18:05
jeblairlet's etherpad this: https://etherpad.openstack.org/p/nN69gpYoAO18:05
Shrewsah, that could be different. i think zk does some storing of session state (at least, that's what it looked like when i played around with it)18:06
Shrewsso graceful might not break locks18:06
pabelangerlet me add some logs18:06
jeblairpabelanger: i don't seen an UploadWorker.7 from before the shutdown18:07
openstackgerritMerged openstack-infra/nodepool: Add --checksum support to disk-image-create  https://review.openstack.org/40641118:08
jeblairpabelanger: oh are you on nb01 or 02?18:08
pabelangerjeblair: ya, this is nb0218:08
jeblairah ok, that's better18:08
jeblairpabelanger, Shrews: i agree, it looks like uploadworker.7 survived the graceful zk restart with an upload in progress18:12
jeblairneat :)18:12
pabelangerI don't actually think UploadWorker.11 was actually uploading anything18:13
Shrewsanyone know offhand how long until a session is considered expired by zk/kazoo? those SUSPENDED messages indicate it had NOT expired18:13
pabelangerI am not sure18:13
Shrewswondering if the results would be different if we waited longer to restart18:14
*** adam_g has quit IRC18:14
pabelangerFor sure, we should do that too.18:14
jeblairpabelanger: yeah, uploadworker.11 (and the others) were likely in their poll loop looking for things to upload18:16
jeblairharlowja: ^ do you know the answer to Shrews question?18:17
Shrewskazoo code seems to indicate 10s18:17
pabelangerjeblair: on nb01, you can see uploadworker.06 has an exception, but just after zookeeper is back online, it finds an image18:17
*** adam_g has joined #zuul18:18
jeblairpabelanger: yeah, i think that agrees with what i said18:19
pabelanger\o/18:19
harlowjajeblair not sure, ha18:20
pabelangerjust waiting for rax-ord ubuntu-precise image to come online, if that happens we are good18:20
openstackgerritJames E. Blair proposed openstack-infra/nodepool: Delete builds when diskimage removed from config  https://review.openstack.org/40042118:21
jeblairShrews: can you take a quick look at https://review.openstack.org/407124 ?18:22
Shrewsjeblair: seems harmless18:23
Shrewsjeblair: fyi, 407736 to pluralize things is fine, but it isn't backwards compatible18:26
Shrewslike, pabelanger couldn't just use a new builder with that. the current ZK nodes would need to change, or else start fresh18:26
jeblairShrews: yeah; i could do a bunch of backwards compat code, or we could just manually fix the production zk db, or start over18:27
jeblairi assume we're still the only production db :)18:27
jeblairi'd be happy to shephard that through18:27
Shrewsi don't mind either way. just mentioning it18:27
jeblairi'll leave a note on the review to let me approve it and i'll fix up the db when it lands18:28
openstackgerritMerged openstack-infra/nodepool: Sort images and providers in zookeeper  https://review.openstack.org/40712418:28
openstackgerritMerged openstack-infra/nodepool: Merge branch 'master' into feature/zuulv3  https://review.openstack.org/40792318:34
openstackgerritJames E. Blair proposed openstack-infra/nodepool: Don't use taskmanagers in builder  https://review.openstack.org/40566318:34
openstackgerritMerged openstack-infra/zuul: Add reset of watchdog timeout flag  https://review.openstack.org/40819418:43
openstackgerritMerged openstack-infra/nodepool: Fix zookeeper config in test fixture  https://review.openstack.org/40763218:47
pabelangeri'll restart nodepool-builder here shortly, once the git repo is updated on disk18:56
pabelangera side from that, I don't think I have any outstanding issues right now18:57
pabelanger\o/18:57
jeblairpabelanger: did you want to perform more zk kill tests?18:58
pabelangerjeblair: not yet, if we are good with first round, I'll start another build then -9 zookeeper18:58
pabelangerlet me pick up lastest code first18:58
Shrewspabelanger: lol, you're exception cleanup change failed in all sorts of fantastic ways19:01
Shrewsnone of which were related to your patch19:01
pabelangernice, didn't see that19:01
jeblairyeah was just going through those -- regular tests failed with the mysql db timeout thing19:02
pabelangeroh, pep8 failed because of bhs1 mirror issue from this morning19:02
jeblair(it's not as obvious now, but i think it's that issue because it hit during a lockfile action in the cleanup)19:02
pabelangernods19:03
Shrewsis that NodeDeleter exception common?19:03
jeblairand i think the coverage job failed with two similar timeouts19:04
SpamapSjeblair: fungi Can we talk about PTG space for Zuul some time soon?19:04
jeblairShrews: the FakeError?19:04
SpamapSOr has it already been arranged?19:04
jeblairer "Fake Error"19:04
* SpamapS isn't sure where to look.19:04
Shrewsjeblair: yeah19:04
Shrewsjeblair: oh, that one is expected, i guess19:05
jeblairShrews: i think that's a test simulation19:05
Shrewsyeah19:05
* Shrews should have looked at the name of the test more closely19:05
fungiSpamapS: we have ptg space for infra, and i'm happy for much of that to be devoted to working on zuul19:06
fungiit's by far our most complex deliverable now19:06
jeblairthat seems useful to me19:06
jeblairi *hope* we'll be at a point where we'll be ready to do some work on planning the openstack rollout.  if we're not there, then we will still have plenty of work to do.19:08
fungiyeah, either way we won't be done with this undertaking by time for the ptg19:09
fungiso it's safe to say there will be plenty of zuulishness to be had there19:10
fungigiven how far along it's likely to be, i expect we'll want it to be the primary focus for our infra days anyway19:10
Shrewsfungi: your optimism is amusing to me19:13
Shrews(but hopefully it *will* be far along)19:13
jeblairSpamapS: are you looking for setting an agenda now, or agreement that "yeah, we do enough zuul things to warrant attending"?19:13
fungiit's a survival mechanism19:13
SpamapSjeblair: I'm justifying travel budgets right now.19:14
SpamapSWhich means I need to make sure people have space and something important to do while there. :)19:14
fungiSpamapS: i will make sure if you come you can spend as much time collaborating on zuul as you want (at least for the horizontal team days where infra gets a space)19:14
SpamapSIt will help if agendas start to show zuul when there are, in fact, agendas. :)19:15
mordredShrews: I figure the node launcher work is gonna take like ... a week, right?19:15
Shrewsmordred: your optimism is even MORE amusing than fungi's19:15
fungiSpamapS: ttx seems to want us to play a little fast and loose with "agendas" for this, but to the degree that we can declare the things we intend to work on while we're there i'll make certain zuul gets a very prominent spot on that list19:16
fungithat is unless i'm replaced as ptl before the ptg anyway ;)19:17
pabelangerfungi: by a robot fungi?19:17
fungiin which case i will merely strongly advise that it should be a major focus19:17
* fungi thought he _was_ the robot fungi19:18
pabelangerdoh19:18
openstackgerritMerged openstack-infra/nodepool: Clean up exception message to use image / provider name  https://review.openstack.org/40823919:25
*** rcarrillocruz has joined #zuul19:47
pabelangerokay, nodepool-builder restarted on nb01.o.o / nb02.o.o19:51
jeblairShrews: https://review.openstack.org/405663 is ready now19:52
jeblairwait maybe not19:52
jeblairjust noticed the nv tests are failing19:52
*** hashar has joined #zuul19:53
openstackgerritJames E. Blair proposed openstack-infra/nodepool: Don't use taskmanagers in builder  https://review.openstack.org/40566319:58
*** adam_g_ has joined #zuul20:06
pabelangerokay, all the leaked checksum files have been manually cleaned up20:38
*** harlowja has quit IRC20:41
jeblairpabelanger: is now a good time for me to stop the builders and do the manual zk work?20:54
pabelangerjeblair: yes, feel free20:56
jeblairpabelanger: looks like i stopped nb01 in the middle of a dib run20:57
jeblairdib is still running, however... i think we normally expect it to stop20:58
pabelangerindeed, looks like fedora-24 is doing checksums now20:58
jeblairit's stopped now20:59
jeblairmaybe it just needed to finish the checksum programs20:59
pabelangerya20:59
jeblair"cd nodepool" "cp image images true" "rmr image" is what i'm doing for the move21:02
openstackgerritMerged openstack-infra/nodepool: Pluralize zk nodes with children  https://review.openstack.org/40773621:02
jeblairthe 'true' in the cp command means 'recursive'21:02
Shrewsjeblair: hmmm, wonder how that affects sequence node numbers21:02
Shrewsif at all21:02
jeblairShrews: good question, we should look for that in the next build21:03
pabelangerI tried using dstat logs on https://review.openstack.org/#/c/405663 but didn't see much difference with a single provider. But +2'd21:05
jeblairpabelanger: hrm, it should remove about 250 threads21:06
jeblairpabelanger: oh, *single provider*21:07
pabelangerya21:07
jeblairpabelanger: yes, there will be very little difference in that case.21:07
pabelangernods21:07
pabelangerthat is what I figured21:07
jeblairpabelanger: threads = providers * workers21:07
pabelangerk21:08
jeblairmordred just approved that so i'll wait till it lands to restart builders21:08
pabelangerYay21:08
jeblair(which works well since i have more zk moves to do)21:08
jeblairpabelanger: fedora-24/builds/00..10 and 11 are empty but still exist21:09
jeblairdo you know the story there?21:09
pabelangerjeblair: I wonder if that is a result of me stopping nodepool-builder during builds too21:09
pabelangerlet me check the logs21:10
mordredjeblair: double plus bonus points for use of boartty as a library in your storyboard script21:10
jeblair15 and 16 are empty too21:10
jeblairmordred: i am lazy :)21:10
Shrewsjeblair: hrm, i think we'll have problems21:10
jeblairShrews: because the new nodes will have reset version numbers?21:11
Shrewsjeblair: quick test, using the method you outlined, makes a subsequent create fail with NodeExistsError21:11
Shrewsjeblair: and i don't know why21:11
Shrews(this is a hacked kazoo script i'm using, not the builder)21:11
pabelangerjeblair: fedora-24-00..16 is empty too?21:12
jeblairpabelanger: yep 10,11,15,16 all empty21:12
pabelangerjeblair: if so, that must be a result of us stopping nodepool-builder when we have builds in progress21:13
pabelangerbecause that 0016 was the build that was just running21:13
jeblairpabelanger: 13 14 exist, 12 does not.21:13
jeblairpabelanger: well, that one is fine -- there's nothing to clean it up right now21:13
jeblairpabelanger: but the others may be an error21:13
pabelangeryes, 13 and 14 are valid builds which are ready21:13
mordredjeblair: 407135 has 2x+2 but I think should go in th elist of jeblair reviews21:14
jeblairmordred: yeah, i'm going to switch to zuul soon21:14
jeblairthanks21:14
mordredjeblair: I've been trying to clear out things that don't need your attention21:15
jeblairmordred: oh thanks21:15
pabelangerI'll also work on more test coverage around stopping nodepool-builder during a diskimage build21:16
pabelangershould be easy to expose the issue21:16
Shrewsjeblair: http://paste.openstack.org/show/591725/21:17
jeblairShrews: hrm, that wfm21:20
ShrewsO.o21:20
jeblairii  zookeeper                                  3.4.5+dfsg-1                                        all          High-performance coordination service for distributed applications21:20
Shrewsjeblair: check zk-shell listing. do you have a 'junks2'21:21
Shrews?21:22
jeblairShrews: no, just junk and junks21:22
Shrewsi missed a step in that paste... i deleted the first sequence node, then re-ran k.py21:22
jeblair(i also did an attempt where i 'rmr junk' after cping.  that also worked)21:22
jeblairoh i'll try that21:23
Shrewsbut, if i don't delete that, i get a junks221:23
Shrewsperhaps i'm using an older version of zookeeper21:23
openstackgerritMerged openstack-infra/zuul: Add roadmap to README  https://review.openstack.org/40721321:23
openstackgerritMerged openstack-infra/zuul: Re-enable TestScheduler.test_rerun_on_error  https://review.openstack.org/40641621:24
openstackgerritMerged openstack-infra/zuul: Re-enable test_rerun_on_abort  https://review.openstack.org/40700021:24
openstackgerritMerged openstack-infra/nodepool: Delete builds when diskimage removed from config  https://review.openstack.org/40042121:24
jeblairShrews: aha, if i delete the initial sequence znode i get the error21:24
Shrewsjeblair: if those tests work against the production zookeeper, then go for it. we might be seeing version differences21:24
openstackgerritMerged openstack-infra/nodepool: Activate virtualenv before running dib  https://review.openstack.org/40448721:24
Shrewsjeblair: neat21:24
Shrewsso, that's not going to work21:25
Shrewsmost likely21:25
jeblairfascinating :)21:25
Shrewsjeblair: we are learning more ZK things21:26
jeblairyay us!21:26
Shrewsfyi, i have zk 3.4.8, which is newer21:29
jeblairi like that the kazoo docs explicitly state this will never happen21:31
Shrewslol! link?21:31
jeblairhttp://kazoo.readthedocs.io/en/latest/api/client.html#kazoo.client.KazooClient.create21:31
jeblairNote that since a different actual path is used for each invocation of creating sequential nodes with the same path argument, the call will never raise NodeExistsError.21:31
Shrewssomething about the 'cp' probably makes it lose some sequence attributes21:33
jheskethMorning21:35
jeblairShrews: i think i understand -- i think it uses the cversion stat to get the next seqno21:35
jeblairthe new node starts with cversion at 0 and then increments with each new child21:35
Shrewsso that's the attribute it loses  :)21:36
jeblairso if you had 4 children, you would have cversion=4.  then delete 1, you still have cversion=4.  copy that to a new node, cversion=3.21:36
Shrewsah21:37
jeblairso in production, we're going to have cversion=2 on most of these things; we'll probably end up having sequence numbers start at 3.  they'll work for a while until they hit a collision with 11 or whatever we're up to.  unless 11 gets deleted in time.  which it probably would.  except for fedora-24.21:38
jeblairShrews: so i think we can fix this with a script that looks at all the sequence number containers, and adds/removes children until cversion==max(sequence number)21:38
jeblair(even though this is ridiculous, i'm glad we're learning this now before we had to learn it for something that's actually important :)21:39
Shrewsyeah21:40
jeblairi guess we want cversion to be max(sequence number)+121:40
jeblairi'll write a script to do that real quick21:41
*** harlowja has joined #zuul21:41
*** jamielennox|away is now known as jamielennox21:49
jeblairShrews: experimentally, it seems the next sequence number is cversion-numChildren+121:54
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Update waitForBuildDeletion() to protect against delete race  https://review.openstack.org/40832421:54
pabelangerjeblair: Shrews: should fix the race condition in: http://logs.openstack.org/63/405663/6/gate/gate-nodepool-python27-ubuntu-xenial/91a1e7d/console.html21:54
jeblairShrews: okay i ran this: http://paste.openstack.org/show/591733/22:06
jeblairi think we should be ready to restart now22:06
jeblairwait what we decided to go with the activate virtualenv thing?22:07
jeblairmordred: fyi https://review.openstack.org/40396622:10
jeblairmordred: you may want to read the entire discussion there, and on the linked change22:12
jeblairpabelanger, Shrews: i'm going to restart the builders now22:13
Shrewsjeblair: okie dokie. will keep my fingers and toes crossed22:13
jeblairShrews: i apparently missed something: http://paste.openstack.org/show/591734/22:15
Shrewsugh22:15
jeblairShrews: hrm, actually, any chance that's normal for a lock collision?22:16
Shrewsjeblair: nope. they have their own exceptions22:16
Shrewsbut that's failing on the sequence node create anyway22:17
Shrewsjeblair: heading out to meet ansible type folks (hi rbergeron!) but will try to pay attention to my phone pings if you need me22:22
jeblairShrews: ok.  i think the situation corrected itself after some deletions or something anyway22:24
jeblairthe other builder got it.  :)22:24
openstackgerritMerged openstack-infra/nodepool: Don't use taskmanagers in builder  https://review.openstack.org/40566322:40
jeblairSpamapS: in 407000, i don't understand why the test is changing.  i can't think of a reason for it to do so, and anyway, the final assertion in that test is basically confirming that we're running one less job than before.22:50
jeblairSpamapS: looking at the test output from master, "Launch job project-test1" shows up 5 times, and attempts is set at 4.  that's correct because the 5th attempt is the one that returns RETRY_LIMIT -- so from the user POV, we tried to run the job 4 times.  with 407000 the v3 branch shows that job launching 4 times, and it's the 4th that returns RETRY_LIMIT (so the user told us to run it 4 times and sees it run 3).22:51
SpamapSjeblair: yeah, I wasn't sure about that the test changing was the right call.22:55
SpamapSjeblair: I thought what I saw was just that we ended up not recording one of the tries anymore.22:55
SpamapSbut I may have misunderstood how things are recorded.22:55
jeblairSpamapS: yeah -- that's what i went looking for, but i figured "Launching job project-test1" was a pretty good proxy for "how many times did we run this job" since that's emitted by the pipeline manager when it tells the launcher to run a job (so it shouldn't be as affected by things changing around the launcher)22:57
openstackgerritMerged openstack-infra/zuul: Re-enable test_client_get_running_jobs  https://review.openstack.org/40713523:04
jeblairjhesketh: https://review.openstack.org/40669923:06
* jhesketh looks23:07
jheskethoh, I reviewed that last night... why don't I have a vote23:07
jeblairjhesketh: did you use gertty?23:07
jheskethnope, I suspect it was a human error23:07
* jhesketh tries to regain state23:08
jeblairwell there's your problem! ;)23:08
jheskethheh, that's a default for me ;-)23:08
SpamapSjeblair: so then it's possible the logic on retries changed and we're actually doing it wrong.23:11
jeblairSpamapS: yeah, nothing about what might cause that comes immediately to mind, so it may require spelunking.23:12
jheskethjamielennox: left a question on 406699 if you have a moment23:14
SpamapSjeblair: worth doing. I'll take a deeper look.23:15
*** harlowja has quit IRC23:16
jeblairSpamapS: thanks.  note that the change has merged.23:16
jamielennoxjhesketh: yea, it's probably easier to just make them seperate git repos on the filesystem, they're already seperate pipelines and everything23:16
SpamapSjeblair: indeed, I think I just glossed over a bug, rather than introduced one. :-/23:17
jamielennoxi'll spin a new one today23:17
jeblairSpamapS: yeah, odds are i introduced it.  but we've lost our tracking of it, so we might want to do something (revert, propose a todo note, or file a story)23:18
jheskethjamielennox: cool, mostly curious about the second comment though as I think we need that to make sure it's behaving as expected23:19
openstackgerritMerged openstack-infra/zuul: Re-enable merge-mode config option and add more tests  https://review.openstack.org/40636123:19
SpamapSjeblair: I'll add a story with the v3 tag and assign myself23:22
SpamapShttps://storyboard.openstack.org/#!/story/200082723:24
openstackgerritJames E. Blair proposed openstack-infra/zuul: Remove v3 project template test  https://review.openstack.org/39572223:26
*** jamielennox is now known as jamielennox|away23:27
*** jamielennox|away is now known as jamielennox23:28
*** hashar has quit IRC23:36
*** hashar has joined #zuul23:39
*** Cibo_ has quit IRC23:50
*** hashar has quit IRC23:52

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!