Thursday, 2016-12-08

*** willthames has joined #zuul00:17
*** adam_g has quit IRC00:37
*** adam_g_ is now known as adam_g00:37
*** jamielennox is now known as jamielennox|away00:45
*** jamielennox|away is now known as jamielennox01:50
*** Cibo_ has joined #zuul02:55
openstackgerritMerged openstack-infra/zuul: Remove v3 project template test  https://review.openstack.org/39572203:49
*** harlowja has joined #zuul04:37
*** bhavik1 has joined #zuul05:00
*** harlowja has quit IRC05:46
*** bhavik1 has quit IRC05:58
*** abregman has joined #zuul06:22
*** saneax-_-|AFK is now known as saneax07:59
*** hashar has joined #zuul08:23
*** jamielennox is now known as jamielennox|away08:33
*** hogepodge has quit IRC09:45
*** hogepodge has joined #zuul09:46
*** saneax is now known as saneax-_-|AFK10:08
*** saneax-_-|AFK is now known as saneax10:09
*** hogepodge has quit IRC11:01
*** hogepodge has joined #zuul11:21
*** hogepodge has quit IRC11:26
*** hogepodge has joined #zuul11:38
*** hogepodge has quit IRC11:48
*** hogepodge has joined #zuul12:00
*** bhavik1 has joined #zuul13:39
*** abregman has quit IRC14:26
*** abregman has joined #zuul14:49
*** abregman has quit IRC15:02
*** abregman has joined #zuul15:08
*** abregman has quit IRC15:29
*** saneax is now known as saneax-_-|AFK15:41
*** hashar is now known as hasharCall17:02
*** adam_g is now known as adm_g17:46
*** adm_g is now known as adam_g17:47
*** adam_g has quit IRC17:47
*** adam_g has joined #zuul17:47
pabelangero/18:06
pabelangerjust checking nb01, I see we are getting an exception18:07
*** hasharCall is now known as hashar18:07
pabelangerhttp://paste.openstack.org/show/591842/18:07
Shrewspabelanger: is that recent?18:10
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Use diskimage.name for _checkForScheduledImageUpdates exception  https://review.openstack.org/40875618:10
pabelangerShrews: just happened18:10
pabelangerShrews: however, its stopped18:11
pabelangerlet me check nb02 to see what happened18:11
Shrewspabelanger: ok. that's due to the change jeblair made yesterday. i thought that was transitive18:11
pabelangeroh, no. It is still happening18:11
pabelangerlet me check zk-shell18:11
pabelangerShrews: okay, lets see what jeblair wants to do18:12
jeblairdrat18:13
Shrewswe might just need to start fresh  :(18:13
jeblairyes, though i think if we land pabelanger's change, we will know which note to go in and fix harder18:13
jeblairer node18:13
Shrewswhich change?18:13
jeblair75618:14
pabelangerI don't mind a fresh start, rebuilds and uploads are working very well18:14
*** hashar has quit IRC18:16
*** harlowja has joined #zuul18:17
jeblairlet's restart with 756, poke at that node (mostly i want to see what i missed earlier).  if that's confusing or doesn't work, let's reboot.18:17
pabelangerwfm18:20
*** harlowja_ has joined #zuul18:20
Shrewsmaybe our config objects should just implement __repr__()18:22
*** harlowja has quit IRC18:22
jeblairShrews: ++18:27
pabelangerjeblair: regarding https://review.openstack.org/#/c/404976/, which moves fake-image-create out of production code, how would you like to see it reworked?18:27
jeblairpabelanger: i'm fine with that change as-is -- the thing i wanted to communicate is that i don't think we should change how we cause failures to happen.  we can set that command in our test configs, but we should not change the command to cause failures, we should continue using the 'should_fail' (or whatever it's called) image metadata attribute to cause build failures.18:29
jeblairpabelanger: (basically, my -1 was in reference to the last sentence of the commit message)18:30
pabelangerokay, I am okay with that18:30
pabelangerI'll update it here in a minute18:30
jeblair(the reason being: by using the metadata, we can say "image A should succeed, image B should fail".  we can't do that if we set the command to fail for all images)18:31
pabelangerThat make sense18:34
openstackgerritMerged openstack-infra/nodepool: Use diskimage.name for _checkForScheduledImageUpdates exception  https://review.openstack.org/40875618:36
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Make diskimage-builder command configurable for testing  https://review.openstack.org/40497618:38
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Make diskimage-builder command configurable for testing  https://review.openstack.org/40497618:39
pabelangerk, 756 has landed on disk. going to stop / start nodepool-builder18:44
pabelangerhttp://paste.openstack.org/show/591848/18:47
jeblairokay, i'll poke around in zkshell18:52
pabelangerya, just looking now too.18:54
pabelanger /nodepool/images/ubuntu-trusty/builds does exist18:54
pabelangerwhich I think is the path it should be using18:55
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Add __repr__ to ConfigValue objects  https://review.openstack.org/40877618:57
jeblaircversion is 17, numchildren is 3, 17-3+1=15 which is > 11....18:58
jeblair(i'm poking at zk -- ignore json object decode errors)19:01
pabelangerack19:02
pabelangerwe should dump threads again too, when finished with zookeeper, nodepool-builder still using 139% CPU.  Figured it would have dropped a little19:03
jeblairokay, i did a create/rm cycle 2 times and now it works19:04
jeblairi no longer think i understand the math about how it picks the next sequence number19:04
jeblairi think if we want to be able to understand how to do this in the future, we should probably read some code.19:05
pabelangerindeed, I have no idea why it was failing :)19:05
jeblairpabelanger, Shrews: we *could* just let things proceed and if it crops up again, perform a fix like i just did.  i think there's a chance that this may happen a few more times as other images age out and the builders decide they need to be rebuilt.  there's also a chance that things could work now but break unexpectedly in the future because of some quirk of the math.  so if we want to play it safe and scrap the whole thing and restart, that's ...19:07
jeblair... okay too.19:07
Shrewsi feel like us not understanding the knobs we are twiddling is a good reason to just start fresh. but we could wait and see what happens, too.19:08
pabelangerYa, I think a fresh start if we are going live with nodepool.o.o is a good idea. But same, if we want to see what happens, I'm sure we'll learn more19:09
jeblairokay, let's do a fresh start then.19:10
jeblairpabelanger: do you want to do the honors?19:10
pabelangerjeblair: sure, how do we want to do the fresh start?19:11
Shrewspause everything, delete images, shutdown builder, rmr /nodepool, unpause everything, start builder?19:12
pabelangerya, that's what I've come up with too19:13
jeblairShrews: yeah, i think that's it19:13
pabelangerokay, let me but nb01 / nb02 into emergency file so puppet doesn't do things19:13
Shrewsmake sure to hit both nb01 and nb0219:13
pabelangersince we don't have a pause via CLI yet19:13
Shrewspabelanger: oh, should probably delete local dib files before the restart, too19:15
pabelanger++19:15
jeblairShrews: yeah, any that are left that the delete's don't take care of19:16
Shrewsyeah, hopefully there are none19:17
pabelangerokay, deleting centos-7 dibs19:22
pabelangerlet see what happens19:22
clarkbis there a tldr of what the problem is/was?19:25
Shrewspabelanger: image-delete properly did the things?19:25
Shrews(since we had a bug there previously)19:26
pabelangerShrews: I believe so, looking at logs now19:27
Shrewsclarkb: this https://review.openstack.org/407736 was a non-backward compatible change, causing manual intervention via zk-shell19:28
pabelangerwe have 1 builds in process, ubuntu-xenial19:28
pabelangergoing to try deleting that19:29
Shrewsclarkb: that caused the zk sequence nodes (used for build and upload IDs) to stop working19:29
pabelangernot sure if it will work19:29
*** bhavik1 has quit IRC19:33
pabelangerclean up still working19:33
Shrewspabelanger: hmm, unlikely to work via CLI19:33
pabelangerYa, we don't have a good way to abort a DIB in progress from CLI19:34
pabelangeraside from -9 disk-image-create19:35
Shrewspabelanger: hrm, the CLI has a bug. we *should* report that we can't delete it b/c it's in progress. i don't have a lock around it  :(19:35
pabelangersame goes for uploads in process, we they didn't stop.19:37
pabelangerk19:37
pabelangerI had to stop nodepool-builder to stop the uploads, started again19:38
pabelangerall uploaded images are now gone19:38
pabelangerwe have 3 DIBs stuck in deleting, but I think we expected that19:39
pabelangerboth nb01 / nb02 have empty /opt/nodepool_dib19:39
pabelangerI think we are good19:40
pabelangerokay, both builders stopped19:40
pabelangerjust for fun, running sudo -H -u nodepool nodepool alien-image-list on nodepool.o.o19:41
pabelangerit should be empty19:41
pabelangerand it is, no images from nb01 / nb0219:42
pabelangerrmr /nodepool done19:43
pabelangerstarting builders again19:43
pabelangerand removing nb01 / nb02 from emergency file19:44
Shrewsoh, no. i was wrong. code is fine19:45
Shrewscan only delete READY builds (or uploads, for that matter)19:45
Shrewsugh, no19:47
pabelangerand diskimage builds started again19:47
pabelangerShrews: jeblair: we are back online19:49
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Check for in progress build/upload in CLI  https://review.openstack.org/40879419:49
pabelangeronly 2 issues, if we want to call them issues.19:49
pabelanger1) cannot abort inprogress diskimage builds, had to -9 disk-image-create19:49
pabelanger2) cannot abort uploads, had to stop nodepool-builder19:50
pabelangeraside from that, dib-image-delete worked great19:50
Shrews794 should at least fix the CLI and it will say you can't do those things19:50
Shrewsbeing able to *actually* do those things is a new feature. Could you do that in the old builder?19:51
Shrewsi don't recall seeing code to handle that (without actually stopping the builder itself)19:51
pabelangerno, I don't think we could19:51
Shrewsoh, so this is feature creep  :)19:52
jeblairclarkb: note that we chose to go ahead and do 407736 knowing we might have to restart, as an opportunity to potentially learn some things about zk.  we did.  :)  i think we know enough to come up with a better migration plan next time something like this comes up.19:52
pabelangerhowever, that was the first time I've ever had to purge all images in nodepool :)19:52
Shrewspabelanger: lol, touche19:52
pabelangerbut ya, feature creep :)19:53
jeblairpabelanger: yeah, i think restarting builder after setting pause in order to abort builds/uploads is the right thing19:53
pabelangerYup, it wasn't too painful19:53
jeblairit looks like f23 and f24 failed?19:55
pabelangerchecking19:55
jeblair2016-12-08 19:50:28,424 INFO nodepool.image.build.fedora-24: Error: Failed to synchronize cache for repo 'updates'19:56
jeblairthey both say that19:56
pabelangeroh, ya19:57
pabelangerthat happens from time to time19:57
pabelangeractually19:57
pabelanger2016-12-08 19:47:43,817 INFO nodepool.image.build.fedora-24: mount: none already mounted or /opt/dib_tmp/dib_build.bIPidWaE/mnt/proc busy19:57
jeblair'info' huh? :)19:58
jeblairpabelanger: i think killing dib was the wrong thing :(19:58
pabelangeryes19:59
jeblairpabelanger: probably should have let the nodepool-builder shutdown handle stopping it19:59
pabelangeragreed19:59
pabelangerI didn't check the dib_tmp folder20:00
Shrewsdb/lockfile errors seem to be happening frequently today20:02
pabelangerhttp://paste.openstack.org/show/591854/20:03
jeblairpabelanger: i don't understand that error -- that path should be for the current build, not something left over20:03
pabelangerwhen should we expect the cleanup worker to pickup the deleting state?20:03
pabelangerjeblair: ya, this might be something in diskimage-builder, http://logs.openstack.org/88/408288/8/check/gate-dib-dsvm-functests-ubuntu-xenial/9c7656f/console.html#_2016-12-08_19_23_48_471594 has the same message20:04
jeblairpabelanger: we may never delete those builds from zk since there are no files for them :( -- no builder will think it's responsible for it.20:07
pabelangerjeblair: sad face indeed20:08
pabelangerwill poke at it shortly20:09
jeblairwe may need to make sure the builder field is set correctly and then use that to determine whether a given builder should delete the zk record20:10
jeblair(it looks like the builder got unset when the state changed to deleting)20:10
Shrewstouch fedora-23-0000000001.raw should do it20:11
jeblairagreed20:11
pabelangerk, trying that20:12
pabelangeryup, worked20:13
*** jamielennox|away is now known as jamielennox20:13
pabelangerdid fedora-24 too20:14
Shrewsi'm not sure what the code path is that would NOT set the hostname20:17
Shrewsor would remove it20:17
pabelangerShrews: Would it be line 451 that uses a new ImageBuild()? over updating the existing data?20:20
pabelangerin builder.py20:20
jeblairpabelanger: yep20:20
Shrewsoh, i see it20:20
Shrewsyes20:20
pabelangercool20:21
Shrewsthat *should* just be reusing 'build'20:21
Shrewsfix coming20:21
pabelangerk20:22
pabelangergetting a coffee20:22
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Re-use build data when we set for DELETING  https://review.openstack.org/40880820:24
jeblairianw: do you know what went wrong with this build?  http://nb01.openstack.org/dib.fedora-23.log20:24
jeblairianw: same thing happened to f24 i think20:24
ianwjeblair: 2016-12-08 19:48:19,954 INFO nodepool.image.build.fedora-23: Error: Failed to synchronize cache for repo 'updates'20:26
jeblairShrews: should we lock around that?20:26
jeblairianw: yeah, that looked suspicious to me.  but i don't understand what that means or why it would happen 3 times on 2 hosts.20:26
ianwi think upstream mirror issues.  we had a couple of failures in dib CI similar20:26
jeblairianw: okay, so hopefully will be fixed by the time the builders cycle around to them again20:27
ianwjeblair: yeah, i think so.  i really need to get rid of f23!  keep getting sidetracked20:27
Shrewsjeblair: that really shouldn't be necessary as it's not a current build or in progress build20:29
Shrewswe short-circuit the loop if it is20:29
jeblairShrews: yeah, but any builder can choose to mark it as deleting, so i'm wondering about two builders doing that at the same time, with the second one possiblying issuing the store call after the first one actually deleted the znode.20:30
Shrewsjeblair: good point20:31
jeblairShrews: (also, we should probably only store it if the state has changed so we don't do so more than necessary)20:32
Shrewsjeblair: how do you mean?20:33
Shrewsoh, if it's not already DELETING20:33
jeblairya20:33
Shrewsk20:33
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/zuul: Fix retry accounting off-by-one bug  https://review.openstack.org/40881420:40
SpamapSjeblair: ^ found it ;)20:40
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Re-use build data when we set for DELETING  https://review.openstack.org/40880820:42
jeblairSpamapS: \o/20:58
SpamapSjeblair: hey, looking at test_swift_instructions .. looks like that will need some glue written. That's still a thing in v3, yes?21:07
jeblairSpamapS: yes, there's a little bit in the spec about moving it to the new auth section of the job config. so yeah, will require some re-plumbing.21:26
jeblairSpamapS: it looks like the master branch has tries as base 1 as well?  http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/model.py#n67521:29
SpamapSjeblair: entirely possible. I did not look at master, I looked at why we gave up early in v3.21:53
SpamapSjeblair: Since I have 0 familiarity with master, it's a lot harder for me to go back that far (though I have done it a bit lately)21:54
SpamapSmuch simpler for me to try and grok the test, and make sure v3 does what the test wanted and the spec wants. :-P21:54
SpamapSso my commit message does in fact contain an assumption that may be untrue21:56
SpamapSequally possible is that in master we don't set tries until we've _tried_ once... and in v3 we set it as part of the model creation21:57
* SpamapS should be more careful with words21:57
jeblairSpamapS: yeah, maybe this is the right fix (it seems to make what we expect happen)... i probably won't be able to sleep until i know what changed so i can update my mental map, but i'm happy to dig into that since that might just be a personal character flaw.  :)21:59
SpamapSjeblair: I get the benefit of having no base on which to build my anxiety .. my condolences to your sleep ;)22:03
*** jamielennox is now known as jamielennox|away22:29
*** jamielennox|away is now known as jamielennox22:34
*** saneax-_-|AFK is now known as saneax23:12
openstackgerritJames E. Blair proposed openstack-infra/zuul: WIP triggers  https://review.openstack.org/40884823:25
openstackgerritJames E. Blair proposed openstack-infra/zuul: WIP organize connections into drivers  https://review.openstack.org/40884923:25
jeblairjamielennox, jhesketh: ^ that isn't remotely ready for review; i'm literally about halfway into the change.  i'm mostly pushing it up so i don't lose it.  the commit message is literally my shorthand notes.  don't worry if you can't make heads or tails of it -- i think it will be clear when i'm done (and i'll add nice docstrings).  but basically i think that direction gives us the ability to have nice singleton objects (drivers, connections) for ...23:29
jeblair... the things we don't want lots of copies of, but also the ability to have lots of per-tenant/pipeline objects (triggers, reporters, etc).  and to nicely organize all the supporting code.  i think it will make it easier to make extensible using entrypoints, and provide a nice foundation for new drivers (github, sqlalchemy, etc)23:29
jeblairjamielennox, jhesketh: i have to run now, but i think i can finish that change tomorrow and maybe have it ready to look at next week23:30
jlkoh sorry, I already peeked at it :)23:31
jeblairjlk: np :)  but hopefully that ^ explains the 'print' statements remaining :)23:32
jlkyup well, that and the giant "WIP"23:32
jlkand part of commenting was so that I see updates, since I've been staring at the 2.5 + github code23:32
jlkan am eager to get to porting it over :)23:32
*** jamielennox is now known as jamielennox|away23:49
*** Cibo_ has quit IRC23:58
*** Cibo_ has joined #zuul23:59

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!