Monday, 2016-12-19

*** saneax-_-|AFK is now known as saneax01:18
*** saneax is now known as saneax-_-|AFK02:45
*** saneax-_-|AFK is now known as saneax02:58
*** saneax is now known as saneax-_-|AFK03:09
*** bhavik has joined #zuul04:49
*** bhavik has quit IRC04:58
*** saneax-_-|AFK is now known as saneax05:35
openstackgerritTobias Henkel proposed openstack-infra/nodepool: Ignore .idea directory  https://review.openstack.org/41229606:12
*** abregman has joined #zuul06:12
*** abregman has quit IRC06:31
*** abregman has joined #zuul06:36
openstackgerritCao Xuan Hoang proposed openstack-infra/nodepool: Clean imports in code  https://review.openstack.org/40405906:38
*** hashar has joined #zuul09:12
*** bhavik1 has joined #zuul10:31
openstackgerritAndreas Jaeger proposed openstack-infra/zuul: Copy queue_name  https://review.openstack.org/41239810:49
*** bhavik1 has quit IRC11:07
openstackgerritAlexander Evseev proposed openstack-infra/nodepool: Use configured floating IP pool  https://review.openstack.org/41241911:55
*** yolanda has quit IRC11:59
*** yolanda has joined #zuul12:00
*** bhavik1 has joined #zuul12:12
*** bhavik1 has quit IRC12:15
*** jamielennox is now known as jamielennox|away14:56
*** saneax is now known as saneax-_-|AFK15:43
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Change nodepoold ZK config change logic  https://review.openstack.org/41250715:54
Shrewspabelanger: 412507 ^^^ based on your comments last week. I didn't see a reason for lines 1044-1045, so I removed those. Let me know if you think that's incorrect.15:55
Shrewspabelanger: because if the zk servers are the same, setting them equal to each other seems unnecessary15:56
pabelangerShrews: great, +216:16
pabelangerShrews: jeblair: https://review.openstack.org/#/c/412124/ is an optimization for dib-image-list16:17
pabelangerotherwise, you just se16:18
pabelangersee*16:18
pabelanger| centos-7-0000000136       | centos-7       | nb01    |           | building | 00:00:49:26 |16:18
*** abregman has quit IRC16:35
*** cinerama has joined #zuul16:43
*** cinerama has quit IRC16:44
*** cinerama has joined #zuul16:44
jeblairjamielennox|away: can you take a look at https://review.openstack.org/408849 when you get a chance?17:14
*** hashar has quit IRC17:19
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Expose the ability to build all infra diskimages  https://review.openstack.org/41192917:49
openstackgerritMerged openstack-infra/nodepool: Rename ubuntu-dib to ubuntu-trusty  https://review.openstack.org/41192018:10
openstackgerritMerged openstack-infra/nodepool: Expose the ability to build all infra diskimages  https://review.openstack.org/41192918:25
jeblairclarkb: i know we talked about this on friday -- did you want to review it? https://review.openstack.org/41199918:27
clarkbjeblair: yup will do so now18:28
openstackgerritJames E. Blair proposed openstack-infra/nodepool: Fix some doc typos  https://review.openstack.org/41257118:30
openstackgerritMerged openstack-infra/nodepool: Update documentation on diskimages/images  https://review.openstack.org/41199818:32
openstackgerritMerged openstack-infra/zuul: Remove TestScheduler.test_nonexistent_job  https://review.openstack.org/41057118:36
openstackgerritMerged openstack-infra/zuul: Remove now-unused ZuulTestCase.resetGearmanServer  https://review.openstack.org/41093518:37
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Allow nodepool-builder to only build diskimages  https://review.openstack.org/41216018:38
jeblairmordred: can you take a look at the comment on 411512?  maybe there's some other way to accomplish what you want?18:40
openstackgerritMerged openstack-infra/nodepool: Change nodepoold ZK config change logic  https://review.openstack.org/41250718:42
clarkbjeblair: done. +2'd but didn't approve as that conflicts with a few things and wasn't sure if you want that in first or later18:45
clarkbShrews: pabelanger is http://logs.openstack.org/02/411002/1/gate/gate-nodepool-python27-ubuntu-xenial/1434cc9/console.html.gz a known issue? seems unrelated to my config change so gonna recheck but thought I would point it out18:46
jeblairclarkb: yeah, i'd like to press forward on that; i'll +318:48
pabelangerclarkb: yes, I believe it was hitting: https://review.openstack.org/#/c/408324/ reason for the recheck18:49
pabelangeroh wait18:50
pabelangerthat is not correct18:50
pabelangerhttp://logs.openstack.org/02/411002/1/gate/gate-nodepool-python27-ubuntu-xenial/1434cc9/console.html.gz#_2016-12-16_21_47_09_78006018:51
pabelangerI was looking at that on Friday18:51
pabelangerbut didn't figure out why18:51
pabelangerclarkb: Shrews: ^ first time I seen that18:51
pabelangerI was looking into a leaked lock18:52
openstackgerritMerged openstack-infra/nodepool: Remove script-dir  https://review.openstack.org/41199918:53
openstackgerritMerged openstack-infra/nodepool: Docs: add Zookeeper installation section  https://review.openstack.org/41200918:55
openstackgerritMerged openstack-infra/nodepool: Update operation docs  https://review.openstack.org/41201818:55
mordredjeblair: ah - ok. we can just not remove the FakeProviderManager class but do the other things19:13
mordredjeblair: the biggest problem there is that there were two different ways of getting a ProviderManager that used fake clouds in the test suite19:13
*** openstack has joined #zuul19:17
*** bhavik1 has joined #zuul19:24
openstackgerritMerged openstack-infra/nodepool: Validate configs when used by tests  https://review.openstack.org/41100219:26
*** bhavik1 has quit IRC19:29
*** bhavik1 has joined #zuul20:01
*** bhavik1 has quit IRC20:01
Shrewspabelanger: clarkb: about that failure, i think it happens because the copy of the config values an upload thread may have is no longer valid by the time it has already checked for images that need uploading, and the time it actually begins uploading20:28
Shrewsso the config get replaced in the middle of the process, another thread starts the delete process, deletes the upload, the other thread now sees it needs to upload20:29
Shrewsi think i know how to fix it20:30
*** Cibo_ has joined #zuul20:39
*** Cibo_ has quit IRC20:43
Shrewshrm, it's actually going to require a bit more thought  :(20:47
mordredShrews: do we need to add some MVCC?20:57
clarkbthere are functions to ensure a steady state20:58
clarkbcan you reach steady state, then update config showing it should be deleted and at that point there shouldn't be races?20:59
Shrewsmordred: what we NEED are shared read locks in kazoo (ya know that thing i wish we had way back when?)21:03
*** jamielennox|away is now known as jamielennox21:04
Shrewsthat way we could grab a shared lock on the build number before getting the upload lock, and no one could then delete the build number unless there were no uploads21:04
Shrewsour hierarchy of locks is kind of needing that21:05
Shrewsbut, i guess i have to find a work around21:06
clarkboh so this is an actual run time bug?21:08
clarkbwhere we could upload an unwanted image?21:08
Shrewsclarkb: yes. but it would be pretty difficult to make happen in production21:09
Shrewsbut not impossible21:10
Shrewsthe best solution i've come up with so far only decreases the likelyhood of it happening, which doesn't make me happy21:11
clarkbcouldn't you check if upload is locked and state set to deleting then don't upload?21:12
clarkband maybe vice versa don't delete if uploading?21:12
Shrewsthe "don't delete if uploading" is my other idea, but requires getting all upload locks for each provider. that's doable, and likely the only recourse we have without read locks21:13
Shrewswill require adding non-context-manager style locking, too21:14
clarkbactually if node is locked you shouldn't do any state updates right?21:15
clarkbwhich means that we should already be handling that for the don't delete when uploading case as upload should hold lock21:15
clarkband in the other direction we shouldn't try to upload if locked?21:15
*** Cibo_ has joined #zuul21:15
Shrewsin the other direction, we'd have to actually get the build number lock (we can't just check it as that is still racey). but that prevents other uploads from occuring simultaneously21:16
clarkbah ok so uploads actually want a semaphore on that node21:17
Shrewsuploads want a shared read lock on that node21:17
clarkbI think a semaphore could do it too21:18
clarkbbut I guess the end result is likely the same implementing primitives that kazoo should have btu doesn't?21:18
jeblairShrews: i think your statement at 20:28 utc is concise, and i am not sure it can be solved with locks -- if someone deletes an upload, what lock is there to prevent someone else with an outdated config from thinking it needs to be there?21:19
clarkbwe could checkpoint configs21:20
Shrewsthe uploaders would check the build state (should be DELETING) once acquiring the shared read lock on that build number21:20
clarkb(the tricky thing with that is determining which one is newer/more correct)21:20
Shrewsif it's DELETING, we don't upload21:20
Shrewsif and uploader has the shared lock, we can't delete until it is released since the deleter would want a write lock21:21
jeblairShrews: right, i think that handles this very specific case (deleting a build), but doesn't handle the general case of configuration disagreement21:21
Shrewswell, yah, the entire config disagreement is not solved by that since it could leak to other areas besides this21:22
jeblairi don't know the best approach at the moment -- accept config disagreement and try to button down all the places it could be a problem, or try to prevent config disagreement21:22
jeblairjust throwing it out there21:22
Shrewsi don't know how to solve the config disagreement thing21:23
Shrewsother than maybe a config singleton?21:23
mordredonly have one node be the node that can read config from disk at a time21:23
mordredyah21:23
jeblairand stick it in zk21:23
mordredthen have the config reading node distribute the config to the other ones21:23
mordredyup21:23
*** abregman has joined #zuul21:23
mordredthis is largely how heartbeat v2 worked back in the day ... although it took it one step further and disregarded the config on disk after a bootstrap21:24
mordredwhich I did not like21:24
jeblairand then, i guess, wait for all the workers to get quorum on the new config before any starts using it?21:24
mordredmaybe each worker, before it starts a new operatoin on an image, checks to see if there is a new config version?21:25
mordredit's possible that's still racey though21:26
Shrewsmordred: i would think so, but, dunnno21:26
jeblairShrews: can you use the existing build number write lock to perform their deleting check?21:27
jeblairclarification:21:27
jeblairShrews: can you use the existing build number write lock to have the uploaders perform their check of whether the build is being deleted?21:27
Shrewsjeblair: yes, but we cannot hold it since that prevents other uploads. so it would be a quick check, then upload. which is still a race, but reduces the code in between21:28
Shrewsmy statement at 21:11:25 utc21:28
jeblairmordred: yeah, i think that narrows the race considerably but it's still there (and we could actually just do that today with existing distributed on-disk config -- the complexity there comes when we find something actually has changed right in the middle of our processing loop - maybe processing something that doesn't exist anymore)21:29
mordredyah21:30
jeblairShrews: grab build lock; check build state; create upload record; lock upload; release build lock?21:30
Shrewsjeblair: "lock upload" is locked a long time while the upload is done21:31
jeblairShrews: grab build lock; check build state; lock upload; create upload record; release build lock?21:31
jeblair(sorry i think i got that order wrong)21:31
jeblairShrews: ah, is that then the 'non-context-manager lock' approach?21:31
jeblairgrab build lock; check build state; lock upload; create upload record; release build lock; upload; release upload lock21:32
jeblair(that requires interleaved locks, so can't happen entirely nested)21:32
Shrewsjeblair: no. that approach is used by the deleting thread. grab build lock; for each provider, lock upload; delete from disk and zk; release upload locks21:33
jeblairShrews: right, i was asking about the upload thread though21:34
Shrewsupload thread wouldn't have to change21:34
jeblairoh, i thought that was the problematic one21:35
Shrewsthe deleter just wouldn't delete until it could lock all recorded providers for the build21:35
Shrewsbut even that might be racey21:36
jeblairShrews: i'm imagining that it finishes deleting all the uploads, then an uploadworker comes along and uploads one21:36
Shrewsit does give them more time to hopefully recheck the config, but yeah.21:37
jheskethMorning21:38
Shrewsi still think shared read locks are the best solution for this issue, but we still have the base problem of the config disagreement21:38
jeblairShrews: one quick thing that could help reduce the race: after we build or upload one thing, exit the loop and reload the config21:39
jeblairShrews: since both of those take a long time, it could help us run with generally more recent configs21:39
Shrewsjeblair: it already does that, so i don't follow21:40
jeblairjhesketh: good morning -- i wanted to make sure you saw https://review.openstack.org/41142621:40
jeblairShrews: oh?  i thought we went through the whole configuration before reloading21:41
jeblairShrews: for example, _checkForProviderUploads iterates over all providers and all images.  i'm suggesting having it exit as soon as it performs one upload21:42
Shrewsjeblair: oh, i see21:42
jheskethjeblair: ah yes, I haven't reviewed it yet but I saw it pop up21:43
Shrewsi mean, it's still just reducing the chances of wrong things happening, but still not a bad idea21:43
jheskethI incorrectly reviewed one of the mentioned changes so I wanted to make sure I had time to understand it properly21:43
jeblairShrews: yeah, if we do it, we can add a note saying why, and that we can remove it if we fix underlying problems21:44
jheskethSorry for not getting to it yesterday, I had an internet outage21:44
jeblairShrews: (but it's a couple lines of code, should reduce risk without worrying about behavior change)21:44
Shrewsjeblair: so, the two changes I see to reduce the impact are: 1) add check if build is DELETING just before uploading, 2) reload config after each provider loop21:45
Shrewswon't fix it, but should help a bit21:45
jeblairShrews: ya21:45
*** saneax-_-|AFK is now known as saneax21:51
jeblairit's zuul meeting time22:02
fungizuul meeting time!22:02
fungiindeed22:02
*** abregman has quit IRC22:05
Shrewspabelanger: hey!22:16
Shrewspabelanger: did cpu load drop on the builders?22:16
pabelangerShrews: yes!22:17
Shrewsneat22:18
Shrewsdo we have numbers?22:19
pabelangerShrews: http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=547 shows the history22:21
openstackgerritMerged openstack-infra/nodepool: Accept user-home in config validator  https://review.openstack.org/40451923:25

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!