*** saneax-_-|AFK is now known as saneax | 01:18 | |
*** saneax is now known as saneax-_-|AFK | 02:45 | |
*** saneax-_-|AFK is now known as saneax | 02:58 | |
*** saneax is now known as saneax-_-|AFK | 03:09 | |
*** bhavik has joined #zuul | 04:49 | |
*** bhavik has quit IRC | 04:58 | |
*** saneax-_-|AFK is now known as saneax | 05:35 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool: Ignore .idea directory https://review.openstack.org/412296 | 06:12 |
---|---|---|
*** abregman has joined #zuul | 06:12 | |
*** abregman has quit IRC | 06:31 | |
*** abregman has joined #zuul | 06:36 | |
openstackgerrit | Cao Xuan Hoang proposed openstack-infra/nodepool: Clean imports in code https://review.openstack.org/404059 | 06:38 |
*** hashar has joined #zuul | 09:12 | |
*** bhavik1 has joined #zuul | 10:31 | |
openstackgerrit | Andreas Jaeger proposed openstack-infra/zuul: Copy queue_name https://review.openstack.org/412398 | 10:49 |
*** bhavik1 has quit IRC | 11:07 | |
openstackgerrit | Alexander Evseev proposed openstack-infra/nodepool: Use configured floating IP pool https://review.openstack.org/412419 | 11:55 |
*** yolanda has quit IRC | 11:59 | |
*** yolanda has joined #zuul | 12:00 | |
*** bhavik1 has joined #zuul | 12:12 | |
*** bhavik1 has quit IRC | 12:15 | |
*** jamielennox is now known as jamielennox|away | 14:56 | |
*** saneax is now known as saneax-_-|AFK | 15:43 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Change nodepoold ZK config change logic https://review.openstack.org/412507 | 15:54 |
Shrews | pabelanger: 412507 ^^^ based on your comments last week. I didn't see a reason for lines 1044-1045, so I removed those. Let me know if you think that's incorrect. | 15:55 |
Shrews | pabelanger: because if the zk servers are the same, setting them equal to each other seems unnecessary | 15:56 |
pabelanger | Shrews: great, +2 | 16:16 |
pabelanger | Shrews: jeblair: https://review.openstack.org/#/c/412124/ is an optimization for dib-image-list | 16:17 |
pabelanger | otherwise, you just se | 16:18 |
pabelanger | see* | 16:18 |
pabelanger | | centos-7-0000000136 | centos-7 | nb01 | | building | 00:00:49:26 | | 16:18 |
*** abregman has quit IRC | 16:35 | |
*** cinerama has joined #zuul | 16:43 | |
*** cinerama has quit IRC | 16:44 | |
*** cinerama has joined #zuul | 16:44 | |
jeblair | jamielennox|away: can you take a look at https://review.openstack.org/408849 when you get a chance? | 17:14 |
*** hashar has quit IRC | 17:19 | |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Expose the ability to build all infra diskimages https://review.openstack.org/411929 | 17:49 |
openstackgerrit | Merged openstack-infra/nodepool: Rename ubuntu-dib to ubuntu-trusty https://review.openstack.org/411920 | 18:10 |
openstackgerrit | Merged openstack-infra/nodepool: Expose the ability to build all infra diskimages https://review.openstack.org/411929 | 18:25 |
jeblair | clarkb: i know we talked about this on friday -- did you want to review it? https://review.openstack.org/411999 | 18:27 |
clarkb | jeblair: yup will do so now | 18:28 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool: Fix some doc typos https://review.openstack.org/412571 | 18:30 |
openstackgerrit | Merged openstack-infra/nodepool: Update documentation on diskimages/images https://review.openstack.org/411998 | 18:32 |
openstackgerrit | Merged openstack-infra/zuul: Remove TestScheduler.test_nonexistent_job https://review.openstack.org/410571 | 18:36 |
openstackgerrit | Merged openstack-infra/zuul: Remove now-unused ZuulTestCase.resetGearmanServer https://review.openstack.org/410935 | 18:37 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Allow nodepool-builder to only build diskimages https://review.openstack.org/412160 | 18:38 |
jeblair | mordred: can you take a look at the comment on 411512? maybe there's some other way to accomplish what you want? | 18:40 |
openstackgerrit | Merged openstack-infra/nodepool: Change nodepoold ZK config change logic https://review.openstack.org/412507 | 18:42 |
clarkb | jeblair: done. +2'd but didn't approve as that conflicts with a few things and wasn't sure if you want that in first or later | 18:45 |
clarkb | Shrews: pabelanger is http://logs.openstack.org/02/411002/1/gate/gate-nodepool-python27-ubuntu-xenial/1434cc9/console.html.gz a known issue? seems unrelated to my config change so gonna recheck but thought I would point it out | 18:46 |
jeblair | clarkb: yeah, i'd like to press forward on that; i'll +3 | 18:48 |
pabelanger | clarkb: yes, I believe it was hitting: https://review.openstack.org/#/c/408324/ reason for the recheck | 18:49 |
pabelanger | oh wait | 18:50 |
pabelanger | that is not correct | 18:50 |
pabelanger | http://logs.openstack.org/02/411002/1/gate/gate-nodepool-python27-ubuntu-xenial/1434cc9/console.html.gz#_2016-12-16_21_47_09_780060 | 18:51 |
pabelanger | I was looking at that on Friday | 18:51 |
pabelanger | but didn't figure out why | 18:51 |
pabelanger | clarkb: Shrews: ^ first time I seen that | 18:51 |
pabelanger | I was looking into a leaked lock | 18:52 |
openstackgerrit | Merged openstack-infra/nodepool: Remove script-dir https://review.openstack.org/411999 | 18:53 |
openstackgerrit | Merged openstack-infra/nodepool: Docs: add Zookeeper installation section https://review.openstack.org/412009 | 18:55 |
openstackgerrit | Merged openstack-infra/nodepool: Update operation docs https://review.openstack.org/412018 | 18:55 |
mordred | jeblair: ah - ok. we can just not remove the FakeProviderManager class but do the other things | 19:13 |
mordred | jeblair: the biggest problem there is that there were two different ways of getting a ProviderManager that used fake clouds in the test suite | 19:13 |
*** openstack has joined #zuul | 19:17 | |
*** bhavik1 has joined #zuul | 19:24 | |
openstackgerrit | Merged openstack-infra/nodepool: Validate configs when used by tests https://review.openstack.org/411002 | 19:26 |
*** bhavik1 has quit IRC | 19:29 | |
*** bhavik1 has joined #zuul | 20:01 | |
*** bhavik1 has quit IRC | 20:01 | |
Shrews | pabelanger: clarkb: about that failure, i think it happens because the copy of the config values an upload thread may have is no longer valid by the time it has already checked for images that need uploading, and the time it actually begins uploading | 20:28 |
Shrews | so the config get replaced in the middle of the process, another thread starts the delete process, deletes the upload, the other thread now sees it needs to upload | 20:29 |
Shrews | i think i know how to fix it | 20:30 |
*** Cibo_ has joined #zuul | 20:39 | |
*** Cibo_ has quit IRC | 20:43 | |
Shrews | hrm, it's actually going to require a bit more thought :( | 20:47 |
mordred | Shrews: do we need to add some MVCC? | 20:57 |
clarkb | there are functions to ensure a steady state | 20:58 |
clarkb | can you reach steady state, then update config showing it should be deleted and at that point there shouldn't be races? | 20:59 |
Shrews | mordred: what we NEED are shared read locks in kazoo (ya know that thing i wish we had way back when?) | 21:03 |
*** jamielennox|away is now known as jamielennox | 21:04 | |
Shrews | that way we could grab a shared lock on the build number before getting the upload lock, and no one could then delete the build number unless there were no uploads | 21:04 |
Shrews | our hierarchy of locks is kind of needing that | 21:05 |
Shrews | but, i guess i have to find a work around | 21:06 |
clarkb | oh so this is an actual run time bug? | 21:08 |
clarkb | where we could upload an unwanted image? | 21:08 |
Shrews | clarkb: yes. but it would be pretty difficult to make happen in production | 21:09 |
Shrews | but not impossible | 21:10 |
Shrews | the best solution i've come up with so far only decreases the likelyhood of it happening, which doesn't make me happy | 21:11 |
clarkb | couldn't you check if upload is locked and state set to deleting then don't upload? | 21:12 |
clarkb | and maybe vice versa don't delete if uploading? | 21:12 |
Shrews | the "don't delete if uploading" is my other idea, but requires getting all upload locks for each provider. that's doable, and likely the only recourse we have without read locks | 21:13 |
Shrews | will require adding non-context-manager style locking, too | 21:14 |
clarkb | actually if node is locked you shouldn't do any state updates right? | 21:15 |
clarkb | which means that we should already be handling that for the don't delete when uploading case as upload should hold lock | 21:15 |
clarkb | and in the other direction we shouldn't try to upload if locked? | 21:15 |
*** Cibo_ has joined #zuul | 21:15 | |
Shrews | in the other direction, we'd have to actually get the build number lock (we can't just check it as that is still racey). but that prevents other uploads from occuring simultaneously | 21:16 |
clarkb | ah ok so uploads actually want a semaphore on that node | 21:17 |
Shrews | uploads want a shared read lock on that node | 21:17 |
clarkb | I think a semaphore could do it too | 21:18 |
clarkb | but I guess the end result is likely the same implementing primitives that kazoo should have btu doesn't? | 21:18 |
jeblair | Shrews: i think your statement at 20:28 utc is concise, and i am not sure it can be solved with locks -- if someone deletes an upload, what lock is there to prevent someone else with an outdated config from thinking it needs to be there? | 21:19 |
clarkb | we could checkpoint configs | 21:20 |
Shrews | the uploaders would check the build state (should be DELETING) once acquiring the shared read lock on that build number | 21:20 |
clarkb | (the tricky thing with that is determining which one is newer/more correct) | 21:20 |
Shrews | if it's DELETING, we don't upload | 21:20 |
Shrews | if and uploader has the shared lock, we can't delete until it is released since the deleter would want a write lock | 21:21 |
jeblair | Shrews: right, i think that handles this very specific case (deleting a build), but doesn't handle the general case of configuration disagreement | 21:21 |
Shrews | well, yah, the entire config disagreement is not solved by that since it could leak to other areas besides this | 21:22 |
jeblair | i don't know the best approach at the moment -- accept config disagreement and try to button down all the places it could be a problem, or try to prevent config disagreement | 21:22 |
jeblair | just throwing it out there | 21:22 |
Shrews | i don't know how to solve the config disagreement thing | 21:23 |
Shrews | other than maybe a config singleton? | 21:23 |
mordred | only have one node be the node that can read config from disk at a time | 21:23 |
mordred | yah | 21:23 |
jeblair | and stick it in zk | 21:23 |
mordred | then have the config reading node distribute the config to the other ones | 21:23 |
mordred | yup | 21:23 |
*** abregman has joined #zuul | 21:23 | |
mordred | this is largely how heartbeat v2 worked back in the day ... although it took it one step further and disregarded the config on disk after a bootstrap | 21:24 |
mordred | which I did not like | 21:24 |
jeblair | and then, i guess, wait for all the workers to get quorum on the new config before any starts using it? | 21:24 |
mordred | maybe each worker, before it starts a new operatoin on an image, checks to see if there is a new config version? | 21:25 |
mordred | it's possible that's still racey though | 21:26 |
Shrews | mordred: i would think so, but, dunnno | 21:26 |
jeblair | Shrews: can you use the existing build number write lock to perform their deleting check? | 21:27 |
jeblair | clarification: | 21:27 |
jeblair | Shrews: can you use the existing build number write lock to have the uploaders perform their check of whether the build is being deleted? | 21:27 |
Shrews | jeblair: yes, but we cannot hold it since that prevents other uploads. so it would be a quick check, then upload. which is still a race, but reduces the code in between | 21:28 |
Shrews | my statement at 21:11:25 utc | 21:28 |
jeblair | mordred: yeah, i think that narrows the race considerably but it's still there (and we could actually just do that today with existing distributed on-disk config -- the complexity there comes when we find something actually has changed right in the middle of our processing loop - maybe processing something that doesn't exist anymore) | 21:29 |
mordred | yah | 21:30 |
jeblair | Shrews: grab build lock; check build state; create upload record; lock upload; release build lock? | 21:30 |
Shrews | jeblair: "lock upload" is locked a long time while the upload is done | 21:31 |
jeblair | Shrews: grab build lock; check build state; lock upload; create upload record; release build lock? | 21:31 |
jeblair | (sorry i think i got that order wrong) | 21:31 |
jeblair | Shrews: ah, is that then the 'non-context-manager lock' approach? | 21:31 |
jeblair | grab build lock; check build state; lock upload; create upload record; release build lock; upload; release upload lock | 21:32 |
jeblair | (that requires interleaved locks, so can't happen entirely nested) | 21:32 |
Shrews | jeblair: no. that approach is used by the deleting thread. grab build lock; for each provider, lock upload; delete from disk and zk; release upload locks | 21:33 |
jeblair | Shrews: right, i was asking about the upload thread though | 21:34 |
Shrews | upload thread wouldn't have to change | 21:34 |
jeblair | oh, i thought that was the problematic one | 21:35 |
Shrews | the deleter just wouldn't delete until it could lock all recorded providers for the build | 21:35 |
Shrews | but even that might be racey | 21:36 |
jeblair | Shrews: i'm imagining that it finishes deleting all the uploads, then an uploadworker comes along and uploads one | 21:36 |
Shrews | it does give them more time to hopefully recheck the config, but yeah. | 21:37 |
jhesketh | Morning | 21:38 |
Shrews | i still think shared read locks are the best solution for this issue, but we still have the base problem of the config disagreement | 21:38 |
jeblair | Shrews: one quick thing that could help reduce the race: after we build or upload one thing, exit the loop and reload the config | 21:39 |
jeblair | Shrews: since both of those take a long time, it could help us run with generally more recent configs | 21:39 |
Shrews | jeblair: it already does that, so i don't follow | 21:40 |
jeblair | jhesketh: good morning -- i wanted to make sure you saw https://review.openstack.org/411426 | 21:40 |
jeblair | Shrews: oh? i thought we went through the whole configuration before reloading | 21:41 |
jeblair | Shrews: for example, _checkForProviderUploads iterates over all providers and all images. i'm suggesting having it exit as soon as it performs one upload | 21:42 |
Shrews | jeblair: oh, i see | 21:42 |
jhesketh | jeblair: ah yes, I haven't reviewed it yet but I saw it pop up | 21:43 |
Shrews | i mean, it's still just reducing the chances of wrong things happening, but still not a bad idea | 21:43 |
jhesketh | I incorrectly reviewed one of the mentioned changes so I wanted to make sure I had time to understand it properly | 21:43 |
jeblair | Shrews: yeah, if we do it, we can add a note saying why, and that we can remove it if we fix underlying problems | 21:44 |
jhesketh | Sorry for not getting to it yesterday, I had an internet outage | 21:44 |
jeblair | Shrews: (but it's a couple lines of code, should reduce risk without worrying about behavior change) | 21:44 |
Shrews | jeblair: so, the two changes I see to reduce the impact are: 1) add check if build is DELETING just before uploading, 2) reload config after each provider loop | 21:45 |
Shrews | won't fix it, but should help a bit | 21:45 |
jeblair | Shrews: ya | 21:45 |
*** saneax-_-|AFK is now known as saneax | 21:51 | |
jeblair | it's zuul meeting time | 22:02 |
fungi | zuul meeting time! | 22:02 |
fungi | indeed | 22:02 |
*** abregman has quit IRC | 22:05 | |
Shrews | pabelanger: hey! | 22:16 |
Shrews | pabelanger: did cpu load drop on the builders? | 22:16 |
pabelanger | Shrews: yes! | 22:17 |
Shrews | neat | 22:18 |
Shrews | do we have numbers? | 22:19 |
pabelanger | Shrews: http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=547 shows the history | 22:21 |
openstackgerrit | Merged openstack-infra/nodepool: Accept user-home in config validator https://review.openstack.org/404519 | 23:25 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!