openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Add entry for ref-replication-scheduled event https://review.openstack.org/446251 | 00:47 |
---|---|---|
openstackgerrit | K Jonathan Harker proposed openstack-infra/zuul feature/zuulv3: WIP: Reenable test_build_configuration_conflict https://review.openstack.org/446275 | 02:09 |
*** jasondotstar has quit IRC | 03:25 | |
*** jasondotstar has joined #zuul | 03:26 | |
*** leifmadsen has quit IRC | 03:26 | |
*** leifmadsen has joined #zuul | 03:27 | |
jamielennox | is it possible/are there any examples of shipping custom ansible modules in a common-config? what's the path? | 03:34 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Re-add the ability to set username on zuul-executor https://review.openstack.org/446308 | 05:58 |
*** isaacb has joined #zuul | 06:26 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Add default logging file argument https://review.openstack.org/446324 | 07:02 |
*** bhavik1 has joined #zuul | 07:03 | |
*** bhavik1 has quit IRC | 07:11 | |
*** isaacb has quit IRC | 07:20 | |
*** isaacb has joined #zuul | 07:23 | |
*** hashar has joined #zuul | 07:43 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul feature/zuulv3: Change mutex to counting semaphore https://review.openstack.org/442563 | 08:36 |
*** hashar has quit IRC | 08:42 | |
*** hashar has joined #zuul | 08:53 | |
*** hashar has quit IRC | 08:55 | |
*** hashar has joined #zuul | 08:59 | |
*** Cibo_ has quit IRC | 08:59 | |
*** Cibo_ has joined #zuul | 09:04 | |
*** Cibo_ has quit IRC | 09:08 | |
*** isaacb has quit IRC | 09:16 | |
*** isaacb has joined #zuul | 09:17 | |
pabelanger | morning | 13:07 |
mordred | morning pabelanger | 13:09 |
*** hashar is now known as hasharNap | 13:30 | |
*** isaacb has quit IRC | 13:58 | |
*** isaacb has joined #zuul | 14:11 | |
pabelanger | jeblair: do you mind looking at 446095 and 446110 this morning? Shrews patches to bring nl01.o.o back online | 14:14 |
*** Cibo_ has joined #zuul | 14:39 | |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Reset lost requests https://review.openstack.org/446095 | 14:40 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Deallocate ready nodes with no requests https://review.openstack.org/446110 | 14:42 |
*** hasharNap is now known as hashar | 14:43 | |
Shrews | pabelanger: let me know when you're ready to restart nl01. want to watch things | 14:47 |
jeblair | pabelanger, Shrews: done :) | 14:47 |
jeblair | jamielennox: i don't think we've experimented with that yet | 14:47 |
pabelanger | jeblair: danke | 14:47 |
pabelanger | Shrews: Sure, I'm ready once we update the git repo on disk, so 15mins or so | 14:48 |
Shrews | pabelanger: there are currently 5 nodes (the max) allocated to non-existing requests. They should get deleted at startup. The first 4 requests are PENDING and they should get reset to REQUESTED. | 14:51 |
*** isaacb has quit IRC | 14:51 | |
* Shrews crosses fingers | 14:51 | |
*** isaacb has joined #zuul | 14:51 | |
pabelanger | ack | 14:53 |
mordred | jeblair, jamielennox: one can put custom modules in roles, and we do deal with roles -so I'd hope we can just get it for free by putting modules in roles | 14:54 |
mordred | Shrews: I wish you luck | 14:55 |
jeblair | mordred, jamielennox: agreed -- and i think we've implemented enough for it to work at this point if someone wants to try it out | 14:57 |
jeblair | mordred, jamielennox: however, we do still have an outstanding bug (there's a story) where we may not put roles in the correct security context, so be aware of that. | 14:58 |
*** Cibo_ has quit IRC | 15:00 | |
Shrews | jamielennox: hrm, this json error on your nodepool change makes no sense to me: http://logs.openstack.org/75/445675/3/check/nodepool-coverage-ubuntu-xenial/45b42f1/console.html#_2017-03-15_20_22_33_272050 | 15:08 |
Shrews | jamielennox: maybe some sort of race condition we (or kazoo) do not account for | 15:09 |
Shrews | Node znodes should always have data | 15:09 |
*** isaacb has quit IRC | 15:10 | |
pabelanger | Hmm, I might need some help. Going a little crazy as to why e1c26d56b541f (^446110) is not the tip of feature/zuulv3: http://git.openstack.org/cgit/openstack-infra/nodepool/log/?h=feature/zuulv3 | 15:14 |
pabelanger | but is locally for me | 15:14 |
pabelanger | because, nl01.o.o /opt/nodepool is also out of sync with local | 15:15 |
Shrews | pabelanger: hrm, a fresh 'git pull' for me does not get those two latest merges | 15:16 |
pabelanger | I think something might be up with our git farm | 15:17 |
Shrews | yeah | 15:17 |
pabelanger | moving to #openstack-infra | 15:17 |
jeblair | okay, that's recovering now | 16:04 |
pabelanger | confirmed, nl01.o.o updating now | 16:05 |
pabelanger | Shrews: let me know when you are ready | 16:06 |
Shrews | pabelanger: ready. do you need to 'pip install' again since that didn't happen last time? | 16:07 |
pabelanger | Shrews: no, all fixed. | 16:07 |
Shrews | k | 16:07 |
pabelanger | okay, starting nodepool-launcher | 16:07 |
pabelanger | cool, nodes were deleted | 16:08 |
Shrews | yep | 16:08 |
Shrews | pending requests being processed | 16:09 |
pabelanger | Oh | 16:10 |
pabelanger | heh | 16:10 |
pabelanger | we'll need to restart zuul to pick up new executor changes too | 16:10 |
pabelanger | we'll let the queue clear out first | 16:10 |
pabelanger | Shrews: I do see a locked ready node however | 16:12 |
Shrews | pabelanger: that can be normal | 16:12 |
Shrews | they get locked satisfying requests | 16:13 |
* Shrews tests the new 'request-list' command | 16:14 | |
pabelanger | okay, I'm going to stop zuul-launcher and update to zuul-executor. Jobs are getting requeued until they hit the RETRY_LIMIT | 16:15 |
Shrews | pabelanger: umm, i'm confused why some nodes have AZ 'nova' and others are blank? if there are AZ's, we should list them in nodepool.yaml | 16:15 |
Shrews | oh, they're building. nm | 16:15 |
Shrews | jeblair: we should probably add node.az to the zuul model definition. seems we lose that value when you set the state to IN_USE | 16:16 |
clarkb | fwiw I don't think we have any clouds where az currently matters | 16:17 |
clarkb | (but others may and future might) | 16:17 |
jeblair | Shrews: yeah, though also, we should probably have the model classes transparently keep any unknown values, so we don't stomp on them. | 16:17 |
Shrews | jeblair++ | 16:17 |
Shrews | pabelanger: hrm, seems we have a lock now. :( | 16:19 |
Shrews | jeblair: many nodes stuck in IN_USE | 16:21 |
Shrews | and locked | 16:21 |
jeblair | who owns the lock? | 16:21 |
jeblair | looks like zuul | 16:22 |
pabelanger | okay, zuulv3-dev started again | 16:23 |
jeblair | wait what? | 16:23 |
Shrews | oh, well | 16:23 |
jeblair | why are we restarting things? | 16:23 |
pabelanger | to use new zuul-executor | 16:24 |
pabelanger | should I have not done that? | 16:24 |
jeblair | we were trying to debug some stuck locks | 16:24 |
jeblair | it's possible the executor restart made them stuck | 16:24 |
jeblair | i think we still have a bug with that | 16:24 |
jeblair | pabelanger: if you restarted the executor first, then the main daemon, that may have been the cause | 16:25 |
Shrews | nodepool-launcher is still hung up | 16:25 |
pabelanger | apologies, I missed the locks part | 16:25 |
jeblair | pabelanger: did you configure gerrit debug logging? | 16:25 |
jeblair | in zuul | 16:25 |
pabelanger | jeblair: no, I did rename the launcher-logging.conf file to executor-logging.conf | 16:26 |
pabelanger | checking it now | 16:26 |
Shrews | pabelanger: jeblair: can we shutdown both things now so I can have a stable ZK state to investigate current issues? | 16:26 |
pabelanger | I can stop executor again | 16:26 |
jeblair | go for it | 16:26 |
jeblair | i think i know the logging problem | 16:26 |
pabelanger | executor stopped | 16:27 |
*** Cibo_ has joined #zuul | 16:27 | |
Shrews | hrm, i think we may be hitting some znode versioning issues | 16:31 |
Shrews | 2017-03-16 16:22:31,198 DEBUG nodepool.NodeRequestHandler: Locked existing node 0000000584 for request 100-0000002088 | 16:31 |
Shrews | yet, when I looked at that in zk-shell, the 'allocated_to' was still set for a lost request | 16:32 |
Shrews | only after zuul-exe was stopped did the value of allocate_to change to 2088 | 16:32 |
Shrews | after n-l was stopped, i guess | 16:33 |
jeblair | pabelanger: did you only stop the executor? | 16:34 |
jeblair | pabelanger: i thought Shrews asked for both components to be stopped. zuul-server is the only part that talks to nodepool. | 16:34 |
pabelanger | jeblair: I also stopped / started zuul too | 16:34 |
pabelanger | zuul is still running, I can stop that | 16:35 |
jeblair | i see zuul-scheduler running right now | 16:35 |
jeblair | Shrews: you have not had a stable environment | 16:35 |
pabelanger | yes, I restarted it when I did executor, but have not stopped it | 16:35 |
Shrews | yeah, wanted both stopped. just so as not to fill the queue | 16:35 |
jeblair | pabelanger: why don't you go ahead and stop it | 16:36 |
pabelanger | scheduler stopped too | 16:36 |
Shrews | so, i need to read up on znode versioning a bit. i'm really confused here | 16:36 |
Shrews | because n-l did what it was supposed to do, but the data it was reading was not what was expected | 16:37 |
*** bhavik1 has joined #zuul | 16:37 | |
pabelanger | while I am looking at launcher-debug on nl01.o.o, does it make sense to expose our cleanup timer to nodepool.yaml (like we did becore)? I think 5 seconds might be a little aggressive | 16:44 |
jeblair | pabelanger: i do not think it should be exposed as an option because it doesn't make sense for a user to tune it. we should hardcode it to a good value. | 16:45 |
Shrews | pabelanger: We want deleted nodes to be cleaned up quickly (thus the short time), but I was thinking it might be a good idea to have that as a separate thread from all of the other cleanup that can be less aggressive | 16:45 |
jeblair | ++ | 16:46 |
pabelanger | Ah, Isee | 16:46 |
jeblair | though, another option would be to set watches on nodes so that we get immediate notification of when to clean them up. | 16:46 |
jeblair | then all of the periodic cleanup tasks can be less aggressive | 16:47 |
pabelanger | I think the current ServerList that happens every 5 seconds is from our _cleanupLeakedInstances function. Today, we do 60 seconds on nodepool.o.o | 16:47 |
jeblair | pabelanger: on master we get immediate notification | 16:48 |
Shrews | watches are another option | 16:48 |
pabelanger | not a big deal, was just trying to see what nl01.o.o was doing with all the ServerList calls | 16:48 |
Shrews | harlowja: Can you think of any reason why reads from ZK would not return the latest version of a znode? | 16:48 |
Shrews | pabelanger: i'd be fine with first separating out into a separate thread, then doing the watches later | 16:49 |
jeblair | that seems like a tractable plan | 16:50 |
pabelanger | would NodeCleanupWorker() launch the new thread or Nodepool()? | 16:51 |
jeblair | pabelanger, jhesketh: https://review.openstack.org/224336 invalidated our logging configuration. | 16:51 |
jeblair | pabelanger: is puppet running on zuulv3-dev? | 16:51 |
pabelanger | jeblair: no, I made the changes manually | 16:51 |
jeblair | pabelanger, jhesketh: ok, i'm going to start manually making a new logging config there, and we can copy it into puppet for master later | 16:52 |
pabelanger | wfm | 16:52 |
Shrews | pabelanger: i envisioned NodePool doing it (a new DeletedNodeWorker thread and rename NodeCleanupWorker to CleanupWorker) | 16:53 |
Shrews | pabelanger: if you want to take a stab at it, go for it! | 16:53 |
pabelanger | Shrews: sure, would be good to try more threading programming | 16:54 |
pabelanger | Shrews: would _cleanupNodeRequestLocks() and _cleanupNodes() go into DeletedNodeWorker? | 16:59 |
*** hashar has quit IRC | 17:00 | |
Shrews | pabelanger: yes, though I might consider moving the new check in _cleanupNodes() for the allocated_to stuff into the less aggressive thread | 17:02 |
Shrews | not sure that's necessary. haven't given too much thought to it yet | 17:03 |
pabelanger | Shrews: okay, let me propose this over a few patches, so I bette understand things. | 17:03 |
Shrews | pabelanger: the leaked node stuff should go into the less aggressive thread, obviously. that's where the server listing stuff is coming from | 17:05 |
jeblair | pabelanger, Shrews: okay i have a new logging config staged for when we're ready to restart zuul | 17:05 |
Shrews | jeblair: i'm stumped on this versioning stuff | 17:05 |
harlowja | Shrews reads from how many nodes? | 17:05 |
Shrews | maybe my assumption is wrong | 17:06 |
Shrews | harlowja: umm, what? from "nodes"? | 17:06 |
harlowja | sorry, how many zookeeper servers | 17:06 |
Shrews | harlowja: 1 | 17:06 |
harlowja | oh | 17:06 |
harlowja | then, nope can't think of a reason | 17:07 |
harlowja | lol | 17:07 |
Shrews | *sigh* | 17:07 |
Shrews | ok, maybe this is just a bug somewhere then | 17:07 |
Shrews | in my code, that is | 17:07 |
harlowja | if > 1 i think u can get stale reads (due to how the commit/blah blah algorithm works) | 17:07 |
Shrews | obviously a bug :) | 17:07 |
* Shrews tries a new test case | 17:08 | |
* harlowja trying to find if there are any kazoo folks around anymore | 17:13 | |
harlowja | i starting to think i'm the last one left in #zookeeper | 17:13 |
harlowja | lol | 17:13 |
Shrews | ah ha. found at least one bug | 17:14 |
Shrews | this may be the main thing. i've been assuming zk-shell reads the latest version of things. maybe not? | 17:19 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Unpause when we grab a pre-ready node https://review.openstack.org/446642 | 17:20 |
harlowja | it should | 17:20 |
harlowja | didn't think zk-shell was caching anything | 17:20 |
Shrews | jeblair: pabelanger: ^^^ | 17:20 |
harlowja | Shrews https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees | 17:20 |
harlowja | may just be useful to read | 17:20 |
harlowja | `Sometimes developers mistakenly assume one other guarantee that ZooKeeper does not in fact make. This is: | 17:20 |
harlowja | Simultaneously Consistent Cross-Client Views | 17:20 |
harlowja | ZooKeeper does not guarantee that at every instance in time, two different clients will have identical views of ZooKeeper data.` | 17:20 |
*** bhavik1 has quit IRC | 17:20 | |
harlowja | it will guarantee `Sequential Consistency` | 17:21 |
harlowja | and blah blah | 17:21 |
Shrews | harlowja: hrm, except our clients (threads in this case) share the same ZK connection, so.... | 17:22 |
harlowja | k | 17:23 |
Shrews | that is, indeed, interesting though. | 17:23 |
Shrews | we could see differences between zuul and nodepool b/c of that | 17:23 |
harlowja | ya, there might be a way to tweak this to make it happen | 17:24 |
jeblair | Shrews: +2 | 17:24 |
Shrews | "The clients view of the system is guaranteed to be up-to-date within a certain time bound (on the order of tens of seconds)." | 17:25 |
Shrews | tens of seconds? | 17:25 |
Shrews | wonder why so high | 17:26 |
harlowja | likely the time the server has before it kicks itself out of the zookeeper cluster, lol | 17:26 |
jesusaur | jeblair: yesterday I started poking test_build_configuration_conflict() in test_scheduler.py, and I discovered that merge conflicts are getting reported as success. Should a merge failure call sendWorkFail instead of sendWorkComplete? or is there another way the merger should notify the server about the merge failure? | 17:26 |
harlowja | Shrews https://github.com/coreos/etcd/issues/741 is an interested read also | 17:28 |
jeblair | jesusaur: i think they should report sendworkcomplete but with a null commit, or maybe a boolean flag? | 17:29 |
harlowja | pretty sure (though not 100% that etcd does the same, in that reads may return slightly different versions of the data, depending on the server the client is connected to) | 17:29 |
jesusaur | jeblair: ok, so then something should check for the null commit and/or merged=False afterwards? | 17:31 |
jeblair | jesusaur: hrm, i *thought* we had re-added merge checking on all changes, but it looks like that hasn't happened yet. i think someone was working on it but i guess they dropped off. | 17:34 |
jeblair | jesusaur: the missing code is in I14e3d96cadec6e4b4cca66137071e8ed67f161a1. things related to prepareRef | 17:36 |
SpamapS | Shrews: it's ok for Zuul and Nodepool to see different views, as long as they both eventually see the same view. | 17:37 |
harlowja | also https://aphyr.com/posts/313-strong-consistency-models | 17:37 |
jeblair | the main thing is whether locking works :) | 17:37 |
harlowja | distrubted systems are hard | 17:38 |
harlowja | lol | 17:38 |
jeblair | jesusaur: but prepareLayout does very similar work now; so i think the result needs to be a combination of the two. | 17:39 |
jeblair | jesusaur, SpamapS: sorry my evaluation of that turned out to be wrong. | 17:40 |
jesusaur | jeblair: ah, thanks for the breadcrumbs, I'll dig through this a bit more | 17:43 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Rename NodeCleanupWorker to DeletedNodeWorker https://review.openstack.org/446649 | 17:43 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Create BaseNodeWorker class https://review.openstack.org/446650 | 17:43 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Split DeleteNodeWorker into two threads https://review.openstack.org/446651 | 17:43 |
SpamapS | jeblair: can you disambiguate "that" in your message for me? | 17:46 |
Shrews | SpamapS: yep, "eventually" consistent is what i'd expect | 17:49 |
jeblair | SpamapS: the area that jesusaur is working on -- i thought we had re-added the merge conflict detection, but i think that work got dropped. | 17:51 |
rbergeron | shrews, jeblair, spamaps, pabelanger, anyone else full of comments: I've got the draft of the mail update here -- https://etherpad.openstack.org/p/zuul-v3-first-status-update (at the top, the previous version is at the bottom) | 17:51 |
rbergeron | (yes, i know i said i would do this on my own but just want to double check if anyone has anything specific they want to highlight) | 17:51 |
Shrews | rbergeron: i'm full of something... | 17:51 |
rbergeron | shrews: me too. cheetos puffs, to be specific | 17:52 |
Shrews | rbergeron: i added words (at the same time as someone else) | 17:53 |
jeblair | o/ | 17:53 |
SpamapS | jeblair: OH! ACK | 17:53 |
jeblair | SpamapS: so that probably invalidates some of my evals in that etherpad | 17:54 |
SpamapS | jeblair: well we can always fix that. :) | 17:54 |
jeblair | rbergeron: lgtm thanks! | 17:55 |
jesusaur | SpamapS: and with any luck, I will! | 17:56 |
jeblair | rbergeron, pabelanger: we could mention the launcher rename | 17:56 |
jeblair | ah, pabelanger is on that :) | 17:56 |
Shrews | jeblair: perhaps we should reference your email for that syntax change? | 17:56 |
*** Cibo_ has quit IRC | 17:57 | |
pabelanger | ++ | 17:57 |
jeblair | Shrews: good idea: http://lists.openstack.org/pipermail/openstack-infra/2017-January/005018.html | 17:57 |
jeblair | rbergeron: ^ | 17:57 |
Shrews | added. | 17:58 |
pabelanger | rbergeron: updates left, looks good | 17:59 |
rbergeron | pabelanger, shrews, jeblair: merci beaucoup; looks like y'all added all those things above into the mail directly too (thanks!). will ship it out :) | 18:15 |
jeblair | Shrews: nodepool fix merged | 18:23 |
Shrews | jeblair: yup. pabelanger, ready for restart when nl01 updates | 18:24 |
Shrews | somebody restart it? | 18:25 |
Shrews | b/c it's now running... but i don't think the code updated | 18:25 |
Shrews | oh, it's been running | 18:26 |
Shrews | O.o | 18:26 |
pabelanger | I haven't done anything yet | 18:26 |
Shrews | it was running. i stopped it | 18:26 |
Shrews | not sure why | 18:27 |
pabelanger | let me check if puppet has run | 18:27 |
Shrews | /opt/nodepool hasn't updated | 18:27 |
pabelanger | confirmed | 18:27 |
pabelanger | I can kick.sh it | 18:28 |
Shrews | pabelanger: can i manually 'git pull', or does that muck things up | 18:28 |
pabelanger | Shrews: actually | 18:28 |
pabelanger | Shrews: /opt/system-config/production/tools/kick.sh nl01.openstack.org | 18:28 |
pabelanger | from puppetmaster as root | 18:28 |
Shrews | i'm not touching puppet things. you can do it, or we can wait :) | 18:28 |
pabelanger | k, running | 18:28 |
pabelanger | Shrews: okay, restart when you are ready | 18:30 |
pabelanger | we also need to restart zuulv3-dev.o.o | 18:31 |
Shrews | pabelanger: go for it. the request pool drained | 18:31 |
pabelanger | started | 18:32 |
pabelanger | k, we have no changes in queue | 18:33 |
pabelanger | let me add one | 18:33 |
pabelanger | just rechecked 441332 | 18:34 |
pabelanger | Oh, heh | 18:34 |
pabelanger | configure_mirror.sh is failing, since we removed /etc/nodepool/ folder | 18:34 |
pabelanger | but, I do see an issue | 18:34 |
jeblair | yeah, i reckon it's time to reimplement that in ansible | 18:35 |
pabelanger | jobs should be set to failing, but they are requeuing due to exit code 2 | 18:35 |
pabelanger | need to see why that is a 2 | 18:35 |
Shrews | nodepool-launcher seemed to do its thing. will need some watching though | 18:37 |
jeblair | pabelanger: that's a pre-playbook, right? | 18:37 |
jeblair | pabelanger: we re-queue for pre playbook failures | 18:37 |
Shrews | These nodepool bugs over the last few days were fantastic. Nothing like the drizzle savepoint bug that drove me crazy for weeks and curse mordred for getting me involved, but interesting nonetheless | 18:39 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add secret top-level config object https://review.openstack.org/446156 | 18:41 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add per-repo public and private keys https://review.openstack.org/406382 | 18:41 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Associate secrets with jobs https://review.openstack.org/446687 | 18:41 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Decrypt secrets and plumb to Ansible https://review.openstack.org/446688 | 18:41 |
Shrews | wooo, requests | 18:41 |
Shrews | MOAR | 18:42 |
jeblair | i'll type faster! | 18:42 |
* Shrews watches the debug log and sees the matrix | 18:43 | |
pabelanger | jeblair: ya, pre-run. thanks | 18:46 |
Shrews | still chugging... looking pretty good | 18:48 |
Shrews | jeblair: w00t. it scoffed at your attempts to throw it for a loop (literally, a loop) | 18:49 |
jeblair | heh | 18:49 |
pabelanger | jeblair: what are your thoughts on how we get nodepool (provider / cloud / region / az) into zuul? Is it possible to get some of the data out of zookeer? | 18:49 |
pabelanger | zookeeper* | 18:50 |
Shrews | that question confuses me | 18:50 |
pabelanger | I somehow need to access the following data (we deleted from nodepool) from zuul. https://review.openstack.org/#/c/445572/ | 18:52 |
pabelanger | we use it to build up our mirror information | 18:52 |
pabelanger | as an example: http://mirror.dfw.rax.openstack.org/ | 18:52 |
Shrews | oh, ok. less confused | 18:52 |
jeblair | ah yeah, we should be able to get all of that. what's the best way to give it to ansible? in vars.yaml? or should it be in the inventory file? | 18:53 |
pabelanger | vars would be the easiest | 18:54 |
pabelanger | but | 18:54 |
* Shrews uses this time of less brokenness to step away for a bit. bbl | 18:54 | |
pabelanger | do we maybe want to setup a FQDN in inventory? | 18:54 |
jeblair | it's per-host information | 18:54 |
pabelanger | then extract the info from it? | 18:54 |
jeblair | (i mean, those things should generally all be the same, but it feels like we should be able to support them being different) | 18:54 |
pabelanger | according to docs there is inventory_hostname (which could be FQDN) and inventory_hostname_short, which is up to first dot | 18:59 |
pabelanger | so, adding it to inventory could be an option | 18:59 |
jeblair | pabelanger: wait, are you suggesting we make up a fqdn for the hosts? | 18:59 |
pabelanger | jeblair: Ya, but that might be too specific to openstack-infra | 19:00 |
jeblair | yes | 19:01 |
jeblair | pabelanger: what i'm asking is whether it's possible to add host-specific metadata to the inventory | 19:01 |
jeblair | or whether we have to put that in variables and then cross-reference them | 19:01 |
jeblair | http://docs.ansible.com/ansible/intro_inventory.html#host-variables | 19:01 |
jeblair | that seems to say yes | 19:01 |
pabelanger | jeblair: right, what is the difference then just adding them into vars.yaml? | 19:02 |
jeblair | pabelanger: so it makes the most sense to me to add "nodepool_provider", etc, to host variables. then each invocation of a 'setup mirror' role would be correct on its local host, i assume. | 19:02 |
jeblair | pabelanger: we can have many hosts. some values like this may end up being different some day. | 19:03 |
pabelanger | I see | 19:04 |
pabelanger | I don't have a preference atm, so I'll defer to you | 19:04 |
jeblair | host variables it is then | 19:05 |
jeblair | pabelanger: i'm heading to lunch now; it's all yours if you want to work on that | 19:06 |
pabelanger | jeblair: sure, I can try | 19:06 |
Shrews | pabelanger: i want to bike shed on https://review.openstack.org/446650 :) | 19:38 |
pabelanger | Shrews: sure! | 19:40 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Decrypt secrets and plumb to Ansible https://review.openstack.org/446688 | 20:04 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Associate secrets with jobs https://review.openstack.org/446687 | 20:04 |
pabelanger | jeblair: I might need some help with putting vars into inventory, it looks like we need to update test/base.py to properly populate getHostList() | 20:13 |
pabelanger | right now vars.yaml looks to be the easier route :) | 20:14 |
pabelanger | Shrews: no objection, fixing | 20:14 |
jeblair | pabelanger: do you think that vars.yaml and inventory variables are equivalent? | 20:17 |
pabelanger | jeblair: good question, I don't really use inventory files to store variables. Maybe jlk is better to ask | 20:18 |
jeblair | pabelanger: i do not, and that's what i was trying to convey earlier | 20:18 |
pabelanger | Maybe I just haven't had a need for them until now | 20:18 |
jeblair | pabelanger: we have a bunch of hosts. we have metadata for them. ansible has a way of associating metadata with hosts. that seems the right way to do it. | 20:18 |
jeblair | pabelanger: if we stick the information in vars, then we have to cross-reference with some index, like the hostname | 20:19 |
pabelanger | jeblair: yes, I agree in that case inventory variables make sense. | 20:19 |
jeblair | pabelanger: so the vars would look like: "hosts: compute1: provider: rax" and then, the role would have to do something like "{{hosts.$fqdn.provider}}" i'm not sure what that actually looks like in jinja | 20:20 |
jeblair | pabelanger: but if we put it in the inventory, then the role just says "{{nodepool_provider}}" because it should be set already, if i'm following the docs | 20:21 |
pabelanger | jeblair: I am happy to continue down inventory variables, I just need some help with test/base.py. | 20:21 |
jeblair | (i'm sure all those specifics are wrong, but i think that would be the general process required) | 20:21 |
pabelanger | currently getHostList() in test/base.py is returns an empty hosts variable. Need some guidance how to properly mock how the information | 20:22 |
jeblair | pabelanger: i don't think getHostList in tests/base.py needs updating | 20:22 |
jeblair | pabelanger: that's just adding a local connection flag | 20:22 |
jeblair | pabelanger: it calls the real getHostList in executor/server.py | 20:23 |
pabelanger | jeblair: right, I was appending the nodepool_provider info along side ansible_host, for getHostList() in executor/server.py | 20:23 |
jeblair | pabelanger: which is where the additional nodepool data should be added. but it will also need to be plumbed through from nodepool itself, which means in nodepool.py, and zk.py. possibly also the fake nodepool in tests/base.py (to add any missing attrs like az) | 20:24 |
pabelanger | because args['nodes'] is empty, the values are not set | 20:24 |
pabelanger | jeblair: mind if I show you what I have now? | 20:24 |
pabelanger | maybe I am on the wrong path | 20:24 |
jeblair | pabelanger: sure | 20:24 |
jeblair | pabelanger: though if nodes is empty, it's probably one of the many test cases which run jobs with no nodes | 20:25 |
pabelanger | maybe that is the problem | 20:25 |
jeblair | (that still needs to work, but obivously isn't going to exercise the code) | 20:25 |
jeblair | pabelanger: i think many of the tests in test_v3 use nodes | 20:26 |
jeblair | pabelanger: certainly the TestAnsible one | 20:28 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: WIP: Create nodepool inventory variables https://review.openstack.org/446727 | 20:28 |
pabelanger | jeblair: ^ | 20:28 |
pabelanger | I haven't added a test case yet | 20:28 |
pabelanger | but of the test_v3 tests I look at | 20:29 |
pabelanger | inventory is set to 'localhost ansible_connection=local' | 20:29 |
pabelanger | which, could be because of what you just mentioned about nodes | 20:30 |
pabelanger | checking that now | 20:30 |
jeblair | pabelanger: oh yeah, the test gethostlist also appends a localhost entry, mostly so that if a test doesn't request any nodes, there's still something in the inventory. i think we can just ignore that. | 20:32 |
jeblair | pabelanger: i think what you're seeing is that you're on the right path, and just need to use a test that requests nodes, then you should see it | 20:33 |
pabelanger | yes, I think you are right | 20:33 |
pabelanger | let me do that | 20:33 |
pabelanger | jeblair: yup, that was it. Thanks! | 20:34 |
pabelanger | I'll add a test job now | 20:34 |
jeblair | pabelanger: can you add 'az' while you're at it? | 20:36 |
jeblair | Shrews: hrm, it looks like the zuul model is already set up to avoid clobbering unknown values. | 20:37 |
jeblair | Shrews: but i wonder if we missed a read, so we just wrote old values? | 20:38 |
pabelanger | jeblair: yes | 20:39 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Create nodepool inventory variables https://review.openstack.org/446727 | 21:02 |
pabelanger | jeblair: ready for comments | 21:03 |
jeblair | pabelanger: +2 | 21:04 |
pabelanger | jeblair: cool, sorry about the noise earlier about tests/base.py, I should have checked nodes were defined first | 21:05 |
jeblair | pabelanger: np. it's perhaps not immediately obvious that most of the tests don't actually request nodes. :) | 21:06 |
pabelanger | jeblair: right, is that a valid setup still? I remember something about we might not want that to happen on the executor | 21:08 |
jeblair | pabelanger: i'm not sure. perhaps we will want to support jobs that only perform some action on the executor? | 21:09 |
pabelanger | jeblair: unsure also | 21:11 |
pabelanger | if we did that, would motivation would somebody have to run nodepool? | 21:11 |
Shrews | jeblair: hrm, then i have no immediate explanation | 21:14 |
pabelanger | clarkb: Shrews: either of you mind reviewing 446727? moves nodepool provider variables into a job inventory file | 21:16 |
clarkb | I can try to take a look in a few minutes (finishing up one thing) but then have to head out to kid's doctor appointment so depends on how long it takes me I guess | 21:16 |
Shrews | pabelanger: +2 | 21:25 |
pabelanger | clarkb: when you have time, no rush | 21:26 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Handle the 'all' group in callback. https://review.openstack.org/446749 | 21:27 |
clarkb | it was simple, I approved | 21:27 |
clarkb | the "nice" thing about doctors appointment is my other thing (a tempest update) should have test results by the time I get back to a computer | 21:28 |
Shrews | jeblair: i think the AZ thing may either be a shade bug or a nova bug. Seems that only sometimes is an AZ value returned: http://paste.openstack.org/show/603028/ | 21:32 |
Shrews | jeblair: that could make our AZ grouping not too efficient | 21:33 |
Shrews | if "nova" is the same as "no AZ", we can't know that | 21:34 |
* Shrews places $$$ on nova bug | 21:34 | |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Create nodepool inventory variables https://review.openstack.org/446727 | 21:35 |
Shrews | heh, i think i lost $$$ | 21:38 |
clarkb | you dont have to have AZs iirc | 21:38 |
Shrews | mordred: i can see OS-EXT-AZ:availability_zone populated from 'openstack server show', but shade isn't always giving this data back in the location attr | 21:39 |
Shrews | You know what, I bet it's not populated right away by nova, so we need to check it again after waitForServer() | 21:44 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Set node AZ after we're done waiting for it. https://review.openstack.org/446753 | 21:47 |
Shrews | mordred: nevermind! above is a hopeful fix | 21:48 |
*** rattboi is now known as quartermaster | 21:48 | |
*** quartermaster is now known as rattboi | 21:57 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Serve public keys through webapp https://review.openstack.org/446756 | 21:58 |
jeblair | that's the extent of the cryptography related parts of the secrets facility. we still have some config-related changes to make for it. | 22:02 |
jhesketh | Morning | 22:04 |
*** yolanda has quit IRC | 22:16 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Decrypt secrets and plumb to Ansible https://review.openstack.org/446688 | 22:17 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Associate secrets with jobs https://review.openstack.org/446687 | 22:17 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Serve public keys through webapp https://review.openstack.org/446756 | 22:17 |
SpamapS | jeblair: what was the logical conclusion of the discussion around RSA-OAEP yesterday? | 22:35 |
* SpamapS vaguely remembers looking at gcrypt and then shaking in the corner | 22:36 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Remove irrelevant test_merging_queues https://review.openstack.org/446768 | 22:38 |
jeblair | SpamapS: i think it trailed off. but the way it's gelling in my mind is: RSA-OAEP gets us the ability to encrypt something up to a couple hundred bytes long. to do something longer we'd probably switch to some kind of hybrid scheme (pgp or rsa encrypted AES session key). that will produce longer ciphertext. i think we have 3 choices: | 22:39 |
jeblair | SpamapS: 1) stick with pkcs1 and accept the limits. (we can change our minds later). | 22:39 |
jeblair | SpamapS: 2) keep pkcs1 and also add a second system. maybe have the encryption utility automatically select it for you. | 22:40 |
jeblair | SpamapS: 3) drop pkcs1 for the hybrid system and accept that ciphertext blobs will be bigger. | 22:40 |
jeblair | SpamapS: i think that the framework in zuul will be substantially the same, and also, there is highly likely to be some RSA involved either way, so i decided to press on with implementation to spec for the moment. i think little work will end up being wasted if we change. | 22:41 |
jeblair | probably we need a good ML thread to hash out what we want. | 22:42 |
jeblair | fungi: ^ | 22:42 |
SpamapS | Always a good idea to go slow and listen well when playing with crypto. | 22:42 |
jeblair | ya | 22:42 |
SpamapS | jeblair: my only concern is that AFAICT, there's nothing to identify a ciphertext as pkcs1 other than attempting decryption, so changing our minds later may be complicated. | 22:44 |
fungi | SpamapS: the function is namespaced as a yaml datatype in the spec | 22:44 |
SpamapS | Though I suppose we can make a good guess since the data length is relatively static. | 22:44 |
jeblair | yeah, we annotate it with a yaml tag, so it's forward-compatible | 22:44 |
SpamapS | Ah, does that still work with safe_load? | 22:45 |
SpamapS | I've never used the feature. | 22:45 |
jeblair | SpamapS: yeah, you can annotate an object as being 'safe' | 22:45 |
fungi | ml thread sounds like a great idea, and also i agree the current plan makes it not particularly hard to swap out or even increase the algorithms supported | 22:45 |
SpamapS | jeblair: neat | 22:45 |
SpamapS | well yeah, then I'm for (1) | 22:45 |
jeblair | SpamapS: https://review.openstack.org/446156 has that bit | 22:45 |
jeblair | configloader.py | 22:46 |
jeblair | (due to metaclass magic, the class defn of EncryptedPKCS1 is all that's required to make pyyaml know to load that) | 22:46 |
fungi | pkcs#1 gets zuul the ability to handle typical passwords/passphrases and short-ish api keys. something else will be needed to pass along things like asymmetric x.509/rsa key material, i agree | 22:47 |
SpamapS | jeblair: fancy | 22:48 |
fungi | computers are both infuriating and amazing | 22:48 |
SpamapS | fungi: isn't there a somewhat standard encryption format that CSR's use? | 22:49 |
fungi | certificate signing requests aren't encrypted | 22:49 |
SpamapS | I thought they were, by a password, unless you decline to do so. | 22:50 |
fungi | you're probably thinking of pkcs#12 (a.k.a. "pfx") which is commonly used to encrypt x.509 key bundles, but it's symmetric | 22:51 |
fungi | "by a password" is a giveaway that it's symmetric | 22:51 |
SpamapS | Could be | 22:51 |
SpamapS | yeah | 22:51 |
SpamapS | the only arbitrary-length-data public key crypto I'm familiar with are both email standards: PGP and S/MIME | 22:52 |
SpamapS | Though I guess they're just "message" standards, not specifically email. | 22:52 |
fungi | pkcs#7 can, but it's also the underpinning of s/mime | 22:54 |
* fungi sometimes wishes he knew less about these things. shadows of a former career | 22:56 | |
SpamapS | Hey, I feel the same about flipping burgers. ;) | 22:57 |
fungi | the two are alike in soooo many ways | 22:57 |
*** jamielennox is now known as jamielennox|away | 23:03 | |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Set filter according to PR/Change in URL https://review.openstack.org/446782 | 23:05 |
*** jamielennox|away is now known as jamielennox | 23:05 | |
SpamapS | jeblair: so.. stuck jobs.. | 23:21 |
SpamapS | jeblair: I'm looking at how to re-enable that test, and I wonder if we can still get into the situation that it artificially creates. | 23:21 |
jeblair | SpamapS: yeah, something about the word 'stuck' and all the hullaballo about gearman keeps distracting me from what i think is its main purpose, which is in the comment | 23:23 |
jeblair | or, rather the docstring | 23:23 |
jeblair | "Test that pending jobs are cleaned up if removed from layout" | 23:23 |
SpamapS | jeblair: the way it works now.. if I remove a job that is in the queue, the scheduler correctly cancels it. | 23:24 |
SpamapS | 2017-03-16 16:19:10,115 zuul.DependentPipelineManager ERROR Unable to find change queue for project org/project | 23:24 |
SpamapS | 2017-03-16 16:19:10,115 zuul.Scheduler WARNING Canceling build <Build 7a09454c0cd34b988de9e276f8712c64 of gate-noop on <Worker Unknown>> during reconfiguration | 23:24 |
SpamapS | that does not end up in history tho | 23:25 |
SpamapS | vs. before, when it did. | 23:25 |
jeblair | hrm. i agree that behavior matches the docstring. i may not fully understand the mechanics of the test as applied to v2... lemme dig further. | 23:26 |
SpamapS | jeblair: so perhaps it's ok to just run that scenario and assert that there's no history? | 23:26 |
jeblair | SpamapS: maybe? | 23:26 |
SpamapS | let me push up what I have and we can see the same code at least | 23:26 |
SpamapS | jeblair: I may be testing something else with the way I refactored the test. | 23:28 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Re-enable test_stuck_job_cleanup https://review.openstack.org/446783 | 23:29 |
SpamapS | there's code | 23:29 |
jlk | Did the zuulv3 install have a hiccup? | 23:30 |
jlk | my change just got a bunch of RETRY_LIMIT, and it doesn't seem to show up at all in review. | 23:30 |
jlk | (other htan the vote) | 23:30 |
jeblair | jlk: yeah, playbooks broken atm | 23:30 |
jlk | okay. | 23:30 |
jeblair | jlk: pabelanger is fixing it, but we'll have some false -1s for a while | 23:30 |
jeblair | SpamapS: ah, it's testing that the gearman job is removed from the gearman queue | 23:31 |
jeblair | SpamapS: the first config has it running 'project-merge'. *that* is the job that it's looking for in self.assertEqual(len(self.gearman_server.getQueue()), 1) | 23:31 |
jeblair | SpamapS: (it doesn't let the job run, just lets the launcher enqueue it) | 23:32 |
pabelanger | jeblair: jlk: we need to restart zuulv3-dev.o.o to pick up inventory variables first | 23:32 |
jeblair | SpamapS: and the final assertion is checking that project-merge still didn't actually run. | 23:32 |
jeblair | pabelanger: oh has that merged? | 23:32 |
SpamapS | jeblair: ah, so it's just making sure the queue is made of the one thing that did run. | 23:33 |
pabelanger | jeblair: yes, I still need to update our playbooks however | 23:33 |
jeblair | SpamapS: well, completed history, but yes. | 23:33 |
Shrews | pabelanger: any chance you can review https://review.openstack.org/446753 ? very easy review. would like to restart n-l with it | 23:34 |
jeblair | SpamapS: so the key to that test is the hold_jobs_in_queue part. we want to keep doing that, and not set hold_jobs_in build. | 23:35 |
pabelanger | Shrews: +3 | 23:35 |
jeblair | SpamapS: pointed some things out in review | 23:37 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Set node AZ after we're done waiting for it. https://review.openstack.org/446753 | 23:38 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Replace ssh-keyscan with host keys from node https://review.openstack.org/446785 | 23:40 |
pabelanger | jeblair: ^ would be interested in comments when you are ready to review. Specifically, I made some changes to getHostList() return value | 23:41 |
Shrews | any objections to nodepool-launcher restart? | 23:42 |
pabelanger | none here, going to do the same for zuulv3-dev too | 23:42 |
pabelanger | jeblair: ^ | 23:42 |
jeblair | ++ | 23:42 |
Shrews | oh noes | 23:45 |
jeblair | pabelanger: lgtm (assuming what we get from nodepool is ready to splat into the known_hosts file) | 23:47 |
pabelanger | jeblair: I think so, I see the data in zookeeper using zk-shell | 23:48 |
* jeblair hopes Shrews is okay | 23:48 | |
Shrews | well | 23:49 |
jeblair | Shrews is well. that's good. | 23:49 |
jeblair | all i need to hear | 23:49 |
Shrews | something weird happened | 23:50 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Revert "Configure regional mirrors for our jobs" https://review.openstack.org/446786 | 23:51 |
Shrews | OpenStackCloudException: Multiple matches found for None | 23:51 |
Shrews | i know shade had a release today, and there were some changes to the name/id matching | 23:51 |
Shrews | so... maybe a bug there | 23:51 |
Shrews | but also, we should try to delete a node if the id is None | 23:52 |
jeblair | pabelanger: +3 | 23:52 |
Shrews | not try* | 23:52 |
pabelanger | jeblair: danke | 23:53 |
Shrews | jeblair: pabelanger: we need jamielennox's fix https://review.openstack.org/#/c/445299/ | 23:53 |
pabelanger | that will unblock us for now, I'll start a refactoring our mirror logic tomorrow | 23:53 |
pabelanger | Shrews: jeblair: Did we want a test for that? Not sure I understand the issue enough | 23:54 |
jamielennox | pabelanger, Shrews: i'm not sure how to write a test for that one, there was nothing that really existed similar | 23:54 |
Shrews | pabelanger: i can try and work out a test tomorrow if we approve that tonight | 23:55 |
pabelanger | I am okay with that | 23:55 |
Shrews | jamielennox: we just need to make createServer() fail in the test. i think the framework for that might already exist | 23:56 |
pabelanger | +2, so jeblair can review | 23:56 |
jeblair | +3 | 23:56 |
jamielennox | Shrews: there was something like it but it was on a different thread, but i would need to double check for all the details | 23:56 |
jamielennox | like i think there was a test for create server failure, but i couldn't see a test that was actually running the cleanup thread | 23:57 |
Shrews | jamielennox: look at test_fail_request_on_launch_failure | 23:57 |
jamielennox | however my priority was making my stuff work at the time so i could have missed something | 23:57 |
Shrews | jamielennox: we'd just need to make sure the node is deleted properly. can probably do so in that same test | 23:58 |
jamielennox | Shrews: hmm, ok, i think that was what i was looking at, assume i just missed something | 23:59 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Don't try and delete nodes with no external_id https://review.openstack.org/445299 | 23:59 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!