openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: requirements: update ansible version https://review.openstack.org/475228 | 01:10 |
---|---|---|
*** lennyb has quit IRC | 03:40 | |
*** lennyb has joined #zuul | 03:51 | |
tristanC | tobiash: how about using command: rsync instead of the synchronize module? | 04:19 |
tobiash | tristanC: at least the shell module also didn't work delegated to localhost | 04:59 |
tobiash | but that's ok as I currently have a working workaround | 05:00 |
jamielennox | tobiash: you have a workaround to command on localhost? | 05:05 |
tobiash | jamielennox: not for command but for pulling files | 05:06 |
jamielennox | i haven't figured out exactly what fails on mine, but currently there is no output from the command | 05:06 |
tobiash | jamielennox: the output is missing as ansible tries to use py3 when delegated to localhost and many modules don't work yet with py3 | 05:06 |
tobiash | (e.g. synchronize, shell, command,...) | 05:07 |
jamielennox | ah ok, pull didn't seem like a problem, it was me trying to rsync from the executor to the log server which i was running via command | 05:07 |
jamielennox | tobiash: at least command should though, i patched it up th eother day | 05:07 |
tobiash | jamielennox: didn't try command, just shell and synchronize | 05:08 |
jamielennox | tobiash: command and shell are basically the same thing | 05:08 |
tobiash | ah, ok | 05:08 |
jamielennox | same code run, a few flags set | 05:08 |
tobiash | jamielennox: for pulling files I currently use a combination of find and fetch modules | 05:09 |
jamielennox | tobiash: synchronize seemed to be pulling ok for me | 05:12 |
jamielennox | with mode: pull | 05:12 |
jamielennox | i'm not sure how much of that runs on client vs server in that mode | 05:12 |
tobiash | hm, for me that didn't work (failed without any helpfull messages) | 05:12 |
jamielennox | i haven't been able to debug it far | 05:16 |
*** smyers has quit IRC | 05:21 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul feature/zuulv3: Fix and test report urls for unknown failures https://review.openstack.org/469214 | 05:24 |
*** smyers has joined #zuul | 05:24 | |
gundalow | Zuul people at AnsibleFest we have booked "Riverview 9" at 11am. The room is UK 2nd floor, e.g. the floor above the where the main tracks are | 07:21 |
jlk | gundalow: what's the topics? | 07:58 |
pabelanger | gundalow: thanks | 08:00 |
gundalow | jlk: I'd like to know the next steps for Zuul + Ansible integration. Especially anything that will need Ansible Core Team people so I can add it into the roadmap | 08:02 |
jlk | Cool. I'll be there. | 08:03 |
*** hashar has joined #zuul | 08:06 | |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Add a test to verify push reports only set status https://review.openstack.org/476400 | 08:09 |
jlk | jamielennox: So when working on ^ I ran across something that may bite us. I think our code is trying to create a log path to use with the github status we report, and I think it's trying to use a PR number as part of that URL (and maybe the internal log path). But a push event doesn't have an associated PR. | 08:11 |
pabelanger | remote: https://review.openstack.org/476403 Allow secrets for check pipeline | 08:17 |
pabelanger | so, I think ^ is going to be an issue. | 08:17 |
pabelanger | I'm not sure we want secrets in check today for any untrusted job, but we do need it for trusted atm | 08:19 |
pabelanger | ssh private key for logs.o.o to publish job logs | 08:19 |
jlk | oh eww | 08:26 |
pabelanger | I guess we could move log publishing into a new pipeline | 08:36 |
jamielennox | log publishing is a trusted job, so it runs outside of bwrap on the system | 08:45 |
jamielennox | so if you have an ssh key on the executor you can scp your logs to publish | 08:45 |
jamielennox | eg (because i have it open) https://github.com/BonnyCI/project-config/blob/master/playbooks/roles/base/tasks/post.yaml | 08:45 |
jamielennox | note, that's the delegate_to that currently doesn't work - but it used ot | 08:45 |
pabelanger | jamielennox: right, we could do that but means we need to manage keys with puppet. We also are going to run trusted in bwrap shortly too: https://review.openstack.org/#/c/474460/ | 08:46 |
pabelanger | so, with that in mind, we could just store key as secret and have zuul just pass it | 08:47 |
pabelanger | and we talked about that last week, and part of the reason we added tls to gearman | 08:47 |
jamielennox | pabelanger: what the benefit of being trusted if you're still in a bwrap | 08:51 |
pabelanger | jamielennox: usage of lookups / ansible things | 08:51 |
jamielennox | also yea, but we're already doing some key management as isn't the first key (the one nodepool puts on it) already going to need to be in the executor | 08:52 |
pabelanger | ya, today we need a the ssh key for nodepool. but I feel a job could manage the SSH key for logs.o.o under zuulv3. I guess if we think that is going to be an issue, we'll just fall back to having puppet manage that key | 08:55 |
jamielennox | so i've wondered a bit about that, particularly if you were uploading logs to an api of some sort instead of a straight ssh | 08:57 |
jamielennox | whether you could use like a one time private key from hashicorp vault or something that would also tie the job identifier to the key | 08:57 |
jamielennox | but all of these things will require some sort of trusted key on the executor | 08:58 |
pabelanger | right, swift is a good example. we'd want a trusted job to publish to swift. I don't see us storing something directly on the executor for that, the job would have that info in a secret and we'd just use it | 09:00 |
*** hashar has quit IRC | 09:01 | |
*** smyers has quit IRC | 09:03 | |
*** hashar has joined #zuul | 09:04 | |
jamielennox | right, but temp_urls work because you can create them with your token/username+password | 09:04 |
jamielennox | you'll always have a bootstrap problem of how do you know who can create the upload approval token | 09:05 |
jamielennox | but that's what the trusted jobs have been for | 09:05 |
*** hashar_ has joined #zuul | 09:09 | |
*** hashar has quit IRC | 09:09 | |
*** smyers has joined #zuul | 09:15 | |
*** bhavik1 has joined #zuul | 09:23 | |
jlk | gundalow: The rooms aren't labeled, but we've taken the one straight across from the woman's restroom | 09:38 |
*** hashar_ has quit IRC | 09:40 | |
jlk | mordred: jeblair: we're upstairs in one of the two Ansible rooms, floor 2 above the main snack area of Ansiblefest. | 09:46 |
*** bhavik1 has quit IRC | 10:00 | |
jeblair | pabelanger: yeah, i don't think we want to allow secrets in check. i think we'll have to have the log key installed on the executor and use it from a trusted playbook. when we switch to using bubblewrap for trusted, we'll bind-mount that key in as well. i'd like to be able to use secrets for this, but i think we're going to have to revisit some things, so let's not hold up progress on jobs. | 10:23 |
pabelanger | jeblair: okay, I'll do that now in puppet | 10:24 |
jeblair | (in particular, secrets are currently given to an entire job, and we'll have to think about whether we can make them available to just a single playbook within a job) | 10:28 |
pabelanger | jeblair: so, we do not set zuul_changes today in zuul dict() in inventory file. We do have the legacy ZUUL_CHANGES variable, but I think we are (should now?) removing them | 10:55 |
mordred | jamielennox, tobiash: do you have access to logs or anythhing thatshow the issues with syncronize and python3? | 11:01 |
mordred | I'm sitting with toshio looking at the issue | 11:02 |
*** jkilpatr has joined #zuul | 11:04 | |
*** hashar has joined #zuul | 11:17 | |
tobiash | mordred: I haven't figured out how to get logs for that issue | 11:42 |
tobiash | just see that ansible starts a py3 session | 11:42 |
tobiash | and after that there is nothing in the executor log | 11:42 |
mordred | ahh. kk. I think I can try to reproduce that | 11:43 |
*** hashar has quit IRC | 11:46 | |
*** hashar has joined #zuul | 11:46 | |
tobiash | mordred: http://paste.openstack.org/show/613370/ | 11:48 |
tobiash | that's what I got with verbose executor debug logging | 11:48 |
*** hashar has quit IRC | 11:53 | |
*** hashar has joined #zuul | 12:03 | |
*** hashar has quit IRC | 12:18 | |
*** hashar has joined #zuul | 12:19 | |
mordred | tobiash: awesome! thank you! | 12:52 |
mordred | oh fun | 12:52 |
mordred | pabelanger: "test/runner/ansible-test sanity --local --python 2.7" is they way to run ansible test without sourcing hacking/env-setup | 12:54 |
mordred | pabelanger: --local tells it to use the local env not a virtualenv | 12:54 |
pabelanger | ++ | 12:55 |
mordred | tobiash: can you paste me your playbook with synchronize? | 14:03 |
tobiash | mordred: let me check | 14:03 |
tobiash | mordred: did much rebasing during debugging, have to check reflog... | 14:05 |
tobiash | mordred: http://paste.openstack.org/show/613402/ | 14:08 |
mordred | tobiash: awesome! thank you | 14:09 |
tobiash | prerequisite is that the main or pre job at least creates the logs dir in zuul_workspace_root | 14:09 |
tobiash | (in theory if it would work) | 14:10 |
mordred | tobiash: and which task broke? | 14:10 |
mordred | the publish logs? | 14:10 |
tobiash | the synchronize | 14:11 |
tobiash | (collect node logs | 14:11 |
tobiash | mordred: the publish works | 14:11 |
tobiash | the create log dirs also works | 14:12 |
mordred | kk. thanks (we're trying to make it break here so we can debug) | 14:12 |
tobiash | :) | 14:14 |
tobiash | mordred: additional info is that I'm already using the patch to run trusted jobs in bwrap too | 14:18 |
tobiash | if that has an impact on your debugging | 14:19 |
mordred | tobiash: but you had the isseu before you deployed that right? | 14:21 |
tobiash | mordred: I think, but I'm not sure anymore | 14:22 |
tobiash | I was in a "Try and error until it works"-mode... | 14:23 |
tobiash | and the result was a workaround with find and fetch modules as I wasn't able to really debug this... | 14:24 |
mordred | tobiash: yah - we need better debugging structure around that :( | 14:55 |
mordred | tobiash: and ... we can't make it break here :( | 14:55 |
mordred | tobiash: but we're stil trying things | 14:55 |
* SpamapS catching up with all of you in GMT | 15:32 | |
jeblair | tobiash: we've tried running a playbook with that synchronize task inside of bubblewrap under py3.6 and still can't reproduce. if you have a chance to try this procedure to get more info, that would be great: run zuul-executor with --keep-jobdir, run a job, grab the bubblewrap command from the logs, then run it manually from a shell. | 15:33 |
* SpamapS points at zuul-bwrap as a possible aid to debugging | 15:34 | |
SpamapS | I'm sure there's context that has been missed. Have we totally thrown out the idea of just running ansible-playbook using python2.7 ? | 15:35 |
jeblair | tobiash: if that fails, then edit the ansible.cfg to remove zuul_stream and run it again, hopefully if there is a problem where zuul_stream is eating output, that will show more | 15:36 |
jeblair | tobiash: also, i forgot to mention say, run 'zuul-executor verbose' before running the job to get -vvv output | 15:36 |
jlk | I think there were some concerns about the streaming code in python 2 | 15:36 |
mordred | SpamapS: well - we've mostly been focused on trying to reproduce the issue while we had the ansible core team here | 15:37 |
mordred | SpamapS: and have been unable to do so - although we understand basically everything there is to know about synchronize now :) | 15:37 |
SpamapS | hah | 15:37 |
mordred | SpamapS: general consensus is that there should be no issues with synchronize in python3 since the majority of it is run in the action plugin which is core-not-module | 15:38 |
SpamapS | that's exactly what I was hoping we could avoid tying our progress to. | 15:38 |
SpamapS | ahh | 15:38 |
SpamapS | so maybe it's not python3's fault | 15:38 |
mordred | yah | 15:38 |
SpamapS | that explains why the debug steps taken | 15:38 |
mordred | SpamapS: we've also got context in folks heads, so if we can get it to happen again and get a reproducible case, we should be able to debug and figure out what's up | 15:40 |
SpamapS | yeah I don't even have context on when it was seen | 15:41 |
*** hashar is now known as hasharAway | 15:43 | |
*** tobiash_ has joined #zuul | 16:28 | |
tobiash_ | jeblair, mordred: thanks, will try tomorrow again | 16:28 |
tobiash_ | I'm already @home without my work laptop | 16:29 |
tobiash_ | the log I posted earlier was already with verbose logging | 16:30 |
tobiash_ | maybe it could be also a different issue | 16:33 |
tobiash_ | how does zuul setup these module overrides? | 16:33 |
SpamapS | tobiash_: ansible.cfg loads them as plugins IIRC | 16:34 |
tobiash_ | there might also be a possibility of a leftover of earlier module overrides? | 16:34 |
tobiash_ | I'll clean up that directory and retry the synchronize again tomorrow | 16:35 |
tobiash_ | btw, will this stay like this or is there a plan to integrate localhost restrictions in ansible itself? | 16:38 |
SpamapS | tobiash_: that's a really long long way away.. but I think it's the right path | 16:44 |
SpamapS | have a "safe mode" | 16:44 |
SpamapS | it would need to be designed generically which might get complicated | 16:45 |
tobiash_ | ok, so more a topic for ansible 3 and zuul 5 | 16:47 |
tobiash_ | what comes to my mind now: does zuul inhibit loading custom modules along the playbooks? | 16:51 |
tobiash_ | like http://blog.toast38coza.me/custom-ansible-module-hello-world/ | 16:54 |
Shrews | ugh, i think my signal.pause() is affecting asyncio happenings | 16:59 |
SpamapS | tobiash_: I believe that's disabled in untrusted mode by the action plugin restrictions. Could be wrong. | 17:44 |
SpamapS | Shrews: Highly likely. Signals and async are always dicey. | 17:44 |
tobiash_ | ok | 17:48 |
*** tobiash_ has left #zuul | 18:11 | |
*** tobiash__ has joined #zuul | 18:11 | |
*** tobiash__ has quit IRC | 18:11 | |
*** tobiash__ has joined #zuul | 18:12 | |
*** tobiash__ has quit IRC | 18:12 | |
*** tobiash_ has joined #zuul | 18:12 | |
clarkb | Shrews: I am following up on the we don't delete uploads on process restart thing from yesterday. The current clean up worker checks if the upload is in progress by attempting to get the lock for that upload. If it gets the lock then it isn't in progress and should be deleted. Do we not release the locks when disconnecting from zk? | 18:46 |
clarkb | Shrews: our init script uses --signal 9, which may be part of it. I wonder if we can gracefully terminate and release locks on sigterm | 18:50 |
clarkb | Shrews: that said, our locks are expected to survive disconnects? or should our lock grabbing code check if whoever has the lock is connected before failing? | 18:50 |
clarkb | harlowja: do you know re ^ and http://kazoo.readthedocs.io/en/latest/api/recipe/lock.html ? | 18:53 |
clarkb | harlowja: if a zk client via kazoos lock recipe there hold the lock then disconnects is that lock still held by that client? | 18:54 |
clarkb | I would expect not but that appears to be the behavior we are seeing | 18:54 |
harlowja | me to, though potentially after the zk session timeout | 18:54 |
harlowja | i didn't think there was an explict on disconnect release all locks (or try to before disconnecting) | 18:55 |
harlowja | locks are basically https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/lock.py#L213-L214 | 18:57 |
clarkb | harlowja: ok it does look like the records that were there yesterday (that I wanted to go away) have gone away on a query today | 18:57 |
clarkb | harlowja: so the session timeout may be it | 18:58 |
harlowja | ya | 18:58 |
harlowja | check whatever that is set to | 18:58 |
clarkb | harlowja: I don't see us setting it explicitly so I think whatever kazoo sets it to | 19:00 |
SpamapS | I thought locks had an ephemeral component | 19:01 |
SpamapS | so if the session goes away -> lock released | 19:01 |
SpamapS | and that was like, why zk is good | 19:01 |
clarkb | SpamapS: that is the behavior I expected but it definitely appears like that isn't the case based on nodepool builder behavior | 19:01 |
SpamapS | so yeah, session timeout is very likely the thing you are seeing. | 19:01 |
harlowja | see https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/lock.py#L213-L214 | 19:02 |
harlowja | `node = self.client.create(self.create_path, self.data, | 19:02 |
harlowja | ephemeral=True, sequence=True)` | 19:02 |
harlowja | thats pretty much what a lock is, ha | 19:02 |
harlowja | an empheral node with sequencing | 19:02 |
SpamapS | right | 19:02 |
SpamapS | so until the session times out, lock will be held | 19:02 |
clarkb | oh I bet I now what is going on | 19:02 |
SpamapS | clarkb: and yeah, --signal 9 seems like a poor choice. What's up with that? | 19:02 |
clarkb | because we use sigkill we don't get to send the tcp fun | 19:02 |
clarkb | *fin | 19:03 |
SpamapS | yeah | 19:03 |
clarkb | which means zk has to timeout | 19:03 |
SpamapS | TCP timeout, even worse | 19:03 |
clarkb | SpamapS: I don't know why we use sigkill | 19:03 |
harlowja | SpamapS ya agreed, if the only way an app can die is via signal -9, prob app should be redesigned :-P | 19:03 |
clarkb | but if we sigtermed I think python would gracefully kill that for us | 19:03 |
SpamapS | we should be sending SIGTERM and waiting a bit before SIGKILL | 19:03 |
SpamapS | where's this init script? | 19:03 |
clarkb | it is likely that way because we run daemon threads and don't mark them daemon? | 19:03 |
SpamapS | start-stop-daemon will do this for you automatically. | 19:04 |
harlowja | there are many reasons | 19:04 |
harlowja | a daemon thread really isn't a good answer imho either :-P | 19:04 |
clarkb | SpamapS: yes this is start stop daemon | 19:04 |
clarkb | harlowja: it is if this thing is literally supposed to run forever | 19:04 |
clarkb | but you still want to be able to stop it | 19:04 |
harlowja | sorry, i mean, u can use them, but u still need a way to signal to those threads to stop in a nice-happy-fashion | 19:04 |
clarkb | something somewhere is going to be in a tight loop | 19:04 |
harlowja | nice-happy-death preferred | 19:04 |
clarkb | harlowja: sure | 19:04 |
SpamapS | yeah, should be using start-stop-daemon --retry 10 or something | 19:05 |
harlowja | which usually evolves into breaking up your work into small tasks, and then in between tasks checking if some kind of event has been set (ie, for the thread to stop) | 19:05 |
harlowja | blah blah | 19:05 |
harlowja | lol | 19:05 |
clarkb | SpamapS: https://git.openstack.org/cgit/openstack-infra/puppet-nodepool/tree/files/nodepool-builder.init | 19:06 |
clarkb | we do actually appear to have a graceful stop in the nodepool builder too | 19:06 |
SpamapS | argh, cgit has no annotate? | 19:07 |
clarkb | that marks each running worker to shutdown, waits for them, then disconnects from zuuk | 19:07 |
clarkb | I think I see the patch needed to make this work | 19:08 |
clarkb | nodepool.cmd.builder handles sigint as a stop | 19:09 |
clarkb | SpamapS: so ya switching the init script to sigint and wait before sigkill should do it | 19:09 |
clarkb | SpamapS: I can review taht really quick if you are wanting to push it. Otherwise I'll sort out the right start stop daemon incantation | 19:10 |
clarkb | you know what the problem is going to be? its the block on image upload | 19:11 |
clarkb | beacuse that can take a few hours | 19:11 |
clarkb | so we might separately have to make that die gracefully on sigint | 19:11 |
clarkb | yup looks like upload image and image build are both synchronous with their workers | 19:14 |
clarkb | and we join() the workers on stop() | 19:14 |
Shrews | image build spawns a subprocess that should receive any signals sent, so should terminate rather quickly. image uploading doesn't work that way. | 19:17 |
clarkb | ah ok so upload is the one to worry about heer | 19:17 |
Shrews | this should pretty much follow the design of the older nodepool in this area | 19:18 |
clarkb | ya, difference is that older nodepool didn't rely on graceful connection closes to properly update staet in the database | 19:18 |
clarkb | (beacuse there was no database) | 19:18 |
Shrews | (sorry, was out getting new tires. yay fun) | 19:18 |
clarkb | so old nodepool was fine with a forceful kill | 19:19 |
clarkb | Shrews: what do you think about joining only the build workers and not the uploads? | 19:19 |
clarkb | thats the easy mode fix for this I think | 19:19 |
clarkb | (and roughly equivalent to the behavior with kill -9 going on now) | 19:19 |
clarkb | my only concern doing that is upload holds a lock with the zk client and stop() will attempt to stop the client and kill it under the upload | 19:21 |
clarkb | will that raise in the context manager properly/ | 19:21 |
Shrews | well, i'm not worried about the lock handling. that's automatic | 19:23 |
Shrews | i think what would happen is those records will just remain in the UPLOADING state and get cleaned up at some point after restart | 19:24 |
clarkb | Shrews: yup. That is what seems to happen today. The problem today is we kill the tcp connection to zk without letting zk know so we have to wait for that session to timeout. Whereas if we were to call nb.stop() properly with signit we'd close the zk connection which should unlock things at that point | 19:25 |
clarkb | but for that to be effective we can't wait for three hours for a thread to join. I think I see some other related bugs and will push up a patch shortly and we can look at this with concrete change | 19:25 |
Shrews | clarkb: are you finding the zookeeper session timeout too long? | 19:27 |
clarkb | Shrews: I think that is why those uploading records for centos hung around forever yesterday. Either they were waiting on the zk session timeout or the tcp connection to timeout on the zk side or both | 19:28 |
Shrews | clarkb: those hung around because they were part of the latest build | 19:28 |
Shrews | we needed two more builds to age them out | 19:28 |
Shrews | (2 most recent builds and their zk records hang around) | 19:29 |
SpamapS | I'm a little confused | 19:29 |
SpamapS | not looking at the code | 19:29 |
SpamapS | upload threads or upload subprocesses? | 19:29 |
Shrews | the problem with those was that we were not cleaning up not-in-progress UPLOADING records for the most recent builds | 19:29 |
clarkb | SpamapS: threads | 19:29 |
SpamapS | but Shrews just said processes. | 19:30 |
SpamapS | oh | 19:30 |
SpamapS | that's build | 19:30 |
Shrews | SpamapS: image builds are subprocess | 19:30 |
SpamapS | k | 19:30 |
clarkb | Shrews: if you look at the cleanup code it will delete uploads in progress if nothing holds their lock | 19:30 |
clarkb | Shrews: and that is independent of age | 19:30 |
SpamapS | then yeah, don't join those | 19:30 |
SpamapS | exit() should take care of them. | 19:30 |
clarkb | SpamapS: yup | 19:30 |
Shrews | clarkb: nope. 1 sec... | 19:31 |
SpamapS | clarkb: and the start-stop-daemon incantation is just --retry <timeoutvalue> instead of --signal 9 | 19:31 |
Shrews | clarkb: https://github.com/openstack-infra/nodepool/blob/master/nodepool/builder.py#L441-L442 | 19:31 |
clarkb | Shrews: https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/builder.py#n376 | 19:31 |
Shrews | that little bit right there keeps us from calling cleanupProvider | 19:31 |
SpamapS | If timeout is specified instead of schedule, then the schedule signal/timeout/KILL/timeout is used, where signal is the signal specified with --signal. | 19:32 |
clarkb | SpamapS: thats a build not an upload | 19:32 |
clarkb | er Shrews ^ | 19:32 |
clarkb | Shrews: the code I linked is called before the section you point out and is explicitly for builds we want to keep | 19:33 |
clarkb | Shrews: basically it is making sure that the uploads corresponding to builds we keep are curated | 19:33 |
clarkb | SpamapS: so we just have to change the signal and set a timeout? | 19:35 |
Shrews | clarkb: yeah, you're right, i was in the wrong place yesterday | 19:38 |
Shrews | clarkb: perhaps we should adjust the session timeout in zookeeper server if it's too long? | 19:39 |
clarkb | Shrews: we could do that too, but right now we aren't able to gracefully shutdown the builder because uploads can take hours | 19:39 |
clarkb | Shrews: which is why I think we use sigkill in thei nit script | 19:40 |
clarkb | so fixing that first seems like a good idea | 19:40 |
clarkb | then if we still have problems we can adjust the session timeout | 19:40 |
Shrews | i honestly didn't expect it to be very long, the docs for it are vague, measuring in "ticks": https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkSessions | 19:40 |
clarkb | Shrews: ya I think SpamapS may be right that the tcp timeout is actually at fault because we don't greacefully close the tcp connection | 19:41 |
clarkb | so zk has to hang around and wait for tcp to give up | 19:41 |
Shrews | ah, that would make sense | 19:41 |
Shrews | what is the tcp timeout? anyone recall? | 19:43 |
clarkb | I don't I think it is defined by the kernel by default | 19:46 |
Shrews | clarkb: hrm, a quick test of a 'kill -9' on a kazoo script holding lock shows the lock disappear relatively quickly | 19:54 |
clarkb | Shrews: its entirely possible that there is more bugs in the cleanup code too resulting in the upload hanging around | 19:55 |
clarkb | (eg its not just the lock we have to wrry about) | 19:56 |
Shrews | yeah | 19:56 |
clarkb | but I think either way we should avoid using sigkill and instead use sigint | 19:56 |
clarkb | right now stop() is essentially dead code but it shouldn't be | 19:56 |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Use private attribute lookup interanlly in builder.py https://review.openstack.org/476677 | 19:56 |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool feature/zuulv3: Don't join image upload workers on stop() https://review.openstack.org/476678 | 19:56 |
clarkb | there I wrote lots of words to explain this | 19:56 |
Shrews | clarkb: oh, is it possible that the original nb* node that built the image was taken offline? | 20:00 |
clarkb | Shrews: no, I just restarted the service but the node was up the whole time | 20:01 |
openstackgerrit | Joshua Harlow proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683 | 20:02 |
harlowja | my contribution ^ | 20:03 |
harlowja | lol | 20:03 |
Shrews | clarkb: hrm... so, i think we might have some hostname issues. 'nodepool dib-image-list' shows builds that have 'nb04' AND 'nb04.openstack.org' as the builder hostname. that could cause problems | 20:04 |
Shrews | i do not know how that could happen, but changing the hostname of a builder is a definite thing that jeblair and I have considered as a potential issue | 20:05 |
openstackgerrit | Joshua Harlow proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683 | 20:05 |
harlowja | prob a pattern that should be used more ^ | 20:05 |
Shrews | which brings up SpamapS' idea of writing a unique id to disk and using that instead of hostnames | 20:06 |
clarkb | Shrews: I think that may have been related to pabelanger? I recall that he noticed hostname was fqdn in one cloud and just hostname in the other. Looks like they have been changed to both be just hostname | 20:07 |
clarkb | Shrews: happens because different cloud images and metadata retrieval/application methods apply that inconsistently | 20:07 |
Shrews | harlowja: ooh, i like that. though i don't think using that around the SUSPEND_WAIT_TIME portion of code is correct | 20:12 |
harlowja | why not | 20:12 |
harlowja | prob don't want to loop forever waiting for zk to comeback up? | 20:13 |
Shrews | harlowja: oh wait. yeah, that works too | 20:13 |
Shrews | nm | 20:13 |
harlowja | :) | 20:13 |
harlowja | ya, shutting down things quickly is hard, lol | 20:13 |
harlowja | https://github.com/google/python-subprocess32#subprocess32 is also nice, in that it allows the call to timeout | 20:14 |
harlowja | its the backport from py3.3 | 20:14 |
harlowja | i've done stuff like run(cmd, on_timeout_cb=...) | 20:14 |
harlowja | so that u can like stop subprocesses that u start if u your app is dying/getting shutdown | 20:15 |
clarkb | in this case seconds is fine, but hours isn't :) | 20:16 |
harlowja | up to u | 20:16 |
harlowja | alot of the time i see time.sleep in thread stuff it makes me thing a threading.Event is really whats wanted (so the timeout can finish and the thread can die), but anyways, the more u know :-P | 20:17 |
Shrews | harlowja: no, i definitely like the use of that. i think we can avoid setting very short timeout intervals in our tests using that | 20:18 |
harlowja | prob | 20:18 |
harlowja | https://docs.python.org/3/library/threading.html#barrier-objects is the other stuff that is useful | 20:18 |
harlowja | especially for the code that waits for Xthreads to start | 20:18 |
harlowja | i've got a version that i can extract that is equivalent, somewhere in taskflow, lol | 20:18 |
harlowja | ya, https://github.com/openstack/taskflow/blob/master/taskflow/types/latch.py#L22 | 20:19 |
Shrews | i don't think we have need of barriers (thankfully) | 20:19 |
harlowja | k | 20:19 |
*** jkilpatr has quit IRC | 20:20 | |
Shrews | harlowja: it unfortunately doesn't work :( debugging now | 20:28 |
harlowja | impassible | 20:28 |
harlowja | lol | 20:28 |
clarkb | harlowja: does threading.event force a context switch? | 20:30 |
clarkb | harlowja: I am assuming it does if it sleeps under teh covers | 20:30 |
harlowja | yes | 20:30 |
harlowja | py2.7 def had time.sleeps, lol | 20:30 |
harlowja | i think they made it slightly better in py3 | 20:30 |
harlowja | https://hg.python.org/cpython/file/v2.7.11/Lib/threading.py#l343 | 20:31 |
harlowja | 3.x they got rid of that loop, lol | 20:31 |
harlowja | from what i remember | 20:32 |
harlowja | or at least moved it down a level... | 20:33 |
Shrews | harlowja: found the bug. the running prop needs to do "not is_set" | 20:34 |
harlowja | ah poops | 20:34 |
Shrews | harlowja: i'll push up the fix. trying some other things with the tests, too | 20:34 |
harlowja | thought i did that, ha | 20:34 |
harlowja | kk | 20:34 |
harlowja | thx | 20:34 |
harlowja | not high priority, i just saw this, and had urge to change it | 20:34 |
harlowja | back to writing peer reviews :-P | 20:34 |
clarkb | harlowja: I just know sleep() gets used a lot because it is the documented way to force a context switch | 20:34 |
harlowja | ya | 20:35 |
*** jkilpatr has joined #zuul | 20:40 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683 | 20:42 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Use threading.Event for quicker shutdowns https://review.openstack.org/476683 | 20:45 |
Shrews | harlowja: bug fixed and cleaned up a test. I also removed that TODO b/c I don't feel we need to bring that kind of complexity in for that. | 20:46 |
Shrews | harlowja: thanks! | 20:46 |
harlowja | Shrews np :) | 20:47 |
harlowja | the hard part imho becomes when u start to have a thing that u call into that u want to pass that event into to stop that thing early | 20:48 |
harlowja | u start to end up with coroutines and eventlet at some point, lol | 20:49 |
harlowja | lol | 20:49 |
harlowja | (or something like eventlet/asyncio...) | 20:49 |
Shrews | harlowja: ugh, please don't mention asyncio. i'm using it for some of our websocket streaming stuff and it's causing me to have nightmares | 20:49 |
harlowja | lol | 20:49 |
harlowja | i bet | 20:49 |
harlowja | i would of just liked if eventet (at least the concept) was mainlined into python | 20:50 |
harlowja | i think that would of gone over better | 20:50 |
harlowja | but meh, is what it is, lol | 20:50 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Use threading.Event for quicker shutdowns https://review.openstack.org/476705 | 20:52 |
Shrews | the feature/zuulv3 cherry pick ^^^ | 20:52 |
Shrews | want to carry that over to the launcher process, too | 20:53 |
*** dmellado has quit IRC | 21:00 | |
*** dmellado has joined #zuul | 21:03 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Use Event for faster shutdown of launcher https://review.openstack.org/476709 | 21:07 |
harlowja | cools | 21:32 |
*** dmellado has quit IRC | 22:33 | |
*** dmellado has joined #zuul | 22:37 | |
*** hasharAway has quit IRC | 22:48 | |
*** dmellado has quit IRC | 23:03 | |
*** dmellado has joined #zuul | 23:06 | |
*** dmellado has quit IRC | 23:15 | |
*** dmellado has joined #zuul | 23:18 | |
*** dmellado has quit IRC | 23:31 | |
jlk | adam_g: thanks for the comments re GHE. Are you all running another zuul somewhere? | 23:43 |
*** dmellado has joined #zuul | 23:51 | |
SpamapS | jlk: they are | 23:52 |
SpamapS | jlk: They're toying with a BonnyCI for an IBM internal project | 23:52 |
jlk | neat! | 23:53 |
SpamapS | along with mestery | 23:53 |
SpamapS | so it's all in the family :) | 23:53 |
jlk | craziness. | 23:54 |
*** dmellado has quit IRC | 23:55 | |
jamielennox | lol | 23:57 |
jamielennox | well, it shows value for bonny long term in ibm | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!