*** haint has quit IRC | 00:01 | |
*** haint has joined #zuul | 00:02 | |
corvus | clarkb: if you have a sec to +3 https://review.openstack.org/540965 before you retire for the evening, that would be lovely to have in place tomorrow | 00:30 |
---|---|---|
* clarkb looks | 00:30 | |
clarkb | done | 00:35 |
corvus | i've added executor oom errors to the infra meeting tomorrow because it's looking like this issue may mostly be specific to our deployment. just mentioning it as a heads up in case it's still interesting to other folks. | 00:35 |
openstackgerrit | Merged openstack-infra/zuul master: Allow a few more starting builds https://review.openstack.org/540965 | 00:43 |
*** JasonCL has joined #zuul | 01:55 | |
*** harlowja has quit IRC | 02:18 | |
*** jimi|ansible has quit IRC | 04:39 | |
*** harlowja has joined #zuul | 04:46 | |
*** harlowja has quit IRC | 05:11 | |
*** threestrands has quit IRC | 06:40 | |
*** yolanda has joined #zuul | 06:43 | |
*** jpena|off is now known as jpena | 07:53 | |
*** hashar has joined #zuul | 08:24 | |
AJaeger | zuul team, something fishy is happening with zuul and nodepool, see http://grafana.openstack.org/dashboard/db/nodepool and http://grafana.openstack.org/dashboard/db/zuul-status. No root around in #openstack-infra to investigate. | 08:39 |
*** threestrands has joined #zuul | 09:06 | |
*** threestrands has quit IRC | 09:21 | |
*** sshnaidm|bbl is now known as sshnaidm|rover | 09:30 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul-jobs master: role: Inject public keys in case of failure https://review.openstack.org/535803 | 09:30 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Clean held nodes automatically after configurable timeout https://review.openstack.org/536295 | 09:36 |
tobiash | AJaeger: looks like all executors deregistered itself due to some erratic governor behavior | 09:47 |
tobiash | hrm, the starting builds doesn't decrease | 09:48 |
tobiash | so there might be something broken in starting builds limitation | 09:49 |
tobiash | corvus, mordred, pabelanger: maybe we leak job workers under some circumstances | 09:56 |
tobiash | leaked not yet started job workers would be counted towards starting builds and thus explain the current behavior | 09:57 |
AJaeger | tobiash: see backscroll on #openstack-infra as well, please | 09:58 |
AJaeger | let's discuss there | 09:58 |
*** jaianshu has joined #zuul | 10:14 | |
*** dtruong has quit IRC | 10:35 | |
*** dtruong has joined #zuul | 10:35 | |
openstackgerrit | Merged openstack-infra/nodepool master: Convert nodepool-zuul-functional job https://review.openstack.org/540595 | 11:53 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp https://review.openstack.org/535562 | 12:19 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp https://review.openstack.org/535563 | 12:19 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params https://review.openstack.org/536301 | 12:19 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp https://review.openstack.org/535562 | 12:25 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp https://review.openstack.org/535563 | 12:25 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params https://review.openstack.org/536301 | 12:25 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add a separate module for node management commands https://review.openstack.org/536303 | 12:25 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint https://review.openstack.org/536319 | 12:26 |
*** sshnaidm|rover is now known as sshnaidm|afk | 12:27 | |
*** hashar is now known as hasharAway | 12:28 | |
*** jaianshu has quit IRC | 12:50 | |
*** jpena is now known as jpena|lunch | 12:51 | |
*** jimi|ansible has joined #zuul | 13:44 | |
*** jimi|ansible has quit IRC | 13:44 | |
*** jimi|ansible has joined #zuul | 13:44 | |
*** jpena|lunch is now known as jpena | 13:47 | |
*** sshnaidm|afk is now known as sshnaidm|rover | 13:52 | |
*** elyezer has quit IRC | 13:58 | |
*** elyezer has joined #zuul | 14:00 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Fix for age calculation on unused nodes https://review.openstack.org/541281 | 14:11 |
*** dkranz has quit IRC | 14:55 | |
*** hasharAway is now known as hashar | 15:00 | |
*** mnaser has quit IRC | 15:11 | |
*** mnaser has joined #zuul | 15:12 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add separate modules for management commands https://review.openstack.org/536303 | 15:20 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint https://review.openstack.org/536319 | 15:20 |
tobiash | SpamapS: responded on https://review.openstack.org/#/c/540774 | 15:21 |
SpamapS | tobiash: yeah, I'm thinking it might actually be useful as an entry point rather than a template. | 15:30 |
SpamapS | tobiash: or something to import from your debugging template | 15:30 |
SpamapS | I've used it a couple of times now | 15:30 |
SpamapS | and thought "i wish this just took --foo" | 15:31 |
SpamapS | but it's good as-is, which is why I +3'd :) | 15:31 |
tobiash | SpamapS: yeah, I'm open for improvements ;) | 15:31 |
tobiash | it's also bugging me that I have an unclean workspace when using this ;) | 15:31 |
tobiash | but didn't have time yet for polishing this | 15:32 |
SpamapS | Yeah so if nothing else we could make it into a module and import it into debug scripts or even just python cli | 15:33 |
SpamapS | but, just thoughts on some yak shaving for the future | 15:34 |
tobiash | SpamapS: that sound cool | 15:35 |
pabelanger | tobiash: so, I think we might have missed finishJob() in stopJobByUnique(): http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/executor/server.py#n2034 | 15:51 |
pabelanger | tobiash: IIUC, if zuul tests the job to stop, either disk is full or timeout, we don't del key from self.job_workers | 15:51 |
pabelanger | s/tests/tells | 15:51 |
pabelanger | but, not 100% sure | 15:52 |
tobiash | pabelanger: have to grok that deeper | 15:52 |
tobiash | But heading home now. Maybe I can look at that later this evening. | 15:53 |
pabelanger | np, I'll see if others can confirm once openstack-infra is more stable | 15:53 |
openstackgerrit | Merged openstack-infra/zuul master: Fix github connection for standalone debugging https://review.openstack.org/540772 | 15:56 |
*** dkranz has joined #zuul | 15:56 | |
*** Wei_Liu has quit IRC | 16:48 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Do not delete unused but allocated nodes https://review.openstack.org/541375 | 17:02 |
*** jpena is now known as jpena|off | 17:18 | |
openstackgerrit | Merged openstack-infra/zuul master: Enhance github debugging script for apps https://review.openstack.org/540774 | 18:03 |
*** JasonCL has quit IRC | 18:05 | |
*** JasonCL has joined #zuul | 18:06 | |
*** JasonCL has quit IRC | 18:07 | |
*** JasonCL has joined #zuul | 18:08 | |
*** JasonCL has quit IRC | 18:11 | |
*** JasonCL has joined #zuul | 18:12 | |
*** JasonCL has quit IRC | 18:15 | |
*** myoung is now known as myoung|food | 18:15 | |
*** sshnaidm|rover is now known as sshnaidm|bbl | 18:16 | |
*** JasonCL has joined #zuul | 18:20 | |
*** JasonCL has quit IRC | 18:21 | |
*** JasonCL has joined #zuul | 18:22 | |
*** Wei_Liu has joined #zuul | 18:32 | |
*** harlowja has joined #zuul | 18:35 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: [WIP] webapp: add optional admin endpoint https://review.openstack.org/536319 | 18:54 |
*** myoung|food is now known as myoung | 19:03 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434 | 20:13 |
mordred | Shrews: if you get a second, I got that ^^ mostly there, but I'm a little stuck | 20:13 |
mordred | Shrews: just running tests.unit.test_log_streamer.TestStreaming.test_streaming ... it doesn't seem like LogRecordStreamHandler is getting constructed or called | 20:15 |
mordred | corvus: ^^ overall shape should be in place - working on trying to figure out testing | 20:16 |
Shrews | mordred: yeah, will take a look | 20:17 |
mordred | Shrews: also - hopefully this version of that patch makes more sense than the last time I pushed it up | 20:17 |
Shrews | mordred: you're going to finally get me to get zuul tests to run locally, aren't you? ugh | 20:21 |
* Shrews waits for the zuul logs :) | 20:22 | |
mordred | Shrews: you might be waiting for a long time today :) | 20:25 |
mordred | Shrews: this one is pretty easy - ttrun -epy35 tests.unit.test_log_streamer.TestStreaming.test_streaming works easy-peasy for it | 20:26 |
dmsimard | Do we need to initialize a counter anywhere when adding a new one ? or does statsd.incr create it if it doesn't exist ? Trying to find but no luck | 20:30 |
clarkb | I think it creates it if you send it data | 20:32 |
tobiash | mordred: did I correctly understand your commit message that the node connects back to the executor for log streaming? | 20:32 |
mordred | tobiash: sort of - it connects over a port forwarded over the ssh connection | 20:34 |
tobiash | mordred: ah, that's ok | 20:34 |
mordred | tobiash: yah - the other thing would not work very well :) | 20:34 |
tobiash | my build nodes don't even have a route back to my executor :) | 20:35 |
mordred | tobiash: if we can get it working, it shoudl allow us to delete zuul_console and have things work across remote node reboots | 20:35 |
mordred | tobiash: indeed! | 20:35 |
tobiash | that sounds pretty cool | 20:35 |
mordred | now - if only we can figure why the tests don't work | 20:37 |
Shrews | mordred: so, this is changing things between the executor and the node, right? | 20:47 |
openstackgerrit | Merged openstack-infra/nodepool master: Fix for age calculation on unused nodes https://review.openstack.org/541281 | 20:48 |
Shrews | mordred: you're changing a test that is testing what the streaming daemon would send back to a finger client. maybe we need a new test there | 20:48 |
Shrews | because i think we should test both things still | 20:48 |
Shrews | err, test the original thing still. plus your new thing | 20:48 |
mordred | Shrews: ah - well, yes - I agree | 20:59 |
tobiash | zookeeper is a crazy memory hog... | 21:02 |
tobiash | we have a 5 node cluster, each taking 500mb ram | 21:02 |
tobiash | for managing 40kb of data... | 21:02 |
clarkb | tobiash: I think that is likely due to the way the jvm works | 21:06 |
clarkb | tobiash: you can tune that but by default the jvm uses some heuristics to know how much memory the heap should preallocate for a minimum | 21:06 |
clarkb | and it goes out and grabs that memory when it starts | 21:06 |
clarkb | it will then add on more memory up to the maximum limit | 21:06 |
*** dkranz has quit IRC | 21:07 | |
clarkb | chances are it is actually using much less than that memory if you only have 40kb of data | 21:07 |
tobiash | well, the memory consumption is constant and I don't really care about that :) | 21:08 |
clarkb | but ya thats how java works | 21:08 |
tobiash | btw, having that on tmpfs works really great | 21:08 |
clarkb | having the disk backing zk on tmpfs? | 21:09 |
tobiash | the zk latency is almost all of the time below 5ms | 21:09 |
tobiash | yes | 21:09 |
mordred | just like mysql-cluster | 21:10 |
tobiash | we had some trouble with IO due to ceph cluster restructure and meltdown patching | 21:10 |
tobiash | so we increased the replica to 5 and put the data and datalog onto tmpfs | 21:10 |
tobiash | since then we had no problems with zk anymore | 21:11 |
clarkb | I don't think we've really been having problems with zk | 21:11 |
tobiash | we had occationally io latencies over 20s | 21:11 |
tobiash | which broke zk sessions | 21:12 |
pabelanger | only issue so far has been when we lost nodepool.o.o, but even then we didn't lose any data | 21:12 |
pabelanger | just few hour outage, then back online | 21:12 |
mordred | Shrews: k. i've got the test redone to be a new test and not mess with the existing test | 21:13 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434 | 21:13 |
*** threestrands has joined #zuul | 21:32 | |
openstackgerrit | David Moreau Simard proposed openstack-infra/zuul master: Add Executor Merger and Ansible execution statsd counters https://review.openstack.org/541452 | 21:46 |
dmsimard | corvus: ^ as per discussed earlier | 21:46 |
jhesketh | Morning | 21:46 |
SpamapS | Has anybody thought about possibly making src_dir absolute rather than relative? | 21:50 |
SpamapS | I find myself doing a lot of 'chdir: "~{{ ansible_user }}/{{ zuul.project.src_dir }}" after a job fails because I used become: True and suddenly the path isn't relative to where I am on the filesystem. | 21:50 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection https://review.openstack.org/541454 | 21:52 |
corvus | Shrews, clarkb, mordred, fungi, dmsimard, pabelanger: ^ that should take care of the scheduler wedge from this morning | 21:53 |
clarkb | cool I'll take a look shortly. Currently trying to figure out why zuul tests are unhappy on my desktop after changing the timeout behavior | 21:54 |
corvus | SpamapS: i think it's that way because the first part of it is image dependent | 21:56 |
dmsimard | corvus: we know the username we'll be using ahead of time though, right ? | 21:57 |
corvus | dmsimard: yes. we don't know if it has a home directory, or if that's where the repos are. | 21:58 |
dmsimard | I guess | 21:58 |
SpamapS | corvus: yeah, maybe we can set a fact in prepare-workspace that is responsive to the image? | 22:11 |
SpamapS | I haven't looked, maybe we even do. | 22:11 |
SpamapS | Another thing is I kind of wish I could just tell Ansible to set its default CWD to something and not change whether becoming or not. | 22:11 |
*** openstackgerrit has quit IRC | 22:16 | |
*** openstackgerrit has joined #zuul | 22:24 | |
openstackgerrit | Merged openstack-infra/nodepool master: Do not delete unused but allocated nodes https://review.openstack.org/541375 | 22:24 |
clarkb | does anyone know if the test_disk_accountant test is flaky? I am getting testtools.matchers._impl.MismatchError: {'/tmp/tmpf7v6_m8n/012345'} != set() which seems like it shouldn't be related to chagnes in timeouts but still digging | 22:26 |
*** hashar has quit IRC | 22:26 | |
mordred | Shrews: woot. I think I got it | 22:43 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434 | 22:46 |
clarkb | looks like disk accountant tests are not arbitrary fs safe /me working on fix for that | 22:48 |
clarkb | also we leak tmpdirs like crazy | 22:48 |
clarkb | also looking at fixing that | 22:48 |
*** myoung is now known as myoung|off | 22:53 | |
dmsimard | mordred: someone smarter than me once told me to use json instead of pickle due to security concerns but I see there's already a comment about that :p | 22:56 |
mordred | dmsimard: :) | 22:59 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Make timeout value apply to entire job https://review.openstack.org/541485 | 23:02 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Sync when doing disk accountant testing https://review.openstack.org/541486 | 23:03 |
Shrews | mordred: awesome. i will look more closely again also too for a second time tomorrow | 23:08 |
mordred | Shrews: awesome. the new test works now - there's a few things about it now that it's all hanging together where I think I might need to re-think a chunk of it | 23:11 |
Shrews | corvus: funny... my name is associated with the first part of that fix, but I have very little memory of it. your part looks great though. +2 | 23:12 |
pabelanger | just noticed a statsd error in zuul debug log: http://paste.openstack.org/show/664149/ | 23:16 |
pabelanger | should be straightforward fix | 23:16 |
corvus | Shrews: yeah, pretty fuzzy for me too :) | 23:18 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Use nested tempfile fixture for cleanups https://review.openstack.org/541487 | 23:20 |
clarkb | ok thats a couple fixes for the tests based on local experiences | 23:20 |
clarkb | we already seemed to have a test that covers job timeouts that appears to still be valid after my timeout change so I haven't added a new test | 23:20 |
*** sshnaidm|bbl has quit IRC | 23:21 | |
clarkb | reviewing corvus' fix for the stuck nodes now | 23:22 |
clarkb | corvus: left a thought/comment/question on https://review.openstack.org/#/c/541454/ I think I'm ok with it as is but I think we can possibly simplify and make things a little easier to understand with a minor change | 23:37 |
corvus | clarkb: yep. i have to push up a pep8 fix so i'll incorporate that | 23:39 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection https://review.openstack.org/541454 | 23:40 |
corvus | clarkb, Shrews: ^ | 23:40 |
clarkb | ugh browser so slow trying to +2 somehow ended up clicking the cherry pick button | 23:41 |
clarkb | in any case +2'd | 23:41 |
clarkb | now to fix firefox | 23:41 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!