-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821 | 00:02 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 00:06 | |
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280 | ||
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639 | ||
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653 | ||
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814 | ||
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816 | ||
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821 | ||
@jim:acmegating.com | swest, tobiash, Clark: i think the nodepool stack with hashtag:sos is ready for review. (starts at 806813) | 00:15 |
---|---|---|
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] Update IRC nics with Matrix IDs https://review.opendev.org/c/zuul/zuul/+/806640 | 00:59 | |
@jim:acmegating.com | there's something weird happening with changes after https://review.opendev.org/806653 reporting too many open files in the unit tests. i don't understand that yet, and i don't see it locally when i run the full test suite. | 02:31 |
@jim:acmegating.com | i'm running python 3.9 locally | 02:31 |
-@gerrit:opendev.org- Simon Westphahl proposed: | 05:58 | |
- [zuul/zuul] Add source interface for setting change attributes https://review.opendev.org/c/zuul/zuul/+/805836 | ||
- [zuul/zuul] Reference change dependencies by key https://review.opendev.org/c/zuul/zuul/+/805844 | ||
- [zuul/zuul] Implement ABC for caching changes in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805835 | ||
- [zuul/zuul] Cache Gerrit refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805837 | ||
- [zuul/zuul] Cache Github refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805838 | ||
- [zuul/zuul] Cache Pagure refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806556 | ||
- [zuul/zuul] Cache Gitlab refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806557 | ||
- [zuul/zuul] Cache Git refs (driver) in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806755 | ||
-@gerrit:opendev.org- Simon Westphahl proposed: | 06:27 | |
- [zuul/zuul] Add source interface for setting change attributes https://review.opendev.org/c/zuul/zuul/+/805836 | ||
- [zuul/zuul] Reference change dependencies by key https://review.opendev.org/c/zuul/zuul/+/805844 | ||
- [zuul/zuul] Implement ABC for caching changes in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805835 | ||
- [zuul/zuul] Cache Gerrit refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805837 | ||
- [zuul/zuul] Cache Github refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805838 | ||
- [zuul/zuul] Cache Pagure refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806556 | ||
- [zuul/zuul] Cache Gitlab refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806557 | ||
- [zuul/zuul] Cache Git refs (driver) in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806755 | ||
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] Add Tenant-Scoped Resource Quota https://review.opendev.org/c/zuul/nodepool/+/800765 | 07:07 | |
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul] Disable aliases in inventory.yaml for better readibility https://review.opendev.org/c/zuul/zuul/+/802674 | 07:20 | |
-@gerrit:opendev.org- Simon Westphahl proposed: | 07:38 | |
- [zuul/zuul] Implement ABC for caching changes in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805835 | ||
- [zuul/zuul] Cache Gerrit refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805837 | ||
- [zuul/zuul] Cache Github refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805838 | ||
- [zuul/zuul] Cache Pagure refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806556 | ||
- [zuul/zuul] Cache Gitlab refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806557 | ||
- [zuul/zuul] Cache Git refs (driver) in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806755 | ||
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] Periodically maintain connection caches https://review.opendev.org/c/zuul/zuul/+/806756 | 09:32 | |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] Reporter: add info on job retries https://review.opendev.org/c/zuul/zuul/+/806722 | 10:08 | |
@westphahl:matrix.org | corvus: re the too many open files issue: you are busy looping in the election since you never clear the event | 12:35 |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] Reporter: add info on job retries https://review.opendev.org/c/zuul/zuul/+/806722 | 12:44 | |
@jim:acmegating.com | swest: but the only time the event is set, we also set _stopped=True, so it should exit the election thread | 13:22 |
@westphahl:matrix.org | corvus: you are also setting it in case of an exception in _handleNodeRequestEvent | 13:23 |
@jpew:matrix.org | Is there a way to make the nodepool k8s driver add a volume mount when it creates the pod? (It looks like no from my reading). If not, would that be something amenable to be added? | 13:24 |
@avass:vassast.org | jpew: I don't think that's possible at the moment | 13:25 |
@avass:vassast.org | jpew: I guess you mean a persisten volume? | 13:25 |
@jpew:matrix.org | Well, specifically I'd like an `ephermealVolume` but ya | 13:26 |
@jim:acmegating.com | swest: ah good point -- i don't expect that to happen in the tests, but maybe it does? | 13:26 |
@avass:vassast.org | jpew: oh what's the usecase? | 13:26 |
@westphahl:matrix.org | corvus: I think it has to do with how many tests you run in parallel | 13:27 |
@jpew:matrix.org | I have some "fast" disks I want to use as the workdir when building | 13:27 |
@westphahl:matrix.org | corvus: I can easily reproduce it with my 72 core machine ;) | 13:27 |
@jim:acmegating.com | swest: maybe i just have my max open files set very high | 13:27 |
@avass:vassast.org | it would make sense to add that to the kubernetes driver | 13:28 |
@westphahl:matrix.org | or that | 13:28 |
@jim:acmegating.com | swest: but i'm still surprised we're actually hitting that in the tests | 13:28 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 13:32 | |
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653 | ||
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814 | ||
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816 | ||
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821 | ||
@gtema:matrix.org | how do I override zuul_return? I need to alter log_url (whatever is produced by upload-logs-swift role) | 13:46 |
@jim:acmegating.com | gtema: if this is your use case: https://review.opendev.org/776677 may be relevant. otherwise, just run zuul_return again and set the new value. | 13:48 |
@gtema:matrix.org | corvus: I need to modify it, and not to simply rewrite (I want to replace domain name) | 13:49 |
@gtema:matrix.org | means I need to get the value firt | 13:50 |
@gtema:matrix.org | * means I need to get the value first | 13:50 |
@jim:acmegating.com | gtema: sounds like 776677 is your use case | 13:50 |
@gtema:matrix.org | ok, digging. thks | 13:50 |
@jim:acmegating.com | someone else was driving that effort but they stopped working on it when it was almost done :( | 13:50 |
@gtema:matrix.org | uh, sad | 13:51 |
@jim:acmegating.com | just need 2 small changes written then i think we can merge it. | 13:51 |
@jim:acmegating.com | though we might need to re-start the clock on the 2 week deprecation notice with another announcement to zuul-announce | 13:51 |
@gtema:matrix.org | hmm | 13:52 |
@gtema:matrix.org | okay, assuming we get this as a solution, what is an alternative for really getting return value (so that it can be altered)? | 13:52 |
@gtema:matrix.org | is it like roles//pull-from-intermediate-registry/tasks/main.yaml - reading from results.json? | 13:53 |
@jim:acmegating.com | gtema: it's just a json file, you can read it. | 13:53 |
@jim:acmegating.com | but honestly this is like 5 minutes of work to push this over the line. | 13:53 |
@gtema:matrix.org | yeah, but 2 weeks waiting - this is what hurts, since I need this now | 13:53 |
@gtema:matrix.org | should I work on your comments in the change? | 13:54 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Build images on bionic https://review.opendev.org/c/zuul/zuul-registry/+/799728 | 13:54 | |
@jim:acmegating.com | gtema: we don't need to wait 2 weeks to merge that change, we just need the pre-requisites ready to go | 13:54 |
@gtema:matrix.org | oki | 13:54 |
@avass:vassast.org | gtema: you could also always copy the role, update it, and put it in another project that is loaded before zuul-jobs to shadow the original role | 13:58 |
@gtema:matrix.org | corvus, what point 3 in this change refers to? which quick-download? | 13:58 |
@gtema:matrix.org | > <@avass:vassast.org> gtema: you could also always copy the role, update it, and put it in another project that is loaded before zuul-jobs to shadow the original role | 13:58 |
sure - the most terrible way of workarounding things :-) | ||
@avass:vassast.org | :) | 13:58 |
@jim:acmegating.com | gtema: i think it's used here: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml#L8 | 14:01 |
@jim:acmegating.com | gtema: that's the thing that broke in opendev when we merged it the first time | 14:01 |
@gtema:matrix.org | oookay, looking | 14:02 |
@jim:acmegating.com | gtema: i think that role is defined in zuul-jobs too? sorry i have to run now | 14:02 |
@gtema:matrix.org | thks - found the role | 14:02 |
@gtema:matrix.org | will try to pick up from here | 14:02 |
@gtema:matrix.org | thanks | 14:03 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Add a restricted mode (read authentication required) https://review.opendev.org/c/zuul/zuul-registry/+/788389 | 14:12 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Make TLS optional https://review.opendev.org/c/zuul/zuul-registry/+/788390 | 14:16 | |
@clarkb:matrix.org | corvus: I've approved https://review.opendev.org/c/zuul/zuul-registry/+/799726 to help sort out ianw's issue with buildkit | 14:49 |
@clarkb:matrix.org | Just a heads up in case people notice new unexpected registry behavior once that deploys | 14:49 |
@jim:acmegating.com | Clark: since you're reviewing that stack, would you mind going to the end: https://review.opendev.org/799729 and https://review.opendev.org/803112 ? | 14:56 |
@jim:acmegating.com | fungi: ^ | 14:56 |
@jim:acmegating.com | Clark: fungi also -- do you get highlights/notifications for the messages above, or do i need to do this? | 14:59 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Add content-length headers and debug messages https://review.opendev.org/c/zuul/zuul-registry/+/791068 | 15:00 | |
@clarkb:matrix.org | corvus: just "Clark" seems to be sufficient | 15:01 |
@clarkb:matrix.org | I'm not sure if there is a different in element but it highlighted both messages | 15:01 |
@clarkb:matrix.org | and ya I can review things | 15:01 |
@jim:acmegating.com | ok cool; the @ is easy/automatic when i write a msg in element, but less so when i use weechat. thanks :) | 15:02 |
@clarkb:matrix.org | Currently I don't have headphones on which means I missed the audible ping, but after breakfast I tend to put those on and listen to $audio and get the audible pings too | 15:03 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Add missing size to image configs https://review.opendev.org/c/zuul/zuul-registry/+/799726 | 15:10 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: | 15:24 | |
- [zuul/zuul-registry] Also include missing size attributes in layers https://review.opendev.org/c/zuul/zuul-registry/+/799729 | ||
- [zuul/zuul-registry] Return "scope" in auth challenge https://review.opendev.org/c/zuul/zuul-registry/+/803112 | ||
@gtema:matrix.org | anybody faced merger_failure starting 4.8.0? https://paste.opendev.org/show/808523/ | 15:31 |
@jim:acmegating.com | gtema: might want to look for oom killer messages | 15:33 |
@gtema:matrix.org | yeah, if I would have those - in the k8 log there is nothing | 15:33 |
@gtema:matrix.org | will try 4.8.1 and increase logging. This is happening periodically, but something like after 1 week of normal working | 15:34 |
@jim:acmegating.com | it could be related to a git operation, but if so, there should probably be more zuul logs about it. might be worth seeing if you can get access to system logs to check oom. | 15:35 |
@gtema:matrix.org | ugh, this is going to be tough. Anyway - thanks, will try | 15:36 |
@gtema:matrix.org | sadly I killed pods already | 15:36 |
@gtema:matrix.org | so need to wait for another occurence | 15:36 |
@jim:acmegating.com | basically the thought process is: if zuul didn't log a reason for the git process crashing, something else probably killed it. | 15:37 |
@jim:acmegating.com | well, not the git process crashing, the zuul git executor-worker subprocess (which could have crashed because git crashed) | 15:37 |
@gtema:matrix.org | in this case we should try better exception handling, since in case of this crash you need to restart executor - no way it recovers anymore | 15:38 |
@jim:acmegating.com | gtema: it should recover | 15:38 |
@gtema:matrix.org | my practice showed it doesn't. I get a half day of failed jobs untill I really kill pods | 15:38 |
@gtema:matrix.org | maybe need to wait longer, but that is not really great - users are getting mad | 15:39 |
@jim:acmegating.com | maybe you're encountering a bug then, but it's certainly the intent that it should recover automatically. | 15:40 |
@gtema:matrix.org | good to hear | 15:40 |
@gtema:matrix.org | in my case this message is just happening over and over again | 15:41 |
@fungicide:matrix.org | Clark: i do get highlights, but am temporarily using the webclient until i get time to rebuild my shell server where i run weechat and install the matrix plugin for it, so am not always at the computer where i have the browser up | 15:43 |
@jim:acmegating.com | fungi: ack :) | 15:43 |
@fungicide:matrix.org | (and my shellserver rebuild is tangled in moving to less costly cloud providers) | 15:45 |
@fungicide:matrix.org | corvus: and yeah, the changes at the end of that stack are what prompted me to start reviewing it, but i figured it was better to work my way through them in order, so started with the earliest unmerged changes | 15:54 |
@tobias.henkel:matrix.org | gtema: do you see "Process pool got broken" in your logs? It should log this when recovering the process pool. | 15:55 |
@gtema:matrix.org | yes - this is in my logs of executor | 15:56 |
@tobias.henkel:matrix.org | in that case it creates a new process pool | 15:56 |
@tobias.henkel:matrix.org | if the error frequently returns I guess the sub processes get killed over and over again | 15:56 |
@tobias.henkel:matrix.org | which indicates a too low memory limit of the pod | 15:57 |
@gtema:matrix.org | according to monitoring it has enough memory | 15:57 |
@gtema:matrix.org | and yes - this happens over and over again on merge | 15:57 |
@tobias.henkel:matrix.org | if the pod exceeds the memory limit the oom killer kills random processes in that cgroup | 15:57 |
@tobias.henkel:matrix.org | maybe your merge operation needs more ram for a really short time triggering this behavior | 15:58 |
@jim:acmegating.com | maybe a git merge is using a lot of memory quickly... yeah that | 15:58 |
@gtema:matrix.org | hmm, maybe | 15:58 |
@jim:acmegating.com | too quickly for the monitoring to notice | 15:58 |
@tobias.henkel:matrix.org | gtema: which metrics are you looking at concerning memory? | 15:58 |
@jim:acmegating.com | (and restarting the pod fixes it because it releases memory held elsewhere) | 15:59 |
@tobias.henkel:matrix.org | (and what's the memory limit) | 15:59 |
@gtema:matrix.org | my __not_so_well__ k8 as a service shows me memory consumtion and cpu usage graph | 15:59 |
@tobias.henkel:matrix.org | also the ansible processes can take a lot of memory (actually unbounded depending on the job) | 15:59 |
@gtema:matrix.org | but also I have defined limits for the pod | 16:00 |
@gtema:matrix.org | and those apparently do not exceed | 16:00 |
@gtema:matrix.org | otherwise whole pod would be killed | 16:00 |
@tobias.henkel:matrix.org | no, if the container has more than one process, the oom killer mostly chooses not PID 1 which means the pod lives but a child process gets killed | 16:01 |
@tobias.henkel:matrix.org | and that's exactly what you're seeing | 16:01 |
@gtema:matrix.org | hmm | 16:01 |
@jim:acmegating.com | and the oom killer could be acting on a cgroup limit set by k8s? so possibly increasing the ram limit for the pod may help? | 16:02 |
@tobias.henkel:matrix.org | yepp, k8s sets a limit on the cgroup and the kernel kills processes in that cgroup if the processes in there take more memory | 16:04 |
@tobias.henkel:matrix.org | what's your current memory limit for the executors? | 16:04 |
@gtema:matrix.org | cpu:2 mem 2G | 16:04 |
@tobias.henkel:matrix.org | that seems way too low since that covers executor, git calls and all ansible processes of the jobs | 16:05 |
@gtema:matrix.org | what do you set? | 16:05 |
@tobias.henkel:matrix.org | I think I'd recomment a minimum of 8GB | 16:06 |
@tobias.henkel:matrix.org | we're running with 16gb | 16:06 |
@gtema:matrix.org | ok, will try going with 8 first | 16:06 |
@tobias.henkel:matrix.org | ansible can take up a large amount of memory (e.g. a slurp of a file into a variable occupies that amount of memory in the ansible process) | 16:07 |
@jim:acmegating.com | opendev's executors run on 8gb vms. they sustain 6gb usage easily. | 16:07 |
@gtema:matrix.org | yah, but my load is far from opendev ;-) | 16:07 |
@tobias.henkel:matrix.org | I'm seeing short job-caused spikes of 4-6GB in our system sometimes | 16:07 |
@gtema:matrix.org | but also huge projects (with huge git history eventially) | 16:07 |
@tobias.henkel:matrix.org | with less load you could try 4GB as a first step | 16:08 |
@tobias.henkel:matrix.org | and watch out for broken process pool, that's always a sign that the limit was too low | 16:08 |
@gtema:matrix.org | I do request for 1G and limit for 8 - this is not really a problem, since I run in dedicated cluster | 16:08 |
@tobias.henkel:matrix.org | you do overcommitment with memory? | 16:08 |
@jim:acmegating.com | yeah, i think the minimum would be hard to calculate, but there's a base line usage from the process itself, then increase that based on git repo size, then add ansible spikes. | 16:08 |
@gtema:matrix.org | > <@tobias.henkel:matrix.org> you do overcommitment with memory? | 16:09 |
nope - 4 nodes x16G | ||
@gtema:matrix.org | and only 2 executors + scalable merger | 16:09 |
@tobias.henkel:matrix.org | I mean by using different requests and limits | 16:09 |
@jim:acmegating.com | then after that, more jobs can cause more memory usage, but the executor will try to keep that under control. but that will only work above the minimum implied by all those other things. | 16:09 |
@tobias.henkel:matrix.org | (which is overcommitment on k8s level) | 16:09 |
@gtema:matrix.org | > <@tobias.henkel:matrix.org> (which is overcommitment on k8s level) | 16:10 |
yes - doing that: requests: cpu: 100m, mem: 1G, limits: cpu: 2, mem: 8G | ||
@gtema:matrix.org | > <@jim:acmegating.com> yeah, i think the minimum would be hard to calculate, but there's a base line usage from the process itself, then increase that based on git repo size, then add ansible spikes. | 16:11 |
well, looking to all spikes according to build-in metrics - I am always below | ||
@tobias.henkel:matrix.org | note that k9s only uses the requests for scheduling so with different requests and limits you might run into a risk of having too many executors on one node | 16:11 |
@tobias.henkel:matrix.org | * note that k8s only uses the requests for scheduling so with different requests and limits you might run into a risk of having too many executors on one node | 16:11 |
@gtema:matrix.org | I try manage this with affinity | 16:12 |
@gtema:matrix.org | anyway - thanks a lot for the hints | 16:13 |
@clarkb:matrix.org | > <@tobias.henkel:matrix.org> which indicates a too low memory limit of the pod | 16:16 |
As a comparison any openstack/nova git operations that opendev does appear to consume 1GB of memory | ||
@gtema:matrix.org | > <@clarkb:matrix.org> As a comparison any openstack/nova git operations that opendev does appear to consume 1GB of memory | 16:16 |
hmm - maybe really the case | ||
@clarkb:matrix.org | tobiash: are you happy with corvus response in https://review.opendev.org/c/zuul/zuul/+/806062/ ? | 16:50 |
@tobias.henkel:matrix.org | yes, thanks | 16:52 |
@jim:acmegating.com | swest: hrm, the updated patch in 806653 that clears the event is still triggering "too many open files" errors. i've also instrumented the tests and run locally, and i can find no instances of the exception handler being triggered, or even any cases where the election loop ran more than once. so while i agree that was a bug, i don't think it was causing the open files problem. | 17:27 |
@westphahl:matrix.org | corvus: strange, it solved the issue for me at least. I added the clear in a different place but that shouldn't really matter | 17:42 |
@erbarr:matrix.org | what ports need to be open to for zuul-web? I plan to run that on another VM to display publicly only the web service and corporate is asking me. Just 9000? | 18:05 |
@clarkb:matrix.org | erbarr: The service itself defaults to 9000, but you may want to put a proxy in front of it to do ssl and caching and host that on port 443 | 18:07 |
@clarkb:matrix.org | there is also the fingergw which will listen on port 79 to support finger. | 18:08 |
@erbarr:matrix.org | so zuul-fingerwg also needs to run on the same VM as zuul-web? | 18:09 |
@clarkb:matrix.org | It don't think it does, but calling it out as another service with ports that need to be exposed so that things like console streaming work | 18:13 |
@erbarr:matrix.org | ahh, okay thanks! | 18:14 |
@jim:acmegating.com | swest: yeah, i'm stumped. where did you put the clear? | 18:15 |
@jim:acmegating.com | "ulimit -n" locally for me says 1024 | 18:16 |
@westphahl:matrix.org | corvus: right before the election.run(...) | 18:18 |
@jim:acmegating.com | 1024 is the same on the test nodes | 18:18 |
@westphahl:matrix.org | file limit was the same for me | 18:18 |
@jim:acmegating.com | i've logged into the host running https://zuul.opendev.org/t/zuul/stream/d9dc0cc0deeb4158865c1646277c1c18?logfile=console.log now | 18:19 |
@jim:acmegating.com | there really are a lot of open files according to lsof | 18:22 |
@jim:acmegating.com | the process is up to fd #732 | 18:22 |
@jim:acmegating.com | i'm taking snapshots so we can see if we have old threads not ageing out | 18:23 |
@jim:acmegating.com | the threads do seem to be deleted. we're just getting lots of fds | 18:26 |
@clarkb:matrix.org | corvus: if you end up fixing the fds thing you might want to add a fix for my -1 on https://review.opendev.org/c/zuul/zuul/+/806063 when you rebase and push | 18:28 |
@clarkb:matrix.org | its a minor thing but worth fixing now | 18:28 |
@jim:acmegating.com | 825 fds, 255 are pipes, 105 are eventpoll, 141 are listening tcp sockets, 235 udtp sockets | 18:29 |
@jim:acmegating.com | 1010 fds, 332 pipe, 125 eventpoll, 162 listening tcp sockets, 283 udp sockets | 18:31 |
@jim:acmegating.com | everything is increasing | 18:31 |
@jim:acmegating.com | i'm going to hold a node with the change before this one too | 18:32 |
@clarkb:matrix.org | Ya would be good to know if it happens prior to adding the election code | 18:33 |
@clarkb:matrix.org | its possible the election code adds a few extra fds that tips it over, but we're already leaking | 18:33 |
@jim:acmegating.com | yeah that sounds plausible | 18:33 |
@jim:acmegating.com | and it's started emitting too many open files errors | 18:34 |
@jim:acmegating.com | i count 1045 entries in lsof, with a max fd number of 976 | 18:35 |
@clarkb:matrix.org | corvus: I left a note on 806653 though I don't think it is related to the fd leaks | 18:42 |
@jim:acmegating.com | wow, the pre-election change is hitting a max fd # of 8 so far. | 19:02 |
@jim:acmegating.com | so we're not looking at at a last-straw situation, it's something bonkers in that change | 19:03 |
@jim:acmegating.com | oh, strike that, i think i just checked too early | 19:06 |
@jim:acmegating.com | okay, this does seem to be leaking as well, but perhaps more slowly | 19:12 |
@jim:acmegating.com | so this may be a polynomial issue... we may have leaked one more file descriptor times X tests? or something like that | 19:13 |
@jim:acmegating.com | (where X is around 1100/4 or maybe /2 depending on our cpu count / paralellization) | 19:13 |
@jim:acmegating.com | i'm starting to lean toward there being a structural problem with the test harness and this just pushes it over the edge | 19:14 |
@jim:acmegating.com | up to max fd 638 on the pre-election change | 19:36 |
@erbarr:matrix.org | hi clark, I'm looking at the diagram at https://zuul-ci.org/docs/zuul/discussion/components.html#overview that shows web connecting to zookeeper, executor, gearman, database, github. Do each one of those represent a port that needs to be open or just what you pointed out earlier? | 20:46 |
@clarkb:matrix.org | > <@erbarr:matrix.org> hi clark, I'm looking at the diagram at https://zuul-ci.org/docs/zuul/discussion/components.html#overview that shows web connecting to zookeeper, executor, gearman, database, github. Do each one of those represent a port that needs to be open or just what you pointed out earlier? | 20:47 |
each of those represents a network connection. Whether or not you'll need to open those ports depends on your network topology | ||
@erbarr:matrix.org | so if web is on its own VM, and every other service is on a different VM, then web needs those ports open? | 20:50 |
@clarkb:matrix.org | if there is a firewall between the VMs then yes | 20:50 |
@erbarr:matrix.org | okay then yea, thanks! | 20:51 |
@jim:acmegating.com | okay the beginning of a clue -- if i run 2 unit tests then see what files are open, without the election stuff it looks like we leak 17 fds, but with it we leak 21. i have to run 2 tests to see that though; just running one test with and without the election stuff there's no difference. | 21:15 |
@jim:acmegating.com | (an apparent leak of 14 fds either way in that case) | 21:16 |
@jim:acmegating.com | an extra udp socket, an extra tcp listener; none of those make sense. an extra eventpoll -- that does make sense. | 21:23 |
@iwienand:matrix.org | corvus: thanks for fixing the registry size issue! | 21:40 |
@jim:acmegating.com | ianw: yw :) | 21:41 |
@iwienand:matrix.org | corvus: hrm | 22:05 |
@iwienand:matrix.org | failed commit on ref "layer-sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1": "layer-sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1" failed size validation: 7670 != 7671: failed precondition | 22:05 |
@iwienand:matrix.org | that's a suspiciously off-by-one value ... | 22:05 |
@iwienand:matrix.org | https://zuul.opendev.org/t/openstack/build/0cd4979893714910ae45b3191b6b3175 was the job | 22:05 |
@jim:acmegating.com | ianw: unfortunately i'm deep into trying to figure out why the zuul unit tests are leaking file descriptors, i don't think i'll be able to dig into that right now | 22:06 |
@iwienand:matrix.org | corvus: that's ok. do you think https://review.opendev.org/c/opendev/base-jobs/+/806818 is ok to make buildset-registry jobs fail so i can put one on hold? | 22:07 |
@iwienand:matrix.org | i found it difficult to get started debugging | 22:07 |
@jim:acmegating.com | ianw: seems reasonable, but if there's a registry protocol error it may be easier to just run the registry and builds locally. also, ansible has a 'fail' module, btw. | 22:09 |
@iwienand:matrix.org | ahh, yes, fail is probably better. i was having trouble getting my environment talking to the intermediate mirror; it was nice the jobs set all that up | 22:10 |
@iwienand:matrix.org | i mean zuul-registry instance i was trying to run as an intermediate mirror, not our intermediate mirror, blerg a lot of parts ... :) | 22:11 |
@jim:acmegating.com | yeah, if you need the whole system, that's the best way. if you can narrow it down to a build and upload command, then local is a lot easier | 22:14 |
@jim:acmegating.com | i found 2 things we're leaking in tests: the statsd udp listener, and the fake prometheus server | 22:14 |
@jim:acmegating.com | none of those have anything to do with the election change, but maybe fixing those will give us headroom back | 22:15 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 22:32 | |
- [zuul/zuul] WIP: only create stop watcher event if necessary https://review.opendev.org/c/zuul/zuul/+/806994 | ||
- [zuul/zuul] Reduce leaked file descriptors in tests https://review.opendev.org/c/zuul/zuul/+/806995 | ||
@jim:acmegating.com | okay, i like the first change anyway, but don't know if it'll do anything, so i pushed it up separately as an experiment. the second change should definitely improve things. after we get results, i'll move the second change to early in the stack and then squash the first change into the election change. (Clark, swest, tobiash FYI) | 22:34 |
@jim:acmegating.com | even after that, we're still leaking some udp socket listeners. i have no idea what they could be. | 22:35 |
@iwienand:matrix.org | https://zuul.opendev.org/t/openstack/build/203c52dfa57d428e993b79b926643dcf/log/docker/buildset_registry.txt#70 is the registry entry of the assets upload | 22:38 |
@iwienand:matrix.org | Update upload metadata chunks: [{'size': 7670, 'md5': '54a131e3f12627a57edf174190aabebf'}, {'size': 0, 'md5': 'd41d8cd98f00b204e9800998ecf8427e'}] | 22:38 |
@jim:acmegating.com | so on the download the client received one byte more than we received on the upload? | 22:39 |
@iwienand:matrix.org | docker is saying "7670 != 7671: failed precondition"; not 100% sure which side is the expected and which the received | 22:39 |
@jim:acmegating.com | ianw: note this line: https://zuul.opendev.org/t/openstack/build/203c52dfa57d428e993b79b926643dcf/log/docker/buildset_registry.txt#247 | 22:40 |
@jim:acmegating.com | cherrypy said it sent 7670 bytes in that get request (but i don't know if that was the docker request) | 22:40 |
@iwienand:matrix.org | https://github.com/containerd/containerd/blob/fb589a71331869b141c83afb098152359f44b8c4/metadata/content.go#L626 i think is the code putting out the error | 22:42 |
@iwienand:matrix.org | i really hate it when you use the same error message in different paths | 22:45 |
@jim:acmegating.com | that's why i try not to do that :) | 22:46 |
@jim:acmegating.com | ianw: if you have a held registry node, you could run a GET on that url, or check the filesystem and make sure that it's reporting 7670 and not 71 | 22:46 |
@iwienand:matrix.org | yeah i can put back in the pause and at least fiddle until timeout kicks in | 22:49 |
@iwienand:matrix.org | ok, it failed again and is paused for a while. 198.72.124.20 is the registry and 198.72.124.29 the builder | 23:05 |
@iwienand:matrix.org | just "FROM opendevorg/assets" replicates it, that's good. it must try and download it 3 times | 23:07 |
@clarkb:matrix.org | ianw: let me know if I can help further with that off by one thing. Just got back from a bike ride | 23:10 |
@iwienand:matrix.org | corvus: hrm, i guess you need to setup authentication to run a curl request? | 23:12 |
@clarkb:matrix.org | ianw: ya the docker protocol requires auth tokens for even anonymous requests | 23:14 |
@clarkb:matrix.org | you basically query for a token without giving it creds and it sends a token back that you use and the zuul-registry mimics that iirc | 23:14 |
@iwienand:matrix.org | root@9d5bac229fa4:/storage/_local/blobs/sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1# ls -l | 23:14 |
total 12 | ||
-rw-r--r-- 1 root root 7670 Sep 1 23:02 data | ||
@iwienand:matrix.org | so the registry is seeying 0fb1 as 7670 on disk | 23:14 |
@iwienand:matrix.org | https://hub.docker.com/support/doc/how-do-i-authenticate-with-the-v2-api i guess | 23:15 |
@clarkb:matrix.org | ya that looks right | 23:19 |
@iwienand:matrix.org | hrm, i'm not having much luck | 23:21 |
@iwienand:matrix.org | using the password from /home/zuul/buildset_registry | 23:21 |
@clarkb:matrix.org | and the url was updated to use the buildset registry? | 23:22 |
@clarkb:matrix.org | I wonder if this is a case where the registry test suite might be easier to debug with than the deployed service? | 23:23 |
@iwienand:matrix.org | curl -s -H "Content-Type: application/json" -X POST -d '{"username": "'zuul'", "password": "'...blah...'"}' https://zuul-jobs.buildset-registry:5000/v2/users/login | 23:23 |
@clarkb:matrix.org | Are those extra quotes in the json being sent or is that an artifact of element quoting? | 23:24 |
@clarkb:matrix.org | but that could be it if so? | 23:24 |
@iwienand:matrix.org | curl -s 'https://zuul-jobs.buildset-registry:5000/auth/token?scope=read%3A%3A&scope=repository%3Aopendevorg%2Fassets%3Apull' | 23:29 |
@iwienand:matrix.org | curl -s -H "Authorization: JWT ${TOKEN}" https://zuul-jobs.buildset-registry:5000/v2/opendevorg/assets/blobs/sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1?ns=docker.io | 23:34 |
@iwienand:matrix.org | isn't working | 23:34 |
@clarkb:matrix.org | is the namespace correct? I assume so since we don't upload elsewhere | 23:36 |
@iwienand:matrix.org | ok "Authorization: Bearer ${TOKEN}" seems to work...? | 23:37 |
@iwienand:matrix.org | -rw-r--r-- 1 root root 7670 Sep 1 23:37 test | 23:38 |
@iwienand:matrix.org | ok, it is returning me a file of size 7670 | 23:38 |
@iwienand:matrix.org | this suggests the manifest size is wrong? | 23:38 |
@clarkb:matrix.org | ianw: yes, I think you can request the manifest json data too and confirm | 23:39 |
@clarkb:matrix.org | but I'm not sure what that path is | 23:39 |
@iwienand:matrix.org | [01/Sep/2021:23:41:13] "GET /v2/opendevorg/assets/manifests/latest?ns=docker.io HTTP/1.1" 200 776 "" "containerd/1.4.0+unknown" | 23:43 |
@iwienand:matrix.org | [01/Sep/2021:23:42:56] "GET /v2/opendevorg/assets/manifests/latest?ns=docker.io HTTP/1.1" 404 735 "" "curl/7.68.0 | 23:44 |
@iwienand:matrix.org | containerd gets it ok, i get a 404 | 23:44 |
@clarkb:matrix.org | looking at the code it appears you have to set a content type header | 23:46 |
@clarkb:matrix.org | https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/main.py#L388-L389 | 23:47 |
@clarkb:matrix.org | Do you need to set Accept: for each of those types? | 23:48 |
@iwienand:matrix.org | yep, just doing that :) | 23:48 |
@clarkb:matrix.org | Without the Accept: header it falls through to the 404 path | 23:48 |
@iwienand:matrix.org | curl -H "Accept: application/vnd.docker.distribution.manifest.v2+json" -X GET -H "Authorization: Bearer ${TOKEN}" 'https://zuul-jobs.buildset-registry:5000/v2/opendevorg/assets/manifests/latest?ns=docker.io' | 23:49 |
@iwienand:matrix.org | {"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", "size": 7671, "digest": "sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1"} | 23:49 |
@iwienand:matrix.org | ok, so the manifest is saying 7671 | 23:49 |
@iwienand:matrix.org | we know the file on disk is 7670 | 23:50 |
@iwienand:matrix.org | it seems like it should ultimately be passing back os.stat(path).st_size | 23:52 |
@jim:acmegating.com | it returns what it's given | 23:54 |
@clarkb:matrix.org | https://review.opendev.org/c/zuul/zuul-registry/+/799726/2/zuul_registry/main.py was the change | 23:55 |
@clarkb:matrix.org | size = self.storage.blob_size(namespace, digest) | 23:55 |
@clarkb:matrix.org | ianw: I think it does do what you suggest: https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/filesystem.py#L40-L44 | 23:56 |
@jim:acmegating.com | that's specifically for an image config... | 23:56 |
@jim:acmegating.com | ianw: can you paste the complete output of that curl? | 23:56 |
@jim:acmegating.com | also, what's the ip of the buildset registry? | 23:57 |
@clarkb:matrix.org | https://review.opendev.org/c/zuul/zuul-registry/+/799729/1/zuul_registry/main.py would be for the layers | 23:57 |
@iwienand:matrix.org | https://paste.opendev.org/show/808529/ | 23:57 |
@iwienand:matrix.org | 198.72.124.20 is the registry and 198.72.124.29 the builder | 23:58 |
@jim:acmegating.com | okay, so iff the builder didn't include a size for the layer in the manifest, then we would have figured the size ourselves and added it. | 23:59 |
@jim:acmegating.com | but we don't know whether that happened or not -- you might be able to check the builder and see what it looks like there | 23:59 |
@jim:acmegating.com | if it came with a size, then we didn't touch it. | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!