Wednesday, 2021-09-01

-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/80682100:02
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:00:06
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821
@jim:acmegating.comswest, tobiash, Clark: i think the nodepool stack with hashtag:sos is ready for review. (starts at 806813)00:15
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] Update IRC nics with Matrix IDs https://review.opendev.org/c/zuul/zuul/+/80664000:59
@jim:acmegating.comthere's something weird happening with changes after https://review.opendev.org/806653 reporting too many open files in the unit tests.  i don't understand that yet, and i don't see it locally when i run the full test suite.02:31
@jim:acmegating.comi'm running python 3.9 locally02:31
-@gerrit:opendev.org- Simon Westphahl proposed:05:58
- [zuul/zuul] Add source interface for setting change attributes https://review.opendev.org/c/zuul/zuul/+/805836
- [zuul/zuul] Reference change dependencies by key https://review.opendev.org/c/zuul/zuul/+/805844
- [zuul/zuul] Implement ABC for caching changes in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805835
- [zuul/zuul] Cache Gerrit refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805837
- [zuul/zuul] Cache Github refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805838
- [zuul/zuul] Cache Pagure refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806556
- [zuul/zuul] Cache Gitlab refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806557
- [zuul/zuul] Cache Git refs (driver) in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806755
-@gerrit:opendev.org- Simon Westphahl proposed:06:27
- [zuul/zuul] Add source interface for setting change attributes https://review.opendev.org/c/zuul/zuul/+/805836
- [zuul/zuul] Reference change dependencies by key https://review.opendev.org/c/zuul/zuul/+/805844
- [zuul/zuul] Implement ABC for caching changes in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805835
- [zuul/zuul] Cache Gerrit refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805837
- [zuul/zuul] Cache Github refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805838
- [zuul/zuul] Cache Pagure refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806556
- [zuul/zuul] Cache Gitlab refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806557
- [zuul/zuul] Cache Git refs (driver) in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806755
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/nodepool] Add Tenant-Scoped Resource Quota https://review.opendev.org/c/zuul/nodepool/+/80076507:07
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul] Disable aliases in inventory.yaml for better readibility https://review.opendev.org/c/zuul/zuul/+/80267407:20
-@gerrit:opendev.org- Simon Westphahl proposed:07:38
- [zuul/zuul] Implement ABC for caching changes in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805835
- [zuul/zuul] Cache Gerrit refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805837
- [zuul/zuul] Cache Github refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/805838
- [zuul/zuul] Cache Pagure refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806556
- [zuul/zuul] Cache Gitlab refs in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806557
- [zuul/zuul] Cache Git refs (driver) in Zookeeper https://review.opendev.org/c/zuul/zuul/+/806755
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] Periodically maintain connection caches https://review.opendev.org/c/zuul/zuul/+/80675609:32
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] Reporter: add info on job retries https://review.opendev.org/c/zuul/zuul/+/80672210:08
@westphahl:matrix.orgcorvus: re the too many open files issue: you are busy looping in the election since you never clear the event12:35
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] Reporter: add info on job retries https://review.opendev.org/c/zuul/zuul/+/80672212:44
@jim:acmegating.comswest: but the only time the event is set, we also set _stopped=True, so it should exit the election thread13:22
@westphahl:matrix.orgcorvus: you are also setting it in case of an exception in _handleNodeRequestEvent13:23
@jpew:matrix.orgIs there a way to make the nodepool k8s driver add a volume mount when it creates the pod? (It looks like no from my reading). If not, would that be something amenable to be added?13:24
@avass:vassast.orgjpew: I don't think that's possible at the moment13:25
@avass:vassast.orgjpew: I guess you mean a persisten volume?13:25
@jpew:matrix.orgWell, specifically I'd like an `ephermealVolume` but ya13:26
@jim:acmegating.comswest: ah good point -- i don't expect that to happen in the tests, but maybe it does?13:26
@avass:vassast.orgjpew: oh what's the usecase?13:26
@westphahl:matrix.orgcorvus: I think it has to do with how many tests you run in parallel13:27
@jpew:matrix.orgI have some "fast" disks I want to use as the workdir when building13:27
@westphahl:matrix.orgcorvus: I can easily reproduce it with my 72 core machine ;)13:27
@jim:acmegating.comswest: maybe i just have my max open files set very high13:27
@avass:vassast.orgit would make sense to add that to the kubernetes driver13:28
@westphahl:matrix.orgor that13:28
@jim:acmegating.comswest: but i'm still surprised we're actually hitting that in the tests13:28
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:13:32
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821
@gtema:matrix.orghow do I override zuul_return? I need to alter log_url (whatever is produced by upload-logs-swift role)13:46
@jim:acmegating.comgtema: if this is your use case: https://review.opendev.org/776677 may be relevant.  otherwise, just run zuul_return again and set the new value.13:48
@gtema:matrix.orgcorvus: I need to modify it, and not to simply rewrite (I want to  replace domain name)13:49
@gtema:matrix.orgmeans I need to get the value firt13:50
@gtema:matrix.org * means I need to get the value first13:50
@jim:acmegating.comgtema: sounds like 776677 is your use case13:50
@gtema:matrix.orgok, digging. thks13:50
@jim:acmegating.comsomeone else was driving that effort but they stopped working on it when it was almost done :(13:50
@gtema:matrix.orguh, sad13:51
@jim:acmegating.comjust need 2 small changes written then i think we can merge it.13:51
@jim:acmegating.comthough we might need to re-start the clock on the 2 week deprecation notice with another announcement to zuul-announce13:51
@gtema:matrix.orghmm13:52
@gtema:matrix.orgokay, assuming we get this as a solution, what is an alternative for really getting return value (so that it can be altered)?13:52
@gtema:matrix.orgis it like roles//pull-from-intermediate-registry/tasks/main.yaml - reading from results.json?13:53
@jim:acmegating.comgtema: it's just a json file, you can read it.13:53
@jim:acmegating.combut honestly this is like 5 minutes of work to push this over the line.13:53
@gtema:matrix.orgyeah, but 2 weeks waiting - this is what hurts, since I need this now13:53
@gtema:matrix.orgshould I work on your comments in the change?13:54
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Build images on bionic https://review.opendev.org/c/zuul/zuul-registry/+/79972813:54
@jim:acmegating.comgtema: we don't need to wait 2 weeks to merge that change, we just need the pre-requisites ready to go13:54
@gtema:matrix.orgoki13:54
@avass:vassast.orggtema: you could also always copy the role, update it, and put it in another project that is loaded before zuul-jobs to shadow the original role13:58
@gtema:matrix.orgcorvus, what point 3 in this change refers to? which quick-download?13:58
@gtema:matrix.org> <@avass:vassast.org> gtema: you could also always copy the role, update it, and put it in another project that is loaded before zuul-jobs to shadow the original role13:58
sure - the most terrible way of workarounding things :-)
@avass:vassast.org:)13:58
@jim:acmegating.comgtema: i think it's used here: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml#L814:01
@jim:acmegating.comgtema: that's the thing that broke in opendev when we merged it the first time14:01
@gtema:matrix.orgoookay, looking14:02
@jim:acmegating.comgtema: i think that role is defined in zuul-jobs too?  sorry i have to run now14:02
@gtema:matrix.orgthks - found the role14:02
@gtema:matrix.orgwill try to pick up from here14:02
@gtema:matrix.orgthanks14:03
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Add a restricted mode (read authentication required) https://review.opendev.org/c/zuul/zuul-registry/+/78838914:12
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Make TLS optional https://review.opendev.org/c/zuul/zuul-registry/+/78839014:16
@clarkb:matrix.orgcorvus: I've approved https://review.opendev.org/c/zuul/zuul-registry/+/799726 to help sort out ianw's issue with buildkit14:49
@clarkb:matrix.orgJust a heads up in case people notice new unexpected registry behavior once that deploys14:49
@jim:acmegating.comClark: since you're reviewing that stack, would you mind going to the end: https://review.opendev.org/799729 and https://review.opendev.org/803112 ?14:56
@jim:acmegating.comfungi: ^14:56
@jim:acmegating.comClark: fungi also -- do you get highlights/notifications for the messages above, or do i need to do this?14:59
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Add content-length headers and debug messages https://review.opendev.org/c/zuul/zuul-registry/+/79106815:00
@clarkb:matrix.orgcorvus: just "Clark" seems to be sufficient15:01
@clarkb:matrix.orgI'm not sure if there is a different in element but it highlighted both messages15:01
@clarkb:matrix.organd ya I can review things15:01
@jim:acmegating.comok cool; the @ is easy/automatic when i write a msg in element, but less so when i use weechat.  thanks :)15:02
@clarkb:matrix.orgCurrently I don't have headphones on which means I missed the audible ping, but after breakfast I tend to put those on and listen to $audio and get the audible pings too15:03
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-registry] Add missing size to image configs https://review.opendev.org/c/zuul/zuul-registry/+/79972615:10
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com:15:24
- [zuul/zuul-registry] Also include missing size attributes in layers https://review.opendev.org/c/zuul/zuul-registry/+/799729
- [zuul/zuul-registry] Return "scope" in auth challenge https://review.opendev.org/c/zuul/zuul-registry/+/803112
@gtema:matrix.organybody faced merger_failure starting 4.8.0? https://paste.opendev.org/show/808523/15:31
@jim:acmegating.comgtema: might want to look for oom killer messages15:33
@gtema:matrix.orgyeah, if I would have those - in the k8 log there is nothing15:33
@gtema:matrix.orgwill try 4.8.1 and increase logging. This is happening periodically, but something like after 1 week of normal working15:34
@jim:acmegating.comit could be related to a git operation, but if so, there should probably be more zuul logs about it.  might be worth seeing if you can get access to system logs to check oom.15:35
@gtema:matrix.orgugh, this is going to be tough. Anyway - thanks, will try15:36
@gtema:matrix.orgsadly I killed pods already15:36
@gtema:matrix.orgso need to wait for another occurence15:36
@jim:acmegating.combasically the thought process is: if zuul didn't log a reason for the git process crashing, something else probably killed it.15:37
@jim:acmegating.comwell, not the git process crashing, the zuul git executor-worker subprocess (which could have crashed because git crashed)15:37
@gtema:matrix.orgin this case we should try better exception handling, since in case of this crash you need to restart executor - no way it recovers anymore15:38
@jim:acmegating.comgtema: it should recover15:38
@gtema:matrix.orgmy practice showed it doesn't. I get a half day of failed jobs untill I really kill  pods15:38
@gtema:matrix.orgmaybe need to wait longer, but that is not really great - users are getting mad15:39
@jim:acmegating.commaybe you're encountering a bug then, but it's certainly the intent that it should recover automatically.15:40
@gtema:matrix.orggood to hear15:40
@gtema:matrix.orgin my case this message is just happening over and over again15:41
@fungicide:matrix.orgClark: i do get highlights, but am temporarily using the webclient until i get time to rebuild my shell server where i run weechat and install the matrix plugin for it, so am not always at the computer where i have the browser up15:43
@jim:acmegating.comfungi: ack :)15:43
@fungicide:matrix.org(and my shellserver rebuild is tangled in moving to less costly cloud providers)15:45
@fungicide:matrix.orgcorvus: and yeah, the changes at the end of that stack are what prompted me to start reviewing it, but i figured it was better to work my way through them in order, so started with the earliest unmerged changes15:54
@tobias.henkel:matrix.orggtema: do you see "Process pool got broken" in your logs? It should log this when recovering the process pool.15:55
@gtema:matrix.orgyes - this is in my logs of executor15:56
@tobias.henkel:matrix.orgin that case it creates a new process pool15:56
@tobias.henkel:matrix.orgif the error frequently returns I guess the sub processes get killed over and over again15:56
@tobias.henkel:matrix.orgwhich indicates a too low memory limit of the pod15:57
@gtema:matrix.orgaccording to monitoring it has enough memory15:57
@gtema:matrix.organd yes - this happens over and over again on merge15:57
@tobias.henkel:matrix.orgif the pod exceeds the memory limit the oom killer kills random processes in that cgroup15:57
@tobias.henkel:matrix.orgmaybe your merge operation needs more ram for a really short time triggering this behavior15:58
@jim:acmegating.commaybe a git merge is using a lot of memory quickly... yeah that15:58
@gtema:matrix.orghmm, maybe15:58
@jim:acmegating.comtoo quickly for the monitoring to notice15:58
@tobias.henkel:matrix.orggtema: which metrics are you looking at concerning memory?15:58
@jim:acmegating.com(and restarting the pod fixes it because it releases memory held elsewhere)15:59
@tobias.henkel:matrix.org(and what's the memory limit)15:59
@gtema:matrix.orgmy __not_so_well__ k8 as a service shows me memory consumtion and cpu usage graph15:59
@tobias.henkel:matrix.orgalso the ansible processes can take a lot of memory (actually unbounded depending on the job)15:59
@gtema:matrix.orgbut also I have defined limits for the pod16:00
@gtema:matrix.organd those apparently do not exceed16:00
@gtema:matrix.orgotherwise whole pod would be killed16:00
@tobias.henkel:matrix.orgno, if the container has more than one process, the oom killer mostly chooses not PID 1 which means the pod lives but a child process gets killed16:01
@tobias.henkel:matrix.organd that's exactly what you're seeing16:01
@gtema:matrix.orghmm16:01
@jim:acmegating.comand the oom killer could be acting on a cgroup limit set by k8s?  so possibly increasing the ram limit for the pod may help?16:02
@tobias.henkel:matrix.orgyepp, k8s sets a limit on the cgroup and the kernel kills processes in that cgroup if the processes in there take more memory16:04
@tobias.henkel:matrix.orgwhat's your current memory limit for the executors?16:04
@gtema:matrix.orgcpu:2 mem 2G16:04
@tobias.henkel:matrix.orgthat seems way too low since that covers executor, git calls and all ansible processes of the jobs16:05
@gtema:matrix.orgwhat do you set?16:05
@tobias.henkel:matrix.orgI think I'd recomment a minimum of 8GB16:06
@tobias.henkel:matrix.orgwe're running with 16gb16:06
@gtema:matrix.orgok, will try going with 8 first16:06
@tobias.henkel:matrix.organsible can take up a large amount of memory (e.g. a slurp of a file into a variable occupies that amount of memory in the ansible process)16:07
@jim:acmegating.comopendev's executors run on 8gb vms.  they sustain 6gb usage easily.16:07
@gtema:matrix.orgyah, but my load is far from opendev ;-)16:07
@tobias.henkel:matrix.orgI'm seeing short job-caused spikes of 4-6GB in our system sometimes16:07
@gtema:matrix.orgbut also huge projects (with huge git history eventially)16:07
@tobias.henkel:matrix.orgwith less load you could try 4GB as a first step16:08
@tobias.henkel:matrix.organd watch out for broken process pool, that's always a sign that the limit was too low16:08
@gtema:matrix.orgI do request for 1G and limit for 8 - this is not really a problem, since I run in dedicated cluster16:08
@tobias.henkel:matrix.orgyou do overcommitment with memory?16:08
@jim:acmegating.comyeah, i think the minimum would be hard to calculate, but there's a base line usage from the process itself, then increase that based on git repo size, then add ansible spikes.16:08
@gtema:matrix.org> <@tobias.henkel:matrix.org> you do overcommitment with memory?16:09
nope - 4 nodes x16G
@gtema:matrix.organd only 2 executors + scalable merger16:09
@tobias.henkel:matrix.orgI mean by using different requests and limits16:09
@jim:acmegating.comthen after that, more jobs can cause more memory usage, but the executor will try to keep that under control.  but that will only work above the minimum implied by all those other things.16:09
@tobias.henkel:matrix.org(which is overcommitment on k8s level)16:09
@gtema:matrix.org> <@tobias.henkel:matrix.org> (which is overcommitment on k8s level)16:10
yes - doing that: requests: cpu: 100m, mem: 1G, limits: cpu: 2, mem: 8G
@gtema:matrix.org> <@jim:acmegating.com> yeah, i think the minimum would be hard to calculate, but there's a base line usage from the process itself, then increase that based on git repo size, then add ansible spikes.16:11
well, looking to all spikes according to build-in metrics - I am always below
@tobias.henkel:matrix.orgnote that k9s only uses the requests for scheduling so with different requests and limits you might run into a risk of having too many executors on one node16:11
@tobias.henkel:matrix.org * note that k8s only uses the requests for scheduling so with different requests and limits you might run into a risk of having too many executors on one node16:11
@gtema:matrix.orgI try manage this with affinity16:12
@gtema:matrix.organyway - thanks a lot for the hints16:13
@clarkb:matrix.org> <@tobias.henkel:matrix.org> which indicates a too low memory limit of the pod16:16
As a comparison any openstack/nova git operations that opendev does appear to consume 1GB of memory
@gtema:matrix.org> <@clarkb:matrix.org> As a comparison any openstack/nova git operations that opendev does appear to consume 1GB of memory16:16
hmm - maybe really the case
@clarkb:matrix.orgtobiash: are you happy with corvus response in https://review.opendev.org/c/zuul/zuul/+/806062/ ?16:50
@tobias.henkel:matrix.orgyes, thanks16:52
@jim:acmegating.comswest: hrm, the updated patch in 806653 that clears the event is still triggering "too many open files" errors.  i've also instrumented the tests and run locally, and i can find no instances of the exception handler being triggered, or even any cases where the election loop ran more than once.  so while i agree that was a bug, i don't think it was causing the open files problem.17:27
@westphahl:matrix.orgcorvus: strange, it solved the issue for me at least. I added the clear in a different place but that shouldn't really matter17:42
@erbarr:matrix.orgwhat ports need to be open to for zuul-web? I plan to run that on another VM to display publicly only the web service and corporate is asking me. Just 9000?18:05
@clarkb:matrix.orgerbarr: The service itself defaults to 9000, but you may want to put a proxy in front of it to do ssl and caching and host that on port 44318:07
@clarkb:matrix.orgthere is also the fingergw which will listen on port 79 to support finger.18:08
@erbarr:matrix.orgso zuul-fingerwg also needs to run on the same VM as zuul-web?18:09
@clarkb:matrix.orgIt don't think it does, but calling it out as another service with ports that need to be exposed so that things like console streaming work18:13
@erbarr:matrix.orgahh, okay thanks!18:14
@jim:acmegating.comswest: yeah, i'm stumped.  where did you put the clear?18:15
@jim:acmegating.com"ulimit -n" locally for me says 102418:16
@westphahl:matrix.orgcorvus: right before the election.run(...)18:18
@jim:acmegating.com1024 is the same on the test nodes18:18
@westphahl:matrix.orgfile limit was the same for me18:18
@jim:acmegating.comi've logged into the host running https://zuul.opendev.org/t/zuul/stream/d9dc0cc0deeb4158865c1646277c1c18?logfile=console.log now18:19
@jim:acmegating.comthere really are a lot of open files according to lsof18:22
@jim:acmegating.comthe process is up to fd #73218:22
@jim:acmegating.comi'm taking snapshots so we can see if we have old threads not ageing out18:23
@jim:acmegating.comthe threads do seem to be deleted.  we're just getting lots of fds18:26
@clarkb:matrix.orgcorvus: if you end up fixing the fds thing you might want to add a fix for my -1 on https://review.opendev.org/c/zuul/zuul/+/806063 when you rebase and push18:28
@clarkb:matrix.orgits a minor thing but worth fixing now18:28
@jim:acmegating.com825 fds, 255 are pipes, 105 are eventpoll, 141 are listening tcp sockets, 235 udtp sockets18:29
@jim:acmegating.com1010 fds, 332 pipe, 125 eventpoll, 162 listening tcp sockets, 283 udp sockets18:31
@jim:acmegating.comeverything is increasing18:31
@jim:acmegating.comi'm going to hold a node with the change before this one too18:32
@clarkb:matrix.orgYa would be good to know if it happens prior to adding the election code18:33
@clarkb:matrix.orgits possible the election code adds a few extra fds that tips it over, but we're already leaking18:33
@jim:acmegating.comyeah that sounds plausible18:33
@jim:acmegating.comand it's started emitting too many open files errors18:34
@jim:acmegating.comi count 1045 entries in lsof, with a max fd number of 97618:35
@clarkb:matrix.orgcorvus: I left a note on 806653 though I don't think it is related to the fd leaks18:42
@jim:acmegating.comwow, the pre-election change is hitting a max fd # of 8 so far.19:02
@jim:acmegating.comso we're not looking at at a last-straw situation, it's something bonkers in that change19:03
@jim:acmegating.comoh, strike that, i think i just checked too early19:06
@jim:acmegating.comokay, this does seem to be leaking as well, but perhaps more slowly19:12
@jim:acmegating.comso this may be a polynomial issue... we may have leaked one more file descriptor times X tests?  or something like that19:13
@jim:acmegating.com(where X is around 1100/4 or maybe /2 depending on our cpu count / paralellization)19:13
@jim:acmegating.comi'm starting to lean toward there being a structural problem with the test harness and this just pushes it over the edge19:14
@jim:acmegating.comup to max fd 638 on the pre-election change19:36
@erbarr:matrix.orghi clark, I'm looking at the diagram at https://zuul-ci.org/docs/zuul/discussion/components.html#overview that shows web connecting to zookeeper, executor, gearman, database, github. Do each one of those represent a port that needs to be open or just what you pointed out earlier?20:46
@clarkb:matrix.org> <@erbarr:matrix.org> hi clark, I'm looking at the diagram at https://zuul-ci.org/docs/zuul/discussion/components.html#overview that shows web connecting to zookeeper, executor, gearman, database, github. Do each one of those represent a port that needs to be open or just what you pointed out earlier?20:47
each of those represents a network connection. Whether or not you'll need to open those ports depends on your network topology
@erbarr:matrix.orgso if web is on its own VM, and every other service is on a different VM, then web needs those ports open?20:50
@clarkb:matrix.orgif there is a firewall between the VMs then yes20:50
@erbarr:matrix.orgokay then yea, thanks!20:51
@jim:acmegating.comokay the beginning of a clue -- if i run 2 unit tests then see what files are open, without the election stuff it looks like we leak 17 fds, but with it we leak 21.  i have to run 2 tests to see that though; just running one test with and without the election stuff there's no difference.21:15
@jim:acmegating.com(an apparent leak of 14 fds either way in that case)21:16
@jim:acmegating.coman extra udp socket, an extra tcp listener; none of those make sense.  an extra eventpoll -- that does make sense.21:23
@iwienand:matrix.orgcorvus: thanks for fixing the registry size issue!21:40
@jim:acmegating.comianw: yw :)21:41
@iwienand:matrix.orgcorvus: hrm 22:05
@iwienand:matrix.orgfailed commit on ref "layer-sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1": "layer-sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1" failed size validation: 7670 != 7671: failed precondition22:05
@iwienand:matrix.orgthat's a suspiciously off-by-one value ...22:05
@iwienand:matrix.orghttps://zuul.opendev.org/t/openstack/build/0cd4979893714910ae45b3191b6b3175 was the job22:05
@jim:acmegating.comianw: unfortunately i'm deep into trying to figure out why the zuul unit tests are leaking file descriptors, i don't think i'll be able to dig into that right now22:06
@iwienand:matrix.orgcorvus: that's ok.  do you think https://review.opendev.org/c/opendev/base-jobs/+/806818 is ok to make buildset-registry jobs fail so i can put one on hold?22:07
@iwienand:matrix.orgi found it difficult to get started debugging22:07
@jim:acmegating.comianw: seems reasonable, but if there's a registry protocol error it may be easier to just run the registry and builds locally.  also, ansible has a 'fail' module, btw.22:09
@iwienand:matrix.orgahh, yes, fail is probably better.  i was having trouble getting my environment talking to the intermediate mirror; it was nice the jobs set all that up22:10
@iwienand:matrix.orgi mean zuul-registry instance i was trying to run as an intermediate mirror, not our intermediate mirror, blerg a lot of parts ... :)22:11
@jim:acmegating.comyeah, if you need the whole system, that's the best way.  if you can narrow it down to a build and upload command, then local is a lot easier22:14
@jim:acmegating.comi found 2 things we're leaking in tests: the statsd udp listener, and the fake prometheus server22:14
@jim:acmegating.comnone of those have anything to do with the election change, but maybe fixing those will give us headroom back22:15
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:22:32
- [zuul/zuul] WIP: only create stop watcher event if necessary https://review.opendev.org/c/zuul/zuul/+/806994
- [zuul/zuul] Reduce leaked file descriptors in tests https://review.opendev.org/c/zuul/zuul/+/806995
@jim:acmegating.comokay, i like the first change anyway, but don't know if it'll do anything, so i pushed it up separately as an experiment.  the second change should definitely improve things.  after we get results, i'll move the second change to early in the stack and then squash the first change into the election change.  (Clark, swest, tobiash FYI)22:34
@jim:acmegating.comeven after that, we're still leaking some udp socket listeners.  i have no idea what they could be.22:35
@iwienand:matrix.orghttps://zuul.opendev.org/t/openstack/build/203c52dfa57d428e993b79b926643dcf/log/docker/buildset_registry.txt#70 is the registry entry of the assets upload22:38
@iwienand:matrix.orgUpdate upload metadata chunks: [{'size': 7670, 'md5': '54a131e3f12627a57edf174190aabebf'}, {'size': 0, 'md5': 'd41d8cd98f00b204e9800998ecf8427e'}]22:38
@jim:acmegating.comso on the download the client received one byte more than we received on the upload?22:39
@iwienand:matrix.orgdocker is saying "7670 != 7671: failed precondition"; not 100% sure which side is the expected and which the received22:39
@jim:acmegating.comianw: note this line: https://zuul.opendev.org/t/openstack/build/203c52dfa57d428e993b79b926643dcf/log/docker/buildset_registry.txt#24722:40
@jim:acmegating.comcherrypy said it sent 7670 bytes in that get request (but i don't know if that was the docker request)22:40
@iwienand:matrix.orghttps://github.com/containerd/containerd/blob/fb589a71331869b141c83afb098152359f44b8c4/metadata/content.go#L626 i think is the code putting out the error22:42
@iwienand:matrix.orgi really hate it when you use the same error message in different paths22:45
@jim:acmegating.comthat's why i try not to do that :)22:46
@jim:acmegating.comianw: if you have a held registry node, you could run a GET on that url, or check the filesystem and make sure that it's reporting 7670 and not 7122:46
@iwienand:matrix.orgyeah i can put back in the pause and at least fiddle until timeout kicks in22:49
@iwienand:matrix.orgok, it failed again and is paused for a while.  198.72.124.20 is the registry and 198.72.124.29 the builder23:05
@iwienand:matrix.orgjust "FROM opendevorg/assets" replicates it, that's good.  it must try and download it 3 times23:07
@clarkb:matrix.orgianw:  let me know if I can help further with that off by one thing. Just got back from a bike ride23:10
@iwienand:matrix.orgcorvus: hrm, i guess you need to setup authentication to run a curl request?23:12
@clarkb:matrix.orgianw: ya the docker protocol requires auth tokens for even anonymous requests23:14
@clarkb:matrix.orgyou basically query for a token without giving it creds and it sends a token back that you use and the zuul-registry mimics that iirc23:14
@iwienand:matrix.orgroot@9d5bac229fa4:/storage/_local/blobs/sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1# ls -l23:14
total 12
-rw-r--r-- 1 root root 7670 Sep 1 23:02 data
@iwienand:matrix.orgso the registry is seeying 0fb1 as 7670 on disk23:14
@iwienand:matrix.orghttps://hub.docker.com/support/doc/how-do-i-authenticate-with-the-v2-api i guess23:15
@clarkb:matrix.orgya that looks right23:19
@iwienand:matrix.orghrm, i'm not having much luck23:21
@iwienand:matrix.orgusing the password from /home/zuul/buildset_registry23:21
@clarkb:matrix.organd the url was updated to use the buildset registry?23:22
@clarkb:matrix.orgI wonder if this is a case where the registry test suite might be easier to debug with than the deployed service?23:23
@iwienand:matrix.orgcurl -s -H "Content-Type: application/json" -X POST -d '{"username": "'zuul'", "password": "'...blah...'"}' https://zuul-jobs.buildset-registry:5000/v2/users/login23:23
@clarkb:matrix.orgAre those extra quotes in the json being sent or is that an artifact of element quoting?23:24
@clarkb:matrix.orgbut that could be it if so?23:24
@iwienand:matrix.orgcurl -s 'https://zuul-jobs.buildset-registry:5000/auth/token?scope=read%3A%3A&scope=repository%3Aopendevorg%2Fassets%3Apull'23:29
@iwienand:matrix.orgcurl -s -H "Authorization: JWT ${TOKEN}" https://zuul-jobs.buildset-registry:5000/v2/opendevorg/assets/blobs/sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1?ns=docker.io23:34
@iwienand:matrix.orgisn't working23:34
@clarkb:matrix.orgis the namespace correct? I assume so since we don't upload elsewhere23:36
@iwienand:matrix.orgok "Authorization: Bearer ${TOKEN}" seems to work...?23:37
@iwienand:matrix.org-rw-r--r-- 1 root root 7670 Sep  1 23:37 test23:38
@iwienand:matrix.orgok, it is returning me a file of size 767023:38
@iwienand:matrix.orgthis suggests the manifest size is wrong?23:38
@clarkb:matrix.orgianw: yes, I think you can request the manifest json data too and confirm23:39
@clarkb:matrix.orgbut I'm not sure what that path is23:39
@iwienand:matrix.org[01/Sep/2021:23:41:13] "GET /v2/opendevorg/assets/manifests/latest?ns=docker.io HTTP/1.1" 200 776 "" "containerd/1.4.0+unknown"23:43
@iwienand:matrix.org[01/Sep/2021:23:42:56] "GET /v2/opendevorg/assets/manifests/latest?ns=docker.io HTTP/1.1" 404 735 "" "curl/7.68.023:44
@iwienand:matrix.orgcontainerd gets it ok, i get a 40423:44
@clarkb:matrix.orglooking at the code it appears you have to set a content type header23:46
@clarkb:matrix.orghttps://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/main.py#L388-L38923:47
@clarkb:matrix.orgDo you need to set Accept: for each of those types?23:48
@iwienand:matrix.orgyep, just doing that :)23:48
@clarkb:matrix.orgWithout the Accept: header it falls through to the 404 path23:48
@iwienand:matrix.orgcurl -H "Accept: application/vnd.docker.distribution.manifest.v2+json" -X GET  -H "Authorization: Bearer ${TOKEN}" 'https://zuul-jobs.buildset-registry:5000/v2/opendevorg/assets/manifests/latest?ns=docker.io'23:49
@iwienand:matrix.org {"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", "size": 7671, "digest": "sha256:a65b430471b85457a204da7fcbaf12a5f18ea5c33cc9a31a2749aa43f6290fb1"}23:49
@iwienand:matrix.orgok, so the manifest is saying 767123:49
@iwienand:matrix.orgwe know the file on disk is 767023:50
@iwienand:matrix.orgit seems like it should ultimately be passing back os.stat(path).st_size23:52
@jim:acmegating.comit returns what it's given23:54
@clarkb:matrix.orghttps://review.opendev.org/c/zuul/zuul-registry/+/799726/2/zuul_registry/main.py was the change23:55
@clarkb:matrix.orgsize = self.storage.blob_size(namespace, digest)23:55
@clarkb:matrix.orgianw:  I think it does do what you suggest: https://opendev.org/zuul/zuul-registry/src/branch/master/zuul_registry/filesystem.py#L40-L4423:56
@jim:acmegating.comthat's specifically for an image config...23:56
@jim:acmegating.comianw: can you paste the complete output of that curl?23:56
@jim:acmegating.comalso, what's the ip of the buildset registry?23:57
@clarkb:matrix.orghttps://review.opendev.org/c/zuul/zuul-registry/+/799729/1/zuul_registry/main.py would be for the layers23:57
@iwienand:matrix.orghttps://paste.opendev.org/show/808529/23:57
@iwienand:matrix.org 198.72.124.20 is the registry and 198.72.124.29 the builder23:58
@jim:acmegating.comokay, so iff the builder didn't include a size for the layer in the manifest, then we would have figured the size ourselves and added it.23:59
@jim:acmegating.combut we don't know whether that happened or not -- you might be able to check the builder and see what it looks like there23:59
@jim:acmegating.comif it came with a size, then we didn't touch it.23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!