Friday, 2023-01-13

-@gerrit:opendev.org- Ian Wienand proposed: [zuul/nodepool] 870036: Dockerfile: remove unstable usage https://review.opendev.org/c/zuul/nodepool/+/87003601:37
@iwienand:matrix.org^ for posterity, https://github.com/docker-library/python/issues/784 is related to this above change; and describes how we ended up mixing system-packaged python with the docker built version03:50
@mbecker12:matrix.orgHi! We currently see that the zuul-scheduler is trying to load some cached entries in the tenant file where it appears to have conflicting states with the actual deployment. Is there a way to run reconfigure on the scheduler without using the cache?12:35
@fungicide:matrix.org> <@mbecker12:matrix.org> Hi! We currently see that the zuul-scheduler is trying to load some cached entries in the tenant file where it appears to have conflicting states with the actual deployment. Is there a way to run reconfigure on the scheduler without using the cache?13:06
are you seeing that happen even with a full-reconfigure? https://zuul-ci.org/docs/zuul/latest/operation.html#reconfiguration
@mbecker12:matrix.orgyes13:07
@fungicide:matrix.org> <@gerrit:opendev.org> Zuul merged on behalf of James E. Blair  https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 862522: Convert openstack driver to statemachine  https://review.opendev.org/c/zuul/nodepool/+/86252213:08
corvus: a user has observed a behavior change from this... apparently the old driver set private_ipv4 to the value of public_ipv4 if it was empty, while the new driver does not. since the documentation says private_ipv4 is optional, this probably doesn't constitute a regression but i wanted to get your input
@fungicide:matrix.orgsince that landed after the 8.0.1 tag, i suppose a release note about it for the next version is probably in order at least13:14
@jim:acmegating.commbecker12: that's likely a bug.  at this point the best way to fix is probably to shut everything down and run `zuul delete-state` in order to delete the cache14:40
@mbecker12:matrix.orgYeah that's what we ended up doing. I was just wondering if there may have been a less disruptive way 😅14:41
@jim:acmegating.comfungi: i agree that's a change, and it wasn't intentional.  openstack now matches the behavior of all the other drivers.... i think the best courses of action are to either (A) make a release note about the change, or (B), change the behavior of all the drivers to the old openstack behavior (privateip = privateip or publicip) and make a release note about that.14:44
@fungicide:matrix.orgi'm happy just calling it out as a behavior change. the fix on the job side was fairly simple to just fall back on public_ipv4 when private_ipv4 is undefined, mostly it was the surprise factor of it so release note makes sense14:45
@jim:acmegating.comsounds good.  if anyone else has opinions, let me know; either way i'll upload a change by the end of the day14:46
@fungicide:matrix.orgthanks!14:46
@jim:acmegating.comthank you! and sorry :)14:46
@fungicide:matrix.orgno need to be sorry. plenty of us reviewed that change!14:47
@jim:acmegating.comi think i get why we did it originally -- in the same way that "interface_ip" is "whatever the best public-ish ip is" we made "private_ipv4" be "the best private-ish ip".  maybe we should add an explicit complement to interface_ip... like "internal_ip" or something.14:48
@fungicide:matrix.orgsensible_address14:50
@fungicide:matrix.org(like all the sensible-* commands in freedesktop-land over recent years)14:51
@jim:acmegating.comi guess option (C) could be to match the behavior of the old openstack driver for now, add internal_ip, then change to the new behavior when we move it into zuul14:51
@jim:acmegating.comthat would give people time to move jobs over, and we can still get to a consistent state, and nobody has to panic :)14:52
@fungicide:matrix.orgas more of the world is waking up and noticing job failures, it looks like we're finding more places we previously assumed private_ipv4 would always be usable with openstack nodes14:53
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 870100: Make openstack private_ipv4 backwards compatible https://review.opendev.org/c/zuul/nodepool/+/87010014:57
@jim:acmegating.comzuul-maint: ^ there's the quick fix14:57
@mbecker12:matrix.org> <@mbecker12:matrix.org> Hi! I've been looking into the openshiftpods driver recently and stumbled across the max-pods quota and how it's evaluated.15:41
> I can see that it boils down to the listNodes() function in provider.py. What I don't get is the [FakeServer class](https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openshiftpods/provider.py#L50). What's the purpose of that? To me it looks like it eventually just resolves to a counter of instances which should [coincide with the number of items in self.k8s_client.list_namespaced_pod()](https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openshiftpods/provider.py#L73-L75).
Could I get some clarification on this? :) And along with my message right below that one I was wondering if the implementation for the pod_names should really be set to the labels.keys, that doesn't really seem correct to me.
@clarkb:matrix.orgcorvus: fungi I think ianw expects https://review.opendev.org/c/zuul/nodepool/+/870036 to address the arm64 python pathing issue too15:43
@mbecker12:matrix.orgAnd then, I was looking into the openshiftpods driver but there were some features that me and my team felt were missing, although those might be very specific to our use case. So we were wondering if it makes sense to extend this one (like here: https://review.opendev.org/c/zuul/nodepool/+/867971) or if it could make sense to support the ability to pass some more generic pod resource to the provider15:44
@clarkb:matrix.orgmbecker12: I think the fake server class is there to satisfy quota calculation interface requirements.15:44
@clarkb:matrix.orgits just a way to match the interface required for the built in accounting15:44
@clarkb:matrix.org> <@mbecker12:matrix.org> And then, I was looking into the openshiftpods driver but there were some features that me and my team felt were missing, although those might be very specific to our use case. So we were wondering if it makes sense to extend this one (like here: https://review.opendev.org/c/zuul/nodepool/+/867971) or if it could make sense to support the ability to pass some more generic pod resource to the provider15:48
Is gpu a standard thing like cpu and memory? If so then I suspect they way you've got it is fine. If it varies by deployment then we should figure out something more generic.
@mbecker12:matrix.orgI guess it depends on the gpu vendor and installation in the cluster when I now read up on it :/ https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins15:50
@mbecker12:matrix.org> <@clarkb:matrix.org> its just a way to match the interface required for the built in accounting15:57
to me it looks like it's only populating a list whose length is the only thing we're interested in and it will coincide with the number of pods in the namespace. it doesn't even seem to be limited to the pods with valid_names. wouldn't it make more sense to base the quota calc on this line directly and have it in a couple of lines instead? https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openshiftpods/provider.py#L56
@clarkb:matrix.org> <@mbecker12:matrix.org> to me it looks like it's only populating a list whose length is the only thing we're interested in and it will coincide with the number of pods in the namespace. it doesn't even seem to be limited to the pods with valid_names. wouldn't it make more sense to base the quota calc on this line directly and have it in a couple of lines instead? https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openshiftpods/provider.py#L56 16:00
Nodepool's quota system is fairly rich and relies on some standard tooling. I don't think we should replace that tooling to make it simpler just because the openshift driver doesn't make much use of it. It creates consistency across drivers and openshift could be expanded to use some of the richer features.
@mbecker12:matrix.orgyeah okay, I was mostly looking into the openshift related code so I'm not that familiar with the underlying quota system16:06
@mbecker12:matrix.orgbut in its current implementation it counts all pods in the same namespace as the current pool. is that intentional? shouldn't it only count the number of pods in that namespace that nodepool has spawned?16:12
@fungicide:matrix.orgi'm not all that familiar with the kubernetes-like drivers, but the cloud drivers try to consider the cloud account's actual quotas in real time in order to not exceed them, even if there are resources created in them by something other than nodepool itself16:15
@fungicide:matrix.orgif my account is allowed to create 10 servers and someone has manually created 2 in it, nodepool detects that it can only actually boot 8 there and does its best not to boot more until that changes16:16
@fungicide:matrix.orgit does that by querying the cloud's api to find out what the quota is, and the max-servers limit in configuration is simply an additional limit you can apply which is taken into consideration separately from the quota polling16:19
@fungicide:matrix.orgso in the earlier example, if i set max-servers:5 in that provider, then nodepool should try to boot no more than 5 servers there even though the quota allows up to 10, and if someone manually boots 6 servers in that account then nodepool should detect that it can then only boot up to 416:21
@mbecker12:matrix.orgalright, thanks for clarifying :)16:27
@mbecker12:matrix.orgI see that in the kubernetes driver, there's support for max-cores and max-ram. that sounds like something we would want in the openshift(pods) driver as well. do you think there's any restriction there, or is it more difficult to obtain the current quota in openshift or was there any other reason why this is not present in those drivers? otherwise that's something we could look into as that could be beneficial for our deployment16:29
@fungicide:matrix.orgwe're generally driving toward increased consistency between drivers, so improvements in that vein sound good to me. as for whether it was left out due to a lack of some support in the openshift pods driver, i don't really know16:47
@clarkb:matrix.orgI think when those drivers were written the simplest possibly implementation for quotas was used.16:47
@clarkb:matrix.org * I think when those drivers were written the simplest possible implementation for quotas was used.16:47
@fungicide:matrix.org * we're generally driving toward increased consistency between drivers, so improvements in that vein sound good to me. as for whether it was left out due to a lack of some support in the openshift api, i don't really know16:48
@fungicide:matrix.org * we're generally working toward increased consistency between drivers, so improvements in that vein sound good to me. as for whether it was left out due to a lack of some support in the openshift api, i don't really know16:52
@jim:acmegating.com870100 failed a couple of jobs in earlier runs potentially due to the bug it's fixing, but it's still in gate in a currently passing state, so we should know if it merges in a few minutes.17:27
@jim:acmegating.comit just failed the quick-start test.  that also had an empty private ipv4.  so far all the failed tests have that in common.17:36
@jim:acmegating.comamong the several test runs, we have a passing result from each job, so i think we should force-merge it17:37
@clarkb:matrix.orgI'm fine with that given the circumstances17:37
@clarkb:matrix.orgthe alternative of reverting opendev's install to 8.0.1 and then rechecking seems like a fair bit of work for a problem that is well understood17:37
@clarkb:matrix.orgwill force merging promte images though?17:38
@jim:acmegating.comthat's a great question i don't think we've ever dealt with.  it should promote images, as long as nodepool-upload-image succeeded in the last gate run, and it did17:39
@clarkb:matrix.orgI guess if it doesn't then we'll have learned something17:40
@jim:acmegating.com(i'm 98% sure the promote job grabs the most recent gate build)17:40
-@gerrit:opendev.org- corvus.admin merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/nodepool] 870100: Make openstack private_ipv4 backwards compatible https://review.opendev.org/c/zuul/nodepool/+/87010017:43
@jim:acmegating.comthe promote job claims to have succeeded17:45
@jim:acmegating.comLast pushed a minute ago by zuulzuul17:46
@jim:acmegating.comthe docs did not promote; they may only look for successful builds to download17:47
@jim:acmegating.comi think that's fine for today, and we'll just file that info away17:47
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/nodepool] 870036: Dockerfile: remove unstable usage https://review.opendev.org/c/zuul/nodepool/+/87003623:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!