Friday, 2023-11-17

opendevreviewRajesh Tailor proposed openstack/nova master: Fix KeyError on assisted snapshot call  https://review.opendev.org/c/openstack/nova/+/90078307:01
tkajinamhttps://zuul.opendev.org/t/openstack/builds?job_name=grenade-skip-level-always&project=openstack/nova07:38
tkajinamI'm afraid grenade-skip-level-always might be broken. I've not yet looked into it though.07:39
fricklerERROR cinder etcd3gw.exceptions.Etcd3Exception: Not Found07:46
tkajinamfrickler, ah, thanks for that info. I saw the same error in a few run so there may be something wrong with cinder07:59
tkajinamit seems cinder is deployed with etcd3gw tooz driver.07:59
tkajinamI vaguely guess the issue was caused by the new tooz release and https://review.opendev.org/c/openstack/tooz/+/891355 might be the cause08:03
tkajinamhmm. do we really need that job in master ? I see the job test upgrade from zed to 2023.208:06
tkajinamand does not look appropriate for master08:06
tkajinamignore above. I noticed I was looking at a wrong branch of grenade...08:08
bauzastkajinam: looking09:30
bauzasyeah this is failing when upgrading cinder09:31
bauzaswe're on Caracal, so it's upgrading from Antelope09:32
bauzashttps://github.com/openstack/grenade/commit/159054a9274f984d8e53e5e5bd9e7ab0fb29708b09:33
bauzas(C is a SLURP release)09:33
bauzasheh, fun, we got a GMR report in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_73a/900783/2/check/grenade-skip-level-always/73acbef/controller/logs/screen-c-vol.txt09:36
bauzas2023-11-17 06:30:28.118 | ++ /opt/stack/new/grenade/projects/70_cinder/resources.sh:verify:220 :   openstack volume show cinder_grenade_vol2 -f shell -c status 2023-11-17 06:30:29.692 | Internal Server Error (HTTP 500)09:39
bauzasyeah, something got wrong on the cinder side09:39
bauzasand yup yup, good catch tkajinam09:40
bauzashttps://paste.opendev.org/show/br9ep5e0MjNReG9A9ta7/09:40
fricklerbauzas: cf. https://review.opendev.org/c/openstack/devstack/+/90122110:00
bauzasack thanks10:00
bauzasI don't exactly get the patch that broke tooz yet10:01
bauzasbut it was merged on Oct 1610:02
bauzasIMHO we should rather revert it 10:02
bauzashttps://review.opendev.org/c/openstack/tooz/+/891355 was merged on Sep 28 so it's not the culprit10:03
fricklerbauzas: tooz was released yesterday with that patch afaict, added into u-c with https://review.opendev.org/c/openstack/requirements/+/90113110:09
fricklermaybe that default setting change should have triggered a major version bump. also maybe grenade should be patched to handle the etcd upgrade10:11
bauzasfrickler: oh right, tooz is a library10:12
bauzasdon't judge me, it's Friday morning and I'm already tired :)10:13
bauzasand yeah, looks to me it's a behavioural change10:13
fricklerI don't judge anyone on Friday's ;)10:13
bauzas"* Note that this probably means a new minor version of tooz, while   the behaviour can be worked around in config, this is enough to   break existing systems on upgrade."10:15
bauzasindeed it breaks10:15
bauzasfrickler: I'm tempted to require a u-c reverty10:17
bauzaselodilles: ^10:17
bauzaselodilles: fwiw, our grenade skip level job is broken due to the new tooz releasee https://review.opendev.org/c/openstack/releases/+/89996610:18
elodillesreading10:20
bauzasI just sent the usual 'hold your rechecks' email to the wolrd10:24
bauzasfrickler: I feel something is wrong with the tooz patch10:26
bauzasif we want to have SLURP upgrades, we somehow need to keep a backwards compatible behaviour for a second10:26
bauzasor we need to say 'this is a major breaking change' and only once we no longer support older releases that we want to pull it10:27
elodillesbauzas: thanks for the mail o/ (maybe [oslo] team should have been in the subject, too) 10:31
bauzasyeah10:31
bauzaswell, oslo is free to deliver major versions of their libraries that break compatibility10:32
bauzasbut then they need to signal it correctly for the release team appropriately10:32
bauzaselodilles: I tagged with [tooz] I get that filter is maybe too much restricted ?10:33
bauzas(fwiw, I don't do tag filtering on my mailbox)10:33
bauzas(and I hope oslo contributors to not do it as well)10:33
elodillesyes, they probably see the mail with [tooz] tag as well10:36
bauzaselodilles: https://review.opendev.org/c/openstack/requirements/+/90107910:40
elodilles+110:42
elodilles(from release team perspective i'm still happy we are not finding this one week before the final release o:))10:43
bauzasall hail to the grenade-skip-level lord10:44
fricklerand to the "release at rc-1" policy10:44
fricklereh milestone-110:45
elodillesyepp \o/10:45
* bauzas goes outside for sweating needs10:49
opendevreviewRajesh Tailor proposed openstack/nova master: trivial doc fix  https://review.opendev.org/c/openstack/nova/+/90123711:33
pas-ha[m]re tooz/etcd  - may I bring your attention to https://review.opendev.org/c/openstack/etcd3gw/+/887662 - this adds version autodiscovery to etcd3gw, once there, we could switch default version in tooz to 'auto', and everyone is happy, new tooz will be able to work with the same 'backend_url' with old and new etcd311:34
pas-ha[m]this allows to upgrade the etcd and tooz-using components in whatever order w/o changing the configuration11:39
opendevreviewMerged openstack/nova master: Bump jsonschema minimum to 4.0.0  https://review.opendev.org/c/openstack/nova/+/85002112:57
jangutterpas-ha[m]: regarding version autodetect: https://lists.openstack.org/pipermail/openstack-discuss/2023-August/034704.html13:33
jangutterpas-ha[m]: probing for a version also means you don't error on the case where you forgot to upgrade your etcd.13:35
-opendevstatus- NOTICE: Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/14:07
bauzaslemme get it straight14:13
bauzasetcd changed its endpoint by a release14:14
bauzastooz accordingly changed to use the newer endpoint by a specific tooz release14:14
bauzasgiven now, we upgrade tooz but not etcd, tooz talks the new way while it should talk to the other endpoint14:15
bauzasamirite, jangutter pas-ha[m], frickler ?14:15
jangutterbauzas: correct, it's also possible to override the endpoint away from the default, in cinder config. Both old and new tooz can work with old and new etcd if the endpoint is given explicitly.14:17
tkajinamthe problem is that tooz used to use v3beta which is hardcoded and users had to override it in case needed14:18
bauzasjangutter: was cinder using the default, or was it specifically asking an endpoint?14:18
tkajinamthe problem here is that in stable/2023.1 we had old tooz using v1beta by default along with old etcd3 with v1beta. now we upgrade only tooz and the new version uses v1 by default.14:19
tkajinamand because etcd3 is still old after upgrade v3 api is not available14:19
jangutterhttps://opendev.org/openstack/devstack/commit/3a7a3cd8c5a5ac3f1655d6ff17974f8623fb3330 <--- before this, cinder was using default, after this cinder was asking explicitly for v314:20
bauzaswe shouldn't expect operators to upgrade their dependencies when upgrading14:20
tkajinamwe added api_version=/v3 in coordination url in devstack stable/2023.2, so if you first deploy cinder with stable/2023.1 coordination url does not contain api version14:20
opendevreviewJohn Garbutt proposed openstack/nova-specs master: Expose PCI device NUMA using PXB  https://review.opendev.org/c/openstack/nova-specs/+/86941614:20
tkajinamand the same config value is carried over so tooz continue using the default which has been changed from stable/2023.1 to master14:21
janguttertkajinam: and it's worse, tooz was using v3alpha, not v3beta https://review.opendev.org/c/openstack/tooz/+/891355/6/tooz/drivers/etcd3gw.py14:21
bauzasokay, but that's then a devstack config problem14:21
tkajinamah, yeah. I was wrong/ v1alpha14:21
bauzasjangutter: why shouldn't we changing devstack now to use the default?14:22
tkajinambauzas, no we can't use default14:22
bauzasthat's still a breaking change and people would need to change their config when upgrading to Caracal14:22
jangutterthe only way config stays the same is if etcd and tooz is upgraded in lockstep.14:23
tkajinambauzas, the default in tooz stable/2023.1 is v3alpha, the default in tooz master is v3. Because we keep using old etcd without v3 during upgrade process we have to override api version to v3alpha to keep using v3alpha after upgrade14:23
jangutterif etcd and tooz mismatches, the endpoint has to be specified.14:23
bauzasshouldn't tooz be somehow a bit smarter and do a failback call if etcd isn't new enough ?14:24
tkajinamthat's what was proposed above by pas-ha[m]14:24
opendevreviewJohn Garbutt proposed openstack/nova-specs master: Add spec for PCI Groups  https://review.opendev.org/c/openstack/nova-specs/+/89971914:25
bauzaswhat I'm afraid is that operators doing upgrades would face that problem too, right?14:25
bauzasif we start tweaking devstack, there is surely an impact for regular ops that don't deploy with it :)14:26
jangutterthat's technically down to etcd3gw backend... and etcd doesn't have a nice version discovery, so, it's the try-different-versions method.14:26
bauzasthis means they have to change some config on the fly during upgrades 14:26
tkajinamwell, we did document the upgrade impact to suggest setting api version in case old api version should be used.14:26
bauzasjangutter: I'm personnally fine with a try/catch clause14:27
tkajinambut what we are learning here is that just documentation may kill some people.14:27
tkajinamincluding us14:27
bauzasoh yeah14:27
tkajinamhttps://review.opendev.org/c/openstack/tooz/+/891355/6/releasenotes/notes/etcd-3.4-eee8300c942a1263.yaml14:27
bauzasif you really do breaking upgrades, my personal recommandation is to provide tools that people can consume for checking their states and whether they'll break14:28
jangutteryep, the release notes are the last resort :-/14:28
bauzastkajinam: we, in nova, create the nova-status upgrade command for that exact reason14:28
bauzascreated*14:28
bauzasbecause even if we spoil the breaking changes in our relnotes, people may be impacted and miss the point14:29
bauzasanyway, I'm not saying tooz shall do a tooz-status upgrade commande14:29
bauzascommand*14:29
tkajinamthe problem may be that tooz is not the right place to provide that config validation, because tooz itself does not register config options but components using it does and pass the url value14:29
jangutterThe problem is that etcd is sometimes regarded as an _externally_ managed service, and sometimes an _internally_ managed service.14:29
tkajinamwe can provide the base implementation but every project may need to implement the command using that14:30
bauzasI'm just saying that we need some kind of lockstep mechanism that can handle both etcd/tooz upgrades at once14:30
jangutteryeah: for kolla-ansible that happens because in one release all the elements (config, tooz, etcd) is in lockstep. But that's not necessarily the same case for source-level projects.14:31
bauzasproblem is, again, you can't really ask operators to do some OS upgrade while you're upgrading your services14:32
bauzasthat's why grenade is not bumping the reqs14:32
jangutteryeah, _however_ etcd in devstack isn't the os-bundled one.14:33
bauzasjust checked, we pull tooz-3.2.014:35
bauzashttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_73a/900783/2/check/grenade-skip-level-always/73acbef/controller/logs/old/devstacklog.txt14:35
bauzasand we have etcd3gw==2.1.014:36
bauzasoh tonyb +Wd my revert patch https://review.opendev.org/c/openstack/requirements/+/90107914:36
tonybit's clearly broken14:38
*** blarnath is now known as d34dh0r5314:38
bauzasoh heh, https://github.com/openstack/grenade#theory-of-upgrade14:38
bauzastkajinam: jangutter: ^14:38
jangutteryeah I saw that.14:38
tonybwe should block it in g-r and figure out how to handle things for the longer term14:38
bauzas"Any other required changes on upgrade are an exception and must be called out in the release notes." and "Grenade supports per release specific upgrade scripts (from-juno, from-kilo)."14:39
bauzastonyb: I don't disagree14:39
bauzasI still think I'd appreciate if tooz may do some fallback mechanism to an older endpoint if the newer endpoint sounds invalid14:40
tkajinamhttps://review.opendev.org/q/topic:etcd-api-discovery14:40
bauzasjust for the sake of easing people's lifes14:40
tkajinamWe probably need a bug and link these to the bug, but we can make tooz smart and discover the available api version unless user explicitly requires a specific version14:41
bauzastkajinam: wow, kudos14:41
tkajinamwhich is being implemented in these14:41
tkajinam(tooz patch need further work because we first need a new release of etcd3gw...14:41
bauzassounds a good path to me14:41
bauzaswe may ban the current tooz release and just await for a newer release once everything is fine14:42
janguttertkajinam: one question - if etcd gets updated (v3alpha disappears and v3 appears) - would this case be handled?14:42
tkajinamjangutter, yes. with that code etcd checks the etcd server version using version api and use the latest api version available14:44
tkajinamjangutter, though you may need to restart servers because the detected api is cached after the first detection.14:44
tkajinams/servers/services/ I mean14:44
tkajinams/etcd checks/tooz checks/14:45
janguttertakjinam: so it will appear if cinder "remembers" the version of etcd it first connected to and never try a newer version?14:45
tkajinamthere is tradeoff. we can let tooz check api version every time but it always requires additional api call14:46
tkajinamwe may probably implement expire but I'd leave it for now14:46
tkajinamjangutter, I probably had to say this first. yes, you are correct14:46
janguttertkajinam: yeah, gracefully redetecting the endpoint on error is also another option, that adds time to report errors.14:49
jangutterbut this covers 95% of the pain.14:50
tkajinamhttps://bugs.launchpad.net/python-etcd3gw/+bug/204381014:58
tkajinamI've created a bug here... in case you want something you can refer to14:59
tkajinamjangutter, we can discuss how we can realize the automated switch without restart, but I'd prefer leaving it for follow-up.14:59
janguttertkajinam: right, I'm fine with just documenting the behavior (i.e. we expect dependent services to be restarted on etcd upgrade)15:00
jangutter(that's a much more reasonable expectation than requiring operators to update etcd + tooz in lockstep.15:04
tonybbauzas: You probably need to remove your -W on 90107915:15
bauzasack I will15:15
bauzasdansmith: before I shutdown for the weekend and before your PTOs, just a quick question about https://review.opendev.org/c/openstack/nova/+/899625 15:49
bauzasfor non-SRIOV GPUs, shall we cap the total of the inventory by the max_instances or should we rather provide total=0 if the GPU has more than the max ?15:50
* dansmith waits15:50
bauzasfor example, a GPU type supporting 8 vGPUs15:51
bauzasand the operator saying max_instances=215:51
bauzasshall we then accept one GPU and providing a total=2 or should we rather not accept it?15:51
dansmithwe can't *actually* have total=0 right?15:51
bauzaswe can now15:51
bauzaswith https://review.opendev.org/c/openstack/nova/+/899625/4/nova/virt/libvirt/driver.py15:52
bauzasI'm just writing the unittests15:52
dansmithright but that's just internally, placement won't allow it right?15:52
bauzasdansmith: if we would say total=0 then the driver would ask to remove the GPU15:52
dansmithright, so just internal notation for us to know what to do when we update placement15:53
dansmithI guess I'm not sure what you're asking then15:53
bauzasokay, then nevermind15:53
bauzasI'll create a test and we'll discuss with gerrit15:53
dansmithI was also going to ask on that patch why you continue to have providers above the limit, but with total=0, but I thought maybe that was related to cleanup for people who already have more providers than they should15:53
dansmithokay15:54
bauzastbc, sriov gpus only have a total=1 for each of the VFs15:54
bauzasso it's simple15:54
bauzaseither we accept a VF or we don't15:54
bauzasbut for non-sriov gpus, we can have a capacity for say 8 or 1615:54
bauzasso we create an inventory with a total=1615:55
bauzasnow, the question is, what would happen if the operator would add a max_instances value for the type ?15:55
dansmithjust total=min(actual, max_instances) right?15:55
bauzasthat's my question15:56
dansmithwhy not?15:56
bauzasbecause for the moment, we have https://review.opendev.org/c/openstack/nova/+/899625/4/nova/virt/libvirt/driver.py#814415:56
bauzaswe calculate per GPU (or VF)15:56
bauzasand if the number of possible vGPU for it is above the max_instances value, then we eventually say "heh, no, total=0"15:57
bauzasso we don't really cap the GPU, we rather don't accept it15:57
bauzaswhich is fine too15:57
dansmithoh I see, you're saying your current patch doesn't add the inventory _at_all_ if they specify max_instances<total instead of just capping it?15:57
dansmithI didn't pick up on that when I looked, but.. why would we do that?15:58
dansmithisn't the point of this to be able to say "I know it says 16 available, but we can really only allocate 2 of this mdev_type because it takes so much of the card's resource" ?15:58
dansmithif you don't allow them to set max_instances below total, we lose the ability to do that, no?15:58
bauzasyou know what ? I can upload my change15:59
bauzasI'm done with the unitteests15:59
bauzasI can create a functional test (and I want to) but for the sake of discussing, let me just upload it now15:59
opendevreviewSylvain Bauza proposed openstack/nova master: WIP: libvirt: Cap with max_instances GPU types  https://review.opendev.org/c/openstack/nova/+/89962516:00
dansmithokay, but .. the above question is what I need answered to really say anything...16:01
dansmithbut I can put that comment in gerrit if you prefer16:02
opendevreviewSylvain Bauza proposed openstack/nova master: WIP: libvirt: Cap with max_instances GPU types  https://review.opendev.org/c/openstack/nova/+/89962516:04
bauzasdansmith: I added some comments ^16:04
dansmithokay I'll just drop this comment/question in gerrit and we can iterate from there.. I have a call starting in a bit anyway16:04
bauzasdansmith: fwiw, haven't seen your comments from PS4 but we run the periodic just after starting the service16:05
bauzasso this works16:05
bauzasgiven operators would modify the option, they would restart the nova-compute service so it will then delete the non-needed inventories then16:06
dansmithyeah, but we may have boot requests pending for us when we start, and it means we're technically doing the cleanup on each run right?16:07
bauzasdansmith: that's a tradeoff but I see your point16:17
bauzaswe could force a update_rp() right after init_host in the compute manager before we run RPC16:18
bauzasbut I think the problem is the same for all things16:18
bauzasoh wait16:19
bauzashttps://github.com/openstack/nova/blob/master/nova/compute/manager.py#L177816:20
bauzaswe're fine then 16:20
bauzasI just replied16:22
bauzasanyway, I'll need to go errand16:22
bauzasmy patch is up, with a proposition16:22
bauzasan alternative could be to cap the GPU inventory instead of refusing it as a whole, but meh, people can review 16:23
bauzasI'll add functests on Monday when I'm back16:23
* bauzas jumps off the keyboard16:24
dansmithack16:26
-opendevstatus- NOTICE: The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.16:34
-opendevstatus- NOTICE: Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.17:13
tonybbauzas: your revert has merged23:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!