bjolo | godmorning | 07:43 |
---|---|---|
bjolo | goodmorning i mean :) | 07:44 |
jrosser | good morning | 07:55 |
noonedeadpunk | \o/ | 08:16 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible-plugins master: Add ssh_keypairs role https://review.opendev.org/c/openstack/openstack-ansible-plugins/+/825113 | 10:25 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible master: Create ssh certificate authority https://review.opendev.org/c/openstack/openstack-ansible/+/825292 | 10:27 |
*** dviroel|out is now known as dviroel | 10:58 | |
jrosser | noonedeadpunk: did you see the ML stuff about the new RBAC for deployment projects? | 11:38 |
damiandabrowski[m] | he's not fully available today | 12:03 |
damiandabrowski[m] | guys, have You noticed that attaching cinder volume does not work on our AIO? for some reason, iscsid.socket is not spawning iscsid.service | 12:05 |
damiandabrowski[m] | starting iscsid.service manually or rebooting the whole AIO helps | 12:06 |
damiandabrowski[m] | but I wonder how to properly fix it | 12:06 |
damiandabrowski[m] | (i'm testing it on focal) | 12:09 |
jrosser | that might be why the zun tests fail | 12:14 |
jrosser | (one of the reasons) | 12:14 |
jrosser | it could be an ordering issue, seeing the service state and journal from a fresh AIO might be interesting | 12:15 |
jrosser | to see if it was ever tried to be started, or there is some error with the config which then doesnt get reloaded | 12:16 |
damiandabrowski[m] | I just spawned a fresh aio | 12:18 |
damiandabrowski[m] | https://paste.openstack.org/show/812220/ | 12:18 |
damiandabrowski[m] | don't see anything indicating that iscsid.service tried to start before | 12:19 |
andrewbonney | I've seen this before in my own AIOs, but relevant tests have passed in CI. I couldn't work out why there was a difference | 12:27 |
damiandabrowski[m] | i assume that's why this test is currently disabled(it tries to attach a volume to the VM which is not working, so it fails): | 12:29 |
damiandabrowski[m] | https://github.com/openstack/openstack-ansible/blob/master/inventory/group_vars/utility_all.yml#L96 | 12:29 |
jrosser | damiandabrowski[m]: you might want to look at this | 12:33 |
jrosser | https://review.opendev.org/c/openstack/openstack-ansible-galera_server/+/824042 | 12:33 |
jrosser | it's not about iscsi but i used a socket activated service there to replace xinetd | 12:34 |
jrosser | so you can see the order that the services need to be created / loaded / restarted to make that work | 12:34 |
jrosser | i think that the state: "restarted" was a key thing on the socket service itself | 12:35 |
damiandabrowski[m] | thank You, unfortunately restarting iscsid.socket does not help | 12:37 |
damiandabrowski[m] | I'm trying to find what does `ListenStream=@ISCSIADM_ABSTRACT_NAMESPACE` in socket definition means, because I have literally no idea O.o | 12:38 |
jrosser | which bit? ListenStream? | 12:39 |
damiandabrowski[m] | no, `@ISCSIADM_ABSTRACT_NAMESPACE` :D | 12:39 |
jrosser | `If the address starts with an at symbol ("@"), it is read as abstract namespace socket in the AF_UNIX family.` | 12:40 |
damiandabrowski[m] | ahhh, thanks | 12:41 |
jrosser | https://www.freedesktop.org/software/systemd/man/systemd.socket.html | 12:42 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible-os_nova master: Use ssh_keypairs role to generate cold migration ssh keys https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/825306 | 12:45 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible master: Create ssh certificate authority https://review.opendev.org/c/openstack/openstack-ansible/+/825292 | 12:45 |
jrosser | damiandabrowski[m]: this is related but a long time ago https://bugs.launchpad.net/ubuntu/+source/open-iscsi/+bug/1755858 | 12:54 |
damiandabrowski[m] | ah, so we probably hit our issue when this one was fixed | 13:01 |
damiandabrowski[m] | the interesting thing is that when i started and then stopped iscsid.service manually, then nova/cinder were able to start it again when i tried to attach a volume(previously they couldn't do that) | 13:03 |
opendevreview | Merged openstack/openstack-ansible master: Move system_crontab_coordination role to collection https://review.opendev.org/c/openstack/openstack-ansible/+/824593 | 14:16 |
*** dviroel is now known as dviroel|lunch | 14:57 | |
DK4 | im using this guide for ceph prod setup: https://docs.openstack.org/openstack-ansible/latest/user/ceph/full-deploy.html and fail because of ceph error in the second playbook: https://controlc.com/01090692 any ideas? | 15:45 |
jrosser | DK4: ultimately i think that this is the documentation diverged from reality https://paste.opendev.org/show/812224/ | 15:49 |
jrosser | i would think that in the past cidr_networks used to be available in ansible hostvars but that seems not to be the case any more | 15:50 |
DK4 | jrosser: thanks for the quick response. i think i found the mistake, ive forgot the &-anchor in the userconfig | 15:50 |
jrosser | quickest workaround would be to replace those entries in user_variables.yml with the actual address ranges | 15:50 |
jrosser | DK4: for production deployments, we normally see people deploying separate ceph clusters, rather than integrated tightly with OSA | 15:52 |
jrosser | you have the choice to do it either way, but long term maintainance tends to be easier if they are decoupled | 15:53 |
jrosser | but size / scale / use-case can also play a part | 15:53 |
*** dviroel|lunch is now known as dviroel | 16:15 | |
evrardjp | hello folks. For keepalived I am still testing with ansible 2.9. Should I drop this? | 16:47 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible-os_nova master: Use ssh_keypairs role to generate cold migration ssh keys https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/825306 | 16:47 |
jrosser | evrardjp: we are a long way ahead of that now in these parts | 16:47 |
evrardjp | jrosser: including stable branches? | 16:48 |
jrosser | though it depends how far back you want to cover | 16:48 |
evrardjp | as long as OSA is covering I guess | 16:48 |
evrardjp | for the rest of the folks using the roles I think it's fine to move on. | 16:48 |
evrardjp | alternatively, old osa branches can just not bump keepalived role, which is fine too | 16:49 |
jrosser | ussuri is EM and the last place we used 2.9 | 16:49 |
evrardjp | ok then I should be good | 16:49 |
evrardjp | thanks for confirming jrosser! | 16:49 |
jrosser | we are 2.10 for V & W so that will be around for a while yet | 16:50 |
evrardjp | and happy new year to you, your family, and your team :) | 16:50 |
jrosser | thankyou :) | 16:50 |
evrardjp | good to see damiandabrowski[m] in here :) | 16:51 |
damiandabrowski[m] | hey JP! | 16:52 |
evrardjp | damiandabrowski[m]: unrelated convo, are you using matrix? | 16:55 |
damiandabrowski[m] | I am, is it wrong? :D | 16:56 |
evrardjp | Well, I do have matrix, but I am still using my bouncer. I want to get rid of it tbh | 16:57 |
evrardjp | was wondering if the bridge is nice nowadays. | 16:57 |
jrosser | i've never looked back from irccloud | 16:58 |
evrardjp | jrosser: a bit more context, when I was in TC, before the whole mess with freenode happened: https://governance.openstack.org/ideas/ideas/pylon/synchronous-and-pseudo-synchronous-comms.html#the-proposal | 16:59 |
evrardjp | but yeah irccloud is nice. | 17:00 |
damiandabrowski[m] | I'm using element.io and I'm quite happy with it, but I never used anything else(except some console client years ago) so I can't really compare | 17:01 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: do not include [*-feature-enabled] sections in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825164 | 17:23 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Implement variable: tempest_endpoint_type https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825156 | 17:26 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Rename [orchestration] section to [heat_plugin] in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825163 | 17:27 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: do not include [*-feature-enabled] sections in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825164 | 17:31 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Implement variable: tempest_endpoint_type https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825156 | 17:32 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: do not include [*-feature-enabled] sections in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825164 | 17:33 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Rename [orchestration] section to [heat_plugin] in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825163 | 17:35 |
spatel | I have glusterFS mount point on all my compute nodes and i pointed nova /var/lib/nova to glusterfs mount point and all good but when i delete vm i found nova not deleting disk file so i have to do it by hand | 17:35 |
evrardjp | Sometimes I want to shoot myself in the head when I see the direction ansible and molecule is taking. Making things incredibly hard for 0 reason... | 17:35 |
spatel | did anyone noticed this issue with shared mount point | 17:35 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: do not include [*-feature-enabled] sections in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825164 | 17:40 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: do not include [*-feature-enabled] sections in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825164 | 17:40 |
evrardjp | has anyone here an example repo with molecule testing? I would like to know what's the recommended way to add the docker collection to make testing work with molecule with requiring the collection into a new requirements.yml on the root of my repo | 17:46 |
evrardjp | without the docker collection, it all fails, as molecule-docker now requires it since version >1.0 | 17:46 |
jrosser | evrardjp: the tripleo people started on this in the os_tempest role https://github.com/openstack/openstack-ansible-os_tempest/commit/3f4b58bd4133b83c8556c2275875188147d2a58b | 17:48 |
jrosser | but i feel that it really has not gone anywhere | 17:48 |
evrardjp | I see | 17:48 |
jrosser | however i'm not really sure this is going to be helpful | 17:49 |
evrardjp | I was using molecule, but pinned to old versions | 17:49 |
evrardjp | it's the new versions that are a pain, because they are assuming the docker collection is installed on the system | 17:49 |
jrosser | we do have the same problem in OSA, what to do in a post openstack-ansible-tests world | 17:49 |
jrosser | we are very close to dropping all the jobs relying on that repo now as the maintainance overhead is just too much | 17:49 |
evrardjp | well, I feel that the value of having different repos is nowadays pretty much removed | 17:50 |
jrosser | but it does leave a gap with how to run tests in underlying/utility roles | 17:50 |
evrardjp | jrosser: would it make sense to take a stance, in the OSA community, abotu where you want to head in terms of testing, and get the ball rolling? | 17:51 |
jrosser | indeed | 17:51 |
evrardjp | If you feel it's not sustainable, that's something you need to fix | 17:51 |
evrardjp | you -> we | 17:52 |
evrardjp | was there any proposition raised in the last PTG? | 17:52 |
jrosser | mostly we have to address tech debt currently | 17:52 |
jrosser | so discussions tend to focus on that | 17:53 |
jrosser | and we do make good progress though | 17:53 |
jrosser | but i guess i mean feature debt rather than process / ci | 17:53 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Allow to create only specific tempest resources. https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/803477 | 17:55 |
jrosser | like i say in terms of sustainability, there is not enough effort to simultaneously keep openstack-ansible and openstack-ansible-tests both functional | 17:55 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Allow to create only specific tempest resources. https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/803477 | 17:55 |
jrosser | but i think we lack someone with insight / ready answer for ansible role testing outside of the AIO | 17:56 |
evrardjp | Then you might want to reconsider the current testing structure to simplify indeed | 17:57 |
evrardjp | moving all the things back to the openstack-ansible might help on maintainability | 17:57 |
evrardjp | (as you focus energy on testing scenarios) | 17:58 |
evrardjp | (and reduce the amount of repos) | 17:58 |
evrardjp | I see there is less and less reason to work on separate repos nowadays. | 17:58 |
jrosser | the only time that multiple repos is a big pain is when we want to do refactoring across them all | 17:59 |
evrardjp | maybe there should be some kind of project plan to do such refactors? | 17:59 |
evrardjp | moving to noop jobs isn't that hard ;) | 17:59 |
jrosser | this sort of thing https://review.opendev.org/q/topic:%2522osa/include_vars%2522+(status:open+OR+status:merged) | 17:59 |
evrardjp | I mean, it's all work, so you need to evaluate the end goal and if it's worth it over time | 18:00 |
jrosser | moving the existing roles to a collection would be easy | 18:00 |
jrosser | and with some benefit | 18:00 |
evrardjp | well, you could go as crazy as bringing all roles into the integrated repo, it would make things far simpler. But then you lose the flexibility of overriding easily. That is something I am not sure the community is ready to pay | 18:01 |
jrosser | more problematic is key things like the new work on pki and ssh where the low level roles lack really any rigorous tests | 18:01 |
evrardjp | yaeh, but that's not something the _structure_ will fix | 18:02 |
jrosser | no indeed | 18:02 |
jrosser | more that we don't have a cookie-cutter pattern to use there yet | 18:02 |
evrardjp | that's understanding the importance of tests, a different topic :) | 18:02 |
evrardjp | we used to have one | 18:02 |
evrardjp | but it wasn't really well maintained | 18:02 |
evrardjp | testability of the roles is the hardest, tbh | 18:03 |
evrardjp | because that needs thinking what needs to run to be efficient | 18:03 |
evrardjp | standalone work can probably be more easily tested.... | 18:03 |
evrardjp | so what's holding those roles up to increase coverage? | 18:03 |
evrardjp | manpower? prioritization? | 18:04 |
evrardjp | or setting expectations for commits? | 18:04 |
jrosser | no, knowing what to do there "here is a great way to test your role in a zuul job -> use it" vs. having to figure that out | 18:04 |
evrardjp | I have to go for dinner, but it sounds to me that you found the solution: Define template, and document that in OSA ;) | 18:06 |
jrosser | and i would prefer that to be a something as simple as possible, the openstack-ansible-tests repo was mind bogglingly complex | 18:06 |
evrardjp | that's for sure | 18:07 |
evrardjp | for standalone you can probably use molecule directly ;) | 18:07 |
jrosser | right, and standalone should == in zuul | 18:07 |
jrosser | to match expectations with openstack-ansible repo | 18:07 |
jrosser | anyway, enjoy your dinner :) | 18:08 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Fix hardcoded flavor_ref and flavor_ref_alt https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/803492 | 18:15 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Fix hardcoded flavor_ref and flavor_ref_alt https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/803492 | 18:15 |
noonedeadpunk | Doh, it feels I missed all fun... | 18:27 |
noonedeadpunk | But I'm not conviced there's no reason to have roles in independent repos nowadays as well | 18:28 |
noonedeadpunk | Like looking at huge ceph-ansible (which is not _that_ big comparing to osa) - and repo is really overloaded. | 18:31 |
evrardjp | I think the point was not structure, but test coverage: The need to have a documented "standard" for standalone testing, and simplify the coverage overall | 18:32 |
noonedeadpunk | Well with having roles separately is easier to controll coverage imo | 18:33 |
evrardjp | for non standalone, it seems the -tests repo is considered complex, and simplification would be welcomed. | 18:33 |
noonedeadpunk | as it's super easy to miss smth when having all gathered together | 18:33 |
evrardjp | I agree with you | 18:33 |
evrardjp | it's however easy to "miss something" in all cases | 18:33 |
noonedeadpunk | While we do miss things now, we also kind of know what ) | 18:33 |
noonedeadpunk | but yeah | 18:34 |
evrardjp | I think jrosser is right however on deciding on a standard for standalone roles, which should be "easy to apply" | 18:34 |
noonedeadpunk | also if continue with ceph-ansible - they run about 20 jobs for each change to cover scenarious they have | 18:34 |
noonedeadpunk | infra will kill us for that approach:) | 18:34 |
noonedeadpunk | Yes, absolutely | 18:35 |
evrardjp | to simply the -tests, I feel it _could_ (not saying we should do it) make sense to make the roles that can only be tested "together" | 18:36 |
evrardjp | e.g. use a collection, or make those part of the main repo | 18:36 |
noonedeadpunk | What we miss now is the way of testing collections themselves at the moment | 18:38 |
noonedeadpunk | (not even saying about publishing them) | 18:38 |
noonedeadpunk | so yeah, we have room for improvement and I was thinking about jsut other more unified and simplified way comparing to tests repo of running tests/test.yml tbh | 18:39 |
evrardjp | I am not understanding your proposition :) | 18:41 |
noonedeadpunk | so idea is kind of to leverage gate-check-commit, but instead of start deploying things, run test.yml from repo | 18:42 |
noonedeadpunk | as main pita with tests were their own way of deploying things, own a-r-r, inventory, etc | 18:43 |
noonedeadpunk | which was leading to cross-dependencies, being unable to update ansible version (as not possible to update it in 2 places at same time) | 18:44 |
noonedeadpunk | but it might be not that good idea or it might not work out as expected. Wasn't really looking into details yet, just smth that raised in mind during previous meeting | 18:45 |
noonedeadpunk | but again - it's all about what issue we solve... This way it would be really easy to start per-repo test, that are defined only inside that repo. But have exact same environment prepared as if it was a regular deployment | 18:49 |
jrosser | maybe we can move the bootstrap host role to the plugins collection | 18:50 |
jrosser | then that becomes a common piece | 18:50 |
noonedeadpunk | but we also need bootstrap-ansible? | 18:56 |
noonedeadpunk | I'd even say that bootstrap-ansible might be key thing for such tests? | 18:57 |
evrardjp | bootstrap-ansible should be restricted to just install ansible in a way we expect | 19:13 |
evrardjp | for me, it _could_ make sense to move bootstrap_host content the sole purpose of the test repo | 19:15 |
evrardjp | or alternatively, only focus on the integrated repo for _everything_ | 19:16 |
evrardjp | (everything related to the integrated) | 19:16 |
evrardjp | but you're right it all depends on what you want to achieve | 19:16 |
evrardjp | for those large changes, think about the pain, write code, see if it's better, then iterate | 19:16 |
evrardjp | it happened quite a few times I rewrote a large chunk of code, thinking it will be easier long term, then abandon it because it was clever and not simpler | 19:17 |
noonedeadpunk | oh, yes, it's really hard to see whole picture until you start doing smth... And that leads to work abandonment :( | 19:25 |
noonedeadpunk | jrosser: answering your question - no, I haven't yet. But I wonder how that aligns with https://review.opendev.org/c/openstack/openstack-ansible-os_glance/+/823009 (the part that I tried to add extra role to admin one for services) | 19:26 |
jrosser | i don't know tbh - there seems to be a lot of stuff in the ML thread now | 19:27 |
jrosser | some implemented, some not right now | 19:27 |
noonedeadpunk | `add service role to all service users` :) | 19:28 |
noonedeadpunk | I kind of read their ideas https://review.opendev.org/c/openstack/openstack-ansible-os_glance/+/823009/6/defaults/main.yml#154 | 19:28 |
noonedeadpunk | I will read properly tomorrow and will interate on that... | 19:28 |
noonedeadpunk | but yes, thread is big now... | 19:29 |
noonedeadpunk | but indeed, overall changes seem big | 19:45 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Bump OpenStack-Ansible master https://review.opendev.org/c/openstack/openstack-ansible/+/825390 | 19:50 |
noonedeadpunk | not sure though what project/system scopes will bring to us in terms of changes though | 20:00 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/xena: Bump OpenStack-Ansible Xena https://review.opendev.org/c/openstack/openstack-ansible/+/825391 | 20:03 |
DK4 | does osa have any means to recover from a complete mariadb failure? are there any recover functions like in kolla? | 20:17 |
noonedeadpunk | DK4: well, we're not finding out member with latest state, that's for sure. But you can trigger re-bootstrap and define boostrap node explicitly https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/defaults/main.yml#L21-L24 | 20:25 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/wallaby: Bump OpenStack-Ansible Wallaby https://review.opendev.org/c/openstack/openstack-ansible/+/825395 | 20:27 |
noonedeadpunk | But tbh I won't trust any automation toolings to recover my galera cluster :) | 20:28 |
spatel | jamesdenton around | 20:33 |
jamesdenton | maybe | 20:33 |
spatel | I am still dealing with my GPU issue.. take a look here - https://paste.opendev.org/show/812235/ | 20:33 |
jamesdenton | k | 20:33 |
spatel | I have two GPU card in compute node and don't know how my flavor will target them? | 20:33 |
noonedeadpunk | spatel: and you want to do jsut paasthrough? As with v100 I guess it might make sense to use vgpus instead? | 20:34 |
spatel | if you see my GPU PCI card has same bus number 10de:1df6 | 20:34 |
spatel | noonedeadpunk passthrough because we don't have license for vGPU | 20:34 |
spatel | i believe we need to buy it in order to unlock that feature | 20:35 |
noonedeadpunk | passthorugh should be really straight... But I think that it's still might be up to placement to report compute capabilities.... | 20:36 |
spatel | if i spin up first VM it works but how does second VM know i need to use second GPU? | 20:36 |
noonedeadpunk | not sure htough | 20:36 |
mgariepy | "pci_passthrough:alias"="tesla-v100:1" will match 1 gpu and assign it to your vm | 20:37 |
jamesdenton | IIUC, the first flavor will assign 1 GPU, and the 2nd flavor will assign 2 GPU | 20:37 |
mgariepy | "pci_passthrough:alias"="tesla-v100:2" will match 2 gpus and assign 2 to your vm | 20:37 |
jamesdenton | ^^^ | 20:37 |
spatel | i have created two flavor g1.small and g2.small with tesla-v100:1 and tesla-v100:2 | 20:37 |
jamesdenton | the flavor is targeting the GPUs via the alias you defined | 20:38 |
jamesdenton | which in turn matches vendor/product id | 20:38 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/victoria: Bump OpenStack-Ansible Victoria https://review.opendev.org/c/openstack/openstack-ansible/+/825397 | 20:39 |
jamesdenton | the flavors you defined give you the ability to match a single GPU to a single VM (twice) or both GPUs to a single VM | 20:39 |
mgariepy | you also need to add the scheduling filtering stuff iirc | 20:39 |
spatel | i need single VM with single GPU | 20:39 |
jamesdenton | so, your first flavor with tesla-v100:1 should do that | 20:39 |
spatel | its not doing :( | 20:40 |
jamesdenton | ok, what's the error? | 20:40 |
mgariepy | what is the error ? | 20:40 |
mgariepy | lol @jamesdenton | 20:40 |
jamesdenton | mind-meld | 20:40 |
mgariepy | yep lol | 20:40 |
spatel | first {"code": 500, "created": "2022-01-19T20:39:07Z", "message": "Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 5c70f26e-840a-4336-b975-b8d81d3ef54f. Last exception: XML error: Hostdev already exists in the domain configuration", "details": "Traceback (most recent call last): | 20:41 |
jamesdenton | send me your GPU and i fix for you :D | 20:42 |
spatel | i thought tesla-v100:1 will target first GPU-1 and tesla-v100:2 will target GPU-2 | 20:42 |
jamesdenton | no, it's more to do with scheduling 1 or 2 GPUs to the same VM | 20:42 |
spatel | ohhh | 20:42 |
mgariepy | lspci -nk -s 5e:00.0 | 20:43 |
spatel | lol :) | 20:43 |
mgariepy | make sure your do not have the nvidia or nouveau kernel module loaded for the gpus. | 20:43 |
spatel | i have 12 compute nodes so total 24 GPU :) each has 2 GPU | 20:43 |
jamesdenton | which flavor did you use in your test? | 20:44 |
jamesdenton | try the single one, first | 20:44 |
spatel | https://paste.opendev.org/show/812236/ | 20:44 |
spatel | let me just use single flavor tesla-v100:1 and try | 20:45 |
spatel | mgariepy that output for you | 20:46 |
mgariepy | what's in your nova.conf for : [pci] passthrough_whitelist ? | 20:46 |
mgariepy | saw it it seems correct :D | 20:46 |
mgariepy | should have something like: passthrough_whitelist = [{"vendor_id": "10de", "product_id": "1df6"}] | 20:47 |
mgariepy | on your computes with gpus. | 20:48 |
spatel | jamesdenton look at this i created both vm with single flavor and second one ERROR out - https://paste.opendev.org/show/812238/ | 20:48 |
spatel | i do have passthrough_whitelist | 20:49 |
spatel | let me show you | 20:49 |
spatel | https://paste.opendev.org/show/812239/ on my compute node | 20:50 |
jamesdenton | so, spatel - there was a patch to libvirt about 6 mos ago that introduced that error: https://www.mail-archive.com/libvir-list@redhat.com/msg218688.html | 20:52 |
jamesdenton | if i'm reading that correctly, anyway | 20:53 |
spatel | let me read | 20:53 |
prometheanfire | I'm trying to figure out why my infra nodes are not getting a storage_address in the container networks for the storage bridge | 20:55 |
jamesdenton | I think you have to do that by hand | 20:55 |
prometheanfire | I have the swift_proxy group bind added to the storage network | 20:55 |
spatel | hmm interesting | 20:55 |
spatel | patch by hand ? | 20:56 |
jamesdenton | no that was for prometheanfire | 20:56 |
prometheanfire | I have another cluser with it included, no idea why | 20:56 |
jamesdenton | oh hmm | 20:56 |
spatel | haha | 20:56 |
prometheanfire | it has a storage_hostS stanza, but only on one of the three infra nodes, so probably not it | 20:56 |
mgariepy | spatel, do you have any trace on the compute itself ? | 20:59 |
spatel | trace? | 20:59 |
jamesdenton | is there a traceback or error in the compute log | 20:59 |
spatel | let me look | 21:00 |
jamesdenton | and if you have nova-compute in debug mode, will it print the xml for the domain? i can't recall | 21:00 |
prometheanfire | adding swift_hosts instead of swift_proxy seems to have done it, maybe docs are bad or need updating | 21:00 |
jamesdenton | never, sir. | 21:00 |
prometheanfire | or not, br-storage still not found | 21:01 |
* prometheanfire shrugs | 21:01 | |
mgariepy | or error in dmesg is the kernel did not allowed you to do something for ${REASON} | 21:01 |
mgariepy | or on libvirt | 21:02 |
spatel | https://paste.opendev.org/show/812240/ | 21:03 |
spatel | in libvirt single line error - error : virDomainDefDuplicateHostdevInfoValidate:1082 : XML error: Hostdev already exists in the domain configuration | 21:03 |
spatel | can i open bug for nova, may be something is already patched and i am running older code | 21:05 |
spatel | I am running wallaby | 21:05 |
jamesdenton | what version of libvirt? | 21:06 |
spatel | libvirt version: 7.6.0 | 21:06 |
spatel | https://www.mail-archive.com/libvir-list@redhat.com/msg218688.html looks very close to our issue | 21:07 |
spatel | don't know if i have patched version or not | 21:07 |
jamesdenton | 7.6.0 appears to be where it was introduced | 21:08 |
jamesdenton | this may be an unintended side effect | 21:08 |
spatel | or may be i am missing some config or setting | 21:09 |
spatel | jamesdenton you are correct when i use tesla-v100:2 flavor then i can see two GPU attached in my VM | 21:15 |
jamesdenton | that's working? | 21:16 |
jamesdenton | that's the one i would expect not to work :D | 21:16 |
spatel | yes i can see two GPU connected to vm in lspci output | 21:16 |
jamesdenton | but glad to hear it | 21:16 |
spatel | look at yourself - https://paste.opendev.org/show/812241/ | 21:17 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Remove ANSIBLE_ACTION_PLUGINS override https://review.opendev.org/c/openstack/openstack-ansible/+/824595 | 21:17 |
jamesdenton | very nice. for grins, can you 'virsh dumpxml <domain>'? | 21:17 |
spatel | k | 21:17 |
spatel | let me pull out | 21:17 |
spatel | https://paste.opendev.org/show/812242/ | 21:18 |
spatel | hostdev0 and hostdev1 | 21:19 |
spatel | look like same issue i have :) - https://bugs.launchpad.net/nova/+bug/1628168 | 21:21 |
jamesdenton | i saw that, but being 5+ years old i'm likely to ignore | 21:22 |
spatel | lol | 21:22 |
spatel | thinking to open bug for nova and see how it goes | 21:23 |
jamesdenton | good call. i wonder if the same hostdev alias is being used for the second instance and causing any kind of issue (no idea) | 21:25 |
jamesdenton | like, does it compare domains or only the single domain configuration | 21:25 |
spatel | hmm | 21:26 |
spatel | let me open bug and see how it goes | 21:28 |
*** dviroel is now known as dviroel|out | 21:38 | |
spatel | jamesdenton by the way i have build this cloud using kolla-ansible and using OVN for networking, this cloud has 50 around compute nodes | 21:41 |
spatel | kolla-ansible is hard requirement from customer but in next upgrade i am planning to migrate it to OSA | 21:42 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible master: Remove tempest.api.volume.admin.test_multi_backend test https://review.opendev.org/c/openstack/openstack-ansible/+/825166 | 21:54 |
jamesdenton | i noticed that. which openstack version? | 21:55 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Allow to create only specific tempest resources. https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/803477 | 21:55 |
spatel | wallaby | 21:55 |
spatel | This is HPC openstack, it has all kind of cool toy like GPUs, InfiniBand network with 200Gbps | 21:56 |
spatel | Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] | 21:58 |
spatel | They are going to use it for Research | 21:58 |
spatel | did you work on mechanism_drivers = mlnx_infiniband | 22:02 |
spatel | jamesdenton do you know what is Partition Keys (PKEY) per network | 22:08 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Fix hardcoded flavor_ref and flavor_ref_alt https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/803492 | 22:18 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Add support for both Credential Provider Mechanisms https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825403 | 22:26 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Remove unused variables https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825405 | 22:46 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Do not store unnecessary sections in tempest.conf https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825407 | 23:23 |
opendevreview | Damian Dąbrowski proposed openstack/openstack-ansible-os_tempest master: Fix hardcoded instance_type in [heat_plugin] section https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/825408 | 23:23 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!