clarkb | I managed to get the vast majority of the paperwork I need to do done today so I'm hopeful tomorrow I'll start looking at Noble paste server | 00:09 |
---|---|---|
frickler | infra-root: checking the log on bridge for https://zuul.opendev.org/t/openstack/build/764061aeb33b4d5b8eb9cc17b7274252 it looks like access to opendevci-raxflex is broken while opendevzuul-raxflex (the nodepool tenant) is still fine. is there anything known about that? | 06:21 |
frickler | I can confirm getting a 401 for the former cloud on bridge when testing myself | 06:22 |
frickler | hmm, actually this seems to have started failing on 2024-11-22. oldest log we have is 2024-12-10T02:38:15 and that already shows the failure | 07:46 |
frickler | anyway, seems my cleanup change hasn't been deployed yet because infra-prod-base ran before the repo was updated on bridge, so I'll have to wait another day to confirm it works as expected. todays u-a mails were cluttered anyway due to bind9 updates | 07:57 |
frickler | infra-root: looks like there is a issue with the github token used by zuul that I recently updated, see the discussion in #openstack-infra | 09:51 |
frickler | I've updated the token now and in addition manually updated the zuul.conf on the schedulers, but I'm unsure if a) this is urgent enough to warrant a manual zuul restart and b) whether https://docs.opendev.org/opendev/system-config/latest/zuul.html#restarting-zuul-services is still up-to-date | 09:53 |
frickler | chandankumar: fyi ^^ I'll wait for feedback from other admins before proceeding | 09:54 |
chandankumar | sure, thank you :-) | 09:54 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: End gating for qdrouterd role https://review.opendev.org/c/openstack/project-config/+/938188 | 11:11 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Remove qdrouterd role from the config https://review.opendev.org/c/openstack/project-config/+/938194 | 11:11 |
opendevreview | Merged openstack/project-config master: End gating for qdrouterd role https://review.opendev.org/c/openstack/project-config/+/938188 | 13:44 |
opendevreview | Merged openstack/project-config master: End gating for Freezer DR project https://review.opendev.org/c/openstack/project-config/+/938180 | 13:44 |
opendevreview | Merged openstack/project-config master: Remove Murano/Senlin/Sahara OSA roles from infra https://review.opendev.org/c/openstack/project-config/+/935683 | 13:49 |
opendevreview | Merged opendev/system-config master: Mirror jaegertracing https://review.opendev.org/c/opendev/system-config/+/938690 | 15:45 |
clarkb | opendevci-raxflex not working but opendevzuul-raxflex working is not known or expected. The credentials are the same as our regular rax stuff iirc but we point at a different keystone or something likethat | 15:59 |
clarkb | frickler: also you are correct that document for restarting zuul looks to still be assuming the single scheduler setup and isn't correct | 15:59 |
clarkb | frickler: at this point I think we can safely restart any single zuul service except for the executors those need to be gracefully shutdown to avoid noticeable impact | 16:00 |
clarkb | frickler: https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul_reboot.yaml this is what the weekly upgrades run | 16:00 |
clarkb | looks like it basically does a scheduler stop, then reboot, then start. So I think you can do a stop then start (no reboot) on each scheduler one at at time. May need to do that for the web processes too. I'm not sure if they use the github token | 16:01 |
clarkb | I can confirm that openstack client reproduces the raxflex auth issue. However authenticating to old rax does not fail. Visual inspection of clouds.yaml content doesn't show an obvious differences in the bits that are raxflex specific between opendevci-raxflex and opendevzuul-raxflex nor do I see an obvious differences in the shared bits between opendevci-raxflex and openstackci-rax | 16:16 |
fungi | i wonder if it's a domain caching problem. has anyone tried logging into skyline with that account yet? | 16:16 |
clarkb | the error occurs when doing POST /v3/auth/tokens you get back a 401 | 16:17 |
clarkb | fungi: I have not. The extent of my debugging is above | 16:17 |
fungi | half suspecting that if we log into the webui once, everything else will start working again | 16:18 |
fungi | but i'm hesitant to do that if we want to give one of the flex folks an opportunity to take a look first | 16:18 |
clarkb | oh right seems like you hit similar when we first set that up | 16:18 |
fungi | right, that's why i'm coming back to it | 16:19 |
clarkb | cardoe: ^ fyi. Doesn't look like klamath is in the channel right now | 16:26 |
frickler | ok, so I'll restart (stop+start) the schedulers on zuul01+02 now. if the web services need a restart, that's find to wait another two days I'd think | 16:26 |
cardoe | Looking. | 16:26 |
clarkb | frickler: the important thing is you do them one at a time and wait for them to fully come up before doing the next one | 16:26 |
frickler | yes, sure | 16:26 |
clarkb | frickler: I think it may take up to like 20 minutes for a scheduler to fully come up | 16:26 |
clarkb | the https://zuul.opendev.org/components list is a good way to verify it is up after hte logs look good | 16:28 |
clarkb | oh ya that is what the playbook does before proceeding too | 16:30 |
frickler | ok, if I reads those playbooks correctly, it does "cd /etc/zuul-scheduler/; docker-compose down" and then "... up -d", right? | 16:32 |
clarkb | frickler: yes that is my read of the ansible too | 16:32 |
frickler | with some sudo added. ok, will do that on zuul01 now | 16:33 |
clarkb | the components list shows 01 is initializing so thats good. Now we wait | 16:35 |
clarkb | frickler: note that did end up picking up one new commit in zuul. But that commit is a css only update so shouldn't affect anything outside of browsers until we restart webs. And even then the impacti s minimal | 16:39 |
clarkb | components list shows the scheduler is running now. That was faster than I expected. | 16:41 |
frickler | ok, logs also look normal to me, doing zuul02 now | 16:48 |
frickler | jobs are running now for https://zuul.opendev.org/t/openstack/status?change=938776 after a recheck, so the new github token is working better, too | 16:53 |
frickler | #status log restarted zuul-scheduler on zuul01+02 in order to pick up a new github api token | 16:54 |
opendevstatus | frickler: finished logging | 16:54 |
clarkb | thank you for taking care of that | 17:01 |
frickler | logs on zuul02 also look fine, so I consider this topic done for now | 17:01 |
fungi | thanks! | 17:03 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update to Gitea 1.23 https://review.opendev.org/c/opendev/system-config/+/938826 | 17:05 |
clarkb | if ^ passes I'll work on holding a node so that we can check it more interactively | 17:05 |
cardoe | clarkb: did you create a newer account and are trying to use it in that project? | 17:48 |
clarkb | cardoe: no | 17:49 |
cardoe | They did some swizzling that newer accounts do not get some default permission thing to see projects. | 17:49 |
clarkb | its the same credentials that we use against rax classic. fungi ran into similar when we first set this up and apparently logging into skyline at the time got things connect between the two system. | 17:49 |
fungi | yeah, sounds like whatever broke this was back in november, we only just noticed we can't authenticate to it | 17:51 |
cardoe | What are the actual usernames and/or tenant ID? | 18:21 |
frickler | cardoe: broken one is username: 'openstackci', project_id: 'ed36be9ee2a84911982e95636a85698c' | 18:42 |
frickler | for comparison, the working one is openstackjenkins / f063ac0bb70c486db47bcf2105eebcbd | 18:42 |
fungi | note that we ended up using the local cloud project ids because of the sync problem i hit during early setup where the nnnnn/Flex federated ids were rejected by the api until i first logged into skyline | 18:54 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM intentional Gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181 | 18:56 |
fungi | i think something may be amiss with nl01, i'm getting random connection resets trying to ssh into it, but sometimes i get through | 18:57 |
clarkb | tkajinam: do you still need your heat-functional held node? | 18:57 |
clarkb | fungi: nl01 pings from here lgtm over ipv4. Could it be ipv6 specific? | 18:58 |
fungi | maybe | 18:58 |
fungi | nope | 18:58 |
frickler | nl01 works fine for me over ipv6 (by default, didn't explicitly choose it) | 18:58 |
fungi | fungi@dhole:~$ ossh -4 nl01 | 18:58 |
fungi | Connection closed by 104.239.144.109 port 22 | 18:58 |
fungi | seems i get through about 50% of the time regardless of whether i'm going via v4 or v6 | 18:59 |
fungi | and nl02 works every time | 18:59 |
frickler | I do see the same when forcing v4 | 19:00 |
fungi | 0% icmp packet loss over both v4 and v6 | 19:00 |
fungi | guessing it's something to do with state handling on its hypervisor host | 19:00 |
fungi | probably a noisy neighbor filling up the conntrack tables or something | 19:01 |
clarkb | I just did ~10 ssh connections with no problem | 19:01 |
fungi | weird | 19:01 |
fungi | then i guess it could be further into the network layer | 19:01 |
fungi | maybe their core flow handling has my sessions balanced 50% onto a bad switch and not yours or something | 19:02 |
clarkb | ya its interesting that it reports the remote is closing the connection | 19:03 |
clarkb | but I guess that could happen due to being unhappy about asymetrically routed packets | 19:03 |
fungi | it could be just about anything. lots of layer 3/4 middleboxen will spoof tcp/rst packets on behalf of the peer | 19:04 |
fungi | including poorly-behaved rate limiters and packet shapers | 19:04 |
frickler | since clarkb logged it, I'm not getting failures anymore either. problem fixed ;) | 19:05 |
fungi | hah | 19:05 |
frickler | s/it/in/ | 19:05 |
fungi | and yeah, it's cleared up for me now too, whatever it was | 19:05 |
fungi | maybe the problem got routed around | 19:06 |
fungi | or a flood somewhere died off | 19:06 |
fungi | splitting here as a tangent from a discussion in the zuul matrix channel, is anyone else surprised that a held jammy node shows an installed version of podman from the original jammy release rather than a newer version available in jammy-security and jammy-updates? | 19:12 |
clarkb | does the job specify the version maybe? | 19:13 |
fungi | oh maybe... dpkg.log shows it being installed during the job at least, so it's not preinstalled | 19:15 |
fungi | yep | 19:17 |
fungi | podman=3.4.4+ds1-1ubuntu1 | 19:17 |
fungi | https://review.opendev.org/c/zuul/zuul-jobs/+/886552 Pin podman on jammy | 19:20 |
fungi | oh, right, that was because of https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2024394 | 19:20 |
cardoe | fungi/clarkb: auth should be fixed | 19:50 |
fungi | thanks cardoe!!! | 19:50 |
fungi | frickler: ^ | 19:51 |
clarkb | cardoe: fungi: confirmed osc server list is working for me now | 19:51 |
cardoe | So I misspoke to the "new accounts to not see projects" | 19:51 |
clarkb | cardoe: would logging in through skyline have made a difference? Mostly curious if that thought was relevant | 19:55 |
cardoe | Essentially there's some backend side this account is authorized for this product type thing. When they started Flex it was just using the existing rax openstack value. They made them have a new value that says flex so you had to have that to auth against flex. They tried to use some heuristics as to who used flex and should get it. And they missed. | 19:57 |
cardoe | So skyline usage wouldn't have fixed it. | 19:57 |
clarkb | gotcha | 19:59 |
frickler | just strange that it seems to have been working for some time, then started failing like 6 weeks ago | 20:02 |
frickler | ah that's likely the point where that heuristic got applied? | 20:04 |
fungi | interestingly my personal rackspace account which has server instances in both rax classic and flex seems to be working fine | 20:05 |
fungi | so i guess we just got lucky with the control plane account | 20:05 |
fungi | (un)lucky | 20:05 |
frickler | anyway, good that it is working now and I'll recheck the results of the periodic jobs tomorrow | 20:06 |
clarkb | https://200.225.47.39:3081/opendev/system-config is the held gitea 1.23.0 | 20:48 |
clarkb | not sure if we want to try and get that in quickly due to it theoretically fixing the memroy leak or wait for a .1. Typically they have a .1 not too long after the .0 release | 20:48 |
clarkb | tonyb: I've gotten to the point of getting ready to boot a Noble paste02 server and I don't see the noble image in rax dfw (or ord and iad). I thought you had converted and uploaded noble images everywhere. Am I misremembering? | 20:54 |
clarkb | Looks like on july 25th we discussed this and https://paste.opendev.org/raw/b8QSrgmqdHkbHY0S6801/ was going to be uploaded | 21:01 |
clarkb | but I'm not seeing those images when running image list or image list --shared | 21:02 |
clarkb | makes me wonder if the upload failed or maybe we didn't think we had to do the dance to upload via glance tasks but we do? or maybe the images were removed? | 21:03 |
jaltman | >***fungi sighs at https://bugs.debian.org/1089501 having been open for weeks now with no confirmation (also i hunted for upstream bug reports and changes in their gerrit, but no dice) | 21:04 |
jaltman | https://www.openafs.org/dl/openafs/1.8.13.1/RELNOTES-1.8.13.1 released on 19-Dec-2024 | 21:04 |
jaltman | Linux clients * Add support for Linux 6.12 (15964, 15965, 15966) | 21:04 |
clarkb | tonyb: I do see the image in other clouds that I've spot checked. Hoping you acn remember what happened there. | 21:05 |
fungi | jaltman: thanks! i hadn't spotted the .1 release | 21:06 |
fungi | i'll prod the debian package maintainer | 21:06 |
clarkb | I see the vhd image file in tonyb's homedir on bridge so it did get created at least | 21:06 |
clarkb | s/created/converted from raw/ | 21:06 |
clarkb | I'm suddenly worried we're going to go through quite the experience just to upload an image to rax. But we know it is doable because nodepool does it | 21:11 |
clarkb | cloudnull not sure if this is on your radar yet but rax classic doesn't have Ubuntu Noble images yet (I thought we had uploaded our own, but doesn't look like we did hence my discovery the cloud doesn't provide any yet either) | 21:15 |
cardoe | frickler: yeah 6 weeks sounds about right | 21:18 |
opendevreview | Merged zuul/zuul-jobs master: Add ability to exclude a specific platform https://review.opendev.org/c/zuul/zuul-jobs/+/926164 | 22:55 |
opendevreview | Merged zuul/zuul-jobs master: ensure-podman: add tasks to configure socket group https://review.opendev.org/c/zuul/zuul-jobs/+/925916 | 22:55 |
clarkb | I'm not sure if tonyb is around today or not but I think he is on vacation starting tomorrow? I guess I'll give until then for feedback on the image otherwise I'll try to convert a current image to vhd and then proceed to attempting to upload it to the three rax classic regions | 22:56 |
clarkb | I've started brainstorming a rough outline of topics for a january opendev meetup in https://etherpad.opendev.org/p/opendev-january-2025-meetup | 23:41 |
clarkb | right now there isn't really an order I'm just braindumping topics. Feel free to add your own and then as we get closer to a schedule/plan I'll try to organize things | 23:42 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!