*** cloudnull10 is now known as cloudnull1 | 04:47 | |
*** elodilles is now known as elodilles_pto | 06:14 | |
mnasiadka | morning | 06:25 |
---|---|---|
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:27 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:28 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:29 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:29 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:30 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:32 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:34 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:38 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:38 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add centos-10-stream image definition https://review.opendev.org/c/opendev/zuul-providers/+/953726 | 06:41 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460 | 06:42 |
mnasiadka | I think https://review.opendev.org/c/opendev/zuul-providers/+/953269 needs another push to be gated | 06:43 |
opendevreview | James E. Blair proposed openstack/project-config master: Reapply "Switch all Zuul tenants to use niz nodesets" https://review.opendev.org/c/openstack/project-config/+/953769 | 13:41 |
corvus | i think we're ready to try that again ^ | 13:50 |
opendevreview | Clark Boylan proposed opendev/system-config master: Replace zk06 with zk01 https://review.opendev.org/c/opendev/system-config/+/951164 | 15:02 |
clarkb | corvus: that change seems fine to me +2. I'd also like us to try and do ^ if we think today is agood day for it. I had to rebase to address a merge conflict with the change to zk12's inventory details though which is why it just got a new patchset | 15:02 |
clarkb | corvus: fungi: I think the big things today are openstack and starglinx DCO stuff and then zuul-launcher? If we think that we're good to proceed with zk updates in the background while doing those I'd appreciate reviews/+2s | 15:03 |
clarkb | https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2025 is the process notes for that. I just checked and zk05 is still the leader so starting with zk06 should be good | 15:03 |
corvus | lgtm | 15:04 |
fungi | yeah, once i'm out of meetings i can take a closer look | 15:07 |
clarkb | thanks! | 15:07 |
clarkb | seems like it is fairly quiet and I think I should be able to work through the process today if others aren't worried about potential conflicts with other tasks | 15:08 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add a summitspeakers@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/953783 | 15:10 |
fungi | oh that's a neat idea | 15:10 |
mnasiadka | clarkb: Any idea what to do with https://review.opendev.org/c/opendev/zuul-providers/+/953269? It seems it's randomly failing at least one/two/three jobs on source-repositories element (fetching that number of repos must be fun for gitea) | 15:36 |
clarkb | mnasiadka: the latest round of failures were actually due to failures to upload the resulting image to swift | 15:38 |
mnasiadka | uh, now I see that | 15:38 |
mnasiadka | Actually not | 15:38 |
mnasiadka | source-repositories is not failing properly, and then it goes to post to publish the image which is not there | 15:39 |
mnasiadka | see https://zuul.opendev.org/t/opendev/build/9a3728c7cdc242c9ba6158ada739445f | 15:39 |
clarkb | https://zuul.opendev.org/t/opendev/build/03c946441bee4ba79614e3deaa85e39b/log/job-output.txt#9761-9884 this looks successful to me | 15:39 |
clarkb | so anyway I think we have two failure modes going on here. One that is less in our control (but we can provide feedback to jamesdenton__ and dan_with about) and then the "we might be making too many requests to gitea all at once" problem whihc we have more control over | 15:40 |
jamesdenton__ | clarkb i think there was a maint in dfw prod, if you're having api issues there | 15:40 |
clarkb | jamesdenton__: this is sjc3 I think | 15:40 |
clarkb | jamesdenton__: https://zuul.opendev.org/t/opendev/build/04843a0faeee4888ac32d1acb5f361c9/log/job-output.txt#9342 this specifically | 15:41 |
clarkb | one option for the git fetches would be to create a job semaphore for those builds to reduce the total concurrency between themselves | 15:41 |
corvus | fetches are a small part of those builds | 15:42 |
clarkb | mnasiadka: ^ I think that may be where I would start. In the old system we only ever ran a maximum of 3 builds at the same time. Now we're running like 15. Maybe we set a semaphore of 5? | 15:42 |
clarkb | corvus: true, the semaphore if a very heavy weight solution. | 15:42 |
jamesdenton__ | thanks clarkb - looking into it | 15:43 |
corvus | how about changing the element to retry fetches? | 15:43 |
mnasiadka | that's another option I was thinking about | 15:43 |
corvus | what are the errors? | 15:43 |
corvus | here's one: 2025-06-30 12:40:16.384 | fatal: unable to access 'https://opendev.org/openstack/devstack-plugin-open-cas.git/': GnuTLS recv error (-110): The TLS connection was non-properly terminated. | 15:43 |
clarkb | that may be worth trying as well. One suspicion I have is that this could be related to the total number of connections more than the total load. So if we don't see retries make things better we may need a different approach but I would be open to trying that | 15:43 |
corvus | are they all like that? | 15:43 |
clarkb | corvus: yes I think so | 15:44 |
fungi | clarkb: mnasiadka: skimming that swift error briefly, why are we uploading a vhd there? | 15:44 |
fungi | i thought we only used those for rax classic due to xen | 15:45 |
fungi | oh, wait, this is temporary storage, ignore me! | 15:45 |
clarkb | fungi: yes, this is the intermediate step where we build the image then create a downloadable location for it in swift so that the followup publishing process can fetch and push to clouds as appropriate | 15:45 |
clarkb | yup exactly | 15:45 |
corvus | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-6h&to=now&timezone=utc | 15:45 |
corvus | there is a ramp up in connections from 12:30-1300 | 15:46 |
fungi | but yeah, looks like that error could be some sort of middlebox terminating the upload prematurely, the task starts at 13:31:36 and gets aborted at 13:48:53 (~17.5 minutes into the upload) | 15:47 |
corvus | looking at all the failures in https://zuul.opendev.org/t/opendev/buildset/403de0442fea49d78d5e5a1ded814ad3 it looks like 2 of them are git failures, and 5 of them are swift upload failures? | 15:48 |
clarkb | corvus: mnasiadka I wonder if there is a way to have git keepalive a connection across git repos. Probably not but if there is something like ssh control persistence that may help here | 15:48 |
clarkb | corvus: yes I think the latest buildset is mostly swift failures. But previous buildsets have largely been gitea failures (but fewer of them overall) | 15:48 |
clarkb | corvus: mnasiadka it is also possible that we're not the actual source of the problem we're just more likely to hit issues in a noticeable way if there is external pressure | 15:51 |
clarkb | with the old system the build would fail then go back into the queue and eventually get rebuilt. Now we get a failure and don't retry the entire set until the next trigger? | 15:52 |
clarkb | in that case retries would probably be sufficient | 15:52 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add image-build-jobs semaphore to limit concurrent jobs to 5 https://review.opendev.org/c/opendev/zuul-providers/+/953797 | 15:54 |
mnasiadka | Did this, but will have a look in the retries tomorrow | 15:54 |
mnasiadka | and check why source-repositories is not failing the job properly | 15:55 |
corvus | i mean, i left a -1 on that, but i feel like it's much more of a -2 | 15:55 |
corvus | i'm like -1.9 on that | 15:55 |
mnasiadka | sure, thought it's a fast option - but I guess it's better to fix the DIB element | 15:55 |
corvus | yeah, it's an easy option with a big negative impact on image build jobs | 15:56 |
clarkb | ok https://review.opendev.org/c/opendev/system-config/+/951164 has a +1 again if we want to proceed with replacing the first zookeeper server cc fungi | 15:56 |
corvus | https://zuul.opendev.org/t/opendev/buildsets?pipeline=periodic-image-build are generally doing pretty well. they take about 2 hours to run. with a 5-job semaphore we would double that to 4 hours. | 15:58 |
corvus | fixing the element should reduce the few failures we have there too | 15:58 |
fungi | clarkb: was the latest revision of 951164 just a rebase to resolve a merge conflict? | 16:00 |
clarkb | fungi: yes, ze12 was updated on the lines just about zk01 in the einventory file and that created a conflict | 16:01 |
clarkb | so rereview is more about "are we cool with trying to do this today" more than the actual code change itself I think | 16:01 |
fungi | are you ready for me to approve it now, or want to schedule that? | 16:01 |
clarkb | fungi: I think we either do it now or wait for wednesday (basically I want that change to go in early relative to my day start os that I can try and get through the whole cluster over a single work day) | 16:02 |
clarkb | so yes now is fine, or if you'd prefer we schedule it we can look at wednesday? | 16:02 |
fungi | just making sure you're already on hand and didn't want me to wait a few minutes | 16:03 |
clarkb | yup today works for me if we start now | 16:03 |
fungi | it's in the queue now | 16:03 |
clarkb | thanks! | 16:03 |
clarkb | once we get a little closer to that change merging I'll put zk06 in the emergency file and stop zookeeper there | 16:05 |
opendevreview | Merged openstack/project-config master: Reapply "Switch all Zuul tenants to use niz nodesets" https://review.opendev.org/c/openstack/project-config/+/953769 | 16:08 |
opendevreview | James E. Blair proposed openstack/project-config master: Add zuul-launcher max servers https://review.opendev.org/c/openstack/project-config/+/953803 | 16:12 |
corvus | i see some unexpected errors in the launcher: | 16:26 |
corvus | 2025-06-30 16:24:47,129 ERROR zuul.Launcher: [e: e8290ecbf5e947eba95425f2affb7d21] [req: 120137a5e4f544e6bcf89779b4c23177] Error in creating the server. Compute service reports fault: Build of instance f408fcde-0e98-4c99-95ef-c7a26bb93752 aborted: Image a6cca26c-0dae-46c0-91c8-35ce5b48cec6 is unacceptable: Image not in a supported format | 16:27 |
corvus | 2025-06-30 16:24:47,130 ERROR zuul.Launcher: [e: e8290ecbf5e947eba95425f2affb7d21] [req: 120137a5e4f544e6bcf89779b4c23177] Marking node <OpenstackProviderNode uuid=233c355a9ca14241b2841a668d6ce14a, label=niz-ubuntu-noble-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-dfw3-main> as failed | 16:27 |
corvus | that's an image upload that finished.... right around that time | 16:27 |
corvus | 2025-06-30 16:24:00,950 DEBUG zuul.Launcher: Released upload lock for <ImageUpload 65c1f4978e5f41f9a7cb91592d61606f state: ready endpoint: raxflex/raxflex-DFW3 artifact: 3a52e8133fe542f4a35076804b6a3926 validated: True external_id: a6cca26c-0dae-46c0-91c8-35ce5b48cec6> | 16:28 |
corvus | i wonder if we're not waiting for a post-processing step | 16:28 |
clarkb | image not in a supported format would imply something along those lines. Do you know what format we're uploading to that provider? | 16:31 |
clarkb | if its qcow2 I wonder if the backend is converting to raw | 16:31 |
corvus | qcow2 | 16:33 |
clarkb | a lot of clouds convert to raw on the backend and when we've had problems with them doing that we've just uploaded raw images instead | 16:35 |
clarkb | I know with ceph this is adviseable because you get better copy on write behavior from ceph (ironically) | 16:35 |
corvus | https://paste.opendev.org/show/b4UzBVq99xFIloG7gt8T/ | 16:35 |
corvus | that's what the 'image show' says | 16:36 |
clarkb | ya I think the backend conversions are hidden from us. Glance just repeats back what we upload? I think we can confirm something changes by comparing the checksum values though | 16:37 |
corvus | do you know what the "checksum" field means? | 16:37 |
clarkb | iirc the checksum in image show is glance calculated and if it chagnes the details on the backend that checksum won't match what we uploaded (this is am ajor flaw in glance that I've brought up before that I gave up on caring about when no one seemed to understand why it was problematic that i couldn't confirm the file I uploaded matched what glance thinks it has) | 16:38 |
clarkb | corvus: I believe it is an md5 checksum of the image | 16:38 |
corvus | interesting! they are supposed to match in rax-flex, but they do not for this image | 16:40 |
corvus | iow, all the other working images have matching checksums, but this one does not | 16:40 |
clarkb | I wonder if this is just a "normal" bitflip type of situation then | 16:40 |
clarkb | or maybe a short write | 16:40 |
clarkb | and not a supported format error messages are how nova / glance's internal checking are failing | 16:41 |
corvus | we didn't log any errors during the upload | 16:42 |
corvus | unfortunately, i have left https://review.opendev.org/931824 unaddressed, so i don't think we can localize the error further. i will implement that as pennance. | 16:43 |
corvus | meanwhile, i will delete that upload and let it try again. we may get more data that way | 16:43 |
corvus | actually... 1 sec. we can see if we have the same error in sjc3 | 16:44 |
corvus | because they should have come from the same file | 16:44 |
clarkb | can also cross check the md5 checksums | 16:46 |
clarkb | if they don't match then it would be region specific upload issue maybe | 16:46 |
corvus | ++ | 16:46 |
clarkb | I have put zk06.opendev.org in the emergency file. We're a few minutes away from merging the chagne to replace it. I'll shut down zookeeper there when we're a bit closer | 16:47 |
fungi | thanks! | 16:49 |
corvus | only the sjc3 upload is affected. unfortunately, two different launchers uploaded those. that isn't normally supposed to happen (we're supposed to try to download the image only once), but it can happen as a fallback. | 16:50 |
corvus | grr i mean, only the dfw3 upload is affected. sjc3 is fine. | 16:50 |
corvus | oh i think i see it | 16:54 |
corvus | remember those api errors mentioned earlier? | 16:54 |
clarkb | ya they affected uploads into the cloud regions or maybe we got short writes? | 16:55 |
corvus | yeah. still digging a bit, but we definitely hit one during the upload. we retried the upload, but i'm guessing something persisted, since i think we used the same image name. | 16:55 |
corvus | (that's why we switched launchers -- the second upload happened from the second launcher) | 16:56 |
*** iurygregory__ is now known as iurygregory | 16:58 | |
corvus | i think we need a "delete and replace upload" function, and to call it in cases like this. we probably shouldn't try to repeat a failed upload under the same name without deleting. | 16:58 |
clarkb | makes sense particularly for multi part uploads | 16:59 |
clarkb | it might consider some parts of the upload to be successful even when it shouldn't | 16:59 |
clarkb | ok testinfra tests are running now. The chagne should merge shortly. I'm going to stop zookeeper on zk06 now | 17:02 |
clarkb | connections appear to have rebalanced according to stat to both 04 and 05 as expected and 05 remains the leader | 17:03 |
opendevreview | Merged opendev/system-config master: Replace zk06 with zk01 https://review.opendev.org/c/opendev/system-config/+/951164 | 17:07 |
clarkb | ok the zookeeper deployment job is finally running. This should bring up zk01 in a half working state. I need to restart zk04 and zk05 to pick up the new server listings (in that order so the leader goes last) after the deployment is done | 17:37 |
clarkb | nodepool zk configs also updated at least on nl05 but services didn't restar there. Once I've got the cluster running in a three node state happily again I plan to restart nodepool services and maybe a zuul merger or two just to sanity check the connectivity to the new shape of the cluster | 17:38 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Replace StarlingX's CLA enforcement with the DCO https://review.opendev.org/c/openstack/project-config/+/953819 | 17:39 |
fungi | ildikov: i'll plan on merging that ^ tomorrow | 17:39 |
clarkb | `echo mntr | nc localhost 2181 | grep followers` shows 2 synced followers with one of them being a non voting follower | 17:40 |
clarkb | whcih is what I think we expect since zk01 hasn't been inducted into the cluster yet (need to restart the leader for that) | 17:40 |
clarkb | ok cool zookeeper job completed successfully. I'm going to double check all three nodes have the right server list and that zk05 is still the leader then I'll restart zk04, check the state of things again then restart zk05 | 17:41 |
clarkb | yup this looks good I'm going to restart zk04 now | 17:42 |
clarkb | I think that went well. zk05 is still the leader and still reports the same follower counts | 17:44 |
clarkb | ok I'm proceeding with a restart of the zk05 leader next | 17:44 |
clarkb | zk04 is the new leader | 17:45 |
clarkb | there are two synced followers and no nonvoting follwoers. I think everything has caught up now | 17:46 |
clarkb | I need to edit the etherpad because it assumed zk05 would stay the leader but zk04 is leader so things change. THen I need to push up a change to remove zk05 and replace it with zk02 to get that testing then I can restart nodepool and a zuul merger or three | 17:47 |
opendevreview | Clark Boylan proposed opendev/system-config master: Replace zk05 with zk02 https://review.opendev.org/c/opendev/system-config/+/953821 | 17:51 |
clarkb | infra-root ^ quick review on that is appricated | 17:52 |
clarkb | I'm going to restart nodepool services now | 17:52 |
clarkb | most new connections are going to zk05 but at least one made it to zk01 | 17:58 |
clarkb | corvus: do you think it makes sense to restart zuul mergers too to try and get connections onto zk01 or just trust that since the connection count there is 2 that we're probably good enough? | 18:00 |
corvus | it really does, by pure coincidence, look like about 50% of the openstack node traffic is going to niz, the other is still nodepool. 53% to be more exact. that's based on the labels we just happened to have switched so far. | 18:01 |
clarkb | nice | 18:01 |
corvus | clarkb: maybe try a merger or two? | 18:01 |
clarkb | on it | 18:01 |
corvus | just as a canary in case there's something we're not seeing | 18:02 |
corvus | i'm going to start merging the label-switch changes to drive more niz traffic. | 18:02 |
opendevreview | Merged opendev/zuul-providers master: Switch centos-9-stream from nodepool to niz https://review.opendev.org/c/opendev/zuul-providers/+/952715 | 18:03 |
clarkb | I did zm01 and zm02. zm01 moved from zk04 to zk05 and zm02 movedfrom zk04 to zk04 (it reconnected to the same one) | 18:04 |
clarkb | I'll keep doing more to see if I can get at least one to go to zk01 | 18:04 |
clarkb | zm03 connected to zk01 | 18:05 |
clarkb | zm04 connected to zk05. I think thats probably good though. We have nb05 and zm03 talking to zk01 | 18:07 |
*** iurygregory_ is now known as iurygregory | 18:07 | |
corvus | \o/ | 18:08 |
opendevreview | James E. Blair proposed openstack/project-config master: Grafana: update zuul status page for NIZ https://review.opendev.org/c/openstack/project-config/+/953823 | 18:09 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/953821 to keep things moving forward | 18:12 |
clarkb | corvus: I don't see zk01 in the grafan graphs for zuul. DO you know if I have to do anything special to get that data? | 18:13 |
clarkb | oh I see we filter on those node namse in the graph definition | 18:13 |
clarkb | I'll just do a followup to update the graphs and not worry about it at the moment | 18:13 |
clarkb | unless you think that is important | 18:13 |
corvus | followup sounds fine | 18:14 |
corvus | we're collecting the data | 18:14 |
corvus | (it's in graphite) | 18:14 |
clarkb | cool | 18:14 |
corvus | https://opendev.org/airship/promenade/src/branch/master/zuul.d/nodesets.yaml | 18:19 |
corvus | i wonder what the story is there | 18:19 |
clarkb | might be like devstack where they are relying on the groups values to do special stuff in the jobs? | 18:20 |
corvus | maybe. surprised to see a group with the same name as a node though | 18:20 |
opendevreview | Merged opendev/zuul-providers master: Switch debian-bookworm to niz https://review.opendev.org/c/opendev/zuul-providers/+/952716 | 18:21 |
opendevreview | Merged opendev/zuul-providers master: Switch debian-bullseye-arm64 nodesets to niz https://review.opendev.org/c/opendev/zuul-providers/+/952717 | 18:21 |
clarkb | corvus: after I've got zk02 enrolled in the cluster and I've checked things out the next step is to do the out of band update for zuul's zookeeper connection to tell it to connect to all three new nodes (while only two are up and running) then do a zuul restart. I may have questions / ask for help at that point depending on how thinsg are going. Also I wonder if we should restart | 18:21 |
clarkb | executors non gracefully but then do the schedulers one by one to avoid a web ui and event handling outage? | 18:21 |
clarkb | this way we only have to do the big zuul restart once | 18:22 |
corvus | that plan sgtm | 18:22 |
corvus | launchers one at a time too | 18:22 |
clarkb | ack | 18:22 |
corvus | clarkb: do a hard stop of the executors all at once, that way will minimize jobs rolling over to executors that themselves are about to stop. | 18:24 |
clarkb | corvus: best way to do that is with ansible-playbook running zuul_stop.yaml with a limit to the executors? | 18:24 |
corvus | that should work. i have developed a habit of just doing an ad-hoc ansible command that does 'docker-compose down'. :) | 18:25 |
clarkb | hapyp to reuse what you know works too :) | 18:26 |
clarkb | something like `ansible -f 20 zuul-executor -m shell -a 'cd /etc/zuul-executor && docker-compose down'` then run a similar docker-compose up -d after | 18:27 |
clarkb | that may be best actually because then I can do each class of zuul service one by one more easily and give them the different behavior they want | 18:28 |
corvus | yep exactly | 18:28 |
corvus | that's mostly because i'm too lazy to go look at what the playbook does. not that i necessarily think it's better. :) | 18:29 |
opendevreview | Merged opendev/system-config master: Replace zk05 with zk02 https://review.opendev.org/c/opendev/system-config/+/953821 | 18:29 |
clarkb | ok I'm going to stop zookeeper on zk05 now that ^ has merged | 18:29 |
clarkb | zk04 remains the leader | 18:30 |
opendevreview | Merged opendev/zuul-providers master: Switch ubuntu-jammy-arm64 labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/952718 | 18:39 |
opendevreview | Merged opendev/zuul-providers master: Switch to using niz for ubuntu-noble-arm64 https://review.opendev.org/c/opendev/zuul-providers/+/952719 | 18:39 |
opendevreview | Merged opendev/zuul-providers master: Switch to using niz labels for rockylinux-8 https://review.opendev.org/c/opendev/zuul-providers/+/952720 | 18:39 |
opendevreview | Merged opendev/zuul-providers master: Switch to using niz for rockylinux-9 https://review.opendev.org/c/opendev/zuul-providers/+/952721 | 18:39 |
opendevreview | Clark Boylan proposed opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824 | 18:45 |
clarkb | I realized I could get things rolling with testing of ^ since I know which server is going to be replaced last now | 18:45 |
opendevreview | Clark Boylan proposed openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03 https://review.opendev.org/c/openstack/project-config/+/953825 | 18:58 |
clarkb | zookeeper job has finished. I'm going to do my checks across zk01, zk02, and zk04 now. Then will restart zk01 and zk04 once I'm happy | 19:01 |
clarkb | yup everything looked as I expect it to so I've already restarted zk01 since it was follower. Going to restart zk04 now which is the current leader | 19:04 |
clarkb | zk02 is the new leader and it has 2 synced followers and no nonvoting followers | 19:05 |
clarkb | so our cluster is now zk01, zk02, and zk04 with zk04 as leader. Most services are only aware of zk04 as a valid connection point. I'm going to go restart nodepool services and zuul mergers like I did last time now which should have them connect to zk01 and zk02 and zk04 as valid options | 19:06 |
clarkb | then we can look into the bigger zuul restart after updating its config | 19:07 |
clarkb | oh sorry zk02 is the leader not zk04 | 19:07 |
opendevreview | Clark Boylan proposed opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824 | 19:10 |
clarkb | there was a test failure in ^ due to my updating the zk04 node in tests to zk99. Should be fixed now | 19:10 |
clarkb | nodepool services have restarted I believe nl05 is connected to zk02 | 19:16 |
clarkb | I'm going to run my zuul config update playbook against the zuul mergers and do them in bulk | 19:17 |
clarkb | ok I think that lgtm (zuul-mergers are done and are running with config for all three new servers now) | 19:22 |
clarkb | corvus: I think I want to do executors last so that test results for 953824 are not lost and have to restart | 19:23 |
corvus | ack | 19:23 |
clarkb | corvus: I'll do the launchers next. Can I hard stop/start them like mergers? looks like yes? | 19:24 |
clarkb | how much of a delay between each? | 19:24 |
corvus | yep... it only takes them like 10 seconds to come up | 19:25 |
clarkb | got it. Working on that next | 19:25 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: WIP: source-repositories: make git retry https://review.opendev.org/c/openstack/diskimage-builder/+/953829 | 19:27 |
clarkb | zl01 is done. I'll pause for a moment then do zl02 | 19:28 |
clarkb | and now zl02 is done | 19:29 |
clarkb | arg I didn't properly fix the zookeeper tests so I need new test results anyway | 19:29 |
clarkb | considering that I think I'll proceed with executors next now that launchers are done | 19:29 |
clarkb | I'm going to docker compose down all of them then docker compose up -d all of them | 19:30 |
clarkb | that is done | 19:33 |
corvus | i see nothing exploding on the launchers | 19:34 |
clarkb | corvus: you good with me proceeding to do scheduler and web on zuul01 then zuul02 once zuul01 is up again now? | 19:34 |
clarkb | I've updated configs on zuul01 and zuul02. I'll proceed with restarting scheduler and web services on zuul01 now | 19:37 |
corvus | yep | 19:38 |
clarkb | ok zuul01 has has its services restarted | 19:39 |
clarkb | zuul01 web says it is stopped in the components list | 19:40 |
clarkb | now it says it is initializing. I just had to wait a bit I guess | 19:40 |
opendevreview | Clark Boylan proposed opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824 | 19:41 |
clarkb | hopefully that fixes testing | 19:41 |
clarkb | corvus: /components says zuul01 scheduler is running now. Thats super quick these days. | 19:44 |
corvus | :) | 19:44 |
clarkb | and now zuul web is running. I'm goign to do zuul02 nowish | 19:44 |
clarkb | the hourly run for zuul will reset the zuul.conf to contain zk01,zk02, and zk04 which is fine since zuul doesn't look at that config except at startup so these restarts should have them all looking at zk01,zk02, and zk03. Then by the time the next zuul restart happens we should have all the zookeepers updated in our inventory | 19:46 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 19:47 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout https://review.opendev.org/c/openstack/diskimage-builder/+/721581 | 19:48 |
clarkb | #status log Restarted all of zuul to pick up new zookeeper server configuration | 19:52 |
opendevstatus | clarkb: finished logging | 19:52 |
clarkb | we are now at merge https://review.opendev.org/c/opendev/system-config/+/953824 to remove zk04 and add zk03. | 19:52 |
clarkb | I think it should pass testing now too :) | 19:52 |
opendevreview | Merged openstack/project-config master: Make scripts DCO compliant https://review.opendev.org/c/openstack/project-config/+/950770 | 19:55 |
corvus | i'm not sure i've seen this keystone error before: https://paste.opendev.org/show/bZd9L8ZsYwelQ1xSz6eY/ | 20:00 |
corvus | i'm not worried about it atm; it looks like it was a one-off. | 20:00 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add new-style node labels https://review.opendev.org/c/opendev/zuul-providers/+/953832 | 20:01 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Switch all nodepool labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/953833 | 20:01 |
corvus | i'm ready to do that ^ to drive up the niz traffic more. | 20:01 |
clarkb | re the error I've not seen that one before | 20:04 |
clarkb | corvus: fungi https://review.opendev.org/c/opendev/system-config/+/953824 passed its zookeeper test job this time. Can I get a review and/or approval if it looks good? | 20:06 |
corvus | lgtm, despite the trailing whitespace :) | 20:07 |
frickler | hmm, I just noticed that I lost all my bash history of zuul-client commands on zuul02. also "sudo zuul-client autohold-list" first downloaded the container and then fails because the config seems broken. is this expected and I need to switch to something new? | 20:08 |
clarkb | frickler: the server was replaced over the weekend | 20:09 |
clarkb | I suspect that explains both behaviors. Do we need to issue a new auth jwt token on those servers? | 20:09 |
corvus | zuul-client autohold-list --tenant openstack works for me? | 20:10 |
frickler | oh, wait, I was missing the --tenant argument. adding that the command works fine indeed. so I just need to re-discover the proper commands I guess | 20:11 |
corvus | yes, if that argument is omitted, it outputs this error: zuulclient.cmd.ArgumentException: Error: the --tenant argument or the 'tenant' field in the configuration file is required | 20:11 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Switch all nodepool labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/953833 | 20:19 |
corvus | okay that's ready, now with fewer config errors :) | 20:20 |
clarkb | corvus: and once those go in we shutdown nodepool launchers? | 20:22 |
clarkb | put another way what is the coordination point for that change | 20:22 |
clarkb | maybe zuul simply prefers niz over nodepool? | 20:22 |
corvus | yeah, zuul prefers niz, no shutdown necessary | 20:25 |
corvus | we may still have a few niz requests (cf those last couple of images -- need to check where we are on that) | 20:26 |
corvus | so next step would be to check the delta again, close the gap by adding any new labels necessary, then we can shutdown nodepool (if desired, but not required). | 20:27 |
corvus | if anything goes wrong, we can just revert that change and it's instantly back to nodepool. | 20:27 |
opendevreview | Merged opendev/zuul-providers master: Add new-style node labels https://review.opendev.org/c/opendev/zuul-providers/+/953832 | 20:28 |
opendevreview | Merged opendev/zuul-providers master: Switch all nodepool labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/953833 | 20:28 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 20:28 |
corvus | yeah those :) | 20:28 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 20:31 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Drop niz- label prefix from nodesets https://review.opendev.org/c/opendev/zuul-providers/+/953835 | 20:33 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 20:33 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 20:34 |
opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269 | 20:36 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Remove "normal" labels, etc https://review.opendev.org/c/opendev/zuul-providers/+/953836 | 20:44 |
clarkb | I've done a first pass update at our meeting agenda. I dropped the centos 10 dib stuff as I think that is largely done now. I added some notes about zuul launcher status and zookeeper and zuul server replacements | 21:00 |
clarkb | anything else to add? | 21:00 |
corvus | that probably covers all the changes i'd ask to make | 21:00 |
fungi | nothing here | 21:03 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Increase launch-timeout for rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/953841 | 21:30 |
corvus | i notice that rax-iad is turned off; what's the status with that? | 21:31 |
clarkb | corvus: I think that may have gotten lost when I ended up afk for a while | 21:32 |
clarkb | corvus: I suspect we can try turning it back on now and see if we get better behavior | 21:32 |
corvus | ack. maybe leave it be for a bit, until we run out of other node-related things to change. :) | 21:34 |
opendevreview | Merged opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824 | 21:36 |
clarkb | I will shutdown zookeeper on zk04 to get ahead of ^ momentarily | 21:37 |
clarkb | thats done. zk02 remains leader and has one follower currently | 21:38 |
clarkb | after zk03 deploys I'll double check zk01 and zk02 still look good then restart zk01 then zk02 (since 02 is leader it goes last). Then I can restart all of the nodepool services one last time. Then we should be fully on the new cluster and can start looking at followup changes | 21:40 |
clarkb | zookeeper job has comepleted successfully. I'll check things and then do the necessary restarts | 22:08 |
clarkb | restarting zk01 make zk03 the leader | 22:10 |
clarkb | but the zxid seems appropriate and zuul is returning expected data so I think zookeeper handled that properly | 22:10 |
clarkb | I'm just surprised for it to become the leader before I restarted zk02 | 22:11 |
clarkb | zk02 went into a state where it stopped serving requests after that zk01 restart resulted in zk03 being leader. I believe that isb eacuse zk02 was running with a config that didn't allow for zk03. Restarting zk02 to pick up the config change has it working again as a follower | 22:12 |
clarkb | corvus: ^ not sure if there is any other checking you want to do. I think this went well. I'm going to work on restarting nodepool services so they are aware of the new (and final) cluster state | 22:14 |
corvus | clarkb: probably a good idea to do the graphs change now; would be good to double check those | 22:15 |
clarkb | corvus: https://review.opendev.org/c/openstack/project-config/+/953825 it is ready if you are | 22:16 |
corvus | clarkb: i think there's one more; left a comment | 22:17 |
clarkb | fwiw zk02 reported 2 synced followers one of which was nonvoting prior to the zk01 restart. I think when I restarted zk01 it decided that zk03 was the best leader because zk02 and zk03 were synced and zk03 had the higher id value | 22:17 |
clarkb | so I think I've convinced myself that this behavior is expected and not abnormal | 22:17 |
clarkb | bah how did I miss that one | 22:17 |
clarkb | then because zk02 didn't have config that knows about zk03 it basically went idle and stopped doing work | 22:18 |
clarkb | it was a coup amongst the zks | 22:18 |
opendevreview | Clark Boylan proposed openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03 https://review.opendev.org/c/openstack/project-config/+/953825 | 22:18 |
clarkb | corvus: ^ there | 22:19 |
clarkb | nl07 is connected to zk02 (since zk03 became leader lots of stuff ended up over there and it is fine). I just need to finish restarting nl05 and nl06 then I think the application side of this move is done | 22:23 |
clarkb | nl05 also connected to zk02 | 22:24 |
clarkb | all services have been restarted on configs with the three new nodes. All three new nodes are in place and seem to indicate that they are talking to each other via stat and mntr 4 letter commands. Services are stopped on the old servers and they are removed from the inventory | 22:25 |
clarkb | I think confirming that grafana looks sane is the next step then we can do things like cleanup dns. I'll work on a change for that next | 22:25 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Remove zk04-zk06 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953844 | 22:29 |
opendevreview | Clark Boylan proposed opendev/system-config master: Cleanup zookeeper config management https://review.opendev.org/c/opendev/system-config/+/953846 | 22:32 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove docker compose version from zuul services https://review.opendev.org/c/opendev/system-config/+/953848 | 22:35 |
clarkb | ok last call on the meeting agenda. I'm going to send that out in ~10 minutes | 22:36 |
clarkb | corvus: until we have live data we have this screenshot: https://58e37da755b04d7dc54b-c9565b17ee8809c34acc7a1ea3e19ee3.ssl.cf1.rackcdn.com/openstack/e7f70913b8b84f5394eaed9aa16995c6/screenshots/zuul-status.png | 22:39 |
corvus | zk01 latency looks a little bit higher than ould old values (0.4ms) but probably within the margin of error | 22:41 |
corvus | i don't see anything else unusual | 22:41 |
corvus | s/ould/our/ | 22:41 |
opendevreview | Merged opendev/zuul-providers master: Increase launch-timeout for rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/953841 | 22:43 |
corvus | there's a handful of low-priority niz changes at https://review.opendev.org/q/hashtag:opendev-niz+status:open | 22:44 |
clarkb | ack I'll review those as soon as I hit send on this meeting agenda | 22:45 |
corvus | looks like we're at 87% niz and 13% nodepool | 22:48 |
corvus | clarkb: since i had a shell open, i ran the cacti graph create script for zk01-zk03 | 22:52 |
clarkb | thanks! | 22:52 |
opendevreview | Merged openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03 https://review.opendev.org/c/openstack/project-config/+/953825 | 22:54 |
clarkb | corvus: all those niz related stats changes and nodeset/label updates lgtm | 22:54 |
corvus | \o/ | 22:54 |
clarkb | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-15m&to=now&timezone=utc seems to have updated now | 22:57 |
clarkb | oh I'm only looking at the last 15 minutes | 22:57 |
clarkb | you may wish to chagne the time range if you open that link | 22:57 |
corvus | lgtm | 22:58 |
clarkb | I'll probably leave the old servers up with their DNS records in place until at least tomorrow. I think this is a good pause point. THings seem to be working and we can see how zuul handles the periodic job enqueues later this evening before doing more destructive changes to the old servers | 22:59 |
corvus | clarkb: mnasiadka why is https://review.opendev.org/953269 marked WIP? | 22:59 |
clarkb | I don't know why | 22:59 |
corvus | i don't see an explicit event for it | 23:01 |
corvus | hypothesis: it was set by one of the rebase or web-based editing actions, possibly accidentally or without mnasiadka 's knowledge | 23:02 |
corvus | oh, well it has a depends on anyway | 23:03 |
clarkb | corvus: https://zuul.opendev.org/t/openstack/build/9707f99650374c28854400788ebfc0b1/log/job-output.txt#31-53 | 23:09 |
clarkb | corvus: I suspect that build got a nodeset built from nodes from two different providers | 23:09 |
clarkb | corvus: note the Region values in particular (only one provider value is set) | 23:10 |
clarkb | I noticed bceause the zookeeper test actually checks that external connectivity to the zookeeper non ssl port is blocked and I think it got confused running that test from bridge to the zk test server? | 23:10 |
clarkb | and that resulted in a socket timeout error rather than socket no route to host error | 23:11 |
clarkb | I think this may impact multinode testing in general since we try to set up vxlan tunnels for that over private ips iirc. But I can't be sure of that without more digging | 23:11 |
corvus | it does look like that was from two different providers. the launcher should prefer to use a single provider | 23:12 |
corvus | but it is capable of not doing so in extenuating circumstances | 23:12 |
corvus | (this is a change from nodepool) | 23:12 |
corvus | i'll try to see what the circumstances were | 23:13 |
clarkb | right nodepool would only ever provide nodes from the same provider or fail | 23:13 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Replace Airship's CLA enforcement with the DCO https://review.opendev.org/c/openstack/project-config/+/953849 | 23:13 |
clarkb | I see openstack grenade multinode jobs are succeeding in the gate so this isn't a hard fail for everything. That would support the extenuating circumstances theory | 23:14 |
corvus | clarkb: the node was a leftover ready node from a previous canceled request (probably a previous patchset of the same change). it looks like we could be better about the interaction of ready nodes and provider preference. | 23:28 |
clarkb | corvus: got it. THe other thing I wonder about is why the provider info is lost there | 23:36 |
clarkb | maybe we don't carry that forward when reusing recycled nodes? | 23:36 |
corvus | yeah, an unattached ready node loses its provider info, mostly so that it can float to other providers/tenants/etc that have the same config. we should re-assign it when we select it. | 23:37 |
corvus | (it's a weird artifact of having multiple providers with the same endpoint, but they could be different in different tenants) | 23:38 |
corvus | clarkb: fyi https://review.opendev.org/953851 is the simplest way i can think to improve that. there's a better way, but it'd make a very complicated method even more complicated, so i'd like to try simple first. :) | 23:53 |
corvus | clarkb: i do agree that could end up causing problems for devstack runs, but hopefully not too many before we get that fixed. i think it's likely to be rare enough we can roll forward instead of back. | 23:56 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!