corvus | the zuul playbook failed on the first deployment because "cmd": "docker-compose exec -T scheduler zuul-scheduler smart-reconfigure", did not run | 00:14 |
---|---|---|
corvus | we should probably wrap that in a conditional to check if the scheduler is running | 00:15 |
corvus | hrm, incidentally, zuul02 is failing with a broken docker package, but i'm not going to bother fixing it since i'm about to delete it. | 00:16 |
corvus | i believe that zuul01 ran far enough to be a workable system, so i'm going to start it now | 00:16 |
corvus | hrm, looks like letsencrypt did not run | 00:28 |
corvus | i'm running it now manually | 00:39 |
Clark[m] | Looks like because infra-prod-base failed a bunch of things were skipped | 00:39 |
Clark[m] | As a hack you can copy the certs since the names don't change and worry about LE later | 00:40 |
corvus | yeah, unfortunately, i don't know the log url for infra-prod-base. that was my fault. i should have stopped the zuul01 scheduler sooner, but the deploy stuff was fast enough it cut off zuul01's database access :) | 00:40 |
corvus | i suppose that's probably on bridge somewhere? | 00:41 |
Clark[m] | corvus: on bridge it's in /var/log/ansible | 00:41 |
corvus | oh it's just called "base" | 00:41 |
corvus | i was looking for like "infra-prod" | 00:41 |
corvus | restart exim on zuul02 failed......... | 00:42 |
corvus | so... also weird... but also, still not going to fix :) | 00:42 |
corvus | anyway, letsencrypt is done, i'm going to stop zuul02 now and see if all the web stuff still looks good | 00:42 |
corvus | yep, looks good, only running on the new zuul01 now. zuul02 is down. | 00:43 |
corvus | hopefully round 2 goes better | 00:43 |
corvus | (still, we had like... a minute or so of a partial outage) | 00:44 |
opendevreview | Merged opendev/zone-opendev.org master: Replace zuul02 https://review.opendev.org/c/opendev/zone-opendev.org/+/953603 | 00:47 |
opendevreview | Merged opendev/system-config master: Replace zuul02 https://review.opendev.org/c/opendev/system-config/+/953605 | 01:34 |
corvus | https://imgur.com/MgD9uNl | 02:19 |
corvus | i'm not sure i've checked the status page at this time of day since the ui update... | 02:19 |
corvus | that's a lot of little boxes | 02:19 |
tkajinam | I wonder if anyone else see problems with access tu zuul build logs | 02:58 |
corvus | tkajinam: yes, i'm working on it... something is wrong with the load balancer | 02:58 |
tkajinam | corvus, thanks ! | 02:58 |
tkajinam | yeah I thought it can be a problem with lb or network. firefox complains PR_END_OF_FILE_ERROR while chrome complains ERR_CONNECTION_CLOSED | 03:02 |
tkajinam | both indicates response is not returned properly | 03:02 |
opendevreview | James E. Blair proposed opendev/system-config master: Update zuul-lb configuration https://review.opendev.org/c/opendev/system-config/+/953687 | 03:38 |
corvus | infra-root: ^ i completely forgot about that. i'm assuming we thought we had a good reason for that at the time... but i think maybe we should think about making that automatic. | 03:39 |
corvus | tkajinam: should be fixed now | 03:42 |
corvus | #status log replaced zuul01/zuul02 with noble servers; deleted old vms | 03:49 |
opendevstatus | corvus: finished logging | 03:49 |
opendevreview | James E. Blair proposed opendev/zone-opendev.org master: Replace ze01-ze06 https://review.opendev.org/c/opendev/zone-opendev.org/+/953688 | 04:04 |
opendevreview | James E. Blair proposed opendev/system-config master: Replace ze01-ze06 https://review.opendev.org/c/opendev/system-config/+/953689 | 04:04 |
corvus | infra-root: ^ i'd like to start on those tomorrow. | 04:05 |
tkajinam | corvus, thank you! I can access the logs now. | 04:40 |
opendevreview | Merged opendev/system-config master: Update zuul-lb configuration https://review.opendev.org/c/opendev/system-config/+/953687 | 04:59 |
corvus | infra-root: i also forgot about https://review.opendev.org/941468 -- it looks like we left /var/haproxy in place, which confused me since i didn't realize it was unused; /var/lib/haproxy is the new location. | 05:17 |
corvus | next time i'm at a terminal long enough to monitory, i plan to remove those old directories from both of the load balancers (zuul and gitea). | 05:18 |
corvus | in the mean time, i've confirmed that the zuul haproxy has the correct config now. | 05:18 |
opendevreview | Merged opendev/zone-opendev.org master: Replace ze01-ze06 https://review.opendev.org/c/opendev/zone-opendev.org/+/953688 | 14:19 |
corvus | #status log removed unused /var/haproxy from zuul-lb02 and gitea-lb02 | 14:22 |
opendevstatus | corvus: finished logging | 14:23 |
opendevreview | Merged opendev/system-config master: Replace ze01-ze06 https://review.opendev.org/c/opendev/system-config/+/953689 | 14:44 |
corvus | initial deployment of all of those was fine, except for ze05 which had a Failed to lock directory /var/lib/apt/lists/: E:Could not get lock /var/lib/apt/lists/lock. It is held by process 2294 (apt-get) -- hopefully fixed in the next run | 15:36 |
corvus | wonder if it was an unattended updates or something | 15:37 |
fungi | almost definitely, nothing else should be locking the package database while you're in the middle of initial setup | 15:42 |
fungi | probably just got lucky with the random timing on it | 15:43 |
corvus | oh weird, second run failed with the same pid holding the lock | 15:43 |
corvus | maybe it's a stuck unattended upgrades | 15:43 |
fungi | might be a stuck process in that case? i usually look at `ps afuxww` for a quick tree of parentage | 15:43 |
corvus | yep that's it exactly | 15:44 |
corvus | i'll kill it, then run apt-get update/upgrade manually to make sure all is ok | 15:44 |
fungi | maybe something's blocked on i/o | 15:44 |
fungi | yeah | 15:44 |
fungi | though that doesn't bode well for a newly-launched vm | 15:44 |
corvus | http 2303 _apt 3u IPv4 136119 0t0 TCP ze05.opendev.org:46006->ubuntu-archive-3.ps5.canonical.com:http (CLOSE_WAIT) | 15:45 |
corvus | apt-get upgrade did nothing, and apt-get -f install did nothing as well. | 15:47 |
corvus | running the playbook manually now | 15:48 |
corvus | #status log replaced ze01-ze06 with noble vms; deleted old vms | 16:15 |
opendevstatus | corvus: finished logging | 16:15 |
corvus | i'm going to gracefully stop ze07-ze12 so that for the rest of the day we'll run on 50% capacity, but all new vms. if something blows up, hopefully we see it during that period, and we'll still have the old vms. if everything is okay, i'll swap them out tonight. | 16:17 |
corvus | (so far, the jobs running on the new ze01 look good) | 16:17 |
opendevreview | James E. Blair proposed opendev/zone-opendev.org master: Replace ze07-ze12 https://review.opendev.org/c/opendev/zone-opendev.org/+/953713 | 16:26 |
opendevreview | James E. Blair proposed opendev/system-config master: Replace ze07-ze12 https://review.opendev.org/c/opendev/system-config/+/953714 | 16:26 |
opendevreview | Merged opendev/system-config master: Switch zuul-schedulers/web to noble https://review.opendev.org/c/opendev/system-config/+/952697 | 17:04 |
opendevreview | Merged opendev/system-config master: Switch zuul-executors to noble https://review.opendev.org/c/opendev/system-config/+/952698 | 17:06 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!