Saturday, 2024-04-06

corvus1clarkb: fungi morning14:08
corvus1looks like the restart has finished14:09
corvus1i've started a db dump from the old server14:10
corvus1that means this is the start time for the data loss window (jobs completed between now and when we finish won't be in the db)14:10
corvus1i've approved the zuul change to add the index hints (and also one other simple db related fix)14:14
Clark[m]Ack14:22
corvus1dump is finished, copying now14:31
corvus1restore in progress on screen on zuul-db01.14:34
corvus1i think we still need the dburi change, yeah?14:35
corvus1oh heh it's actually just a secret var14:37
Clark[m]Yes I think so14:37
corvus1so i'm going to edit secrets and update it there.14:37
Clark[m]And we run zuul updates hourly so it should go into affect quickly if you don't do it by hand14:38
Clark[m]Just keep that in mind I guess as it may go in before you're ready too?14:39
corvus1okay, the dburi is updated in secret hostvars.  for now, let's see if the ansible deploy gets it before the import is done?  and otherwise we can edit by hand.14:40
corvus1i don't think we're going to automatically restart zuul based on that file changing, right?  so since all the zuul restarts are done now, i think we're good to let it go in any time?14:40
Clark[m]Oh yes that is correct14:41
corvus1i suppose it's possible we might do a full-reconfigure?  and that might reload the connections?  i don't know14:41
Clark[m]We trigger a zuul config reload but that applies to the tenant config not the ini config14:41
corvus1yeah, we only trigger a smart-reconfigure; i don't think that touches connections14:41
fungimorning!14:45
fungilooks like you started just as i was getting ready to grab a shower14:46
corvus1perfect timing!14:47
corvus1i think we're idle for another 90m or so right now.  btw, import is running in screen.14:47
Clark[m]The zuul hourly job will likely run in about a half an hour and again an hour after that 14:48
Clark[m]Just for timing on when the config might update.14:48
Clark[m]The docs explicitly mention tenant config and don't say anything about connections configs in relation to smart reconfiguration14:50
corvus1++  if i'm wrong about what zuul will do with the update, then i would expect it to simply fail inserts due to missing tables (or, if we happen to have all the tables a query needs, then it might insert them (but that's okay because the schema has the insert id in it, so it would go at the end) or it might hit a table lock and just freeze until that table is done.14:50
corvus1i went through the code to double check, and it does reload the zuul.conf file, and it does tell the connections that a reconfiguration has happened, but the sql driver ignores that event since it doesn't typically need to do anything.14:51
corvus1my takaway is that it shouldn't affect us, but also, if we wanted to make it easy to update the dburi without a restart, that would be a very simple change.  :)14:52
Clark[m]It seems like making the database a bit more an explicit update is useful for understanding data integrity. But I can see how that would be useful in some situations14:53
Clark[m](like with redundant active standby DB setups)14:54
corvus1yup14:54
corvus1artifacts done; halfway through builds15:31
fungithinking either my terminal is too small or i'm on the wrong window of the screen session15:32
fungithis is on zuul-db01 yeah?15:32
fungithere are like 4 windows but the first one seems blank, i'm assuming that's where the progress output is showing up for the rest of you15:33
corvus1yeah, it's a big window sorry15:34
corvus1i've resized mine to 80x24 now15:34
corvus1when it's done, something should show up in the first blank one; that's the import15:34
corvus1i'm using the second one to show processlist to check progress15:35
corvus1zuul changes merged15:52
fungino worries, i don't think screen handles dynamic resizing anyway. maybe someday we can talk about using tmux for similar purposes, but it's not a big deal15:54
corvus1looks like it's doen16:24
corvus1i'm going to docker-compose pull on the schedulers, then check the image tags and also dburi and make sure everything is staged16:26
Clark[m]++16:28
corvus1the image log looks correct, has the latest change id16:29
corvus1dburi looks correct16:29
corvus1i'm going to restart zuul-web on zuul01 now16:30
fungicool16:32
corvus1we're going from dev82 to dev8516:34
corvus1zuul01 web is up, restarting zuul02 web now16:35
corvus1https://zuul.opendev.org/t/openstack/builds works now and is instantaneous for me16:36
corvus1restarting zuul01 scheduler now16:37
fungi~instantneous for me as well16:37
Clark[m]And me16:37
fungiinstantaneous16:37
corvus1faster than you can spell instantaneous for sure16:37
fungifaster than i can type it at the very least16:39
corvus1both web servers are up now16:41
corvus1scheduler 1 is up, restarting scheduler 2 now16:42
corvus1that means we should be at the end of the data loss period16:42
fungiright16:42
fungiwriting all further results to the new (temporary) database now16:43
corvus1so job results will be missing from the database and webui from about 14:10 to 16:43 on 2024-04-0616:43
fungithanks! i was about to calculate that as well, saved me the bother16:44
corvus1how about: status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI.  Recheck changes affected by this if necessary.16:45
fungilgtm16:45
corvus1#status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI.  Recheck changes affected by this if necessary.16:46
corvus1maybe i'm not authed?16:47
fungichecking16:47
fungiit's in channel at least16:47
fungi#status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI.  Recheck changes affected by this if necessary.16:48
opendevstatusfungi: sending notice16:48
-opendevstatus- NOTICE: Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary.16:48
*** corvus is now known as Guest26916:48
*** corvus1 is now known as corvus16:48
fungiyeah, corvus1 isn't logged in according to /whois16:48
corvusi am now :)16:48
corvushttps://zuul.opendev.org/t/openstack/build/7cdf35ccdddb46fd804638d4c8305516 is post-migration16:49
fungiagreed16:49
Clark[m]And now we can see if general responsiveness improves as well16:49
corvusall components are up now16:49
fungiawesome! heating up the wok to make lunch, but will be around to keep an eye out for any problem reports16:49
opendevstatusfungi: finished sending notice16:50
corvusi dequeued the sanbox post items16:51
corvusthere's a periodic item in there now that's stuck on a node request?16:51
corvusdoesn't look like i can dequeue it16:52
Clark[m]That's the change vlotorev got in to reproduce that problem...16:52
Clark[m]Or at least I think it is /me double checks16:53
corvusthe ones i dequeued, yes; not the periodic one though16:53
Clark[m]Ya the sandbox change16:53
corvusthere were 3 queue items.  2 in "post" one in "periodic".  i dequeued the 2 in "post".  the one in "periodic" remains because i am unable to dequeue it through the web ui.16:54
Clark[m]And it will return if we don't revert the change in sandbox first. We should probably do that then figure out the dequeue 16:55
corvusthey will return the next time someone merges 2 changes in sandbox in rapid succession.  but yes, it should all be reverted.16:55
Clark[m]Ya if we revert then the problem goes away for new trigger events16:56
Clark[m]And I plan to update the change to address this monday16:57
Clark[m]Which may also allow them to run normally16:57
corvusit's a real travesty what happened to that equisite corpse17:06
corvusremote:   https://review.opendev.org/c/opendev/sandbox/+/915197 Revert to an earlier state of the repo [NEW]        17:13
corvusremote:   https://review.opendev.org/c/opendev/sandbox/+/915198 Update test.py for python3 [NEW]        17:13
corvusClark: fungi ^ some cleanup17:13
Clark[m]corvus: I guess there wasn't one specific change we could revert? I don't see a periodic job in the zuul.yaml diff17:18
corvusClark: oh i'm sure that could be smaller, but the repo was a mess.  just doing 2 things at once.17:19
Clark[m]I see17:19
fungialso worth consideration: a dedicated sandbox tenant in zuul for further isolation17:22
corvusmaybe with only check and gate pipelines17:23
fungiyeah, that would make sense. simplicity helps the demonstration17:26
mnaserhttps://releases.openstack.org/ is not responding to me21:30
mnasercc infra-root ^21:32
mnaserit seems like its responding very slowly now :X21:33
fungilooking22:55
fungiit's responding instantly to me now, but i'll check the system resource graphs for static.o.o22:56
fungihttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=71254&rra_id=all22:57
fungiload average spiked right when mnaser was noticing issues22:58
fungithere was also a connection count spike, but that could be a symptom of connections waiting much longer than usual for a response22:58
fungiserver started complaining about losing contact with afs01.dfw (in the same cloud provider and region) at 21:12:12 utc, continuing with some frequency until 21:23:46 utc23:01
fungi[Sat Apr  6 21:31:06 2024] INFO: task jbd2/dm-0-8:549 blocked for more than 120 seconds.23:03
fungithat's the start of errors in dmesg on afs01.dfw23:03
fungi"This message is to inform you that our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, afs01.dfw.opendev.org/main04, [...] at 2024-04-06T21:34:09.152730."23:06
fungiwe also got a "returned to service" update on that same ticket at 21:59:21 utc23:07
fungii guess i'll reboot afs01.dfw and then, if lvm comes back up clean, see what might be stuck23:09
mnaseryeah looked very io-y :\23:09
fungii/o errors in dmesg ceased at 21:45:10 utc, looks like23:09
fungiserver's back up now23:20
fungireboot took about 10 minutes23:20
fungilooks like /vicepa came up mounted and writeable23:22
fungi[  OK  ] Finished File System Check on /dev/main/vicepa.23:24
fungiboot.log confirms it did get checked23:24
fungi`vos status` says "No active transactions on 104.130.138.161"23:26
fungisame for the other two fileservers, so nothing to abort23:27
fungii'll start a root screen session and work through manually releasing the list of volumes23:29
fungiabout 1/3 of the way through now23:40
fungiabout 2/3 of the way through now23:48
fungiand done23:56
fungiexcept for the volumes without any read-only replicas configured, of course, which i skipped23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!