Friday, 2023-08-18

elodilleshi infra team o/ in relmgmt team we have a task for this week saying 'Notify the Infrastructure team to generate an artifact signing key (but not replace the current one yet), and begin the attestation process'10:25
elodillesso please start to work on this ^^^ & let us know (@ #openstack-releases if we can help with anything)10:26
fungielodilles: thanks for the reminder! i'll try to work on that today if i get a few minutes12:41
elodillesfungi: cool, thanks!12:57
damiandabrowskihey everyone, we(openstack-ansible) started getting DISK_FULL errors on rocky9 jobs yesterday, can I ask for a suggestion what to do with this?15:24
damiandabrowskihttps://zuul.opendev.org/t/openstack/builds?result=DISK_FULL&skip=015:24
fungidamiandabrowski: disk_full results are zuul's way of saying the build used too much disk space on the executor, usually it means the job retrieved too much log data or other files from job nodes at the end of the job though it can also be caused by other activities that use up executor space15:29
fungidamiandabrowski: unfortunately zuul terminates the build abruptly to prevent it from consuming more disk on the executor, so there is no diagnostic data that can be uploaded. do you have any idea what might have changed in those jobs in the past 24-48 hours?15:32
fungilooks like it's specifically the openstack-ansible-upgrade-aio_metal-rockylinux-9 job15:32
fungidamiandabrowski: looks like it's not happening every time either. all the disk_full results are for stable/2023.1 branch builds, but also there are some successful stable/2023.1 runs too15:34
damiandabrowskifungi: hmm, i have no clue what could potentially changed something in this job, noonedeadpunk maybe you have some idea?15:35
fungianother thing to look at would be the job duration. it appears the disk_full results ran about as long or longer than successful builds, which tells me the extra disk is being consumed at the end of the builds, so likely when logs are being fetched from the job nodes15:36
noonedeadpunkdisk_full is zuul-executor issue?15:36
funginoonedeadpunk: not a zuul executor issue, but a zuul executor safeguard15:37
noonedeadpunkor it's full on nodepool node?15:37
fungiif the build tries to use too much disk on the executor, the executor abruptly ends the build in order to protect available disk for other builds running at the same time15:37
noonedeadpunkI'm just trying to understand when this potentially happens - somewhere on POST step, or it's smth in job itself that fills-in node disk15:38
noonedeadpunkas in case of POST - I'm really not sure how to debug that....15:38
funginoonedeadpunk: as i said, the build duration for success and disk_full builds is about the same, which suggests it's happening near or at the end of the job, probably when pulling logs from the nodes15:38
noonedeadpunkyeah, but it's hard to anylize what could take so many diskspace when there're no logs and these are logs themselves causing issue...15:39
fungia random example off the top of my head would be if some process in on a job node, let's say neutron, started generating massive amounts of error messages and the logs grew huge, then when the executor is asked to retrieve them it can discover that they'll require too much space and ends the job rather than risking filling up its own disk15:40
noonedeadpunkyeah, I totally got how it happens... I 'm just not sure how to deal with that15:41
fungii would encourage you not to just give up. for example, have you looked to see how much total space the logs pulled from a successful build of that job on the same branch in the last day consume? how about a failed one that didn't result in disk_full?15:41
fungimaybe these jobs are collecting tons of log data and it's often just under the threshold15:41
fungithe problem can be approached the same way as job timeouts15:42
fungijust because you don't get logs when a build times out doesn't meant it's impossible what's causing it to sometimes do that15:42
noonedeadpunkI'm fetching logs from sucessfull builds right now to check15:43
noonedeadpunkthough I'm not sure what to consider big enough to cause problems15:43
fungii'm going to look up what the threshold is in the executor's disk governor, but i do know it's large enough that jobs rarely hit it15:43
* noonedeadpunk seeing that error first time15:44
fungii'll take me a few minutes because i honestly have no idea if it's configurable and if so whether we configure it or rely on its built-in default threshold15:45
noonedeadpunkI guess wget has found at least one nasty log file... /var/log/messages is already like 350Mb and it's still fetching15:46
fungiouch15:46
funginoonedeadpunk: https://zuul-ci.org/docs/zuul/latest/configuration.html#attr-executor.disk_limit_per_job15:46
noonedeadpunkI wonder if there're plenty of OOMs or smth...15:47
noonedeadpunkI'm pretty sure that limit is non-default to what I see from sucessfull job15:47
noonedeadpunk520mb this file solely15:48
fungihttps://opendev.org/opendev/system-config/src/commit/e8a274e/playbooks/roles/zuul/templates/zuul.conf.j2#L4715:48
funginoonedeadpunk: wget may be uncompressing it locally?15:48
fungithough it does appear we set it to ~5gb15:49
noonedeadpunkyeah, likely it is...15:49
fungiso anyway, something during log/artifact collection is going over 5gb of local storage on the executor for one build15:50
noonedeadpunkI'd need to check when it gets compressed15:50
noonedeadpunkBefore sending to executor or after...15:50
clarkbduring/after. Its a monitor check not a fs quota15:50
noonedeadpunkAs I thought it's after 15:50
noonedeadpunkok, gotcha, thanks for answers as usual!15:51
fungiyw15:52
funginoonedeadpunk: also obviously, trimming the size of data collected from builds will speed up your job completion times, speed up retrieval time for people diagnosing failures, and waste less space in our donors' swift services15:54
fungias well as put less load on our executors15:54
fungiso whatever you can do there within reason is beneficial to all of us15:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!