clarkb | Almost meeting time | 18:59 |
---|---|---|
fungi | ahoy! | 19:00 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Jun 28 19:01:10 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/pipermail/service-discuss/2022-June/000341.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
ianw | o/ | 19:01 |
clarkb | Next week Monday is a big holiday for a few of us. I would expect it to be quiet ish early next week. | 19:01 |
clarkb | Additionally I very likely won't be able to make the meeting two weeks from today | 19:02 |
clarkb | More than happy to skip that week or have someone else run the meeting (its July 12, 2022) | 19:02 |
ianw | i can do 12th july if there's interest | 19:03 |
clarkb | figured I'd let people knwo early then we can organize with plenty of time | 19:03 |
clarkb | Any other announcements? | 19:03 |
clarkb | #topic Topics | 19:05 |
clarkb | #topic Improving CD throughput | 19:05 |
clarkb | There was a bug in the flock path for the zuul auto upgrade playbook which unfortunately caused last weekends upgrade and reboots to fail | 19:05 |
clarkb | That issue has since been fixed so the next pass should run | 19:05 |
clarkb | This is the downside to only trying to run it once a week. | 19:05 |
clarkb | But we can always manually run it if necessary at an earlier date. I'm also hoping that I'll be feeling much better next weekend and can pay attention to it as it runs. (I missed the last one because I wasn't feeling well) | 19:06 |
clarkb | Slow progress, but that still counts :) | 19:07 |
clarkb | Anything else on this topic? | 19:08 |
clarkb | #topic Gerrit 3.5 upgrade | 19:09 |
clarkb | #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 WorkInProgress always treated as merge conflict | 19:09 |
clarkb | I did some investigating of this problem that frickler called out. | 19:09 |
clarkb | I thought I would dig into that more today and try to write a patch, but what I've realized since is that there isn't a great solution here since WIP changes are not mergeable. But Gerrit overloads mergable to indicate there is a conflict (which isn't necessarily true in the WIP case) | 19:10 |
clarkb | so now I'm thinking I'll wait a bit and see if any upstream devs have some hints for how we might address this. Maybe it is ok to drop merge conflict in the case of all wips. Or maybe we need a better distinction between the triple state and use something other than a binary value | 19:10 |
clarkb | If the latter option then that may require them to write a chagne as I think it requires a new index version | 19:11 |
clarkb | But I do think I understand this enough to say it is largely a non issue. It looks werid in the UI, but it doens't indicate a bug in merge conflict checking or the index itself | 19:11 |
clarkb | Which means I think it is fairly low priority and we can focus effort elsewhere | 19:12 |
ianw | (clarkb becoming dangerously close to a java developer again :) | 19:12 |
clarkb | The other item I wanted to bring up here is whether or not we think we are ready to drop 3.4 images and add 3.6 as well as testing | 19:12 |
clarkb | https://review.opendev.org/q/topic:gerrit-3.4-cleanups | 19:12 |
clarkb | If so there are three changes ^ there that need review. The first one drops 3.4, second adds 3.6, and last one adds 3.5 -> 3.6 upgrade testing. That last one is a bit complicated as there are steps we have to take on 3.5 before upgrading to 3.6 and the old test system for that wasn't able to do that | 19:13 |
clarkb | Considering it has been a week and of the two discovered issues one has already been fixed and the other is absically just a UI thing I'm comfortable saying it is unlikely we'll revert at this point | 19:14 |
clarkb | memory usage has also looked good | 19:14 |
ianw | ++ I think so, I can't imagine we'd go back at this point, but we can always revert | 19:14 |
clarkb | ya our 3.4 iamges will stay on docker hub for a bit and we can revert without reinstating all the machinery to build new ones | 19:15 |
fungi | on the merge conflict front, maybe just changing the displayed label to be more accurate would suffice? | 19:15 |
clarkb | looks like ianw has already reviewed those changes. Maybe fungi and/or frickler can take a second look. Particularly of the changes that add 3.6 just to make sure we don't miss anything. | 19:15 |
clarkb | fungi: that is possible but merge conflict relies on mergable: false even though it can also mean wip. So it becomes tricky to not break the merge conflict reporting on non wip changes | 19:16 |
clarkb | But ya maybe we just remove the merge conflict tag entirely on wip things in the UI | 19:16 |
clarkb | that is relatively straightforward | 19:16 |
clarkb | maybe upstream will haev a good idea and we can fix it some way I haven't considered | 19:17 |
clarkb | Anything else on this subject? I think we're just about at a place where we acn drop it off the schedule (once 3.4 images are removed) | 19:17 |
clarkb | s/schedule/agenda/ | 19:18 |
fungi | well, s/merge conflict/unmergeable/ would be more accurate to display | 19:18 |
fungi | since it's not always a git merge conflict causing it to be marked as such | 19:18 |
frickler | in particular the "needs rebase" msg is wrong | 19:18 |
clarkb | fungi: but that is only true for wip changes aiui | 19:18 |
fungi | right | 19:19 |
clarkb | but ya maybe clarifying that in the case of wip changes is a way to go "unmergeable due to the wip state" | 19:19 |
fungi | well, also changes with outdated parents get marked as being in merge conflict even if they're technically not (though in those cases, rebases are warranted) | 19:19 |
clarkb | oh that is news to me but after reading the code is not unexpected. Making note of that on the bug I filed would be good | 19:20 |
fungi | also possible i've imagined that, i'll have to double-check | 19:21 |
clarkb | k | 19:21 |
clarkb | We have a few more topics to get through. Any other gerrit upgrade items before we move on? | 19:22 |
clarkb | #topic Improving grafana management tooling | 19:23 |
clarkb | This topic was largely going to talk about the new grafyaml dashboard screenshotting jobs, but those have since merged. | 19:23 |
clarkb | I guess maybe we should catch up on the current state of things and where we think we might be headed? | 19:24 |
clarkb | pulling info from last meeing what ianw has discovered is that grafyaml uses old APIs which can't properly express things like threshold levels for colors in graphs. This means success and failure graphs both show green in some cases | 19:24 |
ianw | i'm still working on it all | 19:25 |
ianw | but in doing so i did find one issue | 19:25 |
ianw | #link https://review.opendev.org/c/opendev/system-config/+/847876 | 19:25 |
ianw | this is my fault, missing parts of the config when we converted to ansible | 19:26 |
ianw | in short, we're not setting xFilesFactor to 0 for .wsp files created since the update. this was something corvus fixed many years ago that reverted | 19:26 |
ianw | as noted, i'll have to manually correct the files on-disk after we fix the configs | 19:27 |
clarkb | noted. I've got that on my todo list for after the meeting and lunch | 19:27 |
clarkb | reviewing the change I mean | 19:27 |
ianw | i noticed this because i was not getting sensible results in the screenshots of the graphs we now create when we update graphs | 19:28 |
clarkb | ianw: jrosser_ also noted that the screenshots may be catching a spinning loading wheel in some cases. Is this related? | 19:29 |
clarkb | the info is in #opendev if you want to dig into that more | 19:29 |
ianw | ahh, ok, each screenshot waits 5 seconds, but that may not be long enough | 19:30 |
clarkb | it may depend on the size of the dashboard. I think the OSA dashboards have a lot of content based on hte change diff | 19:30 |
ianw | it's quite difficult to tell if the page is actually loaded | 19:30 |
clarkb | Once we've got this fairly stable do we have an idea of what sorts of things we might be looking at to address the grafyaml deficiencies? | 19:31 |
clarkb | or maybe too early to tell since bootstrapping testing has been the focus | 19:32 |
fungi | i wonder if there's a way to identify when the page has completed loading | 19:32 |
ianw | my proposal would be that editing directly in grafana and committing dashboards it exports, using the screenshots as a better way to review changes than trying to be human parsers | 19:32 |
ianw | however, we are not quite at the point I have a working CI example of that | 19:32 |
ianw | so i'd like to get that POC 100%, and then we can move it to a concrete discussion | 19:33 |
clarkb | got it. Works for me | 19:33 |
corvus | will the screenshots show the actual metrics used? | 19:33 |
corvus | by that, i mean the metrics names, formulas applied, etc? | 19:34 |
clarkb | I think grafana can be convinced to show that info, but it may be equivalent to what is the in json (aka just the json backing) | 19:35 |
corvus | (so that a reviewer can see that someone is adding a panel that, say, takes a certain metric and divides by 10 and not 100) | 19:35 |
corvus | okay, so someone reviewing the change for accuracy would need to read the json? | 19:35 |
clarkb | I'm looking at the prod dashboard and to see that info currently it does seem like you have to load the json version (it shows the actual data and stats separatenyl but not how they were formulated) | 19:37 |
ianw | yes, you would want to take a look at the json for say metric functions | 19:38 |
ianw | "the json" looks something like https://review.opendev.org/c/openstack/project-config/+/833213/1/grafana/infra-prod-deployment.json | 19:38 |
corvus | the comment about reviewers not needing to be human parsers made me think that may no longer be the case, but i guess reviews still require reading the source (which will be json instead of yaml) | 19:39 |
corvus | or maybe there's some other way to output that information | 19:40 |
clarkb | one idea had was to use a simpler translation tool between the json and yaml to help humans. But not try to encode logic as much as grafyaml does today as that seems to be part of what trips us up. | 19:41 |
clarkb | But I think we can continue to improve the testing. users have already said how helpful it is while using grafyaml so no harm in improving things this way | 19:42 |
clarkb | and we can further discuss the future of manageing the dashboards as we've learned more about our options | 19:42 |
clarkb | We've got 18 minutes left in the meeting hour and a few more topics. Anything urgent on this subject before we continue on? | 19:42 |
ianw | nope, thanks | 19:43 |
clarkb | #topic URL Shortener Service | 19:43 |
clarkb | frickler: Any updates on this? | 19:43 |
frickler | still no progress here, sorry | 19:43 |
clarkb | no worries | 19:43 |
clarkb | #topic Zuul job POST_FAILUREs | 19:43 |
clarkb | Starting sometime last week openstack ansible and tripleo both noticed a higher rate of POST_FAILURE jobs | 19:44 |
clarkb | fungi did a fair bit of digging last week and I've tried to help out more recently. It isn't easy to debug because these post failure appear related to log uploads which means we get no log url and no log links | 19:44 |
clarkb | We strongly suspect that this si related to the executor -> swift upload process with the playbook timing out in that period of time. | 19:45 |
clarkb | We suspect that it is also related to either the total number of log files, their size or some combo of the two since only OSA and tripleo seem to be affected and they log quite a bit compared to other users/jobs | 19:45 |
fungi | the time required to upload to swift endpoints does seem to account for the majority of the playbook's time, and can vary significantly | 19:46 |
clarkb | We've helped them both identify places they might be able to trim their log content down. The categories largely boiled down to no more ARA, reduce deep nesting of logs since nesting requires an index.html for each dir level, remove logs that are identical on every run (think /etc contents that are fixed and never change), and things like journald binary files. | 19:46 |
clarkb | Doing this cleanup does appear to have helped but not completed removed the issue | 19:47 |
corvus | well, if hypothetically, under some circumstances it takes 4x time to upload, it may simply be that only those jobs are long enough that 4x time is noticeable? | 19:47 |
fungi | yes | 19:47 |
corvus | (but the real issue is surely that under some circumstances it takes 4x time, right?) | 19:47 |
clarkb | corvus: yup, I think that is what we are suspecting | 19:47 |
clarkb | in the OSA case we've seen some jobs take ~2 minutes to upload logs, ~9 minutes, and ~22 minutes | 19:47 |
corvus | so initial steps are good, and help reduce the pain, but underlying problem remains | 19:47 |
fungi | also it's really only impacting tripleo and openstack-ansible changes, so seems to be something specific to their jobs (again, probably the volume of logs they collect) | 19:48 |
clarkb | the problem is we have very little insight into this due to how the issues occur. We lose a lot of info. Even on the executor log side the timeout happens and we don't get info about where we were uploading to | 19:48 |
fungi | unfortunately a lot of the troubleshooting is hampered by blind spots due to ansible not outputting things when it gets aborted mid-task | 19:48 |
clarkb | we could add logging of that to the ansible role but then we set no_log: true which I think may break any explicit logging too | 19:48 |
fungi | so it's hard to even identify which swift endpoint is involved in one of the post_failure results | 19:48 |
clarkb | I think we've managed the bleeding, but now we're looking for ideas on how we might log this better going forward. | 19:49 |
clarkb | One idea that fungi had that was great was to do two passes of uploads. The first can upload the console log and console json and maybe the inventory content. Then a second pass can upload the job specific data | 19:49 |
ianw | yeah -- when i started hitting this, it turned out to be the massive syslog that was due to a kernel bug that only hit in one cloud-provider flooding with backtrace messages. luckily in that that, some of the uploads managed to work, so we could see the massive file. but it could be something obscure and unrelated to the job like this | 19:49 |
corvus | that would help us not have post_failures, but it wouldn't help us have logs, and it wouldn't help us know that we have problems uploading logs. | 19:50 |
clarkb | The problem with this is we generate a zuul manifect with all of the log files and record that for the zuul dashboard so we'd essentially need to upload those base logs twice to make that work | 19:50 |
corvus | iow, it could sweep this under the rug but not actually make things better | 19:50 |
fungi | i think it's the opposite? | 19:50 |
clarkb | corvus: I don't think it would stop the post failures. The second upload pass would still cause that to happen | 19:50 |
fungi | it wouldn't stop the post_failure results, but we'd have console logs and could inspect things in the dashboard | 19:50 |
corvus | oh i see | 19:51 |
clarkb | it would allow us to in theoryk now where we're slow to upload to | 19:51 |
clarkb | and some other info. | 19:51 |
fungi | basically try to avoid leaving users with a build result that says "oops, i have nothing to show you, but trust me this broke" | 19:51 |
clarkb | But making that shift work in the way zuul's logging system currently works is not trivial | 19:51 |
corvus | that sounds good. the other thing is that all the info is in the executor logs. so if you want to write a script to parse it out, that could be an option. | 19:51 |
clarkb | Mostly calling this out here so people are aware of the struggles and also to brainstorm how we can log better | 19:51 |
clarkb | corvus: the info is only there if we don't time out the task though | 19:51 |
corvus | i suggest that because even if you improve the log upload situation, it still doesn't really point at the problem | 19:52 |
fungi | except a lot of the info we want for debugging this isn't in the executor logs, at least not that i can find | 19:52 |
clarkb | corvus: when we time out the task we kill the task before it can record anything in the executor logs | 19:52 |
fungi | though we can improve that, yes | 19:52 |
clarkb | at least that was my impression of the issue here | 19:52 |
corvus | no i mean all the info is in the executor log. /var/log/zuul/executor-debug.log | 19:52 |
fungi | basically do some sort of a dry-run task before the task which might time out | 19:52 |
clarkb | corvus: yes that file doesn't get the info because the task is forcefully killed before it records the info | 19:52 |
fungi | all of the info we have is in the executor log, but in these cases the info isn't much | 19:53 |
clarkb | ya another appraoch may be to do the random selection of target, record it in the log, then start the upload similar to the change jrosser wrote | 19:53 |
clarkb | then we'd at least have that information | 19:53 |
corvus | you're talking about the swift endpoint? | 19:53 |
clarkb | corvus: yes that is the major piece of info | 19:53 |
fungi | that's one piece of data, yes | 19:53 |
clarkb | potentially also the files copied | 19:53 |
fungi | it gets logged by the task when it ends | 19:53 |
fungi | except in these cases, because it isn't allowed to end | 19:54 |
fungi | instead it gets killed by the timeout | 19:54 |
clarkb | the more I think about it the more I think a change like jrosser's could be a good thing here. Basically make random selection, record target, run upload. Then we record some of the useful info before the forceful kill | 19:54 |
fungi | so we get a log that says the task timed out, and no other information that task would normally have logged (for example, the endpoint url) | 19:54 |
fungi | we can explicitly log those other things by doing it before we run that task though | 19:55 |
clarkb | https://review.opendev.org/c/opendev/base-jobs/+/847780 that change | 19:55 |
corvus | yeah that's a good change | 19:55 |
clarkb | that change was initiall made so we can do two upload passes, but maybe we start with it just to record the info and do one upload | 19:55 |
fungi | another idea was to temporarily increase the timeout for post-logs and then try to analyze builds which took longer in that phase than the normal timeout | 19:56 |
corvus | then you'll have everything in the debug log :) | 19:56 |
clarkb | yup. Ok thats a good next step and we can take debugging from there as it may provide important info | 19:56 |
clarkb | we are alomost out of time but do have one more agenda item I'd like to get to | 19:56 |
corvus | in particular, you can look for log upload times by job, and see "normal" and "abnormal" times | 19:56 |
fungi | the risk of temporarily increasing the timeout, of course, is that jobs may end up merging changes that make the situation worse in the interim | 19:56 |
clarkb | #topic Bastion host | 19:57 |
clarkb | ianw put this on the agenda to discuss two items re bridge. | 19:57 |
corvus | yeah, i wouldn't change any timeouts; i'd do the change to get all the data in the logs, then analyze the logs. that's going to be a lot better than doing a bunch of spot checks anyway. | 19:57 |
clarkb | The first is whether or not we should put ansible and openstacksdk in a venv rather than global install | 19:57 |
ianw | this came out of | 19:57 |
ianw | #link https://review.opendev.org/c/opendev/system-config/+/847700 | 19:57 |
ianw | which fixes the -devel job, which uses these from upstream | 19:58 |
ianw | i started to take a different approach, moving everything into a non-privileged virtualenv, but then wondered if there was actually any appetite for such a change | 19:58 |
ianw | do we want to push on that? or do we not care that much | 19:58 |
clarkb | I think putting pip installs into a venv is a good idea simply because not doing that continues to break in fun ways over time | 19:59 |
clarkb | The major downsides are things are no longer automatically in $PATH but we can add them explicitly. And when python upgrades you get really weird errors running stuff out of venvs | 19:59 |
ianw | yeah they basically need to be regenerated | 20:00 |
fungi | i am 100% in favor of avoiding `sudo pip install` in favor of either distro packages or venvs, yes | 20:00 |
fungi | also python upgrades on a stable distro shouldn't need regenerating unless we do an in-place upgrade of the distro to a new major version | 20:00 |
clarkb | ianw: and config management makes that easy if we just rm or move the broken venv aside and let config management rerun (there is a chicken and egg here for ansible specifically though, but I think that is ok if we shift to venvs more and more) | 20:00 |
clarkb | fungi: yes that is the next question :) | 20:01 |
clarkb | The next questions is re upgrading bridge and whether or not we should do it in place or with a new host | 20:01 |
fungi | and to be clear, an in-place upgrade of the distro is fine with me, we just need to remember to regenerate any venvs which were built for the old python versions | 20:01 |
clarkb | I personally really like using new hosts when we can get away with it as it helps us start from a clean slate and delete old cruft automatically. But bridge's IP address might be important? In the past we explicitly allowed root ssh from its IP on hosts iirc | 20:01 |
ianw | this is a longer term thing. one thing about starting a fresh host is that we will probably find all those bits we haven't quite codified yet | 20:02 |
clarkb | I'm not sure if we still do that or not. If we do then doing an in place upgrade is probably fine. But I have a small prefernce for new host if we can get away with it | 20:02 |
corvus | clarkb: i think that should be automatic for a replacement host | 20:02 |
clarkb | corvus: ya for all hosts that run ansible during the time frame we have both in the var/list/group | 20:02 |
clarkb | mostly concerned that a host might get missed for one reason or another and get stradned but we can always manually update that too | 20:03 |
ianw | ok, so i'm thinking maybe we push on the virtualenv stuff for the tools in use on bridge first, and probably end up with the current bridge as a franken-host with things installed everywhere everyhow | 20:03 |
clarkb | Anyway no objection from me shifting ansibel and openstacksdk (and more and more of our other tools) into venvs. | 20:03 |
fungi | same here | 20:03 |
ianw | however, we can then look at upgrade/replacement, and we should start fresh with more compartmentalized tools | 20:03 |
clarkb | And preference for new host to do OS upgrade | 20:03 |
fungi | i also prefer a new host, all things being equal, but understand it's a pragmatic choice in some cases to do in-place | 20:05 |
clarkb | We are now a few minutes over time. No open discussion today, but feel free to bring discussion up in #opendev or on the mailing list. Last call for anything else on bastion work | 20:05 |
fungi | thanks clarkb! | 20:05 |
corvus | thanks! | 20:07 |
clarkb | Sounds like that is it. Thank you everyone! | 20:07 |
clarkb | #endmeeting | 20:07 |
opendevmeet | Meeting ended Tue Jun 28 20:07:49 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:07 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.html | 20:07 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.txt | 20:07 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.log.html | 20:07 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!