Tuesday, 2022-06-28

clarkb	Almost meeting time	18:59
fungi	ahoy!	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Jun 28 19:01:10 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/pipermail/service-discuss/2022-June/000341.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
ianw	o/	19:01
clarkb	Next week Monday is a big holiday for a few of us. I would expect it to be quiet ish early next week.	19:01
clarkb	Additionally I very likely won't be able to make the meeting two weeks from today	19:02
clarkb	More than happy to skip that week or have someone else run the meeting (its July 12, 2022)	19:02
ianw	i can do 12th july if there's interest	19:03
clarkb	figured I'd let people knwo early then we can organize with plenty of time	19:03
clarkb	Any other announcements?	19:03
clarkb	#topic Topics	19:05
clarkb	#topic Improving CD throughput	19:05
clarkb	There was a bug in the flock path for the zuul auto upgrade playbook which unfortunately caused last weekends upgrade and reboots to fail	19:05
clarkb	That issue has since been fixed so the next pass should run	19:05
clarkb	This is the downside to only trying to run it once a week.	19:05
clarkb	But we can always manually run it if necessary at an earlier date. I'm also hoping that I'll be feeling much better next weekend and can pay attention to it as it runs. (I missed the last one because I wasn't feeling well)	19:06
clarkb	Slow progress, but that still counts :)	19:07
clarkb	Anything else on this topic?	19:08
clarkb	#topic Gerrit 3.5 upgrade	19:09
clarkb	#link https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 WorkInProgress always treated as merge conflict	19:09
clarkb	I did some investigating of this problem that frickler called out.	19:09
clarkb	I thought I would dig into that more today and try to write a patch, but what I've realized since is that there isn't a great solution here since WIP changes are not mergeable. But Gerrit overloads mergable to indicate there is a conflict (which isn't necessarily true in the WIP case)	19:10
clarkb	so now I'm thinking I'll wait a bit and see if any upstream devs have some hints for how we might address this. Maybe it is ok to drop merge conflict in the case of all wips. Or maybe we need a better distinction between the triple state and use something other than a binary value	19:10
clarkb	If the latter option then that may require them to write a chagne as I think it requires a new index version	19:11
clarkb	But I do think I understand this enough to say it is largely a non issue. It looks werid in the UI, but it doens't indicate a bug in merge conflict checking or the index itself	19:11
clarkb	Which means I think it is fairly low priority and we can focus effort elsewhere	19:12
ianw	(clarkb becoming dangerously close to a java developer again :)	19:12
clarkb	The other item I wanted to bring up here is whether or not we think we are ready to drop 3.4 images and add 3.6 as well as testing	19:12
clarkb	https://review.opendev.org/q/topic:gerrit-3.4-cleanups	19:12
clarkb	If so there are three changes ^ there that need review. The first one drops 3.4, second adds 3.6, and last one adds 3.5 -> 3.6 upgrade testing. That last one is a bit complicated as there are steps we have to take on 3.5 before upgrading to 3.6 and the old test system for that wasn't able to do that	19:13
clarkb	Considering it has been a week and of the two discovered issues one has already been fixed and the other is absically just a UI thing I'm comfortable saying it is unlikely we'll revert at this point	19:14
clarkb	memory usage has also looked good	19:14
ianw	++ I think so, I can't imagine we'd go back at this point, but we can always revert	19:14
clarkb	ya our 3.4 iamges will stay on docker hub for a bit and we can revert without reinstating all the machinery to build new ones	19:15
fungi	on the merge conflict front, maybe just changing the displayed label to be more accurate would suffice?	19:15
clarkb	looks like ianw has already reviewed those changes. Maybe fungi and/or frickler can take a second look. Particularly of the changes that add 3.6 just to make sure we don't miss anything.	19:15
clarkb	fungi: that is possible but merge conflict relies on mergable: false even though it can also mean wip. So it becomes tricky to not break the merge conflict reporting on non wip changes	19:16
clarkb	But ya maybe we just remove the merge conflict tag entirely on wip things in the UI	19:16
clarkb	that is relatively straightforward	19:16
clarkb	maybe upstream will haev a good idea and we can fix it some way I haven't considered	19:17
clarkb	Anything else on this subject? I think we're just about at a place where we acn drop it off the schedule (once 3.4 images are removed)	19:17
clarkb	s/schedule/agenda/	19:18
fungi	well, s/merge conflict/unmergeable/ would be more accurate to display	19:18
fungi	since it's not always a git merge conflict causing it to be marked as such	19:18
frickler	in particular the "needs rebase" msg is wrong	19:18
clarkb	fungi: but that is only true for wip changes aiui	19:18
fungi	right	19:19
clarkb	but ya maybe clarifying that in the case of wip changes is a way to go "unmergeable due to the wip state"	19:19
fungi	well, also changes with outdated parents get marked as being in merge conflict even if they're technically not (though in those cases, rebases are warranted)	19:19
clarkb	oh that is news to me but after reading the code is not unexpected. Making note of that on the bug I filed would be good	19:20
fungi	also possible i've imagined that, i'll have to double-check	19:21
clarkb	k	19:21
clarkb	We have a few more topics to get through. Any other gerrit upgrade items before we move on?	19:22
clarkb	#topic Improving grafana management tooling	19:23
clarkb	This topic was largely going to talk about the new grafyaml dashboard screenshotting jobs, but those have since merged.	19:23
clarkb	I guess maybe we should catch up on the current state of things and where we think we might be headed?	19:24
clarkb	pulling info from last meeing what ianw has discovered is that grafyaml uses old APIs which can't properly express things like threshold levels for colors in graphs. This means success and failure graphs both show green in some cases	19:24
ianw	i'm still working on it all	19:25
ianw	but in doing so i did find one issue	19:25
ianw	#link https://review.opendev.org/c/opendev/system-config/+/847876	19:25
ianw	this is my fault, missing parts of the config when we converted to ansible	19:26
ianw	in short, we're not setting xFilesFactor to 0 for .wsp files created since the update. this was something corvus fixed many years ago that reverted	19:26
ianw	as noted, i'll have to manually correct the files on-disk after we fix the configs	19:27
clarkb	noted. I've got that on my todo list for after the meeting and lunch	19:27
clarkb	reviewing the change I mean	19:27
ianw	i noticed this because i was not getting sensible results in the screenshots of the graphs we now create when we update graphs	19:28
clarkb	ianw: jrosser_ also noted that the screenshots may be catching a spinning loading wheel in some cases. Is this related?	19:29
clarkb	the info is in #opendev if you want to dig into that more	19:29
ianw	ahh, ok, each screenshot waits 5 seconds, but that may not be long enough	19:30
clarkb	it may depend on the size of the dashboard. I think the OSA dashboards have a lot of content based on hte change diff	19:30
ianw	it's quite difficult to tell if the page is actually loaded	19:30
clarkb	Once we've got this fairly stable do we have an idea of what sorts of things we might be looking at to address the grafyaml deficiencies?	19:31
clarkb	or maybe too early to tell since bootstrapping testing has been the focus	19:32
fungi	i wonder if there's a way to identify when the page has completed loading	19:32
ianw	my proposal would be that editing directly in grafana and committing dashboards it exports, using the screenshots as a better way to review changes than trying to be human parsers	19:32
ianw	however, we are not quite at the point I have a working CI example of that	19:32
ianw	so i'd like to get that POC 100%, and then we can move it to a concrete discussion	19:33
clarkb	got it. Works for me	19:33
corvus	will the screenshots show the actual metrics used?	19:33
corvus	by that, i mean the metrics names, formulas applied, etc?	19:34
clarkb	I think grafana can be convinced to show that info, but it may be equivalent to what is the in json (aka just the json backing)	19:35
corvus	(so that a reviewer can see that someone is adding a panel that, say, takes a certain metric and divides by 10 and not 100)	19:35
corvus	okay, so someone reviewing the change for accuracy would need to read the json?	19:35
clarkb	I'm looking at the prod dashboard and to see that info currently it does seem like you have to load the json version (it shows the actual data and stats separatenyl but not how they were formulated)	19:37
ianw	yes, you would want to take a look at the json for say metric functions	19:38
ianw	"the json" looks something like https://review.opendev.org/c/openstack/project-config/+/833213/1/grafana/infra-prod-deployment.json	19:38
corvus	the comment about reviewers not needing to be human parsers made me think that may no longer be the case, but i guess reviews still require reading the source (which will be json instead of yaml)	19:39
corvus	or maybe there's some other way to output that information	19:40
clarkb	one idea had was to use a simpler translation tool between the json and yaml to help humans. But not try to encode logic as much as grafyaml does today as that seems to be part of what trips us up.	19:41
clarkb	But I think we can continue to improve the testing. users have already said how helpful it is while using grafyaml so no harm in improving things this way	19:42
clarkb	and we can further discuss the future of manageing the dashboards as we've learned more about our options	19:42
clarkb	We've got 18 minutes left in the meeting hour and a few more topics. Anything urgent on this subject before we continue on?	19:42
ianw	nope, thanks	19:43
clarkb	#topic URL Shortener Service	19:43
clarkb	frickler: Any updates on this?	19:43
frickler	still no progress here, sorry	19:43
clarkb	no worries	19:43
clarkb	#topic Zuul job POST_FAILUREs	19:43
clarkb	Starting sometime last week openstack ansible and tripleo both noticed a higher rate of POST_FAILURE jobs	19:44
clarkb	fungi did a fair bit of digging last week and I've tried to help out more recently. It isn't easy to debug because these post failure appear related to log uploads which means we get no log url and no log links	19:44
clarkb	We strongly suspect that this si related to the executor -> swift upload process with the playbook timing out in that period of time.	19:45
clarkb	We suspect that it is also related to either the total number of log files, their size or some combo of the two since only OSA and tripleo seem to be affected and they log quite a bit compared to other users/jobs	19:45
fungi	the time required to upload to swift endpoints does seem to account for the majority of the playbook's time, and can vary significantly	19:46
clarkb	We've helped them both identify places they might be able to trim their log content down. The categories largely boiled down to no more ARA, reduce deep nesting of logs since nesting requires an index.html for each dir level, remove logs that are identical on every run (think /etc contents that are fixed and never change), and things like journald binary files.	19:46
clarkb	Doing this cleanup does appear to have helped but not completed removed the issue	19:47
corvus	well, if hypothetically, under some circumstances it takes 4x time to upload, it may simply be that only those jobs are long enough that 4x time is noticeable?	19:47
fungi	yes	19:47
corvus	(but the real issue is surely that under some circumstances it takes 4x time, right?)	19:47
clarkb	corvus: yup, I think that is what we are suspecting	19:47
clarkb	in the OSA case we've seen some jobs take ~2 minutes to upload logs, ~9 minutes, and ~22 minutes	19:47
corvus	so initial steps are good, and help reduce the pain, but underlying problem remains	19:47
fungi	also it's really only impacting tripleo and openstack-ansible changes, so seems to be something specific to their jobs (again, probably the volume of logs they collect)	19:48
clarkb	the problem is we have very little insight into this due to how the issues occur. We lose a lot of info. Even on the executor log side the timeout happens and we don't get info about where we were uploading to	19:48
fungi	unfortunately a lot of the troubleshooting is hampered by blind spots due to ansible not outputting things when it gets aborted mid-task	19:48
clarkb	we could add logging of that to the ansible role but then we set no_log: true which I think may break any explicit logging too	19:48
fungi	so it's hard to even identify which swift endpoint is involved in one of the post_failure results	19:48
clarkb	I think we've managed the bleeding, but now we're looking for ideas on how we might log this better going forward.	19:49
clarkb	One idea that fungi had that was great was to do two passes of uploads. The first can upload the console log and console json and maybe the inventory content. Then a second pass can upload the job specific data	19:49
ianw	yeah -- when i started hitting this, it turned out to be the massive syslog that was due to a kernel bug that only hit in one cloud-provider flooding with backtrace messages. luckily in that that, some of the uploads managed to work, so we could see the massive file. but it could be something obscure and unrelated to the job like this	19:49
corvus	that would help us not have post_failures, but it wouldn't help us have logs, and it wouldn't help us know that we have problems uploading logs.	19:50
clarkb	The problem with this is we generate a zuul manifect with all of the log files and record that for the zuul dashboard so we'd essentially need to upload those base logs twice to make that work	19:50
corvus	iow, it could sweep this under the rug but not actually make things better	19:50
fungi	i think it's the opposite?	19:50
clarkb	corvus: I don't think it would stop the post failures. The second upload pass would still cause that to happen	19:50
fungi	it wouldn't stop the post_failure results, but we'd have console logs and could inspect things in the dashboard	19:50
corvus	oh i see	19:51
clarkb	it would allow us to in theoryk now where we're slow to upload to	19:51
clarkb	and some other info.	19:51
fungi	basically try to avoid leaving users with a build result that says "oops, i have nothing to show you, but trust me this broke"	19:51
clarkb	But making that shift work in the way zuul's logging system currently works is not trivial	19:51
corvus	that sounds good. the other thing is that all the info is in the executor logs. so if you want to write a script to parse it out, that could be an option.	19:51
clarkb	Mostly calling this out here so people are aware of the struggles and also to brainstorm how we can log better	19:51
clarkb	corvus: the info is only there if we don't time out the task though	19:51
corvus	i suggest that because even if you improve the log upload situation, it still doesn't really point at the problem	19:52
fungi	except a lot of the info we want for debugging this isn't in the executor logs, at least not that i can find	19:52
clarkb	corvus: when we time out the task we kill the task before it can record anything in the executor logs	19:52
fungi	though we can improve that, yes	19:52
clarkb	at least that was my impression of the issue here	19:52
corvus	no i mean all the info is in the executor log. /var/log/zuul/executor-debug.log	19:52
fungi	basically do some sort of a dry-run task before the task which might time out	19:52
clarkb	corvus: yes that file doesn't get the info because the task is forcefully killed before it records the info	19:52
fungi	all of the info we have is in the executor log, but in these cases the info isn't much	19:53
clarkb	ya another appraoch may be to do the random selection of target, record it in the log, then start the upload similar to the change jrosser wrote	19:53
clarkb	then we'd at least have that information	19:53
corvus	you're talking about the swift endpoint?	19:53
clarkb	corvus: yes that is the major piece of info	19:53
fungi	that's one piece of data, yes	19:53
clarkb	potentially also the files copied	19:53
fungi	it gets logged by the task when it ends	19:53
fungi	except in these cases, because it isn't allowed to end	19:54
fungi	instead it gets killed by the timeout	19:54
clarkb	the more I think about it the more I think a change like jrosser's could be a good thing here. Basically make random selection, record target, run upload. Then we record some of the useful info before the forceful kill	19:54
fungi	so we get a log that says the task timed out, and no other information that task would normally have logged (for example, the endpoint url)	19:54
fungi	we can explicitly log those other things by doing it before we run that task though	19:55
clarkb	https://review.opendev.org/c/opendev/base-jobs/+/847780 that change	19:55
corvus	yeah that's a good change	19:55
clarkb	that change was initiall made so we can do two upload passes, but maybe we start with it just to record the info and do one upload	19:55
fungi	another idea was to temporarily increase the timeout for post-logs and then try to analyze builds which took longer in that phase than the normal timeout	19:56
corvus	then you'll have everything in the debug log :)	19:56
clarkb	yup. Ok thats a good next step and we can take debugging from there as it may provide important info	19:56
clarkb	we are alomost out of time but do have one more agenda item I'd like to get to	19:56
corvus	in particular, you can look for log upload times by job, and see "normal" and "abnormal" times	19:56
fungi	the risk of temporarily increasing the timeout, of course, is that jobs may end up merging changes that make the situation worse in the interim	19:56
clarkb	#topic Bastion host	19:57
clarkb	ianw put this on the agenda to discuss two items re bridge.	19:57
corvus	yeah, i wouldn't change any timeouts; i'd do the change to get all the data in the logs, then analyze the logs. that's going to be a lot better than doing a bunch of spot checks anyway.	19:57
clarkb	The first is whether or not we should put ansible and openstacksdk in a venv rather than global install	19:57
ianw	this came out of	19:57
ianw	#link https://review.opendev.org/c/opendev/system-config/+/847700	19:57
ianw	which fixes the -devel job, which uses these from upstream	19:58
ianw	i started to take a different approach, moving everything into a non-privileged virtualenv, but then wondered if there was actually any appetite for such a change	19:58
ianw	do we want to push on that? or do we not care that much	19:58
clarkb	I think putting pip installs into a venv is a good idea simply because not doing that continues to break in fun ways over time	19:59
clarkb	The major downsides are things are no longer automatically in $PATH but we can add them explicitly. And when python upgrades you get really weird errors running stuff out of venvs	19:59
ianw	yeah they basically need to be regenerated	20:00
fungi	i am 100% in favor of avoiding `sudo pip install` in favor of either distro packages or venvs, yes	20:00
fungi	also python upgrades on a stable distro shouldn't need regenerating unless we do an in-place upgrade of the distro to a new major version	20:00
clarkb	ianw: and config management makes that easy if we just rm or move the broken venv aside and let config management rerun (there is a chicken and egg here for ansible specifically though, but I think that is ok if we shift to venvs more and more)	20:00
clarkb	fungi: yes that is the next question :)	20:01
clarkb	The next questions is re upgrading bridge and whether or not we should do it in place or with a new host	20:01
fungi	and to be clear, an in-place upgrade of the distro is fine with me, we just need to remember to regenerate any venvs which were built for the old python versions	20:01
clarkb	I personally really like using new hosts when we can get away with it as it helps us start from a clean slate and delete old cruft automatically. But bridge's IP address might be important? In the past we explicitly allowed root ssh from its IP on hosts iirc	20:01
ianw	this is a longer term thing. one thing about starting a fresh host is that we will probably find all those bits we haven't quite codified yet	20:02
clarkb	I'm not sure if we still do that or not. If we do then doing an in place upgrade is probably fine. But I have a small prefernce for new host if we can get away with it	20:02
corvus	clarkb: i think that should be automatic for a replacement host	20:02
clarkb	corvus: ya for all hosts that run ansible during the time frame we have both in the var/list/group	20:02
clarkb	mostly concerned that a host might get missed for one reason or another and get stradned but we can always manually update that too	20:03
ianw	ok, so i'm thinking maybe we push on the virtualenv stuff for the tools in use on bridge first, and probably end up with the current bridge as a franken-host with things installed everywhere everyhow	20:03
clarkb	Anyway no objection from me shifting ansibel and openstacksdk (and more and more of our other tools) into venvs.	20:03
fungi	same here	20:03
ianw	however, we can then look at upgrade/replacement, and we should start fresh with more compartmentalized tools	20:03
clarkb	And preference for new host to do OS upgrade	20:03
fungi	i also prefer a new host, all things being equal, but understand it's a pragmatic choice in some cases to do in-place	20:05
clarkb	We are now a few minutes over time. No open discussion today, but feel free to bring discussion up in #opendev or on the mailing list. Last call for anything else on bastion work	20:05
fungi	thanks clarkb!	20:05
corvus	thanks!	20:07
clarkb	Sounds like that is it. Thank you everyone!	20:07
clarkb	#endmeeting	20:07
opendevmeet	Meeting ended Tue Jun 28 20:07:49 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:07
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.html	20:07
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.txt	20:07
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.log.html	20:07

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!