Tuesday, 2021-02-16

*** tosky has quit IRC		00:30
tristanC	ianw: when the icon has three vertical dots, it can be refered to as a kebab too	00:33
tristanC	corvus: when you have some time, could you please check the last comment on https://review.opendev.org/c/zuul/zuul/+/775505 ?	00:34
ianw	tristanC: do Americans know what a kebab is? i feel like it's gyros there :)	00:35
clarkb	I think if you say kebab around here people will specifically think of meat on a stick	00:35
clarkb	gyros may have kebab in them but are pita sandwiches	00:36
clarkb	Both delicious	00:36
fungi	looks more like takoyaki to me	00:40
fungi	or maybe kibi dango	00:40
corvus	tristanC: that's 10mb right?	00:40
tristanC	corvus: i think so yes	00:40
corvus	tristanC: cool, replied there, thanks.	00:45
*** mgoddard has quit IRC		00:51
openstackgerrit	Merged zuul/zuul master: Don't bail on fetchBuildAllInfo if fetchBuildManifest fails https://review.opendev.org/c/zuul/zuul/+/773826	00:55
openstackgerrit	Merged zuul/zuul master: Don't enforce foreground with -d switch https://review.opendev.org/c/zuul/zuul/+/705189	00:55
openstackgerrit	James E. Blair proposed zuul/zuul master: Support credentials supplied by nodepool https://review.opendev.org/c/zuul/zuul/+/774362	00:57
*** mgoddard has joined #zuul		00:57
*** noonedeadpunk has quit IRC		01:19
*** noonedeadpunk has joined #zuul		01:21
openstackgerrit	Tristan Cacqueray proposed zuul/zuul master: web: render links and ansi escape sequences in logfile and console https://review.opendev.org/c/zuul/zuul/+/775505	01:24
openstackgerrit	Tristan Cacqueray proposed zuul/zuul master: web: add benchmark test for logfile https://review.opendev.org/c/zuul/zuul/+/775510	01:24
openstackgerrit	Tristan Cacqueray proposed zuul/zuul master: web: disable logfile line rendering when the size exceed a threshold https://review.opendev.org/c/zuul/zuul/+/775726	01:24
tristanC	corvus: i think ^ should do the trick	01:29
*** hamalq has quit IRC		01:47
*** dmsimard has quit IRC		02:16
*** dmsimard has joined #zuul		02:17
openstackgerrit	Narendra Kumar Choudhary proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	02:55
*** ykarel has joined #zuul		03:52
*** ykarel_ has joined #zuul		03:58
*** ykarel has quit IRC		04:01
*** ykarel_ has quit IRC		04:57
*** vishalmanchanda has joined #zuul		05:28
*** evrardjp has quit IRC		05:33
*** evrardjp has joined #zuul		05:33
*** jfoufas1 has joined #zuul		05:55
*** saneax has joined #zuul		06:10
*** ianw has quit IRC		06:18
*** ianw has joined #zuul		06:19
*** mhu has quit IRC		06:19
*** mhu has joined #zuul		06:20
*** mugsie has quit IRC		06:37
openstackgerrit	Merged zuul/project-config master: Zuul: remove explicit SQL reporters https://review.opendev.org/c/zuul/project-config/+/775719	06:43
*** jcapitao has joined #zuul		07:12
*** piotrowskim has joined #zuul		08:14
*** rpittau\|afk is now known as rpittau		08:37
*** tosky has joined #zuul		08:44
*** hashar has joined #zuul		08:45
*** jpena\|off is now known as jpena		08:59
*** nils has joined #zuul		09:20
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438	10:27
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796	10:27
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797	10:27
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438	10:49
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796	10:49
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797	10:49
avass	tobiash: if you got time can you take a look at https://review.opendev.org/c/zuul/zuul/+/775334 later?	11:09
tobiash	avass: yes, that's on my todo list for today	11:10
avass	tobiash: thanks! :)	11:12
*** mugsie has joined #zuul		11:13
*** vishalmanchanda has quit IRC		11:22
tobiash	corvus: I had to do a second round of that nodepool stack and now it exceeds my expectations during load testing. It topped out at 1.5k node startups per 10min while keeping all node startups below one minute. Tested with keeping a constant amount of 200 node requests .	11:26
tobiash	and from the metrics I can see that now the request handling seems to be the next bottleneck since it hit at max 150 building nodes at the same time	11:27
*** vishalmanchanda has joined #zuul		11:29
*** jcapitao is now known as jcapitao_lunch		12:12
*** jpena is now known as jpena\|lunch		12:27
*** rlandy has joined #zuul		12:32
tobiash	avass: +3 with a suggestion for a future improvement	12:53
avass	tobiash: yeah make sense	13:01
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438	13:03
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796	13:03
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797	13:03
*** jcapitao_lunch is now known as jcapitao		13:08
*** jpena\|lunch is now known as jpena		13:28
openstackgerrit	Merged zuul/zuul master: Fix executor errors on faulty .gitmodules file. https://review.opendev.org/c/zuul/zuul/+/775334	13:56
mordred	tobiash: comment on https://review.opendev.org/c/zuul/nodepool/+/775438 - not sure it really matters as I expect both patches to land at the same time anyway	14:25
tobiash	mordred: thanks, but that flag is also used in the central watcher flag	14:29
mordred	oh - ok cool	14:29
tobiash	mordred: responded on the comment	14:35
mordred	tobiash, corvus: hrm. you know - with that change, the primary user of the openstacksdk "use list and client-side-filtering instead of explicit get calls" goes away. that behavior is one that is very confusing to all of the non-nodepool users of openstacksdk, but I always aggressively kept it because nodepool depended on it. I wonder if I should raise it with gtema	14:35
tobiash	mordred: nodepool still uses it	14:35
tobiash	mordred: it just calls it later during wait_for_server to offload most of the time to one thread	14:36
tobiash	mordred: and under high load we still have a high rate of finished servers that use that which would otherwise create more calls	14:37
mordred	ah - yeah - nod.	14:38
mordred	good point	14:38
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438	14:40
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796	14:40
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797	14:40
openstackgerrit	Tobias Henkel proposed zuul/nodepool master: Add simple load testing script https://review.opendev.org/c/zuul/nodepool/+/775843	14:40
tobiash	puts the load test script into its own change so we can spin that separately	14:40
openstackgerrit	Jonas Sticha proposed zuul/nodepool master: aws: add image upload test https://review.opendev.org/c/zuul/nodepool/+/775844	14:41
tristanC	tobiash: i've started a similar script to load test zuul, e.g. submit x changes with jobs and measure how long it takes to get build results	14:42
tobiash	tristanC: I think both make sense, maybe you want to provide that in the zuul repo?	14:43
tobiash	the nodepool test script is handy if I just want to measire cloud/nodepool performance without impacting a zuul and/or github/gerrit	14:44
tristanC	tobiash: and i was wondering if adding promotheus time decorator to be able to measure individual function performance too would be possible now	14:44
tobiash	and the zuul test script is handy if one wants to know an end-to-end performance which is also important	14:44
tristanC	otherwise i worry the end-to-end measure might have too much variance	14:46
tobiash	tristanC: I think the prometheus decorater would be also useful (if don't let the metrics explode in a way that it blows up)	14:46
tobiash	tristanC: end-to-end always has a lot of variance with that many components involved, but that's also a valuable outcome	14:47
tobiash	especially when working towards scale out scheduler	14:47
tristanC	corvus: would you mind revisiting your -2 on 599209 ?	14:47
tobiash	tristanC: I think 599209 suffered from metrics explosion and would impact the scheduler performance if I recall correctly the discussions from back then	14:48
corvus	tristanC, tobiash: i still think statsd is better for large timing data; prometheus does all the work in-process that statsd offloads	14:48
corvus	(i'm not opposed to a small in-process prometheus server used for healthchecks)	14:49
tobiash	I think if we do that we likely need to limit the high cardinality metrics to statsd	14:49
tristanC	tobiash: corvus: i meant to use prometheus only for performance monitoring, not to replace the existing statsd extensive metrics	14:50
corvus	tristanC: performance monitoring is exactly what statsd was designed for	14:51
corvus	the idea is to emit timing data around everything you're interested in monitoring	14:51
corvus	so we could have lots of timing metrics for functions or whatever we want to measure	14:52
tristanC	corvus: ok i'll have a look, it seems like we could embed a statsd server in the benchmark tool to compute the result	14:54
tobiash	zuul-maint: could you have a look on https://review.opendev.org/c/zuul/zuul-jobs/+/775511 to unbreak nodepool gate?	14:54
corvus	tristanC: oh yeah for a benchmark setup that would be a good idea i think. if you just want to get a number out of it, you could probably write a 5 line prometheus server. there's one in the unit test framework.	14:55
corvus	tristanC: er i mean statsd server :)	14:55
corvus	i think having some coarse performance metrics in prometheus makes sense -- like, is the executor up, how many jobs is it running, what is its load average. things that are useful for k8s to know to make health/scale-out decisions.	14:56
tristanC	corvus: yeah, looking at the scheduler code, it's not that easy to add a metric similar to prometheus @SCHEDULER_FUNC_METRIC.time() decorator	14:57
corvus	tristanC: but yeah, for benchmarking, a really simple statsd server to sum the metrics would be a super easy way to get a number out of a benchmark run.	14:57
corvus	tristanC: we can write a decorator. that's not hard. but if you want to know a bunch of function run times, perhaps consider a profiler?	14:58
corvus	there's one hooked up to sigusr2	14:59
tristanC	corvus: how would you share the local statsd client instance to the decorator?	14:59
corvus	tristanC: i dunno; gotta run :)	15:01
openstackgerrit	Tristan Cacqueray proposed zuul/zuul master: prometheus: add options to start the server and process collector https://review.opendev.org/c/zuul/zuul/+/599209	15:01
*** jcapitao_ has joined #zuul		15:03
mordred	tobiash: on the list_servers thing - I just went diving into the sdk wait_for_server - there's one place at the top of it that it calls get_server relying on the list functionality - that could actually be skipped if the server being passed in is already in active state (which is an improvement that really should be made anyway) - and after that all of the calls to get_server later in the stack are actually to get_server_by_id which bypasses the	15:04
mordred	list_servers cache	15:04
*** jcapitao has quit IRC		15:06
mordred	oh- although that's calling with bare=True - so still not a wain	15:06
mordred	win	15:06
mordred	NM :)	15:07
*** jcapitao_ is now known as jcapitao		15:09
openstackgerrit	Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474	15:12
tobiash	mordred: actually I missed the bare=True flag in the first patch resulting in one server list and 1000 fips calls per iteration of that thread	15:19
tobiash	the request log changed helped me to find and fix that problem	15:19
mordred	++	15:19
tobiash	tristanC: how should we continue with the 311 only part of 775511?	15:25
*** hashar has quit IRC		15:29
tristanC	let me see, i think we can fail for older version	15:30
tristanC	or we need a version to repo name mapping...	15:30
tobiash	corvus: do we a deprecation period for removing all openshift older than 3.11 given that the role is broken for some time and nobody (except nodepool) complained?	15:31
openstackgerrit	Jonas Sticha proposed zuul/nodepool master: aws: add support for uploading diskimages https://review.opendev.org/c/zuul/nodepool/+/735217	15:32
openstackgerrit	Jonas Sticha proposed zuul/nodepool master: aws: add image upload test https://review.opendev.org/c/zuul/nodepool/+/775844	15:32
corvus	tobiash: doesn't seem necessary to me	15:33
tobiash	kk	15:33
tobiash	tristanC: so I guess error on <3.11 is fine	15:34
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511	15:36
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511	15:38
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511	15:43
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: multi-node debian: Update package lists before installing https://review.opendev.org/c/zuul/zuul-jobs/+/775866	15:53
*** jfoufas1 has quit IRC		15:55
avass	tristanC: software factory CI is cating up again :)	15:55
avass	acting* but maybe it's catting also	15:55
*** jcapitao is now known as jcapitao\|off		16:05
tristanC	avass: oops, we are fixing that shortly	16:07
*** rpittau is now known as rpittau\|afk		16:10
*** hashar has joined #zuul		16:22
tristanC	avass: thank you for reporting the issue, the job has been removed	16:23
corvus	zuul-maint: i restarted the opendev scheduler yesterday just behind the earliest commit that could be 4.0. should i do a full system restart on opendev today to sanity check current HEAD before tagging 4.0, or are we okay with the slight skew?	16:57
corvus	(i lean slightly toward doing one more full restart... i don't really want to do a paper bag 4.0.1....)	16:57
tobiash	++	16:58
tristanC	corvus: wfm	16:58
tobiash	corvus: are you planning with the nodepool optimizations as well or have that later?	16:58
clarkb	++ to restarting everything on what you want to tag	16:59
fungi	corvus: another restart sounds fine though queues are deep in the openstack tenant at the moment, so if we do it now it will probably take a while to reenqueue everything	16:59
corvus	tobiash: you know, we haven't really talked about a nodepool release... technically we don't need one. but maybe we should go ahead and do a 4.0 just to re-sync their versions? regardless, let's defer the optimizations; i want to give those a good review which would delay 4.0 work	16:59
tobiash	corvus: wfm	17:00
corvus	zuul-maint: while not strictly required, it would make it a little bit easier on me if we didn't approve any zuul or nodepool changes until we release 4.0 :)	17:01
clarkb	corvus: related you email opendev about removing the mysql reporters. Do we need to land those changes before this big restart?	17:02
clarkb	or is that a pre 5.0 thing?	17:02
corvus	clarkb: that's an ASAP thing; absolutely before 5.0, but any time after yesterday. we want it asap so we don't have to wade through the warning messages in the logs though.	17:03
fungi	i think we've landed them, but would be good to double-check	17:03
fungi	i approved a bunch yesterday anyway	17:03
corvus	clarkb: like, as an opendev operator, i'd love it if all our tenants merged those this week.	17:03
clarkb	got it, ya I was planning to take a look after meeting today	17:03
clarkb	or between meetings as it appears that weather has caused things to be rescheduled	17:04
openstackgerrit	Albin Vass proposed zuul/zuul master: Fix build page releasenote typo https://review.opendev.org/c/zuul/zuul/+/775880	17:27
avass	That could be good to merge before releasing ^	17:28
tobiash	avass: +2 but the fix will also just work after tagging	17:32
corvus	i think we can go ahead and approve that; i'll +3	17:34
corvus	we don't have any release notes for nodepool, which violates our "no relesease without a note" rule....	17:34
corvus	do we want to merge a release note saying "no substantial changes, we're just releasing this for version parity with zuul"?	17:34
*** piotrowskim has quit IRC		17:35
tristanC	corvus: perhaps add one to 773540?	17:36
avass	tobiash: yeah but the versioned docs would have the tpo wouldn't they?	17:36
avass	in case someone is reading those :)	17:37
corvus	i do link to the versioned docs in announcements...	17:38
fungi	uprevving the nodepool version to match zuul's major version seems worthy of a release note even if for no other reason than to avoid confusion	17:38
clarkb	++ since our users have come to expect semver there	17:38
corvus	tristanC: oof, that didn't merge? if we landed that now, i'd want to do another nodepool restart	17:38
fungi	i can help with another opendev nodepool restart if it's wanted	17:38
fungi	now that my meetings are on break for a bit	17:39
corvus	fungi: well, if we did a nodepool restart, it's going to be several hours from now	17:40
fungi	i can help with one after 20z as well	17:41
openstackgerrit	James E. Blair proposed zuul/nodepool master: Format multi-line log entries https://review.opendev.org/c/zuul/nodepool/+/773540	17:46
corvus	zuul-maint: ^ that does all the things suggested :)	17:46
tobiash	corvus: what do you think about doing nodepool v4 release a few days after zuul v4 and include aws image handling there? Then we'd have a release note	17:50
tobiash	ah you've put one into the multiline change	17:51
*** jcapitao\|off has quit IRC		17:51
clarkb	corvus: what does ret do in that formatter? it is appended to but not returned (as the name would imply)	17:51
tristanC	i guess we need https://review.opendev.org/c/zuul/zuul-jobs/+/775511 to land change in nodepool	17:52
tobiash	yes	17:52
corvus	tobiash: i just want to release both things today	17:54
corvus	i would rather merge a release note with no change than spend any more time reviewing actual changes	17:54
avass	clarkb: Was wondering about that too	17:54
corvus	if the log formatting change isn't ready, i'll just make a quick change with a release note	17:55
clarkb	corvus: it looks like we are discarding the results of the formatting	17:56
clarkb	but that is based on ret not being returned or otherwise used other than to write to it (not based on testing evidence or similar)	17:56
corvus	fungi: can you look at https://review.opendev.org/775511	17:56
fungi	yep	17:57
tobiash	corvus: zuul-upload-image is in retry limit in the gate	17:57
corvus	clarkb, tristanC: it sure looks like the multiline change is faulty, i think it's missing the last line of the function: return '\n'.join(ret)	17:58
corvus	i will WIP that and move the reno to a standalone change	17:58
clarkb	corvus: sounds good	17:59
clarkb	tristanC: left a comment on 775511 if you want to check it (I think we can merge it as is and just do a followup cleanup)	18:00
openstackgerrit	James E. Blair proposed zuul/nodepool master: Add 4.0 release note https://review.opendev.org/c/zuul/nodepool/+/775885	18:00
corvus	zuul-maint: ^	18:00
tobiash	+2	18:01
fungi	and approved	18:01
fungi	also approved the openshift fix	18:01
corvus	tobiash: looks like the dockerhub login failed	18:02
corvus	ianw saw this yesterday in opendev	18:02
corvus	i think the assumption was it was transient -- fungi, clarkb -- do you remember anything further?	18:02
fungi	just that we suspected maybe dh rate limits/throttling, but things worked on a recheck	18:02
tobiash	shall we close/reopen 775880 to make it abort and reenter the gate?	18:02
clarkb	yes we assumed it was transient because fungi had some gitea changes up that passed through the same ansible and they were happy	18:03
clarkb	and ya docker hub rate limits/throttling was my suspicion	18:03
corvus	wait i don't know what task failed	18:03
clarkb	"upload-docker-image: Log in to registry" failed on ianw's examples	18:04
tobiash	corvus: something really weird: http://paste.openstack.org/show/802705/	18:04
corvus	tobiash: yeah i was just homing in on that	18:04
corvus	the stat module failed with 'bad message'?	18:05
tobiash	yeah, no idea what that means, but all other jobs are ok	18:06
corvus	i don't think we're releasing anything tody	18:06
*** hashar is now known as hasharDinner		18:07
corvus	infra-root, zuul-maint, config-core: we just noticed a very strange failure in a zuul job that occurred during the pre playbook when setting up git repos. please keep an eye out for any other builds that fail in that way.	18:08
tobiash	nodepool-upload-image failed due to the request limit	18:09
clarkb	tobiash: the same task from ianw's example: "upload-docker-image: Log in to registry" ?	18:10
tobiash	corvus: the bad message is nothing new: https://zuul.opendev.org/t/zuul/build/c21cd59ba2cc41d4aa25c134e7f964e5	18:10
corvus	new to me	18:12
tobiash	new to me as well	18:12
tobiash	found three already in the zuul tenant, all airship-kna1	18:14
tobiash	maybe a corrupted image	18:14
fungi	message:"bad message" turns up a ton of hits in logstash	18:15
fungi	we've also been seeing random nodes in airship-kna1 boot up with a 15gb rootfs instead of the usual 100gb the flavor typically provides	18:15
fungi	leading me to suspect that sometimes the growroot doesn't happen at boot there	18:16
corvus	fungi: can you share the logstash link? is bad message correlated with kna1?	18:20
*** jpena is now known as jpena\|off		18:24
openstackgerrit	Merged zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511	18:25
clarkb	corvus: I think fungi was saying he was searching for message:"bad message" in logstash	18:28
clarkb	and ya if I set the time frame to the last 6 hours or so I see a number of them and they don't appear to be cloud or image specific	18:28
clarkb	it seems to universally affect oepnstack-ansible jobs though	18:29
clarkb	I wonder if this is an ansible regression	18:29
fungi	i'm not clear on how to link to specific queries in logstash unfortunately, so yeah i mentioned the query string here	18:30
clarkb	there is or was a way but I can never remember it either	18:30
avass	clarkb: it seems to come from os.stat	18:30
tobiash	the last 2.9 release is from jan 18	18:30
avass	or os.lstat	18:30
fungi	right, it sees like "bad message" may be ansible's way of saying it couldn't parse a message from a task?	18:30
avass	I think it's errno 74 EBADMSG	18:31
fungi	oh, neat	18:31
clarkb	ya if you look at the osa failures its trying to collect journal files and getting that	18:31
clarkb	hrm and that happens in a shell script run by ansible	18:32
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: remove unused role var https://review.opendev.org/c/zuul/zuul-jobs/+/775888	18:32
clarkb	so ya unlikely to be ansibel I guess, they somehow are just tripping over it in a way that logstash catches it	18:32
clarkb	File corruption detected at /var/log/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/system.journal:74d4b70 (of 125829120 bytes, 97%). then var/log/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/system.journal (Bad message)	18:33
avass	ansible only checks for ENOENT and otherwise fails with whatever exception it got: https://github.com/ansible/ansible/blob/stable-2.9/lib/ansible/modules/files/stat.py#L466	18:33
clarkb	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_184/775813/3/check/openstack-ansible-upgrade-aio_metal-centos-8/1842a5b/logs/host/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/index.html the file does seem to be there at least	18:34
*** hamalq has joined #zuul		18:35
clarkb	https://opendev.org/openstack/openstack-ansible/src/branch/master/scripts/log-collect.sh#L193-L195	18:36
clarkb	it is possible the reason that osa is catching this is they are explicitly validating consistency	18:36
clarkb	and it appears to happen across ubuntu and centos (different versions of both) and in different clouds	18:37
clarkb	tobiash' suspicion of a problem with images seems more likely	18:38
clarkb	seems to have been happening for at least the last week	18:42
clarkb	looking at data more historically and not from the last 6 hours it does seem to correlate with kna1 more	18:43
corvus	clarkb: is it correlated with a provider?	18:44
clarkb	corvus: in the last 6 hours no, but from the 9th to the 10th of february yes its almost all kna1	18:44
clarkb	I wonder if OSA does explicit verification and they are also near disk limits on $clouds	18:44
corvus	clarkb: i wish i could see what you see	18:45
corvus	i'm really struggling to get logstash to produce something useful	18:45
clarkb	corvus: in the query field put message:"Bad message" then set the time range in the top right to last 6 hours or last 7 days and clikc the refresh button in the top right next to the time range	18:45
corvus	yeah, i see tons of osa jobs across all providers	18:46
clarkb	yup and different images. If you expand to the last 7 days you'll see a spike around the 10th. If you click in the time graph and hold+drag you can select a dynamic window	18:47
corvus	is there anyway not to see the osa jobs?	18:47
clarkb	doing that is what gave me the kna correlates to the ~9th ~10th range	18:47
corvus	cause if they're failing at that rate and they don't care, maybe that's a red herring	18:47
clarkb	corvus: you can add AND NOT project:openstack/openstack-ansible AND NOT project:openstack/openstack-ansible-os_tempest and so on	18:48
clarkb	to the query string	18:48
corvus	it's unresponsive when i do that :(	18:51
clarkb	hrm may need to quote the project string values	18:52
clarkb	(I always forget the quoting rules but I think not quoting may make it more greedy or something)	18:52
corvus	is there some way to say openstack-ansible*	18:53
corvus	cause i dunno, it looks like we're just awash in false positives here.	18:54
corvus	these jobs are all succeeding	18:55
clarkb	yes there is globbing but it only works on some fields and is incredibly expensive iirc	18:55
clarkb	I've trying to get a globbed version to return over the last 12 hours and not succeeding	18:55
clarkb	corvus: I think they are succeeding beacuse the error is only hit at the end of the job when they validate and verify logs. I suspect that the same issue they are hitting is also hitting your example (which I am beginning to think may be a full root disk)	18:56
corvus	this did hit in kna1	18:56
corvus	clarkb: https://706d78dc87df4baa7fa3-76ad47885070581f857a540cadaa6a6d.ssl.cf5.rackcdn.com/775880/1/gate/zuul-upload-image/55c548d/zuul-info/zuul-info.ubuntu-bionic.txt says 4.5G available? but that also looks like a 15G drive instead of 100	18:58
clarkb	we still run a cleanup play on all/most jobs that checks du right?	18:58
clarkb	corvus: agreed that looks like a too small disk	18:58
clarkb	my strong suspicion is osa catches this often because they fill disks regardless of the cloud and they also do journal verification at the end of jobs	18:59
clarkb	other jobs like your example only hit this if the disk is abnormally small	18:59
tobiash	clarkb, corvus: we had a similar issue a year ago when cloud-init was not finished yet sometimes	18:59
corvus	clarkb: http://paste.openstack.org/show/802711/ df from cleanup of that build	19:00
tobiash	clarkb, corvus: we fixed that by doing a cloud-init wait for finish as early as possible in the base job	19:00
corvus	clarkb: /dev/vda1 15455656 9953244 4639704 69% /	19:00
tobiash	we run "if command -v cloud-init > /dev/null; then cloud-init status -wl; fi" in our base job	19:02
clarkb	tobiash: in our case we just use a growrootfs init script I think, but I suppose it is possible that isn't finishing before things start	19:02
tobiash	yeah, so you could do something similar	19:02
clarkb	corvus: that is small but also not completely full	19:02
clarkb	corvus: I wonder if the underlying hypervisor has minimal disk left so it is saying "here use 15GB" and hoping you don't notice	19:03
clarkb	but then when you try writes you're getting an error from all the way down to the host fs	19:03
corvus	wow	19:03
corvus	clarkb, tobiash, fungi: at this point, how confident are you that this is "something is weird with kna1 which has nothing to do with zuul or zuul-jobs"?	19:04
clarkb	corvus: I would say probably 80-90%	19:04
fungi	i've been fairly confident at least the tiny-disk problem is airship-only	19:05
clarkb	I think there is a small chance the osa logging is catchn an issue with our image builds as well	19:05
clarkb	and the problem would be more widespread in that case	19:05
fungi	maybe noonedeadpunk could work his magic to get someone there to look at the hosts if we can do a hostid correlation, unfortunately our only official contact for that account is the person at ericsson who is paying for it	19:05
clarkb	we do record the hypervisor id and could see if it correlates to a small number of those in our kna examples at least	19:06
tobiash	hostids are different on the three in zuul tenant I found	19:06
fungi	yeah, then the host-specific theory might not hold water	19:07
*** nils has quit IRC		19:10
*** jcapitao\|off has joined #zuul		19:23
*** hasharDinner is now known as hashar		19:27
tobiash	corvus: to me it also looks likely like some infrastructure or environmental problem rather zuul or zuul-jobs	19:32
clarkb	tobiash:wee all three of those examples kna1 ?	19:32
clarkb	*were	19:32
tobiash	yes:	19:33
tobiash	https://48b327e1e5d3af62395f-b9f7b4910ad16729218670d60db15b93.ssl.cf2.rackcdn.com/774650/10/check/tox-py38/c21cd59/zuul-info/inventory.yaml	19:33
tobiash	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_996/774650/10/check/zuul-tox-docs/996dbcc/zuul-info/inventory.yaml	19:33
tobiash	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_655/774650/10/check/zuul-jobs-tox-linters/655a450/zuul-info/inventory.yaml	19:33
clarkb	in that case my 80-90% feels more like 90%	19:33
openstackgerrit	Merged zuul/nodepool master: Add 4.0 release note https://review.opendev.org/c/zuul/nodepool/+/775885	19:34
tobiash	clarkb: and the most recent one as well: https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/775885/1/gate/tox-pep8/3de257e/zuul-info/inventory.yaml	19:35
tobiash	corvus: was there a reason for holding back the release note typo fix (775880)?	19:37
tobiash	corvus: regarding the retry limit of that, attempts #1 and #2 failed due to the docker issue, the last one due to that (likely) kna1 issue	19:38
tristanC	avass: i've added services configuration to zuul-nix, a local tenant is now configured with a periodic pipeline when the vm starts	19:44
*** hashar has quit IRC		19:45
*** hashar has joined #zuul		19:45
avass	tristanC: cool, still getting comfortable with the package manager but I'm probably gonna try to set up something simple to try it out at volvo	19:45
tristanC	my next step is to figure out how to get reliable performance measurements	19:49
avass	tristanC: the executor fails to start btw, it complains that virtualenv is missing	19:58
tristanC	avass: indeed, need to shortcircuit the venv installation	20:00
tristanC	glad you gave it a try and that it worked for you too! :-)	20:00
*** smyers has quit IRC		20:02
*** smyers has joined #zuul		20:02
corvus	tobiash: no i was just halfway through reapproving it	20:07
tobiash	ah ok	20:08
*** tjgresha has joined #zuul		20:19
tjgresha	looking for some help with the scheduler	20:21
tjgresha	please	20:22
fungi	what help do you need with it?	20:24
tjgresha	throwing this error ERROR zuul.Scheduler: AttributeError: 'MergeJob' object has no attribute 'updated'	20:24
tjgresha	running this with inside the docker ..	20:25
fungi	what version?	20:25
avass	tjgresha: that usally means something went wrong with a merge operation in the executor/merger	20:26
tjgresha	docker version or container version?	20:27
fungi	i was asking what version (release) of zuul you're seeing this with	20:28
fungi	wow, this should probably be a faq, i just grepped case-insensitive for 'mergejob.*updated' in my channel log and see so many people reporting the same error over the years	20:30
fungi	dating back to 2017-11-23	20:30
tjgresha	3.19.2.dev367	20:31
fungi	no fewer than 10 people mentioned it in that timeframe	20:31
fungi	i'll see what the causes ended up being	20:32
avass	fungi: from what I remember people have suggested better error messages for that a couple of times :)	20:32
fungi	yeah, but then i'd never be able to find it in my channel history! ;)	20:32
avass	tjgresha: Can you check for errors in the merger/executor? my bet is that they fail to clone a repo or merge a change or something like that	20:33
fungi	according to tobiash it "usually it means that the merger or executor couldn't fetch that repo"	20:33
tjgresha	ok	20:34
tjgresha	one sec	20:34
fungi	in one case it was caused by a failure to connect to zk (startup race, bad cert?)	20:35
*** hashar has quit IRC		20:35
fungi	also one time tobiash said "that can occur if there is no merger in the system or the merger hits a timeout when cloning a large repo"	20:35
avass	oh yeah that too	20:36
tjgresha	restarted -- letting scheduler do its thing for a minute	20:36
fungi	at least in the past it also seemed to happen if you tried to start the scheduler long before starting an executor/merger (if you start them in the other order, or around the same time, then it should be fine)	20:38
tjgresha	yeah it can't pull a repo	20:42
tjgresha	unable to determine the version	20:42
fungi	in this case the zuul version is probably irrelevant, i was just asking in case it wound up being a signature for a bug fixed in a recent release or something	20:44
tjgresha	namely - for some reason now and all of the sudden will not connect to our internal geritt (config and jobs)	20:46
avass	tjgresha: what error do you get?	20:47
tjgresha	ERROR zuul.GerritConnection: Unable to determine remote Gerrit version	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: Traceback (most recent call last):	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 1456, in onLoad	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: self._getRemoteVersion()	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 1426, in _getRemoteVersion	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: version = self.get('config/server/version')	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 635, in get	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: raise Exception("Received response %s" % (r.status_code,))	20:47
tjgresha	scheduler_1 \| 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: Exception: Received response 401	20:48
avass	please use a paste service in the future, like: http://paste.openstack.org/ :)	20:48
tjgresha	sorry	20:49
tjgresha	don't use IRC too much =)	20:49
fungi	okay, so the scheduler is having trouble connecting to your gerrit	20:49
tjgresha	yes - looks like this is the issue to me	20:50
corvus	tjgresha: make sure https://zuul-ci.org/docs/zuul/reference/drivers/gerrit.html#attr-%3Cgerrit%20connection%3E.auth_type is set correctly	20:50
fungi	401 unauthorized makes me suspect you're using the wrong credentials or	20:50
fungi	that, what corvus just said	20:50
fungi	newer gerrit versions require basic auth instead of digest auth, due to changes in how they store the password internally	20:51
tjgresha	so do i need to be auth'd to the server at all to pull the version ?	20:52
tjgresha	answerd my q -- yes ya do!	20:53
corvus	that's intentional so problems are discovered early	20:53
tjgresha	ok thanks all -	20:56
*** tjgresha has quit IRC		21:21
openstackgerrit	Merged zuul/zuul master: Fix build page releasenote typo https://review.opendev.org/c/zuul/zuul/+/775880	21:26
tobiash	yeah it's still on my todo list to improve that error message ;)	21:56
avass	maybe the errormessage should be "please grep for `object has no attribute 'updated'` in the irc logs" ;)	22:01
avass	bonus points if it does it automatically	22:01
*** jcapitao\|off has quit IRC		22:03
*** saneax has quit IRC		22:03
*** vishalmanchanda has quit IRC		22:22
openstackgerrit	Tristan Cacqueray proposed zuul/zuul master: ansible: ensure we can delete ansible files https://review.opendev.org/c/zuul/zuul/+/775943	22:32
*** arxcruz\|rover has quit IRC		22:46
*** arxcruz has joined #zuul		22:47
tristanC	avass: it took longer than planed, turns out when your system doesn't have a /bin, random things are failing unexpectedly, but with the current zuul-nix:head, jobs result are now green in the database :-)	23:07
*** tosky has quit IRC		23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!