Wednesday, 2017-10-11

*** pleia2 has quit IRC		02:43
*** pleia2 has joined #openstack-infra-incident		02:50
*** tumbarka has quit IRC		07:02
-openstackstatus- NOTICE: The CI system will be offline starting at 11:00 UTC (in just under an hour) for Zuul v3 rollout: http://lists.openstack.org/pipermail/openstack-dev/2017-October/123337.html		10:08
-openstackstatus- NOTICE: Due to unrelated emergencies, the Zuul v3 rollout has not started yet; stay tuned for further updates		13:05
*** rosmaita has joined #openstack-infra-incident		15:03
jeblair	fungi, clarkb, mordred: looking into the random erroneous merge conflict error, i found some interesting information.	17:04
fungi	all ears	17:04
jeblair	when i warmed the caches on the mergers and executors, i cloned from git.o.o	17:04
jeblair	that left the origin as git.o.o	17:05
jeblair	if zuul clones a repo for its merger, it clones from gerrit, and leaves the origin as gerrit	17:05
jeblair	our random merge failures are because we're pulling changes from the git mirrors before they have updated	17:05
fungi	ahh. so the warming process should have used gerrit as the origin?	17:05
jeblair	this also explains why we're seeing git timeouts only in v3	17:05
jeblair	fungi: yes, or at least switched the origin after cloning	17:06
fungi	but yes, i agree, that does provide a great explanation for the got timeout situation	17:06
clarkb	jeblair: you think the timeout is because it tries to fetch a non existant ref from git.o.o?	17:06
fungi	s/got/git/	17:06
AJaeger	jeblair: good detective work!	17:06
jeblair	clarkb: i don't know what is causing the timeout, but it's https to a different server, versus ssh to gerrit	17:06
jeblair	clarkb: so there's lots of variables that are different :)	17:07
fungi	well, at least the work around improving git remote operation robustness isn't a waste	17:07
clarkb	got it	17:07
jeblair	clarkb: the merge failure i tarcked down though was because the ref had not updated yet	17:07
jeblair	fungi: oh, yeah, i still think that's good stuff	17:07
jeblair	anyway, i think we have two choices here:	17:07
jeblair	1) update the origins to review.o.o	17:08
jeblair	2) make using git.o.o more reliable	17:08
fungi	i'm in favor of #1 for now... that's more or less what the mergers for v2 were doing right?	17:09
clarkb	gerrit does emit replication events now that could possibly be used, but we'd have to have some logic around "are all the git backends updated" which may get gross in zuul (which shouldn't really need to know the dirty details of replication)	17:09
jeblair	1) is easy, and returns us to v2 status-quo -- mostly. the downside to that is that v3 is much more intensive about fetching stuff from remotes (it merges things way more than necessary) so we are likely to see increased load on gerrit.	17:09
fungi	i mean, having the option of offloading that to git.o.o would be nice, but not a regression over v2	17:09
jeblair	2) clarkb just pointed out what would be involved in 2. it's some non-straightforward zuul coding.	17:09
fungi	i agree the increased gerrit traffic is something to keep an eye on, but not necessarily a problem	17:10
fungi	as an aside, how many sessions to the ssh api are we likely to open in parallel from a single ip address?	17:11
fungi	ssh api and/or ssh jgit	17:11
clarkb	I think just one because we'll only grab a single gearman job at a time?	17:11
jeblair	there are 2 things that make it inefficient -- we merge once for every check pipeline, and then we merge once for every build. both of those things can be improved, but not easily. though, having fewer 'check' pipelines will help. v3 has 3 now.	17:12
jeblair	clarkb, fungi: yes, one per ip.	17:12
fungi	just wanting to keep in mind that we're currently testing a connlimit protection on review.o.o to help keep runaway ci systems under 100 concurrent connections, but sounds like this wouldn't run afoul of it anyway	17:13
jeblair	so it's a total of 14 from v3 in our current setup. likely 18 when we're fully moved over. compared to 4 now (but 8 when we were at full strength v2)	17:13
fungi	so around a 2x bump	17:13
fungi	seems safe enough	17:14
*** efried has joined #openstack-infra-incident		17:15
mordred	it seems to me like trying 1) while in our current state (we won't see gate merge load - but with the check pipelines we should still see a lot)	17:15
efried	jeblair o/ Sorry, didn't even know this channel existed	17:15
jeblair	efried: no i'm sorry; i should have mentioned i was switching	17:15
mordred	would give us an idea of whether or not it's workable to do that	17:16
clarkb	mordred: ++	17:16
jeblair	mordred: ya good point	17:16
mordred	however - it also seems like since we have a git farm, evne if 1 is workable we still may want to consider putting 2 on the backlog	17:16
jeblair	mordred: we'll actually see the full load	17:16
mordred	jeblair: oh - good point	17:16
mordred	that's great	17:16
jeblair	ya i didn't even think about that till you mentioned it	17:16
mordred	I think the results of 1 will tell us how urgent 2 is	17:17
jeblair	this scales with the size of the zuul cluster, not anything else (as long as it's not idle)	17:17
jeblair	okay, so i'll just go ahead and update all the origins	17:17
jeblair	and, um, if gerrit stops then i'll stop zuulv3 :)	17:17
efried	Cool beans guys, thanks for tracking this down!	17:18
mordred	jeblair: cool	17:19
mordred	jeblair: if gerrit stops, maybe just update the origins all back while we re-group :)	17:19
fungi	plan sounds good	17:21
AJaeger	efried: the channel is published on eavesdrop.openstack.org	17:30
efried	AJaeger Yup, thanks, I'm caught up.	17:30
*** rosmaita has quit IRC		17:50
*** rosmaita has joined #openstack-infra-incident		17:51
*** efried is now known as efried_nomnom		18:17
dmsimard	fungi: So just to wrap something up that worried me earlier, because I'm involved in this to some extent... As far as I can understand, "rh1 closing next week" is a misunderstanding. It is not closing next week. Soon, but not next week.	18:19
dmsimard	fungi: Soon, like, a matter of weeks, not months.	18:19
dmsimard	The bulk of the work is already done and many different jobs have already been running off of RDO's Zuul	18:20
fungi	dmsimard: yeah, the irc discussion in #-dev more or less confirmed that as well, but "soon" at least	18:34
*** efried_nomnom is now known as efried		18:58
* mordred waves to fungi and jeblair		20:42
jeblair	okay yeah, i thought we've run into the inode thing before	20:43
jeblair	i guess we reformatted?	20:43
fungi	so the manpage for mkfs.ext4 (shared by ext2 and ext3) says this about the -i bytes-per-inode value: "Be warned that it is not possible to change this ratio on a filesystem after it is created, so be careful deciding the correct value for this parameter. Note that resizing a filesystem changes the numer of inodes to maintain this ratio."	20:43
jeblair	i checked the infra status log but did not find anything :(	20:43
*** Shrews has joined #openstack-infra-incident		20:44
fungi	so if we grew the size of the filesystem, we'd get more inodes, but short of that...	20:44
jeblair	yeah, if only we could, right?	20:45
fungi	jeblair: looking back through my infra channel log now. found an incident from 2013-12-08	20:45
fungi	oh, docs-draft in 2014-03-03	20:46
fungi	zm03 filled to max inodes for its rootfs on 2015-10-12	20:47
fungi	ran out of inodes for /home/gerrit2 on review.o.o on 2016-02-13	20:50
fungi	concluded at the time that it's unfixable for the fs, and so moved everything to a new cinder volume	20:51
clarkb	tripleo is still logging all of /etc in places last I checked	20:52
clarkb	do we think its stuff like that using all the inodes?	20:52
clarkb	also ara is tons of little files	20:52
fungi	http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-02-13.log.html#t2016-02-13T14:47:22	20:53
mordred	the find is doing the gzip pass as well as the prune pass ... should we perhaps do a find pass that doesn't do gzipping ... and maybe add something to prune crazy /etc dirs?	20:53
dmsimard	clarkb: yes, an ara static report is not large but indeed lots of smaller files	20:54
dmsimard	That's why I would like to explore other means of providing the report	20:54
jeblair	fungi: thanks, that archeology helps :)	20:54
fungi	yeah, in past emergencies i've made a version of the log maintenance script which omits the random delay, the docs-draft bit, and drops the compression stanza so it's just a deleter	20:55
fungi	and usually significantly dropped the retention timeframe while at it	20:55
mordred	fungi: do you happen to have any of those versions around still?	20:55
fungi	mordred: static.o.o:~fungi/log_archive_maintenance.sh	20:56
mordred	yah - ~fungi/log_archive_maintenance.sh exists	20:56
fungi	heh, it's like i'm predictable	20:56
mordred	:)	20:56
mordred	infra-root: shall we stop the current cleaner script and run fungi's other version? or do we want to investigate other options?	20:58
fungi	shall i #status alert something for now?	20:58
fungi	mordred: yeah, i would stop it	20:58
jeblair	mordred: wfm	20:58
jeblair	fungi: ++	20:58
fungi	running more than one at a time is just more i/o traffic slowing both down	20:58
mordred	jeblair, fungi: ok. do I need to do anything special to stop it other than kill?	20:58
fungi	also make sure something else like an mlocate.db update isn't running and having similar performance impact	20:59
ianw	can someone give a 2 sentence summary of what's wrong for those of us who might not have been awake (i.e. me ;)	20:59
jeblair	ianw: logs is full	20:59
fungi	mordred: you can just kill the parent shell process and then kill the find	20:59
fungi	and that way it should wrap up without trying to run any subsequent commands	20:59
mordred	fungi: cool	20:59
*** ChanServ changes topic to "logs volume is full"		20:59
fungi	and it should release the flock on its own that way	21:00
mordred	fungi: I'm going to start a screen session called repair_logs	21:00
fungi	good idea	21:00
pabelanger	speaking of full volumes, this is from afs01.dfw.o.o: /dev/mapper/main-vicepa 3.0T 3.0T 30G 100% /vicepa	21:00
jeblair	mordred, fungi: and then maybe grab the flock in the custom script?	21:00
jeblair	pabelanger: want to throw another volume at it?	21:00
mordred	jeblair: it grabs the flock	21:00
fungi	jeblair: the custom script already flocks the same file	21:00
jeblair	pabelanger: we can expand vicepa	21:00
jeblair	mordred, fungi: cool	21:00
pabelanger	jeblair: ya, let me see if there is anything to clean up first	21:01
jeblair	pabelanger: just the usual lvm stuff	21:01
jeblair	pabelanger: oh, maybe the docs backup volume?	21:01
jeblair	old-docs or whatever	21:01
mordred	fungi: bash script killed - nex kill the find not the flock?	21:01
ianw	pabelanger: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-10-09.log.html#t2017-10-09T06:15:55 ... that didn't take long :/	21:01
fungi	mordred: yeah	21:01
mordred	fungi: and let flock just exist	21:01
pabelanger	jeblair: k, let me see	21:01
mordred	k	21:01
fungi	the flock should terminate	21:01
fungi	on its own	21:01
jeblair	ianw, pabelanger: i read that we still have 30G, right?	21:01
jeblair	so, i mean, probably several more hours!	21:02
pabelanger	Yah :)	21:02
mordred	ok. I have started ~fungi/log_archive_maintenance.sh	21:02
pabelanger	jeblair: okay, so docs-old can be deleted?	21:02
ianw	jeblair: yeah, so it was 45gb when i posed, and 30gb now	21:02
jeblair	pabelanger: i think so, but let's check with AJaeger or other docs folks first. i think that would only give us 10G anyway.	21:03
fungi	i checked ioptop, and it looks like that find doesn't have major competition (unless we want to also stop apache and lock down the jenkins/zuul accounts)	21:03
ianw	pabelanger / jeblair : i'll take an action item to add a volume to that, after current crisis	21:03
pabelanger	ianw: wfm	21:03
jeblair	ianw: ack, thx	21:03
fungi	but yeah, if the last find/delete/compress pass ran for 11 days and counting, i expect we've significantly increased inode count beyond just normal block count increases	21:04
pabelanger	I'm going to add mirror-update to emergency, and first manually apply https://review.openstack.org/502316/	21:04
pabelanger	that will free up some room with the removal of opensuse-42.2	21:04
jeblair	pabelanger, ianw: i went ahead and asked in #openstack-doc, but ajaeger is gone for the day so we shouldn't expect a complete answer about docs-old until tomorrow	21:05
pabelanger	k	21:05
ianw	ok, yeah one less distro will help	21:06
jeblair	pabelanger, ianw: dhellman asks that we do keep docs-old around for a while longer. so let's not do anything with that volume, and just expand via lvm.	21:08
pabelanger	++	21:09
dhellmann	ideally we'll be able to delete docs-old by the end of the cycle. I've made a note to coordinate with you all about that	21:09
pabelanger	/dev/mapper/main-vicepa 3.0T 2.9T 80G 98% /vicepa	21:09
pabelanger	back up to 80GB with removal of opensuse 42.2	21:09
pabelanger	should be enought room until ianw gets the other volume	21:09
pabelanger	enough*	21:09
ianw	excellent, that's breathing room	21:09
pabelanger	I'll clean up 502316 and get that approved	21:10
mordred	/dev/mapper/main-logs 768M 768M 0 100% /srv/static/logs	21:10
mordred	we are not removing them faster than we are making them - at least not yet	21:10
pabelanger	Yah, last time I ran it, took a little bit to get a head of the curve	21:11
mordred	welp - nothing I like more than watching a find command sit there and churn :)	21:11
clarkb	vroom vroom	21:12
jeblair	mordred: on the plus side -- it doesn't look like inodes are a problem now!	21:12
jeblair	dhellmann: thanks!	21:13
pabelanger	mordred: I see 95% now :D	21:14
mordred	jeblair: oh - that was df -hi	21:15
jeblair	mordred: oh i never thought to use -h with -i :)	21:18
jeblair	mordred: though, clearly 768M should have clued me in	21:18
pabelanger	we're trying to cleanup old logs, I see some movement	21:26
mordred	infra-root: I'm finally seeing non-zero numbers of free inodes	21:59
mordred	/dev/mapper/main-logs 768M 768M 11K 100% /srv/static/logs	21:59
mordred	well - that was short lived	21:59
mordred	seriously - something just ate 11k inodes	21:59
clarkb	we should probably count the ara inode count and tripleo logs inode count	21:59
clarkb	as I think it likely those two are at least partially to blame	22:00
pabelanger	yah, ps show tripleo jobs currently using logs folder	22:01
fungi	should be able to adapt some of our earlier analysis scripts to do average inodes per job et cetera	22:01
pabelanger	/srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54	22:01
mordred	mordred@static:/srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54$ sudo find . \| wc -l	22:02
mordred	21374	22:02
mordred	so that's 21k files per job	22:02
mordred	so ...	22:03
mordred	./logs/undercloud/tmp/ansible/lib64/python2.7/site-packages/ansible/modules/network/f5/.~tmp~	22:03
dmsimard	mordred: to their defense, there's probably like 3 ara reports in there	22:03
mordred	nope - that's not it	22:03
mordred	an ara report is 445 files	22:03
dmsimard	mordred: 1) devstack gate 2) from oooq-ara and from zuul v3	22:04
dmsimard	okay	22:04
pabelanger	Um	22:04
pabelanger	/srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp	22:04
pabelanger	du -h	22:04
pabelanger	that is glorious	22:04
pabelanger	they are copying back ansible tmp files	22:04
mordred	yah	22:05
mordred	tons of things like logs/subnode-2/etc/pki/ca-trust/source/.~tmp~	22:05
pabelanger	yup	22:05
pabelanger	I have to run, but can help out when we I get back	22:05
pabelanger	EmilienM: ^ might want to prepare for incoming logging work again	22:06
dmsimard	What I do know, and I saw that recently	22:06
EmilienM	hi	22:06
dmsimard	is that they glob the entirety of /var/log/**	22:06
* EmilienM reads context		22:06
mordred	EmilienM: we're out of inodes on logs.o.o	22:06
pabelanger	http://logs.openstack.org//27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp	22:06
EmilienM	oh dear	22:06
pabelanger	contains a ton of ansible tmp files	22:06
EmilienM	who added tmp dir	22:07
EmilienM	i noticed that recently, let me look	22:07
pabelanger	okay, have to run now.	22:07
pabelanger	bbiab	22:07
mordred	we should put in a filter to not copy over anything that has '.~tmp~' in the path	22:07
*** weshay\|ruck has joined #openstack-infra-incident		22:07
weshay\|ruck	hello	22:07
mordred	$ sudo find logs \| grep -v '.~tmp~' \| wc -l	22:08
mordred	3369	22:08
mordred	$ sudo find logs \| grep '.~tmp~' \| wc -l	22:08
mordred	18004	22:08
weshay\|ruck	did tripleo fill up the log server?	22:08
dmsimard	weshay\|ruck: can we cherry-pick what we want from /var/log instead of globbing everything ? https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/collect-logs/defaults/main.yml#L5	22:08
EmilienM	weshay\|ruck: since when we have tmp? http://logs.openstack.org//27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp	22:08
mordred	18k of the files are things with .~tmp~ in the path	22:08
EmilienM	weshay\|ruck: out of inodes on logs.o.o	22:08
mordred	so if we cna just stop uploading those I think we'll be in GREAT shape	22:08
dmsimard	weshay\|ruck: tripleo did not single handedly fill up the log server, but contributes to the problem	22:08
weshay\|ruck	we cherry pick quite a bit afaik	22:08
weshay\|ruck	but can look further	22:08
mordred	srrsly - just filter '.~tmp~' and I think we'll be golden	22:09
dmsimard	weshay\|ruck: I think the ansible tmpdirs is the most important part to fix	22:09
dmsimard	weshay\|ruck: there's no value in logging those directories at all http://logs.openstack.org/27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp/	22:10
weshay\|ruck	that's odd.. I don't remember tmp being there	22:10
weshay\|ruck	k.. sec I'll add it to our explicit exclude	22:10
weshay\|ruck	that's new though	22:10
EmilienM	yeah	22:11
EmilienM	I'm still git log now	22:11
mordred	thing is- apache won't even show the files with .~tmp~ in them	22:12
mordred	http://logs.openstack.org/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54/logs/undercloud/tmp/ansible/lib64/python2.7/encodings/ is an example ...	22:12
EmilienM	is that https://review.openstack.org/#/c/483867/4/toci-quickstart/config/collect-logs.yml ?	22:12
mordred	there's 210 files in that dir	22:12
dmsimard	EmilienM: that sounds like a good suspect	22:12
EmilienM	yeah i'm not sure, let me keep digging now	22:13
dmsimard	"Collect the files under /tmp/ansible, which are useful to debug mistral executions. This dir contains the generated inventories, the generated playbooks, ssh_keys, etc."	22:13
EmilienM	right	22:14
EmilienM	if you read comment history we were not super happy with this patch	22:14
EmilienM	we can easily revert it and fast approve	22:14
EmilienM	I'm just making sure it's this one	22:14
EmilienM	mordred, dmsimard, weshay\|ruck : ok confirmed. I'm reverting and merging	22:15
weshay\|ruck	ya.. you guys found the same commit	22:16
EmilienM	https://review.openstack.org/#/c/511347/	22:16
dmsimard	EmilienM: flaper mentions the possibility of making mistral write the interesting things elsewhere so it looks like it can be workaround' and is not critical	22:16
weshay\|ruck	dmsimard, we need to remove it.. I see it there twice	22:16
EmilienM	dmsimard: it's not critical at all	22:16
EmilienM	weshay\|ruck: twice?	22:17
EmilienM	mordred: do whatever you can to promote https://review.openstack.org/#/c/511347 if possible	22:17
weshay\|ruck	well /tmp/.yml /tmp/.yaml AND /tmp/ansible	22:17
EmilienM	weshay\|ruck: /tmp/.yml /tmp/.yaml would be another patch	22:18
weshay\|ruck	ok	22:18
EmilienM	to keep proper git history I prefer a revert + separate patch for /tmp/.yml /tmp/.yaml	22:18
EmilienM	weshay\|ruck: are you able to send patch for /tmp/.yml /tmp/.yaml ? otherwise I'll look when I can, I'm in mtg now	22:19
dmsimard	/tmp/.yml and /tmp/.yaml is certainly not as big of a deal	22:19
weshay\|ruck	EmilienM, ya.. I post one	22:20
EmilienM	k	22:20
EmilienM	TBH I don't see why we do that	22:20
EmilienM	like why do we collect /tmp...	22:20
EmilienM	its not in my books :)	22:20
pabelanger	made it up to 60k free inodes, something ate it up	23:38
clarkb	could be tripleo jobs that started before the fix merged	23:38
pabelanger	ya,	23:39
clarkb	there is likely going to be a periodic of time where that happens	23:39
clarkb	since those jobs take up to ~3 hours	23:39
pabelanger	I've found a few large tripleo patches and manually purging the directories	23:39
pabelanger	seems to be helping, up to 108K now	23:39
pabelanger	/dev/mapper/main-logs 768M 768M 205K 100% /srv/static/logs	23:43
pabelanger	heading in the right direction	23:43

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!