Thursday, 2018-01-25

*** rlandy\|bbl is now known as rlandy		02:58
*** rlandy has quit IRC		03:31
*** rosmaita has quit IRC		04:06
*** panda is now known as panda\|lunch		12:31
*** rosmaita has joined #openstack-infra-incident		12:56
*** panda\|lunch is now known as panda		13:20
*** rlandy has joined #openstack-infra-incident		13:37
*** weshay\|rover is now known as weshay\|ruck		13:58
*** ChanServ sets mode: +o dmsimard		14:19
*** frickler has joined #openstack-infra-incident		14:19
*** dmsimard changes topic to "logs.openstack.org filesystem is read only \| OpenStack Infra team incident handling channel"		14:19
*** dmsimard sets mode: -o dmsimard		14:19
*** efried has joined #openstack-infra-incident		14:19
dmsimard	infra-root: Running a fsck now would be quite expensive for the gate which is already quite backlogged	14:20
dmsimard	Do we have any ability to spin off a new logs.o.o server somehow, FSCK the previous one in the backlog and do redirects as appropriate ? I had a plan for something like this to phase out an old RDO logserver but I never got around to doing it	14:21
pabelanger	likely not, we have 14 1TB volumes attached to logs.o.o	14:21
dmsimard	pabelanger: right -- do we have the ability to do anything better than that ?	14:22
pabelanger	at this moment, no	14:23
dmsimard	I know it's kind of a stupid question because if there was anything better we would probably be doing it already	14:23
pabelanger	we've talked about moving logs.o.o to vexxhost	14:23
pabelanger	[Thu Jan 25 13:37:15 2018] end_request: I/O error, dev xvdh, sector 1069712576	14:23
dmsimard	pabelanger: but for example, does any other provider (OVH, Vexxhost) have bigger volumes	14:23
pabelanger	it does seem limited to xvdh right now	14:23
dmsimard	yeah any one of the volume being even remotely unavailable will thrash the entire lvm	14:23
AJaeger	pabelanger: unmount it, log to local file system?	14:24
AJaeger	or attach temporary disk, log to it, fix big volume, reattach and move logs over?	14:25
-openstackstatus- NOTICE: We're currently experiencing issues with the logs.openstack.org server which will result in POST_FAILURE for jobs, please stand by and don't needlessly recheck jobs while we troubleshoot the problem.		14:25
pabelanger	right, looking for disk space now	14:25
pabelanger	I'm going to remove jenkins SSH key for now to stop jobs from uploading	14:27
pabelanger	also going to put logs.o.o into emergency	14:27
dmsimard	I will also put max-servers to zero everywhere across the board so we don't fail all the jobs needlessly	14:28
pabelanger	no, shouldn't need that	14:29
fungi	i'm not really here, but if you just unmount it there's ~100gb free in the parent volume to get by on until you can mount it again and then (optionally) move those accumulated logs back in from a secondary mount	14:29
dmsimard	pabelanger: why ? we're starting jobs we know will fail	14:30
fungi	e.g., logs end up accumulating in the /srv/static volume instead of the /srv/static/logs volume	14:30
fungi	which is already its own volume with nearly 100gb free	14:31
pabelanger	fungi: okay, can do that	14:31
fungi	(hopefully that volume isn't spanninng the impacted block device	14:31
* fungi has to disappear to another meatspace meeting now		14:32
dmsimard	nl02.o.o: -bash: cannot create temp file for here-document: No space left on device	14:34
dmsimard	looking..	14:34
dmsimard	wow	14:35
dmsimard	# du -sh /var/log/nodepool/	14:35
pabelanger	/srv/static/logs: root kernel mount /srv/static/logs	14:35
dmsimard	37G/var/log/nodepool/	14:35
pabelanger	looks like kernel is holding mount	14:35
pabelanger	not sure how best to for umount	14:35
*** corvus has joined #openstack-infra-incident		14:36
dmsimard	pabelanger: looking	14:37
pabelanger	umount -l worked	14:37
pabelanger	ssh access enabled again	14:38
pabelanger	okay, think we are accepting logs again	14:38
dmsimard	I'm fixing nl02	14:38
mordred	pabelanger, dmsimard: morning! oh - just missed the fun didn't I?	14:39
pabelanger	mordred: almost, I think we need to fsck our logs volume	14:39
pabelanger	and maybe change out a cinder volume?	14:39
pabelanger	not sure yet	14:39
pabelanger	but /srv/static is our new logs folder	14:40
dmsimard	pabelanger: that's probably going to let log uploads work but likely the vhost isn't looking there	14:40
pabelanger	that is fine	14:40
dmsimard	oh actually it does look there? I dunno http://logs.openstack.org/ seems to be working	14:40
corvus	the vhost is fine	14:40
corvus	logs are being written to the same location, it's just on a different volume now	14:41
dmsimard	ok I free'd up 37GB on nl02.o.o	14:41
corvus	dmsimard: what was taking up space?	14:41
dmsimard	corvus: 37GB worth of logs	14:41
dmsimard	I had to delete some of them because we were really at 100%, and I gzipped the most recent one	14:42
pabelanger	okay, how would we like to handle readonly volume of /srv/static/logs? I've not done the repair of this before	14:42
dmsimard	corvus: logs were rotated but not gzipped -- also we're logging in double because we're logging debug to both launcher-debug.log and nodepool-launcher.log ?	14:42
pabelanger	http://logs.openstack.org/38/533638/7/gate/tox-py35/8ebe16a/ just uploaded, so jobs are okay now	14:42
dmsimard	corvus: but yes, logs.o.o is more important	14:43
corvus	let's first check if xvdh is okay now	14:43
corvus	i'm running a dd to read it, just to make sure we're not getting any more io errors	14:43
corvus	dd if=/dev/xvdh of=/dev/null bs=4096	14:44
dmsimard	pabelanger: from the looks of the history it's an fsck on /dev/mapper/main-logs	14:44
dmsimard	just gotta make sure to start that in a screen but let's wait if xvdh is fine	14:44
Shrews	dmsimard: the nl02 problem is interesting. nl01 and nl03 seem fine	14:44
corvus	dmsimard: any previous fsck commands have any interesting arguments?	14:44
dmsimard	corvus: http://paste.openstack.org/raw/653419/	14:45
dmsimard	Shrews: yes	14:45
dmsimard	Shrews: now that we have free space on it I'm trying to understand what's going on :)	14:46
dmsimard	because it's still logging like crazy	14:46
pabelanger	okay, so is 'fsck -v -y /dev/mapper/main-logs' what we want to run, or fsck /dev/xvdh directly?	14:47
Shrews	dmsimard: oh my, yes. something is wrong there.	14:47
dmsimard	we want to fsck the aggregate volume	14:47
dmsimard	since it's not mounted, we want to fsck /dev/mapper/main-logs instead of /srv/static/logs	14:48
pabelanger	right	14:48
pabelanger	okay, I have screen up	14:48
pabelanger	and will be running	14:48
pabelanger	fsck -v -y /dev/mapper/main-logs	14:48
pabelanger	we are okay with that?	14:48
dmsimard	Shrews, corvus: this is madness (nl02) http://paste.openstack.org/raw/653420/	14:48
Shrews	'nl02-5138-PoolWorker.infracloud-chocolate-main' is being logged like crazy... probably in the declined_by attribute?	14:48
dmsimard	pabelanger: wait for corvus to ack that xvdh is fine	14:49
Shrews	yeah that	14:49
mordred	pabelanger, Shrews: that seems like something with our logic to handle clouds not existing anymore is not doing the right thing	14:49
Shrews	i've got a guess at what may be happening	14:49
mordred	Shrews: oh, good. cause I was just shooting in the dark	14:49
dmsimard	Shrews: reproducing this would probably be non-trivial so let's troubleshoot it live :/	14:50
Shrews	since chocolate is gone, i'm betting it keeps declining requests over and over, forever adding to the declined_by attr	14:50
corvus	the dd is slow, so i'm going to have it skip to closer to where the io error is logged.	14:50
Shrews	but just a guess	14:50
pabelanger	corvus: ack	14:50
corvus	dd if=/dev/xvdh of=/dev/null bs=512 skip=1069712570	14:51
dmsimard	the /srv/static mount point is already at 12% usage and is probably going to run out of space before the fsck ends, do we have a plan B ?	14:51
corvus	dmsimard: continuously delete things there, i guess.	14:52
Shrews	http://paste.openstack.org/raw/653422/	14:52
dmsimard	Shrews: there you go	14:52
corvus	it'll take 6+ hours to fcsk, so we'll probably just need to run a find in a while loop	14:52
Shrews	now what changed recently to break that	14:52
dmsimard	Shrews: we split the nl configuration	14:53
dmsimard	Shrews: also https://review.openstack.org/#/c/536953/	14:53
dmsimard	Shrews: https://github.com/openstack-infra/project-config/commits/master/nodepool	14:53
corvus	i think the dd has gone over the trouble spot from the logs without error. i think we can fsck now.	14:53
pabelanger	okay	14:54
pabelanger	fsck -v -y /dev/mapper/main-logs	14:54
pabelanger	we are good with that?	14:54
Shrews	dmsimard: yes. we've done that before. this is new	14:54
dmsimard	pabelanger: in a screen, yes	14:54
mordred	corvus, dmsimard: we could also update the log upload job to ignore upload errors, so that the gate doesn't flap	14:54
corvus	pabelanger: hang on just a sec	14:54
pabelanger	yes, screen is running as root	14:54
pabelanger	corvus: sure	14:54
Shrews	i'm guessing https://review.openstack.org/533372 might be causing this	14:54
dmsimard	mordred: hmmmm I like that but I wonder if we would have a clever way of ignoring log upload errors only for successful jobs	14:55
mordred	dmsimard: we have zuul_success variable	14:55
dmsimard	mordred: or even disable log upload for successful jobs altogether, I dunno -- we can just continuously discard logs too.. none of these ideas are awesome but hey	14:56
corvus	pabelanger: i wanted to verify that it was '-y' we wanted, and i have. http://eavesdrop.openstack.org/irclogs/%23openstack-infra-incident/%23openstack-infra-incident.2017-03-31.log.html	14:57
corvus	that was the time i remember where we used the wrong option and had to do it again.	14:57
corvus	mordred: maybe we could even just stop uploading logs?	14:58
dmsimard	corvus: that's what I was suggesting but only for successful jobs	14:58
mordred	yah. or - at least don't upload logs for gate jobs or successful jobs	14:58
mordred	gah	14:58
pabelanger	corvus: okay, just to confirm. command posted is correct	14:58
corvus	pabelanger: yep.	14:58
pabelanger	great	14:58
AJaeger	note that we upload logs and docs to the same volume - so, what to do with docs jobs?	14:58
mordred	AJaeger: don't docs go to afs?	14:59
pabelanger	Hmm, says /dev/mapper/main-logs in use	14:59
AJaeger	mordred: published ones go to afs - draft ones go to logs	14:59
pabelanger	let me see why that is	14:59
AJaeger	mordred: and I care about the draft docs from check pipeline	15:00
dmsimard	mordred, corvus: ok I'll try to send a patch to do the logs only on failure thing	15:01
mordred	AJaeger: k. lemme get the no-log patch written, then let's figure out docs-draft	15:01
mordred	dmsimard: already working on it	15:01
AJaeger	yeah, let's do that...	15:01
dmsimard	mordred: ok I'll review then -- from the state of the gate we might want to accelerate merging that or it won't be effective for a while	15:01
pabelanger	corvus: mind helping with /dev/mapper/main-logs inuse error	15:02
corvus	pabelanger: it looks like apache is holding files open	15:02
corvus	pabelanger: i will restart it	15:02
* AJaeger needs to go offline for probably rest of day and sends virtual cookies and coffee...		15:02
pabelanger	ah	15:02
corvus	pabelanger: any better now?	15:03
mordred	dmsimard, corvus: remote: https://review.openstack.org/537929 Only upload logs when jobs fail	15:03
pabelanger	corvus: no, I see see in use from fsck	15:03
pabelanger	corvus: I did use umount -l /srv/static/logs I wonder if that is an issue	15:03
dmsimard	pabelanger: I don't see anything with lsof \|grep mapper	15:03
corvus	pabelanger: hrm, i wonder if the lazy unmount from earlier is inhibiting our ability to find out what's using it	15:04
pabelanger	yah, I fear that too	15:04
dmsimard	hang on	15:04
dmsimard	what is that process ? root 26026 0.1 0.0 0 0 ? S Jan15 16:10 [jbd2/dm-0-8]	15:05
corvus	dmsimard: the kernel	15:05
dmsimard	it's holding dm-0 which is the block device for main-logs	15:05
dmsimard	let me try and understand what that process is and if we can kill it safely ..	15:05
pabelanger	oh, I see a kernel panic in dmesg too	15:06
mordred	EXCELLENT	15:06
pabelanger	[Thu Jan 25 13:34:48 2018] jbd2/dm-0-8 D ffff88020f893ac0 0 26026 2 0x00000000	15:06
dmsimard	hung tasks	15:06
dmsimard	live migration ?	15:06
dmsimard	https://bugzilla.redhat.com/show_bug.cgi?id=886030 "Unmounting an ext3 filesystem result in a stuck jbd2 process that holds the device in use and prevents the same filesystem to be checked (fsck) or mounted again."	15:07
openstack	bugzilla.redhat.com bug 886030 in kernel "Device in use by a jbd2 process after unmounting an ext3 filesystem" [Unspecified,Closed: duplicate] - Assigned to kernel-maint	15:07
*** cmurphy has joined #openstack-infra-incident		15:07
dmsimard	https://bugzilla.redhat.com/show_bug.cgi?id=851970 systemd private temp ?	15:07
openstack	bugzilla.redhat.com bug 851970 in systemd "Unable to umount partition as long there are running services that have PrivateTmp set that where started after the partition mount" [High,Closed: currentrelease] - Assigned to mschmidt	15:07
dmsimard	but there's nothing in /tmp	15:07
dmsimard	this is 14.04 so not systemd	15:08
pabelanger	are we able to unload / load that kernel module again or does this mean we need a reboot?	15:09
dmsimard	probably can't reload that live, it's likely in use by the other mounts	15:09
Shrews	dmsimard: i think this is the fix for the launcher: remote: https://review.openstack.org/537932 Handle missing request during a decline.	15:10
pabelanger	okay, I need to grab a coffee real quick. back in a moment	15:10
dmsimard	Shrews: anything to address it in the meantime ? a restart of nodepool-launcher or something ?	15:11
dmsimard	Shrews: if you want to live test it, nl02 is already in emergency file	15:11
Shrews	i think we should restart the launcher on nl02	15:11
Shrews	i can handle that	15:11
Shrews	we can't really test it live b/c the restart will clear it out	15:12
Shrews	nodepool-launcher restarted on nl02	15:12
pabelanger	back	15:13
dmsimard	mordred: comment on https://review.openstack.org/#/c/537929/1	15:13
Shrews	citycloud seems to be at quota too	15:14
Shrews	oh, that's disabled too	15:14
dmsimard	pabelanger: the thing is I'm not sure a reboot would even work -- it's likely a process that isn't killable due to i/o block operation so we'd need a hard reboot ? If we're going to reboot might as well try to kill it first, it won't make matters worse than a failed reboot/hard reboot	15:15
corvus	i would just reboot	15:15
corvus	i would not recommend trying to kill a kernel process	15:15
corvus	run reboot and if it fails, use the console to reset	15:15
pabelanger	right, we maybe also want to remove entry from /etc/fstab so we don't mount on firstboot?	15:15
dmsimard	corvus: kernel i/o block processes typically can't be killed so we'll need a hard reboot	15:16
dmsimard	yeah	15:16
pabelanger	okay, let me edit /etc/fstab first	15:16
dmsimard	do we have a root password for the console in case we need it ?	15:16
corvus	dmsimard: sure, just don't do that first. :) reboot, then 'reset' which as what you're calling a hard reboot.	15:16
pabelanger	dmsimard: no root password, we'd need to use emergency console I think	15:17
dmsimard	corvus: reset is equivalent to init 0 ? I've never actually used reset before but what I was thinking is a nova reboot --hard	15:17
dmsimard	corvus: which is the equivalent of a reset button on the chassis :p	15:17
corvus	dmsimard: now we're on the same page.	15:17
dmsimard	corvus: oh wait we're saying the same thing.. sorry I'm slow this morning	15:17
mordred	dmsimard: thanks. you're totally right	15:17
pabelanger	okay, I've commented out /srv/static/logs in /etc/fstab now	15:17
corvus	dmsimard: i agree, now would be a good time to set a root password in case something goes wrong	15:18
mordred	dmsimard: and fixed https://review.openstack.org/#/c/537929/	15:18
mordred	corvus: ++	15:18
corvus	dmsimard, pabelanger: how about i set the password and add it to the passwords file?	15:18
pabelanger	sure	15:18
dmsimard	PermitRootLogin is No in /etc/ssh/sshd_config so we're good on that front	15:18
pabelanger	dmsimard: corvus: will also defer to one of you to do reboot	15:19
corvus	okay, password set, saved in file, and i used it to 'su' to root, to verify.	15:20
dmsimard	corvus: where is that password file ? puppetmaster ?	15:21
dmsimard	pabelanger: let me open up the console before we try the reboot, hang on	15:22
dmsimard	it'll be a way to tell if the reboot gets stuck	15:22
corvus	dmsimard: yes... did no one walk you and frickler through that?	15:23
dmsimard	corvus: not familiar with that yet, no -- but we can walk through it later :D	15:23
corvus	dmsimard: yes we should schedule a new root onboarding session :)	15:24
dmsimard	ok I have console on static01	15:24
dmsimard	we good for reboot ?	15:24
corvus	dmsimard: yes from me	15:24
dmsimard	pabelanger: you good too ?	15:25
pabelanger	yes, reboot when ready	15:25
dmsimard	ok attempting a reboot now	15:25
dmsimard	lost console so this is a good sign	15:25
dmsimard	it's back	15:26
dmsimard	$ uptime	15:27
dmsimard	15:27:06 up 0 min, 1 user, load average: 0.27, 0.08, 0.03	15:27
dmsimard	I started a screen called "fsck", starting the fsck	15:27
pabelanger	great	15:27
pabelanger	dmsimard: besure to use right flags for fsck command	15:28
dmsimard	fsck -v -y /dev/mapper/main-logs is running	15:28
dmsimard	I detached from the screen	15:28
pabelanger	thanks	15:28
pabelanger	this will take some time	15:29
dmsimard	need some more reviews on https://review.openstack.org/#/c/537929/ and likely to force submit that	15:30
dmsimard	Shrews: nl02 looks good now, thanks	15:30
dmsimard	Shrews: it's probably worth looking into why we're double logging also	15:31
corvus	i've modified a copy of the log maint script in ~corvus/log_archive_maintenance.sh	15:31
corvus	i'm proposing we delete logs after 120 minutes	15:32
corvus	though it looks like the growth rate is slowing... for some reason?	15:32
dmsimard	Shrews: The DEBUG logging ends up in both nodepool-launcher.log and launcher-debug.log -- we probably either want to take the DEBUG out of nodepool-launcher.log or merge launcher-debug.log into nodepool-launcher.log	15:33
pabelanger	corvus: 120mins sees fine to me	15:34
dmsimard	corvus: we were at 12% @09:51:43, we're now at 21% @10:33:12	15:34
dmsimard	(my time)	15:34
dmsimard	~10% every 30 minutes ?	15:34
pabelanger	dmsimard: nodepool-launcher should only be INFO, would need to check logging.conf	15:34
corvus	dmsimard: so we could double retention to 240 minutes and target 80%	15:35
dmsimard	corvus: sure, there's also nothing preventing us from looking every now and then and change our mind -- that's just a safety net	15:36
corvus	okay, i've written a wrapper script that loops over my modified script and runs it every 60 seconds	15:38
corvus	i've also disabled the normal crontab	15:38
dmsimard	Shrews: btw if we want to keep those logs somewhere safe while we iron this out, I gzipped things in /var/log/nodepool on nl02	15:38
corvus	i'm going to run my looping script in another window of the fsck screen	15:39
dmsimard	Shrews: it compresses surprisingly well	15:39
dmsimard	corvus: wfm	15:39
pabelanger	we also need to remember to update /etc/fstab	15:39
pabelanger	I can do that now, if we want	15:39
dmsimard	let's keep that for when we need to do the remount/mv dance	15:39
dmsimard	#status log (dmsimard) fsck started running in a screen on logs.o.o for /dev/mapper/main-logs at around 15:30UTC, logs are being sent straight to /srv/static	15:42
openstackstatus	dmsimard: finished logging	15:42
*** mnaser has joined #openstack-infra-incident		15:43
dmsimard	#status log (dmsimard) We're running a modified log_archive_maintenance.sh from ~/corvus and https://review.openstack.org/#/c/537929/ as safety nets to keep us from running out of disk space	15:43
openstackstatus	dmsimard: finished logging	15:43
dmsimard	infra-root: if there's no objections I'll submit https://review.openstack.org/#/c/537929/	15:44
pabelanger	dmsimard: we should be okay for now to zuul to merge right?	15:45
corvus	oh interesting. we're gzipping the docs-draft builds.	15:45
corvus	i suspect our normal log processing script is doing that as well, but just slowly enough that we don't notice it	15:46
dmsimard	pabelanger: if we don't submit through gerrit, it means the patch has to go through both check and gate which could be a few hours	15:46
corvus	maybe they'll still work though because of the gzip handling we have on logs.o.o. i guess something to look out for.	15:46
dmsimard	corvus: there's a couple "age" lookups in the find, one for gzip and one for deletion	15:46
corvus	the log processing script still does a lot of gzipping, so we may end up with a significant reduction in space	15:46
pabelanger	dmsimard: looks like it might start running shortly, looking at zuul.o.o	15:47
corvus	dmsimard: yeah, the gzip is low, it's 10 minutes. basically enough time to make sure that zuul isn't still uploading files	15:47
dmsimard	pabelanger: there's nothing in the check queue that will tell us if this works or not, it's a trusted playbook	15:47
pabelanger	right, but some of the pressure is off currently and should be able to merge on its own	15:48
pabelanger	dmsimard: also, we can remove nodepool-launcher from emergency?	15:49
dmsimard	pabelanger: maybe, I guess this isn't in the tripleo or integrated gates	15:49
pabelanger	dmsimard: right, once in gate should get nodes right away	15:49
dmsimard	pabelanger: yeah, I wound up not touching the max-servers but I was getting ready to	15:49
dmsimard	pabelanger: I'll remove nl* from emergency (except yours, 03 I think)	15:49
corvus	the current run of the gzip/rm script is printing out files. the next runs will be quiet.	15:50
pabelanger	dmsimard: no, you can also remove that now. Was going to this morning	15:50
dmsimard	pabelanger: 04 is commented by you actually	15:50
dmsimard	pabelanger: ok	15:50
pabelanger	thanks	15:50
dmsimard	pabelanger: nl* removed	15:50
dmsimard	pabelanger: it's a good idea to add comments in the emergency file so we know why things are in there, +1 for that idea	15:50
-openstackstatus- NOTICE: logs.openstack.org is stabilized and there should no longer be new POST_FAILURE errors. Logs for jobs that ran in the past weeks until earlier today are currently unavailable pending FSCK completion. We're going to temporarily disable successful jobs from uploading their logs to reduce strain on our current limited capacity. Thanks for your patience !		16:01
fungi	keep in mind that /srv/static was already 8% full before this (there are a handful of docroots which lack dedicated volumes), though i agree that it's accumulating faster than the fsck is likey to complete (up to 29% now)	16:16
dmsimard	pabelanger: logs.o.o responds a bit erratically.. took me a few seconds to get dmesg back	16:38
pabelanger	load is up	16:38
dmsimard	looks like load from... apache ? fsck has some load	16:39
dmsimard	but apache is the one causing the load	16:39
pabelanger	something hammering server?	16:40
dmsimard	I'm seeing semrush bot, the crawler I banned from RDO but there's not that many hits	16:40
dmsimard	there's a lot of hits like those: 2001:4800:7817:103:be76:4eff:fe04:38d1 - - [25/Jan/2018:16:40:32 +0000] "GET /99/537199/1/gate/trove-scenario-mysql-multi/172fa3c/logs/screen-s-object.txt.gz?level=INFO HTTP/1.1" 200 151 "-" "Python-urllib/2.7"	16:40
dmsimard	... logstash/elastic-recheck ?	16:41
*** openstackstatus has quit IRC		16:41
dmsimard	yeah those python-urllib are all logstash workers	16:41
*** openstackstatus has joined #openstack-infra-incident		16:42
*** ChanServ sets mode: +v openstackstatus		16:42
dmsimard	lots of i/o wait but I'm trying to narrow it down	16:43
pabelanger	giving how busy the gate is, likey too much I/O for local HDD to keep up with	16:44
dmsimard	pabelanger: we're heavily swapping	16:44
dmsimard	root@static:/var/log/apache2# free -m	16:45
dmsimard	total used free shared buffers cached	16:45
dmsimard	Mem: 7979 7755 224 0 117 31	16:45
dmsimard	-/+ buffers/cache: 7605 373	16:45
dmsimard	Swap: 7811 4167 3644	16:45
pabelanger	pastebin :)	16:45
dmsimard	yeah sorry I meant to only paste the swap line	16:45
dmsimard	the fsck is eating all the ram	16:45
dmsimard	.... why ?	16:45
pabelanger	not sure	16:46
dmsimard	https://serverfault.com/questions/9218/running-out-of-memory-running-fsck-on-large-filesystems "Alternately, there's now an option in e2fsck that'll tell it to store all of it's intermediate results in a directory instead of in RAM, which helps immensely. "	16:46
dmsimard	It's buffering the results in ram	16:46
dmsimard	infra-root ^	16:46
dmsimard	It might explain why the previous FSCKs were so long actually, we were unknowingly heavily swapping	16:47
corvus	yeah, would be nice to work out how to use that	16:48
dmsimard	Although if you compare swapping to disk vs buffering to disk (instead of ram) the difference is probably not that big.. but it'd make that ram available for apache and.. ssh	16:48
pabelanger	yah, load is up to 30.00	16:48
pabelanger	so we should consider doing something	16:48
dmsimard	I'm going to interrupt the fsck	16:49
dmsimard	or try to	16:49
pabelanger	wait	16:49
pabelanger	maybe just stop apache?	16:49
corvus	please don't stop the fsck	16:49
dmsimard	sorry, too late :/	16:49
corvus	seriously?	16:49
corvus	okay this is all you.	16:49
*** corvus has left #openstack-infra-incident		16:51
pabelanger	okay, we need to come up with a plan. To get fsck running again	16:52
pabelanger	dmsimard: ^	16:53
dmsimard	fsck is running again with scratch files in /opt/fsck	16:54
pabelanger	okay, next time we should likey wait until everybody is onboard with plan. Can't fix that now, but for future	16:55
mordred	dmsimard: awesome	16:55
mordred	dmsimard: and TIL about fsck scratchfiles	16:56
dmsimard	yes, I agree that I should have waited for an ack -- I really apologize for not waiting on a consensus before moving forward	16:56
dmsimard	I've dealt with something like this before, it is familiar territory but I won't let that happen again	16:56
mordred	dmsimard: and yah - the performance of the fsck isn't likely to be much different than it would via just swapping - but the memory pressure competition would be much better	16:57
mordred	hopefully	16:57
dmsimard	I have a watch du -sh running inside a window of the screen	16:58
dmsimard	for the /opt/fsck directory	16:58
dmsimard	We have 66GB available in /opt, it should be largely sufficient	16:58
mordred	dmsimard: if it's not sufficient I'm going to be very sad :)	16:58
pabelanger	okay, which flags did we use for fsck now?	17:04
pabelanger	want to better understand the change made	17:05
mordred	dmsimard: yah. agree. also - should we consider putting an /etc/e2fsck.conf on files01 with that setting in it?	17:08
dmsimard	mordred: It's probably worth putting in puppet for servers where we have large volumes	17:11
dmsimard	pabelanger: the change was to configure e2fsck to send scratch files on disk, it defaults to RAM which is faster but is not appropriate for large volumes or block devices	17:12
dmsimard	mordred: I'll send a patch now	17:12
pabelanger	yah, puppet change sounds great	17:14
dmsimard	pabelanger: Do we have any other servers that have very large partitions ?	17:15
pabelanger	dmsimard: openstack volume list would be a way to check	17:22
pabelanger	static.o.o is the largest	17:22
mordred	I don't think we have anything else that's crazypants like static	17:22
dmsimard	ok let's enable it for this one for now with the ability to enable it elsewhere on a need basis	17:23
mordred	dmsimard: ++	17:30
*** panda is now known as panda\|bbl		17:33
*** weshay\|ruck is now known as weshay\|ruck\|brb		18:09
*** weshay\|ruck\|brb is now known as weshay		18:27
*** myoung is now known as myoung\|biab		18:58
*** myoung\|biab is now known as myoung		19:20
*** myoung is now known as myoung\|cheesebur		19:34
*** myoung\|cheesebur is now known as myoung\|food		19:34
*** myoung\|food is now known as myoung		20:02
dmsimard	infra-root: Status update... We're hovering around 55% logs and we've started deleting things with a 240m timeout, we might want to increase that to try and delete less things. Thoughts ?	20:18
dmsimard	mordred: also I suspect we might have been mistaken on the performance impact of on-disk scratch files, we're still not in the phase 2 of the fsck yet going on 3 hours .. we had reached phase 2 before restarting within ~1h30	20:20
dmsimard	Also, the e2fsck man mentions that we can send a SIGUSR1 to the process to display a progress bar (and SIGUSR2 to stop displaying it)	20:22
dmsimard	SIGUSR1 This signal causes e2fsck to start displaying a completion bar or emitting progress information. (See discussion of the -C option.)	20:22
dmsimard	Should we enable that to see how we're doing ?	20:22
pabelanger	dmsimard: how long is it taking to delete stuff?	20:26
dmsimard	pabelanger: not sure, this is the output we get in the screen http://paste.openstack.org/raw/653497/	20:27
dmsimard	I guess anywhere between near instantaneous and 20 minutes ?	20:27
pabelanger	yah, lets give it a little longer, since we are at 55%	20:28
pabelanger	I've never tried SIGUR1 before with fsck, maybe others know	20:29
pabelanger	but, not being in phase2 yet is troublesome	20:29
dmsimard	I raised it to 300 up from 240	20:30
dmsimard	the server is swapping just a tiny bit but is otherwise healthy	20:31
dmsimard	without scratch files we were 4GB into swap	20:31
dmsimard	Really the worst timing for this to be happening at all, the other times we were lucky it happened during the night or the weekend and we just left fsck running with logs.o.o offline..	20:33
dmsimard	pabelanger: I have to step away for a few minutes but I don't think sigusr1 could help us, it seems to imply that the -C <file> argument was already specified.. I might try to test it on a local server to see if it works	20:34
dmsimard	https://www.youtube.com/watch?v=cS4eztFtS0U makes it look like it's possible, I'll test and confirm	20:36
dmsimard	brb	20:37
*** rlandy is now known as rlandy\|brb		20:47
*** myoung is now known as myoung\|pto		20:53
dmsimard	ok, testing this locally on a ubuntu 14.04 vm..	20:59
*** rlandy\|brb is now known as rlandy		21:12
ianw	the progress options to fsck do work, i've used them before	21:17
dmsimard	ianw: do you have the usage handy ? I just don't want to kill the ongoing fsck mistakenly	21:18
dmsimard	almost done trying to test it locally..	21:19
dmsimard	yeah I'm not able to reproduce locally, I don't have a big enough disk that gives me enough time to even try it.. fsck completes near instantaneously :/	21:21
ianw	i'm 99% sure i've sent it sigusr1 before	21:21
ianw	don't worry, you only start it once without the -C , next time you always remember :)	21:21
dmsimard	ianw: we didn't see any -C in the history :/ http://paste.openstack.org/raw/653419/	21:22
ianw	anyway, send it to the fsck.ext4 process, not the parent process, i think that got us once before	21:24
dmsimard	ianw: so we have http://paste.openstack.org/show/653506/ and what we want is... kill -SIGUSR1 27665 ?	21:27
dmsimard	Using a kill command to do this kind of stuff makes me paranoid, whoever thought that was a good idea...	21:28
ianw	dmsimard: yep	21:29
dmsimard	ianw: ok, trying now.	21:29
dmsimard	that worked	21:29
dmsimard	we're at 48% of phase 1.. ouch	21:29
ianw	that said, it doesn't really give you a timeline	21:30
ianw	if you look back through the logs, fungi and i had to do it a little while ago?	21:30
dmsimard	ianw: so to give you some context	21:30
ianw	i was watching it pretty closely, so you could probably correlate the start/finish from the irc logs	21:30
dmsimard	ianw: today is release day and it was kind of a downer to keep logs.o.o down for like 8 hours -- so we're currently sending logs to /srv/static directly and we'll want to do a mv/remount dance once the fsck has finished	21:31
ianw	yep, last time i just bind mounted /srv/static somewhere and rsync'd the intermediate logs back in	21:32
dmsimard	ianw: We realized when running the usual logs.o.o workload side by side the fsck that it was going to be difficult because the fsck process was eating all the ram (swapping >4GB and causing log uploads to keep failing)	21:33
ianw	that is definitely a new one, i can't remember that happening any of the times we've had to recover before	21:34
dmsimard	ianw: so I had fsck send the scratch files to disk instead of ram -- the patch for it is here https://review.openstack.org/#/q/topic:e2fsck	21:34
dmsimard	I jumped the gun on that one, there might've been another avenue but I had already interrupted the fsck	21:35
ianw	probably worth understanding if that's an artefact of certain types of corruption that we just happened to hit, or something else	21:37
ianw	public holiday here and duty calls, good luck :)	21:37
dmsimard	o/	21:38
dmsimard	For the record, there seems to be on average between 5GB and 6GB of scratch files per batch, ex: http://paste.openstack.org/show/653511/	21:56
dmsimard	If we we working with 16GB of RAM instead of our existing 8GB, we might be able to work without writing to disk	21:56
dmsimard	We're currently at 51% of phase 1 :(	21:57
pabelanger	that's about 4 hours in right?	22:01
pabelanger	dmsimard: mordred: do we want to discuss switching back to original method again? 4 hours into fsck and 51% phase1, feels like we are in for a long haul	22:03
dmsimard	5 hours in	22:04
pabelanger	logs look stable at 50% /srv/static	22:04
dmsimard	pabelanger: yeah I bumped the timer again	22:04
dmsimard	pabelanger: can't deny it's going very slowly but the flipside is that it allows jobs to complete ? If we go back to RAM we'll start getting load and failures again -- an alternative would be to resize to 16GB RAM and only then resume it back in ram	22:06
pabelanger	well, we've done this before while jobs have been running. Mind you, not during the release week	22:07
dmsimard	yeah the gate is running at max capacity :(	22:07
pabelanger	if we go back to normal fsck, we could consider reducing number of jobs running at once in nodepool. In an effort to keep things moving and bring volume back online	22:08
pabelanger	infra-root: ^ do you have any thoughts?	22:08
dmsimard	Getting any post failures would yield gate resets and at 100 patches in the gate that's crazy expensive, I would do everything we can to avoid that	22:11
mordred	pabelanger: reading	22:12
dmsimard	5 hours in at 52% of phase 1 fsck is beyond awful.. I would consider bumping ram to 16GB and resuming without on disk scratch files but I don't have any other ideas	22:14
pabelanger	well, we are swapping now: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=143&rra_id=all	22:21
pabelanger	slowly growing	22:21
pabelanger	so, feel we might just be delaying the unhappyness of swapping	22:22
pabelanger	I'm unsure about resizing logs.o.o either, until others are okay with it.	22:23
dmsimard	Ok I'm going to start a pad to summarize our options	22:24
pabelanger	ok	22:27
dmsimard	working on https://etherpad.openstack.org/p/2tAC4qI2eR	22:29
dmsimard	mordred: do you have an opinion ?	22:40
clarkb	isnt option 5 dont fsck?	22:41
dmsimard	clarkb: we admittedly didn't test if the volume was mountable in the first place	22:42
clarkb	its data with a retention period of 4 weeks	22:43
clarkb	if we lose data I dont think its a huge deal as long as fs is functional	22:43
clarkb	but unsure how risky that is	22:43
dmsimard	clarkb: right, I know we don't care that much about the data but it might not let you mount it at all	22:43
dmsimard	But I agree it's worth considering	22:43
clarkb	ok I have to afk again just wantrd to call that out	22:44
pabelanger	I'm not sure how else to check it, outside of fsck	22:45
*** corvus has joined #openstack-infra-incident		22:54
dmsimard	56% of phase 1.. but most jobs are green when looking at the gate in http://zuul.openstack.org/builds.html .. we're increasingly swapping but load is stable -- it's like there's no good solution to this problem but we're probably slowly running into a wall if we don't do anything	22:56
dmsimard	damned if you do, damned if you don't ?	22:56
dmsimard	There's 5 phases to go through, it's unrealistic to keep going at this rate	22:57
dmsimard	pabelanger: how would you want to reduce the pressure ? reduce max-servers ?	22:57
dmsimard	oh crap, hang on	22:58
*** rlandy is now known as rlandy\|biab		23:26
dmsimard	fsck 65% phase 1	23:52
mnaser	fsck progress is weirdly non linear	23:55

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!