*** rlandy|bbl is now known as rlandy | 02:58 | |
*** rlandy has quit IRC | 03:31 | |
*** rosmaita has quit IRC | 04:06 | |
*** panda is now known as panda|lunch | 12:31 | |
*** rosmaita has joined #openstack-infra-incident | 12:56 | |
*** panda|lunch is now known as panda | 13:20 | |
*** rlandy has joined #openstack-infra-incident | 13:37 | |
*** weshay|rover is now known as weshay|ruck | 13:58 | |
*** ChanServ sets mode: +o dmsimard | 14:19 | |
*** frickler has joined #openstack-infra-incident | 14:19 | |
*** dmsimard changes topic to "logs.openstack.org filesystem is read only | OpenStack Infra team incident handling channel" | 14:19 | |
*** dmsimard sets mode: -o dmsimard | 14:19 | |
*** efried has joined #openstack-infra-incident | 14:19 | |
dmsimard | infra-root: Running a fsck *now* would be quite expensive for the gate which is already quite backlogged | 14:20 |
---|---|---|
dmsimard | Do we have any ability to spin off a new logs.o.o server somehow, FSCK the previous one in the backlog and do redirects as appropriate ? I had a plan for something like this to phase out an old RDO logserver but I never got around to doing it | 14:21 |
pabelanger | likely not, we have 14 1TB volumes attached to logs.o.o | 14:21 |
dmsimard | pabelanger: right -- do we have the ability to do anything better than that ? | 14:22 |
pabelanger | at this moment, no | 14:23 |
dmsimard | I know it's kind of a stupid question because if there was anything better we would probably be doing it already | 14:23 |
pabelanger | we've talked about moving logs.o.o to vexxhost | 14:23 |
pabelanger | [Thu Jan 25 13:37:15 2018] end_request: I/O error, dev xvdh, sector 1069712576 | 14:23 |
dmsimard | pabelanger: but for example, does any other provider (OVH, Vexxhost) have bigger volumes | 14:23 |
pabelanger | it does seem limited to xvdh right now | 14:23 |
dmsimard | yeah any one of the volume being even remotely unavailable will thrash the entire lvm | 14:23 |
AJaeger | pabelanger: unmount it, log to local file system? | 14:24 |
AJaeger | or attach temporary disk, log to it, fix big volume, reattach and move logs over? | 14:25 |
-openstackstatus- NOTICE: We're currently experiencing issues with the logs.openstack.org server which will result in POST_FAILURE for jobs, please stand by and don't needlessly recheck jobs while we troubleshoot the problem. | 14:25 | |
pabelanger | right, looking for disk space now | 14:25 |
pabelanger | I'm going to remove jenkins SSH key for now to stop jobs from uploading | 14:27 |
pabelanger | also going to put logs.o.o into emergency | 14:27 |
dmsimard | I will also put max-servers to zero everywhere across the board so we don't fail all the jobs needlessly | 14:28 |
pabelanger | no, shouldn't need that | 14:29 |
fungi | i'm not really here, but if you just unmount it there's ~100gb free in the parent volume to get by on until you can mount it again and then (optionally) move those accumulated logs back in from a secondary mount | 14:29 |
dmsimard | pabelanger: why ? we're starting jobs we know will fail | 14:30 |
fungi | e.g., logs end up accumulating in the /srv/static volume instead of the /srv/static/logs volume | 14:30 |
fungi | which is already its own volume with nearly 100gb free | 14:31 |
pabelanger | fungi: okay, can do that | 14:31 |
fungi | (hopefully that volume isn't spanninng the impacted block device | 14:31 |
* fungi has to disappear to another meatspace meeting now | 14:32 | |
dmsimard | nl02.o.o: -bash: cannot create temp file for here-document: No space left on device | 14:34 |
dmsimard | looking.. | 14:34 |
dmsimard | wow | 14:35 |
dmsimard | # du -sh /var/log/nodepool/ | 14:35 |
pabelanger | /srv/static/logs: root kernel mount /srv/static/logs | 14:35 |
dmsimard | 37G/var/log/nodepool/ | 14:35 |
pabelanger | looks like kernel is holding mount | 14:35 |
pabelanger | not sure how best to for umount | 14:35 |
*** corvus has joined #openstack-infra-incident | 14:36 | |
dmsimard | pabelanger: looking | 14:37 |
pabelanger | umount -l worked | 14:37 |
pabelanger | ssh access enabled again | 14:38 |
pabelanger | okay, think we are accepting logs again | 14:38 |
dmsimard | I'm fixing nl02 | 14:38 |
mordred | pabelanger, dmsimard: morning! oh - just missed the fun didn't I? | 14:39 |
pabelanger | mordred: almost, I think we need to fsck our logs volume | 14:39 |
pabelanger | and maybe change out a cinder volume? | 14:39 |
pabelanger | not sure yet | 14:39 |
pabelanger | but /srv/static is our new logs folder | 14:40 |
dmsimard | pabelanger: that's probably going to let log uploads work but likely the vhost isn't looking there | 14:40 |
pabelanger | that is fine | 14:40 |
dmsimard | oh actually it does look there? I dunno http://logs.openstack.org/ seems to be working | 14:40 |
corvus | the vhost is fine | 14:40 |
corvus | logs are being written to the same location, it's just on a different volume now | 14:41 |
dmsimard | ok I free'd up 37GB on nl02.o.o | 14:41 |
corvus | dmsimard: what was taking up space? | 14:41 |
dmsimard | corvus: 37GB worth of logs | 14:41 |
dmsimard | I had to delete some of them because we were really at 100%, and I gzipped the most recent one | 14:42 |
pabelanger | okay, how would we like to handle readonly volume of /srv/static/logs? I've not done the repair of this before | 14:42 |
dmsimard | corvus: logs were rotated but not gzipped -- also we're logging in double because we're logging debug to both launcher-debug.log and nodepool-launcher.log ? | 14:42 |
pabelanger | http://logs.openstack.org/38/533638/7/gate/tox-py35/8ebe16a/ just uploaded, so jobs are okay now | 14:42 |
dmsimard | corvus: but yes, logs.o.o is more important | 14:43 |
corvus | let's first check if xvdh is okay now | 14:43 |
corvus | i'm running a dd to read it, just to make sure we're not getting any more io errors | 14:43 |
corvus | dd if=/dev/xvdh of=/dev/null bs=4096 | 14:44 |
dmsimard | pabelanger: from the looks of the history it's an fsck on /dev/mapper/main-logs | 14:44 |
dmsimard | just gotta make sure to start that in a screen but let's wait if xvdh is fine | 14:44 |
Shrews | dmsimard: the nl02 problem is interesting. nl01 and nl03 seem fine | 14:44 |
corvus | dmsimard: any previous fsck commands have any interesting arguments? | 14:44 |
dmsimard | corvus: http://paste.openstack.org/raw/653419/ | 14:45 |
dmsimard | Shrews: yes | 14:45 |
dmsimard | Shrews: now that we have free space on it I'm trying to understand what's going on :) | 14:46 |
dmsimard | because it's still logging like crazy | 14:46 |
pabelanger | okay, so is 'fsck -v -y /dev/mapper/main-logs' what we want to run, or fsck /dev/xvdh directly? | 14:47 |
Shrews | dmsimard: oh my, yes. something is wrong there. | 14:47 |
dmsimard | we want to fsck the aggregate volume | 14:47 |
dmsimard | since it's not mounted, we want to fsck /dev/mapper/main-logs instead of /srv/static/logs | 14:48 |
pabelanger | right | 14:48 |
pabelanger | okay, I have screen up | 14:48 |
pabelanger | and will be running | 14:48 |
pabelanger | fsck -v -y /dev/mapper/main-logs | 14:48 |
pabelanger | we are okay with that? | 14:48 |
dmsimard | Shrews, corvus: this is madness (nl02) http://paste.openstack.org/raw/653420/ | 14:48 |
Shrews | 'nl02-5138-PoolWorker.infracloud-chocolate-main' is being logged like crazy... probably in the declined_by attribute? | 14:48 |
dmsimard | pabelanger: wait for corvus to ack that xvdh is fine | 14:49 |
Shrews | yeah that | 14:49 |
mordred | pabelanger, Shrews: that seems like something with our logic to handle clouds not existing anymore is not doing the right thing | 14:49 |
Shrews | i've got a guess at what may be happening | 14:49 |
mordred | Shrews: oh, good. cause I was just shooting in the dark | 14:49 |
dmsimard | Shrews: reproducing this would probably be non-trivial so let's troubleshoot it live :/ | 14:50 |
Shrews | since chocolate is gone, i'm betting it keeps declining requests over and over, forever adding to the declined_by attr | 14:50 |
corvus | the dd is slow, so i'm going to have it skip to closer to where the io error is logged. | 14:50 |
Shrews | but just a guess | 14:50 |
pabelanger | corvus: ack | 14:50 |
corvus | dd if=/dev/xvdh of=/dev/null bs=512 skip=1069712570 | 14:51 |
dmsimard | the /srv/static mount point is already at 12% usage and is probably going to run out of space before the fsck ends, do we have a plan B ? | 14:51 |
corvus | dmsimard: continuously delete things there, i guess. | 14:52 |
Shrews | http://paste.openstack.org/raw/653422/ | 14:52 |
dmsimard | Shrews: there you go | 14:52 |
corvus | it'll take 6+ hours to fcsk, so we'll probably just need to run a find in a while loop | 14:52 |
Shrews | now what changed recently to break that | 14:52 |
dmsimard | Shrews: we split the nl configuration | 14:53 |
dmsimard | Shrews: also https://review.openstack.org/#/c/536953/ | 14:53 |
dmsimard | Shrews: https://github.com/openstack-infra/project-config/commits/master/nodepool | 14:53 |
corvus | i think the dd has gone over the trouble spot from the logs without error. i think we can fsck now. | 14:53 |
pabelanger | okay | 14:54 |
pabelanger | fsck -v -y /dev/mapper/main-logs | 14:54 |
pabelanger | we are good with that? | 14:54 |
Shrews | dmsimard: yes. we've done that before. this is new | 14:54 |
dmsimard | pabelanger: in a screen, yes | 14:54 |
mordred | corvus, dmsimard: we could also update the log upload job to ignore upload errors, so that the gate doesn't flap | 14:54 |
corvus | pabelanger: hang on just a sec | 14:54 |
pabelanger | yes, screen is running as root | 14:54 |
pabelanger | corvus: sure | 14:54 |
Shrews | i'm guessing https://review.openstack.org/533372 might be causing this | 14:54 |
dmsimard | mordred: hmmmm I like that but I wonder if we would have a clever way of ignoring log upload errors only for successful jobs | 14:55 |
mordred | dmsimard: we have zuul_success variable | 14:55 |
dmsimard | mordred: or even disable log upload for successful jobs altogether, I dunno -- we can just continuously discard logs too.. none of these ideas are awesome but hey | 14:56 |
corvus | pabelanger: i wanted to verify that it was '-y' we wanted, and i have. http://eavesdrop.openstack.org/irclogs/%23openstack-infra-incident/%23openstack-infra-incident.2017-03-31.log.html | 14:57 |
corvus | that was the time i remember where we used the wrong option and had to do it again. | 14:57 |
corvus | mordred: maybe we could even just stop uploading logs? | 14:58 |
dmsimard | corvus: that's what I was suggesting but only for successful jobs | 14:58 |
mordred | yah. or - at least don't upload logs for gate jobs or successful jobs | 14:58 |
mordred | gah | 14:58 |
pabelanger | corvus: okay, just to confirm. command posted is correct | 14:58 |
corvus | pabelanger: yep. | 14:58 |
pabelanger | great | 14:58 |
AJaeger | note that we upload logs and docs to the same volume - so, what to do with docs jobs? | 14:58 |
mordred | AJaeger: don't docs go to afs? | 14:59 |
pabelanger | Hmm, says /dev/mapper/main-logs in use | 14:59 |
AJaeger | mordred: published ones go to afs - draft ones go to logs | 14:59 |
pabelanger | let me see why that is | 14:59 |
AJaeger | mordred: and I care about the draft docs from check pipeline | 15:00 |
dmsimard | mordred, corvus: ok I'll try to send a patch to do the logs only on failure thing | 15:01 |
mordred | AJaeger: k. lemme get the no-log patch written, then let's figure out docs-draft | 15:01 |
mordred | dmsimard: already working on it | 15:01 |
AJaeger | yeah, let's do that... | 15:01 |
dmsimard | mordred: ok I'll review then -- from the state of the gate we might want to accelerate merging that or it won't be effective for a while | 15:01 |
pabelanger | corvus: mind helping with /dev/mapper/main-logs inuse error | 15:02 |
corvus | pabelanger: it looks like apache is holding files open | 15:02 |
corvus | pabelanger: i will restart it | 15:02 |
* AJaeger needs to go offline for probably rest of day and sends virtual cookies and coffee... | 15:02 | |
pabelanger | ah | 15:02 |
corvus | pabelanger: any better now? | 15:03 |
mordred | dmsimard, corvus: remote: https://review.openstack.org/537929 Only upload logs when jobs fail | 15:03 |
pabelanger | corvus: no, I see see in use from fsck | 15:03 |
pabelanger | corvus: I did use umount -l /srv/static/logs I wonder if that is an issue | 15:03 |
dmsimard | pabelanger: I don't see anything with lsof |grep mapper | 15:03 |
corvus | pabelanger: hrm, i wonder if the lazy unmount from earlier is inhibiting our ability to find out what's using it | 15:04 |
pabelanger | yah, I fear that too | 15:04 |
dmsimard | hang on | 15:04 |
dmsimard | what is that process ? root 26026 0.1 0.0 0 0 ? S Jan15 16:10 [jbd2/dm-0-8] | 15:05 |
corvus | dmsimard: the kernel | 15:05 |
dmsimard | it's holding dm-0 which is the block device for main-logs | 15:05 |
dmsimard | let me try and understand what that process is and if we can kill it safely .. | 15:05 |
pabelanger | oh, I see a kernel panic in dmesg too | 15:06 |
mordred | EXCELLENT | 15:06 |
pabelanger | [Thu Jan 25 13:34:48 2018] jbd2/dm-0-8 D ffff88020f893ac0 0 26026 2 0x00000000 | 15:06 |
dmsimard | hung tasks | 15:06 |
dmsimard | live migration ? | 15:06 |
dmsimard | https://bugzilla.redhat.com/show_bug.cgi?id=886030 "Unmounting an ext3 filesystem result in a stuck jbd2 process that holds the device in use and prevents the same filesystem to be checked (fsck) or mounted again." | 15:07 |
openstack | bugzilla.redhat.com bug 886030 in kernel "Device in use by a jbd2 process after unmounting an ext3 filesystem" [Unspecified,Closed: duplicate] - Assigned to kernel-maint | 15:07 |
*** cmurphy has joined #openstack-infra-incident | 15:07 | |
dmsimard | https://bugzilla.redhat.com/show_bug.cgi?id=851970 systemd private temp ? | 15:07 |
openstack | bugzilla.redhat.com bug 851970 in systemd "Unable to umount partition as long there are running services that have PrivateTmp set that where started after the partition mount" [High,Closed: currentrelease] - Assigned to mschmidt | 15:07 |
dmsimard | but there's nothing in /tmp | 15:07 |
dmsimard | this is 14.04 so not systemd | 15:08 |
pabelanger | are we able to unload / load that kernel module again or does this mean we need a reboot? | 15:09 |
dmsimard | probably can't reload that live, it's likely in use by the other mounts | 15:09 |
Shrews | dmsimard: i think this is the fix for the launcher: remote: https://review.openstack.org/537932 Handle missing request during a decline. | 15:10 |
pabelanger | okay, I need to grab a coffee real quick. back in a moment | 15:10 |
dmsimard | Shrews: anything to address it in the meantime ? a restart of nodepool-launcher or something ? | 15:11 |
dmsimard | Shrews: if you want to live test it, nl02 is already in emergency file | 15:11 |
Shrews | i think we should restart the launcher on nl02 | 15:11 |
Shrews | i can handle that | 15:11 |
Shrews | we can't really test it live b/c the restart will clear it out | 15:12 |
Shrews | nodepool-launcher restarted on nl02 | 15:12 |
pabelanger | back | 15:13 |
dmsimard | mordred: comment on https://review.openstack.org/#/c/537929/1 | 15:13 |
Shrews | citycloud seems to be at quota too | 15:14 |
Shrews | oh, that's disabled too | 15:14 |
dmsimard | pabelanger: the thing is I'm not sure a reboot would even work -- it's likely a process that isn't killable due to i/o block operation so we'd need a hard reboot ? If we're going to reboot might as well try to kill it first, it won't make matters worse than a failed reboot/hard reboot | 15:15 |
corvus | i would just reboot | 15:15 |
corvus | i would not recommend trying to kill a kernel process | 15:15 |
corvus | run reboot and if it fails, use the console to reset | 15:15 |
pabelanger | right, we maybe also want to remove entry from /etc/fstab so we don't mount on firstboot? | 15:15 |
dmsimard | corvus: kernel i/o block processes typically can't be killed so we'll need a hard reboot | 15:16 |
dmsimard | yeah | 15:16 |
pabelanger | okay, let me edit /etc/fstab first | 15:16 |
dmsimard | do we have a root password for the console in case we need it ? | 15:16 |
corvus | dmsimard: sure, just don't do that first. :) reboot, then 'reset' which as what you're calling a hard reboot. | 15:16 |
pabelanger | dmsimard: no root password, we'd need to use emergency console I think | 15:17 |
dmsimard | corvus: reset is equivalent to init 0 ? I've never actually used reset before but what I was thinking is a nova reboot --hard | 15:17 |
dmsimard | corvus: which is the equivalent of a reset button on the chassis :p | 15:17 |
corvus | dmsimard: now we're on the same page. | 15:17 |
dmsimard | corvus: oh wait we're saying the same thing.. sorry I'm slow this morning | 15:17 |
mordred | dmsimard: thanks. you're totally right | 15:17 |
pabelanger | okay, I've commented out /srv/static/logs in /etc/fstab now | 15:17 |
corvus | dmsimard: i agree, now would be a good time to set a root password in case something goes wrong | 15:18 |
mordred | dmsimard: and fixed https://review.openstack.org/#/c/537929/ | 15:18 |
mordred | corvus: ++ | 15:18 |
corvus | dmsimard, pabelanger: how about i set the password and add it to the passwords file? | 15:18 |
pabelanger | sure | 15:18 |
dmsimard | PermitRootLogin is No in /etc/ssh/sshd_config so we're good on that front | 15:18 |
pabelanger | dmsimard: corvus: will also defer to one of you to do reboot | 15:19 |
corvus | okay, password set, saved in file, and i used it to 'su' to root, to verify. | 15:20 |
dmsimard | corvus: where is that password file ? puppetmaster ? | 15:21 |
dmsimard | pabelanger: let me open up the console before we try the reboot, hang on | 15:22 |
dmsimard | it'll be a way to tell if the reboot gets stuck | 15:22 |
corvus | dmsimard: yes... did no one walk you and frickler through that? | 15:23 |
dmsimard | corvus: not familiar with that yet, no -- but we can walk through it later :D | 15:23 |
corvus | dmsimard: yes we should schedule a new root onboarding session :) | 15:24 |
dmsimard | ok I have console on static01 | 15:24 |
dmsimard | we good for reboot ? | 15:24 |
corvus | dmsimard: yes from me | 15:24 |
dmsimard | pabelanger: you good too ? | 15:25 |
pabelanger | yes, reboot when ready | 15:25 |
dmsimard | ok attempting a reboot now | 15:25 |
dmsimard | lost console so this is a good sign | 15:25 |
dmsimard | it's back | 15:26 |
dmsimard | $ uptime | 15:27 |
dmsimard | 15:27:06 up 0 min, 1 user, load average: 0.27, 0.08, 0.03 | 15:27 |
dmsimard | I started a screen called "fsck", starting the fsck | 15:27 |
pabelanger | great | 15:27 |
pabelanger | dmsimard: besure to use right flags for fsck command | 15:28 |
dmsimard | fsck -v -y /dev/mapper/main-logs is running | 15:28 |
dmsimard | I detached from the screen | 15:28 |
pabelanger | thanks | 15:28 |
pabelanger | this will take some time | 15:29 |
dmsimard | need some more reviews on https://review.openstack.org/#/c/537929/ and likely to force submit that | 15:30 |
dmsimard | Shrews: nl02 looks good now, thanks | 15:30 |
dmsimard | Shrews: it's probably worth looking into why we're double logging also | 15:31 |
corvus | i've modified a copy of the log maint script in ~corvus/log_archive_maintenance.sh | 15:31 |
corvus | i'm proposing we delete logs after 120 minutes | 15:32 |
corvus | though it looks like the growth rate is slowing... for some reason? | 15:32 |
dmsimard | Shrews: The DEBUG logging ends up in both nodepool-launcher.log and launcher-debug.log -- we probably either want to take the DEBUG out of nodepool-launcher.log or merge launcher-debug.log into nodepool-launcher.log | 15:33 |
pabelanger | corvus: 120mins sees fine to me | 15:34 |
dmsimard | corvus: we were at 12% @09:51:43, we're now at 21% @10:33:12 | 15:34 |
dmsimard | (my time) | 15:34 |
dmsimard | ~10% every 30 minutes ? | 15:34 |
pabelanger | dmsimard: nodepool-launcher should only be INFO, would need to check logging.conf | 15:34 |
corvus | dmsimard: so we could double retention to 240 minutes and target 80% | 15:35 |
dmsimard | corvus: sure, there's also nothing preventing us from looking every now and then and change our mind -- that's just a safety net | 15:36 |
corvus | okay, i've written a wrapper script that loops over my modified script and runs it every 60 seconds | 15:38 |
corvus | i've also disabled the normal crontab | 15:38 |
dmsimard | Shrews: btw if we want to keep those logs somewhere safe while we iron this out, I gzipped things in /var/log/nodepool on nl02 | 15:38 |
corvus | i'm going to run my looping script in another window of the fsck screen | 15:39 |
dmsimard | Shrews: it compresses surprisingly well | 15:39 |
dmsimard | corvus: wfm | 15:39 |
pabelanger | we also need to remember to update /etc/fstab | 15:39 |
pabelanger | I can do that now, if we want | 15:39 |
dmsimard | let's keep that for when we need to do the remount/mv dance | 15:39 |
dmsimard | #status log (dmsimard) fsck started running in a screen on logs.o.o for /dev/mapper/main-logs at around 15:30UTC, logs are being sent straight to /srv/static | 15:42 |
openstackstatus | dmsimard: finished logging | 15:42 |
*** mnaser has joined #openstack-infra-incident | 15:43 | |
dmsimard | #status log (dmsimard) We're running a modified log_archive_maintenance.sh from ~/corvus and https://review.openstack.org/#/c/537929/ as safety nets to keep us from running out of disk space | 15:43 |
openstackstatus | dmsimard: finished logging | 15:43 |
dmsimard | infra-root: if there's no objections I'll submit https://review.openstack.org/#/c/537929/ | 15:44 |
pabelanger | dmsimard: we should be okay for now to zuul to merge right? | 15:45 |
corvus | oh interesting. we're gzipping the docs-draft builds. | 15:45 |
corvus | i suspect our normal log processing script is doing that as well, but just slowly enough that we don't notice it | 15:46 |
dmsimard | pabelanger: if we don't submit through gerrit, it means the patch has to go through both check and gate which could be a few hours | 15:46 |
corvus | maybe they'll still work though because of the gzip handling we have on logs.o.o. i guess something to look out for. | 15:46 |
dmsimard | corvus: there's a couple "age" lookups in the find, one for gzip and one for deletion | 15:46 |
corvus | the log processing script still does a *lot* of gzipping, so we may end up with a significant reduction in space | 15:46 |
pabelanger | dmsimard: looks like it might start running shortly, looking at zuul.o.o | 15:47 |
corvus | dmsimard: yeah, the gzip is low, it's 10 minutes. basically enough time to make sure that zuul isn't still uploading files | 15:47 |
dmsimard | pabelanger: there's nothing in the check queue that will tell us if this works or not, it's a trusted playbook | 15:47 |
pabelanger | right, but some of the pressure is off currently and should be able to merge on its own | 15:48 |
pabelanger | dmsimard: also, we can remove nodepool-launcher from emergency? | 15:49 |
dmsimard | pabelanger: maybe, I guess this isn't in the tripleo or integrated gates | 15:49 |
pabelanger | dmsimard: right, once in gate should get nodes right away | 15:49 |
dmsimard | pabelanger: yeah, I wound up not touching the max-servers but I was getting ready to | 15:49 |
dmsimard | pabelanger: I'll remove nl* from emergency (except yours, 03 I think) | 15:49 |
corvus | the current run of the gzip/rm script is printing out files. the next runs will be quiet. | 15:50 |
pabelanger | dmsimard: no, you can also remove that now. Was going to this morning | 15:50 |
dmsimard | pabelanger: 04 is commented by you actually | 15:50 |
dmsimard | pabelanger: ok | 15:50 |
pabelanger | thanks | 15:50 |
dmsimard | pabelanger: nl* removed | 15:50 |
dmsimard | pabelanger: it's a good idea to add comments in the emergency file so we know why things are in there, +1 for that idea | 15:50 |
-openstackstatus- NOTICE: logs.openstack.org is stabilized and there should no longer be *new* POST_FAILURE errors. Logs for jobs that ran in the past weeks until earlier today are currently unavailable pending FSCK completion. We're going to temporarily disable *successful* jobs from uploading their logs to reduce strain on our current limited capacity. Thanks for your patience ! | 16:01 | |
fungi | keep in mind that /srv/static was already 8% full before this (there are a handful of docroots which lack dedicated volumes), though i agree that it's accumulating faster than the fsck is likey to complete (up to 29% now) | 16:16 |
dmsimard | pabelanger: logs.o.o responds a bit erratically.. took me a few seconds to get dmesg back | 16:38 |
pabelanger | load is up | 16:38 |
dmsimard | looks like load from... apache ? fsck has some load | 16:39 |
dmsimard | but apache is the one causing the load | 16:39 |
pabelanger | something hammering server? | 16:40 |
dmsimard | I'm seeing semrush bot, the crawler I banned from RDO but there's not that many hits | 16:40 |
dmsimard | there's a lot of hits like those: 2001:4800:7817:103:be76:4eff:fe04:38d1 - - [25/Jan/2018:16:40:32 +0000] "GET /99/537199/1/gate/trove-scenario-mysql-multi/172fa3c/logs/screen-s-object.txt.gz?level=INFO HTTP/1.1" 200 151 "-" "Python-urllib/2.7" | 16:40 |
dmsimard | ... logstash/elastic-recheck ? | 16:41 |
*** openstackstatus has quit IRC | 16:41 | |
dmsimard | yeah those python-urllib are all logstash workers | 16:41 |
*** openstackstatus has joined #openstack-infra-incident | 16:42 | |
*** ChanServ sets mode: +v openstackstatus | 16:42 | |
dmsimard | lots of i/o wait but I'm trying to narrow it down | 16:43 |
pabelanger | giving how busy the gate is, likey too much I/O for local HDD to keep up with | 16:44 |
dmsimard | pabelanger: we're heavily swapping | 16:44 |
dmsimard | root@static:/var/log/apache2# free -m | 16:45 |
dmsimard | total used free shared buffers cached | 16:45 |
dmsimard | Mem: 7979 7755 224 0 117 31 | 16:45 |
dmsimard | -/+ buffers/cache: 7605 373 | 16:45 |
dmsimard | Swap: 7811 4167 3644 | 16:45 |
pabelanger | pastebin :) | 16:45 |
dmsimard | yeah sorry I meant to only paste the swap line | 16:45 |
dmsimard | the fsck is eating all the ram | 16:45 |
dmsimard | .... why ? | 16:45 |
pabelanger | not sure | 16:46 |
dmsimard | https://serverfault.com/questions/9218/running-out-of-memory-running-fsck-on-large-filesystems "Alternately, there's now an option in e2fsck that'll tell it to store all of it's intermediate results in a directory instead of in RAM, which helps immensely. " | 16:46 |
dmsimard | It's buffering the results in ram | 16:46 |
dmsimard | infra-root ^ | 16:46 |
dmsimard | It might explain why the previous FSCKs were so long actually, we were unknowingly heavily swapping | 16:47 |
corvus | yeah, would be nice to work out how to use that | 16:48 |
dmsimard | Although if you compare swapping to disk vs buffering to disk (instead of ram) the difference is probably not that big.. but it'd make that ram available for apache and.. ssh | 16:48 |
pabelanger | yah, load is up to 30.00 | 16:48 |
pabelanger | so we should consider doing something | 16:48 |
dmsimard | I'm going to interrupt the fsck | 16:49 |
dmsimard | or try to | 16:49 |
pabelanger | wait | 16:49 |
pabelanger | maybe just stop apache? | 16:49 |
corvus | please don't stop the fsck | 16:49 |
dmsimard | sorry, too late :/ | 16:49 |
corvus | seriously? | 16:49 |
corvus | okay this is all you. | 16:49 |
*** corvus has left #openstack-infra-incident | 16:51 | |
pabelanger | okay, we need to come up with a plan. To get fsck running again | 16:52 |
pabelanger | dmsimard: ^ | 16:53 |
dmsimard | fsck is running again with scratch files in /opt/fsck | 16:54 |
pabelanger | okay, next time we should likey wait until everybody is onboard with plan. Can't fix that now, but for future | 16:55 |
mordred | dmsimard: awesome | 16:55 |
mordred | dmsimard: and TIL about fsck scratchfiles | 16:56 |
dmsimard | yes, I agree that I should have waited for an ack -- I really apologize for not waiting on a consensus before moving forward | 16:56 |
dmsimard | I've dealt with something like this before, it is familiar territory but I won't let that happen again | 16:56 |
mordred | dmsimard: and yah - the performance of the fsck isn't likely to be much different than it would via just swapping - but the memory pressure competition would be much better | 16:57 |
mordred | hopefully | 16:57 |
dmsimard | I have a watch du -sh running inside a window of the screen | 16:58 |
dmsimard | for the /opt/fsck directory | 16:58 |
dmsimard | We have 66GB available in /opt, it should be largely sufficient | 16:58 |
mordred | dmsimard: if it's not sufficient I'm going to be very sad :) | 16:58 |
pabelanger | okay, which flags did we use for fsck now? | 17:04 |
pabelanger | want to better understand the change made | 17:05 |
mordred | dmsimard: yah. agree. also - should we consider putting an /etc/e2fsck.conf on files01 with that setting in it? | 17:08 |
dmsimard | mordred: It's probably worth putting in puppet for servers where we have large volumes | 17:11 |
dmsimard | pabelanger: the change was to configure e2fsck to send scratch files on disk, it defaults to RAM which is faster but is not appropriate for large volumes or block devices | 17:12 |
dmsimard | mordred: I'll send a patch now | 17:12 |
pabelanger | yah, puppet change sounds great | 17:14 |
dmsimard | pabelanger: Do we have any other servers that have very large partitions ? | 17:15 |
pabelanger | dmsimard: openstack volume list would be a way to check | 17:22 |
pabelanger | static.o.o is the largest | 17:22 |
mordred | I don't think we have anything else that's crazypants like static | 17:22 |
dmsimard | ok let's enable it for this one for now with the ability to enable it elsewhere on a need basis | 17:23 |
mordred | dmsimard: ++ | 17:30 |
*** panda is now known as panda|bbl | 17:33 | |
*** weshay|ruck is now known as weshay|ruck|brb | 18:09 | |
*** weshay|ruck|brb is now known as weshay | 18:27 | |
*** myoung is now known as myoung|biab | 18:58 | |
*** myoung|biab is now known as myoung | 19:20 | |
*** myoung is now known as myoung|cheesebur | 19:34 | |
*** myoung|cheesebur is now known as myoung|food | 19:34 | |
*** myoung|food is now known as myoung | 20:02 | |
dmsimard | infra-root: Status update... We're hovering around 55% logs and we've started deleting things with a 240m timeout, we might want to increase that to try and delete less things. Thoughts ? | 20:18 |
dmsimard | mordred: also I suspect we might have been mistaken on the performance impact of on-disk scratch files, we're still not in the phase 2 of the fsck yet going on 3 hours .. we had reached phase 2 before restarting within ~1h30 | 20:20 |
dmsimard | Also, the e2fsck man mentions that we can send a SIGUSR1 to the process to display a progress bar (and SIGUSR2 to stop displaying it) | 20:22 |
dmsimard | SIGUSR1 This signal causes e2fsck to start displaying a completion bar or emitting progress information. (See discussion of the -C option.) | 20:22 |
dmsimard | Should we enable that to see how we're doing ? | 20:22 |
pabelanger | dmsimard: how long is it taking to delete stuff? | 20:26 |
dmsimard | pabelanger: not sure, this is the output we get in the screen http://paste.openstack.org/raw/653497/ | 20:27 |
dmsimard | I guess anywhere between near instantaneous and 20 minutes ? | 20:27 |
pabelanger | yah, lets give it a little longer, since we are at 55% | 20:28 |
pabelanger | I've never tried SIGUR1 before with fsck, maybe others know | 20:29 |
pabelanger | but, not being in phase2 yet is troublesome | 20:29 |
dmsimard | I raised it to 300 up from 240 | 20:30 |
dmsimard | the server is swapping just a tiny bit but is otherwise healthy | 20:31 |
dmsimard | without scratch files we were 4GB into swap | 20:31 |
dmsimard | Really the worst timing for this to be happening at all, the other times we were lucky it happened during the night or the weekend and we just left fsck running with logs.o.o offline.. | 20:33 |
dmsimard | pabelanger: I have to step away for a few minutes but I don't think sigusr1 could help us, it seems to imply that the -C <file> argument was already specified.. I might try to test it on a local server to see if it works | 20:34 |
dmsimard | https://www.youtube.com/watch?v=cS4eztFtS0U makes it look like it's possible, I'll test and confirm | 20:36 |
dmsimard | brb | 20:37 |
*** rlandy is now known as rlandy|brb | 20:47 | |
*** myoung is now known as myoung|pto | 20:53 | |
dmsimard | ok, testing this locally on a ubuntu 14.04 vm.. | 20:59 |
*** rlandy|brb is now known as rlandy | 21:12 | |
ianw | the progress options to fsck do work, i've used them before | 21:17 |
dmsimard | ianw: do you have the usage handy ? I just don't want to kill the ongoing fsck mistakenly | 21:18 |
dmsimard | almost done trying to test it locally.. | 21:19 |
dmsimard | yeah I'm not able to reproduce locally, I don't have a big enough disk that gives me enough time to even try it.. fsck completes near instantaneously :/ | 21:21 |
ianw | i'm 99% sure i've sent it sigusr1 before | 21:21 |
ianw | don't worry, you only start it once without the -C , next time you always remember :) | 21:21 |
dmsimard | ianw: we didn't see any -C in the history :/ http://paste.openstack.org/raw/653419/ | 21:22 |
ianw | anyway, send it to the fsck.ext4 process, not the parent process, i think that got us once before | 21:24 |
dmsimard | ianw: so we have http://paste.openstack.org/show/653506/ and what we want is... kill -SIGUSR1 27665 ? | 21:27 |
dmsimard | Using a kill command to do this kind of stuff makes me paranoid, whoever thought that was a good idea... | 21:28 |
ianw | dmsimard: yep | 21:29 |
dmsimard | ianw: ok, trying now. | 21:29 |
dmsimard | that worked | 21:29 |
dmsimard | we're at 48% of phase 1.. ouch | 21:29 |
ianw | that said, it doesn't really give you a timeline | 21:30 |
ianw | if you look back through the logs, fungi and i had to do it a little while ago? | 21:30 |
dmsimard | ianw: so to give you some context | 21:30 |
ianw | i was watching it pretty closely, so you could probably correlate the start/finish from the irc logs | 21:30 |
dmsimard | ianw: today is release day and it was kind of a downer to keep logs.o.o down for like 8 hours -- so we're currently sending logs to /srv/static directly and we'll want to do a mv/remount dance once the fsck has finished | 21:31 |
ianw | yep, last time i just bind mounted /srv/static somewhere and rsync'd the intermediate logs back in | 21:32 |
dmsimard | ianw: We realized when running the usual logs.o.o workload side by side the fsck that it was going to be difficult because the fsck process was eating all the ram (swapping >4GB and causing log uploads to keep failing) | 21:33 |
ianw | that is definitely a new one, i can't remember that happening any of the times we've had to recover before | 21:34 |
dmsimard | ianw: so I had fsck send the scratch files to disk instead of ram -- the patch for it is here https://review.openstack.org/#/q/topic:e2fsck | 21:34 |
dmsimard | I jumped the gun on that one, there might've been another avenue but I had already interrupted the fsck | 21:35 |
ianw | probably worth understanding if that's an artefact of certain types of corruption that we just happened to hit, or something else | 21:37 |
ianw | public holiday here and duty calls, good luck :) | 21:37 |
dmsimard | o/ | 21:38 |
dmsimard | For the record, there seems to be on average between 5GB and 6GB of scratch files per batch, ex: http://paste.openstack.org/show/653511/ | 21:56 |
dmsimard | If we we working with 16GB of RAM instead of our existing 8GB, we might be able to work without writing to disk | 21:56 |
dmsimard | We're currently at 51% of phase 1 :( | 21:57 |
pabelanger | that's about 4 hours in right? | 22:01 |
pabelanger | dmsimard: mordred: do we want to discuss switching back to original method again? 4 hours into fsck and 51% phase1, feels like we are in for a long haul | 22:03 |
dmsimard | 5 hours in | 22:04 |
pabelanger | logs look stable at 50% /srv/static | 22:04 |
dmsimard | pabelanger: yeah I bumped the timer again | 22:04 |
dmsimard | pabelanger: can't deny it's going very slowly but the flipside is that it allows jobs to complete ? If we go back to RAM we'll start getting load and failures again -- an alternative would be to resize to 16GB RAM and only then resume it back in ram | 22:06 |
pabelanger | well, we've done this before while jobs have been running. Mind you, not during the release week | 22:07 |
dmsimard | yeah the gate is running at max capacity :( | 22:07 |
pabelanger | if we go back to normal fsck, we could consider reducing number of jobs running at once in nodepool. In an effort to keep things moving and bring volume back online | 22:08 |
pabelanger | infra-root: ^ do you have any thoughts? | 22:08 |
dmsimard | Getting any post failures would yield gate resets and at 100 patches in the gate that's crazy expensive, I would do everything we can to avoid that | 22:11 |
mordred | pabelanger: reading | 22:12 |
dmsimard | 5 hours in at 52% of phase 1 fsck is beyond awful.. I would consider bumping ram to 16GB and resuming without on disk scratch files but I don't have any other ideas | 22:14 |
pabelanger | well, we are swapping now: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=143&rra_id=all | 22:21 |
pabelanger | slowly growing | 22:21 |
pabelanger | so, feel we might just be delaying the unhappyness of swapping | 22:22 |
pabelanger | I'm unsure about resizing logs.o.o either, until others are okay with it. | 22:23 |
dmsimard | Ok I'm going to start a pad to summarize our options | 22:24 |
pabelanger | ok | 22:27 |
dmsimard | working on https://etherpad.openstack.org/p/2tAC4qI2eR | 22:29 |
dmsimard | mordred: do you have an opinion ? | 22:40 |
clarkb | isnt option 5 dont fsck? | 22:41 |
dmsimard | clarkb: we admittedly didn't test if the volume was mountable in the first place | 22:42 |
clarkb | its data with a retention period of 4 weeks | 22:43 |
clarkb | if we lose data I dont think its a huge deal as long as fs is functional | 22:43 |
clarkb | but unsure how risky that is | 22:43 |
dmsimard | clarkb: right, I know we don't care that much about the data but it might not let you mount it at all | 22:43 |
dmsimard | But I agree it's worth considering | 22:43 |
clarkb | ok I have to afk again just wantrd to call that out | 22:44 |
pabelanger | I'm not sure how else to check it, outside of fsck | 22:45 |
*** corvus has joined #openstack-infra-incident | 22:54 | |
dmsimard | 56% of phase 1.. but most jobs are green when looking at the gate in http://zuul.openstack.org/builds.html .. we're increasingly swapping but load is stable -- it's like there's no good solution to this problem but we're probably slowly running into a wall if we don't do anything | 22:56 |
dmsimard | damned if you do, damned if you don't ? | 22:56 |
dmsimard | There's 5 phases to go through, it's unrealistic to keep going at this rate | 22:57 |
dmsimard | pabelanger: how would you want to reduce the pressure ? reduce max-servers ? | 22:57 |
dmsimard | oh crap, hang on | 22:58 |
*** rlandy is now known as rlandy|biab | 23:26 | |
dmsimard | fsck 65% phase 1 | 23:52 |
mnaser | fsck progress is weirdly non linear | 23:55 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!