*** pleia2 has quit IRC | 02:43 | |
*** pleia2 has joined #openstack-infra-incident | 02:50 | |
*** tumbarka has quit IRC | 07:02 | |
-openstackstatus- NOTICE: The CI system will be offline starting at 11:00 UTC (in just under an hour) for Zuul v3 rollout: http://lists.openstack.org/pipermail/openstack-dev/2017-October/123337.html | 10:08 | |
-openstackstatus- NOTICE: Due to unrelated emergencies, the Zuul v3 rollout has not started yet; stay tuned for further updates | 13:05 | |
*** rosmaita has joined #openstack-infra-incident | 15:03 | |
jeblair | fungi, clarkb, mordred: looking into the random erroneous merge conflict error, i found some interesting information. | 17:04 |
---|---|---|
fungi | all ears | 17:04 |
jeblair | when i warmed the caches on the mergers and executors, i cloned from git.o.o | 17:04 |
jeblair | that left the origin as git.o.o | 17:05 |
jeblair | if *zuul* clones a repo for its merger, it clones from gerrit, and leaves the origin as gerrit | 17:05 |
jeblair | our random merge failures are because we're pulling changes from the git mirrors before they have updated | 17:05 |
fungi | ahh. so the warming process should have used gerrit as the origin? | 17:05 |
jeblair | this also explains why we're seeing git timeouts only in v3 | 17:05 |
jeblair | fungi: yes, or at least switched the origin after cloning | 17:06 |
fungi | but yes, i agree, that does provide a great explanation for the got timeout situation | 17:06 |
clarkb | jeblair: you think the timeout is because it tries to fetch a non existant ref from git.o.o? | 17:06 |
fungi | s/got/git/ | 17:06 |
AJaeger | jeblair: good detective work! | 17:06 |
jeblair | clarkb: i don't know what is causing the timeout, but it's https to a different server, versus ssh to gerrit | 17:06 |
jeblair | clarkb: so there's *lots* of variables that are different :) | 17:07 |
fungi | well, at least the work around improving git remote operation robustness isn't a waste | 17:07 |
clarkb | got it | 17:07 |
jeblair | clarkb: the *merge failure* i tarcked down though was because the ref had not updated yet | 17:07 |
jeblair | fungi: oh, yeah, i still think that's good stuff | 17:07 |
jeblair | anyway, i think we have two choices here: | 17:07 |
jeblair | 1) update the origins to review.o.o | 17:08 |
jeblair | 2) make using git.o.o more reliable | 17:08 |
fungi | i'm in favor of #1 for now... that's more or less what the mergers for v2 were doing right? | 17:09 |
clarkb | gerrit does emit replication events now that could possibly be used, but we'd have to have some logic around "are all the git backends updated" which may get gross in zuul (which shouldn't really need to know the dirty details of replication) | 17:09 |
jeblair | 1) is easy, and returns us to v2 status-quo -- mostly. the downside to that is that v3 is much more intensive about fetching stuff from remotes (it merges things way more than necessary) so we are likely to see increased load on gerrit. | 17:09 |
fungi | i mean, having the option of offloading that to git.o.o would be nice, but not a regression over v2 | 17:09 |
jeblair | 2) clarkb just pointed out what would be involved in 2. it's some non-straightforward zuul coding. | 17:09 |
fungi | i agree the increased gerrit traffic is something to keep an eye on, but not necessarily a problem | 17:10 |
fungi | as an aside, how many sessions to the ssh api are we likely to open in parallel from a single ip address? | 17:11 |
fungi | ssh api and/or ssh jgit | 17:11 |
clarkb | I think just one because we'll only grab a single gearman job at a time? | 17:11 |
jeblair | there are 2 things that make it inefficient -- we merge once for every check pipeline, and then we merge once for every build. both of those things can be improved, but not easily. though, having fewer 'check' pipelines will help. v3 has 3 now. | 17:12 |
jeblair | clarkb, fungi: yes, one per ip. | 17:12 |
fungi | just wanting to keep in mind that we're currently testing a connlimit protection on review.o.o to help keep runaway ci systems under 100 concurrent connections, but sounds like this wouldn't run afoul of it anyway | 17:13 |
jeblair | so it's a total of 14 from v3 in our current setup. likely 18 when we're fully moved over. compared to 4 now (but 8 when we were at full strength v2) | 17:13 |
fungi | so around a 2x bump | 17:13 |
fungi | seems safe enough | 17:14 |
*** efried has joined #openstack-infra-incident | 17:15 | |
mordred | it seems to me like trying 1) while in our current state (we won't see gate merge load - but with the check pipelines we should still see a lot) | 17:15 |
efried | jeblair o/ Sorry, didn't even know this channel existed | 17:15 |
jeblair | efried: no i'm sorry; i should have mentioned i was switching | 17:15 |
mordred | would give us an idea of whether or not it's workable to do that | 17:16 |
clarkb | mordred: ++ | 17:16 |
jeblair | mordred: ya good point | 17:16 |
mordred | however - it also seems like since we have a git farm, evne if 1 is workable we still may want to consider putting 2 on the backlog | 17:16 |
jeblair | mordred: we'll actually see the full load | 17:16 |
mordred | jeblair: oh - good point | 17:16 |
mordred | that's great | 17:16 |
jeblair | ya i didn't even think about that till you mentioned it | 17:16 |
mordred | I think the results of 1 will tell us how urgent 2 is | 17:17 |
jeblair | this scales with the size of the zuul cluster, not anything else (as long as it's not idle) | 17:17 |
jeblair | okay, so i'll just go ahead and update all the origins | 17:17 |
jeblair | and, um, if gerrit stops then i'll stop zuulv3 :) | 17:17 |
efried | Cool beans guys, thanks for tracking this down! | 17:18 |
mordred | jeblair: cool | 17:19 |
mordred | jeblair: if gerrit stops, maybe just update the origins all back while we re-group :) | 17:19 |
fungi | plan sounds good | 17:21 |
AJaeger | efried: the channel is published on eavesdrop.openstack.org | 17:30 |
efried | AJaeger Yup, thanks, I'm caught up. | 17:30 |
*** rosmaita has quit IRC | 17:50 | |
*** rosmaita has joined #openstack-infra-incident | 17:51 | |
*** efried is now known as efried_nomnom | 18:17 | |
dmsimard | fungi: So just to wrap something up that worried me earlier, because I'm involved in this to some extent... As far as I can understand, "rh1 closing next week" is a misunderstanding. It is not closing next week. Soon, but not next week. | 18:19 |
dmsimard | fungi: Soon, like, a matter of weeks, not months. | 18:19 |
dmsimard | The bulk of the work is already done and many different jobs have already been running off of RDO's Zuul | 18:20 |
fungi | dmsimard: yeah, the irc discussion in #-dev more or less confirmed that as well, but "soon" at least | 18:34 |
*** efried_nomnom is now known as efried | 18:58 | |
* mordred waves to fungi and jeblair | 20:42 | |
jeblair | okay yeah, i thought we've run into the inode thing before | 20:43 |
jeblair | i guess we reformatted? | 20:43 |
fungi | so the manpage for mkfs.ext4 (shared by ext2 and ext3) says this about the -i bytes-per-inode value: "Be warned that it is not possible to change this ratio on a filesystem after it is created, so be careful deciding the correct value for this parameter. Note that resizing a filesystem changes the numer of inodes to maintain this ratio." | 20:43 |
jeblair | i checked the infra status log but did not find anything :( | 20:43 |
*** Shrews has joined #openstack-infra-incident | 20:44 | |
fungi | so if we grew the size of the filesystem, we'd get more inodes, but short of that... | 20:44 |
jeblair | yeah, if only we could, right? | 20:45 |
fungi | jeblair: looking back through my infra channel log now. found an incident from 2013-12-08 | 20:45 |
fungi | oh, docs-draft in 2014-03-03 | 20:46 |
fungi | zm03 filled to max inodes for its rootfs on 2015-10-12 | 20:47 |
fungi | ran out of inodes for /home/gerrit2 on review.o.o on 2016-02-13 | 20:50 |
fungi | concluded at the time that it's unfixable for the fs, and so moved everything to a new cinder volume | 20:51 |
clarkb | tripleo is still logging all of /etc in places last I checked | 20:52 |
clarkb | do we think its stuff like that using all the inodes? | 20:52 |
clarkb | also ara is tons of little files | 20:52 |
fungi | http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-02-13.log.html#t2016-02-13T14:47:22 | 20:53 |
mordred | the find is doing the gzip pass as well as the prune pass ... should we perhaps do a find pass that doesn't do gzipping ... and maybe add something to prune crazy /etc dirs? | 20:53 |
dmsimard | clarkb: yes, an ara static report is not large but indeed lots of smaller files | 20:54 |
dmsimard | That's why I would like to explore other means of providing the report | 20:54 |
jeblair | fungi: thanks, that archeology helps :) | 20:54 |
fungi | yeah, in past emergencies i've made a version of the log maintenance script which omits the random delay, the docs-draft bit, and drops the compression stanza so it's just a deleter | 20:55 |
fungi | and usually significantly dropped the retention timeframe while at it | 20:55 |
mordred | fungi: do you happen to have any of those versions around still? | 20:55 |
fungi | mordred: static.o.o:~fungi/log_archive_maintenance.sh | 20:56 |
mordred | yah - ~fungi/log_archive_maintenance.sh exists | 20:56 |
fungi | heh, it's like i'm predictable | 20:56 |
mordred | :) | 20:56 |
mordred | infra-root: shall we stop the current cleaner script and run fungi's other version? or do we want to investigate other options? | 20:58 |
fungi | shall i #status alert something for now? | 20:58 |
fungi | mordred: yeah, i would stop it | 20:58 |
jeblair | mordred: wfm | 20:58 |
jeblair | fungi: ++ | 20:58 |
fungi | running more than one at a time is just more i/o traffic slowing both down | 20:58 |
mordred | jeblair, fungi: ok. do I need to do anything special to stop it other than kill? | 20:58 |
fungi | also make sure something else like an mlocate.db update isn't running and having similar performance impact | 20:59 |
ianw | can someone give a 2 sentence summary of what's wrong for those of us who might not have been awake (i.e. me ;) | 20:59 |
jeblair | ianw: logs is full | 20:59 |
fungi | mordred: you can just kill the parent shell process and then kill the find | 20:59 |
fungi | and that way it should wrap up without trying to run any subsequent commands | 20:59 |
mordred | fungi: cool | 20:59 |
*** ChanServ changes topic to "logs volume is full" | 20:59 | |
fungi | and it should release the flock on its own that way | 21:00 |
mordred | fungi: I'm going to start a screen session called repair_logs | 21:00 |
fungi | good idea | 21:00 |
pabelanger | speaking of full volumes, this is from afs01.dfw.o.o: /dev/mapper/main-vicepa 3.0T 3.0T 30G 100% /vicepa | 21:00 |
jeblair | mordred, fungi: and then maybe grab the flock in the custom script? | 21:00 |
jeblair | pabelanger: want to throw another volume at it? | 21:00 |
mordred | jeblair: it grabs the flock | 21:00 |
fungi | jeblair: the custom script already flocks the same file | 21:00 |
jeblair | pabelanger: we can expand vicepa | 21:00 |
jeblair | mordred, fungi: cool | 21:00 |
pabelanger | jeblair: ya, let me see if there is anything to clean up first | 21:01 |
jeblair | pabelanger: just the usual lvm stuff | 21:01 |
jeblair | pabelanger: oh, maybe the docs backup volume? | 21:01 |
jeblair | old-docs or whatever | 21:01 |
mordred | fungi: bash script killed - nex kill the find not the flock? | 21:01 |
ianw | pabelanger: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-10-09.log.html#t2017-10-09T06:15:55 ... that didn't take long :/ | 21:01 |
fungi | mordred: yeah | 21:01 |
mordred | fungi: and let flock just exist | 21:01 |
pabelanger | jeblair: k, let me see | 21:01 |
mordred | k | 21:01 |
fungi | the flock should terminate | 21:01 |
fungi | on its own | 21:01 |
jeblair | ianw, pabelanger: i read that we still have 30G, right? | 21:01 |
jeblair | so, i mean, probably several more hours! | 21:02 |
pabelanger | Yah :) | 21:02 |
mordred | ok. I have started ~fungi/log_archive_maintenance.sh | 21:02 |
pabelanger | jeblair: okay, so docs-old can be deleted? | 21:02 |
ianw | jeblair: yeah, so it was 45gb when i posed, and 30gb now | 21:02 |
jeblair | pabelanger: i think so, but let's check with AJaeger or other docs folks first. i think that would only give us 10G anyway. | 21:03 |
fungi | i checked ioptop, and it looks like that find doesn't have major competition (unless we want to also stop apache and lock down the jenkins/zuul accounts) | 21:03 |
ianw | pabelanger / jeblair : i'll take an action item to add a volume to that, after current crisis | 21:03 |
pabelanger | ianw: wfm | 21:03 |
jeblair | ianw: ack, thx | 21:03 |
fungi | but yeah, if the last find/delete/compress pass ran for 11 days and counting, i expect we've significantly increased inode count beyond just normal block count increases | 21:04 |
pabelanger | I'm going to add mirror-update to emergency, and first manually apply https://review.openstack.org/502316/ | 21:04 |
pabelanger | that will free up some room with the removal of opensuse-42.2 | 21:04 |
jeblair | pabelanger, ianw: i went ahead and asked in #openstack-doc, but ajaeger is gone for the day so we shouldn't expect a complete answer about docs-old until tomorrow | 21:05 |
pabelanger | k | 21:05 |
ianw | ok, yeah one less distro will help | 21:06 |
jeblair | pabelanger, ianw: dhellman asks that we *do* keep docs-old around for a while longer. so let's not do anything with that volume, and just expand via lvm. | 21:08 |
pabelanger | ++ | 21:09 |
dhellmann | ideally we'll be able to delete docs-old by the end of the cycle. I've made a note to coordinate with you all about that | 21:09 |
pabelanger | /dev/mapper/main-vicepa 3.0T 2.9T 80G 98% /vicepa | 21:09 |
pabelanger | back up to 80GB with removal of opensuse 42.2 | 21:09 |
pabelanger | should be enought room until ianw gets the other volume | 21:09 |
pabelanger | enough* | 21:09 |
ianw | excellent, that's breathing room | 21:09 |
pabelanger | I'll clean up 502316 and get that approved | 21:10 |
mordred | /dev/mapper/main-logs 768M 768M 0 100% /srv/static/logs | 21:10 |
mordred | we are not removing them faster than we are making them - at least not yet | 21:10 |
pabelanger | Yah, last time I ran it, took a little bit to get a head of the curve | 21:11 |
mordred | welp - nothing I like more than watching a find command sit there and churn :) | 21:11 |
clarkb | vroom vroom | 21:12 |
jeblair | mordred: on the plus side -- it doesn't look like inodes are a problem now! | 21:12 |
jeblair | dhellmann: thanks! | 21:13 |
pabelanger | mordred: I see 95% now :D | 21:14 |
mordred | jeblair: oh - that was df -hi | 21:15 |
jeblair | mordred: oh i never thought to use -h with -i :) | 21:18 |
jeblair | mordred: though, clearly 768M should have clued me in | 21:18 |
pabelanger | we're trying to cleanup old logs, I see some movement | 21:26 |
mordred | infra-root: I'm finally seeing non-zero numbers of free inodes | 21:59 |
mordred | /dev/mapper/main-logs 768M 768M 11K 100% /srv/static/logs | 21:59 |
mordred | well - that was short lived | 21:59 |
mordred | seriously - something just ate 11k inodes | 21:59 |
clarkb | we should probably count the ara inode count and tripleo logs inode count | 21:59 |
clarkb | as I think it likely those two are at least partially to blame | 22:00 |
pabelanger | yah, ps show tripleo jobs currently using logs folder | 22:01 |
fungi | should be able to adapt some of our earlier analysis scripts to do average inodes per job et cetera | 22:01 |
pabelanger | /srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54 | 22:01 |
mordred | mordred@static:/srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54$ sudo find . | wc -l | 22:02 |
mordred | 21374 | 22:02 |
mordred | so that's 21k files per job | 22:02 |
mordred | so ... | 22:03 |
mordred | ./logs/undercloud/tmp/ansible/lib64/python2.7/site-packages/ansible/modules/network/f5/.~tmp~ | 22:03 |
dmsimard | mordred: to their defense, there's probably like 3 ara reports in there | 22:03 |
mordred | nope - that's not it | 22:03 |
mordred | an ara report is 445 files | 22:03 |
dmsimard | mordred: 1) devstack gate 2) from oooq-ara and from zuul v3 | 22:04 |
dmsimard | okay | 22:04 |
pabelanger | Um | 22:04 |
pabelanger | /srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp | 22:04 |
pabelanger | du -h | 22:04 |
pabelanger | that is glorious | 22:04 |
pabelanger | they are copying back ansible tmp files | 22:04 |
mordred | yah | 22:05 |
mordred | tons of things like logs/subnode-2/etc/pki/ca-trust/source/.~tmp~ | 22:05 |
pabelanger | yup | 22:05 |
pabelanger | I have to run, but can help out when we I get back | 22:05 |
pabelanger | EmilienM: ^ might want to prepare for incoming logging work again | 22:06 |
dmsimard | What I do know, and I saw that recently | 22:06 |
EmilienM | hi | 22:06 |
dmsimard | is that they glob the entirety of /var/log/** | 22:06 |
* EmilienM reads context | 22:06 | |
mordred | EmilienM: we're out of inodes on logs.o.o | 22:06 |
pabelanger | http://logs.openstack.org//27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp | 22:06 |
EmilienM | oh dear | 22:06 |
pabelanger | contains a ton of ansible tmp files | 22:06 |
EmilienM | who added tmp dir | 22:07 |
EmilienM | i noticed that recently, let me look | 22:07 |
pabelanger | okay, have to run now. | 22:07 |
pabelanger | bbiab | 22:07 |
mordred | we should put in a filter to not copy over anything that has '.~tmp~' in the path | 22:07 |
*** weshay|ruck has joined #openstack-infra-incident | 22:07 | |
weshay|ruck | hello | 22:07 |
mordred | $ sudo find logs | grep -v '.~tmp~' | wc -l | 22:08 |
mordred | 3369 | 22:08 |
mordred | $ sudo find logs | grep '.~tmp~' | wc -l | 22:08 |
mordred | 18004 | 22:08 |
weshay|ruck | did tripleo fill up the log server? | 22:08 |
dmsimard | weshay|ruck: can we cherry-pick what we want from /var/log instead of globbing everything ? https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/collect-logs/defaults/main.yml#L5 | 22:08 |
EmilienM | weshay|ruck: since when we have tmp? http://logs.openstack.org//27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp | 22:08 |
mordred | 18k of the files are things with .~tmp~ in the path | 22:08 |
EmilienM | weshay|ruck: out of inodes on logs.o.o | 22:08 |
mordred | so if we cna just stop uploading those I think we'll be in GREAT shape | 22:08 |
dmsimard | weshay|ruck: tripleo did not single handedly fill up the log server, but contributes to the problem | 22:08 |
weshay|ruck | we cherry pick quite a bit afaik | 22:08 |
weshay|ruck | but can look further | 22:08 |
mordred | srrsly - just filter '.~tmp~' and I think we'll be golden | 22:09 |
dmsimard | weshay|ruck: I think the ansible tmpdirs is the most important part to fix | 22:09 |
dmsimard | weshay|ruck: there's no value in logging those directories at all http://logs.openstack.org/27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp/ | 22:10 |
weshay|ruck | that's odd.. I don't remember tmp being there | 22:10 |
weshay|ruck | k.. sec I'll add it to our explicit exclude | 22:10 |
weshay|ruck | that's new though | 22:10 |
EmilienM | yeah | 22:11 |
EmilienM | I'm still git log now | 22:11 |
mordred | thing is- apache won't even show the files with .~tmp~ in them | 22:12 |
mordred | http://logs.openstack.org/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54/logs/undercloud/tmp/ansible/lib64/python2.7/encodings/ is an example ... | 22:12 |
EmilienM | is that https://review.openstack.org/#/c/483867/4/toci-quickstart/config/collect-logs.yml ? | 22:12 |
mordred | there's 210 files in that dir | 22:12 |
dmsimard | EmilienM: that sounds like a good suspect | 22:12 |
EmilienM | yeah i'm not sure, let me keep digging now | 22:13 |
dmsimard | "Collect the files under /tmp/ansible, which are useful to debug mistral executions. This dir contains the generated inventories, the generated playbooks, ssh_keys, etc." | 22:13 |
EmilienM | right | 22:14 |
EmilienM | if you read comment history we were not super happy with this patch | 22:14 |
EmilienM | we can easily revert it and fast approve | 22:14 |
EmilienM | I'm just making sure it's this one | 22:14 |
EmilienM | mordred, dmsimard, weshay|ruck : ok confirmed. I'm reverting and merging | 22:15 |
weshay|ruck | ya.. you guys found the same commit | 22:16 |
EmilienM | https://review.openstack.org/#/c/511347/ | 22:16 |
dmsimard | EmilienM: flaper mentions the possibility of making mistral write the interesting things elsewhere so it looks like it can be workaround' and is not critical | 22:16 |
weshay|ruck | dmsimard, we need to remove it.. I see it there twice | 22:16 |
EmilienM | dmsimard: it's not critical at all | 22:16 |
EmilienM | weshay|ruck: twice? | 22:17 |
EmilienM | mordred: do whatever you can to promote https://review.openstack.org/#/c/511347 if possible | 22:17 |
weshay|ruck | well /tmp/*.yml /tmp/*.yaml AND /tmp/ansible | 22:17 |
EmilienM | weshay|ruck: /tmp/*.yml /tmp/*.yaml would be another patch | 22:18 |
weshay|ruck | ok | 22:18 |
EmilienM | to keep proper git history I prefer a revert + separate patch for /tmp/*.yml /tmp/*.yaml | 22:18 |
EmilienM | weshay|ruck: are you able to send patch for /tmp/*.yml /tmp/*.yaml ? otherwise I'll look when I can, I'm in mtg now | 22:19 |
dmsimard | /tmp/*.yml and /tmp/*.yaml is certainly not as big of a deal | 22:19 |
weshay|ruck | EmilienM, ya.. I post one | 22:20 |
EmilienM | k | 22:20 |
EmilienM | TBH I don't see why we do that | 22:20 |
EmilienM | like why do we collect /tmp... | 22:20 |
EmilienM | its not in my books :) | 22:20 |
pabelanger | made it up to 60k free inodes, something ate it up | 23:38 |
clarkb | could be tripleo jobs that started before the fix merged | 23:38 |
pabelanger | ya, | 23:39 |
clarkb | there is likely going to be a periodic of time where that happens | 23:39 |
clarkb | since those jobs take up to ~3 hours | 23:39 |
pabelanger | I've found a few large tripleo patches and manually purging the directories | 23:39 |
pabelanger | seems to be helping, up to 108K now | 23:39 |
pabelanger | /dev/mapper/main-logs 768M 768M 205K 100% /srv/static/logs | 23:43 |
pabelanger | heading in the right direction | 23:43 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!