Wednesday, 2019-01-02

*** bobh has quit IRC		00:04
*** bobh has joined #openstack-infra		00:07
openstackgerrit	Tristan Cacqueray proposed openstack-infra/puppet-openstackci master: logserver: set CORS header https://review.openstack.org/627903	00:11
*** bobh has quit IRC		00:12
*** bobh has joined #openstack-infra		00:25
*** wolverineav has joined #openstack-infra		00:26
*** jamesmcarthur has joined #openstack-infra		00:26
*** armax has joined #openstack-infra		00:27
*** wolverineav has quit IRC		00:30
*** bobh has quit IRC		00:45
*** bobh has joined #openstack-infra		00:54
*** jamesmcarthur has quit IRC		00:56
*** _alastor1 has joined #openstack-infra		00:58
*** jamesmcarthur has joined #openstack-infra		00:59
*** wolverineav has joined #openstack-infra		01:01
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541	01:17
*** hwoarang has quit IRC		01:17
*** hwoarang has joined #openstack-infra		01:19
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: config: add playbooks to job.toDict() https://review.openstack.org/621343	01:22
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: config: add tenant.toDict() method and REST endpoint https://review.openstack.org/621344	01:23
*** jamesmcarthur has quit IRC		01:25
*** jamesmcarthur has joined #openstack-infra		01:26
*** jamesmcarthur has quit IRC		01:29
*** jamesmcarthur_ has joined #openstack-infra		01:29
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: config: add tenant.toDict() method and REST endpoint https://review.openstack.org/621344	01:31
*** rfolco has joined #openstack-infra		01:31
*** bobh has quit IRC		01:36
*** hongbin has joined #openstack-infra		01:38
*** bobh has joined #openstack-infra		01:42
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: scheduler: add job's parent name to the rpc job_list method https://review.openstack.org/573473	01:43
*** bobh has quit IRC		01:44
*** jamesmcarthur_ has quit IRC		01:48
*** jamesmcarthur has joined #openstack-infra		01:49
*** jamesmcarthur has quit IRC		01:53
*** wolverineav has quit IRC		01:54
*** wolverineav has joined #openstack-infra		01:54
*** wolverineav has quit IRC		02:02
*** bobh has joined #openstack-infra		02:12
*** jamesmcarthur has joined #openstack-infra		02:21
*** dave-mccowan has joined #openstack-infra		02:22
*** wolverineav has joined #openstack-infra		02:25
*** bhavikdbavishi has joined #openstack-infra		02:28
*** wolverineav has quit IRC		02:31
*** rkukura has quit IRC		02:40
*** bobh has quit IRC		02:46
*** bhavikdbavishi has quit IRC		02:47
*** hwoarang has quit IRC		02:50
*** psachin has joined #openstack-infra		02:50
*** bobh has joined #openstack-infra		02:53
*** dave-mccowan has quit IRC		02:54
*** hwoarang has joined #openstack-infra		02:56
*** dave-mccowan has joined #openstack-infra		02:58
*** _alastor1 has quit IRC		02:58
*** rfolco has quit IRC		02:59
openstackgerrit	Merged openstack-infra/infra-manual master: Fix some URL redirections and broken links https://review.openstack.org/622581	03:08
*** wolverineav has joined #openstack-infra		03:11
*** dave-mccowan has quit IRC		03:12
*** jamesmcarthur has quit IRC		03:14
*** jamesmcarthur has joined #openstack-infra		03:15
*** jamesmcarthur has quit IRC		03:19
*** jamesmcarthur_ has joined #openstack-infra		03:19
*** bhavikdbavishi has joined #openstack-infra		03:35
*** ykarel has joined #openstack-infra		03:36
*** bhavikdbavishi1 has joined #openstack-infra		03:38
*** bhavikdbavishi has quit IRC		03:39
*** bhavikdbavishi1 is now known as bhavikdbavishi		03:39
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: config: add playbooks to job.toDict() https://review.openstack.org/621343	03:40
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: config: add tenant.toDict() method and REST endpoint https://review.openstack.org/621344	03:40
*** bobh has quit IRC		03:46
*** jamesmcarthur_ has quit IRC		03:59
*** ramishra has joined #openstack-infra		04:05
*** bobh has joined #openstack-infra		04:07
*** jamesmcarthur has joined #openstack-infra		04:07
*** udesale has joined #openstack-infra		04:08
*** wolverineav has quit IRC		04:08
*** wolverineav has joined #openstack-infra		04:17
*** bobh has quit IRC		04:20
*** jamesmcarthur has quit IRC		04:21
*** jamesmcarthur has joined #openstack-infra		04:22
*** ykarel has quit IRC		04:33
*** hongbin has quit IRC		04:43
*** ykarel has joined #openstack-infra		04:49
*** ykarel has quit IRC		04:53
*** jamesmcarthur has quit IRC		04:57
*** jamesmcarthur has joined #openstack-infra		04:57
*** eernst has joined #openstack-infra		05:05
*** jamesmcarthur has quit IRC		05:06
*** yboaron has joined #openstack-infra		05:23
*** wolverineav has quit IRC		05:25
*** wolverineav has joined #openstack-infra		05:26
*** wolverineav has quit IRC		05:39
*** wolverineav has joined #openstack-infra		06:15
*** wolverineav has quit IRC		06:25
*** ykarel has joined #openstack-infra		06:28
*** eernst has quit IRC		06:28
*** hwoarang has quit IRC		06:31
*** hwoarang has joined #openstack-infra		06:31
*** psachin has quit IRC		06:34
*** jbadiapa has joined #openstack-infra		06:57
*** rcernin has quit IRC		07:00
*** quiquell\|off is now known as quiquell		07:06
*** jamesmcarthur has joined #openstack-infra		07:07
*** jamesmcarthur has quit IRC		07:11
*** ykarel_ has joined #openstack-infra		07:15
*** ykarel has quit IRC		07:17
*** ykarel__ has joined #openstack-infra		07:20
*** ykarel_ has quit IRC		07:23
*** ykarel_ has joined #openstack-infra		07:24
*** ykarel__ has quit IRC		07:27
*** ykarel_ is now known as ykarel\|lunch		07:31
*** ykarel_ has joined #openstack-infra		07:35
*** ykarel\|lunch has quit IRC		07:38
*** pgaxatte has joined #openstack-infra		07:45
*** arxcruz\|next_yr is now known as arxcruz		07:47
*** slaweq has joined #openstack-infra		07:54
*** ginopc has joined #openstack-infra		07:58
*** rpittau has joined #openstack-infra		08:00
*** quiquell is now known as quiquell\|brb		08:04
*** jtomasek has joined #openstack-infra		08:06
*** eumel8 has joined #openstack-infra		08:07
*** ccamacho has joined #openstack-infra		08:11
*** tosky has joined #openstack-infra		08:14
*** psachin has joined #openstack-infra		08:26
*** ykarel_ is now known as ykarel		08:33
*** janki has joined #openstack-infra		08:37
*** rascasoft has joined #openstack-infra		08:39
*** hrubi has joined #openstack-infra		08:47
*** jpich has joined #openstack-infra		08:49
*** quiquell\|brb is now known as quiquell		08:52
*** janki has quit IRC		09:01
*** janki has joined #openstack-infra		09:01
*** xek has joined #openstack-infra		09:05
*** aojea has joined #openstack-infra		09:07
*** iurygregory has joined #openstack-infra		09:14
iurygregory	Good morning everyone, Happy New Year =)	09:15
iurygregory	any know if there is a way to retrieve old logs from a patch that is merged 4 months ago? we would like to compare logs for a job that start failing in stable/pike and we have a patch that the job was green so we could compare some logs =)	09:17
AJaeger	iurygregory: we delete old logs after some time since we only have limited storage space, so no way to compare.	09:17
iurygregory	AJaeger, tks o/	09:18
*** yboaron_ has joined #openstack-infra		09:32
*** yboaron has quit IRC		09:35
*** shardy has joined #openstack-infra		09:47
openstackgerrit	Simon Westphahl proposed openstack-infra/zuul master: Fix skipped job counted as failed https://review.openstack.org/625910	09:50
*** adriancz has joined #openstack-infra		09:58
openstackgerrit	yolanda.robla proposed openstack/diskimage-builder master: Change phase to check for dracut-regenerate in iscsi-boot https://review.openstack.org/627949	10:06
*** bhavikdbavishi has quit IRC		10:09
*** e0ne has joined #openstack-infra		10:17
*** yboaron_ has quit IRC		10:45
*** gfidente has joined #openstack-infra		10:45
*** yboaron_ has joined #openstack-infra		10:49
*** pbourke has quit IRC		10:59
*** pbourke has joined #openstack-infra		11:01
*** quiquell is now known as quiquell\|brb		11:11
*** udesale has quit IRC		11:18
*** dayou has joined #openstack-infra		11:26
openstackgerrit	Sorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack https://review.openstack.org/620613	11:27
*** dayou_ has quit IRC		11:28
openstackgerrit	Sorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack https://review.openstack.org/620613	11:29
*** sshnaidm is now known as sshnaidm\|afk		11:48
*** whoami-rajat has joined #openstack-infra		12:01
fungi	iurygregory: to be specific, our current retention on the logs.openstack.org site is 4-5 weeks (because we can't keep more than ~12tb of compressed job logs)	12:05
iurygregory	fungi, tks for the information o/	12:06
*** quiquell\|brb is now known as quiquell		12:11
*** jchhatbar has joined #openstack-infra		12:12
*** psachin has quit IRC		12:14
*** rpittau is now known as rpittau\|lunch		12:14
*** janki has quit IRC		12:15
openstackgerrit	Sorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack https://review.openstack.org/620613	12:15
*** Dobroslaw has joined #openstack-infra		12:17
openstackgerrit	Tristan Cacqueray proposed openstack-infra/nodepool master: Implement an OpenShift resource provider https://review.openstack.org/570667	12:18
*** udesale has joined #openstack-infra		12:26
openstackgerrit	Sorin Sbarnea proposed openstack-infra/zuul-jobs master: Remove world writable umask from /src folder https://review.openstack.org/625576	12:26
*** rlandy has joined #openstack-infra		12:39
ssbarnea	bnemerryxmas: happy new year! too! -- check https://review.openstack.org/#/c/620613/ and see what else needs done, if I remember well they asked to cleanup some branches.	12:41
*** rlandy is now known as rlandy\|rover		12:42
*** rfolco has joined #openstack-infra		12:43
*** sshnaidm\|afk is now known as sshnaidm		12:43
*** quiquell is now known as quiquell\|lunch		12:48
*** jcoufal has joined #openstack-infra		12:52
*** aojea has quit IRC		12:53
*** roman_g has joined #openstack-infra		13:06
*** rh-jelabarre has joined #openstack-infra		13:07
fungi	heads up, i've got a morning appointment so am disappearing for a couple hours (but should be back by ~1600z)	13:11
*** boden has joined #openstack-infra		13:11
*** rpittau\|lunch is now known as rpittau		13:15
*** trown\|outtypewww is now known as trown		13:17
*** markvoelker has quit IRC		13:20
*** yboaron_ has quit IRC		13:24
*** aojea has joined #openstack-infra		13:43
*** quiquell\|lunch is now known as quiquell		13:48
*** dave-mccowan has joined #openstack-infra		13:55
*** dave-mccowan has quit IRC		14:00
*** nhicher has joined #openstack-infra		14:04
*** mriedem has joined #openstack-infra		14:05
*** dave-mccowan has joined #openstack-infra		14:06
*** shardy has quit IRC		14:07
*** ykarel is now known as ykarel\|away		14:12
*** ykarel\|away has quit IRC		14:19
*** ramishra has quit IRC		14:20
*** shardy has joined #openstack-infra		14:27
*** ykarel\|away has joined #openstack-infra		14:34
*** slaweq_ has joined #openstack-infra		14:41
*** slaweq has quit IRC		14:42
sshnaidm	where are requirements for IRC channel configuration? If I want to add another channel to Openstack	14:45
*** dkehn has joined #openstack-infra		14:48
openstackgerrit	Sagi Shnaidman proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack https://review.openstack.org/620613	14:48
*** bobh has joined #openstack-infra		14:50
*** ykarel\|away is now known as ykarel		14:51
*** bnemerryxmas is now known as bnemec		14:51
*** whoami-rajat has quit IRC		14:54
*** bhavikdbavishi has joined #openstack-infra		14:55
frickler	sshnaidm: see https://docs.openstack.org/infra/system-config/irc.html#channel-requirements	14:57
sshnaidm	frickler, cool, thanks!	14:59
*** slaweq_ has quit IRC		15:01
*** efried has joined #openstack-infra		15:02
*** ginopc has quit IRC		15:02
*** slaweq_ has joined #openstack-infra		15:03
*** ginopc has joined #openstack-infra		15:06
*** edmondsw has joined #openstack-infra		15:09
*** bhavikdbavishi has quit IRC		15:10
*** bhavikdbavishi has joined #openstack-infra		15:10
*** smarcet has joined #openstack-infra		15:16
*** jrist has joined #openstack-infra		15:17
*** slittle1 has quit IRC		15:17
*** quiquell is now known as quiquell\|off		15:17
*** rfolco is now known as rfolco\|lunch		15:18
*** bhavikdbavishi has quit IRC		15:19
*** jrist has quit IRC		15:22
*** ginopc has quit IRC		15:25
openstackgerrit	Sagi Shnaidman proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack https://review.openstack.org/620613	15:26
*** jrist has joined #openstack-infra		15:26
*** ginopc has joined #openstack-infra		15:26
*** whoami-rajat has joined #openstack-infra		15:27
openstackgerrit	Sorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack https://review.openstack.org/620613	15:33
*** e0ne has quit IRC		15:41
openstackgerrit	Sorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack https://review.openstack.org/620613	15:41
openstackgerrit	Sorin Sbarnea proposed openstack-infra/git-review master: Warn which patchset are going to be obsoleted. https://review.openstack.org/456937	15:43
*** ykarel is now known as ykarel\|awat		15:48
*** ykarel\|awat is now known as ykarel\|away		15:48
*** anteaya has joined #openstack-infra		15:49
*** udesale has quit IRC		15:53
*** rfolco\|lunch is now known as rfolco		15:55
*** ykarel\|away has quit IRC		15:55
*** studarus has joined #openstack-infra		16:00
clarkb	fungi I'm not quite at a place with ssh keys yet, but opendev.org dns appears to have disappeared?	16:02
fungi	um	16:02
fungi	looking	16:02
clarkb	I wonder if adns1 stopped zone transferring to ns*	16:02
clarkb	saw this with zuul dns when adns1 crashed	16:03
fungi	that's what i'm going to check first ;)	16:03
mordred	fungi, clarkb: I agree that I can't resolve opendev.org	16:04
fungi	inventory/openstack.yaml says adns1 is 2001:4800:7819:104:be76:4eff:fe04:43d0	16:05
fungi	server is running at least, looking at logs now	16:05
fungi	named is running since 2018-12-06	16:06
openstackgerrit	Merged openstack-infra/puppet-openstackci master: logserver: set CORS header https://review.openstack.org/627903	16:07
fungi	key reconfiguration is happening (domains all rekeyed 15:34:05 today)	16:07
fungi	i don't see any mention of zone transfers happening, so checking ns1 next	16:08
fungi	2001:4800:7819:104:be76:4eff:fe04:38f0 according to the inventory	16:08
fungi	unbound running on ns1 since 2018-12-05	16:10
clarkb	hrm glue record problem in .org then?	16:10
*** gyee has joined #openstack-infra		16:12
fungi	adns1 returns records but ns1 does not	16:13
fungi	according to `sudo journalctl -u unbound.service` unbound on ns1 hasn't logged any problems, or in fact anything other than startup messages on december 5	16:16
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Switch back to three columns for mid sized screens https://review.openstack.org/627997	16:16
openstackgerrit	Merged openstack-dev/hacking master: Don't quote {posargs} in tox.ini https://review.openstack.org/609192	16:17
clarkb	fungi: they should run nsd for external dns serving	16:17
fungi	named on adns1 is logging to syslog not the systemd journal, but has only logged zone key reconfiguration messages for at least the past couple days	16:17
clarkb	unbound is only for local dns	16:17
fungi	oh, nsd not unbound. thanks :/	16:17
clarkb	and werun bind on adns1	16:18
fungi	Could not tcp connect to 104.239.146.24: No route to host	16:18
fungi	i bet this is a firewall issue	16:18
*** rkukura has joined #openstack-infra		16:18
fungi	indeed, the iptables rules on adns1.opendev.org are only allowing zone transfers from ns1 and ns2.openSTACK.org	16:20
fungi	this could be fallout from my inventory group regex change. looking	16:20
fungi	indeed, adns1.opendev.org wouldn't have been covered previously by the puppet global manifest node entry for /^adns\d+\.openstack\.org$/	16:23
fungi	but is now included by /^adns\d+\.open.*\.org$/	16:24
clarkb	ah	16:24
fungi	same for /^ns\d+\.openstack\.org$/ to /^ns\d+\.open.*\.org$/ for the authoritative nameservers	16:24
clarkb	but firewall rules are in ansible	16:24
clarkb	and we dont puppet on these as they are bionic iir	16:25
clarkb	so maybe similar collision on ansible side?	16:25
fungi	the inventory/groups.yaml entry for the adns group changed from adns* to adns.open.org	16:26
fungi	so that wouldn't have changed them i don't think	16:26
fungi	the /etc/iptables/rules.v4 file on adns1 was last modified 2018-12-18	16:27
fungi	the groups change didn't merge until 2018-12-21, five days later, so i think we can rule it out after all	16:27
fungi	playbooks/group_vars/adns.yaml lists iptables_extra_allowed_hosts of ns1.openstack.org and ns2.openstack.org for tcp 53	16:29
*** ykarel\|away has joined #openstack-infra		16:29
fungi	i don't see any similar allowances for opendev nameservers	16:30
clarkb	maybe it was missed after the manual bootstrapping	16:30
mordred	yeah	16:30
* mordred is adding helpful sentences		16:30
fungi	i don't have any concerns with temporarily adding the nameservers for opendev to openstack and vice versa, whipping up a patch for that now	16:31
corvus	that's probably the best solution; and i think we can drop openstack soon	16:32
*** pgaxatte has quit IRC		16:32
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Allow DNS zone transfers from ns1/ns2.opendev.org https://review.openstack.org/628001	16:33
fungi	clarkb: mordred: corvus: ^	16:33
fungi	plain brown wrapper	16:33
clarkb	we need to manually add those rules right since dns isnt currently working	16:34
fungi	yeah, i'm going to do that on adns1 now and see about urging nsd to retry	16:34
clarkb	ok	16:34
corvus	wait 628001 doesn't make sense to me	16:35
clarkb	I removed my +A	16:36
fungi	sudo iptables -I openstack-INPUT 5 -m state --state NEW -m tcp -p tcp -s 104.239.140.165 --dport 53 -j ACCEPT	16:36
fungi	and 162.253.55.16	16:36
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Switch back to three columns for mid sized screens https://review.openstack.org/627997	16:37
corvus	fungi: ++	16:37
fungi	and the same with ip6tables for 2001:4800:7819:104:be76:4eff:fe04:38f0 and 2604:e100:1:0:f816:3eff:fe2c:7447	16:37
fungi	okay, iptables and ip6tables -L show them open now	16:39
*** bobh has quit IRC		16:39
fungi	reloading nsd doesn't seem to have triggered a refresh	16:40
corvus	fungi, clarkb: okay 628001 makes sense to me now, +3	16:41
fungi	reading up on the nsd-control utility	16:41
tbarron	I'd like to merge this small log collection change https://review.openstack.org/#/c/626921/	16:42
*** jchhatbar has quit IRC		16:43
fungi	for the record, `sudo nsd-control zonestatus` reported all three zones expired, then i issued a `sudo nsd-control transfer` and it seems to have done the trick	16:43
fungi	infra-root: ^	16:43
clarkb	fungi: thanks!	16:43
fungi	confirmed, i can look up opendev.org records from ns1.opendev.org now. i'll do the same on ns2	16:44
fungi	http://paste.openstack.org/show/739824/	16:45
corvus	works from my desktop now	16:46
mordred	same here	16:46
corvus	anyone know why the 'test nodes' graph is topping out below the max line? http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1	16:47
clarkb	could be quota handling in nodepool	16:49
fungi	hrm, yeah that does indeed look odd. maybe one of our providers has dropped our quota?	16:53
*** sthussey has joined #openstack-infra		16:56
clarkb	gra1 seems to not be in use?	16:57
*** psachin has joined #openstack-infra		16:57
corvus	according to http://zuul.openstack.org/nodes gra1 only has a handful of deleting nodes -- and those are substitute node records, so they are likely create-errors	17:00
corvus	oh, they're also really old	17:00
corvus	(they have very small id numbers)	17:00
corvus	gra1 is not in use -- max-servers:0	17:02
clarkb	ah	17:02
fungi	nl04.openstack.org.yaml says	17:02
fungi	yeah	17:02
corvus	but that should mean that it's not contributing to the total	17:02
corvus	i guess we need to ask graphite where it's getting that number from	17:03
fungi	looking at the monthly view, the last time the max-servers total changed from grafana's perspective was on december 20	17:05
fungi	and indeed, the most recent change to any of the nl*.openstack.org.yaml files in git merged on december 20, lowering max-servers for ovh-gra1 from 79 to 0 per https://review.openstack.org/626502	17:06
corvus	http://graphite.openstack.org/render/?width=973&height=777&_salt=1546448829.34&hideLegend=false&target=legendValue(stats.gauges.nodepool.provider.*.max_servers%2C%22last%22)	17:07
dmsimard	I'm still in heavy catch up mode from the holidays -- is there any known issues with limestone at this time ? It appears that some tripleo jobs are failing exclusively there.	17:09
rlandy\|rover	dmsimard: continuing discussion on multinode jobs failures that seem to occur only with one provider - Hostname: centos-7-limestone-regionone-* .. see https://bugs.launchpad.net/tripleo/+bug/1810054	17:09
openstack	Launchpad bug 1810054 in tripleo "mulitnode jobs failing on gathering facts from subnode-2" [Critical,Triaged] - Assigned to Ronelle Landy (rlandy)	17:09
dmsimard	^ See logstash (update duration to 7 days and there's quite a few hits): http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22fatal%3A%20%5Bsubnode-2%5D%3A%20UNREACHABLE!%5C%22	17:09
clarkb	dmsimard: I think most of us are in that mode. I don't know of any known problems specific to clouds currently	17:10
dmsimard	clarkb: ok, I'll try to understand what is happening. Thanks !	17:10
*** yboaron_ has joined #openstack-infra		17:14
clarkb	dmsimard: we've seen that particular ssh unreachable issue when clouds reuse IP addrs and then fight for the IP over arp, but limestone uses ipv6 which should avoid that problem entirely	17:15
*** rpittau has quit IRC		17:15
clarkb	dmsimard: is it possible that the jobs are beraking ipv6?	17:16
*** Vadmacs has joined #openstack-infra		17:17
corvus	clarkb, fungi: max - usage deltas are: bhs1: -135, sjc1: -10, ord: -5. that just about accounts for the descrepancy	17:17
clarkb	dmsimard: hrm that ansible connection goes over ipv4 and is happening primary to subnode? or is that happening executor to subnode? will only work if it is primary to subnode not executor to subnode as the 10/8 addresses are not globally routable	17:18
fungi	http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?panelId=8&fullscreen&orgId=1&from=now-30d&to=now	17:18
dmsimard	clarkb: some of the jobs failing have also been successful on limestone so it feels intermittent	17:19
corvus	i see this in the log for bhs1: Quota exceeded for cores: Requested 8, but already used 512 of 512 cores	17:19
fungi	yeah, looks like maybe around 2018-12-29 we stopped using most of ovh-bhs1 but the first signs i see indicating we're probably capped there is on 2018-12-31	17:19
corvus	that works out to 64 nodes, so i'm not quite sure how we're getting 15 out of that	17:19
corvus	2019-01-02 10:36:28,468 DEBUG nodepool.driver.openstack.OpenStackProvider: Provider quota for ovh-bhs1: {'compute': {'cores': 512, 'instances': 200, 'ram': 4063232}}	17:20
corvus	2019-01-02 10:36:29,663 DEBUG nodepool.driver.openstack.OpenStackProvider: Available quota for ovh-bhs1: {'compute': {'cores': 360, 'instances': 181, 'ram': 3911232}}	17:20
corvus	neither is nodepool apparently :)	17:20
fungi	amorin: ^ any idea if our quota in bhs1 got out of sync sometime in the past few days?	17:20
* fungi isn't sure who's quite back to the computer yet this week		17:23
corvus	i'm not sure i am	17:23
fungi	i was about to say the same ;)	17:23
*** tosky has quit IRC		17:23
dmsimard	lots of us doing our best at a keyboard today haha	17:24
clarkb	its hard to get over that vacation hangover, also real hangovers depending on how hard you celebrate new years :P	17:24
openstackgerrit	Merged openstack-infra/system-config master: Allow DNS zone transfers from ns1/ns2.opendev.org https://review.openstack.org/628001	17:25
clarkb	dmsimard: ok its ansible running from primary to subnode and IP addrs for private v4 seem to check out	17:26
clarkb	dmsimard: next is to check if the firewall is running and whether or not the subnode is possibly crashing its network (or just crashing)	17:26
*** jpich has quit IRC		17:26
clarkb	dmsimard: limestone is the cloud where nested virt caused nodes to crash	17:26
dmsimard	is it ?	17:26
dmsimard	yeah I think the ipv4 is just the tunnel ip	17:27
clarkb	http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/ara-report/result/dc98460a-a68d-4b62-877e-374afe9c0d09/ shows firwall is being configured properly I think	17:27
dmsimard	clarkb: when the failure is occuring, it looks like it's always after the "tripleo-inventory" role runs -- that's what I'm looking at right now	17:28
dmsimard	comparing different job executions to see what's up	17:28
clarkb	dmsimard: yes the hypvervisor kernel and the centos 7.5 kernel did not get along in that cloud about a month ago. logan- updated hypervisor kernels but its possible that broke again. tripleo did switch to qemu instead of kvm as a result but maybe that got turned off and things regressed again? (mentioning it as a possibility I don't have evidence this is the cause)	17:28
logan-	that should not be happening anymore. ubuntu released an updated kernel a few days after the node crash issue started happening and all of the hvs have been upgraded. have not seen any further crashes on our internal nested virt jobs lately	17:29
clarkb	logan-: thanks	17:29
dmsimard	logan-: thanks, we'll let you know if we can't figure it out :)	17:29
logan-	dmsimard: cool, thanks. with something intermittent like that I always wonder if we're hitting a slow node or something. iirc there was some work being done recently to add hostid to the logstash fields.. is that done clarkb? anything I need to enable cloud-side to expose that info to nodepool?	17:31
clarkb	logan-: I think it is purely nodepool side (nova api already gives us that data), I'm not sure if we restarted nodepool with that chagne yet as there were openstacksdk issues that lingered late last month that we didn't want to run into (since fixed)	17:32
logan-	gotcha	17:32
clarkb	give me a few minutes and I'll check if that got in	17:32
logan-	thanks	17:32
fungi	nodepool on nl01 running since 2018-12-05 so hasn't been restarted for nearly a month	17:33
fungi	er, nodepool-launcher daemon specifically	17:33
corvus	clarkb: the change merged on 12-07	17:33
*** ginopc has quit IRC		17:33
fungi	so not in use yet	17:33
clarkb	ya so we didn't manage to restart things once sdk was fixed (not surprising)	17:33
corvus	i'd like to perform a scheduler restart sometime soon	17:34
clarkb	fungi is quicker than me being able to load ssh keys after holidays :)	17:34
mordred	clarkb: I believe we should be completely fixed/released with sdk things currently	17:34
*** weshay\|ruck has joined #openstack-infra		17:40
corvus	i'd also like to restart gerrit -- it seems to have a stuck replication thread (likely due to gitea hanging after its disk was filled)	17:42
corvus	i could probably perform thread surgery with javamelody, but it's probably safer to restart	17:43
mordred	++	17:43
clarkb	ya java melody surgery like that always worries me	17:43
mordred	corvus: is the ceph rebalance all done? I wasn't sure what state it was in and didn't want to go touching things	17:43
clarkb	gerrit wasn't built with the intention that threads would be stopped under it	17:43
corvus	mordred: yes! all done	17:44
corvus	mordred: one of the osds has a weird name (0,1,3) because i didn't know all the steps to replace them at first. and they're out of order on the underlying nodes (like osd 3 is on minion 1 or something). but done. :)	17:45
mordred	corvus: cool. I figure we'll blow the whole thing away and recreate it from scratch at some point anyway, yeah?	17:46
corvus	mordred: (turns out if you're going to zero out an underlying block device, you also need to edit the configmap for the deployment and remove the info about the osd, otherwise the operator is going to think it still exists and create a new one; that's how we ended up with #3 -- the operator thought there was still a #2)	17:47
mordred	corvus: ah - that makes sense	17:48
corvus	so that's the big bit of info i learned -- the operator maintains configmaps for deployments -- it doesn't rely only on the stuff on the filesystem.	17:48
corvus	(the configmap largely duplicates what's on the filesystem)	17:48
frickler	fungi: corvus: there are still my 20 test nodes in bhs1. I had asked amorin whether I should delete or keep them, but I don't remember the outcome. guess I should remove those now	17:48
fungi	so that at least accounts for some of the discrepancy, but not most of it	17:49
frickler	there also were 5 or so nodes listed by "openstack server list", but not existing when trying to show them individually, not sure whether those might count for quota, too	17:50
mordred	corvus: looks like we're using 124G on the cephfs now according to df	17:50
*** kmalloc has joined #openstack-infra		17:50
mordred	does that figure report the size used taking the underlying ceph min block size into account?	17:50
clarkb	mordred: df should be actual disk consumed	17:51
corvus	mordred: aiui yes	17:51
fungi	mordred: that seems like rather a lot given the aggregate utilization of /home/gerrit2/review_site/git on review01.o.o is reported as only 10gib	17:51
corvus	current detailed usage: http://paste.openstack.org/show/739828/	17:51
fungi	myfs-data0 is the block device the git replication is going into?	17:53
fungi	if so, 7.2g looks reasonable	17:53
corvus	fungi: yep. and then 7.2*3 copies gives us 22GiB	17:54
corvus	then... somehowe 22GiB turns into 123.	17:55
fungi	through the magic of cephfs	17:55
corvus	so it's way better than a week ago when 22 turned into 230.	17:55
corvus	it's still... confusing.	17:55
mordred	I'm guessing that's min block size overhead yeah? - what did you change that from/to?	17:56
fungi	especially since myfs-data0 is reporting 17.29% used at 7.2gib but says max available is 34gib	17:57
corvus	bluestore_min_alloc_size = 4096	17:57
*** yboaron_ has quit IRC		17:57
mordred	yeah. that	17:57
fungi	i mean, i'm bad at math, but even i know 7.2 is not 17.29% of 7.2+34	17:58
fungi	no, wait, it is	17:58
corvus	mordred: that seems sufficiently small to me that we should be closer to 1:1	17:58
mordred	fungi: maybe math changed over christmas	17:58
*** kmalloc has quit IRC		17:58
fungi	7.2 isn't 17.29% of 34 but it is 17.29% of 7.2+34	17:59
* fungi probably just hasn't had enough to drink today		18:00
*** ykarel\|away has quit IRC		18:01
fungi	and 3*(34+7.2+.2)=124.2	18:01
fungi	not sure if that's where the ~123gib for global raw used is coming from?	18:02
*** gfidente has quit IRC		18:02
fungi	does the max avail for the pools count against the global raw used?	18:02
fungi	like a preallocation?	18:03
corvus	fungi: hrm, i wouldn't expect that... but maybe? maybe the global storage gets allocated to the pools in very large chunks, so we've allocated half of the global storage to those pools, and from that allocation, 34GiB is still available?	18:03
*** studarus has quit IRC		18:03
corvus	fungi: wow, 'preallocation' is a much shorter way of saying what i said :)	18:03
corvus	anyway, i do not know if it works that way. but it seems like a good theory that fits the numbers.	18:04
fungi	a "good" theory, perhaps, but confusing implementation if true	18:04
corvus	s/good// :)	18:05
*** sshnaidm is now known as sshnaidm\|afk		18:05
*** Vadmacs has quit IRC		18:05
dmsimard	What kind of timeframe are we looking at for the logging migration to swift ?	18:05
corvus	dmsimard: i would personally like to push it so that as many of the things described in http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-July/000501.html as possible are implemnted.	18:07
corvus	dmsimard: almost everything in there mitigates something we lose due to the migration	18:07
corvus	however, we can cut over any time	18:07
clarkb	cost in switching is largely going to be in time spent debugging any unexpected behavior and educating users about the change when it happens	18:08
dmsimard	corvus: I'm particularly interested in the switch from the ara-report sqlite middleware to html generation	18:08
clarkb	(I expect)	18:08
dmsimard	Some projects (I know of at least OSA) are using the middleware for nested Ansible reports	18:08
dmsimard	I'll read the ML post first	18:09
fungi	yeah, accounting for rounding, some used value between 7.15 and 7.20 plus a max available value between 33.5 and 34.0 can give us a value between 122.5 and 123.0 (with enough of a buffer to even account for the used 208 mib for the myfs-metadata pool) so consistent rounding of those for the raw used works out i think	18:09
corvus	dmsimard: post doesn't discuss ara. you can test it out with a change like https://review.openstack.org/592582	18:09
dmsimard	ara is actually mentioned in http://lists.openstack.org/pipermail/openstack-infra/2018-July/006020.html :D	18:10
*** trown is now known as trown\|lunch		18:10
corvus	dmsimard: yeah. that work is all done.	18:12
*** josephrsandoval has joined #openstack-infra		18:12
dmsimard	ok so this works: https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/86bbbcfa8ad043109d2d7af530225c72/logs_82/592582/2/check/devstack-platform-opensuse-423/1470481/ara-report/	18:12
dmsimard	very nice	18:12
*** panda is now known as panda\|off		18:15
clarkb	corvus: dmsimard: is that something we want to get in today and ease people into as they return to work?	18:17
clarkb	(I'm happy to help with that if so)	18:17
corvus	clarkb: get what in today?	18:17
clarkb	the switch to logging to swift over the fileserver	18:17
corvus	clarkb: i would prefer to wait until as much of what is described in http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-July/000501.html as possible exists	18:18
corvus	particularly the inline javascript log viewer	18:18
clarkb	oh I misread your comments above. You had said we can cut over any time and I thought you were thinking of using that as a carrot to get those other improvements in	18:19
clarkb	you mean that we should add those features ebfore we cut over	18:19
corvus	i think tristanC has done some fundamental work in that direction (we now fetch the log json and render part of it; it's a smaller step now to fetch the log text and render it)	18:20
*** kgiusti has joined #openstack-infra		18:20
corvus	clarkb: i doubt we'll get volunteers to implement that by switching	18:20
corvus	clarkb, fungi, mordred: i've read over http://docs.ceph.com/docs/master/rados/operations/monitoring/#checking-a-cluster-s-usage-stats and i still don't understand the numbers	18:21
*** wolverineav has joined #openstack-infra		18:22
corvus	https://access.redhat.com/solutions/1465423	18:22
corvus	that's not terribly far from our question	18:22
corvus	i don't see an answer there though, which makes me sad	18:23
dmsimard	clarkb, corvus: my only concern with switching to swift is that we need to make sure that users of the sqlite middleware also switch to html generation	18:23
dmsimard	(for ara)	18:23
corvus	dmsimard: what users?	18:23
dmsimard	corvus: OSA uses it for nested reports, there might be kolla-ansible as well .. would need to use codesearch or something	18:24
dmsimard	for example http://logs.openstack.org/98/625898/1/check/openstack-ansible-deploy-aio_lxc-ubuntu-bionic/d5e3799/ara-report/ is from Zuul's perspective while http://logs.openstack.org/98/625898/1/check/openstack-ansible-deploy-aio_lxc-ubuntu-bionic/d5e3799/logs/ara-report/ is from OSA's perspective	18:25
clarkb	dmsimard: I'm not too worried about that. Its an easy switch and if you really need the ara data in that interim you can render it off of the sqlite file	18:25
clarkb	(so we haven't lost any data)	18:25
dmsimard	clarkb: right	18:25
mordred	yeah - I don't think it's hard - it's just a thing to add to the checklist	18:25
dmsimard	indeed.	18:26
corvus	er, um, there is no checklist :)	18:26
corvus	i disclaim responsibility for this :)	18:26
*** wolverineav has quit IRC		18:26
fungi	corvus: i suppose you have access to the answer to that question? for me it's "subscriber exclusive content" but if i'm a "current customer or partner" i can find out the possible answer about this particular piece of free software :(	18:26
corvus	so if we need it to happen, we need to figure out how to make it happen :)	18:26
corvus	fungi: i have no idea	18:27
fungi	"Distributing any portion of Red Hat Content to a third party, using any Red Hat Content for the benefit of a third party, or using Red Hat Content in connection with software other than Red Hat Software under an active Red Hat subscription are all prohibited."	18:29
fungi	so if you did have access to it, you couldn't legally tell me what it says anyway	18:29
corvus	well, then i guess i won't look :(	18:29
fungi	https://access.redhat.com/help/terms/ (linked from the footer of that kb article)	18:30
dmsimard	clarkb, logan-: re: tripleo failures -- it could be a coincidence (or a red herring) but I've noticed that there appears to be two kind of CPUs in Limestone reported as "Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)" and "Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz" and so far all the failures I've looked at seem to have occurred on the E3-12xx variant	18:31
corvus	dmsimard, clarkb: do you want to find a way to have the nested ara report generated automatically when we switch over, or do you just want to go update jobs after the switch?	18:32
clarkb	corvus: I'm fine with updating jobs after the switch (it isn't a lossy switch just an extra step to see the data)	18:34
dmsimard	corvus: least we can do is attempt to figure out who uses it and let them know ahead of time I think. At first glance I see OSA, Windmill, puppet-openstackci, and system-config	18:34
dmsimard	based on http://codesearch.openstack.org/?q=ara-report	18:34
corvus	dmsimard: do you have a suggested action for them?	18:36
dmsimard	I can write something up -- it shouldn't be more than replacing "mkdir ara-report; cp /some/path/ansible.sqlite /some/path/ara-report" to "ara generate html /some/path/ara-report"	18:37
dmsimard	What influences whether or not we do it before the move is the timeframe, I think	18:37
dmsimard	I mean, we probably don't want every project to start generating html if we're not moving to swift any time soon -- or else we could fill up logs.o.o again :(	18:38
*** wolverineav has joined #openstack-infra		18:38
corvus	yeah, we should definitely not generate static files before the move, however, we could theoretically write a role which supports both and handles the change transparately (as the existing ara-generate role does)	18:39
dmsimard	that logic is embedded and respective to each project's own nested ansible implementation, though	18:39
dmsimard	OSA bootstraps their own ansible, kolla too, etc.	18:40
*** bobh has joined #openstack-infra		18:41
corvus	i don't understand why the internal ansible would be responsible for moving zuul log files around, but that's okay, i don't need to understand everything.	18:41
*** wolverineav has quit IRC		18:42
efried	Howdy folks. Who's administering jetbrains (pycharm) licenses these days? I need to change my email address, and apparently that kills my current license.	18:43
efried	I submitted a new request at the link listed in https://wiki.openstack.org/wiki/Pycharm -- but IIRC our license admin recently left OpenStack?	18:43
* mordred does not know anything about pycharm licenses		18:43
*** wolverineav has joined #openstack-infra		18:45
corvus	fungi: i think 'ceph osd df' shows us where the 123 comes from: http://paste.openstack.org/show/739832/	18:45
dmsimard	efried: last I know it was swapnil but I'm not sure if he's still around	18:45
efried	dmsimard: Okay, thanks. I guess I'll wait a day or two and see if something comes of that request.	18:46
* efried can't live without his pycharm		18:46
* efried comes up with something clever about leprechauns and cereal.		18:46
*** wolverineav has quit IRC		18:46
fungi	corvus: i don't really get anything new out of that `ceph osd df` output other than that it's ~3x41gib. where does the 41 come from? my theory there is still the same at least	18:47
*** wolverineav has joined #openstack-infra		18:48
fungi	efried: i'm not after your lucky pycharms, just in case you were concerned	18:49
efried	thanks fungi, that's the ticket	18:49
corvus	fungi, clarkb, mordred: okay, so i still don't understand how all these numbers work. do we consider that a requirement to moving forward? or do we just want to keep throwing cinder volumes at the thing to keep it fed?	18:51
fungi	corvus: i don't consider it a requirement, no. mostly just curious so we can do effective capacity planning in the future	18:51
fungi	but our capacity planning in the past has been a bit reactionary anyway (like enospc=>time to make more room)	18:52
clarkb	corvus: mnaser may know. I can also ask my brother if he understands that	18:52
*** wolverineav has quit IRC		18:53
*** rkukura has quit IRC		18:56
fungi	i guess calebb hasn't been chillin' in here lately	18:58
*** diablo_rojo has joined #openstack-infra		18:58
mordred	corvus: yeah - I don't think it's an absolute requrement to move forward - but I would like to understand more better in general	18:59
*** wolverineav has joined #openstack-infra		18:59
clarkb	I've asked calebb to join here and can hopefully shed light on that	18:59
corvus	clarkb: thanks	19:00
corvus	mordred: okay, what's the status/next steps?	19:00
mordred	corvus: step one is fully regain brain post holiday ...	19:02
*** tobiash has quit IRC		19:02
*** calebb has joined #openstack-infra		19:03
calebb	clarkb: o/	19:03
clarkb	calebb: hey so corvus is trying to track down disk usage discrepancies in a ceph cluster he set up. Maybe you can help us understand what is going on there	19:04
corvus	calebb: http://paste.openstack.org/show/739832/	19:04
mordred	corvus: I think we should exercise taking nodes offline, as well as growing the system by adding new nodes, yeah? I feel like there's something else that was on the list pre-holiday but still paging context back in	19:04
corvus	calebb: the pool replication size is 3 for both of those pools	19:04
mordred	corvus: (oh, we need to walk other infra-root folks through the setup)	19:05
corvus	calebb: so i think that explains 7.2*3 == 22	19:05
*** bobh has quit IRC		19:05
*** tobiash has joined #openstack-infra		19:06
calebb	correct	19:06
corvus	calebb: but i don't understand why, in the global section, it says 123 raw used	19:06
calebb	and there are no other pools in the cluster than what is shown there?	19:06
corvus	correct	19:07
calebb	did you run any ceph benchmarks by chance?	19:07
calebb	because those don't always clean up after themselves i think	19:07
*** whoami-rajat has quit IRC		19:07
fungi	my terrible theory is that the used for the two pools plus the max available (which seems to be shared between pools? maybe?) adds up to the 41gib utilization per replica	19:07
corvus	nope; though i have replaced all of the osd's (one at a time). there hasn't been any significant activity after the osd replacement.	19:07
calebb	that should be fine	19:08
fungi	but if my terrible theory is correct, i don't understand why available space in the pools counts toward used space globally	19:08
corvus	calebb: and just for further context, this is running in k8s via rook -- so it's all virtualized and dedicated to one user.	19:08
calebb	fungi: i think by default it doesn't work that way, or at list didnt used to, but there might be a way to make it do that and rook may do that by default?	19:09
calebb	at least*	19:09
*** dtantsur is now known as dtantsur\|afk		19:09
fungi	interesting	19:10
*** rkukura has joined #openstack-infra		19:10
calebb	corvus: just for context what version of ceph is this and is it using filestore or bluestore?	19:10
calebb	might be relevant	19:10
corvus	calebb: mimic and bluestore	19:10
calebb	and did you create those pools manually or did rook do it?	19:11
corvus	calebb: rook created the pools	19:11
corvus	this is the input we gave it to create the filesystem: http://paste.openstack.org/show/739833/	19:12
calebb	looks like someone on the mailing list had a similar question http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023669.html im gonna read up on that and see if there's any useful info in there	19:16
calebb	it sounds like bluestore is bad at deciding usage	19:18
calebb	but those differences seem really high	19:20
corvus	oh hrm. i wonder if they did anything about that in the intervening year?	19:20
calebb	yeah, that looks like that was on Luminous	19:21
calebb	seems like it should have been fixed so this might be a different issue	19:21
corvus	calebb: also worth noting, i set "bluestore_min_alloc_size = 4096" because it was defaulting to 64k, and we have a lot of small files. that caused our global raw used to be about 2x what it is now.	19:21
corvus	(i mean the default caused it to be 2x what it is now; 4096 is producing the current values)	19:22
*** trown\|lunch is now known as trown		19:25
*** jamesmcarthur has joined #openstack-infra		19:25
calebb	ok	19:27
calebb	i assume that's why you hda to redeploy all the osds?	19:28
*** bobh has joined #openstack-infra		19:28
fungi	maybe 4096 is still way too large for our dataset and we need something more like 1024?	19:29
corvus	calebb: yep	19:29
calebb	fungi: but i dont think that explains it away either	19:30
clarkb	http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024801.html is relevant maybe?	19:31
clarkb	(I don't have the object count to do that math)	19:31
calebb	because if there's 500k objects, and if each one wastes 2K, that only accounts for 1GB i think	19:31
calebb	idk if that't eh right math though	19:31
calebb	yeah so i think https://github.com/ceph/ceph/pull/19454 might be the thing that fixes the mailing thread i linked	19:31
*** bobh has quit IRC		19:32
calebb	which was only merged less than a month ago	19:32
corvus	clarkb: http://paste.openstack.org/show/739832/ has object count; our average is 10KiB per object. i think. :)	19:32
*** jamesmcarthur has quit IRC		19:32
*** jamesmcarthur has joined #openstack-infra		19:32
calebb	yeah i think the PR I linked fixes this issue	19:32
corvus	i'm guessing that's not in mimic	19:33
calebb	mimic 13.2.2 was released in september it looks like	19:33
calebb	and that's the latest	19:33
fungi	okay, so even on a typical fs with the traditional 512k block size, our average block count per inode is really tiny	19:33
corvus	and yeah, our 'ceph df detail' report does not have a "used" column, as described in that pr.	19:34
fungi	er, ignore me. i'm thinking 512b block size, so we're averaging 20 traditional 0.5k blocks per file	19:34
fungi	so 64k block size was definitely large for a 10k average file size, but not significantly so. 4k means we're averaging ~2.5 blocks per file now	19:36
fungi	i agree decreasing below 4k block size doesn't really make sense	19:36
corvus	yeah, we don't know the histogram though, so could be more or less important based on the number of 100-byte files :)	19:37
calebb	heh	19:37
corvus	so maybe we should just ignore this for now, feeding the cluster storage as it claims it needs it (via %raw used) and revisit this after the next ceph release?	19:38
corvus	(since it seems like we can probably handle throwing a bit of extra storage at it to deal with it)	19:38
calebb	well, but if it's using 2x more storage than it should then it's not really a bit of extra storage though	19:39
mordred	++ - I think that sounds like a fine idea for now - especially since we don't expect this data to double in size or anything once we get the initial data loaded fully	19:39
calebb	unless the extra used space doesn't scale with the actual used	19:39
calebb	ooc what are you storing in it?	19:40
corvus	calebb: all the openstack git repos	19:40
calebb	ahh	19:40
corvus	so we don't expect it to grow too quickly; like, i don't think we'll see it even double in the next 6 months.	19:41
*** psachin has quit IRC		19:41
calebb	yeah	19:41
corvus	so 'wait for ceph to get better' is pretty viable for us.	19:41
* calebb nods		19:42
corvus	especially since the env is virtualized and we can swap it out easily	19:42
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Reject messages to starlingx-discuss-owner https://review.openstack.org/628030	19:42
fungi	infra-root: ^ please expedite if possible	19:43
clarkb	dmsimard: fwiw I'm still poking around at those failures and we don't seem to collect logs from the subnode	19:44
clarkb	dmsimard: at least I can't find them	19:44
clarkb	dmsimard: and I don't think that is because we can't ssh to it, the log collection just doesn't pull from there?	19:44
corvus	calebb: thanks for your help! :)	19:45
calebb	corvus: feel free to poke me again if needed, i always forget to keep my freenode connection authed and stuff but i'll try >.>	19:45
calebb	np	19:45
dmsimard	clarkb: yeah mwhahaha was looking at that	19:45
mwhahaha	logs don't get collected for teh same reason the job fails	19:46
corvus	mordred: you want to add a node to the cluster?	19:46
mwhahaha	can't pull them, ssh is returning 255	19:46
clarkb	dmsimard: http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_files/ssh.config.ansible.txt.gz is interesting too	19:46
calebb	corvus: im excited you guys are using ceph :D I've poked clarkb a lot to try convince him about it	19:46
mwhahaha	so it either connectivity or access denied	19:46
clarkb	it says use the heat-admin use	19:46
mordred	corvus: yes. that seems like a grand idea	19:46
clarkb	is something creating the heat-admin user on that instance?	19:46
mwhahaha	clarkb: yes later, but we use http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_files/hosts.txt.gz	19:46
corvus	calebb: other than this mystery, it's been really easy to work with, especially with rook.	19:47
mwhahaha	which reuses the zuul user	19:47
clarkb	mwhahaha: and those two configs aren't in conflict?	19:47
mwhahaha	http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_files/ssh.config.local.ansible.txt.gz	19:47
mwhahaha	not necessarily	19:47
calebb	corvus: awesome! I've had some problem with the deploy tools (ceph-deploy and ceph-ansible) but everything else is pretty good imo as far as ops goes	19:47
clarkb	unfortunately the nested ara doesn't seem to record which user was attempted when the failure happens	19:48
*** bobh has joined #openstack-infra		19:48
corvus	mordred: i'll leave you to get started on that and check in after my lunch	19:48
mwhahaha	clarkb: zuul, http://logs.openstack.org/92/625692/2/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/b38e0b1/logs/quickstart_collect_logs.log	19:48
dmsimard	mwhahaha: one of those files has "Hostname 10.4.70.74" and the other doesn't, I'm not sure if that could be related	19:49
clarkb	mwhahaha: dmsimard: and we should double check that the /etc/nodepool/id_rsa file is set up for the zuul user authorized keys	19:49
dmsimard	None of this explains how the issue could occur randomly only on limehost, though ?	19:50
logan-	dmsimard: good find on the cpu model. the guests where host-passthrough is not enabled could be less performant. i'm not sure how certain HVs don't have that enabled, but working on fixing that now.	19:50
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4 https://review.openstack.org/628031	19:50
clarkb	dmsimard: not necessarily, but understanding that and ruling it all out is useful	19:50
fungi	so these nodes may be communicating over ipv4 in limestone? is this working like a legacy d-g job where there's a master node which collects the logs? what are the odds there is one or a small number of ip addresses where that's happening? maybe a zombie instance or several nova has lost track of?	19:50
clarkb	dmsimard: basically if you confirm that ssh is configured properly (users and keys) then you can continue to narrow down the problem space	19:51
clarkb	dmsimard: if we assume they are fine we may miss something obvious	19:51
dmsimard	fair	19:51
mwhahaha	dmsimard: different job results, mine was from a job with .64, clarks' was where it was .74	19:51
*** markvoelker has joined #openstack-infra		19:51
fungi	but yes, i agree starting from first principles helps avoid missing obvious problems due to tunnel vision	19:51
dmsimard	mwhahaha: my bad	19:51
mwhahaha	http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_collect_logs.log	19:52
clarkb	fungi: what is odd is we've seen the key errors when we had that problem before, these are just unreachable but maybe that is ansible converting error messages for us	19:52
mwhahaha	it's all good :D	19:52
dmsimard	logan-: pleasure -- I have no idea if it's related or not but thanks :p	19:52
*** smarcet has quit IRC		19:52
fungi	clarkb: that's a fair point (unless the zombie node in question doesn't have an sshd running any longer?)	19:52
clarkb	fungi: ya that could be	19:53
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Run k8s-on-openstack to manage k8s control plane https://review.openstack.org/626965	19:53
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Add resources for deploying rook and xtradb to kuberenets https://review.openstack.org/626054	19:53
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4 https://review.openstack.org/628031	19:53
fungi	but yeah, could also be that node spontaneously going toes-up	19:53
dmsimard	logan-: hang on, to validate my understanding, are you saying that some hypervisors did not expose the right CPU model mistakenly ?	19:56
*** wolverineav has quit IRC		19:56
*** wolverineav has joined #openstack-infra		19:57
logan-	dmsimard: yes, I see some are doing host-passthrough (the ones where you see cpu model "Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz"), and others are using host-model (where you see "Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)")	19:57
clarkb	dmsimard: mwhahaha we copy /home/zuul/ssh/id_rsa to /etc/nodepool/id_rsa and the public key for that private key is what is put into zuul user's authorized_keys file according to the top level ara repo	19:57
*** rkukura has quit IRC		19:57
clarkb	I think that should be fine as is then	19:57
*** mriedem has quit IRC		19:58
fungi	dmsimard: logan-: so piecing this together, the ones where host-model was in use were where these failures were observed?	19:58
*** aojea has quit IRC		20:00
*** bobh has quit IRC		20:00
*** shardy has quit IRC		20:00
*** kgiusti has left #openstack-infra		20:01
*** wolverineav has quit IRC		20:01
*** kgiusti has joined #openstack-infra		20:01
dmsimard	fungi: A pattern I found when comparing failures against successful jobs (for the same set of tripleo jobs) is that the failures seemed to happen exclusively when the subnode had a E3-12xx processor. I didn't come across a failure that ran on the E5 variant. Could be a coincidence or a red herring, though.	20:02
clarkb	dmsimard: mwhahaha on the logging front the nested ansible does all the log collections not the executor driven ansible?	20:03
clarkb	dmsimard: mwhahaha: worth noting that we appear to be able to ssh into the subnode from the executor over ipv6 if I am reading logs correctly	20:03
mwhahaha	correct for the tripleo jobs it's the nested ansible	20:04
*** jamesmcarthur has quit IRC		20:04
logan-	host-passthrough is set on all of the nodes now.	20:05
*** josephrsandoval has quit IRC		20:06
dmsimard	logan-: cool, thanks !	20:06
*** mriedem has joined #openstack-infra		20:08
openstackgerrit	David Moreau Simard proposed openstack-infra/elastic-recheck master: Add query for intermittent tripleo failures in limestone https://review.openstack.org/628034	20:11
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Run k8s-on-openstack to manage k8s control plane https://review.openstack.org/626965	20:13
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Add resources for deploying rook and xtradb to kuberenets https://review.openstack.org/626054	20:13
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4 https://review.openstack.org/628031	20:13
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Add openstack keypair for the root key https://review.openstack.org/628035	20:13
mordred	infra-root: could y'all check out 628035 real quick ^^ ?	20:13
* mordred is trying to add the new node from bridge using the patches in gerrit		20:14
fungi	logan-: dmsimard: could lack of host-passthrough cause nested virt instability maybe?	20:14
* fungi has no idea what its behaviors are keyed off of)		20:14
logan-	fungi: its quite possible that the vmx flag was not present in the host-model guests so they may have been using software virt	20:15
clarkb	fungi: I'm thinking its more likely an ipv4 issue considering that ipv6 seems to work from the executor	20:15
fungi	mordred: that's the public ssh key for root@bridge.o.o?	20:15
mordred	fungi: yeah	20:15
mordred	fungi: I pulled it from the authorized_keys file on review.o.o	20:16
openstackgerrit	Merged openstack-infra/nodepool master: Ignore removed provider in _cleanupLeakedInstances https://review.openstack.org/608670	20:17
fungi	mordred: and this is specifically in support of magnum?	20:17
clarkb	fungi: its not using magnum, it uses a set of ansible playbooks/roles aiui	20:17
fungi	ahh, it says k8s-on-openstack so wasn't sure if that was the actual name or just a descriptive term	20:18
fungi	is this the openstack provider driver for kubernetes which wants to find the keypair via the api?	20:19
clarkb	I think its ansible	20:20
clarkb	so that it can manage the k8s install	20:20
fungi	seems fine. we already bake that key into our other systems anyway, so including it there isn't going to make anything i can think of less secure	20:22
clarkb	my only concern about it is we'd allow ansible from the bridge to run on those nodes and we don't want that to happen with our regular ansible runs	20:22
clarkb	so we'll have to be extra careful we don't add them to the inventory	20:22
fungi	we don't want to manage them with ansible long-term?	20:23
clarkb	not with the same playbooks	20:23
clarkb	since we don't want regular users there or firwall rules as on our normal hosts (etc)	20:24
dmsimard	clarkb, fungi, logan-: thanks for your help, I sent a patch to add an elastic-recheck query to track if it reproduces despite the host-passthrough fix: https://review.openstack.org/#/c/628034/	20:26
clarkb	dmsimard: if it does we should consider using ipv6 and check if it is a v4 vs v6 problem	20:27
clarkb	dmsimard: one note no the e-r query	20:29
dmsimard	clarkb: why do we set up an ovs bridge by default on multinode again ? I forget	20:29
openstackgerrit	David Moreau Simard proposed openstack-infra/elastic-recheck master: Add query for intermittent tripleo failures in limestone https://review.openstack.org/628034	20:30
*** panda\|off has quit IRC		20:30
clarkb	dmsimard: because much of the software we test assumes it can "manage" the networking so we create a corner for it to do that in	20:30
mordred	fungi, clarkb: I think we might still want sysadmin keys added ... but yeah, for now this is just some ansible that creates the nodes using the os_ modules and then installs/manages k8s on them	20:30
clarkb	dmsimard: it can't do that with the actual host interface as those are managed by our clouds	20:30
mnaser	its not using magnum... yet :p	20:31
dmsimard	clarkb: note fixed	20:31
mordred	yeah - the magnum version of this was a bit unhappy with the cephfs and containerized kubelet	20:31
dmsimard	clarkb: understood (for bridge)	20:31
clarkb	dmsimard: another side effect is it gives tests a consistent network view they can use regardless of the undlerying networking	20:32
clarkb	dmsimard: that isn't critical though and could be worked around if we had to	20:32
*** panda has joined #openstack-infra		20:33
corvus	mordred: 628035 lgtm -- though would it be more helpful to call the key 'bridge' or something?	20:33
corvus	mordred: or rather, call the 'openstack keypair' 'bridge-$date' ?	20:34
mordred	sure - I could respin it real quick	20:34
corvus	clarkb, fungi: ^?	20:34
dmsimard	clarkb: I naively figured devstack (and other projects) ended up installing and setting up ovs anyway :p	20:34
clarkb	corvus: ++	20:34
mordred	I pulled the date from the date in the key comment so that it would match	20:34
corvus	mordred: existing date is fine -- i was just trying to be clear in my englishing	20:34
clarkb	dmsimard: devstack for example will set up ovs for neutron. But we are setting up OVS here to provide a virtual switch that neutron can plug into. So its somewhat separate from enutron itself	20:34
mordred	++	20:34
mordred	corvus: how about "bridge-root-2014-09-15"	20:35
clarkb	dmsimard: think of it as giving tests a consistent interface scheme and l2 networking via a switch that we bring to the jobs	20:35
corvus	mordred: wfm	20:35
corvus	mordred: i think that will avoid me looking at it and saying "of course it's the key for the root user..." :)	20:35
*** wolverineav has joined #openstack-infra		20:35
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Add openstack keypair for the bridge root key https://review.openstack.org/628035	20:35
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Run k8s-on-openstack to manage k8s control plane https://review.openstack.org/626965	20:35
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Add resources for deploying rook and xtradb to kuberenets https://review.openstack.org/626054	20:35
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4 https://review.openstack.org/628031	20:35
mordred	corvus: ok. I added that key manually and am running the ansible to add a new node	20:40
corvus	groovy	20:40
*** wolverineav has quit IRC		20:40
fungi	sorry, got sidetracked by dirty dishes for a moment. catching up	20:41
fungi	key namechange is fine by me, +2	20:44
*** hwoarang has quit IRC		20:48
*** jcoufal has quit IRC		20:48
*** slaweq_ is now known as slaweq		20:49
*** hwoarang has joined #openstack-infra		20:49
openstackgerrit	Merged openstack-infra/system-config master: Reject messages to starlingx-discuss-owner https://review.openstack.org/628030	20:50
openstackgerrit	Merged openstack-infra/elastic-recheck master: Add query for intermittent tripleo failures in limestone https://review.openstack.org/628034	20:52
openstackgerrit	Goutham Pacha Ravi proposed openstack-infra/devstack-gate master: Grenade: Allow setting Python3 version https://review.openstack.org/607379	20:52
*** slaweq has quit IRC		20:55
corvus	mordred, clarkb, fungi: when should we schedule our k8s/rook walkthrough? how about immediately following the infra meeting next tuesday?	20:56
clarkb	corvus: that works for me	20:56
mordred	++	20:56
clarkb	that might be late for frickler though. Maybe 1900UTC (slightly earlier) a week from today is ebtter?	20:57
corvus	i'll make an ethercalc with options	20:59
mordred	mnaser: if you're around, I just booted a new node in sjc1 and it's getting name resolution errors trying to resolve archive.ubuntu.com	21:00
corvus	mordred: ^ does that time work for you as well?	21:00
mnaser	mordred: whats the dns that's configured on it?	21:00
mordred	mnaser: dunno - it's just an ubuntu node that's trying to install some things via cloud-init	21:00
mnaser	mordred: cat /etc/resolv.conf if you can ssh into it?	21:00
mordred	mnaser: I'm having issues sshing in as well ... maybe something went sideways with networking?	21:01
mordred	mnaser: 1bcf108e-85f9-4113-ac7a-0d60daa4e6f6 is the instance uuid in case it's a thing you want to look at on your side	21:01
mnaser	ssh might take a really long time if dns isn't resolving properly	21:01
fungi	corvus: 20:00z tuesday january 8 wfm, i have no conflicts	21:01
mnaser	let me check	21:01
mordred	mnaser: https://github.com/infraly/k8s-on-openstack/blob/master/roles/openstack-nodes/tasks/main.yaml#L20-L27 is the cloud-init script in question - so I don't think it should be doing much to the node	21:02
fungi	oh, or (catching up) some other time which works well for frickler sure	21:03
clarkb	we might just have to do it twice too in order to get ianw and frickler and USAians all in the same spot	21:03
mnaser	mordred: could you possibly try to get another floating ip instead of that one?	21:05
mordred	mnaser: sure - one sec	21:05
corvus	clarkb: well, i also plan on recording it. so if we do it whenever most folks can join, we should be able to get a lively bunch of questions and hit 99% of what we need, and can handle the rest as followups.	21:05
clarkb	corvus: ++	21:05
*** wolverineav has joined #openstack-infra		21:07
corvus	clarkb, fungi, mordred: how's this? https://ethercalc.openstack.org/infra-k8-walkthrough	21:09
corvus	i'll send an email	21:09
clarkb	corvus: lgtm. I've added myself	21:09
mordred	mnaser: I am able to ssh in with the new floating ip	21:09
*** rkukura has joined #openstack-infra		21:10
mordred	mnaser: and it's able to resolve dns now - so maybe just a hiccup?	21:10
mnaser	mordred: the floating IP seems to be null routed so I’ll follow up with the why later	21:11
*** slaweq has joined #openstack-infra		21:12
mordred	mnaser: awesome	21:12
mordred	mnaser: want me to leave it in the account? or delete it?	21:13
dhellmann	config-core: the final step of importing osc-summit-counter into gerrit is to add the release jobs (and then to do a release). I would appreciate it if you add https://review.openstack.org/#/c/625627/ to your review queue for this week.	21:13
mnaser	mordred: you can release it for now	21:13
mordred	mnaser: cool	21:13
corvus	dhellmann: that tool makes me so happy; i'm really looking forward to using it. :)	21:14
dhellmann	corvus : would you like to be a member of the maintainer team? :-)	21:14
corvus	dhellmann: sure, i might even be qualified :)	21:14
dhellmann	the bar is pretty low ;-)	21:15
*** studarus has joined #openstack-infra		21:15
corvus	dhellmann: (i assume the qualifications include "can recite alphabet")	21:15
dhellmann	done	21:15
dhellmann	yeah, at some point one of you is going to find the math error I put in the first version and then you'll get to be PTL	21:15
dhellmann	(it's still there)	21:15
dhellmann	at least I think it is	21:16
fungi	i don't consider myself qualified, i think i've gotten the "how many prior summits have you attended" question on the registration form wrong before	21:16
clarkb	I stopped trying to get it right :/	21:16
corvus	fungi: then you should definitely be a user :)	21:16
*** slaweq has quit IRC		21:17
corvus	this tool, if widely known, could save hundreds of person-hours across our community :)	21:17
dhellmann	and improve the accuracy of our data collection significantly, too, I'm sure	21:17
mordred	corvus: if ansible is to be believed, we should have a 4th node now	21:19
*** xek has quit IRC		21:20
mordred	corvus: and kubectl get nodes seems to also think so	21:20
corvus	mordred: \o/	21:20
*** efried has quit IRC		21:21
corvus	mordred: should we run the rook operator?	21:21
*** slaweq has joined #openstack-infra		21:21
mordred	corvus: I think the rook operator should notice these actions AIUI	21:21
corvus	oh, hrm. i'll check its log	21:21
corvus	there are no log entries	21:22
corvus	which i think means it has been quiet for $some_time and logs have been rotated via $some_mechanism	21:22
mordred	that sounds about right	21:23
mordred	corvus: https://github.com/rook/rook/blob/master/design/cluster-update.md	21:24
openstackgerrit	Merged openstack-infra/project-config master: add release job for osc-summit-counter https://review.openstack.org/625627	21:25
corvus	mordred: since we're doing all devices, we're not updating the cluster crd	21:25
*** slaweq has quit IRC		21:25
mordred	corvus: but - I think we might need to to add a new storage node perhaps?	21:25
openstackgerrit	Merged openstack-infra/system-config master: Add openstack keypair for the bridge root key https://review.openstack.org/628035	21:26
mordred	corvus: but also maybe you're right	21:26
corvus	mordred: we're also using all nodes	21:26
mordred	yeah	21:26
corvus	mordred: we have useAllNodes:true and useAllDevices:true so it's all automatic	21:26
mordred	++	21:26
corvus	so if it only watches the crd changes, then we'll need to restart; but if it somehow gets 'k8s node added' events, then i agree, it should have what it needs.	21:27
corvus	i don't know what reality is :)	21:27
mordred	me either :)	21:27
corvus	i will say that design doc only mentions crd update events	21:27
mordred	yeah. well - we could try deleting the operator pod and letting the app respawn it and see if that re-registers things	21:29
*** efried has joined #openstack-infra		21:29
corvus	mordred: yeah, it's had plenty of time to do something on its own. let's assume that and delete it. you go?	21:29
mordred	corvus: sure	21:29
mordred	I have deleted it and a new one has been created	21:30
*** kgiusti has left #openstack-infra		21:30
mordred	corvus: we seem to have 4 osds now	21:31
corvus	here's the current status: http://paste.openstack.org/show/739837/	21:32
corvus	it's backfilling	21:32
mordred	corvus: neat	21:33
mordred	corvus: should we wait for it to finish that before hupping gerrit to restart the replication?	21:35
openstackgerrit	Merged openstack-infra/zuul master: Only reset working copy when needed https://review.openstack.org/624343	21:35
clarkb	reading that doc seems like if we explicitly listed the nodes rather than setting all, then an update to the crd with a new node would tell the operator to make a change	21:36
corvus	clarkb: agreed	21:36
clarkb	but its diff based so all -> all doesn't diff is a noop (even if the underlying hosts have changed)	21:36
mordred	yah. although restarting the operator pod is kind of like sending something a hup signal - so maybe just doing that on node additions would be fairly easy to do?	21:37
corvus	mordred: re gerrit: maybe? the backfilling process is still very slow; i don't know enough about the issue to know whether we're hitting a performance limitation (in which case replication would slow us down even more), or just artificial limits in backfilling (where ceph is preserving performance for client data reads/writes) in which case it would be fine	21:38
corvus	mordred: however, we're not in production, so maybe we go ahead and try it :)	21:38
mordred	corvus: :)	21:38
mordred	corvus: yeah - we might learn something from it	21:38
fungi	is the ceph cluster split across service providers?	21:39
corvus	(it's also really hard to predict; over the holiday, backfilling looked like it would take a week based on reported numbers, but finished in about a day)	21:39
mordred	clarkb, corvus, fungi: to delete the operator pod (and let the application re-spawn it): kubectl -n rook-ceph-system delete pod -l app=rook-ceph-operator	21:39
corvus	fungi: nope, all in one region	21:39
fungi	didn't know if we were going for cross-provider redundancy or anything. i guess it can be added in the future but better to not complicate the pic	21:40
fungi	poc	21:40
clarkb	aiui you don't want to split osds across regions unless they are part of a full replica type setup	21:41
corvus	fungi: yeah, i'm not sure about either ceph-on-wan or percona-on-wan. i could see either of them being finicky enough to make it harder.	21:41
clarkb	which is doable but its more like having a copy of all the data in each region	21:41
mordred	yah	21:41
clarkb	(I looked into it once upon a time for other raisins and it wasn't really going to win much)	21:41
clarkb	and you have to do all your writes to one of the replicas iirc	21:41
fungi	and cold-failover is probably plenty for our risk profile anyway, live redundancy could be more hassle than it's worth	21:42
corvus	one thing we may want to think about before our next deployment -- whether we want to continue using ceph's default replica strategy, or switch to erasure coding	21:42
clarkb	corvus: erasure coding might be a nice way to reduce disk overhead when we consider that vexxhost is also running with some number of copies under us	21:43
clarkb	but recovery cost is higher aiui	21:43
clarkb	(since you have to maths more to get the data back)	21:43
corvus	fungi: well, for this system i'm proposing that the entire gitea system be HA, but within a single cloud provider. so if vexxhost-sjc1 goes down, we're out; but we should have vm/hypervisor fault tolerance within that constraint.	21:44
mordred	yeah - I meanm ,I think our risk profile for losing one of the cinder volumes is actually pretty low - because they are themselves ceph volumes in this case on the underside	21:44
corvus	mordred: right, we're unlikely to lose data, but we may lose access (for a time)	21:45
mordred	yah	21:45
fungi	cold failover involving building some new servers in another provider and re-replicating data to them seems reasonable, i meant	21:47
mordred	yup. this is all re-creatable data	21:47
corvus	fungi: ah, yes.	21:47
corvus	and we can do that pretty fast too :)	21:48
fungi	right, just weighing the likelihood of a provider permanently (or long-term temporarily) becoming unavailable for us against the work involved in relocating services	21:48
clarkb	erasure coding may also make the disk usage numbers harder to reason about, if we want to keep that as simple as possible for now	21:49
clarkb	you use 1.3x the size of the data or something iirc	21:49
clarkb	or was it 1.6	21:50
mordred	clarkb: well, we already don't know how to reason about disk usage :)	21:50
fungi	from incomprehensible to incomprehensibler	21:51
clarkb	1.5x	21:52
clarkb	in the default profile	21:52
*** bobh has joined #openstack-infra		21:55
*** slaweq has joined #openstack-infra		21:55
*** slaweq has quit IRC		22:00
corvus	k8s walkthrough scheduling email sent	22:00
*** bobh has quit IRC		22:00
*** trown is now known as trown\|outtypewww		22:00
*** bobh has joined #openstack-infra		22:01
*** jamesmcarthur has joined #openstack-infra		22:05
*** jamesmcarthur has quit IRC		22:09
corvus	any objection to my restarting gerrit now?	22:12
clarkb	Digging into e-r data, since I've been trying to keep an eye on it before holidays, I notice that pypi errors shot up but that is because kombu removed a release we pinned in global constraints. I wonder if there is a way to make those errors look less like infra problems in e-r when they aren't infra problems	22:13
clarkb	corvus: not from me	22:13
*** eernst_ has joined #openstack-infra		22:13
clarkb	overall e-r looks healthy, but that is because test demand is super low. Only 6 integrated gate failures in the last 10 days	22:14
clarkb	Maybe this should go on the back burner until we have 10 days of data that resemble normalcy	22:14
mordred	corvus: go for it	22:14
*** smarcet has joined #openstack-infra		22:14
clarkb	happily someone else noticed the kombu issue and generate constraints generated a fix for it auomatically and the change was merged	22:15
corvus	#status log restarted gerrit to clear stuck gitea replication task	22:16
openstackstatus	corvus: finished logging	22:16
clarkb	so other than limestone ssh weirdness (which maybe was fixed? time will tell) and the ovh bhs1 quota issues I think we are looking healthy	22:16
clarkb	gerrit web ui is responding for me again	22:18
corvus	replications to gitea are happening: http://38.108.68.66/explore/repos	22:21
clarkb	Shrews: do you know if the december 5th nodepool launcher restart was to handle your schema change?	22:21
clarkb	Shrews: or will the next restart need to be a full launcher restart?	22:21
corvus	now i'd like to restart the zuul scheduler	22:21
clarkb	If we have already done the full launcher restart then I think we are safe to restart them one by one whenever we like (since openstacksdk is now fixed)	22:22
Shrews	clarkb: i had a schema change?	22:22
corvus	and.... since we just merged a merger change, all of zuul, actually.	22:22
clarkb	Shrews: looks like you wrote the change note about it but original change was corvus'	22:23
clarkb	https://review.openstack.org/#/c/623046/	22:23
Shrews	clarkb: oh, yeah, the restart was for that	22:24
corvus	yeah, i think we should be good wrt that	22:24
clarkb	Shrews: great, so launchers can be restarted whenever we like in whatever order we like	22:24
mordred	I support the restarting of the things	22:24
*** wolverineav has quit IRC		22:24
clarkb	should I go ahead and restart one launcher, monitor it then do the other four?	22:24
clarkb	er other 3	22:24
corvus	clarkb: sure; i'm about to restart all of zuul; i don't think we need to coordinate, unless you want to?	22:24
Shrews	clarkb: any order should be fine	22:25
clarkb	corvus: I don't think we need to coordinate	22:25
clarkb	change log in nodepool looks pretty safe after that full restart point	22:26
corvus	stopping zuul	22:26
Shrews	that should get the host_id change pabelanger wants too	22:26
clarkb	Shrews: yup	22:26
clarkb	that and double checking sdk is happy is the motivation here	22:26
corvus	scheduler is back up, executors are still stopping	22:27
mordred	corvus: the dashboard seems to be sad	22:28
corvus	mergers are back up	22:28
mordred	it's redirecting or something	22:28
clarkb	I'm going to start with nl03	22:28
*** wolverineav has joined #openstack-infra		22:28
clarkb	nodepool==3.3.2.dev88 # git sha f8bf6af is what is installed on 03	22:29
corvus	mordred: i think it will clear once the config is loaded. you can see the error in the console. i mentioned it to tristanC.	22:29
corvus	mordred: "s.pipelines is undefined"	22:29
clarkb	and openstacksdk 0.22.0	22:29
mordred	nod	22:29
corvus	all zuul components are running now	22:29
clarkb	mordred: is 0.22.0 sdk what we want?	22:29
mordred	clarkb: yes, I believe so	22:29
corvus	Playbook run took 0 days, 0 hours, 2 minutes, 58 seconds	22:30
*** tosky has joined #openstack-infra		22:30
clarkb	nl03 launcher restarted. I'll do the others if this one looks happy	22:30
corvus	hrm. the scheduler has not loaded its configuration yet, and it's not clear why.	22:31
clarkb	corvus: I seem to recall it atking about 5 minutes	22:32
corvus	clarkb: yeah, that's how long the cat jobs take but i think they are done	22:32
mordred	corvus: dashboard is up - maybe it was just taking longer	22:33
*** boden has quit IRC		22:34
corvus	hrm. yeah, looks like there were some slow cat jobs	22:34
corvus	2019-01-02 22:32:15,920 INFO zuul.TenantParser: Loading configuration from openstack/nova/.zuul.yaml@stable/ocata	22:35
corvus	may be time to clear out the nova repos on the mergers :)	22:35
*** smarcet has quit IRC		22:35
corvus	(that was one of the stragglers)	22:35
clarkb	nl03 seems happy but it hasn't lifecycled a node yet (I expect it will get to that once corvus' reenqueues queues)	22:36
*** smarcet has joined #openstack-infra		22:36
clarkb	once I see it has done that successfully I'll use that as mark to restart the others	22:36
*** jtomasek has quit IRC		22:36
corvus	starting those now	22:36
corvus	#status log restarted all of zuul at commit 4540b71	22:36
openstackstatus	corvus: finished logging	22:36
clarkb	I see the hostids in the logs now too	22:38
clarkb	pabelanger: ^ fyi, and thanks	22:38
*** efried has quit IRC		22:39
clarkb	corvus: do we repack and gc those repos?	22:40
clarkb	hrm I guess a GC won't remove all the zuul refs since they have pointers to them	22:40
clarkb	(so a delete and reclone is a bit more forceful GC)	22:41
corvus	clarkb: not explicitly; only the default git gc	22:41
*** jamesmcarthur has joined #openstack-infra		22:42
clarkb	node with id 0001464013 was created, used, and deleted on nl03	22:43
clarkb	I'm restarting the other launchers now	22:43
*** jamesmcarthur has quit IRC		22:46
clarkb	#status log Restarted nodepool launchers nl01-nl04 to pick up hypervisor host id logging and update openstacksdk. Now running nodepool==3.3.2.dev88 # git sha f8bf6af and openstacksdk==0.22.0	22:47
openstackstatus	clarkb: finished logging	22:47
*** diablo_rojo has quit IRC		22:47
*** rcernin has joined #openstack-infra		22:49
*** jamesmcarthur has joined #openstack-infra		22:51
*** e0ne has joined #openstack-infra		22:51
mordred	corvus: http://38.108.68.66/openstack/aodh claims to have replicated by the gerrit repl logs, but is still showing empty (was looking at it randomly)	22:53
corvus	lookin	22:53
clarkb	we gained 4 mergers after the zuul restart	22:54
mordred	corvus: other things, both before and after it - seem to have replicated properly	22:54
clarkb	fungi: ^ iirc you had looked into that and found it was network connections between gearman having gone away?	22:54
mordred	corvus: fwiw, bifrost and blazar also seem empty - I haven't found any specific patterns yet	22:57
corvus	mordred: i believe we saw empty repos where we pushed to them before they were created in gitea; and then if we pushed to them again, they showed up correctly; but perhaps if there are no actual changes to the repos, the internal event to refresh doesn't fire?	22:59
mordred	corvus: it would be neat if there was a gitea-refresh-repo command or something	23:00
*** bobh has quit IRC		23:01
*** dkehn_ has joined #openstack-infra		23:01
*** studarus has quit IRC		23:01
*** bobh has joined #openstack-infra		23:01
*** dkehn has quit IRC		23:02
*** dkehn_ is now known as dkehn		23:02
corvus	mordred: i tried "Reinitialize all missing Git repositories for which records exist" but that wasn't it.	23:03
corvus	mordred: i tried "Execute health checks on all repositories" and it said "	23:03
corvus	Repository health checks have started.	23:03
corvus	"	23:03
corvus	so... ?	23:03
*** wolverineav has quit IRC		23:03
corvus	i think it completed without incident.	23:04
corvus	so yeah, i don't see an easy way to refresh it, other than to figure out what the internal hook is	23:04
mordred	I'm looking at https://github.com/go-gitea/gitea/blob/master/cmd/hook.go	23:05
corvus	i think 'update' is what we want	23:05
corvus	based on:	23:06
corvus	[Macaron] 2019-01-02 23:04:53: Started POST /api/internal/ssh/2/update for [::1]	23:06
corvus	[Macaron] 2019-01-02 23:04:53: Completed POST /api/internal/ssh/2/update 200 OK in 5.695615ms	23:06
*** bobh has quit IRC		23:06
mordred	++	23:07
*** wolverineav has joined #openstack-infra		23:07
*** smarcet has quit IRC		23:07
mordred	corvus: actually - if I'm reading https://github.com/go-gitea/gitea/blob/master/routers/private/internal.go right	23:09
mordred	corvus: I think /api/internal/ssh/2/update has to do with public keys ... but still reading	23:10
corvus	weird	23:10
*** rascasoft has quit IRC		23:10
mordred	https://github.com/go-gitea/gitea/blob/master/models/update.go	23:12
corvus	mordred: well, it seems like we may just want to make this operate like the current cgit -- create in gitea first, before we create in gerrit	23:12
mordred	yeah	23:14
corvus	i'm inclined to just write this off and redirect effort on doing that	23:14
mordred	yah - I think that's a good call. although I also think, once the current replication is done, figuring out how to trigger gitea to do whatever it needs to do if/when something got missed seems like a good plan	23:15
corvus	mordred: i'm sure we could delete the repo and re-push; so if that happens for an individual repo, we've got a solution	23:15
corvus	just would be annoying at the current error scale	23:15
mordred	yeah	23:15
corvus	we can go ahead and do that for aodh to validate	23:15
corvus	(but maybe after the current replication backlog clears)	23:16
mordred	yeah. I think poking too hard while it's still trying to replicate is maybe the wrong choice	23:16
mordred	corvus: I see things like this in the gerrit replication logs:	23:19
mordred	RemoteRefUpdate[remoteName=refs/tags/4.1-eol, NOT_ATTEMPTED, (null)...d2624d975b0f24f6e1beb9bd832420b2bffef7d2, srcRef=refs/tags/4.1-eol, forceUpdate, message=null]	23:19
*** e0ne has quit IRC		23:19
mordred	and	23:20
mordred	RemoteRefUpdate[remoteName=refs/changes/79/176779/1, NOT_ATTEMPTED, (null)...389afee9636303201b00477a56a1f4156751fa6d, srcRef=refs/changes/79/176779/1, forceUpdate, message=null]	23:20
corvus	i don't know the meaning of those	23:21
mordred	corvus: maybe the refs/changes ones are just gerrit asserting that it's not going to replicate ref/changes ? (although it's super chatty if that's the case)	23:22
corvus	mordred: yeah; though tag should be there?	23:22
corvus	also.. i think it does replicate refs/changes	23:22
mordred	yeah	23:22
*** rlandy\|rover is now known as rlandy\|rover\|bbl		23:23
mordred	also: http://38.108.68.66/openstack/fuel-library is the repo assocated with both of those, and it's empty	23:23
mordred	although I did also see a NOT_ATTEMPTED message for a repo that did have content	23:23
*** jamesmcarthur has quit IRC		23:23
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: WIP: Combine artifact URLs with log_url if empty https://review.openstack.org/628068	23:24
*** jamesmcarthur has joined #openstack-infra		23:24
corvus	mordred: 812e8662 waiting .... 23:21:10.074 (retry 1) [f3336f21] push git@38.108.68.66:openstack/fuel-library.git	23:26
corvus	mordred: there are a couple of retry replication events at the bottom of the queue now	23:27
mordred	corvus: ah - maybe gitea is backed up trying to process the incoming onslaught?	23:27
*** jamesmcarthur has quit IRC		23:28
corvus	mordred: well, i wouldn't describe it as an onslaught; more of a trickle	23:28
mordred	yeah. I'm seeing some things in the gitea log ...	23:30
mordred	2019/01/02 23:24:24 [...tea/models/update.go:270 pushUpdate()] [E] GetBranchCommitID[refs/changes/58/277858/1]: object does not exist [id: refs/heads/refs/changes/58/277858/1, rel_path: ]	23:30
corvus	we've added 100MiB of data stored in the course of 1.25 hours, and are halfway through the list of repos	23:30
corvus	that's a change to fuel-main	23:31
corvus	fuel-main is currently being replicated (still)	23:34
*** wolverineav has quit IRC		23:35
corvus	mordred: maybe that error shows up for every refs/changes push because it isn't a branch?	23:36
mordred	maybe?	23:37
corvus	mordred: so it looks like it's emitting one of those errors for every fuel-main change right now, and it takes about a second to do that, and that's what's taking so long	23:39
mordred	corvus: wow, awesome	23:39
*** wolverineav has joined #openstack-infra		23:40
*** wolverineav has quit IRC		23:40
*** wolverineav has joined #openstack-infra		23:40
*** Jeffrey4l has quit IRC		23:41
corvus	fuel-main is finished; it did not end up on the retry list	23:41
mordred	I agree. and I can browse it	23:41
mordred	and it has tags and whatnot, so yay (also has the switched branch thing)	23:42
corvus	we should see if re-pushing a repo causes the refs/changes errors to all happen again	23:42
mordred	++	23:42
corvus	since i'm logged in, i can see an event stream; i'll paste	23:43
corvus	https://screenshots.firefox.com/XwHxRyU5N4pvz0b0/38.108.68.66	23:43
corvus	looking at that second entry; this is the refs/changes url: http://38.108.68.66/openstack/fuel-main/src/branch/refs/changes/67/262167/1 (it 404s)	23:43
corvus	but this is the commit sha url below it: http://38.108.68.66/openstack/fuel-main/commit/fc9f3c11f9bc78e03fcb407fd17d7bdd4d906897 (it works)	23:43
corvus	so i think that's consistent with gerrit pushing refs/changes, the commits ending up in the repo, but gitea not knowing what to do with them because ENOTABRANCH	23:44
mordred	yeah. so maybe we should tell gerrit to not replicate refs/changes to gitea	23:44
mordred	and then maybe figure out how to send them a patch to not bomb out on non-branch refs	23:45
*** diablo_rojo has joined #openstack-infra		23:47
corvus	btw, i can't get over how cool it is to have "ls" actually tell you how much data is inside of a directory (this happens in cephfs)	23:48
*** gyee has quit IRC		23:48
clarkb	does it complain about git notes too?	23:49
clarkb	also branchless refs	23:49
corvus	clarkb: have't seen any yet	23:49
mordred	I think I saw a git note complaint previously	23:49
mordred	in the replication_log	23:50
corvus	mordred: the number of files in the sessions directory is large and increasing quickly. (32k currently)	23:50
corvus	i don't remember how to see git notes on github	23:53
mordred	[2019-01-02 23:49:23,821] [c7a229db] Push to git@38.108.68.66:openstack/fuel-plugin-congress.git references: [RemoteRefUpdate[remoteName=refs/changes/44/420544/1, NOT_ATTEMPTED, (null)...adfa2db62988649219d64bd53746f2635d95aa43, srcRef=refs/changes/44/420544/1, forceUpdate, message=null], RemoteRefUpdate[remoteName=refs/heads/master, NOT_ATTEMPTED, (null)...adfa2db62988649219d64bd53746f2635d95aa43,	23:53
mordred	srcRef=refs/heads/master, forceUpdate, message=null], RemoteRefUpdate[remoteName=refs/notes/review, NOT_ATTEMPTED, (null)...c9851bc1cc42f14946cdd98a35cdb6d1b5a2f207, srcRef=refs/notes/review, forceUpdate, message=null]]	23:53
mordred	there's one with refs/notes listed	23:53
mordred	although interestingly enough it also lists refs/heads/master	23:54
corvus	mordred: afaict, not_attempted doesn't mean anything useful	23:55
mordred	corvus: yeah. I agree with that	23:55

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!