Wednesday, 2019-01-02

*** bobh has quit IRC00:04
*** bobh has joined #openstack-infra00:07
openstackgerritTristan Cacqueray proposed openstack-infra/puppet-openstackci master: logserver: set CORS header  https://review.openstack.org/62790300:11
*** bobh has quit IRC00:12
*** bobh has joined #openstack-infra00:25
*** wolverineav has joined #openstack-infra00:26
*** jamesmcarthur has joined #openstack-infra00:26
*** armax has joined #openstack-infra00:27
*** wolverineav has quit IRC00:30
*** bobh has quit IRC00:45
*** bobh has joined #openstack-infra00:54
*** jamesmcarthur has quit IRC00:56
*** _alastor1 has joined #openstack-infra00:58
*** jamesmcarthur has joined #openstack-infra00:59
*** wolverineav has joined #openstack-infra01:01
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation  https://review.openstack.org/53554101:17
*** hwoarang has quit IRC01:17
*** hwoarang has joined #openstack-infra01:19
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: config: add playbooks to job.toDict()  https://review.openstack.org/62134301:22
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: config: add tenant.toDict() method and REST endpoint  https://review.openstack.org/62134401:23
*** jamesmcarthur has quit IRC01:25
*** jamesmcarthur has joined #openstack-infra01:26
*** jamesmcarthur has quit IRC01:29
*** jamesmcarthur_ has joined #openstack-infra01:29
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: config: add tenant.toDict() method and REST endpoint  https://review.openstack.org/62134401:31
*** rfolco has joined #openstack-infra01:31
*** bobh has quit IRC01:36
*** hongbin has joined #openstack-infra01:38
*** bobh has joined #openstack-infra01:42
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: scheduler: add job's parent name to the rpc job_list method  https://review.openstack.org/57347301:43
*** bobh has quit IRC01:44
*** jamesmcarthur_ has quit IRC01:48
*** jamesmcarthur has joined #openstack-infra01:49
*** jamesmcarthur has quit IRC01:53
*** wolverineav has quit IRC01:54
*** wolverineav has joined #openstack-infra01:54
*** wolverineav has quit IRC02:02
*** bobh has joined #openstack-infra02:12
*** jamesmcarthur has joined #openstack-infra02:21
*** dave-mccowan has joined #openstack-infra02:22
*** wolverineav has joined #openstack-infra02:25
*** bhavikdbavishi has joined #openstack-infra02:28
*** wolverineav has quit IRC02:31
*** rkukura has quit IRC02:40
*** bobh has quit IRC02:46
*** bhavikdbavishi has quit IRC02:47
*** hwoarang has quit IRC02:50
*** psachin has joined #openstack-infra02:50
*** bobh has joined #openstack-infra02:53
*** dave-mccowan has quit IRC02:54
*** hwoarang has joined #openstack-infra02:56
*** dave-mccowan has joined #openstack-infra02:58
*** _alastor1 has quit IRC02:58
*** rfolco has quit IRC02:59
openstackgerritMerged openstack-infra/infra-manual master: Fix some URL redirections and broken links  https://review.openstack.org/62258103:08
*** wolverineav has joined #openstack-infra03:11
*** dave-mccowan has quit IRC03:12
*** jamesmcarthur has quit IRC03:14
*** jamesmcarthur has joined #openstack-infra03:15
*** jamesmcarthur has quit IRC03:19
*** jamesmcarthur_ has joined #openstack-infra03:19
*** bhavikdbavishi has joined #openstack-infra03:35
*** ykarel has joined #openstack-infra03:36
*** bhavikdbavishi1 has joined #openstack-infra03:38
*** bhavikdbavishi has quit IRC03:39
*** bhavikdbavishi1 is now known as bhavikdbavishi03:39
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: config: add playbooks to job.toDict()  https://review.openstack.org/62134303:40
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: config: add tenant.toDict() method and REST endpoint  https://review.openstack.org/62134403:40
*** bobh has quit IRC03:46
*** jamesmcarthur_ has quit IRC03:59
*** ramishra has joined #openstack-infra04:05
*** bobh has joined #openstack-infra04:07
*** jamesmcarthur has joined #openstack-infra04:07
*** udesale has joined #openstack-infra04:08
*** wolverineav has quit IRC04:08
*** wolverineav has joined #openstack-infra04:17
*** bobh has quit IRC04:20
*** jamesmcarthur has quit IRC04:21
*** jamesmcarthur has joined #openstack-infra04:22
*** ykarel has quit IRC04:33
*** hongbin has quit IRC04:43
*** ykarel has joined #openstack-infra04:49
*** ykarel has quit IRC04:53
*** jamesmcarthur has quit IRC04:57
*** jamesmcarthur has joined #openstack-infra04:57
*** eernst has joined #openstack-infra05:05
*** jamesmcarthur has quit IRC05:06
*** yboaron has joined #openstack-infra05:23
*** wolverineav has quit IRC05:25
*** wolverineav has joined #openstack-infra05:26
*** wolverineav has quit IRC05:39
*** wolverineav has joined #openstack-infra06:15
*** wolverineav has quit IRC06:25
*** ykarel has joined #openstack-infra06:28
*** eernst has quit IRC06:28
*** hwoarang has quit IRC06:31
*** hwoarang has joined #openstack-infra06:31
*** psachin has quit IRC06:34
*** jbadiapa has joined #openstack-infra06:57
*** rcernin has quit IRC07:00
*** quiquell|off is now known as quiquell07:06
*** jamesmcarthur has joined #openstack-infra07:07
*** jamesmcarthur has quit IRC07:11
*** ykarel_ has joined #openstack-infra07:15
*** ykarel has quit IRC07:17
*** ykarel__ has joined #openstack-infra07:20
*** ykarel_ has quit IRC07:23
*** ykarel_ has joined #openstack-infra07:24
*** ykarel__ has quit IRC07:27
*** ykarel_ is now known as ykarel|lunch07:31
*** ykarel_ has joined #openstack-infra07:35
*** ykarel|lunch has quit IRC07:38
*** pgaxatte has joined #openstack-infra07:45
*** arxcruz|next_yr is now known as arxcruz07:47
*** slaweq has joined #openstack-infra07:54
*** ginopc has joined #openstack-infra07:58
*** rpittau has joined #openstack-infra08:00
*** quiquell is now known as quiquell|brb08:04
*** jtomasek has joined #openstack-infra08:06
*** eumel8 has joined #openstack-infra08:07
*** ccamacho has joined #openstack-infra08:11
*** tosky has joined #openstack-infra08:14
*** psachin has joined #openstack-infra08:26
*** ykarel_ is now known as ykarel08:33
*** janki has joined #openstack-infra08:37
*** rascasoft has joined #openstack-infra08:39
*** hrubi has joined #openstack-infra08:47
*** jpich has joined #openstack-infra08:49
*** quiquell|brb is now known as quiquell08:52
*** janki has quit IRC09:01
*** janki has joined #openstack-infra09:01
*** xek has joined #openstack-infra09:05
*** aojea has joined #openstack-infra09:07
*** iurygregory has joined #openstack-infra09:14
iurygregoryGood morning everyone, Happy New Year =)09:15
iurygregoryany know if there is a way to retrieve old logs from a patch that is merged 4 months ago? we would like to compare logs for a job that start failing in stable/pike and we have a patch that the job was green so we could compare some logs =)09:17
AJaegeriurygregory: we delete old logs after some time since we only have limited storage space, so no way to compare.09:17
iurygregoryAJaeger, tks o/09:18
*** yboaron_ has joined #openstack-infra09:32
*** yboaron has quit IRC09:35
*** shardy has joined #openstack-infra09:47
openstackgerritSimon Westphahl proposed openstack-infra/zuul master: Fix skipped job counted as failed  https://review.openstack.org/62591009:50
*** adriancz has joined #openstack-infra09:58
openstackgerrityolanda.robla proposed openstack/diskimage-builder master: Change phase to check for dracut-regenerate in iscsi-boot  https://review.openstack.org/62794910:06
*** bhavikdbavishi has quit IRC10:09
*** e0ne has joined #openstack-infra10:17
*** yboaron_ has quit IRC10:45
*** gfidente has joined #openstack-infra10:45
*** yboaron_ has joined #openstack-infra10:49
*** pbourke has quit IRC10:59
*** pbourke has joined #openstack-infra11:01
*** quiquell is now known as quiquell|brb11:11
*** udesale has quit IRC11:18
*** dayou has joined #openstack-infra11:26
openstackgerritSorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack  https://review.openstack.org/62061311:27
*** dayou_ has quit IRC11:28
openstackgerritSorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack  https://review.openstack.org/62061311:29
*** sshnaidm is now known as sshnaidm|afk11:48
*** whoami-rajat has joined #openstack-infra12:01
fungiiurygregory: to be specific, our current retention on the logs.openstack.org site is 4-5 weeks (because we can't keep more than ~12tb of compressed job logs)12:05
iurygregoryfungi, tks for the information o/12:06
*** quiquell|brb is now known as quiquell12:11
*** jchhatbar has joined #openstack-infra12:12
*** psachin has quit IRC12:14
*** rpittau is now known as rpittau|lunch12:14
*** janki has quit IRC12:15
openstackgerritSorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack  https://review.openstack.org/62061312:15
*** Dobroslaw has joined #openstack-infra12:17
openstackgerritTristan Cacqueray proposed openstack-infra/nodepool master: Implement an OpenShift resource provider  https://review.openstack.org/57066712:18
*** udesale has joined #openstack-infra12:26
openstackgerritSorin Sbarnea proposed openstack-infra/zuul-jobs master: Remove world writable umask from /src folder  https://review.openstack.org/62557612:26
*** rlandy has joined #openstack-infra12:39
ssbarneabnemerryxmas: happy new year! too! -- check https://review.openstack.org/#/c/620613/ and see what else needs done, if I remember well they asked to cleanup some branches.12:41
*** rlandy is now known as rlandy|rover12:42
*** rfolco has joined #openstack-infra12:43
*** sshnaidm|afk is now known as sshnaidm12:43
*** quiquell is now known as quiquell|lunch12:48
*** jcoufal has joined #openstack-infra12:52
*** aojea has quit IRC12:53
*** roman_g has joined #openstack-infra13:06
*** rh-jelabarre has joined #openstack-infra13:07
fungiheads up, i've got a morning appointment so am disappearing for a couple hours (but should be back by ~1600z)13:11
*** boden has joined #openstack-infra13:11
*** rpittau|lunch is now known as rpittau13:15
*** trown|outtypewww is now known as trown13:17
*** markvoelker has quit IRC13:20
*** yboaron_ has quit IRC13:24
*** aojea has joined #openstack-infra13:43
*** quiquell|lunch is now known as quiquell13:48
*** dave-mccowan has joined #openstack-infra13:55
*** dave-mccowan has quit IRC14:00
*** nhicher has joined #openstack-infra14:04
*** mriedem has joined #openstack-infra14:05
*** dave-mccowan has joined #openstack-infra14:06
*** shardy has quit IRC14:07
*** ykarel is now known as ykarel|away14:12
*** ykarel|away has quit IRC14:19
*** ramishra has quit IRC14:20
*** shardy has joined #openstack-infra14:27
*** ykarel|away has joined #openstack-infra14:34
*** slaweq_ has joined #openstack-infra14:41
*** slaweq has quit IRC14:42
sshnaidmwhere are requirements for IRC channel configuration? If I want to add another channel to Openstack14:45
*** dkehn has joined #openstack-infra14:48
openstackgerritSagi Shnaidman proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack  https://review.openstack.org/62061314:48
*** bobh has joined #openstack-infra14:50
*** ykarel|away is now known as ykarel14:51
*** bnemerryxmas is now known as bnemec14:51
*** whoami-rajat has quit IRC14:54
*** bhavikdbavishi has joined #openstack-infra14:55
fricklersshnaidm: see https://docs.openstack.org/infra/system-config/irc.html#channel-requirements14:57
sshnaidmfrickler, cool, thanks!14:59
*** slaweq_ has quit IRC15:01
*** efried has joined #openstack-infra15:02
*** ginopc has quit IRC15:02
*** slaweq_ has joined #openstack-infra15:03
*** ginopc has joined #openstack-infra15:06
*** edmondsw has joined #openstack-infra15:09
*** bhavikdbavishi has quit IRC15:10
*** bhavikdbavishi has joined #openstack-infra15:10
*** smarcet has joined #openstack-infra15:16
*** jrist has joined #openstack-infra15:17
*** slittle1 has quit IRC15:17
*** quiquell is now known as quiquell|off15:17
*** rfolco is now known as rfolco|lunch15:18
*** bhavikdbavishi has quit IRC15:19
*** jrist has quit IRC15:22
*** ginopc has quit IRC15:25
openstackgerritSagi Shnaidman proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack  https://review.openstack.org/62061315:26
*** jrist has joined #openstack-infra15:26
*** ginopc has joined #openstack-infra15:26
*** whoami-rajat has joined #openstack-infra15:27
openstackgerritSorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack  https://review.openstack.org/62061315:33
*** e0ne has quit IRC15:41
openstackgerritSorin Sbarnea proposed openstack-infra/project-config master: Add openstack-virtual-baremetal to openstack  https://review.openstack.org/62061315:41
openstackgerritSorin Sbarnea proposed openstack-infra/git-review master: Warn which patchset are going to be obsoleted.  https://review.openstack.org/45693715:43
*** ykarel is now known as ykarel|awat15:48
*** ykarel|awat is now known as ykarel|away15:48
*** anteaya has joined #openstack-infra15:49
*** udesale has quit IRC15:53
*** rfolco|lunch is now known as rfolco15:55
*** ykarel|away has quit IRC15:55
*** studarus has joined #openstack-infra16:00
clarkbfungi I'm not quite at a place with ssh keys yet, but opendev.org dns appears to have disappeared?16:02
fungium16:02
fungilooking16:02
clarkbI wonder if adns1 stopped zone transferring to ns*16:02
clarkbsaw this with zuul dns when adns1 crashed16:03
fungithat's what i'm going to check first ;)16:03
mordredfungi, clarkb: I agree that I can't resolve opendev.org16:04
fungiinventory/openstack.yaml says adns1 is 2001:4800:7819:104:be76:4eff:fe04:43d016:05
fungiserver is running at least, looking at logs now16:05
funginamed is running since 2018-12-0616:06
openstackgerritMerged openstack-infra/puppet-openstackci master: logserver: set CORS header  https://review.openstack.org/62790316:07
fungikey reconfiguration is happening (domains all rekeyed 15:34:05 today)16:07
fungii don't see any mention of zone transfers happening, so checking ns1 next16:08
fungi2001:4800:7819:104:be76:4eff:fe04:38f0 according to the inventory16:08
fungiunbound running on ns1 since 2018-12-0516:10
clarkbhrm glue record problem in .org then?16:10
*** gyee has joined #openstack-infra16:12
fungiadns1 returns records but ns1 does not16:13
fungiaccording to `sudo journalctl -u unbound.service` unbound on ns1 hasn't logged any problems, or in fact anything other than startup messages on december 516:16
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Switch back to three columns for mid sized screens  https://review.openstack.org/62799716:16
openstackgerritMerged openstack-dev/hacking master: Don't quote {posargs} in tox.ini  https://review.openstack.org/60919216:17
clarkbfungi: they should run nsd for external dns serving16:17
funginamed on adns1 is logging to syslog not the systemd journal, but has only logged zone key reconfiguration messages for at least the past couple days16:17
clarkbunbound is only for local dns16:17
fungioh, nsd not unbound. thanks :/16:17
clarkband werun bind on adns116:18
fungiCould not tcp connect to 104.239.146.24: No route to host16:18
fungii bet this is a firewall issue16:18
*** rkukura has joined #openstack-infra16:18
fungiindeed, the iptables rules on adns1.opendev.org are only allowing zone transfers from ns1 and ns2.openSTACK.org16:20
fungithis could be fallout from my inventory group regex change. looking16:20
fungiindeed, adns1.opendev.org wouldn't have been covered previously by the puppet global manifest node entry for /^adns\d+\.openstack\.org$/16:23
fungibut is now included by /^adns\d+\.open.*\.org$/16:24
clarkbah16:24
fungisame for /^ns\d+\.openstack\.org$/ to /^ns\d+\.open.*\.org$/ for the authoritative nameservers16:24
clarkbbut firewall rules are in ansible16:24
clarkband we dont puppet on these as they are bionic iir16:25
clarkbso maybe similar collision on ansible side?16:25
fungithe inventory/groups.yaml entry for the adns group changed from adns* to adns*.open*.org16:26
fungiso that wouldn't have changed them i don't think16:26
fungithe /etc/iptables/rules.v4 file on adns1 was last modified 2018-12-1816:27
fungithe groups change didn't merge until 2018-12-21, five days later, so i think we can rule it out after all16:27
fungiplaybooks/group_vars/adns.yaml lists iptables_extra_allowed_hosts of ns1.openstack.org and ns2.openstack.org for tcp 5316:29
*** ykarel|away has joined #openstack-infra16:29
fungii don't see any similar allowances for opendev nameservers16:30
clarkbmaybe it was missed after the manual bootstrapping16:30
mordredyeah16:30
* mordred is adding helpful sentences16:30
fungii don't have any concerns with temporarily adding the nameservers for opendev to openstack and vice versa, whipping up a patch for that now16:31
corvusthat's probably the best solution; and i think we can drop openstack soon16:32
*** pgaxatte has quit IRC16:32
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Allow DNS zone transfers from ns1/ns2.opendev.org  https://review.openstack.org/62800116:33
fungiclarkb: mordred: corvus: ^16:33
fungiplain brown wrapper16:33
clarkbwe need to manually add those rules right since dns isnt currently working16:34
fungiyeah, i'm going to do that on adns1 now and see about urging nsd to retry16:34
clarkbok16:34
corvuswait 628001 doesn't make sense to me16:35
clarkbI removed my +A16:36
fungisudo iptables -I openstack-INPUT 5 -m state --state NEW  -m tcp -p tcp -s 104.239.140.165 --dport 53 -j ACCEPT16:36
fungiand 162.253.55.1616:36
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Switch back to three columns for mid sized screens  https://review.openstack.org/62799716:37
corvusfungi: ++16:37
fungiand the same with ip6tables for 2001:4800:7819:104:be76:4eff:fe04:38f0 and 2604:e100:1:0:f816:3eff:fe2c:744716:37
fungiokay, iptables and ip6tables -L show them open now16:39
*** bobh has quit IRC16:39
fungireloading nsd doesn't seem to have triggered a refresh16:40
corvusfungi, clarkb: okay 628001 makes sense to me now, +316:41
fungireading up on the nsd-control utility16:41
tbarronI'd like to merge this small log collection change https://review.openstack.org/#/c/626921/16:42
*** jchhatbar has quit IRC16:43
fungifor the record, `sudo nsd-control zonestatus` reported all three zones expired, then i issued a `sudo nsd-control transfer` and it seems to have done the trick16:43
fungiinfra-root: ^16:43
clarkbfungi: thanks!16:43
fungiconfirmed, i can look up opendev.org records from ns1.opendev.org now. i'll do the same on ns216:44
fungihttp://paste.openstack.org/show/739824/16:45
corvusworks from my desktop now16:46
mordredsame here16:46
corvusanyone know why the 'test nodes' graph is topping out below the max line?  http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=116:47
clarkbcould be quota handling in nodepool16:49
fungihrm, yeah that does indeed look odd. maybe one of our providers has dropped our quota?16:53
*** sthussey has joined #openstack-infra16:56
clarkbgra1 seems to not be in use?16:57
*** psachin has joined #openstack-infra16:57
corvusaccording to http://zuul.openstack.org/nodes  gra1 only has a handful of deleting nodes -- and those are substitute node records, so they are likely create-errors17:00
corvusoh, they're also really old17:00
corvus(they have very small id numbers)17:00
corvusgra1 is not in use -- max-servers:017:02
clarkbah17:02
funginl04.openstack.org.yaml says17:02
fungiyeah17:02
corvusbut that should mean that it's not contributing to the total17:02
corvusi guess we need to ask graphite where it's getting that number from17:03
fungilooking at the monthly view, the last time the max-servers total changed from grafana's perspective was on december 2017:05
fungiand indeed, the most recent change to any of the nl*.openstack.org.yaml files in git merged on december 20, lowering max-servers for ovh-gra1 from 79 to 0 per https://review.openstack.org/62650217:06
corvushttp://graphite.openstack.org/render/?width=973&height=777&_salt=1546448829.34&hideLegend=false&target=legendValue(stats.gauges.nodepool.provider.*.max_servers%2C%22last%22)17:07
dmsimardI'm still in heavy catch up mode from the holidays -- is there any known issues with limestone at this time ? It appears that some tripleo jobs are failing exclusively there.17:09
rlandy|roverdmsimard: continuing discussion on multinode jobs failures that seem to occur only with one provider - Hostname: centos-7-limestone-regionone-* .. see https://bugs.launchpad.net/tripleo/+bug/181005417:09
openstackLaunchpad bug 1810054 in tripleo "mulitnode jobs failing on gathering facts from subnode-2" [Critical,Triaged] - Assigned to Ronelle Landy (rlandy)17:09
dmsimard^ See logstash (update duration to 7 days and there's quite a few hits): http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22fatal%3A%20%5Bsubnode-2%5D%3A%20UNREACHABLE!%5C%2217:09
clarkbdmsimard: I think most of us are in that mode. I don't know of any known problems specific to clouds currently17:10
dmsimardclarkb: ok, I'll try to understand what is happening. Thanks !17:10
*** yboaron_ has joined #openstack-infra17:14
clarkbdmsimard: we've seen that particular ssh unreachable issue when clouds reuse IP addrs and then fight for the IP over arp, but limestone uses ipv6 which should avoid that problem entirely17:15
*** rpittau has quit IRC17:15
clarkbdmsimard: is it possible that the jobs are beraking ipv6?17:16
*** Vadmacs has joined #openstack-infra17:17
corvusclarkb, fungi: max - usage deltas are: bhs1: -135, sjc1: -10, ord: -5.  that just about accounts for the descrepancy17:17
clarkbdmsimard: hrm that ansible connection goes over ipv4 and is happening primary to subnode? or is that happening executor to subnode? will only work if it is primary to subnode not executor to subnode as the 10/8 addresses are not globally routable17:18
fungihttp://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?panelId=8&fullscreen&orgId=1&from=now-30d&to=now17:18
dmsimardclarkb: some of the jobs failing have also been successful on limestone so it feels intermittent17:19
corvusi see this in the log for bhs1: Quota exceeded for cores: Requested 8, but already used 512 of 512 cores17:19
fungiyeah, looks like maybe around 2018-12-29 we stopped using most of ovh-bhs1 but the first signs i see indicating we're probably capped there is on 2018-12-3117:19
corvusthat works out to 64 nodes, so i'm not quite sure how we're getting 15 out of that17:19
corvus2019-01-02 10:36:28,468 DEBUG nodepool.driver.openstack.OpenStackProvider: Provider quota for ovh-bhs1: {'compute': {'cores': 512, 'instances': 200, 'ram': 4063232}}17:20
corvus2019-01-02 10:36:29,663 DEBUG nodepool.driver.openstack.OpenStackProvider: Available quota for ovh-bhs1: {'compute': {'cores': 360, 'instances': 181, 'ram': 3911232}}17:20
corvusneither is nodepool apparently :)17:20
fungiamorin: ^ any idea if our quota in bhs1 got out of sync sometime in the past few days?17:20
* fungi isn't sure who's quite back to the computer yet this week17:23
corvusi'm not sure i am17:23
fungii was about to say the same ;)17:23
*** tosky has quit IRC17:23
dmsimardlots of us doing our best at a keyboard today haha17:24
clarkbits hard to get over that vacation hangover, also real hangovers depending on how hard you celebrate new years :P17:24
openstackgerritMerged openstack-infra/system-config master: Allow DNS zone transfers from ns1/ns2.opendev.org  https://review.openstack.org/62800117:25
clarkbdmsimard: ok its ansible running from primary to subnode and IP addrs for private v4 seem to check out17:26
clarkbdmsimard: next is to check if the firewall is running and whether or not the subnode is possibly crashing its network (or just crashing)17:26
*** jpich has quit IRC17:26
clarkbdmsimard: limestone is the cloud where nested virt caused nodes to crash17:26
dmsimardis it ?17:26
dmsimardyeah I think the ipv4 is just the tunnel ip17:27
clarkbhttp://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/ara-report/result/dc98460a-a68d-4b62-877e-374afe9c0d09/ shows firwall is being configured properly I think17:27
dmsimardclarkb: when the failure is occuring, it looks like it's always after the "tripleo-inventory" role runs -- that's what I'm looking at right now17:28
dmsimardcomparing different job executions to see what's up17:28
clarkbdmsimard: yes the hypvervisor kernel and the centos 7.5 kernel did not get along in that cloud about a month ago. logan- updated hypervisor kernels but its possible that broke again. tripleo did switch to qemu instead of kvm as a result but maybe that got turned off and things regressed again? (mentioning it as a possibility I don't have evidence this is the cause)17:28
logan-that should not be happening anymore. ubuntu released an updated kernel a few days after the node crash issue started happening and all of the hvs have been upgraded. have not seen any further crashes on our internal nested virt jobs lately17:29
clarkblogan-: thanks17:29
dmsimardlogan-: thanks, we'll let you know if we can't figure it out :)17:29
logan-dmsimard: cool, thanks. with something intermittent like that I always wonder if we're hitting a slow node or something. iirc there was some work being done recently to add hostid to the logstash fields.. is that done clarkb? anything I need to enable cloud-side to expose that info to nodepool?17:31
clarkblogan-: I think it is purely nodepool side (nova api already gives us that data), I'm not sure if we restarted nodepool with that chagne yet as there were openstacksdk issues that lingered late last month that we didn't want to run into (since fixed)17:32
logan-gotcha17:32
clarkbgive me a few minutes and I'll check if that got in17:32
logan-thanks17:32
funginodepool on nl01 running since 2018-12-05 so hasn't been restarted for nearly a month17:33
fungier, nodepool-launcher daemon specifically17:33
corvusclarkb: the change merged on 12-0717:33
*** ginopc has quit IRC17:33
fungiso not in use yet17:33
clarkbya so we didn't manage to restart things once sdk was fixed (not surprising)17:33
corvusi'd like to perform a scheduler restart sometime soon17:34
clarkbfungi is quicker than me being able to load ssh keys after holidays :)17:34
mordredclarkb: I believe we should be completely fixed/released with sdk things currently17:34
*** weshay|ruck has joined #openstack-infra17:40
corvusi'd also like to restart gerrit -- it seems to have a stuck replication thread (likely due to gitea hanging after its disk was filled)17:42
corvusi could probably perform thread surgery with javamelody, but it's probably safer to restart17:43
mordred++17:43
clarkbya java melody surgery like that always worries me17:43
mordredcorvus: is the ceph rebalance all done? I wasn't sure what state it was in and didn't want to go touching things17:43
clarkbgerrit wasn't built with the intention that threads would be stopped under it17:43
corvusmordred: yes! all done17:44
corvusmordred: one of the osds has a weird name (0,1,3) because i didn't know all the steps to replace them at first.  and they're out of order on the underlying nodes (like osd 3 is on minion 1 or something).  but done.  :)17:45
mordredcorvus: cool. I figure we'll blow the whole thing away and recreate it from scratch at some point anyway, yeah?17:46
corvusmordred: (turns out if you're going to zero out an underlying block device, you also need to edit the configmap for the deployment and remove the info about the osd, otherwise the operator is going to think it still exists and create a new one; that's how we ended up with #3 -- the operator thought there was still a #2)17:47
mordredcorvus: ah - that makes sense17:48
corvusso that's the big bit of info i learned -- the operator maintains configmaps for deployments -- it doesn't rely only on the stuff on the filesystem.17:48
corvus(the configmap largely duplicates what's on the filesystem)17:48
fricklerfungi: corvus: there are still my 20 test nodes in bhs1. I had asked amorin whether I should delete or keep them, but I don't remember the outcome. guess I should remove those now17:48
fungiso that at least accounts for some of the discrepancy, but not most of it17:49
fricklerthere also were 5 or so nodes listed by "openstack server list", but not existing when trying to show them individually, not sure whether those might count for quota, too17:50
mordredcorvus: looks like we're using 124G on the cephfs now according to df17:50
*** kmalloc has joined #openstack-infra17:50
mordreddoes that figure report the size used taking the underlying ceph min block size into account?17:50
clarkbmordred: df should be actual disk consumed17:51
corvusmordred: aiui yes17:51
fungimordred: that seems like rather a lot given the aggregate utilization of /home/gerrit2/review_site/git on review01.o.o is reported as only 10gib17:51
corvuscurrent detailed usage: http://paste.openstack.org/show/739828/17:51
fungimyfs-data0 is the block device the git replication is going into?17:53
fungiif so, 7.2g looks reasonable17:53
corvusfungi: yep.  and then 7.2*3 copies gives us 22GiB17:54
corvusthen... somehowe 22GiB turns into 123.17:55
fungithrough the magic of cephfs17:55
corvusso it's *way* better than a week ago when 22 turned into 230.17:55
corvusit's still... confusing.17:55
mordredI'm guessing that's min block size overhead yeah? - what did you change that from/to?17:56
fungiespecially since myfs-data0 is reporting 17.29% used at 7.2gib but says max available is 34gib17:57
corvus    bluestore_min_alloc_size = 409617:57
*** yboaron_ has quit IRC17:57
mordredyeah. that17:57
fungii mean, i'm bad at math, but even i know 7.2 is not 17.29% of 7.2+3417:58
fungino, wait, it is17:58
corvusmordred: that seems sufficiently small to me that we should be closer to 1:117:58
mordredfungi: maybe math changed over christmas17:58
*** kmalloc has quit IRC17:58
fungi7.2 isn't 17.29% of 34 but it *is* 17.29% of 7.2+3417:59
* fungi probably just hasn't had enough to drink today18:00
*** ykarel|away has quit IRC18:01
fungiand 3*(34+7.2+.2)=124.218:01
funginot sure if that's where the ~123gib for global raw used is coming from?18:02
*** gfidente has quit IRC18:02
fungidoes the max avail for the pools count against the global raw used?18:02
fungilike a preallocation?18:03
corvusfungi: hrm, i wouldn't expect that... but maybe?  maybe the global storage gets allocated to the pools in very large chunks, so we've allocated half of the global storage to those pools, and from that allocation, 34GiB is still available?18:03
*** studarus has quit IRC18:03
corvusfungi: wow, 'preallocation' is a much shorter way of saying what i said :)18:03
corvusanyway, i do not know if it works that way.  but it seems like a good theory that fits the numbers.18:04
fungia "good" theory, perhaps, but confusing implementation if true18:04
corvuss/good// :)18:05
*** sshnaidm is now known as sshnaidm|afk18:05
*** Vadmacs has quit IRC18:05
dmsimardWhat kind of timeframe are we looking at for the logging migration to swift ?18:05
corvusdmsimard: i would personally like to push it so that as many of the things described in http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-July/000501.html as possible are implemnted.18:07
corvusdmsimard: almost everything in there mitigates something we lose due to the migration18:07
corvushowever, we can cut over any time18:07
clarkbcost in switching is largely going to be in time spent debugging any unexpected behavior and educating users about the change when it happens18:08
dmsimardcorvus: I'm particularly interested in the switch from the ara-report sqlite middleware to html generation18:08
clarkb(I expect)18:08
dmsimardSome projects (I know of at least OSA) are using the middleware for nested Ansible reports18:08
dmsimardI'll read the ML post first18:09
fungiyeah, accounting for rounding, some used value between 7.15 and 7.20 plus a max available value between 33.5 and 34.0 can give us a value between 122.5 and 123.0 (with enough of a buffer to even account for the used 208 mib for the myfs-metadata pool) so consistent rounding of those for the raw used works out i think18:09
corvusdmsimard: post doesn't discuss ara.  you can test it out with a change like https://review.openstack.org/59258218:09
dmsimardara is actually mentioned in http://lists.openstack.org/pipermail/openstack-infra/2018-July/006020.html :D18:10
*** trown is now known as trown|lunch18:10
corvusdmsimard: yeah.  that work is all done.18:12
*** josephrsandoval has joined #openstack-infra18:12
dmsimardok so this works: https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/86bbbcfa8ad043109d2d7af530225c72/logs_82/592582/2/check/devstack-platform-opensuse-423/1470481/ara-report/18:12
dmsimardvery nice18:12
*** panda is now known as panda|off18:15
clarkbcorvus: dmsimard: is that something we want to get in today and ease people into as they return to work?18:17
clarkb(I'm happy to help with that if so)18:17
corvusclarkb: get what in today?18:17
clarkbthe switch to logging to swift over the fileserver18:17
corvusclarkb: i would prefer to wait until as much of what is described in http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-July/000501.html as possible exists18:18
corvusparticularly the inline javascript log viewer18:18
clarkboh I misread your comments above. You had said we can cut over any time and I thought you were thinking of using that as a carrot to get those other improvements in18:19
clarkbyou mean that we should add those features ebfore we cut over18:19
corvusi think tristanC has done some fundamental work in that direction (we now fetch the log json and render part of it; it's a smaller step now to fetch the log text and render it)18:20
*** kgiusti has joined #openstack-infra18:20
corvusclarkb: i doubt we'll get volunteers to implement that by switching18:20
corvusclarkb, fungi, mordred: i've read over http://docs.ceph.com/docs/master/rados/operations/monitoring/#checking-a-cluster-s-usage-stats and i still don't understand the numbers18:21
*** wolverineav has joined #openstack-infra18:22
corvushttps://access.redhat.com/solutions/146542318:22
corvusthat's not terribly far from our question18:22
corvusi don't see an answer there though, which makes me sad18:23
dmsimardclarkb, corvus: my only concern with switching to swift is that we need to make sure that users of the sqlite middleware also switch to html generation18:23
dmsimard(for ara)18:23
corvusdmsimard: what users?18:23
dmsimardcorvus: OSA uses it for nested reports, there might be kolla-ansible as well .. would need to use codesearch or something18:24
dmsimardfor example http://logs.openstack.org/98/625898/1/check/openstack-ansible-deploy-aio_lxc-ubuntu-bionic/d5e3799/ara-report/ is from Zuul's perspective while http://logs.openstack.org/98/625898/1/check/openstack-ansible-deploy-aio_lxc-ubuntu-bionic/d5e3799/logs/ara-report/ is from OSA's perspective18:25
clarkbdmsimard: I'm not too worried about that. Its an easy switch and if you really need the ara data in that interim you can render it off of the sqlite file18:25
clarkb(so we haven't lost any data)18:25
dmsimardclarkb: right18:25
mordredyeah - I don't think it's hard - it's just a thing to add to the checklist18:25
dmsimardindeed.18:26
corvuser, um, there is no checklist :)18:26
corvusi disclaim responsibility for this :)18:26
*** wolverineav has quit IRC18:26
fungicorvus: i suppose you have access to the answer to that question? for me it's "subscriber exclusive content" but if i'm a "current customer or partner" i can find out the possible answer about this particular piece of free software :(18:26
corvusso if we need it to happen, we need to figure out how to make it happen :)18:26
corvusfungi: i have no idea18:27
fungi"Distributing any portion of Red Hat Content to a third party, using any Red Hat Content for the benefit of a third party, or using Red Hat Content in connection with software other than Red Hat Software under an active Red Hat subscription are all prohibited."18:29
fungiso if you did have access to it, you couldn't legally tell me what it says anyway18:29
corvuswell, then i guess i won't look :(18:29
fungihttps://access.redhat.com/help/terms/ (linked from the footer of that kb article)18:30
dmsimardclarkb, logan-: re: tripleo failures -- it could be a coincidence (or a red herring) but I've noticed that there appears to be two kind of CPUs in Limestone reported as "Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)" and "Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz" and so far all the failures I've looked at seem to have occurred on the E3-12xx variant18:31
corvusdmsimard, clarkb: do you want to find a way to have the nested ara report generated automatically when we switch over, or do you just want to go update jobs after the switch?18:32
clarkbcorvus: I'm fine with updating jobs after the switch (it isn't a lossy switch just an extra step to see the data)18:34
dmsimardcorvus: least we can do is attempt to figure out who uses it and let them know ahead of time I think. At first glance I see OSA, Windmill, puppet-openstackci, and system-config18:34
dmsimardbased on http://codesearch.openstack.org/?q=ara-report18:34
corvusdmsimard: do you have a suggested action for them?18:36
dmsimardI can write something up -- it shouldn't be more than replacing "mkdir ara-report; cp /some/path/ansible.sqlite /some/path/ara-report" to "ara generate html /some/path/ara-report"18:37
dmsimardWhat influences whether or not we do it before the move is the timeframe, I think18:37
dmsimardI mean, we probably don't want every project to start generating html if we're not moving to swift any time soon -- or else we could fill up logs.o.o again :(18:38
*** wolverineav has joined #openstack-infra18:38
corvusyeah, we should definitely not generate static files before the move, however, we could theoretically write a role which supports both and handles the change transparately (as the existing ara-generate role does)18:39
dmsimardthat logic is embedded and respective to each project's own nested ansible implementation, though18:39
dmsimardOSA bootstraps their own ansible, kolla too, etc.18:40
*** bobh has joined #openstack-infra18:41
corvusi don't understand why the internal ansible would be responsible for moving zuul log files around, but that's okay, i don't need to understand everything.18:41
*** wolverineav has quit IRC18:42
efriedHowdy folks. Who's administering jetbrains (pycharm) licenses these days? I need to change my email address, and apparently that kills my current license.18:43
efriedI submitted a new request at the link listed in https://wiki.openstack.org/wiki/Pycharm -- but IIRC our license admin recently left OpenStack?18:43
* mordred does not know anything about pycharm licenses18:43
*** wolverineav has joined #openstack-infra18:45
corvusfungi: i think 'ceph osd df' shows us where the 123 comes from: http://paste.openstack.org/show/739832/18:45
dmsimardefried: last I know it was swapnil but I'm not sure if he's still around18:45
efrieddmsimard: Okay, thanks. I guess I'll wait a day or two and see if something comes of that request.18:46
* efried can't live without his pycharm18:46
* efried comes up with something clever about leprechauns and cereal.18:46
*** wolverineav has quit IRC18:46
fungicorvus: i don't really get anything new out of that `ceph osd df` output other than that it's ~3x41gib. where does the 41 come from? my theory there is still the same at least18:47
*** wolverineav has joined #openstack-infra18:48
fungiefried: i'm not after your lucky pycharms, just in case you were concerned18:49
efriedthanks fungi, that's the ticket18:49
corvusfungi, clarkb, mordred: okay, so i still don't understand how all these numbers work.  do we consider that a requirement to moving forward?  or do we just want to keep throwing cinder volumes at the thing to keep it fed?18:51
fungicorvus: i don't consider it a requirement, no. mostly just curious so we can do effective capacity planning in the future18:51
fungibut our capacity planning in the past has been a bit reactionary anyway (like enospc=>time to make more room)18:52
clarkbcorvus: mnaser may know. I can also ask my brother if he understands that18:52
*** wolverineav has quit IRC18:53
*** rkukura has quit IRC18:56
fungii guess calebb hasn't been chillin' in here lately18:58
*** diablo_rojo has joined #openstack-infra18:58
mordredcorvus: yeah - I don't think it's an absolute requrement to move forward - but I would like to understand more better in general18:59
*** wolverineav has joined #openstack-infra18:59
clarkbI've asked calebb to join here and can hopefully shed light on that18:59
corvusclarkb: thanks19:00
corvusmordred: okay, what's the status/next steps?19:00
mordredcorvus: step one is fully regain brain post holiday ...19:02
*** tobiash has quit IRC19:02
*** calebb has joined #openstack-infra19:03
calebbclarkb: o/19:03
clarkbcalebb: hey so corvus is trying to track down disk usage discrepancies in a ceph cluster he set up. Maybe you can help us understand what is going on there19:04
corvuscalebb: http://paste.openstack.org/show/739832/19:04
mordredcorvus: I think we should exercise taking nodes offline, as well as growing the system by adding new nodes, yeah? I feel like there's something else that was on the list pre-holiday but still paging context back in19:04
corvuscalebb: the pool replication size is 3 for both of those pools19:04
mordredcorvus: (oh, we need to walk other infra-root folks through the setup)19:05
corvuscalebb: so i think that explains 7.2*3 == 2219:05
*** bobh has quit IRC19:05
*** tobiash has joined #openstack-infra19:06
calebbcorrect19:06
corvuscalebb: but i don't understand why, in the global section, it says 123 raw used19:06
calebband there are no other pools in the cluster than what is shown there?19:06
corvuscorrect19:07
calebbdid you run any ceph benchmarks by chance?19:07
calebbbecause those don't always clean up after themselves i think19:07
*** whoami-rajat has quit IRC19:07
fungimy terrible theory is that the used for the two pools plus the max available (which seems to be shared between pools? maybe?) adds up to the 41gib utilization per replica19:07
corvusnope; though i have replaced all of the osd's (one at a time).  there hasn't been any significant activity after the osd replacement.19:07
calebbthat should be fine19:08
fungibut if my terrible theory is correct, i don't understand why available space in the pools counts toward used space globally19:08
corvuscalebb: and just for further context, this is running in k8s via rook -- so it's all virtualized and dedicated to one user.19:08
calebbfungi: i think by default it doesn't work that way, or at list didnt used to, but there might be a way to make it do that and rook may do that by default?19:09
calebbat least*19:09
*** dtantsur is now known as dtantsur|afk19:09
fungiinteresting19:10
*** rkukura has joined #openstack-infra19:10
calebbcorvus: just for context what version of ceph is this and is it using filestore or bluestore?19:10
calebbmight be relevant19:10
corvuscalebb: mimic and bluestore19:10
calebband did you create those pools manually or did rook do it?19:11
corvuscalebb: rook created the pools19:11
corvusthis is the input we gave it to create the filesystem: http://paste.openstack.org/show/739833/19:12
calebblooks like someone on the mailing list had a similar question http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023669.html im gonna read up on that and see if there's any useful info in there19:16
calebbit sounds like bluestore is bad at deciding usage19:18
calebbbut those differences seem really high19:20
corvusoh hrm.  i wonder if they did anything about that in the intervening year?19:20
calebbyeah, that looks like that was on Luminous19:21
calebbseems like it should have been fixed so this might be a different issue19:21
corvuscalebb: also worth noting, i set "bluestore_min_alloc_size = 4096" because it was defaulting to 64k, and we have a lot of small files.  that caused our global raw used to be about 2x what it is now.19:21
corvus(i mean the default caused it to be 2x what it is now; 4096 is producing the current values)19:22
*** trown|lunch is now known as trown19:25
*** jamesmcarthur has joined #openstack-infra19:25
calebbok19:27
calebbi assume that's why you hda to redeploy all the osds?19:28
*** bobh has joined #openstack-infra19:28
fungimaybe 4096 is still way too large for our dataset and we need something more like 1024?19:29
corvuscalebb: yep19:29
calebbfungi: but i dont think that explains it away either19:30
clarkbhttp://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024801.html is relevant maybe?19:31
clarkb(I don't have the object count to do that math)19:31
calebbbecause if there's 500k objects, and if each one wastes 2K, that only accounts for 1GB i think19:31
calebbidk if that't eh right math though19:31
calebbyeah so i think https://github.com/ceph/ceph/pull/19454 might be the thing that fixes the mailing thread i linked19:31
*** bobh has quit IRC19:32
calebbwhich was only merged less than a month ago19:32
corvusclarkb: http://paste.openstack.org/show/739832/ has object count; our average is 10KiB per object.  i think.  :)19:32
*** jamesmcarthur has quit IRC19:32
*** jamesmcarthur has joined #openstack-infra19:32
calebbyeah i think the PR I linked fixes this issue19:32
corvusi'm guessing that's not in mimic19:33
calebbmimic 13.2.2 was released in september it looks like19:33
calebband that's the latest19:33
fungiokay, so even on a typical fs with the traditional 512k block size, our average block count per inode is really tiny19:33
corvusand yeah, our 'ceph df detail' report does not have a "used" column, as described in that pr.19:34
fungier, ignore me. i'm thinking 512b block size, so we're averaging 20 traditional 0.5k blocks per file19:34
fungiso 64k block size was definitely large for a 10k average file size, but not significantly so. 4k means we're averaging ~2.5 blocks per file now19:36
fungii agree decreasing below 4k block size doesn't really make sense19:36
corvusyeah, we don't know the histogram though, so could be more or less important based on the number of 100-byte files :)19:37
calebbheh19:37
corvusso maybe we should just ignore this for now, feeding the cluster storage as it claims it needs it (via %raw used) and revisit this after the next ceph release?19:38
corvus(since it seems like we can probably handle throwing a bit of extra storage at it to deal with it)19:38
calebbwell, but if it's using 2x more storage than it should then it's not really a bit of extra storage though19:39
mordred++ - I think that sounds like a fine idea for now - especially since we don't expect this data to double in size or anything once we get the initial data loaded fully19:39
calebbunless the extra used space doesn't scale with the actual used19:39
calebbooc what are you storing in it?19:40
corvuscalebb: all the openstack git repos19:40
calebbahh19:40
corvusso we don't expect it to grow too quickly; like, i don't think we'll see it even double in the next 6 months.19:41
*** psachin has quit IRC19:41
calebbyeah19:41
corvusso 'wait for ceph to get better' is pretty viable for us.19:41
* calebb nods19:42
corvusespecially since the env is virtualized and we can swap it out easily19:42
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Reject messages to starlingx-discuss-owner  https://review.openstack.org/62803019:42
fungiinfra-root: ^ please expedite if possible19:43
clarkbdmsimard: fwiw I'm still poking around at those failures and we don't seem to collect logs from the subnode19:44
clarkbdmsimard: at least I can't find them19:44
clarkbdmsimard: and I don't think that is because we can't ssh to it, the log collection just doesn't pull from there?19:44
corvuscalebb: thanks for your help!  :)19:45
calebbcorvus: feel free to poke me again if needed, i always forget to keep my freenode connection authed and stuff but i'll try >.>19:45
calebbnp19:45
dmsimardclarkb: yeah mwhahaha was looking at that19:45
mwhahahalogs don't get collected for teh same reason the job fails19:46
corvusmordred: you want to add a node to the cluster?19:46
mwhahahacan't pull them, ssh is returning 25519:46
clarkbdmsimard: http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_files/ssh.config.ansible.txt.gz is interesting too19:46
calebbcorvus: im excited you guys are using ceph :D I've poked clarkb a lot to try convince him about it19:46
mwhahahaso it either connectivity or access denied19:46
clarkbit says use the heat-admin use19:46
mordredcorvus: yes. that seems like a grand idea19:46
clarkbis something creating the heat-admin user on that instance?19:46
mwhahahaclarkb: yes later, but we use http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_files/hosts.txt.gz19:46
corvuscalebb: other than this mystery, it's been really easy to work with, especially with rook.19:47
mwhahahawhich reuses the zuul user19:47
clarkbmwhahaha: and those two configs aren't in conflict?19:47
mwhahahahttp://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_files/ssh.config.local.ansible.txt.gz19:47
mwhahahanot necessarily19:47
calebbcorvus: awesome! I've had some problem with the deploy tools (ceph-deploy and ceph-ansible) but everything else is pretty good imo as far as ops goes19:47
clarkbunfortunately the nested ara doesn't seem to record which user was attempted when the failure happens19:48
*** bobh has joined #openstack-infra19:48
corvusmordred: i'll leave you to get started on that and check in after my lunch19:48
mwhahahaclarkb: zuul, http://logs.openstack.org/92/625692/2/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/b38e0b1/logs/quickstart_collect_logs.log19:48
dmsimardmwhahaha: one of those files has "Hostname 10.4.70.74" and the other doesn't, I'm not sure if that could be related19:49
clarkbmwhahaha: dmsimard: and we should double check that the /etc/nodepool/id_rsa file is set up for the zuul user authorized keys19:49
dmsimardNone of this explains how the issue could occur randomly only on limehost, though ?19:50
logan-dmsimard: good find on the cpu model. the guests where host-passthrough is not enabled could be less performant. i'm not sure how certain HVs don't have that enabled, but working on fixing that now.19:50
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4  https://review.openstack.org/62803119:50
clarkbdmsimard: not necessarily, but understanding that and ruling it all out is useful19:50
fungiso these nodes may be communicating over ipv4 in limestone? is this working like a legacy d-g job where there's a master node which collects the logs? what are the odds there is one or a small number of ip addresses where that's happening? maybe a zombie instance or several nova has lost track of?19:50
clarkbdmsimard: basically if you confirm that ssh is configured properly (users and keys) then you can continue to narrow down the problem space19:51
clarkbdmsimard: if we assume they are fine we may miss something obvious19:51
dmsimardfair19:51
mwhahahadmsimard: different job results, mine was from a job with .64, clarks' was where it was .7419:51
*** markvoelker has joined #openstack-infra19:51
fungibut yes, i agree starting from first principles helps avoid missing obvious problems due to tunnel vision19:51
dmsimardmwhahaha: my bad19:51
mwhahahahttp://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/logs/quickstart_collect_logs.log19:52
clarkbfungi: what is odd is we've seen the key errors when we had that problem before, these are just unreachable but maybe that is ansible converting error messages for us19:52
mwhahahait's all good :D19:52
dmsimardlogan-: pleasure -- I have no idea if it's related or not but thanks :p19:52
*** smarcet has quit IRC19:52
fungiclarkb: that's a fair point (unless the zombie node in question doesn't have an sshd running any longer?)19:52
clarkbfungi: ya that could be19:53
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Run k8s-on-openstack to manage k8s control plane  https://review.openstack.org/62696519:53
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Add resources for deploying rook and xtradb to kuberenets  https://review.openstack.org/62605419:53
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4  https://review.openstack.org/62803119:53
fungibut yeah, could also be that node spontaneously going toes-up19:53
dmsimardlogan-: hang on, to validate my understanding, are you saying that some hypervisors did not expose the right CPU model mistakenly ?19:56
*** wolverineav has quit IRC19:56
*** wolverineav has joined #openstack-infra19:57
logan-dmsimard: yes, I see some are doing host-passthrough (the ones where you see cpu model "Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz"), and others are using host-model (where you see "Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)")19:57
clarkbdmsimard: mwhahaha we copy /home/zuul/ssh/id_rsa to /etc/nodepool/id_rsa and the public key for that private key is what is put into zuul user's authorized_keys file according to the top level ara repo19:57
*** rkukura has quit IRC19:57
clarkbI think that should be fine as is then19:57
*** mriedem has quit IRC19:58
fungidmsimard: logan-: so piecing this together, the ones where host-model was in use were where these failures were observed?19:58
*** aojea has quit IRC20:00
*** bobh has quit IRC20:00
*** shardy has quit IRC20:00
*** kgiusti has left #openstack-infra20:01
*** wolverineav has quit IRC20:01
*** kgiusti has joined #openstack-infra20:01
dmsimardfungi: A pattern I found when comparing failures against successful jobs (for the same set of tripleo jobs) is that the failures seemed to happen exclusively when the subnode had a E3-12xx processor. I didn't come across a failure that ran on the E5 variant. Could be a coincidence or a red herring, though.20:02
clarkbdmsimard: mwhahaha on the logging front the nested ansible does all the log collections not the executor driven ansible?20:03
clarkbdmsimard: mwhahaha: worth noting that we appear to be able to ssh into the subnode from the executor over ipv6 if I am reading logs correctly20:03
mwhahahacorrect for the tripleo jobs it's the nested ansible20:04
*** jamesmcarthur has quit IRC20:04
logan-host-passthrough is set on all of the nodes now.20:05
*** josephrsandoval has quit IRC20:06
dmsimardlogan-: cool, thanks !20:06
*** mriedem has joined #openstack-infra20:08
openstackgerritDavid Moreau Simard proposed openstack-infra/elastic-recheck master: Add query for intermittent tripleo failures in limestone  https://review.openstack.org/62803420:11
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Run k8s-on-openstack to manage k8s control plane  https://review.openstack.org/62696520:13
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Add resources for deploying rook and xtradb to kuberenets  https://review.openstack.org/62605420:13
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4  https://review.openstack.org/62803120:13
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Add openstack keypair for the root key  https://review.openstack.org/62803520:13
mordredinfra-root: could y'all check out 628035 real quick ^^ ?20:13
* mordred is trying to add the new node from bridge using the patches in gerrit20:14
fungilogan-: dmsimard: could lack of host-passthrough cause nested virt instability maybe?20:14
* fungi has no idea what its behaviors are keyed off of)20:14
logan-fungi: its quite possible that the vmx flag was not present in the host-model guests so they may have been using software virt20:15
clarkbfungi: I'm thinking its more likely an ipv4 issue considering that ipv6 seems to work from the executor20:15
fungimordred: that's the public ssh key for root@bridge.o.o?20:15
mordredfungi: yeah20:15
mordredfungi: I pulled it from the authorized_keys file on review.o.o20:16
openstackgerritMerged openstack-infra/nodepool master: Ignore removed provider in _cleanupLeakedInstances  https://review.openstack.org/60867020:17
fungimordred: and this is specifically in support of magnum?20:17
clarkbfungi: its not using magnum, it uses a set of ansible playbooks/roles aiui20:17
fungiahh, it says k8s-on-openstack so wasn't sure if that was the actual name or just a descriptive term20:18
fungiis this the openstack provider driver for kubernetes which wants to find the keypair via the api?20:19
clarkbI think its ansible20:20
clarkbso that it can manage the k8s install20:20
fungiseems fine. we already bake that key into our other systems anyway, so including it there isn't going to make anything i can think of less secure20:22
clarkbmy only concern about it is we'd allow ansible from the bridge to run on those nodes and we don't want that to happen with our regular ansible runs20:22
clarkbso we'll have to be extra careful we don't add them to the inventory20:22
fungiwe don't want to manage them with ansible long-term?20:23
clarkbnot with the same playbooks20:23
clarkbsince we don't want regular users there or firwall rules as on our normal hosts (etc)20:24
dmsimardclarkb, fungi, logan-: thanks for your help, I sent a patch to add an elastic-recheck query to track if it reproduces despite the host-passthrough fix: https://review.openstack.org/#/c/628034/20:26
clarkbdmsimard: if it does we should consider using ipv6 and check if it is a v4 vs v6 problem20:27
clarkbdmsimard: one note no the e-r query20:29
dmsimardclarkb: why do we set up an ovs bridge by default on multinode again ? I forget20:29
openstackgerritDavid Moreau Simard proposed openstack-infra/elastic-recheck master: Add query for intermittent tripleo failures in limestone  https://review.openstack.org/62803420:30
*** panda|off has quit IRC20:30
clarkbdmsimard: because much of the software we test assumes it can "manage" the networking so we create a corner for it to do that in20:30
mordredfungi, clarkb: I think we might still want sysadmin keys added ... but yeah, for now this is just some ansible that creates the nodes using the os_ modules and then installs/manages k8s on them20:30
clarkbdmsimard: it can't do that with the actual host interface as those are managed by our clouds20:30
mnaserits not using magnum... yet :p20:31
dmsimardclarkb: note fixed20:31
mordredyeah - the magnum version of this was a bit unhappy with the cephfs and containerized kubelet20:31
dmsimardclarkb: understood (for bridge)20:31
clarkbdmsimard: another side effect is it gives tests a consistent network view they can use regardless of the undlerying networking20:32
clarkbdmsimard: that isn't critical though and could be worked around if we had to20:32
*** panda has joined #openstack-infra20:33
corvusmordred: 628035 lgtm -- though would it be more helpful to call the key 'bridge' or something?20:33
corvusmordred: or rather, call the 'openstack keypair' 'bridge-$date' ?20:34
mordredsure - I could respin it real quick20:34
corvusclarkb, fungi: ^?20:34
dmsimardclarkb: I naively figured devstack (and other projects) ended up installing and setting up ovs anyway :p20:34
clarkbcorvus: ++20:34
mordredI pulled the date from the date in the key comment so that it would match20:34
corvusmordred: existing date is fine -- i was just trying to be clear in my englishing20:34
clarkbdmsimard: devstack for example will set up ovs for neutron. But we are setting up OVS here to provide a virtual switch that neutron can plug into. So its somewhat separate from enutron itself20:34
mordred++20:34
mordredcorvus: how about "bridge-root-2014-09-15"20:35
clarkbdmsimard: think of it as giving tests a consistent interface scheme and l2 networking via a switch that we bring to the jobs20:35
corvusmordred: wfm20:35
corvusmordred: i think that will avoid me looking at it and saying "of course it's the key for the root user..."  :)20:35
*** wolverineav has joined #openstack-infra20:35
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Add openstack keypair for the bridge root key  https://review.openstack.org/62803520:35
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Run k8s-on-openstack to manage k8s control plane  https://review.openstack.org/62696520:35
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Add resources for deploying rook and xtradb to kuberenets  https://review.openstack.org/62605420:35
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Increase number of k8s nodes to 4  https://review.openstack.org/62803120:35
mordredcorvus: ok. I added that key  manually and am running the ansible to add a new node20:40
corvusgroovy20:40
*** wolverineav has quit IRC20:40
fungisorry, got sidetracked by dirty dishes for a moment. catching up20:41
fungikey namechange is fine by me, +220:44
*** hwoarang has quit IRC20:48
*** jcoufal has quit IRC20:48
*** slaweq_ is now known as slaweq20:49
*** hwoarang has joined #openstack-infra20:49
openstackgerritMerged openstack-infra/system-config master: Reject messages to starlingx-discuss-owner  https://review.openstack.org/62803020:50
openstackgerritMerged openstack-infra/elastic-recheck master: Add query for intermittent tripleo failures in limestone  https://review.openstack.org/62803420:52
openstackgerritGoutham Pacha Ravi proposed openstack-infra/devstack-gate master: Grenade: Allow setting Python3 version  https://review.openstack.org/60737920:52
*** slaweq has quit IRC20:55
corvusmordred, clarkb, fungi: when should we schedule our k8s/rook walkthrough?  how about immediately following the infra meeting next tuesday?20:56
clarkbcorvus: that works for me20:56
mordred++20:56
clarkbthat might be late for frickler though. Maybe 1900UTC (slightly earlier) a week from today is ebtter?20:57
corvusi'll make an ethercalc with options20:59
mordredmnaser: if you're around, I just booted a new node in sjc1 and it's getting name resolution errors trying to resolve archive.ubuntu.com21:00
corvusmordred: ^ does that time work for you as well?21:00
mnasermordred: whats the dns that's configured on it?21:00
mordredmnaser: dunno - it's just an ubuntu node that's trying to install some things via cloud-init21:00
mnasermordred: cat /etc/resolv.conf if you can ssh into it?21:00
mordredmnaser: I'm having issues sshing in as well ... maybe something went sideways with networking?21:01
mordredmnaser: 1bcf108e-85f9-4113-ac7a-0d60daa4e6f6 is the instance uuid in case it's a thing you want to look at on your side21:01
mnaserssh might take a really long time if dns isn't resolving properly21:01
fungicorvus: 20:00z tuesday january 8 wfm, i have no conflicts21:01
mnaserlet me check21:01
mordredmnaser: https://github.com/infraly/k8s-on-openstack/blob/master/roles/openstack-nodes/tasks/main.yaml#L20-L27 is the cloud-init script in question - so I don't think it should be doing much to the node21:02
fungioh, or (catching up) some other time which works well for frickler sure21:03
clarkbwe might just have to do it twice too in order to get ianw and frickler and USAians all in the same spot21:03
mnasermordred: could you possibly try to get another floating ip instead of that one?21:05
mordredmnaser: sure - one sec21:05
corvusclarkb: well, i also plan on recording it.  so if we do it whenever most folks can join, we should be able to get a lively bunch of questions and hit 99% of what we need, and can handle the rest as followups.21:05
clarkbcorvus: ++21:05
*** wolverineav has joined #openstack-infra21:07
corvusclarkb, fungi, mordred: how's this?  https://ethercalc.openstack.org/infra-k8-walkthrough21:09
corvusi'll send an email21:09
clarkbcorvus: lgtm. I've added myself21:09
mordredmnaser: I am able to ssh in with the new floating ip21:09
*** rkukura has joined #openstack-infra21:10
mordredmnaser: and it's able to resolve dns now - so maybe just a hiccup?21:10
mnasermordred: the floating IP seems to be null routed so I’ll follow up with the why later21:11
*** slaweq has joined #openstack-infra21:12
mordredmnaser: awesome21:12
mordredmnaser: want me to leave it in the account? or delete it?21:13
dhellmannconfig-core: the final step of importing osc-summit-counter into gerrit is to add the release jobs (and then to do a release). I would appreciate it if you add https://review.openstack.org/#/c/625627/ to your review queue for this week.21:13
mnasermordred: you can release it for now21:13
mordredmnaser: cool21:13
corvusdhellmann: that tool makes me so happy; i'm really looking forward to using it.  :)21:14
dhellmanncorvus : would you like to be a member of the maintainer team? :-)21:14
corvusdhellmann: sure, i might even be qualified :)21:14
dhellmannthe bar is pretty low ;-)21:15
*** studarus has joined #openstack-infra21:15
corvusdhellmann: (i assume the qualifications include "can recite alphabet")21:15
dhellmanndone21:15
dhellmannyeah, at some point one of you is going to find the math error I put in the first version and then you'll get to be PTL21:15
dhellmann(it's still there)21:15
dhellmannat least I think it is21:16
fungii don't consider myself qualified, i think i've gotten the "how many prior summits have you attended" question on the registration form wrong before21:16
clarkbI stopped trying to get it right :/21:16
corvusfungi: then you should definitely be a user :)21:16
*** slaweq has quit IRC21:17
corvusthis tool, if widely known, could save hundreds of person-hours across our community :)21:17
dhellmannand improve the accuracy of our data collection significantly, too, I'm sure21:17
mordredcorvus: if ansible is to be believed, we should have a 4th node now21:19
*** xek has quit IRC21:20
mordredcorvus: and kubectl get nodes seems to also think so21:20
corvusmordred: \o/21:20
*** efried has quit IRC21:21
corvusmordred: should we run the rook operator?21:21
*** slaweq has joined #openstack-infra21:21
mordredcorvus: I think the rook operator should notice these actions AIUI21:21
corvusoh, hrm. i'll check its log21:21
corvusthere are no log entries21:22
corvuswhich i think means it has been quiet for $some_time and logs have been rotated via $some_mechanism21:22
mordredthat sounds about right21:23
mordredcorvus: https://github.com/rook/rook/blob/master/design/cluster-update.md21:24
openstackgerritMerged openstack-infra/project-config master: add release job for osc-summit-counter  https://review.openstack.org/62562721:25
corvusmordred: since we're doing all devices, we're not updating the cluster crd21:25
*** slaweq has quit IRC21:25
mordredcorvus: but - I think we might need to to add a new storage node perhaps?21:25
openstackgerritMerged openstack-infra/system-config master: Add openstack keypair for the bridge root key  https://review.openstack.org/62803521:26
mordredcorvus: but also maybe you're right21:26
corvusmordred: we're also using all nodes21:26
mordredyeah21:26
corvusmordred: we have useAllNodes:true and useAllDevices:true so it's all automatic21:26
mordred++21:26
corvusso if it only watches the crd changes, then we'll need to restart; but if it somehow gets 'k8s node added' events, then i agree, it should have what it needs.21:27
corvusi don't know what reality is :)21:27
mordredme either :)21:27
corvusi will say that design doc only mentions crd update events21:27
mordredyeah. well - we could try deleting the operator pod and letting the app respawn it and see if that re-registers things21:29
*** efried has joined #openstack-infra21:29
corvusmordred: yeah, it's had plenty of time to do something on its own.  let's assume that and delete it.  you go?21:29
mordredcorvus: sure21:29
mordredI have deleted it and a new one has been created21:30
*** kgiusti has left #openstack-infra21:30
mordredcorvus: we seem to have 4 osds now21:31
corvushere's the current status: http://paste.openstack.org/show/739837/21:32
corvusit's backfilling21:32
mordredcorvus: neat21:33
mordredcorvus: should we wait for it to finish that before hupping gerrit to restart the replication?21:35
openstackgerritMerged openstack-infra/zuul master: Only reset working copy when needed  https://review.openstack.org/62434321:35
clarkbreading that doc seems like if we explicitly listed the nodes rather than setting all, then an update to the crd with a new node would tell the operator to make a change21:36
corvusclarkb: agreed21:36
clarkbbut its diff based so all -> all doesn't diff is a noop (even if the underlying hosts have changed)21:36
mordredyah. although restarting the operator pod is kind of like sending something a hup signal - so maybe just doing that on node additions would be fairly easy to do?21:37
corvusmordred: re gerrit: maybe?  the backfilling process is still very slow; i don't know enough about the issue to know whether we're hitting a performance limitation (in which case replication would slow us down even more), or just artificial limits in backfilling (where ceph is preserving performance for client data reads/writes) in which case it would be fine21:38
corvusmordred: however, we're not in production, so maybe we go ahead and try it :)21:38
mordredcorvus: :)21:38
mordredcorvus: yeah - we might learn something from it21:38
fungiis the ceph cluster split across service providers?21:39
corvus(it's also really hard to predict; over the holiday, backfilling looked like it would take a week based on reported numbers, but finished in about a day)21:39
mordredclarkb, corvus, fungi: to delete the operator pod (and let the application re-spawn it): kubectl -n rook-ceph-system delete pod -l app=rook-ceph-operator21:39
corvusfungi: nope, all in one region21:39
fungididn't know if we were going for cross-provider redundancy or anything. i guess it can be added in the future but better to not complicate the pic21:40
fungipoc21:40
clarkbaiui you don't want to split osds across regions unless they are part of a full replica type setup21:41
corvusfungi: yeah, i'm not sure about either ceph-on-wan or percona-on-wan.  i could see either of them being finicky enough to make it harder.21:41
clarkbwhich is doable but its more like having a copy of all the data in each region21:41
mordredyah21:41
clarkb(I looked into it once upon a time for other raisins and it wasn't really going to win much)21:41
clarkband you have to do all your writes to one of the replicas iirc21:41
fungiand cold-failover is probably plenty for our risk profile anyway, live redundancy could be more hassle than it's worth21:42
corvusone thing we may want to think about before our next deployment -- whether we want to continue using ceph's default replica strategy, or switch to erasure coding21:42
clarkbcorvus: erasure coding might be a nice way to reduce disk overhead when we consider that vexxhost is also running with some number of copies under us21:43
clarkbbut recovery cost is higher aiui21:43
clarkb(since you have to maths more to get the data back)21:43
corvusfungi: well, for this system i'm proposing that the entire gitea system be HA, but within a single cloud provider.  so if vexxhost-sjc1 goes down, we're out; but we should have vm/hypervisor fault tolerance within that constraint.21:44
mordredyeah - I meanm ,I think our risk profile for losing one of the cinder volumes is actually pretty low - because they are themselves ceph volumes in this case on the underside21:44
corvusmordred: right, we're unlikely to lose data, but we may lose access (for a time)21:45
mordredyah21:45
fungicold failover involving building some new servers in another provider and re-replicating data to them seems reasonable, i meant21:47
mordredyup. this is all re-creatable data21:47
corvusfungi: ah, yes.21:47
corvusand we can do that pretty fast too :)21:48
fungiright, just weighing the likelihood of a provider permanently (or long-term temporarily) becoming unavailable for us against the work involved in relocating services21:48
clarkberasure coding may also make the disk usage numbers harder to reason about, if we want to keep that as simple as possible for now21:49
clarkbyou use 1.3x the size of the data or something iirc21:49
clarkbor was it 1.621:50
mordredclarkb: well, we already don't know how to reason about disk usage :)21:50
fungifrom incomprehensible to incomprehensibler21:51
clarkb1.5x21:52
clarkbin the default profile21:52
*** bobh has joined #openstack-infra21:55
*** slaweq has joined #openstack-infra21:55
*** slaweq has quit IRC22:00
corvusk8s walkthrough scheduling email sent22:00
*** bobh has quit IRC22:00
*** trown is now known as trown|outtypewww22:00
*** bobh has joined #openstack-infra22:01
*** jamesmcarthur has joined #openstack-infra22:05
*** jamesmcarthur has quit IRC22:09
corvusany objection to my restarting gerrit now?22:12
clarkbDigging into e-r data, since I've been trying to keep an eye on it before holidays, I notice that pypi errors shot up but that is because kombu removed a release we pinned in global constraints. I wonder if there is a way to make those errors look less like infra problems in e-r when they aren't infra problems22:13
clarkbcorvus: not from me22:13
*** eernst_ has joined #openstack-infra22:13
clarkboverall e-r looks healthy, but that is because test demand is super low. Only 6 integrated gate failures in the last 10 days22:14
clarkbMaybe this should go on the back burner until we have 10 days of data that resemble normalcy22:14
mordredcorvus: go for it22:14
*** smarcet has joined #openstack-infra22:14
clarkbhappily someone else noticed the kombu issue and generate constraints generated a fix for it auomatically and the change was merged22:15
corvus#status log restarted gerrit to clear stuck gitea replication task22:16
openstackstatuscorvus: finished logging22:16
clarkbso other than limestone ssh weirdness (which maybe was fixed? time will tell) and the ovh bhs1 quota issues I think we are looking healthy22:16
clarkbgerrit web ui is responding for me again22:18
corvusreplications to gitea are happening: http://38.108.68.66/explore/repos22:21
clarkbShrews: do you know if the december 5th nodepool launcher restart was to handle your schema change?22:21
clarkbShrews: or will the next restart need to be a full launcher restart?22:21
corvusnow i'd like to restart the zuul scheduler22:21
clarkbIf we have already done the full launcher restart then I think we are safe to restart them one by one whenever we like (since openstacksdk is now fixed)22:22
Shrewsclarkb: i had a schema change?22:22
corvusand.... since we just merged a merger change, all of zuul, actually.22:22
clarkbShrews: looks like you wrote the change note about it but original change was corvus'22:23
clarkbhttps://review.openstack.org/#/c/623046/22:23
Shrewsclarkb: oh, yeah, the restart was for that22:24
corvusyeah, i think we should be good wrt that22:24
clarkbShrews: great, so launchers can be restarted whenever we like in whatever order we like22:24
mordredI support the restarting of the things22:24
*** wolverineav has quit IRC22:24
clarkbshould I go ahead and restart one launcher, monitor it then do the other four?22:24
clarkber other 322:24
corvusclarkb: sure; i'm about to restart all of zuul; i don't think we need to coordinate, unless you want to?22:24
Shrewsclarkb: any order should be fine22:25
clarkbcorvus: I don't think we need to coordinate22:25
clarkbchange log in nodepool looks pretty safe after that full restart point22:26
corvusstopping zuul22:26
Shrewsthat should get the host_id change pabelanger wants too22:26
clarkbShrews: yup22:26
clarkbthat and double checking sdk is happy is the motivation here22:26
corvusscheduler is back up, executors are still stopping22:27
mordredcorvus: the dashboard seems to be sad22:28
corvusmergers are back up22:28
mordredit's redirecting or something22:28
clarkbI'm going to start with nl0322:28
*** wolverineav has joined #openstack-infra22:28
clarkbnodepool==3.3.2.dev88  # git sha f8bf6af is what is installed on 0322:29
corvusmordred: i think it will clear once the config is loaded.  you can see the error in the console.  i mentioned it to tristanC.22:29
corvusmordred: "s.pipelines is undefined"22:29
clarkband openstacksdk 0.22.022:29
mordrednod22:29
corvusall zuul components are running now22:29
clarkbmordred: is 0.22.0 sdk what we want?22:29
mordredclarkb: yes, I believe so22:29
corvusPlaybook run took 0 days, 0 hours, 2 minutes, 58 seconds22:30
*** tosky has joined #openstack-infra22:30
clarkbnl03 launcher restarted. I'll do the others if this one looks happy22:30
corvushrm.  the scheduler has not loaded its configuration yet, and it's not clear why.22:31
clarkbcorvus: I seem to recall it atking about 5 minutes22:32
corvusclarkb: yeah, that's how long the cat jobs take but i think they are done22:32
mordredcorvus: dashboard is up - maybe it was just taking longer22:33
*** boden has quit IRC22:34
corvushrm.  yeah, looks like there were some slow cat jobs22:34
corvus2019-01-02 22:32:15,920 INFO zuul.TenantParser: Loading configuration from openstack/nova/.zuul.yaml@stable/ocata22:35
corvusmay be time to clear out the nova repos on the mergers :)22:35
*** smarcet has quit IRC22:35
corvus(that was one of the stragglers)22:35
clarkbnl03 seems happy but it hasn't lifecycled a node yet (I expect it will get to that once corvus' reenqueues queues)22:36
*** smarcet has joined #openstack-infra22:36
clarkbonce I see it has done that successfully I'll use that as mark to restart the others22:36
*** jtomasek has quit IRC22:36
corvusstarting those now22:36
corvus#status log restarted all of zuul at commit 4540b7122:36
openstackstatuscorvus: finished logging22:36
clarkbI see the hostids in the logs now too22:38
clarkbpabelanger: ^ fyi, and thanks22:38
*** efried has quit IRC22:39
clarkbcorvus: do we repack and gc those repos?22:40
clarkbhrm I guess a GC won't remove all the zuul refs since they have pointers to them22:40
clarkb(so a delete and reclone is a bit more forceful GC)22:41
corvusclarkb: not explicitly; only the default git gc22:41
*** jamesmcarthur has joined #openstack-infra22:42
clarkbnode with id 0001464013 was created, used, and deleted on nl0322:43
clarkbI'm restarting the other launchers now22:43
*** jamesmcarthur has quit IRC22:46
clarkb#status log Restarted nodepool launchers nl01-nl04 to pick up hypervisor host id logging and update openstacksdk. Now running nodepool==3.3.2.dev88  # git sha f8bf6af and openstacksdk==0.22.022:47
openstackstatusclarkb: finished logging22:47
*** diablo_rojo has quit IRC22:47
*** rcernin has joined #openstack-infra22:49
*** jamesmcarthur has joined #openstack-infra22:51
*** e0ne has joined #openstack-infra22:51
mordredcorvus: http://38.108.68.66/openstack/aodh claims to have replicated by the gerrit repl logs, but is still showing empty (was looking at it randomly)22:53
corvuslookin22:53
clarkbwe gained 4 mergers after the zuul restart22:54
mordredcorvus: other things, both before and after it - seem to have replicated properly22:54
clarkbfungi: ^ iirc you had looked into that and found it was network connections between gearman having gone away?22:54
mordredcorvus: fwiw, bifrost and blazar also seem empty - I haven't found any specific patterns yet22:57
corvusmordred: i believe we saw empty repos where we pushed to them before they were created in gitea; and then if we pushed to them again, they showed up correctly; but perhaps if there are no actual changes to the repos, the internal event to refresh doesn't fire?22:59
mordredcorvus: it would be neat if there was a gitea-refresh-repo command or something23:00
*** bobh has quit IRC23:01
*** dkehn_ has joined #openstack-infra23:01
*** studarus has quit IRC23:01
*** bobh has joined #openstack-infra23:01
*** dkehn has quit IRC23:02
*** dkehn_ is now known as dkehn23:02
corvusmordred: i tried "Reinitialize all missing Git repositories for which records exist" but that wasn't it.23:03
corvusmordred: i tried "Execute health checks on all repositories" and it said "23:03
corvusRepository health checks have started.23:03
corvus"23:03
corvusso... ?23:03
*** wolverineav has quit IRC23:03
corvusi think it completed without incident.23:04
corvusso yeah, i don't see an easy way to refresh it, other than to figure out what the internal hook is23:04
mordredI'm looking at https://github.com/go-gitea/gitea/blob/master/cmd/hook.go23:05
corvusi think 'update' is what we want23:05
corvusbased on:23:06
corvus[Macaron] 2019-01-02 23:04:53: Started POST /api/internal/ssh/2/update for [::1]23:06
corvus[Macaron] 2019-01-02 23:04:53: Completed POST /api/internal/ssh/2/update 200 OK in 5.695615ms23:06
*** bobh has quit IRC23:06
mordred++23:07
*** wolverineav has joined #openstack-infra23:07
*** smarcet has quit IRC23:07
mordredcorvus: actually - if I'm reading https://github.com/go-gitea/gitea/blob/master/routers/private/internal.go right23:09
mordredcorvus: I think /api/internal/ssh/2/update has to do with public keys ... but still reading23:10
corvusweird23:10
*** rascasoft has quit IRC23:10
mordredhttps://github.com/go-gitea/gitea/blob/master/models/update.go23:12
corvusmordred: well, it seems like we may just want to make this operate like the current cgit -- create in gitea first, before we create in gerrit23:12
mordredyeah23:14
corvusi'm inclined to just write this off and redirect effort on doing that23:14
mordredyah - I think that's a good call. although I also think, once the current replication is done, figuring out how to trigger gitea to do whatever it needs to do if/when something got missed seems like a good plan23:15
corvusmordred: i'm sure we could delete the repo and re-push; so if that happens for an individual repo, we've got a solution23:15
corvusjust would be annoying at the current error scale23:15
mordredyeah23:15
corvuswe can go ahead and do that for aodh to validate23:15
corvus(but maybe after the current replication backlog clears)23:16
mordredyeah. I think poking too hard while it's still trying to replicate is maybe the wrong choice23:16
mordredcorvus: I see things like this in the gerrit replication logs:23:19
mordredRemoteRefUpdate[remoteName=refs/tags/4.1-eol, NOT_ATTEMPTED, (null)...d2624d975b0f24f6e1beb9bd832420b2bffef7d2, srcRef=refs/tags/4.1-eol, forceUpdate, message=null]23:19
*** e0ne has quit IRC23:19
mordredand23:20
mordredRemoteRefUpdate[remoteName=refs/changes/79/176779/1, NOT_ATTEMPTED, (null)...389afee9636303201b00477a56a1f4156751fa6d, srcRef=refs/changes/79/176779/1, forceUpdate, message=null]23:20
corvusi don't know the meaning of those23:21
mordredcorvus: maybe the refs/changes ones are just gerrit asserting that it's not going to replicate ref/changes ? (although it's super chatty if that's the case)23:22
corvusmordred: yeah; though tag should be there?23:22
corvusalso.. i think it *does* replicate refs/changes23:22
mordredyeah23:22
*** rlandy|rover is now known as rlandy|rover|bbl23:23
mordredalso: http://38.108.68.66/openstack/fuel-library is the repo assocated with both of those, and it's empty23:23
mordredalthough I did also see a NOT_ATTEMPTED message for a repo that did have content23:23
*** jamesmcarthur has quit IRC23:23
openstackgerritJames E. Blair proposed openstack-infra/zuul master: WIP: Combine artifact URLs with log_url if empty  https://review.openstack.org/62806823:24
*** jamesmcarthur has joined #openstack-infra23:24
corvusmordred: 812e8662 waiting .... 23:21:10.074      (retry 1) [f3336f21] push git@38.108.68.66:openstack/fuel-library.git23:26
corvusmordred: there are a couple of retry replication events at the bottom of the queue now23:27
mordredcorvus: ah - maybe gitea is backed up trying to process the incoming onslaught?23:27
*** jamesmcarthur has quit IRC23:28
corvusmordred: well, i wouldn't describe it as an onslaught; more of a trickle23:28
mordredyeah. I'm seeing some things in the gitea log ...23:30
mordred2019/01/02 23:24:24 [...tea/models/update.go:270 pushUpdate()] [E] GetBranchCommitID[refs/changes/58/277858/1]: object does not exist [id: refs/heads/refs/changes/58/277858/1, rel_path: ]23:30
corvuswe've added 100MiB of data stored in the course of 1.25 hours, and are halfway through the list of repos23:30
corvusthat's a change to fuel-main23:31
corvusfuel-main is currently being replicated (still)23:34
*** wolverineav has quit IRC23:35
corvusmordred: maybe that error shows up for every refs/changes push because it isn't a branch?23:36
mordredmaybe?23:37
corvusmordred: so it looks like it's emitting one of those errors for every fuel-main change right now, and it takes about a second to do that, and that's what's taking so long23:39
mordredcorvus: wow, awesome23:39
*** wolverineav has joined #openstack-infra23:40
*** wolverineav has quit IRC23:40
*** wolverineav has joined #openstack-infra23:40
*** Jeffrey4l has quit IRC23:41
corvusfuel-main is finished; it did not end up on the retry list23:41
mordredI agree. and I can browse it23:41
mordredand it has tags and whatnot, so yay (also has the switched branch thing)23:42
corvuswe should see if re-pushing a repo causes the refs/changes errors to all happen again23:42
mordred++23:42
corvussince i'm logged in, i can see an event stream; i'll paste23:43
corvushttps://screenshots.firefox.com/XwHxRyU5N4pvz0b0/38.108.68.6623:43
corvuslooking at that second entry; this is the refs/changes url: http://38.108.68.66/openstack/fuel-main/src/branch/refs/changes/67/262167/1  (it 404s)23:43
corvusbut this is the commit sha url below it: http://38.108.68.66/openstack/fuel-main/commit/fc9f3c11f9bc78e03fcb407fd17d7bdd4d906897 (it works)23:43
corvusso i think that's consistent with gerrit pushing refs/changes, the commits ending up in the repo, but gitea not knowing what to do with them because ENOTABRANCH23:44
mordredyeah. so maybe we should tell gerrit to not replicate refs/changes to gitea23:44
mordredand then maybe figure out how to send them a patch to not bomb out on non-branch refs23:45
*** diablo_rojo has joined #openstack-infra23:47
corvusbtw, i can't get over how cool it is to have "ls" actually tell you how much data is inside of a directory (this happens in cephfs)23:48
*** gyee has quit IRC23:48
clarkbdoes it complain about git notes too?23:49
clarkbalso branchless refs23:49
corvusclarkb: have't seen any yet23:49
mordredI think I saw a git note complaint previously23:49
mordredin the replication_log23:50
corvusmordred: the number of files in the sessions directory is large and increasing quickly.  (32k currently)23:50
corvusi don't remember how to see git notes on github23:53
mordred[2019-01-02 23:49:23,821] [c7a229db] Push to git@38.108.68.66:openstack/fuel-plugin-congress.git references: [RemoteRefUpdate[remoteName=refs/changes/44/420544/1, NOT_ATTEMPTED, (null)...adfa2db62988649219d64bd53746f2635d95aa43, srcRef=refs/changes/44/420544/1, forceUpdate, message=null], RemoteRefUpdate[remoteName=refs/heads/master, NOT_ATTEMPTED, (null)...adfa2db62988649219d64bd53746f2635d95aa43,23:53
mordredsrcRef=refs/heads/master, forceUpdate, message=null], RemoteRefUpdate[remoteName=refs/notes/review, NOT_ATTEMPTED, (null)...c9851bc1cc42f14946cdd98a35cdb6d1b5a2f207, srcRef=refs/notes/review, forceUpdate, message=null]]23:53
mordredthere's one with refs/notes listed23:53
mordredalthough interestingly enough it also lists refs/heads/master23:54
corvusmordred: afaict, not_attempted doesn't mean anything useful23:55
mordredcorvus: yeah. I agree with that23:55

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!