Thursday, 2019-09-05

clarkbjohnsom: why is it querying resolving _http._tcp.mirror.regionone.fortnebula.opendev.org. SRV IN ? is that something apt does?00:00
pabelangerclarkb: is there no mirror info for vexxhost there?00:00
pabelangeralso +200:00
clarkbpabelanger: only the mirrors that have been replaced with opendev.org ansible'd mirrors are there00:00
pabelangerkk00:00
johnsomCouldn't tell you. I thought that was odd too.00:00
clarkbalso it is trying both cloudflare and google before it gives up00:01
johnsomYeah, upping the ttl seems like we are just going to fail the download anyway if the IPv6 routing is down.00:01
*** diablo_rojo has joined #openstack-infra00:02
diablo_rojorm_work, yeah they already went out00:02
*** trident has joined #openstack-infra00:03
clarkbI can resolve google.com via those two resolvers right now against the two nameservers that failed from the mirror host in that cloud00:04
clarkbjohnsom: are we sure that the job itself isn't breaking the host's networking or unbound setup?00:04
clarkbjohnsom: we have had jobs do that in the past00:04
clarkbI think either that is happenign or we have intermittent routing issues out of the cloud00:04
johnsomThis happens over many different jobs/patches00:04
clarkb(because right now it seems to be fine)00:04
johnsomLooking at the unbound log it was fine at the start of the job too.00:05
clarkbjohnsom: ya then the job runs and does stuff to the host :)00:05
clarkbaccording to that log it had successfully resolved records just 14 seconds prior00:05
clarkbwith this being udp is it deciding the network is unreachable because the timeout is hit?00:07
johnsomWell it blew up devstack on the latest one. The patch doesn't change the devstack config (and frankly we haven't in a log time either).00:07
*** rh-jelabarre has quit IRC00:07
clarkbrtt to those two servers is 8ms currently if that timeout is in play00:07
johnsomOr there is no route, like the RA expired00:07
clarkbpabelanger: its direct via bgp00:10
clarkber I was scrolled up and thougt that was familiar, sorry00:10
rm_workdiablo_rojo: what was the title? I don't think I saw mine?00:11
rm_workand approx when?00:11
clarkbjohnsom: fwiw the syslog entries for the time when unbound log starts logging thing shows updates to iptables and ebtables00:11
*** armax has quit IRC00:11
johnsomWe dom00:12
clarkbjohnsom: http://paste.openstack.org/show/770959/00:12
johnsomWe don't use either. We only use neutron SGs00:12
clarkbis it possible neutron is nuking it then?00:12
*** stewie925 has quit IRC00:13
diablo_rojorm_work, I pinged Kendall Waters (she sent them) I'll let you know as soon as she replies.00:14
rm_workhmm k00:14
rm_workwhat's her nick?00:14
clarkbit is interesting that the console log shows only the apt stuff and not other tasks that could change iptables or ebtables or ovs00:14
clarkbmaybe neutron background stuff as its turning on? /me checks neutron logs00:14
johnsomMaybe. There is ovs openflow stuff in there too, so probably neutron devstack stuff.00:14
johnsomIf that was the case, why would the other providers and the py27 run of the same patch pass?00:16
johnsomAnd the two node version on py3, etc...00:16
clarkbit could be a race00:16
clarkbassuming iptables/ebtables/ovs is at fault somehow it they happen long enough after your apt stuff then apt will succeed and happily move on (and job amy not do anything elseexternal after that point?)00:17
clarkbI don't have enough evidence to say they are at fault but it is highly suspicious that we have network problems at around exactly the same time we muck with the network00:17
clarkbjohnsom: Sep 04 22:04:29.894751 ubuntu-bionic-fortnebula-regionone-0010774925 neutron-dhcp-agent[13633]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-89551ca5-1abb-484e-af0c-1a4f3c8e1d59', 'sysctl', '-w', 'net.ipv6.conf.default.accept_ra=0']00:22
clarkbdo network namespaces mask off sysctls?00:23
clarkbif not setting default there instead of the specific interface may be the problem00:23
diablo_rojorm_work, sounds like they would have come from Jimmy or Ashlee and would have had "Open Infrastructure Summit Shanghai" in the subject00:25
johnsomThey do mask some of them, yes.00:25
clarkbinternet says the answer is it depends00:26
clarkbya depends on the specific one00:26
rm_workdiablo_rojo: hmm ok i'll search again00:26
clarkbhttps://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt does not say if accept_ra is namespace specific or not00:27
johnsomYeah, they are listed somewhere else.00:27
clarkbI feel like there is a good chance this is the problem00:27
rm_workdiablo_rojo: yeah I don't see mine... who should I contact?00:29
clarkbneat k8s docs says its even trickier than that. Some sysctls can be set within a namespace but take effect globally00:30
clarkbso there are three types I think, those that are namespace specific, those that can be set in a namespace but take effect globally, and those that can only be set at the root namespace level00:31
johnsomI would be really surprised if the accept_ra isn't netns specific. It impacts the routing tables which are a key part of netns functionality. Still digging however.00:31
clarkbjohnsom: yes, however it is specifically the default. one00:31
clarkbjohnsom: I would not be surprised if default took affect more globally because its default00:32
rm_workdiablo_rojo: ah, I found Kendall Waters' email. I'll shoot her a line.00:32
diablo_rojorm_work, you found your code?00:32
johnsomI don't think so.00:32
diablo_rojorm_work, sounds like it might not have come from her there are a few people that work on them but it should have been around the beginning of the month 8/5 or 8/6?00:33
rm_workdiablo_rojo: no, I just mean I found her email *address*, and will email her directly instead of randomly trying to use you as an intermediary :D00:33
*** bobh has joined #openstack-infra00:35
clarkbdocker default network setup doesn't let me change that sysctl00:35
clarkbprobably have to try and reproduce using netns constructed like neutron00:36
*** dychen has joined #openstack-infra00:38
johnsomclarkb I just tried it. It is locked into the netns00:38
*** zhurong has quit IRC00:38
*** gyee has quit IRC00:39
johnsomTwo different windows, but you can get the idea: http://paste.openstack.org/show/770960/00:40
johnsomOh, and a double paste on the first one. sigh00:40
clarkbjohnsom: is line 79 after line 123 on the wall clock?00:42
*** bobh has quit IRC00:42
johnsomAfter 125 actually00:42
johnsomI wanted to make sure it actually stuck and didn't just silently ignore it00:43
*** bobh has joined #openstack-infra00:45
clarkbcool I can replicate that too00:46
*** Goneri has quit IRC00:47
openstackgerritClark Boylan proposed opendev/base-jobs master: We need to set the build shard var in post playbook too  https://review.opendev.org/68023900:48
clarkbinfra-root ^ yay for testing. That was missed in my first change to base-test00:48
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift)  https://review.opendev.org/68024000:50
clarkbjohnsom: I'm betting someone like slaweq would be able to wade through these logs quicker to rule neutron stuff in or out00:52
johnsomclarkb FYI: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/controller/logs/devstacklog.txt.gz00:52
johnsom2019-09-04 23:49:55.79300:52
johnsomOther projects seem to be having the same issue00:52
clarkbjohnsom: there are similar syslog'd network changes around the time that one fails dns too00:54
clarkbI think that is our next step. See if slaweq can help us understand what neutron is doing there to either rule it in or out00:54
clarkband we can also have donnyd check for external routing problems00:54
johnsomSeems to only happen with the fortnebula instances00:55
clarkbfortnebula is our only ipv6 cloud right now (or did we turn on limestone again?)00:56
clarkbhowever if it is a timing thing then would be related to the cloud possibly00:56
johnsomYeah, 100% of the hits in the last 24 hours were all fortnebula00:57
clarkbany idea why the worlddump doesn't seem to be working at the end of these devstack runs? it is supposed to capture things like ntwork state for us00:57
clarkbmaybe we haven't listed that file as one to copy in the job?00:58
johnsomThough I guess I was only looking at the last 500 out of the 986 hits in the last 24 hours00:58
johnsomHmm, I hadn't noticed that it disappeared00:59
*** bobh has quit IRC01:00
*** markvoelker has joined #openstack-infra01:00
johnsomNone of the projects I have looked at have that worlddump file anymore.01:01
donnydclarkb: how can i help01:01
clarkbso I think tahts the 3 next steps then. Fix worlddump, see if slaweq understands what neutron is doing and if that might play a part, and work with donnyd to check for ipv6 routing issues01:02
clarkbdonnyd: we have some jobs in fortnebula that fail to resolve against 2606:4700:4700::1111 and 2001:4860:4860::8888 at times01:02
clarkbdonnyd: https://79636ab1f0eaf6899dc0-686811860e75a35eabcecb5697905253.ssl.cf2.rackcdn.com/665029/30/check/octavia-v2-dsvm-scenario/51891fa/controller/logs/unbound_log.txt.gz shows this if you need examples01:03
clarkbthe timestamp there translates to 2019-09-4T22:06:01Z01:03
clarkbjohnsom: worlddump is exiting with return code 2? is that what the last line there means?01:04
donnydwhy do i see no timestamps01:04
donnydi haven't looked at the logs since the switch to swift01:05
clarkbdonnyd: in that file [1567634761] is the timestamp which is an epoch time01:05
*** markvoelker has quit IRC01:05
clarkbit is due to how unbound logs01:05
donnyd[1567634761] unbound[792:0] debug: answer from the cache failed01:05
clarkbthen you get [1567634761] unbound[792:0] notice: sendto failed: Network is unreachable and [1567634761] unbound[792:0] notice: remote address is ip6 2606:4700:4700::1111 port 53 (len 28)01:06
donnyd[1567634761] unbound[792:0] notice: sendto failed: Network is unreachable01:07
clarkbI've got to run to dinner now, but I expect fixing worlddump will help significantly01:07
donnydok01:07
johnsomclarkb I don't see any output from the worlddump.py line in the job I'm looking at now. So I don't know...01:07
donnydPING 2606:4700:4700::1111(2606:4700:4700::1111) 56 data bytes01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=2 ttl=60 time=6.76 ms01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=3 ttl=60 time=8.09 ms01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=4 ttl=60 time=11.5 ms01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=5 ttl=60 time=10.1 ms01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=6 ttl=60 time=7.96 ms01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=7 ttl=60 time=6.65 ms01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=8 ttl=60 time=6.73 ms01:07
donnyd64 bytes from 2606:4700:4700::1111: icmp_seq=9 ttl=60 time=7.23 ms01:07
donnydi can turn on packet capture at the edge an find out what the dealio is01:08
*** mattw4 has joined #openstack-infra01:08
clarkbdonnyd: ya I was able to resolve to the two resolvers from the mirror just fine01:09
clarkband they both respond in ~8ms rtt so not a timeout issue unless that rtt skyrockets01:09
*** slaweq has joined #openstack-infra01:11
clarkbsean-k-mooney: fyi the sync-devstack-data role is what is supposed to copy the CA stuff around but it runs after devstack runs on the subnodes01:12
clarkbsean-k-mooney: I noticed this when debugging worlddump really quickly01:12
clarkbI can write or test a change tonight but you might just try move that role before the sub node devstack runs01:12
clarkb*I can't01:12
clarkbianw frickler ^ fyi you may be interested in those two things (worlddump not working and sync-devstack-data not running before the subnodes)01:13
clarkband now really dinner01:13
*** slaweq has quit IRC01:15
donnydjohnsom: so resolution works the entire way through the job, but then fails at the end01:20
johnsomIt seems to fail part way through01:21
donnydwhere01:21
*** zhurong has joined #openstack-infra01:21
*** mattw4 has quit IRC01:21
*** calbers has quit IRC01:22
donnydthe resolve failure is at the end of the job, but I am curious if it from switching between v6 and v4...01:22
clarkbour job setup will switch the ipv6 clouds to ipv6 at the brginning of the job01:23
clarkbbefore that the default is ipv401:23
johnsomdonnyd 2019-09-04 23:49:55.794760 in this ironic job: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/job-output.txt01:23
donnyd2019-09-04 23:49:55.161 | ++ /opt/stack/ironic/devstack/lib/ironic:create_bridge_and_vms:1868 :   sudo ip route replace 10.1.0.0/20 via 172.24.5.21901:24
johnsom2019-09-04 22:06:01.847407 in our Octavia job https://79636ab1f0eaf6899dc0-686811860e75a35eabcecb5697905253.ssl.cf2.rackcdn.com/665029/30/check/octavia-v2-dsvm-scenario/51891fa/job-output.txt01:24
*** yamamoto has joined #openstack-infra01:24
johnsomFor the octavia job, it's still setting up devstack, it hasn't started our tests yet01:25
donnydWell that job is doing the same as the other01:27
donnydfailing at the end01:27
donnyd2019-09-04 21:57:20.422065 | controller | Get:1 http://mirror.regionone.fortnebula.opendev.org/ubuntu bionic-updates/universe amd64 qemu-user amd64 1:2.11+dfsg-1ubuntu7.17 [7,354 kB]01:27
donnydworks at the beginning of the job, but not at the end01:27
donnydi wonder if any of the other projects are having this issue01:27
johnsomWell, the logs end because the devstack start failed....01:27
donnydyes, the logs end01:28
johnsomThose two were the ones I saw in a quick search of the last 24s. The job needs to attempt to pull packages in, i.e. a devstack bindep for this error to report.01:29
johnsom24hrs, sorry01:30
clarkbya and in those two neutron is making network changes around when it starts to fail01:30
clarkbhence my suspicion that may be related01:30
johnsomThey appear time clustered, but the sample size is a bit small given we don't always have jobs running that need to call out01:31
johnsomhttps://usercontent.irccloud-cdn.com/file/CtLQgDRl/image.png01:32
donnydI am also happy to get you some debugging instances if that will help01:32
donnydHow long has it been doing this?01:35
johnsoma month or two from what I remember. The gates were not logging the unbound messages before, so we were not sure why the DNS lookups were failing01:36
*** calbers has joined #openstack-infra01:37
donnyd2019-09-04 22:05:04.371936 | controller | + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:423 :   sudo ip -6 addr replace 2001:db8::2/64 dev br-ex01:37
donnyd2019-09-04 22:05:04.382939 | controller | + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:424 :   local replace_range=fd4d:5343:322e::/5601:37
donnyd2019-09-04 22:05:04.385798 | controller | + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:425 :   [[ -z 163f0224-ac00-44b8-931b-0ec01a216a4c ]]01:37
donnyd2019-09-04 22:05:04.388395 | controller | + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:428 :   sudo ip -6 route replace fd4d:5343:322e::/56 via 2001:db8::1 dev br-ex01:37
donnydwhat are these configuring?01:37
johnsomThat I don't know. That is neutron setup steps in devstack.01:39
donnyddo you have a local.conf for this job?01:39
*** bobh has joined #openstack-infra01:39
donnydOr a way to run it on a test box?01:39
donnydI think that is going to be the fastest way to the bottom01:40
johnsomHere is one from an octavia job: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/controller/logs/local_conf.txt.gz01:41
johnsomI don't think this is easy to reproduce, but I also haven't looked to see if we have had successes at fortnebula.  Let me look at the other jobs01:42
*** bobh has quit IRC01:44
donnydhttp://logstash.openstack.org/#/dashboard/file/logstash.json?query=node_provider:%5C%22fortnebula-regionone%5C%22%20AND%20message:%5C%22Temporary%20failure%20resolving%5C%22&from=3h01:48
donnydSo it looks to me like ironic-standalone-ipa-src and octavia-v2-dsvm-scenario have issues01:50
donnydbut i don't see any other jobs with that issue01:50
donnydi should say octavia-*01:50
johnsomRight, most jobs don't pull in packages later in the stack process01:50
johnsomOk, yes, there have been successful runs at fortnebula: https://821d285ecb2320351bef-f1e24edd0ae51a8de312c1bf83189630.ssl.cf1.rackcdn.com/672477/7/check/octavia-v2-dsvm-scenario-amphora-v2/5f2878d/job-output.txt01:52
donnyddo these jobs only test against ubuntu?01:52
johnsomNo, there are centos as well01:52
johnsomAt least for octavia, I can't talk to ironic01:52
*** bobh has joined #openstack-infra01:53
donnydits looks to me like the failures are 100% ubuntu01:53
*** bobh has quit IRC01:54
johnsomYeah, so it looks like the IPv6 DNS lookups are intermittent I see a number that were successful over the last two days.01:54
johnsomOur centos job has only landed on fortnebula three times in the last week, so, small sample size.01:58
donnydyea that is a small sample size01:59
donnydHrm... well I surely need to keep an eye on it01:59
*** markvoelker has joined #openstack-infra02:01
johnsomYeah, maybe fire up a script that does a IPv6 lookup of mirror.regionone.fortnebula.opendev.org on DNS server 2001:4860:4860::8888 every 5-10 minutes. See if you get some failures.02:01
johnsomI also need to step away now. Thanks for your time looking into it.02:02
donnydNP02:02
donnydlmk how else i can be helpful02:02
*** markvoelker has quit IRC02:05
*** bobh has joined #openstack-infra02:09
*** apetrich has quit IRC02:10
*** jklare has quit IRC02:12
*** jklare has joined #openstack-infra02:20
*** bhavikdbavishi has joined #openstack-infra02:40
*** bhavikdbavishi1 has joined #openstack-infra02:43
*** bobh has quit IRC02:43
*** bhavikdbavishi has quit IRC02:44
*** bhavikdbavishi1 is now known as bhavikdbavishi02:44
*** bobh has joined #openstack-infra02:46
*** nicolasbock has quit IRC02:49
*** jamesmcarthur has quit IRC02:49
*** jamesmcarthur has joined #openstack-infra02:50
*** bobh has quit IRC02:50
*** jamesmcarthur has quit IRC02:55
*** xinranwang has joined #openstack-infra02:56
*** slaweq has joined #openstack-infra03:11
*** slaweq has quit IRC03:15
*** jamesmcarthur has joined #openstack-infra03:20
*** markvoelker has joined #openstack-infra03:31
*** jklare has quit IRC03:35
*** jklare has joined #openstack-infra03:35
*** markvoelker has quit IRC03:41
*** calbers has quit IRC03:44
*** calbers has joined #openstack-infra03:47
*** ricolin has joined #openstack-infra03:49
*** exsdev has joined #openstack-infra04:17
*** ramishra has joined #openstack-infra04:20
*** jtomasek has joined #openstack-infra04:29
*** markvoelker has joined #openstack-infra04:30
*** markvoelker has quit IRC04:35
*** jtomasek has quit IRC04:40
*** jamesmcarthur has quit IRC04:47
*** larainema has joined #openstack-infra04:48
*** raukadah is now known as chandankumar04:49
*** kjackal has joined #openstack-infra04:51
*** jtomasek has joined #openstack-infra04:51
*** udesale has joined #openstack-infra05:02
*** yamamoto has quit IRC05:09
*** slaweq has joined #openstack-infra05:11
*** adriant has quit IRC05:11
*** adriant has joined #openstack-infra05:12
*** slaweq has quit IRC05:15
*** jamesmcarthur has joined #openstack-infra05:17
*** jamesmcarthur has quit IRC05:22
*** dychen has quit IRC05:28
*** markvoelker has joined #openstack-infra05:30
*** ccamacho has quit IRC05:32
*** dchen has quit IRC05:33
*** markvoelker has quit IRC05:35
*** psachin has joined #openstack-infra05:37
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift)  https://review.opendev.org/68024005:40
*** georgk has quit IRC05:40
*** fdegir has quit IRC05:40
*** georgk has joined #openstack-infra05:40
*** fdegir has joined #openstack-infra05:40
AJaegerconfig-core, please put https://review.opendev.org/679850 and https://review.opendev.org/679856 on your review queue05:51
AJaegerconfig-core, and also https://review.opendev.org/679743 and https://review.opendev.org/676430 and https://review.opendev.org/#/c/678573/05:52
AJaegerinfra-root, we have a Zuul error on https://zuul.opendev.org/t/openstack/config-errors: "philpep/testinfra - undefined (undefined)"05:56
*** kjackal has quit IRC06:03
*** dchen has joined #openstack-infra06:04
*** snecker has joined #openstack-infra06:04
openstackgerritOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/68029806:16
*** jamesmcarthur has joined #openstack-infra06:19
*** jamesmcarthur has quit IRC06:23
*** kopecmartin|off is now known as kopecmartion06:23
*** slaweq has joined #openstack-infra06:24
*** kopecmartion is now known as kopecmartin06:24
*** ccamacho has joined #openstack-infra06:27
*** slaweq has quit IRC06:28
*** snecker has quit IRC06:29
*** yamamoto has joined #openstack-infra06:29
*** psachin has quit IRC06:31
*** zhurong has quit IRC06:37
*** yamamoto has quit IRC06:42
*** yamamoto has joined #openstack-infra06:43
openstackgerritMerged openstack/project-config master: Remove broken notification for starlingx  https://review.opendev.org/67985006:46
openstackgerritMerged openstack/project-config master: Finish retiring networking-generic-switch-tempest-plugin  https://review.opendev.org/67974306:48
*** openstackgerrit has quit IRC06:51
*** slaweq has joined #openstack-infra06:52
*** udesale has quit IRC06:56
*** diga has joined #openstack-infra06:58
*** janki has joined #openstack-infra06:58
*** jamesmcarthur has joined #openstack-infra06:59
*** yamamoto has quit IRC07:00
*** snecker has joined #openstack-infra07:03
*** yamamoto has joined #openstack-infra07:03
*** markvoelker has joined #openstack-infra07:06
*** jamesmcarthur has quit IRC07:06
*** markvoelker has quit IRC07:10
*** kjackal has joined #openstack-infra07:15
*** tesseract has joined #openstack-infra07:15
*** trident has quit IRC07:21
*** tosky has joined #openstack-infra07:23
*** pgaxatte has joined #openstack-infra07:29
*** trident has joined #openstack-infra07:29
*** rcernin has quit IRC07:33
*** jpena|off is now known as jpena07:33
*** trident has quit IRC07:34
*** openstackgerrit has joined #openstack-infra07:38
openstackgerritMerged openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/68029807:38
*** trident has joined #openstack-infra07:43
*** sshnaidm|afk is now known as sshnaidm|ruck07:43
*** snecker has quit IRC07:46
*** zhurong has joined #openstack-infra07:53
*** dchen has quit IRC07:56
*** dchen has joined #openstack-infra07:57
*** dchen has quit IRC08:00
openstackgerritFabien Boucher proposed zuul/zuul master: Pagure - handle initial comment change event  https://review.opendev.org/68031008:01
*** jamesmcarthur has joined #openstack-infra08:02
*** tkajinam has quit IRC08:05
*** jamesmcarthur has quit IRC08:07
*** ralonsoh has joined #openstack-infra08:20
*** derekh has joined #openstack-infra08:31
*** dtantsur|afk is now known as dtantsur08:38
*** e0ne has joined #openstack-infra08:41
*** markvoelker has joined #openstack-infra08:45
*** markvoelker has quit IRC08:50
amotokihi, when I open https://zuul.opendev.org/t/openstack/build/91b360daabc8453dba13129f78aca17b, I see an error "Payment Required   Access was denied for financial reasons." when trying to open "Artifacts links.08:50
amotokiis it a known issue?08:50
frickleramotoki: yes, see notice from yesterday. OVH is working on fixing this, you'll need to recheck to get new logs if you need to see them earlier08:56
fricklerclarkb: worlddump seems to be working fine, the errors that are logged are expected. we just fail to collect its output :(. I didn't find the reference to sync-devstack-data not working on subnodes, does someone have logs for that08:57
amotokifrickler: thanks. I missed that.08:57
*** noama has quit IRC08:58
fricklerAJaeger: infra-root: that looks more like an issue with the github api, the repo itself seems to be in place for me: "404 Client Error: Not Found for url: https://api.github.com/installations/1549290/access_tokens"09:00
*** trident has quit IRC09:01
*** jamesmcarthur has joined #openstack-infra09:04
*** jamesmcarthur has quit IRC09:08
*** trident has joined #openstack-infra09:09
*** bexelbie has quit IRC09:11
*** kjackal has quit IRC09:16
*** kjackal has joined #openstack-infra09:18
*** pkopec has joined #openstack-infra09:18
*** ociuhandu has joined #openstack-infra09:23
ianwyeah i'd note that only recently we enabled the zuul ci side of things there09:24
ianwwe've got the system-config job running on it, but it has always failed; i need to look into it.  should be back tomorrow09:24
ianw(it being testinfa)09:24
*** gfidente has joined #openstack-infra09:29
*** yamamoto has quit IRC09:35
*** xenos76 has joined #openstack-infra09:36
fricklerianw: when did "only recently" happen and where? it seems we have patches in the check queue waiting for nodes since 14h, could that be related? e.g. https://review.opendev.org/68015809:38
*** yamamoto has joined #openstack-infra09:40
*** jamesmcarthur has joined #openstack-infra09:40
fricklerseems we have a high rate of "deleting nodes" in grafana since 21:00 which would match that timeframe09:40
fricklerseeing quite a lot of these in nodepool-launcher.log on nl01: 2019-09-05 09:41:24,333 INFO nodepool.driver.NodeRequestHandler[nl01-21267-PoolWorker.rax-iad-main]: Not enough quota remaining to satisfy request 300-000508906009:42
frickleralso ram quota errors like http://paste.openstack.org/show/771269/09:44
*** jamesmcarthur has quit IRC09:45
fricklerhmm, this one looks like rackspace doesn't even let us use all of the quota. or can this happen with multiple lauch requests in parallel? Quota exceeded for instances: Requested 1, but already used 220 of 222 i09:46
fricklernstances09:46
*** bhavikdbavishi has quit IRC09:52
fricklerinfra-root: unless someone has a better idea, I'd suggest to try and gently lower the quota we use on rackspace in an attempt to reduce the ongoing churn10:07
sean-k-mooneyclarkb: thanks for looking into the ca issue. for now ill stick with truning off the tls_proxy until after m3 but assiming moveing sync-devstack-data resolves the issue then we can obviously turn it back on.10:09
*** udesale has joined #openstack-infra10:20
*** udesale has quit IRC10:28
*** udesale has joined #openstack-infra10:29
ianwfrickler: i'd have to look back, but a few weeks.  but i think maybe we might need to restart zuul to pick up the changes as it wasn't authorized to access the testinfra github project when it was launched.  i *think* there may be some bits that don't actively reload themselves10:29
ianwsorry no idea on the quota bits10:29
*** bexelbie has joined #openstack-infra10:39
*** ramishra has quit IRC10:39
*** nicolasbock has joined #openstack-infra10:41
*** jamesmcarthur has joined #openstack-infra10:41
*** yamamoto has quit IRC10:42
*** jamesmcarthur has quit IRC10:46
*** markvoelker has joined #openstack-infra10:46
*** kjackal has quit IRC10:47
*** iurygregory has joined #openstack-infra10:48
*** gfidente has quit IRC10:52
*** markvoelker has quit IRC10:52
AJaegerconfig-core, please review https://review.opendev.org/680240 https://review.opendev.org/680239 https://review.opendev.org/676430 , https://review.opendev.org/678573 and https://review.opendev.org/67965210:56
*** ramishra has joined #openstack-infra11:06
*** jamesmcarthur has joined #openstack-infra11:18
*** kjackal has joined #openstack-infra11:19
*** jpena is now known as jpena|lunch11:21
*** jamesmcarthur has quit IRC11:23
*** ociuhandu has quit IRC11:23
*** beagles has joined #openstack-infra11:24
*** lpetrut has joined #openstack-infra11:27
*** bexelbie has quit IRC11:30
*** rosmaita has left #openstack-infra11:35
*** bexelbie has joined #openstack-infra11:36
*** bexelbie has quit IRC11:36
*** yamamoto has joined #openstack-infra11:36
*** ociuhandu has joined #openstack-infra11:39
*** ociuhandu has quit IRC11:40
*** ociuhandu has joined #openstack-infra11:41
*** ociuhandu has quit IRC11:41
*** ociuhandu has joined #openstack-infra11:42
*** ociuhandu has quit IRC11:43
*** ociuhandu has joined #openstack-infra11:44
*** gfidente has joined #openstack-infra11:45
*** ociuhandu has quit IRC11:50
AJaegerinfra-root, I agree, something is wrong with out nodes - we have a requirements change waiting since 14 hours for a node for example...11:53
*** yamamoto has quit IRC11:54
ShrewsAJaeger: That may not necessarily be something wrong with nodepool. If zuul gets hung up on something, it probably already has the nodes it has requested, it just isn't progressing and using them. That's what we've seen most in these cases. Do you have a node request number for the change in question?11:57
AJaegerShrews: see backscroll and comments by frickler - I have no further information and I'm not an admin.11:58
Shrewsfrickler: ^^^11:58
AJaegerShrews: example change https://review.opendev.org/680107 in http://zuul.opendev.org/t/openstack/status11:59
AJaeger"Not enough quota remaining to satisfy request 300-0005089060" - is that what you need, Shrews ?11:59
AJaegerThat is from backscroll12:00
*** yamamoto has joined #openstack-infra12:00
*** jamesmcarthur has joined #openstack-infra12:00
AJaegerShrews: and http://paste.openstack.org/show/771269/ has details12:00
*** roman_g has joined #openstack-infra12:01
Shrews2019-09-05 09:46:34,399 DEBUG nodepool.driver.NodeRequestHandler[nl01-21267-PoolWorker.rax-iad-main]: Fulfilled node request 300-000508906012:01
Shrewsthat request was satisfied hours ago. if the change for that is not doing anything still, that points to a zuul problem12:01
*** markvoelker has joined #openstack-infra12:02
AJaegerI don't know - frickler, do you? ^12:02
Shrewsand if zuul is holding requested nodes longer than it needs to, it just puts excessive pressure on the entire system (causing the quota issues mentioned)12:03
fricklerShrews: it's well possible that zuul would need a restart, possibly related to what ianw mentioned, but I'm not going to touch it myself12:04
ShrewsAJaeger: frickler: looks like the node used in that request was deleted not long after the request was fulfilled12:04
*** virendra-sharma has joined #openstack-infra12:04
*** bhavikdbavishi has joined #openstack-infra12:10
*** virendra-sharma has quit IRC12:13
*** rh-jelabarre has joined #openstack-infra12:14
ShrewsAJaeger: frickler: from what I can tell, nodepool never even received that request until 2019-09-05 09:41:14, so it processed it rather quickly.12:15
*** yamamoto has quit IRC12:16
fricklerwe do seem to be slowly progressing, not completely stuck. maybe indeed all is well except that we are running at or above capacity12:18
AJaegerwe have 8 hold nodes - do we still need those?12:19
Shrewsoh, that node request does not even belong to the referenced change12:19
fricklercorvus: pabelanger: mnaser: ianw: mordred: do you still need your held nodes? all of them > 10 days old12:20
pabelangerfrickler: nope, that can be deleted12:21
AJaegerwe have 50 nodes deleting in limestone: http://grafana.openstack.org/d/WFOSH5Siz/nodepool-limestone?orgId=112:22
AJaegerand 201 in ovh: http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?orgId=112:22
AJaegerbut 0 building. frickler, that might be the problem with the deleting you noticed ^12:22
haleybclarkb: do you still have question on accept_ra sysctls in the dhcp namespace?  I can't tell from scrollback.  Don't think that should be an issue, but i'd be happy to look into it12:22
pabelangerAJaeger: looks like outages12:23
AJaegerso, that looks like the billing issue with ovh -12:23
AJaegerI'm aware of a swift billing issue, didn't know it affected nodes as well12:23
AJaegeris nodepool getting confused with trying to delete 200 ovh nodes and not succeeding?12:24
*** yamamoto has joined #openstack-infra12:26
*** ociuhandu has joined #openstack-infra12:27
logan-limestone has the host aggregate disabled cloud-side as a precautionary measure while we're ironing out an outage with an upstream carrier.. we've disabled the host aggregate this way in the past and it didn't have adverse affects with nodepool, but I wonder if maybe there's a scale issue with us + the ovh nodes?12:28
*** ociuhandu has quit IRC12:32
*** eharney has joined #openstack-infra12:41
fricklerindeed this is what nl04 sees from ovh-gra1: keystoneauth1.exceptions.http.Unauthorized: The account is disabled for user: 6b66bafa4e214d5ab62928c8d7372b2b. (HTTP 401) (Request-ID: req-3e92a20a-058e-4ff6-a1f0-745f341bb8fa)12:42
fricklergoing to disable that zone12:42
*** yamamoto has quit IRC12:43
openstackgerritJens Harbott (frickler) proposed openstack/project-config master: Disable OVH in nodepool due to accounting issues  https://review.opendev.org/68039712:45
fricklerinfra-root: ^^ we might need to force-merge if checks take too long12:45
*** rh-jelabarre has quit IRC12:46
AJaegerfrickler: so, is that really a problem for nodepool? Doesn't it handle that error properly?12:47
AJaegerfrickler: changes ahead of it in the queue got nodes assigned in 2-5 minutes, so shouldn't take too long12:48
pabelangerI'd be surpried if nodepool is having an issue because of it12:49
pabelangerbut, could be possible12:49
amorinhey clarkb12:49
pabelangerin the past, provider outages haven't been an impact to nodepool12:49
amorinhey all12:50
amorinis the tenant on ovh still down?12:51
amorinthe team is supposed to enabled it back this morning (french time)12:51
pabelangeramorin: yes, we are just disabling it now12:51
pabelangermight still be related to accounting?12:51
frickleramorin: at least it was 10 minutes ago12:51
amorinpabelanger, frickler  spawn still not possible?12:52
frickleramorin: keystone say account 6b66bafa4e214d5ab62928c8d7372b2b is disabled12:52
pabelangeramorin: correct: http://grafana.openstack.org/dashboard/db/nodepool-ovh12:52
AJaegeramorin, and swift is still asking for payment12:53
amorinchecking12:53
amorinI have a guy on phone call checking ATM12:53
AJaegerthanks, amorin . Much appreciated!12:54
*** Goneri has joined #openstack-infra12:55
amorinfrickler: what is 6b66bafa4e214d5ab62928c8d7372b2b ?12:55
amorinits not a tenant?12:55
amorinwhat we have is: dcaab5e32b234d56b626f72581e3644c12:56
frickleramorin: user id probably12:56
amorinok12:56
amorinyes user is disabled12:56
amorinwe are checking12:56
*** bhavikdbavishi has quit IRC12:56
*** yamamoto has joined #openstack-infra12:58
AJaegerfrickler, pabelanger , I think we should not merge the ovh disable change while amorin is on it- do you agree and want to WIP it?12:58
fricklerAJaeger: ack, done12:59
*** jpena|lunch is now known as jpena13:01
amorinfrickler: AJaeger pabelanger users are ok now13:04
amorincould you check?13:04
*** ramishra has quit IRC13:04
*** ramishra has joined #openstack-infra13:04
*** jamesmcarthur has quit IRC13:04
*** bhavikdbavishi has joined #openstack-infra13:04
*** jchhatbar has joined #openstack-infra13:05
*** jcoufal has joined #openstack-infra13:06
*** janki has quit IRC13:06
*** mriedem has joined #openstack-infra13:07
frickleramorin: I don't see any change yet13:07
amorinok, I still have some tasks ongoing, wait a minute13:07
*** jchhatba_ has joined #openstack-infra13:08
amorinfrickler: could you test again?13:09
frickleramorin: yep, looks better now13:09
amorinyay13:10
*** pcaruana has quit IRC13:10
*** jchhatbar has quit IRC13:11
frickleramorin: so at least I can manually list servers, our nodepool service seems to have a bit of a hickup. thanks anyway for fixing this13:11
frickleramorin: the swift issue still seems to persist, though, I'm assuming that will take longer to handle?13:12
amorinah, it's supposed to fix also13:13
amorinchecking13:13
rledisezfrickler: there is few minutes of cache so now that the tenant is enabled, it should be ok in few minutes13:13
amorinthanks rledisez13:13
rledisezfrickler: is it the same tenant ?13:13
rledisezjust to double check13:13
amorinrledisez: the user is on the same tenant13:13
fricklerrledisez: amorin: indeed, after another reload this seems to work now, too. thx alot13:14
rledisezfrickler: perfect. let us know if you hit other issues13:14
AJaegeryeah, swift works - thanks amorin and rledisez !13:14
amorinsorry for that13:15
AJaegerfrickler: nodepool is not yet using ovh according to http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?orgId=1 ?13:15
*** eernst has quit IRC13:16
fricklerAJaeger: nl04 seems to be recovering from suddenly getting useful responses, I'd tend to wait a bit and see what happens13:16
AJaegerok13:17
AJaegerfrickler: want to abandon https://review.opendev.org/#/c/680397/ now?13:18
*** ociuhandu has joined #openstack-infra13:18
fricklerAJaeger: done13:21
*** bhavikdbavishi has quit IRC13:28
AJaegerfrickler, pabelanger: Did either of you remove pabelanger's hold nodes?13:29
pabelangerI have not, I can do in a bit13:30
pabelangeror somebody else can, if they'd like13:30
mnaseri dont need my holds13:30
mnasersorry i didnt update13:31
*** rosmaita has joined #openstack-infra13:32
corvusfrickler: i deleted my nodes, thanks for the reminder13:33
corvusfrickler, AJaeger: i'd like to restart the executors, but that's going to create a bit of nodepool churn; do you think we're stable enough for that now, or should we defer that for a few hours?13:35
fungii am around today, modulo several hours of meetings starting shortly, but am declaring bankruptcy on the last 36 hours of irc scrollback in here13:35
corvusfungi: glad to see you!13:36
fungithanks! glad to be high and dry here in the mountains for the next several days13:36
*** ginopc has quit IRC13:37
smcginnisGood to hear you are safe and dry.13:38
*** pcaruana has joined #openstack-infra13:39
AJaegercorvus: we just enabled ovh again - but they are not picking up nodes yet according to grafana. Let's wait until ovh is healthy13:39
AJaegercorvus: so, my advise: Let's figure out ovh - and the nrestart. We might need to restart nl04 as well...13:41
*** smarcet has joined #openstack-infra13:42
corvusAJaeger: ack; i'll take a look at the nl04 logs13:42
corvusAJaeger, frickler, Shrews: since frickler said that new commands work, but nothing is currently working on nl04, i wonder if there's something cached in the keystone session that's blocking us... how about i go ahead and restart nl04 and we can maybe learn whether that's the case?13:44
*** jchhatba_ has quit IRC13:44
corvusactually... new theory:13:45
corvusi'm only seeing logs about deleting servers13:45
corvusfrickler: you said you were able to list servers -- do we still have a lot of servers in the account?13:46
AJaegerhttp://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?orgId=1&refresh=1m is confusing, it shows deleting at 013:46
AJaegerbut if you look further down at "Test Node History", it still shows them as deleting13:47
corvusi get a grafana error for that dashboard13:47
*** bnemec has quit IRC13:48
AJaegerdoes http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh  work?13:48
corvus(or maybe a graphite error)13:48
AJaegerworks for me...13:48
*** bnemec has joined #openstack-infra13:48
corvusnope, i get http://paste.openstack.org/13:48
corvusgrr13:48
corvushttp://paste.openstack.org/show/771445/13:48
AJaeger;/13:49
* AJaeger has to step out now13:49
corvusi get that for every nodepool dashboard13:49
corvusoh.  privacy badger :)13:51
*** rh-jelabarre has joined #openstack-infra13:51
corvusone really has to get used to debugging the internet if using pb13:52
corvusi've triggered the debug handler, and the objgraph part of that is taking a long time13:56
corvusprobably because we do in fact have a memory leak, and it has to page everything in13:56
Shrewscorvus: for zuul or the launcher?13:57
corvusnl0413:57
corvusor...13:57
corvusis it possible the debug handler crashed?13:58
fungicorvus: yep, a while back i discovered i needed to okay grafana.openstack.org embedding images from graphite.openstack.org (again, i think it's the .openstack.org cookies causing it for me)13:58
corvuscause the program seems to be running again now, with no iowait/swapping.  but sigusr2 doesn't work anymore, and i still never saw anything after 2019-09-05 13:53:09,563 DEBUG nodepool.stack_dump: Most common types:13:58
fungier, graphite.openDEV.org13:59
fungiso maybe not the domain cookie13:59
*** smarcet has quit IRC13:59
corvusShrews: these are the stack traces i got from the pool workers: http://paste.openstack.org/show/771446/ http://paste.openstack.org/show/771447/14:00
corvusi can't tell if they're stuck there in a deadlock or not (since i can't run the handler twice)14:00
corvusi think we've got all we're going to from nl04 and we should restart it now14:00
corvus(some NodeDeleter threads also have similar tracebacks)14:01
corvusShrews: any objection to a restart, or should i go for it?14:01
Shrewscorvus: go for it. i suspect changing something with the user auth caused some wonkiness within keystoneauth14:02
Shrewsbut those stack traces seem... off14:03
corvusyeah, i'm sure at least one dependency has changed from under us14:03
corvusmaybe even nodepool itself14:04
*** ociuhandu has quit IRC14:04
corvusokay, it's doing stuff now.  running into a lot of quota issues, but hopefully that will even out14:04
corvus(as the old, improperly accounted-for nodes are deleted)14:05
*** lpetrut has quit IRC14:05
corvusand there are nodes in-use now14:05
corvusso looks like things are okay again14:06
*** pcaruana has quit IRC14:12
*** yamamoto has quit IRC14:16
*** yamamoto has joined #openstack-infra14:16
johnsomAny pointers on how I can debug "RETRY_LIMIT" job failures? There appear to be no logs: https://zuul.opendev.org/t/openstack/build/c612b1c80c0c4dd49d39706d2a6cc4a514:19
*** ociuhandu has joined #openstack-infra14:20
*** smarcet has joined #openstack-infra14:20
*** yamamoto has quit IRC14:21
*** jamesmcarthur has joined #openstack-infra14:23
corvusjohnsom: you might be able to recheck the change and then follow the streaming console log when it starts running14:25
*** yamamoto has joined #openstack-infra14:25
*** ociuhandu has quit IRC14:25
corvuswhen the release queue clears i'm going to restart the executors14:25
*** smarcet has quit IRC14:29
*** lpetrut has joined #openstack-infra14:29
corvus#status log restarted nl04 to deal with apparently stuck keystone session after ovh auth fixed14:32
openstackstatuscorvus: finished logging14:32
corvus#status log restarted zuul executors with commit cfe6a7b985125325605ef192b2de5fe1986ef56914:32
openstackstatuscorvus: finished logging14:32
*** armstrong has joined #openstack-infra14:35
*** smarcet has joined #openstack-infra14:36
*** yamamoto has quit IRC14:37
*** yamamoto has joined #openstack-infra14:39
fricklercorvus: sorry, was afk for a bit. when I checked there were only a few servers listed, 3 in one region and about 10 in the other14:42
fricklercorvus: there are also two ancient puppet apply processes on nl04, shall we just kill those?14:42
corvusfrickler: great, i think that matches the observed behavior14:43
corvusfrickler: ++14:43
*** jamesmcarthur has quit IRC14:43
fricklercorvus: finally, did you look at the config error? ianw mentioned that this might also need a restart to resolve http://zuul.openstack.org/config-errors14:43
corvusfrickler: has our app been installed in that repo?14:44
corvus(or was it installed after we reconfigured to add the repo?)14:45
fricklercorvus: I have no idea. the repo seems to have been in the config for some months if I checked correctly14:45
*** mattw4 has joined #openstack-infra14:46
*** udesale has quit IRC14:46
*** lpetrut has quit IRC14:47
fricklerhttps://review.opendev.org/65746114:47
*** udesale has joined #openstack-infra14:47
*** gyee has joined #openstack-infra14:50
corvusfrickler: let's check with ianw14:51
corvusfrickler: but if it's a transient error, then we should be able to fix it with a full reconfiguration14:51
*** smarcet has quit IRC14:53
*** armax has joined #openstack-infra14:56
*** piotrowskim has quit IRC14:58
*** cmurphy|afk is now known as cmurphy14:58
donnydcan someone take a look at the nodepool logs for FN. I am having an issue with the new labels I created for NUMA15:00
donnydIt would be super helpful to know what nodepool is throwing as an error15:01
*** jamesmcarthur has joined #openstack-infra15:02
*** e0ne has quit IRC15:05
Shrewsdonnyd: all i'm seeing are quota issues15:07
donnydOk, i will just pull the quota entirely15:07
Shrewsdonnyd: what are the new labels?15:08
donnydhttps://github.com/openstack/project-config/blob/master/nodepool/nl02.openstack.org.yaml#L348-L35715:08
donnydmulti-numa-ubuntu-bionic-expanded15:08
donnydmulti-numa-ubuntu-bionic15:08
donnydthose are the ones i am concerned with15:09
Shrewsdonnyd: this is the most recent reference to one of those: http://paste.openstack.org/show/771452/15:09
donnydthanks Shrews15:10
donnydthat is what i needed15:10
clarkbcorvus: fungi can I get review on https://review.opendev.org/#/c/680239/ that is next step for swift container sharding15:11
clarkbif tests look good for ^ I'll push up a change for base-minimal and base15:11
clarkbalso is there a change to add ovh back to those jobs?15:11
corvusclarkb: no change i'm aware of15:12
donnydalso I am not sure how up to date the dashboard is... but its says its been deleting 29 instances for a while...15:14
donnydthey have been gone since about 37 seconds after the request for delete was placed15:14
openstackgerritClark Boylan proposed opendev/base-jobs master: Revert "Stop storing logs in OVH"  https://review.opendev.org/68044615:14
clarkbcorvus: ^ for OVH15:14
*** tosky has quit IRC15:15
*** smarcet has joined #openstack-infra15:16
*** mattw4 has quit IRC15:17
*** jamesmcarthur has quit IRC15:21
*** jamesmcarthur has joined #openstack-infra15:22
*** ccamacho has quit IRC15:26
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift)  https://review.opendev.org/68024015:27
*** jamesmcarthur has quit IRC15:27
clarkbinfra-root https://review.opendev.org/#/c/680236/ bumps mirror TTLs up to an hour which was something we noticed with johnsom's dns debugging last night15:29
clarkbslaweq: if you have time today it would be great if you could look at these jobs logs from johnsom to double check that neutron isn't breaking host ipv6 networking which results in broken dns15:29
clarkbslaweq: at around the same time that dns stops working syslog records a bunch of changes by neutron in iptables, ebtables, ovs, and sysctl15:30
slaweqclarkb: sorry, I'm on meeting now and will go offline just after it15:31
slaweqclarkb: I can look at it tomorrow morning15:31
johnsomRepaste: An ironic job example: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/job-output.txt15:31
slaweqis it this link https://github.com/openstack/neutron/commit/f21c7e2851bc99b424bdc5322dcd0e3dee7ee5a3 ?15:31
*** jamesmcarthur has joined #openstack-infra15:31
*** ociuhandu has joined #openstack-infra15:31
johnsomrepaste: an octavia log example: https://79636ab1f0eaf6899dc0-686811860e75a35eabcecb5697905253.ssl.cf2.rackcdn.com/665029/30/check/octavia-v2-dsvm-scenario/51891fa/job-output.txt15:32
clarkbslaweq: the one that johnsom just pasted. Job ends up failing beacuse dns stops working15:32
clarkbslaweq: cross checking timestamps in the unbound log and in syslog I see dns stops working about when neutron is making networking changes15:32
clarkbthere is a lotof moving parts there so I ended up getting lost but figured someone that knows neutron would be able to get through it more easily15:32
johnsomclarkb Any advice on this? https://zuul.opendev.org/t/openstack/build/c612b1c80c0c4dd49d39706d2a6cc4a515:33
fungiclarkb: 680236 needs a serial increase15:33
slaweqclarkb: I will look into it later today or tomorrow morning15:33
clarkbfungi: oh right15:33
clarkbslaweq: thank you15:33
openstackgerritClark Boylan proposed opendev/zone-opendev.org master: Increase mirror ttls to an hour  https://review.opendev.org/68023615:34
clarkbfungi: ^ done and thank you15:34
clarkbjohnsom: did you see the suggestion of watching the job? typically that means it is failing badly enough that we lose network connectivity and fail to copy logs15:35
clarkbjohnsom: but if you watch the console log that is live streamed so you'll get the data before networking goes away15:35
johnsomclarkb I did, I'm just not sure how to guess which job is going to have the RETRY_LIMIT, it seems to move around.15:36
clarkbis it specific to that change?15:36
johnsomno15:36
*** ociuhandu has quit IRC15:36
johnsomI will try to catch one, just hoped there was a better way15:37
clarkbNot really. We could try to set up holds but if host networking is getting nuked we won't be able to login and debug with a hold15:37
clarkbwe can check the executor logs to see if there are any crumbs of info there but other than that catching one live is probalby our best bet15:38
clarkbjohnsom: could it be nested virt causing crashes?15:38
johnsomclarkb We disabled that over a year ago15:38
fungirules that out then i guess15:38
johnsomclarkb Due to the nodepool image kernel issue at ovh15:38
johnsomThis just started around the 3rd15:39
*** smarcet has left #openstack-infra15:39
clarkbjohnsom: http://paste.openstack.org/show/771454/ as suspected it appeas to be crashing due to connectivity issues15:42
clarkbduring tempest15:42
clarkbif you catch a live one you'll probably be able to narrow it down to the test that ran prior15:43
clarkbmaybe look for changes in your tempest tests the last few days15:44
johnsomI did, we haven't changed them since the 16th15:45
clarkbother things it could be (but usually this is more widespread) zuul OOMing and losing its ZK Locks on nodes15:45
* clarkb checks cacti15:46
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all seems ok15:46
*** yamamoto has quit IRC15:53
*** yamamoto has joined #openstack-infra15:54
*** yamamoto has quit IRC15:54
*** yamamoto has joined #openstack-infra15:57
*** yamamoto has quit IRC15:57
*** yamamoto has joined #openstack-infra15:57
*** sshnaidm|ruck is now known as sshnaidm|afk16:00
clarkbshould I be concerned that the opendev base jobs updates seem to just be sitting in zuul http://zuul.opendev.org/t/opendev/status ?16:00
*** altlogbot_3 has quit IRC16:01
*** yamamoto has quit IRC16:02
*** altlogbot_3 has joined #openstack-infra16:02
*** smarcet has joined #openstack-infra16:03
*** irclogbot_3 has quit IRC16:03
*** irclogbot_2 has joined #openstack-infra16:04
johnsomI hate to say it, but if you look at the main zuul board, jobs are backed up 18+ hours and other projects are seeing the retries as well.16:05
haleybjohnsom: so does the job failure happen at the same place every time?  looks like it's when it's going to setup diskimage-builder from the link above16:05
openstackgerritPaul Belanger proposed zuul/zuul master: Remove support for ansible 2.5  https://review.opendev.org/65043116:05
openstackgerritPaul Belanger proposed zuul/zuul master: Switch ansible_default to 2.8  https://review.opendev.org/67669516:05
openstackgerritPaul Belanger proposed zuul/zuul master: WIP: Support Ansible 2.9  https://review.opendev.org/67485416:05
clarkbjohnsom: ya I'm wodnering if this is executor restart fallout16:05
johnsomThe top job for cinder has more than an handful doing "2. attempt"16:05
clarkbthey aren't running16:05
clarkbjohnsom: ya because executors were stopped16:06
clarkbcorvus: ^ fyi I don't think they came back up again16:06
corvusclarkb: oh, well that makes me sad16:06
*** irclogbot_2 has quit IRC16:07
johnsomhaleyb Hi there. The common "timing" is both jobs install additional packages later in the devstack setup.16:07
*** irclogbot_2 has joined #openstack-infra16:07
clarkbjohnsom: haleyb ya one theory I had was most jobs don't haev a problem because dns gets broken after they are done doing external work16:08
haleybi just don't see where any of the neutron agents are actively doing anything that would cause it16:08
clarkbjohnsom: haleyb but if projects like octavia and ironic do extra after neutron comes up and sets up its networking that could explain why they see it16:08
johnsomhaleyb When they attempt to go out and get those, the DNS queries fail. This only occurs when the DNS is using IPv6.16:08
clarkbhaleyb: check syslog16:08
slaweqclarkb: I don't see anything in neutron that could break this IPv616:08
slaweqclarkb: but I have to leave now, sorry16:09
clarkbhaleyb: there is a bunch of ovs, iptables, ebtables and such right around teh same time period (starting at the same second and continuing after)16:09
clarkbcorvus: anything I can do to help?16:09
corvusclarkb: i restarted them.  was just a systemd timeout16:09
clarkbah16:09
clarkbjohnsom: haleyb I do still think fixing worlddump will help quite a bit16:11
clarkbsince that should give us the networking state after things break16:11
clarkband we can work backward from there if necessary (or it will say everything is fine)16:12
haleybclarkb: the iptables, etc calls are after the failure from what i see - 2019-09-04 22:06:01.847407 is dns failure, Sep 04 22:06:05 is first "ip netns exec..." call, so not sure it's the culprit yet16:14
clarkbhaleyb: in the agent log that starts earlier (I don't know why there is a mismatch in logs timestamps. maybe one is at start and the other is at completion?)16:15
*** xinranwang has quit IRC16:15
*** mattw4 has joined #openstack-infra16:15
clarkbbut ya we are running worlddump and it does nothing, that should be fixed as it exists specifically to debug these problems16:16
haleybi've seen dns failure locally before, typically when "stack.sh" moves IPs to the OVS bridge, sometimes networkmanager trips over itself, but i just work around that by hardcoding dns servers in resolv.conf16:16
clarkbwe hardcode them in the unbound config then hardcode resolv.conf to resolv against local unbound16:16
clarkb(it is using the correct servers accoprding to the unbound log, they aren't reachable though)16:16
*** bdodd has joined #openstack-infra16:16
*** pgaxatte has quit IRC16:17
*** gfidente has quit IRC16:18
*** kopecmartin is now known as kopecmartin|off16:18
haleybit would be good to get a snapshot of what the system looks like right before this, like 'ip -6 a' and 'ip -6 r', etc, since i'm assuming we can't get in to look around when it fails?16:19
clarkbhaleyb: we can that is what I'm trying to say. Worlddump is the tool we have for this16:23
clarkbits broken16:23
clarkbsomeone needs to fix it16:23
*** e0ne has joined #openstack-infra16:23
clarkbI'm working on a quick change to devstack to see if we can identify where it is broken16:24
*** smarcet has quit IRC16:24
openstackgerritMerged opendev/base-jobs master: We need to set the build shard var in post playbook too  https://review.opendev.org/68023916:25
openstackgerritMerged opendev/base-jobs master: Revert "Stop storing logs in OVH"  https://review.opendev.org/68044616:25
*** yamamoto has joined #openstack-infra16:26
clarkbhaleyb: remote:   https://review.opendev.org/680458 DO NOT MERGE debugging why worlddump logs are not collected16:27
AJaegerconfig-core, please review https://review.opendev.org/679856 and https://review.opendev.org/67965216:27
haleybclarkb: ack, i'll keep an eye on that, and look around some more16:27
openstackgerritJimmy McArthur proposed zuul/zuul-website master: Update to page titles and Users  https://review.opendev.org/68045916:31
*** e0ne has quit IRC16:31
johnsomclarkb After that restart it looks like our jobs aren't RETRY_LIMIT anymore. They seem to have made it to devstack.16:32
*** chandankumar is now known as raukadah16:33
*** smarcet has joined #openstack-infra16:39
openstackgerritJimmy McArthur proposed zuul/zuul-website master: CSS fix for ul/li in FAQ  https://review.opendev.org/68046516:41
openstackgerritMerged zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift)  https://review.opendev.org/68024016:41
*** zigo has quit IRC16:43
*** ramishra has quit IRC16:45
openstackgerritMerged openstack/project-config master: New charm for cinder integration with Purestorage  https://review.opendev.org/67965216:46
clarkbAJaeger: ^ there is one of them16:46
AJaegerthanks, clarkb16:48
*** spsurya has quit IRC16:48
*** igordc has joined #openstack-infra16:49
zbrclarkb: AJaeger any change to persuade you about emit-job-header improvement? https://review.opendev.org/#/c/677971/16:49
openstackgerritJimmy McArthur proposed zuul/zuul-website master: CSS fix for ul/li in FAQ  https://review.opendev.org/68046516:49
*** ullbeking has joined #openstack-infra16:50
*** jaosorior has quit IRC16:51
AJaegerzbr: you add molecule files, I do not see where those are used - and I agree with pabelanger that this is in the inventory already... So, I'm torn whether this is really beneficial...16:54
*** smarcet has quit IRC16:54
openstackgerritJimmy McArthur proposed zuul/zuul-website master: CSS fix for ul/li in FAQ  https://review.opendev.org/68046516:55
zbrAJaeger: molecule files enable a developer to do local testing (did not want to force you to use that tool). Re the comment, users already remarked that the inventory is not available until job finishes and also is not visible in console log. Do we worry about adding few bytes to the console?16:57
zbrarguably even the summary page does not include any information about ansible version used, or which python interpreter was used.16:58
AJaegerI'm missing the relevance, so will abstain here. Also, adding the molecule files is something we should discuss on #zuul.16:58
clarkba few extra bytes is unlikely to be a problem when compared to the vast quantity of log output by jobs16:59
zbryep, i am for reducing logs size, but starting where it matters.17:00
AJaegermnaser: we merged the swift log changes that clarkb wrote, is your maintenance over and do you want us to use mtl1 again? your change https://review.opendev.org/678440 is still WIP17:00
AJaegerbbl17:00
*** rosmaita has left #openstack-infra17:01
clarkbAJaeger: note its only for the base-test so far and I'm testing that it works as we speak (waiting on test results)17:01
zbrregarding inclusion of test files, i am curious because we already do allow inclusion of stuff is is purely used for local testing, like `venv` section in tox.ini files, or some other sections which are run run on CI.17:01
clarkbAJaeger: mnaser but ya I'd love to push up a change that enables the build uuid sharding globally today and add vexxhost back to the mix if that is possible17:01
clarkbmnaser: ^ let us know17:01
*** rh-jelabarre has quit IRC17:01
*** ociuhandu has joined #openstack-infra17:01
zbri am not trying to force anyone to adopt the tool, only reason why I included them was because I used them to test the change made to the role.17:02
clarkbzbr: I think the concern is that if we aren't testing it then we'll likely break it via changes made over time17:02
zbrand i do find them handy, but if that is a reason for not merging, i will remove them, is easy to recreate them,17:02
*** markvoelker has quit IRC17:04
*** igordc has quit IRC17:04
fungizbr: we have historically used the testenv:venv definitions in some jobs, notably when producing release artifacts. at this stage they may be vestigial but i wouldn't wager a guess without additional research17:05
*** tesseract has quit IRC17:06
zbrsure. just put a comment with whatever i need to change, no hardfeelins, the only thing really useful was to print python path/version and ansible version (also could prove useful for logstash searches in the future).17:07
*** jpena is now known as jpena|off17:07
*** zigo has joined #openstack-infra17:10
*** smarcet has joined #openstack-infra17:10
clarkbinfra-root: config-core: https://openstack.fortnebula.com:13808/v1/AUTH_e8fd161dc34c421a979a9e6421f823e9/zuul_opendev_logs_141/680178/1/check/tox-py27/1416af4/ the build uuid prefixing seems to work17:10
*** markvoelker has joined #openstack-infra17:13
*** igordc has joined #openstack-infra17:15
openstackgerritClark Boylan proposed opendev/base-jobs master: Shard build logs with build uuid in all base jobs  https://review.opendev.org/68047617:17
clarkbinfra-root ^ that will take us to production17:17
clarkband if mnaser gives us the go ahead we can enable vexxhost again once that is in?17:17
*** udesale has quit IRC17:18
*** armax has quit IRC17:20
*** jaosorior has joined #openstack-infra17:21
*** larainema has quit IRC17:22
*** smarcet has quit IRC17:25
johnsomclarkb Got one: https://zuul.openstack.org/build/8478f4bf656b412b8c613d19e10b1c2517:25
johnsomhttps://www.irccloud.com/pastebin/5Mb14ZtR/17:25
*** igordc has quit IRC17:26
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Flip the order of the emit-job-header tests  https://review.opendev.org/68047717:28
clarkbjohnsom: ya so something is breaking networking and that is ipv4 (not v6)17:29
*** smarcet has joined #openstack-infra17:29
corvusclarkb: did we delete all the logs_ containers in vexxhost?17:30
clarkbcorvus: I believe mnaser did it for us17:30
clarkbbecause ceph was in a bad way17:30
corvuskk17:31
johnsomI still have that console open if you want me to look at the scrollback. At the surface it looks like ansible was happy before that.17:31
clarkbcorvus: note: I have not double checked that and that would be a good cleanup (we'll want to clean all the logs_ containers in all the providers in ~30 days too)17:31
corvusclarkb: agree; let's just double check it then when we do that17:31
clarkb++17:32
*** ociuhandu_ has joined #openstack-infra17:32
AJaegerclarkb, corvus , change I5af7749fefec61f1e9fe8379266e799184a13807 added minimal retention only to base, but not base-minimal and base-test jobs - could you double check the 1month retention, please? See my comment at 680476...17:33
openstackgerritMerged opendev/zone-opendev.org master: Increase mirror ttls to an hour  https://review.opendev.org/68023617:33
clarkbAJaeger: are you ok with that as a followup?17:33
AJaegerclarkb: it's unrelated to your change ;) I just noticed it when reviewing - so, followup is fine and your change is approved17:34
openstackgerritClark Boylan proposed opendev/base-jobs master: Set container object expiry to 30 days  https://review.opendev.org/68048017:34
clarkbAJaeger: ^ thank you for the reviews17:35
AJaegergreat! thanks17:35
*** ricolin has quit IRC17:35
*** ociuhandu has quit IRC17:36
*** ociuhandu_ has quit IRC17:36
clarkbjohnsom: looking at that it happens after tempest starts but fairly early17:36
*** smarcet has quit IRC17:36
clarkbjohnsom: that almost implies to me it is either an import time issue (testr will import all the test files to find the tests) or something separate that happens to just line up with that timing wise17:36
johnsomYeah, none of the tests completed. I had four other consoles open at the same time, they all went into the tests.17:37
clarkbjohnsom: unfortunately worlddump won't help us when connectivity has gone away completely17:37
clarkbjohnsom: we probably want to see if it is cloud or region specific as a next step17:40
clarkbjohnsom: if it is then it could also be a provider problem17:41
openstackgerritMerged opendev/base-jobs master: Shard build logs with build uuid in all base jobs  https://review.opendev.org/68047617:41
johnsomThat ssh IP implies RAX IAD17:41
*** smarcet has joined #openstack-infra17:48
clarkboh neat that change merged already17:50
clarkbexciting!17:50
*** jamesmcarthur has quit IRC17:51
clarkbmnaser: when you have a moment your feedback on whether or not these changes make you more comfortable running the log uploads against vexxhost again would be great17:51
mnaserclarkb: i guess the only thing that still feels like a bother is the sheer amount of ara-report's :(17:57
clarkbmnaser: I think corvus may be fiddling with options around taht? dmsimard is also brainstorming fixes here https://etherpad.openstack.org/p/Vz5IzxlWFz17:58
clarkbmnaser: did we see problems with the logs_XY containers in addition to the periodic container?17:58
*** smarcet has quit IRC17:59
corvusi expect to be able to propose a change that removes the top-level ara generation that happens to every job by the end of the week, if that's the way we want to go.  it won't affect nested aras (eg, devstack).  it's not clear to me what the balance between the two is (ie, of the X ara files, what percentage comes from the zuul run versus nested runs)18:00
*** smarcet has joined #openstack-infra18:00
clarkbnote devastck doesn't do a nested ara I don't think18:01
clarkbbut osa tripleo etc all do18:01
*** rh-jelabarre has joined #openstack-infra18:01
*** jamesmcarthur has joined #openstack-infra18:04
*** armax has joined #openstack-infra18:04
*** mattw4 has quit IRC18:04
*** mattw4 has joined #openstack-infra18:05
*** smarcet has quit IRC18:06
*** smarcet has joined #openstack-infra18:07
*** jamesmcarthur has quit IRC18:07
*** diga has quit IRC18:10
*** smarcet has quit IRC18:16
clarkbI've approved the gus2019 change for ssh idle timeout updates18:19
clarkbthat should release the replication disabled on start change too18:20
clarkbonce that gets applied to the gerrit server I'll work on doing some restarts (review-dev first)18:20
*** jamesmcarthur has joined #openstack-infra18:23
paladoxyou may also want to enable change.disablePrivateChanges when you upgrade gerrit.18:24
clarkbpaladox: we effectively disabled them via preventing pushes to refs/for/draft or wahtever that meta ref is18:25
clarkbbut I guess drafts went away and there is a new thing now?18:25
paladoxthis is a new thing18:25
paladoxno longer uses refs.18:25
paladoxusers can create a open change and put it as private18:25
paladoxlike wip mode18:25
clarkbah18:25
* paladox enabled that at wikimedia18:25
paladoxas we wanted everything to be open18:26
clarkbya also tends to create confusion when changes depend on a change that has since become private and so on18:26
clarkbhowever it may be useful for security bug fixes18:26
paladoxyup18:26
clarkbwe'll likely have to experiment with it18:26
paladoxi've found that changes are not really hidden as the feature makes it out to be.18:27
clarkbah18:27
clarkbin that case maybe not great for security bugs :)18:27
paladoxI filed a bug about this some where upstream18:27
paladoxhttps://bugs.chromium.org/p/gerrit/issues/detail?id=811118:28
*** kjackal has quit IRC18:28
fungifwiw, that was one of the big problems about the old drafts feature as well, they claimed to be private but they were leaky18:30
*** kjackal has joined #openstack-infra18:31
*** mriedem has quit IRC18:31
*** mriedem has joined #openstack-infra18:33
*** dtantsur is now known as dtantsur|afk18:36
clarkbdid we decide that https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5263-L5267 is the cause of the openstack tenant config error because that repo doesn't have the zuul app on it?18:36
*** igordc has joined #openstack-infra18:36
corvuspaladox: thanks -- added a note to https://etherpad.openstack.org/p/gerrit-upgrade18:37
paladoxyw :)18:37
corvusclarkb: unsure; wanted to check with ianw upon his awakening18:38
clarkbk18:38
paladoxcorvus also native packaging is https://gerrit.googlesource.com/gerrit-installer/ :)18:40
corvuspaladox: yeah, we're going to make our own image builds (mostly so that we have the option to fork if necessary, and can include exactly the plugins we use).  that's working now, i think we just need to tweak some ownership/permissions stuff to match our current server in order to upgrade18:41
*** mattw4 has quit IRC18:41
paladoxah, ok :)18:41
corvusour dockerfile is heavily based on the gerritforce one :)18:41
paladoxheh18:41
corvusgerritforge even18:41
clarkbI want to say the old step 0 plan was to switch to 2.13 via docker18:41
clarkbthen upgrade via docker18:41
corvusclarkb: yeah, i think that's still good18:42
paladoxcorvus upgrading to NoteDB was cool! (dboritz added the notedb.config per my request as we use puppet at wikimedia)18:45
*** pkopec has quit IRC18:47
paladoxAll we did was set https://github.com/wikimedia/puppet/commit/06c8e4122c37508045d84840ac1cb23f4f7d9011#diff-4c58f684fb8a36946bc7616d35570c00 then after the upgrade https://github.com/wikimedia/puppet/commit/d0b08b9675438fe637374a165fdf28c375c3510a#diff-4c58f684fb8a36946bc7616d35570c0018:49
*** eernst has joined #openstack-infra18:51
openstackgerritMerged opendev/puppet-gerrit master: Set a default idle timeout for ssh connections  https://review.opendev.org/67841318:51
paladoxah, nice to see ^^18:54
paladoxwe done similar https://github.com/wikimedia/puppet/commit/cf5c343cc787c46cce2d4d1f91b2ab0c09d3492f#diff-4c58f684fb8a36946bc7616d35570c0018:54
paladox(task is hidden)18:54
*** e0ne has joined #openstack-infra18:54
openstackgerritMerged opendev/puppet-gerrit master: Add support for replicateOnStartup config option  https://review.opendev.org/67848618:55
openstackgerritMerged opendev/system-config master: Don't run replication on gerrit startup  https://review.opendev.org/67848718:55
corvusyeah, that was an amusing moment at the hackathon... most people couldn't understand how our server functioned at all without a timeout set, and we were like "theres a timeout option?"18:55
*** eernst has quit IRC18:56
paladoxcorvus it fixes other issues too :)18:56
paladoxwhich i didn't realise at all could happen18:56
clarkbI'm pretty sure the timeout was set by default in older gerrit18:56
clarkbbecause I remember zuul losing its connection to our idle review-dev server18:56
clarkbI'll watch for those to apply to review-dev and restart it18:57
*** eernst has joined #openstack-infra18:58
paladoxcorvus you may also want to enable https://github.com/wikimedia/puppet/commit/0564af76c6067f58d5622c8f81ec36d3793f2ddd#diff-4c58f684fb8a36946bc7616d35570c00 (affects gerrit 2.16+)18:59
openstackgerritJimmy McArthur proposed zuul/zuul-website master: Removing erroneous og images  https://review.opendev.org/68048819:00
*** eernst has quit IRC19:02
*** armstrong has quit IRC19:04
*** e0ne has quit IRC19:07
*** trident has quit IRC19:09
openstackgerritJimmy McArthur proposed zuul/zuul-website master: Replacing OG images with Zuul icon  https://review.opendev.org/68049019:10
*** eernst has joined #openstack-infra19:10
*** e0ne has joined #openstack-infra19:12
*** e0ne has quit IRC19:14
*** eernst has quit IRC19:15
*** e0ne has joined #openstack-infra19:16
*** eernst has joined #openstack-infra19:17
*** e0ne_ has joined #openstack-infra19:18
*** trident has joined #openstack-infra19:20
*** e0ne has quit IRC19:21
*** eernst has quit IRC19:22
*** ralonsoh has quit IRC19:23
openstackgerritJames E. Blair proposed zuul/zuul master: Web: rely on new attributes when determining task failure  https://review.opendev.org/68049819:34
mriedemis it normal to have a patch queued for nearly 20 hours right now?19:34
mriedem(679473,1) Handle VirtDriverNotReady in _cleanup_running_deleted_instances (19h18m/19:34
AJaegermriedem: not normal - we had a cloud failure and needed some restart, so that bit us ;(19:36
AJaegermriedem: so, we have quite a backlog at the moment ;/19:37
corvusthe trend is heading in the right direction: \19:37
*** jamesmcarthur has quit IRC19:37
corvusit was / then - now \  (ascii sparklines)19:37
*** jamesmcarthur has joined #openstack-infra19:38
AJaeger;)19:38
*** rosmaita has joined #openstack-infra19:42
*** jamesmcarthur has quit IRC19:42
*** eernst has joined #openstack-infra19:43
openstackgerritJames E. Blair proposed opendev/base-jobs master: Remove ara from base-test  https://review.opendev.org/68050019:43
openstackgerritJames E. Blair proposed opendev/base-jobs master: Remove ara from base-test  https://review.opendev.org/68050119:43
openstackgerritJames E. Blair proposed opendev/base-jobs master: Remove ara from base  https://review.opendev.org/68050119:44
corvusclarkb, fungi, AJaeger, dmsimard: ^ that's an option i think we can consider now19:45
corvusmnaser: ^19:45
*** eernst has quit IRC19:46
*** eernst has joined #openstack-infra19:47
*** e0ne_ has quit IRC19:49
slittle1now we need additional cores to be added to the ten new repos created for starlingx yesterday19:52
*** e0ne has joined #openstack-infra19:54
slittle1Can I add them directly myself?  Or do I need to go through an admin ?19:55
slittle1hmmm ... I'm not even on the core list.   Guess I need an admin19:55
clarkbwe add the first user then you add the rest19:55
clarkbgive me a minute and I'll add you19:56
slittle1ok, great19:56
clarkbjust have to find the change again to get my list of repos19:57
*** e0ne_ has joined #openstack-infra19:59
*** e0ne has quit IRC20:00
clarkbslittle1: and done. You should be able to self manage the group membership now20:00
slittle1great, thanks20:00
*** eharney has quit IRC20:01
clarkbI have restarted review-dev and it seemed to come up happy20:02
clarkbchecking the replication log there are no entries in it for today20:02
clarkbI think that means the no replication on start flag is working20:03
clarkbI'm going to start a stream events ssh connection and see if it goes away in an hour20:03
*** pkopec has joined #openstack-infra20:03
openstackgerritPaul Belanger proposed zuul/zuul master: WIP: Support Ansible 2.9  https://review.opendev.org/67485420:04
AJaegercorvus: Just checked the zuul UI for a job result - yes, I think we can drop ara20:04
clarkbno objections from me re dropping the root ara. We should send a note to the mailing list if we merge the production change though as people will notice20:07
clarkbbut then maybe we can turn on vexxhost when mnaser is in a spot to keep an eye on his side of things and we monitor it?20:07
paladoxclarkb note that changing replication.config when gerrit is running, effectively breaks replication until gerrit is restarted. (it's a known issue upstream, it bit us at wikimedia)20:08
clarkbpaladox: we've not experienced that (we even have the replication config set to reload the plugin on updates)20:08
clarkband we tested that and have made such changes multiple times20:08
paladoxoh20:08
*** jamesmcarthur has joined #openstack-infra20:09
clarkbcould be the older plugin version is fine?20:09
paladoxit bit us, we didn't realise it was a bug until we investigated furthur.20:09
paladoxpossibly20:09
clarkbI plan on restarting production gerrit today too, though i do need to pop out for a bit now then do phone calls20:09
clarkbwanting to verify the sshd idleTimeout first though20:10
fungiclarkb: i thought what we observed was that if the replication config is modified while gerrit is running, it discards all pending replication events in its queue20:13
fungiand so we noted that maybe we should just disable that "feature" instead20:14
fungi(but i don't recall whether we got around to actually disabling it)20:14
openstackgerritMerged opendev/base-jobs master: Remove ara from base-test  https://review.opendev.org/68050020:16
clarkboh ya that may be20:16
*** e0ne_ has quit IRC20:19
paladoxfungi that's been fixed i think unless it was reverted.20:26
fungiwell, odds are it wasn't fixed back in 2.13 (or we missed picking up a relevant backport)20:29
paladoxfungi https://bugs.chromium.org/p/gerrit/issues/detail?id=1026020:32
*** lpetrut has joined #openstack-infra20:34
openstackgerritPaul Belanger proposed zuul/zuul master: Switch ansible_default to 2.8  https://review.opendev.org/67669520:38
openstackgerritPaul Belanger proposed zuul/zuul master: WIP: Support Ansible 2.9  https://review.opendev.org/67485420:38
*** lpetrut has quit IRC20:38
*** ociuhandu has joined #openstack-infra20:39
*** Goneri has quit IRC20:44
*** ociuhandu has quit IRC20:45
*** ociuhandu has joined #openstack-infra20:46
*** jamesmcarthur has quit IRC20:48
*** iurygregory has quit IRC20:50
*** ociuhandu has quit IRC20:50
*** jamesmcarthur has joined #openstack-infra20:50
clarkbcorvus: what is the next step for https://review.opendev.org/#/c/680501/ do we have a test chagne for that yet? I guess you can use https://review.opendev.org/#/c/680178/ which I have just rechecekd.20:58
clarkbabout now is when I expected review-dev to kill my ssh connection21:02
clarkbit hasn't happened yet.21:03
clarkboh it just happend \o/21:03
clarkbok those two changes are confirmed to be working on review-dev I think21:03
clarkbnext step is restarting gerrit on review.o.o21:03
clarkbare any other roots around? should I just go for it?21:04
clarkbhrm there is one change in the release pipeline I'll go let the release team know21:04
*** markvoelker has quit IRC21:05
*** diablo_rojo has quit IRC21:06
*** diablo_rojo has joined #openstack-infra21:07
clarkbfungi: are you around enough to be a second set of hands/eyeballs if needed for gerrit restart?21:08
*** rh-jelabarre has quit IRC21:08
fungiclarkb: sure, i've just finished up post-election tasks (i hope)21:09
clarkbfungi: maybe you can wrote up a status notice? and I'll login and double check configs are in place on that server and proceed with restarting it21:09
*** mattw4 has joined #openstack-infra21:10
clarkbconfigs are in place I'm ready to stop gerrit and start it when you are21:10
*** markvoelker has joined #openstack-infra21:11
clarkbhow about #status notice Gerrit is being restarted to pick up configuration changes. Should be quick. Sorry for the interruption.21:11
fungithat wfm, i had one half-typed21:11
fungibut slow going as i'm not at my usual keyboard21:12
clarkbI'll start asking systemd to do that nicely if you want to send the notice21:12
fungi#status notice Gerrit is being restarted to pick up configuration changes. Should be quick. Sorry for the interruption.21:12
openstackstatusfungi: sending notice21:12
-openstackstatus- NOTICE: Gerrit is being restarted to pick up configuration changes. Should be quick. Sorry for the interruption.21:14
clarkbthe log file and systemd think it is running21:14
*** markvoelker has quit IRC21:15
clarkband gerrit queue is largely empty21:15
clarkb(so replicationi config appears to ahve worked here too)21:15
openstackstatusfungi: finished sending notice21:16
clarkbweb ui is up and running for me21:16
openstackgerritJimmy McArthur proposed zuul/zuul-website master: OK, trying again to update with the correct og image tags, removing the erroneous gitlab tags.  https://review.opendev.org/68052021:16
fungiworking for me21:16
clarkbseems like people can push to it too ^ :)21:17
*** pkopec has quit IRC21:17
fungiindeed!21:17
clarkbcontext switching: the ara removal seems to haev worked fine. https://33989e35e43da1db0b96-a619ea89024f9935a8230ca8f397a8a1.ssl.cf2.rackcdn.com/680178/1/check/tox-py35/e5c7f47/ maybe want to double check the dashboard renders that once sql reporting happens21:17
clarkbhttps://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f83/680178/1/check/tox-py27/f839885/ as well21:17
openstackgerritJimmy McArthur proposed zuul/zuul-website master: OK, trying again to update with the correct og image tags, removing the erroneous gitlab tags.  https://review.opendev.org/68052021:19
openstackgerritJimmy McArthur proposed zuul/zuul-website master: OK, trying again to update with the correct og image tags, removing the erroneous gitlab tags.  https://review.opendev.org/68052021:20
clarkbcorvus: I +2'd https://review.opendev.org/#/c/680501/2 as well as AJaeger and fungi so I think that can happen as soon as we like21:20
clarkbI can also send email about it to the mailing lists if you would prefer I did that21:20
clarkb(but will wait for the chagne to be approved before doing that)21:20
*** bdodd_ has joined #openstack-infra21:26
*** bdodd has quit IRC21:27
clarkbfungi: https://review.opendev.org/#/c/680480/ makes base-test and base-minimal match base on their object expiry dates. https://review.opendev.org/#/c/680477/ is a simple order change to make the last emit-job-header url match what will actually be uploaded to21:28
clarkbthose are the last two things I had related to swift and corvus' ara change is the other thing21:28
clarkblooks like corvus just removed his WIP on that change.21:29
*** jamesmcarthur has quit IRC21:29
corvusyep -- though i guess you were thinking of waiting until 680178 reports21:30
clarkbya though I guess I'm not super concerned about that since we can confirm the dashboard works with ara in place too21:31
clarkbat first I was thinkign we needed the sql report to confirm that then realized we don't21:31
clarkbsince all jobs with or without ara get that dashboard stuff21:31
corvusyeah, i think the thing to check would be if there's something weird about the file layout without the generated report21:32
corvusit's more of an unknown unknown thing :)21:32
clarkbk happy to wait for that then21:32
*** jcoufal has quit IRC21:32
*** jamesmcarthur has joined #openstack-infra21:50
openstackgerritMerged opendev/base-jobs master: Set container object expiry to 30 days  https://review.opendev.org/68048021:53
*** jamesmcarthur has quit IRC21:55
*** slaweq has quit IRC22:03
openstackgerritMerged opendev/base-jobs master: Remove ara from base  https://review.opendev.org/68050122:06
*** claudiub has joined #openstack-infra22:07
*** AJaeger_ has joined #openstack-infra22:10
*** kjackal has quit IRC22:10
*** slaweq has joined #openstack-infra22:11
*** AJaeger has quit IRC22:14
*** jamesmcarthur has joined #openstack-infra22:16
ianwclarkb: re zuul errors for testinfra @ https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5263-L526722:17
ianwthe project does have the zuul app on it -- but i don't think it did have the app when zuul actually started22:18
clarkbah so if we trigger a full reconfigure that may go away?22:18
ianwmaybe?22:18
clarkbcorvus: on 680178 zuul has decided it needs to run all the tests against that change now?22:19
clarkbodd22:19
pabelangerianw: clarkb: IIRC, you might need to stop / start zuul, in that case.  I've seen that with github, but cannot remember the fix ATM22:20
pabelangerfor now, I always try to make sure github app is installed before adding it to zuul tenant config22:20
*** jamesmcarthur has quit IRC22:21
*** slaweq has quit IRC22:21
corvuswe should try a full reconfiguration.  if that does not fix the problem it's a bug.22:21
corvusa restart should *never* be necessary.22:21
openstackgerritClark Boylan proposed zuul/zuul-website master: Add Zuul FAQ page  https://review.opendev.org/67967022:22
pabelangerianw: have you started working on adding fedora-30 to nodepool-builders? or does that need dib release22:22
ianwpabelanger: i just need to update for that gate missing you mentioned, then i think we can dib release22:22
openstackgerritClark Boylan proposed zuul/zuul-website master: CSS fix for ul/li in FAQ  https://review.opendev.org/68046522:23
pabelangerianw: great! will hold off on testing until then22:23
*** eernst has quit IRC22:23
*** jamesmcarthur has joined #openstack-infra22:25
*** prometheanfire has quit IRC22:25
*** prometheanfire has joined #openstack-infra22:26
corvusclarkb: yep, that's pretty weird, it shouldn't be doing that.22:26
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Add fedora-30 testing to gate  https://review.opendev.org/68053122:26
corvusmost of those jobs don't inherit from unittests22:26
ianwclarkb / corvus: there does however seem to be something wrong with running the system-config job against upstream testinfra ... https://zuul.opendev.org/t/openstack/build/a018575d4d4e4710b763978069c9cf12/log/job-output.txt#355922:28
ianw"Cloning into '/opt/system-config'...\nfatal: '/home/zuul/src/opendev.org/opendev/system-config' does not appear to be a git repository\22:28
ianwit doesn't run that often, but i think it's failed like that every time22:28
clarkbianw: you may need to add that repo as a required project on the job22:28
ianwhuh ... yeah, i didn't think of that ... different project22:29
*** jamesmcarthur has quit IRC22:30
corvusyeah, if you look here you can see the command: https://zuul.opendev.org/t/openstack/build/a018575d4d4e4710b763978069c9cf12/console#2/1/2/bridge.openstack.org22:30
corvusit's what clarkb said22:30
ianwhaha, sorry yeah seems obvious now.  here i am thinking that the base jobs have some missing matching or something crazy22:31
openstackgerritIan Wienand proposed openstack/project-config master: testinfra : add system-config as required project  https://review.opendev.org/68053422:36
*** kaisers has quit IRC22:37
openstackgerritMerged zuul/zuul-jobs master: Flip the order of the emit-job-header tests  https://review.opendev.org/68047722:37
*** kaisers has joined #openstack-infra22:38
*** eernst has joined #openstack-infra22:43
*** eernst has quit IRC22:44
openstackgerritIan Wienand proposed openstack/project-config master: testinfra : add system-config as required project  https://review.opendev.org/68053422:44
ianw... duur not openstack/system-config ... get with the times :)22:45
*** eernst has joined #openstack-infra22:46
*** igordc has quit IRC22:48
*** mriedem has quit IRC22:50
*** eernst has quit IRC22:51
ianwcorvus / clarkb : so was the conclusion i should SIGHUP zuul?22:52
clarkbianw: or run the zuul-scheduler command that reconfigures it ( Ithink it is a zuul-scheduler command)22:53
clarkbeither way should work22:53
corvuszuul-scheduler full-reconfigure22:53
corvusthat's the future, but sighup should still work for now i think22:53
*** exsdev has quit IRC22:54
ianwok, just ran full-reconfigure22:54
*** exsdev has joined #openstack-infra22:54
fungihttps://zuul-ci.org/docs/zuul/admin/components.html#operation22:55
fungifor reference22:55
corvusit'll take a while -- it's done when the timestamp on the bottom of the status page updates22:55
clarkbhttps://etherpad.openstack.org/p/ara-removed-from-jobs how does that llok for giving people notice of the ara change22:56
corvusclarkb: seems good22:57
fungilooks fine to me22:58
clarkbI'll send that to the zuul airship and starlingx ml too23:01
clarkb(but will send separate emails)23:01
ianwso last reconfigured -> Fri Sep 06 2019 08:59:51 GMT+1000 (Australian Eastern Standard Time)23:02
*** tkajinam has joined #openstack-infra23:02
ianw(that was 2 minutes ago)23:02
corvusianw: so still no joy there.  if a restart fixes it, we have a bug in zuul; if it doesn't, then something something github23:03
corvusit is, however, not a good time for a restart23:03
ianwyeah, just wait for something else to come up i guess.  but it does report (despite the job dependency issue) ... so it is talking to github clearly23:04
clarkbhttps://zuul.opendev.org/t/zuul/build/f83988533a4847b4ad6e7e1948755938/logs has no ara-report https://zuul.opendev.org/t/zuul/build/f83988533a4847b4ad6e7e1948755938/console is happy23:08
clarkbI'll send the email now23:08
*** slaweq has joined #openstack-infra23:11
clarkbmnaser: to catch up the major changes we've made are to: use a opendev specific container name prefix so that multiple zuul installs can run against a single ceph install (address global container namespace), suffixed container names with first three characters of build uuid sharing all builds into 4096 containers, and removed the top level zuul ara report23:15
clarkbsome jobs will still run ara internally and we havne't changed those, but we did stop creating a report for every job23:16
*** slaweq has quit IRC23:16
openstackgerritMerged openstack/project-config master: testinfra : add system-config as required project  https://review.opendev.org/68053423:21
*** rcernin has joined #openstack-infra23:23
*** threestrands has joined #openstack-infra23:30
ianwcan we trigger rechecks via github comments?23:33
clarkbyes "recheck" should work just lik ewith gerrit23:34
ianwhrm doesn't for testinfra, but perhaps that is related to the config error.  maybe new events work but not update checks?23:36
aspiersIs it possible to configure zuul to fail early if certain jobs fail? The docs suggest that pipeline.failure can do it23:37
aspiersbut as you can tell I haven't a clue about zuul config yet ...23:37
ianwaspiers: this comes up a bit, you can setup dependencies ...23:42
*** dchen has joined #openstack-infra23:42
aspiersI mean, e.g. if the pep8 job fails, then immediately cancel a 2 hour tempest job23:43
aspiersthat kind of thing23:43
aspiersso that CI resources aren't wasted on broken changes23:43
aspiersalthough it would have to be cleverer than that23:43
clarkbwe actually did try this a long time ago23:43
clarkbthe problme is then you end up pushing many patchstes as you work through multiple failures23:44
clarkbrather than getting as complete a picture as possible upfront23:44
aspiersyes, I think there's a trade-off to be had somewhere23:44
clarkbbut zuul does still support that fail fast iirc23:44
clarkb(we never removed the functionality just stopped using it)23:44
aspierse.g. in nova, always run all the shorter running unit/functional test suites, but if any of those fail, then cancel the really slow and expensive tempest / grenade jobs23:45
aspiersI get the impression that 90% of nova failures are caught by the unit/functional tests23:45
aspiersICBW23:45
clarkboh ya for that you have to set up the dependencies23:45
aspiersthe problem with not cancelling is that the queue gets really backlogged, like it is right now (hence my asking)23:46
aspiersFor example 680296,1 has already failed openstack-tox-pep8 but some other jobs have been running for ALMOST 19 HOURS(!), taking up valuable CI resources23:46
clarkbthey have been in the queue for that long not running for that long23:47
aspiersoh OK23:47
aspiersbut still23:47
aspiersanything waiting in the queue that long is not good23:47
aspiershttps://www.klipfolio.com/blog/cycle-time-software-development23:47
clarkbyes but the problem is due to external factors which have been corrected and we are now waiting to catch back up again23:48
aspiersactually https://codeclimate.com/blog/software-engineering-cycle-time/ is a better read23:48
aspiersohhhh OK23:48
aspiersnow you mention it, I did vaguely notice an IRC announcement about Gerrit earlier23:48
clarkbbut also gate failures have significantly larger impacts on queue times because of the cost of gate resets23:48
aspiersbrain too fried to make the connection ;-)23:48
clarkbthis is why EVERY time this topic comes up I point people to elastic-recheck23:49
clarkband ask people to focus on fixing the bugs in our software23:49
clarkbas step 023:49
aspiersyeah :)23:49
clarkbbecause then we win with better software and shorter queues23:49
clarkbinstead of just shorter queues with just as broken software23:49
aspiersI was assuming the backlog was due to nova being in the last week before feature freeze, hence a flurry of reviews23:49
aspiersThanks a lot for all the info. As usual this channel is like 19 steps ahead of my thoughts ;-)23:51
aspiersBack to writing my first ever tempest tests \o/23:51
*** rcernin is now known as rcernin|brb23:52
clarkbaspiers: the other way to make a large impact is to reduce job runtimes (which is the other thing I' vebrought up recently with devstack runtimes and OOMing jobs)23:53
clarkbthey get really slow due to lack of memory23:53
aspiersmakes sense23:53
*** jamesmcarthur has joined #openstack-infra23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!