Monday, 2019-09-09

*** markvoelker has quit IRC00:02
*** Goneri has quit IRC00:06
ianwfedora mirroring timed out.  i'm rerunning on afs01 under localauth ...00:07
ianwlock still held on mirror-update00:08
*** bobh has joined #openstack-infra00:11
*** bobh has quit IRC00:17
*** rcernin has quit IRC00:30
openstackgerritMerged openstack/diskimage-builder master: Add fedora-30 testing to gate  https://review.opendev.org/68053100:32
*** ccamacho has quit IRC00:51
ianwpabelanger: ^ yay, i think that's it for a dib release with f30 support.  will cycle back on it and get it going00:53
johnsomFYI, even a patch with no changes in it (extra carriage return) is failing and retrying. So it's definitely not that one patch.00:59
johnsomhttps://www.irccloud.com/pastebin/LBgx6Mhh/00:59
ianwso it's like 213.32.79.175 just disappeared on you?01:03
johnsomThat is the end of the console stream, then zuul retries it. Sometimes not even one tempest test reports before that happens.01:05
ianwjohnsom: i have debugged such things before, it is a pain :)  in one case we tickled a kernel oops on fedora01:06
johnsomThat ssh isn't part of our test to my knowledge. I think it comes from ansible01:06
ianwjohnsom: i don't think it has been ported to !devstack-gate, but https://opendev.org/openstack/devstack-gate/src/branch/master/devstack-vm-gate-wrap.sh#L446 might be a good thing to try01:06
ianwhttps://opendev.org/openstack/devstack-gate/src/branch/master/functions.sh#L972 is the actual setup01:07
ianwif you have something listening for that, it might let you catch why the host just dies01:07
johnsomHmm, that would be handy actually. I could just hijack the syslog from the nodepool level. I would have to come up with a target. I don't have any cloud instances anymore.01:09
johnsomianw Thanks for the idea!01:09
ianwi should just pull that into a quick role ...01:10
ianwjohnsom: you should be able to pull one with a public address via rdocloud, but i can create one for you if you like01:11
johnsomYeah, I haven't dipped my toe in those waters yet. I should learn the process.01:12
*** sgw has quit IRC01:22
*** exsdev has quit IRC01:26
*** exsdev0 has joined #openstack-infra01:26
*** exsdev0 is now known as exsdev01:27
*** exsdev has quit IRC01:39
*** exsdev has joined #openstack-infra01:40
*** sgw has joined #openstack-infra01:44
*** rcernin has joined #openstack-infra01:45
*** yamamoto has joined #openstack-infra01:46
*** gregoryo has joined #openstack-infra01:50
*** yamamoto has quit IRC02:26
*** yamamoto has joined #openstack-infra02:26
*** bobh has joined #openstack-infra02:37
openstackgerritjacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt  https://review.opendev.org/62846602:40
openstackgerritjacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt  https://review.opendev.org/62846602:42
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Add a netconsole role  https://review.opendev.org/68090102:46
*** yamamoto has quit IRC03:11
*** ramishra has joined #openstack-infra03:13
*** bobh has quit IRC03:14
*** rh-jelabarre has joined #openstack-infra03:27
*** dklyle has quit IRC03:29
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Add a netconsole role  https://review.opendev.org/68090103:40
*** yamamoto has joined #openstack-infra03:47
*** ykarel has joined #openstack-infra03:49
*** udesale has joined #openstack-infra03:53
*** yamamoto has quit IRC04:11
*** soniya29 has joined #openstack-infra04:12
*** ykarel has quit IRC04:17
*** ykarel has joined #openstack-infra04:36
*** yamamoto has joined #openstack-infra04:40
*** eernst has joined #openstack-infra04:54
*** markvoelker has joined #openstack-infra04:55
*** markvoelker has quit IRC05:00
*** jaosorior has joined #openstack-infra05:03
*** ricolin has joined #openstack-infra05:08
openstackgerritAndreas Jaeger proposed openstack/openstack-zuul-jobs master: Use promote job for releasenotes  https://review.opendev.org/67843005:08
AJaeger_config-core, please review https://review.opendev.org/677158 for cleanup and https://review.opendev.org/#/q/topic:promote-docs+is:open to use promote jobs for releasenotes, and extra tests for zuul-jobs: https://review.opendev.org/67857305:10
openstackgerritJan Kubovy proposed zuul/zuul master: Evaluate CODEOWNERS settings during canMerge check  https://review.opendev.org/64455705:15
*** eernst has quit IRC05:29
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Add a netconsole role  https://review.opendev.org/68090105:29
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Add a netconsole role  https://review.opendev.org/68090105:44
*** raukadah is now known as chandankumar05:46
*** kopecmartin|off is now known as kopecmartin05:56
*** jtomasek has joined #openstack-infra06:02
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Add a netconsole role  https://review.opendev.org/68090106:02
*** e0ne has joined #openstack-infra06:09
*** jbadiapa has joined #openstack-infra06:13
openstackgerritjacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt  https://review.opendev.org/62846606:13
*** e0ne has quit IRC06:18
*** jtomasek has quit IRC06:22
openstackgerritMerged openstack/project-config master: Add file matcher to releasenotes promote job  https://review.opendev.org/67985606:22
openstackgerritMerged zuul/zuul-jobs master: Switch releasenotes to fetch-sphinx-tarball  https://review.opendev.org/67842906:28
*** snapiri has quit IRC06:35
*** snapiri has joined #openstack-infra06:36
*** snapiri has quit IRC06:36
*** jtomasek has joined #openstack-infra06:40
*** ccamacho has joined #openstack-infra06:42
*** mattw4 has joined #openstack-infra06:44
*** slaweq_ has joined #openstack-infra06:49
*** snapiri has joined #openstack-infra06:49
Tengu~.06:54
*** pgaxatte has joined #openstack-infra06:55
openstackgerritIan Wienand proposed openstack/project-config master: Add Fedora 30 nodes  https://review.opendev.org/68091906:56
*** shachar has joined #openstack-infra06:58
*** mattw4 has quit IRC06:59
*** yamamoto has quit IRC07:00
*** snapiri has quit IRC07:00
ianwinfra-root: appreciate eyes on https://review.opendev.org/680895 to fix testinfra testing for good; after my pebkac issues with how to depends-on: for github it works07:01
*** rcernin has quit IRC07:02
*** slaweq_ has quit IRC07:06
*** slaweq has joined #openstack-infra07:08
*** yamamoto has joined #openstack-infra07:09
*** trident has quit IRC07:10
*** tesseract has joined #openstack-infra07:13
*** hamzy_ has quit IRC07:19
*** trident has joined #openstack-infra07:21
*** hamzy_ has joined #openstack-infra07:29
*** gfidente has joined #openstack-infra07:31
*** ykarel is now known as ykarel|lunch07:31
*** owalsh has quit IRC07:32
*** owalsh has joined #openstack-infra07:33
*** aluria has quit IRC07:33
*** jpena|off is now known as jpena07:35
*** aluria has joined #openstack-infra07:38
*** kjackal has joined #openstack-infra07:45
*** kaiokmo has joined #openstack-infra07:46
*** pcaruana has joined #openstack-infra07:49
*** markvoelker has joined #openstack-infra07:50
*** sshnaidm|afk is now known as sshnaidm07:51
*** sshnaidm is now known as sshnaidm|ruck07:51
*** markvoelker has quit IRC07:55
*** rpittau|afk is now known as rpittau07:55
*** xenos76 has joined #openstack-infra07:56
openstackgerritSimon Westphahl proposed zuul/zuul master: Fix timestamp race occuring on fast systems  https://review.opendev.org/68093708:04
*** apetrich has joined #openstack-infra08:05
openstackgerritSimon Westphahl proposed zuul/zuul master: Fix timestamp race occuring on fast systems  https://review.opendev.org/68093708:06
*** dchen has quit IRC08:07
*** panda has quit IRC08:09
*** panda has joined #openstack-infra08:11
*** rascasoft has quit IRC08:11
*** rascasoft has joined #openstack-infra08:12
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Add a netconsole role  https://review.opendev.org/68090108:14
ianwjohnsom: ^ working for me08:16
*** ralonsoh has joined #openstack-infra08:18
*** e0ne has joined #openstack-infra08:19
jrosserianw: doesnt ansible_default_ipv4 contain the fields you need there rather than a bunch of shell commands?08:19
*** ykarel|lunch is now known as ykarel08:23
*** tkajinam has quit IRC08:42
*** derekh has joined #openstack-infra08:43
*** gregoryo has quit IRC08:48
*** kaiokmo has quit IRC08:48
ianwjrosser: yes, quite likely!  it was just more or less a straight port from the old code that has been working for a long time08:56
openstackgerritSimon Westphahl proposed zuul/zuul master: Fix timestamp race occurring on fast systems  https://review.opendev.org/68093708:59
*** hamzy_ has quit IRC09:07
*** ociuhandu has joined #openstack-infra09:08
*** ociuhandu has quit IRC09:10
*** ociuhandu has joined #openstack-infra09:10
*** ociuhandu has quit IRC09:14
*** ociuhandu_ has joined #openstack-infra09:14
*** ociuhandu_ has quit IRC09:18
zbrianw: do you know who can help with a new bindep release? last one was more than an year ago.09:18
*** soniya29 has quit IRC09:23
*** ociuhandu has joined #openstack-infra09:27
*** arxcruz_pto is now known as arxcruz09:27
openstackgerritSorin Sbarnea proposed opendev/bindep master: Add OracleLinux support  https://review.opendev.org/53635509:35
ianwzbr: i would ask fungi first, i think he is back today his time.  just to see if there was anything in the queue.  but if he doesn't have time, etc. ping me back and i imagine i could tag one tomorrow too09:36
zbrianw: thanks. i think that there are few more patches I could fix, and a bug that prevents running the tests on macos (mocking fails to work), and after this we can make a release.09:37
zbrianw: thanks. i think that there are few more patches I could fix, and a bug that prevents running the tests on macos (mocking fails to work), and after this we can make a release.09:38
*** yamamoto has quit IRC09:40
*** kaisers has quit IRC09:48
*** kaisers has joined #openstack-infra09:50
*** markvoelker has joined #openstack-infra09:51
*** shachar has quit IRC09:53
*** snapiri has joined #openstack-infra09:53
*** ricolin_ has joined #openstack-infra09:55
*** lpetrut has joined #openstack-infra09:55
*** ricolin has quit IRC09:57
*** markvoelker has quit IRC10:00
*** rcernin has joined #openstack-infra10:08
openstackgerritSorin Sbarnea proposed opendev/bindep master: Fix test execution failure on Darwin  https://review.opendev.org/68096210:13
*** ricolin_ has quit IRC10:15
*** spsurya has joined #openstack-infra10:21
*** soniya29 has joined #openstack-infra10:26
*** gfidente has quit IRC10:26
*** panda is now known as panda|rover10:42
*** kaisers has quit IRC10:44
*** ociuhandu has quit IRC10:45
openstackgerritMerged openstack/openstack-zuul-jobs master: Remove openSUSE 42.3  https://review.opendev.org/67715810:47
*** dave-mccowan has joined #openstack-infra10:52
*** ricolin_ has joined #openstack-infra10:53
*** dciabrin_ has joined #openstack-infra10:53
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: add-build-sshkey: add centos/rhel-8 support  https://review.opendev.org/67409210:54
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: add-build-sshkey: add centos/rhel-8 support  https://review.opendev.org/67409210:54
*** dciabrin has quit IRC10:54
*** shachar has joined #openstack-infra11:02
*** nicolasbock has joined #openstack-infra11:02
*** snapiri has quit IRC11:03
*** pgaxatte has quit IRC11:05
*** dchen has joined #openstack-infra11:05
*** yamamoto has joined #openstack-infra11:06
*** rpittau is now known as rpittau|bbl11:12
openstackgerritAndreas Jaeger proposed openstack/project-config master: Fix promote-stx-api-ref  https://review.opendev.org/68097711:13
*** udesale has quit IRC11:15
*** sshnaidm|ruck is now known as sshnaidm|afk11:15
AJaeger_config-core, simple fix, please review ^11:15
*** ricolin_ is now known as ricolin11:15
*** ociuhandu has joined #openstack-infra11:17
*** gfidente has joined #openstack-infra11:18
*** ociuhandu has quit IRC11:22
*** kaisers has joined #openstack-infra11:28
*** pgaxatte has joined #openstack-infra11:34
*** jpena is now known as jpena|lunch11:35
openstackgerritMerged opendev/system-config master: Recognize DISK_FULL failure messages (review_dev)  https://review.opendev.org/67389311:37
*** ociuhandu has joined #openstack-infra11:41
*** dchen has quit IRC11:44
*** gfidente has quit IRC11:46
*** shachar has quit IRC11:48
*** gfidente has joined #openstack-infra11:53
*** goldyfruit has quit IRC11:56
*** markvoelker has joined #openstack-infra11:56
*** snapiri has joined #openstack-infra12:02
*** markvoelker has quit IRC12:02
*** rlandy has joined #openstack-infra12:03
*** yamamoto has quit IRC12:05
*** rfolco has joined #openstack-infra12:06
*** yamamoto has joined #openstack-infra12:09
*** rosmaita has joined #openstack-infra12:10
*** hamzy_ has joined #openstack-infra12:11
*** markvoelker has joined #openstack-infra12:12
*** ociuhandu has quit IRC12:21
*** iurygregory has joined #openstack-infra12:22
*** rcernin has quit IRC12:22
ralonsohhello folks, I have one question if you can help me12:27
ralonsohthis is about Neutron rally-tasks12:27
ralonsohfor example: https://d4b9765f6ab6e1413c28-81a8be848ef91b58aa974b4cb791a408.ssl.cf5.rackcdn.com/680427/2/check/neutron-rally-task/01b2c1c/12:30
ralonsohthere are no logs or results, and the task is failing12:30
*** jpena|lunch is now known as jpena12:31
*** sshnaidm|afk is now known as sshnaidm|ruck12:32
fricklerralonsoh: rally seems to be complicating access to logs by generating an index.html file. there are logs showing what I think is the error, see the end of https://d4b9765f6ab6e1413c28-81a8be848ef91b58aa974b4cb791a408.ssl.cf5.rackcdn.com/680427/2/check/neutron-rally-task/01b2c1c/controller/logs/devstacklog.txt.gz12:36
*** rosmaita has quit IRC12:38
*** happyhemant has joined #openstack-infra12:39
ralonsohfrickler, thanks! I'll take a look at this12:40
fricklerralonsoh: this is where the index file comes from, probably for our new logging setup it would work better to place this into a subdir and register the result as an artifact in zuul https://opendev.org/openstack/rally-openstack/src/branch/master/tests/ci/playbooks/roles/fetch-rally-task-results/tasks/main.yaml#L52-L6412:43
*** ociuhandu has joined #openstack-infra12:43
*** aaronsheffield has joined #openstack-infra12:44
ralonsohfrickler, or rename this index file in order to avoid the browser to read it by default12:45
*** Goneri has joined #openstack-infra12:47
*** ociuhandu has quit IRC12:48
*** rosmaita has joined #openstack-infra12:50
*** rpittau|bbl is now known as rpittau12:55
*** mriedem has joined #openstack-infra12:57
*** roman_g has joined #openstack-infra13:00
*** soniya29 has quit IRC13:06
AJaeger_clarkb: I think all Fedora 28 changes are merged with exception of devstack... Open changes that I'm aware of are https://review.opendev.org/#/q/topic:fedora-latest+is:open13:08
*** e0ne has quit IRC13:19
*** udesale has joined #openstack-infra13:20
*** KeithMnemonic has joined #openstack-infra13:24
*** gfidente has quit IRC13:30
*** gfidente has joined #openstack-infra13:32
*** ociuhandu has joined #openstack-infra13:34
*** Goneri has quit IRC13:34
*** ociuhandu has quit IRC13:39
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: Add prepare-workspace-openshift role  https://review.opendev.org/63140213:45
*** ykarel is now known as ykarel|afk13:45
*** gfidente has quit IRC13:46
*** gfidente has joined #openstack-infra13:48
*** prometheanfire has quit IRC13:49
*** e0ne has joined #openstack-infra13:50
*** prometheanfire has joined #openstack-infra13:50
*** beekneemech is now known as bnemec13:54
*** priteau has joined #openstack-infra13:54
*** AJaeger_ is now known as AJaeger13:54
AJaegerconfig-core, please review https://review.opendev.org/680977 to fix stx API promote job, https://review.opendev.org/#/q/topic:promote-docs+is:open to use promote jobs for releasenotes, and for adding extra tests for zuul-jobs: https://review.opendev.org/67857313:55
*** ykarel|afk is now known as ykarel13:56
*** zzehring has joined #openstack-infra13:57
*** yamamoto has quit IRC14:02
*** eharney has joined #openstack-infra14:04
*** bdodd has joined #openstack-infra14:07
*** ociuhandu has joined #openstack-infra14:13
openstackgerritMerged zuul/nodepool master: Fix Kubernetes driver documentation  https://review.opendev.org/68087914:13
openstackgerritMerged zuul/nodepool master: Add extra spacing to avoid monospace rendering  https://review.opendev.org/68088014:13
openstackgerritMerged zuul/nodepool master: Fix chroot type  https://review.opendev.org/68088114:13
*** Goneri has joined #openstack-infra14:15
*** jamesmcarthur has joined #openstack-infra14:15
*** ociuhandu has quit IRC14:17
*** yamamoto has joined #openstack-infra14:21
openstackgerritSorin Sbarnea proposed opendev/bindep master: Fix test execution failure on Darwin  https://review.opendev.org/68096214:25
dtantsurhi folks! the release notes job on instack-undercloud stable/queens started failing like this: https://e29864d8017a43b5dd67-658295aea286d472989a3acc71bbfe02.ssl.cf5.rackcdn.com/680698/1/check/build-openstack-releasenotes/fae0512/job-output.txt14:25
dtantsurdo you have any ideas?14:25
*** dklyle has joined #openstack-infra14:29
*** ykarel is now known as ykarel|afk14:30
*** lpetrut has quit IRC14:30
*** yamamoto has quit IRC14:30
*** yamamoto has joined #openstack-infra14:34
*** yamamoto has quit IRC14:34
*** yamamoto has joined #openstack-infra14:34
AJaegerdtantsur: let me check, we just switched the implementation...14:35
*** ociuhandu has joined #openstack-infra14:35
dtantsurAJaeger: it may be that you're trying to do something on master. the master branch is deprecated and purged of contents.14:35
AJaegerdtantsur: that explains it...14:36
AJaegerdtantsur: we check out master branch for releasenotes. So, if master is dead, kill the releasenotes job.14:36
dtantsurAJaeger: is there anything to be done except for removing the releasenotes job?14:36
dtantsurah14:36
dtantsurmwhahaha: ^^^14:36
AJaegerdtantsur: releasenotes does basically: git checkout master;tox -e releasenotes14:37
AJaegerSo, master needs to work for this...14:37
dtantsurthis does seem a bit inconvenient since the stable branches are supported and receive changes.14:37
mwhahahajust propose the job removal14:37
*** armax has joined #openstack-infra14:37
AJaegerdtantsur: I don't expect you make many changes to releasenotes on a released branch anymore...14:38
AJaegerdtantsur: to change this, reno would need to be redesigned completely ;(14:39
dtantsurfair enough14:39
mwhahahareleases notes includes a bug fix14:39
*** yamamoto has quit IRC14:39
mwhahahaso i would assume every change should have one14:39
dtantsur++14:39
*** ociuhandu has quit IRC14:39
AJaegernormally you do the bug fix on master and then backport ;) - and have master working...14:40
mwhahahaANYWAY let's just remove the job14:40
AJaegeragreed14:40
dtantsurmwhahaha: wanna me do it?14:40
mwhahahayes plz14:41
AJaegerdtantsur: best on all active branches14:41
dtantsurack14:41
dtantsurthanks AJaeger14:41
*** markvoelker has quit IRC14:41
*** mtreinish has quit IRC14:41
AJaegerglad that we could solve that so quickly - you had me shocked for a moment ;)14:41
*** mtreinish has joined #openstack-infra14:42
AJaegerThis is the change that merged earlier: https://review.opendev.org/#/c/678429/14:42
AJaegerconfig-core, we can switch releasenotes to promote jobs now: Please review https://review.opendev.org/#/c/678430/14:43
dtantsuraha, so rocky has already been fixed by another patch. but not queens.14:43
openstackgerritAndy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault  https://review.opendev.org/68104114:50
openstackgerritMatt McEuen proposed openstack/project-config master: Add per-project Airship groups  https://review.opendev.org/68071714:52
*** ociuhandu has joined #openstack-infra14:52
openstackgerritAndy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault  https://review.opendev.org/68104114:52
fricklerAJaeger: do you want to amend the comment on 678430? otherwise just +A14:53
openstackgerritAndy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault  https://review.opendev.org/68104114:53
AJaegerfrickler: what do you propose?14:53
AJaeger"Build and publish" -> "build and promote"?14:54
fricklerAJaeger: sounds more appropriate I'd think, yes14:54
AJaegerfrickler: let me do a followup - and update the other templates as well...14:55
* AJaeger will +A14:55
fricklerAJaeger: ok14:56
AJaegerthanks!14:56
AJaegerfrickler: could you review https://review.opendev.org/680977 as well, please?14:56
fricklerAJaeger: I did this morning14:58
fricklerwell, before noon anyway ;)14:59
*** jroll has joined #openstack-infra15:00
*** ykarel|afk is now known as ykarel15:00
openstackgerritAndreas Jaeger proposed openstack/openstack-zuul-jobs master: Mention promote in template description  https://review.opendev.org/68104415:01
AJaegerfrickler: ah, thanks. Then I need another core ;)15:01
AJaegerfrickler: here's the change - is that what you had in mind? ^15:01
*** markmcclain has joined #openstack-infra15:02
AJaegerthanks, clarkb !15:03
AJaegerclarkb: care to review https://review.opendev.org/678430 as well? That will switch releasenotes to promote jobs as well...15:04
clarkbAJaeger: ya I'll be working through your scrollback list between meeting stuff15:04
AJaegerclarkb: thanks15:04
*** diablo_rojo__ has joined #openstack-infra15:05
AJaegerclarkb: ignore 678430, just approved ;)15:05
zbrianw: clarkb : bindep: please have a look at https://review.opendev.org/#/c/680954/ and let me know if is ok. had to find a way to pass unittests on all platforms so I can test other incoming patches.15:05
openstackgerritMerged openstack/openstack-zuul-jobs master: Use promote job for releasenotes  https://review.opendev.org/67843015:06
AJaegerclarkb: https://review.opendev.org/676430 and https://review.opendev.org/681044 are the two remaining ones I care about, ignore the backscroll ;)15:06
AJaegerclarkb: these are ready as well https://review.opendev.org/680919 and https://review.opendev.org/68083015:06
AJaegerconfig-core as is https://review.opendev.org/#/c/680717/15:07
*** gyee has joined #openstack-infra15:07
AJaegerno, not yet ;(15:07
*** goldyfruit has joined #openstack-infra15:08
AJaegersorry, it IS ready15:08
openstackgerritMerged openstack/project-config master: Fix promote-stx-api-ref  https://review.opendev.org/68097715:10
*** markmcclain has quit IRC15:12
*** jaosorior has quit IRC15:12
*** ykarel is now known as ykarel|away15:13
*** ykarel|away has quit IRC15:22
donnydJust an update on FN. I am currently waiting on a part to come in the mail (should be here in the next few hours), and also I am trying to get someone out to do the propane installation today or tomorrow as well.15:30
*** ociuhandu has quit IRC15:31
donnydI don't really want to do the generator tests with FN at full tilt when the fuel is hooked up... so if it is going to be done in the next 48 hours I will leave it out of nodepool... if it's gonna take a week... then I will put it back in and pull it again when the time is right... That is *if* that makes sense to everyone here15:32
roman_gHello team. Could I ask to force-refresh OpenSUSE mirror synk, please? This one: [repo-update|http://mirror.dfw.rax.opendev.org/opensuse/update/leap/15.0/oss/] Valid metadata not found at specified URL15:33
roman_gOpenSUSE Leap 15.0, Updates, OSS15:33
roman_gsynk -> sync15:33
clarkbdonnyd: wfm15:34
fungiroman_g: any idea if the official obs mirrors ever got fixed after the reorg? they're what's been blocking our automatic opensuse mirroring15:34
openstackgerritFabien Boucher proposed zuul/zuul master: Pagure - handle Pull Request tags (labels) metadata  https://review.opendev.org/68105015:34
AJaegerricolin: could you help merge https://review.opendev.org/678353 , please?15:34
fungidonnyd: are you planning to load-test the generator with a resistor bank or something? otherwise you probably do want to test it out at full tilt15:34
clarkbfungi: maybe we should just remove the obs mirroring for now15:35
roman_gfungi: yes, zypper works perfectly with original repo at http://download.opensuse.org/update/leap/15.0/oss/15:35
fungiroman_g: i said obs15:35
clarkbroman_g: the problem is with obs not the main distro mirrors15:35
clarkbroman_g: we have a list of obs repos to mirror and they are not present in the upstream we were pulling from any longer15:35
AJaegerdirk, can you help, please? ^15:35
clarkbunfortunately we do all the suse mirroring together so if obs fails we don't publish main distro updates15:35
donnydfungi well first I have to function check the system. Make sure the auto transfer switch doesn15:35
roman_gwhat is obs? fungi clarkb15:35
fungiwe have a script which mirrors opensuse distribution files and some select obs package trees. the latter have been blocking the script succeeding15:35
clarkbroman_g: packaging that isn't part of the main distro release. In this case its probably similar to epel15:36
donnyddoesn't blow a hole in my wall or anything.. so it will be initially tested unloaded15:36
fungior similar to uca15:36
roman_gopen build service?15:36
clarkbroman_g: yes15:36
fungiyes15:36
donnydAnd that is like a 1 hour cycle15:36
donnydthen I have to test it loaded (at least what I can do)15:37
roman_gso we mirror not from download.opensuse.org, but from some other (obs) location?15:37
fungidonnyd: sounds like a fun day!15:37
roman_gwhat is that location? I could check if repo is in good condition there15:37
donnydthe UPS place a 17amp load on the system idle... with everything in full tilt its a 21 amp load15:37
donnydplus the rest of my house circuits15:37
donnydSo you are correct I want to test loaded... but usually you start small and make sure it works.. and the only way that is done is to keep FN out of the pool for now15:38
donnydI should know more later today on propane install15:38
clarkbroman_g: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/opensuse-mirror-update#L82-L9615:39
clarkbroman_g: they all originate from suse15:39
clarkbbut they are not all equally mirrored15:39
clarkbthe upstreams we use for OBS no longer have all of the obs repos we want. And after some looking I was unable to find one that did15:40
zbris the os-loganalyze project orphan or in need more cores? it seems to have reviews lingering for years, like https://review.opendev.org/#/c/468762/ which was clearly not controversial in any way :p15:40
slittle1RE: starlingx/utilities ... we had one major package that failed to migrate to the new repo.  We would like an admin to help us import the missing content.15:41
clarkbzbr: the move to swift likely did orphan it.15:41
clarkbzbr: in general though for changes like that we all have much higher priority items than removing pypy from tox.ini where it harms nothing15:42
zbrclarkb: maybe we should find a better way to address these. I am asking because once I review a change it stays on my queue until it merges (or is abandoned) as I feel the need not to waste the time already invested.15:43
zbrwhat can I do in such cases? sometimes I remove my vote. still the change will still be lingering there.15:44
AJaegerslittle1: what is the problem? https://opendev.org/starlingx/utilities was imported, wasn't it?15:44
clarkbzbr: probably the best way to address things like that case specifically is to coordinate repo updates when we remove pypy15:44
zbri would find better to even abandon all changes, reducing the noise.15:44
clarkbzbr: but I'm guessing the global pypy removal wasn't that thorough15:44
clarkbzbr: if you click the 'x' next to your name you should be removed from it15:45
slittle1yes, but one of the packages that I was supposed to relocate into starlingx/utilities failed to move15:45
zbrclarkb: that was just one example, as you can imagine I do not care much about pypy. in fact not all all.15:45
zbri was more curious about which approach should I take15:45
slittle1That package has a very long commit history.  I stopped counting at 10015:46
*** ykarel|away has joined #openstack-infra15:46
AJaegerslittle1: I don't understand "failed to move" - let's ask differently: Did the repo you prepared was not what you expected and therefore you want to set it up again from scratch?15:46
slittle1Yes, that's what I'm proposing15:46
AJaegerslittle1: so, do you have a new git repo ready? Then an admin - if he has time - can force update it for you...15:47
AJaeger(note I'm not an admin, just trying to figure out next steps)15:48
slittle1I'm preparing the new git repo now15:48
slittle1Might take a couple hours to test15:48
clarkbzbr: personally I track my priorities external to Gerrit, when those intersect with Gerrit I do reviews15:48
AJaegerslittle1: then please come back once you're ready ;)15:49
AJaegerslittle1: so, prepare the repo, and then ask for somebody here to force push this to starlingx/utilities.15:50
AJaegerslittle1: and better tell starlingx team to keep the repo frozen and not merge anything - and everything proposed to it might need a rebase...15:50
AJaeger(and devs might need to check from scratch) - all depending on the changes you do (if they can be fast forwarded, it should be fine)15:51
roman_gclarkb: thank you for your help. I'm not sure I fully understand the code you are referring to, as it mentions repositories I don't use. A few lines above https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/opensuse-mirror-update#L26 there is a MIRROR="rsync://mirror.us.leaseweb.net/opensuse", and this one has broken repository metadata -> seems that we15:51
roman_gmirror broken leaseweb repository.15:51
clarkbroman_g: our mirrors are only as good as what we pull from15:51
clarkbroman_g: I think what we need is for someone to find a reliable upstream and we'll switch to it. If that means we can also continue to mirror OBS then we'll do that otherwise maybe we just turn off OBS mirroring15:52
fungione possibility would be to split opensuse and obs to separate afs volumes and update them with separate scripts, but that would need someone to do the work15:52
*** rlandy is now known as rlandy|brb15:52
*** chandankumar is now known as raukadah15:53
clarkbfungi: and still requires finding a working obs upstream15:53
paladoxcorvus also you'll want to add redirecting /p/ to the upgrade list (only needed for 2.16+)15:53
roman_gclarkb: could leaseweb be changed to something else (original opensuse mirror)? http://provo-mirror.opensuse.org/ this one, for example. or http://download.opensuse.org/ this one.15:53
paladoxi only just remembered15:53
clarkbroman_g: we don't use the main one because they limit rsync connections aggressively and we never update15:53
clarkbroman_g: but we can point to any other second level mirror15:53
*** rpittau is now known as rpittau|afk15:54
*** igordc has joined #openstack-infra15:54
zbrclarkb: ok, i will start removing myself from such reviews, but I will also post them here with recomandation to abandon, like: https://review.opendev.org/#/c/385217/15:55
*** mattw4 has joined #openstack-infra15:58
*** sshnaidm|ruck is now known as sshnaidm|afk16:00
*** diablo_rojo__ is now known as diablo_rojo16:00
*** mattw4 has quit IRC16:01
*** mattw4 has joined #openstack-infra16:01
*** mattw4 has quit IRC16:05
*** mattw4 has joined #openstack-infra16:06
*** tesseract has quit IRC16:07
*** igordc has quit IRC16:09
clarkbzbr: another strategy we could probably get better at is trusting cores to single approve simple changes like that (but we've had a long history of two core reviewers so that probably merits more communication to tell people its ok in the simple case)16:09
*** e0ne has quit IRC16:10
*** mattw4 has quit IRC16:10
clarkbinfra-root I'd like to restart the nodepool launcher on nl02 to rule out bad confg loads with the numa flavor problem that sean-k-mooney has had. FN is currenty set to max-servers of zero so we won't be able to check it until back in the rotation but this way that is done and out of the way16:11
clarkbthat said the problems nodepool has are lack of ssh access16:11
clarkbaccording to the logs anyway16:11
clarkband if we got that far then the config should be working16:11
zbrnot a bad idea, especially for "downsized" projects where maybe there are no linger single cores active, same applies to abandons, cores should not be worried to press that button, is not like "delete", anyone can still restore them if needed.16:11
clarkbbut testing a boot with that flavor and our images myself by hand I had no issues getting onto the host via ssh in a relatively quick amount of time16:12
Shrewsclarkb: could you go ahead and make sure that openstacksdk is at the latest (i think it was last i checked) and restart nl01 as well? it contains a swift feature that auto-deletes the image objects after a day16:12
clarkbShrews: I can16:12
Shrewsossum16:13
clarkbI may as well rotate through all 4 so they are running the same code. I'll do that if 02 looks happy16:13
clarkbShrews: openstacksdk==0.35.0 is what is on 02 and that appears to be current16:14
*** virendra-sharma has joined #openstack-infra16:14
sean-k-mooneyclarkb: the ci runs with the old lable passed on saturday before FN was scalled down by the way https://zuul.opendev.org/t/openstack/build/dd0f5dad770d40a2afb3c506327d1b3e so it is just the new lable that is broken16:14
clarkbShrews: nodepool==3.8.1.dev10  # git sha f2a80ef is the nodepool version there16:14
Shrewsclarkb: yes, that's the version. it just sets an auto-expire header on the objects (so not really a "feature" but a good thing)16:14
clarkbsean-k-mooney: ya I saw that. That is what made me suspect maybe config loading16:15
clarkbsean-k-mooney: once FN is back in the ortation we should be able to confirm or deny16:15
sean-k-mooneyack16:15
sean-k-mooneyi proably should tell people FN is offline/disable currenlty too16:15
Shrewsclarkb: nodepool version is good. i don't see anything of significance that might cause issues16:15
Shrewssince 3.8.0 release, that is16:16
openstackgerritMohammed Naser proposed zuul/zuul master: spec: use operator-sdk for kubernetes operator  https://review.opendev.org/68105816:16
zbrfungi: there are only 4 more pending reviews on bindep, after we address them, can we make an aniversary release?16:17
clarkbShrews: nl02 has been restarted. Limestone is having trouble (but appeared to have trouble prior to the restart too) going to understand that before doing 01 03 and 0416:17
clarkblogan-: fyi ^ "openstack.exceptions.SDKException: Error in creating the server (no further information available)" is what we get from the sdk16:17
clarkblogan-: I'm going to try a manual boot now16:17
*** virendra-sharma has quit IRC16:20
clarkblogan-: {'message': 'No valid host was found. There are not enough hosts available.', 'code': 500, 'created': '2019-09-09T16:19:42Z'}16:20
clarkbShrews: ^ seems like sdk should be able to bubble that message up? maybe that error isn't available initially?16:21
Shrewsclarkb: did that come from the node.fault.message?16:21
clarkbShrews: yes16:21
*** markvoelker has joined #openstack-infra16:21
Shrewsclarkb: https://review.opendev.org/671704 should begin logging that16:22
Shrewsclarkb: the issue is you need to either list servers with detailed info (which nodepool does not), or refetch the server after failure16:22
clarkboh right cool16:22
clarkbI shall rereview it when this restart expedition is complete16:22
*** dtantsur is now known as dtantsur|afk16:23
*** virendra-sharma has joined #openstack-infra16:23
Shrewsclarkb: yeah, i think it's good to go now, but i can't +2 it since I've added to the review16:24
AJaegerpromotion of releasenotes is working fine so far - note that we now only update releasenotes if those are changed and not after each merge.16:25
logan-clarkb: yep, thats expected. i disabled the host aggregate on our end out of caution since we had a transit carrier acting up last week. i am planning to re-enable soon, later today or tomorrow, when i am around for a bit to monitor job results16:25
openstackgerritSorin Sbarnea proposed opendev/bindep master: Add OracleLinux support  https://review.opendev.org/53635516:26
logan-thats how i typically stop nodepool from launching nodes while allowing it to finish up and delete already running jobs. does it mess anything up when I do it that way?16:26
clarkblogan-: thanks for confirming. As a side note you can update the instance quota to zero and nodepool will stop launching against that cloud16:26
clarkbit doesn't mess anything up, just noisy logs16:26
*** spsurya has quit IRC16:27
logan-gotcha16:27
openstackgerritSorin Sbarnea proposed opendev/bindep master: Fix tox python3 overrides  https://review.opendev.org/60561316:32
openstackgerritMohammed Naser proposed zuul/zuul-operator master: Create zookeeper operator  https://review.opendev.org/67645816:32
openstackgerritSorin Sbarnea proposed opendev/bindep master: Add some more default help  https://review.opendev.org/57072116:33
openstackgerritMohammed Naser proposed openstack/project-config master: Add pravega/zookeeper-operator to zuul tenant  https://review.opendev.org/68106316:36
mnaser^ if anyone is around for a trivial patch..16:37
openstackgerritMohammed Naser proposed zuul/zuul-operator master: Create zookeeper operator  https://review.opendev.org/67645816:40
openstackgerritMohammed Naser proposed zuul/zuul-operator master: Deploy Zuul cluster using operator  https://review.opendev.org/68106516:40
*** rlandy|brb is now known as rlandy16:41
*** pgaxatte has quit IRC16:42
openstackgerritMerged openstack/openstack-zuul-jobs master: Mention promote in template description  https://review.opendev.org/68104416:43
openstackgerritMerged zuul/zuul-jobs master: Switch to fetch-sphinx-tarball for tox-docs  https://review.opendev.org/67643016:46
*** ykarel|away has quit IRC16:46
*** jpena is now known as jpena|off16:47
AJaegerinfra-root, we still have 6 hold nodes - are they all needed?16:47
mnaserAJaeger: if any are mine, i dont need them16:48
*** ociuhandu has joined #openstack-infra16:48
clarkbI can take a look as part of my nodepool restarts. I've just done 03 and am confirming things work there since 02 only has offline clouds currently16:48
*** diablo_rojo has quit IRC16:49
Shrewswow, one has been in hold for 49 days16:49
ShrewsAJaeger: i suspect mordred has simply forgetten about that one  :)16:50
AJaeger;)16:50
Shrewsdidn't pabelanger say we could delete his held node the other day?16:51
mnaserplease dont interrupt that one, that's their bitcoin miner :)16:51
Shrewsmnaser: must be a deep mine16:52
AJaegermwhahaha, EmilienM, I see at http://zuul.opendev.org/t/openstack/status that https://review.opendev.org/680780 runs a non-voting job in gate - could you remove that one, please?16:52
AJaegerShrews: he did, so go ahead and delete it...16:52
mwhahahaweshay: -^16:52
AJaegermwhahaha, EmilienM, weshay, job is "tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates16:52
clarkbok there are nodes that went from building to in-use on nl03 now. Going to do 0416:53
AJaegerweshay, mwhahaha, and https://review.opendev.org/674065 has tripleo-build-containers-centos-7-buildah as non-voting in gate16:53
weshayhrm.. k.. thanks for letting me know16:53
*** ykarel|away has joined #openstack-infra16:53
Shrewsmnaser: you have a held node that i'll go ahead and delete now16:53
pabelangerShrews: yes, it can be deleted. Thanks16:54
*** mattw4 has joined #openstack-infra16:56
*** bdodd has quit IRC16:56
ShrewsAJaeger: since that change that mordred's held node is based on has merged, i'm going to assume that is not needed anymore either16:56
Shrewsianw: you now have the only 2 held nodes. will let you decide if they are needed or not16:57
clarkbShrews: for the case where there are ready nodes locked for many many days is there a way to break the lock without restarting the zuul scheduler?16:58
clarkbShrews: 0010553302 for example16:58
*** trident has quit IRC16:58
Shrewsclarkb: we should first verify who owns the lock. let me look16:58
Shrewsthere's been a long standing zuul bug around that16:59
*** derekh has quit IRC17:00
clarkb#status log Restarted nodepool launchers with OpenstackSDK 0.35.0 and nodepool==3.8.1.dev10  # git sha f2a80ef17:00
openstackstatusclarkb: finished logging17:00
roman_gclarkb: I've asked SUSE guys if they could reach out to their OpenSUSE community and find a good CDN to rsync OpenSUSE from.17:01
clarkbroman_g: fwiw AJaeger asked dirk here to do that too in scrollback (though he is CEST so may not happen today)17:01
clarkbroman_g: let us know what you find out17:01
*** diablo_rojo has joined #openstack-infra17:02
ricolinAJaeger, done17:03
roman_gclarkb, AJaeger, just for reference, I was replying to Jean-Philippe Evrard (SUSE) here https://review.opendev.org/#/c/672678/ in comment to patch set 4.17:03
Shrewsclarkb: yeah, zuul owns that lock. there is an out-of-band way to do it (just delete the znode) but that's not recommended17:03
clarkbShrews: ya it should go away if we restart the scheduler but now is a bad time for that due to job backlogs17:04
clarkb(and I was thinking freeing up ~4 more nodes will help a bit with the backlog)17:04
Shrewsi just freed up 4 held nodes, so that might help a small bit17:05
clarkbjohnsom: fungi ianw re job network problems if it is rax specific it could be that we are seeing duplicate IPs there again?17:05
clarkbin the past we would get job failures for those errors because ansible didn't return the correct rc code for network trouble but that hasbeen fixed and so now we retry17:05
clarkbhowever if it happens on more than just rax and it is always triggered during the start of tempest tests I would suspect something in the tempest tests17:06
fungiclarkb: if we can manage to correlate the failures to a specific subset of recurring ip addresses, that can explain it17:06
clarkbfrickler: figured out worlddumping which will help with the dns debugging too I think17:06
johnsomI have seen it fail after 4-5 tests have passed, it's not at the start17:06
*** trident has joined #openstack-infra17:06
clarkbjohnsom: ok then during tempest17:06
clarkbboth ipv6 only clouds are offlien currently so the dns issues shouldn't be happening for a bit17:07
clarkbhopefully long enough to get worlddump fix merged in devstack17:07
clarkbjohnsom: the point was more that if it is duplicate IP problem we should see that during any point of time in a job17:08
clarkband it would be cloud/region specific17:08
clarkbif however it happens during a common job location and on variety of clouds we probably don't have that problem17:08
johnsomYes, it is a definite possibility17:08
donnydok electrical is done for FN and it looks like nova may need to use it.. So how about we put it back in service and can work around the gas install17:08
clarkbdonnyd: I will defer to you on that17:09
donnydI am thinking only compute jobs, maybe hold off on swift logs17:09
clarkbwfm17:09
*** ramishra has quit IRC17:09
clarkbhttps://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e47/680797/1/check/openstack-tox-py36/e477716/ is still working17:09
clarkbAJaeger: ^ should we turn ovh back on for swift logs?17:09
fungiclarkb: last word from amorin was that the suspendbot was done for yesterday, but no idea when it would come back today and whether it would be fixed17:10
clarkbfungi: ya its been just over 24 hours17:12
AJaegerit's more than 24 hours since we talked -  i would expect it run again in that time.17:12
AJaegerclarkb: let me remove my WIP from https://review.opendev.org/#/c/680855/17:12
AJaegerwe might want to wait another 24 hours - or just go for it...17:12
clarkbjohnsom: fungi http://zuul.openstack.org/build/74d1733dd44f4233a37ef43039e338e8/log/job-output.txt#36386 that may be a clue (that job was a retry_limit)17:13
AJaegerthanks, ricolin17:14
clarkbjohnsom: fungi ze04 (where that build ran has disk now and doesn't have leaked builds)17:14
openstackgerritDonny Davis proposed openstack/project-config master: Bringing FN back online for CI Jobs  https://review.opendev.org/68107517:15
fungiclarkb: johnsom: oh! so another thing worth noting is that nodes in rackspace have a smallish rootfs with more space mounted at /opt17:15
AJaegerroman_g: yeah, evrardjp is also a good contact - thanks17:15
clarkbfungi: ya though in this case the thing that ran out of disk was the executor17:15
fungiahh, okay, there goes that idea17:15
AJaegerconfig-core, https://review.opendev.org/678356 and https://review.opendev.org/678357 are ready for review, dependencies merged - those remove now unused publish jobs and update promote jobs. Please review17:18
clarkbAJaeger: fwiw I asked airship if they are ready for their global acl updates to go in and haven't gotten an answer yet17:19
clarkbit will prevent them from approving changes while new groups are configured so I figured I would ask17:20
AJaegerclarkb: looking at the commit message: If we add the airship-core group everywhere, it should work just fine, wouldn't it?17:20
AJaegerclarkb: but double checking is better ;) Thanks17:20
donnydclarkb: fungi so even though FN doesn't have any jobs running, the images should have still be loaded from nodepool.. Correct?17:21
fungidonnyd: correct17:21
donnydand the mirror never actually went down17:21
funginodepool continues performing uploads even if max-servers is 017:21
donnydi did have enough juice to keep it up this whole time17:22
fungiawesome. thanks!17:22
clarkbAJaeger: ya though it won't have the effect tehy want (they will still need to make changes)17:22
donnydso it should also be up to date with whatever has been done...17:22
clarkb23GB is executor-git and 7.9GB is builds in /var/lib/zuul17:23
clarkbgit will hardlink so that 23GB seems large but should be reasonable as its shared across builds17:23
clarkbthere are 36GB available on ze04 represents 53% free on that fs17:24
*** gfidente has quit IRC17:25
clarkbAJaeger: fungi maybe you can reach out to amorin tomorrow CEST time and see if amorin has any more input? then we can enable early tomorrow if amorin is happy with it at that point?17:26
corvuspaladox: can you elaborate on redirecting /p/ ?  do we need to do something like /p/(.*) -> /$1 ?17:27
paladoxyup17:27
paladoxpolygerrit takes over /p/17:27
paladoxso cloning over /p/ breaks17:27
fungiclarkb: hopefully my availability tomorrow morning is better than today was. i've added a reminder for it and will see what i can do17:27
paladoxcorvus https://www.gerritcodereview.com/2.16.html#legacy-p-prefix-for-githttp-projects-is-removed17:28
*** jamesmcarthur has quit IRC17:28
paladoxcorvus https://github.com/wikimedia/puppet/commit/4a2a25f3cbcbabd03b6291459941304e67bbd1c5#diff-4d7f1c048cc827721ef9298a98d1f5d9 is what i did17:28
paladoxto prevent breaking clones && to prevent breaking PolyGerrit project dashboard support.17:29
corvuspaladox: gotcha, so that's targeting just cloning then17:29
corvusthat regex in your change17:29
paladoxyup17:29
clarkbhttp://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1 vexxhost has some weird looking node status graphs17:30
paladoxwe will be upgrading soon :)17:30
paladoxspoke with one of the releng members and with 2.15 going EOL, we will want to be going to 2.16 very soon.17:30
corvuspaladox: cool thanks!  i added it to our list :)17:30
paladox:)17:30
* clarkb figures out boot from volume to test vexxhost17:31
paladoxcorvus you'll also want to convert any of your its-* rules to soy too!17:32
clarkbI think we may have leaked volumes17:33
clarkbI'm digging in17:33
fungipaladox: oh, fun... what's soy?17:34
*** ociuhandu has quit IRC17:34
paladoxSoy is https://github.com/google/closure-templates17:34
Shrewsclarkb: i would not doubt leaked volumes. there is something weird there that i couldn't quite figure out (i thought i had the bug down at one time, but alas, not so much)17:34
paladoxhttps://github.com/wikimedia/puppet/tree/production/modules/gerrit/files/homedir/review_site/etc/its is how it looks under soy :)17:35
*** ociuhandu has joined #openstack-infra17:36
clarkbShrews: well volume list shows we have ~37 volumes at 80GB each. The quota error we got says we are using 6080GB which is enough for 76 instances17:36
Shrewsyowza17:36
clarkbShrews: so the volume leak isn't the only thing going on here from what I can tell (we do apepar to have leaked a small number of volumes which I'm trying to trim now since that is actionable on our end)17:36
paladoxfungi it's also what is used for PolyGerrit's index.html :)17:37
paladoxit helped me to introduce support for base url's17:37
clarkbShrews: its basically 2 times our consumption17:37
fungithanks paladox. luckily we don't have a bunch of its rules: https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/manifests/review.pp#L213-L23417:37
Shrewsclarkb: can you capture that info for the one you're deleting for me? at least the a few of the most recent ones17:37
paladoxah!17:37
paladoxok17:37
paladoxi had to do the support for soy in its-base for security reasons :(17:38
clarkbShrews: ya I'll get a list and then get full volume shows for each of them (thats basically how I'm combing through to decide they are really leaked)17:38
Shrewscool17:38
openstackgerritMerged openstack/project-config master: Bringing FN back online for CI Jobs  https://review.opendev.org/68107517:38
corvuspaladox, fungi: if we're using its-storyboard, do we still need to use soy?17:39
paladoxnope17:39
paladoxyou doin't use velocity it seems per the link fungi gave17:39
fungicorvus: i thought its rules were how we configured its-storyboard (well, that and commentlinks, which seem to be integrated with how its does rule lookups)17:39
*** ricolin has quit IRC17:40
fungiis soy replacing the commentlinks feature too?17:40
paladoxthough, it's rules can now be configured from the UI :)17:40
paladoxfungi nope17:40
paladoxit's replacing the velocity feature17:40
*** ociuhandu has quit IRC17:40
fungiahh, i have no idea what velocity is anyway17:42
fungijust as well, i suppose17:42
paladoxhttps://velocity.apache.org17:42
clarkbShrews: bridge.o.o:/home/clarkb/vexxhost-leaked-volumes.txt17:42
clarkbShrews: I'm going to delete them now17:43
fungipaladox: okay, so that was providing some sort of macro/templating capability i guess17:43
paladoxyup17:43
clarkbShrews: they are all from august 7th17:43
Shrewsclarkb: thx17:43
fungiclarkb: last time i embarked on a vexxhost leaked image cleanup adventure, i similarly discovered a bunch of images hanging around that all leaked within a few minutes of each other on the same day, weeks prior17:44
fungisomething happens briefly there to cause this, i guess17:44
paladoxdo you set javaOpts through puppet (for gerrit.config)17:44
Shrewsclarkb: fungi: my last exploration of this found they seemed to be associated with the inability to delete a node because of "volume in use" errors, iirc17:45
fungiShrews: correct, same for me. something caused server instances to go away ungracefully and left cinder thinking the volumes for them were still in use17:46
Shrewsmaybe this new batch will have some key that i missed before17:46
fungipaladox: not to my knowledge. i17:46
fungier, i'm not even sure what javaopts are17:46
paladoxok, i think you'll want to set https://github.com/wikimedia/puppet/commit/01bf99d8c72886e876878ade7e99f9081dc313d5#diff-1145a1f82a8b6b5ee2c3238b41d3960117:46
clarkbpaladox: fungi we set heap memory size17:46
clarkbbut that gets loaded via the defaults file in the init script17:47
clarkbso ya we set things but puppet is only partially related there17:47
fungiahh, here: https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/manifests/review.pp#L11917:47
paladoxahh, ok.17:47
fungihttps://opendev.org/opendev/puppet-gerrit/src/branch/master/templates/gerrit.config.erb#L62-L6417:48
fungiis where it gets plumbed through to gerrit.config17:49
*** udesale has quit IRC17:49
*** ralonsoh has quit IRC17:49
clarkbmnaser: if you have a moment, vexxhost sjc1 claims we are using ~6080GB of volume quota but volume list shows ~3040GB (curious that it is 2x). Not sure if that is a known thing or somethign we can help debug further but thought I would point it out17:51
*** goldyfruit_ has joined #openstack-infra17:51
*** virendra-sharma has quit IRC17:53
*** goldyfruit has quit IRC17:53
Shrewsclarkb: hrm, aug 7th logs for nodepool no longer exist. that makes investigating near impossible  :(17:53
*** ociuhandu has joined #openstack-infra17:53
clarkbya we keep a month of logs iirc17:54
clarkband that is ~2 days past17:54
Shrewswe only go back to aug 3017:54
clarkboh just 10 days then17:54
clarkbre the retrying jobs the top of the check queue is a ironic job that disappeared in tempest ~2 hours ago on rax-ord and zuul/ansible haven't caught up on that yet18:00
clarkbI am able to hit the ssh port via home and the executor that was running the job there18:01
clarkbso if it is the duplicate ip problem we've got the IP currently18:01
fungiwell, it's rarely just *one* duplicate ip address18:02
fungibut it's usually an identifiably small percent of the addresses used there overall18:03
openstackgerritSorin Sbarnea proposed opendev/bindep master: Add OracleLinux support  https://review.opendev.org/53635518:03
clarkbya its just hard to identfy because we don't get logs18:03
openstackgerritMerged openstack/project-config master: Add pravega/zookeeper-operator to zuul tenant  https://review.opendev.org/68106318:03
clarkbI've put an autohold on that job18:03
clarkbtox/tempset is still running there18:08
fungipreviously when i've had to hunt these down i've built up lists of build ids which hit the retry_limit or whatever, and then programmatically tracked those back through zuul executor logs to corresponding node requests and then into nodepool launcher logs18:09
clarkb23.253.173.73 is the host if anyone else wants to take a look18:09
fungimy access is still a bit crippled while my broadband is out18:10
clarkbhttp://paste.openstack.org/show/774344/ is the last thing stestr logged18:11
clarkbabout 37 minutes after ansible stopped getting info18:11
*** e0ne has joined #openstack-infra18:12
johnsomI am going to try Ian's netconsole role plus redirect the rsyslog. See if that catches anything.18:14
clarkbright around the time ansible loses connectivity dmesg says there was an sda gpt block device. This device has GPT errors and does not exist currently18:14
clarkbdtantsur|afk: TheJulia ^ do you know if ironic-standalone is creating that sda device and giving it a GPT table? if so it seems to be buggy18:15
*** jtomasek_ has joined #openstack-infra18:17
clarkbjournalctl is empty for that block of time too18:18
fungioh, i wonder if something to do with block device management is temporarily hanging the kernel long enough to cause ansible to give up18:19
*** jtomasek has quit IRC18:19
fungithat behavior could certainly be provider/hypervisor dependent18:19
clarkbfungi: ya journalctl having missing data around then seems to point at sadness18:19
clarkbsyslog has the data though so not universally a problem18:19
clarkbbut could be just bad enough potentially18:19
*** priteau has quit IRC18:23
*** kjackal has quit IRC18:23
clarkbfungi: wins /dev/xvda1       38G   36G     0 100% /18:25
clarkbrunning du to track down what filled it up18:25
fungiprobably the job needs adjusting to write stuff in /opt18:27
fungihow full is /opt on that node?18:27
*** yamamoto has joined #openstack-infra18:27
clarkbhttp://paste.openstack.org/show/774346/18:28
clarkbopt has 57GB18:28
clarkbjohnsom: ^ is it possible that your jobs are using fat images too and running into the same problem?18:28
clarkbthat could explain why it is rax and ironic + octavia hitting this a lot18:28
clarkbbecause you both run workload in your VMs18:28
clarkband don't cirros18:28
johnsomNot likely, we just shaved another 250MB off them a month or two ago. It should be only one 350MB image18:29
johnsomWe are logging more than we were however, as we added log offloading a few months back. Seems unlikely that it would be failing at the start of the tempest job though.18:30
johnsomIs the DF log in the output at the start or the end of the job?18:31
*** yamamoto has quit IRC18:31
clarkbthats me sshing in after node broke18:31
fungiso what is e.g. /var/lib/libvirt/images/node-1.qcow2 there which is 12gb?18:32
johnsomThis was a run that passed: https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/df.txt.gz18:32
clarkbironics test image18:32
clarkbfungi: ^18:32
fungiaha18:32
fungithat's fairly massive18:32
fungiespecially since it's one of a dozen images in that same directory18:33
clarkbironic at least is clearly broken in this way18:33
clarkbgives us something to fix and check on after18:33
fungibut also it should almost certainly put that stuff in /opt18:33
fungiso as not to fill the rootfs18:33
johnsomYeah, our new logging is only 250K and 2.3K for a full run, so not a big change.18:33
*** e0ne has quit IRC18:34
clarkbnote as suspected that is a side effect of running tempest I think18:35
* clarkb makes a bug for ironic18:37
clarkbjohnsom: "new logging" ?18:37
clarkbjohnsom: also keep in mind the original image size may grow up to the flavor image size when booted and used18:37
*** jamesmcarthur has joined #openstack-infra18:37
clarkbjohnsom: even if your qcow2 is say only 1GB when handed to glance once booted and blocks start changing it can grow up to that limit18:38
johnsomNew logging == offloading the logging from inside the amphora instances. Looking at the passing runs it's only 250K uncompressed, so certainly not eating the disk.18:39
johnsomOur qcow is ~300MB, the nova flavor it gets is 2GB. Other than getting smaller, it hasn't changed in years.18:39
clarkbok. I'm pointing it out because I've just shown that one of your example jobs is likely failing due to this problem.18:41
clarkbI think it is worthwhile to be able to specifically rule it in or out in the other jobs we know are failing similarly18:41
*** ociuhandu has quit IRC18:41
johnsomYeah, that seems very ironic job related.18:41
clarkboh is ironic in storyboard?18:42
johnsomI hope this netconsole trick sheds light on our job.18:43
fungiclarkb: yes18:43
fungiand i agree, seems like there's a good chance that the ironic and octavia jobs are failing in similarly opaque ways for distinct reasons18:44
johnsomclarkb This test run is going now: https://review.opendev.org/680894 assuming I didn't fumble the Ansible somehow, which is likely given I use it just enough to forget it.18:44
johnsomYep, nevermind, broken ansible. lol18:44
clarkbfungi: yup could be distinct but we should rule out known causes first18:45
clarkbrather than assuming they are fine18:45
clarkboctavia like ironic is failing when tempest runs18:45
clarkboctavia like ironic is being retried becaus ansible fails18:45
clarkbwe know ironic is filling its root disk on rax (which also explains why it may be cloud specific)18:46
clarkbwhy not simply check "is octavia filling the disk" if yes we solved it fi no we look elsewhere18:46
fungiand also, predominantly in the same provider regions right? one where the available space on the rootfs is fairly limited compared to other providers18:46
fungier, what you said18:47
fungiwhat's a good way to go about reporting the disk utilization on those nodes just before they fall over? the successful run johnsom linked earlier had fairly minimal disk usage reported by df at the end of the job18:48
fungibut maybe disk utilization spikes during the tempest run and then the resources consuming disk get cleaned up?18:49
clarkbfungi: ya it could be the test ordering that gets us if multiple tests with big disks run concurrently18:49
clarkbfungi: as for how to get the data off, if ansible is failing (likely because it can't write data to /tmp) then we may need to use ansible raw?18:50
clarkbpabelanger: ^ do you know if the raw module avoids needing to copy files to /tmp on the remote host?18:50
*** factor has joined #openstack-infra18:50
*** vesper11 has quit IRC18:52
pabelangerclarkb: I am not sure, would have to look18:52
pabelangerIIRC, raw is pretty minimal18:53
clarkbhttps://storyboard.openstack.org/#!/story/2006520 has been filed and I just shared it with the ironic channel18:56
*** diablo_rojo has quit IRC18:56
*** diablo_rojo has joined #openstack-infra18:56
fungii wonder if we could better trap/report ansible remote write failures like that18:58
AJaegerconfig-core, https://review.opendev.org/678356 and https://review.opendev.org/678357 are ready for review, dependencies merged - those remove now unused publish jobs and update promote jobs. Please review18:58
*** jbadiapa has quit IRC19:00
clarkbfungi: looking at the log on ze04 for bb2716ef3894465cb1cfbf1b22d7736c the way it manifest for ironic is the run playbook timesout, then the post run attempts to in and do post run things but that fails with exit code 4 which is the networking failed me problem. In this case I don't think networking at a layer 3 or below failed but instead at 7 because ansible/ssh needs /tmp to do things19:01
clarkbjohnsom: is there a preexisting octavia change that we can recheck without acusing problems for you and then if job lands on say rax-* we can hold it/investigate?19:03
clarkbjohnsom: if not I can push a no change change19:03
*** ykarel|away has quit IRC19:03
johnsomI am using this one for the netconsole test: https://review.opendev.org/#/c/680894/19:03
johnsomI just can't say if I got the ansible right this time.19:03
clarkbk19:03
*** e0ne has joined #openstack-infra19:03
johnsomLooks like it got OVH19:04
fungihrm, yeah if ansible is overloading the same exit code for both network connection failure and an inability to write to the remote node's filesystem, then discerning which it is could be tough19:07
*** e0ne has quit IRC19:09
clarkbjohnsom: if you notice a run on rax-* ping me and I'll happily add an autohold for it19:10
johnsomOk, sounds good19:10
*** e0ne has joined #openstack-infra19:10
fungior we can set an autohold with a count >119:11
*** jbadiapa has joined #openstack-infra19:13
*** goldyfruit_ has quit IRC19:15
*** eharney has quit IRC19:20
*** bnemec has quit IRC19:27
*** bnemec has joined #openstack-infra19:32
*** goldyfruit has joined #openstack-infra19:33
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: Improve job and node information banner  https://review.opendev.org/67797119:33
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: Improve job and node information banner  https://review.opendev.org/67797119:34
zbrfungi: if is not too late, please take a look at https://review.opendev.org/#/q/project:opendev/bindep+status:open thanks.19:36
*** vesper11 has joined #openstack-infra19:41
openstackgerritClark Boylan proposed opendev/base-jobs master: Add cleanup phase to base(-test)  https://review.opendev.org/68110019:46
clarkbinfra-root ^ that might help us debug these problems better in the future19:46
*** ociuhandu has joined #openstack-infra19:51
*** igordc has joined #openstack-infra20:00
*** jtomasek_ has quit IRC20:01
*** ociuhandu has quit IRC20:03
*** ociuhandu has joined #openstack-infra20:03
*** michael-beaver has joined #openstack-infra20:04
clarkbcorvus: have a quick second for 681100? once that is in I can test it20:09
*** markmcclain has joined #openstack-infra20:10
clarkbsean-k-mooney: I think fn is enabled again if you want to retry your label checks (restarted the nodepool service so that will rule out config problems if it persists)20:10
clarkbor at least configs due to weird reloading of pools20:10
sean-k-mooneyclarkb: ack20:10
sean-k-mooneyill kick of the new lable test and let you know if that works20:11
*** Lucas_Gray has joined #openstack-infra20:11
*** nicolasbock has quit IRC20:11
sean-k-mooneyrelated question vexxhost has nested virt as does FN. did packet ever turn it back on20:11
fungiwe haven't had a usable packethost environment in... over a year?20:12
openstackgerritMerged opendev/bindep master: Fix emerge testcases  https://review.opendev.org/46021720:12
sean-k-mooneywelll i think that answer my question20:12
clarkbya there was a plan to redeploy using osa but I don't think that ever happened20:13
sean-k-mooneymy hope is that if we have 2+ proviers that can support the multi numa/or single numa lables wiht nested vert(that would be FN and maybe vexhost) then we could replace the now defunct intel nfv ci with a first part version that at a minium did a nightly build20:15
sean-k-mooneyif that is stable then maybe we can promot it to non voting or even voting in check20:15
*** trident has quit IRC20:15
clarkblimestone may also be able to help20:15
sean-k-mooneyyes perhaps. now that we have 1 provider i can continue to refine the job and we can expolore other options as they become avialable20:16
sean-k-mooneyignoring the lable issue the FN job seams to be workign well20:17
clarkbI think the biggest gotcha is going to be those periodic updates to kernels that break it and needing to update the various layers to cope20:18
clarkbbut if the job is nonvoting and we communicate when that happens its probably fine?20:18
sean-k-mooneyhopefully20:18
*** nicolasbock has joined #openstack-infra20:18
sean-k-mooneythe intel nfv ci was always using nested virt and it was at least for the first 2 years when it was staffed faily stable20:19
sean-k-mooneymos tof the issue we had were not related to nested vert20:19
sean-k-mooneybut rather dns or out proxies/mirrors20:19
sean-k-mooneyi kicked of the new lable job 68073820:20
sean-k-mooneyso we will see if tha tfails20:20
sean-k-mooneyor passes20:20
clarkbya we just know it breaks periodically20:21
clarkblogan-: has given us a bit of insight into that and aiui something in middle kernel will update then we need to update the hypervisors to accomodate and it starts working again20:21
sean-k-mooneyits on by default now in kerels from 4.1920:21
*** ociuhandu has quit IRC20:22
clarkbya so only about 2 years away :)20:22
*** ociuhandu has joined #openstack-infra20:22
sean-k-mooneyit really bugs me that rhel8 is based on 4.1820:22
sean-k-mooneybecause its an lts disto on an non lts kernel20:22
*** iurygregory has quit IRC20:23
sean-k-mooneywell its a redhat kerel so the version number dont really mean anything in relation to the usptream kernel anyway20:23
*** ociuhandu has quit IRC20:27
*** trident has joined #openstack-infra20:27
*** trident has quit IRC20:32
*** notmyname has quit IRC20:32
*** trident has joined #openstack-infra20:33
openstackgerritMerged openstack/project-config master: Add per-project Airship groups  https://review.opendev.org/68071720:35
*** notmyname has joined #openstack-infra20:42
*** ociuhandu has joined #openstack-infra20:44
*** kopecmartin is now known as kopecmartin|off20:45
johnsomclarkb rax-iad https://zuul.openstack.org/stream/cd166c6ecde942259f27abaeef8b89fa?logfile=console.log20:45
johnsomI am getting remote logs for this one too20:46
clarkbjohnsom: placed a hold on it20:47
clarkbI'll also ssh in now to see if my connection dies20:47
johnsomI am going to run grab lunch as it will be at least 20 minutes for devstack.20:48
clarkbk I'll keep an eye on it out of the corner of my eye20:48
openstackgerritMerged opendev/bindep master: Fix apk handling of warnings/output problems  https://review.opendev.org/60243320:50
*** markvoelker has quit IRC21:02
*** ociuhandu has quit IRC21:09
johnsomclarkb Tempest is just starting21:11
clarkbI'm still ssh'd in21:12
*** Goneri has quit IRC21:14
clarkbI don't think this is the problem but I wonder if neutron wants to clean it up: netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.21:15
johnsomYeah I am seeing a lot of those too21:16
johnsomNeutron does seem to do a lot of stuff at the start of tempest. It's log was scrolling pretty fast21:16
*** markvoelker has joined #openstack-infra21:16
fungisean-k-mooney: is that ovs-vswitchd attribute length error something you're aware of?21:17
clarkbI'm running watch against ip addr show eth0 ; df -h ; w ; ps -elf | grep tempest21:19
fungigood call21:22
johnsomboom kernel opps21:23
fungiso we are triggering a panic?21:23
johnsomhttps://www.irccloud.com/pastebin/7VeZaVsl/21:23
clarkbI didn't get that with my watch21:23
fungiand i guess it's hypervisor-specific21:23
fungiso probably only happening on xen guests21:23
clarkblooks liek an ovs issue21:24
clarkbexciting21:24
fungineat. sean-k-mooney may be interested in that too21:24
johnsom"exciting" Yes, yes it has been given it's freeze week21:25
fungiany idea what might have changed around that in the relevant timeframe where the behavior seems to have emerged?21:26
johnsomNothing in our project, that is why this has been a pain.21:26
clarkbcould be a kernel update in bionic21:26
clarkbor a xen update in the cloud(s)21:27
fungineutron changes? new ovs version? is it just happening on ubuntu bionic or do we have suspected failures on other distros/versions?21:27
fungibut yeah, at least it's happening around feature freeze week, not release week21:27
clarkbgetting centos version of this test to run on rax to generate a bt (assuming it fails at all) would be a great next step I expect21:28
fungigetting a minimum reproducer might make that easier too21:29
*** jamesmcarthur has quit IRC21:29
clarkbalso could it be related to the invalid length error above?21:29
fungia distinct possibility21:29
johnsomWe started seeing it on the 3rd, so something around that time.21:30
johnsomLooking to see if the centos-7 version of this has ever done this. (though centos has had it's own issues)21:30
fungiif the ovs error messages show up in success logs too, then that seems unlikely to be related (though still possible i suppose)21:30
johnsomWell, I don't see them in this successful run: https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/syslog.txt.gz assuming they go into syslog too21:31
johnsomOk, for one patch, I have not seen it on the centos7 version.21:33
clarkbjohnsom: they are in that log you linked21:33
clarkbhttps://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/syslog.txt.gz#2399 for example21:33
johnsomOh, I must have typo'd21:33
clarkbso ya likely not related but could potentially still be if xen somehow tickles a corner case around the handling of that21:34
clarkbeither way cleaning it up would aid debugging because its one less thing to catch your eye and be confused about21:34
fungiyeah, maybe that's still a bug but only leads to a fatal condition on xen guests?21:34
fungiwild guesses at this stage though21:35
clarkbhttps://patchwork.ozlabs.org/patch/1012045/ I think it may just be noise21:36
fungiseems that way21:37
fungiso do we know which tempest test seems to have triggered the panic?21:37
clarkbif the node did end up being held I may be able to restart the server and get the stestr log21:38
clarkbI'll work on that now21:38
* fungi shakes fist at broadband provider... nearly 7 hours so far on repairing a fiber cut somewhere on the mainland21:39
clarkbdoes not look like holds apply in this case (because the job is retried?)21:39
fungiyou were doing a ps though... do you have the last one from before it died? can we infer tests from that?21:40
fungior i guess the console log stream also mentioned tests21:40
*** e0ne has quit IRC21:40
clarkbthe ps doesn't tell you what tests are running21:41
clarkbjsut that there were 4 proceses running tests21:41
*** snapiri has quit IRC21:41
johnsomI was a bit surprised to see it running 4 for the concurrency, but I also tried locking it down to 2 with no luck21:42
fungiahh, yep, you used the f option (so included child proceses) but then omitted those lines with the grep for "tempest"21:42
clarkbfungi: we don't start new processes for each test though21:43
clarkbthe 4 child processes run their 1/4 of the tests loaded from stdin iirc21:44
fungisure, thought it might be inferrable from whatever shell commands were appearing in child process command lines, but also probably too short-lived for the actual culprit to be captured by watch21:44
fungii guess if we can reproduce it with tempest concurrency set to 1 that might allow narrowing it down21:45
fungiit happens fairly early in the testset right?21:46
clarkbya I guess that depends onw hether or not neutron uses the api or cli commands to interact with ovs there. However that said it almost looksl ike the failure was in frame processing21:46
clarkbentry_SYSCALL_64_after_hwframe+0x3d/0xa221:46
clarkband crashes in do_output21:46
clarkbso probably not easy to track that back to a specific test21:46
fungiyeah, maybe ethernet frame forwarding?21:47
fungiin which case perhaps any test which produces a fair amount of guest traffic might tickle it21:48
openstackgerritJames E. Blair proposed zuul/zuul master: WIP: Add support for the Gerrit checks plugin  https://review.opendev.org/68077821:49
openstackgerritJames E. Blair proposed zuul/zuul master: WIP: Add enqueue reporter action  https://review.opendev.org/68113221:49
clarkbfungi: ya21:49
clarkbthinking out loud here I wonder if having zuul retry jobs that blow up like this actually makes it harder to debug21:50
clarkbfewer people notice because most of the time the retried probably do work by running on another cloud21:50
clarkband you can't hold the test nodes21:50
clarkbetc21:50
clarkbhttps://launchpad.net/ubuntu/+source/linux/4.15.0-60.67 went out on the third fwiw21:53
fungipotential culprit, yep. is that what our images have?21:54
clarkb4.15.0-60-generic #67-Ubuntu os from the panic so ya I think they are21:55
clarkbhttp://launchpadlibrarian.net/439948149/linux_4.15.0-58.64_4.15.0-60.67.diff.gz is the diff between the code that went out on the 3rd and what was in ubuntu prior21:56
sean-k-mooneywhat does the waiting state mean in the zuul dashboard. how is it different form queued21:57
sean-k-mooneydoes that mean its waiting for quota/nodepool?21:58
sean-k-mooneybut zuul has selected the job to run when the nodeset has been fulfilled?21:58
johnsomhttps://www.irccloud.com/pastebin/Tnfw12ug/21:58
johnsomSo, this is fixed in -6221:58
*** markvoelker has quit IRC21:59
johnsomWonder why the instance wasn't using 62.21:59
* johnsom feels like the canary project for *world*22:00
clarkbsean-k-mooney: from tobiash a few days ago 04:25:11*        tobiash | pabelanger, clarkb, SpamapS: waiting means waiting for a dependency or semaphore (basically no node request created yet), queued means waiting for nodes22:00
clarkbjohnsom: 62 was only published 7 hours ago22:01
*** markvoelker has joined #openstack-infra22:01
johnsomHa, so changelog is much earlier than availability... lol22:01
sean-k-mooneyok so its waiting in the merger?22:01
*** goldyfruit has quit IRC22:01
ianwinfra-root: could i get a +w on https://review.opendev.org/#/c/680895/ so 3rd party testinfra is working please, thanks22:02
clarkbianw trade you https://review.opendev.org/#/c/681100/22:02
johnsomianw Thanks for the role. It found the kernel panic22:03
sean-k-mooneyit whet form queued to waiting but i dont see a node in http://zuul.openstack.org/nodes that matche the lable. so its unclear why it transiation state22:03
pabelangersean-k-mooney: if multinode job, likely waiting for node in nodeset22:04
ianwjohnsom: oh nice!  is it a known issue or unique?22:04
pabelangerother wise, nodepool is running a capacity and no nodes avaiable22:04
pabelangeravailable*22:04
johnsomianw Evidently the fixed kernel went out 7 hours ago22:05
clarkbjohnsom: note that bug is specific to fragmented packets which is somethign we do on our ovs bridges because nesting. So ya seems like really good match and likely is the fix22:05
ianwjohnsom: haha well nothing like living on the edge!22:05
sean-k-mooneypabelanger: it is a multi node job but neither node has been created. could it be waiting for an io opertation on the cloud such as uploading  an image?22:05
pabelangersean-k-mooney: http://grafana.openstack.org/dashboard/db/zuul-status gives a good overview of what zuul is doing22:05
clarkbsean-k-mooney: no we never wait for images to upload unless there are no images22:05
johnsomclarkb Path forward?  How soon will that hit the mirrors?22:05
pabelangersean-k-mooney: no image uploads shouldn't be issue hear. Sounds like maybe quota issue22:05
clarkbsean-k-mooney: waiting is "normal" I wouldn't worry about it22:05
*** markvoelker has quit IRC22:06
pabelangersean-k-mooney: cannot boot node on cloud, because no space22:06
sean-k-mooneyclarkb: ok22:06
johnsomclarkb matches to the line number in the launchpad bug22:06
sean-k-mooneyya i can see FN is at capasity more or less http://grafana.openstack.org/d/3Bwpi5SZk/nodepool-fortnebula?orgId=122:06
pabelangerthere is some odd interactions with multinode jobs and quota on clouds. It is possible for a job to boot 1 of 2 nodes fine, but with 2 of 2 boots, there is no quota22:06
pabelangerthen you are stuck on the cloud waiting22:06
pabelangerwe have that issue in zuul.ansible.com22:07
clarkbjohnsom: we update our mirrors every 4 hours and build images every 24 ish hours22:07
clarkbjohnsom: I've just checked the mirror and don't see it at http://mirror.dfw.rax.openstack.org/ubuntu/dists/bionic-updates/main/binary-amd64/Packages22:08
sean-k-mooneyin this case neither node is running (its use a special lable so i can tell form http://zuul.openstack.org/nodes) but i looks like its waiting for space22:08
clarkbjohnsom: so I would guess anywhere from the next 4-24 hours ish22:08
pabelangersean-k-mooney: yah, that would be my guess too22:08
*** goldyfruit has joined #openstack-infra22:08
pabelangeryou feel it much more, when running specific labels in a single cloud22:08
clarkbhttp://grafana.openstack.org/d/3Bwpi5SZk/nodepool-fortnebula?orgId=1 you can see that cloud is at capacity22:09
sean-k-mooneypabelanger: well this might still be a good thing. previously that lable was going to node_error so this might actully be an imporvement22:09
clarkbthe sun just came out so I'm going to sneak out for a quick bike ride before the rain returns22:09
clarkbback in a bit22:10
*** ociuhandu has joined #openstack-infra22:10
fungijohnsom: moral of this story is if you'd just put off debugging another day it would have fixed itself? ;)22:13
johnsomYeah, that probably would have happened if it wasn't freeze week22:14
* fungi suspects that's a terrible moral for a story anyway22:14
*** ociuhandu has quit IRC22:15
johnsomYeah, burned a good week trying to figure out why we couldn't merge anything22:15
fungiwhat's especially interesting is octavia's testing was thorough enough to hit this bug when neutron's was not22:16
johnsomI also wonder if no one is actually testing with floating IPs.... Given this is tied to NAT. It's not like we are load testing either....22:16
johnsomYeah, not the first time22:17
*** slaweq has quit IRC22:18
*** goldyfruit_ has joined #openstack-infra22:21
*** goldyfruit has quit IRC22:24
adriantIs this the irc channel to talk devstack issues?22:25
adriantor does devstack have its own channel?22:25
johnsomadriant General devstack issues are in #openstack-qa22:26
fungiadriant: you're looking for #openstacl-qa22:26
johnsomWhat he said.... grin22:27
fungier, the one johnsom said. you know, with the accurate typing22:27
adriantjohnsom wins :P22:27
adriantty22:27
pabelangerianw: in devstack, is there an option to enable nested virt for centos? Like is it modifying modprode.d configs?22:28
pabelangernot that I am asking to enable it, want to see how it is done22:29
pabelangercause I need to modprob -r kvm && modprobe kvm to toggle it22:29
pabelangernot sure why RPM  isn't reading modprobe config22:30
Roamer`pabelanger, judging by the fact that devstack's doc/source/guides/devstack-with-nested-kvm.rst explains how to do the rmmod/modprobe dance for different CPU types, and from the fact that there is no mention of rmmod in devstack itself other than this file, I'd say most probably not :/22:34
openstackgerritMerged opendev/system-config master: Set zuul_work_dir for tox testing  https://review.opendev.org/68089522:36
*** markvoelker has joined #openstack-infra22:36
pabelangerRoamer`: thanks! I should read the manually next time. that is exactly what I am doing too22:36
pabelangerrmmod22:37
pabelangerabove was a typo :)22:37
*** bobh has joined #openstack-infra22:37
johnsompabelanger export DEVSTACK_GATE_LIBVIRT_TYPE=kvm is the setting for devstack to setup nova for it.22:39
*** EmilienM is now known as little_script22:39
pabelangerty22:39
Roamer`pabelanger, what do you mean RPM is not reading the modprobe config though? is it possible that the module has been already loaded, maybe even at boot time, and you modifying the config later has no effect without, well, reloading it using rmmod/modprobe? :)22:39
pabelangerRoamer`: so, if I setup /etc/modprobe.d/kvm.conf with 'options kvm_intel nested=1' then yum install qemu-kvm, nested isn't enabled. I need to bounce the module, for it to work22:41
*** factor has quit IRC22:41
pabelangernot an issue, but surprised it wasn't loaded properly22:41
*** factor has joined #openstack-infra22:41
*** little_script is now known as EmilienM22:42
*** bobh has quit IRC22:45
Roamer`pabelanger, well, sorry if I'm being obtuse and asking for things that may be obvious and you may have already checked, but still, are you really sure that the kvm_intel module has not been already loaded even before you installed qemu-kvm? I can't really think of anything that would want to load it, it's certainly not loaded by default, but it's usually part of the kernel package, so it will22:45
Roamer`have been *installed* before you install qemu-kvm22:45
*** xenos76 has quit IRC22:46
*** markvoelker has quit IRC22:46
pabelangergood point, I should check that22:46
pabelangerI assumed qemu-kvm was pulling it in22:47
clarkbpabelanger: note you only set nested=1 on the first hypervisor22:48
clarkbon your second you can consume the nested virt without enabling it for the next layer22:48
*** happyhemant has quit IRC22:49
*** rlandy is now known as rlandy|bbl22:50
*** icarusfactor has joined #openstack-infra22:51
*** mriedem has quit IRC22:51
*** goldyfruit_ has quit IRC22:51
clarkbpabelanger: also note you cannot enable nested virt from the middle hypervisor if the first does not have it enabled22:52
clarkband note that it crashes a lot22:52
clarkbbut in linux 4.19 it is finally enabled by default for intel cpus so in theory its a lot better past that point in time22:53
*** factor has quit IRC22:53
*** tkajinam has joined #openstack-infra22:55
pabelangerclarkb: Yup agree! Not going to run this in production, mostly wanted to understand how people enabled it for jobs22:56
*** Lucas_Gray has quit IRC22:57
*** rcernin has joined #openstack-infra22:59
ianwclarkb: :/  ubuntu mirroring failed @  2019-09-05T22:38:09,181762752+00:00 ish ... http://paste.openstack.org/show/774655/23:11
clarkbianw: similar problems to fedora?23:11
*** owalsh has quit IRC23:11
clarkbauth expired because vos release took too long?23:11
ianwyeah, seems likely.  note that's mirror-update.openstack.org so the old server, and also rsync isn't involved there23:12
*** Lucas_Gray has joined #openstack-infra23:14
*** owalsh has joined #openstack-infra23:14
*** threestrands has joined #openstack-infra23:14
ianwistr from scrollback some sort of issues, and davolserver was restarted @ "Sep  6 18:16 /proc/4595"23:18
clarkbya afs02.dfw was out to lunch (really high load and ssh not responding) the console log showed it had a bunch of kernel timeouts for disks23:18
clarkband processes so we rebooted it23:18
clarkbother vos releases were working but we didn't check all of them, its possible we should've checked all of them23:18
ianwhrm, so that failure happened before the reboot, and could be explained by afs02 being in bad state then23:21
clarkboh I didn't notice the date on the failure23:21
ianwmy only concern though is that if i unlock the volume, r/o needs to be completely recreated and maybe *that* will timeout now23:21
clarkbbut ya likely explained by that23:21
clarkbianw: ya in the past when I've manually unlocked I've done a manual sync with the lock held then run the vos release from screen on afs01.dfw using localauth23:22
clarkbthe localauth doesn't timeout23:22
ianwright, it just seems like most volumes are in this state :/23:22
clarkbaspiers: following up on the zuul backlog. I've tracked at least part of the problem back to ironic retrying tempest jobs multiple times due to filling up disks on rax instances23:25
clarkbaspiers: that means we get 3 attempts * 3 hour timeouts we have to wait for23:25
aspiersclarkb: ouch, nice find!23:25
clarkbso those ironic changes hang out in the queue for forever23:26
sean-k-mooneyclarkb the jobs still ended up in node failure so it looks like the nodepool restart was not the issue.23:27
clarkbsean-k-mooney: ok good to rule out the config loading then23:27
clarkbsean-k-mooney: let me check if we are still having ssh errors23:27
sean-k-mooneythis was the patch that failed if that helps but its the only job that uses the numa lables23:28
sean-k-mooneyhttps://review.opendev.org/#/c/680738/23:28
clarkbsean-k-mooney: same http error23:28
sean-k-mooneyok23:28
clarkbsean-k-mooney: I'm going to dig through the logs for a specific instance to see if I can napkin maht verify we are hitting the timeout23:28
sean-k-mooneyam for now im just goign to rework the job to use lables we know work.23:29
sean-k-mooneyi would like to have this working be fore FF to hopeful help merge the numa migration feature its testing23:29
*** dchen has joined #openstack-infra23:29
clarkblooks like nova says the instance is active in about 35 seconds then we timeout after 120 seconds23:30
fungislow to boot then?23:30
sean-k-mooneyso its higing the ssh timeout23:30
sean-k-mooneyfungi: it should not be. it has a numa toplogy but that normally makes it faster23:31
clarkbsean-k-mooney: ya the traceback is definitely the ssh timeout. I just wanted to make sure it wasn't mathing that short but the logging timestamps seem to indicate it hasn't23:31
sean-k-mooneyit also has more ram at least on the contoler23:31
fungigrabbing a nova console log from a boot attempt ought to help23:31
sean-k-mooney16G instead of 823:31
clarkbspecifically it is trying to scan the ssh hostkeys23:31
clarkbbrainstorming: we could be failing to get entropy to generate host keys?23:31
sean-k-mooneyits really strange23:31
sean-k-mooneyoh23:32
clarkband sshd won't start until it has generated them23:32
sean-k-mooneyyes we could23:32
clarkbdoes numa affect entropy in VMs? also we run haveged23:32
sean-k-mooneywe could enable the hardware random number gen in the guest23:32
fungior it could be timing out waiting for device scanning to settle or a particular block device to appear or any number of other things that it waits for during boot23:32
clarkbfungi: ya23:32
sean-k-mooneyclarkb: it should not23:32
clarkbfungi: when I booted by hand though we had none of those issues23:32
clarkbfungi: we can try booting by hand mpre23:33
* clarkb tries now23:33
sean-k-mooneyfungi: also we have other labels that provide identical flavor with a different name that work23:33
fungioh, that's wacky23:33
sean-k-mooneyya23:33
clarkbthe hostids for all three attempts were different23:34
clarkbdonnyd: 71a9bf7925f98026e8a268d2cda4f8623c4812e3025626a2b395e7b0 is the hostid and it tried booting f5708f13-1918-44ce-8c2f-3676712d12a023:34
clarkbif you want to double check the hyervisor for anything funny23:34
fungicould nodepool be trying the wrong interface's address maybe?23:34
sean-k-mooneyi think its booting with just one interace and it seam to be acessing the ipv6 ip23:35
clarkbnodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 2001:470:e045:8000:f816:3eff:fe57:3b6c on port 2223:35
sean-k-mooneywell it might have more interfaces23:35
ianwclarkb: so basically every volume of interest is locked ... http://paste.openstack.org/show/774658/23:35
clarkbthat is 387fa4f4-f69a-4654-9c39-c9fd3443bd6a on 247bdf6d16fe8bfc4c76cbe3a03e5933b4215077c60e516b1a26fbd823:35
sean-k-mooneythat a publicaly routable ipv6 address23:35
clarkbianw: bah ok23:35
ianwat this point, i think the best option is to shutdown the two mirror-updates to stop things getting any worse, and do localauth releases of those volumes23:36
clarkbianw: don't we need to run their respective syncs before vos releasing?23:36
clarkbI suppose if we are happy with the RW state then your plan will work23:36
donnydI am afk atm23:37
donnydI can take a look when I get back23:37
sean-k-mooneyfungi: these are the lables/flavor/image/keys that we are using23:37
sean-k-mooneyhttps://github.com/openstack/project-config/blob/master/nodepool/nl02.openstack.org.yaml#L343-L35723:37
sean-k-mooneyubuntu-bionic-expanded works fine23:38
ianwclarkb: yeah, for mine i think we get the volumes back in sync, and then let the normal mirroring process run23:38
*** Lucas_Gray has quit IRC23:38
sean-k-mooneywhen i use multi-numa-ubuntu-bionic-expanded or multi-numa-ubuntu-bionic it does not work23:38
fungisean-k-mooney: okay, so they're using different flavors23:38
sean-k-mooneyfungi: the flavor are identical https://www.irccloud.com/pastebin/FWxMEIqc/23:38
clarkbfungi: the flavors are named differently but have the same attributes. However maybe there are attributes not exposed by flavor show that are different23:39
sean-k-mooneywell the none expand one has 8G instead of 16G of ram but otherwise they are the same23:39
clarkbI'm able to get right in on the node I just booted with multi-numa-expanded23:40
sean-k-mooneyclarkb: could you test a multi-numa instace instead fo the expanded one. i mean ssh runs fine on a 64mb cirros image but just incase23:41
*** icarusfactor has quit IRC23:41
clarkb23:39:05 is first entry in dmesg -T, 23:39:11 is sshd starting, 23:39:45 is me logging in23:41
clarkbsean-k-mooney: ya though the one I have failure for is the expanded label23:42
sean-k-mooneyoh ok23:42
*** exsdev has quit IRC23:42
*** exsdev0 has joined #openstack-infra23:42
*** exsdev0 is now known as exsdev23:42
sean-k-mooneywell honestly i have been 90 of my openstack dev in multi numa vm for the lat 6 years and i have never seen it have any affect on boot time or time to ssh working23:43
clarkbI can also hit it from nl02 via ipv6 to port 22 withtelnet23:43
clarkbjust ruling out networking problems between nl02 and fn23:43
* clarkb tests the non expanded just to be safe23:43
clarkbboth flavors seem to work manually23:45
clarkbI did get a brief no route to host then host finished booting and ssh worked23:45
clarkball within a minute, well under the timeout23:46
sean-k-mooneydonnyd: how do you get your ipv6 routing?23:46
sean-k-mooneycould there be a propogation dely in the route being acessable ?23:46
sean-k-mooneyalthough that woudl not explain why the other lable seams to work at least most of the time23:47
sean-k-mooneyi think i had some node_failures with ubuntu-bionic-expanded but i think we tracked thoes to quota23:47
sean-k-mooneyi assme we have not seen other node_failres with FN?23:48
sean-k-mooneyits just this spefic set of labels?23:48
clarkbnot that I know of23:49
clarkbthese are the only labels that are fn specific23:50
clarkbother clouds can pick up the slack if it happens then we don't see a node failure, just another cloud serviing the request23:50
clarkblet me see if I can grep for this happening in fn on other labels23:50
sean-k-mooneywell the ubuntu-bionic-expanded and centos-7-expanded are also23:51
sean-k-mooneybut nothing was previously using them23:51
clarkb2019-09-09 18:36:39,267 ERROR nodepool.NodeLauncher: [node: 0011030429] Launch failed for node opensuse-tumbleweed-fortnebula-regionone-001103042923:51
clarkbit is happening for other labels too23:51
sean-k-mooneyso maybe in other cases it retrying to a different provider23:52
clarkbya23:52
sean-k-mooneythis is not a host key thing right like when people reused ipv4 addresses23:53
sean-k-mooneyits not able to connect rather than being rejected23:53
clarkbsean-k-mooney: correct my read of it is tcp is failing to connect to port 2223:55
clarkbso ya a networking issue in the cloud could explain it23:55
sean-k-mooneywell we do have that recuring issue in the tempest test where ssh sometimes does not work...23:55
ianw#status log mirror-update.<opendev|openstack>.org shutdown during volume recovery.  script running in root screen on afs01 unlocking and doing localauth releases on all affected volumes23:56
openstackstatusianw: finished logging23:56
sean-k-mooneywhich i belive is somehow related to the l3 agent. i wonder if there a race with neutron seting up routing of the vm23:56
fungiso if it's just a sometimes network issue, retrying that job should work sometimes and return node_error other times23:58
clarkbfungi: ya though with the way queues have been it might be a while :/23:58
clarkbI've not heard anything from ironic re the disk issues yet. here is hoping that is because of CEST timezones23:59
sean-k-mooneyfungi: ya so it could be a bad luck that the old ubuntu expanded lable seams to work and this one does not.23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!