Wednesday, 2018-09-26

*** Tim_ok has quit IRC00:02
*** sthussey has quit IRC00:22
*** felipemonteiro has joined #openstack-infra00:24
openstackgerritMerged openstack-infra/zuul-jobs master: use find instead of ls to list interfaces  https://review.openstack.org/60467700:26
*** longkb has joined #openstack-infra00:31
*** jamesdenton has quit IRC00:36
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Allow debootstrap to cleanup without a kernel  https://review.openstack.org/60469200:42
openstackgerritMerged openstack-infra/zuul-jobs master: Add Gentoo iptables handling  https://review.openstack.org/60468800:50
Shrewsclarkb: not sure if you're still seeing it, but if nodes are locked after restarting all launchers, then zuul has the lock and you shouldn't try to undo that00:51
*** hashar has joined #openstack-infra00:56
*** felipemonteiro has quit IRC00:57
clarkbShrews: even after 14 days?00:59
clarkbmaybe it is a hold?00:59
ianwclarkb: there we are couple of held bhs1 nodes ... that was what i was looking at for the old configs01:02
clarkbah maybe we can clean those up now?01:03
*** rlandy has joined #openstack-infra01:05
ianwclarkb: hrm, none of the bhs1 nodes currently seem to have a comment01:06
ianwhrm, well maybe it never had a comment.  0001975067 is the node i'm talking about from https://review.openstack.org/#/c/603988/ ... that was one of the few nodes that remained after the region was shutdown01:10
*** graphene has joined #openstack-infra01:12
prometheanfirejust one more review is needed for https://review.openstack.org/#/c/602439/ to get gentoo as a usable image01:15
prometheanfire:(01:17
*** rlandy has quit IRC01:19
*** mrsoul has joined #openstack-infra01:19
*** harlowja has quit IRC01:22
*** hashar has quit IRC01:26
openstackgerritIan Wienand proposed openstack-infra/puppet-graphite master: [wip] rpsec: check service running  https://review.openstack.org/60528601:26
*** graphene has quit IRC01:26
*** owalsh_ has joined #openstack-infra01:29
Shrewsclarkb: I don't think nodes in hold are locked. There is a bug in zuul somewhere that keeps nodes locked, but I forget what triggers it.01:31
ShrewsThat might be what you're seeing01:31
*** owalsh has quit IRC01:33
*** rh-jelabarre has quit IRC01:34
pabelangerIIRC, it happens when a dynamic reload happens during a noderequest, if the job is removed from zuul, we leak the noderequest and will never unlock01:35
*** hongbin has joined #openstack-infra01:39
*** jamesdenton has joined #openstack-infra01:46
*** annp has joined #openstack-infra01:47
openstackgerritMatthew Thode proposed openstack-infra/openstack-zuul-jobs master: add Gentoo jobs and vars and also fix install test  https://review.openstack.org/60243901:58
*** rkukura has quit IRC02:00
*** adriant has quit IRC02:05
*** adriant has joined #openstack-infra02:14
*** adriant has quit IRC02:15
*** adriant has joined #openstack-infra02:17
*** adriant has quit IRC02:31
*** psachin has joined #openstack-infra02:39
*** graphene has joined #openstack-infra02:40
*** apetrich has quit IRC02:42
*** Bhujay has joined #openstack-infra02:49
*** imacdonn has quit IRC02:50
*** imacdonn has joined #openstack-infra02:50
*** felipemonteiro has joined #openstack-infra03:02
*** adriant has joined #openstack-infra03:07
*** Bhujay has quit IRC03:07
*** felipemonteiro has quit IRC03:19
*** ijw has joined #openstack-infra03:19
*** diablo_rojo has quit IRC03:19
*** ijw has quit IRC03:24
*** ramishra has joined #openstack-infra03:26
*** jamesmcarthur has joined #openstack-infra03:42
*** dave-mccowan has quit IRC03:44
*** dave-mccowan has joined #openstack-infra03:46
*** jamesmcarthur has quit IRC03:48
*** jamesmcarthur has joined #openstack-infra03:49
*** armax has quit IRC03:53
*** udesale has joined #openstack-infra03:53
openstackgerritMerged openstack-infra/project-config master: Fix not working kolla graphs  https://review.openstack.org/60502603:54
*** ykarel has joined #openstack-infra03:59
*** hongbin has quit IRC03:59
*** felipemonteiro has joined #openstack-infra04:06
*** pcaruana has joined #openstack-infra04:14
*** jamesmcarthur has quit IRC04:21
*** jamesmcarthur has joined #openstack-infra04:23
*** jamesmcarthur has quit IRC04:27
*** yamamoto has quit IRC04:31
*** yamamoto has joined #openstack-infra04:31
*** pcaruana has quit IRC04:38
openstackgerritMerged openstack-infra/zuul master: Web: don't update the status cache more than once  https://review.openstack.org/60524304:57
*** auristor has quit IRC05:03
*** jamesmcarthur has joined #openstack-infra05:05
*** e0ne has joined #openstack-infra05:08
*** jamesmcarthur has quit IRC05:10
*** jamesmcarthur has joined #openstack-infra05:25
*** Bhujay has joined #openstack-infra05:26
*** jamesmcarthur has quit IRC05:31
cloudnullfungi clarkb been afk - still need anything with that instance?05:32
*** Bhujay has quit IRC05:32
*** rkukura has joined #openstack-infra05:33
cloudnullgoing to bed finally however if it needs digging into let me know, i'll tackle it first thing in the morning05:35
*** auristor has joined #openstack-infra05:39
*** dave-mccowan has quit IRC05:39
*** quique|rover|off is now known as quiquell|rover05:40
prometheanfireyarp05:40
*** pcaruana has joined #openstack-infra05:43
*** apetrich has joined #openstack-infra05:46
*** jamesmcarthur has joined #openstack-infra05:47
*** e0ne has quit IRC05:49
*** Bhujay has joined #openstack-infra05:49
*** felipemonteiro has quit IRC05:50
*** rkukura has quit IRC05:50
*** jamesmcarthur has quit IRC05:51
*** jistr has quit IRC05:55
*** jistr has joined #openstack-infra05:56
*** jamesmcarthur has joined #openstack-infra06:08
*** jamesmcarthur has quit IRC06:12
*** jtomasek has joined #openstack-infra06:13
openstackgerritIan Wienand proposed openstack-infra/system-config master: [wip] port graphite setup to ansible  https://review.openstack.org/60533606:14
*** aojea has joined #openstack-infra06:22
*** dpawlik has joined #openstack-infra06:23
*** diablo_rojo has joined #openstack-infra06:27
*** jamesmcarthur has joined #openstack-infra06:28
*** jamesmcarthur has quit IRC06:33
*** quiquell|rover is now known as quique|rover|brb06:43
*** chkumar|off is now known as chkumar|ruck06:44
*** ijw has joined #openstack-infra06:46
*** jamesmcarthur has joined #openstack-infra06:50
*** graphene has quit IRC06:50
*** ijw has quit IRC06:50
*** graphene has joined #openstack-infra06:51
*** jamesmcarthur has quit IRC06:54
*** ginopc has joined #openstack-infra06:57
*** quique|rover|brb is now known as quiquell|rover07:00
*** rcernin has quit IRC07:02
*** alexchadin has joined #openstack-infra07:03
*** hashar has joined #openstack-infra07:06
*** ijw has joined #openstack-infra07:09
*** jamesmcarthur has joined #openstack-infra07:10
*** olivierb has joined #openstack-infra07:13
*** ijw has quit IRC07:13
*** jamesmcarthur has quit IRC07:15
*** olivierb has quit IRC07:17
*** olivierb has joined #openstack-infra07:17
openstackgerritAndreas Jaeger proposed openstack-infra/openstack-zuul-jobs master: Remove tricircle dsvm jobs  https://review.openstack.org/60534407:19
*** psachin has quit IRC07:21
*** shardy has joined #openstack-infra07:23
*** alexchadin has quit IRC07:25
*** psachin has joined #openstack-infra07:26
*** jamesmcarthur has joined #openstack-infra07:31
*** jamesmcarthur has quit IRC07:37
*** ijw has joined #openstack-infra07:44
*** jpich has joined #openstack-infra07:48
*** ijw has quit IRC07:50
*** alexchadin has joined #openstack-infra07:51
*** jpena|off is now known as jpena07:51
*** jamesmcarthur has joined #openstack-infra07:53
*** rcernin has joined #openstack-infra07:56
*** alexchadin has quit IRC07:57
*** jamesmcarthur has quit IRC07:58
*** ykarel is now known as ykarel|lunch07:59
*** alexchadin has joined #openstack-infra07:59
*** Guest42266 has joined #openstack-infra08:05
*** hashar has quit IRC08:06
*** hashar has joined #openstack-infra08:06
*** jamesmcarthur has joined #openstack-infra08:09
*** e0ne has joined #openstack-infra08:10
openstackgerritChandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook  https://review.openstack.org/60509608:11
*** jamesmcarthur has quit IRC08:13
*** Emine has joined #openstack-infra08:17
*** e0ne has quit IRC08:20
*** ykarel|lunch is now known as ykarel08:21
*** jamesmcarthur has joined #openstack-infra08:24
*** noama has joined #openstack-infra08:25
*** jamesmcarthur has quit IRC08:29
*** jistr has quit IRC08:30
*** jistr has joined #openstack-infra08:31
*** derekh has joined #openstack-infra08:33
*** derekh has quit IRC08:33
*** derekh has joined #openstack-infra08:34
*** shardy has quit IRC08:35
*** shardy has joined #openstack-infra08:36
*** jamesmcarthur has joined #openstack-infra08:40
*** jamesmcarthur has quit IRC08:45
*** owalsh_ is now known as owalsh08:45
*** olivier__ has joined #openstack-infra08:46
*** olivierb has quit IRC08:46
*** tosky has joined #openstack-infra08:48
*** apetrich has quit IRC08:49
openstackgerritFabien Boucher proposed openstack-infra/zuul master: Doc: executor operations document pause, remove graceful  https://review.openstack.org/60245508:55
openstackgerritFabien Boucher proposed openstack-infra/zuul master: Doc: executor operations - explain jobs will be restarted at restart  https://review.openstack.org/60313608:55
*** jamesmcarthur has joined #openstack-infra08:56
*** chkumar|ruck has quit IRC08:58
*** chandankumar has joined #openstack-infra08:59
*** chandankumar is now known as chkumar|ruck09:00
*** jamesmcarthur has quit IRC09:01
*** ykarel is now known as ykarel|away09:01
*** e0ne has joined #openstack-infra09:04
*** ykarel|away has quit IRC09:05
*** alexchadin has quit IRC09:06
*** dtantsur|afk is now known as dtantsur09:10
*** jamesmcarthur has joined #openstack-infra09:12
*** rcernin has quit IRC09:16
*** jamesmcarthur has quit IRC09:17
*** alexchadin has joined #openstack-infra09:20
*** pbourke has quit IRC09:22
*** pbourke has joined #openstack-infra09:23
*** alexchadin has quit IRC09:25
*** electrofelix has joined #openstack-infra09:26
*** ssbarnea|bkp has quit IRC09:28
*** jamesmcarthur has joined #openstack-infra09:28
*** jamesmcarthur has quit IRC09:33
*** priteau has joined #openstack-infra09:41
*** alexchadin has joined #openstack-infra09:42
*** jamesmcarthur has joined #openstack-infra09:44
*** shardy is now known as shardy_mtg09:48
*** jamesmcarthur has quit IRC09:49
*** hashar is now known as hasharAway09:51
*** gfidente has joined #openstack-infra09:53
*** jamesmcarthur has joined #openstack-infra10:00
*** diablo_rojo has quit IRC10:02
*** jamesmcarthur has quit IRC10:04
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command  https://review.openstack.org/60538610:06
*** longkb has quit IRC10:08
*** shardy_mtg has quit IRC10:11
*** jamesmcarthur has joined #openstack-infra10:15
*** jamesmcarthur has quit IRC10:22
*** e0ne has quit IRC10:27
*** jamesmcarthur has joined #openstack-infra10:33
*** yamamoto has quit IRC10:35
*** jamesmcarthur has quit IRC10:38
*** graphene has quit IRC10:40
*** graphene has joined #openstack-infra10:41
*** jamesmcarthur has joined #openstack-infra10:49
*** alexchadin has quit IRC10:49
*** felipemonteiro has joined #openstack-infra10:49
*** jamesmcarthur has quit IRC10:53
*** felipemonteiro has quit IRC10:56
*** alexchadin has joined #openstack-infra10:58
*** yamamoto has joined #openstack-infra11:00
*** alexchadin has quit IRC11:03
*** jamesmcarthur has joined #openstack-infra11:05
*** ijw has joined #openstack-infra11:09
*** jamesmcarthur has quit IRC11:09
*** udesale has quit IRC11:12
*** psachin has quit IRC11:13
*** ijw has quit IRC11:13
*** pcaruana has quit IRC11:15
*** jamesmcarthur has joined #openstack-infra11:20
*** jpena is now known as jpena|lunch11:21
*** jamesmcarthur has quit IRC11:25
*** psachin has joined #openstack-infra11:27
*** olivier__ has quit IRC11:28
*** olivierb has joined #openstack-infra11:29
*** roman_g has quit IRC11:30
*** oanson has quit IRC11:35
*** jamesmcarthur has joined #openstack-infra11:36
*** emerson has quit IRC11:39
*** jamesmcarthur has quit IRC11:41
*** emerson has joined #openstack-infra11:49
*** roman_g has joined #openstack-infra11:49
*** jamesmcarthur has joined #openstack-infra11:52
*** e0ne has joined #openstack-infra11:53
*** alexchadin has joined #openstack-infra11:57
*** jamesmcarthur has quit IRC11:57
*** rh-jelabarre has joined #openstack-infra11:58
*** dpawlik has quit IRC11:58
*** trown|outtypewww is now known as trown11:59
*** jamesmcarthur has joined #openstack-infra12:08
*** kgiusti has joined #openstack-infra12:11
*** apetrich has joined #openstack-infra12:11
*** jamesmcarthur has quit IRC12:13
*** ijw has joined #openstack-infra12:15
*** jamesmcarthur has joined #openstack-infra12:15
*** alexchadin has quit IRC12:17
*** dtantsur is now known as dtantsur|brb12:18
*** rlandy has joined #openstack-infra12:18
*** shardy has joined #openstack-infra12:19
*** ijw has quit IRC12:19
*** quiquell|rover is now known as quique|rover|lch12:20
*** panda|off is now known as panda12:21
*** agopi has quit IRC12:24
*** alexchadin has joined #openstack-infra12:25
*** jpena|lunch is now known as jpena12:28
*** udesale has joined #openstack-infra12:30
*** jamesmcarthur has quit IRC12:31
*** yamamoto has quit IRC12:35
AJaegerfungi, do we need groups always when storyboard is used for new projects? See https://docs.openstack.org/infra/manual/creators.html#add-the-project-to-the-master-projects-list and https://review.openstack.org/#/c/605193/1/gerrit/projects.yaml , please12:40
*** psachin has quit IRC12:41
*** quique|rover|lch is now known as quiquell|rover12:44
*** jamesmcarthur has joined #openstack-infra12:46
dmsimardJust sharing: Ansible to adopt molecule and ansible-lint projects, https://groups.google.com/forum/m/#!topic/ansible-project/ehrb6AEptzA12:48
*** jcoufal has joined #openstack-infra12:49
*** jamesmcarthur has quit IRC12:52
*** janki has joined #openstack-infra12:55
*** jamesmcarthur has joined #openstack-infra12:55
fungiAJaeger: groups are a convenience option, not mandatory. if a team only has one project then a group may not be warranted12:55
fungiif a team has more than one project, they may want to put them in a project group together so they can query them as a set12:55
*** boden has joined #openstack-infra12:58
*** Guest42266 is now known as florianf13:01
fungiour creators guide also doesn't suggest they're always needed, simply explains what they're used for13:02
*** gfidente has quit IRC13:02
*** alexchadin has quit IRC13:02
*** mriedem has joined #openstack-infra13:03
AJaegerfungi: ah, ok - thanks13:07
*** sshnaidm is now known as sshnaidm|mtg13:07
*** yamamoto has joined #openstack-infra13:13
*** alexchadin has joined #openstack-infra13:17
*** ijw has joined #openstack-infra13:17
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command  https://review.openstack.org/60538613:18
openstackgerritMerged openstack-infra/project-config master: Add new project: sardonic  https://review.openstack.org/60519313:20
*** yamamoto has quit IRC13:21
*** yamamoto has joined #openstack-infra13:21
*** ijw has quit IRC13:22
dulekEverything okay with http://zuul.openstack.org/status.html ? Running times (especially in post) seem a bit scary?13:22
fungidulek: http://lists.openstack.org/pipermail/openstack-dev/2018-September/134867.html13:26
*** chkumar|ruck is now known as chandankumar13:26
dulekfungi: Thanks!13:26
fungiwe're back up to capacity now but i think various openstack bugs are still causing a lot of churn in the gate pipeline starving other lower-priority pipelines13:26
openstackgerritMerged openstack-infra/irc-meetings master: Change congress meeting time  https://review.openstack.org/60527413:26
*** mrsoul has quit IRC13:26
fungihelping the community/qa team identify and fix bugs is probably the best place to focus on improving it13:27
AJaegerfungi, should we kill the periodic jobs? We haven't run any of them for two days now...13:27
*** agopi has joined #openstack-infra13:27
*** smarcet has joined #openstack-infra13:27
fungithey won't really use that much capacity once they do finally run13:29
dulekfungi: I'll prioritize looking at kuryr-kubernetes CI flakiness then. Thanks for info!13:29
AJaegerfungi, I fear we run them only at the weekend...13:32
*** panda has quit IRC13:32
*** dtantsur|brb is now known as dtantsur13:33
AJaegerfungi: oh, we have them only once in the queue - not multiple times as I feared (if I interpret zuul.o.o correctly) - then this is fine.13:33
*** alexchadin has quit IRC13:44
*** lbragstad has quit IRC13:45
*** slaweq has quit IRC13:45
*** alexchadin has joined #openstack-infra13:46
*** jamesmcarthur has quit IRC13:49
*** gfidente has joined #openstack-infra13:50
*** lbragstad has joined #openstack-infra13:50
*** alexchadin has quit IRC13:50
*** jamesmcarthur has joined #openstack-infra13:51
*** sthussey has joined #openstack-infra13:52
*** ijw has joined #openstack-infra13:53
*** jamesmcarthur has quit IRC13:56
*** ijw has quit IRC13:58
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command  https://review.openstack.org/60538613:58
fungiyeah, if memory serves zuul doesn't enqueue multiples in periodic13:59
mordredfungi: I think it will - but we could discuss changing its pipeline manager to supercedent so that it wouldn't14:01
mordredalso - good morning14:02
AJaegermordred: I'm fine with not enqueue multiples ;)14:02
AJaegergood morning, mordred14:02
*** yamamoto has quit IRC14:03
openstackgerritDmitry Tantsur proposed openstack/diskimage-builder master: Add an element to configure iBFT network interfaces  https://review.openstack.org/39178714:03
chandankumarfungi: AJaeger mordred https://review.openstack.org/#/c/605096/ please have a look , thanks :-)14:03
*** yamamoto has joined #openstack-infra14:04
openstackgerritDmitry Tantsur proposed openstack/diskimage-builder master: Add an element to configure iBFT network interfaces  https://review.openstack.org/39178714:07
*** bobh has joined #openstack-infra14:08
*** yamamoto has quit IRC14:09
mordreddtantsur: ^^ wow, that patch is old14:10
dtantsurmordred: yeah, I hoped that nobody would come again with this problem to me.. I was proved wrong.14:10
dtantsurit's complicated by the fact that I have no idea what I'm typing :)14:11
*** jamesmcarthur has joined #openstack-infra14:12
*** janki has quit IRC14:12
*** bobh has quit IRC14:13
*** udesale has quit IRC14:15
*** jamesmcarthur has quit IRC14:16
mordreddtantsur: my favorite problems!14:18
*** olivierb has quit IRC14:18
*** panda has joined #openstack-infra14:21
mordredchandankumar, fungi: commented on https://review.openstack.org/#/c/605096/14:22
fungii thought we were running that in check/gate?14:23
*** olivierb has joined #openstack-infra14:23
fungiat least AJaeger had pointed to the fact that we're testing sdist and wheel builds now as part of the standard template for projects participating in release management14:24
AJaegermordred: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n255 is the job14:26
AJaegermordred: part of publish-to-pypi template14:26
*** jamesmcarthur has joined #openstack-infra14:28
*** eernst has joined #openstack-infra14:31
*** jamesmcarthur has quit IRC14:32
openstackgerritsebastian marcet proposed openstack-infra/openstackid-resources master: added new endpoint delete my presentation  https://review.openstack.org/60413014:32
njohnstonAJaeger: I have a quick question to clarify your feedback on https://review.openstack.org/#/c/605126/1/.zuul.yaml@a125 "remove these, they are not needed anymore now."  Do you mean that the ansible playbooks are not consulted anymore for jobs depending on legacy-dsvm-base?14:33
*** bobh has joined #openstack-infra14:35
AJaegernjohnston: you remove the job using the roles, so remove the roles as well.14:38
*** armax has joined #openstack-infra14:39
AJaegernjohnston: I think you're confused by the diff ;)14:39
mordredAJaeger, fungi: yes - but we also use those playbooks in the actual release job14:40
mordredah - wait14:41
mordredplaybooks/pti-python-tarball/check.yaml is the one that should get updated with the twine commands14:41
chandankumarmordred: I will update that14:41
mordredchandankumar: thanks!14:42
*** kashyap has left #openstack-infra14:42
*** jamesmcarthur has joined #openstack-infra14:43
*** Bhujay has quit IRC14:44
jrollhi friends, per https://review.openstack.org/#/c/605193/ could I please be added to the core and release groups? easiest to search for jim@jimrollenhagen.com https://review.openstack.org/#/admin/groups/1947,members https://review.openstack.org/#/admin/groups/1948,members14:45
njohnstonAJaeger: Ah!  I thought you meant to remove the lines, not the files.  *facepalm*  Thanks for the clarification!14:46
AJaegernjohnston: sorry for the confusion14:46
*** jamesmcarthur has quit IRC14:47
*** yamamoto has joined #openstack-infra14:52
pabelangermnaser: clarkb: sjc1 doesn't look happy right now: http://grafana.openstack.org/dashboard/db/nodepool-vexxhost14:55
mnaseri know it was unhappy yesterday14:55
mnaserlet me double check14:55
pabelangernothing in nodepool log except cannot create server14:56
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: web: rewrite interface in react  https://review.openstack.org/59160414:58
*** jamesmcarthur has joined #openstack-infra14:59
*** pcaruana has joined #openstack-infra15:01
*** jamesmcarthur has quit IRC15:03
*** bobh has quit IRC15:03
quiquell|roverHello any infra-root here ?15:11
quiquell|roverWe have a review to fix timeouts https://review.openstack.org/#/c/605377/15:12
quiquell|roverNeed to go over the gates15:12
clarkbquiquell|rover: you mean that change needs to be promoted to the top of the gate?15:13
*** jamesmcarthur has joined #openstack-infra15:14
fungii've just about given up trying to correct people on terminology. everybody seems to want to call everything "gates" even when they mean "gating jobs" or "changes in the gate pipeline" or whatever15:14
quiquell|roverHello15:14
clarkbquiquell|rover: are you asking us to promote that change to the top of the gate?15:15
clarkbthis will restart testing for all of the other changes in the tripleo gate pipeline. So want to be sure we are doing that intentionally before we do it15:16
jaosoriorclarkb: restarting the testing of the other patches is fine. The patch he's trying to promote is meant to help out with the timeouts.15:16
*** bobh has joined #openstack-infra15:17
*** janki has joined #openstack-infra15:17
clarkbjaosorior: quiquell|rover another thing I notice when looking at this is that tripleo changes have multiple non voting jobs in the gate. Can we remove those from the gate since they are non voting?15:17
clarkbwill help get things through more quickly for you (fewer jobs to wait on) and gives more resources to other changes15:18
*** jamesmcarthur has quit IRC15:18
*** aojea has quit IRC15:19
jaosoriorclarkb: yes, we're removing those https://review.openstack.org/#/c/603419/15:19
clarkbok I am going to enqueue ^ to the gate, then promote the first change to the top and the second behind that change15:21
quiquell|roverclarkb: non voting juning in gates ? will look into that15:22
quiquell|roverclarkb, jaosorior: thanks15:22
* prometheanfire doesn't like fedora-multinode :|15:23
jaosoriorclarkb: thank you!15:24
clarkbquiquell|rover: jaosorior it is done15:25
*** chandankumar is now known as chkumar|off15:26
*** quiquell|rover is now known as quique|rover|off15:27
jaosorioryay :D15:27
*** quique|rover|off is now known as quique|off15:28
mnaserinfra-root: is it ok to just delete a vm that nodepool is trying to launch if its stuck?15:29
*** sshnaidm|mtg is now known as sshnaidm15:29
*** jamesmcarthur has joined #openstack-infra15:29
mnaser(on my side)15:29
clarkbmnaser: it should be, nodepool won't use the VM unless ssh works. And if the api shows it as error'd out then it should handle that fine15:30
openstackgerritMerged openstack-infra/openstackid-resources master: added new endpoint delete my presentation  https://review.openstack.org/60413015:30
clarkbmnaser: if the node completely disappears I think nodepool will treat that as an error on its side too, but if it doesn't we should add that functionality15:30
fungiif nodepool is repeatedly issuing "delete" calls for it to the api and it suddenly disappears, i think nodepool will treat that like business as usual?15:31
clarkbfungi: ya15:32
clarkbI expect it will be fine as is15:32
*** dpawlik has joined #openstack-infra15:32
smarcetfungi: afternoon hope that everything is fine on your side :) i have an issue on openstackid zuul jobs, https://review.openstack.org/#/c/604172/ seems that failed and script due a temporal error, could you re trigger the job ?15:33
clarkbsmarcet: leave  comment that says "recheck" and it will reenqueue for you15:33
smarcetoh ok15:34
smarcetthx u !15:34
fungiyep, that will cause jobs to rerun automatically15:34
*** jamesmcarthur has quit IRC15:34
*** jtomasek has quit IRC15:34
smarcetfungi: clarkb: thx u  4 info :)15:35
*** dpawlik has quit IRC15:36
*** janki has quit IRC15:36
*** eernst has quit IRC15:36
fungithough it may take some time. we're under a bit of a backlog this week15:37
pabelangermnaser: trashing in sjc1 seems to have stopped, guess you are still looking into it15:40
mnaseryup.. working on it..15:40
pabelanger++15:40
* prometheanfire thinks the requirements proposal bot update is failing again15:41
clarkbprometheanfire: failing or not running because of the backlog?15:42
*** Tim_ok has joined #openstack-infra15:42
prometheanfirebacklog is 5 hours, I think it's past that by a bit now (double)15:43
prometheanfireand it's been a couple days I think15:43
clarkbprometheanfire: the post backlog is 53 hours15:44
clarkbit has a lower priority than check and gate15:44
prometheanfirewat, wow, ok15:44
*** jtomasek has joined #openstack-infra15:44
prometheanfireguess I wait15:44
*** jamesmcarthur has joined #openstack-infra15:45
*** dklyle has joined #openstack-infra15:45
prometheanfirethere just more activity now?15:45
*** Bhujay has joined #openstack-infra15:46
pabelangerprovider issues i think, ovh looks to also be having an outage: http://grafana.openstack.org/dashboard/db/nodepool-ovh15:46
prometheanfirek15:46
pabelangerand packethost, some launch errors too: http://grafana.openstack.org/dashboard/db/nodepool-packethost15:46
clarkbpabelanger: prometheanfire its easy to blame provider issues but I think the vast majority of it is we have a lot of unrelaible tests15:47
pabelangerso, less VMs to service jobs15:47
clarkbtripleo timing out with a gate queue of like 50 changes15:47
clarkbneutron functional doesn't work15:47
pabelangerclarkb: yup, gate resets too15:47
clarkbglance can't pass unittests15:47
clarkbtempest also has problems15:47
prometheanfireclarkb: ya, seen some of that15:47
pabelangersorry, wasn't blaming just things adding to backlog15:47
clarkbpabelanger: I just want to get away from the attitude that it is someone elses problem15:47
clarkbopenstack testing is bad right now15:47
*** xyang has joined #openstack-infra15:48
clarkband openstack should work to fix that15:48
fungihorizon only just merged fixes for their completely broken gating jobs'15:48
clarkb(and not think infra will fix it by adding capacity)15:48
pabelangerYes, I think that is a fair statement15:48
clarkbthe packethost issue is likely leaked ports15:48
clarkbwe should see if someone from neutron land (slaweq maybe?) can sync up with studarus on debugging that15:49
fungido we yet know what's going on with ovh-bhs1?15:49
mnaserhmm15:49
*** jamesmcarthur has quit IRC15:49
mnaserit looks like nodepool isnt issuing creates anymore15:50
clarkbfungi: no that is new to me15:50
mnaserafter deleting those stuck instances15:50
toskyat least it's a nice stress test for zuul15:50
* tosky hides15:50
fungilooks like things started going sideways in bhs1 around 08:30 utc15:51
mnaseri see vms spawning again15:51
clarkb{"forbidden": {"message": "The number of defined ports: 360 is over the limit: 300", "code": 403}}15:51
mnaserso if someone can kick off nodepool15:52
mnaseroh15:52
mnaseris that in sjc1?15:52
clarkbno that is ovh bhs1 sorry15:52
*** panda is now known as panda|bbl15:52
mnaserafter deleting the vms that were stuck15:52
mnasernodepool hasnt issued any creates15:52
mnaserbut it's working ok now15:52
fungiso neutron has likely leaked ports in ovh-bhs1 for some reason15:52
fungii can try to manually delete them15:52
clarkbsure neough we have many DOWN ports there15:52
prometheanfirefungi: they were waiting on reqs a bit for that (horizon)15:53
clarkbfungi: care to only delete the ones that are DOWN15:53
prometheanfirethey had to cap something :(15:53
pabelangermnaser: clarkb: I can look at nodepool-launcher, 1 sec15:53
clarkbI wonder if all the clouds have upgraded to a buggy version of neutron/nova and now we leak ports all over15:53
clarkbwe should haev a port cleanup thing in nodepool though let me see if we can run that in ovh15:53
clarkbor maybe that is only FIPs15:54
*** kopecmartin|ruck has joined #openstack-infra15:55
*** kopecmartin|ruck has left #openstack-infra15:55
*** kopecmartin|ruck has joined #openstack-infra15:55
fungideleting all the ports marked as DOWN now15:56
*** jamesmcarthur has joined #openstack-infra15:56
*** eglute has joined #openstack-infra15:56
fungistarting out with 357 down15:56
clarkbfwiw this looks very similar to the problems we have in packethost15:57
fungimight be a bigger issue there... my first `openstack port delete ...` in the loop is hanging15:57
*** jamesmcarthur has quit IRC15:57
clarkbhrm they don't hang in packethost, but they aren't very fast15:57
pabelanger2018-09-26 15:56:36,487 DEBUG nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-vexxhost-specific]: Declining node request 100-0006323378 because node type(s) [ubuntu-trusty] not available15:57
*** jamesmcarthur has joined #openstack-infra15:57
pabelangeris sjc1 not running all images?15:57
fungiwell, i say "hanging" but i don't know that for sure. it's only been ~60 seconds so far15:57
fungiwithout returning'15:57
clarkbthis might be one of those situations where we need to get neutron and nova and our clouds all talking together to figure out why this is painful all of a sudden15:58
clarkbpabelanger: it should be but maybe the config there is buggy?15:58
fungiahh, i guess my deletes aren't hanging, they're just "slow" (for fairly extreme definitions of the word)15:59
fungiwe're down to 340 now15:59
clarkbdpawlik, amorin, studarus, mlavalle get together in a room and fix neutron15:59
fungiand the port delete command in osc doesn't provide any output, so i thought it was still on the first in the set16:00
*** dave-mccowan has joined #openstack-infra16:00
clarkbit looks like something in the background may clean them up in ovh based on usage graphs16:01
clarkbwe'll spike then go quiet then spike again16:01
openstackgerritPaul Belanger proposed openstack-infra/project-config master: Simplify vexxhost nodepool configuration  https://review.openstack.org/60546916:01
pabelangermnaser: clarkb: fungi: not a fix, but should reduce some copypasta for vexxhost ^16:02
pabelanger2018-09-26 15:56:30,317 INFO nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-main]: Not enough quota remaining to satisfy request 200-000631763516:02
pabelangerseems nodepool thinks it is at quota for sjc116:03
ssbarneaclarkb: can you help with https://review.openstack.org/#/c/603061/ , already 8 days old and without extra pings I doubt it will get merged. thanks.16:03
pabelanger2018-09-26 15:56:30,317 DEBUG nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-main]: Current pool quota: {'compute': {'ram': inf, 'instances': 0, 'cores': inf}}16:03
pabelangerdon't know why that is 016:03
pabelangermnaser: the instances you deleted, were they in ERROR state?16:04
*** jamesmcarthur has quit IRC16:04
*** jamesmcarthur has joined #openstack-infra16:05
clarkbpabelanger: it is calculating min(quota, max-servers) - number of servers running in nodepool16:07
*** sthussey has quit IRC16:07
fricklerjroll: seems your request got overlooked, done now16:08
*** dtantsur is now known as dtantsur|afk16:08
pabelangerclarkb: okay, nodepool thinks there are 46 nodes building16:09
pabelangerlet me check openstack api16:09
*** ramishra has quit IRC16:09
*** noama has quit IRC16:10
pabelangerclarkb: mnaser: okay, I see movement now. I think we just needed to wait for the launch-timeout to trigger16:12
clarkbamorin: if you are around, we've noticed that we appear to leak neutron ports in ovh bhs1 now. Manually deleting them appears to work. Thought you may find this information useful as you continue to operate newer openstack16:14
clarkbamorin: let us know if we can provide additional information to help debug or understand the problem16:14
fungiamorin: from our graphs, it looks like the leak may have started around 08:30 utc16:16
*** Emine has quit IRC16:16
fungidown port deletion is slowing considerably... i have a feeling we're continuing to leak new ports as we start to be able to boot new nodes after i delete previous ones16:17
*** florianf is now known as florianf|afk16:17
clarkbfungi: would not surprise me16:17
pabelangerclarkb: mnaser: I see jobs running in sjc1 now16:17
pabelangerthanks!16:17
clarkbssbarnea: we don't maintain stackalytics, it is a third party service16:18
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: WIP Extract pep8 messages for inline comments  https://review.openstack.org/58963416:18
clarkbI do not know why I have approval rights to that repo, but it isn't something we support16:18
clarkbfungi: I hate to suggest it but we could add a leaked port cleaner like we do for floating ips to nodepool16:19
*** tpsilva has joined #openstack-infra16:19
clarkbthis specific type of problem seems important enough to want the clouds to address it though16:19
fungiyeah, i've hit the inflection point in bhs1 now where the count of down ports is rising faster than i'm deleting them16:20
*** kopecmartin|ruck is now known as kopecmartin|off16:20
pabelangeropenstack.exceptions.SDKException: Error in creating the server: Build of instance 305569df-0b6b-4da8-9e85-a2c2273e34a5 aborted: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed gigabytes quota. Requested 80G, quota is 5120G and 5120G has been consumed.16:20
pabelangermnaser: think we might have leaked volumes ^16:20
*** jpich has quit IRC16:22
ssbarneaclarkb: infra-core is still member of stackalytics-core which makes me believe it is able to perform reviews in absence of the main maintainers, right?16:23
clarkbssbarnea: yes I have +2 on the repo. I have no desire to use it16:23
clarkbit isn't a service or a repo we support16:23
AJaegerclarkb: infra-core is now in stackalytics group ;(16:26
clarkbfungi: rereading ovh bhs1 graphs I don't think the port issue is the only problem? we don't seem to have ever transitioned to in use nodes16:27
*** che-arne has joined #openstack-infra16:27
* clarkb looks at logs again16:27
AJaegersorry for duplicate - too slow in reading scrollback16:27
clarkbfungi: or maybe we never got it below the 300 magic number?16:27
fungipossible16:28
jrollfrickler: thanks!16:28
ssbarneaclarkb: sure. i was curious more about the process in general. what happens when a project has only one ore two people in the core team and none of them is available any more? may not be the case here, should CRs be stalling forevever?16:28
fungiclarkb: i mean, my delete loop is still going but we're back up above 300 again as of a few minutes ago16:28
clarkbError in creating the server: Exceeded maximum number of retries. Exceeded max scheduling attempts 6 for instance 8f95c8e4-e70a-4744-b87c-ae7c6cdc57cd. Last exception: Maximum number of ports exceeded16:29
clarkbseems we transitioned to that after some time16:29
fungissbarnea: if it's an official openstack project the answer is different than for an unofficial project16:29
*** Bhujay has quit IRC16:29
fungifor official openstack projects the team in charge of it can be required by the tc to add more reviewers or retire that deliverable16:30
fungias stackalytics is not and never was an official openstack project, the tc has no jurisdiction over it16:30
ssbarneaindeed, i was refering more to unofficial/side/infra projects because this the only place where I encounter this dilemma, on official ones is (kinda) easy to find someone.16:30
clarkbssbarnea: in this case there have been multiple discussions of various groups taking it over then no one does16:31
ssbarneahaha, indeed, a good description of reality :D16:31
clarkbit was a mirantis project and continues to be one as far as I know16:32
clarkbif there is renewed interest in taking it over I would reach out to the previous maintainers and see if they will hand over the reigns16:32
dmsimardmnaser: looks like we're running some nodes on sjc1 again, thanks16:32
ssbarneai will send another email to the two commiters, at some point they will find my email, hopefully.16:33
fungiwe've had contributors in the past interested in (and even getting pretty far on) making various improvements to stackalytics like fixing the persistent analysis store so it doesn't go offline for hours when you need to restart it or ripping out the affiliation static config in favor of querying the foundation profile api16:33
fungibut the team in charge of the project and running the server for it seem to lack the bandwidth for such contribution (or even to dicusss it)16:34
*** panda|bbl is now known as panda16:35
clarkbfungi: I added nl04 to the emergency file16:35
corvusa minor correction: *infra* projects *are* official openstack projects :)16:35
clarkbpuppet just ran on it ~4 minutes ago so I think I am safe to go ahead and set max-servers to zero in bhs116:35
clarkbthen we can rerun the port cleanup. Set max-servers to ~5 and see if it ends up in a happier place or not16:36
ssbarneacorvus: with the sideone that not everything under infra/ in gerrit is a infra project :D16:36
fungiwell, not everything under openstack/ in gerrit is an official openstack project either16:36
clarkbfungi: ok max-servers is set to 016:36
ssbarneanow, I want to create a new infra project, probably named openstack-helpers that would be used to host multipe greasemonkey scripts that are useful for openstack devs. i did read the (very) long docs page but is not clear to me who creates the repository in the first place.16:36
fungissbarnea: a script creates the repository when it sees a new entry appear in the gerrit/projects.yaml file in openstack-infra/project-config16:37
fungiautomation16:37
clarkbI'm going to grab breakfast while waiting on that port cleanup to happen16:37
clarkbfungi: ^ assuming it is still running?16:38
fungiit just finished. i can start another pass16:38
clarkbplease do16:38
openstackgerritJames E. Blair proposed openstack-infra/project-config master: Add zone-opendev.org project  https://review.openstack.org/60509516:38
fungiwe're currently at 295 leaked, so it may be cleaning itself up?16:38
*** jamesmcarthur has quit IRC16:38
clarkbfungi: do we want to wait ~10 minutes and see if that number moves?16:39
fungii'll give it a few, yes16:39
clarkbmax-servers is set to 0 so any servers are using should clean up (and their ports too I hope)16:39
ssbarneafungi: thanks for this magic hint! i made a note. Before making the CR, does anyone have something against using "openstack-helpers" name? I could go for "monkeys" if you don't like it,16:39
fungiand then start deleting if it doesn't seem to be doing the cleanup on its own16:39
fungiopenstack-monkeys seems mildly offensive16:40
fungi(or could be taken that way)16:40
*** trown is now known as trown|lunch16:40
AJaegerssbarnea: why not remove openstack from the name? Or use "os"?16:40
mordredos-greasemonkey would be descriptive16:41
ssbarneai do not want to use the full "grease*" name because refers to a specific extension and there are multiple ones tampermonkey, greasemonkey, and even others with different names.16:42
mordredI did not know that - nod16:42
ssbarneathus is why i was considering "helpers" as a more multi-purpose name to use that is not directly linked to a browser extension.16:42
fungibrowser-helpers?16:43
fungior are they useful outside a web browser context?16:43
*** e0ne has quit IRC16:43
*** jamesmcarthur has joined #openstack-infra16:44
ssbarneafungi: yep, for the moment web-helpers would ok but i have some regex patterns inside that are used to highlight console logs. Still,  in the future I want to also make a CLI tool to parse logs, one that uses the same patterns as the web extension, .... this would make the "web" part confusing in the repo name.16:45
ssbarneai find the inclusing of web in the repo name as,.... limiting.16:45
*** dave-mccowan has quit IRC16:49
*** rkukura has joined #openstack-infra16:50
corvusyou could call it "ssbarnea's scripts, browser, and regex network emporium area".  maybe abbreviate that somehow.16:50
fungiyeah, useful names for catch-all repos are hard to come up with16:50
*** evrardjp has quit IRC16:50
*** ginopc has quit IRC16:51
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Only replicate openstack* to github  https://review.openstack.org/60548616:53
*** olivierb has quit IRC16:54
*** gfidente has quit IRC16:55
mordredcorvus, fungi: I'm pretty sure that's all we need to let the zone-opendev.org be in opendev/ instead of openstack-infra, should we choose to do such a thing16:56
corvusmordred: sounds reasonable; clarkb, mordred, fungi: should i rework that back to opendev/ ?  or leave it to be (presumably) moved with the rest?16:57
fungireviewing16:57
mordredhttps://gerrit.googlesource.com/plugins/replication/+doc/master/src/main/resources/Documentation/config.md <-- is what I was looking at - search for "remote.NAME.projects"16:58
mordredwe could alternately do [16:58
mordredwe could alternately do ['openstack/*', 'openstack-dev/*', 'openstack-infra/*'] to be more clearer16:59
clarkbshould double check against our gerrit docs16:59
*** rkukura has quit IRC16:59
mordredclarkb: I tried looking at the docs for our gerrit - but with them having been moved to plugins ...16:59
pabelangerclarkb: I've deleted the leaked volumes in sjc1, but think mnaser will need to debug openstack side, some are stuck deleting. Think I'll work on nodepool at ansiblefest to also try and clean up leaked volumes, I can see there is meta data in the volumes for nodepool_build_id16:59
mordredclarkb: I got to that doc from https://review.openstack.org/Documentation/config-plugins.html#replication16:59
clarkbah16:59
mordredclarkb: https://gerrit.googlesource.com/plugins/replication/+/stable-2.13/src/main/resources/Documentation/config.md17:01
mordredthere we go - there's the 2.13 docs - and it saysthe same thing17:01
*** derekh has quit IRC17:01
fungiso the leaked ports in bhs1 are still dropping, but slower than when i was deleting them i think. i'll go ahead and augment it with a loop of explicit deletes to speed things up further there17:01
fungiwe're down to 288 leaked so far17:01
mordredfungi: glorious17:02
*** fuentess has joined #openstack-infra17:03
mordredclarkb: so I think if you've ok with that- it's got 2x+217:06
clarkbdo we want to test it on review-dev first? will require a gerrit restart too iirc17:07
mordredclarkb: good points both of those17:07
mordredclarkb: lemme make a review-dev patch17:07
*** jpena is now known as jpena|off17:09
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Only replicate openstack namespaces to github  https://review.openstack.org/60548617:11
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Only replicate gtest-org and kdc  https://review.openstack.org/60549017:11
*** slaweq has joined #openstack-infra17:11
mordredclarkb, corvus, fungi: ^^ there ya go17:11
*** slaweq has quit IRC17:15
*** dpawlik has joined #openstack-infra17:16
*** dpawlik has quit IRC17:16
*** jamesmcarthur has quit IRC17:16
*** dpawlik has joined #openstack-infra17:17
clarkbfungi: only down to 212 ports so far17:18
fungiyeah, but also hitting some like17:19
fungiFailed to delete port with name or ID '3c864749-1664-4af9-8aab-d6dacaba24a4': HttpException: 504: Server Error for url: https://network.compute.bhs1.cloud.ovh.net/v2.0/ports/3c864749-1664-4af9-8aab-d6dacaba24a4, <html><body><h1>504 Gateway Time-out</h1>The server didn't respond in time.</body></html> 1 of 1 ports failed to delete.17:19
*** shardy has quit IRC17:20
clarkbI would not be surprised if that is part of why we are leaking them in the first place17:20
clarkbnova asks neutron to delete them, neutron 504s, and server is deleted now we have a leaked port17:20
fungisounds remarkably familiar17:20
* fungi has an overwhelming sense of deja vu17:21
*** jamesmcarthur has joined #openstack-infra17:21
*** harlowja has joined #openstack-infra17:23
*** rkukura has joined #openstack-infra17:26
clarkbya I want to say we saw this issue with the gate and tempest17:28
clarkband initially a lot of the blame was pointed at the apache proxy that was terminated tls17:28
clarkbI don't know if it was ever fixed though17:28
clarkboh actually it was a client thing with holding connections open17:28
fungidown to 205 leaked ports in ovh-bhs1 now but it's been stuck there for a few minutes17:28
clarkbapache by default allows for connections to be reused17:29
clarkbpython requests is buggy in the situation where it closes a connection but races trying to use it for a new request17:29
clarkbwe fixed it by telling requests to use a new connection each request iirc17:29
clarkbfungi: now 20417:30
fungiindeed17:30
clarkband now 203. Not very quick. That could also explain the leaks (not being able to delete fast enough)17:33
fungiyeah, i'm not seeing too many timeouts17:34
mordredclarkb: amusingly enough I recently set session.keep_alive = False in the openstacksdk functional test suite because of tons of log spam due to "dropped connection, retrying"17:34
fungii have a feeling it's more that we're recycling instance quota faster than the neutron backend can clean up17:34
clarkbmordred: ya python requests isn't great around when those keep alived connections are killed due to timeouts17:35
clarkbfungi: could be17:36
*** annp has quit IRC17:39
fungigetting soooo slooooow17:41
fungi200 now17:41
fungionly hit 5 timeouts so far17:42
pabelangerit is a load issue on nodes?17:42
AJaegerconfig-core, two quick cleanup reviews, please: https://review.openstack.org/605076 https://review.openstack.org/60534417:42
fungipabelanger: not sure what you're asking about17:42
pabelangerwe had neutron issue in infra-cloud when CPU was pinned converting qcow2 images to raw17:42
pabelangerfungi: sorry, just jumping in, was asking if the requests were slow due to remove node not responsing fast enough17:43
fungiand that caused port deletion to be slow?17:43
pabelangerfungi: creation17:43
fungithis is just leaked ports. trying to remove them17:43
pabelangerVMs would timeout on network getting created, and fail to boot because of it17:43
*** trown|lunch is now known as trown17:44
clarkbhand waving guessing: similar to how we have to force dhcp in OVH because the neutron config isn't actually to be used, when cleaning up network related resources neutron is talking to something external to do the cleanups and this is slow17:46
fungiyeah, probably17:47
*** dpawlik has quit IRC17:47
clarkbnova/neutron may actually have all of the port deletes queued up they just don't happen very fast17:49
clarkbthen on top of that somehow we ended up above the quota limit of 300 so when things caught up a little we still weren't able to boot new instances17:50
clarkbif we can get it to zero we can bump max-servers to say 5 and see if we leak again17:51
fungiyeah, it's just not happening any time soon17:53
fungi196 ports left to delete17:53
Shrewsthat's, like, painfully slow17:54
fungiwe're up to 9 deletion timeouts now17:56
*** electrofelix has quit IRC17:56
clarkbout of ~100 ?17:56
openstackgerritMerged openstack-infra/project-config master: Move glare legacy jobs in-repo  https://review.openstack.org/60507617:57
fungiyeah, something like that17:57
fungiso maybe 10%17:57
mnaserhttp://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1&from=now-3h&to=now fyi i notice only 15 nodes used in sjc1? do we know why?18:03
*** jcoufal has quit IRC18:03
clarkbmnaser: possibly the volume leaks pabelanger was talking about?18:04
openstackgerritsebastian marcet proposed openstack-infra/openstackid-resources master: Fixed bugs on Submit presentation flow  https://review.openstack.org/60549718:04
clarkbmnaser: apparently some of them are stuck in deleting18:04
mnaseroh18:04
mnaserlet me see18:04
mnaserdeleting seems to fluctuate up and down18:04
mnaser"Requested 80G, quota is 5120G and 5120G has been consumed." ok cool let me investigate18:04
mnaserok, let me see18:05
*** graphene has quit IRC18:07
openstackgerritMerged openstack-infra/openstackid-resources master: Fixed bugs on Submit presentation flow  https://review.openstack.org/60549718:10
* mnaser is in a call, will review shortly18:10
*** slaweq has joined #openstack-infra18:12
clarkbfungi: now down toe 76 (seems like it sped up)18:13
clarkb6818:13
fungiweird18:16
fungii wonder if they're working on it18:16
fungi33 now18:16
*** diablo_rojo has joined #openstack-infra18:18
fungiand now 018:19
mordredyay!18:19
fungii guess let's try to start ramping it back up again, but i have a feeling the deletion speedup coincided with them fixing some problem in their backend18:19
clarkbok I'm going to set max-servers to 518:20
prometheanfirelol18:20
clarkband we can watch if we leak from there18:20
*** dpawlik has joined #openstack-infra18:21
fungiwfm18:21
*** slaweq has quit IRC18:22
clarkbShrews: following up on zuul potentially leaking nodes in nodepool (locked for ~14 days) any idea on how to clear those out?18:22
clarkbShrews: 0001950609 0001975058 0001975067 are the three nodes if you want to take a look18:22
Shrewsclarkb: would require a zuul restart to delete the locks18:23
Shrewsclarkb: but let me poke around zk a bit18:23
clarkbShrews: could we manually delete the lock?18:24
*** pcaruana has quit IRC18:24
Shrewsclarkb: we'd have to manually delete zk nodes. i'm hesitant to do that18:24
*** lbragstad has quit IRC18:24
clarkbShrews: ok18:25
Shrewsbut i guess we could. they'd be seen as leaked nodes to nodepool. not sure what the zuul impact would be18:25
clarkbfungi: due to the 3 ready nodes that zuul has locked (leaked above) max-servers 5 really means only 2 new instances. I am going to bump to 8 to get 518:25
*** dpawlik has quit IRC18:25
*** lbragstad has joined #openstack-infra18:25
fungik18:26
*** yamamoto has quit IRC18:26
*** yamamoto has joined #openstack-infra18:26
Shrewsclarkb: confirmed zuul is holding the locks  :(18:31
clarkbShrews: zuul should handle if the znode goes away though right?18:35
clarkb| fault                       | {'message': 'Build of instance 960aff55-3795-43cb-ad73-e58816444355 aborted: Failed to allocate the network(s), not rescheduling.', 'code': 500, 'created': '2018-09-26T18:36:13Z'} one of the nodes failed to build18:37
clarkbseems potentially related to our inability to clean up ports18:37
Shrewsclarkb: i'm not sure how zuul would handle that18:41
Shrewswe expect some zk nodes to disappear, but a Node isn't one of them18:41
Shrewsgood question for corvus18:42
mordredcorvus knows everything18:42
*** jamesmcarthur has quit IRC18:42
Shrewswe need a zuul restart at some point anyway for some fixes mentioned earlier (yesterday?). perhaps we should just schedule a time for that18:44
Shrewsthat sql optimization at least18:45
clarkbthat probably depends on whether or not we will declare bankruptcy on the backlog or not18:45
clarkbfungi: fwiw it seems that some nodes error as above and others are just really slow to boot. None have successfully booted yet18:46
corvuswith those node ids, we can trace them in the zuul logs and figure out why they leaked18:46
corvusthat should not preclude us restarting either nodepool or zuul whenever we wish18:47
*** jamesmcarthur has joined #openstack-infra18:47
corvushowever, i'm in need of a sandwich so am not going to trace them now :)18:47
clarkbfungi: I'll let it go a little longer but the oldest node i was watching was deleted by nodepool. Doesn't appear to have errored just taken too long18:50
fungiokay, so may be that bhs1 is still just plain unusable at the moment18:50
clarkbhttp://logs.openstack.org/25/604925/2/check/system-config-run-base/3a474c9/job-output.txt.gz#_2018-09-25_01_31_28_665890 is a fun ci bug. I think that happens beacuse I am trying to change the uid of the zuul user on the test nodes18:54
clarkbmordred: corvus ^ thoughts on creating a zuulcd user instead?18:54
mordredclarkb: oh. HAH18:54
clarkbthen zuul the test node user can coexist with zuulcd the cd user18:54
mordredclarkb: or else - maybe make the creation use the same uid for nodepool and prod and make the creation idempotent?18:54
mordredI haven't thought long enough to ave thoughts of whether that'sa bad idea or not18:55
clarkbmordred: ya I think we could set the uid to 1000? I'm nto sure how difficult it will be to keep that in sync over time18:55
*** jcoufal has joined #openstack-infra18:55
clarkb(I did the last uid + 1 process for normal users which resulted in 203X)18:56
clarkbfungi: ya I'm going to set max-servers back to 0, this isn't making any progress18:57
*** smarcet has quit IRC18:58
clarkbthe gate just reset again18:58
clarkbI think I need to step away from the computer for a bit18:58
* fungi is struggling to whittle down a forum session proposal to fit in the requisite 1k character matchbox18:59
clarkbfungi: you get three of those boxes :)19:00
fungiyeah, but only one goes on the schedule19:01
fungi(i think?)19:01
mordredfungi: "gonna talk about stuff"19:02
mordredfungi: who needs more words than that?19:02
fungi"i will wear a funny shirt and make people talk to each other"19:03
fungiperfect!19:03
clarkb"if this session gets enough upvotes I will wear the orange cantina shirt"19:04
*** rtjure has quit IRC19:05
*** dave-mccowan has joined #openstack-infra19:05
AJaeger"And the green one if not" - give us a choice to vote for one ;)19:05
fungigrr... i'm still 95 characters over19:06
AJaegerRemove all punctations ;)19:07
clarkbthe down ports list in ovh bhs1 fell to 1 after setting max servers to 019:07
*** rtjure has joined #openstack-infra19:08
clarkbjaosorior: https://review.openstack.org/#/c/603419/ fyi that didn't pass (it also failed the rdo third party check)19:12
*** e0ne has joined #openstack-infra19:15
*** jcoufal has quit IRC19:21
*** Tim_ok has quit IRC19:24
*** slaweq has joined #openstack-infra19:24
*** Emine has joined #openstack-infra19:26
fuentessclarkb: hi, quick question? what is the best way to know if I'm running under a Zuul slave? is there any environment variable that I can check?19:27
*** e0ne has quit IRC19:30
fungifuentess: jobs can set any environment variables they like when executing a shell or command process... can you be more specific about what you're trying to do and where?19:31
mordredyah - there is a zuul ansible host_var your playbooks can do things with19:32
*** jamesmcarthur has quit IRC19:32
mordredbut once you're out of ansible and into some shell that the ansible has spawned, that's all job specific19:32
fuentessfungi, mordred: I will add a change in one of our scripts to run the cri-o tests using overlay (instead of devicemapper) when in a Zuul slave, so I want to check using something like: if [ "$ZUUL_CI" == true ]19:35
mordredyeah - in order for that to work, we'd need to update the job that calls your script to set a variable like ZUUL_CI19:35
mordredalternately, maybe we should justhae the zuul job pass something like --overlay to the script?19:36
fuentessadding it here, right? https://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/kata-setup/tasks/main.yaml#n3719:36
Shrewscorvus: attempted to trace node 0001975067 through zuul logs (debug.log.14.gz, btw). For that request (200-0006054530), I see that it gets completed, but the nodes are never set in-use (i.e., no "Setting nodeset ... in use" log entry)19:37
slaweqclarkb: mordred: with big help from both of You I finally managed to do patch to migrate dvr multinode job in neutron to zuulv3 syntax: https://review.openstack.org/#/c/578796/ - thx a lot guys :)19:37
Shrewscorvus: i'm a bit stumped as to why19:37
mordredfuentess: yes- and it looks like CI=true is already being set there19:37
mordredfuentess: so you could just add another line just like that19:38
mordredslaweq: \o/ woohoo!19:38
fuentessmordred: cool, I am not sure if I can submit changes there... or how can I do it, any guidance?19:39
mordredfuentess: you totally can - have you submitted patches to the openstack gerrit before, or will this be your first one?19:40
fuentessmordred: this will be my first one19:40
*** jamesmcarthur has joined #openstack-infra19:41
mordredfuentess: excellent. well - we have a doc here: https://docs.openstack.org/infra/manual/developers.html#accout-setup ... I don't think we require signing the CLA for infra projects (do we clarkb fungi?) so you can probably skip that part19:42
fungichecking on ozj there19:42
mordredfuentess: tl;dr is "make sure you have a launchpad/ubuntu sso account", "log in to review.openstack.org", "add your ssh key to your profile in gerrit", "pip install git-review" ... then in a git clone of openstack-zuul-jobs, once you've made your commit, run "git review"19:43
fuentessmordred: great, thanks, I'll follow the doc19:43
mordredfuentess: but the doc is more complete than messages from me in IRC :)19:43
fungihttps://review.openstack.org/#/admin/projects/openstack-infra/openstack-zuul-jobs says "Require a valid contributor agreement to upload: INHERIT (false)" so no cla required19:44
mordredyay!19:44
fuentessgreat, thanks19:44
mordredlet us know if you have any issues - the initial account setup is more onerous than we'd like, but such is the world we live in19:44
fungithings we're (painfully slowly it seems like) changing for the better over time19:45
*** jamesmcarthur has quit IRC19:46
*** jamesmcarthur has joined #openstack-infra19:48
openstackgerritChandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook  https://review.openstack.org/60509619:49
corvusShrews: thanks, i'll start there and see if i can trace further19:50
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: web: add tenant and project scoped, JWT-protected actions  https://review.openstack.org/57690719:51
*** jamesmcarthur has quit IRC19:52
mnaserinfra-root: http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1&from=now-1h&to=now&var-region=All is this starting to look..healthy?19:52
mnaserit looks like for some reason a lot of instances accumulate as ready and then all get used up at once19:52
*** ijw has joined #openstack-infra19:53
*** jamesmcarthur has joined #openstack-infra19:55
openstackgerritChandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook  https://review.openstack.org/60509619:57
corvusclarkb, Shrews: the job that node request was for (openstack-ansible-ironic-ssl-nv) was removed between the time the node request was issued and fulfilled: http://git.openstack.org/cgit/openstack/openstack-ansible-os_ironic/commit/?id=9a2f843dc15750cddba10db73afe381ee378525020:01
corvusi think that's our smoking gun :)20:01
Shrewscorvus: w00t20:01
clarkbmnaser: I think that may be lag induced by the zuul executors throttling themselves. You can compare with the active executor count on the zuul status page20:02
* mnaser clarkb: so it looks like we're at 90ish in-use which seems to tell me things are healthy20:02
mnaserthat shouldn't have been a /me but okay20:03
mnasersorry for the noise/annoyance/problems/etc20:03
corvusthat behavior could also be zuul doing a bunch of reconfigs, or a gate reset20:03
clarkbmnaser: was that a one off thing? eg we don't need nodepool to manage nodes/volumes better?20:04
mnaserclarkb: yes, that was a one-off at our side thing, but i noticed that when i deleted the vms from under nodepool it was a bit confused20:04
mnasernot sure what was done there20:04
*** agopi has quit IRC20:06
clarkbmnaser: I think pabelanger mentioned that it ended up waiting for the api requset to timeout or similar20:07
*** bnemec has quit IRC20:10
*** evrardjp has joined #openstack-infra20:11
clarkblooks like inap might also be unhappy (though less unhappy that some of the other clouds)20:14
clarkb(a lot of deleting nodes there and higher error rate in recent hours)20:15
*** bnemec has joined #openstack-infra20:15
mgagneclarkb: how can I make it happy?20:15
clarkbmgagne: reading the nodepool logs the errors appaer to be "timeout waiting for node to delete"20:17
mgagneclarkb: could it be that same issue we had a couple of months ago?20:17
clarkbmgagne: I think nodepool is not booting new nodes until it frees up capacity (so slow deletes are preventing new boots)20:17
clarkbmgagne: you'll have to remind me what that was (sorry)20:18
mgagnemaybe I should have documented that SQL query somewhere....20:18
clarkbnodepool.exceptions.ServerDeleteException: Timeout waiting for server 3cc96068-e398-4c2c-a908-ebea018af044 deletion20:18
clarkbis the full message and includes an instance uuid20:18
clarkbif that helps20:18
mgagneclarkb: I think delete task gets killed by restarting conductor or something. So you can't delete it again because database says it's already happening somewhere.20:19
mgagnebut it's stuck in BUILD too so might be something related to build time being higher than expected.20:20
mgagneI'll see that I can do to unstuck that mess ;)20:21
mgagneI see that some instances are successfully deleted without me taking any actions.20:22
clarkbmgagne: so maybe it is just slowness?20:23
mgagneI haven't figured out what's going out. I know that some stuck instances are getting deleted.20:24
pabelangerclarkb: mnaser: I don't think it is zuul-executors, our executor queue looks healthy: http://grafana.openstack.org/dashboard/db/zuul-status20:24
mgagnewhat I would *really* love is a way to know what tasks are running on conductor or compute =)20:25
clarkbmnaser: pabelanger http://paste.openstack.org/show/730964/ is a list of volume ids that claim to be attached to those server ids. At last check none of those server ids actually exist20:27
clarkbmnaser: pabelanger I think that may be part of our leaked volume story20:27
clarkbthe oldest volume is from about two weeks ago and the newest from an hour ago20:27
*** smarcet has joined #openstack-infra20:27
clarkbmnaser: pabelanger I haven't tried to delete any of them yet but I figure that is my next step if mnaser doesn't say otehrwise20:27
pabelangerclarkb: Hmm, it is possible I may have deleted those volumes.  I cleaned up some an hour so ago20:28
pabelangerlet me check history20:28
clarkbpabelanger: well they are still there if you tried to delete them :)20:28
pabelangerclarkb: the ones I deleted were available, but unattached20:28
clarkbthese all claim to be attached to those servers but those servers do not exist20:29
*** smarcet has quit IRC20:29
*** smarcet has joined #openstack-infra20:30
*** smarcet has quit IRC20:30
mgagneso I restarted a nova-compute and some instances are getting deleted on that node.20:30
pabelangerclarkb: http://paste.openstack.org/show/730965/20:30
pabelangerthat was a few hours ago when I started clean up20:30
pabelanger22f56d13-c67c-4aac-a6aa-cf58fe57b17720:31
pabelangeris in both pastebins20:31
clarkbya all of mine are those that look like attached to $uuid20:31
pabelangeryah20:31
clarkbthere are a couple extras in mine too20:31
pabelangerI'd expect nodepool to use hostname20:32
clarkbpabelanger: it does see the centos ones in your example20:32
clarkbI think it shows uuid there beacuse the instances don't exist (so it cannot lookup the name)20:32
pabelangerright20:32
pabelangerthink so too20:32
*** priteau has quit IRC20:33
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Fix node leak on job removal  https://review.openstack.org/60552720:34
clarkbmgagne: looks like in the last 10 minutes or so things may be improving (possibly related to that restart?)20:34
corvusclarkb, Shrews: ^ fix for that leak20:34
fungioh, that's a fun bug20:34
mgagneI restarted 1 stuck nova-compute service. Could it be it consumed all rpc workers on nova-conductor? ¯\_(ツ)_/¯20:36
mgagneI don't have much time in front of me. lets hope this improves the situation.20:36
*** hasharAway has quit IRC20:37
clarkbmgagne: thank you for looking20:38
mgagneclarkb: will nodepool attempt to retry deletion if instance is now in error state?20:39
clarkbmgagne: yes it should retry every 5 (or is it 15 minutes)20:39
mgagnecool so restarting nova-compute should put the instance in error and nodepool can continue its cleanup from there.20:40
Shrewscorvus: sweet20:40
corvusclarkb: there's definitely a 5 in there.  but i think it's every 5 *seconds* now :)20:43
*** rkukura has quit IRC20:43
clarkbpersistent20:43
fungisomethingsomething5something20:43
clarkbif anyone is wondering the tempest-slow job is really slow20:44
clarkbbut we might actually merge some changes shortly20:44
clarkbcorvus: http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=now-3h&to=now is it normal for zuul to not assign those ready nodes for this significant amount of time?20:45
clarkbthe zuul queue lengths are short20:46
clarkball 11 executors are online and accepting20:46
clarkbthis must be slowness in the scheduler?20:46
corvusclarkb: yeah, if there's no executor queue, then it's the scheduler not getting around to dispatching the jobs20:46
corvusthe scheduler is not currently behind on work20:47
clarkboh wow refresh the graph20:48
clarkbin the last minute or two almost all of those nodes went to in use20:48
corvusi guess we looked just a bit too late20:48
pabelangercould be zookeeper slowing down?20:48
pabelanger(just a guess)20:48
Shrewsnodepool.o.o seems very idle20:49
corvuspabelanger: based on what?20:49
Shrewsoh, wait.20:49
Shrewsjava 170% cpu20:50
Shrewsneat20:50
Shrews320%20:50
pabelangercorvus: there have been issues in base with nodepool not allocating nodes if zookeeper was laggy, based on comments SpamapS has made in past20:50
corvuspabelanger: issues in base?20:50
pabelangersorry, past*20:51
corvuspabelanger: the current issue is that nodepool allocated nodes and zuul did not immediately use them20:51
clarkbspamaps problems were related to disk io I think. SpamapS and tobiash both run on top of tmpfs now20:51
pabelangercorvus: right, that is what SpamapS said in the past20:51
corvusShrews: java, unlike python, is really good at using multiple processor20:51
corvuss20:51
Shrewsiostat shows mostly idle disk20:51
clarkbI'm booting a clarkb-test instance in bhs1 using the current xenial image. I want to see if it will ever start up without nodepool timing out on it20:54
ianwclarkb: catching up ... lmn if i can help20:55
clarkbianw: mostly at a loss to why bhs1 stopped working, but its basically unuseable for us now. Instances don't seem to boot at all and we leaked a bunch of ports prior to that20:55
clarkbianw: we think we have the ports cleaned up but now trying to see if anything will boot20:55
clarkbianw: likely our next step is to followup with amorin during european tomorrow20:56
*** agopi has joined #openstack-infra20:56
clarkbother than that I think vexxhost is largely working again (though there are some weird volumes there that would be good to have mnaser glance at before we delete them)20:56
clarkbhttp://paste.openstack.org/show/730964/ is those details20:56
clarkband mgagne just did a thing to make inap happier20:57
openstackgerritMatthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command  https://review.openstack.org/60538620:57
clarkbianw: other than general cloud capacity items like ^ I think the continuing issues are largely related to gate flakynesss and cost of resets20:58
clarkbtripleo has a very deep gate queue and resets are common. Openstack integrated gate isn't quite so large but has also shown signs of flakyness20:58
*** trown is now known as trown|outtypewww20:59
*** bobh has quit IRC20:59
ianwok, thanks for the update :)20:59
*** colonwq has joined #openstack-infra20:59
ianwquick question; i started looking at graphite.o.o upgrade ... the puppet is very procedural (install stuff, template, start service).  i've started looking at making it all ansible running on bionic ... see any particular issues with this?21:00
*** rkukura has joined #openstack-infra21:00
corvusianw: maybe we can use containers?21:00
*** fuentess has quit IRC21:04
ianwcorvus: yeah, i started looking at that; i don't know if i'm sold really ... say i make a job to make a statsd container, a carbon container, etc, then a playbook to plug it all together.  what advantage do we have over just having a playbook putting this stuff on a host?21:04
ianwand instead of unattended-upgrades, we now have to manage the dependencies in all those sub-containers21:04
corvusianw: it's a good question, but i think the spec addresses it.  in short, os-independence.21:05
*** slaweq has quit IRC21:06
corvusShrews, clarkb: i looked at node 0002342552 assigned to request 200-0006321477 which has been ready for 7 minutes in inap.  the request is still 'pending'.  so nodepool-launcher hasn't marked it fulfilled yet, even though the only node in the request is ready.21:07
corvusthat's on nl0321:07
*** smarcet has joined #openstack-infra21:11
fungiis it really os-independent given that there is an os inside every container (albeit a hopefully minimal one)?21:11
corvusShrews, clarkb: nodepool is very busy there; perhaps this is thread starvation21:11
pabelangercorvus: oh, interesting, we have 5 active providers there. Maybe getting close to spliting that out to another launcher?21:12
corvusfungi: i should have said no more than "see the spec"21:12
clarkbwe could rebalance the providers. I think nl02 is particularly quiet recently21:12
fungiheh21:12
mgagneI really have to go. But to summarize, I restarted nova-compute on most of the compute nodes. (not all) It put "BUILD/deleting" instances in ERROR state so Nodepool could clean them up. Hopefully new deletions won't get stuck anymore.21:12
clarkbmgagne: thanks, I think it did get things moving21:12
mgagne+121:12
*** panda is now known as panda|off21:13
fungicpu on nl01 looks pretty much pegged (much of that claimed for system)21:15
corvusin the nicest way possible i'd like to convey the idea that i don't really enjoy having "container-vs-not-container" debates and i think it's important that we go with the consensus we achieved in the spec after much work and deliberation rather than having more container debates.  if we truly want to re-open the discussion we had in the spec (without having even really begun on actual21:15
corvusimplementation) that's certainly a choice, but let's make that choice clearly.21:15
fungii'm good with the consensus, i just don't even know where to start with trying to containerize something21:16
clarkbcorvus: while I agree I think we also said in the spec we would move their gradually and build up the tooling to make this happen21:16
clarkband so maybe upgrading graphite can move in parallel to getting tooling in place to build container images so that we can deploy graphtie with containers?21:16
fungilike, i'd like to work out how to containerize mailman3 but i feel pretty out of my depth on container standards and paradigms to know what that should look like21:17
corvusclarkb: true.  i read that as "keep puppeting" rather than "convert puppet to ansible then maybe someone will container"21:17
*** jamesmcarthur has quit IRC21:18
clarkbfungi: the big step 0 which we've started for nodepool and zuul is building images in repeatable manner to keep up with updates21:18
fungii'm still struggling to come to grips with ansible instead of puppet, but since we have people who want to do the ansible and container legwork i'm cool with the direction we settled on21:18
corvus(and to be fair, many services will be easier to run in ansible rather than containers, especially if we're just running os packages.  but graphite is a bunch of pypi packages, so seems like an ideal candidate for containers)21:19
*** jamesmcarthur has joined #openstack-infra21:19
fungiahh, yeah, my main thought exercise was to try and figure out how to containerize gerrit and the various java libs it needs21:20
corvusfungi: there's already a gerrit container image21:20
fungibut maybe approaching python stuff first will be easier to grasp21:20
fungii thought we said reusing existing containers was a no-go and we were going to insist on building all ours from scratch instead?21:20
funginow i'm confused21:20
corvusfungi: well, if that's the case, we at least presumably have a build script21:21
fungii probably misunderstood the answer when i asked that question earlier21:21
fungigot the impression we didn't want to reuse anyone else's container build tooling21:21
fungibut if it's just that we want to regenerate containers using existing tools provided by the upstream maintainers, that's not so tought to grasp21:22
clarkbthere is the build tooling and then the resulting images. I think we should reuse images if possible, but we'll likely have to see how reusable these images are21:22
corvusanyway, i merely suggest that given the spec, perhaps leaving graphite as puppet or working on containerizing it may be better choices at the momemnt than translating the puppet to ansible21:23
corvusianw: thanks for asking :)21:23
ianwsay i make a job using buildah to produce an image that contains statsd (that's probably the very simplest case, no secrets, nodejs + statsd all required)21:23
ianwas a first step, where does that image go after build?21:23
fungimaybe also a terminology gap for me. when i hear "build tooling" i think the makefiles provided by upstream for installing things into a container image21:23
clarkbianw: dockerhub?21:24
fungii.e. the things we currently have pupet exec21:24
clarkbfungi: there is also a severe case of NIH when it comes to container image build tools. Basically everone has one21:24
clarkbI think the spec suggests we start with dockerfiles and ocker build because it is the commonest tool21:25
fungiso, like, would we still use pip for installing python-based packages into a container image or do we need to do something lower-level? is it okay to install pip temporarily into the container and then uninstall it before creating the image?21:26
fungido the images get created under a chroot on a loopback device and then copied?21:26
fungior do we tar up the chroot after it's tied off?21:27
fungii guess a dockerfile is like a makefile that contains the steps for writing things inside the chroot?21:28
clarkbyou would use pip or any other package manager (or make etc) as part of the image build to create the image contents. It is ok to have intermediate steps that don't show through to the final image (though if you want to keep iamge size down you have to take additional steps to clean those out)21:28
clarkbfungi: yes21:28
clarkbI'm not actually sure what docker uses as a transport format to and from dockerhub, likely something like a tarball21:29
ianwhrm, there's an official docker graphite+statsd+carbon docker21:29
ianwhttps://github.com/graphite-project/docker-graphite-statsd#secure-the-django-admin21:29
fungias opposed to whatever the thing is that tells you how to stack multiple image layers?21:29
ianwis a little worrying "First login at: http://localhost/account/login Then update the root user's profile at: http://localhost/admin/auth/user/1/"21:29
fungidocker does layered images, right?21:29
fungior i guess we can just decide to squash them all into a single layer if we want?21:30
clarkbfungi: yes it does layered images, that is one of the reasons that cleaning out the intermediate stuff you don't need in the end result can be tricky (like say pip being installed to pip install but then is deleted after, by default you'd keep the layer that had pip installed)21:31
ianwoh, i just thought of one big issue ... ipv621:31
fungiwhy is ipv6 an issue?21:32
clarkbshouldn't be with docker itself, k8s doesn't really support ipv6 yet though21:32
ianwfungi: well, something like https://github.com/graphite-project/docker-graphite-statsd, which at first glance looks like a lot of work is done for you, well it's not really21:32
*** jamesmcarthur has quit IRC21:33
clarkbhrm my ubuntu-xenial test in bhs1 booted just fine looks like21:33
fungibut also for web-based services we can still proxy from an apache running outside the container on the host server to the v4 loopback if we wanted, right?21:33
clarkblet me try centos too21:33
clarkbfungi: yes21:34
clarkbfungi: possibly in another container21:34
ianwfungi: if we're in a world where we're running things in containers for simplicity, and then running external apache to proxy ipv6 into containers instead of processes running on the host listening on ipv6 ... i'm not sure i'd count that as winning :/21:35
fungiwell, it was suggested that we don't have to containerize all the things if it winds up being convenient to, say, run a gerrit container and leave apache in the host system context21:35
*** bobh has joined #openstack-infra21:35
clarkbya not required to do so, but could be done21:35
fungi(in cases where we already run services on the loopback and proxy them from apache anyway)21:35
ianwoh right, yeah that's not the case for say statsd currently, but is for gerrit21:36
*** graphene has joined #openstack-infra21:36
clarkbok bhs1 is working now that I manually boot stuff21:36
clarkbI am going to turn max-servers back up to 8 again21:37
clarkbperhaps this is nodepool specific somehow? we'll have to see if the behavior persists21:37
fungiyeah, maybe they fixed it, or maybe there's some deeper problem there21:38
*** bobh has quit IRC21:40
clarkb| 0002343764 | ovh-bhs1             | ubuntu-xenial          | d427d8d8-4a31-4f02-bab3-eb857e0fcf9b | 158.69.64.222   | 2607:5300:201:2000::26                 | in-use   | 00:00:00:06  | locked   |21:40
clarkbthat looks promising21:40
clarkbianw: ^ fyi nl04 is in the emergency file so that we can tweak the max-servers value there to help debug bhs121:40
ianw++21:42
*** ijw has quit IRC21:46
mordredianw: yah - for some (possibly many) of the services, there is likely a container already built we can use21:50
mordredianw: for python things we need to build, I've been using the python:sim or python:alpine base images, which already have pip and stuff installed - so you can do really simple dockerfiles like "pip install ." or something21:51
mordredand similarly, thinking about things like gerrit, I would expect to use a "java" base image and then just install the war file in there21:51
mordredor something21:51
mordredbut that's also all just in my imagination of course :)21:52
mordredianw: also - for anything we need to make a container of that uses pbr and bindep, we can use pbrx to build images - so like when we start making storyboard containers, for instance21:53
*** jamesmcarthur has joined #openstack-infra21:54
ianwmordred: so as step one, we need an ansible role to install & configure docker on a host, right?  that's not done?21:54
clarkbstill only the one DOWN port in BHS1 too21:54
ianwapt-get install docker seems fine ... reading up on making sure it talks ipv6 seems a little harder21:54
clarkbI've got to go pick up a box of vegetables, any objection to me increasing max-servers to say 80 on bhs1 to see ho that goes?21:54
clarkbI won't be able to check on it for about 45 minutes likely21:55
ianwclarkb: just keep an eye on ports?21:56
clarkbianw: ports in a DOWN state and whether or not instancs are actually booting21:57
clarkbianw: earlier today after we cleaned up the ports in a DOWN state we ended up not being able to boot anything21:57
*** jamesmcarthur has quit IRC21:58
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Don't report non-live items in stats  https://review.openstack.org/60554022:00
clarkbianw: fungi: you all ok with me making that max-server edit?22:01
ianwclarkb: ok, i have a window up with pre-warmed bash cache for monitoring ovh-bhs1 openstackjenkins22:01
*** dklyle has quit IRC22:02
clarkbok max servers set to 8022:02
ianwwhat's the 3 servers in there that don't have an "image"?22:03
ianwpossibly we just refreshed images?22:03
gouthamrhi infra experts, i switched the nodeset on a set of jobs to ubuntu-bionic, but got a weird side-effect - the post run playbook is ignored - https://review.openstack.org/#/c/604929 - is this a known issue? or pbkac on my end?22:04
gouthamrthe playbook's there in the ARA report, but not executed22:04
gouthamrsample: http://logs.openstack.org/29/604929/2/check/manila-tempest-dsvm-mysql-generic/ab0b8cf/22:04
clarkbgouthamr: I think the devstack base job may expect certain groups in your nodest22:05
clarkbgouthamr: you'll need to update the nodeset and include those if you haven't22:05
*** boden has quit IRC22:06
gouthamrclarkb: oh.. you mean the "legacy-dsvm-base" job?22:06
*** kgiusti has left #openstack-infra22:06
clarkbhrm that job should be named legacy-manila-tempest-dsvm-mysql-generic if it isn't a v3 native job22:07
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Speed up build list query under mysql  https://review.openstack.org/60517022:07
clarkbno I meant the native v3 devstack job base22:07
clarkbI don't know if legacy-dsvm-base needs anything like that22:08
mordredianw: we have a role for that ...22:09
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Improve docs for inventory_file  https://review.openstack.org/60266522:09
gouthamrclarkb: true! that isn't v3 native; dunno why we changed the name on that22:09
clarkbgouthamr: the legacy jobs use the same devstack-single-node nodeset that the non legacy jobs use22:09
mordredianw: http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/install-docker22:09
clarkbgouthamr: and that sets a node to be named controller. look in openstack-dev/devstack/.zuul.yaml22:10
mordredianw: it might be worth us rearranging that a little bit - based on the discussion we had about roles in system-config22:10
mordredianw: this might be another of the ones that wants to live in system-config?22:10
corvusmordred: why system-config?22:10
mordredcorvus: if we're going to use it in production ansible to install docker on a system we want to run services in containers on - then it would be a bit weird for system-config to need to add the zuul-jobs repo to bridge.o.o in order to get the role?22:11
corvusthat's currently zuul-jobs, which is our widest applicable source of roles, so moving to system-config is a scope reduction22:11
corvusmordred: maybe that should be an independent repo then22:12
mordredcorvus: maybe so - maybe this is the first one where being an independent repo has a benefit22:12
clarkbopenstack.exceptions.SDKException: Error in creating the server: Build of instance 7bbfc561-5a4a-4fc7-8f5c-02e60bc61511 aborted: Request to https://network.compute.bhs1.cloud.ovh.net/v2.0/ports.json timed out22:12
corvusmordred: though, *that* role sets up CI-specific mirrors -- so we may need 2 roles22:12
clarkbmaybe all our troubles are not behind us :/22:12
mordredcorvus: oh -that's a great point22:12
ianwclarkb: i also noticed a 500 error when i tried to list the servers, but only once22:13
ianwclarkb: more general intermittent issues?22:13
ianwyeah, lot of ERRORs22:14
*** jamesmcarthur has joined #openstack-infra22:14
clarkbI'll set it back to 8 and then go pick up my groceries22:15
fungisorry, stepped away for dinner. clarkb: increasing max-servers again seems worth trying, i agree22:16
ianwfungi: heh, tried it and it didn't work :)22:16
*** smarcet has quit IRC22:16
mordredianw: corvus has brought me back around to your way  of thinking - we need an install-docker role for our production ansible22:16
mordredianw: I'd argue that it could be _very_ similar to the install-ansible role in zuul-jobs though - just maybe not with ci mirrors22:16
clarkbfungi: ya was worth trying, but still broken, I unddi it22:17
mordredianw, corvus: actually - sorry - now I'm changing my mind again22:17
mordredhow about I go play with something for a second and get back to the two of you22:17
fungii'm also figuring that rather than completing the mediawiki puppeting now i should be looking at containerizing it instead. is there a base php container it should be built on?22:18
ianwmordred: sure, well knowing how to get docker onto the host seems like the first step in using this.22:19
*** jamesmcarthur has quit IRC22:19
fungilike the alpine python base?22:19
mordredfungi: https://hub.docker.com/_/mediawiki/22:20
mordredfungi: there's even a mediawiki container22:20
fungiyeah, just found https://docs.docker.com/samples/library/mediawiki/22:20
mordred\o/22:20
fungias well as a lot of discussion about the complexity of adding extensions22:21
mordredfungi: but if we want to roll our own - their dockerfile shows: FROM php:7.2-apache22:21
mordredso that seems to maybe be a good base image22:21
clarkbmordred: in the openstacksdk exception above is sdk/shade directly trying to manage the ports?22:21
mordredfungi: https://github.com/wikimedia/mediawiki-docker/blob/41edcc8020aa47823d30c1b35f216b0a2834b2b6/stable/Dockerfile22:21
clarkbI wonder if that is part of the problem. when I boot manually I use openstackclient which probably just talks boring nova api22:22
mordredclarkb: in some cases it might be22:22
mordredclarkb: we query ports after booting a server to find the port id22:22
mordredclarkb: so that we can pass the port_id of the server's fixed ip to the foating_ip create call, creating the fip on the server's port22:22
clarkbmordred: but we don't use floating IPs here22:22
mordredoh - then something is wrong - we shouldn't be listing ports on a non fip cloud22:23
mordredwe MIGHT be trying to list them to get meta info about the server's interfaces22:23
mordredbut we should only ever be listing22:23
clarkbya this looks like its part of listing the server details22:23
clarkbhttp://paste.openstack.org/show/730970/ is full traceback22:24
clarkbif we go back to hunches I wouldn't be surprised if ovhdoesn't expcet you to talk to neutron so much since they don't really enable any sdn features22:24
fungimordred: aha so we would just have a periodic job slurp down a dockerfile like that and then push the resulting image to a server?22:24
mordredoh - wait22:24
mordredclarkb: that's not the sdk interacting with ports22:24
clarkbok really need to grab veggies not, but maybe that makes sense to you mordred22:25
clarkbmordred: basically anything that seems to directly mess with ports is unreliable22:25
mordredclarkb:                     "Error in creating the server: {reason}".format(22:25
mordred                        reason=server['fault']['message']),22:25
clarkboh so that is from nova?22:25
mordredthat's us printing the fault.message in the server payload from nova22:25
clarkbgotit22:26
mordredwe should clarify that in the error message22:26
mordred"Error in creating the server, nova reports error: {reason}"22:26
mordredwould be clearer, yeah?22:26
clarkb++22:28
mordredclarkb: remote:   https://review.openstack.org/605544 Clarify error message is from nova22:28
mordredfungi: yes, that's what I'm thinking22:29
fungiso this is probably similar to the errors i was getting from openstackclient in that case (nova reporting a problem talking to neutron)?22:29
mordredfungi: or, a periodic job that builds an image based on a dockerfile like that - then publishes the image to either dockerhub or a docker hub we run22:29
fungii guess if we want to fork the dockerfile we can do that to add things like different configuration or additional non-core extension and theme bundles?22:30
mordredfungi: then in the ansible, instead of "apt-get install mediawiki" we'd just do "docker run opendev-ci/mediawiki" or something22:30
mordredfungi: yes.22:30
fungior is that where layered images come into play?22:30
mordredfungi: we could also just buld a new image based on the mediawiki image22:30
mordredfungi: so make a dockerfile that says "from mediawiki" at the top, then plop our extensions and themes in22:31
mordredfungi: I think for config, we just put it on the server like normal, and bind-mount the config dir in as a docker volume22:31
fungiahh, so dockerfiles have an inheritance concept. that's what the FROM php:7.2-apache at the top means?22:31
mordredfungi: so like "docker run -v /etc/mediawiki:/etc/mediawiki opendev-ci/mediawiki"22:32
mordredfungi: yes22:32
ianwmordred: it would be more a .service file that does the docker run though?22:32
mordredianw: yah - I'm just spitting commandline docker commands in channel for conversation22:32
mordredjust saying - zuul job publishes to dockerhub (or a private dockerhub) - then on the server we just pull/run the image22:32
fungithe fact that the mediawiki dockerfile runs `apt-get install ...` suggests that the php:7.2-apache image is debianish22:33
mordredfungi: yes indeed it does22:33
fungiand not some stripped-down thing like alpine or coreos i guess?22:33
mordredfungi: if you want - you should be able to run "docker run -it --rm mediawiki -- /bin/bash" and get a shell in a container running that image and see what it's based on and what's there22:34
mordredclarkb, fungi, ianw corvus : speaking of - opendev is taken on dockerhub - what say I grab opendevorg as an account to push things to?22:35
corvusmordred: ++22:35
*** jamesmcarthur has joined #openstack-infra22:35
*** rcernin has joined #openstack-infra22:35
fungithe deeper down this rabbit hole i go, the more it seems like running containerized services means 1. understanding conventions of multiple distros since you won't know which ones a given upstream image is based on, and 2. sort of also being responsible for your own distribution since versions of stuff is hard-coded all over the place needing you to bump them to get security fixes? surely this can't22:36
fungibe the actual reality of containerization22:36
fungiopendevorg sgtm22:36
*** eernst has joined #openstack-infra22:37
*** tosky has quit IRC22:37
mordredfungi: yup. that is, in fact, the realityof containerization22:37
fungiat least for the mediawiki image, it's just mediawiki versions themselves which are hard-coded22:38
mordredfungi: however, due to the idea of microservices and service teams - it's frequently only one service and one base image you're ever poking at22:38
fungiso no worse than our (incomplete) configuration management in that regard22:38
mordredfungi: yah22:38
mordredfungi: also - I've been finding that the majority of contianer images are actually debuntu based - unless someone wants to make it smaller inwhich case there is also frequently an alpine variation22:39
*** jamesmcarthur has quit IRC22:40
SpamapSFYI, zookeeper's disk IO is *VERY* bursty22:40
fungimailman3 seems to use docker-compose to build their images https://github.com/maxking/docker-mailman/blob/master/docker-compose.yaml22:40
SpamapSit wants to sync all of its writes to disk periodically, and if you did a lot....22:40
SpamapSit gets mad22:40
SpamapSIt never shows as heavy disk IO, it just shows as heavy latency on the syncs22:41
SpamapSand no I don't run on tmpfs, I just run on a dedicated ZK node.22:41
SpamapSpreviously it was battling with other processes for that IO latency and that made it miss checkpoints22:42
fungiaha, FROM python:3.6-alpine https://github.com/maxking/docker-mailman/blob/master/core/Dockerfile22:42
mordredfungi: yeah. that docker compose file is a way to say "please launch this set of images for me"22:43
fungiso probably not too dissimilar from whatever we'll use for other python services22:43
mordredfungi: yah - that's whatwe're using in the zuul and nodepool images22:43
fungioh, so docker-compose is not a higher-level image build language, it's a deployment language?22:43
mordredah22:43
mordredyah22:43
fungiand again, at least for the mm3 core image, it's only hard-coding mailman package versions and i guess taking whatever latest is on pypi otherwise22:45
fungiso not too onerous22:45
mordredin fact, jesse keating had a docker compose file a while back for booting a zuul on a machine - so you can say "docker compose up" and docker compose will launch all of the processes in containers needed connected together to produce the zuul service22:45
mordredfungi: yah22:45
fungii already worked out how to get mm3 services deployed from debian packages, but translating it to alpine will probably take some time22:46
fungii have a feeling we might stick with distro-packaged exim similarly to how we might stick with distro-packaged apache for some of these things22:47
mordredfungi: well, you could maybe just use the docker mailman images?22:47
mordredfungi: yup. totally agree22:47
fungii mean from a configuring it standpoint22:47
mordredfungi: I don't think it's a win for us for the forseeable future for things like apache22:47
mordredfungi: oh - gotcha22:47
fungithe debian packages preconfigure an awful lot of the wsgi bits and stuff22:48
*** dpawlik has joined #openstack-infra22:48
fungiit's django and wsgi and...22:48
mordredfungi: https://hub.docker.com/r/maxking/mailman-core/ seems to be the main image22:49
fungiand database setup22:49
mordredyeah - seems to be a lot of stuff22:49
mordredfungi: well - if there are well maintained debian packages maybe it's one that we just instlal with ansible directly? I see the container stuff as more of a win when we're installing wars or installing python stuff from source or anything we're building ourselves22:51
fungihttps://asynchronous.in/docker-mailman/ is the container install walkthrough linked from the mm3 docs22:51
mordredbut - I guess if they have a full walkthrough of that- maybe it's a good learning example?22:51
mordredfungi: ooh - the security section of that is nice - we should do that with our images22:52
*** dpawlik has quit IRC22:52
mordred(and make sure we appropriately sign our images too)22:52
fungiyeah, maybe no need to container mm3, but also if we stick with the debian packages filtered through ubuntu we're stuck dealing with a mailman that's still rapidly changing (there are a _lot_ of bugs in mm3 being worked through still, from what i can see)22:53
*** rcernin has quit IRC22:53
mordredfungi: yah. whereas if you just follow these instructions22:54
*** eernst has quit IRC22:54
*** rcernin has joined #openstack-infra22:55
mordredand maybe even use docker-compose for it since that's how they're recommending - and that way we can follow the stuff they're publishing and do it in a way that would enhance our ability to interact with them?22:55
*** jamesmcarthur has joined #openstack-infra22:56
fungiseems worth trying to redo the current poc following my notes from https://etherpad.openstack.org/p/mm3poc and adapting them to their container walkthrough22:56
mordred++22:56
mordredwould at the very least be a learning case where you've got a clear set of instructions for the one way22:56
mordredin the mean time - I will get an install-docker role done by morning22:57
ianwmordred: sorry, where did we end up on "install docker to a host"?  is there something i can do?  i would like to get this bit nailed down (with ipv6) before i start looking at it22:57
ianwoh, jinx22:57
mordredianw: :)22:57
mordredianw: I think I nerd-sniped myself into helping on that bit22:57
openstackgerritMerged openstack-infra/openstackid master: Updated user profile UI  https://review.openstack.org/60417222:57
*** jamesmcarthur has quit IRC22:57
*** jamesmcarthur has joined #openstack-infra22:57
ianwmordred: i'm happy to take an initial stab, working from the existing role, if you're thinking of moving it into system-config?22:58
corvuss/move/copy/22:58
mordredianw: well - I think corvus made a great point, which is that the one in zuul-jobs has ci specific things22:58
mordredso yeah, I think the next thought was copy it22:58
mordredbut then I was thinking - the ci mirror stuff is just generic "use this mirror" and could still be useful - we just need to remove the defaults of zuul_docker_mirror_host22:59
mordredwhich is what I wanted to poke at to see what it looked like22:59
mordredif that does seem reasonable - then I think moving it into its own repo and having the jobs add a roles-path - and adding it to the galaxy install in system-config - could be nice22:59
clarkbI've set bhs1 back to max-servers 022:59
fungiwhat is our story around testing this stuff? should we keep most of the zuul-side automation reusable for check/gate jobs on our container configs?23:00
mordredcorvus, ianw does that above stream-of-consciousness make sense?23:00
mordredor - should we just copy the role to system-config for now23:00
mordredand maybe make the reusable shared role for later?23:00
ianwmordred: sort of, though i'm not sure what advantage having it in a separate repo affords?23:00
fungier, i guess that's basically what you're already discussing ;)23:00
mordredianw: to be able to use it in both zuul-jobs and system-config without having zuul-jobs need system-config or system-config need zuul-jobs23:01
ianwpresumably other people have done this, and we're not using *their* roles23:01
mordredtotally - and it might be a waste of energy :)23:01
ianwgiven that it's not that complex, and we might want to do things in production that we don't want in zuul-jobs without having to worry about it being generic, i'm feeling like putting it in system-config wins, for mine23:01
mordredso maybe we should just start with a copy of install-docker in system-config but without the mirror config?23:02
mordredianw: ++23:02
fungiwe can also run roles from system-config in jobs that test things in or related to system-config if we want, right?23:02
mordredfungi: totally23:02
ianwok, i'll be happy to spin that up with some basic test-infra as step 123:02
openstackgerritClark Boylan proposed openstack-infra/system-config master: Add zuul user to bridge.openstack.org  https://review.openstack.org/60492523:02
openstackgerritClark Boylan proposed openstack-infra/system-config master: Manage user ssh keys from urls  https://review.openstack.org/60493223:02
mordredfungi: but for ci and for docker specifically, we should use the install-docker role in zuul-jobs23:03
mordredfungi: since it properly configures ci mirrors23:03
fungioh, sure23:03
mordredbut yes, in general :)23:03
mordredianw: sweet!23:03
fungiokay, now i get it23:03
fungiso for deployment we have a shadow equivalent of install-docker which doesn't do the ci-specific setup23:03
fungiand stick that in system-config23:04
ianwfungi: yes, and quite possibly might do things production-specific, if we find the need23:04
mordredyah23:04
clarkbwe did leak a bunch of ports out of that short time with max servers 8023:04
mordredlike - we might decide we want some settings in the docker daemon.json23:04
fungiclarkb: yech. are you still worried this is something recently regressed in nodepool/shade?23:05
mordredianw: also - we can simplify that role for production and just keep the 'use_upstream_docker' codepath23:05
ianwyes, like the ipv6 settings i keep harping on about23:05
clarkbfungi: no mordred pointed out that message came from nova itself and shade was just passing it through23:05
mordredfungi: no - I believe it's an openstack-side issue23:05
mordredianw: ++23:05
fungiyeah, makes sense that the errors nodepool reported are the equivalent of some of what i was seeing from osc23:05
mordredianw: so maybe it's "copy install-docker, start deleting, completely change daemon.json.j2" :)23:05
ianwsounds about right23:07
ianwclarkb: am i understanding that at lower limits, no port leaks and everything was working, but then at 80 it started going wrong?23:08
clarkbianw: yes, though turning it back down to 8 it was unhappy still. (but plenty of ports leaked so maybe have to clean those up again?)23:09
clarkbI am composing an email to amorin since our timezones don't line up great. I will cc you and fungi as people who ahve looked at this so far23:10
ianwright, yeah saw those errors.  but individual boots didn't leak ports23:10
ianwanyway, ++ on the email.  it does seem like the error is coming from the other end23:11
fungithanks clarkb!23:11
*** jamesmcarthur has quit IRC23:17
clarkbemail sent23:18
*** jamesmcarthur has joined #openstack-infra23:19
*** tpsilva has quit IRC23:19
ianwhttp://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/job-output.txt.gz23:24
ianwtraceroute_v4_output": "git.openstack.org: Name or service not known\nCannot handle \"host\" cmdline arg `git.openstack.org' on position 1 (argc 2)\n",23:24
*** jlvillal has joined #openstack-infra23:24
ianwthat's interesting, did we know dns was the cause of those intermittent errors?23:25
ianwand how fascinating that was a multinode job and they *both* failed.  that suggests it wasn't just a fluke23:26
*** rh-jelabarre has quit IRC23:26
clarkbI think prometheanfire had pointed it out (they were failing when setting up gentoo)23:26
*** dklyle has joined #openstack-infra23:26
clarkbhttp://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/zuul-info/host-info.primary.yaml says localhost is the resolver so must be somethign with unbound?23:27
clarkb(grep dns in that file)23:28
*** graphene has quit IRC23:28
prometheanfireya, it's been happening a bunch23:29
prometheanfiresilly fedora23:29
*** mriedem is now known as mriedem_away23:30
mwhahahaany chance on getting this project-config change merged? https://review.openstack.org/#/c/603489/23:31
ianwclarkb: but unbound isn't setup yet, right, on these integration jobs?23:33
mwhahahagracias23:33
*** graphene has joined #openstack-infra23:33
clarkbianw: it should be but using our default config (which is ipv4 and ipv6 resolvers)23:33
clarkbunless we don't do that config on fedora23:34
ianwbut why would it only sometimes not work ...23:34
ianwclarkb: where did you see the resolver in info yaml?23:34
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Update Zuul service documentation  https://review.openstack.org/60555623:35
clarkbianw: it is under ansible_dns23:36
clarkbsays nameservers: - 127.0.0.123:36
prometheanfireis the service running?23:38
ianwhrm, right, so that would be pre reconfiguration23:38
ianwa working log is http://logs.openstack.org/39/602439/11/check/openstack-infra-base-integration-fedora-latest/08950a4/zuul-info/host-info.fedora-28.yaml and it has the same23:38
clarkbprometheanfire: I don't think we have that information, that would certainly be somethign to check. Maybe it isn't starting on boot on fedora reliably23:39
prometheanfiremaybe provider related? iirc, 2-3/4 of my rechecks for adding gentoo are for fedora failing23:40
ianwwe should capture unbound.log23:40
ianwand a ps dump at least23:41
*** rcernin_ has joined #openstack-infra23:41
pabelangerprometheanfire: yes, it will fail on ipv6 hosts I believe23:43
*** rcernin has quit IRC23:43
clarkbpabelanger: it shouldn't23:43
pabelangerclarkb: unbound needs to be configured before validate-hosts23:44
pabelangerhttp://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/job-output.txt.gz23:44
clarkbit is configured23:44
prometheanfireit's that race thing?23:44
pabelangerthat was the whole reason fedora-28 nodes were flakey in the gate23:44
openstackgerritMerged openstack-infra/project-config master: Add ansible-role-chrony project  https://review.openstack.org/60348923:44
pabelangerclarkb: right, but configure-unbound hasn't run yet for base-minimal jobs23:44
ianwpabelanger: it should be minimally configured in dib -> https://git.openstack.org/cgit/openstack-infra/project-config/tree/nodepool/elements/nodepool-base/finalise.d/89-unbound#n3723:44
clarkbyes, but the unbound config that is there should work23:45
pabelangerianw: yup, agree. for some reason, it is broken on ipv6. Fix was to first do configure-unbound and reload unbound service, and started working23:45
clarkbwe boot with working dns, if we don't that is a bug23:45
clarkbwe are only optimizing it from there on the jobs23:45
pabelangerhttp://git.openstack.org/cgit/openstack-infra/project-config/tree/playbooks/base/pre.yaml#n623:45
clarkbya I wonder if unbound is just not runningat boot theer23:46
pabelangerthe only difference I can think of, is for DIB we put both ipv6/ipv4 forwarders, for configure-unbound role, we choose the right one based on ipv6 / ipv423:46
clarkband restarting it is what made it happy23:46
ianwok, this seems like something i can test ... if we boot a fedora node in rax the suggestion is that unbound isn't immediately working, right?23:47
pabelangerclarkb: should be able to manually boot a fedora-28 to confirm23:47
pabelangerin inap23:47
clarkbianw: yes23:47
pabelangerbut, i think I did manually test this in rax or ipv623:47
pabelangerbut happy to see ianw try23:48
pabelangercould reproduce the issue23:48
pabelangercouldn't*23:48
clarkbpabelanger: the error above was on rax is why ianw is looking at rax I think23:48
ianwyeah ... but we have rax logs where it passes and where it fails23:48
pabelanger++23:49
*** openstackgerrit has quit IRC23:49
ianwlet me try creating a rax node manually and we can boot-cycle it and see if we hit something23:49
pabelangermaybe add port tcp/53 localhost check to validate-host too?23:50
*** jamesmcarthur has quit IRC23:50
ianwa hold on this job might help, although i think i tried that and got distracted because we never hit it to hold the node23:50
ianwand then zuul got restarted etc23:51
pabelangerianw: yah, I think that is what I did, autohold. then when I finally went back to check, everything was working.23:51
pabelangeroh, I rememeber, maybe it was a race starting unbound, but journald logs were too old to confirm23:52
ianwclassic heisenbug23:52
ianwwe do start unbound from rc.local right ...23:52
ianwmaybe we should drop in a .service file23:53
ianwno, what we do is "'echo 'nameserver 127.0.0.1' > /etc/resolv.conf" in rc.local23:55

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!