*** gyee has quit IRC | 00:14 | |
*** jamesmcarthur has joined #openstack-infra | 00:17 | |
*** openstackgerrit has joined #openstack-infra | 00:24 | |
openstackgerrit | MarcH proposed openstack-infra/git-review master: Make it possible to configure draft as default push mode https://review.openstack.org/220426 | 00:24 |
---|---|---|
*** smarcet has joined #openstack-infra | 00:26 | |
*** longkb has joined #openstack-infra | 00:55 | |
*** jamesmcarthur has quit IRC | 00:58 | |
*** jamesmcarthur has joined #openstack-infra | 01:01 | |
*** diablo_rojo has quit IRC | 01:07 | |
*** rlandy|bbl is now known as rlandy | 01:09 | |
*** longkb has quit IRC | 01:10 | |
*** carl_cai has joined #openstack-infra | 01:18 | |
*** mrsoul has quit IRC | 01:19 | |
*** diablo_rojo has joined #openstack-infra | 01:20 | |
*** smarcet has quit IRC | 01:20 | |
*** hongbin has joined #openstack-infra | 01:21 | |
*** jamesmcarthur has quit IRC | 01:23 | |
*** smarcet has joined #openstack-infra | 01:26 | |
*** efried has quit IRC | 01:49 | |
*** jamesmcarthur has joined #openstack-infra | 01:51 | |
*** jamesmcarthur has quit IRC | 01:56 | |
*** efried has joined #openstack-infra | 02:01 | |
*** smarcet has quit IRC | 02:02 | |
*** anteaya has quit IRC | 02:04 | |
*** felipemonteiro has joined #openstack-infra | 02:08 | |
*** tinwood has quit IRC | 02:10 | |
*** tinwood has joined #openstack-infra | 02:11 | |
*** bobh has joined #openstack-infra | 02:13 | |
*** agopi has joined #openstack-infra | 02:13 | |
*** apetrich has quit IRC | 02:16 | |
*** longkb has joined #openstack-infra | 02:18 | |
*** jamesmcarthur has joined #openstack-infra | 02:20 | |
*** jamesmcarthur_ has joined #openstack-infra | 02:27 | |
*** jamesmcarthur has quit IRC | 02:27 | |
*** munimeha1 has quit IRC | 02:30 | |
*** roman_g_ has quit IRC | 02:47 | |
*** psachin has joined #openstack-infra | 02:53 | |
*** jamesmcarthur_ has quit IRC | 03:10 | |
*** bhavikdbavishi has joined #openstack-infra | 03:18 | |
*** diablo_rojo has quit IRC | 03:21 | |
*** jesusaur has joined #openstack-infra | 03:21 | |
*** lpetrut has joined #openstack-infra | 03:30 | |
*** bobh has quit IRC | 03:32 | |
*** felipemonteiro has quit IRC | 03:32 | |
*** ramishra has joined #openstack-infra | 03:37 | |
*** cfriesen has quit IRC | 03:50 | |
*** ykarel|away has joined #openstack-infra | 03:54 | |
*** ykarel|away is now known as ykarel | 03:54 | |
*** lpetrut has quit IRC | 03:56 | |
*** lbragstad has quit IRC | 04:00 | |
*** hongbin has quit IRC | 04:08 | |
*** janki has joined #openstack-infra | 04:12 | |
*** udesale has joined #openstack-infra | 04:27 | |
*** felipemonteiro has joined #openstack-infra | 04:28 | |
*** ykarel has quit IRC | 04:30 | |
*** armax has quit IRC | 04:38 | |
openstackgerrit | Merged openstack-infra/irc-meetings master: Remove Glare meeting https://review.openstack.org/612693 | 04:40 |
*** ykarel has joined #openstack-infra | 04:49 | |
*** spsurya has joined #openstack-infra | 04:50 | |
*** larainema has joined #openstack-infra | 04:55 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: WIP: Add sar logging roles https://review.openstack.org/613112 | 04:57 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: Pin flake8 https://review.openstack.org/613194 | 05:07 |
*** armax has joined #openstack-infra | 05:08 | |
*** carl_cai has quit IRC | 05:08 | |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool master: Prepend exception output with time, date and thread https://review.openstack.org/613196 | 05:10 |
*** kjackal has joined #openstack-infra | 05:16 | |
*** felipemonteiro has quit IRC | 05:16 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: WIP: Add sar logging roles https://review.openstack.org/613112 | 05:22 |
*** rlandy has quit IRC | 05:31 | |
openstackgerrit | Simon Westphahl proposed openstack-infra/zuul master: Fix issue in Github connection with large diffs https://review.openstack.org/612989 | 05:32 |
*** armax has quit IRC | 05:39 | |
openstackgerrit | Simon Westphahl proposed openstack-infra/zuul master: Fix issue in Github connection with large diffs https://review.openstack.org/612989 | 05:49 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: Add sar logging roles https://review.openstack.org/613112 | 06:01 |
*** tobiash_ has quit IRC | 06:03 | |
*** lpetrut has joined #openstack-infra | 06:03 | |
*** tobiash has joined #openstack-infra | 06:04 | |
openstackgerrit | Andreas Jaeger proposed openstack-infra/zuul-jobs master: Fix flake8 3.6.0 errors https://review.openstack.org/613205 | 06:11 |
openstackgerrit | OpenStack Proposal Bot proposed openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/613206 | 06:13 |
*** gfidente has joined #openstack-infra | 06:26 | |
*** kjackal_v2 has joined #openstack-infra | 06:34 | |
*** kopecmartin|off is now known as kopecmartin | 06:34 | |
*** slaweq has joined #openstack-infra | 06:35 | |
*** kjackal has quit IRC | 06:37 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: Add sar logging roles https://review.openstack.org/613112 | 06:42 |
*** AJaeger has quit IRC | 06:43 | |
*** aojea has joined #openstack-infra | 06:45 | |
*** quiquell|off is now known as quiquell | 06:49 | |
*** xek has joined #openstack-infra | 06:49 | |
*** AJaeger has joined #openstack-infra | 06:57 | |
*** xek has quit IRC | 06:59 | |
*** rcernin has quit IRC | 07:00 | |
*** apetrich has joined #openstack-infra | 07:00 | |
*** pcaruana has joined #openstack-infra | 07:04 | |
*** cfriesen has joined #openstack-infra | 07:08 | |
*** ccamacho has joined #openstack-infra | 07:09 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: DNM: Run tox with eatmydata https://review.openstack.org/613221 | 07:11 |
*** SpamapS has quit IRC | 07:11 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: DNM: Enable sar logging for unit tests https://review.openstack.org/613117 | 07:11 |
openstackgerrit | Merged openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/613206 | 07:15 |
*** jpena|off is now known as jpena | 07:15 | |
*** ginopc has joined #openstack-infra | 07:16 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: DNM: pass LD_PRELOAD and LD_LIBRARY_PATH vars https://review.openstack.org/613222 | 07:20 |
*** SpamapS has joined #openstack-infra | 07:24 | |
*** hashar has joined #openstack-infra | 07:25 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: DNM: Run tox with eatmydata https://review.openstack.org/613221 | 07:37 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Fix flake8 3.6.0 errors https://review.openstack.org/613205 | 07:37 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: DNM: Enable sar logging for unit tests https://review.openstack.org/613117 | 07:37 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: DNM: pass LD_PRELOAD and LD_LIBRARY_PATH vars https://review.openstack.org/613222 | 07:37 |
*** carl_cai has joined #openstack-infra | 07:42 | |
*** Emine has joined #openstack-infra | 07:46 | |
*** cfriesen has quit IRC | 07:48 | |
*** ykarel is now known as ykarel|lunch | 07:59 | |
*** jpich has joined #openstack-infra | 08:00 | |
*** ccamacho has quit IRC | 08:00 | |
*** ccamacho has joined #openstack-infra | 08:01 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: DNM: Run tox with eatmydata https://review.openstack.org/613221 | 08:03 |
*** bhavikdbavishi has quit IRC | 08:11 | |
openstackgerrit | Natal Ngétal proposed openstack/gertty master: [Documentation] Add a link for aur. https://review.openstack.org/613238 | 08:21 |
*** roman_g has joined #openstack-infra | 08:32 | |
*** ykarel|lunch is now known as ykarel | 08:34 | |
*** e0ne has joined #openstack-infra | 08:42 | |
*** electrofelix has joined #openstack-infra | 08:54 | |
*** ccamacho has quit IRC | 08:54 | |
*** ccamacho has joined #openstack-infra | 08:55 | |
*** xek has joined #openstack-infra | 08:57 | |
*** dtantsur|afk is now known as dtantsur | 09:09 | |
*** tosky has joined #openstack-infra | 09:16 | |
ssbarnea|bkp2 | hi! i want to test some commands on the f28 image we use in CI. How can I do this? | 09:26 |
quiquell | ianw: ^ | 09:26 |
quiquell | ianw: Can we get the image and start it up at a local host ? | 09:27 |
ssbarnea|bkp2 | f28 images has some customizations that are affecting what we do and I cannot really wait for CI for these. Currently I am using ~clean f28 which is good for generic use-case, but i need to cover CI too. | 09:27 |
*** ccamacho has quit IRC | 09:27 | |
ianw | quiquell ssbarnea|bkp2 : you can grab the images from https://nb01.openstack.org/images/ ... they boot with config-drive + glean, so should pick up root keys via that | 09:30 |
quiquell | ianw: Do the have the exclusion at dnf.conf ? | 09:33 |
quiquell | ianw: Or this is done later on at some ansible role ? | 09:33 |
ianw | quiquell: that will all be in the base image | 09:33 |
quiquell | ianw: ack | 09:34 |
quiquell | ykarel: ^ | 09:34 |
*** psachin has quit IRC | 09:35 | |
ykarel | quiquell, yes that only i refereed | 09:35 |
*** panda|off has quit IRC | 09:35 | |
ykarel | in #oooq | 09:35 |
ykarel | ok u referred dnf.conf thing | 09:36 |
*** kopecmartin is now known as kopecmartin|afk | 09:38 | |
*** panda has joined #openstack-infra | 09:38 | |
*** derekh has joined #openstack-infra | 09:39 | |
*** psachin has joined #openstack-infra | 09:40 | |
ykarel | quiquell, so if i got it correct https://github.com/openstack/diskimage-builder/blob/e796b3bc1884cbb0a7259be486d835ca114cca9e/diskimage_builder/elements/pip-and-virtualenv/install.d/pip-and-virtualenv-source-install/04-install-pip#L29-L30 and https://github.com/openstack/diskimage-builder/blob/e796b3bc1884cbb0a7259be486d835ca114cca9e/diskimage_builder/elements/pip-and-virtualenv/install.d/pip-and-virtualenv-source-install/04-insta | 09:41 |
ykarel | ll-pip#L156 does add excludes | 09:41 |
ykarel | and the images in upstream are build using diskimage builder iirc, ianw ? | 09:41 |
quiquell | ykarel: I think that's it, and it's by design so we don't mess around with those | 09:42 |
ykarel | quiquell, yes if we know what we are doing, we can hack i think | 09:43 |
*** yamamoto has quit IRC | 09:44 | |
quiquell | ykarel: You mean changing dnf.conf at our jobs ? | 09:45 |
ianw | ykarel: yes, it's using those images | 09:45 |
*** yamamoto has joined #openstack-infra | 09:45 | |
ykarel | ianw, ack | 09:45 |
ianw | i mean elements | 09:45 |
ykarel | quiquell, yes if it's really required when using nodepool images | 09:45 |
ykarel | ianw, ack | 09:46 |
ianw | it might be that "yum install python-virtualenv" when it's held does nothing, and "dnf install python-virtualenv" fails | 09:46 |
quiquell | ykarel: We can fix our stuff just not installing if it's present in the system so we use nodepool versions | 09:46 |
quiquell | ykarel: I mean for example if virtualenv is already install don't install python*-virtualenv and the same for pip and setuptools | 09:47 |
quiquell | ssbarnea|bkp2: ^ | 09:47 |
ykarel | quiquell, that's what i said, if really required, if we can fix other way it's fine, | 09:47 |
ykarel | but remember we need to add support for non nodepool images | 09:47 |
quiquell | ykarel: I will try to do that at my review | 09:47 |
ykarel | quiquell, ack | 09:48 |
quiquell | ykarel: yep, just want to make the job for f28 pass and then productify the changes so they work for all the environments | 09:48 |
ykarel | quiquell, cool | 09:48 |
quiquell | ykarel: Puff let's see | 09:48 |
*** yamamoto has quit IRC | 09:49 | |
*** carl_cai has quit IRC | 09:52 | |
*** longkb has quit IRC | 09:58 | |
*** jbadiapa has quit IRC | 10:02 | |
*** psachin has quit IRC | 10:03 | |
*** jbadiapa has joined #openstack-infra | 10:04 | |
*** psachin has joined #openstack-infra | 10:05 | |
*** yamamoto has joined #openstack-infra | 10:17 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Add tox functional testing for drivers https://review.openstack.org/609515 | 10:20 |
*** bhavikdbavishi has joined #openstack-infra | 10:20 | |
*** psachin has quit IRC | 10:25 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Implement a Kubernetes driver https://review.openstack.org/535557 | 10:25 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Add tox functional testing for drivers https://review.openstack.org/609515 | 10:25 |
*** psachin has joined #openstack-infra | 10:27 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul-jobs master: Add prepare-workspace-git role https://review.openstack.org/613036 | 10:31 |
*** yamamoto has quit IRC | 10:34 | |
*** yamamoto has joined #openstack-infra | 10:35 | |
*** udesale has quit IRC | 10:36 | |
*** Emine has quit IRC | 10:38 | |
*** Emine has joined #openstack-infra | 10:39 | |
*** yamamoto has quit IRC | 10:39 | |
ssbarnea|bkp2 | what is the best place to talk about bindep? ... and its inability to have conditions based on distro version. | 10:41 |
openstackgerrit | Benoît Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs https://review.openstack.org/613261 | 10:43 |
ssbarnea|bkp2 | https://storyboard.openstack.org/#!/story/2004176 -- bindep no support for disto versions | 10:44 |
openstackgerrit | Benoît Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs https://review.openstack.org/613261 | 10:47 |
*** pbourke has quit IRC | 10:48 | |
*** pbourke has joined #openstack-infra | 10:48 | |
*** psachin has quit IRC | 11:02 | |
*** ccamacho has joined #openstack-infra | 11:07 | |
*** dave-mccowan has joined #openstack-infra | 11:15 | |
*** carl_cai has joined #openstack-infra | 11:17 | |
*** florianf is now known as florianf|pto | 11:19 | |
*** yamamoto has joined #openstack-infra | 11:24 | |
*** adriancz has quit IRC | 11:31 | |
*** jpena is now known as jpena|lunch | 11:34 | |
*** rh-jelabarre has joined #openstack-infra | 11:34 | |
*** jesusaur has quit IRC | 11:43 | |
*** lpetrut has quit IRC | 11:44 | |
*** jesusaur has joined #openstack-infra | 11:46 | |
*** bhavikdbavishi has quit IRC | 11:56 | |
*** ldnunes has joined #openstack-infra | 12:01 | |
*** haleyb has joined #openstack-infra | 12:02 | |
*** fuentess has joined #openstack-infra | 12:03 | |
*** quiquell is now known as quiquell|lunch | 12:05 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Add tox functional testing for drivers https://review.openstack.org/609515 | 12:05 |
*** fuentess has quit IRC | 12:09 | |
*** adrianreza has quit IRC | 12:09 | |
*** jistr_ is now known as jistr | 12:14 | |
*** jamesmcarthur has joined #openstack-infra | 12:14 | |
*** jamesmcarthur has quit IRC | 12:19 | |
*** zul has joined #openstack-infra | 12:19 | |
*** boden has joined #openstack-infra | 12:21 | |
*** pcaruana has quit IRC | 12:26 | |
*** rlandy has joined #openstack-infra | 12:27 | |
*** ykarel is now known as ykarel|afk | 12:28 | |
*** beagles is now known as beagles_mtg | 12:29 | |
*** janki has quit IRC | 12:31 | |
*** janki has joined #openstack-infra | 12:31 | |
*** ykarel|afk has quit IRC | 12:33 | |
*** jpena|lunch is now known as jpena | 12:33 | |
fungi | ssbarnea|bkp2: here is probably the best place to talk about bindep (or on the infra ml) | 12:34 |
*** pcaruana has joined #openstack-infra | 12:39 | |
Shrews | corvus: clarkb: before we merge the zk cluster stuff to the launchers and zuul, I think we need a plan of action on how to handle the current provider instances. If we just switch, we'll have a LOT of instances we'll have to manually clean up (and rather quickly to free up quota). | 12:41 |
Shrews | corvus: clarkb: nodepool won't see those as leaked instances since they'll have the right metadata | 12:41 |
Shrews | maybe we should first set max-servers to 0 for all providers and let most of them go away naturally? | 12:41 |
*** gfidente has quit IRC | 12:42 | |
*** gfidente has joined #openstack-infra | 12:42 | |
fungi | ssbarnea|bkp2: i've commented on your story | 12:42 |
fungi | ssbarnea|bkp2: we've made extensive use of that feature in the past when, say, packages were renamed, split, combined, et cetera between different distro versions | 12:43 |
*** quiquell|lunch is now known as quiquell | 12:43 | |
sshnaidm|ruck | fungi, clarkb do you know if it's possible to check if docker proxy works fine? We have some jobs (but not all) failing because a long time for containers preparing, I'd like to ensure we still download them from proxy, not from docker.io | 12:44 |
sshnaidm|ruck | fungi, clarkb or maybe you know the way to check it in jobs - whether we download from proxy or docker.io directly | 12:45 |
* Shrews steps away momentarily for bfast before the mass zoo migration | 12:46 | |
fungi | i'm not familiar enough with what sort of debug output docker commands provide. does it not tell you the urls it's using? | 12:46 |
fungi | Shrews: clarkb: i'm similarly going to go try to catch early voting while it's hopefully quiet, and then be back as quickly as possible. what time did we say the zk migration was starting? | 12:46 |
*** smarcet has joined #openstack-infra | 12:47 | |
*** smarcet has quit IRC | 12:48 | |
*** jcoufal has joined #openstack-infra | 12:50 | |
*** ykarel has joined #openstack-infra | 12:50 | |
corvus | Shrews, clarkb: or if we set min-ready to 0 then stop zuul, [almost] all of the nodes should be deleted | 12:50 |
*** ansmith has joined #openstack-infra | 12:50 | |
*** janki has quit IRC | 12:56 | |
*** bnemec has joined #openstack-infra | 12:56 | |
*** yamamoto has quit IRC | 12:56 | |
*** yamamoto has joined #openstack-infra | 12:56 | |
*** bobh has joined #openstack-infra | 13:02 | |
*** lpetrut has joined #openstack-infra | 13:03 | |
*** rascasoft has quit IRC | 13:05 | |
*** _ari_ has quit IRC | 13:05 | |
*** rascasoft has joined #openstack-infra | 13:05 | |
*** kgiusti has joined #openstack-infra | 13:05 | |
*** agopi has quit IRC | 13:12 | |
*** hashar is now known as hasharAway | 13:12 | |
*** kgiusti has quit IRC | 13:15 | |
*** rascasoft has quit IRC | 13:15 | |
*** kgiusti has joined #openstack-infra | 13:17 | |
*** rascasoft has joined #openstack-infra | 13:17 | |
*** eharney has joined #openstack-infra | 13:19 | |
openstackgerrit | Fabien Boucher proposed openstack-infra/zuul master: WIP - Pagure driver https://review.openstack.org/604404 | 13:20 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Filter file comments for existing files https://review.openstack.org/613161 | 13:21 |
*** felipemonteiro has joined #openstack-infra | 13:23 | |
Shrews | fungi: i think t-35 minutes? | 13:25 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Collect docker logs after quick-start run https://review.openstack.org/613027 | 13:25 |
Shrews | corvus: yes, that would be faster i think. then we'd just have to delete the ready nodes that are left | 13:26 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Add opendev nameservers (2/2) https://review.openstack.org/610066 | 13:27 |
*** ansmith has quit IRC | 13:28 | |
*** ansmith_ has joined #openstack-infra | 13:28 | |
*** jistr is now known as jistr|call | 13:29 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: WIP: support foreign required-projects https://review.openstack.org/613143 | 13:30 |
*** tobberydberg has joined #openstack-infra | 13:30 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/project-config master: Disable all providers in nodepool launchers https://review.openstack.org/613329 | 13:31 |
Shrews | clarkb: corvus: ^^^ in case that's the route we choose | 13:31 |
*** rascasoft has quit IRC | 13:31 | |
*** yamamoto has quit IRC | 13:31 | |
*** yamamoto has joined #openstack-infra | 13:31 | |
*** d0ugal has quit IRC | 13:33 | |
*** lbragstad has joined #openstack-infra | 13:33 | |
*** rascasoft has joined #openstack-infra | 13:33 | |
*** d0ugal has joined #openstack-infra | 13:34 | |
clarkb | we need to restart zuul before we restart the launchers then? | 13:35 |
clarkb | (thats fine, making sure I understand) | 13:36 |
*** pcaruana has quit IRC | 13:37 | |
clarkb | fungi: re docker it doesnt even use urls, it is more than a bit furstrating | 13:37 |
clarkb | with current docker its just a hostname iirc and images mist be served relative to the http root at that lovation | 13:38 |
fungi | okay, i'm back | 13:39 |
fungi | with 20 minutes to spare | 13:40 |
*** agopi has joined #openstack-infra | 13:40 | |
Shrews | clarkb: i think we (1) set max-servers to 0, (2) stop zuul, (3) delete (or record) any instances we need to manually cleanup, (4) merge your zk change to launchers & zuul, (5) revert max-servers change, (6) start launchers & zuul | 13:40 |
Shrews | i *think* ??? | 13:41 |
* Shrews would like a logic check there | 13:41 | |
Shrews | or if someone has a better plan... | 13:41 |
clarkb | Shrews: I think there is a 2.5 of stop launchers | 13:42 |
Shrews | clarkb: oh yes | 13:42 |
Shrews | actually, if we stop launchers, do we need to set max-servers to 0? | 13:42 |
Shrews | oh, yes. we need them to delete the USED instances | 13:43 |
Shrews | so we need a 2.1 step to wait for that to happen | 13:43 |
clarkb | yup | 13:45 |
clarkb | should I go ahead and put everything in them emergency file now, then we can approve and merge stuff and use kick.sh to apply things? | 13:45 |
clarkb | we won't be able to rely on zuul merging stuff while we do the work | 13:45 |
clarkb | (we may also want to set max-server:0 by hand? since that is a short temporary state? | 13:46 |
Shrews | i'm fine with setting max-servers by hand | 13:48 |
Shrews | will be shorter downtime | 13:48 |
*** jcoufal has quit IRC | 13:48 | |
Shrews | maybe we need an announcement? | 13:48 |
clarkb | nl*, ze*, zm*, and zuul* are in the emergency file now | 13:48 |
clarkb | Shrews: ya we can #status notice as soon as we start rolling and I'll let the release team know | 13:49 |
clarkb | as long as we capture queues and restore them only state changes in gerrit that happen while zuul-scheduler is off will be a problem | 13:49 |
*** d0ugal has quit IRC | 13:49 | |
clarkb | https://review.openstack.org/#/c/612443/ and https://review.openstack.org/#/c/612442/ should be safe to approve now with those hosts in the emergency file. Any objections to doing that now? | 13:50 |
*** beagles_mtg is now known as beagles | 13:51 | |
Shrews | let's start Operation Cattle Drive \o/ | 13:51 |
*** d0ugal has joined #openstack-infra | 13:51 | |
*** jcoufal has joined #openstack-infra | 13:51 | |
clarkb | note I did use the * glob in the emergency file which I think works? | 13:51 |
corvus | we will find out | 13:52 |
clarkb | heh I can list them too if we want | 13:52 |
*** rpittau has quit IRC | 13:52 | |
clarkb | pretty sure the * should work | 13:52 |
Shrews | should we go ahead and set max-servers to 0 in the configs? | 13:56 |
*** efried has quit IRC | 13:56 | |
clarkb | Shrews: we should let those two changes merge first (they are waiting on node allocations) | 13:56 |
Shrews | o rite | 13:56 |
* Shrews enables zuul --turbo option | 13:57 | |
*** efried has joined #openstack-infra | 13:57 | |
*** smarcet has joined #openstack-infra | 13:58 | |
clarkb | sshnaidm|ruck: when we are done with this zuul and nodepool work, I can take a look at docker things | 13:58 |
sshnaidm|ruck | clarkb, thanks | 13:59 |
clarkb | looks like the tripleo gate just did a almost full restart ahead of our changes :/ | 13:59 |
clarkb | we might consider direct merging if we are on a tight time schedule, I think corvus was the one with the time bounds? | 14:01 |
clarkb | corvus: do you think we should bypass the gate on those two changes? they did both pass check | 14:01 |
corvus | the good news is it just did another partial reset | 14:02 |
clarkb | I expect we'll move fairly quickly once those two changes merge. The biggest time sink is likely waiting for executors to stop and launcher to delete nodes | 14:03 |
corvus | so we're getting nodes now; i think we can just let them merge | 14:03 |
clarkb | wfm | 14:03 |
*** smarcet has quit IRC | 14:04 | |
openstackgerrit | Simon Westphahl proposed openstack-infra/zuul master: Use branch for grouping in supercedent manager https://review.openstack.org/613335 | 14:07 |
*** ykarel is now known as ykarel|away | 14:08 | |
clarkb | ok I don't think the * glob in the emergency file worked | 14:09 |
clarkb | I'm going to list out the nodes instead | 14:09 |
clarkb | (puppet just ran on zuul01) | 14:09 |
clarkb | (which is fine at this point, nothing has merged yet) | 14:10 |
corvus | they might be regexes, but listing is good now i think :) | 14:12 |
clarkb | though project-config merging might make things interesting with the launchers racing ansible, arg | 14:13 |
clarkb | the launchers should puppet in about 15 minutes and project config will merge before then | 14:13 |
fungi | just a heads up, the etherpad system cpu bump from yesterday has returned as of 14:00z from the looks of it | 14:13 |
openstackgerrit | Simon Westphahl proposed openstack-infra/zuul master: Use branch for grouping in supercedent manager https://review.openstack.org/613335 | 14:13 |
openstackgerrit | Simon Westphahl proposed openstack-infra/zuul master: Use branch for grouping in supercedent manager https://review.openstack.org/613335 | 14:15 |
fungi | i suspect it's memory pressure and etherpad is using a bunch of cache memory for its operation | 14:16 |
clarkb | fungi: ya that is what made me think about the hwe kernels because memory was weird on the xenial kernel on our executors and switching to hwe fixed it | 14:16 |
openstackgerrit | Merged openstack-infra/project-config master: Switch nodepool launchers to use new zk cluster https://review.openstack.org/612442 | 14:16 |
fungi | looking at the graph, restarting nodejs yesterday the cache memory usage spiked back up to ~3.5 out of 4 gb immediately, and as of nowish it's about out of free memory | 14:16 |
clarkb | ok ^ may or may not apply to the launchers depending on whether or not the globs work for the launchers | 14:16 |
clarkb | Shrews: ^ fyi, I'm not sure there is anything we can do to control that other than to force stop ansible on bridge right now | 14:17 |
fungi | so we may want to think about resizing etherpad.o.o to 8gb? | 14:17 |
clarkb | oh nevermind | 14:17 |
clarkb | I did my math wrong and puppet ran on the launchers a few minutes ago? | 14:17 |
Shrews | launchers still point to nodepool.o.o | 14:17 |
clarkb | Shrews: yup I think we ended up having the timing just work out afterall | 14:18 |
clarkb | so now just waiting on the zuul config update and we can do the manual steps after that | 14:18 |
clarkb | fungi: what is odd is the version of etherpad and the version of nodejs hasn't changed and we kept the flavor fixed on the upgrade | 14:18 |
mwhahaha | is gerrit ssh broken or is it just me? | 14:19 |
clarkb | mwhahaha: I can ssh to gerrit from here | 14:19 |
clarkb | the ls-projects command in particular works for me | 14:19 |
mwhahaha | hmmm ok | 14:19 |
corvus | ditto | 14:19 |
mwhahaha | hrm it seems to be trying ipv6 | 14:20 |
mwhahaha | that's odd | 14:20 |
corvus | mwhahaha: v6 wfm; maybe your v6 route is sad? | 14:21 |
clarkb | fungi: before we bump the memory I'd be inclined to try the hwe kernel | 14:21 |
clarkb | fungi: then if that doesn't help a rebuild on bigger flavor will give us the normal kernel | 14:21 |
mwhahaha | corvus: yea something was odd, i started and ping which was delayed for a bit then when it kicked in it worked. | 14:21 |
mwhahaha | sorry for the false alarm | 14:21 |
corvus | false alarms better than real ones | 14:22 |
dmsimard | clarkb: out of curiosity, have we tried etherpad on 18.04 ? | 14:22 |
clarkb | dmsimard: no, because we don't have deployment stuff for etherpad that works on 18.04 currently | 14:22 |
dmsimard | ack | 14:22 |
clarkb | if someone wants to invest in that nowish we could do that too. It isn't a terribly complicated system once you get the nodejs and npm stuff working (I have no idea if that is a solved problem in ansible land, but containers theoretically make that better too) | 14:23 |
clarkb | Shrews: sounds like you are watching the launchers, are you planning to set max-servers to 0 by hand? corvus did you want to do the zuul shutdown? I can run the kick.sh commands and help watch the cleanup that happens | 14:24 |
fungi | clarkb: i agree, trying hwe kernel next would be good | 14:24 |
clarkb | also how does this look: #status notice Zuul and Nodepool services are being restarted to migrate them to a new Zookeeper cluster. THis brings us an HA database running on newer servers. | 14:26 |
fungi | lgtm | 14:26 |
clarkb | I'll send that as soon as we start making changes to the running services | 14:27 |
Shrews | clarkb: i can do the launcher configs | 14:27 |
fungi | i guess it would be redundant to also say we're taking our quotas down to zero | 14:27 |
Shrews | clarkb: are we ready to set max-servers to 0 now? | 14:27 |
clarkb | Shrews: lets let the last job finish just in case it has to restart or something | 14:27 |
Shrews | clarkb: awaiting the go signal... | 14:27 |
corvus | i can do the zuul shutdown | 14:29 |
clarkb | I think that job is currently compiling afs modules | 14:30 |
clarkb | might be a couple minutes more if so | 14:30 |
corvus | i should just do a full system restart, yeah? | 14:31 |
corvus | just to go ahead and get everything current | 14:31 |
clarkb | corvus: yes, but do a stop, then we'll pause for a sec to make sure configs are updated then we'll do a start | 14:31 |
corvus | (we only *need* to do the scheduler, but since that's the disruptive one) | 14:31 |
clarkb | corvus: but this way we have good data on current zuul tree so maybe zuul can do a release next week | 14:32 |
gnuoy | Hi, does https://review.openstack.org/#/c/608866/ need re-approval now that the dependant change has landed ? | 14:32 |
corvus | clarkb: ack. we still have the zuul-web pid bug, so i'll run the restart playbook and wait to remove the pidfile until we're ready. | 14:32 |
clarkb | corvus: shrews didn't want to apply the zk changes to running processes. So we are stopping everything, updating config, then starting everything | 14:32 |
*** gfidente has quit IRC | 14:32 | |
corvus | ++ | 14:33 |
clarkb | gnuoy: a recheck will work too. | 14:33 |
fungi | clarkb: do rechecks work now even when there's already a verified +1? | 14:33 |
fungi | did zuul v3 solve that? | 14:34 |
clarkb | fungi: they should, I think it was gerrit 2.13 that fixed that | 14:34 |
fungi | oh, interesting | 14:34 |
clarkb | fungi: the problem before was that older gerrit only sent vote deltas. So if you reapplied a +1 that info wasn't sent to zuul | 14:34 |
clarkb | zaro fixed it so that gerrit sends the entire event content | 14:34 |
gnuoy | clarkb, excellent, thanks, will do | 14:35 |
openstackgerrit | Merged openstack-infra/system-config master: Switch zuul scheduler to new zk cluster https://review.openstack.org/612443 | 14:36 |
clarkb | Shrews: ^ I think you can set max-servers to 0 now. | 14:36 |
Shrews | ok | 14:36 |
clarkb | #status notice Zuul and Nodepool services are being restarted to migrate them to a new Zookeeper cluster. THis brings us an HA database running on newer servers. | 14:38 |
openstackstatus | clarkb: sending notice | 14:38 |
Shrews | ok, done | 14:38 |
Shrews | good to stop zuul now | 14:38 |
clarkb | corvus: ^ | 14:38 |
corvus | stopping zuul | 14:38 |
clarkb | I'm going to update system-config on bridge.o.o so that we are ready to run kick.sh | 14:38 |
-openstackstatus- NOTICE: Zuul and Nodepool services are being restarted to migrate them to a new Zookeeper cluster. THis brings us an HA database running on newer servers. | 14:39 | |
corvus | oh neat all the zuul hosts are disabled... | 14:40 |
corvus | trying again | 14:41 |
openstackstatus | clarkb: finished sending notice | 14:41 |
corvus | scheduler stopped | 14:41 |
clarkb | I'm watching nodepool list now to see nodes hopefully get cleaned up | 14:42 |
Shrews | same | 14:42 |
clarkb | yup a lot of deleting in there now | 14:42 |
*** maciejjozefczyk has quit IRC | 14:42 | |
corvus | clarkb: want to go ahead and kick the zuul servers? | 14:42 |
Shrews | we should have only READY and HOLD nodes left eventually | 14:43 |
corvus | or, well, at least zuul01 | 14:43 |
fungi | gnuoy: the reason i recommended having the change reapproved is that it's non-urgent (just bookkeeping), so mnaser or dhellmann will get to it when they're available | 14:43 |
*** otherwiseguy has joined #openstack-infra | 14:43 | |
*** munimeha1 has joined #openstack-infra | 14:43 | |
Shrews | corvus: all requests will be declined | 14:43 |
fungi | a reapproval will run fewer jobs since the change already passed the check pipeline once | 14:43 |
clarkb | Shrews: corvus in that case maybe we wait for the launchers to move over first? | 14:43 |
Shrews | yeah | 14:44 |
clarkb | ok I will wait on the kick.sh then | 14:44 |
corvus | Shrews: i'm asking clarkb to kick zuul01 so it has the correct config in place. i was not planning on starting zuul. | 14:44 |
gnuoy | fungi, ah ,ok. I didn't appreciate there was a mechanism for requesting re-approval | 14:44 |
corvus | it takes a long time to kick | 14:44 |
Shrews | oh | 14:44 |
clarkb | corvus: do we know if puppet will start the scheduler? | 14:45 |
corvus | clarkb: it .... well better not. | 14:45 |
*** kopecmartin|afk is now known as kopecmartin | 14:46 | |
*** gfidente has joined #openstack-infra | 14:46 | |
*** ramishra has quit IRC | 14:46 | |
*** smarcet has joined #openstack-infra | 14:47 | |
clarkb | we are down toe 28 nodes in the launcher I expect we can just wait since we are almost ready to stop the launchers then kick and start them? | 14:47 |
corvus | we have a policy of not having our config management start user-facing services. i really hope we have not decided to violate that. | 14:47 |
Shrews | one more to delete | 14:48 |
clarkb | corvus: I'm skimming the puppet and I think it will actually do the right thing | 14:48 |
clarkb | corvus: we ensure => undef but enable => true in the scheduler service definition | 14:48 |
clarkb | corvus: however I'm not sure if ensure => undef has weird puppet default behavior like ensure running? | 14:48 |
*** Swami has joined #openstack-infra | 14:49 | |
corvus | clarkb: can you just run it and we'll find out? | 14:49 |
Shrews | wow, we've created (or attempted to create) over 3 million nodes since running nodepool v3 | 14:49 |
clarkb | corvus: I can | 14:49 |
clarkb | corvus: doing that now | 14:49 |
*** jistr|call is now known as jistr | 14:50 | |
Shrews | hrm, vexxhost is being slow with that last delete | 14:50 |
*** Swami has quit IRC | 14:50 | |
clarkb | Shrews: I've recorded the nodepool list output and since there are held and ready nodes to delete anyway maybe lets move ahead with stopping the launchers now? | 14:50 |
clarkb | Shrews: then I can kick.sh the launchers too | 14:51 |
Shrews | clarkb: yeah, we can get that last one manually too if we need to | 14:51 |
clarkb | Shrews: ya lets do that | 14:51 |
Shrews | stopping launchers... | 14:51 |
Shrews | just fyi, http://paste.openstack.org/raw/733050/ | 14:52 |
clarkb | corvus: puppet says it is done on zuul01 | 14:52 |
corvus | clarkb: i agree. config looks good, no procs running | 14:52 |
clarkb | corvus: let me know when you think I should run it on ze* and zm* | 14:52 |
Shrews | clarkb: corvus: launchers stopped | 14:52 |
clarkb | Shrews: ok kicking launchers now | 14:52 |
corvus | clarkb: they won't use it so it's not important to run on the other z servers | 14:53 |
*** gema has joined #openstack-infra | 14:53 | |
clarkb | corvus: oh right. We can let normal puppet update that then | 14:53 |
Shrews | clarkb: that should take care of resetting max-servers too, right? | 14:53 |
clarkb | Shrews: it should | 14:53 |
Shrews | i'll make sure | 14:53 |
clarkb | Shrews: the max-servers thing ended up working really well. Much smaller list of things to cleanup this way | 14:54 |
Shrews | clarkb: yeah | 14:55 |
*** bobh has quit IRC | 14:55 | |
clarkb | Shrews: kick.sh is done | 14:55 |
Shrews | clarkb: we would have quickly had quota issues, too | 14:55 |
clarkb | Shrews: I think you are good to start launchers when you are happy with their configs | 14:56 |
Shrews | checking configs... | 14:56 |
*** tobberydberg has quit IRC | 14:56 | |
Shrews | clarkb: configs look good. i'm going to start nl02 first since it has the lowest setting for max-servers | 14:57 |
clarkb | Shrews: ok | 14:57 |
Shrews | Marking for delete leaked instance ubuntu-bionic-limestone-regionone-0002677659 (819013fc-1051-4655-bc61-1769bdc1af4d) in limestone-regionone (unknown node id 0002677659) | 14:59 |
Shrews | oh, maybe we validate node IDs??? | 14:59 |
* Shrews looks | 14:59 | |
clarkb | ok not much activity on nl02 because we set min ready with nl01 | 14:59 |
clarkb | nl01 should be the next one to start | 14:59 |
clarkb | nl02 looks happy in its idling though | 15:00 |
Shrews | clarkb: ok, maybe this max-servers step was unnecessary | 15:00 |
Shrews | starting nl01 now | 15:00 |
clarkb | ah the alien cleanup is more sophisticated than anticipated? | 15:01 |
corvus | well, it helped reduce churn. but yeah, nodepool should do the cleanup for us. | 15:01 |
*** cfriesen has joined #openstack-infra | 15:01 | |
*** jtomasek has quit IRC | 15:01 | |
clarkb | corvus: thats a good point, we avoid the shock of it having to do it all at once | 15:02 |
clarkb | we have a bunch of building nodes now. Do we want to wait to see them go ready before starting the other launchers? | 15:02 |
corvus | (we set metadata with the nodepool id, and if that id isn't in the db, it's a leaked instance) | 15:02 |
Shrews | clarkb: yeah, give it a minute | 15:02 |
clarkb | Shrews: these first boots may be slower than normal because the images are new and haven't been used yet (whcih caches them on hypervisors) | 15:03 |
*** eernst has joined #openstack-infra | 15:03 | |
*** jpena is now known as jpena|brb | 15:04 | |
Shrews | clarkb: yeah, i just wanted to validate some stuff first. good to start the others now | 15:04 |
*** tobberydberg has joined #openstack-infra | 15:04 | |
clarkb | Shrews: are you going to start them or should I help with that? | 15:04 |
*** ccamacho has quit IRC | 15:05 | |
Shrews | i can do it | 15:05 |
clarkb | ok | 15:05 |
Shrews | 03 and 04 started now | 15:05 |
clarkb | we have ready nodes | 15:05 |
corvus | shall i continue with zuul? | 15:06 |
clarkb | corvus: Shrews I think we can start zuul now that ^ is in place | 15:06 |
Shrews | i see some ready nodes now | 15:06 |
clarkb | corvus: I'm good to start zuul if shrews is | 15:06 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64704&rra_id=all is going to be a fun graph to watch | 15:07 |
corvus | zuul is starting, but we're still a few mins away from node requests. | 15:07 |
Shrews | up to corvus at this point. nodepool is priming, but ready | 15:07 |
Shrews | clarkb: unless ianw updated the graphs for the new stat names, some may be empty | 15:08 |
clarkb | Shrews: that is cacti network bw usage on the current zk leader. Independent of the statsd stuff | 15:08 |
Shrews | oh, that's cacti | 15:08 |
Shrews | yeah | 15:08 |
*** yamamoto has quit IRC | 15:09 | |
*** yamamoto has joined #openstack-infra | 15:10 | |
*** smarcet has quit IRC | 15:10 | |
clarkb | corvus: how is zuul doing? | 15:12 |
clarkb | iirc its about a 5 minute startup for zuul? should start seeing stuff zoon? | 15:12 |
corvus | clarkb: scheduler is running, executors are stopping | 15:12 |
clarkb | heh zoon | 15:12 |
corvus | node requests are being submitted | 15:12 |
clarkb | see them on then odepool side, lots of building requests | 15:13 |
corvus | we're up to 145 node reqs | 15:13 |
*** tobberydberg has quit IRC | 15:13 | |
corvus | i'll re-enqueue now | 15:13 |
clarkb | there are in-use nodes now too | 15:13 |
corvus | that's surprising | 15:13 |
corvus | oh, they're marked in use in nodepool, but zuul hasn't actually begun execution yet | 15:14 |
corvus | (it has no executors online) | 15:14 |
clarkb | huh | 15:14 |
corvus | it's normal -- zuul has claimed the nodes | 15:15 |
corvus | executors have started | 15:15 |
Shrews | clarkb: so i don't think the nodes in 'hold' state from before the migration will be cleaned up by nodepool | 15:15 |
Shrews | so those will have to be tracked | 15:15 |
*** jpena|brb is now known as jpena | 15:15 | |
*** jamesmcarthur has joined #openstack-infra | 15:15 | |
clarkb | Shrews: noted, we also need to clean up the old image builds on the builders | 15:15 |
corvus | Shrews: i think they should be deleted as leaks too | 15:15 |
corvus | zuul is completely restarted; re-enqueue is in progress | 15:16 |
*** yamamoto has quit IRC | 15:16 | |
Shrews | corvus: oh, i think you're right | 15:16 |
Shrews | i hope they weren't needed! :) | 15:16 |
clarkb | Shrews: we can always hold new ones | 15:17 |
corvus | our max servers line in grafana is lower than before | 15:17 |
Shrews | yup, i know. just sayin... | 15:17 |
corvus | max went from 1034 -> 934 | 15:18 |
clarkb | that may be ovh gra1 | 15:18 |
clarkb | we've been sort of manually editing it and if the kick.sh undid that we'd lose almost a hundred nodes | 15:18 |
* clarkb looks | 15:18 | |
corvus | oh there we go it's back now | 15:18 |
clarkb | nope gra1 is correct | 15:18 |
corvus | if you refresh, you can see it's back at 1034 after a step up | 15:19 |
clarkb | oh that must've been stats lag due to starting launchers one at a tiem? | 15:19 |
clarkb | so it stepped up for each launcher started | 15:19 |
*** quiquell is now known as quiquell|off | 15:21 | |
clarkb | the in-use number keeps going up according to grafana | 15:22 |
clarkb | zk stat continue to look happy | 15:22 |
*** smarcet has joined #openstack-infra | 15:22 | |
clarkb | zk network graph spiking but not crazy | 15:22 |
clarkb | corvus: Shrews thoughts on removing all of these hosts from the emergency file at this point? | 15:23 |
Shrews | i think nodepool is happy | 15:23 |
clarkb | `echo stat | nc localhost 2181` is the zk monitoring hack I learned if anyone else wants to look at zk too | 15:24 |
clarkb | the Mode: and Oustanding: fields are interesting. Mode tells you if cluster is set up right (followers and leaders) and outstanding shows you if you have a syncage backlog I think | 15:25 |
*** tobberydberg has joined #openstack-infra | 15:25 | |
clarkb | If we are good removing the nodes from the emergency file I'll send #status notice The Zuul and Nodepool database transition is complete. | 15:27 |
corvus | wfm. reenqueue is still proceeding (busy mergers), but that's fine. | 15:27 |
clarkb | ok | 15:27 |
Shrews | i'm now going to find a fedex drop off point to send a laptop back to HQ | 15:29 |
clarkb | #status notice The Zuul and Nodepool database transition is complete. Changes updated during the Zuul outage may need to be rechecked. | 15:29 |
openstackstatus | clarkb: sending notice | 15:29 |
*** bobh has joined #openstack-infra | 15:29 | |
clarkb | emergency file is updated. I left the builders in it due to the need for an sdk release | 15:29 |
-openstackstatus- NOTICE: The Zuul and Nodepool database transition is complete. Changes updated during the Zuul outage may need to be rechecked. | 15:31 | |
clarkb | corvus: the zuul status page doesn't seem to want to load for me. Is that reconfigure induced slowness that we've seen before? I expect it is | 15:31 |
*** armax has joined #openstack-infra | 15:32 | |
corvus | clarkb: yes, i expect that to continue as long as the re-enqueue is happening -- the same gearman worker handles both things | 15:32 |
clarkb | corvus: roger | 15:32 |
corvus | clarkb: it should eventually load -- like, it shouldn't take more than a few minutes (you may need some amount of refreshing due to js stuff) | 15:32 |
openstackstatus | clarkb: finished sending notice | 15:33 |
clarkb | corvus: ya it did eventually reload | 15:33 |
corvus | re-enqueue is finished | 15:33 |
corvus | i think that's everything then | 15:33 |
clarkb | ya the only other outstanding item on my list is cleaning out the old images from the nodepool builders | 15:34 |
clarkb | that isn't urgent but I shall try to get around to it today while shrews is around | 15:34 |
clarkb | then I need to delete nodepool.o.o (say on monday?) | 15:34 |
clarkb | infra-root ^ if you have anything on nodepool.o.o you want to keep please grab it now :) | 15:34 |
clarkb | corvus: did you write down the zuul sha1 that is running? | 15:35 |
*** bobh has quit IRC | 15:35 | |
clarkb | corvus: probably want to grab that for a possible zuul reelase? | 15:35 |
corvus | clarkb: it's on the status page now | 15:35 |
corvus | (it's the scheduler sha) | 15:35 |
clarkb | oh nice | 15:35 |
clarkb | thank you everyone for helping with this. Also pabelanger did a bunch of the prep work a while back | 15:36 |
clarkb | sshnaidm|ruck: hey, I'm about to context switch to docker things. I did notice that https://review.openstack.org/608319 is failing pep8 in the gate due to a name not being valid | 15:38 |
*** ccamacho has joined #openstack-infra | 15:38 | |
clarkb | sshnaidm|ruck: can you point me to a specific job that you'd like to learn more about the docker setup for? | 15:38 |
clarkb | sshnaidm|ruck: I'd like to start with the logs to either confirm we log the important bits or if not, undersatnd what is missing | 15:38 |
fungi | mtreinish: do you think any of the security fixes mentioned in recent releases at https://github.com/eclipse/mosquitto/blob/master/ChangeLog.txt are relevant to our occasional crashes? especially the cve-2017-7651 fix in 1.4.15 looks suspicious | 15:38 |
*** bobh has joined #openstack-infra | 15:39 | |
fungi | debian just backported a bunch of security fixes per https://security-tracker.debian.org/tracker/DSA-4325-1 https://security-tracker.debian.org/tracker/DLA-1409-1 https://security-tracker.debian.org/tracker/DLA-1334-1 | 15:40 |
fungi | in theory ubuntu ought to be able to import those updates | 15:40 |
sshnaidm|ruck | clarkb, I'd like to ensure that afs docker proxies really cache the image, we have this config: http://logs.openstack.org/87/610087/4/gate/tripleo-ci-centos-7-scenario001-multinode-oooq-container/2e409e1/logs/undercloud/etc/docker/daemon.json.txt.gz | 15:44 |
sshnaidm|ruck | clarkb, I think it's enough to use docker proxy, right? But I can't really check where I download the image from in reality | 15:45 |
*** apetrich has quit IRC | 15:45 | |
clarkb | sshnaidm|ruck: in the job logs can you point me to where the images are fetched? we can then cross check against the mirror configured there | 15:45 |
clarkb | sshnaidm|ruck: the ability to check where you downloaded the image from would be a logging function of whatever you use to pull thei mage | 15:45 |
mordred | clarkb: it would require starting the docker daemon with debug logging I believe | 15:45 |
mordred | ten there will be http trace entries in the logs that will indicate from where docker actually fetched things | 15:46 |
clarkb | sshnaidm|ruck: but we can check the mirror node logs too since I Guess docker doesn't etll you by default | 15:46 |
sshnaidm|ruck | clarkb, yeah, for example here: http://logs.openstack.org/87/610087/4/gate/tripleo-ci-centos-7-scenario001-multinode-oooq-container/2e409e1/logs/undercloud/home/zuul/install-undercloud.log.txt.gz#_2018-10-25_03_07_03_975 | 15:46 |
sshnaidm|ruck | docker.io/tripleomaster/centos-binary-rsyslog-base in 2018-10-25 03:07:03.975 from http://mirror.bhs1.ovh.openstack.org:8081/registry-1.docker/ | 15:47 |
clarkb | sshnaidm|ruck: sha256:7810f63ac7ce7026eb5bcb308fd485fb7aa3224707bb2c57c24d2dedd7992cbb looks like the hash for that image right ? (all of the image serving is from hashes so I'll be grepping that in the logs) | 15:47 |
clarkb | oh that is specifically the centos base image. I'll check that one to start | 15:47 |
clarkb | sha256:a55bd98df50363f394ecbb21d19aade7e250590211dd64e83019f8b9cc5273ea looks like a layer for rsyslog specifically | 15:48 |
clarkb | neither sha256 is in the apache logs | 15:50 |
*** yamamoto has joined #openstack-infra | 15:50 | |
*** eernst has quit IRC | 15:51 | |
sshnaidm|ruck | clarkb, I see this, maybe it's it: "docker.io/tripleomaster/centos-binary-rsyslog-base@sha256:19ff38dcdc12a167bcf8dcbef4cb55247194b101d8fc1c4aff781ce73a794756" | 15:51 |
sshnaidm|ruck | and this: "Id": "sha256:5455eec0649474d22cd21dc3a08f9a80659973551c4c5ecbf675609926489c80" | 15:52 |
sshnaidm|ruck | so many shas | 15:52 |
mordred | sshnaidm|ruck: don't you know - shas make everything better :) | 15:53 |
clarkb | sshnaidm|ruck: sha256:5455eec0649474d22cd21dc3a08f9a80659973551c4c5ecbf675609926489c80 shows up a bunch in the logs as cache hits. But the previous one does not | 15:54 |
sshnaidm|ruck | clarkb, ok, let's hope it is :D | 15:54 |
sshnaidm|ruck | clarkb, thanks! | 15:54 |
*** sshnaidm|ruck is now known as sshnaidm|bbl | 15:54 | |
fungi | the system cpu spike on the etherpad server has died back off. i wonder if it corresponded to the board call which also just wrapped up? | 15:55 |
clarkb | fungi: that would be an unfortuante regression if using the service made it slow :) | 15:56 |
clarkb | fungi: but that should be testable at least | 15:56 |
fungi | especially concerning since there weren't _that_ many people using that particular pad | 15:56 |
smcginnis | If we need a few folks to all hit an etherpad around the same time, I can help. | 15:56 |
clarkb | sshnaidm|bbl: as far as I can tell given that id it should be doing what we expect. If you want to be double sure adding the extra docker logging that mordred pointed out is probably worthwile | 15:56 |
clarkb | we keep using steadily less quota in bhs1 | 15:57 |
clarkb | I wonder if the port cleanups failed there? | 15:57 |
*** derekh has quit IRC | 15:58 | |
clarkb | #status log Zuul and Nodepool running against the new three node zookeeper cluster at zk01 + zk02 + zk03 .openstack.org. Old server at nodepool.openstack.org will be deleted in the near future | 15:59 |
openstackstatus | clarkb: finished logging | 15:59 |
*** e0ne has quit IRC | 15:59 | |
clarkb | also inap doesn't look happy. I'm going to start with inap since bhs1 is still mostly working | 16:00 |
clarkb | the inap errors appear to be timeouts, possibly related to our switch to new images? | 16:01 |
mgagne | clarkb: we redeployed a new minor version of Nova. Didn't expect that much impact. Is there anything I can look at for now? | 16:01 |
mgagne | new packages were promoted ~55m ago | 16:02 |
*** carl_cai has quit IRC | 16:02 | |
clarkb | mgagne: from our side it looks like we may have timed out some boots because we transitiioned to new images globally. Then those boots are now timing out trying to delete | 16:02 |
clarkb | mgagne: I can get you uuids in just a momment | 16:02 |
mgagne | clarkb: I never remember what I do to fix those issues :-/ | 16:03 |
clarkb | e064b5bf-dfca-48aa-8b02-b3da37509688 bdf1a01f-1e95-47ec-8e72-827d0180140a 637d76be-930a-4ea2-b145-96c8501d03f4 | 16:03 |
clarkb | are three examples | 16:03 |
mgagne | checking | 16:03 |
clarkb | mgagne: thank you! | 16:04 |
mgagne | I think it was restarting nova-compute? | 16:04 |
clarkb | mgagne: ya that sounds familiar | 16:04 |
mgagne | clarkb: ok, one is now in error, I think it's now in a state where nodepool can retry its delete and it will work | 16:05 |
clarkb | mgagne: great, nodepool should do that automatically | 16:06 |
mgagne | I guess I will restart the whole region in that case | 16:06 |
clarkb | I don't know enough about your cloud to advise one way or the other. But your help is greatly appreciated :) | 16:07 |
mgagne | hehe | 16:07 |
clarkb | openstack.exceptions.SDKException: Error in creating the server. Compute service reports fault: No valid host was found. There are not enough hosts available. is the ovh bhs1 usage reduction cause | 16:11 |
clarkb | amorin: dpawlik ^ fyi if you happen to be around (this may be some side effect of us restarting zuul which creates a rush of demand) | 16:12 |
clarkb | I wonder if the next nodepool feature is going to be a launch throttle | 16:13 |
*** smarcet has quit IRC | 16:14 | |
*** ginopc has quit IRC | 16:14 | |
*** bhavikdbavishi has joined #openstack-infra | 16:15 | |
*** ccamacho has quit IRC | 16:15 | |
*** gyee has joined #openstack-infra | 16:16 | |
clarkb | other than those two cloud related issues (which could theoretically also be related to new openstacksdk?) we appear to be quite stable | 16:16 |
clarkb | zookeeper seems to be keepign up with the demand in its new 3 node configuration as well | 16:16 |
*** weshay has joined #openstack-infra | 16:18 | |
clarkb | mgagne: I see the deleting count falling in inap | 16:19 |
mgagne | yea, instances are now in ERROR state. | 16:19 |
*** emine__ has joined #openstack-infra | 16:21 | |
*** dtantsur is now known as dtantsur|afk | 16:22 | |
*** Emine has quit IRC | 16:24 | |
*** jhesketh has joined #openstack-infra | 16:25 | |
*** shardy has quit IRC | 16:25 | |
*** mnaser has quit IRC | 16:26 | |
*** yamamoto has quit IRC | 16:26 | |
*** yamamoto has joined #openstack-infra | 16:26 | |
*** mnaser has joined #openstack-infra | 16:26 | |
*** jhesketh_ has quit IRC | 16:27 | |
*** jpich has quit IRC | 16:28 | |
clarkb | for the bhs1 thing we have 116 instances according to quota but only ~68 instances according to server list | 16:30 |
clarkb | I think quota may have gotten out of sync there and so the hypervisors think they are used (and possibly are) | 16:31 |
*** trown is now known as trown|lunch | 16:32 | |
fungi | clarkb: amorin said yesterday that was a known issue in bhs1 i think? they're still working on trying to get gra1 back in okay shape | 16:32 |
*** smarcet has joined #openstack-infra | 16:32 | |
clarkb | gotcha | 16:32 |
*** panda is now known as panda|off | 16:38 | |
openstackgerrit | Hervé Beraud proposed openstack/gertty master: Introduce security checks with bandit and fix it https://review.openstack.org/613371 | 16:38 |
clarkb | we have 1 node available in inap. I think we may be turning the corner there. | 16:39 |
mgagne | clarkb: most are stuck in building right? | 16:40 |
clarkb | mgagne: ya I think that is due to new images? | 16:40 |
mgagne | yea | 16:40 |
mgagne | just making sure there is no other issue I can fix | 16:41 |
clarkb | mgagne: we should know in another 10-15 minutes. | 16:41 |
clarkb | 2 available now. Makes me think in 10-15 minutes we'll be operating normally | 16:41 |
mgagne | +1 | 16:41 |
*** fuentess has joined #openstack-infra | 16:42 | |
*** gfidente has quit IRC | 16:42 | |
*** imacdonn has quit IRC | 16:42 | |
*** imacdonn has joined #openstack-infra | 16:43 | |
ssbarnea|bkp2 | fungi clarkb: question regarding basepython=python3 : please read https://github.com/tox-dev/tox/issues/1072 -- I am curious how openstack plans to cover this aspect. | 16:44 |
clarkb | ssbarnea|bkp2: you have to set basepython on the docs and linting to python3 then let py35/py36 etc do the right thing | 16:47 |
ssbarnea|bkp2 | centos-7 concerns me because on it we python3->python3.4 which was dropped by ansible, see https://github.com/ansible/ansible/blob/devel/setup.py#L247 | 16:48 |
fungi | clarkb: ssbarnea|bkp2: where's the context? you likely also need to set ignore_basepython_conflict = True | 16:48 |
clarkb | fungi: the context is centos7 using python3.4 I guess? | 16:49 |
ssbarnea|bkp2 | clarkb yeah, this is where I seen the failure to install ansible on python3, because it was incompatible. | 16:49 |
clarkb | fwiw that seems more like a distro problem | 16:49 |
fungi | otherwise setting basepython = python3 if python3 is 3.4 will result in the implicit py35 and py36 testenvs using 3.4 | 16:49 |
clarkb | not a tox problem | 16:49 |
clarkb | fungi: aha | 16:49 |
clarkb | ssbarnea|bkp2: this is why we carefully run things on a variety of distros to make sure that their python versions line up with what we expect | 16:50 |
clarkb | (it can be painful at times, but does work) | 16:50 |
fungi | https://github.com/tox-dev/tox/issues/477 | 16:51 |
ssbarnea|bkp2 | for testing purposes is a PITA, not on my main (macos) machine where I have the freedom to juggle them but if you want to run just "tox" across multiple platforms, you soon realize that conflict. | 16:51 |
clarkb | ssbarnea|bkp2: but then setting basepython to python3 means we can explicitly run the docs job on xenial to get 3.5, on bionic to get 3.6. Then when the next release comes out we don't have to update tox.ini just add the job to run on that distro release | 16:51 |
fungi | fixed by https://github.com/tox-dev/tox/pull/841 | 16:52 |
clarkb | ssbarnea|bkp2: right we don't support tox on multiple platforms generally. We support specific versions of python informed by what is on distros (which we use to test) then you have to get the right version of python | 16:52 |
fungi | stephenfin did excellent work there | 16:52 |
clarkb | mgagne: up to 5 available now. Slow going but trending the right direction | 16:53 |
ssbarnea|bkp2 | clarkb: just to be clear: I am not trying to say that setting it to python3 is bad. i am going to test the ignore_basepython_conflict | 16:53 |
clarkb | ssbarnea|bkp2: ya. I'm just trying to point out that just because an openstack tox.ini says python3 doesn't mean it will work with any ypthon3. We do that for convenience to avoid needign to update tox.ini frequently. You still need a valid python3 version | 16:54 |
*** lpetrut has quit IRC | 16:54 | |
*** lpetrut has joined #openstack-infra | 16:55 | |
ssbarnea|bkp2 | clarkb: we are not in conflict here :D | 16:55 |
ssbarnea|bkp2 | now I only need to explain others that we still have to use basepython for some tasks, like https://review.openstack.org/#/c/613083/2/tox.ini | 16:58 |
*** jpena is now known as jpena|off | 16:58 | |
*** xek has quit IRC | 16:58 | |
*** xek has joined #openstack-infra | 16:59 | |
*** chandankumar is now known as chkumar|off | 17:02 | |
*** ccamacho has joined #openstack-infra | 17:02 | |
*** pcaruana has joined #openstack-infra | 17:03 | |
ssbarnea|bkp2 | now I have a cosmetic question about zuul html output not wrapping at screen width, doing horizontal scrolling browser sucks. Is this by design, or a known bug? | 17:06 |
*** jamesmcarthur has quit IRC | 17:06 | |
clarkb | ssbarnea|bkp2: at least on mobile it does one column without horizontal scrolling. I also don't have horizontal scrolling on current browser /me tries resizing | 17:07 |
*** jamesmcarthur has joined #openstack-infra | 17:07 | |
clarkb | ssbarnea|bkp2: it seems to resize without doing horizontal scrolling on firefox for me | 17:07 |
ssbarnea|bkp2 | clarkb firefox on http://logs.openstack.org/83/613083/2/check/openstack-tox-linters/10deb24/job-output.txt.gz -- desktop | 17:07 |
clarkb | oh the job logs not the zuul status web page | 17:08 |
ssbarnea|bkp2 | to be exact http://logs.openstack.org/83/613083/2/check/openstack-tox-linters/10deb24/job-output.txt.gz#_2018-10-24_21_57_28_549089 | 17:08 |
clarkb | ssbarnea|bkp2: my personal opinion on that is that is desired. It is a txt file not an html file | 17:08 |
clarkb | it is the raw output | 17:08 |
*** jamesmcarthur has quit IRC | 17:08 | |
clarkb | if we want something to render that differently we should do that on top of the raw data | 17:08 |
ssbarnea|bkp2 | i think it has wrapping only on spaces which prevents the wrapping from occuring. | 17:09 |
clarkb | it will be however firefox wraps text file lines | 17:09 |
clarkb | (might even be configurable?) | 17:09 |
ssbarnea|bkp2 | clarkb i am sure css can change behavior but i wanted to know if this was desired or fixable :D | 17:09 |
clarkb | I think we want to make the raw data available, but if we also render it nicely for people with browsers that is good too | 17:10 |
clarkb | the reason I say that is some log files are massive (hundreds of meg) and i have to view them with vim locally | 17:10 |
clarkb | we also index the raw data in elasticsaerch so you want to be able to support use cases like that | 17:10 |
ssbarnea|bkp2 | this is 100% css issue, i do not expect the lines to be wrapped server side. | 17:11 |
ssbarnea|bkp2 | as you said: they should be as close as possible to raw | 17:11 |
clarkb | ssbarnea|bkp2: except there is no css in txt files | 17:11 |
*** eharney has quit IRC | 17:11 | |
clarkb | oh except that os loganalyze is sending some. I understand now | 17:12 |
clarkb | so ya you could update os-loganalyze to change the html rendering. Sorry I've been using vim a lot lately because too many large files. | 17:12 |
clarkb | os-loganalyze will serve the raw data id you don't set accept encoding to html | 17:13 |
clarkb | or accept-type? whatever the header is | 17:13 |
fungi | infra-root: just a heads up, i have to disappear for a few hours to deal with insurance company stuff in person, but will be back on later today | 17:13 |
clarkb | fungi: gl | 17:13 |
ssbarnea|bkp2 | I found the fix, is missing: word-break: break-all; | 17:13 |
ssbarnea|bkp2 | now i only have to find the place to add that code. | 17:14 |
fungi | clarkb: supposedly they'll be handing me a briefcase full of unmarked bills at the end, so totally worth it (okay, really just a paper check, but regardless...) | 17:14 |
clarkb | fungi: ah you are past the point of arguing over what was insured then :) | 17:14 |
fungi | yep! | 17:14 |
clarkb | ssbarnea|bkp2: look in openstack-infra/os-loganalyze | 17:15 |
fungi | well, except for the wind damage claim which we haven't finished yet. but flood and care are done | 17:15 |
*** sshnaidm|bbl is now known as sshnaidm|off | 17:15 | |
fungi | er, flood and car are done | 17:15 |
* fungi vanishes in a puff of errands | 17:15 | |
*** apetrich has joined #openstack-infra | 17:17 | |
*** apetrich has quit IRC | 17:17 | |
*** apetrich has joined #openstack-infra | 17:18 | |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/os-loganalyze master: Assures that wrapping on PRE occurs on any kind of characters https://review.openstack.org/613383 | 17:18 |
ssbarnea|bkp2 | this reminded me that i hate the timestamp column, too much screen real estate taken by it. I would personally prefer to transform it into a line-numer and have the time value as tooltip. | 17:20 |
clarkb | ssbarnea|bkp2: I find the timestamps to be invaluable | 17:20 |
ssbarnea|bkp2 | but obviously that I would need support for such change. | 17:20 |
*** smarcet has quit IRC | 17:21 | |
ssbarnea|bkp2 | it is valuable, but not sure if it needs to be visible by default and all the time. maybe expandable so something similar. | 17:21 |
clarkb | ssbarnea|bkp2: I think if you want to do something like that then we want a render layer that allows you to toggle things like that. I don't think we should remove that from the raw txt | 17:22 |
clarkb | it is really useful to undersatnd when things happen in a distributed system | 17:22 |
ssbarnea|bkp2 | clarkb: sure I was referring to the display layer | 17:22 |
clarkb | to the point where it is the one requirement I push on people to use the elasticsearch/logstash system | 17:22 |
clarkb | ssbarnea|bkp2: I also think it is important because it helps remind people that their jobs have a time cost | 17:23 |
clarkb | that time cost impacts everyone else's ability to use those test resources | 17:23 |
ssbarnea|bkp2 | 50% of timestamp is spam = first and last. year and month are useless as we don't even keep logs for so long, and sub second divisions ... | 17:24 |
clarkb | sub second is very useful. The year may not be necessary. But the rest of it is I think | 17:24 |
clarkb | we want to keep logs for ~6 months again which is why the swift work is happening | 17:24 |
ssbarnea|bkp2 | another thing that I could fix with css alone, almost for sure. | 17:25 |
*** diablo_rojo has joined #openstack-infra | 17:26 | |
*** eharney has joined #openstack-infra | 17:26 | |
*** lpetrut has quit IRC | 17:28 | |
clarkb | ya I think we can fiddle with overlay type stuff to make it toggleable to user preference, but I also think being clear about how long jobs are taking and how long specific job tasks take is important particularly when we run behind the curve with constricted resources | 17:29 |
clarkb | otherwise as soon as I ask someone to make their jobs run faster the response will be but I can't tell where the time is spent | 17:29 |
ssbarnea|bkp2 | i am building now a proposal, and I will show it to you. | 17:30 |
clarkb | mgagne: hrm it seems to have gone back to unhappy deleting nodes again | 17:30 |
ssbarnea|bkp2 | i got the idea, i will try to cover all use cases | 17:30 |
*** eharney has quit IRC | 17:35 | |
clarkb | mgagne: I'm going to find lunch/breakfast soon but let me know if I can help with any debugging | 17:36 |
hogepodge | I'm looking through some Loci code, and there are notes saying things like "Remove this when infra starts signing thier mirrors" for the apt repositories. | 17:40 |
hogepodge | Just curious, is this something that infra is now doing or plans on doing? | 17:40 |
clarkb | hogepodge: it is not something we are doing now, and I know of no plam to do so. The problem there is apt repo updates are race prone and can lead to broken repos/clients. What happens is you can have packages removed from disk that are still in the index then your clients fail to install the package. The other fail mode is you update the index on a client then the packege is removed from the repo | 17:41 |
clarkb | hogepodge: to address this we use reprepro to build a new valid index based on what is on disk ( and tell it to not clean up old packages for some hours ). Unfortuantely this means the indexes we produce are different than those from upstream and so the upstream keys aren't valid | 17:42 |
hogepodge | Ok, thanks. I'm thinking I'm going to make that bit configurable so we're doing secure by default, but do insecure in the gate | 17:42 |
clarkb | hogepodge: We could sign our repos and you could trust the keys, but we also want to avoid people treating those repos as consumable outside of testing | 17:42 |
clarkb | or if someone can figure out a way to use the upstream signed indexes and mirror them without breaking clients we'd probably do that | 17:43 |
hogepodge | No, I'm just going through our notes and trying to get TODOs out of code. | 17:43 |
hogepodge | It's not a strong requirement, just wanted to see if the note reflected reality, and it kind of doesn't. :-) We don't require it to be signed. | 17:43 |
hogepodge | But I can imagine a downstream user not wanting to trust unsigned repos for producing production packages. | 17:44 |
hogepodge | In the gate, it's not critical ¯\_(ツ)_/¯ | 17:44 |
hogepodge | thanks clarkb | 17:45 |
*** smarcet has joined #openstack-infra | 17:46 | |
clarkb | ok food is here. I'm out for a bit to eat | 17:47 |
*** eharney has joined #openstack-infra | 17:49 | |
mgagne | checking | 17:52 |
*** felipemonteiro has quit IRC | 17:52 | |
*** jamesmcarthur has joined #openstack-infra | 17:54 | |
*** trown|lunch is now known as trown | 17:54 | |
openstackgerrit | Aakarsh proposed openstack-infra/project-config master: Move openstack-browbeat zuul jobs to project repository https://review.openstack.org/613092 | 17:54 |
*** betherly has joined #openstack-infra | 17:55 | |
*** jamesmcarthur has quit IRC | 17:58 | |
*** zzzeek_ has joined #openstack-infra | 17:59 | |
*** betherly has quit IRC | 17:59 | |
*** eumel8 has joined #openstack-infra | 18:04 | |
*** apetrich has quit IRC | 18:07 | |
*** tung_comnets has joined #openstack-infra | 18:08 | |
tung_comnets | Can someone give one more +2 to this patch: https://review.openstack.org/#/c/612962/ | 18:10 |
tung_comnets | Thanks :) | 18:10 |
*** jamesmcarthur has joined #openstack-infra | 18:10 | |
openstackgerrit | Aakarsh proposed openstack-infra/project-config master: Move openstack-browbeat zuul jobs to project repository https://review.openstack.org/613092 | 18:11 |
*** jamesmcarthur has quit IRC | 18:13 | |
*** apetrich has joined #openstack-infra | 18:20 | |
*** jamesmcarthur has joined #openstack-infra | 18:23 | |
openstackgerrit | Pete Birley proposed openstack-infra/project-config master: New Repo - OpenStack-Helm Images https://review.openstack.org/611892 | 18:28 |
mgagne | clarkb: so I'm not sure what to do next. centos image looks fine, maybe because there aren't much instances based on it. but xenial is having a hard time. | 18:29 |
*** bnemec has quit IRC | 18:29 | |
openstackgerrit | Pete Birley proposed openstack-infra/project-config master: New Repo: OpenStack-Helm Docs https://review.openstack.org/611893 | 18:30 |
*** felipemonteiro has joined #openstack-infra | 18:30 | |
*** electrofelix has quit IRC | 18:31 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Support node caching in the nodeIterator https://review.openstack.org/604648 | 18:32 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Support node caching in the nodeIterator https://review.openstack.org/604648 | 18:35 |
*** munimeha1 has quit IRC | 18:37 | |
*** bhavikdbavishi has quit IRC | 18:37 | |
*** felipemonteiro has quit IRC | 18:37 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: DNM: Enable sar logging for unit tests https://review.openstack.org/613117 | 18:38 |
*** bobh has quit IRC | 18:40 | |
clarkb | mgagne: is it just timing out? | 18:45 |
clarkb | mgagne: maybe we be patient with it and see if the cachnig is able to get in place? | 18:47 |
mgagne | could be it, some are active now | 18:47 |
clarkb | normally we don't rotate all images at once so this shouldn't be a common thing | 18:47 |
clarkb | (we did it in this case because ti was easier when we moved zookeeper clusters to not migrate the data) | 18:47 |
clarkb | Shrews: I'm going to look at nb01 now | 18:48 |
clarkb | for cleaning up old images | 18:48 |
Shrews | k k | 18:48 |
clarkb | I'm just going to delete the stuff in /opt/nodepool_dib that is old | 18:49 |
clarkb | Shrews: then after that we need to delete the images on the cloud side | 18:51 |
clarkb | mordred: re ^ if you get a chance could you look at rax swift and glance to see if those are all cleaned up properly? I worry the sdk issue caused extra weirdness thee | 18:54 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: quick-start: add a note about github https://review.openstack.org/613398 | 19:00 |
clarkb | #status log Old dib images cleared out of /opt/nodepool_dib on nb01, nb02, and nb03. Need to remove them from cloud providers next. | 19:02 |
openstackstatus | clarkb: finished logging | 19:02 |
clarkb | I'm going to start looking at image cleanup in $clouds | 19:02 |
Shrews | clarkb: finishing up a required training thing, but can assist when i'm done (if you haven't finished by then) | 19:05 |
clarkb | Shrews: ok. | 19:06 |
mordred | clarkb: heya - once the images are imported the swift objects that are created re no longer needed - so we can clean any of those out - I can do a cleanup pass tomorrow | 19:06 |
clarkb | One thing I've noticed is that in bhs1 some images can't be deleted because they have snapshots. Odd | 19:07 |
clarkb | these are images older than the ones I expected to need to clean. The images I expected to clean apepar to delete ok | 19:07 |
clarkb | mordred: ya, I just have no idea if that was working when we were having that error happen | 19:07 |
clarkb | I want to say we don't catch the exception in the image create path and so it may not happen automatically | 19:07 |
*** hasharAway is now known as hashar | 19:07 | |
*** ykarel|away has quit IRC | 19:11 | |
*** smarcet has quit IRC | 19:12 | |
*** rlandy is now known as rlandy|brb | 19:13 | |
*** bobh has joined #openstack-infra | 19:15 | |
clarkb | BHS1 is done, expcet for all the images that can't be deleted because they have snapshots (I expect that is something cloud side we should look into later) | 19:17 |
clarkb | I don't think we made any snapshots ourselves | 19:17 |
mordred | clarkb: yeah - that's weird, I can't think of any reason we'd make snapshots of images | 19:17 |
*** bobh has quit IRC | 19:20 | |
AJaeger | config-core, could you review these two changes, please? https://review.openstack.org/612820 and https://review.openstack.org/612962 | 19:23 |
clarkb | GRA1 list of images to delete is running now | 19:25 |
clarkb | it seems to be failing less with snapshots than BHS1 | 19:26 |
*** jcoufal_ has joined #openstack-infra | 19:26 | |
*** jcoufal has quit IRC | 19:27 | |
*** bobh has joined #openstack-infra | 19:31 | |
*** rlandy|brb is now known as rlandy | 19:31 | |
clarkb | gra1 is done now too. Going to do inap next | 19:40 |
*** lbragstad has quit IRC | 19:43 | |
*** lbragstad has joined #openstack-infra | 19:43 | |
clarkb | inap is going to take a while. I may run a few of these in parallel. I'll look at vexxhost sjc1 next | 19:50 |
clarkb | what I'm doing is a openstack iamge list --private. Trimming out any images we want to keep and putting that ina file. Then doing a for loop catting that file and openstack image deleting | 19:50 |
clarkb | its not very elegant, but I'm finding there are just enough new corner cases in each cloud that trying to automate it would take all day | 19:51 |
clarkb | like for some reason there are cloud specific private images we didn't upload | 19:51 |
*** irclogbot_1 has joined #openstack-infra | 20:01 | |
*** mriedem has joined #openstack-infra | 20:03 | |
*** kgiusti has left #openstack-infra | 20:03 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: WIP: support foreign required-projects https://review.openstack.org/613143 | 20:09 |
*** smarcet has joined #openstack-infra | 20:16 | |
mordred | clarkb: that might be osc not having full support for the new shared state? | 20:16 |
clarkb | mordred: oh maybe | 20:16 |
clarkb | I'm through all clouds but rax, packethost, citycloud (I think we upload but don't boot there), and inap | 20:18 |
clarkb | inap is in progress | 20:18 |
clarkb | the arm clouds were nice and tidy. Only the two we leaked by changing DBs had to be deleted looks like | 20:18 |
* clarkb does packethost next | 20:18 | |
mgagne | clarkb: I think there is a network bottleneck somewhere. But there is little I can do except trying to tell our netadmin that it's "normal". | 20:18 |
*** jcoufal_ has quit IRC | 20:19 | |
clarkb | mgagne: whats odd is it didn't do that before. So either our start it all at once shock to the system or your update (or some other change?) must've changed the behavior? | 20:19 |
clarkb | mgagne: I'm happy to help however we can | 20:19 |
mgagne | clarkb: the package contained unrelated changes to some management tools. | 20:20 |
clarkb | ah | 20:20 |
mgagne | maybe the network gear is much more overloaded than last time all images were updated. | 20:20 |
*** jcoufal has joined #openstack-infra | 20:21 | |
*** irclogbot_1 has quit IRC | 20:22 | |
clarkb | possible | 20:23 |
mgagne | ;) | 20:26 |
*** smarcet has quit IRC | 20:27 | |
*** hashar has quit IRC | 20:29 | |
clarkb | mordred: on the glance side of things we did seem to leak a bunch of images | 20:31 |
*** imacdonn has quit IRC | 20:31 | |
clarkb | mordred: I'm going to go ahead and delete all but the ones we are using now since it should be safe to cleanup swift later by just clearing things out | 20:32 |
clarkb | starting with rax-iad | 20:32 |
mordred | clarkb: ++ | 20:32 |
mgagne | clarkb: we would need to put a hold on mtl01, this is affecting some other critical systems. | 20:33 |
clarkb | mgagne: ok, if you write the change I'll go ahead and put it in place manually | 20:34 |
clarkb | mgagne: max-servers: 0? (or I can write the change too) | 20:34 |
mgagne | yes, 0 please | 20:34 |
*** anteaya has joined #openstack-infra | 20:34 | |
clarkb | ok I've put that in place manually and will make sure puppet doesn't undo it while we wait for the change to merge | 20:35 |
mgagne | thanks | 20:36 |
openstackgerrit | Mathieu Gagné proposed openstack-infra/project-config master: Disable inap-mtl01 provider https://review.openstack.org/613418 | 20:36 |
clarkb | mgagne: I also have an out of band image cleanup running against inap. Should I stop that too? It is running openstack image delete serially one after another to cleanup some images that we leaked (some were stuck in saving and others are from us changing DBs) | 20:38 |
clarkb | (I don't expect this is doing much to your cloud since it is running one at a time serially and cleaning things up, but happy to stop it too if you think it will help) | 20:38 |
mgagne | clarkb: I don't think this will affect the network performance as this shouldn't pull much bandwidth | 20:38 |
*** ansmith_ has quit IRC | 20:39 | |
clarkb | ya I don't expect it would cause that | 20:41 |
*** xek has quit IRC | 20:42 | |
*** jcoufal has quit IRC | 20:54 | |
*** betherly has joined #openstack-infra | 20:56 | |
clarkb | rax-iad is done. Now on to ord | 20:57 |
*** betherly has quit IRC | 21:01 | |
*** larainema has quit IRC | 21:02 | |
clarkb | heh I've been deleting by name. A few of the delete failures in rax were due to unique names. I'll make a second pass on iad and ord | 21:13 |
clarkb | Shrews: ^ is that a nodepool bug? I wouldn't expect us to reuse a name | 21:14 |
*** fuentess has quit IRC | 21:14 | |
*** irclogbot_1 has joined #openstack-infra | 21:15 | |
*** trown is now known as trown|outtypewww | 21:15 | |
*** bobh has quit IRC | 21:16 | |
*** betherly has joined #openstack-infra | 21:16 | |
*** ldnunes has quit IRC | 21:18 | |
*** betherly has quit IRC | 21:21 | |
*** kjackal_v2 has quit IRC | 21:22 | |
*** kjackal has joined #openstack-infra | 21:22 | |
*** tung_comnets has quit IRC | 21:28 | |
*** kjackal has quit IRC | 21:36 | |
*** jamesmcarthur has quit IRC | 21:37 | |
*** betherly has joined #openstack-infra | 21:37 | |
*** betherly has quit IRC | 21:42 | |
*** kopecmartin is now known as kopecmartin|off | 21:43 | |
*** efried is now known as pot | 21:43 | |
*** pot is now known as efried | 21:43 | |
*** jamesmcarthur has joined #openstack-infra | 21:44 | |
clarkb | ok doing rax-dfw now then I think I am done | 21:46 |
*** jamesmcarthur has quit IRC | 21:48 | |
*** bobh has joined #openstack-infra | 21:51 | |
*** bobh has quit IRC | 21:56 | |
*** betherly has joined #openstack-infra | 21:58 | |
*** betherly has quit IRC | 22:02 | |
*** armax has quit IRC | 22:03 | |
*** armax has joined #openstack-infra | 22:03 | |
*** boden has quit IRC | 22:13 | |
*** betherly has joined #openstack-infra | 22:18 | |
*** gema has quit IRC | 22:18 | |
*** mriedem has quit IRC | 22:21 | |
*** betherly has quit IRC | 22:23 | |
*** emine__ has quit IRC | 22:24 | |
openstackgerrit | Merged openstack-infra/project-config master: New Airship project - Utils https://review.openstack.org/612820 | 22:25 |
ianw | clarkb: does the drop in http://grafana.openstack.org/d/8wFIHcSiz/nodepool-rackspace?panelId=15&fullscreen&orgId=1&from=now-7d&to=now correlate about when something nodepoolish was restarted? | 22:25 |
openstackgerrit | Merged openstack-infra/project-config master: Add release tag and remove python jobs for Apmec https://review.openstack.org/612962 | 22:27 |
clarkb | ianw: yes | 22:28 |
clarkb | ianw: I sent the notice we were taking zuul down about 14:38UTC and we were done about an hour later | 22:29 |
ianw | clarkb: hrm, well i guess i have something to look into now then :) | 22:30 |
clarkb | ianw: part of the rename in stats that you did? | 22:31 |
ianw | clarkb: this was certainly not an intended result of that, but yeah, it's the suspect | 22:32 |
ianw | hrm, although these stats are coming from openstacksdk via ... magic ... i wonder if this task thread etc has changed things | 22:33 |
clarkb | #status log Old nodepool images cleared out of cloud providers as part of the post ZK db transition cleanup. | 22:33 |
openstackstatus | clarkb: finished logging | 22:33 |
clarkb | ianw: possible, openstacksdk did update | 22:33 |
clarkb | ianw: also I've disabled inap at the request of mgagne | 22:35 |
clarkb | ovh is looking happy | 22:35 |
clarkb | seems somethign to do with asking inap to use a bunch of new images all at once made networking there sad | 22:36 |
ianw | ahh, i wondered what that drop was. yeah, occasionally we cleanup a few ports on ovh, but not much | 22:36 |
mgagne | clarkb: we will perform some tests tomorrow to see how we can improve the network performance for mtl01. For now, it should stay disabled. | 22:36 |
clarkb | mgagne: ok, if it would help we can also turn it back on with a lower max servers value tio reduce thrashing but induce behavior if you need it | 22:37 |
clarkb | say to 5 or 10. Not sure if that is desireable on your end | 22:37 |
ianw | like in the last 2 hours on ovh gra1, we found 1 DOWN port that had been sitting around for 3+ minutes | 22:37 |
clarkb | ianw: not bad | 22:37 |
mgagne | clarkb: we need to avoid sending trafic to a specific network hardware. so we will test on our end first and enable back when we are sure the problem is mitigated. | 22:38 |
clarkb | mgagne: roger | 22:38 |
ianw | amorin: we might be at a point where it would make sense for us to modify the script to keep track of the leaked id's? it might be practical from your side to trace through just one port allocation and see why it leaked | 22:38 |
ianw | 2018-10-25 12:12:27,814 DEBUG nodepool.TaskManager: Manager rax-iad ran task ComputeGetServersDetail in 1.608656644821167s | 22:40 |
ianw | so the name is being mangled correctly ... this leaves the possibility that stats are being produced but not making it to statsd | 22:41 |
*** tosky has quit IRC | 22:45 | |
*** tpsilva has quit IRC | 22:47 | |
ianw | E..P..@.@.....c=h........<r.nodepool.task.rax-ord.ComputePostServers:0.000000|ms | 22:48 |
*** eharney has quit IRC | 22:48 | |
ianw | there's your problem ... it's sending zeros | 22:48 |
clarkb | that will do it | 22:49 |
*** betherly has joined #openstack-infra | 22:49 | |
clarkb | possibly related to the sdk update in that case | 22:49 |
clarkb | fwiw I've yet to see anything that would indicate a problem with the new zk cluster | 22:49 |
ianw | HA FTW | 22:50 |
clarkb | ya the only two spofs now are gerrit and zuul scheduler | 22:50 |
clarkb | (I guess technically log copies too, but that is being worked with swift uploads) | 22:50 |
clarkb | I had to get up at a ridiculously early hour this morning so I may begin to call it a day at this point. Anything else I should look at or help with before doing so? I did AJaeger's review requests | 22:51 |
ianw | no ... vice versa anything i should watch particularly? | 22:53 |
*** betherly has quit IRC | 22:53 | |
clarkb | ianw: I would keep an eye on zk periodically just to make sure it hasn't done anything weird (cacti is probably good enough for that). Otherwise I don't think so | 22:53 |
ianw | clarkb: ok, no worries. result will probably be gate grinding to a halt, so that's also a good canary :) | 22:54 |
clarkb | indeed | 22:54 |
*** yamamoto has quit IRC | 23:01 | |
*** yamamoto has joined #openstack-infra | 23:01 | |
*** yamamoto has quit IRC | 23:06 | |
*** betherly has joined #openstack-infra | 23:09 | |
*** markmcd has joined #openstack-infra | 23:09 | |
*** betherly has quit IRC | 23:14 | |
*** rlandy has quit IRC | 23:23 | |
*** carl_cai has joined #openstack-infra | 23:39 | |
*** yamamoto has joined #openstack-infra | 23:44 | |
*** agopi is now known as agopi|brb | 23:56 | |
*** gyee has quit IRC | 23:57 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!