jhesketh | dmsimard: No, there isn't a spec (afaik). Happy to work on one with you though if you think it's useful. Otherwise hopefully the direction those patches are going help explain the goals. | 00:46 |
---|---|---|
* jhesketh really needs to return to that :-s | 00:46 | |
pabelanger | oh, interesting, in one of my tests, nodepool-builder lost access to zookeeper, it seems during a build. This is on ovh, so suspect node is slow, however I wouldn't have expect nodepool-builder to delete the already completed DIB and generate it again: http://logs.openstack.org/04/622604/1/check/windmill-src-fedora-latest/88c57a7/logs/nb01/var/log/nodepool/builder-debug.log | 00:51 |
*** rlandy has quit IRC | 00:59 | |
pabelanger | Shrews: Looking at https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/builder.py#n795 if DIB was successful, what zk info could we have lost, if we lose the connection to zk? | 01:05 |
pabelanger | Shrews: basically, trying to understand why we set zk.FAILED there if connetion is lost, and not wait for it to recover | 01:06 |
pabelanger | like we did at https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/builder.py#n781 | 01:06 |
*** j^2 has joined #zuul | 02:15 | |
tristanC | Shrews: not sure what changed between 3.3.0 and 3.3.1, but "deleting" node are leaking, we now have 16 of those... here are the logs: http://paste.openstack.org/show/736679/ | 02:17 |
clarkb | tristanC: ya there is a fix for that. though I dont think it was an issue in 3.3.1? came aftwr iirc | 02:18 |
clarkb | basically nodepool created delete records for nodes that had exceptions booting (previously we leaked them without a node record) but the first fix didnt include pool info so we still couldnt delete them | 02:19 |
tristanC | clarkb: i also thought the missing pool info was the issue because of the "OpenStackProvider: Cannot find provider pool for node" message | 02:21 |
tristanC | but the NodeDeleter doesn't seems to use the pool info to delete the node | 02:21 |
tristanC | then I looked into the "node_exists" attribute, but Shrews said those node shall have an id, e.g.: https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/launcher.py#n93 | 02:23 |
dmsimard | tristanC: that's another fix that landed | 02:23 |
dmsimard | Not sure if it's been released | 02:23 |
dmsimard | https://review.openstack.org/#/c/622403/ | 02:26 |
dmsimard | Saw this issue in our logs a while back | 02:27 |
tristanC | dmsimard: iiuc, that's about empty zknode, which also are an issue, but a different one | 02:27 |
tristanC | perhaps https://review.openstack.org/#/c/621301/ would help figure out what's the issue | 02:27 |
tristanC | commit message doesn't really help though... | 02:27 |
clarkb | tristanC I thought it needed the pool info as it was short circuiting otherwise | 02:28 |
tristanC | clarkb: alright, thanks for the tip. I'll give the fix a try | 02:30 |
pabelanger | tristanC: clarkb: https://review.openstack.org/621040/ is the issue I think will fixed leaked nodes for vexxhost in ansible-network... as for deleting nodes. Because this is rdocloud, i think you need to have a cloud admin reset the state of the VM, I believe nodepool is saying delete by openstack isn't | 02:42 |
pabelanger | tristanC: you can test, by trying to manually delete the node with openstack client | 02:42 |
pabelanger | if that fails, then the state needs to be toggled on openstack side | 02:42 |
tristanC | pabelanger: there are some Unauthorized exception too, not sure if they came with nodepool-3.3.1 though... here is an example: http://paste.openstack.org/show/736680/ | 02:47 |
pabelanger | tristanC: if you trace the log, was that to limestone? If so, we did a password reset today and updated clouds.yaml with dmsimard | 02:49 |
tristanC | pabelanger: the provider is not logged, and those are quite frequent though, about 3000 since 2018-12-01 | 02:51 |
pabelanger | tristanC: are they still happening? | 02:51 |
pabelanger | yah, we should log the provider to help debug | 02:51 |
pabelanger | tristanC: it is likely limestone auth was down for a few days also | 02:52 |
tristanC | pabelanger: last one was 2018-12-04 16:52:09,287 | 02:52 |
tristanC | (utc) | 02:52 |
pabelanger | tristanC: I think that is around the time dmsimard update clouds.yaml on sf.io, you likely can confirm with timestamp on the file | 02:52 |
pabelanger | tristanC: I am not sure the process on sf.io side where the password is stored | 02:53 |
tristanC | indeed ~nodepool/.config/openstack/clouds.yaml is Dec 4 16:52 | 02:53 |
pabelanger | cool | 02:54 |
*** bhavikdbavishi has joined #zuul | 02:56 | |
*** bhavikdbavishi has quit IRC | 03:01 | |
*** bhavikdbavishi has joined #zuul | 03:18 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 04:04 |
*** bjackman has joined #zuul | 04:05 | |
*** mordred has quit IRC | 04:57 | |
*** mordred has joined #zuul | 04:57 | |
tobiash | tristanC: commented ^ | 05:09 |
tristanC | tobiash: thanks, i agree that tests (and better commit message too) would help understand the new failures we are having with nodepool-3.3.1 | 05:18 |
tristanC | tobiash: but for the moment, i'm just trying to mitigate node leaking with the last released version... | 05:19 |
tristanC | fwiw, here is the list of backport that seems to help: https://softwarefactory-project.io/r/14521 | 05:21 |
tobiash | tristanC: I'm sure there is already a server create failed scenario. That probably just misses the quota call afterwards | 05:23 |
*** j^2 has quit IRC | 05:29 | |
tobiash | tristanC: if you don't have time maybe I can help with the test later | 05:32 |
tristanC | tobiash: https://review.openstack.org/622101 seems critical though, provider with a failing node are not able to launch new nodes because of the type IndexError exception | 05:45 |
*** dmellado has quit IRC | 05:46 | |
*** gouthamr has quit IRC | 05:46 | |
tobiash | One more argument for a test. I'll look later | 05:46 |
tristanC | this was not issue without https://review.openstack.org/621681, because the estimatedNodepoolQuotaUsed was skipped as the node didn't have a pool set. | 05:47 |
openstackgerrit | Merged openstack-infra/nodepool master: Add cleanup routine to delete empty nodes https://review.openstack.org/622616 | 05:47 |
*** dmellado has joined #zuul | 05:48 | |
*** gouthamr has joined #zuul | 05:53 | |
*** dmellado has quit IRC | 05:55 | |
*** dmellado has joined #zuul | 05:57 | |
*** dmellado has quit IRC | 06:02 | |
*** dmellado has joined #zuul | 06:14 | |
*** gouthamr has quit IRC | 06:15 | |
*** gouthamr has joined #zuul | 06:18 | |
*** njohnston has quit IRC | 06:27 | |
*** njohnston_ has joined #zuul | 06:28 | |
*** gouthamr has quit IRC | 06:29 | |
*** gouthamr has joined #zuul | 06:32 | |
*** njohnston_ has quit IRC | 06:52 | |
*** njohnston has joined #zuul | 06:55 | |
*** gouthamr has quit IRC | 06:58 | |
*** gouthamr has joined #zuul | 07:07 | |
*** bhavikdbavishi has quit IRC | 07:09 | |
*** bhavikdbavishi1 has joined #zuul | 07:09 | |
*** pcaruana has joined #zuul | 07:10 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 07:12 | |
quiquell|off | tobiash: What a brain fart my review :-( | 07:16 |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175 | 07:17 |
*** quiquell|off is now known as quiquell | 07:17 | |
tobiash | quiquell: thanks | 07:17 |
quiquell | tobiash: fixed | 07:18 |
quiquell | thanks to you | 07:18 |
*** gouthamr has quit IRC | 07:40 | |
*** gouthamr has joined #zuul | 07:42 | |
*** quiquell is now known as quiquell|brb | 07:48 | |
*** gtema has joined #zuul | 08:01 | |
*** themroc has joined #zuul | 08:15 | |
*** quiquell|brb is now known as quiquell | 08:22 | |
*** bhavikdbavishi has quit IRC | 08:33 | |
*** jpena|off is now known as jpena | 08:39 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 08:44 |
tobiash | tristanC, corvus, Shrews: ^ | 08:44 |
tobiash | there will be a second change that makes the quotacalculation resilient about this so that already wedged nodepools don't have to manually delete broken znodes | 08:45 |
*** bhavikdbavishi has joined #zuul | 09:06 | |
*** bhavikdbavishi has quit IRC | 09:07 | |
*** bhavikdbavishi has joined #zuul | 09:07 | |
*** njohnston has quit IRC | 09:09 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 09:30 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906 | 09:30 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906 | 09:32 |
*** bhavikdbavishi has quit IRC | 09:40 | |
*** njohnston has joined #zuul | 09:44 | |
*** quiquell is now known as quiquell|brb | 10:21 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout based on screen size https://review.openstack.org/622010 | 10:22 |
*** dkehn has quit IRC | 10:22 | |
*** quiquell|brb is now known as quiquell | 10:52 | |
bjackman | Any chance of a Workflow+1 on https://review.openstack.org/#/c/620838/ ? | 10:56 |
*** bhavikdbavishi has joined #zuul | 10:58 | |
quiquell | tobiash, corvus: The role fetch-zuul-cloner is legacy stuff ? | 11:09 |
*** bhavikdbavishi has quit IRC | 11:09 | |
*** bhavikdbavishi has joined #zuul | 11:09 | |
*** bhavikdbavishi has quit IRC | 11:16 | |
*** bhavikdbavishi has joined #zuul | 11:16 | |
*** bhavikdbavishi has quit IRC | 11:20 | |
tristanC | quiquell: zuul-cloner is no longer needed in zuulv3, the repos are pushed from the executor instead | 11:24 |
quiquell | tristanC: So if we have something that depends on that we have to fix it, is that right ? | 11:26 |
tristanC | quiquell: iiuc, the fetch-zuul-cloner let you not fix that, but it's recommended to use the new zuul.projects workspace yes | 11:28 |
quiquell | tristanC: ack, thanks! | 11:29 |
tobiash | quiquell: yes, that's legacy ;) | 11:29 |
quiquell | sshnaidm: ^ looks like we cannot depend on the setup install fetch-zuul-cloner does | 11:29 |
quiquell | tobiash: thanks, we have being using the install it does over python projects without knowing | 11:30 |
tobiash | quiquell: you still can use it but it's only there for backwards compatibility and mainly used in openstack land | 11:30 |
quiquell | tobiash, tristanC: broken requirements to using /home/zuul/src... | 11:30 |
quiquell | cool cool thanks | 11:31 |
sshnaidm | quiquell, the patch with requirement is in gates, so should be fine | 11:31 |
quiquell | sshnaidm: Just thinking about removing it from the base of our repro, so we can catch those issues | 11:32 |
quiquell | sshnaidm: Like the patch do we have to also include all the projects there ? | 11:33 |
sshnaidm | quiquell, well, it's not really the issue while it's working | 11:33 |
quiquell | sshnaidm: ack | 11:33 |
*** gtema has quit IRC | 11:40 | |
*** bhavikdbavishi has joined #zuul | 12:02 | |
openstackgerrit | Merged openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838 | 12:04 |
openstackgerrit | Brendan proposed openstack-infra/zuul master: Fix urllib imports in Gerrit HTTP form auth code https://review.openstack.org/622942 | 12:05 |
*** gtema has joined #zuul | 12:14 | |
*** jpena is now known as jpena|lunch | 12:25 | |
*** bjackman has quit IRC | 12:52 | |
*** themroc has quit IRC | 13:21 | |
*** themroc has joined #zuul | 13:23 | |
*** jpena|lunch is now known as jpena | 13:33 | |
*** dkehn has joined #zuul | 13:35 | |
*** rlandy has joined #zuul | 13:36 | |
mordred | tristanC: refactor stack lgtm - there's an oops in the commit mesage in https://review.openstack.org/#/c/621396 | 14:02 |
*** sshnaidm has quit IRC | 14:06 | |
*** njohnston has quit IRC | 14:08 | |
*** njohnston_ has joined #zuul | 14:09 | |
Shrews | pabelanger: re: image build, not sure what zk info would have been lost. i'd have to go back through the code again | 14:09 |
pabelanger | Shrews: ack, thanks. For now, I've bumped the job timeout to account for the slow nodes, but might be a good optimization to look into. | 14:10 |
*** quiquell is now known as quiquell|off | 14:11 | |
Shrews | pabelanger: oh, i think it's because we no longer hold a lock for building that particular image (another builder could be actively building it now) | 14:11 |
Shrews | pabelanger: not so efficient if you only have one builder with that image, but no way for us to know that atm | 14:12 |
*** sshnaidm has joined #zuul | 14:14 | |
pabelanger | Shrews: oh, right. that makes sense | 14:15 |
tobias-urdin | anybody running zuul with gerrit 2.16, planning upgrade but can't see any supported versions statement | 14:31 |
pabelanger | tobias-urdin: I've seen 1 fix for gerrit 2.16 in zuul so far: https://review.openstack.org/619533/ | 14:41 |
pabelanger | I figure that user is, but do not know the irc handle | 14:42 |
tobias-urdin | pabelanger: thanks, i'll wait for a while and stop at 2.15 for now, 2.16 seems pretty fresh overall | 14:43 |
pabelanger | tobias-urdin: I know openstack has plans to upgrade, but not sure the timeline. I know corvus is also working on gerrit for opendev, so assume that will maybe use a newer version of gerrit. | 14:45 |
pabelanger | tobias-urdin: we should see what version of gerrit zuul quickstart is using | 14:45 |
pabelanger | https://git.zuul-ci.org/cgit/zuul/tree/doc/source/admin/examples/docker-compose.yaml#n6 | 14:46 |
pabelanger | seems like latest version | 14:46 |
tobias-urdin | cool | 14:47 |
tobias-urdin | nobody remembers a coward :) | 14:48 |
tobias-urdin | i'll give it a try | 14:48 |
pabelanger | ++ | 14:49 |
*** sshnaidm has quit IRC | 14:57 | |
*** quiquell|off has quit IRC | 14:59 | |
*** ParsectiX has joined #zuul | 15:16 | |
*** sshnaidm has joined #zuul | 15:18 | |
*** sshnaidm is now known as sshnaidm|afk | 15:36 | |
*** jesusaur has quit IRC | 15:37 | |
tobiash | tobias-urdin: there is a second change for 2.16 too: https://review.openstack.org/620838 | 15:41 |
tobiash | tobias-urdin: jackman uses 2.16 so you could ask him how good it works atm | 15:41 |
tobiash | s/jackman/bjackman | 15:41 |
*** jesusaur has joined #zuul | 15:42 | |
tobias-urdin | just saw some plugins we are using haven't cut any stable-2.16 branches yet so i'll have to wait on 2.15 for a while :( | 15:51 |
*** hashar has joined #zuul | 16:04 | |
tobiash | fungi: I approved 554352 but it is in merge conflict | 16:09 |
fungi | tobiash: unsurprising. that change has been waiting for ages. i'll rebase it and fix up whatever conflicts it has nowish | 16:11 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/zuul master: Add instructions for reporting vulnerabilities https://review.openstack.org/554352 | 16:12 |
fungi | tobiash: ^ not bad. it was just the index, unsurprisingly | 16:12 |
fungi | thanks! | 16:14 |
tobiash | now we just need more contact persons... | 16:14 |
openstackgerrit | Merged openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 16:14 |
openstackgerrit | Merged openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906 | 16:14 |
*** j^2 has joined #zuul | 16:15 | |
*** sshnaidm|afk has quit IRC | 16:15 | |
*** pcaruana has quit IRC | 16:18 | |
tobiash | tristanC: ^ | 16:19 |
tobiash | with tests :) | 16:19 |
tobiash | Shrews, corvus: were these the last fixes needed for 3.3.1 ^^ ? | 16:20 |
tobiash | maybe it makes sense to do a release probably next week? | 16:21 |
Shrews | tobiash: that's proabably mostly your call since you've worked more with tristanC on his issues | 16:22 |
Shrews | i'm not aware of anything outstanding atm | 16:22 |
corvus | we'd probably have the same issue if we restarted more :) | 16:23 |
corvus | so i think we should get all that in and restart openstack-infra, then when it looks good release | 16:24 |
tobiash | ++ | 16:24 |
*** sshnaidm|afk has joined #zuul | 16:30 | |
pabelanger | would be great to see 3.3.1 release | 16:32 |
*** themroc has quit IRC | 16:37 | |
*** gtema has quit IRC | 16:49 | |
*** bjackman has joined #zuul | 16:57 | |
*** bhavikdbavishi has quit IRC | 17:02 | |
*** bhavikdbavishi has joined #zuul | 17:02 | |
*** hashar has quit IRC | 17:08 | |
*** bjackman has quit IRC | 17:09 | |
*** rlandy is now known as rlandy|brb | 17:10 | |
tobiash | corvus, Shrews: as far I can see all recent fixes to the last release have been merged | 17:25 |
openstackgerrit | Merged openstack-infra/zuul master: Add instructions for reporting vulnerabilities https://review.openstack.org/554352 | 17:25 |
*** rlandy|brb is now known as rlandy | 17:42 | |
*** jpena is now known as jpena|off | 17:42 | |
*** mrhillsman has quit IRC | 17:51 | |
*** mrhillsman has joined #zuul | 17:52 | |
mrhillsman | any thoughts on why web would only show Loading... on the Builds tab? | 18:25 |
tobiash | mrhillsman: typically if the api requests failed | 18:26 |
tobiash | mrhillsman: you should check requests in the browser debugging window | 18:26 |
tobiash | mrhillsman: does zuul-web have the sql connection configured? | 18:27 |
mrhillsman | let me confirm that | 18:27 |
mrhillsman | did not know that was required | 18:27 |
mrhillsman | but makes sense | 18:28 |
tobiash | mrhillsman: zuul-web directly queries the database without asking the scheduler | 18:28 |
mrhillsman | thx | 18:32 |
mrhillsman | tobiash which daemon handles writing to the db | 18:41 |
tobiash | mrhillsman: the scheduler, but only if the sql reporter is added to the pipeline | 18:41 |
mrhillsman | yeah, it is there, and i can login manually | 18:42 |
mrhillsman | and pymysql is there | 18:42 |
mrhillsman | i did not have mysql-client thought | 18:42 |
mrhillsman | though | 18:42 |
mrhillsman | not sure if that mattered | 18:43 |
Shrews | corvus: tobiash: looks like we successfully removed around 430 empty zk nodes with the latest update | 18:43 |
tobiash | Shrews: cool :) | 18:43 |
mrhillsman | oh, i got it | 18:43 |
mrhillsman | thx tobiash | 18:43 |
mrhillsman | i have found the error of my ways | 18:43 |
tobiash | mrhillsman: you're welcome | 18:44 |
*** manjeets_ is now known as manjeets | 18:47 | |
*** ParsectiX has quit IRC | 18:49 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Add an upgrade release note for schema change https://review.openstack.org/623046 | 18:53 |
Shrews | corvus: tobiash: ^^^ | 18:53 |
tobiash | +2 | 18:54 |
*** bhavikdbavishi has quit IRC | 19:15 | |
panda | jhesketh: ping re: https://review.openstack.org/#/q/topic:freeze_job may I ask you what you expect this will be used ? something like runner --job <job-name> and will run the playbooks locally ? | 19:49 |
Shrews | So, it seems we have a race in the test_handler_poll_session_expired nodepool unit test. I've seen it fail enough to make me begin to look into it. Not sure where the race is yet, though. | 21:06 |
Shrews | but was able to reproduce it locally | 21:07 |
tobiash | Shrews: cool, I also thought about that but don't have an idea either | 21:09 |
openstackgerrit | Merged openstack-infra/nodepool master: Add an upgrade release note for schema change https://review.openstack.org/623046 | 21:17 |
*** rlandy is now known as rlandy|bbl | 23:25 | |
*** j^2 has quit IRC | 23:26 | |
*** j^2 has joined #zuul | 23:31 | |
*** dkehn has quit IRC | 23:33 | |
*** dkehn has joined #zuul | 23:34 | |
*** j^2 has quit IRC | 23:47 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!