Monday, 2022-05-16

@ssbarnea:matrix.org	hello folks! Who has a love for zuul-jobs repo? I need one or two people that are willing to help keeping ansible-lint integration with zuul-jobs. See https://github.com/ansible/ansible-lint/discussions/1403#discussioncomment-2757108 for details	06:58
@fzzfh:matrix.org	fungi: Clark hi, I found method 'wait_for_server' of sdks invoke by nodepool can't finish bind float-ip in fact. this result in nodepool boot instance haven't float-ip, so 'auto-floating-ip' option doesn't work.	11:02
https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L341
https://docs.openstack.org/openstacksdk/rocky/user/connection.html#openstack.connection.Connection.wait_for_server
I invoke sdks add_auto_ip method in provider.py for attach float-ip to instance as a temporary solution.
can see this problem also https://storyboard.openstack.org/#!/story/2009877
@fzzfh:matrix.org	https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_floating_ip.py#L995 .my instance boot by nodepool have public_v4 already ,so auto_ip wouldn't work.	11:07
@hanson76:matrix.org	Hi, I'm trying to debug a problem we encounter with the aws provider after upgrading Zuul and Nodepool to 6.0.0	12:56
We receive a lot of node_error after the upgrade.
I've tried to track it down and one common thing is that it looks to happen when we get teh following log. "WARNING kazoo.client: Connection dropped: socket connection broken"
It looks like all ongoing node requests fail when this happens.
Not sure why, but it looks like DeletedNodeWorker is marking all node request that are not complete as failed and tries to delete them.
Even when the requests are fine and currently in "complete" , waiting for the keyscan to be done.
Not sure why the ZK connection is lost, everything is running in a k8s cluster in AWS and we can't see any network issues between nodes running ZK and Nodepool.
@hanson76:matrix.org	We are currently running "latest" Nodepool to see if the fixes done after 6.0.0 helps, but no different (announced version is "6.0.1.dev11"	12:57
@fungicide:matrix.org	Hanson: do you have resource tracking for the zk server cluster? Could it be running out of memory or restarting? anything at all in the zk logs on any of the cluster members?	13:37
@hanson76:matrix.org	@fungi I found out that there is a zookeeper-timeout property that is set to 10 seconds default, it looks to be too small of a timeout for our setup, even that our setup is small.	13:48
I changed it to 60 seconds and that got rid of all lost connections so far.
But, that does not fix the node_error problem. The logs look very strange.
I'm trying to dig bit more, The strange thing is that it looks like all nodes created for a job are reported as ready, but the job got declined with
"Declining node request because node type(s) [centos-8-standard,de-mockhost] not available"
@fungicide:matrix.org	do you have multiple node providers? if so, keep in mind that nodepool (currently) requires all the labels in the nodeset to be available in a provider, since all nodes for the node request have to be satisfied by the same provider	13:50
@hanson76:matrix.org	I can see the following in the logs.	13:51
2022-05-16 13:37:31,911 DEBUG nodepool.NodePool: No more active min-ready requests for label de-mockhost
2022-05-16 13:37:31,919 DEBUG nodepool.NodePool: No more active min-ready requests for label centos-8-standard
2022-05-16 13:37:44,050 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0000052438 (state: used, allocated_to: 200-0000075662)
2022-05-16 13:37:44,055 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0000052439 (state: used, allocated_to: 200-0000075662)
It looks like the nodes are deleted due to the "No more active min-ready" for some reason.
@hanson76:matrix.org	Both node is in the same provider, it sometimes work, sometimes not. It worked every time with Nodepool 5.0.0	13:52
@fungicide:matrix.org	200-0000075662 is the node request those nodes were used in. did the build which requested them run?	13:53
@hanson76:matrix.org	200-0000075662 is the request, no the job did not run on those, it looks like it did a reattempt later.	13:55
@fungicide:matrix.org	the logs on your scheduler(s) should indicate the build id which was associated with that node request, and why the nodes for it were eventually unlocked	14:00
@hanson76:matrix.org	Mm, strange, looked in the zuul-executor logs now and I can actually see that those two nodes were returned and that the job did run but failed in 10 s, but there is no build logs available.	14:00
@fungicide:matrix.org	yeah, collecting build logs requires the jobs to work at least enough to run whatever post-run phase playbook is doing your log collection	14:01
@fungicide:matrix.org	the executor logs should hopefully have some detail as to why the job was terminated though	14:02
@hanson76:matrix.org	Guess I need to enable debug log on the executor, only log I can see is the "Checking out", then nothing for 10s, and then "Held status set to False"	14:05
@fungicide:matrix.org	check for python tracebacks in the log	14:06
@hanson76:matrix.org	No traceback around that time. Anyway, looks like the zookeeper-timeout got rid of most of our node_error.	14:08
Most jobs work now, but require 2 or 3 attempts to complete for some odd reason.
@fungicide:matrix.org	i'd definitely check the executor logs for the initial attempts before retrying. also if you're quick and your jobs have console log streaming turned on, you might be able to pull up the web stream of the job stdout/stderr and spot a clue in there	14:10
@fungicide:matrix.org	i'm assuming the retried builds didn't get far enough to actually collect logs, so the build result pages for them in the web dashboard probably aren't giving you much to go on, but it doesn't hurt to check there just in case	14:11
@fungicide:matrix.org	if you go to the builds tab you should see entries for the builds which got ignored and retried	14:12
@fungicide:matrix.org	usually though, retrying a job automatically is a sign that there was a failure in a pre-run playbook or that ansible indicated an error which indicated a probable network disconnection from a node	14:13
@fungicide:matrix.org	* usually though, retrying a job automatically is a sign that there was a failure in a pre-run playbook or that ansible returned an error which indicated a probable network disconnection from a node	14:13
@fungicide:matrix.org	ansible does sometimes mistake things like its process getting oom-killed or not enough space on the node's fs to write its execution scripts as a "network problem"	14:14
@hanson76:matrix.org	Think I found the problem but no clue why it happens. "Failed to connect to the host via ssh: ssh: connect to host X port"	14:14
@hanson76:matrix.org	We did set "host-key-checking: False" in nodepool for troubleshooting, guess that nodepool is returning the node to Zuul bit too early when that is done.	14:16
@hanson76:matrix.org	Will remove that and try again.	14:16
@fungicide:matrix.org	that definitely sounds like ansible at least thinking the host is "unreachable" (which in ansible's terms just means it wasn't able to run things on it for undetermined reasons)	14:16
@hanson76:matrix.org	Oo, that might have solved it.	14:22
We tried to track down the problem with deleted nodes and ZK disconnect and did set the "host-key-checking" to false in that process.
The actual problem was the zookeeper-timeout being too low, fixing that did the trick but not doing host-key-checking made Nodepool
return nodes before ssh was started (as soon as AWS state was "running") and that failed the actual job that tried to ssh to them.
I reverted that host-key-checking config and have zookeeper-timeout set to 60 seconds instead of 10 and it looks to work!
Thanks for helping out troubleshooting this.
@fungicide:matrix.org	awesome, yeah i hadn't considered that as i forgot the config allows you to disable the launcher's host key checking (which as you've observed is generally necessary in order to gauge whether a node is ready to receive ssh connections). glad you worked it out!	14:32
@fungicide:matrix.org	i wonder if we have an appropriate warning in the docs about that. checking... will propose the addition if not	14:33
@clarkb:matrix.org	I think that feature is aimed at systems that don't have sshd	14:55
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [zuul/nodepool] 841908: Clarify reasons for host-key-checking setting https://review.opendev.org/c/zuul/nodepool/+/841908		14:55
@fungicide:matrix.org	Hanson: see if that ^ would have made it more obvious	14:56
@fungicide:matrix.org	Clark: the documentation for the option only calls out situations where the launcher and nodes are on different networks (so the launcher is unable to ssh to the nodes to check them), but yes i suppose non-ssh nodes could be another reason to disable it	14:57
@fzzfh:matrix.org	fungi: Clark do you have moment read my message about nodepool auto-float-ip earlier . thanks	15:04
@fungicide:matrix.org	fzzf: it's a little outside my scope of knowledge, sounds like you're having trouble getting openstacksdk to understand how you've set up networking in your cloud	15:08
@clarkb:matrix.org	fzzf: fungi: if the issue is wait for server isn't wanting long enough those timeouts are configurable in nodepool.	15:20
@clarkb:matrix.org	But if it is failing beacuse it will never become available then that is a separate issue	15:20
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [zuul/nodepool] 841908: Clarify reasons for host-key-checking setting https://review.opendev.org/c/zuul/nodepool/+/841908		15:25
@fzzfh:matrix.org	> <@fzzfh:matrix.org> fungi: Clark hi, I found method 'wait_for_server' of sdks invoke by nodepool can't finish bind float-ip in fact. this result in nodepool boot instance haven't float-ip, so 'auto-floating-ip' option doesn't work.	15:25
> https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L341
> https://docs.openstack.org/openstacksdk/rocky/user/connection.html#openstack.connection.Connection.wait_for_server
>
> I invoke sdks add_auto_ip method in provider.py for attach float-ip to instance as a temporary solution.
> can see this problem also https://storyboard.openstack.org/#!/story/2009877
fungi: I mean maybe nodepool invoke sdks wait_for_server method can't finish attach float-ip. and there is some encounter same problem. https://storyboard.openstack.org/#!/story/2009877
@fungicide:matrix.org	since folks in here may be generally interested, a reminder that the cfp for ansiblefest 2022 is now open until july 15 and is always a great place to talk about zuul: https://www.ansible.com/ansiblefest	15:34
@fzzfh:matrix.org	https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_floating_ip.py#L990 auto-float-ip need have private_v4， not have public_v4. and need private to false. my nodepool boot's instance don't match this. it have same private_v4 and public_v4. is this problem accidental?	15:38
@clarkb:matrix.org	fzzf: I think the idea is if you have a public ip then you don't need a floating ip automatically and it should be explicit	15:39
@clarkb:matrix.org	auto floating IP really only makes sense if the cloud only provides private IPs to the instance but you still want connectivity	15:39
@fzzfh:matrix.org	Clark: my public_v4 is same to private_v4 according it show. they are both private ip. I don't know why	15:42
@fzzfh:matrix.org	when I add add_auto_ip method, then its public_v4 would be float-ip.	15:43
@clarkb:matrix.org	fzzf: it detects that bsed on attributes of the network itself or you can override it with network config in clouds.yaml iirc	16:11
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 840540: Make note of python_version being a string value https://review.opendev.org/c/zuul/zuul-jobs/+/840540		16:16
@clarkb:matrix.org	Zuulians https://review.opendev.org/c/zuul/zuul-jobs/+/840539 is a python3.10 cleanup and https://review.opendev.org/c/zuul/zuul-jobs/+/840540 is a followup to that to avoid a problem that the python3.10 chnage ran into by being more clear in docs and examples. Not urgent but straightforward and would be good to alnd eventually	16:19
@jim:acmegating.com	Clark: note my +2 comment on that last one	16:32
@clarkb:matrix.org	Good to know. ianw suggested that update	16:37
@clarkb:matrix.org	https://review.opendev.org/c/zuul/zuul/+/840356 is the other change related to that but I think we may want to hold off on that as python3.10 wheels don't exist yet. Not a major issue for Zuul but a bigger impact for nodepool on arm64. Work is in progress to get wheels built though	16:43
@clarkb:matrix.org	Sounds like paramiko 2.11.0 should happen soon as well to address that blowfish warning. Hasn't happened yet according to pypi though	16:46
@jim:acmegating.com	Clark: i agree, sounds like waiting for wheels so that our nodepool builds don't slow to a crawl would be good	16:47
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul-jobs] 840539: Switch py3.10 testing to Ubuntu Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/840539		16:58
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul-jobs] 840540: Make note of python_version being a string value https://review.opendev.org/c/zuul/zuul-jobs/+/840540		17:27
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 840246: Add a spec for global semaphores https://review.opendev.org/c/zuul/zuul/+/840246		17:38
-@gerrit:opendev.org- Sunny Cai proposed: [zuul/zuul-website] 841986: test blog article https://review.opendev.org/c/zuul/zuul-website/+/841986		18:10
@fungicide:matrix.org	corvus: i updated 840979 per your suggestion, anything else you feel it needs? i think sunny was hoping to link to the published article in social media with the reminder promotions about the impending summit ticket price increase	18:54
@jim:acmegating.com	fungi: ah thanks! lgtm. also, that felt like a good application of review to blog posts :)	19:37
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [zuul/zuul-website] 840979: Meet the Zuul community in Berlin blog article https://review.opendev.org/c/zuul/zuul-website/+/840979		19:41
@fungicide:matrix.org	> <@jim:acmegating.com> fungi: ah thanks! lgtm. also, that felt like a good application of review to blog posts :)	19:59
yep, agreed. thanks!
@iwienand:matrix.org	Clark: yeah frickler double checked but we did add types to rolevars at one point https://review.opendev.org/c/zuul/zuul-sphinx/+/641168	21:59
@jim:acmegating.com	ianw: Clark my apologies, it looks like my comment raced with the build results for ps4	22:17
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 842010: Add global semaphore support https://review.opendev.org/c/zuul/zuul/+/842010		23:26
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:		23:45
- [zuul/zuul] 804177: Add include- and exclude-branches tenant config options https://review.opendev.org/c/zuul/zuul/+/804177
- [zuul/zuul] 841336: Add always-dynamic-branches option https://review.opendev.org/c/zuul/zuul/+/841336

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!