Monday, 2022-05-16

@ssbarnea:matrix.orghello folks! Who has a love for zuul-jobs repo? I need one or two people that are willing to help keeping ansible-lint integration with zuul-jobs.  See https://github.com/ansible/ansible-lint/discussions/1403#discussioncomment-2757108 for details06:58
@fzzfh:matrix.orgfungi: Clark hi, I found  method 'wait_for_server' of sdks invoke by nodepool can't finish bind float-ip in fact. this result in nodepool boot instance haven't float-ip, so 'auto-floating-ip' option doesn't work.11:02
https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L341
https://docs.openstack.org/openstacksdk/rocky/user/connection.html#openstack.connection.Connection.wait_for_server
I invoke sdks add_auto_ip method in provider.py for attach float-ip to instance as a temporary solution.
can see this problem also https://storyboard.openstack.org/#!/story/2009877
@fzzfh:matrix.orghttps://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_floating_ip.py#L995 .my instance boot by nodepool have public_v4 already ,so auto_ip wouldn't work. 11:07
@hanson76:matrix.orgHi, I'm trying to debug a problem we encounter with the aws provider after upgrading Zuul and Nodepool to 6.0.012:56
We receive a lot of node_error after the upgrade.
I've tried to track it down and one common thing is that it looks to happen when we get teh following log. "WARNING kazoo.client: Connection dropped: socket connection broken"
It looks like all ongoing node requests fail when this happens.
Not sure why, but it looks like DeletedNodeWorker is marking all node request that are not complete as failed and tries to delete them.
Even when the requests are fine and currently in "complete" , waiting for the keyscan to be done.
Not sure why the ZK connection is lost, everything is running in a k8s cluster in AWS and we can't see any network issues between nodes running ZK and Nodepool.
@hanson76:matrix.orgWe are currently running "latest" Nodepool to see if the fixes done after 6.0.0 helps, but no different (announced version is "6.0.1.dev11"12:57
@fungicide:matrix.orgHanson: do you have resource tracking for the zk server cluster? Could it be running out of memory or restarting? anything at all in the zk logs on any of the cluster members?13:37
@hanson76:matrix.org@fungi I found out that there is a zookeeper-timeout property that is set to 10 seconds default, it looks to be too small of a timeout for our setup, even that our setup is small. 13:48
I changed it to 60 seconds and that got rid of all lost connections so far.
But, that does not fix the node_error problem. The logs look very strange.
I'm trying to dig bit more, The strange thing is that it looks like all nodes created for a job are reported as ready, but the job got declined with
"Declining node request because node type(s) [centos-8-standard,de-mockhost] not available"
@fungicide:matrix.orgdo you have multiple node providers? if so, keep in mind that nodepool (currently) requires all the labels in the nodeset to be available in a provider, since all nodes for the node request have to be satisfied by the same provider13:50
@hanson76:matrix.orgI can see the following in the logs.13:51
2022-05-16 13:37:31,911 DEBUG nodepool.NodePool: No more active min-ready requests for label de-mockhost
2022-05-16 13:37:31,919 DEBUG nodepool.NodePool: No more active min-ready requests for label centos-8-standard
2022-05-16 13:37:44,050 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0000052438 (state: used, allocated_to: 200-0000075662)
2022-05-16 13:37:44,055 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0000052439 (state: used, allocated_to: 200-0000075662)
It looks like the nodes are deleted due to the "No more active min-ready" for some reason.
@hanson76:matrix.orgBoth node is in the same provider, it sometimes work, sometimes not. It worked every time with Nodepool 5.0.013:52
@fungicide:matrix.org200-0000075662 is the node request those nodes were used in. did the build which requested them run?13:53
@hanson76:matrix.org 200-0000075662 is the request, no the job did not run on those, it looks like it did a reattempt later.13:55
@fungicide:matrix.orgthe logs on your scheduler(s) should indicate the build id which was associated with that node request, and why the nodes for it were eventually unlocked14:00
@hanson76:matrix.orgMm, strange, looked in the zuul-executor logs now and I can actually see that those two nodes were returned and that the job did run but failed in 10 s, but there is no build logs available.14:00
@fungicide:matrix.orgyeah, collecting build logs requires the jobs to work at least enough to run whatever post-run phase playbook is doing your log collection14:01
@fungicide:matrix.orgthe executor logs should hopefully have some detail as to why the job was terminated though14:02
@hanson76:matrix.orgGuess I need to enable debug log on the executor, only log I can see is the "Checking out", then nothing for 10s, and then "Held status set to False"14:05
@fungicide:matrix.orgcheck for python tracebacks in the log14:06
@hanson76:matrix.orgNo traceback around that time. Anyway, looks like the zookeeper-timeout got rid of most of our node_error.14:08
Most jobs work now, but require 2 or 3 attempts to complete for some odd reason.
@fungicide:matrix.orgi'd definitely check the executor logs for the initial attempts before retrying. also if you're quick and your jobs have console log streaming turned on, you might be able to pull up the web stream of the job stdout/stderr and spot a clue in there14:10
@fungicide:matrix.orgi'm assuming the retried builds didn't get far enough to actually collect logs, so the build result pages for them in the web dashboard probably aren't giving you much to go on, but it doesn't hurt to check there just in case14:11
@fungicide:matrix.orgif you go to the builds tab you should see entries for the builds which got ignored and retried14:12
@fungicide:matrix.orgusually though, retrying a job automatically is a sign that there was a failure in a pre-run playbook or that ansible indicated an error which indicated a probable network disconnection from a node14:13
@fungicide:matrix.org * usually though, retrying a job automatically is a sign that there was a failure in a pre-run playbook or that ansible returned an error which indicated a probable network disconnection from a node14:13
@fungicide:matrix.organsible does sometimes mistake things like its process getting oom-killed or not enough space on the node's fs to write its execution scripts as a "network problem"14:14
@hanson76:matrix.orgThink I found the problem but no clue why it happens. "Failed to connect to the host via ssh: ssh: connect to host X port"14:14
@hanson76:matrix.orgWe did set "host-key-checking: False" in nodepool for troubleshooting, guess that nodepool is returning the node to Zuul bit too early when that is done.14:16
@hanson76:matrix.orgWill remove that and try again.14:16
@fungicide:matrix.orgthat definitely sounds like ansible at least thinking the host is "unreachable" (which in ansible's terms just means it wasn't able to run things on it for undetermined reasons)14:16
@hanson76:matrix.orgOo, that might have solved it. 14:22
We tried to track down the problem with deleted nodes and ZK disconnect and did set the "host-key-checking" to false in that process.
The actual problem was the zookeeper-timeout being too low, fixing that did the trick but not doing host-key-checking made Nodepool
return nodes before ssh was started (as soon as AWS state was "running") and that failed the actual job that tried to ssh to them.
I reverted that host-key-checking config and have zookeeper-timeout set to 60 seconds instead of 10 and it looks to work!
Thanks for helping out troubleshooting this.
@fungicide:matrix.orgawesome, yeah i hadn't considered that as i forgot the config allows you to disable the launcher's host key checking (which as you've observed is generally necessary in order to gauge whether a node is ready to receive ssh connections). glad you worked it out!14:32
@fungicide:matrix.orgi wonder if we have an appropriate warning in the docs about that. checking... will propose the addition if not14:33
@clarkb:matrix.orgI think that feature is aimed at systems that don't have sshd14:55
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [zuul/nodepool] 841908: Clarify reasons for host-key-checking setting https://review.opendev.org/c/zuul/nodepool/+/84190814:55
@fungicide:matrix.orgHanson: see if that ^ would have made it more obvious14:56
@fungicide:matrix.orgClark: the documentation for the option only calls out situations where the launcher and nodes are on different networks (so the launcher is unable to ssh to the nodes to check them), but yes i suppose non-ssh nodes could be another reason to disable it14:57
@fzzfh:matrix.orgfungi: Clark  do you have moment read my message about nodepool auto-float-ip earlier . thanks15:04
@fungicide:matrix.orgfzzf: it's a little outside my scope of knowledge, sounds like you're having trouble getting openstacksdk to understand how you've set up networking in your cloud15:08
@clarkb:matrix.orgfzzf: fungi: if the issue is wait for server isn't wanting long enough those timeouts are configurable in nodepool.15:20
@clarkb:matrix.orgBut if it is failing beacuse it will never become available then that is a separate issue15:20
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [zuul/nodepool] 841908: Clarify reasons for host-key-checking setting https://review.opendev.org/c/zuul/nodepool/+/84190815:25
@fzzfh:matrix.org> <@fzzfh:matrix.org> fungi: Clark hi, I found  method 'wait_for_server' of sdks invoke by nodepool can't finish bind float-ip in fact. this result in nodepool boot instance haven't float-ip, so 'auto-floating-ip' option doesn't work.15:25
> https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/provider.py#L341
> https://docs.openstack.org/openstacksdk/rocky/user/connection.html#openstack.connection.Connection.wait_for_server
>
> I invoke sdks add_auto_ip method in provider.py for attach float-ip to instance as a temporary solution.
> can see this problem also https://storyboard.openstack.org/#!/story/2009877
fungi: I mean maybe nodepool invoke sdks wait_for_server method can't finish attach float-ip. and there is some encounter same problem. https://storyboard.openstack.org/#!/story/2009877
@fungicide:matrix.orgsince folks in here may be generally interested, a reminder that the cfp for ansiblefest 2022 is now open until july 15 and is always a great place to talk about zuul: https://www.ansible.com/ansiblefest15:34
@fzzfh:matrix.orghttps://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_floating_ip.py#L990  auto-float-ip need have private_v4, not have public_v4. and need private to false.  my nodepool boot's instance don't match this. it have same private_v4 and public_v4. is this problem accidental?15:38
@clarkb:matrix.orgfzzf: I think the idea is if you have a public ip then you don't need a floating ip automatically and it should be explicit15:39
@clarkb:matrix.orgauto floating IP really only makes sense if the cloud only provides private IPs to the instance but you still want connectivity15:39
@fzzfh:matrix.orgClark: my public_v4  is same to private_v4 according it show. they are both private ip. I  don't know why15:42
@fzzfh:matrix.orgwhen I add add_auto_ip method, then its public_v4 would be float-ip.15:43
@clarkb:matrix.orgfzzf: it detects that bsed on attributes of the network itself or you can override it with network config in clouds.yaml iirc16:11
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 840540: Make note of python_version being a string value https://review.opendev.org/c/zuul/zuul-jobs/+/84054016:16
@clarkb:matrix.orgZuulians https://review.opendev.org/c/zuul/zuul-jobs/+/840539 is a python3.10 cleanup and https://review.opendev.org/c/zuul/zuul-jobs/+/840540 is a followup to that to avoid a problem that the python3.10 chnage ran into by being more clear in docs and examples. Not urgent but straightforward and would be good to alnd eventually16:19
@jim:acmegating.comClark: note my +2 comment on that last one16:32
@clarkb:matrix.orgGood to know. ianw suggested that update16:37
@clarkb:matrix.orghttps://review.opendev.org/c/zuul/zuul/+/840356 is the other change related to that but I think we may want to hold off on that as python3.10 wheels don't exist yet. Not a major issue for Zuul but a bigger impact for nodepool on arm64. Work is in progress to get wheels built though16:43
@clarkb:matrix.orgSounds like paramiko 2.11.0 should happen soon as well to address that blowfish warning. Hasn't happened yet according to pypi though16:46
@jim:acmegating.comClark: i agree, sounds like waiting for wheels so that our nodepool builds don't slow to a crawl would be good16:47
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul-jobs] 840539: Switch py3.10 testing to Ubuntu Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/84053916:58
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul-jobs] 840540: Make note of python_version being a string value https://review.opendev.org/c/zuul/zuul-jobs/+/84054017:27
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 840246: Add a spec for global semaphores https://review.opendev.org/c/zuul/zuul/+/84024617:38
-@gerrit:opendev.org- Sunny Cai proposed: [zuul/zuul-website] 841986: test blog article https://review.opendev.org/c/zuul/zuul-website/+/84198618:10
@fungicide:matrix.orgcorvus: i updated 840979 per your suggestion, anything else you feel it needs? i think sunny was hoping to link to the published article in social media with the reminder promotions about the impending summit ticket price increase18:54
@jim:acmegating.comfungi: ah thanks! lgtm.  also, that felt like a good application of review to blog posts :)19:37
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [zuul/zuul-website] 840979: Meet the Zuul community in Berlin blog article https://review.opendev.org/c/zuul/zuul-website/+/84097919:41
@fungicide:matrix.org> <@jim:acmegating.com> fungi: ah thanks! lgtm.  also, that felt like a good application of review to blog posts :)19:59
yep, agreed. thanks!
@iwienand:matrix.orgClark: yeah frickler double checked but we did add types to rolevars at one point https://review.opendev.org/c/zuul/zuul-sphinx/+/64116821:59
@jim:acmegating.comianw: Clark my apologies, it looks like my comment raced with the build results for ps422:17
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 842010: Add global semaphore support https://review.opendev.org/c/zuul/zuul/+/84201023:26
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:23:45
- [zuul/zuul] 804177: Add include- and exclude-branches tenant config options https://review.opendev.org/c/zuul/zuul/+/804177
- [zuul/zuul] 841336: Add always-dynamic-branches option https://review.opendev.org/c/zuul/zuul/+/841336

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!