Friday, 2021-10-29

@clarkb:matrix.orghttps://sdv.eclipse.org/ was announced today. Considering zuul's existing use in the space I thought people here might be interested00:06
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com:00:29
- [zuul/zuul] 814281: Remove toDict from FrozenJob https://review.opendev.org/c/zuul/zuul/+/814281
- [zuul/zuul] 814243: Make FrozenJob a ZKObject https://review.opendev.org/c/zuul/zuul/+/814243
- [zuul/zuul] 814329: Implement frozen job serialization/deserialization https://review.opendev.org/c/zuul/zuul/+/814329
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815911: Remove false-negative in zk test https://review.opendev.org/c/zuul/zuul/+/81591101:44
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 815343: CI image requires consistency cleanup https://review.opendev.org/c/zuul/zuul/+/81534301:44
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 814679: Store FrozenJob data in separate znodes https://review.opendev.org/c/zuul/zuul/+/81467902:09
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814544: Cleanup stale items after refreshing a pipeline https://review.opendev.org/c/zuul/zuul/+/81454406:03
-@gerrit:opendev.org- Simon Westphahl proposed:06:27
- [zuul/zuul] 815787: Refresh pipelines in tests when settled https://review.opendev.org/c/zuul/zuul/+/815787
- [zuul/zuul] 815788: wip: Allow refreshing project branches https://review.opendev.org/c/zuul/zuul/+/815788
- [zuul/zuul] 815278: DNM: execute tests with two schedulers https://review.opendev.org/c/zuul/zuul/+/815278
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 814996: Make the ConfigLoader work independently of the Scheduler https://review.opendev.org/c/zuul/zuul/+/81499608:04
-@gerrit:opendev.org- Zuul merged on behalf of Felix Edel: [zuul/zuul] 815844: Provide zstat version when updating Node Request in ZooKeeper https://review.opendev.org/c/zuul/zuul/+/81584408:25
@ecsantos:matrix.orgHello folks08:52
@ecsantos:matrix.orgI have a question regarding Zuul playbooks and roles08:52
@ecsantos:matrix.orgFor example, this line on [1]: `    path: "{{ cached_repos_root }}/{{ zj_project.canonical_name }}"`08:52
[1] https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-git/tasks/main.yaml#L3
@ecsantos:matrix.orgWhere are these kind of variables declared (such as `zj_project`)?08:53
@avass:vassast.orgecsantos:  `zj_project` is set by `loop_control { loop_var: zj_project }` and is there because of the loop var policy: https://www.zuul-ci.org/docs/zuul-jobs/policy.html#loops-in-roles09:49
@avass:vassast.orgecsantos: `cached_repos_root` is set as a default var in the role: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-git/defaults/main.yaml09:49
-@gerrit:opendev.org- Simon Westphahl proposed:10:11
- [zuul/zuul] 814772: Allow passing extra attributes to ZKObject.fromZK https://review.opendev.org/c/zuul/zuul/+/814772
- [zuul/zuul] 814571: Update pipeline state when modifying attributes https://review.opendev.org/c/zuul/zuul/+/814571
- [zuul/zuul] 814570: Reference active change queues in pipeline state https://review.opendev.org/c/zuul/zuul/+/814570
- [zuul/zuul] 814862: Bail out when a project moves between connections https://review.opendev.org/c/zuul/zuul/+/814862
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899
- [zuul/zuul] 815309: Cancel jobs before resetting builds https://review.opendev.org/c/zuul/zuul/+/815309
- [zuul/zuul] 815111: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111
- [zuul/zuul] 815276: Add change queues to change queue managers https://review.opendev.org/c/zuul/zuul/+/815276
- [zuul/zuul] 815277: Refresh pipelines before checking for empty queues https://review.opendev.org/c/zuul/zuul/+/815277
- [zuul/zuul] 815428: Fix GitHub PR (de-)serialization https://review.opendev.org/c/zuul/zuul/+/815428
- [zuul/zuul] 815429: Add missing logger to Build and BuildSet classes https://review.opendev.org/c/zuul/zuul/+/815429
- [zuul/zuul] 815450: Create bundle items during queue deserialization https://review.opendev.org/c/zuul/zuul/+/815450
- [zuul/zuul] 815495: Fix Gerrit change (de-)serialization https://review.opendev.org/c/zuul/zuul/+/815495
- [zuul/zuul] 815616: Only reset the pipeline state if needed https://review.opendev.org/c/zuul/zuul/+/815616
- [zuul/zuul] 815617: Ensure same layout UUID across schedulers https://review.opendev.org/c/zuul/zuul/+/815617
- [zuul/zuul] 815787: Refresh pipelines in tests when settled https://review.opendev.org/c/zuul/zuul/+/815787
- [zuul/zuul] 815788: wip: Allow refreshing project branches https://review.opendev.org/c/zuul/zuul/+/815788
- [zuul/zuul] 815278: DNM: execute tests with two schedulers https://review.opendev.org/c/zuul/zuul/+/815278
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com:10:11
- [zuul/zuul] 815154: Update test_inventory to be ZK-friendly https://review.opendev.org/c/zuul/zuul/+/815154
- [zuul/zuul] 815565: Remove unecessary assignment in re-enqueue https://review.opendev.org/c/zuul/zuul/+/815565
- [zuul/zuul] 815744: Use a metaclass to deserialize event objects https://review.opendev.org/c/zuul/zuul/+/815744
- [zuul/zuul] 815764: Add a pipeline change list object to ZK https://review.opendev.org/c/zuul/zuul/+/815764
- [zuul/zuul] 815916: Reduce use of OrderedDict in PipelineState https://review.opendev.org/c/zuul/zuul/+/815916
- [zuul/zuul] 815917: Update Pipeline for symmetry https://review.opendev.org/c/zuul/zuul/+/815917
@westphahl:matrix.orgzuul-maint: reordered the sos stack a bit since 814570 had legit test failures. 814772 is now the first change in the stack10:21
@mordred:inaugust.comswest: it's so exciting seeing all of that10:38
@westphahl:matrix.orgmordred: yeah I'm looking forward to having multiple schedulers running. we don't have much runway left running with a single scheduler10:42
@ecsantos:matrix.org> <@avass:vassast.org> ecsantos:  `zj_project` is set by `loop_control { loop_var: zj_project }` and is there because of the loop var policy: https://www.zuul-ci.org/docs/zuul-jobs/policy.html#loops-in-roles10:46
Albin Vass: Thanks! That made it clearer for me
@avass:vassast.orgecsantos: no problem!10:50
@ecsantos:matrix.orgSorry to bother but I have one more question :p10:56
@ecsantos:matrix.orgI'm hitting the following error on multiple plays (e.g., "Check growroot logs", "configure-mirrors : Update apt cache", "persistent-firewall : List current ipv4 rules"): `Timeout exception waiting for the logger. Please check connectivity to [<IP address>:19885]`10:56
@ecsantos:matrix.orgWhat is this "logger" Zuul is referring to?10:56
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814772: Allow passing extra attributes to ZKObject.fromZK https://review.opendev.org/c/zuul/zuul/+/81477211:44
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814571: Update pipeline state when modifying attributes https://review.opendev.org/c/zuul/zuul/+/81457111:48
@fungicide:matrix.orgecsantos: it's the daemon which streams the ansible output (console log), started by this role: https://www.zuul-ci.org/docs/zuul-jobs/general-roles.html#role-start-zuul-console12:47
@fungicide:matrix.orgdoes your job maybe adjust firewall rules to block access to 19885/tcp?12:48
@fungicide:matrix.orgspecifically, the executor will need to be able to reach that port12:50
@fungicide:matrix.orgalso i notice we don't do a great job of documenting that, the only actual documentation reference to that port number  seems to be in https://www.zuul-ci.org/docs/zuul/howtos/nodepool_static.html#node-requirements12:54
@fungicide:matrix.orgif i get a moment later today i'll try to leave some breadcrumbs in the start-zuul-console role readme and the components chapter of the zuul documentation12:55
@ecsantos:matrix.org> <@fungicide:matrix.org> does your job maybe adjust firewall rules to block access to 19885/tcp?13:08
fungi: I'm running the same base job as OpenDev, same playbooks and roles, only difference is that I upload my logs to a logserver and not Swift
@ecsantos:matrix.orgFor example, the play `Check growroot logs` for me results in `localhost | Timeout exception waiting for the logger` for the localhost, and `controller | ok` for my Ubuntu node13:10
@ecsantos:matrix.orgInspecting my logs I see13:15
```
2021-10-29 08:25:30.915301 | TASK [start-zuul-console : Start zuul_console daemon.]00:01
2021-10-29 08:25:31.889466 | controller | ok00:01
```
It's only running for the Ubuntu node and not the localhost. Maybe I need to run this role on my localhost as well (Zuul is running on localhost)?
@ecsantos:matrix.org * Inspecting my logs I see13:15
```
2021-10-29 08:25:30.915301 | TASK [start-zuul-console : Start zuul_console daemon.]
2021-10-29 08:25:31.889466 | controller | ok
```
It's only running for the Ubuntu node and not the localhost. Maybe I need to run this role on my localhost as well (Zuul is running on localhost)?
@fungicide:matrix.orgecsantos: in opendev, we set up default firewall rules on our nodes during the image build process, and so are doing it there: https://opendev.org/openstack/project-config/src/commit/30fd4b45491c4d0aa054be66dd8763a7ca89c1ec/nodepool/elements/nodepool-base/install.d/20-iptables#L6013:39
@fungicide:matrix.orgstart-zuul-console should only apply to the job nodes, not "localhost" (which is the executor)13:39
@ecsantos:matrix.orgfungi: Yeah, I'm using the same DIB elements on my diskimage, including `nodepool-base`13:41
@ecsantos:matrix.orgOh okay, I though "all" included the localhost, my mistake, I'm new to Ansible as well :)13:41
@spamaps:spamaps.ems.hostI'm returning to zuul-land from a long hiatus. Is there a thing like nodepool-builder, but that builds using Dockerfiles and pushes images instead? I really want that (I may just build it using periodic jobs)13:42
@spamaps:spamaps.ems.host * I'm returning to zuul-land from a long hiatus. Is there a thing like nodepool-builder, but that builds using Dockerfiles and pushes container images instead? I really want that (I may just build it using periodic jobs)13:42
@jim:acmegating.comspamaps: not that pushes images; i think i'd just use periodic jobs for that.  opendev does its base container image building all in zuul jobs.13:43
@jim:acmegating.comspamaps: (nodepool can use dockerfiles for building vm images; not what you're asking for, but an interesting related new feature)13:44
@spamaps:spamaps.ems.hostthat's.. weird!13:45
@spamaps:spamaps.ems.hostI like it? ;)13:45
@jim:acmegating.comspamaps: but cool!  it isolates the build environment, so you don't need a builder host with tooling for every platform (which is occasionally not possible)13:46
@jim:acmegating.com(also, it's a little lighter weight to write a dockerfile than an element)13:47
@jim:acmegating.comit's all implemented with an element of course, called 'containerfile'13:47
@fungicide:matrix.orgwell, it's not all as simple as it seems. for example we need podman from debian/unstable to get a new enough version to be able to handle the glibc in latest fedora13:48
@spamaps:spamaps.ems.host🍿13:49
@spamaps:spamaps.ems.hostGlad to be home. I hope it's not a short visit.13:49
@fungicide:matrix.orgspamaps:  details of that particular slice of cake can be found here: https://review.opendev.org/81576613:49
@fungicide:matrix.orgcontainers! magic that lets you run pretty much anything anywhere, right? well, nope, as it turns out ;)13:50
@fungicide:matrix.orgcan anybody spot what's going sideways with the tenant quota in this failed test? seems to crop up at random (we just hit it on the podman version change above): https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_677/815766/3/gate/nodepool-tox-py38/677848b/testr_results.html14:02
@jim:acmegating.comfungi: it looks like the either the last node was never deleted from zk, or the internal node cache is out of sync with zk.  i can't tell more about the problem from that bit of logging.14:12
@jim:acmegating.comwill probably need to add a bit more logging and run locally14:16
@clarkb:matrix.orgcorvus: the next few unapproved changes in the zuul sos stack were quick easy reviews that I've snuck in before my meeting today. I can pick reviews up again after the meeting (I'm at https://review.opendev.org/c/zuul/zuul/+/815111). I think that there is enough in the gate queue that I should get ahead of it this morning14:20
@jim:acmegating.comClark: awesome, thanks!14:20
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814570: Reference active change queues in pipeline state https://review.opendev.org/c/zuul/zuul/+/81457014:42
@goneri:matrix.orgHi, Can I get the final +A on https://review.opendev.org/c/zuul/zuul-jobs/+/815901?14:43
@goneri:matrix.orgthanks! :-)14:53
@clarkb:matrix.orgswest: corvus few consistency questions in https://review.opendev.org/c/zuul/zuul/+/815111 not worthy of a -1 so I +2'd feel free to approve if you don't have any additional concerns from my comments15:37
@clarkb:matrix.orgcorvus:  I guess I didn't raelize that pipeline queues are branch specific? https://review.opendev.org/c/zuul/zuul/+/815276/8/zuul/model.py In the case of say openstack's gate queue we set that to None beacuse we run all the branches in that queue?15:45
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 815979: Use activeContext instead of explicit _save calls https://review.opendev.org/c/zuul/zuul/+/81597915:48
@jim:acmegating.comClark: ^ replied on 111 and made that followup ^15:48
@jim:acmegating.comClark: they are optionally branch-specific15:48
@fungicide:matrix.orgClark: i think that changed somewhat recently in order to support users who want separate queues per branch15:48
@jim:acmegating.comi believe the default is they aren't15:48
@clarkb:matrix.orgI see I think I missed that15:49
@clarkb:matrix.orgcorvus: swest question on https://review.opendev.org/c/zuul/zuul/+/81527615:49
@clarkb:matrix.orgI've approved https://review.opendev.org/c/zuul/zuul/+/815111/ and +2'd the followup thanks15:51
@jim:acmegating.comClark: does https://review.opendev.org/814773 address your question?15:58
@jim:acmegating.comin general, pipeline processing should make the zk data eventually consistent wrt configuration changes and re-enqueues.  i don't know if that addresses your specific concern though.15:59
@clarkb:matrix.orgI think that half answers my question. We're unlikely to suffer major problems from my concern on 81527616:00
@clarkb:matrix.orgThe other half is whether or not we should be doing inferred associations at all? and instead directly address them in the database16:00
@clarkb:matrix.orgWhen everything was in memory our python references were our direct associations and we lose that in this instance16:00
@clarkb:matrix.orgbut that would require making the queues object more complex to refer to a change queue manager? or the other way around. The change queue manager/pipeline state will have to refer to specific queues?16:01
@jim:acmegating.comi think what should happen is that for a project to move between quues, you would need a re-enqueue pass, so everything would need to move from old_queues to queues.  and the new queues would be constructed with the new set of shared projects.16:02
@clarkb:matrix.orggotcha. So sched 1 might fail but you haven't moved yet. Then sched2 can recover and do the move16:05
@jim:acmegating.comClark: yep, that's my understanding of the theory.16:06
@jim:acmegating.comi think the pipeline.state.layout_uuid lets us know that the objects in zk for a pipeline match the current config; if that doesn't match, re-enqueue needs to happen16:07
@clarkb:matrix.orgI have approved 815276 and left a note with some hints to this information16:07
@jim:acmegating.comand re-enqueue will always happen with the current config16:07
@clarkb:matrix.orgMy goal today is to get through the end of the stack that is ready to land. Then reward myself with a bike ride while zuul gates everything :)16:30
@jim:acmegating.comsolid plan16:30
@clarkb:matrix.orgtobiash: you don't have a recent review on https://review.opendev.org/c/zuul/zuul/+/815450/ (this is my next change to review). Did you want to review that before it gets approved?16:38
@tobias.henkel:matrix.orgClark: I had +2 on that in the past16:38
@tobias.henkel:matrix.orglooks like that vanished during rebase16:38
@clarkb:matrix.orgah ok so probably fine to approve, thansk for checking16:39
@jim:acmegating.comi think there was a small fix to that since then; not substantial16:39
@clarkb:matrix.orgI've approved it16:43
@spamaps:spamaps.ems.hostI seem to have caused nodepool to go to plaid... 16:47
```
launcher_1 | 2021-10-29 16:45:18,535 ERROR nodepool.NodePool: if isinstance(other, GCEProviderConfig): launcher_1 | 2021-10-29 16:45:18,535 ERROR nodepool.NodePool: RecursionError: maximum recursion depth exceeded while calling a Python object
launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: Exception in main loop:
launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: Traceback (most recent call last): launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: File "/usr/local/lib/python3.9/site-packages/nodepool/launcher.py", line 1095, in run
launcher_1 | 2021-10-29 16:45:38,613 ERROR nodepool.NodePool: self.updateConfig()
```
@spamaps:spamaps.ems.hostSeems there's a way that GCEProviderConfig might self-reference.16:55
@spamaps:spamaps.ems.hostIt's happening when Nodepool tries to determine if new/old are different16:58
@spamaps:spamaps.ems.hostI think there may be somewhere that parts of the config are shared between new and old.16:59
@spamaps:spamaps.ems.hosthappens on startup too. :(17:00
@tobias.henkel:matrix.orgcorvus: q on 81576417:01
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 814862: Bail out when a project moves between connections https://review.opendev.org/c/zuul/zuul/+/81486217:03
@jim:acmegating.comtobiash: replied17:04
@tobias.henkel:matrix.orgthanks, lgtm17:05
@clarkb:matrix.orgswest: corvus can you check my comment on https://review.opendev.org/c/zuul/zuul/+/815617 it isn't clear to me how we are setting an initial layout uuid to a valid uuid17:09
@clarkb:matrix.orghrm maybe I just figured it out. We're setting it in the Layout object then that gets passed to the LayoutState17:10
@clarkb:matrix.orgI was reading it as LayoutState always sets the uuid on Layout but that is only true if a LayoutState exists already. If it doesn't exist we get it from Layout /me updates the review17:10
@clarkb:matrix.orgI've approved https://review.opendev.org/c/zuul/zuul/+/815617 but left a comment about a small test update17:13
@clarkb:matrix.orgspamaps: I think the gerrit zuul installation is using nodepool with gce. I wonder why it doesn't hit that17:13
@clarkb:matrix.orgtobiash: if you are able to review https://review.opendev.org/c/zuul/zuul/+/815744/ that would be helpful as I'm the only current reviewer on it17:15
@jim:acmegating.comClark: tobiash previously reviewed that one and there have been only insubstantial changes since then, so i think we can carry it over17:17
@jim:acmegating.com(variable names/comments were the only changes)17:17
@tobias.henkel:matrix.orgI've just re-reviewed it17:17
@jim:acmegating.comthat works too :)17:18
@tobias.henkel:matrix.orgjust when I though the gate got more stable the next reset happened...17:18
@tobias.henkel:matrix.org * just when I thought the gate got more stable the next reset happened...17:18
@jim:acmegating.comyeah looking at the failure now17:19
@westphahl:matrix.orgClark: responded in 81561717:19
@tobias.henkel:matrix.orgcorvus: the first thing I'm seeing there is a zk session loss: https://ddb474164b5093260c27-91acd78cd46015ce54b5f888f723113e.ssl.cf1.rackcdn.com/814773/14/gate/zuul-tox-py36/7e35fa2/testr_results.html17:20
@tobias.henkel:matrix.orgif that's the real reason we could look into reducing the parallel tests by one or maybe increase the zk session timeout more in the tests17:21
@jim:acmegating.comwell, that's an intentional zk disconnect test17:21
@tobias.henkel:matrix.orgoh, then scratch that17:21
@jim:acmegating.comi already "fixed" that once though...17:21
@spamaps:spamaps.ems.host> <@clarkb:matrix.org> spamaps: I think the gerrit zuul installation is using nodepool with gce. I wonder why it doesn't hit that17:22
I'm wondering if maybe there's cruft in zk from before I wrote the config ... I can't figure out how to log in to zk though.. zkCli gives SASL errors.
@clarkb:matrix.orgswest: yup after I left my comment I realized we are passing that all the way down to Layout during tenant layout parsing to have it generate a new uuid for us17:22
@jim:acmegating.comspamaps: zk is not relevant for that problem17:22
@spamaps:spamaps.ems.hostah ok17:22
@clarkb:matrix.orgour zk has local no ssl connectivity to work around that (if it were a zk issue)17:22
@jim:acmegating.comthose should only be local python objects17:22
@spamaps:spamaps.ems.host```zookeeper-servers:17:23
- host: zk
port: 2281
zookeeper-tls:
cert: /var/certs/certs/client.pem
key: /var/certs/keys/clientkey.pem
ca: /var/certs/certs/cacert.pem
labels:
- name: ubuntu-focal
min-ready: 2
providers:
- name: static-vms
driver: static
pools:
- name: main
nodes:
- name: node
labels: ubuntu-focal
host-key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOgHJYejINIKzUiuSJ2MN8uPc+dfFrZ9JH1hLWS8gI+g"
python-path: /usr/bin/python3
username: root
- name: gce-uscentral1
driver: gce
project: spotify-zuul-ci
region: us-central1
zone: us-central1-a
cloud-images:
- name: ubuntu-focal
image-project: ubuntu-os-cloud
image-family: ubuntu-2004-lts
username: ubuntu
pools:
- name: ubuntu-focal-vms-gce-uscentral1a
use-internal-ip: true
labels:
- name: ubuntu-focal
cloud-image: ubuntu-focal
instance-type: e2-standard-8
volume-type: pd-standard
volume-size: 500```
@spamaps:spamaps.ems.hostThere's the config (that host key is the static one in the docs ;)17:23
@jim:acmegating.comspamaps: for reference though, in opendev we allow non-ssl connections to zk on localhost only (protected by firewall) to aid in debugging with zk-shell.17:23
@jim:acmegating.comtobiash: it got stuck on self.stats_thread.join() weird17:27
@tobias.henkel:matrix.orgCorvus: maybe because of the crashed stats reporter election17:32
@jim:acmegating.comtobiash: yeah, i think it's a race with the election17:32
@tobias.henkel:matrix.orgThere is an exception about that in the log17:32
@jim:acmegating.comlike it may have started running the election after the stop signal17:33
@jim:acmegating.comi think it's a pretty rare race.  we should probably have scheduler.stop cancel the election after setting the stop event.  i think that would take care of it.  i'll do that next time i have a clean working tree :)17:33
@spamaps:spamaps.ems.host> <@jim:acmegating.com> spamaps: for reference though, in opendev we allow non-ssl connections to zk on localhost only (protected by firewall) to aid in debugging with zk-shell.17:40
I may submit a patch for the quickstart to do the same thing
@jim:acmegating.comspamaps: i don't think we should do that17:43
@jim:acmegating.comspamaps: no end-user of zuul should ever have to touch zk.  i just said that to you as a zuul developer expert17:44
@jim:acmegating.comand exposing un-encrypted zk could compromise the system and therefore the integrity of the code that's tested.  it's super dangerous17:45
@jim:acmegating.com(we actually added a zuul cli tool so that if something goes wrong with zk, you can just delete all of zuul's state and start over; that's the level of end-user zk interaction i expect)17:47
@fungicide:matrix.orgyeah, the risk in the quickstart is that there are lots of additional components being started on the same system as the zk service. if you have zk confined to entirely separate servers it's less of a concern to do that17:52
@fungicide:matrix.org(and that's the case in opendev's deployment, for example)17:53
@fungicide:matrix.orgin particular, a job could run on the co-located executor to connect to the unprotected loopback port17:55
@spamaps:spamaps.ems.hostOk.. I figured quick start was more of "try this out, if you like it use a real install method"18:04
@fungicide:matrix.orgit should still be made reasonably "safe" to for small scale use cases18:05
@jim:acmegating.comsince we know people would do it anyway, we designed it to be able to 'evolve' to production use, mostly by adjusting the nodepool config.  either way, i'd want to set a good example there.  :)18:09
@jim:acmegating.com(which is why it goes to all the trouble to use an encrypted zk connection on an all-in-one self-contained isolated-network localhost-only system)18:10
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl:18:19
- [zuul/zuul] 814773: Move re-enqueue to pipeline processing https://review.opendev.org/c/zuul/zuul/+/814773
- [zuul/zuul] 814899: Delete old build sets immediately https://review.opendev.org/c/zuul/zuul/+/814899
-@gerrit:opendev.org- Clark Boylan proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815979: Use activeContext instead of explicit _save calls https://review.opendev.org/c/zuul/zuul/+/81597918:21
@clarkb:matrix.orgThat ^ fixes a linter error on the last change in the stack18:21
@clarkb:matrix.orgThe stack is largely approved now. There are still a variety of failures but I think corvus has been tracking those down and the majority are races or load issues and not directly related to the stack (though the stack does more zk stuff so can make races and load worse)18:22
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 815154: Update test_inventory to be ZK-friendly https://review.opendev.org/c/zuul/zuul/+/81515418:33
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815309: Cancel jobs before resetting builds https://review.opendev.org/c/zuul/zuul/+/81530918:33
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl:19:52
- [zuul/zuul] 815111: Store builds in Zookeeper https://review.opendev.org/c/zuul/zuul/+/815111
- [zuul/zuul] 815276: Add change queues to change queue managers https://review.opendev.org/c/zuul/zuul/+/815276
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815277: Refresh pipelines before checking for empty queues https://review.opendev.org/c/zuul/zuul/+/81527719:52
@tobias.henkel:matrix.orgcorvus: this is the next failure: https://11a58606918618718cd5-a21c5a2f7ec31a719791313ddc031133.ssl.cf5.rackcdn.com/815764/4/gate/zuul-tox-py38/a1e5f86/testr_results.html21:22
in this case iterate timeout timed out waiting for a build to be in starting phase because the load governor paused the executor
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815428: Fix GitHub PR (de-)serialization https://review.opendev.org/c/zuul/zuul/+/81542821:33
@tobias.henkel:matrix.orgyet another gate reset, this time it seems to have hit a very slow node21:37
@tobias.henkel:matrix.orghowever I've also seen a real zk session low further down in the queue21:38
@tobias.henkel:matrix.org * however I've also seen a real zk session loss further down in the queue21:38
@jim:acmegating.comtobiash: regarding the governor, it spent 30s over load average of 20, so 815389 combined either with a load multiplier increase or reducing concurrency would probably fix it.21:40
@jim:acmegating.comi'm leaning increasingly toward reducing concurrency21:40
@tobias.henkel:matrix.orgeither that or slightly increasing the load multiplier21:41
@tobias.henkel:matrix.orgI guess both would probably work21:41
@jim:acmegating.commost of the time that job was around LA=12, which is high but not crazy for an 8 core machine21:41
@jim:acmegating.comso i think it's still reasonable to keep trying concurrency=-1 (what we have now) and increase the multiplier for tests.21:42
@tobias.henkel:matrix.orgreducing concurrency would probably also reduce the risk of zk session lossses21:42
@jim:acmegating.comi don't actually care that the governor work in tests, i just want the code running.21:42
@jim:acmegating.comtobiash: can you link to the build page of the job where the session was lost?21:43
@jim:acmegating.com(by work, i mean i don't want the governor to stop jobs, i just want it running)21:43
@tobias.henkel:matrix.orgneed to check my browser history, just a sec21:43
@jim:acmegating.com(so as far as i'm concerned, we can change the multiplier to 100x in tests)21:44
@tobias.henkel:matrix.orgcorvus: https://9646a7fb82b47fbe6288-a22e2178400a1d74c0dfc0d0570ba9cf.ssl.cf2.rackcdn.com/815744/3/gate/zuul-tox-py36/e30809a/testr_results.html21:44
@tobias.henkel:matrix.orgsecond failed test21:44
@jim:acmegating.comthanks for the link -- fwiw i find the build page more helpful than the direct link21:45
@tobias.henkel:matrix.orgI didn't find that anymore in my history21:46
@jim:acmegating.comno prob, here it is https://zuul.opendev.org/t/zuul/build/e30809a40c5a44ab8bdbee8181c2e3be21:46
@tobias.henkel:matrix.orgthe timed out job (https://zuul.opendev.org/t/zuul/build/b0dc7a8e8ec94bb89b0e46afe25aedca) shows a lot of steal time in dstat, so likely an overloaded compute node21:48
@jim:acmegating.comtobiash: that one is due to load too21:48
@jim:acmegating.comin the e30809a build:   2021-10-29 20:32:11,139 zuul.ExecutorServer              INFO     Unregistering due to high system load 22.06 > 20.0  21:49
@tobias.henkel:matrix.orgso I guess we should go with 815389 and reducing concurrency by one?21:49
@jim:acmegating.comthat's the first failure, and i don't always trust connection losses, etc, after the first failure (they may be real, or they may be a side effect of sloppy test teardown)21:50
@jim:acmegating.comtobiash: we could try 815389 with a multiplier adjustment before reducing concurrency if you want21:50
@jim:acmegating.comi do think we know that 815389 alone isn't enough at this point21:50
@jim:acmegating.comactually, if we're going to increase the multiplier we probably don't really need 81538921:51
@tobias.henkel:matrix.orgyeah21:51
@jim:acmegating.combut it's harmless; i could go either way.  might speed up the actual governor tests :)21:52
-@gerrit:opendev.org- Tobias Henkel proposed: [zuul/zuul] 816072: Increase load_multiplier in tests https://review.opendev.org/c/zuul/zuul/+/81607221:57
@jim:acmegating.comClark: ^22:00
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 816073: Cancel stats election on shutdown https://review.opendev.org/c/zuul/zuul/+/81607322:14
@jim:acmegating.comtobiash, Clark: that's the change i promised earlier ^22:14
@clarkb:matrix.orgtobiash: corvus do we think we need to poll the sensors more often too?22:31
@clarkb:matrix.orgI've got a change up that does that if so22:31
@jim:acmegating.comClark: i don't think it's necessary, but i'm happy to merge the change regardless22:32
@jim:acmegating.com(basically, if the multiplier is so high it never trips, it shouldn't matter how often we poll)22:32
@clarkb:matrix.orgwell now that I see the multiplyer is 100 I agree it shouldn't matter :)22:32
@clarkb:matrix.orgyup exactly22:32
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 815429: Add missing logger to Build and BuildSet classes https://review.opendev.org/c/zuul/zuul/+/81542922:48

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!