*** noorul has joined #zuul | 00:15 | |
*** noorul has quit IRC | 00:28 | |
*** noorul has joined #zuul | 00:30 | |
*** noorul has quit IRC | 00:40 | |
*** noorul has joined #zuul | 00:40 | |
*** noorul has quit IRC | 00:55 | |
*** wxy-xiyuan has joined #zuul | 01:08 | |
*** jamesmcarthur has joined #zuul | 01:22 | |
*** bhavikdbavishi has joined #zuul | 01:30 | |
*** noorul has joined #zuul | 01:32 | |
*** noorul has quit IRC | 01:38 | |
*** threestrands has joined #zuul | 01:39 | |
*** bhavikdbavishi has quit IRC | 01:52 | |
*** jamesmcarthur has quit IRC | 01:54 | |
*** noorul has joined #zuul | 01:54 | |
*** noorul has quit IRC | 01:59 | |
openstackgerrit | Jeff Liu proposed zuul/zuul-operator master: Add PerconaXDB Cluster to Zuul-Operator https://review.opendev.org/677315 | 02:07 |
---|---|---|
*** spsurya has joined #zuul | 02:11 | |
*** noorul has joined #zuul | 02:15 | |
*** noorul has quit IRC | 02:20 | |
*** noorul has joined #zuul | 02:23 | |
*** noorul has quit IRC | 02:31 | |
*** noorul has joined #zuul | 02:36 | |
*** bhavikdbavishi has joined #zuul | 03:14 | |
*** bhavikdbavishi1 has joined #zuul | 03:19 | |
*** bhavikdbavishi has quit IRC | 03:20 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 03:20 | |
*** ianychoi has quit IRC | 03:29 | |
*** ianychoi has joined #zuul | 03:30 | |
*** rfolco has quit IRC | 03:32 | |
*** noorul has quit IRC | 03:50 | |
*** noorul has joined #zuul | 04:22 | |
*** raukadah is now known as chkumar|rover | 04:47 | |
*** bjackman has joined #zuul | 05:03 | |
*** noorul has quit IRC | 05:06 | |
*** jhesketh has quit IRC | 05:34 | |
*** jhesketh has joined #zuul | 05:42 | |
openstackgerrit | Merged zuul/zuul master: web: test trailing slash are removed from renderTree https://review.opendev.org/676824 | 06:14 |
*** sanjayu_ has joined #zuul | 06:16 | |
openstackgerrit | Benedikt Löffler proposed zuul/zuul master: Report retried builds via sql reporter. https://review.opendev.org/633501 | 06:18 |
*** sanjayu_ has quit IRC | 06:50 | |
openstackgerrit | Monty Taylor proposed zuul/zuul master: Add linter rule disallowing use of var https://review.opendev.org/673841 | 07:08 |
yoctozepto | hey folks, any thoughts on: https://review.opendev.org/678273 ? please review before it rots :-) | 07:10 |
*** themroc has joined #zuul | 07:34 | |
*** jpena|off is now known as jpena | 07:40 | |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: Zuul Web: add /api/user/authorizations endpoint https://review.opendev.org/641099 | 07:56 |
*** mhu has joined #zuul | 08:01 | |
*** jangutter has joined #zuul | 08:03 | |
*** yolanda__ is now known as yolanda | 08:30 | |
*** sanjayu_ has joined #zuul | 09:09 | |
*** sshnaidm|afk is now known as sshnaidm | 10:02 | |
tobiash | corvus: do you want to have a look at 678895 (the ref fix)? Or shall we +a it? It has +2 from me and tristanC. | 10:44 |
*** hashar has joined #zuul | 11:22 | |
corvus | tobiash: +3 thx | 11:29 |
*** hashar has quit IRC | 11:33 | |
*** hashar has joined #zuul | 11:33 | |
*** jpena is now known as jpena|lunch | 11:35 | |
*** rlandy has joined #zuul | 11:52 | |
*** rlandy is now known as rlandy|ruck | 11:53 | |
*** rfolco has joined #zuul | 12:15 | |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 12:19 | |
openstackgerrit | Merged zuul/zuul master: Check refs and revs for repo needing updates https://review.opendev.org/678895 | 12:21 |
*** jamesmcarthur has joined #zuul | 12:22 | |
*** jamesmcarthur has quit IRC | 12:29 | |
*** jpena|lunch is now known as jpena | 12:32 | |
openstackgerrit | Merged zuul/zuul master: Add linter rule disallowing use of var https://review.opendev.org/673841 | 12:35 |
*** jamesmcarthur has joined #zuul | 12:48 | |
*** bjackman has quit IRC | 12:49 | |
*** bjackman has joined #zuul | 12:52 | |
*** dkehn has quit IRC | 12:53 | |
*** nhicher has joined #zuul | 13:21 | |
mhu | Shrews, no prob, happy to be of help | 13:30 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: WIP: Add autohold delete/info commands to web API https://review.opendev.org/679057 | 13:32 |
Shrews | mhu: am I on the right track with that? ^^^^ | 13:32 |
mhu | Shrews, I'm in meetings right now but I'll have a look ASAP | 13:32 |
Shrews | mhu: sure, no hurry. thx | 13:33 |
*** sanjayu_ has quit IRC | 13:34 | |
*** bjackman has quit IRC | 13:34 | |
*** sanjayu_ has joined #zuul | 13:34 | |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: WIP: Add autohold delete/info commands to web API https://review.opendev.org/679057 | 13:39 |
tobiash | are opendev's executors backed by ssds? | 13:51 |
*** bhavikdbavishi has quit IRC | 13:59 | |
*** bjackman has joined #zuul | 14:04 | |
*** dkehn_ has joined #zuul | 14:14 | |
*** sshnaidm has quit IRC | 14:17 | |
*** brennen is now known as brennen|afk | 14:18 | |
*** sshnaidm has joined #zuul | 14:19 | |
*** openstackgerrit has quit IRC | 14:22 | |
tristanC | Shrews: it seems like change to the diskimage structs (e.g. adding a python-path), are not picked up by provider which still spawn register the previous python-path. It seems like we need to restart the launcher process | 14:23 |
fungi | tobiash: i don't believe so, looks like they're on whatever rackspace's default is for the rootfs and the ephemeral disk where we mount /var/lib/zuul | 14:23 |
fungi | presumably "spinning rust" (via sata) | 14:23 |
tristanC | Shrews: any idea how to make provider reload diskimage definition when they change? | 14:24 |
tobiash | fungi: I was wondering as your executors seem to perform much better than ours (which are ceph backed) | 14:24 |
fungi | tobiash: just a sec and i'll get more specifics | 14:24 |
tobiash | but we're currently in process of moving them to nvme disks | 14:24 |
tobiash | tristanC: it should reload automatically afaik | 14:25 |
tobiash | if not that seems like a bug | 14:26 |
fungi | tobiash: we've booted them from rackspace's "8 GB Performance" flavor in their dfw region, and this is not one of their special "ssd" flavors | 14:26 |
tobiash | fungi: ah thanks | 14:26 |
tristanC | tobiash: iiuc the openstack.config module, it keeps a copy of the diskimage but it doesn't check for change (diskimages list is global vs openstack provider local list) | 14:26 |
tobiash | tristanC: hrm, we should probably fix this | 14:27 |
fungi | tobiash: additional details, they're using ubuntu bionic (18.04.2 LTS) with linux 4.15.0-46-generic and the filesystems are formatted ext4 | 14:27 |
tobiash | fungi: cool, thanks! | 14:27 |
fungi | please ask if you wand any other details. i don't think anything besides the passwords is meant to be particularly secret ;) | 14:28 |
tristanC | Shrews: tobiash: oh my bad, the python-path change requires a new image upload | 14:28 |
*** jamesmcarthur has quit IRC | 14:29 | |
*** jamesmcarthur has joined #zuul | 14:29 | |
tobiash | fungi: thanks, I was interested mainly in io performance characteristics | 14:29 |
*** jeliu_ has joined #zuul | 14:29 | |
fungi | our cacti graphs should show some i/o metrics if you haven't looked yet | 14:30 |
tobiash | fungi: so your executors average about 4k iops | 14:32 |
fungi | neat, i hadn't looked but that sounds like a lot | 14:33 |
tobiash | sounds like ssd ;) | 14:33 |
tobiash | http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=64197&rra_id=0&view_type=tree&graph_start=1566916407&graph_end=1567002807 for reference | 14:34 |
*** amotoki_ has quit IRC | 14:34 | |
*** amotoki has joined #zuul | 14:35 | |
*** jamesmcarthur has quit IRC | 14:42 | |
*** openstackgerrit has joined #zuul | 14:47 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Switch to fetch-sphinx-tarball for tox-docs https://review.opendev.org/676430 | 14:47 |
*** jamesmcarthur has joined #zuul | 14:49 | |
fungi | tobiash: one thing we've observed is that the sla in a lot of public providers use write-back caching instead of write-through, so we could simply be seeing numbers reflecting writes to memory there | 14:57 |
*** igordc has joined #zuul | 14:57 | |
*** bjackman has quit IRC | 15:00 | |
*** jamesmcarthur has quit IRC | 15:02 | |
tobiash | fungi: according to flavor description it seems to be (undefined) ssd: https://developer.rackspace.com/docs/cloud-servers/v2/general-api-info/flavors/ | 15:04 |
*** themroc has quit IRC | 15:04 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: Add phoronix-test-suite job https://review.opendev.org/679082 | 15:06 |
fungi | huh, neat. maybe it's just their default block storage which is on sata then, i know you have to request a special flavor of that to get ssd | 15:06 |
fungi | (or at least you used to, maybe they've upgraded all their storage?) | 15:07 |
clarkb | note /var/lib/zuul is the ephemeral device | 15:08 |
clarkb | which may be different hardware than the root disk | 15:08 |
fungi | yep | 15:08 |
fungi | i haven't seen where they indicate what hardware serves their ephemeral disks | 15:08 |
mhu | Shrews: I've commented the review, you're close :) Next I'll help with the tests if you'd like | 15:11 |
Shrews | mhu: trying to figure out the tests now :) | 15:11 |
mhu | Shrews, I think the existing ones for autoholding from the REST API should be a good starting point | 15:12 |
mhu | although they require auth | 15:12 |
*** jamesmcarthur has joined #zuul | 15:13 | |
Shrews | mhu: what portion is the "boilerplate authentication/authorization code" ? | 15:14 |
Shrews | mhu: the part beginning with: rawToken = cherrypy.request.headers['Authorization'][len('Bearer '):] ? | 15:15 |
mhu | Shrews, from "basic_error ..." | 15:16 |
mhu | https://opendev.org/zuul/zuul/src/branch/master/zuul/web/__init__.py#L244 to https://opendev.org/zuul/zuul/src/branch/master/zuul/web/__init__.py#L262 for example | 15:17 |
*** jamesmcarthur has quit IRC | 15:18 | |
mhu | this would be better factored in a single method ... I thought of having this as a decorator but was advised against it | 15:18 |
*** tosky has joined #zuul | 15:21 | |
Shrews | mhu: that portion of code calls is_authorized() which requires a tenant parameter. How does that effect your suggestion to not use tenant in the api url? | 15:22 |
mhu | Shrews, IIUC autohold-info gets you the tenant info for the autohold request | 15:23 |
mhu | so I'd suggest calling autohold-info first, fetch the tenant, call is_authorized, then proceed | 15:24 |
Shrews | ok | 15:24 |
mhu | this way you can also catch errors when the request id is incorrect and return a 404 | 15:24 |
tosky | anyone up for re-reviewing https://review.opendev.org/#/c/674334/ ? | 15:29 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: WIP: Add autohold delete/info commands to web API https://review.opendev.org/679057 | 15:31 |
*** chkumar|rover is now known as raukadah | 15:31 | |
Shrews | mhu: does that mean you'd suggest i use 404 instead of 500 within _autohold_info() if the rpc call fails? | 15:33 |
Shrews | i just copy-pasted that code from elsewhere | 15:33 |
mhu | Shrews: that depends on the type of error | 15:37 |
mhu | 4XX HTTP statuses are generally used for user-induced errors | 15:37 |
mhu | for example 404 (Not Found) is an adequate return code if the request ID is incorrect | 15:38 |
mhu | 401 means Unauthorized, ie the user needs more privileges in order to perform an action | 15:38 |
*** sanjayu_ has quit IRC | 15:38 | |
mhu | 500 is a catch-all code for server-side errors | 15:39 |
*** sanjayu_ has joined #zuul | 15:39 | |
mhu | so in your case 500 is correct, just not very informative | 15:39 |
mhu | but maybe the RPC itself won't give much info | 15:39 |
Shrews | oh, i think i need to check for an empty dict and THEN return 404 | 15:40 |
*** rlandy|ruck|mtg is now known as rlandy | 15:43 | |
mhu | Shrews, yep | 15:44 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: WIP: Add autohold delete/info commands to web API https://review.opendev.org/679057 | 15:44 |
Shrews | that should do it ^^ | 15:44 |
*** rlandy is now known as rlandy|brb | 15:46 | |
*** panda is now known as panda|rover | 15:47 | |
*** sshnaidm is now known as sshnaidm|afk | 15:49 | |
*** noorul has joined #zuul | 16:02 | |
*** jpena is now known as jpena|off | 16:05 | |
noorul | How does the log collection works in Zuul? | 16:07 |
*** rlandy|brb is now known as rlandy | 16:07 | |
noorul | Where is it actually stored? | 16:07 |
clarkb | noorul: whereever you configure it is the short answer. There are roles to upload logs to openstack swift storage locations (what we currently use) as well as rsync onto filesystems which you can serve with a webserver (what we used previously) | 16:10 |
noorul | clar | 16:11 |
noorul | clarkb: Actually the script are triggered remotely using Ansible. So the logs are stored on the node where the Ansible runs. Is this the executor? | 16:12 |
clarkb | noorul: there is a bit of coordination between the executor and test nodes to make this work. Basically there are log collection roles that pull logs onto the executor from the test nodes then roles with publish those logs from the executor to say swift or a fileserver | 16:15 |
clarkb | so there are two steps. Collect logs into publication source dir, run publication role to publish publication source dir | 16:16 |
noorul | Am I missing anything in the config http://paste.openstack.org/show/766638/ ? | 16:18 |
clarkb | for logging? no. The logging happens as part of your job config. Typically you will put it in your base job | 16:19 |
noorul | This is the base log http://paste.openstack.org/show/766639/ | 16:20 |
clarkb | for opendev this is the chain of things that gets you logs: https://opendev.org/opendev/base-jobs/src/branch/master/zuul.d/jobs.yaml#L55 https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post.yaml#L3-L4 https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml | 16:21 |
clarkb | the first bit is where the base job includes the playbooks then the first playbook collects logs and the second publishes them | 16:21 |
clarkb | noorul: ya so your post-run playbook(s) should coordinate publishing of logs | 16:24 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: Add phoronix-test-suite job https://review.opendev.org/679082 | 16:27 |
*** hashar has quit IRC | 16:33 | |
mordred | tristanC: ^^ interesting. how are you thinking of using that? | 16:34 |
clarkb | mordred: I'm guessing that will be used to benchmark nodepool nodes | 16:36 |
mhu | clarkb, yep | 16:37 |
mordred | clarkb: ah - yeah. makes sense | 16:37 |
tristanC | mordred: to test nodepool label performance, here is how: https://softwarefactory-project.io/r/#/c/16145/ | 16:37 |
mhu | and more generally cloud providers | 16:37 |
mordred | for some reason I was reading it as a test of nodepool itself which didn't make any sense. that makes much more sense | 16:37 |
clarkb | opendev has used tempest for years for that since it isn't artificial and tends to map to our needs well | 16:37 |
clarkb | it is actually a relly good test of cpu and disk and network | 16:38 |
tristanC | clarkb: well, we'd like to know what is causing a difference, e.g. check cpu, memory, io, network, ... | 16:38 |
clarkb | ya phoronix test sutie is likely best if you want to examine specific items rather than a holistic "is this node fast enough to run our jobs" | 16:40 |
mordred | yeah. seems like a good tool in the toolbox | 16:41 |
tristanC | mordred: yeah, we figured that would be a nice addition to zuul-jobs :) | 16:43 |
AJaeger_ | mordred: the two week waiting period for https://review.opendev.org/676430 is over - and I just pushed an update for it to fix the problems I noticed. In case you want to ask for reviews ;) | 16:44 |
*** jeliu_ has quit IRC | 16:45 | |
mordred | tristanC: left a question/comment on it | 16:46 |
*** jamesmcarthur has joined #zuul | 17:00 | |
noorul | If I make changes config-projects repo, will get automatically loaded? | 17:01 |
noorul | s/chages/chages to/ | 17:02 |
*** jeliu_ has joined #zuul | 17:06 | |
clarkb | no changes to config projects have to be merged before they take effect | 17:06 |
clarkb | (this is for security reasons you don't want to expose secrets for example) | 17:06 |
noorul | I did not get that | 17:10 |
clarkb | changes to config projects must be merged before they change how zuul operates. This ensures that humans can review the changes prior to implementing them which helps to avoid security problems with privileged info | 17:11 |
noorul | clarkb: I directly pushed to master | 17:12 |
noorul | My main.yaml is here http://paste.openstack.org/show/766643/ | 17:16 |
noorul | I am not seeing all the roles under zuul-jobs under the tenant | 17:16 |
clarkb | if you've pushed directly to master then I would expect zuul to pick it up. However I don't know how the bitbucket driver will handle that case | 17:17 |
tristanC | noorul: iirc, zuul may miss direct push and thus skip reload the config | 17:17 |
noorul | tristanC: So a restart of scheduler might help? | 17:18 |
clarkb | tristanC: on Gerrit at least there should be an event for that case | 17:19 |
tristanC | noorul: if no ref-updated or change merged happened in the scheduler log, then restarting the service will force using the latest config | 17:19 |
clarkb | and I would expect zuul to pick it up | 17:19 |
clarkb | (but you'd have to push through gerrit not directly onto disk) | 17:19 |
noorul | clarkb: I a not using Gerrit but Bitbucket server instead | 17:20 |
clarkb | I know, I'm explaining that it should work in the gerrit case but may not in the other driver cases | 17:21 |
noorul | Oh I see | 17:21 |
*** panda|rover is now known as panda|rover|off | 17:23 | |
*** noorul has quit IRC | 17:43 | |
*** tosky_ has joined #zuul | 17:45 | |
*** tosky has quit IRC | 17:45 | |
*** tosky_ is now known as tosky | 17:45 | |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: Add autohold-info CLI command https://review.opendev.org/662487 | 18:00 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: Record held node IDs with autohold request https://review.opendev.org/662498 | 18:00 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: Auto-delete expired autohold requests https://review.opendev.org/663762 | 18:00 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: Mark nodes as USED when deleting autohold https://review.opendev.org/664060 | 18:00 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: WIP: Add autohold delete/info commands to web API https://review.opendev.org/679057 | 18:00 |
Shrews | yay bugs | 18:00 |
Shrews | mhu: ok, i have tests now. The only one that fails is test_autohold_delete() and that's because of authz failure. How do I do that correctly? | 18:01 |
Shrews | mhu: oh! the authz tests are in a different class. | 18:03 |
Shrews | yay, it works | 18:06 |
openstackgerrit | David Shrewsbury proposed zuul/zuul master: Add autohold delete/info commands to web API https://review.opendev.org/679057 | 18:08 |
Shrews | corvus: mordred: that should tie up the loose ends of the autohold revamp stuff and should be gtg now ^^^ | 18:14 |
clarkb | zuulians I'm wondering if we should consider a release this week to fix that zuul tests the wrong commit bug for peopel consuming relases? | 18:19 |
clarkb | I'm deploying that fix on opendev now so we should know if it is working | 18:19 |
clarkb | or at least doesn't regress further | 18:19 |
clarkb | (the conditions under which it happens are somewhat specific) | 18:20 |
*** michael-beaver has joined #zuul | 18:23 | |
Shrews | that bug merges things into the wrong branch, yeah? If so, then yeah, a release sounds advisable | 18:23 |
*** igordc has quit IRC | 18:23 | |
clarkb | Shrews: it causes zuul to checkout the wrong branch in the jobs | 18:23 |
clarkb | so the jobs run against the wrong commit if they trip over the bug | 18:23 |
Shrews | ah | 18:24 |
clarkb | I think the correct commits are actually there too | 18:24 |
Shrews | still seems worthy | 18:24 |
clarkb | ya | 18:24 |
*** armstrongs has joined #zuul | 18:36 | |
*** jamesmcarthur has quit IRC | 18:43 | |
*** armstrongs has quit IRC | 18:45 | |
*** tosky has quit IRC | 18:56 | |
openstackgerrit | Clark Boylan proposed zuul/nodepool master: Use fedora-29 instead of fedora-28 https://review.opendev.org/679116 | 19:01 |
openstackgerrit | Clark Boylan proposed zuul/nodepool master: Use fedora-29 instead of fedora-28 https://review.opendev.org/679116 | 19:06 |
openstackgerrit | Clark Boylan proposed zuul/nodepool master: Use fedora-29 instead of fedora-28 https://review.opendev.org/679116 | 19:15 |
clarkb | tristanC: ^ can you review that change | 19:17 |
EmilienM | hi there, what is the "2. attempt" thing in zuul? | 19:18 |
EmilienM | (I probably missed the feature announcement) | 19:18 |
EmilienM | is it like an auto-recheck or? | 19:19 |
openstackgerrit | Ronelle Landy proposed zuul/zuul-jobs master: Only use RHEL8 deps repo on Red Hat systems newer than 7 https://review.opendev.org/679126 | 19:19 |
clarkb | EmilienM: there are two major cases for it: either the job fails in pre-run playbook so is restarted or zuul identifies the failure as something external to the job so retries it | 19:20 |
clarkb | EmilienM: in this case I've restarted all of the zuul executors which kills the jobs running on the executor that was stopped and reschedules them to another | 19:20 |
EmilienM | in case #2,w here is the list of known issues? | 19:20 |
clarkb | (this was to update the deployment of our executors) | 19:20 |
fungi | also not a new feature, but the fact that we're surfacing it in the builds dashboard is new | 19:21 |
clarkb | EmilienM: I don't think there is a list as much as "this exit code from ansible means it has a network failre" type of deal | 19:21 |
clarkb | + gearman worker went away | 19:21 |
EmilienM | ok | 19:21 |
EmilienM | thanks! | 19:21 |
fungi | EmilienM: when you see a build result of RETRY_LIMIT that means that zuul saw failures it thought meant it should abort and requeue the build, but tried that repeatedly and finally gave up | 19:22 |
EmilienM | it makes sense | 19:23 |
EmilienM | nicely done! | 19:23 |
clarkb | ya zuul has done this since the jenkins days | 19:25 |
clarkb | its just always been a bit transparent to people unless they hit retry limits | 19:25 |
fungi | because: clouds (and the internet) | 19:25 |
fungi | stuff has a tendency to just spontaneously go away and never come back | 19:26 |
fungi | building a castle on a foundation of sand | 19:26 |
clarkb | I want to say the jenkins behavior of losing its sshconnection and not trying to reconnect but instead simply failing is what precipitated the feature | 19:27 |
EmilienM | fungi: right i just didn't see it before in the UI | 19:27 |
mnaser | uh | 19:40 |
mnaser | no bucket sharding is happening for periodic jobs uploaded to swift btw | 19:40 |
mnaser | ex: https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/logs_periodic/opendev.org/openstack/operations-guide/master/propose-translation-update/ba10bde/ | 19:40 |
mnaser | this has contributed to doing really bad things in our swift :\ | 19:40 |
clarkb | because one container is larger than the others? | 19:47 |
timburke | mnaser, out of curiosity, roughly how many objects are in the container? | 19:49 |
clarkb | timburke: note mnasers cloud is ceph not swift | 19:49 |
mnaser | ^ | 19:49 |
clarkb | havent heard complaints from the swift clouds | 19:50 |
mnaser | clarkb: yes, and rados eventually needs to reshard buckets automatically | 19:50 |
timburke | 👍 | 19:50 |
timburke | still, just curious :-) | 19:50 |
mnaser | so it hits a limit then starts a reshard which takes forevers | 19:50 |
mnaser | let me check | 19:50 |
clarkb | timburke: ya me too | 19:50 |
mnaser | the stats is sitting around for a while.. | 19:52 |
clarkb | mnaser: the way the prefix sharding works is it takes the zuul log path eg: logs/periodic/opendev.org and logs/68/678968/3/check and replaces the first / with a _ and that component becomes the container name | 20:05 |
clarkb | so it is sharding the periodic logs, it is just sharding them into the same container | 20:05 |
mnaser | oh hm | 20:06 |
mnaser | right | 20:06 |
clarkb | we could change the zuul log path for periodic jobs to include their day of the month maybe? | 20:06 |
clarkb | eg logs/periodic_$DoM/opendev.org | 20:06 |
clarkb | then you'd get 31 periodic job shards | 20:06 |
clarkb | there may be other methods that would work better? | 20:07 |
clarkb | let me figure out how to push that up so we have a change we can poke at at least | 20:10 |
timburke | might produce a bit of a hot container -- surely no worse than we've got now, but you might want to consider using seconds instead of day. plus you'd get about twice as many | 20:15 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: WIP: Add day of month to periodic logs for swift sharding https://review.opendev.org/679135 | 20:16 |
clarkb | timburke: thats a good point. I avoided hour and minute because we launch the periodic jobs at the same time | 20:17 |
clarkb | but secdonds should give us enough variance there | 20:17 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: WIP: Add current date seconds to periodic logs for swift sharding https://review.opendev.org/679135 | 20:18 |
clarkb | timburke: mnaser ^ there | 20:18 |
clarkb | from the user standpoint my big concern is people not using swift for logs may rely on webserver indexes that sort by name to present periodic jobs. I'm not sure if we want to be able to assume the zuul dashboard is the primary consumption point for this stuff yet | 20:19 |
clarkb | (fwiw I think it should be the primary point as it adds a bunch of functionality but we may not quite be there yet) | 20:19 |
clarkb | in any case this may be good enough while people transition. I'll let others chime in | 20:19 |
timburke | yeah, that's a fair point. might be a point in favor of day, as it would have *some* sort of useful-ish meaning for a human | 20:22 |
clarkb | tristanC: ^ you may have thoughts on that, will it be bad for softwarefactory for example | 20:41 |
clarkb | pabelanger: ^ you too | 20:41 |
*** jamesmcarthur has joined #zuul | 20:44 | |
fungi | i suppose we could reorder the path so that the build id comes first, which should be fairly entropic | 20:47 |
clarkb | I think that becomes even harder for people to navigate without the dashboard though | 20:48 |
fungi | is there a good reason that's a bad idea? (i assume it is or someone would have already suggested it as an obvious option) | 20:48 |
clarkb | ya its the hitting logs.openstack.org/ type webserver root problem | 20:48 |
clarkb | if you are looking for periodic jobs finding them would be hard if it was just build uuids | 20:48 |
clarkb | (granted digging through 60 different dirs isn't easy either | 20:49 |
fungi | yeah, i see that as no worse than the lets-inject-seconds plan | 20:49 |
fungi | and it would allow us to keep the paths no longer than they currently are | 20:49 |
clarkb | we'd probably go to full uuids so they would be longer | 20:50 |
clarkb | to avoid collisions | 20:51 |
clarkb | currently we avoid them by being at the end of the path | 20:51 |
clarkb | so 7 chars is enough | 20:51 |
fungi | why would collisions become a problem if they're not already with the parameters reordered? | 20:51 |
clarkb | because right now they are uniquely identified by branch project name and job name and pipeline | 20:52 |
fungi | i don't propose we change that | 20:52 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: Add phoronix-test-suite job https://review.opendev.org/679082 | 20:53 |
fungi | but you could just as uniquely identify them by build id, branch, project name, job name, and pipeline | 20:53 |
fungi | as by branch, project name, job name, pipeline, and build id | 20:53 |
clarkb | that gets weird to me if you go to the first dir and find multiple entries for the same id | 20:53 |
fungi | are people going to the first dir now? | 20:53 |
clarkb | yes that is what makes periodic special | 20:54 |
clarkb | the way you navigate periodic jobs on the old style log server is to go to the root, sort by date, and find wat you want | 20:54 |
clarkb | this is why we reverted the swift logs stuff forever ago the first time | 20:54 |
fungi | i thought the upshot of object storage was that we started requiring folks to rely on the zuul dashboard as an index to the logs | 20:54 |
clarkb | because we neglected the periodic jobs use case | 20:54 |
clarkb | fungi: yes that is the question I had above, can we currently expect people to use the dashboard | 20:55 |
fungi | because we no longer guarantee that the log urls are predictable | 20:55 |
clarkb | fungi: because this is a change to zuul-jobs it affects more than opendev | 20:55 |
fungi | oh, that ordering is hard-coded, not parameterized? hrm... | 20:55 |
clarkb | for opendev I have no problem changing those urls to arbitrary random strings | 20:55 |
clarkb | beause the dashboard is the primary consumption point | 20:55 |
fungi | seems like it could be solved by templating the parameters and their order | 20:56 |
fungi | fwiw, i think injecting seconds early in the path creates basically the same problem | 20:56 |
clarkb | ya day of month sort of avoids it if you know the scheme | 20:57 |
clarkb | which is ps1 | 20:57 |
fungi | day of month still creates load clustering, i expect. we end up slamming the same shard with writes over the course of a day | 20:58 |
clarkb | fungi: ya that was timburke's point | 20:58 |
clarkb | however if the problem is total count of objects this may help | 20:58 |
clarkb | mnaser: ^ would probably need to weigh in on that aspect of it | 20:58 |
fungi | right, i'd defer to someone operating an impacted storage environment on that | 20:59 |
*** jamesmcarthur has quit IRC | 20:59 | |
clarkb | other options include using a fork of that role in opendev/base-jobs or similar for a bit and change it to whatever | 21:02 |
clarkb | since we have the dashboard it is safe ofr us | 21:02 |
clarkb | or just accept that periodic logs via old webserver isn't going to be nice | 21:02 |
clarkb | and change it to whatever in zuul-jobs | 21:02 |
fungi | i do think if we made the path component ordering configurable, it would allow opendev to do something like build-id first and get better path-indicated sharding on platforms which do that sort of thing | 21:03 |
fungi | even just 7 hex digits allows for a fairly insane number of shards | 21:04 |
fungi | 268 billion | 21:04 |
clarkb | which might be a problem itself | 21:05 |
clarkb | we probably only want to do 2 or 3 digits that way | 21:05 |
clarkb | to avoid creating too many containers | 21:05 |
fungi | <sagan>billions and billions...</sagan> | 21:05 |
fungi | oh, that's container names? | 21:05 |
clarkb | yes | 21:06 |
clarkb | the way it creates the container name is to take the first two components of the path and to combine them | 21:06 |
fungi | in that case we could just prefix with 2 hex digits truncated from the build-id i guess | 21:06 |
clarkb | logs/68/54368/3 -> container logs_68 with object 54368/3/... | 21:06 |
clarkb | the problem with periodic jobs is that becomes logs_periodic for all periodic jobs | 21:07 |
fungi | for some reason i thought it as that ceph/radosgw was using the first two path components to decide on the sharding | 21:07 |
clarkb | all other jobs either have a change number or ref prefix | 21:07 |
clarkb | fungi: no we are deciding that in our swift upload role | 21:07 |
fungi | yeah, if we're creating containers based on those, i agree even distribution over 268 billion possibilities ultimately probably means that over the course of the month we have roughly as many containers as builds | 21:08 |
clarkb | and then ceph is sharding within the container iaui | 21:08 |
clarkb | and that becomes a problem when a single container has too many objects? | 21:08 |
clarkb | that was my understanding of what mnaser said | 21:09 |
clarkb | so if we divide the object count by 31 or 60 maybe we reuce the object count sufficiently to not be a problem | 21:09 |
fungi | if we use a 2-hex-digit truncation then we top out at 256 possibilities which is a more reasonable container count probably | 21:10 |
clarkb | yup, but would be shared across all builds not just periodic builds (that should more evenly distribute the objects which is a good thing) | 21:10 |
clarkb | mnaser: ^ if you get a chance your input on what would help your ceph install would be valuable for figuring out the next step here | 21:19 |
fungi | what i like about the truncated build uuid in the container name is that we get a clear upper bound on container count in each provider that way | 21:20 |
fungi | though the quantization jump in either direction is to go to 16 or 4096 which might be extreme | 21:21 |
clarkb | my guess is 4096 is probably fine but 16 too small | 21:22 |
clarkb | swift should be able to handle thousands of containers | 21:22 |
fungi | so of those three options (16, 256, 4096) the middle one seems the most reasonable | 21:22 |
fungi | and yeah, 4096 may be as well | 21:22 |
fungi | 65k containers, the next jump past 4k, is probably not | 21:23 |
clarkb | ya | 21:23 |
fungi | granted, we *could* reencode and then truncate the build-id in whatever base we want, so sticking to powers of 16 is not absolutely necessary either | 21:24 |
fungi | but as much as i might like to spend the rest of my afternoon on modular arithmetic, i probably need to get to mowing the lawn at some point | 21:25 |
tristanC | clarkb: we do have users relying on $logserver/periodic/ to collect periodic logs, but we can break that by documenting zuul builds interface | 21:37 |
tristanC | clarkb: or perhaps this new behavior can be toggled by a set-zuul-log-path-fact attribute? | 21:37 |
clarkb | tristanC: ya that is what I'm working on now to make it a toggle | 21:38 |
tristanC | using an unified build-id based path sounds like a good idea, and we would likely enable this by default if it's optional | 21:39 |
tristanC | our prune log script does take into account the different log path scheme, i'd be happy to make it more simple | 21:40 |
fungi | i do feel like that path is a bit of an api contract we shouldn't break without warning, but thankfully there are options to allow it to continue working as-is by defauly | 21:40 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Add option for object store friendly log paths https://review.opendev.org/679145 | 21:40 |
clarkb | I've only removed the previous sharding prefixes and kept the rest of the paths as is | 21:41 |
clarkb | there is some useful info in there about what job and change and stuff that helps people when sharing urls that I don't think should go away | 21:41 |
fungi | the main argument i see for sharding by date is that storage schemes which want to expire old logs can far more easily prune old paths that way | 21:41 |
fungi | if we were stuck doing opendev on attached storage, i would have advocated for something like logical volume per day mounted at those subtrees, and then we could just umount and lvremove them at expiration | 21:43 |
fungi | which would have been trivial compared to the days-long find -exec rm cronjobs we ran | 21:43 |
fungi | of course, logging to swift, we can just set expiration times at the object level and forget about it | 21:44 |
pabelanger | clarkb: I think we'd be okay with the change, most humans use builds UI to fetch periodic jobs in swift | 21:46 |
pabelanger | in fact, we'd love to iterate on http://lists.zuul-ci.org/pipermail/zuul-discuss/2019-June/000961.html for logs | 21:47 |
clarkb | pabelanger: filtering does exist for start and end iirc | 21:48 |
clarkb | but it may not be exposed with a filter option in the list | 21:48 |
pabelanger | yah, can't remember of top of head the issue there. But we'd want to be able to create weekly reports, and filter specific periodic jobs in that range | 21:50 |
clarkb | yup I think that is doable today you just have to know what the parameter names are /me double checks | 21:51 |
clarkb | whihc admittedly should be made easier | 21:51 |
*** jeliu_ has quit IRC | 21:52 | |
clarkb | ah nope its offset and limit that I'm thinking of so its the pagination problem | 21:52 |
clarkb | no time bounding currently | 21:53 |
clarkb | hrm manually setting skip and limit doesn't seem to work | 21:59 |
clarkb | did react change that? | 21:59 |
tristanC | clarkb: the webui doesn't know about skip or limit filters, only the json endpoints interpret those | 22:03 |
clarkb | I see | 22:03 |
clarkb | seems like before you could just manipulate the builds url and it worked | 22:03 |
clarkb | but I guess that only works now if tlaking to the api directly? | 22:03 |
tristanC | perhaps the old angular code did forward the query args | 22:04 |
openstackgerrit | Merged zuul/nodepool master: Use fedora-29 instead of fedora-28 https://review.opendev.org/679116 | 22:06 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Add option for object store friendly log paths https://review.opendev.org/679145 | 22:13 |
clarkb | ianw: ^ not much shorter than your suggestion but is namespaced now | 22:13 |
*** armstrongs has joined #zuul | 22:44 | |
*** armstrongs has quit IRC | 22:48 | |
ianw | clarkb: did you want to go with it and run some test jobs? not sure how urgent it is | 23:20 |
clarkb | I think it isnt super urgent because we pulled vexxhost out already | 23:21 |
clarkb | next step may ve to have mnaser confirm it should help then work on testing | 23:22 |
clarkb | no rush | 23:22 |
*** rfolco has quit IRC | 23:29 | |
*** rlandy has quit IRC | 23:43 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!