*** tosky has quit IRC | 00:26 | |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: reconcile config objects https://review.opendev.org/c/zuul/nodepool/+/779898 | 01:54 |
---|---|---|
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: Handle IPv6 https://review.opendev.org/c/zuul/nodepool/+/780400 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: don't require full subnet id https://review.opendev.org/c/zuul/nodepool/+/780402 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: add quota support https://review.opendev.org/c/zuul/nodepool/+/780439 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: implement launch retries https://review.opendev.org/c/zuul/nodepool/+/780682 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: implement support for diskimages https://review.opendev.org/c/zuul/nodepool/+/781187 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: handle leaked image upload resources https://review.opendev.org/c/zuul/nodepool/+/781855 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: use rate limiting https://review.opendev.org/c/zuul/nodepool/+/781856 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: fix race in leaked resources https://review.opendev.org/c/zuul/nodepool/+/781924 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: replace driver with state machine driver https://review.opendev.org/c/zuul/nodepool/+/781925 | 01:54 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: update documentation https://review.opendev.org/c/zuul/nodepool/+/781926 | 01:54 |
openstackgerrit | Merged zuul/zuul-jobs master: Revert "Pin to npm4 until npm 5.6.0 comes out" https://review.opendev.org/c/zuul/zuul-jobs/+/781966 | 02:13 |
openstackgerrit | Merged zuul/zuul-jobs master: Clone nvm from official source https://review.opendev.org/c/zuul/zuul-jobs/+/781970 | 02:17 |
fungi | ianw: mordred: i just exercised those two ^ successfully with https://zuul.opendev.org/t/openstack/build/042af4adb20c472583c278cd365a12ab so looks like it's finally working. thanks!!! | 02:54 |
ianw | fungi: oh good, 4 months is a long time in the javascript world, let alone 4 years :) | 02:55 |
fungi | yup! | 02:57 |
*** evrardjp has quit IRC | 03:33 | |
*** evrardjp has joined #zuul | 03:33 | |
*** raukadah is now known as chkumar|ruck | 05:02 | |
*** vishalmanchanda has joined #zuul | 05:11 | |
*** ykarel has joined #zuul | 05:28 | |
*** ykarel has quit IRC | 05:29 | |
*** ykarel has joined #zuul | 05:29 | |
*** jfoufas1 has joined #zuul | 05:36 | |
*** ajitha has joined #zuul | 05:45 | |
*** hashar has joined #zuul | 07:54 | |
*** jcapitao has joined #zuul | 08:01 | |
*** rpittau|afk is now known as rpittau | 08:19 | |
*** wxy has joined #zuul | 08:47 | |
*** tosky has joined #zuul | 08:49 | |
*** jpena|off is now known as jpena | 08:57 | |
*** phildawson has joined #zuul | 08:58 | |
*** holser has joined #zuul | 09:02 | |
*** ykarel is now known as ykael|lunch | 09:21 | |
*** saneax has joined #zuul | 09:25 | |
*** nils has joined #zuul | 09:28 | |
*** harrymichal has joined #zuul | 09:48 | |
*** ykael|lunch is now known as ykael | 10:03 | |
*** ykael is now known as ykarel | 11:25 | |
*** rlandy has joined #zuul | 11:41 | |
*** holser has quit IRC | 11:54 | |
*** holser has joined #zuul | 11:57 | |
*** ykarel_ has joined #zuul | 11:58 | |
*** ykarel has quit IRC | 12:00 | |
*** phildawson has quit IRC | 12:05 | |
*** ykarel_ is now known as ykarel | 12:09 | |
*** jcapitao is now known as jcapitao_lunch | 12:13 | |
*** hashar is now known as hasharLunch | 12:17 | |
*** ykarel_ has joined #zuul | 12:50 | |
*** jpena is now known as jpena|lunch | 12:50 | |
*** ykarel has quit IRC | 12:50 | |
*** hasharLunch is now known as hashar | 12:55 | |
*** jcapitao_lunch is now known as jcapitao | 13:07 | |
*** ykarel_ has quit IRC | 13:37 | |
*** ykarel_ has joined #zuul | 13:37 | |
*** jpena|lunch is now known as jpen | 13:51 | |
*** jpen is now known as jpena | 13:51 | |
*** jhesketh has quit IRC | 14:02 | |
*** jhesketh has joined #zuul | 14:04 | |
openstackgerrit | Jeremy Stanley proposed zuul/zuul-jobs master: DNM: test role job triggering https://review.opendev.org/c/zuul/zuul-jobs/+/782263 | 14:13 |
*** kmalloc has joined #zuul | 14:18 | |
mordred | fungi: woot | 14:22 |
corvus | i'm going to work on restarting opendev's zuul | 14:26 |
corvus | and once that's done, i'll send out the release announcements | 14:30 |
openstackgerrit | Jeremy Stanley proposed zuul/zuul-jobs master: Revert "Temporarily stop running Gentoo base role tests" https://review.opendev.org/c/zuul/zuul-jobs/+/771106 | 14:34 |
*** wxy has quit IRC | 14:36 | |
fungi | thanks corvus! | 14:38 |
*** jangutter_ has joined #zuul | 14:39 | |
*** jangutter has quit IRC | 14:42 | |
*** ykarel_ is now known as ykarel | 14:49 | |
corvus | tristanC: looking at https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1 we're graphing the new event processing time; occasionally zuul stalls for 5 minutes, presumably it's doing a bunch of reconfiguration/re-enqueue work then. it may be tricky to separate out "normal" event queue performance from those spikes. but we could probably try to see if the baseline increases. | 14:51 |
*** harrymichal has quit IRC | 14:52 | |
avass | corvus: some of those peaks could also be python gc right? | 14:53 |
avass | not sure how long that can stall the system tbh | 14:53 |
corvus | avass: i think probably not -- most of python's memory mgmt is ref-counter based so it happens more gradually and synchronously (the gc runs rarely and only cleans up cycles). but even so, that would still be fractions of a second, not 5 minutes. | 14:56 |
corvus | 5 minutes is definitely "reconfigure the openstack tenant and re-enqueue everything" :) | 14:56 |
corvus | like, full reconfiguration | 14:56 |
corvus | that's about how long a startup takes | 14:56 |
tristanC | corvus: i see, perhaps we should differentiate management from regular event | 14:56 |
avass | ah make sense | 14:56 |
corvus | tristanC: well, i think our trigger events will just be sitting in the queue during that time | 14:57 |
corvus | tristanC: so i think it's more like filtering out the data during times we know that the system is frozen | 14:57 |
corvus | anyway, i don't know if we need to automate that, or if we can just look at the graph and see the patterns | 14:58 |
corvus | just something to keep in mind | 14:58 |
fungi | i suppose it also depends on what you expect from that graph. regardless of what's causing the delays in event processing, it's an accurate depiction of how long events waited to be processed | 14:58 |
corvus | fungi: yeah, though i know one hope is to see if moving the queues to zk affected processing time | 14:59 |
corvus | but i think that was an implicit "processing time during the times where zuul was not frozen due to reconfigurations" :) | 14:59 |
fungi | we could probably still compare averages, main concern is probably the limited pre-upgrade sample size | 15:00 |
corvus | fungi: yep | 15:00 |
corvus | also, we could probably discard times > 60 seconds | 15:00 |
avass | well those peaks could also shoot up | 15:01 |
corvus | as long as the baseline isn't near that cutoff, we could probably filter like that | 15:01 |
fungi | that might help, though you'd still end up skewing results with the events which arrived in the last minute of the pause | 15:01 |
corvus | yeah. for now, i think i'll rely on wetware visual pattern matching :) | 15:02 |
avass | oh you don't want to model it like a control system and come up with a non-linear equation describing how the system behaves before and after? :) | 15:03 |
*** harrymichal has joined #zuul | 15:03 | |
corvus | we should get an intern | 15:03 |
fungi | an intern who can also stand up two copies of infrastructure for side-by-side benchmarking | 15:04 |
fungi | i'm sure it would make an excellent thesis | 15:04 |
corvus | ++ | 15:05 |
corvus | i don't see any anomolies so far; i'm going to grab some breakfast and check back in a bit | 15:07 |
tristanC | corvus: not sure how to translate that in grafyaml, but in graphite there is a `std` version which might be useful to smooth the spikes, e.g.: https://graphite.opendev.org/?width=1272&height=837&lineMode=connected&yUnitSystem=msec&target=stats.timers.zuul.tenant.openstack.event_enqueue_processing_time.mean&target=stats.timers.zuul.tenant.openstack.event_enqueue_processing_time.std | 15:19 |
corvus | ooh nice | 15:19 |
*** harrymichal has quit IRC | 15:24 | |
*** harrymichal has joined #zuul | 15:24 | |
*** ykarel has quit IRC | 15:26 | |
*** jfoufas1 has quit IRC | 15:53 | |
*** vishalmanchanda has quit IRC | 16:11 | |
*** kmalloc has quit IRC | 16:37 | |
*** sean-k-mooney has joined #zuul | 16:53 | |
sean-k-mooney | o/ | 16:53 |
sean-k-mooney | am i correct in saying that https://zuul-ci.org/docs/zuul/reference/developer/specs/zuul-runner.html was never implmeneted | 16:54 |
fungi | sean-k-mooney: ot | 16:54 |
corvus | sean-k-mooney: it's in progress | 16:54 |
fungi | it's... yeah | 16:54 |
sean-k-mooney | context is we were talking about downstream ci | 16:54 |
* fungi curses the adjacency of backspace and return on this keyboard | 16:54 | |
sean-k-mooney | and one of the things we are looking at is using zuul more | 16:54 |
sean-k-mooney | but one of question was ask was what is the simplete way to write/debug a zuul job locally | 16:55 |
fungi | yeah, zuul-runner should help | 16:56 |
sean-k-mooney | i have always done that by either deploying zuul locally or just submiting a patch to the upstream gate with all other jobs disabled | 16:56 |
sean-k-mooney | fungi: does the design still require a zull server? | 16:56 |
sean-k-mooney | i know its fiarly trivial with the docker compose file to get one | 16:56 |
fungi | sean-k-mooney: the current design requires you to ask the zuul server to resolve the collection of interdependent ansible playbooks/roles, yes | 16:57 |
sean-k-mooney | but i was wondering could you run the sculder in the cli tool and pass in the main.yaml ectra | 16:57 |
corvus | it sort of depends on what you're writing/debugging -- if you want to write/test/debug the job's playbook, then you can run that (or whatever shell scripts it runs). if you're debugging the rendered zuul job which is composed of a complex hierarchy of playbooks from inhertited jobs, then, yeah, something like zuul-runner is necessary | 16:58 |
corvus | if you design jobs/roles/playbooks with reusability in mind, then it's pretty simple to run the main playbook of a job locally. if you have a deep system of inheritance (like devstack), then it can be hard to figure out what playbooks and roles to run | 16:59 |
sean-k-mooney | corvus: well part of the challange with runing the job playbook is all the infrastcure of generating the set of playbooks | 16:59 |
sean-k-mooney | corvus: right and ooo jobs make that much worse | 17:00 |
corvus | sean-k-mooney: so the context is downstream inheriting from tripleo jobs? | 17:00 |
sean-k-mooney | corvus: basically downstream for nova we have zuul running the tox jobs and jenkinx runing ooo via infrared | 17:00 |
sean-k-mooney | corvus: yes | 17:00 |
sean-k-mooney | we would like to run the upstream ooo job downstream with internal repos instead of rdo | 17:01 |
sean-k-mooney | but have a way to debug it | 17:01 |
fungi | out of curiosity how do you locally reproduce the jenkins jobs to debug them? or are you doing like we used to do and just avoiding relying on any fancy jenkins plugins so it's merely calling a shell script? | 17:01 |
sean-k-mooney | corvus: e.g. have a way to run the downstream job on our laptops | 17:01 |
sean-k-mooney | fungi: i dont i try to do all backports upstream :) other tirger it manually and have it hold a ci host | 17:02 |
corvus | sean-k-mooney: gotcha, yeah, that's definitely more in zuul-runner's wheelhouse. even the first patch in the zuul-runner series, which i think may be close to approval, would probably help with that, as it at least enables querying zuul for the list of playbooks to run. | 17:02 |
sean-k-mooney | corvus: would it be posible to use it to render a local directoy with the playbooks ectra and then run them manually | 17:03 |
sean-k-mooney | maybe i should read the patch | 17:03 |
corvus | sean-k-mooney: topic:freeze_job | 17:03 |
sean-k-mooney | thanks :) | 17:03 |
corvus | sean-k-mooney: it's a long series; starts with server-side api changes, then later implements cli tool to use them | 17:04 |
corvus | tristanC: ^ | 17:04 |
sean-k-mooney | we are in a somewhat stragne situation downstream where all the upstream enginers konw to some degree how zuul and partically the in porject jobs work(based on devstack) where as our qe understand the jobs they wrote in jenkins that are ooo/infrared based | 17:06 |
corvus | well, zuulv3 was certainly designed for downstream re-usability. being able to reuse some of that directly would be great. | 17:07 |
sean-k-mooney | yep we would like to reuse the upstream ooo jobs directly in our downstream patches ci which is basilly like upstream openstack ci e.g. based on patches to our downstream gerrit | 17:08 |
sean-k-mooney | which is basically just backports 99% of the time | 17:08 |
corvus | sean-k-mooney: also, you mentioned it in passing, but i sort of wouldn't underestimate the power of "throw up a patch and let zuul do its thing". that can be really nice in a downstream system where developers can hold their own nodes. | 17:08 |
fungi | though it seems like zuul was really going for reusability in the other direction (write playbooks you can run locally/in production, use those in your job definitions). if you design in the other direction (write playbooks for ci, then try to run them locally) that's a tougher nut to crack | 17:08 |
corvus | mhu has patches in flight to add buttons to the web ui for things like holding nodes | 17:09 |
sean-k-mooney | fungi: the issue is that there really is no documentqaiton of how to run them locally | 17:09 |
corvus | well, i mean, they're playbooks, you just run them :) | 17:09 |
fungi | sean-k-mooney: right, well, that's part of writing them for local running first. you can add readmes/documentation | 17:09 |
tristanC | tobiash: would you mind checking the last comments of https://review.opendev.org/c/zuul/zuul/+/607078 | 17:09 |
sean-k-mooney | corvus: the issue is the onion like nestiong and figurein out how to generate the inventory files | 17:10 |
sean-k-mooney | corvus: just running them is not so trivial | 17:10 |
fungi | yeah, that's what zuul-runner should help with. you can look at the inventory for a reported build and then create something similar | 17:10 |
sean-k-mooney | fungi: i think that works for roles not for jobs | 17:11 |
corvus | sean-k-mooney: yeah, maybe i misunderstood -- i thought you were saying "zuul has no docs on how to run playbooks locally" and i don't think we can write that. but maybe you were saying "tripleo has no docs on how to run their playbooks locally" | 17:11 |
fungi | sean-k-mooney: it works for plays/playbooks as well as roles | 17:11 |
sean-k-mooney | fungi: not really | 17:11 |
sean-k-mooney | i know where your coming form | 17:11 |
*** rpittau is now known as rpittau|afk | 17:11 | |
fungi | you can write playbooks which work locally, then add those to your job definition | 17:11 |
sean-k-mooney | but as a user of zull my entry point is generally a job | 17:11 |
tristanC | sean-k-mooney: the zuul-runner spec is implemented through https://review.opendev.org/c/632064, the change are mostly waiting for reviews | 17:11 |
sean-k-mooney | but to recreate teh job i need to run all the pre and postplaybooks in order around the run playbook that is listed in the job or its parent | 17:12 |
sean-k-mooney | + flatten all role vars into the approate input | 17:12 |
sean-k-mooney | so i cant really just run tempset-full | 17:13 |
sean-k-mooney | https://github.com/openstack/tempest/blob/master/zuul.d/integrated-gate.yaml#L32-L60 for example it define no pre,run or post playbooks for me to run they all come form the playbook | 17:14 |
tristanC | corvus: fungi: unfortunately it can be quite complex to "just run the playbooks" | 17:15 |
corvus | sean-k-mooney: yeah, that's basically a result of the way the tripleo jobs were designed. with deep nesting and playbooks at every level, you need a tool. it *is* possible to design the other direction, which is i think what fungi was saying. if you design for real-world use and then run that in ci, then you end up with a single "real-world" playbook nestled between some ci-specific pre and post run | 17:15 |
corvus | playbooks. | 17:15 |
corvus | tristanC: yeah, i mean, i think we're talking past each other here | 17:15 |
corvus | i totally agree that you need zuul-runner to figure out how to run a tripleo job | 17:15 |
sean-k-mooney | corvus: correct also the devstack jobs | 17:15 |
fungi | tristanC: it can be complex to "just run the playbooks" if they were not written with local running as a primary goal, i agree | 17:16 |
corvus | but since this conversation started in a more generic vein, we've also touched on the general problem of reusability | 17:16 |
sean-k-mooney | corvus: basically anythin gother the the tox jobs is very hard to fiture out | 17:16 |
fungi | the jobs can be improved in that regard | 17:16 |
corvus | and in a world where you design for dev/prod reusability, zuul-runner isn't needed, because of that reusability. | 17:16 |
sean-k-mooney | i think the issue is most of the jobs that are listed in the pipelines are not written to be resued locally | 17:17 |
fungi | sean-k-mooney: bingo | 17:17 |
corvus | so i just wanted to make that point, that the use of zuul in tripleo/devstack is not the only way to use zuul. i may just be on a soapbox and that's fine. :) | 17:17 |
fungi | sean-k-mooney: at least for the projects you're talking about anyway | 17:17 |
sean-k-mooney | fungi: i even to reuse the jobs in a third party ci enve i had to so some deep hacks to make it work | 17:17 |
corvus | i totally agree that zuul-runner is right for sean-k-mooney's situation :) | 17:18 |
sean-k-mooney | corvus: hehe i was just hoping it could work without a zuul server | 17:18 |
sean-k-mooney | e.g. also provide it with the config file that the zuul server would have and have it embed a zull-schuler in it | 17:18 |
corvus | sean-k-mooney: fundamentally, the question you're asking is "how will zuul construct a list of playbooks and inputs" and the only way to answer that is to ask zuul, so that's the zuul-runner design | 17:19 |
sean-k-mooney | to have it render the playbooks into a directly that could then be run | 17:19 |
fungi | there is an alternative approach, which is to reimplement all of zuul's logic around indexing and parsing configuration. which is basically akin to writing a second zuul | 17:20 |
corvus | like, you could do it "without running a server" but only by starting up all of the zuul components to do all of the same work and then just exiting at the end. it's a distinction without a difference. | 17:20 |
sean-k-mooney | fungi: well not really i was more thinkin just start the zuul api in process | 17:20 |
corvus | and, mind you, in openstack's case, that process would take about 10 minutes. | 17:20 |
sean-k-mooney | then asking it via its api | 17:20 |
corvus | with a fleet of 30 servers | 17:20 |
corvus | so on a desktop, would probably take like an hour | 17:20 |
fungi | sean-k-mooney: and a merger process (or many) to collect and feed in all the distributed configuration | 17:21 |
sean-k-mooney | fungi: yep | 17:21 |
sean-k-mooney | its currently embdeing the executor right | 17:21 |
sean-k-mooney | which acts as a mergere | 17:21 |
sean-k-mooney | but anyway that out of scope for now | 17:21 |
sean-k-mooney | its not that hard to run zuul | 17:22 |
sean-k-mooney | espcially with the docker compose. and if we have downstream and we are just editing an exsiting jobs we can ask it | 17:22 |
avass | fungi, sean-k-mooney: something like: https://review.opendev.org/c/zuul/zuul-jobs/+/728684, it worked really good for library projects like zuul-jobs since everything is contained to the repo. though it wouldn't be able to handle complex logic to figure out which variant to use if there are multiple branches etc | 17:23 |
fungi | it may be that someone can eventually work out a way to slim down a lightweight zuul deployment aimed solely at providing a local api to query for zuul-runner use, but i expect that's nontrivial | 17:23 |
fungi | and as pointed out, you'd still want to run it independent of your individual zuul-runner calls, because of the startup indexing cost | 17:24 |
fungi | i don't think that can be optimized away, though maybe a persistent cache could allow you to stop and start it | 17:25 |
sean-k-mooney | fungi: ya its normally not that bad for small deployment but even my slimed down third party ci was 5-10 minutes | 17:25 |
* sean-k-mooney give we have downstream jobs that take more then 6 hours to run not sure i would notice :) | 17:25 | |
sean-k-mooney | fungi: a big part of the inital start up time is clouing the repos which woiuld be done once | 17:26 |
sean-k-mooney | granted all the internal zull reprentiaton of the job would still have to be built | 17:27 |
sean-k-mooney | and unless you seralised that and cached it you would always have to pay that | 17:27 |
fungi | well, even just restarting the zuul scheduler in opendev takes 10+ minutes to ask the mergers to find any updated refs and pass the updated configuration back | 17:27 |
fungi | and that's with caches of repositories on the mergers | 17:27 |
sean-k-mooney | for a zuul-runner you proably dont need to ask it to update teh refs on startup | 17:28 |
avass | yeah we take something like 15min to restart | 17:28 |
sean-k-mooney | avass: yep which is why having a zuul rungin with docker compose is not that unresonable an overhead | 17:29 |
sean-k-mooney | or poining it at our downstream ci | 17:29 |
fungi | v5 will solve a lot of that if you have a scheduler cluster, since it will be rare that you restart the entire cluster at once and so don't have to worry about missing events and getting out of sync | 17:30 |
sean-k-mooney | v5? a joke or actuall thing | 17:30 |
* sean-k-mooney wonders if i missed a v4 | 17:30 | |
fungi | you did | 17:31 |
avass | oh that's 15 min running a couple executors, that would probably take a lot longer starting up locally since it's pretty much one big repo with close to 1000 branches that is the reason for that startup cost | 17:31 |
sean-k-mooney | avass: you dont need all the repo | 17:31 |
fungi | v5 is where we move from gearman to zk for coordinating zuul service interactions, so we can scale out the scheduler | 17:31 |
fungi | work is well in progress on it | 17:31 |
*** jcapitao has quit IRC | 17:31 | |
sean-k-mooney | my third party ci is offline currenly due to capsiaty issues | 17:31 |
avass | sean-k-mooney: ah in that case it would be faster yes | 17:31 |
sean-k-mooney | but it starts up with enough for devstack in under 5 mines less then 10 for ooo | 17:32 |
fungi | gerrit's checks implementation will probably also allow us to make some assumptions because it will cache events server side, so maybe we can assume cached gerrit configuration is in sync at restart and then roll forward the stashed events, but i don't know that the same could be said for other connection drivers | 17:33 |
sean-k-mooney | http://paste.openstack.org/show/803783/ | 17:33 |
sean-k-mooney | that is what my third party ci was using | 17:33 |
sean-k-mooney | but to avoid all the repos i had to do some hacks in the tempest job | 17:34 |
sean-k-mooney | https://github.com/SeanMooney/ci-sean-mooney/blob/main/zuul.d/jobs.yaml | 17:34 |
sean-k-mooney | basically i needed to define my own version fo tempest-all/full | 17:35 |
corvus | sean-k-mooney: there have been 2 4.x releases; if you aren't subscribed to the zuul-announce mailing list, you may want to consider it. there is important information for operators about the v4/v5 roadmap. | 17:38 |
sean-k-mooney | corvus: sorry was distracted by the fact tempest seams to have fixed there viral import mess | 17:40 |
sean-k-mooney | importing jobs form tempest nolonger import half of openstack | 17:41 |
corvus | neat | 17:41 |
sean-k-mooney | it use to import all the pluggin which import every project that had a tempest plugin if i rememebr correctly | 17:42 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: reconcile config objects https://review.opendev.org/c/zuul/nodepool/+/779898 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: Handle IPv6 https://review.opendev.org/c/zuul/nodepool/+/780400 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: don't require full subnet id https://review.opendev.org/c/zuul/nodepool/+/780402 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: add quota support https://review.opendev.org/c/zuul/nodepool/+/780439 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: implement launch retries https://review.opendev.org/c/zuul/nodepool/+/780682 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: implement support for diskimages https://review.opendev.org/c/zuul/nodepool/+/781187 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: handle leaked image upload resources https://review.opendev.org/c/zuul/nodepool/+/781855 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: use rate limiting https://review.opendev.org/c/zuul/nodepool/+/781856 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: fix race in leaked resources https://review.opendev.org/c/zuul/nodepool/+/781924 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: replace driver with state machine driver https://review.opendev.org/c/zuul/nodepool/+/781925 | 17:43 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Azure: update documentation https://review.opendev.org/c/zuul/nodepool/+/781926 | 17:43 |
corvus | i seem to have misplaced a patch in that series :/ | 17:45 |
fungi | checked under the sofa cushions? | 17:45 |
corvus | yep, that's were it was | 17:46 |
corvus | i'm going to have to clean it off | 17:46 |
sean-k-mooney | by the way i assume the openstack ci is moving to zuulv4 or already has? | 17:47 |
sean-k-mooney | and will continue to track a recent major version of zuul | 17:47 |
fungi | 4.1.0 currently, yes | 17:47 |
corvus | 4.1.0++ even | 17:47 |
sean-k-mooney | cool reading 4.1.0 release notes | 17:47 |
avass | speaking of, I should probably upgrade my own instance :) | 17:47 |
fungi | sean-k-mooney: https://zuul.opendev.org/t/openstack/status "Zuul version: 4.1.1.dev23 68ba89c6" | 17:48 |
corvus | avass: yeah, if not tracking master, at least tracking releases would probably be a really good idea :) | 17:48 |
sean-k-mooney | ah at the bottom | 17:48 |
sean-k-mooney | ya i go to that page often but didnt know the version was there | 17:49 |
avass | now running 4.1.1.dev23 68ba89c6! | 17:49 |
sean-k-mooney | fungi: well belated congratualtion to all of you on v4 | 17:49 |
sean-k-mooney | i think nova may have finally got rid of our last non v3 native job this cycle | 17:50 |
sean-k-mooney | grenade was them main last blocker | 17:50 |
sean-k-mooney | allthough we stil have a silight parity gap with ceph that is pending | 17:50 |
avass | corvus: heh yeah we're like two people using it so it's quite easy to keep it up to date since downtime has pretty much no consequence :) | 17:51 |
fungi | yes, the openstack tact sig is excited at the prospect of letting devstack-gate finally die quietly | 17:51 |
corvus | avass: oh *that* instance, i thought you meant that other instance with more than 2 people. :) | 17:52 |
sean-k-mooney | ... we still havent moved one job https://github.com/openstack/nova/blob/master/.zuul.yaml#L288-L305 which is nova-grenade-multinode | 17:52 |
sean-k-mooney | but that is close | 17:52 |
avass | corvus: that one very often keeps up to date with master as well ;) | 17:53 |
fungi | okay, i rechecked zuul-jobs change 771106 and the scheduler seems to think we shouldn't run any jobs on that... or am i misreading the debug log? http://paste.openstack.org/show/803785 | 17:54 |
fungi | worried that if i try to add debug reporting temporarily, that will heisenbug and trigger jobs because of matching different file filters | 17:55 |
fungi | log there shows it matched the comment pattern for enqueuing into the check pipeline | 17:57 |
fungi | or is "Project zuul/zuul-jobs not in pipeline <Pipeline check> for change <Change 0x7f270e53f9d0 zuul/zuul-jobs 771106,2>" not expected i wonder | 17:58 |
avass | fungi: I think it should run jobs but the scheduler for some reason doesn't agree :) | 17:59 |
fungi | d'oh! i filtered on the wrong event id i think, not the zuul tenant | 18:02 |
corvus | was about to say the same | 18:03 |
* fungi starts over | 18:03 | |
fungi | i totally forgot there would be more than one event id because that projects is in multiple tenants | 18:03 |
*** jpena is now known as jpena|off | 18:04 | |
corvus | i think the event id should be the same | 18:05 |
fungi | yeah, looks like it | 18:05 |
*** hashar has quit IRC | 18:05 | |
corvus | but maybe that paste is just filled with the openstack/opendev lines | 18:05 |
fungi | so why was there no zuul.Pipeline.zuul.check evaluation, i wonder | 18:05 |
corvus | oh, nope, that's the whole thing. | 18:06 |
avass | could this be relevant? https://zuul.opendev.org/t/zuul/config-errors | 18:06 |
fungi | indeed it could, i didn't see that alarming earlier | 18:06 |
fungi | is this a recurrence of the ghost errors i saw a week or two back, which disappeared after a smart reconfig? | 18:08 |
corvus | that tenant appears to have a check pipeline with the zuul-jobs repo attached to it | 18:08 |
corvus | this could be an issue: http://paste.openstack.org/show/803787/ | 18:09 |
fungi | er, yeah | 18:10 |
fungi | i always forget to filter for tracebacks since they lack event association | 18:10 |
fungi | so change was None? that can't be right | 18:11 |
corvus | i don't know if that's related | 18:11 |
fungi | er, yeah that was earlier | 18:11 |
fungi | i rechecked at 19:10:26 | 18:12 |
corvus | i think that's probably harmless (but needs to be fixed) | 18:12 |
corvus | i think that happens on replication events | 18:12 |
avass | that has actually been a really big plus for the splunk reporter so far, event_id and build is attached to every log entry (where it's applicable) | 18:12 |
fungi | scheduler restart was around 14:38 | 18:13 |
fungi | er, sorry, my recheck comment was at 15:46:19 | 18:14 |
fungi | so hours prior to that exception | 18:14 |
corvus | my concern is that if it happens on replication events, it may happen on others too (like tags) | 18:14 |
corvus | nope, this should only be for replication events | 18:17 |
corvus | tristanC: i have good news! there is a longstanding behavior in zuul related to replication events which i think means that no one could be using them in pipeline triggers. | 18:18 |
mordred | corvus: oh? | 18:19 |
fungi | hah, fortuitous bug ;) | 18:19 |
corvus | yeah, i'm double checking that now | 18:19 |
corvus | mordred, fungi, tristanC: the ref-* events from gerrit all hit this line: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L745 | 18:20 |
corvus | which means there's nothing for the pipelines to match on | 18:21 |
corvus | so i think we can merge tristanC's change to remove those immediately | 18:21 |
mordred | corvus: well - in that case removing the replication events - yeah - should make logs better | 18:22 |
fungi | i concur | 18:22 |
corvus | and then also, we need to fix the new event handling to handle a null return (which should fix that traceback in the scheduler) | 18:22 |
avass | what does this comment refer to then? https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L719 | 18:22 |
corvus | fungi: and i'm pretty sure none of this has anything to do with your job issue, so back to that... | 18:22 |
fungi | corvus: yeah, i'm running down the config errors now | 18:22 |
corvus | avass: sorry i was sloppy -- i meant ref-replicated-* events | 18:23 |
fungi | i strongly suspect this is a recurrence of what we saw briefly last week or the week before | 18:23 |
corvus | avass: ref-updated is legit and correctly handled | 18:23 |
avass | ah | 18:23 |
corvus | avass: er, i mean ref-replication-* :) | 18:23 |
fungi | https://opendev.org/openstack/project-config/src/branch/master/zuul/main.yaml#L1631-L1640 | 18:23 |
fungi | we clearly include [] from openstack/project-config in the zuul tenant | 18:24 |
fungi | so it should not be complaining about pipeline definitions from that repo | 18:24 |
corvus | tristanC: if you would care to un-minus-2 your change https://review.opendev.org/780809 i think we can merge it asap. sorry i missed that earlier. | 18:25 |
fungi | is it somehow assembling configuration incorrectly at start, which then gets corrected at the next smart-reconfigure? | 18:26 |
* fungi checks whether there was a scheduler restart fairly close to when i saw this last time | 18:27 | |
fungi | 2021-03-01 23:07:49 is when i mentioned it in here last time | 18:28 |
fungi | most recent mention of a zuul restart in our status log prior to that was on 2021-02-20 so not a smoking gun | 18:29 |
avass | fungi: maybe it's enough with a full-reconfig? | 18:29 |
fungi | yeah, really anything's possible i guess | 18:30 |
corvus | fungi: if you want to try a smart-reconfig and see if it changes i have no objection | 18:30 |
corvus | fungi: i don't think the lack of jobs is specific to that gentoo change; the nodepool changes i uploaded a while ago are not enqueued either. so we're probably looking at a systemic issue, either related to the config errors you and avass observed, or an as-yet-unidentified bug in the trigger-event-queue-in-zk changes | 18:32 |
corvus | i have to run now; i will be back after lunch | 18:33 |
*** jpenag has joined #zuul | 18:39 | |
fungi | i've initiated one now | 18:41 |
fungi | 2021-03-22 18:31:49,126 DEBUG zuul.CommandSocket: Received b'smart-reconfigure' from socket | 18:41 |
*** jpena|off has quit IRC | 18:41 | |
fungi | it logged "Reconfiguration complete" at 18:31:49,336 | 18:41 |
fungi | oh, that was "Smart reconfiguration of tenants []..." so it didn't reconfigure anything | 18:41 |
fungi | full-reconfigure maybe? | 18:41 |
fungi | https://zuul.opendev.org/t/zuul/config-errors is still clearly reporting errors for things which shouldn't happen | 18:41 |
fungi | corvus: yep, i agree, it seems like it's not evaluating any zuul tenant events because of the configuration errors for the tenant | 18:42 |
fungi | i'm about to start heating up some food myself, but will keep poking at logs to see if i can spot any other hints | 18:45 |
openstackgerrit | Albin Vass proposed zuul/nodepool master: WIP: Digitalocean driver https://review.opendev.org/c/zuul/nodepool/+/759559 | 18:51 |
avass | corvus: would you recommend moving the digital ocean to use the state machine driver or is the simple driver fine? | 18:54 |
fungi | the restart begins at 2021-03-22 14:30:47,174 in our debug log. at 14:37:27,654 (during startup config loading) it complained "WARNING zuul.ConfigLoader: Zuul encountered a syntax error while parsing its configuration in the repo openstack/project-config on branch master. The error was: Pipelines may not be defined in untrusted repos..." | 18:59 |
fungi | so it definitely was concerned about this at start | 19:00 |
fungi | stat reporting exception, almost certainly unrelated, but curious if we have a patch up for it yet: http://paste.openstack.org/show/803789/ | 19:03 |
fungi | looks like arrived is None | 19:04 |
corvus | avass: state machine; i'd like to move gce to it and drop simple | 19:06 |
corvus | avass: should be easy | 19:06 |
corvus | fungi: i'm still somewhat inclined to think that these may be 2 separate issues; occams razor says if we're not seeing trigger events then the change to move trigger events into zk may be relevant. :) | 19:08 |
fungi | corvus: yep, absolutely | 19:08 |
corvus | (i'm not excluding the idea they may be related; just thinking right now it's like 75% probability they are distinct) | 19:08 |
fungi | i guess it *should* enqueue for any repository not in a syntax error state? | 19:09 |
fungi | unless one of the error repositories is a required-project? | 19:09 |
corvus | looking at the running config, i don't see any reason why it shouldn't be enqueued right now | 19:10 |
corvus | (the error about pipelines is from a repo where we don't expect pipelines anyway) | 19:10 |
fungi | but it will refuse to add that project, right? so if it's declared as a required-project it can't be used in that state? not saying that's the case for every zuul-jobs job so something still should have been enqueued | 19:11 |
corvus | i don't think a config error on a project should prevent it from being used as a required project; at most, it would cause zero config objects to be loaded from that project. | 19:12 |
fungi | ahh, okay, so even less likely to be related | 19:13 |
corvus | yep, since zero of the projects which are showing config errors in the zuul tenant are used by zuul-jobs | 19:13 |
avass | corvus: ok, I'll take a look at that soon then. I realized that building a game client inside the same k8s as zookeeper and zuul on cheap nodes isn't a good idea :) | 19:13 |
fungi | building a game client is always a good idea | 19:14 |
avass | well you don't actually get to build anything if something always gets OOM killed | 19:14 |
fungi | and yeah, so far the only exceptions i'm seeing logged are the one from replication events, and the stats emitting one i pasted above | 19:15 |
fungi | now wondering if the stats exception is related to enqueuing things with zuul-enqueue rpc | 19:17 |
fungi | seems to consistently follow "Adding change" lines, and looks related to our bulk scripted reenqueue | 19:18 |
*** harrymichal has quit IRC | 19:19 | |
*** harrymichal has joined #zuul | 19:20 | |
*** harrymichal has joined #zuul | 19:20 | |
fungi | interesting, our opendev-prod-hourly pipeline (timer triggered) is raising run handler exceptions with "AttributeError: 'Branch' object has no attribute 'number'" | 19:20 |
*** hamalq has joined #zuul | 19:27 | |
avass | okay weird, I seem to only be able to run jobs for a PR once per scheduler restart | 19:50 |
corvus | avass: 4.1.0 or master? | 19:51 |
avass | master | 19:51 |
corvus | avass: can you try 4.1.0? | 19:52 |
avass | sure | 19:52 |
fungi | i see opendev's openstack tenant enqueuing things, but https://zuul.opendev.org/t/zuul/builds suggests nothing has been enqueued for the zuul tenant for 16+ hours | 19:52 |
corvus | avass: i suspect it will still fail and i would start with I55df1cc28279bb6923e51686dde8809421486c6a as a suspected cause | 19:53 |
corvus | fungi: i think i've got it | 19:55 |
fungi | regression in today's restart? | 19:55 |
avass | corvus: 4.1.0 works | 19:55 |
corvus | avass: oh nice -- runs twice after restart on 4.1.0? | 19:56 |
avass | yeah | 19:56 |
corvus | avass: ok, it *could* be the same thing we're seeing in opendev; let me explain, then let's see if your situation could fit | 19:56 |
fungi | yay for not releasing with whatever brought in that bug ;) | 19:56 |
corvus | fungi, avass: there's an exception in our scheduler logs: AttributeError: 'Branch' object has no attribute 'number' | 19:57 |
corvus | that's being hit during the new event forwarding code (where we have a 2-stage event queue system: a global queue, then we forward events to individual tenant queues) | 19:58 |
fungi | okay, so that's the same one i saw for one of our opendev hourly pipelines | 19:59 |
corvus | if one of the tenants has an item in a pipeline that is not a change (ie, a branch tip or tag or other ref -- probably a post or periodic pipeline), then we hit that exception and abort forwarding the event to any other tenants | 19:59 |
avass | I'm not seeing that error however | 19:59 |
avass | or I don't think I do | 20:00 |
corvus | so basically, the new event forwarding is stopping for us at the opendev tenant. any tenant in our list of tenants after (and somewhat including) opendev is not getting events | 20:00 |
* avass does another restart | 20:00 | |
corvus | avass: hrm. any other exceptions in your scheduler log? maybe a slightly different error is being hit and causing it. | 20:00 |
avass | corvus: no, no error at all | 20:03 |
corvus | avass: what's the sequence? are you hitting the recheck button in github for the already-run check? | 20:06 |
avass | recheck -> job runs and gets NODE_FAILURE -> recheck -> nothing and logs doesn't really say anything either | 20:07 |
avass | working on paste but I guess you don't want 800 lines of raw scheduler logs :) | 20:07 |
corvus | the more the better as far as i'm concerned, but opendev's pastebin does have a size limit; if you hit it, you can split it in 2 :) | 20:08 |
fungi | yeah, i've used the split -l trick to get a log into multiple pastes plenty of times | 20:13 |
fungi | i think the limit in our paste server is actually the max field size in the mysql db backing it | 20:14 |
avass | sec, manjaro seems to want to upgrade all system packages. | 20:15 |
avass | also had to fill out an impossible captcha because my paste contains spam :) | 20:16 |
avass | http://paste.openstack.org/show/803796/ | 20:16 |
avass | I suppose that silently truncated my logs | 20:18 |
avass | 2/4: http://paste.openstack.org/show/803799/ | 20:19 |
avass | 3/4: http://paste.openstack.org/show/803800/ | 20:20 |
avass | 4/4: http://paste.openstack.org/show/803801/ | 20:20 |
corvus | avass: and when was the second recheck evnt? | 20:20 |
avass | corvus: last paste. somewhere around line 10 | 20:21 |
corvus | avass: are you sure it's not somewhere between paste 3 and 4? | 20:22 |
avass | corvus: probably line 18: [e: 98be8bb0-8b49-11eb-935a-fabeee8c4b40] Github Webhook Received | 20:22 |
avass | corvus: that's the first event probably | 20:23 |
corvus | avass: okay, so you're leaving a "recheck" comment? | 20:23 |
avass | yeah | 20:23 |
avass | first one works, second one doesn't | 20:23 |
fungi | and second one was after the first reported? | 20:23 |
avass | I believe so yes | 20:23 |
avass | fungi: just left another comment and that produces the same result | 20:24 |
fungi | got it | 20:25 |
corvus | avass: i see a "Run handler awake" log in #3 and nothing after that; suspect your scheduler may be stuck in result event processing | 20:26 |
corvus | avass: the node request always fails? | 20:27 |
avass | corvus: trying to figure out why digital ocean doesn like the image it's configured to use so yep | 20:28 |
corvus | okay, so the reproducer for this might be as simple as "enqueue a change with a node request that fails"; that does also seem to point to the bug being related to event-queues in zk. | 20:29 |
corvus | i'll work on fixing both of these | 20:31 |
corvus | fungi: for opendev, i think we need to restart on 4.1.0 | 20:32 |
corvus | there's no good way to get an image with fixes without doing that | 20:33 |
corvus | fungi: do you have time to do that, or should i? | 20:33 |
fungi | corvus: sorry, had to step away for a moment, but catching back up now | 20:48 |
corvus | fungi: it's a little more complex so -> #opendev | 20:49 |
fungi | yup | 20:49 |
*** ajitha has quit IRC | 21:15 | |
*** jangutter has joined #zuul | 21:22 | |
*** jangutter_ has quit IRC | 21:25 | |
corvus | avass: you weren't running any unmerged patches on that instance, right? the bug you saw was with plain old upstream master? | 22:42 |
corvus | avass: (in particular, no async reporting patch?) | 22:42 |
corvus | avass: also, it really looks like there are some lines missing between pastes #3 and #4 -- can you check on that? | 22:45 |
*** harrymichal_ has joined #zuul | 22:51 | |
*** harrymichal has quit IRC | 22:52 | |
*** harrymichal_ is now known as harrymichal | 22:52 | |
openstackgerrit | James E. Blair proposed zuul/zuul master: Fix trigger event forwarding bug https://review.opendev.org/c/zuul/zuul/+/782335 | 22:59 |
openstackgerrit | James E. Blair proposed zuul/zuul master: WIP: Try to repro recheck failure https://review.opendev.org/c/zuul/zuul/+/782336 | 22:59 |
corvus | fungi, avass, tobiash, swest: 782335 is a successful fix of the issues we saw in production today; i think with that merged, we can restart opendev's zuul on master again. (you can read the commit msg for the tl;dr on issues) | 23:00 |
*** nils has quit IRC | 23:00 | |
corvus | fungi, avass, tobiash, swest: 782336 is an unsuccessful attempt to repro the issue avass described; i think we need more data there. | 23:01 |
*** holser has quit IRC | 23:38 | |
*** harrymichal has quit IRC | 23:39 | |
*** rlandy has quit IRC | 23:43 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!