Monday, 2021-03-22

*** tosky has quit IRC00:26
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: reconcile config objects  https://review.opendev.org/c/zuul/nodepool/+/77989801:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: Handle IPv6  https://review.opendev.org/c/zuul/nodepool/+/78040001:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: don't require full subnet id  https://review.opendev.org/c/zuul/nodepool/+/78040201:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: add quota support  https://review.opendev.org/c/zuul/nodepool/+/78043901:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: implement launch retries  https://review.opendev.org/c/zuul/nodepool/+/78068201:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: implement support for diskimages  https://review.opendev.org/c/zuul/nodepool/+/78118701:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: handle leaked image upload resources  https://review.opendev.org/c/zuul/nodepool/+/78185501:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: use rate limiting  https://review.opendev.org/c/zuul/nodepool/+/78185601:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: fix race in leaked resources  https://review.opendev.org/c/zuul/nodepool/+/78192401:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: replace driver with state machine driver  https://review.opendev.org/c/zuul/nodepool/+/78192501:54
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: update documentation  https://review.opendev.org/c/zuul/nodepool/+/78192601:54
openstackgerritMerged zuul/zuul-jobs master: Revert "Pin to npm4 until npm 5.6.0 comes out"  https://review.opendev.org/c/zuul/zuul-jobs/+/78196602:13
openstackgerritMerged zuul/zuul-jobs master: Clone nvm from official source  https://review.opendev.org/c/zuul/zuul-jobs/+/78197002:17
fungiianw: mordred: i just exercised those two ^ successfully with https://zuul.opendev.org/t/openstack/build/042af4adb20c472583c278cd365a12ab so looks like it's finally working. thanks!!!02:54
ianwfungi: oh good, 4 months is a long time in the javascript world, let alone 4 years :)02:55
fungiyup!02:57
*** evrardjp has quit IRC03:33
*** evrardjp has joined #zuul03:33
*** raukadah is now known as chkumar|ruck05:02
*** vishalmanchanda has joined #zuul05:11
*** ykarel has joined #zuul05:28
*** ykarel has quit IRC05:29
*** ykarel has joined #zuul05:29
*** jfoufas1 has joined #zuul05:36
*** ajitha has joined #zuul05:45
*** hashar has joined #zuul07:54
*** jcapitao has joined #zuul08:01
*** rpittau|afk is now known as rpittau08:19
*** wxy has joined #zuul08:47
*** tosky has joined #zuul08:49
*** jpena|off is now known as jpena08:57
*** phildawson has joined #zuul08:58
*** holser has joined #zuul09:02
*** ykarel is now known as ykael|lunch09:21
*** saneax has joined #zuul09:25
*** nils has joined #zuul09:28
*** harrymichal has joined #zuul09:48
*** ykael|lunch is now known as ykael10:03
*** ykael is now known as ykarel11:25
*** rlandy has joined #zuul11:41
*** holser has quit IRC11:54
*** holser has joined #zuul11:57
*** ykarel_ has joined #zuul11:58
*** ykarel has quit IRC12:00
*** phildawson has quit IRC12:05
*** ykarel_ is now known as ykarel12:09
*** jcapitao is now known as jcapitao_lunch12:13
*** hashar is now known as hasharLunch12:17
*** ykarel_ has joined #zuul12:50
*** jpena is now known as jpena|lunch12:50
*** ykarel has quit IRC12:50
*** hasharLunch is now known as hashar12:55
*** jcapitao_lunch is now known as jcapitao13:07
*** ykarel_ has quit IRC13:37
*** ykarel_ has joined #zuul13:37
*** jpena|lunch is now known as jpen13:51
*** jpen is now known as jpena13:51
*** jhesketh has quit IRC14:02
*** jhesketh has joined #zuul14:04
openstackgerritJeremy Stanley proposed zuul/zuul-jobs master: DNM: test role job triggering  https://review.opendev.org/c/zuul/zuul-jobs/+/78226314:13
*** kmalloc has joined #zuul14:18
mordredfungi: woot14:22
corvusi'm going to work on restarting opendev's zuul14:26
corvusand once that's done, i'll send out the release announcements14:30
openstackgerritJeremy Stanley proposed zuul/zuul-jobs master: Revert "Temporarily stop running Gentoo base role tests"  https://review.opendev.org/c/zuul/zuul-jobs/+/77110614:34
*** wxy has quit IRC14:36
fungithanks corvus!14:38
*** jangutter_ has joined #zuul14:39
*** jangutter has quit IRC14:42
*** ykarel_ is now known as ykarel14:49
corvustristanC: looking at https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1 we're graphing the new event processing time; occasionally zuul stalls for 5 minutes, presumably it's doing a bunch of reconfiguration/re-enqueue work then.  it may be tricky to separate out "normal" event queue performance from those spikes.  but we could probably try to see if the baseline increases.14:51
*** harrymichal has quit IRC14:52
avasscorvus: some of those peaks could also be python gc right?14:53
avassnot sure how long that can stall the system tbh14:53
corvusavass: i think probably not -- most of python's memory mgmt is ref-counter based so it happens more gradually and synchronously (the gc runs rarely and only cleans up cycles).  but even so, that would still be fractions of a second, not 5 minutes.14:56
corvus5 minutes is definitely "reconfigure the openstack tenant and re-enqueue everything" :)14:56
corvuslike, full reconfiguration14:56
corvusthat's about how long a startup takes14:56
tristanCcorvus: i see, perhaps we should differentiate management from regular event14:56
avassah make sense14:56
corvustristanC: well, i think our trigger events will just be sitting in the queue during that time14:57
corvustristanC: so i think it's more like filtering out the data during times we know that the system is frozen14:57
corvusanyway, i don't know if we need to automate that, or if we can just look at the graph and see the patterns14:58
corvusjust something to keep in mind14:58
fungii suppose it also depends on what you expect from that graph. regardless of what's causing the delays in event processing, it's an accurate depiction of how long events waited to be processed14:58
corvusfungi: yeah, though i know one hope is to see if moving the queues to zk affected processing time14:59
corvusbut i think that was an implicit "processing time during the times where zuul was not frozen due to reconfigurations" :)14:59
fungiwe could probably still compare averages, main concern is probably the limited pre-upgrade sample size15:00
corvusfungi: yep15:00
corvusalso, we could probably discard times > 60 seconds15:00
avasswell those peaks could also shoot up15:01
corvusas long as the baseline isn't near that cutoff, we could probably filter like that15:01
fungithat might help, though you'd still end up skewing results with the events which arrived in the last minute of the pause15:01
corvusyeah.  for now, i think i'll rely on wetware visual pattern matching :)15:02
avassoh you don't want to model it like a control system and come up with a non-linear equation describing how the system behaves before and after? :)15:03
*** harrymichal has joined #zuul15:03
corvuswe should get an intern15:03
fungian intern who can also stand up two copies of infrastructure for side-by-side benchmarking15:04
fungii'm sure it would make an excellent thesis15:04
corvus++15:05
corvusi don't see any anomolies so far; i'm going to grab some breakfast and check back in a bit15:07
tristanCcorvus: not sure how to translate that in grafyaml, but in graphite there is a `std` version which might be useful to smooth the spikes, e.g.: https://graphite.opendev.org/?width=1272&height=837&lineMode=connected&yUnitSystem=msec&target=stats.timers.zuul.tenant.openstack.event_enqueue_processing_time.mean&target=stats.timers.zuul.tenant.openstack.event_enqueue_processing_time.std15:19
corvusooh nice15:19
*** harrymichal has quit IRC15:24
*** harrymichal has joined #zuul15:24
*** ykarel has quit IRC15:26
*** jfoufas1 has quit IRC15:53
*** vishalmanchanda has quit IRC16:11
*** kmalloc has quit IRC16:37
*** sean-k-mooney has joined #zuul16:53
sean-k-mooneyo/16:53
sean-k-mooneyam i correct in saying that https://zuul-ci.org/docs/zuul/reference/developer/specs/zuul-runner.html was never implmeneted16:54
fungisean-k-mooney: ot16:54
corvussean-k-mooney: it's in progress16:54
fungiit's... yeah16:54
sean-k-mooneycontext is we were talking about downstream ci16:54
* fungi curses the adjacency of backspace and return on this keyboard16:54
sean-k-mooneyand one of the things we are looking at is using zuul more16:54
sean-k-mooneybut one of question was ask was what is the simplete way to write/debug a zuul job locally16:55
fungiyeah, zuul-runner should help16:56
sean-k-mooneyi have always done that by either deploying zuul locally or just submiting a patch to the upstream gate with all other jobs disabled16:56
sean-k-mooneyfungi: does the design still require a zull server?16:56
sean-k-mooneyi know its fiarly trivial with the docker compose file to get one16:56
fungisean-k-mooney: the current design requires you to ask the zuul server to resolve the collection of interdependent ansible playbooks/roles, yes16:57
sean-k-mooneybut i was wondering could you run the sculder in the cli tool and pass in the main.yaml ectra16:57
corvusit sort of depends on what you're writing/debugging -- if you want to write/test/debug the job's playbook, then you can run that (or whatever shell scripts it runs).  if you're debugging the rendered zuul job which is composed of a complex hierarchy of playbooks from inhertited jobs, then, yeah, something like zuul-runner is necessary16:58
corvusif you design jobs/roles/playbooks with reusability in mind, then it's pretty simple to run the main playbook of a job locally.  if you have a deep system of inheritance (like devstack), then it can be hard to figure out what playbooks and roles to run16:59
sean-k-mooneycorvus: well part of the challange with runing the job playbook is all the infrastcure of generating the set of playbooks16:59
sean-k-mooneycorvus: right and ooo jobs make that much worse17:00
corvussean-k-mooney: so the context is downstream inheriting from tripleo jobs?17:00
sean-k-mooneycorvus: basically downstream for nova we have zuul running the tox jobs and jenkinx runing ooo via infrared17:00
sean-k-mooneycorvus: yes17:00
sean-k-mooneywe would like to run the upstream ooo job downstream with internal repos instead of rdo17:01
sean-k-mooneybut have a way to debug it17:01
fungiout of curiosity how do you locally reproduce the jenkins jobs to debug them? or are you doing like we used to do and just avoiding relying on any fancy jenkins plugins so it's merely calling a shell script?17:01
sean-k-mooneycorvus: e.g. have a way to run the downstream job on our laptops17:01
sean-k-mooneyfungi: i dont i try to do all backports upstream :) other tirger it manually and have it hold a ci host17:02
corvussean-k-mooney: gotcha, yeah, that's definitely more in zuul-runner's wheelhouse.  even the first patch in the zuul-runner series, which i think may be close to approval, would probably help with that, as it at least enables querying zuul for the list of playbooks to run.17:02
sean-k-mooneycorvus: would it be posible to use it to render a local directoy with the playbooks ectra and then run them manually17:03
sean-k-mooneymaybe i should read the patch17:03
corvussean-k-mooney: topic:freeze_job17:03
sean-k-mooneythanks :)17:03
corvussean-k-mooney: it's a long series; starts with server-side api changes, then later implements cli tool to use them17:04
corvustristanC: ^17:04
sean-k-mooneywe are in a somewhat stragne situation downstream where all the upstream enginers konw to some degree how zuul and partically the in porject jobs work(based on devstack) where as our qe understand the jobs they wrote in jenkins that are ooo/infrared based17:06
corvuswell, zuulv3 was certainly designed for downstream re-usability.  being able to reuse some of that directly would be great.17:07
sean-k-mooneyyep we would like to reuse the upstream ooo jobs directly in our downstream patches ci which is basilly like upstream openstack ci e.g. based on patches to our downstream gerrit17:08
sean-k-mooneywhich is basically just backports 99% of the time17:08
corvussean-k-mooney: also, you mentioned it in passing, but i sort of wouldn't underestimate the power of "throw up a patch and let zuul do its thing".  that can be really nice in a downstream system where developers can hold their own nodes.17:08
fungithough it seems like zuul was really going for reusability in the other direction (write playbooks you can run locally/in production, use those in your job definitions). if you design in the other direction (write playbooks for ci, then try to run them locally) that's a tougher nut to crack17:08
corvusmhu has patches in flight to add buttons to the web ui for things like holding nodes17:09
sean-k-mooneyfungi: the issue is that there really is no documentqaiton of how to run them locally17:09
corvuswell, i mean, they're playbooks, you just run them :)17:09
fungisean-k-mooney: right, well, that's part of writing them for local running first. you can add readmes/documentation17:09
tristanCtobiash: would you mind checking the last comments of https://review.opendev.org/c/zuul/zuul/+/60707817:09
sean-k-mooneycorvus: the issue is the onion like nestiong and figurein out how to generate the inventory files17:10
sean-k-mooneycorvus: just running them is not so trivial17:10
fungiyeah, that's what zuul-runner should help with. you can look at the inventory for a reported build and then create something similar17:10
sean-k-mooneyfungi:  i think that works for roles not for jobs17:11
corvussean-k-mooney: yeah, maybe i misunderstood -- i thought you were saying "zuul has no docs on how to run playbooks locally" and i don't think we can write that.  but maybe you were saying "tripleo has no docs on how to run their playbooks locally"17:11
fungisean-k-mooney: it works for plays/playbooks as well as roles17:11
sean-k-mooneyfungi: not really17:11
sean-k-mooneyi know where your coming form17:11
*** rpittau is now known as rpittau|afk17:11
fungiyou can write playbooks which work locally, then add those to your job definition17:11
sean-k-mooneybut as a user of zull my entry point is generally a job17:11
tristanCsean-k-mooney: the zuul-runner spec is implemented through https://review.opendev.org/c/632064, the change are mostly waiting for reviews17:11
sean-k-mooneybut to recreate teh job i need to run all the pre and postplaybooks in order around the run playbook that is listed in the job or its parent17:12
sean-k-mooney+ flatten all role vars into the approate input17:12
sean-k-mooneyso i cant  really just run tempset-full17:13
sean-k-mooneyhttps://github.com/openstack/tempest/blob/master/zuul.d/integrated-gate.yaml#L32-L60 for example it define no pre,run or post playbooks for me to run they all come form the playbook17:14
tristanCcorvus: fungi: unfortunately it can be quite complex to "just run the playbooks"17:15
corvussean-k-mooney: yeah, that's basically a result of the way the tripleo jobs were designed.  with deep nesting and playbooks at every level, you need a tool.  it *is* possible to design the other direction, which is i think what fungi was saying.  if you design for real-world use and then run that in ci, then you end up with a single "real-world" playbook nestled between some ci-specific pre and post run17:15
corvusplaybooks.17:15
corvustristanC: yeah, i mean, i think we're talking past each other here17:15
corvusi totally agree that you need zuul-runner to figure out how to run a tripleo job17:15
sean-k-mooneycorvus: correct also the devstack jobs17:15
fungitristanC: it can be complex to "just run the playbooks" if they were not written with local running as a primary goal, i agree17:16
corvusbut since this conversation started in a more generic vein, we've also touched on the general problem of reusability17:16
sean-k-mooneycorvus: basically anythin gother the the tox jobs is very hard to fiture out17:16
fungithe jobs can be improved in that regard17:16
corvusand in a world where you design for dev/prod reusability, zuul-runner isn't needed, because of that reusability.17:16
sean-k-mooneyi think the issue is most of the jobs that are listed in the pipelines are not written to be resued locally17:17
fungisean-k-mooney: bingo17:17
corvusso i just wanted to make that point, that the use of zuul in tripleo/devstack is not the only way to use zuul.  i may just be on a soapbox and that's fine.  :)17:17
fungisean-k-mooney: at least for the projects you're talking about anyway17:17
sean-k-mooneyfungi: i even to reuse the jobs in a third party ci enve i had to so some deep hacks to make it work17:17
corvusi totally agree that zuul-runner is right for sean-k-mooney's situation :)17:18
sean-k-mooneycorvus: hehe i was just hoping it could work without a zuul server17:18
sean-k-mooneye.g. also provide it with the config file that the zuul server would have and have it embed a zull-schuler in it17:18
corvussean-k-mooney: fundamentally, the question you're asking is "how will zuul construct a list of playbooks and inputs" and the only way to answer that is to ask zuul, so that's the zuul-runner design17:19
sean-k-mooneyto have it render the playbooks into a directly that could then be run17:19
fungithere is an alternative approach, which is to reimplement all of zuul's logic around indexing and parsing configuration. which is basically akin to writing a second zuul17:20
corvuslike, you could do it "without running a server" but only by starting up all of the zuul components to do all of the same work and then just exiting at the end.  it's a distinction without a difference.17:20
sean-k-mooneyfungi: well not really i was more thinkin just start the zuul api in process17:20
corvusand, mind you, in openstack's case, that process would take about 10 minutes.17:20
sean-k-mooneythen asking it via its api17:20
corvuswith a fleet of 30 servers17:20
corvusso on a desktop, would probably take like an hour17:20
fungisean-k-mooney: and a merger process (or many) to collect and feed in all the distributed configuration17:21
sean-k-mooneyfungi: yep17:21
sean-k-mooneyits currently embdeing the executor right17:21
sean-k-mooneywhich acts as a mergere17:21
sean-k-mooneybut anyway that out of scope for now17:21
sean-k-mooneyits not that hard to run zuul17:22
sean-k-mooneyespcially with the docker compose. and if we have downstream and we are just editing an exsiting jobs we can ask it17:22
avassfungi, sean-k-mooney: something like: https://review.opendev.org/c/zuul/zuul-jobs/+/728684, it worked really good for library projects like zuul-jobs since everything is contained to the repo. though it wouldn't be able to handle complex logic to figure out which variant to use if there are multiple branches etc17:23
fungiit may be that someone can eventually work out a way to slim down a lightweight zuul deployment aimed solely at providing a local api to query for zuul-runner use, but i expect that's nontrivial17:23
fungiand as pointed out, you'd still want to run it independent of your individual zuul-runner calls, because of the startup indexing cost17:24
fungii don't think that can be optimized away, though maybe a persistent cache could allow you to stop and start it17:25
sean-k-mooneyfungi: ya its normally not that bad for small deployment but even my slimed down third party ci was 5-10 minutes17:25
* sean-k-mooney give we have downstream jobs that take more then 6 hours to run not sure i would notice :)17:25
sean-k-mooneyfungi: a big part of the inital start up time is clouing the repos which woiuld be done once17:26
sean-k-mooneygranted all the internal zull reprentiaton of the job would still have to be built17:27
sean-k-mooneyand unless you seralised that and cached it you would always have to pay that17:27
fungiwell, even just restarting the zuul scheduler in opendev takes 10+ minutes to ask the mergers to find any updated refs and pass the updated configuration back17:27
fungiand that's with caches of repositories on the mergers17:27
sean-k-mooneyfor a zuul-runner you proably dont need to ask it to update teh refs on startup17:28
avassyeah we take something like 15min to restart17:28
sean-k-mooneyavass: yep which is why having a zuul rungin with docker compose is not that unresonable an overhead17:29
sean-k-mooneyor poining it at our downstream ci17:29
fungiv5 will solve a lot of that if you have a scheduler cluster, since it will be rare that you restart the entire cluster at once and so don't have to worry about missing events and getting out of sync17:30
sean-k-mooneyv5? a joke or actuall thing17:30
* sean-k-mooney wonders if i missed a v417:30
fungiyou did17:31
avassoh that's 15 min running a couple executors, that would probably take a lot longer starting up locally since it's pretty much one big repo with close to 1000 branches that is the reason for that startup cost17:31
sean-k-mooneyavass: you dont need all the repo17:31
fungiv5 is where we move from gearman to zk for coordinating zuul service interactions, so we can scale out the scheduler17:31
fungiwork is well in progress on it17:31
*** jcapitao has quit IRC17:31
sean-k-mooneymy third party ci is offline currenly due to capsiaty issues17:31
avasssean-k-mooney: ah in that case it would be faster yes17:31
sean-k-mooneybut it starts up with enough for devstack in under 5 mines less then 10 for ooo17:32
fungigerrit's checks implementation will probably also allow us to make some assumptions because it will cache events server side, so maybe we can assume cached gerrit configuration is in sync at restart and then roll forward the stashed events, but i don't know that the same could be said for other connection drivers17:33
sean-k-mooneyhttp://paste.openstack.org/show/803783/17:33
sean-k-mooneythat is what my third party ci was using17:33
sean-k-mooneybut to avoid all the repos i had to do some hacks in the tempest job17:34
sean-k-mooneyhttps://github.com/SeanMooney/ci-sean-mooney/blob/main/zuul.d/jobs.yaml17:34
sean-k-mooneybasically i needed to define my own version fo tempest-all/full17:35
corvussean-k-mooney: there have been 2 4.x releases; if you aren't subscribed to the zuul-announce mailing list, you may want to consider it.  there is important information for operators about the v4/v5 roadmap.17:38
sean-k-mooneycorvus: sorry was distracted by the fact tempest seams to have fixed there viral import mess17:40
sean-k-mooneyimporting jobs form tempest nolonger import half of openstack17:41
corvusneat17:41
sean-k-mooneyit use to import all the pluggin which import every project that had a tempest plugin if i rememebr correctly17:42
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: reconcile config objects  https://review.opendev.org/c/zuul/nodepool/+/77989817:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: Handle IPv6  https://review.opendev.org/c/zuul/nodepool/+/78040017:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: don't require full subnet id  https://review.opendev.org/c/zuul/nodepool/+/78040217:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: add quota support  https://review.opendev.org/c/zuul/nodepool/+/78043917:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: implement launch retries  https://review.opendev.org/c/zuul/nodepool/+/78068217:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: implement support for diskimages  https://review.opendev.org/c/zuul/nodepool/+/78118717:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: handle leaked image upload resources  https://review.opendev.org/c/zuul/nodepool/+/78185517:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: use rate limiting  https://review.opendev.org/c/zuul/nodepool/+/78185617:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: fix race in leaked resources  https://review.opendev.org/c/zuul/nodepool/+/78192417:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: replace driver with state machine driver  https://review.opendev.org/c/zuul/nodepool/+/78192517:43
openstackgerritJames E. Blair proposed zuul/nodepool master: Azure: update documentation  https://review.opendev.org/c/zuul/nodepool/+/78192617:43
corvusi seem to have misplaced a patch in that series :/17:45
fungichecked under the sofa cushions?17:45
corvusyep, that's were it was17:46
corvusi'm going to have to clean it off17:46
sean-k-mooney by the way i assume the openstack ci is moving to zuulv4 or already has?17:47
sean-k-mooneyand will continue to track a recent major version of zuul17:47
fungi4.1.0 currently, yes17:47
corvus4.1.0++ even17:47
sean-k-mooneycool reading 4.1.0 release notes17:47
avassspeaking of, I should probably upgrade my own instance :)17:47
fungisean-k-mooney: https://zuul.opendev.org/t/openstack/status "Zuul version: 4.1.1.dev23 68ba89c6"17:48
corvusavass: yeah, if not tracking master, at least tracking releases would probably be a really good idea :)17:48
sean-k-mooneyah at the bottom17:48
sean-k-mooneyya i go to that page often but didnt know the version was there17:49
avassnow running 4.1.1.dev23 68ba89c6!17:49
sean-k-mooneyfungi: well belated congratualtion to all of you on v417:49
sean-k-mooneyi think nova may have finally got rid of our last non v3 native job this cycle17:50
sean-k-mooneygrenade was them main last blocker17:50
sean-k-mooneyallthough we stil have a silight parity gap with ceph that is pending17:50
avasscorvus: heh yeah we're like two people using it so it's quite easy to keep it up to date since downtime has pretty much no consequence :)17:51
fungiyes, the openstack tact sig is excited at the prospect of letting devstack-gate finally die quietly17:51
corvusavass: oh *that* instance, i thought you meant that other instance with more than 2 people. :)17:52
sean-k-mooney... we still havent moved one job https://github.com/openstack/nova/blob/master/.zuul.yaml#L288-L305 which is nova-grenade-multinode17:52
sean-k-mooneybut that is close17:52
avasscorvus: that one very often keeps up to date with master as well ;)17:53
fungiokay, i rechecked zuul-jobs change 771106 and the scheduler seems to think we shouldn't run any jobs on that... or am i misreading the debug log? http://paste.openstack.org/show/80378517:54
fungiworried that if i try to add debug reporting temporarily, that will heisenbug and trigger jobs because of matching different file filters17:55
fungilog there shows it matched the comment pattern for enqueuing into the check pipeline17:57
fungior is "Project zuul/zuul-jobs not in pipeline <Pipeline check> for change <Change 0x7f270e53f9d0 zuul/zuul-jobs 771106,2>" not expected i wonder17:58
avassfungi: I think it should run jobs but the scheduler for some reason doesn't agree :)17:59
fungid'oh! i filtered on the wrong event id i think, not the zuul tenant18:02
corvuswas about to say the same18:03
* fungi starts over18:03
fungii totally forgot there would be more than one event id because that projects is in multiple tenants18:03
*** jpena is now known as jpena|off18:04
corvusi think the event id should be the same18:05
fungiyeah, looks like it18:05
*** hashar has quit IRC18:05
corvusbut maybe that paste is just filled with the openstack/opendev lines18:05
fungiso why was there no zuul.Pipeline.zuul.check evaluation, i wonder18:05
corvusoh, nope, that's the whole thing.18:06
avasscould this be relevant? https://zuul.opendev.org/t/zuul/config-errors18:06
fungiindeed it could, i didn't see that alarming earlier18:06
fungiis this a recurrence of the ghost errors i saw a week or two back, which disappeared after a smart reconfig?18:08
corvusthat tenant appears to have a check pipeline with the zuul-jobs repo attached to it18:08
corvusthis could be an issue: http://paste.openstack.org/show/803787/18:09
fungier, yeah18:10
fungii always forget to filter for tracebacks since they lack event association18:10
fungiso change was None? that can't be right18:11
corvusi don't know if that's related18:11
fungier, yeah that was earlier18:11
fungii rechecked at 19:10:2618:12
corvusi think that's probably harmless (but needs to be fixed)18:12
corvusi think that happens on replication events18:12
avassthat has actually been a really big plus for the splunk reporter so far, event_id and build is attached to every log entry (where it's applicable)18:12
fungischeduler restart was around 14:3818:13
fungier, sorry, my recheck comment was at 15:46:1918:14
fungiso hours prior to that exception18:14
corvusmy concern is that if it happens on replication events, it may happen on others too (like tags)18:14
corvusnope, this should only be for replication events18:17
corvustristanC: i have good news!  there is a longstanding behavior in zuul related to replication events which i think means that no one could be using them in pipeline triggers.18:18
mordredcorvus: oh?18:19
fungihah, fortuitous bug ;)18:19
corvusyeah, i'm double checking that now18:19
corvusmordred, fungi, tristanC: the ref-* events from gerrit all hit this line: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L74518:20
corvuswhich means there's nothing for the pipelines to match on18:21
corvusso i think we can merge tristanC's change to remove those immediately18:21
mordredcorvus: well - in that case removing the replication events - yeah - should make logs better18:22
fungii concur18:22
corvusand then also, we need to fix the new event handling to handle a null return (which should fix that traceback in the scheduler)18:22
avasswhat does this comment refer to then? https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L71918:22
corvusfungi: and i'm pretty sure none of this has anything to do with your job issue, so back to that...18:22
fungicorvus: yeah, i'm running down the config errors now18:22
corvusavass: sorry i was sloppy -- i meant ref-replicated-* events18:23
fungii strongly suspect this is a recurrence of what we saw briefly last week or the week before18:23
corvusavass: ref-updated is legit and correctly handled18:23
avassah18:23
corvusavass: er, i mean ref-replication-* :)18:23
fungihttps://opendev.org/openstack/project-config/src/branch/master/zuul/main.yaml#L1631-L164018:23
fungiwe clearly include [] from openstack/project-config in the zuul tenant18:24
fungiso it should not be complaining about pipeline definitions from that repo18:24
corvustristanC: if you would care to un-minus-2 your change https://review.opendev.org/780809 i think we can merge it asap.  sorry i missed that earlier.18:25
fungiis it somehow assembling configuration incorrectly at start, which then gets corrected at the next smart-reconfigure?18:26
* fungi checks whether there was a scheduler restart fairly close to when i saw this last time18:27
fungi2021-03-01 23:07:49 is when i mentioned it in here last time18:28
fungimost recent mention of a zuul restart in our status log prior to that was on 2021-02-20 so not a smoking gun18:29
avassfungi: maybe it's enough with a full-reconfig?18:29
fungiyeah, really anything's possible i guess18:30
corvusfungi: if you want to try a smart-reconfig and see if it changes i have no objection18:30
corvusfungi: i don't think the lack of jobs is specific to that gentoo change; the nodepool changes i uploaded a while ago are not enqueued either.  so we're probably looking at a systemic issue, either related to the config errors you and avass observed, or an as-yet-unidentified bug in the trigger-event-queue-in-zk changes18:32
corvusi have to run now; i will be back after lunch18:33
*** jpenag has joined #zuul18:39
fungii've initiated one now18:41
fungi2021-03-22 18:31:49,126 DEBUG zuul.CommandSocket: Received b'smart-reconfigure' from socket18:41
*** jpena|off has quit IRC18:41
fungiit logged "Reconfiguration complete" at 18:31:49,33618:41
fungioh, that was "Smart reconfiguration of tenants []..." so it didn't reconfigure anything18:41
fungifull-reconfigure maybe?18:41
fungihttps://zuul.opendev.org/t/zuul/config-errors is still clearly reporting errors for things which shouldn't happen18:41
fungicorvus: yep, i agree, it seems like it's not evaluating any zuul tenant events because of the configuration errors for the tenant18:42
fungii'm about to start heating up some food myself, but will keep poking at logs to see if i can spot any other hints18:45
openstackgerritAlbin Vass proposed zuul/nodepool master: WIP: Digitalocean driver  https://review.opendev.org/c/zuul/nodepool/+/75955918:51
avasscorvus: would you recommend moving the digital ocean to use the state machine driver or is the simple driver fine?18:54
fungithe restart begins at 2021-03-22 14:30:47,174 in our debug log. at 14:37:27,654 (during startup config loading) it complained "WARNING zuul.ConfigLoader: Zuul encountered a syntax error while parsing its configuration in the repo openstack/project-config on branch master.  The error was: Pipelines may not be defined in untrusted repos..."18:59
fungiso it definitely was concerned about this at start19:00
fungistat reporting exception, almost certainly unrelated, but curious if we have a patch up for it yet: http://paste.openstack.org/show/803789/19:03
fungilooks like arrived is None19:04
corvusavass: state machine; i'd like to move gce to it and drop simple19:06
corvusavass: should be easy19:06
corvusfungi: i'm still somewhat inclined to think that these may be 2 separate issues; occams razor says if we're not seeing trigger events then the change to move trigger events into zk may be relevant.  :)19:08
fungicorvus: yep, absolutely19:08
corvus(i'm not excluding the idea they may be related; just thinking right now it's like 75% probability they are distinct)19:08
fungii guess it *should* enqueue for any repository not in a syntax error state?19:09
fungiunless one of the error repositories is a required-project?19:09
corvuslooking at the running config, i don't see any reason why it shouldn't be enqueued right now19:10
corvus(the error about pipelines is from a repo where we don't expect pipelines anyway)19:10
fungibut it will refuse to add that project, right? so if it's declared as a required-project it can't be used in that state? not saying that's the case for every zuul-jobs job so something still should have been enqueued19:11
corvusi don't think a config error on a project should prevent it from being used as a required project; at most, it would cause zero config objects to be loaded from that project.19:12
fungiahh, okay, so even less likely to be related19:13
corvusyep, since zero of the projects which are showing config errors in the zuul tenant are used by zuul-jobs19:13
avasscorvus: ok, I'll take a look at that soon then. I realized that building a game client inside the same k8s as zookeeper and zuul on cheap nodes isn't a good idea :)19:13
fungibuilding a game client is always a good idea19:14
avasswell you don't actually get to build anything if something always gets OOM killed19:14
fungiand yeah, so far the only exceptions i'm seeing logged are the one from replication events, and the stats emitting one i pasted above19:15
funginow wondering if the stats exception is related to enqueuing things with zuul-enqueue rpc19:17
fungiseems to consistently follow "Adding change" lines, and looks related to our bulk scripted reenqueue19:18
*** harrymichal has quit IRC19:19
*** harrymichal has joined #zuul19:20
*** harrymichal has joined #zuul19:20
fungiinteresting, our opendev-prod-hourly pipeline (timer triggered) is raising run handler exceptions with "AttributeError: 'Branch' object has no attribute 'number'"19:20
*** hamalq has joined #zuul19:27
avassokay weird, I seem to only be able to run jobs for a PR once per scheduler restart19:50
corvusavass: 4.1.0 or master?19:51
avassmaster19:51
corvusavass: can you try 4.1.0?19:52
avasssure19:52
fungii see opendev's openstack tenant enqueuing things, but https://zuul.opendev.org/t/zuul/builds suggests nothing has been enqueued for the zuul tenant for 16+ hours19:52
corvusavass: i suspect it will still fail and i would start with I55df1cc28279bb6923e51686dde8809421486c6a as a suspected cause19:53
corvusfungi: i think i've got it19:55
fungiregression in today's restart?19:55
avasscorvus: 4.1.0 works19:55
corvusavass: oh nice -- runs twice after restart on 4.1.0?19:56
avassyeah19:56
corvusavass: ok, it *could* be the same thing we're seeing in opendev; let me explain, then let's see if your situation could fit19:56
fungiyay for not releasing with whatever brought in that bug ;)19:56
corvusfungi, avass: there's an exception in our scheduler logs: AttributeError: 'Branch' object has no attribute 'number'19:57
corvusthat's being hit during the new event forwarding code (where we have a 2-stage event queue system: a global queue, then we forward events to individual tenant queues)19:58
fungiokay, so that's the same one i saw for one of our opendev hourly pipelines19:59
corvusif one of the tenants has an item in a pipeline that is not a change (ie, a branch tip or tag or other ref -- probably a post or periodic pipeline), then we hit that exception and abort forwarding the event to any other tenants19:59
avassI'm not seeing that error however19:59
avassor I don't think I do20:00
corvusso basically, the new event forwarding is stopping for us at the opendev tenant.  any tenant in our list of tenants after (and somewhat including) opendev is not getting events20:00
* avass does another restart20:00
corvusavass: hrm.  any other exceptions in your scheduler log?  maybe a slightly different error is being hit and causing it.20:00
avasscorvus: no, no error at all20:03
corvusavass: what's the sequence?  are you hitting the recheck button in github for the already-run check?20:06
avassrecheck -> job runs and gets NODE_FAILURE -> recheck -> nothing and logs doesn't really say anything either20:07
avassworking on paste but I guess you don't want 800 lines of raw scheduler logs :)20:07
corvusthe more the better as far as i'm concerned, but opendev's pastebin does have a size limit; if you hit it, you can split it in 2 :)20:08
fungiyeah, i've used the split -l trick to get a log into multiple pastes plenty of times20:13
fungii think the limit in our paste server is actually the max field size in the mysql db backing it20:14
avasssec, manjaro seems to want to upgrade all system packages.20:15
avassalso had to fill out an impossible captcha because my paste contains spam :)20:16
avasshttp://paste.openstack.org/show/803796/20:16
avassI suppose that silently truncated my logs20:18
avass2/4: http://paste.openstack.org/show/803799/20:19
avass3/4: http://paste.openstack.org/show/803800/20:20
avass4/4: http://paste.openstack.org/show/803801/20:20
corvusavass: and when was the second recheck evnt?20:20
avasscorvus: last paste. somewhere around line 1020:21
corvusavass: are you sure it's not somewhere between paste 3 and 4?20:22
avasscorvus: probably line 18: [e: 98be8bb0-8b49-11eb-935a-fabeee8c4b40] Github Webhook Received20:22
avasscorvus: that's the first event probably20:23
corvusavass: okay, so you're leaving a "recheck" comment?20:23
avassyeah20:23
avassfirst one works, second one doesn't20:23
fungiand second one was after the first reported?20:23
avassI believe so yes20:23
avassfungi: just left another comment and that produces the same result20:24
fungigot it20:25
corvusavass: i see a "Run handler awake" log in #3 and nothing after that; suspect your scheduler may be stuck in result event processing20:26
corvusavass: the node request always fails?20:27
avasscorvus: trying to figure out why digital ocean doesn like the image it's configured to use so yep20:28
corvusokay, so the reproducer for this might be as simple as "enqueue a change with a node request that fails"; that does also seem to point to the bug being related to event-queues in zk.20:29
corvusi'll work on fixing both of these20:31
corvusfungi: for opendev, i think we need to restart on 4.1.020:32
corvusthere's no good way to get an image with fixes without doing that20:33
corvusfungi: do you have time to do that, or should i?20:33
fungicorvus: sorry, had to step away for a moment, but catching back up now20:48
corvusfungi: it's a little more complex so -> #opendev20:49
fungiyup20:49
*** ajitha has quit IRC21:15
*** jangutter has joined #zuul21:22
*** jangutter_ has quit IRC21:25
corvusavass: you weren't running any unmerged patches on that instance, right?  the bug you saw was with plain old upstream master?22:42
corvusavass: (in particular, no async reporting patch?)22:42
corvusavass: also, it really looks like there are some lines missing between pastes #3 and #4 -- can you check on that?22:45
*** harrymichal_ has joined #zuul22:51
*** harrymichal has quit IRC22:52
*** harrymichal_ is now known as harrymichal22:52
openstackgerritJames E. Blair proposed zuul/zuul master: Fix trigger event forwarding bug  https://review.opendev.org/c/zuul/zuul/+/78233522:59
openstackgerritJames E. Blair proposed zuul/zuul master: WIP: Try to repro recheck failure  https://review.opendev.org/c/zuul/zuul/+/78233622:59
corvusfungi, avass, tobiash, swest: 782335 is a successful fix of the issues we saw in production today; i think with that merged, we can restart opendev's zuul on master again. (you can read the commit msg for the tl;dr on issues)23:00
*** nils has quit IRC23:00
corvusfungi, avass, tobiash, swest: 782336 is an unsuccessful attempt to repro the issue avass described; i think we need more data there.23:01
*** holser has quit IRC23:38
*** harrymichal has quit IRC23:39
*** rlandy has quit IRC23:43

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!