Monday, 2021-03-22

*** tosky has quit IRC		00:26
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: reconcile config objects https://review.opendev.org/c/zuul/nodepool/+/779898	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: Handle IPv6 https://review.opendev.org/c/zuul/nodepool/+/780400	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: don't require full subnet id https://review.opendev.org/c/zuul/nodepool/+/780402	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: add quota support https://review.opendev.org/c/zuul/nodepool/+/780439	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: implement launch retries https://review.opendev.org/c/zuul/nodepool/+/780682	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: implement support for diskimages https://review.opendev.org/c/zuul/nodepool/+/781187	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: handle leaked image upload resources https://review.opendev.org/c/zuul/nodepool/+/781855	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: use rate limiting https://review.opendev.org/c/zuul/nodepool/+/781856	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: fix race in leaked resources https://review.opendev.org/c/zuul/nodepool/+/781924	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: replace driver with state machine driver https://review.opendev.org/c/zuul/nodepool/+/781925	01:54
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: update documentation https://review.opendev.org/c/zuul/nodepool/+/781926	01:54
openstackgerrit	Merged zuul/zuul-jobs master: Revert "Pin to npm4 until npm 5.6.0 comes out" https://review.opendev.org/c/zuul/zuul-jobs/+/781966	02:13
openstackgerrit	Merged zuul/zuul-jobs master: Clone nvm from official source https://review.opendev.org/c/zuul/zuul-jobs/+/781970	02:17
fungi	ianw: mordred: i just exercised those two ^ successfully with https://zuul.opendev.org/t/openstack/build/042af4adb20c472583c278cd365a12ab so looks like it's finally working. thanks!!!	02:54
ianw	fungi: oh good, 4 months is a long time in the javascript world, let alone 4 years :)	02:55
fungi	yup!	02:57
*** evrardjp has quit IRC		03:33
*** evrardjp has joined #zuul		03:33
*** raukadah is now known as chkumar\|ruck		05:02
*** vishalmanchanda has joined #zuul		05:11
*** ykarel has joined #zuul		05:28
*** ykarel has quit IRC		05:29
*** ykarel has joined #zuul		05:29
*** jfoufas1 has joined #zuul		05:36
*** ajitha has joined #zuul		05:45
*** hashar has joined #zuul		07:54
*** jcapitao has joined #zuul		08:01
*** rpittau\|afk is now known as rpittau		08:19
*** wxy has joined #zuul		08:47
*** tosky has joined #zuul		08:49
*** jpena\|off is now known as jpena		08:57
*** phildawson has joined #zuul		08:58
*** holser has joined #zuul		09:02
*** ykarel is now known as ykael\|lunch		09:21
*** saneax has joined #zuul		09:25
*** nils has joined #zuul		09:28
*** harrymichal has joined #zuul		09:48
*** ykael\|lunch is now known as ykael		10:03
*** ykael is now known as ykarel		11:25
*** rlandy has joined #zuul		11:41
*** holser has quit IRC		11:54
*** holser has joined #zuul		11:57
*** ykarel_ has joined #zuul		11:58
*** ykarel has quit IRC		12:00
*** phildawson has quit IRC		12:05
*** ykarel_ is now known as ykarel		12:09
*** jcapitao is now known as jcapitao_lunch		12:13
*** hashar is now known as hasharLunch		12:17
*** ykarel_ has joined #zuul		12:50
*** jpena is now known as jpena\|lunch		12:50
*** ykarel has quit IRC		12:50
*** hasharLunch is now known as hashar		12:55
*** jcapitao_lunch is now known as jcapitao		13:07
*** ykarel_ has quit IRC		13:37
*** ykarel_ has joined #zuul		13:37
*** jpena\|lunch is now known as jpen		13:51
*** jpen is now known as jpena		13:51
*** jhesketh has quit IRC		14:02
*** jhesketh has joined #zuul		14:04
openstackgerrit	Jeremy Stanley proposed zuul/zuul-jobs master: DNM: test role job triggering https://review.opendev.org/c/zuul/zuul-jobs/+/782263	14:13
*** kmalloc has joined #zuul		14:18
mordred	fungi: woot	14:22
corvus	i'm going to work on restarting opendev's zuul	14:26
corvus	and once that's done, i'll send out the release announcements	14:30
openstackgerrit	Jeremy Stanley proposed zuul/zuul-jobs master: Revert "Temporarily stop running Gentoo base role tests" https://review.opendev.org/c/zuul/zuul-jobs/+/771106	14:34
*** wxy has quit IRC		14:36
fungi	thanks corvus!	14:38
*** jangutter_ has joined #zuul		14:39
*** jangutter has quit IRC		14:42
*** ykarel_ is now known as ykarel		14:49
corvus	tristanC: looking at https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1 we're graphing the new event processing time; occasionally zuul stalls for 5 minutes, presumably it's doing a bunch of reconfiguration/re-enqueue work then. it may be tricky to separate out "normal" event queue performance from those spikes. but we could probably try to see if the baseline increases.	14:51
*** harrymichal has quit IRC		14:52
avass	corvus: some of those peaks could also be python gc right?	14:53
avass	not sure how long that can stall the system tbh	14:53
corvus	avass: i think probably not -- most of python's memory mgmt is ref-counter based so it happens more gradually and synchronously (the gc runs rarely and only cleans up cycles). but even so, that would still be fractions of a second, not 5 minutes.	14:56
corvus	5 minutes is definitely "reconfigure the openstack tenant and re-enqueue everything" :)	14:56
corvus	like, full reconfiguration	14:56
corvus	that's about how long a startup takes	14:56
tristanC	corvus: i see, perhaps we should differentiate management from regular event	14:56
avass	ah make sense	14:56
corvus	tristanC: well, i think our trigger events will just be sitting in the queue during that time	14:57
corvus	tristanC: so i think it's more like filtering out the data during times we know that the system is frozen	14:57
corvus	anyway, i don't know if we need to automate that, or if we can just look at the graph and see the patterns	14:58
corvus	just something to keep in mind	14:58
fungi	i suppose it also depends on what you expect from that graph. regardless of what's causing the delays in event processing, it's an accurate depiction of how long events waited to be processed	14:58
corvus	fungi: yeah, though i know one hope is to see if moving the queues to zk affected processing time	14:59
corvus	but i think that was an implicit "processing time during the times where zuul was not frozen due to reconfigurations" :)	14:59
fungi	we could probably still compare averages, main concern is probably the limited pre-upgrade sample size	15:00
corvus	fungi: yep	15:00
corvus	also, we could probably discard times > 60 seconds	15:00
avass	well those peaks could also shoot up	15:01
corvus	as long as the baseline isn't near that cutoff, we could probably filter like that	15:01
fungi	that might help, though you'd still end up skewing results with the events which arrived in the last minute of the pause	15:01
corvus	yeah. for now, i think i'll rely on wetware visual pattern matching :)	15:02
avass	oh you don't want to model it like a control system and come up with a non-linear equation describing how the system behaves before and after? :)	15:03
*** harrymichal has joined #zuul		15:03
corvus	we should get an intern	15:03
fungi	an intern who can also stand up two copies of infrastructure for side-by-side benchmarking	15:04
fungi	i'm sure it would make an excellent thesis	15:04
corvus	++	15:05
corvus	i don't see any anomolies so far; i'm going to grab some breakfast and check back in a bit	15:07
tristanC	corvus: not sure how to translate that in grafyaml, but in graphite there is a `std` version which might be useful to smooth the spikes, e.g.: https://graphite.opendev.org/?width=1272&height=837&lineMode=connected&yUnitSystem=msec&target=stats.timers.zuul.tenant.openstack.event_enqueue_processing_time.mean&target=stats.timers.zuul.tenant.openstack.event_enqueue_processing_time.std	15:19
corvus	ooh nice	15:19
*** harrymichal has quit IRC		15:24
*** harrymichal has joined #zuul		15:24
*** ykarel has quit IRC		15:26
*** jfoufas1 has quit IRC		15:53
*** vishalmanchanda has quit IRC		16:11
*** kmalloc has quit IRC		16:37
*** sean-k-mooney has joined #zuul		16:53
sean-k-mooney	o/	16:53
sean-k-mooney	am i correct in saying that https://zuul-ci.org/docs/zuul/reference/developer/specs/zuul-runner.html was never implmeneted	16:54
fungi	sean-k-mooney: ot	16:54
corvus	sean-k-mooney: it's in progress	16:54
fungi	it's... yeah	16:54
sean-k-mooney	context is we were talking about downstream ci	16:54
* fungi curses the adjacency of backspace and return on this keyboard		16:54
sean-k-mooney	and one of the things we are looking at is using zuul more	16:54
sean-k-mooney	but one of question was ask was what is the simplete way to write/debug a zuul job locally	16:55
fungi	yeah, zuul-runner should help	16:56
sean-k-mooney	i have always done that by either deploying zuul locally or just submiting a patch to the upstream gate with all other jobs disabled	16:56
sean-k-mooney	fungi: does the design still require a zull server?	16:56
sean-k-mooney	i know its fiarly trivial with the docker compose file to get one	16:56
fungi	sean-k-mooney: the current design requires you to ask the zuul server to resolve the collection of interdependent ansible playbooks/roles, yes	16:57
sean-k-mooney	but i was wondering could you run the sculder in the cli tool and pass in the main.yaml ectra	16:57
corvus	it sort of depends on what you're writing/debugging -- if you want to write/test/debug the job's playbook, then you can run that (or whatever shell scripts it runs). if you're debugging the rendered zuul job which is composed of a complex hierarchy of playbooks from inhertited jobs, then, yeah, something like zuul-runner is necessary	16:58
corvus	if you design jobs/roles/playbooks with reusability in mind, then it's pretty simple to run the main playbook of a job locally. if you have a deep system of inheritance (like devstack), then it can be hard to figure out what playbooks and roles to run	16:59
sean-k-mooney	corvus: well part of the challange with runing the job playbook is all the infrastcure of generating the set of playbooks	16:59
sean-k-mooney	corvus: right and ooo jobs make that much worse	17:00
corvus	sean-k-mooney: so the context is downstream inheriting from tripleo jobs?	17:00
sean-k-mooney	corvus: basically downstream for nova we have zuul running the tox jobs and jenkinx runing ooo via infrared	17:00
sean-k-mooney	corvus: yes	17:00
sean-k-mooney	we would like to run the upstream ooo job downstream with internal repos instead of rdo	17:01
sean-k-mooney	but have a way to debug it	17:01
fungi	out of curiosity how do you locally reproduce the jenkins jobs to debug them? or are you doing like we used to do and just avoiding relying on any fancy jenkins plugins so it's merely calling a shell script?	17:01
sean-k-mooney	corvus: e.g. have a way to run the downstream job on our laptops	17:01
sean-k-mooney	fungi: i dont i try to do all backports upstream :) other tirger it manually and have it hold a ci host	17:02
corvus	sean-k-mooney: gotcha, yeah, that's definitely more in zuul-runner's wheelhouse. even the first patch in the zuul-runner series, which i think may be close to approval, would probably help with that, as it at least enables querying zuul for the list of playbooks to run.	17:02
sean-k-mooney	corvus: would it be posible to use it to render a local directoy with the playbooks ectra and then run them manually	17:03
sean-k-mooney	maybe i should read the patch	17:03
corvus	sean-k-mooney: topic:freeze_job	17:03
sean-k-mooney	thanks :)	17:03
corvus	sean-k-mooney: it's a long series; starts with server-side api changes, then later implements cli tool to use them	17:04
corvus	tristanC: ^	17:04
sean-k-mooney	we are in a somewhat stragne situation downstream where all the upstream enginers konw to some degree how zuul and partically the in porject jobs work(based on devstack) where as our qe understand the jobs they wrote in jenkins that are ooo/infrared based	17:06
corvus	well, zuulv3 was certainly designed for downstream re-usability. being able to reuse some of that directly would be great.	17:07
sean-k-mooney	yep we would like to reuse the upstream ooo jobs directly in our downstream patches ci which is basilly like upstream openstack ci e.g. based on patches to our downstream gerrit	17:08
sean-k-mooney	which is basically just backports 99% of the time	17:08
corvus	sean-k-mooney: also, you mentioned it in passing, but i sort of wouldn't underestimate the power of "throw up a patch and let zuul do its thing". that can be really nice in a downstream system where developers can hold their own nodes.	17:08
fungi	though it seems like zuul was really going for reusability in the other direction (write playbooks you can run locally/in production, use those in your job definitions). if you design in the other direction (write playbooks for ci, then try to run them locally) that's a tougher nut to crack	17:08
corvus	mhu has patches in flight to add buttons to the web ui for things like holding nodes	17:09
sean-k-mooney	fungi: the issue is that there really is no documentqaiton of how to run them locally	17:09
corvus	well, i mean, they're playbooks, you just run them :)	17:09
fungi	sean-k-mooney: right, well, that's part of writing them for local running first. you can add readmes/documentation	17:09
tristanC	tobiash: would you mind checking the last comments of https://review.opendev.org/c/zuul/zuul/+/607078	17:09
sean-k-mooney	corvus: the issue is the onion like nestiong and figurein out how to generate the inventory files	17:10
sean-k-mooney	corvus: just running them is not so trivial	17:10
fungi	yeah, that's what zuul-runner should help with. you can look at the inventory for a reported build and then create something similar	17:10
sean-k-mooney	fungi: i think that works for roles not for jobs	17:11
corvus	sean-k-mooney: yeah, maybe i misunderstood -- i thought you were saying "zuul has no docs on how to run playbooks locally" and i don't think we can write that. but maybe you were saying "tripleo has no docs on how to run their playbooks locally"	17:11
fungi	sean-k-mooney: it works for plays/playbooks as well as roles	17:11
sean-k-mooney	fungi: not really	17:11
sean-k-mooney	i know where your coming form	17:11
*** rpittau is now known as rpittau\|afk		17:11
fungi	you can write playbooks which work locally, then add those to your job definition	17:11
sean-k-mooney	but as a user of zull my entry point is generally a job	17:11
tristanC	sean-k-mooney: the zuul-runner spec is implemented through https://review.opendev.org/c/632064, the change are mostly waiting for reviews	17:11
sean-k-mooney	but to recreate teh job i need to run all the pre and postplaybooks in order around the run playbook that is listed in the job or its parent	17:12
sean-k-mooney	+ flatten all role vars into the approate input	17:12
sean-k-mooney	so i cant really just run tempset-full	17:13
sean-k-mooney	https://github.com/openstack/tempest/blob/master/zuul.d/integrated-gate.yaml#L32-L60 for example it define no pre,run or post playbooks for me to run they all come form the playbook	17:14
tristanC	corvus: fungi: unfortunately it can be quite complex to "just run the playbooks"	17:15
corvus	sean-k-mooney: yeah, that's basically a result of the way the tripleo jobs were designed. with deep nesting and playbooks at every level, you need a tool. it is possible to design the other direction, which is i think what fungi was saying. if you design for real-world use and then run that in ci, then you end up with a single "real-world" playbook nestled between some ci-specific pre and post run	17:15
corvus	playbooks.	17:15
corvus	tristanC: yeah, i mean, i think we're talking past each other here	17:15
corvus	i totally agree that you need zuul-runner to figure out how to run a tripleo job	17:15
sean-k-mooney	corvus: correct also the devstack jobs	17:15
fungi	tristanC: it can be complex to "just run the playbooks" if they were not written with local running as a primary goal, i agree	17:16
corvus	but since this conversation started in a more generic vein, we've also touched on the general problem of reusability	17:16
sean-k-mooney	corvus: basically anythin gother the the tox jobs is very hard to fiture out	17:16
fungi	the jobs can be improved in that regard	17:16
corvus	and in a world where you design for dev/prod reusability, zuul-runner isn't needed, because of that reusability.	17:16
sean-k-mooney	i think the issue is most of the jobs that are listed in the pipelines are not written to be resued locally	17:17
fungi	sean-k-mooney: bingo	17:17
corvus	so i just wanted to make that point, that the use of zuul in tripleo/devstack is not the only way to use zuul. i may just be on a soapbox and that's fine. :)	17:17
fungi	sean-k-mooney: at least for the projects you're talking about anyway	17:17
sean-k-mooney	fungi: i even to reuse the jobs in a third party ci enve i had to so some deep hacks to make it work	17:17
corvus	i totally agree that zuul-runner is right for sean-k-mooney's situation :)	17:18
sean-k-mooney	corvus: hehe i was just hoping it could work without a zuul server	17:18
sean-k-mooney	e.g. also provide it with the config file that the zuul server would have and have it embed a zull-schuler in it	17:18
corvus	sean-k-mooney: fundamentally, the question you're asking is "how will zuul construct a list of playbooks and inputs" and the only way to answer that is to ask zuul, so that's the zuul-runner design	17:19
sean-k-mooney	to have it render the playbooks into a directly that could then be run	17:19
fungi	there is an alternative approach, which is to reimplement all of zuul's logic around indexing and parsing configuration. which is basically akin to writing a second zuul	17:20
corvus	like, you could do it "without running a server" but only by starting up all of the zuul components to do all of the same work and then just exiting at the end. it's a distinction without a difference.	17:20
sean-k-mooney	fungi: well not really i was more thinkin just start the zuul api in process	17:20
corvus	and, mind you, in openstack's case, that process would take about 10 minutes.	17:20
sean-k-mooney	then asking it via its api	17:20
corvus	with a fleet of 30 servers	17:20
corvus	so on a desktop, would probably take like an hour	17:20
fungi	sean-k-mooney: and a merger process (or many) to collect and feed in all the distributed configuration	17:21
sean-k-mooney	fungi: yep	17:21
sean-k-mooney	its currently embdeing the executor right	17:21
sean-k-mooney	which acts as a mergere	17:21
sean-k-mooney	but anyway that out of scope for now	17:21
sean-k-mooney	its not that hard to run zuul	17:22
sean-k-mooney	espcially with the docker compose. and if we have downstream and we are just editing an exsiting jobs we can ask it	17:22
avass	fungi, sean-k-mooney: something like: https://review.opendev.org/c/zuul/zuul-jobs/+/728684, it worked really good for library projects like zuul-jobs since everything is contained to the repo. though it wouldn't be able to handle complex logic to figure out which variant to use if there are multiple branches etc	17:23
fungi	it may be that someone can eventually work out a way to slim down a lightweight zuul deployment aimed solely at providing a local api to query for zuul-runner use, but i expect that's nontrivial	17:23
fungi	and as pointed out, you'd still want to run it independent of your individual zuul-runner calls, because of the startup indexing cost	17:24
fungi	i don't think that can be optimized away, though maybe a persistent cache could allow you to stop and start it	17:25
sean-k-mooney	fungi: ya its normally not that bad for small deployment but even my slimed down third party ci was 5-10 minutes	17:25
* sean-k-mooney give we have downstream jobs that take more then 6 hours to run not sure i would notice :)		17:25
sean-k-mooney	fungi: a big part of the inital start up time is clouing the repos which woiuld be done once	17:26
sean-k-mooney	granted all the internal zull reprentiaton of the job would still have to be built	17:27
sean-k-mooney	and unless you seralised that and cached it you would always have to pay that	17:27
fungi	well, even just restarting the zuul scheduler in opendev takes 10+ minutes to ask the mergers to find any updated refs and pass the updated configuration back	17:27
fungi	and that's with caches of repositories on the mergers	17:27
sean-k-mooney	for a zuul-runner you proably dont need to ask it to update teh refs on startup	17:28
avass	yeah we take something like 15min to restart	17:28
sean-k-mooney	avass: yep which is why having a zuul rungin with docker compose is not that unresonable an overhead	17:29
sean-k-mooney	or poining it at our downstream ci	17:29
fungi	v5 will solve a lot of that if you have a scheduler cluster, since it will be rare that you restart the entire cluster at once and so don't have to worry about missing events and getting out of sync	17:30
sean-k-mooney	v5? a joke or actuall thing	17:30
* sean-k-mooney wonders if i missed a v4		17:30
fungi	you did	17:31
avass	oh that's 15 min running a couple executors, that would probably take a lot longer starting up locally since it's pretty much one big repo with close to 1000 branches that is the reason for that startup cost	17:31
sean-k-mooney	avass: you dont need all the repo	17:31
fungi	v5 is where we move from gearman to zk for coordinating zuul service interactions, so we can scale out the scheduler	17:31
fungi	work is well in progress on it	17:31
*** jcapitao has quit IRC		17:31
sean-k-mooney	my third party ci is offline currenly due to capsiaty issues	17:31
avass	sean-k-mooney: ah in that case it would be faster yes	17:31
sean-k-mooney	but it starts up with enough for devstack in under 5 mines less then 10 for ooo	17:32
fungi	gerrit's checks implementation will probably also allow us to make some assumptions because it will cache events server side, so maybe we can assume cached gerrit configuration is in sync at restart and then roll forward the stashed events, but i don't know that the same could be said for other connection drivers	17:33
sean-k-mooney	http://paste.openstack.org/show/803783/	17:33
sean-k-mooney	that is what my third party ci was using	17:33
sean-k-mooney	but to avoid all the repos i had to do some hacks in the tempest job	17:34
sean-k-mooney	https://github.com/SeanMooney/ci-sean-mooney/blob/main/zuul.d/jobs.yaml	17:34
sean-k-mooney	basically i needed to define my own version fo tempest-all/full	17:35
corvus	sean-k-mooney: there have been 2 4.x releases; if you aren't subscribed to the zuul-announce mailing list, you may want to consider it. there is important information for operators about the v4/v5 roadmap.	17:38
sean-k-mooney	corvus: sorry was distracted by the fact tempest seams to have fixed there viral import mess	17:40
sean-k-mooney	importing jobs form tempest nolonger import half of openstack	17:41
corvus	neat	17:41
sean-k-mooney	it use to import all the pluggin which import every project that had a tempest plugin if i rememebr correctly	17:42
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: reconcile config objects https://review.opendev.org/c/zuul/nodepool/+/779898	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: Handle IPv6 https://review.opendev.org/c/zuul/nodepool/+/780400	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: don't require full subnet id https://review.opendev.org/c/zuul/nodepool/+/780402	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: add quota support https://review.opendev.org/c/zuul/nodepool/+/780439	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: implement launch retries https://review.opendev.org/c/zuul/nodepool/+/780682	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: implement support for diskimages https://review.opendev.org/c/zuul/nodepool/+/781187	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: handle leaked image upload resources https://review.opendev.org/c/zuul/nodepool/+/781855	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: use rate limiting https://review.opendev.org/c/zuul/nodepool/+/781856	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: fix race in leaked resources https://review.opendev.org/c/zuul/nodepool/+/781924	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: replace driver with state machine driver https://review.opendev.org/c/zuul/nodepool/+/781925	17:43
openstackgerrit	James E. Blair proposed zuul/nodepool master: Azure: update documentation https://review.opendev.org/c/zuul/nodepool/+/781926	17:43
corvus	i seem to have misplaced a patch in that series :/	17:45
fungi	checked under the sofa cushions?	17:45
corvus	yep, that's were it was	17:46
corvus	i'm going to have to clean it off	17:46
sean-k-mooney	by the way i assume the openstack ci is moving to zuulv4 or already has?	17:47
sean-k-mooney	and will continue to track a recent major version of zuul	17:47
fungi	4.1.0 currently, yes	17:47
corvus	4.1.0++ even	17:47
sean-k-mooney	cool reading 4.1.0 release notes	17:47
avass	speaking of, I should probably upgrade my own instance :)	17:47
fungi	sean-k-mooney: https://zuul.opendev.org/t/openstack/status "Zuul version: 4.1.1.dev23 68ba89c6"	17:48
corvus	avass: yeah, if not tracking master, at least tracking releases would probably be a really good idea :)	17:48
sean-k-mooney	ah at the bottom	17:48
sean-k-mooney	ya i go to that page often but didnt know the version was there	17:49
avass	now running 4.1.1.dev23 68ba89c6!	17:49
sean-k-mooney	fungi: well belated congratualtion to all of you on v4	17:49
sean-k-mooney	i think nova may have finally got rid of our last non v3 native job this cycle	17:50
sean-k-mooney	grenade was them main last blocker	17:50
sean-k-mooney	allthough we stil have a silight parity gap with ceph that is pending	17:50
avass	corvus: heh yeah we're like two people using it so it's quite easy to keep it up to date since downtime has pretty much no consequence :)	17:51
fungi	yes, the openstack tact sig is excited at the prospect of letting devstack-gate finally die quietly	17:51
corvus	avass: oh that instance, i thought you meant that other instance with more than 2 people. :)	17:52
sean-k-mooney	... we still havent moved one job https://github.com/openstack/nova/blob/master/.zuul.yaml#L288-L305 which is nova-grenade-multinode	17:52
sean-k-mooney	but that is close	17:52
avass	corvus: that one very often keeps up to date with master as well ;)	17:53
fungi	okay, i rechecked zuul-jobs change 771106 and the scheduler seems to think we shouldn't run any jobs on that... or am i misreading the debug log? http://paste.openstack.org/show/803785	17:54
fungi	worried that if i try to add debug reporting temporarily, that will heisenbug and trigger jobs because of matching different file filters	17:55
fungi	log there shows it matched the comment pattern for enqueuing into the check pipeline	17:57
fungi	or is "Project zuul/zuul-jobs not in pipeline <Pipeline check> for change <Change 0x7f270e53f9d0 zuul/zuul-jobs 771106,2>" not expected i wonder	17:58
avass	fungi: I think it should run jobs but the scheduler for some reason doesn't agree :)	17:59
fungi	d'oh! i filtered on the wrong event id i think, not the zuul tenant	18:02
corvus	was about to say the same	18:03
* fungi starts over		18:03
fungi	i totally forgot there would be more than one event id because that projects is in multiple tenants	18:03
*** jpena is now known as jpena\|off		18:04
corvus	i think the event id should be the same	18:05
fungi	yeah, looks like it	18:05
*** hashar has quit IRC		18:05
corvus	but maybe that paste is just filled with the openstack/opendev lines	18:05
fungi	so why was there no zuul.Pipeline.zuul.check evaluation, i wonder	18:05
corvus	oh, nope, that's the whole thing.	18:06
avass	could this be relevant? https://zuul.opendev.org/t/zuul/config-errors	18:06
fungi	indeed it could, i didn't see that alarming earlier	18:06
fungi	is this a recurrence of the ghost errors i saw a week or two back, which disappeared after a smart reconfig?	18:08
corvus	that tenant appears to have a check pipeline with the zuul-jobs repo attached to it	18:08
corvus	this could be an issue: http://paste.openstack.org/show/803787/	18:09
fungi	er, yeah	18:10
fungi	i always forget to filter for tracebacks since they lack event association	18:10
fungi	so change was None? that can't be right	18:11
corvus	i don't know if that's related	18:11
fungi	er, yeah that was earlier	18:11
fungi	i rechecked at 19:10:26	18:12
corvus	i think that's probably harmless (but needs to be fixed)	18:12
corvus	i think that happens on replication events	18:12
avass	that has actually been a really big plus for the splunk reporter so far, event_id and build is attached to every log entry (where it's applicable)	18:12
fungi	scheduler restart was around 14:38	18:13
fungi	er, sorry, my recheck comment was at 15:46:19	18:14
fungi	so hours prior to that exception	18:14
corvus	my concern is that if it happens on replication events, it may happen on others too (like tags)	18:14
corvus	nope, this should only be for replication events	18:17
corvus	tristanC: i have good news! there is a longstanding behavior in zuul related to replication events which i think means that no one could be using them in pipeline triggers.	18:18
mordred	corvus: oh?	18:19
fungi	hah, fortuitous bug ;)	18:19
corvus	yeah, i'm double checking that now	18:19
corvus	mordred, fungi, tristanC: the ref-* events from gerrit all hit this line: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L745	18:20
corvus	which means there's nothing for the pipelines to match on	18:21
corvus	so i think we can merge tristanC's change to remove those immediately	18:21
mordred	corvus: well - in that case removing the replication events - yeah - should make logs better	18:22
fungi	i concur	18:22
corvus	and then also, we need to fix the new event handling to handle a null return (which should fix that traceback in the scheduler)	18:22
avass	what does this comment refer to then? https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/gerrit/gerritconnection.py#L719	18:22
corvus	fungi: and i'm pretty sure none of this has anything to do with your job issue, so back to that...	18:22
fungi	corvus: yeah, i'm running down the config errors now	18:22
corvus	avass: sorry i was sloppy -- i meant ref-replicated-* events	18:23
fungi	i strongly suspect this is a recurrence of what we saw briefly last week or the week before	18:23
corvus	avass: ref-updated is legit and correctly handled	18:23
avass	ah	18:23
corvus	avass: er, i mean ref-replication-* :)	18:23
fungi	https://opendev.org/openstack/project-config/src/branch/master/zuul/main.yaml#L1631-L1640	18:23
fungi	we clearly include [] from openstack/project-config in the zuul tenant	18:24
fungi	so it should not be complaining about pipeline definitions from that repo	18:24
corvus	tristanC: if you would care to un-minus-2 your change https://review.opendev.org/780809 i think we can merge it asap. sorry i missed that earlier.	18:25
fungi	is it somehow assembling configuration incorrectly at start, which then gets corrected at the next smart-reconfigure?	18:26
* fungi checks whether there was a scheduler restart fairly close to when i saw this last time		18:27
fungi	2021-03-01 23:07:49 is when i mentioned it in here last time	18:28
fungi	most recent mention of a zuul restart in our status log prior to that was on 2021-02-20 so not a smoking gun	18:29
avass	fungi: maybe it's enough with a full-reconfig?	18:29
fungi	yeah, really anything's possible i guess	18:30
corvus	fungi: if you want to try a smart-reconfig and see if it changes i have no objection	18:30
corvus	fungi: i don't think the lack of jobs is specific to that gentoo change; the nodepool changes i uploaded a while ago are not enqueued either. so we're probably looking at a systemic issue, either related to the config errors you and avass observed, or an as-yet-unidentified bug in the trigger-event-queue-in-zk changes	18:32
corvus	i have to run now; i will be back after lunch	18:33
*** jpenag has joined #zuul		18:39
fungi	i've initiated one now	18:41
fungi	2021-03-22 18:31:49,126 DEBUG zuul.CommandSocket: Received b'smart-reconfigure' from socket	18:41
*** jpena\|off has quit IRC		18:41
fungi	it logged "Reconfiguration complete" at 18:31:49,336	18:41
fungi	oh, that was "Smart reconfiguration of tenants []..." so it didn't reconfigure anything	18:41
fungi	full-reconfigure maybe?	18:41
fungi	https://zuul.opendev.org/t/zuul/config-errors is still clearly reporting errors for things which shouldn't happen	18:41
fungi	corvus: yep, i agree, it seems like it's not evaluating any zuul tenant events because of the configuration errors for the tenant	18:42
fungi	i'm about to start heating up some food myself, but will keep poking at logs to see if i can spot any other hints	18:45
openstackgerrit	Albin Vass proposed zuul/nodepool master: WIP: Digitalocean driver https://review.opendev.org/c/zuul/nodepool/+/759559	18:51
avass	corvus: would you recommend moving the digital ocean to use the state machine driver or is the simple driver fine?	18:54
fungi	the restart begins at 2021-03-22 14:30:47,174 in our debug log. at 14:37:27,654 (during startup config loading) it complained "WARNING zuul.ConfigLoader: Zuul encountered a syntax error while parsing its configuration in the repo openstack/project-config on branch master. The error was: Pipelines may not be defined in untrusted repos..."	18:59
fungi	so it definitely was concerned about this at start	19:00
fungi	stat reporting exception, almost certainly unrelated, but curious if we have a patch up for it yet: http://paste.openstack.org/show/803789/	19:03
fungi	looks like arrived is None	19:04
corvus	avass: state machine; i'd like to move gce to it and drop simple	19:06
corvus	avass: should be easy	19:06
corvus	fungi: i'm still somewhat inclined to think that these may be 2 separate issues; occams razor says if we're not seeing trigger events then the change to move trigger events into zk may be relevant. :)	19:08
fungi	corvus: yep, absolutely	19:08
corvus	(i'm not excluding the idea they may be related; just thinking right now it's like 75% probability they are distinct)	19:08
fungi	i guess it should enqueue for any repository not in a syntax error state?	19:09
fungi	unless one of the error repositories is a required-project?	19:09
corvus	looking at the running config, i don't see any reason why it shouldn't be enqueued right now	19:10
corvus	(the error about pipelines is from a repo where we don't expect pipelines anyway)	19:10
fungi	but it will refuse to add that project, right? so if it's declared as a required-project it can't be used in that state? not saying that's the case for every zuul-jobs job so something still should have been enqueued	19:11
corvus	i don't think a config error on a project should prevent it from being used as a required project; at most, it would cause zero config objects to be loaded from that project.	19:12
fungi	ahh, okay, so even less likely to be related	19:13
corvus	yep, since zero of the projects which are showing config errors in the zuul tenant are used by zuul-jobs	19:13
avass	corvus: ok, I'll take a look at that soon then. I realized that building a game client inside the same k8s as zookeeper and zuul on cheap nodes isn't a good idea :)	19:13
fungi	building a game client is always a good idea	19:14
avass	well you don't actually get to build anything if something always gets OOM killed	19:14
fungi	and yeah, so far the only exceptions i'm seeing logged are the one from replication events, and the stats emitting one i pasted above	19:15
fungi	now wondering if the stats exception is related to enqueuing things with zuul-enqueue rpc	19:17
fungi	seems to consistently follow "Adding change" lines, and looks related to our bulk scripted reenqueue	19:18
*** harrymichal has quit IRC		19:19
*** harrymichal has joined #zuul		19:20
*** harrymichal has joined #zuul		19:20
fungi	interesting, our opendev-prod-hourly pipeline (timer triggered) is raising run handler exceptions with "AttributeError: 'Branch' object has no attribute 'number'"	19:20
*** hamalq has joined #zuul		19:27
avass	okay weird, I seem to only be able to run jobs for a PR once per scheduler restart	19:50
corvus	avass: 4.1.0 or master?	19:51
avass	master	19:51
corvus	avass: can you try 4.1.0?	19:52
avass	sure	19:52
fungi	i see opendev's openstack tenant enqueuing things, but https://zuul.opendev.org/t/zuul/builds suggests nothing has been enqueued for the zuul tenant for 16+ hours	19:52
corvus	avass: i suspect it will still fail and i would start with I55df1cc28279bb6923e51686dde8809421486c6a as a suspected cause	19:53
corvus	fungi: i think i've got it	19:55
fungi	regression in today's restart?	19:55
avass	corvus: 4.1.0 works	19:55
corvus	avass: oh nice -- runs twice after restart on 4.1.0?	19:56
avass	yeah	19:56
corvus	avass: ok, it could be the same thing we're seeing in opendev; let me explain, then let's see if your situation could fit	19:56
fungi	yay for not releasing with whatever brought in that bug ;)	19:56
corvus	fungi, avass: there's an exception in our scheduler logs: AttributeError: 'Branch' object has no attribute 'number'	19:57
corvus	that's being hit during the new event forwarding code (where we have a 2-stage event queue system: a global queue, then we forward events to individual tenant queues)	19:58
fungi	okay, so that's the same one i saw for one of our opendev hourly pipelines	19:59
corvus	if one of the tenants has an item in a pipeline that is not a change (ie, a branch tip or tag or other ref -- probably a post or periodic pipeline), then we hit that exception and abort forwarding the event to any other tenants	19:59
avass	I'm not seeing that error however	19:59
avass	or I don't think I do	20:00
corvus	so basically, the new event forwarding is stopping for us at the opendev tenant. any tenant in our list of tenants after (and somewhat including) opendev is not getting events	20:00
* avass does another restart		20:00
corvus	avass: hrm. any other exceptions in your scheduler log? maybe a slightly different error is being hit and causing it.	20:00
avass	corvus: no, no error at all	20:03
corvus	avass: what's the sequence? are you hitting the recheck button in github for the already-run check?	20:06
avass	recheck -> job runs and gets NODE_FAILURE -> recheck -> nothing and logs doesn't really say anything either	20:07
avass	working on paste but I guess you don't want 800 lines of raw scheduler logs :)	20:07
corvus	the more the better as far as i'm concerned, but opendev's pastebin does have a size limit; if you hit it, you can split it in 2 :)	20:08
fungi	yeah, i've used the split -l trick to get a log into multiple pastes plenty of times	20:13
fungi	i think the limit in our paste server is actually the max field size in the mysql db backing it	20:14
avass	sec, manjaro seems to want to upgrade all system packages.	20:15
avass	also had to fill out an impossible captcha because my paste contains spam :)	20:16
avass	http://paste.openstack.org/show/803796/	20:16
avass	I suppose that silently truncated my logs	20:18
avass	2/4: http://paste.openstack.org/show/803799/	20:19
avass	3/4: http://paste.openstack.org/show/803800/	20:20
avass	4/4: http://paste.openstack.org/show/803801/	20:20
corvus	avass: and when was the second recheck evnt?	20:20
avass	corvus: last paste. somewhere around line 10	20:21
corvus	avass: are you sure it's not somewhere between paste 3 and 4?	20:22
avass	corvus: probably line 18: [e: 98be8bb0-8b49-11eb-935a-fabeee8c4b40] Github Webhook Received	20:22
avass	corvus: that's the first event probably	20:23
corvus	avass: okay, so you're leaving a "recheck" comment?	20:23
avass	yeah	20:23
avass	first one works, second one doesn't	20:23
fungi	and second one was after the first reported?	20:23
avass	I believe so yes	20:23
avass	fungi: just left another comment and that produces the same result	20:24
fungi	got it	20:25
corvus	avass: i see a "Run handler awake" log in #3 and nothing after that; suspect your scheduler may be stuck in result event processing	20:26
corvus	avass: the node request always fails?	20:27
avass	corvus: trying to figure out why digital ocean doesn like the image it's configured to use so yep	20:28
corvus	okay, so the reproducer for this might be as simple as "enqueue a change with a node request that fails"; that does also seem to point to the bug being related to event-queues in zk.	20:29
corvus	i'll work on fixing both of these	20:31
corvus	fungi: for opendev, i think we need to restart on 4.1.0	20:32
corvus	there's no good way to get an image with fixes without doing that	20:33
corvus	fungi: do you have time to do that, or should i?	20:33
fungi	corvus: sorry, had to step away for a moment, but catching back up now	20:48
corvus	fungi: it's a little more complex so -> #opendev	20:49
fungi	yup	20:49
*** ajitha has quit IRC		21:15
*** jangutter has joined #zuul		21:22
*** jangutter_ has quit IRC		21:25
corvus	avass: you weren't running any unmerged patches on that instance, right? the bug you saw was with plain old upstream master?	22:42
corvus	avass: (in particular, no async reporting patch?)	22:42
corvus	avass: also, it really looks like there are some lines missing between pastes #3 and #4 -- can you check on that?	22:45
*** harrymichal_ has joined #zuul		22:51
*** harrymichal has quit IRC		22:52
*** harrymichal_ is now known as harrymichal		22:52
openstackgerrit	James E. Blair proposed zuul/zuul master: Fix trigger event forwarding bug https://review.opendev.org/c/zuul/zuul/+/782335	22:59
openstackgerrit	James E. Blair proposed zuul/zuul master: WIP: Try to repro recheck failure https://review.opendev.org/c/zuul/zuul/+/782336	22:59
corvus	fungi, avass, tobiash, swest: 782335 is a successful fix of the issues we saw in production today; i think with that merged, we can restart opendev's zuul on master again. (you can read the commit msg for the tl;dr on issues)	23:00
*** nils has quit IRC		23:00
corvus	fungi, avass, tobiash, swest: 782336 is an unsuccessful attempt to repro the issue avass described; i think we need more data there.	23:01
*** holser has quit IRC		23:38
*** harrymichal has quit IRC		23:39
*** rlandy has quit IRC		23:43

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!