Thursday, 2021-02-18

corvuszuul likes it00:00
corvusah, i'm guessing it's not quite tuned for ubuntu -- from the nodepool-functional-k8s job: ubuntu-bionic |   "msg": "No package matching 'java-latest-openjdk' is available"00:01
corvus'default-jdk-headless' maybe?00:02
clarkbya or default-jdk00:02
clarkbnot sure what the difference is but headless seems appropriat ehere00:02
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: ensure-zookeeper: add use_tls role var  https://review.opendev.org/c/zuul/zuul-jobs/+/77629000:06
openstackgerritJames E. Blair proposed zuul/nodepool master: WIP: Require TLS.  https://review.opendev.org/c/zuul/nodepool/+/77628600:07
clarkbcorvus: looking at the WIP change it seems to have a lot of th same stuff as the zj role. Is the intent that nodepool would bootstrap itself?00:09
corvusclarkb: i don't want to require 'ensure-zookeeper' for the tox jobs00:10
corvusi like the idea of test-setup.sh doing that in a way that works for the zuul tox jobs and devs00:11
clarkbok just making sure I understand the various bits there00:11
corvuswe could switch the functional jobs to use test-setup, but those are running k8s and openshift, so there's a pretty good argument for not running something in docker in those jobs :)00:11
corvusand they are not especially standard; they already have their own playbooks00:12
clarkbcorvus: in test-setup.sh why use docker-compose rm -sf over docker-compose down? (If you use down you can drop the condition and I think down does roughly the same thing?)00:13
corvusclarkb: no idea, copied that from zuul's test-setup-docker.sh00:13
corvusi'll incorporate that in the next rev00:13
clarkbcorvus: found a minor thing on the WIP change (left it as a comment)00:15
corvusah thx00:16
corvusthat will probably fail some unit tests but not all of them; i'm going to let it run a bit more before i push up a fix00:17
*** tosky has quit IRC00:30
corvusi'm confused about why the py3x tests timed out; they seem to be mostly running; i'm not sure what they were doing when time was up01:02
clarkbcorvus: https://9e7a0909972c63991fa4-da3822d63841e990242061d65cb4e6c4.ssl.cf5.rackcdn.com/776286/8/check/tox-py36/be3b5ed/tmpqv4jwhl2 is an attempt at deciphering that01:03
clarkbthat is the partial subunit stream01:03
* clarkb pulls it up01:04
corvusyeah, it ends with a successful test01:04
corvusi think i have it running the bad test locally, i just still don't know which one it is01:06
corvusi don't see a test name in my stestr output01:06
clarkbcorvus: grep ^test: $thatsubunitfile | wc -l is 9101:06
corvusclarkb: i don't know how to use that info01:09
clarkbmostly just pointing out it only ran 91 tests before it timed out (I think nodepool has several hundred tests overall)01:10
corvus27701:11
corvusthe 3 tests around line 91 of  'python -m testtools.run discover --list' are file01:11
corvusfine01:11
corvusi'm really confused because we have a TESTING.rst file that says if i run "stestr run" it will print out the name of each test as it runs01:12
corvusbut all i see is log output01:12
clarkbcorvus: the invocation in the job log uses --no-subunit-trace which suppressed the behvaior you want I think01:13
corvusok.  well i'm running plain "stestr run" now, and i do see some job names among the logs01:14
corvusso maybe if it hangs i'll see it this time01:14
clarkbhttps://opendev.org/zuul/nodepool/src/branch/master/tox.ini#L2101:14
corvusokay, running that way and then hitting ctrl-c when it hung has given me the names of the 7 running tests that were hanging01:16
corvusso i can iterate now01:16
clarkband ya it seems like the file only writes complete subunit results (to avoid interleaving?) but the proper tracing should print them as they go01:16
corvusokay, i think i have fixes for the latest errors in the k8s/openshift jobs and these tests; i'm just running locally again to find any more01:24
openstackgerritJames E. Blair proposed zuul/nodepool master: WIP: Require TLS  https://review.opendev.org/c/zuul/nodepool/+/77628601:32
corvusat this point, i expect some of these jobs to start passing01:32
*** harrymichal has quit IRC01:49
corvusclarkb, tobiash: the disadvantage of using docker for the unit tests is: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit01:52
corvusmaybe we should leave the docker setup for devs only and switch the tox jobs to ensure-zookeeper01:53
corvustobiash, felixedel, swest: if you start to look at moving zk connection inside zuul services, note the work going on in nodepool; see the lines above ^  and also https://review.opendev.org/776290 and https://review.opendev.org/77628602:02
*** dmsimard8 has joined #zuul02:56
*** dmsimard has quit IRC02:59
*** aluria has quit IRC02:59
*** SpamapS has quit IRC02:59
*** gouthamr has quit IRC02:59
*** tflink has quit IRC02:59
*** dmsimard8 is now known as dmsimard02:59
*** ianw has quit IRC02:59
*** tflink has joined #zuul03:00
*** SpamapS has joined #zuul03:00
*** ianw has joined #zuul03:00
*** gouthamr has joined #zuul03:03
*** aluria has joined #zuul03:04
*** ykarel has joined #zuul05:00
*** evrardjp has quit IRC05:33
*** evrardjp has joined #zuul05:33
*** jfoufas1 has joined #zuul06:00
*** gouthamr has joined #zuul06:01
*** rlandy|bbl has quit IRC06:17
*** vishalmanchanda has joined #zuul06:34
*** reiterative has quit IRC06:48
*** reiterative has joined #zuul06:48
openstackgerritDinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735407:01
openstackgerritDinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735407:03
openstackgerritDinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735407:04
*** jpena|off is now known as jpena07:36
*** harrymichal has joined #zuul07:58
*** jcapitao has joined #zuul08:13
*** rpittau|afk is now known as rpittau08:14
*** hashar has joined #zuul08:21
*** tosky has joined #zuul08:45
openstackgerritDinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735408:51
*** jfoufas1 has quit IRC08:56
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735409:05
*** ykarel_ has joined #zuul09:05
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735409:07
*** ykarel has quit IRC09:08
*** nils has joined #zuul09:22
*** ykarel_ is now known as ykarel09:49
*** wuchunyang has joined #zuul10:39
*** wuchunyang has quit IRC10:44
openstackgerritMerged zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735411:31
*** harrymichal has quit IRC11:38
*** harrymichal has joined #zuul11:39
*** jcapitao is now known as jcapitao_lunch12:08
*** rlandy has joined #zuul12:28
*** jpena is now known as jpena|lunch12:30
*** ikhan has joined #zuul12:50
*** jcapitao_lunch is now known as jcapitao13:19
*** jpena|lunch is now known as jpena13:31
*** iurygregory has quit IRC13:47
*** iurygregory has joined #zuul13:50
*** ykarel_ has joined #zuul14:14
*** ykarel has quit IRC14:17
pabelangerI know some folks are using prometheus with zuul metric, is that right? if so, do people have it documented any place14:27
*** ykarel_ has quit IRC14:27
pabelangerwe'll likely have to use that given no one offer graphite much any more14:27
corvuspabelanger: you run your own cotrol plane right?  or are you moving to a hosted control plane?14:28
corvus(not sure who needs to "offer" graphite in this scenario)14:29
pabelangeryah, we still run our own. But we don't want to manage more services, so looking to other teams to provide something14:30
pabelangerand prometheus is the thing, not graphite14:30
pabelangerand basically don't have time / effort to stand up own now14:30
corvuspabelanger: well, fwiw, graphite isn't required... you can use grafana with influxdb as your statsd receiver as well.14:31
pabelangerwe're kinda of at a cross roads, of looking for managed CI again. Which may or maynot be zuul14:31
pabelanger:(14:31
corvuspabelanger: if you do need prometheus, i think tobiash has a patch in review under zuul/zuul for a prometheus statsd exporter14:33
avasswhere is logging configured in zuul and what decides what variables are passed when zuul (python?) logs something?14:47
avassnot that well versed in how logging in python actually works14:47
corvusavass: uses standard python logging library: https://docs.python.org/3/library/logging.html14:47
avassanyhow, I don't think buildset gets logged nicely and if it is I can't find it except for in the actual message14:47
corvusavass: that configures the formatting, etc.  but if you want extra messages or info/metadata along with the messages, those are source code changes14:48
avasscorvus: oh nvm, think I found it. the get_annotated_logger function is probably what I'm looking for14:49
avassit would be nice if buildset was part of the logs somehow as well14:50
tristanCcorvus: note that influxdb statsd support is a bit tricky, we couldn't use the graphite query in grafana and we add to configure continuous queries for nodepool14:50
*** tosky has quit IRC14:51
*** tosky_ has joined #zuul14:51
corvusavass: ack... i wonder if it would be better to have it in every line related to a build, or just have a clear "starting build X for buildset Y" for cross-referencing14:51
*** tosky_ is now known as tosky14:51
avasscorvus: for console logs that should be enough but for our splunk logs it would be nice to have that as part of the data for each log entry14:52
avassbecause doing buildset=<hash> levelname=ERROR could be convenient :)14:53
avassbut event_id should do pretty much the same thing14:55
tristanCpabelanger: we have been using the node_exporter text file feature to get zuul metrics in prometheus15:01
pabelangersorry, I got pulled into a meeting. catching up15:02
tristanCpabelanger: here are the rules https://softwarefactory-project.io/cgit/software-factory/sf-infra/tree/monitoring/rules-zuul.yaml  and the script to generate the metrics is: https://softwarefactory-project.io/cgit/software-factory/sf-infra/tree/roles/custom-exporter/files/exporter.py15:02
pabelangerso, the feedback I am getting is 'we need more zuul metric', which is solved already. But I have a very limited amount of time or ability to launch new services to support it.  Given teams already have prometheus (I am unsure about influxdb), the plan would be to use those services, over asking them to run graphite.15:05
pabelangerstatsd is already in place, so we don't have to provision that (aside from exporters)15:05
pabelangertristanC: thanks, I'll look15:06
tristanCpabelanger: the links i gave you is more about alerting when something break, you'll need to look at the tobiash's statsd exporter configuration for metrics15:13
pabelangerhttps://review.opendev.org/c/zuul/zuul/+/66047215:18
pabelangerthanks15:18
*** hashar has quit IRC15:18
pabelangerthat looks like it might be easy to setup and test15:18
*** rlandy is now known as rlandy|training15:39
corvuspabelanger: yep that's the one15:43
*** ykarel_ has joined #zuul15:48
tristanCcorvus: about ensure-zookeeper tls support, if i understand correctly it fixed the k8s/openshift integration jobs, but we now need to use it for the regular tox job?15:48
tristanCthank you for taking over the change by the way15:49
corvustristanC: yes.  i wanted to use docker in test-setup.sh so that it was the same for devs and CI, and we wouldn't have to add a playbook to the tox jobs.  however, with the dockerhub rate limits, i think your idea of using it for the tox jobs is better.15:50
corvustristanC: so i think we just need to update 776286 to run ensure-zookeeper and then set env variables for the unit test tox runs15:50
corvusi can do that in a little bit15:50
corvustristanC: and thank you for the ensure-zk change -- i love timezone teamwork :)15:51
clarkbcorvus: no objection from me using ensure-zk for unittests too16:02
*** vishalmanchanda has quit IRC16:02
*** rlandy|training is now known as rlandy16:16
*** ykarel_ is now known as ykarel16:21
*** mvadkert has joined #zuul16:22
mvadkerttristanC: hi, I looked at your Fedora lightning talk, do you have examples where you are now using Dhall in Zuul and for what?16:23
mvadkerttristanC: we are looking to use cuelang.org for some of our services, but we want to look at Dhall also ...16:24
avassmvadkert: it's widely used for the zuul-operator: https://opendev.org/zuul/zuul-operator16:24
mvadkertavass: thanks, will check that out16:26
*** ykarel has quit IRC16:29
clarkbcorvus: looks like unit tests passed in your latest recheck (but not some of the functional jobs), though it has not finished and reported yet16:36
tristanCmvadkert: fbo wrote a blog post about it here: https://www.softwarefactory-project.io/using-dhall-to-generate-fedora-ci-zuul-config.html16:44
tristanCmvadkert: you can also find several examples in the binding documentation here: https://github.com/softwarefactory-project/dhall-zuul16:45
tristanCmvadkert: and finally, you can also find an ambitious work in progress expression to manage a full tenant configuration there: https://github.com/softwarefactory-project/bootstrap-your-zuul16:47
*** jpena is now known as jpena|off16:51
*** ikhan has quit IRC16:52
*** hashar has joined #zuul17:16
mvadkerttristanC: cool ty!17:20
*** jcapitao has quit IRC17:24
tristanCmvadkert: you're welcome, i'd be interested to see a zuul configuration in cue if you already made one17:36
mvadkerttristanC: not for zuul, we plan to use it for other services, but the work has just started17:38
mvadkerttristanC: so we are mostly looking around for similar use cases17:38
*** rpittau is now known as rpittau|afk17:47
*** rlandy is now known as rlandy|training18:01
openstackgerritJames E. Blair proposed zuul/nodepool master: WIP: Require TLS  https://review.opendev.org/c/zuul/nodepool/+/77628618:06
*** akrpan-pure has joined #zuul18:13
akrpan-pureHello! I have a zuul run working fine on opendev/ci-sandbox, but I just switched it over to also run on openstack/cinder and the runs are working fine, but uploading the result mysteriously fails with a "Gerrit error executing gerrit review <rest of command>" and no comment on the change request18:14
akrpan-pureAnyone either seen this before, or know how I could turn on more debugging info for response reporting through gerrit?18:14
corvusakrpan-pure: you may be able to run zuul with the "-d" argument to get more debugging info18:15
clarkbakrpan-pure: are your jobs trying to vote +/-1 verified?18:15
*** wuchunyang has joined #zuul18:15
clarkbI think that most openstack projects don't allow third party CI systems to vote and the reviews can fail if you try to vote18:16
akrpan-pure.... oh I bet I know what it is, I made a new account for our third party CI, I bet it hasn't been approved18:17
akrpan-pure*as a CI account yet, because our old one could just fine18:17
clarkbakrpan-pure: cinder-ci has no members, they don't want anyone voting18:17
akrpan-pureOh indeed, I'm not sure what I was remembering but I could've sworn I saw votes from other CI systems including ours in the past18:19
akrpan-pureLooking back now I see none18:19
akrpan-pureThanks for pointing that out! Easy enough to fix18:20
*** wuchunyang has quit IRC18:20
*** hashar has quit IRC18:32
*** akrpan-pure has quit IRC19:00
*** mgoddard has quit IRC19:17
*** mgoddard has joined #zuul19:18
avasstristanC: oh yep overrides are starting to click a bit now20:01
tristanCi've published an haskell library to interact with zuul, it comes with a cli to list the nodepool labels used in projects pipeline configuration: https://hackage.haskell.org/package/zuul20:01
tristanCavass: good, that seems the best solution to modify a set, if you look in zuul-nix, override could be used to tweak the existing requirements instead of creating new ones when there is a mismatch20:04
*** ikhan has joined #zuul20:05
*** nils has quit IRC20:14
*** jamesmcarthur has joined #zuul20:26
*** rlandy|training is now known as rlandy20:33
pabelangeranybody seen the following decode issue before?20:58
pabelangerhttp://paste.openstack.org/show/802799/20:58
pabelangertrying to stand up new executor and hitting it20:58
tristanCpabelanger: perhaps a jwt library update?20:59
tristanCpabelanger: without https://review.opendev.org/c/zuul/zuul/+/768312  , you need to pin it to <221:00
pabelangertristanC: thanks! that look to be it21:01
pabelangermanually downgrading to 1.7.1 fixed it21:01
tristanCi guess pip installing zuul 3.19.1 is not working since december21:02
pabelangeryah, possible21:02
pabelangerat least for github21:02
clarkbyou can always use a contraints file if necessary21:04
pabelangertrue21:08
*** mgagne has quit IRC21:08
*** jamesmcarthur has quit IRC21:12
*** jamesmcarthur has joined #zuul21:14
*** jamesmcarthur has quit IRC21:26
*** jamesmcarthur has joined #zuul21:37
clarkbcorvus: left a couple of notes on the WIP nodepool change.21:39
*** jamesmcarthur has quit IRC21:39
*** jamesmcarthur has joined #zuul21:40
clarkbcorvus: and for the functional jobs I see where ensure-zookeeper started zookeeper but then later logs complain about being unable to connect. Not quite sure what is going on there21:42
clarkblooks like it is connecting to port 218121:44
clarkbok left another comment to address ^ I think.21:45
avasstristanC: I think overrides might not make much sense after all...21:58
avasswell, at least I tried to build a newer rustc version so I tried to override pkgs.rustc.version, for some reason that produces logs as if it downloaded a 1.50.0 tarball but it turns out it built 1.41.0 anyway.. :(22:00
openstackgerritJames E. Blair proposed zuul/nodepool master: WIP: Require TLS  https://review.opendev.org/c/zuul/nodepool/+/77628622:01
corvusclarkb: thx updated22:01
tristanCavass: it is confusing indeed :-) I think it's necessary when a function expect a package set, for example if you want to rebuild zuul-nix with a modified set, then you need to create it with an override22:01
avasstristanC: yeah I read through the pills explaining overrides and config.nix and followed along with the examples and that made sense. but I feel like I have no idea why that didn't work for a real package22:02
tristanCavass: the override implementation may varies between packageset, it's just a nix function afterall22:02
avassit would help if the packages were better documented with what can be overriden22:04
tristanCavass: for sure, i would recommend to check the source directly :) for rust, have you tried https://github.com/mozilla/nixpkgs-mozilla#using-in-nix-expressions ?22:06
avasstristanC: that's what I'm doing currently. but having to read through the source to use a package manager will be a bit hard to motivate when people are used to the simplicity of containers22:07
avasswill take a look at that22:07
tristanCavass: that's true, though once the expression is working, it's quite easy to use. And unless you are doing complicated things, you shouldn't have to use an override just to use a different toolchain, for rust mozilla provides every set already.22:15
openstackgerritTristan Cacqueray proposed zuul/zuul master: scheduler: add statsd metric for enqueue time  https://review.opendev.org/c/zuul/zuul/+/77628722:21
clarkbcorvus: just noticed another thing that is important but maybe do that update after we see how teh functional jobs do?22:22
clarkbthat actually had me really confused for a minute too. Was wondering why the new pre run wasn't used22:22
corvusclarkb: derp yes thx.  updated locally, will wait for results to push.22:23
clarkbnodepool is just starting now in the functional openstack job22:31
clarkbit isn't spamming about connection problems so I think it is looking good.22:32
clarkbwill have to see if image builds and launches all agree22:32
avasstristanC: thanks that worked a lot better :)22:36
*** jamesmcarthur has quit IRC22:38
tristanCavass: that's good to know! note that you can find similar package set for other toolchain outside of nixpkgs, for example you would pull zuul-nix using an equivalent import expression22:45
clarkbcorvus: the functional openstack jobs succeeded22:48
clarkbI think this is very close once you swap out the unittest jobs22:48
openstackgerritJames E. Blair proposed zuul/zuul master: Add python-logstash-async to container images  https://review.opendev.org/c/zuul/zuul/+/77655122:49
tristanCcorvus: thank you for the suggestion, i've added a statsd service to my benchmark and the metrics are more finegrained, for enqueue time from 776287 i get: mean 0.00139 std dev 0.00018  (compared to the mqtt measure: mean 0.01406 std dev 0.00120)22:50
tristanCfor reference, here is the updated script: https://github.com/TristanCacqueray/zuul-nix/commit/2083b33005e568115a5e71b2847a953b1dd5d62c22:51
corvusoh interesting, i wasn't expecting them to necessarily be different as much as being something that could be compared with prod.  that's good to know and i'm glad that worked out22:52
corvusclarkb: looks like maybe it's far enough along i should just push up the fix and see if it passes for real?22:55
tristanCwell the timestamp are probably not collected at the same place, the statsd one measure from sched.addEvent to the end of manager.addChange22:55
clarkbcorvus: ya I think so22:55
openstackgerritJames E. Blair proposed zuul/nodepool master: Require TLS  https://review.opendev.org/c/zuul/nodepool/+/77628622:55
*** harrymichal has quit IRC22:57
*** harrymichal has joined #zuul22:57
pabelangerI maybe have raised this before, but cannot remember. We have a lot of multi-node jobs, and sometime when we are close to quota capacity we need a lot of unlocked ready nodes.  Here is an example: http://paste.openstack.org/show/802801/23:05
pabelangernodes 0001147143 and 0001147144 are ready, but because node 0001147128 failed, they seem to be idle now (using quota).23:07
pabelangerand given we've hit quota, they sometime sit idle for upwards of 8 hours23:07
*** jamesmcarthur has joined #zuul23:07
pabelangerI am curious, if we could add something that is the whole nodeset isn't allocated, we also delete the unlocked ready nodes23:08
pabelangerand give them back to the pool23:08
clarkbpabelanger: from memory I believe they are technically part of the pool already. If another job came by looking for those labels they could be used for them instead23:10
clarkbbut they aren't gone so do consume those resources until they timeout or are used23:10
clarkbyou can reduce the timeout for cleaning ready nodes iirc23:11
pabelangerHmm, so I need to confirm but we are not seeing them get used by other jobs. That would actually be good if they did23:12
pabelangerwe do set a low 'max-ready-age' but that doesn't seem to work in this case23:12
pabelangereg: max-ready-age: 60023:13
pabelangerbut, they will say online well over 8 hours23:13
pabelangerhttps://dashboard.zuul.ansible.com/t/ansible/nodes23:13
pabelangeris an example of the amount right now23:14
*** harrymichal has quit IRC23:15
*** jamesmcarthur has quit IRC23:15
*** jamesmcarthur_ has joined #zuul23:15
clarkbthat doesn't show locked state though23:16
clarkbis it possible the ready nodes are locked?23:16
pabelanger| 0001147027 | limestone-us-dfw-1 | centos-8-1vcpu            | ad65b3b4-aa9a-4778-9d12-14f58577a11b | 162.253.43.28  | 2607:ff68:100:a::13e              | ready    | 00:01:05:32 | unlocked |23:16
pabelangerno23:16
pabelangerI _think_ they are still part of original node request23:17
pabelangerand not free, until the missing node comes online23:17
pabelangerwhich is a chicken and egg issue, as the ready unlocked nodes, is taking up quota23:17
clarkboh the original node request is still alive in that provider?23:18
clarkbthen ya I think that would hold those resources for that request (though they should stay locked until then?)23:18
pabelangeryah, it is23:18
pabelangerjust confirmed23:18
pabelanger| 200-0000346078 | 2        | requested | zs01.sjc1.vexxhost.zuul.ansible.com | centos-8-1vcpu,vmware-vcsa-7.0.0,esxi-6.7.0-without-nested,esxi-6.7.0-without-nested |       | nl01-26418-PoolWorker.ec2-us-east-2-main,nl01-26418-PoolWorker.limestone-us-dfw-1-s1.small,nl01-26418-PoolWorker.limestone-us-slc-s1.small,nl02-10946-PoolWorker.vexxhost-sjc1-v2-highcpu-1 | 224c7760-7232-11eb-86ac-5a443aaf3adf |23:18
pabelangerI need to confirm what the Priority value (2) means23:19
pabelangerif higher or lower means, it gets requests sooner23:19
clarkbpabelanger: I think it would be helpful to specifically figureout what state the node request is in while those nodes sit ready23:23
clarkbreading the max-ready-age cleanup code it seems the onyl checks are is this ready and is the age > than the max age if so lock it then delete it23:23
clarkb(that is why I asked about being locked beacuse an unlocked node for a label with a ready age that has passed should be cleanable)23:24
clarkbI wonder if the cleanup doesn't run often enough23:24
pabelangersure, let me see if I can figure it out23:24
pabelangerI believe I can see the cleanup handler running23:25
clarkbif you grep 'exceeds max ready age' in debug logs you should see if it is ever trying to delete a node23:26
clarkblooks like those cleanups just run in a loop one after another so it should be serviced fairly frequently23:27
pabelangeryah, I can see it for other nodes23:27
pabelanger2021-02-18 23:13:50,840 DEBUG nodepool.CleanupWorker: Node 0001147424 exceeds max ready age: 1300.8019075393677 >= 60023:27
pabelangerso that is odd23:28
pabelangerthat node was in another node-request, which didn't fully come online: 200-000034616323:29
clarkbpabelanger: looking at your original paste that request was released and I would expect 0001147127 0001147128 0001147143 and 0001147144 to be cleaned up by the max-ready-age cleanup23:30
clarkb2021-02-18 22:24:02,138 DEBUG nodepool.PoolWorker.limestone-us-slc-s1.small: [e: 224c7760-7232-11eb-86ac-5a443aaf3adf] [node_request: 200-0000346078] Removing request handler <- is the line that I think says I'm giving up23:31
clarkband so those 4 nodes should either be used by subsequent requests or get cleaned up23:31
pabelangerokay, I am not going to touch anything and will grep for them in a bit23:31
clarkbmight be good to double check those 4 nodes and see what happened with them23:31
clarkband work from there23:31
*** tosky has quit IRC23:31
pabelangerI see a lot of exceeds max ready age well above 60023:32
pabelanger2021-02-18 20:17:19,868 DEBUG nodepool.CleanupWorker: Node 0001145171 exceeds max ready age: 4105.30643081665 >= 60023:32
clarkbya I suspect that may be the logging of what you are seeing and now need to work backward from there to see why the age is so high before it gets removed23:33
clarkbmaybe it was part of a request that lived for over an hour and it wasn't until the request went away that you get the cleanup23:34
pabelangermaybe23:39
pabelangerI can still see 200-0000346078 in the request-list23:39
pabelangerbut nothing in logs since23:39
pabelanger2021-02-18 22:24:02,970 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0001147128 (state: failed, allocated_to: 200-0000346078)23:39
pabelangerover an hour ago23:40
clarkb200-0000346078 should stay in the request list until all providers fail it or one succeeds23:40
clarkbthat log line may be the breadcrumb you need23:40
clarkbI would expect that to be deletable because that provider isn't handling that request anymore but maybe there is a bug where if the request is still outstanding on any provider it is considered alive23:41
pabelangeryah23:44
pabelangerI can't see anything specific now23:44
pabelangerI think I just have to wait, and see what happens23:44
pabelangerthen try to debug after the fact23:44
pabelangerokay, have to run23:45
pabelangerwill let you know23:45
*** jamesmcarthur_ has quit IRC23:53
*** jamesmcarthur has joined #zuul23:54
*** jamesmcarthur has quit IRC23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!