Monday, 2023-03-06

*** jpena|off is now known as jpena08:06
dpawlikdansmith: after 4days of getting logs from TIMED_OUT, it looks like...10:25
dpawlik1. rax-ORD - 21 times10:25
dpawlik2. rax-IAD - 13 times10:25
dpawlik3. rax-DFW - 12 times10:25
dpawlik4. ovh-GRA1 - 9 times10:25
dpawlik5. ovh-BHS1 - 6 times10:25
fungithat looks like it could approximately correlate to the proportions of quota we have in each region, implying a relatively even distribution and not implicating any particular provider/region12:40
fungiwould need to look at the effective peak utilization on the graphs for each region to be absolutely certain, since the max-servers value set for them in nodepool can be a lie, but regardless those numbers all look to be within an order of magnitude of each other so we're probably nowhere near statistical significance with the number of samples there12:47
opendevreviewElod Illes proposed openstack/openstack-zuul-jobs master: Add stable/2023.1 to periodic-stable templates  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/87656812:53
fungijust going by max-servers though: rax-ord=195, rax-iad=145, rax-dfw=140, ovh-gra1=79, ovh-bhs1=12012:57
fungiso if we ignore the lack of statistical significance, it seems like maybe ovh-bhs1 performs twice as well as all the others12:58
fungihowever, looking at the graphs, somethings wrong in rax-ord and we're not actually booting anywhere near our max-servers value there, seems most of it either sits unused or is taken up by nodes in a "building" state, suggesting maybe we have serious boot issues there. as a result, the used count is relatively far lower than for other regions, making its comparatively high timeout count13:00
fungia bit more significant of an indicator13:01
fungistrangely, we have enough server quota there for 204 instances, enough ram quota for 200 instances, unlimited core quota... though totalInstancesUsed is reporting 10x more than what's currently on nodepool's graph leading me to wonder if there's a ghost fleet hiding there13:12
fungiyeah, there are a ton of nodes there in an active state which nodepool's own count doesn't reflect13:15
fungimoving to #opendev with further investigation of what's going on in that provider13:48
dansmithdpawlik: ack, but also two of those days were mostly weekend14:42
dansmithdpawlik: but yeah, that's the sort of data I want to be able to gather, so thanks very much for getting that fixed!14:42
fungianyway, there do seem to be enough examples outside rax-ord to suggest that the underlying problems aren't exclusive to a specific provider nor all a result of whatever is going on in rax-ord that we need to sort out14:47
fungialso we still have inmotion-iad3 offline until we can track down what's been causing nova to turn off the mirror server in there14:48
fungiso if it was contributing to timeouts previously, we won't have numbers to say until it's returned to the pool14:49
dpawlikdansmith you're welcome15:09
gansohi! I'm not sure this is the place to ask this, but is there anything blocking this patch? https://review.opendev.org/c/openstack/grenade/+/87194616:09
dansmithgmann: kopecmartin I can +W this ^ but since it's been sitting a while I just want to make sure you weren't waiting for the release or something16:12
gansoI've seen this tempest issue blocking the CI on heat stable/yoga branch. Has anyone seen the fix for it that could be backported ? "AttributeError: type object 'Draft4Validator' has no attribute 'FORMAT_CHECKER'"16:15
kopecmartindansmith: feel free to +W, not waiting for anything 16:16
dansmithdone16:16
kopecmartinganso: we have reverted patches which caused that, all should be good now 16:16
kopecmartindansmith: thanks16:16
kopecmartinganso: oh , almost merged the revert :/ 16:17
kopecmartinhttps://review.opendev.org/c/openstack/tempest/+/87621816:17
kopecmartinrechecking 16:17
kopecmartinit's been rechecked, so we're waiting now 16:17
gansokopecmartin: thank you so much! =)16:17
fungiganso: as far as where to ask such questions in the future, grenade is a deliverable of the qa team, so #openstack-qa is your best bet (but still reasonably on topic for here as well)16:41
gansofungi: thanks!16:41
opendevreviewJeremy Stanley proposed openstack/project-config master: Increase boot-timeout for rax-ord  https://review.opendev.org/c/openstack/project-config/+/87659216:58
*** jpena is now known as jpena|off17:13
opendevreviewMerged openstack/project-config master: Increase boot-timeout for rax-ord  https://review.opendev.org/c/openstack/project-config/+/87659217:38
gmannganso: yes, we reverted those changes. let me check if they are merged18:31
gmannganso: this one, it is in gate, once this merge feel free to recheck https://review.opendev.org/c/openstack/tempest/+/87621818:32
gmannjust saw kopecmartin already mentioned these18:33
gmannkopecmartin: ganso sent email on openstack-discuss also. I thought revert is merged18:37

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!