Saturday, 2024-12-07

fungi	internet finally came back! taking a look now	03:07
opendevreview	Merged openstack/project-config master: Update jeepyb triggered Gerrit builds to Gerrit 3.10 https://review.opendev.org/c/openstack/project-config/+/937268	03:17
opendevreview	Merged opendev/system-config master: Update gerrit image build job dependencies https://review.opendev.org/c/opendev/system-config/+/937288	03:23
kevko	Hi, it looks like something changed ..as we don't have ansible_host available in our kolla pipelines	09:57
kevko	some known issue ?	09:57
frickler	infra-root: ^^ zuul was restarted 3h ago, I assume that that has updated to the latest version, but I cannot see a change in the last week that looks to me like it could cause this? I also don't think it could be related to the gerrit update	10:30
frickler	also seems other builds than kolla are not affected, at least from what has been running so far. two tempest jobs on reqs passed	10:31
kevko	hmm	10:41
kevko	currently analyzing	10:41
kevko	frickler: it's weird because ansible_host is there .... my debug with hostvars printed to file	10:49
kevko	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3cd/937302/3/check/kolla-ansible-ubuntu-lets-encrypt/3cde069/primary/logs/kolla_configs/inventory-kevko-debug	10:50
frickler	so I think that this is due a new ansible-core version getting used in zuul, containing the fix for https://access.redhat.com/security/cve/cve-2024-11079 https://github.com/ansible/ansible/commit/70e83e72b43e05e57eb42a6d52d01a4d9768f510	13:41
fungi	this was the fifth day since we turned on mailman's verp probe feature and reenabled all subscriptions for openstack-discuss which had been disabled by bounces. 227 subscriptions have been re-disabled today and spot checks of the dsn indicates that they were bounces of the separate verp probes (not mailing list posts) and that those addresses probably are really and truly dead	15:10
fungi	so now i watch to see if more subscriptions get disabled, and possibly also manually disable some that are generating "uncaught" bounces because mailman can't parse the dsn from their corresponding mailservers	15:12
kevko	fungi: frickler: fix https://review.opendev.org/c/openstack/kolla-ansible/+/937302	16:41
kevko	aaa ..you were fast :D ..i was debugging on mobile :D	16:44
fungi	kevko: yeah, the zuul upstream maintainers already got contacted earlier in the week by a user trying to track down a weird behavior in tasks iterating over the hostvars array, which turned out to be that ansible behavior change: https://review.opendev.org/c/zuul/zuul-jobs/+/937071	16:48
fungi	rather odd to see that behavior change backported in a stable point release with no explanation and seemingly no connection to the security fix which dragged it in. my guess is a dirty backport	16:49
kevko	Thank you 🙏	17:00
frickler	seems the patch was only applied to stable-2.18-16, not in devel	17:00
frickler	and this does indeed look like a regression to me, too	17:01
clarkb	that change for kolla ansible runs more than 60 jobs. I wonder if that is a record	17:20
clarkb	and many of the jobs are multinode. It wouldn't surprise me if it is over 100 nodes per check queue run	17:21
clarkb	frickler: is the "can't find LABEL=kolla mount error a known issue? semes like a fair bit of job retries are due to that	17:24
frickler	clarkb: yes, it seems another "raxflex is different" issue, mnasiadka was going to send a fix for that	17:25
frickler	and the number of jobs is because the inventory preparation is central, so it triggers really the full set, which we don't often do otherwise	17:26
fungi	i've so far manually disabled 12 subscriptions for openstack-discuss which have been omitting "uncaught" bounces indicating their addresses are inactive/deleted	17:27
fungi	(one at ibm, 11 at red hat)	17:27
fungi	s/omitting/emitting/	17:30
frickler	fungi: we've seen before that rh does seem to have some "special" kind of mail setup, so out of curiosity: were there other accounts that got disabled through automation for that site? or only these manual ones?	17:30
frickler	(I can check myself if you don't know offhand)	17:31
Clark[m]	frickler: it's weird that it would be a cloud issue since the label is set by Kolla jobs repartitioning and formatting the device	17:35
fungi	frickler: no, a lot of old red hat subscriptions were getting "Recipient address rejected: User unknown in local recipient table" from a mail relay somewhere in aws	17:36
Clark[m]	But also I would caution against blaming the cloud region for bugs like "we tried to use ipv6 when we didn't have ipv6 and it broke"	17:36
Clark[m]	Sure we didn't have ipv6 because it ran in a specific location, but that's not a bug caused by the cloud region. The code was broken	17:36
fungi	the "uncaught" bounces are coming from mimecast servers	17:36
fungi	oh. some of the successfully caught bounces for rh addresses instead had "Diagnostic-Code: X-Postfix; unknown user"	17:38
fungi	anyway, the variations probably have to do with how forwards and aliases were set up for different users, perhaps specific to their vintage or department/division	17:40
fungi	seems like the addresses mimecast is configured to route to gmail inboxes are the ones mailman isn't successfully parsing bounce messages from	17:42
Clark[m]	You'd expect Gmail bounces to be well understood at this point	17:43
fungi	those include details from gmail's servers saying things like "The email account that you tried to reach is inactive." (disabled account) or "The email account that you tried to reach does not exist." (no such account)	17:45
fungi	i expect the issue is how mimecast is formatting the dsn, since it's mimecast generating the bounce when trying to hand off the message to gmail	17:48
fungi	gmail's servers are rejecting the message at rcpt, causing mimecast to bounce the message back to mailman	17:51
fungi	so it's not really a gmail bounce	17:51
mnasiadka	clarkb: it’s weird but it only happens on raxflex (the label thing) - if it would happen always because of some kernel bug - we would not point at raxflex	19:02
Clark[m]	mnasiadka is it a syncing problem? I would expect partitioning software to sync pretty aggressively though	19:07
Clark[m]	I don't think we have seen similar issues with devstack jobs which also reformat though so maybe it is related to the software used to write the table?	19:08
frickler	I didn't want to blame raxflex, I was just trying to convey that the subtle differences we experience there are good in finding issues in our jobs caused by invalid assumptions or possibly weird bugs (I didn't look at this specific issue in detail yet)	19:59
fungi	my first guess would be baked-in assumptions about device names	20:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!