Saturday, 2024-12-07

fungiinternet finally came back! taking a look now03:07
opendevreviewMerged openstack/project-config master: Update jeepyb triggered Gerrit builds to Gerrit 3.10  https://review.opendev.org/c/openstack/project-config/+/93726803:17
opendevreviewMerged opendev/system-config master: Update gerrit image build job dependencies  https://review.opendev.org/c/opendev/system-config/+/93728803:23
kevkoHi, it looks like something changed ..as we don't have ansible_host available in our kolla pipelines 09:57
kevkosome known issue ? 09:57
fricklerinfra-root: ^^ zuul was restarted 3h ago, I assume that that has updated to the latest version, but I cannot see a change in the last week that looks to me like it could cause this? I also don't think it could be related to the gerrit update10:30
frickleralso seems other builds than kolla are not affected, at least from what has been running so far. two tempest jobs on reqs passed10:31
kevkohmm10:41
kevkocurrently analyzing 10:41
kevkofrickler: it's weird because ansible_host is there .... my debug with hostvars printed to file 10:49
kevkohttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3cd/937302/3/check/kolla-ansible-ubuntu-lets-encrypt/3cde069/primary/logs/kolla_configs/inventory-kevko-debug10:50
fricklerso I think that this is due a new ansible-core version getting used in zuul, containing the fix for https://access.redhat.com/security/cve/cve-2024-11079 https://github.com/ansible/ansible/commit/70e83e72b43e05e57eb42a6d52d01a4d9768f51013:41
fungithis was the fifth day since we turned on mailman's verp probe feature and reenabled all subscriptions for openstack-discuss which had been disabled by bounces. 227 subscriptions have been re-disabled today and spot checks of the dsn indicates that they were bounces of the separate verp probes (not mailing list posts) and that those addresses probably are really and truly dead15:10
fungiso now i watch to see if more subscriptions get disabled, and possibly also manually disable some that are generating "uncaught" bounces because mailman can't parse the dsn from their corresponding mailservers15:12
kevkofungi: frickler: fix https://review.opendev.org/c/openstack/kolla-ansible/+/93730216:41
kevkoaaa ..you were fast :D ..i was debugging on mobile :D 16:44
fungikevko: yeah, the zuul upstream maintainers already got contacted earlier in the week by a user trying to track down a weird behavior in tasks iterating over the hostvars array, which turned out to be that ansible behavior change: https://review.opendev.org/c/zuul/zuul-jobs/+/93707116:48
fungirather odd to see that behavior change backported in a stable point release with no explanation and seemingly no connection to the security fix which dragged it in. my guess is a dirty backport16:49
kevkoThank you ๐Ÿ™17:00
fricklerseems the patch was only applied to stable-2.18-16, not in devel17:00
fricklerand this does indeed look like a regression to me, too17:01
clarkbthat change for kolla ansible runs more than 60 jobs. I wonder if that is a record17:20
clarkband many of the jobs are multinode. It wouldn't surprise me if it is over 100 nodes per check queue run17:21
clarkbfrickler: is the "can't find LABEL=kolla mount error a known issue? semes like a fair bit of job retries are due to that17:24
fricklerclarkb: yes, it seems another "raxflex is different" issue, mnasiadka was going to send a fix for that17:25
fricklerand the number of jobs is because the inventory preparation is central, so it triggers really the full set, which we don't often do otherwise17:26
fungii've so far manually disabled 12 subscriptions for openstack-discuss which have been omitting "uncaught" bounces indicating their addresses are inactive/deleted17:27
fungi(one at ibm, 11 at red hat)17:27
fungis/omitting/emitting/17:30
fricklerfungi: we've seen before that rh does seem to have some "special" kind of mail setup, so out of curiosity: were there other accounts that got disabled through automation for that site? or only these manual ones?17:30
frickler(I can check myself if you don't know offhand)17:31
Clark[m]frickler: it's weird that it would be a cloud issue since the label is set by Kolla jobs repartitioning and formatting the device17:35
fungifrickler: no, a lot of old red hat subscriptions were getting "Recipient address rejected: User unknown in local recipient table" from a mail relay somewhere in aws17:36
Clark[m]But also I would caution against blaming the cloud region for bugs like "we tried to use ipv6 when we didn't have ipv6 and it broke"17:36
Clark[m]Sure we didn't have ipv6 because it ran in a specific location, but that's not a bug caused by the cloud region. The code was broken17:36
fungithe "uncaught" bounces are coming from mimecast servers17:36
fungioh. some of the successfully caught bounces for rh addresses instead had "Diagnostic-Code: X-Postfix; unknown user"17:38
fungianyway, the variations probably have to do with how forwards and aliases were set up for different users, perhaps specific to their vintage or department/division17:40
fungiseems like the addresses mimecast is configured to route to gmail inboxes are the ones mailman isn't successfully parsing bounce messages from17:42
Clark[m]You'd expect Gmail bounces to be well understood at this point17:43
fungithose include details from gmail's servers saying things like "The email account that you tried to reach is inactive." (disabled account) or "The email account that you tried to reach does not exist." (no such account)17:45
fungii expect the issue is how mimecast is formatting the dsn, since it's mimecast generating the bounce when trying to hand off the message to gmail17:48
fungigmail's servers are rejecting the message at rcpt, causing mimecast to bounce the message back to mailman17:51
fungiso it's not really a gmail bounce17:51
mnasiadkaclarkb: itโ€™s weird but it only happens on raxflex (the label thing) - if it would happen always because of some kernel bug - we would not point at raxflex19:02
Clark[m]mnasiadka is it a syncing problem? I would expect partitioning software to sync pretty aggressively though 19:07
Clark[m]I don't think we have seen similar issues with devstack jobs which also reformat though so maybe it is related to the software used to write the table?19:08
fricklerI didn't want to blame raxflex, I was just trying to convey that the subtle differences we experience there are good in finding issues in our jobs caused by invalid assumptions or possibly weird bugs (I didn't look at this specific issue in detail yet)19:59
fungimy first guess would be baked-in assumptions about device names20:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!