fungi | internet finally came back! taking a look now | 03:07 |
---|---|---|
opendevreview | Merged openstack/project-config master: Update jeepyb triggered Gerrit builds to Gerrit 3.10 https://review.opendev.org/c/openstack/project-config/+/937268 | 03:17 |
opendevreview | Merged opendev/system-config master: Update gerrit image build job dependencies https://review.opendev.org/c/opendev/system-config/+/937288 | 03:23 |
kevko | Hi, it looks like something changed ..as we don't have ansible_host available in our kolla pipelines | 09:57 |
kevko | some known issue ? | 09:57 |
frickler | infra-root: ^^ zuul was restarted 3h ago, I assume that that has updated to the latest version, but I cannot see a change in the last week that looks to me like it could cause this? I also don't think it could be related to the gerrit update | 10:30 |
frickler | also seems other builds than kolla are not affected, at least from what has been running so far. two tempest jobs on reqs passed | 10:31 |
kevko | hmm | 10:41 |
kevko | currently analyzing | 10:41 |
kevko | frickler: it's weird because ansible_host is there .... my debug with hostvars printed to file | 10:49 |
kevko | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3cd/937302/3/check/kolla-ansible-ubuntu-lets-encrypt/3cde069/primary/logs/kolla_configs/inventory-kevko-debug | 10:50 |
frickler | so I think that this is due a new ansible-core version getting used in zuul, containing the fix for https://access.redhat.com/security/cve/cve-2024-11079 https://github.com/ansible/ansible/commit/70e83e72b43e05e57eb42a6d52d01a4d9768f510 | 13:41 |
fungi | this was the fifth day since we turned on mailman's verp probe feature and reenabled all subscriptions for openstack-discuss which had been disabled by bounces. 227 subscriptions have been re-disabled today and spot checks of the dsn indicates that they were bounces of the separate verp probes (not mailing list posts) and that those addresses probably are really and truly dead | 15:10 |
fungi | so now i watch to see if more subscriptions get disabled, and possibly also manually disable some that are generating "uncaught" bounces because mailman can't parse the dsn from their corresponding mailservers | 15:12 |
kevko | fungi: frickler: fix https://review.opendev.org/c/openstack/kolla-ansible/+/937302 | 16:41 |
kevko | aaa ..you were fast :D ..i was debugging on mobile :D | 16:44 |
fungi | kevko: yeah, the zuul upstream maintainers already got contacted earlier in the week by a user trying to track down a weird behavior in tasks iterating over the hostvars array, which turned out to be that ansible behavior change: https://review.opendev.org/c/zuul/zuul-jobs/+/937071 | 16:48 |
fungi | rather odd to see that behavior change backported in a stable point release with no explanation and seemingly no connection to the security fix which dragged it in. my guess is a dirty backport | 16:49 |
kevko | Thank you ๐ | 17:00 |
frickler | seems the patch was only applied to stable-2.18-16, not in devel | 17:00 |
frickler | and this does indeed look like a regression to me, too | 17:01 |
clarkb | that change for kolla ansible runs more than 60 jobs. I wonder if that is a record | 17:20 |
clarkb | and many of the jobs are multinode. It wouldn't surprise me if it is over 100 nodes per check queue run | 17:21 |
clarkb | frickler: is the "can't find LABEL=kolla mount error a known issue? semes like a fair bit of job retries are due to that | 17:24 |
frickler | clarkb: yes, it seems another "raxflex is different" issue, mnasiadka was going to send a fix for that | 17:25 |
frickler | and the number of jobs is because the inventory preparation is central, so it triggers really the full set, which we don't often do otherwise | 17:26 |
fungi | i've so far manually disabled 12 subscriptions for openstack-discuss which have been omitting "uncaught" bounces indicating their addresses are inactive/deleted | 17:27 |
fungi | (one at ibm, 11 at red hat) | 17:27 |
fungi | s/omitting/emitting/ | 17:30 |
frickler | fungi: we've seen before that rh does seem to have some "special" kind of mail setup, so out of curiosity: were there other accounts that got disabled through automation for that site? or only these manual ones? | 17:30 |
frickler | (I can check myself if you don't know offhand) | 17:31 |
Clark[m] | frickler: it's weird that it would be a cloud issue since the label is set by Kolla jobs repartitioning and formatting the device | 17:35 |
fungi | frickler: no, a lot of old red hat subscriptions were getting "Recipient address rejected: User unknown in local recipient table" from a mail relay somewhere in aws | 17:36 |
Clark[m] | But also I would caution against blaming the cloud region for bugs like "we tried to use ipv6 when we didn't have ipv6 and it broke" | 17:36 |
Clark[m] | Sure we didn't have ipv6 because it ran in a specific location, but that's not a bug caused by the cloud region. The code was broken | 17:36 |
fungi | the "uncaught" bounces are coming from mimecast servers | 17:36 |
fungi | oh. some of the successfully caught bounces for rh addresses instead had "Diagnostic-Code: X-Postfix; unknown user" | 17:38 |
fungi | anyway, the variations probably have to do with how forwards and aliases were set up for different users, perhaps specific to their vintage or department/division | 17:40 |
fungi | seems like the addresses mimecast is configured to route to gmail inboxes are the ones mailman isn't successfully parsing bounce messages from | 17:42 |
Clark[m] | You'd expect Gmail bounces to be well understood at this point | 17:43 |
fungi | those include details from gmail's servers saying things like "The email account that you tried to reach is inactive." (disabled account) or "The email account that you tried to reach does not exist." (no such account) | 17:45 |
fungi | i expect the issue is how mimecast is formatting the dsn, since it's mimecast generating the bounce when trying to hand off the message to gmail | 17:48 |
fungi | gmail's servers are rejecting the message at rcpt, causing mimecast to bounce the message back to mailman | 17:51 |
fungi | so it's not really a gmail bounce | 17:51 |
mnasiadka | clarkb: itโs weird but it only happens on raxflex (the label thing) - if it would happen always because of some kernel bug - we would not point at raxflex | 19:02 |
Clark[m] | mnasiadka is it a syncing problem? I would expect partitioning software to sync pretty aggressively though | 19:07 |
Clark[m] | I don't think we have seen similar issues with devstack jobs which also reformat though so maybe it is related to the software used to write the table? | 19:08 |
frickler | I didn't want to blame raxflex, I was just trying to convey that the subtle differences we experience there are good in finding issues in our jobs caused by invalid assumptions or possibly weird bugs (I didn't look at this specific issue in detail yet) | 19:59 |
fungi | my first guess would be baked-in assumptions about device names | 20:55 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!