Friday, 2020-06-26

*** jamesmcarthur has joined #openstack-infra00:00
*** Lucas_Gray has quit IRC00:09
*** Lucas_Gray has joined #openstack-infra00:12
*** jamesmcarthur_ has joined #openstack-infra00:18
*** hamalq_ has quit IRC00:20
*** jamesmca_ has joined #openstack-infra00:21
*** jamesmcarthur has quit IRC00:22
*** jamesmcarthur_ has quit IRC00:23
*** jamesmca_ has quit IRC00:24
*** tetsuro has joined #openstack-infra00:25
*** yamamoto has joined #openstack-infra00:36
*** ricolin has joined #openstack-infra00:37
*** jamesmcarthur has joined #openstack-infra00:39
*** yamamoto has quit IRC00:41
*** jamesmcarthur has quit IRC00:42
*** jamesden_ has joined #openstack-infra00:42
*** ricolin has quit IRC00:42
*** jamesden_ has quit IRC00:42
*** jamesden_ has joined #openstack-infra00:43
*** jamesdenton has quit IRC00:43
*** jamesmcarthur has joined #openstack-infra00:44
*** armax has quit IRC00:50
*** markvoelker has joined #openstack-infra00:54
*** ociuhandu has joined #openstack-infra00:56
*** armax has joined #openstack-infra00:58
*** ociuhandu has quit IRC00:59
*** markvoelker has quit IRC00:59
*** ociuhandu has joined #openstack-infra00:59
*** ociuhandu has quit IRC01:03
*** markvoelker has joined #openstack-infra01:09
*** yamamoto has joined #openstack-infra01:12
*** markvoelker has quit IRC01:14
*** yamamoto has quit IRC01:38
*** ricolin has joined #openstack-infra01:46
*** jamesmcarthur has quit IRC01:58
*** jamesmcarthur has joined #openstack-infra01:59
*** rlandy|ruck|bbl is now known as rlandy|ruck02:03
*** Lucas_Gray has quit IRC02:04
*** jamesmcarthur has quit IRC02:04
*** rfolco has quit IRC02:09
*** rlandy|ruck has quit IRC02:21
*** vishalmanchanda has joined #openstack-infra02:29
*** jamesmcarthur has joined #openstack-infra02:32
*** yamamoto has joined #openstack-infra02:42
*** jamesmcarthur has quit IRC02:42
*** yamamoto has quit IRC02:43
*** yamamoto has joined #openstack-infra02:43
*** hongbin has joined #openstack-infra02:59
*** ricolin has quit IRC03:02
*** yolanda has quit IRC03:02
*** yolanda has joined #openstack-infra03:03
*** smarcet has quit IRC03:07
*** jamesmcarthur has joined #openstack-infra03:17
*** jamesmcarthur has quit IRC03:22
*** psachin has joined #openstack-infra03:28
*** hongbin has quit IRC03:33
*** jamesmcarthur has joined #openstack-infra03:50
*** armax has quit IRC04:05
*** ykarel|away is now known as ykarel04:27
*** evrardjp has quit IRC04:33
*** evrardjp has joined #openstack-infra04:33
*** ysandeep|away is now known as ysandeep04:40
*** jamesmcarthur has quit IRC04:52
*** jamesmcarthur has joined #openstack-infra04:52
*** jtomasek has joined #openstack-infra04:56
*** jtomasek has quit IRC04:56
*** jtomasek has joined #openstack-infra04:57
*** jamesmcarthur has quit IRC04:58
*** matt_kosut has joined #openstack-infra05:00
AJaegerconfig-core, could you review https://review.opendev.org/737987 and https://review.opendev.org/737995 , please? - further retirement changes05:05
*** jamesmcarthur has joined #openstack-infra05:26
*** jamesmcarthur has quit IRC05:33
*** udesale has joined #openstack-infra05:40
*** lmiccini has joined #openstack-infra05:45
openstackgerritOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/73815006:10
*** danpawlik is now known as dpawlik|EoD06:15
*** ralonsoh has joined #openstack-infra06:17
*** dklyle has quit IRC06:20
*** rpittau|afk is now known as rpittau06:29
openstackgerritMerged openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1  https://review.opendev.org/73798706:50
*** slaweq has joined #openstack-infra06:57
*** ysandeep is now known as ysandeep|afk07:04
*** marcosilva has joined #openstack-infra07:17
*** jcapitao has joined #openstack-infra07:18
*** hashar has joined #openstack-infra07:20
*** ysandeep|afk is now known as ysandeep07:23
*** bhagyashris|afk is now known as bhagyashris07:27
*** amoralej|off is now known as amoralej07:31
*** jpena|off is now known as jpena07:31
*** jtomasek has quit IRC07:32
*** dtantsur|afk is now known as dtantsur07:33
*** jtomasek has joined #openstack-infra07:35
*** tosky has joined #openstack-infra07:35
*** ociuhandu has joined #openstack-infra07:37
openstackgerritMerged openstack/openstack-zuul-jobs master: Remove legacy-tempest-dsvm-networking-onos  https://review.opendev.org/73799507:38
*** marcosilva has quit IRC07:50
openstackgerritAndreas Jaeger proposed openstack/openstack-zuul-jobs master: Run openafs promote job only if gate job run  https://review.opendev.org/73815507:50
*** jtomasek has quit IRC08:10
*** jtomasek has joined #openstack-infra08:14
*** hashar_ has joined #openstack-infra08:21
*** hashar has quit IRC08:22
*** hashar_ is now known as hashar08:29
*** pkopec has quit IRC08:31
*** derekh has joined #openstack-infra08:43
*** ykarel is now known as ykarel|lunch08:48
*** markvoelker has joined #openstack-infra08:49
openstackgerritCarlos Goncalves proposed openstack/project-config master: Add nested-virt-centos-8 label  https://review.opendev.org/73816108:50
*** jistr has quit IRC08:53
*** markvoelker has quit IRC08:54
*** jistr has joined #openstack-infra08:54
*** gfidente has joined #openstack-infra09:02
*** kaiokmo has joined #openstack-infra09:06
*** ysandeep is now known as ysandeep|lunch09:06
*** udesale has quit IRC09:07
openstackgerritShivanand Tendulker proposed openstack/project-config master: Removes py35 and py27 jobs for proliantutils  https://review.opendev.org/73816809:09
*** ramishra has quit IRC09:09
*** xek has joined #openstack-infra09:10
openstackgerritCarlos Goncalves proposed openstack/project-config master: Add nested-virt-centos-8 label  https://review.opendev.org/73816109:13
openstackgerritShivanand Tendulker proposed openstack/project-config master: Removes py35 and py27 jobs for proliantutils  https://review.opendev.org/73816809:14
*** udesale has joined #openstack-infra09:15
*** pkopec has joined #openstack-infra09:17
*** ysandeep|lunch is now known as ysandeep09:22
*** eolivare has joined #openstack-infra09:26
*** Lucas_Gray has joined #openstack-infra09:41
*** ramishra has joined #openstack-infra09:52
*** ykarel|lunch is now known as ykarel09:55
*** priteau has joined #openstack-infra10:03
*** rpittau is now known as rpittau|bbl10:04
*** tetsuro has quit IRC10:08
*** pkopec has quit IRC10:09
*** jcapitao has quit IRC10:21
*** jcapitao has joined #openstack-infra10:23
*** jcapitao is now known as jcapitao_lunch10:34
*** slaweq has quit IRC10:40
*** ccamacho has quit IRC10:42
*** slaweq has joined #openstack-infra10:42
*** markvoelker has joined #openstack-infra10:50
*** markvoelker has quit IRC10:54
*** Lucas_Gray has quit IRC11:14
openstackgerritThierry Carrez proposed zuul/zuul-jobs master: upload-git-mirror: use retries to avoid races  https://review.opendev.org/73818711:21
*** jaicaa has quit IRC11:23
zbrwhat is happening with "Web Listing Disabled" on log servers?11:24
*** Lucas_Gray has joined #openstack-infra11:26
*** jpena is now known as jpena|lunch11:33
*** ryohayakawa has quit IRC11:35
*** tinwood has quit IRC11:37
*** kopecmartin is now known as kopecmartin|pto11:37
*** tinwood has joined #openstack-infra11:38
*** jaicaa has joined #openstack-infra11:49
fricklerdirk: cmurphy: would one of you be interested in fixing opensuse for stable/stein in devstack? see https://review.opendev.org/735640 , the other option would be to just drop that job until someone cares or has time12:01
dirkfrickler: iirc AJaeger was looking at it12:04
dirkthere has been a short conversation about it12:04
dirkfrickler: I'll poke that you'll get a colleague looking at it12:04
AJaegerdirk: I was looking and failed ;(12:07
*** jcapitao_lunch is now known as jcapitao12:07
AJaegerdirk: so, we were able to fix train but stein is a different beast12:08
dirkAJaeger: ok, I'll ask internally further, thanks12:08
AJaegerthanks, dirk!12:10
*** rpittau|bbl is now known as rpittau12:12
*** rlandy has joined #openstack-infra12:13
*** rlandy is now known as rlandy|ruck12:13
*** ociuhandu has quit IRC12:14
*** rfolco has joined #openstack-infra12:18
*** udesale has quit IRC12:29
*** derekh has quit IRC12:32
*** ociuhandu has joined #openstack-infra12:34
*** rlandy|ruck is now known as rlandy|ruck|mtg12:35
*** smarcet has joined #openstack-infra12:37
fungizbr: you'll have to be more specific, though that usually is an indication that no index was uploaded for the url you're visiting (possibly nothing at all). have an example build which links to a listing error?12:43
fungizbr: also possible you're following an old link and the logs have already expired and been deleted?12:44
*** jpena|lunch is now known as jpena12:44
zbrfungi: i did a recheck, my guess is that current retention is too small. sometimes we need logs available for long time before we make a decision12:45
zbralso, i observed that updating the commit message on my controversial wrap/unwrap change reset votes, so we cannot rely on gerrit to track support for that change.12:46
fungiyep, however we generate something like 2-3tb of compressed logs every month12:46
zbrany idea what we can use to track it?12:46
fungilast i looked anyway12:46
zbrprobably we need different rules based on project or based on size of logs per job12:46
zbrwhy to scrap jobs that produce low logs due to ones that are heavy on them?12:47
fungii'm not sure i want to be in the position of deciding what project is more important than what other project and who deserves to be able to monopolize our ci log storage12:47
zbrprobably a rule based on size would be unbiased12:47
zbrany log > X is scrapped at 30 days, any log > Y is scrapped at 3 months.12:48
fungithat might be doable, expiration is decided at upload time, though it's also much easier to communicate a single retention period12:48
zbra rotating rule would make more sense to me, start to scrap old stuff, instead of guessing at upload time12:50
zbri am not sure we can always make an informed decision about removal date when we upload, yep we should have a default.12:51
fungiwell, the expiration is a swift feature. when we had a log server we used to have a running process go through all the old logs to decide what to get rid of, and turns out the amount of logs we're keeping is so large that if you run something like that continuously you still can't keep up with the upload rate12:51
zbrcan we compute total log size before uploading them?12:51
*** markvoelker has joined #openstack-infra12:51
zbrif so, we could make the expiration bit longer for small logs.12:51
zbrthis could play well with less active projects too12:52
fungithat's why i was saying it might be possible since in theory we know the aggregate log quantity for any build12:52
zbrand avoid extra "rechecks"12:52
zbrlets see what others think about it12:52
fungithough based on previous numbers, tripleo would wind up with a one-week expiration for most of its job logs ;)12:52
zbrfungi: (me hiding)12:53
fungiwe expire lots of smaller logs to make room for those12:53
fungicurrently12:53
fungibut yeah, i don't know what the real upshot would be, or whether we could just increase our retention period, across the board depending on what our overall utilization and donated object storage quotas look like12:54
*** markvoelker has quit IRC12:56
*** jamesden_ is now known as jamesdenton12:59
*** amoralej is now known as amoralej|lunch13:04
*** derekh has joined #openstack-infra13:04
*** gfidente has quit IRC13:06
*** ykarel is now known as ykarel|afk13:14
*** smarcet has quit IRC13:16
*** gfidente has joined #openstack-infra13:16
*** dtantsur is now known as dtantsur|afk13:21
mwhahahahey i'm trying to look into the RETRY_LIMIT crashing issue but I don't seem to be able to figure out when it might have started. Is there a way to get more history out of zuul or some other tool?13:21
mwhahahaI dont' seem to be able to find them in logstash (probably cause no logs)13:22
mwhahahaI have a feeling it's a bug in centos8 because it's occuring on different jobs/branches but I don't know when it started13:24
EmilienMinfra-root: we're having gate issue and we have a mitigation to reduce the failures with a revert: https://review.opendev.org/#/c/738025/ - would it be possible to forsh push that patch please?13:28
*** smarcet has joined #openstack-infra13:29
rlandy|ruck|mtgmwhahaha: we had a discussion about the RETRY_LIMIT on wednesday - if it's the same issue13:30
mwhahahayea it's that one13:30
mwhahahai think it's happening when we run container image prepare which relies on multiprocessing because it seems to be happening ~30-40 mins into the job consistently13:30
AJaegerEmilienM: you mean: promote to head of queue?13:30
rlandy|ruck|mtg we got as far as finding out that the failure hits in tripleo-ci/toci_gate_test.sh13:31
rlandy|ruck|mtgand mostly leaves no logs13:31
rlandy|ruck|mtgany test running toci can basically hit it13:31
mwhahaharlandy|ruck|mtg: I don't think it's the shell script tho, it's more likely what we're running inside it13:31
mwhahahathat's just our entry point into quickstart13:31
rlandy|ruck|mtgwe didn't trace it back to any particular provider13:31
mwhahahayea i think it's a centos8 bug13:32
mwhahahabecause i think it started about the time we got 8.213:32
mwhahahahttps://zuul.opendev.org/t/openstack/builds?result=RETRY_LIMIT&project=openstack%2Ftripleo-heat-templates13:32
mwhahahapoints to something durring the job13:32
EmilienMAJaeger: no, force merge13:32
mwhahaharlandy|ruck|mtg: the timing indicates it's during the standalone deploy itself because we start it ~30 mins into a job13:33
rlandy|ruck|mtgAJaeger: here is the related bug ... https://bugs.launchpad.net/tripleo/+bug/188527913:33
openstackLaunchpad bug 1885279 in tripleo "TestVolumeBootPattern.test_volume_boot_pattern tests on master are failing on updating to cirros-0.5.1 image" [Critical,In progress] - Assigned to Ronelle Landy (rlandy)13:33
fricklerrlandy|ruck|mtg: you need > 64MB for cirros 0.5.1. devstack uses 128MB13:34
*** mordred has quit IRC13:34
rlandy|ruck|mtgchandankumar: ^^13:35
rlandy|ruck|mtgfrickler: that may be - but we need more qualification here and we'd like to revert to do that13:35
chandankumarrlandy|ruck|mtg, let me check the default size13:36
AJaegerEmilienM: I don't have those permissions, just asking. Why do you think a force-merge is needed?13:36
rlandy|ruck|mtgAJaeger: it's taking time to get the patch through the gate13:36
rlandy|ruck|mtgin the mean time, other jobs are failing13:36
EmilienMAJaeger: our CI stats isn't good. We're dealing with multiple issues at this time and we think this one is one of them13:37
AJaegerNormally, we just promote them to head of gate to speed up...13:37
*** dave-mccowan has quit IRC13:37
EmilienMyeah things haven't been normal for us this week :-/13:37
AJaegerbbl13:37
rlandy|ruck|mtgI guess we'll take what we can get - top of the queue then - pls13:38
*** dave-mccowan has joined #openstack-infra13:38
chandankumarrlandy|ruck|mtg, https://opendev.org/osf/python-tempestconf/src/branch/master/config_tempest/constants.py#L3013:38
chandankumarit is 6413:38
chandankumarwe need to increase that13:38
rlandy|ruck|mtgchandankumar: let's discuss back on our channels13:39
chandankumaryes13:39
corvusEmilienM, AJaeger, infra-root: hi, i can promote 73802513:40
EmilienMthanks corvus13:41
corvusEmilienM: if it does improve things, then of coures changes behind it in the gate will automatically receive the benefit of that improvement; you can make other changes in check benefit from it before it lands by adding a Depends-On13:43
EmilienMright13:44
corvusit's at the top now13:44
EmilienMI saw, thanks a lot13:44
rlandy|ruck|mtgcorvus: thank you13:45
corvusno problem :)  hth13:45
*** udesale has joined #openstack-infra13:47
*** amoralej|lunch is now known as amoralej13:47
*** jamesmcarthur has joined #openstack-infra13:49
rlandy|ruck|mtgmwhahaha: https://bugs.launchpad.net/tripleo/+bug/1885286 - so we have a place to track the investigation of RETRY_LIMIT errors13:53
openstackLaunchpad bug 1885286 in tripleo "Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate" [Critical,Triaged] - Assigned to Ronelle Landy (rlandy)13:53
*** rlandy|ruck|mtg is now known as rlandy|ruck13:54
mwhahaharlandy|ruck: yea i think it's happening during container-image-prepare  based on the timings ~30-40 mins13:54
*** yamamoto has quit IRC13:54
rlandy|ruckmwhahaha: so then the failure is later then ...13:55
mwhahahayea13:55
rlandy|ruckon Wed we were looking at failure around 1-1 mins13:55
mwhahahahttps://zuul.opendev.org/t/openstack/builds?result=RETRY_LIMIT&project=openstack%2Ftripleo-heat-templates13:55
rlandy|ruck10-1513:55
rlandy|ruckand no logs13:55
mwhahahacheck the failures for tripleo jobs13:55
mwhahahai think it's a multiprocessing bug in python in centos813:55
mwhahahawe've seen stack traces in ansible with it too previously13:55
mwhahahai'll try and dig up my logs later, i asked about it in #ansible-devel liek 2 weeks ago13:56
*** yamamoto has joined #openstack-infra13:56
rlandy|ruckmwhahaha: k, thanks13:56
mwhahaharlandy|ruck: http://paste.openstack.org/show/794407/ was the ansible crash i saw14:03
rlandy|ruckat least we have some trace to go on now14:04
mwhahahamay not be related but i saw it shortly after we got 2.9.914:04
mwhahahabut since the failure is in python itself, i'm wondering if there's another issue14:05
*** dklyle has joined #openstack-infra14:08
fungisomething seems to be breaking in such a way that at least ssh from the executor ceases working... whether that's a kernel panic, network interface getting unconfigured, sshd hanging... can't really tell14:09
fungiyou could open a bunch of log streams for jobs you think are more likely to hit that condition, and see where they stop14:10
fungi(if you do it with finger protocol you could probably record them to separate local files pretty easily, and not have to depend on browser websockets)14:10
mwhahahait doesn't happen enough :/14:11
rlandy|ruckmwhahaha: on that paste the date logged is Jun 05 09:44:1514:11
mwhahahayea that's not from one of these14:11
rlandy|rucktwenty days ago?14:11
mwhahahathat's just something i noticed that was happening14:11
fungiwell, the retry_limit doesn't happen that often, because the job has to hit a similar condition three builds in a row... i imagine isolated instances of this which aren't hit on a second or third rebuild may be much more common14:12
mwhahahawhere python was segfaulting in the multiprocessing bits in ansible. since container image prepare uses multiprocessing it might be a similar root cause14:12
rlandy|ruckit's happening often enough now to impact the rate jobs get through gates14:12
openstackgerritShivanand Tendulker proposed openstack/project-config master: Removes py35, tox and cover jobs for proliantutils  https://review.opendev.org/73816814:22
*** armax has joined #openstack-infra14:24
*** ykarel|afk is now known as ykarel14:37
*** priteau has quit IRC14:44
*** priteau has joined #openstack-infra14:47
*** markvoelker has joined #openstack-infra14:52
*** priteau has quit IRC14:52
*** markvoelker has quit IRC14:57
clarkbmwhahaha: rlandy|ruck: may also want to add logging of the individual steps as they happen in that script14:57
mwhahahait's nto that script and we do14:57
*** lmiccini has quit IRC14:57
mwhahahabut since we don't get any logs we have no idea what's happening14:57
mwhahahathat script just invokes other things that do log, but no logs are captured14:57
clarkbmwhahaha: I know its not the script but its something the script runs isn't it? and getting that emited to the console log would be useful rather than trying to infer based on time to failure14:57
clarkbright I'm saying write the logs to the console and then you'll get them14:58
mwhahahawe don't seem to be recording the console anywhere14:58
clarkbits available while the job runs14:58
mwhahahawhich is not helpful14:58
clarkbpermanent storage requires that the host be available at the end of the job for archival14:58
clarkbwhy is't that helpful? you can start a number of them, open the logs (via browser or finger) wait for one to fail, save logs, debug from there14:59
mwhahahazuul doesn't have a call back to write out the console and always ship that off the executor?14:59
clarkbmwhahaha: the console log is from the test node not the executor14:59
clarkbif the test node is gone there is no more console log to copy14:59
openstackgerritMerged openstack/project-config master: Removes py35, tox and cover jobs for proliantutils  https://review.opendev.org/73816814:59
mwhahahamaybe i'm missing how zuul is invoking ansible on that, but shouldn't there be a way to ship the output off the node w/o needing the node such that it can be captured even if the node dies15:00
clarkbno because those logs are on the disk of the node15:00
* mwhahaha shrugs15:00
clarkbwe could potentially set up a hold and keep nodepool from deleting the instance (though I'm not sure that would trigger on a retry failure? that may be a hold bug), then reboot --hard the instance via the nova api and hope it comes back15:01
mwhahahait's not a single job that hits this, it's like any of them. so trying to open up something that can capture all the console output all the time and then figure out which one RETRY_LIMITs isn't as simple as you make it seem15:01
weshay_ptothere has to be some amount of tracking jobs that hit RETRY_LIMIT right?15:02
*** jcapitao has quit IRC15:02
clarkbwe track it15:02
clarkbthe problem is in accessing the logs after the fact15:02
clarkbmwhahaha: you could run a bunch of fingers with their output tee'd to files15:02
clarkbI'm not saying its ideal, but this particular class of failure is difficult to deal with15:04
fungimwhahaha: also you don't need to wait for a retry_limit result, as i keep saying, the retry_limit happens when a particular job fails in a similar way three builds in a row, so the odds there are failures of these builds happening only once or twice in a row is likely much higher, statistically15:04
mwhahahai think the issue is identifying that15:04
mwhahahawhile it's running15:04
mwhahahaanyway i'll look at it later15:04
clarkbre holding a node I'm 99% certain we won't hold if the job is retried. Whether or not the 3rd pass failing would trigger the hold I'm not sure15:06
fungiand yeah, i also suggested using finger protocol and redirecting it to a file... you could grab a snapshot of the zuul status.json, parse out the list of any running builds which are likely to hit that issue, spawn individual netcats in the background to each of the finger addresses for them and redirect those to local files... later grep those dump files for an indication the job did not succeed15:06
fungi(perhaps lacks the success message) and that narrows the pool significantly15:06
clarkbhold_list = ["FAILURE", "RETRY_LIMIT", "POST_FAILURE", "TIMED_OUT"]15:07
clarkbwe would hold the third failure15:07
clarkbso that is another option, though relies on a reboot producing a working instance after the fact15:08
clarkbI'15:08
clarkber15:08
clarkbI'm happy to add a hold if we can give a rough set of criteria for it15:08
fungii doubt the hold would help, because the set of jobs it's impacting is fairly large, and the frequency with which one of those builds hits retry_failure is likely statistical noise compared to other failure modes15:09
clarkbI guess even if reboot fails we can ask the cloud for the instance console and that may give clues15:10
clarkbfungi: ya, it may require several attempts to get one. I wonde can we tell hold we only want RETRY_LIMIT jobs?15:10
clarkblooks like no15:11
fungiwe do have code we can switch on to grab nova console from build failures right? or was that nodepool launch failures?15:11
clarkbfungi: that is nodepool launch failures15:11
clarkbbut if we get a hold we can manually run it15:11
fungiahh, right, executors lack the credentials for that anyway15:11
mwhahahaso i just got a crashed one15:15
mwhahahahttp://paste.openstack.org/show/795271/15:16
mwhahahait just disappears15:16
mwhahahano console output15:16
fungithat was fast15:16
mwhahahaso it's like it crashed15:16
fungiyep, that's all i was findnig in the executor debug logs too15:16
clarkbmwhahaha: is standalone deploy a task that runs ansible?15:17
mwhahahait's an ansibel task that invokes shell15:17
mwhahahato run python/ansible stuff15:17
clarkbis that nested ansible or zuul's top level ansible?15:17
fungiand that crash wasn't in the same script i was seeing before15:17
mwhahahadoesn't really matter because the node went poof15:17
mwhahahanested15:18
clarkbmwhahaha: well what would be potentially useful is figuring out where in that 11 minutes the script breaks15:18
mwhahahazuul -> toci sh -> quickstart -> ansible -> shell -> python -> ansible15:18
clarkbhrm nested means we aren't able to get streaming shell out of ansible (at least not easily)15:18
mwhahahai know where it likely breaks based on timeing (as previously mentioned)15:18
mwhahahawhich is a python process that uses multiprocess to do container fetching/processing15:18
mwhahahahence i think there's a bug in either python or the kernel15:18
mwhahahabut w/o any other info on the node it's going to be impossible to track down at the moment15:19
mwhahahasame deal with this one http://paste.openstack.org/show/795272/15:19
fungipreviously i was seeing it happen while toci_gate_test.sh was running, so it could be anything common between what that does and what the tripleo.operator.tripleo_deploy : Standalone deploy task does15:19
* mwhahaha knows what it does15:20
mwhahahawhat i need is the vm console output15:20
mwhahahato see if it's kernel panicing15:20
fungioh, and that one happened during tripleo.operator.tripleo_undercloud_install : undercloud install15:20
clarkbmwhahaha: yes and as mentioend above I've suggested how we might get that15:20
mwhahahahttps://zuul.opendev.org/t/openstack/stream/bf54d4fdf5c040d590372a7cbfbd3c53?logfile=console.log will likely crash15:21
mwhahahait's on 2. attempt at the moment15:21
mwhahaha737774,2 tripleo-ci-centos-8-standalone (2. attempt)15:21
clarkbmwhahaha: k, autohold is in place for that one, if the 3rd attempt fails we'll get it. Separately we can try grabbing the console log for it ahead of time while it is running the job15:23
clarkbmwhahaha: any idea how far away from failing it would be now?15:23
mwhahahait fails ~30 mins in15:23
mwhahahai don't know how long it's running let me look at the console15:23
mwhahahamaybe 10 mins15:23
mwhahahaoh no probably like 5-10 from now15:24
mwhahahait just started teh deploy15:24
clarkba0c808c9-481e-44bc-8e64-9cfe8b90e1f2 is the instance uid in inap15:24
fungicentos-8-inap-mtl01-001741746415:24
*** priteau has joined #openstack-infra15:26
clarkbI've managed to console log show it, nothing exciting yet15:26
fungii can also paste the web console url, i know that would normally be sensitive but this is a throwaway vm15:27
fungithe vnc token should be unique, right?15:27
clarkbfungi: I have no idea (and assuming that about openstack seems potentially dangerous)15:28
clarkbfungi: but I guess if you open that locally you'll get the running console log and won't have to time it like my console log show15:28
clarkbso maybe just open it locally and see if we catch anything?15:28
fungii do have it open locally, but am also polling console log show out to a local file just in case15:28
fungiso far it's just iptables drop logs though15:29
clarkband selinx bookkeeping15:29
fungiyep15:29
fungidevice br-ctlplane entered promiscuous mode15:30
*** priteau has quit IRC15:30
fungiin case anyone wondered15:30
*** priteau has joined #openstack-infra15:30
fungiooh, "loop: module loaded"15:31
fungiyeah, very much not exciting so far15:31
mwhahahalike watching paint dry15:31
fungii really wish openstack console log show had something like --follow but that's probably tough to implement15:32
mwhahahait might succeed at this point, let me see if there's another one15:35
*** mordred has joined #openstack-infra15:35
mwhahahacrashed15:35
clarkbthe console log doesn't show any panic15:36
mwhahahahrm15:36
fungiyeah, it's still logging iptables drops to15:36
fungitoo15:36
mwhahahai wonder if it's an ovs bug15:36
mwhahahaso if it's still up but isn't reachable via zuul that's weird?15:37
*** jtomasek has quit IRC15:37
clarkbI can confirm it doesn't seem tp ping or be sshable from there15:38
fungiyeah, network deems to be dead, dead, deadski15:38
clarkbwe've already determined this isn't cloud specific so unlikely that we're colliding a specific network range15:38
fungithe node's ipv4 addy is/was 104.130.253.14015:39
AJaegerclarkb,fungi, can either of you me to ACLs of openstack-ux and solum-infra-guestagent so that I can retire these two repos,or do you want to abandon changes and approve the retirement change, please?15:39
fungiand the instance just got deleted15:39
*** ricolin has joined #openstack-infra15:39
fungiwant me to stick the recorded console log somewhere?15:39
clarkbfungi: 198.72.124.67 is what I had according to the job log15:40
fungioh, yep nevermind, i grabbed that address out of the wrong console window15:40
clarkb104.130.253.140 looks like a rax IP but this was an inap node15:40
fungiwhere i was troubleshooting something unrelated15:40
mwhahaha2020-06-26 15:37:59.536990 | primary |   "msg": "Failed to connect to the host via ssh: ssh: connect to host 198.72.124.67 port 22: Connection timed out",15:41
fungianyway, i have the console log from the correct instance15:41
clarkbfungi: probably doesn't hurt to share just in case there is some clue possibly in the iptables logging15:41
clarkbif ping was working I'd suspect something crashed sshd15:41
clarkbbut pign doesn't seem to work either so more likely the network stack under sshd is having trouble15:41
clarkblack of kernel panic in the log implies it isn't a catastrophic failure15:42
fungigrumble, it's slightly too long for paste.o.o15:42
fungii'll trim the first few lines from boot15:42
clarkbalso if the third pass of that job fails we should get a node hold and we can try a reboot and see if any of the logs on the host give us clues15:42
*** yamamoto has quit IRC15:44
fungiokay, i split the log in half between http://paste.openstack.org/show/795276 and http://paste.openstack.org/show/79527715:44
fungimwhahaha: ^15:45
mwhahahayea nothing out of the ordinary15:45
fungii wonder if we should stick something in systemd to klog the system time so we have something to generate log line offsets against15:47
openstackgerritThierry Carrez proposed openstack/project-config master: [DNM] Define maintain-github-mirror job  https://review.opendev.org/73822815:47
clarkbfungi: you might be able to find some other reference point like ssh user log in compared to zuul logs?15:47
fungioh, i know, but thinking for the future it might be nice not to have to15:48
fungiin this case there wasn't anything worth calibrating anyway, but if we'd snagged a kernel panic we'd be able to tell how long after a particular job log line that happened15:49
fungiwhich in some cases could help narrow things down to a narrower set of operations15:49
fungiAJaeger: openstack-ux-core and i guess... solum-core?15:51
clarkbmwhahaha: (this is me just thinking crazy ideas) Do you know if you have these failures on rackspace? Rackspace gives us two interfaces a public and a private interface. Most other clouds give us a single interface where we have public only, or private ipv4 that is NAT'd with a fip. Now for the crazy idea. A lot of jobs use the "private_ip" which is actually the public_ip in clouds without a private15:51
clarkbip to do things like set up multinode networking. On rax that would be on a completely separate interface so anything that may break that would be isolated from breaking zuul's connectivity via the public interface. However on basically all other clouds breaking that interface would also berak Zuul's connectivity15:51
mwhahahano idea15:52
mwhahahacan you query zuul for the RETRY_LIMIT stuff?15:52
mwhahahait's not in logstash15:52
fungiAJaeger: i've added you to those, let me know if that wasn't what you needed15:52
mwhahahaw/o the logs i don't know where these are running15:53
clarkbmwhahaha: ya I think we can ask zuul for that. Its not in logstash because there were no log files to index :/15:53
AJaegerfungi: thanks,letme check15:54
clarkbhrm zuul build records don't have nodeset info15:54
clarkbI guess weh ave to look at zuul logs15:55
fungiclarkb: yeah, i'm seeing what i can parse out of the scheduler debug log first15:55
clarkbfungi: thanks15:55
fungiyeah, we'll have to glue scheduler and executor logs together15:59
fungithe scheduler doesn't log node info, and the executor doesn't know when the result is retry_limit15:59
clarkbfungi: we could just ask the executor for ssh connection problems15:59
*** xek has quit IRC15:59
*** rpittau is now known as rpittau|afk15:59
clarkband assume that is close enough15:59
fungiyep, that's what i'm doing now16:00
fungiRESULT_UNREACHABLE is what the executor has, which will actually be a lot more hits anyway16:00
AJaegerfungi: I removed myself again from solum-core. Now waiting for slaweq to +A the networking-onos change and then those three repos can finish retiring16:02
*** ykarel is now known as ykarel|away16:04
fungii've worked out a shell one-liner to pull the node names for each RESULT_UNREACHABLE failure, running this against all our executors now16:07
fungioh, right, this won't work on ze01 because containery, but i'll just snag the other 1116:12
fungi1514 result_unreachable builds across ze02-12 in today's debug log16:12
*** ricolin has quit IRC16:13
*** vishalmanchanda has quit IRC16:15
*** psachin has quit IRC16:18
*** yamamoto has joined #openstack-infra16:20
fungithe distribution looks like it may favor inap a lot more than the proportional quotas would account for: http://paste.openstack.org/show/79527916:20
fungia little under a third of the unreachable failures occurred there, when they account for a lot less than a third of our aggregate quota16:21
fungithe node label distribution indicates we see more on ubuntu then centos too: http://paste.openstack.org/show/79528016:22
fungier, than16:22
*** yamamoto has quit IRC16:27
*** Lucas_Gray has quit IRC16:28
clarkbfungi: we probably want to filter for centos to isolate the tripleo case sinse it seems consistent16:31
fungiyeah, i can also try to filter for tripleo jobs, i suppose16:32
*** mordred has quit IRC16:33
*** gyee has joined #openstack-infra16:35
*** ociuhandu_ has joined #openstack-infra16:36
*** mordred has joined #openstack-infra16:38
*** hamalq has joined #openstack-infra16:38
*** ociuhandu has quit IRC16:38
*** ociuhandu_ has quit IRC16:40
*** hamalq_ has joined #openstack-infra16:40
*** amoralej is now known as amoralej|off16:40
fungigrep $(grep $(grep '\[e: .* result RESULT_UNREACHABLE ' /var/log/zuul/executor-debug.log | sed 's/.*\[e: \([0-9a-f]\+\)\].*/-e \1.*Beginning.job.*tripleo/') /var/log/zuul/executor-debug.log | sed 's/.*\[e: \([0-9a-f]\+\)\].*/-e \1.*Provider:/') /var/log/zuul/executor-debug.log | sed 's/.*\\\\nProvider: \(.*\)\\\\nLabel: \(.*\)\\\\nInterface .*/\1 \2/' > nodes16:40
fungiin case you wondered16:40
*** smarcet has quit IRC16:40
*** jpena is now known as jpena|off16:42
fungi861 tripleo jobs with unreachable results in the logs today so far16:43
fungibreakdown by provider-region: http://paste.openstack.org/show/79528316:44
*** hamalq has quit IRC16:44
fungiand by node label: http://paste.openstack.org/show/79528416:44
fungithis was a crude match for any result_unreachable builds with "tripleo" in the job name16:45
weshay_ptofungi, afaict.. there was a large event on 6/14 where this peaked and has been an issue since.. not as many hits as 6/14 though..16:45
weshay_ptoyou seeing anything similar?16:46
weshay_ptow/ when this started16:46
fungii only analyzed today's debug log16:46
AJaegerregarding these numbers: How does that compare to all runs? I mean: Do we run 3 times as much CentOS8 jobs than bionic for tripleo - and therefore the 3 timesfailure is not significant?16:46
fungiAJaeger: yeah, that's likely the case. i don't think these ratios are telling us much on the node label side. on the provider-region side it suggests that inap is getting a disproportionately larger number of these, i think16:47
*** priteau has quit IRC16:48
fungiinteresting though that i'm getting some airship-kna1 node hits in here for tripleo jobs. that may mean i'm not filtering the way i thought. investigating16:49
mwhahahaweshay_pto: we updated openvswitch on 6/16, perhaps that's the issue?16:50
mwhahahaopenvswitch-2.12.0-1.1.el8.x86_64.rpm2020-06-16 07:572.0M16:50
fungiahh, nevermind, airship-kna1 also hosts a small percentage of normal node labels16:50
fungiso that's expected16:50
weshay_ptomwhahaha, could be part of the issue for sure.. given it's openvswitch.. but it would not explain the spike on 6/1416:51
mwhahahai don't know if that's related16:51
rlandy|ruckcould we revert that upgrade?16:52
mwhahahagiven that the networking goes poof, how we configure the interfaces on the nodes, and the lack of like a kernel panic, it seems to be openvswitch16:52
mwhahahawe dont' see retry_failure on centos7 jobs right?16:53
mwhahahathat didn't get updated16:53
weshay_pto6/14 spike looks like the mirror outtage 2020-06-14 06:42:29.732260 | primary | Cannot download 'https://mirror.kna1.airship-citycloud.opendev.org/centos/8/AppStream/x86_64/os/': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried.16:53
*** markvoelker has joined #openstack-infra16:53
weshay_ptoso we can ignore that spike16:53
mwhahahayea that was the day of the mirror outages i think16:53
mwhahahai really think it's openvswitch16:53
fungii'm putting together some trends for RESULT_UNREACHABLE on "tripleo" named jobs per day over the past month16:54
mwhahahacentos 8.2 came out on 6-1116:55
weshay_ptoya.. mwhahaha looking at each day.. we start to see network go down after openvswitch update16:56
mwhahahado we know when the first infra image switched to it?16:56
weshay_pto6/18 is the first day.. I see network going down16:56
weshay_pto1 hit on 6/17.. so not sure how long mirrors take to update..16:57
funginormally they update every 2 hours16:57
*** markvoelker has quit IRC16:58
fungibut we pull from a mirror to make our mirror, so can be delayed by however long the mirror we're pulling from takes to reflect updates too16:58
*** derekh has quit IRC17:00
mwhahahahrm the openvswitch release was a build w/o dpdk17:00
mwhahahaso maybe not17:00
fungiparsing 30 days of compressed executor logs is taking a bit of time, but i should have something soonish17:06
mwhahahamy only other thought is that there is a known issue with iptables on the 8.2 kernel and we end up configuring the network/iptables about the time it fails17:09
*** smarcet has joined #openstack-infra17:09
mwhahahathough i would have thought there to be a stack trace on the console if that was the case17:10
*** ociuhandu has joined #openstack-infra17:11
*** smarcet has quit IRC17:17
*** kaiokmo has quit IRC17:17
*** ociuhandu has quit IRC17:18
openstackgerritMerged openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/73815017:23
*** jamesmcarthur has quit IRC17:23
*** jamesmcarthur has joined #openstack-infra17:23
*** jamesmcarthur has quit IRC17:27
fungi~14k unreachable results in the past month across 11 of our 12 executors17:38
*** udesale has quit IRC17:39
fungimwhahaha: weshay_pto: here's what the hourly breakdown looks like for the past month: http://paste.openstack.org/show/79529017:42
fungiand here's the daily breakdown: http://paste.openstack.org/show/79529117:42
funginote these are not scaled by the number of jobs run, these are simply counts of result_unreachable builds for jobs with "tripleo" in their names17:43
*** gfidente has quit IRC17:44
fungier, fixed daily aggregates paste, the previous one had a bit of cruft at the beginning: http://paste.openstack.org/show/79529217:45
*** jamesmcarthur has joined #openstack-infra17:50
*** mtreinish has quit IRC18:04
*** mtreinish has joined #openstack-infra18:05
*** jamesmcarthur has quit IRC18:08
clarkbfungi: re airship kna we did that to help ensure things were running normally there18:13
clarkbfungi: something we learned doing the tripleo clouds having the resources dedicated to a specfiic purpose makes it harder to understand what is going on there when things break18:14
rlandy|ruckfrom the pastes above, it looks like we have better days and worse days18:15
mwhahahaprobably related to the number of patches, though hose numbers seem really high18:16
clarkbfungi: also re ze01 the container should log to the same location as the non container runs18:16
rlandy|ruck    819 2020-06-2318:16
rlandy|ruck    800 2020-06-2418:16
rlandy|ruck    846 2020-06-2518:16
fungii concur, those could also indicate days where you simply had higher change activity18:16
rlandy|ruck^^ consistent bad though18:16
fungias i said, that's not scaled by the overall build count for those jobs18:16
rlandy|ruckmwhahaha: do we have a next step here? something we can try on our end?18:20
mwhahahait's kinda hard because the layers of logging here18:21
mwhahahawe really need to either reproduce it or get a node that it failed on18:21
* fungi checks to see if clarkb's hold caght anything18:22
clarkbfungi: I was just checkingand I don't think it did but double check as I'm trying to eat a sandwich too :)18:22
funginope, it's still set18:23
mwhahahasudo make me a sandwich18:23
*** ralonsoh has quit IRC18:25
clarkbI think we can set up some holds on jobs likely to hit the issue then see if we catch any. Another approach could be to try and reproduce it outside of nodepool and zuul's control with a VM (or three) in inap18:32
*** markvoelker has joined #openstack-infra18:49
*** markvoelker has quit IRC18:54
*** smarcet has joined #openstack-infra19:04
*** lbragstad_ has joined #openstack-infra19:04
*** lbragstad has quit IRC19:06
*** eolivare has quit IRC19:10
openstackgerritAndreas Jaeger proposed openstack/project-config master: Finish retirement of openstack-ux,solum-infra-guestagent  https://review.opendev.org/73799219:16
openstackgerritAndreas Jaeger proposed openstack/project-config master: Finish retirement of networking-onos  https://review.opendev.org/73826319:16
AJaegerconfig-core, 737992 is ready to merge - onos is waiting for final approval. I thus split these, please review 992.19:16
clarkb+219:17
AJaegerthanks, clarkb19:17
EmilienMrlandy|ruck: https://review.opendev.org/#/c/738025/ failed again :/19:22
EmilienMinfra-code: I would really request a force merge if it's possible for you19:22
EmilienMinfra-core^ sorry19:22
rlandy|ruckEmilienM: don't worry about - it may fail on the retry_limit19:24
clarkbEmilienM: does it fix a gate bug? it isn't clear from the commit message why that would be a priority19:24
clarkbrlandy|ruck: it did fail on retry limit but also something else19:25
rlandy|ruckclarkb: no - it failed because the we updated the cirros image for tempest but not the space requirements. our fault19:25
rlandy|ruckit's a revert - it does fail gates though19:26
rlandy|ruckbut it could fail again on retry_limit so I'll just try the regular route of getting patches in19:26
clarkbright but usually when we force merge something its something that fill fix gate failures. There is no indication in the commit message that it does this (note I don't expect it to fix the retry limits but an indication that it fixes a testing bug hence the bypass of testing would be nice)19:26
rlandy|ruckclarkb: yeah - updating the commit message with the bug details19:27
rlandy|ruckok - patch updated - but let's let it run through the regular channels19:32
clarkbok, that helps. Let us know if force merge is appropriate after it tries the normal route19:33
rlandy|ruckclarkb: thanks19:33
*** smarcet has quit IRC19:47
*** smarcet has joined #openstack-infra19:56
*** smarcet has quit IRC20:01
*** slaweq has quit IRC20:02
*** smarcet has joined #openstack-infra20:05
*** slaweq has joined #openstack-infra20:06
*** slaweq has quit IRC20:10
*** yamamoto has joined #openstack-infra20:25
*** yamamoto has quit IRC20:30
*** markvoelker has joined #openstack-infra20:35
*** markvoelker has quit IRC20:39
*** smarcet has quit IRC20:40
*** hashar has quit IRC21:02
*** armax has quit IRC21:10
*** armax has joined #openstack-infra21:26
*** paladox has quit IRC21:33
*** paladox has joined #openstack-infra21:37
*** lbragstad_ has quit IRC21:38
*** markvoelker has joined #openstack-infra22:35
*** markvoelker has quit IRC22:40
*** tosky has quit IRC23:00
openstackgerritMerged openstack/openstack-zuul-jobs master: Run openafs promote job only if gate job run  https://review.opendev.org/73815523:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!