Tuesday, 2021-02-16

*** tosky has quit IRC00:30
tristanCianw: when the icon has three vertical dots, it can be refered to as a kebab too00:33
tristanCcorvus: when you have some time, could you please check the last comment on https://review.opendev.org/c/zuul/zuul/+/775505 ?00:34
ianwtristanC: do Americans know what a kebab is?  i feel like it's gyros there :)00:35
clarkbI think if you say kebab around here people will specifically think of meat on a stick00:35
clarkbgyros may have kebab in them but are pita sandwiches00:36
clarkbBoth delicious00:36
fungilooks more like takoyaki to me00:40
fungior maybe kibi dango00:40
corvustristanC: that's 10mb right?00:40
tristanCcorvus: i think so yes00:40
corvustristanC: cool, replied there, thanks.00:45
*** mgoddard has quit IRC00:51
openstackgerritMerged zuul/zuul master: Don't bail on fetchBuildAllInfo if fetchBuildManifest fails  https://review.opendev.org/c/zuul/zuul/+/77382600:55
openstackgerritMerged zuul/zuul master: Don't enforce foreground with -d switch  https://review.opendev.org/c/zuul/zuul/+/70518900:55
openstackgerritJames E. Blair proposed zuul/zuul master: Support credentials supplied by nodepool  https://review.opendev.org/c/zuul/zuul/+/77436200:57
*** mgoddard has joined #zuul00:57
*** noonedeadpunk has quit IRC01:19
*** noonedeadpunk has joined #zuul01:21
openstackgerritTristan Cacqueray proposed zuul/zuul master: web: render links and ansi escape sequences in logfile and console  https://review.opendev.org/c/zuul/zuul/+/77550501:24
openstackgerritTristan Cacqueray proposed zuul/zuul master: web: add benchmark test for logfile  https://review.opendev.org/c/zuul/zuul/+/77551001:24
openstackgerritTristan Cacqueray proposed zuul/zuul master: web: disable logfile line rendering when the size exceed a threshold  https://review.opendev.org/c/zuul/zuul/+/77572601:24
tristanCcorvus: i think ^ should do the trick01:29
*** hamalq has quit IRC01:47
*** dmsimard has quit IRC02:16
*** dmsimard has joined #zuul02:17
openstackgerritNarendra Kumar Choudhary proposed zuul/zuul-jobs master: Allow customization of helm charts repos  https://review.opendev.org/c/zuul/zuul-jobs/+/76735402:55
*** ykarel has joined #zuul03:52
*** ykarel_ has joined #zuul03:58
*** ykarel has quit IRC04:01
*** ykarel_ has quit IRC04:57
*** vishalmanchanda has joined #zuul05:28
*** evrardjp has quit IRC05:33
*** evrardjp has joined #zuul05:33
*** jfoufas1 has joined #zuul05:55
*** saneax has joined #zuul06:10
*** ianw has quit IRC06:18
*** ianw has joined #zuul06:19
*** mhu has quit IRC06:19
*** mhu has joined #zuul06:20
*** mugsie has quit IRC06:37
openstackgerritMerged zuul/project-config master: Zuul: remove explicit SQL reporters  https://review.opendev.org/c/zuul/project-config/+/77571906:43
*** jcapitao has joined #zuul07:12
*** piotrowskim has joined #zuul08:14
*** rpittau|afk is now known as rpittau08:37
*** tosky has joined #zuul08:44
*** hashar has joined #zuul08:45
*** jpena|off is now known as jpena08:59
*** nils has joined #zuul09:20
openstackgerritTobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion  https://review.opendev.org/c/zuul/nodepool/+/77543810:27
openstackgerritTobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances  https://review.opendev.org/c/zuul/nodepool/+/77579610:27
openstackgerritTobias Henkel proposed zuul/nodepool master: Log openstack requests  https://review.opendev.org/c/zuul/nodepool/+/77579710:27
openstackgerritTobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion  https://review.opendev.org/c/zuul/nodepool/+/77543810:49
openstackgerritTobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances  https://review.opendev.org/c/zuul/nodepool/+/77579610:49
openstackgerritTobias Henkel proposed zuul/nodepool master: Log openstack requests  https://review.opendev.org/c/zuul/nodepool/+/77579710:49
avasstobiash: if you got time can you take a look at https://review.opendev.org/c/zuul/zuul/+/775334 later?11:09
tobiashavass: yes, that's on my todo list for today11:10
avasstobiash: thanks! :)11:12
*** mugsie has joined #zuul11:13
*** vishalmanchanda has quit IRC11:22
tobiashcorvus: I had to do a second round of that nodepool stack and now it exceeds my expectations during load testing. It topped out at 1.5k node startups per 10min while keeping all node startups below one minute. Tested with keeping a constant amount of 200 node requests .11:26
tobiashand from the metrics I can see that now the request handling seems to be the next bottleneck since it hit at max 150 building nodes at the same time11:27
*** vishalmanchanda has joined #zuul11:29
*** jcapitao is now known as jcapitao_lunch12:12
*** jpena is now known as jpena|lunch12:27
*** rlandy has joined #zuul12:32
tobiashavass: +3 with a suggestion for a future improvement12:53
avasstobiash: yeah make sense13:01
openstackgerritTobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion  https://review.opendev.org/c/zuul/nodepool/+/77543813:03
openstackgerritTobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances  https://review.opendev.org/c/zuul/nodepool/+/77579613:03
openstackgerritTobias Henkel proposed zuul/nodepool master: Log openstack requests  https://review.opendev.org/c/zuul/nodepool/+/77579713:03
*** jcapitao_lunch is now known as jcapitao13:08
*** jpena|lunch is now known as jpena13:28
openstackgerritMerged zuul/zuul master: Fix executor errors on faulty .gitmodules file.  https://review.opendev.org/c/zuul/zuul/+/77533413:56
mordredtobiash: comment on https://review.opendev.org/c/zuul/nodepool/+/775438 - not sure it really matters as I expect both patches to land at the same time anyway14:25
tobiashmordred: thanks, but that flag is also used in the central watcher flag14:29
mordredoh - ok cool14:29
tobiashmordred: responded on the comment14:35
mordredtobiash, corvus: hrm. you know - with that change, the primary user of the openstacksdk "use list and client-side-filtering instead of explicit get calls" goes away. that behavior is one that is very confusing to all of the non-nodepool users of openstacksdk, but I always aggressively kept it because nodepool depended on it. I wonder if I should raise it with gtema14:35
tobiashmordred: nodepool still uses it14:35
tobiashmordred: it just calls it later during wait_for_server to offload most of the time to one thread14:36
tobiashmordred: and under high load we still have a high rate of finished servers that use that which would otherwise create more calls14:37
mordredah - yeah - nod.14:38
mordredgood point14:38
openstackgerritTobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion  https://review.opendev.org/c/zuul/nodepool/+/77543814:40
openstackgerritTobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances  https://review.opendev.org/c/zuul/nodepool/+/77579614:40
openstackgerritTobias Henkel proposed zuul/nodepool master: Log openstack requests  https://review.opendev.org/c/zuul/nodepool/+/77579714:40
openstackgerritTobias Henkel proposed zuul/nodepool master: Add simple load testing script  https://review.opendev.org/c/zuul/nodepool/+/77584314:40
tobiashputs the load test script into its own change so we can spin that separately14:40
openstackgerritJonas Sticha proposed zuul/nodepool master: aws: add image upload test  https://review.opendev.org/c/zuul/nodepool/+/77584414:41
tristanCtobiash: i've started a similar script to load test zuul, e.g. submit x changes with jobs and measure how long it takes to get build results14:42
tobiashtristanC: I think both make sense, maybe you want to provide that in the zuul repo?14:43
tobiashthe nodepool test script is handy if I just want to measire cloud/nodepool performance without impacting a zuul and/or github/gerrit14:44
tristanCtobiash: and i was wondering if adding promotheus time decorator to be able to measure individual function performance too would be possible now14:44
tobiashand the zuul test script is handy if one wants to know an end-to-end performance which is also important14:44
tristanCotherwise i worry the end-to-end measure might have too much variance14:46
tobiashtristanC: I think the prometheus decorater would be also useful (if don't let the metrics explode in a way that it blows up)14:46
tobiashtristanC: end-to-end always has a lot of variance with that many components involved, but that's also a valuable outcome14:47
tobiashespecially when working towards scale out scheduler14:47
tristanCcorvus: would you mind revisiting your -2 on 599209 ?14:47
tobiashtristanC: I think 599209 suffered from metrics explosion and would impact the scheduler performance if I recall correctly the discussions from back then14:48
corvustristanC, tobiash: i still think statsd is better for large timing data; prometheus does all the work in-process that statsd offloads14:48
corvus(i'm not opposed to a small in-process prometheus server used for healthchecks)14:49
tobiashI think if we do that we likely need to limit the high cardinality metrics to statsd14:49
tristanCtobiash: corvus: i meant to use prometheus only for performance monitoring, not to replace the existing statsd extensive metrics14:50
corvustristanC: performance monitoring is exactly what statsd was designed for14:51
corvusthe idea is to emit timing data around everything you're interested in monitoring14:51
corvusso we could have *lots* of timing metrics for functions or whatever we want to measure14:52
tristanCcorvus: ok i'll have a look, it seems like we could embed a statsd server in the benchmark tool to compute the result14:54
tobiashzuul-maint: could you have a look on https://review.opendev.org/c/zuul/zuul-jobs/+/775511 to unbreak nodepool gate?14:54
corvustristanC: oh yeah for a benchmark setup that would be a good idea i think.  if you just want to get a number out of it, you could probably write a 5 line prometheus server.  there's one in the unit test framework.14:55
corvustristanC: er i mean statsd server :)14:55
corvusi think having some coarse performance metrics in prometheus makes sense -- like, is the executor up, how many jobs is it running, what is its load average.  things that are useful for k8s to know to make health/scale-out decisions.14:56
tristanCcorvus: yeah, looking at the scheduler code, it's not that easy to add a metric similar to prometheus @SCHEDULER_FUNC_METRIC.time() decorator14:57
corvustristanC: but yeah, for benchmarking, a really simple statsd server to sum the metrics would be a super easy way to get a number out of a benchmark run.14:57
corvustristanC: we can write a decorator.  that's not hard.  but if you want to know a bunch of function run times, perhaps consider a profiler?14:58
corvusthere's one hooked up to sigusr214:59
tristanCcorvus: how would you share the local statsd client instance to the decorator?14:59
corvustristanC: i dunno; gotta run :)15:01
openstackgerritTristan Cacqueray proposed zuul/zuul master: prometheus: add options to start the server and process collector  https://review.opendev.org/c/zuul/zuul/+/59920915:01
*** jcapitao_ has joined #zuul15:03
mordredtobiash: on the list_servers thing - I just went diving into the sdk wait_for_server - there's one place at the top of it that it calls get_server relying on the list functionality - that could actually be skipped if the server being passed in is already in active state (which is an improvement that really should be made anyway) - and after that all of the calls to get_server later in the stack are actually to get_server_by_id which bypasses the15:04
mordredlist_servers cache15:04
*** jcapitao has quit IRC15:06
mordredoh- although that's calling with bare=True - so still not a wain15:06
mordredwin15:06
mordredNM :)15:07
*** jcapitao_ is now known as jcapitao15:09
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size  https://review.opendev.org/c/zuul/zuul-jobs/+/77347415:12
tobiashmordred: actually I missed the bare=True flag in the first patch resulting in one server list and 1000 fips calls per iteration of that thread15:19
tobiashthe request log changed helped me to find and fix that problem15:19
mordred++15:19
tobiashtristanC: how should we continue with the 311 only part of 775511?15:25
*** hashar has quit IRC15:29
tristanClet me see, i think we can fail for older version15:30
tristanCor we need a version to repo name mapping...15:30
tobiashcorvus: do we a deprecation period for removing all openshift older than 3.11 given that the role is broken for some time and nobody (except nodepool) complained?15:31
openstackgerritJonas Sticha proposed zuul/nodepool master: aws: add support for uploading diskimages  https://review.opendev.org/c/zuul/nodepool/+/73521715:32
openstackgerritJonas Sticha proposed zuul/nodepool master: aws: add image upload test  https://review.opendev.org/c/zuul/nodepool/+/77584415:32
corvustobiash: doesn't seem necessary to me15:33
tobiashkk15:33
tobiashtristanC: so I guess error on <3.11 is fine15:34
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository  https://review.opendev.org/c/zuul/zuul-jobs/+/77551115:36
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository  https://review.opendev.org/c/zuul/zuul-jobs/+/77551115:38
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository  https://review.opendev.org/c/zuul/zuul-jobs/+/77551115:43
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: multi-node debian: Update package lists before installing  https://review.opendev.org/c/zuul/zuul-jobs/+/77586615:53
*** jfoufas1 has quit IRC15:55
avasstristanC: software factory CI is cating up again :)15:55
avassacting* but maybe it's catting also15:55
*** jcapitao is now known as jcapitao|off16:05
tristanCavass: oops, we are fixing that shortly16:07
*** rpittau is now known as rpittau|afk16:10
*** hashar has joined #zuul16:22
tristanCavass: thank you for reporting the issue, the job has been removed16:23
corvuszuul-maint: i restarted the opendev scheduler yesterday just behind the earliest commit that could be 4.0.  should i do a full system restart on opendev today to sanity check current HEAD before tagging 4.0, or are we okay with the slight skew?16:57
corvus(i lean slightly toward doing one more full restart... i don't really want to do a paper bag 4.0.1....)16:57
tobiash++16:58
tristanCcorvus: wfm16:58
tobiashcorvus: are you planning with the nodepool optimizations as well or have that later?16:58
clarkb++ to restarting everything on what you want to tag16:59
fungicorvus: another restart sounds fine though queues are deep in the openstack tenant at the moment, so if we do it now it will probably take a while to reenqueue everything16:59
corvustobiash: you know, we haven't really talked about a nodepool release... technically we don't need one.  but maybe we should go ahead and do a 4.0 just to re-sync their versions?  regardless, let's defer the optimizations; i want to give those a good review which would delay 4.0 work16:59
tobiashcorvus: wfm17:00
corvuszuul-maint: while not strictly required, it would make it a little bit easier on me if we didn't approve any zuul or nodepool changes until we release 4.0 :)17:01
clarkbcorvus: related you email opendev about removing the mysql reporters. Do we need to land those changes before this big restart?17:02
clarkbor is that a pre 5.0 thing?17:02
corvusclarkb: that's an ASAP thing; absolutely before 5.0, but any time after yesterday.  we want it asap so we don't have to wade through the warning messages in the logs though.17:03
fungii think we've landed them, but would be good to double-check17:03
fungii approved a bunch yesterday anyway17:03
corvusclarkb: like, as an opendev operator, i'd love it if all our tenants merged those this week.17:03
clarkbgot it, ya I was planning to take a look after meeting today17:03
clarkbor between meetings as it appears that weather has caused things to be rescheduled17:04
openstackgerritAlbin Vass proposed zuul/zuul master: Fix build page releasenote typo  https://review.opendev.org/c/zuul/zuul/+/77588017:27
avassThat could be good to merge before releasing ^17:28
tobiashavass: +2 but the fix will also just work after tagging17:32
corvusi think we can go ahead and approve that; i'll +317:34
corvuswe don't have any release notes for nodepool, which violates our "no relesease without a note" rule....17:34
corvusdo we want to merge a release note saying "no substantial changes, we're just releasing this for version parity with zuul"?17:34
*** piotrowskim has quit IRC17:35
tristanCcorvus: perhaps add one to 773540?17:36
avasstobiash: yeah but the versioned docs would have the tpo wouldn't they?17:36
avassin case someone is reading those :)17:37
corvusi do link to the versioned docs in announcements...17:38
fungiuprevving the nodepool version to match zuul's major version seems worthy of a release note even if for no other reason than to avoid confusion17:38
clarkb++ since our users have come to expect semver there17:38
corvustristanC: oof, that didn't merge?  if we landed that now, i'd want to do another nodepool restart17:38
fungii can help with another opendev nodepool restart if it's wanted17:38
funginow that my meetings are on break for a bit17:39
corvusfungi: well, if we did a nodepool restart, it's going to be several hours from now17:40
fungii can help with one after 20z as well17:41
openstackgerritJames E. Blair proposed zuul/nodepool master: Format multi-line log entries  https://review.opendev.org/c/zuul/nodepool/+/77354017:46
corvuszuul-maint: ^ that does all the things suggested :)17:46
tobiashcorvus: what do you think about doing nodepool v4 release a few days after zuul v4 and include aws image handling there? Then we'd have a release note17:50
tobiashah you've put one into the multiline change17:51
*** jcapitao|off has quit IRC17:51
clarkbcorvus: what does ret do in that formatter? it is appended to but not returned (as the name would imply)17:51
tristanCi guess we need https://review.opendev.org/c/zuul/zuul-jobs/+/775511 to land change in nodepool17:52
tobiashyes17:52
corvustobiash: i just want to release both things today17:54
corvusi would rather merge a release note with no change than spend any more time reviewing actual changes17:54
avassclarkb: Was wondering about that too17:54
corvusif the log formatting change isn't ready, i'll just make a quick change with a release note17:55
clarkbcorvus: it looks like we are discarding the results of the formatting17:56
clarkbbut that is based on ret not being returned or otherwise used other than to write to it (not based on testing evidence or similar)17:56
corvusfungi: can you look at https://review.opendev.org/77551117:56
fungiyep17:57
tobiashcorvus: zuul-upload-image is in retry limit in the gate17:57
corvusclarkb, tristanC: it sure looks like the multiline change is faulty, i think it's missing the last line of the function: return '\n'.join(ret)17:58
corvusi will WIP that and move the reno to a standalone change17:58
clarkbcorvus: sounds good17:59
clarkbtristanC: left a comment on 775511 if you want to check it (I think we can merge it as is and just do a followup cleanup)18:00
openstackgerritJames E. Blair proposed zuul/nodepool master: Add 4.0 release note  https://review.opendev.org/c/zuul/nodepool/+/77588518:00
corvuszuul-maint: ^18:00
tobiash+218:01
fungiand approved18:01
fungialso approved the openshift fix18:01
corvustobiash: looks like the dockerhub login failed18:02
corvusianw saw this yesterday in opendev18:02
corvusi think the assumption was it was transient -- fungi, clarkb -- do you remember anything further?18:02
fungijust that we suspected maybe dh rate limits/throttling, but things worked on a recheck18:02
tobiashshall we close/reopen 775880 to make it abort and reenter the gate?18:02
clarkbyes we assumed it was transient because fungi had some gitea changes up that passed through the same ansible and they were happy18:03
clarkband ya docker hub rate limits/throttling was my suspicion18:03
corvuswait i don't know what task failed18:03
clarkb"upload-docker-image: Log in to registry" failed on ianw's examples18:04
tobiashcorvus: something really weird: http://paste.openstack.org/show/802705/18:04
corvustobiash: yeah i was just homing in on that18:04
corvusthe stat module failed with 'bad message'?18:05
tobiashyeah, no idea what that means, but all other jobs are ok18:06
corvusi don't think we're releasing anything tody18:06
*** hashar is now known as hasharDinner18:07
corvusinfra-root, zuul-maint, config-core: we just noticed a very strange failure in a zuul job that occurred during the pre playbook when setting up git repos.  please keep an eye out for any other builds that fail in that way.18:08
tobiashnodepool-upload-image failed due to the request limit18:09
clarkbtobiash: the same task from ianw's example: "upload-docker-image: Log in to registry" ?18:10
tobiashcorvus: the bad message is nothing new: https://zuul.opendev.org/t/zuul/build/c21cd59ba2cc41d4aa25c134e7f964e518:10
corvusnew to me18:12
tobiashnew to me as well18:12
tobiashfound three already in the zuul tenant, all airship-kna118:14
tobiashmaybe a corrupted image18:14
fungimessage:"bad message" turns up a ton of hits in logstash18:15
fungiwe've also been seeing random nodes in airship-kna1 boot up with a 15gb rootfs instead of the usual 100gb the flavor typically provides18:15
fungileading me to suspect that sometimes the growroot doesn't happen at boot there18:16
corvusfungi: can you share the logstash link?  is bad message correlated with kna1?18:20
*** jpena is now known as jpena|off18:24
openstackgerritMerged zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository  https://review.opendev.org/c/zuul/zuul-jobs/+/77551118:25
clarkbcorvus: I think fungi was saying he was searching for message:"bad message" in logstash18:28
clarkband ya if I set the time frame to the last 6 hours or so I see a number of them and they don't appear to be cloud or image specific18:28
clarkbit seems to universally affect oepnstack-ansible jobs though18:29
clarkbI wonder if this is an ansible regression18:29
fungii'm not clear on how to link to specific queries in logstash unfortunately, so yeah i mentioned the query string here18:30
clarkbthere is or was a way but I can never remember it either18:30
avassclarkb: it seems to come from os.stat18:30
tobiashthe last 2.9 release is from jan 1818:30
avassor os.lstat18:30
fungiright, it sees like "bad message" may be ansible's way of saying it couldn't parse a message from a task?18:30
avassI think it's errno 74 EBADMSG18:31
fungioh, neat18:31
clarkbya if you look at the osa failures its trying to collect journal files and getting that18:31
clarkbhrm and that happens in a shell script run by ansible18:32
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: remove unused role var  https://review.opendev.org/c/zuul/zuul-jobs/+/77588818:32
clarkbso ya unlikely to be ansibel I guess, they somehow are just tripping over it in a way that logstash catches it18:32
clarkbFile corruption detected at /var/log/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/system.journal:74d4b70 (of 125829120 bytes, 97%). then var/log/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/system.journal (Bad message)18:33
avassansible only checks for ENOENT and otherwise fails with whatever exception it got: https://github.com/ansible/ansible/blob/stable-2.9/lib/ansible/modules/files/stat.py#L46618:33
clarkbhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_184/775813/3/check/openstack-ansible-upgrade-aio_metal-centos-8/1842a5b/logs/host/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/index.html the file does seem to be there at least18:34
*** hamalq has joined #zuul18:35
clarkbhttps://opendev.org/openstack/openstack-ansible/src/branch/master/scripts/log-collect.sh#L193-L19518:36
clarkbit is possible the reason that osa is catching this is they are explicitly validating consistency18:36
clarkband it appears to happen across ubuntu and centos (different versions of both) and in different clouds18:37
clarkbtobiash' suspicion of a problem with images seems more likely18:38
clarkbseems to have been happening for at least the last week18:42
clarkblooking at data more historically and not from the last 6 hours it does seem to correlate with kna1 more18:43
corvusclarkb: is it correlated with a provider?18:44
clarkbcorvus: in the last 6 hours no, but from the 9th to the 10th of february yes its almost all kna118:44
clarkbI wonder if OSA does explicit verification and they are also near disk limits on $clouds18:44
corvusclarkb: i wish i could see what you see18:45
corvusi'm really struggling to get logstash to produce something useful18:45
clarkbcorvus: in the query field put message:"Bad message" then set the time range in the top right to last 6 hours or last 7 days and clikc the refresh button in the top right next to the time range18:45
corvusyeah, i see tons of osa jobs across all providers18:46
clarkbyup and different images. If you expand to the last 7 days you'll see a spike around the 10th. If you click in the time graph and hold+drag you can select a dynamic window18:47
corvusis there anyway not to see the osa jobs?18:47
clarkbdoing that is what gave me the kna correlates to the ~9th ~10th range18:47
corvuscause if they're failing at that rate and they don't care, maybe that's a red herring18:47
clarkbcorvus: you can add AND NOT project:openstack/openstack-ansible AND NOT project:openstack/openstack-ansible-os_tempest and so on18:48
clarkbto the query string18:48
corvusit's unresponsive when i do that :(18:51
clarkbhrm may need to quote the project string values18:52
clarkb(I always forget the quoting rules but I think not quoting may make it more greedy or something)18:52
corvusis there some way to say openstack-ansible*18:53
corvuscause i dunno, it looks like we're just awash in false positives here.18:54
corvusthese jobs are all succeeding18:55
clarkbyes there is globbing but it only works on some fields and is incredibly expensive iirc18:55
clarkbI've trying to get a globbed version to return over the last 12 hours and not succeeding18:55
clarkbcorvus: I think they are succeeding beacuse the error is only hit at the end of the job when they validate and verify logs. I suspect that the same issue they are hitting is also hitting your example (which I am beginning to think may be a full root disk)18:56
corvusthis did hit in kna118:56
corvusclarkb: https://706d78dc87df4baa7fa3-76ad47885070581f857a540cadaa6a6d.ssl.cf5.rackcdn.com/775880/1/gate/zuul-upload-image/55c548d/zuul-info/zuul-info.ubuntu-bionic.txt  says 4.5G available?  but that also looks like a 15G drive instead of 10018:58
clarkbwe still run a cleanup play on all/most jobs that checks du right?18:58
clarkbcorvus: agreed that looks like a too small disk18:58
clarkbmy strong suspicion is osa catches this often because they fill disks regardless of the cloud and they also do journal verification at the end of jobs18:59
clarkbother jobs like your example only hit this if the disk is abnormally small18:59
tobiashclarkb, corvus: we had a similar issue a year ago when cloud-init was not finished yet sometimes18:59
corvusclarkb: http://paste.openstack.org/show/802711/ df from cleanup of that build19:00
tobiashclarkb, corvus: we fixed that by doing a cloud-init wait for finish as early as possible in the base job19:00
corvusclarkb: /dev/vda1       15455656 9953244   4639704  69% /19:00
tobiashwe run "if command -v cloud-init > /dev/null; then cloud-init status -wl; fi" in our base job19:02
clarkbtobiash: in our case we just use a growrootfs init script I think, but I suppose it is possible that isn't finishing before things start19:02
tobiashyeah, so you could do something similar19:02
clarkbcorvus: that is small but also not completely full19:02
clarkbcorvus: I wonder if the underlying hypervisor has minimal disk left so it is saying "here use 15GB" and hoping you don't notice19:03
clarkbbut then when you try writes you're getting an error from all the way down to the host fs19:03
corvuswow19:03
corvusclarkb, tobiash, fungi: at this point, how confident are you that this is "something is weird with kna1 which has nothing to do with zuul or zuul-jobs"?19:04
clarkbcorvus: I would say probably 80-90%19:04
fungii've been fairly confident at least the tiny-disk problem is airship-only19:05
clarkbI think there is a small chance the osa logging is catchn an issue with our image builds as well19:05
clarkband the problem would be more widespread in that case19:05
fungimaybe noonedeadpunk could work his magic to get someone there to look at the hosts if we can do a hostid correlation, unfortunately our only official contact for that account is the person at ericsson who is paying for it19:05
clarkbwe do record the hypervisor id and could see if it correlates to a small number of those in our kna examples at least19:06
tobiashhostids are different on the three in zuul tenant I found19:06
fungiyeah, then the host-specific theory might not hold water19:07
*** nils has quit IRC19:10
*** jcapitao|off has joined #zuul19:23
*** hasharDinner is now known as hashar19:27
tobiashcorvus: to me it also looks likely like some infrastructure or environmental problem rather zuul or zuul-jobs19:32
clarkbtobiash:wee all three of those examples kna1 ?19:32
clarkb*were19:32
tobiashyes:19:33
tobiashhttps://48b327e1e5d3af62395f-b9f7b4910ad16729218670d60db15b93.ssl.cf2.rackcdn.com/774650/10/check/tox-py38/c21cd59/zuul-info/inventory.yaml19:33
tobiashhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_996/774650/10/check/zuul-tox-docs/996dbcc/zuul-info/inventory.yaml19:33
tobiashhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_655/774650/10/check/zuul-jobs-tox-linters/655a450/zuul-info/inventory.yaml19:33
clarkbin that case my 80-90% feels more like 90%19:33
openstackgerritMerged zuul/nodepool master: Add 4.0 release note  https://review.opendev.org/c/zuul/nodepool/+/77588519:34
tobiashclarkb: and the most recent one as well: https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/775885/1/gate/tox-pep8/3de257e/zuul-info/inventory.yaml19:35
tobiashcorvus: was there a reason for holding back the release note typo fix (775880)?19:37
tobiashcorvus: regarding the retry limit of that, attempts #1 and #2 failed due to the docker issue, the last one due to that (likely) kna1 issue19:38
tristanCavass: i've added services configuration to zuul-nix, a local tenant is now configured with a periodic pipeline when the vm starts19:44
*** hashar has quit IRC19:45
*** hashar has joined #zuul19:45
avasstristanC: cool, still getting comfortable with the package manager but I'm probably gonna try to set up something simple to try it out at volvo19:45
tristanCmy next step is to figure out how to get reliable performance measurements19:49
avasstristanC: the executor fails to start btw, it complains that virtualenv is missing19:58
tristanCavass: indeed, need to shortcircuit the venv installation20:00
tristanCglad you gave it a try and that it worked for you too! :-)20:00
*** smyers has quit IRC20:02
*** smyers has joined #zuul20:02
corvustobiash: no i was just halfway through reapproving it20:07
tobiashah ok20:08
*** tjgresha has joined #zuul20:19
tjgreshalooking for some help with the scheduler20:21
tjgreshaplease20:22
fungiwhat help do you need with it?20:24
tjgreshathrowing this error   ERROR zuul.Scheduler:   AttributeError: 'MergeJob' object has no attribute 'updated'20:24
tjgresharunning this with inside the docker ..20:25
fungiwhat version?20:25
avasstjgresha: that usally means something went wrong with a merge operation in the executor/merger20:26
tjgreshadocker version or container version?20:27
fungii was asking what version (release) of zuul you're seeing this with20:28
fungiwow, this should probably be a faq, i just grepped case-insensitive for 'mergejob.*updated' in my channel log and see so many people reporting the same error over the years20:30
fungidating back to 2017-11-2320:30
tjgresha 3.19.2.dev36720:31
fungino fewer than 10 people mentioned it in that timeframe20:31
fungii'll see what the causes ended up being20:32
avassfungi: from what I remember people have suggested better error messages for that a couple of times :)20:32
fungiyeah, but then i'd never be able to find it in my channel history! ;)20:32
avasstjgresha: Can you check for errors in the merger/executor? my bet is that they fail to clone a repo or merge a change or something like that20:33
fungiaccording to tobiash it "usually it means that the merger or executor couldn't fetch that repo"20:33
tjgreshaok20:34
tjgreshaone sec20:34
fungiin one case it was caused by a failure to connect to zk (startup race, bad cert?)20:35
*** hashar has quit IRC20:35
fungialso one time tobiash said "that can occur if there is no merger in the system or the merger hits a timeout when cloning a large repo"20:35
avassoh yeah that too20:36
tjgresharestarted -- letting scheduler do its thing for a minute20:36
fungiat least in the past it also seemed to happen if you tried to start the scheduler long before starting an executor/merger (if you start them in the other order, or around the same time, then it should be fine)20:38
tjgreshayeah it can't pull a repo20:42
tjgreshaunable to determine the version20:42
fungiin this case the zuul version is probably irrelevant, i was just asking in case it wound up being a signature for a bug fixed in a recent release or something20:44
tjgreshanamely - for some reason now and all of the sudden will not connect to our internal geritt (config and jobs)20:46
avasstjgresha: what error do you get?20:47
tjgreshaERROR zuul.GerritConnection: Unable to determine remote Gerrit version20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:   Traceback (most recent call last):20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:     File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 1456, in onLoad20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:       self._getRemoteVersion()20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:     File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 1426, in _getRemoteVersion20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:       version = self.get('config/server/version')20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:     File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 635, in get20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:       raise Exception("Received response %s" % (r.status_code,))20:47
tjgreshascheduler_1  | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection:   Exception: Received response 40120:48
avassplease use a paste service in the future, like: http://paste.openstack.org/ :)20:48
tjgreshasorry20:49
tjgreshadon't use IRC too much =)20:49
fungiokay, so the scheduler is having trouble connecting to your gerrit20:49
tjgreshayes - looks like this is the issue to me20:50
corvustjgresha: make sure https://zuul-ci.org/docs/zuul/reference/drivers/gerrit.html#attr-%3Cgerrit%20connection%3E.auth_type is set correctly20:50
fungi401 unauthorized makes me suspect you're using the wrong credentials or20:50
fungithat, what corvus just said20:50
funginewer gerrit versions require basic auth instead of digest auth, due to changes in how they store the password internally20:51
tjgreshaso do i need to be auth'd to the server at all to pull the version ?20:52
tjgreshaanswerd my q -- yes ya do!20:53
corvusthat's intentional so problems are discovered early20:53
tjgreshaok thanks all  -20:56
*** tjgresha has quit IRC21:21
openstackgerritMerged zuul/zuul master: Fix build page releasenote typo  https://review.opendev.org/c/zuul/zuul/+/77588021:26
tobiashyeah it's still on my todo list to improve that error message ;)21:56
avassmaybe the errormessage should be "please grep for `object has no attribute 'updated'` in the irc logs" ;)22:01
avassbonus points if it does it automatically22:01
*** jcapitao|off has quit IRC22:03
*** saneax has quit IRC22:03
*** vishalmanchanda has quit IRC22:22
openstackgerritTristan Cacqueray proposed zuul/zuul master: ansible: ensure we can delete ansible files  https://review.opendev.org/c/zuul/zuul/+/77594322:32
*** arxcruz|rover has quit IRC22:46
*** arxcruz has joined #zuul22:47
tristanCavass: it took longer than planed, turns out when your system doesn't have a /bin, random things are failing unexpectedly, but with the current zuul-nix:head, jobs result are now green in the database :-)23:07
*** tosky has quit IRC23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!