*** tosky has quit IRC | 00:30 | |
tristanC | ianw: when the icon has three vertical dots, it can be refered to as a kebab too | 00:33 |
---|---|---|
tristanC | corvus: when you have some time, could you please check the last comment on https://review.opendev.org/c/zuul/zuul/+/775505 ? | 00:34 |
ianw | tristanC: do Americans know what a kebab is? i feel like it's gyros there :) | 00:35 |
clarkb | I think if you say kebab around here people will specifically think of meat on a stick | 00:35 |
clarkb | gyros may have kebab in them but are pita sandwiches | 00:36 |
clarkb | Both delicious | 00:36 |
fungi | looks more like takoyaki to me | 00:40 |
fungi | or maybe kibi dango | 00:40 |
corvus | tristanC: that's 10mb right? | 00:40 |
tristanC | corvus: i think so yes | 00:40 |
corvus | tristanC: cool, replied there, thanks. | 00:45 |
*** mgoddard has quit IRC | 00:51 | |
openstackgerrit | Merged zuul/zuul master: Don't bail on fetchBuildAllInfo if fetchBuildManifest fails https://review.opendev.org/c/zuul/zuul/+/773826 | 00:55 |
openstackgerrit | Merged zuul/zuul master: Don't enforce foreground with -d switch https://review.opendev.org/c/zuul/zuul/+/705189 | 00:55 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Support credentials supplied by nodepool https://review.opendev.org/c/zuul/zuul/+/774362 | 00:57 |
*** mgoddard has joined #zuul | 00:57 | |
*** noonedeadpunk has quit IRC | 01:19 | |
*** noonedeadpunk has joined #zuul | 01:21 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: web: render links and ansi escape sequences in logfile and console https://review.opendev.org/c/zuul/zuul/+/775505 | 01:24 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: web: add benchmark test for logfile https://review.opendev.org/c/zuul/zuul/+/775510 | 01:24 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: web: disable logfile line rendering when the size exceed a threshold https://review.opendev.org/c/zuul/zuul/+/775726 | 01:24 |
tristanC | corvus: i think ^ should do the trick | 01:29 |
*** hamalq has quit IRC | 01:47 | |
*** dmsimard has quit IRC | 02:16 | |
*** dmsimard has joined #zuul | 02:17 | |
openstackgerrit | Narendra Kumar Choudhary proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354 | 02:55 |
*** ykarel has joined #zuul | 03:52 | |
*** ykarel_ has joined #zuul | 03:58 | |
*** ykarel has quit IRC | 04:01 | |
*** ykarel_ has quit IRC | 04:57 | |
*** vishalmanchanda has joined #zuul | 05:28 | |
*** evrardjp has quit IRC | 05:33 | |
*** evrardjp has joined #zuul | 05:33 | |
*** jfoufas1 has joined #zuul | 05:55 | |
*** saneax has joined #zuul | 06:10 | |
*** ianw has quit IRC | 06:18 | |
*** ianw has joined #zuul | 06:19 | |
*** mhu has quit IRC | 06:19 | |
*** mhu has joined #zuul | 06:20 | |
*** mugsie has quit IRC | 06:37 | |
openstackgerrit | Merged zuul/project-config master: Zuul: remove explicit SQL reporters https://review.opendev.org/c/zuul/project-config/+/775719 | 06:43 |
*** jcapitao has joined #zuul | 07:12 | |
*** piotrowskim has joined #zuul | 08:14 | |
*** rpittau|afk is now known as rpittau | 08:37 | |
*** tosky has joined #zuul | 08:44 | |
*** hashar has joined #zuul | 08:45 | |
*** jpena|off is now known as jpena | 08:59 | |
*** nils has joined #zuul | 09:20 | |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438 | 10:27 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796 | 10:27 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797 | 10:27 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438 | 10:49 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796 | 10:49 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797 | 10:49 |
avass | tobiash: if you got time can you take a look at https://review.opendev.org/c/zuul/zuul/+/775334 later? | 11:09 |
tobiash | avass: yes, that's on my todo list for today | 11:10 |
avass | tobiash: thanks! :) | 11:12 |
*** mugsie has joined #zuul | 11:13 | |
*** vishalmanchanda has quit IRC | 11:22 | |
tobiash | corvus: I had to do a second round of that nodepool stack and now it exceeds my expectations during load testing. It topped out at 1.5k node startups per 10min while keeping all node startups below one minute. Tested with keeping a constant amount of 200 node requests . | 11:26 |
tobiash | and from the metrics I can see that now the request handling seems to be the next bottleneck since it hit at max 150 building nodes at the same time | 11:27 |
*** vishalmanchanda has joined #zuul | 11:29 | |
*** jcapitao is now known as jcapitao_lunch | 12:12 | |
*** jpena is now known as jpena|lunch | 12:27 | |
*** rlandy has joined #zuul | 12:32 | |
tobiash | avass: +3 with a suggestion for a future improvement | 12:53 |
avass | tobiash: yeah make sense | 13:01 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438 | 13:03 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796 | 13:03 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797 | 13:03 |
*** jcapitao_lunch is now known as jcapitao | 13:08 | |
*** jpena|lunch is now known as jpena | 13:28 | |
openstackgerrit | Merged zuul/zuul master: Fix executor errors on faulty .gitmodules file. https://review.opendev.org/c/zuul/zuul/+/775334 | 13:56 |
mordred | tobiash: comment on https://review.opendev.org/c/zuul/nodepool/+/775438 - not sure it really matters as I expect both patches to land at the same time anyway | 14:25 |
tobiash | mordred: thanks, but that flag is also used in the central watcher flag | 14:29 |
mordred | oh - ok cool | 14:29 |
tobiash | mordred: responded on the comment | 14:35 |
mordred | tobiash, corvus: hrm. you know - with that change, the primary user of the openstacksdk "use list and client-side-filtering instead of explicit get calls" goes away. that behavior is one that is very confusing to all of the non-nodepool users of openstacksdk, but I always aggressively kept it because nodepool depended on it. I wonder if I should raise it with gtema | 14:35 |
tobiash | mordred: nodepool still uses it | 14:35 |
tobiash | mordred: it just calls it later during wait_for_server to offload most of the time to one thread | 14:36 |
tobiash | mordred: and under high load we still have a high rate of finished servers that use that which would otherwise create more calls | 14:37 |
mordred | ah - yeah - nod. | 14:38 |
mordred | good point | 14:38 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Offload waiting for server creation/deletion https://review.opendev.org/c/zuul/nodepool/+/775438 | 14:40 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Optimize list_servers call in cleanupLeakedInstances https://review.opendev.org/c/zuul/nodepool/+/775796 | 14:40 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Log openstack requests https://review.opendev.org/c/zuul/nodepool/+/775797 | 14:40 |
openstackgerrit | Tobias Henkel proposed zuul/nodepool master: Add simple load testing script https://review.opendev.org/c/zuul/nodepool/+/775843 | 14:40 |
tobiash | puts the load test script into its own change so we can spin that separately | 14:40 |
openstackgerrit | Jonas Sticha proposed zuul/nodepool master: aws: add image upload test https://review.opendev.org/c/zuul/nodepool/+/775844 | 14:41 |
tristanC | tobiash: i've started a similar script to load test zuul, e.g. submit x changes with jobs and measure how long it takes to get build results | 14:42 |
tobiash | tristanC: I think both make sense, maybe you want to provide that in the zuul repo? | 14:43 |
tobiash | the nodepool test script is handy if I just want to measire cloud/nodepool performance without impacting a zuul and/or github/gerrit | 14:44 |
tristanC | tobiash: and i was wondering if adding promotheus time decorator to be able to measure individual function performance too would be possible now | 14:44 |
tobiash | and the zuul test script is handy if one wants to know an end-to-end performance which is also important | 14:44 |
tristanC | otherwise i worry the end-to-end measure might have too much variance | 14:46 |
tobiash | tristanC: I think the prometheus decorater would be also useful (if don't let the metrics explode in a way that it blows up) | 14:46 |
tobiash | tristanC: end-to-end always has a lot of variance with that many components involved, but that's also a valuable outcome | 14:47 |
tobiash | especially when working towards scale out scheduler | 14:47 |
tristanC | corvus: would you mind revisiting your -2 on 599209 ? | 14:47 |
tobiash | tristanC: I think 599209 suffered from metrics explosion and would impact the scheduler performance if I recall correctly the discussions from back then | 14:48 |
corvus | tristanC, tobiash: i still think statsd is better for large timing data; prometheus does all the work in-process that statsd offloads | 14:48 |
corvus | (i'm not opposed to a small in-process prometheus server used for healthchecks) | 14:49 |
tobiash | I think if we do that we likely need to limit the high cardinality metrics to statsd | 14:49 |
tristanC | tobiash: corvus: i meant to use prometheus only for performance monitoring, not to replace the existing statsd extensive metrics | 14:50 |
corvus | tristanC: performance monitoring is exactly what statsd was designed for | 14:51 |
corvus | the idea is to emit timing data around everything you're interested in monitoring | 14:51 |
corvus | so we could have *lots* of timing metrics for functions or whatever we want to measure | 14:52 |
tristanC | corvus: ok i'll have a look, it seems like we could embed a statsd server in the benchmark tool to compute the result | 14:54 |
tobiash | zuul-maint: could you have a look on https://review.opendev.org/c/zuul/zuul-jobs/+/775511 to unbreak nodepool gate? | 14:54 |
corvus | tristanC: oh yeah for a benchmark setup that would be a good idea i think. if you just want to get a number out of it, you could probably write a 5 line prometheus server. there's one in the unit test framework. | 14:55 |
corvus | tristanC: er i mean statsd server :) | 14:55 |
corvus | i think having some coarse performance metrics in prometheus makes sense -- like, is the executor up, how many jobs is it running, what is its load average. things that are useful for k8s to know to make health/scale-out decisions. | 14:56 |
tristanC | corvus: yeah, looking at the scheduler code, it's not that easy to add a metric similar to prometheus @SCHEDULER_FUNC_METRIC.time() decorator | 14:57 |
corvus | tristanC: but yeah, for benchmarking, a really simple statsd server to sum the metrics would be a super easy way to get a number out of a benchmark run. | 14:57 |
corvus | tristanC: we can write a decorator. that's not hard. but if you want to know a bunch of function run times, perhaps consider a profiler? | 14:58 |
corvus | there's one hooked up to sigusr2 | 14:59 |
tristanC | corvus: how would you share the local statsd client instance to the decorator? | 14:59 |
corvus | tristanC: i dunno; gotta run :) | 15:01 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: prometheus: add options to start the server and process collector https://review.opendev.org/c/zuul/zuul/+/599209 | 15:01 |
*** jcapitao_ has joined #zuul | 15:03 | |
mordred | tobiash: on the list_servers thing - I just went diving into the sdk wait_for_server - there's one place at the top of it that it calls get_server relying on the list functionality - that could actually be skipped if the server being passed in is already in active state (which is an improvement that really should be made anyway) - and after that all of the calls to get_server later in the stack are actually to get_server_by_id which bypasses the | 15:04 |
mordred | list_servers cache | 15:04 |
*** jcapitao has quit IRC | 15:06 | |
mordred | oh- although that's calling with bare=True - so still not a wain | 15:06 |
mordred | win | 15:06 |
mordred | NM :) | 15:07 |
*** jcapitao_ is now known as jcapitao | 15:09 | |
openstackgerrit | Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474 | 15:12 |
tobiash | mordred: actually I missed the bare=True flag in the first patch resulting in one server list and 1000 fips calls per iteration of that thread | 15:19 |
tobiash | the request log changed helped me to find and fix that problem | 15:19 |
mordred | ++ | 15:19 |
tobiash | tristanC: how should we continue with the 311 only part of 775511? | 15:25 |
*** hashar has quit IRC | 15:29 | |
tristanC | let me see, i think we can fail for older version | 15:30 |
tristanC | or we need a version to repo name mapping... | 15:30 |
tobiash | corvus: do we a deprecation period for removing all openshift older than 3.11 given that the role is broken for some time and nobody (except nodepool) complained? | 15:31 |
openstackgerrit | Jonas Sticha proposed zuul/nodepool master: aws: add support for uploading diskimages https://review.opendev.org/c/zuul/nodepool/+/735217 | 15:32 |
openstackgerrit | Jonas Sticha proposed zuul/nodepool master: aws: add image upload test https://review.opendev.org/c/zuul/nodepool/+/775844 | 15:32 |
corvus | tobiash: doesn't seem necessary to me | 15:33 |
tobiash | kk | 15:33 |
tobiash | tristanC: so I guess error on <3.11 is fine | 15:34 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511 | 15:36 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511 | 15:38 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511 | 15:43 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: multi-node debian: Update package lists before installing https://review.opendev.org/c/zuul/zuul-jobs/+/775866 | 15:53 |
*** jfoufas1 has quit IRC | 15:55 | |
avass | tristanC: software factory CI is cating up again :) | 15:55 |
avass | acting* but maybe it's catting also | 15:55 |
*** jcapitao is now known as jcapitao|off | 16:05 | |
tristanC | avass: oops, we are fixing that shortly | 16:07 |
*** rpittau is now known as rpittau|afk | 16:10 | |
*** hashar has joined #zuul | 16:22 | |
tristanC | avass: thank you for reporting the issue, the job has been removed | 16:23 |
corvus | zuul-maint: i restarted the opendev scheduler yesterday just behind the earliest commit that could be 4.0. should i do a full system restart on opendev today to sanity check current HEAD before tagging 4.0, or are we okay with the slight skew? | 16:57 |
corvus | (i lean slightly toward doing one more full restart... i don't really want to do a paper bag 4.0.1....) | 16:57 |
tobiash | ++ | 16:58 |
tristanC | corvus: wfm | 16:58 |
tobiash | corvus: are you planning with the nodepool optimizations as well or have that later? | 16:58 |
clarkb | ++ to restarting everything on what you want to tag | 16:59 |
fungi | corvus: another restart sounds fine though queues are deep in the openstack tenant at the moment, so if we do it now it will probably take a while to reenqueue everything | 16:59 |
corvus | tobiash: you know, we haven't really talked about a nodepool release... technically we don't need one. but maybe we should go ahead and do a 4.0 just to re-sync their versions? regardless, let's defer the optimizations; i want to give those a good review which would delay 4.0 work | 16:59 |
tobiash | corvus: wfm | 17:00 |
corvus | zuul-maint: while not strictly required, it would make it a little bit easier on me if we didn't approve any zuul or nodepool changes until we release 4.0 :) | 17:01 |
clarkb | corvus: related you email opendev about removing the mysql reporters. Do we need to land those changes before this big restart? | 17:02 |
clarkb | or is that a pre 5.0 thing? | 17:02 |
corvus | clarkb: that's an ASAP thing; absolutely before 5.0, but any time after yesterday. we want it asap so we don't have to wade through the warning messages in the logs though. | 17:03 |
fungi | i think we've landed them, but would be good to double-check | 17:03 |
fungi | i approved a bunch yesterday anyway | 17:03 |
corvus | clarkb: like, as an opendev operator, i'd love it if all our tenants merged those this week. | 17:03 |
clarkb | got it, ya I was planning to take a look after meeting today | 17:03 |
clarkb | or between meetings as it appears that weather has caused things to be rescheduled | 17:04 |
openstackgerrit | Albin Vass proposed zuul/zuul master: Fix build page releasenote typo https://review.opendev.org/c/zuul/zuul/+/775880 | 17:27 |
avass | That could be good to merge before releasing ^ | 17:28 |
tobiash | avass: +2 but the fix will also just work after tagging | 17:32 |
corvus | i think we can go ahead and approve that; i'll +3 | 17:34 |
corvus | we don't have any release notes for nodepool, which violates our "no relesease without a note" rule.... | 17:34 |
corvus | do we want to merge a release note saying "no substantial changes, we're just releasing this for version parity with zuul"? | 17:34 |
*** piotrowskim has quit IRC | 17:35 | |
tristanC | corvus: perhaps add one to 773540? | 17:36 |
avass | tobiash: yeah but the versioned docs would have the tpo wouldn't they? | 17:36 |
avass | in case someone is reading those :) | 17:37 |
corvus | i do link to the versioned docs in announcements... | 17:38 |
fungi | uprevving the nodepool version to match zuul's major version seems worthy of a release note even if for no other reason than to avoid confusion | 17:38 |
clarkb | ++ since our users have come to expect semver there | 17:38 |
corvus | tristanC: oof, that didn't merge? if we landed that now, i'd want to do another nodepool restart | 17:38 |
fungi | i can help with another opendev nodepool restart if it's wanted | 17:38 |
fungi | now that my meetings are on break for a bit | 17:39 |
corvus | fungi: well, if we did a nodepool restart, it's going to be several hours from now | 17:40 |
fungi | i can help with one after 20z as well | 17:41 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Format multi-line log entries https://review.opendev.org/c/zuul/nodepool/+/773540 | 17:46 |
corvus | zuul-maint: ^ that does all the things suggested :) | 17:46 |
tobiash | corvus: what do you think about doing nodepool v4 release a few days after zuul v4 and include aws image handling there? Then we'd have a release note | 17:50 |
tobiash | ah you've put one into the multiline change | 17:51 |
*** jcapitao|off has quit IRC | 17:51 | |
clarkb | corvus: what does ret do in that formatter? it is appended to but not returned (as the name would imply) | 17:51 |
tristanC | i guess we need https://review.opendev.org/c/zuul/zuul-jobs/+/775511 to land change in nodepool | 17:52 |
tobiash | yes | 17:52 |
corvus | tobiash: i just want to release both things today | 17:54 |
corvus | i would rather merge a release note with no change than spend any more time reviewing actual changes | 17:54 |
avass | clarkb: Was wondering about that too | 17:54 |
corvus | if the log formatting change isn't ready, i'll just make a quick change with a release note | 17:55 |
clarkb | corvus: it looks like we are discarding the results of the formatting | 17:56 |
clarkb | but that is based on ret not being returned or otherwise used other than to write to it (not based on testing evidence or similar) | 17:56 |
corvus | fungi: can you look at https://review.opendev.org/775511 | 17:56 |
fungi | yep | 17:57 |
tobiash | corvus: zuul-upload-image is in retry limit in the gate | 17:57 |
corvus | clarkb, tristanC: it sure looks like the multiline change is faulty, i think it's missing the last line of the function: return '\n'.join(ret) | 17:58 |
corvus | i will WIP that and move the reno to a standalone change | 17:58 |
clarkb | corvus: sounds good | 17:59 |
clarkb | tristanC: left a comment on 775511 if you want to check it (I think we can merge it as is and just do a followup cleanup) | 18:00 |
openstackgerrit | James E. Blair proposed zuul/nodepool master: Add 4.0 release note https://review.opendev.org/c/zuul/nodepool/+/775885 | 18:00 |
corvus | zuul-maint: ^ | 18:00 |
tobiash | +2 | 18:01 |
fungi | and approved | 18:01 |
fungi | also approved the openshift fix | 18:01 |
corvus | tobiash: looks like the dockerhub login failed | 18:02 |
corvus | ianw saw this yesterday in opendev | 18:02 |
corvus | i think the assumption was it was transient -- fungi, clarkb -- do you remember anything further? | 18:02 |
fungi | just that we suspected maybe dh rate limits/throttling, but things worked on a recheck | 18:02 |
tobiash | shall we close/reopen 775880 to make it abort and reenter the gate? | 18:02 |
clarkb | yes we assumed it was transient because fungi had some gitea changes up that passed through the same ansible and they were happy | 18:03 |
clarkb | and ya docker hub rate limits/throttling was my suspicion | 18:03 |
corvus | wait i don't know what task failed | 18:03 |
clarkb | "upload-docker-image: Log in to registry" failed on ianw's examples | 18:04 |
tobiash | corvus: something really weird: http://paste.openstack.org/show/802705/ | 18:04 |
corvus | tobiash: yeah i was just homing in on that | 18:04 |
corvus | the stat module failed with 'bad message'? | 18:05 |
tobiash | yeah, no idea what that means, but all other jobs are ok | 18:06 |
corvus | i don't think we're releasing anything tody | 18:06 |
*** hashar is now known as hasharDinner | 18:07 | |
corvus | infra-root, zuul-maint, config-core: we just noticed a very strange failure in a zuul job that occurred during the pre playbook when setting up git repos. please keep an eye out for any other builds that fail in that way. | 18:08 |
tobiash | nodepool-upload-image failed due to the request limit | 18:09 |
clarkb | tobiash: the same task from ianw's example: "upload-docker-image: Log in to registry" ? | 18:10 |
tobiash | corvus: the bad message is nothing new: https://zuul.opendev.org/t/zuul/build/c21cd59ba2cc41d4aa25c134e7f964e5 | 18:10 |
corvus | new to me | 18:12 |
tobiash | new to me as well | 18:12 |
tobiash | found three already in the zuul tenant, all airship-kna1 | 18:14 |
tobiash | maybe a corrupted image | 18:14 |
fungi | message:"bad message" turns up a ton of hits in logstash | 18:15 |
fungi | we've also been seeing random nodes in airship-kna1 boot up with a 15gb rootfs instead of the usual 100gb the flavor typically provides | 18:15 |
fungi | leading me to suspect that sometimes the growroot doesn't happen at boot there | 18:16 |
corvus | fungi: can you share the logstash link? is bad message correlated with kna1? | 18:20 |
*** jpena is now known as jpena|off | 18:24 | |
openstackgerrit | Merged zuul/zuul-jobs master: ensure-openshift: workaround missing ansible26 repository https://review.opendev.org/c/zuul/zuul-jobs/+/775511 | 18:25 |
clarkb | corvus: I think fungi was saying he was searching for message:"bad message" in logstash | 18:28 |
clarkb | and ya if I set the time frame to the last 6 hours or so I see a number of them and they don't appear to be cloud or image specific | 18:28 |
clarkb | it seems to universally affect oepnstack-ansible jobs though | 18:29 |
clarkb | I wonder if this is an ansible regression | 18:29 |
fungi | i'm not clear on how to link to specific queries in logstash unfortunately, so yeah i mentioned the query string here | 18:30 |
clarkb | there is or was a way but I can never remember it either | 18:30 |
avass | clarkb: it seems to come from os.stat | 18:30 |
tobiash | the last 2.9 release is from jan 18 | 18:30 |
avass | or os.lstat | 18:30 |
fungi | right, it sees like "bad message" may be ansible's way of saying it couldn't parse a message from a task? | 18:30 |
avass | I think it's errno 74 EBADMSG | 18:31 |
fungi | oh, neat | 18:31 |
clarkb | ya if you look at the osa failures its trying to collect journal files and getting that | 18:31 |
clarkb | hrm and that happens in a shell script run by ansible | 18:32 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-openshift: remove unused role var https://review.opendev.org/c/zuul/zuul-jobs/+/775888 | 18:32 |
clarkb | so ya unlikely to be ansibel I guess, they somehow are just tripping over it in a way that logstash catches it | 18:32 |
clarkb | File corruption detected at /var/log/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/system.journal:74d4b70 (of 125829120 bytes, 97%). then var/log/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/system.journal (Bad message) | 18:33 |
avass | ansible only checks for ENOENT and otherwise fails with whatever exception it got: https://github.com/ansible/ansible/blob/stable-2.9/lib/ansible/modules/files/stat.py#L466 | 18:33 |
clarkb | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_184/775813/3/check/openstack-ansible-upgrade-aio_metal-centos-8/1842a5b/logs/host/journal/5a8fcb09881b49f9ae1601fbeee5f3fa/index.html the file does seem to be there at least | 18:34 |
*** hamalq has joined #zuul | 18:35 | |
clarkb | https://opendev.org/openstack/openstack-ansible/src/branch/master/scripts/log-collect.sh#L193-L195 | 18:36 |
clarkb | it is possible the reason that osa is catching this is they are explicitly validating consistency | 18:36 |
clarkb | and it appears to happen across ubuntu and centos (different versions of both) and in different clouds | 18:37 |
clarkb | tobiash' suspicion of a problem with images seems more likely | 18:38 |
clarkb | seems to have been happening for at least the last week | 18:42 |
clarkb | looking at data more historically and not from the last 6 hours it does seem to correlate with kna1 more | 18:43 |
corvus | clarkb: is it correlated with a provider? | 18:44 |
clarkb | corvus: in the last 6 hours no, but from the 9th to the 10th of february yes its almost all kna1 | 18:44 |
clarkb | I wonder if OSA does explicit verification and they are also near disk limits on $clouds | 18:44 |
corvus | clarkb: i wish i could see what you see | 18:45 |
corvus | i'm really struggling to get logstash to produce something useful | 18:45 |
clarkb | corvus: in the query field put message:"Bad message" then set the time range in the top right to last 6 hours or last 7 days and clikc the refresh button in the top right next to the time range | 18:45 |
corvus | yeah, i see tons of osa jobs across all providers | 18:46 |
clarkb | yup and different images. If you expand to the last 7 days you'll see a spike around the 10th. If you click in the time graph and hold+drag you can select a dynamic window | 18:47 |
corvus | is there anyway not to see the osa jobs? | 18:47 |
clarkb | doing that is what gave me the kna correlates to the ~9th ~10th range | 18:47 |
corvus | cause if they're failing at that rate and they don't care, maybe that's a red herring | 18:47 |
clarkb | corvus: you can add AND NOT project:openstack/openstack-ansible AND NOT project:openstack/openstack-ansible-os_tempest and so on | 18:48 |
clarkb | to the query string | 18:48 |
corvus | it's unresponsive when i do that :( | 18:51 |
clarkb | hrm may need to quote the project string values | 18:52 |
clarkb | (I always forget the quoting rules but I think not quoting may make it more greedy or something) | 18:52 |
corvus | is there some way to say openstack-ansible* | 18:53 |
corvus | cause i dunno, it looks like we're just awash in false positives here. | 18:54 |
corvus | these jobs are all succeeding | 18:55 |
clarkb | yes there is globbing but it only works on some fields and is incredibly expensive iirc | 18:55 |
clarkb | I've trying to get a globbed version to return over the last 12 hours and not succeeding | 18:55 |
clarkb | corvus: I think they are succeeding beacuse the error is only hit at the end of the job when they validate and verify logs. I suspect that the same issue they are hitting is also hitting your example (which I am beginning to think may be a full root disk) | 18:56 |
corvus | this did hit in kna1 | 18:56 |
corvus | clarkb: https://706d78dc87df4baa7fa3-76ad47885070581f857a540cadaa6a6d.ssl.cf5.rackcdn.com/775880/1/gate/zuul-upload-image/55c548d/zuul-info/zuul-info.ubuntu-bionic.txt says 4.5G available? but that also looks like a 15G drive instead of 100 | 18:58 |
clarkb | we still run a cleanup play on all/most jobs that checks du right? | 18:58 |
clarkb | corvus: agreed that looks like a too small disk | 18:58 |
clarkb | my strong suspicion is osa catches this often because they fill disks regardless of the cloud and they also do journal verification at the end of jobs | 18:59 |
clarkb | other jobs like your example only hit this if the disk is abnormally small | 18:59 |
tobiash | clarkb, corvus: we had a similar issue a year ago when cloud-init was not finished yet sometimes | 18:59 |
corvus | clarkb: http://paste.openstack.org/show/802711/ df from cleanup of that build | 19:00 |
tobiash | clarkb, corvus: we fixed that by doing a cloud-init wait for finish as early as possible in the base job | 19:00 |
corvus | clarkb: /dev/vda1 15455656 9953244 4639704 69% / | 19:00 |
tobiash | we run "if command -v cloud-init > /dev/null; then cloud-init status -wl; fi" in our base job | 19:02 |
clarkb | tobiash: in our case we just use a growrootfs init script I think, but I suppose it is possible that isn't finishing before things start | 19:02 |
tobiash | yeah, so you could do something similar | 19:02 |
clarkb | corvus: that is small but also not completely full | 19:02 |
clarkb | corvus: I wonder if the underlying hypervisor has minimal disk left so it is saying "here use 15GB" and hoping you don't notice | 19:03 |
clarkb | but then when you try writes you're getting an error from all the way down to the host fs | 19:03 |
corvus | wow | 19:03 |
corvus | clarkb, tobiash, fungi: at this point, how confident are you that this is "something is weird with kna1 which has nothing to do with zuul or zuul-jobs"? | 19:04 |
clarkb | corvus: I would say probably 80-90% | 19:04 |
fungi | i've been fairly confident at least the tiny-disk problem is airship-only | 19:05 |
clarkb | I think there is a small chance the osa logging is catchn an issue with our image builds as well | 19:05 |
clarkb | and the problem would be more widespread in that case | 19:05 |
fungi | maybe noonedeadpunk could work his magic to get someone there to look at the hosts if we can do a hostid correlation, unfortunately our only official contact for that account is the person at ericsson who is paying for it | 19:05 |
clarkb | we do record the hypervisor id and could see if it correlates to a small number of those in our kna examples at least | 19:06 |
tobiash | hostids are different on the three in zuul tenant I found | 19:06 |
fungi | yeah, then the host-specific theory might not hold water | 19:07 |
*** nils has quit IRC | 19:10 | |
*** jcapitao|off has joined #zuul | 19:23 | |
*** hasharDinner is now known as hashar | 19:27 | |
tobiash | corvus: to me it also looks likely like some infrastructure or environmental problem rather zuul or zuul-jobs | 19:32 |
clarkb | tobiash:wee all three of those examples kna1 ? | 19:32 |
clarkb | *were | 19:32 |
tobiash | yes: | 19:33 |
tobiash | https://48b327e1e5d3af62395f-b9f7b4910ad16729218670d60db15b93.ssl.cf2.rackcdn.com/774650/10/check/tox-py38/c21cd59/zuul-info/inventory.yaml | 19:33 |
tobiash | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_996/774650/10/check/zuul-tox-docs/996dbcc/zuul-info/inventory.yaml | 19:33 |
tobiash | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_655/774650/10/check/zuul-jobs-tox-linters/655a450/zuul-info/inventory.yaml | 19:33 |
clarkb | in that case my 80-90% feels more like 90% | 19:33 |
openstackgerrit | Merged zuul/nodepool master: Add 4.0 release note https://review.opendev.org/c/zuul/nodepool/+/775885 | 19:34 |
tobiash | clarkb: and the most recent one as well: https://1bdefef51603346d84af-53302f911195502b1bb2d87ad2b01ca2.ssl.cf5.rackcdn.com/775885/1/gate/tox-pep8/3de257e/zuul-info/inventory.yaml | 19:35 |
tobiash | corvus: was there a reason for holding back the release note typo fix (775880)? | 19:37 |
tobiash | corvus: regarding the retry limit of that, attempts #1 and #2 failed due to the docker issue, the last one due to that (likely) kna1 issue | 19:38 |
tristanC | avass: i've added services configuration to zuul-nix, a local tenant is now configured with a periodic pipeline when the vm starts | 19:44 |
*** hashar has quit IRC | 19:45 | |
*** hashar has joined #zuul | 19:45 | |
avass | tristanC: cool, still getting comfortable with the package manager but I'm probably gonna try to set up something simple to try it out at volvo | 19:45 |
tristanC | my next step is to figure out how to get reliable performance measurements | 19:49 |
avass | tristanC: the executor fails to start btw, it complains that virtualenv is missing | 19:58 |
tristanC | avass: indeed, need to shortcircuit the venv installation | 20:00 |
tristanC | glad you gave it a try and that it worked for you too! :-) | 20:00 |
*** smyers has quit IRC | 20:02 | |
*** smyers has joined #zuul | 20:02 | |
corvus | tobiash: no i was just halfway through reapproving it | 20:07 |
tobiash | ah ok | 20:08 |
*** tjgresha has joined #zuul | 20:19 | |
tjgresha | looking for some help with the scheduler | 20:21 |
tjgresha | please | 20:22 |
fungi | what help do you need with it? | 20:24 |
tjgresha | throwing this error ERROR zuul.Scheduler: AttributeError: 'MergeJob' object has no attribute 'updated' | 20:24 |
tjgresha | running this with inside the docker .. | 20:25 |
fungi | what version? | 20:25 |
avass | tjgresha: that usally means something went wrong with a merge operation in the executor/merger | 20:26 |
tjgresha | docker version or container version? | 20:27 |
fungi | i was asking what version (release) of zuul you're seeing this with | 20:28 |
fungi | wow, this should probably be a faq, i just grepped case-insensitive for 'mergejob.*updated' in my channel log and see so many people reporting the same error over the years | 20:30 |
fungi | dating back to 2017-11-23 | 20:30 |
tjgresha | 3.19.2.dev367 | 20:31 |
fungi | no fewer than 10 people mentioned it in that timeframe | 20:31 |
fungi | i'll see what the causes ended up being | 20:32 |
avass | fungi: from what I remember people have suggested better error messages for that a couple of times :) | 20:32 |
fungi | yeah, but then i'd never be able to find it in my channel history! ;) | 20:32 |
avass | tjgresha: Can you check for errors in the merger/executor? my bet is that they fail to clone a repo or merge a change or something like that | 20:33 |
fungi | according to tobiash it "usually it means that the merger or executor couldn't fetch that repo" | 20:33 |
tjgresha | ok | 20:34 |
tjgresha | one sec | 20:34 |
fungi | in one case it was caused by a failure to connect to zk (startup race, bad cert?) | 20:35 |
*** hashar has quit IRC | 20:35 | |
fungi | also one time tobiash said "that can occur if there is no merger in the system or the merger hits a timeout when cloning a large repo" | 20:35 |
avass | oh yeah that too | 20:36 |
tjgresha | restarted -- letting scheduler do its thing for a minute | 20:36 |
fungi | at least in the past it also seemed to happen if you tried to start the scheduler long before starting an executor/merger (if you start them in the other order, or around the same time, then it should be fine) | 20:38 |
tjgresha | yeah it can't pull a repo | 20:42 |
tjgresha | unable to determine the version | 20:42 |
fungi | in this case the zuul version is probably irrelevant, i was just asking in case it wound up being a signature for a bug fixed in a recent release or something | 20:44 |
tjgresha | namely - for some reason now and all of the sudden will not connect to our internal geritt (config and jobs) | 20:46 |
avass | tjgresha: what error do you get? | 20:47 |
tjgresha | ERROR zuul.GerritConnection: Unable to determine remote Gerrit version | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: Traceback (most recent call last): | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 1456, in onLoad | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: self._getRemoteVersion() | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 1426, in _getRemoteVersion | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: version = self.get('config/server/version') | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: File "/usr/local/lib/python3.8/site-packages/zuul/driver/gerrit/gerritconnection.py", line 635, in get | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: raise Exception("Received response %s" % (r.status_code,)) | 20:47 |
tjgresha | scheduler_1 | 2021-02-16 20:41:50,053 ERROR zuul.GerritConnection: Exception: Received response 401 | 20:48 |
avass | please use a paste service in the future, like: http://paste.openstack.org/ :) | 20:48 |
tjgresha | sorry | 20:49 |
tjgresha | don't use IRC too much =) | 20:49 |
fungi | okay, so the scheduler is having trouble connecting to your gerrit | 20:49 |
tjgresha | yes - looks like this is the issue to me | 20:50 |
corvus | tjgresha: make sure https://zuul-ci.org/docs/zuul/reference/drivers/gerrit.html#attr-%3Cgerrit%20connection%3E.auth_type is set correctly | 20:50 |
fungi | 401 unauthorized makes me suspect you're using the wrong credentials or | 20:50 |
fungi | that, what corvus just said | 20:50 |
fungi | newer gerrit versions require basic auth instead of digest auth, due to changes in how they store the password internally | 20:51 |
tjgresha | so do i need to be auth'd to the server at all to pull the version ? | 20:52 |
tjgresha | answerd my q -- yes ya do! | 20:53 |
corvus | that's intentional so problems are discovered early | 20:53 |
tjgresha | ok thanks all - | 20:56 |
*** tjgresha has quit IRC | 21:21 | |
openstackgerrit | Merged zuul/zuul master: Fix build page releasenote typo https://review.opendev.org/c/zuul/zuul/+/775880 | 21:26 |
tobiash | yeah it's still on my todo list to improve that error message ;) | 21:56 |
avass | maybe the errormessage should be "please grep for `object has no attribute 'updated'` in the irc logs" ;) | 22:01 |
avass | bonus points if it does it automatically | 22:01 |
*** jcapitao|off has quit IRC | 22:03 | |
*** saneax has quit IRC | 22:03 | |
*** vishalmanchanda has quit IRC | 22:22 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: ansible: ensure we can delete ansible files https://review.opendev.org/c/zuul/zuul/+/775943 | 22:32 |
*** arxcruz|rover has quit IRC | 22:46 | |
*** arxcruz has joined #zuul | 22:47 | |
tristanC | avass: it took longer than planed, turns out when your system doesn't have a /bin, random things are failing unexpectedly, but with the current zuul-nix:head, jobs result are now green in the database :-) | 23:07 |
*** tosky has quit IRC | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!