Tuesday, 2020-01-07

*** jamesmcarthur has joined #zuul		00:02
openstackgerrit	Merged zuul/zuul-jobs master: Test swift upload logs without encoding type for gzip files https://review.opendev.org/701283	00:04
mnaser	corvus: neat, i just did a little bit of changes to make those fixes, i wrote those jobs without sudo in mind	00:17
*** jamesmcarthur has quit IRC		00:58
*** jamesmcarthur has joined #zuul		01:15
*** igordc has joined #zuul		01:49
*** jamesmcarthur has quit IRC		02:02
*** zbr_ has joined #zuul		02:04
*** bhavikdbavishi has quit IRC		02:04
*** zbr has quit IRC		02:07
*** zbr has joined #zuul		02:09
*** zbr_ has quit IRC		02:11
*** Goneri has quit IRC		02:19
*** igordc has quit IRC		02:43
*** bhavikdbavishi has joined #zuul		02:52
*** bhavikdbavishi1 has joined #zuul		02:55
*** bhavikdbavishi has quit IRC		02:56
*** bhavikdbavishi1 is now known as bhavikdbavishi		02:56
*** michael-beaver has quit IRC		03:05
*** chandankumar has joined #zuul		03:55
*** spsurya has joined #zuul		04:32
*** chandankumar has quit IRC		04:34
*** sshnaidm\|afk has quit IRC		04:38
*** bolg has joined #zuul		04:52
*** chandankumar has joined #zuul		05:13
*** sshnaidm has joined #zuul		05:30
*** evrardjp has quit IRC		05:33
*** evrardjp has joined #zuul		05:33
*** chandankumar has quit IRC		05:52
*** sshnaidm has quit IRC		05:54
*** pcaruana has joined #zuul		06:00
*** chandankumar has joined #zuul		06:02
*** bolg has quit IRC		06:02
*** pcaruana has quit IRC		06:14
*** AJaeger has quit IRC		06:55
*** AJaeger has joined #zuul		07:02
openstackgerrit	Felix Schmidt proposed zuul/zuul master: Report retried builds via sql reporter. https://review.opendev.org/633501	07:18
*** avass has joined #zuul		07:22
*** themroc has joined #zuul		07:26
*** bolg has joined #zuul		08:18
*** saneax has joined #zuul		08:21
*** tosky has joined #zuul		08:23
*** arxcruz\|off is now known as arxcruz		08:24
*** pcaruana has joined #zuul		08:29
*** sshnaidm has joined #zuul		08:32
*** chandankumar has quit IRC		08:51
*** chandankumar has joined #zuul		08:53
*** sshnaidm has quit IRC		09:02
*** dmellado has quit IRC		09:07
*** dmellado has joined #zuul		09:08
*** bhavikdbavishi has quit IRC		09:12
*** sshnaidm has joined #zuul		09:42
avass	Hmm, the fetch-output role fails with a 'chown invalid argument' message if we try to preserve uid and guid for some reason.	09:54
*** rlandy has quit IRC		10:10
*** chandankumar has quit IRC		10:39
*** sshnaidm has quit IRC		11:32
*** sshnaidm has joined #zuul		11:33
*** bolg has quit IRC		11:53
*** bolg has joined #zuul		11:55
*** themroc has quit IRC		12:03
*** themroc has joined #zuul		12:03
*** bolg has quit IRC		12:05
*** themr0c has joined #zuul		12:09
*** themroc has quit IRC		12:09
*** zbr has quit IRC		12:11
*** zbr_ has joined #zuul		12:11
*** bolg has joined #zuul		12:29
*** bhavikdbavishi has joined #zuul		12:43
*** rfolco has joined #zuul		12:45
*** rlandy has joined #zuul		12:57
*** bolg has quit IRC		13:06
*** bolg has joined #zuul		13:08
*** bolg has quit IRC		13:27
*** sshnaidm is now known as sshnaidm\|mtg		13:33
*** bhavikdbavishi has quit IRC		13:34
*** Goneri has joined #zuul		13:42
*** bolg has joined #zuul		13:47
*** bolg has quit IRC		13:56
*** bolg has joined #zuul		13:59
*** bolg has quit IRC		14:16
*** pcaruana has quit IRC		14:20
*** themroc has joined #zuul		14:29
*** bolg has joined #zuul		14:29
*** themr0c has quit IRC		14:31
*** pcaruana has joined #zuul		14:32
*** bolg has quit IRC		14:36
fungi	i don't think rsync can preserve uid/gid unless it's run as root	14:43
fungi	maybe we need a note about that	14:44
mordred	avass: also - I don't see anyting in fetch-output about preserving uid/gid - where is that issue happening?	14:49
*** zbr_ is now known as zbr\|rover		14:50
Shrews	mordred: i think the synchronize module automatically does that unless you give the option not too (was looking at it earlier)	14:52
Shrews	archive=yes default	14:53
mordred	ah - nod	14:54
*** sshnaidm\|mtg is now known as sshnaidm\|afk		14:55
*** avass has quit IRC		14:55
*** themroc has quit IRC		14:55
mordred	so if content is put into the fetch-output directory with preserved uid/gid from its original source, fetch-output is going to have a sad	14:55
pabelanger	is anybody working on a patch to add volume quota support to nodepool? We often see NODE_FAILURE on zuul.a.c, if we've leaked volumes in our providers. Last I checked, we don't account for volumes in our quota calculations	14:55
mordred	Shrews: seems like we should maybe either set archive=no in fetch-output or add a flag somewhere or something	14:56
*** themroc has joined #zuul		14:56
Shrews	mordred: yah	14:56
mordred	pabelanger: not to my knowledge, but my knowledge shoudl be considered suspect	14:56
pabelanger	ack	14:57
Shrews	pabelanger: leaking volumes is an openstack cinder issue that i don't know how to safely solve in nodepool (for automatic cleanup)	14:57
Shrews	but adding general quota support should be doable, i would think	14:58
Shrews	mordred: i can throw something up for fetch-output unless you're already on it	14:58
pabelanger	Shrews: it would be nice if we add clean-volumes flag, like we do for clean-floating-ips, but not sure how we'd track them. Think that logic is in openstacksdk?	14:58
pabelanger	mordred: is that right?	14:59
Shrews	pabelanger: that's what i meant. tracking those is the key	14:59
*** saneax has quit IRC		15:00
pabelanger	basically, there are 2 issues. a) provider has leaked volumes, and not able to delete. For that you need admin access. b) there are leaked 'available' volumes, which are value, but just taking up quota space. In that case, nodepool won't delete or use	15:00
Shrews	i haven't seen (b) i don't think?	15:00
smcginnis	It's been years, but still waiting for someone to show some data on "cinder leaking volumes" to show it actually is cinder and give a clue how to reproduce and fix. If anyone has logs or anything, it would be useful to get an issue filed to capture that.	15:01
pabelanger	mnaser: not to call you out, but is that something you may be able to help with? logs for smcginnis	15:01
Shrews	smcginnis: i think mnaser and fungi had a pretty good handle on the cause. the difficulty (i think) was being able to communicate with nova from cinder	15:02
smcginnis	Deja vu and motion sickness from going in circles. :)	15:02
smcginnis	Right, the closest I've seen in the past was that nova was not telling cinder to delete a volume. Yet it keeps getting put out there that "cinder leaks volumes".	15:03
Shrews	smcginnis: sorry, not trying to lay blame on the wrong thing. that's just how i refer to the issue in general :)	15:04
smcginnis	Cinder will happily leak as many volumes as it's not told to delete. ;)	15:04
pabelanger	as for tracking volumes, can a volume hold metadata?	15:04
pabelanger	I am guessing, no	15:04
smcginnis	Yeah, I get it. Just a little frustrating to keep seeing that repeated when we've at least narrowed down some of the cases to another service's issue, yet still getting framed as a cinder problem.	15:04
Shrews	perhaps "openstack volume leak" is a better term	15:05
*** themroc has quit IRC		15:05
smcginnis	Nova volume leak, from the cases where we've actually gotten data, but sure. :P	15:05
Shrews	or "that dang volume thing"	15:05
pabelanger	volume_image_metadata looks promising	15:05
mordred	Shrews: I'm not already on fetch-output	15:05
Shrews	mordred: k	15:06
pabelanger	Shrews: couldn't we do the same thing for servers and metadata, but with volumes? eg: inject zk info then periodic loop and delete if doesn't exist?	15:06
mordred	yeah - nova volume leak or openstack volume leak - becuase the issue at hand is with volumes created by nova as part of a boot-from-volume	15:07
Shrews	pabelanger: iirc, the issue was from automatic volume creation when booting a server?	15:07
pabelanger	yah, this is BFV	15:08
mordred	if the user (nodepool) had initial control over the volume rather than delegated control, it's not a problem - but that's a bit harder coding-wise	15:08
Shrews	pabelanger: yeah, see boot-from-volume per above	15:08
mordred	I don't remember - can we do BFV with a pre-existing volume?	15:08
mordred	smcginnis: ^^ ??	15:08
smcginnis	Yes, and that's usually the safest way.	15:09
smcginnis	Nova has pushed back in the past on doing anything more with orchestrating, so really the best practice is to create a volume first, then tell nova to use it to boot from.	15:09
mordred	ah. ok. so - we maybe need to recode nodepool's bfv support to do that	15:09
smcginnis	That would probably be safer and help avoid this.	15:10
mordred	smcginnis: that's making me think I should recode openstacksdk to create the volume itself rather than using nova-create	15:10
mordred	smcginnis: we can attach metadata to a volume, right?	15:10
pabelanger	Shrews: sorry, not seeing it.	15:11
mordred	the problem with haivng sdk do this for people automatically would be deleting the server and not knowing which volume to delete - but it we can put metadata on a volume that's solvable	15:11
mordred	Shrews: do we want to try to solve this in the sdk layer? or add logic to nodepool?	15:12
Shrews	pabelanger: sorry, just meant we use the bfv option, which you already know	15:12
pabelanger	mordred: yah, openstack volume set --property foo=bar e86ab129-1915-4574-9e67-09b1d9823f48 works	15:12
mordred	pabelanger: awesome	15:12
Shrews	mordred: sdk seems right? but not sure what implications that might have for non-nodepool users	15:13
pabelanger	so, if we tagged all volumes with zk id, that is step in right direction	15:13
smcginnis	mordred: Yep, that sounds like a good plan too. It would be great if the SDK just took care of that and didn't make it so the end user had to take extra steps.	15:13
mordred	pabelanger: so with that - we could store something like "openstacksdk-delete-with-server: true"	15:13
smcginnis	So someone used to the "old" workflow just starts noticing better results.	15:13
mordred	for volumes sdk auto-created	15:13
mordred	smcginnis: ++	15:13
pabelanger	mordred: ++	15:13
smcginnis	And Cinder finally fixed it's leak problem. :D	15:13
mordred	smcginnis: as long as I get cinder points :)	15:13
*** bhavikdbavishi has joined #zuul		15:14
Shrews	smcginnis: aha! you admit it! it's now documented ;)	15:14
smcginnis	Extra bonus points even.	15:14
smcginnis	Shrews: Haha! ;)	15:14
mordred	\o/	15:14
mordred	ok. I can work on a patch for that - the concepts at least make sense	15:14
smcginnis	mordred: If you happen to remember, please add me as a reviewer or ping me with it. I'd love to see that go through.	15:15
mordred	also - thanks smcginnis! it's super helpful when we get good info from services like that	15:15
pabelanger	mordred: awesome, tyty	15:15
mordred	smcginnis: I will definitely ping you with it	15:15
mordred	smcginnis: if you don't watch out I'll make you sdk core :)	15:15
Shrews	mordred: i think i just heard smcginnis request core status	15:15
*** avass has joined #zuul		15:16
smcginnis	I knew I should have just gone back to my coffee!	15:16
avass	mordred: yeah I fixed it by setting owner=no group=no	15:18
*** bolg has joined #zuul		15:18
Shrews	openstack: punishing developers with core status since 2010	15:18
*** bhavikdbavishi1 has joined #zuul		15:19
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Don't preserve logs uid and guid https://review.opendev.org/701381	15:20
*** bhavikdbavishi has quit IRC		15:20
*** bhavikdbavishi1 is now known as bhavikdbavishi		15:20
avass	Gotta go but I'll check in later	15:21
Shrews	avass: oh, mordred and i were just discussing a similar change. perhaps you could change 701381 to use a variable to control permissions via the 'archive' option?	15:21
avass	Shrews: Sure I'll take a look at it when I get home :)	15:22
Shrews	avass: ++	15:22
Shrews	thanks!	15:22
*** avass has quit IRC		15:22
fungi	smcginnis: Shrews: yes, the issue i'm familiar with seemed to be nova ignoring deletion errors returned from cinder (or possibly never managing to request deletion due to connectivity errors?) and then going ahead and deleting the instance while cinder still thinks the volume is attached (to a now nonexistent instance)	15:24
fungi	which leaves nobody but the cloud operator able to clean up the mess, since there's no way to get nova to tell cinder about a detach at that point, and cinder (sanely) won't let normal users delete attached volumes	15:27
mordred	fungi, smcginnis: sdk still might have problems working around that	15:28
*** michael-beaver has joined #zuul		15:28
fungi	i can't remember why telling nova not to proceed with the instance deletion if it can't confirm from cinder that the volume got detached was a problem, but maybe that's what smcginnis means by "more orchestration in nova"	15:29
mordred	because if the issue is that a server delete isn't communicating the detach fully - then the situation fungi describes will still occur	15:29
pabelanger	mordred: other question I had, do you think we could bubble up warnings / deprecation notices to console logs for zuul users to see? When moving jobs between ansible versions, it is sometime hard to see what is deprecated. Today, only zuul operator has access to that info in executor logs.	15:30
mordred	because I don't think as an API consumer sdk can say "please detach this volume" to a volume that is the root volume for a running server	15:30
fungi	could it if the server were stopped first?	15:30
corvus	iiuc, pabelanger described both situations -- user-level leakage, and admin-level leakage. could be 2 problems?	15:30
fungi	yeah, we have seen both. some volumes are successfully detached but never deleted	15:30
mordred	fungi: I don't know - we could try that	15:31
corvus	oh, so maybe it's more or less the same problem, but the interruption in communication happens at different phases?	15:31
fungi	those we could certainly clean up with normal credentials	15:31
fungi	corvus: it wouldn't surprise me	15:31
pabelanger	corvus: yup, I'm able to manually clean up 'available' volumes which block quota. For leaked (still attached) I need to ask the admin	15:31
mordred	smcginnis: do you happen to know if a normal user can detach a volume from a stopped server if the volume in question is the root volume of the server?	15:32
mordred	fungi, corvus, Shrews: also - upon further reflaction, I do not believe we can fix this in sdk - or, at least, not in a way that nodepool can use	15:32
Shrews	yah	15:32
mordred	and that's because fixing it sdk-side would turn create_server into a synchronous call (we'd have ot wait for the volume create to pass the volume to the server create)	15:33
Shrews	also yah	15:34
corvus	or else some kind of chain of closures or something	15:34
mordred	so I think there's two things here - is there a user-level sequence of API operations we can do instead of BFV via nova that will avoid the attachment issue - and then how best to implement it - which might be a feature in sdk that nodepool doesn't use, as well as a feature in nodepool	15:34
mordred	yeah	15:34
mordred	maybe we do the sdk support in such a way that nodepool can consume it piecemeal	15:34
mordred	but first we need to find the appropriate api sequence	15:35
Shrews	or we could just add a new sdk API method specifically for this?	15:35
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Add cri-o support to use-buildset registry https://review.opendev.org/701237	15:36
Shrews	create_server_with_volume() which automatically assumes wait_for_volume() / wait_for_server()	15:36
Shrews	or similar nonsense words	15:37
mordred	Shrews: yeah - but I think nodepool couldn't use that - so we'd still need to figure out how best to expose the capability in a way nodepool could use ... I'm liking the corvus chain-of-closures idea as I think about it - but it might go to the dark place	15:38
corvus	yeah, pieces that nodepool can use sounds like a great first step without turning the taskmanager into a turing machine :)	15:39
*** bolg has quit IRC		15:39
smcginnis	mordred: No, you are not able to detach a boot volume from an instance, even when it's been stopped.	15:40
smcginnis	I think Matt had started looking at allowing that to be able to do some other shelving or something, but it sounded like a not insignificant effort and I don't think it's gotten anywhere yet.	15:40
mordred	smcginnis: gotcha. so without that we'd be unable to avoid the proposed issue of "nova sent a detach which didn't work but then deleted the instance anyway"	15:42
mordred	since we'd be reliant on nova doing the detach	15:42
smcginnis	Hmm, yeah.	15:43
mordred	that said- maybe that's a more tractable bug? when instances are deleted, sometimes the cinder metadata about volume attachment state is stale?	15:44
mordred	it's not really more orchestration as much as simply completing an action that nova is the only actor that can perform it	15:45
smcginnis	Yeah, I think in that case nova would blow away it's information about it, but cinder would still have an entry in the attachments table for the volume and instance.	15:45
smcginnis	So in theory something external should be able to reconcile it.	15:45
mordred	yeah	15:45
mordred	smcginnis: if you don't mind ... how does this work today? is there an internal api or something that nova is calling as part of its delete process to tell cinder it's no longer attached?	15:46
smcginnis	Yep. Nova calls a detach API to tell cinder to remove any mappings or whatever backend specific thing that exposes the volume to the host and remove the entry from the database.	15:48
smcginnis	Some backends need to explicitly say "let initiator XYZ access volume 123".	15:49
smcginnis	So detach removes that.	15:49
mordred	smcginnis: nod. and that API is one that is not available to normal users, because if it's something that's called on a volume that is actually still in use, it would potentially create havoc	15:51
smcginnis	mordred: Well, it is exposed as a regular API, no special secret handshake needed.	15:51
smcginnis	It's just not publicized as something an end user should use normally.	15:52
smcginnis	Most should be using this newer microversion - https://docs.openstack.org/api-ref/block-storage/v3/index.html?expanded=delete-attachment-detail#delete-attachment	15:52
mordred	smcginnis: oh - wait - a non-privileged end-user could use it?	15:52
mordred	aha!	15:52
smcginnis	I believe if they have access to the volume, they have rights to make that call with the default policies.	15:52
mordred	pabelanger: do you have any volumes in the leaked state needing admin deletion right now?	15:53
smcginnis	John Griffith and Matt R did a lot of the work to implement this API and get a better handle on the whole workflow.	15:53
smcginnis	I would need to find the specs from that work to get more details. It's not always clear which calls come in which order without swapping some of that back in.	15:53
pabelanger	mordred: yup	15:54
mordred	pabelanger: cool. I think we should try that api call above to see if it removes the attachment from the volume and then lets you delete it directly	15:54
mordred	pabelanger: if you like, i can try to cook up a script that would do it - but I can't test it myself	15:54
smcginnis	Some docs in case anyone is interested - http://specs.openstack.org/openstack/cinder-specs/specs/ocata/add-new-attach-apis.html	15:54
mordred	if the api call works, then I think we've got a path forward	15:55
pabelanger	mordred: sure, happy to test	15:55
mordred	pabelanger: k. I'll get you a thing in just a bit	15:55
pabelanger	http://paste.openstack.org/show/788131/	15:55
pabelanger	is what I see if I try to directly delete	15:55
pabelanger	http://paste.openstack.org/show/788132/	15:56
pabelanger	is server info	15:56
smcginnis	You can also force reset a volume status, but I think that would leave some crud in some of the tables.	15:56
smcginnis	https://ask.openstack.org/en/question/66918/how-to-delete-volume-with-available-status-and-attached-to/	15:57
Shrews	gotta step out for an appointment, but will catch up on this convo when i return.	15:57
mordred	pabelanger: http://paste.openstack.org/show/788133/	16:01
mordred	pabelanger: obvs change the cloud name	16:01
mordred	pabelanger: but that's your attachment id	16:01
mordred	pabelanger: if that works, then you should be able to try deleting the volume again	16:02
pabelanger	mordred: yup, that worked! I tested with 2 leaked volumes	16:05
mordred	pabelanger: WOOT!	16:07
mordred	I have a path forward	16:07
pabelanger	Yay	16:07
mordred	also - Shrews re: the above - it should be possible to also code a safe leak checker if we have server ids - logic being "do get /attachments ; look to see if it's for a server we know about but which we thnk is deleted, if so, force a detach"	16:08
mordred	but I think we can make a system that doesn't leak	16:08
mordred	fungi, clarkb, corvus: ^^ see scrollback and paste for information on how to resolve leaked volumes without need of an admin (the hacky/script version)	16:08
clarkb	sometimes they fail to delete and have no attachment	16:09
clarkb	we may even have a few in that state in vexxhost right now	16:09
clarkb	iirc we did pre holidays	16:09
mordred	I'll go look	16:10
mordred	or - rather	16:10
mordred	clarkb: do you know how to identify those?	16:10
clarkb	mordred: volume list \| grep available then volume show on those and look for those that are older than a few days	16:10
clarkb	then try and delete them and see if they go away	16:11
pabelanger	clarkb: yah, that is b) from above	16:11
pabelanger	we also see that	16:11
pabelanger	for that, I think we could add zk id into meta data, then compare, if missing zk id, delete?	16:12
clarkb	the problem is they don't delete	16:12
pabelanger	well, not missing zk id, doesn't exists	16:12
clarkb	some do some dont	16:12
mordred	yah - clark is suggesting there is a C state	16:12
pabelanger	ack, so far I haven't see that issue	16:12
pabelanger	mind you, I often delete 2 weeks later	16:12
mordred	I think I'm going to focus on seeing if we can eliminate a and b	16:12
mordred	since we've got new info on how to handle those	16:13
clarkb	there is a fourth state too where your quota usage recorded by cinder is completely out of sync and you can't bfv because you have no more volume quota	16:13
Shrews	mordred: problem with us “having server ids” is we don’t keep that info after the server is deleted. Can discuss more later	16:15
*** rfolco is now known as rfolcOUT		16:16
mordred	Shrews: good point.	16:18
mordred	clarkb: is it common for these volumes to take my entire life to delete?	16:19
clarkb	mordred: ya its not fast, but that may mean they won't delete? which is the state I was pointing out	16:20
*** electrofelix has joined #zuul		16:23
*** rishabhhpe has joined #zuul		16:23
mordred	clarkb: it was outbound http hanging because fo stupid captive portal :)	16:24
mordred	clarkb: all leaked volumes in vexxhost deleted with no issue. I'll keep my eye out for that so we can try to trick smcginnis in to helping us with them	16:29
fungi	he's surprisingly easy to bait into being helpful	16:35
mordred	fungi: it's almost like he's a helpful person	16:36
fungi	oh, maybe that's it ;)	16:40
mordred	smcginnis, Shrews: ok - so I can confirm that I can do the delete attachments call on a volume that has been created by nova as part of a BFV and is a boot volume for a server (I stopped the server first)	16:46
mnaser	pabelanger, smcginnis: indeed i actually don't agree its cinder leaking volumes, its more like nova noop-ing volume deletes however i have seen cases where quota does get out of sync	16:46
mordred	mnaser: yah - I think we've got a fix in works	16:46
mnaser	reproducers are hard obviously for the quota out of sync	16:46
mnaser	i will say: 99% of the time everything works	16:47
mordred	so I think the safe flow would be "create volume ; boot server on volume.id" for create - then for delete: "stop server ; delete attachment ; delete server ; delete volume"	16:48
pabelanger	mnaser: yah, it only happens every few months	16:48
pabelanger	all other times, works every time :)	16:48
mordred	we could also leave things as they are, since we can clean up the situation with the attachment delete - and simply add a flag like "nodepol.autocreated" to the volume metadata after teh server create if finished	16:49
mordred	then add a cleanup thread that looks for volumes in an attached state with nodepool.autocreated set to True where the server id listed in teh volume's attachments does not exist	16:50
mordred	and clean those up by deleting the attachment and then deleting the volume	16:50
mordred	might not need nodepool.autocreated flag - any volume in an attached state with a non-existent server is a leak and should be safe to clean	16:50
mordred	yeah?	16:50
mnaser	i would scan for attachments for nodepool created volumes instead just in case there is some other thing going on	16:51
mnaser	i dont know if we enforce "nodepool owns an entire tenant" rule	16:51
mordred	we don't	16:51
mordred	so a nodepool flag is safer	16:52
pabelanger	+1	16:52
mordred	although it's pretty much always a volume leak if a volume has a bogus attach - even if it's not a nodepool leak	16:52
mordred	but let's start with the nodepool filter	16:52
mnaser	i agree with that too	16:53
*** mattw4 has joined #zuul		16:53
mnaser	but when deleting things it's always scary with automation :p	16:53
mnaser	so the more barriers...	16:53
mordred	yup	16:53
pabelanger	nodepool, deleter of all things	16:53
mordred	I'll make an SDK "cleanup_leaked_volumes" method that takes an optional thing to look for in metadata	16:54
clarkb	any busy nodepool effectively owns a tenant due to quota utilization fwiw	16:54
clarkb	(also isn't this what tenants are for?)	16:55
mordred	s/tenants/projects/ yes	16:55
rishabhhpe	hello all, in community page and also in ciwatch i can see status N which means not registered .. what does it signify	16:55
rishabhhpe	and what can be done to overcome that	16:55
fungi	rishabhhpe: i don't know what you're talking about, though also we have nothing to do with ciwatch	16:56
fungi	rishabhhpe: it sounds like maybe you're running a third-party ci system for testing an openstack driver? i would start by asking people from the project that driver integrates with	16:57
fungi	they're the ones who will be able to tell you what they expect of your ci system	16:57
rishabhhpe	OK i will ask in third party ci channel then	16:58
*** pcaruana has quit IRC		17:05
openstackgerrit	Mohammed Naser proposed zuul/zuul-jobs master: Add basic Helm jobs https://review.opendev.org/700222	17:12
*** tosky has quit IRC		17:23
*** bhavikdbavishi has quit IRC		17:23
rishabhhpe	Hi .. can any one check the error pasted in mentioned link "http://paste.openstack.org/show/788139/" and let me know if i am missing anything while configuring zuul v3 using this link https://zuul-ci.org/docs/zuul/admin/quick-start.html	17:24
*** bhavikdbavishi has joined #zuul		17:24
clarkb	fungi: I don't think we need to update the docs for the upload roles	17:27
clarkb	fungi: for the non swift upload role you'll only want to compress if the fileserver can handle it (and if not you'd disable stage_compress_logs too)	17:27
clarkb	and for swift the idea is to make it transparent to the user	17:27
*** pcaruana has joined #zuul		17:28
fungi	yeah, just wondering if the behavior change provided by the stage_compress_logs variable is documented (i confused it with zuul_log_compress)	17:29
fungi	https://zuul-ci.org/docs/zuul-jobs/general-roles.html#rolevar-stage-output.stage_compress_logs	17:30
fungi	not sure if there's benefit to explaining in the rolevar description what the implications of that option are in combination with swift storage	17:31
clarkb	fungi: https://review.opendev.org/#/c/701282/2/roles/stage-output/README.rst updates that	17:31
clarkb	maybe that update is too brief	17:32
fungi	aha, nope, that's plenty	17:32
*** evrardjp has quit IRC		17:33
fungi	i was only looking at 701284	17:33
*** evrardjp has joined #zuul		17:33
fungi	and i guess https://etherpad.openstack.org/p/wwgBLSq1Rq is announcing 701282	17:34
clarkb	yes	17:34
fungi	makes much more sense now, thanks	17:34
clarkb	I don't think 84 needs to be announced as we've always hidden the behavior from the user (except for when it leaks through as in this bug and does the wrong thing)	17:34
fungi	yeah	17:35
pabelanger	okay, using mordred script, I've been able to clean up all leaked volumes (without admin powers)	17:35
fungi	clarkb: do those need to merge in any particular order?	17:35
fungi	i guess they're independent-but-related changes	17:35
mordred	pabelanger: woot!	17:36
* mordred has helped something today		17:36
clarkb	fungi: I dont think order matters too much as long as they are adjacent	17:39
*** sgw has quit IRC		17:39
fungi	yeah, i'm +2 on both. thanks for working through those	17:39
openstackgerrit	Mohammed Naser proposed zuul/zuul-helm master: Added nodepool to charts https://review.opendev.org/700462	17:41
corvus	clarkb: 284 lgtm, +3; etherpad lgtm; do you want to wip 282 until next week?	17:42
clarkb	corvus: that works, wasn't sure how long of a warning you wanted to give	17:44
*** saneax has joined #zuul		17:44
clarkb	I'll add a link to 282 to the email draft too	17:44
clarkb	WIP'd and sending email now	17:46
fungi	rishabhhpe: which error are you referring to? if it's "Cannot send SSH key added message to admin@example.com" that's probably benign since it's not a reachable e-mail address	17:46
fungi	rishabhhpe: yep, it shows up in our quickstart job log too and that doesn't result in any failure: http://zuul.opendev.org/t/zuul/build/c422829e153640b68bd983f1bd2e120f/log/container_logs/gerrit.log#135	17:48
clarkb	corvus: once 284 lands we should merge something to zuul to get a new js tarball	17:49
clarkb	https://review.opendev.org/#/c/697055/ maybe? /me reviews this change	17:49
clarkb	oh you've got a question on that one	17:50
corvus	clarkb: https://review.opendev.org/698774 looks ready	17:50
corvus	or we could ask someone else to review https://review.opendev.org/691715	17:51
rishabhhpe	fungi: so this docker-compose up command is never ending shell ? and i can run it in background if require to get my services up for all time ?	17:52
clarkb	corvus: that first one looks like a good one to get in. I'm reviewing it	17:54
openstackgerrit	Merged zuul/zuul-jobs master: Swift upload logs without encoding type for gzip files https://review.opendev.org/701284	17:56
clarkb	https://review.opendev.org/#/c/698774/ is headed in now	18:00
clarkb	that should update the js tarball which should be retrieved with no decompression now	18:00
*** saneax has quit IRC		18:01
fungi	rishabhhpe: you mean the command at https://zuul-ci.org/docs/zuul/admin/quick-start.html#start-zuul-containers i guess?	18:03
fungi	i thought `docker-compose up` returned control to the calling shell	18:03
clarkb	it does not. It follows the containers	18:05
clarkb	and when you kill the up command it kills the conatiners	18:05
clarkb	docker compose start does the background	18:05
clarkb	or up -d	18:05
rishabhhpe	clarkb: it worked for me but i can see after this some of my containers are exiting	18:08
clarkb	rishabhhpe: some of them are one shot containers to do setup	18:10
fungi	aha, my grasp of containery stuff is still a bit tenuous	18:10
fungi	thanks	18:10
fungi	i suppose where the quickstart says "All of the services will be started with debug-level logging sent to the standard output of the terminal where docker-compose is running." should have been a tip-off	18:11
fungi	obviously leaking output from a background process to an interactive terminal would be a poor choice, so that does imply the shell is blocked on that running process	18:12
openstackgerrit	Mohammed Naser proposed zuul/zuul-helm master: Added nodepool to charts https://review.opendev.org/700462	18:13
rishabhhpe	fungi: i did not get what you are trying to say	18:13
*** michael-beaver has quit IRC		18:18
fungi	rishabhhpe: i was conversing with clarkb about how in retrospect it should have been obvious to me from the quickstart document that the `docker-compose up` command will block the calling shell and stay in the foreground indefinitely	18:21
rishabhhpe	fungi: clarkb: is there any documentation available for setting third party ci using zuul v3	18:25
fungi	rishabhhpe: i don't know of any. like i said earlier, you should talk to the projects who are asking you to do this testing and find out what they expect from you	18:26
fungi	rishabhhpe: the opendev infra team has a help-wanted spec up about what that documentation might cover: https://docs.opendev.org/opendev/infra-specs/latest/specs/zuulv3-3rd-party-ci.html	18:27
Shrews	tristanC: a few weeks back (which seems like years now), you suggested a web site that explained how technical documentation should be organized. do you recall what that was?	18:29
tristanC	Shrews: it was: https://www.divio.com/en/blog/documentation/	18:30
Shrews	yes! thank you	18:31
tristanC	Shrews: you're welcome, i find this structure quite transformative :)	18:33
Shrews	tristanC: do you have any documentation using this structure?	18:34
Shrews	that's public	18:34
tristanC	Shrews: https://docs.djangoproject.com/en/3.0/#how-the-documentation-is-organized	18:35
Shrews	oh, yeah. that works. thx again	18:35
tristanC	Shrews: https://docs.dhall-lang.org/ and my own attempt at it is: https://github.com/podenv/podenv	18:36
*** sgw has joined #zuul		18:37
Shrews	++	18:38
fungi	https://zuul-ci.org/docs/zuul/developer/specs/multiple-ansible-versions.html is fully implemented now, right? is there any process we're using to mark completion of a specification in zuul?	18:38
fungi	do we delete it? or remove the admonition from the top of the page, or...?	18:38
*** rishabhhpe has quit IRC		18:40
tobiash	fungi: yes, that's fully implemented	18:49
tobiash	I guess we don't have a process for marking completed specs yet?	18:50
Shrews	i think also https://zuul-ci.org/docs/zuul/developer/specs/container-build-resources.html	18:50
fungi	right, i was using multi-ansible as an example. that was really my question... i guess we need to decide how to signal when a spec is implemented	18:51
Shrews	maybe just sections: Proposed and Completed	18:51
fungi	i'll push up a possible solution	18:51
fungi	well, and also removing or changing the admonition	18:51
fungi	for the completed specs	18:51
Shrews	yeah	18:52
openstackgerrit	Mohammed Naser proposed zuul/zuul-helm master: Added nodepool to charts https://review.opendev.org/700462	18:52
fungi	also, do we expect specs to be updated to reflect the implementation, if the proposed plan differed somewhat from what we wound up with? that will impact how we present a completed spec to the reader, i guess	18:52
openstackgerrit	Merged zuul/zuul master: Don't set ansible_python_interpreter if in vars https://review.opendev.org/698774	18:52
AJaeger	REgarding specs: I think with Zuul the team does a great job to update docs, so do we need completed specs at all? Or isn't there good enough documentation so that the spec can be removed from our published site?	18:54
fungi	that seems like the simplest option to me. delete specs once implemented	18:54
fungi	which is what i had assumed was going on before i started looking closer at what was in the specs directory	18:55
fungi	the text isn't lost, since we'll have it in the git history forever	18:55
AJaeger	exactly	18:55
AJaeger	and a completed spec should be of interest only to developers implementing Zuul, shouldn't it? So, they have access to git history...	18:56
fungi	agreed	18:56
fungi	i'll start with that and we can use it at least as a springboard for talking about more complicated options	18:56
AJaeger	If we delete specs, we might add a note to the specs site: Implemented specs get deleted, use git if you need it - and read our docs for full treatment	18:56
fungi	good idea, i'll include that	18:57
fungi	https://zuul-ci.org/docs/zuul/developer/specs/logs.html is done too, right?	18:59
clarkb	the zuul js tarball is actually compressed now \o/	18:59
* fungi cheers		19:00
*** tjgresha has joined #zuul		19:02
AJaeger	\o/	19:02
*** bhavikdbavishi has quit IRC		19:04
*** tjgresha has quit IRC		19:07
*** tjgresha has joined #zuul		19:07
openstackgerrit	Jeremy Stanley proposed zuul/zuul master: Remove implemented specs https://review.opendev.org/701435	19:19
*** spsurya has quit IRC		19:25
openstackgerrit	Mohammed Naser proposed zuul/zuul-helm master: Added nodepool to charts https://review.opendev.org/700462	19:28
*** electrofelix has quit IRC		19:33
*** armstrongs has joined #zuul		20:08
*** armstrongs has quit IRC		20:14
*** tjgresha has quit IRC		20:17
*** tjgresha_ has joined #zuul		20:17
openstackgerrit	Mohammed Naser proposed zuul/zuul-helm master: Add Zuul charts https://review.opendev.org/700460	20:25
mnaser	corvus: fyi im slowly progressing on the helm charts, nodepool seems ok and passing, ill iterate on the rest but i dont have a lot of bandwidth these days so feel free to pick things up	20:26
pabelanger	mnaser: haven't been following along, but is ansible-operator zuul not a thing anymore?	20:27
mnaser	pabelanger: given time constraints and a bit of limbo in deciding where to go and take it, i settled for working on helm charts which are a bit less "invasive" in terms of work needed to deploy	20:28
mnaser	ideally we can take the work we build there and convert it to an operator easily though	20:28
corvus	in the end, i don't think it'll hurt to have both, if people need different deployment methods	20:28
mnaser	yep that too	20:29
pabelanger	yup, mostly curious is other was dropped	20:30
mnaser	yeah it's a lot more work for the other one and this works enough	20:31
corvus	mnaser: thanks, i'm hoping to resume work on the gerrit-project zuul work soon where i think we can use the helm charts. but i'm blocked waiting on account creation right now, so probably not this week.	20:33
mnaser	corvus: cool, i have a basic deployment running fine with no problems. the only issue is the runc zombie processes	20:33
corvus	mnaser: what runc processes?	20:33
mnaser	corvus: or maybe it was bwrap processes?	20:34
corvus	yeah, probably bwrap.	20:34
corvus	but i haven't seen that yet -- any theories about the cause?	20:34
* mnaser looks to check		20:37
mnaser	i'm not really sure, i'm curious if its the fact that its a container-inside-container thing	20:37
mnaser	ok i have a few here, let me get a psate	20:37
*** tjgresha_ has quit IRC		20:37
mnaser	https://www.irccloud.com/pastebin/2qNlcLxM/	20:38
mnaser	corvus: i wonder if it has to do with zuul running as uid0 in the container..	20:38
mnaser	in the case of https://opendev.org/zuul/zuul/src/branch/master/Dockerfile -- we dont seem to make a non-prviledged user like in nodepool	20:40
fungi	we're definitely not getting that on production executors in opendev, so could be container-related	20:41
mordred	mnaser: that process list doens't seem to show dumb-init	20:41
mnaser	i dont think the zuul images have dumb-init, or do they	20:41
fungi	looks like crio is running the executor directly?	20:41
*** pcaruana has quit IRC		20:41
mnaser	correct, that's pretty much using the k8s helm charts that are up for review	20:42
mnaser	so https://review.opendev.org/#/c/700460/5/charts/zuul/templates/executor/statefulset.yaml	20:42
fungi	but yeah, maybe dumb-init is needed to clean up terminated subprocess forks there?	20:42
mordred	mnaser: https://opendev.org/opendev/system-config/src/branch/master/docker/python-base/Dockerfile#L23	20:42
corvus	the executor does not wait() on processes it kills, so if there is no init, that would explain zombie processes in those cases (aborted/timed out jobs)	20:42
mnaser	o	20:42
fungi	that makes sense	20:43
mordred	mnaser: maybe doing command zuul-executor in the chart is also overriding the base image entrypoint?	20:43
mnaser	"Note: The command field corresponds to entrypoint in some container runtimes. Refer to the Notes below."	20:43
mnaser	https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#notes	20:43
corvus	ah yeah, is there an option to do the other thing? where it adds extra args to the entrypoint?	20:44
corvus	mnaser: also, why set "command" in the helm chart?	20:44
mordred	ah - but you could also just supply it as args	20:44
mnaser	i wrote this a while back so im not sure if it didn't work using args only	20:45
mnaser	let me try it out right now	20:45
mordred	nod	20:45
mnaser	i def remember there was a reason behind that	20:45
mordred	I'd start with corvus question - shouldn't need it at all	20:45
mordred	but if it is needed, I'd try just doing args	20:45
corvus	maybe to add the "-d" ?	20:45
mordred	ah - that's a good point	20:46
corvus	but yeah, let's try switching that to "args: .../zuul-executor -d"	20:46
mordred	so either args: or just add /usr/bin/dumb-init -- to the beginning of the command list	20:46
mnaser	yeah -d is all i need it for	20:46
corvus	and if "args" fails, do mordred's full command suggestion	20:46
mnaser	it might be leftovers from like different iterations	20:46
corvus	and let's, someday, add an env variable to toggle debug logs to make this nicer	20:46
mnaser	well this goes back to an old patch i never had time t oget around which is	20:47
mnaser	runnign zuul in foreground implies debug mode i think	20:47
mnaser	it failed	20:47
mnaser	[dumb-init] -d: No such file or directory	20:47
mordred	what did you do for args?	20:47
mnaser	no command, just: args: ['-d']	20:48
mnaser	args: ['zuul-executor', -d'] could be a thing but i feel like thats wrong(tm)	20:48
mordred	that's the thing	20:48
mordred	I think it needs to be args: ['/usr/local/bin/zuul-executor', '-d']	20:48
mordred	because command == entrypoint and args == cmd	20:48
mordred	you know ...	20:49
mnaser	i assume this is largely because zuul requires a sort of init?	20:49
mordred	yeah - you need an init to reap child processes	20:49
mnaser	okay so this is sthe sort of thing to be useful to have a note explaining why zuul-executor needs this	20:49
mordred	nod	20:49
mnaser	ok, let me try with those args and update that chart if thats the case	20:50
*** tjgresha has joined #zuul		20:50
mordred	mnaser: we have it in the entrypoint for all of the images just so that things work right in general - but didn't know until just now that k8s did this with entrypoints	20:50
*** tjgresha has quit IRC		20:51
mordred	so - lesson definitely learned for the day	20:51
mnaser	https://www.irccloud.com/pastebin/DYtOn7Ul/	20:52
mnaser	ok, so ill monitor things and update the chart with that and whatever has failed the chart linter	20:52
fungi	does zuul-merger need it as well (for git subprocesses)?	20:52
mnaser	i havent seen any zombie processes in the merger but i assume thats less likely to have things go 'wrong'	20:53
mnaser	or maybe it hasnt in my case	20:53
corvus	i think we should do the same for all the images; let's run dumb-init everywhere	20:54
corvus	and, basically, if we're overriding the command, override the full thing	20:54
corvus	alternatively we could set entrypoint to "dumb-init zuul-executor" in zuul's dockerfile then args could be "-d"	20:56
corvus	but i think we'd have to make that change carefully with announcements, so let's shelve that for now.	20:56
openstackgerrit	Mohammed Naser proposed zuul/zuul-helm master: Add Zuul charts https://review.opendev.org/700460	20:56
mnaser	corvus: ok, updated all services to use dumb-init with the above and address tristanC comments and fixed some lint stuff	20:56
corvus	cool!	20:57
mnaser	i split out zuul and nodepool into two different charts	20:58
mnaser	and then my idea was to have some zuul-ci chart that has both included inside of it	20:58
mnaser	in the case that someone wants to use one or the other for a specific use case, and the 'zuul-ci' chart would be more of a 'distro' (aka it includes zookeeper, mysql, etc)	20:58
corvus	or "zuul-system"	20:58
mnaser	yeah, something like that!	20:58
mnaser	thats a way better name heh	20:59
corvus	btw, i think we can make a job that spins up a k8s and runs the helm chart for real	20:59
corvus	using speculative registries and can cross-test changes to zuul images	20:59
mordred	corvus: ++	21:00
*** Goneri has quit IRC		21:00
corvus	i just wrote a change to add crio support to that, so it'll work with fully-qualified image names in non-dockerhub registries in k8s: https://review.opendev.org/701237	21:00
mnaser	corvus: there's a few tricks with that if you're trying to depend on a helm chart that hasn't merged yet (it has to live in a repo) -- but i think we can overcome those	21:04
corvus	i'm looking forward to digging into it :)	21:04
mnaser	also, totally my next step is to run a full zuul and k8s job, but also if we want t ostart publishing those for users, we really need tagged images for zuul	21:04
mnaser	that way we can leverage the appVersion inside the helm chart and point it directly at the right image (and makes those reliably consumable)	21:05
mordred	it seems like appVersion in the chart would make speculative images harder to work with	21:05
mordred	but it'll be fun to poke and and figure out	21:05
corvus	mnaser: we should get tagged images i agree -- however, as a potential user of the helm charts, i'm actually only interested in running master CD	21:06
mordred	yeah	21:06
corvus	just some user-level input fyi	21:06
pabelanger	clarkb: we just flipped zuul.a.c to ansible 2.9, 1 issue found with version_compare() filter. Was renamed to version()	21:06
mordred	I think it'll be interesting to figure out how to support CD-from-master as well as tagged releases and to do both reliably	21:06
mordred	pabelanger: yay	21:07
clarkb	pabelanger: yup I had to fix that in zuul-jobs	21:07
pabelanger	clarkb: configure-mirrors?	21:08
pabelanger	(we fork that today)	21:08
clarkb	yes for distro version checks	21:08
pabelanger	yha	21:08
mnaser	mordred, corvus: totally possible and very easy, we simply define (imagine this as a yaml tree): image.tag: "{{ .Chart.appVersion }}"	21:29
mnaser	so by default we use the one defined to the chart, but a user can use their own values file to override image.tag to "latest"	21:29
mordred	cool	21:34
*** rlandy is now known as rlandy\|bbl		23:08
*** mattw4 has quit IRC		23:45

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!