ACM.95 Creating a script to delete a stack of resources so you can redeploy from scratch
This is a continuation of my series on Automating Cybersecurity Metrics.
Deleting stacks might not be as easy as you think!
As I explained in the last post, my resources got into a mangled state in CloudFormation I couldn’t fix due to Amazon editing policies behind the scenes when principles used in KMS and trust policies are edited. 🙁
I thought well, I will “just” create that delete script I’ve been thinking about. A friend of mine said any time you say “just” in a sentence you’re probably in trouble. This deletion script took far too long and was way more complicated than I expected.
Not only did I have issues deleting resources due to dependencies, I had issues re-creating resources due to hidden dependencies not apparent when applying on top of existing resources. Then somehow when I re-deployed I had errors in scripts that previously worked. Hmm.
This post got really long and due to the amount of time spent, I’ll explain some of the issues you may face in this post and we’ll draft a first cut here. I’ll optimize it all and fix the remaining issues in the next post.
Creating a delete script for AWS resources
Because this is a test POC I can just delete everything and start over. I created a delete script for this purpose in the root of the GitHub directory called delete.sh.
This script uses an all-powerful delete CLI profile which is obviously pretty dangerous if you were in a production environment. I would not have it set or active up unless you absolutely needed it. For a test environment it is useful.
Common stack delete function
The first thing I did was create an optional check before deleting each stack which can be turned on or off at the start:
I created a common delete function of course. Don’t want to repeat code over and over again. This is where I added the possibility to skip deleting a single resource if stepping through the code.
KMS Keys need special handling. Recall that only the KMS administrator can delete them. If we remove our KMS admins before we delete the key we’d be stuck with the keys unless we contact AWS support. I currently have an account in this state. I may just delete the whole account.
When a KMS key stack can’t be deleted it doesn’t indicate that the delete failed in the status. That’s probably good because what’s annoying is that sometimes a stack gets into a DELETE_FAILED state and you can’t fix it without redeploying the resource. That could cause a lot of problems with KMS. However, in this case it’s frustrating that we can’t see there was an error. I had to dig into each stack to see if it even processed my delete at all.
Recall that we have a lot of policies referencing KMS keys, so we’ll have to delete those first.
Here’s the error even though the stack says “UPDATE COMPLETE”:
I had to create a special function for deleting keys since we have to run some commands as the KMS admin user outside of CloudFormation to delete the alias and schedule key deletion. I had to ignore some errors when the key doesn’t exist. I am being lazy here. I should really never use the /dev/null function and check to see if the key exists and only delete it if it does. But there’s no simple “if-exists” for AWS resources on the command line. So many little things take so much time.
I had to create a special function to get a stack export outside our common functions because I just want to ignore errors instead of warn on error.
Recall that we’ve been using profiles like IAM, KMS, Developer, etc. Each of these profiles has the following stacks:
- Group Policy
- Role Policy
Initially I wrote a line for each of those but then I saw I was repeating myself…you know what to do if you have been following along. I created a function. The order of deleting these resources is important:
Then I can just delete all the profiles like this (except KMS until after deleting KMS resources):
Things deleted pretty quickly.
I added code to triple check the user wants to delete the KMS profile:
I probably should just go verify the keys and aliases do not exist. That would be better.
I didn’t finish network deletion yet because I don’t need or want to delete networking just yet but the concept would be the same and pretty simple to add if you need it before I add it.
I also did not delete users because I don’t want to have to re-create all my AWS CLI profiles. They can’t do anything without the associated policies or other resources.
Next I redeployed everything and by doing so verified my test scripts work.
Because I deleted my IAM Admins and roles, I have to set up an IAM profile with another user in order to create my initial IAM users. Let’s say you deleted all your IAM admins by default. You may have to start over and login as root and create a temporary user to run that initial portion of the test script. Then it pauses and you can go reconfigure the IAM user created by the framework in your CLI profile. It might make sense to break those into two scripts but wanted to keep it simple or people looking at the GitHub repo.
Fix dependencies not evident when deploying on top of existing resources
I also realized I have a dependency issue:
My primary IAM Admin Group Role policy cannot reference a secret that hasn’t been created yet. This is why testing from scratch is important. When dependencies already exist you might not see errors.
I have a couple of options for addressing this. First of all, I could crate a separate policy and deploy it later. I could also grant permission for some other group to create these keys. For now I’m just creating a separate policy.
I ran into the same issue with KMS dependencies. I have IAM policies referencing KMS keys, including the IAM Admins policy. That can’t be created when the referenced KMS Key doesn’t exist. I’ll explain how I fixed that in the next post.
Note on deploying on top of existing stacks
Just as I uncovered errors when deploying from scratch, if you only test deploying from scratch your deployment may not work when you go to deploy it on top of existing resources. You have to do both. The existing resource configuration may have conflicts you won’t notice when deploying from scratch.
You should likely have a staging environment, on top of Dev and QA to test this functionality. Deployments be tagged or in branches in your source control system so you can redeploy the existing production configuration, or perhaps you left it up and running from a subsequent test. Then you deploy your new code on top of that to see if you get any errors.
More efficient testing with optional deployments
One other thing I added that I’ve done in the past is add the ability to skip over certain deployments that I know are working in my test script and proceed to the next item in the list. For example, I don’t want to re-deploy my IAM user over and over again if I know that is working. I echo out a question to the user asking if they want to deploy the IAM user. If no, skip to the next group of resources.
In other words, I added an if-then to each resource deployment type like this:
By adding the ability to skip over resources, I can skip over the ones I know are already implemented and working correctly and focus on issues with the ones that are failing. That said, I remind you again to delete everything and redeploy it and make sure it still works end to end before you declare it is “done.”
I can further refine my delete script to abstract out some code and I need to sort out some dependencies. Follow for updates.
If you liked this story please clap and follow:
Medium: Teri Radichel or Email List: Teri Radichel
Twitter: @teriradichel or @2ndSightLab
Requests services via LinkedIn: Teri Radichel or IANS Research
© 2nd Sight Lab 2022
All the posts in this series:
Cybersecurity for Executives in the Age of Cloud on Amazon
Need Cloud Security Training? 2nd Sight Lab Cloud Security Training
Is your cloud secure? Hire 2nd Sight Lab for a penetration test or security assessment.
Have a Cybersecurity or Cloud Security Question? Ask Teri Radichel by scheduling a call with IANS Research.
Cybersecurity & Cloud Security Resources by Teri Radichel: Cybersecurity and Cloud security classes, articles, white papers, presentations, and podcasts
Leave a Reply