ACM.69 When you go to bed and your code works and wake up to find it doesn’t anymore ~ what it’s like to troubleshoot CloudFormation
This is a continuation of my series on Automating Cybersecurity Metrics.
I’m going to write about things you can do to help deal with errors in the context of an error that is currently driving me nuts at the start of writing this post.
If I have to spend this much time on it, I figure it’s worth blog post. Maybe it will help you troubleshoot CloudFormation faster. Or maybe I’m just documenting time spent on a single error message I’m looking at while trying to complete code. But this is what to look out for and how I went about troubleshooting this particular error.
As it turns out, some of it was a waste of time because the error message is completely 100% misleading. I wrote about the importance of error messages here. It’s important for security and also to be nice to your customers, software users, and future developers of your code (including yourself when you can’t remember what the error means later).
The first thing is to understand where to find CloudFormation errors. When you deploy a template using CloudFormation and it fails you can print out a list of events on the command line using the command AWS provides with the failure (which is really nice, by the way!)
That’s great but sometimes I find it easier to view the events in the AWS Console. Navigate to your list of CloudFormation stacks (in the same region as your CLI is configured to deploy them of course!). Click on your stack and then click on “Events” at the top of the list.
The information you need is not necessarily the first thing in red. You have to scroll down in the list to find the problem that triggered the deployment failure.
One thing I always do is print out each CLI command my scripts are executing — with parameters that were input into the command. That way when something fails I have the actual command that the CLI was trying to run at the time of failure. That way I don’t have to re-run my entire stack. I can re-run the command that failed to try to troubleshoot it.
That’s what I’m doing in this framework I’m writing for batch jobs but the code actually has a framework for how an organization might start using AWS and structuring deployments. I don’t repeat the print command over and over, I created common functions for error handling to try to reduce the amount of code I have to maintain and troubleshoot.
Here’s what I get at the end of my script above the CloudFormation failure message using my common deployment functions:
I can copy and paste that code and execute it on the command line separately from all the rest of my complicated stack code to see what error I get.
Well first of all I get this error:
An error occurred (ValidationError) when calling the CreateChangeSet operation: Stack:arn:aws:cloudformation:us-east-1:xxxxx:stack/Network-SGRules-RemoteAccessPublicVPC-Default-Rules/xxxxx is in ROLLBACK_COMPLETE state and can not be updated.
That essentially tells me that there’s a stack in a bad state. I need to delete it before I can continue. Sometimes this causes a real hassle because you can’t delete a stack that has a whole bunch of other dependencies but it’s stuck in a bad state. This is a good reason to fully test your deployments before you attempt to roll them out in production. It’s also a good ideal to build stacks in small pieces like I’ve done in my code above to it’s easier to delete and re-deploy individual components. However, you might get into this situation and get stuck. At that point you may try various options to force deployments.
Unfortunately you might get mixed results with those options as well. Apparently disable-rollback only exists for create-stack and not deploy methods:
The problem with create-stack is that you have to test to see if a stack exists before you deploy it otherwise you have to update it. Had to write a lot of logic for prior code. Deploy makes it much easier to deploy stacks because it handles that logic for you. Unfortunately the rollback option doesn’t exist and I’d rather deal with failures manually that test every stack to see if I need to create or update.
So I delete the stack and try again.
This time it tells me the following:
That’s odd because my parameter does have a value. I don’t get that error when my full stack executes. If you are familiar with CloudFormation parameter overrides you might notice something odd in the way I am passing in the parameters. Typically you pass them in this way:
So I test the code above and it works. Why was I including the brackets before? Because some error message that came out of CloudFormation at some point produced an error message that told me to formulate my parameters in the above manner. I got that error when I was trying to figure out how to pass parameters with spaces into CloudFormation.
The first method was the only way I could get parameters that had values with spaces to not crash my scripts, but the parameters still didn’t work correctly and show up in the console with the full value including spaces.
OK so what’s going on here. Why is it working in my script but not on the command line. Other resources are deploying correctly. Maybe that’s not actually what my command looks like. Back to my common stack deployment function that prints out this command.
I’m printing out exactly what the script executes:
Does my other code still work that worked before that uses that parameter structure? Let’s delete the last stack before this one that got deployed by our script and see if running that command from the command line works.
Here’s the stack prior to the one I’m getting errors on currently.
A test without deleting it works:
Let’s delete it and see what happens.
Here’s the problem. Route tables have dependencies. They are slow to create, update, and delete. If you try to delete them prematurely they might get into a Delete Failed state.
The other problem is once your route table gets into a DeleteFailed state, you can’t update it:
There’s actually nothing wrong with my route table so I could just leave it like that. It’s just kind of ugly. The problem is — what if I need to update that route table? You might be able to add a new one and delete the existing one but that’s just messy.
You could try to force deletion of your route table using this command:
aws cloudformation delete-stack --stack-name my-stack --retain-resources myresource1 myresource2
In that case, you will need to know all the resources dependent on your route table.
You can achieve the same result by attempting to delete a stack twice in the AWS console, which will pop up the following dialog box.
In my case, I’m doing something to work around a CloudFormation issue with route tables that I’ll be publishing in a later post. That is causing my issue here.
I can go ahead and delete my stack and choose to leave the underlying resource in place, and then try to go manually delete it. Just remember you have to try to delete twice to get this option. You should clean up any resources you don’t need.
If I run the script again on the command line I get the same error. Clearly CloudFormation scripts executing from the CLI are not parsing the values the same way as when I manually run the command, which is very strange.
I’m running the exact same command in a bash script that I’m running manually from the console except that I’m using the $( [command] ) syntax to execute the command within in my script, after formulating it as a string. Basically I can emulate the functionality in my script to execute the command like this (from the same directory as my deploy script to get the correct relative path to my templates):
And, that works. Do not ask me why. I don’t know. I don’t care. I do know that I want to get this code out of bash as soon as possible.
Now back to our failing stack. First I have to redeploy the stacks I deleted.
Successfully back to our failure. Ugh.
Now I can test as I did above and same result.
The error message is a bit misleading because it says the value of GroupId needs to be a string. This template works just fine for other stacks, so the GroupId is not the problem. The problem is the parameter I am passing in. It is an output from another stack that is used with an ImportValue function to get a security group ID output from another stack. I shouldn’t have any errors in my template because this already worked before — or was I delusional when I looked at the prior results of my templates that all ran successfully last night?
The template takes an SGExportParam parameter. It uses that Parameter to get the output. I don’t see any typos here.
OK next, let’s see what I’m getting out of my function call based on the command I printed out.
That looks correct. I think that is the correct output name. Let’s make sure it got passed into our CloudFormation stack correctly by looking at the Parameters tab for the failed stack.
I see a value that looks correct. I think. Right?
Return to the VPC stack where I am outputting that value and check the outputs.
Am I blind or does that match? I mean, I realize I should probably get glasses but I’m not seeing a difference.
Debug Output from CloudFormation (and other AWS tools)
What I’m doing to do next is add a debug command to my stack.
This is where I stopped to write the prior post with a warning about credentials in debug output and as well as in the AWS console.
Now let’s go ahead and run the command and see if we can find any useful information in our debug output. After weeding through the logs it looks like they just contain a bunch of retries to check the status of CloudFormation until the stack finally reports the failure, so the CLI or Boto3 logs don’t really help us.
Hmm. The problem has to do with not getting a Security Group Id from that output. I explained earlier how I created a function to get an output from a stack. Let’s test getting that output independently.
I’m going to navigate to my Functions directory and create a test script.
I navigate to my stack with the output values and copy and paste them into my script. I always copy and paste whenever I am testing or coding whenever possible to avoid typos.
And here is where I see a problem even before I attempt to run my code. I tested this a couple of times. When I copy and paste the output for the Default Security Group ID it appears to have a space in it. What is causing this? Is this a red herring or no?
First let’s check the VPC template to see if I have an extra space inside some quotes somewhere. I can just look at the template for the stack in the AWS Console:
Maybe I’m just not seeing it but I don’t see how the above causes those spaces and I’m doing the same thing here that I’m doing in other templates. Let’s continue our test of the output parameters with the standalone function.
Let’s think about this or a minute:
What happens with my parameters when I try to pass them into a command line function like this? How are my bash commands going to interpret these parameters?
If there are extra spaces it’s not going to recognize all the values I’m passing into my function right?
Because it’s makes each value between spaces a separate parameter…so let’s see what happens if we do this:
unary operator expected
What’s on line 7?
The spaces are causing us grief.
I tried another of other variations I won’t bore you with. It’s related to this same post linked above:
Let’s verify that there are really spaces in that output. I don’t see the spaces here when I query the outputs of this stack and that would break tons of people’s code on AWS so I find it hard to believe that would ever be the problem in the first place:
So perhaps our query will work after all. Just out of curiosity I went back to the console and tried my copy and paste method on other outputs and I did not get the extra spaces. Then I tried on the original parameter again. I’m not getting the same result. OK weird. Anyway let’s see if we can get our parameter as an output.
Test it. This works:
Alright back to our stack. Out of curiosity I’m going to run the stack one more time with no changes because this is so odd. Same error.
Now again, for sanity, I’m going to copy and paste and test the value from the failed stack parameters:
The next thing I did was alter my stack to manually hard code in the name of the output with no spaces. I click on the template designer.
I hardcoded the name I know works in my function above into the part of the template that uses the ImportValue statement. I tried with and without a space. I uploaded and deployed it:
By the way, if you don’t want people manually editing templates in the AWS console you need to limit that access in your IAM Policies.
Now again for sanity I deployed my working stacks that use those same templates to check again that the problem is not with the template itself, but rather the export parameter not being correctly retrieved from CloudFormation using fn::ImportValue.
At this point I realize my script that I am sure was working last night is no longer working. I know I ran all the stacks and got no errors prior to adding the functionality in the next post which did not alter this template. It simply uses it. What in the world is going on?
Alright, what’s the difference between this and a script that is working? I’m passing in a variable and resolving the output to a security group. Here’s one of the alternate, working scripts.
Compare that to my failing script:
Do you see a difference? I looked this code about 100 times before I saw it. Maybe you saw it sooner.
IpProtocol is misaligned.
The error message is 100% misleading and a big time-waster because it’s talking about the export value when that is not at all the problem so I wasted hours on a stupid space.
This is where I wish the error messages would be a bit smarter at giving us the correct problem, instead of trying to force us to use an IDE in the cloud, if that would even help.
Fix the problem:
This is why you pay me to write this blog — haha (it’s free). To find the dumb little errors like this and give you working code so you don’t feel the pain.
It’s also why I publish my bugs to this site to help anyone who suffers this fate in other areas of programming if it aligns to whatever I’m working on at the moment.
The oddest thing is that I know I ran the script and checked all the CloudFormation stacks last night and there were no errors. I don’t know. Maybe I was delusional. At any rate 99% sure this is going to fix the problem and completing this post as it runs.
All stacks functional prior to my last addition that uses this template (without modification).
Let’s add in the last stack to use the template and test. Yes. It works.
By the way, I’ve been having this strange issue with CloudFormation parameters that I 100% know is not my fault. It happens when I don’t even change the code and goes away when I don’t change the code. I don’t know the source. It may be something on my local machine or AWS. But just so you are aware, sometimes you have to pinpoint if the problem is in your code or somewhere else:
If we could just get rid of these weird errors and get really specific error messages, all the programmers in the universe would be able to work much more efficiently! We’d probably prevent some security bugs too.
If you liked this story please clap and follow:
Medium: Teri Radichel or Email List: Teri Radichel
Twitter: @teriradichel or @2ndSightLab
Requests services via LinkedIn: Teri Radichel or IANS Research
© 2nd Sight Lab 2022
All the posts in this series:
Need Cloud Security Training? 2nd Sight Lab Cloud Security Training
Cybersecurity & Cloud Security Resources by Teri Radichel: Cybersecurity and Cloud security classes, articles, white papers, presentations, and podcasts