DEV Community

Warren Parad
Warren Parad

Posted on • Originally published at Medium on

Can’t connect to service running in EC2

One of the most annoying and often challenging issues is solve dreaded connection problems with services running in EC2 (in AWS and with virtual machines in general). Spinning up an EC2 in a public or private subnet is easy, but then you want an Application Load Balancer (ALB) to handle TLS termination and auto-scaling groups. You want Security Groups (SG) set up to enable connections only from the public ALB. And then when everything looks like it should work, there is a TCP timeout.

There is a huge cost associated with running technology on non-serverless, even if that tech is open source, and that’s due to the operational and maintenance. Often you might find yourself in this position if you have selected open source solutions instead of a SaaS provided one. These sorts of solutions may or may not have accounted for running in a container, running behind an ALB, but all the same, you will have problems connecting as if you built it yourself.

I want to SSH

The first step to all good investigation is reproduction. Even if everything looks right from the outside, looks correct from the ALB and SG perspective, removing confounding variables and narrowing down is critical. The first step is to cut out the ALB and the SGs and prove that the service running on the machine is actually running correctly, you can connect to it, you can ping it, etc.. With EC2 these things are likely blocked by your SG, pings are blocked by default — If you want to enable these, enable ICMP.

But SSH still requires access from the public internet, and likely your instance is not accessible there. (Note: If it is accessible there, stop what you are doing and move it to the private subnet immediately. There are zero good reasons to put a machine in the DMZ and allow connections from anywhere, they aren’t designed with this security in mind.)

You don’t want SSH , what you actually care about doing is connecting to the machine to verify that the service is running. Assuming we can’t trust the SGs and the ALB connections, running commands from the instance is the salient step. Welcome to SSM or AWS Systems Manager. There are often long steps detailed about how to set this up, but in reality is it quite simple 99% of the time. If you have an off the shelf AMI, just make sure that the AmazonSSMManagedInstanceCore policy is attached to the instance profile for the instance. When the instance starts up, you should be able to go to the connection console in EC2 and see this:

EC2 Connection Manager
SSM set up successfully

Click Connect and move on to the next step. Instead of needing to configure keys, sshd services, etc, you are already done, now to the real work of figuring out why your open source service or custom app aren’t working. Remember this is the real value for using containers and Lambdas or SaaS, stop worrying about this work, and start delivering.

Container Services

In the case, that you graduated to partial serverless with something like ECS (or unfortantely EKS) with Fargate, then there is nothing here to connect to. But that doesn’t mean everything is working. Setting up container logging goes a long way, but isn’t the end of the story if it still is timing out.

You can actually pull the same trick with containers by using SSM from the CLI (a similar command exists for EC2):

aws ssm start-session --target="ecs:${clusterName}_${taskId}_${containerRuntimeId}"
Enter fullscreen mode Exit fullscreen mode

Which will let you get on and start trying to verify everything is going as expected.

Connected: Now what?

Now that you are connected successfully to the instance, you continue the debugging where you left off.

Health check failing — Run some curl http://localhost:port

Need some metrics — htop, netcat, (and there are 100s more)

Service not doing what it should — curl http:localhost:port/api/path

Verify to yourself that the service is doing exactly what it needs to. Validating this from your local machine going through local network, geocache, VPC, ALB, SGs, and ASG is a long list of places where there can be a problem.

You may think I ran this exact same stuff on my machine and it works, but if you aren’t testing it directly from the source, then it isn’t useful. It is different tech, different stack, different location. But once everything is working, then you can move on to verifying SGs and ALBs.

TLS termination when using open source containers/services is frequently the source of consternation. It is one of the reasons TCO when building with open source is so high, and why so many solutions just make sense as SaaS.

Join the AWS Discord server

Top comments (0)