Problems I Faced and How I Solved Them on ECS

Deploying a multi-service application like Taiga on ECS was not without challenges. Here are some of the issues I encountered and how I resolved them:

1. Tasks Failing ECS Health Checks

Initially, some backend and frontend tasks repeatedly failed their ECS container-level health checks, causing ECS to mark them as unhealthy and restart the tasks. By inspecting CloudWatch Logs and ECS task events, I realized that the containers were not fully initialized when the health check requests hit them.

Solution: I adjusted the health check grace period in the task definition, giving containers enough time to become ready before ECS considered them unhealthy. Here’s the key part of the configuration in Terraform:

health_check = {
  command     = ["CMD-SHELL", "curl -f http://localhost:8000/ || exit 1"]
  interval    = 30          # seconds between health check attempts
  retries     = 3           # failed attempts before marking unhealthy
  startPeriod = 60          # initial grace period for container startup
  timeout     = 5           # seconds before the check times out
}

2. IAM Permission Issues with ECS Exec

While enabling ECS Exec to interact with containers directly, some tasks failed with permission errors. ECS Exec requires specific IAM policies for both the task role and the user invoking exec. Without these permissions, attempts to run commands inside containers would fail.

Solution: I updated the task role to include the necessary permissions for ECS Exec. Specifically, the role needed access to the SSM messages channels that ECS uses under the hood:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "ssmmessages:OpenDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:CreateControlChannel"
      ],
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "ECSExec"
    }
  ]
}

I also ensured that my IAM user had ecs:ExecuteCommand permission to invoke ECS Exec. After applying these changes, I was able to run commands inside containers seamlessly, inspect logs in real-time, and troubleshoot tasks interactively.

3. Environment Variables and Secrets Not Loading

Some containers failed at startup because required environment variables and secrets were unavailable at runtime. CloudWatch Logs indicated missing configuration rather than application errors.

Solution: I reviewed the ECS task definitions and verified that all required environment variables were explicitly defined per container. For sensitive values, I passed secrets using AWS Secrets Manager and ensured the task execution role had permission to retrieve them at startup (secretsmanager:GetSecretValue).

Once the execution role permissions and variable mappings were corrected, the containers started reliably and remained stable across deployments.

4. Unable to Pull Images from Amazon ECR

During initial deployments, some tasks failed immediately with errors indicating that the container image could not be pulled from Amazon ECR. ECS task events showed image pull failures, even though the images existed in the repository.

Solution: I verified that the task execution role had the required permissions to authenticate and pull images from ECR, including ecr:GetAuthorizationToken, ecr:BatchGetImage, and ecr:GetDownloadUrlForLayer. I also confirmed that the ECS tasks were running in private subnets with outbound internet access via a NAT Gateway, allowing them to reach the ECR endpoints.

After correcting the execution role permissions and validating network egress, the tasks were able to pull images successfully and start without issues.