loading...
Heroku

Using the Scientific Method to Debug Containerized Applications

mbogan profile image Michael Bogan ・13 min read

Ah yes—“It works fine on my machine!” Perhaps the most famous saying in our industry. Even with the advent of containers that provide consistent environments across the SDLC, we _still _hear developers fall back to this claim when a defect is found. But in the end, if the code doesn't work in test or production, it doesn't work—even if it works locally. So as a developer, being able to deep dive into your containerized application to fix the problem—regardless of the environment—is a critical skill we must all learn.

Let's look at a sample app that works locally, but fails when deployed, and how you might use the scientific method (you remember that from school, right?) to debug these failures in containerized apps.

Our Example

For our example, we'll use a simple Flask application—deployed in a Kubernetes container—that displays the famous people that have used our app. It works flawlessly on our local system!

method1

Now let's deploy this to our Heroku service and see if it’ll behave the same (hint: it won't).

 heroku container:push worker -a heroku-troubleshooting
 heroku container:release worker -a heroku-troubleshooting
Enter fullscreen mode Exit fullscreen mode

method2

Drats! Foiled by the production environment powers that be. Now we can start digging into what happened and see what tools we have at our disposal.

Our Approach

When your app fails, it's helpful to take an unbiased and objective approach to triage the issue and understand the faults. One such classical approach is the Scientific Method. Harnessed through hundreds of years of rigorous repetition and review, I've found that this method works great for modern debugging.

Too often developers and DevOps engineers take a "let's see what this does" scattered approach to debugging. We grab the first thing that comes to mind that _might _be wrong, change our code/config around, deploy, and see if that worked. And if that doesn't work, we just keep trying, sometimes forgetting what we've already tried, sometimes spending hours or even days going down a path that if we just stopped and thought about it for a minute, clearly is the wrong path.

But if we use a more reasoned, structured approach - like the scientific method - our debugging process could be more productive, faster, and in the end, better.

The scientific method, at its most basic, involves forming a hypothesis, conducting experiments, observing the results, and then refining your hypothesis. The scientific method as I apply it to debugging containerized apps looks like this:

  1. Ask "What should be happening?"
  2. Determine reasons that might explain why it's not happening
  3. Design a test to validate/invalidate those claims by tuning a config or using a tool
  4. Observe the outcome
  5. Draw a conclusion
  6. Repeat until the outcome is what should be happening

The Scientific Method in Action

Now let's see that approach in action.

1. Ask "What should be happening?"

First, we need to understand what our application is trying to do. To start investigating, we need to understand the details of the page rendering. Let’s see how the Dockerfile describes our service:

          FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./

RUN pip install --no-cache-dir -r requirements.txt

RUN echo "healthy" >> /tmp/healthy && \

        cat /tmp/healthy

ENV FLASK_ENV=development

ENV FLASK_APP=myapp.py

COPY myapp.py .

CMD [ "python3", "myapp.py" ]

Enter fullscreen mode Exit fullscreen mode

Pretty simple. We have a Python3 application, it installs some requirements, and then starts myapp.py. Let’s take a quick look at what our Python application is doing and then we can decide how to best proceed.

from flask import Flask, Response

import os

import psycopg2

app = Flask(__name__)

query = ("select * from actor order by random() limit 5;")

host = os.getenv("host")

password = os.getenv("PSQL_PASSWORD")

user = os.getenv("PSQL_USERNAME")

dbname = os.getenv("PSQL_DBNAME")

print(host)

print(user)

print(dbname)

def callpsql():

    conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port="5432")

    cur = conn.cursor()

    cur.execute(query)

    results = cur.fetchall()

    cur.close()

    conn.close()

    return results

@app.route('/')

def index():

    response = callpsql()

    print("Getting some juicy logs")

    results = "I've sold to these famous people from server "+os.getenv("DYNO")+": <br>\n"

    for row in response:

        results = results+"<br>\n"+str(row[1])+"\t"+str(row[2])

    return results

@app.route('/health')

def health():

    f = open("/tmp/healthy", "r")

    print(f)

    health = f.readline().rstrip()

    resp = Response(health)

    if health != "healthy":

        resp.status=500

    resp.headers['health-check'] = health

    return resp

if __name__ == "__main__":

    port = 5000

    app.run(debug=True, host='0.0.0.0', port=port)

Enter fullscreen mode Exit fullscreen mode

Another pretty simple application. This takes a few environment variables for authentication and executes a SQL query against a Postgres database when the root page is called.

2. Determine reasons that might explain why it's not happening

Now that we know what the application is trying to do, we move to the next step—explain why it's not happening in our new environment.

Here we can look at the Heroku logs --tail command that’s suggested on the error page. This should give us a good idea of what our service is doing at any point in time and also help us form a hypothesis as to what might be the problem.

$ heroku logs --tail -a heroku-troubleshooting

...

2020-08-22T22:45:34.805282+00:00 heroku[router]: at=error code=H14 desc="No web processes running" method=GET path="/" host=heroku-troubleshooting.herokuapp.com request_id=09292b76-b372-49e9-858b-e8611c020fd5 fwd="xx.xx.xx.xx" dyno= connect= service= status=503 bytes= protocol=https

…

Enter fullscreen mode Exit fullscreen mode

We see the error “No web processes running” appear every time we load the page. It seems as though Heroku doesn’t know that we’ve started Flask. Great! We have our hypothesis - "Flask probably isn't running in Heroku's environment."

3. Design a test to validate/invalidate those claims by tuning a config or using a tool

Now let's move to the next steps: create a test to validate what we think is happening, observe the outcome, and draw a conclusion.

So let's test our hypothesis - we'll hop onto our dyno and validate that the Flask service is (or is not) actually running. Our Dockerfile specifies python should be a running, listening service.

$ heroku ps:exec -a heroku-troubleshooting

Establishing credentials... error

 ▸    Could not connect to dyno!

 ▸    Check if the dyno is running with `heroku ps'
Enter fullscreen mode Exit fullscreen mode

4. Observe the outcome

We released our container, so it should be running. However, the logs from out test have shown us that we couldn’t connect to the dyno that should be running the code. Executing the suggested command may give us another clue.

$ heroku ps -a heroku-troubleshooting
No dynos on ⬢ heroku-troubleshooting
Enter fullscreen mode Exit fullscreen mode

5. Draw a conclusion

Well, this is embarrassing. The scientific method conclusion? We were wrong in our initial assertion that our released code left us with a running dyno to serve the code. We definitely pushed and released our image but we didn’t add any dynos to our application. It appears we have validated that the application is not running and that our hypothesis was incorrect.

6. Repeat until the outcome is what should be happening

So we move to the next step, which is repeat the process—and we'll continue to repeat this process throughout this article until our problem is solved.

So let's form a new hypothesis and test it. Our new hypothesis, based on the last test, is that with the dyno running, we should be able to reach our Flask application from the outside world. We'll scale a dyno up to start the application and check the logs again to test that assertion.

$ heroku dyno:scale -a heroku-troubleshooting worker=1

Scaling dynos... done, now running worker at 1:Standard-1X

$ heroku logs --tail -a heroku-troubleshooting

2020-08-22T23:29:05.207217+00:00 app[api]: Scaled to worker@1:Standard-1X by user

2020-08-22T23:29:18.202160+00:00 heroku[worker.1]: Starting process with command `python3 myapp.py`

2020-08-22T23:29:18.760500+00:00 heroku[worker.1]: State changed from starting to up

2020-08-22T23:29:21.559386+00:00 app[worker.1]: None

2020-08-22T23:29:21.559547+00:00 app[worker.1]: None

2020-08-22T23:29:21.559548+00:00 app[worker.1]: None

2020-08-22T23:29:21.559549+00:00 app[worker.1]: * Serving Flask app "myapp" (lazy loading)

2020-08-22T23:29:21.559556+00:00 app[worker.1]: * Environment: development

2020-08-22T23:29:21.559556+00:00 app[worker.1]: * Debug mode: on

2020-08-22T23:29:21.636027+00:00 app[worker.1]: * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

2020-08-22T23:29:21.637965+00:00 app[worker.1]: * Restarting with stat

2020-08-22T23:29:21.897821+00:00 app[worker.1]: * Debugger is active!

2020-08-22T23:29:21.929543+00:00 app[worker.1]: * Debugger PIN: 485-744-571
Enter fullscreen mode Exit fullscreen mode

There we go. Now the application has started! It's time to test the site again to observe if it’s working.

method3

Our conclusion now, is it’s definitely not. We should go over our commands again and see if the status has changed at all so that we can create a new hypothesis.

$ heroku logs -a heroku-troubleshooting
2020-08-22T23:32:12.098568+00:00 heroku[router]: at=error code=H14 desc="No web processes running" method=GET path="/" host=heroku-troubleshooting.herokuapp.com request_id=4d3cbafa-cb31-4452-bfc6-7ee18ce72eb0 fwd="xx.xx.xx.xx" dyno= connect= service= status=503 bytes= protocol=https
$ heroku ps -a heroku-troubleshooting
=== worker (Standard-1X): python3 myapp.py (1)
worker.1: up 2020/08/22 16:31:59 -0700 (~ 3m ago)
Enter fullscreen mode Exit fullscreen mode

It looks like we’re running a service but we can’t get any more data using only the logs. Hypothetically we can get into the service and make sure it’s working.

$ heroku ps:exec -d worker.1 -a heroku-troubleshooting

Establishing credentials... error

 ▸    Could not connect to dyno!

 ▸    Check if the dyno is running with `heroku ps'
Enter fullscreen mode Exit fullscreen mode

Here we can observe that even with the dyno running, we can't connect to it. Looking back at our Dockerfile and some documentation, we can conclude that there are a few requisites needed for the Heroku environment to support exec into the deployed containers. We should add a few utility commands we might need while we’re in here to test future hypotheses while we test if this new configuration will allow us to connect.


...

RUN apt-get update && \

        apt-get install -y \

                curl \

                openssh-server \

                           net-tools \

                           dnsutils \

                           iproute2

RUN mkdir -p /app/.profile.d && \

        touch /app/.profile.d/heroku-exec.sh && \

        echo '[ -z "$SSH_CLIENT" ] && source <(curl --fail --retry 3 -sSL "$HEROKU_EXEC_URL")' >> /app/.profile.d/heroku-exec.sh && \

        chmod +x /app/.profile.d/heroku-exec.sh

...

Enter fullscreen mode Exit fullscreen mode

One more push and release later and we've validated our conclusion was correct and we can finally connect:

$ heroku ps:exec -d worker.1 -a heroku-troubleshooting

Establishing credentials... done

Connecting to worker.1 on ⬢ heroku-troubleshooting...

The programs included with the Debian GNU/Linux system are free software;

the exact distribution terms for each program are described in the

individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent

permitted by applicable law.

~ $

Enter fullscreen mode Exit fullscreen mode

Now that we’re in the application, we can check our hypothesis that the port is listening as it should be and that our application is responsive from within the container itself. We can utilize “net-tools” netstat command as a test to show us a list of ports that are open on our container and listening for calls to open a socket.

~ $ netstat -apn | grep 5000 | grep LIST

(Not all processes could be identified, non-owned process info

 will not be shown, you would have to be root to see it all.)

tcp        0      0 0.0.0.0:5000            0.0.0.0:*               LISTEN      3/python3

Enter fullscreen mode Exit fullscreen mode

Our analysis shows that the port is open as expected and our hypothesis is correct.

Because nothing has changed, we must design a new test. “Curl” can act as our command-line internet browser to show us a simplified view into the port we now verified is open and listening properly.

~ $ curl -I localhost:5000

HTTP/1.0 500 INTERNAL SERVER ERROR

Content-Type: text/html; charset=utf-8

X-XSS-Protection: 0

Server: Werkzeug/1.0.1 Python/3.8.5

Date: Sun, 23 Aug 2020 00:27:14 GMT

~ $ curl localhost:5000

...

psycopg2.OperationalError: could not connect to server: No such file or directory

        Is the server running locally and accepting

        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

...

Enter fullscreen mode Exit fullscreen mode

Ah-ha! Our analysis of this error proves that the calls are failing because we’re getting a 500 error in our application. Our new hypothesis is that this is surely at least _one _reason why Heroku won’t send traffic to our application. If we look back at our Python code, we can see that it requires some environment variables. In our local host we have them exported as part of a development virtual environment. We can now conclude these need to be added to our application in Heroku as well.

Our new hypothesis becomes "The service should work as expected in both the local and production environments."

To test this hypothesis and pick up the new change, we'll need to restart our worker which should let the application pick up the new environment variables.

$ heroku ps:restart worker.1 -a heroku-troubleshooting

Restarting worker.1 dyno on ⬢ heroku-troubleshooting... done

$ heroku ps:exec -d worker.1 -a heroku-troubleshooting

Establishing credentials... done

Connecting to worker.1 on ⬢ heroku-troubleshooting...

The programs included with the Debian GNU/Linux system are free software;

the exact distribution terms for each program are described in the

individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent

permitted by applicable law.

~ $ curl -I localhost:5000

HTTP/1.0 200 OK

Content-Type: text/html; charset=utf-8

Content-Length: 148

Server: Werkzeug/1.0.1 Python/3.8.5

Date: Sun, 23 Aug 2020 00:50:35 GMT

~ $ curl localhost:5000

I've sold to these famous people from server worker.1: <br>

<br>

Julianne        Dench<br>

Sylvester       Dern<br>

Greta   Keitel<br>

Dustin  Tautou<br>

Penelope        Monroe

Enter fullscreen mode Exit fullscreen mode

We’ve now validated that the local container is working as we expect after re-running our previous tests. Still, our logs and browser conclude that the app is still not functioning as expected.

Logs from our internal curls

2020-08-23T00:50:35.931678+00:00 app[worker.1]: 127.0.0.1 - - [23/Aug/2020 00:50:35] "[37mHEAD / HTTP/1.1[0m" 200 -
2020-08-23T00:50:41.807629+00:00 app[worker.1]: 127.0.0.1 - - [23/Aug/2020 00:50:41] "[37mGET / HTTP/1.1[0m" 200 -
Enter fullscreen mode Exit fullscreen mode

Logs from our external browser

2020-08-23T00:50:47.151656+00:00 heroku[router]: at=error code=H14 desc="No web processes running" method=GET path="/" host=heroku-troubleshooting.herokuapp.com request_id=a9421f09-94f2-4822-9608-f2f63ffe5123 fwd="68.251.62.251" dyno= connect= service= status=503 bytes= protocol=https
Enter fullscreen mode Exit fullscreen mode

method4

With the application working in the container, it seems that everything should be working fine, hypothetically. Utilizing our prior re-run of tests and taking a closer analysis of the error “No web processes running” gives us another hint. There are two types of dynos, and we’ve deployed this as a “worker” dyno. Looking at the documentation, we can conclude that perhaps we’ve deployed to the wrong resource type that does not get an ingress. Hypothetically Heroku should provide a dynamic external port if we configure the dyno as "web" instead of "worker". Time to test redeploying as a “web” dyno.

heroku ps:scale worker=0 -a heroku-troubleshooting

heroku container:push web -a heroku-troubleshooting

heroku container:release web -a heroku-troubleshooting

heroku ps:scale web=1 -a heroku-troubleshooting

Enter fullscreen mode Exit fullscreen mode

Browsing to the site still gives the same conclusion as our initial tests, but now our observed logs are giving us our next thread to investigate.

2020-08-23T01:03:12.890612+00:00 app[web.1]:  * Serving Flask app "myapp" (lazy loading)

2020-08-23T01:03:12.890619+00:00 app[web.1]:  * Environment: development

2020-08-23T01:03:12.890620+00:00 app[web.1]:  * Debug mode: on

2020-08-23T01:03:12.905580+00:00 app[web.1]:  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

2020-08-23T01:03:12.906860+00:00 app[web.1]:  * Restarting with stat

2020-08-23T01:03:13.661451+00:00 app[web.1]:  * Debugger is active!

2020-08-23T01:03:13.662395+00:00 app[web.1]:  * Debugger PIN: 196-723-097

2020-08-23T01:04:07.990856+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch

2020-08-23T01:04:08.016515+00:00 heroku[web.1]: Stopping process with SIGKILL

2020-08-23T01:04:08.224860+00:00 heroku[web.1]: Process exited with status 137

2020-08-23T01:04:08.290383+00:00 heroku[web.1]: State changed from starting to crashed

2020-08-23T01:04:08.293139+00:00 heroku[web.1]: State changed from crashed to starting

“Web process failed to bind to $PORT”

Enter fullscreen mode Exit fullscreen mode

We can conclude that we need to bind our process to a variable rather than our default Flask port. This is simple enough to fix in our Python application. If we have the correct variable mapped in our production environment, and our application is utilizing that as its port configuration, then we should have a fully functioning site. To test this we can pass in the environment variable from the host and use it in our app.run argument.

myapp.py

if __name__ == "__main__":

    port = int(os.getenv("PORT", 5000))

    app.run(debug=True, host='0.0.0.0', port=port)
Enter fullscreen mode Exit fullscreen mode

method5

And there you have it! Our analysis of the test shows that the app works on both locally and when deployed. One final validation to ensure that our changes continue to work in our development environment as well.

method6

Putting it Together

In this workflow we used the scientific method to investigate an application that was failing to render in production. Debugging a managed infrastructure is a daunting task, but only because of the sum of its parts. By breaking down each layer of the product, reviewing how it behaves with standardized tooling, taking clues from the outputs of the logs and application outputs, and reviewing various product documentation, we can break down these complex structures to meaningful and simple conclusions that will resurrect any service.

If you want to further analyze the app that we deployed, you can see it here. Next time you find yourself investigating a difficult problem, remember to take each component one layer at a time and apply methodical, scientific, approaches to the process. By doing this you can remove much of the personal bias that would otherwise potentially derail you into a wild goose chase of StackOverflow articles and random community forums.

Discussion

pic
Editor guide