DEV Community: Simon Bracegirdle

React anti-patterns that lead to unnecessary complexity

Simon Bracegirdle — Fri, 23 Feb 2024 10:41:16 +0000

As called out by the legend of the craft, Grug, complexity can be the bane of your existence as a software developer. Unnecessary complexity leads to code that is hard to understand and reason about, and makes it easy to introduce bugs.

I have been doing React long enough to know that it's not exempt from having complex, hard to read code. Whether it's old school Redux, class components, or newer hooks and server components, complexity can creep in at any point if we're not careful.

What patterns can we look out for that might flag that a problem is ahead? In this post i'll cover what I think are some common anti-patterns and indicators that your React code might be more complex than it needs to be.

Anti-pattern 1 — Unnecessary effects

The react paradigm is all about writing reactive code — code that produces output (rendered elements) in response to input (props, state). useEffect allows us to do some side action that doesn't directly impact the rendered output. This could be updating the window title when a certain prop changes, or focusing an input field on first render.

It's an escape hatch, so we need to be cautious with how we use it to avoid issues. Let's look at an example of how that can happen:

function BadUseEffectComponent() {
  const {loading, error, data} = useQuery(GET_DATA);
  const [records, setRecords] = useState([]);

  useEffect(() => {
    if (data) {
      setRecords(data.records);
    }
  }, [data]);

  const handleEdit = (id, newValue) => {
    setRecords(records.map(record => 
      record.id === id ? { ...record, value: newValue } : record
    ));
  };

  if (loading) return <p>Loading...</p>;
  if (error) return <p>Error :(</p>;

  return data.records.map(({ id, value }) => (
    <div key={id}>
      <input 
        type="text" 
        value={value} 
        onChange={e => handleEdit(id, e.target.value)} 
      />
    </div>
  ));
}

Here we have a component that queries data from a GraphQL API (useQuery), and then uses an effect to copy that data into state (records). When the user edits a record (input onChange), we override the state value for that data record (handleEdit).

I can understand why people want to do this; they want a single variable containing the values they're going to render, it's a model that makes sense.

But, the presence of the useEffect here can add make it harder to read because we have to understand the conditions the effect fires and the flow-on effect it has on state and rendering. Oversights in following this logic can lead to bugs, of which I have experienced too many.

Returning to the code above, if the query was to run again, such as due to props changing, the effect could fire and override the user's edited data! The use of an effect to copy data into state has created a bug.

Here's how we could re-write the code without an effect:

function BetterComponent() {
  const {loading, error, data} = useQuery(GET_DATA);
  const [editedRecords, setEditedRecords] = useState({});

  const handleEdit = (id, newValue) => {
    setEditedRecords({ ...editedRecords, [id]: newValue });
  };

  if (loading) return <p>Loading...</p>;
  if (error) return <p>Error :(</p>;

  return data.records.map(({ id, value }) => (
    <div key={id}>
      <input 
        type="text" 
        value={editedRecords[id] || value} 
        onChange={e => handleEdit(id, e.target.value)} 
      />
    </div>
  ));
}

This time we still have state to hold the user's edited values, but we do not copy the back-end data into state. We can combine the back-end data with the user state in the render body. This is also easier to test because our function has less effects. If the back-end query is re-run, we retain our unsaved edited values, removing a critical bug.

My recommendation here is to avoid useEffect as much as possible. In general, don't use it to set derived state, and don't use it do mapping. Instead of using it to fetch back-end data, look at a robust query library that provides hooks like React Query, SWR, or Apollo client.

Sometimes useEffect is necessary, but consider it a last resort when other options aren't possible.

Extreme variant — effect chain hell

To take the above to the extreme, chained effects with interdependencies can combine to create the ultimate in complexity hell:

function ThisIsHell({ propA, propB }) {
  const { data, loading, error } = useQuery(SOME_QUERY);
  const [state, setState] = useState(null);
  const [derivedState, setDerivedState] = useState(null);
  const [finalState, setFinalState] = useState(null);

  // First useEffect based on Apollo query result
  useEffect(() => {
    if (!loading && data) {
      setState(data.someField);
    }
  }, [data, loading]);

  // Second useEffect based on the state set by the first useEffect
  useEffect(() => {
    if (state) {
      setDerivedState(`Derived: ${state}`);
    }
  }, [state]);

  // Third useEffect based on the state set by the second useEffect and propA
  useEffect(() => {
    if (derivedState && propA) {
      setFinalState(`${derivedState} and ${propA}`);
    }
  }, [derivedState, propA]);

  // Fourth useEffect based on the state set by the third useEffect and propB
  useEffect(() => {
    if (finalState && propB) {
      console.log(`Final state: ${finalState} and ${propB}`);
    }
  }, [finalState, propB]);

  if (loading) return <p>Loading...</p>;
  if (error) return <p>Error :(</p>;

  return <div>{finalState}</div>;
}

The above is a contrived example, but respresents a real world problem. Each of the effects are partially dependent on each other to create spaghetti code that is difficult to follow. Code like this is going to be impossible to understand, hard to test, and riddled with bugs.

I think this can be the result of overcomplicating the problem space in our head, which is easy to do when we're solving a non-trivial problem. A useful idea here might be to take a step away from the code, return to it fresh and look for alternative designs that lead to simpler code — can we break up the components in a way that avoids the effects? Can we move logic from front-end to back-end that avoids the problem? Can we simplify the data model somehow?

Anti-pattern 2 — Unnecessary state

State is an important concept in React, allowing us to hold values entered by the user before we're ready to send them to the back-end for persistence. But, a common issue is accidental misuse. Let's look at an example of that:

function UnnecessaryState() {
  const [value1, setValue1] = useState('');
  const [value2, setValue2] = useState('');
  const [sum, setSum] = useState(0);

  const handleValue1Change = (e) => {
    setValue1(e.target.value);
    setSum(parseInt(e.target.value) + parseInt(value2));
  };

  const handleValue2Change = (e) => {
    setValue2(e.target.value);
    setSum(parseInt(value1) + parseInt(e.target.value));
  };

  return (
    <div>
      <input type="number" value={value1} onChange={handleValue1Change} />
      <input type="number" value={value2} onChange={handleValue2Change} />
      <p>The sum is: {sum}</p>
    </div>
  );
}

Here we have two state values, which change when the user updates the two number inputs. We also have a sum state, which updates when either of the two values change. Then we show the sum below the two inputs.

But, we don't need to put sum in state at all, since we can calculate it in on the fly in our render:

function SumInBody() {
  const [value1, setValue1] = useState('');
  const [value2, setValue2] = useState('');

  const handleValue1Change = (e) => {
    setValue1(e.target.value);
  };

  const handleValue2Change = (e) => {
    setValue2(e.target.value);
  };

  return (
    <div>
      <input type="number" value={value1} onChange={handleValue1Change} />
      <input type="number" value={value2} onChange={handleValue2Change} />
      <p>The sum is: {value1 + value2}</p>
    </div>
  );
}

Again this is a contrived example, but as components get complex it's easy for this pattern to creep into code and cause issues. For example, what if we add a third number value, and forget to update the sum state in that change handler. Putting data in state unnecessarily opens up our code for bugs, especially if another engineer needs to make changes later on.

In general we don't need to put derived data in state, we should prefer to use simple inline statements, or move the mapping logic into a separate function that we call from our component:

function sum(value1, value2) {
  return value1 + value2;
}

/// ...

<p>The sum is: {sum(value1, value2)}</p>

sum is now easier to test since it's a pure function that returns a value based on some input, without any side effects.

If we're concerned about performance, we can memoise sum to make it efficient, but as we'll discuss in the next section, we should be hesitant to do that.

Side note — prefer state in URL

When you do need to use state for holding values the user has entered, it's often a good idea to hold that value in the URL query parameters, instead of using plain useState. The reason for this is the user can then share the link with colleagues, friends, or your technical support in case they encounter an issue. The URL they share conveniently holds the state of their page, which someone else can then reproduce.

An example of this could be to hold the searchTerm in the URL, after the user has typed in a search query. The code below achieves that by using the React library use-query-params, which provides some useful hooks for putting state in query parameters:

function MySearchComponent() {
  const [searchTerm, setSearchTerm] = useQueryParam('searchTerm', StringParam);

  const handleChange = (event) => {
    setSearchTerm(event.target.value);
  };

  return (
    <div>
      <input type="text" value={searchTerm || ''} onChange={handleChange} />
    </div>
  );

Anti-pattern 3 — Premature memoisation

Memoisation is a powerful tool that builds on the idea of caching to prevent re-runs of a function unless a given set of dependencies change. If they don't change then react returns a cached value instead, potentially saving computation.

But, I think we are overrusing this tool in the React community — we should start by putting the computation in the component body, we can add memoisation later when it's needed.

Let's look at an example of premature memoisation:

function PrematureMemo() {
  const {loading, error, data} = useQuery(GET_DATA);

  const myData = useMemo(() => 
    data?.data?.map(item => ({
      ...item,
      value: item.value * 2,
    }))
  , [data]);

  if (loading) return <p>Loading...</p>;
  if (error) return <p>Error :(</p>;

  return (
    <div>
      {myData?.map(item => (
        <p key={item.id}>Data: {item.value}</p>
      ))}
    </div>
  );
}

In this component, we have a useQuery (from Apollo GraphQL client) hook that we use to query data from our back-end. We then have a useMemo for performing some mapping operation on the resulting data and memoising it. We then render our elements based on that mapped data.

Instead we could have re-written the above like so:

function SimpleMapping() {
  const { loading, error, data } = useQuery(GET_DATA);

  if (loading) return <p>Loading...</p>;
  if (error) return <p>Error :(</p>;

  return (
    <div>
      {data?.data?.map(item => (
        <p key={item.id}>Data: {item.value * 2}</p>
      ))}
    </div>
  );
}

The biggest change here is that we have removed the memo, and do the mapping in the render body instead.

Some might ask; "But that's not efficient, it'll be re-calculated on each render!". But, an O(n) mapping operation isn't necessarily computationally significant, it depends on n! In this context, we're talking about a handful of entries, which even slower devices can compute fast.

The other assumption is that rendering is happening all the time, but that depends on whether state changes, props change, or if the parent is re-rendered. The rendering lifecycle of React already acts a kind of memoisation, and we should leverage that before adding another layer.

By adding memoisation prematurely we could be adding a lot of unnecessary noise to our code, or even bugs if we don't get our dependency array right (like if we forgot [data] in the first example). Instead, observe real world performance, and only when performance is unsatisfactory should we look at optimisation.

That isn't to say we should write inefficient code by default — but don't assume any kind of loop or mapping is going to be slow, unless you have a high degree of confidence.

Anti-pattern 4 — Lots of large inline functions

This one is more of a readability problem as components get larger, rather than something that can directly cause bugs. Having a lot of inline functions can make a mess of your component, making it hard to follow the logic and trace the data flow. Let's look at the following:

const LargeInlineFunction = () => {
  const [response, setResponse] = React.useState(null);

  return (
    <div>
      <button onClick={() => {
        fetch('https://api.example.com/data')
          .then(response => response.json())
          .then(data => {
            // Perform some complex transformations on the data
            let transformedData = data;
            for (let i = 0; i < data.length; i++) {
              transformedData[i] = {
                ...data[i],
                extraProperty: 'extraValue'
              };
            }

            setResponse(transformedData);
          })
          .catch(error => console.error(error));
      }}>Do something</button>
      {response ? response.map(item => <div key={item.id}>{item.name}</div>)}
    </div>
  );
};

This particular example isn't too bad because it's a small component, but if you can imagine a component hundreds of lines long, with half a dozen large inline functions, it'll be hard to read and follow — more so if you add state and effects into the mix.

It's also hard to write tests for functions like this since they're buried inside the component. We'd need to mock out a bunch of things to get the code to trigger.

As a habit, I find moving these out into separate functions a good idea:

// Separate function for fetching and transforming data
async function fetchData() {
  const response = await fetch('https://api.example.com/data');
  const data = await response.json();
}

function transformData(data) {
  // Perform some complex transformations on the data
  let transformedData = [];
  for (let i = 0; i < data.length; i++) {
    transformedData[i] = {
      ...data[i],
      extraProperty: 'extraValue'
    };
  }

  return transformedData;
}

const FunctionsMovedOut = () => {
  const [data, setData] = React.useState(null);

  return (
    <div>
      <button onClick={() => {
        fetchData()
          .then(transformData)
          .then(setData)
          .catch(console.error);
      }}>Do something</button>
      {data ? data.map(item => <div key={item.id}>{item.name}</div>) : 'Loading...'}
    </div>
  );
};

This means we can now write tests for transformData, without the hassle of mocking, since it's a pure function that produces output from input, without any side effects.

Conclusion

Simplicity is a virtue in software development, and unnecessary complexity is going to be a impediment. In this post we've had a look at React anti-patterns that can be painful in my experience.

We explored the pitfalls of overusing useEffect, we discussed the unnecessary state usage, and encouraged developers to calculate derived data in the render body or use separate pure functions.

By avoiding these anti-patterns, I hope it puts you on the path of simpler, more readable, and more maintainable React code.

I'd be keen to hear from you if you have any thoughts on patterns that help or cause harm in your experience.

Cheers.

Remembering the important bits to log

Simon Bracegirdle — Thu, 18 May 2023 09:13:14 +0000

Logging can be a mixed bag, I've seen it done well and not well, and I've been guilty of both myself. Even though these days people tend to rely on traces for observability, I still think logs have an important part to play for engineering teams operating products and services.

When there's not enough logs, or other kinds of observability telemetry, then it can be difficult to understand what's going on inside a system, which can be painful when trying to debug an issue. If our database connections are failing and we didn't log the error, it might take us longer to understand what's happening.

When logs are too verbose and noisy, it can make it difficult to search for and find the information we need, and it can increase costs depending on how we're ingesting and storing logs.

Achieving the right quantity and quality of logging is difficult to master, and I'm not claiming to be a master myself, but the guidance in this post helped me, so I hope it helps you too.

Today I present a handy acronym — REDIT (in case we needed another one) — to help us remember what I think are the key bits of information to log. Let's explore it further.

R — Request/response

Log key context from the request and response payloads.

Capturing the input and output of a system is fundamental to understanding its usage. This might include capturing the requesting user ID, the record ID, date range, user's browser, or source IP address. This isn't an exhaustive list!

Add anything that might be conceivably useful when diagnosing a production issue, or trying to understand how users use the system.

Whether we log the entire request or a subset depends on context. We don't want to log the entire payload when a user is submitting large content, such as for a blog post or multimedia, but perhaps logging the content length or key attributes would be helpful.

Be careful to avoid logging personally identifiable information (PII), or anything sensitive or private — this could lead to regulatory issues, privacy issues, or losing the user's trust. It's important that we're mindful of the sensitivity of what we put into logs.

Below is an example of setting up Python's Flake framework to log every request. Note that we don't log every header, or the request body, as they may contain sensitive information such as authentication tokens or private data. We'd also log the response and attributes like status code, response time, and error code, but I've excluded an example of that for brevity.

@app.before_request
def log_request_info():
    headers = request.headers
    # Selectively log safe headers to avoid leaking sensitive information into the log
    required_headers = {k: headers.get(k) for k in ["User-Agent", "Accept-Language"]}

    log_data = {
        "remote_address": request.remote_addr,
        "url": request.path,
        "method": request.method,
        "headers": required_headers,
    }
    app.logger.info(json.dumps(log_data))

# ... also log after_request ...

# Requests to the following route will be logged:
@app.route('/')
def hello_world():
    return 'Hello, World!'

CALLOUT — All code samples in this post are un-tested pseudo code for demonstration purposes.

E — Errors

Log errors, and any other surrounding context such as stack traces and identifiers.

When something goes wrong, we need to understand the problem so that we can work to resolve the underlying issue. If we don't understand the problem then our hands are tied and we don't know where to look, or what even happened.

An error that gets swallowed is a disaster, we'll find out about these when a user reports a problem with the system, and we'll be powerless to solve the problem because the logs don't indicate anything is wrong. When in this situation we must resort to trial and error to isolate the issue, or by fixing the logs.

Logging errors is essential and the bare minimum for any sane production system.

Context is important with errors too, so be sure to include the error message itself along with any pertinent IDs, stack traces and any other context that help us understand what was happening, where, and when the issue happened.

Logging errors is one place where we can afford to be a bit more verbose too, since they shouldn't, in theory, be happening too often, so there's more value in maximising information about the error, with little cost.

Here's an example of logging an error with a good amount of context in Python:

def create_user(user):
    try:
        # ... Create user code

    except Exception as e:
        log_data = {
            "event": "user::create::error",
            "message": str(e),
            "user_id": user.id if user else None,
            "username": user.username if user else None,
            "error": repr(e),
            "stack_trace": str(sys.exc_info())
            # Add any other userful context you want to log
        }

        logging.error(json.dumps(log_data))

        # Handle the error etc...

D — Dependencies

Log calls to third-party dependencies such as external APIs or cloud services.

Trying to understand an issue in a distributed system can be challenging, so it's critical to understand any interaction with third party APIs such as AWS, SendGrid, GitHub, or anything else you're using.

Logging the entry and exit points can be a big help when trying to understand the flow of data in the system and across systems. Of course, traces is, in general, a better tool for this job. But having logs is also a good idea for local debugging, or in case we sample out the span, or the span doesn't contain the attributes we need. Redundancy is nice.

To give you a concrete example — I encountered an issue in a call to the AWS SQS.batchMessage endpoint. The call site wasn't checking the response, but instead expected it to throw an error when an issue occurred. But this is a batch endpoint, and doesn't throw errors, but instead returns them in the payload. This lead to a bug, and we didn't any logging to help us understand what was happening, and the investigation took longer that it should have.

This highlights the need to log key attributes from the response at the call site. Not everything will bubble up into an error when something goes wrong.

Here's an example of logging an API call site:

def create_user(user):
    log_data = {
        "event": "user::create::api_call",
        "message": "Calling the user service API."
        "user_id": user.id,
        "username": user.username,
        # Add any other userful context you want to log
    }
    logging.info(json.dumps(log_data))

    response = user_service.call_api('create', user)

    # ALSO LOG THE RESPONSE HERE!
    response_log_data = { 
        # Any useful context from the response
    }
    logging.info(json.dumps(response_log_data))

    # ...

I — Important events

Log any important system or business events that occur.

It can also be useful to log business events that occur. For example, if you're building a book management system, you might log business events such as — user left a review, book created, user signed up, etc.

These events provide context that help debug problems. If the user created a book, and then later failed to cancel the book, finding the original create event could help to understand why the cancel failed. It's rare that anything happens in isolation, having extra context is helpful. Perhaps the book creation used values we didn't expect, capturing pertinent attributes would be helpful in this case.

An example:

def create_book(book):
    # Book creation process goes here...

    # If successful:
    log_data = {
        "event": "book::created",
        "message": "Book successfully created"
        "book_id": book.get('id'),
        "title": book.get('title'),
        "author": book.get('author'),
        "genre": book.get('genre'),
        # Add any other userful context you want to log — but nothing SENSITIVE or PRIVATE.
    }
    logging.info(json.dumps(log_data))

T — Trace IDs

Log any IDs that help you to trace a request as it passes across a distributed system.

Logging is just one tool to achieve observability within your operational systems. Traces are another tool that are quite good for understanding how a request passes across services and layers of a service.

In fact, people are now claiming that traces are only thing you need, due to the idea of wide events. These are traces packed with enough metadata that they're useful for diagnosing issues on their own. I think there's some validity to this, but logs can still be helpful and complementary.

Traces and logs can work together by linking them with what's called a correlation ID or trace ID. Some libraries provide this functionality for you. OpenTelemetry for JS and the winston instrumentation package provide an option to inject trace and span IDs into logs, which are then correlated in your monitoring tool of choice. As an example, DataDog provides a logs tab within their trace viewer.

In OpenTelemetry specifically, there's the concept of events, which are simple log-like objects nestable within spans, co-locating the data, which reduces the chance of missing logs due to differing sampling rules by data type. I think the guidance in this post also applies to events.

Summary

To recap, the REDIT acronym stands for:

R — Capture request/response metadata
E — Capture errors
D — Capture calls to external dependencies
I — Capture important business and system events
T — Capture trace IDs and link traces to logs to increase observability even further

I hope this can serve as a useful mnemonic for remembering the important bits to log and send you on the way to observability nirvana. Thanks for reading.

Ship the thing — what's getting in the way?

Simon Bracegirdle — Wed, 17 May 2023 09:16:52 +0000

Now and then I think it's worth taking a step back and reflecting on what's stopping us from reaching our goal of shipping a new product or feature to customers. The stated goal "ship the thing," seems simple, so why is it hard in practice?

Sometimes it's the sheer volume of work — we need to write code, test with users, iterate after receiving feedback. Everyone has a long list of tasks that they need to complete.

Hidden amongst any product journey is work that's not really needed, things that get in our way, distract us, or slow us down. In some of my experiences in the past, most of the work of the project fell into this bucket. A lot of the time it's accepted as okay, normal, or even expected.

The Google Site Reliability Engineering (SRE) book introduced a similar idea that they call toil — repetitive manual work necessary to keep the system running, but doesn't contribute to the system's stability or strategic development. In other words, it has no value.

In the Lean Manufacturing and Toyota Product System they talk about continuous improvement and the need to remove waste. They define seven forms of waste in manufacturing, most of which are also relevant to software — transportation, inventory, motion, waiting, over-production, over-processing, defects.

Going even further back, Marcus Aurelius of the Stoics made this astounding statement that's still relevant to our industry today:

"The impediment to action advances action. What stands in the way becomes the way".

What things get in our way? Sometimes it's obvious — we'll experience some friction or frustration that needs fixing up. But other times it's not obvious. That's why it can be valuable to set time aside to reflect on our experience and map out the tools, systems and processes that we use to analyse where the pain points are.

This is starting to sound a lot like systems thinking — a method for looking at systems, such as organisations, software teams, or CI/CD pipelines, as a whole and breaking them down into sub-systems and components. Deming wrote about this in 1980's.

The other way that we can get slowed down is when we've adapted to the inefficiency of the tool or system. I'm calling this "learned waste" — waste that we don't even realise is there because everyone's been doing it this way for so long they forget that there's better ways to work. I suspect sometimes the entire industry has learned waste with certain tools and methods.

We've talked a lot about waste and friction at this point, so let's look at a few examples from my own experience.

Do we need that overcomplicated or overabstracted architecture? A lot has been said about the whole microservices versus monolith debate, but I think it's entirely pointless arguing about it without considering context.

If we don't have any users yet, or have less than X engineers, then a complicated microservices architecture with 30+ repositories, 20 data stores and 10000 lines of AWS CloudFormation code should be a massive red flag. But if we're Google and we need to handle 10 quadrillion requests per second, then the opposite could be true.

What's the simplest possible architecture that's fit for purpose and has the least friction in our context? Earlier in the product lifecycle we need to focus on shipping fast. Build the "dream architecture" later when we're actually making money with real users and need to handle large numbers of requests per minute.

Do we need those bugs? When we're building a prototype, a personal project or an early stage product, bugs are acceptable because we don't know if the thing is going to survive at all. But for established teams bugs slow us down and frustrate users. There's well documented ways to increase code quality if you find yourself in this situation.

I'm a proponent for writing tests, but again it depends on the context. A mature product with real users needs to be stable and reliable, and the half-dozen team of engineers need to be able to make regular changes with confidence they won't break the system — tests are paramount for this team. We can do this by writing tests either first (TDD), or after. I don't have time for dogma, I want to know what's going to help us ship and continue to ship.

It's worth noting that even if we're building a prototype, code written without tests can make writing tests harder later. This is why people will advocate for discarding a prototype and re-writing it once we know this is going to be a real product with real users. Context matters, remove the barrier or burden that's slowing you down.

Do we need slow CI? This is a common one. If we make code changes throughout the day and need to wait 20 minutes for each code change to build, test, and deploy, how much do we think that's going to add up to over time? It takes some simple math to realise it's worth spending time optimising this process.

Is our language, library or framework choice fit for purpose? Is there a lot of boilerplate or overhead that will slow down the building of features? Does the framework tend to lead to verbose or complex code that's harder to work with? Does it support easy testing? A lot's been said about the Ruby on Rails/Laravel vs React debate, but I think both are valid (or poor) choices depending on context.

There can be waste on a personal level too. I know personally that social media and collaboration tools can be a distraction, so I try to close or block them as much as possible during periods of focused work. I find having breaks from the computer, going for walks, or talking through the problem with others helpful for tackling tricky issues that I'm stuck with.

In summary, we've all got friction points, distractions and waste in our environment that impede shipping. Let's use intuition to find and remove them, but also consider using systems thinking and other analytical methods. The impediments should become "the way," work hard to remove them, but with the goal of enabling us to ship the thing.

OpenTelemetry and the future of monitoring and observability

Simon Bracegirdle — Thu, 23 Mar 2023 09:04:35 +0000

OpenTelemetry is a collection of standards and tools designed to help you add telemetry to your application or service. By telemetry, I mean metrics, traces, and logs, which are critical for understanding the state of any deployed software system.

The key benefits that OpenTelemetry brings is standardisation and vendor agnosticism. You can instrument your code once and then with the help of the OpenTelemetry Collector create configurations that pipe your telemetry data to one or more back-ends. There's integrations with popular SaaS vendors in the space such as DataDog, AWS CloudWatch, Honeycomb, and others.

For example, let's say your deploying a new web application as a container to a cloud service such as AWS ECS, Flyio, GCP containers, or one of the other options. To understand the behaviour of that application in production, and help debug issues, or receive alerts when something is going wrong, telemetry is essential.

To start with OpenTelemetry you would add instrumentation to your code. A good starting point is to use the auto instrumentation tooling provided by the community. These are available via the SDK and API libraries provided for each language supported by OpenTelemetry.

You may also decide to run a OpenTelemetry Collector, which is a separate process that can add processing such as batching and sampling, to help prepare, filter, and massage your data before sending it to your monitoring and observability back-end. A common configuration is to deploy the collector as a side car container on the same host as your application. They would then communicate over OTLP — the OpenTelemetry protocol — over HTTP or gRPC.

With that context in place, let's take a look at the limitations and challenges with OpenTelemetry.

Limitations and challenges

The biggest issue for me so far is that some tooling isn't yet mature and has some rough edges. For instance, I encountered a strange issue with exporting logs with the DataDog exporter, and ended up debugging the problem myself and submitting a PR to the contrib repository. It's notable that this is also a benefit of the ecosystem — that the community can contribute fixes and improvements to the project — so I think over time the community will achieve a stable and reliable tool, but there's some challenges as it stands today in 2023.

To illustrate this further, if you browse through the Collector contrib repository , you'll notice components in the ecosystem are in beta or alpha state. These components are evolving fast, with breaking changes occurring. This can create difficulties in maintaining up-to-date dependencies on the ecosystem.

To mange that it's wise to set a regular cadence for applying updates, and setting time aside for working through any major changes that impact you.

Another challenge is that adding instrumentation to new services can be a bit more work than using libraries from specific vendors. Those vendors have had time to optimise their on-boarding and developer experience, understandable given it's criticality to their business. But I think once you have deployed OpenTelemetry to at least one service in your environment, you'll have established a pattern that makes it easier to copy to other services, plus the generative text tools available these days can help to cut down on boilerplate.

For example, setting up the infrastructure for the OpenTelemetry Collector can be a bit of added work. First you need to choose a distribution for the collector, such as the contrib distirbution, one from a vendor such as AWS, or by building one yourself with the builder tool. Then you need to add a YML configuration for your collector, which defines how to receive, process and export telemetry data. Then you need to build an image for the collector that embeds your configuration and publish it to a container repository in your ecosystem. Then we deploy the collector along side the application, which we configure to pipe telemetry data to the collector via OTLP, which then forwards the data on to a backend.

This is a lot of steps and moving parts, and I'm sure it's daunting for newcomers, but once you've stepped through the process and familiarised with it, I don't think it's too arduous or complex.

Highlights and strengths

I think the biggest strength of OpenTelemetry is the amount of flexibility and power in the tools provided. For instance, the community has provided a lot of components for the Collector that allow processing data in a range of ways. For example, if we want to sample our traces before sending them off to DataDog, which can be quite expensive if you're ingesting every single trace and span, then it's a matter of adding the tail_sampling component and adding a configuration like below:

receivers:
  # ...

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 1000
    expected_new_traces_per_sec: 10
    policies:
      - name: sample
        type: string_attribute
        rate_limiting:
          spans_per_second: 100
        string_attribute:
          key: "sample-me"
          values: ["true"]

exporters:
  # ...

pipeline:
  # ...

Using tail_sampling, we can create a wide range of policies that will sample traces with specific attributes. An example could be to include less traces from health checks, but ensure errors are always included, or that slow requests are always included.

Another powerful feature of OpenTelemetry is the ability to test locally, and to export to open source tools such as Jaeger for traces, without the need to work with a third party vendor that might charge you for data ingestion. This can help you to iterate faster during development before deploying.

The OpenTelemetry project has collaboration and community involvement at its core. Being an open-source project, it encourages developers and organisations to actively participate in its growth and development. The project maintains a strong presence on GitHub, where developers can contribute to the codebase, report issues, and suggest improvements under the Apache license Version 2.0, which is permissive license that allows commercial use.

The community's involvement has led to the development of integrations, exporters, and instrumentation libraries for a range of programming languages and platforms. Examples of community-driven contributions include the development of OpenTelemetry SDKs for languages like Java, Python, Go, JavaScript, also exporters for popular backends like Jaeger, Prometheus, Zipkin, and DataDog. The community also contributes to documentation, tutorials, and sharing best practices to help other developers adopt OpenTelemetry.

Conclusion

In conclusion, the current state of OpenTelemetry shows great promise in standardising and democratising observability and monitoring across different platforms and languages. Challenges exist, such as adding instrumentation to new services being more involved than equal libraries from vendors. But despite these on-boarding challenges, I think the benefits of vendor agnosticism, and the powerful tools available, will compound in the long run.

As the OpenTelemetry ecosystem continues to evolve and mature, we can anticipate further improvements in its tooling and developer experience, solidifying its position in the world of observability and monitoring. I predict, and hope, that in a few years, it'll become the default method for software teams intrumenting their code.

Thanks for reading! Please reach out on the socials if you'd like to discuss further.

Analysing AWS VPC Flow logs with Python and Pandas

Simon Bracegirdle — Tue, 14 Feb 2023 00:49:04 +0000

Context

Recently, I encountered an AWS EC2 bill that was higher than expected and I suspected that traffic flowing in and out of the NAT Gateway was the culprit. In this post, I will share my journey of using Python and its powerful data analytics ecosystem to analyze VPC flow logs and gain insights into AWS networking costs.

Before diving into a solution, I always strive to have a good understanding of the problem to avoid wasting precious engineering time optimizing the wrong thing. In this case, I needed to gather more information about the traffic flow within the private network. To achieve this, I leveraged VPC flow logs, which contain a record of network activity within an AWS VPC.

Analysis

Python notebook in VS Code

I've had some experience doing simple data analysis in Python before, specifically with Pandas, Matplotlib, Numpy, and other popular data science libraries, so it made sense that I leverage those skills rather than trying to learn something like AWS Athena.

I went for a Jupyter notebook, which is a popular development environment for data analysis. It allows you to run Python code in small chunks known as cells, which can be interwoven with text, visualisations and other content. With the VS Code Python extension, you can treat any Python file as a pseudo-notebook by marking the start of a cell with the #%% character string. You can then execute that cell directly inside VS Code and get instant feedback.

For example, here's a cell to import the libraries we need:

#%%
import pandas as pd
import os
import boto3

With that up and running, I moved on to retrieving the data set.

Pulling down logs from AWS S3.

In this case, the VPC flow logs were already stored in AWS S3, so I was able to download them in compressed format directly.

I wanted a subset of the logs for one day, I didn't need an entire day or month, as long as the sample was representative of the whole. This will help to keep the transfer and computation time down too.

The cell below downloads the first 20 files from a S3 path and stores them locally:

#%%
s3 = boto3.resource('s3')

# Download all the files from a S3 path
def download_files_from_s3(bucket, s3_path, local_path):
    if not os.path.exists(os.path.dirname(local_path)):
        os.makedirs(os.path.dirname(local_path))

    bucket = s3.Bucket(bucket)
    count = 0
    for obj in bucket.objects.filter(Prefix=s3_path):
        if count > 20: # First 20 files please
            break
        count += 1

        # Strip any path separators from the file name
        filename = obj.key.split('/')[-1]

        # Download the file if it doesn't already exist locally
        if not os.path.exists(local_path + filename):
            print('Downloading', obj.key)
            bucket.download_file(obj.key, local_path + filename)
        else:
            print('Skipping', obj.key)

download_files_from_s3('my-vpc-logs-bucket', 'path-to-the-logs/2023/02/07', 'data/')

Loading the data into a dataframe

Now that I had my data stored locally, I wanted to get it into a data structure in memory that would make analysis of the data easier. The most common data structure for analysis like this in Python is the Pandas data frame, which is a two dimensional structure that allows for easy aggregation, grouping, filtering and visualisation. Data frames are even more powerful when used with other libraries in the Python ecosystem such as matplotlib and numpy.

The following cell reads the first 20 files it finds in a directory, un-compresses them and appends them to the primary data frame:

# %%
# Process log.gz files into a single dataframe
def process_log_files(local_path):
    df = pd.DataFrame()
    # Only process first X files
    count = 0
    for file in os.listdir(local_path):
        if file.endswith(".gz") and count < 20:
            print('Processing', file)
            df = df.append(pd.read_csv(local_path + file, compression='gzip', header=None, sep=' ', names=['version', 'account_id', 'interface_id', 'src_addr', 'dst_addr', 'src_port', 'dst_port', 'protocol', 'packets', 'bytes', 'start', 'end', 'action', 'log_status']))
            count += 1
    return df

df = process_log_files('data/')

# %%
# Print the first rows of the table to verify the data looks right
df.head()

Now that there's a sampling of the data in memory, we can commence analysis.

Looking for the largest destination of data

The first question I had for the data was to find out which host was receiving the most bytes.

That meant converting the bytes column into a numeric format that we can use in aggregations:

df['bytes'] = pd.to_numeric(df['bytes'], errors='coerce')

Then I grouped by destination address, tallied the bytes, and sorted in descending order:

# %%
# Group by destination address, sum bytes as a new column
result = df.groupby(['dst_addr']).sum()['bytes'].reset_index()
# Sort by bytes descending
result = result.sort_values(by=['bytes'], ascending=False)
result.head()

There was one IP address that stood out by a large margin, so I was curious to learn more about it. It fell within the VPC CIDR, so I queried ENI's in AWS to see if there was a match:

# %%
# Query the resource attached to a destination address
def query_resource(dst_addr):
    client = boto3.client('ec2')
    response = client.describe_network_interfaces(
        Filters=[
            {
                'Name': 'addresses.private-ip-address',
                'Values': [
                    dst_addr,
                ]
            },
        ],
    )
    return response

print(query_resource('IP_ADDRESS_HERE'))

This returned a large amount of metadata about the ENI, but the description made it clear that the interface belonged to the NAT Gateway, confirming the hunch I mentioned earlier.

Looking for the largest sender of data

I wanted to understand where this NAT traffic was originating from, as I hoped it would lead to optimisations that can trim down the AWS bill.

I grouped by source address, where the destination address was the NAT gateway, then tallied the bytes and sorted in descending order:

# %%
# Find the source that sends the most bytes to the destination in question
result = df[df['dst_addr'] == 'NAT_IP_ADDRESS_HERE'].groupby(['src_addr']).sum()['bytes'].reset_index().sort_values(by=['bytes'], ascending=False)
result.head()

This revealed IP addresses that weren't in the VPC — the traffic was coming from the interwebs.

I installed the ipwhois library, which would allow me to lookup metadata about an IP address, such as which ISP or network it belongs to:

#%%
# pip3 install ipwhois
from ipwhois import IPWhois, IPDefinedError

#%%
# Use ipwhois to lookup metadata about the IP address
def call_ipwhois(ip):
    # Catch IPDefinedError
    try:
        result = IPWhois(ip).lookup_rdap(depth=1)
    except IPDefinedError as e:
        result = None

    return result

print(call_ipwhois('SUSPECT_IP_HERE'))

I did this for the top 25 source addresses to see if there any patterns in ISP or network. This is slow since each API call takes a second or two. If I was going to do this more than once I'd optimise it, but this is a once-off task so I didn't bother.

#%%
# Create in memory cache for ipwhois results to make re-querying faster
ipwhoiscache = {}

#%%
count = 0

# Iterate over dataframe
for index, row in result.iterrows():
    # Call ipwhois for first 25 rows
    if count < 25:
        # Print index
        print(count, index, row['src_addr'])

        if row['src_addr'] in ipwhoiscache:
            ipmeta = ipwhoiscache[row['src_addr']]
        else:
            ipmeta = call_ipwhois(row['src_addr'])
            ipwhoiscache[row['src_addr']] = ipmeta

        if ipmeta is not None:
            result.loc[index, 'network'] = ipmeta['network']['name']

        count += 1

# Print result, show top 25
result.head(25)

I grouped by network name, tallied the bytes and sorted descending to find the network sending the most bytes to the NAT:

# %%
# Group by network name and sum bytes
result_network = result.groupby(['network']).sum()['bytes'].reset_index().sort_values(by=['bytes'], ascending=False)
result_network.head(25)

It turns out that most of the traffic was coming from the network AT-88-Z, owned by Amazon Technologies Inc. In other words, this is traffic flowing between AWS services and the NAT Gateway.

Retrospective

This simple analysis provided enough information to identify which AWS resource was sending this data, which led me to make config changes that drastically reduced the AWS networking bill.

I think this demonstrates the power of Pandas and Python for quick analysis jobs like this. If I was going to productionise this analysis, for example with a regular report to management, or if I needed to crunch larger amounts of data, I'd consider using something like AWS SageMaker or AWS Athena. But for this particular ad-hoc case with a smaller data set, Pandas and Python in a locally running notebook was the perfect choice.

Thanks for reading, please get in contact with me on Twitter or LinkedIn if you have any comments or questions.

Doesn't look good to me — a requiem for thorough code reviews

Simon Bracegirdle — Mon, 14 Mar 2022 01:25:11 +0000

A lot of development teams are doing code reviews these days, it's become the industry norm. But are they done well? The idea of the lazy code review — "LGTM, ship it" — has become a widespread meme, and it's funny because there's an element of truth in it that we've experienced ourselves.

In this post I aim to convince you of the value of thorough code reviews and i'll provide some pointers for bringing rigour and methods to increase the value you get out of them.

Let's be honest though, when we're working towards deadlines, it's easy to deprioritise or simplify what we see as low priority tasks to ensure we can meet our commitments. This isn't malicious or lazy, it's human nature.

What's the solution then? There's no quick fix of course — it comes down to individuals and teams, and how much value they place in the benefits of code review. But even if you value code reviews highly, you still need a disciplined, well articulated and well understood approach to overcome the habits of individuals and sustain a high level of review quality over time.

What do I mean by code review?

I'm focusing on asynchronous code reviews specifically, also known as pull requests, often done when merging a code change from a branch back to the mainline in git.

There's other forms of code review such as; pairing, code walk throughs and code review meetings, each with their own advantages and drawbacks, but this comparison is not the focus of this post. The approaches I suggest may or may not be applicable to those other forms.

In fact, pairing is probably a more effective approach to code review than pull requests. With pairing two individuals are working on the same change simultaneously and with a shared understanding of the context. The individual on the keyboard edits the code whilst the other provides feedback and ask questions. But pull requests are more commonplace in the industry, so I won't go into the nuance of pairing in this post.

Why do code reviews?

This may be well covered ground, but I think it's always worth revisiting the fundamentals to refresh our appreciation for practices we've adopted. Having a practice without a good reason is not an approach that leads to success in my experience, unless you get lucky.

Lets explore the rationale...

Reason 1 — Learning opportunity

Code reviews present an opportunity to learn something for both reviewer and the author. Here's some examples:

As a reviewer, you may learn about the behaviour of an upcoming feature that you had some assumptions about, which turned out to be wrong.
As an author, you may receive some suggestions about how to improve error handling that results in code with less bugs.
As a reviewer, you're confused about the code structure, so you ask a question about it and learn about a new approach.
As an author, you may ask the reviewer of what they think about the expected behaviour in the tests you've written, they may present some feedback and ideas.

Code reviews should be a conversation. Conversations happen to be a good tool to achieve common understanding between two or more individuals.

Those conversations should also be psychologically safe for both parties — we should be able to speak up with questions and ideas without fear of punishment, blame or judgement. Blameless environments allow members to learn from mistakes and be more engaged in continuous improvement.

But even then success isn't guaranteed. We need to build the habit of asking good questions, providing suggestions and other forms of feedback that move code reviews from a chore into a conversation. Reviews with a reasonable attention to detail and tactful use of tone and language will have this benefit.

This learning opportunity is critically important when one of the participants is more senior than the other. Seniors have a responsibility to provide guidance to their less experienced peers, and code reviews are an opportunity to do so.

Reason 2 — Product quality

Thorough code reviews can lead to higher code quality, and code that is more maintainable and robust will allow the sustained delivery of a higher quality product.

A good reviewer will look for issues that impact the robustness, security or performance of the system. For example; an error that is not handled could cause a catastrophic bug that leads to data loss or a system outage, both of which would be a poor experience for users.

Nobody is perfect and having a second set of eyes can pick up potential issues that we missed ourselves during development. A fresh perspective can have a different view on the code than yours after a tiring development stint. Everyone brings with them a different set of experience and a different world view, leverage that to get the most out of your reviews.

Even if the change under review doesn't directly have any bugs in it, but is missing tests, or has bad naming, or is hard to understand, it could lead to bugs in future changes when confused developers try to understand what's going on.

Who's responsible for code quality though? I think everyone in the team holds the same level of responsibility for quality. But in the context of a specific change, the author's ultimately responsible that it meets the agreed standard. The reviewers are there to help by asking questions, making suggestions and sharing knowledge, but they won't be making the necessary changes and clicking the merge button.

Who should the author pick to review then? Ideally it's the person that will give the best feedback — asking good questions, making insightful suggestions and identifying potential issues. But we also need to consider the impact of creating bottlenecks in the team. For example; if the senior engineer receives all the code reviews, then they're not going to be able to work on their own tasks and will block merging of other pull requests.

We also need to give opportunities to less experienced team members so they can grow their own skills and learn from others. With the use of some of the suggestions in this post we can help them to uplift their code review game.

Reason 3 — Sustained delivery over time

Code that is buggy or hard to understand will slow down the delivery of value to customers, so code reviewers should look for changes that might work against these goals.

For example; if tests are not added to a change, then it might make it hard to enhance or fix that code next time, it becomes worse if the developer is new and unfamiliar with the intended purpose of the code.

Small paper cuts add up over time to form what is now well known as technical debt. Code bases that have accumulated enough technical debt can exceed a threshold of no return, beyond which they become unmaintainable messes that everyone's scared to touch.

To control the fallout of tech debt, management often bring in manual gates and other heavy approval processes to attempt to limit the damage of any changes. This increases the time needed to make changes (lead time) and reduces the frequency that we deliver to customers.

If you're working on a system that needs to continue to serve real customers for the foreseeable future, play the long game and optimise for testability, maintainability and changeability. Code reviews are a critical tool to help sustain that discipline over time.

Aside — Why not do code reviews?

As with everything in software, there is no universal laws, practices that are useful in some cases are harmful in others. Code reviews are no exception — they aren't necessarily suitable for every situation and team.

Here are some cases where you wouldn't do code reviews and why:

You are experimenting or prototyping, in which case code quality is not a concern. Be careful that your prototype doesn't become a production application overnight. Ideally throw away any prototype code and start again.
You are a one person show and don't have a team. You have no choice here unless you can afford to bring on another person.

Okay, with those points aside and assuming you're interested in code reviews, let's look at some approaches for adding rigour.

How can we improve our code review approach?

Approach 1 — Review in context

Most teams use Git these days and so the popular process for code reviews is to create a feature branch and then open a pull request from that feature branch back to the main line. In git platforms such as GitHub, BitBucket and GitLab, the pull request is where the code review takes place.

Code review tools emphasise the "diff" view, which highlights the files and lines of code that have changed in the branch when compared to the main line. This is a fantastic tool as it makes it clear what the contents of the change are. The expander that allows us to view more of the file is also useful for getting more context for the changed file.

But the "diff" has a limited view of the source code. For more complex changes, or changes in a complex context, or if you want to browse around to get an understanding of the code base, opening the change in a full IDE can help to provide the bigger picture of the impact of a change.

For example, if you make a one-line change to the returning statement of a function, what's missing in the diff view is how the callee of that function is interacting with the return value. Without seeing the full picture, you could miss a side effect leading to a bug. Opening in an IDE will allow us to use the powerful search features, or the "search for dependencies" feature.

Tooling in this space is always improving. In 2021 GitHub released https://github.dev, which allows you to open up any file in a web-based Visual Studio editor with the press of the . button in a GitHub file or PR view.

The code author should provide as much context in the description of the PR itself. For example:

Provide a reason for the change — what is the rationale and motivation behind it?
Explain the scope of the change — does it cover the entirety of the ticket, or is there more changes to follow?
Explain the contents of the change — what have you changed and how does it work?

It's unreasonable to expect authors to remember these points for every change, so leverage pull request templates to establish a format to follow.

Approach 2 — Use of language and tone

Code reviews are a great opportunity for feedback, but if you're not careful they can also be a source of arguments, defensiveness and frustration. We've all experienced it, I'm guilty of it too.

It's understandable, it's hard receiving critical feedback on work we've put a lot of effort into, and are under pressure to deliver. If we feel like a reviewer is being overly pedantic, we naturally go into defensive mode.

I think it's critical that the reviewer makes careful choice of words when giving feedback. The author has this responsibility too, but the reviewer is the one that sets the initial tone.

Thankfully, there is an approach that can help us — conventional comments. Conventional comments helps by prefixing a label to each comment posted by a reviewer. The prefix describes the intent of the comment. For example; "question: could you explain the intent behind this function?". By prefixing an intent to a comment, it helps to provide insight into the motivation behind the comment and belay our fears of harshness.

A question indicates a curiosity to learn more about the author's approach. A suggestion shows a willingness to have a discussion and be flexible with the author instead of dictate changes. A nitpick indicates something minor that the reviewer doesn't feel strongly about.

There's no label for do-as-i-say or this-is-trash. Code reviews should be a conversation about the change. When it's a two way dialogue where both parties feel respected and listened to we get the benefits we talked about earlier.

What about as the author? What if you're not getting the feedback that you hoped for? Don't be afraid to ask for it — be clear about what kind of feedback you want and from who.

For example:

Hi Sam! Could you please take a look at this change and tell me what you think about the way I've structured the code? With your experience in design patterns I'm interested to get your input.

Approach 3 — Checklists

As covered extensively in The Checklist Manifesto, checklists have many benefits including:

They help us to achieve, and exceed, a given benchmark
They supplement memory recall
The provide structure in a complex space

Anecdotally, I once worked in a team that had a well defined code review approach with something resembling a checklist.

When I followed the checklist I gave high quality actionable feedback to authors. But over time, I paid less attention to the checklist and I became lazier with my feedback. I became reactive, responding to the code in the diff and how I intuitively perceived it, instead of being proactive and thinking of a wide range of concerns such as security, performance and testability.

Sticking to the checklist changes that, it helps the reviewer to drive the discussion and maintain a high standard if you remain disciplined over time. Even after we've been doing reviews for a long time, and we've internalised the checklist, we're human and can forget considerations. The checklist helps us to be consistent.

I've defined my own personal code review checklist, and identified four key areas that I think are critical:

Testability
Maintainability
Security/privacy
Robustness

Below is my full checklist, based on what I think is important, and also inspired by other checklists out there. If you're looking to adopt a checklist in your team, I suggest you use one that encapsulates the properties your team thinks is important. There's plenty of good checklists around the web to use as source material.

Simon's Code Review Checklist

Checklist questions grouped by four key areas...

Testability

If you're running production code, and that code is changing frequently due to new features and improvements, having an automated test suite is critical for being able to deploy fast and make changes without fear. Below are the items are look for:

[ ] Implementation can change without breaking tests (black-box tests)
[ ] I can understand the expected behaviour of the system by reading tests
[ ] Uses a variety of testing approaches for robustness (unit + integration etc)
[ ] Has good code coverage and covers a good number of core and edge cases

Maintainability

Being able to continue to make changes to production software over its lifespan is critical for continuing to keep customers happy. Team members will change over time and knowledge about software is potentially lost. Tests and well organised code help to capture this knowledge in code.

Here are some questions that cover what I think are important maintainability concerns:

[ ] Does the file structure follow a consistent pattern? (e.g. ports and adapters)
[ ] Do file / function / variable / object names reflect their purpose?
[ ] Is there any unnecessary coupling that would make refactoring or testing harder (e.g. email function and sms function linked together)?
[ ] Is the code split into appropriate concerns / layers if necessary. E.g. API code does not interact with database.
[ ] Is there any unnecessary complexity that makes the code harder to reason about?
[ ] Comments provide extra context (the "why?") where necessary
[ ] Is the API, architecture, setup and usage documented (e.g. README, OpenAPI, etc)?
[ ] Is the code configurable? Can you change config in one place without re-factoring?

Security and privacy

The risk and cost of security breaches these days is too high to make compromises on security or privacy. These questions focus on keeping our security posture tight:

[ ] Are there any opportunities for abuse? E.g. large volume of requests? Bad input?
[ ] Are all entry points authenticated and authorised appropriately?
[ ] Does any process, resource or user have more access than they need?
[ ] Is all PII and sensitive data handled appropriately (not logged, not in plain text, not checked in)
[ ] Are all third-party dependencies vetted and pinned?

Robustness

If your code is secure and maintainable, that's a good start. If it's constantly falling over in production or riddled with bugs, it's going to result in a poor user experience and frustrate the team. These questions focus on robustness concerns:

[ ] Can you think of any errors or conditions that would cause an unexpected state?
[ ] Are there any statements that won't scale well to large data sets?
[ ] Is there anything that can have an impact on system load / service limits / costs?
[ ] Are all read and write operations logged to assist debugging and support?
[ ] Are error messages readable and assist debugging and support (e.g. shows key non-sensitive details)?

Other considerations

We could introduce other checklist items depending on the code change under review. For example; if we're reviewing a user interface code change, we may ask for a screenshot of the UI and have some checklist items to check for certain styling issues (e.g. bad white-spacing).

If we're reviewing an API change, we may have some checklist items for reviewing the OpenAPI schema, for example to check the correct use of verbs and nouns

Approach 4 — Get consensus on approach

Dictating practices from top down is not generally a good idea. If people aren't invested in your practices, or don't understand the motivations behind them and are instead forced to adopt them, they may resent it and not commit to the approach. You won't get the feedback and engagement you're looking for.

Instead, let the team choose its own path. Discuss together what you think are important attributes of code, and share thoughts on why code review is valuable in your context. Decide as a team the details of how you'll conduct code reviews. For example; Will you use a checklist? What questions are most important during review?

Write down your decisions and the rationale behind them so people that join later have the context of that original decision. That'll inform those people to make their own suggestions for improvement. Writing down decisions also helps for reviewing decisions at a later date to see if the reasons are still valid.

Approach 5 — Don't review the trivial or the major

I don't believe we should spend any effort reviewing tasks that can be trivially automated. Automated tasks can save a lot of hassle and wasted effort in looking for small nits during code review.

For example:

Automated formatters that ensure we format code according to an agreed upon style guide
Static analysis tools and code linters can pick up issues such as unused variables and syntax errors
Code build in CI can uncover syntax issues and misconfiguration
Code coverage reports can find testing gaps
Pull request templates for defaulting the pull request description to the agreed upon structure

We also shouldn't be making or changing major design decisions in a code review. Major architectural design changes late in the process is going to incur a lot of waste. We'd ideally catch these earlier in the process in an architectural review step or similar. This is another reason why pairing is a good option — we can catch design issues earlier and provide that feedback instantaneously.

If you find that changes with poor design are ending up in code review, it could be an opportunity to run a blameless post-mortem to understand where your process is going wrong.

Summary

Code reviews give you an opportunity to learn through knowledge sharing, increase product quality and sustain delivery over time. Some approaches you can use for success are; viewing the full context of the change, using conventional comments, a code review checklist and getting team consensus.

The ideal code review for me depends on what kind of change we're reviewing, but would involve:

Clear context and description of the change by the author.
A respectful and insightful discussion between author and reviewers.
Both the author and reviewers learnt something that they can bring to future changes.
Results in a high quality code change that brings value to the customer.

With the approaches outlined in this post, we can trend towards that ideal.

Applying DevOps Principles to Robotics

Simon Bracegirdle — Mon, 20 Dec 2021 07:10:42 +0000

The robotics industry continues to evolve as shown by the growing adoption of commercial quadruped robots. As interest in the field grows so does the number of teams building their own robotics software.

As adoption of their product grows, Robotics teams may look to automate their tooling so that they can maintain a high delivery rate whilst sustaining product stability, safety and reliability.

DevOps is also increasing in popularity, with the Google State of DevOps report indicating an increasing number of high performing teams adopting DevOps habits. The DevOps research shows that the four key metrics — Deployment Frequency, Lead Time, Mean Time to Recovery (MTTR) and Change Failure Rate — are indicative of teams that deliver more value to customers.

But can we apply DevOps principles to Robotics teams to help increase the cadence of their development — a RoboOps approach? Can we apply the principles in a way that will result in more value for robotics customers?

What challenges does robotics face?

Let's start by looking at the challenges unique to robotics.

One unique aspect to robotics is the need for integration with hardware that carries out complex tasks. This makes testing more challenging, because hardware is subject to the physical, practical and financial constraints of the real world.

Robots can do real world harm and so we have a responsibility to be mindful of that and do everything we can to minimise the risk.

For example, if you're building a robot that autonomously navigates around the customer's house, you'll want to test the reliability of that system by running a series of test scenarios for each change.

But running a set of fixed scenarios is not enough. Robots are not deterministic systems and have a high degree of variability. Test suites must take this into account by repeating tests with an appropriate level of random input noise.

Testing and developing on hardware can also be difficult because a team may be sharing a limited set of test devices, often constrained to a dedicated test area such as the office garage or car park. This makes it difficult to adequately stress the system with conditions that are representative of their target environment.

In the following sections we'll take a look at some tools that we can utilise in combination with DevOps principles, that may help with teams looking to scale up their development and customer base whilst addressing robotics challenges.

Tool #1 — Use of simulation in CI

Simulation is a tool for executing tests on robotics behaviour and systems in a software environment that emulates the physical properties of the real world as close as possible. Since simulations are purely a representation, and never precisely match the real world, don't think of them as ground truth, but they're still a useful testing tool.

The key benefit of leveraging robotics simulation is that you can execute tests without any hardware dependencies. Without this constraint, simulations can run on a variety of compute environments — developer laptops, cloud based environments such as AWS RoboMaker, linux containers and more.

This freedom of choice and ease of use frees up team members to develop and test when they like, or as part of an automated testing / continuous integration (CI) toolchain.

Since simulations are not constrained by the physical limits of hardware, you can run them in parallel and at accelerated time rates. This can reduce the time to receive feedback when running test suites.

If a simulation is well designed and runs on every code change it can help to increase deployment frequency and reduce the change failure rate. Teams that are confident in their changes will commit without fear and trust the tool to do its job.

Are there any downsides to Simulation? Well it does take some effort to setup and can be computationally expensive in come cases, so it's worth assessing the value of it for your context.

For example, if you're building a simple system with minimal failure modes, then it may be hard to justify the investment in setting up simulation.

In other cases there are some platforms that do not have good simulation support, in which case you may not have a choice.

How do you integrate simulation into your workflow?

As a metaphor to traditional software testing, simulation is best characterised as a form of integration testing, since it involves testing the robotics software as a whole, or a subset of the whole, including the impact of software on hardware.

Whilst simulation is faster than real world testing, it's still slower than unit testing and can be computationally expensive.

Given this, if you're planning to run it automatically as part of a CI/CD toolchain, then it's best suited to run after or parallel to unit testing, but before deployment to your non-production environment.

Tool #2 — Automate hardware-in-the-loop testing in CI

What is hardware-in-the-loop testing?

Automating hardware testing involves creating a dedicated space and a scheduled time for use of that space in which test suites run against real hardware.

This may involve talking with others that share this area and agreeing on it's use. This would include deciding when and how testing takes place, and publishing a schedule if necessary.

No matter how good your simulation, for products that include a hardware component, nothing is going to be as good for testing as the real thing.

If your team is making a lot of changes to the product, it can get tedious to setup that testing manually. This is where kicking off testing on real hardware automatically from a CI/CD pipeline can be useful.

For example; if you have made a small code change, you commit this to master. Unit tests and simulation run automatically, if they pass and the test area is available, then the pipeline deploys the latest build to your test hardware and runs tests in the real world.

How does automated hardware-in-the-loop benefit robotics teams?

The key benefit of integrating this process with CI is to support a high rate of change, making it suitable for teams wanting to write smaller changes and commit more frequently.

Adding hardware testing to your pipeline, alongside unit testing and simulation will form a solid foundation for a comprehensive test suite that will stop issues leaking out to your fleet.

Integrating on-hardware testing into your pipeline will support increasing deployment frequency and reducing change failure rate, by allowing you to push your change and forget it. Setup alerting to let you know if a pipeline fails.

Teams with confidence in their test suite and pipeline will be happy to push and rely on the tooling to do its job, knowing that it's unlikely a serious bug will get into production, and if it does then it'll get rolled back fast.

How do you integrate automated hardware-in-the-loop into your workflow?

To run tests on real hardware you must first deploy your change to that hardware. Ideally you would also run unit tests and simulation tests before that deployment takes place.

If you separate your fleet into staging (or non-production, or the test environment, whatever you want to call it) and production, then you could first deploy to your staging environment, run the tests on real hardware, and then proceed to deploying to production.

It may also be worth running the tests again after you deploy to the live environment.

Tool #3 — Continuous fleet deployment

What are fleet deployments?

If you deploy software to a group of robots or IoT devices, that's a fleet deployment.

Fleet deployments differ in that devices and robots are not co-located — they're often spread out or on different sites entirely and different networks.

Fleet devices can have unreliable connectivity, due to moving through wifi hot spots or using mobile connections. This means fleet deployments are less reliable than common software deployments. You may get more reliable connectivity at one site versus another.

Deployments can happen manually, or automatically through a CI/CD pipeline. With the latter, developers will commit changes to a branch, run tests and deploy.

Ideally the changes deploy to a staging or non production fleet first, which allows for on-device testing to take place before deployment to the live production fleet.

A manual gate may block the final deployment step for teams that don't have a high degree of automation. But ideally you would automate and build confidence in your testing so can deploy automatically to production after tests pass.

Don't forget to test in production too. Whilst it's good practice to keep environments the same, there's still differences in data and usage. Testing after deployment will uncover any of those further issues and verify the deployment was successful.

One approach is to use a canary-style deployment -- applying updates to a subset of devices at a time and increase the deployment linearly or exponentially. This opens the opportunity to identify issues roll them back before it impacts the entire fleet.

Being able to roll back failed changes automatically is also important. You need a path back to operation without human intervention. This can save phone calls, reduce time to recovery and limit impact to customers.

How does automated fleet deployment benefit robotics teams?

The key benefit of automating deployment is to support an increase in deployment frequency, and to reduce the mean time to recovery.

By having a process that takes you from code commit to production deployment, verifying the change at each step, and by using that process frequently, you will build confidence in it and move towards pushing smaller changes more frequently.

This allows you to respond faster to bug fixes, to incidents, to small improvements, to everything. You can deliver more value to your customer with less lead time.

How do you automate fleet deployment in your workflow?

The first step to automating fleet deployment is to integrate it with your CI/CD tool, for example GitHub Actions or AWS CodePipeline.

The deployment steps conventionally take place after build and test. First deploy to a staging environment, conduct further testing and then deploy to your live production environment.

To assist in deployment and management of workloads in your fleet, it's worth taking advantage of a fleet or device management tool such as AWS GreenGrass, Formant or Rocos.

For example; Greengrass V2 components can deploy software to your robot or device fleet. It supports rolling deployments with options for linear or exponential steps.

Summary

In this post we introduced three tools that help to support a DevOps approach to development when integrated with your CI/CD pipeline; simulation, automated hardware testing and continuous fleet deployments.

These tools aren't silver bullets — on their own they won't solve your scaling or delivery woes and they're not suitable for all teams. But they can be powerful tools that can help you take positive steps towards a DevOps approach and delivering more value to your customers.