DEV Community: Grunet

Making a Totally Free Uptime Monitor using a Worker Runtime and OpenTelemetry

Grunet — Mon, 03 Jun 2024 17:09:33 +0000

What is an Uptime Monitor and When to Use One?
Traditional Options
Using a Worker Runtime and OpenTelemetry
- The High-Level Solution
- The High-Level Setup Steps
- Comparison to the Other Options
Takeaway

What is an Uptime Monitor and When to Use One?

An uptime monitor is a tool that periodically (e.g. every minute) checks your application or API to gauge if it’s up and healthy.

If you have true observability and are using SLOs effectively you probably don’t need to use one. But if you’re not at that level yet, an uptime monitor can be a valuable information source regarding the reliability of your application or API.

Traditional Options

There are a number of ways to run an uptime monitor. For example,

Running a cron job on a server/VM and using bash, curl, and webhooks
Setting up an Eventbridge cron with Container/Lambda targets and webhooks
Paying for a 3rd party service (e.g. Pingdom)

Each of them comes with their own downsides though

Maintenance (e.g. security patching, keeping away from end-of-life states)
Complexity (e.g. setting up IaC, CI/CD)
Cost

Is there an option that avoids these downsides?

Using a Worker Runtime and OpenTelemetry

I contend there is using a worker runtime and OpenTelemetry.

The High-Level Solution

The solution maps out at a high-level as follows

Use a cron from a worker runtime
Have the worker hit the application or API endpoint
Gather instrumentation about the network call with OpenTelemetry
Send that OpenTelemetry instrumentation to an observability backend
Use the observability backend to alert on unhealthy traffic

The High-Level Setup Steps

These steps will use Cloudflare Workers for the worker runtime, but something similar can be done with Deno Deploy as well.

Create a free Cloudflare account

Create a worker with the following code and the Node.js compatibility flag

import { instrument } from '@microlabs/otel-cf-workers'

const handler = { 
    async scheduled(event, env, ctx) {
        await fetch(env.ENDPOINT_TO_MONITOR)
    }
}

const config = (env, _trigger) => {
    return {
        exporter: {
            url: 'https://api.honeycomb.io/v1/traces',
            headers: { 'x-honeycomb-team': env.HONEYCOMB_API_KEY },
        },
        service: { name: env.ENDPOINT_NAME },
    }
}

export default instrument(handler, config)

Add an environment variable named “ENDPOINT_TO_MONITOR” with the endpoint to check and add another environment variable named “ENDPOINT_NAME” with a friendly name for the endpoint
Create a free Honeycomb account
Create an environment named “Uptime Monitors” and create an ingest key
Back in Cloudflare, take that ingest key and copy-paste it into a Cloudflare Workers secret named “HONEYCOMB_API_KEY”
Add a cron of “* * * * *” to the worker
(Confirm that traces are appearing every minute in Honeycomb)

In Honeycomb, create a trigger (alert) based on the query

COUNT > 0 where http.response.status_code >= 400

Route the trigger’s notifications as needed (e.g. to Slack)

You should now have a functioning uptime monitor for your endpoint.

Comparison to the Other Options

Compared to the other options outlined before, this solution has

Minimal maintenance (just a single npm package and its dependencies to monitor for security vulnerabilities)
Minimal complexity (just the steps outlined above)
Totally free (the usage is very much within the Cloudflare Workers free tier and Honeycomb free tier)

Takeaway

Paying for an uptime monitor service is probably preferable to this (if you’re able to).

The real takeaway is that there is this newer form of compute (worker runtimes) with a cost model that can be taken advantage of for situations similar to this.

Change Plans: A Subtle Superpower

Grunet — Tue, 21 May 2024 16:14:35 +0000

What is a Change Plan?
What Benefits Do Change Plans Bring?
- Force You To Think
- Peer Review
- Drive Clarification
- Facilitate Discussion
- Discoverable
- Auditable
A Change Plan Template in Detail
- Summary
- Impact
- Security
- Communication Plan
- Test plan
- Before the change
- Steps
- Monitoring
- Backout plan
Takeaway

What is a Change Plan?

A change plan is a document describing a plan to make a nonstandard change to a production environment. For example, manually making changes to hand-curated virtual machines.

The outline of a change plan might look something like this

Summary
Impact
Security
Communication Plan
Test Plan
Before the change
Steps
Monitoring
Backout Plan

Before diving into each of these sections, let’s discuss why change plans are helpful to begin with.

What Benefits Do Change Plans Bring?

There are several distinct benefits change plans bring to the table.

Force You To Think

Like any template, a change plan template prompts you to consider each section carefully, determine whether or not it’s applicable for this change, and then fill out the section if so.

Without a change plan template, it can be easy to forget important aspects of a change, e.g. having a backout plan.

Peer Review

Just like for code review, peer review of change plans can be powerful. Not only do you get extra scrutiny of the plan, but you can also get bidirectional knowledge sharing between the participants.

Without a change plan, there is no knowledge sharing and there’s increased risk of the change going awry.

Drive Clarification

The steps of a change plan need to be detailed down to a point where anyone could follow them. This forces any ambiguous language to be clarified, increasing the likelihood of the steps being followed correctly and with the correct outcomes.

Without a change plan, the steps might be determined on-the-fly and may result in mistakes being made.

Facilitate Discussion

With a change plan external participants can comment on and engage in discussions about the change. For example, a Product Manager might request the date of a change be moved since it overlaps with a big feature release.

Without a change plan there’s no artifact to structure discussion around, and worse external stakeholders might not be aware a change is happening at all.

Discoverable

With a change plan, the change exists in documented history and can be examined by people in the future. For example, people trying to make a similar change might review and learn from it.

Without a change plan, the details of the change are lost after it’s performed and no one other than the performer knows about it.

Auditable

A change plan can serve as an artifact for auditors to confirm that you’re following change management procedures correctly.

Without a change plan some other artifact needs to be created for auditing purposes.

A Change Plan Template in Detail

What follows is an example of a change plan template used at a previous job of mine.

Summary

A few sentences to let the reviewer know what are we doing and why are we doing it.

Impact

How many customers or internal users will be impacted if things go right? How about if things go south / pear shaped / blow up?

Security

This change increases | temporarily decreases | decreases | has no impact to security. Another sentence or two to justify why temporarily or permanently decreasing security is a good idea.

Communication Plan

How are you going to communicate to internal users, support, etc. that the change is happening? Consider that if the change impacts users or causes downtime, you may need to communicate the change weeks in advance.

Test plan

If this is a high risk or complex change (or not easy to back out), how are you going to test this first? If you are not going to test it first, justify it was either easy to back out or otherwise low risk. A reviewer might have suggestions of what needs to be tested or how to test.

Before the change

What steps are you going to take to prepare for the change or stage or test things before you do the change?

Steps

Step 1

Backup or save current state…

Step 2

Do something

Run a command:

this is a command that you run that you will copy/paste during the change

Test that the change worked!

This is some expected output you should see

If it didn’t work, do the backout steps, try again… be specific as to what happens if things go wrong.

Any follow up or cleanup steps

Monitoring

What can we monitor to know that this change worked?

How can check for unexpected side-effects we may have on the application?

What other parts of the application could be affected by this change?

Backout plan

The same as Steps but specifically how you would back out changes. If you can’t back out the change, note it here.

Undo some stuff

Undo some other stuff

Check that the undoing worked

Takeaway

Change plans are an excellent tool to help you manage your ever changing production environments. While they may seem like pure overhead at first, they shine when faced with uncertain, complex, or risky changes.

Gushing Over AWS Application Load Balancer Access Logs

Grunet — Wed, 15 May 2024 20:26:49 +0000

What They Are
Why They Are Great
- Chock Full of Details
- Close to End-User Behavior and Pain
- Non-Invasive to Enable
- Supported by Vendors
Some Frustrations
- Batching
- Poor Integrations
Takeaway

What They Are

Auto-generated logs that capture details on each request that passes through an Application Load Balancer (ALB) and onward to a backend target. (Here is their main AWS doc page)

Why They Are Great

There are multiple reasons to get excited about these logs.

Chock Full of Details

ALB access logs include a huge amount of information on each request, for example

Request Method
Request Path (including query parameters)
Client Ip Address
User Agent
Request Start Time
Request End Time
Request Duration
Load Balancer Status Code
Target Status Code
Target Internal Ip Address

And that’s just to mention a few! (Here is the full list of attributes)

With just these attributes you can do things like

Look for requests coming from a particular ip address
Look for requests blocked by the Web Application Firewall (403s for the load balancer status code with no target status code)
Look for a service that temporarily went down (502s for the load balancer status code with no target status code)

Close to End-User Behavior and Pain

Access logs from a public-facing ALB give the closest representation of what end-users are doing with your application (client-side instrumentation aside) since they record every single request (no sampling).

They also give the best representation of the pain your end-users are facing (e.g. looking at 5xx’s the load balancer is returning).

Backend instrumentation alone will always be missing part of the picture (e.g. when a backend service is hard down, sampling).

Non-Invasive to Enable

No application code changes or instrumentation with 3rd party agents/libraries are required to turn on ALB access logs.

Just enable them like you configure the other parts of your infrastructure via IaC, the CLI, or ClickOps. Then watch the log files start to show up in S3.

Supported by Vendors

Most observability vendors will offer a solution for ingesting ALB access log files into their platform (e.g. offering a lambda that can trigger off the log files’ S3 bucket new object created event) for querying there.

Alternatively, Athena can be used to query them from within AWS.

Some Frustrations

To be honest these logs aren't perfect and have a few notable downsides.

Batching

An ALB will batch all the access logs it generates and send it to the S3 bucket containing them once every 5 minutes. This means that it’s not suitable for situations where human response times to changes need to be faster than that. This is one area where backend instrumentation wins out.

(I believe GCP’s analogs don’t have this limitation fyi.)

Poor Integrations

Web Application Firewalls (WAF) can be configured to record their own logs of every request sent to a load balancer configured to use a firewall, but as far as I know there’s no way to integrate them with ALB access logs. They are a totally separate source of info.

Also ALB access logs can’t take part in OpenTelemetry tracing as far as I know (though it would be pretty cool if they did)

Takeaway

If you’re stranded on an island and you can only have 1 piece of telemetry, I’d recommend picking ALB access logs. They have their limitations, but they deliver the most value for the least investment in my opinion.

Handling Concurrent Load During an AWS Outage: A Tradeoff To Consider

Grunet — Sun, 06 Aug 2023 01:39:05 +0000

The Main Compute Primitives for AWS
Control Planes vs Data Planes
- Static Stability
How Each Primitive Handles Concurrency During a Control Plane Outage
- EC2
- Fargate
- Lambda
The Takeaway
References

The Main Compute Primitives for AWS

There are 3 compute primitives in AWS (Amazon Web Services) that almost all of its other compute offerings are built on top of

Virtual Machines (i.e. EC2, Elastic Cloud Compute)
Containers (e.g. Fargate for ECS, Elastic Container Service, or EKS, Elastic Kubernetes Service)
Functions (i.e. Lambda)

Each comes with its own set of tradeoffs, but there is one subtle tradeoff that only manifests during certain AWS outages.

To understand that tradeoff, we first need to understand the concepts of “control planes” and “data planes” of AWS services.

Control Planes vs Data Planes

Every AWS compute service is separated into 2 logical components, a control plane and a data plane.

The data plane is responsible for actually running the hardware and software powering the compute. Think of the physical server running a virtual machine, for example.

The control plane is responsible for making changes to the data plane. If you want to add a new virtual machine to the data plane, you have to make a request to the control plane for it to do so on your behalf.

Static Stability

Services are designed this way in part to be more fault tolerant. If an outage occurs in the control plane, the data plane will continue working without issue.

And in general, outages in control planes are more common than outages in data planes.

This leads to the concept of “static stability”, where as long as your workload doesn’t depend on control planes, it will remain stable during most AWS outages.

How Each Primitive Handles Concurrency During a Control Plane Outage

But your existing workloads being stable during an AWS outage might not be enough. What if there’s a surge in load that they need to respond to? Will they be able to scale up to meet that demand?

Specifically, there’s the question of the maximum concurrency a workload can support during an AWS service control plane outage.

The answer to this question (perhaps surprisingly) depends on the compute primitive involved.

EC2

In normal times, in the face of increased concurrency a workload can autoscale up to handle it (e.g. an ASG, Autoscaling Group, can bring up more virtual machines).

However, during an outage of the EC2 control plane, this isn’t possible (i.e. autoscaling requires requests to the control plane).

This means that during the outage, the maximum concurrency a workload can support is fixed and cannot be increased. Any requests exceeding this limit will fail.

Fargate

Fargate behaves similarly to EC2, as starting new tasks requires a request to the ECS or EKS control plane.

So during a control plane outage, any requests exceeding the fixed maximum concurrency of the workload will fail.

Lambda

Lambda is the odd duck out.

In normal times, in the face of increased concurrency a workload can start up multiple new Lambda execution environments to handle the load.

But the subtlety here is that this behavior is part of the Lambda data plane, NOT the control plane.

This means that during an outage of the Lambda control plane, a workload can still handle essentially arbitrary concurrency of requests (only limited by your account’s quota on concurrent executions).

The Takeaway

If you need to handle arbitrary concurrency while the control plane of a service is impaired, Lambda provides the best tradeoff.

Fargate or EC2 (or any other more managed service built on top of them, e.g. Elastic Beanstalk) will not be able to meet the need.

References

AWS whitepaper on fault isolation boundaries that defines “static stability” and references the lack of ability of EC2-based workloads to autoscale during control plane outages
Lambda whitepaper that says scaling occurs at the level of the data plane

Operational Challenges for SCIM Servers

Grunet — Sun, 18 Jun 2023 17:17:47 +0000

What is SCIM?
The Key Operational Challenges
- No Load Limits
- All Requests Must be Synchronously Handled
Downstream Consequences when Faced with “Scale”
- Performance Problems with 3rd Party SCIM Libraries
- ORM Problems Leading to Database Performance Problems Leading to Other ORM Problems
- Inability to Throttle Requests
- Inability to Queue Requests
- Inability to Horizontally Scale
Takeaways
- Targeted Avoidance of the ORM-to-SCIM-to-ORM Pattern is Valuable
- Threading is Potentially Valuable, Maybe
- Pre-Production Load Testing of SCIM Servers is Valuable
- Recording Failed SCIM Requests in Production Telemetry is Valuable

What is SCIM?

SCIM (System for Cross-domain Identity Management) is an open standard for user provisioning.

For example, it allows an organization that is a customer of a SaaS product to easily sync all of their users’ identity information into the SaaS’s databases.

The organization’s users’ identity information will often be stored in a 3rd party identity provider like Okta or Azure Active Directory (Azure AD). The identity provider will act as a SCIM client, sending requests with provisioning data to a SCIM server managed by the SaaS product.

Building a conformant and operational SCIM server however is a non-trivial task.

The Key Operational Challenges

There are 2 inherent operational difficulties all SCIM server implementations must face.

No Load Limits

A SCIM client has no restrictions on how quickly it can send requests to your SCIM server.

In theory, this means that your SCIM server needs to be able to handle arbitrary requests at arbitrary concurrency.

In practice, this means that your SCIM server needs to be able to handle the load of “the worst” SCIM client out there (Azure AD’s SCIM client has been observed to send somewhere just shy of 1000 requests per minute at peak load)

All Requests Must be Synchronously Handled

SCIM clients rely on the status code of responses to determine whether or not to retry a request (e.g. a client may retry all 500-level errors until it succeeds).

This means that all request processing has to be done synchronously (i.e. it can’t be offloaded to consumers of a separate queue).

Downstream Consequences when Faced with “Scale”

When faced with increased load and larger customers, several different types of problems can arise for a SCIM server implementation.

Performance Problems with 3rd Party SCIM Libraries

Handling SCIM requests means writing the logic for applying a request’s changes to a SCIM resource (e.g. PATCH-ing a group).

Depending on your runtime, there may be existing libraries that already have this logic ready for re-use (e.g. scim-patch for PATCH requests in Node.js). However, a library’s code may not necessarily be optimized for performance in every case.

For example, a library function’s execution time may sometimes scale quadratically with the size of the request and/or SCIM resource (e.g. for groups with a large number of users). This can drastically slow down response times (think 10s of seconds) and hog server CPU (a lethal problem for single-threaded runtimes like Node.js).

ORM Problems Leading to Database Performance Problems Leading to Other ORM Problems

If you’re using an ORM (Object Relational Mapping), all of your PATCH or PUT SCIM endpoints may work as follows

Load the ORM’s representation of the SCIM resource from the database
Transform the ORM’s representation into a SCIM representation expected by the SCIM library
Apply the request to the SCIM representation using the SCIM library
Transform the updated SCIM representation back into an ORM representation of the resource
Save the ORM representation back to the database

The problem with this can occur in Step 5, as the ORM has no idea of what exactly changed and so can in certain cases end up recreating the database records from scratch.

For example, consider PATCH-ing a very large group (10,000 or more users) to add 1 user to the group, where in the database users in a group are represented in a join table between the Groups and Users tables. In Step 5, the ORM isn’t aware that 1 user was added, so instead it generates SQL to delete all the existing users from the group and then re-insert them all plus the 1 new user.

This can become extremely problematic when these requests happen at high concurrency (e.g. Azure AD sending a huge number of near parallel requests to add 1 user at a time to a group). Every request requires an exclusive lock on the join table because of the deletes, so the transactions end up being processed 1 at a time at the database-level, leading to very slow query times.

Because of these delays in database processing, ORM operations will start to back up in their queue and time out. ORM database connection pools will become maxed out as well. This will cause a massive percentage (e.g. 95+%) of the requests to fail. So many that no amount of retrying from the SCIM client will fix things.

Inability to Throttle Requests

Because of the severe slowdown and serialization in request processing under this high concurrency, throttling is ineffective as it will just lead to timeouts at the gateway instead (e.g. 504s from the load balancer fronting the SCIM server replicas) .

Inability to Queue Requests

Because SCIM requires that requests be synchronous, putting the requests onto a separate queue (e.g. an SQS queue) to avoid all of the aforementioned problems won’t work because if a queue consumer fails to process a request for some reason there’s no way to indicate that to the SCIM client so it knows to retry.

Inability to Horizontally Scale

SCIM clients’ barrage of requests can start suddenly and end just as quickly (think minutes) which isn’t enough time for traditional autoscaling tools to respond by creating more replicas of the SCIM servers.

Also in the case of the database lock contention issue mentioned before, more replicas (and hence more available database connections) can actually make things worse, as the line of concurrent transactions waiting on the database lock will grow even longer.

Takeaways

Stepping back, there are a few high-level things to take away from the previous pessimism.

Targeted Avoidance of the ORM-to-SCIM-to-ORM Pattern is Valuable

In the large group scenario from before, if all of the requests happen to be adding or removing 1 user from a group, making that 1 change directly to the database (via the ORM) rather than creating an intermediate SCIM representation of the group that the ORM then has to save back to the database can avoid all of the aforementioned problems.

Threading is Potentially Valuable, Maybe

As an alternative to using a separate queue, using separate runtime threads and an in-memory request queue may help avoid saturation in the specific case of a server CPU bottleneck coming from request processing.

Pre-Production Load Testing of SCIM Servers is Valuable

Simulating the type and concurrency of requests that SCIM clients may deliver should be a valuable exercise, as it may unearth several bottlenecks in the SCIM server that may not become otherwise apparent until production.

You might try forking https://github.com/wso2-incubator/scim2-compliance-test-suite and adjusting it to be able to send requests at high concurrency towards this end.

Recording Failed SCIM Requests in Production Telemetry is Valuable

The first step in addressing a novel operational problem happening with your SCIM servers in production is usually to develop some understanding of it.

On top of your usual sources of telemetry (e.g. OpenTelemetry, Application Performance Monitoring tools, RDS Performance Monitoring) recording the raw SCIM request data of failed requests in your telemetry (e.g. logs, traces) can be very helpful in figuring out what exactly is going on.

The Hidden Tradeoff of Keyless Auth

Grunet — Thu, 18 May 2023 02:19:46 +0000

What is Keyless Auth and Why Should I Care?

Keyless auth refers to being able to authenticate to a system without using any long-lived credentials.

This means getting access to a non-public system without a username/password, a public/private key pair, an access key, etc… while (somewhat magically) maintaining security.

Here are a few places using it today

Getting AWS access from a Github Actions workflow
SSH-ing into VMs using Teleport
Signing artifacts using cosign and sigstore

You should care because any long-lived credential you can get rid of is 1 less target for attackers to compromise.

How do Keyless Auth Systems Work?

There are usually 3 parties involved in an auth interaction

The Requestor (e.g. a Github Actions workflow run)
The Identity Provider (e.g. Github’s OIDC provider)
The Resource Provider (e.g. an AWS account)

The flow then goes something like this

The Requestor wants to access the Resource Provider
The Requestor asks the Identity Provider for a token capturing the identity of the Requestor
The Identity Provider vends it a token
The Requestor sends the token to the Resource Provider
The Resource Provider then sends the token back to the Identity Provider, asking if this is a valid request
The Identity Provider confirms it just made that token and it’s expected
The Resource Provider allows the Requestor time-limited access via temporary credentials

The critical part here is the Identity Provider (e.g. Github’s OIDC provider) and the Resource Provider (e.g. an AWS account) have already previously established a trust relationship via configuration inside the Resource Provider. That’s what enables the Resource Provider to trust that the token isn’t malicious.

But there’s a catch to this that no one seems to talks about.

The Hidden, Unspoken Tradeoff

Imagine that Github’s OIDC provider were to get compromised (it’s not inconceivable. It’s a massive target just like LastPass or CircleCI were). This would then mean that malicious actors could also get access to any AWS accounts configured for keyless auth from Github Actions.

The same exploit is not necessarily possible if you’re managing your own long-lived AWS access keys. You can make securing them fully independent of the security position of Github’s OIDC provider.

So the tradeoff in general is putting trust in the Identity Provider’s security at the expense of losing control of the security surface for your Resource Provider.

Takeaways

I will continue to use keyless auth solutions as I think the tradeoff is almost always worth it.

However, I will now think twice about the vendors involved before jumping for it.

The Lack of Disabled Peoples' Experiences in Web Accessibility is Concerning

Grunet — Thu, 18 May 2023 01:51:35 +0000

If you've ever done frontend work around accessibility, odds are the following are true

You are abled
You never met an affected disabled person in the course of the work
You never learned if your changes actually helped disabled users

You may have closed a ticket, or remediated a finding from an auditor, but any learning from real disabled peoples' experiences likely didn't happen.

We all mostly assume this status quo is fine. That as long as the VPAT (Voluntary Product Accessibility Template) or the ACR (Accessibility Conformance Report) look good to customers, there's nothing else to worry about.

Note that this mindset would be bizarre when applied to any other measurement of a product's functionality (e.g. the revenue it generates). Having someone come in yearly to disclose to you how much money your product is or isn't making would be unacceptable on several grounds. Yet we accept it for accessibility.

It's easy to say teams should do more. Teams should involve disabled people at every phase of the SDLC. Companies should hire more inclusively. But this seldom happens as it's hard on multiple socio-organizational angles.

There need to be more ways to draw from real disabled user experiences when creating on the web. Not just for large corporations with deep pockets, but small businesses too.

Until something changes, the practice of accessibility will always remain concerning.

Leveraging OpenTelemetry in Deno

Grunet — Sat, 08 Apr 2023 00:41:14 +0000

Background
Goal
A Minimal Interesting Example
- The Boilerplate
  - Autoinstrumentation
  - Span Processing and Exporting
    - Exporting to the Console and to a Tracing Vendor
- The App Code
  - OpenTelemetry’s API Surface
  - Handling Concurrent Requests and the Need for Async Context
Future Directions
- Adding Structured Data to Spans
- To Deploy and Beyond
- Actually Knowing CPU Time per Request
- Autoinstrumenting Deno-specific APIs and Libraries
- Putting the “Distributed” in Distributed Tracing
- All the Pillars
Parting Thoughts
References
Console Output

Background

Deno is a runtime for JavaScript, TypeScript, and WebAssembly that is based on the V8 JavaScript engine and the Rust programming language.

OpenTelemetry (OTEL) is a collection of tools, APIs, and SDKs. It's used to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

Until relatively recently, it wasn’t possible to bring the power of OpenTelemetry to bear on Deno. All of us were missing out on the information OpenTelemetry can gather, in particular for tracing.

Goal

This article will go over 1 simplified example in depth of using OpenTelemetry for tracing in Deno.

The aim is primarily to serve as an introduction to OpenTelemetry concepts for folks already somewhat familiar with Deno.

A Minimal Interesting Example

Here is a visualization of 1 trace emitted by the example code (taken from Honeycomb, an observability vendor).

The "HTTP GET" span ending after its parent is due to a small bug in otel-js. For now, just pretend it ended before its parent did.

Note the following things you can tell without even looking at the code

Different parts of the code have their duration measured
Outgoing HTTP requests are captured
The structure of the code is probably reflected in the structure of the diagram

And now here is the code that was used to generate that telemetry.



import { registerInstrumentations } from "npm:@opentelemetry/instrumentation";
import { FetchInstrumentation } from 'npm:@opentelemetry/instrumentation-fetch';

import { NodeTracerProvider } from "npm:@opentelemetry/sdk-trace-node";
import { Resource } from "npm:@opentelemetry/resources";
import { SemanticResourceAttributes } from "npm:@opentelemetry/semantic-conventions";
import { BatchSpanProcessor, ConsoleSpanExporter } from "npm:@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "npm:@opentelemetry/exporter-trace-otlp-proto";

import opentelemetry from "npm:@opentelemetry/api";
import { serve } from "https://deno.land/std@0.180.0/http/server.ts";

// autoinstrumentation.ts

registerInstrumentations({
  instrumentations: [new FetchInstrumentation()],
});

// Monkeypatching to get past FetchInstrumentation's dependence on sdk-trace-web, which has runtime dependencies on some browser-only constructs. See https://github.com/open-telemetry/opentelemetry-js/issues/3413#issuecomment-1496834689 for more details
// Specifically for this line - https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-sdk-trace-web/src/utils.ts#L310
globalThis.location = {}; 

// tracing.ts

const resource =
  Resource.default().merge(
    new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: "deno-demo-service",
      [SemanticResourceAttributes.SERVICE_VERSION]: "0.1.0",
    })
  );

const provider = new NodeTracerProvider({
    resource: resource,
});

const consoleExporter = new ConsoleSpanExporter();
provider.addSpanProcessor(new BatchSpanProcessor(consoleExporter));

const traceExporter = new OTLPTraceExporter();
provider.addSpanProcessor(new BatchSpanProcessor(traceExporter));

provider.register();

// Application code

const tracer = opentelemetry.trace.getTracer(
  'deno-demo-tracer'
);

const port = 8080;

const handler = async (request: Request): Promise<Response> => {
  // This call will be autoinstrumented
  await fetch("http://www.example.com");

  const span = tracer.startSpan(`constructBody`);
  const body = `Your user-agent is:\n\n${request.headers.get("user-agent") ?? "Unknown"}`;
  span.end();

  return new Response(body, { status: 200 });
};

await serve(instrument(handler), { port });

// Helper code

function instrument(handler) {

  async function instrumentedHandler(request) {
    let response;
    await tracer.startActiveSpan('handler', async (span) => {

      response = await handler(request);

      span.end();
    });

    return response;
  }

  return instrumentedHandler;
}

It’s a lot! But let’s take a look at it piece-by-piece.

The Boilerplate

Autoinstrumentation

This is the code for the fetch autoinstrumentation



import { registerInstrumentations } from "npm:@opentelemetry/instrumentation";
import { FetchInstrumentation } from 'npm:@opentelemetry/instrumentation-fetch';

...

registerInstrumentations({
  instrumentations: [new FetchInstrumentation()],
});

This monkeypatches the global fetch so that all network calls made with it are instrumented (i.e. have data on them recorded in telemetry).

It saves you the trouble of needing to instrument every fetch call in application code yourself. And it also instruments fetch calls your dependencies are making, which may have been otherwise impossible to track.

Span Processing and Exporting

This is the code for setting up the span processing and exporting pipelines.



import { BatchSpanProcessor, ConsoleSpanExporter } from "npm:@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "npm:@opentelemetry/exporter-trace-otlp-proto";

...

const consoleExporter = new ConsoleSpanExporter();
provider.addSpanProcessor(new BatchSpanProcessor(consoleExporter));

const traceExporter = new OTLPTraceExporter();
provider.addSpanProcessor(new BatchSpanProcessor(traceExporter));

Once a span has ended, it will be sent to a span processor, which will decide what to do with it and eventually pass it on to an exporter.

In this case BatchSpanProcessor is being used, meaning that spans are queued up in-memory and flushed in batches via a setTimeout every 5 seconds.

Exporting to the Console and to a Tracing Vendor

The first “endpoint” spans are exported to is the console, via the ConsoleSpanExporter. This is useful to have for debugging purposes, especially when you’re not seeing traces show up in your vendor but you are seeing them in the console.

The second endpoint spans are exported to is your tracing vendor (e.g. Honeycomb, NewRelic, etc...), via the OTLPTraceExporter. It will depend on the vendor, but specifying your vendor’s remote endpoint and auth credentials as environment variables should be enough



export OTEL_EXPORTER_OTLP_ENDPOINT=<your vendor HTTP OTLP ingest endpoint>
export OTEL_EXPORTER_OTLP_HEADERS=<vendor specific auth credentials>

This should enable the exporter code to send telemetry via OTLP (OpenTelemetry Line Protocol) over HTTP to your tracing vendor.

The App Code

OpenTelemetry’s API Surface

Everything so far has been part of the OpenTelemetry SDK, and should not be touched directly by application code.

Application code should only ever have to interact with the OpenTelemetry API. The API then hooks up to the SDK behind the scenes to do all of the things previously discussed.

To create spans and other instrumentation, application code should use the OpenTelemetry API, as shown below.



import opentelemetry from "npm:@opentelemetry/api";

...

const tracer = opentelemetry.trace.getTracer(
  'deno-demo-tracer'
);

...

const span = tracer.startSpan(`constructBody`);
const body = `Your user-agent is:\n\n${request.headers.get("user-agent") ?? "Unknown"}`;
span.end();

...

async function instrumentedHandler(request) {
    let response;
    await tracer.startActiveSpan('handler', async (span) => {

      response = await handler(request);

      span.end();
    });

    return response;
  }

A key, subtle difference here is between startSpan and startActiveSpan.

startSpan
- Creates a span, finds the currently “active” span, and adds the newly created span as a child of it
startActiveSpan
- Creates a span, finds the currently “active” span, and adds the newly created span as a child of it.
- Makes the new span the “active” span, so all spans created in the function passed in the 2nd parameter will be added as child spans of it

The difference is that startActiveSpan makes a new parent by default, whereas startSpan does not.

In order to have a span created by startSpan become “active” and the parent of any child spans, you have to do this



const ctx = opentelemetry.trace.setSpan(
  opentelemetry.context.active(),
  wantsToBeAParentSpan
);

Then calling startSpan afterwards will create the new spans as children of wantsToBeAParentSpan.

Handling Concurrent Requests and the Need for Async Context

Imagine 2 requests (say request A and request B) come in near simultaneously. Both create their own parent spans (say parent span A and parent span B) and both end up waiting on the asynchronous fetch call to resolve at the same time.

Say request A’s fetch call finishes first. How does the tracer.startSpan call know to attach itself to parent span A and not parent span B?



const handler = async (request: Request): Promise<Response> => {

  await fetch("http://www.example.com");

  const span = tracer.startSpan(`constructBody`);
  const body = `Your user-agent is:\n\n${request.headers.get("user-agent") ?? "Unknown"}`;
  span.end();

  return new Response(body, { status: 200 });
};

We need some way to keep the context of the request across asynchronous events, so that tracer.startSpan can know that this is still for request A, and it should make the new span a child of parent span A.

OpenTelemetry-JS handles this differently based on the situation

Browser - Uses zone.js to keep track of these asynchronous contexts
Node.js - Uses AsyncLocalStorage to keep track of asynchronous contexts

Thanks to Deno’s Node.js compatibility efforts all that’s needed to benefit from the Node.js async context management approach is to use the Node SDK



import { NodeTracerProvider } from "npm:@opentelemetry/sdk-trace-node";

...

const provider = new NodeTracerProvider({
    resource: resource,
});

This will automatically make sure tracer.startSpan attaches the span it creates to the correct parent span in the above situation.

Future Directions

This example is just scratching the surface of using OpenTelemetry in Deno. Here are some other ways to take it moving forward.

Adding Structured Data to Spans

One thing the example didn't highlight is the ability to add structured data to spans, just like you would with logs (this OTEL doc covers how to do this in more detail).

In certain use cases, you could potentially get away with only using traces and not using logs at all.

To Deploy and Beyond

As of this writing, Deno Deploy doesn’t support npm: specifiers so it’s not possible to use the OTEL example above there.

But once that lands it should be good to go (!)

Actually Knowing CPU Time per Request

Deno Deploy (and other edge function services) limit or bill based on the milliseconds of CPU time your functions use.

However, there’s no easy way to profile or measure this today (as far as I’m aware).

With OpenTelemetry traces, you should be able to drop in spans in CPU-intensive parts of your code to zone in on what’s eating up CPU time.

And with autoinstrumentation, you could even measure the underlying framework you’re using too.

Autoinstrumenting Deno-specific APIs and Libraries

There are many Deno APIs outside of fetch that could probably benefit from being autoinstrumented (e.g. Cache), in particular the ones that don’t already have a browser autoinstrumentation package available (i.e. for Deno-specific APIs, like filesystem ones).

There are also a wide variety of Deno-specific libraries (e.g. oak, Fresh) that could possibly benefit from being autoinstrumented too, either via a separate autoinstrumentation library or by getting built-in to the library itself.

Putting the “Distributed” in Distributed Tracing

This example was only of 1 service, but you could imagine running several different distributed services as well.

In that case, trace context propagation across the network (where Service B is aware of the parent span Service A created before sending Service B the request, so Service B can add its spans to the correct parent) is critical in creating 1 unbroken distributed trace across all services.

I haven't yet tried to see if this just works out-of-the-box, but I’m guessing it may need some effort to get firing on all cylinders.

All the Pillars

This example was just of tracing, but OpenTelemetry also covers metrics and logging as well (enabling cool things like finding trace exemplars that are contributing to a metric).

It would be interesting to see if those work out-of-the-box for Deno too.

Parting Thoughts

I personally didn't anticipate that the Node.js compat work done by the Deno team would impact OpenTelemetry support, so this came as a pleasant surprise to me. I have no idea what that involved but hats off to the Deno folks for their efforts.

And I hope there is more to come in making Deno the best server-side JS runtime around for production workloads!

References

https://en.wikipedia.org/wiki/Deno_(software) is where I got the definition of Deno from
https://opentelemetry.io/ is where I got the definition of OpenTelemetry from
https://opentelemetry.io/docs/instrumentation/js/instrumentation/ and the remaining OTEL-JS docs have a lot of information on what more you can do with spans and how to go about doing it
For more tips on troubleshooting beyond ConsoleSpanExporter, including turning on OpenTelemetry-JS's diagnostic Debug-level logging, see https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#troubleshooting
As of this writing, Deno supports Node.js's AsyncLocalStorage API for async context management except for setTimeout per Luca Casonato
This older article I wrote collects some notes on the history of async context management in JS

Console Output

For the curious, here is the output that the ConsoleSpanExporter generates after the server handles a request



{
  traceId: "9a9a625e1562e9847ffe97e09fcf1bea",
  parentId: "f80a96f14dc334d8",
  traceState: undefined,
  name: "constructBody",
  id: "5a1974974e18fcf6",
  kind: 0,
  timestamp: 1680829418679000,
  duration: 195,
  attributes: {},
  status: { code: 0 },
  events: [],
  links: []
}
{
  traceId: "9a9a625e1562e9847ffe97e09fcf1bea",
  parentId: undefined,
  traceState: undefined,
  name: "handler",
  id: "f80a96f14dc334d8",
  kind: 0,
  timestamp: 1680829418663000,
  duration: 16340,
  attributes: {},
  status: { code: 0 },
  events: [],
  links: []
}
{
  traceId: "9a9a625e1562e9847ffe97e09fcf1bea",
  parentId: "f80a96f14dc334d8",
  traceState: undefined,
  name: "HTTP GET",
  id: "7ec3993d6f40671c",
  kind: 2,
  timestamp: 1680829418669000,
  duration: 12000,
  attributes: {
    component: "fetch",
    "http.method": "GET",
    "http.url": "http://www.example.com/",
    "http.status_code": 200,
    "http.status_text": "OK",
    "http.host": "www.example.com",
    "http.scheme": "http",
    "http.user_agent": "Deno/1.32.1"
  },
  status: { code: 0 },
  events: [],
  links: []
}
{
  traceId: "5086d5f4412c11a533d81044998ac7d6",
  parentId: undefined,
  traceState: undefined,
  name: "HTTP POST",
  id: "698b36fbc35e09c9",
  kind: 2,
  timestamp: 1680829423694000,
  duration: 94000,
  attributes: {
    component: "fetch",
    "http.method": "POST",
    "http.url": "https://api.honeycomb.io/v1/traces",
    "http.status_code": 200,
    "http.status_text": "OK",
    "http.host": "api.honeycomb.io",
    "http.scheme": "https",
    "http.user_agent": "Deno/1.32.1"
  },
  status: { code: 0 },
  events: [],
  links: []
}

How to Think About Software Supply Chain Security - Part 2

Grunet — Sat, 01 Apr 2023 13:55:43 +0000

(If you haven’t already, read or skim Part 1 of this series first for background.)

Tracking Confidence and Risk

There are an extremely large number of software supply chain security risks. Each one of these risks can reduce confidence in the security of the software development process.

When someone has an idea for a software change, it’s at its most secure. Peoples’ brains cannot be infected or manipulated that easily (by software).

However the change then has to go through design, then development, then validation, and ultimately deployment or release. At each stage there are a multitude of software supply chain security risks that can erode confidence.

If left unmitigated, these risks can add up to a complete loss of confidence in the integrity of the final product and the security of the process itself.

Diving Deeper Into Individual Risks

What follows is a brief exposition of each of the risks included in the above diagram.

Note that what the diagram covers is only a small sample of all software supply chain security risks.

Lack of 2FA on Github

If your Github password gets compromised, an attacker can now act as you.

For example, they might write a Github workflow to exfiltrate all of your build-time secrets.

Enforcing 2FA on all user accounts mitigates this risk.

Long-Lived Github Personal Access Tokens

If a PAT (personal access token) gets compromised, an attacker now has access to all of the allowed permissions of the token and the Github account the PAT is from, regardless of 2FA.

For example, they could use the PAT to steal all of your confidential source code, and then use that information in a subsequent attack.

There is no general mitigation for this (that I can think of) outside of avoiding use of PATs.

Signed Commits Not Required

If signed commits aren’t required in your Github repository or aren't in use by your team, an attacker who has already compromised some Github account can modify your commits after they’ve been made.

For example, they might modify a commit you had previously made on a PR and that a reviewer had already reviewed, sneaking in subtle runtime secrets exfiltration code.

Requiring signed commits in your repository eliminates this risk. And if you use Github Codespaces, your commits will automatically be signed.

Main Branch Not Protected

If the main branch of your repository isn’t protected, an attacker who has already compromised some Github account can directly commit changes to it without anyone noticing.

For example, they could add in some subtle runtime secrets exfiltration code.

Protecting the main branch eliminates this risk.

Code Review Not Required

If code review is not required in your repository, an attacker who has already compromised some Github account can make a PR (pull request) and merge it into main all by themselves.

For example, they could add in some subtle runtime secrets exfiltration code via the PR.

Requiring code review on all PRs eliminates this risk.

Builds Not Fully Automated using a CI Service

If builds are not automated in an ephemeral environment, then malware lingering in the environment can infect the builds.

For example, if a container image is built on someone’s computer, existing malware running on that computer could modify what’s included in the container image, inserting a backdoor for when it’s running in production later on.

Using a build service like Github Actions prevents this possibility from happening, since the build is run on a new, clean VM (virtual machine) each time.

Dependencies’ Versions Not Pinned

A dependency could be compromised by an attacker and a new malicious version of the dependency published.

If dependencies aren’t pinned, the next build will pull in the malicious version of the dependency.

Pinning the dependency ensures the same code is used each time unless someone explicitly chooses to change it.

Dependencies Neither Cached in CI Nor Vendored

If every build fetches dependencies from the internet, then that increases the chances of pulling in an existing version of a dependency that’s been corrupted.

Caching dependencies in CI helps reduce the number of times fetching from the internet is required. Vendoring dependencies (i.e. including their code in your source code) erases this problem altogether.

Dependencies Updated Too Often

If you update your dependencies anytime a dependency publishes a new version, you’re at increased risk that one of those dependency updates has been compromised and you'll now be pulling in its malicious code.

Keeping your dependencies up-to-date only whenever there’s a new major version, while also taking all security patches is one way to achieve a safer balance.

Secrets Not Restricted to Protected Branches

If you have Github Secrets that are accessible outside of protected branches, anyone (e.g. a disgruntled employee) can write a Github Workflow in a throwaway branch to exfiltrate those secrets.

Environment-based secrets in Github Actions can restrict secrets to protected branches and eliminate this risk.

Secrets Not Least Permissioned

If secrets include things like AWS access keys, and the permissions behind them are very broad (e.g. Adminstrator-level on an AWS account) if/when the secrets are exfiltrated, an attacker will have wide access to your AWS accounts.

Restricting these kinds of secrets to the least permissions required to perform their functions (e.g. only enough for pushing to a container registry) is one mitigation.

Another (imo easier) mitigation if the tool supports it is to use ephemeral access keys via Github’s OIDC provider. These can be configured so the access keys only exist for a few minutes, so even if they are exfiltrated they are hard to abuse.

Outbound Requests Made during CI

Disallowing outbound requests during CI prevents any malicious code that’s already infiltrated your CI environment from exfiltrating any secrets. (this is similar to the idea of “air gapping” a build)

In practice this can be difficult to pull off, or even monitor for.

Deploys/Releases Not Automated

If deployments or releases aren’t automated from an ephemeral environment, then malware lingering in the environment can affect the deployment or release.

For example, if changes to your cloud IAC (infrastructure as code) are done from someone’s computer, existing malware on their computer could include extra cloud resources into changesets (e.g. cryptominers).

Outside of fully automating deploys or releases, one mitigation for this is to do the process from a clean machine (e.g. Cloud Shell in AWS or GCP) every time.

Production Access Outside of Deployment/Release Branch

This is the same as “Secrets Not Restricted to Protected Branches” for the special (and much worse) case that the secrets contains access credentials to your production environments.

The same mitigation about using Environments in Github Actions to restrict the branches the secrets are accessible from applies here.

Takeaways

Supply chain security is hard to grok because it requires you to be skeptical about things you also have to trust heavily. This tension is difficult to grapple with.

Thinking about things from a risk-first (or equivalently confidence-first) perspective has proven useful to me in dealing with this tension.

To take a deeper dive into the world of software supply chain security, check out https://slsa.dev/

Modern Accessibility

Grunet — Sat, 25 Mar 2023 23:26:43 +0000

Guessing at What the State of the Art in Web Accessibility Will Look Like in 5 Years

Web design and development evolve at a rapid pace. It's reasonable to assume web accessibility will evolve similarly over the next few years.

This is my attempt to guess at what the best organizations will be doing for it in 5 years time (inspired by the Modern Testing principles and the DevOps Research and Assessment studies).

Protecting Abled Usage and Optimizing for Experimentation with Disabled Experiences

There are 2 high-level aspects to my guess at what the top organizations will be doing

Protecting Abled (i.e. non-disabled) Experiences
Optimizing for Experimentation with Disabled Experiences

Protecting Abled Experiences

Before organizations can focus on optimizing their disabled users' experiences, they'll have to first make sure they avoid impacting or regressing their abled users' experiences.

Avoidance of Designs that Cannot Be Made Accessible Without Affecting Abled Experiences

At the design level, this means avoiding design patterns that cannot be made accessible without changing how abled users experience them.

For example, take ephemeral toast notifications. There's no way to make these accessible after-the-fact without creating a drawer containing all of the notifications. If there's no room for such a drawer in the design without impacting the abled experience, the design can't be made accessible.

For a contrary example, take icon buttons that end up not having accessible labels. They can be given accessible labels without needing to adjust the abled experience at all. The design can be made accessible without impacting the abled experience.

Functional and Visual Regression Tests of Abled Workflows Derived From Telemetry

Automated functional tests protecting abled user flows can give teams experimenting with accessibility changes extra confidence that their changes won't break abled use cases.

Layering on automated visual regression tests can enhance that confidence to another level.

And deriving the tests from production telemetry makes sure that the right abled user flows are being encoded into automation.

Optimizing for Experimentation with Disabled Experiences

All of that protection should enable teams to experiment at will when it comes to disabled user experiences, without fear of side-effects.

Lead Times on the Order of Seconds

A key prerequisite to this is having an extremely short time in-between having an idea for an experiment and getting real feedback on it.

Bringing down lead times to a few seconds helps to this end. Tests can be shifted to run as synthetic monitors against production to help with this.

Living in Production

At this point, teams will be effectively "living in production" and can focus on experimentation.

Experiment-first Mentality

At the end of the day, the only people who can tell if an experience is accessible are the disabled users who experience it. No amount of prior team experience or knowledge can substitute for this.

Teams will construct experiments on how to improve disabled user experiences and measure them through several means in production. The successful experiments will live on, and the team will continue to iterate via experiments.

Analytics-first Mentality

Teams will leverage anonymized, aggregated metrics derived from their analytics that serve as indicators for disabled user experiences.

A common tactic will be to compare metrics (e.g. conversion rates) for disabled user groups against abled user groups. If the disabled user groups are performing more poorly than their abled counterparts, it will indicate more experimentation is needed.

Zero-Effort Generation of Strong Ties to the Disability Community

Analytics alone won't generate useful enough information. Teams will leverage 3rd parties to connect them with their disabled users so they can apply user research techniques to better understand their disabled users and drive their future experiments.

Summary

My guess is the organizations that will be doing the best at accessibility will be the ones that work the closest with their disabled users, and then optimize all of their processes towards experimenting to find the best solutions for those users.

(Slack already does parts of this today to my understanding, which is why it doesn't seem too farfetched to me.)

How to Think About Software Supply Chain Security - Part 1

Grunet — Sat, 25 Mar 2023 02:54:08 +0000

How Software Supply Chain Security Differs from Normal Security

With normal security, the concern is primarily with malicious, external actors probing your software looking for direct exploits.

With software supply chain security, the concern is more about malicious actors exploiting backdoors in the creation of your software.

The terminology is different too. Whereas it makes sense to talk about "trust" in the context of normal security, that concept loses usefulness in the context of software supply chain security.

What to be Concerned about When it Comes to Software Supply Chain Security

There are 2 main areas to be concerned about

Artifact Integrity
Exfiltration

Artifact Integrity

Artifact integrity has to do with trying to make sure that the software that was delivered and/or is running in production is actually what was intended to be made.

An example of when this would fail is when a malicious actor is able to modify the source code used to build a container, including a backdoor that lets them collect sensitive user information.

Exfiltration

Exfiltration in this context is concerned with trying to make sure sensitive parts of the supply chains themselves aren't stolen (e.g. source code, build secrets).

An example of when this has happened is the CircleCI security incident from a few months ago, where all customer build secrets were compromised by a malicious actor.

How to Think About the Techniques That Address The Concerns

There are many techniques available to address these two concerns, but one helpful way to categorize them is how they impact the risks involved. The 3 most prominent categories are

Risk Elimination
Risk Mitigation
Risk Awareness

Risk Elimination

These techniques eviscerate certain classes of risk altogether. For example,

Signing all Git operations (e.g. commits, tags)
Automating builds and running them in an isolated, ephemeral environment

The former prevents malicious actors from impersonating valid contributors.

The latter prevents any long-lived malicious software from living in the build environment.

Risk Mitigation

These techniques reduce the chances of certain classes of risk. For example,

Peer review of code changes
Using a dedicated build service

The former mitigates the case of 1 disgruntled employee trying to submit malicious code, but it does nothing for the case of 2 disgruntled employees colluding to submit malicious code.

The latter will generally improve the security of the secrets kept inside the build service. However, as the CircleCI incident showed, all build service platforms are still fallible when it comes to secrets exfiltration.

Risk Awareness

These techniques give you more insight into the risk profile of certain classes of risk. For example,

Gathering all manifests (e.g. Software Bills of Material, SBOMs) of all of your dependencies
Checking Reddit before updating a dependency in case there's a well-known compromise in flight

The former helps increase awareness of all the pieces comprising the software, as well as their individual security vulnerabilities (notice how this overlaps with normal security concerns).

The latter will help you become aware before merging a Dependabot dependency update PR that may contain malicious code.

Takeaways

Risk is the primitive to use when thinking about software supply chain security.

Thinking about the risk a technique or tool impacts can make it easier to reason about.

For more on this, including more practical examples, check out Part 2 of this series.

To take a deeper dive into the world of software supply chain security, check out https://slsa.dev/

How to Maximize User Privacy When Using Google Analytics 4

Grunet — Sat, 04 Mar 2023 23:42:57 +0000

What is Google Analytics 4?

Web analytics is the practice of gathering information about how your users are using your websites, for the purposes of marketing, sales, improving product offerings, etc...

Google Analytics is one such tool that aids in this practice. It is by far the most popular one.

Google Analytics 4 is the latest iteration of the tool. It was created in large part as a response to GDPR (General Data Protection Regulation) privacy legislation in the EU (European Union) that Universal Analytics ("Google Analytics 3") couldn't support.

How does Google Analytics 4 Treat Privacy By Default?

Despite the increased focus on privacy, it doesn't look great.

It has several defaults that are unnecessary for ordinary website analytics, exposing way more information than needed back to Google.

Steps to Take to Maximize Your Users' Privacy

Turn Off All Account Data Sharing Settings

By default, you're opted in to sharing your users' analytics data with these 4 other entities

Google Products & Services
Modeling Contributions & Business Insights
Technical Support
Account Specialists (aka Google salespeople)

Uncheck all of them during the onboarding flow.

Use "Device-based" for Reporting Identity

Reporting Identity refers to how Google Analytics tracks your users across different websites and different devices.

By default this is set to "Blended", which includes

Google signals (aka tracking your users based on their having logged into their Google account on any browser or device)
Modeling (aka using machine learning based on users who accepted tracking to infer behaviors of users who declined tracking, so they can be tracked...)

Both of these are overkill for basic website analytics, and overreach into your users' privacy.

The best alternative is actually hidden. You have to hit "Show All" to uncover the "Device-based" option.

Go to Admin, then Account Access Management, then Reporting Identity, then hit "Show All", then select "Device-based".

Actually Make the "Device-based" Choice Useful

Even with this choice, Google is still able to track your users across your site by setting a first-party cookie.

While the cookie may be first-party, Google is most certainly not. Any tracking at this level should at best be done by solutions that let you be the steward of your users' data, not Google.

To stop this tracking, you need to effectively deny tracking automatically on behalf of your users (as if they'd automatically denied all such tracking via your cookie notice)

The details vary by platform and integration, but seem to be eventually findable in the docs (e.g. for web).

What You're Left With

After all that, Google Analytics 4 should be a tool that

Captures anonymous analytics about how your users are using your site
Lets you add your own custom anonymous instrumentation to capture events it doesn't by default

Or at least I hope so...

Summary

Google Analytics 4 is not an analytics tool. It's an advertising and marketing tool.

That's the only framing I can make that explains why its defaults are the way they are.

DEV Community: Grunet

Making a Totally Free Uptime Monitor using a Worker Runtime and OpenTelemetry

Table of Contents

What is an Uptime Monitor and When to Use One?

Traditional Options

Using a Worker Runtime and OpenTelemetry

The High-Level Solution

The High-Level Setup Steps

Comparison to the Other Options

Takeaway

Change Plans: A Subtle Superpower

Table of Contents

What is a Change Plan?

What Benefits Do Change Plans Bring?

Force You To Think

Peer Review

Drive Clarification

Facilitate Discussion

Discoverable

Auditable

A Change Plan Template in Detail

Summary

Impact

Security

Communication Plan

Test plan

Before the change

Steps

Monitoring

Backout plan

Takeaway

Gushing Over AWS Application Load Balancer Access Logs

Table of Contents

What They Are

Why They Are Great

Chock Full of Details

Close to End-User Behavior and Pain

Non-Invasive to Enable

Supported by Vendors

Some Frustrations

Batching

Poor Integrations

Takeaway

Handling Concurrent Load During an AWS Outage: A Tradeoff To Consider

Table of Contents

The Main Compute Primitives for AWS

Control Planes vs Data Planes

Static Stability

How Each Primitive Handles Concurrency During a Control Plane Outage

EC2

Fargate

Lambda

The Takeaway

References

Operational Challenges for SCIM Servers

Table of Contents

What is SCIM?

The Key Operational Challenges

No Load Limits

All Requests Must be Synchronously Handled

Downstream Consequences when Faced with “Scale”

Performance Problems with 3rd Party SCIM Libraries

ORM Problems Leading to Database Performance Problems Leading to Other ORM Problems

Inability to Throttle Requests

Inability to Queue Requests

Inability to Horizontally Scale

Takeaways

Targeted Avoidance of the ORM-to-SCIM-to-ORM Pattern is Valuable

Threading is Potentially Valuable, Maybe

Pre-Production Load Testing of SCIM Servers is Valuable

Recording Failed SCIM Requests in Production Telemetry is Valuable

The Hidden Tradeoff of Keyless Auth

What is Keyless Auth and Why Should I Care?

How do Keyless Auth Systems Work?

The Hidden, Unspoken Tradeoff

Takeaways

The Lack of Disabled Peoples' Experiences in Web Accessibility is Concerning

Leveraging OpenTelemetry in Deno

Table of Contents

Background