DEV Community: Richard Li

Bullish on AI infrastructure, bearish on AI developer frameworks

Richard Li — Fri, 31 Jan 2025 13:30:07 +0000

When I said “don’t buy the AI library hype”, one of the more common responses was “Did you try $NEW_AI_FRAMEWORK instead?”

After thinking about these comments, I realized that my framing of the developer AI framework space as a maturity issue is not quite correct. I see a structural deficiency in this market, and I believe that many (most?) developer AI frameworks suffer from this gap.

Let me explain.

An AI Application (should be) an event-driven asynchronous application

As context, I believe most AI applications should adopt an event-driven, asynchronous architecture. This is because most AI operations (e.g., calculating an embedding, or calling an LLM) have high latency. In a synchronous architecture, a call to the LLM can occupy a thread for tens of seconds (or longer!) while it waits for a response.

In an asynchronous architecture, the application sends a remote procedure call (RPC) to the LLM and moves on to other tasks while waiting for the response. This non-blocking approach ensures that threads are not tied up, allowing the application to handle other requests or workflows simultaneously. Once the LLM responds, an event listener or callback mechanism processes the response, reintegrating it into the workflow.

The core value of AI developer frameworks: syntactic sugar and standardization

AI developer frameworks such as LangChain, LlamaIndex, Mirascope generally make it easier to create these RPCs, which can get complicated. For example, a typical RPC could include all of the following:

A "system" prompt, which sets the overall context for the AI interaction
A “user” prompt, which contains the actual question at hand
Historical context, which includes any relevant conversational history
Question context, which includes any data that might be relevant to the prompt itself (e.g., a PDF document)

But while assembling all of these details can be complicated and time consuming for a developer, the workflow is fairly straightforward (in most cases, you’re populating a JSON object to send with your RPC). The reality is that these developer frameworks codify common design patterns by providing thin wrappers around a bunch of other lower-level libraries.

Thus, the core value of these libraries is not “developer productivity”. The core value of these libraries is standardization (H/T Erick Friis) through syntactic sugar. For organizations with lots of developers, consistency is key to maintainability, and these libraries provide consistency. Outside of consistency, I don’t believe that these frameworks offer much value.

So what AI frameworks and tools do add value?

While AI developer libraries are not valuable for most developers (standardization is an organizational benefit), I believe there are three categories of libraries & infrastructure that AI developers do use which will have enduring value: event-driven infrastructure, model training, and model inference. The software in these three categories are all non-trivial to implement, and comprise important parts of an AI application.

Event-driven infrastructure

Virtually all AI applications adopt an event-driven architecture, given the high latency in LLM response. Asynchronous I/O libraries, message queues, and other infrastructure software created for event-driven architectures are even more relevant today in the age of AI.

As an entrepreneur, I believe premature scaling is a common mistake that startups make. Prior to building my own small-scale AI application, I would have put much of event-driven infrastructure into the “premature scaling” category. Then, I ran into some of these challenges at very small scale, and that has forced me to reconsider.

I now believe that AI applications are forcing a reordering of technology priorities, and event-driven infrastructure will see a renaissance. The winners of this space will be the organizations that do the best job of creating a great developer experience for AI applications.

Model training

Training pipelines are the backbone of any real-world AI application. Building these pipelines require significant engineering effort to create high quality, reproducible models. Training pipelines typically include steps for data processing, fine tuning, and evaluation. Fortunately, a robust ecosystem of libraries and tools are available to make it easier to build these pipelines. Some examples:

Data preprocessing and manipulation: Libraries like Pandas solve for the messy, real-world challenge of efficiently wrangling and cleaning large datasets. Without it, you'd be reinventing functionality for basic tasks like merging, filtering, or aggregating data.
Fine-tuning models: Libraries like HuggingFace Transformers offer pre-built functionality for adapting large language models to your domain. Implementing this on your own would require deep familiarity with tokenization, optimizer setups, and memory management.
Experiment tracking and reproducibility: Tools like Weights & Biases solve the hard problem of managing hundreds of experiments with varying hyperparameters, dataset splits, and evaluation results. This is critical for teams working collaboratively on model improvements.

Model inference

Serving a model in production is deceptively complex. While running a single inference locally might seem trivial, scaling that process to handle latency requirements, cost constraints, and diverse use cases exposes deep infrastructure needs. Some example tools for model inference include:

Inference engines: Tools like vLLM address the core problem of high-latency inference by optimizing token streaming and GPU utilization. These aren't just "nice-to-haves"—without them, response times can make your application unusable.
KV caching: Specialized caches for LLMs such as LMcache can dramatically reduce LLM latency and improve scalability.
Structured output generation: Generating reliable structured data isn't trivial. Libraries like Outlines address this by wrapping generation with schema validation, saving significant development time and reducing downstream errors.

(There's a shift that is also taking place where allocating 💰 to inference vs training provides a better bang-for-the-buck on accuracy which needs to be explored, but that's yet another post.)

The evolving AI ecosystem

While I’m optimistic about the AI ecosystem as a whole, I’m pessimistic about the future of AI developer libraries in this ecosystem. The LangChain team is also moving beyond their core library, with the introduction of LangSmith (AI observability) and LangGraph (durable workflow). The tools that will thrive will be the ones that solve hard problems, made accessible through a great developer experience. We’re still in the early innings, with lots more to come!

(Thanks to William Bakst and Erick Friis for feedback.)

Data-driven customer acquisition: Machine Learning applied to Customer Lifetime Value

Richard Li — Wed, 03 Apr 2024 15:45:51 +0000

In today's data-centric world, insights from customer data can identify new opportunities to drive growth. Instead of staring endlessly at dashboards (and driving data analysts crazy with requests), we'll explore using regression analysis on your data to calculate lifetime value (LTV). We'll then analyze this model to find new ideas to try to drive LTV.

LTV, also called customer lifetime value (CLTV), is a pivotal metric that quantifies the total revenue a business can expect from a single customer over their entire relationship. Understanding your LTV is crucial to analyzing your overall go-to-market performance, so that's where we'll start.

Traditional strategy

Calculating LTV traditionally involves a "top-down" approach, where the average revenue per customer and the average lifespan of a customer are calculated to derive the LTV. This method provides a broad overview of the potential value of a customer to the business over their lifetime. However, the high-level nature of this calculation doesn’t give any indication of what actually drives LTV (and thus, how you can affect it).

For instance, different customer acquisition strategies can attract different customer segments, which lead to differences in LTV. And, different strategies have different acquisition costs. Typically, segmentation is then used to answer this question. You can segment based on demographics, geography, company size, and other relevant factors and then calculate the corresponding payback periods.

Segmentation helps in understanding the nuances of LTV among different customer groups, allowing for more targeted and effective acquisition strategies. However, traditional segmentation may not always capture the full complexity of customer behavior and preferences.

This raises the question: what if you want to delve deeper and explore segments beyond the traditional categories? We’re going to next explore a long-standing mathematical tool that can help us identify new opportunities: regression analysis.

Regression analysis and machine learning

Regression analysis is a statistical method used to understand the relationship between a dependent variable and one or more independent variables. Traditionally, regression analysis is employed to predict values of the dependent variable based on the values of the independent variables. However, in the context of LTV analysis, we can focus on building a directionally accurate model that can help identify the customer characteristics and behaviors that contribute to higher value for the business.

Moreover, we can use machine learning to enhance regression analysis. Machine learning allows for more complex and nuanced modeling of the relationship between variables. Instead of fitting a simple linear model, machine learning algorithms can capture non-linear relationships and interactions among variables, providing a more accurate representation of the data. In LTV analysis, machine learning can be used to build regression models that not only predict LTV but also identify the key factors driving LTV for different customer segments.

Shape of Data

The VahanBima insurance dataset consists of thousands of individual insurance customers and their customer lifetime value. The data set also includes some demographic information (e.g., area, gender, education, income) and behavioral information (e.g., number of policies, past claims).

Training a model

Encoding

Regression analysis can only be performed on actual numerical values, so non-numerical values need to be encoded into numerical values. In this particular situation, we will use n-hot encoding for all the categorical values. In n-hot encoding, we convert each unique value in a row into a column. For example, the “qualification” column can be “High School”, “Bachelor”, or “Other”. We convert this single column to three columns: “High_School”, “Bachelor”, and “Other”, with true/false values in each column.

Training

To illustrate the core concepts of ML and regression analysis, we’ll start with a simple model. ChatGPT (the free version) creates something that works with this prompt:

Load a df with a CSV file of data. The df has a target of "cltv", an index of "id", and feature variables "gender", "area", "qualification", "income", "marital_status", "vintage", "claim_amount", "num_policies", "policy", and "type_of_policy", n-hot encode the type_of_policy, income, qualification columns. Convert the gender, area, and num_policies column to boolean by mapping one value to 0 and the second value to 1. Gender is Male/Female, area is Urban/Rural, and Num_policies is "more than 1" or 1.

ChatGPT gives the following output:

import pandas as pd

# Load the data
df = pd.read_csv('train_BRCpofr.csv')

# N-hot encode the 'type_of_policy', 'income', 'policy' and 'qualification' columns
df = pd.get_dummies(df, columns=['type_of_policy', 'income', 'policy', 'qualification'])

# Map 'gender', 'area', and 'num_policies' columns to boolean
gender_map = {'Male': 0, 'Female': 1}
area_map = {'Urban': 0, 'Rural': 1}
num_policies_map = {'more than 1': 1, 1: 0}

df['gender'] = df['gender'].map(gender_map)
df['area'] = df['area'].map(area_map)
df['num_policies'] = df['num_policies'].map(num_policies_map)

Paste the above into a Jupyter notebook and run it. You can then ask ChatGPT to use a gradient-boosted decision tree to analyze this dataframe.

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Assuming X_train, X_test, y_train, y_test are your training and test sets

# Create a pipeline with an imputer and the GradientBoostingRegressor
pipeline = make_pipeline(
    SimpleImputer(strategy='mean'),  # You can change the strategy as needed
    GradientBoostingRegressor()
)

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Interpreting the model

The first thing to look at is how well our model fits the data. This is typically measured using metrics like Root Mean Squared Error (RMSE) or Mean Absolute Percentage Error (MAPE). If the model doesn't fit the data well, you may need to revisit your model and data preprocessing steps. The RMSE on this data set is 87,478 with a MAPE of 71.4%. The average LTV in the data set is $97,953. This suggests that the model’s prediction have a large spread with a significant error percentage.

Most data scientists would say the model isn’t useful at this point, as its predictive power is poor. And it is! You don’t want to use this model to predict the actual LTV of a customer based on these variables. But our goal was not to build highly accurate models for prediction. Our goal was to identify customer characteristics and behaviors that are high value to the business. So let’s explore this a little bit more.

Feature importance

We want to understand the relative importance of each of these independent variables on the output. Machine learning models typically have ways to compute “feature importance”, which is the relative importance of each independent variable to the final prediction’s accuracy. In other words, features (independent variables) with high importance contribute more to the accuracy of the prediction (and the prediction isn’t really all that accurate right now).

But what we really care about is not which features contribute to the accuracy -- we care about which features actually impact the output. And so traditional feature importance can easily lead us astray. Instead, we will turn to Shapley values.

By leveraging concepts from cooperative game theory, Shapley values provide a more nuanced understanding of feature importance than traditional methods. They quantify the marginal contribution of each feature to the prediction by considering all possible combinations of features and their impact on the model's output. This approach captures interactions between features and ensures that each feature is credited proportionally to its true contribution, even in complex models where features may interact in non-linear ways. By doing so, Shapley values offer deeper insights into the factors driving the model's decisions, making them a powerful tool for interpreting machine learning models.

Computing Shapley values is computationally intensive and can take hours. We’ll use the Python SHAP library with some sampling to compute the following feature importance chart:

This tells us the major factors that are driving this particular computation.

What-If Analysis

We now know the number of policies is the biggest factor, followed by claim amount, followed by area. So, how exactly does the number of policies affect the average LTV?

We can use our predictive model to answer this question. We take our original input data, and change the average number of policies up and down in 10% intervals. This lets us create a what-if chart like below, which shows the relative impact of changing the number of policies, area, and claim amounts.

As the average number of policies increases, the average LTV decreases, as shown by the purple line. A 50% increase in policies causes a 10.7% decrease in average LTV.
Claim amounts are represented by the teal line. As claim amounts increase, the average LTV modestly increases. A 50% increase in claim amount causes a 1% decrease in average LTV.
Area is shown by the red line. Urban customers are slightly more valuable than rural customers. As the mix shifts towards urban by 50%, a 1.1% increase in average LTV is noted.

Traditional models for LTV have historically focused on segmentation and aligning customer acquisition channels to specific segments. While effective, this approach often relies on predefined assumptions and can limit the scope of analysis. In contrast, regression analysis offers a powerful tool for at-scale data exploration, allowing for the analysis of possibilities that may not have been considered initially. Dashboards excel in the traditional model as they align with predefined questions, but regression analysis, especially when augmented with machine learning, can reveal insights that go beyond the scope of predefined queries. By embracing machine learning to enhance regression analysis, businesses can uncover new levers for customer acquisition and optimize their strategies for long-term success in a data-driven landscape.

When Metrics Go Awry: Analyzing KPIs using machine learning, regression analysis, and Shapley values

Richard Li — Fri, 08 Mar 2024 13:25:21 +0000

Every week, management teams at companies around the world gather to review the weekly dashboard. Inevitably, one of these key metrics … is red. How do we turn it from red to green?

There are two typical reactions to a metrics incident:

Wait another week to see if it’s a trend.
Schedule a meeting to investigate this metric.

Neither of these approaches are particularly satisfying. The flaw of the first is that if the problem is a trend, you’ve just lost a week of time to address the issue. The flaw of the second is that traditional analysis techniques don’t tell you why, and frequently lead into a round of expensive and inconclusive data analysis.

Segmentation: Traditional analysis

The management team elects to go with Option 2. What happens in that meeting?

The traditional tool used is segmentation. Typically, the KPI is decomposed into different segments: geography, customer size, gender, and so forth to see if there is a common segment that is underperforming.

The data team works steadily for the next day writing SQL queries to produce a set of dashboards that show the metric in question by segment. They figure it’s worth codifying these segments the next time something happens.

The dashboards show that the metric has gone down for most segments, but it’s most pronounced in N. America. Now what?

The team has run into the core problem with segmentation: segmentation only identifies what segment(s) are affected by the KPI change. Segmentation does not provide any insight into what might be driving the change. Conversations with marketing & product ensue to ask if they’ve changed anything in the past week that might affect most customers, but particularly N. America customers. This is inconclusive.

Regression analysis

Regression analysis is a statistical method used to model the relationship between a dependent variable (in this case, the KPI) and one or more independent variables). The goal of regression analysis is to understand how the independent variables impact the dependent variable.

The earliest forms of regression analysis were published in 1805 (!). Since then, research in regression analysis has resulted in sophisticated algorithms that can find patterns in noisy and non-linear data. The most recent of these algorithms fall into the “machine learning” category, and take advantage of the awesome computational power of modern CPUs & GPUs. We’ll see how these algorithms can be used to analyze a KPI.

Insurance and LTV

We’ll start with the VahanBima customer lifetime value dataset. We previously used this data set to build a regression model for LTV. This data set has approximately 90,000 insurance customers and their lifetime value. To simulate a drop in LTV, we’ll add another 10,000 synthetic customers based on this dataset. If you want to try this yourself, you can download the updated dataset here.

First, let’s take a look at the LTV of this dataset. We can graph the trend line of LTV using a simple moving average, and show it here:

Why is LTV down? This data set has multiple independent variables: number of policies, claim amount, area, marital status, and so forth. We'll now explore which of these variables, if any, could be causing this drop in LTV.

In the previous article, we trained a regression model on the same dataset to find ways to increase LTV, so we won’t re-train a model here. Instead, we’ll start where we left off, which was with the feature importance chart.

Let’s look at each independent variable, in descending order of feature importance, to see if we can identify what might be driving LTV down. For each variable, we’ll look at three factors:

The feature importance, which we have previously computed using Shapley values.
The trend line, which we can compute using a simple moving average (SMA).
The expected impact of the variable on LTV as the variable increases & decreases, which we can visualize using Shapley values.

The variable with the highest feature importance in this data set is number of policies. Here’s the trend line:

There is no discernible pattern in this trend line, so we can dismiss a change in the number of policies issued as the driver of the CLTV.

Let’s look at the next variable, claim amount. Here’s the trend line:

There’s an obvious drop in the claim amount for more recent customers. We didn't need any fancy regression analysis or machine learning to identify this trend. But our model lets us go one step further, because a drop in an independent variable does not necessarily correspond to a drop in the target variable. We can look at the Shapley chart for claim amount to understand this further:

Shapley scatter charts are a powerful tool that gives a sense of how changes in the independent variable impact the predicted value. In this chart, each dot represents a single prediction from the data set. The x-axis is the specific value of the independent variable, while the y-axis is the change in the target for a given x-axis value.

Shapley scatter charts also help identify situations where independent variables are not completely independent. When the predicted values on the y-axis have a wide dispersion for the same x-axis value, that suggests that there are other factors in play besides the independent variable in play. Similarly, if there is a narrow dispersion, this suggests the independent variable is driving the given prediction.

The Shapley scatter chart for claim amount is suggestive. As the claim amounts increase, so does the LTV (presumably via increased premiums). There is relatively narrow dispersion in the chart, which suggests that claim amount is fairly independent.

We now have a hypothesis! We can quickly scan the trend lines in the other independent variables, and our eye quickly settles in on the spike in vintage:

Vintage is the number of years a customer has been a customer. We look at the Shapley chart:

Here, we see wide vertical dispersion for each vintage, and no clear changes in value. This suggests that there are high & low LTV customers at every vintage, and as such, the increase in vintage is unlikely to be causing any impact on LTV. We thus conclude that the recent drop in LTV is driven by a drop in claim amounts by customers. (This also suggests that LTV may not be the ideal metric for measuring insurance customers, as the increase in LTV associated with claim activity does not necessarily mean an increase in profitability.)

Segmentation & dashboards are insufficient

When key performance indicators go down, traditional approaches like segmentation can only go so far in pinpointing the root cause. While segmentation identifies which segments are affected, it often falls short in providing actionable insights into why the change occurred.

More generally, dashboards are poor tools for identifying relationships in data. Modern regression analysis can help bridge this gap between problem and action. By modeling the relationship between the KPI and various independent variables, regression analysis can uncover hidden patterns and factors driving the KPI change.

The answer could be in your data. If it’s there, regression analysis will help you find it.

Six months with GitHub Copilot

Richard Li — Mon, 05 Feb 2024 14:34:33 +0000

I’ve been coding with GitHub Copilot over the past six months, and I’ve steadily seen it improve my productivity. Because the Copilot UX is seamlessly integrated into my IDE, I didn't think initially there was much to using Copilot. But after six months of steady usage, I've found that my way of using Copilot has evolved and improved over time.

TL;DR: Copilot really improves productivity -- especially if you invest in adapting your workflow.

Background

I’m a 2x founder, and with my current startup, I’m doing all the programming. As such, I have to be a generalist programmer. On any given day, I could be working on the web UI, some backend business logic, machine learning, or writing documentation.

I’m building a cloud application that builds predictive models from large data sets. I’ve worked to make the stack as boring as possible, but there's a lot of different technologies that need to be used.

Python. The ecosystem of data and machine learning frameworks built on Python made this an obvious choice.
Flask. Simple, minimalist web framework.
SQL. State needs to be stored in a relational database.
Pandas. While Polars and DuckDB get a lot of attention, the reality is that most research and documentation start with Pandas.
HTML & CSS. Hard to avoid these two if you have a web application.
Bootstrap. I have minimal front end skills, so I picked Bootstrap because it has professional-looking components that are easy for me to use.
JavaScript. I use small amounts of JavaScript to improve the UX. In particular, ML jobs can take tens of minutes or longer to run, so having a UI that can poll and update status is helpful.
PyTorch and numerous ML libraries.

Thus, my “simple” application uses five languages (Python, SQL, JavaScript, HTML, CSS) and four major frameworks (Pandas, Bootstrap, PyTorch, Flask) that I interact with on a daily basis.

My workflow

I’m using VSCode with the Copilot plug-in. Initially, when I tried Copilot, I used it as a smarter auto-complete. I would write code, and Copilot would sometimes suggest code. I would read the code suggestion and if it made sense, I would accept it. This was super helpful for writing tedious boilerplate code, for example:

While I appreciated the reduction in tedium, this felt like an incremental improvement in productivity. Then, I discovered the Copilot chat. Over time, I've found that using this has created a step function increase in productivity.

After experimentation, my workflow today is something like this:

Figure out requirements. For example, a number of users asked for native support for downloading data directly from BigQuery.
Design the solution. Draw a UI mockup on my whiteboard, look at the code, and figure out how I want to add this functionality to the code in a maintainable way. If I know that the problem I’m trying to tackle is something others have run into, I ask Copilot. This is a fairly iterative approach where I might write out some pseudo-code that helps me formulate my thoughts.
Implementation. This is where the Copilot chat is really great. I now know that I need to create a function call that does X. I go to the relevant bit of code, and type into the chat “create a function that does X”. And Copilot cranks out a function that does X!
Review & test. I read the code to make sure I understand it, adding comments and fixes as I go along.

I’ve found that the real value of Copilot is not the auto-complete any more. Instead, its Copilot's ability to prototype entire functions based on my design that is a time saver.

Copilot represents conventional wisdom

For me, Copilot represents the conventional wisdom of how to tackle a specific programming problem. If I'm tackling a programming problem and I have a sense that someone must have run into this problem before, I turn to Copilot.

Before Copilot, I relied on Google to give me the conventional wisdom. Today, many of the top results on Google are commercial websites that are optimized for ranking and not user experience. Copilot gives me contextual results without the spam!

Using Copilot chat

The most important thing to realize with Copilot chat is that it takes two inputs: your code and whatever you type into the chat box. Make sure your cursor is in a relevant part of the code before you ask your question.

Here's an example where Copilot shines. The stack trace tells me where the error occurs, so I open the file and place my cursor on the line. I then paste in the error message. This is what Copilot tells me:

Limitations

While Copilot does represent conventional wisdom, you don't always want to follow conventional wisdom. In my personal experience, Copilot is heavily biased to using SQLAlchemy in Python (which I don't really use), and JQuery for all UI styling (even though I use Bootstrap). The other limitation I've encountered is that I can only supply one piece of context with my question, which is the code at hand. In reality, a web application has three pieces of context: the database schema, the HTML template, and the actual business logic. I'm sure this is something that will be addressed in the future, though.

Using Copilot for programming

Here's what I would tell my six-month-ago self about programming with Copilot:

Think of Copilot as conventional wisdom, and use it in these situations.
Use the chat and place your cursor on the code you want it to comment on.
Once you design a feature, write it out in pseudo-code, and then prompt Copilot to flesh out the implementation.

I've found that Copilot gives me more time to spend on requirements & design, which is where I should be spending my time.

One thing that hasn't changed for me is the feeling of satisfaction when the code works. Whether I write it or Copilot or not, I get the same feeling of accomplishment.

Recommended.

This article was originally published on on my blog.

Using data for predictive analytics

Richard Li — Fri, 12 Jan 2024 15:13:03 +0000

At Ambassador, the weekly/monthly/quarterly metrics review was a core part of the operating cadence of the company. We'd meet and review metrics that spanned the business: usage, customer satisfaction, revenue, pipeline. When a metric was not tracking to plan, we'd task a team to investigate and recommend actions to get back on track.

Yet, as the business grew in complexity, these teams got slower, as they were trying to reconcile data from different systems: Salesforce, HubSpot, Google Analytics, product data. Luckily, I was introduced to the world of modern data engineering, which promised a better approach.

Modern data infrastructure, Extract-Load-Transform, and cloud data warehouses

Over the past decade, two macro trends have powered innovation in the modern data ecosystem:

Unlimited demand for data. Modern applications such as machine learning and business intelligence are fueled by data -- the more, the better.
Plummeting cost of storage and compute, particularly in the cloud.

These trends have given rise to the "Extract-Load-Transform" (ELT) architecture, which is an evolution from the traditional "Extract-Transform-Load" approach. In an ELT architecture, data is stored in its original format, and then transformed as needed, instead of being transformed prior to load. An ELT architecture is much more flexible to adapt to future (unanticipated) needs, as no information is ever thrown away. The tradeoff with ELT vs ETL is more storage costs -- but with cloud storage being cheap, this almost always is worthwhile.

Business intelligence is not a dashboard

We adopted a modern data infrastructure solution: Google BigQuery (cloud data warehouse), FiveTran (ELT), Metabase (Business Intelligence / Dashboard), DBT (modeling), Hightouch (Reverse ELT). This worked great, as we were able to aggregate data from our CRM, marketing, product, support, and other systems in one place. And we started building dashboards, which illuminated parts of the business we had never seen before.

But the success of the data warehouse created more demand for dashboards, and we had dashboards upon dashboards. Our small data analyst team couldn't keep up with the demand. We also struggled to keep all of our dashboards up-to-date, and track which dashboards were being used.

Our business "intelligence" solution of dashboards was pretty dumb. That's when we realized that we were overusing dashboards as both a reporting tool and a communication and alignment tool. If we went back to first principles, the reality is that we wanted data to make better decisions.

Models

Around this time, I was introduced to a CFO who was looking for her next opportunity. We weren't actively looking for a CFO, but we really liked her, and I wanted to figure out if we could get her on board. And she said something that stuck with me: "What I like to do is figure out how to turn the business into a spreadsheet."

And I realized: We didn't need more dashboards. We needed more models.

What's a model? A model is a representation of a thing that is smaller in scale than the original. Models, being smaller in scale, are easier to manipulate. In the context of business, models are everywhere: financial models, funnels, and sales productivity models are common examples. Models are super-powerful because assumptions can be quickly tested. At a previous company, we had agreed on a top-line revenue goal for the year, but when we plugged that number into the sales productivity model, it showed how we would have to more than double rep productivity. Needless to say, we added more sales reps to the plan.

We had a financial model, productivity model, and funnel model -- but what we started to do after that was to build even more granular models. Our first model was our signup to product-qualified lead model, which accounted for all the steps (and decisions) a user needed to make from signup all the way to activation all the way to lead. We then built a dashboard that tracked the different steps of this model, and, as our understanding of the model evolved, so did the dashboard.

Data teams and the evolution of business intelligence

I've talked to dozens of data teams at B2B SAAS companies since then, and I've realized our journey was typical. The modern data journey looks something like this:

Businesses usually start with a homegrown business metrics dashboard, which can be implemented in slides or spreadsheets. This dashboard is an aggregation of key metrics from each function. These metrics are pulled from each function's critical systems: sales will pull pipeline data from Salesforce, marketing will pull lead data from marketing automation, support from Zendesk, and so forth. And this strategy can work well for a long time.

At some point, companies realize they need to do more cross-functional data analysis. It's not enough to know that marketing drove X signups, or product drove Y activations, or sales created Z opportunities. Organizations realize that all of these functions need to work together, and knowing where to find signups that are most likely to result in activations and opportunities requires the integration of different data systems. This drives Phase 2: the creation of a central data team, which drives the adoption of an ELT architecture and dashobards.

Data is addictive: the more you have, the more you want. And this will lead to the data team being overwhelmed with requests to create more dashboards and analysis. The typical answer is "self-service dashboards" from a business intelligence vendor, where technologies such as Looker, or more recently, LLM-powered dashboards, have become popular. This leads to Phase 3: self-service dashboards for the rest-of-the-organization.

Self-service dashboards have a problem: they make the easy problems easy, and they don't make hard problems any easier. What typically follows from self-service are an explosion of unmaintained dashboards (because creating them is easy) but there's still a huge bottleneck on analysis. The business functions inevitably hire their own data analysts: a marketing data analyst, product analyst. And this is because the real questions can't be answered in a chat session with an LLM. This leads to Phase 4: function-specific analysts.

All these analysts querying the same data warehouse causes a different type of problem: multiple competing definitions of critical KPIs. Different teams use similar but different definitions of revenue. Marketing runs on a Sunday weekly start, while sales starts on Monday. All these different definitions start to create concerns around data integrity. The solution proposed is to standardize these definitions in a metrics layer (AirBnb's version of this story). And thus, the organization is in Phase 5: implementing metrics.

Yet with all this investment, the original concern of the business: using data to make better decisions, is taking longer than ever. Things go back to Phase 2 in an attempt to standardize access to the metrics layer, which will let everyone be more productive.

Ambassador, and a different way

At Ambassador, we got to Phase 2. The data team was a bottleneck, but we weren't big enough to justify a big investment in self-service. We were forced to innovate, and that's when we started building models. Our data evolution ended up looking like this:

Different functional teams who needed dashboards started by building models in spreadsheets and slides. They shared these models with the data teams, who helped them build dashboards that let them track the performance of these models.

We found that models became powerful tools for aligning the data team, the functional teams, and the broader organization on the purpose of the dashboards. Beyond the alignment, we were able to test and question assumptions and use these models in planning.

Business intelligence is about communication

Since then, I've come to view that models are a critical abstraction between analysis & dashboard tools and data itself.

In the context of the data stack, these models are a distinct abstraction from the modeling done by tools such as DBT or LookML. These models represent how the business is operating, and capture the business processes themselves.

More generally, I've come to see business intelligence is a communication function. With BI, you're using data and visualizations and models to align the organization around the challenge at hand. Models are a crucial part of this communication function. In the process of building a model, you'll find holes in your data, and assumptions you're making -- and that's when you should go off and go validate those assumptions and fill in these holes. Most importantly, models create alignment between all the different functions in the organization, from data to engineering to go-to-market.

This post was originally published on the Amorphous Data blog.

Developing Python Applications on Kubernetes

Richard Li — Tue, 02 Mar 2021 21:29:59 +0000

Kubernetes has become the de-facto standard for running cloud applications. With Kubernetes, users can deploy and scale containerized applications at any scale: from one service to thousands of services. The power of Kubernetes is not free — the learning curve is particularly steep, especially for application developers. Knowing what to do is just half the battle, then you have to choose the best tools to do the job. So how do Python developers create a development workflow on Kubernetes that is fast and effective?

There are two unique challenges with creating productive development workflows on Kubernetes.

Most development workflows are optimized for local development, and Kubernetes applications are designed to be native to the cloud.
Most Kubernetes applications either start off or evolve into a microservices architecture. Thus, your development environment becomes more complex as every microservice adds additional dependencies to test code. And in turn, these services quickly become too resource-intensive and exceed the limits of your local machine.

In this tutorial, we’ll walk through how to set up a realistic development environment for Kubernetes. Typically, we’d have to wait for a container build, push to registry and deploy to see the impact of our change. Instead, we’ll use Telepresence and see the results of our change instantly.

Telepresence is an open source project that lets you run your microservice locally, while creating a bi-directional network connection to your Kubernetes cluster. This approach enables the microservice running locally to communicate to other microservices running in the cluster, and vice versa. Since you’re running the microservice locally, you’re able to benefit from any workflow or tool that you run locally.

Step 1: Deploy a Sample Microservices Application

For our example, we’ll make code changes to a Go service running between a resource-intensive Java service and a large datastore. We’ll start by deploying a sample microservice application consisting of 3 services:

VeryLargeJavaService A memory-intensive service written in Java that generates the front-end graphics and web pages for our application
DataProcessingService A Python service that manages requests for information between the two services.
VeryLargeDataStore A large datastore service that contains the sample data for our Edgey Corp store.

Note: We’ve called these VeryLarge services to emphasize the fact that your local environment may not have enough CPU and RAM, or you may just not want to pay for all that extra overhead for every developer.

In this architecture diagram, you’ll notice that requests from users are routed through an ingress controller to our services. For simplicity’s sake, we’ll skip the step of deploying an ingress controller in this tutorial. If you’re ready to use Telepresence in your own setup and need a simple way to set up an ingress controller, we recommend checking out the Ambassador Edge Stack which can be easily configured with the K8s Initializer.

Let’s deploy the sample application to your Kubernetes cluster:

kubectl apply -f https://raw.githubusercontent.com/datawire/edgey-corp-python/master/k8s-config/edgey-corp-web-app-no-mapping.yaml

Note: This tutorial assumes you have access to a Kubernetes cluster with kubectl access. If you don’t, some options include MicroK8S and Docker Kubernetes.

Step 2: Set up your local Python development environment

We’ll need a local development environment so that we can edit the DataProcessingService service. As you can see in the architecture diagram above, the DataProcessingService is dependent on both the VeryLargeJavaService and the VeryLargeDataStore, so in order to make a change to this service, we’ll have to interact with these other services as well. Let’s get started!

Clone the repository for this application from GitHub:

git clone https://github.com/datawire/edgey-corp-python.git

Install the application dependencies with pip (you may need to type pip3 if you have Python 3 installed):

cd edgey-corp-python/DataProcessingService/
pip install flask requests

Run the application (you may need to type python3 if you have Python 3 installed):

python app.py

Test the application. In another terminal window, we’ll send a request to the service, which should return blue.

curl localhost:3000/color
blue

Step 3: Make Code Changes with Telepresence

To test a code change with Kubernetes, you typically need to build a container image, push the image to a repository, and deploy the Kubernetes cluster. This takes minutes.

Telepresence is an open source, Cloud-Native Computing Foundation project that solves exactly this problem. By creating a bidirectional network connection between your local development environment and the Kubernetes cluster, Telepresence enables fast, local development.

Download Telepresence (~60MB):

# Mac OS X
sudo curl -fL https://app.getambassador.io/download/tel2/darwin/amd64/latest/telepresence -o /usr/local/bin/telepresence
# Linux
sudo curl -fL https://app.getambassador.io/download/tel2/linux/amd64/latest/telepresence -o /usr/local/bin/telepresence

Make the binary executable:

sudo chmod a+x /usr/local/bin/telepresence

Test Telepresence by connecting to the remote Kubernetes cluster:

telepresence connect
```


4. Send a request to the Kubernetes API server:



```
curl -ik https://kubernetes.default.svc.cluster.local
HTTP/1.1 401 Unauthorized
Cache-Control: no-cache, private
Content-Type: application/json
Www-Authenticate: Basic realm="kubernetes-master"
Date: Tue, 09 Feb 2021 23:21:51 GMT
```



Congratulations! You’ve successfully configured Telepresence. Telepresence is intercepting the request you’re making to the Kubernetes API server, and routing over its direct connection to the cluster instead of over the Internet.

## Step 4: Set up an Intercept
An intercept is a routing rule for Telepresence. We can create an intercept that will route traffic intended for the `DataProcessingService` in the cluster and route all the traffic to the local version of the DataProcessingService running on port 3000.
1. Create the intercept:


```
telepresence intercept dataprocessingservice --port 3000
```


2. Access the application directly with Telepresence. In your browser, go to http://verylargejavaservice:8080. Again, Telepresence is intercepting requests from your browser and routing them directly to the Kubernetes cluster.
3. Now, let’s make a code change. Open `edgey-corp-python/DataProcessingService/app.py` and change `DEFAULT_COLOR` from `blue` to `orange`. Save the file.
4. Reload the page in your browser, and see how the color has changed from blue to orange.

That’s it! With Telepresence we saw how quickly we can go from editing a local service to seeing how these changes will look when deployed with the larger application. When you compare it to our original process of building and deploying a container after every change, it’s very easy to see how much time you can save especially as we make more complex changes or run even larger services.


## Learn More about Telepresence
Typically, developers at organizations adopting Kubernetes face challenges slow feedback loops from inefficient local development environments. Today, we’ve learned how to use Telepresence to set up fast, efficient development environments for Kubernetes and get back to the instant feedback loops you had with your legacy applications.

If you want to learn more about Telepresence, check out the following resources:
* Watch a [demo video](https://www.youtube.com/watch?v=W_a3aErN3NU), which shows more details on different features in Telepresence
* Check out the [Python Quickstart for Telepresence](http://docs/latest/telepresence/quick-start/qs-python/)
* Learn about [Preview URLs](https://www.getambassador.io/docs/pre-release/telepresence/howtos/preview-urls/#collaboration-with-preview-urls)  for easy collaboration with teammates
* [Join our Slack channel](https://d6e.co/slack) to connect with the Telepresence community

In our next tutorial, we’ll use Telepresence to set up a local Kubernetes development environment and then use Pycharm to set breakpoints and debug a broken service. To be notified when more tutorials are available, make sure to check out our [website](https://www.getambassador.io) or follow us on [Twitter](http://www.twitter.com/ambassadorlabs).

*This post was originally published on [Python Pandemonium](https://medium.com/python-pandemonium/developing-python-applications-on-kubernetes-75be68a3f0f9).*

SRE vs Platform Engineering

Richard Li — Tue, 16 Feb 2021 21:08:00 +0000

Over the past decade, engineering and technology organizations have converged on a common set of best practices for building and deploying cloud-native applications. These best practices include continuous delivery, containerization, and building observable systems.

At the same time, cloud-native organizations have radically changed how they’re organized, moving from large departments (development, QA, operations, release) to smaller, independent development teams. These application development teams are supported by two new functions: site reliability engineering and platform engineering. SRE and platform engineering are spiritual successor of traditional operations teams, and bring the discipline of software engineering to different aspects of operations.

Site Reliability Engineering and Platform Engineering

Platform engineering teams apply software engineering principles to accelerate software delivery. Platform engineers ensure application development teams are productive in all aspects of the software delivery lifecycle.

Site reliability engineering teams apply software engineering principles to improve reliability. Site reliability engineers minimize the frequency and impact of failures that can impact the overall reliability of a cloud application.

These two teams are frequently confused and the terms are sometimes used interchangeably. Indeed, some organizations consolidate SRE and platform engineering into the same function. This occurs because both roles apply a common set of principles:

Platform as product. These teams spend time understanding their internal customers, building roadmaps, having a planned release cadence, writing documentation, and doing all the things that go into a software product.
Self-service platforms. These teams build their platforms for internal use. In these platforms, best practices are encoded, so that the users of these platforms don’t need to worry about it -- they just push the button. In the Puppet Labs 2020 State of DevOps report, Puppet Labs found that High functioning DevOps organizations had more self-service infrastructure than low DevOps evolution organizations.
A constant focus on eliminating toil. As defined in the Google SRE book, toil is manual, repetitive, automatable, tactical work. The best SRE and platform teams identify toil, and work to eliminate it.

Platform Engineering

Platform engineers constantly examine the entire software development lifecycle from source to production. From this introspective process, they build a workflow that enables application developers to rapidly code and ship software. A basic workflow typically includes a source control system connected with a continuous integration system, along with a way to deploy artifacts into production.

As the number of application developers using the workflow grows, the needs of the platform evolves. Different teams of application developers need similar but different workflows, so self-service infrastructure becomes important. Common platform engineering targets for self-service include CI/CD, alerting, and deployment workflows.

In addition to self-service, education and collaboration become challenges. Platform engineers find they increasingly spend time educating application developers on best practices and how to best use the platform. Application developers also find that they depend on other teams of application developers, and look to the platform engineering team to give them the tools to collaborate productively with different teams.

Site Reliability Engineering

Site reliability engineers create and evolve systems to automatically run applications, reliably. The concept of site reliability engineering originated at Google, and is documented in detail in the Google SRE Book. Ben Treynor Sloss, the SVP at Google responsible for technical operations, described SRE as “what happens when you ask a software engineer to design an operations team.”

SREs define service level objectives and build systems to help services achieve these objectives. These systems evolve into a platform and workflow that encompass monitoring, incident management, eliminating single points of failure, failure mitigation, and more.

A key part of SRE culture is to treat every failure as a failure in the reliability system. Rigorous post-mortems are critical to identifying the root cause of the failure, and corrective actions are introduced into the automatic system to continue to improve reliability.

SRE and Platform Engineering at New Relic

One of us (Bjorn Freeman-Benson) managed the engineering organization at New Relic until 2015 as it grew from a handful of customers to tens of thousands of customers, all sending millions of requests per second into the cloud. New Relic had independent SRE and platform engineering teams that followed the general principles outlined above.

One of the reasons these teams were built separately was that the people who thrived in these roles differed. While both SREs and platform engineers need strong systems engineering skills in addition to classic programming skills, the roles dictate very different personality types. SREs tend to enjoy crisis management and get an adrenaline rush out of troubleshooting an outage. SRE managers thrive under intense pressure and are good at recruiting and managing similarly minded folks. On the other hand, platform engineers are more typical software engineers, preferring to work without interruption on big, complex problems. Platform engineering managers prefer to operate on a consistent cadence.

DevOps and GitOps

Over the past decade, DevOps has become a popular term to describe many of these practices. More recently, GitOps has also emerged as a popular term. How do DevOps and GitOps relate to platform and SRE teams?

Both DevOps and GitOps are a loosely codified set of principles of how to manage different aspects of infrastructure. The core principles of both of these philosophies -- automation, infrastructure as code, application of software engineering -- are very similar.

DevOps is a broad movement that began with a focus on eliminating traditional silos between development and operation. Over time, strategies such as infrastructure automation and engineering applications with operations in mind have gained widespread acceptance as ways better build highly reliable applications.

GitOps is an approach for application delivery. In GitOps, declarative configuration is used to codify the desired state of the application at any moment in time. This configuration is managed in a versioned source control system as the single source of truth. This ensures auditability, reproducibility, and consistency of configuration.

DevOps is a set of guiding principles for SRE, while GitOps is a set of guiding principles for platform engineering.

Unlocking application development productivity

Site reliability engineering and platform engineering are two functions that are critical to optimizing engineering organizations for building cloud-native applications. The SRE team works to deliver infrastructure for highly reliable applications, while the platform engineering team works to deliver infrastructure for rapid application development. Together, these two teams unlock the productivity of application development teams.

This story was originally published on the Ambassador blog.

Optimize the Kubernetes Developer Experience with Version 0

Richard Li — Fri, 10 Jul 2020 13:39:46 +0000

One of the core promises of microservices is development team autonomy, which should, in theory, translate into faster and better decision making. But sometimes, this theory doesn’t translate into reality.

Why is this the case?

There are a multitude of reasons for microservices not working well. Microservices, cloud-native, and Kubernetes are a new approach and culture shift, and there’s a lot of good ways and bad ways to approach the challenge.

One of the keys to success is enabling a consistent developer experience for each microservice from day 0, which is critical for unlocking team autonomy and development velocity.

Bootstrapping a Microservice

Creating microservices should be cheap and easy. This enables app dev teams to quickly build and ship new microservices to address specific business needs without being encumbered by preexisting code. At the same time, this agility and flexibility do come at a cost — applications become distributed, dynamic organisms that can be harder to develop, test, and debug.

Better Developer Experience == Better Customer Experience

In a recent Ambassador podcast, Gene Kim spoke about how a great developer experience is critical to delivering value to customers. By creating a great developer experience, developers can ship more code, which results in happier customers.

We’ve seen a similar trend in organizations that successfully adopt microservices: an emphasis on the developer experience. While it may not be a “strategic” initiative in the organization, usually there’s someone at the company who is passionate about creating a great developer workflow and is able to spend time working on continuously improving that developer workflow.

The Microservices Developer Experience

With a monolith, there’s a common application that is the target for the development workflow. With microservices, there is no longer a single common application. Every new microservice requires a developer workflow. Without due care, it’s easy to have a smorgasbord of microservices, all with poor developer workflows. In this situation, velocity actually decreases since microservices can’t be easily and rapidly shipped. This defeats the entire rationale for adopting microservices in the first place, and development slows.

At the same time, microservices presents an opportunity for improving the developer experience. By optimizing the developer experience of each microservice, teams can build the best possible developer experience for the team (and not the organization), and continue to optimize that experience as the application and team evolve.

Developer Experience, Defined

A developer experience is the workflow a developer uses to develop, test, deploy, and release software. The developer experience typically consists of both an inner dev loop and an outer dev loop. The inner dev loop is a single developer workflow. A single developer should be able to set up and use an efficient inner dev loop to code and test changes quickly. The inner dev loop is typically used for pre-commit changes. The outer dev loop is a shared developer workflow that is orchestrated by a continuous integration system. The outer dev loop is used for post-commit changes and includes automated builds, tests, and deploys.

Engineering a good inner and outer dev loop is key to a great developer experience and unlocking the potential of microservices. So how can an engineer help in building a great developer experience?

The Version 0 Strategy

The version 0 strategy involves shipping an end-to-end development and deployment workflow as the first milestone — before any features are coded. A good test of a version 0 milestone is if a developer on a different team is able to independently code, test, and release a change to the microservice without consulting the original team. This implies a version 0 has a development environment, a deployment workflow, and documentation that explains how to get started and ship. With a version 0 in place, the microservices team then begins with feature development, knowing that their ability to rapidly iterate and ship is already in place.

The Version 0 approach works well for a number of reasons:

The codebase is very simple, so there is no reverse engineering of obscure dependencies, monkey patches, or any other gremlins to get a working environment
With no features, there is less pressure from external parties who want to implement changes and adjustments to the roadmap
A great developer experience accrues benefits over time, so Version 0 maximizes the payback period
Most importantly, version 0 sets the tone for the microservice, which is that developer experience is important!

Version 0 for Engineers

Any engineer can adopt the version 0 practice (and should!). A development team should have full autonomy over a microservice, which includes the development timeline and workflow! So starting with a Version 0 will help the team rapidly bootstrap the microservice.

Version 0 for Managers

Managers can support Version 0 across the organization by asking engineering teams that are creating new microservices to start with a Version 0. As engineering organizations grow, the organization could choose to assign platform engineers focused on development workflows. These platform engineers should not implement Version 0, but instead provide tools, templates, and best practices to the microservice teams on how best to build a version 0. The Netflix engineering team adopted this approach to developer empowerment.

Summary

Every engineer has felt the pain of a bad developer workflow. A trivial one-line fix takes a half-day to complete. Microservices can exacerbate this problem. The Version 0 strategy is a simple but powerful strategy that will help integrate developer experience into your organization’s development workflow.

This article was originally published on the Ambassador blog.