DEV Community: Yitaek Hwang

Creating a AI-enabled Slackbot with AWS Bedrock Knowledge Base

Yitaek Hwang — Mon, 19 Jan 2026 22:35:31 +0000

One of the lowest-friction, highest-ROI applications of large language models (LLMs) so far has been the internal AI assistant. Yes, AI doesn't have to be all about customer-facing chatbots or fully autonomous agents. Just a simple interface for users to ask questions like the following can be a powerful tool:

"How do I deploy this service?"
"What's the on-call runbook for this alert?"
"Where is the latest diagram for the design doc?"

These questions already have answers — scattered across Confluence pages, Google Docs, GitHub READMEs, and Slack threads. The problem isn’t generation. It’s retrieval.

Out of the box, LLMs are great at reasoning and summarization, but they’re completely disconnected from your organization’s institutional knowledge. Prompt stuffing helps a bit. Fine-tuning helps in very narrow cases. But neither scales when your knowledge base changes weekly, or when correctness actually matters.

This is the void that retrieval-augmented generation (RAG) fills.

RAG bridges the gap between probabilistic language models and deterministic internal knowledge. Instead of asking an LLM to guess, you retrieve relevant documents first, then ask the model to synthesize an answer grounded in that context. The result is an assistant that feels intelligent without being reckless — and, crucially, one that stays up to date without constant retraining.

If you're already on AWS, Amazon Bedrock Knowledge Bases provides an easy way to create, deploy, and integrate a RAG into your existing infrastructure. In this post, we'll walk through how to use AWS Bedrock Knowledge Base and connected to a Slackbot for a realistic internal, AI-enabled assistant use case.

Setting up AWS Bedrock Knowledge Base

From AWS console, navigate to Amazon Bedrock. Under Build, choose Knowledge Bases. As of time of writing, AWS currently supports indexing unstructured data via creating a custom vector store, utilizing Kendra GenAI service, or enabling semantic search with structured data (e.g., databases, tables).

Since most internal data is likely to be unstructured (e.g., Confluence documentation, markdown files, etc), we'll choose "Create knowledge base with vector store" option. As of time of writing, AWS supports Confluence, Salesforce, Sharepoint, and Web Crawlers on top of S3 (note: there is a limit of 5 data sources at the moment). For the purpose of this demo, let's choose Confluence. To connect, we'll need to store credentials in AWS Secret Manager as described in the detailed guide.

Next, we need to configure our data source parsing strategy (either AWS default parser or utilizing a foundation model like Claude as a parser) as well as chunking strategy for our vector database. Bedrock will automatically chunk documents, generate embeddings, and store vectors in OpenSearch Serverless service based on our configurations here. The performance of the RAG will depend on these parameters, but for a quick demo, we can use default chunking and use Amazon Titan embeddings to start out with.

Once the vector store is set up, we just have to manually sync our data store by syncing the data source. You can imagine adding Sharepoint for internal PDFs, crawling open source library documentation websites, as well as some internally hosted S3 files.

Setting up a Slack bot

With the "hard" part out of the way, we need to set up a Slack App via the Slack Admin Console. The key things we need are:

Enabling Socket Mode
Minimally, chat:write, app_mentions:read, and channels:history under OAuth Scopes
Then grab the bot tokens under "Basic Information" page

The final part is to actually code up a Slack bot. We can use the Slack Bolt SDK to quickly spin up a bot using Python. We want the bot to do three things at a high-level:

Parse Slack events (or respond to mentions, slash commands, etc)
Query the Knowledge Base
Generate a response

A quick pseudocode could look like:

def handler(event, context):
    text = extract_slack_message(event)

    retrieval = bedrock.retrieve(
        knowledgeBaseId=KB_ID,
        query=text,
        retrievalConfiguration={"vectorSearchConfiguration": {"numberOfResults": 5}}
    )

    prompt = build_prompt(text, retrieval["results"])

    response = bedrock_runtime.invoke_model(
        modelId="arn:aws:bedrock:us-east-1:...:inference-profile/us.anthropic.claude-sonnet-4-5-20250929-v1:0",
        body=prompt
    )

    post_to_slack(response)

Tuning for performance

Now time for the real magic. Because LLMs are non-deterministic, we need to guide it with some context for better performance. While RAG provides most of our "internal" knowledge, we can still use prompt engineering to guide the generation side.

You can include a prompt like:

You are an internal engineering assistant.

Answer the question using ONLY the provided context.
If the answer is not in the context, say you do not know.

<context>
{{retrieved_chunks}}
</context>

Question: {{user_question}}

and pass it with the user's questions to dictate what the LLM will do.

The other dial we can turn is how we embed and store our internal knowledge. AWS has a great guide on how content chunking works for knowledge bases. The key takeaway is that depending on how the data is structured, different chunking schemes will perform better. For example, lots of Confluence documentation has a natural hierarchical pattern with headings and body, so using hierarchical chunking can link information better and lead to better retrieval performance.

Wrapping up

AI-enabled Slackbots are quickly becoming the front door to internal knowledge. With Amazon bedrock Knowledge Bases, AWS has made it easy to build a RAG without knowing how to operate and maintain a vector database for the most part.

With powerful LLMs like ChatGPT and Claude, creating a Slack bot is easier than ever. But if you would like to compare your solution with a working model, there is a slightly outdated yet functional example from AWS team on Github that you can follow.

Setting up AWS Bedrock with Claude

Yitaek Hwang — Mon, 22 Dec 2025 19:43:05 +0000

As 2025 draws to a close, the AI race is still not showing any signs of slowing down. We have new foundational models coming out between OpenAI, Anthropic, and Google to name a few, and every day, there's yet another strategic partnership being announced by major players up and down the stack.

It's interesting to take a look at what the major hyperscalers have done with AI. Google of course has their own model (i.e., Gemini) as well as significant investments in infrastructure with TPU chips. Microsoft still has deep ties to OpenAI, while making new agreements to expand their offerings. Until recently, Amazon was lagging in the AI race besides their initial deal with Anthropic in 2023 and investments in Trainium chips.

But with Anthropic forming a new partnership with Microsoft and OpenAI in talks with Amazon about investment, exclusive access is looking like a thing of the past. This puts AWS in an interesting position as it can leverage its lead in the cloud space to offer the latest AI models and features to its customers.

In this blog post, we'll quickly go over how to setup Anthropic models with AWS Bedrock and configure Claude VSCode extension to go through AWS Bedrock.

But why?

You might be wondering why someone would go through AWS Bedrock instead of going directly to Anthropic. For personal use, getting a Pro or Max plan is likely the cheaper and easier route. However, for enterprise use cases, leveraging existing AWS infrastructure is often easier in terms of compliance, security, and billing. Also, unless you are a large enterprise customer, establishing an enterprise relationship with Anthropic is a slow process at the moment as they scale up.

Setting up AWS Bedrock

Setting up AWS Bedrock to use Anthropic's models is straight forward. Navigate to AWS Bedrock via the Console. Previously, you needed to enable specific models under Configure and learn > Model access, but now foundational models are automatically enabled when first invoked. Instead, jump to Chat / Text playground and select any Anthropic models like Sonnet 4.5.

Anthropic requires first-time users to submit a use case form. Fill out the form, and you should promptly get an email from AWS confirming AWS Marketplace subscription for the model you chose. Subsequent models that are selected will be automatically enabled without having to submit another form.

Configuring AWS credentials for Claude Code

Now that Anthropic models are enabled, we just need to configure AWS credentials for Claude Code. Since Claude Code uses the default AWS SDK credentials, we can leverage existing methods to authenticate with AWS (as long as we have IAM permissions to access Bedrock).

Option A: AWS CLI configuration

aws configure

Option B: Environment variables (access key)

export AWS_ACCESS_KEY_ID=your-access-key-id
export AWS_SECRET_ACCESS_KEY=your-secret-access-key
export AWS_SESSION_TOKEN=your-session-token

Option C: Environment variables (SSO profile)

aws sso login --profile=<your-profile-name>

export AWS_PROFILE=your-profile-name

The other option is to leverage Bedrock API keys to authenticate directly with Bedrock. To generate an API key, navigate back to AWS Bedrock console and select API keys. Then you can generate a short-term or long-term API keys with desired expiration time.

Finally, you need to set two environment variables to let Claude Code know you want to authenticate with Bedrock instead of going directly to Anthropic:

export CLAUDE_CODE_USE_BEDROCK=1
export AWS_REGION=us-east-1

You can set other environment variables like models or disable caching like:

# Using inference profile ID
export ANTHROPIC_MODEL='global.anthropic.claude-sonnet-4-5-20250929-v1:0'
export ANTHROPIC_SMALL_FAST_MODEL='us.anthropic.claude-haiku-4-5-20251001-v1:0'

# Using application inference profile ARN
export ANTHROPIC_MODEL='arn:aws:bedrock:us-east-2:your-account-id:application-inference-profile/your-model-id'

# Optional: Disable prompt caching if needed
export DISABLE_PROMPT_CACHING=1

Setting up VS Code Plugin

At this point, you can use Claude Code with Bedrock in any terminal shell. However, since VS Code plugin launches its own shell, we need to configure VS Code to let the Claude Code extension know where to pull the right credentials.

To enable this, open VS Code, and type Preferences: Open User Settings (JSON) via the command palette (either Ctrl+Shift+P or Cmd+Shift+P on Windows or Mac).

Depending on your previous interactions with the Claude Code extension, there may already be some claudeCode related settings (e.g., selected model). The crucial part here is to add claudeCode.environmentVariables in an array like:

    "claudeCode.environmentVariables": [
        {
            "name": "AWS_PROFILE",
            "value": "AWS_PROFILE_NAME"
        },
        {
            "name": "AWS_REGION",
            "value": "us-east-1"
        },
        {
            "name": "CLAUDE_CODE_USE_BEDROCK",
            "value": "1"
        }
    ]

You can add other environment variables or use your preferred authentication method (including the API keys) there. Once configured, reload VS Code and relaunch the Claude Code extension.

Now you're ready to use Claude Code without being prompted a login:

Understanding the True Cost of Ownership: ECS vs. EKS

Yitaek Hwang — Sun, 02 Mar 2025 19:50:54 +0000

While there are plenty of articles already on the Total Cost of Ownership (TCO) between a fully-managed service like ECS vs. one that shares the responsibility more with its users like EKS, the discussion is almost always very high-level, geared towards C-level executives. There's certainly value in having those discussions, but problem I see over and over again, is more at the ground-level between developers and DevOps teams struggling to internalize what it really means for them on a day-to-day basis.

I recently went through this exercise that highlights some of these key points so wanted to walk through how TCO actually plays out in practice in terms of concrete workstreams for both dev and infra teams.

Background

To lay out some context, there is a homegrown, legacy ETL system that has been running on ECS for years. This system was developed when there were no embedded DevOps engineers on the team, meaning that some developers on the team wrote some bespoke Terraform code and decided to use ECS as it required lower DevOps overhead upfront.

While the system is fairly simple (e.g., moves files from S3 to a data lake, does some simple transformations), it become a critical component of the entire data pipeline that it became one of those "don't break what works" systems that was always on the backlog for migrations but never had enough momentum to carry it through.

During this time, the DevOps team grew in size and EKS became the norm at the company for container orchestration. All of the new workloads were deployed onto EKS, and all the internal tooling to help manage not just the cluster itself but adding some controls onto the applications as well were geared towards supporting Kubernetes workloads (e.g., network policies, security, etc).

At every quarterly planning event, the question of "why aren't we using a single container orchestration system?" would be brought up. Every now and then, the DevOps team would do an initial analysis on how ECS is actually costing more in terms of operational and management costs as backporting new EKS features to ECS was expensive in terms of time and internal resources. This would in turn trigger the dev teams to do their due diligence in estimating how much effort it would take to migrate, but because things are "still working", it would always fall behind in priority and the issue would become stale and forgotten until the next time TCO discussion would bubble up again.

Problems Bubbling Up

Cracks started showing when there were finally new feature requests to add to the legacy ETL system. From the dev side, this was a well-scoped problem. For example, instead of storing data in CSV, this system would now convert the format into Parquet for other systems to efficiently ingest. After the feature was developed, the dev team worked with infra teams to run some preliminary scaling analysis and pushed to prod with no problem.

Or so they thought.

After a few weeks, the team was getting paged for two reasons. First, sometimes the pods would eat up too many resources on the node and not let other pods including observability agents from being scheduled. Secondly, the finance team was noticing a huge uptick in network costs as soon as this feature was released.

Both the dev team and the infra teams were confused. Afterall, they had done some scalability testing and nothing they were doing was ground-breaking (meaning these exact problems were already solved on the EKS side). But what they found was that even though best-practices like anti-affinity rules, container limits, and using S3 Private Endpoints were thought to be in place, due to bespoke Terraform code and subtle differences in ECS and EKS, it was in fact not working as intended (e.g., S3 Private Endpoint was only on for VPCs hosting EKS and not ECS).

Takeaways

This "incident" finally illustrated to the dev teams what the hidden operational and maintenances costs are and how it can manifest in practice. Even though ECS is easier to manage and requires very little input from developers, there is a hidden cost to maintaining two difference infrastructure systems across teams. So the argument of "ECS is so easy to use and it's working" is true, it does not diminish the fact that it is masking a TCO problem that can bubble up in the future.

Most of the TCO discussion is often focused on how running EKS adds on more operational burden, but this can be a nuanced discussion as this case study shows. If the rest of the team is running on EKS and has more expertise, maintaining a more "fully-managed" solution can bring on more challenges as well.

Running Jobs in a Container via GitHub Actions Securely

Yitaek Hwang — Sun, 04 Aug 2024 00:38:22 +0000

Like any modern CI/CD platform, GitHub allows users to run CI jobs in a container. This is great for running consistent and reproducible CI jobs as well as reducing the amount of setup steps that is required for the job to run (e.g., running actions/setup-python to install Python environment and installing necessary packages via pip) as those environments and dependencies can be baked into the container.

In order to make use of this feature, in the GitHub yaml file, specify the container to run any steps in a job via jobs.<job_id>.container. This will tell GitHub to spin up a container and run any steps in that job to run inside. If you have both scripts and container actions, GitHub will run the container actions as sibling containers on the same network with the same volume mounts.

jobs:
  container-test-job:
    runs-on: ubuntu-latest
    container:
      image: node:18
      env:
        NODE_ENV: development
    steps:
      - name: Check for dockerenv file
        run: (ls /.dockerenv && echo Found dockerenv) || (echo No dockerenv)

While using public images are great, for most non-open-source use cases, you'll need to pull from private registries. To do so, you can pass in a map of username and password like the following:

container:
  image: my-registry/my-image
  credentials:
     username: ${{ github.actor }}
     password: ${{ secrets.github_token }}

Easy, right? But let's take a look at when the above approach can become problematic.

Problem

GitHub's current approach works great if you already have a static password that you can pass in securely via GitHub's secret mechanism. However, if you are dealing with temporarily credentials, then there's no way to pass them in securely currently.

To illustrate, let's take AWS ECR as an example. To grab a private image from ECR, you might have two steps like login-to-amazon-ecr and run-tests.

In the first step, you can use the aws-actions to configure credentials and login to ECR. Finally, you will have to set the username and password in the output to send to the next job run-tests:

jobs:
  login-to-amazon-ecr:
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/my-github-actions-role
          aws-region: us-east-1
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2
        with:
          mask-password: 'false'
    outputs:
      registry: ${{ steps.login-ecr.outputs.registry }}
      docker_username: ${{ steps.login-ecr.outputs.docker_username_123456789012_dkr_ecr_us_east_1_amazonaws_com }} 
      docker_password: ${{ steps.login-ecr.outputs.docker_password_123456789012_dkr_ecr_us_east_1_amazonaws_com }}

Note the flag mask-password: 'false'. This is because in order for GitHub to make use of this output, it needs to be unmasked. As of the time of writing, masked outputs cannot be passed to separate jobs (see this issue).

This means that while this will technically work, it is insecure as now the docker_password output will be logged unmasked if debug logging is enabled.

  run-tests:
    name: Run tests
    needs: login-to-amazon-ecr
    container:
      image: ${{ needs.login-to-amazon-ecr.outputs.registry }}/my-ecr-repo:latest
      credentials:
        username: ${{ needs.login-to-amazon-ecr.outputs.docker_username }}
        password: ${{ needs.login-to-amazon-ecr.outputs.docker_password }}
    steps:
      - name: Run steps in container
        run: echo "run steps in container"

Solutions

Until GitHub supports a way to either pass masked values as outputs or support a different way to authenticate and pull private images, we have a few options.

Disable debug logging

Currently, anyone who has access to run a workflow can enable step debug logging for a workflow re-run. You could either opt to remove human access to trigger workflows or disable re-runs. While this technically solves the issue, it will severely impact developer productivity and experience in a negative way.

Limit private repos runners can access

We could instead elect to accept the risk of having temporary docker credentials printed out to debug logs for a short duration. As a compromise, you can limit which ECR repositories that GitHub actions can pull from. The rationale here is that if your private container does not contain any confidential IP (e.g., mostly just running tests and setup scripts) then temporarily giving attackers to download or list containers images may be acceptable. To take this to the extreme, you could also consider just using a public repo as well.

Run custom runner images

If neither of those quick-fix solutions are acceptable, then you will need to create customer runner images and bake in the docker login step yourself.

Continuing with our AWS example, we could use the amazon-ecr-credential-helper to automatically set credentials. Just download the binary to the runner image and mount a ~/.docker/config.json file with the following contents:

{
    "credsStore": "ecr-login"
}

Then, you can specify that image to run for repos needing to pull private container images, then once it hits the login step, the above binary will take care of it behind the scenes.

The only potential downside of this approach is that now the login step is abstracted away from developers and other docker-login GitHub actions might conflict.

Cyclops: Simple Kubernetes Deployment Manager

Yitaek Hwang — Tue, 05 Dec 2023 01:39:19 +0000

A user-friendly dashboard for developers to leverage existing Kubernetes frameworks to deploy and manage applications.

When it comes to deploying applications to Kubernetes, the division of responsibilities between developers and infrastructure teams at various organizations is always an interesting (or contentious) topic. Some teams prefer their developers to fully buy into the DevOps ethos and own the entire stack. In this model, developers are expected to not only write their application code, but also the necessary Kubernetes components to run their applications. On the other end of the spectrum, I’ve also seen teams with embedded DevOps or SREs who take on this burden of writing Helm charts and owning everything after code has been merged into their main branch.

Obviously, the “best” option in dividing these responsibilities will depend on the makeup of your team, especially along their familiarity with Kubernetes concepts and/or their willingness to learn. The latter point is especially important as I’ve seen some developers simply refuse to touch anything YAML. The learning curve going from Docker to Kubernetes has always been a huge challenge.

The community has not been ignoring this problem either. In fact, there’s been many attempts to address this tension. Some tools have really focused on the templating and generation of Kubernetes manifests that developers can leverage. Others have focused more on developer environments to promote remote development and execution, and thereby having the developers interact with Kubernetes earlier in the development lifecycle. I’m also sure there are tons of internal tools that address some aspect of this problem in very opinionated means (as I can attest with lots of tools I’ve written in the past).

This brings me to an interesting new open-source project called Cyclops that takes yet another crack at this problem. I was able to speak with Petar Cvitanovic, who is one of the main contributors, and decided to give it a try.

What is Cyclops

Cyclops is a web-based tool that focuses on making Kubernetes deployment and configuration into a more developer-friendly experience. Instead of reinventing the wheel, Cyclops decided to adopt Helm charts that have been the de facto standard for packaging and supporting templatization. Cyclops takes in pre-built Helm charts and translates all those fields into a form that developers can easily adjust.

Instead of having developers override variables in YAML or via command line arguments, Cyclops exposes all those fields into a form that is easier to grok for developers. For most applications, developers simply have to change a few things: container image, service ports, configuration files, and secrets. Cyclops is targeting those use cases to streamline that process, although it can handle more complex applications as well.

Installing Cyclops

To give Cyclops a try, I spun up a minikube cluster and installed Cyclops using the following commands:

minikube start

kubectl apply -f https://raw.githubusercontent.com/cyclops-ui/cyclops/v0.0.1-alpha.5/install/cyclops-install.yaml

By default, Cyclops components are installed into cyclops namespace: cyclops-ctrl and cyclops-ui. The backend pod cyclops-ctrl takes a bit to come up so wait for the pod to be healthy. Then expose both the backend and the frontend locally (the frontend expects the backend to be available on localhost:8080).

kubectl port-forward svc/cyclops-ui 3000:3000 -n cyclops
kubectl port-forward svc/cyclops-ctrl 8080:8080 -n cyclops

Adding a Module

Cyclops groups Kubernetes deployments into what it calls as modules. Click the Add module button on localhost:3000 to create one.

Cyclops team has some templates that we can leverage on https://github.com/cyclops-ui/templates . We’ll load the demo from those templates and fill out the module details with a simple nginx container.

After clicking save, Cyclops will interpret those details and deploy those pods accordingly:

Note that any Helm chart is supported, but in order for Cyclops to work, you’ll need to provide a values.schema.json file in the following format: https://cyclops-ui.com/docs/templates/ .

Thoughts

Cyclops is an early stage Kubernetes tooling project with a limited set of features currently. However, that limited focus may actually be a benefit to some teams. Some of the more mature tools, by necessity, grow into a bloated tool to satisfy security, compliance, and various feature requests. But if you are looking for a lightweight tool for developers to quickly override templates in a familiar form format, Cyclops could be a better alternative.

It is important to note that Cyclops does make some assumptions about how you think of division of responsibilities. In order to use Cyclops effectiveness, some teams must first create the Helm charts and the schema file. If your team is more in the camp of “you write it then you own it”, then having developers buy into GitOps tooling might be sufficient. On the other hand, Cyclops caters to DevOps and infrastructure teams who want to expose a more curated experience for developers to interact with. You can control what fields the devs can override then they can use the Cyclops UI to view the basic status and logs.

I can see Cyclops being part of a good building block for teams looking to create an internal development platform. There are lots of commercial or large OSS projects dealing with cluster management, CI/CD, etc, but those may be overkill for local development setup or smaller developer platforms. Once Cyclops supports more templates (e.g., sample schemas that work with popular Helm charts like various databases or monitoring tools), it’ll be even more useful.

If you’re interested, check out Cyclops on Github: https://github.com/cyclops-ui/cyclops. The team is also very responsive to feedback, so create issues/comments for the team if you are looking to adopt Cyclops!

Ingesting Financial Tick Data Using a Time-Series Database

Yitaek Hwang — Tue, 25 Apr 2023 00:48:16 +0000

Compared to traditional financial markets, crypto markets experience more volatility with price swinging in either direction at a quicker pace. Price of each cryptocurrency also tends to vary across exchanges. Given such a dynamic
nature, investors and traders looking to navigate the market need fast andreliable data from various crypto exchanges. In this tutorial, we’ll take a look at three different ways to ingest crypto market data into QuestDB for further analysis:

Using the Cryptofeed library
Writing a custom data pipeline
Via Change Data Capture (CDC)

Prerequisites

We will be using QuestDB to ingest and store crypto market data. Create a new directory and from the directory, run the following to start a local instance of QuestDB:

mkdir cryptofeed-questdb
cd cryptofeed-questdb
docker run \
  -p 9000:9000 -p 9009:9009 -p 8812:8812 -p 9003:9003 \
  -v "$(pwd):/var/lib/questdb" \
  questdb/questdb:7.0.1

Method 1: ingesting data using the Cryptofeed library

One of the easiest ways to ingest market data is to use an open-source tool called Cryptofeed. The Python library establishes websocket connections to various exchanges including Binance, Coinbase, Gemini, and Kraken and returns trade, market, and book update data in a standardized format. Cryptofeed also has native integration with QuestDB, making it a great choice to ingest data rapidly.

To get started, create a virtual environment with Python 3.8+. We will use venv but you can use conda, poetry, or virtualenv as well. We will create a venv for cryptofeed:

$ python3 -m venv cryptofeed
$ source cryptofeed/bin/activate

Then install cryptofeed: pip install cryptofeed

Navigate into the cryptofeed directory and create a new file questdb.py. We will then paste the following to ingest trade data for BTC-USD pair from Coinbase and Gemini:


from cryptofeed import FeedHandler
from cryptofeed.backends.quest import TradeQuest
from cryptofeed.defines import TRADES
from cryptofeed.exchanges import Coinbase, Gemini


QUEST_HOST = '127.0.0.1'
QUEST_PORT = 9009




def main():
   f = FeedHandler()
   f.add_feed(Coinbase(channels=[TRADES], symbols=['BTC-USD'], callbacks={TRADES: TradeQuest(host=QUEST_HOST, port=QUEST_PORT)}))
   f.add_feed(Gemini(channels=[TRADES], symbols=['BTC-USD'], callbacks={TRADES: TradeQuest(host=QUEST_HOST, port=QUEST_PORT)}))
   f.run()




if __name__ == '__main__':
   main()

When you run this code, it will automatically create a socket connection with Coinbase and Gemini API and push data to QuestDB. Note that it may take a while to see data populated (especially from Gemini).

Navigate to localhost:9000 to access the web console. We can query data from Coinbase via SELECT * FROM trades-COINBASE:

You can see all the supported exchanges and supported channels (e.g., L1/L2/L3 books, trades, ticket, candles, open interest, etc) on the Cryptofeed GitHub page.

If you want to modify the structure of the data ingested into QuestDB, you can override the callback handler. For example, if you want to change the name of the table it writes to or the columns, you can specify the write function. In
fact, the QuestDB demo site implements cryptofeed to ingest data to the trades table with the following custom callback function:

from cryptofeed import FeedHandler
from cryptofeed.backends.backend import BackendCallback
from cryptofeed.backends.socket import SocketCallback
from cryptofeed.defines import TRADES
from cryptofeed.exchanges import Coinbase


QUEST_HOST = '127.0.0.1'
QUEST_PORT = 9009




class QuestCallback(SocketCallback):
   def __init__(self, host='127.0.0.1', port=9009, **kwargs):
       super().__init__(f"tcp://{host}", port=port, **kwargs)
       self.numeric_type = float
       self.none_to = None


   async def writer(self):
       while True:
           try:
               await self.connect()
           except:
               exit(-1)
           async with self.read_queue() as update:
               update = "\n".join(update) + "\n"
               try:
                   self.conn.write(update.encode())
               except:
                   exit(-2)


class TradeQuest(QuestCallback, BackendCallback):
   default_key = 'trades'


   async def write(self, data):
       update = f'{self.key},symbol={data["symbol"]},side={data["side"]} price={data["price"]},amount={data["amount"]} {int(data["timestamp"] * 1_000_000_000)}'
       await self.queue.put(update)




def main():
   handler = FeedHandler()
   handler.add_feed(Coinbase(channels=[TRADES], symbols=['BTC-USD', 'ETH-USD'],
                             callbacks={TRADES: TradeQuest(host=QUEST_HOST, port=QUEST_PORT)}))
   hanlder.run()




if __name__ == '__main__':
   main()

Underneath the hood, cryptofeed library utilizes plain socket connections via Influx Line Protocol (ILP) to push data to
QuestDB. As such, it is important to provide the raw ILP string in the write callback function.

The biggest advantage of using Cryptofeed is the large number of preconfigured integrations with various exchanges. The library does the heavy lifting of normalizing the data so ingesting it into QuestDB is very simple. However, if you need more control over the type or format of the data, you may need to call the exchange API directly.

Method 2: Build a custom market data pipeline with Cryptofeed data fetcher

If Cryptofeed does not support the exchange you are interested in or if you need more control over the type or format of the data, you can opt to write your own data ingestion function. With QuestDB, you have the option to use PostgreSQL wire or ILP. Since the ILP is faster and supports schemaless ingestion, we will show an example of using the InfluxDB Line Protocol via QuestDB Node.js SDK to ingest price data from Binance and Gemini:

const axios = require("axios")
const { Sender } = require("@questdb/nodejs-client");


async function main() {
 // create a sender with a 4k buffer
 const sender = new Sender({ bufferSize: 4096 });


 // connect to QuestDB
 // host and port are required in connect options
 await sender.connect({ port: 9009, host: "localhost" });


 async function getBinanceData() {
   const { data } = await axios.get(
     "https://api.binance.us/api/v3/avgPrice?symbol=BTCUSD",
   )


   // add rows to the buffer of the sender
   sender
     .table("prices")
     .symbol("pair", "BTCUSD")
     .stringColumn("exchange", "Binance")
     .floatColumn("bid", parseFloat(data.price))
     .atNow();


   await sender.flush();


   setTimeout(getBinanceData, 1000)
 }


 async function getGeminiData() {
   const { data } = await axios.get("https://api.gemini.com/v1/pricefeed")
   const { price } = data.find((i) => i.pair === "BTCUSD")


   // add rows to the buffer of the sender
   sender
     .table("prices")
     .symbol("pair", "BTCUSD")
     .stringColumn("exchange", "Gemini")
     .floatColumn("bid", parseFloat(price))
     .atNow();


   await sender.flush();
   setTimeout(getGeminiData, 1000)
 }

 getBinanceData()
 getGeminiData()
}


main()

The code above polls the REST endpoints of Binance and Gemini API and writes the data to a table called prices:

While writing a custom data ingestion function is more work than simply using Cryptofeed, it can be a great option if you need to customize the fields or run some preprocessing logic prior to sending it to QuestDB.

Method 3: Ingest market data using Cange Data Capture (CDC)

Finally, you can ingest data via Change Data Capture (CDC) if you have an external data stream or database that you can listen on. For example, an external data market team might publish price data on Kafka or push updates to a relational database. Instead of polling this data directly, you could opt to leverage CDC patterns to stream changes to QuestDB instead.

An example of this architecture is detailed in
Realtime crypto tracker with QuestDB Kafka Connector.
This reference architecture has a function that polls Coinbase API for latest price data and publishes it to Kafka topics. QuestDB Kafka Connector in turn publishes that data to QuestDB.

Wrapping up

Wrapping up QuestDB offers various ways to ingest crypto market data quickly. For a starting point, utilize the Cryptofeed library to connect to various exchanges that are already supported, and optionally modify the ingestion by implementing your own callback. If you need to integrate with a data feed not supported by Cryptofeed, you can write a custom data ingestor and publish data over InfluxDB line protocol to QuestDB. Finally, if there’s an existing data
feed that Debezium supports (e.g., Kafka, PostgreSQL) then using CDC can be a great choice to minimize the infrastructure burden.

Additional resources

Data Lifecycle with QuestDB

Yitaek Hwang — Wed, 02 Nov 2022 23:13:54 +0000

For most applications dealing with time series data, the value of each data point diminishes over time as the granularity of the dataset loses relevance as it gets stale. For example, when applying a real-time anomaly detection model, more granular data (e.g., data collected at second resolution), would yield better results. However, to train forecasting models afterwards, recording data at such high frequency may not be needed and would be costly in terms of storage and compute.

When I was working for an IoT company, to combat this issue, we stored data in three separate databases. To show the most up to date value, latest updates were pushed to a NoSQL realtime database. Simultaneously, all the data was appended to both a time series database storing up to 3 months of data for quick analysis and to an OLAP database for long-term storage. To stop the time series database from exploding in size, we also ran a nightly job to delete old data. As the size of the data grew exponentially with IoT devices, this design caused operational issues with maintaining three different databases.

QuestDB solves this by providing easy ways to downsample the data and also detach or drop partitions when old data is no longer necessary. This helps to keep all the data in a single database for most operations and move stale data to cheaper storage in line with a mature data retention policy.

To illustrate, let’s revisit the IoT application involving heart rate data. Unfortunately, Google decided to shut down its Cloud IoT Core service, so we’ll use randomized data for this demo.

Populating heart rate data

Let’s begin by running QuestDB via Docker:

docker run -p 9000:9000 \
-p 9009:9009 \
-p 8812:8812 \
-p 9003:9003 \
-v “$(pwd):/var/lib/questdb” \
questdb/questdb

We’ll create the a simple heart-rate data table with a timestamp, heart rate, and sensor ID partitioned by month via the console at localhost:9000:

CREATE TABLE heart_rate AS(
  SELECT
    x ID,
    timestamp_sequence(
      to_timestamp('2022–10–10T00:00:00', 'yyyy-MM-ddTHH:mm:ss'),
      rnd_long(1, 10, 0) * 100000L
    ) ts,
    rnd_double(0) * 100 + 60 heartrate,
    rnd_long(0, 10000, 0) sensorId
  FROM
    long_sequence(10000000) x
) TIMESTAMP(ts) PARTITION BY MONTH;

We now have randomized data from 10,000 sensors over ~2 months time frame (10M data points). Suppose we are continuously appending to this dataset from a data stream, then having such frequent updates will be useful to detect anomalies in heart rate. This could be useful to detect and alert on health issues that could arise.

Downsampling the data

However, if no anomalies are detected, having a dataset with heart rate collected every second is not useful if we simply want to note general trends over time. Instead we can record the average heart rate in one hour intervals to compact data. For example, if we’re interested in the min, max, and avg heart rate of a specific sensor, sampled every hour, we can invoke:

SELECT
  min(heartrate),
  max(heartrate),
  avg(heartrate),
  ts
FROM
  heart_rate
WHERE
  sensorId = 1000 SAMPLE BY 1h FILL(NULL, NULL, PREV);

Once you are happy with the downsampled results, we can store those results into a separate sampled_data table for other data science time to create forecasting models or do further analysis:

CREATE TABLE sampled_data (ts *timestamp*, min_heartrate *double*, max_heartrate *double*, avg_heartrate *double*, sensorId *long*) *timestamp*(ts);

INSERT INTO sampled_data (ts, min_heartrate, max_heartrate, avg_heartrate, sensorId);

SELECT ts, min(heartrate), max(heartrate), avg(heartrate), sensorId FROM heart_rate SAMPLE BY 1h FILL(NULL, NULL, PREV);

This downsampling operation can be done periodically (e.g., daily, monthly) to populate the new table. This way the data science team does not have to import the massive raw dataset and can simply work with sampled data with appropriate resolution.

Data retention strategy

Downsampling alone, however, does not solve the growing data size. The raw sensor heart_rate table will continue to grow in size. In this case, we have some options in QuestDB to detach or even drop partitions.

Since we partitioned the original dataset by month, we have 3 partitions: 2022–10, 2022–11, and 2022–12. This can be seen under /db/heart_rate/ directories, along with other files holding metadata.

/db/heart_rate
├── 2022–10
├── 2022–11
├── 2022–12

After we have downsampled the data, we probably no longer need data from older months. In this case, we can DETACH this partition to make it unavailable for reads.

ALTER TABLE ‘heart_rate’ DETACH PARTITION LIST ‘2022–10’;

Now the 2022–10 partition is renamed to 2022–10.detached and running queries in the heart_rate table returns data from 2022–11 onwards:

SELECT * FROM ‘heart_rate’ LIMIT 10;

We can then compress this data and move it to a cheaper block storage option like S3 or GCS:

tar cfz — ‘/db/heart_rate/2022–10.detached’ | aws s3 cp — s3://my-data-backups/2022–10.tar.gz

If we need to restore this partition for further analysis, we can re-download the tar file to a new directory named .attachable under /db/ (or where the rest of the QuestDB data lives) and uncompress the tar file:

mkdir 2022–02.attachable | aws s3 cp s3:/my-data-backups/2022–10.tar.gz — | tar xvfz — -C 2022–10.attachable — strip-components 1

With the data in place, simply use the ATTACH command:

ALTER TABLE heart_rate ATTACH PARTITION LIST ‘2022–10’;

We can verify the partition has been attached back by running the count query and seeing 10M records:

SELECT count() FROM heart_rate;

Alternatively, if we want to simply delete partitions, we can use the DROP command to do so. Unlike the DETACH command, this operation is irreversible:

ALTER TABLE heart_rate DROP PARTITION LIST ‘2022–10’;

Simplifying operations

As the volume of data continues to explode, it’s important to consider a data retention strategy that is both cost-effective and useful to teams depending on that data. Time series data, by its nature, lend itself well to aggregation and partitioning by time. We can leverage this quality to serve raw data for teams requiring real-time decisions and then move to a downsampled dataset for other analytic needs. Finally, to control the cost and performance of the time series database, we can detach and store partitions in cheaper, long-term storage options.

QuestDB makes these operations simple with built-in support so that teams don’t have to build custom data pipelines to manually delete and replicate data into different databases. The use case shown here with heart rate data can easily be applied to other industries with high-frequency, time series data (e.g., financial markets, infrastructure monitoring).

Simplifying Kubernetes CI/CD With Devtron

Yitaek Hwang — Thu, 23 Jun 2022 21:57:33 +0000

When organizations think of fostering a DevOps culture, building out an effective continuous integration (CI) and continuous deployment (CD) pipeline is usually brought up as the first step to success. Nowadays, infrastructure teams have a plethora of both open-source and licensed tools such as Jenkins, CircleCI, Github Actions, and ArgoCD to implement various CI/CD pipelines and deployment strategies. However, most of these tools rely on complex YAML templating to trigger the pipelines, which may discourage developers who simply want an easy way to build and deploy their application to cloud-native environments.

Devtron is an open-source software delivery workflow orchestrator for Kubernetes with a built-in CI/CD builder to address this issue. In this article, we’ll review how to configure some common CI/CD steps via Devtron.

CI Pipeline

In Devtron, CI pipelines can be created via a CI Workflow Editor (trigger from a code repository), linked to an existing pipeline (e.g. templates), or be integrated with an external provider via an incoming webhook.

To create a new CI pipeline, choose the “Continuous integrations” option to open up the Workflow Editor:

Instead of specifying various branch types and triggers via a YAML file, developers can simply choose the source type (e.g. branch, PR, tag) or branch name to trigger the pipeline. Devtron provides three simple stages in the CI steps:

Pre-build stage: tasks to run before building the container image (e.g. linting, unit tests)
Build stage: creating the container image
Post-build stage: tasks to run after image creation (e.g. scanning for vulnerabilities)

The build wizard guides through setting up each of these configuration parameters. If the team already has an existing CI template, developers can opt to link that pipeline or integrate with external tools if the team is migrating from legacy providers (e.g. Jenkins).

If the pipeline is set to trigger automatically, either commit to the branch or submit a PR to trigger the action. Alternatively, users can click on “Select Material” to trigger the builds.

Under the Build History tab, developers can also see vulnerabilities if that feature was enabled under the advanced options. This built-in integration is a nice way to avoid having to add in open-source scanners (e.g. Anchore, Clair, Trivy) or paid-tools (e.g. Jfrog Xray) manually.

CD Pipeline

Once the CI pipeline is set, we can extend the pipeline to include the CD portion. Simply click on the (+) sign of the pipeline via the Workflow Editor and select the deployment environment (i.e. target namespace/cluster) and deployment strategy.

As with the CI portion, CD comes with three different stages:

Pre-deployment stage: useful to carry out DB/schema migrations or config setup before the application deployment
Deployment stage: step to deploy utilizing one of four strategies (recreate, canary, blue-green, and rolling upgrades) that can be configured per use case
Post-deployment stage: runs after the deployment to either update Jira ticket, send notifications, or run clean up tasks
All of these stages can be configured using the Workflow Editor. Since CD step is more open-ended, more complex workflows will require writing up some YAML but the config for each stage is relatively minimal.

Configure the pre-deployment stage to run automatically or manually with a config like the following. The following example shows using Flyway to manage database migrations.

Post-deployment stage works similarly as the pre-deployment stage:

You can run smoke tests or end-to-end tests on lower environments after deploying. As shown above, you can use a node or cypress docker image to run tests:

$ docker run -it -v $PWD:/e2e -w /e2e --entrypoint=cypress cypress/base
Finally, these pipelines can be linked to create sequential pipelines if multiple deployments or special jobs must trigger in order. To create a sequential pipeline, click on the + sign on the right side of the existing pipeline components to add new jobs:

These pipelines can be linked to sequentially deploy to multiple environments (e.g. dev → QA → UAT → prod).

Conclusion

With so many choices in the market today, most teams struggle to create a cohesive CI/CD experience without cobbling together a multitude of tools. While the flexibility of each of these tools provides tremendous value, for some teams, just setting up a simple pipeline is all that is needed. This is where Devtron can provide value in guiding developers through an intuitive widget to set up a pipeline that is ready for cloud-native applications.

Resources

How to Provision and Manage Amazon EKS with Ease

Yitaek Hwang — Thu, 05 May 2022 15:32:15 +0000

AWS is the unquestioned leader of the $180-billion cloud market today, with a 33% overall market share according to Synergy Research Group. Their dominance also extends to the managed Kubernetes space. Both the Cloud Native Computing Foundation 2019 survey and a more recent Logicata Kubernetes poll results show EKS with the lead in terms of popularity amongst its competition (e.g. GKE, AKS, etc).

Source: CNCF Survey 2019

However, as more applications are onboarded onto EKS, managing multiple clusters and workloads remain a challenge. In this post, we'll discuss a few ways to provision an EKS cluster and using KubeSphere as the platform layer to securely deploy and maintain containerized applications on Kubernetes.

eksctl

eksctl is an open-source tool jointly developed by the AWS and Weaveworks to create and manage EKS clusters. Behind the scenes, eksctl creates a CloudFormation stack to provision and update AWS artifacts.

After installing eksctl, a cluster can be bootstrapped imperatively with command line flags or via a config file declaratively:

via command-line:

eksctl create cluster --name=cluster-1 --nodes=4

or via config file:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: basic-cluster
  region: us-east-2

nodeGroups:
  - name: ng-1
    instanceType: m5.large
    desiredCapacity: 10
    volumeSize: 80
    ssh:
      allow: true # will use ~/.ssh/id_rsa.pub as the default ssh key
  - name: ng-2
    instanceType: m5.xlarge
    desiredCapacity: 2
    volumeSize: 100
    ssh:
      publicKeyPath: ~/.ssh/ec2_id_rsa.pub

Similar to kubectl commands, config file changes can be applied via the -f flag: eksctl create cluster -f <file-name.yaml>.

eksctl maintains a comprehensive documentation website with various configuration parameters like enabling CloudWatch, Fargate nodegroups, EKS addons (e.g. CNI, EBS driver, core-dns), as well as nice integrations for IRSA support. EKS workshop also hosts a step-by-step guide with a video to bootstrap an EKS cluster.

If you are familiar with CloudFormation or would like to use a nice wrapper that is jointly maintained by the AWS team, eksctl is a great choice to use to maintain EKS clusters.

Terraform EKS Module

Another popular option is to use Terraform to provision EKS cluster either with the official EKS module or use submodules for VPC, EKS, and/or nodegroups separately. The complete example for the EKS module will bootstrap an EKS cluster with self-managed and AWS managed nodes with KMS encryption enabled by default.

Since Terraform is one of the more popular IaC providers, the advantage of using Terraform to manage EKS is the ability to keep everything with the same tech stack. If you are also using multi-cloud or have plans to branch out into AKS or GKE in the future, using Terraform would be more desirable than eksctl as well.

Finally, for a deep-dive into designing and provisioning a production ready EKS cluster, you can check out some tips in this article.

Installing KubeSphere on Amazon EKS

Now that we have a functional EKS cluster, we can install KubeSphere using kubectl.

First, we need to update the kubeconfig to match our newly created cluster:

aws eks --region <my-region> update-kubeconfig --name <my-cluster-name>

Then we can apply the kubectl manifests to install:

kubectl apply -f https://github.com/kubesphere/ks-installer/releases/download/v3.2.1/kubesphere-installer.yaml

kubectl apply -f https://github.com/kubesphere/ks-installer/releases/download/v3.2.1/cluster-configuration.yaml

Once the installation is complete, you should see:

#####################################################
###              Welcome to KubeSphere!           ###
#####################################################
Account: admin
Password: P@88w0rd
NOTES：
1. After logging into the console, please check the
   monitoring status of service components in
   the "Cluster Management". If any service is not
   ready, please wait patiently until all components
   are ready.
2. Please modify the default password after login.
#####################################################
https://kubesphere.io             2020-xx-xx xx:xx:xx

Alternatively, KubeSphere has partnered with AWS to easily install KubeSphere as an AWS Quick Start. This will use a CloudFormation template to deploy an EKS cluster and install KubeSphere automatically. You can edit the CloudFormation template to remove VPC and EKS creation and only trigger KubeSphere installation with an existing cluster.

Managing Amazon EKS with KubeSphere

Although Amazon provides some add-on services such as VPC CNI, CoreDNS, EBS CSI, and kube-proxy to the core EKS offering, it is pretty barebones in terms of the extra tooling needed for a production-ready Kubernetes platform. It does not come with any ingress controllers (e.g. nginx, traefik, etc), autoscalers (e.g. karpenter, cluster autoscaler), logging and monitoring agents, or common tools like external dns or cert-manager. If you also plan to support multitenancy on EKS, then you are also responsible for configuring namespaces and necessary RBAC components yourself.

This is where KubeSphere can help ease the burden by providing a platform layer on top. KubeSphere comes prepackaged with integrations with Jenkins, logging/monitoring, service mesh, ingress controllers, and more deliver a complete application management experience. User management is also built in via workspaces and projects, which will assign users and RBAC roles to namespaces corresponding to each project. The main dashboard can also be used to deploy new microservices with Jenkins pipelines or utilize the App Store to deploy popular Helm charts like etcd, redis, tomcat, postgresql, etc.

KubeSphere also shines when there are multiple clusters (e.g. multi-region or multi-environment) involved. KubeSphere follows the federation model where the KubeSphere running in the host cluster can control downstream member clusters. Using KubeSphere, platform teams can consistently install, upgrade, and manage not only the infrastructure components but also their application across multiple clusters.

Image Credit: New Stack

Conclusion

The learning curve for mastering Kubernetes is high. But with Amazon EKS, users can offload the management of the master plane and core addon components to Amazon. With eksctl and Terraform, teams can easily provision many Kubernetes clusters at scale. To go a step further, utilize the rich ecosystem of integrations that KubeSphere provides to further ease the burden of having to manage Kubernetes clusters and the applications in a cloud-native way.

Running QuestDB on GKE Autopilot

Yitaek Hwang — Fri, 26 Mar 2021 13:37:41 +0000

Extending the QuestDB Helm chart with monitoring and automated backups for a production-ready setup.

Recently, I’ve been experimenting with QuestDB as the primary time-series database to stream and analyze IoT/financial data:

While I was able to validate the power of QuestDB in storing massive amounts of data and querying them quickly in those two projects, I was mostly running them on my laptop via Docker. In order to scale my experiments, I wanted to create a more production-ready setup, including monitoring and disaster recovery on Kubernetes. So in this guide, we’ll walk through setting up QuestDB on GKE with Prometheus and Velero.

Prerequisites

Setting up GKE Autopilot

As a DevOps engineer/SRE, I’m a huge fan of GKE since it provides a lot of features out of the box such as cluster autoscaling, network policy plugins, and managed Istio compared to other managed Kubernetes options available. Recently Google Cloud announced GKE Autopilot, a new mode that further automates Kubernetes operations, including node management, security/hardening, and resource optimization. It brings together the serverless experience of Cloud Run with the flexibility and features of GKE. In practice, this means that you are now charged for pod usage rather than paying for the compute and storage of the underlying Kubernetes nodes, making it a great choice for projects with unknown resource utilization.

Creating a new GKE Autopilot cluster is also extremely simple. There’s no need to set up Terraform or VPCs/autoscalers/node groups. Simply create a Google Cloud account, navigate to Google Kubernetes Engine, enable the Kubernetes Engine API, and click on “Create Cluster”:

Click on the “Configure” button under “Autopilot mode”, give the cluster a name, and wait for a few minutes for the cluster to be ready. I went with the default name autopilot-cluster-1 in us-east1 region (feel free to skip the other sections for now).

To configure your Kubernetes context to interact with the cluster via Helm, fetch the credentials using the following gcloud command:

$ gcloud container clusters get-credentials autopilot-cluster-1 --region us-east1

Deploying QuestDB

QuestDB provides an official Helm chart that deploys a single StatefulSet pod with 50Gi PVC by default. To install the chart, add the QuestDB Helm repo and deploy it by running the following:

$ helm repo add questdb [https://helm.questdb.io/](https://helm.questdb.io/)
$ helm upgrade --install questdb questdb/questdb

Since GKE Autopilot does not pre-provision node capacity, the pod will be initially marked as unschedulable. Give it a few seconds until cluster autoscaler triggers a scaling up event and schedules the questdb pod.

For a quick smoke test, port-forward the HTTP endpoint and interact with the web console UI:

$ kubectl port-forward questdb-0 9000:9000

Monitoring Postgres Metrics

At this point, we have a healthy, running instance of QuestDB. However, without inspecting the application logs, there is no easy way to determine the overall system health and grab metrics from QuestDB. For performance reasons, QuestDB team decided to decouple the REST endpoint from its min HTTP server that holds a simple health status on port 9003. This port is currently not exposed via the Helm chart, so the StatefulSet and Service sections need to be changed to allow probes to periodically check this endpoint.

Unfortunately, QuestDB currently does not provide a comprehensive metrics endpoint that natively integrates with Prometheus or Postgres. Open Github issues for this support is listed below:

Still, I wanted to deploy Prometheus Postgres Exporter to at least track QuestDB up/down status and set up the framework to monitor Postgres metrics once these features became available.

Expose Postgres Port

First, we need to modify the default QuestDB chart to expose port 8812 (Postgres endpoint). Create a new YAML file called questdb-values.yaml :

service:
  expose:
    postgresql:
      enabled: true
      port: 8812

Update the existing deployment:

$ helm upgrade --install questdb questdb/questdb -f questdb-values.yaml

Install Prometheus Postgres Exporter

Prometheus Helm charts are managed by the prometheus-community :

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Next, we need to modify the default values to scrape QuestDB and also disable default metrics since QuestDB doesn’t currently store pg_stat_database or pg_stat_activity . We also want Prometheus to scrape our metrics, so we need to add annotations specify our scrape port.

Create another YAML file called exporter-values.yaml :

config:
  datasource:
    host: questdb-0.questdb-headless
    user: admin
    password: quest
    port: '8812'
    database: qdb
    sslmode: disable
  disableDefaultMetrics: true

annotations: 
  prometheus.io/scrape: 'true'
  prometheus.io/port: '9187'

Now we can install Postgres Exporter:

$ helm upgrade -i postgres-exporter prometheus-community/prometheus-postgres-exporter -f exporter-values.yaml

Finally, we can install Prometheus to collect these metrics. Create a new YAML file to hold our Prometheus configurations, prometheus-values.yaml :

alertmanager:
  enabled: false
nodeExporter:
  enabled: false
pushgateway:
  enabled: false

Since we don’t have real metrics from QuestDB yet, we will only deploy the Prometheus server and accept the default configurations.

$ helm upgrade -i prometheus prometheus-community/prometheus -f prometheus-values.yaml

Access the dashboard to run our queries by port-forwarding:

$ export POD_NAME=$(kubectl get pods --namespace default -l \ "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace default port-forward $POD_NAME 9090

Open up localhost:9090 and check that we’re able to scrape pg_up :

(Note: If you are looking for a production-ready monitoring setup, please refer to the four-part series “Practical Monitoring with Prometheus and Grafana”.)

Automated Backups with Velero

Velero is an open-source tool to back up and restore Kubernetes resources and persistent volumes. This is useful for disaster recovery (taking snapshots) or for data migration. Velero runs inside the Kubernetes cluster and integrates with various storage providers (e.g. AWS S3, GCP Storage, Minio) as well as restic to take snapshots either on-demand or on a schedule.

Installation

Velero can be installed via Helm or via the CLI tool. In general, it seems like the CLI gets the latest updates, and the Helm chart lags behind slightly with compatible Docker images.

# for MacOS
$ brew install velero

# for Windows
$ choco install velero

Configuring Server Components

To set up Velero on GCP, we need to:

Create an GCS Bucket

Give a unique bucket name and use the gsutil tool to create the bucket (replace with the name of your bucket):

PROJECT_ID=$(gcloud config get-value project)
BUCKET=<YOUR_BUCKET>

gsutil mb gs://$BUCKET/

Set GCP Permissions

Create a service account:

gcloud iam service-accounts create velero \
 — display-name “Velero service account”

Attach the compute permissions to the service account:

SERVICE_ACCOUNT_EMAIL=$(gcloud iam service-accounts list \
  --filter="displayName:Velero service account" \
  --format 'value(email)')

ROLE_PERMISSIONS=(
    compute.disks.get
    compute.disks.create
    compute.disks.createSnapshot
    compute.snapshots.get
    compute.snapshots.create
    compute.snapshots.useReadOnly
    compute.snapshots.delete
    compute.zones.get
)

gcloud iam roles create velero.server \
    --project $PROJECT_ID \
    --title "Velero Server" \
    --permissions "$(IFS=","; echo "${ROLE_PERMISSIONS[*]}")"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member serviceAccount:$SERVICE_ACCOUNT_EMAIL \
    --role projects/$PROJECT_ID/roles/velero.server

gsutil iam ch serviceAccount:$SERVICE_ACCOUNT_EMAIL:objectAdmin gs://${BUCKET}

Download the service account key and save it as credential-velero :

gcloud iam service-accounts keys create credentials-velero \
    --iam-account $SERVICE_ACCOUNT_EMAIL

Now we can install Velero with the GCS plugin enabled:

velero install \
    --provider gcp \
    --plugins velero/velero-plugin-for-gcp:v1.1.0 \
    --bucket $BUCKET \
    --secret-file ./credentials-velero

Finally, we can create a schedule using cron string:

velero schedule create questdb --schedule "0 7 * * *" -l "app.kubernetes.io/instance=questdb" --include-namespaces default

We can verify backups being made in our bucket:

(Note: For a deep-dive on Velero, refer to “Disaster Recovery on Kubernetes.”)

Next Steps

To optimize for performance, QuestDB opted to build a time-series database from scratch with ANSI SQL compatibility, as opposed to building on top of Postgres like TimescaleDB. Also, as a newer product, QuestDB is missing some enterprise features (e.g. replication, high-availability, recovery from snapshot) and integrations to other popular projects. Still, with Prometheus Postgres exporter and Velero, we can configure a production-ready QuestDB on Kubernetes. I look forward to enhancing this setup in future releases when monitoring and ops features will be supported.

Realtime Crypto Tracker with Kafka and QuestDB

Yitaek Hwang — Thu, 18 Feb 2021 17:58:14 +0000

Analyze cryptocurrency price trends in realtime via Kafka and store for further investigation in a timeseries database.

Photo by M. B. M. on Unsplash

“Bitcoin soars past $50,000 for the first time” — CNN

“Tesla invests $1.5 billion in bitcoin, will start accepting it as payment” — Washington Post

Not a day goes by without some crypto news stealing the headlines these days. From institutional support of Bitcoin to central banks around the world exploring some form of digital currency, interest in cryptocurrency has never been higher. This is also reflected in the daily exchange volume:

As someone interested in the future of DeFi (decentralized finance), I wanted to better track the price of different cryptocurrencies and store them into a timeseries database for further analysis. I found an interesting talk by Ludvig Sandman and Bruce Zulu at Kafka Summit London 2019, “Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges”, so I decided to leverage Kafka and modify it for my own use. In this tutorial, we will use Python to send real-time cryptocurrency metrics into Kafka topics, store these records in QuestDB, and perform moving average calculations on this time series data with numpy.

Project Setup

At a high level, this project polls the public Coinbase API for the price of Bitcoin, Ethereum, and Chainlink. This information is then published onto individual topics on Kafka (e.g. topic_BTC). The raw price information is sent to a QuestDB via Kafka Connect to populate the timeseries database. At the same time, a separate consumer also pulls that data and calculates a moving average for a quick trend analysis.

The codebase is organized into three parts:

docker-compose: holds docker-compose file to start Kafka (zookeeper, broker, kafka connect), QuestDB, and JSON file to initialize Kafka Connect
docker: Dockerfile to build Kafka Connect image (pre-built image is available via docker-compose)
Python files: grabs latest pricing information from Coinbase, pubishes information to Kafka, and calculates a moving average

If you would like to analyze different cryptocurrencies or extend the simple moving average example with a more complicated algorithm like relative strength index analysis, feel free to fork the repo on Github

Prerequisites

Docker (with at least 4GB memory): if using Docker Desktop, go to Settings -> Resources -> Memory and increase he default limit from 2GB to 4GB
Python 3.7+

Setting up Kafka & QuestDB

Before pulling data from Coinbase, we need a running instance of a Kafka cluster and QuestDB. In the repo, I have a working docker-compose file with Confluent Kafka components (i.e. zookeeper, broker, Kafka Connect) and QuestDB. If you would like to run this on the cloud or run it locally, follow the instructions on the Confluent website. Otherwise simply apply the docker-compose file:

cd docker-compose
docker-compose up -d

The docker-compose file runs the following services:

Zookeeper
Kafka Broker
Kafka Connect with JDBC driver
QuestDB

The Kafka Connect image is based on confluentinc/cp-kafka-connect-base:6.1.0 . If you wish to modify this image (e.g. add a new connector to MongoDB or modify the bootup process), you can override the Dockerfile and build it locally.

Wait for the Kafka cluster to come up. Watch the logs in the connect container until you see the following messages:

[2021-02-17 01:55:54,456] INFO [Worker clientId=connect-1, groupId=compose-connect-group] Starting connectors and tasks using config offset -1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

[2021-02-17 01:55:54,456] INFO [Worker clientId=connect-1, groupId=compose-connect-group] Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

[2021-02-17 01:55:54,572] INFO [Worker clientId=connect-1, groupId=compose-connect-group] Session key updated (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

Configuring Postgres Sink

At this point, we have a health Kafka cluster and a running instance of QuestDB, but they are not connected. Since QuestDB supports Kafka Connect JDBC driver, we can leverage the PostgreSQL sink to populate our database automatically. Post this connector definition to our Kafka Connect container:

# Make sure you're inside the docker-compose directory

$ curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" --data @postgres-sink-btc.json [http://localhost:8083/connectors](http://localhost:8083/connectors)

postgres-sink-btc.json holds the following configuration details:

{
  "name": "postgres-sink-btc",
  "config": {
    "connector.class":"io.confluent.connect.jdbc.JdbcSinkConnector",
    "tasks.max":"1",
    "topics": "topic_BTC",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "connection.url": "jdbc:postgresql://questdb:8812/qdb?useSSL=false",
    "connection.user": "admin",
    "connection.password": "quest",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "true",
    "auto.create": "true",
    "insert.mode": "insert",
    "pk.mode": "none"
  }
}

Some important fields to note:

topics: Kafka topic to consume and convert into Postgres format
connection: Using default credentials for QuestDB (admin/quest) on port 8812
value.converter: This example uses JSON with schema, but you can also use Avro or raw JSON. If you would like to override the default configuration, you can refer to Kafka Sink Connector Guide from MongoDB.

Poll Coinbase for Latest Crypto Prices

Now our that our Kafka-QuestDB connection is made, we can start pulling data from Coinbase. The Python code requires numpy , kafka-python , and pandasto run. Using pip , install those packages and run the getData.py script:

$ pip install -r requirements.txt
$ python getData.py

It will now print out debug message with pricing information as well as the schema we’re using to populate QuestDB:

Initializing Kafka producer at 2021-02-17 14:38:18.655069
Initialized Kafka producer at 2021-02-17 14:38:18.812354
API request at time 2021-02-17 14:38:19.170623

Record: {'schema': {'type': 'struct', 'fields': [{'type': 'string', 'optional': False, 'field': 'currency'}, {'type': 'float', 'optional': False, 'field': 'amount'}, {'type': 'string', 'optional': False, 'field': 'timestamp'}], 'optional': False, 'name': 'coinbase'}, 'payload': {'timestamp': datetime.datetime(2021, 2, 17, 14, 38, 19, 170617), 'currency': 'BTC', 'amount': 50884.75}}

API request at time 2021-02-17 14:38:19.313046
Record: {'schema': {'type': 'struct', 'fields': [{'type': 'string', 'optional': False, 'field': 'currency'}, {'type': 'float', 'optional': False, 'field': 'amount'}, {'type': 'string', 'optional': False, 'field': 'timestamp'}], 'optional': False, 'name': 'coinbase'}, 'payload': {'timestamp': datetime.datetime(2021, 2, 17, 14, 38, 19, 313041), 'currency': 'ETH', 'amount': 1809.76}}

API request at time 2021-02-17 14:38:19.471573
Record: {'schema': {'type': 'struct', 'fields': [{'type': 'string', 'optional': False, 'field': 'currency'}, {'type': 'float', 'optional': False, 'field': 'amount'}, {'type': 'string', 'optional': False, 'field': 'timestamp'}], 'optional': False, 'name': 'coinbase'}, 'payload': {'timestamp': datetime.datetime(2021, 2, 17, 14, 38, 19, 471566), 'currency': 'LINK', 'amount': 31.68216}}

API request at time 2021-02-17 14:38:23.978928
Record: {'schema': {'type': 'struct', 'fields': [{'type': 'string', 'optional': False, 'field': 'currency'}, {'type': 'float', 'optional': False, 'field': 'amount'}, {'type': 'string', 'optional': False, 'field': 'timestamp'}], 'optional': False, 'name': 'coinbase'}, 'payload': {'timestamp': datetime.datetime(2021, 2, 17, 14, 38, 23, 978918), 'currency': 'BTC', 'amount': 50884.75}}

...

Query Data on QuestDB

QuestDB is a fast, open-source, timeseries database with SQL support. This makes it a great candidate to store financial market data for further historical trend analysis and generating trade signals. By default, QuestDB ships with a console UI exposed on port 9000. Navigate to localhost:9000 and query Bitcoin tracking topic topic_BTC to see price data stream in:

You can repeat this process for the other topics as well. If you prefer to run without a UI, you can also use the REST API to check:

$ curl -G \
--data-urlencode "query=select * from topic_BTC" \
http://localhost:9000/exp

QuestDB console UI also provides the ability to generate basic graphs. Click on the Chart tab underneath the Tables. Select line as the chart type, timestamp as the label, and click Draw :

Unfortunately, the QuestDB native charting capabilities are currently limited. For more advanced visualization, check out my previous guide on streaming heart rate data to QuestDB under the “Visualizing Data with Grafana” section.

Calculate Moving Average

While we store the raw data on QuestDB for more sophisticated analysis, we can also consume from the same topics to calculate a quick moving average. This may be useful if you want to also post these records to another Kafka topic that you may use on a dashboard or to set alerts on pricing trends.

On a separate terminal, run the moving average script:

$ python movingAverage.py

It will print out the moving average of 25 data points and post it to topic__ma_25 :

Starting Apache Kafka consumers and producer
Initializing Kafka producer at 2021-02-17 16:28:33.584649
Initialized Kafka producer at 2021-02-17 16:28:33.699208

Consume record from topic 'topic_BTC' at time 2021-02-17 16:28:34.933318
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.072581
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.075352
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.077106
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.088821
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.091865
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.094458
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.096814
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.098512
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.100150
Produce record to topic 'topic_BTC_ma_25' at time 2021-02-17 16:28:35.103512

If you wish to also populate these data points into QuestDB, supplement the JSON data with schema information in movingAverage.py similar to the way it is defined in the new_data JSON block in getData.py . Then create another Postgres sink via curl with topic set as topic__ma_25 .

Wrapping Up

To stop streaming data, simply stop the Python scripts. To destroy the Kafka cluster and QuestDB, run:

$ docker-compose down

While this is a simple example, you can extend this to optimize the data format with Avro, connect it with your Coinbase account to execute trades based on trading signals, or test out different statistical methods on the raw data. Feel free to submit a PR to make this repo more useful.

Stream heart rate data into QuestDB via Google IoT Core

Yitaek Hwang — Thu, 11 Feb 2021 18:40:11 +0000

An end-to-end demo of a simple IoT system to stream and visualize heart rate data.

Photo by Louis Reed on Unsplash

Thanks to the growing popularity of fitness trackers and smartwatches, more people are tracking their biometrics data closely and integrating IoT into their everyday lives. In my search for a DIY heart rate tracker, I found an excellent walkthrough from Brandon Freitag and Gabe Weiss, using Google Cloud services to stream data from a Raspberry Pi with a heart rate sensor to BigQuery via IoT Core and Cloud Dataflow.

Image Credit: Google Codelab

Although Cloud Dataflow supports streaming inserts to BigQuery, I wanted to take this opportunity to try out a new time-series database I came across called QuestDB. QuestDB is a fast open-source time-series database with Postgres compatibility. The live demo on the website queried the NYC taxi rides dataset with over 1.6 billion rows in milliseconds, so I was excited to give this database a try. To round out the end-to-end demo, I used Grafana to pull and visualize data from QuestDB.

Data Pipeline

Prerequisites

NodeJS v14+
Docker
Google Cloud Account
gcloud sdk
Optional: *Raspberry Pi kit*

In this tutorial, we will use a Debian image to run simulated heart rate data through IoT Core. If you wish to send real sensor data from Raspberry Pi, purchase the optional kit listed above and follow the install instructions.

Google Cloud Setup

In order to use Cloud IoT Core and Cloud Pub/Sub, you need to first create a Google Cloud Platform account and a new project (mine is called questdb-iot-demo ). Then, enable IoT Core, Compute Engine, and Pub/Sub APIs under APIs & Services -> Enable APIs and Services -> Search for APIs & Services:

IoT Core Setup

IoT Core is Google’s fully-managed IoT service to help securely connect and manage IoT devices. In this demo, we will create a registry called heartrate and send MQTT data. Click on “Create Registry” and set the Registry ID and Region based on the geographic region closest to you (for me it was us-central1):

Next, we need to configure a Pub/Sub topic to publish device data to. Under “Select a Cloud Pub/Sub topic”, click on “Create a Topic” and give it the Topic ID heartratedata :

Once the “Registry properties” and “Cloud Pub/Sub topics” are configured, click on “Create”.

Compute Engine Setup

Now it’s time to add our simulated device. In order for our device to communicate with IoT Core, we need to add a public key. Head over to Compute Engine -> Create.

The default options (e2-medium, Debian 10 image, us-central1) will work for our simulator (*Note: make sure to match the region with the IoT Core registry region if you chose something other than us-central1 *). Once the VM is ready, click on the SSH button under “Connect” and install the project code:

# Install git
sudo apt-get update
sudo apt-get install git

# Clone project code
git clone [https://github.com/googlecodelabs/iotcore-heartrate](https://github.com/googlecodelabs/iotcore-heartrate)
cd iotcore-heartrate

# Install all the core packages
chmod +x initialsoftware.sh
./initialsoftware.sh

# Generate the keys
chmod +x generate_keys.sh
./generate_keys.sh

# View the keys (highlighting will copy the contents)
cat ../.ssh/ec_public.pem

IoT Device Setup

Once you have the ec_public.pem key, head back to the IoT Core Registry. Under “Devices”, click on “Create a Device”. For Device ID, enter raspberryHeartRate and expand the “Communication, Cloud Logging, Authentication” pull down:

Under Authentication, change the “Public key format” to ES256 and paste in the key from our VM or Raspberry Pi and click “Create”:

Pub/Sub Setup

Finally, we need to create a subscription to our Pub/Sub topic to pull messages and insert into QuestDB. Head over to “Pub/Sub” and click on our heartratedata topic. Name the subscription questdb and click create.

QuestDB Setup

At this point, we have everything on Google Cloud to send data to our Pub/Sub topic. Now we need to write some code to take those messages and insert them into QuestDB. Let’s start by starting up QuestDB via Docker.

docker run -p 9000:9000 -p 8812:8812 questdb/questdb

The above command pulls the latest QuestDB image (v5.0.6) and maps port 9000 for the console UI and port 8812 for Postgres operations. Open up the QuestDB console at (http://127.0.0.1:9000/) and create our heart_rate table:

CREATE TABLE heart_rate(sensorID STRING, uniqueID STRING, timecollected TIMESTAMP, heartrate DOUBLE);

(Note: the UI doesn’t automatically refresh so if you don’t see the heart_rate table populated on the tables panel, click on the refresh icon above the tables.)

Pub/Sub to QuestDB

Since there’s no native integration for Pub/Sub, we will need to write a simple program to listen to new Pub/Sub messages and insert the data into QuestDB. I’m using NodeJS v14.15.4, but you can use similar client libraries for Pub/Sub and Postgres to achieve the same.

First, configure the gcloud sdk to authenticate with your GCP project without having to download a service account (see Simple GCP Authentication with Service Accounts for more details).

# Set default profile
$ gcloud auth application-default login

Next, create a new NodeJS workspace and install @google-cloud/pubsub and pg libraries. You can use the code below to listen to Pub/Sub and stream to QuestDB.

Sending Data

Finally, we are ready to send the simulated data. Switch back to the Compute Engine and ssh into the VM again. Issue the command below to send the data to our IoT Core device:

$ python heartrateSimulator.py --project_id=questdb-iot-demo --registry_id=heartrate --device_id=raspberryHeartRate --private_key_file=../.ssh/ec_private.pem

If successful, you should see some logs like:

...
Publishing message #544: '{"sensorID": "heartrate.raspZero", "heartrate": 72.56881801680139, "uniqueID": "c1ca9656-671f-4fa7-8c03-12fdfb4f422f-heartrate.raspZero", "timecollected": "2018-07-07 20:54:50"}'Publishing message #545: '{"sensorID": "heartrate.raspZero", "heartrate": 72.8324264524384, "uniqueID": "8d6337b7-204f-4209-88c0-46a79d1911bb-heartrate.raspZero", "timecollected": "2018-07-07 20:54:59"}'
Finished.

Now run our NodeJS code and we should see data populated in QuestDB:

Visualizing Data with Grafana

Although QuestDB console provides some default visualizations out of the box, to simulate a more realistic scenario of combining all the metrics to Grafana, we’ll set up a Postgres data source and visualize our heart rate data.

Download Grafana and login at http://localhost:3000/login with admin/admin

$ docker run -p 3000:3000 grafana/grafana

Under “Configuration” -> “Data Sources”, search for PostgreSQL.

For PostgreSQL connection, enter the following (password: quest) and save

Finally, let’s create a dashboard. Create a dashboard and add the following SQL query:

SELECT
  timecollected AS "time",
  heartrate
FROM heart_rate 
ORDER BY time;

Now we see the sample heart rate data:

As a side note, the same chart can be drawn on QuestDB console:

At this point, we have an end-to-end system of a device securely sending data via IoT Core and streaming data into QuestDB. We can extend this example to multiple devices by adding them under IoT Core and scaling our server to using pooled connections to more efficiently add data to QuestDB. At scale, we can also look at average heart rates instead of raw data points (e.g. avg(heartrate) as average_heartrate from heart_rate ).