DEV Community: Flo Comuzzi

PART 3 A Helm Chart for Ephemeral Environments

Flo Comuzzi — Fri, 03 Oct 2025 17:46:29 +0000

Part 3 — Ephemeral Environments with Helm and Argo CD (Starry IDP)

TL;DR: We package a complete, isolated preview stack (two frontends, one backend, Redis, two DBs) into a single Helm chart and reconcile it with Argo CD. Each preview lives in its own namespace, gets secrets from External Secrets, exposes stable ingress with managed TLS, and tears down cleanly via prune/TTL—shrinking feedback loops and avoiding collisions in shared dev/stage.

Note: I use preview and ephemeral environment terms interchangeably. The point is to emphasize that these are short-lived environment instances.

Who this is for

Platform teams adopting GitOps on GKE.
App teams wanting per-PR previews with minimal toil.

Prerequisites and assumptions

GKE with GCE Ingress, ManagedCertificate available.
Artifact Registry for images; Google Secret Manager for secrets.
Argo CD installed and reachable by the cluster.
Optional: External Secrets Operator (GSM integration), Workload Identity.

Why previews (recap)

Shared environments create collisions, noisy logs, version skew, and review friction. Previews isolate each change into its own namespace with predictable URLs, allowing fast, deterministic testing without blocking teammates.

Architecture (at a glance)

A preview environment has two frontends that connect to a backend. The backend connects to a Redis instance and two databases.

From PR to preview (sequence)

The Helm chart below is what Argo CD renders when creating an ephemeral/preview environment.

File hierarchy (Helm chart)

You can create a Git repository with a similar file hierarchy. Each file is a template for a Kubernetes resource needed for an ephemeral environment.

ephemeral-environment-helm/
├─ Chart.yaml
├─ values.yaml
├─ values.preview.yaml            # optional per-env defaults (e.g., TTL/resource caps)
├─ values.schema.json             # optional, validates user-provided values
├─ charts/                        # optional, vendored dependencies
├─ templates/
│  ├─ _helpers.tpl               # names/labels templates
│  ├─ configmap.yaml             # non‑secret settings
│  ├─ externalsecret.yaml        # pulls secrets from GSM (or your vault)
│  ├─ serviceaccount.yaml        # workload identity
│  ├─ rbac-role.yaml             # minimal namespace Role
│  ├─ rbac-rolebinding.yaml
│  ├─ ingress.yaml               # GCE Ingress with hosts per app/env
│  ├─ managedcertificate.yaml    # TLS for Ingress hosts (GKE)
│  ├─ service-backend.yaml
│  ├─ deployment-backend.yaml
│  ├─ hpa-backend.yaml           # optional autoscaling
│  ├─ service-frontend1.yaml
│  ├─ deployment-frontend1.yaml
│  ├─ hpa-frontend1.yaml         # optional
│  ├─ service-frontend2.yaml
│  ├─ deployment-frontend2.yaml
│  ├─ hpa-frontend2.yaml         # optional
│  ├─ redis-statefulset.yaml
│  ├─ redis-service.yaml
│  ├─ db1-statefulset.yaml       # ephemeral DB1 (init/fixtures optional)
│  ├─ db1-service.yaml
│  ├─ db2-statefulset.yaml       # ephemeral DB2
│  ├─ db2-service.yaml
│  ├─ cronjob-cleanup.yaml       # TTL enforcement / garbage collection
│  ├─ networkpolicy.yaml         # optional, if using NetworkPolicies
│  └─ NOTES.txt                  # optional Helm install notes
└─ README.md                     # chart overview and values reference

Minimal configuration snippets

Small, copy‑pasteable examples to get a preview running.

values.yaml (minimal)

global:
  ttlMinutes: 30
  labels:
    starry.env: preview-123

ingress:
  enabled: true
  className: gce
  hosts:
    - sample-be.preview-123.starry.mycompany.com
    - sample-fe-1.preview-123.starry.mycompany.com
    - sample-fe-2.preview-123.starry.mycompany.com
  tls:
    managedCertificate: true

backend:
  image:
    repository: us-docker.pkg.dev/myproj/sample-be
    tag: pr-123-abcd123
    pullPolicy: IfNotPresent
  service:
    port: 8080
  env:
    REDIS_URL: redis://starry-redis-master:6379
    DB1_URL: postgres://db1:5432/app
    DB2_URL: postgres://db2:5432/app

frontend1:
  enabled: true
  image:
    repository: us-docker.pkg.dev/myproj/sample-fe-1
    tag: pr-123-abcd123
  service:
    port: 8080
  env:
    VITE_API_BASE_URL: https://sample-be.preview-123.starry.mycompany.com

frontend2:
  enabled: true
  image:
    repository: us-docker.pkg.dev/myproj/sample-fe-2
    tag: pr-123-abcd123
  service:
    port: 8080
  env:
    VITE_API_BASE_URL: https://sample-be.preview-123.starry.mycompany.com

redis:
  enabled: true

databases:
  db1:
    enabled: true
  db2:
    enabled: true

secrets:
  externalSecret:
    enabled: true
    secretStoreRef:
      name: gsm-store
      kind: ClusterSecretStore

values.schema.json (guardrails excerpt)

{
  "$schema": "https://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "global": {
      "type": "object",
      "properties": {
        "ttlMinutes": { "type": "integer", "minimum": 5, "maximum": 240 }
      },
      "required": ["ttlMinutes"]
    },
    "backend": {
      "type": "object",
      "properties": {
        "image": {
          "type": "object",
          "properties": {
            "repository": { "type": "string", "minLength": 3 },
            "tag": { "type": "string", "minLength": 5 }
          },
          "required": ["repository", "tag"]
        }
      }
    }
  }
}

Argo CD Application (preview)

Starry can make a request to the Kubernetes API to create an Application resource that will manage a preview environment instance.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: preview-123
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: starry
  source:
    repoURL: https://github.com/myorg/environment-helm.git
    targetRevision: main
    path: .
    helm:
      valueFiles:
        - values.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: preview-123
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

ExternalSecret (GSM)

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
  namespace: preview-123
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: gsm-store
    kind: ClusterSecretStore
  target:
    name: app-secrets
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: projects/123/secrets/db-password/versions/latest
    - secretKey: REDIS_PASSWORD
      remoteRef:
        key: projects/123/secrets/redis-password/versions/latest

Security hardening

Workload Identity on ServiceAccounts to access GSM; least‑privilege IAM on secrets.
Namespace‑scoped Roles/RoleBindings; deny cluster‑wide privileges by default.
No secrets in Git; inject via External Secrets only.
Optional NetworkPolicies to confine pod traffic and egress.

Cost and quotas

Defaults: ttlMinutes: 30, HPA minReplicas: 1, narrow requests/limits for density.
Cap concurrent previews per team/namespace with ResourceQuota/LimitRange.
Cleanup CronJob as a watchdog for stragglers.

Observability and troubleshooting

Standard labels (app.kubernetes.io/*, starry.env) and probes for all pods.
Common issues and quick checks:
- Cert Pending: DNS/host mismatch or quota; verify ManagedCertificate status.
- 404 after sync: NEG endpoints warming; check Service → Endpoints and pod readiness.
- Image pull back‑off: tag/registry typo; confirm Artifact Registry permissions.
- RBAC denied: verify namespace Role/RoleBinding and ServiceAccount name.
- Quota exceeded: review ResourceQuota and HPA limits.

Conclusion

By shipping previews as a Helm chart reconciled by Argo CD, we get isolation by default, secure secret handling, fast spin‑up/tear‑down, and a fully auditable GitOps trail. CI stays simple—build and push images on PR branches—while the platform provides predictable URLs, ephemeral data stores, and least‑privilege access.

In future articles, we’ll automate the PR lifecycle end‑to‑end: auto‑create on PR open, auto‑destroy on merge/close, add quotas and policies for cost control, layer optional E2E tests and data seeding for production‑like previews, and discuss Terraform for infrastructure setup.

PART 2 Starry: An Internal Developer Platform (IDP) for Ephemeral Environments

Flo Comuzzi — Mon, 29 Sep 2025 21:07:24 +0000

In my last post, I introduced Starry as a custom internal developer platform (IDP) that creates ephemeral environments for merge requests. The platform uses well known, interoperable tools (Kubernetes, Helm, ArgoCD, Terraform) which makes the solution practical and adaptable to a range of needs. In this article, I will dive deeper into the architecture, a simplified CICD flow, and the relationship between Starry and Argo CD.

Let's simplify the system further by supposing that a user will manually create environments through the IDP. In a follow-up post I will introduce what is required to fully automate merge request to environment creation.

From Push To Ephemeral Environment

I am running CICD in GitHub for repos sample-be, sample-fe-1, and sample-fe-2. When I have a merge request open and I push to the feature branch, an image is built and pushed to Artifact Registry. This approach is super simplified. The CICD pipeline, triggered when there is a push to a branch that has an open pull request, in this case, only needs to build a container image, properly tag it, and push it to the image registry. The developer can create an environment with any tag pointing to sample-be, sample-fe-1, and sample-fe-2 images.

Ephemeral Environment Management Workflow

A user can go to the Starry Internal Developer Platform (IDP) and create an ephemeral environment by passing in an image tag for sample-be backend. Optionally, a user may also choose to create an environment with sample-fe-1, sample-fe-2 or both by passing in image tags for those. The backend and frontends become available at an ephemeral URL like sample-env-backend.ephemeral.mycompany.com and sample-env-fe-1.ephemeral.mycompany.com. The user can then manually run any tests by clicking around, checking endpoints, or the IDP can have some additional features like automated end-to-end tests that span the backend and frontend. Once the time-to-live (ttl) of an environment is reached i.e. the environment is 30 minutes old, for example, the environment is destroyed or, of course, the user can trigger deletion through Starry before that themselves.

How is the Starry application itself managed in Kubernetes? How does Starry manage an environment? Understanding the Argo CD setup is actually key. Let's go into that next.

`starry-helm`

To understand Starry, it is important to understand how it lives in Kubernetes. To begin with, Starry is a Python FAST API app that uses some frontend technologies and can be bundled up into an image and used just as any other app. We have a Dockerfile and the CICD pipeline builds an image and pushes the image registry when there is a push to the main branch.

Here is a diagram of the Kubernetes resources the Starry Helm chart templates out:

The Starry Helm chart provisions a simple ingress–service–deployment architecture suited for ephemeral use: a stateless web app runs as a Deployment scaled by a HorizontalPodAutoscaler (2–10) and configured via a ConfigMap, using a ServiceAccount annotated for (GCP) Workload Identity. It’s exposed internally by a Service (ClusterIP with NEG) and externally by a GCE Ingress on host starry.mycompany.com, with TLS handled by a ManagedCertificate so the Google L7 HTTP(S) Load Balancer terminates TLS and forwards HTTP to the Service. Caching/coordination is provided by a single‑replica Redis StatefulSet using emptyDir storage (no persistence) and reached via the starry-redis-master ClusterIP (and a headless service for stable DNS). A CronJob runs cleanup tasks and talks to the app through the in‑cluster Service. RBAC (ClusterRole, ClusterRoleBinding, Role, RoleBinding) is created to grant the app and Argo CD the required permissions. Networking flows Ingress → Service (NEG) → pods by label selector, pods reach Redis via service DNS, and outbound internet access uses the cluster’s standard egress/NAT; this favors fast spin‑up/tear‑down with minimal state and operational overhead.

Note that the CronJob cleanup tasks work by deleting environment resources for any environment that has exceeded the time-to-live setting.

Argo CD and the `starry-helm` Chart

To let Argo CD manage Starry and its environments safely, the Helm chart is designed to be fully declarative and follow GitOps best practices. It gives just enough permissions for Argo CD and the Starry app to do their jobs—nothing more.

The chart includes:

RBAC setup:
- A Role and RoleBinding in the argocd namespace lets the Starry app create and update Argo CD Application resources—these represent the ephemeral environments.
- A read-only ClusterRoleBinding lets the app watch core Kubernetes resources so it can monitor environment status.
Argo CD sync hints:
- Sync annotations are added to guide Argo CD on how to apply changes safely:
  - Replace=true on the Redis StatefulSet to handle immutable field changes.
  - PruneLast=true on the HorizontalPodAutoscaler so it’s cleaned up last.
  - ServerSideApply=true and related annotations on the cleanup CronJob to preserve managed fields.
Stable naming and labeling:
- Kubernetes labels and names follow predictable patterns (like app.kubernetes.io/name) so Argo CD can track changes cleanly and avoid drift.

`argocd-apps`

In the argocd-apps repository, we declare the Argo CD setup.

How the repo is structured

This repo uses Argo CD’s App of Apps pattern to keep environments simple and consistent. There’s one root Application per environment (dev and prod). Each root points at a Kustomize overlay directory that lists the child Applications to deploy. This keeps environment differences limited to overlays, while the core app definitions live in a shared base.

At the project level, an AppProject named starry defines what Git repos are allowed, where resources can be deployed, and who can operate them. Think of the AppProject as the guardrails: it scopes access and keeps all Applications operating within an approved perimeter.

What gets deployed first (and why)

Sync waves ensure the right order. Operators that everything else depends on come first: External Secrets and cert-manager are deployed early so secrets and certificates exist before workloads need them. External Secrets pulls a GitLab token from Google Secret Manager and materializes an Argo CD “repository” Secret; this gives Argo CD access to your Git repos without hardcoding credentials.

Workload and automation

The main workload, starry, is deployed via its Helm chart. Image updates are automated by Argo CD Image Updater, which watches your container registry and, when a new allowed tag appears, writes the tag back to Git. That Git change drives Argo CD to reconcile the updated chart, keeping deployments declarative and auditable.

Handling environment drift and platform quirks

Some Kubernetes fields are mutated by the platform (e.g., GKE Autopilot) or are immutable in StatefulSets. ignoreDifferences rules are applied where needed so Argo CD focuses on meaningful drift and doesn’t fight expected, safe mutations. Overall, the combination of App of Apps, overlays, operators-first ordering, External Secrets, and Image Updater results in a clean, environment-aware, and fully GitOps-driven deployment flow.

Starry-Managed Environments via Argo CD

When a new ephemeral environment is requested (for example, from a pull/merge request), the starry service programmatically creates an Argo CD Application custom resource by talking directly to the Kubernetes API. Running in‑cluster with a dedicated ServiceAccount, it authenticates using the standard Kubernetes client flow and has RBAC to manage applications.argoproj.io resources in the argocd namespace. The service calculates a unique preview name (e.g., starry-pr-123), target namespace, and value overrides (hostnames, image tag, replica counts), then submits a new Application object pointing to the environment Helm chart and Git revision for that preview. The spec.destination.namespace is set to the preview namespace, and the Helm/Kustomize parameters embed any per‑environment differences.

Once the Application object is created, the Argo CD controller does the heavy lifting. It watches the new Application, pulls the referenced Git content, renders the manifests (via Helm/Kustomize), and applies them to the preview namespace. Sync waves ensure platform prerequisites (like secrets or certs, if included) are present before the workload rolls out. If needed, the starry service can pre‑create the target namespace or supporting resources, but generally it relies on Argo CD with CreateNamespace=true to keep the flow simple and declarative.

The lifecycle is symmetrical. To tear down a preview, the starry service deletes the corresponding Application object through the Kubernetes API. Because Applications are created with Argo CD finalizers and pruning enabled, Argo CD prunes everything it deployed for that preview and the environment disappears cleanly. To keep operations safe and repeatable, the service uses idempotent “apply”-style calls (or server‑side apply), sets labels/annotations that encode the preview context, and ensures the ServiceAccount only holds the minimal RBAC needed to create, update, and delete Application resources and the preview namespace.

Conclusion

Starry keeps ephemeral environments simple, fast, and consistent by leaning on proven building blocks: Kubernetes for runtime, Helm for packaging, and Argo CD for declarative GitOps. CI’s job is minimal—build and push images when a branch with an open PR changes. Users then create environments in the IDP by choosing which images to run (backend and optional frontends), and Starry provisions a short‑lived, isolated stack at predictable URLs. Because the chart is stateless by default and scoped to a unique namespace, environments spin up quickly and tear down cleanly when TTL is reached or the user deletes them. We will go more into the ephemeral environment Helm chart in another article.

Under the hood, Argo CD provides the control loop while the chart and RBAC give just enough permission for Starry to create/update Application CRs safely. Sync waves ensure platform dependencies (External Secrets, cert‑manager) land first, and optional image automation (Argo CD Image Updater) keeps updates auditable by writing tags back to Git. Drift rules suppress harmless platform mutations so reconciles stay focused on meaningful changes.

This approach is practical and adaptable: each piece is interchangeable, the flow is fully declarative, and environments are reproducible across dev and prod. Today, creation is explicit and user‑driven for clarity; in a follow‑up, we’ll layer on automation to go from “merge request opened” to “environment ready” without manual steps. We will also go over what an ephemeral environment Helm chart could look like.

Starry: An Internal Developer Platform (IDP) for Ephemeral Environments Part 1

Flo Comuzzi — Mon, 15 Sep 2025 20:22:07 +0000

Suppose that your company has a large backend application that supplies information to several frontend applications. Right now, when a developer makes changes to one of the applications, they must merge their changes to a develop branch to deploy the changes to the development environment shared by all developers. This setup is fraught with issues. A developer could deploy a change that breaks the development environment thereby getting in the way of other developers' testing until the change is fixed or removed from the branch. Developers must test their changes along with other changes at the same time so it could be hard to isolate where errors are coming from. How many issues can you think of when testing a set of apps in limited environments like only development, staging, and prod?

By creating a short-lived ephemeral environment based on a developer's feature branch, changes can be isolated for better testing. That's where an internal developer platform (IDP) comes in. What is an IDP?

An Internal Developer Platform (IDP) is built by a platform team to build golden paths and enable developer self-service. An IDP consists of many different techs and tools, glued together in a way that lowers cognitive load on developers without abstracting away context and underlying technologies. Following best practices, platform teams treat their platform as a product and build it based on user research, maintaining and continuously improving it. Source

An IDP makes it easy for a developer to perform some task that combines actions that interact with various systems. In our case, I will walk you through an IDP design that enables easy creation of ephemeral environments.

So, let's go back to the setup I presented in the beginning. Your company has a large backend application that supplies information to several frontend applications. Users interact with the frontends through their browsers:

We could design a system, we'll call it starry, that creates ephemeral environments such that for an environment called test-00 for example a developer can access the apps at test-00.{app-name}.starry.mycompany.com. The test-00 environment could be testing code changes made to the sample-be repo under the feat/myfeature-00 branch. To see how those changes affect the frontends, instances of the frontends are also spun up.

Tooling and Artifacts

Kubernetes

Kubernetes makes spinning up applications and managing resources simpler and there is a rich ecosystem of tools and resources. Let's use Kubernetes to manage our system. Suppose we are running on GCP, our GKE cluster will interact with Artifact Registry for backend and frontend container images and Secret Manager for secrets used by starry.

This has been simplified for the sake of this article. For example, VPCs and networking are not included here.

Helm

To encapsulate application definitions, we will use Helm charts.

1) For environments, we can create a custom Helm chart that manages backend and frontend deployments, ingress, databases, and cache. The Helm chart will also manage the services needed for the frontend apps to connect to the backend as well as secrets and service accounts. The chart creates a unique namespace for each environment. The environment Helm chart defines a single instance of an environment.

Any environment, whether that be development, staging, production, or an ephemeral one, has two frontend apps, one backend app, two database instances and one cache instance.

2) We will also need a Helm chart for our starry application which will provide a frontend for our users to create environments, view details, and delete environments after testing.

ArgoCD

The starry application as well as the environment applications have Kubernetes resources that have to be managed. When you merge changes to starry's main branch, you want the changes to be reflected in production. When you delete an ephemeral environment all associated resources like config maps need to be deleted, not just deployments, and we want to easily track the status of individual resources. Even for platform developers, we want an interface to our applications that is easy to understand complete with details about individual resource status. We may want to expose some of these details through a custom, well thought out UI to developers as well for their own troubleshooting. A system like ArgoCD makes all of this easier.

Terraform

Finally, all of this infrastructure needs to be bootstrapped and we can use Infrastructure-as-Code (IaC) tool Terraform for that.

Artifacts

We will end up with backend, frontend, Helm chart, ArgoCD, and Terraform repositories in GitHub which will define our system:

There will be 4 container image repositories: 1) sample-be 2) sample-fe-1 3) sample-fe-2 4) starry

There we have our tools and artifacts.

After choosing tools based on our needs, we are ready to think through more specific patterns that we will use, more granular interactions between services, as well as the repository structures that will shape our implementation. Stay tuned.

Recent Platform Engineer Interview Questions

Flo Comuzzi — Sun, 31 Aug 2025 02:50:03 +0000

I have been interviewing for roles for several months now (ask me why if you want to!) and thought I would write down some of the questions that came up.

A question I got by a recruiter recently is, what is the difference between a Dockerfile and a container? Wonder what they were filtering for.
What do I know about networking? VPC, BGP, firewall, subnets, IPs, cross-network communication
Another company gave me a take home assignment with instructions to showcase my proficiency in Terraform and Google Cloud Platform (GCP) by developing a Terraform module for provisioning a GCP environment, including a Virtual Private Cloud (VPC) and a Google Kubernetes Engine (GKE) cluster, along with all necessary prerequisites which were detailed in the document. I used the setup to test some CPanel later on which broke the build but everything worked for the follow up interview in which we went over my submission. Check it out.
On yet another interview, I was asked to write code so that engineers can queue up functions. How would I ensure that functions don't run over some timeout? Don't use inspect.
Write a Dockerfile for a known technology.
Review this Terraform module. What can you tell me about it? What is it doing? What about naming conventions? Data sources? Backend?
How does cherry-picking work in Git? Can you cherry pick a branch?
Show me how you would Google the answer to this question...
Explain Kubernetes fundamental resources. What about Helm?

Intro to Data Ingestion and Data Lakes

Flo Comuzzi — Fri, 09 Aug 2019 20:30:41 +0000

I landed in the data engineering space by a bit of luck and a bit of blind faith. As graduation was approaching, I had landed a job through a new grad program. When asked about areas I'd be interested in working with, I mentioned "Big Data" because a friend had told me her mentor advised her to pursue it. It's the hot thing right now, she said. Now, years later, I'm glad I chose this route because data engineering is a superset of software engineering with a focus on performance that can be super fun.

In this first post of this series, I'll go through what a data lake is and how it relates to data ingestion. I start with data ingestion because it gives a look into the work that is commonly done on data engineering teams and trends in the field.

In the rest of the series, I look to give you a view into the concerns that mire my work life and go through how to plan, design, and build data pipelines. I hope you'll get a solid mix of business and engineering perspective. I haven't seen much writing about data engineering that is accessible to many folks so I also hope to provide some of that here and, of course, I am open to feedback!

This series is motivated by and dedicated to my greatest mentors. I am grateful to them for walking alongside this long road with me.

What is a data ingestion pipeline?

With any data pipeline, the objective is getting data from A to B and sometimes even C, D, E, etc. Ingestion is a term specific to data lakes.

What is a data lake?

Well, a data warehouse is usually a small set of curated datasets for specific purposes. Data in data warehouses typically lives for a short time, e.g. 30 days, and is used very often. A data warehouse may fit into a traditional database.

To have good query performance, the working set (the set of records your query touches) should typically be able to fit into memory.

In comparison, what we think of as "Big Data" (think any amount of data that doesn't all fit into memory at once), would need to be processed in some distributed way on several machines. We have indeed developed some frameworks to do this kind of distributed processing like MapReduce and Spark. Because "Big Data" by definition should not fit into the memory of a single machine, it should not reside in a database. It should live in another location... So goes the idea of a data lake.

Data Lakes store massive amounts of data, typically historical data going back really far in time. Whereas we would call data in a data warehouse hot because it is the most relevant and therefore the most used/queried, data in a data lake may not be so hot. This colder data can perhaps be stored on storage volumes that have slower retrieval times. Slower hardware (usually) = cheaper hardware...

For now, think of a data lake as a place where you store large amounts of data. Yes, there is more to the concept of a data lake and you can read about it in The Enterprise Data Lake by Alex Gorelik.

Ok, so what's ingestion refer to?

Most data lakes are organized into several zones. For now, we will think of 3 zones:

landing/dump zone
raw zone
transformed zone.

A common pattern is to dump data in the dump zone and have something ingest the data into the raw zone. Data in the raw zone should remain as close to its original form as possible. Datasets that you have changed should go in the transformed zone.

So,

ingestion refers to the process of bringing in data from some location into the raw zone of the data lake

where it can be queried, using technology like Presto and Hive, for understanding so that other datasets can be built off of it.

Ok, now show me how to do the thing...

In the next post, I'll go through how to think through the design of a data ingestion pipeline!

Live notetaking as I learn about distributed computing

Flo Comuzzi — Sun, 21 Apr 2019 20:28:46 +0000

In my previous post, Live notetaking as I learn Spark, I learned some of the basics of Spark:

"Spark is a distributed programming model in which the user specifies transformations. Multiple transformations build up a directed acyclic graph of instructions. An action begins the process of executing that graph of instructions, as a single job, by breaking it down into stages and tasks to execute across the cluster. The logical structures that we manipulate with transformations and actions are DataFrames and Datasets. To create a new DataFrame or Dataset, you call a transformation. To start computation or convert to native language types, you call an action." Spark: The Definitive Guide

I liked the live notaking format because it pushed me to write down what I was learning and kept me accountable. I have so many drafted posts that I haven't published because I haven't finished my thoughts. Live notetaking takes the pressure off from constantly thinking about a possible pitch as I'm learning. I am using this live learning pattern again here.

I am now up to chapter 4 of Spark: The Definitive Guide. In the time between my last live notaking post and this one I have done some reading on my own about the anatomy of database systems and git as a distributed system. Seeing Spark described as a distributed programming model got my attention since I haven't seen Spark described in this way before so I definitely want to be clear on what the term means. Julia Evans describes the importance of identifying what you don't understand in her post on how to teach yourself hard things.

In this post, I will put together an outline of concepts related to distributed computing programming models. I have an additional goal as well: recognize patterns I use to learn a new concept. Is this the most effective way to learn? Can I optimize these patterns? What motivates me to use these patterns?

First thing I did was enter "distributed programming model" into Google search engine. I chose the third result because it looks like it could be the syllabus to a class on this topic. I like syllabi. They usually contain readings and homework assignments I can complete for a topic. I also typically compare the textbooks that appear in different syllabi. If a textbook is used in several courses, I look into it to see if it is the book on a topic.

The syllabus is for a class titled "SPECIAL TOPICS IN COMPUTER SYSTEMS:
Programming Models for Distributed Computing" at Northeastern University. I'd definitely take this course if I could:

"Topics we will cover include, promises, remote procedure calls, message-passing, conflict-free replicated datatypes, large-scale batch computation à la MapReduce/Hadoop and Spark, streaming computation, and where eventual consistency meets language design, amongst others."

I got some feedback from a colleague to focus on the Spark data structures next and this course covers conflict-free replicated datatypes. I don't know what that means yet but I want to know. The class works on authoring a literature review on the landscape of programming models for distributed computation during the semester. So cool.

RPC

I have already read a bit on RPC in Distributed Systems by Tanenbaum. Tanenbaum describes the Remote Procedure Call proposal made by Birrell and Nelson in Implementing Remote Procedure Calls (1984):

"When a process on machine A calls a procedure on machine B, the calling process on A is suspended, and execution of the called procedure takes place on B. Information can be transported from the caller to the callee in the parameters and can come back in the procedure result. No message passing at all is visible to the programmer." p. 173

Implementing Remote Procedure Calls (1984)
A Distributed Object Model for the Java System (1996)
A Note on Distributed Computing (1994)
A Critique of the Remote Procedure Call Paradigm (1988)
Convenience Over Correctness (2008)

Futures, promises

Multilisp: A language for concurrent symbolic computation (1985)
Promises: linguistic support for efficient asynchronous procedure calls in distributed systems (1988)
Oz dataflow concurrency. Selected sections from the textbook Concepts, Techniques, and Models of Computer Programming. Sections to read: 1.11: Dataflow, 2.2: The single-assignment store, 4.93-4.95: Dataflow variables as communication channels ...etc.
The F# asynchronous programming model (2011)
Your Server as a Function (2013)

Message passing

Concurrent Object-Oriented Programming (1990)
Concurrency among strangers (2005)
Scala actors: Unifying thread-based and event-based programming (2009)
Erlang (2010)
Orleans: cloud computing for everyone (2011)

Distributed Programming Languages

Distributed Programming in Argus (1988)
Distribution and Abstract Types in Emerald (1987)
The Linda alternative to message-passing systems (1994)
Orca: A Language For Parallel Programming of Distributed Systems (1992)
Ambient-Oriented Programming in AmbientTalk (2006)

Consistency, CRDTs

Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services (2002)
Conflict-free Replicated Data Types (2011)
A comprehensive study of Convergent and Commutative Replicated Data Types (2011)
CAP Twelve Years Later: How the "Rules" Have Changed (2012)
Cloud Types for Eventual Consistency (2012)

Languages & Consistency

Consistency Analysis in Bloom: a CALM and Collected Approach (2011)
Logic and Lattices for Distributed Programming (2012)
Consistency Without Borders (2013)
Lasp: A language for distributed, coordination-free programming (2015)

Languages Extended for Distribution

Towards Haskell in the Cloud (2011)
Alice Through the Looking Glass (2004)
Concurrency Oriented Programming in Termite Scheme (2006)
Type-safe distributed programming with ML5 (2007)
MBrace
- MBrace: cloud computing with monads (2013)
- MBrace Programming Model (Tutorial)

Large-scale parallel processing (batch)

MapReduce: simplified data processing on large clusters (2008)
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (2008)
Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing (2012)
Spark SQL: Relational Data Processing in Spark (2015)
FlumeJava: Easy, Efficient Data-Parallel Pipelines (2010)
GraphX: A Resilient Distributed Graph System on Spark (2013)
Dremel: Interactive Analysis of Web-Scale Datasets (2010)
Pig latin: a not-so-foreign language for data processing (2008)

Large-scale parallel processing (streaming)

TelegraphCQ: continuous dataflow processing (2003)
Naiad: A Timely Dataflow System (2013)
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (2012)
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (2015)

From these tips on how to read a research paper and some common sense, I know I won't be able to read all these papers quickly nor am I interested in doing that right now.

From https://blog.ably.io/what-is-a-distributed-systems-engineer-f6c1d921acf8

understanding of a hash ring: Cassandra, Riak, Dynamo, Couchbase Server
protocols for keeping track of changes in cluster topology, in response to network partitions, failures, and scaling events:
"Various protocols exist to ensure that this can happen, with varying levels of consistency and complexity. This needs to be dynamic and real time because nodes come and go in elastic systems, failures need to be detected quickly, and load and state needs to be rebalanced in real time."
- Gossip protocol
- Paxos protocol
- Raft consensus algorithm
- Popular consensus backed systems like etcd and Zookeeper, and gossip backed systems like Serf.
Eventually consistent data types and read/write consistencies
locks are impractical to implement and impossible to scale. As a result, trade-offs need to be made between the consistency and availability of data. In many cases, for example, availability can be prioritised, and consistency guarantees weakened to eventual consistency, with data structures such as CRDTs.
- familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store
  - Operational Transform — implemented by Google originally in their Wave product and now in Google Docs. It has uses in collaboration apps, but OTs are complex and not widely implemented.
  - Conflict-free Replicated Data Types or CRDT provides an eventually consistent result so long as the data types available are used. Used by Riak distributed database and presence in Phoenix.
  - Consistency levels for both read and writes in distributed databases like Cassandra
At each layer, be confident in your understanding and ability to debug problems at a packet or frame level:
WebSockets example
- DNS protocol and UDP for address lookup.
- File descriptors (on *nix) and buffers used for connections, NAT tables, conntrack tables etc.
- IP to route packets between hosts
- TCP to establish a connection
- TLS handshakes, termination and certificate authentication
- HTTP/1.1 or more recently 2.0 used extensively by gRPC.
- WebSocket upgrades over HTTP.
  - higher level protocols such as HTTP, WebSockets, gRPC and TCP sockets and the full stack of protocols they rely on all the way down to the OS itself
also be a solid systems engineer: have the fundamentals such as programming languages, general design patterns, version control, infrastructure management, continuous integration and deployment systems already in place.

Reflect as You Work: My Python Project Workflow

Flo Comuzzi — Sat, 13 Apr 2019 02:38:56 +0000

One of the apprenticeship patterns in Apprenticeship Patterns is Reflect as You Work. This pattern is about introspecting on how you work regularly. Doing this often allows developers to notice how their practices have changed and even how they haven't. This isn't just about observing yourself. As the book says,

"Unobtrusively watch the journeymen and master craftsmen on your team. Reflect on the practices, processes, and techniques they use to see if they can be connected to other parts of your experiences. Even as an apprentice, you can discover novel ideas simply by closely observing more experienced craftsmen as they go about their work." p. 36

I have been thinking about my own practices and those of others around me. The workflow I follow when I create new Python projects particularly stands out because I learned it from sitting with another engineer. I noted what they did and asked questions. Then, I went back to my desk and tried it myself while taking more notes. I followed the resulting workflow so many times that the steps now flow from my fingertips with ease.

I think there could be ways to optimize even this workflow but first I am going to note it down here for the potential future reader and for future me to look back on!

P.S. Many of the extra details I included here I learned from my colleagues. A big thank you to them for sharing what they know with me 💓

Prerequisites

pyenv is installed.

New Python Project Checklist

Install a specific Python version.
Create a project directory. Go to the directory.
Set the Python version for the project.
Create a virtual environment.
Activate the virtual environment.
Install dependencies.
Save packages.
Run the code.

Note: this workflow should work on MacOS.

PREREQUISITE: Install `pyenv`.

Mac OS X comes with Python 2.7 out of the box. If you haven't fiddled with anything, you should be able to open up a Terminal window and type in python --version and get some 2.7 variant. You probably don't want to use the version of Python that comes shipped with your OS (Operating System) though. There are many reasons for this like that the version may be out of date. I have even come across an important library that was missing.

Not only do you want to avoid using the version of Python that is shipped with your machine, in your work you will need to have several different versions of Python installed at once. For example, perhaps one codebase is using an older version of Python due to some library dependency. Upgrading the version of Python you are using for that project could require some refactoring of that project that you haven't prioritized. At the same time, you may be using a newer Python version on other projects because you want to take advantage of shiny new features.

Having several Python versions installed on your machine is a realistic scenario for a Python developer. Managing these versions effectively is important.

There are instructions on how to install pyenv here.

When you run a command like python or pip, your operating system searches through a list of directories to find an executable file with that name. This list of directories lives in an environment variable called PATH, with each directory in the list separated by a colon...

pyenv works by inserting a directory of shims at the front of your PATH so that when you call python or pip these shims are the first thing your OS finds. The commands you enter are, then, intercepted and sent to pyenv which decides which version of Python to use for your command based on some rules.

Follow the instructions to install pyenv. Make sure you follow the rest of the post-installation steps under Basic GitHub Checkout even if you use Homebrew to install. When I was installing I found that I had a .bashrc AND .bash_profile. Here is an article on the difference between them and when either file is used. If after following the instructions, you type in pyenv and do not get something like the following, go back and make sure you set the other bash file:

flo at MacBook-Pro in ~ $ pyenv
pyenv 1.2.8
Usage: pyenv <command> [<args>]

Some useful pyenv commands are:
   commands    List all available pyenv commands
   local       Set or show the local application-specific Python version
   global      Set or show the global Python version
   shell       Set or show the shell-specific Python version
   install     Install a Python version using python-build
   uninstall   Uninstall a specific Python version
   rehash      Rehash pyenv shims (run this after installing executables)
   version     Show the current Python version and its origin
   versions    List all Python versions available to pyenv
   which       Display the full path to an executable
   whence      List all Python versions that contain the given executable

See `pyenv help <command>' for information on a specific command.
For full documentation, see: https://github.com/pyenv/pyenv#readme

Step 1: Install a specific Python version.

Suppose I'm creating a script that will open the latest xkcd comic in a web browser. I'm going to run it with Python 3.7.0.

flo at MacBook-Pro in ~ $ pyenv install 3.7.0
python-build: use openssl from homebrew
python-build: use readline from homebrew
Downloading Python-3.7.0.tar.xz...
-> https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tar.xz
Installing Python-3.7.0...
python-build: use readline from homebrew
Installed Python-3.7.0 to /Users/flo/.pyenv/versions/3.7.0

Step 2: Create a project directory. Go to the directory.

flo at MacBook-Pro in ~ $ mkdir Documents/comic-creator
flo at MacBook-Pro in ~ $ cd Documents/comic-creator/

Step 3: Set the Python version for the project.

First, look at the files in the folder, even the hidden files (-la will show hidden files).

flo at MacBook-Pro in .../comic-creator $ ls -la
total 0
drwxr-xr-x   2 flo  staff    64 Apr 12 21:12 .
drwx------+ 33 flo  staff  1056 Apr 12 21:12 ..

Now, set the Python version for the project. Now you can see a hidden file (hidden files start with a dot). When you look inside .python-version, you can see the version we set.

flo at MacBook-Pro in .../comic-creator $ pyenv local 3.7.0
flo at MacBook-Pro in .../comic-creator $ ls -la
total 8
drwxr-xr-x   3 flo  staff    96 Apr 12 21:16 .
drwx------+ 33 flo  staff  1056 Apr 12 21:12 ..
-rw-r--r--   1 flo  staff     6 Apr 12 21:16 .python-version
flo at MacBook-Pro in .../comic-creator $ cat .python-version 
3.7.0

Step 4: Create a virtual environment.

Just as you may have several Python versions installed on your machine, you may also have different versions of Python packages installed. Imagine the dependency graph for one of your projects looks like this:

requests==2.21.0
  - certifi [required: >=2017.4.17, installed: 2019.3.9]
  - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
  - idna [required: >=2.5,<2.9, installed: 2.8]
  - urllib3 [required: >=1.21.1,<1.25, installed: 1.24.1]

In another project, you may be using a different version of requests which depends on a different version of certifi. By using virtual environments, we can keep package installations isolated by project.

A virtual environment is a Python environment such that the Python interpreter, libraries and scripts installed into it are isolated from those installed in other virtual environments, and (by default) any libraries installed in a “system” Python, i.e., one which is installed as part of your operating system. Python venv docs

So, first, you can verify (again) that we correctly set the Python version for the project. Now, create a virtual environment by calling venv and call that new environment venv. You can now see the environment is created.

flo at MacBook-Pro in .../comic-creator $ python --version
Python 3.7.0
flo at MacBook-Pro in .../comic-creator $ python -m venv venv
flo at MacBook-Pro in .../comic-creator $ ls
venv

Step 5: Activate the virtual environment.

Look inside venv. Then, look inside venv/bin. bin stands for binary. In Linux/Unix-like systems, executable programs needed to run the system are found in /bin. Similarly, Python executable programs are stored in bin.

Activate the virtual environment with source.

source is a Unix command that evaluates the file following the command executed in the current context... Frequently the "current context" is a terminal window into which the user is typing commands during an interactive session. The source command can be abbreviated as just a dot (.) in Bash and similar POSIX-ish shells. Wikipedia

This means that if you open a new Terminal window, you will need to source the activate file again to activate the virtual environment in that window! Also note that you can type in . venv/bin/activate and it will do the exact same thing as source venv/bin/activate.

flo at MacBook-Pro in .../comic-creator $ ls venv/
bin        include    lib        pyvenv.cfg
flo at MacBook-Pro in .../comic-creator $ ls venv/bin/
activate         activate.csh     activate.fish    easy_install     easy_install-3.7 pip              pip3             pip3.7           python           python3
flo at MacBook-Pro in .../comic-creator $ source venv/bin/activate

Let's look at activate:

flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ cat venv/bin/activate
# This file must be used with "source bin/activate" *from bash*
# you cannot run it directly

deactivate () {
    # reset old environment variables
    if [ -n "${_OLD_VIRTUAL_PATH:-}" ] ; then
        PATH="${_OLD_VIRTUAL_PATH:-}"
        export PATH
        unset _OLD_VIRTUAL_PATH
    fi
    if [ -n "${_OLD_VIRTUAL_PYTHONHOME:-}" ] ; then
        PYTHONHOME="${_OLD_VIRTUAL_PYTHONHOME:-}"
        export PYTHONHOME
        unset _OLD_VIRTUAL_PYTHONHOME
    fi

    # This should detect bash and zsh, which have a hash command that must
    # be called to get it to forget past commands.  Without forgetting
    # past commands the $PATH changes we made may not be respected
    if [ -n "${BASH:-}" -o -n "${ZSH_VERSION:-}" ] ; then
        hash -r
    fi

    if [ -n "${_OLD_VIRTUAL_PS1:-}" ] ; then
        PS1="${_OLD_VIRTUAL_PS1:-}"
        export PS1
        unset _OLD_VIRTUAL_PS1
    fi

    unset VIRTUAL_ENV
    if [ ! "$1" = "nondestructive" ] ; then
    # Self destruct!
        unset -f deactivate
    fi
}

# unset irrelevant variables
deactivate nondestructive

VIRTUAL_ENV="/Users/flo/Documents/comic-creator/venv"
export VIRTUAL_ENV

_OLD_VIRTUAL_PATH="$PATH"
PATH="$VIRTUAL_ENV/bin:$PATH"
export PATH

# unset PYTHONHOME if set
# this will fail if PYTHONHOME is set to the empty string (which is bad anyway)
# could use `if (set -u; : $PYTHONHOME) ;` in bash
if [ -n "${PYTHONHOME:-}" ] ; then
    _OLD_VIRTUAL_PYTHONHOME="${PYTHONHOME:-}"
    unset PYTHONHOME
fi

if [ -z "${VIRTUAL_ENV_DISABLE_PROMPT:-}" ] ; then
    _OLD_VIRTUAL_PS1="${PS1:-}"
    if [ "x(venv) " != x ] ; then
        PS1="(venv) ${PS1:-}"
    else
    if [ "`basename \"$VIRTUAL_ENV\"`" = "__" ] ; then
        # special case for Aspen magic directories
        # see http://www.zetadev.com/software/aspen/
        PS1="[`basename \`dirname \"$VIRTUAL_ENV\"\``] $PS1"
    else
        PS1="(`basename \"$VIRTUAL_ENV\"`)$PS1"
    fi
    fi
    export PS1
fi

# This should detect bash and zsh, which have a hash command that must
# be called to get it to forget past commands.  Without forgetting
# past commands the $PATH changes we made may not be respected
if [ -n "${BASH:-}" -o -n "${ZSH_VERSION:-}" ] ; then
    hash -r
fi

Step 6: Install dependencies.

If you haven't come across "dependencies", this word is often used to say that something is dependent on something else... Makes sense. In our case, our Python project will depend on installing various libraries that don't come already bundled with Python 3.7.0.

This is what our code looks like:

import json
import sys
import webbrowser

import requests

# url of latest xkcd comic
URL = 'http://xkcd.com/info.0.json'

if __name__ == '__main__':
    response = requests.get(URL)
    if response.status_code == requests.codes.ok:
        content = json.loads(response.text)
        print('Comic is located at {}'.format(content['img']))
        webbrowser.open(content['img'])
    else:
        print('Error: \n {}'.format(response.text))
        sys.exit()

Create a file comic_popup.py in the project and add this code. If you try to run the code you will get an error. requests module isn't installed. Let's install it.

flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ touch comic_popup.py
flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ ls
comic_popup.py venv
flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ pip install requests
Collecting requests
  Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting urllib3<1.25,>=1.21.1 (from requests)
  Using cached https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6ad3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests)
  Using cached https://files.pythonhosted.org/packages/60/75/f692a584e85b7eaba0e03827b3d51f45f571c2e793dd731e598828d380aa/certifi-2019.3.9-py2.py3-none-any.whl
Installing collected packages: chardet, urllib3, idna, certifi, requests
Successfully installed certifi-2019.3.9 chardet-3.0.4 idna-2.8 requests-2.21.0 urllib3-1.24.1
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Step 7: Save packages.

Notice what is printed when you enter pip freeze. This command outputs installed packages in requirements format ({library-name}={version}). In the next line, redirect that output to a file called requirements.txt using >. A single > will overwrite the contents of the file if the file already existed. Using >> would append to the contents of an already existing file.

You don't have to call the file requirements.txt but that is what most Python developers use so follow the convention! More on requirements files here.

You may also notice that requests isn't the only library outputted by pip freeze. The other libraries are libraries that requests depends on so when you install requests you must install the others for requests to work. These other libraries are referred to as transitive dependencies.

flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ pip freeze
certifi==2019.3.9
chardet==3.0.4
idna==2.8
requests==2.21.0
urllib3==1.24.1
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ pip freeze > requirements.txt
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ cat requirements.txt 
certifi==2019.3.9
chardet==3.0.4
idna==2.8
requests==2.21.0
urllib3==1.24.1
flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ pip --help

Usage:   
  pip <command> [options]

Commands:
  install                     Install packages.
  download                    Download packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  check                       Verify installed packages have compatible dependencies.
  config                      Manage local and global configuration.
  search                      Search PyPI for packages.
  wheel                       Build wheels from your requirements.
  hash                        Compute hashes of package archives.
  completion                  A helper command used for command completion.
  help                        Show help for commands.

General Options:
  -h, --help                  Show help.
  --isolated                  Run pip in an isolated mode, ignoring environment variables and user configuration.
  -v, --verbose               Give more output. Option is additive, and can be used up to 3 times.
  -V, --version               Show version and exit.
  -q, --quiet                 Give less output. Option is additive, and can be used up to 3 times (corresponding to WARNING, ERROR, and CRITICAL logging levels).
  --log <path>                Path to a verbose appending log.
  --proxy <proxy>             Specify a proxy in the form [user:passwd@]proxy.server:port.
  --retries <retries>         Maximum number of retries each connection should attempt (default 5 times).
  --timeout <sec>             Set the socket timeout (default 15 seconds).
  --exists-action <action>    Default action when a path already exists: (s)witch, (i)gnore, (w)ipe, (b)ackup, (a)bort).
  --trusted-host <hostname>   Mark this host as trusted, even though it does not have valid or any HTTPS.
  --cert <path>               Path to alternate CA bundle.
  --client-cert <path>        Path to SSL client certificate, a single file containing the private key and the certificate in PEM format.
  --cache-dir <dir>           Store the cache data in <dir>.
  --no-cache-dir              Disable the cache.
  --disable-pip-version-check
                              Don't periodically check PyPI to determine whether a new version of pip is available for download. Implied with --no-index.
  --no-color                  Suppress colored output

Step 8: Run the code.

That's it. You should be able to run the code now. You would be able to run it as soon as you install the dependencies it needs but don't forget to save your requirements!

flo at MacBook-Pro in .../comic-creator using virtualenv: venv $ python comic_popup.py 
Comic is located at https://imgs.xkcd.com/comics/election_commentary.png

Now, if you save your code to a repo, anyone can pull the code and run it. Add a README.md and include which version of Python to use to run the code. The next developer will set up the right Python version and install the requirements by running pip install -r requirements.txt.

Don't include the .python-version file in the repo because the file is pyenv specific and other developers may have their own way to manage Python versions.

As a rule of thumb, I don't include files that are specific to me like configuration files for various IDEs (Integrated Development Environments) because they clutter up the repository. Ignore these files in your repository by adding and configuring a .gitignore file.

That's my development workflow when I start a Python project! I included some developer best practices where I felt it fit. I also explained as much context as I felt appropriate. I encourage you to try out different ways of doing the same thing to see the pros and cons of each.

Feel free to ask any questions! I'd love to chat about best practices and what works for you as well. So many parts of our workflows are by convention or because that's the way we first learned it or we don't know any better. I'd love to hear from you!

Live notetaking as I learn Spark

Flo Comuzzi — Sat, 06 Apr 2019 16:00:20 +0000

What is this?

I would love to get in the mind of other developers. I want to see how they think and I want to watch how they learn live. So, this is an experiment.

I have a somewhat vague goal: learn the theoretical foundations of Spark so I can look at a program and optimize it.

I will put my notes here as I go. My hope is that I will start to gleam some patterns from these notes. For example, so far I have noticed that I categorize the questions I have and define concepts I don't understand as I come across them sometimes putting off defining them because I want to understand a larger concept first. This is valuable information for me because I want to learn in the most efficient manner! There is so much I am curious about. I want to live life to the fullest and explore topics that give me joy.

This experiment is inspired by Jessie Frazelle's blog post, Digging into RISC-V and how I learn new things. By inspired I mean that I felt that same excitement I felt when I first started learning to code when I read this post. It brought that feeling of wonder and reverence back. Romance is not dead :p

If you have been feeling burned out or like your work does not matter, I urge you to think outside your immediate situation (whether a shitty job or overwhelming schoolwork) and learn about areas that you are curious about. See how other people are learning. Try new things. Find inspiration.

Notes

Installation

From Spark: The Definitive Guide (recommended to me!)

If you want to download and run Spark locally, the first step is to make sure that you have Java installed on your machine (available as java), as well as a Python version if you would like to use Python. Next, visit the project’s official download page, select the package type of “Pre-built for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR file, or tarball, that you will then need to extract.
- I already installed the pre-built version for 2.9
- I moved the uncompressed folder to my Applications folder but couldn't find the folder thru the terminal. What have I done?

Small snafu

There is the folder:

But, when I don't see any content in the folder thru the Terminal:

$ ls ~/Applications/
Chrome Apps.localized

So, I Googled and found that Applications on Mac is at the root and I had been looking at my user's Applications.

Spark can run locally without any distributed storage system, such as Apache Hadoop, so I won't install Hadoop since the last time I did it was incredibly slow on my machine. (although I was using a different machine with shittier specs)

"You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster, all you need is an installation of Java. If you want to use the Python API, you will also need a Python interpreter (version 2.7 or later)."

I'll be using PySpark. I use pyenv to manage installations of Python and create virtual environments. I see that I'll need a virtual environment because the book says, "In Spark 2.2, the developers also added the ability to install Spark for Python via pip install pyspark. This functionality came out as this book was being written, so we weren’t able to include all of the relevant instructions." So I know that I'll need a virtual environment because I want to manage the version of pyspark by project instead of installing it globally. I also now know that I may need to do more Googling since this book doesn't have all the instructions I may need. I don't know exactly how many instructions are missing -- hope this doesn't derail me for too long. I'm anticipating it though and I wonder if that gets in the way of me pushing thru on other projects.

This is how I'm installing pyspark:

$ mkdir spark-trial
$ cd spark-trial/
$ python --version
Python 2.7.10
$ pyenv local 3.6.5
$ python -m venv venv
$ source venv/bin/activate
$ python --version
Python 3.6.5
$ pip install pyspark
$ pip freeze > requirements.txt
$ cat requirements.txt 
py4j==0.10.7
pyspark==2.4.1

Now, I want to make sure I can run the thing. The book says to run the following from the home directory of the Spark installation: $ ./bin/pyspark.
Then, type “spark” and press Enter. You’ll see the SparkSession object printed.

So I went back to the directory where I installed Spark and ran those commands. This was the output:

Python 3.6.5 (default, Nov 20 2018, 15:26:21) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Applications/spark-2.4.1-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/06 12:44:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1
      /_/

Using Python version 3.6.5 (default, Nov 20 2018 15:26:21)
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x10e262ac8>

I don't know why it is running Python 3.6.5 even though the system default is running 2.7. I know pyenv manages which version of Python is set for a project my creating a .python-version file so I'm going to exit this shell and look for that file in this spark directory.

When I run python --version from the same directory that Spark is installed in I see the version is 3.6.5 -- the same version as my virtual environment. I set this bash theme that displays this when virtual environments are activated:

○ flo at MacBook-Pro in .../-2.4.1-bin-hadoop2.7 using virtualenv: venv $ python --version
Python 3.6.5

I forgot that the virtual environment I created above is activated. I thought pyenv worked by reading the .python-version file to determine which Python version is set for a project but I did not put together how Python virtual environments work... what does the activate script that comes with virtual environments do? After looking at the script, I see that it sets VIRTUAL_ENV environment variable to the path of this virtual environment, adds the VIRTUAL_ENV to the system PATH, and unsets PYTHONHOME if it is set so that the first version of Python that the system finds when it calls Python is the activated virtual environment's Python. Cool!

Spark Architecture

"Single machines do not have enough power and resources to perform computations on huge amounts of information (or the user probably does not have the time to wait for the computation to finish). A cluster, or group, of computers, pools the resources of many machines together, giving us the ability to use all the cumulative resources as if they were a single computer. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers." Spark the definite guide

I have a textbook, Distributed Systems, by Andrew Tanenbaum (my favorite textbook author -- yes, I have a fav!). This morning, I read about different types of distributed computing, like cluster computing vs grid computing, so I am going to go back to the book to clarify if cluster in the quote above has a connection to cluster computing or if its an overloaded term.

There are many classes of distributed systems; one class of distributed systems is for high-performance computing tasks. There are two subgroups within this class:

cluster computing:
- underlying hardware consists of a collection of similar workstations or PCs closely connected by means of a high-speed local-area network
- each node runs the same operating system
grid computing:
- often constructed as a federation of computer systems, where each system may fall under a different administrative domain, and may be very different when it comes to hardware, software, and deployed network technology

I have only come in contact with Spark within single companies which I assume each have a single network??? so for now I am going to assume that Spark falls under cluster computing.

Distributed Systems textbook then goes into examples of cluster computers. For example, the MOSIX system. MOSIX tries to provide a single-system image of a cluster which means that a cluster computer tries to appear as a single computer to a process. I've come across my first distributed system gem: IT IS IMPOSSIBLE TO PROVIDE A SINGLE SYSTEM IMAGE UNDER ALL CIRCUMSTANCES.

I am finally starting to connect the dots between Spark and distributed systems concepts. I am going back in the textbook to design goals of distributed systems to learn more about transparency and adding a section to my notes called Design Goals since I am now getting what these design goals are all about.

Back to Spark,

"The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to these cluster managers, which will grant resources to our application so that we can complete our work."

Cluster manager

The cluster manager controls physical machines and allocates resources to Spark Applications.

Spark Applications

Spark Applications consist of a driver process and a set of executor processes. The driver and executors are simply processes, which means that they can live on the same machine or different machines. In local mode, the driver and executors run (as threads) on your individual computer instead of a cluster.

Driver

called the SparkSession
1:1 SparkSession to Spark Application
runs main() function
sits on a node in the cluster
responsible for three things:
- maintaining information about the Spark Application during the lifetime of the application
- responding to a user’s program or input
- analyzing, distributing, and scheduling work across the executors

Executors

responsible for only two things:
- executing code assigned to it by the driver
- reporting the state of the computation on that executor back to the driver node

I've done quite a bit of learning about concepts. I am itching to build before my attention wavers. There are still some parts in the text that I gleamed and seem relevant so I'm going to keep going and hopefully will get to build soon. Otherwise, I'll pivot myself.

Spark APIs

language APIs and structured vs unstructured APIs
Each language API maintains the same core concepts described (driver, executors, etc.?).
- There is a SparkSession object available to the user, which is the entrance point to running Spark code.
- When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.

Starting the Spark shell is how I can send commands to Spark so that Spark can then send to executors. So, starting a Spark shell creates an interactive Spark Application. The shell will start in standalone mode. I can also send standalone applications to Spark using spark-submit process, whereby I submit a precompiled application to Spark.

Distributed collections of data

core data structures are immutable so they cannot change after they are created

DataFrames

a table of data with rows and columns
schema is list of columns and types
parts of the DataFrame can reside on different machines
- A partition is a collection of rows that sit on one physical machine in your cluster. A DataFrame’s partitions represent how the data is physically distributed across the cluster of machines during execution.
Python/R DataFrames mostly exist on a single machine but can convert to Spark DataFrame
myRange = spark.range(1000).toDF("number") in Python creates a DataFrame with one column containing 1,000 rows with values from 0 to 999. This range of numbers represents a distributed collection. When run on a cluster, each part of this range of numbers exists on a different executor. This is a Spark DataFrame.
most efficient and easiest to use

Transformations

narrow: those for which each input partition will contribute to only one output partition
- Spark will automatically perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames, they’ll all be performed in-memory.
wide: A wide dependency (or wide transformation) style transformation will have input partitions contributing to many output partitions aka shuffle
- when a shuffle is performed, Spark writes the results to disk

Actions

Theoretical foundations

Design Goals

Making distribution transparent/invisible: hide that processes and resources are physically distributed across multiple computers
- Access: hide differences in data representation and how a process or resource is accessed
- location: hide where a process or resource is located
- relocation: hide that a resource or process may be moved to another location while in use
- migration: hide that a resource or process may move to another location
- replication: hide that a resource or process is replicated
- concurrency: hide that a resource or process may be shared by several independent users
- failure: hide the failure and recovery of a resource or process

What are the main problems in distributed systems?

multiset
working set: the amount of memory that a process requires in a given time interval, typically the units of information in question are considered to be memory pages. This is suggested to be an approximation of the set of pages that the process will access in the future and more specifically is suggested to be an indication of what pages ought to be kept in main memory to allow most progress to be made in the execution of that process.
cluster computing paradigms
distributed shared memory

Architectural foundations

RDD: resilient distributed dataset, a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way
The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged.

Cluster Computing Paradigm

Limitations of MapReduce cluster computing paradigm

Spark and its RDDs were developed in 2012 in response to limitations in the
MapReduce cluster computing paradigm forces a particular linear dataflow structure on distributed programs:
- MapReduce programs read input data from disk
- map a function across the data
- reduce the results of the map
- store reduction results on disk
Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.
Spark facilitates the implementation of both iterative algorithms, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to Apache Hadoop MapReduce implementation.

Apache Spark requires:

a cluster manager: standalone (native Spark cluster), Hadoop YARN, Apache Mesos
a distributed storage system: Alluxio, HDFS, MapR-FS, Cassandra, OpenStack Swift, Amazon S3, Kudu, a custom solution, pseudo-distributed local mode where distributed storage is not required and the local file system can be used instead (Spark is run on a single machine with one executor per CPU core in this case)

why use RDDs?
RDDs are lower-level abstractions because they reveal physical execution characteristics like partitions to end users. Might use RDDs to parallelize raw data stored in memory on the driver machine.

Scala RDDs are not equivalent to Python RDDs.

Chapter 4: Structured API Overview

Spark is a distributed programming model in which the user specifies transformations.

"Spark is a distributed programming model in which the user specifies transformations. Multiple transformations build up a directed acyclic graph of instructions. An action begins the process of executing that graph of instructions, as a single job, by breaking it down into stages and tasks to execute across the cluster. The logical structures that we manipulate with transformations and actions are DataFrames and Datasets. To create a new DataFrame or Dataset, you call a transformation. To start computation or convert to native language types, you call an action."

A technical leadership lesson from interacting with folks outside of engineering

Flo Comuzzi — Fri, 08 Mar 2019 02:30:46 +0000

At work, I have the privilege of interacting with folks from other (non-engineering) teams often. I really enjoy this part of the job! I studied computer science in college and have worked as an engineer since so being in a role that gives me visibility into other roles is refreshing. I see every interaction with someone new as a chance to learn something and perhaps build together? 😊

So when I came across this blog post I was immediately intrigued:

Daniele Bernardi on Working with Non-Technical People and Getting Recognition Right

Sam Jarman 👨🏼‍💻 ・ Mar 5 '19

#career #nontechnicalpeople #softskills

This quote resonated with me:

"I'm not afraid to say I struggled with this aspect at the beginning of my career. I thought technical leadership would simply mean displaying your knowledge by painstakingly listing all the details in a project, but I soon found out that aspect is actually perceived as overzealous. Technical leadership is really about using your best judgement to condense the most difficult aspects into a straightforward statement; it also means hiding details you know don't require broad consensus among the key stakeholders of your project."

I recently created a document whose primary audience are not engineers. I was proud of having started the conversation that led to this very document being written. I truly believed this document was taking my team one step forward on the road to success. So, I included several hundred pages of sample records which took many evenings during off-hours to put together. I wanted to be as precise as possible and I felt good about my effort right up until I received feedback from the other team. They had removed the sample records because it didn't add to the discussion.

I realize I did not practice enough self-awareness when writing this document. Yes, I was concerned with how the readers would perceive me and my knowledge but now I see that I was more concerned with proving I knew what I was talking about. I missed the point.

The blog post I reference above is not the first time I hear about the delivery of simplified messages as a key aspect of technical leadership. Other people I look up to and respect have shared this with me. This is, however, the first time that it really hit home for me as I reflected on this experience with another team outside engineering. So I will be mulling over this lesson, practicing empathy in future interactions, and actively seeking feedback.

I am curious on your thoughts! Please send me a message or comment below.

P.S. I hate referring to folks as "non-technical" but I couldn't help how someone else titled their blog post. April Wensel explains some of the issues I have with the term on her blog.

Why you should join a new grad program

Flo Comuzzi — Tue, 05 Mar 2019 02:22:25 +0000

I immigrated to the United States when I was seven years old. Life was hard for my family. They were not working glamorous jobs. They were making ends meet the best they could. My mom never had a Bring Your Kid to Work Day at her low wage job. I did not get exposure to folks working in corporate settings either. So when I started working in the tech industry after college, I was very nervous -- I didn't know what to expect but I knew that I really needed to succeed at this. This was my chance to make a better life for myself. I had been waiting for this moment my whole life.

I consider myself very fortunate.

My first job was through J.P. Morgan's Software Engineer Program.

I remember getting the call with the job offer. Compensation was several times more than what my mom made in a year. I was thrilled. I was terrified. Many people I looked up to told me that this was a place I could learn best practices and grow. Also, the money... so I accepted. I had many restless nights after that. I was worried that any second the offer would be rescinded. I did not rest until the first day went by without a hitch. I am happy to say today that everything worked out.

The program offers many benefits:

"You’ll have access to continuous training both on-the-job and via courses to build your technical and business skills. We’ll cover topics ranging from cybersecurity to presentation skills to further your career development. Our teams are dedicated to your support and advocacy throughout the two years of the program."

They're not kidding. I got a lot of support here. I had access to mentors, trainings, and built strong networks. My manager was cool and I learned more from them than I ever thought imaginable. I highly recommend joining a program for new grads if you want extra support during your transition from student to professional developer.

New grad programs set the expectation that you will need extra support to succeed. When you tell more senior colleagues (the good ones!) you are in the program, they will make themselves available to you for help. Many will offer to mentor you. Take them up on it.

Programs also tend to have policies to ensure that those in charge of your success are doing right by you. Managers often have to attend trainings to learn how to best support you. Seeing how seriously my manager at the time took the trainings increased my trust in them. They often shared what they learned at the trainings and kept me in the loop about what was coming up like the dreaded performance reviews (which they prepared me for!).

On the flip side, Jesse Jiryu Davis writes about getting his interns taken away at one point:

"My managers were watching me founder and they issued ultimata: get involved, make goals, get the project on track. I never did. With only weeks left in the summer, Intern Protective Services reassigned my apprentices to Mike O'Brien, who maintained the MongoDB Connector for Hadoop at the time."

Jesse did eventually triumph as a mentor (yay!) but what I'd like to point to here is the structure set in place to ensure folks are supported. You want to know that your employer will rectify the situation if you find yourself on a team that is not working for you and you want to be sure it is easy to say something to someone who can help you before the situation gets too dire. Communicating difficult issues will be much easier in a new grad program where regular check-ins with program coordinators are the norm.

I remember running into my program coordinator at the time and them saying, "Hi Flo, how are you?". They remember my name?! They must have interviewed hundreds of candidates! In a firm of over 400,000, do not underestimate how many people you will get to know. They will be rooting for you. That's pretty special and it will make a difference in your life. As you grow in your career, you will still get support but the expectation will be that you will seek out what you need. The support you get at this stage will model the support you will most likely seek in the future. Make sure you have high standards for yourself!

So, in conclusion, if you are uneasy about the transition to professional life, consider a new grad program. You will learn what you need to succeed, have a structure in place to make sure you are supported, and you will have dedicated folks helping you every step of the way. I am really grateful my career started out like this!

I am happy to talk to you if you are considering a new grad program :)

How I'm developing my learning plan this year

Flo Comuzzi — Sun, 03 Mar 2019 19:49:38 +0000

Motivation

My grandfather took my sister and I to the library every week as kids. I remember being in awe of the large books older folks would pick up. I remember telling myself that one day I too would be able to read such long books.

I have wanted to be part of a Recurse Center batch ever since I found out the center exists. The thought of spending extended time learning about what I want brings me joy. Getting myself to a place where I feel comfortable embarking in self-guided learning for hard things is also a huge motivation for me.

Working at something you want to get better at and building mastery is a Dialectical Behavioral Therapy skill used to increased self-confidence. Through years of DBT, I have learned that when you want to achieve something, you need clear, actionable steps to get there otherwise you're setting yourself up for failure. I know I want to be able to learn any difficult topic so I must practice learning a difficult topic, reflecting on what worked and didn't work, and continue.

Expectations

I looked to what the Recurse Center looks for in applicants for a good model of possible habits to strive towards. I created the Daily Affirmations graphic below and set it as my screen background.

To be clear, I don't think you need all of these to succeed. For example, I don't think you need to enjoy programming to get better at it however these aspirations align with my interests. I do enjoy programming! Doing activities that bring us happiness often increases well-being. What can I do to feed this interest? I find this reminder grounding whenever I am feeling frustrated by mundane work or feeling external pressure that doesn't align with my values.

Also, note that one of my values is being intellectually honest. I don't pretend to know something really well if I don't! To me, this isn't about moral superiority but rather the opportunities that open up when you are honest with yourself. When you fill in what you know about a topic you can see where the gaps are in your understanding and seek help. One of my fears when I started in this field was stagnation. I have learned over time that it is rare for things to take you by surprise when you are honest with yourself and practice self-awareness. Being honest with yourself also means being kind to yourself and that is so much easier to do when you know that you don't understand pointers because you are still fuzzy on references, for example, instead of rejecting C altogether because you've been struggling for a while.

Learning Goals

At first, I knew I wanted to learn something thoroughly but I wasn't sure what exactly so I wrote down a list of interests in a Google doc. This is that list:

What are my interests?

Implementation of different database types i.e NoSQL, SQL, graph

Seeing how changes to implementation impact performance

Database performance

Systems

Optimization of code at the lowest level i.e. assembly

Drivers

How networks work

Physics of wi-fi

How engines work e.g. storage engine, what does engine mean?

How does JVM work? What is Java bytecode? What does that mean?

Regex and state machines

Designing distributed systems

Assembler commands to machine commands, CPU understands binary

Designing Data Intensive Applications

Database algorithms

Database Reliability Engineering

SLAs

There is a lot going on in this list. To know something well, you must first know it not so well. I currently use Python at work so I decided to learn this language thoroughly. I also noticed that the JavaScript community is welcoming and there is lots of accessible learning material out there. Learning JavaScript alongside Python should give me a chance to touch on some of the topics I am interested in like performance, low level details of languages, and how engines work.

Desired Outcomes

I know I want to know Python and JavaScript thoroughly but because I haven't created a learning plan of this size and scope yet there are still many unknowns.

I know I need to reenforce my learning so I will be blogging about what I learn along the way. I am also gathering all my notes in the same place so I can clearly see where the gaps in my knowledge are. I decided to go with Scrivener, a word processor used for putting together literary works. I like it because it allows you to (re-)organize your thoughts into sections and subsections easily and integrates with BibTex for citation management.

This is what the project structure looks like right now:

I add subtopics as I go. I still looking for a good language implementation book. I am thinking about getting "the dragon book". If you have any recommendations, please let me know!

Progress so far

I am making good progress! Learning about JavaScript in conjunction with Python has made it easier to recognize language implementation patterns and what the lingo for those patterns are. For example, I came across this excellent JavaScript execution context post. I realized that though I knew of the concept of an execution context, I had not thought about it formally. Knowing what keywords to search for is so important. By looking up Python execution context info, I learned more about PYTHONPATH and why my code a while ago was acting the way that it was. Now I know what to search for when learning any new programming language.

Conclusion

Making a plan for myself and starting with the basics like creating motivational content for myself has been helpful. I found something to aspire to (joining a Recurse Center batch) that already had a basic guide on the habits I need to get to my goal. I chose topics to focus on and created a structure that lets me see what I'm missing in order to fully understand a concept.

I am actively writing down what I learn and reflecting on both the content and execution (no pun intended!). I have found that learning this way is super fun. I don't feel burdened with completing an entire textbook before proceeding to the next topic. I can switch from JavaScript to Python and vice versa when I get bored or when a concept is difficult to understand in one language. I constantly find new things to try out, like profiling Python code or deploying my own vanilla JS site to my new domain (!), that give me a quick feeling of satisfaction in between the difficult concepts like EBNF grammar files and lexical environments.

Importantly, I notice that I am making connections between the material I learn for fun and the material I learn for work without the imposter syndrome anxiety. I see that I am growing as a person and developing interests that are completely my own and not fueled by a paycheck which has increased my feelings of self-efficacy.

I'd love to hear about your learning plans and reflections! I have seen how some of you on this platform use blogging to keep yourselves accountable in your learning and it's super motivating! Keep up the good work, folks :)

I get nervous as a mentor, too!

Flo Comuzzi — Fri, 01 Mar 2019 02:26:35 +0000

Dear Reader,
I want to be candid with you about my experiences as a mentor. I have mentored steadily in some capacity since I started working in industry. Mentoring caused complex emotions for me in the beginning. I love helping people and I took on the responsibility knowing that giving anything less than 100% would be difficult for me. I put so much pressure on myself. I was nervous.

If you have a mentor, you may also be nervous about your interactions with them. I remember fearing that I was wasting my mentors' time. I feared that their offers to answer any questions or meet up in the future were only courtesies. Now, as a mentor, I know that when I offer my time to someone, I mean it. I love when someone reaches out after I offer help.

If I enthusiastically tell you to email me anytime and give you my card, please know that I have already decided that you are an intelligent person with important things to say and I am incredibly interested in hearing more from you.

I see incredible potential in you. I think you are special. I want to see you succeed. You have made a great impression on me. Please do not forget this!

If you are a new mentor, you should remind yourself that your mentee is likely incredibly nervous because they think you're a cool person. Recently, I met a wonderful and smart young person who as soon as they met me expressed how awesome they think I am because I work at [cool place]. I was so struck by the idea that someone looks up to me for being at a job that they would like to be at that I froze for a few seconds. I instantly saw the worry in their face from my lack of response. I remembered how inspiring/intimidating it was for me to meet engineers working in industry. As a mentor, cherish the moments your mentee expresses genuine emotion whether happiness, sadness, fear, or anger. It is difficult to be honest with people we respect for fear that we may come off as inappropriate or unprofessional. Remember to practice empathy! If your mentee expresses something that throws you off and you don't know what to say, think back to when you were in a similar position as your mentee and share some tidbit about what that was like for you. That's usually enough to comfort your mentee and let them know they haven't said something egregious.

Though I still get anxious from time to time, these feelings have mostly decreased in intensity as I have come to understand the dynamics of mentoring relationships more.

You'll get there too! Stay honest with yourself throughout the process, validate your feelings, and know that you'll get better at this thing. The lessons you'll learn and the relationships you'll build by sticking with it are truly worth it.

Signed,
A Very Eager Mentor

DEV Community: Flo Comuzzi

PART 3 A Helm Chart for Ephemeral Environments

Part 3 — Ephemeral Environments with Helm and Argo CD (Starry IDP)

Who this is for

Prerequisites and assumptions

Why previews (recap)

Architecture (at a glance)

From PR to preview (sequence)

File hierarchy (Helm chart)

Minimal configuration snippets

values.yaml (minimal)

values.schema.json (guardrails excerpt)

Argo CD Application (preview)

ExternalSecret (GSM)

Security hardening

Cost and quotas

Observability and troubleshooting

Conclusion

PART 2 Starry: An Internal Developer Platform (IDP) for Ephemeral Environments

From Push To Ephemeral Environment

Ephemeral Environment Management Workflow

starry-helm

Argo CD and the starry-helm Chart

argocd-apps

How the repo is structured

What gets deployed first (and why)

Workload and automation

Handling environment drift and platform quirks

Starry-Managed Environments via Argo CD

Conclusion

Starry: An Internal Developer Platform (IDP) for Ephemeral Environments Part 1

Tooling and Artifacts

Kubernetes

Helm

ArgoCD

Terraform

Artifacts

Recent Platform Engineer Interview Questions

Intro to Data Ingestion and Data Lakes

What is a data ingestion pipeline?

What is a data lake?

Ok, so what's ingestion refer to?

Ok, now show me how to do the thing...

Live notetaking as I learn about distributed computing

RPC

Futures, promises

Message passing

Distributed Programming Languages

Consistency, CRDTs

Languages & Consistency

Languages Extended for Distribution

Large-scale parallel processing (batch)

Large-scale parallel processing (streaming)

Reflect as You Work: My Python Project Workflow

Prerequisites

New Python Project Checklist

PREREQUISITE: Install pyenv.

Step 1: Install a specific Python version.

Step 2: Create a project directory. Go to the directory.

Step 3: Set the Python version for the project.

Step 4: Create a virtual environment.

Step 5: Activate the virtual environment.

Step 6: Install dependencies.

Step 7: Save packages.

Step 8: Run the code.

Live notetaking as I learn Spark

What is this?

Notes

Installation

Small snafu

Spark Architecture

Cluster manager

Spark Applications

Driver

Executors

Spark APIs

Distributed collections of data

DataFrames

Transformations

Actions

`starry-helm`

Argo CD and the `starry-helm` Chart

`argocd-apps`

PREREQUISITE: Install `pyenv`.