Apache Spark Basics

#beginners #productivity #bitesize

This is a basic cheat sheet, glossary and the very beginning of getting started with Apache Spark, every time we will share a new post with terms or code snippets, they will appear here as well at a generic form.

If you work with Apache Spark and look for a cheat sheet, this is for you as well!

First thing first:

-1- the workspace:

First, we need to create the workspace, we are using Databricks workspace and here is a tutorial for creating it.

-2- Basic Apache Spark Vocabulary :

Dataframe

This is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates. Dataframe data is often distributed across multiple machines. It can be in-memory data or on disk.

Dataset

Strongly typed collection of objects that can be transformed in parallel using functional or relational operations. Each Dataset is a typed view of Dataframe.
Dataset is defined as "lazy", meaning the computations are only triggered when an action is invoked.

RelationalGroupedDataset

A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup.

This is an evolving page and more terms, code snippets and architecture design will be added.

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

DEV Community

Apache Spark Basics

-1- the workspace:

-2- Basic Apache Spark Vocabulary :

Dataframe

Dataset

RelationalGroupedDataset

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

Top comments (0)

Tune in for AWS Security LIVE!

Read next

Managing Large Debian Repositories with Pulp

Are We In An AI Hype Cycle?

Excel-lent News: Copilot Takes Your Data to New Heights

Interview Questions on AWS Networking: VPC, Subnets, and Security Groups

Okay