DEV Community

Cover image for Rye Tables vs Python/Pandas: A Different Way to Wrangle Data
Refaktor
Refaktor

Posted on • Originally published at ryelang.org

Rye Tables vs Python/Pandas: A Different Way to Wrangle Data

If you’ve ever filtered a CSV in Pandas or written multiple nested for loops to group data in Python, you’ll probably find Rye’s take on the problem interesting.

Rye is a small, homoiconic language inspired by Rebol and Factor. One of its most interesting value types is the Table. A first-class, immutable data structure for tabular data.

Tables are mostly manipulated with pure, composable (endomorphic1) functions. By using Rye's pipe and op-words, common wrangling tasks become small, chainable expressions.

This post compares how three different approaches handle the same data-wrangling problems:

  • Rye tables
  • Python with Pandas
  • Plain Python (no libraries)

Along the way we’ll look at what makes Rye’s model unique, where Pandas shines, and why pure Python becomes verbose as complexity grows.


Rye, in this case, is a programming language - not the package manager with the same name.

I worked on Rye for years before I saw a post about Rye - the package manager on HN. That tool seems to have been superseded by uv… so time fixes some confusions on its own. :P

Rye aims to be a higher higher-level language with value types that reflect more how people work with information vs. how computers work. One such value type is a Table.

To learn more about tables, visit the Cookbook page: Working with Tables.


#1: Filtering and Transforming Sci-fi books

Scenario: Given a dataset of sci-fi books, find all books published after 1980, add a column showing how old each book is, sort by age, and limit to top 2 results.

This first example represents a common data processing pattern: filtering, transforming, and sorting a dataset.

Rye

books: table { 'title 'author 'year }
{ "Neuromancer" "William Gibson" 1984
  "Snow Crash" "Neal Stephenson" 1992
  "Foundation" "Isaac Asimov" 1951
  "Dune" "Frank Herbert" 1965 }

books .where-greater 'year 1980
|add-column 'age { year } { 2025 - year }
|order-by 'age 'asc
|columns? { 'title 'author 'age }
|display

; Prints:
; | title       | author          | age |
; +-------------+-----------------+-----+
; | Snow Crash  | Neal Stephenson | 33  |
; | Neuromancer | William Gibson  | 41  |
Enter fullscreen mode Exit fullscreen mode

Rye's example is written using pipe-words ( |word ), very common to Rye, and a fact that base Table functions are endomorphic. They accept a table and return a new table. This means you can compose and combine them in any order you need them to.

Find out more about Rye's op and pipe-words

Python Pandas

Pandas supports very similar concepts, so the code is very similar in shape and size.

import pandas as pd

books = pd.DataFrame([
    {"title": "Neuromancer", "author": "William Gibson", "year": 1984},
    {"title": "Snow Crash", "author": "Neal Stephenson", "year": 1992},
    {"title": "Foundation", "author": "Isaac Asimov", "year": 1951},
    {"title": "Dune", "author": "Frank Herbert", "year": 1965}
])

result = (books[books['year'] > 1980]
         .assign(age=lambda x: 2025 - x['year'])
         .sort_values('age')
         [['title', 'author', 'age']])
print(result)
Enter fullscreen mode Exit fullscreen mode

Pandas does introduce some unusual syntax for Python that is not that self-explanatory. It seems to
heavily utilize the accessor syntax.

  • books[books['year'] > 1980]
  • [['title', 'author', 'age']]

Python

Contrary to the above two, pure Python is a staple of imperative code, for its good and its bad. It's a little
more verbose, but it's very basic and not hard to see what exactly is going on.

books = [
    {"title": "Neuromancer", "author": "William Gibson", "year": 1984},
    {"title": "Snow Crash", "author": "Neal Stephenson", "year": 1992},
    {"title": "Foundation", "author": "Isaac Asimov", "year": 1951},
    {"title": "Dune", "author": "Frank Herbert", "year": 1965}
]

# Filter books after 1980
filtered_books = []
for book in books:
    if book["year"] > 1980:
        filtered_books.append(book)

# Add age column
for book in filtered_books:
    book["age"] = 2025 - book["year"]

# Sort by age
filtered_books.sort(key=lambda x: x["age"])

# Display a specific table
print("| title | author | age |")
print("+-------+--------+-----+")
for book in filtered_books:
    print(f"| {book['title']:<12} | {book['author']:<15} | {book['age']:2} |")
Enter fullscreen mode Exit fullscreen mode

Despite its simplicity, contrary to the above examples this code doesn't display intent. We see 3 for loops but each is
used for an absolutely different purpose. So comments that explain intent make sense here.

#2: Joining and Aggregating Space Colonies

Scenario: Join colony data with planet information, then group by planet type to calculate total population and average colony size, and save the result as an Excel (.xlsx) file.

Moving beyond simple transformations, this example tackles relational operations—joining datasets and performing aggregations.

Rye

colonies: table { 'id 'name 'planet_id 'population }
{ 1 "New Terra" 1 50000
  2 "Olympus" 1 75000
  3 "Titan Base" 2 12000 }

planets: table { 'id 'name 'type }
{ 1 "Mars" "Terrestrial"
  2 "Titan" "Moon" }

colonies .inner-join planets 'planet_id 'id
|group-by 'type { 'type count 'population sum 'population avg }
|save\xlsx %colonies.xlsx
|display

; Prints:
; | type        | type_count | population_sum | population_avg |
; +------------------------------------------------------------+
; | Terrestrial | 2          | 125000.000000  | 62500.000000   |
; | Moon        | 1          | 12000.000000   | 12000.000000   |
Enter fullscreen mode Exit fullscreen mode

In Rye even what are usual special forms are just functions, and this example just uses a few more functions.
Like add-column in the first example, that accepted a table and two blocks of code/data. Here we have a group-by
function with its third argument as its own little microdialect for specifying column to aggregate on and aggregate functions to use.

Rye dialects often rely on Rye's many word types. Here a lit word ( 'word ) or string would denote a column and a word ( word ) denotes
an aggregate function to apply.

Python Pandas

import pandas as pd

colonies = pd.DataFrame([
    {"id": 1, "name": "New Terra", "planet_id": 1, "population": 50000},
    {"id": 2, "name": "Olympus", "planet_id": 1, "population": 75000},
    {"id": 3, "name": "Titan Base", "planet_id": 2, "population": 12000}
])

planets = pd.DataFrame([
    {"id": 1, "name": "Mars", "type": "Terrestrial"},
    {"id": 2, "name": "Titan", "type": "Moon"}
])

result = (colonies.merge(planets, left_on='planet_id', right_on='id')
         .groupby('type')['population']
         .agg(['count', 'sum', 'mean'])
         .reset_index())

# Save to Excel file (requires: pip install openpyxl)
result.to_excel('colonies.xlsx', index=False)
print(result)
Enter fullscreen mode Exit fullscreen mode

Again, Pandas example is of similar shape as Rye one. And again it seems we use some magical accessor syntax:

.groupby('type')['population']

Python

colonies = [
    {"id": 1, "name": "New Terra", "planet_id": 1, "population": 50000},
    {"id": 2, "name": "Olympus", "planet_id": 1, "population": 75000},
    {"id": 3, "name": "Titan Base", "planet_id": 2, "population": 12000}
]

planets = [
    {"id": 1, "name": "Mars", "type": "Terrestrial"},
    {"id": 2, "name": "Titan", "type": "Moon"}
]

# Manual join operation
joined_data = []
for colony in colonies:
    for planet in planets:
        if colony["planet_id"] == planet["id"]:
            joined_row = {**colony, **planet}
            joined_row["planet_name"] = planet["name"]
            joined_data.append(joined_row)

# Manual grouping and aggregation
from collections import defaultdict
groups = defaultdict(list)
for row in joined_data:
    groups[row["type"]].append(row)

results = []
for planet_type, group_data in groups.items():
    total_pop = sum(row["population"] for row in group_data)
    avg_pop = total_pop / len(group_data)
    results.append({
        "type": planet_type,
        "count": len(group_data),
        "total_population": total_pop,
        "avg_population": avg_pop
    })

# Display results
print("| type | count | total_pop | avg_pop |")
print("+------+-------+-----------+---------+")
for result in results:
    print(f"| {result['type']:<11} | {result['count']:3} | {result['total_population']:8} | {result['avg_population']:7.0f} |")

# Note: Excel export would require additional libraries like openpyxl or xlsxwriter.
# We don't save to Excel here.
Enter fullscreen mode Exit fullscreen mode

This example highlights how complexity scales differently across approaches. The manual Python implementation becomes significantly more verbose when dealing with joins and aggregations, while both Pandas and Rye maintain relatively concise syntax.

Interactively solving Examples #1 and #2

You can visit asciinema to see the first two examples run interactively in a Rye console (REPL) while displaying the intermediate results.

Rye focuses on being especially useful in an interactive setting.

asciicast

#3: Nesting multi-table Exoplanet data

Scenario: Read 3 CSV files containing exoplanet data, observations, and station information. For each exoplanet, collect the top 3 observations (by confidence) and sum the population of all stations stationed near that exoplanet. Output as multi-layered JSON.

This example demonstrates more complex relational operations across multiple tables, joining data, nested aggregations, and structured output generation.

For better information organisation we load all data into a data context. All functions here again are endomorphic.

Rye

data: context {
    load\csv %exoplanets.csv |autotype 0.95 :exoplanets
    load\csv %observations.csv |autotype 0.95 :observations
    load\csv %stations.csv |autotype 0.95 :stations
} 

data/exoplanets
|add-column 'observations { id } {
    data/observations
    |where-equal 'exoplanet_id id
    |order-by 'confidence 'desc
    |columns? { 'date 'notes }
    |head 3
}
|add-column 'stationed_population { id } {
    data/stations
    |where-equal 'stationed_near id
    |column? 'population
    |sum
}
|drop-column 'id
|to-json
|print
Enter fullscreen mode Exit fullscreen mode

This example uses the fact that Table is a basic Rye value type, so it only makes sense that table cells can also hold tables. We add a column to the base table
that holds tables to the top 3 observations, and we add another where we manually aggregate a value to its sum.

Python Pandas

import pandas as pd
import json

# Load CSV files with automatic type inference
exoplanets = pd.read_csv('exoplanets.csv')
observations = pd.read_csv('observations.csv')
stations = pd.read_csv('stations.csv')

# Functional approach with `apply()` requires helper functions for nested operations
def get_observations_for_exoplanet(exoplanet_id):
    """Get top 3 observations for an exoplanet, ordered by confidence desc"""
    filtered_obs = observations[observations['exoplanet_id'] == exoplanet_id]
    sorted_obs = filtered_obs.sort_values('confidence', ascending=False)
    top_obs = sorted_obs[['date', 'notes']].head(3)
    return top_obs.to_dict('records')

def get_stationed_population(exoplanet_id):
    """Get sum of population for stations near an exoplanet"""
    filtered_stations = stations[stations['stationed_near'] == exoplanet_id]
    return int(filtered_stations['population'].sum())

# Add nested observations and stationed population data
exoplanets['observations'] = exoplanets['id'].apply(get_observations_for_exoplanet)
exoplanets['stationed_population'] = exoplanets['id'].apply(get_stationed_population)

# Remove the id column and convert to JSON
result_json = exoplanets.drop('id', axis=1).to_json(orient='records', indent=2)
print(result_json)
Enter fullscreen mode Exit fullscreen mode

In this case the code between Pandas and Rye is not so similar. It seems Pandas requires the use of
helper functions, that are then applied to get the aggregated / nested values.

Python

import csv
import json
from collections import defaultdict

# Load exoplanets
exoplanets = []
with open('exoplanets.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        row['id'] = int(row['id'])
        row['mass'] = float(row['mass'])
        row['radius'] = float(row['radius'])
        row['temp'] = float(row['temp'])
        row['discovery_year'] = int(row['discovery_year'])
        exoplanets.append(row)

# Load observations
observations = []
with open('observations.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        row['exoplanet_id'] = int(row['exoplanet_id'])
        row['confidence'] = float(row['confidence'])
        observations.append(row)

# Load stations
stations = []
with open('stations.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        row['population'] = int(row['population'])
        row['stationed_near'] = int(row['stationed_near'])
        stations.append(row)

# Group observations by exoplanet_id for efficient lookup
obs_by_planet = defaultdict(list)
for obs in observations:
    obs_by_planet[obs['exoplanet_id']].append(obs)

# Group stations by stationed_near for efficient lookup
stations_by_planet = defaultdict(list)
for station in stations:
    stations_by_planet[station['stationed_near']].append(station)

# Process each exoplanet
for exoplanet in exoplanets:
    planet_id = exoplanet['id']

    # Get top 3 observations by confidence
    planet_observations = obs_by_planet[planet_id]
    planet_observations.sort(key=lambda x: x['confidence'], reverse=True)
    top_observations = []
    for obs in planet_observations[:3]:
        top_observations.append({
            'date': obs['date'],
            'notes': obs['notes']
        })
    exoplanet['observations'] = top_observations

    # Sum stationed population
    total_population = sum(station['population'] 
                          for station in stations_by_planet[planet_id])
    exoplanet['stationed_population'] = total_population

    # Remove the id column
    del exoplanet['id']

# Convert to JSON and print
result_json = json.dumps(exoplanets, indent=2)
print(result_json)
Enter fullscreen mode Exit fullscreen mode

The pure-Python version quickly becomes verbose, so we would have to heavily refactor it or of course use
a library like Pandas.

Example inputs

The three csv files we use in the example above:

exoplanets.csv

id,name,mass,radius,temp,discovery_year
1,Kepler-296e,1.79,1.5,251,2014
2,TOI-715b,3.02,1.55,280,2023
3,Gliese-667Cc,3.8,1.5,277,2011
Enter fullscreen mode Exit fullscreen mode

observations.csv

exoplanet_id,date,confidence,notes
1,2015-01-20,0.95,Primary transit detected
1,2015-02-15,0.92,Secondary eclipse confirmed
1,2020-03-10,0.98,Follow-up observation
2,2015-01-18,0.89,Initial detection
3,2014-05-12,0.94,Multiple transits observed
2,2014-08-30,0.96,Orbital period confirmed
3,2014-05-12,0.88,Shallow transit
3,2014-11-15,0.90,Period validation
3,2023-02-14,0.97,Recent discovery
Enter fullscreen mode Exit fullscreen mode

stations.csv

name,population,stationed_near
Alpha Station,15000,1
Beta Outpost,8500,2
Gamma Research Base,2300,1
Delta Mining Station,12000,1
Epsilon Colony,45000,3
Zeta Trading Hub,22000,3
Enter fullscreen mode Exit fullscreen mode

Example output

This is the JSON output that running a Rye script for example 3 produces.

[
  {
    "name": "Kepler-296e",
    "mass": 1.79,
    "radius": 1.5,
    "temp": 251,
    "discovery_year": 2014,
    "observations": [
      {
        "date": "2020-03-10",
        "notes": "Follow-up observation"
      },
      {
        "date": "2015-01-20",
        "notes": "Primary transit detected"
      },
      {
        "date": "2015-02-15",
        "notes": "Secondary eclipse confirmed"
      }
    ],
    "stationed_population": 29300
  },
  {
    "name": "TOI-715b",
    "mass": 3.02,
    "radius": 1.55,
    "temp": 280,
    "discovery_year": 2023,
    "observations": [
      {
        "date": "2014-08-30",
        "notes": "Orbital period confirmed"
      },
      {
        "date": "2015-01-18",
        "notes": "Initial detection"
      }
    ],
    "stationed_population": 8500
  },
  {
    "name": "Gliese-667Cc",
    "mass": 3.8,
    "radius": 1.5,
    "temp": 277,
    "discovery_year": 2011,
    "observations": [
      {
        "date": "2023-02-14",
        "notes": "Recent discovery"
      },
      {
        "date": "2014-05-12",
        "notes": "Multiple transits observed"
      },
      {
        "date": "2014-11-15",
        "notes": "Period validation"
      }
    ],
    "stationed_population": 67000
  }
]
Enter fullscreen mode Exit fullscreen mode

More information

If you find Rye tables interesting, you can read more about them in the Cookbook page:

Cookbook / Working with Tables

To read more about the concepts and specifics of Rye visit:

For code, visit: Rye's Github

Conclusion

Should you ditch Python and Pandas and use Rye? Of course not. Rye is still a work in progress programming language and it's nowhere near
stability and functionality compared to Python or Pandas
.

I would love feedback on the concepts and whether Rye's table abstractions could fit into your workflow. I’d also love suggestions on how to make it even more useful and internally consistent.

Here, it means table functions accept a table and return a new table, so you can stack them together without losing type consistency.


Updates after publishing

15.Sep.2025 - A lot of comments we got mentioned Python Polars library, that also made some similar design decisions (immutability, endomorphism). I will look into it.

15.Sep.2025 - A user n1000 from Lobste.rs wrote the same examples in R language. R is a popular lanuauge for statistics, data science, visualisation, and has a DataFrame value type, which was also one of inspirations for Tables. Examples in R

16.Sep.2025 - Comments on lobste.rs are interesting and provide more ideas and information lobste.rs thread


  1. Endomorphic means “mapping a set onto itself.”  

Top comments (0)