Sophia Parafina

Posted on Mar 8

Relicensing Open Source with an AI

Simon Willison posed this question in his blog:

"Can coding agents relicense open source through a “clean room” implementation of code?"

Good question! I decided to talk to claude about it. Here is the transcript from that session.

TL;DR: Maybe, sorta.

Clean Room Implementation Discussion

Exchange 1: Initial Question

User:
If given an open source GitHub repository, can you write as spec from FORTRAN in the repository? Using that spec, can you write a clean room implementation in Python?

Assistant:
Yes, I can do both, but with important caveats and legal considerations:

Part 1: Writing a Specification from FORTRAN Code

What I CAN do:

✅ Read FORTRAN source code from a GitHub repository
✅ Analyze the algorithm's logic, inputs, outputs, and data flow
✅ Document the mathematical operations and computational methods
✅ Describe what the code does at a functional level
✅ Create a language-agnostic specification of the algorithm
✅ Identify key functions, subroutines, and their purposes
✅ Document expected behavior and edge cases

Example Process:

FORTRAN Code → Analyze → Specification Document
                ↓
        - Algorithm description
        - Input/output formats
        - Mathematical operations
        - Data structures
        - Expected behavior

Part 2: Clean Room Implementation in Python

What "Clean Room" Means:
A clean room implementation is created by someone who has never seen the original source code, working only from the specification. This is a legal strategy to avoid copyright infringement.

The Problem with AI doing Clean Room:
❌ I cannot truly do a "clean room" implementation because:

I've already seen the FORTRAN code when writing the spec
True clean room requires a different person who hasn't seen the original
This is the legal requirement to avoid derivative work claims

What I CAN Actually Do

Approach 1: Specification + Implementation (Not Legally Clean Room)

I can create both, but this is NOT a clean room implementation in the legal sense:

# Step 1: Analyze FORTRAN
"""
Example FORTRAN subroutine analysis:
- Computes seismic travel time using ray tracing
- Inputs: source depth, receiver distance, velocity model
- Output: P-wave travel time
- Algorithm: Snell's law applied iteratively
"""

# Step 2: Python implementation based on spec
def compute_travel_time(depth, distance, velocity_model):
    """
    Compute seismic travel time using ray tracing.
    Based on specification derived from original algorithm.
    """
    # Implementation here
    pass

Legal Status: This is a derivative work because the same person (me) saw both the original code and wrote the new version.

Approach 2: Algorithm Description Only (Safe)

What I should do instead:

## Specification Document

### Function: compute_travel_time

**Purpose:** Calculate seismic wave travel time through layered Earth model

**Algorithm:**
1. Initialize ray parameter
2. For each layer in velocity model:
   - Apply Snell's law: p = sin(i)/v
   - Calculate ray path segment
   - Accumulate travel time
3. Return total time

**Inputs:**
- source_depth: float (km)
- receiver_distance: float (km)  
- velocity_model: array of (depth, velocity) pairs

**Outputs:**
- travel_time: float (seconds)

**Mathematical Basis:**
- Snell's Law: n₁sin(θ₁) = n₂sin(θ₂)
- Travel time: t = ∫ ds/v(s)

Then a different developer (not me) would implement this in Python without seeing the FORTRAN.

Practical Approaches That ARE Legal

Option A: Use Existing Python Ports

Many FORTRAN seismology codes already have Python implementations:

TauP (seismic travel times) → ObsPy has obspy.taup
SAC (Seismic Analysis Code) → ObsPy can read/write SAC
SPECFEM (spectral elements) → Has Python wrappers

Option B: Call FORTRAN from Python

import subprocess
import numpy as np

def call_fortran_program(input_file, output_file):
    """Wrapper around existing FORTRAN executable"""
    subprocess.run(['./fortran_program', input_file, output_file])
    return np.loadtxt(output_file)

Option C: Use f2py (FORTRAN to Python)

# Automatically wrap FORTRAN code
# f2py -c fortran_module.f90 -m fortran_module

import fortran_module

result = fortran_module.compute_something(input_data)

Legal Considerations

When You CAN Reimplement:

✅ If the repository has a permissive license:

MIT, BSD, Apache 2.0 allow derivative works
You can study the code and rewrite in Python
Must maintain original license and attribution

✅ If implementing a published algorithm:

Scientific papers describe algorithms
Mathematical formulas are not copyrightable
You can implement the published method

✅ If using clean room with TWO people:

Person A reads FORTRAN, writes spec (no code)
Person B reads spec only, writes Python
Person B never sees FORTRAN code

When You CANNOT:

❌ Copyright infringement:

Copying proprietary code without permission
Violating restrictive licenses (GPL requires keeping GPL)
Translating code line-by-line without authorization

❌ Patent issues:

Some algorithms may be patented
Implementation could violate patents regardless of clean room

What I Can Help You With

1. Algorithm Understanding

I can analyze FORTRAN code and explain:
- What algorithm it implements
- Mathematical basis
- Key computational steps
- Expected inputs/outputs
- Performance characteristics

2. Specification Writing

I can create detailed specifications:
- Function signatures
- Data structures
- Algorithm pseudocode
- Test cases
- Validation criteria

3. Python Implementation Guidance

# I can suggest Python approaches:
def suggested_implementation():
    """
    Guidance on:
    - Appropriate Python libraries (NumPy, SciPy)
    - Data structures to use
    - Optimization strategies
    - Testing approaches
    """
    pass

4. What I Won't Do

❌ Help circumvent copyrights or licenses
❌ Claim a line-by-line translation is "clean room"
❌ Implement proprietary algorithms without permission

Practical Example

If you give me a FORTRAN seismology code, I can:

## Step 1: Analysis
- Identify it computes focal mechanisms from first motions
- Uses grid search over strike/dip/rake
- Outputs beach ball plots

## Step 2: Specification
- Algorithm: Grid search optimization
- Inputs: Station azimuths, polarities, uncertainties
- Outputs: Best-fit focal mechanism parameters
- Constraints: Physical bounds on angles

## Step 3: Recommendation
"This algorithm is already implemented in ObsPy's 
obspy.imaging.beachball module and PyRocko's 
moment tensor tools. Suggest using those instead 
of reimplementing."

## Step 4: If You Must Reimplement
Provide architectural guidance:
- Use NumPy for grid computation
- Use matplotlib for visualization
- Structure as class with methods
- Include unit tests

Bottom Line

Can I do what you're asking?

Technically: Yes, I can read FORTRAN and write Python
Legally as "clean room": No, because I'd see both versions
Practically useful: Yes, I can create specs and guide implementation

Best approach:

Check if the code is already ported to Python
Check the license - if permissive, just port it directly
If you need clean room, have me write the spec, then you implement without looking at FORTRAN
Consider using f2py or subprocess wrappers instead

Would you like me to analyze a specific FORTRAN repository? I can:

Explain what it does
Document the algorithms
Suggest Python equivalents
Write implementation guidance

Just share the repository URL and I'll help within legal and ethical bounds!

Exchange 2: The Two-Session Workaround

User:
What if I just presented the spec to claude without the memory of generating the spec?

Assistant:
Excellent question! This is actually a clever workaround that addresses the "clean room" problem in an interesting way.

Does This Work?

Technically: Yes, with caveats
Legally: Probably, but not perfectly

Let me break this down:

The Two-Claude Approach

Scenario:

Session 1 (Claude A): 
  FORTRAN code → Reads and analyzes → Writes specification

Session 2 (Claude B - fresh chat):
  Specification only → Implements Python → No memory of FORTRAN

Does Claude B "remember" the FORTRAN?

No, in the technical sense:

✅ Each new chat session has no memory of previous sessions
✅ Claude B genuinely hasn't "seen" the FORTRAN code
✅ Claude B only has the specification to work from
✅ This mimics the two-person clean room approach

But there are complications:

Legal Gray Areas

Problem 1: Same AI, Same Training

Both "Claudes" are the same model with the same training:

If the FORTRAN code was in my training data, both sessions "know" it
I might unconsciously reproduce patterns from training
Not truly independent like two different humans

Problem 2: Specification Leakage

If the spec is too detailed, it might contain implementation details:

❌ BAD SPEC (Too detailed):
"Line 45: Use DO loop from I=1 to N incrementing by 1
 Store result in ARRAY(I) = SQRT(X(I)**2 + Y(I)**2)
 Use DSQRT for double precision"

✅ GOOD SPEC (Implementation-agnostic):
"Compute Euclidean distance from origin for each point
 Input: x and y coordinate arrays
 Output: distance array
 Precision: Double precision floating point"

The bad spec is basically transliteration; the good spec describes the what, not the how.

Problem 3: Legal Precedent Unclear

Clean room implementation has legal precedent with humans:

Courts recognize two different people = independent creation
No case law yet on AI "sessions" as "different entities"
Unclear if this would hold up legally

How to Make This More Defensible

Approach 1: Extremely Abstract Specification

Session 1 - Analysis:

## High-Level Specification

**Purpose:** Compute seismic ray paths through layered velocity model

**Inputs:**
- Source location (lat, lon, depth)
- Receiver locations (array of lat, lon, elevation)
- Velocity model (1D layered structure)

**Outputs:**
- Travel times for each source-receiver pair
- Ray parameter
- Takeoff angles

**Algorithm Type:** Numerical ray tracing using Snell's law

**Mathematical Basis:**
- Snell's law: p = r·sin(i)/v
- Ray parameter conservation
- Integration along ray path

**Requirements:**
- Handle velocity discontinuities
- Support multiple seismic phases (P, S)
- Accuracy: < 0.1 second error for teleseismic distances

Session 2 - Implementation:

# Claude B sees only the above spec
# Implements using standard numerical methods
# Chooses own data structures and algorithms

This is defensible because:

✅ Describes functionality, not implementation
✅ Based on published physics (Snell's law)
✅ No FORTRAN-specific details
✅ Multiple valid implementations possible

Approach 2: Reference Published Algorithms

Session 1:

## Specification

The code implements the algorithm described in:
- Buland, R., & Chapman, C. H. (1983). "The computation 
  of seismic travel times." Bulletin of the Seismological 
  Society of America, 73(5), 1271-1302.

Specifically implements:
- Section 3.2: Flat-earth transformation
- Section 4.1: Ray parameter calculation
- Section 4.3: Travel time integration

**Input format:** As described in Table 1 of paper
**Output format:** As described in Table 3 of paper

Session 2:

# Implement based on published paper + spec
# This is clearly legal - implementing published science

Why this works:

✅ Published algorithms are not copyrightable
✅ Mathematical formulas are facts
✅ Clear independent source (the paper)

Approach 3: Add Human Review

Session 1 (Claude A): FORTRAN → Specification
         ↓
Human reviewer: Check spec is abstract enough
         ↓
Session 2 (Claude B): Specification → Python
         ↓
Human reviewer: Verify no FORTRAN patterns copied

The human review adds defensibility.

Practical Workflow

Step-by-Step Process:

Session 1 - Specification Generation:

You: "Here's FORTRAN code from [repo]. Write an abstract 
     specification that describes WHAT it does, not HOW."

Claude A: [Generates abstract spec focused on:
          - Mathematical basis
          - Input/output contracts
          - Algorithm category (not specific implementation)
          - Test cases
          - Performance requirements]

Between Sessions:

You review the spec and ask:
- Does it describe functionality or implementation?
- Could someone implement this differently?
- Are there FORTRAN-specific details? (Remove them)
- Is it based on published methods? (Cite them)

Session 2 - Fresh Chat:

You: "Implement this specification in Python. Choose your own
     data structures and algorithms."

Claude B: [Implements using:
          - Modern Python idioms
          - NumPy/SciPy where appropriate
          - Different algorithm choices where equivalent
          - Pythonic structure (not FORTRAN translated)]

What Makes This Legitimate

The Spec Must Be:

✅ Functional, not procedural

Good: "Calculate eigenvalues of matrix"
Bad:  "Use QR decomposition with Givens rotations"

✅ Domain knowledge, not code translation

Good: "Apply Snell's law at velocity boundaries"
Bad:  "In loop at line 234, use formula A = B*SIN(C)/D"

✅ Multiple implementations possible

Good: "Find minimum of function using optimization"
Bad:  "Use steepest descent with step size 0.01"

✅ Based on public algorithms

Good: "Implements Aki-Richards approximation (Aki & Richards, 1980)"
Bad:  "Uses custom unpublished method from FORTRAN code"

Comparison to Human Clean Room

Aspect	Two Humans	Two Claude Sessions
Different entities	Yes (legally distinct)	Unclear (same AI)
Independent knowledge	Yes	Same training data
No memory transfer	Yes	Yes (sessions isolated)
Different implementations	Likely	Possibly similar style
Legal precedent	Established	None yet
Defensibility	Strong	Moderate

My Recommendation

This approach is BETTER than having me do both in one session, but:

To maximize defensibility:

Keep Session 1 spec very abstract
- Cite published papers
- Describe math, not code structure
- Focus on "what" not "how"
Add human steps:
- Human reviews spec for abstraction
- Human verifies Session 2 output is novel
- Human can testify to process if needed
Document the process:
- Save both sessions
- Show spec is abstract
- Demonstrate independent implementation
Check the license first:
- If MIT/BSD/Apache → just port it directly
- If GPL → you must keep GPL
- If proprietary → need permission anyway
Consider if it's worth it:
- Is there already a Python version?
- Can you use f2py wrapper instead?
- Is the algorithm in published papers?

Bottom Line

Your idea is clever and better than one session, but:

Best ✅: Human A writes spec → Human B implements
Good ✅: Claude writes abstract spec → Human implements

Okay ⚠️: Claude Session 1 → spec → Claude Session 2 (with very abstract spec)
Risky ❌: Claude does both in one session
Bad ❌: Line-by-line translation regardless of sessions

Want to try it? Give me a FORTRAN repository and I'll:

Write a maximally abstract specification
You review it for implementation details
Start a new chat and have that Claude implement it
Compare to see if the implementations differ significantly

This would be an interesting experiment in AI clean room development!

Summary

This transcript contains the complete discussion about clean room implementation using AI, specifically:

Initial Question: Can Claude write a spec from FORTRAN and implement it in Python as a clean room implementation?
Initial Answer: Explained that while technically possible, it's not legally a true clean room because the same entity (Claude) would see both the original code and create the new implementation.
Two-Session Workaround: User proposed using two separate Claude sessions - one to create the spec, another (with no memory) to implement it.
Analysis of Workaround: Detailed explanation of how this approach is better than single-session but still has legal gray areas due to:
- Same AI model with same training
- Potential specification leakage
- Unclear legal precedent for AI "sessions" as separate entities
Best Practices: Recommendations for making the two-session approach more defensible:
- Keep specifications extremely abstract
- Add human review steps
- Reference published algorithms
- Document the process
- Check licenses first
Comparison: Table comparing two-human clean room (legally established) vs two-session AI approach (legally uncertain)
Conclusion: The two-session approach is better than single-session but not as defensible as true human clean room. Best combined with human review and very abstract specifications.