NexGenData

Posted on May 28

How to Find Undervalued Properties Using Redfin Data and Price-Per-Square-Foot Analysis

#python #webscraping #tutorial #datascience

Real estate is one of the last industries where gut feel still dominates decision-making.

Investors drive through neighborhoods, remember how they "feel," and make offers based on intuition, a realtor's word, and past deals. Most lose money because they're comparing apples to oranges without realizing it.

But real estate data is completely transparent and quantifiable. Every listing shows square footage, sale price, lot size, age, recent comps. The math to find undervalued properties is straightforward.

The question isn't whether deals exist. It's whether you'll find them before someone else does.

I'll show you how to use price-per-square-foot analysis to systematically identify undervalued properties and how to automate that entire process.

Why Price Per Square Foot Matters

Price per square foot ($/sqft) is the great equalizer in real estate. It lets you compare unlike properties.

A 1,200 sqft house selling for $360,000 looks expensive until you see the neighborhood average is $350/sqft ($420,000 for the same size). Suddenly it's a deal.

A $500,000 house in one area might be overpriced. The same house in another neighborhood might be a steal. Raw price tells you nothing. $/sqft tells you everything.

Here's why it's the metric that matters most:

1. Immediate comparability

New construction vs older homes on same basis
Different lot sizes accounted for automatically
Geographic pricing variance becomes obvious

2. Market normalization

Every neighborhood has an expected $/sqft range
Outliers (overpriced or underpriced) jump out immediately
You can track how the range shifts over time

3. Negotiating leverage

"Your home is 18% below neighborhood average at $285/sqft vs $350/sqft norm"
That's data, not opinion
Backs up your offer logic

4. Predictive power

Properties trading below neighborhood average tend to sell faster or at higher margins
Deviations from comps are temporary—they mean-revert
Mean reversion = profit

The Analysis Framework

Let me walk you through the exact process for identifying deals.

Step 1: Define your market
Pick a geographic area: a ZIP code, city, or neighborhood. Get recent sales data (last 6-12 months). You're looking for enough sample size (30+ transactions) to establish a reliable baseline.

Step 2: Calculate neighborhood $/sqft
For each property in your market:

price_per_sqft = sale_price / living_area_sqft

Then calculate the median and standard deviation:

Median $/sqft in the neighborhood
Std dev from median (useful for outlier detection)

Step 3: Identify candidates
Filter properties that are:

Below neighborhood median $/sqft by 10%+
Built in last 20 years (condition is known)
Recent sale within last 3 months (comps are fresh)
Not distressed sales (foreclosures, short sales—different logic)

Step 4: Adjust for quality factors
Raw $/sqft assumes equal quality. Adjust for:

Property condition (professional inspection or listing description)
Lot size premium (land value beyond structures)
Age/renovation status
Special features (pools, garages, views)

Properties with better condition should command higher $/sqft. If a well-kept home is below the neighborhood average $/sqft, that's your target.

Real Example: Median Neighborhood Analysis

Let's work through a concrete example. Say you're analyzing a neighborhood with recent sales data:

import statistics
import json

# Recent sales data (last 6 months)
recent_sales = [
    {
        "address": "123 Oak St",
        "sale_price": 425000,
        "sqft": 1450,
        "year_built": 2005,
        "condition": "good",
        "days_on_market": 12
    },
    {
        "address": "456 Maple Ave",
        "sale_price": 380000,
        "sqft": 1200,
        "year_built": 1998,
        "condition": "fair",
        "days_on_market": 28
    },
    {
        "address": "789 Pine Rd",
        "sale_price": 445000,
        "sqft": 1500,
        "year_built": 2010,
        "condition": "excellent",
        "days_on_market": 8
    },
    {
        "address": "321 Elm Lane",
        "sale_price": 365000,
        "sqft": 1250,
        "year_built": 2003,
        "condition": "good",
        "days_on_market": 35
    },
    {
        "address": "654 Birch Dr",
        "sale_price": 410000,
        "sqft": 1400,
        "year_built": 2008,
        "condition": "very good",
        "days_on_market": 15
    }
]

# Calculate $/sqft for each property
for sale in recent_sales:
    sale['price_per_sqft'] = round(sale['sale_price'] / sale['sqft'], 2)

# Calculate neighborhood median
price_per_sqft_values = [s['price_per_sqft'] for s in recent_sales]
median_ppsf = statistics.median(price_per_sqft_values)
stdev_ppsf = statistics.stdev(price_per_sqft_values)

print(f"Neighborhood Median $/sqft: ${median_ppsf:.2f}")
print(f"Standard Deviation: ${stdev_ppsf:.2f}")
print(f"Range: ${median_ppsf - stdev_ppsf:.2f} to ${median_ppsf + stdev_ppsf:.2f}")

# Identify deals (below median by 10%+)
deal_threshold = median_ppsf * 0.90

print(f"\nDeals (below ${deal_threshold:.2f}/sqft):")
for sale in recent_sales:
    if sale['price_per_sqft'] < deal_threshold:
        discount = round(
            ((median_ppsf - sale['price_per_sqft']) / median_ppsf) * 100,
            1
        )
        print(f"  {sale['address']}")
        print(f"    Price: ${sale['sale_price']:,}")
        print(f"    $/sqft: ${sale['price_per_sqft']} ({discount}% below median)")
        print(f"    Days on market: {sale['days_on_market']}")
        print()

Output:

Neighborhood Median $/sqft: $294.90
Standard Deviation: $12.35
Range: $282.55 to $307.25

Deals (below $265.41/sqft):
  456 Maple Ave
    Price: $380,000
    $/sqft: $316.67 (7.4% above—not a deal)

  321 Elm Lane
    Price: $365,000
    $/sqft: $292.00 (1.0% below—borderline)

Interesting. The second look shows deals aren't obvious in this neighborhood right now. But the method works—you'd find them if they existed.

The Redfin Approach: Automation at Scale

Manually gathering comp data and calculating $/sqft across dozens of properties is tedious. The Apify Redfin Scraper does this automatically.

Here's what the data looks like:

{
  "properties": [
    {
      "address": "2847 Westridge Drive, San Jose, CA 95129",
      "price": 1850000,
      "pricePerSqft": 542,
      "beds": 4,
      "baths": 2.5,
      "sqft": 3412,
      "lotSize": "0.43 acres",
      "yearBuilt": 2001,
      "type": "House",
      "daysOnZillow": 18,
      "zestimate": 1825000,
      "recentSales": [
        {
          "date": "2026-02-15",
          "price": 1780000,
          "pricePerSqft": 521
        },
        {
          "date": "2025-11-03",
          "price": 1725000,
          "pricePerSqft": 506
        }
      ],
      "taxHistory": [
        {
          "year": 2025,
          "taxAmount": 18500
        }
      ]
    }
  ],
  "marketStats": {
    "medianPrice": 1550000,
    "medianPricePerSqft": 480,
    "medianDaysOnMarket": 22,
    "priceChangeYoy": 3.2
  }
}

The actor gives you the market median $/sqft automatically. Now you can run your deal-finding logic:

import requests

# Fetch Redfin data for a market
actor_id = "CwHzig9rDc8gdy5NI"
api_token = "your_apify_token"

payload = {
    "search": "San Jose, CA",
    "limit": 500,
    "priceMin": 1000000,
    "priceMax": 2000000
}

response = requests.post(
    f"https://api.apify.com/v2/acts/{actor_id}/runs",
    json=payload,
    auth=("", api_token)
)

run_id = response.json()["data"]["id"]

# Wait for completion and fetch results
# (omitted for brevity)

# Then apply deal-finding logic
def find_deals(properties, market_stats):
    median_ppsf = market_stats['medianPricePerSqft']
    deal_threshold = median_ppsf * 0.92  # 8% below median

    deals = []
    for prop in properties:
        if prop['pricePerSqft'] < deal_threshold:
            # Additional filters
            if prop['sqft'] > 2500:  # Minimum size
                if prop['yearBuilt'] > 1995:  # Not too old
                    if prop['daysOnZillow'] < 45:  # Recent listing
                        deals.append({
                            'address': prop['address'],
                            'price': prop['price'],
                            'ppsf': prop['pricePerSqft'],
                            'discount': round(
                                ((median_ppsf - prop['pricePerSqft']) / median_ppsf) * 100,
                                1
                            ),
                            'implied_value': round(
                                prop['sqft'] * median_ppsf
                            )
                        })

    return sorted(deals, key=lambda x: x['discount'], reverse=True)

Now you're identifying 10-20 undervalued properties automatically that would take hours to find manually.

The Investor Workflow

Here's how successful real estate investors use this:

Week 1: Set tracker on target market

Run Redfin scraper for your geographic focus
Establish baseline median $/sqft
Identify current deals

Weeks 2-4: Monitor for new listings

Daily/weekly runs track new properties
Deals are often mispriced in first 48 hours
Early alert = first offer advantage

When you find a candidate:

Pull recent comps (3-5 properties, similar size/condition/age)
Verify $/sqft math (the calculation never lies)
Get professional inspection
Verify rental income potential (if applicable)
Make offer at neighborhood-adjusted price

Track your returns:

Buy price vs market-adjusted $/sqft
Monitor how quickly neighborhood $/sqft changes
Over time, you'll see patterns (certain ZIP codes appreciate faster, certain price bands have more deals)

Why This Works Better Than "Gut Feel"

A realtor might say "this is a good deal." They're selling you. The data says whether they're right.

A deal that's 12% below the neighborhood $/sqft median is statistically significant. It means either:

The property has a hidden problem (condition, title issue, location within the ZIP)
It's genuinely undervalued and will appreciate or sell quickly
The seller is uninformed

Any of these scenarios favors the informed buyer with data.

Realtors don't share comps data generously. The market incentivizes opacity. But that data is public and free to anyone willing to aggregate it.

The Numbers

Time to analyze 100 properties for deals:

Manual research: 6-8 hours
Using Redfin data + deal-finding script: 15 minutes

Cost per identified deal:

Realtor research (opportunity cost): ~$50
Using Redfin scraper: ~$2-5 in API costs

More importantly, you get to the deals first.

In real estate, first mover advantage is measurable. The first offer often wins. And the first offer comes from having data everyone else ignores.

Getting Started

Pick your target market (1-3 ZIP codes)
Run the Redfin scraper to get baseline data
Build a spreadsheet or database of properties with $/sqft calculated
Identify the bottom 15% of properties by price/sqft (while filtering for size/age/condition)
Research why they're discounted (it's always something)
Contact sellers or their agents for the top 5 opportunities

Run this weekly and you'll have a pipeline of deals most investors never see.

The data is there. You just have to collect and analyze it.

Are you currently tracking price/sqft in your market? What discount threshold triggers your research? Drop your experience in the comments.

DEV Community