KazKN

Posted on Jun 4

I Pulled 1,234 Dallas CRE Listings from LoopNet + Crexi. Deduping Was the Real Problem.

#apify #webscraping #realestate #automation

CRE Listing Intelligence Series

Best CoStar alternatives for small CRE brokers
How to scrape LoopNet and Crexi listings into one CRE dataset
Commercial real estate API for brokers: cheap CoStar alternative
LoopNet vs Crexi data: how to deduplicate listings
How to build a daily CRE deal-flow dashboard

A commercial real estate broker asked me a simple question:

"Can you pull LoopNet and Crexi into one file so I do not have to check both every morning?"

At first, that sounds like a scraping problem.

It is not.

The hard part is what happens after the rows arrive.

📍 For a Dallas test run, I pulled public LoopNet + Crexi listings into one Apify dataset:

Metric	Result
Market	Dallas, TX
Sources	LoopNet + Crexi
Scope	Sale + lease
Rows exported	1,234
Runtime	72.469 seconds
Actor charge estimate	~$6.22
Effective actor cost	~$5.04 / 1,000 rows
Cross-platform duplicate signals	15
Rows with broker company context	1,209
Rows with days-on-market context	982

The run worked.

But the interesting part was not "can we scrape listings?"

The interesting part was:

How do you turn two messy listing sources into a market file a broker or analyst can actually trust?

🧩 The naive version breaks fast

The first version of this workflow is usually:

1. Scrape LoopNet
2. Scrape Crexi
3. Append both arrays
4. Export CSV

That gives you more rows.

It does not give you better data.

In CRE, the same property can appear on more than one platform. Sometimes the fields are nearly identical. Sometimes they are not.

Example of why exact matching fails:

1200 East 6th Street
1200 E 6th St
1200 E. Sixth Street

Same street, different string.

Square footage can drift too:

7,500 SF
7,480 SF
7.5K SF

Asset class can be inconsistent:

Office
Creative Office
Mixed Use

If your dedupe logic is too strict, you miss duplicates.

If it is too loose, you merge different properties and quietly damage the dataset.

That is worse than having duplicates.

The goal is not "one giant scrape"

For this actor, I changed the mental model.

I do not want to sell this as a generic scraper.

The output should feel like a market proof file:

What is listed?
What is stale?
What is cross-posted?
Who represents it?
What pricing / cap-rate context exists?
Where did each row come from?

That means the dedupe layer has to preserve provenance, not hide it.

🔑 The dedupe key

The current strategy groups listings with a normalized key:

transaction_type + normalized_street + city + state + sqft_bucket + asset_class

Why include transaction_type?

Because sale and lease records for the same building should not always collapse into one row.

A building can be for sale and also have suites for lease. Those are different workflows for a broker.

The simplified TypeScript shape looks like this:

const baseKey = buildDedupKey({
  address: listing.address,
  sqft: listing.sqft,
  asset_class: listing.asset_class,
});

const dedupKey = `${listing.transaction_type}:${baseKey}`;

Then each group gets scored.

The most complete listing becomes the primary record.

function scoreListing(listing) {
  let score = 0;

  if (listing.asking_price_usd != null) score++;
  if (listing.noi_usd != null) score++;
  if (listing.cap_rate_listed != null) score++;
  if (listing.sqft != null) score++;
  if (listing.broker?.name) score++;
  if (listing.broker?.email) score++;
  if (listing.broker?.phone) score++;
  if (listing.photo_urls?.length) score++;
  if (listing.description) score++;
  if (listing.listed_at) score++;

  return score;
}

This is not fancy ML.

It is boring, explainable, and good enough for a first-pass broker dataset.

✅ Preserve the duplicate signal

The mistake I wanted to avoid was deleting useful source context.

If a property appears on both LoopNet and Crexi, that is not only a duplicate problem.

It is a signal.

So the output keeps fields like:

{
  "source": "loopnet",
  "listing_url": "https://...",
  "address_full": "Example property, Dallas, TX",
  "asset_class": "retail",
  "transaction_type": "sale",
  "dedup_key": "sale:example-key",
  "also_listed_on": ["crexi"],
  "also_listed_on_text": "crexi",
  "data_quality_notes": [
    "cross_platform_duplicate:crexi"
  ]
}

That gives the broker one primary row, while still showing that the property has exposure elsewhere.

📍 The Dallas run made the problem concrete

Here are a few sample rows from the Dallas run:

Source	Type	Address	Asset	Price	Cap rate	DOM	Broker company
Crexi	Sale	9300 Central Expressway, Dallas, TX	Industrial	$1,599,990	5.9%	247	Transworld Commercial Real Estate
Crexi	Sale	434 E Hwy 67, Duncanville, TX	Retail	$2,677,950	6.8%	54	Venture Commercial
Crexi	Sale	8010 Stemmons Freeway, Dallas, TX	Retail	$2,700,000	6.8%	409	ISL Commercial Real Estate
Crexi	Sale	5243 Naaman Forest Blvd, Garland, TX	Unknown	$10,998,000	-	61	Matthews
Crexi	Sale	2833 Irving Blvd, Dallas, TX	Retail	-	-	178	Capstone Commercial Real Estate Group

The dataset is not valuable because every field is perfect.

It is valuable because the uncertainty is visible.

For example:

{
  "noi_declared_usd": null,
  "noi_implied_usd": 94399,
  "noi_source": "estimated_from_asset_class_median",
  "noi_estimated": true,
  "cap_rate_listed": null,
  "cap_rate_normalized": 5.9,
  "cap_rate_estimated": true,
  "cap_rate_source": "asset_class_median"
}

That distinction matters.

A broker or analyst should not confuse declared NOI with estimated context.

If the source gives a real cap rate, keep it.

If the system estimates a cap rate from asset-class assumptions, label it clearly.

A useful CRE row needs provenance

For broker workflows, I care about these fields more than raw HTML:

{
  "source": "crexi",
  "listing_url": "https://www.crexi.com/properties/...",
  "address_full": "9300 Central Expressway, Dallas, TX, 75241",
  "city": "Dallas",
  "state": "TX",
  "asset_class": "industrial",
  "transaction_type": "sale",
  "asking_price_usd": 1599990,
  "days_on_market": 247,
  "days_on_market_source": "listed_at",
  "broker_company": "Transworld Commercial Real Estate",
  "also_listed_on": []
}

This is the difference between:

I scraped some listings.

and:

I built a market file my team can scan, filter, and route into the next workflow.

Deduping is also a product decision

There is no perfect universal dedupe rule.

For this use case, I care about a few product constraints:

Decision	Why
Keep sale and lease separate	Same property, different broker workflow
Preserve `also_listed_on`	Cross-platform exposure is useful context
Pick the most complete primary row	Brokers want the richest visible record first
Label estimated financials	Avoid mixing declared and inferred numbers
Keep source URLs	Users need to verify the original listing

This is also why I think "scraper" is the wrong positioning for this kind of tool.

The product is the structured market file.

The scraper is just one layer underneath it.

The real use case

A broker does not wake up thinking:

I need a web scraper.

They think:

I need to know what changed in my market before I call owners, investors, or other brokers.

That is the workflow I am trying to support:

LoopNet + Crexi search
        |
normalized public listing rows
        |
dedupe / provenance / data quality notes
        |
CSV, Excel, JSON, or API
        |
broker-ready market proof file