CRE Listing Intelligence Series
- Best CoStar alternatives for small CRE brokers
- How to scrape LoopNet and Crexi listings into one CRE dataset
- Commercial real estate API for brokers: cheap CoStar alternative
- LoopNet vs Crexi data: how to deduplicate listings
- How to build a daily CRE deal-flow dashboard
A commercial real estate broker asked me a simple question:
"Can you pull LoopNet and Crexi into one file so I do not have to check both every morning?"
At first, that sounds like a scraping problem.
It is not.
The hard part is what happens after the rows arrive.
📍 For a Dallas test run, I pulled public LoopNet + Crexi listings into one Apify dataset:
| Metric | Result |
|---|---|
| Market | Dallas, TX |
| Sources | LoopNet + Crexi |
| Scope | Sale + lease |
| Rows exported | 1,234 |
| Runtime | 72.469 seconds |
| Actor charge estimate | ~$6.22 |
| Effective actor cost | ~$5.04 / 1,000 rows |
| Cross-platform duplicate signals | 15 |
| Rows with broker company context | 1,209 |
| Rows with days-on-market context | 982 |
The run worked.
But the interesting part was not "can we scrape listings?"
The interesting part was:
How do you turn two messy listing sources into a market file a broker or analyst can actually trust?
🧩 The naive version breaks fast
The first version of this workflow is usually:
1. Scrape LoopNet
2. Scrape Crexi
3. Append both arrays
4. Export CSV
That gives you more rows.
It does not give you better data.
In CRE, the same property can appear on more than one platform. Sometimes the fields are nearly identical. Sometimes they are not.
Example of why exact matching fails:
1200 East 6th Street
1200 E 6th St
1200 E. Sixth Street
Same street, different string.
Square footage can drift too:
7,500 SF
7,480 SF
7.5K SF
Asset class can be inconsistent:
Office
Creative Office
Mixed Use
If your dedupe logic is too strict, you miss duplicates.
If it is too loose, you merge different properties and quietly damage the dataset.
That is worse than having duplicates.
The goal is not "one giant scrape"
For this actor, I changed the mental model.
I do not want to sell this as a generic scraper.
The output should feel like a market proof file:
What is listed?
What is stale?
What is cross-posted?
Who represents it?
What pricing / cap-rate context exists?
Where did each row come from?
That means the dedupe layer has to preserve provenance, not hide it.
🔑 The dedupe key
The current strategy groups listings with a normalized key:
transaction_type + normalized_street + city + state + sqft_bucket + asset_class
Why include transaction_type?
Because sale and lease records for the same building should not always collapse into one row.
A building can be for sale and also have suites for lease. Those are different workflows for a broker.
The simplified TypeScript shape looks like this:
const baseKey = buildDedupKey({
address: listing.address,
sqft: listing.sqft,
asset_class: listing.asset_class,
});
const dedupKey = `${listing.transaction_type}:${baseKey}`;
Then each group gets scored.
The most complete listing becomes the primary record.
function scoreListing(listing) {
let score = 0;
if (listing.asking_price_usd != null) score++;
if (listing.noi_usd != null) score++;
if (listing.cap_rate_listed != null) score++;
if (listing.sqft != null) score++;
if (listing.broker?.name) score++;
if (listing.broker?.email) score++;
if (listing.broker?.phone) score++;
if (listing.photo_urls?.length) score++;
if (listing.description) score++;
if (listing.listed_at) score++;
return score;
}
This is not fancy ML.
It is boring, explainable, and good enough for a first-pass broker dataset.
✅ Preserve the duplicate signal
The mistake I wanted to avoid was deleting useful source context.
If a property appears on both LoopNet and Crexi, that is not only a duplicate problem.
It is a signal.
So the output keeps fields like:
{
"source": "loopnet",
"listing_url": "https://...",
"address_full": "Example property, Dallas, TX",
"asset_class": "retail",
"transaction_type": "sale",
"dedup_key": "sale:example-key",
"also_listed_on": ["crexi"],
"also_listed_on_text": "crexi",
"data_quality_notes": [
"cross_platform_duplicate:crexi"
]
}
That gives the broker one primary row, while still showing that the property has exposure elsewhere.
📍 The Dallas run made the problem concrete
Here are a few sample rows from the Dallas run:
| Source | Type | Address | Asset | Price | Cap rate | DOM | Broker company |
|---|---|---|---|---|---|---|---|
| Crexi | Sale | 9300 Central Expressway, Dallas, TX | Industrial | $1,599,990 | 5.9% | 247 | Transworld Commercial Real Estate |
| Crexi | Sale | 434 E Hwy 67, Duncanville, TX | Retail | $2,677,950 | 6.8% | 54 | Venture Commercial |
| Crexi | Sale | 8010 Stemmons Freeway, Dallas, TX | Retail | $2,700,000 | 6.8% | 409 | ISL Commercial Real Estate |
| Crexi | Sale | 5243 Naaman Forest Blvd, Garland, TX | Unknown | $10,998,000 | - | 61 | Matthews |
| Crexi | Sale | 2833 Irving Blvd, Dallas, TX | Retail | - | - | 178 | Capstone Commercial Real Estate Group |
The dataset is not valuable because every field is perfect.
It is valuable because the uncertainty is visible.
For example:
{
"noi_declared_usd": null,
"noi_implied_usd": 94399,
"noi_source": "estimated_from_asset_class_median",
"noi_estimated": true,
"cap_rate_listed": null,
"cap_rate_normalized": 5.9,
"cap_rate_estimated": true,
"cap_rate_source": "asset_class_median"
}
That distinction matters.
A broker or analyst should not confuse declared NOI with estimated context.
If the source gives a real cap rate, keep it.
If the system estimates a cap rate from asset-class assumptions, label it clearly.
A useful CRE row needs provenance
For broker workflows, I care about these fields more than raw HTML:
{
"source": "crexi",
"listing_url": "https://www.crexi.com/properties/...",
"address_full": "9300 Central Expressway, Dallas, TX, 75241",
"city": "Dallas",
"state": "TX",
"asset_class": "industrial",
"transaction_type": "sale",
"asking_price_usd": 1599990,
"days_on_market": 247,
"days_on_market_source": "listed_at",
"broker_company": "Transworld Commercial Real Estate",
"also_listed_on": []
}
This is the difference between:
I scraped some listings.
and:
I built a market file my team can scan, filter, and route into the next workflow.
Deduping is also a product decision
There is no perfect universal dedupe rule.
For this use case, I care about a few product constraints:
| Decision | Why |
|---|---|
| Keep sale and lease separate | Same property, different broker workflow |
Preserve also_listed_on
|
Cross-platform exposure is useful context |
| Pick the most complete primary row | Brokers want the richest visible record first |
| Label estimated financials | Avoid mixing declared and inferred numbers |
| Keep source URLs | Users need to verify the original listing |
This is also why I think "scraper" is the wrong positioning for this kind of tool.
The product is the structured market file.
The scraper is just one layer underneath it.
The real use case
A broker does not wake up thinking:
I need a web scraper.
They think:
I need to know what changed in my market before I call owners, investors, or other brokers.
That is the workflow I am trying to support:
LoopNet + Crexi search
|
normalized public listing rows
|
dedupe / provenance / data quality notes
|
CSV, Excel, JSON, or API
|
broker-ready market proof file
Try it
I packaged this workflow as an Apify actor:
Commercial Real Estate Brokerage Intel
The demo is here:
I am also building market-specific proof files for Dallas, Austin, and Phoenix because the best proof is not "it scrapes."
The best proof is:
Here are the exact rows a CRE team can use.
CRE Listing Intelligence Series
- Best CoStar alternatives for small CRE brokers
- How to scrape LoopNet and Crexi listings into one CRE dataset
- Commercial real estate API for brokers: cheap CoStar alternative
- LoopNet vs Crexi data: how to deduplicate listings
- How to build a daily CRE deal-flow dashboard
Top comments (0)