DEV Community

Cover image for I build the scrapers the data vendors run on, and I need to tell you what you are actually paying for
George Kioko
George Kioko

Posted on

I build the scrapers the data vendors run on, and I need to tell you what you are actually paying for

I have built the scraping guts behind more "real time business intelligence" products than I would ever admit on a CV. So when a contact data company tells you their database is fresh, I am not impressed, because I know how that word gets made.

Here is the part nobody puts in the sales deck. These companies pull public pages on a schedule, pour the rows into a giant table, and sell you a seat to the table. That table is a photograph. It was true the day it was scraped and it has been aging ever since. You pay every month like it refreshes every month. It does not. A row captured in February is still February sitting in your CRM in June, looking exactly as confident as a row pulled this morning, because nothing on the screen tells you which is which.

And the field that goes bad first is the one you bought the thing for. Where a person works right now. People quit and get promoted and get poached and get pushed out, and the company page is the last place on earth to find out. So "current company," the whole reason the list has any value, is one of the first columns to start lying by the time your rep hits send.

The industry knows this. Of course they know it. They built the pipelines. They just figured out that decay is invisible, and you cannot be angry about a number you never see. So the pitch drifts back to coverage. Millions of contacts. Look how big the table is. Freshness is the expensive part. It makes the table look smaller, and a smaller honest number loses sales calls to a bigger dishonest one.

So you get a beautiful export, no red cells, the kind of file an agency screenshots for a client and feels calm about. Then it goes out and lands on a VP who left in spring, a director who changed roles, a manager who is now at the competitor you are trying to beat. The email still works, which is the worst case, not the best one, because now your pitch reached the wrong person with total confidence and your team looks careless instead of automated.

I will say the thing the vendors will not. Pulling the names is the easy half. It has been easy for years. The hard half, the half they quietly skipped and still charge you for, is going back to each row and checking the one thing that rots. Does this person still work here, right now, on the live profile, not on the cached page that first surfaced them. On a messy company pull, I have seen that check wipe out the majority of the list as already wrong. Calling that a data cleaning chore is how vendors hide the miss. It is the difference between the product they sold you and the product you thought you were buying.

I still build scrapers. I think they are great. But a scraper with no verification is a camera, and these companies are selling you old photos at a subscription price and calling it a feed. The tell is simple. Ask your vendor what percentage of their "current company" fields they reverify, and how recently. Watch them change the subject to coverage. That answer, the one they will not give you, is the whole game.

You are not paying for data. You are paying for the confidence that it is current. And that is the one thing in the box they never actually put in.

Top comments (0)