Exporting CRM data is messier than you think: migration scripts for Honeybook, Dubsado, and 17hats

#automation #data #dataengineering #tutorial

I've scripted CRM migrations from Honeybook, Dubsado, and 17hats. The CSV exports are dirtier than you'd expect. Here are the scripts I use.

Related reading on owning your CRM stack - https://seedlycrm.com/blog/gohighlevel-alternatives-where-you-own-the-code

Why this post exists

I run an agency for home service businesses. A lot of my onboarding work isn't the fun stuff. It's getting a client's data out of whatever boutique CRM they're stuck in and into a schema that doesn't make me want to punch a wall.

Each of these three platforms exports differently, and each one has its own brand of "yeah we technically gave you your data" energy.

Export reality: Honeybook

Honeybook's export is a zip with multiple CSVs. Contacts, projects, payments, files. Quoting is inconsistent across older exports, so a naive pd.read_csv will eat it alive.

Things to know going in:

Phone numbers come through as strings with formatting characters baked in.
Project status uses internal codes that aren't documented in the export.
Notes can contain raw HTML from rich-text inputs.
Date fields are ISO-ish, but they don't tell you the timezone is UTC.

import pandas as pd

contacts = pd.read_csv(
    "honeybook_export/contacts.csv",
    engine="python",         # the C engine chokes on inconsistent quoting
    dtype=str,               # never trust pandas type inference on dirty CSVs
    keep_default_na=False,
    encoding="utf-8-sig",    # strip the BOM honeybook adds
)

Export reality: Dubsado

Dubsado makes you export each section separately. Clients, projects, leads, invoices. There's no "give me everything" button. You'll be clicking for ten minutes.

The clients CSV has a tags column that's a single string of comma-separated tags. So now you have commas inside a comma-delimited file. The export wraps that field in quotes most of the time, but I've seen cases where it doesn't.

The bigger problem is custom fields. Dubsado lets clients add custom fields per-project, and those come through as columns named whatever the user typed. I once opened an export with 60+ columns because someone got creative with their intake form.

def load_dubsado_clients(path):
    df = pd.read_csv(path, dtype=str, keep_default_na=False)

    core = {"name", "email", "phone", "company", "address", "tags", "status"}
    custom_cols = [c for c in df.columns if c.lower() not in core]

    # collapse unknown columns into a single notes blob
    df["_custom_blob"] = df[custom_cols].apply(
        lambda row: "\n".join(
            f"{k}: {v}" for k, v in row.items() if v.strip()
        ),
        axis=1,
    )
    return df.drop(columns=custom_cols)

If the client wants those custom fields preserved, you map them to your target CRM's custom field schema. If they don't care (most don't), it goes in notes.

Export reality: 17hats

17hats gives you contacts and projects as separate CSVs. The contact identifier isn't called contact_id. It's called Contact #. With a space and a hash symbol. Welcome to migration.

The phone column can contain multiple numbers in one cell, separated by a slash or an "alt:" prefix. So (804) 555-1234 / alt: 804-555-9999 is a real value I've parsed.

import re

PHONE_RE = re.compile(r"[\d\(\)\-\s\+]+")

def split_17hats_phones(raw):
    if not raw:
        return []
    parts = re.split(r"[/;]|alt:", raw, flags=re.IGNORECASE)
    cleaned = []
    for p in parts:
        match = PHONE_RE.search(p)
        if match:
            digits = re.sub(r"\D", "", match.group())
            if len(digits) >= 10:
                cleaned.append(digits[-10:])
    return cleaned

You'll thank me when your SMS sender isn't trying to text "alt:" as a destination.

The CSV pitfalls that bite everyone

I keep a checklist I run before I touch anything:

BOM at the start of the file. UTF-8 with BOM is common in older exports. Use encoding="utf-8-sig".
Mixed line endings. Windows \r\n and Unix \n in the same file. pandas usually handles it. csv.reader sometimes doesn't.
Embedded newlines in fields. Notes with line breaks will split into multiple "rows" if you read line-by-line. Always use a real CSV parser.
Smart quotes. Curly quotes from someone who pasted in from Word will not match your search filters. Normalize to straight quotes.
Trailing whitespace in emails. john@example.com is not john@example.com to most CRMs. Strip.
Duplicate contacts. Same person entered twice with different formatting. Match on lowercase trimmed email first, then on (name, phone) as a fallback.
Empty strings vs nulls. Some CSVs use "", some use NULL, some use N/A. Normalize on load.

def normalize_email(s):
    if not s:
        return None
    s = s.strip().lower()
    return s if "@" in s else None

def normalize_phone(s):
    if not s:
        return None
    digits = re.sub(r"\D", "", str(s))
    if len(digits) == 11 and digits.startswith("1"):
        digits = digits[1:]
    return digits if len(digits) == 10 else None

The reusable script template

Every migration I write follows the same shape. Load, normalize, map fields, validate, write. I keep it boring on purpose.

# migrate.py
import pandas as pd
import json
import sys
from pathlib import Path

def load(source, path):
    loaders = {
        "honeybook": load_honeybook,
        "dubsado":   load_dubsado,
        "17hats":    load_17hats,
    }
    return loaders[source](path)

def normalize(df):
    df["email"] = df["email"].apply(normalize_email)
    df["phone"] = df["phone"].apply(normalize_phone)
    df["name"]  = df["name"].str.strip()
    return df

FIELD_MAP = {
    # source_column : target_column
    "email":        "primary_email",
    "phone":        "primary_phone",
    "name":         "full_name",
    "company":      "company_name",
    "tags":         "tag_list",
    "_custom_blob": "notes",
}

def map_fields(df, mapping=FIELD_MAP):
    out = pd.DataFrame()
    for src, dst in mapping.items():
        out[dst] = df[src] if src in df.columns else None
    return out

def validate(df):
    errors = []

    no_contact = df[df["primary_email"].isna() & df["primary_phone"].isna()]
    if len(no_contact):
        errors.append(f"{len(no_contact)} rows have no email or phone")

    dupes = df[df["primary_email"].duplicated(keep=False) & df["primary_email"].notna()]
    if len(dupes):
        errors.append(f"{len(dupes)} duplicate emails")

    return errors

def write(df, out_path):
    df.to_csv(out_path, index=False, encoding="utf-8")

if __name__ == "__main__":
    source, in_path, out_path = sys.argv[1], sys.argv[2], sys.argv[3]
    df = load(source, in_path)
    df = normalize(df)
    df = map_fields(df)
    errs = validate(df)
    if errs:
        print("VALIDATION FAILED:")
        for e in errs:
            print(" -", e)
        sys.exit(1)
    write(df, out_path)
    print(f"wrote {len(df)} rows to {out_path}")

Usage:

python migrate.py honeybook ./exports/contacts.csv ./out/contacts.csv

Validate before you import. Always.

The thing that has saved me from "hey Andrew our CRM thinks every customer has the same email" is dry-running the import against a staging instance and diffing the counts.

My validation pass checks:

row count matches the source (no silent drops)
every row has at least one contact method
no duplicate primary emails
phone numbers are exactly 10 digits after normalization
timestamps parse cleanly
tag list has no empty strings

If any of those fail, the script exits non-zero and I don't touch the prod CRM. If I'm feeling fancy I pipe the validation report to a JSON file the client can review before I press the button.

def write_report(df, errs, path):
    Path(path).write_text(json.dumps({
        "total_rows": len(df),
        "rows_with_email": int(df["primary_email"].notna().sum()),
        "rows_with_phone": int(df["primary_phone"].notna().sum()),
        "errors": errs,
    }, indent=2))

Client sees the report. Client signs off. Then I import. Has saved my ass more than once when someone went "wait, we don't have 4,000 contacts, we have 400" and we caught it before it hit production.

What I'd do differently starting today

Skip the CSV middle step entirely if the source has a real API. Honeybook does, kind of. Dubsado does not in any meaningful sense. 17hats is a no.

If you're stuck with CSVs, version-control them. Drop the raw export into a git-ignored folder, commit the normalized output, and you'll have an audit trail when the client emails you in six months asking why a contact looks different.

Also: write the script for your worst client, not your best one. The one with 14 years of half-finished projects and free-text status fields where someone typed "idk?" 47 times. If your script handles them, it handles anyone.

If you've done one of these migrations and hit something I didn't cover, drop it in the comments. I'm building my own CRM right now specifically so I never have to do another Dubsado export, and I'd love to know what I'm missing on the import side.