Biggest data integration challenges not addressed today? Why?

#ai #database #machinelearning #datascience

Brief Plug:

This post is related to what my company is building at General Commission, where we are leveraging AI to autonomously structure disparate data inputs, regardless of source, type, or format. We are looking for feedback from the user community on what's most important to you.

Intro

Data availability is exponentially increasing and expected to continue that trajectory over next 5-10 years. This will continue to exacerbate data engineering challenges around discovery, curation, integration, and synthesis.

Today's Engineering Tools

The market has produced several data integration and engineering solutions to assist in structuring and integrating disparate data sources (see: Fivetran, PreCog, dbt, Mozart Data, Palantir, y42, etc.).

IMO, these are somewhat limited in scope and generally based on a library pre-built connectors (or extremely expensive w/ large ramp-up in Palantir’s case).

What other technical challenges remain in using these solutions? What sets them apart? Why?

From what we've heard, users want a common but configurable schema output they can query and one that’s generated from many disparate data sources… They want to be able to ingest data from structured and unstructured sources which might include web-based or local files… APIs, databases, flat files, websites, etc. But are spending too much time trying to normalize.

What's Ahead

I believe AI and Machine Learning can be used to address some of these limitations in existing SaaS solutions so they are not as open-ended in what data sources can be used as inputs for integration.

What are the biggest points to consider from a technical / engineering perspective?