DEV Community

Cover image for Standardizing Data — Making data consistent with 30+ data sources
Rey Riel
Rey Riel

Posted on • Updated on • Originally published at citadelid.com

Standardizing Data — Making data consistent with 30+ data sources

When you offer an employment/income verification product like we do at Citadel one of the important steps in delivering a top tier developer experience is ensuring high quality integration with multiple data sources, in our case payroll providers. There’s multiple ways we can accomplish this and in our opinion data standardization is the way to go.

What is data standardization?

So what is data standardization? By definition it’s the process of converting data to a common format to enable users to process and analyze it. In a normal setup this is a fairly straightforward process. Functions and checks are created to ensure that the data a user enters into the system conforms to certain formats and data is only permitted into the system when it matches those formats. This is important because once data is in the system in the right format it can be extracted with confidence knowing the data will all look the same.

With Citadel however new challenges were created since we’re not able to impose rules upon our multitude of data suppliers when the data is given to us. We still got the job done, but it wasn’t easy.

What to think of when standardizing data.

When the decision was made to standardize the payroll data we receive from payroll providers, there were many processes to think of and many lessons we learned along the way. Here are the most important ones we figured out.

sample  JSON data

Just a small sample of data from the Citadel API

Integrating with each provider is a largely manual process. From the beginning you need a product manager and software engineer to individually assess each payroll provider. There’s different data formats (not everybody sends JSON you see) and no 2 providers use the same field names. Because of this it’s very difficult to create a process that can be largely automated.

It’s not just about mapping fields, but massaging data. Not all data is free entry, and as engineers we know that enumerated data is much easier to work with. It’s consistent and predictable. Unfortunately this isn’t quite so easy when dealing with multiple data sources. Take pay frequency for example. While provider A may call it “bi-weekly”, provider B might say “every two weeks”. As a result we need to look at what each provider is passing through for the fields we enumerate and if needed create a translation between the data they provide and the values we store.

Why even standardize the data?…We love the developer

Not all providers are going to provide the data we store. At Citadel we strive to give all the data needed to effectively verify employment and income through payroll providers because we believe payroll providers are the ultimate source of truth for this type of verification. We provide over 100 data points to allow accurate and efficient verification, but unfortunately not every provider provides all the data points we store. Some providers don’t give a basis of pay. Some providers don’t provide job titles. Some providers do provide job titles but some employers don’t provide that information. If the data comes from the provider we definitely capture it, but sometimes it’s just not available.

All providers are going to provide more data than what we store. With support for over 30 of the largest payroll providers in the US covering over 85% of Americans who have a payroll provider, there’s hundreds of different data points you couldn’t even imagine a payroll provider would provide. It’s important to distinguish between which data points are important to store, which are consistent between providers and which aren’t necessary for a verification or few providers actually give.

Providing new fields means going back and investigating all providers. Every week we get new field requests from developers and we’re more than happy to oblige. More data for the developer makes for better informed verifications. But when a new field request comes in that means we need to go back through each provider and map that data point.

So why even standardize data?

With all the different things to think about above, some would ask why even standardize the data? Why not simply store the data as is and spit it out to the developer when they request it? The answer is actually simple.

We love the developer. We want the developer to get up and running with our API’s fast and we want them to be happy while working with Citadel. With Citadel APIs being built by developers for developers we know that if we were to forego all the headaches above ourselves then our developer community would need to put their time and effort into it and we just don’t want to subject them to that.

We’re happy to go back through each provider to map new fields because we know you won’t have to. We’re happy to massage the data for each enumerated field so that you developers can be confident the data we provide you is the way you expect it.

Data normalization is not a straightforward process when dealing with multiple data providers and integration takes real time and effort to create quality data developers can rely on. But we’re happy to do it to make your development with Citadel a piece of cake.

Learn more about how Citadel’s APIs can make employment and income verification easy and affordable at https://citadelid.com

Cover image provided by https://www.thebluediamondgallery.com/wooden-tile/d/data.html

Top comments (0)