DEV Community: annabelvandaalen

A Brief Introduction to Data Portals

annabelvandaalen — Wed, 01 Jul 2020 11:07:56 +0000

A crucial tool for any organisation, data portals perform a range of functions, from providing an easily-searchable catalog of your data to enabling data visualisations and enhancement. This article is a must-read for anyone looking to unlock their data’s potential, from NGOs to the Fortune 100.

Photo by Matthew T. Rader on Unsplash

What is a data portal?

A data portal is a software that catalogs datasets. There are two main types of data portal: open data portals for sharing public data, and internal data portals for sharing data within an organisation. They serve as a single “point of truth” for an organisation’s data or of data relating to a certain topic. Along with basic catalog features, data portals can incorporate an extensive range of functionality for organising, structuring and presenting data.

Background

The rise of data portals reflects the increase in the volume and variety of data being collected by organisations. This could be data on tax, crime and geolocations, in the case of governments, and sales, customer preferences and costs, in the case of enterprise. Even the simplest of organizations may have dozens of data assets, ranging from cloud spreadsheets to web analytics, meanwhile large organizations can have very complex data arrangements, ranging from Hadoop clusters and data warehouses to CRM systems. The more data you collect, the more robust your storage needs to be and the more sophisticated your system for managing it.

Why are data portals useful?

Data portals have five main functions. These are listed below.

### 1. Data discovery -

organisations wanting to get the most out of their data and use it to drive fact-based decision-making first need to overcome a basic obstacle: working out what data they actually own. Without a data portal, they might have to rely on word of mouth/ calling around the office to ask if anyone knows about the whereabouts (or even existence) of a certain dataset or file.

2. Data access -

common metadata, data showcases and data APIs make data easily and quickly accessible to technical and non-technical users. Data previews also allow users to work out whether the data is what they are looking for or not without having to open it.

3. Data lineage -

without a portal, it is easy to forget or lose track of data created a long time ago, by colleagues who have now left, or even just by other colleagues in the office. If you can’t locate the data, you might assume it doesn’t exist, re-invest in collecting it and have your data engineers re-transform it - a costly and time consuming process.

4. Data integration -

often, organisations keep their data across different systems, devices and clouds. This means that data becomes ‘siloed’, ie. not accessible by certain people or devices, leading to cumbersome document sharing across departments or staff. Some data portals can also take data directly from the web, transform it into the correct format, and include it in the portal. This lets you integrate public data with your own.

5. Data visualisation & analysis -

one of the key motivations behind organising data is that you can use it to generate insight. Some data portals allow you to create graphs or other visual tools to monitor and analyse patterns or anomalies.

Using your portal as a scaffold

With your data standardised and uniformly accessible, you can start to discover new purposes for it. That is to say, aside from helping you get your data organised, data portals act as scaffolds to start building with your data. Much of the new value for data comes from unexpected or unplanned applications that are made possible by combining existing data from across divisions and systems.

You can also use data portals to start applying principles of progressive enhancement to your data. Only once your data is standardised and uniformly accessible can you begin to enrich your data. This may involve the following additions: data dictionaries (the location column refers to cities); data mappings (by using this city as a look-up against a different set, we can know that Sidney is a city in Australia & other info about Sidney); and data validation (are these locations correct?).

What type of data portal is CKAN?

CKAN is the leading data portal software. It is usable both ‘out of the box’ and as a powerful framework for creating more tailored systems. CKAN’s combination of open-source codebase and enterprise support make it uniquely attractive for organizations looking to build customized, enterprise-grade solutions.

FAQs

Are data portals always either open or closed?

Data portals are not necessarily always either open or closed, but could fall somewhere in between. Some organisations - particularly those in the fields of research or philanthropy, ie. those wanting to help others with their data - might use a portal for internal data management while allowing external organisations to search certain data sets.

By Paul Walsh and Annabel van Daalen, with graphics by Monika Popova.

Want to work with Datopian? We empower government and enterprise to unlock their data's potential through outstanding data management strategy and implementation. Check our website for more information or contact us.

Subscribe to Datasets: New CKAN Feature Explained

annabelvandaalen — Tue, 16 Jun 2020 15:20:45 +0000

Last month, we announced the launch of a new CKAN feature developed by Datopian that allows users to subscribe to datasets. This is an opt-in feature that sends users an email notification when a dataset to which they are subscribed is changed or updated. Let’s take a look at the feature in more detail.

Photo by Dele Oke on Unsplash

Why subscribe to datasets with CKAN?

The subscribe to datasets feature designed by Datopian was born out of the needs of our enterprise customers. In order to provide clients with a robust messaging system, we needed to build a feature outside of the main application process.

Before Datopian developed a subscribe to datasets feature, data portal users had no good way of finding out about changes to datasets. Approaches to notifying users of changes include using RSS feeds or CKAN’s built-in email integration. However, these approaches were not applicable for our client's context because:

Some datasets and resources can change rapidly, and many different types of stakeholders can subscribe to change notifications. This means that anywhere from 50,000 to 200,000 notifications may be broadcast in a given month.
Our client wants to extend the notification feature to support additional notification channels as well as email. A next iteration will add SMS notifications, giving users the choice to receive notifications by SMS, email, or both.

Another advantage of the feature is that the granularity is high. Users can currently receive the following information via email notifications:

The name of the datasets in which a change has taken place.
Whether the change was applied to a whole dataset, or a single resource.
Whether there were changes to the metadata.

Here’s an example notification:

Screenshot section of an example email notification

Overview

Fig 1.1. Diagram demonstrates that data curators edit the metadata and data of a dataset or resource to which a user is subscribed.

Fig 1.2. Diagram shows, at a high level, the technical design of the data subscription service, including how it interacts with CKAN.

Current features

Configure notification frequency - system administrators can determine the frequency with which users receive email notifications. This is particularly helpful for users subscribed to very large datasets that are updated multiple times per minute/hour.
Disable notifications for certain datasets - system administrators may opt to disable notifications for certain datasets for a number of reasons. In particular, companies using CKAN data portals may choose to disable notifications for datasets that are updated frequently, should the cost of mass emailing become too high.

Upcoming features

Subscribe to new datasets - soon, CKAN users will be able to receive emails notifying them when new datasets are added to the portal. This is particularly helpful for users monitoring all portal activity.

How can I get the new feature?

The data subscriptions service is currently available for use. If you are interested in deploying it against your existing CKAN installation, please reach out to us by visiting the project on GitHub here and creating an issue. Additionally, contact Datopian to discuss how we can deploy a data subscription integration for your platform.

Call to Action!

CKAN is an open-source software that relies on collaboration to develop functionality. If you extend this new feature, we would be really interested in using this code to improve CKAN and thereby encourage others to opt for open-source solutions.

By Annabel van Daalen and Irio Musskopf, with graphics by Monika Popova.

Datopian Presents: Headless DMS

annabelvandaalen — Fri, 12 Jun 2020 10:51:42 +0000

By Annabel van Daalen and Rufus Pollock, with graphics by Monika Popova

In a previous article, we drew an analogy between CMS (that’s ‘content management system’) and DMS (‘data management system) to show how the two software share a similar structure. Now, in this follow-up piece, we’re going to show how DMS have always been one step ahead of the game when it comes to a novel software trend: headlessness.

Photo by Mika on Unsplash

Introduction

If you’ve not heard of Headless DMS before, that’s because you’re reading it here for the first time. However, while the term may be new, the concept itself is not. The open-source DMS CKAN has been operating headlessly for years - the term headlessness just didn’t exist yet. It wasn’t until CMSs began calling themselves headless that the name gained traction. Here at Datopian, though, we were working with headless software long before it became cool.

In order to understand how Headless DMS - and specifically CKAN - is significantly improving the ways in which organisations manage their data, we first need to clarify a key term: headlessness.

What does it mean for a software to be ‘headless’?

To understand headless software, we first need to know some basic information about how software is structured. In software engineering, a distinction is made between the ‘frontend’ part of a software, the part seen by the user, and the ‘backend’, the behind-the-scenes part. The backend is made up of a storage component (e.g. content repository or database), an editor (e.g. an admin user interface) and an API, which is a tool for delivery stored contents to the frontend. The frontend acts as a renderer, turning stored contents into a themed display (e.g. a webpage). Traditional DMSs, which contain both a frontend and a backend that run in the same process, are known as monolithic DMSs.

Sometimes, software engineers choose to decouple the frontend of the software (the ‘head’) from the backend (the ‘body’). There are many reasons for doing this, and these will be explored further on in the article.

You may remember the software stick-person from the precursor article - here they are again, but this time they have been decoupled, leaving a headless part and a head:

DMS head

Headless DMS

Why go headless?

Datopian believes that shifting focus away from monolithic DMS to a decoupled DMS could significantly benefit data-driven organisations. There are two main reasons for this.

Reason 1: greater specialisation

Using a monolithic DMS, in which the frontend and backend are tied together, limits the extent to which each function of the DMS can be customised. The following table demonstrates the limitations of monolithic DMSs:

Limitation	Description
Frontend and backend development requires different programming languages	In the case of CKAN, front-end developers would have to install Python just to be able to do a small amount of HTML and CSS work.
Updating the frontend means updating the backend	As updating the front end takes much less time than the backend, what should be a speedy process takes an unnecessary amount of time.
You can’t choose the frontend	Considering there are multiple front end frameworks out there, why not be able to choose the best one to suit your needs?
Heavy-weight instances	Scaling through replication means replicating the whole instance, not just the frontend.

By decoupling the head and headless part of a DMS, we can build both parts using the latest technologies and practices specialised for each purpose. It also makes life easier for data portal developers, who now don’t have to worry about tackling the backend to make changes to the frontend. This way, it’s easier to find developers for either end, and data portals are cheaper, faster and more flexible to build.

Reason 2: more options

Monolithic DMS can no longer keep up with the changing demands of users. Nowadays, users want to be able to integrate multiple sources, or push data from one database to multiple systems and devices. This is not possible with a monolithic DMS, which can only provide one backend and one frontend.

For example, imagine a company using a DMS no longer just wants to display their data through the ‘attached’ data portal, but also wants to push this data to smartphone or smartwatch apps, or a website. It can’t do this with a monolithic DMS. Neither could the company suddenly decide they wanted to have one frontend (e.g. a data portal) that integrated information from multiple sources.

Let’s look at the different options presented by decoupling in turn.

Option 1: push data from one database to multiple devices (the ‘one body, multiple heads’ approach).

"One body, multiple heads" diagram by Monika Popova

Option 2: integrate data and content from multiple sources (the ‘one head, multiple bodies’ approach).

"One head, multiple bodies" diagram by Monika Popova

Where does CKAN fit into all of this?

We mentioned earlier that CKAN was operating headlessly before headlessness became cool. Back in 2010, CKAN was used in headless mode to build data.gov.uk. Nowadays, most of our clients use CKAN as a monolithic DMS, so with the backend and the frontend unified as one system.

Recently, however, we at Datopian have been building a decoupled head for CKAN in javascript, called frontend v2. This is already in production with a number of Datpian clients, and allows us to deliver CKAN in two pieces - the headless component and the head. We are currently working hard to make the head even better using the latest frontend technologies, React and Next.js.

From the perspective of our clients, not much has changed in terms of the way they use CKAN. However, deploying CKAN in decoupled mode improves their overall experience with the software. This is because frontend v2 makes it easier for clients to integrate content and data from both DMS and CMS for unified display via the same ‘head’. This is all made possible by CKAN’s rich API.

This is the future

As mentioned earlier, through our work on CKAN, Datopian were fine tuning their approach to headless software years before it even became cool. Thanks to recent developments in content management, we’ve now been able to give it a name: headless DMS. You heard it here first.

Technical Appendix:

There is currently no straightforward way to create unified front ends that integrate content and data. Here are some of the possible options:

You start to develop a CMS (backend) in the DMS. This is far from ideal, as CMSs are good at what they do - they already have a rich admin UI and a good structure.
You put the DMS in front of the CMS (i.e. replicate content into the DMS). This would mean having to replicate content into the DMS and you have to develop theming in the DMS for content.
You put a CMS in front of the DMS. This is even worse, as data portal functionality is data focused, so you now have to replicate that functionality into the CMS.
Or, you do side by side. This would mean having to maintain two themes and have a bifurcated user experience (and you may have to replicate things like user accounts).

As CMS is to Content, DMS is to Data

annabelvandaalen — Wed, 10 Jun 2020 12:51:49 +0000

You've heard of a content management system (CMS), but have you heard of a data management system (DMS)? In this article, we show that the two aren't all that different. Just as many companies turn instinctively to CMS to manage their content, we'll explain why DMS should be the natural go-to for any data-driven organisation.

Photo by Stephen Phillips on Unsplash

Many companies are familiar with the term Content Management System (CMS). For those producing large amounts of content, investing in a CMS is the established practice. Type content management into a web browser and well-worn software like WordPress and Contentful are top of the search.

Far fewer companies count Data Management System (DMS) among their passive vocabulary. Enterprises producing large amounts of data, unlike their content-producing counterparts, have not traditionally enjoyed go-to solutions for managing their assets. They have only really had access to ad-hoc solutions for storing their data, like Dropbox or Sharepoint.

At least, until now. As a range of organisations begin collecting more and more data, the term DMS is starting escape expert circles. So, too, is knowledge about the open-source data management software CKAN.

What is a CMS?

A content management system is a software that can be used to manage the creation, modification and display of website content. In other words, it is a tool that allows non-expert users to create a website without having to code one from scratch. A well-known CMS is Wordpress.

A traditional CMS allows users to:

Store content, such as blog posts and images
Edit, update and add content
Display themed content on a website
Share content internally

What is a DMS?

A data management system is a software that can be used to manage the storage, modification and display of data(sets) in a data portal. These could be internal data portals, used to manage private organisation data, or open data portals, used to share data with the public.

DMSs are becoming increasingly popular among a wide range of data-driven organisations, from governments to enterprises. This is because DMSs allow organisations to do much more with their data than simply store it across ad-hoc solutions like DropBox or OneDrive. A traditional DMS allows you to:

Store metadata about data stored elsewhere
Discover data
Edit, update and add data and/or metadata
Display and visualise data
Share data internally and externally

CKAN is an open-source DMS, which means it can be extended to provide new features based on different user needs.

Comparing CMS and DMS

The design and function of a DMS is very similar to a CMS. Both systems are made up of the same basic components:

A system for storing information
An interface for creating and editing information
A component for rendering the stored information in a user interface (UI) and often in an API

You might think of a CMS and a DMS in terms of a human body, with the head as the front end and the body and legs as the back end (note: in software engineering, a distinction is made between the ‘front end’, the part of the software seen by the user, and the ‘back end’, the behind-the-scenes part of the software). Each part of the body serves a certain function.

TIP

It is important to clarify here that the type of CMS and DMS under discussion in this article are traditional, or ‘monolithic’ CMSs/DMSs. For these software to function properly, each part of the body has to work together as a unified system. More recently, new approaches to management system software (known as headlessness or decoupling), in which the different parts operate independently of one another, are gaining traction. Look out for an upcoming post on this from us soon!

Let’s compare the two management systems in more detail.

Part	Monolithic DMS	Monolithic CMS
storage	data catalog storing metadata about data stored elsewhere (and sometimes the data itself)	content repository
API	delivers data	delivers content
admin interface	allows users to edit and add to datasets	allows users to edit content
renderer	displays datasets on a data portal	displays themed content on a webpage

Whereas a CMS publishes web pages, a DMS publishes datasets. That being said, new approaches to data management make it possible to display both content (such as blog posts) and datasets via the same front end. This is thanks to the ‘headless’ movement within content and data management, which forms the subject of an upcoming Datopian post.

Article by Annabel van Daalen and Rufus Pollock, with graphics by Monika Popova.