DEV Community

vcavanna
vcavanna

Posted on

Interactions Tracker, Part 2: Revising the Data Model

For my next article on the interactions tracker, I had planned to show how the model that I had created could be implemented. Unfortunately enough, the data model I made isn't ready for a RDBMS. That's okay! This mistake gives the opportunity to go over the exact sticking point, learn both my thought process and the RDBMS process, and how I can implement this in a way that better adheres to warehousing principles drawn from Kimball.

So what exactly was the mistake that I made in my data model?

Well, let's re-evaluate the diagram that I had produced earlier:

Image description

This diagram has two fact tables at two different levels of detail or grain. So we're not really practicing dimensional modeling.

This became apparent as I began making the SQL tables for this dimensional model:

import sqlite3
conn = sqlite3.connect("rez_life.db")
event_columns = [
    "event_key INTEGER PRIMARY KEY",
    "staff_key INTEGER",
    "student_keys ??????", # What do I do here??
    "conversation_type VARCHAR",
    ...
]
Enter fullscreen mode Exit fullscreen mode

student_keys can't be represented as a single element of data. This is important because student_keys is meant to be a foreign key to the 'dim_member' table. I know that one of my business questions was to answer "Who is attending each event?" To get that answer, I need a query that works like 'from fct_events select student_keys' in order to get the list of students.

Events should not be a fact table

Of course, it's hypothetically possible to store a list of names in a single column like I wanted in my first attempt at data modeling, and keep the "event" level of grain:

>> residentKeyList = ["Jeffrey", "Zack", "Cassandra"]
>> residentKeysAsString = ""
>> for key in residentKeyList:
        residentKeysAsString+= key + "_"

>> print(residentKeysAsString)
Jeffrey_Zack_Cassandra
Enter fullscreen mode Exit fullscreen mode

And now I have it all together, like I wanted. The issue with this is twofold: 1) Now I have to create a custom parser, which is prone to breaking easily, and 2) In choosing a higher level grain, I'm imposing limits on what I can do with this data.

For the second issue, let's suppose I don't just want a list of residents that attended the event. Suppose I only want the residents that live in one particular dormitory! This makes sense from a residence life standpoint: while it's great to offer events to non-residents, you want to make sure that events mostly bring in students that you're asked to serve.

All of the sudden, our handy concatenated string of residents is close to useless! We would need to write a custom query that gets the list of students through SQL, parses through python, then check through the list of students in python for whether they reside in the dormitory of choice. Already I sense the data analysts shaking their fists!

Remodeling to make events a dimension table is the answer

Wouldn't it be so much better if we could answer this question in a single data query? For instance:
SELECT interactions.resident_name FROM fct_interactions INNER JOIN dim_members ON fct_interactions.member_key=dim_members.member_key WHERE fct_interactions.event_key="Chili Cookout" AND dim_members.residence="Clark Hall"

This gets the same data as the above complicated process simply by joining two tables together and filtering the results. Keeping things at the lowest possible grain is standard practice for Kimball, and it's a mistake of mine to try something else.

Luckily, the remodel is very simple:

Image description

Simply calling the events table a dimension table rather than make it a separate fact table keeps the grain at the same level, meaning that the queries I mentioned above can be called fairly easily.

Key Takeaways

  1. Unless you want angry analysts, keep the data to one element per cell. Doing otherwise defeats the point of RDBMS. Seems like a clear point in retrospect.
  2. The events table will be treated as a dimensions table for the lowest grain level from now on. We'll see more about how this impacts the design in the next post, when I implement the SQL database.

Top comments (0)