Thomas Brittain

Posted on Dec 26, 2016 • Originally published at ladvien.com

HMIS, R, and SQL -- Basics

#eto #hmis #r #sql

Hacker Introduction

I'm a hacker. If you find errors, please leave comments below. If you have an opinion I'll hear it, but I'm often not likely to agree without some argument.

Joins (Merging Data)

Probably the best part of R and SQL is their ability to quickly combine data around a key. For example, in HMIS CSVs the Client.csv contains a lot of demographic information and the Enrollment.csv contains a lot of assessment information. This makes it difficult when needing a count of the total participants who are veterans and disabled, since the veteran information is in Client.csv and disability information is in the Enrollment.csv. However, both R and SQL contain the join functions.

Joins are a hughely expansive topic; I'm not going to try to cover all their quirks, but here's some videos I found helpful:

The two useful joins for HMIS data are LEFT JOIN and INNER JOIN. The left join keeps all the data in the left table and data matching from the right table and the inner join keeps only data which matches.

Here's an example in the context of the Client.csv and Enrollment.csv:

Client.csv

PersonalID	FirstName	VeteranStatus
12345	Jane	Yes
54321	Joe	No

Enrollment.csv

PersonalID	FirstName	DisablingCondition
12345	Jane	Yes
54321	Joe	No
45321	Sven	Yes

Here are the two join statements and their results for the data above

SELECT * 
   FROM client a 
   LEFT JOIN enrollment b ON a.Personal=b.PersonalID

This join should result in the following:

PersonalID	FirstName	VeteranStatus	DisablingCondition
12345	Jane	Yes	Yes
54321	Joe	No	No
45321	Sven	NULL	Yes

Notice Sven was kept, even though he had no entry the Client.csv. After the join, since he had no

And the inner join would look like this:

SELECT * 
       FROM client a 
       INNER JOIN enrollment b ON a.Personal=b.PersonalID

This join should result in the following:

PersonalID	FirstName	VeteranStatus	DisablingCondition
12345	Jane	Yes	Yes
54321	Joe	No	No

Counts

PersonalID <- sqldf("SELECT DISTINCT PersonalID FROM client")

Method above creates a vector of all the PersonalIDs in the client data-frame, which came from the Client.csv. The DISTINCT command takes only one ID if there are more than two which are identical. In short, it create a de-duplicaed list of participants.

For example,

PersonalID	OtherData
12345	xxxxxxxxx
56839	xxxxxxxxx
12345	xxxxxxxxx
32453	xxxxxxxxx

Should result in the following,

PersonalID

12345

56839

32453

This is useful in creating a key vector, given other CSVs have a one-to-many relationship for the PersonalID. For example,

The Enrollment.csv looks something like this

PersonalID	ProjectEntryID	EntryDate
12345	34523	2016-12-01
56839	24523	2015-09-23
12345	23443	2014-01-10
32453	32454	2015-12-30

This reflects a client (i.e., 12345) entering a project twice, once on 2014-01-10 and the other 2016-12-01.

Count of Total Participants:

SELECT COUNT(PersonalID) as 'Total Participants' FROM client

This query should give a on row output, counting the number of clients in the data-frame.

Total Participants
1	1609

However, if there are duplicate PersonalIDs it'll count each entry as an ID. To get a count of unique clients in a data-frame add the DISTINCT command.

SELECT COUNT(DISTINCT(PersonalID)) as 'Unique Total Participants' FROM client

Conditional Data

Often in HMIS data it is necessary to find a collection of participants which meet a specific requirement. For example, "How many people in this data-set are disabled?" This is where the WHERE statement helps a lot.

SELECT PersonlID FROM clientAndEnrollment WHERE disability = 'Yes'

This statement will return a vector of all the PersonalID's of participants who stated they were disabled. The total participant query could be used, but there is an alternative method.

SELECT SUM(CASE WHEN 
               disability = 'Yes' THEN 1 ELSE 0 
           END) as DisabledCount

The above statement uses the CASE WHEN END statement, which I understand as SQL's version of the IF statement. Here's C equivalent:

for(int i = 0; i < total_participants; i++)
    if(disability == true){
       disabilityCounter++;
    }
}

BOOL!

Boolean operaters can be used to get more complex conditional data:

SELECT PersonalID FROM clientAndEnrollment 
       WHERE disability = 'Yes' 
       AND gender = 'Female'

This statement will provide a vector of all the PersonalID's for clients who are disabled and female.

Ok, good stopping point for now.

DEV Community