Austin

Posted on Apr 7, 2023

Crash City

#perl #soql #bash #curl

Analyzing NYC Traffic Collision Data on the Linux Command Line with SoQL and Curl

New York City and the NYPD publish a dataset of traffic collisions, related fatalities, injuries and other details here.
This was first published in July 2012, and is updated regularly to this date.

NYC Commuters, especially pedestrians and cyclists, have to endure many hazards just to get to and from work. Crime is one issue, but it's not as treacherous as crossing Queens Boulevard during rush hour, or cycling in downtown Manhattan when some clown driving an SUV the size of an Sherman tank is taking up half the road.

Previously I did some analysis using their downloadable CSV dataset.

Here I’m going to use the curl utility along with the SODA or SoQL Query Language, to try and make some sense out of this published data.

Using the Socrata Query Language, SoQL

Dataset App Token
Socrata functions

How many collisions since July 2012

curl --get --data-urlencode "\$\$app_token=uvwxyz" --data-urlencode "\$select=count(*)" https://data.cityofnewyork.us/resource/h9gi-nx95.json 
[{"count":"1977803"}]

Explanation

curl --get or -G
Use the GET verb as we are ‘getting’ data
v
Lots of verbose output as you can see from the above output.
d
Request data to pass to the API in ASCII format
--data-urlencode
URL-Encode the data. Safer than just using -d
$$app_token
Users personal authorization. Not really necessary for ad-hoc requests
Socrata open data API [App-Token]](https://dev.socrata.com/docs/app-tokens.html)
I replaced my actual token with ‘uvwxyz’ for fairly obvious reasons
"$select=count(*)"
Similar to the SQL SELECT and SQL count aggregate function
SoQL $select
SoQL count

Observation

1,977,803 Collisions from July 2012 to March 2023 seems like a lot to me. You’d wonder what’s the point of driving tests if we still end up with this many collisions.

Getting all the collision records between two arbitrary dates, June 30th 2022 to December 31 2022.

This time I’ll use the -v switch for curl to get a much more verbose output.

> curl --get -v  --data-urlencode "\$\$app_token=xyz" --data-urlencode "\$select=*" --data-urlencode "\$where=crash_date between '2022-06-30T00:00:00.000' and '2022-12-31T00:00:00.000'"  https://data.cityofnewyork.us/resource/h9gi-nx95.json

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.206.68.26:443...
* Connected to data.cityofnewyork.us (52.206.68.26) port 443 (#0)
...
> GET /resource/h9gi-nx95.json?$$app_token=xyz&$select=%2A&$where=crash_date+between+%272022-06-30T00%3A00%3A00.000%27+and+%272022-12-31T00%3A00%3A00.000%27 HTTP/1.1
> Host: data.cityofnewyork.us
> User-Agent: curl/7.81.0
...
< HTTP/1.1 200 OK
....
< X-SODA2-Fields: ["crash_date","crash_time","borough","zip_code","latitude","longitude","location","on_street_name","off_street_name","cross_street_name","number_of_persons_injured","number_of_persons_killed","number_of_pedestrians_injured","number_of_pedestrians_killed","number_of_cyclist_injured","number_of_cyclist_killed","number_of_motorist_injured","number_of_motorist_killed","contributing_factor_vehicle_1","contributing_factor_vehicle_2","contributing_factor_vehicle_3","contributing_factor_vehicle_4","contributing_factor_vehicle_5","collision_id","vehicle_type_code1","vehicle_type_code2","vehicle_type_code_3","vehicle_type_code_4","vehicle_type_code_5"]
< X-SODA2-Types: ["floating_timestamp","text","text","text","number","number","location","text","text","text","number","number","number","number","number","number","number","number","text","text","text","text","text","number","text","text","text","text","text"]
< X-SODA2-Data-Out-Of-Date: false
...
{ [14733 bytes data]
...
[{"crash_date":"2022-06-30T00:00:00.000","crash_time":"14:01","cross_street_name":"101       EAST DRIVE","number_of_persons_injured":"1","number_of_p
ersons_killed":"0","number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0","number_of_cyclist_injured":"1","number_of_cyclist_killed":
"0","number_of_motorist_injured":"0","number_of_motorist_killed":"0","contributing_factor_vehicle_1":"Pedestrian/Bicyclist/Other Pedestrian Error/Con
fusion","collision_id":"4542318","vehicle_type_code1":"Bike"}
...
,{"crash_date":"2022-07-03T00:00:00.000","crash_time":"22:30","borough":"BRONX","zip_code":"10458","latitude":"40.866802","longitude":"-73.88444","location":{"latitude":"40.866802","longitude":"-73.88444","human_address":"{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}"},"on_street_name":"WEBSTER AVENUE","off_street_name":"EAST 199 STREET","number_of_persons_injured":"0","number_of_persons_killed":"0","number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0","number_of_cyclist_injured":"0","number_of_cyclist_killed":"0","number_of_motorist_injured":"0","number_of_motorist_killed":"0","contributing_factor_vehicle_1":"Driver Inattention/Distraction","contributing_factor_vehicle_2":"Unspecified","collision_id":"4543075","vehicle_type_code1":"Station Wagon/Sport Utility Vehicle","vehicle_type_code2":"Station Wagon/Sport Utility Vehicle"}]

Explanation

1000 records
When no $limit is set, this is the default maximum rows returned
curl --get or -G
Use the GET verb as we are ‘getting’ data
-v
Lots of verbose output as you can see
-d
Request data to pass to the API in ASCII format
--data-urlencode
URL-Encode the data. Safer than just using -d

$$app_token
Users personal authorization. Not really necessary for ad-hoc requests
"$select=*"
Similar to an SQL SELECT
Selecting all columns. This is the default and can be omitted
SoQL $select
$where
Similar to SQL WHERE to filter down data.
SoQL $where
between … and …
SoQL between
Narrow our results down to collisions between the two *inclusive ‘crash_date’ values

Observation

It turns out after piping this request to a wc command, that the API only returns 1000 rows, which is the default maximum amount if the $limit clause isn’t specified. With the $limit clause, the maximum amount that can be returned with one call is 50,000 rows. To get more, you will need to order and page through the data.
One other thing to note here is that when using the -v, verbose switch, you get to see the column names and their data types.

The NYC dataset column names

"crash_date","crash_time","borough","zip_code","latitude","longitude","location","on_street_name","off_street_name","cross_street_name","number_of_persons_injured","number_of_persons_killed","number_of_pedestrians_injured","number_of_pedestrians_killed","number_of_cyclist_injured","number_of_cyclist_killed","number_of_motorist_injured","number_of_motorist_killed","contributing_factor_vehicle_1","contributing_factor_vehicle_2","contributing_factor_vehicle_3","contributing_factor_vehicle_4","contributing_factor_vehicle_5","collision_id","vehicle_type_code1","vehicle_type_code2","vehicle_type_code_3","vehicle_type_code_4","vehicle_type_code_5"

Corresponding Field Data Types

"floating_timestamp","text","text","text","number","number","location","text","text","text","number","number","number","number","number","number","number","number","text","text","text","text","text","number","text","text","text","text","text"

SoQL Query Clauses from the Docs

Parameter	Description	Default	In $query
$select	The set of columns to be returned, similar to a SELECT in SQL	All columns, equivalent to $select=*	SELECT
$where	Filters the rows to be returned, similar to WHERE	No filter	WHERE
$order	Column to order results on, similar to ORDER BY in SQL	Unspecified order	ORDER BY
$group	Column to group results on, similar to GROUP BY in SQL	No grouping	GROUP BY
$having	Filters the rows that result from an aggregation, similar to HAVING	No filter	HAVING
$limit	Maximum number of results to return	1000 (2.0 endpoints: maximum of 50,000; 2.1: unlimited »)	LIMIT
$offset	Offset count into the results to start at, used for paging	0	OFFSET
$q	Performs a full text search for a value.	No search	N/A
$query	A full SoQL query string, all as one parameter	N/A	N/A
$$bom	Prepends a UTF-8 Byte Order Mark to the beginning of CSV output	false	N/A

Get all the collisions for zip code 10036, Times Square NYC, for Feb 2023

Save it into file times_square_july_2022.json

curl --get --data-urlencode "\$\$app_token=uvwxyz"  --data-urlencode "\$select=*" / --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"  https://data.cityofnewyork.us/resource/h9gi-nx95.json >    collisions_z10036_feb_2023.json

Explanation

"\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"

Specify dates between and including February 1st to the 28th.
zip_code=10036 to narrow down our results.

Count how many collisions using the Linux wc command with our newly created file, times_square_july_2022.json.

wc -l collisions_z10036_feb_2023.json 
25 collisions_z10036_feb_2023.json

Double check that count of 25 collisions, using the SoQl count(*) function.

> curl --get --data-urlencode "\$\$app_token=uvwxyz"  --data-urlencode "\$select=count(*)"  --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"  https://data.cityofnewyork.us/resource/h9gi-nx95.json 
[{"count":"25"}]

Explanation

\$select=count(*)
Similar to the SQL count function, this uses the SoQL count function to count the number of rows that match our search criteria.
[{"count":"25"}], which matches the number of records in the collisions_z10036_feb_2023.json file

Observation

25 collisions in one midtown zip code for February is almost 1 collision a day. I’m sure that's lower than many other zip codes.

Get the 10 worst zip codes for collisions in February 2023

> curl --get --silent  ‘$$app_token=uvwxyz’  --data-urlencode "\$select=count(*), zip_code"   --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'"  --data-urlencode '$group=zip_code'   https://data.cityofnewyork.us/resource/h9gi-nx95.json | jq -r '.[] | .zip_code + " " + .count' | sort  -k 2,2nr -k 1n | head -n10
11207 105
11212 85
11208 79
11226 75
11234 72
11236 72
11101 71
11203 67
11368 67
11211 62

Explanation

OK, I threw in a lot of commands here.
"\$select=count(*), zip_code"
Selecting the count and zip_code
SoQL count function to count the number of rows that match our search criteria.
$group=zip_code
Similar to the SQL GROUP BY
Returns aggregate rows grouped by the zip_code
jq -r '.[] | .zip_code + " " + .count'
Using the very useful jq to do additional filtering
jq bills itself as, “a lightweight and flexible command-line JSON processor”
I extract the zip_code and collision count for each zip code and concatenate them using the bash +, concatenation operator
sort -k 2,2nr -k 1n
Using the bash sort command, we do a reverse numerical sort by the second field, which is the count. We also do a numerical sort on the zip_code for zip_codes with identical collision counts
head -n10
This gets the first 10, which are the 10 zip codes with the most collisions, starting with the very worst.

Observation

I could have used SoQL $sort and $limit to do some of this work, but I chose the bash sort, just because ...
Zip code 11207, East New York, Brooklyn, emerges as the zip with the most collisions in February.
This zip has a lot of issues with traffic safety, as you could also check here .
105 collisions in one month. 3.75 a day? There’s something seriously wrong there. You’d probably need some kind of armor suit just to cross the street there.

As the queries get more complex, these one line commands start to get long and hard to manage.

Curl has an option to create a config file. On a Linux system the default config is usually ~/.curlrc. You can specify a config file with the -K or --config switch.

I created the below config file for these requests
The config file sets the NYC API URL, the $$app_token parameter, a GET request, as well as asking for verbose output

##### The ./.nyc_curlrc file contents
# --- NYC Collision Data ---
get
url = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"
data-urlencode  =  "\$\$app_token=uvwxyz"
verbose

The previous example can now be rewritten to use the .nyc_curlrc config file. I also broke up the commands into separate lines using the bash continuation ‘\’ . Enclosing some of the commands in single quotes also means that the ‘$’ doesn’t need to be escaped.

> curl -K ./.nyc_curlrc \
 --data-urlencode '$select=count(*), zip_code' \
 --data-urlencode '$where=crash_date between "2023-02-01T00:00:00.000" and "2023-02-28T00:00:00.000"' \
 --data-urlencode '$group=zip_code'  \
  | jq -r '.[] | .zip_code + " " + .count' | sort  -k 2,2nr -k 1n | head -n10

This is a little more concise than the previous version, and yields the same result.

### Now to find how many cyclists and pedestrians were killed over the duration of this dataset

 curl -K ./.nyc_curlrc \
  --data-urlencode "\$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed, SUM(number_of_cyclist_killed) AS tot_cyclist_killed"  \
  --data-urlencode "\$group=year" \
  --data-urlencode "\$order=tot_pedestrians_killed DESC"  | jq .

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   885    0   885    0     0   1616      0 --:--:-- --:--:-- --:--:--  1614
[
  {
    "year": "2013",
    "tot_pedestrians_killed": "176",
    "tot_cyclist_killed": "11"
  },
  {
    "year": "2016",
    "tot_pedestrians_killed": "149",
    "tot_cyclist_killed": "18"
  },
  {
    "year": "2014",
    "tot_pedestrians_killed": "133",
    "tot_cyclist_killed": "20"
  },
  {
    "year": "2015",
    "tot_pedestrians_killed": "133",
    "tot_cyclist_killed": "15"
  },
  {
    "year": "2022",
    "tot_pedestrians_killed": "132",
    "tot_cyclist_killed": "18"
  },
  {
    "year": "2019",
    "tot_pedestrians_killed": "131",
    "tot_cyclist_killed": "31"
  },
  {
    "year": "2021",
    "tot_pedestrians_killed": "129",
    "tot_cyclist_killed": "19"
  },
  {
    "year": "2017",
    "tot_pedestrians_killed": "127",
    "tot_cyclist_killed": "27"
  },
  {
    "year": "2018",
    "tot_pedestrians_killed": "123",
    "tot_cyclist_killed": "10"
  },
  {
    "year": "2020",
    "tot_pedestrians_killed": "101",
    "tot_cyclist_killed": "29"
  },
  {
    "year": "2012",
    "tot_pedestrians_killed": "72",
    "tot_cyclist_killed": "6"
  },
  {
    "year": "2023",
    "tot_pedestrians_killed": "18",
    "tot_cyclist_killed": "8"
  }
]

Explanation

date_extract_y(crash_date) AS year
Will extract ‘2023’ from ‘2023-02-03T00:00:00.000’
SoQL date_extract_y
SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed
SUM
Similar to SQL SUM aggregate function.
AS
Give these aggregate results a meaningful label
$group=year and $order=tot_pedestrians_killed
Similar to the SQL GROUP BY and ORDER BY
Returns aggregate rows grouped by the year they occurred.
Sorted having the year with most pedestrian fatalities first
jq .
This is the most basic jq command
It just prints the JSON output in it’s default “pretty” format
We could have added --silent to the curl command or config file, to not print the curl download statistics.

Observation

2012 and the current year, 2023,  can be omitted as both years have incomplete data.

Run the previous query minus years 2012 and 2023

curl -K ./.nyc_curlrc   --data-urlencode '$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed, SUM(number_of_cyclist_killed) AS tot_cyclists_killed' \
  --data-urlencode '$where=year not in ("2012", "2023")' \
  --data-urlencode '$group=year' \
  --data-urlencode '$order=tot_pedestrians_killed DESC, tot_cyclists_killed'

[{"year":"2013","tot_pedestrians_killed":"176","tot_cyclists_killed":"11"}
,{"year":"2016","tot_pedestrians_killed":"149","tot_cyclists_killed":"18"}
,{"year":"2015","tot_pedestrians_killed":"133","tot_cyclists_killed":"15"}
,{"year":"2014","tot_pedestrians_killed":"133","tot_cyclists_killed":"20"}
,{"year":"2022","tot_pedestrians_killed":"132","tot_cyclists_killed":"18"}
,{"year":"2019","tot_pedestrians_killed":"131","tot_cyclists_killed":"31"}
,{"year":"2021","tot_pedestrians_killed":"129","tot_cyclists_killed":"19"}
,{"year":"2017","tot_pedestrians_killed":"127","tot_cyclists_killed":"27"}
,{"year":"2018","tot_pedestrians_killed":"123","tot_cyclists_killed":"10"}
,{"year":"2020","tot_pedestrians_killed":"101","tot_cyclists_killed":"29"}]

Explanation

$where=year not in ("2012", "2023")
Added a WHERE clause to omit years 2012 and 2023 from the query
SoQL not in (...)

Observation

Well, it’s not that safe being a pedestrian or cyclist in New York City. Checking the injury count would yield much higher numbers.

Run a query to get a yearly total of injured pedestrians and cyclists

Our query string was getting a little bit out of hand and difficult to manage.
I created a dedicated config file, .nyc_ped_cyc_injured_yearly_curlrc for our next request.

The Config

> cat .nyc_ped_cyc_injured_yearly_curlrc 
# --- NYC Collision Data - Injured List  ---
get
url = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"
data-urlencode  = "\$\$app_token=uvwxyz"
data-urlencode  = "\$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_injured) AS tot_pedestrians_injured, SUM(number_of_cyclist_injured) AS tot_cyclists_injured"
data-urlencode  = "\$where=year not in ('2012','2023')"
data-urlencode  = "\$group=year" 
data-urlencode  = "\$order=tot_pedestrians_injured DESC, tot_cyclists_injured DESC"
silent

The Query using the config file

>  curl --config ./.nyc_ped_cyc_injured_yearly_curlrc 
[{"year":"2013","tot_pedestrians_injured":"11988","tot_cyclists_injured":"4075"}
,{"year":"2017","tot_pedestrians_injured":"11151","tot_cyclists_injured":"4889"}
,{"year":"2018","tot_pedestrians_injured":"11123","tot_cyclists_injured":"4725"}
,{"year":"2016","tot_pedestrians_injured":"11090","tot_cyclists_injured":"4975"}
,{"year":"2014","tot_pedestrians_injured":"11036","tot_cyclists_injured":"4000"}
,{"year":"2019","tot_pedestrians_injured":"10568","tot_cyclists_injured":"4986"}
,{"year":"2015","tot_pedestrians_injured":"10084","tot_cyclists_injured":"4281"}
,{"year":"2022","tot_pedestrians_injured":"8963","tot_cyclists_injured":"5025"}
,{"year":"2021","tot_pedestrians_injured":"7503","tot_cyclists_injured":"4961"}
,{"year":"2020","tot_pedestrians_injured":"6691","tot_cyclists_injured":"5576"}]

Observation

Looks like the config file worked as expected. While the number of pedestrians injured is declining a little, the number of cyclists injured is going in the opposite direction.

Using jq to do additional filtering

Similar to the previous query, extract the yearly totals of injured cyclists. This time we’ll use jq to filter the output.

> curl --config ./.nyc_ped_cyc_injured_yearly_curlrc \
  | jq -r '.[] | .year + "," + .tot_cyclists_injured' | sort -k 1n \
  | column -t -s, --table-columns=Year,CyclistsInjured 
Year  CyclistsInjured
2013  4075
2014  4000
2015  4281
2016  4975
2017  4889
2018  4725
2019  4986
2020  5576
2021  4961
2022  5025

Explanation

This is similar to the previous query except I used jq to extract the injured cyclist data only from the returned results.

sort -k 1n
- Sort the Year numerically

sort -k 1n
Sort the Year, numerically
column -t -s, --table-columns=Year,CyclistsInjured
Add column headers for readability
The jq command already created comma separated results

Observation

2020 and 2022 were the worst years for bicyclist injuries. 2020 was a year where cycling became more popular. The injuries dropped a little in 2021, maybe because cyclists got a little scared after the slaughter in 2020. The upward trend may be returning, based on the 2022 results.

Get the 10 worst zip codes for collisions in January 2023

Previously I got the [10 worst Zip codes for collisions in February][Get the 10 worst zip codes for collisions in February 2023]. I used some bash commands to fine tune results. Here I will use SoQL to do most of the heavy lifting.

Config file `.nyc_jan_coll_curlrc`

> cat .nyc_jan_coll_curlrc 
# --- NYC Collision Data - January Collisions  ---
get
url = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"
data-urlencode  = "\$\$app_token=uvwxyz"
data-urlencode = "\$select=zip_code,count(zip_code) AS collision_count"
data-urlencode = "\$where=crash_date between '2023-01-01' AND '2023-01-31' "
data-urlencode = "\$group=zip_code"
data-urlencode = "\$order=collision_count DESC, zip_code"
data-urlencode = "\$limit=10"
silent

> curl --config ./.nyc_jan_coll_curlrc \
  | jq -r '.[] | .zip_code + ", " + .collision_count' \
  | column -t -s, --table-columns=ZipCode,CollisionCount
ZipCode  CollisionCount
11207     124
11236     83
11208     82
11212     77
11203     69
11385     67
11234     66
11206     64
10002     63
11101     61

Explanation

Most of this is similar to our earlier request for February stats. This time we are using a new config file ..

Instead of sorting the results using the bash sort, we sort using the SoQL [$order](https://dev.socrata.com/docs/queries/order.html). We get the 10 worst using the $limit clause.
\$order=collision_count DESC, zip_code
Sort the collision count from worst to “least worst”.
zip_code ascending sort
$limit=10
Get the first 10 after the sort using $limit

Observation

Zip Code 11207, again emerges as a collision prone area with 124 collisions in January. That’s 4 collisions a day. Every day is a regular demolition derby day in that part of Brooklyn.

Analyzing NYC Traffic Collision Data on the Linux Command Line with SoQL and Curl

Using the Socrata Query Language, SoQL

How many collisions since July 2012

Explanation

Observation

Getting all the collision records between two arbitrary dates, June 30th 2022 to December 31 2022.

Explanation

Observation

The NYC dataset column names

Corresponding Field Data Types

SoQL Query Clauses from the Docs

Get all the collisions for zip code 10036, Times Square NYC, for Feb 2023

Explanation

Explanation

Observation

Get the 10 worst zip codes for collisions in February 2023

Explanation

Observation

As the queries get more complex, these one line commands start to get long and hard to manage.

Explanation

Observation

Run the previous query minus years 2012 and 2023

Explanation

Observation

Run a query to get a yearly total of injured pedestrians and cyclists

The Config

The Query using the config file

Observation

Using jq to do additional filtering

Explanation

Observation

Get the 10 worst zip codes for collisions in January 2023

Config file .nyc_jan_coll_curlrc

Explanation

Observation

Some Perl CLI Resources

Some NYC Street Resources

Me

Read next

Can AI finally generate best practice code? I think so.

De cero a Ingeniero de Software

Interface Funcional: EVITE esse uso!

Police Department Management Simulator

Config file `.nyc_jan_coll_curlrc`