DEV Community

MarcelGeo
MarcelGeo

Posted on

Problems by reading CSV file with numpy python library

Input data

Input data is CSV file for day 5.5.2020 from Mobility Trends Reports database published by Apple. Reports are published daily and reflect requests for directions in Apple Maps.

Input CSV file is having four string columns: geo_type,region,transportation_type,alternative_name and every next column is difference between mobility from 13.1.2020 representing by float and day named in column. And there problems are beginning.

Reading CSV to numpy array

I used native function of numpy library named genfromtxt. My initial/naive code for reading file downloaded from Mobility reports ends with encoding error.

import numpy as np

def getData():
  path = "data/applemobilitytrends-2020-05-05.csv"
  npcsv = np.genfromtxt(path, delimiter=',')
  print(npcsv)

getData()

Error:

Exception has occurred: UnicodeDecodeError
'charmap' codec can't decode byte 0x98 in position 5961: character maps to <undefined>
  File "C:\Users\marcel.kocisek\Documents\marcel\covid\examples\csv.py", line 5, in getData
    npcsv = np.genfromtxt(path, delimiter=',')
  File "C:\Users\marcel.kocisek\Documents\marcel\covid\examples\csv.py", line 8, in <module>
    getData()

The fix was easy, add encoding as parameter:

import numpy as np

def getDataV2():
  path = "data/applemobilitytrends-2020-05-05.csv"
  npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8')
  print(npcsv)

getDataV2()

The result is without first 4 string columns and header:

[[   nan    nan    nan ...    nan    nan    nan]
 [   nan    nan    nan ...  36.    43.69  42.61]
 [   nan    nan    nan ...  43.41  49.59  46.44]
 ...
 [   nan    nan    nan ... 128.55 110.19 107.62]
 [   nan    nan    nan ... 113.52 104.54 104.41]
 [   nan    nan    nan ...  82.94  72.42  72.63]]
[[   nan    nan    nan ...    nan    nan    nan]
 [   nan    nan    nan ...  36.    43.69  42.61]
 [   nan    nan    nan ...  43.41  49.59  46.44]
 ...
 [   nan    nan    nan ... 128.55 110.19 107.62]
 [   nan    nan    nan ... 113.52 104.54 104.41]
 [   nan    nan    nan ...  82.94  72.42  72.63]]

Why? Because default data type for resulting 2d numpy array is np.float and it is not possible to convert string names of states/regions to float (logic). Therefore, add new parameter to change np.float to np.string.

npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str)

The result is numpy array with header and all values in string (important is, that we have values :) )

[['geo_type' 'region' 'transportation_type' ... '2020-05-03' '2020-05-04'
  '2020-05-05']
 ['country/region' 'Albania' 'driving' ... '36.0' '43.69' '42.61']
 ['country/region' 'Albania' 'walking' ... '43.41' '49.59' '46.44']
 ...
 ['sub-region' 'Östergötland County' 'driving' ... '128.55' '110.19'
  '107.62']
 ['sub-region' 'Ústí nad Labem Region' 'driving' ... '113.52' '104.54'
  '104.41']
 ['sub-region' 'Žilina Region' 'driving' ... '82.94' '72.42' '72.63']]

To ignore header, you could use skip_header parameter for genfromtxt function

npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str, skip_header=1)

or numpy 2d array indexing to ignore first row [1:, :]:

npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str)
print(npcsv[1:, :])

Result:

[['country/region' 'Albania' 'driving' ... '36.0' '43.69' '42.61']
 ['country/region' 'Albania' 'walking' ... '43.41' '49.59' '46.44']
 ['country/region' 'Argentina' 'driving' ... '16.44' '32.01' '33.63']
 ...
 ['sub-region' 'Östergötland County' 'driving' ... '128.55' '110.19'
  '107.62']
 ['sub-region' 'Ústí nad Labem Region' 'driving' ... '113.52' '104.54'
  '104.41']
 ['sub-region' 'Žilina Region' 'driving' ... '82.94' '72.42' '72.63']]

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay