DEV Community

MarcelGeo
MarcelGeo

Posted on

Problems by reading CSV file with numpy python library

Input data

Input data is CSV file for day 5.5.2020 from Mobility Trends Reports database published by Apple. Reports are published daily and reflect requests for directions in Apple Maps.

Input CSV file is having four string columns: geo_type,region,transportation_type,alternative_name and every next column is difference between mobility from 13.1.2020 representing by float and day named in column. And there problems are beginning.

Reading CSV to numpy array

I used native function of numpy library named genfromtxt. My initial/naive code for reading file downloaded from Mobility reports ends with encoding error.

import numpy as np

def getData():
  path = "data/applemobilitytrends-2020-05-05.csv"
  npcsv = np.genfromtxt(path, delimiter=',')
  print(npcsv)

getData()

Error:

Exception has occurred: UnicodeDecodeError
'charmap' codec can't decode byte 0x98 in position 5961: character maps to <undefined>
  File "C:\Users\marcel.kocisek\Documents\marcel\covid\examples\csv.py", line 5, in getData
    npcsv = np.genfromtxt(path, delimiter=',')
  File "C:\Users\marcel.kocisek\Documents\marcel\covid\examples\csv.py", line 8, in <module>
    getData()

The fix was easy, add encoding as parameter:

import numpy as np

def getDataV2():
  path = "data/applemobilitytrends-2020-05-05.csv"
  npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8')
  print(npcsv)

getDataV2()

The result is without first 4 string columns and header:

[[   nan    nan    nan ...    nan    nan    nan]
 [   nan    nan    nan ...  36.    43.69  42.61]
 [   nan    nan    nan ...  43.41  49.59  46.44]
 ...
 [   nan    nan    nan ... 128.55 110.19 107.62]
 [   nan    nan    nan ... 113.52 104.54 104.41]
 [   nan    nan    nan ...  82.94  72.42  72.63]]
[[   nan    nan    nan ...    nan    nan    nan]
 [   nan    nan    nan ...  36.    43.69  42.61]
 [   nan    nan    nan ...  43.41  49.59  46.44]
 ...
 [   nan    nan    nan ... 128.55 110.19 107.62]
 [   nan    nan    nan ... 113.52 104.54 104.41]
 [   nan    nan    nan ...  82.94  72.42  72.63]]

Why? Because default data type for resulting 2d numpy array is np.float and it is not possible to convert string names of states/regions to float (logic). Therefore, add new parameter to change np.float to np.string.

npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str)

The result is numpy array with header and all values in string (important is, that we have values :) )

[['geo_type' 'region' 'transportation_type' ... '2020-05-03' '2020-05-04'
  '2020-05-05']
 ['country/region' 'Albania' 'driving' ... '36.0' '43.69' '42.61']
 ['country/region' 'Albania' 'walking' ... '43.41' '49.59' '46.44']
 ...
 ['sub-region' 'Östergötland County' 'driving' ... '128.55' '110.19'
  '107.62']
 ['sub-region' 'Ústí nad Labem Region' 'driving' ... '113.52' '104.54'
  '104.41']
 ['sub-region' 'Žilina Region' 'driving' ... '82.94' '72.42' '72.63']]

To ignore header, you could use skip_header parameter for genfromtxt function

npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str, skip_header=1)

or numpy 2d array indexing to ignore first row [1:, :]:

npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str)
print(npcsv[1:, :])

Result:

[['country/region' 'Albania' 'driving' ... '36.0' '43.69' '42.61']
 ['country/region' 'Albania' 'walking' ... '43.41' '49.59' '46.44']
 ['country/region' 'Argentina' 'driving' ... '16.44' '32.01' '33.63']
 ...
 ['sub-region' 'Östergötland County' 'driving' ... '128.55' '110.19'
  '107.62']
 ['sub-region' 'Ústí nad Labem Region' 'driving' ... '113.52' '104.54'
  '104.41']
 ['sub-region' 'Žilina Region' 'driving' ... '82.94' '72.42' '72.63']]

Top comments (0)