Ali Sherief

Posted on Mar 30, 2020

Numpy arrays at lightspeed ⚡ Part 4

#python

More array creation functions

Here I will detail all the functions that are responsible for creating arrays. We already saw np.zeros(), np.ones() and np.empty(). There is also np.full() which creats an array but instead of filling it with zero or one it fills it with a number you specify in the second argument.

np.eye() and np.identity() create an identity matrix. eye accepts more arguments than identity.

np.indices() is useful for the following scenario: Suppose you have an array sample and a function such as np.sin which can take several arguments, all the same dimension (or you could concatenate all those arguments to a dimension+1 array which is what np.indices() returns). But instead of passing completely different arrays as arguments to the function, you want to pass the same array but with rows treated as the first dimension in argument 1, columns treated as the first dimension in argument 2, etc. so that the the first dimension of these resulting arrays come from different axes in the sample array. You can index the sample array with the index array generated by np.indices() to get this result.

A practical use for this function would be to make a 3D plot of a function evaluated on a 3D mesh. These have three coordinates per element and it's useful to run the function on all X, Y and Z axes.

The arguments of np.indices is the shape of the array as a tuple or list.

In [1]: np.indices((3,3))                                                       
Out[1]: 
array([[[0, 0, 0],
        [1, 1, 1],
        [2, 2, 2]],

       [[0, 1, 2],
        [0, 1, 2],
        [0, 1, 2]]])

In [2]: sample = np.linspace(1., 4., 9).reshape(3,3)                                

In [3]: sample
Out[4]: 
array([[1.   , 1.375, 1.75 ],
       [2.125, 2.5  , 2.875],
       [3.25 , 3.625, 4.   ]])

In [4]: sample[np.indices((3,3))]                                              
Out[4]: 
array([[[[1.   , 1.375, 1.75 ],
         [1.   , 1.375, 1.75 ],
         [1.   , 1.375, 1.75 ]],

        [[2.125, 2.5  , 2.875],
         [2.125, 2.5  , 2.875],
         [2.125, 2.5  , 2.875]],

        [[3.25 , 3.625, 4.   ],
         [3.25 , 3.625, 4.   ],
         [3.25 , 3.625, 4.   ]]],


       [[[1.   , 1.375, 1.75 ],
         [2.125, 2.5  , 2.875],
         [3.25 , 3.625, 4.   ]],

        [[1.   , 1.375, 1.75 ],
         [2.125, 2.5  , 2.875],
         [3.25 , 3.625, 4.   ]],

        [[1.   , 1.375, 1.75 ],
         [2.125, 2.5  , 2.875],
         [3.25 , 3.625, 4.   ]]]])

Reading arrays from file

Very large arrays are usually stored in files because they would be too big and tedious to write in code. It's possible to read arrays stored in CSV files, HDF5, FITS and image pixels, but not directly into numpy. Other third party libraries are required to import the file and create a numpy array out of that. I hope to cover those in due time.

In particular, h5py reads HDF5 files, Astropy reads FITS files and Pillow reads image files.

Reading arrays from a string

np.genfromtxt() enables us to read an array from a simple string representing an array in a tabular format.

In [1]: from io import StringIO 

In [2]: data = u"1, 2, 3\n4, 5, 6" 
   ...: np.genfromtxt(StringIO(data), delimiter=",")                           
Out[2]: 
array([[1., 2., 3.],
       [4., 5., 6.]])

In [3]: print(data)                                                            
1, 2, 3
4, 5, 6

The data string looks exactly like the numpy array that was created from it. Also, notice how you can change the delimiter keyword argument to whatever delimiter your string file uses. By default, the delimiter is any number of whitespace.

You can also remove leading and trailing whitespace with the autostrip keyword argument:

>>> data = u"1, abc , 2\n 3, xxx, 4"
>>> # Without autostrip
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")
array([['1', ' abc ', ' 2'],
#       ['3', ' xxx', ' 4']], dtype='<U5')
>>> # With autostrip
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True)
array([['1', 'abc', '2'],
#       ['3', 'xxx', '4']], dtype='<U5')  #TODO remove slash

There is also a comments argument that excludes a range of characters in a line, from a particular character at some point, such as #, to the end of the line. If this is None, no lines are treated as comments. By default, this is #, so all hash comments are removed.

>>> data = u"""#
... # Skip me !
... # Skip me too !
... 1, 2
... 3, 4
... 5, 6 #This is the third line of the data
... 7, 8
... # And here comes the last line
... 9, 0
... """
>>> np.genfromtxt(StringIO(data), comments="#", delimiter=",")
array([[1., 2.],
       [3., 4.],
       [5., 6.],
       [7., 8.],
       [9., 0.]])

The skip_header and skip_footer arguments exclude a certain number of lines at the beginning or end, respectively. Both of these are 0 by default.


In [1]: data = u"\n".join(str(i) for i in range(10))                          

In [2]: print(data)                                                           
0
1
2
3
4
5
6
7
8
9

In [3]: np.genfromtxt(StringIO(data), 
   ...:               skip_header=3, skip_footer=5)                           
Out[3]: array([3., 4.])

use_cols allows us to select particular columns to be imported into the array. Indices are specified as a tuple, and behave like normal python list indices. In particular, negative indices behave the same way as those in Python, so -1 means the last column, -2 means second to last, etc. You can't select a column more than once.

This can be combined with the names argument, which gives names to all the columns, to select columns by name. Names and indies can both be mixed in usecols but at any rate, columns must not be selected more than once.

>>> data = u"1 2 3\n4 5 6"
>>> np.genfromtxt(StringIO(data), usecols=(0, -1))
array([[ 1.,  3.],
       [ 4.,  6.]])
>>> np.genfromtxt(StringIO(data),
...               names="a, b, c", usecols=("a", "c"))
array([(1.0, 3.0), (4.0, 6.0)],
      dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
...               names="a, b, c", usecols=("a", 2))   # Same as above
array([(1.0, 3.0), (4.0, 6.0)],
      dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
...               names="a, b, c", usecols=("a, c"))   # Notice they're all in one string
    array([(1.0, 3.0), (4.0, 6.0)],
          dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
...               names="a, b, c", usecols=("a, 2"))   # Fails, 2 is not the name of a column
ValueError: '2' is not in list
>>> np.genfromtxt(StringIO(data),
...               names="a, b, c", usecols=("a", 0))   # Fails, column 0 named "a" is imported more than once
ValueError: field 'a' occurs more than once

In this function, dtype is allowed to be None, which causes np.genfromtxt to guess the dtype of the elements. But this is a lot slower than specifying the dtype explicitly.

deletechars will delete the all the characters provided as a string to the deletechars argument from field names (not elements). By default, the deleted characters are ~!@#$%^&*()-=+~\|]}[{';: /?.>,<. Again, these are not deleted from elements, with an appropriate dtype:

In [1]: data = u"(, 0, )\n<, ., d>2d" 
     ...: np.genfromtxt(StringIO(data), usecols=(1, 2), dtype='U')      # Note: delimiter is whitespace
Out[1]: 
array([['0,', ')'],
#       ['.,', 'd>2d']], dtype='<U4')

excludelist will prepend an underscore at the beginning of fields if their names matches one of the values in this list (which by default is None). Only applies to fields, not elements.

In [1]: data = u"(, 0, )\n<, ., return" 
     ...: np.genfromtxt(StringIO(data), usecols=(1, 2), dtype='U', excludelist=[
     ...: 'return'])                                                            
Out[1]: 
array([['0,', ')'],
#       ['.,', 'return']], dtype='<U6')

The case_sensitive argument determines the case of the fields, whether it's uppercase (case_sensitive=False or case_sensitive='upper'), lowercase (case_sensitive='lower') or leave the case alone (case_sensitive=True, the default value).

Adjusting values during import

Suppose you have a date field or a percentage in your tabular data. Numpy can't convert them by itself so what do you do? Luckily, genfromtxt has another argument called converters, and this is a dictionary containing the column name or index as a key, and a function (def or lambda) that takes an element from the column in string form as the only argument (so its argument type is str) and returns a value corresponding to the columns dtype.

# This converter turns percentages to floats
>>> convertfunc = lambda x: float(x.strip(b"%"))/100.
>>> data = u"1, 2.3%, 45.\n6, 78.9%, 0"
>>> names = ("i", "p", "n")
>>> np.genfromtxt(StringIO(data), delimiter=",", names=names,
...               converters={1: convertfunc})    # Convert second column
array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

Also, something like convert = lambda x: float(x.strip() or -999), or anything with <result-evaluation> or <default-value> in it could be used to provide a default value in case an invalid element is passed to the converter. Converters should never assume that the input data is well-formed.

The missing_values argument offers a cleaner solution to this. It takes a list of strings, one for each column, that elements in a column that exactly match that column's string should be considered missing. missing_values can also take a dictionary of column indices/names (or None to represent all columns) and missing strings instead of a list. missing_values can even take a single string that represents missing elements in all columns, the entire table. By default, missing_values=None, so the only value treated as missing is the empty string.

Now, to fill those missing values with default values, we must set the filling_values argument. It has exactly the same format as missing_values, but with default values of a column type instead of missing strings. Everything said above about missing_values applies here too.

In [1]: data = u"2, 2, 3\n1, , 3" 
     ...: np.genfromtxt(StringIO(data), dtype='i', delimiter=",")               
Out[1]: 
array([[ 2,  2,  3],
       [ 1, -1,  3]], dtype=int32)

These are the default filling values for each of the numeric dtypes:

Type	Default Value
bool	False
int	-1
float	np.nan
complex	np.nan+0j

Finally, the usemask argument allows us to inspect which of the elements were labeled missing if you set it to True. By default it's False:

In [1]: data = u"2, 2, 3\n1, , 3" 
   ...: np.genfromtxt(StringIO(data), dtype='i', delimiter=",", usemask=True) 
Out[1]: 
masked_array(
  data=[[2, 2, 3],
        [1, --, 3]],
  mask=[[False, False, False],
        [False,  True, False]],
  fill_value=999999,
  dtype=int32)

This masked_array is a Python object and the fields listed here can be explored with dot . notation or getattr().

And we're done

Much material was sourced from the numpy manual.

If you see anything incorrect here please let me know so I can fix it.

Image by Arek Socha from Pixabay

Is Your CI/CD Server a Prime Target for Attack?

57% of organizations have suffered from a security incident related to DevOps toolchain exposures. It makes sense—CI/CD servers have access to source code, a highly valuable asset. Is yours secure? Check out nine practical tips to protect your CI/CD.

Learn more