# Numpy arrays at lightspeed ⚡ Part 4

### Ali Sherief ・7 min read

python-numpy (5 Part Series)

## More array creation functions

Here I will detail all the functions that are responsible for creating arrays. We already saw `np.zeros()`

, `np.ones()`

and `np.empty()`

. There is also `np.full()`

which creats an array but instead of filling it with zero or one it fills it with a number you specify in the second argument.

`np.eye()`

and `np.identity()`

create an identity matrix. `eye`

accepts more arguments than `identity`

.

`np.indices()`

is useful for the following scenario: Suppose you have an array `sample`

and a function such as `np.sin`

which can take several arguments, all the same dimension (or you could concatenate all those arguments to a `dimension+1`

array which is what `np.indices()`

returns). But instead of passing completely different arrays as arguments to the function, you want to pass the same array but with rows treated as the first dimension in argument 1, columns treated as the first dimension in argument 2, etc. so that the the first dimension of these resulting arrays come from different axes in the `sample`

array. You can index the `sample`

array with the index array generated by `np.indices()`

to get this result.

A practical use for this function would be to make a 3D plot of a function evaluated on a 3D mesh. These have three coordinates per element and it's useful to run the function on all X, Y and Z axes.

The arguments of np.indices is the shape of the array as a tuple or list.

```
In [1]: np.indices((3,3))
Out[1]:
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
In [2]: sample = np.linspace(1., 4., 9).reshape(3,3)
In [3]: sample
Out[4]:
array([[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]])
In [4]: sample[np.indices((3,3))]
Out[4]:
array([[[[1. , 1.375, 1.75 ],
[1. , 1.375, 1.75 ],
[1. , 1.375, 1.75 ]],
[[2.125, 2.5 , 2.875],
[2.125, 2.5 , 2.875],
[2.125, 2.5 , 2.875]],
[[3.25 , 3.625, 4. ],
[3.25 , 3.625, 4. ],
[3.25 , 3.625, 4. ]]],
[[[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]],
[[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]],
[[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]]]])
```

### Reading arrays from file

Very large arrays are usually stored in files because they would be too big and tedious to write in code. It's possible to read arrays stored in CSV files, HDF5, FITS and image pixels, but not directly into numpy. Other third party libraries are required to import the file and create a numpy array out of that. I hope to cover those in due time.

In particular, h5py reads HDF5 files, Astropy reads FITS files and Pillow reads image files.

### Reading arrays from a string

`np.genfromtxt()`

enables us to read an array from a simple string representing an array in a tabular format.

```
In [1]: from io import StringIO
In [2]: data = u"1, 2, 3\n4, 5, 6"
...: np.genfromtxt(StringIO(data), delimiter=",")
Out[2]:
array([[1., 2., 3.],
[4., 5., 6.]])
In [3]: print(data)
1, 2, 3
4, 5, 6
```

The data string looks exactly like the numpy array that was created from it. Also, notice how you can change the `delimiter`

keyword argument to whatever delimiter your string file uses. By default, the delimiter is any number of whitespace.

You can also remove leading and trailing whitespace with the `autostrip`

keyword argument:

```
>>> data = u"1, abc , 2\n 3, xxx, 4"
>>> # Without autostrip
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")
array([['1', ' abc ', ' 2'],
# ['3', ' xxx', ' 4']], dtype='<U5')
>>> # With autostrip
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True)
array([['1', 'abc', '2'],
# ['3', 'xxx', '4']], dtype='<U5') #TODO remove slash
```

There is also a `comments`

argument that excludes a range of characters in a line, from a particular character at some point, such as `#`

, to the end of the line. If this is None, no lines are treated as comments. By default, this is `#`

, so all hash comments are removed.

```
>>> data = u"""#
... # Skip me !
... # Skip me too !
... 1, 2
... 3, 4
... 5, 6 #This is the third line of the data
... 7, 8
... # And here comes the last line
... 9, 0
... """
>>> np.genfromtxt(StringIO(data), comments="#", delimiter=",")
array([[1., 2.],
[3., 4.],
[5., 6.],
[7., 8.],
[9., 0.]])
```

The `skip_header`

and `skip_footer`

arguments exclude a certain number of lines at the beginning or end, respectively. Both of these are `0`

by default.

```
In [1]: data = u"\n".join(str(i) for i in range(10))
In [2]: print(data)
0
1
2
3
4
5
6
7
8
9
In [3]: np.genfromtxt(StringIO(data),
...: skip_header=3, skip_footer=5)
Out[3]: array([3., 4.])
```

`use_cols`

allows us to select particular columns to be imported into the array. Indices are specified as a tuple, and behave like normal python list indices. In particular, negative indices behave the same way as those in Python, so -1 means the last column, -2 means second to last, etc. *You can't select a column more than once.*

This can be combined with the `names`

argument, which gives names to all the columns, to select columns by name. Names and indies can both be mixed in `usecols`

but at any rate, columns must **not** be selected more than once.

```
>>> data = u"1 2 3\n4 5 6"
>>> np.genfromtxt(StringIO(data), usecols=(0, -1))
array([[ 1., 3.],
[ 4., 6.]])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a", "c"))
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a", 2)) # Same as above
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a, c")) # Notice they're all in one string
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a, 2")) # Fails, 2 is not the name of a column
ValueError: '2' is not in list
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a", 0)) # Fails, column 0 named "a" is imported more than once
ValueError: field 'a' occurs more than once
```

In this function, `dtype`

is allowed to be None, which causes `np.genfromtxt`

to guess the dtype of the elements. But this is a lot slower than specifying the dtype explicitly.

`deletechars`

will delete the all the characters provided as a string to the `deletechars`

argument from *field names* (not elements). By default, the deleted characters are `~!@#$%^&*()-=+~\|]}[{';: /?.>,<`

. Again, these are *not* deleted from elements, with an appropriate dtype:

```
In [1]: data = u"(, 0, )\n<, ., d>2d"
...: np.genfromtxt(StringIO(data), usecols=(1, 2), dtype='U') # Note: delimiter is whitespace
Out[1]:
array([['0,', ')'],
# ['.,', 'd>2d']], dtype='<U4')
```

`excludelist`

will prepend an underscore at the beginning of fields if their names matches one of the values in this list (which by default is None). Only applies to fields, not elements.

```
In [1]: data = u"(, 0, )\n<, ., return"
...: np.genfromtxt(StringIO(data), usecols=(1, 2), dtype='U', excludelist=[
...: 'return'])
Out[1]:
array([['0,', ')'],
# ['.,', 'return']], dtype='<U6')
```

The `case_sensitive`

argument determines the case of the fields, whether it's uppercase (`case_sensitive=False`

or `case_sensitive='upper')`

, lowercase (`case_sensitive='lower'`

) or leave the case alone (`case_sensitive=True`

, the default value).

### Adjusting values during import

Suppose you have a date field or a percentage in your tabular data. Numpy can't convert them by itself so what do you do? Luckily, `genfromtxt`

has another argument called `converters`

, and this is a dictionary containing the column name or index as a key, and a function (def or lambda) that takes an element from the column in string form as the only argument (so its argument type is `str`

) and returns a value corresponding to the columns dtype.

```
# This converter turns percentages to floats
>>> convertfunc = lambda x: float(x.strip(b"%"))/100.
>>> data = u"1, 2.3%, 45.\n6, 78.9%, 0"
>>> names = ("i", "p", "n")
>>> np.genfromtxt(StringIO(data), delimiter=",", names=names,
... converters={1: convertfunc}) # Convert second column
array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],
dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])
```

Also, something like `convert = lambda x: float(x.strip() or -999)`

, or anything with `<result-evaluation> or <default-value>`

in it could be used to provide a default value in case an invalid element is passed to the converter. *Converters should never assume that the input data is well-formed.*

The `missing_values`

argument offers a cleaner solution to this. It takes a list of strings, one for each column, that elements in a column that exactly match that column's string should be considered missing. `missing_values`

can also take a dictionary of column indices/names (or `None`

to represent all columns) and missing strings instead of a list. `missing_values`

can even take a single string that represents missing elements in all columns, the entire table. By default, `missing_values=None`

, so the only value treated as missing is the empty string.

Now, to fill those missing values with default values, we must set the `filling_values`

argument. It has exactly the same format as `missing_values`

, but with default values of a column type instead of missing strings. Everything said above about `missing_values`

applies here too.

```
In [1]: data = u"2, 2, 3\n1, , 3"
...: np.genfromtxt(StringIO(data), dtype='i', delimiter=",")
Out[1]:
array([[ 2, 2, 3],
[ 1, -1, 3]], dtype=int32)
```

These are the default filling values for each of the numeric dtypes:

Type | Default Value |
---|---|

bool | False |

int | -1 |

float | np.nan |

complex | np.nan+0j |

Finally, the `usemask`

argument allows us to inspect which of the elements were labeled missing if you set it to True. By default it's False:

```
In [1]: data = u"2, 2, 3\n1, , 3"
...: np.genfromtxt(StringIO(data), dtype='i', delimiter=",", usemask=True)
Out[1]:
masked_array(
data=[[2, 2, 3],
[1, --, 3]],
mask=[[False, False, False],
[False, True, False]],
fill_value=999999,
dtype=int32)
```

This `masked_array`

is a Python object and the fields listed here can be explored with dot `.`

notation or `getattr()`

.

## And we're done

Much material was sourced from the numpy manual.

If you see anything incorrect here please let me know so I can fix it.

Image by Arek Socha from Pixabay

python-numpy (5 Part Series)