This tutorial is meant to provide a quick intro to a couple useful subjects: generating polynomial data, introducing noise to that data, creating a fit with the least squares method, and graphing the fit and data together with an R^{2} value. First, we are going to make some numpy arrays representing x and y coordinates for a line:

```
import numpy as np
x = np.linspace(0, 50, num=50)
y = np.linspace(0, 30, num=50)
```

Here x is an array that has 50 evenly spaced points between 0 and 50, inclusive, and y has 50 evenly spaced points between 0 and 30. This all could just as easily be represented by one 2D array. Now we are going to introduce some variation to the values:

```
import random
def noise(k):
return k + (random.random()*2-1) * 1.5
x = np.vectorize(noise)(x)
y = np.vectorize(noise)(y)
```

np.vectorize evaluates our noise function for each value of x and y, using random.random() to generate a random number between 0 and 1 to offset the values a bit and make the points (x,y) arranged a bit less linearly. Changing 1.5 to a larger or smaller value will introduce more or less noise. Now we want to create a line to fit our data:

```
lin_fit = np.polyfit(x, y, 1)
lin_model = np.poly1d(lin_fit)
lin_R_squared = np.corrcoef(x, y)[0, 1]**2
```

np.polyfit uses the least squares method to create a line matching the points (x, y). If we wanted to match a higher order polynomial to the points, we'd just need a different value for the third parameter, the degree of the polynomial created.

Polyfit returns an array containing the line's coefficients in order from highest degree to lowest - this is import to remember when comparing to the function that has replaced polyfit, numpy.polynomial.polynomial.Polynomial.fit, and returns the coefficients in order from lowest degree to highest.

np.poly1d takes an array of polynomial coefficients and creates a handy function for us that turns x values into y values. It wouldn't be hard to use `m, b = np.polyfit(x, y, 1)`

and later use `m*x + b`

instead of `lin_model(x)`

, but poly1d gets more useful the more coefficients are needed.

np.corrcoef returns a correlation coefficient matrix, which isn't really worth getting into. The important thing is we are accessing it to get R^{2}, a value that will tell us how well our line fits.

Fitting linear data is great, but we might as well fit a parabola while we're at it. We can use the same x values, but we need some new y values:

```
yq = np.empty(shape=(50)) #random values
with np.nditer(yq, flags=['f_index'], op_flags=['readwrite']) as it:
for yVal in it:
yVal[...] = it.index ** 2
```

Numpy expects us to make an array of the right size and replace values instead of appending, which takes much longer. np.empty is the fastest way to make an array, here a 1D array with 50 values. It is full of random values, but we are going to replace them anyways.

The next lines are a standard way to iterate through an array and replace values. we need f_index in the args to access the index, and we need readwrite to change the values in yq. Each value in yq is now its index squared, which should look exactly like the values for y = x^{2} from 0 to 50. Now to add noise and create a fit:

```
yq = np.vectorize(noise)(yq)
quad_fit = np.polyfit(x, yq, 2, full=True)
quad_model = np.poly1d(quad_fit[0])
mean_yq = yq.sum()/yq.size
quad_tss = np.sum((yq - mean_yq)**2)
quad_R_squared = 1 - quad_fit[1]/quad_tss
quad_R_squared = quad_R_squared[0]
```

Adding noise works the same, but we now creating a fit with degree 2 instead of 1, and setting full to True. This means polyfit will return a 2d array where the first row is our coefficients like before, but the next rows contain some information about how good our fit is. This is needed because coercoff only works for lines. quad_fit having additional lines is why we need to index it when invoking np.poly1d. The next lines are a way to get R^{2}: we get the average value of yq by summing it then dividing by its number of elements; we get the TSS (total sum of squares) squaring the difference between each y element and the average y then summing them all up; last we get R^{2} in the second to last line by accessing the RSS (residual sum of squares) from quad_fit. quad_R_squared is an array with one element so we just make it that element instead. Now we can use Matplotlib to look at our data and fits!

```
import matplotlib as plt
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.set_title('Linear regression')
ax1.scatter(x, y)
ax1.plot(x, lin_model(x), '--k')
ax1.text(0.05, 0.95,
f'y = mx + b\nm = {lin_fit[0]}\nb = {lin_fit[1]}\nR\u00b2 = {lin_R_squared}',
verticalalignment='top',
transform = ax1.transAxes,
fontsize=14,
bbox={'facecolor':'cyan'})
ax2.set_title('Quadratic regression')
ax2.scatter(x, yq)
ax2.plot(x, quad_model(x), '--k')
ax2.text(0.05, 0.95,
f'y = ax\u00b2 + bx + c\na = {quad_fit[0][0]}\nb = {quad_fit[0][1]}\nc = {quad_fit[0][2]}\nR\u00b2 = {quad_R_squared}',
verticalalignment='top',
transform = ax2.transAxes,
fontsize=14,
bbox={'facecolor':'cyan'})
plt.show()
```

Here we are making one figure which contains our two graphs, called ax1 and ax2. The arguments of plt.subplots makes them arranged horizontally. Calling scatter on ax1 lets us plot the points with coordinates from our x and y arrays. `ax1.plot(x, lin_model(x), '--k')`

has x as x coords and the values generated by our fit as y coords. The last argument makes our fit a black dotted line - we could have avoided using scatter for (x, y) by setting the style to have no connecting line at all. The arguments for the text function just control where our info is and what it looks like - we are using it to show the fits' coefficients and R^{2} values on our graphs. Running the file lets plt.show show us our graphs like so:

## Top comments (0)