Linear Algebra in Data Science
Vectors and matrices are heavily used in computer science (and data science) to store and perform multiple mathematical operations on a data set.Although, the concept of vector as used in physics (example a quantity with both magnitude and direction (2 dimensional vector)) and vector used in computers (example list of features recorded on a observation) have very different meanings in real life, when performing mathematical operations, both of these vectors follow same rules. Therefore, the branch of discrete mathematics, that deal with manipulation of vectors has applications in diverse fields like computer programming, storing big data set, cryptography, image editing, graph theory and many more. Hence, a through understanding of fundamental concepts of vectors and matrix operation is necessary to understand multiple algorithms used in data science. In in this article, I am trying to highlight the importance of vectors and matrices in data science by showing some matrix and vector operations in Python. ( Code in R and data used are available here.)
Vectors
In data science, any list of values is a vector and as long as the manipulations on this list follows linear operations (not exponential or higher order transformations), the theories of linear algebra can be used to perform complex calculations. For example to store a 28X28 pixel gray scale image in a computer, we can make a 28*28=784 dimensional vector which will hold values for intensity of black at each pixel. Following code chuck in Python reads a csv holding such data for images of 150 handwritten letters one letter per row.
##load necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Read handwriting data set
Handwriting = pd.read_csv("DigitMatrix.csv")
print("Summary of DataFrame " )
## Summary of DataFrame
Handwriting.describe()
## y X1 X2 X3 ... X781 X782 X783 X784
## count 150.000000 150.0 150.0 150.0 ... 150.0 150.0 150.0 150.0
## mean 4.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
## std 2.953783 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
## min 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
## 25% 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
## 50% 5.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
## 75% 7.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
## max 7.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
##
## [8 rows x 785 columns]
print("The first column is the label of these digits and below is the frequecy distribution ")
## The first column is the label of these digits and below is the frequecy distribution
Handwriting['y'].value_counts()
#Make a DataFrame excluding labels
## 7 50
## 0 50
## 5 50
## Name: y, dtype: int64
Handwriting_matrix = Handwriting.iloc[:,1:786]
print("View of a single vector in this data set")
## View of a single vector in this data set
Handwriting_matrix.iloc[1]
## X1 0
## X2 0
## X3 0
## X4 0
## X5 0
## ..
## X780 0
## X781 0
## X782 0
## X783 0
## X784 0
## Name: 1, Length: 784, dtype: int64
This view of first row in this handwriting_matrix shows a representation of familiar column vector which has 784 dimensions.
#visualize some examples
print("Visualization of some example images")
## Visualization of some example images
fig, (ax0, ax1, ax2) = plt.subplots(ncols=3)
ax0.imshow((np.array(Handwriting_matrix.iloc[1])).reshape(28,28), cmap="binary")
ax1.imshow((np.array(Handwriting_matrix.iloc[55])).reshape(28,28), cmap="binary")
ax2.imshow((np.array(Handwriting_matrix.iloc[105])).reshape(28,28), cmap="binary")
fig.tight_layout()
plt.show()
The figure above shows three example images from 150 images present in this handwriting data. Each row in this dataframe can be considered a vector of 784 dimensions.
Vector addition and scalar multiplication
Lets consider a situation, How can we find average intensity of each pixel for each of these three digits ? Instead of finding aveage of one pixel at time, we can perform simple vector addition of all 50 vectors and divide the sum vector by the total number of images(50) of that category. This operations is just the extension to 784 dimensions of simple vector addition and scalar multiplication illustrated by 2 dimensional vectors below:
\[Sum(u,v)= \begin{bmatrix}a \\ b \end{bmatrix} +\begin{bmatrix}c \\ d \end{bmatrix}=\begin{bmatrix}a+c \\ b+d \end{bmatrix}\] \[k*u= \begin{bmatrix}ka \\ kb \end{bmatrix}\]Vector addition and multiplication performed in Python to find average pixel intesity for each digit.
#Find average of each pixel for each category of number
Sum_Sevens= (Handwriting_matrix.iloc[0:50]).sum(0)
Average_Sevens= Sum_Sevens*1/50
#Can also use mean function in pandas
Average_zeros= Handwriting_matrix.iloc[50:100].mean(0)
Average_fives= Handwriting_matrix.iloc[100:150].mean(0)
#Visualize the average pixel for each category
fig, (ax0, ax1, ax2) = plt.subplots(ncols=3)
ax0.imshow((np.array(Average_Sevens)).reshape(28,28), cmap="binary")
ax1.imshow((np.array(Average_zeros)).reshape(28,28), cmap="binary")
ax2.imshow((np.array(Average_fives)).reshape(28,28), cmap="binary")
fig.tight_layout()
plt.show()
Vector subtraction
Subtraction of vectors can be performed similar to vector addition, by subtracting each item of second vector from the first vector.
\[Subtraction(u,v)= \begin{bmatrix}a \\ b \end{bmatrix} -\begin{bmatrix}c \\ d \end{bmatrix}=\begin{bmatrix}a-c \\ b-d \end{bmatrix}\]To see some examples of vector subtraction lets consider only one example of image from our 150 images above. A single digit image is composed of 784 pixels and location of each of these pixels in Cartesian coordinate can be represented by a 2d vector where the first row corresponds to x-coordinate and second row corresponds to y-coordinate. In order to completely describe each pixel we need to add a 3rd dimension to our pixel vector which will hold the value of intensity of black. Remember this 3rd dimension has nothing to do with the 3d dimension in space, vector dimensions in computer science keeps on increasing as new information is collected about that vector. For example if each of these pixels were RGB colored pixels in addition to x,y coordinates, there would be 3 more dimensions one each for R, G and B value, making each pixel a 5-dimensional vector.
The code chuck below shows how a image can be represented by 874 vectors of dimensions = 3.
#Take a image with row index = 105
SingleImage = np.array(Handwriting_matrix.iloc[105, :])
Vectors = pd.DataFrame({"x":np.tile(np.arange(1,29),28), # 1st dim is x-coordinate
"y":np.repeat(np.arange(28,0,-1),28), #2nd dim is y-coordinate
"z":SingleImage.ravel()}) #3rd dim is intensity of black
Vectors.describe()
#write a function to visualize these 784 vectors
## x y z
## count 784.000000 784.000000 784.000000
## mean 14.500000 14.500000 26.001276
## std 8.082904 8.082904 69.831479
## min 1.000000 1.000000 0.000000
## 25% 7.750000 7.750000 0.000000
## 50% 14.500000 14.500000 0.000000
## 75% 21.250000 21.250000 0.000000
## max 28.000000 28.000000 255.000000
def DigitImage(df):
fig, ax = plt.subplots()
ax.grid()
ax.axhline(y=0)
ax.axvline(x=0)
ax.scatter("x","y", data=df, c="z", marker="s", cmap= "YlOrRd", vmin=0, vmax=255)
ax.set_xlim(left=-28, right=28)
ax.set_ylim(bottom=-28, top= 28)
ax.set_axisbelow(True)
plt.show()
DigitImage(Vectors)
The figure above is generated by plotting 784 points (shaped square to mimic square pixels) and color intensity mapped to a color gradient from yellow to red (yellow= 0, red= 255). Hence each of the square is a vector of 3 dimensions.
Now, if we want to move this image so that it is centered at origin we need to perform some vector subtraction. The vector \(\begin{bmatrix}14 \\ 14 \\249 \end{bmatrix}\) is the midpoint of this image (shown in the image below). If we subtract this vector from all 784 vectors the image should be centered at origin.
fig, ax = plt.subplots()
ax.grid()
ax.axhline(y=0)
ax.axvline(x=0)
ax.scatter("x","y", data=Vectors, c="z", marker="s", cmap= "YlOrRd", vmin=0, vmax=255)
ax.arrow(0,0,14,14, width=0.3, length_includes_head=True, color='black')
ax.text(14,14, " Vector [14,14,249]", color="blue")
ax.set_xlim(left=-28, right=28)
## (-28.0, 28.0)
ax.set_ylim(bottom=-28, top= 28)
## (-28.0, 28.0)
ax.set_axisbelow(True)
plt.show()
The figure above show the pixel that occur at the center of this image as an arrow. Now the transformation of this image to the center can be represented as a vector subtraction as below:
\[\text{Transform to origin}= \begin{bmatrix}x \\ y \\z \end{bmatrix} -\begin{bmatrix}14 \\ 14 \\0 \end{bmatrix}=\begin{bmatrix}x-14 \\ y-14 \\z-0 \end{bmatrix}\]Centered = Vectors.apply(lambda x: (x) - [14,14,0], axis=1,)
DigitImage(Centered)
Matrix
The simplest definition of a matrix is, it is collection of vectors. For example the identity matrix for a 2d vector is:
\[\begin{pmatrix} 1 & 0 \\0 &1 \end{pmatrix}\]Here the first and second columns are simply the basis vector in x-direction and y-direction after the transformation by this matrix. Since this is matrix that does not bring about any transformation both columns are same as the original basis vectors of this system. The matrix multiplication with vector is formulated in such a way that when a matrix is multiplied with the transformation matrix it is transformed according the values of new basis vector coded in the matrix.
Skew
The following matrix can skew the image by 45 degree towards left without making any change to the 3rd dimension which stores information for color.
\[\begin{pmatrix} 1 & -1 &0 \\0 &1 &0\\0&0&1 \end{pmatrix}\]The basis on x-direction is unchanged while for basis in y-direction now has a x-component of -1.
##write a function which takes df and matrix and gives transformed df
def matrixTrans(Df, matrix):
Transformed =pd.DataFrame(np.matmul(matrix, pd.DataFrame.to_numpy(Df).transpose() ))
Transformed = Transformed.transpose()
Transformed.columns = ['x','y','z']
return(Transformed)
mat1=np.array([[1, -1, 0],
[0, 1, 0],
[0, 0, 1]])
VecTrans = matrixTrans(Centered, mat1)
DigitImage(VecTrans)
Skew in y direction
import math
mat1=np.array([[1, 0, 0],
[-1, 1, 0],
[0, 0, 1]])
VecTrans = matrixTrans(Centered, mat1)
DigitImage(VecTrans)
Rotation 45 counterclockwise
import math
mat1=np.array([[math.cos(45),-math.sin(45),0],
[math.sin(45), math.cos(45), 0],
[0, 0, 1]])
VecTrans = matrixTrans(Centered, mat1)
DigitImage(VecTrans)
#### Rotation 90 counterclockwise
import math
mat1=np.array([[math.cos(90),-math.sin(90),0],
[math.sin(90), math.cos(90), 0],
[0, 0, 1]])
VecTrans = matrixTrans(Centered, mat1)
DigitImage(VecTrans)
Flip along y-axis
mat1=np.array([[-1,0,0],
[0, 1, 0],
[0, 0, 1]])
VecTrans = matrixTrans(Centered, mat1)
DigitImage(VecTrans)
Flip along x-axis
mat1=np.array([[1,0,0],
[0, -1, 0],
[0, 0, 1]])
VecTrans = matrixTrans(Centered, mat1)
DigitImage(VecTrans)
Decrease color intensity by 50%
mat1=np.array([[1,0,0],
[0, 1, 0],
[0, 0, 0.5]])
VecTrans = matrixTrans(Centered, mat1)
DigitImage(VecTrans)
Matrix multiplication
Matrix multiplication is simply a way to represent a sequence of transformation. For example \(B.A\) , simply represents apply transformation by A first and then B. This step by step transformation will give same result at apply transformation by BA. For example lets first flip the letter along y-axis and then apply skew left.
#Flip
mat1=np.array([[-1,0,0],
[0, 1, 0],
[0, 0, 1]])
#skew left
mat2=np.array([[1,-1,0],
[0, 1, 0],
[0, 0, 1]])
VecTrans = matrixTrans(Centered, mat1)
VecTrans2= matrixTrans(VecTrans, mat2)
DigitImage(VecTrans2)
#Flip
mat1=np.array([[-1,0,0],
[0, 1, 0],
[0, 0, 1]])
#skew left
mat2=np.array([[1,-1,0],
[0, 1, 0],
[0, 0, 1]])
comb = np.matmul(mat2,mat1)
VecTrans = matrixTrans(Centered, comb)
DigitImage(VecTrans)
Inverse matrix
This is the matrix which will revert the transformation caused by its complementary matrix.
#inverse of comb matrix above is
invMat = np.linalg.inv(comb)
#apply this inv transformation to VecTrans
Orig = matrixTrans(VecTrans, invMat)
DigitImage(Orig)