chapter of the in-progress book on linear algebra, “A birds eye view of linear algebra”. The table of contents to date:
Stay tuned for future chapters.
Here, we’ll describe operations we will do with two matrices, but keeping in mind they are only representations of linear maps.
I) Why care about matrix multiplication?
Almost any information might be embedded in a vector space. Images, video, language, speech, biometric information and whatever else you may imagine. And all of the applications of machine learning and artificial intelligence (just like the recent chat-bots, text to image, etc.) work on top of those vector embeddings. Since linear algebra is the science of coping with high dimensional vector spaces, it’s an indispensable constructing block.
A whole lot of the techniques involve taking some input vectors from one space and mapping them to other vectors from another space.
But why the deal with “linear” when most interesting functions are non-linear? It’s because the issue of creating our models high dimensional and that of creating them non-linear (general enough to capture every kind of complex relationships) turn into orthogonal to one another. Many neural network architectures work through the use of linear layers with easy one dimensional non-linearities in between them. And there may be a theorem that claims this type of architecture can model any function.
For the reason that way we manipulate high-dimensional vectors is primarily matrix multiplication, it isn’t a stretch to say it’s the bedrock of the fashionable AI revolution.

II) Algebra on maps
In chapter 2, we learnt methods to quantify linear maps with determinants. Now, let’s do some algebra with them. We’ll need two linear maps and a basis.

II-A) Addition
If we will add matrices, we will add linear maps since matrices are the representations of linear maps. And matrix addition isn’t very interesting when you know scalar addition. Just as with vectors, it’s only defined if the 2 matrices are the identical size (same rows and columns) and involves lining them up and adding element by element.

So, we’re just doing a bunch of scalar additions. Which implies that the properties of scalar addition logically extend.
Commutative: when you switch, the result won’t twitch
Associative: in a sequence, don’t refrain, take any 2 and proceed
Identity: And here I’m where I started! That’s no technique to treat a person!
The presence of a special element that when added to anything ends in the identical thing. Within the case of scalars, it’s the number . Within the case of matrices, it’s a matrix stuffed with zeros.
Also, it is feasible to start out at any element and find yourself at another via addition. So it have to be possible to start out at and find yourself on the additive identity, . The thing that have to be added to to realize that is the additive inverse of and it’s called
For matrices, you only go to every scalar element within the matrix and replace with the additive inverse of each (switching the signs if the scalars are numbers) to get the additive inverse of the matrix.
II-B) Subtraction
Subtraction is just addition with the additive inverse of the second matrix as a substitute.
II-C) Multiplication
We could have defined matrix multiplication just as we defined matrix addition. Just take two matrices which are the identical size (rows and columns) after which multiply the scalars element by element. There’s a reputation for that sorts of operation, the Hadamard product.
But no, we defined matrix multiplication as a way more convoluted operation, more “exotic” than addition. And it isn’t complex only for the sake of it. It’s an important operation in linear algebra by far.
It enjoys this special status since it’s the means by which linear maps are applied to vectors, constructing on top of dot products.
The best way it actually works requires a dedicated section, so we’ll cover that in section III. Here, let’s list a few of its properties.
Commutative
Unlike addition, matrix multiplication isn’t at all times commutative. Which implies that the order through which you apply linear maps to your input vector matters.
Associative
It continues to be associative
And there may be a number of depth to this property, as we’ll see in section IV.
Identity
Similar to addition, matrix multiplication also has an identity element, a component that when any matrix is multiplied to ends in the identical matrix. The large caveat being that this element only exists for square matrices and is itself square.
Now, due to the importance of matrix multiplication, “the identity matrix” basically is defined because the identity element of matrix multiplication (not that of addition or the Hadamard product for instance).
The identity element for addition is a matrix composed of ’s and that of the Hadamard product is a matrix composed of ’s. The identity element of matrix multiplication is:

So, ’s on the important diagonal and ’s all over the place else. What sort of definition for matrix multiplication would result in an identity element like this? We’ll need to explain how it really works to see, but first let’s go to the ultimate operation.
II-D) Division
Just as with addition, the presence of an identity matrix suggests any matrix, might be multiplied with one other matrix, and brought to the identity. This known as the inverse. Since matrix multiplication isn’t commutative, there are two ways to this. Thankfully, each result in the identity matrix.
So, “dividing” a matrix by one other is solely multiplication with the second ones inverse, . If matrix multiplication could be very essential, then this operation is as well because it’s the inverse. Additionally it is related to how we historically developed (or perhaps stumbled upon) linear algebra. But more on that in the subsequent chapter (4).
One other property we’ll be using that could be a combined property of addition and multiplication is the distributive property. It applies to every kind of matrix multiplication from the normal one to the Hadamard product:
III) Why is matrix multiplication defined this fashion?
We’ve got arrived finally to the section where we’ll answer the query within the title, the meat of this chapter.
Matrix multiplication is the way in which linear maps act on vectors. So, we get to motivate it that way.
III-A) How are linear maps applied in practice?
Consider a linear map that takes dimensional vectors (from ) as input and maps them to dimensional vectors (in ). Let’s call the dimensional input vector, .
At this point, it is perhaps helpful to think about yourself actually coding up this linear map in some programming language. It needs to be a function that takes the m-dimensional vector, as input and returns the dimensional vector, .
The linear map has to take this vector and switch it into an dimensional vector someway. Within the function above, you’ll notice we just generated some vector at random. But this completely ignored the input vector, . That’s unreasonable, must have some say. Now, is just an ordered list of scalars . What do scalars do? They scale vectors. And the output vector we want needs to be dimensional. How about we take some (fixed) vectors (pulled out of thin air, each dimensional), . Then, scale by v, by v and so forth and add all of them up. This results in an equation for our linear map (with the output on the left).

Make note of the equation (1) above since we’ll be using it again.
For the reason that , are all dimensional, so is And all the weather of have an influence on the output, The thought in equation (1) is implemented below. We take some randomly generated vectors for the ’s but with fixed seeds (ensuring that the vectors are the identical across every call of the function).
We’ve got a way now to “map” dimensional vectors (to dimensional vectors (. But does this “map” satisfy the properties of a linear map? Recall from chapter-1, section II the properties of a linear map, (here, and are vectors and is a scalar):
It’s clear that the map specified by equation (1) satisfies the above two properties of a linear map.


The vectors, are arbitrary and irrespective of what we decide for them, the function, defined in equation (1) is a linear map. So, different decisions for those vectors results in numerous linear maps. Furthermore, for any linear map you may imagine, there might be some vectors that might be applied along with equation (1) to represent it.
Now, for a given linear map, we will collect the vectors into the columns of a matrix. Such a matrix can have rows and columns. This matrix represents the linear map, and its multiplication with an input vector, represents the applying of the linear map, to . And this application is where the definition of matrix multiplication comes from.

We are able to now see why the identity element for matrix multiplication is the way in which it’s:

We start with a column vector, and end with a column vector, (so only one column for every of them). And because the elements of must align with the column vectors of the matrix representing the linear map, the variety of columns of the matrix must equal the variety of elements in . More on this in section III-C.
III-B) Matrix multiplication as a composition of linear maps
Now that we described how a matrix is multiplied to a vector, we will move on to multiplying a matrix with one other matrix.
The definition of matrix multiplication is far more natural after we consider the matrices as representations of linear maps.
Linear maps are functions that take a vector as input and produce a vector as output. Let’s say the linear maps corresponding to 2 matrices are and . How would you’re thinking that of adding these maps ?
That is paying homage to the distributive property of addition where the argument goes contained in the bracket to each the functions and we add the outcomes. And if we fix a basis, this corresponds to applying each linear maps to the input vector and adding the result. By the distributive property of matrix and vector multiplication, this is similar as adding the matrices corresponding to the linear maps and applying the result to the vector.
Now, let’s consider multiplication .
Since linear maps are functions, essentially the most natural interpretation of multiplication is to compose them (apply them one by one, in sequence to the input vector).
When two matrices are multiplied, the resulting matrix represents the composition of the corresponding linear maps. Consider matrices A and B; the product embodies the transformation achieved by applying the linear map represented by to the input vector first after which applying the linear map represented by .
So now we have a linear map corresponding to the matrix, and a linear map corresponding to the matrix, . We’d wish to know the matrix, corresponding to the composition of the 2 linear maps. So, applying to any vector first after which applying to the result needs to be comparable to just applying
Within the last section, we learnt methods to multiply a matrix and a vector. Let’s try this twice for Say the columns of are the column vectors, From equation (1) within the previous section,

And what if we applied the linear map corresponding to on to the vector, The column vectors of the matrix are

Comparing the 2 equations above we get,

So, the columns of the product matrix, are obtained by applying the linear map corresponding to matrix to every of the columns of the matrix And collecting those resulting vectors right into a matrix gives us .
We’ve got just prolonged our matrix-vector multiplication result from the previous section to the multiplication of two matrices. We just break the second matrix into a set of vectors, multiply the primary matrix to all of them and collect the resulting vectors into the columns of the result matrix.

So the primary row and first column of the result matrix, is the dot product of the primary column of and the primary row of And basically the -th row and -th column of is the dot product of the -th row of and the -th column of That is the definition of matrix multiplication most of us first learn.

Associative proof
We may show that matrix multiplication is associative now. As an alternative of the one vector, , let’s apply the product C=individuallyto a gaggle of vectors, . Let’s say the matrix that has these as column vectors is . We are able to use the very same trick as above to indicate:
It’s because and the identical for all the opposite vectors.
Sum of outer products
Say we’re multiplying two matrices and :

Equation (3) might be generalized to indicate that the ,element of the resulting matrix, is:

We’ve got a sum over terms. What if we took each of those terms and created individual matrices out of them. For instance, the primary matrix can have as its ,th entry: . The matrices and their relationship to

This technique of summing over matrices might be visualized as follows (paying homage to the animation in section III-A that visualized a matrix multiplied to a vector):

We see here the sum over matrices the entire same size (x) which is similar size because the result matrix, . Notice in equation (4) how for the primary matrix, , the column index stays the identical while for the second matrix, , the row index stays the identical. So the matrices we’re getting are the matrix products of the -th column of and the -th row of .
Matrix multiplication as a sum of outer products. Image by writer.
Contained in the summation, two vectors are multiplied to provide matrices. It’s a special case of matrix multiplication when applied to vectors (special cases of matrices) and called “outer product”. Here is yet one more animation to indicate this sum of outer products process:

This tells us why the variety of row vectors in needs to be the identical because the variety of column vectors in . Because they need to be mapped together to get the person matrices.
We’ve seen a number of visualizations and a few math, now let’s see the identical thing via code for the special case where and are square matrices. This relies on section 4.2 of the book “Introduction to Algorithms”, [2].
III-C) Matrix multiplication: the structural decisions

Matrix multiplication appears to be structured in a weird way. It’s clear that we want to take a bunch of dot products. So, one among the scale has to match. But why make the columns of the primary matrix be equal to the variety of rows of the second?
Won’t it make things more straightforward if we redefine it in a way that the variety of rows of the 2 matrices needs to be the identical (or the variety of columns)? This could make it much easier to discover when two matrices might be multiplied.
The normal definition where we require the rows of the primary matrix to align with the columns of the second has multiple advantage. Let’s go first to matrix-vector multiplication. Animation (1) in section III-A showed us how the normal version works. Let’s visualize what it if we required the rows of the matrix to align with the variety of elements within the vector as a substitute. Now, the rows of the matrix might want to align with the elements of the vector.

We see that we’d have to start out with a column vector, with rows and one column and find yourself with a row vector, with row and columns. That is awkward and makes defining an identity element for matrix multiplication difficult because the input and output vectors can never have the identical shape. With the normal definition, this isn’t a problem because the input is a column vector and the output can also be a column vector (see animation (1)).
One other consideration is multiplying a sequence of matrices. In the normal method, it is very easy to see to begin with that the chain of matrices below might be multiplied together based on their dimensionalities.

Further, we will tell that the output matrix can have rows and columns.
Within the framework where the rows of the 2 matrices should line up, this quickly becomes a large number. For the primary two matrices, we will tell that the rows should align and that the result can have rows and columns. But visualizing what number of rows and columns the result can have after which reasoning about weather it’ll be compatible with , etc. becomes a nightmare.

And that’s the reason we require the rows of the primary matrix to align with the columns of the second matrix. But perhaps I missed something. Possibly there may be an alternate definition that’s “cleaner” and manager to side-step these two challenges. Would love to listen to ideas within the comments 🙂
III-D) Matrix multiplication as a change of basis
Up to now, we’ve considered matrix multiplication with vectors as a linear map that takes a vector as input and returns another vector as output. But there may be one other technique to consider matrix multiplication — as a technique to change perspective.
Let’s consider two-dimensional space, We represent any vector on this space with two numbers. What do those numbers represent? The coordinates along the x-axis and y-axis. A unit vector that points just along the x-axis is and one which points along the y-axis is . These are our basis for the space. Every vector now has an address. For instance, the vector means we scale the primary basis vector by and the second by .
But this isn’t the one basis for the space. Another person (say, he who shall not be named) might need to use two other vectors as their basis. For instance, the vectors and . Any vector within the space will also be expressed of their basis. The identical vector would have different representations in our basis and their basis. Like different addresses for a similar house (perhaps based on different postal systems).
Once we’re in the premise of he who shall not be named, the vector and the vector that are the premise vectors from his perspective by definition of basis vectors). And the functions that translates vectors from our basis system to that of he who shall not be named and vise-versa are linear maps. And so the translations might be represented as matrix multiplications. Let’s call the matrix that takes vectors from us to the vectors to he who shall not be named, and the matrix that does the other, How will we find the matrices for these matrices?

We all know that the vectors we call and he who shall not be named calls and Let’s collect our version of the vectors into the columns of a matrix.

And in addition collect the vectors, and of he who shall not be named into the columns of one other matrix. That is just the identity matrix.

Since matrix multiplication operates independently on the columns of the second matrix,

Pre-multiplying by an appropriate matrix on either side gives us

Doing the identical thing in reverse gives us

This will all be generalized into the next statement: A matrix with column vectors; translates vectors expressed in a basis where are the premise vectors to our basis.
And the inverse of that matrix translates vectors from our basis to the one where are the premise.
All square matrices can hence be regarded as “basis changers”.
Note: Within the special case of an orthonormal matrix (where every column is a unit vector and orthogonal to each other column), the inverse becomes the identical because the transpose. So, changing to the premise of the columns of such a matrix becomes comparable to taking the dot product of a vector with each of the rows.
For more on this, see the 3B1B video, [1].
Conclusion
Matrix multiplication is arguably one of the essential operations in modern computing and likewise with almost any data science field. Understanding deeply how it really works is essential for any data scientist. Most linear algebra textbooks describe the “what” but not why its structured the way in which it’s. Hopefully this blog filled that gap.
[1] 3B1B video on change of basis: https://www.youtube.com/watch?v=P2LTAUO1TdA&t=2s
[2] Introduction to Algorithms by Cormen et.al. Third edition
[3] Matrix multiplication as sum of outer products: https://math.stackexchange.com/questions/2335457/matrix-at-a-as-sum-of-outer-products
[4] Catalan numbers wikipedia article https://en.wikipedia.org/wiki/Catalan_number