The BLAS3 routines, which involve
operations (and only n
data) are where the great gains from cache may be had. Write the
routines below (same prototype as mult()), and uncomment the rest of
the main program. Note that, for debugging, the main program checks
your two matrix-matrix multiplications routines against each other.
This is useful because block-matrix multiply is somewhat intricate and
prone to coding bugs.
Hint: Pseudo-code for the block-matrix multiplication follows:
double b1[Nb][Nb],b2[Nb][Nb],b3[Nb][Nb]; /* Regular C arrays are faster
-- use them! */
for (I=0; I<n/Nb; I++)
for (J=0; J<n/Nb; J++) { /* Compute IJ block of M3: */
-> YOUR CODE: zero out b3, an Nb x Nb array to store IJ block of M3
for (K=0; K<n/Nb; K++) { /* Add to b3 product of IK block of M1
with KJ block of M2 */
/* Copy IK block of M1 into b1, an Nb x Nb matrix */
for (i=0; i<Nb; i++)
for (k=0; k<Nb; k++)
b1[i][k]=M1[I*Nb+i][K*Nb+k];
/* Copy KJ block of M2 into b2, an Nb x Nb matrix */
-> YOUR CODE
/* Do b1*b2, adding result into b3 */
for (i=0; i<Nb; i++)
for (j=0; j<Nb; j++)
for (k=0; k<Nb; k++)
b3[i][j]+=b1[i][k]*b2[k][j];
}
-> YOUR CODE: Copy b3 into IJ block of M3
}
}