I am Xianyi, a co-author of AUGEM paper and the developer of OpenBLAS.
We used the AUGEM generated assembly codes at OpenBLAS sandy bridge kernel(OpenBLAS/kernel/x86_64/dgemm_kernel_4x8_sandy.S).
Meanwhile, we need add some hand written codes to deal with the tail ( undivided by block size).
Thus, we didn't compare AUGEMM with OpenBLAS. However, we compared the performance with Intel MKL and AMD ACML.
Xianyi
I am Xianyi, a co-author of AUGEM paper and the developer of OpenBLAS.
We used the AUGEM generated assembly codes at OpenBLAS sandy bridge kernel(OpenBLAS/kernel/x86_64/dgemm_kernel_4x8_sandy.S).
Meanwhile, we need add some hand written codes to deal with the tail ( undivided by block size).
Thus, we didn't compare AUGEMM with OpenBLAS. However, we compared the performance with Intel MKL and AMD ACML.
Xianyi