Apr 20, 2021
AFAIK in the reference implementation the core BLAS routines are actually written in FORTRAN, and CBLAS in just a (thin) C (not C++) wrapper around these (EDIT: in ‘optimized’ implementations of BLAS time-critical routines are even written in ASSEMBLER).
So he´s actually comparing performance of a FORTRAN-based library versus own C++ code.
And yes, it would probably be fairer to compare C++ using BLAS to Python using BLAS (via numpy). This would probably make the point I think he´s making stand out even more…
The point being IMHO that in almost all cases using time and battle tested libraries instead of trying to reinvent the wheel yourself is your best option