Books on scientific computing, efficent NN inference, and matrix multipication

I’m trying to learn more about how inference, matrix multiplication, and scientific computing (primarily with tensors/matrices). I’m not sure what the classics here are or what good sources are. I’m primarily looking for books but classic texts of any kind are welcome (including both papers, blogs, and articles on real world implementations).

I’d like to gain an understanding of both how to implement algorithms like GEMM as efficiently as BLAS implementations do and also how to perform inference on neural networks efficiently. When I say "efficiency" I mean both latency and throughput as is classically meant but I also mean energy efficiency as well. Energy efficiency seems to be covered less however.

What are good references/books in this area?