I wish to transpose a square matrix, permanently overwriting it.
This is not the same as creating a 2nd matrix with the transposed contents of the 1st matrix.
I call the procedure with 3 parameters: the address of the original matrix, its rank, and the address of a scratch buffer that is large enough.
First, all the elements are spread out over the scratch buffer. Later the scratch buffer is copied back to the original storage.
How can I optimize this code?
; TransposeSquareMatrix(Address, Rank, Scratch) Q: push ebp mov ebp, esp push ecx edx esi edi mov esi, [ebp+8] ; Address mov edx, [ebp+12] ; Rank mov edi, [ebp+16] ; Scratch buffer lea eax, [edx-1] ; Additional address increment .a: push edi ; (1) mov ecx, [ebp+12] ; Rank .b: movsb ; Spreading out the elements of one row add edi, eax dec ecx jnz .b pop edi ; (1) inc edi dec edx jnz .a ; Repeating it for every row mov edi, [ebp+8] ; Address mov ecx, [ebp+12] ; Rank imul ecx, ecx mov esi, [ebp+16] ; Scratch buffer rep movsb ; Overwriting the original matrix pop edi esi edx ecx pop ebp ret 12 ; --------------------------