Transposing a square matrix (n x n)

I wish to transpose a square matrix, permanently overwriting it.
This is not the same as creating a 2nd matrix with the transposed contents of the 1st matrix.

I call the procedure with 3 parameters: the address of the original matrix, its rank, and the address of a scratch buffer that is large enough.
First, all the elements are spread out over the scratch buffer. Later the scratch buffer is copied back to the original storage.

How can I optimize this code?

; TransposeSquareMatrix(Address, Rank, Scratch) Q:  push    ebp     mov     ebp, esp     push    ecx edx esi edi     mov     esi, [ebp+8]    ; Address     mov     edx, [ebp+12]   ; Rank     mov     edi, [ebp+16]   ; Scratch buffer     lea     eax, [edx-1]    ; Additional address increment .a: push    edi             ; (1)     mov     ecx, [ebp+12]   ; Rank .b: movsb                   ; Spreading out the elements of one row     add     edi, eax     dec     ecx     jnz     .b     pop     edi             ; (1)     inc     edi     dec     edx     jnz     .a              ; Repeating it for every row     mov     edi, [ebp+8]    ; Address     mov     ecx, [ebp+12]   ; Rank     imul    ecx, ecx     mov     esi, [ebp+16]   ; Scratch buffer     rep movsb               ; Overwriting the original matrix     pop     edi esi edx ecx     pop     ebp     ret     12 ; --------------------------