Fast approach to do losummation in Compile[]?

My code does massive Summation and Matrix multiplication

Compile[] has boosted it distinctly. But I read some literatures related to my program, It seems there are approches to make it even faster. Maybe it can be improved from optimizing MMA language or algorithm itself.

My first piece code is below.

MomentComputing =   Compile[{{Mmax, _Integer}, {k, _Integer}, {image, _Real,      2}, {xLegendreP, _Real, 2}, {yLegendreP, _Real, 2}},    Block[{m, n, width, momentMatrix},    width = Length[image];    momentMatrix = Table[0., {Mmax + 1}, {Mmax + 1}];    Do[ momentMatrix[[m + 1, n + 1]] = ((2. (m - n) + 1.) (2. n + 1.)/((k k)*width width)) xLegendreP[[         m - n + 1]].image.yLegendreP[[n + 1]], {m, 0, Mmax}, {n, 0,       m}];        momentMatrix], CompilationTarget -> "C",    RuntimeAttributes -> {Listable}, Parallelization -> True,     RuntimeOptions -> "Speed"] 

It should be better if I don’t use any loop operations. But I can not figure out any other approaches. Probably matrix vector multiplication should be time-consuming as well.

Second piece.

Reconstruction =    Compile[{{lambdaMatrix, _Real, 2}, {lPoly, _Real, 2}},     Block[{Mmax, width, x, y, m, n, reconstructedImage},     {Mmax, width} = Dimensions[lPoly];     reconstructedImage = Table[0., {width}, {width}];     Do[      reconstructedImage[[x, y]] =        Sum[lambdaMatrix[[m + 1, n + 1]]*lPoly[[m - n + 1, x]]*         lPoly[[n + 1, y]], {m, 0, Mmax - 1}, {n, 0, m}]      {x, 1, width}, {y, 1, width}];     reconstructedImage], CompilationTarget -> "C",     RuntimeAttributes -> {Listable}, Parallelization -> True,     RuntimeOptions -> "Speed"]; 

Likewise, I don’t want Do[] loop here. In addition, I think Sum[] is a very slow function.

I can give all my code if necessary.

Edit 1:

According to Micheal’s suggestion, the first part is fast enough. It does not need acceleration anymore. The second part is the main time-consuming part, I believe it can speed up anyway.