On a parallel machine with $ n$ processors we can compute the sum (or product, or the result of any associative operation) on $ n$ numbers in $ \log n$ steps. In the first step combine neighbors to get $ n/2$ numbers left, then combine them so that $ n/4$ numbers are left and so on until a single number, the result is left which happens after $ \lceil \log n \rceil$ steps. With this is it obvious that we can compute the matrix product of two $ n \times n$ matrices $ A = (A_{ij}), B = (B_{ij})$ with $ n^3$ processors in $ \log n + 1$ steps, in a first step compute all the $ n^3$ products of the of the entries $ A_{ik} \cdot B_{kj}$ for $ i,j,k \in \{1,\ldots, n\}$ . Then with $ n^2$ processors compute the sum of the $ n$ numbers $ A_{i1}\cdot B_{1j} + \ldots + A_{ik}\cdot B_{kj}$ . Hence we get by with $ 1 + \log n$ parallel steps.

Now the following argument is from C. Papadimitriou, *Computational Complexity*, page 361, to show that we get time $ O(\log n)$ using just $ n^2 / \log n$ processors (the argument used is called *Brent’s principle* according to the book).

We compute the $ n^3$ products not in a single step as before, but rather in $ \log n$ ”shifts” using $ \lceil n^3 / \log \rceil$ processors at each shift. We use shifts of the same $ \lceil n^3 / \log n \rceil$ processors to compute the first $ \log \log n$ parallel addition steps, where more than $ \frac{n^3}{\log n}$ processors would be ordinary needed. The total number of parallel steps is now no more than $ 2\log n$ , with $ \frac{n^3}{\log n}$ processors […].

(those arguments could also be found on these slides).

But I get more parallel steps as $ 2\log n$ .

1) First we compute the $ n^3$ products $ A_{ik}\cdot B_{kj}$ with $ \log n$ shifts, hence we need $ \log n$ parallel steps for this.

2) If we combine the results as written in the introduction, then in the $ k$ -th step we have $ n/2^k$ numbers to combine. Hence if $ 2^k < \log n$ we do not have enough processors to do this in a single step, hence we have to ”shift” again, this time $ \lceil \log n / 2^k \rceil$ times. If $ 2^k \ge \log n$ we can use a single parallel step as we have enough processors at hand.

Combining the above reasoning we get the following equation (I omit the rounding) for the number of parallel steps: $ $ \log n + \sum_{k=1}^{\log \log n} \frac{\log n}{2^k} + \log n – \log \log n. $ $ The first summand commes from the shift over the $ n^3$ numbers, the second summand for the shifts needed if in the combination/summation of the product we still have to many numbers for our processors, and the last for the number of steps if we do have enough processors. This equals \begin{align*} & \log n + \log n \left( 1 – \frac{1}{2^{\log \log n}} \right) + \log n – \log \log n \ & = \log n + \log n – 1 + \log n – \log \log n \ & = 3 \log n – \log \log n – 1 \end{align*} The last is in general more than $ 2\log n$ , so what am I missing. Why does they argue they get by with $ 2\log n$ steps. Do I miss anything here?