Say I have an array a[1..n], and I want to output an array s[1..n] with s[i] = a[1]+…+a[i]. What is the best (or at least standard) way to do so *in parallel*?

The way I can think of doing it, given m processors, is to split the array a into m blocks of equal size M, compute b[j] = a[floor((j-1)/M)*M+1]+…+a[j-1]+a[j] (in parallel), then compute s[iM] for 1<=i<=m by summing values of b[jM] (serially), and then, lastly, compute s[jM+i] = s[jM]+b[jM+i] for all 0<i<M, 0<=j<m (in parallel). That feels a little wasteful, though the time complexity is right – is there a better way?