I’ve run into a challenging algorithm puzzle while trying to generate a large amount of test data. The problem is as follows:

We have $ N$ buckets, $ B_1$ through $ B_N$ . Each bucket $ B_i$ maps to a unique item $ a_i$ and a count $ k_i$ . Altogether, the collection holds $ T=\sum_1^N{k_i}$ items. This is a more compact representation of a vector of $ T$ items where each $ a_i$ is repeated $ k_i$ times.

We want to output a shuffled list of the $ T$ items, all permutations equally probable, using only $ O(N)$ space and minimal time complexity. (Assume a perfect RNG.)

$ N$ is fairly large and $ T$ is much larger; 5,000 and 5,000,000 in the problem that led me to this investigation.
Now clearly the time complexity is at least $ O(T)$ since we have to output that many items. But how closely can we approach that lower bound? Some algorithms:

Algorithm 1: Expand the buckets into a vector of $ T$ items and use FisherYates. This uses $ O(T)$ time, but also $ O(T)$ space, which we want to avoid.

Algorithm 2: For each step, choose a random number $ R$ from $ [0,T1]$ . Traverse the buckets, subtracting $ k_i$ from $ R$ each time, until $ R<0$ ; then output $ i$ and decrement $ k_i$ and $ T$ . This seems correct and does not use extra space. However, it takes $ O(NT)$ time, which is quite slow when $ N$ is large.

Algorithm 3: Convert the vector of buckets into a balanced binary tree with buckets at the leaf nodes; the depth should be close to $ \log_2{N}$ . Each node stores the total count of all the buckets under it. To shuffle, choose a random number $ R$ from $ [0,T1]$ , then descend into the tree accordingly, decrementing each node count as we go; when descending to the right, reduce $ R$ by the left count. When we reach a leaf node, output its value. It uses $ O(N)$ space and $ O(T\log{N})$ time.

Algorithm 3a: Same as Algorithm 3, but with a Huffman tree; this should be faster if the $ k_i$ values vary widely, since the most often visited nodes will be closer to the root. The performance is more difficult to assess, but looks like it would vary from $ O(T)$ to $ O(T\log{N})$ depending on the distribution of $ k_i$ .
Algorithm 3 is the best I’ve come up with. Here are some illustrations to clarify it:
Does anyone know of a more efficient algorithm? I tried searching with various terms but could not find any discussion of this particular task.