I’m performing some correlation assessment à la NIST Recommendation for the Entropy Sources Used for Random Bit Generation, § 5.1.

You take a test sequence and compress it with a standard compression algorithm. You then shuffle that sequence randomly using a PRNG, and re-compress. We expect that the randomly shuffled sequence to be harder to compress as any and all redundancy and correlations will have been destroyed. It’s entropy will have increased.

So if there is any auto correlation, $ \frac{\text{size compressed shuffled}} {\text{size compressed original}} > 1$ .

This works using NIST’s recommended bz2 algorithm, and on my data samples, the ratio is ~1.03. This indicates a slight correlation within the data. When I switch to LZMA, the ratio is ~0.99 which is < 1. And this holds over hundreds of runs so it’s not just a stochastic fluke.

What would cause the LZMA algorithm to repetitively compress a randomly shuffled sequence (slightly) better than a non shuffled one?