Why isn’t it more popular to increase the p (parallelization) parameter of scrypt?

First of all, the understanding I have of the p parameter in scrypt is that it multiplies the amount of work to do, but in such a way that the additional workloads are independent from each other, and can be run in parallel. With the interpretation of p cleared out of the way, why is the recommended value still 1? More generally, why is it a good thing that key stretching algorithms are not parallelizable?

From the point of view of an attacker trying to crack a password, it doesn’t matter whether an algorithm is parallelizable. After all, even if the entire algorithm is sequential, the attacker can just crack several different passwords in parallel.

I understand that scrypt being memory-hard makes it difficult to utilize GPUs for cracking. GPUs have a much greater combined computational power accross its many weak cores than CPUs, but the memory bus is about the same speed, so it levels the ground for authentic users on a CPU and attackers on a GPU.

However, subdividing an scrypt workload that accesses 256MB of RAM into 4 different parallel scrypt workloads, accessing 64MB each, would still consume the same amount of memory bandwidth for an attacker, therefore running at the same throughput, while running 4 times faster on a quad+ core CPU for an authentic user.

Is there any fundamental flaw in my logic? Why is the recommended value for p still p = 1? Is there any downside I can’t see to increasing p?