I’m pretty new to Postgres. I have a few related questions about the performance benefits of partitioning.
Background: I am trying to fix slow batch queries on a 4 column table with over 300 million rows. The PK is on all columns and the 3 columns that aren’t at the beginning of the PK also have single column indexes.
The indexes are growing out of control and are collectively larger than all RAM, and 3x the size of the underlying data. Batches regularly read and write a couple hundred million rows and they are slow.
What I have already done/understand: I’ve analyzed the code and determined that only the first and last columns are used in
SELECT WHERE clauses. Other than
INSERTs, no other queries hit this table (besides a
DELETE that will be replaced with partition dropping). So I already know I can drop two of the three indexes since they are unused, which will lead to reduced index size, fewer indexes, and hopefully improved
In addition to dropping the unused indexes, I am going to use partitioning to replace a lengthy
DELETE statement (which has a
WHERE on the two middle columns) with, instead,
LIST partitioning on those columns so I can
Where I have questions: My proposed
LIST partition also breaks up the table somewhat, so the partitions are no more than 1/4 the size of the original table. However, even that size is very large. I am considering adding a second hierarchical partition layer with a
HASH partition on the 4th column, which could further considerably reduce partition size. My hope is that this would further reduce RAM consumption and/or improve performance in other ways.
However, I have a few questions about this:
- While I would be reducing partition size, my queries actually select on nearly every value in the column where I would be using
HASHpartitioning. I don’t know the distribution of those selected values, so there might be hot and cold partitions, but few totally cold partitions. In this case, would
HASHpartitioning even help, or would it be no improvement over the single level of partitioning? It’s kind of confusing to me, because since hashing is random by design, I would expect this to be a problem whenever
The only reason I think it might help is by reducing the height of the index trees, since the indexes exist within partitions. But all indexes would still be used. Basically what I’m wondering is, is it still an improvement to reduce index height even if all the indexes may be used, versus one huge index? Is there a best practice as to when
HASH might help?
The upper level of the partition hierarchy is a
LISTon the two middle columns because that’s how my
DELETE(that I am replacing) is defined. But those columns are never used in
WHEREclauses. Is it a problem if the top partition hierarchy level isn’t even used in
WHEREclauses? I could reverse the order so the
HASHis the parent partition, since I actually select by that, but then I’d need to drop multiple partitions when I
DROP. Should I just bite the bullet and switch the order?
Is it best for the order of the partition hierarchy to match the order of the PK? In other words, if my parent partitioning is
LIST(col2, col3)and my child partitioning is
HASH(col4), should I change my PK from 1,2,3,4 to 2,3,4,1 to match, or does it not matter?