I’m trying to index a collection of roughly 127K files using a PostgreSQL 9.6 server on RDS. I had expected the process of writing these documents to the DB to take about 8 hours, but over time I observe that the write rate decays to 0 so that the process doesn’t complete (at some point, inserts begin timing out). Unfortunately, I don’t have much DBA/PostgreSQL background, so I’m struggling to debug this.
On average, indexing one file means inserting 125 rows and 0.5 MB of data into a table (some files are significant outliers, yielding ~40K rows). I have several indices on the table (I don’t think I can avoid these, due to other requirements).
Initially, the max WAL size on the database was set to 2 GB. My automation could process roughly 25K files before insert performance became unusable, with acceptable write throughput during the first hour. Increasing the max WAL size to 30 GB helped, but didn’t completely solve the problem; with this configuration the system was able to index 85K files before the insert rate degraded. Looking at the database logs, I saw primarily checkpoints that started due to the configured timeout (15 minutes). On the RDS console, I see a fairly consistent average write throughput of 5-15 MB / sec.
Eventually I may need to index a larger corpus of ~635K files, so I’d like to find settings where I get consistent write throughput.
- PostgreSQL 9.6.15 (on RDS)
- 6 TB disk
- 4 CPU
- 16 GB RAM
- Max WAL size: 30 GB
- Checkpoint Timeout: 15 min.
- Checkpoint Completion Target: 0.9
- Do I need to increase the configured max WAL size again? Is there a rule of thumb for how large this should be?
- Why did a 15x increase in max WAL size only increase the amount of files I could index by 3-4x?
- Are there other places I should look for diagnostic information?
- If I stop the write process temporarily and restart it hours later, write performance improves temporarily. Why is this?