I am trying to perform some mathematical operations in PostgreSQL that involve calculating multiple values from each row, then aggregating, without requiring multiple Seq Scans over the whole table. Performance is critical for my application, so I want this to run as efficiently as possible on large data sets. Are there any optimizations I can do to cause PostgreSQL to only use a single Seq Scan?
Here’s a simplified example:
Given this test data set:
postgres=> CREATE TABLE values (value int); postgres=> INSERT INTO values (value) SELECT * from generate_series(-500000,500000); postgres=> SELECT * FROM values; value --------- -500000 -499999 -499998 -499997 -499996 ... 499996 499997 499998 499999 500000
And I want to perform this query that counts 2 instances of each row, once by the value
column and once by the abs(value)
. I’m currently accomplishing this with CROSS JOIN
:
SELECT CASE idx WHEN 0 THEN value WHEN 1 THEN abs(value) END, COUNT(value) FROM values CROSS JOIN LATERAL unnest(ARRAY[0,1]) idx GROUP BY 1;
Here’s the EXPLAIN ANALYZE
result for this query. Notice the loops=2
in the Seq Scan line:
postgres=> EXPLAIN ANALYZE SELECT .... QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=82194.40..82201.40 rows=400 width=12) (actual time=997.448..1214.576 rows=1000001 loops=1) Group Key: CASE idx.idx WHEN 0 THEN "values".value WHEN 1 THEN abs("values".value) ELSE NULL::integer END -> Nested Loop (cost=0.00..70910.65 rows=2256750 width=8) (actual time=0.024..390.070 rows=2000002 loops=1) -> Function Scan on unnest idx (cost=0.00..0.02 rows=2 width=4) (actual time=0.005..0.007 rows=2 loops=1) -> Seq Scan on "values" (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.012..82.584 rows=1000001 loops=2) Planning Time: 0.073 ms Execution Time: 1254.362 ms
I compared this to the case of only using 1 instance of each row rather than 2. The 1 instance
query performs a single Seq Scan and runs ~50% faster (as expected):
postgres=> EXPLAIN ANALYZE SELECT postgres-> value, postgres-> COUNT(value) postgres-> FROM values postgres-> GROUP BY 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=21350.62..21352.62 rows=200 width=12) (actual time=444.381..662.952 rows=1000001 loops=1) Group Key: value -> Seq Scan on "values" (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.015..84.494 rows=1000001 loops=1) Planning Time: 0.044 ms Execution Time: 702.806 ms (5 rows)
I want to scale this up to a much larger data set, so performance is critical. Are there any optimizations I cause my original query to run with only 1 Seq Scan? I’ve tried tweaking query plan settings (enable_nestloop
, work_mem
, etc)
Other Attempts
Here are some other approachs I tried:
- Using
UNION
still performs 2 Seq Scans:
SELECT value, COUNT(value) FROM ( SELECT value FROM values UNION SELECT abs(value) AS value FROM values ) tbl GROUP BY 1;
postgres=> EXPLAIN ANALYZE ... QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=130150.31..130152.31 rows=200 width=12) (actual time=1402.221..1513.000 rows=1000001 loops=1) Group Key: "values".value -> HashAggregate (cost=73731.56..96299.06 rows=2256750 width=4) (actual time=892.904..1112.867 rows=1000001 loops=1) Group Key: "values".value -> Append (cost=0.00..68089.69 rows=2256750 width=4) (actual time=0.025..343.921 rows=2000002 loops=1) -> Seq Scan on "values" (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.024..86.299 rows=1000001 loops=1) -> Seq Scan on "values" values_1 (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.013..110.885 rows=1000001 loops=1) Planning Time: 0.067 ms Execution Time: 1598.531 ms
- Using PL/PGSQL. This performs only 1 Seq Scan, but ARRAY operations in PL/PGSQL are very slow, so this actual executes slower than the original:
CREATE TEMP TABLE result (value int, count int); DO LANGUAGE PLPGSQL $ $ DECLARE counts int8[]; row record; BEGIN counts = array_fill(0, ARRAY[500000]); FOR row IN (SELECT value FROM values) LOOP counts[row.value] = counts[row.value] + 1; counts[abs(row.value)] = counts[abs(row.value)] + 1; END LOOP; FOR i IN 0..500000 LOOP CONTINUE WHEN counts[i] = 0; INSERT INTO result (value, count) VALUES (i, counts[i]); END LOOP; END $ $ ; SELECT value, count FROM result;
postgres=> \timing Timing is on. postgres=> DO LANGUAGE PLPGSQL $ $ ... DO Time: 2768.611 ms (00:02.769)
- Tweaking Query Plan Configuration. I tried changing
enable_seqscan
,enable_nestloop
,work_mem
, and cost constraints and could not find a configuration that performed better than original.