Aggregate Multiple Instances of Each Row Without Multiple Seq Scans

I am trying to perform some mathematical operations in PostgreSQL that involve calculating multiple values from each row, then aggregating, without requiring multiple Seq Scans over the whole table. Performance is critical for my application, so I want this to run as efficiently as possible on large data sets. Are there any optimizations I can do to cause PostgreSQL to only use a single Seq Scan?

Here’s a simplified example:

Given this test data set:

postgres=> CREATE TABLE values (value int); postgres=> INSERT INTO values (value) SELECT * from generate_series(-500000,500000); postgres=> SELECT * FROM values;   value ---------  -500000  -499999  -499998  -499997  -499996 ...  499996  499997  499998  499999  500000 

And I want to perform this query that counts 2 instances of each row, once by the value column and once by the abs(value). I’m currently accomplishing this with CROSS JOIN:

SELECT   CASE idx   WHEN 0 THEN value   WHEN 1 THEN abs(value)   END,   COUNT(value) FROM values CROSS JOIN LATERAL unnest(ARRAY[0,1]) idx GROUP BY 1; 

Here’s the EXPLAIN ANALYZE result for this query. Notice the loops=2 in the Seq Scan line:

postgres=> EXPLAIN ANALYZE SELECT ....                                                           QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------  HashAggregate  (cost=82194.40..82201.40 rows=400 width=12) (actual time=997.448..1214.576 rows=1000001 loops=1)    Group Key: CASE idx.idx WHEN 0 THEN "values".value WHEN 1 THEN abs("values".value) ELSE NULL::integer END    ->  Nested Loop  (cost=0.00..70910.65 rows=2256750 width=8) (actual time=0.024..390.070 rows=2000002 loops=1)          ->  Function Scan on unnest idx  (cost=0.00..0.02 rows=2 width=4) (actual time=0.005..0.007 rows=2 loops=1)          ->  Seq Scan on "values"  (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.012..82.584 rows=1000001 loops=2)  Planning Time: 0.073 ms  Execution Time: 1254.362 ms 

I compared this to the case of only using 1 instance of each row rather than 2. The 1 instance query performs a single Seq Scan and runs ~50% faster (as expected):

postgres=> EXPLAIN ANALYZE SELECT postgres-> value, postgres-> COUNT(value) postgres-> FROM values postgres-> GROUP BY 1;                                                        QUERY PLAN -------------------------------------------------------------------------------------------------------------------------  HashAggregate  (cost=21350.62..21352.62 rows=200 width=12) (actual time=444.381..662.952 rows=1000001 loops=1)    Group Key: value    ->  Seq Scan on "values"  (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.015..84.494 rows=1000001 loops=1)  Planning Time: 0.044 ms  Execution Time: 702.806 ms (5 rows) 

I want to scale this up to a much larger data set, so performance is critical. Are there any optimizations I cause my original query to run with only 1 Seq Scan? I’ve tried tweaking query plan settings (enable_nestloop, work_mem, etc)

Other Attempts

Here are some other approachs I tried:

  1. Using UNION still performs 2 Seq Scans:
SELECT    value,    COUNT(value) FROM (   SELECT value FROM values UNION   SELECT abs(value) AS value FROM values ) tbl GROUP BY 1; 
postgres=> EXPLAIN ANALYZE ...                                                                   QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------  HashAggregate  (cost=130150.31..130152.31 rows=200 width=12) (actual time=1402.221..1513.000 rows=1000001 loops=1)    Group Key: "values".value    ->  HashAggregate  (cost=73731.56..96299.06 rows=2256750 width=4) (actual time=892.904..1112.867 rows=1000001 loops=1)          Group Key: "values".value          ->  Append  (cost=0.00..68089.69 rows=2256750 width=4) (actual time=0.025..343.921 rows=2000002 loops=1)                ->  Seq Scan on "values"  (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.024..86.299 rows=1000001 loops=1)                ->  Seq Scan on "values" values_1  (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.013..110.885 rows=1000001 loops=1)  Planning Time: 0.067 ms  Execution Time: 1598.531 ms 
  1. Using PL/PGSQL. This performs only 1 Seq Scan, but ARRAY operations in PL/PGSQL are very slow, so this actual executes slower than the original:
CREATE TEMP TABLE result (value int, count int); DO LANGUAGE PLPGSQL $  $     DECLARE     counts int8[];     row record;   BEGIN      counts = array_fill(0, ARRAY[500000]);     FOR row IN (SELECT value FROM values) LOOP       counts[row.value] = counts[row.value] + 1;       counts[abs(row.value)] = counts[abs(row.value)] + 1;     END LOOP;      FOR i IN 0..500000 LOOP       CONTINUE WHEN counts[i] = 0;       INSERT INTO result (value, count) VALUES (i, counts[i]);     END LOOP;   END $  $  ; SELECT value, count FROM result; 
postgres=> \timing Timing is on. postgres=> DO LANGUAGE PLPGSQL $  $   ... DO Time: 2768.611 ms (00:02.769) 
  1. Tweaking Query Plan Configuration. I tried changing enable_seqscan, enable_nestloop, work_mem, and cost constraints and could not find a configuration that performed better than original.