why does this correlated subquery return an error in PostgreSQL?

Here is the query in question :

SELECT name, totals1.all_purchases FROM accounts,      (SELECT orders.account_id, SUM(total) AS all_purchases      FROM orders      GROUP BY orders.account_id) totals1 WHERE accounts.id = totals1.account_id AND totals1.all_purchases >=      ALL (SELECT totals1.all_purchases FROM totals1);  

The error text returned is the following:

ERROR:  syntax error at or near ")" LINE 7: ALL (SELECT totals1.all_purchases FROM LATERAL totals1) 

I know I can easily solve this problem using a postgres CTE, and that it would be the better approach. But I want to understand why this code behaves this way, and want someone to point out the flaws in my logic. By the Wikipedia definition of a correlated query:

a correlated subquery (also known as a synchronized subquery) is a subquery (a query nested inside another query) that uses values from the outer query.

because the subquery is correlated with a column of the outer query, it must be re-executed for each row of the result.

My logic is, since a correlated subquery is driven by the outer query, it should execute after the outer query. Therefore this inner query should have no problem recognizing the totals1 table.

For example, this code fails as expected, because when the FROM statement is first executed, the subquery that creates totals1 has not yet been executed.

SELECT name, totals1.all_purchases FROM accounts, totals1 WHERE accounts.id = totals1.account_id AND totals1.all_purchases >=  ALL(SELECT totals1.all_purchases     FROM (SELECT orders.account_id, SUM(total) AS all_purchases        FROM orders        GROUP BY orders.account_id) totals1; 

I am also aware of the existence of the LATERAL keyword. I don’t think it applies to this situation though, I believe that was meant for correlated subqueries in the outer FROM clause. Case in point I literally used it in every way possible to no avail.

Could this be a possible limitation of PostgreSQL, or is it in the SQL standard altogether ?

PS: here is the schema. You might recognize it from the Udacity’s SQL for data analysts course.

subquery uses ungrouped column “shops.id” from outer query

I’m making a PSQL query and I’m encountering an error which is:

ERROR:  subquery uses ungrouped column "shops.id" from outer query LINE 8:             WHERE target_id = shops.id AND type = 'started_d... 

My query is :

SELECT  localities.name AS "City",          COUNT(shops) AS "Shops",         CAST(AVG(shops.rating_cache) AS decimal(10, 2)) AS "Rating",         SUM(shops.product_count_cache) AS "Products",         (             SELECT COUNT(*)             FROM customer_events             WHERE target_id = shops.id AND type = 'started_directions'         ) AS "Visites" FROM shops  LEFT JOIN localities ON localities.id = shops.locality_id  WHERE shops.locality_id IN (     SELECT cast(unnest as uuid)      FROM      unnest(string_to_array('9c57227a-8f4e-44e0-a3a8-1439c25bf2e5,8f285bca-baec-442e-8a21-e067b75d8f13', ',')) ) AND shops.onboarding_status = 'ready'  GROUP BY localities.name 

First four selected and calculated columns are working but the 5th which counts the number of customer_events for the current row’s localities.id doesn’t work.

Any idea how to make my column count working ?

Best regards,

EDIT: To clarify one thing, the column target_id is a foreign key to a shop’s id

Why is this IN-clause with subquery materialization slow?

Can someone help me explain why the performance of the two queries are so vastly different? (Setup-Code is at the end, DB-Fiddle is here: https://www.db-fiddle.com/f/eEuPWqR6gZcjbeSeWk4tu2/0)

1.

select id from texts WHERE id not in (select doc_id from details); 
select id from texts WHERE not exists (select 1 from details where details.doc_id=texts.id) 

When running the query select id from texts WHERE id not in (select doc_id from details); the query seems to run "forever". The query plan looks like this:

                                     QUERY PLAN                                      ------------------------------------------------------------------------------------  Gather  (cost=1000.00..3703524012.67 rows=400000 width=8)    Workers Planned: 2    ->  Parallel Seq Scan on texts  (cost=0.00..3703483012.67 rows=166667 width=8)          Filter: (NOT (SubPlan 1))          SubPlan 1            ->  Materialize  (cost=0.00..20220.86 rows=799991 width=8)                  ->  Seq Scan on details  (cost=0.00..13095.91 rows=799991 width=8)  JIT:    Functions: 8    Options: Inlining true, Optimization true, Expressions true, Deforming true (10 rows)  Time: 1.319 ms 

The costs already hint a much longer execution time, but I do not understand why the costs are that much bigger? Why does the Parallel Seq Scan on texts take so long? What is postgres doing here?

With fewer rows in the tables, I get the following query plan:

explain (analyse) select id from texts WHERE id not in (select doc_id from details);                                                                               QUERY PLAN                                                        -----------------------------------------------------------------------------------------------------------------------  Seq Scan on texts  (cost=11466.00..23744.18 rows=338247 width=8) (actual time=174.488..325.775 rows=10 loops=1)    Filter: (NOT (hashed SubPlan 1))    Rows Removed by Filter: 599990    SubPlan 1      ->  Seq Scan on details  (cost=0.00..9937.20 rows=611520 width=8) (actual time=0.014..56.549 rows=599990 loops=1)  Planning Time: 0.079 ms  Execution Time: 326.372 ms (7 rows) 

Why does it degrade that much with more rows? If I understand the output correctly the materialization itself is not the problem. Also how does the second query avoid this problem?

Schema/Data generation:

create table texts (   id bigint primary key,   url text);    create table details (   id bigserial primary key,   doc_id bigint,   content text);       insert into details (doc_id, content) select generate_series(1,800000), 'foobar'; insert into texts (id, url) select generate_series(1,800000), 'something';  # Delete some values delete from details where doc_id IN ( 307531, 630732, 86402, 584950, 835230, 334934, 673047, 772541, 239455, 763671); 

One-row postgres query as CTE/subquery much slower when passing subquery field into function / maybe related to inlining?

I’m using postgres 13.3 with inner and outer queries that both only produce a single row (just some stats about row counts).

I can’t figure out why Query2 below is so much slower than Query1 (they should basically be almost exactly the same, maybe a few ms difference at most)…

Query1: This query takes 49 seconds:

WITH t1 AS (         SELECT             (SELECT COUNT(*) FROM racing.all_computable_xformula_bday_combos) AS all_count,             (SELECT COUNT(*) FROM racing.xday_todo_all) AS todo_count,             (SELECT COUNT(*) FROM racing.xday) AS xday_row_count         OFFSET 0 -- this is to prevent inlining )  SELECT             t1.all_count,             t1.all_count-t1.todo_count AS done_count,             t1.todo_count,             t1.xday_row_count FROM t1 

Query2: This query takes 4 minutes and 30 seconds (only one line difference):

WITH t1 AS (         SELECT             (SELECT COUNT(*) FROM racing.all_computable_xformula_bday_combos) AS all_count,             (SELECT COUNT(*) FROM racing.xday_todo_all) AS todo_count,             (SELECT COUNT(*) FROM racing.xday) AS xday_row_count         OFFSET 0 -- this is to prevent inlining )  SELECT             t1.all_count,             t1.all_count-t1.todo_count AS done_count,             t1.todo_count,             t1.xday_row_count,             -- the line below is the only difference to Query1:             util.divide_ints_and_get_percentage_string(todo_count, all_count) AS todo_percentage FROM t1 
  • Before this point, and with some extra columns in the outer the query (which should have made almost zero difference), the whole query was insanely slow, like 25 minutes, which I think was due to inlining maybe? Hence the OFFSET 0 being added into both queries (which does help a lot).
  • I’ve also been swapping between using the above CTEs vs subqueries, but with the OFFSET 0 included it doesn’t seem to make any difference.

Here’s the definitions of the functions being called in Query2:

CREATE OR REPLACE FUNCTION util.ratio_to_percentage_string(FLOAT, INTEGER) RETURNS TEXT AS $  $   BEGIN     RETURN ROUND($  1::NUMERIC * 100, $  2)::TEXT || '%'; END; $  $   LANGUAGE plpgsql IMMUTABLE;   CREATE OR REPLACE FUNCTION util.divide_ints_and_get_percentage_string(BIGINT, BIGINT) RETURNS TEXT AS $  $   BEGIN          RETURN CASE          WHEN $  2 > 0 THEN util.ratio_to_percentage_string($  1::FLOAT / $  2::FLOAT, 2)         ELSE 'divide_by_zero'          END         ;  END; $  $   LANGUAGE plpgsql IMMUTABLE; 
  • As you can see it’s a very simple function, which is only being called once, from the single row the whole thing produces… how can this cause such a massive slowdown? And why is it affecting whether postgres inlines the initial subquery/WITH (or whatever else might be going on here)

EXPLAIN ANALYZE outputs:

  • Query1: https://explain.depesz.com/s/bq7u
  • Query2: https://explain.depesz.com/s/9w3rY

Why do two queries run faster than combined subquery?

I’m running postgres 11 on Azure.

If I run this query:

select min(pricedate) + interval '2 days' from pjm.rtprices 

It takes 0.153 sec and has the following explain:

    "Result  (cost=2.19..2.20 rows=1 width=8)"     "  InitPlan 1 (returns $  0)"     "    ->  Limit  (cost=0.56..2.19 rows=1 width=4)"     "          ->  Index Only Scan using rtprices_pkey on rtprices  (cost=0.56..103248504.36 rows=63502562 width=4)"     "                Index Cond: (pricedate IS NOT NULL)" 

If I run this query:

    select pricedate, hour, last_updated, count(1) as N      from pjm.rtprices     where pricedate<= '2020-11-06 00:00:00'     group by pricedate, hour, last_updated     order by pricedate desc, hour 

it takes 5sec with the following explain:

    "GroupAggregate  (cost=738576.82..747292.52 rows=374643 width=24)"     "  Group Key: pricedate, hour, last_updated"     "  ->  Sort  (cost=738576.82..739570.68 rows=397541 width=16)"     "        Sort Key: pricedate DESC, hour, last_updated"     "        ->  Index Scan using rtprices_pkey on rtprices  (cost=0.56..694807.03 rows=397541 width=16)"     "              Index Cond: (pricedate <= '2020-11-06'::date)" 

However when I run

    select pricedate, hour, last_updated, count(1) as N      from pjm.rtprices     where pricedate<= (select min(pricedate) + interval '2 days' from pjm.rtprices)     group by pricedate, hour, last_updated     order by pricedate desc, hour 

I get impatient after 2 minutes and cancel it.

The explain on the long running query is:

    "Finalize GroupAggregate  (cost=3791457.04..4757475.33 rows=3158115 width=24)"     "  Group Key: rtprices.pricedate, rtprices.hour, rtprices.last_updated"     "  InitPlan 2 (returns $  1)"     "    ->  Result  (cost=2.19..2.20 rows=1 width=8)"     "          InitPlan 1 (returns $  0)"     "            ->  Limit  (cost=0.56..2.19 rows=1 width=4)"     "                  ->  Index Only Scan using rtprices_pkey on rtprices rtprices_1  (cost=0.56..103683459.22 rows=63730959 width=4)"     "                        Index Cond: (pricedate IS NOT NULL)"     "  ->  Gather Merge  (cost=3791454.84..4662729.67 rows=6316230 width=24)"     "        Workers Planned: 2"     "        Params Evaluated: $  1"     "        ->  Partial GroupAggregate  (cost=3790454.81..3932679.99 rows=3158115 width=24)"     "              Group Key: rtprices.pricedate, rtprices.hour, rtprices.last_updated"     "              ->  Sort  (cost=3790454.81..3812583.62 rows=8851522 width=16)"     "                    Sort Key: rtprices.pricedate DESC, rtprices.hour, rtprices.last_updated"     "                    ->  Parallel Seq Scan on rtprices  (cost=0.00..2466553.08 rows=8851522 width=16)"     "                          Filter: (pricedate <= $  1)" 

Clearly, the last query has it doing a very expensive gathermerge so how to avoid that?

I did a different approach here:

    with lastday as (select distinct pricedate from pjm.rtprices order by pricedate limit 3)         select rtprices.pricedate, hour, last_updated - interval '4 hours' as last_updated, count(1) as N          from pjm.rtprices         right join lastday on rtprices.pricedate=lastday.pricedate         where rtprices.pricedate<= lastday.pricedate         group by rtprices.pricedate, hour, last_updated         order by rtprices.pricedate desc, hour 

which took just 2 sec with the following explain:

    "GroupAggregate  (cost=2277449.55..2285769.50 rows=332798 width=32)"     "  Group Key: rtprices.pricedate, rtprices.hour, rtprices.last_updated"     "  CTE lastday"     "    ->  Limit  (cost=0.56..1629038.11 rows=3 width=4)"     "          ->  Result  (cost=0.56..105887441.26 rows=195 width=4)"     "                ->  Unique  (cost=0.56..105887441.26 rows=195 width=4)"     "                      ->  Index Only Scan using rtprices_pkey on rtprices rtprices_1  (cost=0.56..105725202.47 rows=64895517 width=4)"     "  ->  Sort  (cost=648411.43..649243.43 rows=332798 width=16)"     "        Sort Key: rtprices.pricedate DESC, rtprices.hour, rtprices.last_updated"     "        ->  Nested Loop  (cost=0.56..612199.22 rows=332798 width=16)"     "              ->  CTE Scan on lastday  (cost=0.00..0.06 rows=3 width=4)"     "              ->  Index Scan using rtprices_pkey on rtprices  (cost=0.56..202957.06 rows=110933 width=16)"     "                    Index Cond: ((pricedate <= lastday.pricedate) AND (pricedate = lastday.pricedate))" 

This last one is all well and good but if my subquery wasn’t extensible to this hack, is there a better way for my subquery to have similar performance to the one at a time approach?

Converting a MySQL Subquery to a JOIN for performance

I took a look at my old accounting system, and it seems that performance is taking a role in the daily labor of the employees using it. So, I discovered that using a subquery was the problem, I’ve been reading, testing and, it seems that using a JOIN is like 100x faster as the data that we have in our Databases is huge now.

How do I can convert this subquery into a JOIN?

I’m seeking for help because I’m trying, but I’m being unable to do it, and I’m starting to think that this is not possible.

$  sql = "SELECT orders.order_id, orders.order_time, orders.order_user, orders.payment_state, orders.order_state,         orders.area_name, (SELECT COUNT(*) FROM order_item WHERE order_item.order_id = orders.order_id) AS items_number         FROM orders         WHERE orders.order_state = 1 AND order_time BETWEEN DATE_SUB(NOW(), INTERVAL 365 DAY) AND NOW()"; 

Being specific, the data we are retrieving here is all the rows created in the last year from the orders table AND the number of items purchased in each order, which is called from the subquery as items_number from order_item table WHERE order_id is equal in each table.

NOT EXISTS with two subquery fields that match 2 fields in main query

Background: Two different document types in a document management system. Both Doc Type A and Doc Type B have a Ticket #, and a Ticket Date. What we’re looking for: Doc Type A docs that don’t have a matching Doc Type B doc (NOT EXISTS) with the same Ticket # and Ticket Date. There like are Doc Type B docs that have the same Ticket # but NOT the same Ticket Date. We want to ignore those. Seems simple…. but I am stuck. So far what I have is something like this:

SELECT distinct ki110.keyvaluechar AS "Ticket #", ki101.keyvaluedate AS "Ticket Date"  FROM itemdata  left outer join hsi.keyitem110 ki110 on ( itemdata.itemnum = ki110.itemnum ) left outer join hsi.keyitem101 ki101 on ( itemdata.itemnum = ki101.itemnum ) WHERE   ki101.keyvaluedate BETWEEN '01-01-2021' AND '01-31-2021' AND ( itemdata.itemtypenum  = 178  ) -- this is Doc Type A  AND NOT EXISTS (select ki110.keyvaluechar, ki101.keyvaluedate from itemdata, keyitem110 ki110 , keyitem101 ki101 where --(itemdata.itemnum = ki110.itemnum) --Ticket # 

— ** the problem is here for Date: I need to say Date in Doc Type B doc is not the same as Date in Doc Type A doc using ki101.keyvaluedate)

AND itemdata.itemtypenum = 183) -- this  is DOC Type B 

Query to find the second highest row in a subquery

The goal is to send notifications about the customer updates but only for the first one if there are consecutive updates from the customer in a ticketing system.

This is the simplified query that I’m using to get the data that I need. There are a few more columns in the original query and this subquery for threads is kind of required so I can also identify if this is a new ticket or if existing one was updated (in case of update, the role for the latest threads will be a customer):

SELECT t.ref, m.role    FROM tickets t    LEFT JOIN threads th ON (t.id = th.ticket_id)    LEFT JOIN members m ON (th.member_id = m.id)   WHERE th.id IN ( SELECT MAX(id)                      FROM threads                     WHERE ticket_id = t.id                 ) 

It will return a list of tickets so the app can send notifications based on that:

+------------+----------+ | ref        | role     | +------------+----------+ | 210117-001 | customer | | 210117-002 | staff    | +------------+----------+ 

Now, I want to send only a single notification if there a multiply consecutive updates from the customer.

Question:

How I can pull last and also one before last row to identify if this is consecutive reply from the customer?

I was thinking about GROUP_CONCAT and then parse the output in the app but tickets can have many threads so that’s not optimal and there are also a few more fields in the query so it will violate the ONLY_FULL_GROUP_BY SQL mode.

db<>fiddle here

MySQL – multiple counts on relations based on conditions – JOIN VS. SUBQUERY

I don’t want to share my exact DB structure, so let’s assume this analogy :

--categories-- id name  --products-- id name cat_id 

I then have SQL like this :

SELECT categories.*, count(CASE WHEN products.column1=something1 and products.column2=something2 THEN 1 END) as count1, count(CASE WHEN products.column3=something3) as count2  FROM categories LEFT JOIN products ON products.cat_id=categories.id GROUP BY categories.id 

The problem here is that the GROUP BY is taking too long, it’s a difference between 0.2s query and 2.5s query.

How do you generate random row order in a subquery?

I know other answers here (and here) say to order by newid(). However if I am selecting top 1 in a subquery – so as to generate a random selection per row in the outer query – using newid() yields the same result each time.

That is:

select *,     (select top 1 [value] from lookupTable where [code] = 'TEST') order by newid()) from myTable 

… yields the same lookupTable.value value on each row returned from myTable.