Non-deterministic performance of query on select, from 1s to 60s on table with 1 billion rows

I’m trying to investigate why the performance of this query is so non-deterministic. It can take anywhere from 1 seconds, to 60 seconds and above. The nature of the query is to select a "time window", and get all rows from within that time window.

Here’s the query in question, running on a table of approximately 1 billion rows:

SELECT CAST(extract(EPOCH from ts)*1000000 as bigint) as ts     , ticks     , quantity     , side FROM order_book WHERE ts >= TO_TIMESTAMP(1618882633073383/1000000.0)     AND ts < TO_TIMESTAMP(1618969033073383/1000000.0)     AND zx_prod_id = 0 ORDER BY ts ASC, del desc 

The values within TO_TIMESTAMP will keep sliding forward as I walk the whole table. Here is the EXPLAIN ANALYZE output for the same query on two different time windows:

Slow Performance

Gather Merge  (cost=105996.20..177498.48 rows=586308 width=18) (actual time=45196.559..45280.769 rows=539265 loops=1)   Workers Planned: 6   Workers Launched: 6   Buffers: shared hit=116386 read=42298   ->  Sort  (cost=104996.11..105240.40 rows=97718 width=18) (actual time=45169.717..45176.775 rows=77038 loops=7)         Sort Key: (((date_part('epoch'::text, _hyper_16_214_chunk.ts) * '1000000'::double precision))::bigint), _hyper_16_214_chunk.del DESC         Sort Method: quicksort  Memory: 9327kB         Worker 0:  Sort Method: quicksort  Memory: 8967kB         Worker 1:  Sort Method: quicksort  Memory: 9121kB         Worker 2:  Sort Method: quicksort  Memory: 9098kB         Worker 3:  Sort Method: quicksort  Memory: 9075kB         Worker 4:  Sort Method: quicksort  Memory: 9019kB         Worker 5:  Sort Method: quicksort  Memory: 9031kB         Buffers: shared hit=116386 read=42298         ->  Result  (cost=0.57..96897.07 rows=97718 width=18) (actual time=7.475..45131.932 rows=77038 loops=7)               Buffers: shared hit=116296 read=42298               ->  Parallel Index Scan using _hyper_16_214_chunk_order_book_ts_idx on _hyper_16_214_chunk  (cost=0.57..95187.01 rows=97718 width=18) (actual time=7.455..45101.670 rows=77038 loops=7)                     Index Cond: ((ts >= '2021-04-22 01:34:31.357179+00'::timestamp with time zone) AND (ts < '2021-04-22 02:34:31.357179+00'::timestamp with time zone))                     Filter: (zx_prod_id = 0)                     Rows Removed by Filter: 465513                     Buffers: shared hit=116296 read=42298 Planning Time: 1.107 ms JIT:   Functions: 49   Options: Inlining false, Optimization false, Expressions true, Deforming true   Timing: Generation 9.273 ms, Inlining 0.000 ms, Optimization 2.008 ms, Emission 36.235 ms, Total 47.517 ms Execution Time: 45335.178 ms 

Fast Performance

Gather Merge  (cost=105095.94..170457.62 rows=535956 width=18) (actual time=172.723..240.628 rows=546367 loops=1)   Workers Planned: 6   Workers Launched: 6   Buffers: shared hit=158212   ->  Sort  (cost=104095.84..104319.16 rows=89326 width=18) (actual time=146.702..152.849 rows=78052 loops=7)         Sort Key: (((date_part('epoch'::text, _hyper_16_214_chunk.ts) * '1000000'::double precision))::bigint), _hyper_16_214_chunk.del DESC         Sort Method: quicksort  Memory: 11366kB         Worker 0:  Sort Method: quicksort  Memory: 8664kB         Worker 1:  Sort Method: quicksort  Memory: 8986kB         Worker 2:  Sort Method: quicksort  Memory: 9116kB         Worker 3:  Sort Method: quicksort  Memory: 8858kB         Worker 4:  Sort Method: quicksort  Memory: 9057kB         Worker 5:  Sort Method: quicksort  Memory: 6611kB         Buffers: shared hit=158212         ->  Result  (cost=0.57..96750.21 rows=89326 width=18) (actual time=6.145..127.591 rows=78052 loops=7)               Buffers: shared hit=158122               ->  Parallel Index Scan using _hyper_16_214_chunk_order_book_ts_idx on _hyper_16_214_chunk  (cost=0.57..95187.01 rows=89326 width=18) (actual time=6.124..114.023 rows=78052 loops=7)                     Index Cond: ((ts >= '2021-04-22 01:34:31.357179+00'::timestamp with time zone) AND (ts < '2021-04-22 02:34:31.357179+00'::timestamp with time zone))                     Filter: (zx_prod_id = 4)                     Rows Removed by Filter: 464498                     Buffers: shared hit=158122 Planning Time: 0.419 ms JIT:   Functions: 49   Options: Inlining false, Optimization false, Expressions true, Deforming true   Timing: Generation 10.405 ms, Inlining 0.000 ms, Optimization 2.185 ms, Emission 39.188 ms, Total 51.778 ms Execution Time: 274.413 ms 

I interpreted this output as most of the blame lying on this parallel index scan.

At first, I tried to raise work_mem to 1 GB and shared_buffers to 24 GB, thinking that maybe it couldn’t fit all the stuff it needed in RAM, but that didn’t seem to help.

Next, I tried creating an index on (zx_prod_id, ts), thinking that the filter on the parallel index scan might be taking a while, but that didn’t seem to do anything either.

I’m no database expert, and so I’ve kind of exhausted the limits of my knowledge.

Thanks in advance for any suggestions!

What is best practice for referencing data rows within stored procedures – via PK, code column or data value?

Suppose you have a table for colours with columns:

  • id = automatically incrementing integer, primary key
  • code = short code reference for the colour, unique
  • colour = human-readable name of colour, unique

Example values might be:

  • 1, BL, Blue
  • 2, GR, Green

Now imagine you have a stored procedure that, at some point, needs to reference this table. Let’s say the business logic says to obtain the colour "Green". To achieve this, you could have any of the following three WHERE clauses:

  • WHERE id = 2
  • WHERE code = GR
  • WHERE colour = Green

Now, if the system is designed such that it is agreed that a code value, once created, never changes, then, in my view, that is the best column to reference because:

  • It is an alternate key
  • It is human-readable for people who maintain the code
  • It will not be impacted when the business decides to change the colour value to ‘Sea Green’

However, if a legacy table lacks such code values, what, in your opinion, is best practice? To reference the id column, or the colour column?

If you reference the id column, the code is not readable unless you then also add comments – you shouldn’t have to comment simple things like this. It sucks figuring out what statements like WHERE id not in (1, 7, 17, 24, 56) mean.

I’m not sure how often, in reality, the id value might change – but consider if you run a script during development to insert new colours but then delete those and insert some more. If your stored procedure references the id values from that last set of colours inserted but when you create your new colours in your next environment you skip the step that inserted the colours which ended up deleted, then the id values won’t match in that next environment. Bad practice, but it can happen – a developer develops their script on a dev instance not thinking that the id values will conflict with production (which, for example, may have had additional colours created manually by the business before your colour creation script runs).

If you reference the colour column, you run the risk that if the business does ask to update the description from ‘Green’ to ‘Sea Green’, that your procedure will begin to fail.

I suppose a further solution is to implement the code column when you need it, if it isn’t there already – probably the best solution?

Select all rows that have all of the following column values

I have the following tables Genres, Films, Directors. They have the following schema:

CREATE TABLE GENRES(     GID INTEGER PRIMARY KEY,     GENRE VARCHAR(20) UNIQUE NOT NULL );  CREATE TABLE Films(     FID INTEGER PRIMARY KEY,     Title VARCHAR(45) UNIQUE NOT NULL,     DID INTEGER NOT NULL,     GID INTEGER NOT NULL,     FOREIGN KEY (DID) REFERENCES Directors(DID),     FOREIGN KEY (GID) REFERENCES Genres(DID) );  CREATE TABLE Directors(     DID INTEGER PRIMARY KEY,     First_Name VARCHAR(20) NOT NULL,     Last_Name VARCHAR(20) NOT NULL ); 

I want to write a query that will allow me to select all of Director information for every director that has made atleast one movie in the same genre(s) as another director. For example if Stanley Kubrick has made films in genres ‘Sci-Fi’, ‘Thriller’, and ‘Crime’, I want to select all the directors who have made at least 1 sci-fi AND 1 thriller AND 1 crime film.

I’ve tried the query seen below but this will give me directors who have made atleast 1 sci-fi OR 1 thriller OR 1 crime film.

SELECT DISTINCT D.DID, D.First_Name, D.Last_Name FROM Directors D LEFT JOIN Films F ON F.DID = D.DID LEFT JOIN Genres G ON G.GID = B.GID WHERE G.Genre IN (   SELECT DISTINCT G1.Genre   FROM Generes G1   LEFT JOIN Films F1   ON F1.GID = G1.GID   LEFT JOIN Directors D1   ON D1.DID = D1.DID   WHERE D1.First_Name = 'Stanley'   AND D1.Last_Name = 'Kubrick' );  Additionally, I am not able to check before hand which Genres the director in question has been involved with. The query should work with the only given information being the Directors First and Last name. 

Combine duplicate rows on column1 sum column2

I have the following data.

Id      ParentId    ProductId   Quantity 215236  19297       16300319    60 215221  19297       16314611    6 215234  19297       16314670    1    <- Duplicate productid 215235  19297       16314670    2    <- Duplicate productid 215195  19297       16314697    20 215205  19297       16321820    75 215216  19297       16329252    10 215233  19297       16331834    9 215224  19297       16519280    40 

Selecting unique records is easy. Or grouping the data on ProductId is also possible. I need a result that only contains a unique Product Id with the same parentid and the quantities summed up.

The result should look like this.

Id      ParentId    ProductId   Quantity 215236  19297       16300319    60 215221  19297       16314611    6 215234  19297       16314670    3 215195  19297       16314697    20 215205  19297       16321820    75 215216  19297       16329252    10 215233  19297       16331834    9 215224  19297       16519280    40 

The quantities of the duplicates

215234  19297       16314670    1    <- Duplicate productid 215235  19297       16314670    2    <- Duplicate productid 

should result in

215234  19297       16314670    3 

Split data across multiple rows in a single statement

I have data in table like

Customer Invoice No Date Value Paid Balance
ABC 1 01/12/2020 25 0 25
ABC 2 01/12/2020 50 0 50
XYZ 3 02/12/2020 200 0 200
XYZ 4 04/12/2020 100 0 100
ABC 5 04/12/2020 500 0 500

Now I received amounts for customers as below

ABC 540 XYZ 210

ABC XYZ
540 210

After receiving the amounts my table should like this

Customer Invoice No Date Value Paid Balance
ABC 1 01/12/2020 25 25 0
ABC 2 01/12/2020 50 50 0
XYZ 3 02/12/2020 200 200 0
XYZ 4 04/12/2020 100 10 90
ABC 5 04/12/2020 500 465 35

I got some clues but this works only for date based values but I need customer and date based.

Any help will be appreciated

Periodical sum of rows by a certain step

I’ve been struggling with one problem. I have a classic matrix in this example 12×4 such as: matrix = Table[i, {i, 12}, {4}]

TableForm@matrix

I need to sum the rows according to the example below.

Total@matrix[[{1, 5, 9}]]

Total@matrix[[{2, 6, 10}]]

Total@matrix[[{3, 7, 11}]

I tried using MapAt, but without any real success so far. I need to make a function out of it, cause later I will be using it for matrix 1036×37. Any tips?

Storing timeseries data with dynamic number of columns and rows to a suitable database

I have a timeseries pandas dataframe which dynamically increases the columns every minute as well as adds a new row:

Initial:

timestamp                100     200     300 2020-11-01 12:00:00       4       3       5 

Next minute:

timestamp                100     200     300   500 2020-11-01 12:00:00       4       3       5     0 2020-11-01 12:01:00      14       3       5     4 

The dataframe has these updated values and so on every minute.

so ideally, I want to design a database solution that supports such a dynamic column structure. The number of columns could grow to over 20-30k+ and since it’s one minute timeseries, it will have 500k+ rows per year.

I’ve read that relational db’s have a limit on the number of columns so that might not work here, but also, since I am setting the data for new columns and assigning a default value(0) to previous timestamps, I lose out on the DEFAULT param that’s there on MySQL.

Eventually, I will be querying data for 1 day, 1 month to get the data for the columns and their values.

Please suggest a suitable database solution for this type of dynamic row and column data.

Flexible table columns width, rows independent?

Hello,

i have HTML table with 3 rows.
the third row contains in the middle three columns that contains extraordinary long phrasesphrases which cause columns above these long columns to be long too and that looks bad, because above rows contains images so spacing is different.

How can i set the HTML table columns width to be flexible, i mean so the wide third row columns does not influence width of the columns in other rows of the same table?

Flexible table columns width, rows independent?

MySQL Transform multiple rows into a single row in same table (reduce by merge group by)

Hy, i want reduce my table and updating himself (group and sum some columns, and delete rows)

Source table "table_test" :

+----+-----+-------+----------------+ | id | qty | user  | isNeedGrouping | +----+-----+-------+----------------+ |  1 |   2 | userA |              1 | <- row to group + user A |  2 |   3 | userB |              0 | |  3 |   5 | userA |              0 | |  4 |  30 | userA |              1 | <- row to group + user A |  5 |   8 | userA |              1 | <- row to group + user A |  6 |   6 | userA |              0 | +----+-----+-------+----------------+ 

Wanted table : (Obtained by)

DROP TABLE table_test_grouped; SET @increment = 2; CREATE TABLE table_test_grouped SELECT id, SUM(qty) AS qty, user, isNeedGrouping FROM table_test GROUP BY user, IF(isNeedGrouping = 1, isNeedGrouping, @increment := @increment + 1); SELECT * FROM table_test_grouped;  +----+------+-------+----------------+ | id | qty  | user  | isNeedGrouping | +----+------+-------+----------------+ |  1 |   40 | userA |              1 | <- rows grouped + user A |  3 |    5 | userA |              0 | |  6 |    6 | userA |              0 | |  2 |    3 | userB |              0 | +----+------+-------+----------------+ 

Problem : i can use another (temporary) table, but i want update initial table, for :

  • grouping by user and sum qty
  • replace/merge rows into only one by group

The result must be a reduce of initial table, group by user, and qty summed.

And it’s a minimal exemple, and i don’t want full replace inital from table_test_grouped, beacause in my case, i have another colum (isNeedGrouping) for decide if y group or not.

For flagged rows "isNeedGrouping", i need grouping. For this exemple, a way to do is sequentialy to :

CREATE TABLE table_test_grouped SELECT id, SUM(qty) AS qty, user, isNeedGrouping FROM table_test WHERE isNeedGrouping = 1 GROUP BY user ; DELETE FROM table_test WHERE isNeedGrouping = 1 ; INSERT INTO table_test SELECT * FROM table_test_grouped ; 

Any suggestion for a simpler way?

Is there performance loss in out of sequence inserted rows (MySQL InnoDB)

I am trying to migrate from a bigger sized MySQL AWS RDS instance to a small one and data migration is the only method. There are four tables in the range of 330GB-450GB and executing mysqldump, in a single thread, while piped directly to the target RDS instance is estimated to take about 24 hours by pv (copying at 5 mbps).

I wrote a bash script that calls multiple mysqldump using ‘ & ‘ at the end and a calculated --where parameter, to simulate multithreading. This works and currently takes less than an hour with 28 threads.

However, I am concerned about any potential loss of performance while querying in the future, since I’ll not be inserting in the sequence of the auto_increment id columns.

Can someone confirm whether this would be the case or whether I am being paranoid for no reasons.

What solution did you use for a single table that is in the 100s of GBs? Due to a particular reason, I want to avoid using AWS DMS and definitely don’t want to use tools that haven’t been maintained in a while.