Benefits of not having a clustered index on tables (Heaps)

What are the benefits of not having a clustered index on a table in SQL server. Will a

SELECT * INTO TABLE_A FROM TABLE_B 

Be faster if TABLE_A is a heap? Which operation will benefit if the table is a heap? I am quite sure UPDATE‘s and DELETE‘s will benefit from a clustered index. What about INSERTS? My understanding is that INSERT "might" benefit from the table being a heap, both in term of speed but also other resources and hardware (I/O, CPU, memory and storage…).

What is the most scarce resource in terms of hardware? In terms of storage is a heap going to occupy less space? Is disk storage not the least expensive resource? If so is it rational to keep table as heap in order to save disk space? How will a heap affect CPU and I/O with SELECT, INSERT, UPDATE and DELETE? What cost goes up when table is a heap and we SELECT, UPDATE and DELETE from it?

Thansk

Mongodb TTL Index not expiring documents from collection

I have TTL index in collection fct_in_ussd as following

db.fct_in_ussd.createIndex(     {"xdr_date":1},     { "background": true, "expireAfterSeconds": 259200} )   {     "v" : 2,     "key" : {         "xdr_date" : 1     },     "name" : "xdr_date_1",     "ns" : "appdb.fct_in_ussd",     "background" : true,     "expireAfterSeconds" : 259200 } 

with expiry of 3 days. Sample document in collection is as following

{     "_id" : ObjectId("5f4808c9b32ewa2f8escb16b"),     "edr_seq_num" : "2043019_10405",     "served_imsi" : "",     "ussd_action_code" : "1",     "event_start_time" : ISODate("2020-08-27T19:06:51Z"),     "event_start_time_slot_key" : ISODate("2020-08-27T18:30:00Z"),     "basic_service_key" : "TopSim",     "rate_event_type" : "",     "event_type_key" : "22",     "event_dir_key" : "-99",     "srv_type_key" : "2",     "population_time" : ISODate("2020-08-27T19:26:00Z"),     "xdr_date" : ISODate("2020-08-27T19:06:51Z"),     "event_date" : "20200827" } 

Problem Statement :- Documents are not getting removed from collection. Collection still contains 15 days old documents.

MongoDB server version: 4.2.3

Block compression strategy is zstd

storage.wiredTiger.collectionConfig.blockCompressor: zstd

Column xdr_date is also part of another compound index.

MySQL performance issue with ST_Contains not using spatial index

We are having what seems to be a fairly large mysql performance issue on trying to run a fairly simple update statement. We have a table(1.8mil) with houses that contains a Lat+Long geometry point column(geo), and then a table(6k) that has a list of schools with a boundary geometry polygon column(boundary). We have spatial indexes on both, we are trying to set the school’s id, that contains the point, to the house table with the update. The update is taking 1 hour and 47 minutes to update 1.6mil records. In other systems I have used in my paste experience, something like that would take just a few minutes. Any recommendations?

I have posted this same question in the GIS SE site as well, as it is very much a GIS & DBA question.

CREATE TABLE houses (   ID int PRIMARY KEY NOT NULL,   Latitude float DEFAULT NULL,   Longitude float DEFAULT NULL,   geo point GENERATED ALWAYS AS (st_srid(point(ifnull(`Longitude`,0),ifnull(`Latitude`, 0)),4326)) STORED NOT NULL,   SPATIAL INDEX spidx_houses(geo) ) ENGINE = INNODB, CHARACTER SET utf8mb4, COLLATE utf8mb4_0900_ai_ci;  CREATE TABLE schoolBound (   ID int PRIMARY KEY NOT NULL,   BOUNDARY GEOMETRY NOT NULL,   reference VARCHAR(200) DEFAULT NULL,   type bigint DEFAULT NULL,   INDEX idx_reference(reference),   INDEX idx_type(type),   SPATIAL INDEX spidx_schoolBound(BOUNDARY) ) ENGINE = INNODB, CHARACTER SET utf8mb4, COLLATE utf8mb4_0900_ai_ci;  
-- type 4 means it's a elementary Update houses hs     INNER JOIN schoolBound AS sb ON ST_Contains(sb.boundary, hs.geo) AND sb.type = 4 SET hs.elementary_nces_code = sb.reference 

The explain seems to show that it is not going to use the spatial index for schoolBound.

+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                                          | +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+ |  1 | SIMPLE      | sb    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    6078 |    10.00 | Using where                                    | |  1 | UPDATE      | hs    | NULL       | ALL  | spidx_houses  | NULL | NULL    | NULL | 1856567 |   100.00 | Range checked for each record (index map: 0x4) | +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+ 

Range Query Question: Index Compression and Updates

Given an array of up to 200000 elements, my task is to process up to 200000 queries, which either ask me to update a single value within the array or ask me to find the number of elements that lie in a given range.

My current idea is to first use index compression on the given array, then keep another array that contains the number of occurrences of each number. Then, processing and updating queries could be done using a sum segment tree.

However, I ran into a problem while trying to implement this approach. I realized that updating a single array value could force me to shift the compressed array.

For example, given an array [1, 5, 3, 3, 2], I would define a compression function C such that

C[1] = 0; C[2] = 1; C[3] = 2; C[5] = 3; 

Then, the occurrence array would be [1, 1, 2, 1], and processing sum queries would be efficient. However, if I were instructed to update a value, say, change the third element to 4, then that throws everything out of balance. The compression function would have to change to

C[1] = 0; C[2] = 1; C[3] = 2; C[4] = 3; C[5] = 4; 

which would force me to reconstruct my occurrence array, resulting in O(N) update time.

Since N can be up to 200000, my approach will not work efficiently enough to solve the problem, although I think I have the right idea with index compression. Can somebody please point me in the right direction?

Why is PostgreSQL not using composite Index on JSONB Fields?

Two of my PostgreSQL DB columns are JSONB fields (J1, J2). My queries are on (J1->>'f1')::int and (J2->>'f2'), for example:

SELECT * from table where (J1->>'f1')::int = 1 AND (J2->>'f2') IS NULL;

Since I have millions of records, I created a composite index:

CREATE INDEX f1_f2 ON table (((J1->>'f1')::int), (J2->>'f2'));

However, looking at EXPLAIN, I see that Seq Scan on table is still being used. Is this because PG does not handle composite JSONB fields? Any other reason that I should check?

Is there yet an “auto index” feature in PostgreSQL?

I’ve struggled all my database life with indexes. Am I supposed to apply it on the ORDER BY column(s), on the ones being compared, a combination of them (and if so, what combination/order?), which kind of index should it be, and so on, and so on…

Yet indexes are necessary for anything but the most trivial test application. So I don’t get to ignore them.

As I’m sitting here once again, utterly unable to "see" what column(s) should get indexes to speed the sluggish queries up, and which order and type and all the other specifics of indexes, it strikes me that maybe it’s now a thing that PG can automatically analyze itself (just like it already does with query planning and whatnot) and automatically add/delete indexes as it sees fit, rather than having me, a human sit there and try to predict what it will have, based on my incomplete (to say the least) knowledge of the subject?

(Yeah, right, fat chance…)

This frankly appears to be one of those "job security" things. And while I don’t have some deep-rooted hate towards DBAs earning a living, I do have a complete lack of money to spend on any such support. And besides the purely monetary issue, I also have privacy concerns: I wouldn’t want a third party to have intimate knowledge of my tables and database structure, and for them to do a proper job, they’d have to get this insight before they can get to work. This is perhaps a bigger issue than the money, actually.

Why does adding an index increase the execution time in SQLite?

I’ll just show you an example. Here candidates is a table of 1000000 candidates from 1000 teams and their individual scores. We want a list of all teams and whether the total score of all candidates within each team is within the top 50. (Yeah this is similar to the example from another question, which I encourage you to look at, but I assure you that it is not a duplicate)

Note that all CREATE TABLE results AS ... statements are identical, and the only difference is the presence of indices. These tables are created (and dropped) to suppress the query results so that they won’t make a lot of noise in the output.

------------ -- Set up -- ------------  .open delete-me.db    -- A persistent database file is required  .print '' .print '[Set up]'  DROP TABLE IF EXISTS candidates;  CREATE TABLE candidates AS WITH RECURSIVE candidates(team, score) AS (     SELECT ABS(RANDOM()) % 1000, 1     UNION     SELECT ABS(RANDOM()) % 1000, score + 1     FROM candidates     LIMIT 1000000 ) SELECT team, score FROM candidates;   ------------------- -- Without Index -- -------------------  .print '' .print '[Without Index]'  DROP TABLE IF EXISTS results;  ANALYZE;  .timer ON .eqp   ON CREATE TABLE results AS WITH top_teams_verbose(top_team, total_score) AS (     SELECT team, SUM(score)     FROM candidates     GROUP BY team     ORDER BY 2 DESC     LIMIT 50 ), top_teams AS (     SELECT top_team     FROM top_teams_verbose ) SELECT team, SUM(team IN top_teams) FROM candidates GROUP BY team; .eqp   OFF .timer OFF   ------------------------------ -- With Single-column Index -- ------------------------------  .print '' .print '[With Single-column Index]'  CREATE INDEX candidates_idx_1 ON candidates(team);  DROP TABLE IF EXISTS results;  ANALYZE;  .timer ON .eqp   ON CREATE TABLE results AS WITH top_teams_verbose(top_team, total_score) AS (     SELECT team, SUM(score)     FROM candidates     GROUP BY team     ORDER BY 2 DESC     LIMIT 50 ), top_teams AS (     SELECT top_team     FROM top_teams_verbose ) SELECT team, SUM(team IN top_teams) FROM candidates GROUP BY team; .eqp   OFF .timer OFF   ----------------------------- -- With Multi-column Index -- -----------------------------  .print '' .print '[With Multi-column Index]'  CREATE INDEX candidates_idx_2 ON candidates(team, score);  DROP TABLE IF EXISTS results;  ANALYZE;  .timer ON .eqp   ON CREATE TABLE results AS WITH top_teams_verbose(top_team, total_score) AS (     SELECT team, SUM(score)     FROM candidates     GROUP BY team     ORDER BY 2 DESC     LIMIT 50 ), top_teams AS (     SELECT top_team     FROM top_teams_verbose ) SELECT team, SUM(team IN top_teams) FROM candidates GROUP BY team; .eqp   OFF .timer OFF 

Here is the output

[Set up]  [Without Index] QUERY PLAN |--SCAN TABLE candidates |--USE TEMP B-TREE FOR GROUP BY `--LIST SUBQUERY 3    |--CO-ROUTINE 1    |  |--SCAN TABLE candidates    |  |--USE TEMP B-TREE FOR GROUP BY    |  `--USE TEMP B-TREE FOR ORDER BY    `--SCAN SUBQUERY 1 Run Time: real 0.958 user 0.923953 sys 0.030911  [With Single-column Index] QUERY PLAN |--SCAN TABLE candidates USING COVERING INDEX candidates_idx_1 `--LIST SUBQUERY 3    |--CO-ROUTINE 1    |  |--SCAN TABLE candidates USING INDEX candidates_idx_1    |  `--USE TEMP B-TREE FOR ORDER BY    `--SCAN SUBQUERY 1 Run Time: real 2.487 user 1.108399 sys 1.375656  [With Multi-column Index] QUERY PLAN |--SCAN TABLE candidates USING COVERING INDEX candidates_idx_1 `--LIST SUBQUERY 3    |--CO-ROUTINE 1    |  |--SCAN TABLE candidates USING COVERING INDEX candidates_idx_2    |  `--USE TEMP B-TREE FOR ORDER BY    `--SCAN SUBQUERY 1 Run Time: real 0.270 user 0.248629 sys 0.014341 

While the covering index candidates_idx_2 does help, it seems that the single-column index candidates_idx_1 makes the query significantly slower, even after I ran ANALYZE;. It’s only 2.5x slower in this case, but I think the factor can be made greater if you fine-tune the number of candidates and teams.

Why is it?

Postgres not using index in citext column

The following query is returning a SEQ Scan instead a Index.

select file_name from myschemadb.my_files where file_name = 'djsaidjasdjoasdjoaidad' 

Engine

Postgres 11.5

My table:

CREATE TABLE myschemadb.my_files      id int4 NOT NULL,     file_name myschemadb.citext NOT NULL,     status_id int4 NOT NULL,<br />     file_key myschemadb.citext NOT NULL,     is_fine bool NOT NULL DEFAULT true,     create_date timestamptz NOT NULL DEFAULT now(),     update_date timestamptz NULL,     CONSTRAINT pk_my_files PRIMARY KEY (id) ); 

The created index:

CREATE INDEX my_files_file_name_idx ON myschemadb.my_files USING btree (file_name); 

Execution Plan

[    {       "Plan": {          "Node Type": "Gather",          "Parallel Aware": false,          "Startup Cost": 1000,          "Total Cost": 70105.63,          "Plan Rows": 1,          "Plan Width": 41,          "Actual Startup Time": 109.537,          "Actual Total Time": 110.638,          "Actual Rows": 0,          "Actual Loops": 1,          "Output": [             "file_name"          ],          "Workers Planned": 2,          "Workers Launched": 2,          "Single Copy": false,          "Shared Hit Blocks": 58326,          "Shared Read Blocks": 0,          "Shared Dirtied Blocks": 0,          "Shared Written Blocks": 0,          "Local Hit Blocks": 0,          "Local Read Blocks": 0,          "Local Dirtied Blocks": 0,          "Local Written Blocks": 0,          "Temp Read Blocks": 0,          "Temp Written Blocks": 0,          "I/O Read Time": 0,          "I/O Write Time": 0,          "Plans": [             {                "Node Type": "Seq Scan",                "Parent Relationship": "Outer",                "Parallel Aware": true,                "Relation Name": "my_files",                "Schema": "myschemadb",                "Alias": "my_files",                "Startup Cost": 0,                "Total Cost": 69105.53,                "Plan Rows": 1,                "Plan Width": 41,                "Actual Startup Time": 107.42,                "Actual Total Time": 107.42,                "Actual Rows": 0,                "Actual Loops": 3,                "Output": [                   "file_name"                ],                "Filter": "((my_files.file_name)::text = 'djsaidjasdjoasdjoaidad'::text)",                "Rows Removed by Filter": 690443,                "Shared Hit Blocks": 58326,                "Shared Read Blocks": 0,                "Shared Dirtied Blocks": 0,                "Shared Written Blocks": 0,                "Local Hit Blocks": 0,                "Local Read Blocks": 0,                "Local Dirtied Blocks": 0,                "Local Written Blocks": 0,                "Temp Read Blocks": 0,                "Temp Written Blocks": 0,                "I/O Read Time": 0,                "I/O Write Time": 0,                "Workers": [                   {                      "Worker Number": 0,                      "Actual Startup Time": 106.121,                      "Actual Total Time": 106.121,                      "Actual Rows": 0,                      "Actual Loops": 1,                      "Shared Hit Blocks": 15754,                      "Shared Read Blocks": 0,                      "Shared Dirtied Blocks": 0,                      "Shared Written Blocks": 0,                      "Local Hit Blocks": 0,                      "Local Read Blocks": 0,                      "Local Dirtied Blocks": 0,                      "Local Written Blocks": 0,                      "Temp Read Blocks": 0,                      "Temp Written Blocks": 0,                      "I/O Read Time": 0,                      "I/O Write Time": 0                   },                   {                      "Worker Number": 1,                      "Actual Startup Time": 106.821,                      "Actual Total Time": 106.821,                      "Actual Rows": 0,                      "Actual Loops": 1,                      "Shared Hit Blocks": 26303,                      "Shared Read Blocks": 0,                      "Shared Dirtied Blocks": 0,                      "Shared Written Blocks": 0,                      "Local Hit Blocks": 0,                      "Local Read Blocks": 0,                      "Local Dirtied Blocks": 0,                      "Local Written Blocks": 0,                      "Temp Read Blocks": 0,                      "Temp Written Blocks": 0,                      "I/O Read Time": 0,                      "I/O Write Time": 0                   }                ]             }          ]       },       "Planning Time": 0.034,       "Triggers": [],       "Execution Time": 110.652    } ] 

I guess the problem is here:

"Filter": "((my_files.file_name)::text = 'djsaidjasdjoasdjoaidad'::text)", 

This implicit conversion can be a problem. But when i make a explicit conversion doesnt work too:

select file_name from myschemadb.file_history where file_name = 'djsaidjasdjoasdjoaidad'::myschemadb.citext 

I see this link: Why does a comparison between CITEXT and TEXT fail?

but didn’t help me..