Indexing very similar character varying field in postgres

I have a table with a column "name", the table is describing a media asset and the name is a character varying field containing the file name. The table is generated and used by a CMS (Strapi) and I can’t really tweak how the columns are used nor the SQL being executed. What I’m hoping to do is slap on an index (or two) and get a bit better performance.

The file names of our files are very similar, pretty much XYZ12345-Q2.png, where XYZ is the same for about 80% of the files. So what I’m wondering is what kind of index (if any) would help speeding up a query such as:

select count(*) as "count" from "upload_file" where ("upload_file"."name"::text ILIKE '%some_string%' or "upload_file"."id"::text ILIKE '%some_string%' 

The id is the primary key, and it’s an auto incrementing positive integer.

My concern regarding the actual string is that an index won’t do much when the file names are so similar? Or would it actually make a difference? In that case what would be the best index type to use? My understanding of Gin is that it wouldn’t really suit this case because there are no words (none of the file names contain space).

Read specific fields from Postgres jsonb

I have data like

{"name": "a", "scope": "1", "items": [{"code": "x", "description": "xd"}, {"code": "x2", "description": "xd2"}]} {"name": "b", "scope": "2", "items": [{"code": "x", "description": "xd"}]} {"name": "c", "scope": "3", "items": [{"code": "x", "description": "xd"}]} {"name": "d", "scope": "4", "items": [{"code": "x", "description": "xd"}]} 

Now I want the result like:

{"name": "a","items": [{"code": "x"}, {"code": "x2"}]} {"name": "b","items": [{"code": "x"}]} {"name": "c","items": [{"code": "x"}]} {"name": "d","items": [{"code": "x"}]} 

Postgres Query conversion (Count, updated date)

I currently run the below query in Snowflake for some of my reconciliations- have never worked in PostgreSQL but we just adopted it.

Select count(1), min (LAST_UPDATED_DATE), max(LAST_UPDATED_DATE) from "SOURCE"."SCHEMA"."TABLE"

I’m looking to do the same thing in PosgreSQL. It’s one that we can run against a table that will give us the last updated date and a count when checking that movements completed.

I know postgres can be time heavy on counts and some of these tables are massive. That aside I’m not sure how to rewrite this either. Any pointers/tips would be amazing

thanks,

zach

I am trying to convert my SQLite database into Postgres database, but I don’t know how to convert name TEXT (32) NOT NULL

CREATE TABLE integrals (    id SERIAL PRIMARY KEY NOT NULL UNIQUE,    name TEXT (32) NOT NULL,    ip TEXT (32),    posx REAL NOT NULL,    posy REAL NOT NULL,    port INTEGER,    image INT,    zones TEXT,    pass TEXT ); 

This is my SQLite generated database, and I am trying to convert it, but I am very new to Postgres.

Postgres Specific table access

We have a requirement in our Postgres Database, We want to give specific table permissions to a particular set of user, we also have airflow in our environment which syncs tables, but sometimes any new columns are added to a table so we have to drop the table, due to which the specific table access for the user is gone. Access to a specific table is given through GRANT. Can you guys suggest us a way in which specific table access can be given, and will remain if the table is dropped and recreated?

Postgres REINDEX time estimate

I’ve got an older DB (postgres 10.15) that’s not yet been upgraded. One problematic table had a few large indexes on it, some of which were corrupt and needed reindexing. Since it’s not on version 12+, I can’t concurrently reindex the table (which means I need to do it non-concurrently, which requires a table write lock) – so I wanted to know how I could do some rough calculations on how long the reindex would take so I can plan some maintenance. Most of my research ends up in the "just use pg_stat_progress_create_index! (which isn’t available in 10), or people just saying to use CONCURRENTLY.

The table is ~200GB, and there are indexes are 7 indexes which are 14GB each (as per pg_relation_size). I can get ~900M/s constant read-rate on the DB for this task. Is there a simple metric I can use to determine how much data will be required to be read to reindex fully?

Interpreting startup time and varying plans for Postgres seq scans

In asking a recent question, some mysterious startup time components came up in my EXPLAIN ANALYZE output. I was playing further, and discovered that the startup time drops to near 0 if I remove the regex WHERE clause.

I ran the following bash script as a test:

for i in $  (seq 1 10) do     if (( $  RANDOM % 2 == 0 ))     then         echo "Doing plain count"         psql -e -c "EXPLAIN ANALYZE SELECT count(*) FROM ui_events_v2"     else         echo "Doing regex count"         psql -e -c "EXPLAIN ANALYZE SELECT count(*) FROM ui_events_v2 WHERE page ~ 'foo'"     fi done 

The first query returns a count of ~30 million, and the second counts only 7 rows. They are running on a PG 12.3 read replica in RDS with minimal other activity. Both versions take roughly the same amount of time, as I’d expect. Here is some output filtered with grep:

Doing plain count                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3060374.07 rows=12632507 width=0) (actual time=0.086..38622.215 rows=10114306 loops=3) Doing regex count                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3091955.34 rows=897 width=0) (actual time=16856.679..41398.062 rows=2 loops=3) Doing plain count                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3060374.07 rows=12632507 width=0) (actual time=0.162..39454.499 rows=10114306 loops=3) Doing plain count                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3060374.07 rows=12632507 width=0) (actual time=0.036..39213.171 rows=10114306 loops=3) Doing regex count                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3091955.34 rows=897 width=0) (actual time=12711.308..40015.734 rows=2 loops=3) Doing plain count                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3060374.07 rows=12632507 width=0) (actual time=0.244..39277.683 rows=10114306 loops=3) Doing regex count ^CCancel request sent 

So, a few questions:

  1. What goes into this startup component of "actual time" in the regex scan, and why is it so much larger? (10-20s vs 0-1s)

  2. Although "cost" and "time" aren’t comparable units, the planner seems to think the startup cost should be 0 in all cases – is it being fooled?

  3. Why do the strategies seem different? Both plans mention Partial Aggregate, but the regex query says actual rows is 2, but the plain version says actual rows is ~10 million (I guess this is some kind of average between 2 workers and 1 leader, summing to ~30 million). If I had to implement this myself, I would probably add up the results of several count(*) operations, instead of merging rows and counting – do the plans indicate how exactly its doing that?

So I don’t hide anything, below are full versions of the query plan for each:

 EXPLAIN ANALYZE SELECT count(*) FROM ui_events_v2                                                                        QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------  Finalize Aggregate  (cost=3093171.59..3093171.60 rows=1 width=8) (actual time=39156.499..39156.499 rows=1 loops=1)    ->  Gather  (cost=3093171.37..3093171.58 rows=2 width=8) (actual time=39156.356..39157.850 rows=3 loops=1)          Workers Planned: 2          Workers Launched: 2          ->  Partial Aggregate  (cost=3092171.37..3092171.38 rows=1 width=8) (actual time=39154.405..39154.406 rows=1 loops=3)                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3060587.90 rows=12633390 width=0) (actual time=0.033..38413.690 rows=10115030 loops=3)  Planning Time: 7.968 ms  Execution Time: 39157.942 ms (8 rows)   EXPLAIN ANALYZE SELECT count(*) FROM ui_events_v2 WHERE page ~ 'foo'                                                                    QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------  Finalize Aggregate  (cost=3093173.83..3093173.84 rows=1 width=8) (actual time=39908.495..39908.495 rows=1 loops=1)    ->  Gather  (cost=3093173.61..3093173.82 rows=2 width=8) (actual time=39908.408..39909.848 rows=3 loops=1)          Workers Planned: 2          Workers Launched: 2          ->  Partial Aggregate  (cost=3092173.61..3092173.62 rows=1 width=8) (actual time=39906.317..39906.318 rows=1 loops=3)                ->  Parallel Seq Scan on ui_events_v2  (cost=0.00..3092171.37 rows=897 width=0) (actual time=17250.058..39906.308 rows=2 loops=3)                      Filter: (page ~ 'foo'::text)                      Rows Removed by Filter: 10115028  Planning Time: 0.803 ms  Execution Time: 39909.921 ms (10 rows) 

AWS Postgres setting pg_trgm.word_similarity_threshold

Trying to set pg_trgm.word_similarity_threshold on an RDS postgres instance (postgres 12 if it matters):

ALTER SYSTEM SET pg_trgm.word_similarity_threshold = 0.3; 

I get the error: must be superuser to execute ALTER SYSTEM command

I don’t see this in the parameter groups even though pg_trgm is a supported extension. Is there something I’m missing?

Drop tables but space not claimed in postgres 12

I have upgraded Postgresql 9.5 to Postgresql 12.4 a few days back using pg_upgrade utility with link (-k) option.

So basically I am having two data directories i.e. One is old data directory (v9.5) and the current one in running state (v12.4).

Yesterday I have dropped two tables of size 700MB and 300MB.

After connecting to postgres using psql utility I can see database size whose tables was dropped got decreased (with \l+ ) but what is making me worry is that only a few MBs have been freed from storage partition.

I have run vacuumdb only on that database but no luck. I have checked if any deleted open file is there on OS level using lsof but there is none.

Looking for the solution.

Is it safe to pg_dump and pg_restore a new postgres database that has malware?

I’m pretty sure there is a crypto bot eating up my CPU through a postgres script. I would like to create an entirely new VM, and move my database with it using pg_dump and pg_restore. I already checked my postgres for new users, tables, databases; couldn’t find anything odd there which could comprise me if I move my data. I’m a little worried however because the bot is some how getting access to my postgres, and nothing else on my VM.

Thank you for the help.