I have a pretty big table with HTMLs (about 500K rows, average length of text is ~150K). It is required to make fast search (less than 1s) rows which match let say ‘%<meta name="twitter:app%’. First 10 results are enough.
Possible solution I have tired (on Postgres mainly):
- Full-text search: results look not relevant and may miss something (can not find a way to improve it).
- Trigram index (
...USING GIN (t gin_trgm_ops)...) – a cool feature of PosgreSQL by the way. It worked fast on my synthetic tests, but when I applied to my HTMLs set – it worked minutes(!). I used
explain anaylze, what showed me that index is used (what is nice), but than Postgres recheck what matched by index and this is quite slow, because it needs to make linear text search in large texts, one by one. On synthetic tests it worked because my texts were relatively small.
- Also I tied Elastic (I used ‘wildcard field type’, Elastic use trigram index as well in that case), but performance was even worse that in Postgres case.
I still believe that it possible, but how?