I’m using an sqlite database to store manually created labels for some data automatically queried from a live system. The data from the live system consists primarily of an address, comprised of 3 parts. Let’s use URLs an example, the three parts being the protocol, the domain and the path. Initially, I would load a couple 100k worth of addresses into a table with each field of the address being a column and together building the primary key. The labels are then in additional columns.
CREATE TABLE OldWebsites ( protocol VARCHAR (255) NOT NULL, domain VARCHAR (255) NOT NULL, path VARCHAR (255) NOT NULL, label1 INTEGER, label2 TEXT, CONSTRAINT address PRIMARY KEY ( protocol, domain, path ) );
I found myself repeating labels over and over based on certain patterns that the address would match. Since I would always extend this table with new data and remove old data, this became too much of a hazzle, so I tried a different approach, namely just loading the existing addresses into one table and then have other tables for the data where I would write regex matchers for the address components
CREATE TABLE Websites ( protocol VARCHAR (255) NOT NULL, domain VARCHAR (255) NOT NULL, path VARCHAR (255) NOT NULL, CONSTRAINT address PRIMARY KEY ( protocol, domain, path ) ); CREATE TABLE Label1 ( protocol_re VARCHAR (255) NOT NULL, domain_re VARCHAR (255) NOT NULL, path_re VARCHAR (255) NOT NULL, label1 INTEGER CONSTRAINT address PRIMARY KEY ( protocol, domain, path ) ); CREATE TABLE Label2 ( protocol_re VARCHAR (255) NOT NULL, domain_re VARCHAR (255) NOT NULL, path_re VARCHAR (255) NOT NULL, label2 TEXT, CONSTRAINT address PRIMARY KEY ( protocol, domain, path ) );
Assume that I have already (using other queries) guaranteed, that there is exactly one match in each label table for each address in the Websites
table. I would now like to write a query that reconstructs a table like the original OldWebsites
one by matching labels and automatically queried data.
Something like this
SELECT Websites.*, Label1.*, Label2.* FROM Websites JOIN Label1 ON (Websites.protocol REGEXP '^' || Label1.protocol_re || '$ ' AND Websites.domain REGEXP '^' || Label1.domain_re || '$ ' AND Websites.path REGEXP '^' || Label1.path_re || '$ ') JOIN Label2 ON (Websites.protocol REGEXP '^' || Label2.protocol_re || '$ ' AND Websites.domain REGEXP '^' || Label2.domain_re || '$ ' AND Websites.path REGEXP '^' || Label2.path_re || '$ ');
Now.. this is really slow, especially for more label tables, using PCRE sqlite3 extension for the REGEXP function.
I would like to know if there’s way to optimize this query using either parallelization (the query should run ideally from python) or using the knowledge that there is exactly 1 match in each Label table.
From my understanding, multiple inner joins should take at most the sum of the individual joins, correct?
Perhaps indexes are also helpful, but I have only a basic idea of what they are and no idea whether they would be of help here.