I have a large and growing table I’m maintaining in sqlite (250M+ rows). I am using this as archival backup for working data on a different system (in postgres). Generally speaking, archive size is more important than archive performance. Therefore, the sqlite db is unindexed.
Every record has a timestamp (stored in epoch ms), and data is never inserted out of date order. Multiple records can have the same timestamp, but the timestamp will only increase over time. Occasionally records will be deleted.
To validate that the archive is tracking with my production data, I perform a "checksum" of sorts, counting the number of records in a given date range, and comparing the archive to the production dataset. Doing a
select count(*) where ts > nnn is slow on the archive (as expected), but it occurs to me that I might be able to use rowid to do a binary search for the first record where ts = nnn and then (perhaps) subtract rowids to get my count of records much faster than the ordinary select would, kind of like a home grown index (or, at very least, allow me to restrict the part of the database that needs to be searched by adding
where rowid > mmm and rowid < ooo to the query.
So, some questions:
- Is rowid maintained in "insertion order" such that if my data is inserted in order of timestamp, I can assume that a higher rowid will never have a lower timestamp?
- Can I therefore implement my own binary search to rapidly find a record with a particular ts? (or use max(rowid) to find the maximum timestamp)?
- Will deleting records create "holes" that sqlite will attempt to "fill in" with subsequent insertions?
- What events would cause the rowid of a given record to change?
- If I do decide to insert data out of date order, how can I sort the data and reset the rowids to account for that?
- Is there a less janky way to leverage the fact that my data is sorted to get at least some of the benefits of indexing without rolling my own?
- Is there a better way than counting to validate that records in a specified date range stored in databases on two different systems, one in postgres, one in sqlite, contain the same data? (I presume they do, but I’m a paranoid sort, especially around the idea that my syncing process might somehow drop records. I’m less worried that two corresponding records would contain different values).
A few more probably irrelevant details: The postgres production system is on a very space-constrained machine, and can only hold about 6 months of data. The sqlite database is on a very slow machine with tons of disk space, and holds the total archive going back several years. Because I occasionally share the entire archive with other researchers, sqlite is a convenient container, but I don’t index because I want the files to be small when I back them up and ship them around. The only work I do on the archival dataset is add records and verify that the data has been properly backed up, as described above. Copying the archive locally (to, say, make a replica and strip out an index before compressing) is a very slow operation on the archive machine.