I have a bunch of data that consists of public records from the state government dating back to the early 90s. Along the way, the data organization and attributes have changed significantly. I put together an Excel sheet containing the headers in each year’s file to make sense of it and it came out like this:
As you can see by looking at my checksum colum on the left, there are 8 different schemas from 1995 through 2019. Also, you can see that the data between each can vary quite a bit. I’ve color-coded columns that are logically similar. Sometimes the data is mostly the same but the names of the columns have changed. Sometimes, there is different data altogether that appears or disappears.
I think it is pretty clear that the best goal here is to have 1 table combining all of this information rather than 8 disparate tables, since I want to be able to query across all of them efficiently. Each table contains ~150,000 rows so the table would have around 4 million records. Each table has 55-60 fields approximately.
I’ve been struggling for a few days with how to tackle it. Half of the files were fixed-width text files, not even CSVs, so it took me a long time to properly convert those. The rest are thankfully already CSVs or XLSX. From here, I would like to end up with a table that:
- includes a superset of all available logically distinct columns – meaning that the
ID number and
ID Nbr columns would be the same in the final table, not 2 separate tables
- has no loss of data
Additionally, there are other caveats such as:
- random Filler columns (like in dark red) that serve no purpose
- No consistency with naming, presence/absence of data, etc.
- data is heavily denormalized but does not need to be normalized
- there’s a lot of data, 2 GB worth just as CSV/XLS/XLSX files
I basically just want to stack the tables top to bottom into one big table, more or less.
I’ve considered a few approaches:
- Create a separate table for each year, import the data, and then try to merge all of the tables together
- Create one table that contains a superset of the columns and add data to it appropriately
- Try pre-processing the data as much as possible until I have one large file with 4 million rows that I can convert into a database
I’ve tried importing just the first table into both SQL Server and Access but have encountered issues there with their inability to parse the data (e.g. duplicate columns, flagging columns with textual data as integers). In any case, it’s not practical to manually deal with schema issues for each file. My next inclination was to kind of patchwork this together in Excel, which seems the most intuitive, but Excel can’t handle a spreadsheet that large so that’s a no-go as well.
The ultimate goal is to have one large (probably multi-GB) SQL file that I can copy to the database server and run, maybe using
LOAD IN FILE or something of that sort – but with the data all ready to go since it would be unwieldy to modify afterwards.
Which approach would be best? Additionally, what tools should I be using for this? Basically the problem is trying to "standardize" this data with a uniform schema without losing any data and being as non-redundant as possible. On the one hand, it doesn’t seem practical to go through all 25 tables manually and try to get them imported or try to change the schema on each one. I’m also not sure about trying to figure out the schema now and then modifying the data, since I can’t work with it all at once? Any advice from people who have done stuff like this before? Much appreciated!