Best way to combine many disparate schemas for database table creation?

I have a bunch of data that consists of public records from the state government dating back to the early 90s. Along the way, the data organization and attributes have changed significantly. I put together an Excel sheet containing the headers in each year’s file to make sense of it and it came out like this:

enter image description here

As you can see by looking at my checksum colum on the left, there are 8 different schemas from 1995 through 2019. Also, you can see that the data between each can vary quite a bit. I’ve color-coded columns that are logically similar. Sometimes the data is mostly the same but the names of the columns have changed. Sometimes, there is different data altogether that appears or disappears.

I think it is pretty clear that the best goal here is to have 1 table combining all of this information rather than 8 disparate tables, since I want to be able to query across all of them efficiently. Each table contains ~150,000 rows so the table would have around 4 million records. Each table has 55-60 fields approximately.

I’ve been struggling for a few days with how to tackle it. Half of the files were fixed-width text files, not even CSVs, so it took me a long time to properly convert those. The rest are thankfully already CSVs or XLSX. From here, I would like to end up with a table that:

  • includes a superset of all available logically distinct columns – meaning that the ID number and ID Nbr columns would be the same in the final table, not 2 separate tables
  • has no loss of data

Additionally, there are other caveats such as:

  • random Filler columns (like in dark red) that serve no purpose
  • No consistency with naming, presence/absence of data, etc.
  • data is heavily denormalized but does not need to be normalized
  • there’s a lot of data, 2 GB worth just as CSV/XLS/XLSX files

I basically just want to stack the tables top to bottom into one big table, more or less.

I’ve considered a few approaches:

  • Create a separate table for each year, import the data, and then try to merge all of the tables together
  • Create one table that contains a superset of the columns and add data to it appropriately
  • Try pre-processing the data as much as possible until I have one large file with 4 million rows that I can convert into a database

I’ve tried importing just the first table into both SQL Server and Access but have encountered issues there with their inability to parse the data (e.g. duplicate columns, flagging columns with textual data as integers). In any case, it’s not practical to manually deal with schema issues for each file. My next inclination was to kind of patchwork this together in Excel, which seems the most intuitive, but Excel can’t handle a spreadsheet that large so that’s a no-go as well.

The ultimate goal is to have one large (probably multi-GB) SQL file that I can copy to the database server and run, maybe using LOAD IN FILE or something of that sort – but with the data all ready to go since it would be unwieldy to modify afterwards.

Which approach would be best? Additionally, what tools should I be using for this? Basically the problem is trying to "standardize" this data with a uniform schema without losing any data and being as non-redundant as possible. On the one hand, it doesn’t seem practical to go through all 25 tables manually and try to get them imported or try to change the schema on each one. I’m also not sure about trying to figure out the schema now and then modifying the data, since I can’t work with it all at once? Any advice from people who have done stuff like this before? Much appreciated!

The PostgreSQL database cannot start after modifying the IP

The following is the operation process:

[root@freelab ~]# su - postgres Last login: Sat Jul  4 10:16:58 EDT 2020 on pts/1 -bash-4.2$   /usr/pgsql-12/bin/pg_ctl -D /postgres -l logfile start waiting for server to start.... stopped waiting pg_ctl: could not start server Examine the log output. -bash-4.2$   -bash-4.2$   date Sat Jul  4 10:35:13 EDT 2020 

But the log does not seem to be updated accordingly

[root@freelab log]# tail -f postgresql-Sat.log 2020-07-04 09:23:14.930 EDT [1832] LOG:  background worker "logical replication launcher" (PID 1840) exited with exit code 1 2020-07-04 09:23:14.930 EDT [1835] LOG:  shutting down 2020-07-04 09:23:14.943 EDT [1832] LOG:  received SIGHUP, reloading configuration files 2020-07-04 09:23:14.943 EDT [1832] LOG:  parameter "data_directory" cannot be changed without restarting the server 2020-07-04 09:23:14.944 EDT [1832] LOG:  configuration file "/postgres/postgresql.conf" contains errors; unaffected changes were applied 2020-07-04 09:23:15.052 EDT [1832] LOG:  database system is shut down 

Efficiently storing and modifying a reorderable data structure in a database

I’m trying to create a list curation web app. One thing that’s important to me is being able to drag-and-drop reorder items in the list easily. At first I thought I could just store the order of each item, but with the reordering requirements, that would mean renumbering everything with a higher order (down the list) than the place where you removed or where you inserted the moved item. So I started looking at data structures that were friendly to reordering, or to both deletion and insertion. I’m looking at binary trees, probably red-black trees or something like that. I feel like I could with great effort probably implement the algorithms for manipulating those.

So here’s my actual question. All the tree tutorials assume you’re creating these trees in memory, usually through instantiating a simple Node class or whatever. But I’m going to be using a database to persist these structures, right? So one issue is how do I represent this data, which is kind of a broad question, sure. I would like to not have to read the whole tree from the database to update the order of one item in the list. But then again if I have to modify a lot of nodes that are stored as separate documents, that’s going to be expensive too, even if I have an index on the relevant fields, right? Like, every time I need to manipulate a node I need to actually find it in the database first, I think.

Also, I obviously need the content and order of each node on the front end in order to display the list, but do I need the client to know the whole tree structure in order to be able to send updates to the server, or can I just send the item id and its new position in the list or something?

I’m sorry if this veers too practical but I thought someone might want to take a crack at it.

What’s the best way to encrypt and store text in a MongoDB database?

I have a "cloud service", which consists of 2 parts:

  • Web application, written in Next.js;
  • MongoDB database (uses MongoDB Atlas).

I allow users to sign in with GitHub and handle authentication using JWT. User can create & delete text files, which are saved in the database as so:

{     "name": string,     "content": string,     "owner": number    <-- User ID } 

I would like to encrypt the content so that I can’t see it in the database. I was thinking about using the Web Crypto API, but I’m not sure how I’m going to store the encryptions/decryption key securely.

What’s the best way to handle this case and which encryption algorithm should I use?

Stop UUID injection in MYSQL Database

I have a cordova app that logs users in based on their devices model+platform+uuid. For example: Pixel 2Android39798721218. The way this works when a user uses a new device is detailed in the following:

  1. Users opens app
  2. App sends uuid code to checking page like: login-uuid?id=(uuid_here)
  3. If the uuid does not exist in the database the user is directed to a login page with the url: login?uuid=(uuid_here)
  4. User logs in and the uuid is sent to the login backend where it gets stored in a database
  5. When the user opens the app again they are logged in because their uuid is in the database

My question is basically, if someone knows a users login details. They can navigate to login?uuid=foo and then even if the user changes their password the attacker can still login by navigating to login-uuid?id=foo. Is there any way to mitigate this or will simply removing all logged in devices when a user resets there password be enough?

Sign records in a database

I have a table in my database where I store records to be processed later (basically orders that need to be invoiced or something similar, but this is not the important part).

Since this software runs on-premises, pro-users controls the database and they’re able to insert records directly into it. They usually do this to make my system process records in an unsupported way. Obviously, this leads me to problems that I often need to deal: inconsistency, invalid domain, missing fields, etc.

To avoid this problem, I’d like to know what are my options to "sign records", that is, identify the records generated by my system in a way that others can not reproduce.

Several approaches came to my mind when I think in this problem:

  • Create some undocumented record hash (that can be reverse engineered);
  • Use a digital certificate to sign records (where to store the digital certificate? the system runs offline on-premises);
  • Use some kind of blockchain approach: linking a record with the previous + some proof of work (maybe too hard to implement and error prone).

Are there other approaches I am not considering? If not, between the ones I listed, is there an approach I should stick/avoid?

What is the suitable file structure for database if queries are select (relational algebra) operations only?

Searches related to A relation R(A, B, C, D) has to be accessed under the query σB=10(R). Out of the following possible file structures, which one should be chosen and why? i. R is a heap file. ii. R has a clustered hash index on B. iii. R has an unclustered B+ tree index on (A, B).

Do non-PostgreSQL database softwares use roughly the same “structure” for communicating with them?

Basically, I have developed a PostgreSQL-based application which "in theory" could have its database software swapped out, but probably would cause a million headaches if I actually attempted to. I’m trying to determine if the other SQL database softwares (I frankly don’t care about non-SQL ones in the least, because they seem too different for me to bother with them in this life) have the following concepts:

  1. "hostname"
  2. "port"
  3. "username"
  4. "password"
  5. "handle database" (such as "postgres", which must be used to connect when there is no other database or when certain operations are to be done to the actual database)
  6. "database name"

I guess I’m fairy sure already about all the points except for the 5th. The concept of a "handle database" seems like it might be PG-only. If such is the case, I’m not sure how I should handle that, but I’m awaiting your answers before I make a decision.

I have a good mind to just forget about ever supporting other database softwares, but the way my system is structured basically forces me to at least try to "genericize" the communication with the database with functions called "database_" rather than "PostgreSQL_". (Even when the queries sent to these functions would only work on PG…)