Setup pipeline to analyze data stored in web app DB

Background:

  • So there is a (Ruby) web app with a production Postgres DB (hosted in the cloud)
  • I would like to run some machine learning algorithms in a Python setting on the production data and (ultimately) deploy the model in a production setting (in the cloud)
  • I only know how to run these algorithms locally on say a Numpy array that fits in memory and assuming the training data is fixed
  • Let us say the dataset of interest would ultimately be too large to fit in memory, so the data would need to be accessed in batches.

My general question is:

What is a good way to go about setting up the pipeline to run the algorithms on the production data?

To be more specific, here is my current reasoning, that may or may not make sense, with more specific questions:

  • Considering the algorithms will need to access the data over and over, read speed will be pretty important. We cannot afford to access it over the network and cannot keep querying the web app production db anyway. What is the best way to store the data and make it available to the machine learning algorithms to process? Copy everything to another relational DB that the Python code can access locally?

  • Finding the right model is probably easiest if done locally on a sample of the data that fits in memory. Once a good candidate is found, we can retrain it, with all the data we have. Should we do the second step locally as well? Or you should generally try to setup a complete production pipeline that allows you to work with a larger amount of data at this stage already?

  • Let us say you have new data being written regularly. If you do the initial training by visiting batches of the data you have at time 0, and then stop training, you probably have to retrain it from scratch using all of the data you have at some later time t? Is the re-training something that is reasonable to automatize in production?

General hints and sources that help with these kind of questions are appreciated.