Let us say we have:
- a web app with a
Postgres DBthat produces data over time,
DBoptimized for analytics that we would like to populate over time.
My goal is to build and monitor an
ETL pipeline that will transform the data and write it to the analytics
DB. (Preferably using a streaming approach.)
My idea was to:
- have the web app produce a
JSONrepresentation of the relevant data, write it to a message broker of some kind (say
Airflowto structure and monitor the
The part I am confused about is:
- How to have
Airflowsubscribe to the message broker in a way that a given pipeline is triggered every time a message is received?
- (If the above is possible), does this approach make sense? (In particular using
Airflowin a stream-like approach, rather than in a scheduled, batch approach.)
- (If 1 is not possible) I have a few follow-up questions below.
Follow-up questions if architecture above is not feasible:
- why is not possible?
- what would be an alternative way to construct a streaming approach that would allow some of form of monitoring/management of tasks out of the box?
- would it be a more standard approach to let Airflow schedule retrieval of data in small batches and what are the components of the trade-off between this approach and streaming approach?
Any form of feedback on these questions is much welcome.