Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
- The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
- The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
- Each data source has its own queue (Kafka Topic).
- From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
- Depending upon the feature set, an inference engine will subscribe to multiple Kafka topics and stream data from those topics to continuously output the inference.
- The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized. So one of the inference engines pulls the latest entries from each of the kafka topic, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?