Skip to content

Anees0711/Real-Time-Data-Lakehouse-Streaming-ML-Platform

Repository files navigation

End-to-End Real-Time Transport Lakehouse and Machine Learning Platform

Project Overview

This project demonstrates a complete real-time data platform that collects live transport data, processes it automatically, stores it in a structured data lake, and trains machine learning models to generate insights.

It is designed to simulate how modern companies build reliable data systems at scale. The platform combines real-time streaming, cloud storage, automated pipelines, and machine learning into one production-style architecture. The system receives live data events, processes them continuously, organizes them into clean datasets, and prepares them for analytics and machine learning. Automated workflows ensure that data is always updated and reliable.

The platform also includes a machine learning pipeline that retrains models automatically and serves predictions through an API.

This project is built to demonstrate real-world data engineering and MLOps practices used in modern technology teams.


Python Kafka Spark Airflow AWS S3 Snowflake BigQuery FastAPI Docker Terraform

How the Platform Works

  1. Live events are streamed into the system
  2. Spark processes the data in real time
  3. Data is stored in a cloud lakehouse (raw → cleaned → analytics-ready)
  4. Automated workflows prepare datasets
  5. Data is loaded into analytics warehouses
  6. Machine learning models are trained automatically
  7. Predictions are served through an API

The architecture supports scalability, monitoring, and production reliability.


Key Capabilities

  • Real-time data ingestion and processing
  • Automated data pipelines
  • Structured lakehouse architecture
  • Cloud storage and analytics integration
  • Data quality validation
  • Machine learning retraining workflows
  • Model versioning and tracking
  • API-based prediction service
  • Containerized deployment
  • Infrastructure automation
  • CI/CD-ready repository

Technology Stack

Streaming and Processing

  • Apache Kafka
  • Apache Spark

Storage and Analytics

  • AWS S3 lakehouse
  • Snowflake
  • BigQuery
  • Parquet data format

Orchestration

  • Apache Airflow

Machine Learning

  • Python
  • Feature engineering pipeline
  • Automated model training

Serving

  • FastAPI prediction API

Infrastructure

  • Docker
  • Kubernetes
  • Terraform

Data Quality

  • Great Expectations

Running the Project

Start the system:

docker compose up --build

Access services:

Airflow interface: http://localhost:8090

Prediction API: http://localhost:8000


Machine Learning Pipeline

The ML pipeline automatically:

  • loads cleaned data
  • builds features
  • trains models
  • evaluates performance
  • saves model artifacts
  • tracks metrics

This ensures models stay updated without manual intervention.


Project Structure

src/ streaming/ loaders/ ml/ airflow/ infrastructure/ docs/ tests/

The structure follows production standards used in real data engineering teams.


Purpose of the Project

This platform demonstrates:

  • real-time data engineering
  • scalable cloud architecture
  • automated analytics workflows
  • production-ready machine learning pipelines
  • modern MLOps practices

It is designed as a portfolio project to showcase practical skills in building large-scale data systems.


Author

Built as a professional data engineering portfolio project demonstrating real-time architecture and machine learning systems.

Anees Ahmad Abbasi

About

Real-time data lakehouse + production ML pipeline using Kafka, Spark, Airflow, AWS S3, Snowflake, BigQuery, FastAPI, Docker, Terraform & Kubernetes

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors