GitHub - Anees0711/Real-Time-Data-Lakehouse-Streaming-ML-Platform: Real-time data lakehouse + production ML pipeline using Kafka, Spark, Airflow, AWS S3, Snowflake, BigQuery, FastAPI, Docker, Terraform & Kubernetes

End-to-End Real-Time Transport Lakehouse and Machine Learning Platform

Project Overview

This project demonstrates a complete real-time data platform that collects live transport data, processes it automatically, stores it in a structured data lake, and trains machine learning models to generate insights.

It is designed to simulate how modern companies build reliable data systems at scale. The platform combines real-time streaming, cloud storage, automated pipelines, and machine learning into one production-style architecture. The system receives live data events, processes them continuously, organizes them into clean datasets, and prepares them for analytics and machine learning. Automated workflows ensure that data is always updated and reliable.

The platform also includes a machine learning pipeline that retrains models automatically and serves predictions through an API.

This project is built to demonstrate real-world data engineering and MLOps practices used in modern technology teams.

How the Platform Works

Live events are streamed into the system
Spark processes the data in real time
Data is stored in a cloud lakehouse (raw → cleaned → analytics-ready)
Automated workflows prepare datasets
Data is loaded into analytics warehouses
Machine learning models are trained automatically
Predictions are served through an API

The architecture supports scalability, monitoring, and production reliability.

Key Capabilities

Real-time data ingestion and processing
Automated data pipelines
Structured lakehouse architecture
Cloud storage and analytics integration
Data quality validation
Machine learning retraining workflows
Model versioning and tracking
API-based prediction service
Containerized deployment
Infrastructure automation
CI/CD-ready repository

Technology Stack

Streaming and Processing

Apache Kafka
Apache Spark

Storage and Analytics

AWS S3 lakehouse
Snowflake
BigQuery
Parquet data format

Orchestration

Apache Airflow

Machine Learning

Python
Feature engineering pipeline
Automated model training

Serving

FastAPI prediction API

Infrastructure

Docker
Kubernetes
Terraform

Data Quality

Great Expectations

Running the Project

Start the system:

docker compose up --build

Access services:

Airflow interface: http://localhost:8090

Prediction API: http://localhost:8000

Machine Learning Pipeline

The ML pipeline automatically:

loads cleaned data
builds features
trains models
evaluates performance
saves model artifacts
tracks metrics

This ensures models stay updated without manual intervention.

Project Structure

src/ streaming/ loaders/ ml/ airflow/ infrastructure/ docs/ tests/

The structure follows production standards used in real data engineering teams.

Purpose of the Project

This platform demonstrates:

real-time data engineering
scalable cloud architecture
automated analytics workflows
production-ready machine learning pipelines
modern MLOps practices

It is designed as a portfolio project to showcase practical skills in building large-scale data systems.

Author

Built as a professional data engineering portfolio project demonstrating real-time architecture and machine learning systems.

Anees Ahmad Abbasi

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
airflow/dags		airflow/dags
docs		docs
great_expectations		great_expectations
gx		gx
infrastructure		infrastructure
src		src
tests/unit		tests/unit
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.api		Dockerfile.api
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Real-Time Transport Lakehouse and Machine Learning Platform

Project Overview

How the Platform Works

Key Capabilities

Technology Stack

Streaming and Processing

Storage and Analytics

Orchestration

Machine Learning

Serving

Infrastructure

Data Quality

Running the Project

Machine Learning Pipeline

Project Structure

Purpose of the Project

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

End-to-End Real-Time Transport Lakehouse and Machine Learning Platform

Project Overview

How the Platform Works

Key Capabilities

Technology Stack

Streaming and Processing

Storage and Analytics

Orchestration

Machine Learning

Serving

Infrastructure

Data Quality

Running the Project

Machine Learning Pipeline

Project Structure

Purpose of the Project

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages