This project demonstrates a complete real-time data platform that collects live transport data, processes it automatically, stores it in a structured data lake, and trains machine learning models to generate insights.
It is designed to simulate how modern companies build reliable data systems at scale. The platform combines real-time streaming, cloud storage, automated pipelines, and machine learning into one production-style architecture. The system receives live data events, processes them continuously, organizes them into clean datasets, and prepares them for analytics and machine learning. Automated workflows ensure that data is always updated and reliable.
The platform also includes a machine learning pipeline that retrains models automatically and serves predictions through an API.
This project is built to demonstrate real-world data engineering and MLOps practices used in modern technology teams.
- Live events are streamed into the system
- Spark processes the data in real time
- Data is stored in a cloud lakehouse (raw → cleaned → analytics-ready)
- Automated workflows prepare datasets
- Data is loaded into analytics warehouses
- Machine learning models are trained automatically
- Predictions are served through an API
The architecture supports scalability, monitoring, and production reliability.
- Real-time data ingestion and processing
- Automated data pipelines
- Structured lakehouse architecture
- Cloud storage and analytics integration
- Data quality validation
- Machine learning retraining workflows
- Model versioning and tracking
- API-based prediction service
- Containerized deployment
- Infrastructure automation
- CI/CD-ready repository
- Apache Kafka
- Apache Spark
- AWS S3 lakehouse
- Snowflake
- BigQuery
- Parquet data format
- Apache Airflow
- Python
- Feature engineering pipeline
- Automated model training
- FastAPI prediction API
- Docker
- Kubernetes
- Terraform
- Great Expectations
Start the system:
docker compose up --build
Access services:
Airflow interface: http://localhost:8090
Prediction API: http://localhost:8000
The ML pipeline automatically:
- loads cleaned data
- builds features
- trains models
- evaluates performance
- saves model artifacts
- tracks metrics
This ensures models stay updated without manual intervention.
src/ streaming/ loaders/ ml/ airflow/ infrastructure/ docs/ tests/
The structure follows production standards used in real data engineering teams.
This platform demonstrates:
- real-time data engineering
- scalable cloud architecture
- automated analytics workflows
- production-ready machine learning pipelines
- modern MLOps practices
It is designed as a portfolio project to showcase practical skills in building large-scale data systems.
Built as a professional data engineering portfolio project demonstrating real-time architecture and machine learning systems.
Anees Ahmad Abbasi