AI Research (PhD) → ML Engineering → Cloud Infrastructure for AI
(AWS · Terraform · Observability)
Started in academia studying how systems fail — modeling SSD and HDD reliability at HPC scale, published at ARCS 2024, funded by the EU Horizon 2020 IO-SEA project. That work pushed me toward engineering resilient systems at scale.
Today I bridge research and deployment: building reliable, observable, scalable systems that turn cutting-edge ML into production-grade infrastructure.
The goal: speak fluent ML and fluent AWS — own the full path from research to production.
AI & Machine Learning
Cloud & Infrastructure
Languages, Data & Tooling
Production-grade observability stack on AWS. An order-processing API (FastAPI · ECS Fargate · RDS) with structured JSON logging, 8 custom CloudWatch metrics, Golden-Signal dashboards, tiered SNS alerting, Lambda auto-remediation, FinOps cost monitoring, and AI-powered incident analysis via CrewAI — all infrastructure-as-code with Terraform and shipped through GitHub Actions.
Highlight: three injected failure scenarios (error flood, high latency, CPU spike), each diagnosed from the CloudWatch correlation view and remediated automatically by Lambda — closing the alert loop with no human in it.
FastAPI ECS Fargate RDS CloudWatch Lambda SNS Terraform GitHub Actions CrewAI
Production-grade AWS 3-tier architecture: internet-facing ALB → 6 Node.js EC2 instances across 2 AZs with Auto Scaling → isolated data tier. Custom VPC with network segmentation, security-group chaining, and CloudWatch monitoring.
Highlight: fully automated scaling and high availability across multiple availability zones.
AWS VPC EC2 Auto Scaling ALB CloudWatch Terraform
ML-driven reliability analysis of SSD and HDD failure in HPC burst buffers. Uses SMART telemetry from ~1M Alibaba SSDs and Backblaze HDDs to predict Mean Time to Failure with Random Forest and LSTM models.
Highlight: 94% prediction accuracy — published at ARCS 2024, funded by EU Horizon 2020 IO-SEA.
Python MongoDB scikit-learn XGBoost LSTM
End-to-end ML application forecasting monthly road-accident occurrences from Munich open traffic data — trained, serialized, and served via a REST API, containerized with Docker and deployed to the cloud.
Highlight: the full ML pipeline in one repo — preprocessing → training → API serving → deployment.
Python Flask Docker scikit-learn
- 📜 Pursuing the HashiCorp Terraform Associate certification
- 🏗️ Building production-ready AI services on AWS — containerized, observable, infrastructure-as-code
- 🔍 Going deeper on MLOps: model serving, cost optimization, and distributed training on cloud
Open to ML Engineering · MLOps · Cloud Infrastructure roles
Last updated June 2026 · View all repositories →


