Eric Borba EricBorba

Eric Borba

I work where AI meets infrastructure — because great models need great plumbing.

The Stack I'm Building

AI Research (PhD)  →  ML Engineering  →  Cloud Infrastructure for AI
                                         (AWS · Terraform · Observability)

Started in academia studying how systems fail — modeling SSD and HDD reliability at HPC scale, published at ARCS 2024, funded by the EU Horizon 2020 IO-SEA project. That work pushed me toward engineering resilient systems at scale.

Today I bridge research and deployment: building reliable, observable, scalable systems that turn cutting-edge ML into production-grade infrastructure.

The goal: speak fluent ML and fluent AWS — own the full path from research to production.

Tech Stack

AI & Machine Learning

Cloud & Infrastructure

Languages, Data & Tooling

Featured Projects

📊 Instrumented & Monitored Cloud Service

Production-grade observability stack on AWS. An order-processing API (FastAPI · ECS Fargate · RDS) with structured JSON logging, 8 custom CloudWatch metrics, Golden-Signal dashboards, tiered SNS alerting, Lambda auto-remediation, FinOps cost monitoring, and AI-powered incident analysis via CrewAI — all infrastructure-as-code with Terraform and shipped through GitHub Actions.

Highlight: three injected failure scenarios (error flood, high latency, CPU spike), each diagnosed from the CloudWatch correlation view and remediated automatically by Lambda — closing the alert loop with no human in it.

FastAPI ECS Fargate RDS CloudWatch Lambda SNS Terraform GitHub Actions CrewAI

☁️ Three-Tier Architecture on AWS

Production-grade AWS 3-tier architecture: internet-facing ALB → 6 Node.js EC2 instances across 2 AZs with Auto Scaling → isolated data tier. Custom VPC with network segmentation, security-group chaining, and CloudWatch monitoring.

Highlight: fully automated scaling and high availability across multiple availability zones.

AWS VPC EC2 Auto Scaling ALB CloudWatch Terraform

🔬 Storage Failure Predictor

ML-driven reliability analysis of SSD and HDD failure in HPC burst buffers. Uses SMART telemetry from ~1M Alibaba SSDs and Backblaze HDDs to predict Mean Time to Failure with Random Forest and LSTM models.

Highlight: 94% prediction accuracy — published at ARCS 2024, funded by EU Horizon 2020 IO-SEA.

Python MongoDB scikit-learn XGBoost LSTM

🚗 Accident Predictor App

End-to-end ML application forecasting monthly road-accident occurrences from Munich open traffic data — trained, serialized, and served via a REST API, containerized with Docker and deployed to the cloud.

Highlight: the full ML pipeline in one repo — preprocessing → training → API serving → deployment.

Python Flask Docker scikit-learn

GitHub Activity

Currently

📜 Pursuing the HashiCorp Terraform Associate certification
🏗️ Building production-ready AI services on AWS — containerized, observable, infrastructure-as-code
🔍 Going deeper on MLOps: model serving, cost optimization, and distributed training on cloud

Open to ML Engineering · MLOps · Cloud Infrastructure roles

_{Last updated June 2026 · View all repositories →}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly