AI Infrastructure & MLOps
advancedmlopsai-infrastructurekubernetesgpudevops
Operate machine learning models in production with proper CI/CD, monitoring, and cost management for AI workloads
← Back to AdvancedLearning Objectives
- Understand the MLOps lifecycle and how it extends traditional DevOps practices
- Serve machine learning models in production with Kubernetes and GPU scheduling
- Build CI/CD pipelines for model versioning, testing, and deployment
- Monitor model performance in production: latency, throughput, drift, and cost
- Manage the cost of GPU compute as an engineering discipline
Requirements
You are required to build and operate the infrastructure for running AI/ML workloads in production. You do not need to train models - the focus is entirely on operating them:
- MLOps Architecture and Model Registry
- Deploy MLflow as a model registry and experiment tracker:
- Configure a PostgreSQL backend for metadata
- Configure S3 or GCS for artifact storage (model files, datasets)
- Register at least two versions of a pre-trained model (use any public model from Hugging Face)
- Document the MLOps lifecycle:
- Model development → experiment tracking → model registration → staging → production
- How this differs from and integrates with the existing software delivery pipeline
- Define model promotion criteria:
- A model is promoted to production only if: accuracy > baseline, latency p95 < 200ms, no regression on a validation dataset
- Deploy MLflow as a model registry and experiment tracker:
- Model Serving Infrastructure
- Deploy a model serving stack on Kubernetes:
- Option A: BentoML - package and serve a Hugging Face model as an HTTP API
- Option B: Triton Inference Server - serve an ONNX or TensorRT model
- Option C: Seldon Core or KServe - full model serving platform on Kubernetes
- Configure the serving deployment with:
- Readiness and liveness probes
- Resource requests/limits (CPU for non-GPU workloads; document GPU requirements if relevant)
- Horizontal scaling based on inference request queue depth (KEDA)
- Rolling deployment strategy for model updates
- Expose the model via an API Gateway with rate limiting and authentication
- Verify the serving endpoint: send inference requests and validate responses
- Deploy a model serving stack on Kubernetes:
- GPU Scheduling (Conceptual + Practical where possible)
- If GPU nodes are available (GKE Autopilot with GPU, AWS EC2 g4dn, or local GPU):
- Deploy the NVIDIA Device Plugin for Kubernetes
- Schedule a model inference Pod with GPU resource requests (
nvidia.com/gpu: 1) - Measure inference latency on GPU vs. CPU
- If GPU nodes are not available:
- Document the complete GPU scheduling setup in
gpu-infrastructure.md - Use a CPU-optimized quantized model (GGUF format with llama.cpp or similar) as a substitute
- Document cost comparison: GPU instance cost vs. CPU instance cost for equivalent throughput
- Document the complete GPU scheduling setup in
- Configure time-slicing or MIG (Multi-Instance GPU) for shared GPU utilization documentation
- If GPU nodes are available (GKE Autopilot with GPU, AWS EC2 g4dn, or local GPU):
- CI/CD for Models
- Implement a model deployment pipeline triggered on model registry promotion:
- Pull the new model version from MLflow registry
- Run automated evaluation: inference latency test, accuracy on a held-out dataset
- Canary deploy: route 10% of traffic to the new model, 90% to the current production model
- Monitor for 30 minutes: if no degradation, promote to 100%
- Automatic rollback if p95 latency degrades > 30% or error rate > 1%
- Implement model versioning in the serving layer:
- Multiple model versions can run simultaneously
- Traffic split is configurable without redeployment
- Each version is independently observable
- Implement a model deployment pipeline triggered on model registry promotion:
- Model Observability and Cost Management
- Deploy a monitoring stack for model inference:
- Request metrics: requests/second, p50/p95/p99 latency, error rate per model version
- Resource metrics: CPU/GPU utilization, memory usage, batch size distribution
- Business metrics: number of inferences served, cost per 1,000 inferences
- Implement data drift detection:
- Use Evidently AI or WhyLogs to monitor input feature distributions
- Alert when drift exceeds a defined threshold
- Document what drift means for model reliability and when retraining is needed
- FinOps for AI:
- Calculate cost per 1,000 inference requests for both CPU and GPU serving
- Implement scale-to-zero for non-critical model endpoints (using KEDA)
- Document the cost-latency trade-off: when is GPU serving worth the cost?
- Deploy a monitoring stack for model inference:
Stretch Goals
- Implement a full RAG (Retrieval-Augmented Generation) pipeline with a vector database (Qdrant or Weaviate) and measure infrastructure requirements
- Set up a model fine-tuning pipeline that triggers automatically when drift is detected above threshold
- Benchmark quantized models (GGUF/GGML) vs. full-precision serving and document the accuracy-cost trade-off
Deliverables
- MLflow deployment with model registry, two registered model versions, and promotion criteria documented
- Model serving deployment on Kubernetes with scaling, probes, and API gateway configured
- GPU scheduling documentation (or live GPU deployment if available)
- CI/CD pipeline for model deployment with canary rollout and automatic rollback
- Grafana dashboard for model observability: request metrics, resource metrics, and drift alerts
- FinOps report: cost per 1,000 inferences for CPU vs. GPU, scale-to-zero configuration
References
Books
- Designing Machine Learning Systems - Chip Huyen (essential - the best book on ML in production)
- Machine Learning Engineering - Andriy Burkov
- Practical MLOps - Noah Gift & Alfredo Deza
Courses
- Machine Learning Engineering for Production (MLOps) - Andrew Ng, Coursera
- Full Stack Deep Learning
- Made With ML - Goku Mohandas
Tools and Documentation
- MLflow Documentation
- BentoML Documentation
- Triton Inference Server
- KServe Documentation
- KEDA HTTP Add-on
- Evidently AI - ML Monitoring
- Hugging Face Model Hub
- NVIDIA Device Plugin for Kubernetes
Articles
Once you complete this task you will operate AI workloads with the same rigor as any other production service - with versioning, canary deployments, drift detection, and cost accountability. This is what separates a platform engineer who understands AI infrastructure from one who just deploys containers.
Submit Your Solution
Completed this project? Share your solution with the community!
- Push your code to a GitHub repository
- Open an issue on our GitHub repo with your solution link
- Share on X with the hashtag #DevOpsDiary
