AI Infrastructure & MLOps

advancedmlopsai-infrastructurekubernetesgpudevops

Operate machine learning models in production with proper CI/CD, monitoring, and cost management for AI workloads

Learning Objectives

Understand the MLOps lifecycle and how it extends traditional DevOps practices
Serve machine learning models in production with Kubernetes and GPU scheduling
Build CI/CD pipelines for model versioning, testing, and deployment
Monitor model performance in production: latency, throughput, drift, and cost
Manage the cost of GPU compute as an engineering discipline

Requirements

You are required to build and operate the infrastructure for running AI/ML workloads in production. You do not need to train models - the focus is entirely on operating them:

MLOps Architecture and Model Registry
- Deploy MLflow as a model registry and experiment tracker:
  - Configure a PostgreSQL backend for metadata
  - Configure S3 or GCS for artifact storage (model files, datasets)
  - Register at least two versions of a pre-trained model (use any public model from Hugging Face)
- Document the MLOps lifecycle:
  - Model development → experiment tracking → model registration → staging → production
  - How this differs from and integrates with the existing software delivery pipeline
- Define model promotion criteria:
  - A model is promoted to production only if: accuracy > baseline, latency p95 < 200ms, no regression on a validation dataset
Model Serving Infrastructure
- Deploy a model serving stack on Kubernetes:
  - Option A: BentoML - package and serve a Hugging Face model as an HTTP API
  - Option B: Triton Inference Server - serve an ONNX or TensorRT model
  - Option C: Seldon Core or KServe - full model serving platform on Kubernetes
- Configure the serving deployment with:
  - Readiness and liveness probes
  - Resource requests/limits (CPU for non-GPU workloads; document GPU requirements if relevant)
  - Horizontal scaling based on inference request queue depth (KEDA)
  - Rolling deployment strategy for model updates
- Expose the model via an API Gateway with rate limiting and authentication
- Verify the serving endpoint: send inference requests and validate responses
GPU Scheduling (Conceptual + Practical where possible)
- If GPU nodes are available (GKE Autopilot with GPU, AWS EC2 g4dn, or local GPU):
  - Deploy the NVIDIA Device Plugin for Kubernetes
  - Schedule a model inference Pod with GPU resource requests (nvidia.com/gpu: 1)
  - Measure inference latency on GPU vs. CPU
- If GPU nodes are not available:
  - Document the complete GPU scheduling setup in gpu-infrastructure.md
  - Use a CPU-optimized quantized model (GGUF format with llama.cpp or similar) as a substitute
  - Document cost comparison: GPU instance cost vs. CPU instance cost for equivalent throughput
- Configure time-slicing or MIG (Multi-Instance GPU) for shared GPU utilization documentation
CI/CD for Models
- Implement a model deployment pipeline triggered on model registry promotion:
  - Pull the new model version from MLflow registry
  - Run automated evaluation: inference latency test, accuracy on a held-out dataset
  - Canary deploy: route 10% of traffic to the new model, 90% to the current production model
  - Monitor for 30 minutes: if no degradation, promote to 100%
  - Automatic rollback if p95 latency degrades > 30% or error rate > 1%
- Implement model versioning in the serving layer:
  - Multiple model versions can run simultaneously
  - Traffic split is configurable without redeployment
  - Each version is independently observable
Model Observability and Cost Management
- Deploy a monitoring stack for model inference:
  - Request metrics: requests/second, p50/p95/p99 latency, error rate per model version
  - Resource metrics: CPU/GPU utilization, memory usage, batch size distribution
  - Business metrics: number of inferences served, cost per 1,000 inferences
- Implement data drift detection:
  - Use Evidently AI or WhyLogs to monitor input feature distributions
  - Alert when drift exceeds a defined threshold
  - Document what drift means for model reliability and when retraining is needed
- FinOps for AI:
  - Calculate cost per 1,000 inference requests for both CPU and GPU serving
  - Implement scale-to-zero for non-critical model endpoints (using KEDA)
  - Document the cost-latency trade-off: when is GPU serving worth the cost?

Stretch Goals

Implement a full RAG (Retrieval-Augmented Generation) pipeline with a vector database (Qdrant or Weaviate) and measure infrastructure requirements
Set up a model fine-tuning pipeline that triggers automatically when drift is detected above threshold
Benchmark quantized models (GGUF/GGML) vs. full-precision serving and document the accuracy-cost trade-off

Deliverables

MLflow deployment with model registry, two registered model versions, and promotion criteria documented
Model serving deployment on Kubernetes with scaling, probes, and API gateway configured
GPU scheduling documentation (or live GPU deployment if available)
CI/CD pipeline for model deployment with canary rollout and automatic rollback
Grafana dashboard for model observability: request metrics, resource metrics, and drift alerts
FinOps report: cost per 1,000 inferences for CPU vs. GPU, scale-to-zero configuration

References

Books

Designing Machine Learning Systems - Chip Huyen (essential - the best book on ML in production)
Machine Learning Engineering - Andriy Burkov
Practical MLOps - Noah Gift & Alfredo Deza

Courses

Tools and Documentation

Articles

Once you complete this task you will operate AI workloads with the same rigor as any other production service - with versioning, canary deployments, drift detection, and cost accountability. This is what separates a platform engineer who understands AI infrastructure from one who just deploys containers.

Submit Your Solution

Completed this project? Share your solution with the community!

Push your code to a GitHub repository
Open an issue on our GitHub repo with your solution link
Share on X with the hashtag #DevOpsDiary

Submit Solution Share on X