MLOps Engineer - CI/CD for ML Models
Position Overview
We are seeking an MLOps Engineer to build and maintain CI/CD pipelines for machine learning models and scripts. This role bridges the gap between data science and production engineering, ensuring ML models are deployed reliably, monitored effectively, and updated seamlessly in production environments.
Key Responsibilities
- Build and deploy ML applications on Databricks (end-to-end)
- Develop CI/CD pipelines for ML workflows and data pipelines
- Work with Databricks (Delta Lake, notebooks, jobs, workflows)
- Build APIs (Python/FastAPI) to serve ML models
- Containerize and deploy applications using Docker & Kubernetes
- Implement monitoring, logging, and model performance tracking
- Collaborate with data scientists to productionize models
Required Qualifications
Technical Skills
Programming & Scripting:
- Python (advanced) - Primary language for ML and automation
- Bash/Shell scripting for automation
- YAML for configuration management
- Understanding of software engineering best practices
CI/CD Tools:
- GitHub Actions, GitLab CI/CD, or Jenkins - Building automated pipelines
- Experience with pipeline-as-code concepts
- Automated testing frameworks (pytest, unittest)
Containerization & Orchestration:
- Docker - Container creation and management (required)
- Kubernetes - Container orchestration (intermediate level)
- Docker Compose for local development
- Container registries (Docker Hub, ECR, ACR, GCR)
Cloud Platforms:
- Experience with AWS, Azure, or GCP (at least one)
- Cloud ML services (SageMaker, Azure ML, Vertex AI)
- Cloud storage (S3, Blob Storage, GCS)
- Compute services (EC2, VMs, Cloud Run)
MLOps Tools:
- MLflow - Experiment tracking and model registry
- DVC (Data Version Control) - Data and model versioning
- Weights & Biases, Neptune.ai, or similar (nice to have)
Infrastructure as Code:
- Terraform or CloudFormation/ARM templates
- Experience managing infrastructure through code
- Understanding of state management
Version Control:
- Git (advanced) - Branching strategies, merge workflows
- GitHub/GitLab/Bitbucket repository management
ML Knowledge
Understanding of ML Workflows:
- Familiarity with ML model training and inference
- Understanding of model formats (pickle, ONNX, SavedModel, TorchScript)
- Knowledge of ML frameworks (scikit-learn, TensorFlow, PyTorch) - not required to build models, but must understand how they work
- Awareness of ML lifecycle (training, validation, deployment, monitoring)
Model Serving:
- FastAPI or Flask - Building REST APIs for model serving
- TensorFlow Serving, TorchServe, or ONNX Runtime (nice to have)
- Understanding of model optimization (quantization, pruning)
Monitoring & Observability
Monitoring Tools:
- Prometheus & Grafana - Metrics and dashboards
- ELK Stack (Elasticsearch, Logstash, Kibana) or similar for logging
- Cloud monitoring (CloudWatch, Azure Monitor, Stackdriver)
ML-Specific Monitoring:
- Model drift detection (Evidently AI, Arize, WhyLabs)
- Data quality monitoring
- Performance metrics tracking
DevOps & Software Engineering
Best Practices:
- Agile/Scrum methodologies
- Code review processes
- Documentation standards
- Security best practices for ML systems
Testing:
- Unit testing, integration testing
- Test-driven development (TDD) concepts
- Data validation and schema testing
Experience Requirements
- 3-5+ years in DevOps, MLOps, or software engineering
- 1-2+ years specifically working with ML model deployment and CI/CD
- Proven track record of building and maintaining production ML systems
- Experience with cloud platforms and containerization
- Hands-on experience with CI/CD pipeline development