Job Summary
As a Senior DevOps Engineer, you will lead the design, implementation, and optimization of cloud and hybrid infrastructure that supports high-availability application and machine learning workloads. You will own the DevOps lifecycle — from architecting secure, scalable environments to managing CI/CD pipelines, container orchestration, and real-time monitoring systems.
This is a hands-on, senior individual contributor role. You will work closely with software engineers, machine learning researchers, and security teams to build robust, automated deployment systems and enforce best practices in infrastructure management, security, and reliability.
Key Responsibilities
- Infrastructure Architecture Design: Lead the design and implementation of secure, scalable, and resilient infrastructure across AWS, GCP, and on-premise environments.
- Cloud Infrastructure Management: Deploy, manage, and continuously optimize cloud services (AWS/GCP), ensuring high availability, cost efficiency, and maintainability.
- Security & Access Management: Enforce Zero Trust principles and manage fine-grained IAM roles, access policies, and service identities to maintain secure infrastructure access.
- CI/CD Ownership: Design and maintain robust CI/CD pipelines for backend and machine learning workloads, integrating automated testing, security scanning, and progressive deployment strategies.
- Containerization & Orchestration: Lead containerization efforts using Docker, and manage Kubernetes clusters for reliable and scalable service deployment.
- Monitoring & Observability: Set up, maintain, and fine-tune observability stacks using Grafana, Prometheus, Cloud Monitoring, and logging frameworks, enabling proactive issue detection and resolution.
- ML Model Deployment: Collaborate with ML teams to deploy, monitor, and manage machine learning models in production, including rollback, performance tracking, and versioning.
- Infrastructure as Code (IaC): Apply Terraform or similar tools to automate infrastructure provisioning, standardize configurations, and support repeatable, auditable infrastructure changes.
- Cybersecurity & Risk Management: Perform continuous vulnerability assessments, patch management, secret rotation, and threat mitigation strategies across environments.
- Process & Workflow Optimization: Identify bottlenecks and optimize deployment workflows, build pipelines, and developer environments to improve velocity and reduce failure rates.
- Collaboration & Mentorship: Collaborate closely with cross-functional teams and mentor mid/junior DevOps or software engineers in DevOps tooling, practices, and standards.
General Qualifications
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management.
- Demonstrated ability to design and operate production-grade infrastructure at scale.
Experience and Skills
Technical Skills
- Deep experience with AWS and Google Cloud Platform (GCP) infrastructure services.
- Strong knowledge of Infrastructure as Code (IaC) using Terraform, Cloud Deploy, or similar.
- Expert in CI/CD design and operations — including test automation, security scanning, and artifact management.
- Proven experience in Docker and Kubernetes for containerized application deployment and orchestration.
- Advanced skills in monitoring and logging tools: Grafana, Prometheus, ELK stack, Cloud Monitoring, etc.
- Strong scripting skills in Python, Bash, or similar for automation and tooling.
- Experience managing Zero Trust security models, IAM configurations, and secure communication between services.
- Background in ML model deployment, scaling, and versioning in production environments is highly preferred.
Soft Skills
- Strong problem-solving and incident response capabilities in high-availability systems.
- Excellent communication and cross-functional collaboration skills.
- Able to drive technical decisions independently, while aligning with broader engineering goals.
- Detail-oriented with a strong focus on infrastructure security, stability, and efficiency.
- Eagerness to stay current with evolving cloud and DevOps trends.
Preferred Experience
- Certified AWS and/or GCP cloud engineer.
- Hands-on experience deploying and maintaining ML infrastructure and MLOps workflows.
- Experience with DevSecOps practices and integration of security in the software delivery lifecycle.
- Familiarity with GitOps tools and workflows (e.g., ArgoCD, Flux).
Work Environment
- Office-based with flexibility for hybrid or remote work, depending on company policy.
- Collaborative engineering environment with strong DevOps culture and ownership-driven workflows.
Reporting Line
This role reports directly to the Lead Software Engineer.
APPLY