Cloud Ops Engineer

Date:  23 Oct 2025
Company:  Power International Holding

Job Summary

A Cloud Ops Engineer is responsible for architecting, implementing, and managing highly available, scalable, and secure infrastructure across Cloud Platforms. A key focus of this role is infrastructure automation, ensuring consistent, repeatable, and efficient provisioning and configuration of environments using Infrastructure as Code (IaC) and other DevOps and AI\ML Ops best practices. The engineer enables seamless continuous integration and delivery (CI/CD), leverages AI/ML-driven monitoring tools for predictive analytics and system health, and collaborates cross-functionally to align infrastructure with agile development needs. This role plays a pivotal part in accelerating deployment velocity, improving system reliability, and maintaining a resilient, production-grade infrastructure across hybrid and multi-cloud platforms.

Job Responsibilities 1

Design, develop, and maintain automated infrastructure solutions across multi-cloud environments (Azure, GCP) and on-premises systems, ensuring high availability, scalability, and security.

Develop and integrate AI/ML-based automation and AI agents to support infrastructure operations, including real-time monitoring, anomaly detection, auto-remediation, and self-healing capabilities.

Leverage AI-driven agents for incident triage and resolution, automating common support tasks and enabling intelligent decision-making during outages and performance issues.

Automate routine operational tasks and infrastructure workflows using scripting languages (Bash, Python, PowerShell) to reduce manual overhead and improve response times.

Implement infrastructure as code using tools like Terraform, Bicep, Ansible, Puppet, and Chef to provision, configure, and manage infrastructure in a repeatable and efficient manner.

Build and manage CI/CD pipelines using Azure DevOps, GitLab CI/CD, or Jenkins to automate the end-to-end delivery lifecycle for infrastructure and application code.

Deploy and orchestrate containerized workloads using Docker, Kubernetes, and NKP, supporting microservices-based architectures and scalable infrastructure deployments.

Implement AI-based predictive analytics for infrastructure capacity planning, performance tuning, and preemptive fault detection.

Configure and manage cloud networking components including VPNs, firewalls, and load balancers, ensuring secure and optimized connectivity.

Administer identity and access controls (IAM, RBAC) and manage secrets and encryption using Azure Key Vault and GCP KMS, aligning with security and compliance standards.

Job Responsibilities 2

Monitor infrastructure health and performance using Prometheus, Grafana, and the ELK Stack, and integrate AI-powered observability to enhance root cause analysis and alert accuracy.

Optimize cloud costs using automation and AI-powered insights, including resource tagging, automated scaling, budgeting, and rightsizing recommendations.

Collaborate cross-functionally with development, IT, and security teams to support automated infrastructure provisioning, pipeline integration, and deployment orchestration.

Troubleshoot infrastructure and deployment issues across all environments, utilizing AI agents where possible to automate diagnostics and resolution.

Ensure that all development and testing environments are fully automated, secured, and aligned with production standards to ensure environment consistency.

Maintain detailed documentation of infrastructure designs, automation workflows, AI agent integrations, CI/CD processes, and operational procedures.

Continuously evaluate and integrate emerging tools and technologies in infrastructure automation, AIOps, and DevSecOps to improve performance, reliability, and operational efficiency.

Additional Responsibilities 3

Job Knowledge & Skills

A solid understanding of cloud platforms like Azure and GCP, including how to deploy, manage, and monitor resources.

Familiarity with automation tools and the ability to write scripts for automating repetitive testing tasks is increasingly important

Awareness of AI/ML-based tools and AI agents used in infrastructure automation, monitoring, and self-healing operations.

Good knowledge of containerization using Docker and orchestration platforms like Kubernetes for managing microservices.

Familiarity with SecOps practices to ensure infrastructure is secure by design.

Understanding of cloud cost management, including tagging, budgeting, and resource rightsizing.

Strong documentation and collaboration skills to work effectively with development, IT, and security teams.

Job Experience

Minimum 5 years working experience, 3 years relevant working experience, 2 years GCC experience is a plus.

Competencies

Agility
Build High-Performing Teams
Cloud Specific Skills L3
Data Center Network Architecture L3
IT Infrastructure and Application Integration L3
LAN Network Security L3
Leadership
Network Security L3
Provide Direction
Quality
Resilience

Education

Bachelor's Degree in Information Technology or Computer Science