Site Reliability Engineer
Full Job Description
HighRadius is seeking a highly skilled and adaptable Site Reliability Engineer with 7+ years of experience to join our Cloud Engineering team in Hyderabad, Telangana, India. In this critical role, you will be instrumental in designing and refining our cloud infrastructure, with a strong emphasis on reliability, security, and scalability. You will apply software engineering principles to solve complex operational challenges, ensuring the overall resilience and continuous stability of our systems. This position involves managing live production environments while actively contributing to engineering efforts like automation and system improvements.
Responsibilities:
- Cloud Infrastructure Architecture and Management: Design, build, and maintain resilient cloud infrastructure solutions supporting scalable and reliable application development and deployment. Optimize cloud platforms for high availability, performance, and cost efficiency.
- Enhancing Service Reliability: Lead reliability best practices by establishing and managing robust monitoring and alerting systems to proactively detect and respond to anomalies and performance issues. Utilize SLI, SLO, and SLA concepts to measure and drive reliability improvements. Identify and resolve potential bottlenecks.
- Driving Automation and Efficiency: Contribute to the automation, provisioning, and standardization of infrastructure resources and system configurations. Implement automation for repetitive tasks to significantly reduce operational overhead. Develop Standard Operating Procedures (SOPs) and automate workflows using tools like Rundeck or Jenkins.
- Incident Response and Resolution: Participate in and lead the resolution of major incidents, conduct thorough root cause analyses, and implement permanent solutions. Effectively manage incidents within the production environment using a systematic problem-solving approach.
- Collaboration and Innovation: Work closely with diverse stakeholders and cross-functional teams, including software engineers, to integrate cloud solutions, gather requirements, and execute Proof of Concepts (POCs). Foster strong collaboration and communication. Guide designs and processes with a focus on resilience and minimizing manual effort. Promote the adoption of common tooling and components, and implement software and tools to enhance resilience and automate operations. Embrace new tools and approaches as needed.
Required Skills:
- Cloud Platforms: Demonstrated expertise in at least one major cloud platform (AWS, Azure, or GCP). Extensive experience with containerization (Docker) and orchestration (Kubernetes) technologies.
- Automation & IaC: Proficiency in scripting languages (shell and Python). Experience with configuration management tools (Ansible or Puppet). Must have exposure to Infrastructure as Code (IaC) tools (Terraform or CloudFormation).
- Monitoring & Observability: Experience setting up and configuring monitoring tools (Prometheus, Grafana, or the ELK stack). Hands-on experience implementing OpenTelemetry for observability. Familiarity with monitoring and logging tools for cloud-based applications.
- Service Reliability Concepts: A strong understanding of SLI, SLO, SLA, and error budgeting.
- Infrastructure Management: Proven proficiency in on-premises hosting and virtualization platforms (VMware, Hyper-V, or KVM). Solid understanding of storage internals (NAS, SAN, EFS, NFS) and protocols (FTP, SFTP, SMTP, NTP, DNS, DHCP). Experience with networking and firewall technologies. Strong hands-on experience with Linux internals and operating systems (RHEL, CentOS, Rocky Linux). Experience with Windows operating systems to support varied environments.
- Soft Skills & Mindset: Excellent communication and interpersonal skills for effective teamwork. Proactive individuals eager to learn and adapt in a dynamic environment. Pragmatic and adaptable mindset, willingness to step outside comfort zones and acquire new skills. Ability to consider the broader system impact of your work. Change advocate for reliability initiatives.
Preferred Skills:
- Experience with DevOps toolchain elements like Git, Jenkins, Rundeck, ArgoCD, or Crossplane.
- Experience with database management, particularly MySQL and Hadoop.
- Knowledge of cloud cost management and optimization strategies.
- Exposure to Gen AI.
- Understanding of cloud security best practices, including data encryption, access controls, and identity management.
- Experience implementing disaster recovery and business continuity plans.
- Familiarity with ITIL (Information Technology Infrastructure Library) processes.
Company
HighRadius
HighRadius is a leading provider of an AI-powered platform designed specifically for the Office of the CFO. Their innovative solution integrates over 180 intelligent agents to streamline and orchestra...