
PhonePe•2h ago
InstaHyre
Site Reliability Engineer 3
Bangalore
Full Time
Senior Level
N/A
N/A
N/A
Responsibilities
Qualifications & Requirements
Experience Level: Senior Level
Full Job Description
Site Reliability Engineer 3 - Big Data
Responsibilities:
- Oversee and maintain Linux/Unix environments, managing incremental changes.
- Lead on-call rotations and incident response, including root cause analysis and postmortem processes.
- Design and implement automation for big data infrastructure, covering provisioning, scaling, upgrades, and patching.
- Resolve complex production issues, identify root causes, and implement mitigating strategies.
- Architect and review scalable and reliable system designs.
- Collaborate with teams to optimize overall system performance.
- Enforce security standards across systems and infrastructure.
- Set technical direction, drive standardization, and operate with autonomy.
- Ensure system and service availability, performance, and scalability through proactive monitoring, maintenance, and capacity planning.
- Analyze and respond to system outages, implementing measures to prevent recurrence.
- Develop tools and scripts to automate operational tasks, enhancing efficiency and resilience.
- Monitor and optimize system performance and resource utilization, addressing bottlenecks and implementing best practices.
- Partner with development teams to embed reliability, scalability, and performance best practices in the SDLC.
- Stay abreast of industry technology trends and contribute to internal technology communities.
- Develop and enforce SRE best practices and principles.
- Align cross-functional teams on priorities and deliverables.
- Drive automation initiatives to boost operational efficiency.
Requirements:
- 7+ years of experience managing distributed big data ecosystems.
- Strong Linux expertise, including IP, Iptables, and IPsec.
- Proficiency in scripting/programming languages such as Perl, Golang, or Python.
- Hands-on experience with the Hadoop stack: HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot.
- Familiarity with open-source configuration management and deployment tools (Puppet, Salt, Chef, Ansible).
- Solid understanding of networking, open-source technologies, and related tools.
- Excellent communication and collaboration skills.
- Experience with DevOps tools: SaltStack, Ansible, Docker, Git.
- Experience with SRE Logging and monitoring tools: ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry.
- Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP).
- Experience designing and reviewing system architectures for scalability and reliability.
- Experience with observability tools for visualizing and alerting on system performance.
Company
PhonePe
PhonePe: Revolutionizing Digital Payments in IndiaPhonePe is dedicated to making digital payments effortless, secure, and universally accessible, aiming to eliminate the need for physical cash and car...
Bangalore
Posted on InstaHyre