
Site Reliability Engineer
Responsibilities
Qualifications & Requirements
Experience Level: Mid Level
Full Job Description
About the Role: Site Reliability Engineer at ValueFirst
ValueFirst is seeking a dedicated Site Reliability Engineer (SRE) to join our dynamic team in Gurugram, India. As an SRE, you will play a crucial role in ensuring the unwavering reliability, seamless scalability, and optimal performance of our extensive telecom and CPaaS platforms. This position uniquely blends software engineering principles with robust systems operations, enabling you to architect and implement resilient, observable, and automated infrastructure. Your contributions will be vital in supporting our high-throughput messaging services, operating within a demanding 24/7 environment. You will collaborate closely with our Engineering, Customer Experience (CX), and Product teams to uphold carrier-grade service reliability.
Key Responsibilities:
- Guarantee the high availability, peak performance, and steadfast reliability of CPaaS production systems deployed across multiple cloud and data center locations.
- Take ownership of and continuously improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) for our messaging platforms and associated services.
- Proactively monitor system health, latency, transactions per second (TPS), error rates, and delivery metrics utilizing advanced observability tools.
- Actively participate in on-call rotations, efficiently handle production incidents with an emphasis on rapid recovery and thorough root cause analysis.
- Deploy, configure, and optimize systems for high-throughput messaging across various channels.
- Troubleshoot complex telecom-specific issues, including Delivery Report (DLR) failures, encoding problems, TPS fluctuations, and routing anomalies.
- Engage directly with diverse teams for integrations, rigorous testing, and swift incident resolution.
- Perform in-depth packet-level analysis using tools such as tcpdump and Wireshark to diagnose intricate network and protocol-level challenges.
- Develop and maintain efficient shell scripts and automation to eliminate repetitive operational tasks and minimize manual intervention.
- Contribute to infrastructure automation initiatives leveraging tools like Ansible and CI/CD pipelines where applicable.
- Enhance deployment, configuration, and rollback processes for all messaging services.
- Design and refine monitoring, alerting, and dashboarding solutions using industry-leading tools like Datadog, Site24x7, ELK, and Grafana.
- Administer and troubleshoot Linux-based servers in production environments.
- Manage and optimize MySQL and MongoDB databases, focusing on performance tuning, robust backups, and reliable recovery strategies.
- Work extensively with APIs and webhooks across the product and services, focusing on their enhancements and troubleshooting.
- Maintain and optimize web and application servers including Apache, Nginx, and JBoss (WildFly).
- Support cloud-based and virtualized environments, gaining exposure to auto-scaling and containerization concepts.
- Collaborate with engineering teams on release planning, production deployments, and comprehensive post-release validation.
- Lead or contribute significantly to incident response and Root Cause Analysis (RCA), with a strategic focus on long-term reliability improvements.
- Track issues, changes, and reliability-focused tasks using Jira and related project management tools.
Qualifications:
- Bachelor of Technology (B.Tech) or Bachelor of Engineering (B.E.) in Computer Science or a related field, coupled with 2-3 years of hands-on experience in SRE, DevOps, telecom, or CPaaS operations.
- Practical, hands-on experience with SMS gateways and messaging workflows.
- A strong foundational understanding of Linux systems, core networking fundamentals, and production troubleshooting techniques.
- Demonstrable experience with MySQL & MongoDB administration, query optimization, and performance tuning.
- Proficiency in shell scripting with a proactive mindset geared towards automation and reliability engineering.
- Experience utilizing tcpdump, Wireshark, and performing protocol-level troubleshooting.
- Familiarity with monitoring, logging, and alerting systems such as Datadog, ELK, Grafana, and Site24x7.
- Working knowledge of configuration management tools like Ansible and version control systems like Git.
- Experience with cloud platforms, virtualization, auto-scaling, and containerization concepts.
- Excellent incident management capabilities, strong analytical thinking, and effective communication skills.
- Relevant certifications such as RHCE, AWS, or other SRE-related credentials are considered a significant advantage.