micro1•2h ago
LinkedIn
Site Reliability Engineer
APAC
Contract
Full Job Description
About the Role: Site Reliability Engineer (Contractor)
Location: Remote | Type: Contractor
In this critical role, you will leverage your expertise to train next-generation AI systems. Your work directly shapes how models learn, reason, and perform through high-quality, real-world input. No prior AI industry experience is required; what matters most is your deep domain knowledge.
Key Responsibilities:
- Infrastructure Management: Design, implement, and maintain scalable infrastructure using Linux, Kubernetes, and Prometheus.
- System Health & Performance: Monitor system health, analyze performance metrics in real-time, and proactively address bottlenecks or potential failures before they impact users.
- Automation: Automate operational processes to minimize manual intervention and significantly increase overall system reliability (SLOs).
- Incident Response: Respond swiftly to incidents, conduct thorough root cause analysis (RCA), and drive continuous improvements in incident response procedures.
- Cross-Functional Collaboration: Work closely with development and operations teams to deliver seamless deployments and ensure high system availability.
- Documentation & Knowledge Sharing: Create comprehensive documentation and clear runbooks for operational excellence, fostering a culture of knowledge sharing within the team.
- Best Practices: Champion best practices in Site Reliability Engineering (SRE), security, and compliance across the customer's ecosystem.
Required Skills & Qualifications:
- Linux Mastery: Expert-level hands-on experience with Linux system administration, configuration management, and troubleshooting.
- Kubernetes Proficiency: Advanced proficiency in Kubernetes (k8s), including cluster deployment, day-to-day operations, scaling strategies, and management of complex services.
- Prometheus Expertise: Deep knowledge of Prometheus for effective monitoring, metrics collection, visualization, and alerting strategy implementation.
- Scripting & Automation: Strong scripting abilities in Bash, Python, or similar languages to build automation scripts, tooling, and CI/CD pipelines.
- Communication: Excellent written and verbal communication skills, with the ability to document technical concepts clearly and share knowledge effectively with diverse teams.
- SRE Background: Proven track record in Site Reliability Engineering or similar roles within high-availability environments (e.g., cloud-native ecosystems).
- Mindset: Demonstrated commitment to proactive problem-solving, ownership of outcomes, and collaborative teamwork.
Preferred Qualifications:
- Familiarity with other cloud-native observability tools such as Grafana for visualization or Helm/Istio for service mesh management.
- Certifications in Kubernetes (CKA/CKS), Linux (LPIC/RHCE), or major cloud platforms (AWS/Azure/GCP).
- Experience operating within high-growth startups or large-scale production environments handling millions of requests per second.
Company
micro1
micro1: The leading AI platform designed to amplify human intelligence by connecting over a billion individuals with their dream roles.
APAC
Posted on LinkedIn