micro1
micro12h ago
LinkedIn

Site Reliability Engineer

APAC
Contract

Auto Apply to 50+ AI Matched Site Reliability Engineer Jobs

Use Auto Apply Agents to Bulk Apply jobs with ATS Optimised Resumes, find verified Insider Connections for jobs at micro1

Full Job Description

About the Role: Site Reliability Engineer (Contractor)

Location: Remote | Type: Contractor

In this critical role, you will leverage your expertise to train next-generation AI systems. Your work directly shapes how models learn, reason, and perform through high-quality, real-world input. No prior AI industry experience is required; what matters most is your deep domain knowledge.

Key Responsibilities:

  • Infrastructure Management: Design, implement, and maintain scalable infrastructure using Linux, Kubernetes, and Prometheus.
  • System Health & Performance: Monitor system health, analyze performance metrics in real-time, and proactively address bottlenecks or potential failures before they impact users.
  • Automation: Automate operational processes to minimize manual intervention and significantly increase overall system reliability (SLOs).
  • Incident Response: Respond swiftly to incidents, conduct thorough root cause analysis (RCA), and drive continuous improvements in incident response procedures.
  • Cross-Functional Collaboration: Work closely with development and operations teams to deliver seamless deployments and ensure high system availability.
  • Documentation & Knowledge Sharing: Create comprehensive documentation and clear runbooks for operational excellence, fostering a culture of knowledge sharing within the team.
  • Best Practices: Champion best practices in Site Reliability Engineering (SRE), security, and compliance across the customer's ecosystem.

Required Skills & Qualifications:

  • Linux Mastery: Expert-level hands-on experience with Linux system administration, configuration management, and troubleshooting.
  • Kubernetes Proficiency: Advanced proficiency in Kubernetes (k8s), including cluster deployment, day-to-day operations, scaling strategies, and management of complex services.
  • Prometheus Expertise: Deep knowledge of Prometheus for effective monitoring, metrics collection, visualization, and alerting strategy implementation.
  • Scripting & Automation: Strong scripting abilities in Bash, Python, or similar languages to build automation scripts, tooling, and CI/CD pipelines.
  • Communication: Excellent written and verbal communication skills, with the ability to document technical concepts clearly and share knowledge effectively with diverse teams.
  • SRE Background: Proven track record in Site Reliability Engineering or similar roles within high-availability environments (e.g., cloud-native ecosystems).
  • Mindset: Demonstrated commitment to proactive problem-solving, ownership of outcomes, and collaborative teamwork.

Preferred Qualifications:

  • Familiarity with other cloud-native observability tools such as Grafana for visualization or Helm/Istio for service mesh management.
  • Certifications in Kubernetes (CKA/CKS), Linux (LPIC/RHCE), or major cloud platforms (AWS/Azure/GCP).
  • Experience operating within high-growth startups or large-scale production environments handling millions of requests per second.

Company

micro1

micro1

micro1: The leading AI platform designed to amplify human intelligence by connecting over a billion individuals with their dream roles.

APAC
Posted on LinkedIn