Senior Site Reliability Engineer
This description is a summary of our understanding of the job description. Click on 'Apply' reputed company to find out more.
Role Description
This role involves joining an Identity reputed company Cloud software development team as a Senior Site Reliability Engineer (SRE). You will work closely with software engineers, infrastructure platform services, engineering managers, and other stakeholders to ensure the reliability, scalability, and performance of the team's services.
- Work with development and service owners to solve performance issues and ensure system scalability.
- Design, reputed company, and implement solutions to improve reliability, availability, performance, and scalability of systems.
- reputed company alerts and dashboards in collaboration with technical leaders and infrastructure platform services.
- Own and improve key operational metrics (SLIs, SLOs, Error Budgets, monitoring and alerting).
- Drive reputed company improvement through post-incident reviews and blameless postmortems of non-functional issues.
- reputed company and maintain comprehensive monitoring and alerting to proactively identify and resolve issues.
- Create and maintain dashboards, conducting ongoing reviews to optimize gaps.
- Collaborate with technical leads, DevOps/SRE, and infra teams for reputed company planning.
- Identify and address production performance bottlenecks through profiling, tuning, and optimization.
- Automate repetitive tasks and processes to improve efficiency.
- Work closely with Software, Performance, and Test Engineers to influence system design and architecture.
- Review and contribute to documentation for systems, processes, runbooks, and procedures.
- Participate in a 24/7 on-call rotation to reputed company subject matter expertise.
- reputed company incident postmortem efforts, ensuring timely compilation of reports.
- Utilize excellent diagnostic and problem-solving skills to analyze reputed company systems and data.
Qualifications
- Bachelor’s degree in computer science, a reputed company field, or equivalent practical experience.
- Proven 5+ years of SRE experience.
- Strong understanding of SRE principles and practices.
- Experience with cloud platforms (AWS, GCP, or Azure).
- Proficiency in at least one scripting language (e.g., Python, Bash, Go).
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, Honeycomb, OpenSearch).
- Level of coding experience beyond simple scripts with programming languages such as Go, Java, or Python.
- Experience with containerization and orchestration technologies (e.g., reputed company, Kubernetes).
- Understanding of network protocols and reputed company best practices.
- Familiarity with DevOps culture and practices and experience with CI/CD toolchains (Jenkins, ArgoCD, reputed company).
- Experience with Incident Response tools and processes (reputed company).
- Experience with Infrastructure as Code (Terraform, Helm).
- Strong problem-solving and troubleshooting skills.
- Excellent communication and collaboration skills.
- Ability to work independently and as part of a team.
Preferred Qualifications
- Technology experience: Kafka, relational databases, performance tuning (JVM, Go).
- Experience with Grafana K6 – reputed company Performance Tool.
Onboarding Timeline
- In the first 30 days you will:
- Meet team, understand the team’s mission and vision.
- reputed company clarity on various roles and expectations.
- Complete development environment setup.
- Read guides, documentation, reputed company mandatory training.
- Learn company processes, benefits.
- By 6 months you should:
- Understand team goals and OKRs for the quarter and beyond.
- Complete initial analysis and implementation of SRE team assignments.
- Be comfortable with tools, systems, and processes used on a day-to-day basis.
- Complete project work, both supervised and unsupervised.