Description:
The Sr. Site Reliability Engineer is a technical Subject Matter Expert that pro-actively drives the technical stability and performance of the applications in the provider technology portfolio. They combine software and systems engineering to design solutions in physical, virtual and cloud environments that automate fault detection, containment, and resolution without customer impact or human intervention. These solutions typically involve software development for metrics and event collection/correlation across distributed architectures, automation, monitoring, intelligent alerting, random fault injection, and self-healing. Focus areas include…
- High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering
- Service and Operational Level Agreements
- SRE - Standards and best practices
- Performance Engineering
- Application scalability/Capacity Management
- Technical debt Reduction
- Logging, monitoring, intelligent alerting, self-healing
- Security Vulnerabilities and Compliance
- Solution design
- Application Knowledge Support Artifacts, etc.
Primary Responsibilities:
- Lead the Site Reliability Engineering practices – improve availability / reliability, latency, performance, efficiency, monitoring, emergency response and capacity planning / forecasting / management
- Implement self-healing and resiliency patterns
- Responsible for running production systems - ensure applications are available per business SLAs.
- Involved in the resolution of high / critical business impact issues and conduct blameless post-mortems and Root Cause Analysis.
- Collaborates with other IT teams on solutions, enhancements, and process improvements.
- Responsible for production best practices, technical and operating standards, design and implementation of performance and operational enhancements.
- Work with engineering teams across SDLC activities to implement best practices to make applications secure and reliable.
- Integrate security/compliance tools in deployment pipelines. Ability to perform vulnerability assessments using tools and remediation of vulnerabilities within established timeframes.
- Responsible in coordination, technical planning and implementation of Product Life cycle upgrades, production maintenance and technology debt reduction activities.
- Ability to drive technical features including intake, prioritization, creation, grooming and implementation
- Drive Chaos Engineering practices to test under real-world conditions
- Participate in architectural and design decisions
- Design and implement end-to-end monitoring solutions for Application and Infrastructure components, based on cutting edge SLO-based telemetry tools
- Participate in on-call rotations across continents, using a follow-the-sun model
You will be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role, as well as providing development for other roles you may be interested in.
Required Qualifications:
- BS or MS in Computer Science, a related field, or equivalent experience
- 4+ years of experience in site reliability engineering practices
- Experience in supporting and operating large-scale production systems
- Experience in programming in Java Spring Boot and APIs
- Knowledge in Unix/Linux shell, can write shell scripts, and understands Linux internals
- Experience with CI/CD and infrastructure automation tools - Jenkins, Terraform, etc.
- Experience in infrastructure and application logging, monitoring and observability tools, intelligent alerting, and automated self-healing
Preferred Qualifications:
- Experience in public cloud ecosystems – AWS
- Experience in Elastic Search
- Experience in Kafka Streaming
- Experience with containers, such as with Kubernetes
- Experience with Chaos Engineering
Horizontal is proud to be an Equal Opportunity and Affirmative Action Employer. We seek to provide employment opportunities to talented, qualified candidates regardless of race, color, sex/gender including gender identity and/or expression, national origin, religion, sexual orientation, disability, marital status, citizen status, veteran status, or any other protected classification under federal, state or local law.
In addition, Horizontal will provide reasonable accommodations for qualified individuals with disabilities. If you need to request a reasonable accommodation in order to complete the application or interview process, please contact hr@horizontal.com.
All applicants applying must be legally authorized to work in the country of employment.
Success!
You have saved your first job! To see all your Saved Jobs, click here. Or continue scrolling through jobs and bookmark openings that catch your eye and apply for those jobs later.
We’re sorry!
There are currently no open positions in your location or accepting applications from out of the country
Return to Home