SRE (LATAM)

Remote
- Buenos Aires, Buenos Aires, Argentina
- Bogotá, Distrito Capital de Bogotá, Colombia
- São Paulo, São Paulo, Brazil
- Buenos Aires, Buenos Aires, Argentina
- Colombia, Distrito Capital de Bogotá, Colombia
- Lima, Lima, Peru
- Santiago, Región Metropolitana de Santiago, Chile
- São Paulo, São Paulo, Brazil
+7 more
Product

We are looking for a motivated and detail-oriented SRE to join our Infrastructure team. You will focus on incident response, system monitoring, and maintaining the reliability of our services.

Job description

Site Reliability Engineer

About RebelMouse

RebelMouse is the always-modern SaaS CMS where more than 100 enterprise brands and media companies grow their digital audience. Websites running on RebelMouse serve more than half a billion page views per month thanks to powerful tools and incredible distribution across search and social. We blend technology and strategy together to move the needle where it matters most to increase traffic, loyalty, and revenue.

Our People

Our fully distributed team lives in 33 countries around the world.. Led by Andrea Breanna, our Mexican-American, gender-fluid founder and CEO, we are a very safe, positive, and loving environment where diversity matters. We enjoy interesting tasks and strong challenges, value a sense of humor, and strive for work-life balance.

Job Summary

We are looking for a motivated and detail-oriented Site Reliability Engineer (SRE) to join our Infrastructure team. In this role, you will focus on incident response, system monitoring, and maintaining the reliability of our services. Over time, you will have the opportunity to take on broader responsibilities within the SRE function. We are seeking someone who is passionate about infrastructure, eager to learn, and ready to grow by supporting and improving the stability and performance of our platform.

Key Responsibilities:

Assist with incident investigation and root cause analysis
Design and implement preventive measures based on incident patterns
Create and update runbooks and documentation for operational procedures
Develop automation to prevent recurring incidents
Monitor service health and implement proactive improvements
Collaborate with existing SRE team members to enhance system reliability
Identify and address technical debt related to infrastructure stability
Help reduce alert noise by refining monitoring thresholds and rules

Growth Opportunities

Develop expertise in cloud infrastructure management
Learn advanced Kubernetes orchestration
Gain experience with performance optimization
Contribute to automation and tooling development
Participate in system architecture discussions

Benefits Package

Remote work forever
Monthly wellness subsidy
Flexible work hours
Flexible paid time off (PTO) with 12 national holidays and 20 days of vacation per year, as well as paid sick days and personal celebrations days : )

RebelMouse is committed to providing a diverse work environment. We appreciate the unique competencies that each person brings to the company, and we provide equal employment opportunity to all applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, protected veteran status, or disability status.

Job requirements

Technical Environment

You'll be working with a hybrid infrastructure including:

AWS services (EC2, EKS, RDS, ElastiCache, DocumentDB, OpenSearch)
Kubernetes for production applications
Multiple database technologies (MongoDB, Redis, Memcached, MySQL, PostgreSQL)
Monitoring systems (ELK Stack, Prometheus, Grafana, ClickHouse, OpenTelemetry)

Qualifications and Skills:

At least 2 years of experience in IT operations, DevOps, or related field
Basic knowledge of AWS cloud services (EC2, EKS, RDS)
Familiarity with Kubernetes and container orchestration
Experience with at least one database technology (MongoDB, Redis, MySQL, or PostgreSQL)
Understanding of monitoring systems (ELK, Prometheus, Grafana)
Experience with Linux systems administration
Basic scripting skills (Bash, Python)
Problem-solving mindset and ability to work under pressure
Good written and verbal communication skills in English

Nice to Have

Experience with OpenTelemetry or distributed tracing
Knowledge of ClickHouse or time-series databases
Experience with infrastructure-as-code tools (Terraform, Ansible)
Understanding of CI/CD pipelines
Previous experience in incident management
Familiarity with self-hosted and managed database environments