
SRE (LATAM)
- Remote
- Buenos Aires, Buenos Aires, Argentina
- Bogotá, Distrito Capital de Bogotá, Colombia
- São Paulo, São Paulo, Brazil
- Buenos Aires, Buenos Aires, Argentina
- Colombia, Distrito Capital de Bogotá, Colombia
- Lima, Lima, Peru
- Santiago, Región Metropolitana de Santiago, Chile
- São Paulo, São Paulo, Brazil
+7 more- Product
We are looking for a motivated and detail-oriented SRE to join our Infrastructure team. You will focus on incident response, system monitoring, and maintaining the reliability of our services.
Job description
Site Reliability Engineer
About RebelMouse
RebelMouse is the always-modern SaaS CMS where more than 100 enterprise brands and media companies grow their digital audience. Websites running on RebelMouse serve more than half a billion page views per month thanks to powerful tools and incredible distribution across search and social. We blend technology and strategy together to move the needle where it matters most to increase traffic, loyalty, and revenue.
Our People
Our fully distributed team lives in 33 countries around the world.. Led by Andrea Breanna, our Mexican-American, gender-fluid founder and CEO, we are a very safe, positive, and loving environment where diversity matters. We enjoy interesting tasks and strong challenges, value a sense of humor, and strive for work-life balance.
Job Summary
We are looking for a motivated and detail-oriented Site Reliability Engineer (SRE) to join our Infrastructure team. In this role, you will focus on incident response, system monitoring, and maintaining the reliability of our services. Over time, you will have the opportunity to take on broader responsibilities within the SRE function. We are seeking someone who is passionate about infrastructure, eager to learn, and ready to grow by supporting and improving the stability and performance of our platform.
Key Responsibilities:
Assist with incident investigation and root cause analysis
Design and implement preventive measures based on incident patterns
Create and update runbooks and documentation for operational procedures
Develop automation to prevent recurring incidents
Monitor service health and implement proactive improvements
Collaborate with existing SRE team members to enhance system reliability
Identify and address technical debt related to infrastructure stability
Help reduce alert noise by refining monitoring thresholds and rules
Growth Opportunities
Develop expertise in cloud infrastructure management
Learn advanced Kubernetes orchestration
Gain experience with performance optimization
Contribute to automation and tooling development
Participate in system architecture discussions
Benefits Package
Remote work forever
Monthly wellness subsidy
Flexible work hours
Flexible paid time off (PTO) with 12 national holidays and 20 days of vacation per year, as well as paid sick days and personal celebrations days : )
RebelMouse is committed to providing a diverse work environment. We appreciate the unique competencies that each person brings to the company, and we provide equal employment opportunity to all applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, protected veteran status, or disability status.
Job requirements
Technical Environment
You'll be working with a hybrid infrastructure including:
AWS services (EC2, EKS, RDS, ElastiCache, DocumentDB, OpenSearch)
Kubernetes for production applications
Multiple database technologies (MongoDB, Redis, Memcached, MySQL, PostgreSQL)
Monitoring systems (ELK Stack, Prometheus, Grafana, ClickHouse, OpenTelemetry)
Qualifications and Skills:
At least 2 years of experience in IT operations, DevOps, or related field
Basic knowledge of AWS cloud services (EC2, EKS, RDS)
Familiarity with Kubernetes and container orchestration
Experience with at least one database technology (MongoDB, Redis, MySQL, or PostgreSQL)
Understanding of monitoring systems (ELK, Prometheus, Grafana)
Experience with Linux systems administration
Basic scripting skills (Bash, Python)
Problem-solving mindset and ability to work under pressure
Good written and verbal communication skills in English
Nice to Have
Experience with OpenTelemetry or distributed tracing
Knowledge of ClickHouse or time-series databases
Experience with infrastructure-as-code tools (Terraform, Ansible)
Understanding of CI/CD pipelines
Previous experience in incident management
Familiarity with self-hosted and managed database environments
or
All done!
Your application has been successfully submitted!