Senior Site Reliability Engineer San Francisco, CA
Company: Tbwa Chiat/Day Inc
Location: San Francisco
Posted on: November 20, 2024
Job Description:
Location: San Francisco - hybrid (1-2 days per week)Salary:
$165-175k + stockCompany DescriptionFocal Systems is the industry
leader in retail AI solutions. We are a Silicon Valley based
startup that has more than doubled in size every year since
inception. We are a Deep Learning first company. Our mission is to
automate and optimize brick and mortar retail using deep learning
computer vision. Focal Systems has been deployed at scale with the
top retailers in the world. We are looking for smart, creative and
passionate people who want to help build a great and enduring
company and deploy Deep Learning to the world!Mission of the role:
To enable us to scale from 200k to 1 million camerasJob SummaryAs a
Sr. DevOps/Site Reliability Engineer (SRE) at our company, you will
play a pivotal role in ensuring the smooth operation and continuous
improvement of our infrastructure, deployment processes, and
overall system reliability.Responsibilities
- Set up and manage blue/green and canary deployments to ensure
smooth launches without downtime.
- Operate multiple large GCP Kubernetes clusters and fine tune
for reliability vs cost.
- Manage the various distributed services of the company,
ensuring to always provide graceful updates, comprehensive test
coverage, tracking of logs, and 99.9% uptime.
- Work with Backend, Frontend and Deep Learning teams and write
infrastructure automation code for their needs.
- Identify scalability bottlenecks through load testing and plan
infrastructure architecture.
- Create tools to provide transparency/ease of access into the
company's rich datasets stored across varying geographic locations
and data formats.
- Design, build, and manage a robust Continuous Integration and
Continuous Deployment (CI/CD) pipeline.
- Lead uptime improvement processes including: postmortem review,
on-call setup.Requirements
- Solid experience in an infrastructure or Site Reliability
Engineer (SRE) role.
- Hands-on experience with containerization (Docker) and
orchestration platforms (Kubernetes) required.
- Experience in cloud cost management.
- Great understanding of SQL, networking, distributed systems,
operating systems (debian) and software engineering practices.
- Experience with messaging systems.
- Terraform or other Infrastructure as Code automation
solution.
- Operating Relational SQL databases and Redis at terabyte
scale.
- Proven experience with setting up monitoring/alerting and
reliability engineering.
- Scripting skills in Python.Nice to have experience:
- GitOps.
- Setting up automation for complex load testing scenarios.
- Tuning Deep Learning pipelines with Python, Pytorch and
Multiprocessing.
- Backend programming with Python.Why Focal SystemsStrong Values
and Mission - We are a tightly-knit team with an ambitious mission
and a strong set of core values, which define our approach to
business and have successfully guided us since
inception.Exceptional Team - We are a team of hard-working,
fun-loving professionals from some of the most eminent
universities, research labs, and tech companies of our time. We
pride ourselves on recruiting exceptional individuals to help us
redefine the state-of-the-art.Outstanding Partners - We work with
10+ of the largest retailers in the world and have a world-class
roster of investors, advisors and partners to support & advise us
in our endeavors.We care deeply about the health, happiness, and
wellbeing of all of our employees. We offer:
- Paid Time Off
- Quarterly Team Retreats
#J-18808-Ljbffr
Keywords: Tbwa Chiat/Day Inc, San Rafael , Senior Site Reliability Engineer San Francisco, CA, Professions , San Francisco, California
Didn't find what you're looking for? Search again!
Loading more jobs...