Staff Software Engineer - Machine Learning Platform (San Francisco)
Company: Replicate, Inc.
Location: San Francisco
Posted on: May 3, 2025
Job Description:
Staff Software Engineer - Machine Learning Platform (San
Francisco)Replicate makes it easy for software engineers to run and
customize machine learning models in the cloud. With a library of
thousands of open-source models, you can get started with one line
of code-or fine-tune and deploy your own models when you need
something custom. We handle the infrastructure, so you can focus on
building. Our team comes from places like Docker, GitHub, and
NVIDIA, and we're obsessed with making AI as intuitive as deploying
a web app. We build in public, ship fast, and care about getting
the details right.The Platform team at Replicate oversees the
entire lifecycle of models, from packaging and deployment to
serving, scaling, and monitoring. You'll be developing the
infrastructure that supports thousands of models and powers
millions of predictions daily. This is a chance to build something
truly innovative, where each decision you make has a tangible
impact and allows your creativity to shine.What you'll be
doing:
- Designing and building our deployment and model-serving
platform.
- Building technology to operate the latest advancements in the
ML and AI space.
- Designing systems to maximize the utilization and reliability
of our Kubernetes clusters and GPUs, including multi-regional
traffic shifting and failover capabilities.
- Owning and optimizing fair and reliable task allocation and
queuing across a diverse set of customers with heterogeneous
workloads.
- Working with our Models team to speed up model inference
through techniques like caching, weights management, machine
configurations, and runtime optimizations in Python and
PyTorch.Technologies you'll be working with:
- Python, Go, and Node.js
- Kubernetes and Terraform
- Redis, Google BigQuery, and PostgreSQLWe're looking for the
right person, not just someone who checks boxes, but it's likely
you have---
- Experience building platforms at scale.
- Worked in complex systems with many moving parts; you have
opinions on monoliths vs. services.
- Designed and implemented developer-friendly APIs to enable
scalable and reliable integration.
- Hands-on experience setting up and operating Kubernetes.
- A passion for building tools that empower developers.
- Strong communication and collaboration skills, with the ability
to understand customer needs and distill complex topics into clear,
actionable insights. We believe that most of programming isn't just
about writing code; building a platform requires a collaborative
approach.
- At least 10 years of full-time software engineering
experience.These aren't hard requirements, but we definitely want
to talk with you if---
- You have worked on machine learning platform teams in the
past.
- You have experience working with or on teams that have put
ML/AI into production, even though this role does not entail
building ML models directly.
- You have some exposure to serving Generative AI features where
GPUs are costly commodities and workloads can take significant time
to finish.You'll be working from our beautiful office in the
Mission, San Francisco, at least 3 days a week.
#J-18808-Ljbffr
Keywords: Replicate, Inc., Parkway-South Sacramento , Staff Software Engineer - Machine Learning Platform (San Francisco), IT / Software / Systems , San Francisco, California
Didn't find what you're looking for? Search again!
Loading more jobs...