Back to jobs

Staff Software Engineer - Backend Distributed Systems

Job description

At Snorkel AI, we’re redefining how people and organizations build AI applications. Snorkel started as a research project in the Stanford AI Lab in 2016, creating a higher-level interface to machine learning through programmatically labeled and managed training data. From deploying in some of the world’s largest and most sophisticated tech organizations, to empowering scientists, doctors, and journalists — we’ve seen firsthand how this approach democratizes and accelerates AI. Now, we’re building Snorkel Flow to bring our technology to everyone!
Building Snorkel Flow requires outstanding engineers and technologies across the stack, including scalable data pipelines, elegant and intuitive interfaces (both visual and programmatic), state-of-the-art ML modeling techniques, and best practices for seamless deployment. Modern AI approaches require large labeled training datasets to learn from. While traditional approaches typically rely on armies of human annotators to label by hand, Snorkel Flow empowers users to programmatically label and build training data sets to drive a radically faster, more flexible, and higher quality end-to-end AI development process. Snorkel Flow is an end-to-end development platform, complete with a GUI and powerful programmatic interfaces for driving the development process for full AI application workflows: from preprocessing, to programmatic training data creation, to ML model training, to analysis, and deployment. It's the data-first platform for enterprise AI.
Snorkel AI is looking for a staff distributed systems architect and backend engineer. The company’s flagship product is a cloud-based enterprise software used by data scientists and ML engineers. Snorkel products are used by large enterprises to solve their most impactful problems in today’s data-centric AI world.

You will be part of the backend team that is building a scalable and reliable distributed system that empowers users to solve their most pressing needs in a data-centric AI world. The team has a variety of technical backgrounds, from machine learning PhDs to full-stack engineers who are building large-scale production systems. You will become one of these pragmatic, high-output, product-focused engineers.
Hybrid schedule with 1 or 2 days per week in our Redwood City HQ.
Main Responsibilities

  • Prototype, optimize, and maintain scalable back-end services that will power new ML development workflows

  • Design extensible and testable interfaces between internal services including the underlying storage and data models

  • Own the architecture, design, development, and operations of large-scale systems designed for AI/ML tasks including data management systems, data engineering workflow systems, distributed compute systems and connect to the front-end components

  • Work with customers to understand their product use case, desired capabilities, and scale requirements and translate that to engineering specifications and code

  • Be an engaged team player in a customer-focused cross-functional environment where you will feel excited to take on whatever is most impactful for the company and product

  • Work a hybrid schedule with one or two days per week in our Redwood City HQ and work remotely with "No Meeting" Tuesdays and Thursdays

Must haves

  • Bachelor's degree in Computer Science or related field

  • 4+ years experience in delivering distributed systems and services in a production setting for cloud-native applications

  • Ability to design and build efficient scalable data storage and retrieval systems for AI/ML tasks

  • Strong communication and coding skills with emphasis on designing for scale and robustness

  • Proactive and positive attitude to lead, learn, troubleshoot and take ownership of shipping multi-quarter large feature development as well as immediate debugging and unblocking customers

Nice to haves

  • 8+ years of professional software engineering experience

  • Experience with architecting and developing production web-scale systems (monitoring, telemetry, performance, reliability, triage and debug)

  • Strong development and debugging skills in Python

  • Experience developing enterprise software products for machine learning and/or data science applications

  • Experience with distributed compute frameworks and/or deep learning frameworks

  • Experience building and maintaining large scale, distributed and high performance data pipelines