✨ About The Role
- The role involves developing scalable and fault-tolerant distributed machine learning libraries that support leading ML platforms.
- You will be responsible for creating an exceptional end-to-end experience for training machine learning models.
- The position requires solving complex architectural challenges and transforming them into practical solutions.
- Collaboration with the open-source community, including ML researchers and engineers, is a key aspect of the job.
- The role also includes working directly with end-users to enhance the product based on their feedback.
⚡ Requirements
- The ideal candidate will have over 5 years of experience in building, scaling, and maintaining software systems in production environments.
- A strong foundation in algorithms, data structures, and system design is essential for success in this role.
- Proficiency with popular machine learning frameworks and libraries such as PyTorch, TensorFlow, and XGBoost is required.
- Experience in designing fault-tolerant distributed systems will be a significant advantage.
- Candidates with a background in managing and maintaining open-source libraries will be highly regarded.