How do you manage the dependencies of a large-scale data science project? How do you migrate that project from a laptop to cloud infrastructure or utilize GPUs and multiple instances in parallel? This week on the show, Savin Goyal returns to discuss the updates to the open-source framework Metaflow.
Savin briefly describes the Metaflow platform and the goal of simplifying engineering overhead for data scientists and programmers. We discuss how the platform captures snapshots of a project as you work, allowing you to go back in time or share the state of your project with another team member.
We dig into the complicated process of managing dependencies for machine learning and data science projects. Savin describes how the required external libraries can be specified within a flow with the new @pypi
or @conda
decorators. This allows a project to scale from a local machine to the cloud or multiple instances with all dependencies included.
He talks about starting a new company, Outerbounds, with fellow co-workers from Netflix. Their vision is to continue to build the Metaflow open-source platform and offer customers scalable enterprise-grade infrastructure.
This week’s episode is brought to you by Intel.
Course Spotlight: Everyday Project Packaging With pyproject.toml
)
In this Code Conversation video course, you’ll learn how to package your everyday projects with pyproject.toml
. Playing on the same team as the import system means you can call your project from anywhere, ensure consistent imports, and have one file that’ll work for many build systems.
Topics:
00:00:00 – Introduction
00:02:25 – Update on Metaflow
00:04:13 – What is Outerbounds?
00:07:26 – An ML platform to serve data scientists needs
00:13:02 – Dependency reproducibility via @conda
and @pypi
decorators
00:26:18 – Sponsor: Intel
00:27:10 – Storing lock files along with snapshots
00:29:17 – Working alongside code and dependency management systems
00:34:03 – Scaling a project from laptop to the cloud
00:40:13 – Video Course Spotlight
00:41:41 – Getting visibility on processes
00:47:23 – Adjusting your project due to GPU availability
00:52:27 – Example of jumping back into a project one year later
00:55:54 – What are you excited about in the world of Python?
00:57:39 – What do you want to learn next?
00:59:35 – How can people follow your work online?
01:00:19 – Thanks and goodbye
Show Links:
Metaflow - a framework for real-life ML, AI, and data science)
Human-Friendly, Production-Ready Data Science with Metaflow- Savin Goyal | SciPy 2022 - YouTube)
New in Metaflow: The Long-Awaited @pypi
Decorator - Outerbounds)
Seamless Data and ML Pipelines with Airflow and Metaflow - Outerbounds)
Episode #142: Orchestrating Large and Small Projects With Apache Airflow – The Real Python Podcast)
Level up your Python skills with our expert-led courses:
Support the podcast & join our community of Pythonistas)