Snowflake Notebooks: What I Love and What I Hope to See Next
I recently explored Snowflake Notebooks, which have recently became generally available, and found it quite interesting. In this article, I’ll cover the following topics:
- What Snowflake Notebook is.
- 4 key features which I love in notebooks.
- 3 additional features I’d like to see.
- Ideas for integrating it into our data pipelines.
What is Snowflake Notebooks
Snowflake Notebooks, part of Snowflake Projects alongside tools like Worksheets, Streamlit, and Dashboards, is an interactive, cell-based environment within Snowflake. Similar to Jupyter Notebooks, it enables seamless data analysis, building data pipelines, and training machine learning models using both SQL and Python, all within Snowflake’s infrastructure without moving data outside the platform.
Snowflake Notebooks offers a wide range of features. From the ones I’ve tried so far, in the next section, I’ll share the top 4 features I love about Snowflake Notebooks.
My Favorite Features in Snowflake Notebooks
Feature #1: Compute Control
One of the features I appreciate most in Snowflake Notebooks is the ability to fully control the compute instance used by the notebook. You can choose between:
Existing Snowflake Warehouses (WH):
These are the classical warehouses that come in a variety of sizes, ranging from X-Small to 6X-Large, allowing you to scale based on your needs.
Container Compute Pools:
For even more flexibility, you can leverage compute pools that offer diverse configurations, such as:
- Basic CPUs for general-purpose tasks.
- Memory-Intensive CPUs for data-heavy operations.
- GPUs for machine learning and other computationally intensive workloads.
Configuring these options is remarkably simple, thanks to Snowflake’s high level of abstraction. As with all Snowflake features, the user experience is designed to be intuitive, making it easy to manage and adjust these settings effortlessly.
Feature #2: Enhanced Cells Interaction
Another feature I really like in Snowflake Notebooks is the ability to work with SQL cells alongside the standard Python and Markdown cells. This adds powerful functionality for data analysis and exploration. Here’s what makes it stand out:
Interaction Between Cells:
- Referencing SQL Cells in Other SQL Cells: You can use the name of a SQL cell as if it was a Common Table Expression (CTE). This allows you to query the output of one SQL cell directly in another SQL cell without having to re-run the query or export data.
- Accessing SQL Output in Python: The output of a SQL cell can be accessed in a Python cell by simply referencing the SQL cell's name. You can then convert the output into either a Snowpark DataFrame or a Pandas DataFrame for further processing.
Accessing Python DataFrames in SQL
To use a Python DataFrame in a SQL cell, you can simply save it as a temporary table. Once saved, the temporary table can be queried in any SQL cell within the same session.
Using Python Variables in SQL Queries
Python string variables can be seamlessly integrated into SQL queries within SQL cells. This allows for dynamic and parameterized queries, making the interaction between Python and SQL even more flexible.
Feature #3: Customized Conda Environments
Another feature I appreciate in Snowflake Notebooks is the ability to easily define Conda package dependencies for each notebook. This allows you to create a customized Conda environment tailored to the specific requirements of your notebook.
- Custom Environments for Each Notebook: Each notebook can have its own unique set of dependencies, making it easy to work on multiple projects with varying library requirements without conflicts.
- Support for Custom Packages: In addition to using the pre-defined packages from the Snowflake Conda channel, you can also add your own custom packages, providing flexibility to include specialized libraries or tools.
- Current Limitation: At the moment, the functionality is limited to packages available in the Snowflake Conda channel, which may restrict the availability of some libraries.
Feature #4 Flexible Execution Options
One of the features I really like about Snowflake Notebooks is the flexibility in how they can be executed. Snowflake provides several ways to run notebooks, making it easy to integrate them into different workflows:
- Interactive Execution: You can execute the notebook directly from the Snowflake UI while opening it, enabling an interactive experience for analyzing or exploring the data.
- SQL-Based Execution: Notebooks can be executed through SQL code using the EXECUTE NOTEBOOK command. This SQL command can be run either directly from the Snowflake UI or from external tools such as dbt or Airflow, allowing seamless integration into automated data pipelines.
- Scheduled Execution: Snowflake Notebooks can be scheduled as Snowflake Tasks, allowing them to run periodically based on a defined schedule. One standout feature of this functionality is the ability to pass parameters to notebooks when executed as a task. For example, during interactive execution, you might run the notebook on a sample of the data for faster testing. However, when scheduled as a task, you can pass a parameter to process the entire dataset, enabling full-scale production processing seamlessly.
Features I'd Like to See in Snowflake Notebooks
Feature #1 Real-Time Collaboration
I would love to see robust real-time collaboration capabilities in Snowflake Notebooks, allowing multiple users to edit the same notebook simultaneously. Features like shared cursors, presence indicators, and real-time updates would greatly improve teamwork and enhance the overall user experience.
Feature #2 Expanded Package Management
I would like to see the ability to support additional Conda channels and install packages from pip repositories in Snowflake Notebooks. This feature would provide greater flexibility in accessing specialized libraries while maintaining robust environment management and effectively handling dependency conflicts.
Feature #3 Parameters Passing for SQL Execution
Adding the ability to pass parameters to a notebook when executing it through SQL would allow greater flexibility. This would enable external tools, not just the Snowflake scheduler, to dynamically send parameters, making integrations with third-party platforms more powerful.
Integrating Snowflake Notebooks into Our Data Pipelines
Interactive Development and Testing for Data Pipelines
One way we can leverage Snowflake Notebooks in our data pipeline is by using them to develop and test subsets of our workflows interactively, allowing us to incorporate stakeholder feedback before deploying them to production.
Building Interactive Analytics Dashboards
Another use for Snowflake Notebooks is to build custom interactive data analytics dashboards, leveraging the Streamlit integration to provide a rich, interactive user experience. These dashboards can also be kept refreshed as part of our data pipelines, ensuring they always display up-to-date information for stakeholders.
Building and Experimenting with Machine Learning Models
Another use case for Snowflake Notebooks is the ability to build machine learning models and conduct experiments without needing to move data outside Snowflake. While allowing users to visualize model results and interact with them directly within the notebook environment.
Prototyping and Scheduling Intermediate Pipeline Components
Another use for Snowflake Notebooks is to write quick, experimental components of our data pipelines, such as data models or data transformations, and schedule them to run periodically using tools like dbt and Airflow. These components can serve as an intermediate stage, allowing us to test and refine them before fully integrating them into our production pipelines.
Conclusion
Snowflake Notebooks provide an exciting opportunity to streamline data workflows, enhance collaboration, and bridge the gap between experimentation and production. By building on their current features and introducing additional functionality, such as broader package support, advanced collaboration tools, and better integration with external platforms, they can support a wider range of use cases. These enhancements would make Snowflake Notebooks easier to use and become powerful tool for managing and optimizing our data workflows.