Introducing the Anyscale Snowflake Connector

By Eric Greene   

LinkSimple Data Transfer and Efficient AI Workflows with Anyscale and Snowflake

The Anyscale Snowflake Connector is a new Ray datasets capability that facilitates easy data transfer between Snowflake clusters and Anyscale-hosted Ray clusters. The connector makes it easy for Data Scientists doing machine learning discovery and development within Anyscale, while leveraging their existing Snowflake data lake. It also enables a simpler way for machine learning engineers to create end to end workloads by allowing the entire ML pipeline to be executed within a single Python script. 

By taking advantage of the highly scalable nature of Ray and Ray datasets, machine learning workloads such as training, tuning and batch serving jobs can be executed more quickly and for lower costs, all while taking advantage of the latest advancements in AI through other Ray integrations with Hugging Face, XGBoost, LightGBM and many AI frameworks and libraries.

LinkKey Benefits of Snowflake

  • Simple access to data - Access Snowflake data through SQL queries within Anyscale Workspaces Visual Code or JupyterHub environments, empowering data science discovery and development.

  • Improved data security and governance - Ensure data security and governance by directly copying data from Snowflake to Ray clusters, eliminating the need for intermediate steps and maintaining separate controls over sensitive data. All data is encrypted in transit and at rest.

  • Highly scalable data exchange - Take advantage of the parallel read and write capabilities of Ray datasets, enabling the exchange of terabytes of data in minutes. This capability significantly speeds up data-intensive operations and enhances scalability while reducing job run times and overall costs.

  • Simplified workload development - Simplify machine learning workflows by consolidating all the necessary logic into a single script. This script can query features, train and tune models, and score and materialize results back into the data lake, streamlining the entire process.

  • Unlock the latest AI Innovations - Leverage the power of Snowflake for querying and joining data, while benefiting from the scalability and simplicity of RayAIR for machine learning and AI development. RayAIR provides integration to the latest AI innovations such as pretrained Hugging Face language models.

LinkSimple, Secure and Scalable Data Access

Using the Anyscale Snowflake connector, large datasets can be queried from a Snowflake warehouse, and quickly transferred into a Ray dataset distributed across the Ray cluster. The Snowflake connector reads query results in chunks of data, in parallel across the Ray cluster. The size of the dataset and the speed at which it can be transferred scales based on the size of your Ray cluster and the number of simultaneous requests the underlying Snowflake SQL warehouse supports.

Anyscale Snowflake Parallel Read and Write

With the Anyscale Snowflake connector’s parallel reads and writes, speeds will scale with the size of the Ray cluster. The benchmarks for reading and writing demonstrate how the cluster scales out to support larger data sets. 

Table (TPCDS 1000) 

Size (GB)

Read Time (s)

Write Time (s)

CUSTOMER_ADDRESS

7

17

160

INVENTORY

22

21

595

LINEITEM

755

198

9060

STORE_RETURNS

3229

200

35700

LinkSimplified ML Development

Using the Snowflake connector from within an Anyscale Workspace, Data Scientists and Machine Learning Engineers have a unified experience while developing workloads. The data query, feature engineering, training, tuning and inference can all be executed within a single Python script that scales across a Ray cluster. Ray integrates with the latest AI and Machine learning libraries, enabling the most advanced ML workloads that work with 3rd party libs  like Hugging Face, XGBoost, LightGBM, and most PyTorch and TensorFlow model architectures.

Anyscale Workspace Snowflake Connector
Gray Arrow Down
Working with Snowflake

Link
Unlock new use cases with AI and ML 

Anyscale and Ray integrate with most open source AI and ML libraries, enabling the latest innovation in AI to be applied to Snowflake data. Ray makes working with Hugging face, XGBoost, LightGBM, TensorFlow and PyTorch and SciKit Learn a unified experience, with the added benefits of scaling with distributed training, tuning and serving.

A typical ML workload to train a LightGBM can be implemented in 20 lines of code.

Snowflake and Ray Diagram
Gray Arrow Down
1
2
3
4
5
6
7
8
9
10
11
12
SQL = "SELECT * FROM samples.tpch.lineitem"
dataset = ray.data.read_datasource(SnowflakeDatasource(**connect_props), sql=SQL)
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

results = LightGBMTrainer(
  scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
  label_column="l_tax",
  params= {"objective": "regression", "metric": ["rmse", "mae"]},
  datasets={"train": train_dataset, "valid": valid_dataset},
  preprocessor=Categorizer("l_shipmode"),
  num_boost_round=10
).fit()

LinkLearn more

We’ll be demonstrating this capability and diving into a wide range of AI use cases with many of the world’s top AI pioneers from OpenAI, Netflix, Pinterest, Verizon, Instacart and others at Ray Summit 2023 this Sept 18-19 in San Francisco. 

If you’d like to learn more about building ML workflows with Ray, just register for your complimentary copy of “Learning Ray” from O’Reilly Publishing.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.