Blog

Cloud Data Warehouse - Sep 19 2021

Does Snowflake Use Presto?

Snowflake and Presto are both exciting cloud data warehouse solutions, but they are fundamentally different. Snowflake is a fully-managed, almost serverless cloud data warehouse solution, whereas Presto is open-source and self-hosted. We have talked a lot about Snowflake in previous articles, so let’s focus on Presto.

How are Snowflake and Presto different?

Presto is a distributed SQL query engine designed to run interactive queries against multiple data sources. It’s been designed from the ground up to facilitate fast analytics across Hive, relational databases, Cassandra, Hadoop clusters, and even proprietary data sources. Presto is open-source, and a self-hosted solution.

Snowflake differs significantly from Presto. Snowflake is a fully-managed, almost serverless cloud data warehouse solution, whereas Presto is open-source and self-hosted. This means that with Presto, the user is responsible for server provisioning and creating the configuration of the Presto cluster. Whereas, Snowflake is a fully managed solution that completely separates data storage from compute.

In this context, Presto can be considered a competitor to Snowflake, as it provides similar functionality while being open-source. Presto is also one of the fastest query engines currently available.

Why is Presto so fast?

Speed is one of the reasons Presto is popular. Speed gives Presto an advantage over other engines that use the SQL-on-Hadoop architecture. Presto was initially developed by Facebook to handle its own data query requirements. This required Presto to be interactive, while completing queries on massive volumes of data at great speed.

The entire design of Presto is centered around it being able to provide high speed, high performance, and high concurrency at low latency. The fundamental reason behind its speed is by working in-memory to process queries, whereas many competitors choose to go the MapReduce route. While MapReduce offers high throughput, it comes at the expense of more latency.

MapReduce operates using a “pull” model in order to pull data from previous tasks, whereas Presto opts for a “push” model. In a push model, any SQL query is processed using multiple stages that run concurrently. A stage in the upstream can directly receive data from a stage in the downstream, resulting in significantly faster passing of intermediate data between stages. Also, MapReduce writes query results back to disk. Presto, on the other hand, does the compilation of various parts of the query on-the-fly and has the advantage of being in-memory. While processing in-memory does bring limited fault tolerance, the result is significantly faster query processing.

What is a typical deployment of Presto?

Typically, a deployment of Presto would include multiple Presto Workers and one Presto Coordinator. The Coordinator is tasked with submitting queries, while also handling planning, parsing, and scheduling the execution of queries among multiple Presto Workers. The Workers, on the other hand, do the query processing. The more Presto Workers deployed, the faster the query processing can happen.

Who uses Presto?

While Presto was initially developed by Facebook, it was later open-sourced and currently has its own development foundation. It was built to handle data queries running across very large datasets, and spread across formats such as Hadoop, MySQL, and Hive. 

Among the early adopters were large organizations like Netflix, Uber, and Airbnb. In short, Presto is used in productions that operate at a very large scale and deal with multiple petabytes of data on a daily basis.

Keeping in mind the capabilities of Presto and how it is architectured, ideal candidates for its use will include organizations prioritizing query speed above other metrics. Organizations that routinely run queries against hundreds of petabytes of data can benefit from Presto. Along with high-speed query handling, Presto can also be a great option for organizations looking for an open-source solution for a distributed query engine. This is because, in most cases, using an open-source solution has significant financial advantages. It can help organizations reduce costs, avoid vendor lock-in in the long term, and retain the capability and flexibility to add to the solution at a later date.

Have questions? We help companies like yours, every day.

Email us at hello@nextphase.ai

Read More

The Guide to GCP vs AWS

How to Migrate From Oracle On-Premise to AWS Cloud
Should My Business Use Open Source Data Integration Tools?

About NextPhase.ai

NextPhase.ai is a data cloud services provider specializing in Snowflake, cloud data management and analytics technologies. We accelerate enterprise digital transformation initiatives by leveraging our innovative cloud data management technology, “NextPhase.ai DATAFLO” to optimize and rationalize disparate enterprise data into relevant insights. “DATAFLO” is designed to automate the lifecycle of data management transformation using AI and ML along with expeditious on-ramps to the Snowflake data cloud infrastructure. NextPhase.ai provides a range of technology consulting services for the Financial Services, Biotech and Technology industry sectors combining our platform-based services, seasoned talent, and industry proven methodology so our customers can harness more from their data. We are a Silicon Valley based company with global presence having delivered high value service engagements for numerous Global 2000 enterprises.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Get in touch with NextPhase.ai