What is Presto, Facebook Presto Database, PrestoSQL, or PrestoDB? A powerful SQL query engine

Published in

Openbridge

8 min readMay 23, 2019

If you have heard of Amazon Athena interactive query service, then you are familiar with Presto. PrestoDB is the open-source SQL query engine that powers the AWS Athena service, making data lakes easy to analyze with columnar formats like Apache Parquet.

While Athena is one of the more visible commercial offerings, it certainly is not the only path for those interested in the software.

Facebook Presto History

Presto has its technical roots in the Hadoop world at Facebook. Before Facebook created Presto performance challenges drove them to develop the software to achieve their objectives. As a result, the project was born in 2012. It was then rolled out company-wide in 2013. Later in 2013, Facebook open-sourced it under the Apache Software License.

Here is what Facebook said of its pursuit of the project;

For the analysts, data scientists, and engineers who crunch data derive insights, and work to continuously improve our products, the performance of queries against our data warehouse is important. Being able to run more queries and get results faster improves their productivity.

Facebook noted vital differences in how it approaches certain operations;

In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, all processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead.

Facebook also provided a simplified architecture overview;

One of the key features is that it allows you to make analytic queries against data in different sources of varying sizes. As a result of this model, Presto is a query engine designed with a lot of data connectors.

It supports querying data in RDBMS, Hive, and other data stores. This includes non-relational sources like Hadoop HDFS, Amazon S3, HBase, and relational sources such as MySQL, PostgreSQL, Redshift, SQL Server, and others.

Another goal was to support standard ANSI SQL, including ad hoc aggregations, joins, left/right outer joins, sub-queries, distinct counts, and many others.

As a result, it can act as a SQL query proxy, allowing you to combine data from multiple sources across your organization using familiar SQL. Depending on your architecture, this can be a complement to data warehouses, especially for organizations that use a federated model where having these connectors adds value.

What is prestodb or prestosql? Fork or Official

In 2019 three of the original Facebook Presto team members Martin Traverso, Dain Sundstrom, and David Phillips formed the “Presto Software Foundation.” This foundation is meant to oversee their fork of the official project. The Presto fork is often referred to as prestosql online.

On GitHub, the fork is located at prestosql/presto while the official project is prestodb/presto. As you can imagine, this is leading to confusion as both projects seem to be synonymous with each other. For example, here are project descriptions for each on GitHub:

prestodb/presto: “The official home of the Presto distributed SQL query engine for big data https://prestodb.github.io.”
prestosql/presto: “Official home of Presto, the distributed SQL query engine for big data https://prestosql.io.”

Unfortunately, it is not clear why the prestosql/preso fork, or foundation, references itself as being “official.” They should own the fact that they left Facebook and forked their project rather than cast themselves as the official Presto distribution. The prestosql team has the heritage and credentials to tell a great story, so the efforts to package their fork as the official project, including Wikipedia, is unfortunate. It seems like a missed opportunity to go down that path. This posture contributes to a level of confusion and serves no benefit to the broader Presto community.

For now, we would suggest focusing your development efforts on the core project rather than the fork. People should start with http://prestodb.github.io/ and https://github.com/prestodb/presto as two principal official resources for the project. This will ensure you are not mistakenly investing time and energy in the wrong places.

Facebook Presto Performance

Presto was designed for running interactive analytic queries fast. Query execution runs in parallel, with most results returning in seconds. The expectation is the query engine will deliver response times ranging from sub-second to minutes. In addition to speed, providing easy to analyze large datasets in standard s3 sources as well as using Standard SQL were important.

Another performance consideration is the data consumption pattern you have. For example, let’s say data is resident within Parquet files in a data lake on the Amazon S3 file system. You wrap Presto (or Amazon Athena) as a query service on top of that data. Lastly, you leverage Tableau to run scheduled queries that will store a “cache” of your data within the Tableau Hyper Engine.

In this model, Tableau acts as an ad hoc query cache for Presto. This allows you to store data locally to the Tableau Hyper Engine vs. live calls to Presto/Athena each time. As a result, all subsequent queries in a Tableau visualization happen against the data resident in Hyper rather than the query engine. This results in high-speed analytics and reduced costs, essential for users of business intelligence and data visualization software.

See the post Building A Serverless Business Intelligence Stack With Apache Parquet, Tableau, and Amazon Athena.

Who uses Presto?

Facebook, Nasdaq, Airbnb, Netflix, Atlassian, and many more have indicated they are using the query engine. However, it is likely many others are also running the software when you factor in the AWS offerings in EMR and Athena. For example, we are working with Fortune 500 companies that have deployed serverless data analytics stacks using Athena, Tableau, and Apache Parquet. As a result, the number of actual Presto users may be underreported.

The broader community can be found here or on Facebook.

Commercial Presto Solutions

As we referenced earlier, the software is commonly deployed in the cloud, though using Docker means you can run it locally or on-premise. However, it was designed so that it can be easily be paired with cloud infrastructure for scaling. This allows a Presto query to deliver exceptional performance, scalability, reliability, availability, and economies of scale for data gigabytes to petabytes in size.

Amazon Web Services

If you are currently a Redshift user, you may be interested in our Redshift Spectrum vs Athena comparison.

Both Amazon EMR and Amazon Athena are examples of cloud-based deployments. Like most things AWS, they handle the bulk of set up, infrastructure, operations, and testing for you.

We mentioned Amazon Athena a few times already. Amazon Athena is a leading commercial offering of the software. It lets you deploy the query engine within AWS as a serverless platform. This means no servers, virtual machines, or clusters to set up, manage, or tune. Athena automatically parallelizes interactive queries and dynamically scales resources as needed.

With Athena, you pay only for the queries that you run. Another benefit is that many existing Business Intelligence (BI) tools, like Tableau, support Athena natively.

Starburst Presto and Ahana

Other companies, like Starburst Data and Ahana, provide the ability for you to launch a Presto cluster in minutes without complicated setup, maintenance, or tuning. For example, on AWS, Starburst’s CloudFormation and AMI provide the tools to get started quickly. Ahana offers AWS and Docker Hub options. Ahana also offers enterprise Presto support options for those that want to go beyond a self-service model.

Whether you go the AWS, Starburst, or “roll your own” path, Presto is a great technology for those seeking performance, flexibility, and a non-intrusive technical layer within their data stack.

Facebook Presto Data Visualization, Reporting, and Analytics

Support is gaining tracking for the query engine across a wide variety of data visualization and business intelligence tools. Today, there are several options available to analysts for tapping into your data via Presto.

Tableau
Looker
Mode Analytics
Amazon QuickSight (Athena & EMR).
Microsoft Power BI has a connection offered through a third-party driver
MicroStrategy
Redash
You can also use SQL tools like SQL Workbench to connect to Presto via third-party drivers

There are many other options in addition to the ones listed above. The point being, Presto is a first-class citizen in data analytics and visualization tooling.

Looking To Get Started With Presto?

Want a quick start with Presto? You can get the benefits of Presto with AWS Athena. Try our fully automated, code-free, zero administration AWS Athena data ingestion service. It has never been easier to get your data into Amazon Athena for use with Tableau or other leading BI platforms.

Ready to go Presto with AWS Athena?

We have launched a code-free, zero-admin, fully automated data pipeline that automates database, table creation, Parquet file conversion, Snappy compression, partitioning, and more.

Get started with Amazon Athena for free!

DWant to discuss Presto or Amazon Athena for your organization? Need a platform and team of experts to kickstart your data and analytics efforts? We can help! Getting traction adopting new technologies, especially if it means your team is working in different and unfamiliar ways, can be a roadblock for success. This is especially true in a self-service only world. If you want to discuss a proof-of-concept, pilot, project, or any other effort, the Openbridge platform and team of data experts are ready to help.

Reach out to us at hello@openbridge.com. Prefer to talk to someone? Set up a call with our team of data experts.

References

PrestoDB vs PrestoSQL & the new Presto Foundation

PrestoDB moves to the Linux Foundation, Revisiting PrestoDB vs. PrestoSQL

blog.openbridge.com

Querying 8.66 Billion Records - a Performance and Cost Comparison between Starburst Presto and…

If you're an application owner, sooner or later you'll need to analyze large amounts of data. The good news? Whatever…

www.concurrencylabs.com

1.1 Billion Taxi Rides: Spark 2.4.0 versus Presto 0.214

Last year I did a $3 / hour challenge between a beefy EC2 instance and a 21-node EMR cluster. In that benchmark the…

tech.marksblogg.com

Facebook

By Martin Traverso Background Facebook is a data-driven company. Data processing and analytics are at the heart of…

www.facebook.com