This is the first in a series of posts on the architectures of analytic databases. This is relational database technology that has been “supercharged” in some way to handle large amounts of data such as typical data warehouse workloads. The series will examine massively parallel processing (MPP), columnar databases, appliances and in-database analytics. Our purpose is to help those evaluating analytic database technologies understand some of the alternative approaches so they can differentiate between different vendors’ offerings. There is no single best solution for all analytics for all types of applications; usually the decision involves a series of trade-offs. Understanding what you might be giving up or gaining, you may be able to make a better decision about which solution is best for your organization’s needs.
Analytic databases exist because of the need for speed and scale. Our benchmark research on business intelligence and performance management show that performance and scalability are key considerations seeking improvement in this area. Some years ago operating on summary data via OLAP cubes could provide speed to end users, but that approach is no longer sufficient for many types of analyses.
In this post we’ll look at MPP concepts including how these techniques speed up processing, what some of the challenges are and the different approaches that vendors use. Fundamentally, massively parallel processing is about splitting a database workload across multiple servers or nodes and executing a portion of the workload in parallel on each of the nodes. Once the parallel activities are completed, the results from each node are stitched together to deliver the end results.
MPP systems eliminate the need to manually split workloads across multiple machines. The manual technique known as sharding became a common way to overcome limitations of single-server database systems such as MySQL. Many organizations found that a single instance of their database could not deliver the performance or scale they needed, so they created application level schemes to split the data manually across multiple machines. For instance, all data for customers whose last name began with A-M would be stored in one database instance on one machine and all data for N-Z on another. With this approach and a lot of effort (i.e., coding) you could get theoretically almost twice the performance and twice the scale out of your system. However, the amount of coding and maintenance in such a system can become significant. Instead many organizations have opted to purchase an MPP system with the built-in ability to split workloads.
As you explore different MPP architectures you’ll need to understand how a vendor segments the data across the nodes in the system as well as how it segments the workload. Both are necessary to accomplish the maximum degree of parallelization in the system. If the data is not evenly distributed across the nodes, it can affect the speed of the queries. For example, suppose you have a cluster consisting of five nodes. If 80% of your data is on one node and 5% of the data is on each of the other nodes, it’s obvious that a query would have to wait for the node with 80% of the data to finish its work while the nodes with only 5% of the data probably sat idle. If the data were split evenly with 20% on each node, it’s likely that all five nodes would finish their work at approximately the same time and considerably faster than the unbalanced arrangement.
When it comes to how the workload is segmented across the nodes, the data distribution dictates which nodes must be involved in queries, but vendors have adopted different styles of workload distribution. MPP systems can be designed with either homogeneous or heterogeneous workload distribution. In the case of homogeneous nodes, the workload – whether loading or querying data – is distributed to all nodes in the system. In the case of heterogeneous nodes, each node is designated for a specific type of workload. For instance, queries are directed to one set of nodes and load processes are directed to a different set of nodes.
It is not obvious that one approach is better than the other. However, in the case of homogeneous nodes, the burden is on the software vendor to design a system that can ensure an optimized and balanced workload. In the case of heterogeneous nodes, the burden is shifted somewhat to the enterprise deploying the system. If the load processes are slow, you can alter the number of nodes dealing with the load processes, by adding more nodes to the system or changing the designation of some of the query nodes to data loading. I like to think of this trade-off as the difference between an automatic transmission and a standard transmission. Some people like the control offered by the standard transmission, while others don’t want to have to think about shifting.
Another issue to understand is where queries are initiated within the system. The initiation process can become a bottleneck if all queries must funnel through one or a few nodes of the system. The reason is that the node initiating the query has extra responsibilities. Some amount of post-processing is necessary by the query initiator. For instance, if the user requests the top 10 products based on sales, each node is asked to produce its list of the top 10 within its segment of the data and ships the results to the initiator. Then the initiator compares each of the top 10 results sets to come up with the top 10 overall. Ideally the bulk of the work is done in parallel on each of the nodes but then a small amount of work is done on the aggregated results on the initiating node alone.
Unfortunately it’s not always that simple, and the load on the initiator can become significant. Not all problems can be neatly divided and performed on a specific segment of the data and then combined at the end of the process for some finishing touches. Some calculations require data to move back and forth across multiple nodes. For example, supposed data is segmented by across nodes by customer ID and you want to calculate each customer’s revenue as a percentage of the best-selling product in each region, state or city – in other words, which customers are contributing the most to the products that are selling well in that area. The data for one customer may all be on the same node, but to calculate total product sales in each region city or state you have to look at sales from all the nodes since any customer could have made a purchase in any region.
The simplest approach is to ship all the product sales data to the query initiator and have the initiator perform the rest of the processing. While this approach may be the best strategy, it probably is not. The query optimizer needs to evaluate a series of trade-offs to get the best response time for the query. Should it ship all the data to one node and do all the processing on that one node? While not highly parallel, that strategy might work if the total amount of data that needs to be combined is relatively small, for example, if the sales are restricted to just a few products and a few time periods.
Alternatively, should the query optimizer ship some portion of data to each of the nodes to speed up processing? This scenario might work if there is a relatively small set of data that can be constructed (for example, product sales by region) that would be combined with large amounts of data on each node. These examples oversimplify the complexities that the optimizers must deal with. For examples of some of the considerations in MPP query plan optimization and the way some vendors have tackled these issues, see these posts by EMC Greenplum and Vertica.
Queries can be much more complicated than the examples given above; data distribution schemes vary and the speed of moving data across nodes is affected by interconnect speeds, storage medium and volume of data to be moved. Query optimizers need to consider many points and must have algorithms that produce efficient plans.
An IT organization considering an MPP database product needs to understand a few things in order to evaluate and maximize the performance of different database systems, starting with the data distribution scheme. Typically, the person designing the database implementation must provide some guidance or information to the MPP database product (such as which keys or columns to use in its distribution scheme) to help it distribute data in a way that produces a highly parallel and evenly distributed workload.
I recommend that you test your typical query workload against the database to see what kind of results you get. If your workload never changes, this approach may be sufficient. However, if your workload has an ad-hoc component or if it changes over time, odds are you will encounter some sort of challenge in query tuning sooner or later. In that case, you need to understand the types of techniques the optimizer uses in its query plan design.
Here are some of the issues to consider:
• Where do queries get initiated? Are there enough nodes processing queries?
• Similarly, are there enough nodes loading data?
• How does data loading impact query processing? If the nodes are separate, this might be easier to understand and address a conflict between loading and query processing.
• How much data and which data (if any) should be replicated across nodes? Keeping frequently accessed data on all nodes can minimize the movement of data across nodes and speed processing.
• Can redundant copies of the data (perhaps used to achieve high availability) be used in the query plans to minimize data movement across nodes and facilitate more local joins?
• What tools are available to you to understand the query execution plans? Can you use these tools to identify the bottlenecks?
The bottom line is that you need to understand the “secrets” of the MPP database systems you are considering or using. One is the data distribution strategy and the extent to which you can or need to influence it. Another key is the workload distribution strategy and what causes it difficulties. Evaluate those difficulties and determine how they might affect your typical workloads. Finally, there are additional tricks or techniques that can be used to maximize parallelization and minimize data movement. Evaluate whether those additional techniques are well documented and easily applied when needed.