| 1. |
How Much Memory Is Required? |
|
Answer» Although Impala is not an in-memory database, when DEALING with large tables and large result sets, you should expect to dedicate a substantial portion of physical memory for the impalad daemon. Recommended physical memory for an Impala node is 128 GB or higher. If practical, devote approximately 80% of physical memory to Impala. The amount of memory required for an Impala operation depends on several factors:
For example, if you execute the query: select * from giant_table order by some_column limit 1000; and your cluster has 50 nodes, then each of those 50 nodes will transmit a maximum of 1000 rows back to the coordinator node. The coordinator node needs enough memory to sort (LIMIT * cluster_size) rows, although in the end the final result set is at most LIMIT rows, 1000 in this case. Likewise, if you execute the query: select * from giant_table where test_val > 100 order by some_column; then each node filters out a set of rows matching the WHERE conditions, sorts the results (with no size limit), and sends the sorted intermediate rows back to the coordinator node. The coordinator node might need substantial memory to sort the final result set, and so might use a temporary disk work area for that final phase of the query.
Although Impala is not an in-memory database, when dealing with large tables and large result sets, you should expect to dedicate a substantial portion of physical memory for the impalad daemon. Recommended physical memory for an Impala node is 128 GB or higher. If practical, devote approximately 80% of physical memory to Impala. The amount of memory required for an Impala operation depends on several factors: For example, if you execute the query: select * from giant_table order by some_column limit 1000; and your cluster has 50 nodes, then each of those 50 nodes will transmit a maximum of 1000 rows back to the coordinator node. The coordinator node needs enough memory to sort (LIMIT * cluster_size) rows, although in the end the final result set is at most LIMIT rows, 1000 in this case. Likewise, if you execute the query: select * from giant_table where test_val > 100 order by some_column; then each node filters out a set of rows matching the WHERE conditions, sorts the results (with no size limit), and sends the sorted intermediate rows back to the coordinator node. The coordinator node might need substantial memory to sort the final result set, and so might use a temporary disk work area for that final phase of the query. |
|