1.

How Much Memory Is Required?

Answer»

Although Impala is not an in-memory database, when DEALING with large tables and large result sets, you should expect to dedicate a substantial portion of physical memory for the impalad daemon. Recommended physical memory for an Impala node is 128 GB or higher. If practical, devote approximately 80% of physical memory to Impala.

The amount of memory required for an Impala operation depends on several factors:

  • The file format of the table. Different file formats represent the same data in more or fewer data files. The compression and encoding for each file format might require a different amount of TEMPORARY memory to decompress the data for analysis.
  • Whether the operation is a SELECT or an INSERT. For example, Parquet tables require relatively little memory to query, because Impala reads and DECOMPRESSES data in 8 MB chunks. Inserting into a Parquet table is a more memory-intensive operation because the data for each data file (potentially hundreds of megabytes, depending on the value of the PARQUET_FILE_SIZE query option) is stored in memory until encoded, compressed, and written to disk.
  • Whether the table is partitioned or not, and whether a query against a partitioned table can take advantage of partition pruning.
  • Whether the final result set is sorted by the ORDER BY CLAUSE. Each Impala node scans and filters a portion of the total data, and applies the LIMIT to its own portion of the result set. In Impala 1.4.0 and higher, if the sort operation requires more memory than is available on any particular host, Impala uses a temporary disk work area to perform the sort. The intermediate result sets are all sent back to the coordinator node, which does the final sorting and then applies the LIMIT clause to the final result set.

For example, if you execute the query:

select * from giant_table order by some_column limit 1000;

and your cluster has 50 nodes, then each of those 50 nodes will transmit a maximum of 1000 rows back to the coordinator node. The coordinator node needs enough memory to sort (LIMIT * cluster_size) rows, although in the end the final result set is at most LIMIT rows, 1000 in this case.

Likewise, if you execute the query:

select * from giant_table where test_val > 100 order by some_column;

then each node filters out a set of rows matching the WHERE conditions, sorts the results (with no size limit), and sends the sorted intermediate rows back to the coordinator node. The coordinator node might need substantial memory to sort the final result set, and so might use a temporary disk work area for that final phase of the query.

  • Whether the query contains any join clauses, GROUP BY clauses, ANALYTIC functions, or DISTINCT operators. These operations all require some in-memory work areas that vary depending on the volume and distribution of data. In Impala 2.0 and later, these kinds of operations utilize temporary disk work areas if memory usage grows too large to handle.
  • The size of the result set. When intermediate results are being passed around between nodes, the amount of data depends on the number of columns returned by the query. For example, it is more memory-efficient to query only the columns that are actually needed in the result set rather than always issuing SELECT *.
  • The mechanism by which work is divided for a join query. You use the COMPUTE STATS statement, and query hints in the most difficult cases, to help Impala pick the most efficient execution plan. 

Although Impala is not an in-memory database, when dealing with large tables and large result sets, you should expect to dedicate a substantial portion of physical memory for the impalad daemon. Recommended physical memory for an Impala node is 128 GB or higher. If practical, devote approximately 80% of physical memory to Impala.

The amount of memory required for an Impala operation depends on several factors:

For example, if you execute the query:

select * from giant_table order by some_column limit 1000;

and your cluster has 50 nodes, then each of those 50 nodes will transmit a maximum of 1000 rows back to the coordinator node. The coordinator node needs enough memory to sort (LIMIT * cluster_size) rows, although in the end the final result set is at most LIMIT rows, 1000 in this case.

Likewise, if you execute the query:

select * from giant_table where test_val > 100 order by some_column;

then each node filters out a set of rows matching the WHERE conditions, sorts the results (with no size limit), and sends the sorted intermediate rows back to the coordinator node. The coordinator node might need substantial memory to sort the final result set, and so might use a temporary disk work area for that final phase of the query.



Discussion

No Comment Found