Why Does My Insert Statement Fail?

1.	Why Does My Insert Statement Fail?
Answer» When an INSERT statement fails, it is usually the result of exceeding some limit within a Hadoop component, typically HDFS. An INSERT into a partitioned table can be a strenuous OPERATION due to the possibility of opening many files and associated threads simultaneously in HDFS. Impala 1.1.1 includes some improvements to distribute the work more efficiently, so that the values for each partition are written by a single node, rather than as a separate data file from each node. Certain expressions in the SELECT part of the INSERT statement can complicate the execution PLANNING and result in an inefficient INSERT operation. Try to make the column data types of the source and DESTINATION tables match up, for example by doing ALTER TABLE ... REPLACE COLUMNS on the source table if necessary. Try to avoid CASE expressions in the SELECT portion, because they make the result values harder to PREDICT than TRANSFERRING a column unchanged or passing the column through a built-in function. Be prepared to raise some limits in the HDFS configuration settings, either temporarily during the INSERTor permanently if you frequently run such INSERT statements as part of your ETL pipeline. The resource usage of an INSERT statement can vary depending on the file format of the destination table. Inserting into a Parquet table is memory-intensive, because the data for each partition is buffered in memory until it reaches 1 gigabyte, at which point the data file is written to disk. Impala can distribute the work for an INSERT more efficiently when statistics are available for the source table that is queried during the INSERT statement. When an INSERT statement fails, it is usually the result of exceeding some limit within a Hadoop component, typically HDFS.

Answer»

When an INSERT statement fails, it is usually the result of exceeding some limit within a Hadoop component, typically HDFS.

An INSERT into a partitioned table can be a strenuous OPERATION due to the possibility of opening many files and associated threads simultaneously in HDFS. Impala 1.1.1 includes some improvements to distribute the work more efficiently, so that the values for each partition are written by a single node, rather than as a separate data file from each node.
Certain expressions in the SELECT part of the INSERT statement can complicate the execution PLANNING and result in an inefficient INSERT operation. Try to make the column data types of the source and DESTINATION tables match up, for example by doing ALTER TABLE ... REPLACE COLUMNS on the source table if necessary. Try to avoid CASE expressions in the SELECT portion, because they make the result values harder to PREDICT than TRANSFERRING a column unchanged or passing the column through a built-in function.
Be prepared to raise some limits in the HDFS configuration settings, either temporarily during the INSERTor permanently if you frequently run such INSERT statements as part of your ETL pipeline.
The resource usage of an INSERT statement can vary depending on the file format of the destination table. Inserting into a Parquet table is memory-intensive, because the data for each partition is buffered in memory until it reaches 1 gigabyte, at which point the data file is written to disk. Impala can distribute the work for an INSERT more efficiently when statistics are available for the source table that is queried during the INSERT statement.

When an INSERT statement fails, it is usually the result of exceeding some limit within a Hadoop component, typically HDFS.

Why Does My Insert Statement Fail?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment