1.

In what scenario sqoop can use and what are the features. How Sqoop works to Move Data into Hive and HDFS.

Answer»

Basically, SQOOP can use to get the Data from Relational database that is DB2,  MYSQL, ORACLE, etc and load into Hadoop that is HDFS, Hive, Hbase, etc or vice versa this process is called ETL for Extract, TRANSFORM and Load. Alternatively, SQOOP can import and export data from the Relational database to Hadoop.

Below are some of the important features which are Sqoop having:

  • Full Load: Sqoop can load the single table or all the tables in a database using sqoop command.
  • Incremental Load: Sqoop can do incremental load, it means it will retrieve only rows newer than some previously-imported set of rows.
  • Parallel import/export: Sqoop is using the YARN framework to import and export the data. The YARN framework provides parallelism as it is read and writes multiple nodes parallelly and fault tolerance is very much possible because by default replication is happening.
  • Import results of SQL query: It is having the facility to import the result of the query in HDFS.
  • Compression: Sqoop having the facility to do the compression of the data, what it imports from a database. Sqqop having various options to compress the data. simply if you specify -compress while importing data, sqoop compress the output file with gzip format by default and it will create an extension as .gz, If you provide -compression-codec instead of compress then sqoop compress the output with bgip2 format.
  • Connectors for all major RDBMS Databases: Sqoop having almost all the connectors to connect the relational databases.
  • Kerberos Security Integration: Sqoop supports Kerberos Authentication, Kerberos Authentication is a protocol which works on the basis of Ticket or key tab which will help you to authenticate user as well as services prior to connect the services like HDFS/HIVE, etc.
How Sqoop works:

Sqoop creating SQL query for each mapper internally which is ingesting data from a source table to HDFS, basically, 4 mappers will be generated by default but you can modify the number of mappers based on your logic and requirements. The number of mapper influence the split by column. split by column work based on where condition and each mapper have a logical partition of the Target table or directory.  For example, if we used three mappers and a split-by column. suppose 1,000,000 records are there. Sqoop can segregate using min and max call to the DB on the split-by column. Sqoop's first mapper would try to get values from 0 to 333333 records, the second mapper would pull from 333334 to 666666 records and the last would grab from 666667 to 1000000 records.

Scoop is running a Map-only job, as we know the Reduce phase is required in case of aggregations. But here in Apache Sqoop we just import and export the data. It does not perform any aggregations. Map job launch multiple mappers depending on the number defined by the user in the above example we are considering as 3. For Sqoop import, each mapper task will be assigned with a part of the data to be imported. Sqoop distributes the input data among the mappers equally to get high performance. Then each mapper creates a connection with the database using JDBC and fetches the part of data assigned by Sqoop and writes it into HDFS or Hive or HBase based on the arguments provided in the CLI so alternatively Mappers drop the data in the Target-dir with a file named as part-m-00000, part-m-00001, part-m-00002.

How Sqoop works to MOVE data into Hive/HDFS:

Here in this scenario, we will discuss how sqoop will import data from the Relational database to Hive. Sqoop can only import the data as a text file or sequence file into a hive database. If you want to use the ORC file format then you must follow a two-stage approach, in the first stage sqoop can get the data into HDFS as a text file format or sequence file format, then in the second stage hive can convert the data into ORC file format.
Please find the steps as mentioned below.

  • You need to get all the source connection details to get connect with a relational database.
    • Database URL: db.bib.com
    • Database name: sales
    • Connection protocol: jdbc:mysql
    • source database username and password
    • Specify the file where the password is stored

Example: 

sqoop import --connect jdbc:mysql://db.bib.com/sales --table EMPLOYEES --username  <username> --password-file ${user.home}/.password
  • Below are the few considerations when you are using parallelism while doing the import.
    • You can mention the entire source table for import
    •  You can mention columns from the table
    •  You can mention only the latest records by specifying them with a WHERE clause
    •  You can mention a number of map tasks specifying write parallelism, in this case, Sqoop evenly splits the primary key RANGE of the source table

Example:
sqoop import --connect jdbc:mysql://db.bib.com/sales --table EMPLOYEES --columns "employee_id,first_name,last_name,job_title" --where "start_date > '2010-01-01'"
--num-mappers 8

  • Below are the few scenarios where you can mention split-by column
    • We can use split key using --split-by
    • We can use split by with condition

Example:

sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' --split-by a.id --target-dir /user/bib/sales
  • Destination you can mention as HDFS directory or Hive table
    Example: 

1. HDFS as target directory

sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' --split-by a.id --target-dir /user/bib/sales

2. Hive as Target table

sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import


Discussion

No Comment Found

Related InterviewSolutions