1.

Explain the ETL process concerning Big Data?

Answer»

ETL stands for Extract-Transform-Load. Mostly the Big Data is unstructured and it is in very large quantity and also gets accumulated at a very fast pace.

So, at the time of extraction, it becomes very difficult to transform it because of its sheer volume, VELOCITY, and variety. Also, we can not afford to lose Big Data. So, it requires to be stored as it is and then in the FUTURE as per the business requirements can be transformed and analysed.

The process of extraction of Big Data involves the retrieval of data from various data sources.

The enterprises extract data for various reasons such as:

  1. For further processing
  2. Migrate it to some other data repository such as a data warehouse/data lake
  3. For analyzing etc.

Sometimes, while extracting the data, it may be desired to add some additional information to the data, depending on the business requirements. This additional information can be something like geolocation data, timestamps, etc. It is called as data enrichment.

Sometimes it may be required to consolidate the data with some other data in the target datastore. These different processes are collectively known as ETL. ie. Extract-Transform-Load.

In ETL, Extraction is the very first step.

The Big Data tools for data extraction assist in collecting the data from a variety of different data sources. The functionalities of these tools can be as mentioned below:

  1. Extract the data from various homogeneous/heterogeneous sources.
  2. Transform it to store in a proper format/structure for further processing and querying.
  3. Load the data in the target store such as data mart, an operational data store, or a data warehouse.

It's a usual activity in ETL tools that the common 3 steps are executed in parallel. As the extraction of data takes a longer time, the other process of transformation starts. It processes the already pulled data and prepares it for loading.

As the data becomes ready for loading into the target store, the process of loading the data starts immediately irrespective of the completion of previous steps.

ETL for Structured Data:

If the data under consideration is structured, then the extraction process is performed generally within the source system itself.

Following extraction strategies may be used:

  1. Full Extraction: In the full extraction method, the data is extracted completely from the source. Tracking the changes are not required. The logic here is simpler but the load on the system is greater.
  2. Incremental extraction: In the incremental extraction method, the changes occurring in the source data are tracked from the last successful data extraction. It is so because you are not required to go through the entire process of extracting all the data every time there occurs a change.

For this, a changing table is created to track the changes. In some data warehouses, a special functionality known as 'CDC' (Change Data Capture) is built-in.

The logic required for incremental data extraction is a little bit more complex but the load on the system is reduced.

ETL for Unstructured Data:

When the data under consideration is unstructured, a major part of the WORK goes into preparing the data so that the data can be extracted. In most cases, such data is stored data lakes until it is required to extract for some kind of processing, analysis or migration.

The data is cleaned up by removing the so-called 'noise' from it.

It is done in the following ways:

  1. Removing whitespaces/symbols
  2. Removing duplicate results
  3. Handling missing values.
  4. Removing outliers etc.

There are some challenges in the ETL process. When you are consolidating data from one system to the other system, you have to ensure that the combination is good/successful. A lot of STRATEGIC planning is required. The complexity of planning increases manyfold when the data under consideration is both structured and unstructured. The other challenges include maintaining the security of the data intact and complying with the various regulations.

Thus performing ETL on Big Data is a very important and sensitive process that is to be done with the UTMOST care and strategic planning.



Discussion

No Comment Found