InterviewSolution
| 1. |
Explain the ETL process concerning Big Data? |
|
Answer» ETL stands for Extract-Transform-Load. Mostly the Big Data is unstructured and it is in very large quantity and also gets accumulated at a very fast pace. So, at the time of extraction, it becomes very difficult to transform it because of its sheer volume, VELOCITY, and variety. Also, we can not afford to lose Big Data. So, it requires to be stored as it is and then in the FUTURE as per the business requirements can be transformed and analysed. The process of extraction of Big Data involves the retrieval of data from various data sources. The enterprises extract data for various reasons such as:
Sometimes, while extracting the data, it may be desired to add some additional information to the data, depending on the business requirements. This additional information can be something like geolocation data, timestamps, etc. It is called as data enrichment. Sometimes it may be required to consolidate the data with some other data in the target datastore. These different processes are collectively known as ETL. ie. Extract-Transform-Load. In ETL, Extraction is the very first step. The Big Data tools for data extraction assist in collecting the data from a variety of different data sources. The functionalities of these tools can be as mentioned below:
It's a usual activity in ETL tools that the common 3 steps are executed in parallel. As the extraction of data takes a longer time, the other process of transformation starts. It processes the already pulled data and prepares it for loading. As the data becomes ready for loading into the target store, the process of loading the data starts immediately irrespective of the completion of previous steps. ETL for Structured Data: If the data under consideration is structured, then the extraction process is performed generally within the source system itself. Following extraction strategies may be used:
For this, a changing table is created to track the changes. In some data warehouses, a special functionality known as 'CDC' (Change Data Capture) is built-in. The logic required for incremental data extraction is a little bit more complex but the load on the system is reduced. ETL for Unstructured Data: When the data under consideration is unstructured, a major part of the WORK goes into preparing the data so that the data can be extracted. In most cases, such data is stored data lakes until it is required to extract for some kind of processing, analysis or migration. The data is cleaned up by removing the so-called 'noise' from it. It is done in the following ways:
There are some challenges in the ETL process. When you are consolidating data from one system to the other system, you have to ensure that the combination is good/successful. A lot of STRATEGIC planning is required. The complexity of planning increases manyfold when the data under consideration is both structured and unstructured. The other challenges include maintaining the security of the data intact and complying with the various regulations. Thus performing ETL on Big Data is a very important and sensitive process that is to be done with the UTMOST care and strategic planning. |
|