1.

Explain the data transformation in Big Data?

Answer»

One obvious question is- why do we need data transformation?

Several reasons make it compulsory to transform the data. In a Big Data kind of environment, we need to make use of every type of data available and from every possible source to draw useful insights out of it that will help the business to grow. These reasons can be:

  1. Making it compatible with the other data
  2. Moving it to the other systems
  3. Joining it with other data
  4. Aggregating the information present in the data.

Several steps can be followed to have a successful data transformation. These steps are:

  1. Data Interpretation or Data Discovery
  2. Data Quality Check - Pre-Translation
  3. Data Translation or Data Mapping
  4. Data Quality Check - Post-Translation.

There are many ways that you can perform the data transformation.

  • You can use scripting to transform the data. i.e. you have to manually write a code to perform the required transformation.
  • You can use automation tools on-premises.
  • Or you can opt for cloud-based automation tools

The process of data transformation tends to be SLOW, costly and time-consuming. You have to design an optimized strategy to have a successful data transformation to take place considering all the aspects, business needs, objectives, data governance,  regulatory requirements, security, scalability, etc.

The different methods that can be used for data transformation are:

  • Data binning: It is also called as data bucketing. It is a TECHNIQUE used for data pre-processing. It reduces the EFFECTS of small observational errors. In the process of data binning, the sample is divided into some intervals and then replaced by the categorical values.
  • Indicator variables: The technique of indicator variables is used for the conversion of categorical data into the Boolean values. It is done by creating the indicator variables. If there are more than TWO values-'n', we are required to create 'n-1' columns.
  • Centering & Scaling: The value of one feature can be centred by subtracting the mean of all values. For scaling the data, the centered feature is divided by the standard deviation.
  • Other techniques:  We can use some other techniques for data transformation, such as making a group of outliers having the same values. We can also DECIDE to replace the value with the number of times it appears in the column.


Discussion

No Comment Found