InterviewSolution
This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
What do you mean by trigger in SQL? |
|
Answer» In SQL, Trigger acts a stored procedure which gets invoked when a triggering event occurs in a database. These triggering events can be caused due to insertion, deletion or updating of any row or column in a particular table. For example, trigger can be invoked when a NEW row is added or deleted from a table or any row is updated. The syntax to create a tigger in SQL is as below. Syntax: create trigger [trigger_name] [before | after] {insert | update | delete} on [table_name] [for each row] [trigger_body]Explanation: 1. Trigger will be created with a name as [trigger_name] whose EXECUTION is determined by [before | after]. 2. {insert | update | delete} are examples of DML OPERATIONS. 3. [table_name] is the table which is associated with trigger. 4. [for each row] determines the rows for which trigger will be executed. 5. [trigger_body] determines the operations that needs to be performed after trigger is invoked. Description: Data Engineering is very important term used in big data. It is the PROCESS of transforming the raw entity of data (data generated from various sources) into helpful information that can be used for various purposes. Data Engineering has become one of the most popular career choices today. According to a study, it has been expected that the data engineering services and global big data will grow from USD 29.50 billion that was in 2017 to USD 77.37 billion by 2023, at a Compound Annual Growth Rate (CAGR) of 17.6% during the forecast period. 2017 is taken as the base year for this study, and the forecast period taken here is 2018–2023. Data engineer has to take up a lot of responsibilities daily, from collecting to analyzing data with the help of many tools. If you are interested in data engineering and looking for top interview questions and answers in the field of data engineering, then these above beginner and advance level questions are best for you which keep into consideration various skills of data engineering like Python, Big data, HADOOP, SQL, Database, etc. Data analyst and data engineer jobs are increasing at a faster rate in the market and market has a lot opportunities for both freshers and experienced engineers across the world. Good conceptual knowledge and hold on logics will help you crack interviews in many reputed companies. The above questions are designed to help understand the concepts of data engineering deeply. We have tried to cover almost every topic of data engineering. If you go through the above-mentioned, you will easily find questions from beginner to an advanced level according to your level of expertise. These questions will help you give an extra edge over the other applicants who will apply for data engineering jobs. If you want to study data engineering topics deeply, you can enroll in big data courses on KnowledgeHut that can help you to boost your basic and advanced skills. Best of Luck. |
|
| 2. |
What do you mean by SQL injection? |
|
Answer» SQL injection is the process of inserting MALICIOUS SQL commands to the database that can exploit the user data stored in it. By inserting these statements, hackers actually take CONTROL of the database and can destroy and manipulate sensitive information stored in database. These SQL COMMAND insertions or SQL injection mainly happens using inputs through web pages which is one of the most common web hacking techniques. In Web applications, usually web servers do communication with the database servers in order to retrieve or store data related to user in the database. Hackers input these malicious SQL codes which are executed once the web server tries to make CONNECTION with the database server resulting in compromising the security of the web application. We can make use of Restricted access privileges and user authentication to AVOID any security breach which may impact the critical data present in database. Another way is to avoid using system administrator accounts. |
|
| 3. |
What do you mean by alias in SQL? |
|
Answer» In SQL, we can provide temporary names to EITHER columns or table, these are called as aliases for a specific query. When we don’t want to USE the original name of the table or column, then we use alias to provide temporary names to them. The scope of the alias is temporary and up to that specified query only. We use alias to increase the readability of a column or a table name. This change is temporary and the original names that are stored in the database never GET changed. Sometimes the names of table or column are complex so it is always preferred to use alias to GIVE them an easy name TEMPORARILY. Below is the syntax to use alias for both table and column names. Column Alias: Syntax: SELECT column as alias_name FROM table_name;Explanation: Here alias_name is the temporary name that is given to column name in the given table table_name. Table Alias: Syntax: SELECT column FROM table_name as alias_name;Explanation: Here alias_name is the temporary name that is given to table table_name. |
|
| 4. |
What are the differences between IN and BETWEEN operators? |
|
Answer» BETWEEN operator: In Python, BETWEEN operator is used in order to TEST whether the provided expression lies between a defined range of values or not. While testing, the range is inclusive. These values can be of any TYPE like date, NUMBER or text. We can use this BETWEEN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below. SELECT column_name(s) FROM table_name WHERE column_name BETWEEN value1 AND value2;Output: It will return all the values from above column_name which lies between value1 and value2 including these 2 values also. IN operator: In Python, IN operator is used to check whether an expression matches with some of values that has been specified in a list containing values. It can be used in order to eliminate the use of multiple OR conditions. We can also use NOT IN operator which functions EXACTLY opposite to the IN operator to exclude certain rows from the output list. We can use this IN operator or NOT IN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below. IN: SELECT column_name(s) FROM table_name WHERE column_name IN (list_of_values);Output: It will return all the values from above column_name which matches with the specified “list_of_values” NOT IN: SELECT column_name(s) FROM table_name WHERE column_name NOT IN (list_of_values);Output: It will return all the values from above column_name excluding the specified “list_of_values” |
|
| 5. |
What do you mean by SciPy? |
|
Answer» In Python, SciPy is an open-source library which is used for the purpose of solving various ENGINEERING, mathematical, TECHNICAL and scientific problems. We can easily manipulate data with the HELP of SciPy and PERFORM the data visualisation with a wide number of high-level commands present in Python. SciPy is called as “Sign Pi”. NumPy acts as the foundation of SciPy as it is built on it. SciPy LIBRARIES are built only to make sure that it can be able to work with the arrays of NumPy. Optimization and numeric integration are also possible by using the various numerical practices like routines which are provided by SciPy. In order to setup SciPy in your systems, below are syntax for the same for different operating systems. Windows: Syntax: Python3 -m pip install --user numpy scipy Linux: Syntax: sudo apt-get install python-scipy python-numpy Mac: Syntax: sudo port install py35-scipy py35-numpy |
|
| 6. |
Explain pass, continue and break statements in Python? |
|
Answer» In Python, loops statements are used in order to do repetitive tasks with good efficiency. But in some scenarios, we need to come out of those loop statements or ignore some of the conditions. For these scenarios Python provides various loop control statements in order to take control over loops. These statements are as below.
|
|
| 7. |
How can you differentiate between append() and extend() in Python? |
|
Answer» append(): In Python, when we will pass an argument to append() then it will be added as a single entity in the list. In other words, we can say that when we try to append a list into another list then that whole list is added as a single object to the other list’s end and hence the length of the list will be incremented by 1 only. Append() has fixed TIME complexity of O(1) Example: Let’s take an example of two LISTS as shown below. list1 = [“Alpha”, “Beta”, “Gamma”] list2 = [“DELTA”, “Eta”, “Theta”] list1.append(list2) list1 will now become: [“Alpha”, “Beta”, “Gamma”, [“Delta”, “Eta”, “Theta”]]The length of list1 will now become 4 after addition of second list as a single entity. extend(): In Python, when we will pass an argument to extend() then all the ELEMENTS which are contained in that argument get added to the list or in other words the argument will be iterated over. So, the length of the list will be incremented by the NUMBER of elements which have been added from another list. extend() has time complexity of O(n) where n is the number of the elements in an argument which has been passed to the extend(). Example: Let’s take an example of two lists as shown below. list1 = [“Alpha”, “Beta”, “Gamma”] list2 = [“Delta”, “Eta”, “Theta”] list1.extend(list2) list1 will now become: [“Alpha”, “Beta”, “Gamma”, “Delta”, “Eta”, “Theta”].The length of list1 will now become 6 in this scenario. |
|
| 8. |
Explain Decorator in Python? |
|
Answer» Decorators can be considered as one of the most important and powerful TOOLS that are present in python. We can temporarily modify the behaviour of a function or a class with the help of this tool. Decorator helps to WRAP the function or a class with another function to modify the behaviour of wrapped function or a class without making any permanent changes to that specific function source code. In Python we can easily use or pass functions as ARGUMENTS as they are considered as first-class objects. In Decorator, a function acting as a first-class object can be passed an argument to another function and then it will be called INSIDE the wrapper function. |
|
| 9. |
How memory can be managed in Python? |
|
Answer» PYTHON memory MANAGER does the task of managing memory in Python. All the DATA structures and objects in Python are stored in private heap. It is the duty of Python memory manager only to manage this private heap. Developers can’t access this private heap space. This private heap space can be allocated to objects by memory manager. Python memory manager contains OBJECT specific allocators to allocate the space to specific objects. Along with that it also has raw memory allocators to make SURE that space is allocated to the private heap. In Python, developers create a Garbage collector so that they don’t need to manually do garbage collection. The main job of this collector is to clear out unused space and make it available for other new objects or private heap space. |
|
| 10. |
How can you differentiate between “is” and “==” operators in Python? |
|
Answer» “is” operator is used for the purpose of REFERENCE equality to check whether the two references or variables are pointing to the same OBJECT or not. ACCORDINGLY, it returns value as true or false. “==” operator is used for the purpose of value equality to check whether the two variables are having same value or not. Accordingly, it returns value as true or false. We can take any example with the HELP of two lists X and Y. X = [1,2,3,4,5] Y = [1,2,3,4,5] Z = Y
|
|
| 11. |
Explain collaborative filtering? |
|
Answer» Collaborative filtering is a process or technique which make use of various algorithms in order to PROVIDE personalized recommendations to the users. It is also known with name of social filtering. Some of the popular websites which make use of this kind of filtering are iTunes, Amazon, Flipkart, Netflix etc. In Collaborative filtering, a user is PROVIDED with personal recommendations based upon the compilation of common interest or preferences from other users with the help of prediction algorithms. We can TAKE an example of two users A and B. Let’s suppose user A visits Amazon and bought item 1 and 2 and when user B will try to buy that same item 1 then item 2 will be recommended to the user B based upon PREDICTIVE analysis. |
|
| 12. |
Explain the purpose of A/B testing and list all its benefits? |
|
Answer» A/B testing also known as split testing is a random statistical experiment PERFORMED on two different variants (A and B) of a webpage or any application by showing these variants to set of END users and analysing which of the two variants is creating a larger impact on the users or which variant proved to be more effective and BENEFICIAL to the end users. A/B testing is having a lot benefits which as follows.
|
|
| 13. |
What do you mean by logistic regression? |
|
Answer» Logistic regression acts as a PREDICTIVE model which is used to analyse large datasets to determine a binary output considering an INPUT variable is provided. The binary output can take only limited number of values which can be 0/1, true/false or yes/no. Logistic regression make use a sigmoid function in order to determine various POSSIBLE outcomes and their corresponding probabilities of occurrence and then MAP them both on a graph. There is always an acceptance threshold which is set to determine WHETHER a particular instance belongs to that class or not. If the probability of an outcomes is more than that threshold then that instance belongs to that class otherwise it doesn’t belong to that class if the probability is less than the acceptance threshold. There are three types of logistic regression. These are as listed below.
|
|
| 14. |
Differentiate between the KNN and K-means methods? |
||||||
|
Answer» The difference between the K-Nearest Neighbour and K-Means methods are as below.
|
|||||||
| 15. |
What do you mean by outliers? |
|
Answer» OUTLIERS are the data records which are different from the NORMAL records in some of their characteristics. It is very important to first decide the characteristics of the normal records in order to detect the outliers. These records when used in ALGORITHMS or analytical systems can provide ABNORMAL results which may impact the analysis process. So, it is very important to detect the outliers to avoid such abnormalities. We can detect outliers using tables or graphs by directly looking at them. As an example, let suppose there is table containing the Name and Age of few people and one of the rows representing a person contains Age as 500. So, we can easily analyse that value to be an invalid value as age can be 40,50 or 55 but it can’t be 500. So, we can predict the age but can’t sure about the exact value. This kind of detection is easy when we are dealing with a table with limited records but if the tables CONTAIN thousands of records, then it becomes impossible to detect the outliers. |
|
| 16. |
What are the ways to handle missing values in Big Data? |
|
Answer» There are few ways with the help of which we can handle missing value in Big Data. These are as follows.
Other than the above-mentioned techniques we can also use K-NN algorithm, The RandomForest algorithm, Naive Bayes algorithm, and Last Observation Carried Forward (LCOF) METHODS in order to handle missing values in Big Data. |
|
| 17. |
What do you mean by feature selection? |
|
Answer» Feature selection is the process of identifying and selecting the most relevant FEATURES that can be input to the machine learning algorithms for the purpose of MODEL creation. Feature selection techniques are used for the purpose of neglecting all the redundant or unrelated features as an input to the machine learning models by decreasing the number of input VARIABLES and narrowing down the features to only the desired relevant features. There are few advantages of using these feature selection techniques which are mentioned below.
|
|
| 18. |
How can you differentiate NFS from HDFS? |
||||||||||
|
Answer» NFS is Network File System and HDFS is HADOOP DISTRIBUTED File System. The various differences between the two are as follows.
|
|||||||||||
| 19. |
How can you differentiate a Data Engineer and a Data Scientist? |
||||||||
|
Answer» In modern world, data has become the new currency. Both the ROLES of data engineer and data scientists revolves around data only but there are some differences in their duties which are as mentioned below.
|
|||||||||
| 20. |
Can you explain some common problems faced by data engineer? |
|
Answer» This question MAINLY focuses on knowing what problems you have faced while working as a data ENGINEER in you prior experience. We can depict some of the most common problems here as an answer.
|
|
| 21. |
Can you depict various advantages and disadvantages of cloud computing? |
|
Answer» Advantages and DISADVANTAGES of cloud computing are as follows. Advantages:
Disadvantages: |
|
| 22. |
What can you do in case of any unexpected problem with data maintenance, according to your past experience? |
|
Answer» This question mainly FOCUSES on knowing how you can actually deal with unexpected problems in high pressure situations. Unexpected problems are inevitable and a lot of situations arises when you encounter these unexpected problems while doing your daily routine jobs or tasks. Same is the case with DATA maintenance. Data maintenance can be considered as one of the daily basis tasks which need to be monitored properly to make sure all the inbuild tasks and corresponding scripts are getting executed as per expectation. As an example, in order to PREVENT addition of corrupt indexes into the database, we can create various maintenance tasks which can prevent addition of these corrupt indexes to the database to AVOID any SERIOUS damage. |
|
| 23. |
What do you mean by COSHH? |
|
Answer» COSHH also known as Classification and Optimization-based SCHEDULING for HETEROGENEOUS Hadoop systems. Multiplexing and execution of a lot of tasks HAPPENS in a common datacentre in Hadoop system. It will lead to the sharing of the Hadoop cluster among a lot of users which will lead to increase in system heterogeneity. Though in Hadoop schedulers, this issue is not given that much importance. In order to rectify that, here comes the USE of COSHH, which is SPECIALLY designed and implemented in order to provide scheduling at both cluster and application levels. This implementation leads to improve the completion time of job. |
|
| 24. |
How can you handle duplicate data points in SQL? |
|
Answer» We can come across a situation in which we can find MULTIPLE duplicate data entries in a table which makes no sense to fetch all those entries to avoid redundancy while we are fetching records from that table. We need only those unique data entries that will make sense to fetch. For achieving this, DISTINCT KEYWORD is provided by the SQL which we can use with the SELECT statement so that we can eliminate the duplicate data entries and can only fetch unique data entries. The syntax to use this keyword to eliminate duplicate data is as below: SELECT DISTINCT COLUMN1, column2, column3...columnM FROM table_name1 WHERE [conditions]We can also this use UNIQUE keyword to handle duplicate data. The UNIQUE constraint is used for ENSURING that all the values present in a specific column are different in SQL. |
|
| 25. |
What are the differences between list and tuple? |
||||||||||||||||
|
Answer» In Python, both LIST and tuple are CLASSES of data structures. Differences between list and tuple are as follows.
|
|||||||||||||||||
| 26. |
Which database is better to use between NoSQL and relational database? |
|
Answer» In Modern APPLICATIONS that has complex and constantly changing data sets, NoSQL seems to be better option to use as compared to traditional DATABASE as in such applications we need a flexible data model that doesn’t need to be defined immediately. NoSQL provided various agile features which helps companies to go to the market faster and accordingly MAKE updates faster. It also helps to store real time data. While using big servers, it is always better approach to scale out RATHER than scale in when we are dealing with increased data processing load. Using NoSQL is a better option here as it is cost effective and can deal with huge volume of data. Although, relational database PROVIDES better connectivity with the analytical tools but still NoSQL is better to use as it offers a lot of features compared to traditional database. |
|
| 27. |
What are the differences between NoSQL database and SQL database? |
|||||||||||||||||||||
|
Answer» The differences between the NoSQL and SQL database are as below.
|
||||||||||||||||||||||
| 28. |
What are the differences between OLAP and OLTP? |
||||||||||||||||
|
Answer» The differences between OLAP and OTLP are given below.
|
|||||||||||||||||
| 29. |
What are the differences between Data warehouse and Database? |
|||||||||||||||||||||
|
Answer» The differences between Data warehouse and Database are given below.
|
||||||||||||||||||||||
| 30. |
What is the usage of *args and **kwargs? |
|
Answer» In Python, we can pass a variable number of arguments to a function when we are unsure about how may arguments NEED to be passed to a function. These arguments can be passed using a special TYPE of symbols as depicted below.
Function flexibility can be achieved by PASSING these two types of special symbols. |
|
| 31. |
What are various SerDe implementations available in Hive? |
|
Answer» In Hive, there are various TYPES of SerDe implementations available. There is ALSO a provision to create your own custom SerDe implementations. Few of the POPULAR implementations are LISTED below.
|
|
| 32. |
Explain the importance of Distributed cache in Hadoop? |
|
Answer» In Hadoop, DISTRIBUTED cache is a utility provided by the MapReduce FRAMEWORK. BRIEFLY we can that, it can cache files like jar files, archives and TEXT files when needed for any application. When MapReduce job is running, this utility caches the read only files and make them AVAILABLE to all the DataNodes. Each DataNodes gets the local copy of the file. Thus, we will be able to access all files present in DataNodes. These files remain in the DataNodes while job is running and these are deleted once the job is completed. The default size of Distributed cache is 10 GB which can be adjusted according to the requirement using local.cache.size. |
|
| 33. |
What is the use of balancer in HDFS? |
|
Answer» Balancer is a utility provided by HDFS. As we know that, DataNodes stores the actual data related to any JOB or process. Datasets are divided into BLOCKS and these blocks are stored across the DataNodes in Hadoop cluster. Some of these nodes are underutilized and some are overutilized by the storage of blocks, so a balance needs to be maintained. Here comes the USE of balancer which analyses the BLOCK placement across various nodes and moves blocks from overutilized to underutilized nodes in order to maintain balance of data across the DataNodes until the cluster is DEEMED to be balanced. |
|
| 34. |
Explain the concept of Data Locality in Hadoop? |
|
Answer» In Hadoop, when we are dealing with Big Data Systems, then the size of data is huge. Therefore, it is not a GOOD practice to move this large amount of data across the network otherwise it may impact the system output and also CAUSES network congestion. In order to get rid of these above problems, Hadoop uses the concept of Data Locality. BRIEFLY we can say that, it is the process of moving the computation towards the data rather than doing the opposite process of moving huge amount of data. In this way, data always remain local to STORAGE locations. So, when a user RUNS a MapReduce job, then the code present in MapReduce is sent by NameNodes to DataNodes that contains the data related to MapReduce job. |
|
| 35. |
Describe the use of Combiner in Hadoop? |
|
Answer» Combiner also known as Mini-Reducer acts as an optional step between MAP and Reduce. Briefly we can that, it helps to take the output from Map function. It then summarizes that output using the same key and then it passes the final summarized RECORDS as input to the Reducer. When we make use of MAPREDUCE JOB on a large dataset. Then large chunk of data is generated by the Mapper which when passed to the reducer for further processing can cause congestion in the network. In ORDER to deal with kind of congestion, Combiner is used by Hadoop Framework as an intermediate between Mapper and Reducer to reduce network congestion. |
|
| 36. |
Explain the functions of Secondary NameNode? |
|
Answer» The various functions of SECONDARY NameNode are as follows.
|
|
| 37. |
Why Commodity hardware is used in Hadoop? |
|
Answer» In Hadoop, HDFS ABBREVIATED as Hadoop distributed file system is considered as the standard storage mechanism. It is built with the help of commodity hardware. As we all know that till now, Hadoop does not REQUIRE a COSTLY server with high processing power and bigger storage, we can use inexpensive systems with average processor and RAM. These systems are called as commodity hardware. These are affordable, easy to obtain and compatibles with various operating systems like Linux, Windows and MS-DOS without any requirement of any special type of DEVICES or equipment. ANOTHER benefit of using commodity hardware is its scalability. |
|
| 38. |
Explain YARN in Hadoop? |
|
Answer» Yarn is abbreviated ad YET Another Resource Negotiator. In Hadoop, it is considered as one of the main components. While opening Hadoop, Yarn helps in processing and running data for stream processing, graph processing, batch processing, and interactive processing which are stored in HDFS. So briefly, we can SAY that YARN helps to run various TYPES of distributed applications. Using YARN, the efficiency of the system can be increased as data that is stored in HDFS is processed and run by various types of processing engines as DEPICTED above. It is also known for optimum utilization of all available resources that results in easy processing of a HIGH volume of data. |
|
| 39. |
What do you mean by FSCK? |
|
Answer» FSCK stands for File System Consistency Check. Briefly, we can define FSCK as a COMMAND that is used in order to check any INCONSISTENCIES or any problems in HDFS file system or at the HDFS level. Syntax of using FSCK command is as below. hadoop fsck [ GENERIC OPTIONS] < PATH > [-delete | -MOVE | -openforwrite ] [-files [ -blocks [ -locations | -racks] ] ] |
|
| 40. |
Explain the steps to deploy a big data solution? |
|
Answer» Below are the STEPS that need to be followed in order to deploy a big data SOLUTION.
|
|
| 41. |
How big data and data analytics can help to increase company’s revenue? |
|
Answer» Following are some of the ways how big data and data analytics can POSITIVELY impact company’s business.
|
|
| 42. |
How you can search for a specific string in a table column in MYSQL? |
|
Answer» We can perform various operations on strings as WELL as the substrings present in a table. In ORDER to search for a specific STRING in a table column, we can use REGEX operator for the same. |
|
| 43. |
How you can see database structure and list of tables in MYSQL? |
|
Answer» In MYSQL, we can see the data STRUCTURE with the HELP of DESCRIBE command. The SYNTAX to USE this command is as follows. DESCRIBE Table name;We can see the list of all tables in MYSQL using SHOW command. The syntax to use the thing command is as follows. SHOW TABLES; |
|
| 44. |
What do you mean by Skewed tables in Hive? |
|
Answer» In Hive, there are some special types of tables in which the VALUES of columns appear in a repeating manner (Skew), these tables are called as skewed tables. In Hive, while creation of a particular table we can specify that table as SKEWED. All the skewed values in the table are written into separate files and the REST of the remaining values are stored in another file. While writing QUERIES, skewed tables help to provide better performance. Syntax to define a particular table as ‘skewed’ during its creation is as written below using an example. CREATE TABLE TableName (column1 STRING, column2 STRING) SKEWED BY (column1) on (‘value’) |
|
| 45. |
Can you create more than one table in Hive for a single data file? |
|
Answer» In Hive, multiple tables can be CREATED for a SINGLE data file using the same HDFS DIRECTORY. As we know already that metastore acts as the central repository for Hive metadata and it stores metadata like schemas and locations. Data already remain in the same file. So, it becomes a very easy task to RETRIEVE the different results for the corresponding same data based UPON the schema. |
|
| 46. |
Briefly explain the use of Metastore in Hive? |
|
Answer» Metastore ACTS as the central repository for Hive metadata. It is used for storing the metadata of Hive TABLES i.e., schemas and locations. Metadata is first stored in metastore which is LATER stored in a relational DATABASE (RDBMS) whenever required. Metastore consists of 3 types of modes for deployment. These are given below.
|
|
| 47. |
Briefly explain the role of the .hiverc file in Hive? |
|
Answer» In HIVE, .hiverc acts as the initialization file. Whenever you open the CLI (Command Line Interface) in order to write the code for Hive, .hiverc is the FIRST ONE file that gets loaded. All the parameters that have been initially set by you are contained in this file. For example, you can set column HEADERS that you want to be VISIBLE in the query results, the addition of any jar files, etc. This file is loaded from the hive conf directory. |
|
| 48. |
What are all the objects created by create statement in MySQL? |
|
Answer» The objects CREATED by CREATE STATEMENT in MySQL are listed below:
|
|
| 49. |
What are the functions present in Hive for table creation? |
|
Answer» The Table CREATION FUNCTIONS present in Hive are as follows: |
|
| 50. |
What does SerDe mean in Hive? |
|
Answer» In Hive, SERDE stands for Serialization and DESERIALIZATION. SerDe is a built-in Library present in Hadoop API. SerDe INSTRUCTS Hive on how processing of a record(row) can be done. Deserializer will take binary representation of a record and translate it into the java OBJECT that hive can be able to understand. Now, Serializer will take that java object on which Hive is already working and convert that into a format that can be PROCESSED by HDFS and can be stored. |
|