How LDAP, Active Directory and Kerberos will help the Hadoop environme

1.	How LDAP, Active Directory and Kerberos will help the Hadoop environment to get secure?
Answer» Sentry is a role-based authorization to both data and metadata stored on a Hadoop cluster for a user. Prior to know more about Sentry, below are the components based on which sentry is working. Sentry server Data ENGINE Sentry plugin Sentry server: The sentry server is RPC (Remote protocol) server that stores all the authorization metadata details in an underlying relational database. RPC interface to retrieve or control the privileges. Data engine: Data engines that are providing access to the data. Here we can consider data engine as Hive, Impala and Hadoop HDFS. Sentry Plug-in: Sentry plug-in runs in data engine. The plug-in interfaces will help to manipulation of authorization metadata which is stored in the Apache Sentry Server. Whatever access request from data engine (Hive, Impala, Hdfs) those are validated by plug-in authorization policy engine referring authorization metadata. Sentry server only helps you to get the metadata information. The actual authorization decision is made by a Data engine that runs in data PROCESSING applications such as Hive or Impala. Each component loads the Sentry plug-in it means for each service like Hive/Hdfs/Impala/solr, each sentry plug-in has to be installed for dealing with the Sentry services and the policy engine to validate the authorization request. Below are the few capabilities which sentry is having. 1. Fine-Grained Authorization: It means Permissions on object hierarchies for example Server level, Database level, Table level, view (Row/column level authorization), URI and permissions hierarchies will be Select/insert/All this is called Fine-Grained Authorization. 2. Role-Based Authorization(RBAC): Sentry is providing role-based authorization where it is supporting a set of privileges and it supports for role templates which combine multiple access rules for a large set of users and data objects(Database, Table, etc). For example, If Bibhu joins the Finance Department, all you need to do is add him to the finance-department group in Active Directory. This will give Bibhu access to data from the Sales and Customer tables. You can create a role called Analyst and grant SELECT on tables Customer and Sales to this role. CREATE ROLE Analyst; GRANT SELECT on table Customer TO ROLE Analyst; Now Bibhu who is a member of the finance-department group GETS the SELECT privilege to the Customer and Sales tables. GRANT ROLE Analyst TO GROUP finance-department ; 3. Multi Tanent Administration or Delegate Admin responsibilities: It is having the capability to delegate or assign the admin responsibilities for a subset of resources. Delegate admin responsibility it means Delegated-Admin Privilege is assigned on a specific set of resources for a specific set of users/groups by a person who has already Delegated-Admin privilege on the specific set of resources. 4. User Identity and Group Mapping: Sentry depends on Kerberos or LDAP to identify the user. It also uses the group mapping mechanism configured in Hadoop to ensure that Sentry sees the same group mapping as other components of the Hadoop ecosystem. For example, considering that users Bibhu and Sibb belong to an Active Directory (AD) group called the finance-department. Sibb also belongs to a group called finance-managers. In Sentry, create the roles first and then grant required privileges to those roles. For example, you can create a role called Analyst and grant SELECT on tables Customer and Sales to this role. CREATE ROLE Analyst; GRANT SELECT on table Customer TO ROLE Analyst; The next step is to join these authentication entities (users and groups) to authorization entities (roles). This can be done by granting the Analyst role to the finance-department group. Now Bibhu and Sibb who are members of the finance-department group get the SELECT privilege to the Customer and Sales tables. GRANT ROLE Analyst TO GROUP finance-department ; Below are some scenarios where Hive, Impala, HDFS, and search activities are working with Sentry. Considering a few examples we will try to understand how it works. 1. Hive and Sentry : If ID "Bibhu" submits the following Hive query: select * from production.status Here in the above query Hive will identify that user Bibhu is requesting SELECT access to the Status table. At this point, Hive will ask the Sentry plugin to validate the access request of Bibhu. The plugin will retrieve Bibhu's privileges related to the Status table and the policy engine will determine if the request is valid or not. 2. Impala and Sentry: Authorization processing in Impala is more or less the same as Hive. The main difference is the caching of privileges. Usually, Impala’s Catalog server is managing caching roles and privileges or metadata, and spread it to all Impala server nodes. As a result, Impala daemon can authorize queries much faster referring to the metadata from the cache memory. The only drawback related to PERFORMANCE is it will take some time for privilege changes to take effect, it might take a few seconds. 3. Sentry-HDFS Synchronization: When we are talking about Sentry and HDFS authorization, it basically speaks about Hive warehouse data. Warehouse data means whether it is Hive or Impala data related to Table. The main objective is when other components like Pig, MapReduce or Spark trying to access the hive table at that time similar authorization check will occur. At this point, this feature does not replace HDFS ACLs. The tables which are not associated with sentry those retain their old ACLs. The mapping of Sentry privileges to HDFS ACL permissions is as follows: SELECT privilege -> Read access on the file INSERT privilege -> Write access on the file ALL privilege -> Read and Write access on the file. When NameNode loads a Sentry plugin that caches Sentry privileges as well as Hive metadata. It helps HDFS to keep file permissions and Hive tables privileges in SYNC. The Sentry plugin periodically communicates the Sentry and Metastore to keep the metadata changes are in sync. For example, if Bibhu runs a Pig job, which is reading from the Sales table data files, anyhow data files will be stored in HDFS. Sentry plugin on the Name Node will figure out that data file is part of Hive data and cover Sentry privileges on top of the file ACLs, It means HDFS will get the same privileges for this Pig client that Hive would have applied for a SQL query. For HDFS-Sentry synchronization to work, for doing the same you must use the Sentry service, not policy file authorization. 4. Search and Sentry: Sentry can apply restriction on search tasks which are coming from a browser or command line or through the admin console. With Search, Sentry stores its privilege policies in a policy file (for example, sentry-provider.ini) which is stored in an HDFS location such as hdfs://ha-nn-uri/user/solr/sentry/sentry-provider.ini. Multiple policy files for multiple databases is not supported by Sentry with Search. However, you must use a separate policy file for each Sentry-enabled service. 5. Disabling Hive CLI: To execute the hive queries you have to use beeline. when you will disable Hive CLI, Hive CLI is not supported with Sentry and Hive Metastore also be disabled. This is especially necessary if the Hive metastore has sensitive metadata. To do the same, you have to modify the hadoop.proxyuser.hive.groups in core-site.xml on the Hive Metastore host. For example, to give the hive user permission to members of the hive and hue groups, set the property to: <property> <name>hadoop.proxyuser.hive.groups</name> <value>hive,hue</value> </property> If More user groups that require access to the Hive Metastore can be added to the comma-separated list as needed.

How LDAP, Active Directory and Kerberos will help the Hadoop environment to get secure?

Answer»

Sentry is a role-based authorization to both data and metadata stored on a Hadoop cluster for a user. Prior to know more about Sentry, below are the components based on which sentry is working.

Sentry server
Data ENGINE
Sentry plugin

Sentry server: The sentry server is RPC (Remote protocol) server that stores all the authorization metadata details in an underlying relational database. RPC interface to retrieve or control the privileges.
Data engine: Data engines that are providing access to the data. Here we can consider data engine as Hive, Impala and Hadoop HDFS.
Sentry Plug-in: Sentry plug-in runs in data engine. The plug-in interfaces will help to manipulation of authorization metadata which is stored in the Apache Sentry Server. Whatever access request from data engine (Hive, Impala, Hdfs) those are validated by plug-in authorization policy engine referring authorization metadata.

Sentry server only helps you to get the metadata information. The actual authorization decision is made by a Data engine that runs in data PROCESSING applications such as Hive or Impala. Each component loads the Sentry plug-in it means for each service like Hive/Hdfs/Impala/solr, each sentry plug-in has to be installed for dealing with the Sentry services and the policy engine to validate the authorization request.

Below are the few capabilities which sentry is having.

1. Fine-Grained Authorization:

It means Permissions on object hierarchies for example Server level, Database level, Table level, view (Row/column level authorization), URI and permissions hierarchies will be Select/insert/All this is called Fine-Grained Authorization.

2. Role-Based Authorization(RBAC):

Sentry is providing role-based authorization where it is supporting a set of privileges and it supports for role templates which combine multiple access rules for a large set of users and data objects(Database, Table, etc).

For example, If Bibhu joins the Finance Department, all you need to do is add him to the finance-department group in Active Directory. This will give Bibhu access to data from the Sales and Customer tables.

You can create a role called Analyst and grant SELECT on tables Customer and Sales to this role.

CREATE ROLE Analyst;
GRANT SELECT on table Customer TO ROLE Analyst;

Now Bibhu who is a member of the finance-department group GETS the SELECT privilege to the Customer and Sales tables.

GRANT ROLE Analyst TO GROUP finance-department ;

3. Multi Tanent Administration or Delegate Admin responsibilities:

It is having the capability to delegate or assign the admin responsibilities for a subset of resources. Delegate admin responsibility it means Delegated-Admin Privilege is assigned on a specific set of resources for a specific set of users/groups by a person who has already Delegated-Admin privilege on the specific set of resources.

4. User Identity and Group Mapping: Sentry depends on Kerberos or LDAP to identify the user. It also uses the group mapping mechanism configured in Hadoop to ensure that Sentry sees the same group mapping as other components of the Hadoop ecosystem.

For example, considering that users Bibhu and Sibb belong to an Active Directory (AD) group called the finance-department. Sibb also belongs to a group called finance-managers. In Sentry, create the roles first and then grant required privileges to those roles. For example, you can create a role called Analyst and grant SELECT on tables Customer and Sales to this role.

CREATE ROLE Analyst;
GRANT SELECT on table Customer TO ROLE Analyst;

The next step is to join these authentication entities (users and groups) to authorization entities (roles). This can be done by granting the Analyst role to the finance-department group. Now Bibhu and Sibb who are members of the finance-department group get the SELECT privilege to the Customer and Sales tables.

GRANT ROLE Analyst TO GROUP finance-department ;

Below are some scenarios where Hive, Impala, HDFS, and search activities are working with Sentry. Considering a few examples we will try to understand how it works.

1. Hive and Sentry :

If ID "Bibhu" submits the following Hive query:
select * from production.status

Here in the above query Hive will identify that user Bibhu is requesting SELECT access to the Status table. At this point, Hive will ask the Sentry plugin to validate the access request of Bibhu. The plugin will retrieve Bibhu's privileges related to the Status table and the policy engine will determine if the request is valid or not.

2. Impala and Sentry:

Authorization processing in Impala is more or less the same as Hive. The main difference is the caching of privileges. Usually, Impala’s Catalog server is managing caching roles and privileges or metadata, and spread it to all Impala server nodes. As a result, Impala daemon can authorize queries much faster referring to the metadata from the cache memory. The only drawback related to PERFORMANCE is it will take some time for privilege changes to take effect, it might take a few seconds.

3. Sentry-HDFS Synchronization:

When we are talking about Sentry and HDFS authorization, it basically speaks about Hive warehouse data. Warehouse data means whether it is Hive or Impala data related to Table. The main objective is when other components like Pig, MapReduce or Spark trying to access the hive table at that time similar authorization check will occur. At this point, this feature does not replace HDFS ACLs. The tables which are not associated with sentry those retain their old ACLs.

The mapping of Sentry privileges to HDFS ACL permissions is as follows:

SELECT privilege -> Read access on the file
INSERT privilege -> Write access on the file
ALL privilege -> Read and Write access on the file.

When NameNode loads a Sentry plugin that caches Sentry privileges as well as Hive metadata. It helps HDFS to keep file permissions and Hive tables privileges in SYNC. The Sentry plugin periodically communicates the Sentry and Metastore to keep the metadata changes are in sync.

For example, if Bibhu runs a Pig job, which is reading from the Sales table data files, anyhow data files will be stored in HDFS. Sentry plugin on the Name Node will figure out that data file is part of Hive data and cover Sentry privileges on top of the file ACLs, It means HDFS will get the same privileges for this Pig client that Hive would have applied for a SQL query.

For HDFS-Sentry synchronization to work, for doing the same you must use the Sentry service, not policy file authorization.

4. Search and Sentry:

Sentry can apply restriction on search tasks which are coming from a browser or command line or through the admin console.

With Search, Sentry stores its privilege policies in a policy file (for example, sentry-provider.ini) which is stored in an HDFS location such as hdfs://ha-nn-uri/user/solr/sentry/sentry-provider.ini.
Multiple policy files for multiple databases is not supported by Sentry with Search. However, you must use a separate policy file for each Sentry-enabled service.

5. Disabling Hive CLI:

To execute the hive queries you have to use beeline. when you will disable Hive CLI, Hive CLI is not supported with Sentry and Hive Metastore also be disabled. This is especially necessary if the Hive metastore has sensitive metadata.

To do the same, you have to modify the hadoop.proxyuser.hive.groups in core-site.xml on the Hive Metastore host.

For example, to give the hive user permission to members of the hive and hue groups, set the property to:

<property> <name>hadoop.proxyuser.hive.groups</name> <value>hive,hue</value> </property>

If More user groups that require access to the Hive Metastore can be added to the comma-separated list as needed.

How LDAP, Active Directory and Kerberos will help the Hadoop environment to get secure?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment