InterviewSolution
| 1. |
How can you inner join two DataFrames? |
|
Answer» We can make use of the join() method present in PySpark SQL. The syntax for the function is: join(self, other, on=None, how=None)where, The join expression can be appended with where() and filter() methods for filtering ROWS. We can have multiple join too by means of the chaining join() method. Consider we have two dataframes - employee and department as shown below: -- Employee DataFrame --+------+--------+-----------+|emp_id|emp_name|empdept_id |+------+--------+-----------+| 1| Harry| 5|| 2| Ron | 5|| 3| Neville| 10|| 4| Malfoy| 20|+------+--------+-----------+-- Department DataFrame --+-------+--------------------------+|dept_id| dept_name |+-------+--------------------------+| 5 | Information TECHNOLOGY | | 10| Engineering || 20| Marketting | +-------+--------------------------+We can inner join the Employee DataFrame with Department DataFrame to get the department information along with employee information as: emp_dept_df = empDF.join(deptDF,empDF.empdept_id == deptDF.dept_id,"inner").show(truncate=False)The result of this becomes: +------+--------+-----------+-------+--------------------------+|emp_id|emp_name|empdept_id |dept_id| dept_name |+------+--------+-----------+-------+--------------------------+| 1| Harry| 5| 5 | Information Technology || 2| Ron | 5| 5 | Information Technology || 3| Neville| 10| 10 | Engineering || 4| Malfoy| 20| 20 | Marketting | +------+--------+-----------+-------+--------------------------+We can also perform joins by chaining join() method by following the syntax: DF1.join(df2,["column_name"]).join(df3,df1["column_name"] == df3["column_name"]).show()Consider we have a third dataframe called Address DataFrame having columns emp_id, city and state where emp_id acts as the foreign key equivalent of SQL to the Employee DataFrame as shown below: -- Address DataFrame --+------+--------------+------+|emp_id| city |state |+------+--------------+------+|1 | Bangalore | KA ||2 | Pune | MH ||3 | MUMBAI | MH ||4 | Chennai | TN |+------+--------------+------+If we want to get address details of the address along with the Employee and the Department Dataframe, then we can run, resultDf = empDF.join(addressDF,["emp_id"]) .join(deptDF,empDF["empdept_id"] == deptDF["dept_id"]) .show()The resultDf would be: +------+--------+-----------+--------------+------+-------+--------------------------+|emp_id|emp_name|empdept_id | city |state |dept_id| dept_name |+------+--------+-----------+--------------+------+-------+--------------------------+| 1| Harry| 5| Bangalore | KA | 5 | Information Technology || 2| Ron | 5| Pune | MH | 5 | Information Technology || 3| Neville| 10| Mumbai | MH | 10 | Engineering || 4| Malfoy| 20| Chennai | TN | 20 | Marketting |+------+--------+-----------+--------------+------+-------+--------------------------+ |
|