1.

How will you create PySpark UDF?

Answer»

Consider an example where we want to capitalize the first letter of every word in a string. This feature is not supported in PySpark. We can however achieve this by creating a UDF capitalizeWord(str) and using it on the DataFrames. The following STEPS demonstrate this:

  • Create Python function capitalizeWord that takes a string as input and capitalizes the first CHARACTER of every word.
def capitalizeWord(str): result="" words = str.split(" ") for word in words: result= result + word[0:1].upper() + word[1:len(x)] + " " return result
  • Register the function as a PySpark UDF by using the udf() method of org.apache.spark.sql.functions.udf package which needs to be imported. This method returns the object of class org.apache.spark.sql.expressions.UserDefinedFunction.
""" Converting function to UDF """capitalizeWordUDF = udf(lambda z: capitalizeWord(z),StringType())
  • USE UDF with DataFrame: The UDF can be applied on a Python DataFrame as that acts as the built-in function of DataFrame.
    Consider we have a DataFrame of stored in variable df as below:
+----------+-----------------+|ID_COLUMN |NAME_COLUMN |+----------+-----------------+|1 |harry potter ||2 |ronald weasley ||3 |hermoine granger |+----------+-----------------+

To capitalize every first character of the word, we can use:

df.select(col("ID_COLUMN"), convertUDF(col("NAME_COLUMN")) .alias("NAME_COLUMN") ) .show(truncate=False)

The output of the above code would be:

+----------+-----------------+|ID_COLUMN |NAME_COLUMN |+----------+-----------------+|1 |Harry Potter ||2 |Ronald Weasley ||3 |Hermoine Granger |+----------+-----------------+

UDFs have to be designed in a way that the algorithms are efficient and take less time and space complexity. If care is not taken, the performance of the DataFrame OPERATIONS would be impacted.



Discussion

No Comment Found