I would like to create a new column in a pyspark dataframe using a function from an external python module I've installed.
For example, I want to use a function get_tld
from a module publicsuffix2
, which extracts the public suffix from a domain using the Public Suffix List.
My current solution uses a udf:
import pyspark.sql.functions as F
from publicsuffix2 import get_tld
df = spark.createDataFrame([{"domain": "stackoverflow.com"},
{"domain": "wikipedia.org"},
{"domain": "google.invalid"}])
get_public_suffix = F.udf(lambda x: get_tld(x))
df.withColumn("public suffix", get_public_suffix(F.col("domain"))).show()
| domain | public suffix |
| ----------------- | ------------- |
| stackoverflow.com | com |
| wikipedia.org | org |
| google.invalid | null |
My questions are:
get_tld
source code here. if you have relatively simple ICANN TLD
(like all of your domain names are .com
,.org
, etc), i guess you could probably just use a simple string splitting operation. however for more complex urls where you might want to return something like .co.uk
you'll need to use a udf
in order to use get_tld
UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice
A thump of rule that I always follow when working with spark data frames and transformations, is to avoid using UDFs as much as possible. Using UDFs will kill the performance and the parallelism of spark as it needs to serialize each row and send it to the python runtime.
In the above question, the same logic can be applied using native spark functions like F.split, below is the code on how to implement that using native spark methods.
df_suffix = df.withColumn("pub suffix",F.split("domain","\.")[1])
Usually, when I face issues where I need to use UDFs I ask myself the below questions:
df = spark.createDataFrame([{"domain": "stackoverflow.com"},{"domain": "wikipedia.org"},{"domain": "google.invalid"}])
import pyspark.sql.functions as f
suf=df.withColumn("public suffix",f.split("domain","\.")[1])
display(suf)
Output:
| domain | public suffix|
| ----------------- | ------------ |
| stackoverflow.com | com |
| wikipedia.org | org |
| google.invalid | invalid |