menu

Questions & Answers

Use python module function with spark dataframe

I would like to create a new column in a pyspark dataframe using a function from an external python module I've installed.

For example, I want to use a function get_tld from a module publicsuffix2 , which extracts the public suffix from a domain using the Public Suffix List.

My current solution uses a udf:

import pyspark.sql.functions as F
from publicsuffix2 import get_tld

df = spark.createDataFrame([{"domain": "stackoverflow.com"},
                            {"domain": "wikipedia.org"},
                            {"domain": "google.invalid"}])

get_public_suffix = F.udf(lambda x: get_tld(x))
df.withColumn("public suffix", get_public_suffix(F.col("domain"))).show()

| domain            | public suffix |
| ----------------- | ------------- |
| stackoverflow.com | com           |
| wikipedia.org     | org           |
| google.invalid    | null          |

My questions are:

  1. Is there a way to accomplish this without a udf?
  2. If not, is there anything I can do to improve the efficiency of this operation and what are some best practices to follow when using external modules/libraries such as this?
Comments:
2023-01-19 00:30:07
I am not particularly knowledgeable in public suffixes, but you can look into the get_tld source code here. if you have relatively simple ICANN TLD (like all of your domain names are .com,.org, etc), i guess you could probably just use a simple string splitting operation. however for more complex urls where you might want to return something like .co.uk you'll need to use a udf in order to use get_tld
Answers(2) :

UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice

A thump of rule that I always follow when working with spark data frames and transformations, is to avoid using UDFs as much as possible. Using UDFs will kill the performance and the parallelism of spark as it needs to serialize each row and send it to the python runtime.

In the above question, the same logic can be applied using native spark functions like F.split, below is the code on how to implement that using native spark methods.

df_suffix = df.withColumn("pub suffix",F.split("domain","\.")[1])

Usually, when I face issues where I need to use UDFs I ask myself the below questions:

  • Is there a pyspark function, or combination of functions, that will solve my problem?
  • Is there a SQL function for this purpose?
    df = spark.createDataFrame([{"domain": "stackoverflow.com"},{"domain": "wikipedia.org"},{"domain": "google.invalid"}])
    import pyspark.sql.functions as f

    suf=df.withColumn("public suffix",f.split("domain","\.")[1])
    display(suf)
Output:
| domain            | public suffix|
| ----------------- | ------------ |
| stackoverflow.com | com          |
| wikipedia.org     | org          |
| google.invalid    | invalid      |