menu

Questions & Answers

Lemmatization taking forever with Spacy

I'm trying to lemmatize chat registers in a dataframe using spacy. My code is:

nlp = spacy.load("es_core_news_sm")
df["text_lemma"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))

I have aprox 600.000 rows and the apply takes more than two hours to execute. Is there a faster package/way to lemmatize? (I need a solution that works for spanish)

I have only tried using spacy package

Comments:
2023-01-24 00:30:08
Use nlp.pipe, which is the API for bulk annotation.
Answers(2) :

The apply method can be slow when working with large data sets like yours, because it applies the function to each row of the dataframe sequentially.

You could try using the concurrent.futures module to parallelize the lemmatization process, which could speed up the execution time. Here's an example of how you might use it:

from concurrent.futures import ProcessPoolExecutor, as_completed

def lemmatize_text(text):
    doc = nlp(text)
    return " ".join([w.lemma_ for w in doc])

with ProcessPoolExecutor() as executor:
    future_lemmas = {executor.submit(lemmatize_text, text): text for text in df["text"]}
    for future in as_completed(future_lemmas):
        text = future_lemmas[future]
        lemmas = future.result()
        df.loc[df["text"] == text, "text_lemma"] = lemmas

This will use multiple processes to lemmatize the text in parallel, which can significantly speed up the process.

Another alternative is to use another package like NLTK or Pattern, that are faster than spacy for this task, specially if you only need lemmatization.

Finally, you could consider using a pre-trained model such as Flair or polyglot-neural for lemmatization, these models are fast and accurate for Spanish text.

The slow-down in processing speed is coming from the multiple calls to the spaCy pipeline via nlp(). The faster way to process large texts is to instead process them as a stream using the nlp.pipe() command. When I tested this on 5000 rows of dummy text, it offered a ~789x improvement in speed (~9.67sec vs ~0.0125sec) over the original method. There are ways to improve this further if required, see this checklist for spaCy optimisation I made.

Solution

# Assume dataframe (df) already contains column "text" with text

# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")

# Process large text as a stream via `nlp.pipe()`
docs = list(nlp.pipe(df["text"]))

# Iterate over the results and extract lemmas
lemma_text_list = []
for doc in docs:
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma"] = lemma_text_list

Full code for testing timings

import spacy
import pandas as pd
import time

# Random Spanish sentences
rand_es_sentences = [
    "Tus drafts influirán en la puntuación de las cartas según tu número de puntos DCI.",
    "Información facilitada por la División de Conferencias de la OMI en los cuestionarios enviados por la DCI.",
    "Oleg me ha dicho que tenías que decirme algo.",
    "Era como tú, muy buena con los ordenadores.",
    "Mas David tomó la fortaleza de Sion, que es la ciudad de David."]

# Duplicate sentences specified number of times
es_text = [sent for i in range(1000) for sent in rand_es_sentences]
# Create data-frame
df = pd.DataFrame({"text": es_text})
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")


# Original method (very slow due to multiple calls to `nlp()`)
t0 = time.time()
df["text_lemma_1"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
t1 = time.time()
print("Total time: {}".format(t1-t0))  # ~9.6746 seconds on 5000 rows


# Faster method processing rows as stream via `nlp.pipe()`
docs = list(nlp.pipe(df["text"]))
t0 = time.time()
lemma_text_list = []
for doc in docs:
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma_2"] = lemma_text_list
t1 = time.time()
print("Total time: {}".format(t1-t0))  # ~0.0125 seconds on 5000 rows
Comments:
2023-01-24 00:30:08
The call to list is not great because you're materializing the entire pipe. Just use the generator object.