How I cleaned the text for the quiz game:

Aug 22

3 min read

The `clean_lyrics` function is a specialized text-cleaning function designed for processing song lyrics in a Pandas DataFrame. It focuses on removing unwanted phrases, standardizing formatting, and preparing text data for analysis. Here I'll break down the function’s workflow, with some specifications on my specific needs for the quiz game, making it easier for data scientists to understand and implement it in their text preprocessing pipelines.

Function Overview

"def clean_lyrics(df, column):

The function takes two arguments:

df: The DataFrame containing the lyrics.
column: The specific column within the DataFrame that needs cleaning.

Key Features and Steps in the Cleaning Process

Identifying Target Words and Phrases

The function defines multiple phrases and words that frequently appear in song lyrics but are generally unwanted in a clean dataset. Examples include phrases like `"see taylor swift live get tickets as low as"` and tags like `"[chorus]"`, `"[verse]"`, and others that do not contribute to meaningful analysis.

These are stored as individual variables (e.g., `phrase_to_replace`, `phrase_to_replace2`, etc.), which are later used for pattern matching and replacement.

Regex and String Replacement Operations

The function relies heavily on regex and string replacement methods to clean the lyrics:

Removing Digits: Regex is used to eliminate any numeric characters:

df[column] = df[column].apply(lambda x: re.sub(r'\d+', '', x))

Lowercasing and Removing Specific Words: Words such as "instrumental," "intro," and "guitar" are replaced with spaces:

df[column] = df[column].str.lower().str.replace(r"instrumental|intro|guitar|solo", ' ', case=False)

Cleaning Special Characters: The function removes newlines, punctuation, and other non-alphanumeric characters:

df[column] = df[column].str.replace("\n", " ").str.replace(r"[^\w\d'\s]+", "")

Handling Repetitive Phrases and Tags

Common repetitive phrases like `"embed"` and `[chorus]` are identified and replaced using regex. The function contains a long list of such replacements to ensure the cleaned lyrics are as concise as possible.

Trimming Unwanted Content Before the Word “Lyrics”

A key feature of this function is how it trims everything before the word "lyrics" in a text string, as this often precedes unwanted content:

df[column] = df[column].apply(lambda x: x[x.find(word) + len(word):] if x.find(word) != -1 else x)

Final Text Cleanup

The function performs a final strip operation to remove leading and trailing spaces after all replacements are complete:

df[column] = df[column].str.strip()

Performance Considerations

This function is particularly useful in preprocessing large collections of song lyrics for tasks like:

Sentiment Analysis: Cleaned lyrics provide more reliable sentiment scores. (This is what I used for the quiz game.)
Topic Modeling: Removing noise enhances topic detection accuracy.
NLP Tasks: Standardizing text format ensures better tokenization and word embedding results.

Customization Tips for Data Scientists

The function is highly customizable. You can:

Add or Remove Specific Phrases: Depending on your dataset, you may want to adjust the list of phrases being removed.
Modify Regex Patterns: Tailor the patterns based on your specific text-cleaning needs.
I found that a phrase, word, or character pattern might need to be removed after other words are removed first. This is to make the removal easier becouse some combination of words and characters just wouldn't remove properly unless they were in a specific order.

Conclusion

The `clean_lyrics` function is a versatile tool for data scientists working on music data. It combines regex, string operations, and custom logic to transform raw lyrics into clean, analysis-ready text. By understanding each step, you can easily adapt this function to suit your project’s requirements.

Topics

- Text Cleaning

- NLP Preprocessing

- Pandas DataFrame

- Song Lyrics Analysis

- Regex for Data Cleaning

- Natural Language Processing (NLP)

John Zyski

Aug 22

3 min read