Mastering Topic Modeling with the BERTopic Library in Python

Introduction to Topic Modeling

Text contains a wealth of information, both explicit and implicit. Just like other data forms, text can harbor hidden insights that aren't immediately visible. To uncover these insights, we rely on data science techniques. In this article, I will guide you through the process of implementing topic modeling with the BERTopic library. Let’s dive in!

Understanding the Mechanism

Before we embark on the modeling journey, it’s essential to grasp the mechanism underpinning the library. In essence, the algorithm operates through three main steps: embedding texts, clustering them, and then performing topic modeling. Here’s a breakdown:

The BERT model creates a representation vector for each document.
The UMAP algorithm reduces the dimensionality of these vectors.
The HDBSCAN algorithm is employed for clustering, grouping texts with similar meanings.
The c-TF-IDF algorithm extracts the most relevant words for each identified topic.
Finally, the Maximize Candidate Relevance algorithm is utilized to enhance diversity.

Source: The BERTopic Documentation.

Implementation Steps

Getting and Loading the Data

To download the data, we can streamline the process by using the Kaggle API. Ensure you obtain a JSON file containing your API key for Kaggle access. To create this file, navigate to the Account tab on your Kaggle profile page and click "Create New API Token." Here’s how it should appear:

Once you have your JSON file, copy it into the .kaggle folder. You can use the following code snippet to do this:

# Code snippet to copy the JSON file

Assuming there are no issues, you can download the dataset with the script below:

# Code snippet to download the dataset

Next, let’s unpack the dataset:

# Code snippet to unpack the dataset

Now we can load the data using the following code:

# Code snippet to load the data

Note: The original dataset contains roughly 100,000 tweets. To optimize our process, we'll sample it down to 20,000. I previously attempted topic modeling with the complete dataset but encountered memory constraints. If your resources allow, feel free to work with the full dataset.

Cleaning the Data

Text data often contains noise. Prior to diving into modeling, it’s crucial to clean the data to ensure high-quality outputs. We can employ libraries like NLTK and re to filter out unnecessary elements such as mentions, hashtags, and links. Here’s how to clean the data:

# Code snippet for cleaning the data

Modeling the Data with BERT

With clean data in hand, we can now proceed to topic modeling using the BERTopic library. First, install the library via pip with the following command:

pip install bertopic

Now, let’s import the library:

# Code snippet to import the library

Next, we need to prepare two pieces of data: the tweet text and the corresponding timestamps. Here’s the code for that:

# Code snippet to prepare tweet and timestamp data

Now, we can initiate the topic modeling process. To do this, simply create an instance of the BERTopic object, which will fit and transform the tweets to generate relevant topics.

For those familiar with scikit-learn, this will feel intuitive. Here’s the code to get started:

# Code snippet for initializing and fitting the BERTopic model

From this, we can extract the topics present in the dataset. To display a table with each topic and the number of associated tweets, run the following code:

# Code snippet to display topics and their tweet counts

You’ll find that there are 357 topics identified. Note that topic -1 is excluded as it lacks significant meaning. Among the tweets, 304 discuss viewers' experiences watching the Olympics, alongside topics celebrating medal-winning athletes.

Visualizing the Results

To better understand the results, let’s visualize the data. The first visualization we can create is a distance map showing the relationships between topics. Run this code:

# Code snippet for distance map visualization

Next, we can generate bar charts displaying the most frequent words for each topic:

# Code snippet for bar chart visualization

This chart is visually appealing but may not capture all topics beyond the top eight.

Another useful visualization is a heat map, revealing correlations between topics. Use this code to create one:

# Code snippet for heat map visualization

Tracking Topic Occurrences Over Time

Since we included timestamps, we can visualize trends over time for specific topics. Here’s how to model the topics alongside their timestamps:

# Code snippet for modeling topics over time

Now, visualize the time series data representing the trends for each topic:

# Code snippet for time series visualization

Conclusion

Congratulations! You've successfully learned how to perform topic modeling using BERT with the BERTopic library. I hope this guide has provided you with valuable insights and practical skills for extracting meaningful information from text. If you have any questions, feel free to reach out via email or connect with me on LinkedIn.

Thank you for reading!

parkmodelsandcabins.com

Mastering Topic Modeling with the BERTopic Library in Python

Introduction to Topic Modeling

Understanding the Mechanism

Implementation Steps

Getting and Loading the Data

Cleaning the Data

Modeling the Data with BERT

Visualizing the Results

Tracking Topic Occurrences Over Time

Conclusion

References

Share the page:

Recent Post:

Why Coffee Enemas Are Harmful to Your Health: A Cautionary Guide

# From $30 to $300 on Medium: My Journey and Tips for Success

Elevating Your Listening Experience: Pixel Buds Pro Expectations