Mastering Topic Modeling with the BERTopic Library in Python
Written on
Introduction to Topic Modeling
Text contains a wealth of information, both explicit and implicit. Just like other data forms, text can harbor hidden insights that aren't immediately visible. To uncover these insights, we rely on data science techniques. In this article, I will guide you through the process of implementing topic modeling with the BERTopic library. Let’s dive in!
Understanding the Mechanism
Before we embark on the modeling journey, it’s essential to grasp the mechanism underpinning the library. In essence, the algorithm operates through three main steps: embedding texts, clustering them, and then performing topic modeling. Here’s a breakdown:
- The BERT model creates a representation vector for each document.
- The UMAP algorithm reduces the dimensionality of these vectors.
- The HDBSCAN algorithm is employed for clustering, grouping texts with similar meanings.
- The c-TF-IDF algorithm extracts the most relevant words for each identified topic.
- Finally, the Maximize Candidate Relevance algorithm is utilized to enhance diversity.
Source: The BERTopic Documentation.
Implementation Steps
Getting and Loading the Data
To download the data, we can streamline the process by using the Kaggle API. Ensure you obtain a JSON file containing your API key for Kaggle access. To create this file, navigate to the Account tab on your Kaggle profile page and click "Create New API Token." Here’s how it should appear:
Once you have your JSON file, copy it into the .kaggle folder. You can use the following code snippet to do this:
# Code snippet to copy the JSON file
Assuming there are no issues, you can download the dataset with the script below:
# Code snippet to download the dataset
Next, let’s unpack the dataset:
# Code snippet to unpack the dataset
Now we can load the data using the following code:
# Code snippet to load the data
Note: The original dataset contains roughly 100,000 tweets. To optimize our process, we'll sample it down to 20,000. I previously attempted topic modeling with the complete dataset but encountered memory constraints. If your resources allow, feel free to work with the full dataset.
Cleaning the Data
Text data often contains noise. Prior to diving into modeling, it’s crucial to clean the data to ensure high-quality outputs. We can employ libraries like NLTK and re to filter out unnecessary elements such as mentions, hashtags, and links. Here’s how to clean the data:
# Code snippet for cleaning the data
Modeling the Data with BERT
With clean data in hand, we can now proceed to topic modeling using the BERTopic library. First, install the library via pip with the following command:
pip install bertopic
Now, let’s import the library:
# Code snippet to import the library
Next, we need to prepare two pieces of data: the tweet text and the corresponding timestamps. Here’s the code for that:
# Code snippet to prepare tweet and timestamp data
Now, we can initiate the topic modeling process. To do this, simply create an instance of the BERTopic object, which will fit and transform the tweets to generate relevant topics.
For those familiar with scikit-learn, this will feel intuitive. Here’s the code to get started:
# Code snippet for initializing and fitting the BERTopic model
From this, we can extract the topics present in the dataset. To display a table with each topic and the number of associated tweets, run the following code:
# Code snippet to display topics and their tweet counts
You’ll find that there are 357 topics identified. Note that topic -1 is excluded as it lacks significant meaning. Among the tweets, 304 discuss viewers' experiences watching the Olympics, alongside topics celebrating medal-winning athletes.
Visualizing the Results
To better understand the results, let’s visualize the data. The first visualization we can create is a distance map showing the relationships between topics. Run this code:
# Code snippet for distance map visualization
Next, we can generate bar charts displaying the most frequent words for each topic:
# Code snippet for bar chart visualization
This chart is visually appealing but may not capture all topics beyond the top eight.
Another useful visualization is a heat map, revealing correlations between topics. Use this code to create one:
# Code snippet for heat map visualization
Tracking Topic Occurrences Over Time
Since we included timestamps, we can visualize trends over time for specific topics. Here’s how to model the topics alongside their timestamps:
# Code snippet for modeling topics over time
Now, visualize the time series data representing the trends for each topic:
# Code snippet for time series visualization
Conclusion
Congratulations! You've successfully learned how to perform topic modeling using BERT with the BERTopic library. I hope this guide has provided you with valuable insights and practical skills for extracting meaningful information from text. If you have any questions, feel free to reach out via email or connect with me on LinkedIn.
Thank you for reading!