Topic Model Comparison Module

Overview

The Topic Model Comparison module enables you to run and evaluate multiple topic modeling approaches side by side, helping identify the optimal method and parameters for your specific data. This experimental approach takes the guesswork out of topic modeling by quantitatively measuring how different algorithms and settings perform on the same dataset.

This module allows you to: - Configure and run multiple topic modeling methods with different parameters - Visualize similarities and differences between topic models - Compare topic coherence, distinctiveness, and distribution - Analyze how parameter changes affect topic quality - Generate detailed reports with findings and recommendations - Export comparison results for documentation and sharing

By systematically comparing approaches, you can make data-driven decisions about which topic modeling method will best reveal the underlying themes in your documents.

Main interface of the Comparison Tab png png

Core Components

ComparisonTab Class

The ComparisonTab class is the main UI component that manages the comparison process, visualizations, and reporting.

Key Methods

Method	Description
`start_comparison()`	Launches the comparison configuration dialog and initiates the process
`execute_runs()`	Manages the execution of multiple topic model runs in sequence
`generate_visualizations()`	Creates all comparison visualizations when runs are complete
`generate_summary_report()`	Creates a detailed analysis of the comparison with recommendations
`export_report()`	Exports the comparison results in HTML or Markdown format

TopicModelRun Class

The TopicModelRun class stores all information about a single topic modeling run, including its configuration, results, and metrics.

Properties

Property	Description
`name`	User-defined name for the run
`method`	Topic modeling method used (BERTopic, NMF, LDA, etc.)
`parameters`	Configuration parameters used for the run
`documents`	Documents processed in this run
`topics`	Topic assignments for each document
`topic_words`	Words and their weights for each topic
`metrics`	Quality metrics calculated for this run
`timestamp`	When the run was executed

Helper Dialogs

RunConfigDialog: UI for configuring a single topic model run
ComparisonRunDialog: UI for setting up multiple runs to compare
ElbowMethodDialog: Interactive tool for finding optimal LDA topic counts

User Interface

The Comparison Tab provides a comprehensive interface for setting up, visualizing, and analyzing multiple topic model runs.

UI Components

Control Buttons:
"Start New Comparison" button to configure and run a new comparison
"Export Report" button to export comparison results
Run Selection Table:
Displays all completed runs with their methods, parameters, and metrics
Allows selection of specific runs for detailed comparison
Visualization Tabs:
Topic Similarity: Heatmap showing relationship between topics across runs
Word Weights: Comparison of word importance in similar topics
Topic Distribution: Chart showing document distribution across topics
Parameter Impact: Analysis of how parameters affect topic quality
Summary Report: Detailed analysis with findings and recommendations

Usage Guide

Starting a Comparison

Ensure you have documents loaded in the application
Click the "Start New Comparison" button
In the dialog that appears, click "Add Run" to configure your first run
For each run:
Provide a descriptive name
Select a topic modeling method
Configure parameters (number of topics, min topic size, etc.)
Click OK to add the run to the comparison
Add 2-5 runs with different methods or parameters
Click "Start Comparison" to begin processing

Run Configuration Options

When configuring each run, you can adjust these key parameters:

Method:
BERTopic (UMAP): High-quality semantic topics using BERT embeddings
BERTopic (PCA): More stable BERTopic variant using PCA
NMF: Non-negative Matrix Factorization for traditional topic modeling
LDA: Latent Dirichlet Allocation, a probabilistic approach
Language: Select the primary language of your documents
Number of Topics: Set a specific number or choose "Auto"
Min Topic Size: Minimum documents required to form a topic
N-gram Range: Whether to include phrases (2+ words) in topics

For LDA models, additional options are available: - Elbow Method: Automatically find the optimal number of topics - Min/Max Topics: Range to search for the optimal topic count - Step Size: Granularity of the topic count search

Understanding Visualizations

Topic Similarity Heatmap

This visualization shows how topics from different runs relate to each other: - Each cell represents similarity between two topics (from same or different runs) - Darker colors indicate higher similarity - Helps identify consistent topics that appear across multiple methods - Shows which topics are method-specific versus universal in your data

Word Weights Comparison

Compares word importance between similar topics across different runs: - Select a topic to see how its top words compare to similar topics in other runs - Bar height shows word importance (weight) in each topic - Helps evaluate topic coherence and specificity across methods - Reveals semantic differences in how methods interpret similar concepts

Topic Distribution Chart

Shows how documents are distributed across topics in different runs: - Compares distribution balance between methods - Identifies methods that produce more evenly distributed topics - Shows if some methods create "catch-all" topics or many small topics - Helps evaluate coverage and granularity of different approaches

Parameter Impact Chart

Visualizes how changing parameters affects topic quality: - Scatter plot showing relationship between number of topics and topic quality - Points colored by method to compare different approaches - Helps identify optimal parameter settings for your data - Reveals how methods scale with different parameter values

Interpreting the Summary Report

The Summary Report provides a comprehensive analysis of your comparison, including:

Run Summary: Overview of all runs with their key metrics
Top Topic Words: The most important words for each topic across all runs
Recommendations: Data-driven suggestions for optimal methods and parameters
Method Analysis: Comparative strengths and weaknesses of different approaches
Parameter Recommendations: Guidance on ideal parameter values for your data

Recommendations Methodology

The report uses several metrics to identify the best approach:

Optimal Number of Topics: Based on the average number of meaningful topics found across all runs
Best Method: Typically the run with highest topic coherence and balanced distribution
N-gram Setting: Based on which n-gram range produces most distinctive topics
Topic Size Threshold: Determined by analyzing distribution balance

Exporting Comparison Results

Click the "Export Report" button
Choose a format:
HTML: Rich formatting with tables and styling, ideal for sharing
Markdown: Plain text with formatting, good for documentation
Select a save location and filename
The exported report contains:
All run configurations and their results
Top words for each topic across all runs
Analysis and recommendations
Summary metrics and charts

Best Practices

Effective Comparison Strategy

Vary one parameter at a time to isolate its impact
Include diverse methods to get different perspectives on your data
Use descriptive run names to easily identify them in visualizations
Test a range of topic numbers to find the optimal granularity
Compare results with domain knowledge to validate topic quality

When to Use Each Method

BERTopic (UMAP): Best for semantic understanding, works well with medium-sized datasets
BERTopic (PCA): Good alternative when UMAP is unstable with small datasets
NMF: Fast, deterministic results; good for clearly distinct topics
LDA: Works well with longer documents; provides probabilistic topic assignments

Identifying the Best Model

The "best" topic model depends on your specific goals:

For exploration: Prioritize coherence and interpretability
For document organization: Look for balanced distribution and clear separation
For content analysis: Focus on specific, meaningful topics
For classification: Emphasize predictive power and coverage

Common Pitfalls to Avoid

Relying on a single metric: Consider multiple aspects of topic quality
Overlooking preprocessing impact: Document cleaning affects all models
Assuming more topics means better results: Quality often peaks at a moderate number
Neglecting outlier topics: Check how many documents lack clear topic assignments
Focusing only on top words: Consider full topic coherence and document assignments

Advanced Usage

Customizing the Comparison Process

Run a focused sub-comparison on the most promising methods
Try different preprocessing approaches to see their impact on topic quality
Combine insights from multiple models for more robust topic identification
Compare stability by running the same configuration on different document subsets

Handling Special Cases

Very Small Document Sets

Focus on NMF and LDA which are more stable with few documents
Use topic_size=1 to allow single-document topics
Consider document chunking to increase document count

Multi-Language Corpora

Use "multilingual" setting for mixed-language document sets
Compare language-specific versus multilingual models
Check topic distribution to ensure fair representation across languages

Domain-Specific Content

Pay special attention to technical terminology in topic words
Adjust min_df and max_df in NMF/LDA to handle specialized vocabulary
Compare with domain expert assessment for topic quality validation