Topic Validation Module

Overview

The Topic Validation module enables you to evaluate the quality of extracted topics and determine the optimal number of topics for your document collection. It provides quantitative metrics to assess whether your topic model is effectively capturing the underlying themes in your text data.

This module allows you to: - Measure the quality of extracted topics using multiple metrics - Find the optimal number of topics for your document collection - Visualize topic distribution and quality metrics - Make data-driven decisions to improve your topic model

Unlike the extraction process which focuses on finding topics, validation helps you assess if those topics are good, coherent, and representative of your content.

png Main interface of the Validator Tab showing metrics and distribution

Core Components

TopicValidator Class

The TopicValidator class serves as the main interface for validation functionality. It manages the validation and optimization processes.

Methods

Method	Description	Parameters	Returns
`validate_model()`	Starts validation of a topic model	`documents`: List of document strings `topic_model`: Trained topic model	Returns the worker thread that can be connected to signals
`find_optimal_topics()`	Starts optimization to find the optimal number of topics	`documents`: List of document strings `max_topics`: Maximum number of topics to consider `min_topic_size`: Minimum topic size `**bertopic_kwargs`: Additional arguments	Returns the worker thread that can be connected to signals

ValidationWorker Class

The ValidationWorker class extends QThread to handle topic validation in a background thread, keeping the UI responsive during processing.

Signals

Signal	Description	Parameters
`progress_updated`	Emitted to update progress	`int`: progress percentage, `str`: status message
`validation_completed`	Emitted when validation is done	`dict`: metrics dictionary, `object`: summary DataFrame
`error_occurred`	Emitted when an error occurs	`str`: error message

OptimizationWorker Class

The OptimizationWorker class extends QThread to handle finding the optimal number of topics in a background thread.

Signals

Signal	Description	Parameters
`progress_updated`	Emitted to update progress	`int`: progress percentage, `str`: status message
`optimization_completed`	Emitted when optimization is done	`dict`: recommendations dictionary
`error_occurred`	Emitted when an error occurs	`str`: error message

Validation Metrics

The module calculates several key metrics to evaluate topic quality:

Topic Diversity (0-1)

Measures how distinct topics are from each other based on the uniqueness of words across topics.

High diversity (>0.7): Topics have minimal overlap in keywords, indicating well-separated themes
Low diversity (<0.3): Topics share many keywords, suggesting redundancy or poorly defined topics

Topic Coherence (0-1)

Measures how semantically related the words within each topic are.

High coherence (>0.7): Words in topics are strongly related, forming coherent themes
Low coherence (<0.3): Words in topics appear random or unrelated

Topic Coverage (0-1)

The percentage of documents assigned to non-outlier topics (not Topic -1).

High coverage (>0.8): Most documents fit into meaningful topics
Low coverage (<0.5): Many documents don't fit into any clear topic

Other Metrics

Number of Topics: The total number of distinct topics (excluding the outlier topic)
Outliers: Number of documents assigned to Topic -1 (the outlier topic)
Topic Distribution: Count of documents assigned to each topic

User Interface

The ValidatorTab class provides a comprehensive interface for validating topics and finding the optimal topic count.

UI Components

Control Buttons:
"Validate Topics" button to analyze the current topic model
"Find Optimal Topics" button to determine the ideal number of topics
Results Tabs:
Metrics: Table showing validation metrics with color coding
Topic Distribution: Table showing how documents are distributed across topics
Visualization: Interactive visualization of topic distribution
Optimization Dialog:
Settings for maximum topics to test
Minimum topic size control
Progress tracking during optimization
Results visualization showing quality scores across different topic counts

Usage Guide

Validating a Topic Model

Extract topics using the Topic Tab
Navigate to the Validator Tab
Click "Validate Topics" to begin the validation process
Review the metrics in the Results tab
Examine the topic distribution to see how documents are assigned
Use the visualization to get a graphical view of the topic distribution

Finding the Optimal Number of Topics

Navigate to the Validator Tab
Click "Find Optimal Topics" to open the optimization dialog
Set the maximum number of topics to test (default: 15)
Set the minimum topic size (default: 2)
Click "Start Optimization" to begin the process
Review the results, which include:
Recommended number of topics
Combined quality score
Stability score
Diversity score
Use the visualization to see how scores vary across different topic counts

Understanding Optimization

The topic optimization process uses a sophisticated approach to find the ideal number of topics:

Cross-Validation: For each potential topic count, it performs cross-validation by creating multiple topic models with different data splits
Stability Measurement: It measures how consistently similar topics emerge across different data splits
Diversity Assessment: It evaluates how distinct the topics are from each other
Combined Scoring: It combines stability (weighted 70%) and diversity (weighted 30%) into a single quality score
Recommendation: It recommends the number of topics with the highest combined score

This approach balances the ability to consistently find the same topics (stability) with the need for topics to be distinct from each other (diversity).

Optimization Parameters

Maximum Topics to Test: The upper limit for how many topics to consider
Minimum Topic Size: The minimum number of documents required for a topic

Interpreting Validation Results

What Makes a Good Topic Model?

A high-quality topic model typically shows: - High diversity (topics are distinct from each other) - High coherence (words within topics make sense together) - High coverage (most documents are assigned to non-outlier topics) - Balanced distribution of documents across topics

Common Issues and Solutions

Too Many Outliers (Low Coverage)

Problem: Many documents assigned to Topic -1 (outlier topic)
Possible solutions:
Decrease the minimum topic size
Improve text preprocessing
Try a different number of topics (use the optimization feature)
Use a different topic modeling method

Low Diversity

Problem: Topics share many of the same keywords
Possible solutions:
Reduce the number of topics
Improve stopword removal
Try the BERTopic model instead of LDA or NMF

Imbalanced Distribution

Problem: Most documents fall into just a few topics
Possible solutions:
Adjust the minimum topic size
Increase the number of topics
Consider if the imbalance reflects your actual data

Tips for Effective Validation

Validation Workflow

Run the "Find Optimal Topics" feature first to get a recommended topic count
Extract topics using the recommended count in the Topic Tab
Validate the resulting topics
If validation metrics show issues, adjust parameters and try again
Consider refining topics with an LLM for better interpretability

Balancing Metrics Based on Your Goals

For exploratory analysis: Prioritize diversity and coverage
For content summarization: Prioritize coherence
For document classification: Prioritize balanced topic distribution

Validation with Small Document Collections

Use smaller minimum topic sizes (1 or 2)
Extract fewer topics (3-5 is often sufficient)
Pay more attention to coherence than diversity
Be aware that cross-validation results might be less reliable

Combining with Other Modules

Use the Visualizer to see topic relationships and keyword importance
Try Topic Highlighting to see how topics appear in your actual documents
Consider LLM refinement to improve topic interpretability

Technical Details

Stability Calculation

Topic stability is calculated by: 1. Computing topic embeddings or word distributions for topics from different data splits 2. Calculating cosine similarity between topics from different models 3. Finding the maximum similarity for each topic 4. Averaging these maximum similarities to get the overall stability score

Diversity Calculation

Topic diversity is calculated by: 1. Extracting the top N words from each topic 2. Counting the number of unique words across all topics 3. Dividing by the total number of words (accounting for repetition) 4. This produces a score between 0 and 1, where higher values indicate more diverse topics

Coherence Calculation

The module uses the coherence score from the BERTopic model, which is based on: 1. How often topic words appear together in the same documents 2. The semantic similarity of words within a topic 3. Higher coherence scores indicate more semantically related words in topics