Handling Outliers in Text Corpus Cluster Analysis
Published:Dec 15, 2025 16:03
•1 min read
•r/LanguageTechnology
Analysis
The article describes a challenge in text analysis: dealing with a large number of infrequent word pairs (outliers) when performing cluster analysis. The author aims to identify statistically significant word pairs and extract contextual knowledge. The process involves pairing words (PREC and LAST) within sentences, calculating their distance, and counting their occurrences. The core problem is the presence of numerous word pairs appearing infrequently, which negatively impacts the K-Means clustering. The author notes that filtering these outliers before clustering doesn't significantly improve results. The question revolves around how to effectively handle these outliers to improve the clustering and extract meaningful contextual information.
Key Takeaways
- •The core problem is the presence of numerous infrequent word pairs (outliers) in the dataset.
- •Filtering outliers before clustering doesn't significantly improve the results.
- •The author is seeking methods to effectively handle these outliers to improve cluster analysis.
Reference
“Now it's easy enough to e.g. search DATA for LAST="House" and order the result by distance/count to derive some primary information.”