Handling Outliers in Text Corpus Cluster Analysis

Research#Natural Language Processing👥 Community|Analyzed: Dec 28, 2025 21:56
Published: Dec 15, 2025 16:03
1 min read
r/LanguageTechnology

Analysis

The article describes a challenge in text analysis: dealing with a large number of infrequent word pairs (outliers) when performing cluster analysis. The author aims to identify statistically significant word pairs and extract contextual knowledge. The process involves pairing words (PREC and LAST) within sentences, calculating their distance, and counting their occurrences. The core problem is the presence of numerous word pairs appearing infrequently, which negatively impacts the K-Means clustering. The author notes that filtering these outliers before clustering doesn't significantly improve results. The question revolves around how to effectively handle these outliers to improve the clustering and extract meaningful contextual information.
Reference / Citation
View Original
"Now it's easy enough to e.g. search DATA for LAST="House" and order the result by distance/count to derive some primary information."
R
r/LanguageTechnologyDec 15, 2025 16:03
* Cited for critical analysis under Article 32.