Search:
Match:
1 results

Handling Outliers in Text Corpus Cluster Analysis

Published:Dec 15, 2025 16:03
1 min read
r/LanguageTechnology

Analysis

The article describes a challenge in text analysis: dealing with a large number of infrequent word pairs (outliers) when performing cluster analysis. The author aims to identify statistically significant word pairs and extract contextual knowledge. The process involves pairing words (PREC and LAST) within sentences, calculating their distance, and counting their occurrences. The core problem is the presence of numerous word pairs appearing infrequently, which negatively impacts the K-Means clustering. The author notes that filtering these outliers before clustering doesn't significantly improve results. The question revolves around how to effectively handle these outliers to improve the clustering and extract meaningful contextual information.
Reference

Now it's easy enough to e.g. search DATA for LAST="House" and order the result by distance/count to derive some primary information.