Search: 作者正在寻找有效处理这些异常值以改善聚类分析的方法。 - ai.jp.net

Research #Natural Language Processing 👥 CommunityAnalyzed: Dec 28, 2025 21:56

Handling Outliers in Text Corpus Cluster Analysis

Published:Dec 15, 2025 16:03

•

1 min read

•

r/LanguageTechnology

Analysis

The article describes a challenge in text analysis: dealing with a large number of infrequent word pairs (outliers) when performing cluster analysis. The author aims to identify statistically significant word pairs and extract contextual knowledge. The process involves pairing words (PREC and LAST) within sentences, calculating their distance, and counting their occurrences. The core problem is the presence of numerous word pairs appearing infrequently, which negatively impacts the K-Means clustering. The author notes that filtering these outliers before clustering doesn't significantly improve results. The question revolves around how to effectively handle these outliers to improve the clustering and extract meaningful contextual information.

Key Takeaways

•The core problem is the presence of numerous infrequent word pairs (outliers) in the dataset.
•Filtering outliers before clustering doesn't significantly improve the results.
•The author is seeking methods to effectively handle these outliers to improve cluster analysis.

Reference

“Now it's easy enough to e.g. search DATA for LAST="House" and order the result by distance/count to derive some primary information.”

Permalink r/LanguageTechnology

Handling Outliers in Text Corpus Cluster Analysis

Analysis

Key Takeaways

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics