PolyAnalyst adds support for automatically identifying and removing duplicate documents

November 1, 2008

As a part of the release of PolyAnalyst 6.0.920, PolyAnalyst now provides a new node, Distinct Texts. The Distinct Texts node is data cleansing operation that examines a set of documents and looks for content duplication. If any two documents are “similar” enough, one of the documents can be filtered from the set of documents. The node produces a new set of documents where the duplicate documents have been removed, along with a report for reviewing which documents were removed.

This is an important preparation step in many statistical analyses that integrate the analysis of unstructured data. The presence of duplicated content can substantially skew statistics, such as how frequently various words occur in a set of documents, where the frequency metric is misleading because it counts words from the same document twice. By first analyzing and removing the duplicated content, you ensure that your statistics are more reliable.

For more information, or to request a demonstration of the new functionality, please contact Megaputer sales.

PolyAnalyst adds support for automatically identifying and removing duplicate documents

PolyAnalyst adds support for automatically identifying and removing duplicate documents

Contact Us

Software

Solutions

Services