Morphology – Megaputer Intelligence

Query languages—the Swiss army knife of information extraction

Echo Lu — Tue, 06 Feb 2024 05:19:41 +0000

Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are indispensable for text mining, their utility hits a hard ceiling when semantic or meaningful search is required. They cannot, for example, capture complex entities such as human names, corporations, and drugs. For handling tasks like these, we need a more powerful query language that has semantic understanding, such as Megaputer’s PDL.

So, what is PDL, and how does it achieve semantic understanding while regex and string search do not? Let’s take a look at an example to find out.

First of all, PDL does not just search for the literal form of the word in the query: instead, it automatically extends its search to all morphological forms of the word. For example, when searching for the word “company” in financial news articles, the PDL query will not only find “company”, but also “companies,” the plural form. This feature often comes in handy, especially when the search involves a verb. Suppose that you are interested in extracting what the CEOs said. With regex or other substring search, you will need to list all possible verb forms such as “say,” “saying,” “says,” and “said.” With PDL, simply entering “say” in the query will automatically fetch all possible verb forms. This behavior can also be turned off by enclosing the word in the form function, which will then restrict the search to the literal form of the word, such as in the example below.

Another notable feature of the PDL language is its capability for users to tailor the scope of their searches using a range of built-in functions. Returning to the previous example, you may not wish to confine your search exclusively to the specific verb “say,” but rather include other synonymous verbs like “tell” or “mention.” Achieving this is straightforward with PDL – users can invoke the synonym function with the verb “say,” as demonstrated in (a) below. As the subsequent results table (b) illustrates, the captured text now includes various speech verbs such as “tell,” “emphasize,” and “claim,” in addition to the word “say,” capturing them in all possible verb forms. For additional flexibility, the user can also create and modify synonym dictionaries.

The PDL language offers various modes of information extraction, including proximity search (e.g., finding words A and B within a sentence, or within a 3-words range), syntactic relation (e.g., finding word A that is the subject or object of B), semantic relation (e.g., finding words that are synonyms/antonyms to word A), access to dictionaries and ontologies, and more. This language is expressive enough to capture complex patterns, and yet relatively easy to use, having a syntax that closely resembles English. Having access to this versatile query language significantly enhances the power and quality of text mining operations.

In conclusion, PDL is a powerful and versatile query language that enables users to extract meaningful information from text with greater efficiency and accuracy than competing methods like regex or string search. Its ability to understand and capture morphological forms, synonyms, and other complex patterns makes it an indispensable tool for solving text mining tasks that require semantic understanding. By leveraging the capabilities of PDL, users can enhance their information extraction processes and gain valuable insights from their data, making it a true Swiss army knife of information extraction.

The post Query languages—the Swiss army knife of information extraction appeared first on Megaputer Intelligence.

Morphology: Investigating Word Structures

Echo Lu — Tue, 05 Sep 2023 20:38:43 +0000

Morphology stands as a vital subfield of linguistics, delving into the organization of words. While words are often perceived as the fundamental units in textual analysis, they can be deconstructed further into smaller constituents known as morphemes. These morphemes, which are the elemental units of grammatical form in language, fall into distinct classifications including roots, suffixes, prefixes, and more. Through the application of word formation rules, these morphemes combine to create coherent words. In essence, just as sentences have internal structures, words also exhibit intricate internal structures consisting of morphemes, and those morphemes are combined through morphological rules, similar to how words are composed through syntactic rules.

There are two important categories of morphemes: those that can stand alone (e.g., “car,” “freedom,” “sing”), and those that need to be attached to other morphemes (e.g., “-ish,” “bi-,” “-er,” “-est”). The former is termed “free morphemes,” while the latter is referred to as “bound morphemes.” Bound morphemes can attach at various positions around another morpheme. Certain bound morphemes are affixed to the beginning of another morpheme (e.g., “bi” in “bi” + “weekly”), which is called a prefix. When a bound morpheme is attached to the end of another morpheme (e.g., “er” in “teach” + “er”), it is a suffix. Although there exist affixes that are inserted within a morpheme (“infix”) or surround a morpheme (“circumfix”), these morphemes are rare or non-existent in English.

Morphologically complex words are decomposed into various parts such as roots and affixes. A root is a core lexical unit of a word to which affixes are attached. While a root is often a free morpheme (e.g., “teach” in “teacher”), some words have a bound morpheme as their root. For example, a word like “receive” is decomposed into “re,” a prefix, and “ceive,” a root, where the root cannot stand alone. Another important aspect of multimorphemic words is that they can further combine with other affixes to derive more complex words. Consider the word “global” as an illustration: it consists of the root “globe” and the suffix “-al.” It can further merge with additional affixes to form words like “globalize,” “globalization,” and more. These instances exemplify what is known as derivational morphology, in which added morphemes derive words with new meanings. Importantly, however, the addition of a morpheme does not always give rise to a new meaning. The so-called inflectional morpheme, primarily serving grammatical functions, does not change a word’s meaning but only alters the grammatical attributes of a word. For instance, adding the “-ed” morpheme to a verb (e.g., “walk” + “ed” -> “walked”) only modifies its grammatical form, turning the verb into its past tense form, while its meaning remains unaffected.

In PolyAnalyst, a built-in algorithm equipped with a morphological dictionary automatically tags every token in the text with morpheme-level information such as grammatical categories or tenses. For example, verbs that end with the “ed” morpheme are automatically tagged as the past tense or the past participial form of the verb. Furthermore, PolyAnalyst offers a variety of morphological functions in the PDL query language, allowing end users to craft precise queries tailored to their specific needs. The video clip below illustrates some of these functions. As shown here, users can employ the lemma function to retrieve all word forms of a given lemma. For example, lemma(face) finds all inflectional forms such as “faces,” “faced,” “facing,” and so on. Since the word “face” can function as both a noun and a verb, a user can further refine the search to locate only the forms corresponding to the noun sense; e.g., lemma(noun, face).

lemma(face) and lemma(noun, face)

Notably, PolyAnalyst defaults to a comprehensive search of all word forms of a given word. That is, if a word “facing” is included in a query, PolyAnalyst will retrieve all inflectional forms of its root “face,” capturing “face,” “faced,” “faces,” and “facing.” If a user wishes to narrow down the search to a specific word form, they can utilize the form function, which then exclusively identifies instances of the same word form as the argument; for instance, form(facing) exclusively identifies instances of “facing,” and nothing else.

Among this series of functions, the one that covers the broadest scope is singleroot. This function searches not only for all possible inflectional forms of a word but also for all words that share the same root. For instance, singleroot(face) would match a word like “facial,” along with “face,” “faced,” “facing,” and so on.

form(facing) and singleroot(facing)

The post Morphology: Investigating Word Structures appeared first on Megaputer Intelligence.

PolyAnalyst introduces support for using stop list dictionaries when analyzing spelling errors

Jeff Palan — Tue, 15 May 2018 18:44:31 +0000

In previous builds of PolyAnalyst the only way to stop a word from showing up in the spell checker was to add it to the morphology dictionary. This would sometimes result in having to add things to the morphology dictionary that might not really belong there, such as product codes (Model ABC-XYZ) and the occasional single foreign word (e.g. yukata).

In order to prevent this dictionary mismatch PolyAnalyst now includes stoplist functionality for spell checking. In other words you can define a list of words for the spellchecker to ignore without assigning them as actual English words.

Consider the following example of dealing with the problematic word “yukata”.

Without a stop list:

The word is identified as a spelling error.

With a stop list: no more “yukata”!

After adding the word to a stop list, the word is no longer identified as a spelling error.

The post PolyAnalyst introduces support for using stop list dictionaries when analyzing spelling errors appeared first on Megaputer Intelligence.

Introducing the partofspeech function for PolyAnalyst

Jeff Palan — Thu, 15 Jun 2017 14:08:39 +0000

Build 2010 – Released June 2017

As part of PolyAnalyst’s crusade for accuracy and ease of use, a new Pattern Definition Language (PDL) function partofspeech has been added to the function library. This function allows you to create search queries that reference very specific grammatical attributes.

For example

partofspeech(noun_singular)

will match “desk” and “table” but not “desks” or “tables”

partofspeech(noun, building)

will match “building” and “buildings”, but only when they are used as nouns (as opposed to a verb, as in “I’m building a desk”)

It was possible to achieve similar results previously, but required more complex and esoteric queries.

This function can be especially useful in entity extraction. As an example, let’s run the following query on some public data from the National Highway Traffic Safety Administration.

near(1, partofspeech(adjective), tire or tread)

The result is all the instances where an adjective appeared next to the words “tire” or “tread”.

The post Introducing the partofspeech function for PolyAnalyst appeared first on Megaputer Intelligence.