Pattern Definition Language – Megaputer Intelligence

Query languages—the Swiss army knife of information extraction

Echo Lu — Tue, 06 Feb 2024 05:19:41 +0000

Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are indispensable for text mining, their utility hits a hard ceiling when semantic or meaningful search is required. They cannot, for example, capture complex entities such as human names, corporations, and drugs. For handling tasks like these, we need a more powerful query language that has semantic understanding, such as Megaputer’s PDL.

So, what is PDL, and how does it achieve semantic understanding while regex and string search do not? Let’s take a look at an example to find out.

First of all, PDL does not just search for the literal form of the word in the query: instead, it automatically extends its search to all morphological forms of the word. For example, when searching for the word “company” in financial news articles, the PDL query will not only find “company”, but also “companies,” the plural form. This feature often comes in handy, especially when the search involves a verb. Suppose that you are interested in extracting what the CEOs said. With regex or other substring search, you will need to list all possible verb forms such as “say,” “saying,” “says,” and “said.” With PDL, simply entering “say” in the query will automatically fetch all possible verb forms. This behavior can also be turned off by enclosing the word in the form function, which will then restrict the search to the literal form of the word, such as in the example below.

Another notable feature of the PDL language is its capability for users to tailor the scope of their searches using a range of built-in functions. Returning to the previous example, you may not wish to confine your search exclusively to the specific verb “say,” but rather include other synonymous verbs like “tell” or “mention.” Achieving this is straightforward with PDL – users can invoke the synonym function with the verb “say,” as demonstrated in (a) below. As the subsequent results table (b) illustrates, the captured text now includes various speech verbs such as “tell,” “emphasize,” and “claim,” in addition to the word “say,” capturing them in all possible verb forms. For additional flexibility, the user can also create and modify synonym dictionaries.

The PDL language offers various modes of information extraction, including proximity search (e.g., finding words A and B within a sentence, or within a 3-words range), syntactic relation (e.g., finding word A that is the subject or object of B), semantic relation (e.g., finding words that are synonyms/antonyms to word A), access to dictionaries and ontologies, and more. This language is expressive enough to capture complex patterns, and yet relatively easy to use, having a syntax that closely resembles English. Having access to this versatile query language significantly enhances the power and quality of text mining operations.

In conclusion, PDL is a powerful and versatile query language that enables users to extract meaningful information from text with greater efficiency and accuracy than competing methods like regex or string search. Its ability to understand and capture morphological forms, synonyms, and other complex patterns makes it an indispensable tool for solving text mining tasks that require semantic understanding. By leveraging the capabilities of PDL, users can enhance their information extraction processes and gain valuable insights from their data, making it a true Swiss army knife of information extraction.

The post Query languages—the Swiss army knife of information extraction appeared first on Megaputer Intelligence.

Combining the Best of Both Worlds: Machine Learning & Rule-Based Approach

Elli Bourlai — Wed, 12 Feb 2020 15:15:51 +0000

In the past few years, more and more organizations have started relying solely on Machine Learning (also called Artificial Intelligence or AI) for addressing Text Analytics tasks. The reasons are that this approach is faster, suitable for large volumes of data, and more adaptable to new contexts, as it does not require hard-coded rules. However, AI / Machine Learning algorithms are only as good as the data they are trained on.

A good training dataset usually requires expert knowledge for pre-categorizing or “tagging” the input features for the model, which is often performed manually. These training datasets are also known as “gold standard data” and are not easy to obtain: it requires time and many trained people to produce an accurately tagged dataset, which can be very expensive. This is why many machine learning solutions use a statistical approach for identifying and categorizing input features for training, often resulting in solutions with high recall but low precision.

What about the Rule-Based approach?

Even though it has been abandoned lately in favor of machine learning solutions, it offers certain advantages over the latter method. Since the rules are created by humans, the results are consistent and more accurate. It is also easier to understand the logic behind the machine’s decisions and improve a model, instead of trying to comprehend a “black box”, which is the case with most machine learning algorithms. But the main issue with this approach is that the rules are hard-coded and are limited to the existing context, making it hard to adapt to new context; this results in higher precision, but lower recall.

Ideally, we would like to get the best of both worlds: the aspects of a Rule-Based approach that result in high precision and the aspects from Machine Learning that result in high recall. So how can we combine them efficiently?

Making Gold from Rules

The solution proposed by Megaputer involves using a rule-based approach to generate training datasets. This approach works by utilizing a highly accurate query language that removes the need for manual tagging, while still achieving similar results. PolyAnalyst’s Pattern Definition Language (PDL) navigates and leverages the linguistic properties of text; it allows an expert to teach a machine how to identify and categorize important features that are necessary for generating training datasets with “gold standard” quality.

Once the machine is given expert guidance with rules and has generated the pre-categorized training data, it can expand its knowledge to additional or new contexts through Machine Learning.

A Step Forward

Thus, with a combined approach, we achieve both the high precision that Rule-Based approaches bestow on created training datasets, along with the high recall that is achieved through Machine Learning in different contexts. If you are interested in automating the generation of accurate training data to improve the results of your Machine Learning models, contact us and learn about the tools we have available.

The post Combining the Best of Both Worlds: Machine Learning & Rule-Based Approach appeared first on Megaputer Intelligence.

Comparing Machine Translation to Native Language Analysis

Jeff Palan — Fri, 10 May 2019 20:20:59 +0000

As our world becomes increasingly global, so does our data. Being an analyst working with almost any size company today often means facing the challenge of receiving text data that contains multiple languages. So what do you do?

Essentially, there are two options we may consider: machine translation or native language analysis.

With machine translation, we actually create a new dataset where the text has all been translated into a single language before we do the analysis. This makes the subsequent analysis much easier, as we only need to use a single language grammar module for the analysis.
Native language analysis means that we keep documents in their original languages and perform a separate analysis for each language with the corresponding grammar module.

To demonstrate how these options work, let’s imagine we are working with a dataset that has records (or rows) of textual responses that are either in English or Spanish. Regardless of whether we want to use machine translation or to process individual documents in their native languages, we first need to split the dataset up by language. In this example, all the English records are together in one dataset, and all the Spanish records are in another. We then use the software PolyAnalyst for Text™ to perform native language text analysis by implementing the corresponding language modules. Currently, this software can process the texts of 16 different languages, and it can integrate with third-party translation services such as Microsoft, SDL, and Google. Splitting the data can be easily accomplished by using a combination of the Language Detection and Filter Rows nodes in PolyAnalyst. The Language Detection node automatically samples the text of each record and determines what language is being used (as seen below). Once each record is tagged with its language, we can use these tags to filter out the data by language.

Translation services like Google and Microsoft are fully capable of identifying the language of texts on a record-by-record basis, but they will charge you even if you ask them to translate English to English.

PolyAnalyst workflow for detecting and filtering text data by language.

Translation services like Google and Microsoft are fully capable of identifying the language of texts on a record-by-record basis, but they will charge you a fee even if you ask them to translate English to English. Therefore, it is best to do the split beforehand and send them only the texts that really need translation. Some of you may even go so far as to split individual texts into multiple records when that text contains multiple languages. This will minimize translation costs.

Once we’ve separated our data based on the language, it’s time to decide what approach to use: Machine Translation (MT) or Native Language Analysis (NLA). Let’s review some pros and cons of each of these approaches.

Machine Translation Pros & Cons

PROS

Relatively Cheap
Machine translation is relatively cheap, even with a fairly large dataset. These services tend to charge by the character, so the cost will vary based on how many documents you have and how long (a.k.a. wordy) they are. For reference, at Microsoft’s lowest (and least cost-effective) pricing tier, the cost is $10 per 1 million characters. A page in Microsoft word is about 3000 characters, so you can translate about 333 pages for $10. If you have a huge dataset, then this might sound like a lot. However, compared to hiring multiple analysts with fluency in different languages, it may still be the more affordable option.
Simple and Accessible Results
The data is more widely accessible and ends up consolidated into a single language. This means that even if the data was originally in 10 different languages, end users such as the company or team managers, who may only speak English, can review the text of each supporting record for a proposed insight and see what is being said and why that record was processed that way.
Low Maintenance
Typically for ongoing analysis, you will want to update your analysis workflow periodically to account for new issues and trends. Because MT facilitates in creating a single analysis scenario for a single language, its maintenance becomes simpler. If you want to use NLA and have 10 languages present in the data, you will have to build and maintain a separate analysis scenario for each language.

CONS

Low Accuracy
The accuracy of machine translation is still relatively low compared to manual translation. And even with manual translation, some things like sarcasm and figures of speech may not translate well. For example, if you translated “break a leg” into another language, the meaning of “I wish you good luck” is likely to be lost. Additionally, different languages may be more or less difficult to translate to or from. Going from Spanish to English will likely result in a reasonable translation, but most translation services that have been tested by our analysts at Megaputer performed relatively poor when working with languages like Japanese, Chinese, and Korean. Anyone looking to use machine translation will need to run some tests to see if the accuracy is at a level that can meet the output goals.

Here is an example of poor translation from Google Translate. As you may know, Japanese is a highly contextual language, which makes machine translation difficult.

Original Text: 生懸命指でまぶたを広げて目薬を差しました。
Google Translate: I spread my eyelids with my fingers and put on my eyes.
Manual Translation: With great effort I held his eyelids open with my fingers and dropped in the eye medicine.

Garbage In, Garbage Out
The accuracy of the analysis will be partially dependent on the accuracy of the translation. A low accuracy translation will cause the results of the analysis to be less trustworthy.

Native Language Analysis Pros and Cons

PROS

Better Accuracy
Native language analysis generally results in much more accurate results. This is, of course, dependent on the analyst.
Traceability and Transparency
There is a 1-to-1 account of what parts of the original text match the search query, and this will be visible in the final results.

CONS

More Expensive
Native language analysis tends to be more expensive. There will be costs for hiring additional analysts to cover different languages. Those analysts will need to not only have the skills of an analyst, but also the skills of a polyglot linguist.
More Maintenance
In the future when models and algorithms need to be adjusted, the work will be multiplied by the number of languages being worked with.
Less Accessibility
When consumers of the results review the analysis, they may not be able to independently read all the records to understand the supporting information for suggested insights.

Prepping the XPDL seminar series

Rebecca Hale — Wed, 09 May 2018 16:09:35 +0000

XPDL is short for Extended Pattern Definition Language. It is a feature of Megaputer’s PolyAnalyst software that helps in analyzing textual data.

Megaputer periodically provides educational events to help others learn about our technologies.

Registration for the XPDL webinar series opens June 1, 2018.

Following numerous requests from our customers for more XPDL training material, I’ve been developing an XPDL seminar series. We receive a lot of questions about how best to use XPDL to extract information from text, because it is such an important task for text analysis. Data analysts, survey conductors, customer experience managers… anyone working with text data needs a way to extract the information they need. After testing these presentations on some of our analysts, we’ll begin preparing brand-new training modules for our customers.

This week, we had a lot of great questions from our analysts, making sure they understand the more complex rule types. Practicing the series at our weekly meetings has given us a lot of great tips and improvements— for example, we learned that users want lots of practical text examples, showing exactly the effect of the rule we’re learning about. We’ve also had some really positive feedback; after our Hierarchical Rules seminar, where we learned about how to make our XPDL rules as efficient as possible, one analyst team returned to their project and concentrated on applying the tips we covered. When they reported back, one of their Entity Extraction nodes had gone from 90-minute execution times to completing extractions in only 10 minutes. Now that’s progress!

There is a lot to learn—XPDL can be intimidating for new users, because there is so much customizability. You can use it to extract anything you want, based on word proximity or morphological, syntactic, and semantic context. This is a really exciting project because we will finally have discrete training modules on virtually any topic in XPDL, from syntax basics to the most advanced filtering and efficiency-improving techniques.

For those of you eager for training in XPDL right now, I plan to run a webinar series to make this content available to our customers. We will cover such topics as advanced PDL querying, XPDL syntax, improving efficiency with child rules, controlling your output, and tips and tricks to automating entity discovery.

Our presenters have also been developing quizzes for each of the topics, to accompany our PolyAnalyst certification program. Users can learn about specific topics in PolyAnalyst, prove their knowledge, and become experts in the myriad of applications of this software. Additionally, there will be hands-on workshop sessions available, as well as training videos and tutorials, to meet the needs of every user hoping to learn more about XPDL. The more our users and analysts know about XPDL, the more places they find to creatively solve interesting problems -I can’t wait to see what our network of analysts and data scientists comes up with next!

First stop, XPDL. Up next, everything else! XPDL is one of our most in-demand training topics, but there are many others, like Dictionaries, Document Classification, and Machine Learning. This is the start of an initiative that will set our trainers’ pockets to bursting with material to answer any question in a way that is easy to understand and straightforward to practice. We in Bloomington are already preparing hands-on workshops for a number of these topics—check out our conference agenda here.

The post Prepping the XPDL seminar series appeared first on Megaputer Intelligence.

Introducing the partofspeech function for PolyAnalyst

Jeff Palan — Thu, 15 Jun 2017 14:08:39 +0000

Build 2010 – Released June 2017

As part of PolyAnalyst’s crusade for accuracy and ease of use, a new Pattern Definition Language (PDL) function partofspeech has been added to the function library. This function allows you to create search queries that reference very specific grammatical attributes.

For example

partofspeech(noun_singular)

will match “desk” and “table” but not “desks” or “tables”

partofspeech(noun, building)

will match “building” and “buildings”, but only when they are used as nouns (as opposed to a verb, as in “I’m building a desk”)

It was possible to achieve similar results previously, but required more complex and esoteric queries.

This function can be especially useful in entity extraction. As an example, let’s run the following query on some public data from the National Highway Traffic Safety Administration.

near(1, partofspeech(adjective), tire or tread)

The result is all the instances where an adjective appeared next to the words “tire” or “tread”.

The post Introducing the partofspeech function for PolyAnalyst appeared first on Megaputer Intelligence.

Pattern Definition Language – Megaputer Intelligence

Query languages—the Swiss army knife of information extraction

Combining the Best of Both Worlds: Machine Learning & Rule-Based Approach

What about the Rule-Based approach?

Making Gold from Rules

A Step Forward

Comparing Machine Translation to Native Language Analysis

Machine Translation Pros & Cons

PROS

CONS

Native Language Analysis Pros and Cons

PROS

CONS

Other Options and Common Questions

For Example

Prepping the XPDL seminar series

Introducing the partofspeech function for PolyAnalyst