<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pattern Definition Language &#8211; Megaputer Intelligence</title>
	<atom:link href="https://www.megaputer.com/tag/pattern-definition-language/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.megaputer.com</link>
	<description>Your Knowledge Partner</description>
	<lastBuildDate>Tue, 24 Mar 2026 00:02:52 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.0.22</generator>

<image>
	<url>https://www.megaputer.com/wp-content/uploads/favicon.png</url>
	<title>Pattern Definition Language &#8211; Megaputer Intelligence</title>
	<link>https://www.megaputer.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Query languages—the Swiss army knife of information extraction</title>
		<link>https://www.megaputer.com/query-languages-the-swiss-army-knife-of-information-extraction/</link>
		<pubDate>Tue, 06 Feb 2024 05:19:41 +0000</pubDate>
		<dc:creator><![CDATA[Echo Lu]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Business Intelligence]]></category>
		<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[Data Analytics]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Entity Extraction]]></category>
		<category><![CDATA[Fuzzy Matching]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Morphology]]></category>
		<category><![CDATA[Pattern Definition Language]]></category>
		<category><![CDATA[Text Analytics]]></category>

		<guid isPermaLink="false">https://www.megaputer.com/?p=35231</guid>
		<description><![CDATA[<p>Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are...</p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/query-languages-the-swiss-army-knife-of-information-extraction/">Query languages—the Swiss army knife of information extraction</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p><span style="font-weight: 400;">Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are indispensable for text mining, their utility hits a hard ceiling when semantic or meaningful search is required. They cannot, for example, capture complex entities such as human names, corporations, and drugs. For handling tasks like these, we need a more powerful query language that has semantic understanding, such as Megaputer’s PDL.</span></p>
<p><span style="font-weight: 400;">So, what is PDL, and how does it achieve semantic understanding while regex and string search do not? Let’s take a look at an example to find out.</span></p>
<p><span style="font-weight: 400;">First of all, PDL does not just search for the literal form of the word in the query: instead, it automatically extends its search to all morphological forms of the word. For example, when searching for the word “company” in financial news articles, the PDL query will not only find “company”, but also “companies,” the plural form. This feature often comes in handy, especially when the search involves a verb. Suppose that you are interested in extracting </span><i><span style="font-weight: 400;">what the CEOs said. </span></i><span style="font-weight: 400;">With regex or other substring search, you will need to list all possible verb forms such as “say,” “saying,” “says,” and “said.” With PDL, simply entering “say” in the query will automatically fetch all possible verb forms. This behavior can also be turned off by enclosing the word in the </span><i><span style="font-weight: 400;">form </span></i><span style="font-weight: 400;">function</span><span style="font-weight: 400;">,</span><span style="font-weight: 400;"> which will then restrict the search to the literal form of the word, such as in the example below.</span></p>
<p><img class="wp-image-35249 aligncenter" src="https://www.megaputer.com/wp-content/uploads/comparison_pdl-1.png" alt="" width="800" height="497" /><br />
<!-- <img class="wp-image-35232 aligncenter" src="https://www.megaputer.com/wp-content/uploads/pdl-image-1-300x186.jpg" alt="" width="710" height="440" srcset="https://www.megaputer.com/wp-content/uploads/pdl-image-1-300x186.jpg 300w, https://www.megaputer.com/wp-content/uploads/pdl-image-1-1024x636.jpg 1024w, https://www.megaputer.com/wp-content/uploads/pdl-image-1-768x477.jpg 768w, https://www.megaputer.com/wp-content/uploads/pdl-image-1-644x400.jpg 644w, https://www.megaputer.com/wp-content/uploads/pdl-image-1-600x373.jpg 600w" sizes="(max-width: 710px) 100vw, 710px" /> --></p>
<p><span style="font-weight: 400;">Another notable feature of the PDL language is its capability for users to tailor the scope of their searches using a range of built-in functions. Returning to the previous example, you may not wish to confine your search exclusively to the specific verb “say,” but rather include other synonymous verbs like “tell” or “mention.” Achieving this is straightforward with PDL – users can invoke the </span><i><span style="font-weight: 400;">synonym</span></i> <span style="font-weight: 400;">function with the verb &#8220;say,&#8221; as demonstrated in (a) below. As the subsequent results table (b) illustrates, the captured text now includes various speech verbs such as “tell,” “emphasize,” and “claim,” in addition to the word “say,” capturing them in all possible verb forms. For additional flexibility, the user can also create and modify synonym dictionaries.</span></p>
<p><img class="wp-image-35250 aligncenter" src="https://www.megaputer.com/wp-content/uploads/comparison_pdl-2.png" alt="" width="800" height="497" /><br />
<!-- <img class="wp-image-35235 aligncenter" src="https://www.megaputer.com/wp-content/uploads/pdl-image-2-300x269.jpg" alt="" width="737" height="661" srcset="https://www.megaputer.com/wp-content/uploads/pdl-image-2-300x269.jpg 300w, https://www.megaputer.com/wp-content/uploads/pdl-image-2-1024x919.jpg 1024w, https://www.megaputer.com/wp-content/uploads/pdl-image-2-768x689.jpg 768w, https://www.megaputer.com/wp-content/uploads/pdl-image-2-446x400.jpg 446w, https://www.megaputer.com/wp-content/uploads/pdl-image-2-600x538.jpg 600w" sizes="(max-width: 737px) 100vw, 737px" /> --></p>
<p><span style="font-weight: 400;">The PDL language offers various modes of information extraction, including proximity search (e.g., finding words A and B within a sentence, or within a 3-words range), syntactic relation (e.g., finding word A that is the subject or object of B), semantic relation (e.g., finding words that are synonyms/antonyms to word A), access to dictionaries and ontologies, and more. This language is expressive enough to capture complex patterns, and yet relatively easy to use, </span><span style="font-weight: 400;">having a syntax that closely resembles English. Having access to this versatile query language significantly enhances the power and quality of text mining operations.</span></p>
<p><span style="font-weight: 400;">In conclusion, PDL is a powerful and versatile query language that enables users to extract meaningful information from text with greater efficiency and accuracy than competing methods like regex or string search. Its ability to understand and capture morphological forms, synonyms, and other complex patterns makes it an indispensable tool for solving text mining tasks that require semantic understanding. By leveraging the capabilities of PDL, users can enhance their information extraction processes and gain valuable insights from their data, making it a true Swiss army knife of information extraction.</span></p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/query-languages-the-swiss-army-knife-of-information-extraction/">Query languages—the Swiss army knife of information extraction</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></content:encoded>
			</item>
		<item>
		<title>Combining the Best of Both Worlds: Machine Learning &#038; Rule-Based Approach</title>
		<link>https://www.megaputer.com/best-of-both-worlds-rule-based-ml/</link>
		<pubDate>Wed, 12 Feb 2020 15:15:51 +0000</pubDate>
		<dc:creator><![CDATA[Elli Bourlai]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Pattern Definition Language]]></category>

		<guid isPermaLink="false">https://www.megaputer.com/?p=32587</guid>
		<description><![CDATA[<p>Many companies have abandoned rule-based approaches for higher recall, but are suffering low-precision in their results. And a good training dataset usually requires expert knowledge for pre-categorizing or “tagging” the input features for a model, which is often performed manually. How can we combine the best aspects of Machine Learning and a Rule-Based approach for high recall and precision?</p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/best-of-both-worlds-rule-based-ml/">Combining the Best of Both Worlds: Machine Learning &#038; Rule-Based Approach</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></description>
				<content:encoded><![CDATA[<section class="l-section wpb_row height_small"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper"><div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">In the past few years, more and more organizations have started relying solely on Machine Learning (also called Artificial Intelligence or AI) for addressing Text Analytics tasks. The reasons are that this approach is faster, suitable for large volumes of data, and more adaptable to new contexts, as it does not require hard-coded rules. However, AI / Machine Learning algorithms are only as good as the data they are trained on. </span></p>

		</div>
	</div>
</div></div></div></div><div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-8 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1580327119221">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">A good training dataset usually requires expert knowledge for pre-categorizing or “tagging” the input features for the model, which is often performed manually. These training datasets are also known as “gold standard data” and are not easy to obtain: it requires time and many trained people to produce an accurately tagged dataset, which can be very expensive. This is why many machine learning solutions use a statistical approach for identifying and categorizing input features for training, often resulting in solutions with high recall but low precision.</span></p>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-4 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper"><div class="w-image align_center style_circle vc_custom_1580327126299"><div class="w-image-h"><img width="405" height="390" src="https://www.megaputer.com/wp-content/uploads/solution.jpg" class="attachment-large size-large" alt="Thought to Idea" srcset="https://www.megaputer.com/wp-content/uploads/solution.jpg 405w, https://www.megaputer.com/wp-content/uploads/solution-300x289.jpg 300w" sizes="(max-width: 405px) 100vw, 405px" /></div></div></div></div></div></div><div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h2><span style="font-weight: 400;">What about the Rule-Based approach? </span></h2>
<p><span style="font-weight: 400;">Even though it has been abandoned lately in favor of machine learning solutions, it offers certain advantages over the latter method. Since the rules are created by humans, the results are consistent and more accurate. It is also easier to understand the logic behind the machine’s decisions and improve a model, instead of trying to comprehend a “black box”, which is the case with most machine learning algorithms. But the main issue with this approach is that the rules are hard-coded and are limited to the existing context, making it hard to adapt to new context; this results in higher precision, but lower recall.</span></p>

		</div>
	</div>
<div class="w-image align_center"><a class="w-image-h" href="https://www.megaputer.com/wp-content/uploads/rulevsml.jpg" ref="magnificPopup"><img width="768" height="450" src="https://www.megaputer.com/wp-content/uploads/rulevsml-768x450.jpg" class="attachment-us_768_0 size-us_768_0" alt="RuleBasedVS_ML" srcset="https://www.megaputer.com/wp-content/uploads/rulevsml-768x450.jpg 768w, https://www.megaputer.com/wp-content/uploads/rulevsml-300x176.jpg 300w, https://www.megaputer.com/wp-content/uploads/rulevsml-600x352.jpg 600w, https://www.megaputer.com/wp-content/uploads/rulevsml.jpg 892w" sizes="(max-width: 768px) 100vw, 768px" /></a></div></div></div></div></div>
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">Ideally, we would like to get the best of both worlds: the aspects of a Rule-Based approach that result in high precision and the aspects from Machine Learning that result in high recall. So how can we combine them efficiently?</span></p>
<h2>Making Gold from Rules</h2>
<p><span style="font-weight: 400;">The solution proposed by Megaputer involves using a rule-based approach to generate training datasets. This approach works by utilizing a highly accurate query language that removes the need for manual tagging, while still achieving similar results. </span><span style="font-weight: 400;">PolyAnalyst&#8217;s Pattern Definition Language (PDL) navigates and leverages the linguistic properties of text; it allows an expert to teach a machine how to identify and categorize important features that are necessary for generating training datasets with “gold standard” quality.</span></p>

		</div>
	</div>
<div class="w-image align_center"><div class="w-image-h"><img width="1280" height="720" src="https://www.megaputer.com/wp-content/uploads/rlebasedling.png" class="attachment-full size-full" alt="RleBasedLing_full" srcset="https://www.megaputer.com/wp-content/uploads/rlebasedling.png 1280w, https://www.megaputer.com/wp-content/uploads/rlebasedling-300x169.png 300w, https://www.megaputer.com/wp-content/uploads/rlebasedling-1024x576.png 1024w, https://www.megaputer.com/wp-content/uploads/rlebasedling-768x432.png 768w, https://www.megaputer.com/wp-content/uploads/rlebasedling-600x338.png 600w" sizes="(max-width: 1280px) 100vw, 1280px" /></div></div><div class="ult-spacer spacer-6a391ea16aede" data-id="6a391ea16aede" data-height="60" data-height-mobile="" data-height-tab="60" data-height-tab-portrait="" data-height-mobile-landscape="" style="clear:both;display:block;"></div>
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">Once the machine is given expert guidance with rules and has generated the pre-categorized training data, it can expand its knowledge to additional or new contexts through Machine Learning. </span></p>

		</div>
	</div>
<div class="w-image align_center"><a class="w-image-h" href="https://www.megaputer.com/wp-content/uploads/mlplusrulesolution.jpg" ref="magnificPopup"><img width="975" height="419" src="https://www.megaputer.com/wp-content/uploads/mlplusrulesolution.jpg" class="attachment-large size-large" alt="MLRB-Process" srcset="https://www.megaputer.com/wp-content/uploads/mlplusrulesolution.jpg 975w, https://www.megaputer.com/wp-content/uploads/mlplusrulesolution-300x129.jpg 300w, https://www.megaputer.com/wp-content/uploads/mlplusrulesolution-768x330.jpg 768w, https://www.megaputer.com/wp-content/uploads/mlplusrulesolution-600x258.jpg 600w" sizes="(max-width: 975px) 100vw, 975px" /></a></div>
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h2>A Step Forward</h2>

		</div>
	</div>
<div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-9 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">Thus, with a combined approach, we achieve both the high precision that Rule-Based approaches bestow on created training datasets, along with the high recall that is achieved through Machine Learning in different contexts. If you are interested in automating the generation of accurate training data to improve the results of your Machine Learning models, <a href="https://www.megaputer.com/contact/">contact us</a> and learn about the tools we have available.</span></p>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-3 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper"><div class="w-image align_center style_circle"><div class="w-image-h"><img width="350" height="332" src="https://www.megaputer.com/wp-content/uploads/humanandmachinehand.jpg" class="attachment-shop_single size-shop_single" alt="humanandmachinehand" srcset="https://www.megaputer.com/wp-content/uploads/humanandmachinehand.jpg 350w, https://www.megaputer.com/wp-content/uploads/humanandmachinehand-300x285.jpg 300w" sizes="(max-width: 350px) 100vw, 350px" /></div></div></div></div></div></div></div></div></div></div></div></section>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/best-of-both-worlds-rule-based-ml/">Combining the Best of Both Worlds: Machine Learning &#038; Rule-Based Approach</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></content:encoded>
			</item>
		<item>
		<title>Comparing Machine Translation to Native Language Analysis</title>
		<link>https://www.megaputer.com/comparing-machine-translation-to-native-language-analysis/</link>
		<pubDate>Fri, 10 May 2019 20:20:59 +0000</pubDate>
		<dc:creator><![CDATA[Jeff Palan]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Pattern Definition Language]]></category>
		<category><![CDATA[Text Analytics]]></category>

		<guid isPermaLink="false">https://www.megaputer.com/?p=31761</guid>
		<description><![CDATA[<p>As our world becomes increasingly global, so does our data. Being an analyst working with almost any size company today often means facing the challenge of receiving text data that contains multiple languages. So what do you do? Essentially, there are two options we may consider: machine translation or native language analysis.</p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/comparing-machine-translation-to-native-language-analysis/">Comparing Machine Translation to Native Language Analysis</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></description>
				<content:encoded><![CDATA[<section class="l-section wpb_row height_small"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><img class=" wp-image-32171 alignright" src="https://www.megaputer.com/wp-content/uploads/feature-imageml.jpg" alt="ml-translate-interpret" width="268" height="268" srcset="https://www.megaputer.com/wp-content/uploads/feature-imageml.jpg 600w, https://www.megaputer.com/wp-content/uploads/feature-imageml-150x150.jpg 150w, https://www.megaputer.com/wp-content/uploads/feature-imageml-300x300.jpg 300w, https://www.megaputer.com/wp-content/uploads/feature-imageml-350x350.jpg 350w" sizes="(max-width: 268px) 100vw, 268px" />As our world becomes increasingly global, so does our data. Being an analyst working with almost any size company today often means facing the challenge of receiving text data that contains multiple languages. So what do you do?</p>
<p>Essentially, there are two options we may consider: machine translation or native language analysis.</p>
<ol>
<li>With machine translation, we actually create a new dataset where the text has all been translated into a single language before we do the analysis. This makes the subsequent analysis much easier, as we only need to use a single language grammar module for the analysis.</li>
<li>Native language analysis means that we keep documents in their original languages and perform a separate analysis for each language with the corresponding grammar module.</li>
</ol>

		</div>
	</div>
<div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-8 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p>To demonstrate how these options work, let’s imagine we are working with a dataset that has records (or rows) of textual responses that are either in English or Spanish. Regardless of whether we want to use machine translation or to process individual documents in their native languages, we first need to split the dataset up by language. In this example, all the English records are together in one dataset, and all the Spanish records are in another. We then use the software <a href="https://www.megaputer.com/polyanalyst/text/">PolyAnalyst for Text™</a> to perform native language text analysis by implementing the corresponding <a href="https://www.megaputer.com/polyanalyst/language-pack/">language modules</a>. Currently, this software can process the texts of 16 different languages, and it can integrate with third-party translation services such as Microsoft, SDL, and Google. Splitting the data can be easily accomplished by using a combination of the Language Detection and Filter Rows nodes in PolyAnalyst. The Language Detection node automatically samples the text of each record and determines what language is being used (as seen below). Once each record is tagged with its language, we can use these tags to filter out the data by language.</p>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-4 wpb_column vc_column_container has-fill"><div class="vc_column-inner  vc_custom_1556745174546"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<blockquote><p>
Translation services like Google and Microsoft are fully capable of identifying the language of texts on a record-by-record basis, but they will charge you even if you ask them to translate English to English.
</p></blockquote>

		</div>
	</div>
</div></div></div></div></div></div></div></div></div></section><section class="l-section wpb_row height_small"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<div id="attachment_32076" style="width: 768px" class="wp-caption aligncenter"><a href="https://www.megaputer.com/mt-vs-nla/langugage-detection-workflow/" rel="attachment wp-att-32076"><img class=" wp-image-32076" src="https://www.megaputer.com/wp-content/uploads/langugage-detection-workflow-300x150.jpg" alt="Language Detection in PolyAnalyst" width="758" height="379" srcset="https://www.megaputer.com/wp-content/uploads/langugage-detection-workflow-300x150.jpg 300w, https://www.megaputer.com/wp-content/uploads/langugage-detection-workflow-600x300.jpg 600w, https://www.megaputer.com/wp-content/uploads/langugage-detection-workflow.jpg 765w" sizes="(max-width: 758px) 100vw, 758px" /></a><p class="wp-caption-text"><em>PolyAnalyst workflow for detecting and filtering text data by language.</em></p></div>
<p>Translation services like Google and Microsoft are fully capable of identifying the language of texts on a record-by-record basis, but they will charge you a fee even if you ask them to translate English to English. Therefore, it is best to do the split beforehand and send them only the texts that really need translation. Some of you may even go so far as to split individual texts into multiple records when that text contains multiple languages. This will minimize translation costs.</p>
<p>Once we’ve separated our data based on the language, it&#8217;s time to decide what approach to use: Machine Translation (MT) or Native Language Analysis (NLA). Let’s review some pros and cons of each of these approaches.</p>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h3 style="text-align: center;">Machine Translation Pros &#038; Cons</h3>

		</div>
	</div>
<div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-6 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1557155814734">
		<div class="wpb_wrapper">
			<h4 style="text-align: center;"><span style="color: #ffffff;"><strong>PROS</strong></span></h4>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<ul>
<li><span style="color: #95ad31;"><strong>Relatively Cheap</strong></span><br />
Machine translation is relatively cheap, even with a fairly large dataset. These services tend to charge by the character, so the cost will vary based on how many documents you have and how long (a.k.a. wordy) they are. For reference, at Microsoft’s lowest (and least cost-effective) pricing tier, the cost is $10 per 1 million characters. A page in Microsoft word is about 3000 characters, so you can translate about 333 pages for $10. If you have a huge dataset, then this might sound like a lot. However, compared to hiring multiple analysts with fluency in different languages, it may still be the more affordable option.</li>
<li><span style="color: #95ad31;"><strong>Simple and Accessible Results</strong></span><br />
The data is more widely accessible and ends up consolidated into a single language. This means that even if the data was originally in 10 different languages, end users such as the company or team managers, who may only speak English, can review the text of each supporting record for a proposed insight and see what is being said and why that record was processed that way.</li>
<li><span style="color: #95ad31;"><strong>Low Maintenance</strong></span><br />
Typically for ongoing analysis, you will want to update your analysis workflow periodically to account for new issues and trends. Because MT facilitates in creating a single analysis scenario for a single language, its maintenance becomes simpler. If you want to use NLA and have 10 languages present in the data, you will have to build and maintain a separate analysis scenario for each language.</li>
</ul>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-6 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1557155839804">
		<div class="wpb_wrapper">
			<h4 style="text-align: center;"><span style="color: #ffffff;"><strong>CONS</strong></span></h4>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<ul>
<li style="text-align: left;"><span style="color: #d1652f;"><strong>Low Accuracy</strong></span><br />
The accuracy of machine translation is still relatively low compared to manual translation. And even with manual translation, some things like sarcasm and figures of speech may not translate well. For example, if you translated “break a leg” into another language, the meaning of “I wish you good luck” is likely to be lost. Additionally, different languages may be more or less difficult to translate to or from. Going from Spanish to English will likely result in a reasonable translation, but most translation services that have been tested by our analysts at Megaputer performed relatively poor when working with languages like Japanese, Chinese, and Korean. Anyone looking to use machine translation will need to run some tests to see if the accuracy is at a level that can meet the output goals.</li>
</ul>
<p style="padding-left: 30px; text-align: left;"><em>Here is an example of poor translation from Google Translate. As you may know, Japanese is a highly contextual language, which makes machine translation difficult.</em></p>
<p style="text-align: left; padding-left: 30px;"><strong>Original Text:</strong> 生懸命指でまぶたを広げて目薬を差しました。<br />
<strong>Google Translate:</strong> I spread my eyelids with my fingers and put on my eyes.<br />
<strong>Manual Translation:</strong> With great effort I held his eyelids open with my fingers and dropped in the eye medicine.</p>
<ul>
<li><span style="color: #d1652f;"><strong>Garbage In, Garbage Out</strong></span><br />
The accuracy of the analysis will be partially dependent on the accuracy of the translation. A low accuracy translation will cause the results of the analysis to be less trustworthy.</li>
</ul>

		</div>
	</div>
</div></div></div></div><div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper"><div class="w-separator size_medium"></div></div></div></div></div>
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h3 style="text-align: center;">Native Language Analysis Pros and Cons</h3>

		</div>
	</div>
<div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-6 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1557155814734">
		<div class="wpb_wrapper">
			<h4 style="text-align: center;"><span style="color: #ffffff;"><strong>PROS</strong></span></h4>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<ul>
<li><span style="color: #95ad31;"><strong>Better Accuracy</strong></span><br />
Native language analysis generally results in much more accurate results. This is, of course, dependent on the analyst.</li>
<li><span style="color: #95ad31;"><strong>Traceability and Transparency</strong></span><br />
There is a 1-to-1 account of what parts of the original text match the search query, and this will be visible in the final results.</li>
</ul>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-6 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1557155839804">
		<div class="wpb_wrapper">
			<h4 style="text-align: center;"><span style="color: #ffffff;"><strong>CONS</strong></span></h4>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<ul>
<li><span style="color: #d1652f;"><strong>More Expensive</strong></span><br />
Native language analysis tends to be more expensive. There will be costs for hiring additional analysts to cover different languages. Those analysts will need to not only have the skills of an analyst, but also the skills of a polyglot linguist.</li>
<li><span style="color: #d1652f;"><strong>More Maintenance</strong></span><br />
In the future when models and algorithms need to be adjusted, the work will be multiplied by the number of languages being worked with.</li>
<li><span style="color: #d1652f;"><strong>Less Accessibility</strong></span><br />
When consumers of the results review the analysis, they may not be able to independently read all the records to understand the supporting information for suggested insights.</li>
</ul>

		</div>
	</div>
</div></div></div></div><div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper"><div class="w-separator size_small"></div></div></div></div></div><div class="g-cols wpb_row type_default valign_middle vc_inner "><div class="vc_col-sm-8 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h3>Other Options and Common Questions</h3>
<p>There is actually a third option that is the most expensive choice. You can use native language analysis for the model building and analysis to ensure high accuracy of the results, but then also use machine translation so that end users reading the report can get a general idea of what each record says. However, it may not be 100% clear to them why it was processed as it was since the analysis was done on the untranslated text.</p>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-4 wpb_column vc_column_container has-fill"><div class="vc_column-inner  vc_custom_1557156026586"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<blockquote><p>
As for which option should you choose&#8230; How much accuracy are you willing to sacrifice for cost?
</p></blockquote>

		</div>
	</div>
</div></div></div></div>
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p>As for which option you should choose, there is no way to know except to do a small-scale test analysis. Try some MT analysis, and if the accuracy is acceptable, then that might be the better choice for you. How much accuracy are you willing to sacrifice for cost? The answer to that will vary from company to company and task to task.</p>
<p>Another common question is, “Which translation service should I use?” Here is a suggestion on how to decide when your goals involve a categorization task. Suppose you have a dataset of customer complaints and you want to categorize each text based on what complaints were made, thus allowing you to get the count for each complaint type. It is recommended that you do a small-scale analysis for each MT service you are considering, and compare this to the results you achieve using NLA. Make sure that the analyst-driven portion of the NLA is rock solid, then compare the results of each MT service to your NLA results. To make this comparison, treat the NLA as 100% correct, then calculate the precision, recall, and F score of the post-MT analysis. This means you will count how many categorizations were made incorrectly, and how many categorizations the post-MT analysis failed to make that it should have made.</p>

		</div>
	</div>
<div class="w-separator size_medium with_line width_default thick_1 style_solid color_border align_center"><div class="w-separator-h"></div></div>
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h4>For Example</h4>
<ul>
<li>Your NLA made 100 categorizations.</li>
<li>You use translation service “A”.</li>
<li>When you run your algorithm on the machine translated data from &#8220;A&#8221;, it makes 70 categorizations, 10 of which were not made by the NLA (and therefore are incorrect), and 60 of which are identical to the NLA results.</li>
</ul>

		</div>
	</div>
</div></div></div></div></div></section><section class="l-section wpb_row height_auto"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-4 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1557543711020">
		<div class="wpb_wrapper">
			<p><strong>Precision</strong> is the number of correct results divided by the number of all returned results.<br />
In this case, we had 60 correct results out of 70 returned categorizations.<br />
P = 60/70<br />
P = .857</p>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-4 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1557543831258">
		<div class="wpb_wrapper">
			<p><strong>Recall</strong> is the number of correct results divided by the number of results that should have been returned.<br />
In this case, we had 60 correct results found out of the possible 100 correct results.<br />
R = 60/100<br />
R = .6</p>

		</div>
	</div>
</div></div></div><div class="vc_col-sm-4 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column  vc_custom_1557543986686">
		<div class="wpb_wrapper">
			<p><strong>F Score</strong> is the harmonic mean of recall and precision. The harmonic mean is the preferred method for averaging ratios. The F score is a good measure of how “correct” are the categorization results.<br />
F = ((2)(Precision)(Recall))/(Precision + Recall)<br />
F = ((2)( .857)( .6))/( .857 + .6)<br />
F = .706</p>

		</div>
	</div>
</div></div></div></div></div></section><section class="l-section wpb_row height_small"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<ul>
<li>Then you try translation service &#8220;B&#8221;, and perform a similar calculation of precision, recall, and F-ratio for the corresponding categorization results. Let’s assume the F-score on the machine translated data from &#8220;B&#8221; is .75. Since service &#8220;B&#8221; facilitated a higher F-score (.75 > .706),  we conclude that service &#8220;B&#8221; provides a more accurate machine translation than service &#8220;A&#8221;. Ceteris paribus, you should go with service &#8220;B&#8221;.</li>
</ul>
<p>&nbsp;</p>
<p>All that being said, there is at least one other factor to consider. If your data is highly sensitive, remember that services like Microsoft and Google have rules about keeping a sample of your data, which they can use for improving their algorithms. SDL, on the other hand, does not keep any of your data. Unfortunately for those with highly sensitive data, it appears that generally Microsoft and Google have put to good use the additional data they are receiving. In the tests Megaputer staff have run, these services tend to outperform SDL in terms of accuracy.</p>

		</div>
	</div>
</div></div></div></div></div></section>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/comparing-machine-translation-to-native-language-analysis/">Comparing Machine Translation to Native Language Analysis</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></content:encoded>
			</item>
		<item>
		<title>Prepping the XPDL seminar series</title>
		<link>https://www.megaputer.com/xpdl-seminar-series/</link>
		<pubDate>Wed, 09 May 2018 16:09:35 +0000</pubDate>
		<dc:creator><![CDATA[Rebecca Hale]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Pattern Definition Language]]></category>
		<category><![CDATA[Text Analytics]]></category>

		<guid isPermaLink="false">https://www.megaputer.com/?p=19343</guid>
		<description><![CDATA[<p>We are developing an XPDL seminar series. XPDL is short for Extended Pattern Definition Language. It is a feature of Megaputer's PolyAnalyst.</p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/xpdl-seminar-series/">Prepping the XPDL seminar series</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p><img class="size-full wp-image-19363 alignleft" src="https://www.megaputer.com/wp-content/uploads/xpdl-logo.png" alt="" width="143" height="142" /></p>
<p>XPDL is short for Extended Pattern Definition Language. It is a feature of Megaputer&#8217;s <a href="https://www.megaputer.com/polyanalyst/">PolyAnalyst software</a> that helps in analyzing textual data.</p>
<p><span style="font-weight: 400;">Megaputer periodically provides educational events to help others learn about our technologies.</span></p>
<p><span style="font-weight: 400;">Registration for the <a href="https://www.megaputer.com/events/webinars/">XPDL webinar series</a> opens June 1, 2018.</span></p>
<p><span style="font-weight: 400;">Following numerous requests from our customers for more XPDL training material, I’ve been developing an XPDL seminar series.  We receive a lot of questions about how best to use XPDL to extract information from text, because it is such an important task for text analysis. Data analysts, survey conductors, customer experience managers… anyone working with text data needs a way to extract the information they need. After testing these presentations on some of our analysts, we’ll begin preparing brand-new training modules for our customers.</span></p>
<p><span style="font-weight: 400;">This week, we had a lot of great questions from our analysts, making sure they understand the more complex rule types. Practicing the series at our weekly meetings has given us a lot of great tips and improvements</span><span style="font-weight: 400;">—</span><span style="font-weight: 400;"> for example, we learned that users want lots of practical text examples, showing exactly the effect of the rule we’re learning about. We’ve also had some really positive feedback; after our Hierarchical Rules seminar, where we learned about how to make our XPDL rules as efficient as possible, one analyst team returned to their project and concentrated on applying the tips we covered. When they reported back, one of their Entity Extraction nodes had gone from 90-minute execution times to completing extractions in only 10 minutes. Now that’s progress!</span></p>
<p><span style="font-weight: 400;">There is a lot to learn—XPDL can be intimidating for new users, because there is so much customizability. You can use it to extract anything you want, based on word proximity or morphological, syntactic, and semantic context. This is a really exciting project because we will finally have discrete training modules on virtually any topic in XPDL, from syntax basics to the most advanced filtering and efficiency-improving techniques. </span></p>
<p><span style="font-weight: 400;">For those of you eager for training in XPDL right now, I plan to run a webinar series to make this content available to our customers. We will cover such topics as advanced PDL querying, XPDL syntax, improving efficiency with child rules, controlling your output, and tips and tricks to automating entity discovery.</span></p>
<p><span style="font-weight: 400;">Our presenters have also been developing quizzes for each of the topics, to accompany our <a href="https://www.megaputer.com/polyanalyst/">PolyAnalyst</a> certification program. Users can learn about specific topics in PolyAnalyst, prove their knowledge, and become experts in the myriad of applications of this software. Additionally, there will be hands-on workshop sessions available, as well as training videos and tutorials, to meet the needs of every user hoping to learn more about XPDL. The more our users and analysts know about XPDL, the more places they find to creatively solve interesting problems -I can’t wait to see what our network of analysts and data scientists comes up with next!</span></p>
<p><span style="font-weight: 400;">First stop, XPDL. Up next, everything else! XPDL is one of our most in-demand training topics, but there are many others, like Dictionaries, Document Classification, and <a href="https://www.megaputer.com/machine-learning-introduction-ten-minute-video/">Machine Learning</a>. This is the start of an initiative that will set our trainers’ pockets to bursting with material to answer any question in a way that is easy to understand and straightforward to practice. We in Bloomington are already preparing hands-on workshops for a number of these topics</span><span style="font-weight: 400;">—</span><a href="https://www.megaputer.com/event/analytics-conference-2018/analytics-conference-2018-agenda/"><span style="font-weight: 400;">check out our conference agenda here</span></a><span style="font-weight: 400;">.</span></p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/xpdl-seminar-series/">Prepping the XPDL seminar series</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></content:encoded>
			</item>
		<item>
		<title>Introducing the partofspeech function for PolyAnalyst</title>
		<link>https://www.megaputer.com/introducing-partofspeech-function-polyanalyst/</link>
		<pubDate>Thu, 15 Jun 2017 14:08:39 +0000</pubDate>
		<dc:creator><![CDATA[Jeff Palan]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Morphology]]></category>
		<category><![CDATA[Pattern Definition Language]]></category>
		<category><![CDATA[Text Analytics]]></category>

		<guid isPermaLink="false">https://www.megaputer.com/?p=22297</guid>
		<description><![CDATA[<p>This function allows you to create search queries that reference very specific grammatical attributes.</p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/introducing-partofspeech-function-polyanalyst/">Introducing the partofspeech function for PolyAnalyst</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Build 2010 &#8211; Released June 2017</p>
<p>As part of <a href="https://www.megaputer.com/polyanalyst/">PolyAnalyst’s</a> crusade for accuracy and ease of use, a new Pattern Definition Language (PDL) function <strong>partofspeech</strong> has been added to the function library.  This function allows you to create search queries that reference very specific grammatical attributes.</p>
<p>For example</p>
<pre>partofspeech(noun_singular)</pre>
<p>will match “desk” and “table” but not “desks” or “tables”</p>
<pre>partofspeech(noun, building)</pre>
<p>will match “building” and “buildings”, but only when they are used as nouns (as opposed to a verb, as in “I’m building a desk”)</p>
<p>It was possible to achieve similar results previously, but required more complex and esoteric queries.</p>
<p>This function can be especially useful in entity extraction. As an example, let&#8217;s run the following query on some public data from the National Highway Traffic Safety Administration.</p>
<pre>near(1, partofspeech(adjective), tire or tread)</pre>
<p>The result is all the instances where an adjective appeared next to the words “tire” or “tread”.</p>
<p><img class="size-medium wp-image-22298 aligncenter" src="https://www.megaputer.com/wp-content/uploads/part-of-speech-blog-img-1-tire-tread-300x224.png" alt="" width="300" height="224" srcset="https://www.megaputer.com/wp-content/uploads/part-of-speech-blog-img-1-tire-tread-300x224.png 300w, https://www.megaputer.com/wp-content/uploads/part-of-speech-blog-img-1-tire-tread-350x263.png 350w, https://www.megaputer.com/wp-content/uploads/part-of-speech-blog-img-1-tire-tread.png 353w" sizes="(max-width: 300px) 100vw, 300px" /></p>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/introducing-partofspeech-function-polyanalyst/">Introducing the partofspeech function for PolyAnalyst</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com">Megaputer Intelligence</a>.</p>
]]></content:encoded>
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.w3-edge.com/products/

Page Caching using disk: enhanced 
Minified using disk

Served from: www.megaputer.com @ 2026-06-22 06:38:09 by W3 Total Cache
-->