Big Data – Megaputer Intelligence

Query languages—the Swiss army knife of information extraction

Echo Lu — Tue, 06 Feb 2024 05:19:41 +0000

Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are indispensable for text mining, their utility hits a hard ceiling when semantic or meaningful search is required. They cannot, for example, capture complex entities such as human names, corporations, and drugs. For handling tasks like these, we need a more powerful query language that has semantic understanding, such as Megaputer’s PDL.

So, what is PDL, and how does it achieve semantic understanding while regex and string search do not? Let’s take a look at an example to find out.

First of all, PDL does not just search for the literal form of the word in the query: instead, it automatically extends its search to all morphological forms of the word. For example, when searching for the word “company” in financial news articles, the PDL query will not only find “company”, but also “companies,” the plural form. This feature often comes in handy, especially when the search involves a verb. Suppose that you are interested in extracting what the CEOs said. With regex or other substring search, you will need to list all possible verb forms such as “say,” “saying,” “says,” and “said.” With PDL, simply entering “say” in the query will automatically fetch all possible verb forms. This behavior can also be turned off by enclosing the word in the form function, which will then restrict the search to the literal form of the word, such as in the example below.

Another notable feature of the PDL language is its capability for users to tailor the scope of their searches using a range of built-in functions. Returning to the previous example, you may not wish to confine your search exclusively to the specific verb “say,” but rather include other synonymous verbs like “tell” or “mention.” Achieving this is straightforward with PDL – users can invoke the synonym function with the verb “say,” as demonstrated in (a) below. As the subsequent results table (b) illustrates, the captured text now includes various speech verbs such as “tell,” “emphasize,” and “claim,” in addition to the word “say,” capturing them in all possible verb forms. For additional flexibility, the user can also create and modify synonym dictionaries.

The PDL language offers various modes of information extraction, including proximity search (e.g., finding words A and B within a sentence, or within a 3-words range), syntactic relation (e.g., finding word A that is the subject or object of B), semantic relation (e.g., finding words that are synonyms/antonyms to word A), access to dictionaries and ontologies, and more. This language is expressive enough to capture complex patterns, and yet relatively easy to use, having a syntax that closely resembles English. Having access to this versatile query language significantly enhances the power and quality of text mining operations.

In conclusion, PDL is a powerful and versatile query language that enables users to extract meaningful information from text with greater efficiency and accuracy than competing methods like regex or string search. Its ability to understand and capture morphological forms, synonyms, and other complex patterns makes it an indispensable tool for solving text mining tasks that require semantic understanding. By leveraging the capabilities of PDL, users can enhance their information extraction processes and gain valuable insights from their data, making it a true Swiss army knife of information extraction.

The post Query languages—the Swiss army knife of information extraction appeared first on Megaputer Intelligence.

Wide, Bulky Data and Dimensionality Reduction

Chris Farris — Mon, 03 Feb 2020 17:46:31 +0000

As the saying goes, “mo’ data, mo’ problems.” Well, that’s not quite the saying and that’s not quite what we mean, but it certainly is true that large data comes with some problems that need to be solved. There are two ways that data can be “big.” Traditionally, Big Data is considered large literally by its physical size in terms of the memory space it consumes. In this article, we’re considering another type of “big” data, which we like to call Wide Data. Wide Data is unstacked tabular data that consists of large amounts of columns or variables, usually in the hundreds and above. Wide Data and Big Data are often overlapping but they are not necessarily the same: for example, Wide Data can be shallow, with relatively small number of rows – but large number of columns. Wide Data is common and comes with its own set of problems that need to be addressed in any analytical setting. Next, let’s discuss those problems.

The Problems of Wide Data

One issue about wide Data is that it is bulky, and this presents challenges that aren’t present in more compact datasets. For example, a person can quickly scan a table of a few columns, maybe even a dozen or more, to summarize what it contains. That quick scan will tell a person what sort of information is contained there, determine if the data is sparse, and provide a rough estimate of the range and variance of the data. This sort of quick summarization is impossible with Wide Data. Even scrolling through column names (if they are informative at all) takes some time to do, and in the end, it is usually impossible to retain all that information.

Not only is Wide Data confusing for us humans, but it can also be confusing for machines as well if we have an especially large number of columns. If we want to create a machine learning model with Wide Data, our model might suffer from a complexity problem. A large number of inputs to some machine learning model can make the training of this model either unstable or time-consuming.

So, what to do? Large numbers of columns aren’t ideal for either humans or machines, but we don’t want to just throw out columns randomly to reduce the size of the data, as we could risk losing important data. The solution is dimensionality reduction, a common technique in data science. Dimensionality reduction will reduce the number of columns that we have in our data while minimizing any loss of information as a result. These techniques are extremely useful in cutting out the fat from data while retaining the value.

Dimensionality Reduction in PolyAnalyst™

PolyAnalyst™ contains several built-in methods for performing dimensionality reduction. We’ll highlight a couple of these below.

The first method is Data Simplification. This process scans data to identity columns that are nearly uniform. A column with almost no variation is very uninformative and could be dropped entirely. Additionally, Data Simplification compares the columns with one another to see if there are any that are nearly identical. These columns can be triaged so that only one remains in the dataset to avoid unnecessary duplication. In addition to Data Simplification, Correlation Analysis can be performed to discover correlated variables. While not identical, strongly correlated variables serve a common purpose and we could consider cutting down to one in this case as well.

The second method is Factor Analysis, which includes techniques such as principal component analysis. The goal of this method is to find the vectors along which most of the variance occurs in the data, reorient the data along those vectors, and discard the vectors in which little variance occurs. This is best understood with an animation.

In this example, we have measured and plotted only two variables—height and weight. We can see the direction along which most of the variance occurs. This is the principal component. Now we rotate our data such that this direction is our horizontal axis and the orthogonal direction is the vertical axis. This destroys the semantic meaning of these axes of course, but this not a big deal. Where they were once semantically width and height, they now hold no such meaning. We will be able to regain the meaning after performing the analysis in the smaller number of dimensions and rotating the results back to the original axes. Now, if we would like to shrink the size of our data, we can throw out the vertical axis, which exhibits little variance. Thus, we have eliminated an entire variable but preserved the majority of the useful information, which simplifies further analysis.

Stay tuned…

Wide, bulky data is fairly common in the world, and although it comes with some problems, these can be managed with the techniques described here, such as Factor Analysis and Data Simplification. Once our data is manageable, PolyAnalyst has a wide range of modeling solutions. Stay tuned for more discussions about data analysis!

The post Wide, Bulky Data and Dimensionality Reduction appeared first on Megaputer Intelligence.

Choosing Machine Learning Models

Chris Farris — Mon, 21 Oct 2019 21:27:31 +0000

As we have discussed previously, machine learning approaches to modeling are just that – approaches. There are many forms of models we could use with machine learning, each with different design philosophies and quirks. When setting out to use machine learning to create your own models, a question you may be asking yourself is: Which model framework do I choose? From Neural Networks, to Support Vector Machines, to Decision Trees, to Linear Regression, there are many options. While there is rarely a definitive or clear answer to this question, let’s discuss some things to consider when making this choice.

Model Properties

Model frameworks have a set of qualitative measures attached to them which, while difficult to directly measure, can be useful for thinking of. We will consider two qualities: capacity and complexity. Capacity is the measure of a model framework’s ability to describe data distributions; it can be thought of as potential accuracy. For example, imagine we scatter pebbles on the ground and would like to model the shape they form with two different models. One model is a rigid stick (depicted by the blue line) and the other is an elastic band (depicted by the red line).

The stick has very low capacity because it can only model one shape while changing the angle that it lies on the ground. A few pebbles may fit into this model, but it is unlikely to be a good choice for many situations. The elastic band, while probably not being perfect, can twist and bend to create a host of complex curves to fit our data. This model can fit more potential data distributions and has a higher capacity.

Capacity is a very important property of models. If our model’s capacity is too low, we may never be able to adequately make predictions no matter how much training data we use or how fancy our hardware is. Models with high capacity include Neural Networks and Support Vector Machines (SVMs), and this is why they have been so popular.

However, capacity is not the only property to consider. Another quality is complexity, which can also be thought of as interpretability or how easy it is for a human to understand. In our previous example, the stick had low complexity (or high interpretability). It can be described and understood well by humans, and it is highly generalizable. Low complexity models include Linear Regression and Decision Trees. Conversely, the many curves created by the elastic band are very complex and difficult for humans to describe. SVMs and Neural Networks would, therefore, be considered high complexity models.

This relationship between complexity and capacity that we saw in the above example is generally true for all of our models. By increasing our capacity, we often must incur the cost of increased complexity. This is part of the analysis you must consider when choosing a model. Do you need to be able to describe or personally understand what the model is doing? If so, going all in for the most complex but accurate model may not be desirable. This is why many hospitals and healthcare-related institutions often rely on less complex model structures like Random Forests. When taking the health of patients into account, doctors want to be able to understand what a model is doing before taking action on those results.

Resources

Thus far we have discussed the theoretical constraints of a trained model. Now we should consider more practical properties. A machine learning model requires training, which in turn requires data, processing power, and time. When selecting a machine learning model to use, we need to weigh how much of these resources we have available to invest in the model. Additionally, the quantification of the model itself is important. For example, if we plan to deploy the model on mobile technology, we should be cautious about how much storage the model itself requires. As expected, if we desire models with high capacity, we usually need to invest more resources into the model during training and storage. Let’s consider a sample of models and discuss the resources they require.

On the low-cost end of the spectrum are Naïve Bayes models. These models are generally lower in Capacity but are extremely easy to interpret and train. One of the main advantages of Naïve Bayes models is that they are exceedingly fast to train and store. Unfortunately, the model relies on the Naïve Bayes assumption which is rarely true for our data. However, if your data does not stray too far from that assumption, you can find that the model provides decent results in a cheap package.

Also relatively low-cost are Decision Trees and Random Forests. These models are interpretable and the tree structure of the models make them easy to store efficiently if you don’t have an enormous number of features. Unfortunately, we start to see an increase in the time required to train for these models.

SVMs, a popular technique, come with a large jump in resources required. Training these usually requires more data than other methods and the resulting models are bulky, almost impossible to interpret into plain English, and tricky to train in the first place because of the choice of kernel. However, SVM’s are powerful models when trained correctly, and thus they are widely used.

Possibly the most costly of the models are Neural Networks. Famed for their high capacity, these models require extensive processing and time to train and can take up a large chunk of storage to contain. But it isn’t all doom and gloom about the resources required for Neural Networks. Because of how they are trained, Neural Networks can take additional data to finetune the training at any point, meaning that we don’t need to retrain our models every time we get new data.

Which Model to Pick?

Ultimately, the answer is not obvious in every case. When choosing which machine learning model to use, we need to consider many different factors: What tradeoff do we want to make between capacity and complexity? What resources do we have available for our training? Do we want to be able to store the model in a compact manner? Do we want to be able to continually use new data without having to retrain the model entirely each time? Finally, we don’t always make the right choice. Sometimes we have to experiment with multiple structures to find one that fits our needs. If you can, train multiple models to help decide which one to devote your attention to.

The post Choosing Machine Learning Models appeared first on Megaputer Intelligence.

What’s the News with Neural Networks?

Chris Farris — Tue, 28 May 2019 18:10:03 +0000

There are many reasons for the explosion of machine learning advancements over the past decade. We now have vastly improved hardware for fast computation, and memory is cheaper than ever. Data is now “Big Data,” and it is both jealously hoarded and publicly available in repositories such as ImageNet. Individually, these advancements are already a blessing for the technology-space. But for artificial intelligence (AI), they have opened the gates for something truly powerful—Neural Networks.

Neural Whatnow?

Neural Networks. You’ve probably heard of them. They are at the forefront of the machine learning craze and are the driver of many of the most impressive advancements. The technology that led machines to the best humans in Go and the popular video game Starcraft? Neural Networks. The backbone of algorithms that can recognize images and faces, which are igniting a surveillance and privacy panic? Neural Networks.

And yet, Neural Networks aren’t some new idea spawned from the incubator of a giant tech company. They aren’t some stroke of genius from a college student turned dropout who went on to found a revolutionary tech firm. Neural Networks are in fact… old hat. Or they were.

The concept of a Neural Network (something we will get to later) has been around for years. They date back to the 1970s, and simpler versions of them existed even in the 1940s! So, if they have existed for decades, why are they only popular now?

The answer is related to the hardware and data advancements mentioned earlier. Neural Networks crunch a lot of numbers. They also need a lot of data to help them learn. Up until the past decade, this made training anything but the simplest networks highly time-consuming and expensive.

With all the great improvements to hardware over the years, the possibility of using more advanced Neural Networks became possible. Aided by hardware demands from the entertainment industry and now cryptominers, GPUs (graphical processing units) have been developed which can calculate specific mathematical operations at lightning speed. Luckily, these same kinds of operations occur in training neural networks. Using the technology originally developed for beautiful visuals in film and video games helps us train networks at a fraction of the time it takes a traditional CPU.

What’s in a Name?

The power of Neural Networks may be evident, but at this point, one may also be wondering what they are in the first place. The name gives some clues. A Neural Network isn’t a singular object that makes decisions by itself. It is, as implied, a network of smaller objects all connected. A network of what, then? We call them Neurons. Neurons as in the cells inside our brains? Yes! Well, no. But sort of!

Neurons in Neural Networks can be thought of as being like the neurons in brains. They are tiny, individual units that are connected to other neurons in a large, structured network. These connections allow tiny pieces of data to flow between them. In our brains, these are electrical pulses. In the Neural Network, we send numbers between Neurons. The Neurons then take all the numbers fed to them by the Neurons they are connected to and process them. The process isn’t complicated—in fact, it is painfully trivial. After all, it is just a tiny unit—a single cell in our brain. But it then sends the processed information out to other Neurons it is connected to. Another number. Another electrical pulse. And so on. Tiny Neurons are fed tiny pieces of information, perform tiny pieces of computation on that snippet, and feed it forward to other Neurons, which do the same over and over until we eventually reach a final set of Neurons—the output of which is our final result.

From the collective effort of many small individual units networked together in a perfectly calibrated balance, we can achieve enormous computational power. The Whole is greater than the Sum of its parts.

Balancing Act

If that all sounded magical and farfetched, then don’t worry—it is. How exactly are we supposed to arrange these Neurons in such a perfect balance that their combined minuscule computations lead to a machine recognizing human faces? We can’t. So how do we get this to work? Well, we are talking about Machine Learning after all. And Machine Learning is how we are going to solve this. We aren’t going to calibrate the Neurons to be in balance. They are going to calibrate themselves.

We aren’t going to calibrate the Neurons to be in balance. They are going to calibrate themselves.

In order to achieve this, we need training data. We need data that is labeled with the desired output we want from the machine. We can provide this data to the unconfigured Neural Network. It will process the data and likely output something nonsensical and useless. But this is fine. We can use a mathematical function called Error. This Error is just a measurement of how different our output was from our desired targets. Then, using Calculus,we can discover how much of that Error is caused by the calibration of each one of our Neurons! Using this information, we can then tune the Neurons slightly and repeat the process—feed data into the Network, observe the output, calculate the Error, and use Calculus to know how to tune the Neurons to make them more accurate. This process continues over time until we have converged to a calibrated Network. This process of using Error functions, Calculus, and tuning is what Machine Learning is.

Power at a Price

Although it is conceptually simple (at least in this explanation that ignores the details), it is very resource intensive. At the moment I am fine tuning a Neural Network on my own desktop to recognize Western Art styles. Although the dataset is only a few GB, it takes almost an hour to run a single iteration of the training. It will take nearly two days to complete the 50-iteration learning schedule I planned. And even then, I may need to schedule another one if the Network still needs to learn more! And I’m not running this on some dusty machine I dragged out of the aughts. This is an almost brand-new desktop powered with a 3.7 GHz AMD Rhyzen 8-core processor with 32GB of RAM available and virtually no other load on the machine.

Neural Networks are expensive to train. If you want to increase performance, you could pay for an expensive GPU, but that might set you back nearly a thousand dollars. Companies and research institutions may have the funds to throw at this problem, but individuals, small companies, and small research groups may not. Luckily, they don’t need to anymore.

Cloud services have opened access to remote processing to provide other computing options to those desiring to train a Neural Network. Don’t have an expensive rig? No worries, just rent one remotely from Google at a fraction of the price.

Neurally Networked World

Neural Networks are here and they aren’t leaving. New advancements and computing architectures are constantly being published. Convolutional Networks are good at processing images and Recurrent Networks can handle variable sized data, streaming data, or sequential data. Neural Networks are powerful, indeed—far more so than other solutions we have. But they don’t mimic what is really occurring in our brains. There are some deep flaws even in our most advanced networks. For instance, it is extremely easy to confuse a Neural Network. A Network designed to recognize stop signs can be fooled by a few well-placed stickers on a sign. There are deep safety concerns if these are going to be used in self-driving cars for example.

Neural Networks are deeply dependent on the data used to train them (just as we discussed in a previous article). They are also dependent on how we configure their output. Most networks are designed to give a decision. For object detection, the network must return what it thinks the input image is. But what happens when we feed the network “nothing,” such as a completely blank image or fuzzy static noise? The network is forced to return something so it “sees,” say, a dog in the empty space. This is nonsense. Why would it choose one object over another in these cases where a human would just refuse to rigidly define nonsense? As another example, if we slightly alter an image by inserting some imperceptibly small random noise, we can completely trick a Neural Network. Where it before correctly thought that the image was a butterfly, it now thinks the image is a truck, while a human sees no difference in the images. This is bad, and it reflects deep issues with Neural Networks as we construct them today.

Neural Networks are occupying a liminal space. They are simultaneously scarily powerful and laughably simple and ignorant. They can best the human masters and yet be duped by the smallest of changes. We won’t be seeing Neural Networks achieve human-like sentience any time soon, and they aren’t ready for deployment in many other types of systems. But they are already being used in ways that should cause alarm.

China is already using facial recognition technology to tag members of the Uighur ethnic minority group. Accurate voice recognition ensures individuals could potentially be tracked even if they are not near a camera by turning the phones in our pockets into monitoring devices. “Deep Fakes” are a growing type of video that can modify existing videos to map one person’s face and voice over another’s to create a fake video that could be used for blackmail or disinformation. While Terminator remains science-fantasy, armed military drones using neural networks for stabilization, navigation, targeting, and tactics would revolutionize armed conflicts in ways impossible to predict.

Though neural networks are being used to oppress in some places they are also being used to save in others. Medical facilities increasingly deploy network systems to detect ailments such as cancer or infectious diseases. Laboratories can use similar networks for modeling complex biomolecules and developing treatments. Neural networks are even being used in traffic light control systems to increase vehicle flow and reduce accidents.

Where will Neural Networks take us next? It’s hard to say. But it seems evident that the world is caught and will remain in a web of Neurons.

The post What’s the News with Neural Networks? appeared first on Megaputer Intelligence.

An introduction to machine learning

Chris Farris — Fri, 15 Feb 2019 18:58:52 +0000

Machine Learning is hot right now. Really hot. And it’s come a long way from where it was over two decades ago. In 1996, IBM’s Deep Blue defeated world chess champion Garry Kasparov. This was a great achievement to mark the progress of the field, but chess is a relatively simple task and computers still struggled to be able to master more difficult tasks for another decade. Then, in the late aughts, machine learning started to boom like never before. In 2011, IMB’s Watson utilized live simultaneous natural language processing with information retrieval to defeat two Jeopardy! champions. In 2016, Google’s AlphaGo defeated Lee Sedol, who is claimed by many to be the strongest Go player in history. Despite the simplicity of Go’s rules, the game is extraordinarily complex. For perspective, there are more possible configurations of a Go board than there are atoms in the universe—and by this measure, Go is a googol more complicated than chess. To really appreciate the magnitude of this, consider that a googol (10¹⁰⁰) is written as one followed by one hundred zeros, and the ratio of an electron (a sub-atomic particle) to the entire known universe is only 0.00000001% of a googol. It is a mind-boggling large number beyond human comprehension. And in a handful of years, machine learning has advanced by that scale.

Recently, a program called AlphaStar learned virtually on its own how to play and master the wildly popular and challenging competitive real-time strategy video game StarCraft II. StarCraft II is so competitive that human players actually perform physical exercises with their fingers so that their reflexes are honed in order to strike the right keys as fast as possible. While Go merely involves placing pebbles on a grid, StarCraft II involves economic resource management, strategic combined arms combat, exploration, extensive future planning, processing streams of real-time information, and making rapid decisions.

Applications of Machine Learning

Machine Learning often gets flashy coverage when it reaches milestones in games, but it has slowly crept into our daily lives in ways we may not think about. When we post pictures on social media platforms like Facebook, our faces are instantly recognized and names are suggested of who to tag. When we watch movies or shows on platforms like Netflix, we get an endless stream of recommendations based on our past behavior. This same structure exists on the online shopping market such as Amazon. And it’s common for Apple’s Siri, Amazon’s Alexa, and Google’s Assistant to live in our homes and understand our requests.

Machine Learning is here to stay. And with Machine Learning technology sprinting into the future, everyone wants in. Companies are rapidly creating or bolstering existing analytics departments to utilize the craze. Unfortunately, like many hyped technologies, “Machine Learning” as a phrase can devolve into jargon, buzz words, oversimplifications, misnomers, and a lot of confusion. To many, Machine Learning is a mystery that seems indifferentiable from sorcery.

What is Machine Learning?

Machine Learning, like Artificial Intelligence, suffers from a name that tends to get bogged down in philosophy and pedantry. What does it mean to “learn?” What does it mean to hold “intelligence?” There is a good deal of discussion amongst computer scientists and others about this, but for our purpose it is just a rabbit hole. In fact, it is best to throw out our human conceptions of what “learn” and “intelligence” mean because they only add a biased expectation.

So, then, what is Machine Learning? In it’s simplest form, Machine Learning is essentially making a computer learn a complicated task by having the computer teach itself. We merely provide the computer some examples of how to do something, and the computer learns from those examples to help us fill in the holes and solve complex problems.

Now let’s break this process down. First, it is useful to identify our goals. Almost always, we would like to classify (assign discrete labels), perform regression (predict numerical values) on, or cluster (group similar things together) our data. For example, we could classify faces by if they are smiling or not. We could perform regression on weather data to predict tomorrow’s temperature. We could cluster stars together by grouping them based on how hot and bright they are.

Now that we have a goal in mind, the question is by what method can we achieve this task? This is the “model.” Think of a model as a machine that you feed data into and which spits out classes, a regression, or clusters of that data. How do we build the model? Before Machine Learning, you would have to manually figure this out. This would involve having to draw specific blueprints for the machine, figure out the appropriate size of gears and their exact positions, assemble the machine, and test it.

In this analogy, Machine Learning is like building a special type of machine. This machine has gears that can change size and position. Also, if you provide the machine with the desired output, it can run data through itself and check the resulting output with this desired output. If they don’t match, the machine knows how to change the size and position of the gears to make it more likely to be correct in the future. And it can repeat this over and over again until it is has found the best arrangement of gears. This is essentially Machine Learning, and all it took was having a special device and labeled data it could look at. No human was needed to tell it where to place the gears.

This is just an analogy, but it works very similarly in our digital space. We construct a digital model that is parameterized by certain variables. We feed training data labeled by the desired output into the model and see the actual output. We then compare the actual output to the labels, which is our desired output. We note the error and use math to compute in what direction and by how much we should tweak the parameters of the model. And repeat. The machine doesn’t make decisions or figure out concepts on its own. It is programmed to use Calculus, Information Theory, Probability, and Statistics to calculate numbers to fudge the parameters by.

In the end, Machine Learning is not what most humans consider “learning.” It is driven by math, algorithms, and data. For those who aren’t fascinated by the math and theory of Machine Learning, this can destroy the mystique somewhat. Personally, although the Man Behind the Curtain is not what we might expect or desire, he is intriguing all the same. In fact, as computer scientists and cognitive scientists study Machine Learning more we start to suspect that there isn’t as much of a disconnect as we might suspect. We may think that Machine Learning is so unlike humanity, but it may be the case that humanity is more like Machine Learning than we suspect. Our own brains may be just like a cold math-algorithm-data amalgam, but on a massive and complex scales.

The Importance of Data

Data is a pillar of Machine Learning. Without data we cannot build anything. The math doesn’t change, and while we can build clever model structures to help with the learning process, without good data our models will fail. We luckily live in an age of data. Large amounts of data allow models to be extremely fine-tuned. It allows for advanced model structures like Deep Neural Networks. Publicly available data sets such as MNIST and ImageNet for image processing have allowed data scientists to easily compare models and share knowledge. Data is revolutionizing entire fields and Machine Learning is no exception.

But with great power comes great responsibility. If our data is not labeled well or if it is biased from how it is collected, that error or bias is directly injected into the model. After all, the model is simply trying to mimic the data used to create it. Bad data in, bad model out.

Biased data can lead to very poor model performance. For example, last year MIT Media Lab showed that facial recognition from Microsoft, IBM, and Face++ weren’t very good at identifying women or person’s with darker skin, most likely because they didn’t have enough examples of those types of faces in their training data.

So big data is great for our ability to model the world. But the purity and completeness of the data we use in Machine Learning should always be considered when building a model.

The Future

Well, Machine Learning is the future. Tasks that we do effortlessly as humans, such as processing sound and sight, are extraordinarily complicated. Without Machine Learning it would take monumental effort to create machines to do them effectively, and for the tasks that even we as humans find hard, machines would be hopeless. We live in the age of data and increasingly cheap computational power. Machine Learning is thriving and adapting. If you are interested in Machine Learning, check out this video for more insights.

And stay tuned for future blog posts on Machine Learning as we examine different applications and different models used.

The post An introduction to machine learning appeared first on Megaputer Intelligence.

Challenges in human name detection

Margaret Glide — Tue, 22 May 2018 20:42:39 +0000

When working with big data, it is often possible to automate the slow, mind-numbing information extraction tasks, for precisely the reason that they require little creative energy. One such automatable process is detecting, extracting, and resolving human names from text data. For example, an insurance company may want to automatically extract from call center notes the names of all parties involved in an accident. However, they process hundreds of thousands of records each week. While a person may look at a single text and easily recognize names, it can be time-consuming or nearly impossible to examine each record individually and compile them into a suitable output.

Using algorithms that people apply unthinkingly to detect names, we can teach software to do this time-consuming work for us.

Some approaches and challenges – Building name dictionaries

The simplest approach to detecting names in big data is to build dictionaries, or lists, of common first names and surnames. As humans, we store prototypes of common names based on their frequency. That is, more frequent names will be more easily recognized with better accuracy, both cognitively and in a name detection algorithm. However, using the dictionary approach alone can limit how many names a name detection algorithm can identify. It does not account for the many innovative and creative first names being coined each day (e.g., Blue Ivy, Jennyfyr, etc.), and the numerous possibilities for words and non-words that can represent last names (e.g., Glide, Nowak, etc.). Additionally, human bias may lead to far less coverage for common foreign names if they are not accounted for in our name dictionary (e.g., Kostas, Eleni).

Rule-based approaches

Often it is more efficient to define abstract rules for common name patterns, rather than creating an infinite list of possible names. In this approach, we can look for other context clues to identify an item as a name. Let’s consider the simplest pattern for a human name in the example below. We can quickly and easily interpret the example as a name and even know which portion is the first or last name based on a few abstract rules.Human readers will notice that there are two items following one another and that they are both capitalized. We also know that the gold item is a very common first name, John, and that it is in most common first name dictionaries. When we see the following blue item, Doe, we know that it does not refer to a female deer because it is also capitalized.

Utilizing this heuristic, we can build a very simple name detection algorithm by searching for two to three capitalized words in a row. Accuracy can be improved by adding known information, such as a list of common first names or surnames. By defining just one simple pattern, we can automate the discovery of previously unknown last names, as in the following example.

Here we can recognize the unknown word, Brouwer, as a last name because it is capitalized and it follows a common first name.

Of course, this simple rule will need to be improved to account for other patterns and to filter out erroneous results. We should limit our results by banning some words with a stop list — a list of frequent words that are not typically names (e.g. verbs like is, was,etc.). Additionally, we must account for less simple name patterns and other challenges in detecting human names, which require more rules and/or more advanced approaches to automation.

One such challenge is defining contexts for first and last names that are real words in English. Say, for example, we had “Hope” in our first name dictionary. It would capture this word as a first name in the following two records:

However, we know as human readers, Hope, as a person name, will never be a verb, as in the first sentence. Similarly, hope, in the sense of expectation or desire, does not typically drive a blue Chevy. So we can define the expected context of the name Hope, as a noun that is capable of actions typical of people (e.g., drive cars, think, etc.). Conversely, we can limit the name Hope to a non-verb.

Adjusting name patterns according to culture

There are many cultural differences in naming conventions that can affect the interpretation of a person’s name. It cannot be assumed that names from all cultures will follow the same pattern. While names across cultures can be comprised of a few simple components, they can have vastly different interpretations according to culture of origin.

Consider the contrast in the example below. A common pattern in the United States is comprised of three components in the following order: a first name, an optional middle name or middle initial, and a last name. However, in Hispanic cultures in the United States and abroad, the same number of components can yield a different underlying structure: a first name and two last names. It would be erroneous to assume that González in this example is a middle name.

In the example below, we can observe that the order of these components is not guaranteed across cultures, either. As in Korean names and names from many other Asian cultures, it is common for surnames to precede first names.

To avoid misinterpreting someone’s surname or first name, it is best to customize an appropriate algorithm for that culture’s naming conventions.

Detecting names automatically in PolyAnalyst

The challenges of name detection illustrated here represent only a small fraction of possible issues one may encounter when designing their own name detection algorithm. Even more challenges can arise with messy data or when there are further conflicts with other proper name entities (e.g., company names). Stay tuned for further blog posts on more advanced approaches to human name detection!

Learn more about PolyAnalyst.

The post Challenges in human name detection appeared first on Megaputer Intelligence.

What is sentiment analysis?

Elli Bourlai — Fri, 18 May 2018 20:07:46 +0000

Sentiment Analysis is another buzzword that we hear more and more often, but may not entirely understand what it does or how to use it. Sentiment Analysis application in business offers benefits to both companies and customers, such as improving products and services, identifying strengths and weaknesses of the competition, as well as targeted advertising.

Sentiment Analysis, also known as Sentiment Detection or Opinion Mining, is the classification of the sentiment polarity in a given text that may express people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics and their attributes¹. Put simply, it answers the question “how does the writer feel about this topic?” with the answer positive or negative, but may also include neutral.

In other words, Sentiment Analysis identifies words or patterns in language that carry positive, negative, or neutral sentiment. As humans, we interpret sentiment based on not only our knowledge of the language, but also social context. Computers are excellent at identifying linguistic patterns quickly, but they face accuracy challenges when the interpretation of sentiment is heavily dependent on context. For example, humans understand easily that the word cheap is positive when referring to price, but negative when referring to quality. There may also be cases where a word may simply state a fact in one domain, but carry sentiment in another. The word “red” in a hotel review (“the wall had a picture with red flowers”) simply states a fact, whereas in a technology review the word “red”, as in “my picture printed red”, would be negative. Consequently, computers need to be taught how to take into account this kind of context when classifying sentiment polarity. This is why many Sentiment Analysis tools have recently been taking the form of domain-specific or topic-specific Sentiment Analysis, to improve accuracy by limiting the scope to a known context.

Degrees of sentiment

In addition to polarity, we may also want to know how strongly negative or positive the opinion is by using degrees of sentiment. Some sentiment words or phrases carry stronger feelings. For example, “the movie was not that good” is less negative than “the movie was terrible”. Including the degrees of sentiment polarity in our analysis may offer better insights for decision making.

Types of sentiment

Another aspect we may be interested in is the type of sentiment: joy, sadness, disgust, disappointment, surprise, fear, anticipation, trust, etc. For example, the sentence “I expected faster download speeds”, is not only negative, but also carries the sentiment of disappointment. Another example is the sentence “I’m concerned that the next Avengers film will not include my favorite Marvel superhero”, where the sentiment is negative and we can identify the emotion of fear. Similarly, in the positive sentence “I can’t wait to try out the new iPhone!”, we also have a sense of anticipation.

Where is Sentiment Analysis used?

With the appearance of social media, Sentiment Analysis has become a very popular way of mining people’s public opinions. Sentiment analysis has been used for a long time to analyze the answers to open-ended survey questions; however, this data is usually limited and requires considerable effort to collect. However, millions of people share their opinions freely on social media platforms and we can harness this public data to inform intelligent business decisions with Sentiment Analysis.

Company Product or Service Evaluation

A lot of companies use Sentiment Analysis for extracting and understanding their customers’ opinions on their products or services. Reviews on websites, forums, and social media provide helpful insights on the strengths and limitations of a company’s products and/or services. Those insights may be used to discover possible issues that need improvement, as domain-specific Sentiment Analysis may automatically categorize positive and negative reviews into relevant topics and sub-topics. Imagine the benefit to be gained from the quick summarization of millions of customer experiences, where customer feelings about topics like “product quality” can be tracked in real time. Let’s take a look at the following review:

A domain-specific Sentiment Analysis tool would identify that the customer has positive feelings toward certain aspects of the product, but not about its size.

Competitor Intelligence

While knowing the strengths and limitations of a company’s own products and services is helpful, it is also important to understand the status of its competitors. They may wish to incorporate features that customers appreciate in a competitor’s products or services, or avoid those that competitor customers do not enjoy. Below, the printer of a company is compared to a similar printer of a competitor company (Printer X). The company’s printer is preferred for its easy set up and interface, but one of the competitor’s printers has the added feature of a rear feed.

Market Intelligence

If an organization hosts products or services from other companies, it is useful to know what people enjoy and what needs improvement. This information can help a company decide whether to keep offering a product or service, as well as which products or services to promote. Similarly, if a company sells their products through other companies, they may analyze customer reviews to identify issues with the seller or new vendors to sell their product. The negative feelings associated with the brand in the following tweet indicate a possible decrease in sales; a company selling this brand may need to take this into account.

Another example is the tweet below:

The customer is dissatisfied with the selection of the products from a brand in a specific store; unless the brand addresses this issue with the store, it could have implications for their sales.

Advertising

A lot of the current advertising strategies are based on the inclusion of people’s reviews on products or services, also known as customer testimonials. Sentiment analysis allows companies to identify positive reviews they wish to showcase among the millions available, in order to gain the customers’ trust. Similarly, they may wish to showcase their superiority compared to their competitors, by identifying negative trends on the competitors’ products and services. For example, Gusto retweets positive customer reviews on their Twitter account, but also showcase some of them on their website as an advertising strategy.

The post What is sentiment analysis? appeared first on Megaputer Intelligence.