Data Analytics – Megaputer Intelligence

Query languages—the Swiss army knife of information extraction

Echo Lu — Tue, 06 Feb 2024 05:19:41 +0000

Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are indispensable for text mining, their utility hits a hard ceiling when semantic or meaningful search is required. They cannot, for example, capture complex entities such as human names, corporations, and drugs. For handling tasks like these, we need a more powerful query language that has semantic understanding, such as Megaputer’s PDL.

So, what is PDL, and how does it achieve semantic understanding while regex and string search do not? Let’s take a look at an example to find out.

First of all, PDL does not just search for the literal form of the word in the query: instead, it automatically extends its search to all morphological forms of the word. For example, when searching for the word “company” in financial news articles, the PDL query will not only find “company”, but also “companies,” the plural form. This feature often comes in handy, especially when the search involves a verb. Suppose that you are interested in extracting what the CEOs said. With regex or other substring search, you will need to list all possible verb forms such as “say,” “saying,” “says,” and “said.” With PDL, simply entering “say” in the query will automatically fetch all possible verb forms. This behavior can also be turned off by enclosing the word in the form function, which will then restrict the search to the literal form of the word, such as in the example below.

Another notable feature of the PDL language is its capability for users to tailor the scope of their searches using a range of built-in functions. Returning to the previous example, you may not wish to confine your search exclusively to the specific verb “say,” but rather include other synonymous verbs like “tell” or “mention.” Achieving this is straightforward with PDL – users can invoke the synonym function with the verb “say,” as demonstrated in (a) below. As the subsequent results table (b) illustrates, the captured text now includes various speech verbs such as “tell,” “emphasize,” and “claim,” in addition to the word “say,” capturing them in all possible verb forms. For additional flexibility, the user can also create and modify synonym dictionaries.

The PDL language offers various modes of information extraction, including proximity search (e.g., finding words A and B within a sentence, or within a 3-words range), syntactic relation (e.g., finding word A that is the subject or object of B), semantic relation (e.g., finding words that are synonyms/antonyms to word A), access to dictionaries and ontologies, and more. This language is expressive enough to capture complex patterns, and yet relatively easy to use, having a syntax that closely resembles English. Having access to this versatile query language significantly enhances the power and quality of text mining operations.

In conclusion, PDL is a powerful and versatile query language that enables users to extract meaningful information from text with greater efficiency and accuracy than competing methods like regex or string search. Its ability to understand and capture morphological forms, synonyms, and other complex patterns makes it an indispensable tool for solving text mining tasks that require semantic understanding. By leveraging the capabilities of PDL, users can enhance their information extraction processes and gain valuable insights from their data, making it a true Swiss army knife of information extraction.

The post Query languages—the Swiss army knife of information extraction appeared first on Megaputer Intelligence.

What is Time Series Analysis?

Chris Farris — Wed, 15 Jul 2020 19:46:19 +0000

What is Time Series Analysis?

Many types of models are created from individual samples of data. We may have conducted a study, collected vital signs from different patients, and are now fitting some predictive model. These patients are unique and independent cases, which collectively make up a trend. However, many kinds of systems are not collections of individual data points; they are a sequence of observations of one concept over time. The fluctuations in the stock market, global temperatures, oil sales, activity of solar flares, and so on are not baskets of different observations, but rather they are sequences of the same object through time. Thus, we would like to model how this object changes over time. When we consider time and past observations of a variable to model how it behaves and make forecasts, this is called a time series analysis.

As the name implies, time series models are used to predict what will happen in the future. The goal is to learn patterns from the past and use those for forecasting. These models are used in many applications such as predicting the weather and predicting economic activity for investors. There are many kinds of time series analysis, some more complicated than others, but today we will examine a relatively straightforward yet powerful one: ARIMA. ARIMA, or Autoregressive Integrated Moving Average, is built out of several components. Let’s tackle the Autoregressive portion first.

Autoregression

The concept of autoregression is simple: What happens today is strongly correlated with what happened yesterday, mildly correlated with what happened last week, weakly correlated with what happened last month, and so on. Essentially, we would like to use past values of our variable to predict future ones. That is, at a given time, t, our value at that time is modeled as a constant, c, plus a sum of weighted past values going back p timestamps, plus some Gaussian noise (error).

This equation may look familiar. It is just a linear regression using past values of our variable as the input variables. This is where the “auto” part of autoregression comes from. We are regressing a variable on itself. And we can easily fit this model using Ordinary Least Squares; we just need to select the parameter p, which is how far back we want to look.

Moving Averages

Autoregression is fairly intuitive. Moving averages, however, are a bit trickier to understand. In a moving average model, we don’t use the actual values of our past observations, but rather the errors they accrued. More formally, a moving average is another type of linear regression but instead of using past observations of data, we use their errors.

Moving averages are conceptually simple, but tricky because they can be hard to create helpful analogies. A good example is to consider a human trying to throw a stone as far as possible. They first throw the stone and have some deviation or “noise” or “error” from their typical throw. Then they throw again. This second throw may have a deviation from the typical throw related to the deviation in their previous throw. Maybe a good first throw gives them confidence and they are more likely to continue that success or maybe a poor throw dejects them and makes them more likely to have another poor throw. With moving averages, there is momentum in these deviations.

Integrated

We can combine the two models described above to form a model called an ARMA (Autoregressive Moving Average) or even an ARIMA. Unfortunately, the “integrated” portion of ARIMA is a bit too technical in detail to discuss here in depth. Basically, it refers to the fact that an ARIMA system is something that is an ARMA system when we take the difference between successive times. Or in the case of a continuous time series, we take the derivative. It basically means that we can consider if we should be modeling the data directly or if we should model the rate of change of the data and then accumulate that change.

Seasonality

ARIMA models are good but they often fall short when data exhibits seasonal behavior. For example, consider temperatures over a period of years.

There is a strong cyclical pattern as summer is hotter than winter. Thus, the data may be strongly correlated with the direct past, but it is also strongly correlated with some fixed periodic interval. The temperatures today depend on what they were last week, very little on what they were six months ago, and moderately on what they were a year ago. We can include autoregressive and moving average models that, instead of looking back several timestamps, they look back several periods of the seasonal structure. ARIMA combined with this seasonal model is called Seasonal ARIMA or SARIMA.

Using ARIMA Models

Training an ARIMA model is fast and easy. The problem is choosing the hyperparameters of how far you wish to look back and if there is seasonality. However, like traditional linear regression, there are several statistical tools to help make that decision. We can use autocorrelation and partial autocorrelation correlograms to help visualize when to make these cutoffs. When fitting a model, we can view heuristics like the distribution of our errors and measures like AIC and BIC to compare performance. Once we have selected a model, we can use it to make forecasts and either through some statistical calculations or simulation, we can then create a band of likelihood. Our predictions, like any regression, are almost impossible to be literally true. It is unlikely that we get exactly the right price of a stock or the exact temperature. However, we can create confidence intervals, which are regions where we are 95% (or some other value) sure the future will move in. All of this was achieved simply by looking at the past values of our desired variable. Plus, time series can be made more powerful by using independent input variables in the regression as well.

ARIMA models are another example of an excellent application of statistics in that they are:

Formalized
Easy to understand
Predictable

Interpretable
Produce successful results

When conducting any sort of temporal analysis, ARIMA is likely the first thing to try. Request a free trial of PolyAnalyst to try it out yourself.

The post What is Time Series Analysis? appeared first on Megaputer Intelligence.

5 Things to Know about Linear Regression

Chris Farris — Wed, 29 Apr 2020 18:32:43 +0000

Before we get into talking about linear regression, you may recall that we recently have discussed advanced machine learning techniques such as neural networks and support vector machines, but these are not always the most appropriate tool for modeling data. Machine Learning models are big, complicated, and almost impossible to interpret. While they have great capacity and are sometimes the only solution to difficult problems, their downsides can be substantial depending on what our goals are. Additionally, it is often the case that we want to understand our models or have some measures of how valid they are.

For example, the goal of a physicist may be to understand their model of a particle interaction. If they create a neural network to do this, we may have low error in the system but our human understanding of the laws of nature is no better. Likewise, an economist may be more concerned with understanding the general relationship between political uncertainty in elections, uncertainty in public policy, and uncertainty in financial markets, than creating a complex interactive system that prevents generalizations that could be applied to governance, electioneering, or market trading. So, for the time being, let us return from the vast jungles of machine learning to the tamed and comfortable community of traditional statistical models, the first of which will be linear regression.

What is Linear Regression?

Linear regression is one of the simplest statistical models, but don’t let that fool you. It is a powerful tool because of that simplicity. Understanding linear regression starts with the name. A regression analysis is simply a method of estimating the relationship between a dependent variable and a set of independent variables. For example, given a set of data about childhood measurements like height, weight, gender, and so on, can we estimate the relationship between those variables and the dependent variable, which is the child’s height as an adult?

Why is it called “regression?” Mostly from tradition. The term originated many centuries ago and has stuck. You may have heard of the concept of “regression to the mean.” The concept is that data may divert from an expected “average” value, but over multiple observations, it will revert or “regress” to this expected value. For example, a tall person is likely to have children that are also tall, but probably less so. They will likely “regress” towards the average height rather than being taller than their parent.

Note that regression is modeling a numeric relationship. The output of the regression is a number. This is different from something like categorization, which outputs a class or label for the input data. The second part of the name is linear, referring to the fact that in linear Regression, we only care about linear relationships; our model is just going to be the sum of things.

Let’s consider a simple example. Using the traditional Iris set of data, examine the relationship between the length of a flower’s sepal to the length of its petal for the Virginica variety. A Linear Regression would attempt to model the interaction between petal length and sepal length as a line of best fit.

Virginica Iris

Wildflowerfarm.com

From the chart below, we can see there is a general linear trend in the data. As sepal length increases, petal length increases in a similar manner. Linear regression produces a linear equation which is the model of the system.

We can see the result of that regression plotted along with the data. The line is our model’s prediction of petal length for each sepal length. Of course, there will be some error because there is some variability in the data, but the fit generally describes the relationship well. The equation for this relationship is:

This is a simple, interpretable form that is very useful for analysis and understanding. From the equation, we can say that for each additional centimeter of sepal length, we generally observe around 0.75 centimeters of additional petal length in a Virginica Iris.

How is a Linear Model Calculated?

By now you probably have guessed the general form of a linear regression. Our linear model is:

Our estimate for the target variable is expressed as a bias term or intercept plus a weighted sum of our input variables. This a nice linear form. But given some data, how can we create a such a model it? There are an infinite number of lines to choose from, so which is the best? We need a way to measure how good our model is. This is where the concept of error or “residuals” comes into play. For each data point, we want to measure how far from our prediction the actual data was. There are several methods for this, the most common of which is called Residual Sum of Squares (RSS). RSS is the sum of the squared residuals. A residual is the difference between the real value for our target variable and the predicted value. It is often asked why we don’t call these terms “errors” instead of “residuals”, particularly since the term “residual” seems more confusing and an excuse to use a fancy math term to sound important. Unfortunately, a full answer and explanation is beyond the scope of this article, but in short, the reason is that this difference is not actually an “error.” When creating a model, we are fully aware that there is some variance in the distribution of the target variable. The full linear model equation includes a noise term, which is a Gaussian of mean 0, with some standard deviation that is estimated by the model. Thus, “residuals” are measuring this noise, which is what remains after removing the deterministic aspect of our model. So, our RSS is:

Now that we have a measure of how our model performs, we can compare lines to choose the best one. RSS is a measure of how much variability there is in distance from our line. The best line would have the least RSS measure. Thus, to choose the line of best fit we must find this minimizing line. This sounds like a difficult task. How would we be able to find that line? We can’t test each line one by one because there are infinite lines to consider! The answer is we use a method called Ordinary Least Squares. Now at this point, we will simply wave our hands and assure you that calculus and some linear algebra will solve this problem for us algorithmically and we don’t need to worry about it.

How to Assess a Linear Model

We’ve seen how we can calculate the terms of a linear model using minimization of a residual equation, but this is not the end of the model building process. Unlike machine learning models, which have millions of parameters obscured in a network, each parameter of a linear regression model can be interrogated and analyzed. The regression model produces several metrics for use. First, there is the R² measure or the Coefficient of Determination. R² measures how much of the variance in the data is explained by the model. A value of 1 means that the model perfectly describes the data while a value of .5 suggests that only 50% of the variance across the data is explained in the model. The rest is unaccounted for. Obviously, the closer to 1 the R² value, the better.

Additionally, for each coefficient term in our model, we can calculate a p-value to determine if that estimation is statistically significant. If we encounter large p-values, we may question whether the inclusion of that variable into the model is a good idea. In those cases, the variable may not be useful and we can retry the model after discarding it. One technique is to gradually build a model by adding variables one at a time and testing to see if there is a statistically significant increase in model performance by the addition of the next variable. This technique is called AnoVa. Another method for testing our model is by analyzing the distribution of the residuals. Ideally, they should have a normal distribution and be homoscedastic; that is, they maintain roughly the same variance across the fitted values of the model and across the inputs to the model. We want the residuals to behave the same across our data because that indicates regular performance. We don’t want the model to suddenly be worse for some areas or be unpredictable.

While all this analysis is again beyond the scope presented here, it is important to understand that because of the simplicity of the linear regression model, these types of robust, interpretable, and well understood statistical analyses are possible. We can be intimately familiar with our linear models in ways that are impossible for more complex structures.

What Flexibility Exists in Linear Models?

The biggest weakness of linear regression is that it is only possible to model direct linear relationships. But there are many more types of relationships in the world that we would like to model. Luckily, there are very easy ways to inject non-linearity into linear regression. The first of which is transformations on our data. Instead of modeling our variables directly, it is very common to transform them first such as taking a logarithm of the values or some other Power transform. There may not be a linear relationship between our values directly, but there could be one between their logarithms. A model of this kind can be interpreted as saying an increase in the dependent variable by 1% implies an increase in the target variable by X%.

Another technique is to include squared terms of our variables into the model as well. We simply square our data and treat that square as another separate variable. Yet another technique is to use what are called interactions. These are treated as combinations of existing variables. For example, if we include Sex and Height into our model, we can include a term, Sex:Height, which is just the combination of the two. Essentially, this creates two new variables called Male:Height, which is the same as Height for males and 0 otherwise, and likewise for Female:Height. This interaction term can model the separate effects of Height for Males and Height for Females. Perhaps the Height term has more dramatic effects for Females than Males?

The Power of Linear Regression Models

Linear regression models are extremely popular across domains because of their robustness and simplicity. Because of transformation and other techniques, linear regression models can model a wide range of dependencies in our data. Because their form is well defined, unlike Neural Networks, they have statistical properties that we can analyze to compare models, make interpretations, and derive important information. Linear regression is not only useful for predictions, it can do what most machine learning models cannot – describe the system. If you have a numerical value you want to model, a relatively short list of independent variables to use, and a desire to understand the model you create, linear regression should usually be your first choice. Luckily, if you use PolyAnalyst, you have all of those options available to test and find out what works best for your data.

The post 5 Things to Know about Linear Regression appeared first on Megaputer Intelligence.

Wide, Bulky Data and Dimensionality Reduction

Chris Farris — Mon, 03 Feb 2020 17:46:31 +0000

As the saying goes, “mo’ data, mo’ problems.” Well, that’s not quite the saying and that’s not quite what we mean, but it certainly is true that large data comes with some problems that need to be solved. There are two ways that data can be “big.” Traditionally, Big Data is considered large literally by its physical size in terms of the memory space it consumes. In this article, we’re considering another type of “big” data, which we like to call Wide Data. Wide Data is unstacked tabular data that consists of large amounts of columns or variables, usually in the hundreds and above. Wide Data and Big Data are often overlapping but they are not necessarily the same: for example, Wide Data can be shallow, with relatively small number of rows – but large number of columns. Wide Data is common and comes with its own set of problems that need to be addressed in any analytical setting. Next, let’s discuss those problems.

The Problems of Wide Data

One issue about wide Data is that it is bulky, and this presents challenges that aren’t present in more compact datasets. For example, a person can quickly scan a table of a few columns, maybe even a dozen or more, to summarize what it contains. That quick scan will tell a person what sort of information is contained there, determine if the data is sparse, and provide a rough estimate of the range and variance of the data. This sort of quick summarization is impossible with Wide Data. Even scrolling through column names (if they are informative at all) takes some time to do, and in the end, it is usually impossible to retain all that information.

Not only is Wide Data confusing for us humans, but it can also be confusing for machines as well if we have an especially large number of columns. If we want to create a machine learning model with Wide Data, our model might suffer from a complexity problem. A large number of inputs to some machine learning model can make the training of this model either unstable or time-consuming.

So, what to do? Large numbers of columns aren’t ideal for either humans or machines, but we don’t want to just throw out columns randomly to reduce the size of the data, as we could risk losing important data. The solution is dimensionality reduction, a common technique in data science. Dimensionality reduction will reduce the number of columns that we have in our data while minimizing any loss of information as a result. These techniques are extremely useful in cutting out the fat from data while retaining the value.

Dimensionality Reduction in PolyAnalyst™

PolyAnalyst™ contains several built-in methods for performing dimensionality reduction. We’ll highlight a couple of these below.

The first method is Data Simplification. This process scans data to identity columns that are nearly uniform. A column with almost no variation is very uninformative and could be dropped entirely. Additionally, Data Simplification compares the columns with one another to see if there are any that are nearly identical. These columns can be triaged so that only one remains in the dataset to avoid unnecessary duplication. In addition to Data Simplification, Correlation Analysis can be performed to discover correlated variables. While not identical, strongly correlated variables serve a common purpose and we could consider cutting down to one in this case as well.

The second method is Factor Analysis, which includes techniques such as principal component analysis. The goal of this method is to find the vectors along which most of the variance occurs in the data, reorient the data along those vectors, and discard the vectors in which little variance occurs. This is best understood with an animation.

In this example, we have measured and plotted only two variables—height and weight. We can see the direction along which most of the variance occurs. This is the principal component. Now we rotate our data such that this direction is our horizontal axis and the orthogonal direction is the vertical axis. This destroys the semantic meaning of these axes of course, but this not a big deal. Where they were once semantically width and height, they now hold no such meaning. We will be able to regain the meaning after performing the analysis in the smaller number of dimensions and rotating the results back to the original axes. Now, if we would like to shrink the size of our data, we can throw out the vertical axis, which exhibits little variance. Thus, we have eliminated an entire variable but preserved the majority of the useful information, which simplifies further analysis.

Stay tuned…

Wide, bulky data is fairly common in the world, and although it comes with some problems, these can be managed with the techniques described here, such as Factor Analysis and Data Simplification. Once our data is manageable, PolyAnalyst has a wide range of modeling solutions. Stay tuned for more discussions about data analysis!

The post Wide, Bulky Data and Dimensionality Reduction appeared first on Megaputer Intelligence.

How to Audit and Explore Datasets in PolyAnalyst

Kathryn Verhoeven — Wed, 22 Jan 2020 14:00:43 +0000

Suppose you just obtained a fresh new set of data. You might know what it’s generally about, but the only way you can know what hypotheses to test (and what analysis challenges you’re up against) is to explore the data and audit its overall quality. Unless you can see the basic overview of what you’re dealing with, you won’t be able to determine the most appropriate models to build or develop a strategy for handling challenges such as outliers and missing values.

For these reasons, exploratory data analysis is regarded as an important step in data mining methodologies.

What is exploratory data analysis?

The overarching concept of exploratory data analysis (EDA) was first advocated by John Tukey. His foundational work on EDA, published in the late 1970s, championed the idea of assessing the data before rushing into testing hypotheses and building models. Since that time, EDA has become an integral part of data mining methodologies, including CRISP-DM, KDD, and SEMMA. The goal of EDA is to get an overall feel for the data you’re about to spend time analyzing so that you can better address the key analysis questions and objectives.

Specifically, some of the key purposes of exploratory analysis include:

Obtaining an overview of the data
- Discover patterns
- Frame hypotheses
- Check assumptions
Gathering general statistical information about the data
- Distribution
- Key statistics (mean, median, range, standard deviation, etc.)
Detecting data anomalies
- Outliers
- Missing Data

Graphical visualization methods comprise the majority of EDA techniques, but non-graphical quantitative measures are just as important to incorporate. More information on specific EDA techniques is available from sources such as this handbook from the NIST Information Technology Laboratory.

Exploratory analysis using PolyAnalyst

Users of PolyAnalyst know that the system comes with a wide variety of analysis features for handling structured data as well as text. And for basic EDA, the system’s data loading nodes automatically provide a convenient view of the overall data patterns from a univariate perspective.

For example, if we load an example dataset containing car data, the analyst can obtain a basic overview of the dataset variables and the data types by viewing the data in tabular form.

In the Statistics tab of a dataset node, the various statistics for each variable are presented for quick assessment of data anomalies, such as outliers and missing data. For example, in the car data, we can see the mean, median, range, and standard deviation of the Power variable, and we observe that there are 6 missing values.

Finally, the Distinct tab lists all the unique values a selected variable has within the dataset along with the relative percentage distribution among those values. In the case of the example car data, we can view all the Year values for the cars in the dataset as well as their relative distribution.

PolyAnalyst’s built-in EDA tools

From the basic overview provided in any PolyAnalyst data node, we can begin to formulate a plan for how to proceed with data cleansing and analysis. And to compare sets of variables against each other for a bivariate (or multivariate) analysis, the various chart nodes (such as bar charts and scatter plots) are very quick to set up and view relationships among variables. But there are a few additional built-in tools in PolyAnalyst for EDA that are worth highlighting, such as the Data Audit node and the Distribution Analysis node.

Let’s now go over these tools briefly and see how they work.

Data Audit Node

In PolyAnalyst, the Data Audit node is a built-in tool for exploring your data. It is useful when first examining a new dataset and provides lots of information you can quickly scan to gain preliminary insights into your dataset. This node serves as a summarization node, so it does not connect up to additional downstream analysis nodes in the project script. Its main purpose is to guide your decision making on how to cleanse, prepare, and analyze your data.

The node settings allow you to specify which columns (i.e., variables) to analyze and whether or not to perform both summary statistics and anomaly detection. The resulting report shows a synopsis of the statistics and anomaly metrics, which are useful to consider prior to applying a predictive model or classification model.
Here’s an example of the node output when analyzing a crime data example dataset, which lists the date, district, category, and description of various crime reports:

The data audit output report consists of multiple tabs that assess the data for anomalies, arranged according to the data type (i.e., categorical, numerical, text). Summary statistics for each variable are also listed in the report under different tabs. In the example above, the system has identified and presented in the lower pane three potentially anomalous records with dates outside of the expected range covering the vast majority of records (years 1997-2005).

Distribution Analysis Node

The Distribution Analysis node in PolyAnalyst is a helpful tool for analyzing trends and distributions of numerical data values in a dataset. Along with calculating statistical characteristics, the node recognizes distributions (normal, exponential, double exponential, log-normal, and uniform), performs general hypothesis testing,

and groups variables based on those tests. When setting up this node, the user can specify which variables to analyze, as well as the significance levels, tail properties, and other parameters of the statistical tests to be performed.

The resulting report contains a list of the statistical tests that have been conducted as well as their results. The output report will also show information like the fitted distributions and labeled tails, which can later be appended as a separate column to the original dataset. The Distribution Analysis node also conveniently connects to other analysis nodes to pass its results to predictive modeling and machine learning.

Let’s see how this works using the car data from our earlier example:

The distribution analysis performed here was set up to analyze the patterns for the MPG variable in data records representing individual cars, grouped by a car’s Origin (i.e., Europe, Japan, USA). The resulting report displays the MPG distribution type for each Origin group along with other statistics in the top table. By highlighting a specific Origin group (e.g., Europe), the user can view the test parameter information that resulted in assigning this sample a log-normal distribution.

On the summary tab, a fitted distribution chart is displayed for the selected subset. For example, if we select the Europe subset, we can see the log-normal curve fitting of the data.

Conveniently, the information obtained from the distribution analysis can be scored against the dataset to create an additional column. For example, in the image below we can see the scored values for MPG significance (blue box) and status (red box) from our distribution analysis appended to the car dataset, which can be useful for analyzing anomalies (marked by the system as belonging to either the upper or lower Tails of the identified distribution).

Keep in mind, however, that these PolyAnalyst’s built-in EDA tools are only a couple of techniques that can be employed when assessing data quality. There are, of course, more analysis features in PolyAnalyst that analysts can use to conduct EDA. You can check out this page for more information on the types of analysis nodes and capabilities in PolyAnalyst. Feel free to contact us for a live demo of these features if you’re interested in learning more.

The post How to Audit and Explore Datasets in PolyAnalyst appeared first on Megaputer Intelligence.

What’s the News with Neural Networks?

Chris Farris — Tue, 28 May 2019 18:10:03 +0000

There are many reasons for the explosion of machine learning advancements over the past decade. We now have vastly improved hardware for fast computation, and memory is cheaper than ever. Data is now “Big Data,” and it is both jealously hoarded and publicly available in repositories such as ImageNet. Individually, these advancements are already a blessing for the technology-space. But for artificial intelligence (AI), they have opened the gates for something truly powerful—Neural Networks.

Neural Whatnow?

Neural Networks. You’ve probably heard of them. They are at the forefront of the machine learning craze and are the driver of many of the most impressive advancements. The technology that led machines to the best humans in Go and the popular video game Starcraft? Neural Networks. The backbone of algorithms that can recognize images and faces, which are igniting a surveillance and privacy panic? Neural Networks.

And yet, Neural Networks aren’t some new idea spawned from the incubator of a giant tech company. They aren’t some stroke of genius from a college student turned dropout who went on to found a revolutionary tech firm. Neural Networks are in fact… old hat. Or they were.

The concept of a Neural Network (something we will get to later) has been around for years. They date back to the 1970s, and simpler versions of them existed even in the 1940s! So, if they have existed for decades, why are they only popular now?

The answer is related to the hardware and data advancements mentioned earlier. Neural Networks crunch a lot of numbers. They also need a lot of data to help them learn. Up until the past decade, this made training anything but the simplest networks highly time-consuming and expensive.

With all the great improvements to hardware over the years, the possibility of using more advanced Neural Networks became possible. Aided by hardware demands from the entertainment industry and now cryptominers, GPUs (graphical processing units) have been developed which can calculate specific mathematical operations at lightning speed. Luckily, these same kinds of operations occur in training neural networks. Using the technology originally developed for beautiful visuals in film and video games helps us train networks at a fraction of the time it takes a traditional CPU.

What’s in a Name?

The power of Neural Networks may be evident, but at this point, one may also be wondering what they are in the first place. The name gives some clues. A Neural Network isn’t a singular object that makes decisions by itself. It is, as implied, a network of smaller objects all connected. A network of what, then? We call them Neurons. Neurons as in the cells inside our brains? Yes! Well, no. But sort of!

Neurons in Neural Networks can be thought of as being like the neurons in brains. They are tiny, individual units that are connected to other neurons in a large, structured network. These connections allow tiny pieces of data to flow between them. In our brains, these are electrical pulses. In the Neural Network, we send numbers between Neurons. The Neurons then take all the numbers fed to them by the Neurons they are connected to and process them. The process isn’t complicated—in fact, it is painfully trivial. After all, it is just a tiny unit—a single cell in our brain. But it then sends the processed information out to other Neurons it is connected to. Another number. Another electrical pulse. And so on. Tiny Neurons are fed tiny pieces of information, perform tiny pieces of computation on that snippet, and feed it forward to other Neurons, which do the same over and over until we eventually reach a final set of Neurons—the output of which is our final result.

From the collective effort of many small individual units networked together in a perfectly calibrated balance, we can achieve enormous computational power. The Whole is greater than the Sum of its parts.

Balancing Act

If that all sounded magical and farfetched, then don’t worry—it is. How exactly are we supposed to arrange these Neurons in such a perfect balance that their combined minuscule computations lead to a machine recognizing human faces? We can’t. So how do we get this to work? Well, we are talking about Machine Learning after all. And Machine Learning is how we are going to solve this. We aren’t going to calibrate the Neurons to be in balance. They are going to calibrate themselves.

We aren’t going to calibrate the Neurons to be in balance. They are going to calibrate themselves.

In order to achieve this, we need training data. We need data that is labeled with the desired output we want from the machine. We can provide this data to the unconfigured Neural Network. It will process the data and likely output something nonsensical and useless. But this is fine. We can use a mathematical function called Error. This Error is just a measurement of how different our output was from our desired targets. Then, using Calculus,we can discover how much of that Error is caused by the calibration of each one of our Neurons! Using this information, we can then tune the Neurons slightly and repeat the process—feed data into the Network, observe the output, calculate the Error, and use Calculus to know how to tune the Neurons to make them more accurate. This process continues over time until we have converged to a calibrated Network. This process of using Error functions, Calculus, and tuning is what Machine Learning is.

Power at a Price

Although it is conceptually simple (at least in this explanation that ignores the details), it is very resource intensive. At the moment I am fine tuning a Neural Network on my own desktop to recognize Western Art styles. Although the dataset is only a few GB, it takes almost an hour to run a single iteration of the training. It will take nearly two days to complete the 50-iteration learning schedule I planned. And even then, I may need to schedule another one if the Network still needs to learn more! And I’m not running this on some dusty machine I dragged out of the aughts. This is an almost brand-new desktop powered with a 3.7 GHz AMD Rhyzen 8-core processor with 32GB of RAM available and virtually no other load on the machine.

Neural Networks are expensive to train. If you want to increase performance, you could pay for an expensive GPU, but that might set you back nearly a thousand dollars. Companies and research institutions may have the funds to throw at this problem, but individuals, small companies, and small research groups may not. Luckily, they don’t need to anymore.

Cloud services have opened access to remote processing to provide other computing options to those desiring to train a Neural Network. Don’t have an expensive rig? No worries, just rent one remotely from Google at a fraction of the price.

Neurally Networked World

Neural Networks are here and they aren’t leaving. New advancements and computing architectures are constantly being published. Convolutional Networks are good at processing images and Recurrent Networks can handle variable sized data, streaming data, or sequential data. Neural Networks are powerful, indeed—far more so than other solutions we have. But they don’t mimic what is really occurring in our brains. There are some deep flaws even in our most advanced networks. For instance, it is extremely easy to confuse a Neural Network. A Network designed to recognize stop signs can be fooled by a few well-placed stickers on a sign. There are deep safety concerns if these are going to be used in self-driving cars for example.

Neural Networks are deeply dependent on the data used to train them (just as we discussed in a previous article). They are also dependent on how we configure their output. Most networks are designed to give a decision. For object detection, the network must return what it thinks the input image is. But what happens when we feed the network “nothing,” such as a completely blank image or fuzzy static noise? The network is forced to return something so it “sees,” say, a dog in the empty space. This is nonsense. Why would it choose one object over another in these cases where a human would just refuse to rigidly define nonsense? As another example, if we slightly alter an image by inserting some imperceptibly small random noise, we can completely trick a Neural Network. Where it before correctly thought that the image was a butterfly, it now thinks the image is a truck, while a human sees no difference in the images. This is bad, and it reflects deep issues with Neural Networks as we construct them today.

Neural Networks are occupying a liminal space. They are simultaneously scarily powerful and laughably simple and ignorant. They can best the human masters and yet be duped by the smallest of changes. We won’t be seeing Neural Networks achieve human-like sentience any time soon, and they aren’t ready for deployment in many other types of systems. But they are already being used in ways that should cause alarm.

China is already using facial recognition technology to tag members of the Uighur ethnic minority group. Accurate voice recognition ensures individuals could potentially be tracked even if they are not near a camera by turning the phones in our pockets into monitoring devices. “Deep Fakes” are a growing type of video that can modify existing videos to map one person’s face and voice over another’s to create a fake video that could be used for blackmail or disinformation. While Terminator remains science-fantasy, armed military drones using neural networks for stabilization, navigation, targeting, and tactics would revolutionize armed conflicts in ways impossible to predict.

Though neural networks are being used to oppress in some places they are also being used to save in others. Medical facilities increasingly deploy network systems to detect ailments such as cancer or infectious diseases. Laboratories can use similar networks for modeling complex biomolecules and developing treatments. Neural networks are even being used in traffic light control systems to increase vehicle flow and reduce accidents.

Where will Neural Networks take us next? It’s hard to say. But it seems evident that the world is caught and will remain in a web of Neurons.

The post What’s the News with Neural Networks? appeared first on Megaputer Intelligence.

An introduction to machine learning

Chris Farris — Fri, 15 Feb 2019 18:58:52 +0000

Machine Learning is hot right now. Really hot. And it’s come a long way from where it was over two decades ago. In 1996, IBM’s Deep Blue defeated world chess champion Garry Kasparov. This was a great achievement to mark the progress of the field, but chess is a relatively simple task and computers still struggled to be able to master more difficult tasks for another decade. Then, in the late aughts, machine learning started to boom like never before. In 2011, IMB’s Watson utilized live simultaneous natural language processing with information retrieval to defeat two Jeopardy! champions. In 2016, Google’s AlphaGo defeated Lee Sedol, who is claimed by many to be the strongest Go player in history. Despite the simplicity of Go’s rules, the game is extraordinarily complex. For perspective, there are more possible configurations of a Go board than there are atoms in the universe—and by this measure, Go is a googol more complicated than chess. To really appreciate the magnitude of this, consider that a googol (10¹⁰⁰) is written as one followed by one hundred zeros, and the ratio of an electron (a sub-atomic particle) to the entire known universe is only 0.00000001% of a googol. It is a mind-boggling large number beyond human comprehension. And in a handful of years, machine learning has advanced by that scale.

Recently, a program called AlphaStar learned virtually on its own how to play and master the wildly popular and challenging competitive real-time strategy video game StarCraft II. StarCraft II is so competitive that human players actually perform physical exercises with their fingers so that their reflexes are honed in order to strike the right keys as fast as possible. While Go merely involves placing pebbles on a grid, StarCraft II involves economic resource management, strategic combined arms combat, exploration, extensive future planning, processing streams of real-time information, and making rapid decisions.

Applications of Machine Learning

Machine Learning often gets flashy coverage when it reaches milestones in games, but it has slowly crept into our daily lives in ways we may not think about. When we post pictures on social media platforms like Facebook, our faces are instantly recognized and names are suggested of who to tag. When we watch movies or shows on platforms like Netflix, we get an endless stream of recommendations based on our past behavior. This same structure exists on the online shopping market such as Amazon. And it’s common for Apple’s Siri, Amazon’s Alexa, and Google’s Assistant to live in our homes and understand our requests.

Machine Learning is here to stay. And with Machine Learning technology sprinting into the future, everyone wants in. Companies are rapidly creating or bolstering existing analytics departments to utilize the craze. Unfortunately, like many hyped technologies, “Machine Learning” as a phrase can devolve into jargon, buzz words, oversimplifications, misnomers, and a lot of confusion. To many, Machine Learning is a mystery that seems indifferentiable from sorcery.

What is Machine Learning?

Machine Learning, like Artificial Intelligence, suffers from a name that tends to get bogged down in philosophy and pedantry. What does it mean to “learn?” What does it mean to hold “intelligence?” There is a good deal of discussion amongst computer scientists and others about this, but for our purpose it is just a rabbit hole. In fact, it is best to throw out our human conceptions of what “learn” and “intelligence” mean because they only add a biased expectation.

So, then, what is Machine Learning? In it’s simplest form, Machine Learning is essentially making a computer learn a complicated task by having the computer teach itself. We merely provide the computer some examples of how to do something, and the computer learns from those examples to help us fill in the holes and solve complex problems.

Now let’s break this process down. First, it is useful to identify our goals. Almost always, we would like to classify (assign discrete labels), perform regression (predict numerical values) on, or cluster (group similar things together) our data. For example, we could classify faces by if they are smiling or not. We could perform regression on weather data to predict tomorrow’s temperature. We could cluster stars together by grouping them based on how hot and bright they are.

Now that we have a goal in mind, the question is by what method can we achieve this task? This is the “model.” Think of a model as a machine that you feed data into and which spits out classes, a regression, or clusters of that data. How do we build the model? Before Machine Learning, you would have to manually figure this out. This would involve having to draw specific blueprints for the machine, figure out the appropriate size of gears and their exact positions, assemble the machine, and test it.

In this analogy, Machine Learning is like building a special type of machine. This machine has gears that can change size and position. Also, if you provide the machine with the desired output, it can run data through itself and check the resulting output with this desired output. If they don’t match, the machine knows how to change the size and position of the gears to make it more likely to be correct in the future. And it can repeat this over and over again until it is has found the best arrangement of gears. This is essentially Machine Learning, and all it took was having a special device and labeled data it could look at. No human was needed to tell it where to place the gears.

This is just an analogy, but it works very similarly in our digital space. We construct a digital model that is parameterized by certain variables. We feed training data labeled by the desired output into the model and see the actual output. We then compare the actual output to the labels, which is our desired output. We note the error and use math to compute in what direction and by how much we should tweak the parameters of the model. And repeat. The machine doesn’t make decisions or figure out concepts on its own. It is programmed to use Calculus, Information Theory, Probability, and Statistics to calculate numbers to fudge the parameters by.

In the end, Machine Learning is not what most humans consider “learning.” It is driven by math, algorithms, and data. For those who aren’t fascinated by the math and theory of Machine Learning, this can destroy the mystique somewhat. Personally, although the Man Behind the Curtain is not what we might expect or desire, he is intriguing all the same. In fact, as computer scientists and cognitive scientists study Machine Learning more we start to suspect that there isn’t as much of a disconnect as we might suspect. We may think that Machine Learning is so unlike humanity, but it may be the case that humanity is more like Machine Learning than we suspect. Our own brains may be just like a cold math-algorithm-data amalgam, but on a massive and complex scales.

The Importance of Data

Data is a pillar of Machine Learning. Without data we cannot build anything. The math doesn’t change, and while we can build clever model structures to help with the learning process, without good data our models will fail. We luckily live in an age of data. Large amounts of data allow models to be extremely fine-tuned. It allows for advanced model structures like Deep Neural Networks. Publicly available data sets such as MNIST and ImageNet for image processing have allowed data scientists to easily compare models and share knowledge. Data is revolutionizing entire fields and Machine Learning is no exception.

But with great power comes great responsibility. If our data is not labeled well or if it is biased from how it is collected, that error or bias is directly injected into the model. After all, the model is simply trying to mimic the data used to create it. Bad data in, bad model out.

Biased data can lead to very poor model performance. For example, last year MIT Media Lab showed that facial recognition from Microsoft, IBM, and Face++ weren’t very good at identifying women or person’s with darker skin, most likely because they didn’t have enough examples of those types of faces in their training data.

So big data is great for our ability to model the world. But the purity and completeness of the data we use in Machine Learning should always be considered when building a model.

The Future

Well, Machine Learning is the future. Tasks that we do effortlessly as humans, such as processing sound and sight, are extraordinarily complicated. Without Machine Learning it would take monumental effort to create machines to do them effectively, and for the tasks that even we as humans find hard, machines would be hopeless. We live in the age of data and increasingly cheap computational power. Machine Learning is thriving and adapting. If you are interested in Machine Learning, check out this video for more insights.

And stay tuned for future blog posts on Machine Learning as we examine different applications and different models used.

The post An introduction to machine learning appeared first on Megaputer Intelligence.

Thanks for attending the 2018 Megaputer Analytics Conference

Brian Howard — Mon, 03 Dec 2018 22:32:38 +0000

We had a great experience at the Megaputer Analytics Conference this year. Of course, this event would have been nothing special if it were not for the attending professionals whose engagement and contributions enhanced the content and discussion. A special thanks to the guest speakers for sharing their challenges faced and current applications developed using data and text analysis.

We appreciate all the feedback about how everyone enjoyed the interactive workshops, the food, and the gathering together each night for additional networking and fun.

We look forward to seeing each of you again next year!

The post Thanks for attending the 2018 Megaputer Analytics Conference appeared first on Megaputer Intelligence.

Flagging documents with timestamps after an event

Rebecca Hale — Mon, 19 Nov 2018 19:06:54 +0000

We have electronic medical records containing free text descriptions as well as structured information like patient IDs, hospital admission IDs, and date/time. One patient may have multiple additions to his or her medical record over the course of one hospital stay, so there may be several records with the same patient ID and hospital admission ID, but different notes and timestamps.

The Problem

We want to use free text in medical records to predict when a patient will develop sepsis. We know which patients were eventually billed for a sepsis diagnosis, but we don’t want to include notes that were written after the patient developed sepsis, or we might discover useless predictors like “began course of antibiotics for treatment of sepsis”. Once sepsis is suspected, the data is no longer useful for early prediction. How can we remove the records that mention sepsis, as well as all the patient’s other records with later timestamps, even if the later notes do not directly mention sepsis?

The Plan

We are going to solve this problem using PolyAnalyst, Megaputer’s data processing software. The following is a screenshot of a flowchart built using the software. The icons represent individual processing steps. Each icon is described below.

Begin with a dataset of electronic medical records. I’ll use a small sample for testing my workflow first. Node: Sample.
Search for mentions of sepsis in the text, and get a subset containing just those records. These will serve as the cutoff times for the patients they represent. Remember, we want to keep only the notes that the patient received beforethis time. Node: Search Query and Search Query Subset. The latter can be created by right-clicking on the relevant query in the Search Query node and selecting “Create subset”.
Sort the records by time: earliest dates first. Node: Sort Rows.
Create a new column for timestamp information in each row, called “Earliest Sepsis Mention”. In our case, some timestamp entries were empty and needed to be filled with a more general date. Node: Derive.
For any patient ID / hospital admission ID pair, keep only the first record. You now have a list of patients, each with a value in a column called “Earliest Sepsis Mention,” containing the timestamp for the first record that mentions their sepsis. Node: Aggregate.
Join this dataset (right) with your original dataset (left). Use a left outer join so that for every record in your original dataset, it checks your new list to see if there is a matching patient and hospital admission ID. If there is, it will add in the new information: the timestamp of that patient’s Earliest Sepsis Mention. If that patient and hospital admission are not in the new list (they never had sepsis mentioned in their notes), the Earliest Sepsis Mention will remain blank. Node: Join.
Now for the easy part! Filter rows so that you only keep the records with an empty value in Earliest Sepsis Mention, or where the record’s timestamp is an earlier time than that patient’s Earliest Sepsis Mention. Node: Filter Rows.

The Result

We searched our data for mentions of sepsis, selected the earliest record for each patient in that list, then applied that answer to all patient records in our original dataset. From there, it was easy to compare the medical note timestamp with the timestamp of the first time sepsis was ever mentioned for that patient, meaning we have a new, smaller dataset where we have eliminated all of the records mentioning sepsis, as well as those patients’ later follow-up records.

In short, we flagged all patient records with timestamps after a particular event in their medical record, so they could be removed from our training data.

The post Flagging documents with timestamps after an event appeared first on Megaputer Intelligence.

What is self-documenting analysis?

Josh Froelich — Wed, 20 Jun 2018 17:00:10 +0000

A classical annoyance of data analysis is the extra work required to properly record what you have done. Despite benefits of enabling others to pick up where you left off, or to help you revisit your work and still understand it later, the tedium of proper documentation tends to result in little to no documentation whatsoever. Subsequently, failure to document forms technical debt. The value of your acquired insight is reduced when you have trouble explaining your results and have trouble recreating findings.

What is technical debt?

Technical debt is an obtuse term and thus has multiple definitions depending on the context in which it is used. Essentially, it refers to starting a project and not truly finishing it.

The debt that arises from lack of documentation represents a hidden cost to analyzing data without taking the time to explain the process. This debt only materializes later, once the analytical objectives shift or the analysis must be performed again. Encountering a “What was I thinking?” moment requires you to retrace your steps.

Separately, yet similarly ominous, is the widespread problem of not explaining the work. Perhaps it is an artifact of academia, or intellectual hubris. Our goal at Megaputer is to re-envision that cost-benefit analysis that an analyst undertakes when deciding whether to document, and make the costs smaller and the benefits larger.

Avoiding technical debt

One of the first steps is recognizing that data analysis is a process that not only must be created, but also maintained over time. We think the answer lies in making analysis self-documenting¹. An analysis becomes self-documenting when there is no additional effort required to generate the documentation of that analysis. Merely by performing the work, the documentation materializes. Personified, this is similar to living a life that is autobiographical. There is no need to hire a biographer towards the end, as the act of living itself memorializes life.

We designed PolyAnalyst with this principle in mind. Merely by using PolyAnalyst to generate analytical results, a journal of your work is recorded. Not simply a textual log, however. Analysis is done by performing individual analytical steps that you compose together to form an analytical step sequence. A step might be something as simple as moving data from one database to another, or changing around the columns and rows of a table, or generating a machine learning model. This overall sequence (or set of sequences) comprises your analysis. The act of adding a step to the sequence is concurrently the act of curating your record of the steps you have done. In naming the step appropriately, you describe what the step does. Moreover, this overhead of naming steps and sequencing steps is simple. There is very little if any extra burden imposed on you. Indirectly, you are making the decisions of how to explain your work to others upfront, immediately, when you decide to undertake a particular analytical step.

By baking the requirements of documentation into the approach, your work becomes naturally self-documenting. We think this is a vital feature in the design of an analytical product. Analytical results do not speak for themselves, they must be accompanied by explanation, and we have designed PolyAnalyst to simplify the task of generating such explanation. When the insights you have generated are easily explainable and readily modifiable, then you have designed with the future in mind, leaving you better equipped to tackle your future analytical tasks, or revisit and old one and pick up right where you left off.

The post What is self-documenting analysis? appeared first on Megaputer Intelligence.