Thursday, May 5, 2016

A brief overview of machine learning

This article is a brief introduction to machine learning, for those wishing to have some idea of what the term entails.

Arthur Samuel, in his classic 1959 paper "Some studies in machine learning using the game of Checkers", stated the goals of machine learning in the following terms: is necessary to specify methods of problem solution in minute and exact detail, a time-consuming and costly procedure. Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.
Many decades later, two commonly employed types of machine learning algorithms are supervised and unsupervised learning algorithms:

  • Supervised learning
    • The algorithm is presented with a body of training input. 
      • Each input has a human-designated label. 
    • After the training process is complete, when presented with an unlabeled input the data structure produces a label. 
      • Ideally, that label will be what the human labeler "would have picked" for that input.
  • Unsupervised learning
    • The algorithm is presented with a body of training input.
      • The inputs do not have human-designated labels.
    • A data structure is built that associates inputs that "look similar".
      • This similarity is determined by a user-submitted distance metric.
    • Each group of "similar" inputs is referred to as a cluster.
      • Each cluster is defined in terms of a reference input. This reference input is typically calculated as a weighted average of the inputs belonging to the cluster.
    • After the training process is complete, when presented with a new input the data structure indicates the cluster to which it belongs.
      • This is typically calculated by finding the distance between the new input and the reference input for each cluster.
      • The reference input with the shortest distance to the new input determines the winning cluster.
What type you use depends on the problem to be solved and the data that is available. 

Revisiting our quotation from Arthur Samuel, the goal of machine learning is to "eventually eliminate the need" for "specify[ing] methods of problem solution in minute and exact detail, a time-consuming and costly procedure." 

Seen in this light, if labeled data is available, supervised learning is often preferable. Since the learned data structure yields a symbolic label for each input, the software in which the learner is embedded can be directly programmed to respond in specific ways that reflect the known semantics of each label.

For example, if one wishes to create a spam filter, supervised learning is a great approach. The labels are easily obtained, as the email user assigns labels to spam emails by putting them in a junk folder. In time, the system should improve its ability to predict whether a given email is spam or not.

Now, if the available data has no labels, unsupervised learning is the only choice. It often proves useful to examine the reference inputs for the clusters that are learned as a first step in getting a deeper understanding of the data. This is, in practice, probably the most common use case for unsupervised learning.

If, however, the unsupervised learner is embedded in a larger software system, some additional human input will be needed to assign workable semantics to the clusters that were learned. This concept intrigues me and has been a focus of my work in recent years. One example is a clustering system I created to facilitate building robot behaviors. I presented the work at MAICS-2014. The robot drives around and builds clusters based on the images it sees. The user then assigns a move to each reference input. By examining the reference inputs, the human is able to determine an appropriate semantics for that cluster.

Of course, it would be really cool to have the robot automatically assign the actions to the clusters based on the input from a human pilot. I have made some tentative steps in that direction, but it remains a work in progress.

Some commonly-used supervised learning algorithms include:
  • Decision trees
  • Multi-layer perceptrons 
    • Also called "feed-forward neural networks"
    • "Deep learning" systems fall into this category
  • Support vector machines
  • Naive Bayesian classifiers
  • k-nearest-neighbor
Well-known unsupervised learners include:
  • Kohonen Self-Organizing Map (SOM)
  • k-means clustering
  • Growing Neural Gas (GNG)


  1. So, here's a stupid question that's probably not answerable. When a student comes up to me and says "I want to study machine learning" (as an independent study or something, presumably), is this kind of high level approach what they're talking about?

    I think I've been scared to go into it because there's so much in the field that I don't know, but yeah, knowing the big ideas are pretty important..

    1. Yes, I think this would be an excellent high-level structure for an independent study or the like. The sheer diversity of algorithms and approaches is intimidating, but I think a good student experience could be devised by having the student explore several examples of machine learning algorithms. That exploration could involve either outright implementation or perhaps experimentation with an already-implemented system. Aside from Support Vector Machines and "deep" neural nets, everything I listed in this post could easily be implemented from scratch by a capable student.