(These are excerpts from my book "Intelligence is not Artificial")
Machine Learning before Artificial Intelligence
The mathematics of pattern recognition and classification existed before the invention of the digital computer but clearly the digital computer made it practical.
A system of pattern recognition operates on a dataset. If the dataset has been manually labeled by humans, the system's learning is called "supervised". If the dataset consists of unlabeled data, the system's learning is "unsupervised".
In supervised learning the system has to learn a model (basically, a generalization) so it will be able to correctly categorize future instances of each category. In this case "learning" means: to act based on patterns in the data. For example, recognize apples or forecast the effects of chess moves. In unsupervised learning the system has to discover patterns in the data, i.e. categories. For example, after watching millions of videos of cats and dogs, it may divide them into two groups (without being told what a cat and a dog are).
Supervised learning is about recognition, classification, prediction. Unsupervised learning is about clustering. What they have in common is that they are both methods of generalization.
An instance is described, mathematically speaking, by a vector of features. For example, the vector of an image contains features such as edges, shape, color, and texture.
The two fields that studied machine learning before it was called "machine learning" are statistics and optimization.
Statistical methods for pattern recognition are sometimes a century old. British statistician Karl Pearson invented "principal components analysis" in 1901 (unsupervised), popularized in the USA by Harold Hotelling ("Analysis of a Complex of Statistical Variables into Principal Components", 1933), and then "linear regression" in 1903 (supervised). He was a fascinating character who at the same time predated Einstein's Relativity by more than a decade in "The Grammar of Science" (1892) and, alas, predated Hitler's ideas on the genetic inferiority of Jews. Contributions to pattern recognition came from different quarters: Ronald Fisher (one of the founders of population genetics) in Britain invented "linear discriminant analysis" in 1936; Joseph Berkson invented the most popular "logistic regression" method in 1944 working in a Minnesota clinic; the "knearestneighbors" (KNN) classifier (aka "minimum distance classifier", aka "proximity algorithm") was invented in 1951 by Evelyn Fix and Joseph Hodges at the US Air Force School of Aviation Medicine in Texas; the Bell Labs physicist Stuart Lloyd invented "kmeans clustering" in 1957 for signal processing; etc.
Linear classifiers were particularly popular, such as the "naive Bayes" algorithm, first employed in 1961 for text classification by Melvin Maron at the RAND Corporation and (the same year) by Marvin Minsky for computer vision (in "Steps Toward Artificial Intelligence"); and such as the Rocchio algorithm invented by Joseph Rocchio at Harvard University in 1965.
Naive Bayes uses an approximation of Bayes' theorem that may sound wildly arbitrary (that the effects of a cause do not influence each other, which is like saying that the history of the world is a simple hierarchy of causes and effects); but it turns out that Naive Bayes works incredibly well in most cases, even defying its own limitations, as proven by the Portugueseborn Pedro Domingos at UC Irvine ("Beyond Independence", 1996). Nearestneighbor methods are the simplest of these algorithms, and became very popular after a theorem by by Thomas Cover of Stanford University and his former student Peter Hart showed that they were also more reliable than it was apparent ("Nearest Neighbor Pattern Classification", 1967), although the world had to wait almost 40 years for a theorem by Kamalika Chaudhuri and Sanjoy Dasgupta of UC San Diego in order to fully grasp the mathematical properties of nearestneighbor methods that makes them so efficient ("Rates of Convergence for Nearest Neighbor Classification", 2014).
A linear classifier uses a linear function to map the possible features of the data to a set of possible labels. Frank Rosenblatt's Perceptron of 1957 was also a linear classifier, whereas multilayer neural networks are nonlinear classifiers. Another kind of nonlinear classifier is "decision trees analysis", notably Iterative Dichotomiser 3 (ID3) invented by Ross Quinlan at the University of Sydney in Australia ("Induction of Decision Trees", 1985) and in 1993 expanded to C4.5, which became the benchmark for supervised learning.
These statistical methods were widely employed in computer science. Popular textbooks such as "Introduction to Statistical Pattern Recognition" (1972) by Keinosuke Fukunaga of Purdue University and "Pattern Classification and Scene Analysis" (1973) by Richard Duda and Peter Hart of SRI International provided comprehensive summaries. Linear classifiers remained popular for text classification and, during the dot.com boom, became the method of choice for writing "recommender systems". Paul Resnick built an early one, called GroupLens, at MIT in 1994 based on a KNN algorithm (it was used to recommend articles on the Usenet).
None of this was marketed as Artificial Intelligence. Its roots were in statistics.
In the age of "big data" there is a tendency to focus on the data. However, it is important to remember that we frequently learn something from just one example. Show a banana to a child and the child will probably be able to recognize any future banana. Learning is not about "data". Learning is about the ability to form and use concepts. It just so happens that computers are good at processing data, so we invented a mathematics that allows them to overcome the limitations of not having concepts; i.e. the mathematics that allows them to do with information what we do with knowledge.
"Where is the Life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in the information?" ("The Rock", Thomas Stearns Eliot).
Back to the Table of Contents
Purchase "Intelligence is not Artificial")
