Machine Learning

December 16, 2016 | Author: Anuj Matthew | Category: N/A
Share Embed Donate


Short Description

About Machine learning in artifical intelligence...

Description

Machine Learning Abstract-- The purpose of this chapter is to provide the reader with an overview over the vast range of applications which have at their heart a machine learning problem and to bring some degree of order to the zoo of problems.

I. INTRODUCTION Over the past two decades Machine Learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life. With the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress. Machine learning, a branch of Artificial Intelligence, is about the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders. The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances; the conditions under which this can be guaranteed are a key object of study in the subfield of computational learming theory. II. A TASTE OF MACHINE LEARNING Machine learning can appear in many guises. We now discuss a number of applications, the types of data they deal with, and _nally, we formalize the problems in a somewhat more stylized fashion. The latter is key if we want to avoid reinventing the wheel for every new application. Instead, much of the art of machine learning is to reduce a range of fairly disparate problems to a set of fairly narrow prototypes. Much of the science of machine learning is then to solve those problems and provide good guarantees for the solutions. III. THE SCOPE OF MACHINE LEARNING Can Computers Learn? According to a bonmot of Dijkstra, this question is like the question whether submarines can swim. In Machine Learning we do not attempt to mimic human learning. Rather, we say that a computer program “learns” if it automatically improves its performance while doing a task. Many well-defined

mathematical problems can be solved by algorithms that explicitly say how to solve the problems. But other problems are far too complicated for such explicit rules, or not even completely defined formally. In Machine Learning we apply algorithms as well, but they are not instructions that solve the problems directly. Instead, they only consist of meta-rules that specify how they shall derive rules from data and (hopefully) draw correct conclusions by themselves. The course aims at laying general foundations for various Machine Learning algorithms. The main goal is to get an understanding how and why such algorithms work, and to critically assess them. People using Machine Learning methods in applications should not just press buttons, without having an idea what is behind these magic programs, as this can lead to serious misinterpretations. (“A fool with a tool is still a fool.”) Some Representative Problem Examples Relevant or Influencial Attributes Various medical parameters are known from a patient, and the doctor wants to predict whether the patient will develop a certain disease, or at least estimate the risk. Can a prediction “formula” be learned based on the data from many patients? Some parameters may be more influential than others, and some parameters totally irrelevant. A prediction algorithm would have to identify, in particular, these risk factors and their relationships. Similarly, if a bank customer asks for a loan, the bank has to predict whether the customer will be able to pay back, based on the knowledge about the customers’ situation, and previous experience. After these serious example we introduce some fun examples of this type of learning problems. One is: Weather conditions may be roughly described by the following vector of attributes: (sky, air, humidity, wind, water, forecast). For simplicity, let each attribute be binary, i.e., capable of only two values: sky can be sunny or rainy, air (temperature) can be warm or cold, humidity can be normal or high, wind can be weak or strong, water (temperature) can be warm or cool, forecast can be “same” or “change” (compared to today’s weather). Suppose we want to learn when a certain person enjoys outdoor sport. On several days we observe the following behaviour of the person: (sunny, warm, normal, strong, warm, same): yes (sunny, warm, high, strong, warm, same): yes (rainy, cold, high, strong, warm, change): no (sunny, warm, high, strong, cool, change): yes You may already try and guess, without any algorithm, the pattern behind these yes-no decisions. Give a general explanation of

how the sportsman behaves. And then reflect what your explanation is based on. Some more examples below have been contributed by students. A young child learns: I burn my fingers whenever I touch a hotplate that is on, no matter in what mood I have been or mum has been, which color my shirt has, etc. All these attributes are irrelevant. (Of course, the child would not express it in this way.) Continuing with family matters, consider a married person’s task of learning what makes his/her wife/husband angry. A list of possible offences includes: you did not wash the dishes, you broke the expensive vase, you forgot the kids at day care, you forgot the partner’s birthday. Obviously some of these offences are more serious while others are just venial sins. Can we somehow learn which combinations of mistakes are still acceptable, without really trying all combinations (which is both slow and painful)? Other, more serious learning problems include (hints to potentially relevant attributes are given in parantheses):- Will this person, or this group of persons, like this film? (genre, country, year, length, etc.) - In a game of chess, is this situation good for the white player? (positions of all pieces) Catagorizing Text Is a given email spam or not? Is a given text document potentially interesting for us or not? We would like to decide this automatically by an algorithm. Can we train a computer algorithm to tell apart different categories of text with high accuracy? What could such a classification be based on? One should also notice that the criteria may change over time and need to be adjusted. Here, an instance of the learning problem is a text, e.g. and the desired result is a text category, often just a yes/no answer. Shapes in Images. Here the challenge is to decide whether a given digital image shows a certain shape or object. Important applications include handwritten character recognition and understanding of complex images. It would be rather hopeless to give explicit classification rules “manually”. Note that the same object may appear in different sizes, positions, and illuminations, or may even be distorted, partially occluded, etc. Another standard task is to partition an image into objects and background. How can we automatically recognize which parts of an image are objects and background? (Once this segmentation is done, we may try and classify these objects, etc.) There is a nice story about a robot that would enter a pub, recognize a free seat, order a beer, start talking to his neighbor who happens to be a theoretical physicist, and finally the robot would discuss with him a new theoretical concept. The most

incredible point in this story is – that the robot has recognized a free seat. Geometric Learning Problems Often, the concept to be learned is a subset of points in a geometric space. (This should not be confused with recognizing shapes in images!) Points are given by real-valued coordinates that may represent numerical attributes of any objects. In the simplest case there is only one numerical parameter, i.e., the space is a real line. For example, suppose that we want to learn temperatures that are suitable for a certain purpose (e.g., temperatures at which a certain plant thrives well, or at which some perishable goods can be stored for along time). When we have data saying that certain temperatures are suitable or not, how do we predict for other, yet unobserved temperature values that they are suitable or not? What could be a rationale for such predictions? Similar questions can be raised for combinations of several parameters: Which medical parameters indicate a certain disease? Which combinations of geometrical features (sizes, angles, etc.) characterize a certain subspecies of a plant? etc. Instead of an abstract parameter space, a problem can also “live” in real physical space. For example: You want to connect a laptop or other device to a wireless local network. You know that it works at some points, and does not work at some other points. How do you infer the set of points where you will get a signal? What Next? Given a sequence of numbers, can we predict the next members to come in this sequence? And on what basis can we do successful predictions? Examples of this type of problem are: Can we predict the price of a stock based on the previous developments? What will be the weather on the next day? How will a player behave in the next step? etc. Laws of Physics At least in classical physics, a law is typically a formula that describes how different quantities are related to each other. The detection of a new law of physics means to learn the formula from measurements. Things are way more complicated in modern physics, but still, making discoveries in physics is in principle learning from data. Learning Probabilities If we flip a coin, heads or tails should ideally be up with the same probability 1/2. But the coin may be biased, e.g., heavier on one side, and give heads or tails with some probabilities p and 1−p dsitinct from 1/2. We can figure out p by repeated coin flips. Learning an unknown probability is also a learning problem. In contrast to the previous examples this is a simply structured problem, and the learning aspect

comes in only because we do not have direct access to the quantity p we are interested in, but we can only infer it indirectly from observations. One may argue that this is rather a statistical estimation problem. But we should not put up artifical borders between Mathematical Statistics and Machine Learning. Both fields have similar goals, and Machine Learning uses methods from Statistics, among others. A more sophisticated type of problem is to learn a whole unknown function that is genuinely probabilistic. For example, the genetic disposition of an individual does not determine whether this person will develop a certain disease, rather, the gene variants only influence the probability to get the disease. Can we learn this probability for any possible combinations of gene variants, although only a very limited set of patient data is available? Formal Definitions of Concept Learning Concept learning is about learning an unknown function (also called concept) c : X −! Y , from some known function values c(x). Here, X is a set of items we are interested in. We call the elements of X instances and the elements of Y labels. The only information we have is a set D of training examples: A training example is a pair (x, c(x)), that is, an instance x 2 X along with its label c(x). In the frequent case of binary labels, say Y = {0, 1}, we say that x is a positive instance if c(x) = 1, whereas x is a negative instance if c(x) = 0. Moreover, we can equivalently view the concept c as a subset of X, namely the set of all positive instances. Intuitively, we want to generalize what we observed in D. More precisely, we wish to come up with a hypothesis h : X −! Y , which is just another concept, and we hope that h = c. A necessary condition for h = c is that h equals c at least on the training data: h(x) = c(x) for all x in D. We call such a hypothesis h consistent with D. The unknown concept c to be learned is also called the target. Mathematically, learning a consistent hypothesis is nothing else than function interpolation. For genuinely probabilistic concepts the above definition is too narrow. We define a probabilistic concept as a function that assigns each (x, y) 2 X × Y a conditional probability Pr(y|x) that the label of x will be y. Of course, the Pr(y|x) for any fixed x must sum up to 1. In the case of binary labels, Y = {0, 1}, we can also view a probabilistic concept as a function p : X −! [0, 1] that assigns each x 2 X a probability p(x) that the label of x is 1. Clearly, x has then the label 0 with probability 1 − p. Note that the training data are still pairs (x, y) of instances and their labels. But now we may repeatedly observe the same instance in D, with different labels according to their probabilities. (We assume that instances are sampled independently.)

The goal of probabilistic concept learning is to learn the probabilities. In any application of concept learning, we have some properties (labels) of the instances in mind, but it is not known, and perhaps hard to characterize, which instances in have which label. Of course, concept learning makes sense in practice only if the true labels are not easily accessible or computable, but decisions must be made that depend on predicted values, such as in diagnosis, economy, security, etc. (As opposed to this, it would be silly to make an algorithm that learns to compute a known simple mathematical function from training examples. This has at most didactic value, e.g., for demonstrating a learning algorithm.) Data Types We discuss how our introductory examples of learning problems fit in the framework of concept learning. Instances (objects, situations, etc.) are often described by their properties, formally, by values of several attributes. In this case, the set X of instances is the cartesian product of the domains of values of all these attributes. Attributes can take a finite or infinite set of values. Binary attributes have only two values (yes-no, true-false, present-absent, etc.). Numerical attributes are typically real-valued. The same cases exist for the set Y of labels. In principle, any information about things can be dissolved into binary pieces of information (yes/no decisions). Therefore, in the case of finite X and Y it is not a real restriction to consider only concepts where labels and attributes are binary. Such concepts are known as boolean functions. The focus on boolean functions makes some theoretical considerations easier. It is not hard to specify X (and Y ) in our examples. Attributes can be already explicitly given (see the sections “Influencial Attributes” and “Geometric Learning Problems”), and then it is clear what X is. Note that sometimes an instance is described by only one attribute, e.g., one real number. If our instances are texts, we may want to define suitable attributes that specify a text. The first idea is to have an attribute for each position i, where the attribute value is simply the ith character in the text (including spaces). However, most applications are based on words, and dividing a text into words is rather trivial. Thus it is usually better to have an attribute for each word position i, where the attribute value is the ith word. Since texts have different lengths but we want a value for each attribute, we may choose X as the set of all texts of length at most n (for some suitable n) and fill up shorter texts with dummy words. If our instances are images (of a certain size), the attributes could be simply the pixels, and their values the grey or color values. Nothing

special needs to be said for the case of probabilistic concepts. Just note that even the case of a set X with only one instance makes sense here. The task of predicting the next member of a sequence can be expressed in the framework of concept learning in different ways. We recommend to think about that. (What should be the instances and labels, etc.?) IV.

HOW IS MACHINE LEARNING POSSIBLE AT ALL?

Inductive Bias Remember that we attempt to learn a function c : X −! Y from a set D of training examples, where D does not cover all cases. On which base can we do that? What do the labels of some instances say about labels of other instances? The depressing answer is: Nothing! In principle they could be totally arbitrary. Learning and generalization cannot be based on data only. But in practice we permanently do generalize beyond observed data, this is even a vital ability. Obviously, something is missing in our formalization...Let us review some of our examples again. How would we draw conclusions from data? In weather example we may guess from the data that the label is 1 (yes) if and only if sky is sunny, or air is warm. This explanation seems also plausible when we use our real-world knowledge. But, do we need both these conditions, or would one of them be enough (and which?), or either of them? Already these questions cannot be answered from the available data only. Even worse: Note that the above explanation is somehow the simplest possible. How can we know that the sportsman’s decision is based on such a simple condition? Maybe the unknown concept is much more complicated, but the small data set does not display more complicated cases, and our hypotheses are only artifacts, being incidentally consistent with our data. Apparently we need (and we implicitly make!) additional assumptions in addition to the data. Also note that a computer has no realworld knowledge and understanding like humans have. A computer program can only learn formal rules. When we categorize text, we need some clue what determines the category a text belongs to. For a computer, a text is only a sequence of symbols in the first place. One obvious and frequently used assumption is that the categories have something to do the words occuring in the text. Note that this is really an extra assumption which is not already “given” in the data. For the automatic recognition of shapes on images it helps to remember that simple geometric curves are described by simple formulae in the coordinates of points. For segmenting images into objects and background, there is a nice methods that

assumes that (1) objects have typical colors and objects are relatively large connected regions in a picture. Based on these assumptions, the problem can be reformulated as some standard optimization problem and solved with some success. As this is not an optimization course, we cannot give the method here, but we emphasize once more that learning is based on extra assumptions about the optical appearance of objects. A particularly instructive example is our suitable-temperature scenario. Learning Problems”). One would not expect that two different temperatures are suitable but some temperature in between is not. In order words, it is natural to assume that the suitable temperatures form an interval: every temperature between some minimum and maximum value is suitable, and it suffices to learn the two extrema. This is a very strong assumption and will enable us to infer suitable and non-suitable temperatures from a few data. However, we cannot take this assumption for granted: Imagine that some complex chemical reaction works well in some temperature interval, butin some subinterval there happens another reaction that disturbs it. Then the suitable temperatures form a set of two intervals, etc. Attempts to predict the next element of a sequence make sense only if we can assume that the sequence follows some rule at all, moreover, this rule must be simple enough, in relation to the sequence length. Similarly, physical laws can be discovered only because they are relatively simple mathematical formulae. When we learn probabilities from repeated trials, we have to assume that the data really come from a random source, and no hidden deterministic mechanism is behind the scene. From an apple tree we expect that it bears apples next year, and not suddenly pears or plums. Why are we sure about that? Obviously we know something about the biology of fruit trees that strictly excludes such odd possibilities. Pure logic alone cannot rule out these crazy things. From all these examples we get the insight that any learning of concepts from data requires additional assumptions, i.e., some a priori knowledge about the unknown concept. These assumptions are often implicit, but they must be there. This is often neglected. But users of machine learning algorithms should be aware that certain assumptions are made, and the outcome of the algorithms crucially depends on these assumptions. The set of all assumptions a learning algorithm is based on is called the inductive bias of this algorithm. V.

ALGORITHM TYPES

Machine learning algorithms can be organized into a taxonomy based on the desired outcome of the

algorithm or the type of input available during training the machine.  Supervised learning generates a function that maps inputs to desired outputs (also called labels, because they are often provided by human experts labeling the training examples). For example, in a classification problem, the learner approximates a function mapping a vector into classes by looking at input-output examples of the function.  Unsupervised learning models a set of inputs, like clustering. Here, labels are not known during training.  Semi-supervised learning combines both labeled and unlabeled examples to generate an appropriate function or classifier. Transduction, or transductive inference, tries to predict new outputs on specific and fixed (test) cases from observed, specific (training) cases.  Reinforcement Learning learns how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback in the form of rewards that guides the learning algorithm. VI. APPLICATIONS Most readers will be familiar with the concept of web page ranking. That is, the process of submitting a query to a search engine, which then finds webpages relevant to the query and which returns them in their order of relevance. That is, the search engine returns a sorted list of webpages given a query. To achieve this goal, a search engine needs to `know' which pages are relevant and which pages match the query. Such knowledge can be gained from several sources: the link structure of webpages, their content, the frequency with which users will follow the suggested links in a query, or from examples of queries in combination with manually ranked webpages. Increasingly machine learning rather than guesswork and clever engineering is used to automate the process of designing a good search engine. A rather related application is collaborative _ltering. Internet bookstores such as Amazon, or video rental sites such as Netix use this information extensively to entice users to purchase additional goods (or rent more automatic digestion and understanding of documents. Some modern e-mail clients, such as Apple's Mail.app nowadays ship with the ability to identify addresses in mails and _ling them automatically in an address book. While systems using hand-crafted rules can lead to satisfactory results, it is far more e_cient to use examples of marked-up documents to learn such dependencies automatically, in particular if we want to deploy our system in many languages. For instance, while 'bush' and 'rice' are clearly terms from agriculture, it is equally clear that in the context of

movies). The problem is quite similar to the one of web page ranking. As before, we want to obtain a sorted list (in this case of articles). The key difference is that an explicit query is missing and instead we can only use past purchase and viewing decisions of the user to predict future viewing and purchase habits. The key side information here are the decisions made by similar users, hence the collaborative nature of the process. For an example. It is clearly desirable to have an automatic system to solve this problem, thereby avoiding guesswork and time. An equally illdefined problem is that of automatic translation of documents. At one extreme, we could aim at fully understanding a text before translating it using a curated set of rules crafted by a computational linguist well versed in the two languages we would like to translate. This is a rather arduous task, in particular given that text is not always grammatically correct, nor is the document understanding part itself a trivial one. Instead, we could simply use examples of translated documents, such as the proceedings of the Canadian parliament or other multilingual entities (United Nations, European Union, Switzerland) to learn how to translate between the two languages. In other words, we could use examples of translations to learn how to translate. This machine learning approach proved quite successful Many security applications, e.g. for access control, use face recognition as one of its components. That is, given the photo (or video recording) of a person, recognize who this person is. In other words, the system needs to classify the faces into one of many categories (Alice, Bob, Charlie, . . . ) or decide that it is an unknown face. A similar, yet conceptually quite different problem is that of verification. Here the goal is to verify whether the person in question is who he claims to be. Note that differently to before, this is now a yes/no question. To deal with different lighting conditions, facial expressions, whether a person is wearing glasses, hairstyle, etc., it is desirable to have a system which learns which features are relevant for identifying a person. Another application where learning helps is the problem of named entity recognition. That is, the problem of identifying entities, such as places, titles, names, actions, etc. from documents. Such steps are crucial in the contemporary politics they refer to members of the Republican Party. Other applications which take advantage of learning are speech recognition (annotate an audio sequence with text, such as the system shipping with Microsoft Vista), the recognition of handwriting (annotate a sequence of strokes with text, a feature common to many PDAs), trackpads of computers (e.g. Synaptics, a major manufacturer of such pads derives its name from the synapses of a neural network), the detection of failure in jet engines, avatar behavior in computer games

(e.g. Black and White), direct marketing (companies use past purchase behavior to guesstimate whether you might be willing to purchase even more) and our cleaning robots (such as iRobot's Roomba). The overarching theme of learning problems is that there exists a nontrivial dependence between some observations, which we will commonly refer to as x and a desired response, which we refer to as y, for which a simple set of deterministic rules is not known. By using learning we can infer such a dependency between x and y in a systematic fashion. We conclude this section by discussing the problem of classification, since it will serve as a prototypical problem for a significant part of this book. It occurs frequently in practice: for instance, when performing spam filtering, we are interested in a yes/no answer as to whether an e-mail contains relevant information or not. Note that this issue is quite user dependent: for a frequent traveller e-mails from an airline informing him about recent discounts might prove valuable information, whereas for many other recipients this might prove more of an nuisance (e.g. when the e-mail relates to products available only overseas). Moreover, the nature of annoying emails might change over time, e.g. through the availability of new products (Viagra, Cialis, Levitra, . . . ), different opportunities for fraud (the Nigerian 419 scam which took a new twist after the Iraq war), or different data types (e.g. spam which consists mainly of images). To combat these problems we Binary classification; separate stars from diamonds. In this example we are able to do so by drawing a straight line which separates both sets. We will see later that this is an important example of what is called a linear classifier. Want to build a system which is able to learn how to classify new e-mails. A seemingly unrelated problem, that of cancer diagnosis shares a common structure: given histological data (e.g. from a microarray analysis of a patient's tissue) infer whether a patient is healthy or not. Again, we are asked to generate a yes/no answer given a set of observations.

VII. REFERENCES [1] Tom M. Mitchell “Machine Learning”, McGraw Hill. p-2 1997 [2] Stuart Russell, Peter Norvig: “Articifial

Intelligence, A Modern Approach”, published by Pearson 2003 (Part VI). [3] Alex Smola and S.V.N. Vishwanathan “Introduction To Machine Learning” published by the press syndicate of the university of

Cambridge 2008 [4] Christopher M. Bishop ” Pattern Recognition and Machine Learning” 2006

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF