Hands On Machine Learning Chapter 1
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.
Why Create This
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook, I have learned more about Python, Python libraries that are used in this book, machine learning and artificial intelligence, and programming in general.
This book assumes you know close to nothing about Machine Learning. Its goal is to give you the concepts, the intuitions, and the tools you need to actually implement programs capable of learning from data.
Libraries That Will be Used
- Scikit-Learn
- Very easy to use, yer it implements many Machine Learning algorithms efficiently, so it makes for a great entry point to learn Machine Learning
- TensorFlow
- More complex library for distributed numerical computation. It makes it possible to train and run very large neural networks efficiently by distributing the computations across potentially hundreds of multi-GPU servers. TensorFlow was created at Google and supports many of their large-scale Machine Learning applications. It was open sourced in November 2015.
- Keras
- high level Deep Learning API that makes it simple to train and run neural networks. It can run on tip of either TensorFlow, Theano or Microsoft Cognitive Toolkit (formerly known as CNTK). TensorFlow comes with its own implementations of this API, called tf.keras, which provides support for some advanced TensorFlow features.
This book assumes that you have some Python programming experience and that you are familiar with Python's main scientic libraries, in particular:
References
Chapter 1: The Machine Learning Landscape
This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart.
What is Machine Learning?
Machine Learning is the science (and art) of programming computers so they can learn from data.
Machine Learning is the fields of study that gives computers the ability to learn without being explicitly Programmed - Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. - Tom mitchell, 1997
Why Use Machine Learning
- For problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and performance better
- For complex problems for which there is not good solution at all using a traditional approach: the best Machine Learning techniques can find a solution
- For fluctuating environments: a Machine Learning system can adapt to new data
- For getting insights about complex problems and large amounts of data
Types of Machine Learning Systems
- So many types of Machine Learning systems that it is useful to classify them based on:
- Whether or not they are trained with human supervision (supervised vs unsupervised vs semi-supervised vs Reinforcement Learning)
- In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels
- A typical supervised task is classification.
- Another typical task is to predict a target numeric value, such as the price of a car given a set of features called predictors. This sort of task is called regression.
- Some regression algorithms can be used for classification as well. Logistic Regression* is commonly used for classification, as it can output a value that corresponds to the probability of belonging tro a given class.
- Important Supervised Learning Algorithms:
- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural Networks
- In unsupervised learning, the training data is unlabeled. The system tries to learn without a teacher.
- Important unsupervised learning algorithms:
- Clustering:
- K-Means
- DBSCAN
- Hierarchal Cluster Analysis (HCA)
- Anomaly Detection and Novelty Detection:
- One-class SVM
- Isolation Forest
- Visualization and Dimensionality Reduction
- Principle Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association Rule Learning
- Apriori
- Eclat
- Clustering:
- Visualization algorithms try to preserve as much structure as they can, so you can understand how the data is organized and identify unsuspected patterns
- Dimensionality reduction's goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. Merging correlated features together is called feature extraction.
- Dimensionality reduction can help the efficiency of the ML program (memory, speed) and can sometimes improve performance.
- Anomaly detection and novelty detection are used to detect data points that are far outside that of the normal data points that were used in training.
- The goal of association rule learning is to dig into large amounts of data and discover interesting relations between attributes.
- Important unsupervised learning algorithms:
- Supervised Learning deals with a lot of unlabeled data and some labeled data. Most of these are combinations of unsupervised ans supervised learning algorithms.
- Deep Belief Networks (DBNs) are based on unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.
- Reinforcement Learning: the learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.
In Machine Learning, an attribute us a data type (e.g., "Mileage"), while a feature has several meaning depending on the context, but generally means an attribute plus its value (e.g., "Mileage = 15,000")
- Whether or not they can learn incrementally on the fly (online vs batch processing)
- In batch learning, the system is incapable of leaning incrementally: it must be trained using all the available data. This will typically take a lot of time and computing resources, so it is typically done offline. This is called offline learning.
- You can train a batch learning system daily, but this typically costs a lot of money.
- In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly.
- Online learning is great for systems that receive data as a continuously flow and need to adapt to change rapidly or autonomously.
- Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine;s main memory.
- One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.
- High Learning Rate = more sensitive to new data, forgets old quickly
- Low Learning Rate = system has more inertia
- Whether they work by comparing new data points to new data points, or instead detect patterns in the training data and build a predictive model (instance-based versus model based learning)
- One more way to categorize Machine Learning systems is how they generalize.
- Instance Based Learning: the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples, using a similarity measure (measure of similarity).
- Model Based Learning: build a model of examples then use that model to make predictions
Some Definitions
- Model Selection - choosing a function to represent model
- Model Parameters - can be tweaked, affect the model
- Utility Function - measures how good the model is.
- Cost Function - measures how bad the model is
- Training Model - feeding examples to the model to find the parameters that best fit the data.
# Trainging and running a linear model using Scikit learn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
# Load the data
oecd_bli = pd.read_csv("oecd_bli.csv",thousands=",")
gdp_per_capita=pd.read_csv("gdp_per_capita.csv",thousands=",",encoding="latin1",na_values="n/a")
# Prepare the data
def prepare_country_stats(oecd_bli, gdp_per_capita):
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
gdp_per_capita.set_index("Entity", inplace=True)
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
left_index=True, right_index=True)
full_country_stats.sort_values(by="GDP per capita, PPP (constant 2017 international $)", inplace=True)
remove_indices = [0, 1, 6, 8, 33, 34, 35]
keep_indices = list(set(range(36)) - set(remove_indices))
return full_country_stats[["GDP per capita, PPP (constant 2017 international $)", 'Life satisfaction']].iloc[keep_indices]
country_stats = prepare_country_stats(oecd_bli,gdp_per_capita)
X = np.c_[country_stats["GDP per capita, PPP (constant 2017 international $)"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind="scatter",x="GDP per capita, PPP (constant 2017 international $)",y="Life satisfaction")
plt.show()
# Select a Linear Model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit(X,y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new))
- In Summary:
- You studied the data
- You selected a model
- You trained it on the training data (i.e., the learning algorithm searched for the model parameters that minimize a cost function)
- You applied the model to make predictions on new cases (this is called inference, hoping that this model will generalize well)
Main Challenges of Machine learning
- The two things that can go wrong are "bad algorithm" and "bad data"
In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different Machine Learning algorithms performed almost identically well on a complex problem of natural language disambiguation once they were given enough data. [...] The idea that data matters more than algorithms for complex problems was further popularized by Peter Norvig et al. in a paper titled "The Unreasonable Effectiveness of Data" published in 2009.
Bad Data
NonRepresentative Training Data
In order to generalize well, it is crucial that your training data be representative of the new cases that you want to generalize to. This is true whether you use instance based or model based learning.
If the sample is too small, you will have sampling noise - NonRepresentative data as a result of choice. Even very large samples can be nonrepresentative if the sampling method is flawed. This is called sampling bias.
Poor-Quality Data
It is often well worth the effort to sped time cleaning up your training data. The truth is, most data scientists spend a significant part of their time doing just that. For example:
- If some instances are clearly outliers, it may help to simply discard them or to try to fix the errors manually.
- If some instances are missing a few features, you must decide whether to ignore this attribute altogether, to fill in the missing values, or train one model with the feature and one model without it.
Irrelevant features
A critical part of the success of a Machine learning project is coming up with a good set of features to train on. This process, called feature engineering, involves:
- Feature Selection: Selecting the most useful features to train among existing features
- Feature Extraction: combining existing features to produce more useful one (as we saw earlier, dimensionality reduction algorithms can help)
- Creating new features by gathering new data
Overfitting the Training Data
Overfitting means that the model performs well on the training data, but it does not generalize well.
Overfitting happens when the model is too complex relative to the amount of noisiness of the training data. The possible solutions are:
- To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model) by reducing the number of attributes in the training data or by constraining the model
- To gather more training data
- To reduce the noise of the training data (fix data errors and remove outliers)
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. Degree of Freedom of model = number of parameters? The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training. Tuning hyperparameters is an important part of building a Machine Learning system.
Underfitting the Training Data
underfitting occurs when your model is too simple to learn the underlying structure of the data.
The main options to fix this problem are:
- Selecting a more powerful model, with more parameters
- Feeding better features to the learning algorithm (feature engineering)
- Reducing the constraints on the model (reducing the regularization hyperparameter)
Testing and Validating
The only way to know how well a model will generalize to new cases is to actually try it out on new cases. Splitting your model into the training set and test set allows you to test the generalization ability of your model. The error rate on new cases is called the generalization error or out-of-sample error.
If your training error is low, but the generalization error is high, it means that your model is overfitting the training data. It is common to use a 80/20 train/test split. However, if you have millions of training samples, you can use a higher ration train/test split.
Hyperparameter Tuning and Model Selection
How to Select the Best Model and the Hyperparameters?
Holdout validation - holding out part of the training set to evaluate several candidate models and select the best one. The new heldout set is called the validation set (or development set or dev set). More specifically, you train multiple models with various hyperparameters on the reduced training set, and you select the model that performs the best on the validation set. After this holdout validation process, you train the best model on the full training set and this gives you the final model.
Cross Validation - each model is evaluated onc e per validation set, after it is trained on the rest of the data. By averaging out all the evaluations of a model, we get a much more accurate measure of its performance. Training time increases though.
A model is a simplified version of the observations. The simplifications are meant to discard superfluous details thar are unlikely to generalize to new instances. However, to decide what data to discard and what data to keep, you must make assumptions. [...] In a famous 1996 paper, David Wolbert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch theorem. [...] There is no model that is a priori guaranteed to work better. The only way to know for sure which model is best is to evaluate them all.