Hands On Machine Learning Chapter 1

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.

2 438

Why Create This

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook, I have learned more about Python, Python libraries that are used in this book, machine learning and artificial intelligence, and programming in general.

This book assumes you know close to nothing about Machine Learning. Its goal is to give you the concepts, the intuitions, and the tools you need to actually implement programs capable of learning from data.

Libraries That Will be Used

  • Scikit-Learn
    • Very easy to use, yer it implements many Machine Learning algorithms efficiently, so it makes for a great entry point to learn Machine Learning
  • TensorFlow
    • More complex library for distributed numerical computation. It makes it possible to train and run very large neural networks efficiently by distributing the computations across potentially hundreds of multi-GPU servers. TensorFlow was created at Google and supports many of their large-scale Machine Learning applications. It was open sourced in November 2015.
  • Keras
    • high level Deep Learning API that makes it simple to train and run neural networks. It can run on tip of either TensorFlow, Theano or Microsoft Cognitive Toolkit (formerly known as CNTK). TensorFlow comes with its own implementations of this API, called tf.keras, which provides support for some advanced TensorFlow features.

This book assumes that you have some Python programming experience and that you are familiar with Python's main scientic libraries, in particular:

References

Chapter 1: The Machine Learning Landscape

This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart.

What is Machine Learning?

Machine Learning is the science (and art) of programming computers so they can learn from data.

Machine Learning is the fields of study that gives computers the ability to learn without being explicitly Programmed - Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. - Tom mitchell, 1997

Why Use Machine Learning

  1. For problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and performance better
  2. For complex problems for which there is not good solution at all using a traditional approach: the best Machine Learning techniques can find a solution
  3. For fluctuating environments: a Machine Learning system can adapt to new data
  4. For getting insights about complex problems and large amounts of data

Types of Machine Learning Systems

  • So many types of Machine Learning systems that it is useful to classify them based on:
  1. Whether or not they are trained with human supervision (supervised vs unsupervised vs semi-supervised vs Reinforcement Learning)
  • In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels
    • A typical supervised task is classification.
    • Another typical task is to predict a target numeric value, such as the price of a car given a set of features called predictors. This sort of task is called regression.
      • Some regression algorithms can be used for classification as well. Logistic Regression* is commonly used for classification, as it can output a value that corresponds to the probability of belonging tro a given class.
    • Important Supervised Learning Algorithms:
      • k-Nearest Neighbors
      • Linear Regression
      • Logistic Regression
      • Support Vector Machines (SVMs)
      • Decision Trees and Random Forests
      • Neural Networks
  • In unsupervised learning, the training data is unlabeled. The system tries to learn without a teacher.
    • Important unsupervised learning algorithms:
      • Clustering:
        • K-Means
        • DBSCAN
        • Hierarchal Cluster Analysis (HCA)
      • Anomaly Detection and Novelty Detection:
        • One-class SVM
        • Isolation Forest
      • Visualization and Dimensionality Reduction
        • Principle Component Analysis (PCA)
        • Kernel PCA
        • Locally-Linear Embedding (LLE)
        • t-distributed Stochastic Neighbor Embedding (t-SNE)
      • Association Rule Learning
        • Apriori
        • Eclat
    • Visualization algorithms try to preserve as much structure as they can, so you can understand how the data is organized and identify unsuspected patterns
    • Dimensionality reduction's goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. Merging correlated features together is called feature extraction.
      • Dimensionality reduction can help the efficiency of the ML program (memory, speed) and can sometimes improve performance.
    • Anomaly detection and novelty detection are used to detect data points that are far outside that of the normal data points that were used in training.
    • The goal of association rule learning is to dig into large amounts of data and discover interesting relations between attributes.
  • Supervised Learning deals with a lot of unlabeled data and some labeled data. Most of these are combinations of unsupervised ans supervised learning algorithms.
    • Deep Belief Networks (DBNs) are based on unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.
  • Reinforcement Learning: the learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

    In Machine Learning, an attribute us a data type (e.g., "Mileage"), while a feature has several meaning depending on the context, but generally means an attribute plus its value (e.g., "Mileage = 15,000")

  1. Whether or not they can learn incrementally on the fly (online vs batch processing)
  • In batch learning, the system is incapable of leaning incrementally: it must be trained using all the available data. This will typically take a lot of time and computing resources, so it is typically done offline. This is called offline learning.
    • You can train a batch learning system daily, but this typically costs a lot of money.
  • In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly.
    • Online learning is great for systems that receive data as a continuously flow and need to adapt to change rapidly or autonomously.
    • Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine;s main memory.
    • One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.
      • High Learning Rate = more sensitive to new data, forgets old quickly
      • Low Learning Rate = system has more inertia
  1. Whether they work by comparing new data points to new data points, or instead detect patterns in the training data and build a predictive model (instance-based versus model based learning)
  • One more way to categorize Machine Learning systems is how they generalize.
  • Instance Based Learning: the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples, using a similarity measure (measure of similarity).
  • Model Based Learning: build a model of examples then use that model to make predictions

Some Definitions

  • Model Selection - choosing a function to represent model
  • Model Parameters - can be tweaked, affect the model
  • Utility Function - measures how good the model is.
  • Cost Function - measures how bad the model is
  • Training Model - feeding examples to the model to find the parameters that best fit the data.
# Trainging and running a linear model using Scikit learn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
# Load the data
oecd_bli = pd.read_csv("oecd_bli.csv",thousands=",")
gdp_per_capita=pd.read_csv("gdp_per_capita.csv",thousands=",",encoding="latin1",na_values="n/a")
# Prepare the data
def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Entity", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita, PPP (constant 2017 international $)", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita, PPP (constant 2017 international $)", 'Life satisfaction']].iloc[keep_indices]


country_stats = prepare_country_stats(oecd_bli,gdp_per_capita)
X = np.c_[country_stats["GDP per capita, PPP (constant 2017 international $)"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind="scatter",x="GDP per capita, PPP (constant 2017 international $)",y="Life satisfaction")
plt.show()
# Select a Linear Model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit(X,y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new))
out[3]

Index(['GDP per capita, PPP (constant 2017 international $)', 'Life satisfaction'], dtype='object')

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

[[-1.38704323]]

  • In Summary:
    • You studied the data
    • You selected a model
    • You trained it on the training data (i.e., the learning algorithm searched for the model parameters that minimize a cost function)
    • You applied the model to make predictions on new cases (this is called inference, hoping that this model will generalize well)

Main Challenges of Machine learning

  • The two things that can go wrong are "bad algorithm" and "bad data"

    In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different Machine Learning algorithms performed almost identically well on a complex problem of natural language disambiguation once they were given enough data. [...] The idea that data matters more than algorithms for complex problems was further popularized by Peter Norvig et al. in a paper titled "The Unreasonable Effectiveness of Data" published in 2009.

Bad Data

NonRepresentative Training Data

In order to generalize well, it is crucial that your training data be representative of the new cases that you want to generalize to. This is true whether you use instance based or model based learning.
If the sample is too small, you will have sampling noise - NonRepresentative data as a result of choice. Even very large samples can be nonrepresentative if the sampling method is flawed. This is called sampling bias.

Poor-Quality Data

It is often well worth the effort to sped time cleaning up your training data. The truth is, most data scientists spend a significant part of their time doing just that. For example:

  • If some instances are clearly outliers, it may help to simply discard them or to try to fix the errors manually.
  • If some instances are missing a few features, you must decide whether to ignore this attribute altogether, to fill in the missing values, or train one model with the feature and one model without it.

Irrelevant features

A critical part of the success of a Machine learning project is coming up with a good set of features to train on. This process, called feature engineering, involves:

  • Feature Selection: Selecting the most useful features to train among existing features
  • Feature Extraction: combining existing features to produce more useful one (as we saw earlier, dimensionality reduction algorithms can help)
  • Creating new features by gathering new data

Overfitting the Training Data

Overfitting means that the model performs well on the training data, but it does not generalize well.
Overfitting happens when the model is too complex relative to the amount of noisiness of the training data. The possible solutions are:

  1. To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model) by reducing the number of attributes in the training data or by constraining the model
  2. To gather more training data
  3. To reduce the noise of the training data (fix data errors and remove outliers)
    Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. Degree of Freedom of model = number of parameters? The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training. Tuning hyperparameters is an important part of building a Machine Learning system.

Underfitting the Training Data

underfitting occurs when your model is too simple to learn the underlying structure of the data.
The main options to fix this problem are:

  • Selecting a more powerful model, with more parameters
  • Feeding better features to the learning algorithm (feature engineering)
  • Reducing the constraints on the model (reducing the regularization hyperparameter)

Testing and Validating

The only way to know how well a model will generalize to new cases is to actually try it out on new cases. Splitting your model into the training set and test set allows you to test the generalization ability of your model. The error rate on new cases is called the generalization error or out-of-sample error.
If your training error is low, but the generalization error is high, it means that your model is overfitting the training data. It is common to use a 80/20 train/test split. However, if you have millions of training samples, you can use a higher ration train/test split.

Hyperparameter Tuning and Model Selection

How to Select the Best Model and the Hyperparameters?
Holdout validation - holding out part of the training set to evaluate several candidate models and select the best one. The new heldout set is called the validation set (or development set or dev set). More specifically, you train multiple models with various hyperparameters on the reduced training set, and you select the model that performs the best on the validation set. After this holdout validation process, you train the best model on the full training set and this gives you the final model.
Cross Validation - each model is evaluated onc e per validation set, after it is trained on the rest of the data. By averaging out all the evaluations of a model, we get a much more accurate measure of its performance. Training time increases though.

A model is a simplified version of the observations. The simplifications are meant to discard superfluous details thar are unlikely to generalize to new instances. However, to decide what data to discard and what data to keep, you must make assumptions. [...] In a famous 1996 paper, David Wolbert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch theorem. [...] There is no model that is a priori guaranteed to work better. The only way to know for sure which model is best is to evaluate them all.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC