The Machine Learning Landscape Exercises Answers
Going through the exercises in Hands-On Machine Learning to improve machine learning knowledge.
Question 1
How would you define Machine Learning?
Machine Learning is the science (and art) of programming computers so they can learn from data. Machine Learning is the fields of study that gives computers the ability to learn without being explicitly programmed. A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Question 2
Can you name the four types of problems where it shines?
- For problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and make performance better.
- For complex problems for which there is not a good solution at all using a traditional approach; the best Machine Learning techniques can find a solution.
- For fluctuating environments: a Machine Learning system can adapt to new data.
- For getting insights about complex problems and large amounts of data.
Question 3
What is a labeled Training Set?
A labeled training set is a training set that includes the desired solutions, called labels, for each instance of features.
Question 4
What are the two most common supervised tasks?
- Classification - output a class of something given data
- Regression - predict a target numeric value given a set of features
Question 5
Can you name four common unsupervised tasks?
- Clustering - segmenting groups of instances of data
- Anomaly Detection and Novelty Detection - detecting outliers
- Visualization and Dimensionality Reduction - visualization algorithms output a 2D or 3D representation of data, given complex data, so that the data can be plotted. Dimensionality rediction's goal is to simplify the data without losing too much information.
- Association Rule Learning - goal is to dig into large amounts of data and discover interesting relations between attributes
Question 6
What types of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?
Reinforcement Learning - Reinforcement Learning is a learning system in which an agent can observe the environment, select and perform actions, and get rewards in return (or penalties in the case of native rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.
Question 7
What type of algorithm would you use to segment your customers into multiple groups?
Clustering - segmenting/detecting groups of instances of data
Question 8
Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?
Spam Detection is a supervised learning task assuming you have the data about which emails users denote as being SPAM. Spam Detection is an unsupervised learning task if you have no such data, and you must go through some unsupervised learning techniques, like clustering, to determine which kinds of emails are spam.
Question 9
What is an online learning system?
An online learning system can be trained incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly. It is great for systems that receive data as a continuous flow and need to adapt to change rapidly or autonomously. It cna also be used to train systems on huge datasets that cannot fit in memory.
Question 10
What is out-of-core learning?
It is when online learning systems are trained on huge datasets that cannot fit in one machine's main memory.
Question 11
What type of learning algorithm relies on a similarity measure to make predictions?
Instance-based learning learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them) using a similarity measure.
Question 12
What is the difference between a model parameter and a learning algorithm’s hyperparameter?
A model's parameter's are altered during training to make the system best fit the data. A hyperparameter is a parameter of a learning algorithm - not the model (while a parameter is of the model).
Question 13
What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
Model-based learning algorithms search the parameter space for the parameters that minimize some cost function or maximize some utility function or otherwise best fit the data. They most common try to find parameters that minimize a cost function. They make predictions by applying the model to new instances of data.
Question 14
Can you name four of the main challenges in Machine Learning?
- Overfitting the data
- Underfitting the data
- Lacking in Data
- Nonrepresentative Data
Question 15
If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
The model is probably overfitting the training data. Three possible solutions include:
- Simplify the model by selecting one with fewer parameters
- Gather more training data
- Reduce the noise of the training data (fix data errors and remove outliers)
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. The amount of regularization to apply during learning can be controlled by a hyperparameter.
Question 16
What is a test set and why would you want to use it?
A test set is a subset of all of the available training data that one uses to test whether the model will generalize well. You want to use it to see if the model will generalize well. The model is notg trained on the test set.
Question 17
What is the purpose of a validation set?
A validation set is used to test which set of hyperparameters work best for a model. A validation set is heldout for training a model and the models generalization is tested on the validation set. Testing different models with varying hyperparameters on the validation set gives you an idea of which set of hyperparameters will generalize/perform the best.
Question 18
What can go wrong if you tune hyperparameters using the test set?
You can choose hyperparameters that perform well on the test set but do not generalize well.
Question 19
What is repeated cross-validation and why would you prefer it to using a single validation set?
First of all repeated cross-validation is just repeating cross-validation multiple times where in each repetition, the folds are split in a different way. After each repetition of the cross-validation, the model assessment metric is computed (e.g. accuracy or RMSE). The scores from all repetitions are finally averaged (you can also take the median), to get a final model assessment score. This gives a more “robust” model assessment score than performing cross-validation only once.