Hands On Machine Learning Chapter 2 - End To End Machine Learning Project
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.
End-to-End Machine Learning Project
In this chapter, you will be going through an example project end to end, pretending to be a recently hired data scientist in a real estate company.
- When you are learning Machine learning, it is best to actually experiment with real-world data, not just artificial datasets. Here are a few places you can look to get data:
- Popular Open Data Repositories
- Meta Portals (List Open Repositories)
- Other Pages Listing Many Popular Open Data Repositories
Looking at the Big Picture
The first task you are asked to perform is to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Your model should learn from this data and be able to predict the median housing price in any district given all the other metrics.
A sequence of data processing components is called a data pipeline. Pipelines are very common in Machine Learning Systems, since there is a lot of data to manipulate and many data transformations to apply. Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in anther data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. Each component is fairly self-contained: the interface between components is simply the data store.
Framing Problem
First step is framing the problem.
- This is a supervised learning task since you are given labeled data.
- It is a regression task, since you are asked to predict a value
- This is a multiple regression problem since the system will use multiple features to make a prediction
- It is a univariate regression problem since we are only trying to predict a single value for each district.
- If we were trying to predict multiple values per district, it would be a multivariate regression problem.
- Batch Learning problem - no continuous data flow
Select Performance Measure
Next step is to select a performance measure. A tyoical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a high weight for large errors:
Although the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. If there are outlier districts, you may consider using the Mean Absolute Error (also known as the Average Absolute Deviation):
Both RMSE and MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. various distance measures, or norms are possible:
- Computing the root os a sum of sqares (RMSE) corresponds to the Euclidean norm. It is also called the ℓ2 norm, note ∥⋅∥2 or just ∥⋅∥
- Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm, noted ∥⋅∥1. It is sometimes called the Manhatten norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
- More generally, the ℓk norm of a vector v containing n elements is defined as ∥v∥k=(∣v0∣k+⋯+∣vn∣k)k1 ⋅ ℓ0 just gives the number of non-zero elements in the vector, and ℓ∞ gives the maximum absolute value in the vector.
- The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare, the RMSE performs very well and is generally preferred.
The Math Notation Used in Textbook:
- m is the number of instances in the datasets
- xi is a vector of all feature values (excluding the label) of the ith instance in the dataset, and yi is its label (the desired output value for that instance)
- X is a matrix containing all feature values (excluding labels) of all instances in the dataset. There is one row per instance and the ith row is equal to the transpose of xi noted (xi)T.
- h is your system's prediction function, also called a hypothesis. When your system is given an instance's feature vector xi it outputs a predicted value y^y=h(xi) for that instance
- RMSE(X,h) is the cost function measured on the set of examples using your hypothesis h.
- We use lowercase italic font for scalar values and function names, lowercase bold font for vectors, and uppercase bold font for matrices.
Get the Data
import pandas as pd
def load_housing_data():
return pd.read_csv("housing.csv")
housing = load_housing_data()
housing.head()
housing.info()
Using the dataFrame.head() and dataFrame.info() methods above, you can see the first few rows and get a quick description of the data (number of rows, each attribute's type and number of non-null values) respectively. You can see that the total_bed_rooms columns contains some null values above. The ocean_proximity column has a type of object, which usually means that the column has a text type when you load it from a CSV file. Every other column has a numeric type. In the head method, you see that ocean_proximity values are repetitive, which means that is probably a categorical attribute. You can use the value_counts() method to get an idea of what is contained in the categorical column. You can use the dataFrame.describe() method to get a summary of the numerical attributes. Note that the null values are ignored in the describe method.
The %matplotlib inline command tells Jupyter to set up Matplotlib so it uses Jupyter's own backend. Plots are then rendered within the notebook itself.
housing["ocean_proximity"].value_counts()
housing.describe()
Another quick way to get a feel of data you are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis). You can use the hist() method on the whole dataset, and it will plot a histogram for each numerical attribute.
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()
Things to notice from histograms:
- The media income attribute does not look to be expressed in US dollars. The numbers actually represent ten thousand dollars and were capped at 15.
- The housing median age and the median house value were also capped.
- The attributes have very different scales.
- Many of the histograms are tail heavy: they extend much farther to the right of the median than to the left.
Generate a Test Set
Your brain is an amazing pattern detection system, which means that it is highly prone to overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern in the test data that leads you to select a particular kind of machine learning model. When you estimate the generalization error using the test set, your estimate will be too optimistic and you will launch a system that will not perform as well as expected. This is called data snooping bias.
You want to avoid generating different test/train splits every time you run your code - that way the model does not see all the dataset.You can use sklearn's train_test_split function to properly split the data. There is a random_state parameter that allows you to set the random number seed.
from sklearn.model_selection import train_test_split
train_set,test_set = train_test_split(housing,test_size=0.2,random_state=42)
You want to make sure that your train set is representative of the population: Stratified Sampling - the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. It is important to have a significant number of instances in your dataset for each stratum, or else the estimate of the stratum's importance may be biased. This means that you should not have too much strata, and each stratum should be large enough. The code below uses the pd.cut function to create an income category attribute with 5 categories. After creating the categories, you can do stratified sampling based on income category. For thus use can use Scikit-Learn's StratifiedShuffleSplit class.
Test set generation is often neglected but critical part of a machine learning project.
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
housing["income_cat"] = pd.cut(housing["median_income"],bins=[0.,1.5,3.0,4.5,6.,np.inf],labels=[1,2,3,4,5])
housing["income_cat"].hist()
%matplotlib inline
plt.show()
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in split.split(housing,housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set=housing.loc[test_index]
print(strat_test_set["income_cat"].value_counts()/len(strat_test_set))
for set_ in (strat_train_set,strat_test_set):
set_.drop("income_cat",axis=1,inplace=True)
Discover and Visualize the Data to Gain Insights
- Since there is geographical information (latitude and longitude), it a good idea to create a scatter plot of all districts to viualize the data.
- You can see that the data resembled California, and by using alpha, you can see that high density area
- Customizing the plot:
- s is the radius of the circle and represents the district's population
- c is the color and represents the price
- We use a predefined color map jet, which ranges from blue (low values) to red (high values)
- The image tells you that housing prices are very much related to location and population density.
Looking for Correlations
Since the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson's r) between every pair using the corr method.
The correlation coefficient ranges from -1 to 1. When it is close to 1, it means that there is a strong positive correlation. When the coefficient is close to -1, it means that there is a strong negative correlation. Finally, coefficients close to 0 mean that there is no linear correlation. The correlation coefficient only measures linear coefficients. It may completely miss nonlinear relationships. The correlation coefficient has nothing to do with slope.
Another way to check for correlation between attributes is to use Pandas' scatter_matrix function, which plots every numerical attribute against every other numerical attribute.
As seen from the plot, the most promising attribute to predict the media house value is the median income.
from pandas.plotting import scatter_matrix
housing = strat_train_set.copy()
housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,s=housing["population"]/100,label="population",figsize=(10,7),c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=True)
plt.legend()
corr_matrix = housing.corr(numeric_only=True)
print(corr_matrix["median_house_value"].sort_values(ascending=False))
attributes = ["median_house_value","median_income","total_rooms","housing_median_age"]
a_throwaway = scatter_matrix(housing[attributes],figsize=(12,8)) # Assigning scatter_matrix to varaible to supress output
Experimenting with Attribute Combinations
One last thing you may want to do before actually preparing the data for Machine Learning algorithms is to try out attribute combinations.
As seen from the attribute combinations below, the bedrooms_per_room and rooms_per_household are more informative (more correlated with the housing price) than the total_rooms per district and the total_bedrooms per district.
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]
corr_matrix = housing.corr(numeric_only=True)
print(corr_matrix["median_house_value"].sort_values(ascending=False))
Prepare the Machine Learning Algorithms
You should write functions to prepare the data for your Machine Learning algorithms for good reason:
- This will allow you to reproduce these transformations easily on any dataset (e.g., the next time you get a fresh dataset)
- You will gradually build a library of transformation functions that you can reuse in future projects
- You can use these functions in your live system to transform the new data before feeding it to your algorithms
- This will make it possible for you try various transformations and see which combination of transformations works best.
Data Cleaning
Most machine learning algorithms cannot work with missing features. You have three options to fix:
- Get rid of rows that contain missing values dropna()
- Get rid of the whole column that contains missing values drop()
- Set the values to some value (zero, the mean, the media, etc.) fillna()
- If you use this option, you need to save the mean/median/... so that you can replace missing values in the test set when you want to evaluate the system and so that you can replace missing values in the live system
- Scikit-Learn provides a handy class to take care of missing values: SimpleImputer
- Only works on numerical data (need to drop non-numerical data)
- It's a good idea to apply the imputer to all numerical attributes in case you can't be sure that there won;t be missing data when you go live
- The imputer transform() method returns a NumPy array, but ou can easily put it back into DataFrame form.
## Revert to Clean Training Set and Separate Labels
housing = strat_train_set.drop("median_house_value",axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
# Impute or drop columns or rows
from sklearn.impute import SimpleImputer
housing.dropna(subset=["total_bedrooms"]) # option 1
housing.drop("total_bedrooms",axis=1) # option 2
median = housing["total_bedrooms"].median() # option 3
housing["total_bedrooms"].fillna(median)
# Imputing
imputer = SimpleImputer(strategy="median")
# Since the median can only be computed on numerical attributed, you need to create copy of data without categorical data
housing_num = housing.drop("ocean_proximity",axis=1)
imputer.fit(housing_num)
print(imputer.statistics_)
print(housing_num.median().values)
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X,columns=housing_num.columns)
Scikit-Learn Design
The main design principles of Scikit-Learn are:
- Consistency: All objects share a consistent and simple interface
- Estimators: Any object that can estimate some parameters based on a dataset is called an estimator. The estimation is performed by the fit() method, and it takes only a dataset as a parameter. Any other parameter needed to guide the estimation process is considered a hyperparameter, and it must be set as an instance variable
- Transformers: Some estimators can also transform a dataset; these are called transformers. Once again, the API is quit simple: the transformation is performed by the transform() method with the dataset to transform as a parameter. It returns the transformed dataset. All transformers also have a convenience method called fit_transform() that is equivalent to calling fit() and then transform()
- Predictors: Some estimators are capable of making predictions given a dataset; they are called predictors. A predictor has a predict method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a score() method that measures the quality of the predictions given a test set and the corresponding labels in case of supervised learning.
- Inspection: All the estimator's hyperparameters are accessible directly via public instance variables, and all the estimators learned parameters are also accessible via public instance variables with a trailing underscore
- Nonproliferation of Classes: Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes. Hyperparameters are just regular python strings or numbers.
- Composition: Existing building blocks are reused as much as possible.
- Sensible Defaults: Scikit-learn provides reasonable default values for most parameters, making it easy to create a baseline working system quickly.
Handling text And Categorical Attributes
Most machine learning algorithms prefer to work with numbers. You can convert categories from text to numbers using sklearn OrdinalEncoder class. One issue that may arise is that ML algorithms will assume that two nearby values are more similar than two distinct values. To fix this issue, a common solution is to create one binary attribute per category. This is called one hot encoding, because only one attribute will be equal to 1 while the others will be 0. The new attributes are sometimes called dummy attributes. You can use Scikit-Learn's OneHotEncoder for this.
- The output of OneHotEncoder's fit_transform is oftentimes a scipy sparse matrix which improves the memory storage of the matrix.
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
ordinal_encoder = OrdinalEncoder()
cat_encoder = OneHotEncoder()
housing_cat = housing[["ocean_proximity"]]
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
Custom Transformers
You will sometimes need to write your own transformers for tasks like custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities. You will need to create a class that implements three methods:
- fit() returning self
- transform()
- fit_transform()
You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstimator as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be used for automatic hyperparameter tuning.
In this example the transformer has one hyperparameter, add_bedrooms_per_room set to True by default. This hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not. [...] The more you automate data preparation steps, the more combinations you can automatically try out, making it much more likely you will find a great combination.
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self,X,y=None):
return self # nothing else to do
def transform(self,X,y=None):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, population_ix] / X[:, households_ix]
return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
Feature Scaling
One of the most important transformation you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms don't perform well when the input numerical attributes have very different scales. There are two common ways to get all attributes to have the same scale:
- min-max scaling (also called normalization)
- Values are shifted and scaled so that they end up ranging from 0 to 1. new value=max valuevalue−min value
- Scikit-Learn provides the MinMaxScaler for this
- standardization
- Results in a distribution with a unit variance. new value=σvalue−xˉ where σ=standard deviation and xˉ=mean
- Does not bound values to certain range, which might be a problem for some algorithms, like neural networks.
- Standardization is much less affected by outliers
- Scikit-Learn provides StandardScaler for standardization
Transformation Pipelines
There are many transformation steps that need to be executed in teh right order. Scikit-Learn provides the Pipeline class tro help with sequences of transformations. The Pipeline consructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers.
When you call the pipeline's fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method.
It would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column. (This way, we don't have to handle text and numeric columns separately). Scikit-Lean has ColumnTransformer for this purpose.
- ColumnTransformer requires a list of tuples, where each tuple contains a name, a transformer and a list of names (or indices) of columns that the transformer should be applied to.
- ColumnTransformer applies each transformer to the appropriate columns and concatenates the outputs along the second axis (the transformers must return the same number of rows)
- ColumnTransformer may return a sparse or dense matrix
Instead of a transformer, you can specify the string "drop" or "pass through" if you want the columns to be dropped or remain untouched, respectively.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
num_pipeline = Pipeline([
('imputer',SimpleImputer(strategy="median")),
('attribs_adder',CombinedAttributesAdder()),
('std_scaler',StandardScaler())
])
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num",num_pipeline,num_attribs),
("cat",OneHotEncoder(),cat_attribs)
])
housing_prepared = full_pipeline.fit_transform(housing)
Select and Train a Model
Our predictions with the linear regression model show to be underfitting the data. This means that the features do not provide enough information to make good predictions or that the model is not powerful enough.
Our predictions using the Decision Tree Regressor are badly overfitting the data.
We can use Scikit-Learn's K-fold cross-validation feature to randomly split the training set into 10 distinct subsets called folds, and then train and evaluate the Decision tree 10 times, picking a different fold for evaluation each time and training on the 9 folds. The result is an array containing 10 evaluation scores. Scikit-Learn's cross-validation features expect a utility function rather than a cost function.
Cross validation allows you to get the estimate of the performance of the model and also a measure of how precise the performance is. Cross validation comes at the cost of training the model several times.
RandomForestRegressor models work by training many decision trees in random subsets of features, then averaging out their predictions. Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algorithms even further. If the score on the training set is much lower than on the validation sets, it means the model is overfitting the training set.
The goal of this step is to get a few promising models without tuning the hyperparameters yet.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
housing_predictions_lin = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions_lin)
lin_rmse = np.sqrt(lin_mse)
print("The typical prediction error with Linear Regression is: $",lin_rmse)
print("The Linear Regression Model is underfitting the data - and is probaly not powerful enough.")
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions_tree = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions_tree)
tree_rmse = np.sqrt(tree_mse)
print("The typical prediction error with a Decision Tree Regression is: $",tree_rmse)
print("The Decision Tree Regressor is overfitting the data.")
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores: ",scores)
print("Mean: ",scores.mean())
print("Standard Deviation: ",scores.std())
print("\nDecision Tree Regressor Scores:\n---------------------------------------------")
display_scores(tree_rmse_scores)
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
print("\nLinear Regression Scores:\n---------------------------------------------")
display_scores(lin_rmse_scores)
print("\nThe Decison Tree Regressor model is overfitting so bad that it performs worse than the Linear Regressor.\n")
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42,max_depth=6)
forest_reg.fit(housing_prepared,housing_labels)
housing_predictions_for = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions_for)
forest_rmse = np.sqrt(forest_mse)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
print("\nRandom Forest Scores:\n---------------------------------------------")
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
You should save every model you experiment with, so you can come back easily to any model you want. Make sure you save both the hyperparameters and the trained parameters, as well as the cross-validation scores and perhaps the actual predictions as well. This will allow you to easily compare scores across model types, and compare the types of errors they make. You can easily save Scikit-Learn models by using Python’s pickle module, or using sklearn.externals.joblib
Fine Tune Your Model
Let's assume you have a shortlist of promising models, you now need to fine tune them.
Grid Search
You can use Scikit-learn's GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross validation. It may take a long time, but when it is done, you can get the best combination of parameters by accessing the best_params_ property of the grid search object
Randomized Search
The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead. Instead of trying out all possible combinations, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This has two main benefits:
- You get to test more different values for the hyperparameter depending on how long the search runs.
- You have more control over the computing budget by setting the number of iterations.
Ensemble Methods
Another way to fine-tune the system is to try to combine the models that perform the best. The group (or "ensemble") will often perform better than the best individual models.
Analyze the Best Models and Their Errors
You will often find good insights on the problem by inspecting the best models. Using this information, you might be able to see that you should drop some features (for example). You should also look at the specific errors that your system makes, then try to understand why it makes them and what could fix the problem.
Evaluate on Test Set
There is nothing special about this process [evaluating the model on the test set]: just get the predictors and the labels from your test set, run your full_pipeline to transform the data (call transform(), not fit_transform(), you do not want to fit the test set!), and evaluate the final model on the test set.
You might want to have an idea of how precise this estimate is, For this,m you can compute a 95% confidence interval for the generalization error using scipy.stats.t.interval()
You must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data.
from sklearn.model_selection import GridSearchCV
param_grid = [
{
'n_estimators': [3, 10, 30],
'max_features': [2, 4, 6, 8]
},
{
'bootstrap': [False],
'n_estimators': [3, 10],
'max_features': [2, 3, 4]
}
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
print(grid_search.best_params_)
grid_search.best_estimator_
feature_importances = grid_search.best_estimator_.feature_importances_
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
print("Retrieving Information From Final Model\n----------------------------------")
print(sorted(zip(feature_importances, attributes), reverse=True))
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value",axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print("Final Root Mean Square Error",final_rmse)
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,loc=squared_errors.mean(),scale=stats.sem(squared_errors)))
Launch, Monitor and Maintain System
- You need to write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops. This is important to monitor model degradation.
- You should evaluate the system's input data quality.
- You generally want to train your models on a regular basis using fresh data. You should automate this process as much as possible.