import sys
assert sys.version_info >= (3, 7)
Chapter 2 – End-to-end Machine Learning project
This notebook contains all the sample code and solutions to the exercises in chapter 2.
|
|
End-to-end Machine Learning project
This project requires Python 3.7 or above:
It also requires Scikit-Learn ≥ 1.0.1:
from packaging import version
import sklearn
assert version.parse(sklearn.__version__) >= version.parse("1.0.1")
Get the Data
Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.
Download the Data
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request
def load_housing_data():
= Path("datasets/housing.tgz")
tarball_path if not tarball_path.is_file():
"datasets").mkdir(parents=True, exist_ok=True)
Path(= "https://github.com/ageron/data/raw/main/housing.tgz"
url
urllib.request.urlretrieve(url, tarball_path)with tarfile.open(tarball_path) as housing_tarball:
="datasets")
housing_tarball.extractall(pathreturn pd.read_csv(Path("datasets/housing/housing.csv"))
= load_housing_data() housing
Take a Quick Look at the Data Structure
housing.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
"ocean_proximity"].value_counts() housing[
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
housing.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
The following cell is not shown either in the book. It creates the images/end_to_end_project
folder (if it doesn’t already exist), and it defines the save_fig()
function which is used through this notebook to save the figures in high-res for the book.
# extra code – code to save the figures as high-res PNGs for the book
= Path() / "images" / "end_to_end_project"
IMAGES_PATH =True, exist_ok=True)
IMAGES_PATH.mkdir(parents
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
= IMAGES_PATH / f"{fig_id}.{fig_extension}"
path if tight_layout:
plt.tight_layout()format=fig_extension, dpi=resolution) plt.savefig(path,
import matplotlib.pyplot as plt
# extra code – the next 5 lines define the default font sizes
'font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)
plt.rc(
=50, figsize=(12, 8))
housing.hist(bins"attribute_histogram_plots") # extra code
save_fig( plt.show()
Create a Test Set
import numpy as np
def shuffle_and_split_data(data, test_ratio):
= np.random.permutation(len(data))
shuffled_indices = int(len(data) * test_ratio)
test_set_size = shuffled_indices[:test_set_size]
test_indices = shuffled_indices[test_set_size:]
train_indices return data.iloc[train_indices], data.iloc[test_indices]
= shuffle_and_split_data(housing, 0.2)
train_set, test_set len(train_set)
16512
len(test_set)
4128
To ensure that this notebook’s outputs remain the same every time we run it, we need to set the random seed:
42) np.random.seed(
Sadly, this won’t guarantee that this notebook will output exactly the same results as in the book, since there are other possible sources of variation. The most important is the fact that algorithms get tweaked over time when libraries evolve. So please tolerate some minor differences: hopefully, most of the outputs should be the same, or at least in the right ballpark.
Note: another source of randomness is the order of Python sets: it is based on Python’s hash()
function, which is randomly “salted” when Python starts up (this started in Python 3.3, to prevent some denial-of-service attacks). To remove this randomness, the solution is to set the PYTHONHASHSEED
environment variable to "0"
before Python even starts up. Nothing will happen if you do it after that. Luckily, if you’re running this notebook on Colab, the variable is already set for you.
from zlib import crc32
def is_id_in_test_set(identifier, test_ratio):
return crc32(np.int64(identifier)) < test_ratio * 2**32
def split_data_with_id_hash(data, test_ratio, id_column):
= data[id_column]
ids = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
in_test_set return data.loc[~in_test_set], data.loc[in_test_set]
= housing.reset_index() # adds an `index` column
housing_with_id = split_data_with_id_hash(housing_with_id, 0.2, "index") train_set, test_set
"id"] = housing["longitude"] * 1000 + housing["latitude"]
housing_with_id[= split_data_with_id_hash(housing_with_id, 0.2, "id") train_set, test_set
from sklearn.model_selection import train_test_split
= train_test_split(housing, test_size=0.2, random_state=42) train_set, test_set
"total_bedrooms"].isnull().sum() test_set[
44
To find the probability that a random sample of 1,000 people contains less than 48.5% female or more than 53.5% female when the population’s female ratio is 51.1%, we use the binomial distribution. The cdf()
method of the binomial distribution gives us the probability that the number of females will be equal or less than the given value.
# extra code – shows how to compute the 10.7% proba of getting a bad sample
from scipy.stats import binom
= 1000
sample_size = 0.511
ratio_female = binom(sample_size, ratio_female).cdf(485 - 1)
proba_too_small = 1 - binom(sample_size, ratio_female).cdf(535)
proba_too_large print(proba_too_small + proba_too_large)
0.10736798530929913
If you prefer simulations over maths, here’s how you could get roughly the same result:
# extra code – shows another way to estimate the probability of bad sample
42)
np.random.seed(
= (np.random.rand(100_000, sample_size) < ratio_female).sum(axis=1)
samples < 485) | (samples > 535)).mean() ((samples
0.1071
"income_cat"] = pd.cut(housing["median_income"],
housing[=[0., 1.5, 3.0, 4.5, 6., np.inf],
bins=[1, 2, 3, 4, 5]) labels
"income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)
housing["Income category")
plt.xlabel("Number of districts")
plt.ylabel("housing_income_cat_bar_plot") # extra code
save_fig( plt.show()
from sklearn.model_selection import StratifiedShuffleSplit
= StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
splitter = []
strat_splits for train_index, test_index in splitter.split(housing, housing["income_cat"]):
= housing.iloc[train_index]
strat_train_set_n = housing.iloc[test_index]
strat_test_set_n strat_splits.append([strat_train_set_n, strat_test_set_n])
= strat_splits[0] strat_train_set, strat_test_set
It’s much shorter to get a single stratified split:
= train_test_split(
strat_train_set, strat_test_set =0.2, stratify=housing["income_cat"], random_state=42) housing, test_size
"income_cat"].value_counts() / len(strat_test_set) strat_test_set[
3 0.350533
2 0.318798
4 0.176357
5 0.114341
1 0.039971
Name: income_cat, dtype: float64
# extra code – computes the data for Figure 2–10
def income_cat_proportions(data):
return data["income_cat"].value_counts() / len(data)
= train_test_split(housing, test_size=0.2, random_state=42)
train_set, test_set
= pd.DataFrame({
compare_props "Overall %": income_cat_proportions(housing),
"Stratified %": income_cat_proportions(strat_test_set),
"Random %": income_cat_proportions(test_set),
}).sort_index()= "Income Category"
compare_props.index.name "Strat. Error %"] = (compare_props["Stratified %"] /
compare_props["Overall %"] - 1)
compare_props["Rand. Error %"] = (compare_props["Random %"] /
compare_props["Overall %"] - 1)
compare_props[* 100).round(2) (compare_props
Overall % | Stratified % | Random % | Strat. Error % | Rand. Error % | |
---|---|---|---|---|---|
Income Category | |||||
1 | 3.98 | 4.00 | 4.24 | 0.36 | 6.45 |
2 | 31.88 | 31.88 | 30.74 | -0.02 | -3.59 |
3 | 35.06 | 35.05 | 34.52 | -0.01 | -1.53 |
4 | 17.63 | 17.64 | 18.41 | 0.03 | 4.42 |
5 | 11.44 | 11.43 | 12.09 | -0.08 | 5.63 |
for set_ in (strat_train_set, strat_test_set):
"income_cat", axis=1, inplace=True) set_.drop(
Discover and Visualize the Data to Gain Insights
= strat_train_set.copy() housing
Visualizing Geographical Data
="scatter", x="longitude", y="latitude", grid=True)
housing.plot(kind"bad_visualization_plot") # extra code
save_fig( plt.show()
="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)
housing.plot(kind"better_visualization_plot") # extra code
save_fig( plt.show()
="scatter", x="longitude", y="latitude", grid=True,
housing.plot(kind=housing["population"] / 100, label="population",
s="median_house_value", cmap="jet", colorbar=True,
c=True, sharex=False, figsize=(10, 7))
legend"housing_prices_scatterplot") # extra code
save_fig( plt.show()
The argument sharex=False
fixes a display bug: without it, the x-axis values and label are not displayed (see: https://github.com/pandas-dev/pandas/issues/10611).
The next cell generates the first figure in the chapter (this code is not in the book). It’s just a beautified version of the previous figure, with an image of California added in the background, nicer label names and no grid.
# extra code – this cell generates the first figure in the chapter
# Download the California image
= "california.png"
filename if not (IMAGES_PATH / filename).is_file():
= "https://github.com/ageron/handson-ml3/raw/main/"
homl3_root = homl3_root + "images/end_to_end_project/" + filename
url print("Downloading", filename)
/ filename)
urllib.request.urlretrieve(url, IMAGES_PATH
= housing.rename(columns={
housing_renamed "latitude": "Latitude", "longitude": "Longitude",
"population": "Population",
"median_house_value": "Median house value (ᴜsᴅ)"})
housing_renamed.plot(="scatter", x="Longitude", y="Latitude",
kind=housing_renamed["Population"] / 100, label="Population",
s="Median house value (ᴜsᴅ)", cmap="jet", colorbar=True,
c=True, sharex=False, figsize=(10, 7))
legend
= plt.imread(IMAGES_PATH / filename)
california_img = -124.55, -113.95, 32.45, 42.05
axis
plt.axis(axis)=axis)
plt.imshow(california_img, extent
"california_housing_prices_plot")
save_fig( plt.show()
Looking for Correlations
= housing.corr() corr_matrix
/tmp/ipykernel_34715/2466220658.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
corr_matrix = housing.corr()
"median_house_value"].sort_values(ascending=False) corr_matrix[
median_house_value 1.000000
median_income 0.688380
total_rooms 0.137455
housing_median_age 0.102175
households 0.071426
total_bedrooms 0.054635
population -0.020153
longitude -0.050859
latitude -0.139584
Name: median_house_value, dtype: float64
from pandas.plotting import scatter_matrix
= ["median_house_value", "median_income", "total_rooms",
attributes "housing_median_age"]
=(12, 8))
scatter_matrix(housing[attributes], figsize"scatter_matrix_plot") # extra code
save_fig( plt.show()
="scatter", x="median_income", y="median_house_value",
housing.plot(kind=0.1, grid=True)
alpha"income_vs_house_value_scatterplot") # extra code
save_fig( plt.show()
Experimenting with Attribute Combinations
"rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["people_per_house"] = housing["population"] / housing["households"] housing[
= housing.corr()
corr_matrix "median_house_value"].sort_values(ascending=False) corr_matrix[
/tmp/ipykernel_34715/826279322.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
corr_matrix = housing.corr()
median_house_value 1.000000
median_income 0.688380
rooms_per_house 0.143663
total_rooms 0.137455
housing_median_age 0.102175
households 0.071426
total_bedrooms 0.054635
population -0.020153
people_per_house -0.038224
longitude -0.050859
latitude -0.139584
bedrooms_ratio -0.256397
Name: median_house_value, dtype: float64
Prepare the Data for Machine Learning Algorithms
Let’s revert to the original training set and separate the target (note that strat_train_set.drop()
creates a copy of strat_train_set
without the column, it doesn’t actually modify strat_train_set
itself, unless you pass inplace=True
):
= strat_train_set.drop("median_house_value", axis=1)
housing = strat_train_set["median_house_value"].copy() housing_labels
Data Cleaning
In the book 3 options are listed to handle the NaN values:
=["total_bedrooms"], inplace=True) # option 1
housing.dropna(subset
"total_bedrooms", axis=1) # option 2
housing.drop(
= housing["total_bedrooms"].median() # option 3
median "total_bedrooms"].fillna(median, inplace=True) housing[
For each option, we’ll create a copy of housing
and work on that copy to avoid breaking housing
. We’ll also show the output of each option, but filtering on the rows that originally contained a NaN value.
= housing.isnull().any(axis=1)
null_rows_idx housing.loc[null_rows_idx].head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
14452 | -120.67 | 40.50 | 15.0 | 5343.0 | NaN | 2503.0 | 902.0 | 3.5962 | INLAND |
18217 | -117.96 | 34.03 | 35.0 | 2093.0 | NaN | 1755.0 | 403.0 | 3.4115 | <1H OCEAN |
11889 | -118.05 | 34.04 | 33.0 | 1348.0 | NaN | 1098.0 | 257.0 | 4.2917 | <1H OCEAN |
20325 | -118.88 | 34.17 | 15.0 | 4260.0 | NaN | 1701.0 | 669.0 | 5.1033 | <1H OCEAN |
14360 | -117.87 | 33.62 | 8.0 | 1266.0 | NaN | 375.0 | 183.0 | 9.8020 | <1H OCEAN |
= housing.copy()
housing_option1
=["total_bedrooms"], inplace=True) # option 1
housing_option1.dropna(subset
housing_option1.loc[null_rows_idx].head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity |
---|
= housing.copy()
housing_option2
"total_bedrooms", axis=1, inplace=True) # option 2
housing_option2.drop(
housing_option2.loc[null_rows_idx].head()
longitude | latitude | housing_median_age | total_rooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|
14452 | -120.67 | 40.50 | 15.0 | 5343.0 | 2503.0 | 902.0 | 3.5962 | INLAND |
18217 | -117.96 | 34.03 | 35.0 | 2093.0 | 1755.0 | 403.0 | 3.4115 | <1H OCEAN |
11889 | -118.05 | 34.04 | 33.0 | 1348.0 | 1098.0 | 257.0 | 4.2917 | <1H OCEAN |
20325 | -118.88 | 34.17 | 15.0 | 4260.0 | 1701.0 | 669.0 | 5.1033 | <1H OCEAN |
14360 | -117.87 | 33.62 | 8.0 | 1266.0 | 375.0 | 183.0 | 9.8020 | <1H OCEAN |
= housing.copy()
housing_option3
= housing["total_bedrooms"].median()
median "total_bedrooms"].fillna(median, inplace=True) # option 3
housing_option3[
housing_option3.loc[null_rows_idx].head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
14452 | -120.67 | 40.50 | 15.0 | 5343.0 | 434.0 | 2503.0 | 902.0 | 3.5962 | INLAND |
18217 | -117.96 | 34.03 | 35.0 | 2093.0 | 434.0 | 1755.0 | 403.0 | 3.4115 | <1H OCEAN |
11889 | -118.05 | 34.04 | 33.0 | 1348.0 | 434.0 | 1098.0 | 257.0 | 4.2917 | <1H OCEAN |
20325 | -118.88 | 34.17 | 15.0 | 4260.0 | 434.0 | 1701.0 | 669.0 | 5.1033 | <1H OCEAN |
14360 | -117.87 | 33.62 | 8.0 | 1266.0 | 434.0 | 375.0 | 183.0 | 9.8020 | <1H OCEAN |
from sklearn.impute import SimpleImputer
= SimpleImputer(strategy="median") imputer
Separating out the numerical attributes to use the "median"
strategy (as it cannot be calculated on text attributes like ocean_proximity
):
= housing.select_dtypes(include=[np.number]) housing_num
imputer.fit(housing_num)
SimpleImputer(strategy='median')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SimpleImputer(strategy='median')
imputer.statistics_
array([-118.51 , 34.26 , 29. , 2125. , 434. , 1167. ,
408. , 3.5385])
Check that this is the same as manually computing the median of each attribute:
housing_num.median().values
array([-118.51 , 34.26 , 29. , 2125. , 434. , 1167. ,
408. , 3.5385])
Transform the training set:
= imputer.transform(housing_num) X
imputer.feature_names_in_
array(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income'],
dtype=object)
= pd.DataFrame(X, columns=housing_num.columns,
housing_tr =housing_num.index) index
housing_tr.loc[null_rows_idx].head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
14452 | -120.67 | 40.50 | 15.0 | 5343.0 | 434.0 | 2503.0 | 902.0 | 3.5962 |
18217 | -117.96 | 34.03 | 35.0 | 2093.0 | 434.0 | 1755.0 | 403.0 | 3.4115 |
11889 | -118.05 | 34.04 | 33.0 | 1348.0 | 434.0 | 1098.0 | 257.0 | 4.2917 |
20325 | -118.88 | 34.17 | 15.0 | 4260.0 | 434.0 | 1701.0 | 669.0 | 5.1033 |
14360 | -117.87 | 33.62 | 8.0 | 1266.0 | 434.0 | 375.0 | 183.0 | 9.8020 |
imputer.strategy
'median'
= pd.DataFrame(X, columns=housing_num.columns,
housing_tr =housing_num.index) index
# not shown in the book housing_tr.loc[null_rows_idx].head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
14452 | -120.67 | 40.50 | 15.0 | 5343.0 | 434.0 | 2503.0 | 902.0 | 3.5962 |
18217 | -117.96 | 34.03 | 35.0 | 2093.0 | 434.0 | 1755.0 | 403.0 | 3.4115 |
11889 | -118.05 | 34.04 | 33.0 | 1348.0 | 434.0 | 1098.0 | 257.0 | 4.2917 |
20325 | -118.88 | 34.17 | 15.0 | 4260.0 | 434.0 | 1701.0 | 669.0 | 5.1033 |
14360 | -117.87 | 33.62 | 8.0 | 1266.0 | 434.0 | 375.0 | 183.0 | 9.8020 |
#from sklearn import set_config
#
# set_config(transform_output="pandas") # scikit-learn >= 1.2
Now let’s drop some outliers:
from sklearn.ensemble import IsolationForest
= IsolationForest(random_state=42)
isolation_forest = isolation_forest.fit_predict(X) outlier_pred
outlier_pred
array([-1, 1, 1, ..., 1, 1, 1])
If you wanted to drop outliers, you would run the following code:
#housing = housing.iloc[outlier_pred == 1]
#housing_labels = housing_labels.iloc[outlier_pred == 1]
Handling Text and Categorical Attributes
Now let’s preprocess the categorical input feature, ocean_proximity
:
= housing[["ocean_proximity"]]
housing_cat 8) housing_cat.head(
ocean_proximity | |
---|---|
13096 | NEAR BAY |
14973 | <1H OCEAN |
3785 | INLAND |
14689 | INLAND |
20507 | NEAR OCEAN |
1286 | INLAND |
18078 | <1H OCEAN |
4396 | NEAR BAY |
from sklearn.preprocessing import OrdinalEncoder
= OrdinalEncoder()
ordinal_encoder = ordinal_encoder.fit_transform(housing_cat) housing_cat_encoded
8] housing_cat_encoded[:
array([[3.],
[0.],
[1.],
[1.],
[4.],
[1.],
[0.],
[3.]])
ordinal_encoder.categories_
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)]
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder()
cat_encoder = cat_encoder.fit_transform(housing_cat) housing_cat_1hot
housing_cat_1hot
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
with 16512 stored elements in Compressed Sparse Row format>
By default, the OneHotEncoder
class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()
method:
housing_cat_1hot.toarray()
array([[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
...,
[0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.]])
Alternatively, you can set sparse=False
when creating the OneHotEncoder
:
= OneHotEncoder(sparse=False)
cat_encoder = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot housing_cat_1hot
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
warnings.warn(
array([[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
...,
[0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.]])
cat_encoder.categories_
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)]
= pd.DataFrame({"ocean_proximity": ["INLAND", "NEAR BAY"]})
df_test pd.get_dummies(df_test)
ocean_proximity_INLAND | ocean_proximity_NEAR BAY | |
---|---|---|
0 | 1 | 0 |
1 | 0 | 1 |
cat_encoder.transform(df_test)
array([[0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0.]])
= pd.DataFrame({"ocean_proximity": ["<2H OCEAN", "ISLAND"]})
df_test_unknown pd.get_dummies(df_test_unknown)
ocean_proximity_<2H OCEAN | ocean_proximity_ISLAND | |
---|---|---|
0 | 1 | 0 |
1 | 0 | 1 |
= "ignore"
cat_encoder.handle_unknown cat_encoder.transform(df_test_unknown)
array([[0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0.]])
cat_encoder.feature_names_in_
array(['ocean_proximity'], dtype=object)
cat_encoder.get_feature_names_out()
array(['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
'ocean_proximity_NEAR OCEAN'], dtype=object)
= pd.DataFrame(cat_encoder.transform(df_test_unknown),
df_output =cat_encoder.get_feature_names_out(),
columns=df_test_unknown.index) index
df_output
ocean_proximity_<1H OCEAN | ocean_proximity_INLAND | ocean_proximity_ISLAND | ocean_proximity_NEAR BAY | ocean_proximity_NEAR OCEAN | |
---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
Feature Scaling
from sklearn.preprocessing import MinMaxScaler
= MinMaxScaler(feature_range=(-1, 1))
min_max_scaler = min_max_scaler.fit_transform(housing_num) housing_num_min_max_scaled
from sklearn.preprocessing import StandardScaler
= StandardScaler()
std_scaler = std_scaler.fit_transform(housing_num) housing_num_std_scaled
# extra code – this cell generates Figure 2–17
= plt.subplots(1, 2, figsize=(8, 3), sharey=True)
fig, axs "population"].hist(ax=axs[0], bins=50)
housing["population"].apply(np.log).hist(ax=axs[1], bins=50)
housing[0].set_xlabel("Population")
axs[1].set_xlabel("Log of population")
axs[0].set_ylabel("Number of districts")
axs["long_tail_plot")
save_fig( plt.show()
What if we replace each value with its percentile?
# extra code – just shows that we get a uniform distribution
= [np.percentile(housing["median_income"], p)
percentiles for p in range(1, 100)]
= pd.cut(housing["median_income"],
flattened_median_income =[-np.inf] + percentiles + [np.inf],
bins=range(1, 100 + 1))
labels=50)
flattened_median_income.hist(bins"Median income percentile")
plt.xlabel("Number of districts")
plt.ylabel(
plt.show()# Note: incomes below the 1st percentile are labeled 1, and incomes above the
# 99th percentile are labeled 100. This is why the distribution below ranges
# from 1 to 100 (not 0 to 100).
from sklearn.metrics.pairwise import rbf_kernel
= rbf_kernel(housing[["housing_median_age"]], [[35]], gamma=0.1) age_simil_35
# extra code – this cell generates Figure 2–18
= np.linspace(housing["housing_median_age"].min(),
ages "housing_median_age"].max(),
housing[500).reshape(-1, 1)
= 0.1
gamma1 = 0.03
gamma2 = rbf_kernel(ages, [[35]], gamma=gamma1)
rbf1 = rbf_kernel(ages, [[35]], gamma=gamma2)
rbf2
= plt.subplots()
fig, ax1
"Housing median age")
ax1.set_xlabel("Number of districts")
ax1.set_ylabel("housing_median_age"], bins=50)
ax1.hist(housing[
= ax1.twinx() # create a twin axis that shares the same x-axis
ax2 = "blue"
color =color, label="gamma = 0.10")
ax2.plot(ages, rbf1, color=color, label="gamma = 0.03", linestyle="--")
ax2.plot(ages, rbf2, color='y', labelcolor=color)
ax2.tick_params(axis"Age similarity", color=color)
ax2.set_ylabel(
="upper left")
plt.legend(loc"age_similarity_plot")
save_fig( plt.show()
from sklearn.linear_model import LinearRegression
= StandardScaler()
target_scaler = target_scaler.fit_transform(housing_labels.to_frame())
scaled_labels
= LinearRegression()
model "median_income"]], scaled_labels)
model.fit(housing[[= housing[["median_income"]].iloc[:5] # pretend this is new data
some_new_data
= model.predict(some_new_data)
scaled_predictions = target_scaler.inverse_transform(scaled_predictions) predictions
predictions
array([[131997.15275877],
[299359.35844434],
[146023.37185694],
[138840.33653057],
[192016.61557639]])
from sklearn.compose import TransformedTargetRegressor
= TransformedTargetRegressor(LinearRegression(),
model =StandardScaler())
transformer"median_income"]], housing_labels)
model.fit(housing[[= model.predict(some_new_data) predictions
predictions
array([131997.15275877, 299359.35844434, 146023.37185694, 138840.33653057,
192016.61557639])
Custom Transformers
To create simple transformers:
from sklearn.preprocessing import FunctionTransformer
= FunctionTransformer(np.log, inverse_func=np.exp)
log_transformer = log_transformer.transform(housing[["population"]]) log_pop
= FunctionTransformer(rbf_kernel,
rbf_transformer =dict(Y=[[35.]], gamma=0.1))
kw_args= rbf_transformer.transform(housing[["housing_median_age"]]) age_simil_35
age_simil_35
array([[2.81118530e-13],
[8.20849986e-02],
[6.70320046e-01],
...,
[9.55316054e-22],
[6.70320046e-01],
[3.03539138e-04]])
= 37.7749, -122.41
sf_coords = FunctionTransformer(rbf_kernel,
sf_transformer =dict(Y=[sf_coords], gamma=0.1))
kw_args= sf_transformer.transform(housing[["latitude", "longitude"]]) sf_simil
sf_simil
array([[0.999927 ],
[0.05258419],
[0.94864161],
...,
[0.00388525],
[0.05038518],
[0.99868067]])
= FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
ratio_transformer 1., 2.], [3., 4.]])) ratio_transformer.transform(np.array([[
array([[0.5 ],
[0.75]])
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
class StandardScalerClone(BaseEstimator, TransformerMixin):
def __init__(self, with_mean=True): # no *args or **kwargs!
self.with_mean = with_mean
def fit(self, X, y=None): # y is required even though we don't use it
= check_array(X) # checks that X is an array with finite float values
X self.mean_ = X.mean(axis=0)
self.scale_ = X.std(axis=0)
self.n_features_in_ = X.shape[1] # every estimator stores this in fit()
return self # always return self!
def transform(self, X):
self) # looks for learned attributes (with trailing _)
check_is_fitted(= check_array(X)
X assert self.n_features_in_ == X.shape[1]
if self.with_mean:
= X - self.mean_
X return X / self.scale_
from sklearn.cluster import KMeans
class ClusterSimilarity(BaseEstimator, TransformerMixin):
def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
self.n_clusters = n_clusters
self.gamma = gamma
self.random_state = random_state
def fit(self, X, y=None, sample_weight=None):
self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)
self.kmeans_.fit(X, sample_weight=sample_weight)
return self # always return self!
def transform(self, X):
return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
def get_feature_names_out(self, names=None):
return [f"Cluster {i} similarity" for i in range(self.n_clusters)]
= ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
cluster_simil = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
similarities =housing_labels) sample_weight
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
3].round(2) similarities[:
array([[0. , 0.14, 0. , 0. , 0. , 0.08, 0. , 0.99, 0. , 0.6 ],
[0.63, 0. , 0.99, 0. , 0. , 0. , 0.04, 0. , 0.11, 0. ],
[0. , 0.29, 0. , 0. , 0.01, 0.44, 0. , 0.7 , 0. , 0.3 ]])
# extra code – this cell generates Figure 2–19
= housing.rename(columns={
housing_renamed "latitude": "Latitude", "longitude": "Longitude",
"population": "Population",
"median_house_value": "Median house value (ᴜsᴅ)"})
"Max cluster similarity"] = similarities.max(axis=1)
housing_renamed[
="scatter", x="Longitude", y="Latitude", grid=True,
housing_renamed.plot(kind=housing_renamed["Population"] / 100, label="Population",
s="Max cluster similarity",
c="jet", colorbar=True,
cmap=True, sharex=False, figsize=(10, 7))
legend1],
plt.plot(cluster_simil.kmeans_.cluster_centers_[:, 0],
cluster_simil.kmeans_.cluster_centers_[:, ="", color="black", marker="X", markersize=20,
linestyle="Cluster centers")
label="upper right")
plt.legend(loc"district_cluster_plot")
save_fig( plt.show()
Transformation Pipelines
Now let’s build a pipeline to preprocess the numerical attributes:
from sklearn.pipeline import Pipeline
= Pipeline([
num_pipeline "impute", SimpleImputer(strategy="median")),
("standardize", StandardScaler()),
( ])
from sklearn.pipeline import make_pipeline
= make_pipeline(SimpleImputer(strategy="median"), StandardScaler()) num_pipeline
from sklearn import set_config
='diagram')
set_config(display
num_pipeline
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())])
SimpleImputer(strategy='median')
StandardScaler()
= num_pipeline.fit_transform(housing_num)
housing_num_prepared 2].round(2) housing_num_prepared[:
array([[-1.42, 1.01, 1.86, 0.31, 1.37, 0.14, 1.39, -0.94],
[ 0.6 , -0.7 , 0.91, -0.31, -0.44, -0.69, -0.37, 1.17]])
def monkey_patch_get_signature_names_out():
"""Monkey patch some classes which did not handle get_feature_names_out()
correctly in Scikit-Learn 1.0.*."""
from inspect import Signature, signature, Parameter
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
= StandardScaler.get_feature_names_out
default_get_feature_names_out
if not hasattr(SimpleImputer, "get_feature_names_out"):
print("Monkey-patching SimpleImputer.get_feature_names_out()")
= default_get_feature_names_out
SimpleImputer.get_feature_names_out
if not hasattr(FunctionTransformer, "get_feature_names_out"):
print("Monkey-patching FunctionTransformer.get_feature_names_out()")
= FunctionTransformer.__init__
orig_init = signature(orig_init)
orig_sig
def __init__(*args, feature_names_out=None, **kwargs):
*args, **kwargs)
orig_sig.bind(*args, **kwargs)
orig_init(0].feature_names_out = feature_names_out
args[
__init__.__signature__ = Signature(
list(signature(orig_init).parameters.values()) + [
"feature_names_out", Parameter.KEYWORD_ONLY)])
Parameter(
def get_feature_names_out(self, names=None):
if callable(self.feature_names_out):
return self.feature_names_out(self, names)
assert self.feature_names_out == "one-to-one"
return default_get_feature_names_out(self, names)
__init__ = __init__
FunctionTransformer.= get_feature_names_out
FunctionTransformer.get_feature_names_out
monkey_patch_get_signature_names_out()
= pd.DataFrame(
df_housing_num_prepared =num_pipeline.get_feature_names_out(),
housing_num_prepared, columns=housing_num.index) index
2) # extra code df_housing_num_prepared.head(
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
13096 | -1.423037 | 1.013606 | 1.861119 | 0.311912 | 1.368167 | 0.137460 | 1.394812 | -0.936491 |
14973 | 0.596394 | -0.702103 | 0.907630 | -0.308620 | -0.435925 | -0.693771 | -0.373485 | 1.171942 |
num_pipeline.steps
[('simpleimputer', SimpleImputer(strategy='median')),
('standardscaler', StandardScaler())]
1] num_pipeline[
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
-1] num_pipeline[:
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median'))])
SimpleImputer(strategy='median')
"simpleimputer"] num_pipeline.named_steps[
SimpleImputer(strategy='median')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SimpleImputer(strategy='median')
="median") num_pipeline.set_params(simpleimputer__strategy
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())])
SimpleImputer(strategy='median')
StandardScaler()
from sklearn.compose import ColumnTransformer
= ["longitude", "latitude", "housing_median_age", "total_rooms",
num_attribs "total_bedrooms", "population", "households", "median_income"]
= ["ocean_proximity"]
cat_attribs
= make_pipeline(
cat_pipeline ="most_frequent"),
SimpleImputer(strategy="ignore"))
OneHotEncoder(handle_unknown
= ColumnTransformer([
preprocessing "num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs),
( ])
from sklearn.compose import make_column_selector, make_column_transformer
= make_column_transformer(
preprocessing =np.number)),
(num_pipeline, make_column_selector(dtype_include=object)),
(cat_pipeline, make_column_selector(dtype_include )
= preprocessing.fit_transform(housing) housing_prepared
# extra code – shows that we can get a DataFrame out if we want
= pd.DataFrame(
housing_prepared_fr
housing_prepared,=preprocessing.get_feature_names_out(),
columns=housing.index)
index2) housing_prepared_fr.head(
pipeline-1__longitude | pipeline-1__latitude | pipeline-1__housing_median_age | pipeline-1__total_rooms | pipeline-1__total_bedrooms | pipeline-1__population | pipeline-1__households | pipeline-1__median_income | pipeline-2__ocean_proximity_<1H OCEAN | pipeline-2__ocean_proximity_INLAND | pipeline-2__ocean_proximity_ISLAND | pipeline-2__ocean_proximity_NEAR BAY | pipeline-2__ocean_proximity_NEAR OCEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13096 | -1.423037 | 1.013606 | 1.861119 | 0.311912 | 1.368167 | 0.137460 | 1.394812 | -0.936491 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
14973 | 0.596394 | -0.702103 | 0.907630 | -0.308620 | -0.435925 | -0.693771 | -0.373485 | 1.171942 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
def column_ratio(X):
return X[:, [0]] / X[:, [1]]
def ratio_name(function_transformer, feature_names_in):
return ["ratio"] # feature names out
def ratio_pipeline():
return make_pipeline(
="median"),
SimpleImputer(strategy=ratio_name),
FunctionTransformer(column_ratio, feature_names_out
StandardScaler())
= make_pipeline(
log_pipeline ="median"),
SimpleImputer(strategy="one-to-one"),
FunctionTransformer(np.log, feature_names_out
StandardScaler())= ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
cluster_simil = make_pipeline(SimpleImputer(strategy="median"),
default_num_pipeline
StandardScaler())= ColumnTransformer([
preprocessing "bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
("people_per_house", ratio_pipeline(), ["population", "households"]),
("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
("households", "median_income"]),
"geo", cluster_simil, ["latitude", "longitude"]),
("cat", cat_pipeline, make_column_selector(dtype_include=object)),
(
],=default_num_pipeline) # one column remaining: housing_median_age remainder
= preprocessing.fit_transform(housing)
housing_prepared housing_prepared.shape
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
(16512, 24)
preprocessing.get_feature_names_out()
array(['bedrooms__ratio', 'rooms_per_house__ratio',
'people_per_house__ratio', 'log__total_bedrooms',
'log__total_rooms', 'log__population', 'log__households',
'log__median_income', 'geo__Cluster 0 similarity',
'geo__Cluster 1 similarity', 'geo__Cluster 2 similarity',
'geo__Cluster 3 similarity', 'geo__Cluster 4 similarity',
'geo__Cluster 5 similarity', 'geo__Cluster 6 similarity',
'geo__Cluster 7 similarity', 'geo__Cluster 8 similarity',
'geo__Cluster 9 similarity', 'cat__ocean_proximity_<1H OCEAN',
'cat__ocean_proximity_INLAND', 'cat__ocean_proximity_ISLAND',
'cat__ocean_proximity_NEAR BAY', 'cat__ocean_proximity_NEAR OCEAN',
'remainder__housing_median_age'], dtype=object)
Select and Train a Model
Training and Evaluating on the Training Set
from sklearn.linear_model import LinearRegression
= make_pipeline(preprocessing, LinearRegression())
lin_reg lin_reg.fit(housing, housing_labels)
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
Pipeline(steps=[('columntransformer', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6... 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6... 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('linearregression', LinearRegression())])
ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio... ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
ClusterSimilarity(random_state=42)
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
['housing_median_age']
SimpleImputer(strategy='median')
StandardScaler()
LinearRegression()
Let’s try the full preprocessing pipeline on a few training instances:
= lin_reg.predict(housing)
housing_predictions 5].round(-2) # -2 = rounded to the nearest hundred housing_predictions[:
array([243700., 372400., 128800., 94400., 328300.])
Compare against the actual values:
5].values housing_labels.iloc[:
array([458300., 483800., 101700., 96100., 361800.])
# extra code – computes the error ratios discussed in the book
= housing_predictions[:5].round(-2) / housing_labels.iloc[:5].values - 1
error_ratios print(", ".join([f"{100 * ratio:.1f}%" for ratio in error_ratios]))
-46.8%, -23.0%, 26.6%, -1.8%, -9.3%
from sklearn.metrics import mean_squared_error
= mean_squared_error(housing_labels, housing_predictions,
lin_rmse =False)
squared lin_rmse
68687.89176589991
from sklearn.tree import DecisionTreeRegressor
= make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg tree_reg.fit(housing, housing_labels)
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
Pipeline(steps=[('columntransformer', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6... ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('decisiontreeregressor', DecisionTreeRegressor(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6... ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('decisiontreeregressor', DecisionTreeRegressor(random_state=42))])
ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio... ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
ClusterSimilarity(random_state=42)
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
['housing_median_age']
SimpleImputer(strategy='median')
StandardScaler()
DecisionTreeRegressor(random_state=42)
= tree_reg.predict(housing)
housing_predictions = mean_squared_error(housing_labels, housing_predictions,
tree_rmse =False)
squared tree_rmse
0.0
Better Evaluation Using Cross-Validation
from sklearn.model_selection import cross_val_score
= -cross_val_score(tree_reg, housing, housing_labels,
tree_rmses ="neg_root_mean_squared_error", cv=10) scoring
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
pd.Series(tree_rmses).describe()
count 10.000000
mean 66868.027288
std 2060.966425
min 63649.536493
25% 65338.078316
50% 66801.953094
75% 68229.934454
max 70094.778246
dtype: float64
# extra code – computes the error stats for the linear model
= -cross_val_score(lin_reg, housing, housing_labels,
lin_rmses ="neg_root_mean_squared_error", cv=10)
scoring pd.Series(lin_rmses).describe()
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
count 10.000000
mean 69858.018195
std 4182.205077
min 65397.780144
25% 68070.536263
50% 68619.737842
75% 69810.076342
max 80959.348171
dtype: float64
Warning: the following cell may take a few minutes to run:
from sklearn.ensemble import RandomForestRegressor
= make_pipeline(preprocessing,
forest_reg =42))
RandomForestRegressor(random_state= -cross_val_score(forest_reg, housing, housing_labels,
forest_rmses ="neg_root_mean_squared_error", cv=10) scoring
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
pd.Series(forest_rmses).describe()
count 10.000000
mean 47019.561281
std 1033.957120
min 45458.112527
25% 46464.031184
50% 46967.596354
75% 47325.694987
max 49243.765795
dtype: float64
Let’s compare this RMSE measured using cross-validation (the “validation error”) with the RMSE measured on the training set (the “training error”):
forest_reg.fit(housing, housing_labels)= forest_reg.predict(housing)
housing_predictions = mean_squared_error(housing_labels, housing_predictions,
forest_rmse =False)
squared forest_rmse
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
17474.619286483998
The training error is much lower than the validation error, which usually means that the model has overfit the training set. Another possible explanation may be that there’s a mismatch between the training data and the validation data, but it’s not the case here, since both came from the same dataset that we shuffled and split in two parts.
Fine-Tune Your Model
Grid Search
Warning: the following cell may take a few minutes to run:
from sklearn.model_selection import GridSearchCV
= Pipeline([
full_pipeline "preprocessing", preprocessing),
("random_forest", RandomForestRegressor(random_state=42)),
(
])= [
param_grid 'preprocessing__geo__n_clusters': [5, 8, 10],
{'random_forest__max_features': [4, 6, 8]},
'preprocessing__geo__n_clusters': [10, 15],
{'random_forest__max_features': [6, 8, 10]},
]= GridSearchCV(full_pipeline, param_grid, cv=3,
grid_search ='neg_root_mean_squared_error')
scoring grid_search.fit(housing, housing_labels)
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
GridSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<f... <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('random_forest', RandomForestRegressor(random_state=42))]), param_grid=[{'preprocessing__geo__n_clusters': [5, 8, 10], 'random_forest__max_features': [4, 6, 8]}, {'preprocessing__geo__n_clusters': [10, 15], 'random_forest__max_features': [6, 8, 10]}], scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<f... <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('random_forest', RandomForestRegressor(random_state=42))]), param_grid=[{'preprocessing__geo__n_clusters': [5, 8, 10], 'random_forest__max_features': [4, 6, 8]}, {'preprocessing__geo__n_clusters': [10, 15], 'random_forest__max_features': [6, 8, 10]}], scoring='neg_root_mean_squared_error')
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe... ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('random_forest', RandomForestRegressor(random_state=42))])
ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio... ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
ClusterSimilarity(random_state=42)
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
['housing_median_age']
SimpleImputer(strategy='median')
StandardScaler()
RandomForestRegressor(random_state=42)
You can get the full list of hyperparameters available for tuning by looking at full_pipeline.get_params().keys()
:
# extra code – shows part of the output of get_params().keys()
print(str(full_pipeline.get_params().keys())[:1000] + "...")
dict_keys(['memory', 'steps', 'verbose', 'preprocessing', 'random_forest', 'preprocessing__n_jobs', 'preprocessing__remainder__memory', 'preprocessing__remainder__steps', 'preprocessing__remainder__verbose', 'preprocessing__remainder__simpleimputer', 'preprocessing__remainder__standardscaler', 'preprocessing__remainder__simpleimputer__add_indicator', 'preprocessing__remainder__simpleimputer__copy', 'preprocessing__remainder__simpleimputer__fill_value', 'preprocessing__remainder__simpleimputer__keep_empty_features', 'preprocessing__remainder__simpleimputer__missing_values', 'preprocessing__remainder__simpleimputer__strategy', 'preprocessing__remainder__simpleimputer__verbose', 'preprocessing__remainder__standardscaler__copy', 'preprocessing__remainder__standardscaler__with_mean', 'preprocessing__remainder__standardscaler__with_std', 'preprocessing__remainder', 'preprocessing__sparse_threshold', 'preprocessing__transformer_weights', 'preprocessing__transformers', 'preprocessing__verbose'...
The best hyperparameter combination found:
grid_search.best_params_
{'preprocessing__geo__n_clusters': 15, 'random_forest__max_features': 6}
grid_search.best_estimator_
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe... ClusterSimilarity(n_clusters=15, random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bdbef4970>)])), ('random_forest', RandomForestRegressor(max_features=6, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe... ClusterSimilarity(n_clusters=15, random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bdbef4970>)])), ('random_forest', RandomForestRegressor(max_features=6, random_state=42))])
ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio... ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(n_clusters=15, random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bdbef4970>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
ClusterSimilarity(n_clusters=15, random_state=42)
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bdbef4970>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
['housing_median_age']
SimpleImputer(strategy='median')
StandardScaler()
RandomForestRegressor(max_features=6, random_state=42)
Let’s look at the score of each hyperparameter combination tested during the grid search:
= pd.DataFrame(grid_search.cv_results_)
cv_res ="mean_test_score", ascending=False, inplace=True)
cv_res.sort_values(by
# extra code – these few lines of code just make the DataFrame look nicer
= cv_res[["param_preprocessing__geo__n_clusters",
cv_res "param_random_forest__max_features", "split0_test_score",
"split1_test_score", "split2_test_score", "mean_test_score"]]
= ["split0", "split1", "split2", "mean_test_rmse"]
score_cols = ["n_clusters", "max_features"] + score_cols
cv_res.columns = -cv_res[score_cols].round().astype(np.int64)
cv_res[score_cols]
cv_res.head()
n_clusters | max_features | split0 | split1 | split2 | mean_test_rmse | |
---|---|---|---|---|---|---|
12 | 15 | 6 | 43460 | 43919 | 44748 | 44042 |
13 | 15 | 8 | 44132 | 44075 | 45010 | 44406 |
14 | 15 | 10 | 44374 | 44286 | 45316 | 44659 |
7 | 10 | 6 | 44683 | 44655 | 45657 | 44999 |
9 | 10 | 6 | 44683 | 44655 | 45657 | 44999 |
Randomized Search
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
Try 30 (n_iter
× cv
) random combinations of hyperparameters:
Warning: the following cell may take a few minutes to run:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
= {'preprocessing__geo__n_clusters': randint(low=3, high=50),
param_distribs 'random_forest__max_features': randint(low=2, high=20)}
= RandomizedSearchCV(
rnd_search =param_distribs, n_iter=10, cv=3,
full_pipeline, param_distributions='neg_root_mean_squared_error', random_state=42)
scoring
rnd_search.fit(housing, housing_labels)
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_... ('random_forest', RandomForestRegressor(random_state=42))]), param_distributions={'preprocessing__geo__n_clusters': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7f6bd9735c30>, 'random_forest__max_features': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7f6bdbd0a2f0>}, random_state=42, scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_... ('random_forest', RandomForestRegressor(random_state=42))]), param_distributions={'preprocessing__geo__n_clusters': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7f6bd9735c30>, 'random_forest__max_features': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7f6bdbd0a2f0>}, random_state=42, scoring='neg_root_mean_squared_error')
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe... ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('random_forest', RandomForestRegressor(random_state=42))])
ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio... ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
ClusterSimilarity(random_state=42)
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
['housing_median_age']
SimpleImputer(strategy='median')
StandardScaler()
RandomForestRegressor(random_state=42)
# extra code – displays the random search results
= pd.DataFrame(rnd_search.cv_results_)
cv_res ="mean_test_score", ascending=False, inplace=True)
cv_res.sort_values(by= cv_res[["param_preprocessing__geo__n_clusters",
cv_res "param_random_forest__max_features", "split0_test_score",
"split1_test_score", "split2_test_score", "mean_test_score"]]
= ["n_clusters", "max_features"] + score_cols
cv_res.columns = -cv_res[score_cols].round().astype(np.int64)
cv_res[score_cols] cv_res.head()
n_clusters | max_features | split0 | split1 | split2 | mean_test_rmse | |
---|---|---|---|---|---|---|
1 | 45 | 9 | 41287 | 42071 | 42627 | 41995 |
8 | 32 | 7 | 41690 | 42513 | 43224 | 42475 |
0 | 41 | 16 | 42223 | 42959 | 43321 | 42834 |
5 | 42 | 4 | 41818 | 43094 | 43817 | 42910 |
2 | 23 | 8 | 42264 | 42996 | 43830 | 43030 |
Bonus section: how to choose the sampling distribution for a hyperparameter
scipy.stats.randint(a, b+1)
: for hyperparameters with discrete values that range from a to b, and all values in that range seem equally likely.scipy.stats.uniform(a, b)
: this is very similar, but for continuous hyperparameters.scipy.stats.geom(1 / scale)
: for discrete values, when you want to sample roughly in a given scale. E.g., with scale=1000 most samples will be in this ballpark, but ~10% of all samples will be <100 and ~10% will be >2300.scipy.stats.expon(scale)
: this is the continuous equivalent ofgeom
. Just setscale
to the most likely value.scipy.stats.loguniform(a, b)
: when you have almost no idea what the optimal hyperparameter value’s scale is. If you set a=0.01 and b=100, then you’re just as likely to sample a value between 0.01 and 0.1 as a value between 10 and 100.
Here are plots of the probability mass functions (for discrete variables), and probability density functions (for continuous variables) for randint()
, uniform()
, geom()
and expon()
:
# extra code – plots a few distributions you can use in randomized search
from scipy.stats import randint, uniform, geom, expon
= np.arange(0, 7 + 1)
xs1 = randint(0, 7 + 1).pmf(xs1)
randint_distrib
= np.linspace(0, 7, 500)
xs2 = uniform(0, 7).pdf(xs2)
uniform_distrib
= np.arange(0, 7 + 1)
xs3 = geom(0.5).pmf(xs3)
geom_distrib
= np.linspace(0, 7, 500)
xs4 = expon(scale=1).pdf(xs4)
expon_distrib
=(12, 7))
plt.figure(figsize
2, 2, 1)
plt.subplot(="scipy.randint(0, 7 + 1)")
plt.bar(xs1, randint_distrib, label"Probability")
plt.ylabel(
plt.legend()-1, 8, 0, 0.2])
plt.axis([
2, 2, 2)
plt.subplot(="scipy.uniform(0, 7)")
plt.fill_between(xs2, uniform_distrib, label"PDF")
plt.ylabel(
plt.legend()-1, 8, 0, 0.2])
plt.axis([
2, 2, 3)
plt.subplot(="scipy.geom(0.5)")
plt.bar(xs3, geom_distrib, label"Hyperparameter value")
plt.xlabel("Probability")
plt.ylabel(
plt.legend()0, 7, 0, 1])
plt.axis([
2, 2, 4)
plt.subplot(="scipy.expon(scale=1)")
plt.fill_between(xs4, expon_distrib, label"Hyperparameter value")
plt.xlabel("PDF")
plt.ylabel(
plt.legend()0, 7, 0, 1])
plt.axis([
plt.show()
Here are the PDF for expon()
and loguniform()
(left column), as well as the PDF of log(X) (right column). The right column shows the distribution of hyperparameter scales. You can see that expon()
favors hyperparameters with roughly the desired scale, with a longer tail towards the smaller scales. But loguniform()
does not favor any scale, they are all equally likely:
# extra code – shows the difference between expon and loguniform
from scipy.stats import loguniform
= np.linspace(0, 7, 500)
xs1 = expon(scale=1).pdf(xs1)
expon_distrib
= np.linspace(-5, 3, 500)
log_xs2 = np.exp(log_xs2 - np.exp(log_xs2))
log_expon_distrib
= np.linspace(0.001, 1000, 500)
xs3 = loguniform(0.001, 1000).pdf(xs3)
loguniform_distrib
= np.linspace(np.log(0.001), np.log(1000), 500)
log_xs4 = uniform(np.log(0.001), np.log(1000)).pdf(log_xs4)
log_loguniform_distrib
=(12, 7))
plt.figure(figsize
2, 2, 1)
plt.subplot(
plt.fill_between(xs1, expon_distrib,="scipy.expon(scale=1)")
label"PDF")
plt.ylabel(
plt.legend()0, 7, 0, 1])
plt.axis([
2, 2, 2)
plt.subplot(
plt.fill_between(log_xs2, log_expon_distrib,="log(X) with X ~ expon")
label
plt.legend()-5, 3, 0, 1])
plt.axis([
2, 2, 3)
plt.subplot(
plt.fill_between(xs3, loguniform_distrib,="scipy.loguniform(0.001, 1000)")
label"Hyperparameter value")
plt.xlabel("PDF")
plt.ylabel(
plt.legend()0.001, 1000, 0, 0.005])
plt.axis([
2, 2, 4)
plt.subplot(
plt.fill_between(log_xs4, log_loguniform_distrib,="log(X) with X ~ loguniform")
label"Log of hyperparameter value")
plt.xlabel(
plt.legend()-8, 1, 0, 0.2])
plt.axis([
plt.show()
Analyze the Best Models and Their Errors
= rnd_search.best_estimator_ # includes preprocessing
final_model = final_model["random_forest"].feature_importances_
feature_importances round(2) feature_importances.
array([0.07, 0.05, 0.05, 0.01, 0.01, 0.01, 0.01, 0.19, 0.04, 0.01, 0. ,
0.01, 0.01, 0.01, 0.01, 0.01, 0. , 0.01, 0.01, 0.01, 0. , 0.01,
0.01, 0.01, 0.01, 0.01, 0. , 0. , 0.02, 0.01, 0.01, 0.01, 0.02,
0.01, 0. , 0.02, 0.03, 0.01, 0.01, 0.01, 0.01, 0.01, 0.02, 0.01,
0.01, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.02, 0.01, 0. , 0.07,
0. , 0. , 0. , 0.01])
sorted(zip(feature_importances,
"preprocessing"].get_feature_names_out()),
final_model[=True) reverse
[(0.18694559869103852, 'log__median_income'),
(0.0748194905715524, 'cat__ocean_proximity_INLAND'),
(0.06926417748515576, 'bedrooms__ratio'),
(0.05446998753775219, 'rooms_per_house__ratio'),
(0.05262301809680712, 'people_per_house__ratio'),
(0.03819415873915732, 'geo__Cluster 0 similarity'),
(0.02879263999929514, 'geo__Cluster 28 similarity'),
(0.023530192521380392, 'geo__Cluster 24 similarity'),
(0.020544786346378206, 'geo__Cluster 27 similarity'),
(0.019873052631077512, 'geo__Cluster 43 similarity'),
(0.018597511022930273, 'geo__Cluster 34 similarity'),
(0.017409085415656868, 'geo__Cluster 37 similarity'),
(0.015546519677632162, 'geo__Cluster 20 similarity'),
(0.014230331127504292, 'geo__Cluster 17 similarity'),
(0.0141032216204026, 'geo__Cluster 39 similarity'),
(0.014065768027447325, 'geo__Cluster 9 similarity'),
(0.01354220782825315, 'geo__Cluster 4 similarity'),
(0.01348963625822907, 'geo__Cluster 3 similarity'),
(0.01338319626383868, 'geo__Cluster 38 similarity'),
(0.012240533790212824, 'geo__Cluster 31 similarity'),
(0.012089046542256785, 'geo__Cluster 7 similarity'),
(0.01152326329703204, 'geo__Cluster 23 similarity'),
(0.011397459905603558, 'geo__Cluster 40 similarity'),
(0.011282340924816439, 'geo__Cluster 36 similarity'),
(0.01104139770781063, 'remainder__housing_median_age'),
(0.010671123191312802, 'geo__Cluster 44 similarity'),
(0.010296376177202627, 'geo__Cluster 5 similarity'),
(0.010184798445004483, 'geo__Cluster 42 similarity'),
(0.010121853542225083, 'geo__Cluster 11 similarity'),
(0.009795219101117579, 'geo__Cluster 35 similarity'),
(0.00952581084310724, 'geo__Cluster 10 similarity'),
(0.009433209165984823, 'geo__Cluster 13 similarity'),
(0.00915075361116215, 'geo__Cluster 1 similarity'),
(0.009021485619463173, 'geo__Cluster 30 similarity'),
(0.00894936224917583, 'geo__Cluster 41 similarity'),
(0.008901832702357514, 'geo__Cluster 25 similarity'),
(0.008897504713401587, 'geo__Cluster 29 similarity'),
(0.0086846298524955, 'geo__Cluster 21 similarity'),
(0.008061104590483955, 'geo__Cluster 15 similarity'),
(0.00786048176566994, 'geo__Cluster 16 similarity'),
(0.007793633130749198, 'geo__Cluster 22 similarity'),
(0.007501766442066527, 'log__total_rooms'),
(0.0072024111938241275, 'geo__Cluster 32 similarity'),
(0.006947156598995616, 'log__population'),
(0.006800076770899128, 'log__households'),
(0.006736105364684462, 'log__total_bedrooms'),
(0.006315268213499131, 'geo__Cluster 33 similarity'),
(0.005796398579893261, 'geo__Cluster 14 similarity'),
(0.005234954623294958, 'geo__Cluster 6 similarity'),
(0.0045514083468621595, 'geo__Cluster 12 similarity'),
(0.004546042080216035, 'geo__Cluster 18 similarity'),
(0.004314514641115755, 'geo__Cluster 2 similarity'),
(0.003953528110719969, 'geo__Cluster 19 similarity'),
(0.003297404747742136, 'geo__Cluster 26 similarity'),
(0.00289453474290887, 'cat__ocean_proximity_<1H OCEAN'),
(0.0016978863168109126, 'cat__ocean_proximity_NEAR OCEAN'),
(0.0016391131530559377, 'geo__Cluster 8 similarity'),
(0.00015061247730531558, 'cat__ocean_proximity_NEAR BAY'),
(7.301686597099842e-05, 'cat__ocean_proximity_ISLAND')]
Evaluate Your System on the Test Set
= strat_test_set.drop("median_house_value", axis=1)
X_test = strat_test_set["median_house_value"].copy()
y_test
= final_model.predict(X_test)
final_predictions
= mean_squared_error(y_test, final_predictions, squared=False)
final_rmse print(final_rmse)
41424.40026462184
We can compute a 95% confidence interval for the test RMSE:
from scipy import stats
= 0.95
confidence = (final_predictions - y_test) ** 2
squared_errors len(squared_errors) - 1,
np.sqrt(stats.t.interval(confidence, =squared_errors.mean(),
loc=stats.sem(squared_errors))) scale
array([39275.40861216, 43467.27680583])
We could compute the interval manually like this:
# extra code – shows how to compute a confidence interval for the RMSE
= len(squared_errors)
m = squared_errors.mean()
mean = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tscore = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
tmargin - tmargin), np.sqrt(mean + tmargin) np.sqrt(mean
(39275.40861216077, 43467.2768058342)
Alternatively, we could use a z-score rather than a t-score. Since the test set is not too small, it won’t make a big difference:
# extra code – computes a confidence interval again using a z-score
= stats.norm.ppf((1 + confidence) / 2)
zscore = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
zmargin - zmargin), np.sqrt(mean + zmargin) np.sqrt(mean
(39276.05610140007, 43466.691749969636)
Model persistence using joblib
Save the final model:
import joblib
"my_california_housing_model.pkl") joblib.dump(final_model,
['my_california_housing_model.pkl']
Now you can deploy this model to production. For example, the following code could be a script that would run in production:
import joblib
# extra code – excluded for conciseness
from sklearn.cluster import KMeans
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics.pairwise import rbf_kernel
def column_ratio(X):
return X[:, [0]] / X[:, [1]]
#class ClusterSimilarity(BaseEstimator, TransformerMixin):
# [...]
= joblib.load("my_california_housing_model.pkl")
final_model_reloaded
= housing.iloc[:5] # pretend these are new districts
new_data = final_model_reloaded.predict(new_data) predictions
predictions
array([442737.15, 457566.06, 105965. , 98462. , 332992.01])
You could use pickle instead, but joblib is more efficient.
Exercise solutions
1.
Exercise: Try a Support Vector Machine regressor (sklearn.svm.SVR
) with various hyperparameters, such as kernel="linear"
(with various values for the C
hyperparameter) or kernel="rbf"
(with various values for the C
and gamma
hyperparameters). Note that SVMs don’t scale well to large datasets, so you should probably train your model on just the first 5,000 instances of the training set and use only 3-fold cross-validation, or else it will take hours. Don’t worry about what the hyperparameters mean for now (see the SVM notebook if you’re interested). How does the best SVR
predictor perform?
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
= [
param_grid 'svr__kernel': ['linear'], 'svr__C': [10., 30., 100., 300., 1000.,
{3000., 10000., 30000.0]},
'svr__kernel': ['rbf'], 'svr__C': [1.0, 3.0, 10., 30., 100., 300.,
{1000.0],
'svr__gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
]
= Pipeline([("preprocessing", preprocessing), ("svr", SVR())])
svr_pipeline = GridSearchCV(svr_pipeline, param_grid, cv=3,
grid_search ='neg_root_mean_squared_error')
scoring5000], housing_labels.iloc[:5000]) grid_search.fit(housing.iloc[:
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
GridSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<f... <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('svr', SVR())]), param_grid=[{'svr__C': [10.0, 30.0, 100.0, 300.0, 1000.0, 3000.0, 10000.0, 30000.0], 'svr__kernel': ['linear']}, {'svr__C': [1.0, 3.0, 10.0, 30.0, 100.0, 300.0, 1000.0], 'svr__gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0], 'svr__kernel': ['rbf']}], scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<f... <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('svr', SVR())]), param_grid=[{'svr__C': [10.0, 30.0, 100.0, 300.0, 1000.0, 3000.0, 10000.0, 30000.0], 'svr__kernel': ['linear']}, {'svr__C': [1.0, 3.0, 10.0, 30.0, 100.0, 300.0, 1000.0], 'svr__gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0], 'svr__kernel': ['rbf']}], scoring='neg_root_mean_squared_error')
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe... 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('svr', SVR())])
ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio... ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
ClusterSimilarity(random_state=42)
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
['housing_median_age']
SimpleImputer(strategy='median')
StandardScaler()
SVR()
The best model achieves the following score (evaluated using 3-fold cross validation):
= -grid_search.best_score_
svr_grid_search_rmse svr_grid_search_rmse
69814.13888570886
That’s much worse than the RandomForestRegressor
(but to be fair, we trained the model on much less data). Let’s check the best hyperparameters found:
grid_search.best_params_
{'svr__C': 10000.0, 'svr__kernel': 'linear'}
The linear kernel seems better than the RBF kernel. Notice that the value of C
is the maximum tested value. When this happens you definitely want to launch the grid search again with higher values for C
(removing the smallest values), because it is likely that higher values of C
will be better.
2.
Exercise: Try replacing the GridSearchCV
with a RandomizedSearchCV
.
Warning: the following cell will take several minutes to run. You can specify verbose=2
when creating the RandomizedSearchCV
if you want to see the training details.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, loguniform
# see https://docs.scipy.org/doc/scipy/reference/stats.html
# for `expon()` and `loguniform()` documentation and more probability distribution functions.
# Note: gamma is ignored when kernel is "linear"
= {
param_distribs 'svr__kernel': ['linear', 'rbf'],
'svr__C': loguniform(20, 200_000),
'svr__gamma': expon(scale=1.0),
}
= RandomizedSearchCV(svr_pipeline,
rnd_search =param_distribs,
param_distributions=50, cv=3,
n_iter='neg_root_mean_squared_error',
scoring=42)
random_state5000], housing_labels.iloc[:5000]) rnd_search.fit(housing.iloc[:
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_... <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('svr', SVR())]), n_iter=50, param_distributions={'svr__C': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bd730b1c0>, 'svr__gamma': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bd73278b0>, 'svr__kernel': ['linear', 'rbf']}, random_state=42, scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_... <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('svr', SVR())]), n_iter=50, param_distributions={'svr__C': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bd730b1c0>, 'svr__gamma': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bd73278b0>, 'svr__kernel': ['linear', 'rbf']}, random_state=42, scoring='neg_root_mean_squared_error')
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe... 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('svr', SVR())])
ColumnTransformer(remainder=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio... ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']), ('geo', ClusterSimilarity(random_state=42), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
ClusterSimilarity(random_state=42)
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
['housing_median_age']
SimpleImputer(strategy='median')
StandardScaler()
SVR()
The best model achieves the following score (evaluated using 3-fold cross validation):
= -rnd_search.best_score_
svr_rnd_search_rmse svr_rnd_search_rmse
55853.88100158656
Now that’s really much better, but still far from the RandomForestRegressor
’s performance. Let’s check the best hyperparameters found:
rnd_search.best_params_
{'svr__C': 157055.10989448498,
'svr__gamma': 0.26497040005002437,
'svr__kernel': 'rbf'}
This time the search found a good set of hyperparameters for the RBF kernel. Randomized search tends to find better hyperparameters than grid search in the same amount of time.
Note that we used the expon()
distribution for gamma
, with a scale of 1, so RandomSearch
mostly searched for values roughly of that scale: about 80% of the samples were between 0.1 and 2.3 (roughly 10% were smaller and 10% were larger):
42)
np.random.seed(
= expon(scale=1).rvs(100_000) # get 100,000 samples
s > 0.105) & (s < 2.29)).sum() / 100_000 ((s
0.80066
We used the loguniform()
distribution for C
, meaning we did not have a clue what the optimal scale of C
was before running the random search. It explored the range from 20 to 200 just as much as the range from 2,000 to 20,000 or from 20,000 to 200,000.
3.
Exercise: Try adding a SelectFromModel
transformer in the preparation pipeline to select only the most important attributes.
Let’s create a new pipeline that runs the previously defined preparation pipeline, and adds a SelectFromModel
transformer based on a RandomForestRegressor
before the final regressor:
from sklearn.feature_selection import SelectFromModel
= Pipeline([
selector_pipeline 'preprocessing', preprocessing),
('selector', SelectFromModel(RandomForestRegressor(random_state=42),
(=0.005)), # min feature importance
threshold'svr', SVR(C=rnd_search.best_params_["svr__C"],
(=rnd_search.best_params_["svr__gamma"],
gamma=rnd_search.best_params_["svr__kernel"])),
kernel ])
= -cross_val_score(selector_pipeline,
selector_rmses 5000],
housing.iloc[:5000],
housing_labels.iloc[:="neg_root_mean_squared_error",
scoring=3)
cv pd.Series(selector_rmses).describe()
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/workspaces/data_mining/.venv/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
count 3.000000
mean 56211.362086
std 1922.002801
min 54150.008630
25% 55339.929908
50% 56529.851186
75% 57242.038814
max 57954.226441
dtype: float64
Oh well, feature selection does not seem to help. But maybe that’s just because the threshold we used was not optimal. Perhaps try tuning it using random search or grid search?
4.
Exercise: Try creating a custom transformer that trains a k-Nearest Neighbors regressor (sklearn.neighbors.KNeighborsRegressor
) in its fit()
method, and outputs the model’s predictions in its transform()
method. Then add this feature to the preprocessing pipeline, using latitude and longitude as the inputs to this transformer. This will add a feature in the model that corresponds to the housing median price of the nearest districts.
Rather than restrict ourselves to k-Nearest Neighbors regressors, let’s create a transformer that accepts any regressor. For this, we can extend the MetaEstimatorMixin
and have a required estimator
argument in the constructor. The fit()
method must work on a clone of this estimator, and it must also save feature_names_in_
. The MetaEstimatorMixin
will ensure that estimator
is listed as a required parameters, and it will update get_params()
and set_params()
to make the estimator’s hyperparameters available for tuning. Lastly, we create a get_feature_names_out()
method: the output column name is the …
from sklearn.neighbors import KNeighborsRegressor
from sklearn.base import MetaEstimatorMixin, clone
class FeatureFromRegressor(MetaEstimatorMixin, BaseEstimator, TransformerMixin):
def __init__(self, estimator):
self.estimator = estimator
def fit(self, X, y=None):
= clone(self.estimator)
estimator_
estimator_.fit(X, y)self.estimator_ = estimator_
self.n_features_in_ = self.estimator_.n_features_in_
if hasattr(self.estimator, "feature_names_in_"):
self.feature_names_in_ = self.estimator.feature_names_in_
return self # always return self!
def transform(self, X):
self)
check_is_fitted(= self.estimator_.predict(X)
predictions if predictions.ndim == 1:
= predictions.reshape(-1, 1)
predictions return predictions
def get_feature_names_out(self, names=None):
self)
check_is_fitted(= getattr(self.estimator_, "n_outputs_", 1)
n_outputs = self.estimator_.__class__.__name__
estimator_class_name = estimator_class_name.lower().replace("_", "")
estimator_short_name return [f"{estimator_short_name}_prediction_{i}"
for i in range(n_outputs)]
Let’s ensure it complies to Scikit-Learn’s API:
from sklearn.utils.estimator_checks import check_estimator
check_estimator(FeatureFromRegressor(KNeighborsRegressor()))
Good! Now let’s test it:
= KNeighborsRegressor(n_neighbors=3, weights="distance")
knn_reg = FeatureFromRegressor(knn_reg)
knn_transformer = housing[["latitude", "longitude"]]
geo_features knn_transformer.fit_transform(geo_features, housing_labels)
array([[412500.33333333],
[435250. ],
[105100. ],
...,
[148800. ],
[500001. ],
[234333.33333333]])
And what does its output feature name look like?
knn_transformer.get_feature_names_out()
['kneighborsregressor_prediction_0']
Okay, now let’s include this transformer in our preprocessing pipeline:
from sklearn.base import clone
= [(name, clone(transformer), columns)
transformers for name, transformer, columns in preprocessing.transformers]
= [name for name, _, _ in transformers].index("geo")
geo_index = ("geo", knn_transformer, ["latitude", "longitude"])
transformers[geo_index]
= ColumnTransformer(transformers) new_geo_preprocessing
= Pipeline([
new_geo_pipeline 'preprocessing', new_geo_preprocessing),
('svr', SVR(C=rnd_search.best_params_["svr__C"],
(=rnd_search.best_params_["svr__gamma"],
gamma=rnd_search.best_params_["svr__kernel"])),
kernel ])
= -cross_val_score(new_geo_pipeline,
new_pipe_rmses 5000],
housing.iloc[:5000],
housing_labels.iloc[:="neg_root_mean_squared_error",
scoring=3)
cv pd.Series(new_pipe_rmses).describe()
count 3.000000
mean 104866.322819
std 2966.688335
min 101535.315061
25% 103687.330297
50% 105839.345534
75% 106531.826698
max 107224.307862
dtype: float64
Yikes, that’s terrible! Apparently the cluster similarity features were much better. But perhaps we should tune the KNeighborsRegressor
’s hyperparameters? That’s what the next exercise is about.
5.
Exercise: Automatically explore some preparation options using RandomSearchCV
.
= {
param_distribs "preprocessing__geo__estimator__n_neighbors": range(1, 30),
"preprocessing__geo__estimator__weights": ["distance", "uniform"],
"svr__C": loguniform(20, 200_000),
"svr__gamma": expon(scale=1.0),
}
= RandomizedSearchCV(new_geo_pipeline,
new_geo_rnd_search =param_distribs,
param_distributions=50,
n_iter=3,
cv='neg_root_mean_squared_error',
scoring=42)
random_state5000], housing_labels.iloc[:5000]) new_geo_rnd_search.fit(housing.iloc[:
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)), ('standardscaler', StandardSc... param_distributions={'preprocessing__geo__estimator__n_neighbors': range(1, 30), 'preprocessing__geo__estimator__weights': ['distance', 'uniform'], 'svr__C': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bc984c8b0>, 'svr__gamma': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bc984ca90>}, random_state=42, scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)), ('standardscaler', StandardSc... param_distributions={'preprocessing__geo__estimator__n_neighbors': range(1, 30), 'preprocessing__geo__estimator__weights': ['distance', 'uniform'], 'svr__C': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bc984c8b0>, 'svr__gamma': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f6bc984ca90>}, random_state=42, scoring='neg_root_mean_squared_error')
Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)), ('standardscaler', StandardScaler())]), ['total_bedrooms', 'total... FeatureFromRegressor(estimator=KNeighborsRegressor(n_neighbors=3, weights='distance')), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])), ('svr', SVR(C=157055.10989448498, gamma=0.26497040005002437))])
ColumnTransformer(transformers=[('bedrooms', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('functiontransformer', FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)), ('standardscaler', StandardScaler())]), ['total_bedrooms', 'total_rooms']), ('rooms_per_house', Pipe... 'households', 'median_income']), ('geo', FeatureFromRegressor(estimator=KNeighborsRegressor(n_neighbors=3, weights='distance')), ['latitude', 'longitude']), ('cat', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>)])
['total_bedrooms', 'total_rooms']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_rooms', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['population', 'households']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out=<function ratio_name at 0x7f6bdbe6cee0>, func=<function column_ratio at 0x7f6bdbe6cf70>)
StandardScaler()
['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']
SimpleImputer(strategy='median')
FunctionTransformer(feature_names_out='one-to-one', func=<ufunc 'log'>)
StandardScaler()
['latitude', 'longitude']
KNeighborsRegressor(n_neighbors=3, weights='distance')
KNeighborsRegressor(n_neighbors=3, weights='distance')
<sklearn.compose._column_transformer.make_column_selector object at 0x7f6bd716ed70>
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
SVR(C=157055.10989448498, gamma=0.26497040005002437)
= -new_geo_rnd_search.best_score_
new_geo_rnd_search_rmse new_geo_rnd_search_rmse
106768.04614723712
Oh well… at least we tried! It looks like the cluster similarity features are definitely better than the KNN feature. But perhaps you could try having both? And maybe training on the full training set would help as well.
6.
Exercise: Try to implement the StandardScalerClone
class again from scratch, then add support for the inverse_transform()
method: executing scaler.inverse_transform(scaler.fit_transform(X))
should return an array very close to X
. Then add support for feature names: set feature_names_in_
in the fit()
method if the input is a DataFrame. This attribute should be a NumPy array of column names. Lastly, implement the get_feature_names_out()
method: it should have one optional input_features=None
argument. If passed, the method should check that its length matches n_features_in_
, and it should match feature_names_in_
if it is defined, then input_features
should be returned. If input_features
is None
, then the method should return feature_names_in_
if it is defined or np.array(["x0", "x1", ...])
with length n_features_in_
otherwise.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
class StandardScalerClone(BaseEstimator, TransformerMixin):
def __init__(self, with_mean=True): # no *args or **kwargs!
self.with_mean = with_mean
def fit(self, X, y=None): # y is required even though we don't use it
= X
X_orig = check_array(X) # checks that X is an array with finite float values
X self.mean_ = X.mean(axis=0)
self.scale_ = X.std(axis=0)
self.n_features_in_ = X.shape[1] # every estimator stores this in fit()
if hasattr(X_orig, "columns"):
self.feature_names_in_ = np.array(X_orig.columns, dtype=object)
return self # always return self!
def transform(self, X):
self) # looks for learned attributes (with trailing _)
check_is_fitted(= check_array(X)
X if self.n_features_in_ != X.shape[1]:
raise ValueError("Unexpected number of features")
if self.with_mean:
= X - self.mean_
X return X / self.scale_
def inverse_transform(self, X):
self)
check_is_fitted(= check_array(X)
X if self.n_features_in_ != X.shape[1]:
raise ValueError("Unexpected number of features")
= X * self.scale_
X return X + self.mean_ if self.with_mean else X
def get_feature_names_out(self, input_features=None):
if input_features is None:
return getattr(self, "feature_names_in_",
f"x{i}" for i in range(self.n_features_in_)])
[else:
if len(input_features) != self.n_features_in_:
raise ValueError("Invalid number of features")
if hasattr(self, "feature_names_in_") and not np.all(
self.feature_names_in_ == input_features
):raise ValueError("input_features ≠ feature_names_in_")
return input_features
Let’s test our custom transformer:
from sklearn.utils.estimator_checks import check_estimator
check_estimator(StandardScalerClone())
No errors, that’s a great start, we respect the Scikit-Learn API.
Now let’s ensure the transformation works as expected:
42)
np.random.seed(= np.random.rand(1000, 3)
X
= StandardScalerClone()
scaler = scaler.fit_transform(X)
X_scaled
assert np.allclose(X_scaled, (X - X.mean(axis=0)) / X.std(axis=0))
How about setting with_mean=False
?
= StandardScalerClone(with_mean=False)
scaler = scaler.fit_transform(X)
X_scaled_uncentered
assert np.allclose(X_scaled_uncentered, X / X.std(axis=0))
And does the inverse work?
= StandardScalerClone()
scaler = scaler.inverse_transform(scaler.fit_transform(X))
X_back
assert np.allclose(X, X_back)
How about the feature names out?
assert np.all(scaler.get_feature_names_out() == ["x0", "x1", "x2"])
assert np.all(scaler.get_feature_names_out(["a", "b", "c"]) == ["a", "b", "c"])
And if we fit a DataFrame, are the feature in and out ok?
= pd.DataFrame({"a": np.random.rand(100), "b": np.random.rand(100)})
df = StandardScalerClone()
scaler = scaler.fit_transform(df)
X_scaled
assert np.all(scaler.feature_names_in_ == ["a", "b"])
assert np.all(scaler.get_feature_names_out() == ["a", "b"])
All good! That’s all for today! 😀
Congratulations! You already know quite a lot about Machine Learning. :)