Содержание
- A Regression Project in Python; Predict Diamonds Prices Based on Cut, Color, Clarity and Other Attributes
- What Is The CRISP-DM Process?
- Step 1 — Business Understanding
- Step 2 — Data Understanding
- Exploratory Analysis
- Data Analysis & Visualization
- Step 3 — Data Preparation
- Removing Outliers
- Feature Engineering
- Create dummy variables
- Splitting to Train and Test
- Step 4 — Modeling
- Train and Build a Linear Regression Model
- Train and Build a Decision Tree Regressor Model
- Train and Build a Random Forest Regressor Model
- Step 5 — Evaluation
- Evaluating the Linear Regression Model
- Evaluating the Decision Tree Regressor Model
- Evaluating the Random Forest Regressor Model
- Selected Model
- Step 6 — Deployment
- In Conclusion
- About the Author
A Regression Project in Python; Predict Diamonds Prices Based on Cut, Color, Clarity and Other Attributes
In order to build an end-to-end regression (i.e., evaluation) project in Python, based on the CRISP-DM Methodology, prices, I chose the diamonds price dataset that was sourced from seaborn (a Python data visualization library). This dataset contains the prices and other attributes of almost 54,000 diamonds. It is a great dataset to work with Python programming, data analysis and visualizations, data science and machine learning.
CRISP-DM stands for CRoss-Industry Standard Process for Data Mining and was developed in 1996 under the ESPRIT initiative. It has been a favourite for business analysts and data scientists alike owing to its easily adaptable model.
What Is The CRISP-DM Process?
CRISP-DM is one of the more structured approaches to solving a problem that requires data science. More precisely, CRISP-DM focuses on the data science part of the operation and features a 6-step process.
Step 1 — Business Understanding
The first step of the CRISP-DM process is business understanding. This is one of the big reasons it is popular among business intelligence practitioners; a BI-first approach. This step includes the basic groundwork for the rest of the project, such as determining goals and objectives, producing a plan and planning out business success criteria.
It is also important to gain an understanding of the workings of the situation, requiring a deep assessment of the situation. As the process requires data mining, it is also important to determine which features to explore and which to eliminate. The goals of the data mining procedure must also be established.
This will enable the project to have a much more focused view of things, leading to less time mining data which will not be used. Along with determining where the business needs improvement, this step also shows the pain points of the organization. Knowing the company inside out is important for deriving actionable insights.
- Business Needs- De Beers is a is the world’s largest diamond company. De Beers needs to know the updated market price (in US dollars) of any diamond it sells. This is a classic regression (i.e., evaluation) problem, in which I need to collect the relevant data, build a useful model and estimate the expected error.
- Data Science Objective- I need to build a model which predicts, with a high-level accuracy, the market price in US dollars of a diamond by relating the prices of De Beers diamonds which were sold to their features. Since, I want my model to be as accurate as possible, I will optimize the mean absolute error on the testing set (which is a metric for accuracy) instead of the R² (i.e., the coefficient of determination) regression score on the testing set (which is a metric of precision).
Step 2 — Data Understanding
One of the biggest parts of data science is, of course, handling data. A well-managed set of data sources and collection of data marks the difference between a successful project and a confusing mess.
The second step of CRISP-DM involves acquiring the data listed in the project. All data relevant to the project goals must be collected, with reports being made at every stage. After collection, efforts must be made to explore the data using methods such as querying, data visualization and more.
It is also important to keep track of the quality of the data in order to ensure that unclean data doesn’t hamper the results. Moreover, there should be a back-and-forth with the business understanding step for a truly flexible approach.
Let’s collect the relevant data from the internet
- Dataset kind- The dataset has 53,940 records of diamonds and contains 10 fields (9 of them are features and 1 is the target variable).
- Dataset size- Since the dataset contains of dozens of thousands of observations, I can classify it as a large dataset.
- Main features:
- carat (carat weight of the diamond)
- cut (cut quality of the cut)
- color (color diamond color)
- clarity (clarity a measurement of how clear the diamond is)
- x (length in mm)
- y (width in mm)
- z (depth in mm)
- depth (total depth percentage = z / mean(x, y))
- table (width of top of diamond relative to widest point)
Exploratory Analysis
I now use the describe() method to show the summary statistics of the numeric variables.
The count, mean, min and max rows are self-explanatory. The std shows the standard deviation, and the 25%, 50% and 75% rows show the corresponding percentiles.
Data Analysis & Visualization
Step 3 — Data Preparation
Data preparation is the step where data to be used is determined. This makes the difference between looking in the wrong place and finding a solution that works. Data mining goals must be solidified, along with data cleaning and integration processes.
Records must be kept at every step in order to operate within the constraints of the project. The technical constraints and other factors determining the data must also be pinned down to eliminate bias and derive insights more easily.
- Missing values or outliers- The dataset doesn’t include any missing values but diamonds with z equal to 0 or greater than 10, y equal to 0 or greater than 10 and x equal to 0 are outliers.
- Dummy variables for categorical variables- The dataset includes 3 categorical variables (cut, color, and clarity). I chose to create dummy variables for those categorical variables using a “replace” function.
- Other techniques- First, I created a new feature called “vol” (for volume) which is a multiplication of x, y, and z and then I replaced x, y, and z with this new feature. Second, I split my data into train and test sub-sets with a ratio of train to test sets of 67:33 and a random state of 42.
Removing Outliers
Feature Engineering
Create dummy variables
Splitting to Train and Test
Step 4 — Modeling
This is where most of the work is done, with the modeling method being integral to the kind of problem to be solved. If the wrong method is used, the results obtained will not be comparable to results gained when the method is right.
Narrow down the technique and set the stage for it to be used effectively. This includes taking care of the assumptions and preparing the data for use with the model.
A test model must also be designed for proof-of-concept and suitability tests. The model should also be fitted for the problem, with testing and backpropagation being important parts if the model used is a neural network.
The approach must also be tailored with respect to the goals and the business and data understanding in order to create a good fit for the problem. In this manner, the model should be assessed.
- I have trained 3 Machine Learning models (Linear Regression, Decision Tree Regressor and Random Forest Regressor) “out of the box”, meaning without changing the hyperparameters of each model.
- For each model, I checked for Overfitting by comparing the R-squared of each model on the test set to the R-squared of that model on the train test.
- For each model, I created a scatter plot of the true prices from the market versus the predicted price from the model.
Train and Build a Linear Regression Model
Liner Regression is one of the most common regression algorithms.
R squared of the Linear Regression on training set: 88.40%
R squared of the Linear Regression on test set: 88.54%
The R squared on the training set is almost equal to the R squared on the test set. This is an indicative that our linear regression model is not overfitting and therefore generalizing well to new data.
In addition, in our linear regression model, 88.54% of the variability in the diamond prices can be explained using the 7 feature we chose (i.e., carat, cut, color, clarity, table, depth, and vol). This is very good.
Train and Build a Decision Tree Regressor Model
R squared of the Decision Tree Regressor on training set: 99.99%
R squared of the Decision Tree Regressor on test set: 96.73%
The R squared on the training set is a bit higher than the R squared on the test set, but that doesn’t mean that our decision tree regressor model is overfitting. On the contrary, our decision tree regressor model is generalizing well to new data.
In addition, in our decision tree regressor model, 96.73% of the variability in the diamond prices can be explained using the 7 feature we chose (i.e., carat, cut, color, clarity, table, depth, and vol). This is excellent.
Train and Build a Random Forest Regressor Model
Let’s apply a random forest consisting of 100 trees on the diamonds data set:
R squared of the Random Forest Regressor on training set: 99.72%
R squared of the Random Forest Regressor on test set: 98.14%
The R squared on the training set is a bit higher than the R squared on the test set, but that doesn’t mean that our random forest regressor model is overfitting. On the contrary, our random forest regressor model is generalizing well to new data.
In addition, in our random forest regressor model, 98.15% of the variability in the diamond prices can be explained using the 7 feature we chose (i.e., carat, cut, color, clarity, table, depth, and vol). This is excellent.
Step 5 — Evaluation
This step will be for evaluating factors such as the accuracy and generality of the model. In addition to this, the process must also be put through a fine-combed inspection to ensure that there are no errors.
A revision sub-step is also present in this, as a way to fine-tune the solution offered by this process. This includes going back to the business understanding roots and seeing if the process makes sense in a sustainable and scalable fashion.
A report must also be compiled for documentation. In addition to this, any possible issues must be ironed out before the next step.
I checked the models MAE and MSLE scores on the test set.
Evaluating the Linear Regression Model
Calculate the model’s expected error in dollars using the MAE (Mean Absolute Error) metric:
Mean Absolute Error of the Linear Regression on test set is 869.38
Our linear regression model was able to predict the price of every diamond in the test set with an error of ± $869.38 of the real price.
Calculate the model’s expected error in percentage using the MSLE (Mean Squared Log Error) metric:
ValueError : Mean Squared Logarithmic Error cannot be used when targets contain negative values.
It turns out that our liner regression model gives negative prices.
This model in terms of product is a bad product because there is no meaning to a negative price.
Evaluating the Decision Tree Regressor Model
Calculate the model’s expected error in dollars using the MAE (Mean Absolute Error) metric:
Mean Absolute Error of the Decision Tree Regressor on test set is 354.01
Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± $354.01 of the real price.
Calculate the model’s expected error in percentage using the MSLE (Mean Squared Log Error) metric:
Mean Squared Log Error of the Decision Tree Regressor on test set is 2.07%
Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± 2.07% of the real price.
Evaluating the Random Forest Regressor Model
Calculate the model’s expected error in dollars using the MAE (Mean Absolute Error) metric:
Mean Absolute Error of the Random Forest Regressor on test set is 277.00
Our random forest regressor model was able to predict the price of every diamond in the test set with an error of ± $277 of the real price.
Calculate the model’s expected error in percentage using the MSLE (Mean Squared Log Error) metric:
Mean Squared Log Error of the Random Forest Regressor on test set is 1.25%
Our random forest regressor model was able to predict the price of every diamond in the test set with an error of ± 1.25% of the real price.
Selected Model
I chose the Random Forest Regressor model as the best model among the three, based on its MAE and MSLE scores on the test set.
Step 6 — Deployment
This step will differ depending on the kind of problem that the organization is facing. However, the basics remain mostly the same. The first thing to do is to summarize how the solution will be deployed in an organized manner.
The solution also needs to future-proofed to ensure that it can be used easily for an extended period of time. Factors such as monitoring and maintenance should also be taken care of, along with a final report and review of the solution.
So, our Random Forest model is a pretty good model for predicting the market price of a diamond. Now how do we predict the market price of a new diamond new diamond?
Suppose there is a new diamond which has: carat=0.23, cut=5 (Ideal), color=2 (E), clarity=1 (SI2), depth=61.5, table=55, vol=38.20 (x=3.95, y=3.98 and z=2.43).
We can take these new data and use it to predict the market price of the new diamond.
The market price of this new diamond is $382.34
Saving the finalized model to pickle saves us a lot of time as we don’t have to train our model every time, we run the application. Once we save our model as pickle, you can load it later while making the prediction.
First, let’s open a new file for our finalized model and call it “fw_model1”
Then, let’s save into this file our Random Forest model
And let’s close this file.
Now, let’s open a new Python notebook and write
Now, let’s make a new prediction for the above transaction
In Conclusion
The CRISP-DM, even today, remains as a dependable method to develop data science solutions for enterprise problems. Its BI-first approach also enables better sourcing of insights and other such data knowledge.
The flexible and iterative approach of the CRISP-DM also makes it a future-proof alternative for anyone looking to solve data science problems. Even as it is important to develop a unique method, it should also be kept in mind that using methods such as CRISP-DM bring an element of professionalism and uniformity to operational procedures.
Roi Polanitzer, PDS, ADL, MLS, PDA, CPD, F.IL.A.V.F.A., FRM, is a data scientist with an extensive experience in solving machine learning problems, such as: regression, classification, clustering, recommender systems, anomaly detection, text analytics & NLP, and image processing. Mr. Polanitzer is is the Owner and Chief Data Scientist of Prediction Consultants — Advanced Analysis and Model Development, a data science firm headquartered in Rishon LeZion, Israel. He is also the Owner and Chief Appraiser of Intrinsic Value — Independent Business Appraisers, a business valuation firm that specializes in corporates, intangible assets and complex financial instruments valuation.
Over more than 16 years, he has performed data science projects such as: regression (e.g., house prices, CLV- customer lifetime value, and time-to-failure), classification (e.g., market targeting, customer churn), probability (e.g., spam filters, employee churn, fraud detection, loan default, and disease diagnostics), clustering (e.g., customer segmentation, and topic modeling), dimensionality reduction (e.g., p-values, itertools Combinations, principal components analysis, and autoencoders), recommender systems (e.g., products for a customer, and advertisements for a surfer), anomaly detection (e.g., supermarkets’ revenue and profits), text analytics (e.g., identifying market trends, web searches), NLP (e.g., sentiment analysis, cosine similarity, and text classification), image processing (e.g., image binary classification of dogs vs. cats, , and image multiclass classification of digits in sign language), and signal processing (e.g., audio binary classification of males vs. females, and audio multiclass classification of urban sounds).
Mr. Polanitzer holds various professional designations, such as a global designation called “Financial Risk Manager” (FRM, which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as K-Means, SVM and KNN for credit risk measurement and management) from the Global Association of Risk Professionals (GARP), a designation called “Fellow Actuary” (F.IL.A.V.F.A., which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as GLM, RF and NN for determining premiums in general insurance) from the Israel Association of Valuators and Financial Actuaries (IAVFA), and a designation called “Certified Risk Manager” (CRM, which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as DT, NB and PCA for operational risk management) from the Israeli Association of Risk Managers (IARM).
Mr. Polanitzer had studied actuarial science (i.e., implementation of statistical and data mining techniques for solving time-series analysis, dimensionality reduction, optimization and simulation problems) at the prestigious 250-hours training program of the University of Haifa, financial risk management (i.e., building statistical predictive and probabilistic models for solving regression, classification, clustering and anomaly detection) at the prestigious 250-hours training program of the program of the Ariel University, and machine learning and deep learning (i.e., building recommender systems and training neural networks for image processing and NLP) at the prestigious 500-hours training program of the John Bryce College.
He had graduated various professional trainings at the John Bryce College, such as: “Introduction to Machine Learning, AI & Data Visualization for Managers and Architects”, “Professional training in Practical Machine Learning, AI & Deep Learning with Python for Algorithm Developers & Data Scientists”, “Azure Data Fundamentals: Relational Data, Non-Relational Data and Modern Data Warehouse Analytics in Azure”, and “Azure AI Fundamentals: Azure Tools for ML, Automated ML & Visual Tools for ML and Deep Learning”.
Mr. Polanitzer had also graduated various professional trainings at the Professional Data Scientists’ Israel Association, such as: “Neural Networks and Deep Learning”, “Big Data and Cloud Services”, “Natural Language Processing and Text Mining”.
Источник
I’m trying to run cross validation with mean squared log error with sklearn and getting the following error message:
ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.
This would suggest that I have negative values in my 1d array y. However, I have tried about 10 different ways of checking, including importing into excel and I can see no negative values in there.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_log_error
import pandas as pd
import numpy as np
train_csv = 'train.csv'
df_train = pd.read_csv(trainData)
# define variables
target = 'SalePrice'
indep_variable = 'OverallQual'
# scoring
scoring_cross_val = 'neg_mean_squared_log_error'
scoring = mean_squared_log_error
# initate model
lin_reg = LinearRegression()
# example data
X = df_train.drop(target, axis=1)
X = X[indep_variable].to_numpy().reshape(-1, 1)
y = df_train[target].to_numpy().reshape(-1, 1)
# fit model
lin_reg.fit(X, y)
# cross validated model error
cv = cross_val_score(lin_reg, X, y, cv=2, scoring=scoring_cross_val)
I created a version of the code above with some simple inputs to check it isn’t a bug in my version of sklearn. The code runs without a problem.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_log_error
import pandas as pd
import numpy as np
# scoring
scoring_cross_val = 'neg_mean_squared_log_error'
scoring = mean_squared_log_error
# initate model
lin_reg = LinearRegression()
# example data
X = np.array([1.,2.,3.]).reshape(-1, 1)
y = np.array([4.,5.,6.]).reshape(-1, 1)
# fit model
lin_reg.fit(X, y)
# cross validated model error
cv = cross_val_score(lin_reg, X, y, cv=2, scoring=scoring_cross_val)
If anyone gets the chance to help, the csv can be downloaded from Kaggle:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
Iain
from abc import ABCMeta, abstractmethod from typing import Any, Callable, Dict, List, Optional, Sequence, Union import collections from functools import partial from itertools import product import numpy as np import sklearn.metrics from sklearn.utils.multiclass import type_of_target from smac.utils.constants import MAXINT from autosklearn.constants import ( BINARY_CLASSIFICATION, MULTICLASS_CLASSIFICATION, MULTILABEL_CLASSIFICATION, MULTIOUTPUT_REGRESSION, REGRESSION, REGRESSION_TASKS, TASK_TYPES, ) from autosklearn.data.target_validator import SUPPORTED_XDATA_TYPES from .util import sanitize_array class Scorer(object, metaclass=ABCMeta): def __init__( self, name: str, score_func: Callable, optimum: float, worst_possible_result: float, sign: float, kwargs: Any, needs_X: bool = False, ) -> None: self.name = name self._kwargs = kwargs self._score_func = score_func self._optimum = optimum self._needs_X = needs_X self._worst_possible_result = worst_possible_result self._sign = sign @abstractmethod def __call__( self, y_true: np.ndarray, y_pred: np.ndarray, *, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, sample_weight: Optional[List[float]] = None, ) -> float: pass def __repr__(self) -> str: return self.name class _PredictScorer(Scorer): def __call__( self, y_true: np.ndarray, y_pred: np.ndarray, *, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, sample_weight: Optional[List[float]] = None, ) -> float: """Evaluate predicted target values for X relative to y_true. Parameters ---------- y_true : array-like Gold standard target values for X. y_pred : array-like, [n_samples x n_classes] Model predictions X_data : array-like [n_samples x n_features] X data used to obtain the predictions: each row x_j corresponds to the input used to obtain predictions y_j sample_weight : array-like, optional (default=None) Sample weights. Returns ------- score : float Score function applied to prediction of estimator on X. """ type_true = type_of_target(y_true) if ( type_true == "binary" and type_of_target(y_pred) == "continuous" and len(y_pred.shape) == 1 ): # For a pred scorer, no threshold, nor probability is required # If y_true is binary, and y_pred is continuous # it means that a rounding is necessary to obtain the binary class y_pred = np.around(y_pred, decimals=0) elif ( len(y_pred.shape) == 1 or y_pred.shape[1] == 1 or type_true == "continuous" ): # must be regression, all other task types would return at least # two probabilities pass elif type_true in ["binary", "multiclass"]: y_pred = np.argmax(y_pred, axis=1) elif type_true == "multilabel-indicator": y_pred[y_pred > 0.5] = 1.0 y_pred[y_pred <= 0.5] = 0.0 elif type_true == "continuous-multioutput": pass else: raise ValueError(type_true) scorer_kwargs = {} # type: Dict[str, Union[List[float], np.ndarray]] if sample_weight is not None: scorer_kwargs["sample_weight"] = sample_weight if self._needs_X is True: scorer_kwargs["X_data"] = X_data return self._sign * self._score_func( y_true, y_pred, **scorer_kwargs, **self._kwargs ) class _ProbaScorer(Scorer): def __call__( self, y_true: np.ndarray, y_pred: np.ndarray, *, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, sample_weight: Optional[List[float]] = None, ) -> float: """Evaluate predicted probabilities for X relative to y_true. Parameters ---------- y_true : array-like Gold standard target values for X. These must be class labels, not probabilities. y_pred : array-like, [n_samples x n_classes] Model predictions X_data : array-like [n_samples x n_features] X data used to obtain the predictions: each row x_j corresponds to the input used to obtain predictions y_j sample_weight : array-like, optional (default=None) Sample weights. Returns ------- score : float Score function applied to prediction of estimator on X. """ if self._score_func is sklearn.metrics.log_loss: n_labels_pred = np.array(y_pred).reshape((len(y_pred), -1)).shape[1] n_labels_test = len(np.unique(y_true)) if n_labels_pred != n_labels_test: labels = list(range(n_labels_pred)) if sample_weight is not None: return self._sign * self._score_func( y_true, y_pred, sample_weight=sample_weight, labels=labels, **self._kwargs, ) else: return self._sign * self._score_func( y_true, y_pred, labels=labels, **self._kwargs ) scorer_kwargs = {} # type: Dict[str, Union[List[float], np.ndarray]] if sample_weight is not None: scorer_kwargs["sample_weight"] = sample_weight if self._needs_X is True: scorer_kwargs["X_data"] = X_data return self._sign * self._score_func( y_true, y_pred, **scorer_kwargs, **self._kwargs ) class _ThresholdScorer(Scorer): def __call__( self, y_true: np.ndarray, y_pred: np.ndarray, *, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, sample_weight: Optional[List[float]] = None, ) -> float: """Evaluate decision function output for X relative to y_true. Parameters ---------- y_true : array-like Gold standard target values for X. These must be class labels, not probabilities. y_pred : array-like, [n_samples x n_classes] Model predictions X_data : array-like [n_samples x n_features] X data used to obtain the predictions: each row x_j corresponds to the input used to obtain predictions y_j sample_weight : array-like, optional (default=None) Sample weights. Returns ------- score : float Score function applied to prediction of estimator on X. """ y_type = type_of_target(y_true) if y_type not in ("binary", "multilabel-indicator"): raise ValueError("{0} format is not supported".format(y_type)) if y_type == "binary": if y_pred.ndim > 1: y_pred = y_pred[:, 1] elif isinstance(y_pred, list): y_pred = np.vstack([p[:, -1] for p in y_pred]).T scorer_kwargs = {} # type: Dict[str, Union[List[float], np.ndarray]] if sample_weight is not None: scorer_kwargs["sample_weight"] = sample_weight if self._needs_X is True: scorer_kwargs["X_data"] = X_data return self._sign * self._score_func( y_true, y_pred, **scorer_kwargs, **self._kwargs )[docs]def make_scorer( name: str, score_func: Callable, *, optimum: float = 1.0, worst_possible_result: float = 0.0, greater_is_better: bool = True, needs_proba: bool = False, needs_threshold: bool = False, needs_X: bool = False, **kwargs: Any, ) -> Scorer: """Make a scorer from a performance metric or loss function. Factory inspired by scikit-learn which wraps scikit-learn scoring functions to be used in auto-sklearn. Parameters ---------- name: str Descriptive name of the metric score_func : callable Score function (or loss function) with signature ``score_func(y, y_pred, **kwargs)``. optimum : int or float, default=1 The best score achievable by the score function, i.e. maximum in case of scorer function and minimum in case of loss function. worst_possible_result : int of float, default=0 The worst score achievable by the score function, i.e. minimum in case of scorer function and maximum in case of loss function. greater_is_better : boolean, default=True Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func. needs_proba : boolean, default=False Whether score_func requires predict_proba to get probability estimates out of a classifier. needs_threshold : boolean, default=False Whether score_func takes a continuous decision certainty. This only works for binary classification. needs_X : boolean, default=False Whether score_func requires X in __call__ to compute a metric. **kwargs : additional arguments Additional parameters to be passed to score_func. Returns ------- scorer : callable Callable object that returns a scalar score; greater is better or set greater_is_better to False. """ sign = 1 if greater_is_better else -1 if needs_proba and needs_threshold: raise ValueError( "Set either needs_proba or needs_threshold to True, but not both." ) cls = None # type: Any if needs_proba: cls = _ProbaScorer elif needs_threshold: cls = _ThresholdScorer else: cls = _PredictScorer return cls( name, score_func, optimum, worst_possible_result, sign, kwargs, needs_X=needs_X )
# Standard regression scores mean_absolute_error = make_scorer( "mean_absolute_error", sklearn.metrics.mean_absolute_error, optimum=0, worst_possible_result=MAXINT, greater_is_better=False, ) mean_squared_error = make_scorer( "mean_squared_error", sklearn.metrics.mean_squared_error, optimum=0, worst_possible_result=MAXINT, greater_is_better=False, squared=True, ) root_mean_squared_error = make_scorer( "root_mean_squared_error", sklearn.metrics.mean_squared_error, optimum=0, worst_possible_result=MAXINT, greater_is_better=False, squared=False, ) mean_squared_log_error = make_scorer( "mean_squared_log_error", sklearn.metrics.mean_squared_log_error, optimum=0, worst_possible_result=MAXINT, greater_is_better=False, ) median_absolute_error = make_scorer( "median_absolute_error", sklearn.metrics.median_absolute_error, optimum=0, worst_possible_result=MAXINT, greater_is_better=False, ) r2 = make_scorer("r2", sklearn.metrics.r2_score) # Standard Classification Scores accuracy = make_scorer("accuracy", sklearn.metrics.accuracy_score) balanced_accuracy = make_scorer( "balanced_accuracy", sklearn.metrics.balanced_accuracy_score ) # Score functions that need decision values roc_auc = make_scorer( "roc_auc", sklearn.metrics.roc_auc_score, greater_is_better=True, needs_threshold=True, ) average_precision = make_scorer( "average_precision", sklearn.metrics.average_precision_score, needs_threshold=True ) # NOTE: zero_division # # Specified as the explicit default, see sklearn docs: # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn-metrics-precision-score precision = make_scorer( "precision", partial(sklearn.metrics.precision_score, zero_division=0) ) recall = make_scorer("recall", partial(sklearn.metrics.recall_score, zero_division=0)) f1 = make_scorer("f1", partial(sklearn.metrics.f1_score, zero_division=0)) # Score function for probabilistic classification log_loss = make_scorer( "log_loss", sklearn.metrics.log_loss, optimum=0, worst_possible_result=MAXINT, greater_is_better=False, needs_proba=True, ) # TODO what about mathews correlation coefficient etc? REGRESSION_METRICS = { scorer.name: scorer for scorer in [ mean_absolute_error, mean_squared_error, root_mean_squared_error, mean_squared_log_error, median_absolute_error, r2, ] } CLASSIFICATION_METRICS = { scorer.name: scorer for scorer in [accuracy, balanced_accuracy, roc_auc, average_precision, log_loss] } # NOTE: zero_division # # Specified as the explicit default, see sklearn docs: # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn-metrics-precision-score for (base_name, sklearn_metric), average in product( [ ("precision", sklearn.metrics.precision_score), ("recall", sklearn.metrics.recall_score), ("f1", sklearn.metrics.f1_score), ], ["macro", "micro", "samples", "weighted"], ): name = f"{base_name}_{average}" scorer = make_scorer( name, partial(sklearn_metric, pos_label=None, average=average, zero_division=0) ) globals()[name] = scorer # Adds scorer to the module scope CLASSIFICATION_METRICS[name] = scorer def _validate_metrics( metrics: Sequence[Scorer], scoring_functions: Optional[List[Scorer]] = None, ) -> None: """ Validate metrics given to Auto-sklearn. Raises an Exception in case of a problem. metrics: Sequence[Scorer] A list of objects that hosts a function to calculate how good the prediction is according to the solution. scoring_functions: Optional[List[Scorer]] A list of metrics to calculate multiple losses """ to_score = list(metrics) if scoring_functions: to_score.extend(scoring_functions) if len(metrics) == 0: raise ValueError("Number of metrics to compute must be greater than zero.") metric_counter = collections.Counter(to_score) metric_names_counter = collections.Counter(metric.name for metric in to_score) if len(metric_counter) != len(metric_names_counter): raise ValueError( "Error in metrics passed to Auto-sklearn. A metric name was used " "multiple times for different metrics!" ) def calculate_scores( solution: np.ndarray, prediction: np.ndarray, task_type: int, metrics: Sequence[Scorer], *, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, scoring_functions: Optional[List[Scorer]] = None, ) -> Dict[str, float]: """ Returns the scores (a magnitude that allows casting the optimization problem as a maximization one) for the given Auto-Sklearn Scorer objects. Parameters ---------- solution: np.ndarray The ground truth of the targets prediction: np.ndarray The best estimate from the model, of the given targets task_type: int To understand if the problem task is classification or regression metrics: Sequence[Scorer] A list of objects that hosts a function to calculate how good the prediction is according to the solution. X_data : array-like [n_samples x n_features] X data used to obtain the predictions scoring_functions: List[Scorer] A list of metrics to calculate multiple losses Returns ------- Dict[str, float] """ if task_type not in TASK_TYPES: raise NotImplementedError(task_type) _validate_metrics(metrics=metrics, scoring_functions=scoring_functions) to_score = list(metrics) if scoring_functions: to_score.extend(scoring_functions) score_dict = dict() if task_type in REGRESSION_TASKS: for metric_ in to_score: try: score_dict[metric_.name] = _compute_single_scorer( metric=metric_, prediction=prediction, solution=solution, task_type=task_type, X_data=X_data, ) except ValueError as e: print(e, e.args[0]) if ( e.args[0] == "Mean Squared Logarithmic Error cannot be used when " "targets contain negative values." ): continue else: raise e else: for metric_ in to_score: # TODO maybe annotate metrics to define which cases they can # handle? try: score_dict[metric_.name] = _compute_single_scorer( metric=metric_, prediction=prediction, solution=solution, task_type=task_type, X_data=X_data, ) except ValueError as e: if e.args[0] == "multiclass format is not supported": continue elif ( e.args[0] == "Samplewise metrics are not available " "outside of multilabel classification." ): continue elif ( e.args[0] == "Target is multiclass but " "average='binary'. Please choose another average " "setting, one of [None, 'micro', 'macro', 'weighted']." ): continue else: raise e return score_dict def calculate_loss( solution: np.ndarray, prediction: np.ndarray, task_type: int, metric: Scorer, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, ) -> float: """Calculate the loss with a given metric Parameters ---------- solution: np.ndarray The solutions prediction: np.ndarray The predictions generated task_type: int The task type of the problem metric: Scorer The metric to use X_data: Optional[SUPPORTED_XDATA_TYPES] X data used to obtain the predictions """ losses = calculate_losses( solution=solution, prediction=prediction, task_type=task_type, metrics=[metric], X_data=X_data, ) return losses[metric.name] def calculate_losses( solution: np.ndarray, prediction: np.ndarray, task_type: int, metrics: Sequence[Scorer], *, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, scoring_functions: Optional[List[Scorer]] = None, ) -> Dict[str, float]: """ Returns the losses (a magnitude that allows casting the optimization problem as a minimization one) for the given Auto-Sklearn Scorer objects. Parameters ---------- solution: np.ndarray The ground truth of the targets prediction: np.ndarray The best estimate from the model, of the given targets task_type: int To understand if the problem task is classification or regression metrics: Sequence[Scorer] A list of objects that hosts a function to calculate how good the prediction is according to the solution. X_data: Optional[SUPPORTED_XDATA_TYPES] X data used to obtain the predictions scoring_functions: List[Scorer] A list of metrics to calculate multiple losses Returns ------- Dict[str, float] A loss function for each of the provided scorer objects """ score = calculate_scores( solution=solution, prediction=prediction, X_data=X_data, task_type=task_type, metrics=metrics, scoring_functions=scoring_functions, ) scoring_functions = scoring_functions if scoring_functions else [] # we expect a dict() object for which we should calculate the loss loss_dict = dict() for metric_ in scoring_functions + list(metrics): # maybe metric argument is not in scoring_functions # TODO: When metrics are annotated with type_of_target support # we can remove this check if metric_.name not in score: continue loss_dict[metric_.name] = metric_._optimum - score[metric_.name] return loss_dict def compute_single_metric( metric: Scorer, prediction: np.ndarray, solution: np.ndarray, task_type: int, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, ) -> float: """ Returns a metric for the given Auto-Sklearn Scorer object. It's direction is determined by the metric itself. Parameters ---------- solution: np.ndarray The ground truth of the targets prediction: np.ndarray The best estimate from the model, of the given targets task_type: int To understand if the problem task is classification or regression metric: Scorer Object that host a function to calculate how good the prediction is according to the solution. X_data : array-like [n_samples x n_features] X data used to obtain the predictions Returns ------- float """ score = _compute_single_scorer( solution=solution, prediction=prediction, metric=metric, X_data=X_data, task_type=task_type, ) return metric._sign * score def _compute_single_scorer( metric: Scorer, prediction: np.ndarray, solution: np.ndarray, task_type: int, X_data: Optional[SUPPORTED_XDATA_TYPES] = None, ) -> float: """ Returns a score (a magnitude that allows casting the optimization problem as a maximization one) for the given Auto-Sklearn Scorer object Parameters ---------- solution: np.ndarray The ground truth of the targets prediction: np.ndarray The best estimate from the model, of the given targets task_type: int To understand if the problem task is classification or regression metric: Scorer Object that host a function to calculate how good the prediction is according to the solution. X_data : array-like [n_samples x n_features] X data used to obtain the predictions Returns ------- float """ if metric._needs_X: if X_data is None: raise ValueError( f"Metric {metric.name} needs X_data, but X_data is {X_data}" ) elif X_data.shape[0] != solution.shape[0]: raise ValueError( f"X_data has wrong length. " f"Should be {solution.shape[0]}, but is {X_data.shape[0]}" ) if task_type in REGRESSION_TASKS: # TODO put this into the regression metric itself cprediction = sanitize_array(prediction) score = metric(solution, cprediction, X_data=X_data) else: score = metric(solution, prediction, X_data=X_data) return score if task_type in REGRESSION_TASKS: # TODO put this into the regression metric itself cprediction = sanitize_array(prediction) score = metric(solution, cprediction) else: score = metric(solution, prediction) return score # Must be at bottom so all metrics are defined default_metric_for_task: Dict[int, Scorer] = { BINARY_CLASSIFICATION: CLASSIFICATION_METRICS["accuracy"], MULTICLASS_CLASSIFICATION: CLASSIFICATION_METRICS["accuracy"], MULTILABEL_CLASSIFICATION: CLASSIFICATION_METRICS["f1_macro"], REGRESSION: REGRESSION_METRICS["r2"], MULTIOUTPUT_REGRESSION: REGRESSION_METRICS["r2"], }