StandardScaler把所有數據歸一到均值爲0方差爲1的分佈中 。適用於沒有明顯邊界的情況;有可能存在極端數據值。
計算公式:
其中S標準差的計算方式是 numpy中的std方法,可以查看一下StandardScaler的文檔
StandardScaler??
Init signature: StandardScaler(copy=True, with_mean=True, with_std=True)
Source:
class StandardScaler(BaseEstimator, TransformerMixin):
"""Standardize features by removing the mean and scaling to unit variance
The standard score of a sample `x` is calculated as:
z = (x - u) / s
where `u` is the mean of the training samples or zero if `with_mean=False`,
and `s` is the standard deviation of the training samples or one if
`with_std=False`.
Centering and scaling happen independently on each feature by computing
the relevant statistics on the samples in the training set. Mean and
standard deviation are then stored to be used on later data using the
`transform` method.
Standardization of a dataset is a common requirement for many
machine learning estimators: they might behave badly if the
individual features do not more or less look like standard normally
distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of
a learning algorithm (such as the RBF kernel of Support Vector
Machines or the L1 and L2 regularizers of linear models) assume that
all features are centered around 0 and have variance in the same
order. If a feature has a variance that is orders of magnitude larger
that others, it might dominate the objective function and make the
estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passing
`with_mean=False` to avoid breaking the sparsity structure of the data.
Read more in the :ref:`User Guide <preprocessing_scaler>`.
Parameters
----------
copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead.
This is not guaranteed to always work inplace; e.g. if the data is
not a NumPy array or scipy.sparse CSR matrix, a copy may still be
returned.
with_mean : boolean, True by default
If True, center the data before scaling.
This does not work (and will raise an exception) when attempted on
sparse matrices, because centering them entails building a dense
matrix which in common use cases is likely to be too large to fit in
memory.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently,
unit standard deviation).
Attributes
----------
scale_ : ndarray or None, shape (n_features,)
Per feature relative scaling of the data. This is calculated using
`np.sqrt(var_)`. Equal to ``None`` when ``with_std=False``.
.. versionadded:: 0.17
*scale_*
mean_ : ndarray or None, shape (n_features,)
The mean value for each feature in the training set.
Equal to ``None`` when ``with_mean=False``.
var_ : ndarray or None, shape (n_features,)
The variance for each feature in the training set. Used to compute
`scale_`. Equal to ``None`` when ``with_std=False``.
n_samples_seen_ : int or array, shape (n_features,)
The number of samples processed by the estimator for each feature.
If there are not missing samples, the ``n_samples_seen`` will be an
integer, otherwise it will be an array.
Will be reset on new calls to fit, but increments across
``partial_fit`` calls.
Examples
--------
from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
print(scaler.mean_)
[0.5 0.5]
print(scaler.transform(data))
[[-1. -1.]
[-1. -1.]
[ 1. 1.]
[ 1. 1.]]
print(scaler.transform([[2, 2]]))
[[3. 3.]]
See also
--------
scale: Equivalent function without the estimator API.
:class:`sklearn.decomposition.PCA`
Further removes the linear correlation across features with 'whiten=True'.
Notes
-----
NaNs are treated as missing values: disregarded in fit, and maintained in
transform.
We use a biased estimator for the standard deviation, equivalent to
`numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely to
affect model performance.
可以看到numpy中的std方法計算標準差的時候,不是除以N-1,即:numpy.std(x, ddof=0),再看看numpy.std的文檔
np.std??
Signature: np.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>)
Source:
@array_function_dispatch(_std_dispatcher)
def std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=np._NoValue):
"""
Compute the standard deviation along the specified axis.
Returns the standard deviation, a measure of the spread of a distribution,
of the array elements. The standard deviation is computed for the
flattened array by default, otherwise over the specified axis.
Parameters
----------
a : array_like
Calculate the standard deviation of these values.
axis : None or int or tuple of ints, optional
Axis or axes along which the standard deviation is computed. The
default is to compute the standard deviation of the flattened array.
.. versionadded:: 1.7.0
If this is a tuple of ints, a standard deviation is performed over
multiple axes, instead of a single axis or all the axes as before.
dtype : dtype, optional
Type to use in computing the standard deviation. For arrays of
integer type the default is float64, for arrays of float types it is
the same as the array type.
out : ndarray, optional
Alternative output array in which to place the result. It must have
the same shape as the expected output but the type (of the calculated
values) will be cast if necessary.
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.
keepdims : bool, optional
If this is set to True, the axes which are reduced are left
in the result as dimensions with size one. With this option,
the result will broadcast correctly against the input array.
If the default value is passed, then `keepdims` will not be
passed through to the `std` method of sub-classes of
`ndarray`, however any non-default value will be. If the
sub-class' method does not implement `keepdims` any
exceptions will be raised.
Returns
-------
standard_deviation : ndarray, see dtype parameter above.
If `out` is None, return a new array containing the standard deviation,
otherwise return a reference to the output array.
從ddof參數的介紹中可以看到,由於ddof默認爲0,計算的時候沒有做N-1的處理,再對比一下pandas中的std
pd.DataFrame.std?
Signature:
pd.DataFrame.std(
self,
axis=None,
skipna=None,
level=None,
ddof=1,
numeric_only=None,
**kwargs,
)
Docstring:
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters
----------
axis : {index (0), columns (1)}
skipna : bool, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
numeric_only : bool, default None
Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
Returns
-------
Series or DataFrame (if level specified)
File: c:\anaconda3\lib\site-packages\pandas\core\generic.py
Type: function
ddof參數默認爲1。
所以StandardScaler和pandas計算標準差的方式主要就是ddof參數的不同,如果模型訓練的時候用到了標準化,模型部署上線後,測試的時候可能需要注意一下標準差和均值以怎樣的方式入模,不然可能會影響到最終的測試結果