auswahl.IntervalRandomFrog¶
- class auswahl.IntervalRandomFrog(n_intervals_to_select: int = 1, interval_width: Optional[Union[int, float]] = None, n_iterations: int = 10000, n_initial_intervals: Union[int, float] = 0.1, variance_factor: float = 0.3, subset_expansion_factor: float = 3, acceptance_factor: float = 0.1, n_cv_folds: int = 5, n_jobs: int = 1, pls: Optional[PLSRegression] = None, model_hyperparams: Optional[Union[Dict, List[Dict]]] = None, random_state: Optional[Union[int, RandomState]] = None)[source]¶
Feature selection with the Interval Random Frog (iRF) method.
The selection frequencies are computed according to Yun et al. [1].
Read more in the User Guide.
- Parameters
- n_intervals_to_selectint, default=1
Number of intervals to select.
- interval_width: int or float, default=None
Size of the selected intervals. If None, the intervals are n_features/2 long. If integer, the parameter directly defines the number of consecutive features that form an interval. If float between 0 and 1, the intervals are n_features*n_intervals_to_select long.
- n_iterationsint, default=10000
Number of variable selection iterations. This variable is called N in the original publication.
- n_initial_intervalsint or float, default=0.1
Number of intervals in the initial interval subset. If None, 10 % of the intervals are used. If integer, the parameter is the size of the initial subset. If float between 0 and 1, it is the fraction of intervals to use. This variable is called Q in the original publication.
- variance_factorfloat, default=0.3
Variance of the normal distribution which samples determine the amount of intervals that are added or removed to the candidate set in each iteration. This variable is called θ in the original publication.
- subset_expansion_factorfloat, default=3
Multiple of the number of intervals that are explored if the candidate subset is expanded. If the current interval subset is n and m new intervals have to be added to the new interval subset, m*subset_expansion_factors intervals are added to a candidate set. After fitting a PLS model, only the n+m intervals with the highest coefficients are kept. This variable is called ω in the original publication.
- acceptance_factorfloat, default=0.1
The factor is used to calculate the probability that an interval subset is selected even though it leads to a worse cross-validation performance of a fitted PLS model. The probability is computed by multiplying the acceptance_factor with the relative decrease of the cross-validated performance score. This variable is called η in the original publication.
- n_cv_foldsint, default=5
Number of cross validation folds used to evaluate the features.
- n_jobsint, default=1
Number of parallel processes used to fit the PLS models on the cross-validation splits.
- plsPLSRegression, default=None
Estimator instance of the
PLSRegressionclass. Use this to adjust the hyperparameters of the PLS method.- random_stateint or numpy.random.RandomState, default=None
Seed for the random subset sampling. Pass an int for reproducible output across function calls.
- Attributes
- frequencies_ndarray of shape (n_features,)
Number of times each interval has been selected after all iterations.
- support_ndarray of shape (n_features,)
Mask of selected intervals.
References
- 1
Yong-Huan Yun and Hong-Dong Li and Leslie R. E. Wood and Wei Fan and Jia-Jun Wang and Dong-Sheng Cao and Qing-Song Xu and Yi-Zeng Liang, ‘An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration’, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 111, 31-36, 2013
Examples
>>> import numpy as np >>> from auswahl import IntervalRandomFrog >>> np.random.seed(1337) >>> X = np.random.randn(100, 10) >>> y = 5 * X[:, 0] - 3 * X[:, 1] + 2 * X[:, 5] - 3 * X[:, 6] # y only depends on two intervals >>> selector = IntervalRandomFrog(n_intervals_to_select=2, interval_width=2, n_iterations=1000, random_state=7331) >>> selector.fit(X, y).get_support() array([ True, True, False, False, False, True, True, False, False, False])
- __init__(n_intervals_to_select: int = 1, interval_width: Optional[Union[int, float]] = None, n_iterations: int = 10000, n_initial_intervals: Union[int, float] = 0.1, variance_factor: float = 0.3, subset_expansion_factor: float = 3, acceptance_factor: float = 0.1, n_cv_folds: int = 5, n_jobs: int = 1, pls: Optional[PLSRegression] = None, model_hyperparams: Optional[Union[Dict, List[Dict]]] = None, random_state: Optional[Union[int, RandomState]] = None)[source]¶
- evaluate(X, y, model, do_cv=True, *args)¶
Conduct a cross validationand hyperparameter optimization of the underlying estimator model.
- Parameters
- X: array-like, shape (n_samples, n_features)
Spectral data to be fitted
- y: array-like, shape (n_samples,)
Regression targets
- model: BaseEstimator
Regression model
- do_cv: bool, default=True
If True, the model is fitted to the data and a cross validation score is provided
- *args: arbitrary payload
Arbitrary payload returned with the evaluation result. Used for instance for identification of threads, if multiple models are evaluated in parallel
- Returns
- tuple: float, BaseEstimator
cross validation score if requested (otherwise None) and fitted estimator
- fit(X, y, mask=None)¶
Run the feature selection process.
- Parameters
- Xarray-like of shape (n_samples, n_features)
The input samples.
- yarray-like of shape (n_samples,)
The target values.
- mask: array-like of shape (n_features,)
Mask indicating (values == 0), which features are not to be taken into account during the feature selection
- Returns
- SpectralSelectorself
Returns the instance itself.
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_best_estimator() BaseEstimator¶
Retrieve the best estimator model fitted on the selected features
- Returns
- best model fitted on selected features: sklearn.base.BaseEstimator
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform().
- reseed(seed: Union[int, RandomState])¶
Random state updating interface for benchmarking. Selector methods with more complex internal structure (such as methods wrapping other methods) are required to override this function accordingly.
- rethread(n_jobs: int)¶
n_jobs updating interface for benchmarking. Selector methods with more complex internal structure (such as methods wrapping other methods) are required to override this function accordingly.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.