auswahl.benchmarking.DataHandler¶
- class auswahl.benchmarking.DataHandler(datasets: List[str], methods: List[str], features: List[FeatureDescriptor], reg_metrics: List[str], stab_metrics: List[str], n_runs: int)[source]¶
Data handling class corralling data generated by the benchmarking of different wavelength selection methods.
- Parameters
- datasets: List[str]
list of dataset identifiers to be allocated in the DataHandler
- methods: List[str]
list of selector identifiers to be allocated in the DataHandler
- feature: List[FeatureDescriptor]
list of FeatureDescriptors to be allocated in the DataHandler
- reg_metrics: List[str]
list of regression metrics to be allocated in the DataHandler
- stab_metrics: List[str]
list of stability metrics to be allocated in the DataHandler
- n_runs: int
number of evaluation run for all selectors (for every dataset and feature configuration) to be allocated in the DataHandler
- Attributes
- datasets: List[str]
sorted list of dataset identifiers contained in the DataHandler
- methods: List[str]
sorted list of selector identifiers contained in the DataHandler
- feature_descriptors: List[FeatureDescriptor]
sorted list of FeatureDescriptors contained in the DataHandler
- reg_metrics: List[str]
sorted list of regression metrics contained in the DataHandler
- stab_metrics: List[str]
sorted list of stability metrics contained in the DataHandler
- n_runs: int
number of evaluation run for all selectors (for every dataset and feature configuration)
- n_datasets: int
number of datasets contained in the DataHandler
- reg_data: pandas.DataFrame
data frame holding regression data
- stab_data: pandas.DataFrame
data frame holding the stability data
- measurement_data: pandas.DataFrame
data frame holding the execution time measurement data
- selecton_data: pandas.DataFrame
data frame holding the feature selection data
- __init__(datasets: List[str], methods: List[str], features: List[FeatureDescriptor], reg_metrics: List[str], stab_metrics: List[str], n_runs: int)[source]¶
- get_measurement_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, List[int], Tuple[int], List[Tuple[int]]] = None, sample: Union[int, List[int]] = None)[source]¶
Retrieve data related to the stability of feature selection methods.
- Parameters
- methodstr or list of str, default=None
Method(s) to be retrieved. If None, all methods are retrieved.
- n_featuresint, tuple of int, list of int or list of tuple of int, default=None
Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all numbers of selected features are retrieved.
- sample_runint or list of int, default=None
The run(s) for which the selected features are to be retrieved. If None, the selected features of all runs are retrieved.
- Returns
- pandas multiIndex DataFrame.
The frame holds the methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘sample’} as columns, where ‘sample’ refers to the individual runs for the statistical evaluation. The keys for level ‘n_features’ are of type
FeatureDescriptor.
- get_meta(dataset)[source]¶
Provides meta information for each dataset.
- Parameters
- dataset: str
Name of the dataset, whose meta information is requested.
- Returns
- dict containing information about the dataset
xThe spectral data of the dataset: np.ndarray of shape (n_samples, n_wavelengths)
yThe target quantity of the dataset: np.ndarray of shape (n_samples, )
n_samplesDirect access to the number of samples in the dataset
n_featuresDirect access to the number of wavelengths, that is features, in the dataset
- get_regression_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, List[int], Tuple[int], List[Tuple[int]]] = None, reg_metric: Union[str, List[str]] = None, sample: Union[int, List[int]] = None) DataFrame[source]¶
Retrieve data related to the regression performance of feature selection methods.
- Parameters
- dataset: str or list of str, default=None
Dataset identifier or list of dataset identifiers.
- methodstr or list of str, default=None
Method(s) to be retrieved. If None, all methods are retrieved.
- n_featuresint, tuple of int, list of int or list of tuple of int, default=None
Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all feature configurations are retrieved.
- reg_metricstr or list of str, default=None
Regression metric(s) to be retrieved. If None, all available metrics are retrieved.
- itemLiteral of [‘mean’, ‘std’, ‘median’, ‘max’, ‘min’, ‘samples’], default=None
Specify, which indicator(s) for the selected regression metrics is to be retrieved. If None, all indicators are retrieved.
- Returns
- pandas multiIndex DataFrame.
The frame holds the selection methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘reg_metric’, ‘sample’} as columns, where ‘sample’ refers to the individual runs for the statistical evaluation. The keys for level ‘n_features’ are of type
FeatureDescriptor.
- get_selection_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, Tuple[int], List[int], List[Tuple[int]]] = None, sample: Union[int, List[int]] = None) DataFrame[source]¶
Retrieve data related to the regression performance of feature selection methods.
- Parameters
- methodstr or list of str, default=None
Method(s) to be retrieved. If None, all methods are retrieved.
- n_featuresint, tuple of int, list of int or list of tuple of int, default=None
Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all numbers of selected features are retrieved.
- sample_runint or list of int, default=None
The run(s) for which the selected features are to be retrieved. If None, the selected features of all runs are retrieved.
- Returns
- pandas.MultiIndex DataFrame.
The frame holds the methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘sample’} as columns, where ‘sample’ refers to the individual runs for the statistical evaluation. The keys for level ‘n_features’ are of type
FeatureDescriptor. The type of the data in the frame isSelection.
- get_stability_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, Tuple[int], List[int], List[Tuple[int]]] = None, stab_metric: Union[str, List[str]] = None) DataFrame[source]¶
Retrieve data related to the stability of feature selection methods.
- Parameters
- methodstr or list of str, default=None
Method(s) to be retrieved. If None, all methods are retrieved.
- n_featuresint, tuple of int, list of int or list of tuple of int, default=None
Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all numbers of selected features are retrieved.
- stab_metricstr or list of str, default=None
Stability metric(s) to be retrieved. If None, all available metrics are retrieved.
- Returns
- pandas multiIndex DataFrame.
The frame holds the selection methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘stab_metric’} as columns. The keys for level ‘n_features’ are of type
FeatureDescriptor.
- register_meta(dataset_meta: List[Tuple[array, array, str, float]])[source]¶
Register dataset information into the DataHandler
- Parameters
- dataset_meta: List[Tuple[np.array, np.array, str, float]]
List of tuples specifying the spectral data of data set, its target values, its name and its training data ratio
- register_stability(dataset: str, method: str, n_features: FeatureDescriptor, metric_name: str, value: float)[source]¶
Register a stability score in the DataHandler
- Parameters
- dataset: str
identifier of the dataset for which a stability is registered
- method: str
identifer of the selector for which the stability is registered
- n_features: Union[int, Tuple[int, int], FeatureDescriptor]
feature configuration for which the stability is registered
- metric_name:
name of the stability metric registered
- value:
the calculated stability