auswahl.benchmarking.DataHandler

class auswahl.benchmarking.DataHandler(datasets: List[str], methods: List[str], features: List[FeatureDescriptor], reg_metrics: List[str], stab_metrics: List[str], n_runs: int)[source]

Data handling class corralling data generated by the benchmarking of different wavelength selection methods.

Parameters
datasets: List[str]

list of dataset identifiers to be allocated in the DataHandler

methods: List[str]

list of selector identifiers to be allocated in the DataHandler

feature: List[FeatureDescriptor]

list of FeatureDescriptors to be allocated in the DataHandler

reg_metrics: List[str]

list of regression metrics to be allocated in the DataHandler

stab_metrics: List[str]

list of stability metrics to be allocated in the DataHandler

n_runs: int

number of evaluation run for all selectors (for every dataset and feature configuration) to be allocated in the DataHandler

Attributes
datasets: List[str]

sorted list of dataset identifiers contained in the DataHandler

methods: List[str]

sorted list of selector identifiers contained in the DataHandler

feature_descriptors: List[FeatureDescriptor]

sorted list of FeatureDescriptors contained in the DataHandler

reg_metrics: List[str]

sorted list of regression metrics contained in the DataHandler

stab_metrics: List[str]

sorted list of stability metrics contained in the DataHandler

n_runs: int

number of evaluation run for all selectors (for every dataset and feature configuration)

n_datasets: int

number of datasets contained in the DataHandler

reg_data: pandas.DataFrame

data frame holding regression data

stab_data: pandas.DataFrame

data frame holding the stability data

measurement_data: pandas.DataFrame

data frame holding the execution time measurement data

selecton_data: pandas.DataFrame

data frame holding the feature selection data

__init__(datasets: List[str], methods: List[str], features: List[FeatureDescriptor], reg_metrics: List[str], stab_metrics: List[str], n_runs: int)[source]
get_measurement_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, List[int], Tuple[int], List[Tuple[int]]] = None, sample: Union[int, List[int]] = None)[source]

Retrieve data related to the stability of feature selection methods.

Parameters
methodstr or list of str, default=None

Method(s) to be retrieved. If None, all methods are retrieved.

n_featuresint, tuple of int, list of int or list of tuple of int, default=None

Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all numbers of selected features are retrieved.

sample_runint or list of int, default=None

The run(s) for which the selected features are to be retrieved. If None, the selected features of all runs are retrieved.

Returns
pandas multiIndex DataFrame.

The frame holds the methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘sample’} as columns, where ‘sample’ refers to the individual runs for the statistical evaluation. The keys for level ‘n_features’ are of type FeatureDescriptor.

get_meta(dataset)[source]

Provides meta information for each dataset.

Parameters
dataset: str

Name of the dataset, whose meta information is requested.

Returns
dict containing information about the dataset
x

The spectral data of the dataset: np.ndarray of shape (n_samples, n_wavelengths)

y

The target quantity of the dataset: np.ndarray of shape (n_samples, )

n_samples

Direct access to the number of samples in the dataset

n_features

Direct access to the number of wavelengths, that is features, in the dataset

get_regression_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, List[int], Tuple[int], List[Tuple[int]]] = None, reg_metric: Union[str, List[str]] = None, sample: Union[int, List[int]] = None) DataFrame[source]

Retrieve data related to the regression performance of feature selection methods.

Parameters
dataset: str or list of str, default=None

Dataset identifier or list of dataset identifiers.

methodstr or list of str, default=None

Method(s) to be retrieved. If None, all methods are retrieved.

n_featuresint, tuple of int, list of int or list of tuple of int, default=None

Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all feature configurations are retrieved.

reg_metricstr or list of str, default=None

Regression metric(s) to be retrieved. If None, all available metrics are retrieved.

itemLiteral of [‘mean’, ‘std’, ‘median’, ‘max’, ‘min’, ‘samples’], default=None

Specify, which indicator(s) for the selected regression metrics is to be retrieved. If None, all indicators are retrieved.

Returns
pandas multiIndex DataFrame.

The frame holds the selection methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘reg_metric’, ‘sample’} as columns, where ‘sample’ refers to the individual runs for the statistical evaluation. The keys for level ‘n_features’ are of type FeatureDescriptor.

get_selection_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, Tuple[int], List[int], List[Tuple[int]]] = None, sample: Union[int, List[int]] = None) DataFrame[source]

Retrieve data related to the regression performance of feature selection methods.

Parameters
methodstr or list of str, default=None

Method(s) to be retrieved. If None, all methods are retrieved.

n_featuresint, tuple of int, list of int or list of tuple of int, default=None

Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all numbers of selected features are retrieved.

sample_runint or list of int, default=None

The run(s) for which the selected features are to be retrieved. If None, the selected features of all runs are retrieved.

Returns
pandas.MultiIndex DataFrame.

The frame holds the methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘sample’} as columns, where ‘sample’ refers to the individual runs for the statistical evaluation. The keys for level ‘n_features’ are of type FeatureDescriptor. The type of the data in the frame is Selection.

get_stability_data(dataset: Union[str, List[str]] = None, method: Union[str, List[str]] = None, n_features: Union[int, Tuple[int], List[int], List[Tuple[int]]] = None, stab_metric: Union[str, List[str]] = None) DataFrame[source]

Retrieve data related to the stability of feature selection methods.

Parameters
methodstr or list of str, default=None

Method(s) to be retrieved. If None, all methods are retrieved.

n_featuresint, tuple of int, list of int or list of tuple of int, default=None

Feature configuration for which to retrieve results. A configuration for a single number of features, a single interval defined as tuple (#intervals, interval_width) or lists of such configurations can be passed. If None, the runs for all numbers of selected features are retrieved.

stab_metricstr or list of str, default=None

Stability metric(s) to be retrieved. If None, all available metrics are retrieved.

Returns
pandas multiIndex DataFrame.

The frame holds the selection methods in its index and a multiindex with levels {‘dataset’, ‘n_features’, ‘stab_metric’} as columns. The keys for level ‘n_features’ are of type FeatureDescriptor.

register_meta(dataset_meta: List[Tuple[array, array, str, float]])[source]

Register dataset information into the DataHandler

Parameters
dataset_meta: List[Tuple[np.array, np.array, str, float]]

List of tuples specifying the spectral data of data set, its target values, its name and its training data ratio

register_stability(dataset: str, method: str, n_features: FeatureDescriptor, metric_name: str, value: float)[source]

Register a stability score in the DataHandler

Parameters
dataset: str

identifier of the dataset for which a stability is registered

method: str

identifer of the selector for which the stability is registered

n_features: Union[int, Tuple[int, int], FeatureDescriptor]

feature configuration for which the stability is registered

metric_name:

name of the stability metric registered

value:

the calculated stability

store(file_path: str, file_name: str)[source]

Stores the DataHandler object as pickle file.

Parameters
file_path: str

Path to the file.

file_name: str

Name of the file without extension.

Examples using auswahl.benchmarking.DataHandler

Benchmarking - Example

Benchmarking - Example

Benchmarking - Example