geochemistrypi.data_mining.data package

Submodules

geochemistrypi.data_mining.data.data_readiness module

basic_info(data: DataFrame) None[source]

Show the basic information of the data set.

Parameters:

data (pd.DataFrame) – The data set to be shown.

bool_input(prefix: str | None = None) bool[source]

Get the number of the desired option.

Parameters:

prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

Returns:

A boolean value.

Return type:

bool

create_sub_data_set(data: DataFrame, allow_empty_columns: bool = False) DataFrame[source]

Create a sub data set.

Parameters:
  • data (pd.DataFrame) – The data set to be processed.

  • allow_empty_columns (bool, optional) – Whether to include empty columns in the sub data set. The default is False.

Returns:

The sub data set.

Return type:

pd.DataFrame

data_split(X: DataFrame, y: DataFrame | Series, names: DataFrame, test_size: float = 0.2) Dict[source]

Split arrays or matrices into random train and test subsets.

Parameters:
  • X (pd.DataFrame) – The data to be split.

  • y (pd.DataFrame or pd.Series) – The target variable to be split.

  • name (pd.DataFrame) – The name of data.

  • test_size (float, default=0.2) – Represents the proportion of the dataset to include in the test split.

Returns:

A dictionary containing the split data.

Return type:

dict

float_input(default: float, prefix: str | None = None, slogan: str | None = '@Number: ') float[source]

Get the number of the desired option.

Parameters:
  • default (float) – If the user does not enter anything, it is assigned to option.

  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – An option number.

Return type:

float or int

int_input(column: int, prefix: str | None = None, slogan: str | None = '@Number: ') int[source]

Get the number of the desired option.

Parameters:
  • default (int) – If the user does not enter anything, it is assigned to option.

  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – An option number.

Return type:

int

limit_num_input(option_list: List[str], prefix: str, input_func: num_input) int[source]

Limit the scope of the option.

Parameters:
  • option_list (List[str]) – All the options provided are stored in a list.

  • prefix (str) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • input_func (function) – The function of input_func.

Returns:

option – An option number. Be careful that ‘option = real index + 1’

Return type:

int

np2pd(array: ndarray, columns_name: List[str]) DataFrame[source]

Convert numpy array to pandas dataframe.

Parameters:
  • array (np.ndarray) – The numpy array to be converted.

  • columns_name (List[str]) – The column names of the dataframe.

Returns:

The converted dataframe.

Return type:

pd.DataFrame

num2option(items: List[str]) None[source]

List all the options serially.

Parameters:

items (list) – a series of items need to be enumerated

num_input(prefix: str | None = None, slogan: str | None = '@Number: ') int[source]

Get the number of the desired option.

Parameters:
  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – An option number. Be careful that ‘option = real index + 1’

Return type:

int

read_data(file_path: str | None = None, is_own_data: int = 2, prefix: str | None = None, slogan: str | None = '@File: ')[source]

Read the data set.

Parameters:
  • file_path (str, optional) – The path of the data set, by default None

  • is_own_data (int, default=2) – 1: own data set; 2: built-in data set

  • prefix (str, optional) – The prefix of the data set, by default None

  • slogan (str, optional) – The slogan of the data set, by default “@File: “

Returns:

The data set read

Return type:

pd.DataFrame

select_column_name(data: DataFrame) str[source]

Select a single column from the dataframe and return its name.

Parameters:

data (pd.DataFrame) – The data set to be selected name.

select_columns(columns_range: str | None = None) List[int][source]

Select the columns of the data set.

Parameters:

columns_range (str, default=None) – The columns range of the data set.

Returns:

The columns selected.

Return type:

list

show_data_columns(columns_name: Index, columns_index: List | None = None) None[source]

Show the column names of the data set.

Parameters:
  • columns_name (pd.Index) – The column names of the data set.

  • columns_index (list, default=None) – The column index of the data set.

show_excel_columns(excel_list: List | None = None) None[source]

Displays the index and name of each column in the provided Excel list.

Parameters:

excel_list (Optional[List], optional) – A list containing the names of Excel columns. Defaults to None.

Returns:

Return type:

None

str_input(option_list: List[str], prefix: str | None = None) str[source]

Get the string of the desired option.

Parameters:
  • option_list (list) – All the options provided are stored in a list.

  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

Returns:

option – A string of the desired option.

Return type:

str

tuple_input(default: Tuple[int], prefix: str | None = None, slogan: str | None = None) Tuple[int][source]

Get the tuple of the desired option.

Parameters:
  • default (Tuple[int]) – If the user does not enter anything, it is assigned to option.

  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • slogan (str, default=None) – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – A numeric tuple.

Return type:

tuple

geochemistrypi.data_mining.data.feature_engineering module

class FeatureConstructor(data: DataFrame, name_all: str)[source]

Bases: object

Construct new feature based on the existing data set.

append_feature(new_feature_column: Series) None[source]

Append the new feature to the original data.

batch_build(feature_engineering_config: Dict) None[source]
build() None[source]

Build the new feature.

cal_words = ['pow', 'sin', 'cos', 'tan', 'pi', 'mean', 'std', 'var', 'log']
index2name() None[source]

Show the index of columns in the data set. The display pattern is [letter : column name], e.g. a : 1st column name; b : 2nd column name.

input_expression() None[source]

Input the expression of the constructed feature.

input_feature_name() None[source]

Name the constructed feature (column name), like ‘NEW-COMPOUND’.

letter_map() None[source]

Map the letter to the column name.

oper = '+-*/^(),.'

geochemistrypi.data_mining.data.imputation module

imputer(data: DataFrame, method: str) tuple[dict, ndarray][source]

Apply imputation on missing values.

Parameters:
  • data (pd.DataFrame) – The dataset with missing values.

  • method (str) – The imputation method.

Returns:

  • imputation_config (dict) – The imputation configuration.

  • data_imputed (np.ndarray) – The dataset after imputing.

geochemistrypi.data_mining.data.inference module

class PipelineConstrutor[source]

Bases: object

Construct a sklearn pipeline from a dictionary of transformers.

chain(transformer_config: Dict) object[source]

Chain transformers together into a sklearn pipeline.

Parameters:

transformer_config (Dict) – A dictionary of transformers and their parameters.

Returns:

A sklearn pipeline.

Return type:

object

property transformer_dict: Dict

A dictionary of transformers. Need to be updated when new transformers in the customized automated ML pipeline is added.

build_transform_pipeline(imputation_config: Dict, feature_scaling_config: Dict, feature_selection_config: Dict, run: object, X_train: DataFrame, y_train: DataFrame) Tuple[Dict, object][source]

Build the transform pipeline.

Parameters:
  • imputation_config (Dict) – The imputation configuration.

  • feature_scaling_config (Dict) – The feature scaling configuration.

  • feature_selection_config (Dict) – The feature selection configuration.

  • run (object) – The model selection object.

  • X_train (pd.DataFrame) – The training data.

Returns:

The transform pipeline configuration and the transform pipeline object.

Return type:

Tuple[Dict, object]

model_inference(inference_data: DataFrame, inference_name_column: str, is_inference: bool, run: object, transformer_config: Dict, transform_pipeline: object | None = None)[source]

Run the model inference.

Parameters:
  • inference_data (pd.DataFrame) – The inference data.

  • inference_name_column (str) – The name of inference_data

  • is_inference (bool) – Whether to run the model inference.

  • run (object) – The model selection object.

  • transformer_config (Dict) – The transformer configuration.

  • transform_pipeline (Optional[object], optional) – The transform pipeline object. The default is None.

geochemistrypi.data_mining.data.preprocessing module

class MeanNormalScaler(copy: bool = True)[source]

Bases: BaseEstimator, TransformerMixin

Custom Scikit-learn transformer for mean normalization.

MeanNormalization involves subtracting the mean of each feature from the feature values and then dividing by the range (maximum value minus minimum value) of that feature.

The transformation is given by:

X_scaled = (X - X.mean()) / (X.max() - X.min())

fit(X: DataFrame, y: DataFrame | None = None) object[source]

Compute the mean and range (max - min) for each feature.

Parameters:
  • X (pd.DataFrame) – The input dataframe where each column represents a feature.

  • y (pd.DataFrame, optional (default: None)) – Ignored.

Returns:

self – Fitted transformer.

Return type:

object

inverse_transform(X: DataFrame) ndarray[source]

Reverse the mean normalization transformation.

Parameters:

X (pd.DataFrame) – The input dataframe where each column represents a feature.

Returns:

X_tr – The original data.

Return type:

np.ndarray

transform(X: DataFrame, y: DataFrame | None = None, copy: bool | None = None) ndarray[source]

Apply mean normalization to the data.

Parameters:
  • X (pd.DataFrame) – The input dataframe where each column represents a feature.

  • y (pd.DataFrame, optional (default: None)) – Ignored.

  • copy (bool, optional (default: None)) – Copy the input X or not.

Returns:

X_tr – The normalized data.

Return type:

np.ndarray

feature_scaler(X: DataFrame, method: List[str], method_idx: int) tuple[dict, ndarray][source]

Apply feature scaling methods.

Parameters:
  • X (pd.DataFrame) – The dataset.

  • method (str) – The feature scaling methods.

  • method_idx (int) – The index of methods.

Returns:

  • feature_scaling_config (dict) – The feature scaling configuration.

  • X_scaled (np.ndarray) – The dataset after imputing.

feature_selector(X: DataFrame, y: DataFrame, feature_selection_task: int, method: List[str], method_idx: int) tuple[dict, DataFrame][source]

Apply feature selection methods.

Parameters:
  • X (pd.DataFrame) – The feature dataset.

  • y (pd.DataFrame) – The label dataset.

  • feature_selection_task (int) – Feature selection for regression or classification tasks.

  • method (str) – The feature selection methods.

  • method_idx (int) – The index of methods.

Returns:

  • feature_selection_config (dict) – The feature selection configuration.

  • X_selected (pd.DataFrame) – The feature dataset after selecting.

geochemistrypi.data_mining.data.statistic module

monte_carlo_simulator(df_orig: DataFrame, df_impute: DataFrame, sample_size: int, iteration: int, test: str, confidence: float = 0.05) None[source]

Check which column rejects hypothesis testing, p value < significance level, to find whether the imputation change the distribution of the original data set.

Parameters:
  • df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.

  • df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.

  • test (str) – The statistics test method used.

  • sample_size (int) – The size of the sample for each iteration.

  • iteration (int) – The number of iterations of Monte Carlo simulation.

  • confidence (float) – Confidence level, default to be 0.05

test_once(df_orig: DataFrame, df_impute: DataFrame, test: str) ndarray[source]

Do hypothesis testing on each pair-wise column once, non-parametric test. Null hypothesis: the distributions of the data set before and after imputing remain the same.

Parameters:
  • df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.

  • df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.

  • test (str) – The statistics test method used.

Returns:

pvals – A numpy array containing the p-values of the tests on each column in the column order

Return type:

np.ndarray

Module contents