geochemistrypi.data_mining.data package¶
Submodules¶
geochemistrypi.data_mining.data.data_readiness module¶
- basic_info(data: DataFrame) None[source]¶
Show the basic information of the data set.
- Parameters:
data (pd.DataFrame) – The data set to be shown.
- bool_input(prefix: str | None = None) bool[source]¶
Get the number of the desired option.
- Parameters:
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
- Returns:
A boolean value.
- Return type:
bool
- create_sub_data_set(data: DataFrame, allow_empty_columns: bool = False) DataFrame[source]¶
Create a sub data set.
- Parameters:
data (pd.DataFrame) – The data set to be processed.
allow_empty_columns (bool, optional) – Whether to include empty columns in the sub data set. The default is False.
- Returns:
The sub data set.
- Return type:
pd.DataFrame
- data_split(X: DataFrame, y: DataFrame | Series, names: DataFrame, test_size: float = 0.2) Dict[source]¶
Split arrays or matrices into random train and test subsets.
- Parameters:
X (pd.DataFrame) – The data to be split.
y (pd.DataFrame or pd.Series) – The target variable to be split.
name (pd.DataFrame) – The name of data.
test_size (float, default=0.2) – Represents the proportion of the dataset to include in the test split.
- Returns:
A dictionary containing the split data.
- Return type:
dict
- float_input(default: float, prefix: str | None = None, slogan: str | None = '@Number: ') float[source]¶
Get the number of the desired option.
- Parameters:
default (float) – If the user does not enter anything, it is assigned to option.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.
- Returns:
option – An option number.
- Return type:
float or int
- int_input(column: int, prefix: str | None = None, slogan: str | None = '@Number: ') int[source]¶
Get the number of the desired option.
- Parameters:
default (int) – If the user does not enter anything, it is assigned to option.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.
- Returns:
option – An option number.
- Return type:
int
- limit_num_input(option_list: List[str], prefix: str, input_func: num_input) int[source]¶
Limit the scope of the option.
- Parameters:
option_list (List[str]) – All the options provided are stored in a list.
prefix (str) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
input_func (function) – The function of input_func.
- Returns:
option – An option number. Be careful that ‘option = real index + 1’
- Return type:
int
- np2pd(array: ndarray, columns_name: List[str]) DataFrame[source]¶
Convert numpy array to pandas dataframe.
- Parameters:
array (np.ndarray) – The numpy array to be converted.
columns_name (List[str]) – The column names of the dataframe.
- Returns:
The converted dataframe.
- Return type:
pd.DataFrame
- num2option(items: List[str]) None[source]¶
List all the options serially.
- Parameters:
items (list) – a series of items need to be enumerated
- num_input(prefix: str | None = None, slogan: str | None = '@Number: ') int[source]¶
Get the number of the desired option.
- Parameters:
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.
- Returns:
option – An option number. Be careful that ‘option = real index + 1’
- Return type:
int
- read_data(file_path: str | None = None, is_own_data: int = 2, prefix: str | None = None, slogan: str | None = '@File: ')[source]¶
Read the data set.
- Parameters:
file_path (str, optional) – The path of the data set, by default None
is_own_data (int, default=2) – 1: own data set; 2: built-in data set
prefix (str, optional) – The prefix of the data set, by default None
slogan (str, optional) – The slogan of the data set, by default “@File: “
- Returns:
The data set read
- Return type:
pd.DataFrame
- select_column_name(data: DataFrame) str[source]¶
Select a single column from the dataframe and return its name.
- Parameters:
data (pd.DataFrame) – The data set to be selected name.
- select_columns(columns_range: str | None = None) List[int][source]¶
Select the columns of the data set.
- Parameters:
columns_range (str, default=None) – The columns range of the data set.
- Returns:
The columns selected.
- Return type:
list
- show_data_columns(columns_name: Index, columns_index: List | None = None) None[source]¶
Show the column names of the data set.
- Parameters:
columns_name (pd.Index) – The column names of the data set.
columns_index (list, default=None) – The column index of the data set.
- show_excel_columns(excel_list: List | None = None) None[source]¶
Displays the index and name of each column in the provided Excel list.
- Parameters:
excel_list (Optional[List], optional) – A list containing the names of Excel columns. Defaults to None.
- Returns:
- Return type:
None
- str_input(option_list: List[str], prefix: str | None = None) str[source]¶
Get the string of the desired option.
- Parameters:
option_list (list) – All the options provided are stored in a list.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
- Returns:
option – A string of the desired option.
- Return type:
str
- tuple_input(default: Tuple[int], prefix: str | None = None, slogan: str | None = None) Tuple[int][source]¶
Get the tuple of the desired option.
- Parameters:
default (Tuple[int]) – If the user does not enter anything, it is assigned to option.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default=None) – It acts like the first parameter of input function in Python, which output the hint.
- Returns:
option – A numeric tuple.
- Return type:
tuple
geochemistrypi.data_mining.data.feature_engineering module¶
- class FeatureConstructor(data: DataFrame, name_all: str)[source]¶
Bases:
objectConstruct new feature based on the existing data set.
- append_feature(new_feature_column: Series) None[source]¶
Append the new feature to the original data.
- cal_words = ['pow', 'sin', 'cos', 'tan', 'pi', 'mean', 'std', 'var', 'log']¶
- index2name() None[source]¶
Show the index of columns in the data set. The display pattern is [letter : column name], e.g. a : 1st column name; b : 2nd column name.
- oper = '+-*/^(),.'¶
geochemistrypi.data_mining.data.imputation module¶
- imputer(data: DataFrame, method: str) tuple[dict, ndarray][source]¶
Apply imputation on missing values.
- Parameters:
data (pd.DataFrame) – The dataset with missing values.
method (str) – The imputation method.
- Returns:
imputation_config (dict) – The imputation configuration.
data_imputed (np.ndarray) – The dataset after imputing.
geochemistrypi.data_mining.data.inference module¶
- class PipelineConstrutor[source]¶
Bases:
objectConstruct a sklearn pipeline from a dictionary of transformers.
- chain(transformer_config: Dict) object[source]¶
Chain transformers together into a sklearn pipeline.
- Parameters:
transformer_config (Dict) – A dictionary of transformers and their parameters.
- Returns:
A sklearn pipeline.
- Return type:
object
- property transformer_dict: Dict¶
A dictionary of transformers. Need to be updated when new transformers in the customized automated ML pipeline is added.
- build_transform_pipeline(imputation_config: Dict, feature_scaling_config: Dict, feature_selection_config: Dict, run: object, X_train: DataFrame, y_train: DataFrame) Tuple[Dict, object][source]¶
Build the transform pipeline.
- Parameters:
imputation_config (Dict) – The imputation configuration.
feature_scaling_config (Dict) – The feature scaling configuration.
feature_selection_config (Dict) – The feature selection configuration.
run (object) – The model selection object.
X_train (pd.DataFrame) – The training data.
- Returns:
The transform pipeline configuration and the transform pipeline object.
- Return type:
Tuple[Dict, object]
- model_inference(inference_data: DataFrame, inference_name_column: str, is_inference: bool, run: object, transformer_config: Dict, transform_pipeline: object | None = None)[source]¶
Run the model inference.
- Parameters:
inference_data (pd.DataFrame) – The inference data.
inference_name_column (str) – The name of inference_data
is_inference (bool) – Whether to run the model inference.
run (object) – The model selection object.
transformer_config (Dict) – The transformer configuration.
transform_pipeline (Optional[object], optional) – The transform pipeline object. The default is None.
geochemistrypi.data_mining.data.preprocessing module¶
- class MeanNormalScaler(copy: bool = True)[source]¶
Bases:
BaseEstimator,TransformerMixinCustom Scikit-learn transformer for mean normalization.
MeanNormalization involves subtracting the mean of each feature from the feature values and then dividing by the range (maximum value minus minimum value) of that feature.
The transformation is given by:
X_scaled = (X - X.mean()) / (X.max() - X.min())
- fit(X: DataFrame, y: DataFrame | None = None) object[source]¶
Compute the mean and range (max - min) for each feature.
- Parameters:
X (pd.DataFrame) – The input dataframe where each column represents a feature.
y (pd.DataFrame, optional (default: None)) – Ignored.
- Returns:
self – Fitted transformer.
- Return type:
object
- inverse_transform(X: DataFrame) ndarray[source]¶
Reverse the mean normalization transformation.
- Parameters:
X (pd.DataFrame) – The input dataframe where each column represents a feature.
- Returns:
X_tr – The original data.
- Return type:
np.ndarray
- transform(X: DataFrame, y: DataFrame | None = None, copy: bool | None = None) ndarray[source]¶
Apply mean normalization to the data.
- Parameters:
X (pd.DataFrame) – The input dataframe where each column represents a feature.
y (pd.DataFrame, optional (default: None)) – Ignored.
copy (bool, optional (default: None)) – Copy the input X or not.
- Returns:
X_tr – The normalized data.
- Return type:
np.ndarray
- feature_scaler(X: DataFrame, method: List[str], method_idx: int) tuple[dict, ndarray][source]¶
Apply feature scaling methods.
- Parameters:
X (pd.DataFrame) – The dataset.
method (str) – The feature scaling methods.
method_idx (int) – The index of methods.
- Returns:
feature_scaling_config (dict) – The feature scaling configuration.
X_scaled (np.ndarray) – The dataset after imputing.
- feature_selector(X: DataFrame, y: DataFrame, feature_selection_task: int, method: List[str], method_idx: int) tuple[dict, DataFrame][source]¶
Apply feature selection methods.
- Parameters:
X (pd.DataFrame) – The feature dataset.
y (pd.DataFrame) – The label dataset.
feature_selection_task (int) – Feature selection for regression or classification tasks.
method (str) – The feature selection methods.
method_idx (int) – The index of methods.
- Returns:
feature_selection_config (dict) – The feature selection configuration.
X_selected (pd.DataFrame) – The feature dataset after selecting.
geochemistrypi.data_mining.data.statistic module¶
- monte_carlo_simulator(df_orig: DataFrame, df_impute: DataFrame, sample_size: int, iteration: int, test: str, confidence: float = 0.05) None[source]¶
Check which column rejects hypothesis testing, p value < significance level, to find whether the imputation change the distribution of the original data set.
- Parameters:
df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.
df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.
test (str) – The statistics test method used.
sample_size (int) – The size of the sample for each iteration.
iteration (int) – The number of iterations of Monte Carlo simulation.
confidence (float) – Confidence level, default to be 0.05
- test_once(df_orig: DataFrame, df_impute: DataFrame, test: str) ndarray[source]¶
Do hypothesis testing on each pair-wise column once, non-parametric test. Null hypothesis: the distributions of the data set before and after imputing remain the same.
- Parameters:
df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.
df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.
test (str) – The statistics test method used.
- Returns:
pvals – A numpy array containing the p-values of the tests on each column in the column order
- Return type:
np.ndarray