Add New Model To Framework¶
Table of Contents¶
1. Framework - Design Pattern and Hierarchical Pipeline Architecture¶
Geochemistry π refers to the software design pattern “Abstract Factory”, serving as the foundational framework upon which our advanced automated ML capabilities are built.
The framework is a four-layer hierarchical pipeline architecture that promotes the creation of workflow obiects through a set of model selection interfaces. The critical layers of this architecture are, as follows:
Layer 1: the realization of ML model-associated functionalities with specific dependencies or libraries.
Layer 2: the abstract components of the ML model workflow class include regression, classification, clustering, and decomposition.
Layer 3: the scikit-learn API-style model selection interface implements the creation of ML model workflow objects.
Layer 4: the customized automated ML pipeline operated at the command line or through a web interface with a complete data-mining process.
This pattern-driven architecture offers developers a standardized and intuitive way to create a ML model workflow class in Layer 2 by using a unified and consistent approach to object creation in Layer 3. Furthermore, it ensures the interchangeability of different model applications, allowing for seamless transitions between methodologies in Layer 1.
The code of each layer lies as shown above.
Notice: in our framework, a model workflow class refers to an algorithm workflow class and a mode includes multiple model workflow classes.
Now, we will take KMeans algorithm as an example to illustrate the connection between each layer. Don’t get too hung up on ths part. Once you finish reading the whole article, you can come back to here again.
After reading this article, you are recommended to refer to this publication also for more details on the whole scope of our framework:
https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324
2. Understand Machine Learning Algorithm¶
You need to understand the general meaning of the machine learning algorithm you are responsible for. Then you encapsultate it as an algorithm workflow in our framework and put it under the directory geochemistrypi/data_mining/model. Then you need to determine which mode this algorithm belongs to and the role of each parameter. For example, linear regression algorithm belongs to regression mode in our framework.
When learning the ML algorithm, you can refer to the relevant knowledge on the scikit-learn official website.
3. Construct Model Workflow Class¶
Noted: You can reference any existing model workflow classes in our framework to implement your own model workflow class.
3.1 Add Basic Elements¶
3.1.1 Find File¶
First, you need to construct the algorithm workflow class in the corresponding model file. The corresponding model file locates under the path geochemistrypi/data_mining/model.
E.g., If you want to add a model for the regression mode, you need to add it in the regression.py file.
3.1.2 Define Class Attributes and Constructor¶
Define the algorithm workflow class and its base class
class ModelWorkflowClassName(BaseModelWorkflowClassName):
You can refer to the ModelName of other models, the format (Upper case and with the suffix ‘Corresponding Mode’) needs to be consistent. E.g.,
XGBoostRegression.Base class needs to be inherited according to the mode the model belongs to.
"""The automation workflow of using "ModelWorkflowClassName" algorithm to make insightful products."""
Class docstring, you can refer to other classes. The template is shown above.
Define the class attributes
name
name = "algorithm terminology"
The class attributes
nameis different from ModelWorkflowClassName. E.g., the nameXGBoostinXGBoostRegressionmodel workflow class.This name needs to be added to the corresponding constant variable in
geochemistrypi/data_mining/constants.pyfile and the corresponding mode processing file under thegeochemistrypi/data_mining/processfolder. Note that those name value should be identical. It will be further explained in later section.For example, the name value
XGBoostshould be included in the constant varibleREGRESSION_MODELSingeochemistrypi/data_mining/constants.pyfile and it will be use ingeochemistrypi/data_mining/process/regress.py.
Define the class attrbutes
common_functiionorspecial_function
If this model workflow class is a base class, you need to define the class attrbutes common_functiion. For example, the class attrbutes common_functiion in the base workflow class RegressionWorkflowBase.
The values of common_functiion are the description of the functionalities of the models belonging to the same mode. It means the children class (all regession models) can share the same common functionalies as well.
common_functiion = []
If this model workflow class is a specific model workflow class, you need to define the class attrbutes special_function. For example, the class attrbutes special_function in the model workflow class XGBoostRegression.
The values of special_function are the description of the owned functionalities of that specific model. Those special functions cannot be reused by other models.
special_function = []
More detail will be explained in the later section.
Define the signature of the constructor
def __init__(
self,
parameter: type = Default parameter value,
) -> None:
The parameters in the constructor is from the algorithm library you depend on. For example, you use Lasso algorithm from Sckit-learn library. You can reference its introduction (Lasso) in Scikit-learn website.
Default parameter value needs to be set according to scikit-learn official documents also.
"""
Parameters
----------
parameter: type,default = Dedault
References
----------
Scikit-learn API: sklearn.model.name
https://scikit-learn.org/......
Parameters docstring are in the source code of the corresponding algorithm on the official website of sklearn.
The constructor of Base class is called
super().__init__()
Initializes the instance’s state by assigning the parameter values passed to the constructor to the instance’s attributes.
self.parameter=parameter
Instantiate the algorithm class you depend on and assign. For example,
Lassofrom the librarysklearn.linear_model.
self.model = modelname(
parameter=self.parameter
)
Note: Don’t forget to import the class from scikit-learn library
Define the instance attribute
naming
self.naming = Class.name
This one will be use to print the name of the class and to activate the AutoML functionality. E.g, self.naming = LassoRegression.name. Further explaination is in section 2.2.
Define the instance attribute
customizedandcustomized_name
self.customized = True
self.customized_name = "Algorithm Name"
These will be use to leverage the customization of AutlML functionality. E.g,self.customized_name = "Lasso". Further explaination is in section 2.3.
Define other instance attributes
self.attributes=...
3.2 Add Manual Hyperparameter Tuning Functionality¶
Our framework provides the user to set the algorithm hyperparameter manually or automiacally. In this part, we implement the manual functionality.
Sometimes the users want to input the hyperparameter values for model training manually, so you need to establish an interaction way to get the user’s input.
3.2.1 Define manual_hyper_parameters Method¶
The manual operation is control by the manual_hyper_parameters method. Inside this method, we encapsulate a lower level application function called algorithm_manual_hyper_parameters().
@classmethod
def manual_hyper_parameters(cls) -> Dict:
"""Manual hyper-parameters specification."""
print(f"-*-*- {cls.name} - Hyper-parameters Specification -*-*-")
hyper_parameters = algorithm_manual_hyper_parameters()
clear_output()
return hyper_parameters
The manual_hyper_parameters method is called in the corresponding mode operation file under the
geochemistrypi/data_mining/processfolder.This lower level application function locates in the
geochemistrypi/data_mining/model/func/specific_modefolder which limits the hyperparameters the user can set manually. E.g., If the model workflow classLassoRegressionbelongs to the regression mode, you need to add the_lasso_regression.pyfile under the foldergeochemistrypi/data_mining/model/func/algo_regression. Here,_lasso_regression.pycontains all encapsulated application functions specific to lasso algorithm.
3.2.2 Create _algorithm.py File¶
Create a _algorithm.py file
Note: Keep name format consistent.
Import module
from typing import Dict
from rich import print
from ....constants import SECTION
In general, these modules need to be imported
from ....data.data_readiness import bool_input, float_input, num_input
You needs to choose the appropriate common utility functions according to the input type of hyperparameter.
Define the application function
def algorithm_manual_hyper_parameters() -> Dict:
Interactive format
print("Hyperparameters: Explaination")
print("A good starting value ...")
Hyperparameters = type_input(Default Value, SECTION[2], "@Hyperparameters: ")
Note: You can query ChatGPT for the recommended good starting value. The default value can come from that one in the imported library. For example, check the default value of the specific parameter for Lasso algorithm in Scikit-learn Website.
Integrate all hyperparameters into a dictionary type and return.
hyper_parameters = {
"Hyperparameters1": Hyperparameters1,
"Hyperparameters": Hyperparameters2,
}
retuen hyper_parameters
3.2.3 Import in The Model Workflow Class File¶
from .func.algo_mode._algorithm.py import algorithm_manual_hyper_parameters
3.3 Add Automated Hyperparameter Tuning (AutoML) Functionality¶
3.3.1 Add AutoML Code to Model Workflow Class¶
Currently, only supervised learning modes (regression and classification) support AutoML. Hence, only the algorithm belonging to these two modes need to implment AutoML functionality.
Our framework leverages FLAML + Ray to build the AutoML functionality. For some algorithms, FLAML has encapsulated them. Hence, it is easy to operate with those built-in algorithm. However, some algorithms without encapsulation needs our customization on our own.
There are three cases in total:
C1: Encapsulated -> FLAML (Good example:
XGBoostRegressioninregression.py)C2: Unencapsulated -> FLAML (Good example:
SVMRegressioninregression.py)C3: Unencapsulated -> FLAML + RAY (Good example:
MLPRegressioninregression.py)
Here, we only talk about 2 cases, C1 and C2. C3 is a special case and it is only implemented in MLP algorithm.
Noted:
The calling method fit is defined in the base class, hence, no need to define it again in the specific model workflow class. You can refrence the fit method of
RegressionWorkflowBaseinregression.py
The following two steps is needed to implement AutoML functionality in the model workflow class. But for C1 it only requires the first step while C2 needs two step both.
Create
settingsmethod
@property
def settings(self) -> Dict:
"""The configuration of your model to implement AutoML by FLAML framework."""
configuration = {
"time_budget": '...'
"metric": '...',
"estimator_list": '...'
"task": '...'
}
return configuration
“time_budget” represents total running time in seconds
“metric” represents Running metric
“estimator_list” represents list of ML learners
“task” represents task type
For C1, the value of “estimator_list” should come from the specified name in FLAML library. For example, the specified name xgboost in the model workflow class XGBoostRegression. Also we need to put this specified value inside a list.
For C2, the value of “estimator_list” should be the instance attribute self.customized_name. For example, self.customized_name = "SVR" in the model workflow class SVMRegression. Also we need to put this specified value inside a list.
Note: You can keep the other key-value pair consistent with other exited model workflow classes.
(2) Create customization method
You can add the parameter tuning code according to the following code:
@property
def customization(self) -> object:
"""The customized 'Your model' of FLAML framework."""
from flaml import tune
from flaml.data import 'TPYE'
from flaml.model import SKLearnEstimator
from 'sklearn' import 'model_name'
class 'Model_Name'(SKLearnEstimator):
def __init__(self, task=type, n_jobs=None, **config):
super().__init__(task, **config)
if task in 'TYPE':
self.estimator_class = 'model_name'
@classmethod
def search_space(cls, data_size, task):
space = {
"'parameters1'": {"domain": tune.uniform(lower='...', upper='...'), "init_value": '...'},
"'parameters2'": {"domain": tune.choice([True, False])},
"'parameters3'": {"domain": tune.randint(lower='...', upper='...'), "init_value": '...'},
}
return space
return "Model_Name"
Note1: The content in ‘ ‘ needs to be modified according to your specific code. You can reference that one in the model workflow class SVMRegression.
Note2:
space = {
"'parameters1'": {"domain": tune.uniform(lower='...', upper='...'), "init_value": '...'},
"'parameters2'": {"domain": tune.choice([True, False])},
"'parameters3'": {"domain": tune.randint(lower='...', upper='...'), "init_value": '...'},
}
tune.uniform represents float
tune.choice represents bool
tune.randint represents int
lower represents the minimum value of the range, upper represents the maximum value of the range, and init_value represents the initial value Note: You need to select parameters based on the actual situation of the model
3.4 Add Application Function to Model Workflow Class¶
We treat the insightful outputs (index, scores) or diagrams to help to analyze and understand the algorithm as useful application. For example, XGBoost algorithm can produce feature importance score, hence, drawing feature importance diagram is an application function we can add to the model workflow class XGBoostRegression.
Conduct research on the corresponding model and look for its useful application functions that need to be added.
You can confirm the functions that need to be added on the official website of the model (such as scikit learn), search engines (such as Google), chatGPT, etc.
In our framework, we define two types of application function: common application function and special application function.
Common application function can be shared among the model workflow classes which belong to the same mode. It will be placed inside the base model workflow class. For example, classification_report is a common application function placed inside the base class ClassificationWorkflowBase. Notice that it is encapsulated in the private instance method _classification_report.
Likewise, special application function is the special fucntionalities owned by the algorithm itself, hence it is placed inside a specific model workflow class. For example, for KMeans algorithm, we can get the inertia scores from it. Hence, inside the model workflow class KMeansClustering, we have a private instance method _get_inertia_scores.
Now, the next question is how to invoke these application function in our framework.
In fact, we put the invocation of the application function in the component method. Accordingly, we have two types of components:
common_componentsis a public method in the base class, and all common application functions will be invoked inside.special_componentsis unique to the algorithm, so they need to be added in a specific model workflow class. All special aaplication function related to this algorithm will be invoked inside.
For more details, you can refer to the brief illustraion of the framework in section 1.
3.4.1 Add Common Application Functions and common_components Method¶
common_components will invoke the common application functions used by all its children model workflow class, so it is necessary to consider the situation of each child model workflow class when adding a application function to it. The better way is to put the application function inside a specific child model workflow class firstly if you are not sure it can be classified as a common application function.
1. Add common application function to the base class
Once you’ve identified the functionality you want to add, you can define the corresponding functions in the base class.
The steps to implement are:
Define the private function name and add the required parameters.
Use annotations to decorate the function.
Add the docstring to explain the use of this functionality.
Referencing specific libraries (e.g., Scikit-learn) to implement the functionality.
Change the format of data acquisition and save the produced data or images, etc.
2. Encapsulte the concrete code in Layer 1
Please refer to our framework’s definition of Layer 1 in section 1.
Some functions may use large code due to their complexity. To ensure the style and readability of the codebase, you need to put the specific function implementation into the corresponding geochemistrypi/data_mining/model/func/mode/_common files and call it.
The steps to implement are:
Define the public function name, add the required parameters and proper decorator.
Add the docstring to explain the use of this functionality,the significance of each parameter and the related reference.
Implement functionality.
Returns the value used in Layer 2.
3. Define ``common_components`` Method
The steps to implement are:
Define the path to store the data and images, etc.
Invoke the common application functions one by one.
**4. Apeend The Name of Functionality in Class Attribute common_function**
The steps to implement are:
Create a class attribute
common_functionlist inClusteringWorkflowBaseCreate a enum class to include the name of the functionality
Append the value of enum class into
common_functionlist
Example
The following is the example of adding model evaluation score to the clustering base class.
First, you need to find the base class of clustering.
**1. Add _score function in base class ClusteringWorkflowBase(WorkflowBase)**
@staticmethod
def _score(data: pd.DataFrame, labels: pd.DataFrame, func_name: str, algorithm_name: str, store_path: str) -> None:
"""Calculate the score of the model."""
print(f"-----* {func_name} *-----")
scores = score(data, labels)
scores_str = json.dumps(scores, indent=4)
save_text(scores_str, f"{func_name}- {algorithm_name}", store_path)
mlflow.log_metrics(scores)
2. Encapsulte the concrete code of ``score`` in Layer 1
You need to add the specific function implementation score to the corresponding geochemistrypi/data_mining/model/func/algo_clustering/_common file.
def score(data: pd.DataFrame, labels: pd.DataFrame) -> Dict:
"""Calculate the scores of the clustering model.
Parameters
----------
data : pd.DataFrame (n_samples, n_components)
The true values.
labels : pd.DataFrame (n_samples, n_components)
Labels of each point.
Returns
-------
scores : dict
The scores of the clustering model.
"""
silhouette = silhouette_score(data, labels)
calinski_harabaz = calinski_harabasz_score(data, labels)
print("silhouette_score: ", silhouette)
print("calinski_harabasz_score:", calinski_harabaz)
scores = {
"silhouette_score": silhouette,
"calinski_harabasz_score": calinski_harabaz,
}
return scores
**3. Define common_components Method in class ClusteringWorkflowBase(WorkflowBase)**
def common_components(self) -> None:
"""Invoke all common application functions for clustering algorithms."""
GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH")
GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH")
self._score(
data=self.X,
labels=self.clustering_result["clustering result"],
func_name=ClusteringCommonFunction.MODEL_SCORE.value,
algorithm_name=self.naming,
store_path=GEOPI_OUTPUT_METRICS_PATH,
)
**4. Apeend The Name of Functionality in Class Attribute common_function**
Create a class attribute common_function in ClusteringWorkflowBase.
class ClusteringWorkflowBase(WorkflowBase):
"""The base workflow class of clustering algorithms."""
common_function = [func.value for func in ClusteringCommonFunction]
The enum class should be put in the corresponding path geochemistrypi/data-mining/model/func/algo_clustering/_enum.py
3.4.2 Add Special Application Functions and special_components Method¶
special application function is a feature that is unique to each specific model. The whole process is similar to that of previous sectoin for common functionalities.
The process is as follows:
Add special application function with proper decorator to the child model workflow class
Encapsulte the concrete code in Layer 1
Define
special_componentsmethodApeend the name of functionality in class attribute
special_function
Example
Each algorithms has their own characteristics. Hence, they have different special fucntionalities as well. For example, for KMeans algorithm, we can get the inertia scores from it. Hence, inside the model workflow class KMeansClustering, we have a private instance method _get_inertia_scores.
First, you need to find the child model workflow class for KMeans algorithm.
**1. Add _get_inertia_scores function in child model workflow class KMeansClustering(ClusteringWorkflowBase)**
@staticmethod
def _get_inertia_scores(func_name: str, algorithm_name: str, trained_model: object, store_path: str) -> None:
"""Get the scores of the clustering result."""
print(f"-----* {func_name} *-----")
print(f"{func_name}: ", trained_model.inertia_)
inertia_scores = {f"{func_name}": trained_model.inertia_}
mlflow.log_metrics(inertia_scores)
inertia_scores_str = json.dumps(inertia_scores, indent=4)
save_text(inertia_scores_str, f"{func_name} - {algorithm_name}", store_path)
2. Encapsulte the concrete code in Layer 1
Getting the inertia score is only one line of code, hence no need to further encapsulate it.
**3. Define special_components Method in class KMeansClustering(ClusteringWorkflowBase)**
def special_components(self, **kwargs: Union[Dict, np.ndarray, int]) -> None:
"""Invoke all special application functions for this algorithms by Scikit-learn framework."""
GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH")
self._get_inertia_scores(
func_name=KMeansSpecialFunction.INERTIA_SCORE.value,
algorithm_name=self.naming,
trained_model=self.model,
store_path=GEOPI_OUTPUT_METRICS_PATH,
)
Also, if only part of the models share a functionality, for example, feature importance in tree-based algorithm including XGBoost, Decision Tree, etc. Hence, you can create a Mixin class to include that application function and let the tree-based model workflow class inherit it. Such as
ExtraTreesRegression(TreeWorkflowMixin, RegressionWorkflowBase)
**4. Apeend The Name of Functionality in Class Attribute special_function**
Create a class attribute special_function list in KMeansClustering.
class KMeansClustering(ClusteringWorkflowBase):
"""The automation workflow of using KMeans algorithm to make insightful products."""
name = "KMeans"
special_function = [func.value for func in KMeansSpecialFunction]
The enum class should be put in the corresponding path geochemistrypi/data-mining/model/func/algo_clustering/_enum.py
3.4.3 Add @dispatch() to Component Method¶
Howerever, in regression mode and classification mode, there are two different scenarios (AutoML and manual ML) when defining either common_components or special_components method. It is needed because we need to differentiate AutoML and manual ML. For example, inside the base model workflow class RegressionWorkflowBase, there are two common_components methods but with different decorators. Also, in its child model workflow class ExtraTreesRegression, there are two special_components methods but with different decorators.
Inside our framework, we leverages the thought of method overloading which is not supported by Python natively but we can achieve it through a library multipledispatch. The invocation of common_components and special_components method locates in Layer 3 which will be explained in later section.
The differences between AutoML and manual ML are as follows:
1. The decorator
For manual ML: add @dispatch() to decorate the component method
For AutoML: add @dispatch(bool) to decorate the component method
2. The signature of the component method
For common_compoents method:
Manual ML:
@dispatch()
def common_components(self) -> None:
AutoML:
@dispatch(bool)
def common_components(self, is_automl: bool = False) -> None:
For special_compoents method:
Manual ML:
@dispatch()
def special_components(self, **kwargs) -> None:
AutoML:
@dispatch(bool)
def special_components(self, is_automl: bool = False, **kwargs) -> None:
3. The trained model instance variable
Usually, inside the component method, we will pass the trained model instance variable to the application function. For example, for common_components in RegressionWorkflowBase(WorkflowBase), be careful about the value passed to the parameter trained_model.
Manual ML:
@dispatch()
def common_components(self) -> None:
self._cross_validation(
trained_model=self.model,
X_train=RegressionWorkflowBase.X_train,
y_train=RegressionWorkflowBase.y_train,
cv_num=10,
algorithm_name=self.naming,
store_path=GEOPI_OUTPUT_METRICS_PATH,
)
AutoML:
@dispatch(bool)
def common_components(self, is_automl: bool = False) -> None:
self._cross_validation(
trained_model=self.auto_model,
X_train=RegressionWorkflowBase.X_train,
y_train=RegressionWorkflowBase.y_train,
cv_num=10,
algorithm_name=self.naming,
store_path=GEOPI_OUTPUT_METRICS_PATH,
)
Note: The content of this part needs to be selected according to the actual situation of your own model. Can refer to similar classes.
3.5 Storage Mechanism¶
In Geochemistry π, the storage mechanism consists of two components: the geopi_tracking folder and the geopi_output folder. MLflow uses the geopi_tracking folder as the store for visualized operation in the web interface, which researchers cannot modify directly. The geopi_output folder is a regular folder aligning with MLflow’s storage structure, which researchers can operate. Overall, this unique storage mechanism is purpose-built to track each experiment and its corresponding runs in order to create an organized and coherent record of researchers’ scientific explorations.
In the codebase, we use Python’s open() function to store data into the geopi_output folder while MLflow’s methods to store data into the geopi_tracking folder.
The common MLflow’s methods includes:
mlflow.log_param(): Log a parameter (e.g. model hyperparameter) under the current run.
mlflow.log_params(): Log a batch of params for the current run.
mlflow.log_metric(): Log a metric under the current run.
mlflow.log_metrics(): Log multiple metrics for the current run.
mlflow.log_artifact(): Log a local file or directory as an artifact of the currently active run. In our software, we use it to store the images, data and text.
You can refer the API document of MLflow for more details.
Actually, we have encapsulated a bunch of saving functions in geochemistrypi/data_mining/utils/base.py, which can be used to store the data into the geopi_output folder and the geopi_tracking folder at the same time. It includes the functions save_fig, save_data, save_text, save_model.
Usually, when you want to use the saving functions, you only need to pass it the storage path and data to store.
For example, in the case of adding a common application function into base clustering model workflow class.
GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH")
This line of code gets the metrics output path from the environment variable.
GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH")
This line of code gets the image model output path from the environment variable.
GEOPI_OUTPUT_ARTIFACTS_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_PATH")
This line of code takes the general output artifact path from the environment variable.
Note: You need to choose to add the corresponding path according to the usage in the following functions. You can look up the pre-defined pathes created inside the function create_geopi_output_dir in geochemistrypi/data_mining/utils/base.py.
Note: You can refer to other similar model workflow classes to complete your implementation.
4. Instantiate Model Workflow Class¶
4.1 Find File¶
Instantiating a model workflow class is the responsibilty of Layer 3. Layer 3 is represented by the scikit-learn API-style model selection interface in the corresponding mode file under the geochemistrypi/data_mining/process folder.
eg: If your model workflow class belongs to regression mode, you need to implement the creation of ML model workflow objects in regress.py file.
4.2 Import Module¶
For example, for the model workflow class belonging to regression, you need to add your model inside regress.py file by using from ..model.regression import().
from ..model.regression import(
...
ModelWorkflowClass,
)
4.3 Define activate Method¶
The activate method defined in Layer 3 will be invoked in Layer 4.
For supervised learning (regression and classification), the signature of activate method is:
def activate(
self,
X: pd.DataFrame,
y: pd.DataFrame,
X_train: pd.DataFrame,
X_test: pd.DataFrame,
y_train: pd.DataFrame,
y_test: pd.DataFrame,
) -> None:
"""Train by Scikit-learn framework."""
For unsupervised learning (clustering, decomposition and anomaly detection), the signature of activate method is:
def activate(
self,
X: pd.DataFrame,
y: Optional[pd.DataFrame] = None,
X_train: Optional[pd.DataFrame] = None,
X_test: Optional[pd.DataFrame] = None,
y_train: Optional[pd.DataFrame] = None,
y_test: Optional[pd.DataFrame] = None,
) -> None:
"""Train by Scikit-learn framework."""
The difference is that for unsupervised learning, there is no need to seperate y and split the training-testing set. But for consistency, we keep it there.
In regression mode and classification mode, there are two different scenarios (AutoML and manual ML) when defining either activated method. It is needed because we need to differentiate AutoML and manual ML. Hence, we still use @dispatch to decorate it. For example, in RegressionModelSelection class, we need to define two activate methods with different decorators.
Manual ML:
@dispatch(object, object, object, object, object, object)
def activate(
self,
X: pd.DataFrame,
y: pd.DataFrame,
X_train: pd.DataFrame,
X_test: pd.DataFrame,
y_train: pd.DataFrame,
y_test: pd.DataFrame,
) -> None:
AutoML:
@dispatch(object, object, object, object, object, object, bool)
def activate(
self,
X: pd.DataFrame,
y: pd.DataFrame,
X_train: pd.DataFrame,
X_test: pd.DataFrame,
y_train: pd.DataFrame,
y_test: pd.DataFrame,
is_automl: bool,
) -> None:
The differences above include the signature of @dispatch and the signature of activate method.
4.4 Create Model Workflow Object¶
There are two activate methods defined in the Regression and Classification mode, the first method uses the Scikit-learn framework, and the second method uses the FLAML and RAY frameworks. Decomposition and Clustering algorithms only use the Scikit-learn framework. The instantiation of model workflow class inside activate method builds the connnectioni between Layer 3 and Layer 2.
The invocatioin of model workflow class in the first activate method (Used in classification, regression,decomposition, clustering, anomaly detection) needs to pass the hyperparameters for manual ML:
elif self.model_name == "ModelName":
hyper_parameters = ModelWorkflowClass.manual_hyper_parameters()
self.dcp_workflow = ModelWorkflowClass(
Hyperparameters1=hyper_parameters["Hyperparameters2"],
Hyperparameters1=hyper_parameters["Hyperparameters2"],
...
)
This “ModelName” needs to be added to the corresponding constant variable in
geochemistrypi/data_mining/constants.pyfile. It will be further explained in later section.
eg:
(2)The invocatioin of model workflow class in the second activate method(Used in classification, regression)for AutoML:
elif self.model_name == "ModelName":
self.reg_workflow = ModelWorkflowClass()
This “ModelName” needs to be added to the corresponding constant variable in
geochemistrypi/data_mining/constants.pyfile. It will be further explained in later section.
eg:
4.5 Invoke Other Methods in Scikit-learn API Style¶
It should contain at least these functoins below:
data_upload(): Load the required data into the base class’s attributes.
show_info(): Display what application functions the algorithm will provide.
fit(): Fit the model.
save_hyper_parameters(): Save the model hyper-parameters into the storage.
common_components(): Invoke all common application functions.
special_components(): Invoke all special application functions.
model_save(): Save the trained model.
You can refer to other existing mode inside geochemistrypi/data_mining/process/mode.py to see what other else you need.
4.6 Add model_name to MODE_MODELS or NON_AUTOML_MODELS¶
Find the constants.py file under geochemistrypi/data_mining folder to add the model name which should be identical to that in geochemistrypi/data_mining/process/mode.py and in geochemistrypi/data_mining/model/mode.py.
**(1) Add model_name to MODE_MODELS**
Append model_name to the MODE_MODELS list corresponding to the mode in the constants file.
eg: Add the name of the Lasso regression algorithm to REGRESSION_MODELS list.
**(2) Add model_name to NON_AUTOML_MODELS**
Only for those algorithms, they belong to either regression or classification and don’t need to provide AutoML functionality. They need to append model_name to NON_AUTOML_MODELS list.
eg: Add the name of the Linear Regression algorithm to NON_AUTOML_MODELS list.
5. Test Model Workflow Class¶
After the model workflow class is added, you can test it through running the command python start_cli_pipeline.py on the terminal.
If you can successfully run the pipeline, there are three aspects to verify the correctness of your modification:
Check whether the output info in the console is what you expect.
Check whether the artifacts (e.g., dataset, images) produced saved properly in
geopi_outputfolder and whether the content of the artifacts is what you expect. You can know where thegeopi_outputfolder via the path in the console.
Check whether the same artifacts (e.g., dataset, images) produced saved properly in MLflow. You can use this command
mlflow ui --backend-store-uri file:/path/to/geopi_tracking --port PORT_NUMBERto launch the web interface supported by MLflow. Copy the linkhttp://127.0.0.1:PORT_NUMBERto the brower. Click the corresponding experiment and run you created and check the artifacts accordingly.
For more details on how to use MLflow, you can watch the video as below:
MLflow UI user guide - Geochemistry π v0.5.0 [Bilibili] | [YouTube]
If you fail to run the pipeline, you need to debug and fix it. Here is a recommended way - breakpoint debugging. In VSCode, you need to open the file start_cli_pipeline.py and click the button VSCode provides.
You can search the benefits of using breakpoint debugging to debug. There are two major benefits:
Lookup the value of the variable in the stack frame in memory directly.
Create temporary watch (code to debug) to evaluate in the current stack frame.
After fixing the problem, don’t forget to verify the produced artifacts in three aspects.
6. Completed Pull Request¶
After the test is correct, you can complete the pull request according to the puu document instructions in Geochemistry π - Completed Pull Request
7. Precautions¶
Note1: This tutorial only discusses the general process of adding a model, and the specific addition needs to be combined with the actual situation of the model to accurately add relevant codes. Note2: If there are unclear situations and problems during the adding process, communicate with other people in time to solve them.