Add New Model To Framework¶

Table of Contents¶

1. Understand the model¶

You need to understand the general meaning of the model, determine which algorithm the model belongs to and the role of each parameter.

You can choose to learn about the relevant knowledge on the scikit-learn official website.

2. Add Model¶

2.1 Add The Model Class¶

2.1.1 Find Add File¶

First, you need to define the model class that you need to complete in the corresponding algorithm file. The corresponding algorithm file is in the model folder in the data_mining folder in the geochemistrypi folder.

eg: If you want to add a model to the regression algorithm, you need to add it in the regression.py file.

2.1.2 Define class properties and constructors, etc.¶

Define the class and the required Base class

class NAME (Base class):

NAME is the name of the algorithm, you can refer to the NAME of other models, the format needs to be consistent.
Base class needs to be selected according to the actual algorithm requirements.

"""The automation workflow of using "Name" to make insightful products."""

Class explanation, you can refer to other classes.

Define the name and the special_function

name = "name"

Define name, different from NMAE.
This name needs to be added to the *constants.py* file and the corresponding algorithm file in the process folder. Note that the names are consistent.
```
special_function = []
```
special_function is added according to the specific situation of the model, you can refer to other similar models.

Define constructor

def __init__(
       self,
       parameter:type=Default parameter value,
    ) -> None:

All parameters in the corresponding model function need to be written out.
Default parameter value needs to be set according to official documents.

 """
Parameters
----------
parameter:type，default = Dedault

References
----------
Scikit-learn API: sklearn.model.name
https://scikit-learn.org/......

Parameters is in the source of the corresponding model on the official website of sklearn

eg: Take the Lasso algorithm as a column.

References is your model’s official website.

The constructor of Base class is called

super().__init__()

Initializes the instance’s state by assigning the parameter values passed to the constructor to the instance’s properties.

self.parameter=parameter

Create the model and assign

self.model=modelname(
  parameter=self.parameter
)

Note: Don’t forget to import Model from scikit-learn

Define other class properties

self.properties=...

2.1.3 Define manual_hyper_parameters¶

manual_hyper_parameters gets the hyperparameter value by calling the manual hyperparameter function, and returns hyper_parameters.

hyper_parameters = name_manual_hyper_parameters()

This function calls the corresponding function in the func folder (needs to be written, see 2.2.2) to get the hyperparameter value.
This function is called in the corresponding file of the Process folder (need to be written, see 2.3).
Can be written with reference to similar classes

2.1.4 Define special_components¶

Its purpose is to Invoke all special application functions for this algorithms by Scikit-learn framework. Note: The content of this part needs to be selected according to the actual situation of your own model.Can refer to similar classes.

GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH")

This line of code gets the image model output path from the environment variable.
```
GEOPI_OUTPUT_ARTIFACTS_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_PATH")
```
This line of code takes the general output artifact path from the environment variable. Note: You need to choose to add the corresponding path according to the usage in the following functions.

2.2 Add AutoML¶

2.2.1 Add AutoML code to class¶

Set AutoML related parameters

@property
def settings(self) -> Dict:
    """The configuration of your model to implement AutoML by FLAML framework."""
    configuration = {
        "time_budget": '...'
        "metric": '...',
        "estimator_list": '...'
        "task": '...'
    }
    return configuration

“time_budget” represents total running time in seconds
“metric” represents Running metric
“estimator_list” represents list of ML learners
“task” represents task type Note: You can keep the parameters consistent, or you can modify them to make the AutoML better.

(2) Add parameters that need to be AutoML You can add the parameter tuning code according to the following code:

@property
def customization(self) -> object:
    """The customized 'Your model' of FLAML framework."""
    from flaml import tune
    from flaml.data import 'TPYE'
    from flaml.model import SKLearnEstimator
    from sklearn.ensemble import 'model_name'

    class 'Model_Name'(SKLearnEstimator):
        def __init__(self, task=type, n_jobs=None, **config):
            super().__init__(task, **config)
            if task in 'TOYE':
                self.estimator_class = 'model_name'

        @classmethod
        def search_space(cls, data_size, task):
            space = {
                "'parameters1'": {"domain": tune.uniform(lower='...', upper='...'), "init_value": '...'},
                "'parameters2'": {"domain": tune.choice([True, False])},
                "'parameters3'": {"domain": tune.randint(lower='...', upper='...'), "init_value": '...'},
            }
            return space

    return "Model_Name"

Note1: The content in ‘ ‘ needs to be modified according to your specific code Note2:

space = {
    "'parameters1'": {"domain": tune.uniform(lower='...', upper='...'), "init_value": '...'},
    "'parameters2'": {"domain": tune.choice([True, False])},
    "'parameters3'": {"domain": tune.randint(lower='...', upper='...'), "init_value": '...'},
}

tune.Uniform represents float
tune.choice represents bool
tune.randint represents int
lower represents the minimum value of the range, upper represents the maximum value of the range, and init_value represents the initial value Note: You need to select parameters based on the actual situation of the model

(3) Define special_components(FLAML) This part is the same as 2.1.4 as a whole, and can be modified with reference to it, but only the following two points need to be noted: a.The multi-dispatch function is different Scikit-learn framework：@dispatch() FLAML framework：@dispatch(bool)

b.Added ‘is_automl: bool’ to the def eg:

Scikit-learn framework：
def special_components(self, **kwargs) -> None:

FLAML framework：
def special_components(self, is_automl: bool, **kwargs) -> None:

c.self.model has a different name eg:

Scikit-learn framework：
coefficient=self.model.coefficient

FLAML framework：
coefficient=self.auto_model.coefficient

Note: You can refer to other similar codes to complete your code.

2.3 Get the hyperparameter value through interactive methods¶

Sometimes the user wants to modify the hyperparameter values for model training, so you need to establish an interaction to get the user’s modifications.

2.3.1 Find file¶

You need to find the corresponding folder for model. The corresponding algorithm file is in the func folder in the model folder in the data_mining folder in the geochemistrypi folder.

eg: If your model belongs to the regression, you need to add the corresponding.py file in the alog_regression folder.

2.3.2 Create the .py file and add content¶

(1) Create a .py file Note: Keep name format consistent.

Import module

from typing import Dict
from rich import print
from ....constants import SECTION

In general, these modules need to be imported

from ....data.data_readiness import bool_input, float_input, num_input

This needs to choose the appropriate import according to the hyperparameter type of model interaction.

Define the function

def name_manual_hyper_parameters() -> Dict:

Note: The name needs to be consistent with that in 2.1.3.

Interactive format

print("Hyperparameters: Role")
print("Recommended value")
Hyperparameters = type_input(Recommended value, SECTION[2], "@Hyperparameters: ")

Note: The recommended value needs to be the default value of the corresponding package.

Integrate all hyperparameters into a dictionary type and return.

hyper_parameters = {
        "Hyperparameters1": Hyperparameters1,
        "Hyperparameters": Hyperparameters2,
}
retuen hyper_parameters

2.3.3 Import in the file that defines the model class¶

from .func.algo_regression.Name import name_manual_hyper_parameters

eg:

2.4 Call Model¶

2.4.1 Find file¶

Call the model in the corresponding file in the process folder. The corresponding algorithm file is in the process folder in themodel folder in the data_mining folder in the geochemistrypi folder.

eg: If your model belongs to the regression,you need to call it in the regress.py file.

2.4.2 Import module¶

You need to add your model in the from ..model.regression import().

from ..model.regression import(
  ...
  NAME,
)

Note: NAME needs to be the same as the NAME when defining the class in step 2.1.2. eg:

2.4.3 Call model¶

There are two activate methods defined in the Regression and Classification algorithms, the first method uses the Scikit-learn framework, and the second method uses the FLAML and RAY frameworks. Decomposition and Clustering algorithms only use the Scikit-learn framework. Therefore, in the call, Regression and Classification need to add related codes to implement the call in both methods, and only one time is needed in Clustering and Decomposition.

Call model in the first activate method(Including Classification, Regression,Decomposition,Clustering)

elif self.model_name == "name":
            hyper_parameters = NAME.manual_hyper_parameters()
            self.dcp_workflow = NAME(
                Hyperparameters1=hyper_parameters["Hyperparameters2"],
                Hyperparameters1=hyper_parameters["Hyperparameters2"],
                ...
            )

The name needs to be the same as the name in 2.4
The hyperparameters in NAME() are the hyperparameters obtained interactively in 2.2 eg:

（2）Call model in the second activate method（Including Classification, Regression）

elif self.model_name == "name":
  self.reg_workflow = NAME()

The name needs to be the same as the name in 2.4 eg:

2.5 Add the algorithm list and set NON_AUTOML_MODELS¶

2.5.1 Find file¶

Find the constants file to add the model name,The constants file is in the data_mining folder in the geochemistrypi folder.

(1) Add the model name Add model name to the algorithm list corresponding to the model in the constants file. eg: Add the name of the Lasso regression algorithm.

（2）set NON_AUTOML_MODELS Because this is a tutorial without automatic parameters, you need to add the model name in the NON_AUTOML_MODELS. eg:

2.6 Add Functionality¶

2.6.1 Model Research¶

Conduct research on the corresponding model and confirm the functions that need to be added.

You can confirm the functions that need to be added on the official website of the model (such as scikit learn), search engines (such as Google), chatGPT, etc.

Common_component is a public function in a class, and all functions in each class can be used, so they need to be added in the parent class，Each of the parent classes can call Common_component.
Special_component is unique to the model, so they need to be added in a specific model，Only they can use it.

2.6.2 Add Common_component¶

Common_component refer to functions that can be used by all internal submodels, so it is necessary to consider the situation of each submodel when adding them.

**1. Add corresponding functionality to the parent class**

Once you’ve identified the features you want to add, you can define the corresponding functions in the parent class.

The code format is:

Define the function name and add the required parameters.
Use annotations to describe function functionsUse annotations to describe function functions.
Referencing specific functions to implement functionality.
Change the format of data acquisition and save data or images.

**2. Define Common_component**

Define the common_components in the parent class, its role is to set where the output is saved.
Set the parameter source for the added function.

**3. Implement function functions**

Some functions may use large code due to their complexity. To ensure the style and readability of the code, you need to put the specific function implementation into the corresponding _common files and call it.

It includes:

Explain the significance of each parameter.
Implement functionality.
Returns the required parameters.

**eg:** You want to add model evaluation to your clustering.

First, you need to find the parent class to clustering.

**1. Add the clustering score function in class ClusteringWorkflowBase (WorkflowBase).**

@staticmethod
def _score(data: pd.DataFrame, labels: pd.DataFrame, algorithm_name: str, store_path: str) -> None:

    """Calculate the score of the model."""

    print("-----* Model Score *-----")

    scores = score(data, labels)

    scores_str = json.dumps(scores, indent=4)

    save_text(scores_str, f"Model Score - {algorithm_name}", store_path)

    mlflow.log_metrics(scores)

Define the function name and add the required parameters.
Use annotations to describe function functionsUse annotations to describe function functions.
Referencing specific functions to implement functionality (Reference 3.2.3).
Change the format of data acquisition and save data or images.

**Note:** Make sure that the code style of the added function is consistent.

**2. Define common_components below the added function to define the output position and parameter source for the added function.**

def common_components(self) -> None:

    """Invoke all common application functions for clustering algorithms."""

    GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH")

    GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH")

    self._score(

      data=self.X,

      labels=self.clustering_result["clustering result"],

      algorithm_name=self.naming,

      store_path=GEOPI_OUTPUT_METRICS_PATH,

    )

The positional relationship is shown in Figure 4.

**3. You need to add the specific function implementation to the corresponding ``_commom`` file.**

def score(data: pd.DataFrame, labels: pd.DataFrame) -> Dict:

    """Calculate the scores of the clustering model.

      Parameters

      ----------

      data : pd.DataFrame (n_samples, n_components)

        The true values.

      labels : pd.DataFrame (n_samples, n_components)

        Labels of each point.

      Returns

      -------

      scores : dict

        The scores of the clustering model.

    """

    silhouette = silhouette_score(data, labels)

    calinski_harabaz = calinski_harabasz_score(data, labels)

    print("silhouette_score: ", silhouette)

    print("calinski_harabasz_score:", calinski_harabaz)

    scores = {

        "silhouette_score": silhouette,

        "calinski_harabasz_score": calinski_harabaz,

    }

    return scores

Explain the significance of each parameter.
Implement functionality.
Returns the required parameters.

2.6.3 Add Special_component¶

Special_components is a feature that is unique to each specific model.

The process of adding a Special_components is similar to that of a Common_component.

The process is as follows:

Find the location that needs to be added.
Defined function.
Define Special_components and add a parametric function to it.
Add the corresponding specific function implementation function to the corresponding manual parameter tuning file.

**eg:** An example is to add a score evaluation function to k-means clustering.

**1. Find the location that needs to be added.**

We add his own unique score to the k-means.

**2. Defined function.**

def _get_inertia_scores(self, algorithm_name: str, store_path: str) -> None:

    """Get the scores of the clustering result."""

    print("-----* KMeans Inertia Scores *-----")

    print("Inertia Score: ", self.model.inertia_)

    inertia_scores = {"Inertia Score": self.model.inertia_}

    mlflow.log_metrics(inertia_scores)

    inertia_scores_str = json.dumps(inertia_scores, indent=4)

    save_text(inertia_scores_str, f"KMeans Inertia Scores - {algorithm_name}", store_path)

Define the function name and add the required parameters.
Use annotations to describe function functionsUse annotations to describe function functions.
Referencing specific functions to implement functionality.
Change the format of data acquisition and save data or images.

**3. Define Special_components and add a parametric function to it.**

def special_components(self, **kwargs: Union[Dict, np.ndarray, int]) -> None:

      """Invoke all special application functions for this algorithms by Scikit-learn framework."""

       GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH")

       self._get_inertia_scores(

       algorithm_name=self.naming,

       store_path=GEOPI_OUTPUT_METRICS_PATH,

     )

The positional relationship is shown in Figure 7.

**4. Add the corresponding specific function implementation function to the ``corresponding manual parameter tuning`` file.**

If the defined function has complex functions, it is necessary to further improve its function content in the manual parameter file, and the code format should refer to Common_component.

3. Test model¶

After the model is added, it can be tested. If the test reports an error, it needs to be checked. If there is no error, it can be submitted.

4. Completed Pull Request¶

After the model test is correct, you can complete the pull request according to the puu document instructions in Geochemistry π

5. Precautions¶

Note1: This tutorial only discusses the general process of adding a model, and the specific addition needs to be combined with the actual situation of the model to accurately add relevant codes. Note2: If there are unclear situations and problems during the adding process, communicate with other people in time to solve them.