Decomposition¶

T-distributed Stochastic Neighbor Embedding （T-SNE）¶

Table of Contents¶

Table of Contents
1. t-distributed Stochastic Neighbor Embedding (T-SNE)
2. Preparation
3. NAN value process
4. Feature engineering
5. Model Selection
6. T-SNE

1. T-distributed Stochastic Neighbor Embedding (T-SNE)¶

t-distributed Stochastic Neighbor Embedding is usually known as T-SNE. T-SNE is an unsupervised learning method, in which the training data we feed to the algorithm does not need the desired labels. T-SNE is a machine learning algorithm used for dimensionality reduction and visualization of high-dimensional data. It represents the similarity between data points in the high-dimensional space using a Gaussian distribution, creating a probability distribution by measuring this similarity. In the low-dimensional space, T-SNE reconstructs this similarity distribution using the t-distribution. T-SNE aims to preserve the local relationships between data points, ensuring that similar points in the high-dimensional space remain similar in the low-dimensional space.

Note: This part would show the whole process of T-SNE, including data-processing and model-running.

2. Preparation¶

First, after ensuring the Geochemistry Pi framework has been installed successfully (if not, please see docs ), we run the python framework in command line interface to process our program: If you do not input own data, you can run:

geochemistrypi data-mining

If you prepare to input own data, you can run:

geochemistrypi data-mining --data your_own_data_set.xlsx

The command line interface would show:

-*-*- Built-in Training Data Option-*-*-
1 - Data For Regression
2 - Data For Classification
3 - Data For Clustering
4 - Data For Dimensional Reduction
(User) ➜ @Number: 4

You have to choose Data For Dimensional Reduction and press 4 . The command line interface would show:

Successfully loading the built-in training data set 'Data_Decomposition.xlsx'.
--------------------
Index - Column Name
1 - CITATION
2 - SAMPLE NAME
3 - Label
4 - Notes
5 - LATITUDE
6 - LONGITUDE
    ...
45 - PB(PPM)
46 - TH(PPM)
47 - U(PPM)
--------------------
(Press Enter key to move forward.)

Here, we just need to press any keyboard to continue.

-*-*- World Map Projection -*-*-
World Map Projection for A Specific Element Option:
1 - Yes
2 - No
(Plot) ➜ @Number:

We can choose map projection if we need a world map projection for a specific element option. Choose yes, we can choose an element to map. Choose no, skip to the next mode. More information of the map projection can be seen in map projection. In this tutorial, we skip it and gain output as:

-*-*- Data Selection -*-*-
--------------------
Index - Column Name
1 - CITATION
2 - SAMPLE NAME
3 - Label
4 - Notes
5 - LATITUDE
6 - LONGITUDE
    ...
45 - PB(PPM)
46 - TH(PPM)
47 - U(PPM)
--------------------
Select the data range you want to process.
Input format:
Format 1: "[**, **]; **; [**, **]", such as "[1, 3]; 7; [10, 13]" --> you want to deal with the columns 1, 2, 3, 7, 10, 11, 12, 13
Format 2: "xx", such as "7" --> you want to deal with the columns 7

Two options are offered. For T-SNE, the Format 1 method is more useful in multiple dimensional reduction. As a tutorial, we input [10, 15] as an example. Note: [start_col_num, end_col_num]

The selected feature information would be given:

--------------------
Index - Column Name
10 - AL2O3(WT%)
11 - CR2O3(WT%)
12 - FEOT(WT%)
13 - CAO(WT%)
14 - MGO(WT%)
15 - MNO(WT%)
--------------------

The Selected Data Set:
     AL2O3(WT%)  CR2O3(WT%)  FEOT(WT%)   CAO(WT%)   MGO(WT%)  MNO(WT%)
    3.936000       1.440   3.097000  18.546000  18.478000  0.083000
    3.040000       0.578   3.200000  20.235000  17.277000  0.150000
    7.016561         NaN   3.172049  20.092611  15.261175  0.102185
    3.110977         NaN   2.413834  22.083843  17.349203  0.078300
    6.971044         NaN   2.995074  20.530008  15.562149  0.096700
..          ...         ...        ...        ...        ...       ...
  2.740000       0.060   4.520000  23.530000  14.960000  0.060000
  5.700000       0.690   2.750000  20.120000  16.470000  0.120000
  0.230000       2.910   2.520000  19.700000  18.000000  0.130000
  2.580000       0.750   2.300000  22.100000  16.690000  0.050000
  6.490000       0.800   2.620000  20.560000  14.600000  0.070000

[109 rows x 6 columns]

After continuing with any key, basic information of selected data would be shown:

Basic Statistical Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   AL2O3(WT%)  109 non-null    float64
 1   CR2O3(WT%)  98 non-null     float64
 2   FEOT(WT%)   109 non-null    float64
 3   CAO(WT%)    109 non-null    float64
 4   MGO(WT%)    109 non-null    float64
 5   MNO(WT%)    109 non-null    float64
dtypes: float64(6)
memory usage: 5.2 KB
None
Some basic statistic information of the designated data set:
       AL2O3(WT%)  CR2O3(WT%)   FEOT(WT%)    CAO(WT%)    MGO(WT%)    MNO(WT%)
count  109.000000   98.000000  109.000000  109.000000  109.000000  109.000000
mean     4.554212    0.956426    2.962310   21.115756   16.178044    0.092087
std      1.969756    0.553647    1.133967    1.964380    1.432886    0.054002
min      0.230000    0.000000    1.371100   13.170000   12.170000    0.000000
25%      3.110977    0.662500    2.350000   20.310000   15.300000    0.063075
50%      4.720000    0.925000    2.690000   21.223500   15.920000    0.090000
75%      6.233341    1.243656    3.330000   22.185450   16.816000    0.110000
max      8.110000    3.869550    8.145000   25.362000   23.528382    0.400000
Successfully calculate the pair-wise correlation coefficient among the selected columns.
Save figure 'Correlation Plot' in dir.
Successfully store 'Correlation Plot' in 'Correlation Plot.xlsx' in dir.
Successfully draw the distribution plot of the selected columns.
Save figure 'Distribution Histogram' in  dir.
Successfully store 'Distribution Histogram' in 'Distribution Histogram.xlsx' in  dir.
Successfully draw the distribution plot after log transformation of the selected columns.
Save figure 'Distribution Histogram After Log Transformation' in  dir.
Successfully store 'Distribution Histogram After Log Transformation' in 'Distribution Histogram After Log Transformation.xlsx' in dir.
Successfully store 'Data Original' in 'Data Original.xlsx' in  dir.
Successfully store 'Data Selected' in 'Data Selected.xlsx' in  dir.

3. NAN value process¶

Check the NAN values would be helpful for later analysis. In Geochemistry π frame, this option is finished automatically.

-*-*- Imputation -*-*-
Check which column has null values:
--------------------
AL2O3(WT%)    False
CR2O3(WT%)     True
FEOT(WT%)     False
CAO(WT%)      False
MGO(WT%)      False
MNO(WT%)      False
dtype: bool
--------------------
The ratio of the null values in each column:
--------------------
CR2O3(WT%)    0.100917
AL2O3(WT%)    0.000000
FEOT(WT%)     0.000000
CAO(WT%)      0.000000
MGO(WT%)      0.000000
MNO(WT%)      0.000000
dtype: float64
--------------------
Note: you'd better use imputation techniques to deal with the missing values.

Several strategies are offered for processing the missing values, including:

-*-*- Strategy for Missing Values -*-*-
1 - Mean Value
2 - Median Value
3 - Most Frequent Value
4 - Constant(Specified Value)
Which strategy do you want to apply?
(Data) ➜ @Number:1

We choose the mean Value in this example and the input data be processed automatically as:

Successfully fill the missing values with the mean value of each feature column respectively.
(Press Enter key to move forward.)

-*-*- Hypothesis Testing on Imputation Method -*-*-
Null Hypothesis: The distributions of the data set before and after imputing remain the same.
Thoughts: Check which column rejects null hypothesis.
Statistics Test Method: Kruskal Test
Significance Level:  0.05
The number of iterations of Monte Carlo simulation:  100
The size of the sample for each iteration (half of the whole data set):  54
Average p-value:
AL2O3(WT%) 1.0
CR2O3(WT%) 0.9327453056346102
FEOT(WT%) 1.0
CAO(WT%) 1.0
MGO(WT%) 1.0
MNO(WT%) 1.0
Note: 'p-value < 0.05' means imputation method doesn't apply to that column.
The columns which rejects null hypothesis: None
Successfully draw the respective probability plot (origin vs. impute) of the selected columns
Save figure 'Probability Plot' in  dir.
Successfully store 'Probability Plot' in 'Probability Plot.xlsx' in dir.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   AL2O3(WT%)  109 non-null    float64
 1   CR2O3(WT%)  109 non-null    float64
 2   FEOT(WT%)   109 non-null    float64
 3   CAO(WT%)    109 non-null    float64
 4   MGO(WT%)    109 non-null    float64
 5   MNO(WT%)    109 non-null    float64
dtypes: float64(6)
memory usage: 5.2 KB
None
Some basic statistic information of the designated data set:
       AL2O3(WT%)  CR2O3(WT%)   FEOT(WT%)    CAO(WT%)    MGO(WT%)    MNO(WT%)
count  109.000000  109.000000  109.000000  109.000000  109.000000  109.000000
mean     4.554212    0.956426    2.962310   21.115756   16.178044    0.092087
std      1.969756    0.524695    1.133967    1.964380    1.432886    0.054002
min      0.230000    0.000000    1.371100   13.170000   12.170000    0.000000
25%      3.110977    0.680000    2.350000   20.310000   15.300000    0.063075
50%      4.720000    0.956426    2.690000   21.223500   15.920000    0.090000
75%      6.233341    1.170000    3.330000   22.185450   16.816000    0.110000
max      8.110000    3.869550    8.145000   25.362000   23.528382    0.400000
Successfully store 'Data Selected Imputed' in 'Data Selected Imputed.xlsx' in dir.

4. Feature engineering¶

The next step is the feature engineering options.

-*-*- Feature Engineering -*-*-
The Selected Data Set:
--------------------
Index - Column Name
1 - AL2O3(WT%)
2 - CR2O3(WT%)
3 - FEOT(WT%)
4 - CAO(WT%)
5 - MGO(WT%)
6 - MNO(WT%)
--------------------
Feature Engineering Option:
1 - Yes
2 - No
(Data) ➜ @Number: 1

Feature engineering options are essential for data analysis. We choose Yes and naming new features:

Selected data set:
a - AL2O3(WT%)
b - CR2O3(WT%)
c - FEOT(WT%)
d - CAO(WT%)
e - MGO(WT%)
f - MNO(WT%)
Name the constructed feature (column name), like 'NEW-COMPOUND':
@input: new

Considering actual need for constructing several new geochemical indexes. We can set up some new indexes. Here, we would set up a new index by AL2O3/CAO via keyboard options with a/d.

-*-*- Feature Engineering -*-*-
The Selected Data Set:
--------------------
Index - Column Name
1 - AL2O3(WT%)
2 - CR2O3(WT%)
3 - FEOT(WT%)
4 - CAO(WT%)
5 - MGO(WT%)
6 - MNO(WT%)
--------------------
Feature Engineering Option:
1 - Yes
2 - No
(Data) ➜ @Number: 1
Selected data set:
a - AL2O3(WT%)
b - CR2O3(WT%)
c - FEOT(WT%)
d - CAO(WT%)
e - MGO(WT%)
f - MNO(WT%)
Name the constructed feature (column name), like 'NEW-COMPOUND':
@input: new
Build up new feature with the combination of basic arithmatic operators, including '+', '-', '*', '/', '()'.
Input example 1: a * b - c
--> Step 1: Multiply a column with b column;
--> Step 2: Subtract c from the result of Step 1;
Input example 2: (d + 5 * f) / g
--> Step 1: Multiply 5 with f;
--> Step 2: Plus d column with the result of Step 1;
--> Step 3: Divide the result of Step 1 by g;
Input example 3: pow(a, b) + c * d
--> Step 1: Raise the base a to the power of the exponent b;
--> Step 2: Multiply the value of c by the value of d;
--> Step 3: Add the result of Step 1 to the result of Step 2;
Input example 4: log(a)/b - c
--> Step 1: Take the logarithm of the value a;
--> Step 2: Divide the result of Step 1 by the value of b;
--> Step 3: Subtract the value of c from the result of Step 2;
You can use mean(x) to calculate the average value.
@input: a/d

Successfully construct a new feature new.
    0.212229
    0.150235
    0.349211
    0.140871
    0.339554
         ...
  0.116447
  0.283300
  0.011675
  0.116742
  0.315661
Name: new, Length: 109, dtype: float64

Basic information of selected data would be shown:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   AL2O3(WT%)  109 non-null    float64
 1   CR2O3(WT%)  109 non-null    float64
 2   FEOT(WT%)   109 non-null    float64
 3   CAO(WT%)    109 non-null    float64
 4   MGO(WT%)    109 non-null    float64
 5   MNO(WT%)    109 non-null    float64
 6   new         109 non-null    float64
dtypes: float64(7)
memory usage: 6.1 KB
None
Some basic statistic information of the designated data set:
       AL2O3(WT%)  CR2O3(WT%)   FEOT(WT%)    CAO(WT%)    MGO(WT%)    MNO(WT%)         new
count  109.000000  109.000000  109.000000  109.000000  109.000000  109.000000  109.000000
mean     4.554212    0.956426    2.962310   21.115756   16.178044    0.092087    0.219990
std      1.969756    0.524695    1.133967    1.964380    1.432886    0.054002    0.101476
min      0.230000    0.000000    1.371100   13.170000   12.170000    0.000000    0.011675
25%      3.110977    0.680000    2.350000   20.310000   15.300000    0.063075    0.148707
50%      4.720000    0.956426    2.690000   21.223500   15.920000    0.090000    0.218100
75%      6.233341    1.170000    3.330000   22.185450   16.816000    0.110000    0.306383
max      8.110000    3.869550    8.145000   25.362000   23.528382    0.400000    0.407216

Do not continue to establish new features:

Do you want to continue to build a new feature?
1 - Yes
2 - No
(Data) ➜ @Number: 2
Successfully store 'Data Selected Imputed Feature-Engineering' in 'Data Selected Imputed Feature-Engineering.xlsx' in dir.
Exit Feature Engineering Mode.

5. Model Selection¶

Select dimensionality reduction

-*-*- Mode Selection -*-*-
1 - Regression
2 - Classification
3 - Clustering
4 - Dimensional Reduction
(Model) ➜ @Number: 4

Scaling features on set X.In this tutorial, we skip it

-*-*- Feature Scaling on X Set -*-*-
1 - Yes
2 - No
(Data) ➜ @Number: 1

-*-*- Which strategy do you want to apply?-*-*-
1 - Min-max Scaling
2 - Standardization
(Data) ➜ @Number: 2

Data Set After Scaling:
     AL2O3(WT%)  CR2O3(WT%)  FEOT(WT%)  CAO(WT%)  MGO(WT%)  MNO(WT%)
0     -0.315302    0.925885   0.119326 -1.314219  1.612536 -0.169053
1     -0.772282   -0.724562   0.210577 -0.450434  0.770496  1.077372
2      1.255852    0.000000   0.185815 -0.523255 -0.642832  0.187847
3     -0.736082    0.000000  -0.485913  0.495097  0.821118 -0.256489
4      1.232638    0.000000   0.029026 -0.299562 -0.431813  0.085813
..          ...         ...        ...       ...       ...       ...
104   -0.925288   -1.716362   1.380010  1.234688 -0.853990 -0.596931
105    0.584377   -0.510119  -0.188093 -0.509247  0.204695  0.519271
106   -2.205444    3.740453  -0.391857 -0.724043  1.277403  0.705305
107   -1.006892   -0.395238  -0.586763  0.503360  0.358941 -0.782964
108    0.987295   -0.299505  -0.303264 -0.284223 -1.106392 -0.410897

[109 rows x 6 columns]
Basic Statistical Information:
Some basic statistic information of the designated data set:
         AL2O3(WT%)    CR2O3(WT%)     FEOT(WT%)      CAO(WT%)      MGO(WT%)      MNO(WT%)
count  1.090000e+02  1.090000e+02  1.090000e+02  1.090000e+02  1.090000e+02  1.090000e+02
mean   1.415789e-16 -8.912341e-17  4.216810e-16  4.685345e-17 -6.722451e-17 -1.874138e-16
std    1.004619e+00  1.004619e+00  1.004619e+00  1.004619e+00  1.004619e+00  1.004619e+00
min   -2.205444e+00 -1.831243e+00 -1.409706e+00 -4.063601e+00 -2.810103e+00 -1.713133e+00
25%   -7.360817e-01 -5.292655e-01 -5.424660e-01 -4.120778e-01 -6.156105e-01 -5.397252e-01
50%    8.455546e-02  0.000000e+00 -2.412486e-01  5.510241e-02 -1.809186e-01 -3.882964e-02
75%    8.563927e-01  4.089238e-01  3.257487e-01  5.470608e-01  4.472813e-01  3.332377e-01
max    1.813530e+00  5.577677e+00  4.591518e+00  2.171605e+00  5.153439e+00  5.728214e+00
Successfully store 'X With Scaling' in 'X With Scaling.xlsx' in dir.

6. T-SNE¶

Select T-SNE.

-*-*- Model Selection -*-*-:
1 - PCA
2 - T-SNE
3 - MDS
4 - All models above to be trained
Which model do you want to apply?(Enter the Corresponding Number)
(Model) ➜ @Number: 2

Input the Hyper-parameters.

-*-*- T-SNE - Hyper-parameters Specification -*-*-
N Components: This parameter specifies the number of components to retain after dimensionality reduction.
Please specify the number of components to retain. A good starting range could be between 2 and 10, such as 4.
(Model) ➜ N Components: 4
Perplexity: This parameter is related to the number of nearest neighbors that each point considers when computing the probabilities.
Please specify the perplexity. A good starting range could be between 5 and 50, such as 30.
(Model) ➜ Perplexity: 30
Learning Rate: This parameter controls the step size during the optimization process.
Please specify the learning rate. A good starting range could be between 10 and 1000, such as 200.
(Model) ➜ Learning Rate: 200
Number of Iterations: This parameter controls how many iterations the optimization will run for.
Please specify the number of iterations. A good starting range could be between 250 and 1000, such as 500.
(Model) ➜ Number of Iterations: 500
Early Exaggeration: This parameter controls how tight natural clusters in the original space are in the embedded space and how much space will be between them.
Please specify the early exaggeration. A good starting range could be between 5 and 50, such as 12.
(Model) ➜ Early Exaggeration: 12

Running this model.

*-**-* T-SNE is running ... *-**-*
Expected Functionality:
+  Model Persistence
Successfully store 'Hyper Parameters - T-SNE' in 'Hyper Parameters - T-SNE.txt' in dir.
-----* Reduced Data *-----
     Dimension 1  Dimension 2  Dimension 3  Dimension 4
0     -10.623264   -17.686243    51.568302     1.087756
1      21.165867    50.837616    -2.456909    69.025017
2      35.604347   -36.303295   -11.843322   -21.651896
3     -25.802412    28.442354    44.452190    18.172804
4     -23.921820   -48.037205   -13.831066    14.313259
..           ...          ...          ...          ...
104    43.789333    -8.022134    26.877687    -7.914544
105   -12.640723    14.591939   -43.875713     4.952276
106   359.025940  -895.016479  -461.668243   491.801666
107    -7.346601    35.262451    25.139845     1.618280
108   188.163788    66.346474    -9.461174   190.716721

[109 rows x 4 columns]
Successfully store 'X Reduced' in 'X Reduced.xlsx' in dir.
-----* Model Persistence *-----
Successfully store 'T-SNE' in 'T-SNE.pkl' in dir.
Successfully store 'T-SNE' in 'T-SNE.joblib' in dir.

-*-*- Transform Pipeline -*-*-
Build the transform pipeline according to the previous operations.
Successfully store 'Transform Pipeline Configuration' in dir.
Successfully store 'Transform Pipeline' in 'Transform Pipeline.pkl' in dir.
Successfully store 'Transform Pipeline' in 'Transform Pipeline.joblib' in dir.