⭐ openGauss database source code analysis series articles - DeepSQL ⭐

❤️ Hello, I'm Gauss squirrel club. Welcome to study~ ❤️ ‍

The previous article introduced 8.6 AI query time prediction In this article, we introduce the relevant wonderful contents of "8.7 DeepSQL and 8.8 summary".

8.7 DeepSQL

The functions mentioned above are all in the field of AI4DB. There is another general direction for the combination of AI and database, namely DB4AI. In this chapter, we will introduce the DB4AI capability of openGauss and explore a new way to efficiently drive AI tasks through databases.

Usage scenario

The realization of database DB4AI function is to realize AI algorithm in the database to better support the rapid analysis and calculation of big data. At present, the DB4AI capability of openGauss is presented through the deep SQL feature. A complete set of SQL based machine learning, data mining and statistical algorithms are provided here. Users can directly use SQL statements for machine learning. DeepSQL can abstract the end-to-end data development process from data to model, cooperate with the automatic optimization of the underlying computing engine and database, so that users with basic SQL knowledge can complete most of the machine learning model training and prediction tasks. The whole analysis and processing run in the database engine. Users can directly analyze and process the data in the database without data transfer between the database and other platforms, avoid unnecessary data movement between multiple environments, and integrate the fragmented data development technology stack.

Existing technology

Today, academia and industry have made many achievements in the direction of DB4AI. Many traditional commercial relational databases have supported the DB4AI capability. Through the built-in AI components to adapt to the data processing and environment in the database, the data stored in the database can be processed to minimize the cost of data movement. At the same time, many cloud databases and cloud computing data analysis platforms also have DB4AI capabilities. At the same time, it may also have interfaces such as Python and R language, which is convenient for data analysts to get started quickly.
In the DB4AI field, it also has excellent open source software, such as MADlib, the top open source project of Apache. It is compatible with PostgreSQL database. Many databases developed based on the source code baseline of PostgreSQL database can also be easily adapted. MADlib can provide statistical and machine learning methods for structured and unstructured data, and realize parallel computing on distributed database by using aggregation function. MADlib supports a variety of machine learning and data mining algorithms, such as regression, classification, clustering, statistics, graph algorithms, etc. the cumulative number of supported algorithms reaches more than 70. In the currently released version 1.17, MADlib supports deep learning. MADlib uses SQL like syntax as the external interface to integrate AI tasks into the database by creating UDF (user-defined function).
At present, the DB4AI module of openGauss is compatible with the open source MADlib. It is adapted and enhanced on the basis of the original MADlib open source software. Its performance is better than that of MADlib running on PostgreSQL database. At the same time, openGauss implements other industrial and common algorithms based on MADlib framework, such as XGBoost, Prophet, GBDT and recommendation system. At the same time, openGauss also has native AI execution plans and operators, which will be open-source in subsequent versions. Therefore, this chapter mainly introduces how openGauss is compatible with MADlib.

Key source code analysis

1. Project structure of madlib

The file structure and description of MADlib are shown in table 8-16. The code of MADlib can be obtained through its official website: https://madlib.apache.org/ .

Table 8-16 main file structure of madlib

file structure

explain

cmake

-

Cmake related documents

 

/array_ops

array operation module

/kmeans

Kmeans related modules

/sketch

Word frequency statistical processing related modules

/stemmer

Stemming related modules

/svec

Sparse matrix correlation module

/svec_util

Sparse matrix dependency module

/utils

Other common modules

src/bin

-

Tool module for installation, uninstallation, deployment, etc

src/bin/madpack

-

Database interaction module

src/dbal

-

Stemming related modules

src/libstemmer

-

Tool dependent files

src/madpack

-

It contains public modules

src/modules

-

Association rule algorithm

/assoc_rules

Including the implementation of convex algorithm

/convex

Including conditional random field algorithm

/crf

Elastic network algorithm

/elastic_net

Generalized linear model

/glm

Implicit Dirichlet distribution

/lda

Linear algebraic operation

/linalg

Linear system module

/linear_systems

Probability module

/prob

Decision tree and random forest

/recursive_partitioning

Regression algorithm

/regress

Sampling module

/sample

Mathematical statistics module

/stats

time series

/utilities

Including pg,gaussdb platform related interfaces

src/ports

-

Interface, link db

src/ports/postgres

-

For pg system, related algorithms

/dbconnector

Association rule algorithm

/modules

Bayesian algorithm

/modules/bayes

conjugate gradient method

/modules/conjugate_gradient

Including multi-layer perceptron

/modules/convex

Conditional random field

/modules/crf

Elastic network

/modules/elastic_net

Prophet time series prediction

/modules/gbdt

Gdbt algorithm

/modules/glm

Generalized linear model

/modules/graph

Graph model

/modules/kmeans

Kmeans algorithm

/modules/knn

Knn algorithm

/modules/lda

Implicit Dirichlet distribution

/modules/linalg

Linear algebraic operation

/modules/linear_systems

Linear system module

/modules/pca

PCA dimensionality reduction

/modules/prob

Probability module

/modules/recursive_partitioning

Decision tree and random forest

/modules/sample

Regression algorithm

/modules/stats

Sampling module

/modules/summary

Mathematical statistics module

/modules/svm

Summary function of descriptive statistics

/modules/tsa

Svm algorithm

/modules/validation

time series

/modules/xgboost_gs

Cross validation

src/utils

-

Xgboost algorithm

2. Execution process of madlib on openGauss

Users can train and predict the model by calling UDF, and the relevant results will be saved in the table and stored in the database. Taking the training process as an example, the overall process of MADlib on openGauss is shown in Figure 8-22.

Figure 8-22 flowchart of madlib training model on openGauss

Extension based on MADlib framework

The previous article shows the functions and functions of MADlib modules. From the perspective of structure, users can expand their own algorithms. The three algorithms mentioned above, XGBoost, GBDT and Prophet, are our extended algorithms on the original basis. This section will take the self-developed GBDT module as an example to introduce the extension based on MADlib framework.

The GBDT file structure is shown in table 8-17.

Table 8-17 main file structure of gbdt algorithm

file structure

explain

gbdt/gbdt.py_in

python code

gbdt/gbdt.sql_in

Stored procedure code

gbdt/test/gbdt.sql

Test code

In sql_in file, define the upper SQL like interface, which is implemented by PL/pgSQL or PL/python.
UDF functions are defined in the SQL layer. The following code implements functions similar to overloading.

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT,
    list_of_features_to_exclude TEXT,
    weights                     TEXT
)
RETURNS VOID AS $$
    SELECT MADLIB_SCHEMA.gbdt_train($1, $2, $3, $4, $5, $6, $7, 30::INTEGER);
$$ LANGUAGE sql VOLATILE;

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT,
    list_of_features_to_exclude TEXT
)
RETURNS VOID AS $$
    SELECT MADLIB_SCHEMA.gbdt_train($1, $2, $3, $4, $5, $6, NULL::TEXT);
$$ LANGUAGE sql VOLATILE;

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT
)
RETURNS VOID AS $$
    SELECT MADLIB_SCHEMA.gbdt_train($1, $2, $3, $4, $5, NULL::TEXT);
$$ LANGUAGE sql VOLATILE;

Among them, the input table, output table, characteristics and other necessary information need to be specified by the user. Other parameters provide default parameters, such as weights. If the user does not specify custom parameters, the program will use the default parameters for operation.
Define the PL/python interface in the SQL layer. The code is as follows:

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT,
    list_of_features_to_exclude TEXT,
    weights                     TEXT,
    num_trees                   INTEGER,
    num_random_features         INTEGER,
    max_tree_depth              INTEGER,
    min_split                   INTEGER,
    min_bucket                  INTEGER,
    num_bins                    INTEGER,
    null_handling_params        TEXT,
    is_classification           BOOLEAN,
    predict_dt_prob             TEXT,
    learning_rate               DOUBLE PRECISION,
    verbose                     BOOLEAN,
    sample_ratio                DOUBLE PRECISION
)
RETURNS VOID AS $$
PythonFunction(gbdt, gbdt, gbdt_fit)
$$ LANGUAGE plpythonu VOLATILE;

PL/pgSQL or SQL functions will eventually call a PL/python function.
"Python function (gbdt, gbdt, gbdt_fit)" is a fixed usage. It is also a encapsulated m4 macro, which will be replaced during compilation and installation.
In Python function, the first parameter is the folder name, the second parameter is the file name, and the third parameter is the function name. The python function macro will be replaced with the "from gdbt.gdbt import gbdt_fit" statement. So make sure the file path and function are correct.
In the python layer, the training function is implemented. The code is as follows:

def gbdt_fit(schema_madlib,training_table_name, output_table_name,
        id_col_name, dependent_variable, list_of_features,
        list_of_features_to_exclude, weights,
        num_trees, num_random_features,
        max_tree_depth, min_split, min_bucket, num_bins,
        null_handling_params, is_classification,
        predict_dt_prob = None, learning_rate = None, 
        verbose=False, **kwargs):
     ...
    plpy.execute("""ALTER TABLE {training_table_name} DROP COLUMN IF EXISTS gradient CASCADE
                """.format(training_table_name=training_table_name))
                            
    create_summary_table(output_table_name, null_proxy, bins['cat_features'], 
                        bins['con_features'], learning_rate, is_classification, predict_dt_prob, 
                        num_trees, training_table_name)

The prediction function is implemented in the python layer. The code is as follows:

def gbdt_predict(schema_madlib, test_table_name, model_table_name, output_table_name, id_col_name, **kwargs):

    num_tree = plpy.execute("""SELECT COUNT(*) AS count FROM {model_table_name}""".format(**locals()))[0]['count']
    if num_tree == 0:
        plpy.error("The GBDT-method has no trees")
    

    elements = plpy.execute("""SELECT * FROM {model_table_name}_summary""".format(**locals()))[0]
...

In py_in file, the corresponding business code is defined, and the corresponding processing logic is implemented in python.
During the installation phase, sql_in and py_in will be parsed into normal Python and sql files by GNU m4. It should be pointed out that the current MADlib framework only supports python2. Therefore, the above code implementation is also based on python2.

Use example of MADlib on openGauss

This paper takes the house price classification by support vector machine algorithm as an example to demonstrate the specific application method.
(1) Data set preparation, code as follows:

DROP TABLE IF EXISTS houses;
CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,  size INT, lot INT);
INSERT INTO houses VALUES
(1 ,  590 ,       2 ,    1 ,  50000 ,  770 , 22100),
(2 , 1050 ,       3 ,    2 ,  85000 , 1410 , 12000),
(3 ,   20 ,       3 ,    1 ,  22500 , 1060 ,  3500),
 ...
(12 , 1620 ,       3 ,    2 , 118600 , 1250 , 20000),
(13 , 3100 ,       3 ,    2 , 140000 , 1760 , 38000),
(14 , 2070 ,       2 ,    3 , 148000 , 1550 , 14000),
(15 ,  650 ,       3 ,  1.5 ,  65000 , 1450 , 12000);

(2) Model training
① Configure the corresponding schema and compatibility parameters before training. The code is as follows:

SET search_path="$user",public,madlib;
SET behavior_compat_options = 'bind_procedure_searchpath';

② The default parameters are used for training. The classification condition is' price < 100000 ', and the SQL statement is as follows:

DROP TABLE IF EXISTS houses_svm, houses_svm_summary; 
SELECT madlib.svm_classification('public.houses','public.houses_svm','price < 100000','ARRAY[1, tax, bath, size]');

(3) View the model with the following code:

\x on
SELECT * FROM houses_svm;
\x off

The results are as follows:

-[ RECORD 1 ]------+-----------------------------------------------------------------
coef               | {.113989576847,-.00226133300602,-.0676303607996,.00179440841072}
loss               | .614496714256667
norm_of_gradient   | 108.171180769224
num_iterations     | 100
num_rows_processed | 15
num_rows_skipped   | 0
dep_var_mapping    | {f,t}

(4) The code is as follows:

DROP TABLE IF EXISTS houses_pred; 
SELECT madlib.svm_predict('public.houses_svm','public.houses','id','public.houses_pred');

(5) View the forecast results with the following code:

SELECT *, price < 100000 AS actual FROM houses JOIN houses_pred USING (id) ORDER BY id;

The results are as follows:

id | tax  | bedroom | bath | price  | size |  lot  | prediction | decision_function | actual
----+------+---------+------+--------+------+-------+------------+-------------------+--------
  1 |  590 |       2 |    1 |  50000 |  770 | 22100 | t          |      .09386721875 | t
  2 | 1050 |       3 |    2 |  85000 | 1410 | 12000 | t          |     .134445058042 | t
 ...
 14 | 2070 |       2 |    3 | 148000 | 1550 | 14000 | f          |  -1.9885277913972 | f
 15 |  650 |       3 |  1.5 |  65000 | 1450 | 12000 | t          |   1.1445697772786 | t
(15 rows

Check the error rate, and the code is as follows:

SELECT COUNT(*) FROM houses_pred JOIN houses USING (id) WHERE houses_pred.prediction != (houses.price < 100000);

The results are as follows:

count
-------
     3
(1 row)

(6) Use other svm cores for training. The code is as follows:

DROP TABLE IF EXISTS houses_svm_gaussian, houses_svm_gaussian_summary, houses_svm_gaussian_random; 
SELECT madlib.svm_classification( 'public.houses','public.houses_svm_gaussian','price < 100000','ARRAY[1, tax, bath, size]','gaussian','n_components=10', '', 'init_stepsize=1, max_iter=200' );

Make predictions and view training results.

DROP TABLE IF EXISTS houses_pred_gaussian; 
SELECT madlib.svm_predict('public.houses_svm_gaussian','public.houses','id', 'public.houses_pred_gaussian');
SELECT COUNT(*) FROM houses_pred_gaussian JOIN houses USING (id) WHERE houses_pred_gaussian.prediction != (houses.price < 100000);

The results are as follows:

count 
-------+    
0 
(1 row)

(7) Other parameters
In addition to specifying different kernel methods, you can also specify the number of iterations and initial parameters, such as init_stepsize,max_iter,class_weight et al.

Evolution route

openGauss currently has machine learning capabilities through the open source Apache MADlib machine learning library. By adapting the original MADlib framework, openGauss implements a variety of custom engineering algorithm extensions.
In addition to being compatible with Apache MADlib, the industry benchmark PostgreSQL system, to obtain its business ecology, openGauss is also developing its own native DB4AI engine and supports end-to-end whole process AI capabilities, including model management, hyper parameter optimization, native SQL like syntax, database native AI operators and execution plans. Its performance is more than 5 times higher than MADlib. This function will be gradually open source in the future.

8.8 summary

This chapter introduces the openGauss team's exploration in the combination of AI and database, and focuses on the features of AI4DB, such as parameter self-tuning, index recommendation, exception detection, query time prediction, slow SQL discovery, and the DB4AI function of openGauss. In any way, the combination of AI and database is far more than this. These functions introduced here are only the beginning. There are still a lot of things to do and a lot of ways to go in terms of AI functions of openGauss. Including the further combination of AI and optimizer; Create AI autonomy in the whole process to realize fault discovery and automatic repair in the whole scene; Using AI to transform the algorithm and logic in the database is the direction of evolution.
Although the combination of AI and database has made long-term progress, it still faces the following challenges.
(1) Computational power problem: how to solve the computational power cost caused by additional AI computing? Will it cause performance degradation.
(2) Algorithmic question: will the combination of AI algorithm and database bring significant benefits? Is the extra cost of the algorithm large? Can the algorithm be generalized and applied to pervasive scenarios? What kind of algorithm can better solve practical problems?
(3) Data problem: how to safely extract and store the data required for AI model training, and how to face the problems of data hot and cold classification, loading and startup?
To a large extent, the above problem is a trade-off problem. We should not only make full use of the inspiration created by AI, but also fully inherit and carry forward the existing theory and practice of the database, which is also the direction of continuous exploration by openGauss team.

Thank you for learning the wonderful contents of "8.7 DeepSQL and 8.8 summary" in Chapter 8 AI technology. Next, we will start the introduction of relevant contents of "Chapter 9 security management source code analysis".
Coming soon.

Tags: Database Big Data PostgreSQL

Posted on Tue, 28 Sep 2021 04:14:21 -0400 by alexszilagyi