Modeling essays series 88 project metadata practice 6-PM2 implementation process review


stay Modeling essays series 84 project metadata practice 5 Based on the design of.

If the time spent organizing the application in a brainless way is 1, the time spent this time is about 1.5. Time consumption ratio: 1.5

On the whole, the results are quite satisfactory. There are several strong experiences:

  • It takes more time than expected. It took about 4 days to develop. There are factors to improve the design while doing it, but it is still longer than expected. In particular, some places feel cumbersome.
  • 2. Basically, there is no other debugging, and it is successful at one time. This is the effect that the design should have, but when entering the detail development, the result is uncertain, and there is no extra time to modify.
  • 3. Great potential for improvement. In other words, there are still many things to be done, including the following aspects:
    • 1. Use redis to manage process data. Process data has multiple processes / processes. It will be troublesome to manually declare which process data to monitor. Store data in redis at any time during the development of each proc.
    • 2. Actively report the process. Store some process related details in redis (equivalent to memory runtime), echo with the previous runtime, and persist them in this folder.
    • 3 use the gallery to manage relationships. One is the dependency of the flow (proc) defined in the process configuration file; the other is the inheritance relationship when the process and procedure change.

btw, I put this service in "cat fan-4650G" On the machine , stable at (1200 rpm, 46.5 degrees), very quiet. Squatting on the side for a while, the fan didn't make any sound. Look back at the performance of the machine when the service is fully running.


Sort out the contents of this practice in reverse order

1 operation

After each part is configured, you need to start three docker s.

1.1 Foundation: input and distribution

The function of this container is to obtain data from the message queue and distribute it to various functional processes. The message queue is on the public network server and the running host is local.

docker run -d --name='m4_pm2_ner_base'\
               -v /etc/localtime:/etc/localtime  \
               -v /etc/timezone:/etc/timezone\
               -e "LANG=C.UTF-8"\
               -v /opt/m4_pm2 test:/workspace\
               sh -c "python"

1.2 processing and return

This container is responsible for running each functional process and sending the results to the target database

docker run -d --name='m4_pm2_ner_process'\
               -v /etc/localtime:/etc/localtime  \
               -v /etc/timezone:/etc/timezone\
               -e "LANG=C.UTF-8"\
               -v /opt/m4_pm2 test:/workspace\
               sh -c "python"

1.3 patrol inspection

docker run -d --name='m4_pm2_ner_patrol'\
               -v /etc/localtime:/etc/localtime  \
               -v /etc/timezone:/etc/timezone\
               -e "LANG=C.UTF-8"\
               -v /opt/m4_pm2 test:/workspace\
               sh -c "python"

forever_run is a chained function of infinite loop controlled by APScheduler. The template is as follows:

import pandas as pd 
import funcs as fs
import os 
from apscheduler.schedulers.blocking import BlockingScheduler  
from apscheduler.schedulers.background import  BackgroundScheduler
from datetime import datetime, timedelta
import time 

# Give a datetime string (current by default) and the week, day, hour, minute and second of the offset. The default is 1s. Return a new time and date string str or datetime type dt.
def make_dt_bias(cur_dt = None, str_format = '%Y-%m-%d %H:%M:%S', 
                 b_weeks = 0, b_days=0, b_hours =0 , b_minutes=0, 
                 b_seconds=1, b_milliseconds=0, b_microseconds = 0,
                 return_type = 'dt'):
    if cur_dt is None:
        cur_dt =
        cur_dt = datetime.strptime(cur_dt, str_format)
    cur_dt = cur_dt + timedelta(weeks=b_weeks, days=b_days, 
                                hours=b_hours, minutes=b_minutes,  
                                seconds=b_seconds, milliseconds=b_milliseconds,
    if return_type =='dt':
        return cur_dt
        return datetime.strftime(cur_dt, str_format)
from jinja2 import Template

# Generate keyword variables in shell commands according to the dictionary
def gen_jinja_shell_kw_from_dict(kw_dict = None):
    the_tmp = '{%for k,v in obj.items()%} {%if v%} --{{k}}={{v}} {%endif%} {%endfor%}'
    the_tmp1 = Template(the_tmp)
    return the_tmp1.render(obj=kw_dict)

# Chain function 
def chain_func(cur_func_name = None, func_order_dict = None , sche = None ):
    if cur_func_name is not None:
        print('[I] >>> Chain Started At: ', )
        cur_func = func_order_dict[cur_func_name]['cur_func']
        cur_func_kw = func_order_dict[cur_func_name]['cur_func_kw']

        if cur_func_kw is None:
            res = cur_func()
            res = cur_func(**cur_func_kw)

        print('go next step')
        next_func =  func_order_dict[cur_func_name].get('next_func')
        next_date_bias_kw = func_order_dict[cur_func_name].get('next_date_bias_kw')
        if next_func is None:
            print('[I] Done & Quit')

        # This function handles logic
        if next_date_bias_kw is None:
            next_dt = make_dt_bias()
            next_dt = make_dt_bias(**next_date_bias_kw)

        sche.add_job(chain_func, 'date', run_date = next_dt, kwargs = {'cur_func_name':next_func , 'func_order_dict':func_order_dict, 'sche':sche})

        print('[I] Chain is Empty')
# 1. Query the latest data status
def test_hello(para1=None):
    print('>> hello', para1)
#     return os.system('python ./some_folder/ -p %s --prefix=%s' % (pname,prefix) )

# Dictionary of functions to be executed
func_order_dict = {}
func_order_dict['step1'] = {'cur_func':test_hello, 
                            'cur_func_kw':{'para1':'YoYo, Check it Now'
                            'next_func':'step1' , 'next_date_bias_kw':None }

# 'next_date_bias_kw':{'b_seconds':60} offset for 60 seconds. Execute the next step. None is offset for 1 second by default

if __name__ =='__main__':

    # Executive state
    sche = BlockingScheduler()
    next_dt = make_dt_bias()
    sche.add_job(chain_func,'date',run_date =next_dt,
    print('[S] starting chain')

Taking the core process processing container as an example, the core execution function is

def processing_data(para1 = None):
    print('>>> running ', para1)

    res0 = os.system('python')
    print('>>> handling (p=process)f(f=function)_000', res0)

    res1 = os.system('python')
    print('>>> handling (p=process)f(f=function)_001', res1)

    res2 = os.system('python')
    print('>>> handling (p=process)f(f=function)_002', res2)
    return True

This is mainly achieved by executing a py file. Each py file can be regarded as an integration. This integration must ensure that all parts of a process are covered, but it is not necessary to correspond one to one. Some bottleneck processes can start multiple containers.

Note: I found that in this design, I forgot to add shard control to each process, so it can not be parallelized well. Of course, the practical application of practice does not need me to parallelize. Just considering the universality of proc, this deficiency will be made up next time. In addition, each parallelization will actually start a new container, which will waste a lot of time if it is operated manually I plan to integrate this part into the Agent, and the Agent will launch the docker command as planned. The Agent itself is constructed in PM2 mode and does not need to be parallel.

2 function (process)

2.1 entry_runtime

An entry_runtime corresponds to the set of proc actually executed. It can only be a part of a process.

This time, an entry_runtime exactly corresponds to a process.

import os 

# 1 pretreatment of company name
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
# Each process always reflects a - > B 
print('Get generated subtasks,Save after pretreatment (company name)','opr status:',res )

# 2 names
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
print('Get generated subtasks,name','opr status:',res )

# 3 forecast company name
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
print('Forecast company name','opr status:',res )

# 4. Name of forecaster
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
print('Predicted person name','opr status:',res )

# 5 merge tasks
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
print('Consolidated company name and person name','opr status:',res )

# 6. Check whether the task fragments are complete
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
print('Check whether the task fragment is complete','opr status:',res )

# 7 submit tasks
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
print('Task submission to database','opr status:',res )

# 8 delete file
res = os.system('python ./sche_001/pys/ --sname=sche_001 --prname=002 --pr_version=000')
print('Delete data','opr status:',res )

First of all, it looks very clear as a whole. Each function has a py file.
Another advantage of this is that each py file is executed as a shell command (performed by the os). In this way, the file will occupy resources during operation. Once the operation is completed, these resources will be released.
There is another advantage or disadvantage I don't know: the py file changes during operation and will take effect immediately at the next execution.
In addition, this operation mode also makes the overall process not completely interrupted by the failure of a certain section of logic.

Secondly, this method is easy to reuse, because the 002 function is combined based on 000 and 001. During development, only the configuration file is modified, one is added, and the two original processes are slightly modified.

The most thought in the process is the structure, not the function. I think neo4j can help greatly improve the efficiency of PM2

2.2 process configuration file

A heterogeneous process corresponds to a function, p_002 = f_002

The configuration file corresponding to the 002 function is placed in. / sche_001/process_config/p_002_v_000/main.conf. Since PM2 is used for management, in fact, it is not necessary to remember such an address. To some extent, PM2 is also an agent.

# Pretreatment of company name identification
last_proc =  proc_001_v_000
ner_task = company
rule_set = NerPreProcessing_company
# The maximum length of each short sentence list, considering memory / video memory
max_ss_list_len = 2000
ss_len_min = 4
ss_len_max = 100
intever_type = left_close
data_type = dict_list
output_mode = persistent

# Each difference set flows 100 subtasks

# Preprocessing of person name recognition
last_proc =  proc_001_v_000
ner_task = person
rule_set = NerPreProcessing_person
# The maximum length of each short sentence list, considering memory / video memory
max_ss_list_len = 2000
ss_len_min = 2
ss_len_max = 100
intever_type = left_close
data_type = dict_list
output_mode = persistent

# Each difference set flows 100 subtasks

# Pattern recognition of company name
model_path = model_v0
allow_fetch_num = 10

# What to do has been specified in the preprocessed file
# task_type = company

# Pattern recognition of names
model_path = model_v0
allow_fetch_num = 10

# What to do has been specified in the preprocessed file
# task_type = company

# Merge the tasks of the two parts

# Input in this process step
# cur_proc_to=schema_output

# Collect all subtasks according to the parts of the task, merge them after completion, and send them to output



output_folder = schema_output

run_mode = auto
result_db_name = result_db
target_table = t_biz_entity_parse
max_id_query = select max(id) from %s
insert_batch_num = 1000

# There is no pre order dependency, and some folders to be cleared are added

# auto will have a tolerance rate, and the number of deliveries and data will be deleted as completed within a certain range
# force can be deleted as long as fetch
tolerance = 0.1
output_folder = schema_output

# Folders to check for deletion
del_input = schema_input
del_proc1 = proc_001_v_000
del_proc2 = proc_002_v_000
del_proc2_amend =  predict
del_proc3 = proc_003_v_000
del_proc3_amend =  predict
del_proc5 = proc_005_v_000

del_output = schema_output

This configuration file is basically the sum of two processes p_000 and p_001, and another process is added.

Adjustment of procedure

Some processes can be reused, but they need to be slightly modified, so we can copy the process, change the last bit of the version number, and then fine tune the corresponding.

The three version numbers roughly correspond to:

  • First: large version changes, call parameters and operation mode may change
  • Second: the parameter may have changed due to the change in.
  • Third: small changes can continue to be used.

Another problem is introduced here. Procedure and Process are very flexible and can be fine tuned according to the actual situation. The basis of fine-tuning is usually similar to fork, which is similar to git. I think it would be better to manage through neo4j.

A configuration file can be regarded as a filter statement of a sub graph

Additive: PM (project meta) and RS(RuleSet)

Both PM and RS are designed for reliable and flexible purposes

The differences between PM and RS are:

  • 1. The particle size of PM is relatively large. An RS may only be part of a process.
  • 2. Whether it is persistent. PM is persistent, and RS is generally run in memory.
  • 3 PM is looser and RS is more centralized. The failure of a process in PM will not completely paralyze the whole process. If RS does not read the dependency, it will Fail directly.

It is necessary for the two to exist independently. This is simply understood as why the communication layer 5 protocol should exist independently.

The two can also learn from each other. The structure of PM is clearer than that of RS, probably because there are many persistent parts, so it is more intuitive. RS calls are more integrated.

3 process template

The specific process will not be described one by one. You can roughly list such a template

# ==================A fixed part of import==================
import os
import sys

# Add the current path to call funcs
basedir = os.path.abspath(os.path.dirname(__file__))
# Back two layers to (/ workspace /)
basedir = basedir[:basedir.rfind('/')]
basedir = basedir[:basedir.rfind('/')]
if basedir not in sys.path:

# ==================B General Package and Temporary Function==================
import os
import pandas as pd 
import numpy as np 
import time
import funcs as fs
import requests as req
from functools import partial

# =====b this can define some temporary functions

def temp():

# ==================C Procedure Keyword Args==================
import argparse
def get_arg():
    parser = argparse.ArgumentParser(description='Customized Arguments')
#     parser.add_argument('-p','--pkl', default='Meta')

    # Process name
    # Process name
    # Process version
    # Force operation
    # Current shard
    # Maximum fragmentation

    # Prepare parsing parameters
    args = parser.parse_args()

    sname = args.sname
    prname = args.prname
    pr_version = args.pr_version
    run_mode = args.run_mode
    shard = args.shard
    max_shard = args.max_shard
    return sname,prname,pr_version,run_mode,shard,max_shard

# ==================D Run Command example Run Command==================
# Run: Python. / sche_ 001/pys/proc_ --sname=sche_ 001 --prname=001 --pr_ version=000

if __name__ == '__main__':
    # ==================E get Runtime Parameter==================
    sname,prname,pr_version,run_mode, shard,max_shard = get_arg()
    # ======e process and process names must be given, and others can not be written (the default value is adopted)

    # The version number may not be given
    pr_version = pr_version or '000'
    # The current film can not be given: if you want to give it, you should match the largest film and give it at the same time. For example, when the largest slice is 2, there must be 0 and 1
    shard = shard or 0 
    # The biggest movie can not be given
    max_shard = max_shard or 1

    # Get current process name
    cur_proc =  os.path.basename(__file__)
    cur_proc1 = cur_proc[:cur_proc.find('.')]
    print('>>> Current Process: %s ' % cur_proc1)

    # ======This can be done during e debug (start ipython under the project root directory)
    # import funcs as fs
    # sname = 'sche_001'
    # prname = '000'
    # pr_version = '000'
    # cur_proc1 = 'proc_002_v_000'

    # ==================F Get Process Configs with configuration parameters==================
    # Initialize the current pm2
    pm2 = fs.PM2(sname,prname, pr_version)

    # ======f basic configuration file and configuration file of current process
    base_conf_dict = pm2.get_base_conf()
    proc_conf_dict = pm2.get_process_conf()[cur_proc1]

    # ======f the following are some common configurations
    # f1 difference set flow quantity
    task_per_time = proc_conf_dict['task_per_time']
    # From
    last_proc = proc_conf_dict['last_proc']
    # f3 result storage location (to, the default is the current process, under the folder of the current process)
    result_to_process = proc_conf_dict['result_to_process']
    result_to_proc = proc_conf_dict['result_to_proc']

    # ==================G Variables Fetch==================
    # ======g obtain the variables required by the core logic in the next step according to the configuration parameters obtained from step F
    # Slightly

    # ==================H difference set check Gap Check==================
    # ======h check the difference set according to the main task / sub task. In the case of auto, the difference set will be executed
    # ======h this step is equivalent to if exists, then execute
    # ======H this step is equivalent to a round of screening for candidate sets that may be executed

    # >>>>>>>>>>>>>>>>>>For every possible task > > > > > > > > > > > > > > > > >

    # ==================I metadata interaction==================
    # ======i check the metadata of each main task / sub task [execute only when metadata exists]
    # ======i sometimes it is necessary to make further judgment on metadata before deciding whether to execute [judge metadata before execution]
    # ======i mark whether the data in this step should be deleted after delivery
    # ======i tag other task centric metadata

    # ==================J Core Logics==================
    # ======j this step can be understood as an RS (RuleSet). After the input is given, a fixed output (in memory) will be generated
    # some ruleset

    # ==================K metadata updating==================
    # ======k mark the task metadata according to the execution status of the core logic
    # ======k temporarily stores some runtime states in redis (the persistence location is runtime. The persistence here is not provided by redis, but specified by itself)

It seems that a process step is still cumbersome. Later, through PM, combined with Web interaction, it can be further enhanced: under the condition of ensuring the integrity and strictness of logic, try to reduce the long chain of memory and management by people.

The above is the main logic of this implementation. On the whole, it is good. After another optimization, the time consumption ratio should be close to 1. The next optimization is based on a single PM to reduce the effort spent on relationship sorting through standardized templates. Later, the joint management of multiple projects and cross process and cross process cooperation can reduce the time consumption ratio to less than 0.7.

The greatest value of PM is to improve the efficiency of linear expansion on the premise of ensuring quality.

At present, PM can realize almost 0 joint debugging. By continuously refining and standardizing the process step template, we can realize the function in the way of crowdsourcing. Suppose that it takes one person to complete the original project in one month. If one person can define each process and its constituent process steps in 10 days, send the process steps to be processed to 30 people, and complete them in 2 days Merge. Even if the cost of training and correction is added to 3 days, the project can be completed in only 15 days. The effect is almost the same as that completed by one person

More interestingly, because the design is implemented by a few people, the overall planning can be done. The variable parameters will be abstracted and handed over to the genetic algorithm for large-scale search. In other words, the function has been greatly enhanced.

Moreover, these processes and processes can be reused in the future, and the cost will drop sharply with the passage of time.

4 others

4.1 some ideas

At present, we mainly focus on the goals of A-goals - > B. in fact, A and B are the results we care about

From a graph point of view, there are points and edges. At present, we mainly consider edges, while points are ignored. This is because in most cases, the process is fixed and the corresponding points are fixed. This is simple, but not flexible enough

The description of points, or data (can be imagined as a pkl file), can be divided into two parts: structure and content.

For example, we have ten pieces of text to process. Then these ten pieces of text can be stored through a list (structure), and each text (content) is a character (data type).

If we describe this kind of data abstractly, such as ListOfStr, we need to clean these ten texts, which can be described as:

A(ListOfStr) -[Clean Process]->B(ListOfStr)

This is very clear. Of course, in the future, we can use more detailed classification to represent these types. We can use object management. When we can't remember, we can even use sample as an example.

4.2 what to do next

4.2.1 database PM project

It is planned to establish a "database interaction" project (DBM), which uses PM2 to standardize various database management methods. The first is to complete neo4j, so that we can quickly apply the library to strengthen PM2.

Each different database is a different schema, which accommodates the total combined operations through process. This project is special because RS should be used to complete these functions (pure memory operation). However, some of the previous objects, littlemongo and littleneo4j, felt that some patterns were not clear.

Therefore, it is hoped that these operations can be clearly expressed through the PM project, and then encapsulated into RS after maturity.

4.2.2 using Redis to strengthen process management

There are some real-time statistics, which will be more cumbersome if they are declared one by one manually. During the process, if you use PM to update the content to Redis in real time, it will be very convenient:

  • Because in memory, different processes are very simple and fast to use
  • 2. You only need to get data from the database logically, and you don't need to think about it again
  • 3 each process that needs to be counted can be written in directly

Similarly, there is one point to note when adding redis: it increases the complexity of the system. Originally, PM does not rely on any other system (the file system is the default), so when using redis for enhancement, options should be considered. Redis is better. It can be used without redis

Tags: Python Back-end

Posted on Mon, 29 Nov 2021 08:52:04 -0500 by secret007