Starting from an actual software fault, this paper discusses the handling skills that are difficult to reproduce the fault stably in the field of enterprise management software

catalogue

Some manifestations of intractable faults in the field of enterprise management software

1. Complex processes are required to reproduce

2. Faults span multiple modules of enterprise management software

3. The fault can only reappear in the customer's production system

4. The fault can only reappear in the background operation mode, and everything is normal when running in the online mode

5. The fault can only reappear in the normal operation mode of the software. During single-step debugging, the software works normally

Case sharing of an actual troubleshooting process

1. Try to find a way to stably reproduce the fault

2. Narrow down the troubleshooting scope of codes that may cause faults

3. Use the debugger to lock the problem

Summary of troubleshooting process

The author has been engaged in the development of enterprise management software since he graduated from university in 2007 and joined SAP Chengdu Research Institute.

Enterprise management software is aimed at enterprise level users. If the software fails (bugs), in some extreme cases, the enterprise may suffer huge economic losses. Therefore, higher requirements are put forward for software developers in programming specification, software testing and verification before software delivery. At the same time, due to the high complexity of enterprise management software, some faults are difficult to reproduce or can only be reproduced on the production system running the customer's specific business process. All these have brought great challenges to enterprise management software analysis and fault handling.

Starting from an actual software fault handled by the author, this paper talks about his experience in dealing with some thorny faults in enterprise management software.

 

In my opinion, these thorny faults can be divided into the following categories.

Some manifestations of intractable faults in the field of enterprise management software

The author has dealt with many software faults that once gave me a headache in SAP Chengdu Research Institute, which have one or more of the following characteristics.

1. Complex processes are required to reproduce

For example, I handled a customer invoice related fault. This fault can only reappear each time a release invoice is issued. In order to release the invoice, we must first create a sales order, create a Customer Demand based on the order, then create a pick task, generate a delivery note, and finally generate a new customer invoice.

These complex processes often require the system to maintain the corresponding master data and transaction data in advance in order to execute smoothly. Complex business processes increase the difficulty of fault recurrence.

2. Faults span multiple modules of enterprise management software

Due to the complexity of enterprise management software, a seemingly simple fault seen by end users may span multiple modules of software implementation.

Taking the fault described in form 1 above as an example, it is assumed that the support function described in the software help document is that the customer adds a new user-defined field on the sales order interface and maintains the corresponding value, which can be transferred from the sales order to the customer invoice through the picking task and delivery order. We call this transfer of field values from multiple documents data flow  


If the customer sees that the value of this field is empty on the invoice page, the customer may think that the invoice module has failed. However, the module processing corresponding to each node of the data flow may be the culprit of the fault. Sales orders and customer invoices belong to the CRM module, while picking tasks and delivery orders belong to SCM.

In the actual development work, this means that analyzing the fault often requires cross team cooperation, because CRM and SCM modules are often in the charge of different development teams.

3. The fault can only reappear in the customer's production system

Before the delivery of enterprise management software, it must have been tested at different levels in the internal development, testing and validation system. Even so, for various objective reasons, for example, when the application runs on the customer's production system and is based on some configurations of specific business processes that only the customer will use, the fault will be exposed, and these configurations are not covered by the internal system test of the enterprise management software supplier.

This kind of fault can only be reproduced in the customer's production system, so it is more difficult to analyze and locate the problem. Especially when the reproduction step will be written in the customer's production system, usually only the relevant personnel of the customer can be contacted. The remote desktop + teleconference method is adopted to let the relevant personnel of the customer operate, and then the support personnel of the software supplier debug online.

 

4. The fault can only reappear in the background operation mode, and everything is normal when running in the online mode

In the field of enterprise management software, especially in the field of ERP, background jobs are often used to perform some time-consuming batch processing, such as order batch processing, report data analysis and aggregation. The background operation mode is different from the online mode with user interface attached, which also brings difficulties to single-step debugging.

5. The fault can only reappear in the normal operation mode of the software. During single-step debugging, the software works normally

When the fault has this feature, it actually sends a signal to the support personnel that the fault may be related to the specific execution timing of the program. Because the program runs normally, the execution timing is obviously different from that in the single-step debugging mode. For example, when the debugger is single-step debugging, the normal execution timing of multithreaded programs may be destroyed.


Because of the lack of debugger, a powerful weapon, to analyze this kind of fault, support personnel need to have stronger theoretical analysis ability and problem abstraction ability.

Due to space constraints, this paper only gives a practical example to share the analysis and processing flow of the above fifth type of fault.

Case sharing of an actual troubleshooting process

The author was once responsible for SAP CRM IBASE(Installed Base) module. IBase is an abstract model that describes resource objects, such as devices, machines, services, or software, that have been installed at a customer location. IBase model describes the hierarchical structure of these objects and their components in a tree structure, which is the reference basis of service modules.

 

One day, I received a fault report. Colleagues from another team used the IBASE API in charge of my team to create IBASE components, modify, delete and save them in the same session. They would encounter runtime errors

 

The screenshot of the runtime error mentioned in the fault description is shown in the figure above.

The colleague found that this error can only be repeated in the background operation mode, and it may not be repeated every time. The fault also cannot be reproduced in single step debugging mode.

 

Not always reproducible= It can't be reproduced.

1. Try to find a way to stably reproduce the fault

In order to analyze this problem, I have to find a way to reproduce it stably. Because the fault is immune to the one-step debugging method, I can only think of another method.

According to the description in the failure report word by word, the operation process before the failure is as follows:

(1) Create IBASE
(2) Modify IBASE
(3) Delete IBASE
(4) Save the transaction.

A runtime error occurred.

Because I am the person in charge of IBASE module, I have written a program with less than 200 lines by dividing five by two. In the program, I call the IBASE creation, modification and deletion API in turn, and then save the transaction.

The program source code is as follows:

REPORT zibase_create_delete.

PARAMETERS: txt TYPE char40 OBLIGATORY DEFAULT 'description test',
            eid TYPE char30 OBLIGATORY DEFAULT 'PROGRAM',
            oid TYPE comm_product-product_id OBLIGATORY DEFAULT 'CHILDOBJ8',
            fam TYPE comm_product-object_family OBLIGATORY DEFAULT '0401',
            cat TYPE COMT_CATEGORY_ID OBLIGATORY DEFAULT 'OBJ_0401'.

DATA: lt_param  TYPE crmt_name_value_pair_tab,
      ls_param  TYPE crmt_name_value_pair,
      lr_core   TYPE REF TO cl_crm_bol_core,
      ls_object TYPE comm_product,
      lr_root   TYPE REF TO if_bol_entity_col,
      entity    TYPE REF TO cl_crm_bol_entity.

CHECK zcl_object_generator=>create_object( iv_id = oid iv_family = fam iv_catid = cat ) = abap_true.

ls_param-name  = cl_crm_ibase_il_constant=>createparam.
ls_param-value = '01'.
APPEND ls_param TO lt_param.

lr_core = cl_crm_bol_core=>get_instance( ).
lr_core->load_component_set('IBASE_ONLY').

CALL METHOD lr_core->root_create
  EXPORTING
    iv_object_name  = cl_crm_ibase_il_constant=>root_object
    iv_create_param = lt_param
    iv_number       = 1
  RECEIVING
    rv_result       = lr_root.

CHECK lr_root IS BOUND.
entity ?= lr_root->get_current( ).

CHECK entity IS BOUND.
IF entity->lock( ) = abap_true.
  entity->switch_to_change_mode( ).
ENDIF.

entity->set_property_as_string( iv_attr_name = 'DESCR' iv_value = CONV #( txt ) ).
entity->set_property_as_string( iv_attr_name = 'EXTID' iv_value = CONV #( eid ) ).
"entity->set_property_as_string( iv_attr_name = 'IBTYP' iv_value = '01' ).
lr_core->modify( ).
DATA(lv_ibase_id) = entity->get_property_as_string( 'IBASE' ).

DATA(component) = entity->create_related_entity( 'FirstLevelComponent' ).

CHECK component IS NOT INITIAL.

DATA(obj_comp) = component->create_related_entity( 'IBCompObj').

CHECK obj_comp IS NOT INITIAL.

obj_comp->set_property_as_string( iv_attr_name = 'OBJECT_ID' iv_value = CONV #( oid ) ).

SELECT SINGLE * INTO ls_object FROM comm_product WHERE product_id = oid.
ASSERT sy-subrc = 0.

obj_comp->set_property_as_string( iv_attr_name = 'OBJECT_GUID' iv_value = CONV #( ls_object-product_guid ) ).
obj_comp->set_property_as_string( iv_attr_name = 'OBJECT_FAMILY' iv_value = CONV #( ls_object-product_guid ) ).
lr_core->modify( ).

DATA(lo_message_container) = entity->get_message_container( ).
CALL METHOD lo_message_container->get_messages
  EXPORTING
    iv_message_type = if_genil_message_container=>mt_all
  IMPORTING
    et_messages     = DATA(lt_msg1).
LOOP AT lt_msg1 ASSIGNING FIELD-SYMBOL(<msg1>).
  WRITE:/ <msg1>-message COLOR COL_NEGATIVE.
ENDLOOP.

CHECK lt_msg1 IS INITIAL.

DATA: ls_header      TYPE ibap_head1,
      lt_struc_tab   TYPE ibap_struc1_tab,
      ls_comp TYPE IBAP_DAT1.
"delete component"

ls_header-ibase = lv_ibase_id.
CALL FUNCTION 'CRM_IBASE_GET_DETAIL'
  EXPORTING
    i_ibase_head      = ls_header
  IMPORTING
    e_struc_ibase_tab = lt_struc_tab
  EXCEPTIONS
    not_specified     = 1
    doesnt_exist      = 2
    no_authority      = 3.

CHECK sy-subrc = 0.

READ TABLE lt_struc_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
ls_comp-instance = <line>-instance.

CALL FUNCTION 'CRM_IBASE_COMP_DELETE'
  EXPORTING
     i_comp = ls_comp
  EXCEPTIONS
      DATA_NOT_CONSISTENT = 1
      IBASE_LOCKED = 2
      NOT_SUCCESFUL = 3
      NO_AUTHORITY = 4.

CASE sy-subrc.
   WHEN 1.
      WRITE: / 'data not consistent' COLOR COL_NEGATIVE.
   WHEN 2.
      WRITE: / 'cannot delete locked component' COLOR COL_NEGATIVE.
   WHEN 3.
      WRITE: / 'deletion not successful' COLOR COL_NEGATIVE.
   WHEN 4.
      WRITE: / 'no deletion authorization' COLOR COL_NEGATIVE.
 ENDCASE.

DATA(lo_transaction) = lr_core->get_transaction( ).
DATA(lv_changed) = lo_transaction->check_save_needed( ).

CHECK lv_changed EQ abap_true.

DATA(lv_success) = lo_transaction->save( ).

DATA(lo_glb_msg_cont) = lr_core->get_global_message_cont( ).
CALL METHOD lo_glb_msg_cont->if_genil_message_container~get_messages
  EXPORTING
    iv_message_type = if_genil_message_container=>mt_all
  IMPORTING
    et_messages     = DATA(lt_msg).
LOOP AT lt_msg ASSIGNING FIELD-SYMBOL(<msg>).
  WRITE:/ <msg>-message.
ENDLOOP.

IF lv_success = abap_true.
  lo_transaction->commit( ).
  WRITE:/ 'IBASE Created Successfully: ', lv_ibase_id COLOR COL_NEGATIVE.
ELSE.
  lo_transaction->rollback( ).
ENDIF.

An expected runtime error was encountered while executing this report. This is a good sign because I have now found a way to stably reproduce the problem. Next, I need to narrow down the scope of the problem and find out which line of my 200 lines of code caused a runtime error.

The author likes to call the program developed by himself specially for analyzing faults and reproducing errors "scaffold program" or "fault trigger".

2. Narrow down the troubleshooting scope of codes that may cause faults

Because these 200 lines of code are written by myself, I can modify them at will. First, comment out all the code, leaving only the call of IBASE to create the API. Execute the procedure and everything is normal.

Then release the comment of IBASE to modify the API call code and let it participate in the program execution. Everything is normal.

Then de annotate IBASE, delete the API call code, execute the program, and there is a runtime error!

This shows that this runtime error is related to the scenario deleted by IBASE.

Return to the runtime error screenshot in the fault submission report: an error of type X is thrown on line 103 because the function CRM is called_ IBASE_ COMP_ GET_ Detail, did not read through the input parameter i_date and I_ The IBASE data corresponding to the timestamp specified by time, so the program decides to terminate the execution by throwing an error.

 

Through the context call stack of runtime error, I found CRM_ IBASE_ COMP_ GET_ The reason why the detail API does not return any IBASE data: CHECK the highlighted code in line 53 of the following figure to CHECK whether the current incoming timestamp (the default is the timestamp when IBASE is created) is less than the valto of the IBASE header to be read (i.e. valid to, which refers to the timestamp of the IBASE valid end date) field. If it is less than, the next CHECK is executed sequentially, i.e. 54 lines. If it is greater than or equal to, the loop body where the data reading logic is located is exited.

 
In the background job running mode and during the execution of my scaffold program, the timestamp judgment condition on line 53 was not met, so the loop exited, resulting in the failure of reading CRM_IBASE_COMP_GET_DETAIL, so the fault was caused.

There are only two possibilities to meet the judgment conditions of line 53:

  • Current timestamp > IBASE valto field value
  • Current timestamp = IBASE valto field value

It should be emphasized that the timestamp field in ABAP programming language is accurate to seconds. For example, 20211024102424 represents 10:24:24 on October 24, 2021.

3. Use the debugger to lock the problem

Although my scaffold application can not reproduce the fault in the single-step debugging mode, it can be reproduced by direct execution. Therefore, when executing the scaffold application, click the Debugger button on the toolbar on the runtime fault page to pop up the Debugger and view various information of the error thrown by the application when it is running:

 
This time, all the puzzles are revealed in the debugger: the current timestamp = IBASE valto field value, which causes API CRM_IBASE_COMP_GET_DETAIL reading failure and throws a runtime error.

 

  • When calling the IBASE creation API, the valfr field of the IBASE header to be created will be assigned with the current time stamp of the system.
  • When calling the IBASE delete API, the valto field of the IBASE header to be deleted will be assigned with the current time stamp of the system.

Why can't this error be reproduced in single step debugging mode? Let's look at a simple sequence diagram.

The horizontal axis represents the timestamp. t3 represents the value of < ibinadm > - valto field in the judgment statement in line 53 of code, and t1 represents the value of lv_timestamp field in the judgment statement in line 53 of code.

In the step-by-step debugging mode, if we start from IBASE creation API and execute it step by step, t3 must be greater than t1 due to the speed of key pressing

Under the background operation mode and the normal operation of scaffold program, if the API for IBASE creation, modification and deletion is executed fast enough to be completed within one second, the difference between t3 and t1 is less than one second, so the CHECK statement fails and returns directly.

In other words, when this failure was submitted, the developers of CRM IBASE API did not consider that the creation and deletion of IBASE would be completed in the same second. After all, under normal circumstances, customers cannot complete the operation of IBASE creation and deletion in the UI within 1 second. This scenario can only be found in some secondary development scenarios using IBASE API May occur.

Of course, the last problem is not just to change the < symbol of the 53 line CHECK statement to less than or equal. We carefully evaluated the other side effects that the change may bring, discussed with the team developers who submitted the fault, and finally took other ways to avoid the fault.

Summary of troubleshooting process

Back to the fault analysis process itself, when the fault was first received, because the single-step debugging could not be reproduced, the author was at a loss for a while. Later, he thought of writing a scaffold program to stably reproduce the fault. This step is the breakthrough of problem analysis.

After having the scaffold program, first comment out all API calls, then gradually open the code for IBASE creation, modification and deletion, and finally narrow the scope of the problem to the IBASE deletion process.

Through the runtime error triggered by the direct execution of scaffold application, the debugger is used to check the variable value when the program throws the error, lock the problem to the processing logic of timestamp, and then find out the root cause.

This analysis step is a bit like the troubleshooting measures taken by computer DIY enthusiasts at the end of last century and the beginning of this century when the assembly machine fails to start. When the assembly machine fails to start, only keep the power supply, motherboard and CPU and try to start. If it succeeds, add graphics cards, hard disks and other devices one by one. When the newly added device causes the system to return to the unable to start state, it indicates that the device is in trouble At that time, enthusiasts called this method "minimum system method".

The most important thing in the whole analysis process is to abstract the content executed in the background job that cannot stably reproduce the fault in the fault report into a scaffold program with less than 200 lines.

The fifth chapter of programming Abas once shared an interesting story about fault debugging: a programmer in IBM Research Center installed a new workstation and found a fault: he could only log in to the system in a sitting position; Once you stand up, you can't log in to the system. Do you know how to locate the fault in the end? Go and read the original book!

 

I hope this article can give you some inspiration for troubleshooting in the field of enterprise management software. Thank you for reading.
 

Tags: Programming software testing abap Software development

Posted on Sat, 23 Oct 2021 18:05:27 -0400 by rainerpl