Post-condition Violation; Element ‘A‘ not found; Failed Expression: anum > -1; rdkit parses pdb and reports an error; PeriodicTab

I encountered a problem when using rdkit to parse pdb files. I used the decoys set generated by zdock. There was a problem with the decoys data set of a protein in it. After looking for it for a long time, I finally found the reason.
The complete error is as follows:
Post-condition Violation
Element 'A' not found
Violation occurred on line 91 in file /tmp/pip-req-build-tzcdahwp/build/temp.linux-x86_64-3.7/rdkit/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1

When I encounter this error, my first thought is to change the parameters for rdkit to read the pdb file code, but it is useless, because these parameters are used to clean the data, and this error is caused by your pdb file does not conform to the standard format

MolFromPDBFile(path,sanitize=True,removeHs=True,flavor=0,proximityBonding=False)

It can't be solved by adjusting parameters. How should it be solved?
Go back and see the error. He said that the exception was reported in the PeriodicTable.h file under this path. At this time, he found the file and found it. He couldn't find it!
However, rdkit is open source. Search the source code of rdkit PeriodicTable.h directly. We can find the source code of this file. I put the link below:
https://www.rdkit.org/docs/cppapi/PeriodicTable_8h_source.html
By eliminating the error, we can find the function that went wrong

//! overload
   int getAtomicNumber(const std::string &elementSymbol) const {
     // this little optimization actually makes a measurable difference
     // in molecule-construction time
     int anum = -1;
     if (elementSymbol == "C")
       anum = 6;
     else if (elementSymbol == "N")
       anum = 7;
     else if (elementSymbol == "O")
       anum = 8;
     else {
       STR_UINT_MAP::const_iterator iter = byname.find(elementSymbol);
       if (iter != byname.end()) anum = iter->second;
     }
     POSTCONDITION(anum > -1, "Element '" + elementSymbol + "' not found");
     return anum;
   }

Look, C language, wood! Pointer! More wooden!
Find the error first, on the penultimate line
However, it doesn't matter. According to the variable name elementSymbol, we can know that this function is probably related to the amino acid number (in fact, it is the amino acid number). The amino acid number is 77-78 columns in the standard PDB format, and then you can find the amino acid number column to see whether there are non numeric contents. For example, in my pdb file, the second line to the penultimate line, It's all 1CCPA. It's wrong. People are all numbers. It's' A '. Look at the key line of code of the function (under PDB)

ATOM   2206  OE2 GLU A 271       8.458  38.630   1.305  1.00 21.63      1CCP22
ATOM   2207  N   ASN A 272       3.442  40.934   3.546  1.00 24.98      1CCPA 
ATOM   2208  CA  ASN A 272       2.188  41.396   2.919  1.00 27.43      1CCPA 
ATOM   2209  C   ASN A 272       2.123  42.942   2.899  1.00 27.54      1CCPA 
ATOM   2210  O   ASN A 272       2.396  43.575   3.938  1.00 26.12      1CCPA 
ATOM   2211  CB  ASN A 272       0.879  40.878   3.518  1.00 29.24      1CCPA 
ATOM   2212  CG  ASN A 272       0.454  39.484   3.108  1.00 28.89      1CCPA 
ATOM   2213  OD1 ASN A 272       0.765  38.974   2.014  1.00 25.14      1CCPA 
ATOM   2214  ND2 ASN A 272      -0.175  38.874   4.020  1.00 30.27      1CCPA 
ATOM   2215  N   GLY A 273       1.645  43.437   1.757  1.00 26.20      1CCP23
ATOM   2216  CA  GLY A 273       1.424  44.871   1.596  1.00 28.34      1CCP23
ATOM   2217  C   GLY A 273       2.677  45.627   1.146  1.00 30.86      1CCP23

STR_UINT_MAP::const_iterator iter = byname.find(elementSymbol);
Is there any in this line of data
Well, transient memory loss, so the solution is to change 1CCPA to 1CCP, or directly renumber 77-78

Tags: Python

Posted on Wed, 06 Oct 2021 18:39:47 -0400 by zkoneffko