I encountered a problem when using rdkit to parse pdb files. I used the decoys set generated by zdock. There was a problem with the decoys data set of a protein in it. After looking for it for a long time, I finally found the reason.
The complete error is as follows:
Post-condition Violation
Element 'A' not found
Violation occurred on line 91 in file /tmp/pip-req-build-tzcdahwp/build/temp.linux-x86_64-3.7/rdkit/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
When I encounter this error, my first thought is to change the parameters for rdkit to read the pdb file code, but it is useless, because these parameters are used to clean the data, and this error is caused by your pdb file does not conform to the standard format
MolFromPDBFile(path,sanitize=True,removeHs=True,flavor=0,proximityBonding=False)
It can't be solved by adjusting parameters. How should it be solved?
Go back and see the error. He said that the exception was reported in the PeriodicTable.h file under this path. At this time, he found the file and found it. He couldn't find it!
However, rdkit is open source. Search the source code of rdkit PeriodicTable.h directly. We can find the source code of this file. I put the link below:
https://www.rdkit.org/docs/cppapi/PeriodicTable_8h_source.html
By eliminating the error, we can find the function that went wrong
//! overload int getAtomicNumber(const std::string &elementSymbol) const { // this little optimization actually makes a measurable difference // in molecule-construction time int anum = -1; if (elementSymbol == "C") anum = 6; else if (elementSymbol == "N") anum = 7; else if (elementSymbol == "O") anum = 8; else { STR_UINT_MAP::const_iterator iter = byname.find(elementSymbol); if (iter != byname.end()) anum = iter->second; } POSTCONDITION(anum > -1, "Element '" + elementSymbol + "' not found"); return anum; }
Look, C language, wood! Pointer! More wooden!
Find the error first, on the penultimate line
However, it doesn't matter. According to the variable name elementSymbol, we can know that this function is probably related to the amino acid number (in fact, it is the amino acid number). The amino acid number is 77-78 columns in the standard PDB format, and then you can find the amino acid number column to see whether there are non numeric contents. For example, in my pdb file, the second line to the penultimate line, It's all 1CCPA. It's wrong. People are all numbers. It's' A '. Look at the key line of code of the function (under PDB)
ATOM 2206 OE2 GLU A 271 8.458 38.630 1.305 1.00 21.63 1CCP22 ATOM 2207 N ASN A 272 3.442 40.934 3.546 1.00 24.98 1CCPA ATOM 2208 CA ASN A 272 2.188 41.396 2.919 1.00 27.43 1CCPA ATOM 2209 C ASN A 272 2.123 42.942 2.899 1.00 27.54 1CCPA ATOM 2210 O ASN A 272 2.396 43.575 3.938 1.00 26.12 1CCPA ATOM 2211 CB ASN A 272 0.879 40.878 3.518 1.00 29.24 1CCPA ATOM 2212 CG ASN A 272 0.454 39.484 3.108 1.00 28.89 1CCPA ATOM 2213 OD1 ASN A 272 0.765 38.974 2.014 1.00 25.14 1CCPA ATOM 2214 ND2 ASN A 272 -0.175 38.874 4.020 1.00 30.27 1CCPA ATOM 2215 N GLY A 273 1.645 43.437 1.757 1.00 26.20 1CCP23 ATOM 2216 CA GLY A 273 1.424 44.871 1.596 1.00 28.34 1CCP23 ATOM 2217 C GLY A 273 2.677 45.627 1.146 1.00 30.86 1CCP23
STR_UINT_MAP::const_iterator iter = byname.find(elementSymbol);
Is there any in this line of data
Well, transient memory loss, so the solution is to change 1CCPA to 1CCP, or directly renumber 77-78