2, python parsing XML documents

1. What is an XML document?

Extensible Markup LanguageStandard General Markup Language A subset of, or XML for short. It is used to mark electronic documents to make them structural Markup Language.  [1] 

In an electronic computer, a marker means Computer Information symbols that can be understood, through which computers can process various information including articles, etc. It can be used to tag data and define data types. It is a source language that allows users to define their own markup language. It's perfect Web Transmission, providing a unified way to describe And exchange independent of application program Or structured data from suppliers. yes Internet Cross platform and content dependent technology in the environment is also an effective tool to deal with distributed structural information. As early as 1998, the W3C released the XML 1.0 specification, which is used to simplify the document information transmission on the Internet. [2] a movies.xml document is shown below.

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

2. What are the methods of parsing XML with python?

python has three ways to parse XML: SAX, DOM, and ElementTree.

  • 1.SAX (simple API for XML )
    pyhton standard library contains SAX parser. SAX is a typical extremely fast tool, which will not occupy a lot of memory when parsing XML.
    But this is based on the callback mechanism, so in some data, it will call some methods for passing. This means that a handle must be specified for the data,
    It's very difficult to stay in shape.
  • 2.DOM(Document Object Model)
    Compared with SAX, the typical disadvantage of DOM is that it is slower and consumes more memory, because DOM will read the whole XML number into memory and make it a tree
    The first node in creates an object. The advantage of using DOM is that you don't need to trace the state, because every node knows who it belongs to
    Parent node, who is the child node. But DOM is a bit of a hassle.
  • 3. Elementtree
    ElementTree is like a lightweight DOM with a convenient and friendly API. Good code availability, fast speed, less memory consumption.

When using Python for XML parsing, the first choice is to use the ET module, because according to the author's evaluation of lxml, DOM is not easy to use, nor efficient, and it is easy to have problems.

3. Simple code implementation

SAX is an event driven API that contains parsers and event handlers. The parser parser is responsible for parsing XML documents. When encountering the start and end tag s of XML statements, it requests the counter handler of the event processor to make corresponding response and process the XML data.

SAX code is as follows:

import xml.sax

#Define a movie processing class for movie xml file parsing and reading
class MovieHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.CurrentData = ""
        self.type = ""
        self.format = ""
        self.year = ""
        self.rating = ""
        self.stars = ""
        self.description = ""

#Start element call when "movie" flag is encountered
    def startElement(self,tag,attributers):
        self.CurrentData=tag
        if tag=='movie':
            print('**********movie***********')

#If you encounter each flag such as "type", read out the corresponding content after it
    def characters(self,content):
        if self.CurrentData=='type':
            self.type=content
        elif self.CurrentData=='format':
            self.type=content
        elif self.CurrentData == 'year':
            self.type = content
        elif self.CurrentData=='rating':
            self.type=content
        elif self.CurrentData=='stars':
            self.type=content
        elif self.CurrentData=='description':
            self.type=content

#When the corresponding flag is encountered again, it means that the reading is finished, then the element call is ended and the corresponding content is printed out
    def endElement(self,tag):
        if self.CurrentData=='type':
            print('Type',self.type)
        elif self.CurrentData == "format":
            print("Format:", self.format)
        elif self.CurrentData == "year":
            print("Year:", self.year)
        elif self.CurrentData == "rating":
            print("Rating:", self.rating)
        elif self.CurrentData == "stars":
            print("Stars:", self.stars)
        elif self.CurrentData == "description":
            print("Description:", self.description)
        self.CurrentData = ""

if (__name__=='main'):
    #Create a parser object
    parser=xml.sax.make_parser()
    parser.setFeature(xml.sax.handler.feature_namespaces,0)
    #Create a new event handler object, inherit the MovieHandler class
    Handler=MovieHandler()
    parser.setContentHandler(Handler)
    #The parser starts parsing the movies.xml file
    parser.parse('movies.xml')

DOM parses the XML data into a tree in memory, and processes the data through the operation of the tree.

DOM code is as follows:

#Parsing xml files with DOM is slow
from xml.dom.minidom import parse
import xml.dom.minidom

# Open XML document with minidom parser
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
    print("Root element : %s" % collection.getAttribute("shelf"))

# Get all movies in the collection
movies = collection.getElementsByTagName("movie")

# Print details for each movie
for movie in movies:
    print("*****Movie*****")
    if movie.hasAttribute("title"):
        print("Title: %s" % movie.getAttribute("title"))
    type = movie.getElementsByTagName('type')[0]
    print("Type: %s" % type.childNodes[0].data)
    format = movie.getElementsByTagName('format')[0]
    print("Format: %s" % format.childNodes[0].data)
    rating = movie.getElementsByTagName('rating')[0]
    print("Rating: %s" % rating.childNodes[0].data)
    description = movie.getElementsByTagName('description')[0]
    print("Description: %s" % description.childNodes[0].data)

ElementTree also parses XML into tree structure, each tag is a node, so the whole XML document is composed of several < tag > text < / tag > Basic nodes. We parse the text content in the node by layer. ET has two classes for this purpose - ElementTree Represent the entire XML document as a tree, and Element Represents a single node in this tree. Interaction with the entire document (reading and writing files) is usually done in ElementTree Finish at the level. The interaction with a single XML element and its children is Element Complete at level.

The ElementTree code is as follows:

#Use ElementTree to parse xml files, which takes up less memory
import xml.etree.ElementTree as ET
#Read root node
tree=ET.parse('movies.xml')
root=tree.getroot()
print('*********movie*********')
#Find all the first level nodes movie
movie_nodes=root.findall('movie')
#For each primary node, do the following
for node in movie_nodes:
    print('{}:{}'.format(node.tag,str(node.attrib)[9:-1])) #Print the name of each primary node movie
    #Under each level node, find all child nodes n
    n = node.findall('year') #The 'year' here can be replaced by any node of the same level, such as' type ',' rating ', etc
    #For each child node, print out its node name and node value
    for n in node:
        print('{}:{}'.format(n.tag,n.text)) #Note that the node value here uses n.text
    print('\r\n')

The operation results are as follows:

*********movie*********
movie: 'Enemy Behind'
type:War, Thriller
format:DVD
year:2003
rating:PG
stars:10
description:Talk about a US-Japan war


movie: 'Transformers'
type:Anime, Science Fiction
format:DVD
year:1989
rating:R
stars:8
description:A schientific fiction


movie: 'Trigun'
type:Anime, Action
format:DVD
episodes:4
rating:PG
stars:10
description:Vash the Stampede!


movie: 'Ishtar'
type:Comedy
format:VHS
rating:PG
stars:2
description:Viewable boredom



//Process ended, exit code 0
Published 3 original articles, praised 0 and visited 16
Private letter follow

Tags: xml Python less

Posted on Fri, 21 Feb 2020 05:35:17 -0500 by Spekta