XML Extensible Markup Language

XML concepts

Extensible Markup Language
Extensible: labels are customized

Function:

Store data
1. Configuration file
2. Transmission in the network

The difference between xml and html:

1. xml tags are customized and html tags are predefined
2. The syntax of xml is strict and that of html is loose
3. xml is used to store data, and html is used to display data

quick get start

The suffix of the xml document. xml
The first line of xml must be defined as a document declaration
There is one and only one root tag in the xml document
Attribute values must be enclosed in quotation marks (either single or double)
The label must be closed correctly
xml tag names are case sensitive

<?xml version = '1.0'?>

<users>
	<user id='1'>
		<name>zhangsan</name>
		<age>23</age>
		<gender>male</gender>
	</user>
	
	<user id='2'>
		<name>lisi</name>
		<age>24</age>
		<gender>female</gender>
	</user>
</users>
component

1. Document declaration
Format: <? XML attribute list? >
Attribute list:
Version: the version number must be an attribute
Encoding: encoding method. Inform the parsing engine of the character set used in the current document. The default is ISO-8859-1
standalone: independent yes: not dependent on other files no: dependent on other files

2. Instruction (understanding): combined with css

<?xml version="1.0" encoding="utf-8" standalone='no'?>
<?xml-stylesheet type="text/css" href="a.css"?>

<users>
<user id='1'>
    <name>zhangsan</name>
    <age>23</age>
    <gender>male</gender>
</user>

<user id='2'>
    <name>lisi</name>
    <age>24</age>
    <gender>female</gender>
</user>
</users>

3. Labels: custom label names
Rule: names cannot start with numbers or punctuation marks
The name cannot start with the letter xml (or XML Xml, etc.)
The name cannot contain spaces
4. Attribute: unique ID value
5. Text: CDATA area: the data in this area will be displayed as is

<code >
        <![CDATA[
        if(a < b && a > c){}
        ]]>
    </code>

constraint

Specify writing rules for xml documents
1. Ability to introduce constraint documents into xml
2. Be able to simply read constraint documents

classification

1. DTD: a simple constraint technique
2. Schema: a complex constraint technique

DTD.dtd

Importing dtd documents into xml documents

Internal dtd: define constraint rules in xml document

External dtd: define the constraint rules in the external dtd file
Local:
Network:

Schema.xsd

Disadvantages of DTD: the specific legitimacy of content cannot be defined, such as age=1000

analysis

Operate the xml document and read the data in the document into memory

Manipulating xml documents

Parse (read): read the data in the document into memory
Write: save the data in memory to the xml document for persistent storage

How to parse xml:

DOM: load the markup language document into memory at one time to form a DOM tree in memory

Advantages: it is easy to operate and can perform all CRUD operations on documents
Disadvantages: occupy memory

SAX: read line by line, event driven

Advantages: no memory
Disadvantages: it can only be read and cannot be added, deleted or modified

Common parsers for xml:

JAXP: the parser provided by sun company supports dom and sax
DOM4J: an excellent parser
Jsoup: jsoup is a java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can fetch and manipulate data through DOM, CSS and operation methods similar to JQuery
PULL: the built-in parser of Android operating system, sax mode

Jsoup

1. Import jar package
2. Get Document object
3. Get the corresponding label Element object
4. Get data
student.xml

<?xml version="1.0" encoding="UTF-8" ?>

<students>
    <student number="heima_001">
        <name>Bob</name>
        <age>18</age>
        <sex>male</sex>
    </student>

    <student number="heima_002">
        <name>Alice</name>
        <age>18</age>
        <sex>male</sex>
    </student>
</students>

package zg.jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

/*
* Jsoup quick get start
* 1,Import jar package*/
public class jsoupDemo1 {
    public static void main(String[] args) throws IOException {
        //2. Get the Document object and get it according to the xml Document
        //2.1 get the path of student.xml
        //Get bytecode file object get classloader find the path of the corresponding resource file get string representation
        String path = jsoupDemo1.class.getClassLoader().getResource("zg/jsoup/student.xml").getPath();
        //2.2 parse the xml document, load the document into memory, and obtain the dom tree -- > document
        Document document = Jsoup.parse(new File(path), "UTF-8");
        //3. Get Element object Element
        Elements elements = document.getElementsByTag("name");
        //3.1 get the value of the first Element, that is, the object of the first name Element
        System.out.println(elements.size());
        Element element = elements.get(0);
        //3.2 data acquisition
        String name = element.text();
        System.out.println(name);

    }
}

Jsup object

1. Jsup: a tool class that can parse html or xml documents and return Document
parse: parses html or xml documents and returns Document
parse (File in, String charsetName): parses xml or html files
parse (String html): parses xml or html strings
parse (URL, int timeoutMillis): get the specified html or xml document through the network path

package zg.jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;
import java.net.URL;

/*
* Jsoup quick get start
* 1,Import jar package*/
public class jsoupDemo2 {
    public static void main(String[] args) throws IOException {
        //2. Get the Document object and get it according to the xml Document
        //2.1 get the path of student.xml
        //Get bytecode file object get classloader find the path of the corresponding resource file get string representation
        String path = jsoupDemo2.class.getClassLoader().getResource("zg/jsoup/student.xml").getPath();
        //2.2 parse the xml document, load the document into memory, and obtain the dom tree -- > document
        Document document = Jsoup.parse(new File(path), "UTF-8");
        System.out.println(document);//Document returns the document as a string representation

        //parse (String html): parses xml or html strings
        String str = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n" +
                "\n" +
                "<students>\n" +
                "    <student number=\"heima_001\">\n" +
                "        <name>Bob</name>\n" +
                "        <age>18</age>\n" +
                "        <sex>male</sex>\n" +
                "    </student>\n" +
                "\n" +
                "    <student number=\"heima_002\">\n" +
                "        <name>Alice</name>\n" +
                "        <age>18</age>\n" +
                "        <sex>male</sex>\n" +
                "    </student>\n" +
                "</students>\n";
        Document document1 = Jsoup.parse(str);//Also return a Document object
        System.out.println(document1);//Can also be resolved to

        //parse (URL, int timeoutMillis): get the specified html or xml document through the network path
        URL url = new URL("https://editor.csdn.net/md?articleId=120723799 "); / / represents a resource path in the network
        Document document2 = Jsoup.parse(url, 1000);
        System.out.println(document2);//After parsing, it is an html document

    }
}

2. Document: document object. Represents a dom tree in memory
Get Element object
getElementById(String id): get a unique element object according to the id attribute value
getElementsByTag(String tagName): get the element object collection according to the tag name
getElementsByAttribute(String key): get the element object collection according to the attribute name
getElementByAttributeValue(String key,String value): get the element object collection according to the corresponding attribute name and attribute value

package zg.jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

/*
* Document/Element object*/
public class jsoupDemo3 {
    public static void main(String[] args) throws IOException {
        //2. Get the Document object and get it according to the xml Document
        //2.1 get the path of student.xml
        //Get bytecode file object get classloader find the path of the corresponding resource file get string representation
        String path = jsoupDemo3.class.getClassLoader().getResource("zg/jsoup/student.xml").getPath();
        Document document = Jsoup.parse(new File(path), "UTF-8");
        //3. Get element object
        //3.1 get all student objects
        Elements elements = document.getElementsByTag("student");
        System.out.println(elements);
        System.out.println("-----------");


        //3.2 get element objects with attribute name id
        Elements id = document.getElementsByAttribute("id");
        System.out.println(id);
        System.out.println("-----------");
        //getElementById(String id): get a unique element object according to the id attribute value
        Element zgdaren = document.getElementById("zgdaren");
        System.out.println(zgdaren);
        System.out.println("-----------");

        //3.3 get the element object whose number attribute value is heima_001
        Elements elements1 = document.getElementsByAttributeValue("number", "heima_001");
        System.out.println(elements1);


    }
}

3. Elements: a collection of Element objects that can be used as an ArrayList
4. Element: element object
Get child element object
getElementById(String id): get a unique element object according to the id attribute value
getElementsByTag(String tagName): get the element object collection according to the tag name
getElementsByAttribute(String key): get the element object collection according to the attribute name
getElementByAttributeValue(String key,String value): get the element object collection according to the corresponding attribute name and attribute value
Get property value
String attr (String key): get the attribute value according to the attribute name
Get text content
String text(): get the plain text content of all sub tags
String html(): get all contents of the tag body (including string tags and text contents of sub tags)

package zg.jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

/*
* Jsoup quick get start
* 1,Import jar package*/
public class jsoupDemo4 {
    public static void main(String[] args) throws IOException {
        //2. Get the Document object and get it according to the xml Document
        //2.1 get the path of student.xml
        //Get bytecode file object get classloader find the path of the corresponding resource file get string representation
        String path = jsoupDemo4.class.getClassLoader().getResource("zg/jsoup/student.xml").getPath();
        //2.2 parse the xml document, load the document into memory, and obtain the dom tree -- > document
        Document document = Jsoup.parse(new File(path), "UTF-8");
        //Get child element object
        //Get the name tag through the document object, that is, get all the name tags
        Elements name = document.getElementsByTag("name");
        System.out.println(name.size());//2
        System.out.println("--------");
        //Get child label object through Element object
        Element element_student = document.getElementsByTag("student").get(0);
        Elements name1 = element_student.getElementsByTag("name");
        System.out.println(name1.size());//1
        System.out.println("----------");

        //String attr (String key): get the attribute value according to the attribute name
        //Gets the property value of the student object
        String number = element_student.attr("number");
        System.out.println(number);//heima_001
        System.out.println("----------");

        //String text(): get text content
        String text = name1.text();
        System.out.println(text);
        System.out.println("----------");
        //If the names of text and html are Chinese, not text, but sub tags, html prints the contents of sub tags, and text obtains the plain text contents of all sub tags
        //String html(): get all the contents of the tag body (including the string contents of sub tags)
        String html = name1.html();
        System.out.println(html);



    }
}

5. Node: the node object is the parent class of Document and Element

Jsup shortcut query

Selector: selector

Elements select(String cssQuery)

package zg.jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

/*
* Jsoup quick get start
* 1,Import jar package*/
public class jsoupDemo5 {
    public static void main(String[] args) throws IOException {
        //2. Get the Document object and get it according to the xml Document
        //2.1 get the path of student.xml
        //Get bytecode file object get classloader find the path of the corresponding resource file get string representation
        String path = jsoupDemo5.class.getClassLoader().getResource("zg/jsoup/student.xml").getPath();
        //2.2 parse the xml document, load the document into memory, and obtain the dom tree -- > document
        Document document = Jsoup.parse(new File(path), "UTF-8");
        //3. Query name tag

        Elements name = document.select("name");
        System.out.println(name);
        System.out.println("-----------");
        //4. Query the element with id value zgdaren
        Elements select = document.select("#zgdaren");
        System.out.println(select);
        System.out.println("------------");
        //5. Get the student tag and the number property value is Heima_ age sub tag of 001
        //5.1 get the student tag and the number attribute value is heima_001
        Elements select1 = document.select("student[number=\'heima_001\']");
        System.out.println(select1);

        //5.2 get the student tag and the number attribute value is Heima_ age sub tag of 001
        Elements select2 = document.select("student[number=\'heima_001\'] > age");
        System.out.println(select2);


    }
}

XPath: selectors

XPath is the xml path language, which is a language used to determine the location of a part in xml (a subset of markup General Markup Language) documents

Using the Xpath of jsup requires additional jar packages to be imported
Query the xpath syntax of xml in w3cshool to complete the query

package zg.jsoup;

import cn.wanghaomiao.xpath.exception.XpathSyntaxErrorException;
import cn.wanghaomiao.xpath.model.JXDocument;
import cn.wanghaomiao.xpath.model.JXNode;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;
import java.util.List;

/*
* Jsoup quick get start
* 1,Import jar package*/
public class jsoupDemo6 {
    public static void main(String[] args) throws IOException, XpathSyntaxErrorException {
        //2. Get the Document object and get it according to the xml Document
        //2.1 get the path of student.xml
        //Get bytecode file object get classloader find the path of the corresponding resource file get string representation
        String path = jsoupDemo6.class.getClassLoader().getResource("zg/jsoup/student.xml").getPath();
        //2.2 parse the xml document, load the document into memory, and obtain the dom tree -- > document
        Document document = Jsoup.parse(new File(path), "UTF-8");
        //The document object is inside jsup, which does not support Xpath syntax
        //3. Create a JXDocument object based on the document object
        JXDocument jxDocument = new JXDocument(document);
        //4. Query with xpath syntax
        //4.1 query all student Tags
        List<JXNode> jxNodes = jxDocument.selN("//student");
        for (JXNode jxNode : jxNodes) {
            System.out.println(jxNode);
        }
        System.out.println("---------");
        //4.2 query the name tag under all student tags
        List<JXNode> jxNodes1 = jxDocument.selN("//student/name");
        for (JXNode jxNode : jxNodes1) {
            System.out.println(jxNode);
        }
        System.out.println("---------");
        //4.3 query the name tag with id attribute under the student tag
        List<JXNode> jxNodes2 = jxDocument.selN("//student/name[@id]");
        for (JXNode jxNode : jxNodes2) {
            System.out.println(jxNode);
        }
        System.out.println("---------");
        List<JXNode> jxNodes3 = jxDocument.selN("//student/name[@id='zgdaren']");
        for (JXNode jxNode : jxNodes3) {
            System.out.println(jxNode);
        }


    }
}

Tags: xml html crawler

Posted on Wed, 13 Oct 2021 17:19:41 -0400 by FourthChapter