Usage and xpath syntax of the summary selector

Preparation

html example:

<?xml version="1.0" encoding="UTF-8"?>
<html
<head>
    <title>text</title>
</head>

<body>

<div class="cdiv">
    <p class="cp1 section">test p1 <span>next p1</span></p>
    <ul>
        <li id="0">1</li>
        <li id="1">2</li>
        <li id="2">3</li>
    </ul>
</div>

<div class="cdiv1">
     <p class="cp2">test p2 <span>next p2</span></p>
     <ul>
        <li id="3">4</li>
        <li id="4">5</li>
        <li id="5">6</li>
    </ul>
</div>

<p class="item">test in p </p>

<li id="6" class="item cli-1">7</li>
<li id="7" class="item cli-2">8</li>


</body>

</html>

Save the example to test.html

Create python file, input code

from scrapy.selector import Selector

doc = ''
with open('./test.html', 'r') as f:
    doc = f.read()

sel = Selector(text=doc)

All the following sample code will be added to this file

Main methods of Selector

Get the string of the selected node

get(): get the first middle node in the selected node list, and convert it into a string to return.

getall(): get all the nodes in the selected node list, and convert them to string to return.

Example:

#Get the selected node string
res = sel.css('.cdiv').xpath(".//li").get()
print(res)
res = sel.css('.cdiv').xpath(".//li").getall()
print(res)

Result:

<li id="0">1</li>
['<li id="0">1</li>', '<li id="1">2</li>', '<li id="2">3</li>']

Match with regular expression

re(regex): matches nodes with regular expressions and returns a matching string.

Example:

res = sel.css('body .item').re(r'cli\-\d+')
print(res)

Result:

['cli-1', 'cli-2']

Selecting nodes using xpath expressions

xpath(query): use the xpath expression to select a node and return the Selector object containing the selected node

Use css selector to select nodes

css(query): select a node using the css Selector expression and return the Selector object containing the selected node

xpath syntax

Select all child nodes

//{node}: select all the child nodes labeled {node} under the root node

//{node}//{node1}: select all nodes with the label {node1} under the root node and the parent node contains the label {node}

Example 1:

res = sel.xpath("//li")
print(res)

Result:

[<Selector xpath='//li' data='<li id="0">1</li>'>, <Selector xpath='//li' data='<li id="1">2</li>'>, <Selector xpath='//li' data='<li id="2">3</li>'>, <Selector xpath='//li' data='<li id="3">4</li>'>, <Selector xpath='//li' data='<li id="4">5</li>'>, <Selector xpath='//li' data='<li id="5">6</li>'>, <Selector xpath='//li' data='<li id="6" class="item cli-1">7</li>'>, <Selector xpath='//li' data='<li id="7" class="item cli-2">8</li>'>]

Example 2

res = sel.xpath("//ul//li")
print(res)

Result

[
<Selector xpath='//ul//li' data='<li>1</li>'>, 
<Selector xpath='//ul//li' data='<li>2</li>'>, 
<Selector xpath='//ul//li' data='<li>3</li>'>, 
<Selector xpath='//ul//li' data='<li>4</li>'>, 
<Selector xpath='//ul//li' data='<li>5</li>'>,
<Selector xpath='//ul//li' data='<li>6</li>'>
]

Select direct child

{node}/{node1}: select the node labeled {node1} from the direct children of {node}

Example 1:

res = sel.xpath("//body/li")
print(res)

Result:

[<Selector xpath='//body/li' data='<li id="6" class="item cli-1">7</li>'>, <Selector xpath='//body/li' data='<li id="7" class="item cli-2">8</li>'>]

Select the n th child from the list of selected children

//{node}[n]: first aggregate the sibling nodes in a list and change it to [list [1, list [2,...], then select the nth from each list, and skip if the length of the list is less than n

(/ / {node})[n]: put all the selected nodes in a list, and then select the nth node from the list

Example 1:

res = sel.xpath("//li[1]")
print(res)
res = sel.xpath("//li[3]")
print(res)

Result:

[<Selector xpath='//li[1]' data='<li id="0">1</li>'>, <Selector xpath='//li[1]' data='<li id="3">4</li>'>, <Selector xpath='//li[1]' data='<li id="6" class="item cli-1">7</li>'>]
[<Selector xpath='//li[3]' data='<li id="2">3</li>'>, <Selector xpath='//li[3]' data='<li id="5">6</li>'>]

Example 2

res = sel.xpath("(//li)[1]")
print(res)
res = sel.xpath("(//li)[3]")
print(res)

Result:

[<Selector xpath='(//li)[1]' data='<li>1</li>'>]
[<Selector xpath='(//li)[3]' data='<li>3</li>'>]

Use node attributes as selection criteria

{node}[@{attr}='{val}': the selected node must have an attribute named {attr} with a value equal to {val}

{node}[contains(@{attr}, '{val}': the selected node must have an attribute with the name '{attr}', and the value of this attribute contains {val}

Example 1

res = sel.xpath("//p[@class='cp1']")
print(res)
res = sel.xpath("//p[@class='cp2']")
print(res)
res = sel.xpath("//p[contains(@class, 'cp1')]")
print(res)

Result

[]
[<Selector xpath="//p[@class='cp2']" data='<p class="cp2">test p2 <span>next p2<...'>]
[<Selector xpath="//p[contains(@class, 'cp1')]" data='<p class="cp1 section">test p1 <span>...'>]
[<Selector xpath="descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' cp1 ')]" data='<p class="cp1 section">test p1 <span>...'>]

In the inclusion condition, if you use the 'class' attribute, you can use the css selector to simplify:

Example 2:

res = sel.css("p.cp1")
print(res)

Result:

[<Selector xpath="descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' cp1 ')]" data='<p class="cp1 section">test p1 <span>...'>]

Extract the value of a node attribute

{node}/@{attr}: extract the value of the attribute named {attr} in the selected section's point

Example 1:

res = sel.xpath("//p/@class")
print(res)
print("\n")

Result

[<Selector xpath='//p/@class' data='cp1 section'>, <Selector xpath='//p/@class' data='cp2'>]

Extract text content from nodes

{node}/text(): extract the text content of the currently selected node, excluding the text of the child node

{node}//text(): extract the text content of the selected node, including the text of the child node

Example 1:

res = sel.xpath("//p//text()")
print(res)
res = sel.xpath("//p/text()")
print(res)

Result:

[<Selector xpath='//p//text()' data='test p1 '>, <Selector xpath='//p//text()' data='next p1'>, <Selector xpath='//p//text()' data='test p2 '>, <Selector xpath='//p//text()' data='next p2'>]
[<Selector xpath='//p/text()' data='test p1 '>, <Selector xpath='//p/text()' data='test p2 '>]

Using variables in xpath expressions

In an xpath expression, it is ${varname} to define variables, similar to bash

Example 1:

#Use variable $val
res = sel.xpath("//li[@id=$val]", val='1')
print(res)
res = sel.xpath("//li[@id=$val]", val='3')
print(res)
res = sel.xpath("//li[@id=$val]", val='6')
print(res)

Result:

[<Selector xpath='//li[@id=$val]' data='<li id="1">2</li>'>]
[<Selector xpath='//li[@id=$val]' data='<li id="3">4</li>'>]
[<Selector xpath='//li[@id=$val]' data='<li id="6" class="item cli-1">7</li>'>]

Tags: Python Attribute xml encoding

Posted on Tue, 07 Apr 2020 06:14:04 -0400 by Maeltar