.. _topics-selectors:
=====
Usage
=====
Create a :class:`~parsel.selector.Selector` object for your input text.
For HTML or XML, use `CSS`_ or `XPath`_ expressions to select data::
>>> from parsel import Selector
>>> html_text = "
Hello, Parsel!
"
>>> html_selector = Selector(text=html_text)
>>> html_selector.css('h1')
[]
>>> html_selector.xpath('//h1') # the same, but now with XPath
[]
For JSON, use `JMESPath`_ expressions to select data::
>>> json_text = '{"title":"Hello, Parsel!"}'
>>> json_selector = Selector(text=json_text)
>>> json_selector.jmespath('title')
[]
And extract data from those elements::
>>> html_selector.xpath('//h1/text()').get()
'Hello, Parsel!'
>>> json_selector.jmespath('title').getall()
['Hello, Parsel!']
.. _CSS: https://www.w3.org/TR/selectors
.. _XPath: https://www.w3.org/TR/xpath
.. _JMESPath: https://jmespath.org/
Learning expression languages
=============================
`CSS`_ is a language for applying styles to HTML documents. It defines
selectors to associate those styles with specific HTML elements. Resources to
learn CSS_ selectors include:
- `CSS selectors in the MDN`_
- `XPath/CSS Equivalents in Wikibooks`_
Parsel support for CSS selectors comes from cssselect, so read about `CSS
selectors supported by cssselect`_.
.. _CSS selectors supported by cssselect: https://cssselect.readthedocs.io/en/latest/#supported-selectors
`XPath`_ is a language for selecting nodes in XML documents, which can also be
used with HTML. Resources to learn XPath_ include:
- `XPath Tutorial in W3Schools`_
- `XPath cheatsheet`_
For HTML and XML input, you can use either CSS_ or XPath_. CSS_ is usually
more readable, but some things can only be done with XPath_.
JMESPath_ allows you to declaratively specify how to extract elements from
a JSON document. Resources to learn JMESPath_ include:
- `JMESPath Tutorial`_
- `JMESPath Specification`_
.. _CSS selectors in the MDN: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
.. _XPath cheatsheet: https://devhints.io/xpath
.. _XPath Tutorial in W3Schools: https://www.w3schools.com/xml/xpath_intro.asp
.. _XPath/CSS Equivalents in Wikibooks: https://en.wikibooks.org/wiki/XPath/CSS_Equivalents
.. _JMESPath Tutorial: https://jmespath.org/tutorial.html
.. _JMESPath Specification: https://jmespath.org/specification.html
Using selectors
===============
To explain how to use the selectors we'll use the :mod:`requests` library
to download an example page located in the Parsel's documentation:
https://parsel.readthedocs.org/en/latest/_static/selectors-sample1.html
.. _topics-selectors-htmlcode:
For the sake of completeness, here's its full HTML code:
.. literalinclude:: _static/selectors-sample1.html
:language: html
.. highlight:: python
So, let's download that page and create a selector for it:
.. skip: start
>>> import requests
>>> from parsel import Selector
>>> url = 'https://parsel.readthedocs.org/en/latest/_static/selectors-sample1.html'
>>> text = requests.get(url).text
>>> selector = Selector(text=text)
.. skip: end
.. invisible-code-block: python
selector = load_selector('selectors-sample1.html')
Since we're dealing with HTML, the default type for Selector, we don't need
to specify the `type` argument.
So, by looking at the :ref:`HTML code ` of that
page, let's construct an XPath for selecting the text inside the title tag::
>>> selector.xpath('//title/text()')
[]
You can also ask the same thing using CSS instead::
>>> selector.css('title::text')
[]
To actually extract the textual data, you must call the selector ``.get()``
or ``.getall()`` methods, as follows::
>>> selector.xpath('//title/text()').getall()
['Example website']
>>> selector.xpath('//title/text()').get()
'Example website'
``.get()`` always returns a single result; if there are several matches,
content of a first match is returned; if there are no matches, None
is returned. ``.getall()`` returns a list with all results.
Notice that CSS selectors can select text or attribute nodes using CSS3
pseudo-elements::
>>> selector.css('title::text').get()
'Example website'
As you can see, ``.xpath()`` and ``.css()`` methods return a
:class:`~parsel.selector.SelectorList` instance, which is a list of new
selectors. This API can be used for quickly selecting nested data::
>>> selector.css('img').xpath('@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
If you want to extract only the first matched element, you can call the
selector ``.get()`` (or its alias ``.extract_first()`` commonly used in
previous parsel versions)::
>>> selector.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '
It returns ``None`` if no element was found::
>>> selector.xpath('//div[@id="not-exists"]/text()').get() is None
True
Instead of using e.g. ``'@src'`` XPath it is possible to query for attributes
using ``.attrib`` property of a :class:`~parsel.selector.Selector`::
>>> [img.attrib['src'] for img in selector.css('img')]
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
As a shortcut, ``.attrib`` is also available on SelectorList directly;
it returns attributes for the first matching element::
>>> selector.css('img').attrib['src']
'image1_thumb.jpg'
This is most useful when only a single result is expected, e.g. when selecting
by id, or selecting unique elements on a web page::
>>> selector.css('base').attrib['href']
'http://example.com/'
Now we're going to get the base URL and some image links::
>>> selector.xpath('//base/@href').get()
'http://example.com/'
>>> selector.css('base::attr(href)').get()
'http://example.com/'
>>> selector.css('base').attrib['href']
'http://example.com/'
>>> selector.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
>>> selector.css('a[href*=image]::attr(href)').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
>>> selector.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
>>> selector.css('a[href*=image] img::attr(src)').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
.. _topics-selectors-css-extensions:
Extensions to CSS Selectors
---------------------------
Per W3C standards, `CSS selectors`_ do not support selecting text nodes
or attribute values.
But selecting these is so essential in a web scraping context
that Parsel implements a couple of **non-standard pseudo-elements**:
* to select text nodes, use ``::text``
* to select attribute values, use ``::attr(name)`` where *name* is the
name of the attribute that you want the value of
.. warning::
These pseudo-elements are Scrapy-/Parsel-specific.
They will most probably not work with other libraries like `lxml`_ or `PyQuery`_.
Examples:
* ``title::text`` selects children text nodes of a descendant ```` element::
>>> selector.css('title::text').get()
'Example website'
* ``*::text`` selects all descendant text nodes of the current selector context::
>>> selector.css('#images *::text').getall()
['\n ',
'Name: My image 1 ',
'\n ',
'Name: My image 2 ',
'\n ',
'Name: My image 3 ',
'\n ',
'Name: My image 4 ',
'\n ',
'Name: My image 5 ',
'\n ']
* ``a::attr(href)`` selects the *href* attribute value of descendant links::
>>> selector.css('a::attr(href)').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
.. note::
You cannot chain these pseudo-elements. But in practice it would not
make much sense: text nodes do not have attributes, and attribute values
are string values already and do not have children nodes.
.. note::
See also: :ref:`selecting-attributes`.
.. _CSS Selectors: https://www.w3.org/TR/css3-selectors/#selectors
.. _topics-selectors-nesting-selectors:
Nesting selectors
-----------------
The selection methods (``.xpath()`` or ``.css()``) return a list of selectors
of the same type, so you can call the selection methods for those selectors
too. Here's an example::
>>> links = selector.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['Name: My image 1 ',
'Name: My image 2 ',
'Name: My image 3 ',
'Name: My image 4 ',
'Name: My image 5 ']
>>> for index, link in enumerate(links):
... args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
... print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'
.. _selecting-attributes:
Selecting element attributes
----------------------------
There are several ways to get a value of an attribute. First, one can use
XPath syntax::
>>> selector.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
XPath syntax has a few advantages: it is a standard XPath feature, and
``@attributes`` can be used in other parts of an XPath expression - e.g.
it is possible to filter by attribute value.
parsel also provides an extension to CSS selectors (``::attr(...)``)
which allows to get attribute values::
>>> selector.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In addition to that, there is a ``.attrib`` property of Selector.
You can use it if you prefer to lookup attributes in Python
code, without using XPaths or CSS extensions::
>>> [a.attrib['href'] for a in selector.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
This property is also available on SelectorList; it returns a dictionary
with attributes of a first matching element. It is convenient to use when
a selector is expected to give a single result (e.g. when selecting by element
ID, or when selecting an unique element on a page)::
>>> selector.css('base').attrib
{'href': 'http://example.com/'}
>>> selector.css('base').attrib['href']
'http://example.com/'
``.attrib`` property of an empty SelectorList is empty::
>>> selector.css('foo').attrib
{}
Using selectors with regular expressions
----------------------------------------
:class:`~parsel.selector.Selector` also has a ``.re()`` method for extracting
data using regular expressions. However, unlike using ``.xpath()`` or
``.css()`` methods, ``.re()`` returns a list of strings. So you
can't construct nested ``.re()`` calls.
Here's an example used to extract image names from the :ref:`HTML code
` above::
>>> selector.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1 ',
'My image 2 ',
'My image 3 ',
'My image 4 ',
'My image 5 ']
There's an additional helper reciprocating ``.get()`` (and its
alias ``.extract_first()``) for ``.re()``, named ``.re_first()``.
Use it to extract just the first matching string::
>>> selector.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1 '
.. _topics-selectors-relative-xpaths:
Working with relative XPaths
----------------------------
Keep in mind that if you are nesting selectors and use an XPath that starts
with ``/``, that XPath will be absolute to the document and not relative to the
selector you're calling it from.
For example, suppose you want to extract all ``