parsel package

Submodules

parsel.csstranslator module

class parsel.csstranslator.GenericTranslator[source]

Bases: parsel.csstranslator.TranslatorMixin, cssselect.xpath.GenericTranslator

class parsel.csstranslator.HTMLTranslator(xhtml=False)[source]

Bases: parsel.csstranslator.TranslatorMixin, cssselect.xpath.HTMLTranslator

class parsel.csstranslator.TranslatorMixin[source]

Bases: object

This mixin adds support to CSS pseudo elements via dynamic dispatch.

Currently supported pseudo-elements are ::text and ::attr(ATTR_NAME).

xpath_attr_functional_pseudo_element(xpath, function)[source]

Support selecting attribute values using ::attr() pseudo-element

xpath_element(selector)[source]
xpath_pseudo_element(xpath, pseudo_element)[source]

Dispatch method that transforms XPath to support pseudo-element

xpath_text_simple_pseudo_element(xpath)[source]

Support selecting text nodes using ::text pseudo-element

class parsel.csstranslator.XPathExpr(path='', element='*', condition='', star_prefix=False)[source]

Bases: cssselect.xpath.XPathExpr

attribute = None
classmethod from_xpath(xpath, textnode=False, attribute=None)[source]
join(combiner, other)[source]
textnode = False

parsel.selector module

XPath selectors based on lxml

class parsel.selector.SafeXMLParser(*args, **kwargs)[source]

Bases: lxml.etree.XMLParser

class parsel.selector.Selector(text=None, type=None, namespaces=None, root=None, base_url=None, _expr=None)[source]

Bases: object

Selector allows you to select parts of an XML or HTML text using CSS or XPath expressions and extract data from it.

text is a unicode object in Python 2 or a str object in Python 3

type defines the selector type, it can be "html", "xml" or None (default). If type is None, the selector defaults to "html".

css(query)[source]

Apply the given CSS selector and return a SelectorList instance.

query is a string containing the CSS selector to apply.

In the background, CSS queries are translated into XPath queries using `cssselect`_ library and run .xpath() method.

extract()[source]

Serialize and return the matched nodes in a single unicode string. Percent encoded content is unquoted.

namespaces
re(regex)[source]

Apply the given regex and return a list of unicode strings with the matches.

regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

register_namespace(prefix, uri)[source]

Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces. See Selector examples on XML text.

remove_namespaces()[source]

Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See Removing namespaces.

root
selectorlist_cls

alias of SelectorList

text
type
xpath(query)[source]

Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened. List elements implement Selector interface too.

query is a string containing the XPATH query to apply.

class parsel.selector.SelectorList[source]

Bases: list

The SelectorList class is a subclass of the builtin list class, which provides a few additional methods.

css(xpath)[source]

Call the .css() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.css()

extract()[source]

Call the .extract() method for each element is this list and return their results flattened, as a list of unicode strings.

extract_first(default=None)[source]

Return the result of .extract() for the first element in this list. If the list is empty, return the default value.

re(regex)[source]

Call the .re() method for each element in this list and return their results flattened, as a list of unicode strings.

re_first(regex)[source]

Call the .re() method for the first element in this list and return the result in an unicode string.

xpath(xpath)[source]

Call the .xpath() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.xpath()

parsel.selector.create_root_node(text, parser_cls, base_url=None)[source]

Create root node for text using given parser class.

parsel.utils module

parsel.utils.extract_regex(regex, text)[source]

Extract a list of unicode strings from the given text/encoding using the following policies: * if the regex contains a named group called “extract” that will be returned * if the regex contains multiple numbered groups, all those will be returned (flattened) * if the regex doesn’t contain any group the entire regex matching is returned

parsel.utils.flatten(sequence) → list[source]

Returns a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables). Examples: >>> [1, 2, [3,4], (5,6)] [1, 2, [3, 4], (5, 6)] >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)]) [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10] >>> flatten([“foo”, “bar”]) [‘foo’, ‘bar’] >>> flatten([“foo”, [“baz”, 42], “bar”]) [‘foo’, ‘baz’, 42, ‘bar’]

parsel.utils.iflatten(sequence) → iterator[source]

Similar to .flatten(), but returns iterator instead

Module contents

Parsel lets you extract text from XML/HTML documents using XPath or CSS selectors