parsel package¶

Submodules¶

parsel.csstranslator module¶

class parsel.csstranslator.GenericTranslator[source]¶: Bases: parsel.csstranslator.TranslatorMixin, cssselect.xpath.GenericTranslator

class parsel.csstranslator.HTMLTranslator(xhtml=False)[source]¶: Bases: parsel.csstranslator.TranslatorMixin, cssselect.xpath.HTMLTranslator

class parsel.csstranslator.TranslatorMixin[source]¶

Bases: object

xpath_attr_functional_pseudo_element(xpath, function)[source]¶

xpath_element(selector)[source]¶

xpath_pseudo_element(xpath, pseudo_element)[source]¶

xpath_text_simple_pseudo_element(xpath)[source]¶: Support selecting text nodes using ::text pseudo-element

class parsel.csstranslator.XPathExpr(path='', element='*', condition='', star_prefix=False)[source]¶

Bases: cssselect.xpath.XPathExpr

attribute = None¶

classmethod from_xpath(xpath, textnode=False, attribute=None)[source]¶

join(combiner, other)[source]¶

textnode = False¶

parsel.selector module¶

XPath selectors based on lxml

class parsel.selector.SafeXMLParser(*args, **kwargs)[source]¶: Bases: lxml.etree.XMLParser

class parsel.selector.Selector(text=None, type=None, namespaces=None, root=None, base_url=None, _expr=None)[source]¶

Bases: object

Selector allows you to select parts of an XML or HTML text using CSS or XPath expressions and extract data from it.

text is a unicode object in Python 2 or a str object in Python 3

type defines the selector type, it can be "html", "xml" or None (default). If type is None, the selector defaults to "html".

css(query)[source]¶

Apply the given CSS selector and return a SelectorList instance.

query is a string containing the CSS selector to apply.

In the background, CSS queries are translated into XPath queries using `cssselect`_ library and run .xpath() method.

extract()[source]¶: Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

namespaces¶

re(regex)[source]¶

Apply the given regex and return a list of unicode strings with the matches.

regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

register_namespace(prefix, uri)[source]¶: Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces. See Selector examples on XML text.

remove_namespaces()[source]¶: Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See Removing namespaces.

root¶

selectorlist_cls¶: alias of SelectorList

text¶

type¶

xpath(query)[source]¶

Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened. List elements implement Selector interface too.

query is a string containing the XPATH query to apply.

class parsel.selector.SelectorList[source]¶

Bases: list

The SelectorList class is a subclass of the builtin list class, which provides a few additional methods.

css(xpath)[source]¶

Call the .css() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.css()

extract()[source]¶: Call the .extract() method for each element is this list and return their results flattened, as a list of unicode strings.

extract_first(default=None)[source]¶

re(regex)[source]¶: Call the .re() method for each element is this list and return their results flattened, as a list of unicode strings.

re_first(regex)[source]¶

xpath(xpath)[source]¶

Call the .xpath() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.xpath()

parsel.selector.create_root_node(text, parser_cls, base_url=None)[source]¶: Create root node for text using given parser class.

parsel.utils module¶

parsel.utils.extract_regex(regex, text)[source]¶: Extract a list of unicode strings from the given text/encoding using the following policies: * if the regex contains a named group called “extract” that will be returned * if the regex contains multiple numbered groups, all those will be returned (flattened) * if the regex doesn’t contain any group the entire regex matching is returned

parsel.utils.flatten(sequence) → list[source]¶: Returns a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables). Examples: >>> [1, 2, [3,4], (5,6)] [1, 2, [3, 4], (5, 6)] >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)]) [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10] >>> flatten([“foo”, “bar”]) [‘foo’, ‘bar’] >>> flatten([“foo”, [“baz”, 42], “bar”]) [‘foo’, ‘baz’, 42, ‘bar’]

parsel.utils.iflatten(sequence) → iterator[source]¶: Similar to .flatten(), but returns iterator instead

Module contents¶

Parsel lets you extract text from XML/HTML documents using XPath or CSS selectors