You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While it's possible to use Prefix to narrow the scope of what you're parsing, that really only works if the shape of the data is predictable ahead of time.
It doesn't appear to be possible to avoid causing lxml to reparse the entire document when handling children.
#!/usr/bin/env python3fromlxml.etreeimporttostringfromxextractimportElement, Stringhtml=open('data.html').read()
category_elements=Element(css='.category').parse(html)
forcategory_elementincategory_elements:
# this does not work because XPathExtractor.get_root()# assumes body is a string#extractor = HtmlXPathExtractor(category_element)# this works because `BaseParser.parse()` has a special case for# `body` values that are already an `XPathExtractor`# but it is inefficient: we convert an lxml element into a string# and then re-parse itextractor=HtmlXPathExtractor(tostring(category_element))
category_name=String(css='h1', quant=1).parse(extractor)
product_count=int(String(css='.count', quant=1).parse(extractor))
# note that `product_count` here was dynamically extracted from the htmlproduct_names=String(css='.product', quant=product_count).parse(extractor)
print(f"Category '{category_name}' contains {product_count} products:")
forproduct_nameinproduct_names:
print(f" {product_name}")
The important part here is that we want to validate that the number of products matches the product count string. For a file this small it obviously doesn't make much a difference, but consider a very large file.
One workaround is to create a custom parser:
importlxml.etreeclassElementHtmlXPathExtractor(HtmlXPathExtractor):
def_get_root(self, body):
ifisinstance(body, lxml.etree._Element):
returnbodyreturnsuper()._get_root()
# and then in our loop we can do:forcategory_elementincategory_elements:
...
extractor=ElementHtmlXPathExtractor(category_element)
...
It would be nice if the _get_root() special case check for _Element was integrated into XPathExtractor.
The text was updated successfully, but these errors were encountered:
While it's possible to use
Prefix
to narrow the scope of what you're parsing, that really only works if the shape of the data is predictable ahead of time.It doesn't appear to be possible to avoid causing lxml to reparse the entire document when handling children.
An example:
data.html:
The important part here is that we want to validate that the number of products matches the product count string. For a file this small it obviously doesn't make much a difference, but consider a very large file.
One workaround is to create a custom parser:
It would be nice if the
_get_root()
special case check for_Element
was integrated intoXPathExtractor
.The text was updated successfully, but these errors were encountered: