Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reusing parsed lxml trees #10

Open
levic opened this issue Feb 6, 2023 · 0 comments
Open

Reusing parsed lxml trees #10

levic opened this issue Feb 6, 2023 · 0 comments

Comments

@levic
Copy link

levic commented Feb 6, 2023

While it's possible to use Prefix to narrow the scope of what you're parsing, that really only works if the shape of the data is predictable ahead of time.

It doesn't appear to be possible to avoid causing lxml to reparse the entire document when handling children.

An example:

data.html:

<html>
  <body>

    <div class="category">
      <h1>Fruit</h1>
      <p><div class="count">3</div>results</p>
      <div class="product">Apple</div>
      <div class="product">Pear</div>
      <div class="product">Orange</div>
    </div>

    <div class="category">
      <h1>Vegetables</h1>
      <p><div class="count">2</div>result</p>
      <div class="product">Potato</div>
      <div class="product">Pumpkin</div>
    </div>

  </body>
</html>
#!/usr/bin/env python3
from lxml.etree import tostring

from xextract import Element, String

html = open('data.html').read()

category_elements = Element(css='.category').parse(html)
for category_element in category_elements:
    # this does not work because XPathExtractor.get_root()
    # assumes body is a string
    #extractor = HtmlXPathExtractor(category_element)

    # this works because `BaseParser.parse()` has a special case for
    # `body` values that are already an `XPathExtractor`
    # but it is inefficient: we convert an lxml element into a string
    # and then re-parse it
    extractor = HtmlXPathExtractor(tostring(category_element))

    category_name = String(css='h1', quant=1).parse(extractor)
    product_count = int(String(css='.count', quant=1).parse(extractor))

    # note that `product_count` here was dynamically extracted from the html
    product_names = String(css='.product', quant=product_count).parse(extractor)

    print(f"Category '{category_name}' contains {product_count} products:")
    for product_name in product_names:
        print(f"  {product_name}")

The important part here is that we want to validate that the number of products matches the product count string. For a file this small it obviously doesn't make much a difference, but consider a very large file.

One workaround is to create a custom parser:

import lxml.etree
class ElementHtmlXPathExtractor(HtmlXPathExtractor):
    def _get_root(self, body):
        if isinstance(body, lxml.etree._Element):
            return body
        return super()._get_root()


# and then in our loop we can do:
for category_element in category_elements:
    ...
    extractor = ElementHtmlXPathExtractor(category_element)
    ...

It would be nice if the _get_root() special case check for _Element was integrated into XPathExtractor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant