Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF terms with object interning support #2972

Open
edmondchuc opened this issue Nov 5, 2024 · 3 comments
Open

RDF terms with object interning support #2972

edmondchuc opened this issue Nov 5, 2024 · 3 comments

Comments

@edmondchuc
Copy link
Contributor

I'm interested in this. I've been playing around with the idea of implementing RDF terms with object interning to save memory and avoid copying. This issue is a continuation from #2866.

In any embarassingly parallel, distributed ETLs where I've used RDFLib, I've always seen the memory usage grow over time. By implementing object interning, we may be able to fix this issue and potentially stop the memory growth when objects are no longer referenced. I think this particular issue is also related to this other issue described here #740.

The key is to implement RDF terms as immutable data structures. This way, we can safely reuse references to the same object if the unicode code point sequence in the term's value is the same.

An example of a Blank Node implementation with object interning and is thread-safe when accessing the weakrefs. Memory should be freed once the objects are no longer in use even though we have a weakref pointing to it.

import threading
from dataclasses import dataclass, field
from typing import Any, Self, final
from uuid import uuid4
from weakref import WeakValueDictionary


class InternedBlankNode:
    _intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
    _lock = threading.Lock()

    __slots__ = ("__weakref__",)

    def __new__(cls, value: str | None = None) -> Self:
        if value is None:
            value = str(uuid4()).replace("-", "0")

        with cls._lock:
            if value in cls._intern_cache:
                return cls._intern_cache[value]

            instance = super().__new__(cls)
            object.__setattr__(instance, "value", value)
            cls._intern_cache[value] = instance
            return instance


@final
@dataclass(frozen=True, slots=True)
class BlankNode(InternedBlankNode):
    """
    An RDF blank node representing an anonymous resource.

    Specification: https://www.w3.org/TR/rdf12-concepts/#section-blank-nodes

    This implementation uses object interning to ensure that blank nodes
    with the same identifier reference the same object instance, optimizing
    memory usage. The class is marked final to ensure the :py:meth:`IRI.__new__`
    implementation cannot be overridden.

    :param value:
        A blank node identifier. If :py:obj:`None` is provided, an identifier
        will be generated.
    """

    value: str = field(default_factory=lambda: str(uuid4()).replace("-", "0"))

    def __str__(self) -> str:
        return f"_:{self.value}"

    def __reduce__(self) -> str | tuple[Any, ...]:
        return self.__class__, (self.value,)


__all__ = ["BlankNode"]

And tests:

import pickle

import pytest

from rdf_core.terms import BlankNode


def test_blank_node():
    bnode1 = BlankNode("123")
    bnode2 = BlankNode("123")
    bnode3 = BlankNode("222")

    assert bnode1.value == bnode2.value
    assert bnode1.value != bnode3.value
    assert bnode1 == bnode2
    assert bnode1 != bnode3
    assert bnode1 is bnode2
    assert bnode1 is not bnode3
    assert hash(bnode1) == hash(bnode2)

    bnode4 = BlankNode()
    assert len(bnode4.value) > 0


def test_blank_node_repr_str():
    bnode1 = BlankNode("123")
    assert repr(bnode1) == "BlankNode(value='123')"
    assert str(bnode1) == "_:123"


def test_blank_node_immutability():
    bnode1 = BlankNode("123")
    with pytest.raises(AttributeError):
        bnode1.value = "222"


def test_blank_node_pickling():
    bnode1 = BlankNode("123")
    pickled = pickle.dumps(bnode1)
    unpickled = pickle.loads(pickled)
    assert bnode1 is unpickled
    assert bnode1 == unpickled
@ashleysommer
Copy link
Contributor

@edmondchuc
I got a chance this weekend to do some in-depth experimentation with this object-interning pattern.

First, I needed to make this method compatible with the existing implementation of Identifier where it is a subclass of the native str. We require all Identifier and IdentifiedNode instances to be sublcass of str so that it can inherit the very fast native cpython implementations of __hash__, __contains__, etc, as well as the str methods like .startswith(), .split(), etc.

That means we can't use the Dataclass style wrapper you've show in the original post, because in that pattern the str is a property ("value") of the dataclass, not the base type.

I replace the original URIRef and BNode implementations in RDFLib with these:

class InternedURIRef(IdentifiedNode):
    #_intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
    _intern_cache: dict[str, "Self"] = dict()
    _lock = threading.Lock()

    __slots__ = ("__weakref__",)

    def __new__(cls, value: str, base: str | None = None) -> Self:
        if base is not None:
            ends_in_hash = value.endswith("#")
            value = urljoin(base, value, allow_fragments=1)  # type: ignore[arg-type]
            if ends_in_hash:
                if not value.endswith("#"):
                    value += "#"

        with cls._lock:
            if value in cls._intern_cache:
                old_ref_val = cls._intern_cache[value]()
                if old_ref_val is not None:
                    return old_ref_val
                else:
                    del cls._intern_cache[value]

            if not _is_valid_uri(value):
                logger.warning(
                    f"{value} does not look like a valid URI, trying to serialize this will break."
                )
            instance = str.__new__(cls, value)
            cls._intern_cache[value] = weakref.ref(instance)
        return instance

    def __reduce__(self) -> tuple[type[InternedURIRef], tuple[str]]:
        return InternedURIRef, (str(self),)

    def __repr__(self) -> str:
        if self.__class__ is InternedURIRef:
            clsName = "rdflib.term.InternedURIRef"  # noqa: N806
        else:
            clsName = self.__class__.__name__  # noqa: N806

        return f"{clsName}({str.__repr__(self)})"
class InternedBlankNode(IdentifiedNode):
    #_intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
    _intern_cache: dict[str, "Self"] = dict()
    _lock = threading.Lock()

    __slots__ = ("__weakref__",)

    def __new__(cls, value: str | None = None) -> Self:
        if value is None:
            # This new UUID is statistically certain to not exist
            # in the cache already, so simply add it without checking.
            value = str(uuid4()).replace("-", "0")

        with cls._lock:
            if value in cls._intern_cache:
                old_ref_val = cls._intern_cache[value]()
                if old_ref_val is not None:
                    return old_ref_val
                else:
                    del cls._intern_cache[value]

            instance = str.__new__(cls, value)
            cls._intern_cache[value] = weakref.ref(instance)
        return instance

Notice they look like a combination of the existing RDFLib implementation, and the proposed new interning pattern.
I did not modify Literal, because it is a little more involved, and the benchmark results might be different had I included Literal in this change.

While running my benchmarks I noticed the use of WeakValueDictionary was causing some increased memory usage, and slower than expected performance, that is because WeakValueDictionary needs to allocate its own memory to hold its weakrefs, and it is implemented as a pure-python module in stdlib, it is not cpython code, so it is slower than a normal dict(). I tried replacing it with just a regular dict[str, "Self"] where the values are the stored instances. This worked well but we lost the ability to garbage collect the unused instances.
Then I tried with a regular dict() but storing weakref.ref(instance) and dereferencing the weakref on usage. (as shown the my code above). I think this is a happy middle-ground between the two approaches.

I had two Benchmarks I was running.

  1. The first ("called Benchmark1") is my usual go-to script that generates a pool of 10,000 short URIRefs, 10,000 medium-length URIRefs, 10,000 short String Literals, and 10,000 medium-length String Literals. It then generates 100,000 Records, where each Record has a random subject from the medium-legth URIRef pool, plus 10 property-object pairs composed of a random choice of any URIRef for the pools for the predicate, and any URIRef or Literal from the pools for the subject.
    These records are inserted into a regular RDFlib memory graph.
    Then 1000 known triples are queried back from the graph, using a mixture of missing subject, missing object, or missing predicate triples() lookups.

  2. The second (called "Parsing large NQuads") is loading a 150MB real-world NQuads file, containing 578,073 quads across 3 named graphs, into a RDFLib Dataset backed by the RDFLib in-memory store.

I ran each combination of changes multiple times, and included the fastest and slowest results (or sometimes 3 results) for each change. That's why some have multple results.

I ran the benchmarks with memory tracing on, but noticed the tracemalloc recording was significantly affecting the run-time of the benchmarks, skewing their real performance numbers, so I also ran them without memory tracing.

These are the notes from my benchmark runs:

Benchmark1
===========
Before Changes: Benchmark1
---- Run1
Graph created in 4,945.53562393412 milliseconds
Lookup done in 9.00932401418686 milliseconds
---- Run2
Graph created in 5,305.003169924021 milliseconds
Lookup done in 8.485371014103293 milliseconds

After Changes (with native dict+deref): Benchmark1
---- Run1
Graph created in 5,234.036693000235 milliseconds
Lookup done in 9.053118992596865 milliseconds
---- Run2
Graph created in 5,188.26289603021 milliseconds
Lookup done in 8.541548973880708 milliseconds

Before Changes: Benchmark1 - With Memory Tracing:
---- Run1
Graph created in 23,727.400987991132 milliseconds
Lookup done in 28.91787199769169 milliseconds
Memory usage: 883.95 MB

After Changes (with WeakRef dict): Benchmark1  - With Memory Tracing:
---- Run1
Graph created in 22,301.57436209265 milliseconds
Lookup done in 28.51113409269601 milliseconds
Memory usage: 884.15 MB
---- Run2
Graph created in 23,235.819360008463 milliseconds
Lookup done in 28.72282301541418 milliseconds
Memory usage: 884.07 MB

After Changes (with native dict): Benchmark1 - With Memory Tracing:
---- Run1
Graph created in 22139.52067203354 milliseconds
Lookup done in 27.624279959127307 milliseconds
Memory usage: 884.02 MB
---- Run2
Graph created in 22065.47229096759 milliseconds
Lookup done in 27.51331706531346 milliseconds
Memory usage: 884.09 MB

After Changes (with native dict+deref): Benchmark1 - With Memory Tracing
---- Run1
Graph created in 21656.742853927426 milliseconds
Lookup done in 27.697815094143152 milliseconds
Memory usage: 883.96 MB



Benchmark2
===========
Before Changes: Parsing Large NQuads 
---- Run1
NQuads parsed in 8,951.310015982017 milliseconds
---- Run2
NQuads parsed in 9,300.709676928818 milliseconds
---- Run3
NQuads parsed in 9,092.107067001052 milliseconds


After Changes (with native dict+deref): Parsing Large NQuads
---- Run1
NQuads parsed in 9,280.790237011388 milliseconds
---- Run2
NQuads parsed in 9,385.945044923574 milliseconds


Before Changes: Parsing Large NQuads - With Memory Tracing:
---- Run1
NQuads parsed in 33,611.075244960375 milliseconds
Memory usage: 987.65 MB
---- Run2
NQuads parsed in 33,598.71585201472 milliseconds
Memory usage: 987.65 MB


After Changes (with WeakRef dict): Parsing Large NQuads - With Memory Tracing:
---- Run1
NQuads parsed in 31,668.487506103702 milliseconds
Memory usage: 798.65 MB
---- Run2
NQuads parsed in 30,816.48442998994 milliseconds
Memory usage: 798.65 MB


After Changes (with native dict): Parsing Large NQuads - With Memory Tracing:
---- Run1
NQuads parsed in 29,839.805025956593 milliseconds
Memory usage: 785.44 MB
---- Run2
NQuads parsed in 29,852.437035995536 milliseconds
Memory usage: 785.44 MB


After Changes (with native dict+deref): Parsing Large NQuads - With Memory Tracing:
---- Run1
NQuads parsed in 30,995.93586998526 milliseconds
Memory usage: 797.45 MB
---- Run2
NQuads parsed in 30,012.62752409093 milliseconds
Memory usage: 797.45 MB

RESULTS
The Results are certainly open to interpretation. The two different benchmarks show different results.

Benchmark1 is not a good test for this change because it never needs to re-create URIRefs or BNodes after they are first created. The result is that the memory usage is actually slightly higher after the changes (due to needing to store the weakrefs) and performance is exactly the same as before the change.

Benchmark2 is much more representative of how this change can help in real-world usage.
The most notable difference is 190MB less memory usage (reduction of 20%) after the changes.
The performance after the change is roughly the same (sometimes equal, sometimes slightly slower) this is surprising because a speedup is supposed to come from not needing to run _is_valid_uri() in the URIRef constructor every time the same URIRef is constructed. This performance gain is seen when memory tracing is turned on, but disappears when memory tracing is turned off.

@edmondchuc
Copy link
Contributor Author

Thanks @ashleysommer. I've only had a brief read of your results so not much to comment on besides the interesting tidbit that having memory tracing on affected the execution speed. I'll have to test that with my changes too.

Some other initial thoughts. I think you're right that in most cases, the execution speed is largely the same, and may even use a bit more memory due to an additional weakref data structure. From a few ETL's I've seen though, it's quite common to have some kind of loop where potentially the same terms are created as new objects over and over again. By performing memoisation, a lot of memory and execution speed can be reduced. These optimisations can be implemented in application code to get the same benefits, but I do like the idea that by having these optimisations in RDFLib itself, it simplifies the kind of application code we have to write.

I planned to add a PR with the changes to URIRef along with these test results, but I'll post them here for now in case you want to try them in your test environment as well.

In each test, the first result is the one with the memoisation optimisation, and the second is the current RDFLib release (7.1.1).

Test 1 - Adding to Graph

File: perf_graph_add.py

Adding 5 million URIRef of the same value to a graph uses exactly the same memory in both implementations, but the execution speed is much faster with weakrefs.

Memory usage: 0.06 MB
Total execution time: 41.88 seconds

Memory usage: 0.06 MB
Total execution time: 93.21 seconds

Test 2 - Adding to Graph - no URIRef recreate

File: perf_graph_add_no_create.py

Creating a URIRef instance once outside of the loop, then adding the same instance to the graph 5 million times.

Memory usage: 0.06 MB
Total execution time: 28.84 seconds

Memory usage: 0.06 MB
Total execution time: 32.42 seconds

Test 3 - Check existence of URIRef statement

File: perf_graph_check_existence.py

Checking whether a triple exists in a graph.

Memory usage: 0.06 MB
Total execution time: 30.45 seconds

Memory usage: 0.06 MB
Total execution time: 35.70 seconds

Test 4 - Adding to List

File: perf_list.py

Each element in the list is a new instance of URIRef (different memory address), but with weakrefs, each element is the same object instance.

Memory usage: 41.91 MB
Total execution time: 6.39 seconds

Memory usage: 590.28 MB
Total execution time: 25.63 seconds

Test 5 - Adding to Dict - just the key

File: perf_dict.py

Slight overhead due to object creation in the second test?

Memory usage: 0.00 MB
Total execution time: 6.30 seconds

Memory usage: 0.00 MB
Total execution time: 13.08 seconds

Test 6 - Adding to Set

File: perf_set.py

Slight overhead due to object creation in the second test?

Memory usage: 0.00 MB
Total execution time: 6.42 seconds

Memory usage: 0.00 MB
Total execution time: 12.79 seconds

Test 7 - Creating URIRef and adding to graph for a csv dataset

File: perf_words.py

Memory usage: 1137.69 MB
Total execution time: 44.42 seconds

Memory usage: 1078.54 MB
Total execution time: 43.35 seconds

Test 8 - Creating same URIRef and adding to graph for a csv dataset

File: perf_words.py

Memory usage: 517.81 MB
Total execution time: 26.90 seconds

Memory usage: 570.76 MB
Total execution time: 35.44 seconds

Test 9 - Equality test with == vs is

File: perf_equality.py

Now that we have implemented weakrefs to reuse the same object, we can simply compare the memory address, which is much faster than comparing the str values, character by character. This also prevents the memory leak issue from happening in applications using multiprocessing when objects are passed around.

Memory usage: 0.00 MB
Total execution time: 1.20 seconds

Memory usage: 0.00 MB
Total execution time: 4.49 seconds

Namespace IRI performance

File: perf_namespace.py

Defined namespaces benefit from object interning with a slight lead in execution speed, but not much. Not sure where this benefit is from. Defined namespaces offset the creation of URIRefs by having an LRU cache decorator on the meta class' __getitem__ method.

However, any namespaces created via the Namespace class is significantly faster with object interning implementation. Each __getitem__ call to a namespace instance creates a new URIRef.

RDF = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")  
RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")  
OWL = Namespace("http://www.w3.org/2002/07/owl#")  
graph = Graph()  
for i in range(5_000_000):  
    graph.add((RDFS.label, RDF.type, OWL.AnnotationProperty))

Memory usage: 0.06 MB
Total execution time: 64.64 seconds

Memory usage: 0.06 MB
Total execution time: 113.74 seconds

@ashleysommer
Copy link
Contributor

Thanks @edmondchuc for sharing your results. I just realized I think I'm tracking memory usage (and possibly performance difference) incorrectly for Benchmark1, its not including the setup of the URIRef pools in the calculation (that's the main part it would affect!) so I will re-do my numbers for that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants