-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDF terms with object interning support #2972
Comments
@edmondchuc First, I needed to make this method compatible with the existing implementation of That means we can't use the Dataclass style wrapper you've show in the original post, because in that pattern the I replace the original URIRef and BNode implementations in RDFLib with these: class InternedURIRef(IdentifiedNode):
#_intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
_intern_cache: dict[str, "Self"] = dict()
_lock = threading.Lock()
__slots__ = ("__weakref__",)
def __new__(cls, value: str, base: str | None = None) -> Self:
if base is not None:
ends_in_hash = value.endswith("#")
value = urljoin(base, value, allow_fragments=1) # type: ignore[arg-type]
if ends_in_hash:
if not value.endswith("#"):
value += "#"
with cls._lock:
if value in cls._intern_cache:
old_ref_val = cls._intern_cache[value]()
if old_ref_val is not None:
return old_ref_val
else:
del cls._intern_cache[value]
if not _is_valid_uri(value):
logger.warning(
f"{value} does not look like a valid URI, trying to serialize this will break."
)
instance = str.__new__(cls, value)
cls._intern_cache[value] = weakref.ref(instance)
return instance
def __reduce__(self) -> tuple[type[InternedURIRef], tuple[str]]:
return InternedURIRef, (str(self),)
def __repr__(self) -> str:
if self.__class__ is InternedURIRef:
clsName = "rdflib.term.InternedURIRef" # noqa: N806
else:
clsName = self.__class__.__name__ # noqa: N806
return f"{clsName}({str.__repr__(self)})" class InternedBlankNode(IdentifiedNode):
#_intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
_intern_cache: dict[str, "Self"] = dict()
_lock = threading.Lock()
__slots__ = ("__weakref__",)
def __new__(cls, value: str | None = None) -> Self:
if value is None:
# This new UUID is statistically certain to not exist
# in the cache already, so simply add it without checking.
value = str(uuid4()).replace("-", "0")
with cls._lock:
if value in cls._intern_cache:
old_ref_val = cls._intern_cache[value]()
if old_ref_val is not None:
return old_ref_val
else:
del cls._intern_cache[value]
instance = str.__new__(cls, value)
cls._intern_cache[value] = weakref.ref(instance)
return instance Notice they look like a combination of the existing RDFLib implementation, and the proposed new interning pattern. While running my benchmarks I noticed the use of I had two Benchmarks I was running.
I ran each combination of changes multiple times, and included the fastest and slowest results (or sometimes 3 results) for each change. That's why some have multple results. I ran the benchmarks with memory tracing on, but noticed the These are the notes from my benchmark runs:
RESULTS Benchmark1 is not a good test for this change because it never needs to re-create URIRefs or BNodes after they are first created. The result is that the memory usage is actually slightly higher after the changes (due to needing to store the weakrefs) and performance is exactly the same as before the change. Benchmark2 is much more representative of how this change can help in real-world usage. |
Thanks @ashleysommer. I've only had a brief read of your results so not much to comment on besides the interesting tidbit that having memory tracing on affected the execution speed. I'll have to test that with my changes too. Some other initial thoughts. I think you're right that in most cases, the execution speed is largely the same, and may even use a bit more memory due to an additional weakref data structure. From a few ETL's I've seen though, it's quite common to have some kind of loop where potentially the same terms are created as new objects over and over again. By performing memoisation, a lot of memory and execution speed can be reduced. These optimisations can be implemented in application code to get the same benefits, but I do like the idea that by having these optimisations in RDFLib itself, it simplifies the kind of application code we have to write. I planned to add a PR with the changes to URIRef along with these test results, but I'll post them here for now in case you want to try them in your test environment as well. In each test, the first result is the one with the memoisation optimisation, and the second is the current RDFLib release (7.1.1). Test 1 - Adding to GraphFile: Adding 5 million URIRef of the same value to a graph uses exactly the same memory in both implementations, but the execution speed is much faster with weakrefs. Memory usage: 0.06 MB Memory usage: 0.06 MB Test 2 - Adding to Graph - no URIRef recreateFile: Creating a URIRef instance once outside of the loop, then adding the same instance to the graph 5 million times. Memory usage: 0.06 MB Memory usage: 0.06 MB Test 3 - Check existence of URIRef statementFile: Checking whether a triple exists in a graph. Memory usage: 0.06 MB Memory usage: 0.06 MB Test 4 - Adding to ListFile: Each element in the list is a new instance of URIRef (different memory address), but with weakrefs, each element is the same object instance. Memory usage: 41.91 MB Memory usage: 590.28 MB Test 5 - Adding to Dict - just the keyFile: Slight overhead due to object creation in the second test? Memory usage: 0.00 MB Memory usage: 0.00 MB Test 6 - Adding to SetFile: Slight overhead due to object creation in the second test? Memory usage: 0.00 MB Memory usage: 0.00 MB Test 7 - Creating URIRef and adding to graph for a csv datasetFile: Memory usage: 1137.69 MB Memory usage: 1078.54 MB Test 8 - Creating same URIRef and adding to graph for a csv datasetFile: Memory usage: 517.81 MB Memory usage: 570.76 MB Test 9 - Equality test with
|
Thanks @edmondchuc for sharing your results. I just realized I think I'm tracking memory usage (and possibly performance difference) incorrectly for Benchmark1, its not including the setup of the URIRef pools in the calculation (that's the main part it would affect!) so I will re-do my numbers for that one. |
I'm interested in this. I've been playing around with the idea of implementing RDF terms with object interning to save memory and avoid copying. This issue is a continuation from #2866.
In any embarassingly parallel, distributed ETLs where I've used RDFLib, I've always seen the memory usage grow over time. By implementing object interning, we may be able to fix this issue and potentially stop the memory growth when objects are no longer referenced. I think this particular issue is also related to this other issue described here #740.
The key is to implement RDF terms as immutable data structures. This way, we can safely reuse references to the same object if the unicode code point sequence in the term's value is the same.
An example of a Blank Node implementation with object interning and is thread-safe when accessing the weakrefs. Memory should be freed once the objects are no longer in use even though we have a weakref pointing to it.
And tests:
The text was updated successfully, but these errors were encountered: