Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an Ad parser class #1181

Open
keileg opened this issue Jun 7, 2024 · 13 comments
Open

Implement an Ad parser class #1181

keileg opened this issue Jun 7, 2024 · 13 comments

Comments

@keileg
Copy link
Contributor

keileg commented Jun 7, 2024

Background

In the current code, the evaluation of a pp.ad.Operator is implemented in the Operator class itself. The parsing of a tree of an operator (which is really a computational graph) is implemented as a depth-first traversal which identifies leaves in the tree, parses these to numerical values, and then propagates the values through the graph in a bottom-up fashion. While this works, there are shortcomings with the approach:

  1. The graph traversal method is tailored for evaluation of numerical values. While other relevant graph operations, such as compatibility checks between operator sizes, or consistency checks of units, would require a similar traversal, the current code would necessitate the writing of a new, but very similar, traversal function for each new operation.
  2. There is no way in which we can identify identical nodes, or subgraphs, and cache evaluated values to save computational time.
  3. The top-level Operator class is now responsible both for representing combinations of operations and the evaluation of computational graphs, thus violating the single responsibility principle. While this is not a major problem by itself, it is a signal that the current design is not ideal.

Suggested changes

The following large-scale changes are suggested - specifications are needed before implementation:

  1. Implement a function for traversal of a computational graph. To EK, it seems natural to convert the tree into a priority queue, but the choice of data structure may not be a primary concern. A rough signature for the function wolud be def graph_traversal(list[pp.ad.Operator]) -> Iterator[pp.ad.Operator]. The motivation for accepting a list of operators is given in the outline of the Ad parser class below.
  2. Implement a class, say AdParser, which takes over responsibility for parsing from the Operator class. This will use the graph traversal function to iterate over the computational graph and convert it to numerical values. Moreover, the parser should contain functionality for chaching of operator values to reduce evaluation time.

Thoughts on the AdParser class

Below is an outline of the class:

class AdParser:

    mdg: MixedDimensionalGrid
   # Parsing of Operator leaves will require access to the grid. This is currently obtained
   # by passing the EquationSystem, but direct access is preferable.

    _cache_table_for_current_value: dict[pp.ad.Operator, np.ndarray]
    _cache_table_for_current_value_and_jacobian: dict[pp.ad.Operator, pp.ad.AdArray]
    # Temporarily store computed values here during evaluation. Not sure if we need separate
    # tables for numpy and Ad arrays

     _longer_term_caching_of_values: dict[pp.ad.Operator, np.ndarray]
    ...
    # Operators that will not (are not likely) to change between evaluation steps, and which are
    # costly to compute, can be stored here inbetween evaluations. This should not be part of
    # the initial implementation.

    def value(self, list[pp.ad.Operator]) -> list[np.ndarray]:
        # Fetch a joint iterator for the operators. Evaluation has two steps: First check if this operator
        # has previously been evaluated (check if it is found in the cache tables, which will use the new
        # hash functions), compute if not available.
        # The signature is non-trivial: We need to accept a list to utilize caching between evaluation of
        # multiple equations, but when iterating, we also need to keep track of which operator tree the
        # individual operators are associated with. Achiving this may require tweaks to the outlined implementation.
        # The need to cache between equations also show that parsing must be moved out of the
        # individual operators.

    def value_and_jacobian():
        # Similar to above

Access to the AdParser

While it is tempting to make the AdParser a singleton accessible as pp.AdParser or similar, there is no clear reason to do so. Instead, it is suggested that all models are equipped with an AdParser, just as they have an EquationSystem. It should also be possible to parse outside a model, by a call pp.ad.AdParser(mdg).value_and_jacobian([operator]), though such operations will not be very common.

The role of the EquationSystem

While the functionality of the AdParser could have been integrated into the EquationSystem, it is preferable not to add more functionality into this (arguably already overloaded) class. Instead, the EquationSystem can be given the model-wide AdParser as an attribute and call this for evaluation of equations.

Dependencies

This should not be started until completion of #1179 and #1180

@IvarStefansson
Copy link
Contributor

Will there be functionality reuse between value and value_and_jacobian? If yes, how do you envision this? Also, will the AdParser be the natural candidate to do consistency checks on dimension compatibility?

@keileg
Copy link
Contributor Author

keileg commented Aug 8, 2024

Note to self: Operator evaluation will also be needed as part of the solution process (test case: the line search methods being introduced in #1208). We need to make sure that the AdParser class can be invoked also in these cases.

@keileg keileg mentioned this issue Aug 8, 2024
13 tasks
@keileg
Copy link
Contributor Author

keileg commented Aug 16, 2024

NOTE: For debugging purposes, it should always be possible to switch off all caching.

@keileg
Copy link
Contributor Author

keileg commented Jan 13, 2025

The current operator tree is generated by Python's processing system for arithmetic operations, and therefore automatically follows the expected rules of precedence for operators. There is however some leeway and room for improvement in ordering operations with equal priority, where Python will operate from left to right whereas we typically want to operate from right to left (reasons: 1. This is needed for the slicing approach to projections hinted at in #1182, 2., this may be faster for residual evaluations, where an expression of the type matrix @ matrix @ vector is best processed from the right).

To exploit this flexibility, it is necessary to rearrange the computational graph to prioritize multiplication from right, though without violating the rules for operator precedence. Exactly how to do this is unclear, my instinct is to maximize the depth of nodes involving AdArrays and DenseArrays, but I am not sure about this.

@keileg
Copy link
Contributor Author

keileg commented Jan 13, 2025

The introduction of an AdParser class will change the responsibility for ad operator evaluation from the operator itself to the parser. This has two main benefits:

  1. It enables caching of computations between evaluations of different operators.
  2. It simplifies the Ad operator class, bringing it closer to having a single responsibility.

The AdParser will need to be available of the EquationSystem since the latter is responsible for orchestrating the evaluation and composition of the full Jacobian matrix. In practice, AdParser should be an attribute of EquationSystem. In addition, the AdParser should also be an attribute of the main model class, for two reasons:

  1. At various places in the models (ex: constitutive laws) we evaluate an AdOperator before wrapping it as a new operator (doing some operation inbetween).
  2. For debugging, it is invaluable to be able to evaluate an Ad operator.

This calls for the AdParser to be created, say, in prepare_simulation and be passed to the EquationSystem, perhaps as part of initialization of the latter.

While it is formally okay make several instances of the AdParser, caching makes these objects potential drains of memory, and of course caching is more efficient if there is data to share. I don't think it is necessary to make the AdParser a singleton, but will still strongly advocate for having one central parser for the model and the EquationSystem.

@IvarStefansson could we have a discussion on this at some point?

@IvarStefansson
Copy link
Contributor

Sure!

@keileg
Copy link
Contributor Author

keileg commented Jan 13, 2025

Implementation steps:

  1. Introduce a new graph structure, based on networkx, in the operator class, let it live in parallel with the existing homebrewed graph.
  2. Introduce the AdParser class that operates on the new graph. Simple caching only.
  3. Make the AdParser an attribute of EquationSystem and the models.
  4. Verify that the new and old parsing systems give equal results.
  5. Purge the old system from EquationSystem and the models (in particular constitutive laws)
  6. Purge the old graph structure from pp.ad.Operator.

While it is tempting to split this into several PRs, there will be a lot of experimental coding here; moreover, the new system will not really be meaningful until step 4 is completed. Hence, it seems this will be a single effort.

I hope to start working on this in earnest over the next few days.

@keileg
Copy link
Contributor Author

keileg commented Jan 15, 2025

Update on relation between classes: Parsing of variables, thus general operator trees, requires access to the variable state, which is accessed through the EquationSystem. If the AdParser is to exist on the same level as EquationSystem (e.g., both are attributes of a model), there are conceptually two ways of doing this:

  1. The EquationSystem is an attribute of AdParser. To avoid a cyclic dependency where the parser is also an attribute of EquationSystem, we will need to move evaluation of the full Jacobian matrix out of EquationSystem. However, the reason Jacobian evaluation is located in EquationSystem is the evaluation (almost) requires access to DOF numbering etc. Also, such a change would amount to a larger refactoring of this part of the code.

  2. The parser is considered a helper class of EquationSystem, and all evaluation of operators goes through the latter class. This is in many ways more logical, since the front-end of all evaluation of ad operators (current functions value and value_and_jacobian in pp.ad.Operator and current functions assemble, assemble_schur_complement_system in EquationSystem) become colocated in the EquationSystem.

With option 2 (which is my strong preference), evaluation inside a model class (during equtaion construction or debugging) will go from

some_operator.value(self.equation_system)

to

self.equation_system.operator_value(some_operator)

Disclaimer: There may be caveats that I don't see yet, but hopefully such a solution can be implemented.

@IvarStefansson If you have opinions on this, please let me know. We can surely negotiate on the method name EquationSystem.operator_value.

@IvarStefansson
Copy link
Contributor

Is the cyclic dependency really a problem?

@keileg
Copy link
Contributor Author

keileg commented Jan 20, 2025

Is the cyclic dependency really a problem?

Yes, it will lead to import errors (and it generally signifies a poor code design).

@vlipovac
Copy link
Contributor

Is the cyclic dependency really a problem?

Yes, it will lead to import errors (and it generally signifies a poor code design).

If I may attempt to be helpful, the descriptor protocol will help you with that.
If the parser is to be an attribute of the equation system, providing an overload of __get__ will help you inform the parser about the equation system instance which owns it.
A working example of this approach is already present in the core code in ad.operator_functions

def __get__(self, binding_instance: object, binding_type: type) -> Callable:

@keileg
Copy link
Contributor Author

keileg commented Jan 22, 2025

Status after latest changes:

  1. The AdParser gives identical results to the old parsing system for all tests in the model class.
  2. The performance of AdParser is, on a few selected benchmarks, very similar to the old system. The currently very simple caching in AdParser gives some savings, while the need to process the networkx graph into a priority queue has some cost, which for the currently tested grid sizes roughly level out. For larger grids, where the cost really matters, the caching should win out since the size of the graph scales with the operator tree, the physics, not the grid.
  3. The current AdParser implements only a very simple caching mechanism, with the cache being wiped between all evaluations. We can easily improve this somewhat, more advanced policies will have to come later.

Remaining tasks before PR:

  • Implement a check that only caches auxiliary unknowns that are needed later. Should save memory.
  • Make the AdParser the default option.
  • Tweak the EquationSystem.value_and_jacobian() (and perhaps .value()) so that the cache is cleaned only when a full set of equations have been parsed.
  • Documentation.
  • Purge old code from the Operator class.
  • Check if we can introduce the networkx-format graph only when invoking the parser. This would make for much cleaner code.

@keileg
Copy link
Contributor Author

keileg commented Jan 24, 2025

Summary of observations made during more thorough benchmarking

This was much more complex than anticipated. Essentially, there are two ways of parsing an operator tree:

  1. The old approach uses top-down graph traversal, where each node recursively makes its children evaluate, before self-evaluating. Caching was not easy to realize in the old implementation, but I have made a prototype version that can do this.
  2. Bottom-up traversal, where, starting with the leaves, each node is evaluated and cached (this is necessary) for evaluation in parent nodes. The graph can be defined as breadth-first or depth-first, but this makes little impact on performance.

Caching, when implemented, can be done either individually among each operator tree, or jointly for a collection of trees (say, when the EquationSystem needs the Jacobian for multiple equations). I have not yet implemented functionality for keeping parts of the cache between Jacobian evaluations (this is possible, but not an immediate priority).

Timings below are based on measurements using line_profiler. In case the detailed profiling disturbs the measurements, I have also run timing on the parent function to confirm that this coarser measurement corresponds well with the finer one obtained by line-by-line profiling. There is of course a slight risk that the line_profiler fundamentally alters the timings, but that is considered highly unlikely.

I have mainly considered a THM simulation with a varying number of cells, roughly estimated (guessed) to range from low hundreds to several thousands.

Observations:

  1. The old recursive top-down strategy performs well in terms of time spent on individual operations (say, accumulated time spent on mat-multing two operators). This is consistent for all relevant arithmetic operations, and on all grid levels.
  2. The new prototype, recursive top-down with the possibility for caching, performs on par with the old strategy when no caching is applied. Caching of leaves (my guess is this is mainly construction of divergence operators) is an improvement compared to the old benchmark. However caching of the result of operations (say, /@ or *) will often increase the total time spent on these operations, even if some of the operand pairs are omitted. That is, the set of operand pairs considered when caching is a strict subset of that considered without caching, and still, it can take more wall-clock time to do the operations. Depending on various circumstances, the change can be from 5-10%, up to 2-300%. Generally, the more caching is done, the worse the performance, and the worst timings are obtained when evaluation of a set of operator trees share the cache.
  3. The bottom-up strategies, since they rely on caching more or less all operations (though saved values are evicted from the cache when they are no longer needed), and thus perform poorly. Again, the worst performance is when sharing the cache between several trees.

The best option right now seems to be the one mentioned in observation no 2 above, though with minimal caching, but I will need to experiment more before concluding. If anyone can explain what is going on, I would be most interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants