Unsafe error handling in Orchestrator #351

andrey-kuprianov · 2021-10-11T11:43:28Z

Surfaced from @informalsystems audit of Althea Gravity Bridge at commit 19a4cfe

severity: High
type: Implementation bug
difficulty: Intermediate

Involved artifacts

Orchestrator

Description

The codebase has generous use (hundreds) of panic-causing statements: unwrap(), panic!(), assert, assert_eq, assert_ne and expect.
In some cases they are safe, but we have identified many cases for which it is unclear whether the asserted condition is always true, or can be violated due to e.g. network errors or misbehavior/bugs in other Gravity Bridge components.

Below we provide some examples of those statements, characterized into three categories:

Assertions that depend on other Gravity Bridge components

These assertions may be fine as soon as the other Gravity components behave correctly. Checking whether this is indeed true is an extremely complex and time-consuming process. Also, even if those components do behave correctly, there is always a possibility of some bug, or a malicious action.

orchestrator/gravity_utils/src/types/valsets.rs#L151
```
for member in self.members.iter() {
    if let Some(eth_address) = member.eth_address {
        if let Some(sig) = signatures_hashmap.get(&eth_address) {
            assert_eq!(sig.get_eth_address(), eth_address);
            assert!(sig.get_signature().is_valid());
```
This is probably safe as the valset updates are coming from trusted validators. Now the question is:
- what if due to some bug an invalid signature is included?
- what if a malicious validator intentionally includes a wrong signature? This will lead to all other Orchestrators stopping, and all other validators being slashed.

orchestrator/gravity_utils/src/types/valsets.rs#L167

// the go code verifies signatures, if we ever see this it means
// that something has gone horribly wrong with our parsing or ordering
// in the orchestrator, therefore we panic.
panic!(
    "Found invalid signature for {} how did this get here?",
    sig.get_eth_address()
)

Same as above.

orchestrator/relayer/src/find_latest_valset.rs#L93

if cosmos_valset != *ethereum_valset {
    // if this is not true then we have a logic error on the Cosmos chain
    // or with our Ethereum search
    assert_eq!(cosmos_valset.nonce, ethereum_valset.nonce);

If, as the comment says, there is indeed some logic error, is this the valid reason for all Orchestrators stopping and validators being slashed?

orchestrator/relayer/src/valset_relaying.rs#L49-53
```
let valset = get_valset(grpc_client, latest_nonce).await;
if let Ok(Some(valset)) = valset {
    assert_eq!(valset.nonce, latest_nonce);
```
This is probably safe, because the rpc calls ask for the specific nonce; but what if there is a bug in the RPC call?
orchestrator/ethereum_gravity/src/utils.rs#L144
orchestrator/ethereum_gravity/src/utils.rs#L168
orchestrator/ethereum_gravity/src/utils.rs#L119

orchestrator/ethereum_gravity/src/utils.rs#L94

// the go represents all nonces as u64, there's no
// reason they should ever overflow without a user
// submitting millions or tens of millions of dollars
// worth of transactions. But we properly check and
// handle that case here.
let real_num = Uint256::from_bytes_be(&val);
Ok(downcast_uint256(real_num).expect("Valset nonce overflow! Bridge Halt!"))

This depends on the correct behavior of the Solidity contract, and on the nonces never overflowing 64 bits. While unlikely, both conditions still could be violated.

Assertions in the initialization phases or in the client code

ignores CosmosGRPCError which can happen because of reasons not related to balance or requesting batch.
- orchestrator/gbt/src/client/cosmos_to_eth.rs#L59
- orchestrator/gbt/src/client/cosmos_to_eth.rs#L109
ignores Web3Error which can happen without incorrect gravity address or ERC20 queries.
There are other expects in the code. But we do not mention them as they are at the initialization phase or due to user specific errors.
- Failed to create directory or write to file.
- Failed to load cosmos or eth keys.

These errors probably won't lead to severe consequences, such as validator slashing. But they will definitely occur many times, and the error will be communicated to the user via panicing and printing a stack trace. This is not the best way to communicate errors to the user, and should be improved.

Vacuously true assertions

These assertions are guaranteed to be safe due to the code structure, and can thus be removed. Keeping them in the code makes code inspection exceptionally difficult, as we spend time on false positives. Minimizing the usage of them would be helpful for future code inspections. Here are a couple of examples:

orchestrator/gravity_utils/src/types/batches.rs#L59-L60
- safe because of the Vecs are populated in a same length loop.
orchestrator/ethereum_gravity/src/valset_update.rs#L26
- safe as the check is done before reaching this call.

Problem Scenarios

As there is a very large number (hundreds) of panic-resulting statements, it is close to impossible to enumerate all possible problem scenarios. Here are a few example ones:

Due to a bug in other Gravity Bridge component, an input comes to the Orchestrator that triggers one of the panic-statements;
The input triggering a panic is maliciously crafted (this may happen even from a validator side);
A network error occurs, which propagates to one of the panic statements.

The effect depends on in the part of the Orchestrator in which the panic occurs. A panic in the initialization code will probably result only in the user inconvenience and annoyance. If the panic occurs a long time after the Orchestrator has been started, it will stop, and this may lead to validators being slashed without malicious actions from their part.

It should be noted that Orchestrator is a single binary responsible for all kinds of operations: a panic triggered during any of those (even seemingly unimportant) will cause the whole processing to stop, with the aforementioned severe consequences.

Recommendation

Short term

We recommend different course of action depending of the panic category:

Assertions that depend on other Gravity Bridge components:
- Carefully inspect the code in the other components on which the assertion depends;
- Move error-handling code in the Orchestrator to the input-receiving boundary;
- For the cases when this is not possible, propagate the errors using Results to the top level;
- At the top level, implement a uniform error handling with alerts issued to validators;
- Do not stop processing of other legitimate, non-erroneous requests.
Assertions in the initialization phases or in the client code:
- Move error-handling code to the input-receiving boundary or propagate errors to the top level;
- At the top level, implement a uniform error handling with meaningful error messages issued to the user.
Vacuously true assertions:
- Inspect the code, and if the assertion proves to be vacuously true, remove it.

Long term

Refactor the Orchestrator into independent binaries each responsible for different operation kind. Ensure that all binaries responsible for possible validator slashing adhere to the highest possible security standards; they should not stop under any circumstances, and should employ defensive programming techniques against possible failures in other components.

The text was updated successfully, but these errors were encountered:

andrey-kuprianov mentioned this issue Oct 13, 2021

Consider making Gravity Bridge validation voluntary #358

Open

hannydevelop mentioned this issue Oct 21, 2021

Unsafe error handling in Orchestrator PeggyJV/gravity-bridge#262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsafe error handling in Orchestrator #351

Unsafe error handling in Orchestrator #351

andrey-kuprianov commented Oct 11, 2021

Unsafe error handling in Orchestrator #351

Unsafe error handling in Orchestrator #351

Comments

andrey-kuprianov commented Oct 11, 2021

Involved artifacts

Description

Assertions that depend on other Gravity Bridge components

Assertions in the initialization phases or in the client code

Vacuously true assertions

Problem Scenarios

Recommendation

Short term

Long term