Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ARM64 support and RiscyROP chaining algorithm #124

Draft
wants to merge 62 commits into
base: master
Choose a base branch
from

Conversation

bkrl
Copy link

@bkrl bkrl commented Jan 4, 2025

For my intern project at Trail of Bits, I modified angrop to add support for ARM64 and implement algorithms from this paper: RiscyROP: Automated Return-Oriented Programming Attacks on RISC-V and ARM64. The new algorithm is able to generate more complex register setting chains like this one which controls all eight argument registers and the return address register on ARM64 glibc:

Example chain

# gadget 1
0x429b68:  ldp  x19, x20, [sp, #0x10]  # load value for x5 into x20
0x429b6c:  ldp  x21, x22, [sp, #0x20]  # load value for x4 into x22,
0x429b70:  ldp  x29, x30, [sp], #0x30  # and set x21 to address of gadget 3
0x429b74:  ret

# gadget 2
0x4d3b54:  mov  x5, x20                # set x5
0x4d3b58:  mov  x4, x22                # set x4
0x4d3b5c:  mov  x0, x19
0x4d3b60:  mov  x6, #0
0x4d3b64:  blr  x21                    # x21 set in gadget 1

# gadget 3
0x46ae00:  ldr  x0, [sp, #0x18]        # set x0 for conditional branch in gadget 4
0x46ae04:  ldp  x29, x30, [sp], #0x20
0x46ae08:  ret

# gadget 4
0x4502dc:  cmn  w0, #1                 # x0 set in gadget 3
0x4502e0:  ldr  x3, [sp, #0x98]        # set x3
0x4502e4:  mov  x8, #1
0x4502e8:  b.eq  #0x44f5f0             # conditional branch
0x44f5f0:  mov  w19, #-1
0x44f5f4:  adrp  x0, #0x59f000
0x44f5f8:  ldr  x0, [x0, #0xe68]
0x44f5fc:  ldr  x2, [sp, #0x198]       # set x2 for conditional branch
0x44f600:  ldr  x1, [x0]
0x44f604:  subs  x2, x2, x1            # x1 is a constant
0x44f608:  mov  x1, #0
0x44f60c:  b.ne  #0x450acc             # conditional branch
0x44f610:  ldp  x21, x22, [sp, #0x20]
0x44f614:  mov  w0, w19
0x44f618:  ldp  x19, x20, [sp, #0x10]
0x44f61c:  ldp  x23, x24, [sp, #0x30]
0x44f620:  ldp  x25, x26, [sp, #0x40]
0x44f624:  ldp  x27, x28, [sp, #0x50]
0x44f628:  ldp  x29, x30, [sp], #0x1a0
0x44f62c:  ret

# gadget 5
0x4d4174:  ldp  x21, x30, [sp, #0x10]
0x4d4178:  ldp  x19, x20, [sp], #0x20  # set x20 to address of gadget 7
0x4d417c:  ret

# gadget 6
0x43ec80:  mov  x2, x21                # set x2
0x43ec84:  blr  x20                    # x20 set in gadget 5

# gadget 7
0x46ae00:  ldr  x0, [sp, #0x18]        # set x0 for conditional branch in gadget 8
0x46ae04:  ldp  x29, x30, [sp], #0x20
0x46ae08:  ret

# gadget 8
0x4c0e78:  cmn  w0, #1                 # x0 set in gadget 7
0x4c0e7c:  ldr  x7, [sp, #0x78]        # set x7
0x4c0e80:  ldp  w6, w11, [sp, #0x80]
0x4c0e84:  ldp  w10, w9, [sp, #0x88]
0x4c0e88:  b.eq  #0x4c0e18             # conditional branch
0x4c0e18:  ldp  x19, x20, [sp, #0x10]
0x4c0e1c:  mov  w0, #-1
0x4c0e20:  ldp  x21, x22, [sp, #0x20]
0x4c0e24:  ldp  x23, x24, [sp, #0x30]
0x4c0e28:  ldp  x25, x26, [sp, #0x40]
0x4c0e2c:  ldp  x27, x28, [sp, #0x50]
0x4c0e30:  ldp  x29, x30, [sp], #0x90
0x4c0e34:  ret

# gadget 9
0x434af4:  ldr  x1, [sp, #0x18]        # set x1 for conditional branch in gadget 12
0x434af8:  ldp  x29, x30, [sp], #0x20
0x434afc:  mov  x0, x1
0x434b00:  ret

# gadget 10
0x44a398:  ldp  x21, x22, [sp, #0x20]  # load address of gadget 12 into x21
0x44a39c:  ldp  x25, x26, [sp, #0x40]
0x44a3a0:  ldr  x0, [sp, #0x78]        # load value for x6 into x0
0x44a3a4:  ldp  x29, x30, [sp], #0xc0
0x44a3a8:  ret

# gadget 11
0x49c270:  mov  x16, x21               # move address of gadget 12 into x16
0x49c274:  ldr  x21, [sp, #0x20]
0x49c278:  ldp  x29, x30, [sp], #0x90  # set x30 to address of gadget 13
0x49c27c:  br  x16

# gadget 12
0x44a9c0:  mov  x6, x0                 # set x6
0x44a9c4:  mov  x0, #0
0x44a9c8:  cbz  x1, #0x44aad8          # conditional branch, x1 set in gadget 9
0x44aad8:  ret                         # x30 set in gadget 11

# gadget 13
0x434af4:  ldr  x1, [sp, #0x18]        # set x1
0x434af8:  ldp  x29, x30, [sp], #0x20
0x434afc:  mov  x0, x1
0x434b00:  ret

# gadget 14
0x44a398:  ldp  x21, x22, [sp, #0x20]  # load final jump address into x21
0x44a39c:  ldp  x25, x26, [sp, #0x40]
0x44a3a0:  ldr  x0, [sp, #0x78]        # set x0
0x44a3a4:  ldp  x29, x30, [sp], #0xc0
0x44a3a8:  ret

# gadget 15
0x49c270:  mov  x16, x21               # move final jump address to x16
0x49c274:  ldr  x21, [sp, #0x20]
0x49c278:  ldp  x29, x30, [sp], #0x90  # set x30
0x49c27c:  br  x16

I'm making this a draft PR for now since there's probably more work to be done before this can be merged and I would appreciate any feedback. I'm happy to discuss things and continue working on this.

Changes

Chaining algorithm

The new chaining algorithm is a recursive DFS that generates the chain backwards instead of forwards. It keeps track of the set of registers that it needs to control, and it tries to prepend gadgets to the chain that control some of those registers. The target register set is updated after each gadget is prepended, and this repeats until the set becomes empty. A heuristic is used to estimate how easy it is to set each register by counting the number of gadgets that pop the register, and gadgets are selected based on the estimated difficulty of the resulting set of registers.

Building the chain backwards instead of forwards is advantageous in situations where gadgets can be used to control multiple registers, but not at the same time. This commonly happens with gadgets like gadget 9 in the example above that copy a value from one register to another. They allow us to control either the source register or the destination register, but not both since the registers will always have the same value. When we build the chain backwards, we will know which of the registers need to be controlled, and if we need both registers then we know that the gadget can't be used.

A disadvantage with the DFS algorithm is that if the first few gadgets that it chooses are bad then it can take a long time to backtrack. This hasn't happened much in my testing but I think there are further improvements that can be made to the algorithm.

Conditional branches

Gadgets with conditional branches are now supported. For each gadget, the set of registers that affect branch conditions is stored, and the the chaining algorithm adds those registers to the set of target registers. A list of basic block addresses is stored for each gadget so that we know which branch to execute when the chain is built.

For compatibility, the gadget analyzer doesn't return gadgets with conditional branches unless a new allow_conditional_branches option is added. With conditional branches, there can be multiple different gadgets at the same address. I updated some tests to account for this and the gadget analyzer returns lists when the option is enabled. Chain builders other than RegSetter discard gadgets with conditional branches for now.

Chain construction

I rewrote most of the Builder._build_reg_setting_chain() function. It supports conditional branches now, and it determines which stack location corresponds to which gadget address or register value by checking which symbolic stack variable gets loaded into the instruction pointer or target register. It creates a dictionary mapping symbolic stack variables to gadgets or RopValues, which is used when generating the payload. I think this is simpler and more robust than comparing concrete values or inspecting the solver constraints. It also doesn't assume that gadget addresses appear in order of execution, which is often not the case. For example, in the chain above, gadget 5 loads the address of gadget 7 into a register before jumping to gadget 6, and the address of gadget 7 has to be placed before the address of gadget 6.

For compatibility, the address of the first gadget is added as the first value in the payload because the existing code assumes that this is the case, but on ARM64 the initial address would usually have to be placed somewhere else since GCC puts the return address near the start of the stack frame instead of at the end. I think it makes more sense to separate the initial address from the rest of the payload, since where it has to be placed depends on how the chain is entered.

Gadget filtering

I disabled the code in the chain builders that compare every gadget with every other gadget, since that doesn't scale well and can be slow when there are a lot of gadgets. It doesn't seem to significantly affect the chain building.

Minor changes

  • Fixed a bug in the calculation of first address to check which causes the minimum address to be skipped if it is aligned.
  • Fixed GadgetAnalyzer.is_in_kernel() throwing an exception on some addresses because obj is None. The code was mostly a duplicate of rop_utils.is_in_kernel() so I made it call that function instead.
  • Removed the broad except clause in the gadget analyzer since it would catch things that we don't want to catch like KeyboardInterrupt.
  • The pc_reg and jump_reg attributes of gadgets seem to always be the same and the jump_reg computation doesn't work with conditional branches so I just set jump_reg to pc_reg.
  • Gadgets with pc_offset >= stack_change are rejected since we assume that gadgets don't read past where the leave the stack pointer. These gadgets can cause conflicts since other gadgets might load a value from the same location.
  • Fixed (act.data.ast == ip).symbolic causing an exception in the gadget analyzer because act.data.ast is a floating point expression instead of a bitvector, so the == operator returns False instead of a claripy expression.
  • For each popped register, the symbolic stack variable that gets popped into the register is stored so that the chain builder can check if two registers are popped from the same location on the stack and can't be independently controlled. For example, in gadget 9 of the chain above, both x0 and x1 are popped registers but they can't both be controlled.
  • In some function calling tests, max_steps is set when executing the chain since the chain doesn't know how many steps the fake function gadget needs.
  • test_roptest_x86_64() is in test_rop.py is rewritten so that it executes the chain and checks the resulting register values instead of checking for a fixed sequence of gadget addresses.
  • execute_chain() in test_rop.py checks chain.next_pc_idx(). The chain concretization code tries to append a no-op gadget if next_pc_idx is not None, but this doesn't always work on architectures like ARM64 that don't have a stack return instruction.
  • The gadget analyzer catches timeout exceptions.
  • Periodically restart worker processes to workaround memory leak issue described below.

Limitations

Gadget finding speed and memory usage

The new gadget finding code is a lot slower than the existing code. With fast mode disabled (which is necessary to do anything useful on ARM64), finding gadgets on glibc takes one to two hours with 16 processes. I think there are probably ways to improve this, though the RiscyROP paper's implementation is even slower.

There appears to be some sort of memory leak issue involving z3 that causes the memory usage to keep going up when finding gadgets. The z3 version has a big effect on the memory usage, and with the latest version it quickly grows to several GBs per process. It's not as bad with older versions like 4.12.6.0, and I added a workaround that periodically restarts the worker processes. This seems to mostly keep the memory usage below 2 GB per process.

I've also encountered an issue a few times where things would hang at the end of the gadget finding process, though this might be cause by one of the processes getting killed due to running out of memory.

Compatibility and integration

I've integrated the new algorithms with the existing code enough that I think all of the tests pass. The use_partial_controller option is not supported in the new RegSetter at this point and the generated chains might have longer payloads since the new algorithm doesn't necessarily return the chain with the smallest stack change. The new algorithm doesn't yet support using gadgets that set registers to concrete values, so I also used the existing _find_all_candidate_chains() function in combination with the new algorithm.

Architecture support

Some of the chain builders other than RegSetter might not work on ARM64 because they still have some architecture specific assumptions. For example, shift gadgets that shift by one value are sometimes used as no-op gadgets. This works on architectures like x86 with stack return instructions, but on other architectures like ARM64 there might not be any gadgets that shift exactly one value, especially due to stack alignment.

bkrl added 30 commits December 17, 2024 11:20
This makes angrop run on aarch64 binaries, but it's not able to do
anything useful yet.
This fixes the handling of gadgets where the address of the next gadget
has to be inserted in the middle instead of just concatenated to the
end. angrop is now able to set registers if there is a gadget that loads
into the register from the stack, and it can also chain gadgets together
as long as each gadget ends with a jump to an address loaded from the
stack.
Prioritize gadgets that result in fewer register dependencies and
gadgets with less instructions.
There appears to be some kind of memory leak in angr that causes the
memory usage to keep going up during gadget finding. This periodically
restarts the worker processes to work around that.
We don't want to go into SimProcedures when finding gadgets since those
are probably not useful.
Some timeouts are normal such as when finding gadgets, so we shouldn't
always print a message. It's also not that useful to print a timeout
message without more information about where it came from so it makes
more sense to do this in exception handlers instead.
SimSolverModeError gets raised when there's a timeout in claripy.
If gadgets access memory outside of the stack, make sure the addresses
are valid so that it doesn't crash.
Turns out that project.factory.blank_state() initializes the ip to the
entry point instead of making it unconstrained.
gadget.block_length is only the size of the first block.
If a gadget pops a value into a register but the value is used in a
branch condition or memory access address, we can't fully control it so
we remove it from the set of popped registers. This also makes
gadget.constraint_regs include registers that affect memory access
addresses.
Backtrack if we've already seen an equivalent chain that is not longer
than our current chain.
Estimate how hard it is to set a register by counting the number of
gadgets that pop the register, and prioritize gadgets that result in
target registers which are easier to set.
If two registers are popped from the same location on the stack, we can
control either one of them but not both.
The gadget filtering code in RegSetter and RegMover tries to compare
every gadget with every other gadget, so it doesn't scale well when
there are a large number of gadgets. Things seem to work fine without
the filtering so I'm removing it for now.
bkrl added 24 commits January 1, 2025 19:06
The stack needs to have one extra value initialized to account for the
address of the first gadget being popped off.
When choosing gadgets, break ties by choosing the gadget with the
smaller stack change to minimize the size of the payload.
See reasoning in commit f47cedd.
Long chains can take a couple of seconds to build.
The other chain builders can't handle conditional branches.
The previous code incorrectly skips the first address if it is aligned.
The chain doesn't know how many steps the fake function gadget needs so
we have to override the max steps calculation.
I accidentally broke some chain builders when trying to remove the slow
filtering loops that compare every gadget with every other gadget.
See explanation in commit 3da7f48.
Setting the maximum number of steps to the total number of blocks
doesn't work with the fake gadgets used in function call chains, so use
the old default of twice the number of gadgets if it's higher.
This makes the behavior closer to the previous algorithm which tries to
find the chain with the smallest payload size.
The gadget comparison code has to handle multiple gadgets at the same
address due to conditional branches.
This preserves the previous behavior where if the chain ends with a jump
to some address in the middle of the chain, the concretization function
would attempt to append a no-op gadget so that the chain jumps to the
address immediately after it instead.
If chain.next_pc_idx() is not None, the address to jump to after the
chain should be placed at that index instead of after the chain. In
cases like this the chain concretization function will attempt to append
a no-op gadget so that the chain jumps to the address immediately after
it, but this is not always possible on architectures without a stack
return instruction.
Tests should check if the chains do what we want, and they should not
rely on the chain matching a fixed sequence of gadgets since there can
be multiple chains that do the same thing.
The BFS algorithm is able to use gadgets that set registers to concrete
values which the new algorithm doesn't support yet. This tries the BFS
algorithm first since it's fast while the new algorithm can take a long
time if it can't find a chain.
The previous code can throw exceptions if the address is not in any
object and there is already a function in rop_utils for this.
act.data.ast can be a floating-point expression instead of a bitvector,
which would cause the == operator to return False instead of a claripy
expression.
I rewrote the existing Builder._build_reg_setting_chain() function
instead.
@Kyle-Kyle
Copy link
Collaborator

Hi, thank you so much for the amazing work. This PR seems to have addressed a lot of features missing from angrop for years. I'd definitely love to merge it after review.
Currently, this PR seems to be failing the CI, would you mind fixing them?

@bkrl
Copy link
Author

bkrl commented Jan 6, 2025

I'll fix the lints but it looks like the rest of the tests are failing because they're using the old gadget caches. I added some fields to the RopGadget class for things like conditional branch support so the gadget caches will have to be regenerated.

@Kyle-Kyle Kyle-Kyle self-assigned this Jan 15, 2025
@Kyle-Kyle
Copy link
Collaborator

Btw. We are currently working on https://github.com/angr/angrop/commits/feat/aarch64/ trying to merge this amazing PR to angrop :D

@Kyle-Kyle
Copy link
Collaborator

btw, I'm going to re-enable gadget filtering and improve its implementation.
I just did an experiment. Without it, indeed, it takes no time to initialize the chainbuilder, but it takes 20s to generate a chain (the 20s here is for finishing the whole script, so includes initialization and chain building).
With gadget filtering, even though it takes 8s to initialize the chainbuilder, it only takes 17s in total to generate a chain.
So I think it is worth it to do gadget filtering.

Also, the algorithm I wrote was never meant to be fast, I'm now going to write a faster version for it. Thanks for pointing it out though!

@Kyle-Kyle
Copy link
Collaborator

Now gadget filtering is pretty fast. The whole process of finding a chain in libc is reduced to 8s. (for the slow arm_func_call testcase, it is even faster for x86_64)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants