Skip to content

Commit

Permalink
refactor: add documentation to the layered architecture (#56)
Browse files Browse the repository at this point in the history
This is a follow-up to the ongoing architectural refactor after the last
merged layer (#54).

Here we add documentation to the layered approach.

Additionally, we change the `OnWorkerDone()` call so that the worker's
name is passed in the call and logging responsibility for the done event
belongs to the `workers.Manager`.

Reference issue: #47

---------

Co-authored-by: Simone Basso <[email protected]>
  • Loading branch information
ainghazal and bassosimone authored Jan 24, 2024
1 parent cddc436 commit c4c9b5a
Show file tree
Hide file tree
Showing 11 changed files with 396 additions and 106 deletions.
9 changes: 9 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Checklist

* [ ] I have read the contribution guidelines
* [ ] Iff you changed code related to services, or inter-service communication, make sure you update the diagrams in `ARCHITECTURE.md`.
* [ ] Reference issue for this pull request:

# Description

Please, insert here a more detailed description.
216 changes: 216 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# minivpn design

`minivpn` (after the re-design landed in January 2024) follows a layered design that tries to capture closely the OpenVPN protocol.

* The bottom layer is [networkio](https://github.com/ooni/minivpn/tree/main/internal/networkio), which deals with network reads and writes. The module implements a [FramingConn](https://github.com/ainghazal/minivpn/blob/main/internal/networkio/framing.go#L10).
* The [packetmuxer](https://github.com/ainghazal/minivpn/blob/main/internal/packetmuxer/service.go) routes both data and control packets under the underlying connection. Multiplexing is needed so that the TLS session sees a [reliable transport](https://community.openvpn.net/openvpn/wiki/SecurityOverview).
* [reliabletransport](https://github.com/ainghazal/minivpn/blob/main/internal/reliabletransport/reliabletransport.go) implements reordering and acknowledgement for incoming packages, and retransmission for outgoing packets.
* [controlchannel](https://github.com/ainghazal/minivpn/blob/main/internal/controlchannel/controlchannel.go) serializes and deserializes data according to the control channel format; and it reacts to `SOFT_RESET_V1` packets.
* [tlsession](https://github.com/ainghazal/minivpn/blob/main/internal/tlssession/tlssession.go) performs a TLS handshake and negotiates a key exchange over the established session. It moves tls records up and down from/towards the `controlchannel`.
* The [datachannel](https://github.com/ainghazal/minivpn/tree/main/internal/datachannel) performs encryption and decryption for IP Tunnel Packets.
* [TUN](https://github.com/ainghazal/minivpn/blob/main/internal/tun/tun.go) is the user-facing interface. It can read and write `[]byte`.
* Finally, the [workers.Manager](https://github.com/ainghazal/minivpn/blob/main/internal/workers/workers.go) component deals with coordination among all the components.


## Services

* Each layer is implemented as a service, that can be found under its own package under the [internal](https://github.com/ainghazal/minivpn/blob/main/internal) path.
* Each service initializes and starts a number of workers (typicall two: one for moving data up the stack, and another one for moving data down). Some services implement only one worker, some do three.
* The communication among the different components happens via channels.
* Some channels are used for event notification, some channels move sequences of `[]byte` or `*model.Packet`.
* The channels leaving and arriving each module can be seen in the diagram below:


```
startShtdwn
┌───────────────────────────────────────────┬────────────►┌──────────────┐
│ │ shtdwn! │ │
│ TUN │◄────────────┤ workers │
│ │ Ready │ Manager │
└────▲───────┬──────────────────────────────┘◄────────────┤ │
│ │ │ │
│tunUp │tunDown │ │
┌────┴───────▼──────────────────────────────┐ │ │
│ │ shtdwn! │ │
│ datachannel │◄────────────┤ │
│ │ │ │
└───▲────────┬────────────────────────▲─────┘ │ │
│ │ keyUp │ shtdwn! │ │
│ │ ┌────────────────┴─────┐◄────────────┤ │
│ │ │ │ │ │
│ │ │ tlssession ◄──┐ │ │
│ │ └───────▲──────────▲───┘ │ │ │
│ │ tlsRec │ │ notifyTLS │ │
muxerTo │ Down│ tlsRecUp notifyTLS │ │ │
Data │ │ │ │ │ │ │
│ │ ┌─▼─────┴──────────┴───┐ │ │ │
│ │ │ │ │ │ │
│ │ │ controlchannel │ │ │ │
│ │ └─┬─────▲──────────────┘ │ ◄────────┤ │
│ │ ctrl │ │ │ shtdwn!│ │
│ │ 2Rel │ rel2Ctrl │ │ │
│ │ ┌─▼────────────────────┐ │ │ │
│ │ │ │ │ ◄────────┤ │
│ │ │ reliabletransport │ │ shtdwn!│ │
│ │ └───────▲──────────────┘ │ │ │
│ dataOrCtrlToMuxer │ muxerToReliable │ │ │
│ │ │ │ │ │ │
┌───┴────────▼─────────▼─────┴──────────────┐ │ │ │
hardReset│ │ │ │ │
────► packetkmuxer & HRESET ├──┘ │ │
│ │ │ │
└───────────────────┬────────▲──────────────┘◄────────────┤ │
muxerToNetwork│ │networkToMuxer shtdwn! │ │
┌───────────────────▼────────┴──────────────┐ │ │
│ │ │ │
│ network I/O │◄────────────┤ │
│ │ shtdwn! │ │
└───────────────────────────────────────────┘ └──────────────┘
```

# Implementation and liveness analysis

In the layered architecture detailed above, there are 12 different goroutines
tasked with moving data across the stack, in 6 services:

1. **networkio**: 2 workers (up/down).
2. **packetmuxer**: 2 workers (up/down).
3. **reliabletransport**: 2 workers (up/down).
4. **controlchannel**: 2 workers (up/down).
5. **tlssession**: 1 worker
6. **datachannel**: 3 workers (up/down/key).

The `TUN` abstraction reads and writes to the `tunUp` and `tunDown` channels; TUN user is responsible for dialing the connection and passing a `networkio.FramingConn` to the `tun.StartTUN()` constructor. The TUN constructor will own the conn, and will also start an internal session.Manager and workers.Manager to deal with service coordination.

By design, it is the responsibility of the upper, public layer (TUN) to account for timeouts (like the TLS handshake timeout), and to close the underlying connection and signal the teardown of all the workers.

The channel communication between services is designed to be blocking, with unbuffered channels.

To reason about **liveness** on the system, we make the following...

## Assumptions

* We'll call "chain" to a series of workers, connected via shared channels, between a source and a sink.
* Any `Read()` or `Write()` from/to the underlying `conn` can block, and since all the workers in a single chain are connected via blocking reads/writes, this block can freely propagate across the stack. The chain moving up pauses while blocked on receiving from the network, and the chain going down pauses if the system is blocked on delivering to the network.
* We assume this is fine because (a) what happens after the syscall boundary is beyond our control, and (b) we want to make these blocks visible to the system (i.e., we want to avoid hiding these events by using queues).
* It is the responsibility of `minivpn`'s user to keep reading from the `TUN` interface so that incoming data packets can be processed.
* Any channels that connect the up and down processing chains (like, for instance, the internal channel that connects `packetmuxer.moveUpWorker` with `packetmuxer.moveDownWorker` to process ACKs) needs to be made buffered and with non-blocking writes.
* The goroutine responsible for the `TLS` service handshake (meaning, the TLS handshake + the control message push and reply to exchange keys) is sequential, and therefore no reads and writes can happen concurrently.
* Guarding `tlsNotify` notifications to the TLS layer need special care (to avoid concurrent notifications while processing the handshake).



```mermaid
stateDiagram-v2
# you can edit this diagram in https://mermaid.live
classDef tunrw font-style:italic,font-weight:bold,fill:yellow
# workers
state "TUN.Write()" as tundown
state "TUN.Read()" as tunup
state "datach.moveDownWorker" as datadown
state "datach.moveUpWorker" as dataup
state "datach.keyWorker" as datakey
state "muxer.moveDownWorker" as muxerdown
state "muxer.moveUpWorker" as muxerup
state "reliable.moveDownWorker" as reliabledown
state "reliable.moveUpWorker" as reliableup
state "ctrlch.moveDownWorker" as controldown
state "ctrlch.moveUpWorker" as controlup
state "tlssession.worker" as tls
state "networkio.moveDownWorker" as networkdown
state "networkio.moveUpWorker" as networkup
# channels
state tunDownCh <<join>>
state tunUpCh <<join>>
state muxerToNetwork <<join>>
state networkToMuxer <<join>>
state dataOrCtrlToMuxer <<join>>
state tlsRecordUp <<join>>
state tlsRecordDown <<join>>
state newkey <<join>>
state reliableToControl <<join>>
state packetInfo <<join>>
state notifytls <<join>>
state internetout <<join>>
state internetin <<join>>
#
# we begin our journey with writes to the TUN interface
# (going down)
#
[*] --> tundown
tundown:::tunrw --> tunDownCh : []byte
tunDownCh --> datadown
datadown --> dataOrCtrlToMuxer : *Packet
reliabledown --> dataOrCtrlToMuxer : *Packet
dataOrCtrlToMuxer --> muxerdown: pkt <- dataOrCtrlToMuxer
muxerdown --> muxerToNetwork : []byte
muxerToNetwork --> networkdown
controldown --> reliabledown
networkdown --> internetout: conn.Write()
internetin --> networkup: conn.Read()
tls --> tlsRecordDown: tlsDown
tlsRecordDown --> controldown: tlsDown
tls --> newkey: key
newkey --> datakey: key
muxerup --> muxer.checkPacket()
# an extra boundary to force a nicer layout
state muxer.checkPacket() {
direction LR
state if_data <<choice>>
if_data --> reliableup: isControl?
if_data --> dataup: isData?
}
# second part, we read from the network
# (and go up)
networkup --> networkToMuxer : []byte
networkToMuxer --> muxerup
muxerup --> notifytls: notifyTLS
controlup --> notifytls: notifyTLS
notifytls --> tls: notifyTLS
reliableup --> reliableToControl : *Packet
reliableToControl --> controlup
reliableup --> packetInfo: packetInfo
packetInfo --> reliabledown: packetInfo
controlup --> tlsRecordUp: tlsUp
tlsRecordUp --> tls: tlsUp
# data to tun
dataup --> tunUpCh
tunUpCh --> tunup
tunup:::tunrw --> [*]
```
38 changes: 24 additions & 14 deletions internal/controlchannel/controlchannel.go
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
package controlchannel

import (
"fmt"

"github.com/ooni/minivpn/internal/model"
"github.com/ooni/minivpn/internal/session"
"github.com/ooni/minivpn/internal/workers"
)

var (
serviceName = "controlchannel"
)

// Service is the controlchannel service. Make sure you initialize
// the channels before invoking [Service.StartWorkers].
type Service struct {
Expand Down Expand Up @@ -61,13 +67,14 @@ type workersState struct {
}

func (ws *workersState) moveUpWorker() {
workerName := fmt.Sprintf("%s: moveUpWorker", serviceName)

defer func() {
ws.workersManager.OnWorkerDone()
ws.workersManager.OnWorkerDone(workerName)
ws.workersManager.StartShutdown()
ws.logger.Debug("controlchannel: moveUpWorker: done")
}()

ws.logger.Debug("controlchannel: moveUpWorker: started")
ws.logger.Debugf("%s: started", workerName)

for {
// POSSIBLY BLOCK on reading the packet moving up the stack
Expand All @@ -79,18 +86,20 @@ func (ws *workersState) moveUpWorker() {
case model.P_CONTROL_SOFT_RESET_V1:
// We cannot blindly accept SOFT_RESET requests. They only make sense
// when we have generated keys. Note that a SOFT_RESET returns us to
// the INITIAL state, therefore, we cannot have concurrent resets in place.

// TODO(ainghazal): revisit this assumption
// when we implement key rotation. OpenVPN has
// the concept of a "lame duck", i.e., the
// retiring key that needs to be expired a fixed time after the new
// one starts its lifetime.
// the INITIAL state, therefore, we will not have concurrent resets in place,
// even if after the first key generation we receive two SOFT_RESET requests
// back to back.

if ws.sessionManager.NegotiationState() < session.S_GENERATED_KEYS {
continue
}
ws.sessionManager.SetNegotiationState(session.S_INITIAL)
// TODO(ainghazal): revisit this step.
// when we implement key rotation. OpenVPN has
// the concept of a "lame duck", i.e., the
// retiring key that needs to be expired a fixed time after the new
// one starts its lifetime, and this might be a good place to try
// to retire the old key.

// notify the TLS layer that it should initiate
// a TLS handshake and, if successful, generate
Expand Down Expand Up @@ -121,13 +130,14 @@ func (ws *workersState) moveUpWorker() {
}

func (ws *workersState) moveDownWorker() {
workerName := fmt.Sprintf("%s: moveDownWorker", serviceName)

defer func() {
ws.workersManager.OnWorkerDone()
ws.workersManager.OnWorkerDone(workerName)
ws.workersManager.StartShutdown()
ws.logger.Debug("controlchannel: moveUpWorker: done")
}()

ws.logger.Debug("controlchannel: moveUpWorker: started")
ws.logger.Debugf("%s: started", workerName)

for {
// POSSIBLY BLOCK on reading the TLS record moving down the stack
Expand All @@ -136,7 +146,7 @@ func (ws *workersState) moveDownWorker() {
// transform the record into a control message
packet, err := ws.sessionManager.NewPacket(model.P_CONTROL_V1, record)
if err != nil {
ws.logger.Warnf("controlchannel: NewPacket: %s", err.Error())
ws.logger.Warnf("%s: NewPacket: %s", workerName, err.Error())
return
}

Expand Down
Loading

0 comments on commit c4c9b5a

Please sign in to comment.