Skip to content

Commit

Permalink
Update docs for Artemis v0.25.0
Browse files Browse the repository at this point in the history
  • Loading branch information
LeonGungadinMogensen committed Oct 4, 2024
1 parent df82654 commit d94abfa
Show file tree
Hide file tree
Showing 3 changed files with 143 additions and 4 deletions.
9 changes: 7 additions & 2 deletions docs/getstarted/fleet/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ in the cloud — and the developer tooling to help orchestrate the devices.
possible to host your own broker, so all your data and code remains under your
control.

Artemis is designed to be [reliable](./reliability) and robust. Configuration is
treated as code, and none of the servers are critical. As long as
the configuration is stored in a reliable system (typically Git), Artemis
can recover from a full loss of all servers.

---------------------

# Installation
Expand Down Expand Up @@ -136,8 +141,8 @@ possible specification by putting the following content into the `my-pod.yaml` f

$schema: https://toit.io/schemas/artemis/pod-specification/v1.json
name: my-pod
sdk-version: v2.0.0-alpha.160
artemis-version: v0.24.0
sdk-version: v2.0.0-alpha.163
artemis-version: v0.25.0
max-offline: 0s
connections:
- type: wifi
Expand Down
4 changes: 2 additions & 2 deletions docs/getstarted/fleet/pods.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,8 @@ It is in YAML format and looks similar to this:

$schema: https://toit.io/schemas/artemis/pod-specification/v1.json
name: example
sdk-version: v2.0.0-alpha.160
artemis-version: v0.24.0
sdk-version: v2.0.0-alpha.163
artemis-version: v0.25.0
max-offline: 0s
connections:
- type: wifi
Expand Down
134 changes: 134 additions & 0 deletions docs/getstarted/fleet/reliability.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Reliability

The most important job of Artemis is to keep devices updatable. Since updates
are delivered through the broker, it is critical that the device can reach the broker.

## Strategy

The Artemis service that is installed on the devices periodically downloads its
configuration from the broker.

If the device can't reach the broker, it will retry more and more aggressively.
First it will reduce the interval between
check-ins. At some point, it will turn off non-critical containers during connection
attempts. Eventually, it will even disable critical containers for a short period of time.

All of these measures don't help if the broker is not available anymore. In that case,
the device attempts to reach a recovery URL. This URL can provide a new broker
configuration, allowing the device to connect again.

Being able to contact the broker means that the device can fetch new configurations,
and, hopefully, also firmware. However, any new firmware isn't guaranteed to be able
to do the same. As such, Artemis only commits to a new firmware after it has
successfully connected to the broker with the new firmware.

As a last resort, Artemis also uses a watchdog. If the device can't connect to the
broker for a long time, the watchdog will trigger and reboot the device. This also
guards against the Artemis service itself getting into a bad state.

## Max offline

The frequency at which Artemis contacts the broker can be configured using the
`max-offline` setting in the pod specification. A value of `0s` means that the
device stays connected to the broker and polls it continuously (but at most
every 20 seconds). A value of `1h` means that the device can be offline for
up to an hour before it tries to reconnect.

Here is a pod specification with a `max-offline` value of `1h`:

```yaml
# yaml-language-server: $schema=https://toit.io/schemas/artemis/pod-specification/v1.json

$schema: https://toit.io/schemas/artemis/pod-specification/v1.json
name: my-pod
sdk-version: v2.0.0-alpha.163
artemis-version: v0.25.0
max-offline: 1h
firmware-envelope: x64-linux
containers: {}
```
The max-offline value can also be set using the `toit` CLI:

```bash
toit device -d DEVICE-ID set-max-offline 1h
```

This configuration change is not permanent and will be lost with the next firmware update.

## Status states

The Artemis service on the device keeps track of its connection status. It is either
green, yellow, orange or red. As of September 2024, the following heuristics are used
to determine the status:
- green: the device is within the `max-offline` window.
- yellow: the device is outside the `max-offline` window, but within a factor of 2. For
example, if `max-offline` is set to 1h, then the device is in yellow state if it hasn't
connected for more than 1h, but less than 2h.
- orange: the device is outside the `max-offline` window, but within a factor of 3.
- red: the device is outside the `max-offline` window, and has been for more than 3 times
the `max-offline` value.

The following actions (again, as of September 2024) are taken based on the status.

### Orange

When a device is in orange state, Artemis reduces the interval between reconnection
attempts by 50%. For example, if `max-offline` is set to 1h, then the device will
try to reconnect every 30 minutes.

Artemis also disables all non-critical containers during reconnection attempts.

In this state, Artemis always tries a random entry from the recovery-URLs list to
get a new broker configuration.

### Red

Devices that can't connect to the broker for a long time are put into red state.

The red state is a more extreme version of the orange state. Artemis will reduce the
reconnect interval by a factor of 4. For example, if `max-offline` is
set to 1h, then the device will try to reconnect every 15 minutes.

As for the orange state, Artemis disables all non-critical containers during
reconnection attempts. However, now Artemis also disables critical containers
15% of the time.

Contrary to the orange state, Artemis does not always contact the recovery URL.
In case the recovery-URL connection makes things worse Artemis only fetches
recovery URLs 20% of the time.

## Recovery URLs

Even though recovery URLs are baked into pods, they are stored as properties of
the fleet. We don't expect recovery URLs to change often, and having them in the
fleet configuration avoids duplication.

A recovery URL can be added to an existing fleet with the following command:

```bash
artemis fleet recovery add RECOVERY-URL
```

If the existing broker for a device doesn't exist anymore, then the device can
use the recovery URL to get a new broker configuration. The new broker configuration
can be generated by running `artemis fleet recovery export -o recovery.json` and
serving the resulting `recovery.json` file at the recovery URL.

We recommend to use a different domain as recovery URL. This way, the recovery URL
is not affected by the same issues as the main broker.

Note that recovery URLs don't need to be online until they are needed.

## Watchdogs

As mentioned earlier, Artemis uses watchdogs to monitor its own health. Given a
certain `max-offline` value, the watchdog will trigger if the device hasn't connected
to the broker for more than 5 times the `max-offline` value (or at least 2 hours, if
`max-offline` is small).

This watchdog is a powerful last resort in that it handles many different failures. If
Artemis itself gets into a bad state the watchdog will eventually help to recover.
Similarly, user programs can communicate with the Artemis service through the [Artemis
package](https://pkg.toit.io/package/github.com%2Ftoitware%2Ftoit-artemis) and a bug
in the user program could accidentally disallow Artemis to connect to the broker.

0 comments on commit d94abfa

Please sign in to comment.