generated from toitware/web-gatsby-template
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
df82654
commit d94abfa
Showing
3 changed files
with
143 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# Reliability | ||
|
||
The most important job of Artemis is to keep devices updatable. Since updates | ||
are delivered through the broker, it is critical that the device can reach the broker. | ||
|
||
## Strategy | ||
|
||
The Artemis service that is installed on the devices periodically downloads its | ||
configuration from the broker. | ||
|
||
If the device can't reach the broker, it will retry more and more aggressively. | ||
First it will reduce the interval between | ||
check-ins. At some point, it will turn off non-critical containers during connection | ||
attempts. Eventually, it will even disable critical containers for a short period of time. | ||
|
||
All of these measures don't help if the broker is not available anymore. In that case, | ||
the device attempts to reach a recovery URL. This URL can provide a new broker | ||
configuration, allowing the device to connect again. | ||
|
||
Being able to contact the broker means that the device can fetch new configurations, | ||
and, hopefully, also firmware. However, any new firmware isn't guaranteed to be able | ||
to do the same. As such, Artemis only commits to a new firmware after it has | ||
successfully connected to the broker with the new firmware. | ||
|
||
As a last resort, Artemis also uses a watchdog. If the device can't connect to the | ||
broker for a long time, the watchdog will trigger and reboot the device. This also | ||
guards against the Artemis service itself getting into a bad state. | ||
|
||
## Max offline | ||
|
||
The frequency at which Artemis contacts the broker can be configured using the | ||
`max-offline` setting in the pod specification. A value of `0s` means that the | ||
device stays connected to the broker and polls it continuously (but at most | ||
every 20 seconds). A value of `1h` means that the device can be offline for | ||
up to an hour before it tries to reconnect. | ||
|
||
Here is a pod specification with a `max-offline` value of `1h`: | ||
|
||
```yaml | ||
# yaml-language-server: $schema=https://toit.io/schemas/artemis/pod-specification/v1.json | ||
|
||
$schema: https://toit.io/schemas/artemis/pod-specification/v1.json | ||
name: my-pod | ||
sdk-version: v2.0.0-alpha.163 | ||
artemis-version: v0.25.0 | ||
max-offline: 1h | ||
firmware-envelope: x64-linux | ||
containers: {} | ||
``` | ||
The max-offline value can also be set using the `toit` CLI: | ||
|
||
```bash | ||
toit device -d DEVICE-ID set-max-offline 1h | ||
``` | ||
|
||
This configuration change is not permanent and will be lost with the next firmware update. | ||
|
||
## Status states | ||
|
||
The Artemis service on the device keeps track of its connection status. It is either | ||
green, yellow, orange or red. As of September 2024, the following heuristics are used | ||
to determine the status: | ||
- green: the device is within the `max-offline` window. | ||
- yellow: the device is outside the `max-offline` window, but within a factor of 2. For | ||
example, if `max-offline` is set to 1h, then the device is in yellow state if it hasn't | ||
connected for more than 1h, but less than 2h. | ||
- orange: the device is outside the `max-offline` window, but within a factor of 3. | ||
- red: the device is outside the `max-offline` window, and has been for more than 3 times | ||
the `max-offline` value. | ||
|
||
The following actions (again, as of September 2024) are taken based on the status. | ||
|
||
### Orange | ||
|
||
When a device is in orange state, Artemis reduces the interval between reconnection | ||
attempts by 50%. For example, if `max-offline` is set to 1h, then the device will | ||
try to reconnect every 30 minutes. | ||
|
||
Artemis also disables all non-critical containers during reconnection attempts. | ||
|
||
In this state, Artemis always tries a random entry from the recovery-URLs list to | ||
get a new broker configuration. | ||
|
||
### Red | ||
|
||
Devices that can't connect to the broker for a long time are put into red state. | ||
|
||
The red state is a more extreme version of the orange state. Artemis will reduce the | ||
reconnect interval by a factor of 4. For example, if `max-offline` is | ||
set to 1h, then the device will try to reconnect every 15 minutes. | ||
|
||
As for the orange state, Artemis disables all non-critical containers during | ||
reconnection attempts. However, now Artemis also disables critical containers | ||
15% of the time. | ||
|
||
Contrary to the orange state, Artemis does not always contact the recovery URL. | ||
In case the recovery-URL connection makes things worse Artemis only fetches | ||
recovery URLs 20% of the time. | ||
|
||
## Recovery URLs | ||
|
||
Even though recovery URLs are baked into pods, they are stored as properties of | ||
the fleet. We don't expect recovery URLs to change often, and having them in the | ||
fleet configuration avoids duplication. | ||
|
||
A recovery URL can be added to an existing fleet with the following command: | ||
|
||
```bash | ||
artemis fleet recovery add RECOVERY-URL | ||
``` | ||
|
||
If the existing broker for a device doesn't exist anymore, then the device can | ||
use the recovery URL to get a new broker configuration. The new broker configuration | ||
can be generated by running `artemis fleet recovery export -o recovery.json` and | ||
serving the resulting `recovery.json` file at the recovery URL. | ||
|
||
We recommend to use a different domain as recovery URL. This way, the recovery URL | ||
is not affected by the same issues as the main broker. | ||
|
||
Note that recovery URLs don't need to be online until they are needed. | ||
|
||
## Watchdogs | ||
|
||
As mentioned earlier, Artemis uses watchdogs to monitor its own health. Given a | ||
certain `max-offline` value, the watchdog will trigger if the device hasn't connected | ||
to the broker for more than 5 times the `max-offline` value (or at least 2 hours, if | ||
`max-offline` is small). | ||
|
||
This watchdog is a powerful last resort in that it handles many different failures. If | ||
Artemis itself gets into a bad state the watchdog will eventually help to recover. | ||
Similarly, user programs can communicate with the Artemis service through the [Artemis | ||
package](https://pkg.toit.io/package/github.com%2Ftoitware%2Ftoit-artemis) and a bug | ||
in the user program could accidentally disallow Artemis to connect to the broker. |