Delease
/
Blocktree


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795
							\documentclass{article}
\usepackage[scale=0.8]{geometry}
\usepackage{hyperref}
\usepackage{graphicx}

\title{The Blocktree Cloud Orchestration Platform}
\author{Matthew Carr}

\begin{document}
\maketitle
\begin{abstract}
This document is a proposal for a novel cloud platform called Blocktree.
The system is described in terms of the actor model,
where tasks and services are implemented as actors.
The platform is responsible for orchestrating these actors on a set of native operating system processes.
A service is provdied to actors which allows them access to a highly available distributed file system,
which serves as the only source of persistent state for the system.
High availability is achieved using the Raft consensus protocol to synchronize the state of files between processes.
All data stored in the filesystem is secured with strong integrity and optional confidentiality protections.
A network block device like interface allows for fast low-level read and write access to the encrypted data,
with full support for client-side encryption.
Well-known cryptographic primitives and constructions are employed to provide this protection,
the system does not attempt to innovate in terms of cryptography.
The system's trust model allows for mutual TLS authentication between all processes in the system,
even those which are controlled by different owners.
By integrating these ideas into a single platform,
the system aims to advance the status quo in the security and reliability of software systems.
\end{abstract}

\section{Introduction}
% The "Big" Picture.
Blocktree is an attempt to extend the Unix philosophy that everything is a file
to the entire distributed system that comprises modern IT infrastructure.
The system is organized around a global distributed filesystem which defines security
principals, resources, and their authorization attributes.
This filesystem provides a language for access control that can be used to securely grant principals
access to resources from different organizations, without the need to setup federation.
The system provides an actor runtime for orchestrating services.
Resources are represented by actors, and actors are grouped into operating system processes.
Each process has its own credentials which authenticate it as a unique security principal,
and which specify the filesystem path where the process is located.
A process has authorization attributes which determine the set of processes that may communicate with it.
Every connection between processes is established using mutual TLS authentication,
which is accomplished without the need to trust any third-party certificate authorities.
The cryptographic mechanisms which make this possible are described in detail in section 3.
Messages addressed to actors in a different process are forwarded over these connections,
while messages delivered to actors in the same process are delivered with zero-copying.

% Self-certifying paths and the chain of trust.
The single global Blocktree filesystem is partitioned into disjoint domains of authority.
Each domain is controlled by a root principal.
As is the case for all principals,
a root principal is authenticated by a public-private key pair,
and is identified by a hash of its public key.
The domain of authority for a given absolute path is determined by its first component,
which is the identifier of the root principal who controls the domain.
Because there is no meaning to the directory "/",
a directory consisting of only a single component equal to a root principal's identifier is
referred to as the root directory of that root principal.
The root principal delegates its authority to write files to subordinate principals by issuing
them certificates which specify the path that the authority of the subordinate is limited to.
File data is signed for authenticity and a certificate chain is contained in its metadata.
This certificate chain must lead back to the root principal
and consist of certificates with correctly scoped authority in order for the file to be authentic.
Given the path of a file and the file's contents,
this system allows the file to be validated by anyone without the need to trust a third-party.
Blocktree paths are referred to as self-certifying for this reason.

% Persistent state provided by the filesystem.
One of the major challenges in distributed systems is managing persistent state.
Blocktree solves this issue using its distributed filesystem.
Files are broken into segments called sectors.
The sector size of a file can be configured when it is created,
but cannot be changed after the fact.
Reads and writes of individual sectors are guaranteed to be atomic.
The sectors which comprise a file and its metadata are replicated by a set of processes running
the sector service.
This service is responsible for storing the sectors of files which are contained in the directory
containing the process in which it is running.
The actors providing the sector service in a given directory coordinate with one another using
the Raft protocol to synchronize the state of the sectors they store.
This method of partitioning the data in the filesystem based on directory
allows the system to scale beyond the capabilities of a single consensus cluster.
Sectors are secured with strong integrity protection,
which allows anyone to verify that their contents were written by an authorized principal.
Encryption can be optionally applied to sectors,
with the system handling key management.
The cryptographic mechanisms used to implement these protections are described in section 3.
To reduce load on the sector service, and to allow the system to scale to a larger number of users,
a peer-to-peer distribution system is implemented in the filesystem service.
This system allows filesystem actors to download sectors from other filesystem actors
that have the sectors in their local cache.
The threat of malicious actors serving bad sector data is mitigated by the strong integrity
protections applied to sectors.
By using peer-to-peer distribution, the system can serve as a content delivery network.

% Protocol contracts.
One of the design goals of Blocktree is to facilitate the creation of composable distributed
systems.
A major challenge to building such systems is the difficulty in pinning down bugs when they
inevitably occur.
Research into session types (a.k.a. Behavioral Types) promises to bring the safety benefits
of type checking to actor communication.
Blocktree integrates a session typing system that allows protocol contracts to be defined that
specify the communication patterns of a set of actors.
This model allows the state space of the set of actors participating in a computation to be defined,
and the state transitions which occur to be specified based on the types of received messages.
These contracts are used to verify protocol adherence statically and dynamically.
This system is implemented using compile time code generation,
making it a zero-cost abstraction.
This frees the developer from dealing with the numerous failure modes that can occur in a
communication protocol.

% Implementation language and project links.
Blocktree is implemented in the Rust programming language.
Its source code is licensed under the Affero GNU Public License Version 3.
It can be downloaded at the project homepage at \url{https://blocktree.systems}.
Anyone interested in contributing to development is welcome to submit a pull request
to \url{https://gogs.delease.com/Delease/Blocktree}.
If you have larger changes or architectural suggestions,
please submit an issue for discussion prior to spending time implementing your idea.

% Outline of the rest of the paper.
The remainder of this paper is structured as follows:
\begin{itemize}
  \item Section 2 describes the actor runtime, service and task orchestration, and service
    discovery.
  \item Section 3 discusses the filesystem, its concurrency semantics and implementation.
  \item Section 4 details the cryptographic mechanisms used to secure communication between
    actor runtimes and to protect sector data.
  \item Section 5 is a set of examples describing ways that Blocktree can be used to build systems.
  \item Section 6 provides some concluding remarks.
\end{itemize}


\section{Actor Runtime}
% Motivation for using the actor model. 
Building scalable fault tolerant systems requires us to distribute computation over
multiple computers.
Rather than switching to a different programming model when an application scales beyond the
capacity of a single computer,
it is beneficial in terms of programmer time and program simplicity to begin with a model that 
enables multi-computer scalability.
Fundamentally, all communication over an IP network involves the exchange of messages,
namely IP packets.
So if we wish to build scalable fault-tolerant systems,
it makes sense to choose a programming model built on message passing,
as this will ensure low impedance with the underlying networking technology.

% Overview of message passing interface.
That is why Blocktree is built on the actor model
and why its actor runtime is at the core of its architecture.
The runtime can be used to spawn actors, register services, and dispatch messages.
Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
A message is dispatched with the \texttt{send} method when no reply is required,
and with \texttt{call} when exactly one is.
The \texttt{Future} returned by \texttt{call} can be awaited to obtain the reply.
If a timeout occurs while waiting for the reply,
the \texttt{Future} completes with an error.
The name \texttt{call} was chosen to bring to mind a remote procedure call,
which is the primary use case this method was intended for.
Awaiting replies to messages serves as a simple way to synchronize a distributed computation.

% Description of virtual actor system.
One of the challenges when building actor systems is supervising and managing actor's lifecycles.
This is handled in Erlang through the use of supervision trees,
but Blocktree takes a different approach inspired by Microsoft's Orleans framework.
Orleans introduced the concept of virtual actors,
which are purely logical entities that exist perpetually.
In Orleans, one does not need to spawn actors nor worry about respawing them should they crash,
the framework takes care of spawning an actor when a message is dispatched to it.
This model also gives the framework the flexibility to deactivate actors when they are idle
and to load balance actors across different computers.
In Blocktree a similar system is used when messages are dispatched to services.
The Blocktree runtime takes care of routing these messages to the appropriate actors,
spawning them if needed.

% Message addressing modes.
Messages can be addressed to services or specific actors.
When addressing a specific actor,
the message contains an \emph{actor name},
which is a pair consisting of the path of the runtime hosting the actor and the \texttt{Uuid}
identifying the specific actor in that runtime.
When addressing a service,
the message is dispatched using a \emph{service name},
which contains the following fields:
\begin{enumerate}
  \item \texttt{service}: The path identifying the receiving service.
  \item \texttt{scope}: A filesystem path used to specify the intended recipient.
  \item \texttt{rootwards}: An boolean describing whether message delivery is attempted towards or
    away from the root of the filesystem tree. A value of
    \texttt{false} indicates that the message is intended for a runtime directly contained in the
    scope. A value of \texttt{true} indicates that the message is intended for a runtime contained
    in a parent directory of the scope and should be delivered to a runtime which has the requested
    service registered and is closest to the scope.
  \item \texttt{id}: An identifier for a specific service provider.
\end{enumerate}
The ID can be a \texttt{Uuid} or a \texttt{String}.
It is treated as an opaque identifier by the runtime,
but a service is free to associate additional meaning to it.
Every message has a header containing the name of the sender and receiver.
The receiver name can be an actor or service name,
but the receiver name is always an actor name.
For example, to open a file in the filesystem,
a message is dispatched with \texttt{call} using the service name of the filesystem service.
The reply contains the name of the file actor spawned by the filesystem service which owns the opened
file.
Messages are then dispatched to the file actor using its actor name to read and write to the file.

% The runtime is implemented using tokio.
The actor runtime is currently implemented using the Rust asynchronous runtime tokio.
Actors are spawned as tasks in the tokio runtime,
and multi-producer single consumer channels are used for message delivery.
Because actors are just tasks,
they can do anything a task can do,
including awaiting other \texttt{Future}s.
Because of this, there is no need for the actor runtime to support short-lived worker tasks,
as any such use-case can be accomplished by awaiting a set of \texttt{Future}s.
This allows the runtime to focus on providing support for services.
Using tokio also means that we have access to a high performance multi-threaded runtime with
evented IO.
This asynchronous programming model ensures that resources are efficiently utilized,
and is ideal for a system focused on orchestrating services which may be used by many clients.

% Delivering messages over the network.
Messages can be forwarded between actor runtimes using a secure transport layer called
\texttt{bttp}.
The transport is implemented using the QUIC protocol, which integrates TLS for security.
A \texttt{bttp} client may connect anonymously or using credentials.
If an anonymous connection is attempted,
the client has no authorization attributes associated with it.
Only runtimes which grant others the execute permission allow connections from such clients.
If these permissions are not granted in the runtime's file,
anonymous connections are rejected.
When a client connects with credentials,
mutual TLS authentication is performed as part of the connection handshake,
which cryptographically verifies the credentials of each runtime.
These credentials contain the filesystem paths where each runtime is located,
which ensures that messages addressed to a specific path will only be delivered to that path.
The \texttt{bttp} server is always authenticated during the handshake,
even when the client is connecting anonymously.
Because QUIC supports the concurrent use of many different streams,
it serves as an ideal transport for a message oriented system.
\texttt{bttp} uses different streams for independent messages,
ensuring that head of line blocking does not occur.
Note that although data from separate streams can arrive in any order,
the protocol does provide reliable in-order delivery of data in a given stream.
The same stream is used for sending the reply to a message dispatched with \texttt{call}.
Once a connection is established,
message may flow both directions (provided both runtimes have execute permissions for the other),
regardless of which runtime is acting as the client or the server.

% Delivering messages locally.
When a message is sent between actors in the same runtime it is delivered into the queue of the recipient without any copying,
while ensuring immutability (i.e. move semantics).
This is possible thanks to the Rust ownership system,
because the message sender gives ownership to the runtime when it dispatches the message,
and the runtime gives ownership to the recipient when it delivers the message.

% Security model based on filesystem permissions.
A runtime is represented in the filesystem as a file.
This file contains the authorization attributes which are associated with the runtime's security
principal.
The credentials used by the runtime specify the file, so other runtimes are able to locate it.
The metadata of the file contains authorization attributes just like any other file
(e.g. UID, GID, and mode bits).
In order for a principal to be able to send a message to an actor in the runtime,
it must have execute permissions for this file.
Thus communication between runtimes can be controlled using simple filesystem permissions.
Permissions checking is done during the \texttt{bttp} handshake.
Note that it is possible for messages to be sent in one direction in a \texttt{bttp} connection
but not in the other.
In this situation replies are permitted but unsolicited messages are not.
An important trade-off which was made when designing this model was that messages which are
sent between actors in the same runtime are not subject to any authorization checks.
This was done for two reasons: performance and security.
By eliminating authorization checks messages can be more efficiently delivered between actors in the
same process,
which helps to reduce the performance penalty of the actor runtime over directly using threads.
Security is enhanced by this decision because it forces the user to separate actors with different
security requirements into different operating system processes,
which ensures all of the process isolation machinery in the operating system will be used to
isolate them.

% Representing resources as actors.
As in other actor systems, it is convenient to represent resources in Blocktree using actors.
This allows the same security model used to control communication between actors to be used for
controlling access to resources,
and for resources to be shared by many actors.
For instance, a Point-to-Point Protocol connection could be owned by an actor.
This actor could forward traffic delivered to it in messages over this connection.
The set of actors which are able to access the connection is controlled by setting the filesystem
permissions on the file for the runtime executing the actor owning the connection.

% Actor ownership.
The concept of ownership in programming languages is very useful for ensuring that resources are
properly freed when the type using them dies.
Because actors are used for encapsulating resources in Blocktree,
a similar system of ownership is employed for this reason.
An actor is initially owned by the actor that spawned it.
An actor can only have a single owner,
but the owner can grant ownership to another actor.
An actor is not allowed to own itself,
though it may be owned by the runtime.
When the owner of an actor returns,
the actor is sent a message instructing it to return.
If it does not return after a timeout,
it is interrupted.
This is the opposite of how supervision trees work in Erlang.
Instead of the parent receiving a message when the child returns,
the child receives a message when the parent returns.
Service providers spawned by the runtime are owned by it.
They continue running until the runtime chooses to reclaim their resources,
which can happen because they are idle or the runtime is overloaded.

% Message routing to services.
A service is identified by a Blocktree path.
Only one service implementation can be registered in a particular runtime,
though this implementation may be used to spawn many actors as providers for the service,
each associated with a different ID.
The runtime spawns a new actor when it finds no service provider associated with the ID in the
message it is delivering.
Some services may only have one service provider in a given runtime,
as is the case for the sector and filesystem services.
Services are reactive,
they don't do anything until they receive a message to process.
The \texttt{scope} and \texttt{rootward} field in an actor name specify the set of runtimes to
which a message may be delivered.
They allow the sender to express their intended recipient,
while still affording enough flexibility to the runtime to route messages as needed.
If \texttt{rootward} is \texttt{false},
the message is delivered to a service provider in a runtime that is directly contained in
\texttt{scope}.
If \texttt{rootward} is \texttt{true},
the parent directories of scope are searched,
working towards the root of the filesystem tree,
and the message is delivered to the first provider of \texttt{service} which is found.
When there are multiple service providers to which a given message could be delivered,
the one to which it is actually delivered is unspecified,
which allows the runtime to balance load.
Delivery will occur for at most one recipient,
even in the case that there are multiple potential recipients.
In order to contact other runtimes and deliver messages to them,
their IP addresses need to be known.
This is achieved by maintaining a file with a runtime's IP address in the same directory as the
runtime.
The runtime is granted write permissions on the file,
and it is updated by \texttt{bttp} when it begins listening on a new endpoint.
The services which are allowed to be registered in a given runtime are specified in the runtime's
file.
The runtime reads this list and uses it to deny service registrations for unauthorized services.
The list is also read by other runtime's when they are searching a directory for service providers.

% The sector and filesystem service.
The filesystem is itself implemented as a service.
A filesystem service provider can be passed messages to delete files, list directory contents,
open files, or perform several other standard filesystem operations.
When a file is opened,
a new actor is spawned which owns the newly created file handle and its name is returned to the
caller in a reply.
Subsequent read and write messages are sent to this actor.
The filesystem service does not persist any data itself,
its job is to function as an integration layer,
conglomerating sector data from many different sources into a single unified interface.
The sector service is what is ultimately responsible for storing data,
and thus maintaining the persistent state of the system.
It stores sector data in the local filesystem of each computer on which it is registered.
The details of how this is accomplished are deferred to the next section.

% Runtime network discovery.
While it is possible to resolve runtime paths to IP addresses when the filesystem is available,
a different mechanism is needed to allow the filesystem and sector services to discover service
providers.
To facilitate this,
runtimes are able to query one another to learn about other runtimes.
Because queries are intended to facilitate message delivery,
the query fields and their meanings mirror those used for addressing messages:
\begin{enumerate}
  \item \texttt{service} The path of the service whose providers are sought.
    Only runtimes with this service registered will be returned.
  \item \texttt{scope} The filesystem path relative to which the query will be processed.
  \item \texttt{rootward} Indicates if the query should search for runtimes from \texttt{scope}
    toward the root.
\end{enumerate}
The semantics of \texttt{scope} and \texttt{rootward} in a query are identical to their use in an
actor name.
As long as at least one other runtime is known,
a query can be issued to learn of more runtimes.
A runtime which receives a query may not be able to answer it directly.
If it cannot,
it returns the IP address of the next runtime to which the query should be sent.
In order to bootstrap the discovery processes,
another mechanism is needed to find the first peer to query.
There were several possibilities explored for doing this.
One way is to use a blockchain to store the IP addresses of the runtimes hosting the sector service
in the root directory.
As long as these runtimes could be located,
then all others could be found using the filesystem.
This idea may be worth revisiting in the future,
but the author wanted to avoid the complexity of implementing a new proof of work blockchain.
Another idea was to use multicast link-local addressing to discover other runtimes,
similar to how mDNS operates.
This approach has several advantages.
It avoids any dependency on centralized internet infrastructure
and keeps network load local to the segment on which the runtimes are connected.
But, it will not work over a wide area network,
making it unsuitable for the general case.
Instead, the design which was decided on was to use DNS to resolve a fully qualified domain name
(FQDN) derived from the root principal's identifier.
This FQDN is expected to resolve to the public IP addresses of the runtimes hosting the
sector service in the root directory of the root principal.
Each process is configured with a search domain which is used as a suffix of the FQDN.
The leading labels in the FQDN are computed by base32 encoding a hash of the root
principal's public key.
If the encoded string is longer than 63 bytes (the limit for each label in a hostname),
it is separated into the fewest number of labels possible,
working from left to right along the string.
A dot followed by the search domain is concatenated onto the end of this string to form the FQDN.
This method has the advantages of being simple to implement
and allowing runtimes to discover each other over the internet.
Implementing this system would be facilitated by hosting DNS servers in actors in the same
runtimes as the root sector service providers.
Then, A or AAAA records could be served which point to these runtimes.
These runtimes would also need to be configured with static IP addresses,
and the NS records for the search domain would need to point to them.
Of course it is also possible to build such a system without hosting DNS inside of Blocktree.
The downside of using DNS is that it couples Blocktree with a centralized,
albeit distributed, system.

% Security model for queries.
To allow runtimes which are not permitted to execute the root directory to query for other runtimes,
authorization logic which is specific to queries is needed.
If a process is connected with credentials
and the path in the credentials contains the scope of the query,
the query is permitted.
If a process is connected anonymously,
its query will only be answered if the query scope
and all of its parent directories,
grant others the execute permission.
Queries from authenticated processes can be authorized using only the information in the query,
but anonymous queries require knowledge of filesystem permissions,
some of which may not be known to the answering runtime.
When authorizing an anonymous query,
an answering runtime should check that that the execute permission is granted on all directories
that it is responsible for storing.
If all these checks pass, it should forward the querier to the next runtime as usual.

% Overview of protocol contracts and runtime checking of protocol adherence.
To facilitate the creation of composable systems,
a protocol contract checking system based on session types has been designed.
This system models a communication protocol as a directed graph representing state transitions
based on types of received messages.
The protocol author defines the states that the actors participating in the protocol can be in using 
Rust traits.
These traits define handler methods for each message type the actor is expected to handle in that
state.
A top-level trait which represents the entire protocol is defined that contains the types of the
initial state of every actor in the protocol.
A macro is used to generate the message handling loop for the each of the parties to the protocol,
as well as enums to represent all possible states that the parties can be in and the messages that
they exchange.
The generated code is responsible for ensuring that errors are generated when a message of an
unexpected type is received,
eliminating the need for ad-hoc error handling code to be written by application developers.

% Example of a protocol contract.
% TODO: I don't find this example very compelling. It would be more impressive to show a pub-sub
% protocol, that would look cool.
Let us explore the use of this system through a simple example using the HTTP/1.1 protocol.
It is a state-less client-server protocol,
essentially just an RPC from client to server.
We can model this in for the contract checker by defining a trait representing the protocol:
\begin{verbatim}
  pub trait Http {
    type Server: ServerInit;
  }
\end{verbatim}
The purpose of this top-level trait is to specify the initial state of every party to the
communications protocol.
In this case we're only modeling the state of the server,
as the client will just \texttt{call} a method on the server.
The initial state for the server is defined as follows:
\begin{verbatim}
  pub trait ServerInit {
    type AfterActivate: Listening;
    type Fut: Future<Output = Result<Self::AfterActivate>>;
    fn handle_activate(self, msg: Activate) -> Self::Fut;
  }
\end{verbatim}
\texttt{Activate} is a message sent by the generated code to allow the actor access to the
runtime and the actor's ID.
It is defined as follows:
\begin{verbatim}
  pub struct Activate {
    rt: &'static Runtime,
    act_id: Uuid,
  }
\end{verbatim}
We represent the statelessness of HTTP by having the requests to the \texttt{Listening} state
return another \texttt{Listening} state.
\begin{verbatim}
  pub trait Listening {
    type AfterRequest: Listening;
    type Fut: Future<Output = Result<Self::AfterRequest>>;
    fn handle_request(self, msg: Envelope<Request>) -> Self::Fut;
  }
\end{verbatim}
The \texttt{Envelope} type is a wrapper around a message which contains information about who sent
it and a method which can be used to send a reply.
In general a new type could be returned after each message received,
with the returned type being dependent on the type of the message.
The state graph of this protocol can be visualized as follows:
\begin{center}
  \includegraphics[height=1.5in]{HttpStateGraph.pdf}
\end{center}

% Implementing actors in languages other than Rust.
Today the actor runtime only supports executing actors implemented in Rust.
A WebAssembly (Wasm) plugin system is planned to allow any language which can compile to Wasm to be
used to implement an actor.
This work is blocked pending the standardization of the WebAssembly Component Model,
which promises to provide an interface definition language which will allow type safe actors to be
defined in many different languages.

% Running containers using actors.
Blocktree allows containers to be run by encapsulating them using a supervising actor.
This actor is responsible for starting the container and managing the container's kernel namespace.
Logically, it owns any kernel resources created by the container, including all spawned operating
system processes.
When the actor halts,
all of these resources are destroyed.
All network communication to the container is controlled by the supervising actor.
The supervisor can be configured to bind container ports to host ports,
as is commonly done today,
but it can also be used to encapsulate traffic to and from the container in Blocktree messages.
These messages are routed to other actors based on the configuration of the supervisor.
This essentially creates a VPN for containers,
ensuring that regardless of well secured their communication is,
they will be safe to communicate over any network.
This network encapsulation system could be used in other actors as well,
allowing a lightweight and secure VPN system to built.


\section{Filesystem}
% The division of responsibilities between the sector and filesystem services.
The responsibility for serving data in the system is shared between the filesystem and sector
services.
Most actors will access the filesystem through the filesystem service,
which provides a high-level interface that takes care of the cryptographic operations necessary to
read and write files.
The filesystem service relies on the sector service for actually persisting data.
The individual sectors which make up a file are read from and written to the sector service,
which stores them in the local filesystem of the computer on which it is running.
A sector is the atomic unit of data storage
and the sector service only supports reading and writing entire sectors at once.
File actors spawned  by the filesystem service buffer reads and writes until there is enough
data to fill a sector.
Because cryptographic operations are only performed on full sectors,
the cost of providing these protections is amortized over the size of the sector.
Thus there is tradeoff between latency and throughput when selecting the sector size of a file:
a smaller sector size means less latency while a larger one enables more throughput.

% Types of sectors: metadata, integrity, and data.
A file has a single metadata sector, a Merkle sector, and zero or more data sectors.
The sector size of a file can be specified when it is created,
but cannot be changed later.
Every data sector contains the ciphertext of the number of bytes equal to the sector size,
but the metadata and Merkle sectors contain a variable amount of data.
The metadata sector contains all of the filesystem metadata associated with the file.
In addition to the usual metadata present in any Unix filesystem (the contents of the \texttt{stat} struct),
cryptographic information necessary to verify and decrypt the contents of the file are also stored.
The Merkle sector of a file contains a Merkle tree over the data sectors of a file.
The hash function used by this tree can be configured at file creation,
but cannot be changed after the fact.

% How sectors are identified.
When sector service providers are contained in the same directory they connect to each other to form
a consensus cluster.
This cluster is identified by a \texttt{u64} called the cluster's \emph{generation}.
Every file is identified by a pair of \texttt{u64}, its generation and its inode.
The sectors within a file are identified by an enum which specifies which type they are,
and in the case of data sectors, their 0-based index.
\begin{verbatim}
  pub enum SectorKind {
    Meta,
    Merkle,
    Data(u64),
  }
\end{verbatim}
The byte offset in the plaintext of the file at which each data sector begins can be calculated by
multiplying the sector's index by the sector size of the file.
The \texttt{SectorId} type is used to identify a sector.
\begin{verbatim}
  pub enum SectorId {
    generation: u64,
    inode: u64,
    sector: SectorKind,
  }
\end{verbatim}

% Types of messages handled by the sector service.
Communication with the sector service is done by passing it messages of type \texttt{SectorMsg}.
\begin{verbatim}
  pub struct SectorMsg {
    id: SectorId,
    op: SectorOperation,
  }

  pub enum SectorOperation {
    Read,
    Write(WriteOperation),
  }

  pub enum WriteOperation {
    Meta(Box<FileMeta>),
    Data {
      meta: Box<FileMeta>,
      contents: Vec<u8>,
    }
  }
\end{verbatim}
Here \texttt{FileMeta} is the type used to store metadata for files.
Note that updated metadata is required to be sent when a sector's contents are modified.

% Scaling horizontally: using Raft to create consensus cluster. Additional replication methods.
A generation of sector service providers uses the Raft protocol to synchronize the state of the
sectors it stores.
The message passing interface of the runtime enables this implementation
and the sector service's requirements were important considerations in designing this interface.
The system currently replicates all data to each of the service providers in the cluster.
Additional replication methods are planned for future implementation
(e.g. erasure encoding and distribution via consistent hashing),
which allow for different tradeoffs between data durability and storage utilization.

% Scaling vertically: how different generations are stitched together.
The creation of a new generation of the sector service is accomplished with several steps.
First, a new directory is created in which the generation will be located.
Next, one or more processes are credentialed for this directory,
using a procedure which is described in the next section.
The credentialing process produces files for each of the processes stored in the new directory.
The sector service provider in each of the processes uses the filesystem service
(which connects to the parent generation of the sector service)
to find the other runtimes hosting the sector service in the directory and messages them to
establish a fully-connected cluster.
Finally, the service provider which is elected leader contacts the generation in the root directory
and requests a new generation number.
Once this number is known it is stored in the superblock for the generation,
which is the file identified by the new generation number and inode 2.
The superblock is not contained in any directory and cannot be accessed outside the sector service.
The superblock also keeps track of the next inode to assign to a new file.

% Authorization logic of the sector service.
To prevent malicious actors from writing invalid data,
the sector service must cryptographically verify all write messages.
The process it uses to do this involves several steps:
\begin{enumerate}
  \item The certificate chain in the metadata that was sent in the write message is validated.
    It is considered valid if it ends with a certificate signed by the root principal
    and the paths in the certificates are correctly nested,
    indicating valid delegation of write authority at every step.
  \item Using the last public key in the certificate chain,
    the signature in the metadata is validated.
    This signature covers all of the fields in the metadata.
  \item The new sector contents in the write message are hashed using the digest function configured
    for the file and the resulting hash is used to update the file's Merkle tree in its Merkle
    sector.
  \item The root of the Merkle tree is compared with the integrity value in the file's metadata.
    The write message is considered valid if and only if there is a match.
\end{enumerate}
This same logic is used by file actors to verify the data they read from the sector service.
Only once a write message is validated is it shared with the sector service provider's peers in
its generation.
Although the data in a file is encrypted,
it is still beneficial for security to prevent unauthorized principal's from gaining access to a
file's ciphertext.
To prevent this, a sector service provider checks a file's metadata to verify that the requesting
principal actually has a readcap (to be defined in the next section) for the file.
This ensures that only principals that are authorized to read a file can gain access to the file's
ciphertext, metadata, and Merkle tree.

% File actors are responsible for cryptographic operations. Client-side encryption.
The sector service is relied upon by the filesystem service to read and write sectors.
Filesystem service providers communicate with the sector service to open files and perform
filesystem operations.
These providers spawn file actors that are responsible for verifying and decrypting the information
contained in sectors and providing it to other actors.
They use the credentials of the runtime they are hosted in to decrypt sector data using
information contained in file metadata.
File actors are also responsible for encrypting and integrity protecting data written to files.
In order for a file actor to produce a signature over the root of the file's Merkle tree,
it maintains a copy of the tree in memory.
This copy is read from the sector service when the file is opened.
While this does mean duplicating data between the sector and filesystem services,
this design was chosen to reduce the network traffic between the two services,
as the entire Merkle tree does not need to be transmitted on every write.
Encapsulating all cryptographic operations in the filesystem service and file actors allows the
computer storing data to be different from the computer encrypting it.
This approach allows client-side encryption to be done on more capable computers
and low powered devices to delegate this task to a storage server.

% Prevention of resource leaks through ownership.
A major advantage of using file actors to access file data is that they can be accessed over the
network from a different runtime as easily as they can be from the same runtime.
One complication arising from this approach is that file actors must not outlive the actor which
caused them to be spawned.
This is handled in the filesystem service by making the actor who opened the file the owner of the
file actor.
When a file actor receives notification that its owner returned,
it flushes any buffered data in its cache and returns,
ensuring that a resource leak does not occur.

% Authorization logic of the filesystem service.
The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.
It passes this type the authorization attributes of the principal accessing the file, the
attributes of the file, and the type of access (read, write, or execute).
The \texttt{Authorizer} returns a boolean indicating if access is permitted or denied.
These access control checks are performed for every message processed by the filesystem service,
including opening a file.
A file actor only responds to messages sent from its owner,
which ensures that it can avoid the overhead of performing access control checks as these were
carried out by the filesystem service when it was created.
The file actor is configured when it is spawned to allow read only, write only, or read write
access to a file,
depending on what type of access was requested by the actor opening the file.

% Streaming replication.
Often when building distributed systems it is convenient to alert any interested party that an event
has occurred.
To facilitate this pattern,
the sector service allows actors to subscribe for notification of writes to a file.
The sector service maintains a list of actors which are currently subscribed
and when it commits a write to its local storage,
it sends each of them a notification message identifying the sector written
(but not the written data).
By using different files to represent different events,
a simple notification system can be built.
Because the contents of a directory may be distributed over many different generations,
this system does not support the recursive monitoring of directories.
Although this system lacks the power of \texttt{inotify} in the Linux kernel,
it does provides some of its benefits without incurring much or a performance overhead
or implementation complexity.
For example, this system can be used to implement streaming replication.
This is done by subscribing to writes on all the files that are to be replicated,
then reading new sectors as soon as notifications are received.
These sectors can then be written into replica files in a different directory.
This ensures that the contents of the replicas will be updated in near real-time.

\section{Cryptography}
% The underlying trust model: self-certifying paths.

% Verifying sector contents on read and certifying on write.

% Confidentiality protecting files with readcaps. Single pubkey operation to read a dir tree.

% Give example of how these mechanisms allow data to be shared without any prior federation.

% Description of bttp handshake and the authentication data which is provided by both parties.

% Requesting and issuing credentials. Multicast link-local network discovery.


\section{Examples}
This section contains examples of systems built using Blocktree. The hope is to illustrate how this
platform can be used to implement existing applications more easily and to make it possible to
implement systems which are currently out of reach.

\subsection{A personal cloud for a home user.}
% Describe my idealized home Blocktree setup.

\subsection{An ecommerce website.}
% Describe a blocktree which runs a cluster of webservers, a manufacturing process, a warehouse
% inventory management system, and an order fulfillment system.

\subsection{A smart home.}

\subsection{A realtime geo-spacial environment.}
% Explain my vision of the metaverse.


\section{Conclusion}
% Blocktree serves as the basis for building a cloud-level distributed operating system.

% The system enables individuals to self-host the services they rely on.

% It also gives business a freeer choice of whether to own or lease computing resources.

% The system advances the status quo in secure computing.

% Composability leads to emergent benefits.

\end{document}