|
@@ -1,7 +1,10 @@
|
|
|
\documentclass{article}
|
|
|
+\usepackage{amsfonts,amssymb,amsmath}
|
|
|
\usepackage[scale=0.8]{geometry}
|
|
|
\usepackage{hyperref}
|
|
|
\usepackage{graphicx}
|
|
|
+\usepackage{biblatex}
|
|
|
+\bibliography{../citations.bib}
|
|
|
|
|
|
\title{Blocktree: A Distributed Computing Environment}
|
|
|
\author{Matthew Carr}
|
|
@@ -17,7 +20,7 @@ The persistent state for the system is stored in a global distributed filesystem
|
|
|
this actor runtime.
|
|
|
High availability is achieved using the Raft consensus protocol to synchronize the state of files between processes.
|
|
|
All data stored in the filesystem is secured with strong integrity and optional confidentiality protections.
|
|
|
-Well-known cryptographic constructions are used to provide this protection,
|
|
|
+Well-known cryptographic constructions are used to provide these protections,
|
|
|
the system does not attempt to innovate in terms of cryptography.
|
|
|
A network block device interface allows for fast low-level read and write access to file sectors,
|
|
|
with full support for client-side encryption.
|
|
@@ -34,12 +37,13 @@ to the entire distributed system that comprises modern IT infrastructure.
|
|
|
The system is organized around a global distributed filesystem which defines security
|
|
|
principals, resources, and their authorization attributes.
|
|
|
This filesystem provides a language for access control that can be used to securely grant
|
|
|
-access to resources, even those owned by different organizations.
|
|
|
+access to resources,
|
|
|
+even those owned by different organizations.
|
|
|
The system provides an actor runtime for orchestrating services.
|
|
|
Resources are represented as actors
|
|
|
and actors are executed by runtimes in different operating system processes.
|
|
|
Each process has its own credentials which authenticate it as a unique security principal,
|
|
|
-and which specify the filesystem path where it is located.
|
|
|
+and which specify the filesystem path where it's located.
|
|
|
A process has authorization attributes which determine the set of processes that it may communicate
|
|
|
with.
|
|
|
TLS authentication is used to secure connections between processes.
|
|
@@ -65,6 +69,8 @@ and consist of certificates with correctly scoped authority in order for the fil
|
|
|
Given the path of a file and the file's contents,
|
|
|
this allows the file to be validated by anyone without the need to trust a third-party.
|
|
|
Blocktree paths are called self-certifying for this reason.
|
|
|
+This construction was independently discovered by the author,
|
|
|
+but a similar system was previously used in the Self-certifying File System (SFS) \cite{sfs}.
|
|
|
|
|
|
% Persistent state provided by the filesystem.
|
|
|
One of the major challenges in distributed systems is managing persistent state.
|
|
@@ -78,7 +84,7 @@ the sector service.
|
|
|
These service providers are responsible for storing the sectors of files that are contained in the
|
|
|
directory containing the runtime in which it's running.
|
|
|
The actors providing the sector service in a given directory coordinate with one another using
|
|
|
-the Raft protocol to synchronize the state of the sectors they store.
|
|
|
+the Raft protocol \cite{raft} to synchronize the state of the sectors they store.
|
|
|
By partitioning the data in the filesystem based on directory,
|
|
|
the system can scale beyond the capabilities of a single consensus cluster.
|
|
|
Associated with every file is a Merkle tree of sector hashes,
|
|
@@ -92,7 +98,7 @@ systems.
|
|
|
A major challenge to building such systems is the difficulty is locating the cause of bugs when they
|
|
|
inevitably occur.
|
|
|
Research into session types (a.k.a. Behavioral Types) promises to bring the safety benefits
|
|
|
-of type checking to actor communication.
|
|
|
+of type checking to actor communication (\cite{armstrong} chapter 9).
|
|
|
Blocktree integrates a session typing system that allows protocol contracts to be defined that
|
|
|
specify the communication protocol of a set of actors.
|
|
|
This model allows the state space of the actors participating in a computation to be defined,
|
|
@@ -106,16 +112,16 @@ communication protocol.
|
|
|
Blocktree is implemented in the Rust programming language.
|
|
|
It is currently tested on Linux,
|
|
|
but running it on other Unix-like operating systems should be straight-forward.
|
|
|
-FUSE support is required to mount the filesystem.
|
|
|
+FUSE support from the host kernel is required to mount the filesystem.
|
|
|
The system's source code is licensed under the Affero GNU Public License Version 3.
|
|
|
The project's homepage is \url{https://blocktree.systems}.
|
|
|
Anyone interested in contributing to development is welcome to submit a pull request
|
|
|
to \url{https://gogs.delease.com/Delease/Blocktree}.
|
|
|
If you have larger changes or architectural suggestions,
|
|
|
-please submit an issue for discussion prior to spending time implementing your idea.
|
|
|
+please submit an issue for discussion prior to investing your time in an implementation.
|
|
|
|
|
|
% Outline of the rest of the paper.
|
|
|
-The remainder of this paper is structured as follows:
|
|
|
+The remainder of this document is structured as follows:
|
|
|
\begin{itemize}
|
|
|
\item Section 2 describes the actor runtime, services, and runtime discovery.
|
|
|
\item Section 3 discusses the filesystem, its concurrency semantics and implementation.
|
|
@@ -132,8 +138,8 @@ Building scalable fault tolerant systems requires us to distribute computation o
|
|
|
multiple computers.
|
|
|
Rather than switching to a different programming model when an application scales beyond the
|
|
|
capacity of a single computer,
|
|
|
-it's beneficial in terms of programmer time and program simplicity,
|
|
|
-to begin with a model that enables multi-computer scalability.
|
|
|
+it's beneficial in terms of programmer time and program simplicity to begin with a model that
|
|
|
+enables multi-computer scalability.
|
|
|
Fundamentally, all communication over a network involves the exchange of messages.
|
|
|
So if we wish to build scalable fault-tolerant systems,
|
|
|
it makes sense to choose a programming model built on message passing,
|
|
@@ -145,9 +151,11 @@ and why its actor runtime is at the core of its architecture.
|
|
|
The runtime can be used to spawn actors, register services, dispatch messages immediately,
|
|
|
and schedule messages to be delivered in the future.
|
|
|
Messages can be dispatched in two ways: with \texttt{send} and \texttt{call}.
|
|
|
-A message is dispatched with the \texttt{send} method when no reply is required,
|
|
|
+A message is dispatched with \texttt{send} when no reply is required,
|
|
|
and with \texttt{call} when exactly one is.
|
|
|
-The \texttt{Future} returned by \texttt{call} can be awaited to obtain the reply.
|
|
|
+The Rust
|
|
|
+\href{https://doc.rust-lang.org/std/future/trait.Future.html}{\texttt{Future}}
|
|
|
+returned by \texttt{call} can be awaited to obtain the reply.
|
|
|
If a timeout occurs while waiting for the reply,
|
|
|
the \texttt{Future} completes with an error.
|
|
|
The name \texttt{call} was chosen to bring to mind a remote procedure call,
|
|
@@ -157,7 +165,7 @@ Awaiting replies to messages serves as a simple way to synchronize a distributed
|
|
|
% Scheduling messages for future delivery.
|
|
|
Executing actions at some point in the future or at regular intervals are common tasks in computer
|
|
|
systems.
|
|
|
-Blocktree facilitates this by allows messages to be scheduled for future delivery.
|
|
|
+Blocktree facilitates this by allowing messages to be scheduled for future delivery.
|
|
|
The schedule may specify a one time delivery at a specific instant in time,
|
|
|
or a repeating delivery with a given period.
|
|
|
These scheduling modes can be combined so that you can specify an anchoring instant
|
|
@@ -175,13 +183,14 @@ But, if a message is periodic,
|
|
|
any messages which were missed due to a runtime not being active will never be sent.
|
|
|
This is because the runtime only persists the message's schedule,
|
|
|
not every delivery.
|
|
|
-This mechanism is intended for periodic tasks or delaying work to a later time.
|
|
|
-It is not for building hard realtime systems.
|
|
|
+This mechanism is intended for periodic tasks or delaying work to a later time,
|
|
|
+not for building hard realtime systems.
|
|
|
|
|
|
% Description of virtual actor system.
|
|
|
One of the challenges in building actor systems is supervising and managing actors' lifecycles.
|
|
|
-This is handled in Erlang through the use of supervision trees,
|
|
|
-but Blocktree takes a different approach, one inspired by Microsoft's Orleans framework.
|
|
|
+This is handled in Erlang \cite{armstrong} through the use of supervision trees,
|
|
|
+but Blocktree takes a different approach, one inspired by Microsoft's Orleans framework
|
|
|
+\cite{orleans}.
|
|
|
Orleans introduced the concept of virtual actors,
|
|
|
which are purely logical entities that exist perpetually.
|
|
|
In Orleans, one does not need to spawn actors nor worry about respawning them should they crash,
|
|
@@ -231,7 +240,8 @@ file.
|
|
|
Messages are then dispatched to the file actor using its actor name to read and write to the file.
|
|
|
|
|
|
% The runtime is implemented using tokio.
|
|
|
-The actor runtime is implemented using the Rust asynchronous runtime tokio.
|
|
|
+The actor runtime is implemented using the Rust asynchronous runtime tokio
|
|
|
+[\url{https://tokio.rs}].
|
|
|
Actors are spawned as tasks in the tokio runtime,
|
|
|
and multi-producer single consumer channels are used for message delivery.
|
|
|
Because actors are just tasks,
|
|
@@ -247,7 +257,8 @@ and is ideal for a system focused on orchestrating services which may be used by
|
|
|
|
|
|
% Delivering messages over the network.
|
|
|
Messages can be forwarded between actor runtimes using a secure transport called \texttt{bttp}.
|
|
|
-This transport is implemented using the QUIC protocol, which integrates TLS for security.
|
|
|
+This transport is implemented using the QUIC protocol \cite{quic}, which integrates TLS for
|
|
|
+security.
|
|
|
A \texttt{bttp} client may connect anonymously or using credentials.
|
|
|
If an anonymous connection is attempted,
|
|
|
the client has no authorization attributes associated with it.
|
|
@@ -284,8 +295,8 @@ A runtime is represented in the filesystem as a file.
|
|
|
Among other things,
|
|
|
this file contains the authorization attributes associated with the runtime's security
|
|
|
principal.
|
|
|
-The certificate used by the runtime to authenticate contain the to this file,
|
|
|
-so other runtimes are able to locate it.
|
|
|
+The certificate used by the runtime to authenticate is also contained in this file,
|
|
|
+so other runtimes are able to locate it and the public key contained within it.
|
|
|
The metadata of the file contains authorization attributes just like any other file
|
|
|
(e.g. UID, GID, and mode bits).
|
|
|
In order for a principal to be able to send a message to an actor in the runtime,
|
|
@@ -300,8 +311,8 @@ sent between actors in the same runtime are not subject to any authorization che
|
|
|
This was done for two reasons: performance and security.
|
|
|
By eliminating authorization checks messages can be more efficiently delivered between actors in the
|
|
|
same process,
|
|
|
-which helps to reduce the performance penalty of the actor runtime over directly using
|
|
|
-\texttt{tokio::Task}s.
|
|
|
+which helps to reduce the performance penalty of the actor runtime over directly using a
|
|
|
+\href{https://docs.rs/tokio/latest/tokio/task/index.html}{\texttt{tokio::Task}}.
|
|
|
Security is enhanced by this decision because it forces the user to separate actors with different
|
|
|
security requirements into different operating system processes,
|
|
|
which ensures all of the process isolation machinery in the operating system will be used to
|
|
@@ -319,7 +330,7 @@ permissions on the file for the runtime executing the actor owning the connectio
|
|
|
|
|
|
% Actor ownership.
|
|
|
The concept of ownership in programming languages is very useful for ensuring that resources are
|
|
|
-properly freed when the type using them dies.
|
|
|
+properly released when the object using them dies.
|
|
|
Because actors are used for encapsulating resources in Blocktree,
|
|
|
a similar system of ownership is employed.
|
|
|
An actor is initially owned by the actor that spawned it.
|
|
@@ -387,7 +398,7 @@ The list is also read by other runtime's when they're searching for service prov
|
|
|
% The sector and filesystem service.
|
|
|
The filesystem is itself implemented as a service.
|
|
|
A filesystem service provider can be passed messages to delete files, list directory contents,
|
|
|
-open files, or perform several other standard filesystem operations.
|
|
|
+open files, or perform other standard filesystem operations.
|
|
|
When a file is opened,
|
|
|
a new actor is spawned which owns the newly created file handle and its name is returned to the
|
|
|
caller in a reply.
|
|
@@ -405,7 +416,7 @@ While it's possible to resolve runtime paths to network endpoints when the files
|
|
|
another mechanism is needed to allow the filesystem service providers to be discovered.
|
|
|
This is accomplished by allowing runtimes to query one another to learn of other runtimes.
|
|
|
Because queries are intended to facilitate message delivery,
|
|
|
-the query fields and their meanings mirror those used for addressing messages:
|
|
|
+the query fields and their semantics mirror those used for addressing messages:
|
|
|
\begin{enumerate}
|
|
|
\item \texttt{service} The path of the service whose providers are sought.
|
|
|
Only runtimes with this service registered will be returned.
|
|
@@ -456,13 +467,13 @@ These runtimes would also need to be configured with static IP addresses,
|
|
|
and the NS records for the search domain would need to point to them.
|
|
|
It is also possible to build such a system without hosting DNS inside of Blocktree,
|
|
|
by using a dynamic DNS service.
|
|
|
-The downside of using DNS is that it couples Blocktree with a centralized,
|
|
|
+The downside of using DNS is that it couples Blocktree with a centrally administered,
|
|
|
albeit distributed, system.
|
|
|
|
|
|
% Using link-local multicast datagrams to find runtimes.
|
|
|
Because this mechanism requires knowledge of the root principal of a domain to perform
|
|
|
discovery,
|
|
|
-it will not work if a runtime does not know its own root principal because it's starting up for the
|
|
|
+it will not work if a runtime doesn't know its own root principal because it's starting up for the
|
|
|
first time and has no credentials.
|
|
|
This runtime needs a way to discover other runtimes so it can connect to the filesystem and sector
|
|
|
services.
|
|
@@ -592,11 +603,15 @@ The definition of \texttt{Activate} is as follows:
|
|
|
act_id: Uuid,
|
|
|
}
|
|
|
\end{verbatim}
|
|
|
+A static reference can be given to a runtime because a runtime is required to live for the
|
|
|
+entire lifetime of a process.
|
|
|
+This allows simple references to be passed around,
|
|
|
+avoiding the complexity of lifetimes and the overhead of reference counting.
|
|
|
The \texttt{Envelope} type is a wrapper around a message which contains information about who sent
|
|
|
it and a method that can be used to send a reply.
|
|
|
In general a new actor state, represented by a new type, can be returned by a messaging handling
|
|
|
method.
|
|
|
-The protocol itself is also represented by a trait:
|
|
|
+The protocol itself is represented by the trait:
|
|
|
\begin{verbatim}
|
|
|
pub trait PubSubProtocol {
|
|
|
type Server: ServerInit;
|
|
@@ -615,6 +630,9 @@ Wasm.
|
|
|
This work is blocked pending the standardization of the WebAssembly Component Model,
|
|
|
which promises to provide an interface definition language which will allow type safe actors to be
|
|
|
defined in many different languages.
|
|
|
+Once Wasm support is added,
|
|
|
+it will make sense to use the filesystem to distribute compiled actor modules,
|
|
|
+as the strong integrity protection it provides make it an ideal way to securely distribute software.
|
|
|
|
|
|
% Running containers using actors.
|
|
|
While the actor runtime can be a convenient way of implementing new systems,
|
|
@@ -913,11 +931,11 @@ increasing the performance of the system.
|
|
|
|
|
|
\section{Cryptography}
|
|
|
This section describes the cryptographic mechanisms used to integrity and confidentiality protect
|
|
|
-files.
|
|
|
+files as well as procedures for obtaining credentials.
|
|
|
These mechanisms are based on well-established cryptographic constructions.
|
|
|
|
|
|
% Integrity protection.
|
|
|
-File integrity is protected by a digital signature over its metadata.
|
|
|
+A file is integrity protected by a digital signature over its metadata.
|
|
|
The metadata contains an integrity field which contains the root node of the Merkle tree over
|
|
|
the file's contents.
|
|
|
This allows any sector in the file to be verified with a number of hash function invocations that
|
|
@@ -930,6 +948,7 @@ A file's metadata also contains a certificate chain,
|
|
|
and this chain is used to authenticate the signature over the metadata.
|
|
|
In Blocktree, the certificate chain is referred to as a \emph{writecap}
|
|
|
because it grants the capability to write to files.
|
|
|
+This term comes from the Tahoe Least-Authority Filesystem \cite{tahoe}.
|
|
|
The certificates in a valid writecap are ordered by their paths,
|
|
|
the initial certificate contains the longest path,
|
|
|
the path in each subsequent certificate must be a prefix of the one preceding it,
|
|
@@ -952,6 +971,7 @@ A file's key and IV are encrypted using the public keys of the principals to who
|
|
|
allowed.
|
|
|
The resulting ciphertext is referred to as a \emph{readcap}, as it grants the capability to read the
|
|
|
file.
|
|
|
+This term is also from Tahoe \cite{tahoe}.
|
|
|
These readcaps are stored in a table in the file's metadata.
|
|
|
Each entry in the table is identified by a byte string that is derived from the public key of the
|
|
|
principal who owns the entry's readcap.
|
|
@@ -1042,7 +1062,7 @@ A symmetric cipher is used to protect the root credentials, if they are stored,
|
|
|
but it relies on the security of the underlying filesystem to protect the process credentials.
|
|
|
For this reason it is not recommended for production use.
|
|
|
The other credential store is called \texttt{TpmCredStore},
|
|
|
-and it uses a Trusted Platform Module (TPM) 2.0 to store credentials.
|
|
|
+and it uses a Trusted Platform Module (TPM) 2.0 \cite{tpm} to store credentials.
|
|
|
The TPM is used to generate the process's credentials in such a way that they can never be
|
|
|
exported from the TPM (this is a feature of TPM 2.0).
|
|
|
A randomly generated cookie is needed to use these credentials.
|
|
@@ -1119,7 +1139,8 @@ Up till now the focus has been on authentication and authorization of processes,
|
|
|
but it bears discussing how user based access control can be accomplished with Blocktree.
|
|
|
Because credentials are locked to the device on which they're created,
|
|
|
a user will be associated with at least as many principals as they have devices.
|
|
|
-But, all of these principals can be configured to have the same authorization attributes (UID, GID),
|
|
|
+But, all of these principals can be configured to have the same authorization attributes
|
|
|
+(UID, GID, SELinux context, etc.),
|
|
|
giving them the same permissions.
|
|
|
It makes sense to provision all of the runtimes associated with a user in one place
|
|
|
and the natural place is the user's home directory.
|
|
@@ -1560,7 +1581,7 @@ which will make it more resistent to disruption and censorship.
|
|
|
Cloud computing has also driven changes in the way businesses acquire computing resources.
|
|
|
It's common for startups to rent all of their computing resources from one large cloud provider
|
|
|
and there are compelling economic and technical reasons to do this.
|
|
|
-But, as a firm grows they often experience growing pains as their cloud bills also grow.
|
|
|
+But, as a firm grows they often experience growing pains as their cloud bills grow with them.
|
|
|
If the firm has not developed their software with a multi-cloud, or hybrid approach in mind,
|
|
|
they may face the prospect of major changes in order to bring their application on-prem or to a
|
|
|
rival cloud.
|
|
@@ -1580,7 +1601,7 @@ There are many reasons for this,
|
|
|
from the reliance on passwords for authentication, to the complexity of the software supply chain,
|
|
|
but it's clear that as IT professionals we need to do more to keep the systems under our
|
|
|
protection safe.
|
|
|
-Blocktree helps us to do this by solving many of the difficult problems involved with securing
|
|
|
+Blocktree helps us do this by solving many of the difficult problems involved with securing
|
|
|
communication on a hostile network.
|
|
|
It takes a true zero-trust approach,
|
|
|
ensuring that all communications between processes is authenticated using public key cryptography.
|
|
@@ -1602,8 +1623,10 @@ it is hoped that low overhead communication between distributed components can b
|
|
|
By using this system to provide a global distributed filesystem,
|
|
|
it is hoped that the interoperable sharing of data can be achieved.
|
|
|
And by using protocol contracts to constrain actor communication,
|
|
|
-it is hoped that the structure and safety can bring order to distributed computation.
|
|
|
+it is hoped that structure and safety can bring order to distributed computation.
|
|
|
While it's possible to see some of the applications that can be built from these abstractions,
|
|
|
their composability and the creativity of developers will lead to systems that cannot be foreseen.
|
|
|
|
|
|
+\printbibliography
|
|
|
+
|
|
|
\end{document}
|