|
@@ -0,0 +1,1621 @@
|
|
|
+\documentclass{article}
|
|
|
+\usepackage[scale=0.8]{geometry}
|
|
|
+\usepackage{hyperref}
|
|
|
+\usepackage{graphicx}
|
|
|
+
|
|
|
+\title{Blocktree: A Distributed Computing Environment}
|
|
|
+\author{Matthew Carr}
|
|
|
+
|
|
|
+\begin{document}
|
|
|
+\maketitle
|
|
|
+\begin{abstract}
|
|
|
+This document is a proposal for a distributed computing environment called Blocktree.
|
|
|
+The system is designed around the actor model,
|
|
|
+and it uses actors to encapsulate resources and provide services.
|
|
|
+The platform is responsible for orchestrating these actors on a set of native operating system processes.
|
|
|
+The persistent state for the system is stored in a global distributed filesystem implemented using
|
|
|
+this actor runtime.
|
|
|
+High availability is achieved using the Raft consensus protocol to synchronize the state of files between processes.
|
|
|
+All data stored in the filesystem is secured with strong integrity and optional confidentiality protections.
|
|
|
+Well-known cryptographic constructions are used to provide this protection,
|
|
|
+the system does not attempt to innovate in terms of cryptography.
|
|
|
+A network block device interface allows for fast low-level read and write access to file sectors,
|
|
|
+with full support for client-side encryption.
|
|
|
+The system's trust model allows for mutual TLS authentication between all processes,
|
|
|
+without the need to trust a third-party certificate authority.
|
|
|
+By integrating these ideas into a single platform,
|
|
|
+the system aims to advance the status quo in the security and reliability of software systems.
|
|
|
+\end{abstract}
|
|
|
+
|
|
|
+\section{Introduction}
|
|
|
+% The "Big" Picture.
|
|
|
+Blocktree is an attempt to extend the Unix philosophy that everything is a file
|
|
|
+to the entire distributed system that comprises modern IT infrastructure.
|
|
|
+The system is organized around a global distributed filesystem which defines security
|
|
|
+principals, resources, and their authorization attributes.
|
|
|
+This filesystem provides a language for access control that can be used to securely grant
|
|
|
+access to resources, even those owned by different organizations.
|
|
|
+The system provides an actor runtime for orchestrating services.
|
|
|
+Resources are represented as actors
|
|
|
+and actors are executed by runtimes in different operating system processes.
|
|
|
+Each process has its own credentials which authenticate it as a unique security principal,
|
|
|
+and which specify the filesystem path where it is located.
|
|
|
+A process has authorization attributes which determine the set of processes that it may communicate
|
|
|
+with.
|
|
|
+TLS authentication is used to secure connections between processes.
|
|
|
+Messages addressed to actors in a different process are forwarded over these connections,
|
|
|
+while messages delivered to actors in the same process are delivered with zero-copying.
|
|
|
+
|
|
|
+% Self-certifying paths and the chain of trust.
|
|
|
+The single global Blocktree filesystem is partitioned into disjoint domains of authority.
|
|
|
+Each domain is controlled by a root principal.
|
|
|
+As is the case for all principals,
|
|
|
+a root principal is authenticated by a public-private signing key pair
|
|
|
+and is identified by the base64url encoded hash of its public signing key.
|
|
|
+The domain of authority for a given absolute path is determined by its first component,
|
|
|
+which is the identifier of the root principal that controls the domain.
|
|
|
+Because there's no meaning to the directory "/",
|
|
|
+a directory consisting of only a single component equal to a root principal's identifier is
|
|
|
+referred to as the root directory of the domain.
|
|
|
+The root principal delegates its authority to write files to subordinate principals by issuing
|
|
|
+them certificates which specify the path that the authority of the subordinate is limited to.
|
|
|
+File data is signed for authenticity and a certificate chain is contained in its metadata.
|
|
|
+This certificate chain must lead back to the root principal
|
|
|
+and consist of certificates with correctly scoped authority in order for the file be valid.
|
|
|
+Given the path of a file and the file's contents,
|
|
|
+this allows the file to be validated by anyone without the need to trust a third-party.
|
|
|
+Blocktree paths are called self-certifying for this reason.
|
|
|
+
|
|
|
+% Persistent state provided by the filesystem.
|
|
|
+One of the major challenges in distributed systems is managing persistent state.
|
|
|
+Blocktree solves this issue with its distributed filesystem.
|
|
|
+Files are broken into segments called sectors.
|
|
|
+The sector size of a file can be configured when it's created,
|
|
|
+but can't be changed later.
|
|
|
+Reads and writes of individual sectors are guaranteed to be atomic.
|
|
|
+The sectors which comprise a file and its metadata are replicated by a set of processes running
|
|
|
+the sector service.
|
|
|
+These service providers are responsible for storing the sectors of files that are contained in the
|
|
|
+directory containing the runtime in which it's running.
|
|
|
+The actors providing the sector service in a given directory coordinate with one another using
|
|
|
+the Raft protocol to synchronize the state of the sectors they store.
|
|
|
+By partitioning the data in the filesystem based on directory,
|
|
|
+the system can scale beyond the capabilities of a single consensus cluster.
|
|
|
+Sectors can be integrity protected and verified without reading the entire file,
|
|
|
+because each file has a Merkle tree of sector hashes associated with it.
|
|
|
+Encryption can be optionally applied to sectors,
|
|
|
+and when it is key is managed by the system.
|
|
|
+The cryptographic mechanisms used to implement these protections are described in section 3.
|
|
|
+
|
|
|
+% Protocol contracts.
|
|
|
+One of the design goals of Blocktree is to facilitate the creation of composable distributed
|
|
|
+systems.
|
|
|
+A major challenge to building such systems is the difficulty in pinning down bugs when they
|
|
|
+inevitably occur.
|
|
|
+Research into session types (a.k.a. Behavioral Types) promises to bring the safety benefits
|
|
|
+of type checking to actor communication.
|
|
|
+Blocktree integrates a session typing system that allows protocol contracts to be defined that
|
|
|
+specify the communication patterns of a set of actors.
|
|
|
+This model allows the state space of the set of actors participating in a computation to be defined,
|
|
|
+and the state transitions which occur to be specified based on the types of received messages.
|
|
|
+These contracts are used to verify protocol adherence statically and dynamically.
|
|
|
+This system is implemented using compile time code generation,
|
|
|
+making it a zero-cost abstraction.
|
|
|
+This frees the developer from dealing with the numerous failure modes that can occur in a
|
|
|
+communication protocol.
|
|
|
+
|
|
|
+% Implementation language and project links.
|
|
|
+Blocktree is implemented in the Rust programming language.
|
|
|
+It is currently only tested on Linux.
|
|
|
+Running it on other Unix-like operating systems should be straight-forward,
|
|
|
+though FUSE support is required to mount the filesystem.
|
|
|
+Its source code is licensed under the Affero GNU Public License Version 3.
|
|
|
+It can be downloaded at the project homepage at \url{https://blocktree.systems}.
|
|
|
+Anyone interested in contributing to development is welcome to submit a pull request
|
|
|
+to \url{https://gogs.delease.com/Delease/Blocktree}.
|
|
|
+If you have larger changes or architectural suggestions,
|
|
|
+please submit an issue for discussion prior to spending time implementing your idea.
|
|
|
+
|
|
|
+% Outline of the rest of the paper.
|
|
|
+The remainder of this paper is structured as follows:
|
|
|
+\begin{itemize}
|
|
|
+ \item Section 2 describes the actor runtime, service and task orchestration, and service
|
|
|
+ discovery.
|
|
|
+ \item Section 3 discusses the filesystem, its concurrency semantics and implementation.
|
|
|
+ \item Section 4 details the cryptographic mechanisms used to secure communication between
|
|
|
+ actor runtimes and to protect sector data.
|
|
|
+ \item Section 5 is a set of examples describing ways that Blocktree can be used to build systems.
|
|
|
+ \item Section 6 provides some concluding remarks.
|
|
|
+\end{itemize}
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+\section{Actor Runtime}
|
|
|
+% Motivation for using the actor model.
|
|
|
+Building scalable fault tolerant systems requires us to distribute computation over
|
|
|
+multiple computers.
|
|
|
+Rather than switching to a different programming model when an application scales beyond the
|
|
|
+capacity of a single computer,
|
|
|
+it is beneficial in terms of programmer time and program simplicity to begin with a model that
|
|
|
+enables multi-computer scalability.
|
|
|
+Fundamentally, all communication over an IP network involves the exchange of messages,
|
|
|
+namely IP packets.
|
|
|
+So if we wish to build scalable fault-tolerant systems,
|
|
|
+it makes sense to choose a programming model built on message passing,
|
|
|
+as this will ensure low impedance with the underlying networking technology.
|
|
|
+
|
|
|
+% Overview of message passing interface.
|
|
|
+That is why Blocktree is built on the actor model
|
|
|
+and why its actor runtime is at the core of its architecture.
|
|
|
+The runtime can be used to spawn actors, register services, dispatch messages immediately,
|
|
|
+and schedule messages to be delivered in the future.
|
|
|
+Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
|
|
|
+A message is dispatched with the \texttt{send} method when no reply is required,
|
|
|
+and with \texttt{call} when exactly one is.
|
|
|
+The \texttt{Future} returned by \texttt{call} can be awaited to obtain the reply.
|
|
|
+If a timeout occurs while waiting for the reply,
|
|
|
+the \texttt{Future} completes with an error.
|
|
|
+The name \texttt{call} was chosen to bring to mind a remote procedure call,
|
|
|
+which is the primary use case this method was intended for.
|
|
|
+Awaiting replies to messages serves as a simple way to synchronize a distributed computation.
|
|
|
+
|
|
|
+% Scheduling messages for future delivery.
|
|
|
+Executing actions at some point in the future or at regular intervals are common tasks in computer
|
|
|
+systems.
|
|
|
+Blocktree facilitates this by allows messages to be scheduled for future delivery.
|
|
|
+The schedule may specify a one time delivery at a specific instant in time,
|
|
|
+or a repeating delivery with a given period.
|
|
|
+These scheduling modes can be combined so that you can specify an anchoring instant
|
|
|
+and a period whose multiples will be added to this instant to calculate each delivery time.
|
|
|
+For example, a message could be scheduled for delivery every morning at 3 AM.
|
|
|
+Messages scheduled in a runtime are persisted in the runtime's file.
|
|
|
+This ensures scheduled messages will be delivered even if the runtime is restarted.
|
|
|
+If a message has been delivered
|
|
|
+and the schedule is such that it will never be delivered again,
|
|
|
+it is removed from the runtime's file.
|
|
|
+If a message is scheduled for delivery at a single instant in time,
|
|
|
+and that delivery is missed,
|
|
|
+the message will be delivered as soon as possible.
|
|
|
+But, if a message is periodic,
|
|
|
+any messages which were missed due to a runtime not being active will never be sent.
|
|
|
+This is because the runtime only persists the message's schedule,
|
|
|
+not every delivery.
|
|
|
+This mechanism is intended for periodic tasks or delaying work to a later time.
|
|
|
+It is not for building hard realtime systems.
|
|
|
+
|
|
|
+% Description of virtual actor system.
|
|
|
+One of the challenges when building actor systems is supervising and managing actors' lifecycles.
|
|
|
+This is handled in Erlang through the use of supervision trees,
|
|
|
+but Blocktree takes a different approach inspired by Microsoft's Orleans framework.
|
|
|
+Orleans introduced the concept of virtual actors,
|
|
|
+which are purely logical entities that exist perpetually.
|
|
|
+In Orleans, one does not need to spawn actors nor worry about respawning them should they crash,
|
|
|
+the framework takes care of spawning an actor when a message is dispatched to it.
|
|
|
+This model also gives the framework the flexibility to deactivate actors when they are idle
|
|
|
+and to load balance actors across different computers.
|
|
|
+In Blocktree a similar system is used when messages are dispatched to services.
|
|
|
+The Blocktree runtime takes care of routing these messages to the appropriate actors,
|
|
|
+spawning them if needed.
|
|
|
+A service must be registered in a runtime before messages can be routed to it.
|
|
|
+The actors which are spawned based on this registration are called \emph{service providers} of the
|
|
|
+service.
|
|
|
+Services which directly use operating system resource,
|
|
|
+such as those that listen on network sockets,
|
|
|
+are often started immediately after registration so that they are available to external clients.
|
|
|
+
|
|
|
+% Message addressing modes.
|
|
|
+Messages can be addressed to services or specific actors.
|
|
|
+When addressing a specific actor,
|
|
|
+the message contains an \emph{actor name},
|
|
|
+which is a pair consisting of the path of the runtime hosting the actor and the \texttt{Uuid}
|
|
|
+identifying the specific actor in that runtime.
|
|
|
+When addressing a service,
|
|
|
+the message is dispatched using a \emph{service name},
|
|
|
+which contains the following fields:
|
|
|
+\begin{enumerate}
|
|
|
+ \item \texttt{service}: The path identifying the receiving service.
|
|
|
+ \item \texttt{scope}: A filesystem path used to specify the intended recipient.
|
|
|
+ \item \texttt{rootward}: A boolean describing whether message delivery is attempted towards or
|
|
|
+ away from the root of the filesystem tree. A value of
|
|
|
+ \texttt{false} indicates that the message is intended for a runtime directly contained in the
|
|
|
+ scope. A value of \texttt{true} indicates that the message is intended for a runtime contained
|
|
|
+ in a parent directory of the scope and should be delivered to a runtime which has the requested
|
|
|
+ service registered and is closest to the scope.
|
|
|
+ \item \texttt{id}: An identifier for a specific service provider.
|
|
|
+\end{enumerate}
|
|
|
+The ID can be a \texttt{Uuid} or a \texttt{String}.
|
|
|
+It is treated as an opaque identifier by the runtime,
|
|
|
+but a service is free to associate additional meaning to it.
|
|
|
+Every message has a header containing the name of the sender and receiver.
|
|
|
+The receiver name can be an actor or service name,
|
|
|
+but the receiver name is always an actor name.
|
|
|
+For example, to open a file in the filesystem,
|
|
|
+a message is dispatched with \texttt{call} using the service name of the filesystem service.
|
|
|
+The reply contains the name of the file actor spawned by the filesystem service which owns the opened
|
|
|
+file.
|
|
|
+Messages are then dispatched to the file actor using its actor name to read and write to the file.
|
|
|
+
|
|
|
+% The runtime is implemented using tokio.
|
|
|
+The actor runtime is currently implemented using the Rust asynchronous runtime tokio.
|
|
|
+Actors are spawned as tasks in the tokio runtime,
|
|
|
+and multi-producer single consumer channels are used for message delivery.
|
|
|
+Because actors are just tasks,
|
|
|
+they can do anything a task can do,
|
|
|
+including awaiting other \texttt{Future}s.
|
|
|
+Because of this, there is no need for the actor runtime to support short-lived worker tasks,
|
|
|
+as any such use-case can be accomplished by awaiting a set of \texttt{Future}s.
|
|
|
+This allows the runtime to focus on providing support for services.
|
|
|
+Using tokio also means that we have access to a high performance multi-threaded runtime with
|
|
|
+evented IO.
|
|
|
+This asynchronous programming model ensures that resources are efficiently utilized,
|
|
|
+and is ideal for a system focused on orchestrating services which may be used by many clients.
|
|
|
+
|
|
|
+% Delivering messages over the network.
|
|
|
+Messages can be forwarded between actor runtimes using a secure transport layer called
|
|
|
+\texttt{bttp}.
|
|
|
+The transport is implemented using the QUIC protocol, which integrates TLS for security.
|
|
|
+A \texttt{bttp} client may connect anonymously or using credentials.
|
|
|
+If an anonymous connection is attempted,
|
|
|
+the client has no authorization attributes associated with it.
|
|
|
+Only runtimes which grant others the execute permission allow connections from such clients.
|
|
|
+If these permissions are not granted in the runtime's file,
|
|
|
+anonymous connections are rejected.
|
|
|
+When a client connects with credentials,
|
|
|
+mutual TLS authentication is performed as part of the connection handshake,
|
|
|
+which cryptographically verifies the credentials of each runtime.
|
|
|
+These credentials contain the filesystem paths where each runtime is located.
|
|
|
+This information is used to securely route messages between runtimes.
|
|
|
+The \texttt{bttp} server is always authenticated during the handshake,
|
|
|
+even when the client is connecting anonymously.
|
|
|
+Because QUIC supports the concurrent use of many different streams,
|
|
|
+it serves as an ideal transport for a message oriented system.
|
|
|
+\texttt{bttp} uses different streams for independent messages,
|
|
|
+ensuring that head of line blocking does not occur.
|
|
|
+Note that although data from separate streams can arrive in any order,
|
|
|
+the protocol does provide reliable in-order delivery of data in any given stream.
|
|
|
+The same stream is used for sending the reply to a message dispatched with \texttt{call}.
|
|
|
+Once a connection is established,
|
|
|
+messages may flow both directions (provided both runtimes have execute permissions for the other),
|
|
|
+regardless of which runtime is acting as the client or the server.
|
|
|
+
|
|
|
+% Delivering messages locally.
|
|
|
+When a message is sent between actors in the same runtime it is delivered into the queue of the recipient without any copying,
|
|
|
+while ensuring immutability (i.e. move semantics).
|
|
|
+This is possible thanks to the Rust ownership system,
|
|
|
+because the message sender gives ownership to the runtime when it dispatches the message,
|
|
|
+and the runtime gives ownership to the recipient when it delivers the message.
|
|
|
+
|
|
|
+% Security model based on filesystem permissions.
|
|
|
+A runtime is represented in the filesystem as a file.
|
|
|
+This file contains the authorization attributes which are associated with the runtime's security
|
|
|
+principal.
|
|
|
+The credentials used by the runtime specify the file, so other runtimes are able to locate it.
|
|
|
+The metadata of the file contains authorization attributes just like any other file
|
|
|
+(e.g. UID, GID, and mode bits).
|
|
|
+In order for a principal to be able to send a message to an actor in the runtime,
|
|
|
+it must have execute permissions for this file.
|
|
|
+Thus communication between runtimes can be controlled using simple filesystem permissions.
|
|
|
+Permissions checking is done during the \texttt{bttp} handshake.
|
|
|
+Note that it is possible for messages to be sent in one direction in a \texttt{bttp} connection
|
|
|
+but not in the other.
|
|
|
+In this situation replies are permitted but unsolicited messages are not.
|
|
|
+An important trade-off which was made when designing this model was that messages which are
|
|
|
+sent between actors in the same runtime are not subject to any authorization checks.
|
|
|
+This was done for two reasons: performance and security.
|
|
|
+By eliminating authorization checks messages can be more efficiently delivered between actors in the
|
|
|
+same process,
|
|
|
+which helps to reduce the performance penalty of the actor runtime over directly using threads.
|
|
|
+Security is enhanced by this decision because it forces the user to separate actors with different
|
|
|
+security requirements into different operating system processes,
|
|
|
+which ensures all of the process isolation machinery in the operating system will be used to
|
|
|
+isolate them.
|
|
|
+
|
|
|
+% Representing resources as actors.
|
|
|
+As in other actor systems, it is convenient to represent resources in Blocktree using actors.
|
|
|
+This allows the same security model used to control communication between actors to be used for
|
|
|
+controlling access to resources,
|
|
|
+and for resources to be shared by many actors.
|
|
|
+For instance, a Point-to-Point Protocol connection could be owned by an actor.
|
|
|
+This actor could forward traffic delivered to it in messages over this connection.
|
|
|
+The set of actors which are able to access the connection is controlled by setting the filesystem
|
|
|
+permissions on the file for the runtime executing the actor owning the connection.
|
|
|
+
|
|
|
+% Actor ownership.
|
|
|
+The concept of ownership in programming languages is very useful for ensuring that resources are
|
|
|
+properly freed when the type using them dies.
|
|
|
+Because actors are used for encapsulating resources in Blocktree,
|
|
|
+a similar system of ownership is employed for this reason.
|
|
|
+An actor is initially owned by the actor that spawned it.
|
|
|
+An actor can only have a single owner,
|
|
|
+but the owner can grant ownership to another actor.
|
|
|
+An actor is not allowed to own itself,
|
|
|
+though it may be owned by the runtime.
|
|
|
+When the owner of an actor returns,
|
|
|
+the actor is sent a message instructing it to return.
|
|
|
+If it does not return after a timeout,
|
|
|
+it is interrupted.
|
|
|
+This is the opposite of how supervision trees work in Erlang.
|
|
|
+Instead of the parent receiving a message when the child returns,
|
|
|
+the child receives a message when the parent returns.
|
|
|
+Service providers spawned by the runtime are owned by it.
|
|
|
+They continue running until the runtime chooses to reclaim their resources,
|
|
|
+which can happen because they are idle or the runtime is overloaded.
|
|
|
+Note that ownership is not limited to a single runtime,
|
|
|
+so distributed resources can be managed by owning actors in many different runtimes.
|
|
|
+
|
|
|
+% Message routing to services.
|
|
|
+A service is identified by a Blocktree path.
|
|
|
+Only one service implementation can be registered in a particular runtime,
|
|
|
+though this implementation may be used to spawn many actors as providers for the service,
|
|
|
+each associated with a different ID.
|
|
|
+The runtime spawns a new actor when it finds no service provider associated with the ID in the
|
|
|
+message it is delivering.
|
|
|
+Some services may only have one service provider in a given runtime,
|
|
|
+as is the case for the sector and filesystem services.
|
|
|
+The \texttt{scope} and \texttt{rootward} field in an actor name specify the set of runtimes to
|
|
|
+which a message may be delivered.
|
|
|
+They allow the sender to express their intended recipient,
|
|
|
+while still affording enough flexibility to the runtime to route messages as needed.
|
|
|
+If \texttt{rootward} is \texttt{false},
|
|
|
+the message is delivered to a service provider in a runtime that is directly contained in
|
|
|
+\texttt{scope}.
|
|
|
+If \texttt{rootward} is \texttt{true},
|
|
|
+the parent directories of scope are searched,
|
|
|
+working towards the root of the filesystem tree,
|
|
|
+and the message is delivered to the first provider of \texttt{service} which is found.
|
|
|
+When there are multiple service providers to which a given message could be delivered,
|
|
|
+the one to which it is actually delivered is unspecified,
|
|
|
+which allows the runtime to balance load.
|
|
|
+Delivery will occur to at most one recipient,
|
|
|
+even in the case that there are multiple potential recipients.
|
|
|
+In order to contact other runtimes and deliver messages to them,
|
|
|
+their network endpoint (IP address and UDP port) needs to be known.
|
|
|
+This is achieved by maintaining a file with a runtime's endpoint address in the same directory as
|
|
|
+the runtime.
|
|
|
+The runtime is granted write permissions on the file,
|
|
|
+and it is updated by \texttt{bttp} when it begins listening on a new endpoint.
|
|
|
+The port a \texttt{bttp} server uses to listen for unicast connections is uniformly
|
|
|
+randomly selected from the set of ports in the dynamic range (49152-65535) which are unused on the
|
|
|
+server's host.
|
|
|
+Use of a random port allows many different \texttt{bttp} servers to share a single IP address
|
|
|
+and makes Blocktree more resistent to censorship.
|
|
|
+The services which are allowed to be registered in a given runtime are specified in the runtime's
|
|
|
+file.
|
|
|
+The runtime reads this list and uses it to deny service registrations for unauthorized services.
|
|
|
+The list is also read by other runtime's when they're searching for service providers.
|
|
|
+
|
|
|
+% The sector and filesystem service.
|
|
|
+The filesystem is itself implemented as a service.
|
|
|
+A filesystem service provider can be passed messages to delete files, list directory contents,
|
|
|
+open files, or perform several other standard filesystem operations.
|
|
|
+When a file is opened,
|
|
|
+a new actor is spawned which owns the newly created file handle and its name is returned to the
|
|
|
+caller in a reply.
|
|
|
+Subsequent read and write messages are sent to this actor.
|
|
|
+The filesystem service does not persist any data itself,
|
|
|
+its job is to function as an integration layer,
|
|
|
+conglomerating sector data from many different sources into a single unified interface.
|
|
|
+The sector service is what is ultimately responsible for storing data,
|
|
|
+and thus maintaining the persistent state of the system.
|
|
|
+It stores sector data in the local filesystem of each computer on which it is registered.
|
|
|
+The details of how this is accomplished are deferred to the next section.
|
|
|
+
|
|
|
+% Runtime queries.
|
|
|
+While it is possible to resolve runtime paths to network endpoints when the filesystem is available,
|
|
|
+another mechanism is needed to allow the filesystem service providers to be discovered.
|
|
|
+This is accomplished by allowing runtimes to query one another to learn of other runtimes.
|
|
|
+Because queries are intended to facilitate message delivery,
|
|
|
+the query fields and their meanings mirror those used for addressing messages:
|
|
|
+\begin{enumerate}
|
|
|
+ \item \texttt{service} The path of the service whose providers are sought.
|
|
|
+ Only runtimes with this service registered will be returned.
|
|
|
+ \item \texttt{scope} The filesystem path relative to which the query will be processed.
|
|
|
+ \item \texttt{rootward} Indicates if the query should search for runtimes from \texttt{scope}
|
|
|
+ toward the root.
|
|
|
+\end{enumerate}
|
|
|
+The semantics of \texttt{scope} and \texttt{rootward} in a query are identical to their use in an
|
|
|
+actor name.
|
|
|
+As long as at least one other runtime is known,
|
|
|
+a query can be issued to learn of more runtimes.
|
|
|
+A runtime which receives a query may not be able to answer it directly.
|
|
|
+If it cannot,
|
|
|
+it returns the endpoint of the next runtime to which the query should be sent.
|
|
|
+
|
|
|
+% Bootstrap discovery methods.
|
|
|
+In order to bootstrap the discovery processes,
|
|
|
+another mechanism is needed to find the first peer to query.
|
|
|
+There were several possibilities explored for doing this.
|
|
|
+One way is to use a blockchain to store the endpoints of the runtimes hosting the filesystem service
|
|
|
+in the root directory.
|
|
|
+As long as these runtimes can be located,
|
|
|
+then all others can be found using the filesystem.
|
|
|
+This idea may be worth revisiting in the future,
|
|
|
+but the author wanted to avoid the complexity of implementing a new proof of work blockchain.
|
|
|
+Instead, two independent mechanisms are used,
|
|
|
+one that can discover runtimes over the internet as long as their path is known,
|
|
|
+and another that can discover runtimes on the local network even when the discoverer does not know
|
|
|
+their paths.
|
|
|
+
|
|
|
+% Searching DNS for root principals.
|
|
|
+When the path to a runtime is known,
|
|
|
+DNS is used to resolve SRV records using a fully qualified domain name
|
|
|
+(FQDN) derived from the path's root principal identifier.
|
|
|
+The SRV records are resolved using the name \texttt{\_bttp.\_udp.<FQDN>},
|
|
|
+where \texttt{<FQDN>} is the FQDN derived from the root principal's identifier.
|
|
|
+One SRV record may be created for each of the filesystem service providers in the root
|
|
|
+directory.
|
|
|
+Each record contains the UDP port and hostname where a runtime is listening.
|
|
|
+Every runtime is configured with a search domain that is used as a suffix in the FQDN.
|
|
|
+The leading labels in the FQDN are computed by base32 encoding the binary representation of the
|
|
|
+root principal's identifier.
|
|
|
+If the encoded string is longer than 63 bytes (the limit for each label in a hostname),
|
|
|
+it is separated into the fewest number of labels possible,
|
|
|
+working from left to right along the string.
|
|
|
+A dot followed by the search domain is concatenated onto the end of this string to form the FQDN.
|
|
|
+This method has the advantages of being simple to implement
|
|
|
+and allowing runtimes to discover each other over the internet.
|
|
|
+Implementing this system would be facilitated by hosting DNS servers in actors in the same
|
|
|
+runtimes as the root sector service providers.
|
|
|
+Then, records could be dynamically created which point to these runtimes.
|
|
|
+These runtimes would also need to be configured with static IP addresses,
|
|
|
+and the NS records for the search domain would need to point to them.
|
|
|
+Of course it is also possible to build such a system without hosting DNS inside of Blocktree.
|
|
|
+The downside of using DNS is that it couples Blocktree with a centralized,
|
|
|
+albeit distributed, system.
|
|
|
+
|
|
|
+% Using link-local multicast datagrams to find runtimes.
|
|
|
+Because the previous mechanism requires knowledge of the root principal of a domain to perform
|
|
|
+discovery,
|
|
|
+it will not work if a runtime is first starting up with no credentials and so does not know its
|
|
|
+own root principal.
|
|
|
+This runtime needs a way to discover other runtimes so it can connect to the filesystem and sector
|
|
|
+services.
|
|
|
+This issue is solved by using link-local multicast addressing to discover the runtimes on the same
|
|
|
+network as the discoverer.
|
|
|
+When a \texttt{bttp} server starts listening for unicast traffic,
|
|
|
+it also listens for UDP datagrams on port 50142 at addresses 224.0.0.142 and FE02::142,
|
|
|
+if the IPv4 or IPv6 networking stack is available, respectively.
|
|
|
+If the host is attached to a dual-stack network,
|
|
|
+the server listens on both addresses.
|
|
|
+When a runtime is attempting to discover other runtimes,
|
|
|
+it sends out datagrams to these endpoints.
|
|
|
+Each \texttt{bttp} server replies with its unicast address and filesystem path
|
|
|
+(as specified in its credentials).
|
|
|
+If the server is available at both IPv4 and IPv6 unicast addresses,
|
|
|
+it is at the server's discretion which address to respond with,
|
|
|
+it may even respond with an IPv4 to an IPv4 datagram,
|
|
|
+and IPv6 address to an IPv6 datagram.
|
|
|
+Once a client has discovered the \texttt{bttp} servers on its network,
|
|
|
+it can route messages to them,
|
|
|
+such as the provisioning requests which are used to obtain new credentials.
|
|
|
+Provisioning is described in the Cryptography section.
|
|
|
+Note that port 50142 is in the dynamic range,
|
|
|
+so it does not need to registered with the Internet Assigned Names and Numbers Authority (IANA).
|
|
|
+Both addresses 224.0.0.142 and FE02::142 are currently unassigned.
|
|
|
+but they will need to be registered with IANA if Blocktree is widely adopted.
|
|
|
+
|
|
|
+% Security model for queries.
|
|
|
+To allow runtimes which are not permitted to execute the root directory to query for other runtimes,
|
|
|
+authorization logic which is specific to queries is needed.
|
|
|
+If a process is connected with credentials
|
|
|
+and the path in the credentials contains the scope of the query,
|
|
|
+the query is permitted.
|
|
|
+If a process is connected anonymously,
|
|
|
+its query will only be answered if the query scope
|
|
|
+and all of its parent directories,
|
|
|
+grant others the execute permission.
|
|
|
+Queries from authenticated processes can be authorized using only the information in the query,
|
|
|
+but anonymous queries require knowledge of filesystem permissions,
|
|
|
+some of which may not be known to the answering runtime.
|
|
|
+When authorizing an anonymous query,
|
|
|
+an answering runtime should check that that the execute permission is granted on all directories
|
|
|
+that it is responsible for storing.
|
|
|
+If all these checks pass, it should forward the querier to the next runtime as usual.
|
|
|
+
|
|
|
+% Overview of protocol contracts and runtime checking of protocol adherence.
|
|
|
+To facilitate the creation of composable systems,
|
|
|
+a protocol contract checking system based on session types has been designed.
|
|
|
+This system models a communication protocol as a directed graph representing state transitions
|
|
|
+based on types of received messages.
|
|
|
+The protocol author defines the states that the actors participating in the protocol can be in using
|
|
|
+Rust traits.
|
|
|
+These traits define handler methods for each message type the actor is expected to handle in that
|
|
|
+state.
|
|
|
+A top-level trait which represents the entire protocol is defined that contains the types of the
|
|
|
+initial state of every actor in the protocol.
|
|
|
+A macro is used to generate the message handling loop for the each of the parties to the protocol,
|
|
|
+as well as enums to represent all possible states that the parties can be in and the messages that
|
|
|
+they exchange.
|
|
|
+The generated code is responsible for ensuring that errors are generated when a message of an
|
|
|
+unexpected type is received,
|
|
|
+eliminating the need for ad-hoc error handling code to be written by application developers.
|
|
|
+
|
|
|
+% Example of a protocol contract.
|
|
|
+Let's explore how this system can be used to build a simple pub-sub communications protocol.
|
|
|
+In this protocol,
|
|
|
+there will be a server which handles \texttt{Sub} messages by remembering the names of the actors
|
|
|
+who sent them.
|
|
|
+It will handle \texttt{Pub} messages by forwarding them to all of the subscribed actors.
|
|
|
+The state-transition graph for the system is shown in figure \ref{fig:pubsub}.
|
|
|
+\begin{figure}
|
|
|
+ \begin{center}
|
|
|
+ \includegraphics[scale=0.6]{PubSubStateGraph.pdf}
|
|
|
+ \end{center}
|
|
|
+ \caption{The state-transition graph for a simple pub-sub protocol.}
|
|
|
+ \label{fig:pubsub}
|
|
|
+\end{figure}
|
|
|
+The solid edges in the graph indicate state transitions and are labeled with the message type
|
|
|
+which triggered the transition.
|
|
|
+The dashed edges indicate message delivery and are labeled with the type of the message delivered.
|
|
|
+Although \texttt{Runtime} is not the state of any actor in the system,
|
|
|
+it is included in the graph as the sender of the \texttt{Activate} and \texttt{Pub} messages.
|
|
|
+\texttt{Activate} is delivered by the runtime to pass a reference to the runtime and provide the
|
|
|
+actor's \texttt{Uuid}.
|
|
|
+\texttt{Pub} messages are dispatched by actors outside the graph and are routed to actors in the
|
|
|
+\texttt{Listening} state by the runtime.
|
|
|
+Note that the runtime itself doesn't have any notion of the state of any actor,
|
|
|
+it just delivers messaging using the rules described previously.
|
|
|
+Only an actor can tell whether a message is expected or not given its current state.
|
|
|
+Each of the actor states are modeled by Rust traits.
|
|
|
+\begin{verbatim}
|
|
|
+ pub struct ClientInit {
|
|
|
+ type AfterActivate: Subed;
|
|
|
+ type Fut: Future<Output = Result<Self::AfterActivate>>;
|
|
|
+ fn handle_activate(self, msg: Activate) -> Self::Fut;
|
|
|
+ }
|
|
|
+
|
|
|
+ pub struct Subed {
|
|
|
+ type AfterPub: Subed;
|
|
|
+ type Fut: Future<Output = Result<Self::AfterPub>>;
|
|
|
+ fn handle_pub(self, msg: Envelope<Pub>) -> Self::Fut;
|
|
|
+ }
|
|
|
+
|
|
|
+ pub struct ServerInit {
|
|
|
+ type AfterActivate: Listening;
|
|
|
+ type Fut: Future<Output = Result<Self::AfterActivate>>;
|
|
|
+ fn handle_activate(self, msg: Activate) -> Self::Fut;
|
|
|
+ }
|
|
|
+
|
|
|
+ pub struct Listening {
|
|
|
+ type AfterSub: Listening;
|
|
|
+ type SubFut: Future<Output = Result<Self::AfterSub>>;
|
|
|
+ fn handle_sub(self, msg: Envelope<Sub>) -> Self::SubFut;
|
|
|
+
|
|
|
+ type AfterPub: Listening;
|
|
|
+ type PubFut: Future<Output = Result<Self::AfterPub>>;
|
|
|
+ fn handle_pub(self, msg: Envelope<Pub>) -> Self::PubFut;
|
|
|
+ }
|
|
|
+\end{verbatim}
|
|
|
+The definition of \texttt{Activate} is as follows:
|
|
|
+\begin{verbatim}
|
|
|
+ pub struct Activate {
|
|
|
+ rt: &'static Runtime,
|
|
|
+ act_id: Uuid,
|
|
|
+ }
|
|
|
+\end{verbatim}
|
|
|
+The \texttt{Envelope} type is a wrapper around a message which contains information about who sent
|
|
|
+it and a method that can be used to send a reply.
|
|
|
+In general a new actor state, represented by a new type, can be returned by a messaging handling
|
|
|
+method.
|
|
|
+The protocol itself is also represented by a trait:
|
|
|
+\begin{verbatim}
|
|
|
+ pub trait PubSubProtocol {
|
|
|
+ type Server: ServerInit;
|
|
|
+ type Client: ClientInit;
|
|
|
+ }
|
|
|
+\end{verbatim}
|
|
|
+By modeling this protocol independently of any implementation of it,
|
|
|
+we allow for many different interoperable implementations to be created.
|
|
|
+We can also isolate bugs in these implementations because unexpected or malformed messages are
|
|
|
+checked for by the generated code.
|
|
|
+
|
|
|
+% Implementing actors in languages other than Rust.
|
|
|
+Today the actor runtime only supports executing actors implemented in Rust.
|
|
|
+A WebAssembly (Wasm) plugin system is planned to allow any language which can compile to Wasm to be
|
|
|
+used to implement an actor.
|
|
|
+This work is blocked pending the standardization of the WebAssembly Component Model,
|
|
|
+which promises to provide an interface definition language which will allow type safe actors to be
|
|
|
+defined in many different languages.
|
|
|
+
|
|
|
+% Running containers using actors.
|
|
|
+Blocktree allows containers to be run by encapsulating them using a supervising actor.
|
|
|
+This actor is responsible for starting the container and managing the container's kernel namespace.
|
|
|
+Logically, it owns any kernel resources created by the container, including all spawned operating
|
|
|
+system processes.
|
|
|
+When the actor halts,
|
|
|
+all of these resources are destroyed.
|
|
|
+All network communication to the container is controlled by the supervising actor.
|
|
|
+The supervisor can be configured to bind container ports to host ports,
|
|
|
+as is commonly done today,
|
|
|
+but it can also be used to encapsulate traffic to and from the container in Blocktree messages.
|
|
|
+These messages are routed to other actors based on the configuration of the supervisor.
|
|
|
+This essentially creates a VPN for containers,
|
|
|
+ensuring that regardless of well secured their communication is,
|
|
|
+they will be safe to communicate over any network.
|
|
|
+This network encapsulation system could be used in other actors as well,
|
|
|
+allowing a lightweight and secure VPN system to built.
|
|
|
+
|
|
|
+% Web GUI used for managing the system.
|
|
|
+Any modern computer system must include a GUI,
|
|
|
+it is required by users.
|
|
|
+For this reason Blocktree includes a web-based GUI called \texttt{btconsole} that can
|
|
|
+monitor the system, provision runtimes, and configure access control.
|
|
|
+\texttt{btconsole} is itself implemented as an actor in the runtime,
|
|
|
+and so has access to the same facilities as any other actor.
|
|
|
+
|
|
|
+
|
|
|
+\section{Filesystem}
|
|
|
+% The division of responsibilities between the sector and filesystem services.
|
|
|
+The responsibility for serving data in Blocktree is shared between the filesystem and sector
|
|
|
+services.
|
|
|
+Most actors will access the filesystem through the filesystem service,
|
|
|
+which provides a high-level interface that takes care of the cryptographic operations necessary to
|
|
|
+read and write files.
|
|
|
+The filesystem service relies on the sector service for actually persisting data.
|
|
|
+The individual sectors which make up a file are read from and written to the sector service,
|
|
|
+which stores them in the local filesystem of the computer on which it is running.
|
|
|
+A sector is the atomic unit of data storage
|
|
|
+and the sector service only supports reading and writing entire sectors at once.
|
|
|
+File actors spawned by the filesystem service buffer reads and writes until there is enough
|
|
|
+data to fill a sector.
|
|
|
+Because cryptographic operations are only performed on full sectors,
|
|
|
+the cost of providing these protections is amortized over the size of the sector.
|
|
|
+Thus there is tradeoff between latency and throughput when selecting the sector size of a file:
|
|
|
+a smaller sector size means less latency while a larger one enables more throughput.
|
|
|
+
|
|
|
+% Types of sectors: metadata, integrity, and data.
|
|
|
+A file has a single metadata sector, a Merkle sector, and zero or more data sectors.
|
|
|
+The sector size of a file can be specified when it is created,
|
|
|
+but cannot be changed later.
|
|
|
+Every data sector contains the ciphertext of the number of bytes equal to the sector size,
|
|
|
+but the metadata and Merkle sectors contain a variable amount of data.
|
|
|
+The metadata sector contains all of the filesystem metadata associated with the file.
|
|
|
+In addition to the usual metadata present in any Unix filesystem (the contents of the \texttt{stat} struct),
|
|
|
+cryptographic information necessary to verify and decrypt the contents of the file are also stored.
|
|
|
+The Merkle sector of a file contains a Merkle tree over the data sectors of a file.
|
|
|
+The hash function used by this tree can be configured at file creation,
|
|
|
+but cannot be changed after the fact.
|
|
|
+
|
|
|
+% How sectors are identified.
|
|
|
+When sector service providers are contained in the same directory they connect to each other to form
|
|
|
+a consensus cluster.
|
|
|
+This cluster is identified by a \texttt{u64} called the cluster's \emph{generation}.
|
|
|
+Every file is identified by a pair of \texttt{u64}, its generation and its inode.
|
|
|
+The sectors within a file are identified by an enum which specifies which type they are,
|
|
|
+and in the case of data sectors, their 0-based index.
|
|
|
+\begin{verbatim}
|
|
|
+ pub enum SectorKind {
|
|
|
+ Meta,
|
|
|
+ Merkle,
|
|
|
+ Data(u64),
|
|
|
+ }
|
|
|
+\end{verbatim}
|
|
|
+The byte offset in the plaintext of the file at which each data sector begins can be calculated by
|
|
|
+multiplying the sector's index by the sector size of the file.
|
|
|
+The \texttt{SectorId} type is used to identify a sector.
|
|
|
+\begin{verbatim}
|
|
|
+ pub enum SectorId {
|
|
|
+ generation: u64,
|
|
|
+ inode: u64,
|
|
|
+ sector: SectorKind,
|
|
|
+ }
|
|
|
+\end{verbatim}
|
|
|
+
|
|
|
+% How the sector service stores data.
|
|
|
+The sector service persists sectors in a directory in its local filesystem,
|
|
|
+with each sector is stored in a different file.
|
|
|
+The scheme used to name these files involves security considerations,
|
|
|
+and is described in the next section.
|
|
|
+When a sector is updated,
|
|
|
+a new local file is created with a different name containing the new contents.
|
|
|
+Rather than deleting the old sector file,
|
|
|
+it is overwritten by the creation of a hardlink to the new file,
|
|
|
+and the name that used to create the new file is unlinked.
|
|
|
+This method ensures that the sector file is updated in one atomic operation
|
|
|
+and is used by other Unix programs.
|
|
|
+The sector service also uses the local filesystem to persist the replicated log it uses for Raft.
|
|
|
+This file serves as a journal of sector operations.
|
|
|
+
|
|
|
+% Types of messages handled by the sector service.
|
|
|
+Communication with the sector service is done by passing it messages of type \texttt{SectorMsg}.
|
|
|
+\begin{verbatim}
|
|
|
+ pub struct SectorMsg {
|
|
|
+ id: SectorId,
|
|
|
+ op: SectorOperation,
|
|
|
+ }
|
|
|
+
|
|
|
+ pub enum SectorOperation {
|
|
|
+ Read,
|
|
|
+ Write(WriteOperation),
|
|
|
+ }
|
|
|
+
|
|
|
+ pub enum WriteOperation {
|
|
|
+ Meta(Box<FileMeta>),
|
|
|
+ Data {
|
|
|
+ meta: Box<FileMeta>,
|
|
|
+ contents: Vec<u8>,
|
|
|
+ }
|
|
|
+ }
|
|
|
+\end{verbatim}
|
|
|
+Here \texttt{FileMeta} is the type used to store metadata for files.
|
|
|
+Note that updated metadata is required to be sent when a sector's contents are modified.
|
|
|
+
|
|
|
+% Scaling horizontally: using Raft to create consensus cluster. Additional replication methods.
|
|
|
+A generation of sector service providers uses the Raft protocol to synchronize the state of the
|
|
|
+sectors it stores.
|
|
|
+The message passing interface of the runtime enables this implementation
|
|
|
+and the sector service's requirements were important considerations in designing this interface.
|
|
|
+The system currently replicates all data to each of the service providers in the cluster.
|
|
|
+Additional replication methods are planned for future implementation
|
|
|
+(e.g. erasure encoding and distribution via consistent hashing),
|
|
|
+which allow for different tradeoffs between data durability and storage utilization.
|
|
|
+
|
|
|
+% Scaling vertically: how different generations are stitched together.
|
|
|
+The creation of a new generation of the sector service is accomplished with several steps.
|
|
|
+First, a new directory is created in which the generation will be located.
|
|
|
+Next, one or more processes are credentialed for this directory,
|
|
|
+using a procedure which is described in the next section.
|
|
|
+The credentialing process produces files for each of the processes stored in the new directory.
|
|
|
+The sector service provider in each of the processes uses the filesystem service
|
|
|
+(which connects to the parent generation of the sector service)
|
|
|
+to find the other runtimes hosting the sector service in the directory and messages them to
|
|
|
+establish a fully-connected cluster.
|
|
|
+Finally, the service provider which is elected leader contacts the generation in the root directory
|
|
|
+and requests a new generation number.
|
|
|
+Once this number is known it is stored in the superblock for the generation,
|
|
|
+which is the file identified by the new generation number and inode 2.
|
|
|
+The superblock is not contained in any directory and cannot be accessed outside the sector service.
|
|
|
+The superblock also keeps track of the next inode to assign to a new file.
|
|
|
+
|
|
|
+% Authorization logic of the sector service.
|
|
|
+To prevent malicious actors from writing invalid data,
|
|
|
+the sector service must cryptographically verify all write messages.
|
|
|
+The process it uses to do this involves several steps:
|
|
|
+\begin{enumerate}
|
|
|
+ \item The certificate chain in the metadata that was sent in the write message is validated.
|
|
|
+ It is considered valid if it ends with a certificate signed by the root principal
|
|
|
+ and the paths in the certificates are correctly nested,
|
|
|
+ indicating valid delegation of write authority at every step.
|
|
|
+ \item Using the last public key in the certificate chain,
|
|
|
+ the signature in the metadata is validated.
|
|
|
+ This signature covers all of the fields in the metadata.
|
|
|
+ \item The new sector contents in the write message are hashed using the digest function configured
|
|
|
+ for the file and the resulting hash is used to update the file's Merkle tree in its Merkle
|
|
|
+ sector.
|
|
|
+ \item The root of the Merkle tree is compared with the integrity value in the file's metadata.
|
|
|
+ The write message is considered valid if and only if there is a match.
|
|
|
+\end{enumerate}
|
|
|
+This same logic is used by file actors to verify the data they read from the sector service.
|
|
|
+Only once a write message is validated is it shared with the sector service provider's peers in
|
|
|
+its generation.
|
|
|
+Although the data in a file is encrypted,
|
|
|
+it is still beneficial for security to prevent unauthorized principal's from gaining access to a
|
|
|
+file's ciphertext.
|
|
|
+To prevent this, a sector service provider checks a file's metadata to verify that the requesting
|
|
|
+principal actually has a readcap (to be defined in the next section) for the file.
|
|
|
+This ensures that only principals that are authorized to read a file can gain access to the file's
|
|
|
+ciphertext, metadata, and Merkle tree.
|
|
|
+
|
|
|
+% File actors are responsible for cryptographic operations. Client-side encryption.
|
|
|
+The sector service is relied upon by the filesystem service to read and write sectors.
|
|
|
+Filesystem service providers communicate with the sector service to open files and perform
|
|
|
+filesystem operations.
|
|
|
+These providers spawn file actors that are responsible for verifying and decrypting the information
|
|
|
+contained in sectors and providing it to other actors.
|
|
|
+They use the credentials of the runtime they are hosted in to decrypt sector data using
|
|
|
+information contained in file metadata.
|
|
|
+File actors are also responsible for encrypting and integrity protecting data written to files.
|
|
|
+In order for a file actor to produce a signature over the root of the file's Merkle tree,
|
|
|
+it maintains a copy of the tree in memory.
|
|
|
+This copy is read from the sector service when the file is opened.
|
|
|
+While this does mean duplicating data between the sector and filesystem services,
|
|
|
+this design was chosen to reduce the network traffic between the two services,
|
|
|
+as the entire Merkle tree does not need to be transmitted on every write.
|
|
|
+Encapsulating all cryptographic operations in the filesystem service and file actors allows the
|
|
|
+computer storing data to be different from the computer encrypting it.
|
|
|
+This approach allows client-side encryption to be done on more capable computers
|
|
|
+and low powered devices to delegate this task to a storage server.
|
|
|
+
|
|
|
+% Prevention of resource leaks through ownership.
|
|
|
+A major advantage of using file actors to access file data is that they can be accessed over the
|
|
|
+network from a different runtime as easily as they can be from the same runtime.
|
|
|
+One complication arising from this approach is that file actors must not outlive the actor which
|
|
|
+caused them to be spawned.
|
|
|
+This is handled in the filesystem service by making the actor who opened the file the owner of the
|
|
|
+file actor.
|
|
|
+When a file actor receives notification that its owner returned,
|
|
|
+it flushes any buffered data in its cache and returns,
|
|
|
+ensuring that a resource leak does not occur.
|
|
|
+
|
|
|
+% Encrypted metadata. Extended attributes in metadata. Cache control.
|
|
|
+Some of the information stored in metadata needs to be kept in plaintext to allow the sector
|
|
|
+service to verify and decrypt the file
|
|
|
+but most of it is encrypted using the same key as the file's contents.
|
|
|
+The file's authorization attributes, its size, and its access times are all encrypted.
|
|
|
+The table storing the file's extended attributes (EAs) is also encrypted.
|
|
|
+Cache control information is included in this area as well.
|
|
|
+It specifies the number of seconds, as a u32, that a file may be cached.
|
|
|
+The filesystem service uses this information to evict sectors from its cache when they have been
|
|
|
+cached for longer than this threshold,
|
|
|
+causing them to be reloaded from the sector service.
|
|
|
+
|
|
|
+% Authorization logic of the filesystem service.
|
|
|
+The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.
|
|
|
+It passes this type the authorization attributes of the principal accessing the file, the
|
|
|
+attributes of the file, and the type of access (read, write, or execute).
|
|
|
+The \texttt{Authorizer} returns a boolean indicating if access is permitted or denied.
|
|
|
+These access control checks are performed for every message processed by the filesystem service,
|
|
|
+including opening a file.
|
|
|
+A file actor only responds to messages sent from its owner,
|
|
|
+which ensures that it can avoid the overhead of performing access control checks as these were
|
|
|
+carried out by the filesystem service when it was created.
|
|
|
+The file actor is configured when it is spawned to allow read only, write only, or read write
|
|
|
+access to a file,
|
|
|
+depending on what type of access was requested by the actor opening the file.
|
|
|
+
|
|
|
+% Streaming replication.
|
|
|
+Often when building distributed systems it is convenient to alert any interested party that an event
|
|
|
+has occurred.
|
|
|
+To facilitate this pattern,
|
|
|
+the sector service allows actors to subscribe for notification of writes to a file.
|
|
|
+The sector service maintains a list of actors which are currently subscribed
|
|
|
+and when it commits a write to its local storage,
|
|
|
+it sends each of them a notification message identifying the sector written
|
|
|
+(but not the written data).
|
|
|
+By using different files to represent different events,
|
|
|
+a simple notification system can be built.
|
|
|
+Because the contents of a directory may be distributed over many different generations,
|
|
|
+this system does not support the recursive monitoring of directories.
|
|
|
+Although this system lacks the power of \texttt{inotify} in the Linux kernel,
|
|
|
+it does provides some of its benefits without incurring much or a performance overhead
|
|
|
+or implementation complexity.
|
|
|
+For example, this system can be used to implement streaming replication.
|
|
|
+This is done by subscribing to writes on all the files that are to be replicated,
|
|
|
+then reading new sectors as soon as notifications are received.
|
|
|
+These sectors can then be written into replica files in a different directory.
|
|
|
+This ensures that the contents of the replicas will be updated in near real-time.
|
|
|
+
|
|
|
+% Peer-to-peer distribution of sector data.
|
|
|
+Because of the strong integrity protection afforded to sectors,
|
|
|
+it is possible for peer-to-peer distribution of sector data to be done securely.
|
|
|
+Implementing this mechanism is planned as a future enhancement to the system.
|
|
|
+The idea is to base the design on bit torrent,
|
|
|
+where the sector service responsible for a file acts as a tracker for that file,
|
|
|
+and the file actors accessing the file communicate with one another directly using the information
|
|
|
+provided by the sector service.
|
|
|
+This could allow the system to scale to a much larger number of concurrent reads by reducing
|
|
|
+the load on the sector service.
|
|
|
+
|
|
|
+% The FUSE daemon.
|
|
|
+Being able to access the filesystem from actors allows a programmer to implement new applications
|
|
|
+using Blocktree,
|
|
|
+but there is an entire world of existing applications which only know how to access the local
|
|
|
+filesystem.
|
|
|
+To allow these applications access to Blocktree,
|
|
|
+a FUSE daemon called \texttt{btfuse} is included which allows a Blocktree directory to be mounted
|
|
|
+to a directory in the local filesystem.
|
|
|
+This daemon can directly access the sector files in a local directory,
|
|
|
+or it can connect over the network to filesystem or sector service provider.
|
|
|
+This FUSE daemon could be included in a system's initrd to allow it to mount its root filesystem
|
|
|
+from Blocktree,
|
|
|
+opening up many interesting possibilities for hosting machine images in Blocktree.
|
|
|
+A planned future enhancement is to develop a Blocktree filesystem driver which actually runs in
|
|
|
+kernel space.
|
|
|
+This would reduce the overhead associated with context switching from user space, to kernel space,
|
|
|
+and back to user space, for every filesystem interaction,
|
|
|
+making the system more practical to use for a root filesystem.
|
|
|
+
|
|
|
+
|
|
|
+\section{Cryptography}
|
|
|
+This section describes the cryptographic mechanisms used to integrity and confidentiality protect
|
|
|
+files.
|
|
|
+These mechanisms are based on well-established cryptographic constructions.
|
|
|
+
|
|
|
+% Integrity protection.
|
|
|
+File integrity is protected by a digital signature over its metadata.
|
|
|
+The metadata contains the integrity field which contains the root node of a Merkle tree over
|
|
|
+the file's contents.
|
|
|
+This allows any sector in the file to be verified with a number of hash function invocations that
|
|
|
+is logarithmic in the size of the file.
|
|
|
+It also allows the sectors of a file to be verified in any order,
|
|
|
+enabling random access.
|
|
|
+The hash function used in the Merkle tree can be configured when the file is created.
|
|
|
+Currently, SHA-256 is the default, and SHA-512 is supported.
|
|
|
+A file's metadata also contains a certificate chain,
|
|
|
+and this chain is used to authenticate the signature over the metadata.
|
|
|
+In Blocktree, the certificate chain is referred to as a \emph{writecap}
|
|
|
+because it grants the capability to write to files.
|
|
|
+The certificates in a valid writecap are ordered by their paths,
|
|
|
+the initial certificate contains the longest path,
|
|
|
+the path in each subsequent certificate must be a prefix of the one preceding it,
|
|
|
+and the final certificate must be signed by the root principal.
|
|
|
+These rules ensure that there is a valid delegation of write authority at every
|
|
|
+link in the chain,
|
|
|
+and that the authority is ultimately derived from the root principal specified by the absolute path
|
|
|
+of the file.
|
|
|
+By including all the information necessary to verify the integrity of a file in its metadata,
|
|
|
+it is possible for a requestor who only knows the path of a file to verify that the contents of the
|
|
|
+file are authentic.
|
|
|
+
|
|
|
+% Confidentiality protecting files with readcaps. Single pubkey operation to read a dir tree.
|
|
|
+Confidentiality protection of files is optional but when it is enabled,
|
|
|
+a file's sectors are individually encrypted using a symmetric cipher.
|
|
|
+The key to this cipher is randomly generated when a file is created.
|
|
|
+A different IV is generated for each sector by hashing the index of the sector with a
|
|
|
+randomly generated IV for the entire file.
|
|
|
+A file's key and IV are encrypted using the public keys of the principals to whom read access is
|
|
|
+to be allowed.
|
|
|
+The resulting ciphertext is referred to as a \emph{readcap}, as it grants the capability to read the
|
|
|
+file.
|
|
|
+These readcaps are stored in a table in the file's metadata.
|
|
|
+Each entry in the table is identified by a byte string that is derived from the public key of the
|
|
|
+principal who owns the entry's readcap.
|
|
|
+The byte string is computed by calculating an HMAC of the the principal's public key.
|
|
|
+The HMAC is keyed with a randomly generated salt that is stored in the file's metadata.
|
|
|
+An identifier for the hash function that was used in the HMAC is included in the byte string so
|
|
|
+that the HMAC can be recomputed later.
|
|
|
+When the filesystem service accesses the file,
|
|
|
+it recomputes the HMAC using the salt, its public key, and the hash function specified in each entry
|
|
|
+of the table.
|
|
|
+It can then identify the entry which contains its readcap,
|
|
|
+or that such an entry does not exist.
|
|
|
+This mechanism was designed to prevent offline correlation attacks on file metadata,
|
|
|
+as metadata is stored in plaintext in local filesystems.
|
|
|
+The file key and IV are also encrypted using the keys of the file's parents.
|
|
|
+Note that there may be multiple parents of a file because it may be hard linked to several
|
|
|
+directories.
|
|
|
+Each of the resulting ciphertexts is stored in another table in the file's metadata.
|
|
|
+The entries in this table are identified by an HMAC of the parent's generation and inode numbers,
|
|
|
+where the HMAC is keyed using the file's salt.
|
|
|
+By encrypting a file's key and IV using the key and IV of its parents,
|
|
|
+it is possible to traverse a directly tree using only a single public key decryption.
|
|
|
+The file where this traversal begins must contain a readcap owned by the accessing principal,
|
|
|
+but all subsequent accesses can be performed by decrypting the key and IV of a child using the
|
|
|
+key and IV of a parent.
|
|
|
+Not only does this allow traversals to use more efficient symmetric key cryptography,
|
|
|
+but it also means that it suffices to grant a readcap on a single directory in order to grant
|
|
|
+access to the entire tree rooted at that directory.
|
|
|
+
|
|
|
+% File key rotation and readcap revocation.
|
|
|
+Because it is not possible to change the key used by a file after it is created,
|
|
|
+a file must be copied in order to rotate the key used to encrypt it.
|
|
|
+Similarly, revoking a readcap is accomplished by creating a copy of the file
|
|
|
+and adding all the readcaps from the original's metadata except for the one being revoked.
|
|
|
+While it is certainly possible to remove a readcap from the metadata table,
|
|
|
+this is not supported because the readcap holder may have used custom software to save the file's
|
|
|
+key and IV while it had access to them,
|
|
|
+so data written to the same file after revocation could potentially be decrypted by it.
|
|
|
+By forcing the user to create a new file,
|
|
|
+they are forced to re-encrypt the data using a fresh key and IV.
|
|
|
+
|
|
|
+% Obfuscating sector files stored in the local filesystem.
|
|
|
+From an attacker's perspective,
|
|
|
+not every file in your domain is equally interesting.
|
|
|
+They may be particularly interested in reading your root directory,
|
|
|
+or they may have identified the inode of a file containing kompromat.
|
|
|
+To make offline identification of which files sectors in the local filesystem belong to,
|
|
|
+an obfuscation mechanism is used.
|
|
|
+This works by generating a random salt for each generation of the sector service,
|
|
|
+and storing it in the generation's superblock.
|
|
|
+It is hashed along with the inode and the sector ID to produce the file name of the sector file
|
|
|
+in the local filesystem.
|
|
|
+These files are arranged into different subdirectories according to the value of the first two
|
|
|
+digits in the hex encoding of the resulting hash,
|
|
|
+the same way git organizes object files.
|
|
|
+This simple method makes it more difficult for an attacker to identify the files each sector belongs
|
|
|
+to
|
|
|
+while still allowing the sector service efficient access.
|
|
|
+
|
|
|
+% Credential stores.
|
|
|
+Processes need a way to securely store their credentials.
|
|
|
+They accomplish this by using a credential store,
|
|
|
+which is a type that implementor the trait \texttt{CredStore}.
|
|
|
+A credential store provides methods for using a process's credentials to encrypt, decrypt,
|
|
|
+sign, and verify data,
|
|
|
+but it does not allow them to be exported.
|
|
|
+A credential store also provides a method for generating new root credentials.
|
|
|
+Because root credentials represent the root of trust for an entire domain,
|
|
|
+it must be possible to securely back them up from one credential store to another.
|
|
|
+Root credentials can also be used to perform cryptographic operations without exporting them.
|
|
|
+A password is set when the root credentials are generated,
|
|
|
+and this same password must be provided to use, export, and import them.
|
|
|
+When root credentials are exported from a credential store they are confidentiality protected
|
|
|
+using multiple layers of encryption.
|
|
|
+The outer most layer is encryption by a symmetric key cipher whose key is derived from the
|
|
|
+password.
|
|
|
+a public key of the receiving credential store must also be provided when root credentials are
|
|
|
+exported.
|
|
|
+This public key is used to perform the inner encryption of the root credentials,
|
|
|
+ensuring that only the intended credential store is able to import them.
|
|
|
+Currently there are two \texttt{CredStore} implementors in Blocktree,
|
|
|
+one which is used for testing and one which is more secure.
|
|
|
+The first is called \texttt{FileCredStore},
|
|
|
+and it uses a file in the local filesystem to store credentials.
|
|
|
+A symmetric cipher is used to protect the root credentials, if they are stored,
|
|
|
+but it relies on the security of the underlying filesystem to protect the process credentials.
|
|
|
+For this reason it is not recommended for production use.
|
|
|
+The other credential store is called \texttt{TpmCredStore},
|
|
|
+and it uses a Trusted Platform Module (TPM) 2.0 on the local machine to store credentials.
|
|
|
+The TPM is used to generate the process's credentials in such a way that they can never be
|
|
|
+exported from the TPM (this is a feature of TPM 2.0).
|
|
|
+A randomly generated cookie is needed to use these credentials.
|
|
|
+The cookie is stored in a file in the local filesystem which its permissions set to prevent
|
|
|
+others from accessing it.
|
|
|
+Thus this type also relies on the security of the local filesystem.
|
|
|
+But, an attacker would need to steal the TPM and this cookie in order to steal a process's
|
|
|
+credentials.
|
|
|
+
|
|
|
+% Manual provisioning via the command line.
|
|
|
+The term provisioning is used in Blocktree to refer to the process of acquiring credentials.
|
|
|
+A command line tool call \texttt{btprovision} is provided for provisioning credential stores.
|
|
|
+This tool can be used to generate new process or root credentials, create a certificate request
|
|
|
+using them, issue a new certificate, and finally to import the new certificate chain.
|
|
|
+When setting up a new domain,
|
|
|
+\texttt{btprovision} can create a new sector storage directory in the local filesystem
|
|
|
+and write the new process's files to it.
|
|
|
+It is also capable of connecting to the filesystem service if it is already running.
|
|
|
+
|
|
|
+% Automatic provisioning.
|
|
|
+While manual provisioning is necessary to bootstrap a domain,
|
|
|
+an automatic method is needed to make this process more ergonomic.
|
|
|
+When a runtime starts it checks its configured credential store to find the certificate chain to
|
|
|
+use for authenticating to other runtimes.
|
|
|
+If no such chain is stored,
|
|
|
+the runtime can choose to request a certificate from the filesystem service.
|
|
|
+This is done by dispatching a message with \texttt{call} to the filesystem service without
|
|
|
+specifying a scope.
|
|
|
+Because the message specifies no path, there is no root directory to begin discovery at.
|
|
|
+So, the runtime resorts to using link-local discovery to find other runtimes.
|
|
|
+Once one is discovered,
|
|
|
+the runtime connects to it anonymously
|
|
|
+and sends it a certificate request.
|
|
|
+This request includes a copy of the runtime's public key and, optional, a path where the
|
|
|
+runtime would like to be located.
|
|
|
+This path is purely advisory,
|
|
|
+the filesystem service is free to place the runtime in any directory it sees fit.
|
|
|
+The filesystem service creates a new process file containing the public key and marks it as
|
|
|
+pending.
|
|
|
+The reply to the runtime contains the path of the file created for it.
|
|
|
+The operators of the domain can then use the web GUI or \texttt{btprovision} to view the request
|
|
|
+and approve it at their discretion.
|
|
|
+Assuming an operator approves the request,
|
|
|
+it uses its credentials and the public key in the new process's file to issue a certificate
|
|
|
+and then stores it in the file.
|
|
|
+Authorization attributes (e.g. UID and GID) are also assigned to the process and written into its
|
|
|
+file.
|
|
|
+Note that a process's file is normally not writeable by the process itself,
|
|
|
+so as to prevent it from setting its own authorization attributes.
|
|
|
+Once these data have been written to the process file,
|
|
|
+the runtime can read them to retrieve its new certificate chain.
|
|
|
+It stores this chain in its credential store for later use.
|
|
|
+The runtime can avoid polling its file for changes if it subscribes to write notifications.
|
|
|
+The runtime must close the anonymous connections it made
|
|
|
+and reconnect using the new certificate chain.
|
|
|
+Once new connections are established,
|
|
|
+it can read and write files using the authorization attributes specified in its file.
|
|
|
+Note that this procedure only works when the runtime is on the same LAN as another runtime.
|
|
|
+
|
|
|
+% The generation of new root credentials and the creation of a new domain.
|
|
|
+The procedure for creating a new domain is straight-forward,
|
|
|
+and all the steps can be performed using \texttt{btprovision}.
|
|
|
+\begin{enumerate}
|
|
|
+ \item Generate the root credentials for the new domain.
|
|
|
+ \item Generate the credentials for the first runtime.
|
|
|
+ \item Create a certificate request using the runtime credentials.
|
|
|
+ \item Approve the request using the root credentials.
|
|
|
+ \item Import the new certificate into the credential store of the first runtime.
|
|
|
+\end{enumerate}
|
|
|
+The first runtime is configured to host the sector and filesystem services,
|
|
|
+so that subsequent runtimes will have access to the filesystem.
|
|
|
+After that, additional runtime on the same LAN can be provisioned using the automatic process.
|
|
|
+
|
|
|
+% Setting up user based access control.
|
|
|
+Up till now the focus has been on authentication and authorization of processes,
|
|
|
+but it bears discussing how user based access control can be accomplished with Blocktree.
|
|
|
+Because credentials are locked to the device on which they're created,
|
|
|
+a user will have at least as many principals as they have devices.
|
|
|
+But, all of these principals can be configured to have the same authorization attributes (UID, GID),
|
|
|
+giving them the same permissions.
|
|
|
+It makes sense to keep the files for all of the provisioned runtimes associated with a user in one
|
|
|
+place
|
|
|
+and the natural place is in the user's home directory.
|
|
|
+Although every one of the user's processes needs to be provisioned,
|
|
|
+this is not a huge limitation because a single runtime can host many different actors,
|
|
|
+implementing many different applications.
|
|
|
+Managing the users in a domain is facilitated by putting their home directories in a single user
|
|
|
+directory for the domain.
|
|
|
+Runtimes hosting the sector service on storage servers could then be provisioned in this directory
|
|
|
+to provide the sector and filesystem services for the users' home directories.
|
|
|
+It would be at the administrators discretion whether or not to enable client-side encryption.
|
|
|
+If they wanted to,
|
|
|
+the principal of at least one of a user's runtimes would need to be issued a readcap for the
|
|
|
+user's home directory.
|
|
|
+This runtime could then directly access the sector service in the domain's user directory.
|
|
|
+By moving encryption onto the user's computer,
|
|
|
+load can be shed from the storage servers.
|
|
|
+Note that this setup does require all of the user's runtimes to be able to communicate with the
|
|
|
+runtime whose principal was issued the readcap.
|
|
|
+
|
|
|
+% Example of how these mechanisms allow data to be shared.
|
|
|
+To illustrate how these mechanisms can be used to facilitate collaboration between enterprises,
|
|
|
+consider a situation where two companies wish to partner to the development of a product.
|
|
|
+To facilitate their collaboration,
|
|
|
+they wish to have a way to securely exchange data with each other.
|
|
|
+One of the companies is selected to host the data
|
|
|
+and accepts the cost and responsibility of serving it.
|
|
|
+The host company creates a directory which will be used to store all of the data created during
|
|
|
+development.
|
|
|
+The other company will connect to the filesystem service in the host company's domain to access
|
|
|
+data in the shared directory.
|
|
|
+Each of the principals in the other company which wish to connect request to be credentialed in the
|
|
|
+shared directory.
|
|
|
+The hosting company manually reviews these requests and approves them,
|
|
|
+assigning each of the principals authorization attributes appropriate for its domain.
|
|
|
+This may involve issuing UID and GID values to each of the principals, or perhaps SELinux contexts.
|
|
|
+The actually set of attributes supported is determined by the \texttt{Authorization} type used by
|
|
|
+by the filesystem service in the host company's domain.
|
|
|
+Once the principals have their credentials,
|
|
|
+they can dispatch messages to the filesystem service using the shared directory as the scope and
|
|
|
+setting the rootward field to true.
|
|
|
+This allows actors authenticating with the credentials of these principals to perform all filesystem
|
|
|
+operations authorized by the hosting company.
|
|
|
+This situation gives the hosting company a lot of control over the data.
|
|
|
+If the other company wishes to protect its investment in the R\&D effort,
|
|
|
+it should subscribe to write events on the shared directory and the files in it so that it can
|
|
|
+copy new sectors out of the host company's domain as soon as they are written.
|
|
|
+Note that although it is not possible to directly subscribe to writes on the contents of a
|
|
|
+directory, by monitoring a directory for changes,
|
|
|
+one can begin monitoring files as soon as they are created.
|
|
|
+
|
|
|
+
|
|
|
+\section{Examples}
|
|
|
+This section contains examples of systems that could be built using Blocktree.
|
|
|
+The hope is to illustrate how this platform can be used to implement existing applications more
|
|
|
+easily and to make it possible to implement systems which are currently out of reach.
|
|
|
+
|
|
|
+\subsection{A distributed AI execution environment.}
|
|
|
+Neural networks are just vector-valued functions with vector inputs,
|
|
|
+albeit very complicated ones with potentially billions of parameters.
|
|
|
+But, just like any other computation,
|
|
|
+these functions can be conceptualized as computational graphs.
|
|
|
+Imagine that you have a set of computers equipped AI accelerator hardware
|
|
|
+and you have a neural network that is too large to be processed by any one of them.
|
|
|
+By partitioning the graph into small enough subgraphs,
|
|
|
+we can break the network down into pieces which can be processed by each of the accelerators.
|
|
|
+The full network can be stitched together by passing messages between each of these pieces.
|
|
|
+
|
|
|
+Let us consider how this could be accomplished with Blocktree.
|
|
|
+We begin by provisioning a runtime on each of the accelerator machines,
|
|
|
+each of which will have a new accelerator service registered.
|
|
|
+Messages will be sent to the accelerator service describing the computational graph to execute,
|
|
|
+as well as the name of the actor to which the output is to be sent.
|
|
|
+When such a message is received by an accelerator service provider,
|
|
|
+it spawns an actor which compiles its subgraph to a kernel for its accelerator
|
|
|
+and remembers the name of the actor to send its output to.
|
|
|
+An orchestrator service will be responsible for partitioning the graph and sending these messages.
|
|
|
+Ownership of the actors spawned by the accelerator service is given to the orchestrator service,
|
|
|
+ensuring that they will all be stopped when the orchestrator returns.
|
|
|
+When one of the spawned actors stops,
|
|
|
+it unloads the kernel from the accelerator's memory and returns it to its initial state.
|
|
|
+Note that the orchestrator actor must have execute permissions on each of the accelerator runtimes
|
|
|
+in order to send messages to them.
|
|
|
+The orchestrator dispatches messages to the accelerator service in reverse order of the flow of data
|
|
|
+in the computational graph,
|
|
|
+so that it can tell each service provider where its output should be sent.
|
|
|
+The actors responsible for the last layer in the computational graph send their output to the
|
|
|
+orchestrator.
|
|
|
+To begin the computation,
|
|
|
+the actors which are responsible for input are given the filesystem path of the input data.
|
|
|
+The orchestrator learns of the completion of the computation once it receives the output from
|
|
|
+final layer.
|
|
|
+It can then save these results to the file system and return.
|
|
|
+Because inference and training can both be modeled by computational graphs,
|
|
|
+this same procedure can be used for both.
|
|
|
+
|
|
|
+\subsection{A decentralized social media network.}
|
|
|
+One of the original motivations for designing Blocktree was to create a platform for a social
|
|
|
+network that puts users in fully in control of their data.
|
|
|
+In the opinion of the author,
|
|
|
+the only way to actually accomplish this is for users to host the data themselves.
|
|
|
+One might think it is possible to use client-side encryption to solve the privacy issue,
|
|
|
+but this does not solve the full problem.
|
|
|
+While it is true that good client-side encryption will prevent the service provider from reading
|
|
|
+the user's data,
|
|
|
+the user could still loose everything if the service provider goes out of business or simply
|
|
|
+decides to stop offering its service.
|
|
|
+Similarly, putting data in a federated system, as has been proposed by the Mastodon developers,
|
|
|
+also puts the user at risk of loosing their data if the operator of the server they use decides to
|
|
|
+shut it down.
|
|
|
+To have real control the user must host the data themselves.
|
|
|
+Then they decide how its encrypted, how its served, and to whom.
|
|
|
+
|
|
|
+Let us explore how Blocktree can be used to build a social media platform which provides this
|
|
|
+control.
|
|
|
+To participate in this network each user will need to setup their own domain by generating new root
|
|
|
+credentials
|
|
|
+and provisioning at least one runtime to host the social media service.
|
|
|
+A technical user could do this on their own hardware by reading the Blocktree documentation,
|
|
|
+but a non-technical user might choose to purchase a new router with Blocktree pre-installed.
|
|
|
+By connecting this router directly to their WAN,
|
|
|
+the user ensures that the services running on it will always have direct internet access.
|
|
|
+The user can access the \texttt{btconsole} web GUI via the router's WiFi interface to generate their
|
|
|
+root credentials and provision new runtimes on their network.
|
|
|
+
|
|
|
+A basic function of any social network is keeping track of a user's contacts.
|
|
|
+This would be handled by maintaining the contacts as files in a well-known directory in the user's
|
|
|
+domain.
|
|
|
+Each file in the directory would be named using the user defined nickname for the contact
|
|
|
+and its contents would include the root principal of the contact as well as any additional user
|
|
|
+defined attributes,
|
|
|
+such as address or telephone number.
|
|
|
+The root principal would be used to discover runtimes controlled by the contact
|
|
|
+so that messages can be sent to the social media service running in them.
|
|
|
+When a user adds a new contact,
|
|
|
+a connection message would be sent to it,
|
|
|
+which the contact could choose to accept or reject.
|
|
|
+If accepted,
|
|
|
+the contact would create an entry in its contacts directory for the user.
|
|
|
+The contact's social media service would then accept future direct messages from the user.
|
|
|
+When the user sends a direct message to the contact,
|
|
|
+its runtime discovers runtimes controlled by the contact and delivers the message.
|
|
|
+Once delivered the contact's social media service stores the message in a directory for the user's
|
|
|
+correspondence,
|
|
|
+sort of like an mbox directory but where messages are sorted into directories based on sender
|
|
|
+instead of receiver.
|
|
|
+
|
|
|
+Note that this procedure only works if a contact's root principal can be resolved using the
|
|
|
+search domain configured in the user's runtime.
|
|
|
+We can ensure this is the case by configuring the runtime to use a search domain that operates
|
|
|
+a Dynamic DNS (DDNS) service
|
|
|
+and by arranging with this service to create the correct records to resolve the root principal.
|
|
|
+The author intends to operate such a service to facilitate the use of Blocktree by home users,
|
|
|
+but a more long-term solution is to implement a blockchain for resolving root principals.
|
|
|
+Only then would the system be fully decentralized.
|
|
|
+
|
|
|
+Making public posts is accomplished by creating files in a directory with the HTML contents of the
|
|
|
+post.
|
|
|
+This file, the directory containing it, and all parents of it,
|
|
|
+would be configured to allow others to read, and in the case of directories, execute them.
|
|
|
+At least one runtime with the filesystem service registered would need to have the execute
|
|
|
+permission granted to others to allow anyone to access these files.
|
|
|
+When someone wanted to view the posts of another user,
|
|
|
+they would use the filesystem service to read these files from the well-known posts directory.
|
|
|
+
|
|
|
+Of course user's would not be using a file manager to interact with this social network,
|
|
|
+they would use their browsers as they do now.
|
|
|
+This web interface would be served by the social media service in their domain.
|
|
|
+A normal user who has a Blocktree enabled router would just type in a special hostname into their
|
|
|
+browser to open this interface.
|
|
|
+Because the router provides DNS services to their network,
|
|
|
+it can generate the appropriate records to ensure this name resolves to the address where the social
|
|
|
+media service is listening.
|
|
|
+The social media service would be responsible for sending message to other user's domains to
|
|
|
+get their posts,
|
|
|
+and to read the filesystem to display the user's direct messages.
|
|
|
+All this file data would be used to populate the web interface.
|
|
|
+It is not hard to see how the same system could be used to serve any type of media: text, images,
|
|
|
+video, immersive 3D worlds.
|
|
|
+All of these can be stored in files in the filesystem,
|
|
|
+and so all of them are accessible to Blocktree actors.
|
|
|
+
|
|
|
+One issue that must be addressed with this design is how it will scale to a large number of users
|
|
|
+accessing data at once.
|
|
|
+In other words,
|
|
|
+what happens if the user goes viral?
|
|
|
+Currently, the way to solve this would be to add more computers to the user's network which run
|
|
|
+the sector and filesystem services.
|
|
|
+This is not ideal as it means the user would need to buy more hardware to serve their dank memes.
|
|
|
+A better solution would be implement peer-to-peer distribution of sector data in the filesystem
|
|
|
+service.
|
|
|
+This would reduce the load on the user's computers and allow their follows to share the posted
|
|
|
+data with each other.
|
|
|
+This work is planned as a future enhancement.
|
|
|
+
|
|
|
+\subsection{A smart lock.}
|
|
|
+The access control language provided by Blocktree's filesystem can be used for more than just
|
|
|
+authorizing access to data.
|
|
|
+To illustrate this point,
|
|
|
+consider a smart lock installed on the front door of a company's office building.
|
|
|
+When the company first got the lock they used NFC to configure the lock
|
|
|
+and connect it to their WiFi network.
|
|
|
+The lock then used link-local runtime discovery to perform automatic provisioning.
|
|
|
+An IT administrator accessed \texttt{btconsole} to approve the provisioning request
|
|
|
+and position the lock in a specific directory in the company's domain.
|
|
|
+Permission to actuate the lock is granted if a principal has execute permission on the lock's file.
|
|
|
+To verify the physical presence of an employee,
|
|
|
+NFC is used for the authentication handshake.
|
|
|
+When an employee presses their NFC device, for instance their phone, to the lock,
|
|
|
+it generates a nonce and transmits it to the device.
|
|
|
+The device then signs the nonce using the credentials it used during provisioning in the company's
|
|
|
+domain.
|
|
|
+It transmits this signature to the lock along with the path to the principal's file in the domain.
|
|
|
+The lock then reads this file to obtain the principal's authorization attributes and its public key.
|
|
|
+It uses the public key to validate the signature presented by the device.
|
|
|
+If this is successful,
|
|
|
+it then checks the authorization attributes of the principal against the authorization attributes on
|
|
|
+its own file.
|
|
|
+If execute permissions are granted, the lock actuates, allowing the employee access.
|
|
|
+The administrators of the company's domain create a group specifically for controlling physical
|
|
|
+access to the building.
|
|
|
+All employees with physical access permission are added to this group,
|
|
|
+and the group is granted execute permission on the lock,
|
|
|
+rather than individual users.
|
|
|
+
|
|
|
+\subsection{A traditional three-tier web application.}
|
|
|
+While it is hoped that Blocktree will enable interesting and novel applications,
|
|
|
+it can also be used to build the kind of web applications that are common today.
|
|
|
+Suppose that we wish to build a three-tier web application.
|
|
|
+Let us explore how Blocktree could help.
|
|
|
+
|
|
|
+First, let us consider which database to use.
|
|
|
+It would be desirable to use a traditional SQL database,
|
|
|
+preferably one which is open source and not owned by a large corporation with dubious motivations.
|
|
|
+These constraints lead us to choose Postgres,
|
|
|
+but Postgres was not designed to run on Blocktree.
|
|
|
+However, Postgres does have a container image available on docker hub,
|
|
|
+we can create a service to run this container image in our domain.
|
|
|
+But Postgres stores all of its data in the local filesystem of the machine it runs on.
|
|
|
+How can we ensure this does not become a single point of failure?
|
|
|
+First, we should create a directory in our domain to hold the Postgres cluster directory.
|
|
|
+Then we should procure at least three servers for our storage cluster
|
|
|
+and provision runtimes hosted on each of them in this directory.
|
|
|
+The sector service is registered on each of the runtimes,
|
|
|
+so all the data stored in the directory will be replicated on each of the server.
|
|
|
+Now, the Postgres service should be register in one and only one of these runtimes,
|
|
|
+as Postgres requires exclusive access to its database cluster.
|
|
|
+\texttt{btfuse} will be used to mount the Postgres directory to a path in the local filesystem
|
|
|
+and the Postgres container will be configured to access it.
|
|
|
+We now have to decide how other parts of the system are going to communicate with Postgres.
|
|
|
+We could have the Postgres service setup port forwarding for the container,
|
|
|
+so that ordinary network connection can be used to talk to it.
|
|
|
+But we will have to setup TLS if we want this to be secure.
|
|
|
+The alternative is to use Blocktree as a VPN and proxy network communications in messages.
|
|
|
+This is accomplished by registering a proxy service in the same runtime as the Postgres service
|
|
|
+and configuring it to allow traffic it receives to pass to the Postgres container on TCP port 5432.
|
|
|
+
|
|
|
+In a separate directory,
|
|
|
+a collection runtimes are provisioned which will host the webapp service.
|
|
|
+This service will use axum to serve the static assets to our site,
|
|
|
+including the Wasm modules which make up our frontend,
|
|
|
+as well as our site's backend.
|
|
|
+In order to do this,
|
|
|
+it will need to connect to the Postgres database.
|
|
|
+This is accomplished by registering the proxy service in each of the runtimes hosting the
|
|
|
+webapp service.
|
|
|
+The proxy service is configured to listen on TCP 127.0.0.1:5432 and forwards all traffic
|
|
|
+to the proxy service in the Postgres directory.
|
|
|
+The webapp can then use the \texttt{tokio-postgres} crate to establish a TCP connection to
|
|
|
+127.0.0.1:5432
|
|
|
+and it will end up talking to the containerized Postgres instance.
|
|
|
+
|
|
|
+Although the data in our database is stored redundantly,
|
|
|
+we do still have a single point of failure in our system,
|
|
|
+namely the Postgres container.
|
|
|
+To handle this we can implement a failover service.
|
|
|
+It will work by calling the Postgres service with heartbeat messages.
|
|
|
+If too many of these timeout,
|
|
|
+we assume the service is dead and start a new instance one of the other runtimes in the Postgres
|
|
|
+directory.
|
|
|
+This new instance will have access to all the same data the old,
|
|
|
+including its journal file.
|
|
|
+Assuming it can complete any in progress transactions,
|
|
|
+the new service will come up after a brief delay
|
|
|
+and the system will recover.
|
|
|
+
|
|
|
+\subsection{A realtime geo-spacial environment.}
|
|
|
+% Motivation
|
|
|
+If we are to believe science fiction,
|
|
|
+then the natural evolution of computer interaction is the development
|
|
|
+of a persistent virtual world that we use to communicate, conduct business, and
|
|
|
+enjoy our leisure.
|
|
|
+This kind of system has been a dream for a long time,
|
|
|
+but as it has grown closer to becoming a reality,
|
|
|
+the popular consciousness has shifted against it.
|
|
|
+People are rightly horrified by the idea of giving control over their virtual worlds to the same
|
|
|
+social media company which has an established track record for causing societal harm.
|
|
|
+But this technology does not need to be dystopian.
|
|
|
+If an open system can be built, which actually works,
|
|
|
+it can prevent the market from accepting a closed system designed to lock in user attention
|
|
|
+and monetize them relentlessly.
|
|
|
+This is the future,
|
|
|
+it is only a question of who will own it.
|
|
|
+
|
|
|
+% Coordinates
|
|
|
+Let us explore how Blocktree could be used to build such a system.
|
|
|
+The world we are going to render will be a planet with a roughly spherical surface and a
|
|
|
+configurable radius $\rho$.
|
|
|
+$\rho$ is a \texttt{u32} value whose units are meters.
|
|
|
+We will use latitude ($\phi$) and longitude ($\lambda$) in radians to describe the locations of
|
|
|
+points on the surface.
|
|
|
+Both $\phi$ and $\lambda$ will take \texttt{f64} values.
|
|
|
+The elevation of a point will be given by $h$,
|
|
|
+which is the deviation from $\rho$.
|
|
|
+$h$ is measured in meters and takes values in \texttt{i32}.
|
|
|
+So, the distance from the center of the planet to the point ($\phi$, $\lambda$, $h$) is
|
|
|
+$\rho + h$.
|
|
|
+
|
|
|
+% Directory organization. Quadtrees.
|
|
|
+The data describing how to render a planet consists of its terrain mesh, terrain textures, and
|
|
|
+the objects on its surface.
|
|
|
+This could represent a very large amount of data for a planet with realistic terrain populated by
|
|
|
+many structures.
|
|
|
+To facilitate sharding the information in a planet over many different servers,
|
|
|
+the planet is broken into disjoint regions,
|
|
|
+each of which is stored in its own directory.
|
|
|
+A single top-level directory represents the entire planet,
|
|
|
+and contains a manifest describing it.
|
|
|
+This manifest specifies the planet's name, its radius, its rotational period,
|
|
|
+the size of its regions in MB, as well as any
|
|
|
+other global attributes.
|
|
|
+This top-level directory also contains the texture for the sky box to render the view of
|
|
|
+space from the planet.
|
|
|
+In the future it may be interesting to explore the creation of more dynamic environments surrounding
|
|
|
+the planet,
|
|
|
+but a simple sky box has the advantage of being efficient.
|
|
|
+The data in a planet is recursively broken into the fewest number of regions such that the
|
|
|
+amount of data in each regions is less than a configured threshold.
|
|
|
+When a regions grows too large it is broken into four new regions by cutting it along the
|
|
|
+centerline parallel to the $\phi$ axis, and the one parallel to the $\lambda$ axis.
|
|
|
+In other words, it is divided in half north to south and east to west.
|
|
|
+The four new regions are stored in four subdirectories of the original region's directory
|
|
|
+named 0, 1, 2, and 3.
|
|
|
+The data in the old region is then moved into the appropriate directory based on its location.
|
|
|
+Thus the directory tree of a planet essentially forms a quadtree,
|
|
|
+albeit one which is built up progressively.
|
|
|
+
|
|
|
+% Region data files.
|
|
|
+In the leaf directories of this tree the actual data for a region are stored in two files,
|
|
|
+one which describes the terrain and the other which describes objects.
|
|
|
+It is expected that the terrain will rarely be modified,
|
|
|
+but that the objects may change regularly.
|
|
|
+The terrain file contains the mesh vertices in the region as well as its textures.
|
|
|
+It is organized as an R-tree to allow for efficient spacial queries based on player location.
|
|
|
+The region's objects file is also organized as an R-tree.
|
|
|
+It contains all of the graphical data for the objects to be rendered in the region,
|
|
|
+such as meshes, textures, and shaders.
|
|
|
+
|
|
|
+% Plots.
|
|
|
+The creation of a shared virtual world must involve players collaboratively building persistent
|
|
|
+structures.
|
|
|
+This is allowed in a controlled way by defining plot objects.
|
|
|
+A plot is like a symbolic link,
|
|
|
+it points to a file whose contents contain the data used to render the plot.
|
|
|
+This mechanisms allows the owner of the planet to delegate a specific area on the surface
|
|
|
+to another player by creating a plot defining that area and pointing it to a file owned by the
|
|
|
+player.
|
|
|
+The other player can then write meshes, textures, and shaders into this file to describe the
|
|
|
+contents of the plot.
|
|
|
+If the other player wishes to collaborate with others on the construction,
|
|
|
+they can grant write access on the file to a third party.
|
|
|
+This is not unlike the ownership of land in the real world.
|
|
|
+
|
|
|
+% LOD files in interior directories.
|
|
|
+To facilitate the viewing of the planet from many distances,
|
|
|
+each interior node in the planet's directory tree contains a reduced level of detail (LOD) version
|
|
|
+of the terrain contained in it.
|
|
|
+For example, the top-level directory contains the lowest LOD mesh and textures for the terrain.
|
|
|
+This LOD would be suitable for rendering the planet as a globe on a shelf,
|
|
|
+or as it would appear from a high orbit.
|
|
|
+By traversing the directory tree,
|
|
|
+the LOD can be increased as the player travels closer to the surface.
|
|
|
+This system assist with rendering an animation where the player appears to approach and land upon
|
|
|
+the planet's surface.
|
|
|
+
|
|
|
+% Sharding planet data.
|
|
|
+By dividing the planet's data into different leaf directories,
|
|
|
+it becomes possible to provision computers running the sector service in each of them.
|
|
|
+This divides the storage and bandwidth requirements for serving the planet over this set of
|
|
|
+servers.
|
|
|
+In addition to serving these data,
|
|
|
+another service is needed to keep track of player positions and execute game logic.
|
|
|
+Game clients address their messages using the directory of the region their player is located
|
|
|
+in, and set \texttt{rootward} to true.
|
|
|
+These messages are delivered to the closest game server to the region the player is in,
|
|
|
+which may be located in the region's directory or higher up the tree.
|
|
|
+When a player transitions from one region to the next,
|
|
|
+its game client begins addressing messages using the path of the next region as the scope.
|
|
|
+
|
|
|
+
|
|
|
+\section{Conclusion}
|
|
|
+% Blocktree serves as the basis for building distributed Unix.
|
|
|
+There have been many attempts to create a distributed Unix over the years.
|
|
|
+Time has shown that this is a very hard problem,
|
|
|
+but time has not diminished its importance.
|
|
|
+IT systems are more complex now than ever,
|
|
|
+with many layers of abstraction which have built up over time.
|
|
|
+We have suffered greatly from systems which were never designed to be secure on the hostile internet
|
|
|
+that exists today.
|
|
|
+Security has been bolted onto these systems (HTTPS, STARTTLS, DNSSEC)
|
|
|
+in a backwards compatible way,
|
|
|
+which results in weakened protections for these systems.
|
|
|
+What's worse,
|
|
|
+the entire trust model of the web relies on the ludicrous idea that there is a distinguished group
|
|
|
+of certificate authorities who have the power to secure our communications.
|
|
|
+We need to take a different approach.
|
|
|
+Data should be certified by its path,
|
|
|
+it must always be transported between processes in an authenticated manner,
|
|
|
+and user code should never have to care how this is accomplished!
|
|
|
+Time will tell whether the programming model of Blocktree is comprehensible and useful for
|
|
|
+developers,
|
|
|
+but the goal is to create the kind of easy to extend computing environment which allowed Unix to
|
|
|
+be successful.
|
|
|
+
|
|
|
+% The system enables individuals to self-host the services they rely on.
|
|
|
+These days, the typical internet user stores all of their important data in the cloud with
|
|
|
+third-party service providers.
|
|
|
+They do this because of the convenience of being able to access this information from anywhere,
|
|
|
+and because of the perceived safety in having a large internet company look after it for them.
|
|
|
+This convenience comes at the price of putting users at the mercy of these companies.
|
|
|
+Take email for example,
|
|
|
+a service which is universally used for account recovery and password reset.
|
|
|
+If a service provided decided to stop providing a user access to their email,
|
|
|
+the user would be effectively cut off from any website which sends login verification emails.
|
|
|
+This is not a hypothetical situation,
|
|
|
+such an incident has occurred (TODO: INSERT CITATION FROM LVL1).
|
|
|
+There is no technical reason for things to be this way.
|
|
|
+Blocktree allows users to host their own services in their own domain.
|
|
|
+If we can make setting up an email or VOIP server as simple as clicking a button in a web GUI,
|
|
|
+their will be no convenience advantage to cloud services.
|
|
|
+One challenge for self-hosting data is ensuring it is protected from loss when hardware inevitably
|
|
|
+fails.
|
|
|
+The data redundancy in Blocktree's sector layer ensures that the loss of any one storage
|
|
|
+device will not cause data loss.
|
|
|
+Streaming replication can also be used to maintain additional redundant copies.
|
|
|
+If more users begin hosting their own services,
|
|
|
+the internet will become more distributed,
|
|
|
+which will make it more resistent to disruption and centralized control.
|
|
|
+
|
|
|
+% Benefits to businesses.
|
|
|
+Cloud computing has also driven changes to the way businesses acquire computing resources.
|
|
|
+It is common for startups to rent all of their computing resources from one large cloud
|
|
|
+provider.
|
|
|
+There are compelling economic and technical reasons to do this,
|
|
|
+but as a firm grows they often experience growing pains as their cloud bills also grow.
|
|
|
+If the firm has not developed their software with a multi-cloud, or hybrid approach in mind,
|
|
|
+they may face the prospect of major changes in order to bring their application on-prem or to a
|
|
|
+rival cloud.
|
|
|
+By developing their application on Blocktree,
|
|
|
+businesses have a single platform to target which can run on rented computers in the cloud just as
|
|
|
+easily servers in their own data center.
|
|
|
+This ensures the choice to rent or buy can be made on a purely economic basis.
|
|
|
+Blocktree is not the only system that provides this flexibility.
|
|
|
+The portability of containers is one of the reasons they have become so popular.
|
|
|
+Containers have their place and will most likely be used for years to come,
|
|
|
+but they are a lower level abstraction which requires the developer to the problems that Blocktree
|
|
|
+handles.
|
|
|
+
|
|
|
+% Blocktree advances the status quo in secure computing.
|
|
|
+Ransomware attacks and data breaches are embarrassingly common these days.
|
|
|
+There are many reasons for this,
|
|
|
+from the reliance on passwords for authentication, to the complexity of the software supply chain,
|
|
|
+but it is clear that as IT professionals we need to do more to keep the systems under our
|
|
|
+protection safe.
|
|
|
+Blocktree helps to do this by solving many of the difficult problems involved with securing
|
|
|
+communication on a hostile network.
|
|
|
+It takes a true zero-trust approach,
|
|
|
+ensuring that all communications between processes is authenticated using public key cryptography.
|
|
|
+Data at rest is also secured with encryption and integrity protection.
|
|
|
+No security system can prevent all attacks,
|
|
|
+but by putting these mechanisms together in an easy to use platform,
|
|
|
+we can advance the status quo of secure computing.
|
|
|
+
|
|
|
+% Composability leads to emergent benefits.
|
|
|
+When Unix was first developed in the 1970's, its authors could not have foreseen the applications
|
|
|
+that would be enabled by their system.
|
|
|
+Although there have been many different kinds of Unices over the years,
|
|
|
+the core programming model, built around the filesystem, has remained since the beginning.
|
|
|
+It is a testament to the importance of this abstraction that it has persisted for so long.
|
|
|
+No designer can foresee all the ways that their abstractions will be used,
|
|
|
+but they can try to build them in such a way that as much choice is left to the user as possible.
|
|
|
+By making the actor model, and messaging passing, the core of Blocktree,
|
|
|
+it is hoped that low overhead communication between distributed components can be achieved.
|
|
|
+By using this system to provide a global distributed filesystem,
|
|
|
+it is hoped that the interoperable sharing of data can be achieved.
|
|
|
+And by using protocol contracts to constrain actor communication,
|
|
|
+it is hoped that the structure and safety of type theory can bring order to distributed
|
|
|
+computation.
|
|
|
+While it is possible to see some of the applications that can be built from these abstractions,
|
|
|
+it seems likely that their composability and the creativity of developers will enable systems that
|
|
|
+cannot be foreseen.
|
|
|
+
|
|
|
+\end{document}
|