1 year ago · 75bc0024cd
--- a/doc/BlocktreeCloudPaper/BlocktreeCloudPaper.tex
+++ b/doc/BlocktreeCloudPaper/BlocktreeCloudPaper.tex
@@ -1,1621 +0,0 @@
 
				-\documentclass{article}

			
 
				-\usepackage[scale=0.8]{geometry}

			
 
				-\usepackage{hyperref}

			
 
				-\usepackage{graphicx}

			
 
				-

			
 
				-\title{The Blocktree Cloud Orchestration Platform}

			
 
				-\author{Matthew Carr}

			
 
				-

			
 
				-\begin{document}

			
 
				-\maketitle

			
 
				-\begin{abstract}

			
 
				-This document is a proposal for a novel cloud platform called Blocktree.

			
 
				-The system is described in terms of the actor model,

			
 
				-where tasks and services are implemented as actors.

			
 
				-The platform is responsible for orchestrating these actors on a set of native operating system processes.

			
 
				-A service is provdied to actors which allows them access to a highly available distributed file system,

			
 
				-which serves as the only source of persistent state for the system.

			
 
				-High availability is achieved using the Raft consensus protocol to synchronize the state of files between processes.

			
 
				-All data stored in the filesystem is secured with strong integrity and optional confidentiality protections.

			
 
				-A network block device like interface allows for fast low-level read and write access to the encrypted data,

			
 
				-with full support for client-side encryption.

			
 
				-Well-known cryptographic primitives and constructions are employed to provide this protection,

			
 
				-the system does not attempt to innovate in terms of cryptography.

			
 
				-The system's trust model allows for mutual TLS authentication between all processes in the system,

			
 
				-even those which are controlled by different owners.

			
 
				-By integrating these ideas into a single platform,

			
 
				-the system aims to advance the status quo in the security and reliability of software systems.

			
 
				-\end{abstract}

			
 
				-

			
 
				-\section{Introduction}

			
 
				-% The "Big" Picture.

			
 
				-Blocktree is an attempt to extend the Unix philosophy that everything is a file

			
 
				-to the entire distributed system that comprises modern IT infrastructure.

			
 
				-The system is organized around a global distributed filesystem which defines security

			
 
				-principals, resources, and their authorization attributes.

			
 
				-This filesystem provides a language for access control that can be used to securely grant

			
 
				-access to resources from different organizations, without the need to setup federation.

			
 
				-The system provides an actor runtime for orchestrating services.

			
 
				-Resources are represented by actors, and actors are grouped into operating system processes.

			
 
				-Each process has its own credentials which authenticate it as a unique security principal,

			
 
				-and which specify the filesystem path where the process is located.

			
 
				-A process has authorization attributes which determine the set of processes that may communicate with it.

			
 
				-Every connection between processes is established using mutual TLS authentication,

			
 
				-which is accomplished without the need to trust any third-party certificate authorities.

			
 
				-The cryptographic mechanisms which make this possible are described in detail in section 3.

			
 
				-Messages addressed to actors in a different process are forwarded over these connections,

			
 
				-while messages delivered to actors in the same process are delivered with zero-copying.

			
 
				-

			
 
				-% Self-certifying paths and the chain of trust.

			
 
				-The single global Blocktree filesystem is partitioned into disjoint domains of authority.

			
 
				-Each domain is controlled by a root principal.

			
 
				-As is the case for all principals,

			
 
				-a root principal is authenticated by a public-private key pair,

			
 
				-and is identified by a hash of its public key.

			
 
				-The domain of authority for a given absolute path is determined by its first component,

			
 
				-which is the identifier of the root principal that controls the domain.

			
 
				-Because there is no meaning to the directory "/",

			
 
				-a directory consisting of only a single component equal to a root principal's identifier is

			
 
				-referred to as the root principal's root directory.

			
 
				-The root principal delegates its authority to write files to subordinate principals by issuing

			
 
				-them certificates which specify the path that the authority of the subordinate is limited to.

			
 
				-File data is signed for authenticity and a certificate chain is contained in its metadata.

			
 
				-This certificate chain must lead back to the root principal

			
 
				-and consist of certificates with correctly scoped authority in order for the file to be validated.

			
 
				-Given the path of a file and the file's contents,

			
 
				-this allows the file to be validated by anyone without the need to trust a third-party.

			
 
				-Blocktree paths are called self-certifying for this reason.

			
 
				-

			
 
				-% Persistent state provided by the filesystem.

			
 
				-One of the major challenges in distributed systems is managing persistent state.

			
 
				-Blocktree solves this issue with its distributed filesystem.

			
 
				-Files are broken into segments called sectors.

			
 
				-The sector size of a file can be configured when it is created,

			
 
				-but cannot be changed later.

			
 
				-Reads and writes of individual sectors are guaranteed to be atomic.

			
 
				-The sectors which comprise a file and its metadata are replicated by a set of processes running

			
 
				-the sector service.

			
 
				-This service is responsible for storing the sectors of files which are contained in the directory

			
 
				-containing the process in which it is running.

			
 
				-The actors providing the sector service in a given directory coordinate with one another using

			
 
				-the Raft protocol to synchronize the state of the sectors they store.

			
 
				-By partitioning the data in the filesystem based on directory,

			
 
				-the system can scale beyond the capabilities of a single consensus cluster.

			
 
				-Sectors can be integrity protected and verified without reading the entire file,

			
 
				-because each file has a Merkle tree of sector hashes associated with it.

			
 
				-Encryption can be optionally applied to sectors,

			
 
				-and when it is key is managed by the system.

			
 
				-The cryptographic mechanisms used to implement these protections are described in section 3.

			
 
				-

			
 
				-% Protocol contracts.

			
 
				-One of the design goals of Blocktree is to facilitate the creation of composable distributed

			
 
				-systems.

			
 
				-A major challenge to building such systems is the difficulty in pinning down bugs when they

			
 
				-inevitably occur.

			
 
				-Research into session types (a.k.a. Behavioral Types) promises to bring the safety benefits

			
 
				-of type checking to actor communication.

			
 
				-Blocktree integrates a session typing system that allows protocol contracts to be defined that

			
 
				-specify the communication patterns of a set of actors.

			
 
				-This model allows the state space of the set of actors participating in a computation to be defined,

			
 
				-and the state transitions which occur to be specified based on the types of received messages.

			
 
				-These contracts are used to verify protocol adherence statically and dynamically.

			
 
				-This system is implemented using compile time code generation,

			
 
				-making it a zero-cost abstraction.

			
 
				-This frees the developer from dealing with the numerous failure modes that can occur in a

			
 
				-communication protocol.

			
 
				-

			
 
				-% Implementation language and project links.

			
 
				-Blocktree is implemented in the Rust programming language.

			
 
				-It is currently only tested on Linux.

			
 
				-Running it on other Unix-like operating systems should be straight-forward,

			
 
				-though FUSE support is required to mount the filesystem.

			
 
				-Its source code is licensed under the Affero GNU Public License Version 3.

			
 
				-It can be downloaded at the project homepage at \url{https://blocktree.systems}.

			
 
				-Anyone interested in contributing to development is welcome to submit a pull request

			
 
				-to \url{https://gogs.delease.com/Delease/Blocktree}.

			
 
				-If you have larger changes or architectural suggestions,

			
 
				-please submit an issue for discussion prior to spending time implementing your idea.

			
 
				-

			
 
				-% Outline of the rest of the paper.

			
 
				-The remainder of this paper is structured as follows:

			
 
				-\begin{itemize}

			
 
				-  \item Section 2 describes the actor runtime, service and task orchestration, and service

			
 
				-    discovery.

			
 
				-  \item Section 3 discusses the filesystem, its concurrency semantics and implementation.

			
 
				-  \item Section 4 details the cryptographic mechanisms used to secure communication between

			
 
				-    actor runtimes and to protect sector data.

			
 
				-  \item Section 5 is a set of examples describing ways that Blocktree can be used to build systems.

			
 
				-  \item Section 6 provides some concluding remarks.

			
 
				-\end{itemize}

			
 
				-

			
 
				-

			
 
				-

			
 
				-\section{Actor Runtime}

			
 
				-% Motivation for using the actor model. 

			
 
				-Building scalable fault tolerant systems requires us to distribute computation over

			
 
				-multiple computers.

			
 
				-Rather than switching to a different programming model when an application scales beyond the

			
 
				-capacity of a single computer,

			
 
				-it is beneficial in terms of programmer time and program simplicity to begin with a model that 

			
 
				-enables multi-computer scalability.

			
 
				-Fundamentally, all communication over an IP network involves the exchange of messages,

			
 
				-namely IP packets.

			
 
				-So if we wish to build scalable fault-tolerant systems,

			
 
				-it makes sense to choose a programming model built on message passing,

			
 
				-as this will ensure low impedance with the underlying networking technology.

			
 
				-

			
 
				-% Overview of message passing interface.

			
 
				-That is why Blocktree is built on the actor model

			
 
				-and why its actor runtime is at the core of its architecture.

			
 
				-The runtime can be used to spawn actors, register services, dispatch messages immediately,

			
 
				-and schedule messages to be delivered in the future.

			
 
				-Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.

			
 
				-A message is dispatched with the \texttt{send} method when no reply is required,

			
 
				-and with \texttt{call} when exactly one is.

			
 
				-The \texttt{Future} returned by \texttt{call} can be awaited to obtain the reply.

			
 
				-If a timeout occurs while waiting for the reply,

			
 
				-the \texttt{Future} completes with an error.

			
 
				-The name \texttt{call} was chosen to bring to mind a remote procedure call,

			
 
				-which is the primary use case this method was intended for.

			
 
				-Awaiting replies to messages serves as a simple way to synchronize a distributed computation.

			
 
				-

			
 
				-% Scheduling messages for future delivery.

			
 
				-Executing actions at some point in the future or at regular intervals are common tasks in computer

			
 
				-systems.

			
 
				-Blocktree facilitates this by allows messages to be scheduled for future delivery.

			
 
				-The schedule may specify a one time delivery at a specific instant in time,

			
 
				-or a repeating delivery with a given period.

			
 
				-These scheduling modes can be combined so that you can specify an anchoring instant

			
 
				-and a period whose multiples will be added to this instant to calculate each delivery time.

			
 
				-For example, a message could be scheduled for delivery every morning at 3 AM.

			
 
				-Messages scheduled in a runtime are persisted in the runtime's file.

			
 
				-This ensures scheduled messages will be delivered even if the runtime is restarted.

			
 
				-If a message has been delivered

			
 
				-and the schedule is such that it will never be delivered again,

			
 
				-it is removed from the runtime's file.

			
 
				-If a message is scheduled for delivery at a single instant in time,

			
 
				-and that delivery is missed,

			
 
				-the message will be delivered as soon as possible.

			
 
				-But, if a message is periodic,

			
 
				-any messages which were missed due to a runtime not being active will never be sent.

			
 
				-This is because the runtime only persists the message's schedule,

			
 
				-not every delivery.

			
 
				-This mechanism is intended for periodic tasks or delaying work to a later time.

			
 
				-It is not for building hard realtime systems.

			
 
				-

			
 
				-% Description of virtual actor system.

			
 
				-One of the challenges when building actor systems is supervising and managing actors' lifecycles.

			
 
				-This is handled in Erlang through the use of supervision trees,

			
 
				-but Blocktree takes a different approach inspired by Microsoft's Orleans framework.

			
 
				-Orleans introduced the concept of virtual actors,

			
 
				-which are purely logical entities that exist perpetually.

			
 
				-In Orleans, one does not need to spawn actors nor worry about respawning them should they crash,

			
 
				-the framework takes care of spawning an actor when a message is dispatched to it.

			
 
				-This model also gives the framework the flexibility to deactivate actors when they are idle

			
 
				-and to load balance actors across different computers.

			
 
				-In Blocktree a similar system is used when messages are dispatched to services.

			
 
				-The Blocktree runtime takes care of routing these messages to the appropriate actors,

			
 
				-spawning them if needed.

			
 
				-A service must be registered in a runtime before messages can be routed to it.

			
 
				-The actors which are spawned based on this registration are called \emph{service providers} of the

			
 
				-service.

			
 
				-Services which directly use operating system resource,

			
 
				-such as those that listen on network sockets,

			
 
				-are often started immediately after registration so that they are available to external clients.

			
 
				-

			
 
				-% Message addressing modes.

			
 
				-Messages can be addressed to services or specific actors.

			
 
				-When addressing a specific actor,

			
 
				-the message contains an \emph{actor name},

			
 
				-which is a pair consisting of the path of the runtime hosting the actor and the \texttt{Uuid}

			
 
				-identifying the specific actor in that runtime.

			
 
				-When addressing a service,

			
 
				-the message is dispatched using a \emph{service name},

			
 
				-which contains the following fields:

			
 
				-\begin{enumerate}

			
 
				-  \item \texttt{service}: The path identifying the receiving service.

			
 
				-  \item \texttt{scope}: A filesystem path used to specify the intended recipient.

			
 
				-  \item \texttt{rootward}: A boolean describing whether message delivery is attempted towards or

			
 
				-    away from the root of the filesystem tree. A value of

			
 
				-    \texttt{false} indicates that the message is intended for a runtime directly contained in the

			
 
				-    scope. A value of \texttt{true} indicates that the message is intended for a runtime contained

			
 
				-    in a parent directory of the scope and should be delivered to a runtime which has the requested

			
 
				-    service registered and is closest to the scope.

			
 
				-  \item \texttt{id}: An identifier for a specific service provider.

			
 
				-\end{enumerate}

			
 
				-The ID can be a \texttt{Uuid} or a \texttt{String}.

			
 
				-It is treated as an opaque identifier by the runtime,

			
 
				-but a service is free to associate additional meaning to it.

			
 
				-Every message has a header containing the name of the sender and receiver.

			
 
				-The receiver name can be an actor or service name,

			
 
				-but the receiver name is always an actor name.

			
 
				-For example, to open a file in the filesystem,

			
 
				-a message is dispatched with \texttt{call} using the service name of the filesystem service.

			
 
				-The reply contains the name of the file actor spawned by the filesystem service which owns the opened

			
 
				-file.

			
 
				-Messages are then dispatched to the file actor using its actor name to read and write to the file.

			
 
				-

			
 
				-% The runtime is implemented using tokio.

			
 
				-The actor runtime is currently implemented using the Rust asynchronous runtime tokio.

			
 
				-Actors are spawned as tasks in the tokio runtime,

			
 
				-and multi-producer single consumer channels are used for message delivery.

			
 
				-Because actors are just tasks,

			
 
				-they can do anything a task can do,

			
 
				-including awaiting other \texttt{Future}s.

			
 
				-Because of this, there is no need for the actor runtime to support short-lived worker tasks,

			
 
				-as any such use-case can be accomplished by awaiting a set of \texttt{Future}s.

			
 
				-This allows the runtime to focus on providing support for services.

			
 
				-Using tokio also means that we have access to a high performance multi-threaded runtime with

			
 
				-evented IO.

			
 
				-This asynchronous programming model ensures that resources are efficiently utilized,

			
 
				-and is ideal for a system focused on orchestrating services which may be used by many clients.

			
 
				-

			
 
				-% Delivering messages over the network.

			
 
				-Messages can be forwarded between actor runtimes using a secure transport layer called

			
 
				-\texttt{bttp}.

			
 
				-The transport is implemented using the QUIC protocol, which integrates TLS for security.

			
 
				-A \texttt{bttp} client may connect anonymously or using credentials.

			
 
				-If an anonymous connection is attempted,

			
 
				-the client has no authorization attributes associated with it.

			
 
				-Only runtimes which grant others the execute permission allow connections from such clients.

			
 
				-If these permissions are not granted in the runtime's file,

			
 
				-anonymous connections are rejected.

			
 
				-When a client connects with credentials,

			
 
				-mutual TLS authentication is performed as part of the connection handshake,

			
 
				-which cryptographically verifies the credentials of each runtime.

			
 
				-These credentials contain the filesystem paths where each runtime is located.

			
 
				-This information is used to securely route messages between runtimes.

			
 
				-The \texttt{bttp} server is always authenticated during the handshake,

			
 
				-even when the client is connecting anonymously.

			
 
				-Because QUIC supports the concurrent use of many different streams,

			
 
				-it serves as an ideal transport for a message oriented system.

			
 
				-\texttt{bttp} uses different streams for independent messages,

			
 
				-ensuring that head of line blocking does not occur.

			
 
				-Note that although data from separate streams can arrive in any order,

			
 
				-the protocol does provide reliable in-order delivery of data in any given stream.

			
 
				-The same stream is used for sending the reply to a message dispatched with \texttt{call}.

			
 
				-Once a connection is established,

			
 
				-messages may flow both directions (provided both runtimes have execute permissions for the other),

			
 
				-regardless of which runtime is acting as the client or the server.

			
 
				-

			
 
				-% Delivering messages locally.

			
 
				-When a message is sent between actors in the same runtime it is delivered into the queue of the recipient without any copying,

			
 
				-while ensuring immutability (i.e. move semantics).

			
 
				-This is possible thanks to the Rust ownership system,

			
 
				-because the message sender gives ownership to the runtime when it dispatches the message,

			
 
				-and the runtime gives ownership to the recipient when it delivers the message.

			
 
				-

			
 
				-% Security model based on filesystem permissions.

			
 
				-A runtime is represented in the filesystem as a file.

			
 
				-This file contains the authorization attributes which are associated with the runtime's security

			
 
				-principal.

			
 
				-The credentials used by the runtime specify the file, so other runtimes are able to locate it.

			
 
				-The metadata of the file contains authorization attributes just like any other file

			
 
				-(e.g. UID, GID, and mode bits).

			
 
				-In order for a principal to be able to send a message to an actor in the runtime,

			
 
				-it must have execute permissions for this file.

			
 
				-Thus communication between runtimes can be controlled using simple filesystem permissions.

			
 
				-Permissions checking is done during the \texttt{bttp} handshake.

			
 
				-Note that it is possible for messages to be sent in one direction in a \texttt{bttp} connection

			
 
				-but not in the other.

			
 
				-In this situation replies are permitted but unsolicited messages are not.

			
 
				-An important trade-off which was made when designing this model was that messages which are

			
 
				-sent between actors in the same runtime are not subject to any authorization checks.

			
 
				-This was done for two reasons: performance and security.

			
 
				-By eliminating authorization checks messages can be more efficiently delivered between actors in the

			
 
				-same process,

			
 
				-which helps to reduce the performance penalty of the actor runtime over directly using threads.

			
 
				-Security is enhanced by this decision because it forces the user to separate actors with different

			
 
				-security requirements into different operating system processes,

			
 
				-which ensures all of the process isolation machinery in the operating system will be used to

			
 
				-isolate them.

			
 
				-

			
 
				-% Representing resources as actors.

			
 
				-As in other actor systems, it is convenient to represent resources in Blocktree using actors.

			
 
				-This allows the same security model used to control communication between actors to be used for

			
 
				-controlling access to resources,

			
 
				-and for resources to be shared by many actors.

			
 
				-For instance, a Point-to-Point Protocol connection could be owned by an actor.

			
 
				-This actor could forward traffic delivered to it in messages over this connection.

			
 
				-The set of actors which are able to access the connection is controlled by setting the filesystem

			
 
				-permissions on the file for the runtime executing the actor owning the connection.

			
 
				-

			
 
				-% Actor ownership.

			
 
				-The concept of ownership in programming languages is very useful for ensuring that resources are

			
 
				-properly freed when the type using them dies.

			
 
				-Because actors are used for encapsulating resources in Blocktree,

			
 
				-a similar system of ownership is employed for this reason.

			
 
				-An actor is initially owned by the actor that spawned it.

			
 
				-An actor can only have a single owner,

			
 
				-but the owner can grant ownership to another actor.

			
 
				-An actor is not allowed to own itself,

			
 
				-though it may be owned by the runtime.

			
 
				-When the owner of an actor returns,

			
 
				-the actor is sent a message instructing it to return.

			
 
				-If it does not return after a timeout,

			
 
				-it is interrupted.

			
 
				-This is the opposite of how supervision trees work in Erlang.

			
 
				-Instead of the parent receiving a message when the child returns,

			
 
				-the child receives a message when the parent returns.

			
 
				-Service providers spawned by the runtime are owned by it.

			
 
				-They continue running until the runtime chooses to reclaim their resources,

			
 
				-which can happen because they are idle or the runtime is overloaded.

			
 
				-Note that ownership is not limited to a single runtime,

			
 
				-so distributed resources can be managed by owning actors in many different runtimes.

			
 
				-

			
 
				-% Message routing to services.

			
 
				-A service is identified by a Blocktree path.

			
 
				-Only one service implementation can be registered in a particular runtime,

			
 
				-though this implementation may be used to spawn many actors as providers for the service,

			
 
				-each associated with a different ID.

			
 
				-The runtime spawns a new actor when it finds no service provider associated with the ID in the

			
 
				-message it is delivering.

			
 
				-Some services may only have one service provider in a given runtime,

			
 
				-as is the case for the sector and filesystem services.

			
 
				-The \texttt{scope} and \texttt{rootward} field in an actor name specify the set of runtimes to

			
 
				-which a message may be delivered.

			
 
				-They allow the sender to express their intended recipient,

			
 
				-while still affording enough flexibility to the runtime to route messages as needed.

			
 
				-If \texttt{rootward} is \texttt{false},

			
 
				-the message is delivered to a service provider in a runtime that is directly contained in

			
 
				-\texttt{scope}.

			
 
				-If \texttt{rootward} is \texttt{true},

			
 
				-the parent directories of scope are searched,

			
 
				-working towards the root of the filesystem tree,

			
 
				-and the message is delivered to the first provider of \texttt{service} which is found.

			
 
				-When there are multiple service providers to which a given message could be delivered,

			
 
				-the one to which it is actually delivered is unspecified,

			
 
				-which allows the runtime to balance load.

			
 
				-Delivery will occur to at most one recipient,

			
 
				-even in the case that there are multiple potential recipients.

			
 
				-In order to contact other runtimes and deliver messages to them,

			
 
				-their network endpoint (IP address and UDP port) needs to be known.

			
 
				-This is achieved by maintaining a file with a runtime's endpoint address in the same directory as

			
 
				-the runtime.

			
 
				-The runtime is granted write permissions on the file,

			
 
				-and it is updated by \texttt{bttp} when it begins listening on a new endpoint.

			
 
				-The port a \texttt{bttp} server uses to listen for unicast connections is uniformly

			
 
				-randomly selected from the set of ports in the dynamic range (49152-65535) which are unused on the

			
 
				-server's host.

			
 
				-Use of a random port allows many different \texttt{bttp} servers to share a single IP address

			
 
				-and makes Blocktree more resistent to censorship.

			
 
				-The services which are allowed to be registered in a given runtime are specified in the runtime's

			
 
				-file.

			
 
				-The runtime reads this list and uses it to deny service registrations for unauthorized services.

			
 
				-The list is also read by other runtime's when they're searching for service providers.

			
 
				-

			
 
				-% The sector and filesystem service.

			
 
				-The filesystem is itself implemented as a service.

			
 
				-A filesystem service provider can be passed messages to delete files, list directory contents,

			
 
				-open files, or perform several other standard filesystem operations.

			
 
				-When a file is opened,

			
 
				-a new actor is spawned which owns the newly created file handle and its name is returned to the

			
 
				-caller in a reply.

			
 
				-Subsequent read and write messages are sent to this actor.

			
 
				-The filesystem service does not persist any data itself,

			
 
				-its job is to function as an integration layer,

			
 
				-conglomerating sector data from many different sources into a single unified interface.

			
 
				-The sector service is what is ultimately responsible for storing data,

			
 
				-and thus maintaining the persistent state of the system.

			
 
				-It stores sector data in the local filesystem of each computer on which it is registered.

			
 
				-The details of how this is accomplished are deferred to the next section.

			
 
				-

			
 
				-% Runtime queries.

			
 
				-While it is possible to resolve runtime paths to network endpoints when the filesystem is available,

			
 
				-another mechanism is needed to allow the filesystem service providers to be discovered.

			
 
				-This is accomplished by allowing runtimes to query one another to learn of other runtimes.

			
 
				-Because queries are intended to facilitate message delivery,

			
 
				-the query fields and their meanings mirror those used for addressing messages:

			
 
				-\begin{enumerate}

			
 
				-  \item \texttt{service} The path of the service whose providers are sought.

			
 
				-    Only runtimes with this service registered will be returned.

			
 
				-  \item \texttt{scope} The filesystem path relative to which the query will be processed.

			
 
				-  \item \texttt{rootward} Indicates if the query should search for runtimes from \texttt{scope}

			
 
				-    toward the root.

			
 
				-\end{enumerate}

			
 
				-The semantics of \texttt{scope} and \texttt{rootward} in a query are identical to their use in an

			
 
				-actor name.

			
 
				-As long as at least one other runtime is known,

			
 
				-a query can be issued to learn of more runtimes.

			
 
				-A runtime which receives a query may not be able to answer it directly.

			
 
				-If it cannot,

			
 
				-it returns the endpoint of the next runtime to which the query should be sent.

			
 
				-

			
 
				-% Bootstrap discovery methods.

			
 
				-In order to bootstrap the discovery processes,

			
 
				-another mechanism is needed to find the first peer to query.

			
 
				-There were several possibilities explored for doing this.

			
 
				-One way is to use a blockchain to store the endpoints of the runtimes hosting the filesystem service

			
 
				-in the root directory.

			
 
				-As long as these runtimes can be located,

			
 
				-then all others can be found using the filesystem.

			
 
				-This idea may be worth revisiting in the future,

			
 
				-but the author wanted to avoid the complexity of implementing a new proof of work blockchain.

			
 
				-Instead, two independent mechanisms are used,

			
 
				-one that can discover runtimes over the internet as long as their path is known,

			
 
				-and another that can discover runtimes on the local network even when the discoverer does not know

			
 
				-their paths.

			
 
				-

			
 
				-% Searching DNS for root principals.

			
 
				-When the path to a runtime is known,

			
 
				-DNS is used to resolve SRV records using a fully qualified domain name

			
 
				-(FQDN) derived from the path's root principal identifier.

			
 
				-The SRV records are resolved using the name \texttt{\_bttp.\_udp.<FQDN>},

			
 
				-where \texttt{<FQDN>} is the FQDN derived from the root principal's identifier.

			
 
				-One SRV record may be created for each of the filesystem service providers in the root

			
 
				-directory.

			
 
				-Each record contains the UDP port and hostname where a runtime is listening.

			
 
				-Every runtime is configured with a search domain that is used as a suffix in the FQDN.

			
 
				-The leading labels in the FQDN are computed by base32 encoding the binary representation of the

			
 
				-root principal's identifier.

			
 
				-If the encoded string is longer than 63 bytes (the limit for each label in a hostname),

			
 
				-it is separated into the fewest number of labels possible,

			
 
				-working from left to right along the string.

			
 
				-A dot followed by the search domain is concatenated onto the end of this string to form the FQDN.

			
 
				-This method has the advantages of being simple to implement

			
 
				-and allowing runtimes to discover each other over the internet.

			
 
				-Implementing this system would be facilitated by hosting DNS servers in actors in the same

			
 
				-runtimes as the root sector service providers.

			
 
				-Then, records could be dynamically created which point to these runtimes.

			
 
				-These runtimes would also need to be configured with static IP addresses,

			
 
				-and the NS records for the search domain would need to point to them.

			
 
				-Of course it is also possible to build such a system without hosting DNS inside of Blocktree.

			
 
				-The downside of using DNS is that it couples Blocktree with a centralized,

			
 
				-albeit distributed, system.

			
 
				-

			
 
				-% Using link-local multicast datagrams to find runtimes.

			
 
				-Because the previous mechanism requires knowledge of the root principal of a domain to perform

			
 
				-discovery,

			
 
				-it will not work if a runtime is first starting up with no credentials and so does not know its

			
 
				-own root principal.

			
 
				-This runtime needs a way to discover other runtimes so it can connect to the filesystem and sector

			
 
				-services.

			
 
				-This issue is solved by using link-local multicast addressing to discover the runtimes on the same

			
 
				-network as the discoverer.

			
 
				-When a \texttt{bttp} server starts listening for unicast traffic,

			
 
				-it also listens for UDP datagrams on port 50142 at addresses 224.0.0.142 and FE02::142,

			
 
				-if the IPv4 or IPv6 networking stack is available, respectively.

			
 
				-If the host is attached to a dual-stack network,

			
 
				-the server listens on both addresses.

			
 
				-When a runtime is attempting to discover other runtimes,

			
 
				-it sends out datagrams to these endpoints.

			
 
				-Each \texttt{bttp} server replies with its unicast address and filesystem path

			
 
				-(as specified in its credentials).

			
 
				-If the server is available at both IPv4 and IPv6 unicast addresses,

			
 
				-it is at the server's discretion which address to respond with,

			
 
				-it may even respond with an IPv4 to an IPv4 datagram,

			
 
				-and IPv6 address to an IPv6 datagram.

			
 
				-Once a client has discovered the \texttt{bttp} servers on its network,

			
 
				-it can route messages to them,

			
 
				-such as the provisioning requests which are used to obtain new credentials.

			
 
				-Provisioning is described in the Cryptography section.

			
 
				-Note that port 50142 is in the dynamic range,

			
 
				-so it does not need to registered with the Internet Assigned Names and Numbers Authority (IANA).

			
 
				-Both addresses 224.0.0.142 and FE02::142 are currently unassigned.

			
 
				-but they will need to be registered with IANA if Blocktree is widely adopted.

			
 
				-

			
 
				-% Security model for queries.

			
 
				-To allow runtimes which are not permitted to execute the root directory to query for other runtimes,

			
 
				-authorization logic which is specific to queries is needed.

			
 
				-If a process is connected with credentials

			
 
				-and the path in the credentials contains the scope of the query,

			
 
				-the query is permitted.

			
 
				-If a process is connected anonymously,

			
 
				-its query will only be answered if the query scope

			
 
				-and all of its parent directories,

			
 
				-grant others the execute permission.

			
 
				-Queries from authenticated processes can be authorized using only the information in the query,

			
 
				-but anonymous queries require knowledge of filesystem permissions,

			
 
				-some of which may not be known to the answering runtime.

			
 
				-When authorizing an anonymous query,

			
 
				-an answering runtime should check that that the execute permission is granted on all directories

			
 
				-that it is responsible for storing.

			
 
				-If all these checks pass, it should forward the querier to the next runtime as usual.

			
 
				-

			
 
				-% Overview of protocol contracts and runtime checking of protocol adherence.

			
 
				-To facilitate the creation of composable systems,

			
 
				-a protocol contract checking system based on session types has been designed.

			
 
				-This system models a communication protocol as a directed graph representing state transitions

			
 
				-based on types of received messages.

			
 
				-The protocol author defines the states that the actors participating in the protocol can be in using 

			
 
				-Rust traits.

			
 
				-These traits define handler methods for each message type the actor is expected to handle in that

			
 
				-state.

			
 
				-A top-level trait which represents the entire protocol is defined that contains the types of the

			
 
				-initial state of every actor in the protocol.

			
 
				-A macro is used to generate the message handling loop for the each of the parties to the protocol,

			
 
				-as well as enums to represent all possible states that the parties can be in and the messages that

			
 
				-they exchange.

			
 
				-The generated code is responsible for ensuring that errors are generated when a message of an

			
 
				-unexpected type is received,

			
 
				-eliminating the need for ad-hoc error handling code to be written by application developers.

			
 
				-

			
 
				-% Example of a protocol contract.

			
 
				-Let's explore how this system can be used to build a simple pub-sub communications protocol.

			
 
				-In this protocol,

			
 
				-there will be a server which handles \texttt{Sub} messages by remembering the names of the actors

			
 
				-who sent them.

			
 
				-It will handle \texttt{Pub} messages by forwarding them to all of the subscribed actors.

			
 
				-The state-transition graph for the system is shown in figure \ref{fig:pubsub}.

			
 
				-\begin{figure}

			
 
				-  \begin{center}

			
 
				-    \includegraphics[scale=0.6]{PubSubStateGraph.pdf}

			
 
				-  \end{center}

			
 
				-  \caption{The state-transition graph for a simple pub-sub protocol.}

			
 
				-  \label{fig:pubsub}

			
 
				-\end{figure}

			
 
				-The solid edges in the graph indicate state transitions and are labeled with the message type

			
 
				-which triggered the transition.

			
 
				-The dashed edges indicate message delivery and are labeled with the type of the message delivered.

			
 
				-Although \texttt{Runtime} is not the state of any actor in the system,

			
 
				-it is included in the graph as the sender of the \texttt{Activate} and \texttt{Pub} messages.

			
 
				-\texttt{Activate} is delivered by the runtime to pass a reference to the runtime and provide the

			
 
				-actor's \texttt{Uuid}.

			
 
				-\texttt{Pub} messages are dispatched by actors outside the graph and are routed to actors in the

			
 
				-\texttt{Listening} state by the runtime.

			
 
				-Note that the runtime itself doesn't have any notion of the state of any actor,

			
 
				-it just delivers messaging using the rules described previously.

			
 
				-Only an actor can tell whether a message is expected or not given its current state.

			
 
				-Each of the actor states are modeled by Rust traits.

			
 
				-\begin{verbatim}

			
 
				-  pub struct ClientInit {

			
 
				-    type AfterActivate: Subed;

			
 
				-    type Fut: Future<Output = Result<Self::AfterActivate>>;

			
 
				-    fn handle_activate(self, msg: Activate) -> Self::Fut;

			
 
				-  }

			
 
				-

			
 
				-  pub struct Subed {

			
 
				-    type AfterPub: Subed;

			
 
				-    type Fut: Future<Output = Result<Self::AfterPub>>;

			
 
				-    fn handle_pub(self, msg: Envelope<Pub>) -> Self::Fut;

			
 
				-  }

			
 
				-

			
 
				-  pub struct ServerInit {

			
 
				-    type AfterActivate: Listening;

			
 
				-    type Fut: Future<Output = Result<Self::AfterActivate>>;

			
 
				-    fn handle_activate(self, msg: Activate) -> Self::Fut;

			
 
				-  }

			
 
				-

			
 
				-  pub struct Listening {

			
 
				-    type AfterSub: Listening;

			
 
				-    type SubFut: Future<Output = Result<Self::AfterSub>>;

			
 
				-    fn handle_sub(self, msg: Envelope<Sub>) -> Self::SubFut;

			
 
				-

			
 
				-    type AfterPub: Listening;

			
 
				-    type PubFut: Future<Output = Result<Self::AfterPub>>;

			
 
				-    fn handle_pub(self, msg: Envelope<Pub>) -> Self::PubFut;

			
 
				-  }

			
 
				-\end{verbatim}

			
 
				-The definition of \texttt{Activate} is as follows:

			
 
				-\begin{verbatim}

			
 
				-  pub struct Activate {

			
 
				-    rt: &'static Runtime,

			
 
				-    act_id: Uuid,

			
 
				-  }

			
 
				-\end{verbatim}

			
 
				-The \texttt{Envelope} type is a wrapper around a message which contains information about who sent

			
 
				-it and a method that can be used to send a reply.

			
 
				-In general a new actor state, represented by a new type, can be returned by a messaging handling

			
 
				-method.

			
 
				-The protocol itself is also represented by a trait:

			
 
				-\begin{verbatim}

			
 
				-  pub trait PubSubProtocol {

			
 
				-    type Server: ServerInit;

			
 
				-    type Client: ClientInit;

			
 
				-  }

			
 
				-\end{verbatim}

			
 
				-By modeling this protocol independently of any implementation of it,

			
 
				-we allow for many different interoperable implementations to be created.

			
 
				-We can also isolate bugs in these implementations because unexpected or malformed messages are

			
 
				-checked for by the generated code.

			
 
				-

			
 
				-% Implementing actors in languages other than Rust.

			
 
				-Today the actor runtime only supports executing actors implemented in Rust.

			
 
				-A WebAssembly (Wasm) plugin system is planned to allow any language which can compile to Wasm to be

			
 
				-used to implement an actor.

			
 
				-This work is blocked pending the standardization of the WebAssembly Component Model,

			
 
				-which promises to provide an interface definition language which will allow type safe actors to be

			
 
				-defined in many different languages.

			
 
				-

			
 
				-% Running containers using actors.

			
 
				-Blocktree allows containers to be run by encapsulating them using a supervising actor.

			
 
				-This actor is responsible for starting the container and managing the container's kernel namespace.

			
 
				-Logically, it owns any kernel resources created by the container, including all spawned operating

			
 
				-system processes.

			
 
				-When the actor halts,

			
 
				-all of these resources are destroyed.

			
 
				-All network communication to the container is controlled by the supervising actor.

			
 
				-The supervisor can be configured to bind container ports to host ports,

			
 
				-as is commonly done today,

			
 
				-but it can also be used to encapsulate traffic to and from the container in Blocktree messages.

			
 
				-These messages are routed to other actors based on the configuration of the supervisor.

			
 
				-This essentially creates a VPN for containers,

			
 
				-ensuring that regardless of well secured their communication is,

			
 
				-they will be safe to communicate over any network.

			
 
				-This network encapsulation system could be used in other actors as well,

			
 
				-allowing a lightweight and secure VPN system to built.

			
 
				-

			
 
				-% Web GUI used for managing the system.

			
 
				-Any modern computer system must include a GUI,

			
 
				-it is required by users.

			
 
				-For this reason Blocktree includes a web-based GUI called \texttt{btconsole} that can

			
 
				-monitor the system, provision runtimes, and configure access control.

			
 
				-\texttt{btconsole} is itself implemented as an actor in the runtime,

			
 
				-and so has access to the same facilities as any other actor.

			
 
				-

			
 
				-

			
 
				-\section{Filesystem}

			
 
				-% The division of responsibilities between the sector and filesystem services.

			
 
				-The responsibility for serving data in Blocktree is shared between the filesystem and sector

			
 
				-services.

			
 
				-Most actors will access the filesystem through the filesystem service,

			
 
				-which provides a high-level interface that takes care of the cryptographic operations necessary to

			
 
				-read and write files.

			
 
				-The filesystem service relies on the sector service for actually persisting data.

			
 
				-The individual sectors which make up a file are read from and written to the sector service,

			
 
				-which stores them in the local filesystem of the computer on which it is running.

			
 
				-A sector is the atomic unit of data storage

			
 
				-and the sector service only supports reading and writing entire sectors at once.

			
 
				-File actors spawned  by the filesystem service buffer reads and writes until there is enough

			
 
				-data to fill a sector.

			
 
				-Because cryptographic operations are only performed on full sectors,

			
 
				-the cost of providing these protections is amortized over the size of the sector.

			
 
				-Thus there is tradeoff between latency and throughput when selecting the sector size of a file:

			
 
				-a smaller sector size means less latency while a larger one enables more throughput.

			
 
				-

			
 
				-% Types of sectors: metadata, integrity, and data.

			
 
				-A file has a single metadata sector, a Merkle sector, and zero or more data sectors.

			
 
				-The sector size of a file can be specified when it is created,

			
 
				-but cannot be changed later.

			
 
				-Every data sector contains the ciphertext of the number of bytes equal to the sector size,

			
 
				-but the metadata and Merkle sectors contain a variable amount of data.

			
 
				-The metadata sector contains all of the filesystem metadata associated with the file.

			
 
				-In addition to the usual metadata present in any Unix filesystem (the contents of the \texttt{stat} struct),

			
 
				-cryptographic information necessary to verify and decrypt the contents of the file are also stored.

			
 
				-The Merkle sector of a file contains a Merkle tree over the data sectors of a file.

			
 
				-The hash function used by this tree can be configured at file creation,

			
 
				-but cannot be changed after the fact.

			
 
				-

			
 
				-% How sectors are identified.

			
 
				-When sector service providers are contained in the same directory they connect to each other to form

			
 
				-a consensus cluster.

			
 
				-This cluster is identified by a \texttt{u64} called the cluster's \emph{generation}.

			
 
				-Every file is identified by a pair of \texttt{u64}, its generation and its inode.

			
 
				-The sectors within a file are identified by an enum which specifies which type they are,

			
 
				-and in the case of data sectors, their 0-based index.

			
 
				-\begin{verbatim}

			
 
				-  pub enum SectorKind {

			
 
				-    Meta,

			
 
				-    Merkle,

			
 
				-    Data(u64),

			
 
				-  }

			
 
				-\end{verbatim}

			
 
				-The byte offset in the plaintext of the file at which each data sector begins can be calculated by

			
 
				-multiplying the sector's index by the sector size of the file.

			
 
				-The \texttt{SectorId} type is used to identify a sector.

			
 
				-\begin{verbatim}

			
 
				-  pub enum SectorId {

			
 
				-    generation: u64,

			
 
				-    inode: u64,

			
 
				-    sector: SectorKind,

			
 
				-  }

			
 
				-\end{verbatim}

			
 
				-

			
 
				-% How the sector service stores data.

			
 
				-The sector service persists sectors in a directory in its local filesystem,

			
 
				-with each sector is stored in a different file.

			
 
				-The scheme used to name these files involves security considerations,

			
 
				-and is described in the next section.

			
 
				-When a sector is updated,

			
 
				-a new local file is created with a different name containing the new contents.

			
 
				-Rather than deleting the old sector file,

			
 
				-it is overwritten by the creation of a hardlink to the new file,

			
 
				-and the name that used to create the new file is unlinked.

			
 
				-This method ensures that the sector file is updated in one atomic operation

			
 
				-and is used by other Unix programs.

			
 
				-The sector service also uses the local filesystem to persist the replicated log it uses for Raft.

			
 
				-This file serves as a journal of sector operations.

			
 
				-

			
 
				-% Types of messages handled by the sector service.

			
 
				-Communication with the sector service is done by passing it messages of type \texttt{SectorMsg}.

			
 
				-\begin{verbatim}

			
 
				-  pub struct SectorMsg {

			
 
				-    id: SectorId,

			
 
				-    op: SectorOperation,

			
 
				-  }

			
 
				-

			
 
				-  pub enum SectorOperation {

			
 
				-    Read,

			
 
				-    Write(WriteOperation),

			
 
				-  }

			
 
				-

			
 
				-  pub enum WriteOperation {

			
 
				-    Meta(Box<FileMeta>),

			
 
				-    Data {

			
 
				-      meta: Box<FileMeta>,

			
 
				-      contents: Vec<u8>,

			
 
				-    }

			
 
				-  }

			
 
				-\end{verbatim}

			
 
				-Here \texttt{FileMeta} is the type used to store metadata for files.

			
 
				-Note that updated metadata is required to be sent when a sector's contents are modified.

			
 
				-

			
 
				-% Scaling horizontally: using Raft to create consensus cluster. Additional replication methods.

			
 
				-A generation of sector service providers uses the Raft protocol to synchronize the state of the

			
 
				-sectors it stores.

			
 
				-The message passing interface of the runtime enables this implementation

			
 
				-and the sector service's requirements were important considerations in designing this interface.

			
 
				-The system currently replicates all data to each of the service providers in the cluster.

			
 
				-Additional replication methods are planned for future implementation

			
 
				-(e.g. erasure encoding and distribution via consistent hashing),

			
 
				-which allow for different tradeoffs between data durability and storage utilization.

			
 
				-

			
 
				-% Scaling vertically: how different generations are stitched together.

			
 
				-The creation of a new generation of the sector service is accomplished with several steps.

			
 
				-First, a new directory is created in which the generation will be located.

			
 
				-Next, one or more processes are credentialed for this directory,

			
 
				-using a procedure which is described in the next section.

			
 
				-The credentialing process produces files for each of the processes stored in the new directory.

			
 
				-The sector service provider in each of the processes uses the filesystem service

			
 
				-(which connects to the parent generation of the sector service)

			
 
				-to find the other runtimes hosting the sector service in the directory and messages them to

			
 
				-establish a fully-connected cluster.

			
 
				-Finally, the service provider which is elected leader contacts the generation in the root directory

			
 
				-and requests a new generation number.

			
 
				-Once this number is known it is stored in the superblock for the generation,

			
 
				-which is the file identified by the new generation number and inode 2.

			
 
				-The superblock is not contained in any directory and cannot be accessed outside the sector service.

			
 
				-The superblock also keeps track of the next inode to assign to a new file.

			
 
				-

			
 
				-% Authorization logic of the sector service.

			
 
				-To prevent malicious actors from writing invalid data,

			
 
				-the sector service must cryptographically verify all write messages.

			
 
				-The process it uses to do this involves several steps:

			
 
				-\begin{enumerate}

			
 
				-  \item The certificate chain in the metadata that was sent in the write message is validated.

			
 
				-    It is considered valid if it ends with a certificate signed by the root principal

			
 
				-    and the paths in the certificates are correctly nested,

			
 
				-    indicating valid delegation of write authority at every step.

			
 
				-  \item Using the last public key in the certificate chain,

			
 
				-    the signature in the metadata is validated.

			
 
				-    This signature covers all of the fields in the metadata.

			
 
				-  \item The new sector contents in the write message are hashed using the digest function configured

			
 
				-    for the file and the resulting hash is used to update the file's Merkle tree in its Merkle

			
 
				-    sector.

			
 
				-  \item The root of the Merkle tree is compared with the integrity value in the file's metadata.

			
 
				-    The write message is considered valid if and only if there is a match.

			
 
				-\end{enumerate}

			
 
				-This same logic is used by file actors to verify the data they read from the sector service.

			
 
				-Only once a write message is validated is it shared with the sector service provider's peers in

			
 
				-its generation.

			
 
				-Although the data in a file is encrypted,

			
 
				-it is still beneficial for security to prevent unauthorized principal's from gaining access to a

			
 
				-file's ciphertext.

			
 
				-To prevent this, a sector service provider checks a file's metadata to verify that the requesting

			
 
				-principal actually has a readcap (to be defined in the next section) for the file.

			
 
				-This ensures that only principals that are authorized to read a file can gain access to the file's

			
 
				-ciphertext, metadata, and Merkle tree.

			
 
				-

			
 
				-% File actors are responsible for cryptographic operations. Client-side encryption.

			
 
				-The sector service is relied upon by the filesystem service to read and write sectors.

			
 
				-Filesystem service providers communicate with the sector service to open files and perform

			
 
				-filesystem operations.

			
 
				-These providers spawn file actors that are responsible for verifying and decrypting the information

			
 
				-contained in sectors and providing it to other actors.

			
 
				-They use the credentials of the runtime they are hosted in to decrypt sector data using

			
 
				-information contained in file metadata.

			
 
				-File actors are also responsible for encrypting and integrity protecting data written to files.

			
 
				-In order for a file actor to produce a signature over the root of the file's Merkle tree,

			
 
				-it maintains a copy of the tree in memory.

			
 
				-This copy is read from the sector service when the file is opened.

			
 
				-While this does mean duplicating data between the sector and filesystem services,

			
 
				-this design was chosen to reduce the network traffic between the two services,

			
 
				-as the entire Merkle tree does not need to be transmitted on every write.

			
 
				-Encapsulating all cryptographic operations in the filesystem service and file actors allows the

			
 
				-computer storing data to be different from the computer encrypting it.

			
 
				-This approach allows client-side encryption to be done on more capable computers

			
 
				-and low powered devices to delegate this task to a storage server.

			
 
				-

			
 
				-% Prevention of resource leaks through ownership.

			
 
				-A major advantage of using file actors to access file data is that they can be accessed over the

			
 
				-network from a different runtime as easily as they can be from the same runtime.

			
 
				-One complication arising from this approach is that file actors must not outlive the actor which

			
 
				-caused them to be spawned.

			
 
				-This is handled in the filesystem service by making the actor who opened the file the owner of the

			
 
				-file actor.

			
 
				-When a file actor receives notification that its owner returned,

			
 
				-it flushes any buffered data in its cache and returns,

			
 
				-ensuring that a resource leak does not occur.

			
 
				-

			
 
				-% Encrypted metadata. Extended attributes in metadata. Cache control.

			
 
				-Some of the information stored in metadata needs to be kept in plaintext to allow the sector

			
 
				-service to verify and decrypt the file

			
 
				-but most of it is encrypted using the same key as the file's contents.

			
 
				-The file's authorization attributes, its size, and its access times are all encrypted.

			
 
				-The table storing the file's extended attributes (EAs) is also encrypted.

			
 
				-Cache control information is included in this area as well.

			
 
				-It specifies the number of seconds, as a u32, that a file may be cached.

			
 
				-The filesystem service uses this information to evict sectors from its cache when they have been

			
 
				-cached for longer than this threshold,

			
 
				-causing them to be reloaded from the sector service.

			
 
				-

			
 
				-% Authorization logic of the filesystem service.

			
 
				-The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.

			
 
				-It passes this type the authorization attributes of the principal accessing the file, the

			
 
				-attributes of the file, and the type of access (read, write, or execute).

			
 
				-The \texttt{Authorizer} returns a boolean indicating if access is permitted or denied.

			
 
				-These access control checks are performed for every message processed by the filesystem service,

			
 
				-including opening a file.

			
 
				-A file actor only responds to messages sent from its owner,

			
 
				-which ensures that it can avoid the overhead of performing access control checks as these were

			
 
				-carried out by the filesystem service when it was created.

			
 
				-The file actor is configured when it is spawned to allow read only, write only, or read write

			
 
				-access to a file,

			
 
				-depending on what type of access was requested by the actor opening the file.

			
 
				-

			
 
				-% Streaming replication.

			
 
				-Often when building distributed systems it is convenient to alert any interested party that an event

			
 
				-has occurred.

			
 
				-To facilitate this pattern,

			
 
				-the sector service allows actors to subscribe for notification of writes to a file.

			
 
				-The sector service maintains a list of actors which are currently subscribed

			
 
				-and when it commits a write to its local storage,

			
 
				-it sends each of them a notification message identifying the sector written

			
 
				-(but not the written data).

			
 
				-By using different files to represent different events,

			
 
				-a simple notification system can be built.

			
 
				-Because the contents of a directory may be distributed over many different generations,

			
 
				-this system does not support the recursive monitoring of directories.

			
 
				-Although this system lacks the power of \texttt{inotify} in the Linux kernel,

			
 
				-it does provides some of its benefits without incurring much or a performance overhead

			
 
				-or implementation complexity.

			
 
				-For example, this system can be used to implement streaming replication.

			
 
				-This is done by subscribing to writes on all the files that are to be replicated,

			
 
				-then reading new sectors as soon as notifications are received.

			
 
				-These sectors can then be written into replica files in a different directory.

			
 
				-This ensures that the contents of the replicas will be updated in near real-time.

			
 
				-

			
 
				-% Peer-to-peer distribution of sector data.

			
 
				-Because of the strong integrity protection afforded to sectors,

			
 
				-it is possible for peer-to-peer distribution of sector data to be done securely.

			
 
				-Implementing this mechanism is planned as a future enhancement to the system.

			
 
				-The idea is to base the design on bit torrent,

			
 
				-where the sector service responsible for a file acts as a tracker for that file,

			
 
				-and the file actors accessing the file communicate with one another directly using the information

			
 
				-provided by the sector service.

			
 
				-This could allow the system to scale to a much larger number of concurrent reads by reducing

			
 
				-the load on the sector service.

			
 
				-

			
 
				-% The FUSE daemon.

			
 
				-Being able to access the filesystem from actors allows a programmer to implement new applications

			
 
				-using Blocktree,

			
 
				-but there is an entire world of existing applications which only know how to access the local

			
 
				-filesystem.

			
 
				-To allow these applications access to Blocktree,

			
 
				-a FUSE daemon called \texttt{btfuse} is included which allows a Blocktree directory to be mounted

			
 
				-to a directory in the local filesystem.

			
 
				-This daemon can directly access the sector files in a local directory,

			
 
				-or it can connect over the network to filesystem or sector service provider.

			
 
				-This FUSE daemon could be included in a system's initrd to allow it to mount its root filesystem

			
 
				-from Blocktree,

			
 
				-opening up many interesting possibilities for hosting machine images in Blocktree.

			
 
				-A planned future enhancement is to develop a Blocktree filesystem driver which actually runs in

			
 
				-kernel space.

			
 
				-This would reduce the overhead associated with context switching from user space, to kernel space,

			
 
				-and back to user space, for every filesystem interaction,

			
 
				-making the system more practical to use for a root filesystem.

			
 
				-

			
 
				-

			
 
				-\section{Cryptography}

			
 
				-This section describes the cryptographic mechanisms used to integrity and confidentiality protect

			
 
				-files.

			
 
				-These mechanisms are based on well-established cryptographic constructions.

			
 
				-

			
 
				-% Integrity protection.

			
 
				-File integrity is protected by a digital signature over its metadata.

			
 
				-The metadata contains the integrity field which contains the root node of a Merkle tree over

			
 
				-the file's contents.

			
 
				-This allows any sector in the file to be verified with a number of hash function invocations that

			
 
				-is logarithmic in the size of the file.

			
 
				-It also allows the sectors of a file to be verified in any order,

			
 
				-enabling random access.

			
 
				-The hash function used in the Merkle tree can be configured when the file is created.

			
 
				-Currently, SHA-256 is the default, and SHA-512 is supported.

			
 
				-A file's metadata also contains a certificate chain,

			
 
				-and this chain is used to authenticate the signature over the metadata.

			
 
				-In Blocktree, the certificate chain is referred to as a \emph{writecap}

			
 
				-because it grants the capability to write to files.

			
 
				-The certificates in a valid writecap are ordered by their paths,

			
 
				-the initial certificate contains the longest path,

			
 
				-the path in each subsequent certificate must be a prefix of the one preceding it,

			
 
				-and the final certificate must be signed by the root principal.

			
 
				-These rules ensure that there is a valid delegation of write authority at every

			
 
				-link in the chain,

			
 
				-and that the authority is ultimately derived from the root principal specified by the absolute path

			
 
				-of the file.

			
 
				-By including all the information necessary to verify the integrity of a file in its metadata,

			
 
				-it is possible for a requestor who only knows the path of a file to verify that the contents of the

			
 
				-file are authentic.

			
 
				-

			
 
				-% Confidentiality protecting files with readcaps. Single pubkey operation to read a dir tree.

			
 
				-Confidentiality protection of files is optional but when it is enabled,

			
 
				-a file's sectors are individually encrypted using a symmetric cipher.

			
 
				-The key to this cipher is randomly generated when a file is created.

			
 
				-A different IV is generated for each sector by hashing the index of the sector with a

			
 
				-randomly generated IV for the entire file.

			
 
				-A file's key and IV are encrypted using the public keys of the principals to whom read access is

			
 
				-to be allowed.

			
 
				-The resulting ciphertext is referred to as a \emph{readcap}, as it grants the capability to read the

			
 
				-file.

			
 
				-These readcaps are stored in a table in the file's metadata.

			
 
				-Each entry in the table is identified by a byte string that is derived from the public key of the

			
 
				-principal who owns the entry's readcap.

			
 
				-The byte string is computed by calculating an HMAC of the the principal's public key.

			
 
				-The HMAC is keyed with a randomly generated salt that is stored in the file's metadata.

			
 
				-An identifier for the hash function that was used in the HMAC is included in the byte string so

			
 
				-that the HMAC can be recomputed later.

			
 
				-When the filesystem service accesses the file,

			
 
				-it recomputes the HMAC using the salt, its public key, and the hash function specified in each entry

			
 
				-of the table.

			
 
				-It can then identify the entry which contains its readcap,

			
 
				-or that such an entry does not exist.

			
 
				-This mechanism was designed to prevent offline correlation attacks on file metadata,

			
 
				-as metadata is stored in plaintext in local filesystems.

			
 
				-The file key and IV are also encrypted using the keys of the file's parents.

			
 
				-Note that there may be multiple parents of a file because it may be hard linked to several

			
 
				-directories.

			
 
				-Each of the resulting ciphertexts is stored in another table in the file's metadata.

			
 
				-The entries in this table are identified by an HMAC of the parent's generation and inode numbers,

			
 
				-where the HMAC is keyed using the file's salt.

			
 
				-By encrypting a file's key and IV using the key and IV of its parents,

			
 
				-it is possible to traverse a directly tree using only a single public key decryption.

			
 
				-The file where this traversal begins must contain a readcap owned by the accessing principal,

			
 
				-but all subsequent accesses can be performed by decrypting the key and IV of a child using the

			
 
				-key and IV of a parent.

			
 
				-Not only does this allow traversals to use more efficient symmetric key cryptography,

			
 
				-but it also means that it suffices to grant a readcap on a single directory in order to grant

			
 
				-access to the entire tree rooted at that directory.

			
 
				-

			
 
				-% File key rotation and readcap revocation.

			
 
				-Because it is not possible to change the key used by a file after it is created,

			
 
				-a file must be copied in order to rotate the key used to encrypt it.

			
 
				-Similarly, revoking a readcap is accomplished by creating a copy of the file

			
 
				-and adding all the readcaps from the original's metadata except for the one being revoked.

			
 
				-While it is certainly possible to remove a readcap from the metadata table,

			
 
				-this is not supported because the readcap holder may have used custom software to save the file's

			
 
				-key and IV while it had access to them,

			
 
				-so data written to the same file after revocation could potentially be decrypted by it.

			
 
				-By forcing the user to create a new file,

			
 
				-they are forced to re-encrypt the data using a fresh key and IV.

			
 
				-

			
 
				-% Obfuscating sector files stored in the local filesystem.

			
 
				-From an attacker's perspective,

			
 
				-not every file in your domain is equally interesting.

			
 
				-They may be particularly interested in reading your root directory,

			
 
				-or they may have identified the inode of a file containing kompromat.

			
 
				-To make offline identification of which files sectors in the local filesystem belong to,

			
 
				-an obfuscation mechanism is used.

			
 
				-This works by generating a random salt for each generation of the sector service,

			
 
				-and storing it in the generation's superblock.

			
 
				-It is hashed along with the inode and the sector ID to produce the file name of the sector file

			
 
				-in the local filesystem.

			
 
				-These files are arranged into different subdirectories according to the value of the first two

			
 
				-digits in the hex encoding of the resulting hash,

			
 
				-the same way git organizes object files.

			
 
				-This simple method makes it more difficult for an attacker to identify the files each sector belongs

			
 
				-to

			
 
				-while still allowing the sector service efficient access.

			
 
				-

			
 
				-% Credential stores.

			
 
				-Processes need a way to securely store their credentials.

			
 
				-They accomplish this by using a credential store,

			
 
				-which is a type that implementor the trait \texttt{CredStore}.

			
 
				-A credential store provides methods for using a process's credentials to encrypt, decrypt,

			
 
				-sign, and verify data,

			
 
				-but it does not allow them to be exported.

			
 
				-A credential store also provides a method for generating new root credentials.

			
 
				-Because root credentials represent the root of trust for an entire domain,

			
 
				-it must be possible to securely back them up from one credential store to another.

			
 
				-Root credentials can also be used to perform cryptographic operations without exporting them.

			
 
				-A password is set when the root credentials are generated,

			
 
				-and this same password must be provided to use, export, and import them.

			
 
				-When root credentials are exported from a credential store they are confidentiality protected

			
 
				-using multiple layers of encryption.

			
 
				-The outer most layer is encryption by a symmetric key cipher whose key is derived from the

			
 
				-password.

			
 
				-a public key of the receiving credential store must also be provided when root credentials are

			
 
				-exported.

			
 
				-This public key is used to perform the inner encryption of the root credentials,

			
 
				-ensuring that only the intended credential store is able to import them.

			
 
				-Currently there are two \texttt{CredStore} implementors in Blocktree,

			
 
				-one which is used for testing and one which is more secure.

			
 
				-The first is called \texttt{FileCredStore},

			
 
				-and it uses a file in the local filesystem to store credentials.

			
 
				-A symmetric cipher is used to protect the root credentials, if they are stored,

			
 
				-but it relies on the security of the underlying filesystem to protect the process credentials.

			
 
				-For this reason it is not recommended for production use.

			
 
				-The other credential store is called \texttt{TpmCredStore},

			
 
				-and it uses a Trusted Platform Module (TPM) 2.0 on the local machine to store credentials.

			
 
				-The TPM is used to generate the process's credentials in such a way that they can never be

			
 
				-exported from the TPM (this is a feature of TPM 2.0).

			
 
				-A randomly generated cookie is needed to use these credentials.

			
 
				-The cookie is stored in a file in the local filesystem which its permissions set to prevent

			
 
				-others from accessing it.

			
 
				-Thus this type also relies on the security of the local filesystem.

			
 
				-But, an attacker would need to steal the TPM and this cookie in order to steal a process's

			
 
				-credentials.

			
 
				-

			
 
				-% Manual provisioning via the command line.

			
 
				-The term provisioning is used in Blocktree to refer to the process of acquiring credentials.

			
 
				-A command line tool call \texttt{btprovision} is provided for provisioning credential stores.

			
 
				-This tool can be used to generate new process or root credentials, create a certificate request

			
 
				-using them, issue a new certificate, and finally to import the new certificate chain.

			
 
				-When setting up a new domain,

			
 
				-\texttt{btprovision} can create a new sector storage directory in the local filesystem

			
 
				-and write the new process's files to it.

			
 
				-It is also capable of connecting to the filesystem service if it is already running.

			
 
				-

			
 
				-% Automatic provisioning.

			
 
				-While manual provisioning is necessary to bootstrap a domain,

			
 
				-an automatic method is needed to make this process more ergonomic.

			
 
				-When a runtime starts it checks its configured credential store to find the certificate chain to

			
 
				-use for authenticating to other runtimes.

			
 
				-If no such chain is stored,

			
 
				-the runtime can choose to request a certificate from the filesystem service.

			
 
				-This is done by dispatching a message with \texttt{call} to the filesystem service without

			
 
				-specifying a scope.

			
 
				-Because the message specifies no path, there is no root directory to begin discovery at.

			
 
				-So, the runtime resorts to using link-local discovery to find other runtimes.

			
 
				-Once one is discovered,

			
 
				-the runtime connects to it anonymously

			
 
				-and sends it a certificate request.

			
 
				-This request includes a copy of the runtime's public key and, optional, a path where the

			
 
				-runtime would like to be located.

			
 
				-This path is purely advisory,

			
 
				-the filesystem service is free to place the runtime in any directory it sees fit.

			
 
				-The filesystem service creates a new process file containing the public key and marks it as

			
 
				-pending.

			
 
				-The reply to the runtime contains the path of the file created for it.

			
 
				-The operators of the domain can then use the web GUI or \texttt{btprovision} to view the request

			
 
				-and approve it at their discretion.

			
 
				-Assuming an operator approves the request,

			
 
				-it uses its credentials and the public key in the new process's file to issue a certificate

			
 
				-and then stores it in the file.

			
 
				-Authorization attributes (e.g. UID and GID) are also assigned to the process and written into its

			
 
				-file.

			
 
				-Note that a process's file is normally not writeable by the process itself,

			
 
				-so as to prevent it from setting its own authorization attributes.

			
 
				-Once these data have been written to the process file,

			
 
				-the runtime can read them to retrieve its new certificate chain.

			
 
				-It stores this chain in its credential store for later use.

			
 
				-The runtime can avoid polling its file for changes if it subscribes to write notifications.

			
 
				-The runtime must close the anonymous connections it made

			
 
				-and reconnect using the new certificate chain.

			
 
				-Once new connections are established,

			
 
				-it can read and write files using the authorization attributes specified in its file.

			
 
				-Note that this procedure only works when the runtime is on the same LAN as another runtime.

			
 
				-

			
 
				-% The generation of new root credentials and the creation of a new domain.

			
 
				-The procedure for creating a new domain is straight-forward,

			
 
				-and all the steps can be performed using \texttt{btprovision}.

			
 
				-\begin{enumerate}

			
 
				-  \item Generate the root credentials for the new domain.

			
 
				-  \item Generate the credentials for the first runtime.

			
 
				-  \item Create a certificate request using the runtime credentials.

			
 
				-  \item Approve the request using the root credentials.

			
 
				-  \item Import the new certificate into the credential store of the first runtime.

			
 
				-\end{enumerate}

			
 
				-The first runtime is configured to host the sector and filesystem services,

			
 
				-so that subsequent runtimes will have access to the filesystem.

			
 
				-After that, additional runtime on the same LAN can be provisioned using the automatic process.

			
 
				-

			
 
				-% Setting up user based access control.

			
 
				-Up till now the focus has been on authentication and authorization of processes,

			
 
				-but it bears discussing how user based access control can be accomplished with Blocktree.

			
 
				-Because credentials are locked to the device on which they're created,

			
 
				-a user will have at least as many principals as they have devices.

			
 
				-But, all of these principals can be configured to have the same authorization attributes (UID, GID),

			
 
				-giving them the same permissions.

			
 
				-It makes sense to keep the files for all of the provisioned runtimes associated with a user in one

			
 
				-place

			
 
				-and the natural place is in the user's home directory.

			
 
				-Although every one of the user's processes needs to be provisioned,

			
 
				-this is not a huge limitation because a single runtime can host many different actors,

			
 
				-implementing many different applications.

			
 
				-Managing the users in a domain is facilitated by putting their home directories in a single user

			
 
				-directory for the domain.

			
 
				-Runtimes hosting the sector service on storage servers could then be provisioned in this directory

			
 
				-to provide the sector and filesystem services for the users' home directories.

			
 
				-It would be at the administrators discretion whether or not to enable client-side encryption.

			
 
				-If they wanted to,

			
 
				-the principal of at least one of a user's runtimes would need to be issued a readcap for the

			
 
				-user's home directory.

			
 
				-This runtime could then directly access the sector service in the domain's user directory.

			
 
				-By moving encryption onto the user's computer,

			
 
				-load can be shed from the storage servers.

			
 
				-Note that this setup does require all of the user's runtimes to be able to communicate with the

			
 
				-runtime whose principal was issued the readcap.

			
 
				-

			
 
				-% Example of how these mechanisms allow data to be shared.

			
 
				-To illustrate how these mechanisms can be used to facilitate collaboration between enterprises,

			
 
				-consider a situation where two companies wish to partner to the development of a product.

			
 
				-To facilitate their collaboration,

			
 
				-they wish to have a way to securely exchange data with each other.

			
 
				-One of the companies is selected to host the data

			
 
				-and accepts the cost and responsibility of serving it.

			
 
				-The host company creates a directory which will be used to store all of the data created during

			
 
				-development.

			
 
				-The other company will connect to the filesystem service in the host company's domain to access

			
 
				-data in the shared directory.

			
 
				-Each of the principals in the other company which wish to connect request to be credentialed in the

			
 
				-shared directory.

			
 
				-The hosting company manually reviews these requests and approves them,

			
 
				-assigning each of the principals authorization attributes appropriate for its domain.

			
 
				-This may involve issuing UID and GID values to each of the principals, or perhaps SELinux contexts.

			
 
				-The actually set of attributes supported is determined by the \texttt{Authorization} type used by

			
 
				-by the filesystem service in the host company's domain.

			
 
				-Once the principals have their credentials,

			
 
				-they can dispatch messages to the filesystem service using the shared directory as the scope and

			
 
				-setting the rootward field to true.

			
 
				-This allows actors authenticating with the credentials of these principals to perform all filesystem

			
 
				-operations authorized by the hosting company.

			
 
				-This situation gives the hosting company a lot of control over the data.

			
 
				-If the other company wishes to protect its investment in the R\&D effort,

			
 
				-it should subscribe to write events on the shared directory and the files in it so that it can

			
 
				-copy new sectors out of the host company's domain as soon as they are written.

			
 
				-Note that although it is not possible to directly subscribe to writes on the contents of a

			
 
				-directory, by monitoring a directory for changes,

			
 
				-one can begin monitoring files as soon as they are created.

			
 
				-

			
 
				-

			
 
				-\section{Examples}

			
 
				-This section contains examples of systems that could be built using Blocktree.

			
 
				-The hope is to illustrate how this platform can be used to implement existing applications more

			
 
				-easily and to make it possible to implement systems which are currently out of reach.

			
 
				-

			
 
				-\subsection{A distributed AI execution environment.}

			
 
				-Neural networks are just vector-valued functions with vector inputs,

			
 
				-albeit very complicated ones with potentially billions of parameters.

			
 
				-But, just like any other computation,

			
 
				-these functions can be conceptualized as computational graphs.

			
 
				-Imagine that you have a set of computers equipped AI accelerator hardware

			
 
				-and you have a neural network that is too large to be processed by any one of them.

			
 
				-By partitioning the graph into small enough subgraphs,

			
 
				-we can break the network down into pieces which can be processed by each of the accelerators.

			
 
				-The full network can be stitched together by passing messages between each of these pieces.

			
 
				-

			
 
				-Let us consider how this could be accomplished with Blocktree.

			
 
				-We begin by provisioning a runtime on each of the accelerator machines,

			
 
				-each of which will have a new accelerator service registered.

			
 
				-Messages will be sent to the accelerator service describing the computational graph to execute,

			
 
				-as well as the name of the actor to which the output is to be sent.

			
 
				-When such a message is received by an accelerator service provider,

			
 
				-it spawns an actor which compiles its subgraph to a kernel for its accelerator

			
 
				-and remembers the name of the actor to send its output to.

			
 
				-An orchestrator service will be responsible for partitioning the graph and sending these messages.

			
 
				-Ownership of the actors spawned by the accelerator service is given to the orchestrator service,

			
 
				-ensuring that they will all be stopped when the orchestrator returns.

			
 
				-When one of the spawned actors stops,

			
 
				-it unloads the kernel from the accelerator's memory and returns it to its initial state.

			
 
				-Note that the orchestrator actor must have execute permissions on each of the accelerator runtimes

			
 
				-in order to send messages to them.

			
 
				-The orchestrator dispatches messages to the accelerator service in reverse order of the flow of data

			
 
				-in the computational graph,

			
 
				-so that it can tell each service provider where its output should be sent.

			
 
				-The actors responsible for the last layer in the computational graph send their output to the

			
 
				-orchestrator.

			
 
				-To begin the computation,

			
 
				-the actors which are responsible for input are given the filesystem path of the input data.

			
 
				-The orchestrator learns of the completion of the computation once it receives the output from

			
 
				-final layer.

			
 
				-It can then save these results to the file system and return.

			
 
				-Because inference and training can both be modeled by computational graphs,

			
 
				-this same procedure can be used for both.

			
 
				-

			
 
				-\subsection{A decentralized social media network.}

			
 
				-One of the original motivations for designing Blocktree was to create a platform for a social

			
 
				-network that puts users in fully in control of their data.

			
 
				-In the opinion of the author,

			
 
				-the only way to actually accomplish this is for users to host the data themselves.

			
 
				-One might think it is possible to use client-side encryption to solve the privacy issue,

			
 
				-but this does not solve the full problem.

			
 
				-While it is true that good client-side encryption will prevent the service provider from reading

			
 
				-the user's data,

			
 
				-the user could still loose everything if the service provider goes out of business or simply

			
 
				-decides to stop offering its service.

			
 
				-Similarly, putting data in a federated system, as has been proposed by the Mastodon developers,

			
 
				-also puts the user at risk of loosing their data if the operator of the server they use decides to

			
 
				-shut it down.

			
 
				-To have real control the user must host the data themselves.

			
 
				-Then they decide how its encrypted, how its served, and to whom.

			
 
				-

			
 
				-Let us explore how Blocktree can be used to build a social media platform which provides this

			
 
				-control.

			
 
				-To participate in this network each user will need to setup their own domain by generating new root

			
 
				-credentials

			
 
				-and provisioning at least one runtime to host the social media service.

			
 
				-A technical user could do this on their own hardware by reading the Blocktree documentation,

			
 
				-but a non-technical user might choose to purchase a new router with Blocktree pre-installed.

			
 
				-By connecting this router directly to their WAN,

			
 
				-the user ensures that the services running on it will always have direct internet access.

			
 
				-The user can access the \texttt{btconsole} web GUI via the router's WiFi interface to generate their

			
 
				-root credentials and provision new runtimes on their network.

			
 
				-

			
 
				-A basic function of any social network is keeping track of a user's contacts.

			
 
				-This would be handled by maintaining the contacts as files in a well-known directory in the user's

			
 
				-domain.

			
 
				-Each file in the directory would be named using the user defined nickname for the contact

			
 
				-and its contents would include the root principal of the contact as well as any additional user

			
 
				-defined attributes,

			
 
				-such as address or telephone number.

			
 
				-The root principal would be used to discover runtimes controlled by the contact

			
 
				-so that messages can be sent to the social media service running in them.

			
 
				-When a user adds a new contact,

			
 
				-a connection message would be sent to it,

			
 
				-which the contact could choose to accept or reject.

			
 
				-If accepted,

			
 
				-the contact would create an entry in its contacts directory for the user.

			
 
				-The contact's social media service would then accept future direct messages from the user.

			
 
				-When the user sends a direct message to the contact,

			
 
				-its runtime discovers runtimes controlled by the contact and delivers the message.

			
 
				-Once delivered the contact's social media service stores the message in a directory for the user's

			
 
				-correspondence,

			
 
				-sort of like an mbox directory but where messages are sorted into directories based on sender

			
 
				-instead of receiver.

			
 
				-

			
 
				-Note that this procedure only works if a contact's root principal can be resolved using the

			
 
				-search domain configured in the user's runtime.

			
 
				-We can ensure this is the case by configuring the runtime to use a search domain that operates

			
 
				-a Dynamic DNS (DDNS) service

			
 
				-and by arranging with this service to create the correct records to resolve the root principal.

			
 
				-The author intends to operate such a service to facilitate the use of Blocktree by home users,

			
 
				-but a more long-term solution is to implement a blockchain for resolving root principals.

			
 
				-Only then would the system be fully decentralized.

			
 
				-

			
 
				-Making public posts is accomplished by creating files in a directory with the HTML contents of the

			
 
				-post.

			
 
				-This file, the directory containing it, and all parents of it,

			
 
				-would be configured to allow others to read, and in the case of directories, execute them.

			
 
				-At least one runtime with the filesystem service registered would need to have the execute

			
 
				-permission granted to others to allow anyone to access these files.

			
 
				-When someone wanted to view the posts of another user,

			
 
				-they would use the filesystem service to read these files from the well-known posts directory.

			
 
				-

			
 
				-Of course user's would not be using a file manager to interact with this social network,

			
 
				-they would use their browsers as they do now.

			
 
				-This web interface would be served by the social media service in their domain.

			
 
				-A normal user who has a Blocktree enabled router would just type in a special hostname into their

			
 
				-browser to open this interface.

			
 
				-Because the router provides DNS services to their network,

			
 
				-it can generate the appropriate records to ensure this name resolves to the address where the social

			
 
				-media service is listening.

			
 
				-The social media service would be responsible for sending message to other user's domains to

			
 
				-get their posts,

			
 
				-and to read the filesystem to display the user's direct messages.

			
 
				-All this file data would be used to populate the web interface.

			
 
				-It is not hard to see how the same system could be used to serve any type of media: text, images,

			
 
				-video, immersive 3D worlds.

			
 
				-All of these can be stored in files in the filesystem,

			
 
				-and so all of them are accessible to Blocktree actors.

			
 
				-

			
 
				-One issue that must be addressed with this design is how it will scale to a large number of users

			
 
				-accessing data at once.

			
 
				-In other words,

			
 
				-what happens if the user goes viral?

			
 
				-Currently, the way to solve this would be to add more computers to the user's network which run

			
 
				-the sector and filesystem services.

			
 
				-This is not ideal as it means the user would need to buy more hardware to serve their dank memes.

			
 
				-A better solution would be implement peer-to-peer distribution of sector data in the filesystem

			
 
				-service.

			
 
				-This would reduce the load on the user's computers and allow their follows to share the posted

			
 
				-data with each other.

			
 
				-This work is planned as a future enhancement.

			
 
				-

			
 
				-\subsection{A smart lock.}

			
 
				-The access control language provided by Blocktree's filesystem can be used for more than just

			
 
				-authorizing access to data.

			
 
				-To illustrate this point,

			
 
				-consider a smart lock installed on the front door of a company's office building.

			
 
				-When the company first got the lock they used NFC to configure the lock

			
 
				-and connect it to their WiFi network.

			
 
				-The lock then used link-local runtime discovery to perform automatic provisioning.

			
 
				-An IT administrator accessed \texttt{btconsole} to approve the provisioning request

			
 
				-and position the lock in a specific directory in the company's domain.

			
 
				-Permission to actuate the lock is granted if a principal has execute permission on the lock's file.

			
 
				-To verify the physical presence of an employee,

			
 
				-NFC is used for the authentication handshake.

			
 
				-When an employee presses their NFC device, for instance their phone, to the lock,

			
 
				-it generates a nonce and transmits it to the device.

			
 
				-The device then signs the nonce using the credentials it used during provisioning in the company's

			
 
				-domain.

			
 
				-It transmits this signature to the lock along with the path to the principal's file in the domain.

			
 
				-The lock then reads this file to obtain the principal's authorization attributes and its public key.

			
 
				-It uses the public key to validate the signature presented by the device.

			
 
				-If this is successful,

			
 
				-it then checks the authorization attributes of the principal against the authorization attributes on

			
 
				-its own file.

			
 
				-If execute permissions are granted, the lock actuates, allowing the employee access.

			
 
				-The administrators of the company's domain create a group specifically for controlling physical

			
 
				-access to the building.

			
 
				-All employees with physical access permission are added to this group,

			
 
				-and the group is granted execute permission on the lock,

			
 
				-rather than individual users.

			
 
				-

			
 
				-\subsection{A traditional three-tier web application.}

			
 
				-While it is hoped that Blocktree will enable interesting and novel applications,

			
 
				-it can also be used to build the kind of web applications that are common today.

			
 
				-Suppose that we wish to build a three-tier web application.

			
 
				-Let us explore how Blocktree could help.

			
 
				-

			
 
				-First, let us consider which database to use.

			
 
				-It would be desirable to use a traditional SQL database,

			
 
				-preferably one which is open source and not owned by a large corporation with dubious motivations.

			
 
				-These constraints lead us to choose Postgres,

			
 
				-but Postgres was not designed to run on Blocktree.

			
 
				-However, Postgres does have a container image available on docker hub,

			
 
				-we can create a service to run this container image in our domain.

			
 
				-But Postgres stores all of its data in the local filesystem of the machine it runs on.

			
 
				-How can we ensure this does not become a single point of failure?

			
 
				-First, we should create a directory in our domain to hold the Postgres cluster directory.

			
 
				-Then we should procure at least three servers for our storage cluster

			
 
				-and provision runtimes hosted on each of them in this directory.

			
 
				-The sector service is registered on each of the runtimes,

			
 
				-so all the data stored in the directory will be replicated on each of the server.

			
 
				-Now, the Postgres service should be register in one and only one of these runtimes,

			
 
				-as Postgres requires exclusive access to its database cluster.

			
 
				-\texttt{btfuse} will be used to mount the Postgres directory to a path in the local filesystem

			
 
				-and the Postgres container will be configured to access it.

			
 
				-We now have to decide how other parts of the system are going to communicate with Postgres.

			
 
				-We could have the Postgres service setup port forwarding for the container,

			
 
				-so that ordinary network connection can be used to talk to it.

			
 
				-But we will have to setup TLS if we want this to be secure.

			
 
				-The alternative is to use Blocktree as a VPN and proxy network communications in messages.

			
 
				-This is accomplished by registering a proxy service in the same runtime as the Postgres service

			
 
				-and configuring it to allow traffic it receives to pass to the Postgres container on TCP port 5432.

			
 
				-

			
 
				-In a separate directory,

			
 
				-a collection runtimes are provisioned which will host the webapp service.

			
 
				-This service will use axum to serve the static assets to our site,

			
 
				-including the Wasm modules which make up our frontend,

			
 
				-as well as our site's backend.

			
 
				-In order to do this,

			
 
				-it will need to connect to the Postgres database.

			
 
				-This is accomplished by registering the proxy service in each of the runtimes hosting the

			
 
				-webapp service.

			
 
				-The proxy service is configured to listen on TCP 127.0.0.1:5432 and forwards all traffic

			
 
				-to the proxy service in the Postgres directory.

			
 
				-The webapp can then use the \texttt{tokio-postgres} crate to establish a TCP connection to

			
 
				-127.0.0.1:5432

			
 
				-and it will end up talking to the containerized Postgres instance.

			
 
				-

			
 
				-Although the data in our database is  stored redundantly,

			
 
				-we do still have a single point of failure in our system,

			
 
				-namely the Postgres container.

			
 
				-To handle this we can implement a failover service.

			
 
				-It will work by calling the Postgres service with heartbeat messages.

			
 
				-If too many of these timeout,

			
 
				-we assume the service is dead and start a new instance one of the other runtimes in the Postgres

			
 
				-directory.

			
 
				-This new instance will have access to all the same data the old,

			
 
				-including its journal file.

			
 
				-Assuming it can complete any in progress transactions,

			
 
				-the new service will come up after a brief delay

			
 
				-and the system will recover.

			
 
				-

			
 
				-\subsection{A realtime geo-spacial environment.}

			
 
				-% Motivation

			
 
				-If we are to believe science fiction,

			
 
				-then the natural evolution of computer interaction is the development

			
 
				-of a persistent virtual world that we use to communicate, conduct business, and

			
 
				-enjoy our leisure.

			
 
				-This kind of system has been a dream for a long time,

			
 
				-but as it has grown closer to becoming a reality,

			
 
				-the popular consciousness has shifted against it.

			
 
				-People are rightly horrified by the idea of giving control over their virtual worlds to the same

			
 
				-social media company which has an established track record for causing societal harm.

			
 
				-But this technology does not need to be dystopian.

			
 
				-If an open system can be built, which actually works,

			
 
				-it can prevent the market from accepting a closed system designed to lock in user attention

			
 
				-and monetize them relentlessly.

			
 
				-This is the future,

			
 
				-it is only a question of who will own it.

			
 
				-

			
 
				-% Coordinates

			
 
				-Let us explore how Blocktree could be used to build such a system.

			
 
				-The world we are going to render will be a planet with a roughly spherical surface and a

			
 
				-configurable radius $\rho$.

			
 
				-$\rho$ is a \texttt{u32} value whose units are meters.

			
 
				-We will use latitude ($\phi$) and longitude ($\lambda$) in radians to describe the locations of

			
 
				-points on the surface.

			
 
				-Both $\phi$ and $\lambda$ will take \texttt{f64} values.

			
 
				-The elevation of a point will be given by $h$,

			
 
				-which is the deviation from $\rho$.

			
 
				-$h$ is measured in meters and takes values in \texttt{i32}.

			
 
				-So, the distance from the center of the planet to the point ($\phi$, $\lambda$, $h$) is

			
 
				-$\rho + h$.

			
 
				-

			
 
				-% Directory organization. Quadtrees.

			
 
				-The data describing how to render a planet consists of its terrain mesh, terrain textures, and

			
 
				-the objects on its surface.

			
 
				-This could represent a very large amount of data for a planet with realistic terrain populated by

			
 
				-many structures.

			
 
				-To facilitate sharding the information in a planet over many different servers,

			
 
				-the planet is broken into disjoint regions,

			
 
				-each of which is stored in its own directory.

			
 
				-A single top-level directory represents the entire planet,

			
 
				-and contains a manifest describing it.

			
 
				-This manifest specifies the planet's name, its radius, its rotational period,

			
 
				-the size of its regions in MB, as well as any

			
 
				-other global attributes.

			
 
				-This top-level directory also contains the texture for the sky box to render the view of

			
 
				-space from the planet.

			
 
				-In the future it may be interesting to explore the creation of more dynamic environments surrounding

			
 
				-the planet,

			
 
				-but a simple sky box has the advantage of being efficient.

			
 
				-The data in a planet is recursively broken into the fewest number of regions such that the

			
 
				-amount of data in each regions is less than a configured threshold.

			
 
				-When a regions grows too large it is broken into four new regions by cutting it along the

			
 
				-centerline parallel to the $\phi$ axis, and the one parallel to the $\lambda$ axis.

			
 
				-In other words, it is divided in half north to south and east to west.

			
 
				-The four new regions are stored in four subdirectories of the original region's directory

			
 
				-named 0, 1, 2, and 3.

			
 
				-The data in the old region is then moved into the appropriate directory based on its location.

			
 
				-Thus the directory tree of a planet essentially forms a quadtree,

			
 
				-albeit one which is built up progressively.

			
 
				-

			
 
				-% Region data files.

			
 
				-In the leaf directories of this tree the actual data for a region are stored in two files,

			
 
				-one which describes the terrain and the other which describes objects.

			
 
				-It is expected that the terrain will rarely be modified,

			
 
				-but that the objects may change regularly.

			
 
				-The terrain file contains the mesh vertices in the region as well as its textures.

			
 
				-It is organized as an R-tree to allow for efficient spacial queries based on player location.

			
 
				-The region's objects file is also organized as an R-tree.

			
 
				-It contains all of the graphical data for the objects to be rendered in the region,

			
 
				-such as meshes, textures, and shaders.

			
 
				-

			
 
				-% Plots.

			
 
				-The creation of a shared virtual world must involve players collaboratively building persistent

			
 
				-structures.

			
 
				-This is allowed in a controlled way by defining plot objects.

			
 
				-A plot is like a symbolic link,

			
 
				-it points to a file whose contents contain the data used to render the plot.

			
 
				-This mechanisms allows the owner of the planet to delegate a specific area on the surface

			
 
				-to another player by creating a plot defining that area and pointing it to a file owned by the

			
 
				-player.

			
 
				-The other player can then write meshes, textures, and shaders into this file to describe the

			
 
				-contents of the plot.

			
 
				-If the other player wishes to collaborate with others on the construction,

			
 
				-they can grant write access on the file to a third party.

			
 
				-This is not unlike the ownership of land in the real world.

			
 
				-

			
 
				-% LOD files in interior directories.

			
 
				-To facilitate the viewing of the planet from many distances,

			
 
				-each interior node in the planet's directory tree contains a reduced level of detail (LOD) version

			
 
				-of the terrain contained in it.

			
 
				-For example, the top-level directory contains the lowest LOD mesh and textures for the terrain.

			
 
				-This LOD would be suitable for rendering the planet as a globe on a shelf,

			
 
				-or as it would appear from a high orbit.

			
 
				-By traversing the directory tree,

			
 
				-the LOD can be increased as the player travels closer to the surface.

			
 
				-This system assist with rendering an animation where the player appears to approach and land upon

			
 
				-the planet's surface.

			
 
				-

			
 
				-% Sharding planet data.

			
 
				-By dividing the planet's data into different leaf directories,

			
 
				-it becomes possible to provision computers running the sector service in each of them.

			
 
				-This divides the storage and bandwidth requirements for serving the planet over this set of

			
 
				-servers.

			
 
				-In addition to serving these data,

			
 
				-another service is needed to keep track of player positions and execute game logic.

			
 
				-Game clients address their messages using the directory of the region their player is located

			
 
				-in, and set \texttt{rootward} to true.

			
 
				-These messages are delivered to the closest game server to the region the player is in,

			
 
				-which may be located in the region's directory or higher up the tree.

			
 
				-When a player transitions from one region to the next,

			
 
				-its game client begins addressing messages using the path of the next region as the scope.

			
 
				-

			
 
				-

			
 
				-\section{Conclusion}

			
 
				-% Blocktree serves as the basis for building distributed Unix.

			
 
				-There have been many attempts to create a distributed Unix over the years.

			
 
				-Time has shown that this is a very hard problem,

			
 
				-but time has not diminished its importance.

			
 
				-IT systems are more complex now than ever,

			
 
				-with many layers of abstraction which have built up over time.

			
 
				-We have suffered greatly from systems which were never designed to be secure on the hostile internet

			
 
				-that exists today.

			
 
				-Security has been bolted onto these systems (HTTPS, STARTTLS, DNSSEC)

			
 
				-in a backwards compatible way,

			
 
				-which results in weakened protections for these systems.

			
 
				-What's worse,

			
 
				-the entire trust model of the web relies on the ludicrous idea that there is a distinguished group

			
 
				-of certificate authorities who have the power to secure our communications.

			
 
				-We need to take a different approach.

			
 
				-Data should be certified by its path,

			
 
				-it must always be transported between processes in an authenticated manner,

			
 
				-and user code should never have to care how this is accomplished!

			
 
				-Time will tell whether the programming model of Blocktree is comprehensible and useful for

			
 
				-developers,

			
 
				-but the goal is to create the kind of easy to extend computing environment which allowed Unix to

			
 
				-be successful.

			
 
				-

			
 
				-% The system enables individuals to self-host the services they rely on.

			
 
				-These days, the typical internet user stores all of their important data in the cloud with

			
 
				-third-party service providers.

			
 
				-They do this because of the convenience of being able to access this information from anywhere,

			
 
				-and because of the perceived safety in having a large internet company look after it for them.

			
 
				-This convenience comes at the price of putting users at the mercy of these companies.

			
 
				-Take email for example,

			
 
				-a service which is universally used for account recovery and password reset.

			
 
				-If a service provided decided to stop providing a user access to their email,

			
 
				-the user would be effectively cut off from any website which sends login verification emails.

			
 
				-This is not a hypothetical situation,

			
 
				-such an incident has occurred (TODO: INSERT CITATION FROM LVL1).

			
 
				-There is no technical reason for things to be this way.

			
 
				-Blocktree allows users to host their own services in their own domain.

			
 
				-If we can make setting up an email or VOIP server as simple as clicking a button in a web GUI,

			
 
				-their will be no convenience advantage to cloud services.

			
 
				-One challenge for self-hosting data is ensuring it is protected from loss when hardware inevitably

			
 
				-fails.

			
 
				-The data redundancy in Blocktree's sector layer ensures that the loss of any one storage

			
 
				-device will not cause data loss.

			
 
				-Streaming replication can also be used to maintain additional redundant copies.

			
 
				-If more users begin hosting their own services,

			
 
				-the internet will become more distributed,

			
 
				-which will make it more resistent to disruption and centralized control.

			
 
				-

			
 
				-% Benefits to businesses.

			
 
				-Cloud computing has also driven changes to the way businesses acquire computing resources.

			
 
				-It is common for startups to rent all of their computing resources from one large cloud

			
 
				-provider.

			
 
				-There are compelling economic and technical reasons to do this,

			
 
				-but as a firm grows they often experience growing pains as their cloud bills also grow.

			
 
				-If the firm has not developed their software with a multi-cloud, or hybrid approach in mind,

			
 
				-they may face the prospect of major changes in order to bring their application on-prem or to a

			
 
				-rival cloud.

			
 
				-By developing their application on Blocktree,

			
 
				-businesses have a single platform to target which can run on rented computers in the cloud just as

			
 
				-easily servers in their own data center.

			
 
				-This ensures the choice to rent or buy can be made on a purely economic basis.

			
 
				-Blocktree is not the only system that provides this flexibility.

			
 
				-The portability of containers is one of the reasons they have become so popular.

			
 
				-Containers have their place and will most likely be used for years to come,

			
 
				-but they are a lower level abstraction which requires the developer to the problems that Blocktree

			
 
				-handles.

			
 
				-

			
 
				-% Blocktree advances the status quo in secure computing.

			
 
				-Ransomware attacks and data breaches are embarrassingly common these days.

			
 
				-There are many reasons for this,

			
 
				-from the reliance on passwords for authentication, to the complexity of the software supply chain,

			
 
				-but it is clear that as IT professionals we need to do more to keep the systems under our

			
 
				-protection safe.

			
 
				-Blocktree helps to do this by solving many of the difficult problems involved with securing

			
 
				-communication on a hostile network.

			
 
				-It takes a true zero-trust approach,

			
 
				-ensuring that all communications between processes is authenticated using public key cryptography.

			
 
				-Data at rest is also secured with encryption and integrity protection.

			
 
				-No security system can prevent all attacks,

			
 
				-but by putting these mechanisms together in an easy to use platform,

			
 
				-we can advance the status quo of secure computing.

			
 
				-

			
 
				-% Composability leads to emergent benefits.

			
 
				-When Unix was first developed in the 1970's, its authors could not have foreseen the applications

			
 
				-that would be enabled by their system.

			
 
				-Although there have been many different kinds of Unices over the years,

			
 
				-the core programming model, built around the filesystem, has remained since the beginning.

			
 
				-It is a testament to the importance of this abstraction that it has persisted for so long.

			
 
				-No designer can foresee all the ways that their abstractions will be used,

			
 
				-but they can try to build them in such a way that as much choice is left to the user as possible.

			
 
				-By making the actor model, and messaging passing, the core of Blocktree,

			
 
				-it is hoped that low overhead communication between distributed components can be achieved.

			
 
				-By using this system to provide a global distributed filesystem,

			
 
				-it is hoped that the interoperable sharing of data can be achieved.

			
 
				-And by using protocol contracts to constrain actor communication,

			
 
				-it is hoped that the structure and safety of type theory can bring order to distributed

			
 
				-computation.

			
 
				-While it is possible to see some of the applications that can be built from these abstractions,

			
 
				-it seems likely that their composability and the creativity of developers will enable systems that

			
 
				-cannot be foreseen.

			
 
				-

			
 
				-\end{document}
			
--- a/doc/BlocktreeCloudPaper/PubSubStateGraph.gv
+++ b/doc/BlocktreeCloudPaper/PubSubStateGraph.gv
@@ -1,19 +0,0 @@
 
				-
			
 
				-// This can be regenerated with the following command:
			
 
				-// dot -Tpdf -o PubSubStateGraph.pdf PubSubStateGraph.gv
			
 
				-digraph {
			
 
				-    Runtime
			
 
				-    ClientInit
			
 
				-    ServerInit
			
 
				-    Subed
			
 
				-    Listening
			
 
				-    Runtime -> ClientInit [label = "Activate", style = "dashed"]
			
 
				-    Runtime -> ServerInit [label = "Activate", style = "dashed"]
			
 
				-    ClientInit -> Subed [label = "Activate"]
			
 
				-    Subed -> Subed [label = " Pub"]
			
 
				-    ClientInit -> Listening [label = "Sub", style = "dashed"]
			
 
				-    ServerInit -> Listening [label = "Activate"]
			
 
				-    Listening -> Listening [label = " Pub|Sub"]
			
 
				-    Runtime -> Listening [label = "Pub", style = "dashed"]
			
 
				-    Listening -> Subed [label = "Pub", style = "dashed"]
			
 
				-}
			
--- a/doc/BlocktreeCloudPaper/notes.md
+++ b/doc/BlocktreeCloudPaper/notes.md
@@ -1,165 +0,0 @@
 
				-## TODO

			
 
				-1. Replace references to "process" with "runtime". Because the runtime is required to route

			
 
				-messages, it will be present in all practical Blocktree processes.

			
 
				-2. Apply the new terminology I've used in this paper to the codebase.

			
 
				-

			
 
				-

			
 
				-- Actor runtime

			
 
				-* Messages securely forwarded over the network.

			
 
				-* 

			
 
				-

			
 
				-- Distributed network storage system.

			
 
				-* Sector-level access to data.

			
 
				-* File-level access to data.

			
 
				-

			
 
				-

			
 
				-## Process of delegating storage in a directory.

			
 
				-1. A new directory is created. This directory has the generation number of the original sector

			
 
				-   cluster.

			
 
				-2. A process credential file is created in the directory. It is marked to indicate that the process

			
 
				-   will host the sector service. This mark means that the process will be responsible (jointly,

			
 
				-   along with all other such processes in the directory) for storing the sectors in the directory.

			
 
				-3. The new process starts and initializes a new directory in its local filesystem to store sector

			
 
				-   data. It knows to create this directory because it is configured to run the sector service,

			
 
				-   which creates a new storage directory if one does not already exist. As part of the creation

			
 
				-   process a new super block is created, which is the file with inode 1 and which is not contained

			
 
				-   in any directory. This new superblock contains the generation number which identifies the sector

			
 
				-   service in this directory. The generation number is determined by contacting the sector service

			
 
				-   in the root directory, which has knowledge and authority to assign unique numbers to every

			
 
				-   sector service.

			
 
				-4. The filesystem service in the directory will discover the sector service actor running inside the

			
 
				-   new process. When it creates new files in the directory it will store their sectors using the

			
 
				-   sector service in the process. These new files will use the generation number defined in the

			
 
				-   superblock stored in the sector service in the directory, which is different from the generation

			
 
				-   number of the directory itself.

			
 
				-5. When new processes configured to run the sector service are added to the directory, they

			
 
				-   automatically replicate sectors marked with their generation number, and use Raft to ensure the

			
 
				-   consistency of sector data.

			
 
				-6. Note that the sectors of the directory itself are actually stored by the parent sector service.

			
 
				-   It is just the files created within it which are created after the sector

			
 
				-   service in the directory becomes active which are stored by the child sector service.

			
 
				-

			
 
				-## Filesystem discovery

			
 
				-There are four cases to consider, depending on what permissions the discovering runtime has for the

			
 
				-file being accessed:

			
 
				-1. The discoverer hosts the sector service responsible for the file.

			
 
				-2. The discoverer hosts the filesystem service because it has a readcap for the file.

			
 
				-3. The discoverer does not host the filesystem service for the file but has read permissions for the

			
 
				-   file.

			
 
				-4. The discoverer is attempting read the file anonymously.

			
 
				-

			
 
				-In the first case, the sector service needs to discover all of the other sector service providers

			
 
				-in its directory. Once it has connected to all of them, sectors can be reconstructed and written

			
 
				-to the cluster.

			
 
				-It makes sense to have the filesystem service registered in such a runtime,

			
 
				-because this would allow all filesystem operations to happen locally (at least it would access the

			
 
				-local sector service, the sector service may need to communicate with its peers in the directory

			
 
				-when data is written).

			
 
				-In this case the runtime needs to be able to find all of the runtimes hosting the sector service

			
 
				-in its directory.

			
 
				-

			
 
				-In the second case the runtime needs to be able to discover the correct sector service provider to

			
 
				-connect to.

			
 
				-It seems that it needs to find a runtime hosting the sector service contained in one of its parent

			
 
				-directories.

			
 
				-Once such a runtime is found, messages can be delivered to it to access the sectors of the file,

			
 
				-and their contents will be decrypted locally.

			
 
				-

			
 
				-In the third case,

			
 
				-the runtime must locate the closest runtime hosting the filesystem service which is contained in one

			
 
				-of the runtime's parent directories.

			
 
				-This should be the same query as in case 2, just used for the filesystem service instead of the

			
 
				-sector service.

			
 
				-

			
 
				-In case four, the process must discover a filesystem service hosting the file. This case

			
 
				-actually doesn't seem any different from case 3, it's just performed with no authorization

			
 
				-attributes.

			
 
				-So in terms of FS permissions, only files which allow others to read them could be accessed in this

			
 
				-way, and all of whose parent directories can be read by others can be accessed in this way.

			
 
				-This requirement that all parent directories can also be read by others,

			
 
				-would be too strict for non-anonymous access.

			
 
				-It's important to allow credentialed access to a file when a process has permission to that

			
 
				-specific file, even if the process can't access one or more of the files parents.

			
 
				-This helps to keep the system flexible.

			
 
				-

			
 
				-There seem to be two queries which are needed to locate the appropriate runtimes. A query is

			
 
				-executed with respect to a scope and only considers runtimes with a given service registration.

			
 
				-1. Find all runtimes directly contained in the scope.

			
 
				-2. Find a runtime which is contained in a parent directory which is closest to the scope.

			
 
				-   Closest means that there are no relevant runtimes contained in any of the subdirectories

			
 
				-   of the directory containing the query result.

			
 
				-

			
 
				-These queries correspond to the two ways that messages can be dispatched by an actor.

			
 
				-

			
 
				-There are three cases to consider when defining the security model for runtime queries:

			
 
				-1. The process has a readcap for the scope of the query.

			
 
				-2. The process has read permission for the scope of the query.

			
 
				-3. The processes is issuing the query anonymously.

			
 
				-

			
 
				-In the first two cases the query should be allowed.

			
 
				-In the third case it should only be allowed if every file on the path from the scope to the root

			
 
				-permits others to read.

			
 
				-

			
 
				-When a runtime receives a query it should use the filesystem to answer it.

			
 
				-If, as it navigates to the scope, it encounters a directory which it is not responsible for

			
 
				-storing,

			
 
				-it will return a redirection to the querier with the IP address of a runtime where the

			
 
				-query should be retried.

			
 
				-This processes repeats until the query is answered,

			
 
				-either successfully with one or more runtimes or with an error and no runtimes.

			
 
				-

			
 
				-Queries are issued automatically by processes as part of the message routing procedure.

			
 
				-Each process maintains a trie keyed using message scope.

			
 
				-It uses this trie to find the longest prefix match with the scope.

			
 
				-The value contained in the trie is a hash table of service registrations.

			
 
				-This allows a process to quickly determine if it already knows the correct runtime to deliver the

			
 
				-message to.

			
 
				-If the process does not know the correct recipient,

			
 
				-it performs discovery using one of the queries above,

			
 
				-with the query being determined by how the message was dispatched.

			
 
				-If no other runtimes are known,

			
 
				-the process uses DNS to find a runtime in the root directory,

			
 
				-remembers the runtime in its trie,

			
 
				-and issues the query to it.

			
 
				-There will need to be a cache control mechanism for determining how long entries in the trie can

			
 
				-be kept.

			
 
				-

			
 
				-## Firewall traversal

			
 
				-Blocktree requires a mechanism which allows runtimes to connect to each other even if one or both

			
 
				-of them is behind a firewall.

			
 
				-I don't yet know how to do this in the case were both are behind a firewall,

			
 
				-but in the case where only a single one is,

			
 
				-we can handle it by having a runtime contained in a parent directory send a control plane message

			
 
				-to the runtime which can't be reached telling it to initiate a connection to the runtime attempting

			
 
				-to reach it.

			
 
				-If the runtime that initiated the connection has a public IP address,

			
 
				-this will allow the two to connect,

			
 
				-after which messages can be sent in either direction.

			
 
				-This requires that at one runtime in the root directory has a public IP address,

			
 
				-and that a connection is maintained between a child runtime and one of its parents.

			
 
				-

			
 
				-Because the sector clusters are fully connected we only need to a connection request message to

			
 
				-one of them if we have the runtime forward these connection requests.

			
 
				-Then, if at least one of the sector hosts in the root has a public IP,

			
 
				-one runtime in each cluster is connected to one runtime in each of its child clusters,

			
 
				-the message should eventually be delivered to the correct runtime.

			
 
				-

			
 
				-This means that the sector hosts will form a single connected component of the connection graph.

			
 
				-

			
 
				-## Representation of files by the filesystem service.

			
 
				-My idea of using actors to own file handles has a significant drawback.

			
 
				-If an actor which opened a file crashes,

			
 
				-the file will remain open forever,

			
 
				-resulting in a resource leak.

			
 
				-An alternative would be to issue file handle structs to actors in local messages,

			
 
				-but this will not work when the filesystem service is being accessed by a remote runtime.

			
 
				-I could keep a table of file handles (integers) in the filesystem service,

			
 
				-and access it similar to how the filesystem struct is used today.

			
 
				-This approach brings the overhead of an RwLock on the table and searching it for a specific

			
 
				-file on every read or write.

			
 
				-Perhaps I could have the file actor poll its owner periodically to see if its still alive?

			
 
				-Then it would be able to halt if the owning actor has crashed.

			
 
				-To get this to work I'll need to reintroduce the ability to send messages to a specific actor,

			
 
				-and solve the issue of handling undeliverable messages.

			
 
				-This approach has the advantage of working over the network,

			
 
				-and it does not introduce any overhead from maintaining a table.

			
--- a/doc/BlocktreeDce/BlocktreeDce.tex
+++ b/doc/BlocktreeDce/BlocktreeDce.tex