Delease
/
Blocktree


			
							1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621
							\documentclass{article}
\usepackage[scale=0.8]{geometry}
\usepackage{hyperref}
\usepackage{graphicx}

\title{The Blocktree Cloud Orchestration Platform}
\author{Matthew Carr}

\begin{document}
\maketitle
\begin{abstract}
This document is a proposal for a novel cloud platform called Blocktree.
The system is described in terms of the actor model,
where tasks and services are implemented as actors.
The platform is responsible for orchestrating these actors on a set of native operating system processes.
A service is provdied to actors which allows them access to a highly available distributed file system,
which serves as the only source of persistent state for the system.
High availability is achieved using the Raft consensus protocol to synchronize the state of files between processes.
All data stored in the filesystem is secured with strong integrity and optional confidentiality protections.
A network block device like interface allows for fast low-level read and write access to the encrypted data,
with full support for client-side encryption.
Well-known cryptographic primitives and constructions are employed to provide this protection,
the system does not attempt to innovate in terms of cryptography.
The system's trust model allows for mutual TLS authentication between all processes in the system,
even those which are controlled by different owners.
By integrating these ideas into a single platform,
the system aims to advance the status quo in the security and reliability of software systems.
\end{abstract}

\section{Introduction}
% The "Big" Picture.
Blocktree is an attempt to extend the Unix philosophy that everything is a file
to the entire distributed system that comprises modern IT infrastructure.
The system is organized around a global distributed filesystem which defines security
principals, resources, and their authorization attributes.
This filesystem provides a language for access control that can be used to securely grant
access to resources from different organizations, without the need to setup federation.
The system provides an actor runtime for orchestrating services.
Resources are represented by actors, and actors are grouped into operating system processes.
Each process has its own credentials which authenticate it as a unique security principal,
and which specify the filesystem path where the process is located.
A process has authorization attributes which determine the set of processes that may communicate with it.
Every connection between processes is established using mutual TLS authentication,
which is accomplished without the need to trust any third-party certificate authorities.
The cryptographic mechanisms which make this possible are described in detail in section 3.
Messages addressed to actors in a different process are forwarded over these connections,
while messages delivered to actors in the same process are delivered with zero-copying.

% Self-certifying paths and the chain of trust.
The single global Blocktree filesystem is partitioned into disjoint domains of authority.
Each domain is controlled by a root principal.
As is the case for all principals,
a root principal is authenticated by a public-private key pair,
and is identified by a hash of its public key.
The domain of authority for a given absolute path is determined by its first component,
which is the identifier of the root principal that controls the domain.
Because there is no meaning to the directory "/",
a directory consisting of only a single component equal to a root principal's identifier is
referred to as the root principal's root directory.
The root principal delegates its authority to write files to subordinate principals by issuing
them certificates which specify the path that the authority of the subordinate is limited to.
File data is signed for authenticity and a certificate chain is contained in its metadata.
This certificate chain must lead back to the root principal
and consist of certificates with correctly scoped authority in order for the file to be validated.
Given the path of a file and the file's contents,
this allows the file to be validated by anyone without the need to trust a third-party.
Blocktree paths are called self-certifying for this reason.

% Persistent state provided by the filesystem.
One of the major challenges in distributed systems is managing persistent state.
Blocktree solves this issue with its distributed filesystem.
Files are broken into segments called sectors.
The sector size of a file can be configured when it is created,
but cannot be changed later.
Reads and writes of individual sectors are guaranteed to be atomic.
The sectors which comprise a file and its metadata are replicated by a set of processes running
the sector service.
This service is responsible for storing the sectors of files which are contained in the directory
containing the process in which it is running.
The actors providing the sector service in a given directory coordinate with one another using
the Raft protocol to synchronize the state of the sectors they store.
By partitioning the data in the filesystem based on directory,
the system can scale beyond the capabilities of a single consensus cluster.
Sectors can be integrity protected and verified without reading the entire file,
because each file has a Merkle tree of sector hashes associated with it.
Encryption can be optionally applied to sectors,
and when it is key is managed by the system.
The cryptographic mechanisms used to implement these protections are described in section 3.

% Protocol contracts.
One of the design goals of Blocktree is to facilitate the creation of composable distributed
systems.
A major challenge to building such systems is the difficulty in pinning down bugs when they
inevitably occur.
Research into session types (a.k.a. Behavioral Types) promises to bring the safety benefits
of type checking to actor communication.
Blocktree integrates a session typing system that allows protocol contracts to be defined that
specify the communication patterns of a set of actors.
This model allows the state space of the set of actors participating in a computation to be defined,
and the state transitions which occur to be specified based on the types of received messages.
These contracts are used to verify protocol adherence statically and dynamically.
This system is implemented using compile time code generation,
making it a zero-cost abstraction.
This frees the developer from dealing with the numerous failure modes that can occur in a
communication protocol.

% Implementation language and project links.
Blocktree is implemented in the Rust programming language.
It is currently only tested on Linux.
Running it on other Unix-like operating systems should be straight-forward,
though FUSE support is required to mount the filesystem.
Its source code is licensed under the Affero GNU Public License Version 3.
It can be downloaded at the project homepage at \url{https://blocktree.systems}.
Anyone interested in contributing to development is welcome to submit a pull request
to \url{https://gogs.delease.com/Delease/Blocktree}.
If you have larger changes or architectural suggestions,
please submit an issue for discussion prior to spending time implementing your idea.

% Outline of the rest of the paper.
The remainder of this paper is structured as follows:
\begin{itemize}
  \item Section 2 describes the actor runtime, service and task orchestration, and service
    discovery.
  \item Section 3 discusses the filesystem, its concurrency semantics and implementation.
  \item Section 4 details the cryptographic mechanisms used to secure communication between
    actor runtimes and to protect sector data.
  \item Section 5 is a set of examples describing ways that Blocktree can be used to build systems.
  \item Section 6 provides some concluding remarks.
\end{itemize}


\section{Actor Runtime}
% Motivation for using the actor model. 
Building scalable fault tolerant systems requires us to distribute computation over
multiple computers.
Rather than switching to a different programming model when an application scales beyond the
capacity of a single computer,
it is beneficial in terms of programmer time and program simplicity to begin with a model that 
enables multi-computer scalability.
Fundamentally, all communication over an IP network involves the exchange of messages,
namely IP packets.
So if we wish to build scalable fault-tolerant systems,
it makes sense to choose a programming model built on message passing,
as this will ensure low impedance with the underlying networking technology.

% Overview of message passing interface.
That is why Blocktree is built on the actor model
and why its actor runtime is at the core of its architecture.
The runtime can be used to spawn actors, register services, dispatch messages immediately,
and schedule messages to be delivered in the future.
Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
A message is dispatched with the \texttt{send} method when no reply is required,
and with \texttt{call} when exactly one is.
The \texttt{Future} returned by \texttt{call} can be awaited to obtain the reply.
If a timeout occurs while waiting for the reply,
the \texttt{Future} completes with an error.
The name \texttt{call} was chosen to bring to mind a remote procedure call,
which is the primary use case this method was intended for.
Awaiting replies to messages serves as a simple way to synchronize a distributed computation.

% Scheduling messages for future delivery.
Executing actions at some point in the future or at regular intervals are common tasks in computer
systems.
Blocktree facilitates this by allows messages to be scheduled for future delivery.
The schedule may specify a one time delivery at a specific instant in time,
or a repeating delivery with a given period.
These scheduling modes can be combined so that you can specify an anchoring instant
and a period whose multiples will be added to this instant to calculate each delivery time.
For example, a message could be scheduled for delivery every morning at 3 AM.
Messages scheduled in a runtime are persisted in the runtime's file.
This ensures scheduled messages will be delivered even if the runtime is restarted.
If a message has been delivered
and the schedule is such that it will never be delivered again,
it is removed from the runtime's file.
If a message is scheduled for delivery at a single instant in time,
and that delivery is missed,
the message will be delivered as soon as possible.
But, if a message is periodic,
any messages which were missed due to a runtime not being active will never be sent.
This is because the runtime only persists the message's schedule,
not every delivery.
This mechanism is intended for periodic tasks or delaying work to a later time.
It is not for building hard realtime systems.

% Description of virtual actor system.
One of the challenges when building actor systems is supervising and managing actors' lifecycles.
This is handled in Erlang through the use of supervision trees,
but Blocktree takes a different approach inspired by Microsoft's Orleans framework.
Orleans introduced the concept of virtual actors,
which are purely logical entities that exist perpetually.
In Orleans, one does not need to spawn actors nor worry about respawning them should they crash,
the framework takes care of spawning an actor when a message is dispatched to it.
This model also gives the framework the flexibility to deactivate actors when they are idle
and to load balance actors across different computers.
In Blocktree a similar system is used when messages are dispatched to services.
The Blocktree runtime takes care of routing these messages to the appropriate actors,
spawning them if needed.
A service must be registered in a runtime before messages can be routed to it.
The actors which are spawned based on this registration are called \emph{service providers} of the
service.
Services which directly use operating system resource,
such as those that listen on network sockets,
are often started immediately after registration so that they are available to external clients.

% Message addressing modes.
Messages can be addressed to services or specific actors.
When addressing a specific actor,
the message contains an \emph{actor name},
which is a pair consisting of the path of the runtime hosting the actor and the \texttt{Uuid}
identifying the specific actor in that runtime.
When addressing a service,
the message is dispatched using a \emph{service name},
which contains the following fields:
\begin{enumerate}
  \item \texttt{service}: The path identifying the receiving service.
  \item \texttt{scope}: A filesystem path used to specify the intended recipient.
  \item \texttt{rootward}: A boolean describing whether message delivery is attempted towards or
    away from the root of the filesystem tree. A value of
    \texttt{false} indicates that the message is intended for a runtime directly contained in the
    scope. A value of \texttt{true} indicates that the message is intended for a runtime contained
    in a parent directory of the scope and should be delivered to a runtime which has the requested
    service registered and is closest to the scope.
  \item \texttt{id}: An identifier for a specific service provider.
\end{enumerate}
The ID can be a \texttt{Uuid} or a \texttt{String}.
It is treated as an opaque identifier by the runtime,
but a service is free to associate additional meaning to it.
Every message has a header containing the name of the sender and receiver.
The receiver name can be an actor or service name,
but the receiver name is always an actor name.
For example, to open a file in the filesystem,
a message is dispatched with \texttt{call} using the service name of the filesystem service.
The reply contains the name of the file actor spawned by the filesystem service which owns the opened
file.
Messages are then dispatched to the file actor using its actor name to read and write to the file.

% The runtime is implemented using tokio.
The actor runtime is currently implemented using the Rust asynchronous runtime tokio.
Actors are spawned as tasks in the tokio runtime,
and multi-producer single consumer channels are used for message delivery.
Because actors are just tasks,
they can do anything a task can do,
including awaiting other \texttt{Future}s.
Because of this, there is no need for the actor runtime to support short-lived worker tasks,
as any such use-case can be accomplished by awaiting a set of \texttt{Future}s.
This allows the runtime to focus on providing support for services.
Using tokio also means that we have access to a high performance multi-threaded runtime with
evented IO.
This asynchronous programming model ensures that resources are efficiently utilized,
and is ideal for a system focused on orchestrating services which may be used by many clients.

% Delivering messages over the network.
Messages can be forwarded between actor runtimes using a secure transport layer called
\texttt{bttp}.
The transport is implemented using the QUIC protocol, which integrates TLS for security.
A \texttt{bttp} client may connect anonymously or using credentials.
If an anonymous connection is attempted,
the client has no authorization attributes associated with it.
Only runtimes which grant others the execute permission allow connections from such clients.
If these permissions are not granted in the runtime's file,
anonymous connections are rejected.
When a client connects with credentials,
mutual TLS authentication is performed as part of the connection handshake,
which cryptographically verifies the credentials of each runtime.
These credentials contain the filesystem paths where each runtime is located.
This information is used to securely route messages between runtimes.
The \texttt{bttp} server is always authenticated during the handshake,
even when the client is connecting anonymously.
Because QUIC supports the concurrent use of many different streams,
it serves as an ideal transport for a message oriented system.
\texttt{bttp} uses different streams for independent messages,
ensuring that head of line blocking does not occur.
Note that although data from separate streams can arrive in any order,
the protocol does provide reliable in-order delivery of data in any given stream.
The same stream is used for sending the reply to a message dispatched with \texttt{call}.
Once a connection is established,
messages may flow both directions (provided both runtimes have execute permissions for the other),
regardless of which runtime is acting as the client or the server.

% Delivering messages locally.
When a message is sent between actors in the same runtime it is delivered into the queue of the recipient without any copying,
while ensuring immutability (i.e. move semantics).
This is possible thanks to the Rust ownership system,
because the message sender gives ownership to the runtime when it dispatches the message,
and the runtime gives ownership to the recipient when it delivers the message.

% Security model based on filesystem permissions.
A runtime is represented in the filesystem as a file.
This file contains the authorization attributes which are associated with the runtime's security
principal.
The credentials used by the runtime specify the file, so other runtimes are able to locate it.
The metadata of the file contains authorization attributes just like any other file
(e.g. UID, GID, and mode bits).
In order for a principal to be able to send a message to an actor in the runtime,
it must have execute permissions for this file.
Thus communication between runtimes can be controlled using simple filesystem permissions.
Permissions checking is done during the \texttt{bttp} handshake.
Note that it is possible for messages to be sent in one direction in a \texttt{bttp} connection
but not in the other.
In this situation replies are permitted but unsolicited messages are not.
An important trade-off which was made when designing this model was that messages which are
sent between actors in the same runtime are not subject to any authorization checks.
This was done for two reasons: performance and security.
By eliminating authorization checks messages can be more efficiently delivered between actors in the
same process,
which helps to reduce the performance penalty of the actor runtime over directly using threads.
Security is enhanced by this decision because it forces the user to separate actors with different
security requirements into different operating system processes,
which ensures all of the process isolation machinery in the operating system will be used to
isolate them.

% Representing resources as actors.
As in other actor systems, it is convenient to represent resources in Blocktree using actors.
This allows the same security model used to control communication between actors to be used for
controlling access to resources,
and for resources to be shared by many actors.
For instance, a Point-to-Point Protocol connection could be owned by an actor.
This actor could forward traffic delivered to it in messages over this connection.
The set of actors which are able to access the connection is controlled by setting the filesystem
permissions on the file for the runtime executing the actor owning the connection.

% Actor ownership.
The concept of ownership in programming languages is very useful for ensuring that resources are
properly freed when the type using them dies.
Because actors are used for encapsulating resources in Blocktree,
a similar system of ownership is employed for this reason.
An actor is initially owned by the actor that spawned it.
An actor can only have a single owner,
but the owner can grant ownership to another actor.
An actor is not allowed to own itself,
though it may be owned by the runtime.
When the owner of an actor returns,
the actor is sent a message instructing it to return.
If it does not return after a timeout,
it is interrupted.
This is the opposite of how supervision trees work in Erlang.
Instead of the parent receiving a message when the child returns,
the child receives a message when the parent returns.
Service providers spawned by the runtime are owned by it.
They continue running until the runtime chooses to reclaim their resources,
which can happen because they are idle or the runtime is overloaded.
Note that ownership is not limited to a single runtime,
so distributed resources can be managed by owning actors in many different runtimes.

% Message routing to services.
A service is identified by a Blocktree path.
Only one service implementation can be registered in a particular runtime,
though this implementation may be used to spawn many actors as providers for the service,
each associated with a different ID.
The runtime spawns a new actor when it finds no service provider associated with the ID in the
message it is delivering.
Some services may only have one service provider in a given runtime,
as is the case for the sector and filesystem services.
The \texttt{scope} and \texttt{rootward} field in an actor name specify the set of runtimes to
which a message may be delivered.
They allow the sender to express their intended recipient,
while still affording enough flexibility to the runtime to route messages as needed.
If \texttt{rootward} is \texttt{false},
the message is delivered to a service provider in a runtime that is directly contained in
\texttt{scope}.
If \texttt{rootward} is \texttt{true},
the parent directories of scope are searched,
working towards the root of the filesystem tree,
and the message is delivered to the first provider of \texttt{service} which is found.
When there are multiple service providers to which a given message could be delivered,
the one to which it is actually delivered is unspecified,
which allows the runtime to balance load.
Delivery will occur to at most one recipient,
even in the case that there are multiple potential recipients.
In order to contact other runtimes and deliver messages to them,
their network endpoint (IP address and UDP port) needs to be known.
This is achieved by maintaining a file with a runtime's endpoint address in the same directory as
the runtime.
The runtime is granted write permissions on the file,
and it is updated by \texttt{bttp} when it begins listening on a new endpoint.
The port a \texttt{bttp} server uses to listen for unicast connections is uniformly
randomly selected from the set of ports in the dynamic range (49152-65535) which are unused on the
server's host.
Use of a random port allows many different \texttt{bttp} servers to share a single IP address
and makes Blocktree more resistent to censorship.
The services which are allowed to be registered in a given runtime are specified in the runtime's
file.
The runtime reads this list and uses it to deny service registrations for unauthorized services.
The list is also read by other runtime's when they're searching for service providers.

% The sector and filesystem service.
The filesystem is itself implemented as a service.
A filesystem service provider can be passed messages to delete files, list directory contents,
open files, or perform several other standard filesystem operations.
When a file is opened,
a new actor is spawned which owns the newly created file handle and its name is returned to the
caller in a reply.
Subsequent read and write messages are sent to this actor.
The filesystem service does not persist any data itself,
its job is to function as an integration layer,
conglomerating sector data from many different sources into a single unified interface.
The sector service is what is ultimately responsible for storing data,
and thus maintaining the persistent state of the system.
It stores sector data in the local filesystem of each computer on which it is registered.
The details of how this is accomplished are deferred to the next section.

% Runtime queries.
While it is possible to resolve runtime paths to network endpoints when the filesystem is available,
another mechanism is needed to allow the filesystem service providers to be discovered.
This is accomplished by allowing runtimes to query one another to learn of other runtimes.
Because queries are intended to facilitate message delivery,
the query fields and their meanings mirror those used for addressing messages:
\begin{enumerate}
  \item \texttt{service} The path of the service whose providers are sought.
    Only runtimes with this service registered will be returned.
  \item \texttt{scope} The filesystem path relative to which the query will be processed.
  \item \texttt{rootward} Indicates if the query should search for runtimes from \texttt{scope}
    toward the root.
\end{enumerate}
The semantics of \texttt{scope} and \texttt{rootward} in a query are identical to their use in an
actor name.
As long as at least one other runtime is known,
a query can be issued to learn of more runtimes.
A runtime which receives a query may not be able to answer it directly.
If it cannot,
it returns the endpoint of the next runtime to which the query should be sent.

% Bootstrap discovery methods.
In order to bootstrap the discovery processes,
another mechanism is needed to find the first peer to query.
There were several possibilities explored for doing this.
One way is to use a blockchain to store the endpoints of the runtimes hosting the filesystem service
in the root directory.
As long as these runtimes can be located,
then all others can be found using the filesystem.
This idea may be worth revisiting in the future,
but the author wanted to avoid the complexity of implementing a new proof of work blockchain.
Instead, two independent mechanisms are used,
one that can discover runtimes over the internet as long as their path is known,
and another that can discover runtimes on the local network even when the discoverer does not know
their paths.

% Searching DNS for root principals.
When the path to a runtime is known,
DNS is used to resolve SRV records using a fully qualified domain name
(FQDN) derived from the path's root principal identifier.
The SRV records are resolved using the name \texttt{\_bttp.\_udp.<FQDN>},
where \texttt{<FQDN>} is the FQDN derived from the root principal's identifier.
One SRV record may be created for each of the filesystem service providers in the root
directory.
Each record contains the UDP port and hostname where a runtime is listening.
Every runtime is configured with a search domain that is used as a suffix in the FQDN.
The leading labels in the FQDN are computed by base32 encoding the binary representation of the
root principal's identifier.
If the encoded string is longer than 63 bytes (the limit for each label in a hostname),
it is separated into the fewest number of labels possible,
working from left to right along the string.
A dot followed by the search domain is concatenated onto the end of this string to form the FQDN.
This method has the advantages of being simple to implement
and allowing runtimes to discover each other over the internet.
Implementing this system would be facilitated by hosting DNS servers in actors in the same
runtimes as the root sector service providers.
Then, records could be dynamically created which point to these runtimes.
These runtimes would also need to be configured with static IP addresses,
and the NS records for the search domain would need to point to them.
Of course it is also possible to build such a system without hosting DNS inside of Blocktree.
The downside of using DNS is that it couples Blocktree with a centralized,
albeit distributed, system.

% Using link-local multicast datagrams to find runtimes.
Because the previous mechanism requires knowledge of the root principal of a domain to perform
discovery,
it will not work if a runtime is first starting up with no credentials and so does not know its
own root principal.
This runtime needs a way to discover other runtimes so it can connect to the filesystem and sector
services.
This issue is solved by using link-local multicast addressing to discover the runtimes on the same
network as the discoverer.
When a \texttt{bttp} server starts listening for unicast traffic,
it also listens for UDP datagrams on port 50142 at addresses 224.0.0.142 and FE02::142,
if the IPv4 or IPv6 networking stack is available, respectively.
If the host is attached to a dual-stack network,
the server listens on both addresses.
When a runtime is attempting to discover other runtimes,
it sends out datagrams to these endpoints.
Each \texttt{bttp} server replies with its unicast address and filesystem path
(as specified in its credentials).
If the server is available at both IPv4 and IPv6 unicast addresses,
it is at the server's discretion which address to respond with,
it may even respond with an IPv4 to an IPv4 datagram,
and IPv6 address to an IPv6 datagram.
Once a client has discovered the \texttt{bttp} servers on its network,
it can route messages to them,
such as the provisioning requests which are used to obtain new credentials.
Provisioning is described in the Cryptography section.
Note that port 50142 is in the dynamic range,
so it does not need to registered with the Internet Assigned Names and Numbers Authority (IANA).
Both addresses 224.0.0.142 and FE02::142 are currently unassigned.
but they will need to be registered with IANA if Blocktree is widely adopted.

% Security model for queries.
To allow runtimes which are not permitted to execute the root directory to query for other runtimes,
authorization logic which is specific to queries is needed.
If a process is connected with credentials
and the path in the credentials contains the scope of the query,
the query is permitted.
If a process is connected anonymously,
its query will only be answered if the query scope
and all of its parent directories,
grant others the execute permission.
Queries from authenticated processes can be authorized using only the information in the query,
but anonymous queries require knowledge of filesystem permissions,
some of which may not be known to the answering runtime.
When authorizing an anonymous query,
an answering runtime should check that that the execute permission is granted on all directories
that it is responsible for storing.
If all these checks pass, it should forward the querier to the next runtime as usual.

% Overview of protocol contracts and runtime checking of protocol adherence.
To facilitate the creation of composable systems,
a protocol contract checking system based on session types has been designed.
This system models a communication protocol as a directed graph representing state transitions
based on types of received messages.
The protocol author defines the states that the actors participating in the protocol can be in using 
Rust traits.
These traits define handler methods for each message type the actor is expected to handle in that
state.
A top-level trait which represents the entire protocol is defined that contains the types of the
initial state of every actor in the protocol.
A macro is used to generate the message handling loop for the each of the parties to the protocol,
as well as enums to represent all possible states that the parties can be in and the messages that
they exchange.
The generated code is responsible for ensuring that errors are generated when a message of an
unexpected type is received,
eliminating the need for ad-hoc error handling code to be written by application developers.

% Example of a protocol contract.
Let's explore how this system can be used to build a simple pub-sub communications protocol.
In this protocol,
there will be a server which handles \texttt{Sub} messages by remembering the names of the actors
who sent them.
It will handle \texttt{Pub} messages by forwarding them to all of the subscribed actors.
The state-transition graph for the system is shown in figure \ref{fig:pubsub}.
\begin{figure}
  \begin{center}
    \includegraphics[scale=0.6]{PubSubStateGraph.pdf}
  \end{center}
  \caption{The state-transition graph for a simple pub-sub protocol.}
  \label{fig:pubsub}
\end{figure}
The solid edges in the graph indicate state transitions and are labeled with the message type
which triggered the transition.
The dashed edges indicate message delivery and are labeled with the type of the message delivered.
Although \texttt{Runtime} is not the state of any actor in the system,
it is included in the graph as the sender of the \texttt{Activate} and \texttt{Pub} messages.
\texttt{Activate} is delivered by the runtime to pass a reference to the runtime and provide the
actor's \texttt{Uuid}.
\texttt{Pub} messages are dispatched by actors outside the graph and are routed to actors in the
\texttt{Listening} state by the runtime.
Note that the runtime itself doesn't have any notion of the state of any actor,
it just delivers messaging using the rules described previously.
Only an actor can tell whether a message is expected or not given its current state.
Each of the actor states are modeled by Rust traits.
\begin{verbatim}
  pub struct ClientInit {
    type AfterActivate: Subed;
    type Fut: Future<Output = Result<Self::AfterActivate>>;
    fn handle_activate(self, msg: Activate) -> Self::Fut;
  }

  pub struct Subed {
    type AfterPub: Subed;
    type Fut: Future<Output = Result<Self::AfterPub>>;
    fn handle_pub(self, msg: Envelope<Pub>) -> Self::Fut;
  }

  pub struct ServerInit {
    type AfterActivate: Listening;
    type Fut: Future<Output = Result<Self::AfterActivate>>;
    fn handle_activate(self, msg: Activate) -> Self::Fut;
  }

  pub struct Listening {
    type AfterSub: Listening;
    type SubFut: Future<Output = Result<Self::AfterSub>>;
    fn handle_sub(self, msg: Envelope<Sub>) -> Self::SubFut;

    type AfterPub: Listening;
    type PubFut: Future<Output = Result<Self::AfterPub>>;
    fn handle_pub(self, msg: Envelope<Pub>) -> Self::PubFut;
  }
\end{verbatim}
The definition of \texttt{Activate} is as follows:
\begin{verbatim}
  pub struct Activate {
    rt: &'static Runtime,
    act_id: Uuid,
  }
\end{verbatim}
The \texttt{Envelope} type is a wrapper around a message which contains information about who sent
it and a method that can be used to send a reply.
In general a new actor state, represented by a new type, can be returned by a messaging handling
method.
The protocol itself is also represented by a trait:
\begin{verbatim}
  pub trait PubSubProtocol {
    type Server: ServerInit;
    type Client: ClientInit;
  }
\end{verbatim}
By modeling this protocol independently of any implementation of it,
we allow for many different interoperable implementations to be created.
We can also isolate bugs in these implementations because unexpected or malformed messages are
checked for by the generated code.

% Implementing actors in languages other than Rust.
Today the actor runtime only supports executing actors implemented in Rust.
A WebAssembly (Wasm) plugin system is planned to allow any language which can compile to Wasm to be
used to implement an actor.
This work is blocked pending the standardization of the WebAssembly Component Model,
which promises to provide an interface definition language which will allow type safe actors to be
defined in many different languages.

% Running containers using actors.
Blocktree allows containers to be run by encapsulating them using a supervising actor.
This actor is responsible for starting the container and managing the container's kernel namespace.
Logically, it owns any kernel resources created by the container, including all spawned operating
system processes.
When the actor halts,
all of these resources are destroyed.
All network communication to the container is controlled by the supervising actor.
The supervisor can be configured to bind container ports to host ports,
as is commonly done today,
but it can also be used to encapsulate traffic to and from the container in Blocktree messages.
These messages are routed to other actors based on the configuration of the supervisor.
This essentially creates a VPN for containers,
ensuring that regardless of well secured their communication is,
they will be safe to communicate over any network.
This network encapsulation system could be used in other actors as well,
allowing a lightweight and secure VPN system to built.

% Web GUI used for managing the system.
Any modern computer system must include a GUI,
it is required by users.
For this reason Blocktree includes a web-based GUI called \texttt{btconsole} that can
monitor the system, provision runtimes, and configure access control.
\texttt{btconsole} is itself implemented as an actor in the runtime,
and so has access to the same facilities as any other actor.


\section{Filesystem}
% The division of responsibilities between the sector and filesystem services.
The responsibility for serving data in Blocktree is shared between the filesystem and sector
services.
Most actors will access the filesystem through the filesystem service,
which provides a high-level interface that takes care of the cryptographic operations necessary to
read and write files.
The filesystem service relies on the sector service for actually persisting data.
The individual sectors which make up a file are read from and written to the sector service,
which stores them in the local filesystem of the computer on which it is running.
A sector is the atomic unit of data storage
and the sector service only supports reading and writing entire sectors at once.
File actors spawned  by the filesystem service buffer reads and writes until there is enough
data to fill a sector.
Because cryptographic operations are only performed on full sectors,
the cost of providing these protections is amortized over the size of the sector.
Thus there is tradeoff between latency and throughput when selecting the sector size of a file:
a smaller sector size means less latency while a larger one enables more throughput.

% Types of sectors: metadata, integrity, and data.
A file has a single metadata sector, a Merkle sector, and zero or more data sectors.
The sector size of a file can be specified when it is created,
but cannot be changed later.
Every data sector contains the ciphertext of the number of bytes equal to the sector size,
but the metadata and Merkle sectors contain a variable amount of data.
The metadata sector contains all of the filesystem metadata associated with the file.
In addition to the usual metadata present in any Unix filesystem (the contents of the \texttt{stat} struct),
cryptographic information necessary to verify and decrypt the contents of the file are also stored.
The Merkle sector of a file contains a Merkle tree over the data sectors of a file.
The hash function used by this tree can be configured at file creation,
but cannot be changed after the fact.

% How sectors are identified.
When sector service providers are contained in the same directory they connect to each other to form
a consensus cluster.
This cluster is identified by a \texttt{u64} called the cluster's \emph{generation}.
Every file is identified by a pair of \texttt{u64}, its generation and its inode.
The sectors within a file are identified by an enum which specifies which type they are,
and in the case of data sectors, their 0-based index.
\begin{verbatim}
  pub enum SectorKind {
    Meta,
    Merkle,
    Data(u64),
  }
\end{verbatim}
The byte offset in the plaintext of the file at which each data sector begins can be calculated by
multiplying the sector's index by the sector size of the file.
The \texttt{SectorId} type is used to identify a sector.
\begin{verbatim}
  pub enum SectorId {
    generation: u64,
    inode: u64,
    sector: SectorKind,
  }
\end{verbatim}

% How the sector service stores data.
The sector service persists sectors in a directory in its local filesystem,
with each sector is stored in a different file.
The scheme used to name these files involves security considerations,
and is described in the next section.
When a sector is updated,
a new local file is created with a different name containing the new contents.
Rather than deleting the old sector file,
it is overwritten by the creation of a hardlink to the new file,
and the name that used to create the new file is unlinked.
This method ensures that the sector file is updated in one atomic operation
and is used by other Unix programs.
The sector service also uses the local filesystem to persist the replicated log it uses for Raft.
This file serves as a journal of sector operations.

% Types of messages handled by the sector service.
Communication with the sector service is done by passing it messages of type \texttt{SectorMsg}.
\begin{verbatim}
  pub struct SectorMsg {
    id: SectorId,
    op: SectorOperation,
  }

  pub enum SectorOperation {
    Read,
    Write(WriteOperation),
  }

  pub enum WriteOperation {
    Meta(Box<FileMeta>),
    Data {
      meta: Box<FileMeta>,
      contents: Vec<u8>,
    }
  }
\end{verbatim}
Here \texttt{FileMeta} is the type used to store metadata for files.
Note that updated metadata is required to be sent when a sector's contents are modified.

% Scaling horizontally: using Raft to create consensus cluster. Additional replication methods.
A generation of sector service providers uses the Raft protocol to synchronize the state of the
sectors it stores.
The message passing interface of the runtime enables this implementation
and the sector service's requirements were important considerations in designing this interface.
The system currently replicates all data to each of the service providers in the cluster.
Additional replication methods are planned for future implementation
(e.g. erasure encoding and distribution via consistent hashing),
which allow for different tradeoffs between data durability and storage utilization.

% Scaling vertically: how different generations are stitched together.
The creation of a new generation of the sector service is accomplished with several steps.
First, a new directory is created in which the generation will be located.
Next, one or more processes are credentialed for this directory,
using a procedure which is described in the next section.
The credentialing process produces files for each of the processes stored in the new directory.
The sector service provider in each of the processes uses the filesystem service
(which connects to the parent generation of the sector service)
to find the other runtimes hosting the sector service in the directory and messages them to
establish a fully-connected cluster.
Finally, the service provider which is elected leader contacts the generation in the root directory
and requests a new generation number.
Once this number is known it is stored in the superblock for the generation,
which is the file identified by the new generation number and inode 2.
The superblock is not contained in any directory and cannot be accessed outside the sector service.
The superblock also keeps track of the next inode to assign to a new file.

% Authorization logic of the sector service.
To prevent malicious actors from writing invalid data,
the sector service must cryptographically verify all write messages.
The process it uses to do this involves several steps:
\begin{enumerate}
  \item The certificate chain in the metadata that was sent in the write message is validated.
    It is considered valid if it ends with a certificate signed by the root principal
    and the paths in the certificates are correctly nested,
    indicating valid delegation of write authority at every step.
  \item Using the last public key in the certificate chain,
    the signature in the metadata is validated.
    This signature covers all of the fields in the metadata.
  \item The new sector contents in the write message are hashed using the digest function configured
    for the file and the resulting hash is used to update the file's Merkle tree in its Merkle
    sector.
  \item The root of the Merkle tree is compared with the integrity value in the file's metadata.
    The write message is considered valid if and only if there is a match.
\end{enumerate}
This same logic is used by file actors to verify the data they read from the sector service.
Only once a write message is validated is it shared with the sector service provider's peers in
its generation.
Although the data in a file is encrypted,
it is still beneficial for security to prevent unauthorized principal's from gaining access to a
file's ciphertext.
To prevent this, a sector service provider checks a file's metadata to verify that the requesting
principal actually has a readcap (to be defined in the next section) for the file.
This ensures that only principals that are authorized to read a file can gain access to the file's
ciphertext, metadata, and Merkle tree.

% File actors are responsible for cryptographic operations. Client-side encryption.
The sector service is relied upon by the filesystem service to read and write sectors.
Filesystem service providers communicate with the sector service to open files and perform
filesystem operations.
These providers spawn file actors that are responsible for verifying and decrypting the information
contained in sectors and providing it to other actors.
They use the credentials of the runtime they are hosted in to decrypt sector data using
information contained in file metadata.
File actors are also responsible for encrypting and integrity protecting data written to files.
In order for a file actor to produce a signature over the root of the file's Merkle tree,
it maintains a copy of the tree in memory.
This copy is read from the sector service when the file is opened.
While this does mean duplicating data between the sector and filesystem services,
this design was chosen to reduce the network traffic between the two services,
as the entire Merkle tree does not need to be transmitted on every write.
Encapsulating all cryptographic operations in the filesystem service and file actors allows the
computer storing data to be different from the computer encrypting it.
This approach allows client-side encryption to be done on more capable computers
and low powered devices to delegate this task to a storage server.

% Prevention of resource leaks through ownership.
A major advantage of using file actors to access file data is that they can be accessed over the
network from a different runtime as easily as they can be from the same runtime.
One complication arising from this approach is that file actors must not outlive the actor which
caused them to be spawned.
This is handled in the filesystem service by making the actor who opened the file the owner of the
file actor.
When a file actor receives notification that its owner returned,
it flushes any buffered data in its cache and returns,
ensuring that a resource leak does not occur.

% Encrypted metadata. Extended attributes in metadata. Cache control.
Some of the information stored in metadata needs to be kept in plaintext to allow the sector
service to verify and decrypt the file
but most of it is encrypted using the same key as the file's contents.
The file's authorization attributes, its size, and its access times are all encrypted.
The table storing the file's extended attributes (EAs) is also encrypted.
Cache control information is included in this area as well.
It specifies the number of seconds, as a u32, that a file may be cached.
The filesystem service uses this information to evict sectors from its cache when they have been
cached for longer than this threshold,
causing them to be reloaded from the sector service.

% Authorization logic of the filesystem service.
The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.
It passes this type the authorization attributes of the principal accessing the file, the
attributes of the file, and the type of access (read, write, or execute).
The \texttt{Authorizer} returns a boolean indicating if access is permitted or denied.
These access control checks are performed for every message processed by the filesystem service,
including opening a file.
A file actor only responds to messages sent from its owner,
which ensures that it can avoid the overhead of performing access control checks as these were
carried out by the filesystem service when it was created.
The file actor is configured when it is spawned to allow read only, write only, or read write
access to a file,
depending on what type of access was requested by the actor opening the file.

% Streaming replication.
Often when building distributed systems it is convenient to alert any interested party that an event
has occurred.
To facilitate this pattern,
the sector service allows actors to subscribe for notification of writes to a file.
The sector service maintains a list of actors which are currently subscribed
and when it commits a write to its local storage,
it sends each of them a notification message identifying the sector written
(but not the written data).
By using different files to represent different events,
a simple notification system can be built.
Because the contents of a directory may be distributed over many different generations,
this system does not support the recursive monitoring of directories.
Although this system lacks the power of \texttt{inotify} in the Linux kernel,
it does provides some of its benefits without incurring much or a performance overhead
or implementation complexity.
For example, this system can be used to implement streaming replication.
This is done by subscribing to writes on all the files that are to be replicated,
then reading new sectors as soon as notifications are received.
These sectors can then be written into replica files in a different directory.
This ensures that the contents of the replicas will be updated in near real-time.

% Peer-to-peer distribution of sector data.
Because of the strong integrity protection afforded to sectors,
it is possible for peer-to-peer distribution of sector data to be done securely.
Implementing this mechanism is planned as a future enhancement to the system.
The idea is to base the design on bit torrent,
where the sector service responsible for a file acts as a tracker for that file,
and the file actors accessing the file communicate with one another directly using the information
provided by the sector service.
This could allow the system to scale to a much larger number of concurrent reads by reducing
the load on the sector service.

% The FUSE daemon.
Being able to access the filesystem from actors allows a programmer to implement new applications
using Blocktree,
but there is an entire world of existing applications which only know how to access the local
filesystem.
To allow these applications access to Blocktree,
a FUSE daemon called \texttt{btfuse} is included which allows a Blocktree directory to be mounted
to a directory in the local filesystem.
This daemon can directly access the sector files in a local directory,
or it can connect over the network to filesystem or sector service provider.
This FUSE daemon could be included in a system's initrd to allow it to mount its root filesystem
from Blocktree,
opening up many interesting possibilities for hosting machine images in Blocktree.
A planned future enhancement is to develop a Blocktree filesystem driver which actually runs in
kernel space.
This would reduce the overhead associated with context switching from user space, to kernel space,
and back to user space, for every filesystem interaction,
making the system more practical to use for a root filesystem.


\section{Cryptography}
This section describes the cryptographic mechanisms used to integrity and confidentiality protect
files.
These mechanisms are based on well-established cryptographic constructions.

% Integrity protection.
File integrity is protected by a digital signature over its metadata.
The metadata contains the integrity field which contains the root node of a Merkle tree over
the file's contents.
This allows any sector in the file to be verified with a number of hash function invocations that
is logarithmic in the size of the file.
It also allows the sectors of a file to be verified in any order,
enabling random access.
The hash function used in the Merkle tree can be configured when the file is created.
Currently, SHA-256 is the default, and SHA-512 is supported.
A file's metadata also contains a certificate chain,
and this chain is used to authenticate the signature over the metadata.
In Blocktree, the certificate chain is referred to as a \emph{writecap}
because it grants the capability to write to files.
The certificates in a valid writecap are ordered by their paths,
the initial certificate contains the longest path,
the path in each subsequent certificate must be a prefix of the one preceding it,
and the final certificate must be signed by the root principal.
These rules ensure that there is a valid delegation of write authority at every
link in the chain,
and that the authority is ultimately derived from the root principal specified by the absolute path
of the file.
By including all the information necessary to verify the integrity of a file in its metadata,
it is possible for a requestor who only knows the path of a file to verify that the contents of the
file are authentic.

% Confidentiality protecting files with readcaps. Single pubkey operation to read a dir tree.
Confidentiality protection of files is optional but when it is enabled,
a file's sectors are individually encrypted using a symmetric cipher.
The key to this cipher is randomly generated when a file is created.
A different IV is generated for each sector by hashing the index of the sector with a
randomly generated IV for the entire file.
A file's key and IV are encrypted using the public keys of the principals to whom read access is
to be allowed.
The resulting ciphertext is referred to as a \emph{readcap}, as it grants the capability to read the
file.
These readcaps are stored in a table in the file's metadata.
Each entry in the table is identified by a byte string that is derived from the public key of the
principal who owns the entry's readcap.
The byte string is computed by calculating an HMAC of the the principal's public key.
The HMAC is keyed with a randomly generated salt that is stored in the file's metadata.
An identifier for the hash function that was used in the HMAC is included in the byte string so
that the HMAC can be recomputed later.
When the filesystem service accesses the file,
it recomputes the HMAC using the salt, its public key, and the hash function specified in each entry
of the table.
It can then identify the entry which contains its readcap,
or that such an entry does not exist.
This mechanism was designed to prevent offline correlation attacks on file metadata,
as metadata is stored in plaintext in local filesystems.
The file key and IV are also encrypted using the keys of the file's parents.
Note that there may be multiple parents of a file because it may be hard linked to several
directories.
Each of the resulting ciphertexts is stored in another table in the file's metadata.
The entries in this table are identified by an HMAC of the parent's generation and inode numbers,
where the HMAC is keyed using the file's salt.
By encrypting a file's key and IV using the key and IV of its parents,
it is possible to traverse a directly tree using only a single public key decryption.
The file where this traversal begins must contain a readcap owned by the accessing principal,
but all subsequent accesses can be performed by decrypting the key and IV of a child using the
key and IV of a parent.
Not only does this allow traversals to use more efficient symmetric key cryptography,
but it also means that it suffices to grant a readcap on a single directory in order to grant
access to the entire tree rooted at that directory.

% File key rotation and readcap revocation.
Because it is not possible to change the key used by a file after it is created,
a file must be copied in order to rotate the key used to encrypt it.
Similarly, revoking a readcap is accomplished by creating a copy of the file
and adding all the readcaps from the original's metadata except for the one being revoked.
While it is certainly possible to remove a readcap from the metadata table,
this is not supported because the readcap holder may have used custom software to save the file's
key and IV while it had access to them,
so data written to the same file after revocation could potentially be decrypted by it.
By forcing the user to create a new file,
they are forced to re-encrypt the data using a fresh key and IV.

% Obfuscating sector files stored in the local filesystem.
From an attacker's perspective,
not every file in your domain is equally interesting.
They may be particularly interested in reading your root directory,
or they may have identified the inode of a file containing kompromat.
To make offline identification of which files sectors in the local filesystem belong to,
an obfuscation mechanism is used.
This works by generating a random salt for each generation of the sector service,
and storing it in the generation's superblock.
It is hashed along with the inode and the sector ID to produce the file name of the sector file
in the local filesystem.
These files are arranged into different subdirectories according to the value of the first two
digits in the hex encoding of the resulting hash,
the same way git organizes object files.
This simple method makes it more difficult for an attacker to identify the files each sector belongs
to
while still allowing the sector service efficient access.

% Credential stores.
Processes need a way to securely store their credentials.
They accomplish this by using a credential store,
which is a type that implementor the trait \texttt{CredStore}.
A credential store provides methods for using a process's credentials to encrypt, decrypt,
sign, and verify data,
but it does not allow them to be exported.
A credential store also provides a method for generating new root credentials.
Because root credentials represent the root of trust for an entire domain,
it must be possible to securely back them up from one credential store to another.
Root credentials can also be used to perform cryptographic operations without exporting them.
A password is set when the root credentials are generated,
and this same password must be provided to use, export, and import them.
When root credentials are exported from a credential store they are confidentiality protected
using multiple layers of encryption.
The outer most layer is encryption by a symmetric key cipher whose key is derived from the
password.
a public key of the receiving credential store must also be provided when root credentials are
exported.
This public key is used to perform the inner encryption of the root credentials,
ensuring that only the intended credential store is able to import them.
Currently there are two \texttt{CredStore} implementors in Blocktree,
one which is used for testing and one which is more secure.
The first is called \texttt{FileCredStore},
and it uses a file in the local filesystem to store credentials.
A symmetric cipher is used to protect the root credentials, if they are stored,
but it relies on the security of the underlying filesystem to protect the process credentials.
For this reason it is not recommended for production use.
The other credential store is called \texttt{TpmCredStore},
and it uses a Trusted Platform Module (TPM) 2.0 on the local machine to store credentials.
The TPM is used to generate the process's credentials in such a way that they can never be
exported from the TPM (this is a feature of TPM 2.0).
A randomly generated cookie is needed to use these credentials.
The cookie is stored in a file in the local filesystem which its permissions set to prevent
others from accessing it.
Thus this type also relies on the security of the local filesystem.
But, an attacker would need to steal the TPM and this cookie in order to steal a process's
credentials.

% Manual provisioning via the command line.
The term provisioning is used in Blocktree to refer to the process of acquiring credentials.
A command line tool call \texttt{btprovision} is provided for provisioning credential stores.
This tool can be used to generate new process or root credentials, create a certificate request
using them, issue a new certificate, and finally to import the new certificate chain.
When setting up a new domain,
\texttt{btprovision} can create a new sector storage directory in the local filesystem
and write the new process's files to it.
It is also capable of connecting to the filesystem service if it is already running.

% Automatic provisioning.
While manual provisioning is necessary to bootstrap a domain,
an automatic method is needed to make this process more ergonomic.
When a runtime starts it checks its configured credential store to find the certificate chain to
use for authenticating to other runtimes.
If no such chain is stored,
the runtime can choose to request a certificate from the filesystem service.
This is done by dispatching a message with \texttt{call} to the filesystem service without
specifying a scope.
Because the message specifies no path, there is no root directory to begin discovery at.
So, the runtime resorts to using link-local discovery to find other runtimes.
Once one is discovered,
the runtime connects to it anonymously
and sends it a certificate request.
This request includes a copy of the runtime's public key and, optional, a path where the
runtime would like to be located.
This path is purely advisory,
the filesystem service is free to place the runtime in any directory it sees fit.
The filesystem service creates a new process file containing the public key and marks it as
pending.
The reply to the runtime contains the path of the file created for it.
The operators of the domain can then use the web GUI or \texttt{btprovision} to view the request
and approve it at their discretion.
Assuming an operator approves the request,
it uses its credentials and the public key in the new process's file to issue a certificate
and then stores it in the file.
Authorization attributes (e.g. UID and GID) are also assigned to the process and written into its
file.
Note that a process's file is normally not writeable by the process itself,
so as to prevent it from setting its own authorization attributes.
Once these data have been written to the process file,
the runtime can read them to retrieve its new certificate chain.
It stores this chain in its credential store for later use.
The runtime can avoid polling its file for changes if it subscribes to write notifications.
The runtime must close the anonymous connections it made
and reconnect using the new certificate chain.
Once new connections are established,
it can read and write files using the authorization attributes specified in its file.
Note that this procedure only works when the runtime is on the same LAN as another runtime.

% The generation of new root credentials and the creation of a new domain.
The procedure for creating a new domain is straight-forward,
and all the steps can be performed using \texttt{btprovision}.
\begin{enumerate}
  \item Generate the root credentials for the new domain.
  \item Generate the credentials for the first runtime.
  \item Create a certificate request using the runtime credentials.
  \item Approve the request using the root credentials.
  \item Import the new certificate into the credential store of the first runtime.
\end{enumerate}
The first runtime is configured to host the sector and filesystem services,
so that subsequent runtimes will have access to the filesystem.
After that, additional runtime on the same LAN can be provisioned using the automatic process.

% Setting up user based access control.
Up till now the focus has been on authentication and authorization of processes,
but it bears discussing how user based access control can be accomplished with Blocktree.
Because credentials are locked to the device on which they're created,
a user will have at least as many principals as they have devices.
But, all of these principals can be configured to have the same authorization attributes (UID, GID),
giving them the same permissions.
It makes sense to keep the files for all of the provisioned runtimes associated with a user in one
place
and the natural place is in the user's home directory.
Although every one of the user's processes needs to be provisioned,
this is not a huge limitation because a single runtime can host many different actors,
implementing many different applications.
Managing the users in a domain is facilitated by putting their home directories in a single user
directory for the domain.
Runtimes hosting the sector service on storage servers could then be provisioned in this directory
to provide the sector and filesystem services for the users' home directories.
It would be at the administrators discretion whether or not to enable client-side encryption.
If they wanted to,
the principal of at least one of a user's runtimes would need to be issued a readcap for the
user's home directory.
This runtime could then directly access the sector service in the domain's user directory.
By moving encryption onto the user's computer,
load can be shed from the storage servers.
Note that this setup does require all of the user's runtimes to be able to communicate with the
runtime whose principal was issued the readcap.

% Example of how these mechanisms allow data to be shared.
To illustrate how these mechanisms can be used to facilitate collaboration between enterprises,
consider a situation where two companies wish to partner to the development of a product.
To facilitate their collaboration,
they wish to have a way to securely exchange data with each other.
One of the companies is selected to host the data
and accepts the cost and responsibility of serving it.
The host company creates a directory which will be used to store all of the data created during
development.
The other company will connect to the filesystem service in the host company's domain to access
data in the shared directory.
Each of the principals in the other company which wish to connect request to be credentialed in the
shared directory.
The hosting company manually reviews these requests and approves them,
assigning each of the principals authorization attributes appropriate for its domain.
This may involve issuing UID and GID values to each of the principals, or perhaps SELinux contexts.
The actually set of attributes supported is determined by the \texttt{Authorization} type used by
by the filesystem service in the host company's domain.
Once the principals have their credentials,
they can dispatch messages to the filesystem service using the shared directory as the scope and
setting the rootward field to true.
This allows actors authenticating with the credentials of these principals to perform all filesystem
operations authorized by the hosting company.
This situation gives the hosting company a lot of control over the data.
If the other company wishes to protect its investment in the R\&D effort,
it should subscribe to write events on the shared directory and the files in it so that it can
copy new sectors out of the host company's domain as soon as they are written.
Note that although it is not possible to directly subscribe to writes on the contents of a
directory, by monitoring a directory for changes,
one can begin monitoring files as soon as they are created.


\section{Examples}
This section contains examples of systems that could be built using Blocktree.
The hope is to illustrate how this platform can be used to implement existing applications more
easily and to make it possible to implement systems which are currently out of reach.

\subsection{A distributed AI execution environment.}
Neural networks are just vector-valued functions with vector inputs,
albeit very complicated ones with potentially billions of parameters.
But, just like any other computation,
these functions can be conceptualized as computational graphs.
Imagine that you have a set of computers equipped AI accelerator hardware
and you have a neural network that is too large to be processed by any one of them.
By partitioning the graph into small enough subgraphs,
we can break the network down into pieces which can be processed by each of the accelerators.
The full network can be stitched together by passing messages between each of these pieces.

Let us consider how this could be accomplished with Blocktree.
We begin by provisioning a runtime on each of the accelerator machines,
each of which will have a new accelerator service registered.
Messages will be sent to the accelerator service describing the computational graph to execute,
as well as the name of the actor to which the output is to be sent.
When such a message is received by an accelerator service provider,
it spawns an actor which compiles its subgraph to a kernel for its accelerator
and remembers the name of the actor to send its output to.
An orchestrator service will be responsible for partitioning the graph and sending these messages.
Ownership of the actors spawned by the accelerator service is given to the orchestrator service,
ensuring that they will all be stopped when the orchestrator returns.
When one of the spawned actors stops,
it unloads the kernel from the accelerator's memory and returns it to its initial state.
Note that the orchestrator actor must have execute permissions on each of the accelerator runtimes
in order to send messages to them.
The orchestrator dispatches messages to the accelerator service in reverse order of the flow of data
in the computational graph,
so that it can tell each service provider where its output should be sent.
The actors responsible for the last layer in the computational graph send their output to the
orchestrator.
To begin the computation,
the actors which are responsible for input are given the filesystem path of the input data.
The orchestrator learns of the completion of the computation once it receives the output from
final layer.
It can then save these results to the file system and return.
Because inference and training can both be modeled by computational graphs,
this same procedure can be used for both.

\subsection{A decentralized social media network.}
One of the original motivations for designing Blocktree was to create a platform for a social
network that puts users in fully in control of their data.
In the opinion of the author,
the only way to actually accomplish this is for users to host the data themselves.
One might think it is possible to use client-side encryption to solve the privacy issue,
but this does not solve the full problem.
While it is true that good client-side encryption will prevent the service provider from reading
the user's data,
the user could still loose everything if the service provider goes out of business or simply
decides to stop offering its service.
Similarly, putting data in a federated system, as has been proposed by the Mastodon developers,
also puts the user at risk of loosing their data if the operator of the server they use decides to
shut it down.
To have real control the user must host the data themselves.
Then they decide how its encrypted, how its served, and to whom.

Let us explore how Blocktree can be used to build a social media platform which provides this
control.
To participate in this network each user will need to setup their own domain by generating new root
credentials
and provisioning at least one runtime to host the social media service.
A technical user could do this on their own hardware by reading the Blocktree documentation,
but a non-technical user might choose to purchase a new router with Blocktree pre-installed.
By connecting this router directly to their WAN,
the user ensures that the services running on it will always have direct internet access.
The user can access the \texttt{btconsole} web GUI via the router's WiFi interface to generate their
root credentials and provision new runtimes on their network.

A basic function of any social network is keeping track of a user's contacts.
This would be handled by maintaining the contacts as files in a well-known directory in the user's
domain.
Each file in the directory would be named using the user defined nickname for the contact
and its contents would include the root principal of the contact as well as any additional user
defined attributes,
such as address or telephone number.
The root principal would be used to discover runtimes controlled by the contact
so that messages can be sent to the social media service running in them.
When a user adds a new contact,
a connection message would be sent to it,
which the contact could choose to accept or reject.
If accepted,
the contact would create an entry in its contacts directory for the user.
The contact's social media service would then accept future direct messages from the user.
When the user sends a direct message to the contact,
its runtime discovers runtimes controlled by the contact and delivers the message.
Once delivered the contact's social media service stores the message in a directory for the user's
correspondence,
sort of like an mbox directory but where messages are sorted into directories based on sender
instead of receiver.

Note that this procedure only works if a contact's root principal can be resolved using the
search domain configured in the user's runtime.
We can ensure this is the case by configuring the runtime to use a search domain that operates
a Dynamic DNS (DDNS) service
and by arranging with this service to create the correct records to resolve the root principal.
The author intends to operate such a service to facilitate the use of Blocktree by home users,
but a more long-term solution is to implement a blockchain for resolving root principals.
Only then would the system be fully decentralized.

Making public posts is accomplished by creating files in a directory with the HTML contents of the
post.
This file, the directory containing it, and all parents of it,
would be configured to allow others to read, and in the case of directories, execute them.
At least one runtime with the filesystem service registered would need to have the execute
permission granted to others to allow anyone to access these files.
When someone wanted to view the posts of another user,
they would use the filesystem service to read these files from the well-known posts directory.

Of course user's would not be using a file manager to interact with this social network,
they would use their browsers as they do now.
This web interface would be served by the social media service in their domain.
A normal user who has a Blocktree enabled router would just type in a special hostname into their
browser to open this interface.
Because the router provides DNS services to their network,
it can generate the appropriate records to ensure this name resolves to the address where the social
media service is listening.
The social media service would be responsible for sending message to other user's domains to
get their posts,
and to read the filesystem to display the user's direct messages.
All this file data would be used to populate the web interface.
It is not hard to see how the same system could be used to serve any type of media: text, images,
video, immersive 3D worlds.
All of these can be stored in files in the filesystem,
and so all of them are accessible to Blocktree actors.

One issue that must be addressed with this design is how it will scale to a large number of users
accessing data at once.
In other words,
what happens if the user goes viral?
Currently, the way to solve this would be to add more computers to the user's network which run
the sector and filesystem services.
This is not ideal as it means the user would need to buy more hardware to serve their dank memes.
A better solution would be implement peer-to-peer distribution of sector data in the filesystem
service.
This would reduce the load on the user's computers and allow their follows to share the posted
data with each other.
This work is planned as a future enhancement.

\subsection{A smart lock.}
The access control language provided by Blocktree's filesystem can be used for more than just
authorizing access to data.
To illustrate this point,
consider a smart lock installed on the front door of a company's office building.
When the company first got the lock they used NFC to configure the lock
and connect it to their WiFi network.
The lock then used link-local runtime discovery to perform automatic provisioning.
An IT administrator accessed \texttt{btconsole} to approve the provisioning request
and position the lock in a specific directory in the company's domain.
Permission to actuate the lock is granted if a principal has execute permission on the lock's file.
To verify the physical presence of an employee,
NFC is used for the authentication handshake.
When an employee presses their NFC device, for instance their phone, to the lock,
it generates a nonce and transmits it to the device.
The device then signs the nonce using the credentials it used during provisioning in the company's
domain.
It transmits this signature to the lock along with the path to the principal's file in the domain.
The lock then reads this file to obtain the principal's authorization attributes and its public key.
It uses the public key to validate the signature presented by the device.
If this is successful,
it then checks the authorization attributes of the principal against the authorization attributes on
its own file.
If execute permissions are granted, the lock actuates, allowing the employee access.
The administrators of the company's domain create a group specifically for controlling physical
access to the building.
All employees with physical access permission are added to this group,
and the group is granted execute permission on the lock,
rather than individual users.

\subsection{A traditional three-tier web application.}
While it is hoped that Blocktree will enable interesting and novel applications,
it can also be used to build the kind of web applications that are common today.
Suppose that we wish to build a three-tier web application.
Let us explore how Blocktree could help.

First, let us consider which database to use.
It would be desirable to use a traditional SQL database,
preferably one which is open source and not owned by a large corporation with dubious motivations.
These constraints lead us to choose Postgres,
but Postgres was not designed to run on Blocktree.
However, Postgres does have a container image available on docker hub,
we can create a service to run this container image in our domain.
But Postgres stores all of its data in the local filesystem of the machine it runs on.
How can we ensure this does not become a single point of failure?
First, we should create a directory in our domain to hold the Postgres cluster directory.
Then we should procure at least three servers for our storage cluster
and provision runtimes hosted on each of them in this directory.
The sector service is registered on each of the runtimes,
so all the data stored in the directory will be replicated on each of the server.
Now, the Postgres service should be register in one and only one of these runtimes,
as Postgres requires exclusive access to its database cluster.
\texttt{btfuse} will be used to mount the Postgres directory to a path in the local filesystem
and the Postgres container will be configured to access it.
We now have to decide how other parts of the system are going to communicate with Postgres.
We could have the Postgres service setup port forwarding for the container,
so that ordinary network connection can be used to talk to it.
But we will have to setup TLS if we want this to be secure.
The alternative is to use Blocktree as a VPN and proxy network communications in messages.
This is accomplished by registering a proxy service in the same runtime as the Postgres service
and configuring it to allow traffic it receives to pass to the Postgres container on TCP port 5432.

In a separate directory,
a collection runtimes are provisioned which will host the webapp service.
This service will use axum to serve the static assets to our site,
including the Wasm modules which make up our frontend,
as well as our site's backend.
In order to do this,
it will need to connect to the Postgres database.
This is accomplished by registering the proxy service in each of the runtimes hosting the
webapp service.
The proxy service is configured to listen on TCP 127.0.0.1:5432 and forwards all traffic
to the proxy service in the Postgres directory.
The webapp can then use the \texttt{tokio-postgres} crate to establish a TCP connection to
127.0.0.1:5432
and it will end up talking to the containerized Postgres instance.

Although the data in our database is  stored redundantly,
we do still have a single point of failure in our system,
namely the Postgres container.
To handle this we can implement a failover service.
It will work by calling the Postgres service with heartbeat messages.
If too many of these timeout,
we assume the service is dead and start a new instance one of the other runtimes in the Postgres
directory.
This new instance will have access to all the same data the old,
including its journal file.
Assuming it can complete any in progress transactions,
the new service will come up after a brief delay
and the system will recover.

\subsection{A realtime geo-spacial environment.}
% Motivation
If we are to believe science fiction,
then the natural evolution of computer interaction is the development
of a persistent virtual world that we use to communicate, conduct business, and
enjoy our leisure.
This kind of system has been a dream for a long time,
but as it has grown closer to becoming a reality,
the popular consciousness has shifted against it.
People are rightly horrified by the idea of giving control over their virtual worlds to the same
social media company which has an established track record for causing societal harm.
But this technology does not need to be dystopian.
If an open system can be built, which actually works,
it can prevent the market from accepting a closed system designed to lock in user attention
and monetize them relentlessly.
This is the future,
it is only a question of who will own it.

% Coordinates
Let us explore how Blocktree could be used to build such a system.
The world we are going to render will be a planet with a roughly spherical surface and a
configurable radius $\rho$.
$\rho$ is a \texttt{u32} value whose units are meters.
We will use latitude ($\phi$) and longitude ($\lambda$) in radians to describe the locations of
points on the surface.
Both $\phi$ and $\lambda$ will take \texttt{f64} values.
The elevation of a point will be given by $h$,
which is the deviation from $\rho$.
$h$ is measured in meters and takes values in \texttt{i32}.
So, the distance from the center of the planet to the point ($\phi$, $\lambda$, $h$) is
$\rho + h$.

% Directory organization. Quadtrees.
The data describing how to render a planet consists of its terrain mesh, terrain textures, and
the objects on its surface.
This could represent a very large amount of data for a planet with realistic terrain populated by
many structures.
To facilitate sharding the information in a planet over many different servers,
the planet is broken into disjoint regions,
each of which is stored in its own directory.
A single top-level directory represents the entire planet,
and contains a manifest describing it.
This manifest specifies the planet's name, its radius, its rotational period,
the size of its regions in MB, as well as any
other global attributes.
This top-level directory also contains the texture for the sky box to render the view of
space from the planet.
In the future it may be interesting to explore the creation of more dynamic environments surrounding
the planet,
but a simple sky box has the advantage of being efficient.
The data in a planet is recursively broken into the fewest number of regions such that the
amount of data in each regions is less than a configured threshold.
When a regions grows too large it is broken into four new regions by cutting it along the
centerline parallel to the $\phi$ axis, and the one parallel to the $\lambda$ axis.
In other words, it is divided in half north to south and east to west.
The four new regions are stored in four subdirectories of the original region's directory
named 0, 1, 2, and 3.
The data in the old region is then moved into the appropriate directory based on its location.
Thus the directory tree of a planet essentially forms a quadtree,
albeit one which is built up progressively.

% Region data files.
In the leaf directories of this tree the actual data for a region are stored in two files,
one which describes the terrain and the other which describes objects.
It is expected that the terrain will rarely be modified,
but that the objects may change regularly.
The terrain file contains the mesh vertices in the region as well as its textures.
It is organized as an R-tree to allow for efficient spacial queries based on player location.
The region's objects file is also organized as an R-tree.
It contains all of the graphical data for the objects to be rendered in the region,
such as meshes, textures, and shaders.

% Plots.
The creation of a shared virtual world must involve players collaboratively building persistent
structures.
This is allowed in a controlled way by defining plot objects.
A plot is like a symbolic link,
it points to a file whose contents contain the data used to render the plot.
This mechanisms allows the owner of the planet to delegate a specific area on the surface
to another player by creating a plot defining that area and pointing it to a file owned by the
player.
The other player can then write meshes, textures, and shaders into this file to describe the
contents of the plot.
If the other player wishes to collaborate with others on the construction,
they can grant write access on the file to a third party.
This is not unlike the ownership of land in the real world.

% LOD files in interior directories.
To facilitate the viewing of the planet from many distances,
each interior node in the planet's directory tree contains a reduced level of detail (LOD) version
of the terrain contained in it.
For example, the top-level directory contains the lowest LOD mesh and textures for the terrain.
This LOD would be suitable for rendering the planet as a globe on a shelf,
or as it would appear from a high orbit.
By traversing the directory tree,
the LOD can be increased as the player travels closer to the surface.
This system assist with rendering an animation where the player appears to approach and land upon
the planet's surface.

% Sharding planet data.
By dividing the planet's data into different leaf directories,
it becomes possible to provision computers running the sector service in each of them.
This divides the storage and bandwidth requirements for serving the planet over this set of
servers.
In addition to serving these data,
another service is needed to keep track of player positions and execute game logic.
Game clients address their messages using the directory of the region their player is located
in, and set \texttt{rootward} to true.
These messages are delivered to the closest game server to the region the player is in,
which may be located in the region's directory or higher up the tree.
When a player transitions from one region to the next,
its game client begins addressing messages using the path of the next region as the scope.


\section{Conclusion}
% Blocktree serves as the basis for building distributed Unix.
There have been many attempts to create a distributed Unix over the years.
Time has shown that this is a very hard problem,
but time has not diminished its importance.
IT systems are more complex now than ever,
with many layers of abstraction which have built up over time.
We have suffered greatly from systems which were never designed to be secure on the hostile internet
that exists today.
Security has been bolted onto these systems (HTTPS, STARTTLS, DNSSEC)
in a backwards compatible way,
which results in weakened protections for these systems.
What's worse,
the entire trust model of the web relies on the ludicrous idea that there is a distinguished group
of certificate authorities who have the power to secure our communications.
We need to take a different approach.
Data should be certified by its path,
it must always be transported between processes in an authenticated manner,
and user code should never have to care how this is accomplished!
Time will tell whether the programming model of Blocktree is comprehensible and useful for
developers,
but the goal is to create the kind of easy to extend computing environment which allowed Unix to
be successful.

% The system enables individuals to self-host the services they rely on.
These days, the typical internet user stores all of their important data in the cloud with
third-party service providers.
They do this because of the convenience of being able to access this information from anywhere,
and because of the perceived safety in having a large internet company look after it for them.
This convenience comes at the price of putting users at the mercy of these companies.
Take email for example,
a service which is universally used for account recovery and password reset.
If a service provided decided to stop providing a user access to their email,
the user would be effectively cut off from any website which sends login verification emails.
This is not a hypothetical situation,
such an incident has occurred (TODO: INSERT CITATION FROM LVL1).
There is no technical reason for things to be this way.
Blocktree allows users to host their own services in their own domain.
If we can make setting up an email or VOIP server as simple as clicking a button in a web GUI,
their will be no convenience advantage to cloud services.
One challenge for self-hosting data is ensuring it is protected from loss when hardware inevitably
fails.
The data redundancy in Blocktree's sector layer ensures that the loss of any one storage
device will not cause data loss.
Streaming replication can also be used to maintain additional redundant copies.
If more users begin hosting their own services,
the internet will become more distributed,
which will make it more resistent to disruption and centralized control.

% Benefits to businesses.
Cloud computing has also driven changes to the way businesses acquire computing resources.
It is common for startups to rent all of their computing resources from one large cloud
provider.
There are compelling economic and technical reasons to do this,
but as a firm grows they often experience growing pains as their cloud bills also grow.
If the firm has not developed their software with a multi-cloud, or hybrid approach in mind,
they may face the prospect of major changes in order to bring their application on-prem or to a
rival cloud.
By developing their application on Blocktree,
businesses have a single platform to target which can run on rented computers in the cloud just as
easily servers in their own data center.
This ensures the choice to rent or buy can be made on a purely economic basis.
Blocktree is not the only system that provides this flexibility.
The portability of containers is one of the reasons they have become so popular.
Containers have their place and will most likely be used for years to come,
but they are a lower level abstraction which requires the developer to the problems that Blocktree
handles.

% Blocktree advances the status quo in secure computing.
Ransomware attacks and data breaches are embarrassingly common these days.
There are many reasons for this,
from the reliance on passwords for authentication, to the complexity of the software supply chain,
but it is clear that as IT professionals we need to do more to keep the systems under our
protection safe.
Blocktree helps to do this by solving many of the difficult problems involved with securing
communication on a hostile network.
It takes a true zero-trust approach,
ensuring that all communications between processes is authenticated using public key cryptography.
Data at rest is also secured with encryption and integrity protection.
No security system can prevent all attacks,
but by putting these mechanisms together in an easy to use platform,
we can advance the status quo of secure computing.

% Composability leads to emergent benefits.
When Unix was first developed in the 1970's, its authors could not have foreseen the applications
that would be enabled by their system.
Although there have been many different kinds of Unices over the years,
the core programming model, built around the filesystem, has remained since the beginning.
It is a testament to the importance of this abstraction that it has persisted for so long.
No designer can foresee all the ways that their abstractions will be used,
but they can try to build them in such a way that as much choice is left to the user as possible.
By making the actor model, and messaging passing, the core of Blocktree,
it is hoped that low overhead communication between distributed components can be achieved.
By using this system to provide a global distributed filesystem,
it is hoped that the interoperable sharing of data can be achieved.
And by using protocol contracts to constrain actor communication,
it is hoped that the structure and safety of type theory can bring order to distributed
computation.
While it is possible to see some of the applications that can be built from these abstractions,
it seems likely that their composability and the creativity of developers will enable systems that
cannot be foreseen.

\end{document}