2 years ago · c29019c935
--- a/doc/BlocktreeCloudPaper/BlocktreeCloudPaper.tex
+++ b/doc/BlocktreeCloudPaper/BlocktreeCloudPaper.tex
@@ -33,7 +33,7 @@ Blocktree is an attempt to extend the Unix philosophy that everything is a file
 
															 to the entire distributed system that comprises modern IT infrastructure.

														
 
															 The system is organized around a global distributed filesystem which defines security

														
 
															 principals, resources, and their authorization attributes.

														
 
															-This filesystem provides a language for access control that can be used to securely grant principals

														
 
															+This filesystem provides a language for access control that can be used to securely grant

														
 
															 access to resources from different organizations, without the need to setup federation.

														
 
															 The system provides an actor runtime for orchestrating services.

														
 
															 Resources are represented by actors, and actors are grouped into operating system processes.

														
@@ -53,25 +53,25 @@ As is the case for all principals,
 
															 a root principal is authenticated by a public-private key pair,

														
 
															 and is identified by a hash of its public key.

														
 
															 The domain of authority for a given absolute path is determined by its first component,

														
 
															-which is the identifier of the root principal who controls the domain.

														
 
															+which is the identifier of the root principal that controls the domain.

														
 
															 Because there is no meaning to the directory "/",

														
 
															 a directory consisting of only a single component equal to a root principal's identifier is

														
 
															-referred to as the root directory of that root principal.

														
 
															+referred to as the root principal's root directory.

														
 
															 The root principal delegates its authority to write files to subordinate principals by issuing

														
 
															 them certificates which specify the path that the authority of the subordinate is limited to.

														
 
															 File data is signed for authenticity and a certificate chain is contained in its metadata.

														
 
															 This certificate chain must lead back to the root principal

														
 
															-and consist of certificates with correctly scoped authority in order for the file to be authentic.

														
 
															+and consist of certificates with correctly scoped authority in order for the file to be validated.

														
 
															 Given the path of a file and the file's contents,

														
 
															-this system allows the file to be validated by anyone without the need to trust a third-party.

														
 
															-Blocktree paths are referred to as self-certifying for this reason.

														
 
															+this allows the file to be validated by anyone without the need to trust a third-party.

														
 
															+Blocktree paths are called self-certifying for this reason.

														
 
															 % Persistent state provided by the filesystem.

														
 
															 One of the major challenges in distributed systems is managing persistent state.

														
 
															-Blocktree solves this issue using its distributed filesystem.

														
 
															+Blocktree solves this issue with its distributed filesystem.

														
 
															 Files are broken into segments called sectors.

														
 
															 The sector size of a file can be configured when it is created,

														
 
															-but cannot be changed after the fact.

														
 
															+but cannot be changed later.

														
 
															 Reads and writes of individual sectors are guaranteed to be atomic.

														
 
															 The sectors which comprise a file and its metadata are replicated by a set of processes running

														
 
															 the sector service.

														
@@ -79,12 +79,12 @@ This service is responsible for storing the sectors of files which are contained
 
															 containing the process in which it is running.

														
 
															 The actors providing the sector service in a given directory coordinate with one another using

														
 
															 the Raft protocol to synchronize the state of the sectors they store.

														
 
															-This method of partitioning the data in the filesystem based on directory

														
 
															-allows the system to scale beyond the capabilities of a single consensus cluster.

														
 
															-Sectors are secured with strong integrity protection,

														
 
															-which allows anyone to verify that their contents were written by an authorized principal.

														
 
															+By partitioning the data in the filesystem based on directory,

														
 
															+the system can scale beyond the capabilities of a single consensus cluster.

														
 
															+Sectors can be integrity protected and verified without reading the entire file,

														
 
															+because each file has a Merkle tree of sector hashes associated with it.

														
 
															 Encryption can be optionally applied to sectors,

														
 
															-with the system handling key management.

														
 
															+and when it is key is managed by the system.

														
 
															 The cryptographic mechanisms used to implement these protections are described in section 3.

														
 
															 % Protocol contracts.

														
@@ -106,8 +106,9 @@ communication protocol.
 
															 % Implementation language and project links.

														
 
															 Blocktree is implemented in the Rust programming language.

														
 
															-It currently only supports running on Linux,

														
 
															-though porting it to other Unix-like operating systems should be straight-forward.

														
 
															+It is currently only tested on Linux.

														
 
															+Running it on other Unix-like operating systems should be straight-forward,

														
 
															+though FUSE support is required to mount the filesystem.

														
 
															 Its source code is licensed under the Affero GNU Public License Version 3.

														
 
															 It can be downloaded at the project homepage at \url{https://blocktree.systems}.

														
 
															 Anyone interested in contributing to development is welcome to submit a pull request

														
@@ -146,7 +147,8 @@ as this will ensure low impedance with the underlying networking technology.
 
															 % Overview of message passing interface.

														
 
															 That is why Blocktree is built on the actor model

														
 
															 and why its actor runtime is at the core of its architecture.

														
 
															-The runtime can be used to spawn actors, register services, and dispatch messages.

														
 
															+The runtime can be used to spawn actors, register services, dispatch messages immediately,

														
 
															+and schedule messages to be delivered in the future.

														
 
															 Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.

														
 
															 A message is dispatched with the \texttt{send} method when no reply is required,

														
 
															 and with \texttt{call} when exactly one is.

														
@@ -157,6 +159,30 @@ The name \texttt{call} was chosen to bring to mind a remote procedure call,
 
															 which is the primary use case this method was intended for.

														
 
															 Awaiting replies to messages serves as a simple way to synchronize a distributed computation.

														
 
															+% Scheduling messages for future delivery.

														
 
															+Executing actions at some point in the future or at regular intervals are common tasks in computer

														
 
															+systems.

														
 
															+Blocktree facilitates this by allows messages to be scheduled for future delivery.

														
 
															+The schedule may specify a one time delivery at a specific instant in time,

														
 
															+or a repeating delivery with a given period.

														
 
															+These scheduling modes can be combined so that you can specify an anchoring instant

														
 
															+and a period whose multiples will be added to this instant to calculate each delivery time.

														
 
															+For example, a message could be scheduled for delivery every morning at 3 AM.

														
 
															+Messages scheduled in a runtime are persisted in the runtime's file.

														
 
															+This ensures scheduled messages will be delivered even if the runtime is restarted.

														
 
															+If a message has been delivered

														
 
															+and the schedule is such that it will never be delivered again,

														
 
															+it is removed from the runtime's file.

														
 
															+If a message is scheduled for delivery at a single instant in time,

														
 
															+and that delivery is missed,

														
 
															+the message will be delivered as soon as possible.

														
 
															+But, if a message is periodic,

														
 
															+any messages which were missed due to a runtime not being active will never be sent.

														
 
															+This is because the runtime only persists the message's schedule,

														
 
															+not every delivery.

														
 
															+This mechanism is intended for periodic tasks or delaying work to a later time.

														
 
															+It is not for building hard realtime systems.

														
 
															+

														
 
															 % Description of virtual actor system.

														
 
															 One of the challenges when building actor systems is supervising and managing actors' lifecycles.

														
 
															 This is handled in Erlang through the use of supervision trees,

														
@@ -369,7 +395,7 @@ and thus maintaining the persistent state of the system.
 
															 It stores sector data in the local filesystem of each computer on which it is registered.

														
 
															 The details of how this is accomplished are deferred to the next section.

														
 
															-% Runtime network discovery.

														
 
															+% Runtime queries.

														
 
															 While it is possible to resolve runtime paths to IP addresses when the filesystem is available,

														
 
															 a different mechanism is needed to allow the filesystem and sector services to discover service

														
 
															 providers.

														
@@ -391,6 +417,8 @@ a query can be issued to learn of more runtimes.
 
															 A runtime which receives a query may not be able to answer it directly.

														
 
															 If it cannot,

														
 
															 it returns the IP address of the next runtime to which the query should be sent.

														
 
															+

														
 
															+% Bootstrap discovery methods.

														
 
															 In order to bootstrap the discovery processes,

														
 
															 another mechanism is needed to find the first peer to query.

														
 
															 There were several possibilities explored for doing this.

														
@@ -400,15 +428,15 @@ As long as these runtimes could be located,
 
															 then all others could be found using the filesystem.

														
 
															 This idea may be worth revisiting in the future,

														
 
															 but the author wanted to avoid the complexity of implementing a new proof of work blockchain.

														
 
															-Another idea was to use multicast link-local addressing to discover other runtimes,

														
 
															-similar to how mDNS operates.

														
 
															-This approach has several advantages.

														
 
															-It avoids any dependency on centralized internet infrastructure

														
 
															-and keeps network load local to the segment on which the runtimes are connected.

														
 
															-But, it will not work over a wide area network,

														
 
															-making it unsuitable for the general case.

														
 
															-Instead, the design which was decided on was to use DNS to resolve a fully qualified domain name

														
 
															-(FQDN) derived from the root principal's identifier.

														
 
															+Instead, two independent mechanisms are used,

														
 
															+one that can discover runtimes over the internet as long as their path is known,

														
 
															+and another that can discover runtimes on the local network even when the discoverer does not know

														
 
															+their paths.

														
 
															+

														
 
															+% Searching DNS for root principals.

														
 
															+When the path to a runtime is known,

														
 
															+DNS is used to resolve a fully qualified domain name

														
 
															+(FQDN) derived from the root principal's identifier in this path.

														
 
															 This FQDN is expected to resolve to the public IP addresses of the runtimes hosting the

														
 
															 sector service in the root directory of the root principal.

														
 
															 Each process is configured with a search domain which is used as a suffix of the FQDN.

														
@@ -429,6 +457,36 @@ Of course it is also possible to build such a system without hosting DNS inside
 
															 The downside of using DNS is that it couples Blocktree with a centralized,

														
 
															 albeit distributed, system.

														
 
															+% Using link-local multicast datagrams to find runtimes.

														
 
															+Because this mechanism requires knowledge of the root principal of a domain to perform discovery,

														
 
															+it will not work if a runtime is first starting up with no credentials and so does not know its

														
 
															+own root principal.

														
 
															+This runtime needs a way to discover other runtimes so it can connect to the filesystem and sector

														
 
															+services.

														
 
															+This issue is solved by using link-local multicast addressing to discover the runtimes on the same

														
 
															+network as the discoverer.

														
 
															+When a \texttt{bttp} server starts listening for unicast traffic,

														
 
															+it also listens for UDP datagrams on port 50142 at addresses 224.0.0.142 and FE02::142,

														
 
															+if the IPv4 or IPv6 networking stack is available, respectively.

														
 
															+If the host is attached to a dual-stack network,

														
 
															+the server listens on both addresses.

														
 
															+When a runtime is attempting to discover other runtimes,

														
 
															+it sends out datagrams to these IP addresses.

														
 
															+Each \texttt{bttp} server replies with its unicast address and filesystem path

														
 
															+(as specified in its credentials).

														
 
															+If the server is available at both IPv4 and IPv6 unicast addresses,

														
 
															+it is at the server's discretion which address to respond with,

														
 
															+it may even respond with an IPv4 to an IPv4 datagram,

														
 
															+and IPv6 address to an IPv6 datagram.

														
 
															+Once a client has discovered the \texttt{bttp} servers on its network,

														
 
															+it can route messages to them,

														
 
															+such as the provisioning requests which are used to obtain new credentials.

														
 
															+Provisioning is described in the Cryptography section.

														
 
															+Note that port 50142 is in the dynamic range, as specified by RFC 6335,

														
 
															+so it does not need to registered with the Internet Assigned Names and Numbers (IANA) corporation.

														
 
															+Both addresses 224.0.0.142 and FE02::142 are currently unassigned.

														
 
															+but they will need to be registered with IANA if Blocktree is widely adopted.

														
 
															+

														
 
															 % Security model for queries.

														
 
															 To allow runtimes which are not permitted to execute the root directory to query for other runtimes,

														
 
															 authorization logic which is specific to queries is needed.

														
@@ -734,6 +792,18 @@ When a file actor receives notification that its owner returned,
 
															 it flushes any buffered data in its cache and returns,

														
 
															 ensuring that a resource leak does not occur.

														
 
															+% Encrypted metadata. Extended attributes in metadata. Cache control.

														
 
															+Some of the information stored in metadata needs to be kept in plaintext to allow the sector

														
 
															+service to verify and decrypt the file

														
 
															+but most of it is encrypted using the same key as the file's contents.

														
 
															+The file's authorization attributes, its size, and its access times are all encrypted.

														
 
															+The table storing the file's extended attributes (EAs) is also encrypted.

														
 
															+Cache control information is included in this area as well.

														
 
															+It specifies the number of seconds, as a u32, that a file may be cached.

														
 
															+The filesystem service uses this information to evict sectors from its cache when they have been

														
 
															+cached for longer than this threshold,

														
 
															+causing them to be reloaded from the sector service.

														
 
															+

														
 
															 % Authorization logic of the filesystem service.

														
 
															 The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.

														
 
															 It passes this type the authorization attributes of the principal accessing the file, the

														
@@ -1004,8 +1074,35 @@ The first runtime is configured to host the sector and filesystem services,
 
															 so that subsequent runtimes will have access to the filesystem.

														
 
															 After that, additional runtime on the same LAN can be provisioned using the automatic process.

														
 
															+% Setting up user based access control.

														
 
															+Up till now the focus has been on authentication and authorization of processes,

														
 
															+but it bears discussing how user based access control can be accomplished with Blocktree.

														
 
															+Because credentials are locked to the device on which they're created,

														
 
															+a user will have at least as many principals as they have devices.

														
 
															+But, all of these principals can be configured to have the same authorization attributes (UID, GID),

														
 
															+giving them the same permissions.

														
 
															+It makes sense to keep the files for all of the provisioned runtimes associated with a user in one

														
 
															+place

														
 
															+and the natural place is in the user's home directory.

														
 
															+Although every one of the user's processes needs to be provisioned,

														
 
															+this is not a huge limitation because a single runtime can host many different actors,

														
 
															+implementing many different applications.

														
 
															+Managing the users in a domain is facilitated by putting their home directories in a single user

														
 
															+directory for the domain.

														
 
															+Runtimes hosting the sector service on storage servers could then be provisioned in this directory

														
 
															+to provide the sector and filesystem services for the users' home directories.

														
 
															+It would be at the administrators discretion whether or not to enable client-side encryption.

														
 
															+If they wanted to,

														
 
															+the principal of at least one of a user's runtimes would need to be issued a readcap for the

														
 
															+user's home directory.

														
 
															+This runtime could then directly access the sector service in the domain's user directory.

														
 
															+By moving encryption onto the user's computer,

														
 
															+load can be shed from the storage servers.

														
 
															+Note that this setup does require all of the user's runtimes to be able to communicate with the

														
 
															+runtime whose principal was issued the readcap.

														
 
															+

														
 
															 % Example of how these mechanisms allow data to be shared.

														
 
															-To illustrate how these mechanisms can be used,

														
 
															+To illustrate how these mechanisms can be used to facilitate collaboration between enterprises,

														
 
															 consider a situation where two companies wish to partner to the development of a product.

														
 
															 To facilitate their collaboration,

														
 
															 they wish to have a way to securely exchange data with each other.

														
@@ -1302,7 +1399,7 @@ $h$ is measured in meters and takes values in \texttt{i32}.
 
															 So, the distance from the center of the planet to the point ($\phi$, $\lambda$, $h$) is

														
 
															 $\rho + h$.

														
 
															-% Directory organization. Quad-trees.

														
 
															+% Directory organization. Quadtrees.

														
 
															 The data describing how to render a planet consists of its terrain mesh, terrain textures, and

														
 
															 the objects on its surface.

														
 
															 This could represent a very large amount of data for a planet with realistic terrain populated by

														
@@ -1328,7 +1425,7 @@ In other words, it is divided in half north to south and east to west.
 
															 The four new regions are stored in four subdirectories of the original region's directory

														
 
															 named 0, 1, 2, and 3.

														
 
															 The data in the old region is then moved into the appropriate directory based on its location.

														
 
															-Thus the directory tree of a planet essentially forms a quad-tree,

														
 
															+Thus the directory tree of a planet essentially forms a quadtree,

														
 
															 albeit one which is built up progressively.

														
 
															 % Region data files.