|
@@ -33,7 +33,7 @@ Blocktree is an attempt to extend the Unix philosophy that everything is a file
|
|
to the entire distributed system that comprises modern IT infrastructure.
|
|
to the entire distributed system that comprises modern IT infrastructure.
|
|
The system is organized around a global distributed filesystem which defines security
|
|
The system is organized around a global distributed filesystem which defines security
|
|
principals, resources, and their authorization attributes.
|
|
principals, resources, and their authorization attributes.
|
|
-This filesystem provides a language for access control that can be used to securely grant principals
|
|
|
|
|
|
+This filesystem provides a language for access control that can be used to securely grant
|
|
access to resources from different organizations, without the need to setup federation.
|
|
access to resources from different organizations, without the need to setup federation.
|
|
The system provides an actor runtime for orchestrating services.
|
|
The system provides an actor runtime for orchestrating services.
|
|
Resources are represented by actors, and actors are grouped into operating system processes.
|
|
Resources are represented by actors, and actors are grouped into operating system processes.
|
|
@@ -53,25 +53,25 @@ As is the case for all principals,
|
|
a root principal is authenticated by a public-private key pair,
|
|
a root principal is authenticated by a public-private key pair,
|
|
and is identified by a hash of its public key.
|
|
and is identified by a hash of its public key.
|
|
The domain of authority for a given absolute path is determined by its first component,
|
|
The domain of authority for a given absolute path is determined by its first component,
|
|
-which is the identifier of the root principal who controls the domain.
|
|
|
|
|
|
+which is the identifier of the root principal that controls the domain.
|
|
Because there is no meaning to the directory "/",
|
|
Because there is no meaning to the directory "/",
|
|
a directory consisting of only a single component equal to a root principal's identifier is
|
|
a directory consisting of only a single component equal to a root principal's identifier is
|
|
-referred to as the root directory of that root principal.
|
|
|
|
|
|
+referred to as the root principal's root directory.
|
|
The root principal delegates its authority to write files to subordinate principals by issuing
|
|
The root principal delegates its authority to write files to subordinate principals by issuing
|
|
them certificates which specify the path that the authority of the subordinate is limited to.
|
|
them certificates which specify the path that the authority of the subordinate is limited to.
|
|
File data is signed for authenticity and a certificate chain is contained in its metadata.
|
|
File data is signed for authenticity and a certificate chain is contained in its metadata.
|
|
This certificate chain must lead back to the root principal
|
|
This certificate chain must lead back to the root principal
|
|
-and consist of certificates with correctly scoped authority in order for the file to be authentic.
|
|
|
|
|
|
+and consist of certificates with correctly scoped authority in order for the file to be validated.
|
|
Given the path of a file and the file's contents,
|
|
Given the path of a file and the file's contents,
|
|
-this system allows the file to be validated by anyone without the need to trust a third-party.
|
|
|
|
-Blocktree paths are referred to as self-certifying for this reason.
|
|
|
|
|
|
+this allows the file to be validated by anyone without the need to trust a third-party.
|
|
|
|
+Blocktree paths are called self-certifying for this reason.
|
|
|
|
|
|
% Persistent state provided by the filesystem.
|
|
% Persistent state provided by the filesystem.
|
|
One of the major challenges in distributed systems is managing persistent state.
|
|
One of the major challenges in distributed systems is managing persistent state.
|
|
-Blocktree solves this issue using its distributed filesystem.
|
|
|
|
|
|
+Blocktree solves this issue with its distributed filesystem.
|
|
Files are broken into segments called sectors.
|
|
Files are broken into segments called sectors.
|
|
The sector size of a file can be configured when it is created,
|
|
The sector size of a file can be configured when it is created,
|
|
-but cannot be changed after the fact.
|
|
|
|
|
|
+but cannot be changed later.
|
|
Reads and writes of individual sectors are guaranteed to be atomic.
|
|
Reads and writes of individual sectors are guaranteed to be atomic.
|
|
The sectors which comprise a file and its metadata are replicated by a set of processes running
|
|
The sectors which comprise a file and its metadata are replicated by a set of processes running
|
|
the sector service.
|
|
the sector service.
|
|
@@ -79,12 +79,12 @@ This service is responsible for storing the sectors of files which are contained
|
|
containing the process in which it is running.
|
|
containing the process in which it is running.
|
|
The actors providing the sector service in a given directory coordinate with one another using
|
|
The actors providing the sector service in a given directory coordinate with one another using
|
|
the Raft protocol to synchronize the state of the sectors they store.
|
|
the Raft protocol to synchronize the state of the sectors they store.
|
|
-This method of partitioning the data in the filesystem based on directory
|
|
|
|
-allows the system to scale beyond the capabilities of a single consensus cluster.
|
|
|
|
-Sectors are secured with strong integrity protection,
|
|
|
|
-which allows anyone to verify that their contents were written by an authorized principal.
|
|
|
|
|
|
+By partitioning the data in the filesystem based on directory,
|
|
|
|
+the system can scale beyond the capabilities of a single consensus cluster.
|
|
|
|
+Sectors can be integrity protected and verified without reading the entire file,
|
|
|
|
+because each file has a Merkle tree of sector hashes associated with it.
|
|
Encryption can be optionally applied to sectors,
|
|
Encryption can be optionally applied to sectors,
|
|
-with the system handling key management.
|
|
|
|
|
|
+and when it is key is managed by the system.
|
|
The cryptographic mechanisms used to implement these protections are described in section 3.
|
|
The cryptographic mechanisms used to implement these protections are described in section 3.
|
|
|
|
|
|
% Protocol contracts.
|
|
% Protocol contracts.
|
|
@@ -106,8 +106,9 @@ communication protocol.
|
|
|
|
|
|
% Implementation language and project links.
|
|
% Implementation language and project links.
|
|
Blocktree is implemented in the Rust programming language.
|
|
Blocktree is implemented in the Rust programming language.
|
|
-It currently only supports running on Linux,
|
|
|
|
-though porting it to other Unix-like operating systems should be straight-forward.
|
|
|
|
|
|
+It is currently only tested on Linux.
|
|
|
|
+Running it on other Unix-like operating systems should be straight-forward,
|
|
|
|
+though FUSE support is required to mount the filesystem.
|
|
Its source code is licensed under the Affero GNU Public License Version 3.
|
|
Its source code is licensed under the Affero GNU Public License Version 3.
|
|
It can be downloaded at the project homepage at \url{https://blocktree.systems}.
|
|
It can be downloaded at the project homepage at \url{https://blocktree.systems}.
|
|
Anyone interested in contributing to development is welcome to submit a pull request
|
|
Anyone interested in contributing to development is welcome to submit a pull request
|
|
@@ -146,7 +147,8 @@ as this will ensure low impedance with the underlying networking technology.
|
|
% Overview of message passing interface.
|
|
% Overview of message passing interface.
|
|
That is why Blocktree is built on the actor model
|
|
That is why Blocktree is built on the actor model
|
|
and why its actor runtime is at the core of its architecture.
|
|
and why its actor runtime is at the core of its architecture.
|
|
-The runtime can be used to spawn actors, register services, and dispatch messages.
|
|
|
|
|
|
+The runtime can be used to spawn actors, register services, dispatch messages immediately,
|
|
|
|
+and schedule messages to be delivered in the future.
|
|
Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
|
|
Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
|
|
A message is dispatched with the \texttt{send} method when no reply is required,
|
|
A message is dispatched with the \texttt{send} method when no reply is required,
|
|
and with \texttt{call} when exactly one is.
|
|
and with \texttt{call} when exactly one is.
|
|
@@ -157,6 +159,30 @@ The name \texttt{call} was chosen to bring to mind a remote procedure call,
|
|
which is the primary use case this method was intended for.
|
|
which is the primary use case this method was intended for.
|
|
Awaiting replies to messages serves as a simple way to synchronize a distributed computation.
|
|
Awaiting replies to messages serves as a simple way to synchronize a distributed computation.
|
|
|
|
|
|
|
|
+% Scheduling messages for future delivery.
|
|
|
|
+Executing actions at some point in the future or at regular intervals are common tasks in computer
|
|
|
|
+systems.
|
|
|
|
+Blocktree facilitates this by allows messages to be scheduled for future delivery.
|
|
|
|
+The schedule may specify a one time delivery at a specific instant in time,
|
|
|
|
+or a repeating delivery with a given period.
|
|
|
|
+These scheduling modes can be combined so that you can specify an anchoring instant
|
|
|
|
+and a period whose multiples will be added to this instant to calculate each delivery time.
|
|
|
|
+For example, a message could be scheduled for delivery every morning at 3 AM.
|
|
|
|
+Messages scheduled in a runtime are persisted in the runtime's file.
|
|
|
|
+This ensures scheduled messages will be delivered even if the runtime is restarted.
|
|
|
|
+If a message has been delivered
|
|
|
|
+and the schedule is such that it will never be delivered again,
|
|
|
|
+it is removed from the runtime's file.
|
|
|
|
+If a message is scheduled for delivery at a single instant in time,
|
|
|
|
+and that delivery is missed,
|
|
|
|
+the message will be delivered as soon as possible.
|
|
|
|
+But, if a message is periodic,
|
|
|
|
+any messages which were missed due to a runtime not being active will never be sent.
|
|
|
|
+This is because the runtime only persists the message's schedule,
|
|
|
|
+not every delivery.
|
|
|
|
+This mechanism is intended for periodic tasks or delaying work to a later time.
|
|
|
|
+It is not for building hard realtime systems.
|
|
|
|
+
|
|
% Description of virtual actor system.
|
|
% Description of virtual actor system.
|
|
One of the challenges when building actor systems is supervising and managing actors' lifecycles.
|
|
One of the challenges when building actor systems is supervising and managing actors' lifecycles.
|
|
This is handled in Erlang through the use of supervision trees,
|
|
This is handled in Erlang through the use of supervision trees,
|
|
@@ -369,7 +395,7 @@ and thus maintaining the persistent state of the system.
|
|
It stores sector data in the local filesystem of each computer on which it is registered.
|
|
It stores sector data in the local filesystem of each computer on which it is registered.
|
|
The details of how this is accomplished are deferred to the next section.
|
|
The details of how this is accomplished are deferred to the next section.
|
|
|
|
|
|
-% Runtime network discovery.
|
|
|
|
|
|
+% Runtime queries.
|
|
While it is possible to resolve runtime paths to IP addresses when the filesystem is available,
|
|
While it is possible to resolve runtime paths to IP addresses when the filesystem is available,
|
|
a different mechanism is needed to allow the filesystem and sector services to discover service
|
|
a different mechanism is needed to allow the filesystem and sector services to discover service
|
|
providers.
|
|
providers.
|
|
@@ -391,6 +417,8 @@ a query can be issued to learn of more runtimes.
|
|
A runtime which receives a query may not be able to answer it directly.
|
|
A runtime which receives a query may not be able to answer it directly.
|
|
If it cannot,
|
|
If it cannot,
|
|
it returns the IP address of the next runtime to which the query should be sent.
|
|
it returns the IP address of the next runtime to which the query should be sent.
|
|
|
|
+
|
|
|
|
+% Bootstrap discovery methods.
|
|
In order to bootstrap the discovery processes,
|
|
In order to bootstrap the discovery processes,
|
|
another mechanism is needed to find the first peer to query.
|
|
another mechanism is needed to find the first peer to query.
|
|
There were several possibilities explored for doing this.
|
|
There were several possibilities explored for doing this.
|
|
@@ -400,15 +428,15 @@ As long as these runtimes could be located,
|
|
then all others could be found using the filesystem.
|
|
then all others could be found using the filesystem.
|
|
This idea may be worth revisiting in the future,
|
|
This idea may be worth revisiting in the future,
|
|
but the author wanted to avoid the complexity of implementing a new proof of work blockchain.
|
|
but the author wanted to avoid the complexity of implementing a new proof of work blockchain.
|
|
-Another idea was to use multicast link-local addressing to discover other runtimes,
|
|
|
|
-similar to how mDNS operates.
|
|
|
|
-This approach has several advantages.
|
|
|
|
-It avoids any dependency on centralized internet infrastructure
|
|
|
|
-and keeps network load local to the segment on which the runtimes are connected.
|
|
|
|
-But, it will not work over a wide area network,
|
|
|
|
-making it unsuitable for the general case.
|
|
|
|
-Instead, the design which was decided on was to use DNS to resolve a fully qualified domain name
|
|
|
|
-(FQDN) derived from the root principal's identifier.
|
|
|
|
|
|
+Instead, two independent mechanisms are used,
|
|
|
|
+one that can discover runtimes over the internet as long as their path is known,
|
|
|
|
+and another that can discover runtimes on the local network even when the discoverer does not know
|
|
|
|
+their paths.
|
|
|
|
+
|
|
|
|
+% Searching DNS for root principals.
|
|
|
|
+When the path to a runtime is known,
|
|
|
|
+DNS is used to resolve a fully qualified domain name
|
|
|
|
+(FQDN) derived from the root principal's identifier in this path.
|
|
This FQDN is expected to resolve to the public IP addresses of the runtimes hosting the
|
|
This FQDN is expected to resolve to the public IP addresses of the runtimes hosting the
|
|
sector service in the root directory of the root principal.
|
|
sector service in the root directory of the root principal.
|
|
Each process is configured with a search domain which is used as a suffix of the FQDN.
|
|
Each process is configured with a search domain which is used as a suffix of the FQDN.
|
|
@@ -429,6 +457,36 @@ Of course it is also possible to build such a system without hosting DNS inside
|
|
The downside of using DNS is that it couples Blocktree with a centralized,
|
|
The downside of using DNS is that it couples Blocktree with a centralized,
|
|
albeit distributed, system.
|
|
albeit distributed, system.
|
|
|
|
|
|
|
|
+% Using link-local multicast datagrams to find runtimes.
|
|
|
|
+Because this mechanism requires knowledge of the root principal of a domain to perform discovery,
|
|
|
|
+it will not work if a runtime is first starting up with no credentials and so does not know its
|
|
|
|
+own root principal.
|
|
|
|
+This runtime needs a way to discover other runtimes so it can connect to the filesystem and sector
|
|
|
|
+services.
|
|
|
|
+This issue is solved by using link-local multicast addressing to discover the runtimes on the same
|
|
|
|
+network as the discoverer.
|
|
|
|
+When a \texttt{bttp} server starts listening for unicast traffic,
|
|
|
|
+it also listens for UDP datagrams on port 50142 at addresses 224.0.0.142 and FE02::142,
|
|
|
|
+if the IPv4 or IPv6 networking stack is available, respectively.
|
|
|
|
+If the host is attached to a dual-stack network,
|
|
|
|
+the server listens on both addresses.
|
|
|
|
+When a runtime is attempting to discover other runtimes,
|
|
|
|
+it sends out datagrams to these IP addresses.
|
|
|
|
+Each \texttt{bttp} server replies with its unicast address and filesystem path
|
|
|
|
+(as specified in its credentials).
|
|
|
|
+If the server is available at both IPv4 and IPv6 unicast addresses,
|
|
|
|
+it is at the server's discretion which address to respond with,
|
|
|
|
+it may even respond with an IPv4 to an IPv4 datagram,
|
|
|
|
+and IPv6 address to an IPv6 datagram.
|
|
|
|
+Once a client has discovered the \texttt{bttp} servers on its network,
|
|
|
|
+it can route messages to them,
|
|
|
|
+such as the provisioning requests which are used to obtain new credentials.
|
|
|
|
+Provisioning is described in the Cryptography section.
|
|
|
|
+Note that port 50142 is in the dynamic range, as specified by RFC 6335,
|
|
|
|
+so it does not need to registered with the Internet Assigned Names and Numbers (IANA) corporation.
|
|
|
|
+Both addresses 224.0.0.142 and FE02::142 are currently unassigned.
|
|
|
|
+but they will need to be registered with IANA if Blocktree is widely adopted.
|
|
|
|
+
|
|
% Security model for queries.
|
|
% Security model for queries.
|
|
To allow runtimes which are not permitted to execute the root directory to query for other runtimes,
|
|
To allow runtimes which are not permitted to execute the root directory to query for other runtimes,
|
|
authorization logic which is specific to queries is needed.
|
|
authorization logic which is specific to queries is needed.
|
|
@@ -734,6 +792,18 @@ When a file actor receives notification that its owner returned,
|
|
it flushes any buffered data in its cache and returns,
|
|
it flushes any buffered data in its cache and returns,
|
|
ensuring that a resource leak does not occur.
|
|
ensuring that a resource leak does not occur.
|
|
|
|
|
|
|
|
+% Encrypted metadata. Extended attributes in metadata. Cache control.
|
|
|
|
+Some of the information stored in metadata needs to be kept in plaintext to allow the sector
|
|
|
|
+service to verify and decrypt the file
|
|
|
|
+but most of it is encrypted using the same key as the file's contents.
|
|
|
|
+The file's authorization attributes, its size, and its access times are all encrypted.
|
|
|
|
+The table storing the file's extended attributes (EAs) is also encrypted.
|
|
|
|
+Cache control information is included in this area as well.
|
|
|
|
+It specifies the number of seconds, as a u32, that a file may be cached.
|
|
|
|
+The filesystem service uses this information to evict sectors from its cache when they have been
|
|
|
|
+cached for longer than this threshold,
|
|
|
|
+causing them to be reloaded from the sector service.
|
|
|
|
+
|
|
% Authorization logic of the filesystem service.
|
|
% Authorization logic of the filesystem service.
|
|
The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.
|
|
The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.
|
|
It passes this type the authorization attributes of the principal accessing the file, the
|
|
It passes this type the authorization attributes of the principal accessing the file, the
|
|
@@ -1004,8 +1074,35 @@ The first runtime is configured to host the sector and filesystem services,
|
|
so that subsequent runtimes will have access to the filesystem.
|
|
so that subsequent runtimes will have access to the filesystem.
|
|
After that, additional runtime on the same LAN can be provisioned using the automatic process.
|
|
After that, additional runtime on the same LAN can be provisioned using the automatic process.
|
|
|
|
|
|
|
|
+% Setting up user based access control.
|
|
|
|
+Up till now the focus has been on authentication and authorization of processes,
|
|
|
|
+but it bears discussing how user based access control can be accomplished with Blocktree.
|
|
|
|
+Because credentials are locked to the device on which they're created,
|
|
|
|
+a user will have at least as many principals as they have devices.
|
|
|
|
+But, all of these principals can be configured to have the same authorization attributes (UID, GID),
|
|
|
|
+giving them the same permissions.
|
|
|
|
+It makes sense to keep the files for all of the provisioned runtimes associated with a user in one
|
|
|
|
+place
|
|
|
|
+and the natural place is in the user's home directory.
|
|
|
|
+Although every one of the user's processes needs to be provisioned,
|
|
|
|
+this is not a huge limitation because a single runtime can host many different actors,
|
|
|
|
+implementing many different applications.
|
|
|
|
+Managing the users in a domain is facilitated by putting their home directories in a single user
|
|
|
|
+directory for the domain.
|
|
|
|
+Runtimes hosting the sector service on storage servers could then be provisioned in this directory
|
|
|
|
+to provide the sector and filesystem services for the users' home directories.
|
|
|
|
+It would be at the administrators discretion whether or not to enable client-side encryption.
|
|
|
|
+If they wanted to,
|
|
|
|
+the principal of at least one of a user's runtimes would need to be issued a readcap for the
|
|
|
|
+user's home directory.
|
|
|
|
+This runtime could then directly access the sector service in the domain's user directory.
|
|
|
|
+By moving encryption onto the user's computer,
|
|
|
|
+load can be shed from the storage servers.
|
|
|
|
+Note that this setup does require all of the user's runtimes to be able to communicate with the
|
|
|
|
+runtime whose principal was issued the readcap.
|
|
|
|
+
|
|
% Example of how these mechanisms allow data to be shared.
|
|
% Example of how these mechanisms allow data to be shared.
|
|
-To illustrate how these mechanisms can be used,
|
|
|
|
|
|
+To illustrate how these mechanisms can be used to facilitate collaboration between enterprises,
|
|
consider a situation where two companies wish to partner to the development of a product.
|
|
consider a situation where two companies wish to partner to the development of a product.
|
|
To facilitate their collaboration,
|
|
To facilitate their collaboration,
|
|
they wish to have a way to securely exchange data with each other.
|
|
they wish to have a way to securely exchange data with each other.
|
|
@@ -1302,7 +1399,7 @@ $h$ is measured in meters and takes values in \texttt{i32}.
|
|
So, the distance from the center of the planet to the point ($\phi$, $\lambda$, $h$) is
|
|
So, the distance from the center of the planet to the point ($\phi$, $\lambda$, $h$) is
|
|
$\rho + h$.
|
|
$\rho + h$.
|
|
|
|
|
|
-% Directory organization. Quad-trees.
|
|
|
|
|
|
+% Directory organization. Quadtrees.
|
|
The data describing how to render a planet consists of its terrain mesh, terrain textures, and
|
|
The data describing how to render a planet consists of its terrain mesh, terrain textures, and
|
|
the objects on its surface.
|
|
the objects on its surface.
|
|
This could represent a very large amount of data for a planet with realistic terrain populated by
|
|
This could represent a very large amount of data for a planet with realistic terrain populated by
|
|
@@ -1328,7 +1425,7 @@ In other words, it is divided in half north to south and east to west.
|
|
The four new regions are stored in four subdirectories of the original region's directory
|
|
The four new regions are stored in four subdirectories of the original region's directory
|
|
named 0, 1, 2, and 3.
|
|
named 0, 1, 2, and 3.
|
|
The data in the old region is then moved into the appropriate directory based on its location.
|
|
The data in the old region is then moved into the appropriate directory based on its location.
|
|
-Thus the directory tree of a planet essentially forms a quad-tree,
|
|
|
|
|
|
+Thus the directory tree of a planet essentially forms a quadtree,
|
|
albeit one which is built up progressively.
|
|
albeit one which is built up progressively.
|
|
|
|
|
|
% Region data files.
|
|
% Region data files.
|