123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667 |
- \documentclass{article}
- \usepackage[scale=0.8]{geometry}
- \usepackage{hyperref}
- \usepackage{graphicx}
- \title{The Blocktree Cloud Orchestration Platform}
- \author{Matthew Carr}
- \begin{document}
- \maketitle
- \begin{abstract}
- This document is a proposal for a novel cloud platform called Blocktree.
- The system is described in terms of the actor model,
- where tasks and services are implemented as actors.
- The platform is responsible for orchestrating these actors on a set of native operating system processes.
- A service is provdied to actors which allows them access to a highly available distributed file system,
- which serves as the only source of persistent state for the system.
- High availability is achieved using the Raft consensus protocol to synchronize the state of files between processes.
- All data stored in the filesystem is secured with strong integrity and optional confidentiality protections.
- A network block device like interface allows for fast low-level read and write access to the encrypted data,
- with full support for client-side encryption.
- Well-known cryptographic primitives and constructions are employed to provide this protection,
- the system does not attempt to innovate in terms of cryptography.
- The system's trust model allows for mutual TLS authentication between all processes in the system,
- even those which are controlled by different owners.
- By integrating these ideas into a single platform,
- the system aims to advance the status quo in the security and reliability of software systems.
- \end{abstract}
- \section{Introduction}
- % The "Big" Picture.
- Blocktree is an attempt to extend the Unix philosophy that everything is a file
- to the entire distributed system that comprises modern IT infrastructure.
- The system is organized around a global distributed filesystem which defines security
- principals, resources, and their authorization attributes.
- This filesystem provides a language for access control that can be used to securely grant principals
- access to resources from different organizations, without the need to setup federation.
- The system provides an actor runtime for orchestrating services.
- Resources are represented by actors, and actors are grouped into operating system processes.
- Each process has its own credentials which authenticate it as a unique security principal,
- and which specify the filesystem path where the process is located.
- A process has authorization attributes which determine the set of processes that may communicate with it.
- Every connection between processes is established using mutual TLS authentication,
- which is accomplished without the need to trust any third-party certificate authorities.
- The cryptographic mechanisms which make this possible are described in detail in section 3.
- Messages addressed to actors in a different process are forwarded over these connections,
- while messages delivered to actors in the same process are delivered with zero-copying.
- % Self-certifying paths and the chain of trust.
- The single global Blocktree filesystem is partitioned into disjoint domains of authority.
- Each domain is controlled by a root principal.
- As is the case for all principals,
- a root principal is authenticated by a public-private key pair,
- and is identified by a hash of its public key.
- The domain of authority for a given absolute path is determined by its first component,
- which is the identifier of the root principal who controls the domain.
- Because there is no meaning to the directory "/",
- a directory consisting of only a single component equal to a root principal's identifier is
- referred to as the root directory of that root principal.
- The root principal delegates its authority to write files to subordinate principals by issuing
- them certificates which specify the path that the authority of the subordinate is limited to.
- File data is signed for authenticity and a certificate chain is contained in its metadata.
- This certificate chain must lead back to the root principal
- and consist of certificates with correctly scoped authority in order for the file to be authentic.
- Given the path of a file and the file's contents,
- this system allows the file to be validated by anyone without the need to trust a third-party.
- Blocktree paths are referred to as self-certifying for this reason.
- % Persistent state provided by the filesystem.
- One of the major challenges in distributed systems is managing persistent state.
- Blocktree solves this issue using its distributed filesystem.
- Files are broken into segments called sectors.
- The sector size of a file can be configured when it is created,
- but cannot be changed after the fact.
- Reads and writes of individual sectors are guaranteed to be atomic.
- The sectors which comprise a file and its metadata are replicated by a set of processes running
- the sector service.
- This service is responsible for storing the sectors of files which are contained in the directory
- containing the process in which it is running.
- The actors providing the sector service in a given directory coordinate with one another using
- the Raft protocol to synchronize the state of the sectors they store.
- This method of partitioning the data in the filesystem based on directory
- allows the system to scale beyond the capabilities of a single consensus cluster.
- Sectors are secured with strong integrity protection,
- which allows anyone to verify that their contents were written by an authorized principal.
- Encryption can be optionally applied to sectors,
- with the system handling key management.
- The cryptographic mechanisms used to implement these protections are described in section 3.
- To reduce load on the sector service, and to allow the system to scale to a larger number of users,
- a peer-to-peer distribution system is implemented in the filesystem service.
- This system allows filesystem actors to download sectors from other filesystem actors
- that have the sectors in their local cache.
- The threat of malicious actors serving bad sector data is mitigated by the strong integrity
- protections applied to sectors.
- By using peer-to-peer distribution, the system can serve as a content delivery network.
- % Protocol contracts.
- One of the design goals of Blocktree is to facilitate the creation of composable distributed
- systems.
- A major challenge to building such systems is the difficulty in pinning down bugs when they
- inevitably occur.
- Research into session types (a.k.a. Behavioral Types) promises to bring the safety benefits
- of type checking to actor communication.
- Blocktree integrates a session typing system that allows protocol contracts to be defined that
- specify the communication patterns of a set of actors.
- This model allows the state space of the set of actors participating in a computation to be defined,
- and the state transitions which occur to be specified based on the types of received messages.
- These contracts are used to verify protocol adherence statically and dynamically.
- This system is implemented using compile time code generation,
- making it a zero-cost abstraction.
- This frees the developer from dealing with the numerous failure modes that can occur in a
- communication protocol.
- % Implementation language and project links.
- Blocktree is implemented in the Rust programming language.
- Its source code is licensed under the Affero GNU Public License Version 3.
- It can be downloaded at the project homepage at \url{https://blocktree.systems}.
- Anyone interested in contributing to development is welcome to submit a pull request
- to \url{https://gogs.delease.com/Delease/Blocktree}.
- If you have larger changes or architectural suggestions,
- please submit an issue for discussion prior to spending time implementing your idea.
- % Outline of the rest of the paper.
- The remainder of this paper is structured as follows:
- \begin{itemize}
- \item Section 2 describes the actor runtime, service and task orchestration, and service
- discovery.
- \item Section 3 discusses the filesystem, its concurrency semantics and implementation.
- \item Section 4 details the cryptographic mechanisms used to secure communication between
- actor runtimes and to protect sector data.
- \item Section 5 is a set of examples describing ways that Blocktree can be used to build systems.
- \item Section 6 provides some concluding remarks.
- \end{itemize}
- \section{Actor Runtime}
- % Motivation for using the actor model.
- Building scalable fault tolerant systems requires us to distribute computation over
- multiple computers.
- Rather than switching to a different programming model when an application scales beyond the
- capacity of a single computer,
- it is beneficial in terms of programmer time and program simplicity to begin with a model that
- enables multi-computer scalability.
- Fundamentally, all communication over an IP network involves the exchange of messages,
- namely IP packets.
- So if we wish to build scalable fault-tolerant systems,
- it makes sense to choose a programming model built on message passing,
- as this will ensure low impedance with the underlying networking technology.
- % Overview of message passing interface.
- That is why Blocktree is built on the actor model
- and why its actor runtime is at the core of its architecture.
- The runtime can be used to register services and dispatch messages.
- Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
- A message is dispatched with the \texttt{send} method when no reply is required,
- and with \texttt{call} when exactly one is.
- The \texttt{Future} returned by \texttt{call} can be awaited to obtain the reply.
- If a timeout occurs while waiting for the reply,
- the \texttt{Future} completes with an error.
- The name \texttt{call} was chosen to bring to mind a remote procedure call,
- which is the primary use case this method was intended for.
- Awaiting replies to messages serves as a simple way to synchronize a distributed computation.
- % Description of virtual actor system.
- One of the challenges when building actor systems is supervising and managing actor's lifecycles.
- This is handled in Erlang through the use of supervision trees,
- but Blocktree takes a different approach inspired by Microsoft's Orleans framework.
- Orleans introduced the concept of virtual actors,
- which are purely logical entities that exist perpetually.
- In Orleans, one does not need to spawn actors nor worry about respawing them should they crash,
- the framework takes care of spawning an actor when a message is dispatched to it.
- This model also gives the framework the flexibility to deactivate actors when they are idle
- and to load balance actors across different computers.
- In Blocktree a similar system is used,
- which is possible because messages are only addressed to services.
- The Blocktree runtime takes care of routing these messages to the appropriate actors,
- spawning them if needed.
- % The runtime is implemented using tokio.
- The actor runtime is currently implemented using the Rust asynchronous runtime tokio.
- Actors are spawned as tasks in the tokio runtime,
- and multi-producer single consumer channels are used for message delivery.
- Because actors are just tasks,
- they can do anything a task can do,
- including awaiting other futures.
- Because of this, there is no need for the actor runtime to support short-lived worker tasks,
- as any such use-case can be accomplished by awaiting a set of \texttt{Future}s.
- This allows the runtime to focus on providing support for services.
- Using tokio also means that we have access to a high performance multi-threaded runtime with
- evented IO.
- This asynchronous programming model ensures that resources are efficiently utilized,
- and is ideal for a system focused on orchestrating services which may be used by many clients.
- % Delivering messages over the network.
- Messages can be forwarded between actor runtimes using a secure transport layer called
- \texttt{bttp}.
- Messages are addressed using \emph{actor names}.
- An actor name consists of the following fields:
- \begin{enumerate}
- \item \texttt{service}: The path identifying the receiving service.
- \item \texttt{scope}: A filesystem path used to specify the intended recipient.
- \item \texttt{rootwards}: An enum describing whether message delivery is attempted towards or
- away from the root of the filesystem tree. A value of
- \texttt{false} indicates that the message is intended for a runtime directly contained in the
- scope. A value of \texttt{true} indicates that the message is intended for a runtime contained
- in a parent directory of the scope and should be delivered to a runtime which has the requested
- service registered and is closest to the scope.
- \item \texttt{id}: An identifier for a specific service provider.
- \end{enumerate}
- The ID can be a \texttt{Uuid} or a \texttt{String}.
- It is treated as an opaque identifier by the runtime,
- but a service is free to associate additional meaning to it.
- Every message has a header containing the name of the sender and receiver.
- The transport is implemented using the QUIC protocol, which integrates TLS for security.
- A \texttt{bttp} client may connect anonymously or using credentials.
- If an anonymous connection is attempted,
- the client has no authorization attributes associated with it.
- Only runtimes which grant others the execute permission allow connections from such clients.
- If these permissions are not granted in the runtime's file,
- anonymous connections are rejected.
- When a client connects with credentials,
- mutual TLS authentication is performed as part of the connection handshake,
- which cryptographically verifies the credentials of each runtime.
- These credentials contain the filesystem paths where each runtime is located,
- which ensures that messages addressed to a specific path will only be delivered to that path.
- The \texttt{bttp} server is always authenticated during the handshake,
- even when the client is connecting anonymously.
- Because QUIC supports the concurrent use of many different streams,
- it serves as an ideal transport for a message oriented system.
- \texttt{bttp} uses different streams for independent messages,
- ensuring that head of line blocking does not occur.
- The same stream is used for sending the reply to a message dispatched with \texttt{call}.
- Once a connection is established,
- message may flow both directions (provided both runtimes have execute permissions for the other),
- regardless of which runtime is acting as the client or the server.
- % Delivering messages locally.
- When a message is sent between actors in the same runtime it is delivered into the queue of the recipient without any copying,
- while ensuring immutability (i.e. move semantics).
- This is possible thanks to the Rust ownership system,
- because the message sender gives ownership to the runtime when it dispatches the message,
- and the runtime gives ownership to the recipient when it delivers the message.
- % Security model based on filesystem permissions.
- A runtime is represented in the filesystem as a file.
- This file contains the authorization attributes which are associated with the runtime's security
- principal.
- The credentials used by the runtime specify the file, so other runtimes are able to locate it.
- The metadata of the file contains authorization attributes just like any other file
- (e.g. UID, GID, and mode bits).
- In order for a principal to be able to send a message to an actor in the runtime,
- it must have execute permissions for this file.
- Thus communication between runtimes can be controlled using simple filesystem permissions.
- Permissions checking is done during the \texttt{bttp} handshake.
- Note that it is possible for messages to be sent in one direction in a \texttt{bttp} connection
- but not in the other.
- In this situation replies are permitted but unsolicited messages are not.
- An important trade-off which was made when designing this model was that messages which are
- sent between actors in the same runtime are not subject to any authorization checks.
- This was done for two reasons: performance and security.
- By eliminating authorization checks messages can be more efficiently delivered between actors in the
- same process,
- which helps to reduce the performance penalty of the actor runtime over directly using threads.
- Security is enhanced by this decision because it forces the user to separate actors with different
- security requirements into different operating system processes,
- which ensures all of the process isolation machinery in the operating system will be used to
- isolate them.
- % Representing resources as actors.
- As in other actor systems, it is convenient to represent resources in Blocktree using actors.
- This allows the same security model used to control communication between actors to be used for
- controlling access to resources,
- and for resources to be shared by many actors.
- For instance, a Point-to-Point Protocol connection could be owned by an actor.
- This actor could forward traffic delivered to it in messages over this connection.
- The set of actors which are able to access the connection is controlled by setting the filesystem
- permissions on the file for the runtime executing the actor owning the connection.
- % Message routing to services.
- A service is identified by a Blocktree path.
- Only one service implementation can be registered in a particular runtime,
- though this implementation may be used to spawn many actors as providers for the service,
- each associated with a different ID.
- The runtime spawns a new actor when it finds no service provider associated with the ID in the
- message it is delivering.
- Some services may only have one service provider in a given runtime,
- as is the case for the sector and filesystem services.
- Services are reactive,
- they don't do anything until they receive a message to process.
- The \texttt{scope} and \texttt{rootward} field in an actor name specify the set of runtimes to
- which a message may be delivered.
- They allow the sender to express their intended recipient,
- while still affording enough flexibility to the runtime to route messages as needed.
- If \texttt{rootward} is \texttt{false},
- the message is delivered to a service provider in a runtime that is directly contained in
- \texttt{scope}.
- If \texttt{rootward} is \texttt{true},
- the parent directories of scope are searched,
- working towards the root of the filesystem tree,
- and the message is delivered to the first provider of \texttt{service} which is found.
- When there are multiple service providers to which a given message could be delivered,
- the one to which it is actually delivered is unspecified,
- which allows the runtime to balance load.
- Delivery will occur for at most one recipient,
- even in the case that there are multiple potential recipients.
- In order to contact other runtimes and deliver messages to them,
- their IP addresses need to be known.
- This is achieved by maintaining a file with a runtime's IP address in the same directory as the
- runtime.
- The runtime is granted write permissions on the file,
- and it is updated by \texttt{bttp} when it begins listening on a new endpoint.
- The services which are allowed to be registered in a given runtime are specified in the runtime's
- file.
- The runtime reads this list and uses it to deny service registrations for unauthorized services.
- The list is also read by other runtime's when they are searching a directory for service providers.
- % The sector and filesystem service.
- The filesystem is itself implemented as a service.
- A filesystem service provider can be passed messages to delete files, list directory contents,
- open files, or perform several other standard filesystem operations.
- When a file is opened,
- a new actor is spawned which owns the newly created file handle and its name is returned to the
- caller in a reply.
- Subsequent read and write messages are sent to this actor.
- The filesystem service does not persist any data itself,
- its job is to function as an integration layer,
- conglomerating sector data from many different sources into a single unified interface.
- The sector service is what is ultimately responsible for storing data,
- and thus maintaining the persistent state of the system.
- It stores sector data in the local filesystem of each computer on which it is registered.
- The details of how this is accomplished are deferred to the next section.
- % Runtime network discovery.
- While it is possible to resolve runtime paths to IP addresses when the filesystem is available,
- a different mechanism is needed to allow the filesystem and sector services to discover service
- providers.
- To facilitate this,
- runtimes are able to query one another to learn about other runtimes.
- Because queries are intended to facilitate message delivery,
- the query fields and their meanings mirror those used for addressing messages:
- \begin{enumerate}
- \item \texttt{service} The path of the service whose providers are sought.
- Only runtimes with this service registered will be returned.
- \item \texttt{scope} The filesystem path relative to which the query will be processed.
- \item \texttt{rootward} Indicates if the query should search for runtimes from \texttt{scope}
- toward the root.
- \end{enumerate}
- The semantics of \texttt{scope} and \texttt{rootward} in a query are identical to their use in an
- actor name.
- As long as at least one other runtime is known,
- a query can be issued to learn of more runtimes.
- A runtime which receives a query may not be able to answer it directly.
- If it cannot,
- it returns the IP address of the next runtime to which the query should be sent.
- In order to bootstrap the discovery processes,
- another mechanism is needed to find the first peer to query.
- There were several possibilities explored for doing this.
- One way is to use a blockchain to store the IP addresses of the runtimes hosting the sector service
- in the root directory.
- As long as these runtimes could be located,
- then all others could be found using the filesystem.
- This idea may be worth revisiting in the future,
- but the author wanted to avoid the complexity of implementing a new proof of work blockchain.
- Another idea was to use multicast link-local addressing to discover other runtimes,
- similar to how mDNS operates.
- This approach has several advantages.
- It avoids any dependency on centralized internet infrastructure
- and keeps network load local to the segment on which the runtimes are connected.
- But, it will not work over a wide area network,
- making it unsuitable for the general case.
- Instead, the design which was decided on was to use DNS to resolve a fully qualified domain name
- (FQDN) derived from the root principal's identifier.
- This FQDN is expected to resolve to the public IP addresses of the runtimes hosting the
- sector service in the root directory of the root principal.
- Each process is configured with a search domain which is used as a suffix of the FQDN.
- The leading labels in the FQDN are computed by base32 encoding a hash of the root
- principal's public key.
- If the encoded string is longer than 63 bytes (the limit for each label in a hostname),
- it is separated into the fewest number of labels possible,
- working from left to right along the string.
- A dot followed by the search domain is concatenated onto the end of this string to form the FQDN.
- This method has the advantages of being simple to implement
- and allowing runtimes to discover each other over the internet.
- Implementing this system would be facilitated by hosting DNS servers in actors in the same
- runtimes as the root sector service providers.
- Then, A or AAAA records could be served which point to these runtimes.
- These runtimes would also need to be configured with static IP addresses,
- and the NS records for the search domain would need to point to them.
- Of course it is also possible to build such a system without hosting DNS inside of Blocktree.
- The downside of using DNS is that it couples Blocktree with a centralized,
- albeit distributed, system.
- % Security model for queries.
- To allow runtimes which are not permitted to execute the root directory to query for other runtimes,
- authorization logic which is specific to queries is needed.
- If a process is connected with credentials
- and the path in the credentials contains the scope of the query,
- the query is permitted.
- If a process is connected anonymously,
- its query will only be answered if the query scope
- and all of its parent directories,
- grant others the execute permission.
- Queries from authenticated processes can be authorized using only the information in the query,
- but anonymous queries require knowledge of filesystem permissions,
- some of which may not be known to the answering runtime.
- When authorizing an anonymous query,
- an answering runtime should check that that the execute permission is granted on all directories
- that it is responsible for storing.
- If all these checks pass, it should forward the querier to the next runtime as usual.
- % Overview of protocol contracts and runtime checking of protocol adherence.
- To facilitate the creation of composable systems,
- a protocol contract checking system based on session types has been designed.
- This system models a communication protocol as a directed graph representing state transitions
- based on types of received messages.
- The protocol author defines the states that the actors participating in the protocol can be in using
- Rust traits.
- These traits define handler methods for each message type the actor is expected to handle in that
- state.
- A top-level trait which represents the entire protocol is defined that contains the types of the
- initial state of every actor in the protocol.
- A macro is used to generate the message handling loop for the each of the parties to the protocol,
- as well as enums to represent all possible states that the parties can be in and the messages that
- they exchange.
- The generated code is responsible for ensuring that errors are generated when a message of an
- unexpected type is received,
- eliminating the need for ad-hoc error handling code to be written by application developers.
- % Example of a protocol contract.
- % TODO: I don't find this example very compelling. It would be more impressive to show a pub-sub
- % protocol, that would look cool.
- Let us explore the use of this system through a simple example using the HTTP/1.1 protocol.
- It is a state-less client-server protocol,
- essentially just an RPC from client to server.
- We can model this in for the contract checker by defining a trait representing the protocol:
- \begin{verbatim}
- pub trait Http {
- type Server: ServerInit;
- }
- \end{verbatim}
- The purpose of this top-level trait is to specify the initial state of every party to the
- communications protocol.
- In this case we're only modeling the state of the server,
- as the client will just \texttt{call} a method on the server.
- The initial state for the server is defined as follows:
- \begin{verbatim}
- pub trait ServerInit {
- type AfterActivate: Listening;
- type Fut: Future<Output = Result<Self::AfterActivate>>;
- fn handle_activate(self, msg: Activate) -> Self::Fut;
- }
- \end{verbatim}
- \texttt{Activate} is a message sent by the generated code to allow the actor access to the
- runtime and the actor's ID.
- It is defined as follows:
- \begin{verbatim}
- pub struct Activate {
- rt: &'static Runtime,
- act_id: Uuid,
- }
- \end{verbatim}
- We represent the statelessness of HTTP by having the requests to the \texttt{Listening} state
- return another \texttt{Listening} state.
- \begin{verbatim}
- pub trait Listening {
- type AfterRequest: Listening;
- type Fut: Future<Output = Result<Self::AfterRequest>>;
- fn handle_request(self, msg: Envelope<Request>) -> Self::Fut;
- }
- \end{verbatim}
- The \texttt{Envelope} type is a wrapper around a message which contains information about who sent
- it and a method which can be used to send a reply.
- In general a new type could be returned after each message received,
- with the returned type being dependent on the type of the message.
- The state graph of this protocol can be visualized as follows:
- \begin{center}
- \includegraphics[height=1.5in]{HttpStateGraph.pdf}
- \end{center}
- % Implementing actors in languages other than Rust.
- Today the actor runtime only supports executing actors implemented in Rust.
- A WebAssembly (Wasm) plugin system is planned to allow any language which can compile to Wasm to be
- used to implement an actor.
- This work is blocked pending the standardization of the WebAssembly Component Model,
- which promises to provide an interface definition language which will allow type safe actors to be
- defined in many different languages.
- % Running containers using actors.
- Blocktree allows containers to be run by encapsulating them using a supervising actor.
- This actor is responsible for starting the container and managing the container's kernel namespace.
- Logically, it owns any kernel resources created by the container, including all spawned operating
- system processes.
- When the actor halts,
- all of these resources are destroyed.
- All network communication to the container is controlled by the supervising actor.
- The supervisor can be configured to bind container ports to host ports,
- as is commonly done today,
- but it can also be used to encapsulate traffic to and from the container in Blocktree messages.
- These messages are routed to other actors based on the configuration of the supervisor.
- This essentially creates a VPN for containers,
- ensuring that regardless of well secured their communication is,
- they will be safe to communicate over any network.
- This network encapsulation system could be used in other actors as well,
- allowing a lightweight and secure VPN system to built.
- \section{Filesystem}
- % The division of responsibilities between the sector and filesystem services.
- The responsibility for storing data in the system is shared between the filesystem and sector
- services.
- Most actors will access the filesystem through the filesystem service,
- which provides a high-level interface that takes care of the cryptographic operations necessary to
- read and write files.
- The filesystem service relies on the sector service for actually persisting data.
- The individual sectors which make up a file are read from and written to the sector service,
- which stores them in the local filesystem of the computer on which it is running.
- A sector is the atomic unit of data storage.
- The sector service only supports reading and writing entire sectors at once.
- File actors spawned by the filesystem service buffer reads and writes so until there is enough
- data to fill a sector.
- Because cryptographic operations are only performed on full sectors,
- the cost of providing these protections is amortized over the size of the sector.
- Thus there is tradeoff between latency and throughput when selecting the sector size of a file.
- A smaller sector size means less latency while a larger one enables more throughput.
- % Types of sectors: metadata, integrity, and data.
- A file has a single metadata sector, a Merkle sector, and zero or more data sectors.
- The sector size of a file can be specified when it is created,
- but cannot be changed later.
- Every data sector contains the ciphertext of the number of bytes equal to the sector size,
- but the metadata and Merkle sectors contain a variable amount of data.
- The metadata sector contains all of the filesystem metadata associated with the file.
- In addition to the usual metadata present in any Unix filesystem (the contents of the \texttt{stat} struct),
- cryptographic information necessary to verify and decrypt the contents of the file are also stored.
- The Merkle sector of a file contains a Merkle tree over the data sectors of a file.
- The hash function used by this tree can be configured at file creation,
- but cannot be changed after the fact.
- % How sectors are identified.
- When sector service providers are contained in the same directory they connect to each other to form
- a consensus cluster.
- This cluster is identified by a \texttt{u64} called the cluster's \emph{generation}.
- Every file is identified by a pair of \texttt{u64}, its generation and its inode.
- The sectors within a file are identified by an enum which specifies which type they are,
- and in the case of data sectors, their index.
- \begin{verbatim}
- pub enum SectorKind {
- Meta,
- Merkle,
- Data(u64),
- }
- \end{verbatim}
- The offset in the plaintext of the file at which each data sector begins can be calculated by
- multiplying the sectors offset by the sector size of the file.
- % Scaling horizontally: using Raft to create consensus cluster. Additional replication methods.
- When multiple multiple sector service providers are contained in the same directory,
- the sector service providers connect to each other to form a consensus cluster.
- This cluster uses the Raft protocol to synchronize the state of the sectors it stores.
- The system is currently designed to replicate all data to each of the service providers in the
- cluster.
- Additional replication methods are planned for implementation,
- such as consisting hashing and erasure encoding,
- which allow for different tradeoffs between data durability and storage utilization.
- % Scaling vertically: how different generations are stitched together.
- The creation of a new generation of the sector service is accomplished with several steps.
- First, a new directory is created in which the generation will be located.
- Next, one or more processes are credentialed for this directory,
- using a procedure which is described in the next section.
- The credentialing process produces files for each of the processes stored in the new directory.
- The sector service provider in each of the new processes uses service discovery to establish
- communication with its peers in the other processes.
- Finally, the service provider which is elected leader contacts the cluster in the root directory
- and requests a new generation number.
- Once this number is known it is stored in the superblock for the generation,
- which is the file identified by the new generation number and inode 2.
- Note that the superblock is not contained in any directory and cannot be accessed by actors
- outside of the sector service.
- The superblock also contains information used to assign a inodes when a files are created.
- % Sector service discovery. Paths.
- % The filesystem service is responsible for cryptographic operations. Client-side encryption.
- The sector service is relied upon by the filesystem service to read and write sectors.
- Filesystem service providers communicate with the sector service to open files, read and write
- their contents, and update their metadata.
- These providers are responsible for verifying and decrypting the information contained in sectors
- and providing it to downstream actors.
- They are also responsible for encrypting and integrity protecting data written by downstream actors.
- Most of the complexity of implementing a filesystem is handled in the filesystem service.
- Most messages sent to the sector service only specify the operation (read or write), the identifier
- for the sector, and the sector contents.
- Every time a data sector is written an updated metadata sector is required to be sent in the same
- message.
- This requirement exists because a signature over the root of the file's Merkle tree is contained in
- the metadata,
- and since this root changes with every modification, it must be updated during every write.
- When the sector service commits a write it hashes the sector contents,
- updates the Merkle sector of the file, and updates the metadata sector.
- In order for the filesystem service to produce a signature over the root of the file's Merkle tree,
- it maintains a copy of the tree in memory.
- This copy is loaded from the sector service when the file is opened.
- While this does mean duplicating data between the sector and filesystem services,
- this design was chosen to reduce the network traffic between the two services,
- as the entire Merkle tree does not need to be transmitted on every write.
- Encapsulating all cryptographic operations in the filesystem service allows the computer storing
- data to be different from the computer encrypting it.
- This approach allows client-side encryption to be done on more capable computers
- and for this task to be delegated to a storage server on low powered devices.
- % Description of how the filesystem layer: opens a file, reads, and writes.
- % Peer-to-peer data distribution in the filesystem service.
- % Streaming replication.
- \section{Cryptography}
- % The underlying trust model: self-certifying paths.
- % Verifying sector contents on read and certifying on write.
- % Confidentiality protecting files with readcaps. Single pubkey operation to read a dir tree.
- % Give example of how these mechanisms allow data to be shared without any prior federation.
- % Description of bttp handshake and the authentication data which is provided by both parties.
- % Requesting and issuing credentials. Multicast link-local network discovery.
- \section{Examples}
- This section contains examples of systems built using Blocktree. The hope is to illustrate how this
- platform can be used to implement existing applications more easily and to make it possible to
- implement systems which are currently out of reach.
- \subsection{A personal cloud for a home user.}
- % Describe my idealized home Blocktree setup.
- \subsection{An ecommerce website.}
- % Describe a blocktree which runs a cluster of webservers, a manufacturing process, a warehouse
- % inventory management system, and an order fulfillment system.
- \subsection{A smart home.}
- \subsection{A realtime geo-spacial environment.}
- % Explain my vision of the metaverse.
- \section{Conclusion}
- % Blocktree serves as the basis for building a cloud-level distributed operating system.
- % The system enables individuals to self-host the services they rely on.
- % It also gives business a freeer choice of whether to own or lease computing resources.
- % The system advances the status quo in secure computing.
- % Composability leads to emergent benefits.
- \end{document}
|