Ver Fonte

The filesystem section in the paper is now topic-complete.

Matthew Carr há 1 ano atrás
pai
commit
c789571c8b

+ 200 - 72
doc/BlocktreeCloudPaper/BlocktreeCloudPaper.tex

@@ -151,7 +151,7 @@ as this will ensure low impedance with the underlying networking technology.
 % Overview of message passing interface.
 That is why Blocktree is built on the actor model
 and why its actor runtime is at the core of its architecture.
-The runtime can be used to register services and dispatch messages.
+The runtime can be used to spawn actors, register services, and dispatch messages.
 Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
 A message is dispatched with the \texttt{send} method when no reply is required,
 and with \texttt{call} when exactly one is.
@@ -172,18 +172,49 @@ In Orleans, one does not need to spawn actors nor worry about respawing them sho
 the framework takes care of spawning an actor when a message is dispatched to it.
 This model also gives the framework the flexibility to deactivate actors when they are idle
 and to load balance actors across different computers.
-In Blocktree a similar system is used,
-which is possible because messages are only addressed to services.
+In Blocktree a similar system is used when messages are dispatched to services.
 The Blocktree runtime takes care of routing these messages to the appropriate actors,
 spawning them if needed.
 
+% Message addressing modes.
+Messages can be addressed to services or specific actors.
+When addressing a specific actor,
+the message contains an \emph{actor name},
+which is a pair consisting of the path of the runtime hosting the actor and the \texttt{Uuid}
+identifying the specific actor in that runtime.
+When addressing a service,
+the message is dispatched using a \emph{service name},
+which contains the following fields:
+\begin{enumerate}
+  \item \texttt{service}: The path identifying the receiving service.
+  \item \texttt{scope}: A filesystem path used to specify the intended recipient.
+  \item \texttt{rootwards}: An boolean describing whether message delivery is attempted towards or
+    away from the root of the filesystem tree. A value of
+    \texttt{false} indicates that the message is intended for a runtime directly contained in the
+    scope. A value of \texttt{true} indicates that the message is intended for a runtime contained
+    in a parent directory of the scope and should be delivered to a runtime which has the requested
+    service registered and is closest to the scope.
+  \item \texttt{id}: An identifier for a specific service provider.
+\end{enumerate}
+The ID can be a \texttt{Uuid} or a \texttt{String}.
+It is treated as an opaque identifier by the runtime,
+but a service is free to associate additional meaning to it.
+Every message has a header containing the name of the sender and receiver.
+The receiver name can be an actor or service name,
+but the receiver name is always an actor name.
+For example, to open a file in the filesystem,
+a message is dispatched with \texttt{call} using the service name of the filesystem service.
+The reply contains the name of the file actor spawned by the filesystem service which owns the opened
+file.
+Messages are then dispatched to the file actor using its actor name to read and write to the file.
+
 % The runtime is implemented using tokio.
 The actor runtime is currently implemented using the Rust asynchronous runtime tokio.
 Actors are spawned as tasks in the tokio runtime,
 and multi-producer single consumer channels are used for message delivery.
 Because actors are just tasks,
 they can do anything a task can do,
-including awaiting other futures.
+including awaiting other \texttt{Future}s.
 Because of this, there is no need for the actor runtime to support short-lived worker tasks,
 as any such use-case can be accomplished by awaiting a set of \texttt{Future}s.
 This allows the runtime to focus on providing support for services.
@@ -195,23 +226,6 @@ and is ideal for a system focused on orchestrating services which may be used by
 % Delivering messages over the network.
 Messages can be forwarded between actor runtimes using a secure transport layer called
 \texttt{bttp}.
-Messages are addressed using \emph{actor names}.
-An actor name consists of the following fields:
-\begin{enumerate}
-  \item \texttt{service}: The path identifying the receiving service.
-  \item \texttt{scope}: A filesystem path used to specify the intended recipient.
-  \item \texttt{rootwards}: An enum describing whether message delivery is attempted towards or
-    away from the root of the filesystem tree. A value of
-    \texttt{false} indicates that the message is intended for a runtime directly contained in the
-    scope. A value of \texttt{true} indicates that the message is intended for a runtime contained
-    in a parent directory of the scope and should be delivered to a runtime which has the requested
-    service registered and is closest to the scope.
-  \item \texttt{id}: An identifier for a specific service provider.
-\end{enumerate}
-The ID can be a \texttt{Uuid} or a \texttt{String}.
-It is treated as an opaque identifier by the runtime,
-but a service is free to associate additional meaning to it.
-Every message has a header containing the name of the sender and receiver.
 The transport is implemented using the QUIC protocol, which integrates TLS for security.
 A \texttt{bttp} client may connect anonymously or using credentials.
 If an anonymous connection is attempted,
@@ -230,6 +244,8 @@ Because QUIC supports the concurrent use of many different streams,
 it serves as an ideal transport for a message oriented system.
 \texttt{bttp} uses different streams for independent messages,
 ensuring that head of line blocking does not occur.
+Note that although data from separate streams can arrive in any order,
+the protocol does provide reliable in-order delivery of data in a given stream.
 The same stream is used for sending the reply to a message dispatched with \texttt{call}.
 Once a connection is established,
 message may flow both directions (provided both runtimes have execute permissions for the other),
@@ -277,6 +293,27 @@ This actor could forward traffic delivered to it in messages over this connectio
 The set of actors which are able to access the connection is controlled by setting the filesystem
 permissions on the file for the runtime executing the actor owning the connection.
 
+% Actor ownership.
+The concept of ownership in programming languages is very useful for ensuring that resources are
+properly freed when the type using them dies.
+Because actors are used for encapsulating resources in Blocktree,
+a similar system of ownership is employed for this reason.
+An actor is initially owned by the actor that spawned it.
+An actor can only have a single owner,
+but the owner can grant ownership to another actor.
+An actor is not allowed to own itself,
+though it may be owned by the runtime.
+When the owner of an actor returns,
+the actor is sent a message instructing it to return.
+If it does not return after a timeout,
+it is interrupted.
+This is the opposite of how supervision trees work in Erlang.
+Instead of the parent receiving a message when the child returns,
+the child receives a message when the parent returns.
+Service providers spawned by the runtime are owned by it.
+They continue running until the runtime chooses to reclaim their resources,
+which can happen because they are idle or the runtime is overloaded.
+
 % Message routing to services.
 A service is identified by a Blocktree path.
 Only one service implementation can be registered in a particular runtime,
@@ -508,7 +545,7 @@ allowing a lightweight and secure VPN system to built.
 
 \section{Filesystem}
 % The division of responsibilities between the sector and filesystem services.
-The responsibility for storing data in the system is shared between the filesystem and sector
+The responsibility for serving data in the system is shared between the filesystem and sector
 services.
 Most actors will access the filesystem through the filesystem service,
 which provides a high-level interface that takes care of the cryptographic operations necessary to
@@ -516,14 +553,14 @@ read and write files.
 The filesystem service relies on the sector service for actually persisting data.
 The individual sectors which make up a file are read from and written to the sector service,
 which stores them in the local filesystem of the computer on which it is running.
-A sector is the atomic unit of data storage.
-The sector service only supports reading and writing entire sectors at once.
-File actors spawned  by the filesystem service buffer reads and writes so until there is enough
+A sector is the atomic unit of data storage
+and the sector service only supports reading and writing entire sectors at once.
+File actors spawned  by the filesystem service buffer reads and writes until there is enough
 data to fill a sector.
 Because cryptographic operations are only performed on full sectors,
 the cost of providing these protections is amortized over the size of the sector.
-Thus there is tradeoff between latency and throughput when selecting the sector size of a file.
-A smaller sector size means less latency while a larger one enables more throughput.
+Thus there is tradeoff between latency and throughput when selecting the sector size of a file:
+a smaller sector size means less latency while a larger one enables more throughput.
 
 % Types of sectors: metadata, integrity, and data.
 A file has a single metadata sector, a Merkle sector, and zero or more data sectors.
@@ -544,7 +581,7 @@ a consensus cluster.
 This cluster is identified by a \texttt{u64} called the cluster's \emph{generation}.
 Every file is identified by a pair of \texttt{u64}, its generation and its inode.
 The sectors within a file are identified by an enum which specifies which type they are,
-and in the case of data sectors, their index.
+and in the case of data sectors, their 0-based index.
 \begin{verbatim}
   pub enum SectorKind {
     Meta,
@@ -552,17 +589,49 @@ and in the case of data sectors, their index.
     Data(u64),
   }
 \end{verbatim}
-The offset in the plaintext of the file at which each data sector begins can be calculated by
-multiplying the sectors offset by the sector size of the file.
+The byte offset in the plaintext of the file at which each data sector begins can be calculated by
+multiplying the sector's index by the sector size of the file.
+The \texttt{SectorId} type is used to identify a sector.
+\begin{verbatim}
+  pub enum SectorId {
+    generation: u64,
+    inode: u64,
+    sector: SectorKind,
+  }
+\end{verbatim}
+
+% Types of messages handled by the sector service.
+Communication with the sector service is done by passing it messages of type \texttt{SectorMsg}.
+\begin{verbatim}
+  pub struct SectorMsg {
+    id: SectorId,
+    op: SectorOperation,
+  }
+
+  pub enum SectorOperation {
+    Read,
+    Write(WriteOperation),
+  }
+
+  pub enum WriteOperation {
+    Meta(Box<FileMeta>),
+    Data {
+      meta: Box<FileMeta>,
+      contents: Vec<u8>,
+    }
+  }
+\end{verbatim}
+Here \texttt{FileMeta} is the type used to store metadata for files.
+Note that updated metadata is required to be sent when a sector's contents are modified.
 
 % Scaling horizontally: using Raft to create consensus cluster. Additional replication methods.
-When multiple multiple sector service providers are contained in the same directory,
-the sector service providers connect to each other to form a consensus cluster.
-This cluster uses the Raft protocol to synchronize the state of the sectors it stores.
-The system is currently designed to replicate all data to each of the service providers in the
-cluster.
-Additional replication methods are planned for implementation,
-such as consisting hashing and erasure encoding,
+A generation of sector service providers uses the Raft protocol to synchronize the state of the
+sectors it stores.
+The message passing interface of the runtime enables this implementation
+and the sector service's requirements were important considerations in designing this interface.
+The system currently replicates all data to each of the service providers in the cluster.
+Additional replication methods are planned for future implementation
+(e.g. erasure encoding and distribution via consistent hashing),
 which allow for different tradeoffs between data durability and storage utilization.
 
 % Scaling vertically: how different generations are stitched together.
@@ -571,53 +640,112 @@ First, a new directory is created in which the generation will be located.
 Next, one or more processes are credentialed for this directory,
 using a procedure which is described in the next section.
 The credentialing process produces files for each of the processes stored in the new directory.
-The sector service provider in each of the new processes uses service discovery to establish
-communication with its peers in the other processes.
-Finally, the service provider which is elected leader contacts the cluster in the root directory
+The sector service provider in each of the processes uses the filesystem service
+(which connects to the parent generation of the sector service)
+to find the other runtimes hosting the sector service in the directory and messages them to
+establish a fully-connected cluster.
+Finally, the service provider which is elected leader contacts the generation in the root directory
 and requests a new generation number.
 Once this number is known it is stored in the superblock for the generation,
 which is the file identified by the new generation number and inode 2.
-Note that the superblock is not contained in any directory and cannot be accessed by actors
-outside of the sector service.
-The superblock also contains information used to assign a inodes when a files are created.
-
-% Sector service discovery. Paths.
+The superblock is not contained in any directory and cannot be accessed outside the sector service.
+The superblock also keeps track of the next inode to assign to a new file.
 
-% The filesystem service is responsible for cryptographic operations. Client-side encryption.
+% Authorization logic of the sector service.
+To prevent malicious actors from writing invalid data,
+the sector service must cryptographically verify all write messages.
+The process it uses to do this involves several steps:
+\begin{enumerate}
+  \item The certificate chain in the metadata that was sent in the write message is validated.
+    It is considered valid if it ends with a certificate signed by the root principal
+    and the paths in the certificates are correctly nested,
+    indicating valid delegation of write authority at every step.
+  \item Using the last public key in the certificate chain,
+    the signature in the metadata is validated.
+    This signature covers all of the fields in the metadata.
+  \item The new sector contents in the write message are hashed using the digest function configured
+    for the file and the resulting hash is used to update the file's Merkle tree in its Merkle
+    sector.
+  \item The root of the Merkle tree is compared with the integrity value in the file's metadata.
+    The write message is considered valid if and only if there is a match.
+\end{enumerate}
+This same logic is used by file actors to verify the data they read from the sector service.
+Only once a write message is validated is it shared with the sector service provider's peers in
+its generation.
+Although the data in a file is encrypted,
+it is still beneficial for security to prevent unauthorized principal's from gaining access to a
+file's ciphertext.
+To prevent this, a sector service provider checks a file's metadata to verify that the requesting
+principal actually has a readcap (to be defined in the next section) for the file.
+This ensures that only principals that are authorized to read a file can gain access to the file's
+ciphertext, metadata, and Merkle tree.
+
+% File actors are responsible for cryptographic operations. Client-side encryption.
 The sector service is relied upon by the filesystem service to read and write sectors.
-Filesystem service providers communicate with the sector service to open files, read and write
-their contents, and update their metadata.
-These providers are responsible for verifying and decrypting the information contained in sectors
-and providing it to downstream actors.
-They are also responsible for encrypting and integrity protecting data written by downstream actors.
-Most of the complexity of implementing a filesystem is handled in the filesystem service.
-Most messages sent to the sector service only specify the operation (read or write), the identifier
-for the sector, and the sector contents.
-Every time a data sector is written an updated metadata sector is required to be sent in the same
-message.
-This requirement exists because a signature over the root of the file's Merkle tree is contained in
-the metadata,
-and since this root changes with every modification, it must be updated during every write.
-When the sector service commits a write it hashes the sector contents,
-updates the Merkle sector of the file, and updates the metadata sector.
-In order for the filesystem service to produce a signature over the root of the file's Merkle tree,
+Filesystem service providers communicate with the sector service to open files and perform
+filesystem operations.
+These providers spawn file actors that are responsible for verifying and decrypting the information
+contained in sectors and providing it to other actors.
+They use the credentials of the runtime they are hosted in to decrypt sector data using
+information contained in file metadata.
+File actors are also responsible for encrypting and integrity protecting data written to files.
+In order for a file actor to produce a signature over the root of the file's Merkle tree,
 it maintains a copy of the tree in memory.
-This copy is loaded from the sector service when the file is opened.
+This copy is read from the sector service when the file is opened.
 While this does mean duplicating data between the sector and filesystem services,
 this design was chosen to reduce the network traffic between the two services,
 as the entire Merkle tree does not need to be transmitted on every write.
-Encapsulating all cryptographic operations in the filesystem service allows the computer storing
-data to be different from the computer encrypting it.
+Encapsulating all cryptographic operations in the filesystem service and file actors allows the
+computer storing data to be different from the computer encrypting it.
 This approach allows client-side encryption to be done on more capable computers
-and for this task to be delegated to a storage server on low powered devices.
-
-% Description of how the filesystem layer: opens a file, reads, and writes.
-
-% Peer-to-peer data distribution in the filesystem service.
+and low powered devices to delegate this task to a storage server.
+
+% Prevention of resource leaks through ownership.
+A major advantage of using file actors to access file data is that they can be accessed over the
+network from a different runtime as easily as they can be from the same runtime.
+One complication arising from this approach is that file actors must not outlive the actor which
+caused them to be spawned.
+This is handled in the filesystem service by making the actor who opened the file the owner of the
+file actor.
+When a file actor receives notification that its owner returned,
+it flushes any buffered data in its cache and returns,
+ensuring that a resource leak does not occur.
+
+% Authorization logic of the filesystem service.
+The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.
+It passes this type the authorization attributes of the principal accessing the file, the
+attributes of the file, and the type of access (read, write, or execute).
+The \texttt{Authorizer} returns a boolean indicating if access is permitted or denied.
+These access control checks are performed for every message processed by the filesystem service,
+including opening a file.
+A file actor only responds to messages sent from its owner,
+which ensures that it can avoid the overhead of performing access control checks as these were
+carried out by the filesystem service when it was created.
+The file actor is configured when it is spawned to allow read only, write only, or read write
+access to a file,
+depending on what type of access was requested by the actor opening the file.
 
 % Streaming replication.
-
-
+Often when building distributed systems it is convenient to alert any interested party that an event
+has occurred.
+To facilitate this pattern,
+the sector service allows actors to subscribe for notification of writes to a file.
+The sector service maintains a list of actors which are currently subscribed
+and when it commits a write to its local storage,
+it sends each of them a notification message identifying the sector written
+(but not the written data).
+By using different files to represent different events,
+a simple notification system can be built.
+Because the contents of a directory may be distributed over many different generations,
+this system does not support the recursive monitoring of directories.
+Although this system lacks the power of \texttt{inotify} in the Linux kernel,
+it does provides some of its benefits without incurring much or a performance overhead
+or implementation complexity.
+For example, this system can be used to implement streaming replication.
+This is done by subscribing to writes on all the files that are to be replicated,
+then reading new sectors as soon as notifications are received.
+These sectors can then be written into replica files in a different directory.
+This ensures that the contents of the replicas will be updated in near real-time.
 
 \section{Cryptography}
 % The underlying trust model: self-certifying paths.

+ 18 - 0
doc/BlocktreeCloudPaper/notes.md

@@ -139,3 +139,21 @@ one runtime in each cluster is connected to one runtime in each of its child clu
 the message should eventually be delivered to the correct runtime.
 
 This means that the sector hosts will form a single connected component of the connection graph.
+
+## Representation of files by the filesystem service.
+My idea of using actors to own file handles has a significant drawback.
+If an actor which opened a file crashes,
+the file will remain open forever,
+resulting in a resource leak.
+An alternative would be to issue file handle structs to actors in local messages,
+but this will not work when the filesystem service is being accessed by a remote runtime.
+I could keep a table of file handles (integers) in the filesystem service,
+and access it similar to how the filesystem struct is used today.
+This approach brings the overhead of an RwLock on the table and searching it for a specific
+file on every read or write.
+Perhaps I could have the file actor poll its owner periodically to see if its still alive?
+Then it would be able to halt if the owning actor has crashed.
+To get this to work I'll need to reintroduce the ability to send messages to a specific actor,
+and solve the issue of handling undeliverable messages.
+This approach has the advantage of working over the network,
+and it does not introduce any overhead from maintaining a table.