123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445 |
- \documentclass{book}
- \usepackage{amsfonts,amssymb,amsmath,amsthm}
- \usepackage[scale=0.75]{geometry}
- \usepackage{hyperref}
- \begin{document}
- \tableofcontents
- \chapter{System Overview}
- % I should replace "consumer" with "user".
- The development of the internet was undoudedtly one of the greatest achievements in the 20th
- century, and the internet's killer app, the web, has reshaped our lives and the way we do
- business. But, for all the benefits we have received from these technologies there have
- corresponding costs. It's now possible to cheaply surveil entire popluations and discern their
- preferences and responses so that propaganda can be effectively created and distributed. The
- surveilence it not surepticious, it's quite overt. Consumers hand over this data willingly to
- tech companies because of the benefits they receive in return. But why should people be forced to
- choose between privacy and convenience?
- The cost of computing power, storage, and bandwidth are very cheap. A single board computer
- can provide more than enough computing power for the average web-browsing consumer. A classic rotating magnetic hard drive can hold terrabytes of data, more than enough to hold an individual's
- social media output. Umetered broadband internet access measured in hudreds of megabits per
- second is common, and becoming more so all the time. So with all these resources available,
- why is it that consumers do not more control over the computing infrastructure upon which
- they rely?
- The issue is that consumers' don't want to manage servers, networking, routing, and
- backup strategies. Each of these can be a full time job by itself. So in order for a consumer to
- be in control of their own infrastructure, the infrastructure needs to be able to take care of
- itself.
- Perhaps as important as consumer agency over their computing infrastructure, is the security of
- that infrastructure. Time and time again consumer trust has been violated by data breaches and
- service disruptions. These incidents show the need for a more secure and standardized system for
- storing application data and building web services. Such a system must provide both
- confidentiality of consumer data, and the ability to authenticate consumers without the
- need for insecure techniques, such as passwords.
- This document proposes a potential solution. It describes a system for organizing information into
- trees of blocks, the distribution of those blocks over a network of nodes, and a programming interface
- to access this information. Because no one piece of hardware
- is infallible, the system also includes mechanisms for nodes to contract with one another to store
- data. This allows data to be backed up and later restored in the case a node is lost. In order to
- ensure the free exchange of data amongst nodes, a digital currency is used to account for the
- available storage capacity of the network.
- The remainder of this chapter will give an overview of the system, with the remainder of the
- document going into specific details of each of the system's components.
- \section{Blocks}
- User data stored in this system is organized into structures called \emph{block trees}. Every block tree
- is identified with a public key. The private key that corresponds to a block tree's public key is
- required to control that tree. Any person who has the private key for a block tree is called that
- tree's owner.
- Computers participating in this system are called \emph{nodes}. Nodes are also identified by public keys, but
- these keys are not directly tied to block trees. Nodes that have access to a block trees data are said
- to be \emph{attached} to that block tree. Nodes can be attached to multiple block trees at once, or none
- at all.
- Block trees are of course trees of \emph{blocks}. Every block is identified by a string called a \emph{path},
- which describes its location in the tree. The root of this path is a hash (hex encoded)
- of the tree's public key, allowing blocks from any tree to be referred to. A block consists of three segments:
- a header, a payload and a signature.
- The payload is encrypted by a symmetric cipher using a randomly generated key. This randomly
- generated key is called the \emph{block's key}. To allow access to the payload, the block's key is encapsulated
- using other keys and the resulting cipher texts are stored in the block's header. These encapculated keys
- are referred to as read capabilities, or \emph{read caps} for short.
- The root node of every block tree contains a read cap for the block tree's public key.
- Every non-root block contains a read cap
- for the block's parent, which is to say the block's key is encapsulated using the its parent's block key.
- So when one has a read cap for a block, they can read the data in all blocks descended from that
- block. Because the owner of a block tree has a read cap for the root block, they can read all data
- stored in the tree. Other people (or nodes) can be given access to a subtree by granting them a read
- cap for the subtree's root. A block which contains public data is stored as cleartext with no read caps.
- While read caps provide for confidentiality, write caps provide for integrity. A \emph{write cap}
- for a block is a certificate chain which terminates at a certificate signed by the block tree's
- owner. Thus a self-signed certificate made using the tree's private key is a valid write cap
- for any block in the tree. By allowing a chain of certificates to be used, it's possible for
- the owner to give other people or nodes the ability to write data into their tree. The scope of
- this access is controlled by specifying the path under which writing is allowed to the certificiate.
- A write cap for a block is only valid if the path of the block is a contained in the path
- specified in every certificate in the chain.
- Both the header and the payload of a Block are protected using a private key signature. The writer
- of the block computes this signature using the private key which corresponds to the write cap
- for the block they're trying to write. In order to validate a block, this signature is validated, then
- the Write Cap is validated, and finally the hash of the public key of
- the last signer in the Write Cap chain is compared to the root of the Block's path. If these match,
- then the block is valid, as this means that an owner has given permission for the writer to write
- into their tree at this path.
- Accessing the data in a block requires several cryptographic operations, both for vaidation and
- for decryption. Because of this its important that blocks are relatively large, on the order of
- 4 MB, to amortize the cost of these operations.
- \section{Fragments}
- By itself this block structure would be useful for building a secure filesystem, but in order to
- be a durable storage system we need an efficient way of distributing data for redundancy and
- availability. This is the purpose of fragments.
- Blocks are distributed amongst nodes in the network using a fountain code. The output symbols
- of this code are referred to as \emph{fragments}. A code with a high performance implementation and good
- coding efficiency is an important design consideration for the system. For these reasons the
- RaptorQ code was chosen.
- In order to preserve the data in a newly created block, a node will need to distribute
- fragments to other nodes. It does this by advertising its desire to trade [currency]
- in its block tree for the storage of these fragments. \emph{[currency]} is a fungible
- token for the exchange of computing resources between nodes. Every block tree has
- some non-negative value for the amount of [currency] it controls. Nodes that are attached
- to a tree spend the tree's [currency] when paying other nodes for the storage of fragments.
- If another node is interested in making the exchange, it contacts the advertising node
- and both sign a contract. A \emph{contract} is a data structure signed by both nodes which
- states that hash of the fragment being stored and the amount of [currency] being exchanged
- for its storage. The contract is then stored in the public block tree (to be discussed below),
- so that [currency] can be transerfed between nodes and to create an accountability mechanism
- to prevent the storing node from acting in bad faith and deleting the fragment.
- When a node needs to retreive a block that was previously distributed in fragments, it connects to a
- subset of nodes containing the fragments and downloads enough to reconstruct
- the block. These downloads can be performed concurrently for greater speed. This same mechanism
- can be used to distribute public blocks to unaffiliated nodes. This mechanism facilitates load balancing
- and performance, as concurrent downloads
- spread the load over multiple nodes and are not limited by the bandwidth between any pair of nodes.
- The list of nodes containing the fragments of a block is called the block's \emph{node list}.
- A block's node list is stored in it's parent. This allows for any non-root block to be retreived.
- To allow the root block to be retrieved its node list is stored in the public block tree.
- \section{The Public Blocktree}
- \emph{The Public Block Tree} is a block tree which is known to all nodes. This is accomplished by
- providing all nodes with a hardcoded list of nodes that are attached to the public block tree.
- This is similar to the list of root DNS servers distributed with any networked operating system.
- Because the public block tree is only used for storing information that should be known to all
- nodes in the network, the payload of every block in it is cleartext. The public block tree serves
- only to facilitate the communication and exchange of data between nodes.
- One way that it does this is by containing a database of nodes and their IP addresses. A node
- which has a write cap to this database will only store an entry for a node if that node can provide
- a valid signed request. This signed request is stored in the database verbatim, so that other nodes can
- independently verify its validity. Thus the nodes in the network can use this database to securely resolve
- the IDs of other node's to their IP addresses.
- The other function of the public block tree is to contain a list of transactions and disputes.
- This list is referred to as the \emph{public log}.
- When a node is created, an event is logged detailing the amount of [currency] the node is worth.
- When a node is first attached to a block tree, this [currency] is then removed from the node
- and added to the block tree. When a node signs a contract with another node, it is stored in the
- log and [currency] is removed by the sending node's block tree and added to the receiving node's.
- In order to discourage nodes from receive payment for the storage of a fragment, then deleting
- the fragment to reclaim disk space, a reporting mechanism exists. If a node is unable to retrieve
- a fragment that it previously stored with another node, then it sends an event to the log
- indicating this. The other node can then respond by sending an event which contains the actual
- fragment which was requested. This allows all the nodes in the network to view the log and
- see if a node that they are considering signing a contract with is trustworthy. If they
- are not the defendant in any disputes, then they should be safe. If they are in one, but responded
- quickly with the fragment, then it could have been a transient network issue. If they never
- responded, then they are risky and should perhaps receive a lower payment for the storage
- of the fragment.
- Finally, the public block tree stores node lists for the root blocks of every block tree.
- This ensures that even if every node that participates in a block tree fails, the block
- tree can still be recovered from its fragments, provided its private key is known.
- \section{Nodes and the Network}
- Each node in the network has a public-private keypair. The string formed by hex encoding the
- hash of a node's public key is referred to as the \emph{node ID} of the node. When nodes
- are manufactured they are issued a certificate trusted by the
- public block tree. New nodes are claimed by issuing them a certificate and then writing
- that certificate into the public log. When a new node is claimed, currency is credited to
- the block tree which claimed it. This currency is to account for the storage capacity
- that the new node brings to that block tree. This mechanism is the reason why the node must have a
- certificate trusted by the public block tree, otherwise there would be no way to control the
- creation of currency.
- Nodes are identified by their node ID in the public block tree and in node lists. Nodes
- are responsible for updating their IP address in the public block tree whenever it changes.
- When a node is attached to a block tree it is issued a certificate containing a path
- under which its data will be stored. We say the the node is attached to the block tree at that path.
- The node which issues the node its certificate creates a read cap for it and stores it in the
- block where the node is attached.
- The data created by a node may optionally be replicated in its parent node. This would be suitable
- for a lightweight or mobile device which needs to ensure its data is replicated immediately and
- doesn't have time to negotiate contracts for the storage of fragments. For larger
- block trees, having non-replicating nodes is essential for scalability.
- More than one node can be attached at the same path, and when this happens a \emph{cluster} is formed.
- Each node in the cluster stores
- copies of the same data and they coordinate with each other to ensure the consitency of this
- data. This is accomplished by electing a leader. All writes to blocks under the attachment point are
- sent to the leader. The leader then serializes these writes and sends them to the rest of the
- nodes. By default writes to blocks use optimistic concurrency, with the last write known to the
- leader being the winner. But if a node requires exclusive access to a block it can
- make a request to the leader to lock it. Writes from nodes other than the locking node are rejected until
- the lock is released. The leader will release the lock on its own if no messages are received from
- the locking node after a timeout.
- If the attachment point is configured to be replicated to its parent, then the leader will maintain
- a connection to the the leader in the parent cluster. Note that the parent cluster need not be
- housed in the parent block, just at some ancestor block. Then, writes will be propagated through
- this connection to the parent cluster, where this process may continue if that cluster is also
- configured for replication. Distributed locking is similarly comunicated to the parent cluster,
- where the lock is only aquired with the parent's approval.
- \section{Programmatic Access to Data}
- No designer can hope to envsion all the potential applications that a person would want to have
- access to their data. That's why an important component of the system is the ability to run
- programs that can access data and provide services to other internet hosts, whether they are
- blocktree nodes or not. This is accomplished by providing a WebAssembly based execution
- environment with a system interface based on WASI. Information in the blocktree is available
- to programs using standard filesystem system calls that specify paths in a special directory.
- While some programs may wish to access blocks directly in this manner, others may wish to use
- an API at a higher level of abstraction. Thus there is also an API for creating arbitrarily sized
- files that will get mapped to fixed sized blocks, freeing the programmer from having to implment
- this themselves.
- Data provided by these filesystem APIs will be the most up-to-date versions known to the node.
- There's the possiblity that conflicts with other nodes may cause writes made by programs on the
- node to be rolled back or overwritten. Of course locks can be taken on files and blocks if a
- program must ensure exclusive access to data. Finally, an inotify-like API is provided for programs to be notified when changes to blocks occur.
- An important consideration for the design of this system was to facilitate the creation of web
- servers and other types of internet hosts which can serve data stored in a blocktree. For this
- reason there is a high level callback based API for declaring HTTP handlers, as well as handlers
- for blocktree specific messages.
- In order to provide the consumer with control over how their data is used a permissions system
- exists to control which blocks and APIs programs have access to. For instance a consumer would
- have to grant special permission for a program to be able to access the Berkeley sockets API.
- Programs are installed by specifing a blocktree path. This is a secure and convenient method of
- distribution as these programs can be downloaded from the nodes associated with the root of their
- path and the downloaded blocks can be cryptographically verified to be trusted by the root
- key. Authors wishing to distribute their programs in this manner will of course need to make the
- blocks containing them public (unencrypted), or else provide some mechanism for selective access.
- \chapter{Concepts}
- \section{Blocks}
- A block is a sequence of bytes and a sequence of events. The sequence of events define how the
- sequence of bytes came to be in its current state. At any time the sequence of bytes can
- be recreated by replaying the events starting with an empty sequence of bytes. Thus the sequence of
- events is considered the canonical form of of the block, with the sequence of bytes simply enabling
- efficient reads.
- Blocks are hierarchical, with every block having at most one parent and zero or more children.
- If a block has children then it is called a directory, and its children are called its directory
- entries. This hierarchical structure allows us to identify blocks using their position in the
- hierarchy.
- \section{Paths}
- A path is a globally unique identifier assigned to a block. The path of the block defines its
- position in the hierarchy of blocks. The syntax of a path is as follows:
- \begin{verbatim}
- COMP ::= '[\w-_\.]'+
- RelPath ::= COMP ('/' COMP)*
- AbsPath ::= '/' RelPath*
- \end{verbatim}
- In other words, a path is a sequence of components, represented texturally as `/' separated
- fields. The empty sequence of components is called the root path.
- The root path is a directory, and its entries are the blocks of the global blocktree and links
- to every private blocktree.
- A path to a private blocktree has only
- one component, consisting of the Hash of the private blocktree's root public signing key. Any path
- with only one components which consists of a fingerprint
- of the blocktree's root public key is a valid path to the blocktree. These paths are
- called the root paths of the private blocktree, and any path that begins with one of them
- is simply called a private path. Conversely, paths that do not begin with them are called public
- paths.
- If one path is a prefix of a second, then we say the first path contains the second and that the
- second is nested in the first.
- Thus every path is contained in the root path, and every private path is contained in the root path
- of a private blocktree.
- Note that if path one is equal to path two, then path one also contains path two, and vice-versa.
- In addition to identifying blocks, paths are used to scope capabilities and to address messages.
- \section{Principals}
- A principal is any entity which can be authenticated.
- All authentication in the blocktree system is performed using digital signatures and as such
- principals are identified by a cryptographic hash of their public signing key. Support
- for the following hash algorithms is required:
- \begin{enumerate}
- \item SHA2 256
- \item SHA2 512
- \end{enumerate}
- These are referred to as the Standard Hash Algorithms and a digest computed using one of them is
- referred to as a Standard Hash.
- When a principal is identified in a textural representation, such as in a path,
- the following syntax is used:
- \begin{verbatim}
- PrincipTxt ::= <hash algo index> '!' <Base64Url(hash)>
- \end{verbatim}
- where ``hash algo index'' is the base 10 string representation of the index of the hash algorithm in
- the above list, and ``Base64Url(hash)'' is the Base64Url encoding of the hash data identifying the
- principal. Such a textural representation is referred to as a fingerprint of the public key from
- which the hash was computed. In a slight abuse of language, we sometimes refer to the fingerprint or
- hash data as a principal, even though these data merely identify a principal.
- Principals can be issued capabilities which allow them to read and write to blocks.
- These capabilities are scoped by paths, which limit the set of blocks the capability can be used
- on. A capability can only be used on a block whose path is contained in the path of the
- capability. This access control mechanism is enforced cryptographically.
- A principal can grant a capability to another principal so long as the granted capability has a
- path which is contained in the capability possessed by the granting node. The specific mechanisms
- for how this is done differs depending on whether the capability is for reading or writing to a
- block.
- Every private blocktree is associated with a root principal. The path containing only one
- component which consists of the fingerprint of the principal's public key is a root path to the
- private block tree. The root principal has read and write capabilities for the private blocktree's
- root, and so can grant subordinate capabilities scoped to any block in the private blocktree.
- \section{Readcaps}
- In order to protect the confidentiality of the data stored in a block, a symmetric cypher is used.
- The key for this cipher is called block key, and by controlling access to it we can control which
- principals can read the block.
- The metadata of every block contains a dictionary of zero or more entries called read capabilities,
- or readcaps for short. A key in this dictionary is a hash of the public signing key of the principal
- for which the readcap was issued and the value is the encryption of the block key using the
- principal's public encryption key.
- The block key is also encrypted using the block key of the parent block, and the resulting cipher
- text is also stored in the block's metadata. This is referred to as the inherited readcap.
- Hence, a principal which has a read capability for a
- given path can read all paths contained in it as well. Further, a principal can use it's
- private encryption key to decrypt the block key, and then re-encrypt it using the public encryption
- key of another principal, thus allowing the principal to grant a subordinate readcap.
- A block that does not require confidentiality protection need not be encrypted. In this case, the
- table of readcaps is empty and the inherited readcap is set to a flag value to indicate the block
- is stored as plaintext. Blocks in the global blocktree are never encrypted.
- \section{Writecaps}
- A write capability, or writecap for short, is a certificate chain which extends from a node's
- credentials back to the root signing key of a private blocktree.
- Each certificate in this chain contains
- the public key of the principal who issued it and the path that it grants write capabilities to.
- The path of a certificate must be contained in the path of the previous certificate in the chain,
- and the chain must terminate in a certificate which is self-signed by the root signing credentials.
- Each certificate also contains an expiration time, represented by an Epoch value, and a writecap
- shall be considered invalid by a node if any certificate has an expiration time which it judges
- to be in the past.
- A block's metadata contains a writecap, which allows it to be verified by checking the digital
- signature on the metadata using the first certificate in the writecap, and then checking that the
- writecap is valid. This ensures that the chain of trust from the root signing credentials extends
- to the block's metadata. To extend this chain to the block's data, a Merkle Tree is used, as
- described below. A separate mechanism is used to ensure the integrity of blocks in the global
- block tree.
- \section{Sectors}
- A block's data is logically broken up into fixed sized sectors, and these are the units of
- confidentiality protection, integrity protection and consensus. When a process performs writes on
- an open block, those writes are buffered until a sector is filled (or until the process calls
- flush). Once this happens, the contents of the buffer are encrypted using the block key and then
- hashed. This hash, along with the offset at which the write occurred, is then used to update the
- Merkle Tree for the block. Once the new root of the Merkle Tree is computed, it is copied into the
- block's metadata. After ensuring it's writecap is copied into the metadata, the node then signs
- the metadata using its private signing key.
- When the block is later opened its metadata is verified using the process described in the previous
- section. Then the value of the root Merkle Tree node in the metadata is compared to the value in the
- Merkle Tree for the block. If it matches, then reads to any offset into the block can be verified
- using the Merkle Tree. Otherwise, the block is rejected as invalid.
- The sector size can be configured for each block individually at creation time,
- but cannot be changed afterwards.
- However, the same effect can be achieved by creating a new block with the desired sector
- size, copying the data from the old block into it, deleting the old block and renaming the new one.
- The choice of sector size for a given block represents a tradeoff between the amount of space the
- Merkle Tree occupies occupies and the latency experienced by the initial read to
- a sector. With a larger sector size, the Merkle Tree size is reduced, but more data has to be
- hashed and decrypted when each sector is read. Depending on the size of the block and its intended
- usage, the optimal choice will vary.
- It's important to note that the entire Merkle Tree for a block is held in memory for as long as the
- block is open. So, assuming the 32 byte SHA2 256 hash and the default 4 KiB sector size are used,
- if a 3 GiB file is opened then its MerkleTree will occupy approximately 64 MiB of memory, but will
- enable fast random access. Conversely, if 64 KiB sectors are used the Merkle Tree for the same
- 3 GiB block will occupy approximately 4 MiB, but it's random access performance may suffer from
- increased latency.
- \section{Nodes}
- The independent computing systems participating in the Blocktree system are called nodes. A node
- need not run on hardware distinct from other nodes, but it is logically distinct from all other
- nodes. In all contemporary operating systems a node can be implemented as a process.
- A node posses it own unique credentials, which consist of two public and private key pairs,
- one used in an encryption scheme and another in a signing scheme. A node is a Principal
- and a hash of its public signing key is used to identify it. In a slight abuse of language, this
- hash is referred to as the node's principal.
- Nodes are also identified by paths.
- A node is responsible for storing the blocks contained in the directory where it is
- attached, unless there is a second node whose parent directory is contained in the parent directory
- of the first and which contains the block's path.
- In other words, the node whose parent directory is closest to the block is responsible for storing
- the block.
- This allows data storage to scale as more nodes are added to a blocktree.
- When a directory contains more than one node a cluster is formed.
- The nodes in the cluster run the Raft consensus protocol in order to agree on the sequence
- of events which constitutes each of the blocks contained in the directory where they're attached.
- This allows for redundancy, load balancing, and increased performance, as different sectors
- of a block can be read from multiple nodes in the cluster concurrently.
- \section{Processes}
- Nodes run code and that running code is called a process. Processes are spawned by the node on which
- their running based on configuration stored in the node's parent directory.
- The code which they're spawned
- from is also stored in a block, though that block need not be contained in the directory of the
- node running it. So, for example, a software developer can publish code in their
- blocktree, and a user can run it by updating the configuring of a node with the path it was
- published to.
- Code which is stored in it's own directory with a manifest describing it is called an app.
- There are two types of apps, portable and native. A portable app is a collection of WebAssembly
- (Wasm) modules which are executed by the node in a special Wasm runtime which exposes a
- messaging API which can be used to perform block IO and communicate with other processes.
- Native apps on the other hand are container images containing binaries compiled to a specific
- target architecture. These containers can access blocktree messaging services via a native library
- using the C ABI, and they have access to blocks via a POSIX-compatible
- filesystem API. The software which runs the node itself is distributed using this mechanism.
- Regardless of their type, all apps require permissions to communicate with the outside world, with
- the default only allowing them to communicate with their parent.
- \chapter{Nodes}
- \chapter{The Network}
- \chapter{Application Programming Interface}
- \chapter{Example Applications}
- \end{document}
|