123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208 |
- \documentclass{book}
- \usepackage{amsfonts,amssymb,amsmath,amsthm}
- \usepackage[scale=0.80]{geometry}
- \usepackage{hyperref}
- \begin{document}
- \tableofcontents
- \chapter{System Overview}
- The development of the internet was undoudedtly one of the greatest achievements in the 20th
- century, and the internet's killer app, the web, has reshaped our lives and the way we do
- business. But, for all the benefits we have received from these technologies there have
- corresponding costs. It's now possible to cheaply surveil entire popluations and discern their
- preferences and responses so that propaganda can be effectively created and distributed. The
- surveilence it not surepticious, it's quite overt. Consumers hand over this data willingly to
- tech companies because of the benefits they receive in return. But why should people be forced to
- choose between privacy and convenience?
- The cost of computing power, storage, and bandwidth are very cheap. A single board computer
- can provide more than enough computing power for the average web-browsing consumer. A classic rotating magnetic hard drive can hold terrabytes of data, more than enough to hold an individual's
- social media output. Umetered broadband internet access measured in hudreds of megabits per
- second is common, and becoming more so all the time. So with all these resources available,
- why is it that consumers do not more control over the computing infrastructure upon which
- they rely?
- The issue is that consumers' don't want to manage servers, networking, routing, and
- backup strategies. Each of these can be a full time job by itself. So in order for a consumer to
- be in control of their own infrastructure, the infrastructure needs to be able to take care of
- itself.
- Perhaps as important as consumer agency over their computing infrastructure, is the security of
- that infrastructure. Time and time again consumer trust has been violated by data breaches and
- service disruptions. These incidents show the need for a more secure and standardized system for
- storing application data and building web services. Such a system must provide both
- confidentiality of consumer data, and the ability to authenticate consumers without the
- need for insecure techniques, such as passwords.
- This document proposes a potential solution. It describes a system for organizing information into
- trees of blocks, distributing those blocks over a network of nodes, and a programming interface
- to access this information in a convenient way. Because no one piece of hardware
- is infallible, the system also includes mechanisms for nodes to contract with one another to store
- data. This allows data to be backed up and later restored in the case a node is lost. In order to
- ensure the free exchange of data amongst nodes, a digital currency is used to account for the
- available storage capacity of the network.
- The remainder of this chapter will give an overview of the system, with the remainder of the
- document going into specific details of each of the system's components.
- \section{Blocks}
- The basis of all trust in the system is a public-private keypair. The secrecy of the private key
- is the linchpin of all security in the system. If this key is compromised then all confidentiality
- and authenticity assurances are void. Further, if this key is lost, then control of the
- resources over which it has agency is also lost.
- Because of this, it's very important to protect this key by storing it in a secure location.
- It is envisioned that a TPM will be used for this purpose. A TPM will be important for other
- system features, as we'll see later.
- All data stored in the system is put into data structures called blocks. Each block has a path
- which describes it's location in the tree. The root of this path is always a hash (hex encoded)
- of the public key corresponding to the private key which is the root of trust for the tree. Each block is encrypted by a symmetric cipher using a randomly generated key, which is referred
- to as the block's key. The block key for the root block is
- encrypted using the root public key and the resulting cipher text is stored in the block itself,
- along with a hash of the public key for later identification. For all non-root blocks in the tree,
- their block key is encrypted using the block key of
- their parent, and the resulting ciphertext is stored in the block.
- This ensures that we can allow a party access to all
- blocks in a subtree by simply encrypting the block key at the root of that subtree using the
- party's public key. This mechanism is used to give node's that are controlled by the
- root selective access to data stored in the tree, by encrypting the block key of the root of the
- subtree using the node's public key.
- Integrity assurance of the block's content is achieved by a digital signature which covers all of
- the block's contents (except the signature itself of course). A certificate chain
- starting with the root key and ending with the key used by make this signature is included with
- the block. This ensures that any node which has been issued a certificate using the root key is
- able to write data into the tree. In particular the path of the block is covered, and since the
- path includes a hash of the root key, this means that anyone is able to independently verify
- the authenticity of the block by checking that the certificate chain was indeed signed by
- the root, that all other signatures in the chain are valid, and finally that the signature on
- the block itself is valid.
- The size of each block has yet to be determined. It's envisioned that they will be fairly large,
- on the order of 4 MB, so as to amortize the overhead of storing a certificate chain, and
- encapsulated keys as well as the cost of cryptographic operations.
- \section{Fragments}
- By itself this block structure would be useful for building a secure filesystem, but in order to
- be a durable storage system it needs an efficient way of backing up, or rather distributing data.
- This is the purpose of fragments.
- Blocks are distributed amongst nodes in the network by using a fountain code. The output symbols
- of this code are referred to as fragments. A code with a high performance implementation and good
- coding efficiency is an important design consideration for the system. For these reasons it's
- envisioned that the RaptorQ code will be used.
- After a block is created the creating node will need to distribute the data in the block to other
- nodes to ensure its persistence in case the node fails. It will create fragments as needed
- and advertise to other node's its desire to store them. Currency controlled by the root key is
- exchanged with these other nodes in exchange for contracts to store the fragments.
- When a node needs to rebuild data that was previously distributed in fragments, it connects to a
- subset of nodes containing fragments and, in parallel, downloads enough fragments to reconstruct
- the block. This same mechanism can be used to distribute block data to unaffiliated nodes in the
- network. It is a convenient load balancing and performance improvement, as the parallel downloads
- spread the load over multiple nodes and are not limited by the bandwidth between any pair.
- We keep track of which nodes hold fragments of a block by storing the IDs of these nodes in the
- block's parent. This list of node IDs can then be resolved to a list of IP addresses by looking
- up data in a shared data structure called the Public Blocktree.
- \section{The Public Blocktree}
- The Public blocktree is just another block tree, but one which is controlled by a distinguished
- private key, whose public key is hard-coded into the other node's in the network. This blocktree
- \section{Nodes and the Network}
- Each node in the network is identified by a public-private keypair and is issued a certificate
- trusted by the public root key. Nodes can be claimed by issuing them a certificate and
- then writing
- that certificate into the public blocktree. When a new node is claimed, currency is deposited into
- the account of the root key which claimed it. This currency is to account for the storage capacity
- that the new node brings to the network. This mechanism is the reason why the node must have a
- certificate trusted by the public root key, otherwise there would be no way to control the
- creation of currency.
- Nodes are identified by the hex encouding of the hash of their public key. This string is written
- into directory blocks to keep track of which nodes contain fragments of blocks in that directory.
- In order for this information to be useful, a mechanism is needed to resolve node IDs to IP
- addresses. This is accomplished by writing a block into the public blocktree with the node's
- IP address which is signed by the node's private key. Anytime the node receives a new IP address,
- it updates this block to inform the network of this change. Because the node's ID is derived from
- its private key, and the block containing its IP address is signed with this key, it's possible
- for a third party to independently verify that this IP address is authentic.
- When a node is given access to a blocktree, by issuing it a certificate, it is assigned a path
- under which its data will be stored. This path is referred to as the node's home. This path is
- written into the node's certificate. Thus a node's home is cryptographically verifiable and must
- be chosen when the node joins the blocktree. The node which issues the new node it's certificate
- grants the new node access to the block at its home path by encapsulating the block's key using
- the node's public key. Thus the new node can recover the block key and read the contents of its
- home block. If a node has its home at a block then its ID is written into the block.
- The data created by a node may optionally replicated to its parent node. This would be suitable
- for a lightweight or mobile device which needs to ensure its data is replicated immediately and
- doesn't have time to negotiate contracts for the storage fragments. However, for larger
- blocktrees, having non-replicating nodes is essential for scalability.
- More than one node can be housed at the same path, such nodes are called a cluster.
- Each node in the cluster stores
- copies of the same data and they coordinate with each other to ensure the consitency of this
- data. This is accomplished by electing a leader. All writes to blocks under the nodes home are
- sent to the leader. The leader then serializes these writes and sends them to the rest of the
- nodes. By default writes to blocks use optimistic concurrency, with the last write known to the
- leader being the winner. But if a node requires exclusive access to the data in a block it can
- request to the leader to lock it. Writes from nodes other than the locking node are rejected until
- the lock is released. The leader will release the lock on its own if no messages are received from
- the locking node after a timeout.
- If a path is configured to be replicated to its parent, then the leader at that path will maintain
- a connection to the the leader in a parent cluster. Note that the parent cluster need not be
- housed in the parent block, just at some ancestor block. Then, writes will be propagated through
- this connection to the parent cluster, where this process may continue if that cluster is also
- configured for replication. Distributed locking is similarly comunicated to the parent cluster,
- where the lock is only aquired with the parent's approval.
- \section{Programmatic Access to Data}
- No designer can hope to envsion all the potential applications that a consumer would want to have
- access to their data. That's why an important component of the system is the ability to run
- programs that can access data and provide services to other internet hosts, whether they are
- blocktree nodes or not. This is accomplished by providing a WebAssembly based execution
- environment with a system interface based on WASI. Information in the blocktree is available
- to programs using standard filesystem system calls that specify paths in a special directory.
- While some programs may wish to access blocks directly in this manner, others may wish to use
- an API at a higher level of abstraction. Thus there is also an API for creating arbitrarily sized
- files that will get mapped to fixed sized blocks, freeing the programmer from having to implment
- this themselves.
- Data provided by these filesystem APIs will be the most up-to-date versions known to the node.
- There's the possiblity that conflicts with other nodes may cause writes made by programs on the
- node to be rolled back or overwritten. Of course locks can be taken on files and blocks if a
- program must ensure exclusive access to data. Finally, an inotify-like API is provided for programs to be notified when changes to blocks occur.
- An important consideration for the design of this system was to facilitate the creation of web
- servers and other types of internet hosts which can serve data stored in a blocktree. For this
- reason there is a high level callback based API for declaring HTTP handlers, as well as handlers
- for blocktree specific messages.
- In order to provide the consumer with control over how their data is used a permissions system
- exists to control which blocks and APIs programs have access to. For instance a consumer would
- have to grant special permission for a program to be able to access the Berkeley sockets API.
- Programs are installed by specifing a blocktree path. This is a secure and convenient method of
- distribution as these programs can be downloaded from the nodes associated with the root of their
- path and the downloaded blocks can be cryptographically verified to be trusted by the root
- key. Authors wishing to distribute their programs in this manner will of course need to make the
- blocks containing them public (unencrypted), or else provide some mechanism for selective access.
- \chapter{Data Structures}
- \chapter{Nodes}
- \chapter{The Network}
- \chapter{Application Programming Interface}
- \chapter{Example Applications}
- \end{document}
|