\documentclass{book}
\usepackage{amsfonts,amssymb,amsmath,amsthm}
\usepackage[scale=0.75]{geometry}
\usepackage{hyperref}

\begin{document}
\tableofcontents
\chapter{System Overview}
% I should replace "consumer" with "user".
The development of the internet was undoudedtly one of the greatest achievements in the 20th 
century, and the internet's killer app, the web, has reshaped our lives and the way we do
business. But, for all the benefits we have received from these technologies there have
corresponding costs. It's now possible to cheaply surveil entire popluations and discern their
preferences and responses so that propaganda can be effectively created and distributed. The
surveilence it not surepticious, it's quite overt. Consumers hand over this data willingly to
tech companies because of the benefits they receive in return. But why should people be forced to
choose between privacy and convenience?

The cost of computing power, storage, and bandwidth are very cheap. A single board computer
can provide more than enough computing power for the average web-browsing consumer. A classic rotating magnetic hard drive can hold terrabytes of data, more than enough to hold an individual's
social media output. Umetered broadband internet access measured in hudreds of megabits per
second is common, and becoming more so all the time. So with all these resources available,
why is it that consumers do not more control over the computing infrastructure upon which
they rely?

The issue is that consumers' don't want to manage servers, networking, routing, and
backup strategies. Each of these can be a full time job by itself. So in order for a consumer to
be in control of their own infrastructure, the infrastructure needs to be able to take care of
itself.

Perhaps as important as consumer agency over their computing infrastructure, is the security of
that infrastructure. Time and time again consumer trust has been violated by data breaches and
service disruptions. These incidents show the need for a more secure and standardized system for
storing application data and building web services. Such a system must provide both
confidentiality of consumer data, and the ability to authenticate consumers without the
need for insecure techniques, such as passwords.

This document proposes a potential solution. It describes a system for organizing information into
trees of blocks, the distribution of those blocks over a network of nodes, and a programming interface
to access this information. Because no one piece of hardware
is infallible, the system also includes mechanisms for nodes to contract with one another to store
data. This allows data to be backed up and later restored in the case a node is lost. In order to
ensure the free exchange of data amongst nodes, a digital currency is used to account for the
available storage capacity of the network.

The remainder of this chapter will give an overview of the system, with the remainder of the
document going into specific details of each of the system's components.

\section{Blocks}
User data stored in this system is organized into structures called \emph{block trees}. Every block tree
is identified with a public key. The private key that corresponds to a block tree's public key is
required to control that tree. Any person who has the private key for a block tree is called that
tree's owner.

Computers participating in this system are called \emph{nodes}. Nodes are also identified by public keys, but
these keys are not directly tied to block trees. Nodes that have access to a block trees data are said
to be \emph{attached} to that block tree. Nodes can be attached to multiple block trees at once, or none
at all.

Block trees are of course trees of \emph{blocks}. Every block is identified by a string called a \emph{path},
which describes its location in the tree. The root of this path is a hash (hex encoded)
of the tree's public key, allowing blocks from any tree to be referred to. A block consists of three segments:
a header, a payload and a signature.

The payload is encrypted by a symmetric cipher using a randomly generated key. This randomly
generated key is called the \emph{block's key}. To allow access to the payload, the block's key is encapsulated
using other keys and the resulting cipher texts are stored in the block's header. These encapculated keys
are referred to as read capabilities, or \emph{read caps} for short.
The root node of every block tree contains a read cap for the block tree's public key.
Every non-root block contains a read cap
for the block's parent, which is to say the block's key is encapsulated using the its parent's block key.
So when one has a read cap for a block, they can read the data in all blocks descended from that
block. Because the owner of a block tree has a read cap for the root block, they can read all data
stored in the tree. Other people (or nodes) can be given access to a subtree by granting them a read
cap for the subtree's root. A block which contains public data is stored as cleartext with no read caps.

While read caps provide for confidentiality, write caps provide for integrity. A \emph{write cap}
for a block is a certificate chain which terminates at a certificate signed by the block tree's
owner. Thus a self-signed certificate made using the tree's private key is a valid write cap
for any block in the tree. By allowing a chain of certificates to be used, it's possible for
the owner to give other people or nodes the ability to write data into their tree. The scope of
this access is controlled by specifying the path under which writing is allowed to the certificiate.
A write cap for a block is only valid if the path of the block is a contained in the path
specified in every certificate in the chain.

Both the header and the payload of a Block are protected using a private key signature. The writer
of the block computes this signature using the private key which corresponds to the write cap 
for the block they're trying to write. In order to validate a block, this signature is validated, then
the Write Cap is validated, and finally the hash of the public key of
the last signer in the Write Cap chain is compared to the root of the Block's path. If these match,
then the block is valid, as this means that an owner has given permission for the writer to write
into their tree at this path.

Accessing the data in a block requires several cryptographic operations, both for vaidation and
for decryption. Because of this its important that blocks are relatively large, on the order of
4 MB, to amortize the cost of these operations.

\section{Fragments}
By itself this block structure would be useful for building a secure filesystem, but in order to
be a durable storage system we need an efficient way of distributing data for redundancy and
availability. This is the purpose of fragments.

Blocks are distributed amongst nodes in the network using a fountain code. The output symbols
of this code are referred to as \emph{fragments}. A code with a high performance implementation and good
coding efficiency is an important design consideration for the system. For these reasons the
RaptorQ code was chosen.

In order to preserve the data in a newly created block, a node will need to distribute
fragments to other nodes. It does this by advertising its desire to trade [currency]
in its block tree for the storage of these fragments. \emph{[currency]} is a fungible
token for the exchange of computing resources between nodes. Every block tree has
some non-negative value for the amount of [currency] it controls. Nodes that are attached
to a tree spend the tree's [currency] when paying other nodes for the storage of fragments.

If another node is interested in making the exchange, it contacts the advertising node
and both sign a contract. A \emph{contract} is a data structure signed by both nodes which
states that hash of the fragment being stored and the amount of [currency] being exchanged
for its storage. The contract is then stored in the public block tree (to be discussed below),
so that [currency] can be transerfed between nodes and to create an accountability mechanism
to prevent the storing node from acting in bad faith and deleting the fragment.

When a node needs to retreive a block that was previously distributed in fragments, it connects to a
subset of nodes containing the fragments and downloads enough to reconstruct 
the block. These downloads can be performed concurrently for greater speed. This same mechanism
can be used to distribute public blocks to unaffiliated nodes. This mechanism facilitates load balancing
and performance, as concurrent downloads
spread the load over multiple nodes and are not limited by the bandwidth between any pair of nodes.

The list of nodes containing the fragments of a block is called the block's \emph{node list}.
A block's node list is stored in it's parent. This allows for any non-root block to be retreived.
To allow the root block to be retrieved its node list is stored in the public block tree.

\section{The Public Blocktree}
\emph{The Public Block Tree} is a block tree which is known to all nodes. This is accomplished by
providing all nodes with a hardcoded list of nodes that are attached to the public block tree.
This is similar to the list of root DNS servers distributed with any networked operating system.
Because the public block tree is only used for storing information that should be known to all
nodes in the network, the payload of every block in it is cleartext. The public block tree serves
only to facilitate the communication and exchange of data between nodes.

One way that it does this is by containing a database of nodes and their IP addresses. A node
which has a write cap to this database will only store an entry for a node if that node can provide
a valid signed request. This signed request is stored in the database verbatim, so that other nodes can
independently verify its validity. Thus the nodes in the network can use this database to securely resolve
the IDs of other node's to their IP addresses.

The other function of the public block tree is to contain a list of transactions and disputes.
This list is referred to as the \emph{public log}.
When a node is created, an event is logged detailing the amount of [currency] the node is worth.
When a node is first attached to a block tree, this [currency] is then removed from the node
and added to the block tree. When a node signs a contract with another node, it is stored in the
log and [currency] is removed by the sending node's block tree and added to the receiving node's.

In order to discourage nodes from receive payment for the storage of a fragment, then deleting
the fragment to reclaim disk space, a reporting mechanism exists. If a node is unable to retrieve
a fragment that it previously stored with another node, then it sends an event to the log
indicating this. The other node can then respond by sending an event which contains the actual
fragment which was requested. This allows all the nodes in the network to view the log and
see if a node that they are considering signing a contract with is trustworthy. If they
are not the defendant in any disputes, then they should be safe. If they are in one, but responded
quickly with the fragment, then it could have been a transient network issue. If they never
responded, then they are risky and should perhaps receive a lower payment for the storage
of the fragment.

Finally, the public block tree stores node lists for the root blocks of every block tree.
This ensures that even if every node that participates in a block tree fails, the block
tree can still be recovered from its fragments, provided its private key is known.

\section{Nodes and the Network}
Each node in the network has a public-private keypair. The string formed by hex encoding the
hash of a node's public key is referred to as the \emph{node ID} of the node. When nodes
are manufactured they are issued a certificate trusted by the
public block tree. New nodes are claimed by issuing them a certificate and then writing
that certificate into the public log. When a new node is claimed, currency is credited to
the block tree which claimed it. This currency is to account for the storage capacity
that the new node brings to that block tree. This mechanism is the reason why the node must have a
certificate trusted by the public block tree, otherwise there would be no way to control the
creation of currency.

Nodes are identified by their node ID in the public block tree and in node lists. Nodes
are responsible for updating their IP address in the public block tree whenever it changes.

When a node is attached to a block tree it is issued a certificate containing a path
under which its data will be stored. We say the the node is attached to the block tree at that path.
The node which issues the node its certificate creates a read cap for it and stores it in the
block where the node is attached.

The data created by a node may optionally be replicated in its parent node. This would be suitable
for a lightweight or mobile device which needs to ensure its data is replicated immediately and
doesn't have time to negotiate contracts for the storage of fragments. For larger 
block trees, having non-replicating nodes is essential for scalability.

More than one node can be attached at the same path, and when this happens a \emph{cluster} is formed.
Each node in the cluster stores
copies of the same data and they coordinate with each other to ensure the consitency of this
data. This is accomplished by electing a leader. All writes to blocks under the attachment point are
sent to the leader. The leader then serializes these writes and sends them to the rest of the
nodes. By default writes to blocks use optimistic concurrency, with the last write known to the
leader being the winner. But if a node requires exclusive access to a block it can
make a request to the leader to lock it. Writes from nodes other than the locking node are rejected until
the lock is released. The leader will release the lock on its own if no messages are received from
the locking node after a timeout.

If the attachment point is configured to be replicated to its parent, then the leader will maintain
a connection to the the leader in the parent cluster. Note that the parent cluster need not be
housed in the parent block, just at some ancestor block. Then, writes will be propagated through
this connection to the parent cluster, where this process may continue if that cluster is also
configured for replication. Distributed locking is similarly comunicated to the parent cluster,
where the lock is only aquired with the parent's approval.

\section{Programmatic Access to Data}
No designer can hope to envsion all the potential applications that a person would want to have
access to their data. That's why an important component of the system is the ability to run
programs that can access data and provide services to other internet hosts, whether they are
blocktree nodes or not. This is accomplished by providing a WebAssembly based execution
environment with a system interface based on WASI. Information in the blocktree is available
to programs using standard filesystem system calls that specify paths in a special directory.
While some programs may wish to access blocks directly in this manner, others may wish to use
an API at a higher level of abstraction. Thus there is also an API for creating arbitrarily sized
files that will get mapped to fixed sized blocks, freeing the programmer from having to implment
this themselves.

Data provided by these filesystem APIs will be the most up-to-date versions known to the node.
There's the possiblity that conflicts with other nodes may cause writes made by programs on the
node to be rolled back or overwritten. Of course locks can be taken on files and blocks if a
program must ensure exclusive access to data. Finally, an inotify-like API is provided for programs to be notified when changes to blocks occur.

An important consideration for the design of this system was to facilitate the creation of web
servers and other types of internet hosts which can serve data stored in a blocktree. For this
reason there is a high level callback based API for declaring HTTP handlers, as well as handlers
for blocktree specific messages.

In order to provide the consumer with control over how their data is used a permissions system
exists to control which blocks and APIs programs have access to. For instance a consumer would
have to grant special permission for a program to be able to access the Berkeley sockets API.

Programs are installed by specifing a blocktree path. This is a secure and convenient method of
distribution as these programs can be downloaded from the nodes associated with the root of their
path and the downloaded blocks can be cryptographically verified to be trusted by the root 
key. Authors wishing to distribute their programs in this manner will of course need to make the
blocks containing them public (unencrypted), or else provide some mechanism for selective access.

\chapter{Concepts}

\section{Blocks}
A block is a sequence of bytes and a sequence of events. The sequence of events define how the
sequence of bytes came to be in its current state. At any time the sequence of bytes can
be recreated by replaying the events starting with an empty sequence of bytes. Thus the sequence of
events is considered the canonical form of of the block, with the sequence of bytes simply enabling
efficient reads.

Blocks are hierarchical, with every block having at most one parent and zero or more children.
If a block has children then it is called a directory, and its children are called its directory
entries. This hierarchical structure allows us to identify blocks using their position in the
hierarchy.

\section{Paths}
A path is a globally unique identifier assigned to a block. The path of the block defines its
position in the hierarchy of blocks. The syntax of a path is as follows:
\begin{verbatim}
    COMP ::= '[\w-_\.]'+
    RelPath ::= COMP ('/' COMP)* 
    AbsPath ::= '/' RelPath*
\end{verbatim}
In other words, a path is a sequence of components, represented texturally as `/' separated
fields. The empty sequence of components is called the root path.
The root path is a directory, and its entries are the blocks of the global blocktree and links
to every private blocktree.
A path to a private blocktree has only
one component, consisting of the Hash of the private blocktree's root public signing key. Any path
with only one components which consists of a fingerprint 
of the blocktree's root public key is a valid path to the blocktree. These paths are
called the root paths of the private blocktree, and any path that begins with one of them
is simply called a private path. Conversely, paths that do not begin with them are called public
paths.

If one path is a prefix of a second, then we say the first path contains the second and that the
second is nested in the first.
Thus every path is contained in the root path, and every private path is contained in the root path
of a private blocktree.
Note that if path one is equal to path two, then path one also contains path two, and vice-versa.

In addition to identifying blocks, paths are used to scope capabilities and to address messages.

\section{Principals}
A principal is any entity which can be authenticated.
All authentication in the blocktree system is performed using digital signatures and as such
principals are identified by a cryptographic hash of their public signing key. Support
for the following hash algorithms is required:
\begin{enumerate}
    \item SHA2 256
    \item SHA2 512
\end{enumerate}
These are referred to as the Standard Hash Algorithms and a digest computed using one of them is
referred to as a Standard Hash.

When a principal is identified in a textural representation, such as in a path,
the following syntax is used:
\begin{verbatim}
    PrincipTxt ::= <hash algo index> '!' <Base64Url(hash)>
\end{verbatim}
where ``hash algo index'' is the base 10 string representation of the index of the hash algorithm in
the  above list, and ``Base64Url(hash)'' is the Base64Url encoding of the hash data identifying the
principal. Such a textural representation is referred to as a fingerprint of the public key from
which the hash was computed. In a slight abuse of language, we sometimes refer to the fingerprint or
hash data as a principal, even though these data merely identify a principal.

Principals can be issued capabilities which allow them to read and write to blocks.
These capabilities are scoped by paths, which limit the set of blocks the capability can be used
on. A capability can only be used on a block whose path is contained in the path of the
capability. This access control mechanism is enforced cryptographically.

A principal can grant a capability to another principal so long as the granted capability has a
path which is contained in the capability possessed by the granting node. The specific mechanisms
for how this is done differs depending on whether the capability is for reading or writing to a
block.

Every private blocktree is associated with a root principal. The path containing only one
component which consists of the fingerprint of the principal's public key is a root path to the
private block tree. The root principal has read and write capabilities for the private blocktree's
root, and so can grant subordinate capabilities scoped to any block in the private blocktree.

\section{Readcaps}
In order to protect the confidentiality of the data stored in a block, a symmetric cypher is used.
The key for this cipher is called block key, and by controlling access to it we can control which
principals can read the block.

The metadata of every block contains a dictionary of zero or more entries called read capabilities,
or readcaps for short. A key in this dictionary is a hash of the public signing key of the principal
for which the readcap was issued and the value is the encryption of the block key using the
principal's public encryption key.

The block key is also encrypted using the block key of the parent block, and the resulting cipher
text is also stored in the block's metadata. This is referred to as the inherited readcap.
Hence, a principal which has a read capability for a
given path can read all paths contained in it as well. Further, a principal can use it's
private encryption key to decrypt the block key, and then re-encrypt it using the public encryption
key of another principal, thus allowing the principal to grant a subordinate readcap.

A block that does not require confidentiality protection need not be encrypted. In this case, the
table of readcaps is empty and the inherited readcap is set to a flag value to indicate the block
is stored as plaintext. Blocks in the global blocktree are never encrypted. 

\section{Writecaps}
A write capability, or writecap for short, is a certificate chain which extends from a node's
credentials back to the root signing key of a private blocktree.
Each certificate in this chain contains
the public key of the principal who issued it and the path that it grants write capabilities to.
The path of a certificate must be contained in the path of the previous certificate in the chain,
and the chain must terminate in a certificate which is self-signed by the root signing credentials.
Each certificate also contains an expiration time, represented by an Epoch value, and a writecap
shall be considered invalid by a node if any certificate has an expiration time which it judges 
to be in the past. 

A block's metadata contains a writecap, which allows it to be verified by checking the digital
signature on the metadata using the first certificate in the writecap, and then checking that the
writecap is valid. This ensures that the chain of trust from the root signing credentials extends
to the block's metadata. To extend this chain to the block's data, a Merkle Tree is used, as
described below. A separate mechanism is used to ensure the integrity of blocks in the global
block tree.

\section{Sectors}
A block's data is logically broken up into fixed sized sectors, and these are the units of
confidentiality protection, integrity protection and consensus. When a process performs writes on
an open block, those writes are buffered until a sector is filled (or until the process calls
flush). Once this happens, the contents of the buffer are encrypted using the block key and then
hashed. This hash, along with the offset at which the write occurred, is then used to update the
Merkle Tree for the block. Once the new root of the Merkle Tree is computed, it is copied into the
block's metadata. After ensuring it's writecap is copied into the metadata, the node then signs
the metadata using its private signing key.

When the block is later opened its metadata is verified using the process described in the previous
section. Then the value of the root Merkle Tree node in the metadata is compared to the value in the
Merkle Tree for the block. If it matches, then reads to any offset into the block can be verified
using the Merkle Tree. Otherwise, the block is rejected as invalid.

The sector size can be configured for each block individually at creation time,
but cannot be changed afterwards.
However, the same effect can be achieved by creating a new block with the desired sector
size, copying the data from the old block into it, deleting the old block and renaming the new one.
The choice of sector size for a given block represents a tradeoff between the amount of space the
Merkle Tree occupies occupies and the latency experienced by the initial read to
a sector. With a larger sector size, the Merkle Tree size is reduced, but more data has to be
hashed and decrypted when each sector is read. Depending on the size of the block and its intended
usage, the optimal choice will vary.

It's important to note that the entire Merkle Tree for a block is held in memory for as long as the
block is open. So, assuming the 32 byte SHA2 256 hash and the default 4 KiB sector size are used,
if a 3 GiB file is opened then its MerkleTree will occupy approximately 64 MiB of memory, but will
enable fast random access. Conversely, if 64 KiB sectors are used the Merkle Tree for the same
3 GiB block will occupy approximately 4 MiB, but it's random access performance may suffer from
increased latency.

\section{Nodes}
The independent computing systems participating in the Blocktree system are called nodes. A node
need not run on hardware distinct from other nodes, but it is logically distinct from all other
nodes. In all contemporary operating systems a node can be implemented as a process.
A node posses it own unique credentials, which consist of two public and private key pairs,
one used in an encryption scheme and another in a signing scheme. A node is a Principal
and a hash of its public signing key is used to identify it. In a slight abuse of language, this
hash is referred to as the node's principal.

Nodes are also identified by paths. 
A node is responsible for storing the blocks contained in the directory where it is
attached, unless there is a second node whose parent directory is contained in the parent directory
of the first and which contains the block's path.
In other words, the node whose parent directory is closest to the block is responsible for storing
the block.
This allows data storage to scale as more nodes are added to a blocktree.

When a directory contains more than one node a cluster is formed.
The nodes in the cluster run the Raft consensus protocol in order to agree on the sequence
of events which constitutes each of the blocks contained in the directory where they're attached. 
This allows for redundancy, load balancing, and increased performance, as different sectors
of a block can be read from multiple nodes in the cluster concurrently.

\section{Processes}
Nodes run code and that running code is called a process. Processes are spawned by the node on which
their running based on configuration stored in the node's parent directory.
The code which they're spawned
from is also stored in a block, though that block need not be contained in the directory of the
node running it. So, for example, a software developer can publish code in their
blocktree, and a user can run it by updating the configuring of a node with the path it was
published to.
Code which is stored in it's own directory with a manifest describing it is called an app.

There are two types of apps, portable and native. A portable app is a collection of WebAssembly
(Wasm) modules which are executed by the node in a special Wasm runtime which exposes a
messaging API which can be used to perform block IO and communicate with other processes.
Native apps on the other hand are container images containing binaries compiled to a specific
target architecture. These containers can access blocktree messaging services via a native library
using the C ABI, and they have access to blocks via a POSIX-compatible
filesystem API. The software which runs the node itself is distributed using this mechanism. 

Regardless of their type, all apps require permissions to communicate with the outside world, with
the default only allowing them to communicate with their parent.

\chapter{Nodes}

\chapter{The Network}

\chapter{Application Programming Interface}

\chapter{Example Applications}
\end{document}