Paper.tex 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259
  1. \documentclass{article}
  2. \usepackage{amsfonts,amssymb,amsmath}
  3. \usepackage[scale=0.75]{geometry}
  4. \usepackage{hyperref}
  5. \title{Blocktree \\
  6. \large A platform for distributed computing.}
  7. \author{Matthew Carr}
  8. \date{May 28, 2022}
  9. \begin{document}
  10. \maketitle
  11. \begin{abstract}
  12. Blocktree is a platform which aims to make developing reliable distributed systems easy and safe.
  13. It does this by defining a format for data stored in the system, a mechanism for replicating
  14. that data, and a programming interface for accessing it. It aims to solve the difficult problems
  15. of cryptographic key management, consensus, and global addressing so that user code can focus on
  16. solving problems germane to their application.
  17. \end{abstract}
  18. \section{Introduction}
  19. (Do I need this?)
  20. \section{An Individual Blocktree}
  21. The atomic unit of data storage and privacy and authenticity guarantees is called a block. A
  22. block contains a payload of data. Confidentiality of this data is achieved by encrypting it using
  23. a symmetric cipher using a random key. This random key is known as the block key.
  24. The block key is encapsualated using
  25. a public key cipher and the resulting cipher text is stored in the header of the block. Thus
  26. only the person possessing the corresponding private key will be able to access the contents of
  27. the block. Blocks are arranged into trees, and the parent of the block also has a block key.
  28. The child's block key is always encapsulated using the parent's key and stored in the block
  29. header. This ensures that if a principal is given access to a block, they automatically have
  30. access to every child of that block. The encapsulated block key is known as a read capability,
  31. or readcap, as it grants the holder the ability to the block.
  32. Authenticity guarantees are provided using a digital signature scheme. In order to change the
  33. contents of a block a data structure called a write capability, or writecap, is needed. A
  34. writecap is approximately an x509 certificate chain. A writecap contains the following data:
  35. \begin{itemize}
  36. \item The path the writecap can be used under.
  37. \item The principal that the writecap was issued to.
  38. \item The timestamp when the writecap expires.
  39. \item The public key of the principal who issued the writecap.
  40. \item A digital signature produced by the private key corresponding to the above public key.
  41. \item Optionally, the next writecap in the chain.
  42. \end{itemize}
  43. The last item is only excluded in the case of a self-signed writecap, i.e. one that was signed by
  44. the same principal it was issued to. A writecap is considered valid for use on a block if all
  45. of the following conditions are true:
  46. \begin{itemize}
  47. \item The signature on every writecap in the chain is valid.
  48. \item The signing principal matches the principal the next writecap was issued to for every
  49. write cap in the chain.
  50. \item The path of the block is contained in the path of every writecap in the chain.
  51. \item The current timestamp is strictly less than the expiration of all the writecaps in the
  52. chain.
  53. \item The principal corresponding to public key used to sign the last writecap in the chain,
  54. is the owner of the blocktree.
  55. \end{itemize}
  56. The intiution behind these rules is that a writecap is only valid if there is a chain of trust
  57. that leads back to the owner of the block tree. The owner may delegate their trust to any number
  58. of intermediaries by issuing them writecaps. These writecaps are scoped based on the path
  59. specified when they are issued. These intermediaries can then delegate this trust as well.
  60. A block is considered valid if it contains a valid writecap, it was signed using the key
  61. corresponding to the first writecap's public key, and this signature is valid.
  62. Blocks are used for more than just orgnaizing data, they also organize computation. A program
  63. participating in the blocktree network is referred to as a node. Multiple nodes may be run on
  64. a single computer. Every node is attached to the blocktree at a specific path. This information
  65. is recorded in the block where the node is attached. A node is responsible for the storage of
  66. the block where it is attached and the blocks that are descended from this block, unless there
  67. is another node attached to a descendent block.
  68. In this way data storage can be delegated, allowing
  69. the system to scale. When more than one node is attached to the same block they form a cluster.
  70. Each node in the cluster contains a copy of the data that the cluster is reponsible for. They
  71. maintain consistency of this data by running the Raft consensus protocol.
  72. TODO: Nodes have keys. These keys are signed by the root key, creating a tree of trust.
  73. TODO: Symbolic links and readonly models (views?).
  74. While the consistency of an individual block can be maintained using Raft, in order to enable
  75. transactions which span multiple blocks a distributed locking mechanism is employed. This is
  76. accomplished by exploiting hierarchical arrangement of nodes in the tree. In order to describe
  77. this, its first helpful to define a new term. The \emph{nodetree} of a blocktree is tree obtained
  78. from the blocktree by collapsing all the blocks that a node (or cluster of nodes) is responsible
  79. for into a single logical block representing the node itself. Thus we can talk about the node
  80. which is the parent of another node, and by these we mean it is the parent of the node in the
  81. nodetree. What this means in terms of the blocktree is that it is the first node encountered
  82. when one traverses the path from the current block back to the root. Now, distributed locking
  83. works as follows:
  84. \begin{itemize}
  85. \item A node sends a request to lock a block to the current concensus leader in its cluster.
  86. If the node is not part of a cluster, then it is the leader. This request contains a timestamp
  87. for when the lock expires, preventing the situation where a lock is never released.
  88. \item If the leader is responsible for the block then it moves on to the next step. Otherwise
  89. it contacts its parent and forwards the lock request and this step is repeated for the parent.
  90. \item The responsible node checks to see if there is already a lock for this block. If there is
  91. then the request fails. Otherwise the request succeeds and a lock is placed on the block. A
  92. message indicating the result is then pass back up the tree ending at the orignal node. This
  93. message includes the principal of the node enforcing the lock.
  94. \item Once the locking node is done making its updates it sends a message directly to the node
  95. enforcing the lock, causing it to remove the lock.
  96. \end{itemize}
  97. Locking a block locks the subtree rooted at that block. Thus no writes to any path contained in
  98. the path of the locked block will be allowed, unless they come from locking node. If the locking
  99. node does not send the message
  100. unlocking the block before the lock expires, then the modifications which had been performed by
  101. it are dropped and the block reverts to its prior state. Since the locking node is the leader
  102. of the consensus cluster that is responsible for the block's, this guarantees that
  103. writes from other nodes will not be accepted.
  104. \section{Connecting Blocktrees}
  105. In order to allow nodes to access blocks in other blocktrees, a global ledger of events is used.
  106. This ledger is implemented using a proof of work (PoW) blockchain and a corresponding
  107. cryptocurrency known as blockcoin. Nodes mine chain blocks (not to be confused with the tree
  108. blocks we've been discussing up till now) in the same way they do in other PoW blockchain
  109. systems such as BitCoin. The node which manages to mine the next chain block receives a reward,
  110. which is the sum of the fees for each event in the chain and a variable amount of newly minted
  111. blockcoin. The amount of new blockcoin created by a chain block is directly proportional to the
  112. amount of data storage events contained in the chain block. Thus the total amount of blockcoin
  113. in circulation has a direct relationship to the amount of data stored in the system, reflecting
  114. the fact that blockcoin exists to provide and accounting mechanism for data stored in the system.
  115. When a node writes data to a tree block, and it wishes this block to be globally accessible, then
  116. it produces what are called fragments. Fragments are the output symbols from an Erasure Coding
  117. algorithm (such as the RaptorQ code). These algorithms are a class of fountain codes which have
  118. the property that only $m$ out of $n$ (where $m < n$) symbols are needed to reconstruct the
  119. original data. Such a code ensures that even if some of the fragments are lost, as long as $m$
  120. remain, the original data can be recovered.
  121. Once these fragments have been computed an event is created for each one and published to the
  122. blockchain. This event indicates to other nodes that this node wishes to store a fragment and
  123. states the amount of blockcoin the node will pay to the first node that accepts the offer. When
  124. another nodes wishes to accept the offer, it directly contacts the first node, who then sends
  125. it the fragment an publishes and event stating that the fragment is stored with the second
  126. node. This event includes the path of the block the fragment was computed from, the fragment's
  127. ID (the sequence number from the erasure code), and the principal of the node which stored it.
  128. Thus any other node in the network can use the information contained in these events to
  129. determine the set of nodes which contain the fragments of any given path.
  130. In order for nodes to be able to conntact other nodes, a mechanism is required for associating
  131. an internet protocol (IP) address with a principal. This is done by having nodes publish events
  132. to the blockchain when their IP address changes. This event includes their new IP address,
  133. their public key, and a digital signature computed using their private key. Other nodes can
  134. then verify this signature to ensure that an attacker cannot bind the wrong
  135. IP address to a principal in order to receive messages it was not meant to have.
  136. While this event ledger is useful for appending new events, and ensuring that previous events
  137. cannot be changed, another data structure is required to ensure that queries on this data can
  138. be performed efficiently. In particular, it's important to be able to quickly perform the
  139. following queries:
  140. \begin{itemize}
  141. \item Find the set of nodes storing the fragments for a given path.
  142. \item Find the IP address of a node or owner given a principal.
  143. \item Find the public key associated with a principal.
  144. \end{itemize}
  145. To enable these queries a special blocktree is maintained by each node in the network: the global blocktree.
  146. This tree does not support the usual writing and locking sematics of local blocktrees. It can be thought of as
  147. a left fold of all of the events in the blockchain, where each event is processed by updating blocks in the
  148. global tree appropriately. The above queries are facilitate by the following blocks:
  149. \begin{itemize}
  150. \item \emph{/global/fragments}: this block contains a hashtable where the key is a path and the value is the list
  151. of nodes storing the fragments for the block at that path.
  152. \item \emph{/global/principals}: contains a hashtable where the key is the a principal and the value is the tuple
  153. contining the public key of that principal, its current IP address, and its current blockcoin balance.
  154. \end{itemize}
  155. To compute the entries in these tree blocks, the nodes in the network iterate over all the chain blocks, updating
  156. their local copy of each tree block approriately. The experienced reader will recognize that this is an event
  157. sourced architecture. At this time only the two tree blocks are known to be needed, but if new events need to be
  158. added to the system it's easy to see that this system can be used for creating other data structures enabling
  159. queries that we have yet to envision. One such extension is the registration of globally unique names, which will
  160. be the focus of future work.
  161. \section{Programming Environment}
  162. Enabling an excellent developer experience is one of the primary goals of this system (the others being security
  163. and scaleability). Nodes execute user code that has been compiled into WebAssembly modules. Such code
  164. running on a blocktree node is referred to as an "app". An app
  165. executes in a sandbox that isolates it from other code, as well as the security critical operations of the node
  166. itself. The sandbox provides the code with an extension of the WebAssembly System Interface (WASI), with extra
  167. system calls to interact with the particulars of the blocktree system. The stadard WASI filesystem APIs are used
  168. to interact with the contents of blocktrees. For instance a file descriptor for a remote block can be obtained
  169. by calling path\_open. Writes and reads of blocks are performed using the privledges of the node on which
  170. the app is running. The extra system calls fall into three categories:
  171. \begin{itemize}
  172. \item Distributed Locking
  173. \item Messaging
  174. \item Supervision Trees
  175. \item Protocol Contracts (Session Types)
  176. \end{itemize}
  177. When an app is installed it is given a block under which it can store data that is shared between all nodes
  178. in the blocktree. The path of this block is formed by prefixing the path the app was published at
  179. with the string "/apps". When an app is installed on a particular node, it is run in a block contained
  180. in the node's block. It can read and write blocks in this block, and to allow it to access shared data,
  181. a symbolic link is created to the app's block in "/apps".
  182. % App Publishing
  183. These apps are, of course, distributed in a blocktree. The path of the block used to publish the app is used to
  184. identify the app, and must be unique. The block containing the app contains the WebAssembly module itself as
  185. well as JSON file containing the app's manifest. This manifest defines the app's name as well as the list
  186. of permissions it requires. This list of persmissions is used to determine which APIs the app will have access to.
  187. % Privacy Safe vs Unsafe
  188. Apps are broken into two large categories: those that are privacy safe and those that are not.
  189. An app is privacy unsafe if it requests any permissions which allow it to send data
  190. to any node that is not part of the blocktree in which it's running. Thus request the ability to
  191. open a TCP socket would cause an app to be privacy unsafe. Similarly the creation of a protocol handler for
  192. HTTP would also be privacy unsafe. Privacy unsafe apps can limit the scope of their unsafety by
  193. imposing limits on the unsafe APIs that they request. For instance an app which needs to send
  194. blocktree message back to the blocktree it was published in can request the messaging permission
  195. for a path in this tree. Similarly, an app which only wants to open a TCP socket listening on the
  196. local network, can limit the scope of its requested permission.
  197. % Protocol Contracts
  198. In order to make it easy to write apps which interface with existing systems, most notably
  199. those using HTTP, apps are able to define protocols using a contract language and then register
  200. call backs for these protocols. This works by providing a system call which the app can
  201. supply a protocol contract to. This contract is then compiled into a state machine by the node
  202. and a handle is returned to the app. This handle is then used to register callbacks for different
  203. parts of the protocol. This ensures that the protocol is handled by the node itself and that
  204. protocols can be shared between many different apps as a library. For instance, the HTTP protocol
  205. would be compiled to a particularly simply state maching, with only one state: the listening state.
  206. This state would expose a hook that where a callback can be registered to handle the request.
  207. This state also defines the record format used to pass the request information to the callback
  208. and the record format of the return value that is expected in order to produce the response.
  209. More complicated (stateful) protocols would more states, each defining their own request and
  210. response records, as well as hooks. One nice thing about this setup is that it will enable
  211. optimizations where the state machine and the user callbacks can be compiled into a program
  212. which can be safely run in the node itself, or even in a SmartNIC. This would require that
  213. the callbacks only use an approved set of APIs, but it can enable much higher performance
  214. network services.
  215. % Supervision Trees
  216. Apps can also arrange themselves into supervision trees, in the same way that Erlang
  217. processes are arranged. In this scheme, when a child app crashes, or the node its
  218. running on dies (which is detected by the node), then the app receives a message. In the
  219. simplest case this can be used to implement a logging system, where crashes and node
  220. deaths are recorded. More interestingly, this could be used to integrate with a control
  221. plan. For instance, if a blocktree were running in AWS, when a message is recevied indicating
  222. that a node has died a new EC2 instance could be started to replace it. Of course these
  223. are just two of the potential applications of this mechanism. The reliability of Erlang
  224. and other system employing the Actor Model have shown the robustness of this approach.
  225. \section{A Brave New Web}
  226. In order to explore how this system can be used, the design of several hypothetical systems
  227. is discussed. It's important to note that blocktree does not try to force all compuatation
  228. to be local to a user's device, it merely trys to enable this for applications where it
  229. is possible.
  230. \subsection{A contacts application.}
  231. \subsection{A distributed social network.}
  232. \subsection{An ecomerce website.}
  233. \subsection{The Open Metaverse}
  234. \end{document}