BlocktreeCloudPaper.tex 74 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140114111421143114411451146114711481149115011511152115311541155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224122512261227122812291230123112321233123412351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272127312741275127612771278127912801281
  1. \documentclass{article}
  2. \usepackage[scale=0.8]{geometry}
  3. \usepackage{hyperref}
  4. \usepackage{graphicx}
  5. \title{The Blocktree Cloud Orchestration Platform}
  6. \author{Matthew Carr}
  7. \begin{document}
  8. \maketitle
  9. \begin{abstract}
  10. This document is a proposal for a novel cloud platform called Blocktree.
  11. The system is described in terms of the actor model,
  12. where tasks and services are implemented as actors.
  13. The platform is responsible for orchestrating these actors on a set of native operating system processes.
  14. A service is provdied to actors which allows them access to a highly available distributed file system,
  15. which serves as the only source of persistent state for the system.
  16. High availability is achieved using the Raft consensus protocol to synchronize the state of files between processes.
  17. All data stored in the filesystem is secured with strong integrity and optional confidentiality protections.
  18. A network block device like interface allows for fast low-level read and write access to the encrypted data,
  19. with full support for client-side encryption.
  20. Well-known cryptographic primitives and constructions are employed to provide this protection,
  21. the system does not attempt to innovate in terms of cryptography.
  22. The system's trust model allows for mutual TLS authentication between all processes in the system,
  23. even those which are controlled by different owners.
  24. By integrating these ideas into a single platform,
  25. the system aims to advance the status quo in the security and reliability of software systems.
  26. \end{abstract}
  27. \section{Introduction}
  28. % The "Big" Picture.
  29. Blocktree is an attempt to extend the Unix philosophy that everything is a file
  30. to the entire distributed system that comprises modern IT infrastructure.
  31. The system is organized around a global distributed filesystem which defines security
  32. principals, resources, and their authorization attributes.
  33. This filesystem provides a language for access control that can be used to securely grant principals
  34. access to resources from different organizations, without the need to setup federation.
  35. The system provides an actor runtime for orchestrating services.
  36. Resources are represented by actors, and actors are grouped into operating system processes.
  37. Each process has its own credentials which authenticate it as a unique security principal,
  38. and which specify the filesystem path where the process is located.
  39. A process has authorization attributes which determine the set of processes that may communicate with it.
  40. Every connection between processes is established using mutual TLS authentication,
  41. which is accomplished without the need to trust any third-party certificate authorities.
  42. The cryptographic mechanisms which make this possible are described in detail in section 3.
  43. Messages addressed to actors in a different process are forwarded over these connections,
  44. while messages delivered to actors in the same process are delivered with zero-copying.
  45. % Self-certifying paths and the chain of trust.
  46. The single global Blocktree filesystem is partitioned into disjoint domains of authority.
  47. Each domain is controlled by a root principal.
  48. As is the case for all principals,
  49. a root principal is authenticated by a public-private key pair,
  50. and is identified by a hash of its public key.
  51. The domain of authority for a given absolute path is determined by its first component,
  52. which is the identifier of the root principal who controls the domain.
  53. Because there is no meaning to the directory "/",
  54. a directory consisting of only a single component equal to a root principal's identifier is
  55. referred to as the root directory of that root principal.
  56. The root principal delegates its authority to write files to subordinate principals by issuing
  57. them certificates which specify the path that the authority of the subordinate is limited to.
  58. File data is signed for authenticity and a certificate chain is contained in its metadata.
  59. This certificate chain must lead back to the root principal
  60. and consist of certificates with correctly scoped authority in order for the file to be authentic.
  61. Given the path of a file and the file's contents,
  62. this system allows the file to be validated by anyone without the need to trust a third-party.
  63. Blocktree paths are referred to as self-certifying for this reason.
  64. % Persistent state provided by the filesystem.
  65. One of the major challenges in distributed systems is managing persistent state.
  66. Blocktree solves this issue using its distributed filesystem.
  67. Files are broken into segments called sectors.
  68. The sector size of a file can be configured when it is created,
  69. but cannot be changed after the fact.
  70. Reads and writes of individual sectors are guaranteed to be atomic.
  71. The sectors which comprise a file and its metadata are replicated by a set of processes running
  72. the sector service.
  73. This service is responsible for storing the sectors of files which are contained in the directory
  74. containing the process in which it is running.
  75. The actors providing the sector service in a given directory coordinate with one another using
  76. the Raft protocol to synchronize the state of the sectors they store.
  77. This method of partitioning the data in the filesystem based on directory
  78. allows the system to scale beyond the capabilities of a single consensus cluster.
  79. Sectors are secured with strong integrity protection,
  80. which allows anyone to verify that their contents were written by an authorized principal.
  81. Encryption can be optionally applied to sectors,
  82. with the system handling key management.
  83. The cryptographic mechanisms used to implement these protections are described in section 3.
  84. % Protocol contracts.
  85. One of the design goals of Blocktree is to facilitate the creation of composable distributed
  86. systems.
  87. A major challenge to building such systems is the difficulty in pinning down bugs when they
  88. inevitably occur.
  89. Research into session types (a.k.a. Behavioral Types) promises to bring the safety benefits
  90. of type checking to actor communication.
  91. Blocktree integrates a session typing system that allows protocol contracts to be defined that
  92. specify the communication patterns of a set of actors.
  93. This model allows the state space of the set of actors participating in a computation to be defined,
  94. and the state transitions which occur to be specified based on the types of received messages.
  95. These contracts are used to verify protocol adherence statically and dynamically.
  96. This system is implemented using compile time code generation,
  97. making it a zero-cost abstraction.
  98. This frees the developer from dealing with the numerous failure modes that can occur in a
  99. communication protocol.
  100. % Implementation language and project links.
  101. Blocktree is implemented in the Rust programming language.
  102. It currently only supports running on Linux,
  103. though porting it to other Unix-like operating systems should be straight-forward.
  104. Its source code is licensed under the Affero GNU Public License Version 3.
  105. It can be downloaded at the project homepage at \url{https://blocktree.systems}.
  106. Anyone interested in contributing to development is welcome to submit a pull request
  107. to \url{https://gogs.delease.com/Delease/Blocktree}.
  108. If you have larger changes or architectural suggestions,
  109. please submit an issue for discussion prior to spending time implementing your idea.
  110. % Outline of the rest of the paper.
  111. The remainder of this paper is structured as follows:
  112. \begin{itemize}
  113. \item Section 2 describes the actor runtime, service and task orchestration, and service
  114. discovery.
  115. \item Section 3 discusses the filesystem, its concurrency semantics and implementation.
  116. \item Section 4 details the cryptographic mechanisms used to secure communication between
  117. actor runtimes and to protect sector data.
  118. \item Section 5 is a set of examples describing ways that Blocktree can be used to build systems.
  119. \item Section 6 provides some concluding remarks.
  120. \end{itemize}
  121. \section{Actor Runtime}
  122. % Motivation for using the actor model.
  123. Building scalable fault tolerant systems requires us to distribute computation over
  124. multiple computers.
  125. Rather than switching to a different programming model when an application scales beyond the
  126. capacity of a single computer,
  127. it is beneficial in terms of programmer time and program simplicity to begin with a model that
  128. enables multi-computer scalability.
  129. Fundamentally, all communication over an IP network involves the exchange of messages,
  130. namely IP packets.
  131. So if we wish to build scalable fault-tolerant systems,
  132. it makes sense to choose a programming model built on message passing,
  133. as this will ensure low impedance with the underlying networking technology.
  134. % Overview of message passing interface.
  135. That is why Blocktree is built on the actor model
  136. and why its actor runtime is at the core of its architecture.
  137. The runtime can be used to spawn actors, register services, and dispatch messages.
  138. Messages can be dispatched in two different ways: with \texttt{send} and \texttt{call}.
  139. A message is dispatched with the \texttt{send} method when no reply is required,
  140. and with \texttt{call} when exactly one is.
  141. The \texttt{Future} returned by \texttt{call} can be awaited to obtain the reply.
  142. If a timeout occurs while waiting for the reply,
  143. the \texttt{Future} completes with an error.
  144. The name \texttt{call} was chosen to bring to mind a remote procedure call,
  145. which is the primary use case this method was intended for.
  146. Awaiting replies to messages serves as a simple way to synchronize a distributed computation.
  147. % Description of virtual actor system.
  148. One of the challenges when building actor systems is supervising and managing actor's lifecycles.
  149. This is handled in Erlang through the use of supervision trees,
  150. but Blocktree takes a different approach inspired by Microsoft's Orleans framework.
  151. Orleans introduced the concept of virtual actors,
  152. which are purely logical entities that exist perpetually.
  153. In Orleans, one does not need to spawn actors nor worry about respawing them should they crash,
  154. the framework takes care of spawning an actor when a message is dispatched to it.
  155. This model also gives the framework the flexibility to deactivate actors when they are idle
  156. and to load balance actors across different computers.
  157. In Blocktree a similar system is used when messages are dispatched to services.
  158. The Blocktree runtime takes care of routing these messages to the appropriate actors,
  159. spawning them if needed.
  160. % Message addressing modes.
  161. Messages can be addressed to services or specific actors.
  162. When addressing a specific actor,
  163. the message contains an \emph{actor name},
  164. which is a pair consisting of the path of the runtime hosting the actor and the \texttt{Uuid}
  165. identifying the specific actor in that runtime.
  166. When addressing a service,
  167. the message is dispatched using a \emph{service name},
  168. which contains the following fields:
  169. \begin{enumerate}
  170. \item \texttt{service}: The path identifying the receiving service.
  171. \item \texttt{scope}: A filesystem path used to specify the intended recipient.
  172. \item \texttt{rootwards}: An boolean describing whether message delivery is attempted towards or
  173. away from the root of the filesystem tree. A value of
  174. \texttt{false} indicates that the message is intended for a runtime directly contained in the
  175. scope. A value of \texttt{true} indicates that the message is intended for a runtime contained
  176. in a parent directory of the scope and should be delivered to a runtime which has the requested
  177. service registered and is closest to the scope.
  178. \item \texttt{id}: An identifier for a specific service provider.
  179. \end{enumerate}
  180. The ID can be a \texttt{Uuid} or a \texttt{String}.
  181. It is treated as an opaque identifier by the runtime,
  182. but a service is free to associate additional meaning to it.
  183. Every message has a header containing the name of the sender and receiver.
  184. The receiver name can be an actor or service name,
  185. but the receiver name is always an actor name.
  186. For example, to open a file in the filesystem,
  187. a message is dispatched with \texttt{call} using the service name of the filesystem service.
  188. The reply contains the name of the file actor spawned by the filesystem service which owns the opened
  189. file.
  190. Messages are then dispatched to the file actor using its actor name to read and write to the file.
  191. % The runtime is implemented using tokio.
  192. The actor runtime is currently implemented using the Rust asynchronous runtime tokio.
  193. Actors are spawned as tasks in the tokio runtime,
  194. and multi-producer single consumer channels are used for message delivery.
  195. Because actors are just tasks,
  196. they can do anything a task can do,
  197. including awaiting other \texttt{Future}s.
  198. Because of this, there is no need for the actor runtime to support short-lived worker tasks,
  199. as any such use-case can be accomplished by awaiting a set of \texttt{Future}s.
  200. This allows the runtime to focus on providing support for services.
  201. Using tokio also means that we have access to a high performance multi-threaded runtime with
  202. evented IO.
  203. This asynchronous programming model ensures that resources are efficiently utilized,
  204. and is ideal for a system focused on orchestrating services which may be used by many clients.
  205. % Delivering messages over the network.
  206. Messages can be forwarded between actor runtimes using a secure transport layer called
  207. \texttt{bttp}.
  208. The transport is implemented using the QUIC protocol, which integrates TLS for security.
  209. A \texttt{bttp} client may connect anonymously or using credentials.
  210. If an anonymous connection is attempted,
  211. the client has no authorization attributes associated with it.
  212. Only runtimes which grant others the execute permission allow connections from such clients.
  213. If these permissions are not granted in the runtime's file,
  214. anonymous connections are rejected.
  215. When a client connects with credentials,
  216. mutual TLS authentication is performed as part of the connection handshake,
  217. which cryptographically verifies the credentials of each runtime.
  218. These credentials contain the filesystem paths where each runtime is located,
  219. which ensures that messages addressed to a specific path will only be delivered to that path.
  220. The \texttt{bttp} server is always authenticated during the handshake,
  221. even when the client is connecting anonymously.
  222. Because QUIC supports the concurrent use of many different streams,
  223. it serves as an ideal transport for a message oriented system.
  224. \texttt{bttp} uses different streams for independent messages,
  225. ensuring that head of line blocking does not occur.
  226. Note that although data from separate streams can arrive in any order,
  227. the protocol does provide reliable in-order delivery of data in a given stream.
  228. The same stream is used for sending the reply to a message dispatched with \texttt{call}.
  229. Once a connection is established,
  230. message may flow both directions (provided both runtimes have execute permissions for the other),
  231. regardless of which runtime is acting as the client or the server.
  232. % Delivering messages locally.
  233. When a message is sent between actors in the same runtime it is delivered into the queue of the recipient without any copying,
  234. while ensuring immutability (i.e. move semantics).
  235. This is possible thanks to the Rust ownership system,
  236. because the message sender gives ownership to the runtime when it dispatches the message,
  237. and the runtime gives ownership to the recipient when it delivers the message.
  238. % Security model based on filesystem permissions.
  239. A runtime is represented in the filesystem as a file.
  240. This file contains the authorization attributes which are associated with the runtime's security
  241. principal.
  242. The credentials used by the runtime specify the file, so other runtimes are able to locate it.
  243. The metadata of the file contains authorization attributes just like any other file
  244. (e.g. UID, GID, and mode bits).
  245. In order for a principal to be able to send a message to an actor in the runtime,
  246. it must have execute permissions for this file.
  247. Thus communication between runtimes can be controlled using simple filesystem permissions.
  248. Permissions checking is done during the \texttt{bttp} handshake.
  249. Note that it is possible for messages to be sent in one direction in a \texttt{bttp} connection
  250. but not in the other.
  251. In this situation replies are permitted but unsolicited messages are not.
  252. An important trade-off which was made when designing this model was that messages which are
  253. sent between actors in the same runtime are not subject to any authorization checks.
  254. This was done for two reasons: performance and security.
  255. By eliminating authorization checks messages can be more efficiently delivered between actors in the
  256. same process,
  257. which helps to reduce the performance penalty of the actor runtime over directly using threads.
  258. Security is enhanced by this decision because it forces the user to separate actors with different
  259. security requirements into different operating system processes,
  260. which ensures all of the process isolation machinery in the operating system will be used to
  261. isolate them.
  262. % Representing resources as actors.
  263. As in other actor systems, it is convenient to represent resources in Blocktree using actors.
  264. This allows the same security model used to control communication between actors to be used for
  265. controlling access to resources,
  266. and for resources to be shared by many actors.
  267. For instance, a Point-to-Point Protocol connection could be owned by an actor.
  268. This actor could forward traffic delivered to it in messages over this connection.
  269. The set of actors which are able to access the connection is controlled by setting the filesystem
  270. permissions on the file for the runtime executing the actor owning the connection.
  271. % Actor ownership.
  272. The concept of ownership in programming languages is very useful for ensuring that resources are
  273. properly freed when the type using them dies.
  274. Because actors are used for encapsulating resources in Blocktree,
  275. a similar system of ownership is employed for this reason.
  276. An actor is initially owned by the actor that spawned it.
  277. An actor can only have a single owner,
  278. but the owner can grant ownership to another actor.
  279. An actor is not allowed to own itself,
  280. though it may be owned by the runtime.
  281. When the owner of an actor returns,
  282. the actor is sent a message instructing it to return.
  283. If it does not return after a timeout,
  284. it is interrupted.
  285. This is the opposite of how supervision trees work in Erlang.
  286. Instead of the parent receiving a message when the child returns,
  287. the child receives a message when the parent returns.
  288. Service providers spawned by the runtime are owned by it.
  289. They continue running until the runtime chooses to reclaim their resources,
  290. which can happen because they are idle or the runtime is overloaded.
  291. % Message routing to services.
  292. A service is identified by a Blocktree path.
  293. Only one service implementation can be registered in a particular runtime,
  294. though this implementation may be used to spawn many actors as providers for the service,
  295. each associated with a different ID.
  296. The runtime spawns a new actor when it finds no service provider associated with the ID in the
  297. message it is delivering.
  298. Some services may only have one service provider in a given runtime,
  299. as is the case for the sector and filesystem services.
  300. Services are reactive,
  301. they don't do anything until they receive a message to process.
  302. The \texttt{scope} and \texttt{rootward} field in an actor name specify the set of runtimes to
  303. which a message may be delivered.
  304. They allow the sender to express their intended recipient,
  305. while still affording enough flexibility to the runtime to route messages as needed.
  306. If \texttt{rootward} is \texttt{false},
  307. the message is delivered to a service provider in a runtime that is directly contained in
  308. \texttt{scope}.
  309. If \texttt{rootward} is \texttt{true},
  310. the parent directories of scope are searched,
  311. working towards the root of the filesystem tree,
  312. and the message is delivered to the first provider of \texttt{service} which is found.
  313. When there are multiple service providers to which a given message could be delivered,
  314. the one to which it is actually delivered is unspecified,
  315. which allows the runtime to balance load.
  316. Delivery will occur for at most one recipient,
  317. even in the case that there are multiple potential recipients.
  318. In order to contact other runtimes and deliver messages to them,
  319. their IP addresses need to be known.
  320. This is achieved by maintaining a file with a runtime's IP address in the same directory as the
  321. runtime.
  322. The runtime is granted write permissions on the file,
  323. and it is updated by \texttt{bttp} when it begins listening on a new endpoint.
  324. The services which are allowed to be registered in a given runtime are specified in the runtime's
  325. file.
  326. The runtime reads this list and uses it to deny service registrations for unauthorized services.
  327. The list is also read by other runtime's when they are searching a directory for service providers.
  328. % The sector and filesystem service.
  329. The filesystem is itself implemented as a service.
  330. A filesystem service provider can be passed messages to delete files, list directory contents,
  331. open files, or perform several other standard filesystem operations.
  332. When a file is opened,
  333. a new actor is spawned which owns the newly created file handle and its name is returned to the
  334. caller in a reply.
  335. Subsequent read and write messages are sent to this actor.
  336. The filesystem service does not persist any data itself,
  337. its job is to function as an integration layer,
  338. conglomerating sector data from many different sources into a single unified interface.
  339. The sector service is what is ultimately responsible for storing data,
  340. and thus maintaining the persistent state of the system.
  341. It stores sector data in the local filesystem of each computer on which it is registered.
  342. The details of how this is accomplished are deferred to the next section.
  343. % Runtime network discovery.
  344. While it is possible to resolve runtime paths to IP addresses when the filesystem is available,
  345. a different mechanism is needed to allow the filesystem and sector services to discover service
  346. providers.
  347. To facilitate this,
  348. runtimes are able to query one another to learn about other runtimes.
  349. Because queries are intended to facilitate message delivery,
  350. the query fields and their meanings mirror those used for addressing messages:
  351. \begin{enumerate}
  352. \item \texttt{service} The path of the service whose providers are sought.
  353. Only runtimes with this service registered will be returned.
  354. \item \texttt{scope} The filesystem path relative to which the query will be processed.
  355. \item \texttt{rootward} Indicates if the query should search for runtimes from \texttt{scope}
  356. toward the root.
  357. \end{enumerate}
  358. The semantics of \texttt{scope} and \texttt{rootward} in a query are identical to their use in an
  359. actor name.
  360. As long as at least one other runtime is known,
  361. a query can be issued to learn of more runtimes.
  362. A runtime which receives a query may not be able to answer it directly.
  363. If it cannot,
  364. it returns the IP address of the next runtime to which the query should be sent.
  365. In order to bootstrap the discovery processes,
  366. another mechanism is needed to find the first peer to query.
  367. There were several possibilities explored for doing this.
  368. One way is to use a blockchain to store the IP addresses of the runtimes hosting the sector service
  369. in the root directory.
  370. As long as these runtimes could be located,
  371. then all others could be found using the filesystem.
  372. This idea may be worth revisiting in the future,
  373. but the author wanted to avoid the complexity of implementing a new proof of work blockchain.
  374. Another idea was to use multicast link-local addressing to discover other runtimes,
  375. similar to how mDNS operates.
  376. This approach has several advantages.
  377. It avoids any dependency on centralized internet infrastructure
  378. and keeps network load local to the segment on which the runtimes are connected.
  379. But, it will not work over a wide area network,
  380. making it unsuitable for the general case.
  381. Instead, the design which was decided on was to use DNS to resolve a fully qualified domain name
  382. (FQDN) derived from the root principal's identifier.
  383. This FQDN is expected to resolve to the public IP addresses of the runtimes hosting the
  384. sector service in the root directory of the root principal.
  385. Each process is configured with a search domain which is used as a suffix of the FQDN.
  386. The leading labels in the FQDN are computed by base32 encoding a hash of the root
  387. principal's public key.
  388. If the encoded string is longer than 63 bytes (the limit for each label in a hostname),
  389. it is separated into the fewest number of labels possible,
  390. working from left to right along the string.
  391. A dot followed by the search domain is concatenated onto the end of this string to form the FQDN.
  392. This method has the advantages of being simple to implement
  393. and allowing runtimes to discover each other over the internet.
  394. Implementing this system would be facilitated by hosting DNS servers in actors in the same
  395. runtimes as the root sector service providers.
  396. Then, A or AAAA records could be served which point to these runtimes.
  397. These runtimes would also need to be configured with static IP addresses,
  398. and the NS records for the search domain would need to point to them.
  399. Of course it is also possible to build such a system without hosting DNS inside of Blocktree.
  400. The downside of using DNS is that it couples Blocktree with a centralized,
  401. albeit distributed, system.
  402. % Security model for queries.
  403. To allow runtimes which are not permitted to execute the root directory to query for other runtimes,
  404. authorization logic which is specific to queries is needed.
  405. If a process is connected with credentials
  406. and the path in the credentials contains the scope of the query,
  407. the query is permitted.
  408. If a process is connected anonymously,
  409. its query will only be answered if the query scope
  410. and all of its parent directories,
  411. grant others the execute permission.
  412. Queries from authenticated processes can be authorized using only the information in the query,
  413. but anonymous queries require knowledge of filesystem permissions,
  414. some of which may not be known to the answering runtime.
  415. When authorizing an anonymous query,
  416. an answering runtime should check that that the execute permission is granted on all directories
  417. that it is responsible for storing.
  418. If all these checks pass, it should forward the querier to the next runtime as usual.
  419. % Overview of protocol contracts and runtime checking of protocol adherence.
  420. To facilitate the creation of composable systems,
  421. a protocol contract checking system based on session types has been designed.
  422. This system models a communication protocol as a directed graph representing state transitions
  423. based on types of received messages.
  424. The protocol author defines the states that the actors participating in the protocol can be in using
  425. Rust traits.
  426. These traits define handler methods for each message type the actor is expected to handle in that
  427. state.
  428. A top-level trait which represents the entire protocol is defined that contains the types of the
  429. initial state of every actor in the protocol.
  430. A macro is used to generate the message handling loop for the each of the parties to the protocol,
  431. as well as enums to represent all possible states that the parties can be in and the messages that
  432. they exchange.
  433. The generated code is responsible for ensuring that errors are generated when a message of an
  434. unexpected type is received,
  435. eliminating the need for ad-hoc error handling code to be written by application developers.
  436. % Example of a protocol contract.
  437. % TODO: I don't find this example very compelling. It would be more impressive to show a pub-sub
  438. % protocol, that would look cool.
  439. Let us explore the use of this system through a simple example using the HTTP/1.1 protocol.
  440. It is a state-less client-server protocol,
  441. essentially just an RPC from client to server.
  442. We can model this in for the contract checker by defining a trait representing the protocol:
  443. \begin{verbatim}
  444. pub trait Http {
  445. type Server: ServerInit;
  446. }
  447. \end{verbatim}
  448. The purpose of this top-level trait is to specify the initial state of every party to the
  449. communications protocol.
  450. In this case we're only modeling the state of the server,
  451. as the client will just \texttt{call} a method on the server.
  452. The initial state for the server is defined as follows:
  453. \begin{verbatim}
  454. pub trait ServerInit {
  455. type AfterActivate: Listening;
  456. type Fut: Future<Output = Result<Self::AfterActivate>>;
  457. fn handle_activate(self, msg: Activate) -> Self::Fut;
  458. }
  459. \end{verbatim}
  460. \texttt{Activate} is a message sent by the generated code to allow the actor access to the
  461. runtime and the actor's ID.
  462. It is defined as follows:
  463. \begin{verbatim}
  464. pub struct Activate {
  465. rt: &'static Runtime,
  466. act_id: Uuid,
  467. }
  468. \end{verbatim}
  469. We represent the statelessness of HTTP by having the requests to the \texttt{Listening} state
  470. return another \texttt{Listening} state.
  471. \begin{verbatim}
  472. pub trait Listening {
  473. type AfterRequest: Listening;
  474. type Fut: Future<Output = Result<Self::AfterRequest>>;
  475. fn handle_request(self, msg: Envelope<Request>) -> Self::Fut;
  476. }
  477. \end{verbatim}
  478. The \texttt{Envelope} type is a wrapper around a message which contains information about who sent
  479. it and a method which can be used to send a reply.
  480. In general a new type could be returned after each message received,
  481. with the returned type being dependent on the type of the message.
  482. The state graph of this protocol can be visualized as follows:
  483. \begin{center}
  484. \includegraphics[height=1.5in]{HttpStateGraph.pdf}
  485. \end{center}
  486. % Implementing actors in languages other than Rust.
  487. Today the actor runtime only supports executing actors implemented in Rust.
  488. A WebAssembly (Wasm) plugin system is planned to allow any language which can compile to Wasm to be
  489. used to implement an actor.
  490. This work is blocked pending the standardization of the WebAssembly Component Model,
  491. which promises to provide an interface definition language which will allow type safe actors to be
  492. defined in many different languages.
  493. % Running containers using actors.
  494. Blocktree allows containers to be run by encapsulating them using a supervising actor.
  495. This actor is responsible for starting the container and managing the container's kernel namespace.
  496. Logically, it owns any kernel resources created by the container, including all spawned operating
  497. system processes.
  498. When the actor halts,
  499. all of these resources are destroyed.
  500. All network communication to the container is controlled by the supervising actor.
  501. The supervisor can be configured to bind container ports to host ports,
  502. as is commonly done today,
  503. but it can also be used to encapsulate traffic to and from the container in Blocktree messages.
  504. These messages are routed to other actors based on the configuration of the supervisor.
  505. This essentially creates a VPN for containers,
  506. ensuring that regardless of well secured their communication is,
  507. they will be safe to communicate over any network.
  508. This network encapsulation system could be used in other actors as well,
  509. allowing a lightweight and secure VPN system to built.
  510. % Web GUI used for managing the system.
  511. Any modern computer system must include a GUI,
  512. it is required by users.
  513. For this reason Blocktree includes a web-based GUI called \texttt{btconsole} that can
  514. monitor the system, provision runtimes, and configure access control.
  515. \texttt{btconsole} is itself implemented as an actor in the runtime,
  516. and so has access to the same facilities as any other actor.
  517. \section{Filesystem}
  518. % The division of responsibilities between the sector and filesystem services.
  519. The responsibility for serving data in Blocktree is shared between the filesystem and sector
  520. services.
  521. Most actors will access the filesystem through the filesystem service,
  522. which provides a high-level interface that takes care of the cryptographic operations necessary to
  523. read and write files.
  524. The filesystem service relies on the sector service for actually persisting data.
  525. The individual sectors which make up a file are read from and written to the sector service,
  526. which stores them in the local filesystem of the computer on which it is running.
  527. A sector is the atomic unit of data storage
  528. and the sector service only supports reading and writing entire sectors at once.
  529. File actors spawned by the filesystem service buffer reads and writes until there is enough
  530. data to fill a sector.
  531. Because cryptographic operations are only performed on full sectors,
  532. the cost of providing these protections is amortized over the size of the sector.
  533. Thus there is tradeoff between latency and throughput when selecting the sector size of a file:
  534. a smaller sector size means less latency while a larger one enables more throughput.
  535. % Types of sectors: metadata, integrity, and data.
  536. A file has a single metadata sector, a Merkle sector, and zero or more data sectors.
  537. The sector size of a file can be specified when it is created,
  538. but cannot be changed later.
  539. Every data sector contains the ciphertext of the number of bytes equal to the sector size,
  540. but the metadata and Merkle sectors contain a variable amount of data.
  541. The metadata sector contains all of the filesystem metadata associated with the file.
  542. In addition to the usual metadata present in any Unix filesystem (the contents of the \texttt{stat} struct),
  543. cryptographic information necessary to verify and decrypt the contents of the file are also stored.
  544. The Merkle sector of a file contains a Merkle tree over the data sectors of a file.
  545. The hash function used by this tree can be configured at file creation,
  546. but cannot be changed after the fact.
  547. % How sectors are identified.
  548. When sector service providers are contained in the same directory they connect to each other to form
  549. a consensus cluster.
  550. This cluster is identified by a \texttt{u64} called the cluster's \emph{generation}.
  551. Every file is identified by a pair of \texttt{u64}, its generation and its inode.
  552. The sectors within a file are identified by an enum which specifies which type they are,
  553. and in the case of data sectors, their 0-based index.
  554. \begin{verbatim}
  555. pub enum SectorKind {
  556. Meta,
  557. Merkle,
  558. Data(u64),
  559. }
  560. \end{verbatim}
  561. The byte offset in the plaintext of the file at which each data sector begins can be calculated by
  562. multiplying the sector's index by the sector size of the file.
  563. The \texttt{SectorId} type is used to identify a sector.
  564. \begin{verbatim}
  565. pub enum SectorId {
  566. generation: u64,
  567. inode: u64,
  568. sector: SectorKind,
  569. }
  570. \end{verbatim}
  571. % How the sector service stores data.
  572. The sector service persists sectors in a directory in its local filesystem,
  573. with each sector is stored in a different file.
  574. The scheme used to name these files involves security considerations,
  575. and is described in the next section.
  576. When a sector is updated,
  577. a new local file is created with a different name containing the new contents.
  578. Rather than deleting the old sector file,
  579. it is overwritten by the creation of a hardlink to the new file,
  580. and the name that used to create the new file is unlinked.
  581. This method ensures that the sector file is updated in one atomic operation
  582. and is used by other Unix programs.
  583. The sector service also uses the local filesystem to persist the replicated log it uses for Raft.
  584. This file serves as a journal of sector operations.
  585. % Types of messages handled by the sector service.
  586. Communication with the sector service is done by passing it messages of type \texttt{SectorMsg}.
  587. \begin{verbatim}
  588. pub struct SectorMsg {
  589. id: SectorId,
  590. op: SectorOperation,
  591. }
  592. pub enum SectorOperation {
  593. Read,
  594. Write(WriteOperation),
  595. }
  596. pub enum WriteOperation {
  597. Meta(Box<FileMeta>),
  598. Data {
  599. meta: Box<FileMeta>,
  600. contents: Vec<u8>,
  601. }
  602. }
  603. \end{verbatim}
  604. Here \texttt{FileMeta} is the type used to store metadata for files.
  605. Note that updated metadata is required to be sent when a sector's contents are modified.
  606. % Scaling horizontally: using Raft to create consensus cluster. Additional replication methods.
  607. A generation of sector service providers uses the Raft protocol to synchronize the state of the
  608. sectors it stores.
  609. The message passing interface of the runtime enables this implementation
  610. and the sector service's requirements were important considerations in designing this interface.
  611. The system currently replicates all data to each of the service providers in the cluster.
  612. Additional replication methods are planned for future implementation
  613. (e.g. erasure encoding and distribution via consistent hashing),
  614. which allow for different tradeoffs between data durability and storage utilization.
  615. % Scaling vertically: how different generations are stitched together.
  616. The creation of a new generation of the sector service is accomplished with several steps.
  617. First, a new directory is created in which the generation will be located.
  618. Next, one or more processes are credentialed for this directory,
  619. using a procedure which is described in the next section.
  620. The credentialing process produces files for each of the processes stored in the new directory.
  621. The sector service provider in each of the processes uses the filesystem service
  622. (which connects to the parent generation of the sector service)
  623. to find the other runtimes hosting the sector service in the directory and messages them to
  624. establish a fully-connected cluster.
  625. Finally, the service provider which is elected leader contacts the generation in the root directory
  626. and requests a new generation number.
  627. Once this number is known it is stored in the superblock for the generation,
  628. which is the file identified by the new generation number and inode 2.
  629. The superblock is not contained in any directory and cannot be accessed outside the sector service.
  630. The superblock also keeps track of the next inode to assign to a new file.
  631. % Authorization logic of the sector service.
  632. To prevent malicious actors from writing invalid data,
  633. the sector service must cryptographically verify all write messages.
  634. The process it uses to do this involves several steps:
  635. \begin{enumerate}
  636. \item The certificate chain in the metadata that was sent in the write message is validated.
  637. It is considered valid if it ends with a certificate signed by the root principal
  638. and the paths in the certificates are correctly nested,
  639. indicating valid delegation of write authority at every step.
  640. \item Using the last public key in the certificate chain,
  641. the signature in the metadata is validated.
  642. This signature covers all of the fields in the metadata.
  643. \item The new sector contents in the write message are hashed using the digest function configured
  644. for the file and the resulting hash is used to update the file's Merkle tree in its Merkle
  645. sector.
  646. \item The root of the Merkle tree is compared with the integrity value in the file's metadata.
  647. The write message is considered valid if and only if there is a match.
  648. \end{enumerate}
  649. This same logic is used by file actors to verify the data they read from the sector service.
  650. Only once a write message is validated is it shared with the sector service provider's peers in
  651. its generation.
  652. Although the data in a file is encrypted,
  653. it is still beneficial for security to prevent unauthorized principal's from gaining access to a
  654. file's ciphertext.
  655. To prevent this, a sector service provider checks a file's metadata to verify that the requesting
  656. principal actually has a readcap (to be defined in the next section) for the file.
  657. This ensures that only principals that are authorized to read a file can gain access to the file's
  658. ciphertext, metadata, and Merkle tree.
  659. % File actors are responsible for cryptographic operations. Client-side encryption.
  660. The sector service is relied upon by the filesystem service to read and write sectors.
  661. Filesystem service providers communicate with the sector service to open files and perform
  662. filesystem operations.
  663. These providers spawn file actors that are responsible for verifying and decrypting the information
  664. contained in sectors and providing it to other actors.
  665. They use the credentials of the runtime they are hosted in to decrypt sector data using
  666. information contained in file metadata.
  667. File actors are also responsible for encrypting and integrity protecting data written to files.
  668. In order for a file actor to produce a signature over the root of the file's Merkle tree,
  669. it maintains a copy of the tree in memory.
  670. This copy is read from the sector service when the file is opened.
  671. While this does mean duplicating data between the sector and filesystem services,
  672. this design was chosen to reduce the network traffic between the two services,
  673. as the entire Merkle tree does not need to be transmitted on every write.
  674. Encapsulating all cryptographic operations in the filesystem service and file actors allows the
  675. computer storing data to be different from the computer encrypting it.
  676. This approach allows client-side encryption to be done on more capable computers
  677. and low powered devices to delegate this task to a storage server.
  678. % Prevention of resource leaks through ownership.
  679. A major advantage of using file actors to access file data is that they can be accessed over the
  680. network from a different runtime as easily as they can be from the same runtime.
  681. One complication arising from this approach is that file actors must not outlive the actor which
  682. caused them to be spawned.
  683. This is handled in the filesystem service by making the actor who opened the file the owner of the
  684. file actor.
  685. When a file actor receives notification that its owner returned,
  686. it flushes any buffered data in its cache and returns,
  687. ensuring that a resource leak does not occur.
  688. % Authorization logic of the filesystem service.
  689. The filesystem service uses an \texttt{Authorizer} type to make authorization decisions.
  690. It passes this type the authorization attributes of the principal accessing the file, the
  691. attributes of the file, and the type of access (read, write, or execute).
  692. The \texttt{Authorizer} returns a boolean indicating if access is permitted or denied.
  693. These access control checks are performed for every message processed by the filesystem service,
  694. including opening a file.
  695. A file actor only responds to messages sent from its owner,
  696. which ensures that it can avoid the overhead of performing access control checks as these were
  697. carried out by the filesystem service when it was created.
  698. The file actor is configured when it is spawned to allow read only, write only, or read write
  699. access to a file,
  700. depending on what type of access was requested by the actor opening the file.
  701. % Streaming replication.
  702. Often when building distributed systems it is convenient to alert any interested party that an event
  703. has occurred.
  704. To facilitate this pattern,
  705. the sector service allows actors to subscribe for notification of writes to a file.
  706. The sector service maintains a list of actors which are currently subscribed
  707. and when it commits a write to its local storage,
  708. it sends each of them a notification message identifying the sector written
  709. (but not the written data).
  710. By using different files to represent different events,
  711. a simple notification system can be built.
  712. Because the contents of a directory may be distributed over many different generations,
  713. this system does not support the recursive monitoring of directories.
  714. Although this system lacks the power of \texttt{inotify} in the Linux kernel,
  715. it does provides some of its benefits without incurring much or a performance overhead
  716. or implementation complexity.
  717. For example, this system can be used to implement streaming replication.
  718. This is done by subscribing to writes on all the files that are to be replicated,
  719. then reading new sectors as soon as notifications are received.
  720. These sectors can then be written into replica files in a different directory.
  721. This ensures that the contents of the replicas will be updated in near real-time.
  722. % Peer-to-peer distribution of sector data.
  723. Because of the strong integrity protection afforded to sectors,
  724. it is possible for peer-to-peer distribution of sector data to be done securely.
  725. Implementing this mechanism is planned as a future enhancement to the system.
  726. The idea is to base the design on bit torrent,
  727. where the sector service responsible for a file acts as a tracker for that file,
  728. and the file actors accessing the file communicate with one another directly using the information
  729. provided by the sector service.
  730. This could allow the system to scale to a much larger number of concurrent reads by reducing
  731. the load on the sector service.
  732. % The FUSE daemon.
  733. Being able to access the filesystem from actors allows a programmer to implement new applications
  734. using Blocktree,
  735. but there is an entire world of existing applications which only know how to access the local
  736. filesystem.
  737. To allow these applications access to Blocktree,
  738. a FUSE daemon is included which allows a Blocktree directory to be mounted to a directory in the
  739. local filesystem.
  740. This daemon can directly access the sector files in a local directory,
  741. or it can connect over the network to filesystem or sector service provider.
  742. This FUSE daemon could be included in a system's initrd to allow it to mount its root filesystem
  743. from Blocktree,
  744. opening up many interesting possibilities for hosting machine images in Blocktree.
  745. A planned future enhancement is to develop a Blocktree filesystem driver which actually runs in
  746. kernel space.
  747. This would reduce the overhead associated with context switching from user space, to kernel space,
  748. and back to user space, for every filesystem interaction,
  749. making the system more practical to use for a root filesystem.
  750. \section{Cryptography}
  751. This section describes the cryptographic mechanisms used to integrity and confidentiality protect
  752. files.
  753. These mechanisms are based on well-established cryptographic constructions.
  754. % Integrity protection.
  755. File integrity is protected by a digital signature over its metadata.
  756. The metadata contains the integrity field which contains the root node of a Merkle tree over
  757. the file's contents.
  758. This allows any sector in the file to be verified with a number of hash function invocations that
  759. is logarithmic in the size of the file.
  760. It also allows the sectors of a file to be verified in any order,
  761. enabling random access.
  762. The hash function used in the Merkle tree can be configured when the file is created.
  763. Currently, SHA-256 is the default, and SHA-512 is supported.
  764. A file's metadata also contains a certificate chain,
  765. and this chain is used to authenticate the signature over the metadata.
  766. In Blocktree, the certificate chain is referred to as a \emph{writecap}
  767. because it grants the capability to write to files.
  768. The certificates in a valid writecap are ordered by their paths,
  769. the initial certificate contains the longest path,
  770. the path in each subsequent certificate must be a prefix of the one preceding it,
  771. and the final certificate must be signed by the root principal.
  772. These rules ensure that there is a valid delegation of write authority at every
  773. link in the chain,
  774. and that the authority is ultimately derived from the root principal specified by the absolute path
  775. of the file.
  776. By including all the information necessary to verify the integrity of a file in its metadata,
  777. it is possible for a requestor who only knows the path of a file to verify that the contents of the
  778. file are authentic.
  779. % Confidentiality protecting files with readcaps. Single pubkey operation to read a dir tree.
  780. Confidentiality protection of files is optional but when it is enabled,
  781. a file's sectors are individually encrypted using a symmetric cipher.
  782. The key to this cipher is randomly generated when a file is created.
  783. A different IV is generated for each sector by hashing the index of the sector with a
  784. randomly generated IV for the entire file.
  785. A file's key and IV are encrypted using the public keys of the principals to whom read access is
  786. to be allowed.
  787. The resulting ciphertext is referred to as a \emph{readcap}, as it grants the capability to read the
  788. file.
  789. These readcaps are stored in a table in the file's metadata.
  790. Each entry in the table is identified by a byte string that is derived from the public key of the
  791. principal who owns the entry's readcap.
  792. The byte string is computed by calculating an HMAC of the the principal's public key.
  793. The HMAC is keyed with a randomly generated salt that is stored in the file's metadata.
  794. An identifier for the hash function that was used in the HMAC is included in the byte string so
  795. that the HMAC can be recomputed later.
  796. When the filesystem service accesses the file,
  797. it recomputes the HMAC using the salt, its public key, and the hash function specified in each entry
  798. of the table.
  799. It can then identify the entry which contains its readcap,
  800. or that such an entry does not exist.
  801. This mechanism was designed to prevent offline correlation attacks on file metadata,
  802. as metadata is stored in plaintext in local filesystems.
  803. The file key and IV are also encrypted using the keys of the file's parents.
  804. Note that there may be multiple parents of a file because it may be hard linked to several
  805. directories.
  806. Each of the resulting ciphertexts is stored in another table in the file's metadata.
  807. The entries in this table are identified by an HMAC of the parent's generation and inode numbers,
  808. where the HMAC is keyed using the file's salt.
  809. By encrypting a file's key and IV using the key and IV of its parents,
  810. it is possible to traverse a directly tree using only a single public key decryption.
  811. The file where this traversal begins must contain a readcap owned by the accessing principal,
  812. but all subsequent accesses can be performed by decrypting the key and IV of a child using the
  813. key and IV of a parent.
  814. Not only does this allow traversals to use more efficient symmetric key cryptography,
  815. but it also means that it suffices to grant a readcap on a single directory in order to grant
  816. access to the entire tree rooted at that directory.
  817. % File key rotation and readcap revocation.
  818. Because it is not possible to change the key used by a file after it is created,
  819. a file must be copied in order to rotate the key used to encrypt it.
  820. Similarly, revoking a readcap is accomplished by creating a copy of the file
  821. and adding all the readcaps from the original's metadata except for the one being revoked.
  822. While it is certainly possible to remove a readcap from the metadata table,
  823. this is not supported because the readcap holder may have used custom software to save the file's
  824. key and IV while it had access to them,
  825. so data written to the same file after revocation could potentially be decrypted by it.
  826. By forcing the user to create a new file,
  827. they are forced to re-encrypt the data using a fresh key and IV.
  828. % Obfuscating sector files stored in the local filesystem.
  829. From an attacker's perspective,
  830. not every file in your domain is equally interesting.
  831. They may be particularly interested in reading your root directory,
  832. or they may have identified the inode of a file containing kompromat.
  833. To make offline identification of which files sectors in the local filesystem belong to,
  834. an obfuscation mechanism is used.
  835. This works by generating a random salt for each generation of the sector service,
  836. and storing it in the generation's superblock.
  837. It is hashed along with the inode and the sector ID to produce the file name of the sector file
  838. in the local filesystem.
  839. These files are arranged into different subdirectories according to the value of the first two
  840. digits in the hex encoding of the resulting hash,
  841. the same way git organizes object files.
  842. This simple method makes it more difficult for an attacker to identify the files each sector belongs
  843. to
  844. while still allowing the sector service efficient access.
  845. % Credential stores.
  846. Processes need a way to securely store their credentials.
  847. They accomplish this by using a credential store,
  848. which is a type that implementor the trait \texttt{CredStore}.
  849. A credential store provides methods for using a process's credentials to encrypt, decrypt,
  850. sign, and verify data,
  851. but it does not allow them to be exported.
  852. A credential store also provides a method for generating new root credentials.
  853. Because root credentials represent the root of trust for an entire domain,
  854. it must be possible to securely back them up from one credential store to another.
  855. Root credentials can also be used to perform cryptographic operations without exporting them.
  856. A password is set when the root credentials are generated,
  857. and this same password must be provided to use, export, and import them.
  858. When root credentials are exported from a credential store they are confidentiality protected
  859. using multiple layers of encryption.
  860. The outer most layer is encryption by a symmetric key cipher whose key is derived from the
  861. password.
  862. a public key of the receiving credential store must also be provided when root credentials are
  863. exported.
  864. This public key is used to perform the inner encryption of the root credentials,
  865. ensuring that only the intended credential store is able to import them.
  866. Currently there are two \texttt{CredStore} implementors in Blocktree,
  867. one which is used for testing and one which is more secure.
  868. The first is called \texttt{FileCredStore},
  869. and it uses a file in the local filesystem to store credentials.
  870. A symmetric cipher is used to protect the root credentials, if they are stored,
  871. but it relies on the security of the underlying filesystem to protect the process credentials.
  872. For this reason it is not recommended for production use.
  873. The other credential store is called \texttt{TpmCredStore},
  874. and it uses a Trusted Platform Module (TPM) 2.0 on the local machine to store credentials.
  875. The TPM is used to generate the process's credentials in such a way that they can never be
  876. exported from the TPM (this is a feature of TPM 2.0).
  877. A randomly generated cookie is needed to use these credentials.
  878. The cookie is stored in a file in the local filesystem which its permissions set to prevent
  879. others from accessing it.
  880. Thus this type also relies on the security of the local filesystem.
  881. But, an attacker would need to steal the TPM and this cookie in order to steal a process's
  882. credentials.
  883. % Manual provisioning via the command line.
  884. The term provisioning is used in Blocktree to refer to the process of acquiring credentials.
  885. A command line tool call \texttt{btprovision} is provided for provisioning credential stores.
  886. This tool can be used to generate new process or root credentials, create a certificate request
  887. using them, issue a new certificate, and finally to import the new certificate chain.
  888. When setting up a new domain,
  889. \texttt{btprovision} can create a new sector storage directory in the local filesystem
  890. and write the new process's files to it.
  891. It is also capable of connecting to the filesystem service if it is already running.
  892. % Automatic provisioning.
  893. While manual provisioning is necessary to bootstrap a domain,
  894. an automatic method is needed to make this process more ergonomic.
  895. When a runtime starts it checks its configured credential store to find the certificate chain to
  896. use for authenticating to other runtimes.
  897. If no such chain is stored,
  898. the runtime can choose to request a certificate from the filesystem service.
  899. This is done by dispatching a message with \texttt{call} to the filesystem service without
  900. specifying a scope.
  901. Because the message specifies no path, there is no root directory to begin discovery at.
  902. So, the runtime resorts to using link-local discovery to find other runtimes.
  903. Once one is discovered,
  904. the runtime connects to it anonymously
  905. and sends it a certificate request.
  906. This request includes a copy of the runtime's public key and, optional, a path where the
  907. runtime would like to be located.
  908. This path is purely advisory,
  909. the filesystem service is free to place the runtime in any directory it sees fit.
  910. The filesystem service creates a new process file containing the public key and marks it as
  911. pending.
  912. The reply to the runtime contains the path of the file created for it.
  913. The operators of the domain can then use the web GUI or \texttt{btprovision} to view the request
  914. and approve it at their discretion.
  915. Assuming an operator approves the request,
  916. it uses its credentials and the public key in the new process's file to issue a certificate
  917. and then stores it in the file.
  918. Authorization attributes (e.g. UID and GID) are also assigned to the process and written into its
  919. file.
  920. Note that a process's file is normally not writeable by the process itself,
  921. so as to prevent it from setting its own authorization attributes.
  922. Once these data have been written to the process file,
  923. the runtime can read them to retrieve its new certificate chain.
  924. It stores this chain in its credential store for later use.
  925. The runtime can avoid polling its file for changes if it subscribes to write notifications.
  926. The runtime must close the anonymous connections it made
  927. and reconnect using the new certificate chain.
  928. Once new connections are established,
  929. it can read and write files using the authorization attributes specified in its file.
  930. Note that this procedure only works when the runtime is on the same LAN as another runtime.
  931. % The generation of new root credentials and the creation of a new domain.
  932. The procedure for creating a new domain is straight-forward,
  933. and all the steps can be performed using \texttt{btprovision}.
  934. \begin{enumerate}
  935. \item Generate the root credentials for the new domain.
  936. \item Generate the credentials for the first runtime.
  937. \item Create a certificate request using the runtime credentials.
  938. \item Approve the request using the root credentials.
  939. \item Import the new certificate into the credential store of the first runtime.
  940. \end{enumerate}
  941. The first runtime is configured to host the sector and filesystem services,
  942. so that subsequent runtimes will have access to the filesystem.
  943. After that, additional runtime on the same LAN can be provisioned using the automatic process.
  944. % Example of how these mechanisms allow data to be shared.
  945. To illustrate how these mechanisms can be used,
  946. consider a situation where two companies wish to partner to the development of a product.
  947. To facilitate their collaboration,
  948. they wish to have a way to securely exchange data with each other.
  949. One of the companies is selected to host the data
  950. and accepts the cost and responsibility of serving it.
  951. The host company creates a directory which will be used to store all of the data created during
  952. development.
  953. The other company will connect to the filesystem service in the host company's domain to access
  954. data in the shared directory.
  955. Each of the principals in the other company which wish to connect request to be credentialed in the
  956. shared directory.
  957. The hosting company manually reviews these requests and approves them,
  958. assigning each of the principals authorization attributes appropriate for its domain.
  959. This may involve issuing UID and GID values to each of the principals, or perhaps SELinux contexts.
  960. The actually set of attributes supported is determined by the \texttt{Authorization} type used by
  961. by the filesystem service in the host company's domain.
  962. Once the principals have their credentials,
  963. they can dispatch messages to the filesystem service using the shared directory as the scope and
  964. setting the rootward field to true.
  965. This allows actors authenticating with the credentials of these principals to perform all filesystem
  966. operations authorized by the hosting company.
  967. This situation gives the hosting company a lot of control over the data.
  968. If the other company wishes to protect its investment in the R\&D effort,
  969. it should subscribe to write events on the shared directory and the files in it so that it can
  970. copy new sectors out of the host company's domain as soon as they are written.
  971. Note that although it is not possible to directly subscribe to writes on the contents of a
  972. directory, by monitoring a directory for changes,
  973. one can begin monitoring files as soon as they are created.
  974. \section{Examples}
  975. This section contains examples of systems that could be built using Blocktree.
  976. The hope is to illustrate how this platform can be used to implement existing applications more
  977. easily and to make it possible to implement systems which are currently out of reach.
  978. \subsection{A distributed AI execution environment.}
  979. Neural networks are just vector-valued functions with vector inputs,
  980. albeit very complicated ones with potentially billions of parameters.
  981. But, just like any other computation,
  982. these functions can be conceptualized as computational graphs.
  983. Imagine that you have a set of computers equipped AI accelerator hardware
  984. and you have a neural network that is too large to be processed by any one of them.
  985. By partitioning the graph into small enough subgraphs,
  986. we can break the network down into pieces which can be processed by each of the accelerators.
  987. The full network can be stitched together by passing messages between each of these pieces.
  988. Let us consider how this could be accomplished with Blocktree.
  989. We begin by provisioning a runtime on each of the accelerator machines,
  990. each of which will have a new accelerator service registered.
  991. Messages will be sent to the accelerator service describing the computational graph to execute,
  992. as well as the name of the actor to which the output is to be sent.
  993. When such a message is received by an accelerator service provider,
  994. it spawns an actor which compiles its subgraph to a kernel for its accelerator
  995. and remembers the name of the actor to send its output to.
  996. An orchestrator service will be responsible for partitioning the graph and sending these messages.
  997. Ownership of the actors spawned by the accelerator service is given to the orchestrator service,
  998. ensuring that they will all be stopped when the orchestrator returns.
  999. When one of the spawned actors stops,
  1000. it unloads the kernel from the accelerator's memory and returns it to its initial state.
  1001. Note that the orchestrator actor must have execute permissions on each of the accelerator runtimes
  1002. in order to send messages to them.
  1003. The orchestrator dispatches messages to the accelerator service in reverse order of the flow of data
  1004. in the computational graph,
  1005. so that it can tell each service provider where its output should be sent.
  1006. The actors responsible for the last layer in the computational graph send their output to the
  1007. orchestrator.
  1008. To begin the computation,
  1009. the actors which are responsible for input are given the filesystem path of the input data.
  1010. The orchestrator learns of the completion of the computation once it receives the output from
  1011. final layer.
  1012. It can then save these results to the file system and return.
  1013. Because inference and training can both be modeled by computational graphs,
  1014. this same procedure can be used for both.
  1015. \subsection{A decentralized social media network.}
  1016. One of the original motivations for designing Blocktree was to create a platform for a social
  1017. network that puts users in fully in control of their data.
  1018. In the opinion of the author,
  1019. the only way to actually accomplish this is for users to host the data themselves.
  1020. One might think it is possible to use client-side encryption to solve the privacy issue,
  1021. but this does not solve the full problem.
  1022. While it is true that good client-side encryption will prevent the service provider from reading
  1023. the user's data,
  1024. the user could still loose everything if the service provider goes out of business or simply
  1025. decides to stop offering its service.
  1026. Similarly, putting data in a federated system, as has been proposed by the Mastodon developers,
  1027. also puts the user at risk of loosing their data if the operator of the server they use decides to
  1028. shut it down.
  1029. To have real control the user must host the data themselves.
  1030. Then they decide how its encrypted, how its served, and to whom.
  1031. Let us explore how Blocktree can be used to build a social media platform which provides this
  1032. control.
  1033. To participate in this network each user will need to setup their own domain by generating new root
  1034. credentials
  1035. and provisioning at least one runtime to host the social media service.
  1036. A technical user could do this on their own hardware by reading the Blocktree documentation,
  1037. but a non-technical user might choose to purchase a new router with Blocktree pre-installed.
  1038. By connecting this router directly to their WAN,
  1039. the user ensures that the services running on it will always have direct internet access.
  1040. The user can access the \texttt{btconsole} web GUI via the router's WiFi interface to generate their
  1041. root credentials and provision new runtimes on their network.
  1042. A basic function of any social network is keeping track of a user's contacts.
  1043. This would be handled by maintaining the contacts as files in a well-known directory in the user's
  1044. domain.
  1045. Each file in the directory would be named using the user defined nickname for the contact
  1046. and its contents would include the root principal of the contact as well as any additional user
  1047. defined attributes,
  1048. such as address or telephone number.
  1049. The root principal would be used to discover runtimes controlled by the contact
  1050. so that messages can be sent to the social media service running in them.
  1051. When a user adds a new contact,
  1052. a connection message would be sent to it,
  1053. which the contact could choose to accept or reject.
  1054. If accepted,
  1055. the contact would create an entry in its contacts directory for the user.
  1056. The contact's social media service would then accept future direct messages from the user.
  1057. When the user sends a direct message to the contact,
  1058. its runtime discovers runtimes controlled by the contact and delivers the message.
  1059. Once delivered the contact's social media service stores the message in a directory for the user's
  1060. correspondence,
  1061. sort of like an mbox directory but where messages are sorted into directories based on sender
  1062. instead of receiver.
  1063. Note that this procedure only works if a contact's root principal can be resolved using the
  1064. search domain configured in the user's runtime.
  1065. We can ensure this is the case by configuring the runtime to use a search domain that operates
  1066. a Dynamic DNS (DDNS) service
  1067. and by arranging with this service to create the correct records to resolve the root principal.
  1068. The author intends to operate such a service to facilitate the use of Blocktree by home users,
  1069. but a more long-term solution is to implement a blockchain for resolving root principals.
  1070. Only then would the system be fully decentralized.
  1071. Making public posts is accomplished by creating files in a directory with the HTML contents of the
  1072. post.
  1073. This file, the directory containing it, and all parents of it,
  1074. would be configured to allow others to read, and in the case of directories, execute them.
  1075. At least one runtime with the filesystem service registered would need to have the execute
  1076. permission granted to others to allow anyone to access these files.
  1077. When someone wanted to view the posts of another user,
  1078. they would use the filesystem service to read these files from the well-known posts directory.
  1079. Of course user's would not be using a file manager to interact with this social network,
  1080. they would use their browsers as they do now.
  1081. This web interface would be served by the social media service in their domain.
  1082. A normal user who has a Blocktree enabled router would just type in a special hostname into their
  1083. browser to open this interface.
  1084. Because the router provides DNS services to their network,
  1085. it can generate the appropriate records to ensure this name resolves to the address where the social
  1086. media service is listening.
  1087. The social media service would be responsible for sending message to other user's domains to
  1088. get their posts,
  1089. and to read the filesystem to display the user's direct messages.
  1090. All this file data would be used to populate the web interface.
  1091. It is not hard to see how the same system could be used to serve any type of media: text, images,
  1092. video, immersive 3D worlds.
  1093. All of these can be stored in files in the filesystem,
  1094. and so all of them are accessible to Blocktree actors.
  1095. One issue that must be addressed with this design is how it will scale to a large number of users
  1096. accessing data at once.
  1097. In other words,
  1098. what happens if the user goes viral?
  1099. Currently, the way to solve this would be to add more computers to the user's network which run
  1100. the sector and filesystem services.
  1101. This is not ideal as it means the user would need to buy more hardware to serve their dank memes.
  1102. A better solution would be implement peer-to-peer distribution of sector data in the filesystem
  1103. service.
  1104. This would reduce the load on the user's computers and allow their follows to share the posted
  1105. data with each other.
  1106. This work is planned as a future enhancement.
  1107. \subsection{A smart lock.}
  1108. The access control language provided by Blocktree's filesystem can be used for more than just
  1109. authorizing access to data.
  1110. To illustrate this point,
  1111. consider a smart lock installed on the front door of a company's office building.
  1112. When the company first got the lock they used NFC to configure the lock
  1113. and connect it to their WiFi network.
  1114. The lock then used link-local runtime discovery to perform automatic provisioning.
  1115. An IT administrator accessed \texttt{btconsole} to approve the provisioning request
  1116. and position the lock in a specific directory in the company's domain.
  1117. Permission to actuate the lock is granted if a principal has execute permission on the lock's file.
  1118. To verify the physical presence of an employee,
  1119. NFC is used for the authentication handshake.
  1120. When an employee presses their NFC device, for instance their phone, to the lock,
  1121. it generates a nonce and transmits it to the device.
  1122. The device then signs the nonce using the credentials it used during provisioning in the company's
  1123. domain.
  1124. It transmits this signature to the lock along with the path to the principal's file in the domain.
  1125. The lock then reads this file to obtain the principal's authorization attributes and its public key.
  1126. It uses the public key to validate the signature presented by the device.
  1127. If this is successful,
  1128. it then checks the authorization attributes of the principal against the authorization attributes on
  1129. its own file.
  1130. If execute permissions are granted, the lock actuates, allowing the employee access.
  1131. The administrators of the company's domain create a group specifically for controlling physical
  1132. access to the building.
  1133. All employees with physical access permission are added to this group,
  1134. and the group is granted execute permission on the lock,
  1135. rather than individual users.
  1136. \subsection{A traditional three-tier web application.}
  1137. While it is hoped that Blocktree will enable interesting and novel applications,
  1138. it can also be used to build the kind of web applications that are common today.
  1139. Suppose that we wish to build a three-tier web application.
  1140. Let us explore how Blocktree could help.
  1141. First, let us consider which database to use.
  1142. It would be desirable to use a traditional SQL database,
  1143. preferably one which is open source and not owned by a large corporation with dubious motivations.
  1144. These constraints lead us to choose Postgres,
  1145. but Postgres was not designed to run on Blocktree.
  1146. However, Postgres does have a container image available on docker hub,
  1147. we can create a service to run this container image in our domain.
  1148. But Postgres stores all of its data in the local filesystem of the machine it runs on.
  1149. How can we ensure this does not become a single point of failure?
  1150. First, we should create a directory in our domain to hold the Postgres cluster.
  1151. Then we should procure at least three servers for our storage cluster
  1152. and provision runtimes hosted on each of them in this directory.
  1153. The sector service is registered on each of the runtimes,
  1154. so all the data stored in the directory will be replicated on each of the server.
  1155. Now, the Postgres service should be register in one and only one of these runtimes,
  1156. as Postgres requires exclusive access to its database cluster.
  1157. We now have to decide how other parts of the system are going to communicate with Postgres.
  1158. We could have the Postgres service setup port forwarding for the container,
  1159. so that ordinary network connection can be used to talk to it.
  1160. But we will have to setup TLS if we want this to be secure.
  1161. The alternative is to use Blocktree as a VPN and proxy network communications in messages.
  1162. This is accomplished by registering a proxy service in the same runtime as the Postgres service
  1163. and configuring it to allow traffic it receives to pass to the Postgres container on TCP port 5432.
  1164. In a separate directory,
  1165. a collection runtimes are provisioned which will host the webapp service.
  1166. This service will use axum to serve the static assets to our site,
  1167. including the Wasm modules which make up our frontend,
  1168. as well as our site's backend.
  1169. In order to do this,
  1170. it will need to connect to the Postgres database.
  1171. This is accomplished by registering the proxy service in each of the runtimes hosting the
  1172. webapp service.
  1173. The proxy service is configured to listen on TCP 127.0.0.1:5432 and forwards all traffic
  1174. to the proxy service in the Postgres directory.
  1175. The webapp can then use the \texttt{tokio-postgres} crate to establish a TCP connection to
  1176. 127.0.0.1:5432
  1177. and it will end up talking to the containerized Postgres instance.
  1178. Although the data in our database is stored redundantly,
  1179. we do still have a single point of failure in our system,
  1180. namely the Postgres container.
  1181. To handle this we can implement a failover service.
  1182. It will work by calling the Postgres service with heartbeat messages.
  1183. If too many of these timeout,
  1184. we assume the service is dead and start a new instance one of the other runtimes in the Postgres
  1185. directory.
  1186. This new instance will have access to all the same data the old,
  1187. including its journal file.
  1188. Assuming it can complete any in progress transactions,
  1189. the new service will come up after a brief delay
  1190. and the system will recover.
  1191. \subsection{A realtime geo-spacial environment.}
  1192. % Explain my vision of the metaverse.
  1193. \section{Conclusion}
  1194. % Blocktree serves as the basis for building a cloud-level distributed operating system.
  1195. % The system enables individuals to self-host the services they rely on.
  1196. % It also gives business a freeer choice of whether to own or lease computing resources.
  1197. % The system advances the status quo in secure computing.
  1198. % Composability leads to emergent benefits.
  1199. \end{document}