Unifying Principles for Next Generation Computing - Draft 4

Chris Gebhardt (chris@infocentral.org)

September 5, 2019

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The proposed infocentric architecture is a complete computing paradigm, similar to Unix and the web. It has its own set of characteristics and interwoven design principles, which must be understood in relation to each other. The adjective infocentric simply means “centered around information.” As a property, the term ‘infocentricity’ is sometimes used. Neither are official or project names. The author’s research project is currently named InfoCentral. It is focused on exploring infocentric data, network, and software architectures.

Use only hash-based identity and referencing for persistent data

This is the central pillar of the infocentric architecture. Hash-based data identity and referencing uniquely yields:

  1. permanent, secure, reliable linking (quite unlike HTTP URIs)
  2. guaranteed support for high-resolution addressing, relative to the hash address, via external metadata
  3. permissionless third-party annotation and composition (This encourages discourse and collaboration.)
  4. reliably-layerable, multi-sourced content (selected and filtered at the full discretion of users)
  5. default versioning semantics (changes must reference prior data and/or versioning root nodes to be discovered)
  6. the ability to route content intelligently among networks without authoritative nodes
  7. the option to operate without any permanent network infrastructure (offline, sneakernet, P2P, etc.)
  8. graph data models with strong network effects (Each hash ID is a permanent nexus of interaction, via reference collection.)
  9. a universal replacement for countless fragile APIs (Instead, use append-only shared data graphs among software components.)
  10. software architecture with high interoperability and re-usability (user-space code operates on neutral shared data)
  11. a feasible route away from apps and data silos toward seamlessly-integrated information environments

A reasonable generalization is that once the switch to exclusive hash-based data addressing is made, everything else practically designs itself – from hardware and networks to operating systems to information and software architectures. (This includes all of the other principles in this article!) Said otherwise, there is an obvious correct way to do most things. Where flexibility exists, it involves quality-of-service differentiation that does not interfere with the data model or harm baseline interoperability. (ex. data persistence and replication policies, which are orthogonal to the data itself)

Longer Explanation

For the majority of computing history, we have assigned meaningful names to data (including code). We’ve then had to maintain secure mechanisms to retrieve valid data by name. This has been an enormous source of complexity and fragility. Names can change arbitrarily. Mappings can be corrupted, both accidentally and maliciously. Names themselves can conflict at local to global levels, requiring authoritative systems to resolve ownership. Data behind names is mutable, making 3rd party references, annotations, and compositions unreliable over time.

Before modern cryptography, named data was the only viable solution. Unfortunately, necessity became tradition and tradition became dogma. The vast majority of both user and developer facing tools continue to use assigned names, from filesystems and internet paths to component and class names in code. Most decentralized internet projects have attempted to bring assigned-name data and references to a new generation of tools. This is a severe mistake.

The only permanent, independent way for one piece of information to reference another is using an identifier calculated directly from the referenced information itself. This is sometimes known as content-based or intrinsic identity. All other schemes (those using assigned identities) require a trusted component and/or party to maintain the validity of the reference identity. Should it be compromised, fail, or shut down, all dependent references become invalid, deceptive, or unavailable. Vulnerable schemes include authoritative institutions (ex. ISBN, DOI, LOC), internet domains (DNS), names registered on cryptocurrency blockchains, and even named URI paths rooted in a public signing key. (ex. Dat, IPNS)

Cryptographic hashes are the efficient solution for generating content-based addresses. Any size data goes into a hash function and a small value (typically 256 to 1024 bits) is returned. This hash value is then used to externally identify or reference the original data. In the same way that filenames are indexed to files, systems using hashes maintain indexes from hash values to data items. Unlike filenames, hashes are calculated rather than chosen. They are globally unique, as different data will always yield different hashes, with incomprehensibly high probability. When used properly, modern cryptographic hash functions may be considered effectively impossible to counterfeit. (ie. It is not feasible to find two different meaningful and syntactically correct pieces of data hash to the same value.)

Hash-based addressing allows any data to reference any data, without fear that the reference will become stale in the future. This allows independent annotations and compositions to be created, discovered, propagated, and layered on a global scale, using any available communication method. (including physically off-line) The significance of this cannot be overstated. It represents a revolution in how data is managed and how people interact digitally. The classic internet has no comparable mechanism. Composition and annotation on the web is unreliable because content can arbitrarily change or disappear. Likewise, higher-resolution addressing than the base URI requires insertion and maintenance of useful anchors in mutable content. Because the original author controls these, the annotator is constrained in expression. There may also be many addresses for the same data, making aggregation of multi-sourced third-party references difficult.

Hash-based addressing eliminates the need for countless application-specific network protocols. For example, a message need only reference an intended recipient hash ID (such as that of an inbox or discussion root). Content-addressed networks can then direct propagation accordingly, with metadata aiding in priority and routing. In the same way, any interaction can be orchestrated using agreed patterns of appending new data to a shared graph representing an application space. These “interaction patterns” among signed graph data can replace fragile APIs and authenticated network protocols.

Ultimately, hash-based addressing is the foundation of a whole new internet and software architecture – one that is fundamentally more fluid, collaborative, inter-operable, decentralized, and context-rich. However, this will require that hash-based data identity and reference be the only allowed option. Unlike all other design decisions, this must be absolute. Hybrid support for named data has zero benefits and devastating complexity. The incompatible mutability semantics compromise all of the benefits of hash addressing. Unlike the web, there is no need or place for multiple URI schemes among the data itself. This will be a challenging but necessary transition. URIs may be used as bibliographical record or for disposable network metadata but never for first-class references within the data model. For example, immutable graph data may be wrapped by mutable disposable “shells” that contains URIs of hosts of contained hash addresses, known external references, and other metadata for management among existing internet systems. Someday, these may be thrown away, while the contained native data remains intact, legacy-free, and archival by default.

Encrypt on write

The time to encrypt private data is when it is initially persisted, using one or more symmetric keys for payload content blocks in each data entity. Such data entities are thus safe to exchange among untrusted systems or over open networks. They are also trivial to back up.

A standard security architecture is desperately needed. Current systems repeatedly wrap and unwrap private data in myriad encryption layers for storage and transit. This is fragile and inefficient, and it often adds interoperability challenges. It requires many trusted components, such as servers that maintain access control and policies for plaintext. Default private data encryption dramatically simplifies security engineering, which in practice increases real-world security. It also encourages end-to-end client-side cryptography.

Hash-based references always are to data entities in their final encoded form, including ciphertexts. This way, public networks can still route requests, verify hashes, and collect references among entities having encrypted content. Transport encryption and/or routing obfuscation is still required if traffic analysis is part of the threat model.

Sign everything

Almost all persistent data should be cryptographically signed. This allows portable data trust directly based on trust of the author(s). It eliminates reliance on fragile 3rd-party provenance claims, as with traditional servers that must protect mutable named data.

Data entities can be covered by both internal and external signatures. Internal signatures are part of the hashed data and thus used for entity authorship claims. External signatures are claims that come from referencing entities, which are themselves internally signed. Thus the internal signature of a referencing entity implicitly covers the referenced entity, thanks to immutability.

Merkle trees allow large collections of (typically related) data entities to be efficiently covered by a single signature of the Merkle root. This is similar to distributed version control systems like Git, though not all use cases can use this optimization.

Layer information

Layering of information adds context and higher dimensionality. This includes annotations, links to related information, refinements, and all forms of media overlays. It can also be treated as a method of flexibly composing both textual and structured information. In the case of a semantic graph, layering provides another means for building n-ary relations, even increasing effective arity over time.

Hash-referenced information is trivially easy to layer because references are stable and conflict free. For a given collection of data entities, another collection having references to them can be considered a layer upon the first collection. Yet another collection may layer upon both of these and others. None of this requires coordination among authors. To be clear, there is no explicit data artifact called a “layer,” as it is only a usage pattern.

In practice, layering requires reference discovery, collection, and propagation. This is fully independent of authoring and is a major role of repositories and networks thereof.

Always reference existing content; don’t quote or copy

Hash referencing provides robust context and provenance to users and software agents traversing the graph. This information is lost when content is instead copied or quoted, breaking its connection with the original containing entities. To be clear, existing entities may be copied, intact, among repositories. In fact, referencing them encourages this to happen automatically. We only wish to avoid unnecessarily lifting content from its original sources and including it in a new entity, which thus has a new ID.

Because every hash address serves as a global collection point to layer upon, referencing builds context over time. New versions should reference the versions they are derived from. Annotations should reference the precise content they are annotating.

Favor extensive decomposition

Structured data should generally be decomposed into atomic components, each contained as its own entity. This makes data re-use far easier because many different compositions may use the resulting components without having to extract them first. Obviously, this often has efficiency benefits. Smaller data is also less likely to conflict among multiple authors.

Table and column structure in traditional database systems must be meticulously planned in advance. This early optimization is a relic of the mainframe computing era, when memory was limited. Of course, it is also a factor in massive centralized systems that rely on data homogeneity to serve billions of users with minimized costs. This compromise of expressiveness is unnecessary under a model of more localized content and client-side processing. In contrast, data entities in the proposed model may have unlimited heterogeneous fields / properties within. For performance, higher-level software may always ingest and filter content from relevant data entities into local indexes and data structures. These may have normalized properties for a particular local application.

Treat naming as a metadata layer

Human-meaningful name and label metadata should be treated as a layer. This helps enormously with internationalization. Different communities can have their own naming layers for the same shared data.

Naming layers may be collections of entities that label other entities via direct reference. They may also take the form of maps between generic shared name-holding entities and unlimited public and private data entities. To create a generic name entity, a text string may be placed in a minimal entity without a nonce, signature, or other uniqueness-creating artifacts. Therefore, it is possible to start with a human-meaningful name / label / tag, independently calculate an entity with a known hash, and then search for references of its use. Communities may create less generic name entities by using agreed prefixes. For example, the name string for use with a wiki page may be “Wiki + Title”. And this may be secondarily mapped to a more generic “Title” name entity. It should be noted that none of these naming schemes require any form of ownership or authority.

Current popular models of computing assume extensive manual human intervention in managing raw data. This includes all aspects of files, internet addresses, plaintext code, and APIs. In these models, human-meaningful naming is critical for manageability. This has led to innumerable assumptions around UI metaphors. User-visible data elements should still be labeled in UIs, but names themselves must not be the low-level glue between data and among code units. This role must be reserved for hashes. In almost all cases, hashes should be hidden from the user since they have no value for memorability or spatial cues. It is implied that future code will use hash references. Thus, code editors slightly higher than plain text will be required, with semantic editors of even greater benefit.

Don’t embed textual markup

Stylistic markup should be separated as an annotation layer on top of plain text. This makes text easier to re-use because undesired markup does not need to be stripped. It also promotes private and multi-party markup efforts.

Unlike with traditional mutable text files, external markup is trivial with hash addressing. Annotations can use simple byte, word, or paragraph positions within the referenced text, without the worry that these will change in the future. Likewise, the plain text no longer needs anchor markup for annotators to use.

Of course, these markup principles also apply to other forms of media besides text.

Promote non-linear text composition and navigation

Traditional linear texts make innumerable compromises that constrain expression to a fixed stream of phrases and illustrations. Order is all-important, useful detail is sacrificed for brevity and readability, interactivity is non-existent, and publication is largely static.

Non-linear texts are multi-layered. They allow the reader to “drill down” to the desired level detail, select and order content before engaging, play with the data sets behind any figures or visualizations, discover and join ongoing interactions, and layer multiple authors’ interwoven comments and contributions into the same view.

Interact over shared append-only data graphs

With hash-based data addressing, many parties may independently append new data into a shared immutable data graph. Users simply create new data entities that hash-reference existing data entities. Networks then propagate these to interested parties. This provides a means to facilitate interactions without shared APIs or servers. It is sometimes called the “blackboard model” or “tuple space model.” Users and code at the edges interpret new data added to the graph and respond accordingly. Patterns of interaction are codified such that parties may agree on how to use the shared graph. Interaction Patterns, as we suggest calling them, specify data schemas and requirements, data and processing flows, and other declarative code. Information that does not follow a pattern may be ignored as invalid. It may instead belong to a different pattern that is active among adjacent users of the graph.

Don’t use network-interactive interfaces to compose software

Using network APIs to compose disparate systems is incredibly fragile. It requires continuous network connectivity, trustworthy servers, and coordination of arbitrary method interfaces and behavior. It has no default logging or traceability of activity. It doesn’t promote independent data re-use. It provides little or no security isolation of code. And it promotes black-box software designs. The shared graph interaction model has none of these problems and can be used as a replacement for all remote APIs. It also supports disconnected operation by virtue of not requiring network authorities, servers, or graph operation synchrony. (Hash references guarantee causal ordering among collections of inter-referenced entities.)

All persistent data must be self-describing

Use strong typing and embedded semantics for all persisted data. The sort of ad-hoc data structures often used in application-specific coding are unacceptable. It should be assumed that other unknown software will want to re-use the same data in the future. Therefore, data must be independent and self-describing. This cannot be an afterthought. Information must be designed before any code is written. Of course, this should be a globally collaborative effort, resulting in high interoperability and less work over time. Unlike current approaches, there is low risk of stifling “design by committee.” Semantic Web ontologies are stitched together with authoritative HTTP URIs and typically published at low granularity. In contrast, with well-decomposed hash-connected graph equivalents, it is easy for anyone to make fine-grained branches of existing ontology nodes. Existing data is unaffected because it hash-references prior revisions. At the same time, there is practical incentive to not fork unnecessarily and instead converge on the most widely useful semantics and supporting code.

User-space code must not control persistent data

The infocentric architecture inverts the relationship between software and persistent data. Instead of being a “backend” implementation detail, persistent data is primary, neutral, and independent. Software is ultimately secondary, in that it comes alongside to make shared data more useful. This is opposite of the code-data encapsulation philosophy of object-oriented and other contemporary software paradigms based on protecting mutable data from harmful modification. It can also be considered a “post-application” architectural pattern.

Use public key infrastructure, not logins

APIs use authentication before providing access to methods that can access or modify data in a harmful way. This requires traditional authoritative servers or decentralized equivalents. In infocentric architectures, private read access is controlled using encryption keys to otherwise public data entities. Write (append) access to a given repository or network may be controlled by filtering data by trusted signatures. Likewise, Interaction Patterns rely upon trusted signatures to validate involved data.

Write infocentric software, not apps

Infocentric software architecture describes wholly different methodologies for creating software around semantic graph data. It promotes small re-usable components integrated on-demand by the user’s private software environment to enable desired Interaction Patterns. Unlike apps, this design philosophy does not bundle numerous concerns into static, pre-designed user experiences. The final UI is always rendered for current needs, environment, and user preferences. This is not to say that popular designs or templates cannot emerge, but they will not be exclusively designed by programmers and then foisted upon users.

Infocentric software needs new UI paradigms that make ad-hoc software discovery, integration and customization easy for average users. This implies forms of declarative, visual programming for the masses. Flow-based programming is a promising approach in this direction. The lowest-level user-space code will tend to be functional, due to its affinity with the data model. However, there are no absolute requirements.

Data should have zero implied dependencies

In the classic programming book, The Pragmatic Programmer, Hunt and Thomas argue for “The Power of Plain Text.” This is really an argument that self-describing data is more robust against obsolescence and incompatibilities. It has little to do with literal “plain text” – which, after all, is an arbitrary binary format for encoding human-meaningful symbols. (hopefully UTF-8) However, the old thinking also assumes that humans will sometimes act as interpreters of context and fuzzy semantics within text-based encodings. This part is unnecessary. Binary data that is truly self-describing and context-rich can be read, interpreted, edited, and re-used by any standard tool. It never requires munging. What remains unacceptable is data encodings reliant upon ambiguous external code that may not be accessible or executable in the future. This does not preclude data from hash-referencing reliant code that is itself self-describing. Compressed media formats are common necessary cases of explicit code dependencies.

Network information, not machines

Infocentric architecture is focused on weaving together information, not networks, hosts, and services. These aspects are orthogonal concerns that merely support the information. This approach is commonly called information-centric networking or content-based networking.

The network is invisible

The mechanics of networking should not be a concern of user-space software. It is the job of lower architectural layers to make the global shared data graph appear local. In other words, when local software appends to a local known subset of the graph, it is not concerned with how that data will be propagated out to other interested parties. Quality-of-Service parameters will still exist, but things like hosts and routing are fully abstracted. In some schemes, QoS metadata is the digital equivalent of postage. Lack of network specificity means that “sneaker nets”, local ad-hoc meshes, and other unusual schemes are more feasible.

Collect and propagate reference metadata

In public hash-referenced data models, any plaintext references among entities are available to index and propagate. Propagation is driven by registered interest. This is how new information flows to those who need it.

Guarantee social features

All social aspects of computing must be baked into our shared information architectures, not reliant on corruptible third party services. The immutable graph data model is a solid foundation for many social network and community-building designs. It ensures that these can be easily interwoven and layered, allowing unlimited customization while maintaining interoperability. This is something that closed social platforms, by very nature, cannot provide.

Promote genuine dialogue

Dialogue is greater than conversation, discussion, or debate. It involves creative action, whereby participants suspend their assumptions for the sake of revealing incoherence and discovering, if not new knowledge or perspectives, at least a greater understanding of others. Dialogue requires mutual respect and is based on the hope that good will result from the process. It does not require participants to relinquish strongly-held values, but rather encourages an openness that will lead to clearer understanding of those values, whether or not any change of mind follows.

Focus on enriching context

It is impossible to get everyone to agree. However, context and layering can at least allow everyone see the field of competing ideas clearly. Different users and communities will responsible for their own repositories and information layers, which may then be fetched and organized locally by the user. While nothing prevents users from crafting a biased view, this will be a conscious act rather than something determined by a secret algorithm.

Make collaboration effortless

Meaningful collaboration needs to become so accessible that participation increasingly becomes default behavior, in contrast to the top-down consumption-oriented internet.

The prescribed graph data model is intrinsically collaborative in that everyone may append, layer, and discover new information surrounding what is already known. While this does not guarantee that an author’s contributions will be valued and well-propagated, it provides the substrate to make this feasible as neutrally as possible.

Localized consensus is usually sufficient

Most online interactions don’t require costly global consensus. Trusted and semi-trusted participants using PKI is normally sufficient. There are relatively few scenarios where a cryptocurrency blockchain or smart contracts are genuinely beneficial. On the other hand, the exclusive use of hash addressing makes it trivial to do periodic checkpointing or timestamping of “off-chain” graph interactions.