This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The proposed infocentric model is a complete computing paradigm, similar to Unix and the web. It has its own set of characteristics and interwoven design principles, which must be understood in relation to each other. The adjective infocentric simply means "centered around information." As a property, the term 'infocentricity' is sometimes used. Neither are official or project names. The author's research project is currently named InfoCentral. It is focused on exploring infocentric data, network, and software architectures.

Use only hash-based identity and referencing for persistent data

This is the central pillar of the infocentric model. Hash-based data identity and referencing uniquely yields:

A reasonable generalization is that once the switch to exclusive hash-based data identity is made, everything else practically designs itself -- from hardware and networks to operating systems to information and software architectures. Said otherwise, there is an obvious correct way to do most things. Where flexibility exists, it involves quality-of-service differentiation that does not interfere with the data model or harm baseline interoperability. (ex. data persistence and replication policies, which are orthogonal to the data itself)

Longer Explanation

For the majority of computing history, we have assigned meaningful names to data (including code). We've then had to maintain secure mechanisms to retrieve valid data by name. This has been an enormous source of complexity and fragility. Names can change arbitrarily. Mappings can be corrupted, both accidentally and maliciously. Names themselves can conflict at local to global levels, requiring authoritative systems to resolve ownership. Data behind names is unstable, making 3rd party references, annotations, and compositions unreliable.

Before cryptography was mature, named data was the only viable solution. Unfortunately, necessity became tradition and tradition became dogma. The vast majority of both user and developer facing tools continue to use assigned names, from filesystems and internet paths to component and class names in code. Most decentralized internet projects have attempted to bring assigned-name data and references to a new generation of tools. This is a severe mistake.

The only permanent, independent way for one piece of information to reference another is using an identifier calculated directly from the referenced information itself. This is sometimes known as content-based or intrinsic identity. All other schemes (those using assigned identities) require a trusted component and/or party to maintain the validity of the reference identity. Should it be compromised, fail, or shut down, all dependent references become invalid, deceptive, or unavailable. Vulnerable schemes include authoritative institutions (ex. ISBN, DOI, LOC), internet domains (DNS), names registered on cryptocurrency blockchains, and even named URI paths rooted in a public signing key.

Cryptographic hashes are the efficient solution for generating content-based identities. Any size data goes into a hash function and a small value (typically 256 to 1024 bits) is returned. This hash value is then used to externally identify or reference the original data. In the same way that filenames are indexed to files, systems using hashes maintain indexes from hash values to data items. Unlike filenames, hashes are calculated rather than chosen. They are globally unique, and different data will always yield different hashes. (with insanely high probability) When used properly, modern cryptographic hash functions may be considered effectively impossible to counterfeit. (ie. purposely find two different meaningful pieces of data that collide at the same identifying hash value)

Hash-based identity allows any data to reference any data, without fear that the reference will go stale. This allows third party annotations and compositions to be created, propagated, and layered on a global scale, using any available communication method. (including physically off-line) The significance of this cannot be overstated. It represents a revolution in how data is managed and how people interact digitally. The classic internet has no comparable mechanism. Composition and annotation on the web is unreliable because content can arbitrarily change or disappear. It is also reliant upon continuous network connectivity for DNS lookup and data retrieval. Likewise, there may be many addresses to the same data, making aggregation of multi-sourced third party content difficult.

Hash-based identity doesn't just make information easier to manage and re-use. It also eliminates the need for countless specialized interactive network protocols. For example, a message need only reference an intended recipient's hash ID (such as an inbox or discussion root). Networks will then direct propagation accordingly. (perhaps with metadata aiding in priority and routing) Complex interactions can be orchestrated using agreed patterns of composition among shared graph data representing an application space. This can replace fragile interactive APIs.

Hash-based identity is also the foundation of a new internet and software architecture -- one that is fundamentally more fluid, collaborative, inter-operable, decentralizable, and context-rich. However, this will require that hash-based identity and reference be the only allowed option. Unlike other design decisions, this must be absolute and unyielding. Named references have completely different semantics and cannot be mixed with hashes without compromising all of their benefits. Unlike the web, there is no place for multiple URI schemes. This will be a challenging but necessary transition. Legacy URIs may be used as disposable network metadata but never for first-class references in the base data model.

Encrypt on write

The time to encrypt private data is when it is initially stored, using a new unique symmetric key for each immutable, hash-identified data entity. This way data can be securely moved among systems and exchanged over open networks.

Having to wrap and unwrap persistent data in additional encryption layers is fragile and inefficient. It requires many trusted components, including servers that maintain access control and policies. Default private data encryption dramatically simplifies security engineering, which in practice increases real-world security.

Hash-based references always are to data entities in their final encoded form, including any encryption. This way, public networks can route requests, verify requested data, and collect references among entities. Transport encryption and/or routing obfuscation is still required if traffic analysis is part of the threat model.

Sign everything

Almost all persistent data should be cryptographically signed. This allows data trust to be directly based on trust of the author(s). It eliminates reliance on fragile 3rd-party provenance claims, as with traditional servers that protect mutable named data.

Immutable data entities can be covered by both internal and external signatures. Internal signatures are part of the hashed data. External signatures come from referencing entities.

Merkle trees allow large collections of (typically related) data entities to be efficiently covered by a single signature of the Merkle root. This is similar to distributed version control systems like Git, though not all use cases can use this optimization.

Layer information

Layering of information adds context and higher dimensionality, whether annotations to a book or overlays on a map.

Hash-referenced information is easy to layer because references are stable and conflict free. For a given collection of data entities, another collection having references to them can be considered a layer upon the first collection. Yet another collection may layer upon both of these and others. None of this requires coordination among authors.

In practice, layering does require reference collection. This is a role of repositories and networks thereof.

Reference, don't copy or quote

Referencing provides original context and provenance, which are lost in copying. History and context need to be visible as users and software agents traverse the graph. New versions reference old versions. Annotations reference the precise text they are annotating.

Favor extensive decomposition and normalization

Decompose information into atomic components, each contained as its own hash-referenceable entity. This makes data re-use far easier because many different higher-level compositions may use the resulting components without having to extract them from a larger document. Unlike tables and columns in traditional database systems, which must be meticulously planned in advance, it costs almost nothing to create extra data entities and properties therein. Smaller data is also less likely to conflict among multiple authors.

For performance, higher-level software must often collect and filter involved data entities into local indexes and data structures.

Don't embed textual markup

Treat stylistic markup as a separate annotation layer on top of plain text. This makes the text far easier to re-use.

Names as a metadata layer

Human-meaningful name and label metadata can be treated as a layer. This helps enormously with internationalization. Different communities can even have their own naming layers for the same shared data.

Interact over immutable shared graph data

With hash-based data referencing, many parties may append new data into the shared immutable data graph. This occurs by creating new data entities that hash-reference existing data entities. Networks propagate these updates to those interested. This provides a means to coordinate interactions without shared APIs or servers. This is sometimes called the "blackboard model" or "tuple space model." Users and code at the edges interpret new data added to the graph and respond accordingly. Patterns of interaction are codified in some way, such that parties may agree on how to use the shared graph. Interaction Patterns, as we may call them, specify data schemas and requirements, workflow-like sequencing, and other declarative code. Information that does not follow a pattern may be ignored. It may belong to another pattern of other users working nearby.

Don't use network-interactive interfaces to compose software

The network API model is incredibly fragile. It requires continuous network connectivity, centralized trustworthy servers, and coordination of arbitrary method interfaces and behavior. It has no default logging or traceability of activity. It doesn't promote independent data re-use. It provides little or no security isolation of code. And it promotes black-box software designs. The shared graph interaction model has none of these problems and can be used as a replacement for all APIs.

All persistent data must be self-describing

Use strong typing and embedded, extensible semantics. The sort of ad-hoc data structures used in application-specific coding are unacceptable. In infocentric designs, when software persists data, the assumption is that other unknown software will also be using it. Therefore, all data must be independent and self-describing. All

Code does not control persistent data

The infocentric model inverts the relationship between software and persistent data. Instead of being a "backend" implementation detail, persistent data is primary, neutral, and independent. Software is secondary, in that it comes alongside to make shared data more useful. This is effectively a repudiation of the code-data encapsulation philosophy of object-oriented and other contemporary software paradigms. It can also be considered a "post-application" architectural pattern.

Write infocentric software, not apps

Infocentric software architecture is a wholly different methodology of creating software around universal shared semantic graph data. It tends to involve small re-usable components integrated at-will by the user's private software environment. Unlike apps, it does not bundle dozens of concerns into pre-designed user experiences. This is not to say that popular designs cannot emerge, but they are not exclusively designed by programmers and then foisted upon users.

Infocentric software needs new UI paradigms that make ad-hoc software discovery, integration and customization easy for average users. This implies declarative, visual programming for the masses.

Data should have zero implied dependencies

In the classic programming book, The Pragmatic Programmer, Hunt and Thomas argue for "The Power of Plain Text." This really boils down to the idea that self-describing data is more robust against obsolescence and incompatibilities. It has nothing to do with "plain text" -- which, after all, is an arbitrary binary format for printable characters. (hopefully UTF-8) However, the old thinking also assumes that humans will sometimes act as interpreters of context and fuzzy semantics within text-based encodings. This part is unnecessary. Binary data that is truly self-describing and context-rich can be read, interpreted, edited, and re-used by any standard tool. It never requires munging. What remains unacceptable is data encodings reliant upon ambiguous external code that may not be accessible or executable in the future. This does not preclude data from hash-referencing reliant code that itself is self-describing. A simple example may be a compressed media format.

Use public key infrastructure, not logins

APIs use authentication before providing access to methods that can access or modify data in a harmful way. This requires traditional authoritative servers. In infocentric architectures, private read access is controlled using encryption keys to otherwise public data. Write (append) access in a given repository or network is controllable by filtering on trusted signatures. Likewise, any given interaction among parties relies upon trusted signatures and agreed contracts for expected data schemas.

Network information, not machines

Infocentric architecture is focused on weaving together information, not networks, hosts, and services. Network aspects are orthogonal design concerns that merely support the information model and notified interest of users. This is commonly called information-centric networking or content-based networking. It is similar to Named Data Networking, but without the names!

The network is invisible

The mechanics of networking should not be a concern of user-space software. It is the job of lower architectural layers to make the global shared data graph appear local. In other words, when local software appends to a local known subset of the graph, it is not concerned with how that data will be propagated out to other interested parties. Quality of service parameters may still exist, but things like hosts and routing are fully abstracted. In some schemes, QoS metadata is perhaps the digital equivalent of postage stamps.

Collect and propagate reference metadata

In public hash-referenced data models, any plaintext references among entities are fair game to index and propagate. Propagation is driven by registered interest. This is how new information flows to those who need it.

Social by design

Social features should be baked into the information architecture, not reliant on corruptible third party services.

Default collaboration

The prescribed data model is naturally collaborative in that everyone may append and layer new information. While this does not guarantee that an author's information will be accepted and well-propagated, it provides the substrate to make this possible as neutrally as possible.

Context over consensus

It is impossible to get everyone to agree. But context and layering at least let everyone see the field of competing ideas clearly. Different users and communities are responsible for their own layers, which may then be fetched and organized locally by the user.

Localized consensus is usually sufficient

Most online interactions don't require costly global consensus. Trusted and semi-trusted participants using PKI is normally sufficient. There are relatively few scenarios where a cryptocurrency blockchain or smart contracts are beneficial.