This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Introduction

We are now in the early stage of a second computer revolution, with several disruptive innovation frontiers taking shape at once. Artificial intelligence, robotics, augmented reality, and ubiquitous computing (which includes the "internet of things") have been widely identified as drivers of both the next era of computing and massive socioeconomic changes. These research fields and others would benefit enormously from a shared foundation of decentralized information. Unfortunately, efforts have thus far been inadequate and disjointed. Decentralized information encompasses a fundamentally new data and software architecture, just as the World Wide Web vision brought in 1989. It is similarly far reaching and disruptive. Antiquated information architecture and design patterns are holding back new research from its full potential. Likewise, critical practical improvements in software engineering, security, information management, and social computing are blocked by an architecture that was designed for the hardware and networks available decades ago. It is imperative that we re-examine these foundations and explore newly-feasible alternatives.

What is decentralized information?

Most concisely, decentralized information is not dependent on any external, authoritative service or software. Nothing besides other decentralized information is required to interpret, validate, and use it. It is self-describing and semantically rich. As long as there is a means to share it, users may interact over decentralized information alone -- no third-party trusted network or service is required. Freedom from external control and segregation also allows decentralized information from unlimited sources to be seamlessly integrated at the point it is used.

Software comes alongside decentralized information in neutral support, rather than controlling or encapsulating it. In the full realization of the concept, users are be able to gather raw information and independent software components from diverse sources and dynamically compose them to suit varying needs. This modus will eventually replace prepackaged software applications with fully-integrated information environments having no artificial boundaries. Ultimately, decentralized information unlocks a path toward what has long been considered the holy grail of software engineering.

Web information is not decentralized, based on the dependency-focused definition. HTTP URIs require external, authoritative services to dereference, retrieve, and authenticate linked information. If any of these services fail, linked information becomes inaccessible, stale, or invalid. Web links cannot exist offline, apart from the services that comprise the web. Two previously-retrieved HTML documents cannot independently reference each other because their identities are bound to authoritative network services. Likewise, users cannot interact directly (i.e. peer-to-peer) using current web technologies, a model that carries over to nearly all mobile apps. Centrally-controlled web services are still mandatory coordinators of interactivity. This is a massive barrier to integration of information and functionality, with the practical result that each app and website is largely its own closed world. Attempts to connect them are labor intensive and highly unstable, often with market forces also in opposition.

Contrast: Centralized public information management (isolated sources)

Traditional centralized approaches to managing public information result in many isolated sources about a subject. Each source is ultimately controlled by one party. Users typically consume information from one source at a time because there is no way to automatically combine them. Books, channels, videos, and websites are common examples, as are most current forms of social media. Each is largely self-contained and competitive with other sources. There is no reliable way for third-party users to make external public annotations like commentary and cross-references. If interactive features are provided, such as a forum or comments section, content can still be manipulated at will by source owners. There is likewise no neutral and transparent way to rank the quality of information. Without explicit cooperation and ongoing maintenance, there is no reliable way to even form connections between information from different sources. For example, Web links become stale if referenced information changes, is relocated to a different URL, changes APIs, or disappears. As a result, there is no efficient way to automatically aggregate, de-duplicate, and browse the best information from all sources. The user is normally left to engage in a tedious process of curation from an endless supply of unorganized, often-contradictory information. Obviously, this also frustrates AI research.

The best sources of centralized public information have strong oversight rules that attempt to limit bias, prevent abuse, and ensure some degree of content and reference stability. (such as mandating public revision history) Wikipedia and the Internet Archive are exemplary but still rely on their own community contracts and those of the supporting the internet infrastructure. Even when crafted with the best of intentions, rules constrain innovation and diminish usefulness. For example, Wikipedia guidelines dictate that it is not to be used for original research or publication, may not include instructional material, and may only reference certain trusted primary sources. This is expected, given the goal of producing encyclopedic content for the web. However, there is also no reliable manner to comprehensively extend and integrate Wikipedia content with external information outside of its purview, let alone orchestrate direct interactions.

Contrast: Centralized private information management (software interfaces)

The traditional centralized approach to managing private information is to divide its control amongst software components and then integrate these using programming interfaces, whether locally or across the internet. Because the underlying data is encapsulated by the software, it is thus bound to an external authoritative entity, rendering it centralized by our definition. The interface ultimately defines the meaning of the data behind it, usually removing the incentive to add rich semantics to the data itself. The most common example is a relational database accessed via data and business objects that abstract the database while adding access rules, constraints, and various logic. In order to combine data from multiple systems, appropriate interfaces to these objects must be used. Directly accessing the underlying databases could bypass rules relied upon for data integrity in each system.

Graph-structured databases that embed sufficient semantics may be independent of external software components to give the stored data meaning. However, they may use a wide variety of access control methods. A Semantic Web data store may control read and write of data at each URI it controls. The means of access and the use of authoritative name-based data identity makes the information stored centralized as a whole, just as all other web content. Said otherwise, there is only one source whereby the latest version of a URI's data may be retrieved.

Decentralized information management: Layering

Decentralized information is structured in such a way that authors and creators do not need to cooperate, provide a service, or give authorization to make their information layerable. Once decentralized information is published, other information can always be reliably and independently layered upon it. Users choose which information from unlimited sources to layer into a personalized view. This freedom allows any degree of diversity, from a single trusted source to a wide spectrum chosen for both quality and maximum contrast. Layerable information sources can be both public and private, including content from personal, commercial, government, and community sources.

Google Maps, Waze, and other navigation tools are popular graphical representations of layered information. (i.e. you can toggle layers for traffic, weather, terrain, points of interest, etc.) A decentralized information equivalent would allow anyone to contribute any type of map information and no central party would have control over who sees what data. This may sound chaotic, but the same technologies that enable distribution also enable efficient community processes to build trust in the best quality information. When we say that users choose information from many sources, this does not refer to a manual evaluation process. It typically means that software will use available trust data and personal preferences to automatically select information most likely to be useful. Information that is irrelevant, untrusted, or widely considered inaccurate can easily be filtered out. This is different than traditional publishing selectivity or censorship, because users may always dig down to see the content excluded from their default view and refine it as desired.

Simple textual information can also be layered. Consider various forms of highlighting and emphasis that can be layered upon raw digital text -- like the infamous fluorescent markers upon the pages of a book but without the permanence. With decentralized information architecture, highlighting is a form of external, possibly third-party annotation. Other common forms of text annotations include cross-references, comments, fact checks, formatting cues, and endorsements. Layering of annotations implies sharing and aggregation among many users. Calculated layers such as statistical highlighting are possible -- perhaps visualized with varying color or intensity depicting how many people have highlighted or commented upon a certain region of text. The user may then drill-down into the original annotations.

As a practical manner, decentralized information does not imply a requirement of information source decentralization. A centralized source with robust community processes may approve information that meets certain standards. Readers may then choose to blanket trust its signed data, avoiding the need to fetch and analyze vast quantities of metadata used internally within a development community. (Although such data will always exist for those interested in drilling down.) For example, a decentralized version of OpenStreetMap or Wikipedia may develop popular default map data layers, rather than relying on an ad-hoc P2P approach. Organizations may run centrally-controlled repositories to help handle the demand for popular data. However, there is no reliance on such measures. The published information always remains separable from sources and transport methods without losing meaning or identity.

Software is the most complex form of information to layer, but there are already emerging strategies to do so. Most fall under the paradigm of declarative programming, which can be generalized as describing things rather than giving explicit instructions. Such description includes rich and extensible data semantics, rule sets, local parameterizations, component wirings, hints, environmental variables, and, of course, units of functional code. The final form of software is only materialized at the moment it is actually needed. Rather than being limited by static, pre-designed applications, the ideal local system takes everything it knows and generates the best available custom interactions and interfaces. This allows maximum flexibility and customization for the task at hand. Of course, this is a conceptual direction to evolve future software. Earlier renditions will have more software components that are manually designed and configured, but the concept provides a design principle to continually refactor toward.

Practical Benefits

Decentralized information provides immediate benefits for all scenarios where independent, multi-party information needs to be aggregated, integrated, collaboratively evaluated, and then filtered and layered into useful views. Obvious arenas include research, journalism, education, government, healthcare, and most everyday personal and business communication. These are ripe with low-hanging-fruit use cases that do not require strict semantics or AI.

The Three C's of Decentralized Information for Social Computing

The nature of decentralized information promotes 3 major arenas of improving social computing:

Core challenges of decentralized systems

MIT's Digital Currency Initiative and Center for Civic Media have released an excellent report on the motivations and progress of efforts to "re-decentralize" the web. Four challenges to decentralized systems are discussed. The decentralized information model is a key component to solving all of these.

User and developer adoption

Decentralized information is not dependent on system adoption or network effects. It stands on its own, without any infrastructure, and is usable within any system that is able to understand it. It is inherently future-proof and easily made forward-compatible. This is attractive for driving developer adoption both within and outside existing software ecosystems because it eliminates concerns of new platform stability and longevity. The information itself is the platform. Software development around decentralized information is always subjugate to the information, using it as both the storage and communication medium. By decoupling software from information, re-use of both is promoted. Software bears the responsibility of working with neutral information, taking advantage of embedded semantics. Exclusive software ownership of data is simply not accepted, though particular software may constrain what data is valid for its own purposes.

As many have noted, the massive network effects of existing social networks is a challenge to new entries in the market. All efforts to provide direct alternatives have failed to gain popular traction. In contrast, it is far easier to build on-ramps and adaptors from the existing web to the realm of decentralized information. Embrace and extend opportunities are everywhere. (compare sharing external web content on proprietary social networks) As more developers build unique solutions around decentralized information, there will be greater incentives to network natively. Early adopters will include organizations and communities for whom the homogeneous, ad-driven, one-size-fits-all social networks are suboptimal. A distinct advantage of social networking over decentralized information is that it does not depend on an interactive service. (Though some FOAF aspects can benefit from a trusted third party intermediary.) Unlike Semantic Web / Linked Data based social network designs, fragile centralized web publishing is also not required. Decentralized information can exist before any software or networks are created to maximize its usefulness. Large-scale aggregators and search engines will still be useful, but these will be orthogonal to how the information is independently used at local system scale and among smaller community networks.

Security

Decentralized information is cryptography-oriented, but we don't need everyone to use privately-managed PKI from day one. A service with a traditional login can provide surrogate identities, keys, and signatures to bootstrap use of decentralized information. A data entity signed by a trusted service as having come from an authenticated user can be relatively trusted for many use cases. It is not tied to the notarizing service once created, because validation is purely cryptographic, not interactive. As such, cryptography becomes merely a security QoS concern. Users may reasonably put less trust in data signed by such a service than data signed by a fully trusted user's key. However, this compromise is better than continuously depending on a centralized service for data identity, access, and authentication! In the meantime, those who wish to manage their own keys will be able to do so seamlessly, knowing that support is default.

Monetization and incentives

Decentralized information has radically different economics than the classic internet. Because there are no fixed points of network interaction and because all data entities are immutable, many services can compete on QoS parameters over the same data entities, metadata collections, and compute services. In practice, repositories and networks thereof will take many forms and be as layered as decentralized information itself. Networks will come and go, but the hosted information will remain unchanged. This stability enables rapid experimentation and market differentiation.

As we transition from host-based to information-based networking, a new landscape of services will follow. Repository networks will focus on particular QoS needs. Some may run over public IP networks, especially smaller private, P2P, and community repositories. Others may use dedicated trunks. The economics of each will be negotiated accordingly. ISPs will have increasing incentive to run generic local caches of popular public content, to alleviate trunk bandwidth expenses. This will be possible without custom service-specific peering, but prioritization can always be negotiated. In general, we may expect a transition to direct utility models where users pay for what services they actually use instead of relying on nebulous freemium and advertising models. While traditional cloud hosting will be the obvious early onramp, personal and business servers will be able to integrate transparently, providing off-grid operation, high local QoS, and private compute services for mobile devices. Unlike application-based server appliances, decentralized information will allow any device to provide standardized utility computing service for any use case. This will also make consumer personal servers practical, though certainly not required.

Incentivizing the production of content is a separate matter. The inherent user-control over decentralized information means that advertising cannot easily be forced, let alone tracked. As with networking, more direct incentive models will be needed. Community-driven patronage should become a far more viable option. Greater interactivity and connectedness will drive social incentives and micro-donations will reduce barriers to financial contribution. We can expect a wide range of competing payment services to facilitate this. For example, a federated cryptocurrency service could create pseudonymous donation tokens that can be annotated (by hash reference) to signed entities in the graph of shared data. Content authors would then be able to cash these in on a periodic basis, reducing the number of traditional bank transactions.

Resisting market consolidation

Decentralized information promotes data portability and user ownership without mandating a particular network architecture. This permits economies of scale without the traditional risk of lock-in. With structured, semantically-rich information at the center of computing, everything else becomes orthogonal and switching costs are dramatically lowered. This promotes maximum competition for services and software solutions, with a greater leaning toward openness due to improved cost-sharing economics.

Open source development models have always worked best for shared infrastructure, such as operating systems, development tools and frameworks, common code libraries, and commodity UI software like browsers. With current software economics, more-specialized code tends to be proprietary, even though it often makes use of open infrastructure. This applies to both local software and web services. The strongest factors are the location of end-user value creation and the ability for application functionality to be efficiently modularized and shared. Decentralized information and declarative programming paradigms promote a much higher degree of software decomposition and modular composability than prepackaged applications. This follows from making information itself the nexus of integration rather than unstable APIs or code libraries in myriad competing languages. As more reusable modules are created, the effort required to take the next step decreases -- and with it the incentives to create proprietary code decrease. Such modules effectively become open shared infrastructure, in parallel with community-driven data schemas and ontologies. Any equivalent proprietary services must compete with the ability of users to provide their own with relative ease and near-zero switching costs.

Technical Properties of Decentralized Information

Decentralized information can be defined by several absolute technical properties:

Location neutral

This is the most obvious and well-understood property. Decentralized information is founded on decentralized data. It is never bound to a location or host. It exists apart from these physical artifacts. Many existing distributed systems provide data location neutrality.

No mutable references

This is perhaps the least obvious property. A mutable reference is one that may point to different data values over time, even though the reference itself is unchanged. Mutable references have no place in decentralized information because they prevent reliable, independent composition and layering of multi-source information. For example, a third party cannot reliably annotate a particular piece of data using a mutable reference because the data it points to may arbitrarily change, even deceptively. In contrast, a reference using a secure hash value is reliable because it is mathematically bound to a unique piece of data. All references within decentralized information must be to immutable data entities. "Entity" here just means a particular self-contained parcel of data. The terms "file," "document," and "record" are traditionally used to refer to mutable data units, so distinct terminology seems wise.

Mutable references vs. mutable reference collections

In any real-world system, there must be a way to model and communicate data that is updated over time. In the decentralized information model, all data entities are immutable. By extension, so are all references contained within them. However, discovered knowledge of what entities reference each other is mutable. Data repositories may keep collections of metadata about references, typically indexed per known entity. This enables update tracking. A revision should reference any older version(s) it is derived from and possibly also a versioning root entity that serves as a master collection point. When a new entity is discovered by a repository, any revision references may be indexed. Software watching the collections may then be notified accordingly. Reference metadata may also be collected for annotations, links, property graph statements, etc. Mutable reference collections themselves are managed by repositories in whatever fashion suits local users. Networked repositories may also propagate reference metadata.

Mutable reference collections are not controlled by the author of an entity and are not in any way authoritative regarding the global existence of references to an entity. They merely facilitate discovery and layering of discovered information. Any repository of distributed information has its own metadata collections, just as any Git repository can have its own collection of branches for a project.

Among distributed information, there is no absolute notion of latest version, as this is relative to currently-known signed and timestamped revisions. Any notion of an official publication is merely based upon the signatures of a trusted party. Locating desired revisions requires knowing where to look, an aspect that is technically separate from decentralized information but is typically orchestrated using it. This is a networking aspect that is intended to evolve over time.

Versioning root entities contrast sharply with centralized channels. While a root entity may be signed by its creator and revisions may be signed by the same party, this does not forbid third-party annotations or revisions from referencing the same versioning root. Whether these are considered valid is policy-based on the data consumer's side.

Referencing a versioning root is not the same as a mutable pointer. It is literally a reference to the knowledge that revisions may exist somewhere out there. Revision roots should not be annotated with metadata that relates to particular revisions, as this would be ambiguous. The only exception is if the versioning root itself contains the initial version internally.

Contrast: IPFS Project

A major pillar of the IPFS design, called InterPlanetary NameSpace (IPNS, section 3.7 of IPFS whitepaper), uses cryptographically-signed named pointers to Merkle DAG nodes. Data references may thus use assigned identities and hierarchical paths. This violates the mutable reference principle, even though the namespace records themselves are decentralized and self-certifying public-key records are used in place of a centralized registrar authority.

To illustrate, consider the following mutable IPNS path:

/ipns/[hash-value-1]/documents/recipes/pizza.html

The path component "hash-value-1" would be a hash of the public key that signs records of paths under this root. It never changes. Whoever has the corresponding private key may publish new paths at this location in the IPNS namespace. What these paths point to is arbitrary, however. A third party that wishes to reference this pizza.html document has the option to use this path. Unfortunately, it is a mutable reference and the reference may go stale in the future, if pizza.html is updated or pointed elsewhere. Fortunately, it is possible to use the underlying IPFS immutable (Merkle DAG) namespace instead.

Consider the following immutable IPFS path:

/ipfs/[hash-value-2]/documents/recipes/pizza.html

In this example, the component "hash-value-2" would be a Merkle DAG root hash. If the pizza.html document or any other part of the path was to change, the hash tree calculations would result in a new root hash value. Therefore, a reference using the IPFS namespace is permanently reliable. However, the use of meaningful naming is itself problematic for other reasons.

While the IPFS project has been used as a specific example here, the same principles apply to similar named-path Merkle-tree-based systems, such as Git and the Dat project.

Contrast: blockchain-based name registration

Suppose we use a scheme where the names of our data entities are registered in a blockchain. In particular, arbitrary text strings are mapped to secure hash values for data entities. Such names are guaranteed to be both globally unique and permanent. This enables lookup of entities by names convenient for human users. However, it also means that references to data using these names are bound to the operation of a particular supporting network that provides consensus via the distributed ledger. This is a logical form of registrar authority dependency. Should the blockchain fall into popular disuse in the future, an easily imaginable scenario, all references using it immediately become unreliable. As a result, this solution is unacceptable for decentralized information.

No human-meaningful reference identities

Human-meaningful names in references are harmful to information architecture because they encode implicit information in the names or paths themselves. In the best case, this information is merely redundant with internal metadata of data entities. (Although choosing these reference names and/or hierarchy positions is still wasteful manual data manipulation.) In the worst case, naming in references represents data with weak semantics that must be gleaned from context. Dependency on an original context can be seen as a form of logical centralization. It may be overcome by annotating explicit metadata after the fact, but it is easier to simply disallow this hazard.

In the IPFS example above, a Merkle hash tree allows human-meaningful naming, but it is subjugate to a secure hash identity that would change if any data name changed. This passes the immutable reference requirement. However, real world usage could still go awry. Consider the "pizza.html" document referenced by the IPFS path above. Suppose that it does not have any internal metadata indicating that it is a food recipe. Certainly, a human user can look at the path and conclude this is the case, but there is no way to be sure. (Perhaps this document describes the classic object-oriented programming example of a pizza shop and "recipes" is simply a metaphor.) Suppose we instead identify the document solely using a flat hash value and ensure that the document itself contains machine-meaningful metadata indicating it is a food recipe. By making this explicit metadata instead of naming, we avoid boxing ourselves into a rigid, manual, path-based hierarchy. We can later construct human-convenient name views automatically.

Strict use of flat, hash-based references also eliminates the complication of unlimited multiple naming. Suppose our "pizza" document entity can individually be accessed by a hash value. If arbitrary path naming is permitted, it can also be included in multiple name-supporting hash trees.

For example, the following hash-rooted paths could all refer to the same binary document entity:

SHA256:[hash-value-1]/documents/recipes/pizza.html  
SHA256:[hash-value-2]/food/recipes/pizza.html  
SHA256:[hash-value-3]/data/docs/food/recipes/pizza  
SHA256:[hash-value-4]/whatever/who/really/cares  

This is problematic if we wish to collect widespread metadata for an entity, for the purposes of annotation and networked collaboration. While nothing in the flat-hash ID scheme stops someone from attempting to fork data by changing even a single bit, thereby resulting in a new hash value, this demonstrates obvious malicious intention and can be more readily detected. Furthermore, most entities should have internal signatures, making such attacks less feasible in many cases. With arbitrary path naming, it is not clear whether a new path has been created for malicious intent or as an artifact of local organizational preferences. Cryptographic signatures do not help here, because the original signed entity remains unchanged, with its original hash value, in the leaf of a new Merkle tree.

Data management architectures often benefit from hierarchical aggregation, for the sake of scalability. However, this can be generated behind the scenes as an optimization. Consider a dependency graph of data or software components that are frequently used together. Smart repositories can discover such correlations and bundle up entities for efficient batch transfer. In contrast, hierarchical naming schemes encourage premature compositions that may not be useful.

Human user interfaces should generally not expose obscure hash values as labels. Indeed, metadata should always be used to provide meaningful name labels for human users. A related issue affects future programming languages and associated tooling. By referencing classes or modules using cryptographic hash values, we can make code more secure and disambiguated. For example, instead of:

import My.Favorite.Functions

we may prefer to use something like:

import SHA256:[hash-value-of-my-favorite-functions]

The latter is a globally-safe reference to a data entity that perhaps is internally named "My.Favorite.Functions". Once this module is imported, we can use whatever named artifacts it contains. Internal naming does not violate the meaningful-name reference principle, in the external scope of data entity management. However, future language designs may result in every separable code artifact being contained in its own entity. The code would be entirely human-language neutral, with labels applied only within an editor view.

Able to be dereferenced via multiple channels

Decentralized information can be considered self-existent. Unlike content on the web, it is not dependent upon supporting networks and services. Networks and repositories merely host and facilitate the exchange of decentralized information. Competing schemes can differentiate neutrally on Quality of Service (QoS) features because these do not impact the information itself.

Unlike systems using centralized network authorities like DNS, the service of dereferencing identifiers among distributed information is a major QoS differentiator. With the use of secure hash values for identity and referencing, there is no singular, straightforward lookup procedure. Within a particular subnetwork, a distributed hash table (DHT) can be used to efficiently locate copies of data at a hash value. However, reliance on a single global DHT or hierarchy of DHTs is not considered a requirement or perhaps even a valid goal. Just like information itself, the network services supporting distributed information can be layered.

As a bootstrapping tool, permanent domains (self-certifying mutable pointer records) can be used to point to recommended repositories for organizations or authors. These repositories may hold mutable reference collections that are most likely to be current with the latest available revision data. However, permanent domains are not used for any data naming or reference purposes. They merely advertise network services.

Embedded and extensible semantics

Decentralized information must be self-descriptive in order to avoid the possibility of dependency on external centralized software systems for interpretation. Semantics must also be extensible by the annotation of existing published data.