This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Introduction

A second computer revolution has begun, with several disruptive innovation frontiers taking shape at once. Artificial intelligence, robotics, augmented reality, and ubiquitous computing (which includes the Internet of Things) have been widely identified as drivers of both the next era of computing and massive socioeconomic changes. These research fields and others would benefit enormously from a unified semantic data model that is not just human but also machine-friendly. Unfortunately, efforts have thus far been inadequate, impractical, and disjointed.

As the internet has matured, its architecture has resulted in control becoming naturally concentrated in large centralized service platforms. This has lead to concerns about media manipulation, privacy invasion, and unfair competition. A movement has emerged to develop open source decentralized internet platforms that are immune to these problems. Thus far there is no universal solution, as each design has unavoidable engineering trade-offs. Interoperability among these platforms is also poor, as neutral standards have yet to materialize.

Decentralized information is a design philosophy that offers both a unified semantic data model and a common ground for decentralized internet projects. It lends itself to fundamentally new data, network, and software architectures, eschewing legacy design assumptions based on the hardware and networks available decades ago.

What is decentralized information?

Most concisely, decentralized information is not dependent on any external, authoritative service or software. Only other decentralized information is needed to identify, validate, and interpret it. It is collectively self-describing, rich in semantics, and referenced only by mathematical identifiers derived from the content itself (i.e. secure hashes). This makes any given parcel of data immutable relative to its reference. Given any method of exchange, users may interact over decentralized information alone. No third-party trusted network or service is required, and local processing is sufficient. Freedom from external control and segregation allows decentralized information from unlimited sources to be integrated at the point of use.

Software comes alongside decentralized information in neutral support, rather than controlling or encapsulating it. Users are able to gather raw information and software components from diverse sources and dynamically compose them. Prepackaged software applications are replaced with information environments having no artificial boundaries between functionality.

Contrast: Centralized public information management (isolated sources)

Traditional centralized approaches to managing public information result in many isolated sources. Each is ultimately controlled by one party. Users typically consume information from one source at a time because there is no general purpose way to automatically combine them. Books, broadcasts, streams, and websites are common examples, as are most current forms of social media. Each is largely self-contained and competitive with other sources. There is no reliable way for third-parties to make external public annotations like commentary and cross-references. If interactive features are provided, such a comments section on a website, content can still be manipulated at will by source owners. There is likewise no neutral and transparent way to rank the quality of information. Without explicit cooperation and ongoing maintenance, there is no reliable way to even provide links between information from different sources. As a result, there is no efficient way to continuously aggregate, de-duplicate, and navigate the best information from all sources. The user is left to engage in a tedious process of curation from an endless supply of unorganized, often-contradictory information. This also frustrates AI research and makes any results highly susceptible to commercial bias.

Based on the external-dependency definition, information on the web is not decentralized and never has been. HTTP URIs require external, authoritative services to lookup, retrieve, and authenticate linked information. If any of these services fail or if content is arbitrarily changed or removed, linked information becomes inaccessible, stale, invalid, or corruptible. Web links cannot exist offline, apart from the services that comprise the web. Two previously-retrieved documents cannot independently reference each other because their identities are bound to authoritative network services. Likewise, users cannot interact directly (i.e. peer-to-peer) using current web technologies, a model that carries over to nearly all mobile apps. Centrally-controlled web services are still mandatory coordinators of interactivity. This is a massive barrier to integration of information and functionality, with the practical result that each app and website is largely its own closed world. Attempts to connect them are labor intensive and highly unstable, often with market forces in opposition.

The best sources of centralized public information have strong oversight rules that attempt to limit bias, prevent abuse, and ensure some degree of content and reference stability. (such as mandating public revision history) Wikipedia and the Internet Archive are exemplary but still rely on their own community contracts and those of the supporting the internet infrastructure. Even when crafted with the best of intentions, rules constrain innovation and diminish usefulness. For example, Wikipedia guidelines dictate that it is not to be used for original research or publication, may not include instructional material, and may only reference certain trusted primary sources. This is expected, given the goal of producing encyclopedic content for the web. However, there is also no reliable manner to comprehensively extend and integrate Wikipedia content with external information outside of its purview -- let alone orchestrate complex interactions among complimentary efforts.

Contrast: Centralized private information management (software interfaces)

The traditional centralized approach to managing private digital information is to divide its control amongst software components and then integrate these using programming interfaces, whether local or remote. Because the underlying data is encapsulated by the software, it is considered bound by an external authoritative dependency. Programming interfaces ultimately define the meaning of the data behind them, removing incentives to add rich semantics to the data itself. The most common example is a relational database wrapped by data and business objects. An interface abstracts the database while adding additional access rules, constraints, and various logic. In order to combine data from multiple systems, appropriate interfaces must be used. Directly accessing the underlying databases could bypass rules and violate system integrity. Nevertheless, integrations usually involve tedious custom logic to munge incompatible semantics.

Graph-structured databases with sufficient embedded semantics are independent of external software to give their stored data meaning. However, they may still use a wide variety of centralized access methods. A Semantic Web data store may control read and write of data at each URI it controls. The means of access and the use of authoritative name-based data identity makes the information centralized as a whole, just as all other web content.

Decentralized information management: Layering

Decentralized information is structured such that authors and creators do not need to cooperate, provide an API, or give authorization to make their information layerable. Once decentralized information is published, other information can always be reliably and independently layered upon it, thanks to immutable reference identity. Users choose which information from unlimited sources to layer into a custom view. This freedom allows any degree of diversity, from a single trusted source to a wide spectrum chosen for both quality and contrast. Layerable information sources can be both public and private, including content from personal, commercial, government, and community sources.

Map and navigation apps commonly contain graphical representations of layered information. Users may toggle layers for traffic, weather, terrain, points of interest, etc. A decentralized information equivalent would allow anyone to contribute any type of map information, and no central party would have control over who sees what data. This may sound chaotic, but the same technologies that enable distribution also enable efficient community processes to build trust in the best quality information. The notion that users choose information from many sources does not refer to a manual evaluation process. It means that software will use available trust data and personal preferences to automatically select information most likely to be useful. Information that is irrelevant, untrusted, or widely considered inaccurate can easily be filtered out. This is different than traditional publishing selectivity or censorship. Users may always examine the content excluded from their default view and refine it as desired.

Simple textual information can also be layered. Consider various forms of highlighting and emphasis that can be layered upon raw digital text. With decentralized information architecture, highlighting is a form of external, likely third-party annotation. Other common forms of text annotations include cross-references, comments, fact checks, formatting cues, and endorsements. Layering of annotations implies sharing and aggregation among many users. Calculated layers such as statistical highlighting are possible -- perhaps visualized with varying color or intensity depicting how many people have highlighted or commented upon a certain region of text. The user may then drill-down into the original annotations.

Decentralized information does not imply that information sources must always be decentralized. A centralized source with robust community processes may approve information that meets certain standards. Readers may then choose to blanket trust its signed data, avoiding the need to fetch and analyze vast quantities of metadata used internally. (Although such data will always exist for those interested.) For example, a decentralized version of OpenStreetMap or Wikipedia may develop popular default map data layers, rather than relying on an ad-hoc P2P approach. Organizations may run centrally-controlled repositories to help handle the demand for popular data. However, there is no reliance on such measures. All published data remains separable from sources and networks without losing meaning, identity, or authentication. If a source fails, the data can be moved elsewhere seamlessly.

Software is a complex form of information to layer, but declarative programming paradigms offer solutions. As a generalization, declarative code describes things rather than giving explicit linear instructions. That which is not strictly ordered through reference chains may be freely layered as long as it is logically sound. Descriptions may include extensible data semantics, rule sets, local parameterizations, component wirings, hints, environmental variables, and, of course, units of functional code. The final form of software is only materialized at the moment it is actually needed. Rather than being limited by static, pre-designed applications, the ideal user environment takes everything currently known and generates custom interactions and interfaces. This allows maximum specialization for the task at hand with minimal human design effort. Of course, this is a ideal conceptual direction to evolve future software. Earlier renditions will have more manually-designed components and configurations, but the concept provides a design principle to continually refactor toward.

Practical Benefits

Decentralized information provides immediate benefits for all scenarios where multi-party information needs to be aggregated, integrated, collaboratively evaluated, and then filtered and layered into useful views. Obvious arenas include research, journalism, education, government, healthcare, and most everyday personal and business communication. These are ripe with low-hanging-fruit use cases that do not immediately require complex semantics or AI.

Social Computing

Decentralized information can particularly benefit 3 major areas of social computing:

Challenges to Decentralized Systems

Decentralized information must be supported by practical real world systems. While adaptors to legacy platforms are possible, they cannot take full advantage of its native properties. On the other hand, decentralized systems have yet to be proven at global scale.

In August 2017, MIT's Digital Currency Initiative and Center for Civic Media released a report on the motivations and progress of various decentralized internet projects. Four outstanding challenges to decentralized systems are identified. We believe that a decentralized information model is an critical component to solving these.

User and developer adoption

Decentralized information is not dependent on system adoption, federation, or network effects. It stands on its own, without any infrastructure, and is usable within any system that is able to understand it. It is inherently future-proof and easily made forward-compatible via layering additional metadata. This is attractive for driving developer adoption both within and outside existing software ecosystems because it eliminates concerns of new platform stability and longevity. The information itself is the platform. Software development around decentralized information is always subjugate to the information, using it as both a storage and communication medium. (The later must be combined with a use-case-appropriate means to propagate new data.) By decoupling software from information, re-use of both is promoted. Software bears the responsibility of working with neutral information, taking advantage of embedded semantics. Exclusive software ownership of data is simply not allowed, though particular software may define what data is considered valid for its own purposes.

The Semantic Web effort shares the interoperability and decoupling goals, but it is bound to the benefits and burdens of web architecture. Unfortunately, the dominant business models of the current web are based on explicit data and functionality silos, used to drive market lock-in. Semantic Web technologies have limited benefit to closed applications but carry substantial computational and development overhead costs. As with decentralized information, its main benefits come from the network effects of publicly shared data and schemas. Without a means to initiate these, Semantic Web has been largely a non-starter.

Semantic Web retains the premise of arbitrarily-named documents with public URLs at centrally-controlled web hosts. While local and content-based URI schemes exist, there is no consistent story for how to integrate them. Various "semantic desktop" projects have failed to prove meaningful benefit, with excessive complexity, clumsy interfaces, and no convenient mechanisms for data portability or bridging to existing systems. To the average developer, Semantic Web is also inaccessibly complex. It is especially difficult to justify its adoption for an internal or personal project. Perhaps telling is that the open source community has not embraced Semantic Web, even though it dovetails with the benefits of open code.

The philosophy of decentralized information promotes simple, elegant tools that are independently-useful for small-scale development. Strict separation of concerns prevents forced complexity. Decentralized identity and referencing schemes ensure trivial portability among disparate public and private systems, with little or no infrastructure investment. The learning curve is shallow, with most semantic graph complexity offloaded to optional metadata that can be annotated later. It is likewise not dependent upon design-by-committee ontology design.

Decentralized information can be used by traditional software applications, but it is most powerful when paired with a software and UI environment native to its paradigms. It prescribes small, re-usable components wired together into interaction patterns for processing and generating software-neutral graph data. This design applies to both local processing and interactions among multiple remote users sharing a graph. The level of modularization and ease of contribution allows much great cost sharing.

Small development groups, IT departments, and independent contractors have every reason to want interoperability standards. They reduce the amount of variation between jobs, increase skill transference, and simplify development toolsets. Enterprise integration opportunities are likewise attractive to large developers, where overhead costs are absorbed by cumulative business value. In both cases, modularity-promoting, open-source development models are the intended target. Industry sectors should find new avenues to collaborate on efficiently meeting shared IT needs.

As many have noted, the massive network effects of existing social networks are a challenge to new entries. All efforts to provide direct alternatives have failed to gain popular traction. However, as more developers build interactions around decentralized information, there will be greater incentive to network natively. Historical precedent can be found in the consumer transition from pre-internet online services to the open web. Once users are comfortable with the new platform, it is only a matter of reaching a tipping point of popularity. Early adopters could include organizations and communities for whom the homogeneous, ad-driven, one-size-fits-all social networks are suboptimal.

Social networking itself is trivial over decentralized information. A distinct advantage is that it does not depend on an interactive service. (Though some FOAF query processing can benefit from a trusted third party intermediary.) Unlike Semantic Web / Linked Data based social network designs, fragile centralized web publishing is also not required. Decentralized information can exist before any software or networks are created to maximize its usefulness. Large-scale aggregators and search engines will still be useful, but these will be orthogonal to how the information is independently used at local system scale and among smaller community networks. Ultimately, decentralized information can succeed for more specialized purposes even if the economics of large-scale social networks remain insurmountable. For this reason, competing with them is not a primary goal. If the larger paradigm succeeds, this will follow effortlessly.

Security

Decentralized information is cryptography-oriented, but it is unnecessary for everyone to use locally-managed public-key infrastructure from day one. A service with a traditional login can provide surrogate identities, keys, and signatures to bootstrap use of decentralized information. Data signed by a trusted service as having come from an authenticated user can be relatively trusted for many use cases. It is not tied to the notarizing service once created, because validation is purely cryptographic, not interactive. As such, cryptography becomes merely a security QoS concern. Users may reasonably put less trust in data signed by such a service than data signed by a fully trusted user's secret key. However, this compromise is better than continuously depending on a centralized service for data identity, access, and authentication on every access! In the meantime, those who wish to manage their own keys will be able to do so seamlessly, knowing that support is default.

A unique property of decentralized information is the ability for an untrusted third party service to blindly aggregate new data and metadata carrying entities without being able to decrypt their payloads. This feature depends on an container model that allows plaintext hash references between data entities. It separates the network concern of propagating data updates from the software concern of processing user data. The latter can be done locally for common use cases.

A final critical aspect of decentralized information is the ability to move all user processing client-side. The buzzword "serverless" today usually refers to traditional application components that can run dynamically on pooled cloud server resources instead of being tied to a particular hardware/OS server instance. However, clients operating over decentralized graph data can truly be serverless, in that no centralized application logic exists. Instead, they interact via contracts and interaction patterns that describe intended data schemas, logic, and protocols. Each client then independently validates data it is involved with. For many use cases, this eliminates the need for trusted 3rd party servers, which normally must have access to the data involved. This is the application equivalent to end-to-end encrypted messaging.

Monetization and incentives

Decentralized information has radically different economics than content on the classic internet. Because there are no fixed points of network interaction and because all data entities are immutable and hash identified, many systems can compete on QoS parameters over the same data. In practice, repositories and networks thereof will take many forms and will be as layered as decentralized information itself. Networks will come and go, but the hosted information will nominally remain unchanged. (so long as replicas exist somewhere accessible) This stability enables rapid experimentation and market differentiation.

As we transition from host-based to information-based networking, a new landscape of services will follow. Repository networks will focus on particular QoS needs. Some may run over public IP networks, especially smaller private, P2P, and community repositories. Others may use dedicated trunks. ISPs will have increasing incentive to operate generic local caches of popular public content, to alleviate upstream bandwidth expenses. This will be possible without custom, service-specific peering, but prioritization can always be negotiated. In general, we may expect a transition to direct utility models where users pay for what services they actually use instead of relying on freemium and advertising models. While traditional cloud hosting will be the obvious early onramp, personal and business servers will be able to integrate transparently, providing off-grid operation, high local QoS, and private compute services for mobile devices. Unlike complex application-based server appliances, decentralized information will allow any device to provide standardized utility computing service for any use case. This commoditization and transparency should also make consumer personal servers practical, though certainly not required.

Incentivizing the production of content is a separate matter. Unlike on the web, the inherent end-user control over decentralized information and adjacent software environments means that advertising cannot easily be forced, let alone tracked. (The only option would be DRM-laden side-channel apps for specific media, comparable to current streaming music providers.) As with networking, more direct incentive models will be needed. Community-driven patronage should become more viable. Greater interactivity and connectedness will drive social incentives. Micro-payments/donations will reduce barriers to financial contribution. We can expect a wide range of competing payment services to facilitate this. For example, a federated cryptocurrency service could create pseudonymous donation tokens that can be annotated (by hash reference) to signed entities in the graph of shared data. Content authors would then be able to cash these in on a periodic basis, reducing the number of transactions.

Resisting market consolidation

Decentralized information promotes default data portability and user ownership without mandating a particular network architecture. This permits the economies of scale of federated or centralized services without the traditional risk of lock-in. (ex. IPFS and MaidSafe could be good candidates as P2P hosts but also compete with Amazon S3 for bulk public data.) With structured, semantically-rich information at the center of computing, everything else becomes orthogonal and switching costs are dramatically lowered. This should promote competition and differentiation among services and software solutions, with a greater leaning toward openness due to improved cost-sharing economics.

Open source development models have always worked best for shared infrastructure, such as operating systems, development tools and frameworks, common code libraries, and commodity UI software like browsers. With current software economics, more-specialized code tends to be proprietary, even though it often makes use of open infrastructure. This applies to both local software and web services. The strongest determining factors are the location of value creation and the ability for code functionality to be efficiently modularized and shared. Decentralized information and declarative programming paradigms promote a much higher degree of software decomposition and modular composability than prepackaged applications. This follows from making information itself the nexus of integration rather than unstable APIs or code libraries in myriad languages. As more reusable modules are created, the effort required to take the next step decreases -- and with it the incentives to create proprietary code decrease. Such modules effectively become open shared infrastructure, in parallel with community-driven data schemas and ontologies. Any equivalent proprietary code or services must compete with the ability of users to provide their own with relative ease and near-zero switching costs. This should push value creation down the chain and tend to promote smaller market participants and greater in-house capability.

Technical Properties of Decentralized Information

Decentralized information can be defined by several absolute technical properties:

Location neutral

This is the most obvious and well-understood property. Decentralized information is founded on decentralized data. It is never bound to a location or host. It exists apart from these physical artifacts. Many existing distributed systems provide data location neutrality.

No mutable references

This is perhaps the least obvious property. A mutable reference is one that may point to different data values over time, even though the reference itself is unchanged. Mutable references have no place in decentralized information because they prevent reliable, independent composition and layering of multi-source information. For example, a third party cannot reliably annotate a particular piece of data using a mutable reference because the data it points to may arbitrarily change, even deceptively. In contrast, a reference using a secure hash value is reliable because it is mathematically bound to a unique piece of data. All references within decentralized information must be to immutable data entities. "Entity" here just means a particular self-contained parcel of data. The terms "file," "document," and "record" are traditionally used to refer to mutable data units.

Mutable reference metadata collections

In any real-world system, there must be a way to model and communicate data that is updated over time. In the decentralized information model, all data entities are immutable. By extension, so are all references contained within them. However, discovered knowledge of what entities reference each other is mutable. Data repositories may keep collections of metadata about references, typically indexed per known entity. This enables update tracking. A revision should reference any older version(s) it is derived from and usually a versioning root entity that serves as a master collection point. When a new entity is discovered by a repository, any revision references may be indexed. Software watching the collections may then be notified accordingly. Reference metadata may also be collected for annotations, links, property graph statements, etc. Mutable reference collections themselves are managed by repositories in whatever fashion suits local users. Networked repositories may also propagate reference metadata, with or without copies of the referenced entities.

Mutable reference metadata collections are not controlled by the author of an entity and are not in any way authoritative regarding the global existence of references to an entity. They merely facilitate discovery of new information. Any repository of decentralized information has its own metadata collections, just as any Git repository can have its own collection of branches for a project.

Among decentralized information, there is no absolute notion of latest version, as this is relative to currently-known signed and timestamped revisions. Any notion of an official publication is merely based upon the signatures of a trusted party. Retrieving desired revisions requires knowing where to look, an aspect that is technically separate from decentralized information but is typically orchestrated using it. For example, disposable repository metadata may indicate what repositories or networks are believed to have a copy of an entity. This is a networking aspect that is intended to evolve over time.

Versioning root entities are nothing like controlled channels. While a root entity may be signed by its creator and revisions may be signed by the same party, this does not forbid third-party annotations or revisions from referencing the same versioning root. Whether these are considered valid is policy-based, on the data consumer's side.

Referencing a versioning root is not equivalent to referencing a pointer to mutable data. The root is literally a reference to an instance of an abstract concept. Revision roots should not be annotated with metadata that relates to particular revisions, as this would be ambiguous. The only exception is if the versioning root itself contains an initial version internally. For example, a root may represent a particular book but contain none of its text. It would therefore be inappropriate to reference the root for an annotation to a quotation in text found in a chapter revision.

Design Contrast: IPFS Project

IPFS (InterPlanetary FileSystem) is one of a number of recent decentralized P2P storage network projects with incentivized sharing of storage space. MaidSafe and Dat are similar projects. While these have interesting network structures in their own right, none use the decentralized information model described in this discussion. This is in large part due to their efforts to replicate familiar properties of centralized information systems.

A major pillar of the IPFS design, called InterPlanetary NameSpace (IPNS, section 3.7 of IPFS whitepaper), uses cryptographically-signed named pointers to Merkle DAG nodes. Data references may thus use assigned identities and hierarchical paths. This violates the mutable reference principle, even though the namespace records themselves are decentralized and self-certifying public-key records are used in place of a centralized registrar authority.

To illustrate, consider the following mutable IPNS path:

/ipns/[hash-value-1]/documents/recipes/pizza.html

The path component "hash-value-1" would be a hash of the public key that signs records of paths under this root. It never changes. Whoever has the corresponding private key may publish new paths at this location in the IPNS namespace. What these paths point to is arbitrary, however. A third party that wishes to reference this pizza.html document has the option to use this path. Unfortunately, it is a mutable reference and the reference may go stale in the future, if pizza.html is updated or pointed elsewhere. Fortunately, it is possible to use the underlying IPFS immutable (Merkle DAG) namespace instead.

Consider the following immutable IPFS path:

/ipfs/[hash-value-2]/documents/recipes/pizza.html

In this example, the component "hash-value-2" would be a Merkle DAG root hash. If the pizza.html document or any other part of the path was to change, the hash tree calculations would result in a new root hash value. Therefore, a reference using the IPFS namespace is permanently reliable. However, the use of meaningful naming is itself problematic for other reasons.

Note that while the IPFS project has been used as a specific example here, the same principles apply to similar named-path Merkle-tree-based systems. This includes the widely popular Git distributed version control system.

Design Contrast: blockchain-based name registration

Suppose we use a scheme where the names of our data entities are registered in a blockchain. In particular, arbitrary text strings are mapped to secure hash values for data entities. Such names are guaranteed to be both globally unique and permanent. This enables lookup of entities by names convenient for human users. However, it also means that references to data using these names are bound to the operation of a particular supporting network that provides consensus via the distributed ledger. This is a logical form of registrar authority dependency. Should the blockchain fall into popular disuse in the future, an easily imaginable scenario, all references using it immediately become unreliable. As a result, this solution is unacceptable for decentralized information.

No human-meaningful reference identities

Human-meaningful names in references are harmful to information architecture because they encode implicit information in the names or paths themselves. In the best case, this information is merely redundant with internal metadata of data entities. (Although choosing these reference names and/or hierarchy positions is still wasteful manual data manipulation.) In the worst case, naming in references represents data with weak semantics that must be gleaned from context. Dependency on an original context can be seen as a form of logical centralization. It may be overcome by annotating explicit metadata after the fact, but it is easier to simply disallow this hazard.

In the example above, a Merkle hash tree allows human-meaningful naming, but it is subjugate to a secure hash identity that would change if any data name changed. This passes the immutable reference requirement. However, real world usage could still go awry. Consider the "pizza.html" document referenced by the IPFS path above. Suppose that it does not have any internal metadata indicating that it is a food recipe. Certainly, a human user can look at the path and conclude this is the case, but there is no way to be sure. Perhaps this document describes the classic object-oriented programming example of a pizza shop and "recipes" is simply a metaphor. Suppose we instead identify the document solely using a flat hash value and ensure that the document itself contains machine-meaningful metadata indicating it is indeed a food recipe. By using explicit metadata instead of naming, we avoid boxing ourselves into a rigid, manual, path-based hierarchy.

Strict use of flat, hash-based references also eliminates the complication of unlimited multiple naming. Suppose our "pizza" document entity can individually be accessed by a hash value. If arbitrary path naming is permitted, it can also be included in multiple name-supporting hash trees.

For example, the following hash-rooted paths could all refer to the same binary document entity:

SHA256:[hash-value-1]/documents/recipes/pizza.html  
SHA256:[hash-value-2]/food/recipes/pizza.html  
SHA256:[hash-value-3]/data/docs/food/recipes/pizza  
SHA256:[hash-value-4]/whatever/who/really/cares  

This is problematic if we wish to collect widespread metadata for an entity, for the purposes of annotation and networked collaboration. While nothing in the flat-hash ID scheme stops someone from attempting to fork data by changing even a single bit, thereby resulting in a new hash value, this demonstrates obvious malicious intention and can be more readily detected. Furthermore, most entities should have cryptographic signatures, making such attacks less feasible. With arbitrary path naming, it is not clear whether a new path has been created for malicious intent or as an artifact of local organizational preferences. Cryptographic signatures do not help here, because the original signed entity remains unchanged, with its original hash value, in the leaf of a new Merkle tree.

Data management architectures often benefit from hierarchical aggregation, for the sake of scalability. However, this can be generated behind the scenes as an optimization. Consider a dependency graph of data or software components that are frequently used together. Smart repositories can discover such correlations and bundle up entities for efficient batch transfer. In contrast, hierarchical naming schemes encourage premature compositions that may not be useful.

Human user interfaces should generally not expose raw hash references. Instead, metadata should be used to provide meaningful name labels to widgets. A related issue affects future programming languages and associated tooling. By referencing classes or modules using cryptographic hash values, we can make code more secure and disambiguated. For example, instead of:

import My.Favorite.Functions

we may prefer to use something like:

import SHA256:[hash-value-of-my-favorite-functions]

The latter is a globally-safe reference to a data entity that perhaps is internally named "My.Favorite.Functions". Once this module is imported, we can use whatever named artifacts it contains. Internal naming does not violate the meaningful-name reference principle, in the external scope of data entity management. However, future language designs may result in every separable code artifact being contained in its own entity. The code would be entirely human-language neutral, with labels applied only within an editor view.

Multiple dereferencing methods

Unlike systems using centralized network authorities like DNS, dereferencing among decentralized information is a major QoS differentiator. With the use of secure hash values for referencing, there is no singular, straightforward lookup procedure. Within a particular network, a distributed hash table (DHT) can be used to efficiently locate copies of data at a hash address. However, reliance on a single global DHT or hierarchy of DHTs is not considered a requirement or even a valid goal. Just like information itself, the network services supporting decentralized information can be layered.

As a bootstrapping tool, permanent domains (self-certifying mutable pointer records) can be used to point to recommended repositories for organizations or authors. These repositories may hold reference collections that are most likely to be current with the latest available revisions and other metadata. However, permanent domains cannot be used for any data naming or reference purposes. They merely advertise network services and are themselves unnamed. As permanent domains are tiny records, a global DHT may be a good distribution method.

Embedded and extensible semantics

Decentralized information must be self-descriptive in order to avoid the possibility of dependency on external centralized software systems for interpretation. However, semantics are extensible, via annotation of existing published data. This introduces some novel complexity around layering, as the same information may be externally annotated with different and perhaps incompatible semantics. This is actually a powerful feature, as it allows for independent exploration of varied dimensionality and frames of reference in the absence of consensus. For example, a geopolitical unit can simultaneously be a sovereign state and a disputed territory, even though it is grounded by a singular permanent concept entity. References should point to the most generic semantic layer possible. Biological classification provides another common application. Effectively, we can agree that something exists before we agree on what it is and what its properties are. Root entities can bind the ambiguous instance as these things are being debated.