August 26, 2021
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A second computer revolution is imminent, with several disruptive innovation frontiers taking shape at once. Artificial intelligence, robotics, augmented reality, and ubiquitous computing (which includes the Internet of Things) have been widely identified as drivers of both the next era of computing and massive socioeconomic changes. These research fields and others would benefit enormously from a unified data model that is not just human but also machine-friendly. Unfortunately, efforts have thus far been inadequate, impractical, and disjointed.
As the internet has matured, its architecture has resulted in control becoming naturally concentrated in large centralized service platforms. This has lead to concerns about media manipulation, privacy invasion, and monopolistic practices. A movement has emerged to develop open source decentralized internet platforms that are immune to these problems. Thus far there is no universal solution, as each design has unavoidable engineering trade-offs. Interoperability among these platforms is also poor, as neutral standards have yet to materialize. Furthermore, efforts using a P2P-network-based approach face severe economic challenges in competing with large, centralized cloud storage and compute services.
Decentralized Information is a design philosophy that offers a unified semantics, security, and data container model. This serves as a common ground for decentralized internet projects. It lends itself to fundamentally new information, network, and software architectures, eschewing legacy design assumptions based on the hardware and networks available decades ago.
Most concisely, decentralized information is not dependent on any external, authoritative service or software. Only other decentralized information is needed to identify, validate, and interpret it. It is collectively self-describing, rich in semantics, and referenced only by mathematical identifiers derived from the content itself (i.e. secure hashes). This makes any given parcel of data immutable relative to its reference. Given any method of exchange, users may interact over decentralized information alone. No third-party trusted network or service is required, and local processing is sufficient. Freedom from external control and segregation allows decentralized information from unlimited sources to be integrated at the point of use.
Software comes alongside decentralized information in neutral support, rather than controlling or encapsulating it. Users are able to gather raw information and software components from diverse sources and dynamically compose them. Prepackaged software applications are replaced with information environments having no artificial boundaries between functionality.
Traditional centralized approaches to managing public information result in many isolated sources. Each is ultimately controlled by one party. Users typically consume information from one source at a time because there is no general purpose way to automatically combine them. Books, broadcasts, streams, and websites are common examples, as are most current forms of social media. Each is largely self-contained and competitive with other sources. There is no reliable way for third-parties to make external public annotations like commentary and cross-references. If interactive features are provided, such a comments section on a website, content can still be manipulated at will by source owners. There is likewise no neutral and transparent way to rank the quality of information. Without explicit cooperation and ongoing maintenance, there is no reliable way to even provide links between information from different sources. As a result, there is no efficient way to continuously aggregate, de-duplicate, and navigate the best information from all sources. The user is left to engage in a tedious process of curation from an endless supply of unorganized, often-contradictory information. This also frustrates AI research and makes any results highly susceptible to commercial bias.
Based on the external-dependency definition, information on the web is not decentralized and never has been. HTTP URIs require external, authoritative services to lookup, retrieve, and authenticate linked information. If any of these services fail or if content is arbitrarily changed or removed, linked information becomes inaccessible, stale, invalid, or corruptible. Web links cannot exist offline, apart from the services that comprise the web. Two previously-retrieved documents cannot independently reference each other because their identities are bound to authoritative network services. Likewise, users cannot interact directly (i.e. peer-to-peer) using current web technologies, a model that carries over to nearly all mobile apps. Centrally-controlled web services are still mandatory coordinators of interactivity. This is a massive barrier to integration of information and functionality, with the practical result that each app and website is largely its own closed world. Attempts to connect them are labor intensive and highly unstable, often with market forces in opposition.
The best sources of centralized public information have strong oversight rules that attempt to limit bias, prevent abuse, and ensure some degree of content and reference stability. (such as mandating public revision history) Wikipedia and the Internet Archive are exemplary but still rely on their own community contracts and those of the supporting the internet infrastructure. Even when crafted with the best of intentions, rules constrain innovation and diminish usefulness. For example, Wikipedia guidelines dictate that it is not to be used for original research or publication, may not include instructional material, and may only reference certain trusted primary sources. This is expected, given the goal of producing encyclopedic content for the web. However, there is also no reliable manner to comprehensively extend and integrate Wikipedia content with external information outside of its purview – let alone orchestrate complex interactions among complimentary efforts.
The traditional centralized approach to managing private digital information is to divide its control amongst software components and then integrate these using programming interfaces, whether local or remote. Because the underlying data is encapsulated by the software, it is considered bound by an external authoritative dependency. Programming interfaces ultimately define the meaning of the data behind them, removing incentives to add rich semantics to the data itself. The most common example is a relational database wrapped by data and business objects. An interface abstracts the database while adding additional access rules, constraints, and various logic. In order to combine data from multiple systems, appropriate interfaces must be used. Directly accessing the underlying databases could bypass rules and violate system integrity. Nevertheless, integrations usually involve tedious custom logic to munge incompatible semantics.
Graph-structured databases with sufficient embedded semantics are independent of external software to give their stored data meaning. However, they may still use a wide variety of centralized access methods. A Semantic Web data store may control read and write of data at each URI it controls. The means of access and the use of authoritative name-based data identity makes the information centralized as a whole, just as all other web content.
Decentralized information is structured such that authors and creators do not need to cooperate, provide an API, or give authorization to make their information layerable. Once decentralized information is published, other information can always be reliably and independently layered upon it, thanks to immutable reference identity. Users choose which information from unlimited sources to layer into a custom view. This freedom allows any degree of diversity, from a single trusted source to a wide spectrum chosen for both quality and contrast. Layerable information sources can be both public and private, including content from personal, commercial, government, and community sources.
Map and navigation apps commonly contain graphical representations of layered information. Users may toggle layers for traffic, weather, terrain, points of interest, etc. A decentralized information equivalent would allow anyone to contribute any type of map information, and no central party would have control over who sees what data. This may sound chaotic, but the same technologies that enable distribution also enable efficient community processes to build trust in the best quality information. The notion that users choose information from many sources does not refer to a manual evaluation process. It means that software will use available trust data and personal preferences to automatically select information most likely to be useful. Information that is irrelevant, untrusted, or widely considered inaccurate can easily be filtered out. This is different than traditional publishing selectivity or censorship. Users may always examine the content excluded from their default view and refine it as desired.
Simple textual information can also be layered. Consider various forms of highlighting and emphasis that can be layered upon raw digital text. With decentralized information architecture, highlighting is a form of external, likely third-party annotation. Other common forms of text annotations include cross-references, comments, fact checks, formatting cues, and endorsements. Layering of annotations implies sharing and aggregation among many users. Calculated layers such as statistical highlighting are possible – perhaps visualized with varying color or intensity depicting how many people have highlighted or commented upon a certain region of text. The user may then drill-down into the original annotations.
Decentralized information does not imply that information sources must always be decentralized. A centralized source with robust community processes may approve information that meets certain standards. Readers may then choose to blanket trust its signed data, avoiding the need to fetch and analyze vast quantities of metadata used internally. (Although such data will always exist for those interested.) For example, a decentralized version of OpenStreetMap or Wikipedia may develop popular default map data layers, rather than relying on an ad-hoc P2P approach. Organizations may run centrally-controlled repositories to help handle the demand for popular data. However, there is no reliance on such measures. All published data remains separable from sources and networks without losing meaning, identity, or authentication. If a source fails, the data can be moved elsewhere seamlessly.
Software is a complex form of information to layer, but declarative programming paradigms offer solutions. As a generalization, declarative code describes things rather than giving explicit linear instructions. That which is not strictly ordered through hash reference chains in the graph may be freely layered as long as it is logically sound. Descriptions may include extensible data semantics, rule sets, local parameterizations, component wirings, hints, environmental variables, and, of course, units of functional code. The final form of software is only materialized at the moment it is actually needed. Rather than being limited by static, pre-designed applications, the ideal user environment takes everything currently known and generates custom interactions and interfaces. This allows maximum specialization for the task at hand with minimal human design effort. Of course, this is a ideal conceptual direction to evolve future software. Earlier renditions will have more manually-designed components and configurations, but the concept provides a design principle to continually refactor toward.
Decentralized information provides immediate benefits for all scenarios where multi-party information needs to be aggregated, integrated, collaboratively evaluated, and then filtered and layered into useful views. Obvious arenas include research, journalism, education, government, healthcare, and most everyday personal and business communication. These are ripe with low-hanging-fruit use cases that do not immediately require complex semantics or AI.
Decentralized information can particularly benefit 3 major areas of social computing:
Collaboration - Support for multi-party revision, annotation, and aggregation is baked into the data architecture, making it available by default. There is no need for special software or services to participate in collaborative work. Equally important, collaborative information workspaces have reliable historical record, cryptographic attribution, and full visibility of interactions.
Community - Default collaborative hypermedia promotes natural group formation and specialization. Communities of all types and sizes can engage information of value to them. Cryptographic trust networking makes it feasible to validate members and to build portable reputation that can span multiple communities. This makes it easier and safer to get involved, find a niche, and network with those who might otherwise be strangers. We hope this will start a revolution of greater civility, connectedness, and civic engagement.
Contextualization - As all information becomes networked and community processes ramp up, no public information will stand alone. It will be progressively woven into ever broader context, by which it can be clearly understood, evaluated, and refined. The essence of contextualization is perspective gained through robust peer review. It is the ultimate antidote for urban legends, false mythologies, fake news, alternative facts, conspiracy theories, and all forms of harmful extremism.
Decentralized information must be supported by practical real world systems. While adaptors to legacy platforms are possible, they cannot take full advantage of its native properties. On the other hand, decentralized systems have yet to be proven at global scale.
In August 2017, MIT’s Digital Currency Initiative and Center for Civic Media released a report on the motivations and progress of various decentralized internet projects. Four outstanding challenges to decentralized systems are identified. We believe that a decentralized information model is an critical component to solving these.
Decentralized information is not dependent on system adoption, federation, or network effects. It stands on its own, without any infrastructure, and is usable within any system that is able to understand it. It is inherently future-proof and easily made forward-compatible via layering additional metadata. This is attractive for driving developer adoption both within and outside existing software ecosystems because it eliminates concerns of new platform stability and longevity. The information itself is the platform. Software development around decentralized information is always subjugate to the information, using it as both a storage and communication medium. (The later must be combined with a use-case-appropriate means to propagate new data.) By decoupling software from information, re-use of both is promoted. Software bears the responsibility of working with neutral information, taking advantage of embedded semantics. Exclusive software ownership of data is simply not allowed, though particular software may define what data is considered valid for its own purposes.
The Semantic Web effort shares the interoperability and decoupling goals, but it is bound to the limitations of web architecture. Unfortunately, the dominant business models of the current web are based on explicit data and functionality silos, used to drive market lock-in. Semantic Web technologies have limited benefit to closed applications but carry substantial computational and development overhead costs. As with decentralized information, its main benefits come from the network effects of publicly shared data and schemas. Without a means to initiate these, Semantic Web has been largely a non-starter.
Semantic Web retains the premise of arbitrarily-named documents with public URLs at centrally-controlled web hosts. While local and content-based URI schemes exist, there is no consistent story for how to integrate them. Various “semantic desktop” projects have failed to prove meaningful benefit, with excessive complexity, clumsy interfaces, and no convenient mechanisms for data portability or bridging to existing systems. To the average developer, Semantic Web is also inaccessibly complex. It is especially difficult to justify its adoption for an internal or personal project. Perhaps telling is that the open source community has not embraced Semantic Web, even though it dovetails with the benefits of open code.
The philosophy of decentralized information promotes simple, elegant tools that are independently-useful for small-scale development. Strict separation of concerns prevents forced complexity. Decentralized identity and referencing schemes ensure trivial portability among disparate public and private systems, with little or no infrastructure investment. The learning curve is shallow, with most semantic graph complexity offloaded to optional metadata that can be annotated later, even by third parties. It is likewise not dependent upon design-by-committee ontology design.
Decentralized information can be used by traditional software applications, but it is most powerful when paired with a software and UI environment native to its paradigms. It prescribes small, re-usable components wired together into declarative Interaction Patterns for processing and generating software-neutral graph data. Interaction Patterns are contracts among users of particular shared graph data, describing expected schemas, logic, and protocols. This design applies to both local processing and interactions among multiple remote users sharing a graph. The level of modularization and ease of contribution allows much greater open-source cost sharing.
Small development groups, IT departments, and independent contractors have every reason to want interoperability standards. They reduce the amount of variation between jobs, increase skill transference, and simplify development toolsets. Enterprise integration opportunities are likewise attractive to large developers, where overhead costs are justified by cumulative business value. In both cases, modularity-promoting, open-source development models are the intended target. Industry sectors should find new avenues to collaborate on efficiently meeting shared IT needs.
As many have noted, the massive network effects of existing social networks are a challenge to new entries. All efforts to provide direct alternatives have failed to gain popular traction. However, as more developers build interactions around decentralized information, there will be greater incentive to network natively. Historical precedent can be found in the consumer transition from pre-internet online services to the open web. Once users are comfortable with a new platform, it is only a matter of reaching a tipping point of popularity. Early adopters could include organizations and communities for whom the homogeneous, ad-driven, one-size-fits-all social networks are suboptimal.
Social networking over decentralized information does not depend on an interactive service. (Though some FOAF query processing can benefit from a trusted third party intermediary.) Unlike Semantic Web / Linked Data based social network designs, fragile centralized web publishing is also not required. Decentralized information can exist before any software or networks are created to maximize its usefulness. Large-scale aggregators and search engines will still be useful, but these will be orthogonal to how the information is independently used at local system scale and among smaller community networks. Ultimately, decentralized information can succeed for more specialized purposes even if the economics of large-scale social networks remain insurmountable. For this reason, competing with them is not a primary goal. However, if the larger architectural paradigm succeeds, this will follow effortlessly.
Security of decentralized information is cryptographic, rather than relying upon traditional access controls among trusted systems. Yet it is unnecessary for everyone to use locally-managed public-key infrastructure from day one, an adoption hurdle in past designs. A service with a traditional login can provide surrogate identities, key escrow, and signatures to bootstrap use of decentralized information. Data signed by a trusted cryptography service on behalf of an authenticated user can be relatively trusted for many use cases. It is not tied to the service, once created, because validation is purely cryptographic, not interactive. Cryptography is thus a variable security Quality of Service concern. Users may put less trust in data signed by a cryptography provider than data signed by a trusted user’s private key. However, this compromise is better than continuously depending on a centralized service for data retrieval, identity, and authentication upon every request! In the meantime, those who wish to manage their own keys will be able to do so seamlessly, knowing that support is default. A middle approach is to use password-based key derivation functions to protect the identity and passphrase of a user’s secret key, using a familiar username and password scheme. This precludes third party recovery, of course, but risk may be mitigated by partial-key-sharing strategies.
A unique property of decentralized information, stemming from immutable data identity, is the ability for a third party repository service to blindly aggregate adjacent data and metadata, based on any visible references. This feature depends on a data container model that allows selective encryption of hash references, while private payloads remain invisible. References may be publicly visible or encrypted to a key that a certain service possesses. This separates the network concern of propagating new data from the software concern of processing it. The latter can be done locally for common use cases, on a trusted device that holds the keys to decrypt private payloads.
A final critical aspect of decentralized information security is the ability to easily move most user data processing to private / local devices. The buzzword “serverless” today usually refers to traditional application components that can run dynamically on pooled cloud server resources instead of being tied to a particular hardware/OS server instance. Software operating over decentralized graph data can truly be serverless, in that no server or centralized application logic even exists. Instead, users’ private software components interact via chosen Interaction Patterns. Each independently validate data they are involved with, according to the pattern. For many use cases, this eliminates the need for trusted 3rd party servers, which normally must have access to the data involved.
Decentralized information has radically different economics than content on the classic internet. Because there are no fixed points of network interaction and because all data entities are immutable and hash identified, many systems can compete on Quality of Service parameters over the same data. In practice, repositories and networks thereof will take many forms and will be as layered as decentralized information itself. Networks will come and go, but the hosted information can nominally remain unchanged. (so long as replicas exist somewhere accessible) This stability enables rapid experimentation and market differentiation.
As we transition from host-based to information-based networking, a new landscape of services will emerge. Repository networks will focus on particular QoS needs. Some may run over public IP networks, especially smaller private, P2P, and community repositories. Others may use dedicated trunks. ISPs will have increasing incentive to operate generic local caches of popular public content, to alleviate upstream bandwidth expenses. This will be possible without custom, service-specific peering, but prioritization can always be negotiated. In general, we may expect a transition to direct utility models, where users pay for what services they actually use instead of relying on freemium and advertising models. While traditional cloud hosting will be the obvious early onramp, personal and business systems will be able to integrate transparently, providing off-grid operation, high local QoS, and private compute services for mobile devices. Unlike complex application-based server appliances, decentralized information will allow any device to provide standardized utility computing service for any use case. This commoditization and transparency, along with shifting hardware economics, should also make consumer personal servers finally practical, though certainly not required.
Incentivizing the production of content is a separate matter. Unlike on the web, the inherent end-user control over decentralized information and adjacent software environments means that advertising cannot easily be forced, let alone tracked. (The only option would be DRM-laden side apps for specific media, comparable to current subscription streaming media services.) As with networking, more direct incentive models will be needed. Community-driven patronage should become more viable as greater interactivity and connectedness drive social incentives. New forms of micro-payments/donations, enabled by immutable graph data and reference collection, may reduce barriers to financial contribution. We can expect a wide range of competing payment services to facilitate this. For example, a federated cryptocurrency service could create pseudonymous donation tokens that can be linked (by hash reference) to signed entities in the graph of shared data. Content authors would then be able to cash these in on a periodic basis, reducing the number of transactions.
Decentralized information promotes default data portability and user ownership without mandating a particular network architecture. This permits the economies of scale of federated or centralized services without the traditional risk of lock-in. (ex. P2P storage networks must compete with centralized cloud services over various QoS parameters.) With structured, versioned, semantically-rich information at the center of computing, everything else becomes orthogonal and switching costs are dramatically lowered. This should promote competition and differentiation among services and software solutions, with a greater leaning toward openness due to improved cost-sharing economics.
Open source development models have always worked best for shared infrastructure, such as operating systems, development tools and frameworks, common code libraries, and commodity UI software like browsers. With current software economics, more-specialized code tends to be proprietary, even though it often makes use of open infrastructure. This applies to both local software and web services. The strongest determining factors are the location of value creation and the ability for code functionality to be efficiently modularized and shared. Decentralized information and declarative programming paradigms promote a much higher degree of software decomposition and modular composability than prepackaged applications. This follows from making information itself the nexus of integration rather than unstable APIs or code libraries in myriad languages. As more reusable modules are created, the effort required to take the next step decreases – and with it the incentives to create proprietary code decrease. Such modules effectively become open shared infrastructure, in parallel with community-driven data schemas and ontologies. Any equivalent proprietary code or services must compete with the ability of users to provide their own with relative ease and near-zero switching costs. This should push value creation down the chain and tend to promote smaller market participants and greater in-house capability.
To be clear, there is no reason to believe that an open source cottage industry can replace specialized scientific research and development. On the other hand, everyday personal and business computing needs to be standardized, commoditized, and made seamlessly inter-operable, so that it is no longer a major source of waste and frustration in the world!
Decentralized information can be defined by several absolute technical properties:
This is the most obvious and well-understood property. Decentralized information is founded on decentralized data. It is never bound to a location or host. It exists apart from these physical artifacts. Many existing distributed systems provide data location neutrality.
This is perhaps the least obvious property. A mutable reference is one that may point to different data values over time, even though the reference itself is unchanged. Mutable references have no place in decentralized information because they prevent reliable, independent composition and layering of multi-source information. For example, a third party cannot reliably annotate a particular piece of data using a mutable reference because the data it points to may arbitrarily change, even deceptively. In contrast, a reference using a secure hash value is reliable because it is mathematically bound to a unique piece of data. All references within decentralized information must be to immutable data entities. “Entity” here just means a particular self-contained parcel of data. The terms “file,” “document,” and “record” are traditionally used to refer to mutable data units.
In any real-world system, there must be a way to model and communicate data that is updated over time. In the decentralized information model, all data entities are immutable. By extension, so are all references contained within them. However, discovered knowledge of what entities reference each other is mutable. Data repositories may keep collections of metadata about references, typically indexed per known entity. This enables update tracking. A revision should reference any older version(s) it is derived from and usually a versioning root entity that serves as a master collection point. When a new entity is discovered by a repository, any revision references may be indexed. Software watching the collections may then be notified accordingly. Reference metadata may also be collected for annotations, links, property graph statements, etc. Mutable reference collections themselves are managed by repositories in whatever fashion suits their users. Networked repositories may also propagate reference metadata, with or without copies of the referenced entities.
Mutable reference metadata collections are not controlled by the author of an entity and are not in any way authoritative regarding the global existence of references to an entity. They merely facilitate discovery of information. Any repository of decentralized information has its own metadata collections, just as any Git repository can have its own collection of branches for a project.
Among decentralized information, there is no absolute notion of latest version, as this is relative to currently-known revisions. Any notion of an official publication is merely based upon the signatures of a trusted party. Retrieving desired revisions requires knowing where to look, an aspect that is technically separate from decentralized information but is typically orchestrated using it. For example, disposable repository metadata may indicate what repositories or networks are believed to have a copy of an entity. This is a networking aspect that is intended to evolve over time.
Versioning root entities are nothing like controlled channels or Merkle roots. While a root entity may be signed by its creator and revisions may be signed by the same party, this does not forbid third-party annotations or revisions from referencing the same versioning root. Whether these are considered valid is policy-based, on the data consumer’s side.
Referencing a versioning root is not equivalent to referencing a pointer to mutable data. The root is literally an instance of a concept. For example, a root may represent a particular book but contain none of its text. It would therefore be inappropriate to reference the root with an annotation relating to text found in a particular revision. This would be imprecise and ambiguous. The only exception is if the versioning root itself contains an initial version internally.
IPFS (InterPlanetary FileSystem) is one of a number of recent decentralized P2P storage network projects with incentivized sharing of storage space. MaidSafe and Dat are similar projects. While these have interesting network and data structures in their own right, none use the decentralized information model described in this discussion. This is in large part due to their efforts to replicate familiar properties of centralized information systems, particularly hierarchical filesystems.
A major pillar of the IPFS design, called InterPlanetary NameSpace (IPNS, section 3.7 of IPFS whitepaper), uses cryptographically-signed named pointers to Merkle DAG nodes. Data references may thus use assigned identities and hierarchical paths. This violates the mutable reference principle, even though the namespace records themselves are decentralized and self-certifying public-key records are used in place of a centralized registrar authority.
To illustrate, consider the following mutable IPNS path:
/ipns/[hash-value-1]/documents/recipes/pizza.html
The path component “hash-value-1” would be a hash of the public key that signs records of paths under this root. It never changes. Whoever has the corresponding private key may publish new paths at this location in the IPNS namespace. What these paths point to is arbitrary, however. A third party that wishes to reference this pizza.html document has the option to use this path. Unfortunately, it is a mutable reference and the reference may go stale in the future, if pizza.html is updated or pointed elsewhere. Fortunately, it is possible to use the underlying IPFS immutable (Merkle DAG) namespace instead.
Consider the following immutable IPFS path:
/ipfs/[hash-value-2]/documents/recipes/pizza.html
In this example, the component “hash-value-2” would be a Merkle DAG root hash. If the pizza.html document or any other part of the path was to change, the hash tree calculations would result in a new root hash value. Therefore, a reference using the IPFS namespace is permanently reliable. However, the use of meaningful naming is itself problematic for other reasons.
Note that while the IPFS project has been used as a specific example here, the same principles apply to similar named-path Merkle-tree-based systems. This includes the widely popular Git distributed version control system.
Suppose we use a scheme where the names of our data entities are registered in a blockchain. In particular, arbitrary text strings are mapped to secure hash values for data entities. Such names are guaranteed to be both globally unique and permanent. This enables lookup of entities by names convenient for human users. However, it also means that references to data using these names are bound to the operation of a particular supporting network that provides consensus via the distributed ledger. This is a logical form of registrar authority dependency. Should the blockchain fall into popular disuse in the future, an easily imaginable scenario, all references using it immediately become unreliable. As a result, this solution is unacceptable for decentralized information.
Human-meaningful names in references are harmful primarily because they encode implicit information in the names or paths themselves. In the best case, this information is merely redundant with internal metadata of data entities. (Although choosing these reference names and/or hierarchy positions is still wasteful manual data manipulation.) In the worst case, naming in references represents data with weak semantics that must be gleaned from context. Dependency on an original context can be seen as a form of logical centralization. It may be overcome by annotating explicit metadata after the fact, but it is easier to simply disallow this hazard.
In the IPFS example above, a Merkle hash tree allows human-meaningful naming subjugate to a secure hash identity, via internal labeling of the tree’s child node references. Because the hash root would change if any data or name component changed, this passes the immutable reference requirement. However, real world usage could still go awry. Consider the “pizza.html” document referenced by the IPFS path above. Suppose that it does not have any internal metadata indicating that it is a food recipe. Certainly, a human user can look at the path and infer that this is the case, but there is no way to be sure. Perhaps this document describes the classic object-oriented programming example of a pizza shop and “recipes” is simply a metaphor! Suppose we instead identify and reference the document solely using a flat hash value and we ensure that the document itself contains metadata indicating it is indeed a food recipe. By using explicit metadata instead of naming, we avoid boxing ourselves into a rigid, manual, path-based hierarchy.
Strict use of flat, hash-based references also eliminates the complication of unlimited multiple naming. Suppose our “pizza” document entity can individually be accessed by a hash value. If arbitrary path naming is permitted, however, it can also be included in multiple hash trees.
For example, the following hash-rooted named paths could all refer to the same binary document entity:
SHA256:[hash-value-1]/documents/recipes/pizza.html
SHA256:[hash-value-2]/food/recipes/pizza.html
SHA256:[hash-value-3]/data/docs/food/recipes/pizza
SHA256:[hash-value-4]/whatever/who/really/cares
This is problematic if we wish to collect widespread metadata for an entity, for the purposes of annotation and networked collaboration. While nothing in the flat-hash ID scheme stops someone from attempting to fork data by changing even a single bit, thereby resulting in a new hash value, this demonstrates obvious malicious intention and can be more readily detected. Furthermore, most entities should have cryptographic signatures, making such attacks less feasible. With arbitrary path naming, it is not clear whether a new path has been created for malicious intent or as an artifact of local organizational preferences. Cryptographic signatures do not help here, because the original signed entity remains unchanged, with its original hash value, in the leaf of a new Merkle tree.
While it is possible to fully traverse every name-supporting Merkle tree by default, to index the individual hash values of leaves, this does not help if an external reference is made using a novel hash-rooted named path. Without first dereferencing this path to the hash, the reference cannot be collected for the appropriate entity. This may also involve another network lookup, if the tree used for this path has never been fetched. With strict flat hash referencing, reference collection can happen immediately, without additional steps.
Data management architectures often benefit from hierarchical aggregation for the sake of scalability. However, this can be transparently orchestrated behind the scenes as an optimization. Consider a dependency graph of data or software components that are frequently used together. Smart repositories can discover such correlations and bundle related entities for efficient batch transfer. In contrast, hierarchical naming schemes encourage premature manual compositions that may not be useful.
Human user interfaces should generally not expose raw hash references. Instead, metadata should be used to provide meaningful name labels to widgets. A related issue affects future programming languages and associated tooling. By referencing classes or modules using cryptographic hash values, we can make code more secure and disambiguated. For example, instead of:
import My.Favorite.Functions.examples.org
we may prefer to use something like:
import SHA256:[hash-value-of-my-favorite-functions]
The latter is a globally-safe immutable reference to a data entity that perhaps is internally named “My.Favorite.Functions”. Once this module is imported, we can use whatever named artifacts it contains. Internal naming does not violate the meaningful-name reference principle, in the external scope of data and network management. However, future language designs may result in every separable code artifact being contained in its own entity. The code would be entirely human-language neutral, with labels applied only within an editor view.
Unlike systems using centralized network authorities like DNS, dereferencing among decentralized information is a major QoS differentiator. With the use of secure hash values for referencing, there is no singular, straightforward lookup procedure. Within a particular network, a distributed hash table (DHT) can be used to efficiently locate copies of data at a hash address. However, reliance on a single global DHT or hierarchy of DHTs is not considered a requirement or even a valid goal. Just like information itself, the network services supporting decentralized information can be layered.
As a bootstrapping tool, permanent domains (self-certifying mutable pointer records) can be used to point to recommended repositories for organizations or authors. These repositories may hold reference collections that are most likely to be current with the latest available revisions and other metadata. However, permanent domains cannot be used for any data naming or reference purposes. They merely advertise network services and are themselves unnamed. As permanent domains are tiny records, a global DHT may be a good distribution method.
Decentralized information must be self-descriptive in order to avoid the possibility of dependency on external centralized software systems for interpretation. However, semantics are extensible, via annotation of existing published data. This introduces some novel complexity around layering, as the same information may be externally annotated with different and perhaps incompatible semantics. This is actually a powerful feature, as it allows for independent exploration of varied dimensionality and frames of reference in the absence of consensus. For example, a geopolitical unit can simultaneously be a sovereign state and a disputed territory, even though it is grounded by a singular permanent concept entity. References should point to the most generic semantic layer possible. Biological classification provides another common application. Effectively, we can agree that something exists before we agree on what it is and what its properties are. Root entities can bind the ambiguous instance as these things are being debated.