This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Introduction

Technical Abstract

A new graph data model is proposed as the single central invariant of future networks and software systems. Content-based (secure hash) identity is used for all public data entities, making them immutable by reference. Entities contain structured data and metadata, with chosen restrictions that promote design patterns suited for collaborative information and decentralized internet architecture. Managed collections of known references between entities are then used to support composition and versioning. To best facilitate this data model, new software and network architectures are required, and these will evolve independently over time. As an initial exploration and proof-of-concept, an idealized unified model for decentralized, distributable information systems is proposed. Networked graph data repositories collect, filter, and propagate references surrounding known or hosted immutable data entities. Public and private compositions, revisions, and annotations can thereby be independently layered from unlimited sources. This supports global-scale distributed collaborative media and is foundational for a new general-purpose software architecture that prioritizes maximum composability and re-use of both code and data. The repository serves as a universal interface for managing persistent data, remotely and locally. Likewise, shared graph data itself serves as the universal interface for user-space software components. Interaction patterns upon the graph, described by declarative code and contracts, replace traditional APIs as composition and communication mechanisms. In support, repositories propagate data entities in response to provided directives. Software components themselves exist within a graph-native programming environment that is suitable to both human and AI users. Finally, human user interfaces are dynamically rendered based on context, environment, interactive needs, preferences, history, and customizable hints.

The InfoCentral project draws inspiration from academic research in many computer science subfields, including distributed systems and databases, knowledge representation, programming languages, human-computer interaction, and networking. The Semantic Web effort toward universal standards for graph-structured information systems provides much technical inspiration and background theory for this work. The research area of Information-Centric Networking (ICN) has also been highly influential, and related aspects of InfoCentral represent a competing entry.

In contrast to similar academic and open source Internet architecture projects, InfoCentral has a wider scope that allows more assumptions to be challenged. For instance, most ICN projects do not consider alternative software architectures and thus make design assumptions based upon supporting current application needs and market demands. Likewise, the Semantic Web / Linked Data effort has largely left existing web data and network architecture unchallenged.

Author’s Preface

The software and internet technology landscape is primed for another major revision. As an industry, we have achieved great things, but we can do far better. With so many revolutionary advances on the practicality horizon, this is an ideal time to revisit foundations and first principles, taking into account the lessons learned over the past five decades. We have too long been stuck in a rut of incrementalism, building upon foundations that no longer properly support our ambitions. The next phase of the information revolution awaits, concurrent with an accelerated transition from frenzied exploration to mature engineering. The resulting quality uniformity will not only make our jobs more enjoyable, but tear down digital divides and improve societies globally. These are ideals shared by academic and startup enterprises alike, following a coalescence of related ideas and research efforts. Many architecture-level projects and proposals have surfaced in the last decade, bearing strong similarity in objectives and principles. They herald an era of computing dominated by distributed and decentralized systems, ubiquitous cryptography, practical forms of artificial intelligence, declarative and functional programming, increasingly verifiable code, fully-dynamic and multi-modal human interfaces, socially and semantically rich data, and universal interoperability. Unfortunately, existing projects seem to lack an overarching vision to bring together their best contributions. The primary aspiration of the InfoCentral project is to discover a truly unifying yet evolvable foundation that everyone can build upon. I hope that aspects of this design proposal will inspire fresh thinking and new collaborative explorations. Everything presented in this early publication is a work-in-progress. However, I believe that enough pieces of the puzzle are in place to begin prototype implementations that demonstrate the power of the chosen architectural principles.

Overview

InfoCentral is an open engineering effort toward a new, unifying software and communications architecture. Besides immediate practical benefits, it seeks to help close the massive sophistication gap between current personal and business IT and the needs of tomorrow’s intelligent machines and pervasive computing environments. InfoCentral is also motivated by social objectives such as improving collaborative processes, community interaction, perspective building, productive debate, rational economics, trust management, and lifelong education. These features follow naturally from good general-purpose information and software architecture.

InfoCentral has a long-range futuristic vision, with some ideals that will admittedly be hard to realize quickly. Recognizing this, the project aims for layered, modular research and development, starting with new data and network models, building toward stronger semantic and logic models, and culminating in new user interface and social computing paradigms. This approach allows the long-range goals to inform lower-level architecture, while not imposing unrealistic expectations on pragmatic development toward earlier real-world applications.

Infocentric Design

The core philosophy of InfoCentral—and its namesake—is that the design of information should be the absolute central focus and priority when creating software systems. However, information design should proceed independently of software design, with no assumptions about how information will be used. Software may then come alongside the neutral information space, bringing functionality appropriate to different contexts. This contrasts with the common software-as-product philosophy, which views application-specific functionality and interfaces as the focus of design and the data model as implementation details. This top-down approach encourages production of self-contained, limited-purpose systems that define in advance what information they work with and what interactions are permitted. Composition using APIs within or between such systems is inherently fragile, due to the difficulty of precisely specifying expected behavior and side effects. This complexity is then compounded by continuous revision and the emergence of interwoven dependencies. In the end, high maintenance costs compel toward greater siloing of data and functionality rather than integration.

Semantically-rich, highly-normalized information, coupled with intuitively-programmable user environments, will someday yield a degree of integration and fluidity that obsoletes self-contained software. Writing applications will be replaced by creating and connecting small, functional modules that operate within a vast sea of standardized shared information. Most interaction patterns among users and software components will be able to be captured without specialized coding. As new functions are required, a focus on elegant, re-usable modules will ensure longevity of solutions. Meanwhile, information itself will become disambiguated, machine-friendly, and future-proof.

Infocentric design allows information to survive all changes to networks and software around it. The networking and programming models of the InfoCentral proposal are fully abstracted from the data model. In the past, we have generally designed information, networks and software with human users, developers, and maintainers in mind. This has affected everything from naming schemes and data structures to trust and authority mechanisms. To allow for unhindered evolution of AI, we must instead abstract as much as possible from the core data model, such that it will be independently useful to machines.

Graph-Structured Data

The InfoCentral proposal is an alternate vision for the widely-desired “web of data,” in which links are between structured information rather than hypertext. The design properties and economic structures that have worked well for the hypertext and applications web conflict with the needs of a pure data web. For example, there are fewer places to insert advertisements into raw data that is consumed by local software under the user’s active control. Service-based business models must typically be used instead. Likewise, there is little business incentive to give users access to the raw data behind current proprietary web and mobile applications, as this would largely reduce switching costs and enable direct competition.

While sharing the same ultimate goals and theoretical underpinning as the Semantic Web effort, InfoCentral diverges from certain entrenched architectural tenets. The proposed changes aim to improve integration of other research areas and make the resulting platform more economically feasible and accessible for developers and content creators. Because the resulting architecture will differ substantially from the current web, a new name should be considered. The casual term “Graph” seems fitting, with a possible formal title of the “Global Information Graph.” (“Giant Global Graph” has formerly been proposed, but is redundant sounding and has an unpronounceable acronym.)

In the public information space, InfoCentral proposes a minimalistic, fully-distributable data / network model in which everyone effectively has write access to publish new data, nobody has write access to modify existing data, cryptography is used to control visibility, and layered social/trust networks (both public and private) are used to shape retainment, prioritization, and propagation. Such a model tends to be democratic, self-regulating, and freedom preserving. Though many nodes will not allow global write access, user-facing systems will perpetually source data from many independent repositories, creating a customizable information layering effect. Write access to any repository participating in a popular overlay is sufficient. There will be many, and any person or group will be able to easily start their own. This promotes both competition and censorship-resistance.

Standards powering the Global Information Graph will eventually subsume all standalone database technology, even for fully-private instances. InfoCentral designs comprise the minimal subset of primitive features needed to support all database architectures via layering higher-level structures and protocols. Meanwhile, compliant implementations are universally cross-compatible at a base level. When faced with unavoidable engineering trade-offs, InfoCentral designs first prioritize independence and flexibility, second scalability, and third efficiency. The selection of certain mandatory primitives guarantees that InfoCentral-style repositories will never be as efficient as a traditional relational database or column store. Keys of at least 256-bits (hash values) are required and highly-normalized, strongly-typed data is default. With time, increasingly smart engines will close the gap. However, the InfoCentral philosophy is about making information more fluid and machine friendly. Employee hours are expensive. Personal time is valuable. Machines are cheap – and becoming ever cheaper.

The User Experience

InfoCentral does not propose a specific, standardized user experience, but rather common metaphors that should eventually form the basis of all user experiences around graph-structured information. Some of these include:

The ideal concept of a unified Information Environment replaces standalone applications and all forms of web pages and services. All data and surrounding software functionality is fluid, with no hard boundaries, usage limitations, or mandatory presentations. Everything is interwoven on-the-fly to meet the users’ present information and interaction needs. There are no applications or pages to switch between, though users will typically assemble workspaces related to the scopes of current tasks. The user brings their private IE across any devices they interact with. It is their singular digital nexus – unique, all-inclusive, and personally optimized.

The everyday user experience of an Information Environment is whatever it needs to be, in the moment, to interact with whatever information is relevant to a task or situation. It is not defined by certain UI paradigms or modalities. InfoCentral envisions a practical replacement for application functionality in the form of captured interaction patterns around information, rather than pre-designed user experiences. Interaction Patterns are declarative rulesets that define multi-party, multi-component data management and computational orchestrations. (A simple, high-level example is a business process workflow.) They are rendered by a local framework, based upon the current mode of interaction, available UI devices, and user preferences. Because patterns do not encapsulate data or custom business logic, they do not limit information re-use, as contemporary software applications usually do. Neither do they hide the flow of data behind the scenes. Likewise, patterns do not even assume that the user is a human, thus serving as integration points for automation and AI.

The ultimate expression of the Information Environment concept is the yet-unrealized vision of Ubiquitous Computing. This term may be defined, most simply, as the turning point at which most computing technology has become fully and effortlessly interoperable, such that complete integration is default. All of the futuristic academic goals for UC follow from this basic property – from a safe and effective Internet of Things to advanced AI applications. The web has brought us to a certain level of interoperability, in terms of providing a standard human-facing UI framework though browsers, but it has failed to create interoperability at the data model, semantic, and logic levels. The fact that proprietary, self-contained applications have returned so strongly with the rise of mobile computing is a sobering demonstration of this. We must find a different solution that is secure, consumer-friendly, non-proprietary, and yet still market-driven. For the sake of privacy, it is imperative that users are in full control of the technology that will soon deeply pervade their lives.

Decentralized Social Computing

InfoCentral’s proposed architecture provides a substrate for new mediums of communication and commerce designed to encourage rationalism, civility, and creativity. All aspects of social computing will become first-class features, woven into the core software architecture, rather than depending on myriad incompatible third-party internet services. This will increase the default level of expressiveness, as all public information becomes interactive and socially networkable by default. Decentralization will guarantee that it is not even possible to strictly limit the interaction around published information, leaving filtering and prioritization up to end-users and communities.

Improved collaborative information spaces will revolutionize how we manage and interact in our increasingly globalized and hyper-connected world. We desperately need better integration and contextualization – the ability for all assertions and ideas to be understood and engaged in a holistic context. Greater civility and novel expression will result if all parties are given a voice, not only to share and collaboratively refine their ideas but to engage other sides formally, in a manner of point and counterpoint, statement and retraction. Layered annotations and continuous, automated, socially-aware discovery can be used to keep information fresh with the latest discourse and relevant facts. Even in the absence of consensus, the best information and arguments can rise to the top, with all sides presenting refined positions and well-examined claims. This contrasts with the traditional broadcast mode of competing channels, controlled by a single person or group and inevitably biased accordingly, with no reliable mechanisms of feedback or review. Likewise, it contrasts with the chaotic multi-cast mode of microblogging, where interaction is like an unstructured shouting match and has limited mechanisms for refinement or consolidation. By making contextualization default, echo chambers of isolated thinking and ideology can be virtually eliminated. As in science, refined consensus is more powerful than any individual authority claim. Faulty ideas can be more quickly eradicated by exposure to engaged communities and open public discourse. This includes encouraging greater internal debate that harnesses diversity within groups. Meanwhile, among commercial applications, traditional advertising can be replaced by customer-driven propagation of reliable information.

Decentralized social networks have inherently different properties than those of current-generation centralized solutions, all of which depend upon a single, large, trusted third-party and its economic realities. Decentralized designs can support features that are either technically or practically impossible to realize or guarantee in centralized designs. These include:

These distinctions do not, however, imply that centralized services cannot be provided as optional private layers on top of decentralized public networks. Such services may include all manner of indexing and analysis derived from public information, typically at scales that currently favor dense infrastructure. In addition, not all decentralized designs have all of the listed features, as tradeoffs sometimes exist with convenience and QoS objectives. InfoCentral promotes high flexibility here, allowing users and communities to discover which are most useful in different contexts.

The Developer Perspective

To understand how the proposed InfoCentral designs affect developers, it is necessary to make a high-level first pass at the software architecture. Within the Information Environment model, users and public software components interact solely by appending new information to a shared graph of immutable data entities. Software components watching the graph are notified and respond accordingly, typically based on formally-defined interaction patterns, which codify expected behaviors. This entirely replaces use of traditional APIs and application-level protocols in user-space and across networks. For example, to send a message, a user will publish a “message” data entity that references a known “mailbox” data entity by a hash value. (Replication toward the recipient(s) happens automatically behind the scenes.) To update a document or database entity, users or software components push revised entities that link to the previous revisions and/or a “root” entity. To implement a business workflow, users chain annotations of tasks, statuses, requests, approvals, fulfillments, etc. onto relevant business data entities. This form of information and interaction modeling naturally lends itself to declarative programming paradigms and clean, modular designs.

The proposed level of integration is made possible by not giving exclusive control of any information to particular application code – or even a standardized object or event model. An Information Environment hosts software components that are infinitely composable but not coupled to each other or the information they work with. This compares closely to the essence of the original Unix philosophy. As the universal interface of Unix is the text stream, the universal interface of the Information Environment model is the graph of immutable data entities. Compositions between public components have patterns or contracts that describe allowed behaviors. Interactions must also take into account the possibility of conflicts, given the default concurrent nature of operation. This is made tractable by the immutability and multi-versioning features of the data model and is typically handled by lower-level patterns that deal with behavior under conflict conditions. Most of these can be shared across interactions and even be standardized close above the ontology level.

It’s important to note that public software components are distinguished from private code that implements them. Most simply, public components must communicate over standardized graph data while private code may use direct local interfacing. The IE model intentionally specifies fewer private implementation requirements, to allow for adequate flexibility and also re-use of existing codebases. Again, a parallel can be found with the old Unix model. Public IE software components are roughly analogous to small composable Unix programs that have a design imperative to be simple, well-documented, and do one thing well. Implementation of public components strongly favors functional paradigms and should itself focus on maximum re-use. The appropriate level of public vs. private granularity is a software engineering exercise. Regardless, all constituent code must be contained in graph data entities and all code references must be grounded in hash identity.

Developing software within the InfoCentral paradigm is entirely unlike contemporary application development. There is a stark absence of boilerplate coding like data wrangling, networking, and lifecycle management. Likewise, data schema and ontology design is removed as a global collaborative effort. Any code deliverables are usually small and independent of specific local project objectives. Integration work, the high-level declarative weaving of modular functionality into specific interaction patterns, typically outweighs writing new component code. Interaction patterns themselves should often be open collaborative works, as many users will need similar patterns. Unlike contemporary software, there is little room or motivation for redundant implementations of everyday schemas and interactions. The resulting consolidation will bring a long overdue simplification and standardization to users and developers alike.

Because the seperable work units are so small and well-defined, the natural economic model for most development is purely labor-based, involving contract work to build upon global, open source codebases. Typically, a project contract would not be considered fulfilled until any new public schemas, interaction patterns, and components have been approved by the oversight communities involved. This provides default quality-assurance and ongoing maintenance, regardless of the nature of the local project.

InfoCentral-style architecture has many other economic ramifications. In contrast to the web, it will result in a shift in revenue streams from advertising to direct service-oriented schemes. While the public data graph is fully open and nominally free, the opportunity for market-driven value-added services is enormous – network QoS, indexing and search, views and filtering, subscription, large-scale analytics, mediated interactions, compute services, etc. Private services built upon the public data graph are fully orthogonal and competitive because no party controls the graph itself. Consumers may easily change service providers because all underlying data is portable and uniformly accessible by default. Even if some services use proprietary code behind the scenes, the services are consumed over open data and the user is still in full control of how the results are used.

Methodology

InfoCentral is intended to be an open and unifying collaborative project. The project is primarily focused on building architectural foundations, not a complete software stack or “platform.” Any implementation work seeks only to help establish a new software ecosystem that will rapidly outgrow the project. Likewise, there is no intention to name any design or specification using “InfoCentral” as though a brand. To do so would undermine the nature of the project’s work toward open industry standards.

The first priority of the InfoCentral project is to design and promote a universal persistent data model that can be shared among the dozens of projects working in the space of distributed systems. This is highly strategic in that it provides a neutral, minimal-infrastructure point of unification. It is also a practical initial goal with immediate benefits and early applications not dependent on the more challenging software layers. Most other projects have started with new languages and/or development environments, forcing the most unstable and research-heavy aspects of their designs from day one. The InfoCentral approach creates a wide bridge between all new and existing systems, separating what needs to evolve from what can be agreed upon today. This allows both cross-polination between projects and a pragmatic migration path for legacy systems. It creates something that will outlast the fun tech demos and toy implementations, without hindering their progress in any way.

The new software ecosystem envisioned will require a bootstrapping phase, but a critical advantage over other “grand-unification” proposals is that InfoCentral seeks not to over-specify designs for the sake of expediency. InfoCentral is not a pedantically clean-slate project. While some absolute architectural guidelines are drawn, any existing technology that fits is permissible to reuse – even if temporarily, through an interface that serves as an adaptor to the new world of fluid, graph-structured information. Conversion may proceed gradually, as dependencies are pulled into the new model. It will be possible to start with low-hanging fruit. There are many simple use cases that derive from the basic architectural features but do not require immediate adoption of a complete stack. (For example, the data and network model are useful independent of the more nascent Information Environment research area.) However, as developers collaborate globally to rebuild common functionality under the new model, an avalanche of conversion should begin. As the benefits become obvious and dependency trees are filled out, it will quickly become cheaper and easier to convert legacy systems than to maintain them.

The InfoCentral philosophy toward technological and downstream social change is to provide tools rather than prescription and pursue undeniable excellence rather than advocacy. The only reliable and ethical method of convincing is to demonstrate elegance and effectiveness. When adequately expressed, good ideas sell themselves. In practice, most people expect new technology to provide immediate benefits, have a reasonable learning curve, and not get in the way of expression or productivity. Any violation of these expectations is a deal-breaker for mainstream acceptance and a severe annoyance even for enthusiasts. Likewise, social technology must not require significant life adjustments to suit the technology itself. In most cases, technology must be nearly transparent to be acceptable.

In the long term, the InfoCentral model contains enough paradigm-shifting design choices that substantial relearning will be required. However, this need not occur at once. The key is to make any new system intuitive enough that self-guided discovery is sufficient to learn it. With InfoCentral, the most challenging new concepts are graph-structured immutable data and the fluidity of software functionality. These are shifts toward more intuitive computing, yet they may temporarily be harder for seasoned users and developers to grasp. It will be critical to build many bridges to help with the transition.

InfoCentral will require a strong and diverse base of early adopters, and early contribution must be rewarding. As soon as possible, implementations should feel like a place for fun exploration, like the early days of the web. InfoCentral offers a haven for academics, visionaries, and tinkerers to experiment within. It likewise offers cutting edge technology for consultants who want to differentiate themselves from the masses of web software developers. It offers businesses a competitive advantage if they can find ways to use the technology to improve their efficiency before competitors. For some developers, InfoCentral will represent a departure from the disposable-startup culture and an opportunity to build things that will last and permanently improve the world.

Just for Fun

Most of us chose the field of computing as a career because we somehow discovered the joys of dissecting complex machines, creating new ones, and solving hard puzzles. After extended time in industry and/or pursuing academic directions, this lighthearted enthusiasm can be lost. I hope that projects like this can help many of us regain our original fervor. I believe that a new software revolution is just around the corner, with a creative modus that will satiate the fiercest nostalgia for the long-past golden age. Very little of what now exists will remain untouched, and fresh ideas will be overwhelmingly welcome once again. The means of novel expression will again feel powerful and rewarding, unhindered by decades of legacy baggage and boilerplate.

The InfoCentral Design Proposal

This initial design proposal is intended to be relatively informal and not fully exhaustive. It is a work-in-progress and is intended to inspire collaboration. Many of the topics contained within will be treated elsewhere with greater depth. Likewise, formal research will need to be conducted to refine specifics of the design, particularly those left to implementation flexibility.

The proposal is presented here in a static document form. It represents merely a seed and a historic artifact, as the InfoCentral project is launched publicly. It is the work of one author and by no means the final authority on design matters. Henceforth, this content should be transitioned to a collaborative medium.

What follows is a dense summary outline of the architectural features, characteristics, and rationales of core InfoCentral designs. It has largely functioned as an organizational artifact, but now serves as a way to introduce key concepts and provide a project overview. The outline structure was chosen to avoid continual maintenance of paragraph flow and extraneous verbiage while developing raw ideas. It also makes the relation of points, subpoints, and comments explicit. For new readers, it will be helpful to make multiple passes on this material, as certain subtleties will become apparent only after broad perspective is gained. The design and writing process itself has consisted of countless refining passes. Each section has some overlap, sometimes using generalized concepts that are elaborated elsewhere.

Unlike a formal academic paper, the varying scope and long timeframe of this exploration has made references difficult to manage, so no complete list will be attempted. This publication should be seen as a practical engineering proposal and vision-casting effort, with academic rigor to be added later.

Architecture for Adaptation

General

Scalability

Data store and network abstraction

High-level programming abstractions

Persistent Data Model

Summary

The Persistent Data Model defines how units of first-class persistent data are contained and referenced. It is the only invariant component of the InfoCentral architecture, around which all other components are freely designed and may evolve with time. To avoid contention and the need for future revision, applicable standards will be as minimal as possible.

Standardized mutability and reference semantics are the foundation of the InfoCentral data architecture. The Persistent Data Model necessarily defines the only model of persistent data allowed for user-space software and the only form of publicly shareable and referenceable data allowed among networked data repositories. In promotion of Information-Centric Networking, it serves as the “thin neck” in the hourglass network model.

Discussion

Standard Data Entities

Standard Data Entity is the data container for all storage and transport. It specifies the data structure but does not mandate the final encoding schemes used by data stores and transport mechanisms. However, any encoding of an entity must yield the same binary data when resolved to the canonical format.

References

All references to other entities within Standard Data Entities must use secure hash values of the canonical serialized data of referenced entities. This enforces the immutability property among Persistent Data and ensures that references cannot go stale.

A secure hash value used in a reference may conventionally be called a Hash Identity (HID) of the referenced entity because it is based solely upon the totality of the entity’s intrinsic properties. (ie. the entity’s content self-identity) However, as there are unlimited calculable hash values for a piece of data, there is no particular HID that constitutes the canonical identity for an entity. References always specify which hash function was used.

Intrinsic Entity Metadata

A reference may contain metadata for the entity being referenced. This metadata must always pertain to publically-visible intrinsic properties of an entity. (i.e. It must be able to be derived solely from plaintext (or undecrypted ciphertext) information contained in the entity referenced.) Inclusion of metadata in a reference is always optional and does not change the equivalence of references to the same entity.

Standard Data Entity Specification Development

Because it is the absolute foundation, Standard Data Entity specification revision should be avoided at all cost. Once a public standard is ratified, it must be supported forever. Standard Data Entity public specifications should mirror the historical stability of simple clean standards like HTTP. Extensions must be preferred before any changes that break compatibility.

Entity standardization is largely defined by enforcement at the Repository interface. Any new public Standard Data Entity specification would coincide with a new Repository interface version, which would support all previous versions. Existing hash-based references would therefore be unaffected. A request by-hash over the network is generally content neutral, in that it can proceed without awareness of the entity specification it applies to and what Repository interface will respond to the request. However, if useful, reference metadata could contain a field for the SDE revision of the referenced entity.

For the sake of prototyping, we can build a temporary, unstable, non-public entity specification and associated Repository interface. As a precaution, it would probably be wise to include a tiny header field that indicates the prototype nature and perhaps a pre-ratification version number. Once a public standard is ratified, prototype data can be converted, but re-establishing references among widespread multi-party data will be difficult, especially where cryptographic signatures were involved. Thus, prototype implementations should strictly be used for either non-valuable data or closed-group private applications, wherein coordination of wholesale conversion is feasible.

Large data handling

Large data may be broken into chunks, each held by first-class entities. Merkle hash trees can be used to efficiently aggregate and validate these data chunk entities.

Design Rationales for the Persistent Data Model

Temporal Data Model

Summary

The Temporal Data Model defines a general model for arbitrary, non-first-class data that is not required to be standardized but may be useful behind the scenes for managing the internal state of data management facilities. It is not related to memory models for user-space software environments.

Discussion

Continuous data sources

Temporal Data may be used to hold temporary continuous data from sensors, human input devices, and other stream data sources.

Design Rationales for Temporal Data

Collectible Reference Metadata

Summary

Although Standard Data Entities are immutable, repositories collect and propagate metadata about which entities reference each other and, if visible, for what purpose. These per-entity reference collections are the sole point of mutability in the data architecture and allow for practical management of data for revisions, graph relationships, external signatures / keys / access controls, and annotations of all kinds. The collection-based model supports arbitrary layering of public and private third-party information across networks. Synchronization of collections may occur across repositories and networks thereof, to propagate revisions and new metadata, manage transactions, etc.

Discussion

Collection Management

Reference metadata may be arbitrarily collected and disposed of for any persistent entity. Relative to the abstract global data graph, no information is lost if reference metadata is destroyed, because it could always be regenereated from Persistent Entities. It is local knowledge and may use the Temporal Data Model. Schemes for managing reference metadata are repository implementation specific.

Entity Reference Types

Collectible Reference Metadata always includes the type of a reference and often includes the type of metadata provided by the referencing entity, if visible. Knowledge of entity reference and metadata types can be useful for repository management purposes, such as allowing subscription to events for desired types. Formally, Entity Reference Type refers to the manner in which a reference is made between entities.

Design Rationales for the Collectible Reference Metadata model

Persistent Entity Metadata

Summary

The Standard Data Entity design provides all entities with the ability to hold both “self subject” metadata about their own payloads and external metadata about other entities.

As a point of terminology clarification, Intrinsic Entity Metadata refers to the raw data properties of a Standard Data Entity in its canonical encoded form, absent of any external decoding, decryption, context, and interpretation. Persistent Entity Metadata refers to metadata stored within an entity that relates to the realm of Persistent Data at large and is subject to external context and interpretation. However, the term “metadata entity” is inappropriate, because there is no such distinction. All entities may contain data and/or metadata.

Metadata Types

There are five standardized categories of entity metadata: General, Repository, Anchors, Signatures, and Permissions – with the convenient mnemonic “GRASP”. Within each category, all common types of metadata shall be standardized. These should cover nearly all possible applications of metadata. Non-standard custom types are are allowed but discouraged for public data. These must be explicitly designated as such, to avoid collision with future extensions of standardized types.

General Metadata

General Metadata is basic data about an entity, such as types or timestamps. It does not include any intrinsic information.

Repository Metadata

Repository Metadata is used to provide history related to an entity and its context. It primarily involves revision and retraction data, but may also include other related state that needs to be persistent, like transaction contexts.

Anchor Metadata

Anchor Metadata connects an entity to specific data within other entities. It is used to link comments, markup, links, entity and property graph composition relationships, discussions and other interactions, and all other forms of annotation.

Signature Metadata

Signature Metadata is used to make various attribution, approval, and authentication assertions about external entities. It may also be used to specify a set of key IDs that are required to internally sign the containing entity’s header and payload, with these signatures included in the final entity data.

Signature Aggregates

Signature Aggregates are lists (or Merkle DAGs) of entity references that are signed at once, within one entity, rather than creating multiple entities that each hard-reference a single entity to be signed. In appropriate situations, this is an alternative and potentially more efficient means of external signature.

Permissions Metadata

Permissions Metadata provides access and authorization security artifacts such as cryptographic keys, ACLs, roles, rules, etc.

Notes

Root Entity Anchoring

Summary

Root entities serve as anchoring points within the global graph. They represent particular persons, organizations, objects, ideas, topics, revisioned documents, ontology classes, etc. and are widely referenced by statements, metadata, etc. Anchoring root entities can be difficult, however. Anyone can create a root intended to anchor some concept or instance, and this can easily lead to unintentional duplicates and divergent data. Consensus must evolve over time, usually through organized community processes. For instance, expert groups may form around the management of different ontologies and eventually become de facto authorities for generating these, ideally aided by data-driven NLP / ML techniques since the work is so vast. In other cases, root entities anchor artifacts with a limited scope. These roots must often be created by individuals, who do not have the weight of a trusted community’s signature behind them. On the other hand, limited scope reduces potential for ambiguity and incentive for abuse.

Discussion

Anchoring Methods

There are, currently known, five methods of unambiguously anchoring an entity, listed below in decreasing order of preference. These may not be individually sufficient in all situations, but may be combined. It should be strongly noted that only the first two currently reside entirely within the Persistent Data model. The third and fourth could be implemented within Persistent Data in the future. The fifth is the only approach that definitively requires out-of-band validation.

  1. Include a hard or soft reference to another entity that is already well anchored and that provides sufficient context by association. (Multiple references increase context.)
  2. Include one or more cryptographic signatures that establish authorship context. If the signing key is ever retracted, the anchoring is weakened or lost. Key expiration is not a problem, so long as the signature was made before the expiration date.
  3. Register a hash or signature within a distributed public ledger or similar network. (ex. those using cryptographic block-chains for distributed consensus) An address within this network or ledger would then need to be provided within the root entity. Ideally, such a system could be implemented on top of Persistent Data, making it a first class identity anchoring scheme that needs no out-of-band validation.
  4. Register with a centralized identity authority or digital notary service. Then, include a token or signature with the root entity that can be externally verified accordingly. Such a service could also be built upon Persistent Data.
  5. Reference an existing, authoritative, permanent identity created by a traditional centralized naming authority. (ex. government-issued official person or place names / numbers, ISBN, DOI, etc.) This should primarily be done as a transitional measure, as part of importing existing information or migrating legacy systems.

Property Graph Data Model

Preface

The Semantic Web vision remains deeply inspirational, despite persistent industry disinterest. Many hurdles, both technological and economic, have prevented the Semantic Web’s popular realization. This section will propose ways that the data model may be practically improved to help overcome these.

This section could also be called, “Retooling the Semantic Web for Information-Centric Networking.” Existing Semantic Web standards are designed around the classic web model of centralized publishing authorities and mutable resources with human-meaningful names. While not strictly tied to these assumptions (eg. alternate URI schemes), their allowance is a major detrement to the evolution of fully-distributed, machine-friendly data architectures. Some of the design patterns that follow are incompatible with the InfoCentral architectural guidelines. The greatest friction is seen with immutable data via hash identity and, in turn, the networkable reference metadata collections used to aid aggregation, propagation, and graph traversal. As part of the remedy, certain features of RDF (and some standards built on top of it) are factored into the Persistent Data Model, which is aligned to the supporting storage and network management layers. Besides simplifying some optimizations, other data models can also benefit from the low-level (non-semantic) graph-oriented features of the Persistent Data and Collectible Metadata models. Likewise, a single mandatory binary serialization eliminates needless wrangling of competing syntaxes and serialization formats. (A relic of interest only to humans using archaic tools.) Ultimately, the InfoCentral proposal mostly differs from Semantic Web regarding syntactic and data management concerns.

The InfoCentral Property Graph Data Model also aims to be simpler to learn and understand than the Semantic Web standards it is derived from. While technically a subset, it supports all of the expressive power of RDF and its existing semantic extensions. (While hopefully also proposing an enhanced n-ary relation extension.) Several of RDF’s special cases are also discarded. (such as all explicit blank nodes) In cases where these are transmitted to standards built upon it, we will likewise propose amendments. (ex. It will be necessary to create a derivative of the OWL RDF-Based Semantics.) However, we believe it will be possible to directly translate everything built upon current RDF+OWL to the InfoCentral model.

Introduction

The InfoCentral Property Graph Data Model is a recommended set of specifications for persisting graph-structured data in the Persistent Data Model. It is largely based on the Semantic Web effort toward universal knowledge representation standards, while departing from existing designs as needed. The initial specifications are primarily a subset of RDF, RDFS, and OWL and thus can be easily mapped to existing Semantic Web tools. The feature set excluded is deemed to be architecturally incompatible or undesirable, under the project’s first-principles mandate.

Due to the immense complexity of knowledge representation, this aspect of the initial InfoCentral design proposal is expected to be the most subject to change.

Discussion

InfoCentral promotes extensive decomposition of information into graphs of small, immutable, uniquely-identified entities. Entity payloads may hold statement triples (subject, predicate, object/value), strictly using HID-based references and strongly-typed values. This promotes a number of critical properties:

Root Entity Usage

Root entities often serve as the logical identities of graphs of versioned and composited information. For example, a “text document” root is an entity around which text components and their revisions are collected. Likewise, a “human person” root entity would establish a permanent identity around which a vast amount of information is collected about an individual person. A HID reference to the root entity is the subject of many statements, as with URIs in the Semantic Web. At initial retrieval time, the entire known graph adjacent to a root entity is typically transferred.

Identity and Reference Semantics

Blank Nodes

Blank node subjects (aka. anonymous resources) are disallowed in the InfoCentral property graph data model. Every statement must have a reliable, concrete identity as its subject. A HID reference to stub root entity will suffice.

Basic justifications for disallowing blank-node subjects
Wider perspective

The Semantic Web envisions a world of predominantly independent publishers of incomplete and often duplicate graph data. This is intended to be aggregated ad hoc, often using logical inference to fill in the gaps. In contrast, InfoCentral envisions a world of predominantly socially-networked collaborative publishers, who actively improve and consolidate information in a manner similar to a Wiki. Collections of information grow around well-known concept roots. Combined with the significant cost to create and maintain a stable HTTP URI, the appeal of blank nodes in Semantic Web architecture is understandable, even if highly undesired by data consumers. However, by using only hash self-identities, there is no such motivation to compromise.

Alternatives to uses of blank nodes

Collections and Containers

The concrete RDF syntax for Persistent Data relies upon root entities instead of blank nodes for collection and container vocabulary. This supports 3rd party contribution. The collection or container root becomes the subject for membership statements, contained in entities that reference it. A key advantage to the (network-visible) reference collection approach is that even simple, passive repositories can aggregate data for such collections.

There are generally three collection patterns to choose from:

  1. A single versioned “collection” entity that contains a list of references or values (membership statements)
  2. A tree of collection entities that contains a set of references or values in its leaf nodes
  3. Independent references to a collection stub entity

Ontologies

Data schemas are provided by globally-collaborative ontologies, themselves composed of immutable entities.

Encoding

A compact, typed, language-native-friendly, binary serialization will be the only standard and will be treated as a module of the Persistent Data Entity encoding scheme. Unlike the HTML/XML/JS web, there is no preference for immediate human readability of data at storage and interchange levels. This is a Unix legacy that is no longer valid with modern tooling.

Embedded language-native serializations allow for richer abstract data types (ex. sum types) and programmatic encodings. However, code and data should always reside in separate entities. For instance, a value that was generated using some particular run-length encoder may reference the applicable code module by HID.

Performance concerns

The InfoCentral graph data model is expected to have greater total overhead compared to many existing Semantic Web systems. Most of this is due to promoting a higher degree of decomposition of data, along with the mandatory versioning scheme of the Persistent Data Model. However, the overriding design concerns of InfoCentral are interoperability, decentralization, and future-proofing. Persisted information should be archival-quality by default, even for personal sharing and interaction.

Practical performance can be regained through local indexing, caching, denormalization, and views. In the proposed model, any relational engine or graph store is ultimately a view, backed by the Persistent Data layer. There is local work to continuously map any relevant new data to views currently in use. However, this overhead is well worth the advantages of a universalized data / identity model and the network models it enables.

Data Store Standardization

Summary

A Standard Repository is a data store or network thereof that supports the Persistent Data Model and at least the base Repository public interface. Repositories may participate in any number of networks, and these networks are themselves logical repositories if they expose a public Repository interface. Data entities and reference metadata are propagated within repository networks, as well as between networks and standalone repositories. The Repository interface serves as the sole public data exchange interface in the InfoCentral architecture. However, repository networks may internally use their own private interfaces, such as designs optimized for local clusters, cloud services, or meeting QoS criteria in a large P2P system. This allows wide room for innovation in Information-Centric Networking research while ensuring baseline compatibility.

Discussion

General

Interfaces

Subscription

Subscriptions to reference metadata collections or published views use interaction over Persistent Data rather than an additional repository interface. Updates are pushed to requested locations under parameters agreed upon during subscription request and acceptance.

Application to the Persistent Data Model

Application to Software Architecture

The software layers directly above the Repository interface hide all transport, storage, encoding, access, and security concerns from user-space software components. This package of capabilities is known as the Data Management Foundation (DMF). It ensures that user-space Information Environment (IE) software components can interact purely within the abstraction of global graph data, ignoring physical concerns. DMF and IE represent the basic role dividing line within InfoCentral software architecture.

Permanent Domains

Summary

Domains are an optional facility for repository or network identity, never data identity. They are managed via entities holding domain records, which are signed by the keys associated with the domain. The root entity for a domain’s records provides the permanent Domain ID, via its hash values. The root entity must contain a list of public-key IDs that may sign its records. Domain record entities reference the root and describe variable aspects of the domain, such as associated repositories or network addresses. Any system that implements a Standard Repository interface may be referenced by a domain record as a member data source. Domain metadata may be used amidst the global data graph to annotate recommended data sources by domain IDs.

Discussion

Adaptable, Multi-Paradigm Security Model

Summary

The primary focus of the InfoCentral security model is to enable secure interactions over the global-scale public data graph, while making the details as invisible as possible for users working at the information-space level of abstraction. The lack of traditional application and service boundaries in the software architecture requires that user end-points be largely responsible for their own security needs. Public distributed systems must almost entirely rely on cryptography. Almost all content is signed; all private content is encrypted. Private networks and traditional access controls are still supported, while providing the same security abstractions. Likewise, distributed network designs may use their own access and propagation schemes to guide data visibility.

Discussion

Entity Encryption

Because the InfoCentral architecture lends itself to a shift toward client-side computing, the costs of strong cryptography are largely offloaded to end-users. What could be economically prohibitive for narrow-margin cloud services supporting billions of homogeneous users is trivial for local hardware operating upon a much smaller yet more diverse data set. Hardware cryptography acceleration is also likely to assist here, especially for energy-constrained mobile devices.

Access Controls

Other access control methods are supported via optional data repository features.

Public-Key Infrastructure

Open distributed systems lend themselves to extensive use of public-key infrastructure (PKI) cryptosystems because it allows for the convenient establishment of trust chains among otherwise untrusted users and systems.

Layered Data Security

Public Interfaces

Systems fully exposed to the public internet (outer layer) have a very minimalistic interface / protocol, similar to HTTP. Ideally, the standard repository interface for public graph data will become the dominant public protocol used on the internet. In combination with formally-defined interaction patterns over graph data, this will eliminate the need for application-specific network protocols.

While the basic Repository interface is a public interoperability requirement, repository networks are free to privately extend it in any way that does not impact the data model. In contrast to public network operation, such intra-repository communication, among distributed topology systems, usually happens over a separate private secure channel.

Social technology philosophy and security implications

Software Architecture

Overview

InfoCentral’s proposed software architecture can be divided into two primary roles: the Data Management Foundation (DMF) and the Information Environment (IE). Most generally, the DMF is responsible for managing raw data spaces (storage, transport, encoding, synchronization, and security concerns), while the IE is responsible for managing information spaces (semantics, language, and knowledge) and the composable, user-facing code modules that operate within them. All aspects of DMF and IE are distributable.

DMF components typically reside on both dedicated servers / network hardware and local user devices. IE components predominantly reside on local user devices, but may also live in trusted systems, such as personal, community, and business systems that serve as private information and automation hubs. Large public DMF instances may also have an adjacent IE to support auxiliary services like indexing.

The reason for distinguishing DMF and IE is to strictly separate certain data management and processing concerns. This will be particularly critical with the shift toward Information Centric Networking. Whether using ICN or host-based networks behind the scenes, IE components should never have to worry about data persistence, networking, or security concerns. The IE should be able to interact with the global graph as if it were a secure, local data space.

Development considerations

DMF components could be developed using any of today’s popular languages and tools, including system-level implementations. The IE requires new high-level languages and tools to fit the paradigms of fully-integrated information spaces without application boundaries. Assuming that adequate role separations can be achieved, there is no specification of where and how DMF and IE components are implemented. Eventually, a hardened, minimal OS / VM is envisioned as the ideal host for both IE and DMF local instances.

While it would theoretically be possible to write traditional applications that directly use the DMF, this would largely defeat the purpose of the new architecture, while still requiring its increased overhead relative to traditional databases. Instead, there are plans for special IE implementations that serve as adaptors for legacy systems during transition.

Major component categories

Data Management Foundation

Summary

The Data Management Foundation provides all necessary functionality and interfaces for persisting, exchanging and managing Persistent Data, while hiding network, storage, and security details that could result in inappropriate dependencies and assumptions. The central abstraction of DMF is to make the global data space appear local to components in the Information Environment, with the exception of generalized Quality-of-Service attributes. Real-world DMF capabilities are often layered across trusted and untrusted system, with entity and transport cryptography, smart routing, and access controls used to safeguard content traversing mixed environments.

Repository and Networking

Encoding and Cryptography

Low-Level Database Features

DMF Capabilities and Specializations

DMF capabilities should be limited to networking and data management, providing clients with a coarse view of entity data. Features such as statement graph or relational semantics are only supported within IE.

Some features like indexing and query are ultimately split between layers. For example, DMF supports coarse indexing by payload and metadata types, provenance, etc. whereas the IE supports fine indexing involving data types, attribute values, entity relationships, and all aspects of higher-level data structures, etc.

While DMF capabilities are standardized, there is wide room for specialization to suit varied needs.

Low-latency operations

In some multi-player games and multi-media interactions, real-time interactive data must be shared, perhaps even involving hundreds of small entities per second. Admittedly, the exclusion of application-specific, low-latency network protocols makes this more challenging. However, a wide variety of DMF specialization techniques can make this more feasible in practice.

DMF to IE Scope and Pairing

Treatment of Persistent Entity IDs in IE and DMF

Implementing DMF

Database Functionality

Summary

In the InfoCentral model, database functionality is a layer on top of the global, persistent data space. It can be used by the local software environment as a basis of what is currently known, trusted, and deemed useful from the global space. Database features are provided exclusively as views, signifying query-oriented derivation from the base entity data. Hierarchies of views can be used to divide and share database management responsibilities among software components. Low-level views, such as those provided by DMF, provide simplistic selection and filtering of entity data. High-level views, the province of IE, provide convenient data structures built using entity data and allow selection and aggregation based on complex attributes and semantics. They may implement any database semantics needed by high-level software components, whether document, relational, or graph-based. Views may also meet additional local / federated-system requirements, such as particular consistency models, that cannot be imposed upon the global data space.

Discussion

Implementation

Provision of ACID properties

Traditional ACID transaction properties can be provided by layering optional capabilities upon the base public DMF standards, in conjunction with appropriate logic within IE.

Information Environments

Introduction

Working with distributed, collaborative, graph-structured information spaces will require new development tools, paradigms, and methodologies. To benefit from the dynamic layering and composeability of information, user-space software functionality must itself be fluid and composable rather than contained in traditional static applications. To ensure arbitrary safe, reliable composition, we need a reasonable long-term path toward formally verifiable code. In the InfoCentral architectural vision, the framework to support these goals is known as the Information Environment (IE).

Definition and Scope

Information Environment Components

Base Data Types, Semantics, and Ontologies
Code Module Management
Interaction Patterns
Human User Interfaces
Software Agents and Artificial Intelligence

Software Architecture Principles

General

Simplicity should be an absolute guiding preference, with a goal of minimal dependencies and maximal re-use of code and data. The Information Environment concept is intended as the modern successor to the Unix philosophy.

Data Storage and Sharing

All persistent and shared data must use the Data Management Foundation. There is no provision for the IE to directly access data storage or networking facilities.

Software Component Interoperation

Information Environment software components may only interoperate publicly using Persistent Data. There is no provision for user-space remote APIs, an artifact of the application model of software. Private interoperation, such as function composition, is allowed within an IE instance. This influences the scope and granularity of private vs public software compositions and interactions. The general rule is that Persistent Data should be used for any data that may be useful in other contexts later or if shared.

Computation may be distributed through an implementation-specific IE protocol. The most obvious applications are local, real-time processing tasks and large scale processing across a distributed IE instance that spans a cloud service or P2P network. In both of these cases, intermediate data does not leave a single logical IE instance. To be accessed externally of that IE, it would need to use Persistent Data.

Services

Services are simply formalized, automated remote interactions over shared graphs of Persistent Data entities. Interaction Patterns are used to specify service behaviors and contracts, and metadata is used to advertise patterns, end-points, and network QoS factors.

Examples
Other Boundaries

Programming Paradigms

General Language Requirements

Although declarative coding paradigms generally make more sense given the full architectural perspective, there are no hard requirements. With the exception of a few generalized constraints, language design will be driven by popularity.

Code containers
Naming

Code Management

Interaction with Data Management Foundation