Arches Modeling Documentation
The following documentation presents information, compiled by the Arches Resource Model Working Group, on using and creating Arches Resource Models and Branches for use in Arches implementations. Each guide will help Arches users to understand the basic concepts behind modeling in Arches, the ARM WG methodology for Arches Resource Models, and the benefits of adopting the ARM WG methodology, as well as other resources for more information.
This documentation supplements the information on Resource Models in the official Arches documentation.
To access the Arches Package Library, click here.
This documentation is a WORK-IN-PROGRESS, and content will be continually added. Last update: January 2021.
BASIC CONCEPTS
By clicking on the SLIDESHOW link, you will be able to access information on the following topics. You can also go to the topic section by clicking each topic below:
What is a Data Model?
Types of Data Models
Structured data
Key elements of structured data
Structured data in a data model
What is a Semantic Data Model?
Semantic Standards
Semantic Formats
Semantic Ontologies
Semantic Web
What is an Arches Resource Model?
Resource Model and Branches
Anatomy of an Arches Branch
Data Types
Concept Collections
Relationships between Resource Models
Arches Designer
Encoding the ontology
Reference Data Manager
The Modeling Process
Data requirements
Data content standards
Creating a conceptual model
Building Arches Resource Models
Introduction
The information in this section introduces some concepts that will be helpful to understand before engaging with the ARM WG methodology for Arches modeling.
Generally, the concepts have a direct relationship to the modeling process and why it’s important.
What is a Data Model?
A data model establishes the overall organization of data in any given database or system.
Data models can work on a conceptual level as well as a functional level. A conceptual data model provides the overall vision and framework for data organization but might not be adequately interpreted, structured and encoded for a particular software application. A data model that is also functional, such as an Arches Resource Model, is ready to be loaded into and expressed through a software application, in this case, Arches.
This documentation focuses on modeling functional data models, Arches Resource Models, that rely on conceptual frameworks.
Types of Data Models
In addition, there are different types of data models, often corresponding to the type of database or system. For example, a relational data model, which corresponds to relational databases, is based on a table structure. A graph data model, which often corresponds to graph databases, focuses on the data itself and the relationships between data. As a note, Arches Resource Models are graph data models.
Because Arches Resource Models are graph data models, it is important to understand structured data and how it forms the basis for graph data models.
Structured data
Structured data is data that is organized and formatted in a way that is machine-readable, or in other words, usable by computer applications for processing, analysis and other functions.
In relational databases, the structure of the data is created by tables. However, in order to increase portability, interoperability and longevity, data should be structured to be self-describing and independent of any particular software application.
To understand how structured data can achieve this, it is important to understand some key elements of structured data.
Key Elements to Structured Data
The following elements help to form the foundation for data that is meaningful and useful:
Structured data consists of clean and consistent data, with terminology that is defined by established controlled vocabularies. All database types can benefit from data that have these two elements.
Structured data also has elements that are essential in graph databases: the ability for each instance of data to be uniquely identified, preferably through a universal unique identifier, and the ability to create meaninful relationships between data instances through the use of triples.
Clean Data
Clean data is consistent in terminology, formatting, and structure through the entirety of the table or database.
For example: consistent date formatting i.e. 2019-10-02 instead of 10/2/19
preferred terminology i.e. United States of America instead of USA
Organizational standards help determine preferred format and structure
OpenRefine is a great, open source tool to help clean data
Controlled Vocabularies
Controlled Vocabularies are the set of standards chosen for preferred terminology used within a database.
They help create consistency when data can be incredibly messy, with misspellings, homonyms, and cultural/national differences.
Preferred vocabularies are established and stored in thesauri that should be shared for enhanced data interoperability.
Here is a primer to controlled vocabularies posted in the Implementation Considerations for the Arches Project.
Some examples:
The Getty Art and Architecture Thesaurus (AAT)
Library of Congress Subject Headings (LCSH)
The Arches Project manages its Controlled Vocabularies and Thesauri through the Reference Data Manager (RDM). For a more detailed guide on the RDM, click here.
Universal Unique Identifiers
Universal Unique identifiers (UUID) are associated with any given entity within a database structure. Each entity has a uniquely coded identifier that ensures exactly what it is. A UUID is a 128-bit number that differentiates the term from any other possibilities found online. This 128-bit number is difficult to replicate, as there are 3.4 x 10^38 possible alphanumeric combinations (an extremely large number!).
Arches utilizes UUIDs, e.g. 662b53c0-2e26-4b87-a6d0-109b7f611e05
Similarily, a UID (Unique Identifier) or URI (Uniform Resource Identifier) may be unique within an organization, but not necessarily unique universally. For example, a university student identification number is a UID that will not be repeated within the context of that university. A Social Security number also is a UID because it is specific to one single person within one single context and cannot be repeated.
A UID can also link an entity to a controlled vocabulary. The AAT record number can replace the name of any entity because it links back to the original preferred authority record. For example, the AAT record for ‘database’ is http://vocab.getty.edu/aat/300028543, with the UID being 300028543.
Usage of a UID or UUID is important within a database or spreadsheet in order to faciliate sorting and filtering information, as well has link back to specific entity within the database system.
METHODOLOGY (coming soon!)
Click here for SLIDESHOW (Coming soon!)
Click here for PDF (Coming soon!)
(Coming soon!) Documentation on the following topics:
Introduction
Assumptions
ARM WG Ontology
Linked Open Data
What makes a good data model?
Modeling Patterns
BENEFITS
Introduction
The benefits of modeling your data in Arches using the principles described in this documentation, include the following:
Domain expertise
The CIDOC-CRM has been developed by cultural heritage domain experts for those working in the cultural heritage field. The CRM provides the framework to build conceptual models for archives, libraries, and museums. Working with the CRM will allow cultural heritage institutions to partner with other similar communities and to build information systems that support specialized research questions
More information about the CRM can be found here.
Information Search and Retrieval
Structuring your data into a data model, or schema, will enhance the searchability and findability of information stored within it. The schema provides the outline for where and how the information is stored. The more detailed and precise your data models are, built using the Arches Designer, the better return of search results for those using your Arches instance. Linked Data is a graph structure that can enhance searchability by establishing semantic connections between resource elements.
Shareability and Interoperability
Linked Data and semantic standardization allow the possibility for organizations to share their data with other institutions who structure their data following the same guidelines. The benefit of sharing data in this way is to enhance collaboration between organizations and to share resources and work between multiple entities.
Structured data sets will enhance interoperability and data usability for a wider variety of computer and software systems.
GLOSSARY
General Data Management Terms
- Structured data: data that is organized and formatted within a database or other such repository to enhance data searchability and to allow for more effective processing and analysis.
- Data cleaning: the process of ensuring consistency and accuracy of records stored within a table or database, such as spelling, date order, or consistent identification.
- Standards: the rules or documented agreement on the format, structure, representation, usage, etc of the ways in which data are described or recorded. They are the best practices of how data and metadata should be described, formatted, or included in a data set.
- Controlled vocabulary: a set of standardized terms, thesauri, or subject headings that are preferred for a data set. It ensures consistency of vocabulary to control for spelling differences, homonyms, or name variations for a single defined entity.
- Data Model: An abstract, visual representation of data objects for a group or organization in order to structure a database. A good data model follows formalized standards, or best practices, in format and structure.
- Conceptual Model: A type of data model that establishes entities, their properties, and the relationship between each entity. The model is used to formalize entity relationships to represent the semantics of an organization.
- Ontology: An ontology defines the elements — entities and their properties — within a data structure. It formalizes the conceptual data model with a common controlled vocabulary, identifiers, and structure that allows for system interoperability of the database.
- Universal Unique Identifier (UUID): a 128-bit number used to reliably identify an object or entity within a database.
Arches Specific Terms
For a larger collection of Arches Terminology, see the Arches Glossary Here.
- Arches Designer: A user interface for facilitating database design, i.e. the creation of Resource Models. The Arches Designer consists of many different tools, such as the Graph Designer and Card Manager, each of which helps build a different facet of Resource Model creation.
- Reference Data Manager: The Reference Data Manager (RDM) is a core Arches module to create and maintain concept schemes (controlled vocabularies) or thesauri. It enables the creation and maintenance of controlled vocabularies for use in dropdowns and controlled fields within the various Arches Resource forms. For more information: [Reference Data Manager (RDM)](https://arches.readthedocs.io/en/stable/rdm/)
- ARM WG: The group established to provide consensus-based guidance on Arches Resource Models and constituent Branches, specifically on how to build and apply them. More information on the ARM [can be found here](https://www.archesproject.org/arm-wg/).
ADDITIONAL LINKS
Arches
CIDOC CRM
- CIDOC-CRM Primer
- CIDOC-CRM Tutorial
- CIDOC-CRM Video Overview with Stephen Stead
- Learning Ontology & CIDOC CRM
- CIDOC-CRM Functional Units
Other Examples
Contribution
We invite contributions to the ARM WG Documentation from any of our Arches Community Members. The documentation provided is a work-in-progress and would benefit from the experience of those who also developing resource graphs for their own implementations. We hope to expand this documentation for the variety of use cases of an Arches implementation.
Connect with us at our GitHub Repository or email us at contact@archesproject.org.
Last updated: January 2021