Introduction

MD-Models is a markdown-based specification language for research data management.

It is designed to be easy to read and write, and to be converted to various programming languages and schema languages.

# Hello MD-Models

This is a simple markdown file that defines a model.

### Object

Enrich your objects with documentation and communicate intent to domain experts.

This is a simple object definition:

- string_attribute
    - type: string
    - description: A string attribute
- integer_attribute
    - type: integer
    - description: An integer attribute

Core Philosophy

The primary motivation behind MD-Models is to reduce cognitive overhead and maintenance burden by unifying documentation and structural definition into a single source of truth. Traditional approaches often require maintaining separate artifacts:

Technical schemas (JSON Schema, XSD, ShEx, SHACL)
Programming language implementations
Documentation for domain experts
API documentation

This separation frequently leads to documentation drift and increases the cognitive load on both developers and domain experts.

A Little Anecdote

When I began my journey in research data management, I was frequently overwhelmed by the intricate tools and standards in use. As a researcher suddenly thrown into a blend of software engineering, format creation, and data management, it felt like I was plunged into deep water without a safety net.

Data management, by its very nature, spans multiple disciplines and demands a thorough understanding of the domain, the data itself, and the available tools. Yet, even the most impressive tools lose their value if they don’t cater to the needs of domain experts. I came to realize that those experts are best positioned to define the structure and purpose of the data, but the overwhelming complexity of existing tools and standards often prevents their active participation.

MD-Models is my response to this challenge. It makes building structured data models easier by enabling domain experts to document the data’s intent and structure in a clear and manageable way. Markdown is an ideal choice for this task. It is simple to read and write, and it effectively communicates the necessary intent. Moreover, its semi-structured format allows for effortless conversion into various schema languages and programming languages, eliminating the need for excessive boilerplate code.

Quickstart

In order to get started with MD-Models, you can follow the steps below.

Installation

In order to install the command line tool, you can use the following command:

cargo install mdmodels

Writing your first MD-Models file

MD-Models files can be written in any editor that supports markdown. In the following is a list of recommended editors:

We also provide a web-editor at mdmodels.vercel.app that can be used to write and validate MD-Models files. This editor not only features a syntax higlighted editor, but also ...

Live preview of the rendered MD-Models file
Graph editor to visualize the relationships between objects
Automatic validation of the MD-Models file
Export to various schema languages and programming languages

Packages

The main Rust crate is compiled to Python and WebAssembly, allowing the usage beyond the command line tool. These are the main packages:

Core Python Package: Install via pip:

# Mainly used to access the core functionality of the library
pip install mdmodels-core

Python Package: Install via pip:

# Provides in-memory data models, database support, LLM support, etc.
pip install mdmodels

NPM Package: Install via npm:

# Mainly used to access the core functionality of the library
npm install mdmodels-core

Examples

The following projects are examples of how to use MD-Models in practice:

Syntax

This section describes the syntax of MD-Models. It is intended to be used as a reference for the syntax and semantics of MD-Models.

Objects

Objects are the building blocks of your data structure. Think of them as containers for related information, similar to how a form organizes different fields of information about a single topic.

What is an Object?

An object is simply a named collection of properties. For example, a Person object might have properties like name, age, and address. In our system, objects are defined using a straightforward format that's easy to read and write, even if you're not a programmer.

How to Define an Object

You start objects by declaring its name using a level 3 heading (###) followed by the name of the object. In the example below, we define an object called Person.

### Person

This is an object definition.

Great! Now we have a named object. But what's next?

Object Properties

Objects can have properties, which define the specific data fields that belong to the object. Properties are defined using a structured list format with the following components:

The property name - starts with a dash (-) followed by the name
The property type - indicates what kind of data the property holds
Optional metadata - additional specifications like descriptions, constraints, or validation rules

Here's the basic structure:

### Person (schema:object)

- name
  - type: string
  - description: The name of the person

Lets break this down:

- name - The name of the property
- type: string - The type of the property, because we expect a name to be a string (e.g. "John Doe")
- description: The name of the person - A description of the property

The name of the property and its type are required. The description is optional, but it is a good practice to add it. Later on we will see that a thourough description can be used to guide a large language model to extract the information from a text.

By default, properties are optional. If you want to make a property required, you need to bold the property name using either __name__ or **name**. Replace name with the name of the property.

Property Types

The data type of a property is very important and generally communicates what kind of data the property holds. Here is a list of the supported base types:

string - A string of characters
integer - A whole number
float - A floating point number
number - A numeric value (integer or float)
boolean - A true or false value

Arrays

While these types are the building blocks, they fail to capture the full range of data types that can be used in a data model. For example, we need to be able to express that a property is an array/list of strings, or an array/list of numbers. This is where the array notation comes in.

We define an array of a given type by placing empty square brackets after the type. For example, an array of strings would be written as string[][^inspired by TypeScript].

### Person (schema:object)

- an_array_of_strings
  - type: string[]
  - description: An array of strings
- an_array_of_numbers
  - type: number[]
  - description: An array of numbers

Connecting Objects

Now we know how to define singular and array properties, but we often need to create relationships between objects in our data models. For example, a Person object might have an address property that references an Address object. This relationship is easily established by using another object's name as a property's type.

### Person

- name
  - type: string
- address
  - type: Address

### Address

- street
  - type: string
- city
  - type: string
- zip
  - type: string

This approach allows you to build complex, interconnected data models that accurately represent real-world relationships between entities. You can create both one-to-one relationships (like a person having one address) and one-to-many relationships (by using array notation).

Property Options

When defining properties in your data model, you can apply various options to control their behavior, validation, and representation. These options are defined using the - option: value syntax. In the following sections, we will look at the different options that are available.

General Options

Option	Description	Example
`description`	Provides a description for the property	`- description "The name of the person"`
`example`	Provides an example value for the property	`- example "John Doe"`

JSON Schema Validation Options

These options map to standard JSON Schema validation constraints, allowing you to enforce data integrity and validation rules in your models. When you use these options, they will be translated into corresponding JSON Schema properties during schema generation, ensuring that your data adheres to the specified constraints. This provides a standardized way to validate data across different systems and implementations that support JSON Schema.

Option	Description	Example
`minimum`	Specifies the minimum value for a numeric property	`- minimum: 0`
`maximum`	Specifies the maximum value for a numeric property	`- maximum: 100`
`minitems`	Specifies the minimum number of items for an array property	`- minitems: 1`
`maxitems`	Specifies the maximum number of items for an array property	`- maxitems: 10`
`minlength`	Specifies the minimum length for a string property	`- minlength: 3`
`maxlength`	Specifies the maximum length for a string property	`- maxlength: 50`
`pattern` or `regex`	Specifies a regular expression pattern that a string property must match	`- pattern: "^[a-zA-Z0-9]+$"`
`unique`	Specifies whether array items must be unique	`- unique: true`
`multipleof`	Specifies that a numeric value must be a multiple of this number	`- multipleof: 5`
`exclusiveminimum`	Specifies an exclusive minimum value for a numeric property	`- exclusiveminimum: 0`
`exclusivemaximum`	Specifies an exclusive maximum value for a numeric property	`- exclusivemaximum: 100`

Format Options

The following options are used to define how the property should be represented in different formats.

Option	Description	Example
`xml`	Specifies that the property should be represented in XML format	`- xml: someName`

A note on the `xml` option

The xml option has multiple effects:

Element will be set as an element in the XML Schema.
@Name will be set as an attribute in the XML Schema.
someWrapper/Element will wrap the element in a parent element called someWrapper.

Semantic Options

The following options are used to define semantic annotations. Read more about semantic annotations in the Semantics section.

Option	Description	Example
`term`	Specifies the term for the property in the ontology	`- term: schema:name`

SQL Database Options

Database options allow you to specify how properties should be represented in relational database systems. MD-Models supports the following options:

Option	Description	Example
`pk`	Indicates whether the property is a primary key in a database	`- primary key: true`

LinkML Specific Options

Options specific to the LinkML specification:

Option	Description	Example
`readonly`	Indicates whether the property is read-only	`- readonly: true`
`recommended`	Indicates whether the property is recommended	`- recommended: true`

Custom Options

You can also define custom options that aren't covered by the predefined ones:

- name
  - MyKey: my value

Example Usage

Here's how you might use these options in a data model:

### Person (schema:object)

- id
  - type: string
  - primary key: true
  - description: The unique identifier for the person
- name
  - type: string
  - description: The name of the person
  - example: "John Doe"
- age
  - type: integer
  - description: The age of the person
  - minimum: 0

These options help to define constraints, provide validation rules, and give hints to code generators about how properties should be treated in the resulting applications and schemas.

Enumerations

Sometimes you want to restrict the values that can be assigned to a property. For example, you might want to restrict the categories of a product to a set of predefined values. A product might be of category book, movie, music, or other. This is where enumerations come in.

Defining an enumeration

To define an enumeration, we start the same as we do for any other type, by using a level 3 heading (###) and then the name of the type.

### ProductCategory

BOOK = "book"
MOVIE = "movie"
MUSIC = "music"
OTHER = "other"

We are defining a key and value here, where the value is the actual value of the enumeration and the key is an identifier. This is required, because when we want to re-use the enumeration in a programming language, we need to be able to refer to it by a key. For instance, in python we can pass an enumeration via the following code:

from model import ProductCategory, Product

product = Product(
    name="Inception",
    category=ProductCategory.MOVIE
)

print(product)

{
    "name": "Inception",
    "category": "movie"
}

Similar to how we can use an object as a type for a property, we can also use an enumeration as a type for a property:

### Product

- name
  - type: string
- category
  - type: ProductCategory

Descriptions

This section further highlights the usage of descriptions in MD-Models. Since we are using markdown, we can enrich our data model with any additional information that we want to add. This not only includes text, but also links and images.

Text

To add a text description to an object, we can use the following syntax:

### Product

A product is a physical or digital item that can be bought or sold.

- name
  - type: string
  - description: The name of the product

Links

To add a link to an object, we can use the following syntax:

### Product

[Additional information](https://www.google.com)

- name
  - type: string
  - description: The name of the product

Images

To add an image to an object, we can use the following syntax:

### Product

![Product image](https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png)

- name
  - type: string
  - description: The name of the product

Please note that tables can be used within object definitions, but can under circumstances lead to parsing errors. It is therefore recommended to only use tables in sections.

Sections

Since objects and enumerations can get quite complex, we can use sections to group related information together. The level 2 heading (##) can be used to create a new section:

## Store-related information

This is section contains information about the store.

### Product

[...]

### Customer

[...]

## Sales-related information

This section contains information about the sales.

### Order

[...]

### Invoice

[...]

Within these sections, you can add any of the previously mentioned elements, including tables. This is very useful to breathe life into your data model and communicate intent and additional information. Treat this as the non-technical part you would usually add in an additional document. It should be noted, that the parsers will ignore these sections, so they will not be included in the generated code.

Best Practices

Use sections to group related information together.
Use links to reference external sources.
Use images to visually represent complex concepts.
Use tables to represent concepts that are better understood in a table format.

Semantics

MD-Models supports a variety of semantic annotations to help you add meaning to your data model. Most commonly, you want to annotate objects and properties with a semantic type to allow for better interoperability and discoverability. For this, ontologies are used:

Ontologies

Ontologies are a way to add semantic meaning to your data model. They are a collection of concepts and relationships between them and are specific to the domain of your data model. For instance, the schema.org ontology is a collection of concepts and relationships that span across many domains. This is very useful when you want to connect to other data models that employ similar concepts, but use different names for them.

Typically these relations are defined as triples, consisting of a subject, predicate and object. For instance, the statement "John is a person" can be represented as the triple (John, is a, person). The first element of the triple is the subject, the second is the predicate and the third is the object.

With MD-Models, you can define the is a predicate as an object annotation for an object definition. On the other hand, you can define the predicate as a property annotation for a property definition.

How to annotate objects

Objects are annotated at the level 3 heading of the object definition. The annotation is followed by a whitespace and enclosed in parentheses. Typically, these annotations are expressed in the form of a URI, which points to a definition of the concept in the ontology. But this is a verbose way and can be simplified by using a prefix. We will be using the schema prefix in the following examples. More on how to use prefixes can be found in the preambles section.

We want to express - "A Product is a schema:Product".

### Product (schema:Product)

- name
  - type: string

How to annotate properties

Properties are annotated using an option, as defined in the Property Options section. We utilize the keyword term to add a semantic type to the property. Properties can function in one of two ways:

If the type of the property is a primitive type, the term option describes an is a relationship and thus the object in the sense of the triple.
If the type of the property is an object or an array of objects, the term option describes the relationship (predicate) between the subject (object) and the object (type).

Object-valued properties

We want to express - "A Product is ordered by a Person".

### Product

- orders
  - type: Person[]
  - term: schema:orderedBy

The annotation effectively describes the relationship between the orders property and the Person type. Given that a Person is also annotated with a term, one can then build a Knowledge Graph that connects the orders property to the Person type in a semantically rich way, which can be used for a variety of purposes, such as semantic search and discovery.

Primitive-valued properties

We want to express - "The name of a Product is a schema:name".

### Product

- name
  - type: string
  - term: schema:name

Naturally, since the name property is part of the Product object, it builds the relationship "A Product has a name". In terms of triples, this is represented as (Product, has, name).

Once these annotations are defined, they are automatically added to the generated code and schemes, if supported. Semantic annotations are currently supported in the following language templates:

python-dataclass (JSON-LD)
python-pydantic (JSON-LD)
typescript (JSON-LD)
shacl (Shapes Constraint Language)
shex (Shape Expressions)

Preamble

The preamble is the first section of your data model. It is used to provide metadata about the data model, such as the name, version, and author.

---
id: my-data-model
prefix: md
repo: http://mdmodel.net/
prefixes:
  schema: http://schema.org/
nsmap:
  tst: http://example.com/test/
imports:
  common.md: common.md
---

Frontmatter Keys

The frontmatter section of your MD-Models document supports several configuration keys that control how your data model is processed and interpreted. Here's a detailed explanation of each available key:

`id`

Type: String (Optional)
Description: A unique identifier for your data model. This can be used to reference your model from other models or systems.
Example: id: my-data-model

`prefixes`

Type: Map of String to String (Optional)
Description: Defines namespace prefixes that can be used throughout your model to reference external vocabularies or schemas. This is particularly useful for semantic annotations.

Example:

prefixes:
  schema: http://schema.org/
  foaf: http://xmlns.com/foaf/0.1/

`nsmap`

Type: Map of String to String (Optional)
Description: Similar to prefixes, defines namespace mappings that can be used in your model. This is often used for XML-based formats or when integrating with systems that use namespaces.

Example:

nsmap:
  tst: http://example.com/test/
  ex: http://example.org/

`repo`

Type: String
Default: http://mdmodel.net/
Description: Specifies the base repository URL for your model. This can be used to generate absolute URIs for your model elements.
Example: repo: https://github.com/myorg/myrepo/

`prefix`

Type: String
Default: md
Description: Defines the default prefix to use for your model elements when generating URIs or qualified names.
Example: prefix: mymodel

`imports`

Type: Map of String to String
Default: Empty map
Description: Specifies other models to import into your current model. The key is the alias or name to use for the import, and the value is the location of the model to import. The location can be either a local file path or a remote URL.

Example:

imports:
  common: common.md
  external: https://example.com/models/external.md

Import Types

The imports key supports two types of imports:

Local Imports: References to local files on your filesystem
```
imports:
  common: ./common/base.md
```
Remote Imports: References to models hosted on remote servers (URLs)
```
imports:
  external: https://example.com/models/external.md
```

When importing models, the definitions from the imported models become available in your current model, allowing you to reference and extend them. This is useful for creating modular and reusable data models.

Full example

The following is a full example of an MD-Models files that defines a data model for a research publication.

---
id: research-publication
prefix: rpub
prefixes:
  - schema: https://schema.org/
---

### ResearchPublication (schema:Publication)

This model represents a scientific publication with its core metadata, authors, 
and citations.

- __doi__
  - Type: Identifier
  - Term: schema:identifier
  - Description: Digital Object Identifier for the publication
  - XML: @doi
- title
  - Type: string
  - Term: schema:name
  - Description: The main title of the publication
- authors
  - Type: [Author](#author)[]
  - Term: schema:authored
  - Description: List of authors who contributed to the publication
- publication_year
  - Type: integer
  - Term: schema:datePublished
  - Description: Year when the publication was published
  - Minimum: 1900
  - Maximum: 2100
- citations
  - Type: integer
  - Term: schema:citation
  - Description: Number of times this publication has been cited
  - Default: 0


### Author (schema:Person)

The `Author` object is a simple object that has a name and an email address.

- __name__
  - Type: string
  - Term: schema:name
  - Description: The name of the author
- __email__
  - Type: string
  - Term: schema:email
  - Description: The email address of the author

Best practices

Use Descriptive Names
- Object names should be PascalCase (e.g., ResearchPublication)
- Attribute names should be in snake_case (e.g., publication_year)
- Use clear, domain-specific terminology
Identifiers
- Mark primary keys with double underscores (e.g., __doi__)
- Choose meaningful identifier fields
Documentation
- Always include object descriptions
- Document complex attributes
- Explain any constraints or business rules
Semantic Mapping
- Use standard vocabularies when possible
- Define custom terms in your prefix map
- Maintain consistent terminology
Validation Rules
- Include range constraints for numbers
- Specify default values when appropriate
- Document any special validation requirements

Common Patterns

Array Types

- tags
  - Type: string[]
  - Description: List of keywords describing the publication

Object References

- main_author
  - Type: Author
  - Description: The primary author of the publication

Required Fields

- __id__
  - Type: Identifier
  - Description: Unique identifier for the object

Remember that MD-Models aims to balance human readability with technical precision. Your object definitions should be clear enough for domain experts to understand while maintaining the structure needed for technical implementation.