Introduction

MD-Models is a markdown-based specification language for research data management.

It is designed to be easy to read and write, and to be converted to various programming languages and schema languages.

# Hello MD-Models This is a simple markdown file that defines a model. ### Object Enrich your objects with documentation and communicate intent to domain experts. This is a simple object definition: - string_attribute - type: string - description: A string attribute - integer_attribute - type: integer - description: An integer attribute

Core Philosophy

The primary motivation behind MD-Models is to reduce cognitive overhead and maintenance burden by unifying documentation and structural definition into a single source of truth. Traditional approaches often require maintaining separate artifacts:

  1. Technical schemas (JSON Schema, XSD, ShEx, SHACL)
  2. Programming language implementations
  3. Documentation for domain experts
  4. API documentation

This separation frequently leads to documentation drift and increases the cognitive load on both developers and domain experts.

A Little Anecdote

When I began my journey in research data management, I was frequently overwhelmed by the intricate tools and standards in use. As a researcher suddenly thrown into a blend of software engineering, format creation, and data management, it felt like I was plunged into deep water without a safety net.

Data management, by its very nature, spans multiple disciplines and demands a thorough understanding of the domain, the data itself, and the available tools. Yet, even the most impressive tools lose their value if they don’t cater to the needs of domain experts. I came to realize that those experts are best positioned to define the structure and purpose of the data, but the overwhelming complexity of existing tools and standards often prevents their active participation.

MD-Models is my response to this challenge. It makes building structured data models easier by enabling domain experts to document the data’s intent and structure in a clear and manageable way. Markdown is an ideal choice for this task. It is simple to read and write, and it effectively communicates the necessary intent. Moreover, its semi-structured format allows for effortless conversion into various schema languages and programming languages, eliminating the need for excessive boilerplate code.

Quickstart

In order to get started with MD-Models, you can follow the steps below.

Installation

In order to install the command line tool, you can use the following command:

cargo install mdmodels

Writing your first MD-Models file

MD-Models files can be written in any editor that supports markdown. In the following is a list of recommended editors:

We also provide a web-editor at mdmodels.vercel.app that can be used to write and validate MD-Models files. This editor not only features a syntax higlighted editor, but also ...

  • Live preview of the rendered MD-Models file
  • Graph editor to visualize the relationships between objects
  • Automatic validation of the MD-Models file
  • Export to various schema languages and programming languages

Packages

The main Rust crate is compiled to Python and WebAssembly, allowing the usage beyond the command line tool. These are the main packages:

  • Core Python Package: Install via pip:

    # Mainly used to access the core functionality of the library pip install mdmodels-core
  • Python Package: Install via pip:

    # Provides in-memory data models, database support, LLM support, etc. pip install mdmodels
  • NPM Package: Install via npm:

    # Mainly used to access the core functionality of the library npm install mdmodels-core

Examples

The following projects are examples of how to use MD-Models in practice:

Syntax

This section describes the syntax of MD-Models. It is intended to be used as a reference for the syntax and semantics of MD-Models.

Objects

Objects are the building blocks of your data structure. Think of them as containers for related information, similar to how a form organizes different fields of information about a single topic.

What is an Object?

An object is simply a named collection of properties. For example, a Person object might have properties like name, age, and address. In our system, objects are defined using a straightforward format that's easy to read and write, even if you're not a programmer.

How to Define an Object

You start objects by declaring its name using a level 3 heading (###) followed by the name of the object. In the example below, we define an object called Person.

### Person This is an object definition.

Great! Now we have a named object. But what's next?

Object Properties

Objects can have properties, which define the specific data fields that belong to the object. Properties are defined using a structured list format with the following components:

  1. The property name - starts with a dash (-) followed by the name
  2. The property type - indicates what kind of data the property holds
  3. Optional metadata - additional specifications like descriptions, constraints, or validation rules

Here's the basic structure:

### Person (schema:object) - name - type: string - description: The name of the person

Lets break this down:

  • - name - The name of the property
  • - type: string - The type of the property, because we expect a name to be a string (e.g. "John Doe")
  • - description: The name of the person - A description of the property

The name of the property and its type are required. The description is optional, but it is a good practice to add it. Later on we will see that a thourough description can be used to guide a large language model to extract the information from a text.

By default, properties are optional. If you want to make a property required, you need to bold the property name using either __name__ or **name**. Replace name with the name of the property.

Property Types

The data type of a property is very important and generally communicates what kind of data the property holds. Here is a list of the supported base types:

  • string - A string of characters
  • integer - A whole number
  • float - A floating point number
  • number - A numeric value (integer or float)
  • boolean - A true or false value

Arrays

While these types are the building blocks, they fail to capture the full range of data types that can be used in a data model. For example, we need to be able to express that a property is an array/list of strings, or an array/list of numbers. This is where the array notation comes in.

We define an array of a given type by placing empty square brackets after the type. For example, an array of strings would be written as string[][^inspired by TypeScript].

### Person (schema:object) - an_array_of_strings - type: string[] - description: An array of strings - an_array_of_numbers - type: number[] - description: An array of numbers

Connecting Objects

Now we know how to define singular and array properties, but we often need to create relationships between objects in our data models. For example, a Person object might have an address property that references an Address object. This relationship is easily established by using another object's name as a property's type.

### Person - name - type: string - address - type: Address ### Address - street - type: string - city - type: string - zip - type: string

This approach allows you to build complex, interconnected data models that accurately represent real-world relationships between entities. You can create both one-to-one relationships (like a person having one address) and one-to-many relationships (by using array notation).

Property Options

When defining properties in your data model, you can apply various options to control their behavior, validation, and representation. These options are defined using the - option: value syntax. In the following sections, we will look at the different options that are available.

General Options

OptionDescriptionExample
descriptionProvides a description for the property- description "The name of the person"
exampleProvides an example value for the property- example "John Doe"

JSON Schema Validation Options

These options map to standard JSON Schema validation constraints, allowing you to enforce data integrity and validation rules in your models. When you use these options, they will be translated into corresponding JSON Schema properties during schema generation, ensuring that your data adheres to the specified constraints. This provides a standardized way to validate data across different systems and implementations that support JSON Schema.

OptionDescriptionExample
minimumSpecifies the minimum value for a numeric property- minimum: 0
maximumSpecifies the maximum value for a numeric property- maximum: 100
minitemsSpecifies the minimum number of items for an array property- minitems: 1
maxitemsSpecifies the maximum number of items for an array property- maxitems: 10
minlengthSpecifies the minimum length for a string property- minlength: 3
maxlengthSpecifies the maximum length for a string property- maxlength: 50
pattern or regexSpecifies a regular expression pattern that a string property must match- pattern: "^[a-zA-Z0-9]+$"
uniqueSpecifies whether array items must be unique- unique: true
multipleofSpecifies that a numeric value must be a multiple of this number- multipleof: 5
exclusiveminimumSpecifies an exclusive minimum value for a numeric property- exclusiveminimum: 0
exclusivemaximumSpecifies an exclusive maximum value for a numeric property- exclusivemaximum: 100

Format Options

The following options are used to define how the property should be represented in different formats.

OptionDescriptionExample
xmlSpecifies that the property should be represented in XML format- xml: someName

A note on the xml option

The xml option has multiple effects:

  • Element will be set as an element in the XML Schema.
  • @Name will be set as an attribute in the XML Schema.
  • someWrapper/Element will wrap the element in a parent element called someWrapper.

Semantic Options

The following options are used to define semantic annotations. Read more about semantic annotations in the Semantics section.

OptionDescriptionExample
termSpecifies the term for the property in the ontology- term: schema:name

SQL Database Options

Database options allow you to specify how properties should be represented in relational database systems. MD-Models supports the following options:

OptionDescriptionExample
pkIndicates whether the property is a primary key in a database- primary key: true

LinkML Specific Options

Options specific to the LinkML specification:

OptionDescriptionExample
readonlyIndicates whether the property is read-only- readonly: true
recommendedIndicates whether the property is recommended- recommended: true

Custom Options

You can also define custom options that aren't covered by the predefined ones:

- name - MyKey: my value

Example Usage

Here's how you might use these options in a data model:

### Person (schema:object) - id - type: string - primary key: true - description: The unique identifier for the person - name - type: string - description: The name of the person - example: "John Doe" - age - type: integer - description: The age of the person - minimum: 0

These options help to define constraints, provide validation rules, and give hints to code generators about how properties should be treated in the resulting applications and schemas.

Enumerations

Sometimes you want to restrict the values that can be assigned to a property. For example, you might want to restrict the categories of a product to a set of predefined values. A product might be of category book, movie, music, or other. This is where enumerations come in.

Defining an enumeration

To define an enumeration, we start the same as we do for any other type, by using a level 3 heading (###) and then the name of the type.

### ProductCategory BOOK = "book" MOVIE = "movie" MUSIC = "music" OTHER = "other"

We are defining a key and value here, where the value is the actual value of the enumeration and the key is an identifier. This is required, because when we want to re-use the enumeration in a programming language, we need to be able to refer to it by a key. For instance, in python we can pass an enumeration via the following code:

from model import ProductCategory, Product product = Product( name="Inception", category=ProductCategory.MOVIE ) print(product)
{ "name": "Inception", "category": "movie" }

Similar to how we can use an object as a type for a property, we can also use an enumeration as a type for a property:

### Product - name - type: string - category - type: ProductCategory

Descriptions

This section further highlights the usage of descriptions in MD-Models. Since we are using markdown, we can enrich our data model with any additional information that we want to add. This not only includes text, but also links and images.

Text

To add a text description to an object, we can use the following syntax:

### Product A product is a physical or digital item that can be bought or sold. - name - type: string - description: The name of the product

To add a link to an object, we can use the following syntax:

### Product [Additional information](https://www.google.com) - name - type: string - description: The name of the product

Images

To add an image to an object, we can use the following syntax:

### Product ![Product image](https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png) - name - type: string - description: The name of the product

Please note that tables can be used within object definitions, but can under circumstances lead to parsing errors. It is therefore recommended to only use tables in sections.

Sections

Since objects and enumerations can get quite complex, we can use sections to group related information together. The level 2 heading (##) can be used to create a new section:

## Store-related information This is section contains information about the store. ### Product [...] ### Customer [...] ## Sales-related information This section contains information about the sales. ### Order [...] ### Invoice [...]

Within these sections, you can add any of the previously mentioned elements, including tables. This is very useful to breathe life into your data model and communicate intent and additional information. Treat this as the non-technical part you would usually add in an additional document. It should be noted, that the parsers will ignore these sections, so they will not be included in the generated code.

Best Practices

  • Use sections to group related information together.
  • Use links to reference external sources.
  • Use images to visually represent complex concepts.
  • Use tables to represent concepts that are better understood in a table format.

Semantics

MD-Models supports a variety of semantic annotations to help you add meaning to your data model. Most commonly, you want to annotate objects and properties with a semantic type to allow for better interoperability and discoverability. For this, ontologies are used:

Ontologies

Ontologies are a way to add semantic meaning to your data model. They are a collection of concepts and relationships between them and are specific to the domain of your data model. For instance, the schema.org ontology is a collection of concepts and relationships that span across many domains. This is very useful when you want to connect to other data models that employ similar concepts, but use different names for them.

Typically these relations are defined as triples, consisting of a subject, predicate and object. For instance, the statement "John is a person" can be represented as the triple (John, is a, person). The first element of the triple is the subject, the second is the predicate and the third is the object.

With MD-Models, you can define the is a predicate as an object annotation for an object definition. On the other hand, you can define the predicate as a property annotation for a property definition.

How to annotate objects

Objects are annotated at the level 3 heading of the object definition. The annotation is followed by a whitespace and enclosed in parentheses. Typically, these annotations are expressed in the form of a URI, which points to a definition of the concept in the ontology. But this is a verbose way and can be simplified by using a prefix. We will be using the schema prefix in the following examples. More on how to use prefixes can be found in the preambles section.

We want to express - "A Product is a schema:Product".

### Product (schema:Product) - name - type: string

How to annotate properties

Properties are annotated using an option, as defined in the Property Options section. We utilize the keyword term to add a semantic type to the property. Properties can function in one of two ways:

  1. If the type of the property is a primitive type, the term option describes an is a relationship and thus the object in the sense of the triple.
  2. If the type of the property is an object or an array of objects, the term option describes the relationship (predicate) between the subject (object) and the object (type).

Object-valued properties

We want to express - "A Product is ordered by a Person".

### Product - orders - type: Person[] - term: schema:orderedBy

The annotation effectively describes the relationship between the orders property and the Person type. Given that a Person is also annotated with a term, one can then build a Knowledge Graph that connects the orders property to the Person type in a semantically rich way, which can be used for a variety of purposes, such as semantic search and discovery.

Primitive-valued properties

We want to express - "The name of a Product is a schema:name".

### Product - name - type: string - term: schema:name

Naturally, since the name property is part of the Product object, it builds the relationship "A Product has a name". In terms of triples, this is represented as (Product, has, name).

Once these annotations are defined, they are automatically added to the generated code and schemes, if supported. Semantic annotations are currently supported in the following language templates:

  • python-dataclass (JSON-LD)
  • python-pydantic (JSON-LD)
  • typescript (JSON-LD)
  • shacl (Shapes Constraint Language)
  • shex (Shape Expressions)

Preamble

The preamble is the first section of your data model. It is used to provide metadata about the data model, such as the name, version, and author.

--- id: my-data-model prefix: md repo: http://mdmodel.net/ prefixes: schema: http://schema.org/ nsmap: tst: http://example.com/test/ imports: common.md: common.md ---

Frontmatter Keys

The frontmatter section of your MD-Models document supports several configuration keys that control how your data model is processed and interpreted. Here's a detailed explanation of each available key:

id

  • Type: String (Optional)
  • Description: A unique identifier for your data model. This can be used to reference your model from other models or systems.
  • Example: id: my-data-model

prefixes

  • Type: Map of String to String (Optional)
  • Description: Defines namespace prefixes that can be used throughout your model to reference external vocabularies or schemas. This is particularly useful for semantic annotations.
  • Example:
    prefixes: schema: http://schema.org/ foaf: http://xmlns.com/foaf/0.1/

nsmap

  • Type: Map of String to String (Optional)
  • Description: Similar to prefixes, defines namespace mappings that can be used in your model. This is often used for XML-based formats or when integrating with systems that use namespaces.
  • Example:
    nsmap: tst: http://example.com/test/ ex: http://example.org/

repo

  • Type: String
  • Default: http://mdmodel.net/
  • Description: Specifies the base repository URL for your model. This can be used to generate absolute URIs for your model elements.
  • Example: repo: https://github.com/myorg/myrepo/

prefix

  • Type: String
  • Default: md
  • Description: Defines the default prefix to use for your model elements when generating URIs or qualified names.
  • Example: prefix: mymodel

imports

  • Type: Map of String to String
  • Default: Empty map
  • Description: Specifies other models to import into your current model. The key is the alias or name to use for the import, and the value is the location of the model to import. The location can be either a local file path or a remote URL.
  • Example:
    imports: common: common.md external: https://example.com/models/external.md

Import Types

The imports key supports two types of imports:

  1. Local Imports: References to local files on your filesystem

    imports: common: ./common/base.md
  2. Remote Imports: References to models hosted on remote servers (URLs)

    imports: external: https://example.com/models/external.md

When importing models, the definitions from the imported models become available in your current model, allowing you to reference and extend them. This is useful for creating modular and reusable data models.

Full example

The following is a full example of an MD-Models files that defines a data model for a research publication.

--- id: research-publication prefix: rpub prefixes: - schema: https://schema.org/ --- ### ResearchPublication (schema:Publication) This model represents a scientific publication with its core metadata, authors, and citations. - __doi__ - Type: Identifier - Term: schema:identifier - Description: Digital Object Identifier for the publication - XML: @doi - title - Type: string - Term: schema:name - Description: The main title of the publication - authors - Type: [Author](#author)[] - Term: schema:authored - Description: List of authors who contributed to the publication - publication_year - Type: integer - Term: schema:datePublished - Description: Year when the publication was published - Minimum: 1900 - Maximum: 2100 - citations - Type: integer - Term: schema:citation - Description: Number of times this publication has been cited - Default: 0 ### Author (schema:Person) The `Author` object is a simple object that has a name and an email address. - __name__ - Type: string - Term: schema:name - Description: The name of the author - __email__ - Type: string - Term: schema:email - Description: The email address of the author

Best practices

  1. Use Descriptive Names

    • Object names should be PascalCase (e.g., ResearchPublication)
    • Attribute names should be in snake_case (e.g., publication_year)
    • Use clear, domain-specific terminology
  2. Identifiers

    • Mark primary keys with double underscores (e.g., __doi__)
    • Choose meaningful identifier fields
  3. Documentation

    • Always include object descriptions
    • Document complex attributes
    • Explain any constraints or business rules
  4. Semantic Mapping

    • Use standard vocabularies when possible
    • Define custom terms in your prefix map
    • Maintain consistent terminology
  5. Validation Rules

    • Include range constraints for numbers
    • Specify default values when appropriate
    • Document any special validation requirements

Common Patterns

Array Types

- tags - Type: string[] - Description: List of keywords describing the publication

Object References

- main_author - Type: Author - Description: The primary author of the publication

Required Fields

- __id__ - Type: Identifier - Description: Unique identifier for the object

Remember that MD-Models aims to balance human readability with technical precision. Your object definitions should be clear enough for domain experts to understand while maintaining the structure needed for technical implementation.

Command Line Interface

To be added

Code generation

To be added

Pipelines

To be added

Schema validation

To be added

Large Language Models

To be added

Exporters

To be added

Programming languages

To be added

Schema languages

To be added

API specifications

To be added

Documentation

To be added

Examples

To be added

Hello MD-Models

To be added

Union types

To be added

Database models

To be added

FAQ

To be added