Copyright © Cryville 2021-2022. This specification is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Implementation is not limited.
This document specifies Cryville Meta Database Framework (CMDF) Version 1. CMDF allows different types of data to be loaded, stored, and organized efficiently.
This document is currently a working draft, thus the stability of the definitions described in this document is not guaranteed.
Draft archive timestamp:
Overtime, databases have been using tables to store data. The structures of the records, or the columns of a table, are predefined, which leads to good performance, but sometimes bad flexibility. CMDF is a database framework designed to focus more on flexibility, meanwhile without losing too much performance.
The main difference between CMDF and traditional database frameworks is the way how they store data. While traditional databases store data in tables, CMDF stores data in nodes and connect them with each other, forming a relational network. Compared to traditional database frameworks, CMDF should provide the following features:
The main database file contains all the records in the database. It is made up of a header and a data body. Although the database structure represents a network, it is still stored in a file or a number of files in linear form.
There are a number of data structures commonly used across the database file.
A string is a sequence of characters encoded in UTF-8. Strings are serialized in three different forms: padded, zero-terminated, and sized. Padded strings are padded to the specified length with null control characters \0. Zero-terminated strings are terminated with a null control character \0, preventing applications from skipping the string. Sized strings are serialized with its length in bytes before it, allowing applications to skip over this string.
The length of a sized string is limited by the size of the length record.
string padded_str[7] = "CMDF"; // Bytes: 43 4D 44 46 00 00 00 string zero_terminated_str = "CMDF"; // Bytes: 43 4D 44 46 00 string sized_str<0..2^8-1> = "CMDF"; // Bytes: 04 43 4D 44 46
A pointer is an unsigned 64-bit integer pointing to an address in the file. A zero value indicates that the pointer is null.
When the main database is stored in a number of files, the significant bits of a pointer MAY be used to point to different files.
A timestamp is the time elapsed since the Unix epoch (00:00:00 UTC on 1970-01-01) excluding leap seconds in milliseconds. It is serialized as a 64-bit unsigned integer.
struct {
pointer ptr_content_root;
pointer ptr_revision_root;
} TreePair; /* 16 bytes */
The database file contains a number of AVL nodes making up AVL trees. AVL tree is a self-balancing binary search tree, enabling fast node searching, insertion, and deletion. Trees are paired as tree pairs, with one being the content tree and the other being the revision tree, both null pointers when initialized. The revision tree records the timestamps when the content nodes are inserted, moved, or deleted.
struct {
string magic[4] = "CMDF"; /* 4 bytes */
uint32 version = 0; /* 4 bytes */
TreePair tp_obj; /* 16 bytes */
pointer ptr_fc; /* 8 bytes */
byte reserved[214]; /* 214 bytes */
pointer ptr_ext_header; /* 8 bytes */
byte reserved; /* 1 byte */
byte flag = 0xFF; /* 1 byte */
} HeaderBase; /* 256 bytes */
enum {
is_free_chunk (0x01),
(0xFF)
} NodeFlag;
struct {
pointer ptr_l;
pointer ptr_r;
uint64 lh_weight;
byte height;
NodeFlag flag;
timestamp rev;
opaque data;
byte padding_size;
byte padding[padding_size];
NodeFlag flag;
} Node; /* 36 bytes + sizeof(data) + padding_size */
The preferred size of a node is the smallest multiple of 16 not less than the size of data plus 36 (in bytes.)
When writing a Node, it is RECOMMENDED to skip over the padding field without writing any data.
struct {
pointer ptr;
} RevisionData; /* 8 bytes */
struct {
byte id[16];
TreePair tp_comps;
TreePair tp_ents;
TreePair tp_props;
TreePair tp_vals;
MetaName name;
} ObjectData; /* 80 bytes + sizeof(name) */
00 00 00 00 00 00 00 00. Object identifiers for other uncomparable objects are the value of a counter which is initially 0 and increases every time before a new uncomparable object is added.Meta name is a list of name parts describing the name of the object. A name part contains a name string, with the information of its language.
Meta name is designed to eliminate the problem of multiple languages in a single name. Meanwhile it gives the database potential to store or automatically generate different aliases and transcriptions of an object, thus later the object can be looked up faster and more conveniently.
struct {
MetaNamePart name_part[1..2^8-1];
} MetaName;
struct {
string language;
string name;
} MetaNamePart;
-. The first subtag is an ISO 639-3 code indicating language. The second one is an ISO 15924 code indicating script. The third and the fourth one, both optional, are an ISO 3166-1 alpha-2 code and an ISO 3166-2 code respectively indicating region.All non-ASCII characters in the code below are escaped in case of inconsistent rendering.
meta_name: [
("jpn-jpan","\x8679\x8272")
("eng-latn","Passions")
("jpn-jpan","\xFF01")
]
meta_name: [("eng-latn-gb","colour")]
meta_name: [("eng-latn-us","color")]
meta_name: [("jpn-jpan","\x7A7A")]
meta_name: [("zho-hans","\x7A7A")]
meta_name: [("zho-hant","\x7A7A")]
The example above may be rendered in HTML like this:
Property data store the property of the object in key value pairs.
struct {
pointer ptr_key;
pointer ptr_value;
} PropertyData; /* 16 bytes */
struct {
pointer ptr;
} LinkData; /* 8 bytes */
struct {
uint32 size;
byte discard[size];
uint32 size;
} FreeChunkData; /* 8 bytes + sizeof(discard) */
The AVL sub key is the start position of this node. The flag of the node is marked is_free_chunk.
There MUST be at least one node in the free chunk tree, which is the never used chunk located after all the existing nodes. Its size MUST be the maximum value of uint32. This node MUST only have the first size field with the other two fields absent.
The size of a node with FreeChunkData MUST span over the whole free chunk. It is only padded with the discard field and the size of the padding field MUST be 0.
When writing FreeChunkData, it is RECOMMENDED to skip over the discard field without writing any data.
In CMDF, an operation refers to an action or a series of actions that write the database file. An operation functions as a whole. If an operation can be broken down into sub-actions, the sub-actions do not function individually and MUST function with the operation.
A recovery file is used to protect write operations. A flag in the recovery file indicates if the database is writing data to the database file. While preforming write operations, the original bytes that would be overwritten later as well as their spans are first copied to the recovery file. The flag is then set to true. Next, the database starts to copy new data to the database file. After it is finished, the flag is reset to false.
Upon startup, the database checks if the flag in the recovery file is true, which indicates an interruption during a write operation. The database can then recover the original data from the recovery file. After recovery is done, the flag is reset to false.
All the data MUST be flushed or have been flushed into the file immediately after the flag is toggled.
The free chunk tree is created with the only essential node upon the creation of a database. The object tree pair stays null.
Nodes with FreeChunkData cannot be inserted manually with this subaction.
When inserting a new node into a tree, the preferred size of the node is computed and the smallest free chunk whose size is not less than the preferred size is found. Then the node is inserted in this chunk and the free chunk is removed from the free chunk tree. If the difference between the preferred size and the size of the free chunk is not less than 44 bytes, the leftover free chunk is reinserted into the free chunk tree. If the difference is less than 44 bytes, the discard field of the node is used to pad the node over the whole free chunk.
If the free chunk found is the last free chunk, and the database file has no more sufficient space for the preferred size, the file is resized or a new database file is created to extend the storage.
Nodes with RevisionData cannot be inserted manually with this operation.
When inserting a new node into a tree pair, a Node with specific data is created and inserted into the content tree of the tree pair. Meanwhile a Node with RevisionData is created with the current timestamp and inserted into the revision tree of the tree pair. If the tree pair is in the data of another node, the rev field of that node is set to the same timestamp as well.
If a node with the same AVL key has already existed, the operation fails.
The id and the name is determined when the object is inserted and MUST NOT be modified. The four TreePairs are null pointers.
Nodes with RevisionData or FreeChunkData cannot be deleted manually with this operation.
When deleting a node, the node is not actually removed from the database file, but its flag is set to deleted.
The usage of ISO 3166-1 alpha-2 codes does not imply sovereignty recognition by any contributors of this specification. It is specified in order to include more details of the region.
| ISO 3166-2 | ISO 3166-1 alpha-2 |
|---|---|
| CN-HK | HK |
| CN-MO | MO |
| CN-TW | TW |
| FI-01 | AX |
| FR-BL | BL |
| FR-GF | GF |
| FR-GP | GP |
| FR-MF | MF |
| FR-MQ | MQ |
| FR-NC | NC |
| FR-PF | PF |
| FR-PM | PM |
| FR-RE | RE |
| FR-TF | TF |
| FR-WF | WF |
| FR-YT | YT |
| NL-AW | AW |
| NL-BQ1 | BQ |
| NL-BQ2 | |
| NL-BQ3 | |
| NL-CW | CW |
| NL-SX | SX |
| US-AS | AS |
| US-GU | GU |
| US-MP | MP |
| US-PR | PR |
| US-UM | UM |
| US-VI | VI |