In Cyyrus, the schema is the heart of the dataset generation process. It defines the structure, properties, and types of the dataset. The schema is written in YAML and is used to generate synthetic data that adheres to the specified structure.

1

Define Your Tasks

First step is to define the tasks that will be used in the dataset generation process. These tasks can include parsing, extraction, or something else.

2

Define Your Types

Once the tasks are defined, the next step is to define the types that will be used in the dataset. These types can include objects, arrays, or any other data type.

3

Define Your Columns

Columns are the attributes of the dataset. They can be linked to tasks to define the structure of the dataset.

Think of task_inputs as a way to define the order of availability of columns. If a columns requires the output of another columns, it can be defined as a task_input. Internally task_inputs are parsed as a directed acyclic graph to ensure the order of execution.

4

Define Your Dataset

Datasets are defined by the metadata, splits, attributes, and shuffle properties.