Schema
Schema outlines entire workflow. Think of it as poor man’s HCL.
In Cyyrus, the schema is the heart of the dataset generation process. It defines the structure, properties, and types of the dataset. The schema is written in YAML and is used to generate synthetic data that adheres to the specified structure.
Define Your Tasks
First step is to define the tasks that will be used in the dataset generation process. These tasks can include parsing, extraction, or something else.
Define Your Types
Once the tasks are defined, the next step is to define the types that will be used in the dataset. These types can include objects, arrays, or any other data type.
Define Your Columns
Columns are the attributes of the dataset. They can be linked to tasks to define the structure of the dataset.
Think of task_inputs
as a way to define the order of availability of columns. If a columns requires the output of another columns, it can be defined as a task_input.
Internally task_inputs
are parsed as a directed acyclic graph to ensure the order of execution.
Define Your Dataset
Datasets are defined by the metadata, splits, attributes, and shuffle properties.