Cyyrus doesn’t need a manual, but here’s one anyway.
With Cyyrus, you can configure your schema using a simple YAML file. This might look intimidating. But don’t worry, we’ll break it down for you. And you’ll be a pro in no time.
Here you go.
Parse Invoices
Copy
# schema.yamlspec: v0 # Version of the schema# Define the properties of the datasetdataset: # Define the metadata of the dataset metadata: name: Invoice Dataset description: Dataset containing the invoice data tags: [invoice, financial, document] # Keywords to categorize the dataset license: CC-BY-NC-SA # Creative Commons Attribution-NonCommercial-ShareAlike license languages: [en] # Dataset language (English) # Define how to shuffle the dataset shuffle: seed: 42 # Random seed for reproducibility # Define the splits of the dataset splits: train: 0.8 # 80% of data for training test: 0.2 # 20% of data for testing seed: 42 # Random seed for reproducible splitting # Define the attributes of the dataset attributes: required_columns: [parsed_invoice] # Columns that must be present unique_columns: [] # Columns that should contain unique values (none specified) nulls: include # Allow null values in the dataset# Define the tasks that will be used in Dataset generation, and their propertiestasks: # Define the invoice parsing task invoice_parsing: task_type: parsing task_properties: directory: experimental/sample # Directory containing invoice files file_type: pdf # Type of files to parse max_depth: 5 # Maximum depth for directory traversal parsed_format: markdown # Output format for parsed invoices api_key: $OPENAI_API_KEY # API key for OpenAI (using environment variable) model: gpt-4o-mini # AI model to use for parsing# Define the columns that will be used in the datasetcolumns: # Define the parsed invoice column parsed_invoice: task_id: invoice_parsing # Links this column to the invoice_parsing task
We start with the schema version. Currently set to “v0”, this allows for future updates and backwards compatibility. As our dataset generation process evolves, we can increment this version number to reflect significant changes. It’s like putting a sick flame decal on your dataset - it looks cool AND it’s practical!
metadata: name: Invoice Dataset description: Dataset containing the invoice data tags: [invoice, financial, document] license: CC-BY-NC-SA languages: [en]
This subsection provides essential information about our dataset:
We’ve named it “Invoice Dataset” and provided a brief description.
Tags help categorize the dataset, making it easier to find and understand its content.
The license (Creative Commons Attribution-NonCommercial-ShareAlike) that says “Share the love, but don’t you dare make money off my hard work! Quite fair.”
We’ve specified that the dataset is in English. Because that’s all we know :D
Next is data shuffling, think of it like shuffling your data like it’s a deck of cards in Vegas. By setting a seed (in this case, 42), we ensure that our shuffling is reproducible. This means anyone using this schema will get the same randomized order, which is essential for replicability in research and development.
And there you have it, folks! This YAML schema hands a blueprint for generating an invoice dataset. It’s got everything you need to generate, process, and structure your data from high-level metadata to specific parsing tasks and output format.
Feeling brave? We have a few more spins to the current schema. We’ve got the blueprint. Now lets crunch data.
Copy
# Lets extract customer_info from PDF Invoicesspec: v0 # Version of the schema# Define the properties of the datasetdataset: # Define the metadata of the dataset metadata: name: Customer Info Dataset description: Dataset containing the customer info from invoices tags: [invoice, financial, document] license: CC-BY-NC-SA languages: [en] # Define how to shuffle the dataset shuffle: seed: 42 # Define the splits of the dataset splits: train: 0.8 test: 0.2 seed: 42 # Define the attributes of the dataset attributes: required_columns: [customer_info] unique_columns: [] flatten_columns: [] exclude_columns: [parsed_invoice] nulls: include# Define the types that will be used in the datasettypes: # Define the customer info customer_info: type: object properties: customer_name: type: string customer_address: type: string invoice_id: type: string total_amount: type: float# Define the tasks that will be used in Dataset generation, and their propertiestasks: # Define the invoice parsing task invoice_parsing: task_type: parsing task_properties: directory: experimental/sample file_type: pdf max_depth: 5 parsed_format: base64 # Define the customer info extraction task extract_customer_info: task_type: generation task_properties: model: gpt-4o-mini prompt: Extract customer info from the given invoice response_format: customer_info api_key: $OPENAI_API_KEY# Define the columns that will be used in the datasetcolumns: # Define the parsed invoice column parsed_invoice: task_id: invoice_parsing # Define the customer info column customer_info: task_id: extract_customer_info task_input: [parsed_invoice] # Reference the column names for the extraction
Alright, enough with the scrolling. The data’s ready to generate itself, but someone’s gotta hit the start button. Lets get started.