Configuration¶
The document parsing configuration is the sole interface to use the document parsing:
- It consists of pydantic models to create and register in the document parsing registry.
- You can find an example for FR retiree eligibility request document type here ⧉.
The document parsing configuration is configured using the DocumentParsingConfiguration for a given document type.
DocumentAutoParsingFlowConfiguration ¶
Bases: BaseModel
Configuration for the automatic parsing flow
Manual parsing configuration¶
For each category, we declare the manual parsing configuration using the DocumentCategoryConfiguration.
Bases: BaseModel
Parsing configuration for a document category
extraction_content_model
instance-attribute
¶
Extraction content structured output model for the document category. Can be None if the document category is not supported for extraction. It can be used to generate the JSON schema for the manual parsing tool or to validated an automatic extraction. Add the json_schema_extra 'order' to specify the order of fields in the manual parsing tool.
icon
instance-attribute
¶
The icon to display for the category in the manual parsing tool from https://tabler.io/icons ⧉ (ex: IconFileDollar)
internal_control_ratio
class-attribute
instance-attribute
¶
the ratio of internal control to apply on this category (between 0 and 1)
unsupported
class-attribute
instance-attribute
¶
If true, the document category is unsupported and cannot be extracted. It can only be rejected.
validate_json_schema
classmethod
¶
Check that the JSON schema is valid
Source code in components/documents/public/entities/parsing/parsing_configuration.py
validate_unsupported_extraction_content_consistency ¶
Validate the extraction content model
Source code in components/documents/public/entities/parsing/parsing_configuration.py
Automatic Parsing configuration¶
The automatic parsing is configured using the DocumentAutoParsingFlowConfiguration. Each step of the flow has its own configuration:
- Transcription: TranscriptionConfiguration
- Classification: ClassificationConfiguration
- Extraction: ExtractionConfiguration
DocumentAutoParsingFlowConfiguration ¶
Bases: BaseModel
Configuration for the automatic parsing flow
Transcription¶
DocumentTranscriptionConfiguration ¶
Bases: BaseModel
Configuration to transcribe a document
min_confidence_score
class-attribute
instance-attribute
¶
Document is sent to review if the confidence score is below. Value between 0 and 1
min_text_length
class-attribute
instance-attribute
¶
Document is sent to review if the text length is below
transcriber
class-attribute
instance-attribute
¶
Transcriber provider. only 'textract' & 'gemini' for now
Classification¶
DocumentClassificationConfiguration ¶
Bases: BaseModel
The configuration used to classify a document. It can be used to classify the document on multiple classes
classifiers
instance-attribute
¶
The classifiers to use to classify the document. Each classifiers can have multiple predicted classes as output. At least one classifier must be configured to predict the 'category' class.
model_post_init ¶
Register sagemaker predictor in the registry if not already done
Source code in components/documents/public/entities/parsing/flow/classification.py
validate_category_in_classifiers ¶
Validate that the 'category' classifier is present
Source code in components/documents/public/entities/parsing/flow/classification.py
validate_no_duplicate_classes ¶
Check that no class is predicted more than once.
Source code in components/documents/public/entities/parsing/flow/classification.py
DocumentClassifierConfiguration ¶
Bases: BaseModel
The configuration to use a predictor to classify a document. A predictor can output multiple classes.
fallback_label
property
¶
If the returned label is not declared in the possible labels, we return this fallback label. By default, the fallback label is unclassifiable which means that the document is unsupported and cannot be parsed.
min_confidence
class-attribute
instance-attribute
¶
The minimum confidence to have, otherwise we send to review
possible_labels
instance-attribute
¶
The possible labels that the predictor can return. Each label should be a dict where keys represent the class names. For single class classifier, the list can only be the possible values for the class.
predictor_type
instance-attribute
¶
the predictor type to use. 'sagemaker' for a sagemaker predictor, 'llm' for a LLM predictor, 'fixed' is for a fixed classification to the unique possible label and 'pythonic' for a Python predictor.
coerce_possible_labels
classmethod
¶
Coerce possible labels from a list of string to dict. This is a convenient pre-processor to simplify the configuration.
Source code in components/documents/public/entities/parsing/flow/classification.py
validate_llm_sagemaker_predictor_single_class ¶
Validate that if the predictor type is a sagemaker or LLM one, the list of classes is composed of only one class. These two predictors do not handle multiple class predictions for now.
Source code in components/documents/public/entities/parsing/flow/classification.py
validate_possible_classes ¶
Validate that possible classes are dict where keys represent class names.
Source code in components/documents/public/entities/parsing/flow/classification.py
validate_predictor_configuration_type ¶
Validate that the predictor configuration is consistent with the predictor type
Source code in components/documents/public/entities/parsing/flow/classification.py
Extraction¶
DocumentCategoryExtractionConfiguration ¶
Bases: BaseModel
Configuration to extract data from a given document category
auto_validate_flag
class-attribute
instance-attribute
¶
whether to auto validate the extracted data. If set to False, the extraction result will be sent to review (auto-populated). It can also be set to a Feature Flag name to enable the auto-validation only if the feature flag is enabled.
extractor_configuration
instance-attribute
¶
the configuration of the dynamic LLM extractor for each possible category
DocumentExtractionConfiguration ¶
Bases: BaseModel
Configuration to extract data from a document
category_extraction_configurations
instance-attribute
¶
The extraction configuration for each category (classification)
document_handler
instance-attribute
¶
the document handler to use by the extractors to fetch relevant example input and expected outputs