Skip to content

Configuration

The document parsing configuration is the sole interface to use the document parsing:

  • It consists of pydantic models to create and register in the document parsing registry.
  • You can find an example for FR retiree eligibility request document type here ⧉.

The document parsing configuration is configured using the DocumentParsingConfiguration for a given document type.

DocumentAutoParsingFlowConfiguration

Bases: BaseModel

Configuration for the automatic parsing flow

classification_configuration instance-attribute

classification_configuration

The classification configuration

extraction_configuration instance-attribute

extraction_configuration

The extraction configuration

should_create_parsing_task_on_failure class-attribute instance-attribute

should_create_parsing_task_on_failure = True

transcription_configuration instance-attribute

transcription_configuration

The transcription configuration

Manual parsing configuration

For each category, we declare the manual parsing configuration using the DocumentCategoryConfiguration.

Bases: BaseModel

Parsing configuration for a document category

category instance-attribute

category

The document category

extraction_content_model instance-attribute

extraction_content_model

Extraction content structured output model for the document category. Can be None if the document category is not supported for extraction. It can be used to generate the JSON schema for the manual parsing tool or to validated an automatic extraction. Add the json_schema_extra 'order' to specify the order of fields in the manual parsing tool.

icon instance-attribute

icon

The icon to display for the category in the manual parsing tool from https://tabler.io/icons ⧉ (ex: IconFileDollar)

internal_control_ratio class-attribute instance-attribute

internal_control_ratio = Field(default=0.0, ge=0, le=1)

the ratio of internal control to apply on this category (between 0 and 1)

unsupported class-attribute instance-attribute

unsupported = False

If true, the document category is unsupported and cannot be extracted. It can only be rejected.

validate_json_schema classmethod

validate_json_schema(extraction_content_model)

Check that the JSON schema is valid

Source code in components/documents/public/entities/parsing/parsing_configuration.py
@field_validator("extraction_content_model")
@classmethod
def validate_json_schema(
    cls, extraction_content_model: type[BaseModel] | None
) -> type[BaseModel] | None:
    """
    Check that the JSON schema is valid
    """
    if extraction_content_model:
        json_schema = extraction_content_model.model_json_schema()
        if "order" in json_schema:
            field_order = json_schema["order"]
            if not isinstance(field_order, list):
                raise ValueError(
                    f"Extraction content model {extraction_content_model.__name__} order key is not a list"
                )
            if len(json_schema["order"]) != len(json_schema["properties"]):
                raise ValueError(
                    f"Order is defined for {extraction_content_model.__name__} but not all fields are in the order"
                )
    return extraction_content_model

validate_unsupported_extraction_content_consistency

validate_unsupported_extraction_content_consistency()

Validate the extraction content model

Source code in components/documents/public/entities/parsing/parsing_configuration.py
@model_validator(mode="after")
def validate_unsupported_extraction_content_consistency(self) -> Self:
    """
    Validate the extraction content model
    """
    if self.extraction_content_model is None and not self.unsupported:
        raise ValueError(
            f"Document category {self.category} has no extraction content model and is not unsupported. Please check the configuration."
        )
    if self.unsupported and self.extraction_content_model:
        raise ValueError(
            f"Document category {self.category} is unsupported but has an extraction content model. Please check the configuration."
        )
    return self

Automatic Parsing configuration

The automatic parsing is configured using the DocumentAutoParsingFlowConfiguration. Each step of the flow has its own configuration:

  • Transcription: TranscriptionConfiguration
  • Classification: ClassificationConfiguration
  • Extraction: ExtractionConfiguration

DocumentAutoParsingFlowConfiguration

Bases: BaseModel

Configuration for the automatic parsing flow

classification_configuration instance-attribute

classification_configuration

The classification configuration

extraction_configuration instance-attribute

extraction_configuration

The extraction configuration

should_create_parsing_task_on_failure class-attribute instance-attribute

should_create_parsing_task_on_failure = True

transcription_configuration instance-attribute

transcription_configuration

The transcription configuration

Transcription

DocumentTranscriptionConfiguration

Bases: BaseModel

Configuration to transcribe a document

min_confidence_score class-attribute instance-attribute
min_confidence_score = 0.6

Document is sent to review if the confidence score is below. Value between 0 and 1

min_text_length class-attribute instance-attribute
min_text_length = 10

Document is sent to review if the text length is below

transcriber class-attribute instance-attribute
transcriber = 'textract'

Transcriber provider. only 'textract' & 'gemini' for now

Classification

DocumentClassificationConfiguration

Bases: BaseModel

The configuration used to classify a document. It can be used to classify the document on multiple classes

classifiers instance-attribute
classifiers

The classifiers to use to classify the document. Each classifiers can have multiple predicted classes as output. At least one classifier must be configured to predict the 'category' class.

model_post_init
model_post_init(__context)

Register sagemaker predictor in the registry if not already done

Source code in components/documents/public/entities/parsing/flow/classification.py
def model_post_init(self, __context: Any) -> None:
    """
    Register sagemaker predictor in the registry if not already done
    """
    for classifier_configuration in self.classifiers:
        if classifier_configuration.predictor_type == "sagemaker":
            SageMakerPredictorRegistry.register_if_not_exists(
                cast(
                    "SageMakerPredictorConfiguration",
                    classifier_configuration.predictor_configuration,
                )
            )
validate_category_in_classifiers
validate_category_in_classifiers()

Validate that the 'category' classifier is present

Source code in components/documents/public/entities/parsing/flow/classification.py
@model_validator(mode="after")
def validate_category_in_classifiers(self) -> Self:
    """
    Validate that the 'category' classifier is present
    """
    for classifier in self.classifiers:
        if "category" in classifier.classes:
            return self
    raise ValueError("Missing 'category' in the list of classes to classify")
validate_no_duplicate_classes
validate_no_duplicate_classes()

Check that no class is predicted more than once.

Source code in components/documents/public/entities/parsing/flow/classification.py
@model_validator(mode="after")
def validate_no_duplicate_classes(self) -> Self:
    """Check that no class is predicted more than once."""
    classified_classes: set[str] = set()
    for classifier in self.classifiers:
        for class_name in classifier.classes:
            if class_name in classified_classes:
                raise ValueError(
                    f"The class {class_name} is predicted by multiple predictors"
                )
            classified_classes.add(class_name)
    return self

DocumentClassifierConfiguration

Bases: BaseModel

The configuration to use a predictor to classify a document. A predictor can output multiple classes.

classes instance-attribute
classes

The list of class names that will be output by the classifier.

fallback_label property
fallback_label

If the returned label is not declared in the possible labels, we return this fallback label. By default, the fallback label is unclassifiable which means that the document is unsupported and cannot be parsed.

min_confidence class-attribute instance-attribute
min_confidence = None

The minimum confidence to have, otherwise we send to review

name property
name

The name of the classifier

possible_labels instance-attribute
possible_labels

The possible labels that the predictor can return. Each label should be a dict where keys represent the class names. For single class classifier, the list can only be the possible values for the class.

predictor_configuration instance-attribute
predictor_configuration

the predictor configuration

predictor_type instance-attribute
predictor_type

the predictor type to use. 'sagemaker' for a sagemaker predictor, 'llm' for a LLM predictor, 'fixed' is for a fixed classification to the unique possible label and 'pythonic' for a Python predictor.

coerce_possible_labels classmethod
coerce_possible_labels(data)

Coerce possible labels from a list of string to dict. This is a convenient pre-processor to simplify the configuration.

Source code in components/documents/public/entities/parsing/flow/classification.py
@model_validator(mode="before")
@classmethod
def coerce_possible_labels(cls, data: Any) -> Any:
    """
    Coerce possible labels from a list of string to dict. This is a convenient pre-processor to simplify the configuration.
    """
    possible_labels = data.get("possible_labels")
    classes = data.get("classes")
    if classes and isinstance(classes, list) and len(classes) == 1:
        if isinstance(possible_labels, list):
            processed_possible_labels = []
            for label in possible_labels:
                if isinstance(label, str):
                    processed_possible_labels.append({classes[0]: label})
                else:
                    processed_possible_labels.append(label)
            data["possible_labels"] = processed_possible_labels
    return data
validate_llm_sagemaker_predictor_single_class
validate_llm_sagemaker_predictor_single_class()

Validate that if the predictor type is a sagemaker or LLM one, the list of classes is composed of only one class. These two predictors do not handle multiple class predictions for now.

Source code in components/documents/public/entities/parsing/flow/classification.py
@model_validator(mode="after")
def validate_llm_sagemaker_predictor_single_class(self) -> Self:
    """
    Validate that if the predictor type is a sagemaker or LLM one, the list of classes is composed of only one class.
    These two predictors do not handle multiple class predictions for now.
    """
    if self.predictor_type in ["sagemaker", "llm"] and len(self.classes) > 1:
        raise ValueError(
            f"Predictor {self.predictor_type} can only handle one class prediction."
        )
    return self
validate_possible_classes
validate_possible_classes()

Validate that possible classes are dict where keys represent class names.

Source code in components/documents/public/entities/parsing/flow/classification.py
@model_validator(mode="after")
def validate_possible_classes(self) -> Self:
    """
    Validate that possible classes are dict where keys represent class names.
    """
    for possible_class in self.possible_labels:
        for key in possible_class.keys():
            if key not in self.classes:
                raise ValueError(f"{key} is not defined as a class name.")
    return self
validate_predictor_configuration_type
validate_predictor_configuration_type()

Validate that the predictor configuration is consistent with the predictor type

Source code in components/documents/public/entities/parsing/flow/classification.py
@model_validator(mode="after")
def validate_predictor_configuration_type(self) -> Self:
    """
    Validate that the predictor configuration is consistent with the predictor type
    """
    if self.predictor_type == "sagemaker":
        if not isinstance(
            self.predictor_configuration, SageMakerPredictorConfiguration
        ):
            raise ValueError(
                "The predictor configuration must be a SageMakerPredictorConfiguration"
            )
    elif self.predictor_type == "llm":
        if not isinstance(self.predictor_configuration, LLMPredictorConfiguration):
            raise ValueError(
                "The predictor configuration must be a LLMPredictorConfiguration"
            )
    elif self.predictor_type == "pythonic":
        if not isinstance(
            self.predictor_configuration, PythonicPredictorConfiguration
        ):
            raise ValueError(
                "The predictor configuration must be a PythonicPredictorConfiguration"
            )
    elif self.predictor_type == "fixed":
        if self.predictor_configuration is not None:
            raise ValueError(
                "The predictor configuration must be null if the predictor type is 'fixed'"
            )
        if len(self.possible_labels) > 1:
            raise ValueError(
                "The predictor configuration must contain only one possible label for 'fixed' predictor."
            )
    return self

Extraction

DocumentCategoryExtractionConfiguration

Bases: BaseModel

Configuration to extract data from a given document category

auto_validate property
auto_validate

Returns whether the auto-validation is enabled

auto_validate_flag class-attribute instance-attribute
auto_validate_flag = True

whether to auto validate the extracted data. If set to False, the extraction result will be sent to review (auto-populated). It can also be set to a Feature Flag name to enable the auto-validation only if the feature flag is enabled.

extractor_configuration instance-attribute
extractor_configuration

the configuration of the dynamic LLM extractor for each possible category

extractor_type instance-attribute
extractor_type

the extractor type to use for this category

DocumentExtractionConfiguration

Bases: BaseModel

Configuration to extract data from a document

category_extraction_configurations instance-attribute
category_extraction_configurations

The extraction configuration for each category (classification)

document_handler instance-attribute
document_handler

the document handler to use by the extractors to fetch relevant example input and expected outputs

model_config class-attribute instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)