Configuration¶

The document parsing configuration is the sole interface to use the document parsing:

It consists of pydantic models to create and register in the document parsing registry.
You can find an example for FR retiree eligibility request document type here ⧉.

The document parsing configuration is configured using the DocumentParsingConfiguration for a given document type.

DocumentAutoParsingFlowConfiguration ¶

Bases: BaseModel

Configuration for the automatic parsing flow

classification_configuration `instance-attribute` ¶

classification_configuration

The classification configuration

extraction_configuration `instance-attribute` ¶

extraction_configuration

The extraction configuration

should_create_parsing_task_on_failure `class-attribute` `instance-attribute` ¶

should_create_parsing_task_on_failure = True

transcription_configuration `instance-attribute` ¶

transcription_configuration

The transcription configuration

Manual parsing configuration¶

For each category, we declare the manual parsing configuration using the DocumentCategoryConfiguration.

Bases: BaseModel

Parsing configuration for a document category

category `instance-attribute` ¶

category

The document category

extraction_content_model `instance-attribute` ¶

extraction_content_model

Extraction content structured output model for the document category. Can be None if the document category is not supported for extraction. It can be used to generate the JSON schema for the manual parsing tool or to validated an automatic extraction. Add the json_schema_extra 'order' to specify the order of fields in the manual parsing tool.

icon `instance-attribute` ¶

icon

The icon to display for the category in the manual parsing tool from https://tabler.io/icons ⧉ (ex: IconFileDollar)

internal_control_ratio `class-attribute` `instance-attribute` ¶

internal_control_ratio = Field(default=0.0, ge=0, le=1)

the ratio of internal control to apply on this category (between 0 and 1)

unsupported `class-attribute` `instance-attribute` ¶

unsupported = False

If true, the document category is unsupported and cannot be extracted. It can only be rejected.

validate_json_schema `classmethod` ¶

validate_json_schema(extraction_content_model)

Check that the JSON schema is valid

Source code in components/documents/public/entities/parsing/parsing_configuration.py

@field_validator("extraction_content_model")
@classmethod
def validate_json_schema(
    cls, extraction_content_model: type[BaseModel] | None
) -> type[BaseModel] | None:
    """
    Check that the JSON schema is valid
    """
    if extraction_content_model:
        json_schema = extraction_content_model.model_json_schema()
        if "order" in json_schema:
            field_order = json_schema["order"]
            if not isinstance(field_order, list):
                raise ValueError(
                    f"Extraction content model {extraction_content_model.__name__} order key is not a list"
                )
            if len(json_schema["order"]) != len(json_schema["properties"]):
                raise ValueError(
                    f"Order is defined for {extraction_content_model.__name__} but not all fields are in the order"
                )
    return extraction_content_model

validate_unsupported_extraction_content_consistency ¶

validate_unsupported_extraction_content_consistency()

Validate the extraction content model

Source code in components/documents/public/entities/parsing/parsing_configuration.py

@model_validator(mode="after")
def validate_unsupported_extraction_content_consistency(self) -> Self:
    """
    Validate the extraction content model
    """
    if self.extraction_content_model is None and not self.unsupported:
        raise ValueError(
            f"Document category {self.category} has no extraction content model and is not unsupported. Please check the configuration."
        )
    if self.unsupported and self.extraction_content_model:
        raise ValueError(
            f"Document category {self.category} is unsupported but has an extraction content model. Please check the configuration."
        )
    return self

Automatic Parsing configuration¶

The automatic parsing is configured using the DocumentAutoParsingFlowConfiguration. Each step of the flow has its own configuration:

Transcription: TranscriptionConfiguration
Classification: ClassificationConfiguration
Extraction: ExtractionConfiguration

DocumentAutoParsingFlowConfiguration ¶

Bases: BaseModel

Configuration for the automatic parsing flow

classification_configuration `instance-attribute` ¶

classification_configuration

The classification configuration

extraction_configuration `instance-attribute` ¶

extraction_configuration

The extraction configuration

should_create_parsing_task_on_failure `class-attribute` `instance-attribute` ¶

should_create_parsing_task_on_failure = True

transcription_configuration `instance-attribute` ¶

transcription_configuration

The transcription configuration

Transcription¶

DocumentTranscriptionConfiguration ¶

Bases: BaseModel

Configuration to transcribe a document

min_confidence_score `class-attribute` `instance-attribute` ¶

min_confidence_score = 0.6

Document is sent to review if the confidence score is below. Value between 0 and 1

min_text_length `class-attribute` `instance-attribute` ¶

min_text_length = 10

Document is sent to review if the text length is below

transcriber `class-attribute` `instance-attribute` ¶

transcriber = 'textract'

Transcriber provider. only 'textract' & 'gemini' for now

Classification¶

DocumentClassificationConfiguration ¶

Bases: BaseModel

The configuration used to classify a document. It can be used to classify the document on multiple classes

classifiers `instance-attribute` ¶

classifiers

The classifiers to use to classify the document. Each classifiers can have multiple predicted classes as output. At least one classifier must be configured to predict the 'category' class.

model_post_init ¶

model_post_init(__context)

Register sagemaker predictor in the registry if not already done

Source code in components/documents/public/entities/parsing/flow/classification.py

def model_post_init(self, __context: Any) -> None:
    """
    Register sagemaker predictor in the registry if not already done
    """
    for classifier_configuration in self.classifiers:
        if classifier_configuration.predictor_type == "sagemaker":
            SageMakerPredictorRegistry.register_if_not_exists(
                cast(
                    "SageMakerPredictorConfiguration",
                    classifier_configuration.predictor_configuration,
                )
            )

validate_category_in_classifiers ¶

validate_category_in_classifiers()

Validate that the 'category' classifier is present

Source code in components/documents/public/entities/parsing/flow/classification.py

@model_validator(mode="after")
def validate_category_in_classifiers(self) -> Self:
    """
    Validate that the 'category' classifier is present
    """
    for classifier in self.classifiers:
        if "category" in classifier.classes:
            return self
    raise ValueError("Missing 'category' in the list of classes to classify")

validate_no_duplicate_classes ¶

validate_no_duplicate_classes()

Check that no class is predicted more than once.

Source code in components/documents/public/entities/parsing/flow/classification.py

@model_validator(mode="after")
def validate_no_duplicate_classes(self) -> Self:
    """Check that no class is predicted more than once."""
    classified_classes: set[str] = set()
    for classifier in self.classifiers:
        for class_name in classifier.classes:
            if class_name in classified_classes:
                raise ValueError(
                    f"The class {class_name} is predicted by multiple predictors"
                )
            classified_classes.add(class_name)
    return self

DocumentClassifierConfiguration ¶

Bases: BaseModel

The configuration to use a predictor to classify a document. A predictor can output multiple classes.

classes `instance-attribute` ¶

classes

The list of class names that will be output by the classifier.

fallback_label `property` ¶

fallback_label

If the returned label is not declared in the possible labels, we return this fallback label. By default, the fallback label is unclassifiable which means that the document is unsupported and cannot be parsed.

min_confidence `class-attribute` `instance-attribute` ¶

min_confidence = None

The minimum confidence to have, otherwise we send to review

name `property` ¶

name

The name of the classifier

possible_labels `instance-attribute` ¶

possible_labels

The possible labels that the predictor can return. Each label should be a dict where keys represent the class names. For single class classifier, the list can only be the possible values for the class.

predictor_configuration `instance-attribute` ¶

predictor_configuration

the predictor configuration

predictor_type `instance-attribute` ¶

predictor_type

the predictor type to use. 'sagemaker' for a sagemaker predictor, 'llm' for a LLM predictor, 'fixed' is for a fixed classification to the unique possible label and 'pythonic' for a Python predictor.

coerce_possible_labels `classmethod` ¶

coerce_possible_labels(data)

Coerce possible labels from a list of string to dict. This is a convenient pre-processor to simplify the configuration.

Source code in components/documents/public/entities/parsing/flow/classification.py

@model_validator(mode="before")
@classmethod
def coerce_possible_labels(cls, data: Any) -> Any:
    """
    Coerce possible labels from a list of string to dict. This is a convenient pre-processor to simplify the configuration.
    """
    possible_labels = data.get("possible_labels")
    classes = data.get("classes")
    if classes and isinstance(classes, list) and len(classes) == 1:
        if isinstance(possible_labels, list):
            processed_possible_labels = []
            for label in possible_labels:
                if isinstance(label, str):
                    processed_possible_labels.append({classes[0]: label})
                else:
                    processed_possible_labels.append(label)
            data["possible_labels"] = processed_possible_labels
    return data

validate_llm_sagemaker_predictor_single_class ¶

validate_llm_sagemaker_predictor_single_class()

Validate that if the predictor type is a sagemaker or LLM one, the list of classes is composed of only one class. These two predictors do not handle multiple class predictions for now.

Source code in components/documents/public/entities/parsing/flow/classification.py

@model_validator(mode="after")
def validate_llm_sagemaker_predictor_single_class(self) -> Self:
    """
    Validate that if the predictor type is a sagemaker or LLM one, the list of classes is composed of only one class.
    These two predictors do not handle multiple class predictions for now.
    """
    if self.predictor_type in ["sagemaker", "llm"] and len(self.classes) > 1:
        raise ValueError(
            f"Predictor {self.predictor_type} can only handle one class prediction."
        )
    return self

validate_possible_classes ¶

validate_possible_classes()

Validate that possible classes are dict where keys represent class names.

Source code in components/documents/public/entities/parsing/flow/classification.py

@model_validator(mode="after")
def validate_possible_classes(self) -> Self:
    """
    Validate that possible classes are dict where keys represent class names.
    """
    for possible_class in self.possible_labels:
        for key in possible_class.keys():
            if key not in self.classes:
                raise ValueError(f"{key} is not defined as a class name.")
    return self

validate_predictor_configuration_type ¶

validate_predictor_configuration_type()

Validate that the predictor configuration is consistent with the predictor type

Source code in components/documents/public/entities/parsing/flow/classification.py

@model_validator(mode="after")
def validate_predictor_configuration_type(self) -> Self:
    """
    Validate that the predictor configuration is consistent with the predictor type
    """
    if self.predictor_type == "sagemaker":
        if not isinstance(
            self.predictor_configuration, SageMakerPredictorConfiguration
        ):
            raise ValueError(
                "The predictor configuration must be a SageMakerPredictorConfiguration"
            )
    elif self.predictor_type == "llm":
        if not isinstance(self.predictor_configuration, LLMPredictorConfiguration):
            raise ValueError(
                "The predictor configuration must be a LLMPredictorConfiguration"
            )
    elif self.predictor_type == "pythonic":
        if not isinstance(
            self.predictor_configuration, PythonicPredictorConfiguration
        ):
            raise ValueError(
                "The predictor configuration must be a PythonicPredictorConfiguration"
            )
    elif self.predictor_type == "fixed":
        if self.predictor_configuration is not None:
            raise ValueError(
                "The predictor configuration must be null if the predictor type is 'fixed'"
            )
        if len(self.possible_labels) > 1:
            raise ValueError(
                "The predictor configuration must contain only one possible label for 'fixed' predictor."
            )
    return self

Extraction¶

DocumentCategoryExtractionConfiguration ¶

Bases: BaseModel

Configuration to extract data from a given document category

auto_validate `property` ¶

auto_validate

Returns whether the auto-validation is enabled

auto_validate_flag `class-attribute` `instance-attribute` ¶

auto_validate_flag = True

whether to auto validate the extracted data. If set to False, the extraction result will be sent to review (auto-populated). It can also be set to a Feature Flag name to enable the auto-validation only if the feature flag is enabled.

extractor_configuration `instance-attribute` ¶

extractor_configuration

the configuration of the dynamic LLM extractor for each possible category

extractor_type `instance-attribute` ¶

extractor_type

the extractor type to use for this category

DocumentExtractionConfiguration ¶

Bases: BaseModel

Configuration to extract data from a document

category_extraction_configurations `instance-attribute` ¶

category_extraction_configurations

The extraction configuration for each category (classification)

document_handler `instance-attribute` ¶

document_handler

the document handler to use by the extractors to fetch relevant example input and expected outputs

model_config `class-attribute` `instance-attribute` ¶

model_config = ConfigDict(arbitrary_types_allowed=True)

Configuration¶

DocumentAutoParsingFlowConfiguration ¶

classification_configuration instance-attribute ¶

extraction_configuration instance-attribute ¶

should_create_parsing_task_on_failure class-attribute instance-attribute ¶

transcription_configuration instance-attribute ¶

Manual parsing configuration¶

category instance-attribute ¶

extraction_content_model instance-attribute ¶

icon instance-attribute ¶

internal_control_ratio class-attribute instance-attribute ¶

unsupported class-attribute instance-attribute ¶

validate_json_schema classmethod ¶

validate_unsupported_extraction_content_consistency ¶

Automatic Parsing configuration¶

DocumentAutoParsingFlowConfiguration ¶

classification_configuration instance-attribute ¶

extraction_configuration instance-attribute ¶

should_create_parsing_task_on_failure class-attribute instance-attribute ¶

transcription_configuration instance-attribute ¶

Transcription¶

DocumentTranscriptionConfiguration ¶

min_confidence_score class-attribute instance-attribute ¶

min_text_length class-attribute instance-attribute ¶

transcriber class-attribute instance-attribute ¶

Classification¶

DocumentClassificationConfiguration ¶

classifiers instance-attribute ¶

model_post_init ¶

validate_category_in_classifiers ¶

validate_no_duplicate_classes ¶

DocumentClassifierConfiguration ¶

classes instance-attribute ¶

fallback_label property ¶

min_confidence class-attribute instance-attribute ¶

name property ¶

possible_labels instance-attribute ¶

predictor_configuration instance-attribute ¶

predictor_type instance-attribute ¶

coerce_possible_labels classmethod ¶

validate_llm_sagemaker_predictor_single_class ¶

validate_possible_classes ¶

validate_predictor_configuration_type ¶

Extraction¶

DocumentCategoryExtractionConfiguration ¶

auto_validate property ¶

auto_validate_flag class-attribute instance-attribute ¶

extractor_configuration instance-attribute ¶

extractor_type instance-attribute ¶

DocumentExtractionConfiguration ¶

category_extraction_configurations instance-attribute ¶

document_handler instance-attribute ¶

model_config class-attribute instance-attribute ¶

classification_configuration `instance-attribute` ¶

extraction_configuration `instance-attribute` ¶

should_create_parsing_task_on_failure `class-attribute` `instance-attribute` ¶

transcription_configuration `instance-attribute` ¶

category `instance-attribute` ¶

extraction_content_model `instance-attribute` ¶

icon `instance-attribute` ¶

internal_control_ratio `class-attribute` `instance-attribute` ¶

unsupported `class-attribute` `instance-attribute` ¶

validate_json_schema `classmethod` ¶

classification_configuration `instance-attribute` ¶

extraction_configuration `instance-attribute` ¶

should_create_parsing_task_on_failure `class-attribute` `instance-attribute` ¶

transcription_configuration `instance-attribute` ¶

min_confidence_score `class-attribute` `instance-attribute` ¶

min_text_length `class-attribute` `instance-attribute` ¶

transcriber `class-attribute` `instance-attribute` ¶

classifiers `instance-attribute` ¶

classes `instance-attribute` ¶

fallback_label `property` ¶

min_confidence `class-attribute` `instance-attribute` ¶

name `property` ¶

possible_labels `instance-attribute` ¶

predictor_configuration `instance-attribute` ¶

predictor_type `instance-attribute` ¶

coerce_possible_labels `classmethod` ¶

auto_validate `property` ¶

auto_validate_flag `class-attribute` `instance-attribute` ¶

extractor_configuration `instance-attribute` ¶

extractor_type `instance-attribute` ¶

category_extraction_configurations `instance-attribute` ¶

document_handler `instance-attribute` ¶

model_config `class-attribute` `instance-attribute` ¶