Api reference
components.documents.public.blueprint ¶
components.documents.public.business_logic ¶
batches ¶
actions ¶
add_document_to_batch ¶
Adds a document to an existing performance batch by inserting a record into the DOCUMENT_PARSING.BATCH_DOCUMENTS table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_id
|
int
|
ID of the batch to add the document to |
required |
document_id
|
str
|
ID of the document to add to the batch |
required |
external_id
|
str | None
|
External ID of the document |
None
|
document_type
|
str | None
|
Type of the document |
None
|
stack
|
str | None
|
Stack the document belongs to |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If batch_id is invalid or document doesn't exist |
Source code in components/documents/public/business_logic/batches/actions.py
create_new_batch ¶
Creates a new batch by adding a row in the Turing table DOCUMENT_PARSING.PERFORMANCE_BATCH.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_name
|
str
|
Name of the batch to create |
required |
Returns:
| Type | Description |
|---|---|
str
|
The batch name |
Raises:
| Type | Description |
|---|---|
ValueError
|
If batch_name is empty or None |
Source code in components/documents/public/business_logic/batches/actions.py
delete_batch ¶
Deletes a batch from Turing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_id
|
int
|
ID of the batch to delete |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If batch_id is empty |
Source code in components/documents/public/business_logic/batches/actions.py
flag_batch_document_as_validated ¶
Updates the is_validated flag for a document in all its batches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_id
|
str
|
ID of the document to update |
required |
validated
|
bool
|
Whether to mark the document as validated or not |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If document is not found |
Source code in components/documents/public/business_logic/batches/actions.py
remove_document_from_batch ¶
Removes a document from a performance batch by deleting the record from the DOCUMENT_PARSING.BATCH_DOCUMENTS table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_id
|
int
|
ID of the batch to remove the document from |
required |
document_id
|
str
|
ID of the document to remove from the batch |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If batch_id is invalid |
Source code in components/documents/public/business_logic/batches/actions.py
queries ¶
get_all_batches ¶
Retrieves all performance batches from Turing.
Returns:
| Type | Description |
|---|---|
list[Batch]
|
List of dictionaries containing batch information |
Source code in components/documents/public/business_logic/batches/queries.py
get_batch_by_id ¶
Retrieves a specific performance batch from Turing by its ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_id
|
int
|
ID of the batch to retrieve |
required |
Returns:
| Type | Description |
|---|---|
Batch
|
Dictionary containing batch information |
Raises:
| Type | Description |
|---|---|
ValueError
|
If batch with the specified ID is not found |
Source code in components/documents/public/business_logic/batches/queries.py
get_documents_by_batch ¶
Retrieves all documents belonging to a specific batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_id
|
int
|
The batch identifier to filter documents by |
required |
Returns:
| Type | Description |
|---|---|
list[BatchDocumentInfo]
|
List of dictionaries containing document information |
Source code in components/documents/public/business_logic/batches/queries.py
comparison ¶
abstract_performance_runner ¶
AbstractDocProcessingPerformanceTestRunner ¶
AbstractDocProcessingPerformanceTestRunner(
run_id=None, run_name=None, save=False, job_timeout=1200
)
Bases: ABC, Generic[Entry]
Abstract class for document processing performance test runners.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
dashboard_url ¶
Generate a URL to the metabase dashboard where the performance test results can be viewed.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
fetch_entry
abstractmethod
staticmethod
¶
Fetch a single entry from the dataset based on the document ID.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
run_and_report
abstractmethod
¶
Run the processing and generate a report for the given entry.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
run_async ¶
Enqueue the performance tests for the given entry to be run asynchronously. This method will create a claim engine job that can be run in the background, allowing for parallel processing of multiple entries.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
run_enqueueable
staticmethod
¶
Enqueueable function to run the performance tests synchronously because we need a static method to be compatible with our queuing system.
We use the runner_cls_qualified_name to dynamically import the right runner class.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
run_sync ¶
Run the performance tests synchronously on the given entry and return a report.
This is just a wrapper around run_and_report that takes care of logging and saving the report.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
stable_new_run_id ¶
Generate a stable run_id based on the current run_id. For an example of usage, see the FrPostprocessingPerformanceTestRunner where we are running 2 performance run at the same time: - one for extraction - one for the postprocessing based on extraction
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
BasePerformanceRunDatasetEntry ¶
Bases: BaseModel
Base class for a single entry in a performance run dataset. You might want to extend this class to add more fields according to the context of your stack.
Note that the document_id is expected to be a string to be compatible with different ID formats (int for France and uuid in the global stack).
keys_to_lowercase
classmethod
¶
Convert all keys in the input data to lowercase in order to make the model case-insensitive. This is useful when the input data will come from a turing query.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
extraction_performance_runner ¶
DocExtractionPerformanceTestRunner ¶
Bases: AbstractDocProcessingPerformanceTestRunner[BasePerformanceRunDatasetEntry]
Performance test runner for document extraction step.
Source code in components/documents/public/business_logic/comparison/abstract_performance_runner.py
fetch_dataset
staticmethod
¶
Fetch the dataset of entries for the extraction performance test.
- If document_ids is provided, only those documents will be fetched.
- If batch_label is provided, only documents with that batch label will be fetched.
- For now label is taken from the upload metadata of the document, but later we'll use Alexandre work
Source code in components/documents/public/business_logic/comparison/extraction_performance_runner.py
fetch_entry
staticmethod
¶
Fetch a single entry from the dataset based on the document ID.
Source code in components/documents/public/business_logic/comparison/extraction_performance_runner.py
run_and_report ¶
Run the document extraction and generate a report against the latest validated extraction result.
Source code in components/documents/public/business_logic/comparison/extraction_performance_runner.py
report ¶
PerformanceRunReport
dataclass
¶
PerformanceRunReport(
env,
commit,
run_id,
run_name,
run_at,
document_type,
document_id,
category,
subcategory,
side_by_side_classification,
side_by_side_reasons_for_review,
side_by_side_result,
result_mismatches,
)
Represents a performance run report for document processing.
This report includes the results of a performance run against a document:
- Some context about the run (e.g. environment, commit, run ID, run name)
- Some fact about the document (e.g. type, id, category, subcategory)
- Side-by-side comparison of the classification, reasons for review, and extraction results
- Mismatches found in the extraction results, meant to be the result of the Turing function pdc_mismatch_report
The 2 main use cases are - To generate a nice textual report of the performance run over a document - To save the report to Turing for further analysis and comparison with other runs.
To build an instance, we provide 2 factory methods helpers:
- from_document_extraction_result: to build the report from a document extraction result
- from_performance_run_diff: to build the report from a performance run diff (retro-compatibility with the previous implementation)
Note that it doesn't support the rejection content for now.
Mismatch
dataclass
¶
Bases: DataClassJsonMixin
Used to represent a mismatch in the performance run report.
It has the same data structure as the one used Turing function pdc_mismatch_report.
SideBySideValue ¶
Bases: Generic[T]
A generic wrapper for side-by-side values in the report.
Source code in components/documents/public/business_logic/comparison/report.py
staticmethod
¶Factory helper to create a SideBySideValue from a list of FieldDiffs. It's meant to ease the retro-compatibility with the previous PerformanceRunDiff implementation.
Source code in components/documents/public/business_logic/comparison/report.py
as_textual_report ¶
Generate a textual report of the performance run with nice table formatting.
Source code in components/documents/public/business_logic/comparison/report.py
from_document_extraction_result
classmethod
¶
Create a PerformanceRunReport from the expected and actual document extraction results.
Source code in components/documents/public/business_logic/comparison/report.py
from_performance_run_diff
classmethod
¶
Create a PerformanceRunReport from a PerformanceRunDiff.
Source code in components/documents/public/business_logic/comparison/report.py
save_to_turing
classmethod
¶
Save the performance run report to Turing database.
Source code in components/documents/public/business_logic/comparison/report.py
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
document ¶
actions ¶
delete_document ¶
Hard deletes a global document including: - The Document DB record (cascades to all related records via FK CASCADE) - The physical S3 file
This is typically called when a corresponding FR insurance document is tombstoned.
Source code in components/documents/public/business_logic/document/actions.py
upload_document ¶
upload_document(
uploader_ref,
document_type,
file,
upload_metadata=None,
trigger_parsing_flow=True,
commit=True,
)
Uploads a documents and store it in S3. This method triggers the document parsing flow if a configuration is registered.
Source code in components/documents/public/business_logic/document/actions.py
queries ¶
get_document_content ¶
Get the content of the document
Source code in components/documents/public/business_logic/document/queries.py
get_document_info ¶
Get the content type of the document
Source code in components/documents/public/business_logic/document/queries.py
get_temporary_download_url ¶
Get a temporary URL to download the document
Source code in components/documents/public/business_logic/document/queries.py
document_handler ¶
base_document_handler ¶
BaseDocumentHandler ¶
Bases: ABC
Class for handling document-related retrieval operations. This class is used to fetch documents and their related entities from the database while keeping the documents module agnostic of the database models.
get_document
abstractmethod
¶
Fetches the document from the appropriate table.
get_document_expected_output
abstractmethod
¶
Fetches the structured output of a document from the appropriate table. Return None if the document is not yet parsed.
Source code in components/documents/public/business_logic/document_handler/base_document_handler.py
get_document_markdown_transcription
abstractmethod
¶
Fetches the document markdown transcription of the document from the appropriate table.
Source code in components/documents/public/business_logic/document_handler/base_document_handler.py
default_document_handler ¶
DocumentHandler ¶
Bases: BaseDocumentHandler
Document handler implementation for the Document modular monolith data model
Source code in components/documents/public/business_logic/document_handler/default_document_handler.py
get_document ¶
Source code in components/documents/public/business_logic/document_handler/default_document_handler.py
get_document_expected_output ¶
Source code in components/documents/public/business_logic/document_handler/default_document_handler.py
get_document_markdown_transcription ¶
Source code in components/documents/public/business_logic/document_handler/default_document_handler.py
embedding ¶
actions ¶
delete_document_embedding ¶
Delete a document embedding.
Source code in components/documents/public/business_logic/embedding/actions.py
delete_document_embeddings ¶
Delete all document embeddings for a given type.
Source code in components/documents/public/business_logic/embedding/actions.py
index_document ¶
Index a single document. If the document already exists (unicity contraint on type/id), it will be updated.
Note: document embeddings are segregated by the type column to differentiate them (fr insurance docs vs github issues vs guarantees, etc)
Source code in components/documents/public/business_logic/embedding/actions.py
index_documents ¶
Index multiple documents. If a document already exists (unicity contraint on type/id), it will be updated.
Source code in components/documents/public/business_logic/embedding/actions.py
queries ¶
MetadataFilterBuilder
module-attribute
¶
SimilarDocument
dataclass
¶
count_indexed_documents ¶
Source code in components/documents/public/business_logic/embedding/queries.py
fetch_text_embedding ¶
Get the embedding for the given text using the specified embedding algorithm.
Source code in components/documents/public/business_logic/embedding/queries.py
find_documents_by_metadata_filter ¶
Find documents by metadata filter. <!> The metadata column is not indexed for now, so this query can be expensive.
:param document_type: The type of the document to find. :param metadata_filter: The metadata filter to apply. :return: The list of documents matching the metadata filter.
Source code in components/documents/public/business_logic/embedding/queries.py
find_similar_documents ¶
find_similar_documents(
type,
text,
embedding_algorithm,
exclude_document_id=None,
restrict_to_document_ids=None,
metadata_filter=None,
n_results=10,
doc_id_to_try_to_fetch_text_embeddings_from_db=None,
use_approximate_search=False,
)
Find similar documents to the given text. <!> The search is only performed on documents that have been indexed with the SAME embedding algorithm.
:param type: The type of the document to find similar documents to. :param text: The text to find similar documents to. :param embedding_algorithm: The embedding algorithm to use. :param exclude_document_id: The document id to exclude from the results (useful to not return the same document). :param restrict_to_document_ids: If set, will restrict the search to these document ids. Useful when you have a fixed set of reference documents :param metadata_filter: A function that takes the document metadata and returns a list of SQLAlchemy filters to apply. :param n_results: The number of results to return. :param doc_id_to_try_to_fetch_text_embeddings_from_db: If None, we will compute the embedding of the text we search for. If set to a doc id, we will first attempt to fetch the text embedding from the DB. :param use_approximate_search: If True, will use HNSW index to make a faster and approximate matching (cf https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw ⧉).
Source code in components/documents/public/business_logic/embedding/queries.py
extraction ¶
extraction_logic ¶
ExtractionInstructionPresenter ¶
Bases: BaseModel
A wrapper to present the instruction to extract fields from a document. It's purely a presentation layer to make the jinja template easier to write/read.
example_raw_content ¶
Build an example of extraction result content.
Source code in components/documents/public/business_logic/extraction/extraction_logic.py
from_extraction_result_model
classmethod
¶
Recursively build ExtractionInstructionPresenter objects from a Pydantic model.
Source code in components/documents/public/business_logic/extraction/extraction_logic.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 | |
FieldExtractionInstructionPresenter ¶
Bases: BaseModel
A wrapper to present the instruction to extract a single field. It's purely a presentation layer to make the jinja template easier to write/read.
example_extracted_value ¶
Build an example of value extracted for this field.
Source code in components/documents/public/business_logic/extraction/extraction_logic.py
factory ¶
ExtractorFactory ¶
Factory to build an extractor based on the extractor type.
build_extractor
classmethod
¶
Build an extractor based on the extractor type.
Source code in components/documents/public/business_logic/extraction/factory.py
helpers ¶
call_gpt_chat ¶
call_gpt_chat(
instructions,
user_input,
prompt_examples,
llm_model,
use_json_mode,
call_context=None,
)
Call an OpenAI chat model with the given instructions, transcription text and prompt examples. It automatically retries if we reach the rate limit
:param call_context: Existing context to continue the conversation
Source code in components/documents/public/business_logic/extraction/helpers.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | |
prompt_builder ¶
build_prompt_from_file ¶
Build an LLM prompt from a file.
build_prompt_from_jinja ¶
build_prompt_from_jinja(
prompt_jinja_env,
prompt_dir,
prompt_filename,
parameters=None,
extraction_result_model=None,
)
Build an LLM prompt from a jinja template or a file using the given parameters. When extraction_result_model is provided, it will be used to generate instructions for the extraction.
Source code in components/documents/public/business_logic/extraction/prompt_builder.py
get_jinja_env_for_dir ¶
Get jinja environment for a directory for rendering templates.
Source code in components/documents/public/business_logic/extraction/prompt_builder.py
parsing ¶
actions ¶
ask_for_review ¶
This function is called when we need a new parsing to correct the previous one. A new task will be created to ask for a new parsing. The document should have no opened task. Nothing is committed in the session.
:param document_id: The document id :param actor_ref: The actor asking for review identifier (ex: operator id) :param operator_comment: The comment to add to the parsing task
Source code in components/documents/public/business_logic/parsing/actions.py
parsing_configuration_registry ¶
DocumentParsingConfigurationRegistry ¶
Registry of document parsing configurations.
get_configuration
classmethod
¶
Get a document parsing configuration by document type.
Source code in components/documents/public/business_logic/parsing/parsing_configuration_registry.py
get_document_types
classmethod
¶
Get the list of document types for which parsing configurations are registered.
:param has_auto_parsing_configuration: If True, only return document types with auto-parsing configuration
Source code in components/documents/public/business_logic/parsing/parsing_configuration_registry.py
register
classmethod
¶
Register a document parsing configuration.
Source code in components/documents/public/business_logic/parsing/parsing_configuration_registry.py
queries ¶
get_latest_parsing_data ¶
Get the last parsing step results
Source code in components/documents/public/business_logic/parsing/queries.py
get_parsing_data ¶
Get the parsing step results for a specific version
Source code in components/documents/public/business_logic/parsing/queries.py
components.documents.public.commands ¶
app_group ¶
documents_commands
module-attribute
¶
documents_commands = AppGroup(
"documents",
help="Main command group for the documents component",
monitor_command_on_slack=False,
)
controls ¶
create_internal_controls ¶
Process operation tasks to open with external task provider
Source code in components/documents/public/commands/controls.py
document_embedding ¶
recompute_documents_embedding ¶
Recompute documents embedding for a given type
Source code in components/documents/public/commands/document_embedding.py
document_operation_task ¶
process_created_operation_tasks ¶
Process operation tasks to open with external task provider
Source code in components/documents/public/commands/document_operation_task.py
embedding ¶
index_documents ¶
Compute asynchronously document embeddings for a given list of document type and time range. It only computes embeddings of documents not already embedded. This command takes care of embedding documents type that can be automatically parsed If the document doesn't have a transcription or a valid classification result (i.e. not parsed and validated yet), it will be skipped.
Source code in components/documents/public/commands/embedding.py
performance ¶
import_latest_extraction_result_from_turing ¶
Import the latest extraction result from Turing for a given batch.
Source code in components/documents/public/commands/performance.py
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 | |
transcribe ¶
Run transcription on the given document
Source code in components/documents/public/commands/performance.py
transcription ¶
transcribe_document ¶
Run the transcription logic for a given document and save it.
Source code in components/documents/public/commands/transcription.py
transcribe_documents ¶
Transcribe all documents of a given type that have a valid classification result and no transcription result. This command is useful when a document type auto parsing configuration is added. It will transcribe massively all the past documents. This is needed in order to index the documents embeddings to use them as similar examples for the automatic extraction using the dynamic LLM extractor.
Source code in components/documents/public/commands/transcription.py
components.documents.public.constants ¶
components.documents.public.controllers ¶
batches ¶
DocumentBatchController ¶
Bases: BaseController
Controller to manage documents in a batch
delete ¶
Remove a document from its batch
Source code in components/documents/public/controllers/batches.py
get ¶
Returns the documents of a performance batch
Source code in components/documents/public/controllers/batches.py
post ¶
Add documents to a batch
Source code in components/documents/public/controllers/batches.py
DocumentReferenceController ¶
Bases: BaseController
Controller to flag a document as validated or not
post ¶
Flag a document as validated or not
Source code in components/documents/public/controllers/batches.py
DocumentsBatchesController ¶
Bases: BaseController
Controller to manage performance batches
delete ¶
Deletes a performance batch
Source code in components/documents/public/controllers/batches.py
get ¶
Returns all available performance batches or a specific batch by ID
Source code in components/documents/public/controllers/batches.py
post ¶
Creates a new performance batch
Source code in components/documents/public/controllers/batches.py
document ¶
DocumentContentController ¶
Bases: BaseController
Controller to get the content type of the document
get ¶
Used by generic document viewer to get the content of the document
Source code in components/documents/public/controllers/document.py
DocumentController ¶
Bases: BaseController
Controller to get the information of the document
get ¶
Returns the general information of the document (not the content)
Source code in components/documents/public/controllers/document.py
DocumentGetLatestParsingController ¶
Bases: BaseController
Get the last parsing result of a document
get ¶
Get the last Classification and Extraction
Source code in components/documents/public/controllers/document.py
DocumentInternalControlReviewController ¶
Bases: BaseController
Controller to get an internal control review by its ID
get ¶
Get an internal control review by its ID
Source code in components/documents/public/controllers/document.py
DocumentSubmitParsingController ¶
Bases: BaseController
Manually submit a parsing for a document
InternalControlReviewParam
dataclass
¶
post ¶
Create a new classification and extraction result
Source code in components/documents/public/controllers/document.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | |
DocumentTemporaryUrlController ¶
Bases: BaseController
Controller to get the temporary url of the document
get ¶
Used by generic document viewer to get the temporary url of the document
Source code in components/documents/public/controllers/document.py
document_type ¶
DocumentCategorySchemaController ¶
Bases: BaseController
Controller to get the content type of the document
get ¶
Get the schemas for the document category
Source code in components/documents/public/controllers/document_type.py
DocumentTypeCategoriesController ¶
Bases: BaseController
Controller to get the categories of the document type
get ¶
Returns the categories of the document type
Source code in components/documents/public/controllers/document_type.py
upload ¶
UploadController ¶
Bases: BaseController
Documents are uploaded through this endpoint.
post ¶
Controller to upload a document, to be processed by the global document processing stack.
Source code in components/documents/public/controllers/upload.py
components.documents.public.entities ¶
classification ¶
configuration ¶
BasePredictorConfiguration ¶
Bases: BaseModel
Base class for a predictor configuration
prepare_document_for_predictor ¶
Prepare the document to be sent to the predictor
Source code in components/documents/public/entities/classification/configuration.py
LLMPredictorConfiguration ¶
Bases: BasePredictorConfiguration
The configuration to use a LLM predictor
PythonicPredictorConfiguration ¶
SageMakerPredictorConfiguration ¶
Bases: BasePredictorConfiguration
The configuration to use a SageMaker predictor
context ¶
ClassificationContext
dataclass
¶
Bases: DataClassJsonMixin
The context of the classification for each classifier.
class_contexts
instance-attribute
¶
Context of the classification indexed by classifier name
ClassificationReviewContext
dataclass
¶
Bases: DataClassJsonMixin
Review Context of a classification
reasons_for_review
instance-attribute
¶
Reasons for review of the classification indexed by classifier name
SageMakerPredictionContext
dataclass
¶
Bases: DataClassJsonMixin
The context of a prediction made by a SageMaker predictor.
SingleClassificationContext
dataclass
¶
SingleClassificationContext(
predictor_type,
prediction_proba,
fallback_label_used,
classification_confidence,
sagemaker_call_context,
llm_call_context,
)
Bases: DataClassJsonMixin
The context of a single classification.
classification_confidence
instance-attribute
¶
Classification confidence score (None if we use the fallback label)
fallback_label_used
instance-attribute
¶
True if the returned label from the classifier was unknown and a fallback label was used
llm_call_context
instance-attribute
¶
LLM call context for the classification, if the classifier uses an LLM model
prediction_proba
instance-attribute
¶
The prediction probability for each possible label. None if the prediction failed and the fallback label was used
predictor_type
instance-attribute
¶
The type of the predictor used for the classification
sagemaker_call_context
instance-attribute
¶
Sagemaker call context for the classification, if the classifier uses a Sagemaker inference endpoint
document ¶
document_content ¶
extraction ¶
configuration ¶
BaseLLMExtractorConfiguration
dataclass
¶
Bases: ABC
Common configuration for dynamic prompting LLM extractor
__post_init__ ¶
hds_only
class-attribute
instance-attribute
¶
If true, only HDS LLM models can be used
llm_model
instance-attribute
¶
the LLM model to use. Make sure to use HDS models for HDS data
n_similar_examples
class-attribute
instance-attribute
¶
the number of similar examples to inject in the prompt. If set to 0, make sure you provide the desired structured output in the prompt.
prepare_expected_output_for_llm ¶
Prepare the LLM example_assistant message for a given expected output
:param expected_output_content:
:return:
Source code in components/documents/public/entities/extraction/configuration.py
prepare_instructions
abstractmethod
¶
prepare_transcription_for_llm ¶
Prepare the LLM example_user or user message for the transcription of a similar example document or the document to parse
Source code in components/documents/public/entities/extraction/configuration.py
DynamicLLMExtractorConfiguration
dataclass
¶
DynamicLLMExtractorConfiguration(
*,
hds_only=True,
llm_model,
document_type,
n_similar_examples=5,
instructions,
content_type,
use_approximate_search=False,
use_similar_examples_from_global_stack=False,
reference_batches=None
)
Bases: BaseLLMExtractorConfiguration
Configuration for the global dynamic prompting LLM extractor
build_similar_document_metadata_filter ¶
Build the metadata filter to apply when looking for similar examples. Default implementation is to filter on - the document category if a category is defined in the classification result - the stack=global if the configuration is set to use similar examples from the global stack
Source code in components/documents/public/entities/extraction/configuration.py
content_type
instance-attribute
¶
content type of the structured output (Used to validate the LLM response)
prepare_instructions ¶
reference_batches
class-attribute
instance-attribute
¶
If provided, we'll only look for similar example across documents from these batches as defined in their metadata.
use_approximate_search
class-attribute
instance-attribute
¶
if true, use approximate search when looking for similar documents. This can speed up the process but may return less relevant results
use_similar_examples_from_global_stack
class-attribute
instance-attribute
¶
if provided, we will make sure to apply a metadata filter to look for similar documents of the right stack
context ¶
ExtractionContext
dataclass
¶
Bases: DataClassJsonMixin
Extraction context to be filled during the extraction process
ExtractionReviewContext
dataclass
¶
Bases: DataClassJsonMixin
Extraction review context to be filled during the extraction review process
extraction_field ¶
ExtractionFieldConfig ¶
Bases: BaseModel
Configuration class for storing field-specific extraction instructions.
This class is used to store metadata that guides how a field should be parsed.
LlmGuidance ¶
Bases: BaseModel
Contains guidance information for LLMs when extracting a field.
This information is used to generate prompts that help LLMs correctly identify and extract the field's value from the document.
as_json_schema_extra_dict ¶
Converts the extraction configuration to a format that can be stored in a Pydantic field's json_schema_extra attribute.
Pydantic fields have a json_schema_extra field where we can store additional metadata that isn't part of the standard JSON schema. This method prepares our extraction configuration for storage in that field.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the extraction configuration nested under the 'extraction_config' key. |
Source code in components/documents/public/entities/extraction/extraction_field.py
from_field_info
classmethod
¶
Retrieves extraction configuration from a Pydantic field's metadata.
This method extracts and validates the extraction configuration that was previously stored in a field's json_schema_extra attribute.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field_info
|
FieldInfo
|
A Pydantic FieldInfo object containing field metadata |
required |
Returns:
| Type | Description |
|---|---|
Optional[ExtractionFieldConfig]
|
A ExtractionFieldConfig object if extraction configuration exists in the field, |
Optional[ExtractionFieldConfig]
|
or None if no extraction configuration is found. |
Source code in components/documents/public/entities/extraction/extraction_field.py
prompt ¶
i18n ¶
I18nKeys ¶
Bases: BaseModel
A class to store i18n keys for a document parsing configuration. Each key should be translated in all languages.
model_post_init ¶
Source code in components/documents/public/entities/i18n.py
internal_control ¶
DocumentInternalControlReviewInfo
dataclass
¶
DocumentInternalControlReviewInfo(
id,
document_id,
document_extraction_result_id,
operation_task_id,
validation_status,
created_at,
updated_at,
)
Bases: DataClassJsonMixin
Information about a document internal control review
llm_context ¶
ChatGptCallContext
dataclass
¶
ChatGptCallContext(
messages,
*,
llm_model,
example_ids,
nb_calls,
usage_total_tokens,
error_type=None
)
Bases: LLMCallContext
The context of a call to an LLM for document extraction with the OpenAI conversation message format (role, content, name)
messages
instance-attribute
¶
The list of messages in OpenAI format to be reused for a future call
parsing ¶
document_parser_result ¶
DocumentExpectedOutput
dataclass
¶
InMemoryDocumentExtractionResult
dataclass
¶
The result of a document extraction in memory
content
instance-attribute
¶
The structured output of a document parsing. Empty dictionary if the extraction failed
context
instance-attribute
¶
The context of the extraction (e.g. the LLM call context if the extraction was done with an LLM)
review_context
instance-attribute
¶
The context of the review of the extraction. None if the extraction is valid
flow ¶
classification ¶
DocumentClassificationConfiguration ¶
Bases: BaseModel
The configuration used to classify a document. It can be used to classify the document on multiple classes
instance-attribute
¶The classifiers to use to classify the document. Each classifiers can have multiple predicted classes as output. At least one classifier must be configured to predict the 'category' class.
Register sagemaker predictor in the registry if not already done
Source code in components/documents/public/entities/parsing/flow/classification.py
Validate that the 'category' classifier is present
Source code in components/documents/public/entities/parsing/flow/classification.py
Check that no class is predicted more than once.
Source code in components/documents/public/entities/parsing/flow/classification.py
DocumentClassifierConfiguration ¶
Bases: BaseModel
The configuration to use a predictor to classify a document. A predictor can output multiple classes.
classmethod
¶Coerce possible labels from a list of string to dict. This is a convenient pre-processor to simplify the configuration.
Source code in components/documents/public/entities/parsing/flow/classification.py
property
¶If the returned label is not declared in the possible labels, we return this fallback label. By default, the fallback label is unclassifiable which means that the document is unsupported and cannot be parsed.
class-attribute
instance-attribute
¶The minimum confidence to have, otherwise we send to review
instance-attribute
¶The possible labels that the predictor can return. Each label should be a dict where keys represent the class names. For single class classifier, the list can only be the possible values for the class.
instance-attribute
¶the predictor type to use. 'sagemaker' for a sagemaker predictor, 'llm' for a LLM predictor, 'fixed' is for a fixed classification to the unique possible label and 'pythonic' for a Python predictor.
Validate that if the predictor type is a sagemaker or LLM one, the list of classes is composed of only one class. These two predictors do not handle multiple class predictions for now.
Source code in components/documents/public/entities/parsing/flow/classification.py
Validate that possible classes are dict where keys represent class names.
Source code in components/documents/public/entities/parsing/flow/classification.py
Validate that the predictor configuration is consistent with the predictor type
Source code in components/documents/public/entities/parsing/flow/classification.py
configuration ¶
DocumentAutoParsingFlowConfiguration ¶
Bases: BaseModel
Configuration for the automatic parsing flow
extraction ¶
DocumentCategoryExtractionConfiguration ¶
Bases: BaseModel
Configuration to extract data from a given document category
class-attribute
instance-attribute
¶whether to auto validate the extracted data. If set to False, the extraction result will be sent to review (auto-populated). It can also be set to a Feature Flag name to enable the auto-validation only if the feature flag is enabled.
instance-attribute
¶the configuration of the dynamic LLM extractor for each possible category
DocumentExtractionConfiguration ¶
Bases: BaseModel
Configuration to extract data from a document
instance-attribute
¶The extraction configuration for each category (classification)
instance-attribute
¶the document handler to use by the extractors to fetch relevant example input and expected outputs
class-attribute
instance-attribute
¶parsing_flow ¶
DocumentAutoParsingFlowOutput
dataclass
¶
DocumentAutoParsingFlowOutput(
*,
document_id,
document_transcription_result=None,
document_classification_result=None,
document_extraction_result=None,
document_operation_task=None
)
transcription ¶
DocumentTranscriptionConfiguration ¶
Bases: BaseModel
Configuration to transcribe a document
class-attribute
instance-attribute
¶Document is sent to review if the confidence score is below. Value between 0 and 1
class-attribute
instance-attribute
¶Document is sent to review if the text length is below
class-attribute
instance-attribute
¶Transcriber provider. only 'textract' & 'gemini' for now
parsing_configuration ¶
DocumentCategoryConfiguration ¶
Bases: BaseModel
Parsing configuration for a document category
extraction_content_model
instance-attribute
¶
Extraction content structured output model for the document category. Can be None if the document category is not supported for extraction. It can be used to generate the JSON schema for the manual parsing tool or to validated an automatic extraction. Add the json_schema_extra 'order' to specify the order of fields in the manual parsing tool.
icon
instance-attribute
¶
The icon to display for the category in the manual parsing tool from https://tabler.io/icons ⧉ (ex: IconFileDollar)
internal_control_ratio
class-attribute
instance-attribute
¶
the ratio of internal control to apply on this category (between 0 and 1)
unsupported
class-attribute
instance-attribute
¶
If true, the document category is unsupported and cannot be extracted. It can only be rejected.
validate_json_schema
classmethod
¶
Check that the JSON schema is valid
Source code in components/documents/public/entities/parsing/parsing_configuration.py
validate_unsupported_extraction_content_consistency ¶
Validate the extraction content model
Source code in components/documents/public/entities/parsing/parsing_configuration.py
DocumentParsingConfiguration ¶
Bases: BaseModel
Class that configures the document parsing process for a given document type.
document_auto_parsing_flow_configuration
class-attribute
instance-attribute
¶
Configuration for the auto-parsing flow. If None, documents will be manually parsed.
document_categories
instance-attribute
¶
the document categories and their relative information (icon, extraction content model, etc.)
document_type
instance-attribute
¶
the document type for which the parsing configuration is defined
get_document_categories ¶
Get the list of document categories with display info for a given document type
Source code in components/documents/public/entities/parsing/parsing_configuration.py
get_document_category_json_schema ¶
Get the JSON schema for a given document category for the manual parsing tool. :param document_category: the document category :param lang: the language in which to translate the schema. If None, the schema is returned as is :return: the JSON schema for the document category
Source code in components/documents/public/entities/parsing/parsing_configuration.py
i18n_keys
class-attribute
instance-attribute
¶
i18n keys for the document parsing tool to translate categories and extraction fields
validate_category_existence_in_extraction ¶
Validate that auto-parsing flow extraction configuration have an extraction configuration for configured categories
Source code in components/documents/public/entities/parsing/parsing_configuration.py
validate_document_categories ¶
Check if we have at least 1 unsupported categories (generally the fallback category used in the DocumentClassificationConfiguration). Otherwise, we add the unclassifiable category to the document categories.
Source code in components/documents/public/entities/parsing/parsing_configuration.py
validate_extraction_model_consistency ¶
Validate the consistency of the extraction content model between the document category configuration and the existing dynamic LLM extraction configuration in the auto-parsing flow configuration.
Source code in components/documents/public/entities/parsing/parsing_configuration.py
parsing_result ¶
DocumentParsingData
dataclass
¶
Bases: DataClassJsonMixin
Data class to store the last parsing step results of a document
category
property
¶
Gets the category from the classification results.
This property retrieves the value associated with the "category" key
from the classification object's result attribute, if available.
If the classification object is not present or does not contain
the necessary information, it returns None.
Returns:
| Type | Description |
|---|---|
str | None
|
str | None: The category extracted from the classification result |
str | None
|
if available, otherwise None. |
subcategory
property
¶
Returns the subcategory from the classification result.
The subcategory is retrieved from the subcategory key of the classification
result dictionary. If the classification object is not present, or if the
subcategory key does not exist, the method will return None.
:return: Subcategory value from the classification result or None. :rtype: str | None
ExtractionResultData
dataclass
¶
ExtractionResultData(
id,
version,
validation_status,
source,
creator_ref,
created_at,
result,
review_context,
)
StepResultData
dataclass
¶
Bases: DataClassJsonMixin
Data class to store the result of a step
transcription ¶
TranscriptionContext
dataclass
¶
TranscriptionContext(
transcription_source=None,
transcription_confidence=None,
transcription_pct_handwritten=None,
)
Bases: DataClassJsonMixin
Context of a transcription
TranscriptionReviewContext
dataclass
¶
Bases: DataClassJsonMixin
Review Context of a transcription
validation ¶
ExtractionValidationErrors
dataclass
¶
Bases: DataClassJsonMixin
Structure for holding validation errors for an extraction
field_validation_errors
instance-attribute
¶
Field validation errors, where keys are JSON paths to the fields with errors and values are lists of error messages.
from_pydantic_validation_error
classmethod
¶
Converts a Pydantic ValidationError into a ValidationErrors object.
Source code in components/documents/public/entities/validation.py
overall_validation_errors
instance-attribute
¶
Overall validation errors, not related to a specific field.
components.documents.public.enums ¶
classification ¶
ClassificationReasonForReview ¶
Bases: AlanBaseEnum
Reasons for reviewing a classification
document_type ¶
DocumentType ¶
Bases: AlanBaseEnum
List all possible document types using the component service (document storage, embedding and/or parsing).
note: Prefix with the country code when the document is specific to a country (fr_, be_, es_, etc.).
BeInsuranceDocument
class-attribute
instance-attribute
¶
CaInsuranceDocument
class-attribute
instance-attribute
¶
FrAlsaceMoselleEligibilityRequest
class-attribute
instance-attribute
¶
FrIncomeEligibilityRequest
class-attribute
instance-attribute
¶
FrInsuranceDocument
class-attribute
instance-attribute
¶
FrPrevoyanceCompetitorContract
class-attribute
instance-attribute
¶
FrRetireeEligibilityRequest
class-attribute
instance-attribute
¶
FrSocialFundsEligibilityRequest
class-attribute
instance-attribute
¶
ResolutionPlatformMacro
class-attribute
instance-attribute
¶
embedding_algorithm ¶
extraction ¶
parser_type ¶
reason_for_review ¶
ExtractionReasonForReview ¶
Bases: AlanBaseEnum
Global reasons for reviewing an extraction. These are reasons for reviewing the extraction as a whole, not specific to a field.
parsing_rejection_reason ¶
ParsingRejectionReason ¶
Bases: AlanBaseEnum
General reasons for rejecting a document during parsing
cropped_document
class-attribute
instance-attribute
¶
Document is cropped and some texts are cropped
invalid_content
class-attribute
instance-attribute
¶
missing required information or invalid content regarding the schema
unsupported
class-attribute
instance-attribute
¶
unsupported document not associated to any document category
step_source ¶
step_validation_status ¶
components.documents.public.events ¶
document ¶
components.documents.public.helpers ¶
parsing_data ¶
results_to_document_parsing_data ¶
Convert classification and extraction results into a DocumentParsingData object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classification
|
DocumentClassificationResult | None
|
The classification result object, if available |
required |
extraction
|
DocumentExtractionResult | None
|
The extraction result object, if available |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DocumentParsingData |
DocumentParsingData
|
A data object containing the formatted classification and extraction results |
Source code in components/documents/public/helpers/parsing_data.py
validation_helpers ¶
ValidationHelpers ¶
A collection of static helper methods for validation across different document types.
are_floats_equal
staticmethod
¶
Compares two floats for equality within a given tolerance. Returns True if both floats are None, False if either (but not both) is None.
Source code in components/documents/public/helpers/validation_helpers.py
format_currency_cad
staticmethod
¶
Formats a float into a CAD currency string e.g., 1234.56 -> "$1,234.56". Returns "N/A" if value is None.
Source code in components/documents/public/helpers/validation_helpers.py
format_currency_eur
staticmethod
¶
Formats a float into a EUR currency string e.g., 1234.56 -> "1234,56€". Returns "N/A" if value is None.
Source code in components/documents/public/helpers/validation_helpers.py
parse_percentage_string
staticmethod
¶
Converts a percentage string (e.g., "2.10%") to a float (e.g., 0.021).