Skip to content

Backend

Transcription

alan-apps/backend

Transcription¶

The transcription step is the first step of the document parsing flow.

It is responsible for converting the document content into a structured format that can be used by the classification and extraction steps.
It consists of transforming a document to transcribed text in Markdown format.
The Markdown format allows us to retain information on the structure of the document (headings, tables, etc.), which we will use during extraction.

How it works¶

The transcription is done using AWS Textract.
The response of AWS Textract consists of blocs of text with their position (bounding boxes) and relation between the blocs (ex: blocs in the same table header, etc.).

Storage¶

The transcription is versioned and stored in DocumentTranscriptionResult.
The AWS textract raw response is stored in S3 and referenced in the document transcription result.