Transcription¶
The transcription step is the first step of the document parsing flow.
- It is responsible for converting the document content into a structured format that can be used by the classification and extraction steps.
- It consists of transforming a document to transcribed text in Markdown format.
- The Markdown format allows us to retain information on the structure of the document (headings, tables, etc.), which we will use during extraction.
How it works¶
- The transcription is done using AWS Textract.
- The response of AWS Textract consists of blocs of text with their position (bounding boxes) and relation between the blocs (ex: blocs in the same table header, etc.).
Storage¶
- The transcription is versioned and stored in
DocumentTranscriptionResult. - The AWS textract raw response is stored in S3 and referenced in the document transcription result.