Architectural Overview: A robust ETL utility designed to convert proprietary, semi-structured, and legacy file formats into machine-readable plain text. This engine serves as the Data Ingestion Layer for training Large Language Models (LLMs), populating Vector Databases (RAG), and archiving corporate memory.
This tool bridges the gap between "Human-Readable Documents" and "Machine-Actionable Data" by abstracting file system complexities.
| Feature | Description | Architectural Value |
|---|---|---|
| Universal Parsing | Supports 20+ extensions (Office, PDF, Web, Model Artifacts) | Unlocks data trapped in binary formats without proprietary software. |
| Encoding Intelligence | chardet based Auto-Detection |
Prevents data corruption (mojibake) in legacy systems (UTF-8, ISO-8859, etc.). |
| Noise Reduction | Binary & System File Filtering | Optimizes token usage for LLM Context Windows by ignoring non-text binaries. |
| MIME Type Analysis | Content-Based Detection | Identifies file types by signature, not just extension, ensuring secure processing. |
| Code Serialization | Flattens Codebases | Converts complex folder structures (src/) into a single text stream for LLM code analysis. |
The engine implements specific parsing strategies for a wide range of MIME types:
- 📄 Document Archives:
- Microsoft Office:
DOCX,XLSX,PPTX,XLS - OpenDocument:
ODT,ODS,ODP(Linux/Government Standards) - Portable:
PDF(viapdfminer.six)
- Microsoft Office:
- 💾 Engineering & Logs:
- Data:
CSV,TXT,LOG,MD - Config:
INI,CONF,CFG,XML,JSON
- Data:
- 🧠 AI Model Artifacts:
- Metadata Extraction:
H5,KERAS,NPY(Extracts architecture/weights info)
- Metadata Extraction:
- 💻 Source Code:
- Languages:
Python (.py),Java,C++,JavaScript,HTML,CSSand more.
- Languages:
- Ingest: Recursively scans directories, validating file sizes (
max_file_size_mblimit). - Detect: Identifies MIME types and resolves character encoding issues.
- Extract: Applies format-specific parsers (e.g.,
python-docxfor Word,h5pyfor Models). - Load: Aggregates clean text into a unified stream for downstream processing.
Designed for easy integration into Docker containers or CI/CD pipelines.
1. Installation
# Install required drivers
pip install -r requirements.txt
# Dependencies: python-docx, openpyxl, python-pptx, odfpy, pdfminer.six, chardet, h5py
2. Python Implementation
from processor.file_processor import FileProcessor
# Initialize the ETL processor with safety limits
processor = FileProcessor(
max_file_size_mb=10.0, # Skip huge binaries to prevent memory overflow
output_file="corpus_for_training.txt"
)
# Execute ingestion on raw data folder
processor.process("/mnt/data/legacy_archive/")3. CLI Usage
python src/main.py
Developed to accelerate the digitization of industrial technical archives and facilitate "Chat with your Data" applications.
Apache-2.0