Unstructured Data Extraction Engine

Automated Data Ingestion Pipeline for AI & Analytics

Architectural Overview: A robust ETL utility designed to convert proprietary, semi-structured, and legacy file formats into machine-readable plain text. This engine serves as the Data Ingestion Layer for training Large Language Models (LLMs), populating Vector Databases (RAG), and archiving corporate memory.

🏗️ Core Capabilities

This tool bridges the gap between "Human-Readable Documents" and "Machine-Actionable Data" by abstracting file system complexities.

Feature	Description	Architectural Value
Universal Parsing	Supports 20+ extensions (Office, PDF, Web, Model Artifacts)	Unlocks data trapped in binary formats without proprietary software.
Encoding Intelligence	`chardet` based Auto-Detection	Prevents data corruption (mojibake) in legacy systems (UTF-8, ISO-8859, etc.).
Noise Reduction	Binary & System File Filtering	Optimizes token usage for LLM Context Windows by ignoring non-text binaries.
MIME Type Analysis	Content-Based Detection	Identifies file types by signature, not just extension, ensuring secure processing.
Code Serialization	Flattens Codebases	Converts complex folder structures (src/) into a single text stream for LLM code analysis.

⚡ Supported Data Sources (Comprehensive List)

The engine implements specific parsing strategies for a wide range of MIME types:

📄 Document Archives:
- Microsoft Office: DOCX, XLSX, PPTX, XLS
- OpenDocument: ODT, ODS, ODP (Linux/Government Standards)
- Portable: PDF (via pdfminer.six)
💾 Engineering & Logs:
- Data: CSV, TXT, LOG, MD
- Config: INI, CONF, CFG, XML, JSON
🧠 AI Model Artifacts:
- Metadata Extraction: H5, KERAS, NPY (Extracts architecture/weights info)
💻 Source Code:
- Languages: Python (.py), Java, C++, JavaScript, HTML, CSS and more.

🚀 Operational Workflow

Ingest: Recursively scans directories, validating file sizes (max_file_size_mb limit).
Detect: Identifies MIME types and resolves character encoding issues.
Extract: Applies format-specific parsers (e.g., python-docx for Word, h5py for Models).
Load: Aggregates clean text into a unified stream for downstream processing.

🛠️ Installation & Usage

Designed for easy integration into Docker containers or CI/CD pipelines.

1. Installation

# Install required drivers
pip install -r requirements.txt
# Dependencies: python-docx, openpyxl, python-pptx, odfpy, pdfminer.six, chardet, h5py

2. Python Implementation

from processor.file_processor import FileProcessor

# Initialize the ETL processor with safety limits
processor = FileProcessor(
    max_file_size_mb=10.0,   # Skip huge binaries to prevent memory overflow
    output_file="corpus_for_training.txt"
)

# Execute ingestion on raw data folder
processor.process("/mnt/data/legacy_archive/")

3. CLI Usage

python src/main.py

⚖️ Context

Developed to accelerate the digitization of industrial technical archives and facilitate "Chat with your Data" applications.

📜 License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
LICENSE		LICENSE
README.md		README.md
content.txt		content.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unstructured Data Extraction Engine

Automated Data Ingestion Pipeline for AI & Analytics

🏗️ Core Capabilities

⚡ Supported Data Sources (Comprehensive List)

🚀 Operational Workflow

🛠️ Installation & Usage

⚖️ Context

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unstructured Data Extraction Engine

Automated Data Ingestion Pipeline for AI & Analytics

🏗️ Core Capabilities

⚡ Supported Data Sources (Comprehensive List)

🚀 Operational Workflow

🛠️ Installation & Usage

⚖️ Context

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages