Skip to content

isikmuhamm/unstructured-data-extraction-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unstructured Data Extraction Engine

Automated Data Ingestion Pipeline for AI & Analytics

Architectural Overview: A robust ETL utility designed to convert proprietary, semi-structured, and legacy file formats into machine-readable plain text. This engine serves as the Data Ingestion Layer for training Large Language Models (LLMs), populating Vector Databases (RAG), and archiving corporate memory.


🏗️ Core Capabilities

This tool bridges the gap between "Human-Readable Documents" and "Machine-Actionable Data" by abstracting file system complexities.

Feature Description Architectural Value
Universal Parsing Supports 20+ extensions (Office, PDF, Web, Model Artifacts) Unlocks data trapped in binary formats without proprietary software.
Encoding Intelligence chardet based Auto-Detection Prevents data corruption (mojibake) in legacy systems (UTF-8, ISO-8859, etc.).
Noise Reduction Binary & System File Filtering Optimizes token usage for LLM Context Windows by ignoring non-text binaries.
MIME Type Analysis Content-Based Detection Identifies file types by signature, not just extension, ensuring secure processing.
Code Serialization Flattens Codebases Converts complex folder structures (src/) into a single text stream for LLM code analysis.

⚡ Supported Data Sources (Comprehensive List)

The engine implements specific parsing strategies for a wide range of MIME types:

  • 📄 Document Archives:
    • Microsoft Office: DOCX, XLSX, PPTX, XLS
    • OpenDocument: ODT, ODS, ODP (Linux/Government Standards)
    • Portable: PDF (via pdfminer.six)
  • 💾 Engineering & Logs:
    • Data: CSV, TXT, LOG, MD
    • Config: INI, CONF, CFG, XML, JSON
  • 🧠 AI Model Artifacts:
    • Metadata Extraction: H5, KERAS, NPY (Extracts architecture/weights info)
  • 💻 Source Code:
    • Languages: Python (.py), Java, C++, JavaScript, HTML, CSS and more.

🚀 Operational Workflow

  1. Ingest: Recursively scans directories, validating file sizes (max_file_size_mb limit).
  2. Detect: Identifies MIME types and resolves character encoding issues.
  3. Extract: Applies format-specific parsers (e.g., python-docx for Word, h5py for Models).
  4. Load: Aggregates clean text into a unified stream for downstream processing.

🛠️ Installation & Usage

Designed for easy integration into Docker containers or CI/CD pipelines.

1. Installation

# Install required drivers
pip install -r requirements.txt
# Dependencies: python-docx, openpyxl, python-pptx, odfpy, pdfminer.six, chardet, h5py

2. Python Implementation

from processor.file_processor import FileProcessor

# Initialize the ETL processor with safety limits
processor = FileProcessor(
    max_file_size_mb=10.0,   # Skip huge binaries to prevent memory overflow
    output_file="corpus_for_training.txt"
)

# Execute ingestion on raw data folder
processor.process("/mnt/data/legacy_archive/")

3. CLI Usage

python src/main.py

⚖️ Context

Developed to accelerate the digitization of industrial technical archives and facilitate "Chat with your Data" applications.

📜 License

Apache-2.0

About

Automated data ingestion pipeline for extracting plain text from proprietary formats (PDF, DOCX, ODT, H5). Optimized for preparing context for LLMs and Vector Databases.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages