jugmt: JUst Give Me Tables¶
jugmt is a minimalistic spex (SPecification EXtractor) implementation
with a codebase less than 200 lines of Python. The tool extracts figure
information and tables from .docx
files, generates HTML and JSON,
and validates the JSON using a JSON schema.
┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━┓
┃ Language ┃ Files ┃ % ┃ Code ┃ % ┃ Comment ┃ % ┃
┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━┩
│ Python │ 5 │ 100.0 │ 156 │ 57.1 │ 23 │ 8.4 │
├──────────┼───────┼───────┼──────┼──────┼─────────┼─────┤
│ Sum │ 5 │ 100.0 │ 156 │ 57.1 │ 23 │ 8.4 │
└──────────┴───────┴───────┴──────┴──────┴─────────┴─────┘
The tool extracts tables and figure information from .docx
files, generates
HTML and JSON, and validates the JSON using a
schema.
When running the tool on a collection of NVMe specification documents, including Base, Boot, MI, NVM, ZNS, KV, PCI, RDMA, and TCP, it consumes a total of 5 seconds of wall-clock time and about 500MB of memory on an i7-1360P using a single thread for all documents combined.