Info-Extractor#
Extract information from academic papers in a structured, transparent, and reproducible way for systematic analysis and data synthesis.
Overview#
Info-Extractor is a paper-level information extraction tool designed to convert unstructured academic documents (PDFs) into structured, machine-readable data. It supports systematic reviews, meta-analyses, and large-scale evidence synthesis by enabling consistent extraction of predefined variables from scientific papers.
The tool allows users to upload PDFs or provide PDF URLs, define custom extraction schemas, and obtain structured outputs in formats suitable for downstream statistical analysis.
Purpose#
Info-Extractor is designed to:
Extract structured metadata and results from individual research papers
Reduce manual data-entry effort in systematic reviews
Improve transparency and reproducibility of data extraction
Support both rule-based and AI-assisted extraction strategies
You provide:
A PDF (local upload or URL)
A schema describing what information to extract
The system returns structured output in JSON and CSV formats.
Key Features#
π§ Coming Soon#
This section is under development. Check back soon for detailed documentation on Info-Extractorβs advanced features and usage instructions.
Main Capabilities#
Structured Data Extraction
Extract key information into predefined, schema-driven fields.Custom Templates
Define your own extraction schemas with one field per line.Batch Processing
Process multiple papers sequentially using consistent schemas.Quality Control
Built-in schema validation and evidence-based consistency checks.Export Formats
JSON and CSV outputs designed for Excel, R, Python, and database ingestion.
Extraction Modes#
Heuristics-Based Extraction (Default)#
Regex-based and rule-driven
Evidence-backed value extraction
Fast, deterministic, and transparent
Well suited for technical metadata (e.g., sample size, TR, TE, scanner parameters)
LLM-Assisted Extraction (Optional)#
Uses a local Ollama model
Constrained by an automatically generated JSON Schema
Heuristics fill missing or null values
No external API calls (fully local execution)
Typical Workflow#
Complete Paper Set
Define Schema
Extract Data
Validate Results
Export Data
Analysis / Meta-analysis
Common Extraction Fields#
Info-Extractor can be configured to extract a wide range of variables, including:
Study Characteristics#
Publication year, authors, journal
Study design and methodology
Sample size and participant demographics
Geographic location
Research Findings#
Primary outcomes and measures
Statistical results (p-values, effect sizes, confidence intervals)
Key findings and conclusions
Reported limitations
Quality Assessment#
Risk-of-bias indicators
Study quality or reporting scores
Funding sources
Conflicts of interest
How to Run Info-Extractor (Local / Windows)#
Terminal 1 β Ollama (Optional)#
ollama serve
Terminal 2 β Backend (FastAPI)#
cd Automating-the-Information-Extraction\info-extractor\backend
conda activate paperextract
uvicorn server:app --reload --port 8000
API root:
http://127.0.0.1:8000
API documentation:
http://127.0.0.1:8000/docs
Terminal 3 β Frontend (HTML UI)#
cd Automating-the-Information-Extraction\info-extractor\frontend
python -m http.server 5173
Open in browser:
http://127.0.0.1:5173/extractor_ui.html
Workflow Overview#

Extraction Logic#
Heuristics provide evidence-backed values whenever possible.
If LLM extraction is enabled, LLM output is applied first.
Missing or null fields are filled using heuristics.
Final output always conforms to the generated JSON Schema.
Integration with Other Tools#
Tool |
Purpose |
Info-Extractor Role |
|---|---|---|
Review Buddy |
Paper collection |
Receives papers to extract from |
Elicit |
AI screening |
Complements with structured extraction |
Excel / R / Python |
Statistical analysis |
Consumes structured outputs |
Use Cases#
Info-Extractor is ideal for:
Systematic Reviews: PRISMA-compliant data extraction
Meta-Analysis: Collecting statistical inputs for quantitative synthesis
Scoping Reviews: Mapping study characteristics across domains
Evidence Synthesis: Compiling structured findings from many papers
Quality Assessment: Consistent evaluation of study quality
Best Practices#
Define your extraction schema before starting the extraction process to ensure consistency.
Avoid changing schemas mid-way through a review unless absolutely necessary.
Keep schemas version-controlled for traceability.
Quality Control#
Always validate extracted data through:
Manual double-checking of critical fields
Inter-rater reliability checks for meta-analysis inputs
Schema validation logs
Data and Reproducibility Notes#
Do not commit large PDFs or private datasets to GitHub.
Store PDFs locally in non-tracked directories.
Commit only code, schemas, and small example outputs to version control.