πŸ“„ PaperScraper#

Overview#

PaperScraper is a Python library designed for efficiently scraping papers from preprint servers and open-access repositories. Unlike traditional database APIs, PaperScraper directly scrapes the websites of major preprint archives, making it ideal for accessing cutting-edge research before peer review.

GitHub: PhosphorylatedRabbits/paperscraper

Key Features#

  • πŸ”¬ Preprint Focus: arXiv, medRxiv, bioRxiv, chemRxiv

  • πŸ₯ PubMed Integration: Search PubMed with full metadata

  • πŸ“₯ PDF Downloads: Automatic full-text retrieval

  • πŸ” Keyword Search: Flexible query syntax

  • πŸ“Š JSON Export: Structured data output

Installation#

pip install paperscraper

Or for development:

git clone https://github.com/PhosphorylatedRabbits/paperscraper
cd paperscraper
pip install -e .

Quick Start#

Advanced Usage#

Date Filtering#

from datetime import datetime
from paperscraper.get_dumps import arxiv

# Get papers from 2024 only
recent_papers = arxiv(
    query='deep learning',
    output_filepath='recent_arxiv.jsonl',
    date_from=datetime(2024, 1, 1),
    date_until=datetime(2024, 12, 31)
)

Combining Multiple Sources#

from paperscraper.get_dumps import arxiv, biorxiv
import json

# Search multiple sources
arxiv_papers = arxiv(query='protein folding')
biorxiv_papers = biorxiv(query='protein folding')

# Combine and deduplicate
all_papers = {}
for paper in arxiv_papers + biorxiv_papers:
    # Use DOI or title as key for deduplication
    key = paper.get('doi') or paper['title']
    if key not in all_papers:
        all_papers[key] = paper

print(f"Total unique papers: {len(all_papers)}")

# Save combined results
with open('combined_results.json', 'w') as f:
    json.dump(list(all_papers.values()), f, indent=2)

Download PDFs#

from paperscraper.pdf import save_pdf
from paperscraper.get_dumps import arxiv

# Search and download PDFs
papers = arxiv(query='transformer models', output_filepath='transformers.jsonl')

for i, paper in enumerate(papers[:10]):  # Download first 10
    pdf_url = paper.get('pdf_url')
    if pdf_url:
        filename = f"paper_{i+1}.pdf"
        try:
            save_pdf(pdf_url, filename)
            print(f"βœ“ Downloaded: {paper['title'][:50]}...")
        except Exception as e:
            print(f"βœ— Failed: {e}")

Common Use Cases#

1. Tracking Latest Research#

from paperscraper.get_dumps import arxiv
from datetime import datetime, timedelta

# Get papers from last 7 days
one_week_ago = datetime.now() - timedelta(days=7)

recent_papers = arxiv(
    query='artificial intelligence',
    date_from=one_week_ago,
    output_filepath='weekly_ai_papers.jsonl'
)

print(f"Found {len(recent_papers)} papers in the last week")

2. Building a Custom Dataset#

from paperscraper.get_dumps import arxiv, biorxiv
import pandas as pd

# Collect papers on a specific topic
ai_health_arxiv = arxiv(query='artificial intelligence healthcare')
ai_health_bio = biorxiv(query='artificial intelligence medical')

# Convert to DataFrame for analysis
data = []
for source, papers in [('arXiv', ai_health_arxiv), ('bioRxiv', ai_health_bio)]:
    for paper in papers:
        data.append({
            'source': source,
            'title': paper['title'],
            'authors': ', '.join(paper['authors']),
            'date': paper['date'],
            'abstract': paper['abstract']
        })

df = pd.DataFrame(data)
df.to_csv('ai_healthcare_dataset.csv', index=False)
print(f"Created dataset with {len(df)} papers")

3. Citation Analysis#

from paperscraper.get_dumps import arxiv
import re

# Get papers and analyze citations
papers = arxiv(query='graph neural networks', output_filepath='gnn_papers.jsonl')

# Extract arXiv IDs from abstracts/references
cited_arxiv_ids = set()
for paper in papers:
    # Find arXiv IDs in abstract (simple pattern)
    matches = re.findall(r'arXiv:(\d+\.\d+)', paper['abstract'])
    cited_arxiv_ids.update(matches)

print(f"Found {len(cited_arxiv_ids)} unique cited arXiv papers")

Tips & Best Practices#

Rate Limiting#

import time
from paperscraper.get_dumps import arxiv

# Add delays between requests
queries = ['query1', 'query2', 'query3']
all_papers = []

for query in queries:
    papers = arxiv(query=query, output_filepath=f'{query}.jsonl')
    all_papers.extend(papers)
    time.sleep(3)  # Respectful delay

Error Handling#

from paperscraper.get_dumps import arxiv

try:
    papers = arxiv(
        query='machine learning',
        output_filepath='ml_papers.jsonl'
    )
except Exception as e:
    print(f"Search failed: {e}")
    # Fallback or retry logic

Metadata Quality Check#

from paperscraper.get_dumps import arxiv

papers = arxiv(query='deep learning')

# Filter for quality
quality_papers = [
    p for p in papers
    if p.get('abstract') and len(p['abstract']) > 100
    and p.get('authors') and len(p['authors']) > 0
]

print(f"High-quality papers: {len(quality_papers)}/{len(papers)}")

Advantages & Limitations#

βœ… Advantages#

  • Fast: Direct scraping is often faster than API calls

  • No API Keys: Works without authentication for most sources

  • Latest Research: Access to preprints before peer review

  • Flexible: Easy to customize and extend

  • Open Access: Focuses on freely available content

⚠️ Limitations#

  • Preprint Quality: Papers are not peer-reviewed

  • Limited Filtering: Fewer search operators than database APIs

  • Website Changes: Scraping may break if sites update

  • Rate Limits: Must be respectful of server loads

  • No Scopus/IEEE: Doesn’t cover major subscription databases

Integration with Review Buddy#

PaperScraper works great alongside Review Buddy:

# Use Review Buddy for peer-reviewed papers
from paper_searcher import PaperSearcher
from config import Config

config = Config(scopus_api_key="key", pubmed_email="email")
searcher = PaperSearcher(config)
peer_reviewed = searcher.search_all(query="machine learning", year_from=2023)

# Use PaperScraper for latest preprints
from paperscraper.get_dumps import arxiv
preprints = arxiv(query='machine learning', date_from=datetime(2024, 1, 1))

print(f"Peer-reviewed: {len(peer_reviewed)}, Preprints: {len(preprints)}")

Additional Resources#


Next Tool

Continue to LitMaps to learn about visual citation mapping!