Skip to main content

Statistics Seminar

Minting Data from Documents by Karl Rohe

Event Details

Date
Thursday, January 29, 2026
Time
1-2 p.m.
Location
7560 Morgridge Hall
Description

Abstract: You have 200 PDFs. You need a spreadsheet with 200 rows, one per document, with columns capturing the specific variables you care about. Traditionally, that means hiring RAs and weeks of reading and manual data entry. Now, large language models make this fast and scalable. Use cases include extracting study characteristics for systematic reviews, coding interview transcripts, and building structured datasets from any document corpus (e.g. journal articles, clinical notes, policy briefs, historical archives, etc).


 

I'll demonstrate [https://datamint.ing,]https://datamint.ing, a platform I built for systematic and automatic data extraction.


 

Core principle: if you can describe a variable in words, datamint.ing can extract it. Three components make it work:

1. Templates (or "codebooks") specify your variables and decision rules. You describe your documents and data needs in a conversation with datamint.ing's AI assistant. It will then draft your initial codebook.

2. Automated extraction runs multiple independent AI models that each extract values from your documents. A separate model identifies disagreements, giving you (i) an automated inter-rater reliability measure, (ii) a way of averaging over noisy responses (since individual extractions often contain errors), and (iii) a potential indication for where your codebook lacks clarity and needs refinement.

3. Transparent results: click any cell to inspect source quotes, reasoning, and disagreements. Download clean CSVs for analysis, or share a link so collaborators can inspect your extractions and reuse your codebook/template.

The workflow is iterative. Draft a template, extract, review disagreements to find edge cases, refine your codebook, re-extract, all in minutes. The bottleneck shifts from labor to measurement clarity. Can you specify what you want to measure?

I'll give an overview of the platform and do a live demo.

 

Cost
Free

Tags