Digitizing 51 years of annual census tables on Native American tribes from the Reports of the Commissioner of Indian Affairs, 1864–1920.
The Annual Reports of the Commissioner of Indian Affairs were published by the U.S. Government Printing Office from the 1820s through the 1930s. Each report contains narrative text plus statistical tables on tribal population, education, agriculture, dress, housing, religion, and more. The tables are a rare source of systematic, disaggregated data on Native American communities across the entire federal reservation system.
We digitized the "Culture" tables—tables of statistics relating to population, civilization, dress, houses, allotments, reading, marriages, and religious participation. These appear under varying titles across eras but track similar variables throughout.
Source PDFs come from two collections: 62 full ARCIA reports (1864–1932, 21–187 MB each) and 45 pre-extracted culture table PDFs (1874–1920, much smaller). Each PDF page in the full reports is a two-page scan of facing printed pages.
| Era | Years | Source | Status |
|---|---|---|---|
| Pre-standardization | 1864–1874 | Full ARCIA PDFs | Extracted |
| Standard format | 1875–1904 | Full ARCIA PDFs | Extracted |
| Format gap | 1905–1910 | — | No table |
| Late-era format | 1911–1920 | Full ARCIA PDFs | Extracted |
| Discontinued | 1921–1932 | — | No table |
Claude Code orchestrates the workflow, but historical OCR is handled by Gemini 3 Flash—among the top performers on scanned document benchmarks (socOCRbench). Each table goes through five phases.
Gemini 3 Flash OCR and table extraction from scanned PDFs. Among the top performers on historical document benchmarks (socOCRbench).
Claude Pipeline orchestration, cleaning logic, code generation, column mapping, and entity harmonization.
Every extraction is scored by arithmetic identity pass rate—do the column components sum to their published totals?
GREEN ≥90% YELLOW 70–89% RED <70%
Final tally: 41 GREEN, 10 YELLOW, 0 RED across 51 years.
Each cell below is one year. Green = extracted and validated. Orange = extracted with minor quality flags (all are legitimate data absences, not extraction errors). Gray = no culture table published that year.
| Panel Statistics | |
|---|---|
| Years covered | 51 (1864–1920) |
| Observations (rows) | 10,123 |
| Columns | 814 |
| Unique culture IDs | 3,588 |
| Canonical entity names | 3,319 |
| Entity crosswalk entries | 5,922 |
| Footnote definitions | 1,164 |
_raw, _grp,
_ind, and _foot variants, explaining the high column count.
arcia_culture_panel_wide.csv,
10,123 rows, 3,588 culture IDs) applies entity standardization: cleaning OCR artifacts,
normalizing tribe/agency spellings across years (e.g., “Menomonee” →
“Menominee”), and removing ~350 non-entity rows (leaked variable names, numeric
values in tribe fields). A raw version (arcia_culture_panel_wide_raw.csv,
10,512 rows, 5,212 culture IDs) preserves the original source-document spellings and is
used for the human comparison below, since the human digitizers also kept original ARCIA
spellings and matching works best when both sides use the same naming conventions.
Each year is processed independently through phases 1–4. Panel construction (phase 5) runs only after all years complete. The pipeline is idempotent—re-running any phase produces the same output—and fail-safe: if a year fails at any phase, it is logged and skipped while the batch continues.
| Phase | Input | Output | Script | Notes |
|---|---|---|---|---|
| 1. Scout | PDF + target table description | page_map.json |
Manual / Claude agent | Read TOC (pp. 3–8) for approximate range, then render boundary pages as PNGs and visually verify exact start/end. Off-by-one errors cause 10–30pp quality drops. Early PDFs (1864–1872) have tables near end. |
| 2. Extract | PDF + page range | _read_literal.csv + _read_context.csv |
gemini_extract.py |
Dual reads via Gemini API; 3 concurrent workers; era-specific prompts |
| 3. Validate | Cleaned CSV | GREEN / YELLOW / RED status | audit_extractions.py |
Arithmetic identity checks; column keyword checks; entity coverage |
| 4. Clean | Raw context CSV (with literal fallback) | {year}_{type}.csv |
clean_arcia_csv.py |
Parse pipe-delimited blocks; normalize headers; add provenance |
| 5. Panel | All cleaned CSVs + column map JSON | arcia_{type}_panel_wide.csv |
harmonize_culture.py |
Position-based column renaming; bracket expansion; entity harmonization |
--table-type flag (default: culture),
so the same infrastructure handles new table types without code changes. Page ranges are
stored in scripts/configs/page_registry.json; prompts are loaded from
scripts/prompts/{table_type}_{era}.txt.
Every table is extracted twice by Gemini with different prompts. The two reads complement each other—literal reads preserve exact characters while context reads resolve structural ambiguities—and disagreements pinpoint exactly which cells need attention.
"Extract exactly what you see on the page, character by character." Preserves OCR artifacts but never hallucinates values. Best for numeric accuracy.
Uses a single shared prompt (scripts/prompts/literal.txt) across all
table types and eras.
"Use your understanding of the table structure to resolve ambiguities." Better at entity names and column alignment, but sometimes creates phantom entries from footnotes or section headers.
Uses era-specific prompts that describe the table title, expected columns, entity hierarchy, and missing-data conventions for that period.
Each PDF page range is converted to PNG images via PyMuPDF at 2× zoom. Images are
sent to the Gemini 3 Flash API one page at a time, with up to 3 concurrent workers and
a rate limiter (14 RPM). Each page returns a pipe-delimited text block; all pages are
concatenated with # --- Page N --- markers into a single CSV file.
The --prompt-type both flag runs literal and context reads in parallel,
halving wall-clock time compared to sequential runs.
Different eras require different context prompts because the table titles, column structures, and entity hierarchies all change:
| Prompt File | Years | Key Differences |
|---|---|---|
culture_1875_1904.txt | 1875–1904 | 3-column entity hierarchy (State/Agency/Tribe); 30+ data columns; bracket encoding |
culture_1911.txt | 1911 | TABLE 2 format; different column set; must skip page 1 (TABLE 1 = admin stats) |
culture_1912_1920.txt | 1912–1920 | TABLE 4/5; 2-column entities (State/Superintendency); marriage/church/dress columns |
read_merger.py aligns both reads cell-by-cell and produces a merged output
plus a disagreements file. For each cell where the reads disagree, an
arithmetic-aware refinement step checks: does swapping to the other
read’s value fix a failing column sum? If so, pick that value. This is free
quality—typically improves pass rates by 1–3 percentage points with zero human effort.
In 2 of 51 years (1878, 1886), the context read failed entirely (0 or near-0 tokens from Gemini). The cleaner automatically detects this and falls back to the literal read. Both years validated clean with full row counts.
Having two independent reads of the same PDF page provides a powerful diagnostic that requires no ground truth data. When the context and literal reads disagree on a large fraction of cells, this flags systematic extraction errors that arithmetic checks alone cannot catch.
Case study (1917): The context read for 1917 appeared structurally correct (right entity names, 22 aligned columns), but had silently shifted all numeric values down by one row within each state section—state-level totals were assigned to the first superintendency. Arithmetic checks would have caught this via failed subtotal validation, but the fastest signal was the cross-read comparison: 91 of 186 comparable cells differed by >5× between reads. A year-specific re-extraction with an improved prompt fixed the issue ($0.04).
Recommended threshold: Flag any year where >30% of comparable cells differ by >5× between reads. For future table types (Population, Crime, etc.) where no human digitization exists for comparison, this cross-read check is the primary automated defense against silent extraction errors.
Every extraction is automatically scored by testing whether column components sum to their published totals. These arithmetic identities are free validation—if the math works, those cells are correct.
ARCIA tables contain built-in checksums. For example, the culture table reports population broken down as males + females = total, and dress as citizen’s dress + partial + Indian dress = total population. If these sums hold in the extracted data, the values are provably correct regardless of OCR quality.
This typically validates 50–80% of numeric cells with zero human effort. Failures point directly at column alignment errors or OCR misreads—far more actionable than random spot-checking.
| Status | Pass Rate | Action |
|---|---|---|
| GREEN | ≥ 90% | Proceed automatically |
| YELLOW | 70–89% | Proceed, flag for batch review |
| RED | < 70% | Stop & run auto-diagnosis |
When a table scores RED, the pipeline runs five diagnostic strategies in order:
| # | Check | What It Detects |
|---|---|---|
| 1 | Column shift detection | Field count mismatch between header and data rows (most common cause) |
| 2 | Phantom entity detection | Context read created >15% more entities than literal (footnotes parsed as data) |
| 3 | Low entity count | Wrong page range scouted (table starts or ends on a different page) |
| 4 | Read preference swap | Try literal-only or context-only instead of the merged read |
| 5 | Per-position pass rate | Which specific column positions are failing (pinpoints the misaligned column) |
Gemini faithfully reproduces each year’s original ARCIA spellings, producing 5,212 unique culture_ids across 51 years. Because the same tribe appears under slightly different spellings year to year (e.g., “Menomonee” vs. “Menominee,” “Sioux of Devil’s Lake” vs. “Sioux of Devils Lake”), a post-extraction standardization pipeline normalizes these into 3,588 canonical culture_ids.
A 7-phase script (aggressive_dedup.py) processes the raw panel:
| Phase | What It Does | Rows Affected |
|---|---|---|
| 1–4 | Remove junk rows (footnote text, page headers, blank rows parsed as data) | −390 |
| 5 | Normalize spelling variants (OCR artifacts, abbreviations, singular/plural, punctuation) | ∼1,378 tribe-name pairs standardized |
| 6 | Rebuild culture_ids from cleaned entity fields | 5,212 → 3,588 unique IDs |
| 7 | Resolve actual duplicates (same culture_id + year) | 0 found |
When Phase 6 rebuilds culture_ids, different original entities can sometimes collapse to the same canonical ID (e.g., two distinct tribes whose cleaned names match). A collision guard detects these false merges: if multiple original culture_ids map to the same new ID within a year, all affected rows are reverted to their original IDs.
In the culture panel, this reverted 797 rows that would otherwise have been incorrectly merged.
The project ships two versions of the panel to serve different needs:
| File | IDs | Rows | Used For |
|---|---|---|---|
arcia_culture_panel_wide.csv |
3,588 | 10,123 | Entity Explorer, Crosswalk, Human Inspection (standardized names) |
arcia_culture_panel_wide_raw.csv |
5,212 | 10,512 | Gemini vs. Human comparison (preserves original ARCIA spellings for matching) |
In the original printed tables, curly brackets { group multiple tribes
that share a single aggregate statistic. Gemini captures these inline.
value_group_rowcount
When a value applies to a group of N tribes, Gemini writes it as
1250_group_3 meaning "1,250 is the aggregate for a group of 3 rows."
Footnote suffixes use +(code) format: 3600+(a) means
"3,600 with footnote marker ‘a’."
The harmonization pipeline expands this into four columns per variable:
| Suffix | Meaning |
|---|---|
_raw | The numeric value (1250) |
_grp | Group size (3) or 0 if individual |
_ind | 1 if individual value, 0 if group |
_foot | Footnote marker, if any |
Column names, positions, and even which columns exist change substantially across the
57-year span. A per-year JSON config (culture_column_map.json) maps
column positions to canonical variable names. This position-based approach is critical
because Gemini’s header parsing is unreliable—but column positions
are stable within an era.
| Era | Years | Entity Columns | Offset | Key Differences |
|---|---|---|---|---|
| Pre-standardization | 1864–1870 | Varies (tribe-only, supt+tribe) | Varies | No "State" column in many years; farming and culture on alternating pages (1864) |
| Early standard | 1871–1874 | Varies (merged entity cols) | Varies | 1873 has 24-col hard-coded header; 1874 shares pages with agricultural table |
| Core format | 1875–1904 | State / Agency / Tribe | 3 | Stable 3-prefix format; brackets common |
| Merged entity | 1878, 1886 | Merged (agency+tribe) | 3 | Gemini merges entity columns; requires offset adjustment |
| Late era | 1911–1920 | States & Superintendencies | 2 | Different table title (Table 4/5); fewer columns; 1911 = TABLE 2 |
Tribal names appear in dozens of spelling variants across decades of reports. The crosswalk maps 4,149 raw names to 1,380 canonical forms.
~600 cleaning rules (ported from Stata) apply systematic normalization:
unicode/accent removal, suffix stripping, OCR fixes, plural standardization,
compound tribe names, and NY reserve format standardization. Rules are defined in
entity_rules.py and applied by normalize_entities.py.
The deep crosswalk adds temporal group links and modern name mappings.
The culture_id is constructed differently by era:
| Era | ID Construction |
|---|---|
| 1874–1911 | state_agency_tribe |
| 1912–1920 | state_superintendency |
| 1873, 1878, 1886 | entity (merged col) |
| 1866 | agency_tribe |
| Entity Crosswalk | |
|---|---|
| Total entries | 4,724 |
| Unique canonical names | 1,380 |
| Harmonized (changed) | 866 |
| Temporal group links | 217 |
| Need review | 89 |
The pipeline uses a multi-model architecture: Gemini 3 Flash handles OCR, Claude Code handles everything else. Extraction cost is dominated by image input tokens sent to the Gemini API.
Gemini 3 Flash
OCR and table extraction from scanned PDFs. Configured with
thinking_level=low—the socOCRbench benchmark showed that low
reasoning effort outperforms high for document OCR tasks (0.698 vs 0.680).
Claude Code Pipeline orchestration, cleaning logic, code generation, column mapping, entity harmonization, and quality diagnosis.
Claude Code launches parallel extraction agents for batch runs. Agents write summary JSONs on completion; the orchestrator reads summaries to track progress. Agent transcripts are 100K+ tokens each—reading them directly would exhaust the orchestrator’s context window.
Pattern: agents write summaries → orchestrator reads summaries → never read full transcripts.
| Metric | Value |
|---|---|
| Per PDF page (dual read) | ~$0.019 |
| Typical 10-page table | ~$0.19 |
| Full culture extraction (51 years, ~475 pages) | ~$9.50 |
| Projected per additional table type (~50 years) | ~$8–10 |
Cost is dominated by image input tokens (each page image is ~2,000–4,000 tokens), not output. Dual reads double the per-page cost but provide critical quality gains via arithmetic-aware merge refinement. Gemini 3 Flash pricing: $0.50/M input tokens, $3.00/M output tokens.
Extracting 51 years of historical tables involves many failure modes. The pipeline handles each automatically rather than requiring manual intervention.
If the context read returns fewer than 50 rows and the literal read has 3× more rows, the cleaner automatically falls back to the literal read. This saved 2 years (1878, 1886) where Gemini’s context read returned 0 or near-0 tokens—both validated clean using literal data alone.
The batch runner checks cumulative Gemini API spend before each year. If the hard cap is reached ($38 default, leaving a $2 buffer under the $40 target), remaining years are logged and queued—not lost. A per-year cost estimate (~$0.10/page, intentionally conservative) provides pre-flight warnings.
Each year gets a 25-minute timeout. If exceeded, the batch runner kills the entire
process group (killpg + SIGKILL) and continues to the next year.
macOS sleep can kill long-running extractions, so batch runs are wrapped with
caffeinate -i to prevent the system from sleeping.
A thread-safe token-bucket rate limiter enforces Gemini’s RPM quota (default: 14 requests/minute). Individual API calls retry with exponential backoff on 503/429 errors (5 retries, base delay 5 seconds). This prevents quota exhaustion during high-concurrency batch runs.
The context read returns 0 or near-0 tokens for these years. The cleaner detects this automatically and falls back to the literal read. Both years validated successfully with full row counts and all columns present.
The state column has 215 unique values across the panel. Approximately 40 are legitimate multi-state groupings or historical territories. The remainder are OCR or format artifacts (e.g., reservation names misidentified as states in 1913/1919, tribe names leaking into the state field in 1871). These are acceptable for downstream crosswalk resolution.
Accurate page boundaries are critical for extraction quality. We validated our scouting process against the pre-extracted Culture Tables PDFs, which serve as ground truth for 41 years (1874–1920).
The Culture Tables PDFs contain byte-for-byte identical images clipped from the Full ARCIA PDFs—not different scans. PyMuPDF renders identical PNGs from either source at the same zoom factor. This means page range accuracy is the only factor affecting extraction quality; no intermediate clipping step is needed.
Image-hash matching between Culture Tables page 1 and Full ARCIA pages revealed that 20 of 41 years had incorrect page ranges in the initial TOC-based scouting. Most errors were off-by-one (included a blank/agriculture page or missed the first/last data page), but some were larger (1875 missed 4 pages on each end; 1883 missed 3 at the start).
Root cause: ARCIA PDFs use two-page spreads where the culture table can end on the left half while a different table (agriculture, land statistics) begins on the right. TOC page numbers are approximate and don’t account for this layout.
To fix this, the scouting process now includes a visual boundary verification step:
after the TOC gives an approximate range, boundary pages are rendered as PNGs (PyMuPDF at
150 DPI) and visually inspected to find the exact first and last pages of the target table.
Pages start−2 through start and end through
end+2 are checked.
Result: 40/41 years (97.6%) matched Culture Tables ground truth exactly. The single disagreement (1874) is a standalone recapitulation page that the Culture Tables PDF includes but is correctly excluded for extraction purposes (it contains only aggregate summaries, not per-tribe data rows).
Output files and construction scripts for the culture table extraction pipeline (51 years, 1864–1920).
| File | Description |
|---|---|
extracted/cleaned/YEAR_Culture.csv | Cleaned extraction for each year (51 files) |
extracted/cleaned/arcia_culture_panel_wide.csv | Standardized panel with entity cleaning (10,123 × 814, 3,588 culture IDs) |
extracted/cleaned/arcia_culture_panel_wide_raw.csv | Raw panel preserving source-document spellings (10,512 × 814, 5,212 culture IDs) |
extracted/cleaned/arcia_culture_panel_ids.csv | Panel row identifiers |
extracted/cleaned/arcia_culture_footnote_definitions.csv | Footnote definitions lookup table (1,164 definitions across 51 years) |
extracted/cleaned/entity_crosswalk_v2.csv | Entity crosswalk with rule types and confidence (4,724 entries) |
extracted/cleaned/deep_crosswalk.csv | Enriched crosswalk with temporal groups and modern names |
extracted/logs/harmonize_decisions.md | Log of all harmonization choices |
extracted/logs/entity_harmonization_decisions.md | Entity harmonization decisions and temporal group documentation |
scripts/configs/entity_rules.py | ~600 entity cleaning rules (ported from Stata) |
scripts/configs/temporal_entities.json | Historical agency transitions, consolidations, state splits |
scripts/configs/culture_column_map.json | Per-year position-to-variable mapping |
extracted/raw/YEAR_*_read_literal.csv | Raw Gemini literal reads (archival) |
extracted/raw/YEAR_*_read_context.csv | Raw Gemini context reads (archival) |
Scripts are organized by pipeline phase. Run the full post-extraction pipeline with:
python scripts/run_pipeline.py --table-type culture
| Script | Phase | Description |
|---|---|---|
| Extraction | ||
scripts/utils/gemini_extract.py | Extract | Gemini API extraction engine. Dual reads (literal + context) + footnotes. Handles era-specific prompt selection via _CULTURE_ERA_MAP. Cost: ~$0.019/page. |
scripts/batch_extract.py | Extract | Batch runner for multi-year extraction. Reads page ranges from page_registry.json, runs parallel Gemini calls with rate limiting. |
scripts/prompts/ | Extract | Externalized Gemini prompt templates per table type and era: literal.txt, culture_1875_1904.txt, culture_1911.txt, culture_1912_1920.txt, culture_1917.txt, culture_1919.txt. |
| Cleaning & Validation | ||
scripts/clean_arcia_csv.py | Clean | Raw CSV → cleaned CSV. Modal header detection, dropped-pipe repair, entity column splitting (Case A: 1912–1920, Case B: 1874–1911), literal fallback logic. Core of the pipeline. |
scripts/audit_extractions.py | Validate | Arithmetic identity checks (component sums = totals), RED/YELLOW/GREEN triage. Primary quality gate. |
scripts/arithmetic_corrections.py | Validate | Post-harmonization OCR correction using arithmetic identities. --dry-run supported. |
scripts/fix_early_years.py | Clean | Special-case fixes for pre-standardization years (1864, 1873) with non-standard table formats. |
| Harmonization & Panel Construction | ||
scripts/harmonize_culture.py | Harmonize | Stack 51 cleaned CSVs into a single panel. Position-based column mapping via culture_column_map.json, bracket encoding expansion (_raw, _grp, _ind, _foot). |
scripts/aggressive_dedup.py | Standardize | Entity standardization: ~600 cleaning rules (OCR fixes, abbreviation expansion, spelling normalization), collision guard (797 reverts). |
scripts/build_entity_crosswalk.py | Crosswalk | Build raw → canonical entity name mapping with rule type and confidence. |
scripts/generate_deep_crosswalk.py | Crosswalk | Enrich crosswalk with temporal groups (agency transitions, consolidations, state boundary changes). |
scripts/build_footnote_lookup.py | Footnotes | Parse footnote CSVs into a definitions lookup table (1,164 definitions across 51 years). |
scripts/utils/normalize_entities.py | Utility | Entity name normalization helpers (regex, OCR fixes, abbreviation expansion). |
| Comparison & Quality | ||
scripts/compare_gemini_vs_human.py | Validate | Cell-level comparison of Gemini panel vs human digitization (Shyrley). Two-pass entity matching (exact ≥0.99, then constrained fuzzy ≥0.85). |
scripts/compare_human_vs_human.py | Validate | Shyrley vs Marlon human-human benchmark (99.8% match rate). |
| Website & Visualization | ||
scripts/generate_comparison_json.py | Website | Generate comparison_data.js for the Validation tab. |
scripts/generate_crosswalk_json.py | Website | Generate crosswalk_data.js for the Entity Crosswalk tab. |
scripts/generate_explorer_json.py | Website | Generate explorer_data.js for the Entity Explorer tab. |
scripts/generate_inspection_sample.py | Website | Generate inspection_data.js with PDF page images and raw/cleaned data samples. |
| Orchestration | ||
scripts/run_pipeline.py | Pipeline | Post-extraction pipeline orchestrator. Chains: clean → audit → harmonize → footnotes → crosswalk → dedup → website JSON. Supports --start-from for partial reruns. |
| Configuration | ||
scripts/configs/page_registry.json | Config | Page ranges per year per table type (source PDF, start/end page). |
scripts/configs/culture_column_map.json | Config | Per-year position → variable name mapping (handles column count changes across eras). |
scripts/configs/entity_rules.py | Config | ~600 entity cleaning rules: OCR fixes, abbreviation expansion, spelling normalization. |
scripts/configs/temporal_entities.json | Config | Historical agency transitions, consolidations, state boundary changes. |
Explore how raw entity names from ARCIA reports (1864–1920) were harmonized
into canonical forms in the standardized panel (arcia_culture_panel_wide.csv).
Filter by entity type, rule type, or search for specific names.
| Raw Name | Canonical | Type | Rule | First | Last | Years | Rows |
|---|
BIA agencies were renamed, consolidated, and split between 1864–1920. These groups link entities that are historically the same administrative unit. In 1908, the "Indian Agent" position was abolished and replaced by school superintendents.
Browse all canonical entities in the standardized culture panel
(arcia_culture_panel_wide.csv).
Search by name, filter by type, and see variant history, year coverage, and related entities.
Track how key culture indicators changed over time for individual tribes. Select a variable, search for tribes, and click names to toggle them on/off. 183 tribes with 15+ data points are available.
Note: Values are raw counts reported in the ARCIA tables, not shares. Variable definitions change across eras: "citizen's dress" (1874–1889) becomes "wholly in citizen's dress" (1890–1904) and "citizen's clothing" (1912–1920). English usage similarly combines "use English" (1874–1904) and "speak English" (1912–1920). Gaps in the lines indicate years where data was not reported for that tribe.
How accurate is automated extraction of historical government tables? We answer this by comparing Gemini’s output against an independent human digitization of the same ARCIA Culture tables, and by testing whether an alternative extraction strategy—multiple reads with majority voting—improves results.
Shyrley Peraza manually digitized the ARCIA Culture tables for 41 years (1874–1904, 1911–1920) as part of an earlier research effort. Her dataset provides a cell-level benchmark: each entity-year-variable triple has a “human” value we can compare against Gemini’s extraction.
Matching method: Entities are linked between datasets using a two-pass approach:
exact matches first (score ≥ 0.99), then constrained fuzzy matching
(SequenceMatcher ≥ 0.85, same agency prefix) on composite state_agency_tribe identifiers.
This produces 7,028 matched entity-year pairs across 41 overlapping years,
covering 88,037 individual cell comparisons across 164 variables.
This comparison uses the raw (pre-standardization) panel to maximize entity matching,
since both Gemini and the human digitizers preserve original ARCIA spellings.
Not all entities match: the raw panel has ~5,212 unique culture IDs versus the human
data’s ~2,976, because Gemini reproduces each year’s exact spellings
(e.g., “Menomonee” in 1879, “Menominee” in 1880) while human
digitizers implicitly standardize as they type. Unmatched entities are disproportionately
single-year spelling variants rather than genuinely missing tribes.
1. Internal (arithmetic identities): ARCIA tables contain built-in checksums— e.g., males + females = total population, dress components = total. If these hold, the extracted values are provably correct. This validates 50–80% of numeric cells with zero human effort. Final tally: 41 GREEN (≥90%), 10 YELLOW (70–89%), 0 RED. See the Technical Detail tab.
2. External (human comparison): Cell-by-cell comparison against Shyrley’s manual digitization. This catches errors that arithmetic checks miss—column offsets, entity misidentification, and OCR errors in non-checksum variables. Results below.
Before evaluating AI accuracy, we need to know: how reproducible is human digitization? Two independent research assistants (Shyrley Peraza and Marlon) each digitized all 41 years of ARCIA Culture tables following the same instructions. Their inter-rater agreement establishes the ceiling against which we evaluate Gemini.
Across 41 shared years, 93.9% of 88,037 compared cells match exactly. The remaining 5.7% are dominated by OCR-level errors (minor digit confusions) and column offset artifacts from Gemini’s inconsistent pipe-separator handling.
Percentage of compared cells where Gemini and human values match exactly. The faint line shows the human-human agreement rate for the same year as a reference ceiling.
Can we improve extraction accuracy by running multiple independent OCR reads and taking a majority vote? We tested this on 6 sample years spanning different eras and difficulty levels.
For each of 6 sample years, we ran 5 independent extractions: 4× Gemini 3 Flash (same model as the production pipeline, different random seeds) and 1× Gemini 2.5 Pro (a larger, reasoning-capable model). Each read was cleaned and harmonized independently using the same pipeline code. A majority-vote algorithm then selected the most common value for each cell across the 5 reads.
Sample years: 1877 (88.6% baseline), 1881 (97.7%), 1889 (85.7%), 1895 (89.2%), 1911 (86.3%), 1919 (81.1%). Total cost: $3.63 for 30 reads.
| Year | Pipeline Baseline | Best Single Read | Voted (5 reads) | Delta |
|---|---|---|---|---|
| 1881 | 93.2% | 86.1% (R1) | 85.3% | −7.9 pp |
| 1877 | 73.1% | 76.4% (R3) | 64.2% | −8.9 pp |
| 1889 | 68.4% | 75.0% (R2) | 64.6% | −3.8 pp |
| 1895 | 71.4% | 85.1% (R3) | 85.9% | +14.5 pp |
| 1911 | 58.4% | 70.7% (R1) | 68.4% | +10.0 pp |
| 1919 | 61.4% | 71.4% (R1) | 69.3% | +7.9 pp |
| Average | 71.0% | — | 72.9% | +2.0 pp |
The Pro model (read 5) averaged 46.1% match rate vs. 70.8% for Flash (reads 1–4). It was also 5–6× slower (~750–960s vs. ~150–235s per year) and 3–5× more expensive. Worst case: 1919 at just 19.2%.
Year-specific cleaning fixes (dropped-pipe detection, blank-run insertion, entity merge conditionals, phantom column removal) add far more value than multiple reads. The 1881 pipeline baseline is 93.2% with one read; the 5-read vote achieves only 85.3%.
Same model, same pages: match rates swing 10–17 pp between runs (e.g., 1889: 29.1% to 75.0%). Gemini’s extraction is non-deterministic—entity counts range from 322 to 409 rows for the same year.
When entity matching across reads is good, voting improves results: +14.5 pp for 1895, +10.0 pp for 1911, +7.9 pp for 1919. But when entity alignment fails across reads (1877, 1881), voting hurts because the voter discards unmatched rows.
Multi-read majority voting provides a modest +2.0 pp average improvement at 5× the extraction cost ($3.63 vs ~$0.73 for the same 6 years). The more cost-effective strategy is single dual-read extraction + targeted year-specific cleaning, which achieves 93.9% match rate across 41 years at a total cost of ~$12 for the entire corpus. Investment in cleaning logic yields higher returns than investment in additional OCR reads.
Since the original experiment used Gemini 2.5 Pro (which performed poorly), we re-ran with Gemini 3.1 Pro Preview—the current leader on the socOCRbench document understanding benchmark—using low reasoning mode. Same 6 years, same pipeline.
| Year | Pipeline | Flash (Read 1) | 2.5 Pro | 3.1 Pro (low) |
|---|---|---|---|---|
| 1881 | 93.2% | 86.1% | 61.6% | 61.9% |
| 1877 | 73.1% | 60.5% | 73.5% | 48.7% |
| 1889 | 68.4% | 63.8% | 29.1% | 52.0% |
| 1895 | 71.4% | 72.1% | 43.2% | 51.0% |
| 1911 | 58.4% | 70.7% | 50.2% | 59.6% |
| 1919 | 61.4% | 71.4% | 19.2% | 59.3% |
| Average | 71.0% | 70.8% | 46.1% | 55.4% |
Each cell-level discrepancy is classified into a likely root cause category based on the relationship between Gemini and human values.
Match rates by variable across all compared entity-year pairs. Variables at the top have the most discrepancies; those at the bottom are most reliable.
The 30 largest cell-level disagreements by absolute difference. Many are concatenation artifacts where two values got stuck together during extraction.
Documented sources of disagreement between Gemini and human extraction, ordered by severity.
Random sample of 10 years for visual comparison between the original PDF
and the per-year cleaned CSVs (YEAR_Culture.csv). Each card shows the
first data page of the table alongside the first 5 rows that Gemini extracted
from that exact page. Use this to spot-check whether entity names, column alignment,
and numeric values match the source document.
Sample: 10 years stratified across early (1864–1873), standard (1874–1904), and late (1912–1920) eras. Seed: 42 (fixed for reproducibility). Showing raw extraction data (pipe-separated values as returned by Gemini).
Loading inspection data…