ARCIA Culture Tables — Digitization Summary

Era	Years	Source	Status
Pre-standardization	1864–1874	Full ARCIA PDFs	Extracted
Standard format	1875–1904	Full ARCIA PDFs	Extracted
Format gap	1905–1910	—	No table
Late-era format	1911–1920	Full ARCIA PDFs	Extracted
Discontinued	1921–1932	—	No table

Panel Statistics
Years covered	51 (1864–1920)
Observations (rows)	10,123
Columns	814
Unique culture IDs	3,588
Canonical entity names	3,319
Entity crosswalk entries	5,922
Footnote definitions	1,164

Pipeline Architecture

Each year is processed independently through phases 1–4. Panel construction (phase 5) runs only after all years complete. The pipeline is idempotent—re-running any phase produces the same output—and fail-safe: if a year fails at any phase, it is logged and skipped while the batch continues.

Scout

TOC lookup + visual boundary check

→

Extract

Dual Gemini reads (literal + context)

→

Validate

Arithmetic identity checks

→

Clean

OCR artifacts, entity names

→

Panel

Harmonize & stack

Phase	Input	Output	Script	Notes
1. Scout	PDF + target table description	`page_map.json`	Manual / Claude agent	Read TOC (pp. 3–8) for approximate range, then render boundary pages as PNGs and visually verify exact start/end. Off-by-one errors cause 10–30pp quality drops. Early PDFs (1864–1872) have tables near end.
2. Extract	PDF + page range	`_read_literal.csv` + `_read_context.csv`	`gemini_extract.py`	Dual reads via Gemini API; 3 concurrent workers; era-specific prompts
3. Validate	Cleaned CSV	GREEN / YELLOW / RED status	`audit_extractions.py`	Arithmetic identity checks; column keyword checks; entity coverage
4. Clean	Raw context CSV (with literal fallback)	`{year}_{type}.csv`	`clean_arcia_csv.py`	Parse pipe-delimited blocks; normalize headers; add provenance
5. Panel	All cleaned CSVs + column map JSON	`arcia_{type}_panel_wide.csv`	`harmonize_culture.py`	Position-based column renaming; bracket expansion; entity harmonization

All pipeline scripts accept a --table-type flag (default: culture), so the same infrastructure handles new table types without code changes. Page ranges are stored in scripts/configs/page_registry.json; prompts are loaded from scripts/prompts/{table_type}_{era}.txt.

Dual-Read Extraction Strategy

Every table is extracted twice by Gemini with different prompts. The two reads complement each other—literal reads preserve exact characters while context reads resolve structural ambiguities—and disagreements pinpoint exactly which cells need attention.

Literal Read

"Extract exactly what you see on the page, character by character." Preserves OCR artifacts but never hallucinates values. Best for numeric accuracy.

Uses a single shared prompt (scripts/prompts/literal.txt) across all table types and eras.

Context Read

"Use your understanding of the table structure to resolve ambiguities." Better at entity names and column alignment, but sometimes creates phantom entries from footnotes or section headers.

Uses era-specific prompts that describe the table title, expected columns, entity hierarchy, and missing-data conventions for that period.

How Extraction Works

Each PDF page range is converted to PNG images via PyMuPDF at 2× zoom. Images are sent to the Gemini 3 Flash API one page at a time, with up to 3 concurrent workers and a rate limiter (14 RPM). Each page returns a pipe-delimited text block; all pages are concatenated with # --- Page N --- markers into a single CSV file.

The --prompt-type both flag runs literal and context reads in parallel, halving wall-clock time compared to sequential runs.

Era-Specific Prompts

Different eras require different context prompts because the table titles, column structures, and entity hierarchies all change:

Prompt File	Years	Key Differences
`culture_1875_1904.txt`	1875–1904	3-column entity hierarchy (State/Agency/Tribe); 30+ data columns; bracket encoding
`culture_1911.txt`	1911	TABLE 2 format; different column set; must skip page 1 (TABLE 1 = admin stats)
`culture_1912_1920.txt`	1912–1920	TABLE 4/5; 2-column entities (State/Superintendency); marriage/church/dress columns

Merge & Arithmetic-Aware Refinement

read_merger.py aligns both reads cell-by-cell and produces a merged output plus a disagreements file. For each cell where the reads disagree, an arithmetic-aware refinement step checks: does swapping to the other read’s value fix a failing column sum? If so, pick that value. This is free quality—typically improves pass rates by 1–3 percentage points with zero human effort.

In 2 of 51 years (1878, 1886), the context read failed entirely (0 or near-0 tokens from Gemini). The cleaner automatically detects this and falls back to the literal read. Both years validated clean with full row counts.

Cross-Read Disagreement as a Quality Signal

Having two independent reads of the same PDF page provides a powerful diagnostic that requires no ground truth data. When the context and literal reads disagree on a large fraction of cells, this flags systematic extraction errors that arithmetic checks alone cannot catch.

Case study (1917): The context read for 1917 appeared structurally correct (right entity names, 22 aligned columns), but had silently shifted all numeric values down by one row within each state section—state-level totals were assigned to the first superintendency. Arithmetic checks would have caught this via failed subtotal validation, but the fastest signal was the cross-read comparison: 91 of 186 comparable cells differed by >5× between reads. A year-specific re-extraction with an improved prompt fixed the issue ($0.04).

Recommended threshold: Flag any year where >30% of comparable cells differ by >5× between reads. For future table types (Population, Crime, etc.) where no human digitization exists for comparison, this cross-read check is the primary automated defense against silent extraction errors.

Quality Gating & Auto-Diagnosis

Every extraction is automatically scored by testing whether column components sum to their published totals. These arithmetic identities are free validation—if the math works, those cells are correct.

How Arithmetic Identities Work

ARCIA tables contain built-in checksums. For example, the culture table reports population broken down as males + females = total, and dress as citizen’s dress + partial + Indian dress = total population. If these sums hold in the extracted data, the values are provably correct regardless of OCR quality.

This typically validates 50–80% of numeric cells with zero human effort. Failures point directly at column alignment errors or OCR misreads—far more actionable than random spot-checking.

Scoring Thresholds

Status	Pass Rate	Action
GREEN	≥ 90%	Proceed automatically
YELLOW	70–89%	Proceed, flag for batch review
RED	< 70%	Stop & run auto-diagnosis

Final tally: 41 GREEN, 10 YELLOW, 0 RED across 51 years.

Auto-Diagnosis for RED Tables

When a table scores RED, the pipeline runs five diagnostic strategies in order:

#	Check	What It Detects
1	Column shift detection	Field count mismatch between header and data rows (most common cause)
2	Phantom entity detection	Context read created >15% more entities than literal (footnotes parsed as data)
3	Low entity count	Wrong page range scouted (table starts or ends on a different page)
4	Read preference swap	Try literal-only or context-only instead of the merged read
5	Per-position pass rate	Which specific column positions are failing (pinpoints the misaligned column)

Entity Standardization

Gemini faithfully reproduces each year’s original ARCIA spellings, producing 5,212 unique culture_ids across 51 years. Because the same tribe appears under slightly different spellings year to year (e.g., “Menomonee” vs. “Menominee,” “Sioux of Devil’s Lake” vs. “Sioux of Devils Lake”), a post-extraction standardization pipeline normalizes these into 3,588 canonical culture_ids.

Standardization Pipeline

A 7-phase script (aggressive_dedup.py) processes the raw panel:

Phase	What It Does	Rows Affected
1–4	Remove junk rows (footnote text, page headers, blank rows parsed as data)	−390
5	Normalize spelling variants (OCR artifacts, abbreviations, singular/plural, punctuation)	∼1,378 tribe-name pairs standardized
6	Rebuild culture_ids from cleaned entity fields	5,212 → 3,588 unique IDs
7	Resolve actual duplicates (same culture_id + year)	0 found

Phase 7 finding zero duplicates confirms the pipeline is not over-merging—the ID reduction comes entirely from spelling normalization, not from collapsing distinct entities.

Collision Guard

When Phase 6 rebuilds culture_ids, different original entities can sometimes collapse to the same canonical ID (e.g., two distinct tribes whose cleaned names match). A collision guard detects these false merges: if multiple original culture_ids map to the same new ID within a year, all affected rows are reverted to their original IDs.

In the culture panel, this reverted 797 rows that would otherwise have been incorrectly merged.

Two-Panel Architecture

The project ships two versions of the panel to serve different needs:

File	IDs	Rows	Used For
`arcia_culture_panel_wide.csv`	3,588	10,123	Entity Explorer, Crosswalk, Human Inspection (standardized names)
`arcia_culture_panel_wide_raw.csv`	5,212	10,512	Gemini vs. Human comparison (preserves original ARCIA spellings for matching)

The raw panel preserves source-document spellings so that entity matching against the human digitization (which also uses original ARCIA spellings) is not distorted by name standardization. The human digitization has 2,976 unique culture_ids—the remaining gap of ~610 IDs between the standardized panel (3,588) and human (2,976) reflects implicit standardization that human digitizers performed but that our automated pipeline does not attempt (e.g., resolving historical tribe name changes across decades).

Bracket Encoding

In the original printed tables, curly brackets { group multiple tribes that share a single aggregate statistic. Gemini captures these inline.

Encoding: `value_group_rowcount`

When a value applies to a group of N tribes, Gemini writes it as 1250_group_3 meaning "1,250 is the aggregate for a group of 3 rows." Footnote suffixes use +(code) format: 3600+(a) means "3,600 with footnote marker ‘a’."

The harmonization pipeline expands this into four columns per variable:

Suffix	Meaning
`_raw`	The numeric value (1250)
`_grp`	Group size (3) or 0 if individual
`_ind`	1 if individual value, 0 if group
`_foot`	Footnote marker, if any

This expansion is why the panel has 835 columns despite the raw tables having ~30–40 data columns: each numeric variable produces 4 output columns.

Column Mapping Across Eras

Column names, positions, and even which columns exist change substantially across the 57-year span. A per-year JSON config (culture_column_map.json) maps column positions to canonical variable names. This position-based approach is critical because Gemini’s header parsing is unreliable—but column positions are stable within an era.

Era	Years	Entity Columns	Offset	Key Differences
Pre-standardization	1864–1870	Varies (tribe-only, supt+tribe)	Varies	No "State" column in many years; farming and culture on alternating pages (1864)
Early standard	1871–1874	Varies (merged entity cols)	Varies	1873 has 24-col hard-coded header; 1874 shares pages with agricultural table
Core format	1875–1904	State / Agency / Tribe	3	Stable 3-prefix format; brackets common
Merged entity	1878, 1886	Merged (agency+tribe)	3	Gemini merges entity columns; requires offset adjustment
Late era	1911–1920	States & Superintendencies	2	Different table title (Table 4/5); fewer columns; 1911 = TABLE 2

Entity offset is the number of prefix columns (state, agency, tribe) before data columns begin. Most years use offset=3. Late-era years (1912–1920) use offset=2. Merged-entity years (1878, 1886) nominally have offset=3 but Gemini collapses entity columns, requiring runtime adjustment.

Entity Normalization & Crosswalk

Tribal names appear in dozens of spelling variants across decades of reports. The crosswalk maps 4,149 raw names to 1,380 canonical forms.

How It Works

~600 cleaning rules (ported from Stata) apply systematic normalization: unicode/accent removal, suffix stripping, OCR fixes, plural standardization, compound tribe names, and NY reserve format standardization. Rules are defined in entity_rules.py and applied by normalize_entities.py. The deep crosswalk adds temporal group links and modern name mappings.

The culture_id is constructed differently by era:

Era	ID Construction
1874–1911	`state_agency_tribe`
1912–1920	`state_superintendency`
1873, 1878, 1886	`entity` (merged col)
1866	`agency_tribe`

Crosswalk Stats

Entity Crosswalk
Total entries	4,724
Unique canonical names	1,380
Harmonized (changed)	866
Temporal group links	217
Need review	89

866 entries were harmonized via spelling fixes, plural standardization, OCR corrections, and compound name normalization. 217 entries link to 11 temporal groups tracking historical agency transitions.

Orchestration & Cost

The pipeline uses a multi-model architecture: Gemini 3 Flash handles OCR, Claude Code handles everything else. Extraction cost is dominated by image input tokens sent to the Gemini API.

Model Division of Labor

Gemini 3 Flash OCR and table extraction from scanned PDFs. Configured with thinking_level=low—the socOCRbench benchmark showed that low reasoning effort outperforms high for document OCR tasks (0.698 vs 0.680).

Claude Code Pipeline orchestration, cleaning logic, code generation, column mapping, entity harmonization, and quality diagnosis.

Multi-Agent Orchestration

Claude Code launches parallel extraction agents for batch runs. Agents write summary JSONs on completion; the orchestrator reads summaries to track progress. Agent transcripts are 100K+ tokens each—reading them directly would exhaust the orchestrator’s context window.

Pattern: agents write summaries → orchestrator reads summaries → never read full transcripts.

Cost Breakdown

Metric	Value
Per PDF page (dual read)	~$0.019
Typical 10-page table	~$0.19
Full culture extraction (51 years, ~475 pages)	~$9.50
Projected per additional table type (~50 years)	~$8–10

Cost is dominated by image input tokens (each page image is ~2,000–4,000 tokens), not output. Dual reads double the per-page cost but provide critical quality gains via arithmetic-aware merge refinement. Gemini 3 Flash pricing: $0.50/M input tokens, $3.00/M output tokens.

Resilience & Fallback Logic

Extracting 51 years of historical tables involves many failure modes. The pipeline handles each automatically rather than requiring manual intervention.

Context Read Failures

If the context read returns fewer than 50 rows and the literal read has 3× more rows, the cleaner automatically falls back to the literal read. This saved 2 years (1878, 1886) where Gemini’s context read returned 0 or near-0 tokens—both validated clean using literal data alone.

Budget Gating

The batch runner checks cumulative Gemini API spend before each year. If the hard cap is reached ($38 default, leaving a $2 buffer under the $40 target), remaining years are logged and queued—not lost. A per-year cost estimate (~$0.10/page, intentionally conservative) provides pre-flight warnings.

Timeout & Process Management

Each year gets a 25-minute timeout. If exceeded, the batch runner kills the entire process group (killpg + SIGKILL) and continues to the next year. macOS sleep can kill long-running extractions, so batch runs are wrapped with caffeinate -i to prevent the system from sleeping.

Rate Limiting

A thread-safe token-bucket rate limiter enforces Gemini’s RPM quota (default: 14 requests/minute). Individual API calls retry with exponential backoff on 503/429 errors (5 retries, base delay 5 seconds). This prevents quota exhaustion during high-concurrency batch runs.

Known Data Quality Issues

Gemini Silent Failures (1878 & 1886)

The context read returns 0 or near-0 tokens for these years. The cleaner detects this automatically and falls back to the literal read. Both years validated successfully with full row counts and all columns present.

Minor OCR Artifacts in State Column

The state column has 215 unique values across the panel. Approximately 40 are legitimate multi-state groupings or historical territories. The remainder are OCR or format artifacts (e.g., reservation names misidentified as states in 1913/1919, tribe names leaking into the state field in 1871). These are acceptable for downstream crosswalk resolution.

Page Boundary Verification

Accurate page boundaries are critical for extraction quality. We validated our scouting process against the pre-extracted Culture Tables PDFs, which serve as ground truth for 41 years (1874–1920).

Key Finding: Images Are Identical

The Culture Tables PDFs contain byte-for-byte identical images clipped from the Full ARCIA PDFs—not different scans. PyMuPDF renders identical PNGs from either source at the same zoom factor. This means page range accuracy is the only factor affecting extraction quality; no intermediate clipping step is needed.

Initial Scouting vs. Ground Truth

Image-hash matching between Culture Tables page 1 and Full ARCIA pages revealed that 20 of 41 years had incorrect page ranges in the initial TOC-based scouting. Most errors were off-by-one (included a blank/agriculture page or missed the first/last data page), but some were larger (1875 missed 4 pages on each end; 1883 missed 3 at the start).

Root cause: ARCIA PDFs use two-page spreads where the culture table can end on the left half while a different table (agriculture, land statistics) begins on the right. TOC page numbers are approximate and don’t account for this layout.

Visual Boundary Verification

To fix this, the scouting process now includes a visual boundary verification step: after the TOC gives an approximate range, boundary pages are rendered as PNGs (PyMuPDF at 150 DPI) and visually inspected to find the exact first and last pages of the target table. Pages start−2 through start and end through end+2 are checked.

Result: 40/41 years (97.6%) matched Culture Tables ground truth exactly. The single disagreement (1874) is a standalone recapitulation page that the Culture Tables PDF includes but is correctly excluded for extraction purposes (it contains only aggregate summaries, not per-tribe data rows).

File Dictionary

Output files and construction scripts for the culture table extraction pipeline (51 years, 1864–1920).

Output Files

File	Description
`extracted/cleaned/YEAR_Culture.csv`	Cleaned extraction for each year (51 files)
`extracted/cleaned/arcia_culture_panel_wide.csv`	Standardized panel with entity cleaning (10,123 × 814, 3,588 culture IDs)
`extracted/cleaned/arcia_culture_panel_wide_raw.csv`	Raw panel preserving source-document spellings (10,512 × 814, 5,212 culture IDs)
`extracted/cleaned/arcia_culture_panel_ids.csv`	Panel row identifiers
`extracted/cleaned/arcia_culture_footnote_definitions.csv`	Footnote definitions lookup table (1,164 definitions across 51 years)
`extracted/cleaned/entity_crosswalk_v2.csv`	Entity crosswalk with rule types and confidence (4,724 entries)
`extracted/cleaned/deep_crosswalk.csv`	Enriched crosswalk with temporal groups and modern names
`extracted/logs/harmonize_decisions.md`	Log of all harmonization choices
`extracted/logs/entity_harmonization_decisions.md`	Entity harmonization decisions and temporal group documentation
`scripts/configs/entity_rules.py`	~600 entity cleaning rules (ported from Stata)
`scripts/configs/temporal_entities.json`	Historical agency transitions, consolidations, state splits
`scripts/configs/culture_column_map.json`	Per-year position-to-variable mapping
`extracted/raw/YEAR_*_read_literal.csv`	Raw Gemini literal reads (archival)
`extracted/raw/YEAR_*_read_context.csv`	Raw Gemini context reads (archival)

Construction Scripts

Scripts are organized by pipeline phase. Run the full post-extraction pipeline with: python scripts/run_pipeline.py --table-type culture

Script	Phase	Description
Extraction
`scripts/utils/gemini_extract.py`	Extract	Gemini API extraction engine. Dual reads (literal + context) + footnotes. Handles era-specific prompt selection via `_CULTURE_ERA_MAP`. Cost: ~$0.019/page.
`scripts/batch_extract.py`	Extract	Batch runner for multi-year extraction. Reads page ranges from `page_registry.json`, runs parallel Gemini calls with rate limiting.
`scripts/prompts/`	Extract	Externalized Gemini prompt templates per table type and era: `literal.txt`, `culture_1875_1904.txt`, `culture_1911.txt`, `culture_1912_1920.txt`, `culture_1917.txt`, `culture_1919.txt`.
Cleaning & Validation
`scripts/clean_arcia_csv.py`	Clean	Raw CSV → cleaned CSV. Modal header detection, dropped-pipe repair, entity column splitting (Case A: 1912–1920, Case B: 1874–1911), literal fallback logic. Core of the pipeline.
`scripts/audit_extractions.py`	Validate	Arithmetic identity checks (component sums = totals), RED/YELLOW/GREEN triage. Primary quality gate.
`scripts/arithmetic_corrections.py`	Validate	Post-harmonization OCR correction using arithmetic identities. `--dry-run` supported.
`scripts/fix_early_years.py`	Clean	Special-case fixes for pre-standardization years (1864, 1873) with non-standard table formats.
Harmonization & Panel Construction
`scripts/harmonize_culture.py`	Harmonize	Stack 51 cleaned CSVs into a single panel. Position-based column mapping via `culture_column_map.json`, bracket encoding expansion (`_raw`, `_grp`, `_ind`, `_foot`).
`scripts/aggressive_dedup.py`	Standardize	Entity standardization: ~600 cleaning rules (OCR fixes, abbreviation expansion, spelling normalization), collision guard (797 reverts).
`scripts/build_entity_crosswalk.py`	Crosswalk	Build raw → canonical entity name mapping with rule type and confidence.
`scripts/generate_deep_crosswalk.py`	Crosswalk	Enrich crosswalk with temporal groups (agency transitions, consolidations, state boundary changes).
`scripts/build_footnote_lookup.py`	Footnotes	Parse footnote CSVs into a definitions lookup table (1,164 definitions across 51 years).
`scripts/utils/normalize_entities.py`	Utility	Entity name normalization helpers (regex, OCR fixes, abbreviation expansion).
Comparison & Quality
`scripts/compare_gemini_vs_human.py`	Validate	Cell-level comparison of Gemini panel vs human digitization (Shyrley). Two-pass entity matching (exact ≥0.99, then constrained fuzzy ≥0.85).
`scripts/compare_human_vs_human.py`	Validate	Shyrley vs Marlon human-human benchmark (99.8% match rate).
Website & Visualization
`scripts/generate_comparison_json.py`	Website	Generate `comparison_data.js` for the Validation tab.
`scripts/generate_crosswalk_json.py`	Website	Generate `crosswalk_data.js` for the Entity Crosswalk tab.
`scripts/generate_explorer_json.py`	Website	Generate `explorer_data.js` for the Entity Explorer tab.
`scripts/generate_inspection_sample.py`	Website	Generate `inspection_data.js` with PDF page images and raw/cleaned data samples.
Orchestration
`scripts/run_pipeline.py`	Pipeline	Post-extraction pipeline orchestrator. Chains: clean → audit → harmonize → footnotes → crosswalk → dedup → website JSON. Supports `--start-from` for partial reruns.
Configuration
`scripts/configs/page_registry.json`	Config	Page ranges per year per table type (source PDF, start/end page).
`scripts/configs/culture_column_map.json`	Config	Per-year position → variable name mapping (handles column count changes across eras).
`scripts/configs/entity_rules.py`	Config	~600 entity cleaning rules: OCR fixes, abbreviation expansion, spelling normalization.
`scripts/configs/temporal_entities.json`	Config	Historical agency transitions, consolidations, state boundary changes.

Entity Crosswalk Browser

Explore how raw entity names from ARCIA reports (1864–1920) were harmonized into canonical forms in the standardized panel (arcia_culture_panel_wide.csv). Filter by entity type, rule type, or search for specific names.

Type: Rule: Search: Changed only Temporal groups only

Raw Name	Canonical	Type	Rule	First	Last	Years	Rows

Historical Agency Transitions

BIA agencies were renamed, consolidated, and split between 1864–1920. These groups link entities that are historically the same administrative unit. In 1908, the "Indian Agent" position was abolished and replaced by school superintendents.

Entity Explorer

Browse all canonical entities in the standardized culture panel (arcia_culture_panel_wide.csv). Search by name, filter by type, and see variant history, year coverage, and related entities.

Merged only

Select an entity from the list to view details.

Time Series: Culture Variables by Tribe

Track how key culture indicators changed over time for individual tribes. Select a variable, search for tribes, and click names to toggle them on/off. 183 tribes with 15+ data points are available.

Variable:

Add tribe:

Note: Values are raw counts reported in the ARCIA tables, not shares. Variable definitions change across eras: "citizen's dress" (1874–1889) becomes "wholly in citizen's dress" (1890–1904) and "citizen's clothing" (1912–1920). English usage similarly combines "use English" (1874–1904) and "speak English" (1912–1920). Gaps in the lines indicate years where data was not reported for that tribe.

Validation Overview

How accurate is automated extraction of historical government tables? We answer this by comparing Gemini’s output against an independent human digitization of the same ARCIA Culture tables, and by testing whether an alternative extraction strategy—multiple reads with majority voting—improves results.

Ground Truth

Shyrley Peraza manually digitized the ARCIA Culture tables for 41 years (1874–1904, 1911–1920) as part of an earlier research effort. Her dataset provides a cell-level benchmark: each entity-year-variable triple has a “human” value we can compare against Gemini’s extraction.

Matching method: Entities are linked between datasets using a two-pass approach: exact matches first (score ≥ 0.99), then constrained fuzzy matching (SequenceMatcher ≥ 0.85, same agency prefix) on composite state_agency_tribe identifiers. This produces 7,028 matched entity-year pairs across 41 overlapping years, covering 88,037 individual cell comparisons across 164 variables. This comparison uses the raw (pre-standardization) panel to maximize entity matching, since both Gemini and the human digitizers preserve original ARCIA spellings. Not all entities match: the raw panel has ~5,212 unique culture IDs versus the human data’s ~2,976, because Gemini reproduces each year’s exact spellings (e.g., “Menomonee” in 1879, “Menominee” in 1880) while human digitizers implicitly standardize as they type. Unmatched entities are disproportionately single-year spelling variants rather than genuinely missing tribes.

Two Layers of Validation

1. Internal (arithmetic identities): ARCIA tables contain built-in checksums— e.g., males + females = total population, dress components = total. If these hold, the extracted values are provably correct. This validates 50–80% of numeric cells with zero human effort. Final tally: 41 GREEN (≥90%), 10 YELLOW (70–89%), 0 RED. See the Technical Detail tab.

2. External (human comparison): Cell-by-cell comparison against Shyrley’s manual digitization. This catches errors that arithmetic checks miss—column offsets, entity misidentification, and OCR errors in non-checksum variables. Results below.

Benchmarking: How Much Do Two Humans Agree?

Before evaluating AI accuracy, we need to know: how reproducible is human digitization? Two independent research assistants (Shyrley Peraza and Marlon) each digitized all 41 years of ARCIA Culture tables following the same instructions. Their inter-rater agreement establishes the ceiling against which we evaluate Gemini.

Gemini vs. Human: Cell-Level Comparison

Across 41 shared years, 93.9% of 88,037 compared cells match exactly. The remaining 5.7% are dominated by OCR-level errors (minor digit confusions) and column offset artifacts from Gemini’s inconsistent pipe-separator handling.

Match Rate by Year

Percentage of compared cells where Gemini and human values match exactly. The faint line shows the human-human agreement rate for the same year as a reference ceiling.

Color by:

Multi-Read OCR Experiment

Can we improve extraction accuracy by running multiple independent OCR reads and taking a majority vote? We tested this on 6 sample years spanning different eras and difficulty levels.

Experiment Design

For each of 6 sample years, we ran 5 independent extractions: 4× Gemini 3 Flash (same model as the production pipeline, different random seeds) and 1× Gemini 2.5 Pro (a larger, reasoning-capable model). Each read was cleaned and harmonized independently using the same pipeline code. A majority-vote algorithm then selected the most common value for each cell across the 5 reads.

Sample years: 1877 (88.6% baseline), 1881 (97.7%), 1889 (85.7%), 1895 (89.2%), 1911 (86.3%), 1919 (81.1%). Total cost: $3.63 for 30 reads.

Results

Year	Pipeline Baseline	Best Single Read	Voted (5 reads)	Delta
1881	93.2%	86.1% (R1)	85.3%	−7.9 pp
1877	73.1%	76.4% (R3)	64.2%	−8.9 pp
1889	68.4%	75.0% (R2)	64.6%	−3.8 pp
1895	71.4%	85.1% (R3)	85.9%	+14.5 pp
1911	58.4%	70.7% (R1)	68.4%	+10.0 pp
1919	61.4%	71.4% (R1)	69.3%	+7.9 pp
Average	71.0%	—	72.9%	+2.0 pp

Note: “Pipeline Baseline” is the match rate from the production pipeline (single dual-read extraction + full custom cleaning). “Voted” uses the same harmonizer but without the year-specific cleaning fixes (dropped-pipe, blank-run, phantom removal). The baseline for 1911 and 1919 reflects the raw harmonized output before custom fixes, which is why these years score lower than their match rates in the chart above.

Key Findings

Gemini 2.5 Pro Performed Worse

The Pro model (read 5) averaged 46.1% match rate vs. 70.8% for Flash (reads 1–4). It was also 5–6× slower (~750–960s vs. ~150–235s per year) and 3–5× more expensive. Worst case: 1919 at just 19.2%.

Custom Pipeline Fixes Outperform Multi-Read

Year-specific cleaning fixes (dropped-pipe detection, blank-run insertion, entity merge conditionals, phantom column removal) add far more value than multiple reads. The 1881 pipeline baseline is 93.2% with one read; the 5-read vote achieves only 85.3%.

High Read-to-Read Variance

Same model, same pages: match rates swing 10–17 pp between runs (e.g., 1889: 29.1% to 75.0%). Gemini’s extraction is non-deterministic—entity counts range from 322 to 409 rows for the same year.

Voting Helps When Entities Align

When entity matching across reads is good, voting improves results: +14.5 pp for 1895, +10.0 pp for 1911, +7.9 pp for 1919. But when entity alignment fails across reads (1877, 1881), voting hurts because the voter discards unmatched rows.

Conclusion

Multi-read majority voting provides a modest +2.0 pp average improvement at 5× the extraction cost ($3.63 vs ~$0.73 for the same 6 years). The more cost-effective strategy is single dual-read extraction + targeted year-specific cleaning, which achieves 93.9% match rate across 41 years at a total cost of ~$12 for the entire corpus. Investment in cleaning logic yields higher returns than investment in additional OCR reads.

Follow-Up: Gemini 3.1 Pro (Low Reasoning)

Since the original experiment used Gemini 2.5 Pro (which performed poorly), we re-ran with Gemini 3.1 Pro Preview—the current leader on the socOCRbench document understanding benchmark—using low reasoning mode. Same 6 years, same pipeline.

Year	Pipeline	Flash (Read 1)	2.5 Pro	3.1 Pro (low)
1881	93.2%	86.1%	61.6%	61.9%
1877	73.1%	60.5%	73.5%	48.7%
1889	68.4%	63.8%	29.1%	52.0%
1895	71.4%	72.1%	43.2%	51.0%
1911	58.4%	70.7%	50.2%	59.6%
1919	61.4%	71.4%	19.2%	59.3%
Average	71.0%	70.8%	46.1%	55.4%

Cost: 3.1 Pro = $2.71 for 6 years (vs $0.73 for Flash, $1.40 for 2.5 Pro). At 3.7× the cost and 3–4× slower per page, Gemini 3.1 Pro achieves 55.4% average match rate—better than 2.5 Pro (46.1%) but still 15 pp below Flash (70.8%). Benchmark leadership on standard OCR tasks does not transfer to structured table extraction from 19th-century government documents.

Discrepancy Classification

Each cell-level discrepancy is classified into a likely root cause category based on the relationship between Gemini and human values.

Variable Match Rates

Match rates by variable across all compared entity-year pairs. Variables at the top have the most discrepancies; those at the bottom are most reliable.

Show:

Largest Discrepancies

The 30 largest cell-level disagreements by absolute difference. Many are concatenation artifacts where two values got stuck together during extraction.

Known Issues & Recommended Fixes

Documented sources of disagreement between Gemini and human extraction, ordered by severity.

What Is ARCIA?

Source Material

Coverage

Extraction Pipeline

Model Division of Labor

Quality Gating

Year Coverage

Output at a Glance

Top Tribes by Panel Coverage

Rows by Era

Pipeline Architecture

Dual-Read Extraction Strategy

Literal Read

Context Read

How Extraction Works

Era-Specific Prompts

Merge & Arithmetic-Aware Refinement

Cross-Read Disagreement as a Quality Signal

Quality Gating & Auto-Diagnosis

How Arithmetic Identities Work

Scoring Thresholds

Auto-Diagnosis for RED Tables

Entity Standardization

Standardization Pipeline

Collision Guard

Two-Panel Architecture

Bracket Encoding

Encoding: value_group_rowcount

Column Mapping Across Eras

Entity Normalization & Crosswalk

How It Works

Crosswalk Stats

Orchestration & Cost

Model Division of Labor

Multi-Agent Orchestration

Cost Breakdown

Resilience & Fallback Logic

Context Read Failures

Budget Gating

Timeout & Process Management

Rate Limiting

Known Data Quality Issues

Gemini Silent Failures (1878 & 1886)

Minor OCR Artifacts in State Column

Page Boundary Verification

Key Finding: Images Are Identical

Initial Scouting vs. Ground Truth

Visual Boundary Verification

File Dictionary

Output Files

Construction Scripts

Entity Crosswalk Browser

Historical Agency Transitions

Entity Explorer

Time Series: Culture Variables by Tribe

Validation Overview

Ground Truth

Two Layers of Validation

Benchmarking: How Much Do Two Humans Agree?

Gemini vs. Human: Cell-Level Comparison

Match Rate by Year

Multi-Read OCR Experiment

Experiment Design

Results

Key Findings

Gemini 2.5 Pro Performed Worse

Custom Pipeline Fixes Outperform Multi-Read

High Read-to-Read Variance

Voting Helps When Entities Align

Conclusion

Follow-Up: Gemini 3.1 Pro (Low Reasoning)

Discrepancy Classification

Variable Match Rates

Largest Discrepancies

Known Issues & Recommended Fixes

Human Inspection Random Sample

Encoding: `value_group_rowcount`