ARCIA Culture Tables

Digitizing 51 years of annual census tables on Native American tribes from the Reports of the Commissioner of Indian Affairs, 1864–1920.

51 Years
10,123 Rows
3,588 Culture IDs
~$12.53 API Cost

What Is ARCIA?

The Annual Reports of the Commissioner of Indian Affairs were published by the U.S. Government Printing Office from the 1820s through the 1930s. Each report contains narrative text plus statistical tables on tribal population, education, agriculture, dress, housing, religion, and more. The tables are a rare source of systematic, disaggregated data on Native American communities across the entire federal reservation system.

Source Material

We digitized the "Culture" tables—tables of statistics relating to population, civilization, dress, houses, allotments, reading, marriages, and religious participation. These appear under varying titles across eras but track similar variables throughout.

Source PDFs come from two collections: 62 full ARCIA reports (1864–1932, 21–187 MB each) and 45 pre-extracted culture table PDFs (1874–1920, much smaller). Each PDF page in the full reports is a two-page scan of facing printed pages.

Coverage

EraYearsSourceStatus
Pre-standardization1864–1874Full ARCIA PDFsExtracted
Standard format1875–1904Full ARCIA PDFsExtracted
Format gap1905–1910No table
Late-era format1911–1920Full ARCIA PDFsExtracted
Discontinued1921–1932No table

Extraction Pipeline

Claude Code orchestrates the workflow, but historical OCR is handled by Gemini 3 Flash—among the top performers on scanned document benchmarks (socOCRbench). Each table goes through five phases.

1
Scout
TOC lookup + visual boundary check
2
Extract
Dual Gemini reads (literal + context)
3
Validate
Arithmetic identity checks
4
Clean
OCR artifacts, entity names
5
Panel
Harmonize & stack

Model Division of Labor

Gemini 3 Flash OCR and table extraction from scanned PDFs. Among the top performers on historical document benchmarks (socOCRbench).

Claude Pipeline orchestration, cleaning logic, code generation, column mapping, and entity harmonization.

Quality Gating

Every extraction is scored by arithmetic identity pass rate—do the column components sum to their published totals?

GREEN ≥90%YELLOW 70–89%RED <70%

Final tally: 41 GREEN, 10 YELLOW, 0 RED across 51 years.

Year Coverage

Each cell below is one year. Green = extracted and validated. Orange = extracted with minor quality flags (all are legitimate data absences, not extraction errors). Gray = no culture table published that year.

YELLOW years explained: 1864 (Arizona & Dakota absent—early era), 1865/1866/1867 (no State column—pre-standardization format), 1871 (Dakota absent), 1874/1875/1878/1890/1892 (Arizona absent from table those years). These reflect legitimate historical gaps, not extraction failures.

Output at a Glance

Panel Statistics
Years covered51 (1864–1920)
Observations (rows)10,123
Columns814
Unique culture IDs3,588
Canonical entity names3,319
Entity crosswalk entries5,922
Footnote definitions1,164
Each numeric variable is expanded into _raw, _grp, _ind, and _foot variants, explaining the high column count.
Two panel versions. The main panel (arcia_culture_panel_wide.csv, 10,123 rows, 3,588 culture IDs) applies entity standardization: cleaning OCR artifacts, normalizing tribe/agency spellings across years (e.g., “Menomonee” → “Menominee”), and removing ~350 non-entity rows (leaked variable names, numeric values in tribe fields). A raw version (arcia_culture_panel_wide_raw.csv, 10,512 rows, 5,212 culture IDs) preserves the original source-document spellings and is used for the human comparison below, since the human digitizers also kept original ARCIA spellings and matching works best when both sides use the same naming conventions.

Top Tribes by Panel Coverage

Pawnee
35 yr
Nez Perce
35 yr
Santee Sioux
33 yr
Omaha
32 yr
Kickapoo
32 yr
Winnebago
32 yr
Seminole
31 yr
Ponca
31 yr

Rows by Era

1864–1873
1,586
1874–1904
~8,400
1911–1920
~1,600

Pipeline Architecture

Each year is processed independently through phases 1–4. Panel construction (phase 5) runs only after all years complete. The pipeline is idempotent—re-running any phase produces the same output—and fail-safe: if a year fails at any phase, it is logged and skipped while the batch continues.

1
Scout
TOC lookup + visual boundary check
2
Extract
Dual Gemini reads (literal + context)
3
Validate
Arithmetic identity checks
4
Clean
OCR artifacts, entity names
5
Panel
Harmonize & stack
PhaseInputOutputScriptNotes
1. Scout PDF + target table description page_map.json Manual / Claude agent Read TOC (pp. 3–8) for approximate range, then render boundary pages as PNGs and visually verify exact start/end. Off-by-one errors cause 10–30pp quality drops. Early PDFs (1864–1872) have tables near end.
2. Extract PDF + page range _read_literal.csv + _read_context.csv gemini_extract.py Dual reads via Gemini API; 3 concurrent workers; era-specific prompts
3. Validate Cleaned CSV GREEN / YELLOW / RED status audit_extractions.py Arithmetic identity checks; column keyword checks; entity coverage
4. Clean Raw context CSV (with literal fallback) {year}_{type}.csv clean_arcia_csv.py Parse pipe-delimited blocks; normalize headers; add provenance
5. Panel All cleaned CSVs + column map JSON arcia_{type}_panel_wide.csv harmonize_culture.py Position-based column renaming; bracket expansion; entity harmonization
All pipeline scripts accept a --table-type flag (default: culture), so the same infrastructure handles new table types without code changes. Page ranges are stored in scripts/configs/page_registry.json; prompts are loaded from scripts/prompts/{table_type}_{era}.txt.

Dual-Read Extraction Strategy

Every table is extracted twice by Gemini with different prompts. The two reads complement each other—literal reads preserve exact characters while context reads resolve structural ambiguities—and disagreements pinpoint exactly which cells need attention.

Literal Read

"Extract exactly what you see on the page, character by character." Preserves OCR artifacts but never hallucinates values. Best for numeric accuracy.

Uses a single shared prompt (scripts/prompts/literal.txt) across all table types and eras.

Context Read

"Use your understanding of the table structure to resolve ambiguities." Better at entity names and column alignment, but sometimes creates phantom entries from footnotes or section headers.

Uses era-specific prompts that describe the table title, expected columns, entity hierarchy, and missing-data conventions for that period.

How Extraction Works

Each PDF page range is converted to PNG images via PyMuPDF at 2× zoom. Images are sent to the Gemini 3 Flash API one page at a time, with up to 3 concurrent workers and a rate limiter (14 RPM). Each page returns a pipe-delimited text block; all pages are concatenated with # --- Page N --- markers into a single CSV file.

The --prompt-type both flag runs literal and context reads in parallel, halving wall-clock time compared to sequential runs.

Era-Specific Prompts

Different eras require different context prompts because the table titles, column structures, and entity hierarchies all change:

Prompt FileYearsKey Differences
culture_1875_1904.txt1875–1904 3-column entity hierarchy (State/Agency/Tribe); 30+ data columns; bracket encoding
culture_1911.txt1911 TABLE 2 format; different column set; must skip page 1 (TABLE 1 = admin stats)
culture_1912_1920.txt1912–1920 TABLE 4/5; 2-column entities (State/Superintendency); marriage/church/dress columns

Merge & Arithmetic-Aware Refinement

read_merger.py aligns both reads cell-by-cell and produces a merged output plus a disagreements file. For each cell where the reads disagree, an arithmetic-aware refinement step checks: does swapping to the other read’s value fix a failing column sum? If so, pick that value. This is free quality—typically improves pass rates by 1–3 percentage points with zero human effort.

In 2 of 51 years (1878, 1886), the context read failed entirely (0 or near-0 tokens from Gemini). The cleaner automatically detects this and falls back to the literal read. Both years validated clean with full row counts.

Cross-Read Disagreement as a Quality Signal

Having two independent reads of the same PDF page provides a powerful diagnostic that requires no ground truth data. When the context and literal reads disagree on a large fraction of cells, this flags systematic extraction errors that arithmetic checks alone cannot catch.

Case study (1917): The context read for 1917 appeared structurally correct (right entity names, 22 aligned columns), but had silently shifted all numeric values down by one row within each state section—state-level totals were assigned to the first superintendency. Arithmetic checks would have caught this via failed subtotal validation, but the fastest signal was the cross-read comparison: 91 of 186 comparable cells differed by >5× between reads. A year-specific re-extraction with an improved prompt fixed the issue ($0.04).

Recommended threshold: Flag any year where >30% of comparable cells differ by >5× between reads. For future table types (Population, Crime, etc.) where no human digitization exists for comparison, this cross-read check is the primary automated defense against silent extraction errors.

Quality Gating & Auto-Diagnosis

Every extraction is automatically scored by testing whether column components sum to their published totals. These arithmetic identities are free validation—if the math works, those cells are correct.

How Arithmetic Identities Work

ARCIA tables contain built-in checksums. For example, the culture table reports population broken down as males + females = total, and dress as citizen’s dress + partial + Indian dress = total population. If these sums hold in the extracted data, the values are provably correct regardless of OCR quality.

This typically validates 50–80% of numeric cells with zero human effort. Failures point directly at column alignment errors or OCR misreads—far more actionable than random spot-checking.

Scoring Thresholds

StatusPass RateAction
GREEN ≥ 90% Proceed automatically
YELLOW 70–89% Proceed, flag for batch review
RED < 70% Stop & run auto-diagnosis
Final tally: 41 GREEN, 10 YELLOW, 0 RED across 51 years.

Auto-Diagnosis for RED Tables

When a table scores RED, the pipeline runs five diagnostic strategies in order:

#CheckWhat It Detects
1Column shift detectionField count mismatch between header and data rows (most common cause)
2Phantom entity detectionContext read created >15% more entities than literal (footnotes parsed as data)
3Low entity countWrong page range scouted (table starts or ends on a different page)
4Read preference swapTry literal-only or context-only instead of the merged read
5Per-position pass rateWhich specific column positions are failing (pinpoints the misaligned column)

Entity Standardization

Gemini faithfully reproduces each year’s original ARCIA spellings, producing 5,212 unique culture_ids across 51 years. Because the same tribe appears under slightly different spellings year to year (e.g., “Menomonee” vs. “Menominee,” “Sioux of Devil’s Lake” vs. “Sioux of Devils Lake”), a post-extraction standardization pipeline normalizes these into 3,588 canonical culture_ids.

Standardization Pipeline

A 7-phase script (aggressive_dedup.py) processes the raw panel:

PhaseWhat It DoesRows Affected
1–4Remove junk rows (footnote text, page headers, blank rows parsed as data)−390
5Normalize spelling variants (OCR artifacts, abbreviations, singular/plural, punctuation)∼1,378 tribe-name pairs standardized
6Rebuild culture_ids from cleaned entity fields5,212 → 3,588 unique IDs
7Resolve actual duplicates (same culture_id + year)0 found
Phase 7 finding zero duplicates confirms the pipeline is not over-merging—the ID reduction comes entirely from spelling normalization, not from collapsing distinct entities.

Collision Guard

When Phase 6 rebuilds culture_ids, different original entities can sometimes collapse to the same canonical ID (e.g., two distinct tribes whose cleaned names match). A collision guard detects these false merges: if multiple original culture_ids map to the same new ID within a year, all affected rows are reverted to their original IDs.

In the culture panel, this reverted 797 rows that would otherwise have been incorrectly merged.

Two-Panel Architecture

The project ships two versions of the panel to serve different needs:

FileIDsRowsUsed For
arcia_culture_panel_wide.csv 3,588 10,123 Entity Explorer, Crosswalk, Human Inspection (standardized names)
arcia_culture_panel_wide_raw.csv 5,212 10,512 Gemini vs. Human comparison (preserves original ARCIA spellings for matching)
The raw panel preserves source-document spellings so that entity matching against the human digitization (which also uses original ARCIA spellings) is not distorted by name standardization. The human digitization has 2,976 unique culture_ids—the remaining gap of ~610 IDs between the standardized panel (3,588) and human (2,976) reflects implicit standardization that human digitizers performed but that our automated pipeline does not attempt (e.g., resolving historical tribe name changes across decades).

Bracket Encoding

In the original printed tables, curly brackets { group multiple tribes that share a single aggregate statistic. Gemini captures these inline.

Encoding: value_group_rowcount

When a value applies to a group of N tribes, Gemini writes it as 1250_group_3 meaning "1,250 is the aggregate for a group of 3 rows." Footnote suffixes use +(code) format: 3600+(a) means "3,600 with footnote marker ‘a’."

The harmonization pipeline expands this into four columns per variable:

SuffixMeaning
_rawThe numeric value (1250)
_grpGroup size (3) or 0 if individual
_ind1 if individual value, 0 if group
_footFootnote marker, if any
This expansion is why the panel has 835 columns despite the raw tables having ~30–40 data columns: each numeric variable produces 4 output columns.

Column Mapping Across Eras

Column names, positions, and even which columns exist change substantially across the 57-year span. A per-year JSON config (culture_column_map.json) maps column positions to canonical variable names. This position-based approach is critical because Gemini’s header parsing is unreliable—but column positions are stable within an era.

EraYearsEntity ColumnsOffsetKey Differences
Pre-standardization1864–1870 Varies (tribe-only, supt+tribe)Varies No "State" column in many years; farming and culture on alternating pages (1864)
Early standard1871–1874 Varies (merged entity cols)Varies 1873 has 24-col hard-coded header; 1874 shares pages with agricultural table
Core format1875–1904 State / Agency / Tribe3 Stable 3-prefix format; brackets common
Merged entity1878, 1886 Merged (agency+tribe)3 Gemini merges entity columns; requires offset adjustment
Late era1911–1920 States & Superintendencies2 Different table title (Table 4/5); fewer columns; 1911 = TABLE 2
Entity offset is the number of prefix columns (state, agency, tribe) before data columns begin. Most years use offset=3. Late-era years (1912–1920) use offset=2. Merged-entity years (1878, 1886) nominally have offset=3 but Gemini collapses entity columns, requiring runtime adjustment.

Entity Normalization & Crosswalk

Tribal names appear in dozens of spelling variants across decades of reports. The crosswalk maps 4,149 raw names to 1,380 canonical forms.

How It Works

~600 cleaning rules (ported from Stata) apply systematic normalization: unicode/accent removal, suffix stripping, OCR fixes, plural standardization, compound tribe names, and NY reserve format standardization. Rules are defined in entity_rules.py and applied by normalize_entities.py. The deep crosswalk adds temporal group links and modern name mappings.

The culture_id is constructed differently by era:

EraID Construction
1874–1911state_agency_tribe
1912–1920state_superintendency
1873, 1878, 1886entity (merged col)
1866agency_tribe

Crosswalk Stats

Entity Crosswalk
Total entries4,724
Unique canonical names1,380
Harmonized (changed)866
Temporal group links217
Need review89
866 entries were harmonized via spelling fixes, plural standardization, OCR corrections, and compound name normalization. 217 entries link to 11 temporal groups tracking historical agency transitions.

Orchestration & Cost

The pipeline uses a multi-model architecture: Gemini 3 Flash handles OCR, Claude Code handles everything else. Extraction cost is dominated by image input tokens sent to the Gemini API.

Model Division of Labor

Gemini 3 Flash OCR and table extraction from scanned PDFs. Configured with thinking_level=low—the socOCRbench benchmark showed that low reasoning effort outperforms high for document OCR tasks (0.698 vs 0.680).

Claude Code Pipeline orchestration, cleaning logic, code generation, column mapping, entity harmonization, and quality diagnosis.

Multi-Agent Orchestration

Claude Code launches parallel extraction agents for batch runs. Agents write summary JSONs on completion; the orchestrator reads summaries to track progress. Agent transcripts are 100K+ tokens each—reading them directly would exhaust the orchestrator’s context window.

Pattern: agents write summaries → orchestrator reads summaries → never read full transcripts.

Cost Breakdown

MetricValue
Per PDF page (dual read)~$0.019
Typical 10-page table~$0.19
Full culture extraction (51 years, ~475 pages)~$9.50
Projected per additional table type (~50 years)~$8–10

Cost is dominated by image input tokens (each page image is ~2,000–4,000 tokens), not output. Dual reads double the per-page cost but provide critical quality gains via arithmetic-aware merge refinement. Gemini 3 Flash pricing: $0.50/M input tokens, $3.00/M output tokens.

Resilience & Fallback Logic

Extracting 51 years of historical tables involves many failure modes. The pipeline handles each automatically rather than requiring manual intervention.

Context Read Failures

If the context read returns fewer than 50 rows and the literal read has 3× more rows, the cleaner automatically falls back to the literal read. This saved 2 years (1878, 1886) where Gemini’s context read returned 0 or near-0 tokens—both validated clean using literal data alone.

Budget Gating

The batch runner checks cumulative Gemini API spend before each year. If the hard cap is reached ($38 default, leaving a $2 buffer under the $40 target), remaining years are logged and queued—not lost. A per-year cost estimate (~$0.10/page, intentionally conservative) provides pre-flight warnings.

Timeout & Process Management

Each year gets a 25-minute timeout. If exceeded, the batch runner kills the entire process group (killpg + SIGKILL) and continues to the next year. macOS sleep can kill long-running extractions, so batch runs are wrapped with caffeinate -i to prevent the system from sleeping.

Rate Limiting

A thread-safe token-bucket rate limiter enforces Gemini’s RPM quota (default: 14 requests/minute). Individual API calls retry with exponential backoff on 503/429 errors (5 retries, base delay 5 seconds). This prevents quota exhaustion during high-concurrency batch runs.

Known Data Quality Issues

Gemini Silent Failures (1878 & 1886)

The context read returns 0 or near-0 tokens for these years. The cleaner detects this automatically and falls back to the literal read. Both years validated successfully with full row counts and all columns present.

Minor OCR Artifacts in State Column

The state column has 215 unique values across the panel. Approximately 40 are legitimate multi-state groupings or historical territories. The remainder are OCR or format artifacts (e.g., reservation names misidentified as states in 1913/1919, tribe names leaking into the state field in 1871). These are acceptable for downstream crosswalk resolution.

Page Boundary Verification

Accurate page boundaries are critical for extraction quality. We validated our scouting process against the pre-extracted Culture Tables PDFs, which serve as ground truth for 41 years (1874–1920).

Key Finding: Images Are Identical

The Culture Tables PDFs contain byte-for-byte identical images clipped from the Full ARCIA PDFs—not different scans. PyMuPDF renders identical PNGs from either source at the same zoom factor. This means page range accuracy is the only factor affecting extraction quality; no intermediate clipping step is needed.

Initial Scouting vs. Ground Truth

Image-hash matching between Culture Tables page 1 and Full ARCIA pages revealed that 20 of 41 years had incorrect page ranges in the initial TOC-based scouting. Most errors were off-by-one (included a blank/agriculture page or missed the first/last data page), but some were larger (1875 missed 4 pages on each end; 1883 missed 3 at the start).

Root cause: ARCIA PDFs use two-page spreads where the culture table can end on the left half while a different table (agriculture, land statistics) begins on the right. TOC page numbers are approximate and don’t account for this layout.

Visual Boundary Verification

To fix this, the scouting process now includes a visual boundary verification step: after the TOC gives an approximate range, boundary pages are rendered as PNGs (PyMuPDF at 150 DPI) and visually inspected to find the exact first and last pages of the target table. Pages start−2 through start and end through end+2 are checked.

Result: 40/41 years (97.6%) matched Culture Tables ground truth exactly. The single disagreement (1874) is a standalone recapitulation page that the Culture Tables PDF includes but is correctly excluded for extraction purposes (it contains only aggregate summaries, not per-tribe data rows).

File Dictionary

Output files and construction scripts for the culture table extraction pipeline (51 years, 1864–1920).

Output Files

FileDescription
extracted/cleaned/YEAR_Culture.csvCleaned extraction for each year (51 files)
extracted/cleaned/arcia_culture_panel_wide.csvStandardized panel with entity cleaning (10,123 × 814, 3,588 culture IDs)
extracted/cleaned/arcia_culture_panel_wide_raw.csvRaw panel preserving source-document spellings (10,512 × 814, 5,212 culture IDs)
extracted/cleaned/arcia_culture_panel_ids.csvPanel row identifiers
extracted/cleaned/arcia_culture_footnote_definitions.csvFootnote definitions lookup table (1,164 definitions across 51 years)
extracted/cleaned/entity_crosswalk_v2.csvEntity crosswalk with rule types and confidence (4,724 entries)
extracted/cleaned/deep_crosswalk.csvEnriched crosswalk with temporal groups and modern names
extracted/logs/harmonize_decisions.mdLog of all harmonization choices
extracted/logs/entity_harmonization_decisions.mdEntity harmonization decisions and temporal group documentation
scripts/configs/entity_rules.py~600 entity cleaning rules (ported from Stata)
scripts/configs/temporal_entities.jsonHistorical agency transitions, consolidations, state splits
scripts/configs/culture_column_map.jsonPer-year position-to-variable mapping
extracted/raw/YEAR_*_read_literal.csvRaw Gemini literal reads (archival)
extracted/raw/YEAR_*_read_context.csvRaw Gemini context reads (archival)

Construction Scripts

Scripts are organized by pipeline phase. Run the full post-extraction pipeline with: python scripts/run_pipeline.py --table-type culture

ScriptPhaseDescription
Extraction
scripts/utils/gemini_extract.pyExtractGemini API extraction engine. Dual reads (literal + context) + footnotes. Handles era-specific prompt selection via _CULTURE_ERA_MAP. Cost: ~$0.019/page.
scripts/batch_extract.pyExtractBatch runner for multi-year extraction. Reads page ranges from page_registry.json, runs parallel Gemini calls with rate limiting.
scripts/prompts/ExtractExternalized Gemini prompt templates per table type and era: literal.txt, culture_1875_1904.txt, culture_1911.txt, culture_1912_1920.txt, culture_1917.txt, culture_1919.txt.
Cleaning & Validation
scripts/clean_arcia_csv.pyCleanRaw CSV → cleaned CSV. Modal header detection, dropped-pipe repair, entity column splitting (Case A: 1912–1920, Case B: 1874–1911), literal fallback logic. Core of the pipeline.
scripts/audit_extractions.pyValidateArithmetic identity checks (component sums = totals), RED/YELLOW/GREEN triage. Primary quality gate.
scripts/arithmetic_corrections.pyValidatePost-harmonization OCR correction using arithmetic identities. --dry-run supported.
scripts/fix_early_years.pyCleanSpecial-case fixes for pre-standardization years (1864, 1873) with non-standard table formats.
Harmonization & Panel Construction
scripts/harmonize_culture.pyHarmonizeStack 51 cleaned CSVs into a single panel. Position-based column mapping via culture_column_map.json, bracket encoding expansion (_raw, _grp, _ind, _foot).
scripts/aggressive_dedup.pyStandardizeEntity standardization: ~600 cleaning rules (OCR fixes, abbreviation expansion, spelling normalization), collision guard (797 reverts).
scripts/build_entity_crosswalk.pyCrosswalkBuild raw → canonical entity name mapping with rule type and confidence.
scripts/generate_deep_crosswalk.pyCrosswalkEnrich crosswalk with temporal groups (agency transitions, consolidations, state boundary changes).
scripts/build_footnote_lookup.pyFootnotesParse footnote CSVs into a definitions lookup table (1,164 definitions across 51 years).
scripts/utils/normalize_entities.pyUtilityEntity name normalization helpers (regex, OCR fixes, abbreviation expansion).
Comparison & Quality
scripts/compare_gemini_vs_human.pyValidateCell-level comparison of Gemini panel vs human digitization (Shyrley). Two-pass entity matching (exact ≥0.99, then constrained fuzzy ≥0.85).
scripts/compare_human_vs_human.pyValidateShyrley vs Marlon human-human benchmark (99.8% match rate).
Website & Visualization
scripts/generate_comparison_json.pyWebsiteGenerate comparison_data.js for the Validation tab.
scripts/generate_crosswalk_json.pyWebsiteGenerate crosswalk_data.js for the Entity Crosswalk tab.
scripts/generate_explorer_json.pyWebsiteGenerate explorer_data.js for the Entity Explorer tab.
scripts/generate_inspection_sample.pyWebsiteGenerate inspection_data.js with PDF page images and raw/cleaned data samples.
Orchestration
scripts/run_pipeline.pyPipelinePost-extraction pipeline orchestrator. Chains: clean → audit → harmonize → footnotes → crosswalk → dedup → website JSON. Supports --start-from for partial reruns.
Configuration
scripts/configs/page_registry.jsonConfigPage ranges per year per table type (source PDF, start/end page).
scripts/configs/culture_column_map.jsonConfigPer-year position → variable name mapping (handles column count changes across eras).
scripts/configs/entity_rules.pyConfig~600 entity cleaning rules: OCR fixes, abbreviation expansion, spelling normalization.
scripts/configs/temporal_entities.jsonConfigHistorical agency transitions, consolidations, state boundary changes.

Entity Crosswalk Browser

Explore how raw entity names from ARCIA reports (1864–1920) were harmonized into canonical forms in the standardized panel (arcia_culture_panel_wide.csv). Filter by entity type, rule type, or search for specific names.

Raw Name Canonical Type Rule First Last Years Rows

Historical Agency Transitions

BIA agencies were renamed, consolidated, and split between 1864–1920. These groups link entities that are historically the same administrative unit. In 1908, the "Indian Agent" position was abolished and replaced by school superintendents.

Entity Explorer

Browse all canonical entities in the standardized culture panel (arcia_culture_panel_wide.csv). Search by name, filter by type, and see variant history, year coverage, and related entities.

    Select an entity from the list to view details.

    Time Series: Culture Variables by Tribe

    Track how key culture indicators changed over time for individual tribes. Select a variable, search for tribes, and click names to toggle them on/off. 183 tribes with 15+ data points are available.

    Note: Values are raw counts reported in the ARCIA tables, not shares. Variable definitions change across eras: "citizen's dress" (1874–1889) becomes "wholly in citizen's dress" (1890–1904) and "citizen's clothing" (1912–1920). English usage similarly combines "use English" (1874–1904) and "speak English" (1912–1920). Gaps in the lines indicate years where data was not reported for that tribe.

    Validation Overview

    How accurate is automated extraction of historical government tables? We answer this by comparing Gemini’s output against an independent human digitization of the same ARCIA Culture tables, and by testing whether an alternative extraction strategy—multiple reads with majority voting—improves results.

    Ground Truth

    Shyrley Peraza manually digitized the ARCIA Culture tables for 41 years (1874–1904, 1911–1920) as part of an earlier research effort. Her dataset provides a cell-level benchmark: each entity-year-variable triple has a “human” value we can compare against Gemini’s extraction.

    Matching method: Entities are linked between datasets using a two-pass approach: exact matches first (score ≥ 0.99), then constrained fuzzy matching (SequenceMatcher ≥ 0.85, same agency prefix) on composite state_agency_tribe identifiers. This produces 7,028 matched entity-year pairs across 41 overlapping years, covering 88,037 individual cell comparisons across 164 variables. This comparison uses the raw (pre-standardization) panel to maximize entity matching, since both Gemini and the human digitizers preserve original ARCIA spellings. Not all entities match: the raw panel has ~5,212 unique culture IDs versus the human data’s ~2,976, because Gemini reproduces each year’s exact spellings (e.g., “Menomonee” in 1879, “Menominee” in 1880) while human digitizers implicitly standardize as they type. Unmatched entities are disproportionately single-year spelling variants rather than genuinely missing tribes.

    Two Layers of Validation

    1. Internal (arithmetic identities): ARCIA tables contain built-in checksums— e.g., males + females = total population, dress components = total. If these hold, the extracted values are provably correct. This validates 50–80% of numeric cells with zero human effort. Final tally: 41 GREEN (≥90%), 10 YELLOW (70–89%), 0 RED. See the Technical Detail tab.

    2. External (human comparison): Cell-by-cell comparison against Shyrley’s manual digitization. This catches errors that arithmetic checks miss—column offsets, entity misidentification, and OCR errors in non-checksum variables. Results below.

    Benchmarking: How Much Do Two Humans Agree?

    Before evaluating AI accuracy, we need to know: how reproducible is human digitization? Two independent research assistants (Shyrley Peraza and Marlon) each digitized all 41 years of ARCIA Culture tables following the same instructions. Their inter-rater agreement establishes the ceiling against which we evaluate Gemini.

    Gemini vs. Human: Cell-Level Comparison

    Across 41 shared years, 93.9% of 88,037 compared cells match exactly. The remaining 5.7% are dominated by OCR-level errors (minor digit confusions) and column offset artifacts from Gemini’s inconsistent pipe-separator handling.

    Match Rate by Year

    Percentage of compared cells where Gemini and human values match exactly. The faint line shows the human-human agreement rate for the same year as a reference ceiling.

    Multi-Read OCR Experiment

    Can we improve extraction accuracy by running multiple independent OCR reads and taking a majority vote? We tested this on 6 sample years spanning different eras and difficulty levels.

    Experiment Design

    For each of 6 sample years, we ran 5 independent extractions: 4× Gemini 3 Flash (same model as the production pipeline, different random seeds) and 1× Gemini 2.5 Pro (a larger, reasoning-capable model). Each read was cleaned and harmonized independently using the same pipeline code. A majority-vote algorithm then selected the most common value for each cell across the 5 reads.

    Sample years: 1877 (88.6% baseline), 1881 (97.7%), 1889 (85.7%), 1895 (89.2%), 1911 (86.3%), 1919 (81.1%). Total cost: $3.63 for 30 reads.

    Results

    Year Pipeline Baseline Best Single Read Voted (5 reads) Delta
    1881 93.2% 86.1% (R1) 85.3% −7.9 pp
    1877 73.1% 76.4% (R3) 64.2% −8.9 pp
    1889 68.4% 75.0% (R2) 64.6% −3.8 pp
    1895 71.4% 85.1% (R3) 85.9% +14.5 pp
    1911 58.4% 70.7% (R1) 68.4% +10.0 pp
    1919 61.4% 71.4% (R1) 69.3% +7.9 pp
    Average 71.0% 72.9% +2.0 pp
    Note: “Pipeline Baseline” is the match rate from the production pipeline (single dual-read extraction + full custom cleaning). “Voted” uses the same harmonizer but without the year-specific cleaning fixes (dropped-pipe, blank-run, phantom removal). The baseline for 1911 and 1919 reflects the raw harmonized output before custom fixes, which is why these years score lower than their match rates in the chart above.

    Key Findings

    Gemini 2.5 Pro Performed Worse

    The Pro model (read 5) averaged 46.1% match rate vs. 70.8% for Flash (reads 1–4). It was also 5–6× slower (~750–960s vs. ~150–235s per year) and 3–5× more expensive. Worst case: 1919 at just 19.2%.

    Custom Pipeline Fixes Outperform Multi-Read

    Year-specific cleaning fixes (dropped-pipe detection, blank-run insertion, entity merge conditionals, phantom column removal) add far more value than multiple reads. The 1881 pipeline baseline is 93.2% with one read; the 5-read vote achieves only 85.3%.

    High Read-to-Read Variance

    Same model, same pages: match rates swing 10–17 pp between runs (e.g., 1889: 29.1% to 75.0%). Gemini’s extraction is non-deterministic—entity counts range from 322 to 409 rows for the same year.

    Voting Helps When Entities Align

    When entity matching across reads is good, voting improves results: +14.5 pp for 1895, +10.0 pp for 1911, +7.9 pp for 1919. But when entity alignment fails across reads (1877, 1881), voting hurts because the voter discards unmatched rows.

    Conclusion

    Multi-read majority voting provides a modest +2.0 pp average improvement at 5× the extraction cost ($3.63 vs ~$0.73 for the same 6 years). The more cost-effective strategy is single dual-read extraction + targeted year-specific cleaning, which achieves 93.9% match rate across 41 years at a total cost of ~$12 for the entire corpus. Investment in cleaning logic yields higher returns than investment in additional OCR reads.

    Follow-Up: Gemini 3.1 Pro (Low Reasoning)

    Since the original experiment used Gemini 2.5 Pro (which performed poorly), we re-ran with Gemini 3.1 Pro Preview—the current leader on the socOCRbench document understanding benchmark—using low reasoning mode. Same 6 years, same pipeline.

    Year Pipeline Flash (Read 1) 2.5 Pro 3.1 Pro (low)
    1881 93.2% 86.1% 61.6% 61.9%
    1877 73.1% 60.5% 73.5% 48.7%
    1889 68.4% 63.8% 29.1% 52.0%
    1895 71.4% 72.1% 43.2% 51.0%
    1911 58.4% 70.7% 50.2% 59.6%
    1919 61.4% 71.4% 19.2% 59.3%
    Average 71.0% 70.8% 46.1% 55.4%
    Cost: 3.1 Pro = $2.71 for 6 years (vs $0.73 for Flash, $1.40 for 2.5 Pro). At 3.7× the cost and 3–4× slower per page, Gemini 3.1 Pro achieves 55.4% average match rate—better than 2.5 Pro (46.1%) but still 15 pp below Flash (70.8%). Benchmark leadership on standard OCR tasks does not transfer to structured table extraction from 19th-century government documents.

    Discrepancy Classification

    Each cell-level discrepancy is classified into a likely root cause category based on the relationship between Gemini and human values.

    Variable Match Rates

    Match rates by variable across all compared entity-year pairs. Variables at the top have the most discrepancies; those at the bottom are most reliable.

    Largest Discrepancies

    The 30 largest cell-level disagreements by absolute difference. Many are concatenation artifacts where two values got stuck together during extraction.

    Known Issues & Recommended Fixes

    Documented sources of disagreement between Gemini and human extraction, ordered by severity.

    Human Inspection Random Sample

    Random sample of 10 years for visual comparison between the original PDF and the per-year cleaned CSVs (YEAR_Culture.csv). Each card shows the first data page of the table alongside the first 5 rows that Gemini extracted from that exact page. Use this to spot-check whether entity names, column alignment, and numeric values match the source document.

    Sample: 10 years stratified across early (1864–1873), standard (1874–1904), and late (1912–1920) eras. Seed: 42 (fixed for reproducibility). Showing raw extraction data (pipe-separated values as returned by Gemini).

    Loading inspection data…