Abstract
This document presents a comprehensive statistical methodology for generating synthetic persona panels that accurately represent the United States population. The approach combines authoritative microdata from the American Community Survey (ACS) with advanced calibration techniques to produce statistically valid synthetic populations for research, market analysis, and strategic planning applications.
Executive Summary
This document presents a comprehensive statistical methodology for generating synthetic persona panels that accurately represent the United States population. The approach combines authoritative microdata from the American Community Survey (ACS) with advanced calibration techniques to produce statistically valid synthetic populations for research, market analysis, and strategic planning applications.
Core Methodology: We construct finite synthetic populations (typically 300,000 personas) by processing individual-level Census Bureau microdata through iterative proportional fitting algorithms. This approach preserves the complex multivariate structure of real populations—including realistic correlations between occupation, income, education, and geographic location—while ensuring aggregate distributions match official population benchmarks.
Key Differentiators:
Source Fidelity: Built exclusively from U.S. Census Bureau public-use microdata, not parametric models or assumptions
Structural Realism: Preserves within-person coherence (e.g., software engineers have appropriate education, income, and industry codes) by starting from actual survey records
Methodological Transparency: Every transformation, calibration target, and algorithmic parameter is documented and auditable
Geographic Precision: Provides state and Public Use Microdata Area (PUMA) attribution, with defensible city-level identification for major metropolitan areas
Statistical Controls: The calibration process ensures synthetic personas reproduce official population distributions across 15+ dimensions including age-by-sex, race/ethnicity, education-by-age, employment status, detailed occupation, industry sector, and household income brackets.
Quality Assurance: Each panel undergoes comprehensive validation including marginal distribution checks (target: <1% deviation on controlled dimensions), cross-tabulation audits, logical consistency verification, and sensitivity analysis.
Applications: The methodology supports consumer research, market segmentation, policy simulation, demographic forecasting, and AI training datasets requiring representative population samples with rich attribute sets.
Reproducibility: The entire pipeline is version-controlled, parameterized, and deterministic (with explicit random seeds for stochastic components), enabling exact replication of any released panel.
Table of Contents
Introduction and Research Objectives
Data Infrastructure and Sampling Framework
Data Preprocessing, Harmonization, and Feature Engineering
Geographic Attribution: State-Level and Sub-State Geographic Units
Calibration Targets and Statistical Control Specification
Weight Calibration Methodology: Iterative Proportional Fitting with Stability Protocols
Integer Panel Generation through Probabilistic Sampling
Methodologically Constrained Modeling of Non-Survey Attributes
Validation Protocols and Quality Assurance Framework
Privacy Safeguards, Equity Considerations, and Known Limitations
Reproducibility Standards and Version Control
References
Appendix A: Data Sources Registry
Appendix B: Variable Dictionary and Transformation Protocols
1. Introduction and Research Objectives
The construction of synthetic populations represents a critical methodological challenge in social science research, policy analysis, and market intelligence. This document presents a comprehensive framework for generating a finite set of synthetic micro-units—designated as "personas"—whose joint distributional properties exhibit statistical consistency with authoritative United States microdata sources.
1.1 Methodological Principles
The design architecture is structured around three fundamental desiderata:
Representativeness: When appropriately weighted, the persona ensemble reproduces both marginal and cross-marginal distributions of pivotal demographic and economic characteristics at national and, where specified, subnational scales. This property ensures that aggregate statistics derived from the synthetic panel converge to population parameters within acceptable tolerance bounds (Lohr, 2021).
Structural Realism: The methodology eschews abstract parametric modeling in favor of direct utilization of empirical microdata records. This approach preserves within-unit coherence of multivariate relationships (e.g., occupation-industry-income linkages, education-age associations, household-level attribute consistency) to the maximum extent feasible, thereby maintaining the complex dependency structure observed in actual populations (Alfons et al., 2011; Müller & Axhausen, 2011).
Methodological Transparency: Every procedural step is governed by documented data sources, deterministic transformation protocols, and auditable calibration specifications. This commitment to transparency enables independent verification and facilitates adaptation to alternative contexts or requirements, consistent with best practices in computational social science (National Academies, 2017).
1.2 Analytical Pipeline
The synthesis pipeline consists of the following sequential stages:
Stage I – Data Ingestion: Acquisition of person- and household-level microdata from the American Community Survey Public Use Microdata Sample (ACS PUMS), selecting the most temporally proximate vintage to establish a contemporary population representation.
Stage II – Harmonization and Feature Engineering: Systematic harmonization of variable encodings, consolidation of categorical classifications, construction of derived features required for downstream calibration, and linkage of person- and household-level records following established data management protocols (U.S. Census Bureau, 2023).
Stage III – Geographic Attribution: Assignment of geographic descriptors consistent with the confidentiality architecture of PUMS data, specifically Public Use Microdata Areas (PUMAs), supplemented by methodologically defensible city-level identifications where applicable.
Stage IV – Statistical Calibration: Implementation of iterative proportional fitting (IPF) algorithms to adjust microdata weights such that weighted personas conform to target distributions across multiple control dimensions including age-by-sex, race and ethnicity, education-by-age, employment status, occupational classification, industrial sector, household income stratification, and additional specifications as required (Deming & Stephan, 1940; Little & Wu, 1991).
Stage V – Integerization: Transformation of calibrated real-valued weights into an integer panel of precisely N personas (e.g., 300,000 units) through probabilistic sampling procedures that preserve proportional representation (Lovelace & Ballas, 2013).
Stage VI – Augmentation and Quality Assurance: Optional assignment of attributes not present in the source survey through methodologically constrained modeling procedures, followed by comprehensive diagnostic protocols and validation against known population benchmarks.
2. Data Infrastructure and Sampling Framework
2.1 Primary Data Source: American Community Survey Public Use Microdata Sample
The American Community Survey (ACS) Public Use Microdata Sample constitutes the foundational data infrastructure for this synthesis. The ACS represents the largest household survey in the United States, providing detailed demographic, social, economic, and housing information for the resident population. The PUMS release provides anonymized, record-level data for both individuals (person files) and households (household files), accompanied by survey-based probability weights that project the sample to population totals (U.S. Census Bureau, 2023).
Data Vintage Selection: The analysis utilizes the 1-year PUMS product to maximize temporal currency and capture the most recent population characteristics. The 1-year sample provides sufficient statistical power for national-level estimation while maintaining recency in rapidly evolving demographic and economic indicators.
Record Linkage: Person-level and household-level records are linked through the unique household identifier (SERIALNO), enabling the attribution of household-level characteristics (e.g., HINCP household income, TEN housing tenure, household size) to individual person records. This linkage preserves the hierarchical structure of household membership (U.S. Census Bureau, 2022).
Survey Weight Treatment: The initial analysis respects ACS-provided person weights (PWGTP), which incorporate complex sample design features including stratification, clustering, and nonresponse adjustment. Subsequent calibration procedures modify these base weights to achieve distributional targets while maintaining statistical efficiency (Valliant et al., 2013).
Variable Semantics: All variable interpretations adhere to officia lACS data dictionaries published by the U.S. Census Bureau. Occupation codes (OCCP) follow the Standard Occupational Classification (SOC) system maintained by the Bureau of Labor Statistics; industry codes (INDP) align with the North American Industry Classification System (NAICS); educational attainment (SCHL) follows Census Bureau categorizations.
2.2 Geographic Framework: Confidentiality and Spatial Resolution
Public Use Microdata Areas (PUMAs): To protect respondent confidentiality, the finest geographic identifier in PUMS data is the PUMA, a geographic unit defined to contain approximately 100,000 persons. PUMAs are nested within states and do not cross state boundaries. The analysis retains state Federal Information Processing Standards (FIPS) codes and PUMA identifiers as the geographic foundation (U.S. Census Bureau, 2020).
Spatial Limitations: Users should note that PUMAs represent an aggregated geography that may not align with jurisdictional or market boundaries of interest. Census tracts, block groups, ZIP codes, and precise municipal boundaries are not available in public-use microdata due to disclosure risk (Federal Committee on Statistical Methodology, 2005).
2.3 Auxiliary Data Sources
IPUMS Large Place–PUMA Crosswalk (2020 Geography): This crosswalk, maintained by the Integrated Public Use Microdata Series (IPUMS), provides population allocation estimates between PUMAs and incorporated places with populations exceeding 75,000. The crosswalk enables probabilistic assignment of city names to personas and supports urban-suburban-rural classification through population share thresholds (Ruggles et al., 2024).
Optional Occupational Benchmarks: When specified, state-level occupation distribution estimates from the Bureau of Labor Statistics Occupational Employment and Wage Statistics (OEWS) may be incorporated as supplementary calibration targets to enhance the precision of occupational margins.
Optional Income Distribution Refinement: For applications requiring precise representation of upper-tail income distributions, Internal Revenue Service Statistics of Income (SOI) quantile estimates or Pareto tail parameterizations may be employed (Piketty & Saez, 2003). The baseline configuration relies exclusively on ACS-derived income measures with inflation adjustment.
3. Data Preprocessing, Harmonization, and Feature Engineering
3.1 Variable Selection and Record Integration
The preprocessing stage initiates with the selection of core variables spanning demographic, economic, social, and geographic domains. Person-level variables extracted include age (AGEP), sex (SEX), Hispanic origin (HISP), race (RAC1P), educational attainment (SCHL), employment status recode (ESR), detailed occupation (OCCP), detailed industry (INDP), personal income (PINCP), person weight (PWGTP), language spoken at home (LANP), English-speaking ability (ENG), nativity (NATIVITY), citizenship status (CIT), military service vintage (MIL or VETSTAT depending on data year), and disability difficulty indicators (DEAR, DEYE, DREM, DPHY, DOUT, DDRS), in addition to state and PUMA identifiers (U.S. Census Bureau, 2023).
Household-level variables extracted include household income (HINCP), adjustment factor for income inflation (ADJINC), housing tenure (TEN), household size (NP), number of own children (NOC), and internet access indicators (ACCESSINET, HISPEED, DIALUP, SATELLITE, OTHSVCEX, BROADBND).
Following individual file ingestion, person and household files are merged on the household identifier (SERIALNO), resulting in a unified analytical file where each person record contains both individual-level and household-level attributes.
3.2 Categorical Harmonization and Standardization
Raw ACS variables frequently employ numerical encoding schemes that require translation to substantively meaningful categories. The following standardization protocols are implemented:
Educational Attainment: The SCHL variable, which provides granular educational credential categories, is consolidated into five analytically tractable levels: (1) Less than high school diploma, (2) High school diploma or GED, (3) Some college or Associate's degree, (4) Bachelor's degree, (5) Graduate or professional degree. This consolidation follows standard educational attainment classifications used in labor economics research (Autor, 2014).
Employment Status: The ESR employment status recode is mapped to four principal categories: (1) Employed full-time, (2) Unemployed, (3) Not in labor force, with (4) Military service identified separately when appropriate based on context and variable availability, consistent with Bureau of Labor Statistics labor force definitions.
Occupational Classification: Detailed ACS occupation codes (OCCP) are preserved in their native form to maintain maximum informational content. Additionally, a major occupational group classification is derived through systematic keyword-based mapping aligned with SOC major groups (e.g., Computer and Mathematical Occupations, Construction and Extraction Occupations, Healthcare Practitioners and Technical Occupations). Optional configurations may employ formal SOC crosswalks for enhanced precision.
Industrial Sector: Detailed ACS industry codes (INDP) are retained alongside a sector-level classification that collapses industries into NAICS-consistent categories (e.g., Manufacturing, Information, Healthcare and Social Assistance, Retail Trade).
Household Income Stratification: Household income values (HINCP) are inflation-adjusted using the ADJINC factor to constant-dollar terms and subsequently binned into defined brackets spanning the income distribution (e.g., $0; $1,000–$9,999; $10,000–$24,999; ...; $500,000–$999,999; $1,000,000 or more).
Primary Language: Language spoken at home is derived from LANP in conjunction with ENG (English-speaking ability), yielding classifications of English, Spanish, or Other. When language data are missing, a conservative fallback protocol applies domain knowledge (e.g., elevated Spanish probability among Hispanic ethnicity) while avoiding deterministic imputation.
Veteran Status: Military service history is extracted from the MIL variable in recent ACS vintages (2023 onward) or VETSTAT in earlier vintages, yielding classifications of Active Duty, Veteran, or Not a Veteran.
Disability Status: A binary disability indicator is constructed such that an individual is classified as "Has disability" if any of the six functional difficulty variables (hearing, vision, cognitive, ambulatory, self-care, independent living) indicates difficulty, following the Census Bureau's definition aligned with the Americans with Disabilities Act (U.S. Census Bureau, 2021).
3.3 Household-Level Constructs
Certain phenomena are inherently household-level rather than person-level attributes. The following household-level constructs are derived:
Presence of Children: The count of children in the household preferentially uses the NOC variable (number of own children). When NOC is unavailable or suspect, a fallback procedure counts persons under 18 years of age within the same household identifier (SERIALNO).
Housing Tenure: The TEN variable is translated into substantively meaningful categories: Owned with mortgage or loan, Owned free and clear, Rented, No cash rent (occupied without payment).
Internet Access Type: Internet connectivity is classified through analysis of ACCESSINET (presence indicator) in conjunction with service-type flags (HISPEED, SATELLITE, BROADBND, DIALUP, OTHSVCEX), yielding categories such as Broadband, Satellite, Cellular data plan only, Internet access present (type unspecified), or No internet access (Perrin & Atske, 2021).
Health Insurance Coverage: Health insurance status is constructed by integrating HICOV (coverage indicator) with specific coverage type variables (HINS1 through HINS7), resulting in classifications including Private only, Public only, Both private and public, Indian Health Service only, Uninsured, or generic Has coverage when detailed source is suppressed (Cohen et al., 2023).
Commuting Mode: For employed individuals, the journey-to-work transportation mode (JWTRNS) is categorized into modes including Drove alone, Carpooled, Public transportation, Walked, Bicycle, Worked from home. Individuals not currently employed are assigned "Not working" to maintain logical consistency.
3.4 Missing Data Protocols
The methodology adopts a conservative stance toward missing data. Rather than employing statistical imputation procedures that introduce unquantified uncertainty, missing values are generally preserved as an explicit "Unknown" category unless a defensible deterministic rule exists (e.g., deriving child count from age distributions within the household when the direct variable is missing). This approach maintains transparency regarding data completeness and avoids the propagation of imputation model assumptions into the synthetic panel (Little & Rubin, 2019).
4. Geographic Attribution: State-Level and Sub-State Geographic Units
The geographic disclosure limitation framework of ACS PUMS restricts direct identification below the PUMA level. However, methodologically sound procedures enable probabilistic city assignment and urbanicity classification within these constraints.
4.1 PUMA-to-Place Allocation Methodology
The IPUMS Large Place–PUMA crosswalk provides, for each state-PUMA combination, the population shares attributable to incorporated places with populations of 75,000 or greater. This allocation is based on the geographic overlay of 2020 PUMA boundaries with 2020 Census place boundaries (Ruggles et al., 2024).
City Assignment Protocol: For a given state-PUMA pair, if a single incorporated place accounts for a population share exceeding a specified threshold (default: 50%), that place name is assigned as the "City" attribute for personas within that geographic unit. This threshold ensures that city assignment reflects genuine geographic concentration rather than marginal overlap. When no single place meets the threshold criterion, the city attribute is designated as "Unknown" to avoid misleading specificity.
Urbanicity Classification: Urban-suburban-rural classification is derived from the same population share analysis:
Urban: The dominant place accounts for ≥50% of PUMA population
Suburban: The dominant place accounts for 10–49% of PUMA population
Rural: No single place accounts for ≥10% of PUMA population
This classification serves as a proxy for settlement density and metropolitan status, though users should recognize that it represents an approximation constrained by the PUMA geography (Ratcliffe et al., 2016).
4.2 Geographic Precision Limitations
Users must recognize that "City" assignments represent geographic approximation rather than precise residential location. A persona assigned to "Chicago" may reside anywhere within a PUMA substantially overlapping Chicago's municipal boundaries. Fine-grained address-level precision is incompatible with the confidentiality protections embedded in public-use microdata and is not claimed by this methodology (Federal Committee on Statistical Methodology, 2005).
5. Calibration Targets and Statistical Control Specification
Statistical calibration adjusts the distribution of persona weights to conform to target marginal distributions derived from the population. The specification of calibration controls represents a critical methodological decision that balances fidelity on key dimensions against preservation of unconstrained covariance structure (Deville & Särndal, 1992).
5.1 Control Margin Philosophy
The methodology derives all calibration targets from the identical ACS PUMS file used for persona generation. This internal consistency approach ensures that targets and microdata employ identical variable definitions, reference periods, and population concepts, thereby avoiding definitional mismatches that plague calibration procedures based on heterogeneous data sources (Burgard et al., 2017).
5.2 Baseline Control Specification
The baseline calibration regime includes the following control margins:
Age-by-Sex Cross-Classification: Joint distribution of coarse age categories (e.g., 18–24, 25–34, 35–44, 45–54, 55–64, 65–74, 75+) by binary sex classification. This control ensures fundamental demographic structure.
Race and Ethnicity: Comprehensive classification combining Hispanic origin (HISP) and race (RAC1P) following Census Bureau guidelines and resembling Statistical Policy Directive 15 (SPD-15) categories (Office of Management and Budget, 1997). Categories typically include White non-Hispanic, Black or African American non-Hispanic, Hispanic or Latino (any race), Asian non-Hispanic, with additional granularity as sample size permits.
Education-by-Age Cross-Classification: Joint distribution of the five-level educational attainment classification by age categories. This control captures the secular trend of rising educational attainment across cohorts and maintains education-age dependency (Ryan & Bauman, 2016).
Employment Status: Marginal distribution across Employed full-time, Unemployed, Not in labor force, and Military categories.
Occupational Distribution: Distribution across major occupational groups for employed individuals, following SOC major group structure.
Industrial Sector Distribution: Distribution across industry sectors for employed individuals, following NAICS sector definitions.
Household Income Distribution: Distribution across inflation-adjusted household income brackets.
5.3 Optional Enhanced Controls
Users may activate additional control margins subject to sample size and convergence considerations:
State-Level Occupational Margins: When state-level occupational distribution precision is paramount, auxiliary Bureau of Labor Statistics data may supplement calibration targets.
Urbanicity Proportions: Geographic settlement pattern controls may be specified to ensure appropriate urban-suburban-rural balance.
Household Structure: Controls on household size distribution or household type may be incorporated when household-level analysis is the primary application.
5.4 Control Set Optimization Principles
Excessive or overly granular controls risk numerical instability, extreme weight variance, and distortion of relationships not directly controlled. The baseline control set targets axes of heterogeneity known to have first-order importance for social and economic outcomes while preserving sufficient degrees of freedom for realistic residual covariance structure inherited from the microdata (Deville et al., 1993; Kolenikov, 2016).
6. Weight Calibration Methodology: Iterative Proportional Fitting with Stability Protocols
6.1 Theoretical Foundation
Weight calibration employs the method of iterative proportional fitting (IPF), also known as raking or the RAS algorithm. IPF belongs to the class of minimum-distance methods that seek the weight vector closest (in a specified metric) to the initial weights while satisfying linear constraints defined by calibration margins (Deming & Stephan, 1940; Fienberg, 1970).
Mathematical Formulation: Let i = 1, ..., n index microdata records, and m = 1, ..., M index distinct control margins. Each margin m partitions records into categories k = 1, ..., K_m. Denote the current weight of record i as w_i, the set of records in category k of margin m as G_{mk}, and the target total for category (m,k) as T_{mk}.
The IPF update for margin m computes adjustment factors:
r_m(k) = T_{mk} / [sum over i in G_{mk} of w_i]
and applies multiplicative rescaling:
w_i ← w_i × r_m(g_i)
where g_i identifies the category of record i within margin m. The algorithm cycles through all M margins repeatedly until convergence, defined as the maximum relative deviation falling below a specified tolerance threshold.
6.2 Algorithmic Implementation and Stability Enhancements
Scale Harmonization: Prior to iteration, target totals are rescaled such that the sum over k of T_{mk} equals the current grand total of weights for each margin. This distributional equivalence transformation eliminates numerical ill-conditioning when population totals and sample totals differ by multiple orders of magnitude (Little & Wu, 1991).
Ratio Bounding: Adjustment factors r_m(k) are bounded within an interval [r_min, r_max] (e.g., [0.1, 10]) to prevent extreme multiplicative shocks when temporary zero denominators occur or when the sample contains few records in a given category (Izrael et al., 2000).
Weight Flooring and Renormalization: Following each margin update, weights are subjected to a minimum threshold (e.g., 0.000001) to prevent exact zeros, which cause algorithmic breakdown. After flooring, weights are renormalized to preserve the grand total, maintaining the scale invariance of the target specification.
Composite Key Encoding: Multi-dimensional controls (e.g., age-by-sex) are encoded as single categorical variables through concatenation of dimension values (e.g., "Male|25–34"). This encoding guarantees unambiguous per-record indexing and eliminates array dimensionality complications.
Convergence Monitoring: At each iteration, the algorithm computes the maximum absolute ratio deviation:
δ = max over m,k of |T_{mk} / [sum over i in G_{mk} of w_i] - 1|
Convergence is declared when δ < ε (e.g., ε = 0.001) or when a maximum iteration limit is reached. Both the convergence criterion and iteration count are logged for quality assurance.
6.3 Properties and Implications
Preservation of Microdata Structure: Because IPF operates exclusively through weight modification without altering record attributes, the within-record coherence of complex multivariate relationships (e.g., the empirically observed occupation-industry-income joint distribution) is preserved. This property distinguishes the approach from synthetic data generation methods that model attributes independently (Müller & Axhausen, 2011).
Impact on Uncontrolled Relationships: While controlled margins are brought into agreement with targets by construction, uncontrolled bivariate and multivariate relationships may be subtly reshaped by the weight adjustments. The degree of impact depends on the correlation structure between controlled and uncontrolled attributes. Extensive controls may constrain uncontrolled relationships more tightly; parsimonious controls preserve greater flexibility (Deville & Särndal, 1992).
Asymptotic Properties: Under mild regularity conditions, IPF converges to the unique solution that minimizes Kullback-Leibler divergence from the initial weight distribution subject to the linear calibration constraints (Ireland & Kullback, 1968). This interpretation provides information-theoretic justification for the procedure.
7. Integer Panel Generation through Probabilistic Sampling
Calibrated weights are real-valued and sum to the target panel size N. To construct a concrete synthetic panel consisting of N integer observations, a probabilistic integerization procedure is employed.
7.1 Truncate-Replicate-Sample (TRS) Algorithm
The TRS procedure operates in three sequential steps:
Step 1 – Truncation: For each record i with calibrated weight w_i, compute the integer floor c_i = ⌊w_i⌋ (greatest integer less than or equal to w_i) and the residual probability p_i = w_i - c_i.
Step 2 – Deterministic Replication: Include each record i exactly c_i times in the synthetic panel.
Step 3 – Probabilistic Sampling: The deterministic replication produces n_det = sum over i of c_i observations. To reach the target panel size N, an additional n_stoch = N - n_det observations are drawn through weighted random sampling without replacement, where each record's selection probability is proportional to its residual p_i.
7.2 Statistical Properties
The TRS algorithm produces an integer panel whose empirical distribution converges to the calibrated weight distribution as N increases. The procedure is unbiased in expectation and widely employed in synthetic population generation across demographic and transportation modeling applications (Lovelace & Ballas, 2013; Pritchard & Miller, 2012).
7.3 Stochastic Replicability
The probabilistic sampling step (Step 3) introduces stochasticity into the panel generation. To ensure exact reproducibility, the random number generator is initialized with an explicitly documented seed value. Identical seed values produce identical panels given identical inputs.
8. Methodologically Constrained Modeling of Non-Survey Attributes
Certain attributes frequently requested for persona analysis are not collected in the American Community Survey. Where demand justifies inclusion and methodologically sound modeling approaches exist, these attributes may be synthesized as optional fields with explicit documentation of priors and uncertainty.
8.1 Religious Affiliation (Optional Field)
Modeling Rationale: Religious affiliation exhibits complex geographic, ethnic, and generational patterns that are not reducible to ACS variables. Nonetheless, approximate national and state-level distributions are available from religious demography surveys (e.g., Pew Research Center Religious Landscape Studies, Association of Religion Data Archives) (Pew Research Center, 2014, 2021).
Assignment Methodology: National (or state-level, when specified) marginal distributions over broad religious tradition categories (e.g., Evangelical Protestant, Mainline Protestant, Catholic, Orthodox Christian, Jewish, Muslim, Hindu, Buddhist, Other faiths, Unaffiliated) serve as base prior probabilities. To introduce realism without overfitting, these priors undergo bounded, monotone adjustments conditioned on race/ethnicity (e.g., modestly elevated Catholic probability among Hispanic/Latino populations; slightly elevated Hindu and Buddhist probabilities among Asian populations). Adjustment magnitudes are constrained to prevent stereotyping and are normalized to preserve aggregate marginal targets.
Limitations: Religious affiliation exhibits substantial within-group heterogeneity and is influenced by factors (family background, geographic religious composition, conversion) not observable in ACS. Modeled religious affiliations should be interpreted as statistically plausible assignments rather than empirical measurements.
8.2 Voter Registration Status (Optional Field)
Modeling Rationale: Voter registration status is a consequential political participation variable. While not surveyed in ACS, registration rates exhibit well-documented monotone relationships with age (positive) and educational attainment (positive) (File, 2020; McDonald, 2021).
Assignment Methodology: Persons under age 18 or non-citizens are deterministically classified as "Not eligible." For eligible persons, registration probability is modeled as a logistic function of age and education, parameterized to reproduce aggregate national or state-level registration rates when available (e.g., from Current Population Survey Voting and Registration Supplement). Probabilities are bounded within [0.05, 0.95] to avoid degenerate assignments while maintaining realistic heterogeneity.
Limitations: Registration status is also influenced by state registration laws, residential mobility, political mobilization, and individual civic engagement—factors not captured in the model. Modeled registration should be interpreted as statistically consistent with population benchmarks rather than individual predictions.
8.3 Sub-PUMA Geographic Specificity
City Name Assignment in Rural Contexts: Where the IPUMS crosswalk does not support a defensible city assignment (i.e., no incorporated place dominates the PUMA), the city field is set to "Unknown." If display labels are required for visualization purposes, the PUMA name may be provided with explicit notation (e.g., "PUMA 00100, State") to avoid misinterpretation as municipal boundaries.
9. Validation Protocols and Quality Assurance Framework
9.1 Marginal Distribution Validation
For every calibration control margin, the validation protocol computes and reports:
Absolute Deviation: For each category k in margin m, the absolute difference between the weighted panel frequency and the target frequency: Δ_{mk} = |sum over i in [synthetic panel ∩ category k] of w_i - T_{mk}|.
Relative Deviation: The relative error: ε_{mk} = Δ_{mk} / T_{mk}.
Information-Theoretic Measures: Where applicable, Kullback-Leibler divergence between the synthetic panel distribution and target distribution quantifies overall distributional fidelity (Kullback & Leibler, 1951).
Deviations are assessed against pre-specified tolerance thresholds (e.g., maximum relative deviation <1% for controlled margins, <5% for monitored but uncontrolled margins).
9.2 Cross-Tabulation Validation
To assess the preservation of realistic joint distributions beyond directly controlled interactions, the validation protocol generates selected cross-tabulations including:
Language spoken at home by race and ethnicity
Household income bracket by industry sector
Health insurance coverage by age group
Educational attainment by occupation
Commute mode by urbanicity classification
These cross-tabulations are compared to analogous tabulations from the weighted ACS PUMS to verify face validity, monotone trends (where theoretically expected), and absence of pathological artifacts (Templ et al., 2017).
9.3 Logical Consistency Audits
Automated validation rules enforce logical constraints:
Commute mode consistency: Persons with employment status "Unemployed" or "Not in labor force" must have commute mode "Not working."
Registration eligibility: Voter registration status "Not eligible" is mandatory for persons age <18 or non-citizens; conversely, "Not eligible" may not appear for adult citizens.
Health coverage coherence: Detailed coverage categories (e.g., "Private only") must be consistent with the binary coverage indicator (HICOV).
Household attribute constancy: All persons sharing a household identifier must have identical household-level attributes (income, tenure, internet access, number of children).
Violations of logical constraints trigger diagnostic flags and are systematically resolved prior to panel release.
9.4 Sensitivity Analysis
The robustness of the synthetic panel is evaluated through sensitivity analyses that perturb:
Control set composition: Adding or removing calibration margins to assess impact on uncontrolled distributions.
Algorithmic parameters: Varying ratio bounds, convergence tolerance, and maximum iterations to verify stability.
Sample size: Generating panels at multiple target sizes to assess scaling properties.
Baseline configurations are selected to exhibit stable behavior under reasonable parameter perturbations (Saltelli et al., 2008).
10. Privacy Safeguards, Equity Considerations, and Known Limitations
10.1 Privacy Protection
Synthetic personas do not correspond to actual survey respondents. The methodology operates exclusively on public-use microdata that have undergone Census Bureau confidentiality protocols including geographic aggregation to PUMAs of approximately 100,000 persons, top-coding of extreme values, and limited noise infusion in selected variables. The synthesis pipeline does not attempt re-identification, does not introduce external individual-level data linkages, and does not reverse Census Bureau disclosure limitation procedures (U.S. Census Bureau, 2020; Federal Committee on Statistical Methodology, 2005).
10.2 Equity and Algorithmic Fairness
The methodology acknowledges two sources of potential bias:
Source Data Limitations: ACS estimates are subject to sampling variance, nonresponse bias, and measurement error. Differential nonresponse or misreporting across demographic groups, if present in ACS, will be reflected in the synthetic panel. Calibration corrects distributional misalignment but cannot remedy systematic reporting errors in the source survey (Groves, 2006; National Research Council, 2013).
Modeled Attribute Priors: Optional attributes (religious affiliation, voter registration) require prior distributional assumptions. The methodology adopts conservative, bounded adjustments informed by social science research and avoids deterministic stereotyping. All modeled components are optional, fully documented, and auditable. Where aggregate benchmarks exist (e.g., state-level registration rates), they are employed as calibration targets.
10.3 Statistical and Methodological Limitations
Sampling Variance: The ACS PUMS is a complex probability sample with associated sampling error. The synthetic panel aims at distributional fidelity to the sample; it does not replicate the replicate-weight-based variance estimation framework employed for ACS standard error calculation (U.S. Census Bureau, 2022).
Temporal Alignment: The 1-year ACS PUMS represents a rolling 12-month collection period. Users must ensure that the panel vintage aligns with the temporal reference period of the intended application. Demographic and economic conditions may shift between panel generation and application deployment.
Geographic Precision: City-level identification is available only for large incorporated places through the IPUMS crosswalk and is subject to the population share threshold criterion. For many PUMAs, particularly in non-metropolitan areas, city-level attribution is not methodologically defensible. PUMAs represent the finest consistently available geography.
Income Distribution Upper Tail: Standard household income brackets may underrepresent the Pareto tail of the income distribution. Applications requiring precise upper-tail representation should enable optional top-tail calibration using administrative tax data or parametric tail fitting (Piketty & Saez, 2003; Burkhauser et al., 2012).
Causal Inference: While the synthetic panel preserves observational correlations from microdata, it does not encode causal relationships. Users conducting policy simulation or counterfactual analysis must supply external causal models; the panel provides a realistic population structure on which such models may operate.
11. Reproducibility Standards and Version Control
The complete synthesis pipeline is implemented as a scripted, version-controlled codebase with comprehensive parameterization. Reproducibility is ensured through systematic tracking of:
Data Provenance: ACS PUMS vintage (year and release date), Census Bureau data dictionary version, IPUMS crosswalk version, and any auxiliary data sources with explicit version identifiers and retrieval dates.
Transformation Protocols: All variable recodes, derived features, and consolidation rules are codified in version-controlled scripts with inline documentation. Changes to transformation logic are tracked through version control commit history.
Calibration Configuration: The control margin specification, target computation procedures, IPF hyperparameters (ratio bounds, convergence tolerance, maximum iterations), and random seed values are stored in machine-readable configuration files that accompany each panel release.
Outputs and Metadata: Each synthetic panel release includes a machine-readable schema documenting all variables and their permissible values, a manifest file containing cryptographic hashes of all input files, and a comprehensive quality assurance report with summary diagnostic statistics.
Stochastic procedures (integerization residual sampling, modeled attribute assignment) are controlled by explicit random seeds, enabling bit-exact replication of any released panel given identical inputs and configuration (Stodden et al., 2014; Peng, 2011).
12. References
Alfons, A., Kraft, S., Templ, M., & Filzmoser, P. (2011). Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20(3), 383-407.
Autor, D. H. (2014). Skills, education, and the rise of earnings inequality among the "other 99 percent". Science, 344(6186), 843-851.
Burgard, J. P., Münnich, R., & Zimmermann, T. (2017). The impact of sampling designs on small area estimates for business data. Journal of Official Statistics, 33(4), 1031-1053.
Burkhauser, R. V., Feng, S., Jenkins, S. P., & Larrimore, J. (2012). Recent trends in top income shares in the United States: Reconciling estimates from March CPS and IRS tax return data. Review of Economics and Statistics, 94(2), 371-388.
Cohen, R. A., Boersma, P., & Gindi, R. M. (2023). Health insurance coverage: Early release of estimates from the National Health Interview Survey, 2022. National Center for Health Statistics.
Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Annals of Mathematical Statistics, 11(4), 427-444.
Deville, J. C., & Särndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376-382.
Deville, J. C., Särndal, C. E., & Sautory, O. (1993). Generalized raking procedures in survey sampling. Journal of the American Statistical Association, 88(423), 1013-1020.
Federal Committee on Statistical Methodology. (2005). Statistical Policy Working Paper 22: Report on Statistical Disclosure Limitation Methodology. U.S. Office of Management and Budget.
Fienberg, S. E. (1970). An iterative procedure for estimation in contingency tables. Annals of Mathematical Statistics, 41(3), 907-917.
File, T. (2020). Voting in America: A look at the 2016 presidential election. U.S. Census Bureau.
Groves, R. M. (2006). Nonresponse rates and nonresponse bias in household surveys. Public Opinion Quarterly, 70(5), 646-675.
Ireland, C. T., & Kullback, S. (1968). Contingency tables with given marginals. Biometrika, 55(1), 179-188.
Izrael, D., Hoaglin, D. C., & Battaglia, M. P. (2000). A SAS macro for balancing a weighted sample. Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, 25, 1350-1355.
Kolenikov, S. (2016). Post-stratification or non-response adjustment? Survey Practice, 9(3), 1-12.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79-86.
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (3rd ed.). John Wiley & Sons.
Little, R. J., & Wu, M. M. (1991). Models for contingency tables with known margins when target and sampled populations differ. Journal of the American Statistical Association, 86(413), 87-95.
Lohr, S. L. (2021). Sampling: Design and analysis (3rd ed.). Chapman and Hall/CRC.
Lovelace, R., & Ballas, D. (2013). 'Truncate, replicate, sample': A method for creating integer weights for spatial microsimulation. Computers, Environment and Urban Systems, 41, 1-11.
McDonald, M. P. (2021). Voter turnout. United States Elections Project. http://www.electproject.org/
Müller, K., & Axhausen, K. W. (2011). Hierarchical IPF: Generating a synthetic population for Switzerland. ERSA Conference Papers, European Regional Science Association.
National Academies of Sciences, Engineering, and Medicine. (2017). Reproducibility and replicability in science. National Academies Press.
National Research Council. (2013). Nonresponse in social science surveys: A research agenda. National Academies Press.
Office of Management and Budget. (1997). Revisions to the standards for the classification of federal data on race and ethnicity. Federal Register, 62(210), 58782-58790.
Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226-1227.
Perrin, A., & Atske, S. (2021). About three-in-ten U.S. adults say they are 'almost constantly' online. Pew Research Center.
Pew Research Center. (2014). Religious landscape study. https://www.pewresearch.org/religion/religious-landscape-study/
Pew Research Center. (2021). Faith among Black Americans. https://www.pewresearch.org/religion/2021/02/16/faith-among-black-americans/
Piketty, T., & Saez, E. (2003). Income inequality in the United States, 1913–1998. Quarterly Journal of Economics, 118(1), 1-41.
Pritchard, D. R., & Miller, E. J. (2012). Advances in population synthesis: Fitting many attributes per agent and fitting to household and person margins simultaneously. Transportation, 39(3), 685-704.
Ratcliffe, M., Burd, C., Holder, K., & Fields, A. (2016). Defining rural at the U.S. Census Bureau. U.S. Census Bureau, American Community Survey and Geography Brief ACSGEO-1.
Ruggles, S., Flood, S., Sobek, M., Backman, D., Chen, A., Cooper, G., Richards, S., Rodgers, R., & Schouweiler, M. (2024). IPUMS USA: Version 15.0 [dataset]. IPUMS. https://doi.org/10.18128/D010.V15.0
Ryan, C. L., & Bauman, K. (2016). Educational attainment in the United States: 2015. U.S. Census Bureau, Current Population Reports, P20-578.
Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli, D., Saisana, M., & Tarantola, S. (2008). Global sensitivity analysis: The primer. John Wiley & Sons.
Stodden, V., Leisch, F., & Peng, R. D. (Eds.). (2014). Implementing reproducible research. CRC Press.
Templ, M., Meindl, B., Kowarik, A., & Dupriez, O. (2017). Simulation of synthetic complex data: The R package simPop. Journal of Statistical Software, 79(10), 1-38.
U.S. Census Bureau. (2020). Public Use Microdata Areas (PUMAs). https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html
U.S. Census Bureau. (2021). Disability characteristics. https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html
U.S. Census Bureau. (2022). Understanding and using American Community Survey data: What all data users need to know. U.S. Government Publishing Office.
U.S. Census Bureau. (2023). 2023 ACS PUMS data dictionary. https://www.census.gov/programs-surveys/acs/microdata/documentation.html
Valliant, R., Dever, J. A., & Kreuter, F. (2013). Practical tools for designing and weighting survey samples. Springer.
13. Appendix A: Data Sources Registry
This registry provides persistent identifiers, versions, and access information for all data sources utilized in the synthetic persona panel construction.
ACS PUMS Person File Version/Vintage: 2023 1-Year URL: https://www.census.gov/programs-surveys/acs/microdata/access.html Access Date: Updated annually Notes: Primary person-level microdata
ACS PUMS Household File Version/Vintage: 2023 1-Year URL: https://www.census.gov/programs-surveys/acs/microdata/access.html Access Date: Updated annually Notes: Primary household-level microdata
ACS PUMS Data Dictionary Version/Vintage: 2023 URL: https://www.census.gov/programs-surveys/acs/microdata/documentation.html Access Date: Updated annually Notes: Variable definitions and codes
IPUMS USA Version/Vintage: Version 15.0 (2024) URL: https://usa.ipums.org/usa/ Access Date: Ongoing updates Notes: Harmonized census/ACS data with enhanced documentation
IPUMS Large Place–PUMA Crosswalk Version/Vintage: 2020 Geography URL: https://usa.ipums.org/usa/volii/pumas10.shtml Access Date: 2020 Census Notes: Population allocation between PUMAs and places ≥75,000
Census Bureau PUMA Documentation Version/Vintage: 2020 URL: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html Access Date: Static (decennial) Notes: Geographic unit definitions
BLS Standard Occupational Classification (SOC) Version/Vintage: 2018 (current) URL: https://www.bls.gov/soc/ Access Date: Periodic updates Notes: Occupation coding system
Census Bureau NAICS Version/Vintage: 2022 URL: https://www.census.gov/naics/ Access Date: Periodic updates Notes: Industry classification system
BLS Occupational Employment & Wage Statistics Version/Vintage: Annual (state-level) URL: https://www.bls.gov/oes/ Access Date: Annual updates Notes: Optional occupation distribution benchmarks
IRS Statistics of Income Version/Vintage: Annual URL: https://www.irs.gov/statistics Access Date: Annual updates Notes: Optional income tail calibration
OMB Statistical Policy Directive 15 Version/Vintage: 1997 (current) URL: https://www.federalregister.gov/d/97-28653 Access Date: Static Notes: Federal standards for race/ethnicity data
Pew Religious Landscape Study Version/Vintage: 2014, 2021 updates URL: https://www.pewresearch.org/religion/religious-landscape-study/ Access Date: Periodic updates Notes: Religious affiliation benchmarks (optional)
CPS Voting and Registration Supplement Version/Vintage: Biennial (election years) URL: https://www.census.gov/topics/public-sector/voting.html Access Date: Biennial Notes: Voter registration benchmarks (optional)
Census Bureau FIPS State Codes Version/Vintage: Current URL: https://www.census.gov/library/reference/code-lists/ansi.html Access Date: Static Notes: State identifier standards
Version Control Note: The pipeline is designed to accommodate multiple ACS vintages. Configuration files specify the exact vintage used for each panel release. All URLs are persistent identifiers maintained by authoritative sources.
Data Access: ACS PUMS files are publicly available from the Census Bureau via FTP download or API access. IPUMS provides enhanced microdata with additional documentation and harmonization across years. All data sources listed are publicly accessible without licensing restrictions, though data users must comply with Census Bureau terms of use regarding confidentiality and appropriate use.
14. Appendix B: Variable Dictionary and Transformation Protocols
The following enumeration documents the mapping from ACS PUMS source variables to synthetic panel attributes. Variable names and encodings may vary across ACS vintages; the ingestion layer is explicitly version-aware and handles vintage-specific naming conventions.
Age (Demographic Core) Source: AGEP Transformation: Direct numeric value preserved; categorical bins constructed for calibration controls. Range: 18 to 95+ (top-coded at 95 in many vintages)
Sex (Demographic Core) Source: SEX Transformation: Direct binary classification. Categories: Male, Female
Race and Ethnicity (Demographic Core) Source: HISP, RAC1P Transformation: Combined classification following Census Bureau guidelines, yielding categories analogous to OMB Statistical Policy Directive 15. Categories: White alone non-Hispanic, Black or African American alone non-Hispanic, American Indian and Alaska Native alone non-Hispanic, Asian alone non-Hispanic, Native Hawaiian and Other Pacific Islander alone non-Hispanic, Some other race alone non-Hispanic, Two or more races non-Hispanic, Hispanic or Latino (any race) Fallback: When detailed race coding is ambiguous, conservative fallback to "Some other race" category
Educational Attainment (Human Capital) Source: SCHL Transformation: Consolidation from detailed credentials to five-level classification. Categories: Less than high school diploma, High school diploma or GED, Some college or Associate's degree, Bachelor's degree, Graduate or professional degree Note: Also exposed as education_attainment in some output schemas
Employment Status (Labor Force) Source: ESR (Employment Status Recode) Transformation: Mapping to analytically tractable categories. Categories: Employed full-time, Employed part-time, Unemployed, Not in labor force, Military (when applicable)
Occupation – Detailed (Labor Force) Source: OCCP Transformation: Detailed occupation labels preserved using ACS coding scheme aligned with Standard Occupational Classification (SOC). Format: Numeric SOC-style code with textual descriptor Reference: https://www.bls.gov/soc/
Occupation – Major Group (Labor Force) Source: Derived from OCCP Transformation: Major occupational group via keyword-based mapping or SOC crosswalk (configuration-dependent). Categories: Management, Business and Financial Operations, Computer and Mathematical, Architecture and Engineering, Life/Physical/Social Science, Community and Social Service, Legal, Education/Training/Library, Arts/Design/Entertainment/Sports/Media, Healthcare Practitioners and Technical, Healthcare Support, Protective Service, Food Preparation and Serving, Building/Grounds Cleaning/Maintenance, Personal Care and Service, Sales and Related, Office and Administrative Support, Farming/Fishing/Forestry, Construction and Extraction, Installation/Maintenance/Repair, Production, Transportation and Material Moving, Military Specific
Industry – Detailed (Labor Force) Source: INDP Transformation: Detailed industry labels preserved using ACS coding scheme aligned with North American Industry Classification System (NAICS). Format: Numeric NAICS-style code with textual descriptor Reference: https://www.census.gov/naics/
Industry – Sector (Labor Force) Source: Derived from INDP Transformation: Sector-level classification via NAICS mapping. Categories: Agriculture/Forestry/Fishing/Hunting, Mining/Quarrying/Oil and Gas Extraction, Utilities, Construction, Manufacturing, Wholesale Trade, Retail Trade, Transportation and Warehousing, Information, Finance and Insurance, Real Estate/Rental/Leasing, Professional/Scientific/Technical Services, Management of Companies and Enterprises, Administrative/Support/Waste Management Services, Educational Services, Health Care and Social Assistance, Arts/Entertainment/Recreation, Accommodation and Food Services, Other Services, Public Administration, Military
Personal Income (Economic Resources) Source: PINCP, ADJINC Transformation: Inflation adjustment to constant dollars using ADJINC factor; rounding to whole dollars. Format: Continuous numeric (dollars)
Household Income – Continuous (Economic Resources) Source: HINCP, ADJINC Transformation: Inflation adjustment to constant dollars; preserved as continuous. Format: Continuous numeric (dollars)
Household Income – Bracket (Economic Resources) Source: HINCP, ADJINC Transformation: Inflation adjustment followed by categorical binning. Categories: $0, $1–$9,999, $10,000–$24,999, $25,000–$49,999, $50,000–$74,999, $75,000–$99,999, $100,000–$149,999, $150,000–$199,999, $200,000–$299,999, $300,000–$499,999, $500,000–$999,999, $1,000,000 or more Optional: Top-tail refinement using IRS SOI data or Pareto parameterization when enabled
Household Size (Household Structure) Source: NP Transformation: Direct count with lower bound of 1. Format: Integer count
Marital Status (Household Structure) Source: MAR Transformation: Direct categorical mapping. Categories: Married, Widowed, Divorced, Separated, Never married
Children in Household (Household Structure) Source: NOC (preferred); fallback via age aggregation within SERIALNO Transformation: Count of own children under 18; fallback procedure counts persons <18 in household when NOC unavailable. Format: Integer count (0, 1, 2, 3, 4+)
Housing Tenure (Household Economics) Source: TEN Transformation: Direct categorical mapping. Categories: Owned with mortgage or loan, Owned free and clear, Rented, Occupied without payment of rent
Language Spoken at Home (Cultural) Source: LANP, ENG Transformation: Primary language identification with English proficiency integration. Categories: English, Spanish, Other Indo-European, Asian and Pacific Islander languages, Other languages Fallback: When missing, conservative imputation consistent with Hispanic ethnicity; otherwise "English" default
English-Speaking Ability (Cultural) Source: ENG Transformation: Direct categorical mapping for non-English speakers. Categories: Very well, Well, Not well, Not at all, [Not applicable – English speaker]
Veteran Status (Military Service) Source: MIL (vintages 2023+); VETSTAT (earlier vintages) Transformation: Direct categorical mapping. Categories: Currently active duty, Veteran, Never served
Disability Status (Health) Source: DEAR, DEYE, DREM, DPHY, DOUT, DDRS Transformation: Binary aggregation – coded "Has disability" if any functional difficulty indicator is affirmative. Categories: Has disability, No disability Reference: https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html
Health Insurance Coverage (Health) Source: HICOV, HINS1, HINS2, HINS3, HINS4, HINS5, HINS6, HINS7 Transformation: Integrated classification from coverage indicator and source types. Categories: Private only, Public only, Both private and public, Indian Health Service only, Uninsured, Has coverage (detail suppressed) Validation: Ensures consistency between detailed categories and HICOV binary indicator
Internet Access Type (Technology) Source: ACCESSINET, HISPEED, SATELLITE, BROADBND, DIALUP, OTHSVCEX Transformation: Hierarchical classification prioritizing high-bandwidth services. Categories: Broadband (cable/fiber/DSL), Satellite, Cellular data plan only, Internet access present (type unspecified), No internet access
Journey to Work – Mode (Transportation) Source: JWTRNS Transformation: Direct categorical mapping for employed persons. Categories: Car/truck/van – drove alone, Car/truck/van – carpooled, Public transportation, Walked, Bicycle, Motorcycle, Taxicab, Other means, Worked from home, Not working [non-employed]
Nativity and Citizenship (Immigration) Source: NATIVITY, CIT Transformation: Combined classification capturing immigration status. Categories: Native-born U.S. citizen, Foreign-born naturalized citizen, Foreign-born non-citizen
State (Geography) Source: ST or STATE (vintage-dependent) Transformation: Federal Information Processing Standards (FIPS) code and two-letter postal abbreviation. Format: Numeric FIPS code (01–56) and character code (AL–WY) Reference: https://www.census.gov/library/reference/code-lists/ansi.html
Public Use Microdata Area (Geography) Source: PUMA Transformation: Direct five-digit numeric code, nested within state. Format: Five-digit integer identifier Reference: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html
City (Derived Geography) Source: Derived via IPUMS Large Place–PUMA crosswalk Transformation: Assigned when incorporated place ≥75,000 population accounts for ≥50% (default threshold) of PUMA population. Categories: Named incorporated place or "Unknown" Caveat: Represents geographic approximation, not precise residential address Reference: https://usa.ipums.org/usa/volii/pumas10.shtml
Urbanicity (Derived Geography) Source: Derived via IPUMS Large Place–PUMA crosswalk population shares Transformation: Classified based on dominant place population share. Categories: Urban (≥50%), Suburban (10–49%), Rural (<10%)
Religious Affiliation (Optional Modeled Attribute) Source: Not present in ACS; modeled via bounded probability adjustments from national/state priors Categories: Evangelical Protestant, Mainline Protestant, Historically Black Protestant, Catholic, Orthodox Christian, Mormon, Jehovah's Witness, Other Christian, Jewish, Muslim, Buddhist, Hindu, Other World Religions, Other Faiths, Unaffiliated (Atheist), Unaffiliated (Agnostic), Unaffiliated (Nothing in particular) Documentation: Modeled attribute flag indicates non-survey origin; priors and adjustment procedures documented in configuration Reference: https://www.pewresearch.org/religion/religious-landscape-study/
Voter Registration Status (Optional Modeled Attribute) Source: Not present in ACS; modeled via age- and education-dependent probabilities calibrated to CPS benchmarks Categories: Not eligible (age <18 or non-citizen), Registered, Not registered Documentation: Modeled attribute flag indicates non-survey origin; probability model parameters documented in configuration Reference: https://www.census.gov/topics/public-sector/voting.html
Disclaimer: The synthetic personas generated through this methodology do not correspond to identifiable individuals and are constructed exclusively from public-use data sources that incorporate confidentiality protections. While the methodology prioritizes statistical fidelity, reproducibility, and transparency, all microdata are subject to sampling variance, nonresponse bias, and measurement error inherent in survey data collection. Modeled attributes not present in the source survey are governed by explicitly stated prior distributions and should be interpreted accordingly. Users bear responsibility for appropriate application of the synthetic panel within the documented limitations and for independent validation against external benchmarks when required for specific use cases.