May 4, 2011

Imputation: Adding People to the Census

When census-takers can’t reach anyone at a particular address or obtain information about occupants in other ways, they sometimes use a last-resort statistical technique called “imputation” to fill in missing data. One marker of the quality of a census is how much it relies on imputation to add people to the count.

In the most extreme cases, census-takers have only an address taken from a master list drawn up in cooperation with local officials. They may not even know that a housing unit exists at that address, much less who lives there. If the address is indeed found to be an apparent dwelling place, the census-taker may not be able to get anyone to come to the door, and neither neighbors nor building managers may be willing and able to supply information. Yet the Census Bureau’s orders are to count everyone living in the U.S. on April 1, Census Day.

So to meet the goal of having a complete and accurate census, the Census Bureau imputes the existence and number of people living at the address in question, a procedure known as “count imputation.” (The other kind of imputation, called “characteristic imputation,” is when the Census Bureau has a head count for an address but is missing race, age or other personal information.) The number of imputed people tends to be higher among hard-to-count groups such as ethnic and racial minorities.

In 2010, according to figures supplied by the Census Bureau, 1,163,462 people were added to the household population (the total excluding group quarters) via count imputation, or .39% (less than one half of one percent) of the total. This served to boost the household population to slightly more than 300 million; pre-imputation, it stood at 299,594,753.

By comparison, in the 2000 Census, 1,172,144 people were added to the household population via count imputation, or .43% of the total.

Using another metric—addresses, not people—a slightly higher share in 2010 had some usable information available, so a slightly lower share required count imputation (.38% compared with .55% in 2000). Count imputation was performed on a slightly lower number of addresses in 2010 than in 2000 as well—521,947 compared with 666,848 in 2000, according to Census Bureau figures.

Types of Count Imputation

There are three kinds of count imputation. “Status imputation,” the most extreme form, is when census-takers do not even know whether a particular address is a real livable residence (maybe it’s a business or in such disrepair that no one could live there), and, if so, whether the unit is occupied or vacant. “Occupancy imputation” is when the Census Bureau knows that an address is a real housing unit, but not whether anyone actually lives there. “Household-size imputation” is when an address is known to be a real, occupied home, but census-takers don’t know how many people live there.

The Census Bureau has released data comparing rates for the three different types of count imputation in 2010 and 2000, using home addresses as the denominator.

Household-size imputation, the largest of the three categories in 2010, accounted for .24% of addresses, followed by status imputation (.12%) and occupancy imputation (.03%). In 2000, status imputation was the largest category (.23% of addresses), with occupancy and household-size imputation accounting for .16% each.

In 2010, of the 521,947 addresses without usable information, about 325,000 required household-size imputation in 2010, 38,000 required occupancy imputation and 159,000 required status imputation, according to figures supplied by the Census Bureau.

State Patterns

There was less variation among states in 2010 than in 2000 in the rate of count imputation, according to the Census Bureau. “We have lower variability in the data,” Census Bureau Director Robert M. Groves told a recent press briefing. “We like that result.”

In all but seven states and the District of Columbia, count imputation rates declined from 2000 to 2010, according to figures supplied by the Census Bureau; the states where rates did not decline include Arkansas, Colorado, Georgia, Louisiana, Mississippi, New Jersey and North Carolina. The highest count imputation rate in 2010 (District of Columbia, at .93% of addresses) was notably lower than the highest rate in 2000 (Arizona, 1.37 %).

The state that had the highest number—as opposed to the highest proportion— of people added to its total via count imputation in 2010, according to Census Bureau figures, was Texas (143,813), followed by Florida (100,575), New York (92,600), Georgia (82,026) and then California (68,204). As a share of all count imputations, the most extreme kind—status imputation—made up the majority in a dozen states, generally smaller ones.

Imputation Technique and History

In carrying out imputation, the bureau applies what it knows about the size and type of neighboring households to fill in the number of people, or their characteristics, at the addresses with missing data. Imputation procedures have grown more sophisticated over the decades.

In the 1940s, imputation was based on a random-ordered set of values on a set of punched cards—a “cold-deck imputation” as it came to be known. Now, the bureau uses a “hot-deck imputation” technique that employs continually updated census data from similar people or households generally within the same census tract as the basis for assigning a value to a missing record.

The Census Bureau’s use of imputation to add people to the total count dipped in the 1990 Census, in part because the agency put on a large and expensive operation to reach non-responding households. Only 54,000 people were added to the census count in 1990, compared with 1.2 million in 2000. In 2000, the agency relied on imputation in part as a way to restrain the rising cost of contacting non-responding households.

A generally positive evaluation of the 2000 Census by the National Research Council described these 1.2 million imputations as “a problematic group with regard to accuracy,” but acknowledged that if they had not been included, the census “would undoubtedly have underestimated the true number of household residents (particularly when a unit was known to be occupied.).”

The National Research Council evaluation noted that the number of imputed people was low compared with the total size of the 2000 Census count, but the share of imputed people is higher among some hard-to-count groups such as renters and ethnic and racial minorities. The National Research Council panel noted that imputations at addresses that were not known even to be housing units tended to cluster in rural areas, such as the Adirondacks region in New York and parts of Arizona and New Mexico. Some of these addresses may have been fishing camps or other temporary recreational lodging, it noted.

Imputation and Statistical Sampling

Although imputation is a statistical technique, it differs from the politically controversial statistical sampling procedure that the bureau has considered using in the past. For 2000, the Census Bureau proposed to use statistical sampling as part of its non-response follow-up operation to estimate the size and characteristics of the population that didn’t return census forms.  That plan was abandoned after the Supreme Court ruled that statistical sampling cannot be employed to produce census data used to apportion congressional seats among the states.

The court ruling did not bar the use of statistical sampling to adjust census numbers for the purpose of producing annual population estimates and distributing federal funds. But the Secretary of Commerce, who oversees the bureau, chose not to do so, on the recommendation of a study conducted by a committee of bureau officials.

The Census Bureau’s use of imputation was challenged in court after the 2000 Census by the state of Utah, after the state failed to gain a congressional seat, which instead went to North Carolina. Utah argued that it, not North Carolina, would have gained the seat had it not been for imputations—in particular, imputations of people in housing units for which household size was not known. Those imputations, Utah argued, amounted to illegal use of statistical sampling. In 2002, the Supreme Court rejected Utah’s lawsuit (Utah et al v. Evans).

The Supreme Court ruled that sampling and imputation differed in three key respects—the “nature of the enterprise,” “methodology,” and “immediate objective.” It described imputation as “inference,” not statistical sampling. The court also stated that imputation does not violate the Constitution’s requirement for an “actual enumeration,” which some have contended prohibits any methods other than a door-to-door count. Without imputation, the court stated, the result would be “a far less accurate assessment of the population.”