Intro to Place Data

This page gives an overview of the model-ready data and features that Iggy provides. This is meant to accompany the Iggy Data Dictionary.

How Iggy thinks about location features

At Iggy, we think about location-related features in terms of boundaries, data sources, and aggregations. These three components form the core of our data model. Put most simply, each Iggy feature is the result of an aggregation applied to an underlying data source within a boundary.

Boundaries

Many data sets have location fields that link a row of data to a real place on Earth. Depending on the particular location field, that may be a relatively general place (e.g. a metro area or county) or a very specific place (e.g. a quadkey or address). Traditionally, some of the challenge in dealing with location data involves conversion from specific to general places. For example, a dataset may have a field for address. But the available economic data only comes at the county level. How to link from the address to the relevant county, in order to add features from the economic dataset?

We use the term boundary to describe the geographic area over which some data is aggregated. Iggy pre-aggregates features to boundary levels ranging from general (metro area) to specific (quadkey) so that users can pull data at exactly the level they need. For example, if your data set includes a zip code field, Iggy provides features that have been pre-aggregated at the zip code level like count of restaurants per capita within each zip.

Currently Iggy provides features pertaining to the following boundaries, from general to specific:

  • metro – Census Core Based Statistical Area, identified by CBSA FIPS
  • county – County, identified by 5-digit FIPS
  • locality – City, identified by ID from the Who's on First gazetteer
  • zipcode – Zip Code, identified by 5-digit zip code
  • census_tract – Census Tract, identified by 11-digit census tract GEOID
  • cbg – Census Block Group, identified by 12-digit census block group GEOID
  • qk_isochrone_walk_10m – 10-min Walk Isochrone, identified by zoom-19 quadkey identifier

The most fine-grained boundary type we currently offer is the 10-min walk isochrone, which is the boundary that encompasses the walkable area within 10 min of a zoom 19 quadkey (a map file with side length ~75m). By providing features aggregated at this fine-grained level, users with addresses or geographic coordinates can add hyper-local features to their models.

Data Sources

A data source describes the underlying geographic data that is aggregated within a boundary. Each data source has rows that represent points, lines, or polygons with geographic coordinates.

Many different types of data can be construed as geographic, such as local businesses, demographics, and topography. Our demo dataset incorporates features computed from the following data sources:

Points of Interest (poi)

  • Points of interest are businesses and services with a physical presence including restaurants, manufacturing sites, and community centers.
  • Our poi features are aggregated from an underlying dataset of points, each representing a distinct point of interest and categorized based on the Iggy Feature Catalog.

American Community Survey (acs)

  • The U.S. Census ACS data includes information about demographics, household composition, employment, commute patterns, and housing. Iggy currently relies on ACS data collected over the 5-year period 2014-2019. The primary advantage of using multi-year estimates is the increased statistical reliability for less populated areas and small population subgroups.
  • Only census-designated boundaries (county, census_tract, and cbg) incorporate features from acs, as these are the levels at which ACS data is reported and provided.

Water (water)

  • Iggy produces features that summarize the coastline, rivers, and lakes within a boundary.
  • Our water features are aggregated from an underlying dataset that represents coastline as lines, and rivers and lakes as polygons.

Parks (park)

  • We also provide features calculated based on national, state, and local parks within a boundary.
  • Our underlying park data represents each park as a polygon.

Data Attributes

Each data source also has one or more attributes describing each row that can be used to filter aggregations and derive more interesting features:

poi

poi data attributes indicate the POI category, and whether it is a brand/chain:

  • Ontology Top-level Category Attributes (see Iggy Feature Catalog)
    • is_{top_level_category}
  • Ontology Sub-level Category Attributes (see Iggy Feature Catalog)
    • is_{sub_level_category}
  • Chain
    • is_brandname indicates whether POI is a brand or chain (e.g. McDonald’s, Dollar Store, Pep Boys)

acs

acs data attributes indicate a particular Census summary statistic about the relevant boundary (county, census_tract, or cbg). They cover a variety of types of information:

Demographics

Includes attributes related to age (e.g. median_age), gender (e.g. pop_sex_male, pop_sex_female_age_5_to_9), race/ethnicity (e.g. pop_race_asian), and birthplace/citizenship (e.g. pop_citizenship_us_naturalized).

Social

Includes attributes surrounding household composition (e.g. households_female_head_with_children, households_cohabiting_couple), education (e.g. pop_adult_education_less_than_high_school), and veteran status (e.g. pop_veterans)

Economic

Includes attributes indicating income (e.g. households_with_annual_income_200000_or_more, pop_below_100_pct_poverty_level), employment status (e.g. pct_in_labor_force_status_civilian_employed), and employment industry (e.g. pop_works_industry_manufacturing)

Commute

Includes attributes indicating (pre-2020) commute habits, including method (e.g. pop_commutes_by_public_transport_rail), time (pop_commute_departure_0630_to_0659), and duration (pop_commute_travel_time_20_to_24_min)

Housing

Includes attributes dealing with housing units type (e.g. housing_units_boat_rv_van), age (housing_units_built_1939_or_earlier), ownership status (housing_units_renter_occupied), size (housing_units_10_to_19_in_structure), and value (housing_units_value_150000_to_199999)

water

water data attributes indicate the type of water body.

  • Type of water body
    • is_coastline
    • is_river
    • is_lake

parks

Our parks data includes terrestrial and marine protected areas inventoried by the the United States Geological Survey. Parks in "PAD-US" are dedicated to preserving biological diversity, and to other natural, recreation, and cultural uses.

Our breakdown of the PAD-US includes aspects of ownership / management (federal, state, local, and private land, for instance), the intended use (recreation vs. agriculture or ranching), and other access-related attributes.

  • Park-related attributes:
    • is_federal_land
    • is_state_land
    • is_local_land
    • is_native_american_land
    • is_private_land
    • is_special_district_land
    • is_easement
    • is_historic_or_cultural_area
    • is_agricultural_or_ranching_area
    • is_conservation_area
    • is_open_or_limited_access_area
    • is_open_access_area
    • is_parks_and_recreation
    • is_protected_area

Note that a park may have a value of 1 for more than one attribute. For example, a state park might have is_conservation_area=1, is_open_access=1, and is_parks_and_recreation=1.

The full set of underlying data sources and attributes is detailed in the Iggy Data Dictionary.

Iggy Feature Catalog

The Iggy Feature Catalog is used to organize places in our poi data source. You can find more information, including definitions and examples, in the Iggy Feature Catalog reference page.

Aggregations and Normalizations

Given a boundary (like a zip code) and a data source (like POIs), Iggy produces features by running an aggregation of the data intersecting the boundary. Aggregations range from simple (i.e. counts of items intersecting a boundary) to more complex spatial functions (i.e. square km in the intersection between a boundary and a data source like lakes).

In addition to aggregations, Iggy also provides features that have additional normalization calculated on top of the aggregation, like dividing by the boundary population or area.

The following is a list of the various aggregations and normalizations that are used to produce Iggy features.

Aggregations

  • [none] Features with no aggregation are generated by taking the raw value from the boundary itself, or from a boundary-linked data source like acs
  • count Count of distinct rows from the underlying data source that intersect a boundary. If the count feature is associated with a data attribute, then the count indicates the number of distinct rows having that particular attribute. For example, the feature poi_is_education_count indicates the number of distinct rows from the poi dataset having the attribute is_education=True
  • intersects A boolean feature indicating whether the boundary intersects any row within the underlying data source
  • intersecting_area_in_sqkm A float feature indicating the total area (in sq km) of the intersection between a boundary and any row in the underlying polygon data source. This can only be computed for data sources whose rows are polygons, like park and water.
  • intersecting_length_in_sqkm A float feature indicating the total length (in km) of the intersection between a boundary and any row in the underlying line data source. This can only be computed for data sources whose rows are lines, like water where is_coastline=True.

Normalizations

  • per_sqkm Divides the aggregated feature value by the boundary area, in sq km
  • per_capita Divides the aggregated feature value by the boundary population

Interpreting Features

The Data Dictionary provides a complete listing of the available Iggy features at each boundary level.

In general, features are named using the following convention:

{data_source}[_{data attribute}]_{aggregation}[_{normalization}]

For example, the feature poi_is_museum_count_per_capita is calculated for a particular boundary by taking the data source poi, filtering for rows where is_museum=True, applying the aggregation count within the boundary, and finally applying the per_capita normalization to divide the count by the boundary population.

Some feature names deviate slightly from this convention in order to make them more interpretable. For example, the feature lake_pct_area_intersecting_boundary is an easier way of expressing the feature generated from lake data source where attribute is_lake=True, applying the intersecting_area_in_sqkm aggregation, and the per_sqkm normalization. The Data Dictionary is searchable by data source, attribute, aggregation, and normalization as well as feature name.