Intro to Place Data
This page gives an overview of the model-ready data and features that Iggy provides. This is meant to accompany the Iggy Data Dictionary.
How Iggy thinks about location features
At Iggy, we think about location-related features in terms of boundaries, data sources, and aggregations. These three components form the core of our data model. Put most simply, each Iggy feature is the result of an aggregation applied to an underlying data source within a boundary.
Many data sets have location fields that link a row of data to a real place on Earth. Depending on the particular location field, that may be a relatively general place (e.g. a metro area or county) or a very specific place (e.g. a quadkey or address). Traditionally, some of the challenge in dealing with location data involves conversion from specific to general places. For example, a dataset may have a field for address. But the available economic data only comes at the county level. How to link from the address to the relevant county, in order to add features from the economic dataset?
We use the term boundary to describe the geographic area over which some data is aggregated. Iggy pre-aggregates features to boundary levels ranging from general (metro area) to specific (quadkey) so that users can pull data at exactly the level they need. For example, if your data set includes a zip code field, Iggy provides features that have been pre-aggregated at the zip code level like count of restaurants per capita within each zip.
Currently Iggy provides features pertaining to the following boundaries, from general to specific:
metro– Census Core Based Statistical Area, identified by CBSA FIPS
county– County, identified by 5-digit FIPS
locality– City, identified by ID from the Who's on First gazetteer
zipcode– Zip Code, identified by 5-digit zip code
census_tract– Census Tract, identified by 11-digit census tract GEOID
cbg– Census Block Group, identified by 12-digit census block group GEOID
qk_isochrone_walk_10m– 10-min Walk Isochrone, identified by zoom-19 quadkey identifier
The most fine-grained boundary type we currently offer is the 10-min walk isochrone, which is the boundary that encompasses the walkable area within 10 min of a zoom 19 quadkey (a map file with side length ~75m). By providing features aggregated at this fine-grained level, users with addresses or geographic coordinates can add hyper-local features to their models.
A data source describes the underlying geographic data that is aggregated within a boundary. Each data source has rows that represent points, lines, or polygons with geographic coordinates.
Many different types of data can be construed as geographic, such as local businesses, demographics, and topography. Our demo dataset incorporates features computed from the following data sources:
Points of Interest (
- Points of interest are businesses and services with a physical presence including restaurants, manufacturing sites, and community centers.
poifeatures are aggregated from an underlying dataset of points, each representing a distinct point of interest and categorized based on the Iggy Feature Catalog.
American Community Survey (
- The U.S. Census ACS data includes information about demographics, household composition, employment, commute patterns, and housing. Iggy currently relies on ACS data collected over the 5-year period 2014-2019. The primary advantage of using multi-year estimates is the increased statistical reliability for less populated areas and small population subgroups.
- Only census-designated boundaries (
cbg) incorporate features from
acs, as these are the levels at which ACS data is reported and provided.
- Iggy produces features that summarize the coastline, rivers, and lakes within a boundary.
waterfeatures are aggregated from an underlying dataset that represents coastline as lines, and rivers and lakes as polygons.
- We also provide features calculated based on national, state, and local parks within a boundary.
- Our underlying
parkdata represents each park as a polygon.
Each data source also has one or more attributes describing each row that can be used to filter aggregations and derive more interesting features:
poi data attributes indicate the POI category, and whether it is a brand/chain:
- Ontology Top-level Category Attributes (see Iggy Feature Catalog)
- Ontology Sub-level Category Attributes (see Iggy Feature Catalog)
is_brandnameindicates whether POI is a brand or chain (e.g. McDonald’s, Dollar Store, Pep Boys)
acs data attributes indicate a particular Census summary statistic about the relevant boundary (
cbg). They cover a variety of types of information:
Includes attributes related to age (e.g.
median_age), gender (e.g.
pop_sex_female_age_5_to_9), race/ethnicity (e.g.
pop_race_asian), and birthplace/citizenship (e.g.
Includes attributes surrounding household composition (e.g.
households_cohabiting_couple), education (e.g.
pop_adult_education_less_than_high_school), and veteran status (e.g.
Includes attributes indicating income (e.g.
pop_below_100_pct_poverty_level), employment status (e.g.
pct_in_labor_force_status_civilian_employed), and employment industry (e.g.
Includes attributes indicating (pre-2020) commute habits, including method (e.g.
pop_commutes_by_public_transport_rail), time (
pop_commute_departure_0630_to_0659), and duration (
Includes attributes dealing with housing units type (e.g.
housing_units_boat_rv_van), age (
housing_units_built_1939_or_earlier), ownership status (
housing_units_renter_occupied), size (
housing_units_10_to_19_in_structure), and value (
water data attributes indicate the type of water body.
- Type of water body
parks data includes terrestrial and marine protected areas inventoried by the the United States Geological Survey. Parks in "PAD-US" are dedicated to preserving biological diversity, and to other natural, recreation, and cultural uses.
Our breakdown of the PAD-US includes aspects of ownership / management (federal, state, local, and private land, for instance), the intended use (recreation vs. agriculture or ranching), and other access-related attributes.
- Park-related attributes:
Note that a park may have a value of 1 for more than one attribute. For example, a state park might have
The full set of underlying data sources and attributes is detailed in the Iggy Data Dictionary.
Iggy Feature Catalog
The Iggy Feature Catalog is used to organize places in our poi data source. You can find more information, including definitions and examples, in the Iggy Feature Catalog reference page.
Aggregations and Normalizations
Given a boundary (like a zip code) and a data source (like POIs), Iggy produces features by running an aggregation of the data intersecting the boundary. Aggregations range from simple (i.e. counts of items intersecting a boundary) to more complex spatial functions (i.e. square km in the intersection between a boundary and a data source like lakes).
In addition to aggregations, Iggy also provides features that have additional normalization calculated on top of the aggregation, like dividing by the boundary population or area.
The following is a list of the various aggregations and normalizations that are used to produce Iggy features.
[none]Features with no aggregation are generated by taking the raw value from the boundary itself, or from a boundary-linked data source like acs
countCount of distinct rows from the underlying data source that intersect a boundary. If the count feature is associated with a data attribute, then the count indicates the number of distinct rows having that particular attribute. For example, the feature poi_is_education_count indicates the number of distinct rows from the poi dataset having the attribute is_education=True
intersectsA boolean feature indicating whether the boundary intersects any row within the underlying data source
intersecting_area_in_sqkmA float feature indicating the total area (in sq km) of the intersection between a boundary and any row in the underlying polygon data source. This can only be computed for data sources whose rows are polygons, like park and water.
intersecting_length_in_sqkmA float feature indicating the total length (in km) of the intersection between a boundary and any row in the underlying line data source. This can only be computed for data sources whose rows are lines, like water where is_coastline=True.
per_sqkmDivides the aggregated feature value by the boundary area, in sq km
per_capitaDivides the aggregated feature value by the boundary population
The Data Dictionary provides a complete listing of the available Iggy features at each boundary level.
In general, features are named using the following convention:
For example, the feature
poi_is_museum_count_per_capita is calculated for a particular boundary by taking the data source poi, filtering for rows where
is_museum=True, applying the aggregation count within the boundary, and finally applying the
per_capita normalization to divide the count by the boundary population.
Some feature names deviate slightly from this convention in order to make them more interpretable. For example, the feature
lake_pct_area_intersecting_boundary is an easier way of expressing the feature generated from lake data source where attribute
is_lake=True, applying the
intersecting_area_in_sqkm aggregation, and the
per_sqkm normalization. The Data Dictionary is searchable by data source, attribute, aggregation, and normalization as well as feature name.