Will's 'Academic' Website: So You Want to Merge FRS Data?

Will Wheeler

In the last week, I have received two similar questions about merging data from the EPA ECHO Data Downloads. In this post, I try to answer one of these questions, posed by Ashley Langer:

CEMS is the Continuous Emissions Monitoring System data.¹ These continuous monitors are generally used by facilities in air market trading programs, which depend on quite accurate data.

The other data set is ICIS-Air, housed by ECHO, contains which “emissions, compliance, and enforcement data on stationary sources of air pollution.” This data set focuses on the plant level.

Both of these data sets have their own facility identifiers, but the CEMS data does not contain a linking identifier (or at least, the data set under discussion does not). Common identifiers are provided by EPA’s Facility Registry Service. It’s important to note here that the common identifier may not be one-to-one, based on differing regulatory definitions and what FRS considers to be a facility.

The ECHO data downloads have an FRS file, so what’s the problem? This is not a complete copy of FRS. As noted in the download summary:

records included in the FRS download file are those with a valid FRS and which are cross-referenced by ID number to data contained in ICIS-Air, TRI, GHG (E-GGRT), RCRAInfo, SDWIS, ICIS-NPDES, or Superfund Enterprise Management System (SEMS), or which are linked to a formal enforcement action in ICIS-FE&C.

CEMS uses the identifier from the EIA form 860 which is not in the FRS file at the ECHO data downloads (my understanding is that the ECHO FRS file only includes regulatory programs that are in other ECHO download files). The solution is to use the FRS flat file downloads. If you did not know these files existed, don’t feel bad, I’ve had to point people to them numerous times!

For convenience and speed, I’m going to look just at one state. The national combined files are at
https://ofmext.epa.gov/FLA/www3/state_files/national_combined.zip


temp <- tempfile()
download.file("https://ofmext.epa.gov/FLA/www3/state_files/state_combined_tx.zip",temp)

unzip(temp, list = TRUE)


                                                 Name    Length
1                        TX_ALTERNATIVE_NAME_FILE.CSV  22091458
2                                 TX_CONTACT_FILE.CSV  28940865
3                  TX_ENVIRONMENTAL_INTEREST_FILE.CSV  56539833
4                                TX_FACILITY_FILE.CSV 129689442
5                         TX_MAILING_ADDRESS_FILE.CSV  21212504
6                                   TX_NAICS_FILE.CSV  10462281
7                            TX_ORGANIZATION_FILE.CSV  66343323
8                                 TX_PROGRAM_FILE.CSV 161161367
9                             TX_PROGRAM_GIS_FILE.CSV  31614995
10                                    TX_SIC_FILE.CSV  14935318
11                          TX_SUPP_INTEREST_FILE.CSV  59122565
12 Facility State File Documentation 11132012_new.pdf   1020424
                  Date
1  2021-02-03 03:49:00
2  2021-02-03 04:05:00
3  2021-02-03 03:37:00
4  2021-02-03 03:30:00
5  2021-02-03 04:13:00
6  2021-02-03 03:42:00
7  2021-02-03 03:58:00
8  2021-02-03 03:33:00
9  2021-02-03 04:22:00
10 2021-02-03 03:45:00
11 2021-02-03 04:17:00
12 2020-05-18 12:41:00

What we actually need is the Environmental Interest file (there is a PDF if you need to look at the documentation).


env_interest <- read_csv(unzip(temp, "TX_ENVIRONMENTAL_INTEREST_FILE.CSV"))


spec(env_interest)


cols(
  REGISTRY_ID = col_double(),
  PGM_SYS_ACRNM = col_character(),
  PGM_SYS_ID = col_character(),
  INTEREST_TYPE = col_character(),
  FED_STATE_CODE = col_character(),
  START_DATE = col_character(),
  START_DATE_QUALIFIER = col_character(),
  END_DATE = col_character(),
  END_DATE_QUALIFIER = col_character(),
  SOURCE_OF_DATA = col_character(),
  LAST_REPORTED_DATE = col_character(),
  CREATE_DATE = col_character(),
  UPDATE_DATE = col_character(),
  ACTIVE_STATUS = col_character()
)

I’m going to throw in a tip here. readr::read_csv shows how it decided to parse each column. Notice how REGISTRY_ID was read as double? That’s bad! REGISTRY_ID is an integer identifier and you want to make sure you don’t lose accuracy in digits. When using ECHO or other, similar data sets I STRONGLY recommend assigning column types when reading the data. With read_csv, this is accomplished by cutting and pasting the output from spec() and creating a list, then setting col_types to this list in read_csv.


coltypes <- cols(
  REGISTRY_ID = col_character(),
  PGM_SYS_ACRNM = col_character(),
  PGM_SYS_ID = col_character(),
  INTEREST_TYPE = col_character(),
  FED_STATE_CODE = col_character(),
  START_DATE = col_character(),
  START_DATE_QUALIFIER = col_character(),
  END_DATE = col_character(),
  END_DATE_QUALIFIER = col_character(),
  SOURCE_OF_DATA = col_character(),
  LAST_REPORTED_DATE = col_character(),
  CREATE_DATE = col_character(),
  UPDATE_DATE = col_character(),
  ACTIVE_STATUS = col_character()
)
  
env_interest <- read_csv(unzip(temp, "TX_ENVIRONMENTAL_INTEREST_FILE.CSV"),
                         col_types = coltypes)

Let’s take a quick look at the data


head(env_interest)


# A tibble: 6 x 14
  REGISTRY_ID PGM_SYS_ACRNM PGM_SYS_ID INTEREST_TYPE FED_STATE_CODE
  <chr>       <chr>         <chr>      <chr>         <chr>         
1 1100004559~ TRIS          75006CRGL~ TRI REPORTER  <NA>          
2 1100004559~ NCDB          I06#19891~ COMPLIANCE A~ FEDERAL       
3 1100004559~ TX-TCEQ ACR   RN1019959~ STATE MASTER  STATE         
4 1100004559~ RCRAINFO      TXD099927~ CESQG         FEDERAL       
5 1100004559~ NPDES         TXR05DZ60  ICIS-NPDES N~ FEDERAL       
6 1100004559~ RCRAINFO      TXD099927~ UNSPECIFIED ~ FEDERAL       
# ... with 9 more variables: START_DATE <chr>,
#   START_DATE_QUALIFIER <chr>, END_DATE <chr>,
#   END_DATE_QUALIFIER <chr>, SOURCE_OF_DATA <chr>,
#   LAST_REPORTED_DATE <chr>, CREATE_DATE <chr>, UPDATE_DATE <chr>,
#   ACTIVE_STATUS <chr>

Relevant for our purposes here, REGISTRY_ID is the FRS identifier, PGM_SYS_ACRNM is the acronym for a regulatory program, PGM_SYS_ID is the facility identifier for that program.

Doing a quick count shows that there are 41 different regulatory programs in the file!


env_interest %>%
  count(PGM_SYS_ACRNM) %>%
  arrange(desc(n))


# A tibble: 41 x 2
   PGM_SYS_ACRNM      n
   <chr>          <int>
 1 TX-TCEQ ACR   165635
 2 NPDES         120956
 3 RCRAINFO       43410
 4 SFDW           33372
 5 ICIS           14859
 6 EIS             7892
 7 OSHA-OIS        6902
 8 AIR             6582
 9 AIRS/AFS        5501
10 TRIS            4661
# ... with 31 more rows

Next step is to create two dataframes, one for each of the two system identifers we want, and then join them by the REGISTRY_ID.


air_df <- env_interest %>%
  filter(PGM_SYS_ACRNM == "AIR") %>%
  select(REGISTRY_ID, PGM_SYS_ACRNM, PGM_SYS_ID) %>%
  rename(air_id = PGM_SYS_ID)
  
cems_df <- env_interest %>%
  filter(PGM_SYS_ACRNM == "EIA-860") %>%
  select(REGISTRY_ID, PGM_SYS_ACRNM, PGM_SYS_ID) %>%
  rename(cems_id = PGM_SYS_ID)

joined_df <- full_join(air_df, cems_df, by = "REGISTRY_ID")

head(joined_df)


# A tibble: 6 x 5
  REGISTRY_ID  PGM_SYS_ACRNM.x air_id          PGM_SYS_ACRNM.y cems_id
  <chr>        <chr>           <chr>           <chr>           <chr>  
1 110000456042 AIR             TX000000481130~ <NA>            <NA>   
2 110000456042 AIR             TX000000481210~ <NA>            <NA>   
3 110000456079 AIR             TX000000481130~ <NA>            <NA>   
4 110000456079 AIR             TX000000481130~ <NA>            <NA>   
5 110000456088 AIR             TX000000480850~ <NA>            <NA>   
6 110000456140 AIR             TX000000481130~ <NA>            <NA>

Although there are 660 CEMS identifiers in the joined data, only 188 of these have an associated ICIS-Air identifier. These are attached to 221 ICIS-Air identifiers.

I’m afraid that I can’t explain why these numbers differ. One potential explanation is that ICIS-Air is a relatively new dataset (it’s predecessor, AFS, is frozen as of late 2014), so if a CEMS reporter stopped emitting prior to that date, it would not be in ICIS-Air. I nevertheless hope this is helpful.

Please let me know if you have any questions!

This work is not a product of the United States Government or the United States Environmental Protection Agency, and the author is not doing this work in any governmental capacity. The views expressed are those of the author only and do not necessarily represent those of the United States or the US EPA.

Learn more about using Distill for R Markdown at https://rstudio.github.io/distill.

Data downloads are provided by EPA’s Clean Air Markets Division at https://ampd.epa.gov/ampd/.↩︎

So You Want to Merge FRS Data?

Corrections

Reuse

Citation