A quick blog post to describe merging ECHO data with FRS data.
In the last week, I have received two similar questions about merging data from the EPA ECHO Data Downloads. In this post, I try to answer one of these questions, posed by Ashley Langer:
CEMS is the Continuous Emissions Monitoring System data.1 These continuous monitors are generally used by facilities in air market trading programs, which depend on quite accurate data.
The other data set is ICIS-Air, housed by ECHO, contains which “emissions, compliance, and enforcement data on stationary sources of air pollution.” This data set focuses on the plant level.
Both of these data sets have their own facility identifiers, but the CEMS data does not contain a linking identifier (or at least, the data set under discussion does not). Common identifiers are provided by EPA’s Facility Registry Service. It’s important to note here that the common identifier may not be one-to-one, based on differing regulatory definitions and what FRS considers to be a facility.
The ECHO data downloads have an FRS file, so what’s the problem? This is not a complete copy of FRS. As noted in the download summary:
records included in the FRS download file are those with a valid FRS and which are cross-referenced by ID number to data contained in ICIS-Air, TRI, GHG (E-GGRT), RCRAInfo, SDWIS, ICIS-NPDES, or Superfund Enterprise Management System (SEMS), or which are linked to a formal enforcement action in ICIS-FE&C.
CEMS uses the identifier from the EIA form 860 which is not in the FRS file at the ECHO data downloads (my understanding is that the ECHO FRS file only includes regulatory programs that are in other ECHO download files). The solution is to use the FRS flat file downloads. If you did not know these files existed, don’t feel bad, I’ve had to point people to them numerous times!
For convenience and speed, I’m going to look just at one state. The national combined files are at
https://ofmext.epa.gov/FLA/www3/state_files/national_combined.zip
temp <- tempfile()
download.file("https://ofmext.epa.gov/FLA/www3/state_files/state_combined_tx.zip",temp)
unzip(temp, list = TRUE)
Name Length
1 TX_ALTERNATIVE_NAME_FILE.CSV 22091458
2 TX_CONTACT_FILE.CSV 28940865
3 TX_ENVIRONMENTAL_INTEREST_FILE.CSV 56539833
4 TX_FACILITY_FILE.CSV 129689442
5 TX_MAILING_ADDRESS_FILE.CSV 21212504
6 TX_NAICS_FILE.CSV 10462281
7 TX_ORGANIZATION_FILE.CSV 66343323
8 TX_PROGRAM_FILE.CSV 161161367
9 TX_PROGRAM_GIS_FILE.CSV 31614995
10 TX_SIC_FILE.CSV 14935318
11 TX_SUPP_INTEREST_FILE.CSV 59122565
12 Facility State File Documentation 11132012_new.pdf 1020424
Date
1 2021-02-03 03:49:00
2 2021-02-03 04:05:00
3 2021-02-03 03:37:00
4 2021-02-03 03:30:00
5 2021-02-03 04:13:00
6 2021-02-03 03:42:00
7 2021-02-03 03:58:00
8 2021-02-03 03:33:00
9 2021-02-03 04:22:00
10 2021-02-03 03:45:00
11 2021-02-03 04:17:00
12 2020-05-18 12:41:00
What we actually need is the Environmental Interest file (there is a PDF if you need to look at the documentation).
env_interest <- read_csv(unzip(temp, "TX_ENVIRONMENTAL_INTEREST_FILE.CSV"))
spec(env_interest)
cols(
REGISTRY_ID = col_double(),
PGM_SYS_ACRNM = col_character(),
PGM_SYS_ID = col_character(),
INTEREST_TYPE = col_character(),
FED_STATE_CODE = col_character(),
START_DATE = col_character(),
START_DATE_QUALIFIER = col_character(),
END_DATE = col_character(),
END_DATE_QUALIFIER = col_character(),
SOURCE_OF_DATA = col_character(),
LAST_REPORTED_DATE = col_character(),
CREATE_DATE = col_character(),
UPDATE_DATE = col_character(),
ACTIVE_STATUS = col_character()
)
I’m going to throw in a tip here. readr::read_csv shows how it decided to parse each column. Notice how REGISTRY_ID was read as double? That’s bad! REGISTRY_ID is an integer identifier and you want to make sure you don’t lose accuracy in digits. When using ECHO or other, similar data sets I STRONGLY recommend assigning column types when reading the data. With read_csv, this is accomplished by cutting and pasting the output from spec() and creating a list, then setting col_types to this list in read_csv.
coltypes <- cols(
REGISTRY_ID = col_character(),
PGM_SYS_ACRNM = col_character(),
PGM_SYS_ID = col_character(),
INTEREST_TYPE = col_character(),
FED_STATE_CODE = col_character(),
START_DATE = col_character(),
START_DATE_QUALIFIER = col_character(),
END_DATE = col_character(),
END_DATE_QUALIFIER = col_character(),
SOURCE_OF_DATA = col_character(),
LAST_REPORTED_DATE = col_character(),
CREATE_DATE = col_character(),
UPDATE_DATE = col_character(),
ACTIVE_STATUS = col_character()
)
env_interest <- read_csv(unzip(temp, "TX_ENVIRONMENTAL_INTEREST_FILE.CSV"),
col_types = coltypes)
Let’s take a quick look at the data
head(env_interest)
# A tibble: 6 x 14
REGISTRY_ID PGM_SYS_ACRNM PGM_SYS_ID INTEREST_TYPE FED_STATE_CODE
<chr> <chr> <chr> <chr> <chr>
1 1100004559~ TRIS 75006CRGL~ TRI REPORTER <NA>
2 1100004559~ NCDB I06#19891~ COMPLIANCE A~ FEDERAL
3 1100004559~ TX-TCEQ ACR RN1019959~ STATE MASTER STATE
4 1100004559~ RCRAINFO TXD099927~ CESQG FEDERAL
5 1100004559~ NPDES TXR05DZ60 ICIS-NPDES N~ FEDERAL
6 1100004559~ RCRAINFO TXD099927~ UNSPECIFIED ~ FEDERAL
# ... with 9 more variables: START_DATE <chr>,
# START_DATE_QUALIFIER <chr>, END_DATE <chr>,
# END_DATE_QUALIFIER <chr>, SOURCE_OF_DATA <chr>,
# LAST_REPORTED_DATE <chr>, CREATE_DATE <chr>, UPDATE_DATE <chr>,
# ACTIVE_STATUS <chr>
Relevant for our purposes here, REGISTRY_ID is the FRS identifier, PGM_SYS_ACRNM is the acronym for a regulatory program, PGM_SYS_ID is the facility identifier for that program.
Doing a quick count shows that there are 41 different regulatory programs in the file!
env_interest %>%
count(PGM_SYS_ACRNM) %>%
arrange(desc(n))
# A tibble: 41 x 2
PGM_SYS_ACRNM n
<chr> <int>
1 TX-TCEQ ACR 165635
2 NPDES 120956
3 RCRAINFO 43410
4 SFDW 33372
5 ICIS 14859
6 EIS 7892
7 OSHA-OIS 6902
8 AIR 6582
9 AIRS/AFS 5501
10 TRIS 4661
# ... with 31 more rows
Next step is to create two dataframes, one for each of the two system identifers we want, and then join them by the REGISTRY_ID.
air_df <- env_interest %>%
filter(PGM_SYS_ACRNM == "AIR") %>%
select(REGISTRY_ID, PGM_SYS_ACRNM, PGM_SYS_ID) %>%
rename(air_id = PGM_SYS_ID)
cems_df <- env_interest %>%
filter(PGM_SYS_ACRNM == "EIA-860") %>%
select(REGISTRY_ID, PGM_SYS_ACRNM, PGM_SYS_ID) %>%
rename(cems_id = PGM_SYS_ID)
joined_df <- full_join(air_df, cems_df, by = "REGISTRY_ID")
head(joined_df)
# A tibble: 6 x 5
REGISTRY_ID PGM_SYS_ACRNM.x air_id PGM_SYS_ACRNM.y cems_id
<chr> <chr> <chr> <chr> <chr>
1 110000456042 AIR TX000000481130~ <NA> <NA>
2 110000456042 AIR TX000000481210~ <NA> <NA>
3 110000456079 AIR TX000000481130~ <NA> <NA>
4 110000456079 AIR TX000000481130~ <NA> <NA>
5 110000456088 AIR TX000000480850~ <NA> <NA>
6 110000456140 AIR TX000000481130~ <NA> <NA>
Although there are 660 CEMS identifiers in the joined data, only 188 of these have an associated ICIS-Air identifier. These are attached to 221 ICIS-Air identifiers.
I’m afraid that I can’t explain why these numbers differ. One potential explanation is that ICIS-Air is a relatively new dataset (it’s predecessor, AFS, is frozen as of late 2014), so if a CEMS reporter stopped emitting prior to that date, it would not be in ICIS-Air. I nevertheless hope this is helpful.
Please let me know if you have any questions!
This work is not a product of the United States Government or the United States Environmental Protection Agency, and the author is not doing this work in any governmental capacity. The views expressed are those of the author only and do not necessarily represent those of the United States or the US EPA.
Learn more about using Distill for R Markdown at https://rstudio.github.io/distill.
Data downloads are provided by EPA’s Clean Air Markets Division at https://ampd.epa.gov/ampd/.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/willwheels/ww, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wheeler (2021, March 2). Will's 'Academic' Website: So You Want to Merge FRS Data?. Retrieved from http://willwheels.rbind.io/posts/
BibTeX citation
@misc{wheeler2021so, author = {Wheeler, Will}, title = {Will's 'Academic' Website: So You Want to Merge FRS Data?}, url = {http://willwheels.rbind.io/posts/}, year = {2021} }