This page describes the data used to determine how much nitrogen and phosphorus is discharged by industries to sewers networks.
Code
knitr::opts_chunk$set(warning=F, message=F, results=F, dev='svg')library(tidyverse) #loads multiple packages (see https://tidyverse.tidyverse.org/)#core tidyverse packages loaded:# ggplot2, for data visualisation. https://ggplot2.tidyverse.org/# dplyr, for data manipulation. https://dplyr.tidyverse.org/# tidyr, for data tidying. https://tidyr.tidyverse.org/# readr, for data import. https://readr.tidyverse.org/# purrr, for functional programming. https://purrr.tidyverse.org/# tibble, for tibbles, a modern re-imagining of data frames. https://tibble.tidyverse.org/# stringr, for strings. https://stringr.tidyverse.org/# forcats, for factors. https://forcats.tidyverse.org/# lubridate, for date/times. https://lubridate.tidyverse.org/#also loads the following packages (less frequently used):# Working with specific types of vectors:# hms, for times. https://hms.tidyverse.org/# Importing other types of data:# feather, for sharing with Python and other languages. https://github.com/wesm/feather# haven, for SPSS, SAS and Stata files. https://haven.tidyverse.org/# httr, for web apis. https://httr.r-lib.org/# jsonlite for JSON. https://arxiv.org/abs/1403.2805# readxl, for .xls and .xlsx files. https://readxl.tidyverse.org/# rvest, for web scraping. https://rvest.tidyverse.org/# xml2, for XML. https://xml2.r-lib.org/# Modelling# modelr, for modelling within a pipeline. https://modelr.tidyverse.org/# broom, for turning models into tidy data. https://broom.tidymodels.org/# Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors#loading additional relevant packageslibrary(cowplot) #for plot_grid(), multiple plotslibrary(patchwork) #also for multiple plotslibrary(viridis)#setting graphs themetheme_set(theme_minimal() +theme(plot.title =element_text(face="bold")) )#setting viridis theme for colorsscale_colour_continuous <- scale_colour_viridis_cscale_colour_discrete <- scale_colour_viridis_dscale_colour_binned <- scale_colour_viridis_b#setting viridis theme for fillscale_fill_continuous <- scale_fill_viridis_cscale_fill_discrete <- scale_fill_viridis_dscale_fill_binned <- scale_fill_viridis_bSource <-"Source: géorisque database\ncomputation by Thomas Starck"# Load the functions filesource("functions.R")
Source, description and loading data
The géorisque database is publicly available and reports industrial discharge, when they are made compulsory by the regulatory thresholds. It reports industrial facilities name and IDs. The GEREP database is not open access but was provided by the French administration ; it contains more data than simply the mandatory reported ones, but the industrial facilities are anonymized. In the following we will use the GEREP database. We nonetheless study the géorisques database to test the consistency of the GEREP database.
The data were downloaded on the georisque website, reporting industrial pollution discharges. Read the description of the data on this page.
All pollution flows are reported in kg/year in the database.
We load the emission file emissions.csv, for each year. They report the industry facility ID and name, pollutant type (145 different), discharge environment (air, water direct, water indirect, ground), and quantity in kg/an. We focus only on NGL, Pt, DBO5, DCO, MES.
Code
#path for géorsiques data sourcepath_source <-"source_data/0_industry_discharge/georisque_data/"#loading pollutant datafile_georisques_emission <-list.files( #read and merge csv of all yearspath = path_source,pattern ="emissions.csv", full.names = T, recursive = T) %>%lapply(read_csv2) %>%bind_rows()# see the 145 different pollutantsunique(file_georisques_emission$polluant)# focus only on NGL, Pt, DBO5, DCO, MESfile_georisques_emission <- file_georisques_emission %>%filter( polluant %in%c("Phosphore total","Azote total","Demande chimique en oxygène (DCO)","Demande biologique en oxygène (DBO5)","Matières en suspension (MES)" ) ) %>%#transform numbers into numericmutate(quantite =as.numeric(quantite) )
We load the files describing the facilities (etablissements.csv). They report the site name, address, coordinates, the European Pollutant Release and Transfer Register code (EPRTR code and wording), and the APE code and wording. EPRTR categories are listed here, and APE codes are listed here.
Code
year_min <-2003year_max <-2020#function to load industrial sites files, from 2003 to 2020f_load_id_sites <-function(year_min, year_max){#read the file site for the first yer temp <-read_csv2(paste0(path_source, year_min, "/etablissements.csv")) %>%mutate(coordonnees_x =as.numeric(coordonnees_x),coordonnees_y =as.numeric(coordonnees_y),annee_emission = year_min )#read the sites files for subsequent years, merge with previous years filefor (i in (year_min+1):year_max){ temp2 <-read_csv2(paste0(path_source, i, "/etablissements.csv")) %>%mutate(coordonnees_x =as.numeric(coordonnees_x),coordonnees_y =as.numeric(coordonnees_y),annee_emission = i ) temp <-bind_rows(temp, temp2) }#return file for all yearsreturn(temp)}#loading industrial sites files from 2003 to 2020file_georisques_id_sites <-f_load_id_sites(2003, 2020)
We merge these 2 emissions files through their unique ID. There are 200 APE codes and 50 EPRTR codes related to our pollution of interest (NGL, Pt, DBO5, DCO, MES).
Code
#merge the industrial sites file with the file on pollutant emissionsfile_georisques_emissions_id_site <-left_join( file_georisques_emission, file_georisques_id_sites, by=c("annee_emission","identifiant"))# see the 300 APE codesunique(file_georisques_emissions_id_site$libelle_ape)# see the 60 EPRTR codesunique(file_georisques_emissions_id_site$libelle_eprtr)# we create the main gérosique file, one column for each pollutantgeorisque <- file_georisques_emissions_id_site %>%spread(polluant, quantite)# renaming columns (we could also get commune, department and region IDs)georisque <- georisque %>%select(Year = annee_emission,Id = identifiant,NGL =`Azote total`,Pt =`Phosphore total`,DCO =`Demande chimique en oxygène (DCO)`,DBO5 =`Demande biologique en oxygène (DBO5)`,MES =`Matières en suspension (MES)`,site_name = nom_etablissement.x,SIRET_code = numero_siret,discharge_environment = milieu,lat_site = coordonnees_x,long_site = coordonnees_y, code_epsg, code_ape, libelle_ape, code_eprtr, libelle_eprtr )# we change the french wording of discharge environment ("mileu" in original file) to Englishgeorisque <- georisque %>%mutate(discharge_environment =case_when( discharge_environment =="Eau (direct)"~"water (direct)", discharge_environment =="Eau (indirect)"~"water (indirect)", discharge_environment =="Sol"~"ground", ) )#see a summary of emissions according to the environmental dischargetable( file_georisques_emissions_id_site %>%filter( polluant %in%c("Phosphore total","Azote total","Demande chimique en oxygène (DCO)","Demande biologique en oxygène (DBO5)","Matières en suspension (MES)" ) ) %>%select(polluant, milieu) )
Only for the years 2019 and 2020, for each site (ID and name), there are files (rejets.csv) specifically specifying if discharges are in the sewers network or in the natural environment, in m3/y. We will use it to test our main dataset on this particular point.
Code
#read and merge csv of all yearsfile_georisques_discharge <-list.files( path = path_source,pattern ="rejets.csv", full.names = T, recursive = T) %>%lapply( read_csv2 ) %>%bind_rows()#renaming French columns to Englishfile_georisques_discharge <- file_georisques_discharge %>%rename(Year = annee_rejet,Id = identifiant,site_name = nom_etablissement,discharge_network_m3_y = rejet_raccorde_m3_par_an,discharge_environment_m3_y = rejet_isole_m3_par_an )
We load the data. Contrary to géorisques data, there is only 1 main file, so no need to merge several files together.
Code
#path for GEREP data sourcepath_source <-"source_data/0_industry_discharge/GEREP_data/"#read GEREP filefile_GEREP <- readxl::read_excel(paste0(path_source, "2010_2021_Rejets raccordés eau par agence.xlsx"))#rename French column names to EnglishGEREP <- file_GEREP %>%select(Year = Année,basin =`Agence de l'eau`,nutrient = Substance,incoming =`Masse émise retenue (kg/an)`,discharged =`Rejet final après épuration (kg/an)`,INSEE_REG =`Code INSEE Région`,name_REG = Région,INSEE_DEP =`Code INSEE département`,name_DEP = Département,INSEE_COM =`Code INSEE Commune`,name_COM = Commune )# Simplify the name of the basin rhone-med-corse basinGEREP <- GEREP %>%mutate(basin =case_when( basin =="Rhône-Méditerranée-Corse"~"Rhône-Méditerranée", T ~ basin ) )
We summarize the flows at the national scale.
Code
GEREP <- GEREP %>%#We only keep metropolitan France sites (exclude overseas territories)filter( basin %in%c("Adour-Garonne", "Artois-Picardie", "Loire-Bretagne", "Rhin-Meuse", "Rhône-Méditerranée", "Seine-Normandie") )#temporary file, incoming and discharged nutrient flows and number of facilities, at the national scaletemp <- GEREP %>%group_by(Year, nutrient) %>%#transform from kg/year (for each facility) to kt/year (national) for nutrient flowssummarise(incoming =signif(sum(incoming, na.rm=T)/10^6, 3),discharged =signif(sum(discharged, na.rm=T)/10^6, 3),nb_facilities =sum(is.na(nutrient)==F) )#function to spread nutrients flows and number of facilities in columnsf_spread_nutrients <-function(dataset, nutrient_type, suffix){ dataset <- dataset %>%#spread in columnsspread(nutrient, {{ nutrient_type }}) %>%#change name of the columns with suffixrename(!!paste0("NGL_", suffix) :=`Azote total (N)`,!!paste0("Pt_", suffix) :=`Phosphore total (P)`,!!paste0("DBO5_", suffix) :=`Demande biologique en oxygène (DBO5)`,!!paste0("DCO_", suffix) :=`Demande chimique en oxygène (DCO)`,!!paste0("MES_", suffix) :=`Matières en suspension (MES)` )return(dataset)}#temporary file with incoming flow for each nutrient in columns, at the national scaletemp_in <-f_spread_nutrients(temp %>%select(-c(discharged, nb_facilities)), incoming, "in")#temporary file with discharge flows for each nutrient in columns, at the national scaletemp_out <-f_spread_nutrients(temp %>%select(-c(incoming, nb_facilities)), discharged, "out")#temporary file with number of industrial facilities for each nutrient, at the national scaletemp_nb <-f_spread_nutrients(temp %>%select(-c(incoming, discharged)), nb_facilities, "nb_facilities")#merge together flows in, out, and number of facilitiesGEREP_national <-left_join( temp_in, temp_out, by="Year")GEREP_national <-left_join( GEREP_national, temp_nb, by="Year")
We summarize the flows at each water agency basin scale.
Code
#temporary file, incoming and discharges nutrient flows and number of facilities, for each water agency basintemp <- GEREP %>%group_by(Year, basin, nutrient) %>%summarise(#transform from kg/year (for each facility) to kt/year (for each basin) for nutrient flowsincoming =signif(sum(incoming, na.rm=T)/10^6, 3),discharged =signif(sum(discharged, na.rm=T)/10^6, 3),nb_facilities =sum(is.na(nutrient)==F) )#temporary file with incoming flow for each nutrient in columns, for each water agency basintemp_in <-f_spread_nutrients(temp %>%select(-c(discharged, nb_facilities)), incoming, "in")#temporary file with discharge flows for each nutrient in columns, for each water agency basintemp_out <-f_spread_nutrients(temp %>%select(-c(incoming, nb_facilities)), discharged, "out")#temporary file with number of industrial facilities for each nutrient, for each water agency basintemp_nb <-f_spread_nutrients(temp %>%select(-c(incoming, discharged)), nb_facilities, "nb_facilities")#merge together flows in, out, and number of facilitiesGEREP_basins <-left_join( temp_in, temp_out, by=c("Year", "basin") )GEREP_basins <-left_join( GEREP_basins, temp_nb, by=c("Year", "basin") )
We also create a complete file with the flows of all the industrial facilities, not summarized at the basin nor national scale.
Code
# add dummy ID for each industrial facility to merge temporary files togethertemp <- GEREP %>%mutate(ID =1:n())# temporary file of incoming nutrient flows in columns, for each industrial facilitytemp_in <-f_spread_nutrients(temp %>%select(-c(discharged)), incoming, "in")# temporary file of discharged nutrient flows in columns, for each industrial facilitytemp_out <-f_spread_nutrients(temp %>%select(-c(incoming)), discharged, "out")# merge incoming and discharged nutrient flowsGEREP_all <-left_join( temp_in, temp_out %>%select(ID, NGL_out, Pt_out, DBO5_out, DCO_out, MES_out), by="ID")#remove temporary filesrm(temp_in, temp_out, temp_nb)
About INERIS, responsible for IREP (Installation Registre Emissions Polluantes, installation register pollutant emissions).
About Aida, INERIS website on risk prevention and environmental protection regulations, and the related decree concerning the reporting of industrial pollution discharges.
In the graphs below we see that there is a distinction in the gérosiques database between water (direct) and water (indirect) discharge. On this page we can see that what is called water indirect (“eau indirect”) in the database corresponds to pollutants sent from industrial facilities to wastewater treatment plants (“eau vers station d’épuration”). However it is unclear whether the reported pollutant flows correspond to the quantity first discharged to the sewage network or to the quantity discharged to the environment after WWTP end of pipe pollution treatment.
For 2003-2020, we plot at the national scale the number of industrial facilities reporting discharges for the concerned nutrient (left) and the quantity discharged (right). Spikes in the right chart indicate some potential incoherent data, that we investigate.
in 2018, 2019 and 2020, Id 0006402259 SERAMM, Usine des boues: values about 10 times higher than the previous years. We divide the values of these years by 10.
For 2019 and 2020 discharged flows in the sanitation network and in the environment are reported (in m3/year) in the emission.csv file. We compare these data with the ones presented in the previous graphs, to check for coherence. Normally the nutrient flows discharged in the network should correspond to the “water (indirect)” flows of the previous graphs.
Code
#merge our main georisque data with file on network dischargesdata_sewers_discharge_merged <-inner_join( file_georisques_discharge %>%#We only keep non null discharge flows into networksfilter(is.na(discharge_network_m3_y)==F) %>%filter(discharge_network_m3_y !=0), georisque, by =c("Year", "Id"))# check that in the 2 databases the site names are the same. if returns 0 lines => OKnrow(data_sewers_discharge_merged %>%filter(site_name.x != site_name.y))# we only keep 1 site namedata_sewers_discharge_merged <- data_sewers_discharge_merged %>%select(-site_name.y) %>%rename(site_name = site_name.x)
Some of what is reported as “connected disharged” in the emission file is actually not a discharge into the sanitation network, but a discharge in the environment from waste water treatment plant. We identify these waste water treatment plants by their APE code (“Collecte et traitement des eaux usées”) and EPRTR code (“Installations de traitement des eaux urbaines résiduaires d’une capacité de 100 000 équivalents habitants”).
Code
#see all the APE labels (about 90)unique(data_sewers_discharge_merged$libelle_ape)#see al the EPRTR labels (about 30)unique(data_sewers_discharge_merged$libelle_eprtr)# classify each row in WWTP / not WWTP, based on APE and EPRTR labelsdata_sewers_discharge_merged <- data_sewers_discharge_merged %>%mutate(WWTP =case_when( libelle_ape =="Collecte et traitement des eaux usées"| libelle_eprtr =="Installations de traitement des eaux urbaines résiduaires d'une capacité de 100 000 équivalents habitants"~"WWTP", T ~"not WWTP (=industrial facilities)" ) )
In the graphs below, we can see that besides what looks like some little incoherences, the “indirect water discharge” of our previous data (black line) corresponds well to the discharge into networks of facilities that are not WWTP (columns for 2019 and 2020, left).
So in the following, we will use the “indirect water discharge” as a proxy for what is discharged into the sanitation networks.
Code
f_graph_comparison <-function(nutrient_select){#in the main file keep only indirect water discharge of nutrient, normally corresponding to discharge in sanitation networks.#agregate at the national scale temp <- georisque %>%filter(is.na(!!as.symbol(nutrient_select))==F & discharge_environment =="water (indirect)" ) %>%group_by(Year) %>%summarise(nutrient=sum(!!as.symbol(nutrient_select)))#keep only non null NGL emissions temp2 <- data_sewers_discharge_merged %>%filter(is.na(!!as.symbol(nutrient_select))==F) %>%rename(nutrient =!!as.symbol(nutrient_select))ggplot(temp2) +geom_col(aes( Year, nutrient/10^6, fill= discharge_environment ),alpha=.8 ) +geom_line(data=temp %>%mutate(WWTP ="not WWTP (=industrial facilities)"),aes(Year, nutrient/10^6) ) +facet_wrap(vars(WWTP)) +labs(x="", y="kt/year",title =paste("Comparison of", nutrient_select, "'network discharge' to 'indirect water discharge'"),subtitle ="network discharge of WWTP (right) and other facilities (left)\nwater discharged in sewage networks (columns) and indirect water discharge (line)",caption=Source,fill ="discharge to\n sewers networks" )}
We save the sewage network discharge for all the industrial sites.
Code
#remove temporary file used only to check consistency of sewers dischargerm(data_sewers_discharge_merged)#in the main georisques databse, select only "water (indirect)", i.e. discharge to sewers networkstemp <- georisque %>%filter( discharge_environment=="water (indirect)" )#output pathoutput_data <-"output_data/industry_sewers_network_discharge/"#save the géorisques data on industrial discharge to sewersf_save_csv_files( temp, output_data, "industry_sewers_network_discharge_georisque_ALL.csv" )
We also save the data aggregated at the national scale. Below are the graphs for each nutrient.
Code
#same as before, but flows summarized at the national scalenational_network_discharge <- georisque %>%filter( discharge_environment=="water (indirect)" ) %>%group_by(Year) %>%summarise(across(c(Pt, NGL, DBO5, DCO, MES), ~signif(sum(.x, na.rm=T)/10^6, 3) #converted in kt per year ) )#temporary file summarizing the number of industrial facilities at the national scale for each nutrienttemp <- georisque %>%filter( discharge_environment=="water (indirect)" ) %>%group_by(Year) %>%summarise(across(c(Pt, NGL, DBO5, DCO, MES), ~sum(is.na(.x)==F), .names ="{.col}_nb_facilities"#converted in kt per year ) )#merge the 2 files (flows and number of facilities)national_network_discharge <-left_join(national_network_discharge, temp, by="Year")#output patchoutput_data <-"output_data/industry_sewers_network_discharge/"#save data at the national scalef_save_csv_files( national_network_discharge, output_data, "industry_sewers_network_discharge_georisque_national.csv" )
Code
#function for graph of industries discharge to sewers networksf_graph_final <-function(dataset, nutrient_flow, nutrient_nb_facilities){ g <-plot_grid(ggplot(dataset) +geom_area(aes(Year, !!as.symbol(nutrient_flow), fill=""), alpha=.8) +theme(legend.position ="none") +labs(x="", y=paste("kt of", nutrient_flow, "per year"), caption ="\n",title="Industrial discharge to sewage network in France",subtitle =paste(nutrient_flow, "flow") ),ggplot(dataset) +geom_area(aes(Year, !!as.symbol(nutrient_nb_facilities), fill=""), alpha=.8) +theme(legend.position ="none") +labs(x="", y="", caption = Source,title="",subtitle =paste("nb of facilities reporting", nutrient_flow, "discharge") ) )return(g)}
Below, we plot the GEREP nutrient flows of discharges in sewers networks, at the national and water agency basins scales. Spikes indicate potential outliers, that we investigate. At the national scale, we also compare it to the géorisques data. There are many more facilities reported in the GEREP database than in géorisques, but the difference in actual nutrient flows is much smaller. This indicates that the géorisques database captures the majority of large discharges. However the temporal dynamics are very similar in the 2 database.
The database also reports the discharge downstream of the sewer network, after the WWTP treatment (in yellow on th egraphs below), using actual or estimated WWTP removal efficiencies. We show it as an indication, but will not use it.
Here are the identified outliers and corrections we make, for each nutrient.
Pt
in 2017 for CHAGNY, in and out, factor 100 (select incoming==55879 because there are several Chagny)
in 2010 and 2011 for PORT-SUR-SAONE, does not exist afterwards. We do not correct it
in 2010 and 2011 for VITRY-SUR-SEINE, correct factor 100 in, factor 10 out
NGL
in 2017 for CHAGNY, correct factor 100 (select incoming == 7711302 because there are several Chagny)
in 2019 for AUBIGNY, correct factor 1000 in, factor 10 000 out
in 2013 for ILLZACH, correct factor 10 in and out
DCO
in 2018 for MARSEILLE, correct factor 10 in and out 10 (select incoming==94603000 because there are several Marseille)
MES
in 2018, MARSEILLE, correct factor 10 in and out 10 (select incoming==24927000 because there are several Marseille)
in 2019 for BISCHWILLER, correct factor 10 000 in and out
Code
#PHOSPHORUS#CHAGNY in, 2017GEREP$incoming[ GEREP$Year==2017& GEREP$name_COM =="CHAGNY"& GEREP$nutrient=="Phosphore total (P)"& GEREP$incoming==55879 ] <- GEREP$incoming[ GEREP$Year==2017& GEREP$name_COM =="CHAGNY"& GEREP$nutrient=="Phosphore total (P)"& GEREP$incoming==55879 ]/100#CHAGNY out, 2017GEREP$discharged[ GEREP$Year==2017& GEREP$name_COM =="CHAGNY"& GEREP$nutrient=="Phosphore total (P)"& GEREP$incoming==55879 ] <- GEREP$discharged[ GEREP$Year==2017& GEREP$name_COM =="CHAGNY"& GEREP$nutrient=="Phosphore total (P)"& GEREP$incoming==55879 ]/100#VITRY-SUR-SEIN in, 2010 and 2011GEREP$incoming[ GEREP$Year%in%c(2010, 2011) & GEREP$name_COM =="VITRY-SUR-SEINE"& GEREP$nutrient=="Phosphore total (P)" ] <- GEREP$incoming[ GEREP$Year%in%c(2010, 2011) & GEREP$name_COM =="VITRY-SUR-SEINE"& GEREP$nutrient=="Phosphore total (P)" ]/100#VITRY-SUR-SEIN out, 2010 and 2011GEREP$discharged[ GEREP$Year%in%c(2010, 2011) & GEREP$name_COM =="VITRY-SUR-SEINE"& GEREP$nutrient=="Phosphore total (P)" ] <- GEREP$discharged[ GEREP$Year%in%c(2010, 2011) & GEREP$name_COM =="VITRY-SUR-SEINE"& GEREP$nutrient=="Phosphore total (P)" ]/10#NITROGEN#CHAGNY in, 2017GEREP$incoming[ GEREP$Year==2017& GEREP$name_COM =="CHAGNY"& GEREP$nutrient=="Azote total (N)"& GEREP$incoming==7711302 ] <- GEREP$incoming[ GEREP$Year==2017& GEREP$name_COM =="CHAGNY"& GEREP$nutrient=="Azote total (N)"& GEREP$incoming==7711302 ]/100#AUBIGNY in, 2019GEREP$incoming[ GEREP$Year==2019& GEREP$name_COM =="AUBIGNY"& GEREP$nutrient=="Azote total (N)" ] <- GEREP$incoming[ GEREP$Year==2019& GEREP$name_COM =="AUBIGNY"& GEREP$nutrient=="Azote total (N)" ]/1000#AUBIGNY out, 2019GEREP$discharged[ GEREP$Year==2019& GEREP$name_COM =="AUBIGNY"& GEREP$nutrient=="Azote total (N)" ] <- GEREP$discharged[ GEREP$Year==2019& GEREP$name_COM =="AUBIGNY"& GEREP$nutrient=="Azote total (N)" ]/10000#ILLZACH in, 2013GEREP$incoming[ GEREP$Year==2013& GEREP$name_COM =="ILLZACH"& GEREP$nutrient=="Azote total (N)" ] <- GEREP$incoming[ GEREP$Year==2013& GEREP$name_COM =="ILLZACH"& GEREP$nutrient=="Azote total (N)" ]/10#ILLZACH out, 2013GEREP$discharged[ GEREP$Year==2013& GEREP$name_COM =="ILLZACH"& GEREP$nutrient=="Azote total (N)" ] <- GEREP$discharged[ GEREP$Year==2013& GEREP$name_COM =="ILLZACH"& GEREP$nutrient=="Azote total (N)" ]/10#DCO#MARSEILLE in, 2018 GEREP$incoming[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Demande chimique en oxygène (DCO)"& GEREP$incoming==94603000 ] <- GEREP$incoming[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Demande chimique en oxygène (DCO)"& GEREP$incoming==94603000 ]/10#MARSEILLE out, 2018 GEREP$discharged[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Demande chimique en oxygène (DCO)"& GEREP$incoming==94603000 ] <- GEREP$discharged[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Demande chimique en oxygène (DCO)"& GEREP$incoming==94603000 ]/10#MES#MARSEILLE in, 2018 GEREP$incoming[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Matières en suspension (MES)"& GEREP$incoming==24927000 ] <- GEREP$incoming[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Matières en suspension (MES)"& GEREP$incoming==24927000 ]/10#MARSEILLE out, 2018 GEREP$discharged[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Matières en suspension (MES)"& GEREP$incoming==24927000 ] <- GEREP$discharged[ GEREP$Year==2018& GEREP$name_COM =="MARSEILLE"& GEREP$nutrient=="Matières en suspension (MES)"& GEREP$incoming==24927000 ]/10#BISCHWILLER in, 2019GEREP$incoming[ GEREP$Year==2019& GEREP$name_COM =="BISCHWILLER"& GEREP$nutrient=="Matières en suspension (MES)" ] <- GEREP$incoming[ GEREP$Year==2019& GEREP$name_COM =="BISCHWILLER"& GEREP$nutrient=="Matières en suspension (MES)" ]/10000#BISCHWILLER out, 2019GEREP$discharged[ GEREP$Year==2019& GEREP$name_COM =="BISCHWILLER"& GEREP$nutrient=="Matières en suspension (MES)" ] <- GEREP$discharged[ GEREP$Year==2019& GEREP$name_COM =="BISCHWILLER"& GEREP$nutrient=="Matières en suspension (MES)" ]/10000
We recompute the aggregated (national and basin scale) flows after our corrections on the GEREP database.
Code
#We recompute the flows at the national scale#temporary file, incoming and discharged nutrient flows and number of facilities, at the national scaletemp <- GEREP %>%group_by(Year, nutrient) %>%#transform from kg/year (for each facility) to kt/year (national) for nutrient flowssummarise(incoming =signif(sum(incoming, na.rm=T)/10^6, 3),discharged =signif(sum(discharged, na.rm=T)/10^6, 3),nb_facilities =sum(is.na(nutrient)==F) )#temporary file with incoming flow for each nutrient in columns, at the national scaletemp_in <-f_spread_nutrients(temp %>%select(-c(discharged, nb_facilities)), incoming, "in")#temporary file with discharge flows for each nutrient in columns, at the national scaletemp_out <-f_spread_nutrients(temp %>%select(-c(incoming, nb_facilities)), discharged, "out")#temporary file with number of industrial facilities for each nutrient, at the national scaletemp_nb <-f_spread_nutrients(temp %>%select(-c(incoming, discharged)), nb_facilities, "nb_facilities")#merge together flows in, out, and number of facilitiesGEREP_national <-left_join( temp_in, temp_out, by="Year")GEREP_national <-left_join( GEREP_national, temp_nb, by="Year")# We recompute the flows for each basin#temporary file, incoming and discharges nutrient flows and number of facilities, for each water agency basintemp <- GEREP %>%group_by(Year, basin, nutrient) %>%summarise(#transform from kg/year (for each facility) to kt/year (for each basin) for nutrient flowsincoming =signif(sum(incoming, na.rm=T)/10^6, 3),discharged =signif(sum(discharged, na.rm=T)/10^6, 3),nb_facilities =sum(is.na(nutrient)==F) )#temporary file with incoming flow for each nutrient in columns, for each water agency basintemp_in <-f_spread_nutrients(temp %>%select(-c(discharged, nb_facilities)), incoming, "in")#temporary file with discharge flows for each nutrient in columns, for each water agency basintemp_out <-f_spread_nutrients(temp %>%select(-c(incoming, nb_facilities)), discharged, "out")#temporary file with number of industrial facilities for each nutrient, for each water agency basintemp_nb <-f_spread_nutrients(temp %>%select(-c(incoming, discharged)), nb_facilities, "nb_facilities")#merge together flows in, out, and number of facilitiesGEREP_basins <-left_join( temp_in, temp_out, by=c("Year", "basin"))GEREP_basins <-left_join( GEREP_basins, temp_nb, by=c("Year", "basin"))#remove temporary filesrm(temp_in, temp_out, temp_nb)
In the géorisques data paragraph, we already saved the cleaned géorisques database.
We also save the GEREP data, which is more complete than the géorisques database. Note: with a little more work, it would also be possible to save industrial pollution at the departmental and regional scale.
Code
#output pathoutput_data <-"output_data/industry_sewers_network_discharge/"#aggregate GEREP water agencies basins and national datatemp <-bind_rows( GEREP_national %>%mutate(basin ="Metropolitan France"), GEREP_basins)#savef_save_csv_files( temp, output_data,"industry_sewers_network_discharge_GEREP_basins.csv")# save all facilities of GEREP databasef_save_csv_files( GEREP_all, output_data,"industry_sewers_network_discharge_GEREP_ALL.csv")
We add a metadata file to describe our saved files.
Code
writeLines("The files in this folder report industries discharge to sewers networks. Géorisque database is publicly available, but less complete than GEREP database. GEREP is more complete, but anonymized (industry names and IDs are not reported). We report both databases. All files report pollution for each year for the following nutrients: NGL, Pt, DBO5, DCO, MES.GEORISQUES DATABASE**industry_sewers_network_discharge_georisque_ALL.csv** reports individual industrial facilities discharges to sewers networks, with their name and ID. Pollutant flows are in **kg/year**. **industry_sewers_network_discharge_georisque_national.csv** reports industrial facilities discharges to sewers networks at the national scale. Pollutant flows are in **kt/year**. We also report the number of industrial facilities for each nutrient.GEREP DATABASEWe report both nutrient _in and _out. _in refers to what enters the sewers after industry discharge. _out refers to release downstream, after wastewater treatment plant, based on actual measure or estimation of wastewater treatment plant removal efficiency. We report it as an indication but do not use it. **industry_sewers_network_discharge_GEREP_ALL.csv** reports individual industrial facilities discharges to sewers networks, with their city (COM), department (DEP) and region (REG) names (name_) and INSEE code (INSEE_). There is also an industry facility dummy ID (the data is anonymized). Pollutant flows are in **kg/year**.**industry_sewers_network_discharge_GEREP_basins.csv** reports industrial facilities discharges to sewers networks at the national scale and for each water agency basin. Pollutant flows are in **kt/year**. We also report the number of industrial facilities for each nutrient. ", paste(output_data, "metadata.md") )