How SimpleLab Built City Water Project
Bringing together water quality data from tests in homes and public water utility infrastructure across the United States.
What is City Water Project?
Over the last year, SimpleLab built the City Water Project (CWP). CWP is an online data tool bringing together tens-of-millions of water quality results across the United States. Two things make CWP stand out. First, it’s the only tool of its kind to centralize point-of-use water quality tests at the tap with publicly available water quality data from utilities. Second, it updates nightly to reflect new tap water test results wherever they’re taken.
In short, CWP is comprehensive and current.
It’s purpose is to be your number one search tool for understanding what the water quality is like near you, your customers or your constituents.
CWP has three main features:
SimpleLab is launching CWP state-by-state, starting with Oklahoma. Releasing states one-by-one will allow us to collect feedback and improve CWP with each release. Are you a system leader or town manager looking to innovate in your water quality transparency and communication? Do you have feedback to give? Reach out to SimpleLab Support (cwp@gosimplelab.com).
This post summarizes why CWP matters and then lays out the nitty gritty details about how we collect, clean, and prepare the data to support every town and city across the country.
Why Does City Water Project Matter?
92% of the US is served by a public water system. Today, water systems are more pressured than ever to meet strict regulatory requirements that cover more than 90 different contaminants present in drinking water. While most systems successfully keep your water safe, many are struggling due to a lack of adequate funding and increasingly polluted source waters. So there’s pressure on water systems from all sides: as environmental quality goes down, water becomes harder to clean, regulations pile up, and costs drive up.
Beyond our collective wallet, this scenario squeezes our progress toward transparent and open data communication. There are nearly 50,000 active, community water systems nationwide across a range of town and city sizes, contaminant challenges, and financial capacities. It’s no wonder it’s so hard to get a clear view of water quality.
Utilities are required to post Consumer Confidence Reports (CCRs) yearly, and many states maintain dashboards (often called Drinking Water Watch or DWW). Unfortunately CCRs and DWW are either static PDFs or a web portal that requires you to click many times before getting to the underlying information. CWP builds on these efforts–using the same underlying data for utilities—but increasing use-ability, public accessibility, and engagement. CWP is clickable, dynamic, and easy-to-understand.
CWP also addresses a big gap that states and other water quality dashboards have not addressed yet. And that’s the question about tap water quality. Most contaminants tested by your water system are sampled in the distribution lines or at the treatment plant. While this should be a pretty good proxy for your tap in some cases, in others, it’s not. Just take the nationwide challenge to get lead out of the tap. Or the growing issue of mitigating disinfection byproducts, which change in concentration throughout the system, depending on distance to the plant, temperature, and the presence of organic matter.
Water chemistry is complex, and as a result, water quality at your tap might actually differ from water quality in the water main. But it’s hard to know how big these differences are—and there’s nowhere to turn to look at tap water quality nationwide. Until CWP.
How Did We Do It?
SimpleLab built CWP using three primary data sources: state databases of water quality by system, federal databases on water systems and their compliance history, and the SimpleLab tap water quality database developed by SimpleLab.
Our data scientists and engineers ingest this data into an organized pipeline that transforms and standardizes water quality sample results into a unified dataset that powers CWP.
For the Data Nerds
We know, we know: get to the methods! For those of you interested in the details—read on. Otherwise, skip to the end.
Tap Water Data
SimpleLab powers consumer testing company Tap Score™, which gives everyone access to certified, high quality lab testing at their tap. Results are nationwide, with varying levels of sampling and availability. We’re careful in how we aggregate household level results to protect people who test tap water with Tap Score, so that no data is traceable to any address. Indeed, data is aggregated to the water system or city scale and summarized in simple statistics.
Water System Data
Data collection
SimpleLab compiles data from U.S. federal and state agencies that manage water system information as well as water quality sample results as part of the Safe Drinking Water Act. To get information about compliance status, population served, and other characteristics, SimpleLab uses federal data sources like Enforcement and History of Compliance Online (ECHO) and the federal Safe Drinking Water Information Systems (SDWIS) databases. To get water quality sample data, SimpleLab compiled water quality results from the previous 10-year period through early-to-mid 2021 from 45 states for the CWP launch. Though this data is publicly available, a majority of the data were collected through FOIA requests, because most states did not provide readily available, machine readable sample data. Several states denied our requests—including New Hampshire, North Carolina—or simply ignored our FOIAs and emails (Louisiana, New Mexico, and Tennessee). We were able to extract the most recent two years of data from Tennessee’s Drinking Water Watch. Other states had highly incomplete data, which include South Dakota (which we had to request from EPA Region 8 after the state ignored our requests), and Indiana.
We learned a lot in this process. For example, some states responded to our requests within a day to a week, while others took five months. We even became the inspiration for one state to set up an online server for data-sharing, because we didn’t have a machine to process the CD-ROM they wanted to send us! Overall, most state administrators were incredibly friendly, helpful, and interested in our use of the data as a publicly-available online tool.
SimpleLab first collected data for the prior ten year period (2011-2021) for chemical and radiological values, and for the prior one year period (2020) for bacteriological parameters. Actual date ranges vary by state and system depending on sampling schedules and the accuracy of state-provided results.
States with readily available data online are updated quarterly.
Water systems sample chemical and bacteriological parameters based on factors that vary depending on, for example, (1) the size of the water system; (2) regulatory changes in requirements over time; and (3) results from previous sampling campaigns that may require systems to sample more (or less). Bacteriological parameters are sampled with very high frequency.
While state regulators care primarily about routine compliance samples, states occasionally provide SimpleLab with non-compliance sample results. If these results meet tap water criteria, they are included in CWP. This might explain some differences between CWP’s water quality presentation and the results reported on a state website. CWP is a screening tool, and not a tool for compliance reporting.
When SimpleLab receives state data, we take the following steps. There’s a lot of detail to pack in, so we keep it high-level. We took significant inspiration from how the federal EPA treats the exact same data sources in their Six-Year Review of Contaminant Occurrence Data nationwide.
Data Wrangling
SimpleLab applies standard data wrangling techniques to state sample data, including data structuring and cleaning. Below we detail some of the key steps.
System Types
Any non-active or non-community water system sample data is excluded, as these systems are less representative as far as demonstrating tap water quality typically consumed in an area. CWSs are defined as those systems serving at least 25 people or 15 connections continuously throughout the year.
Basic Cleaning
Data cleaning steps happen throughout the entire pipeline, but SimpleLab carefully evaluates incoming raw data for any obvious mistakes, e.g., a sample result value of “..”, and removes erroneous or entirely blank rows.
Deduplication
Sample results are considered to be unique if they have a sample identifier, date, time, and contaminant name. Where a clear unique identifier is provided, SimpleLab drops duplicates on the entire row of raw data. Some data are de-duplicated if the raw data is clearly de-normalized. Where sample identifiers were not provided, SimpleLab did not drop perceived duplicate data. Some systems will sample for the same contaminant twice in the same day, producing, in effect, a duplicate reading even if the value is real (this is especially true with bacteriological contaminants like E. coli).
Column Selection
SimpleLab transformed all incoming water quality results to a standard schema.
Data Standardization
Contaminant Matching and Unit Conversions
Each state provides sample data using different contaminant names, identifiers, and units. SimpleLab matches incoming results with a master list of contaminant names and performs unit conversions on observations so that results across all states are standardized. Observations with nonsensical units—for example pH with reported units of mg/L—are discarded.
Non-detects
Many states report the method detection limit or a character value to indicate that a sample result is a non-detect. Other times, states report method detection limits in a separate column and do not specify if the value column is representative of a true value or not. SimpleLab has validated reported values against reported detection limits and qualifier codes to ensure that values are properly labeled as a detection or a non-detect. Values reported below their detection or reporting limits are coded as non-detects.
MDLs/MRLs/LRLs
SimpleLab requests method detection levels, minimum reporting levels, and lab reporting levels for all states. These values are standardized to SimpleLab units.
Tap Water
All samples in CWP are representative of “finished” water—i.e., water on its way to, or at the tap for consumption. States specify sample location, type, and source types in varying ways. SimpleLab uses a rule-based algorithm to verify samples relevant for “tap water”. Sample results are determined to be representative of “tap water” when they are labeled “Finished”, which usually indicates that they are collected at the treatment plant, entry point to the distribution system, within the distribution system, and/or at the tap (e.g., in lead sampling). Sample results are discarded as not relevant to tap water when they indicate measure of source water, raw water, or typical lab collection processes (e.g., spike matrix samples or blanks).
Data Analysis Preparation for CWP
In CWP, we present median concentrations for each contaminant tested over the prior 10 year period. Because of sampling frequency differences across contaminants and varying by regulations, it may be that a contaminant is only tested a few times over that 10 year period. The median value is the number in the middle of the distribution of results. More technically, 50% of results are above or below the median value. To prepare the data for this calculation, we conduct two critical analyses: 1) a screen for extreme values; and 2) imputing values for non-detects.
Extreme Values
The median is a statistic that inherently buffers against potential extreme values, or outliers, skewing results. However, not all outliers or extreme values are wrong, and careful validation work is required before “throwing out” data. SimpleLab follows scientific practice by not removing extreme values but by flagging them to caveat results. If a contaminant has a maximum contaminant level (MCL), the result is flagged if it is 10xMCL. If a contaminant does not have an MCL, the result is flagged if a) there are at least 100 observations nationwide and b) the result is more than the 97.5th percentile nationwide. If a contaminant does not have an MCL, and there are fewer than 100 tests nationwide, we cannot make a determination about the distribution of the data and label this “low density” data. This means that the sample is rarely detected or tested for, so we cannot really evaluate whether or not the underlying results are considered “normal” concentrations.
Each of these flags is aggregated when doing summary statistics. For example, if we estimate the median Arsenic concentration in an area based on 10 samples, and 2 of those samples were greater than 10x the MCL, the “percent extreme value” is 20%. Or, take a rarely tested compound, like Glyoxal. If there were 54 samples, and 45 of these were non-detect, the median value would be non-detect. Because there are fewer than 100 results nationwide, the detected values cannot be evaluated as in the Arsenic case, and are thus labeled “low density” to indicate that any detection is pretty interesting, but hard to evaluate relative to other results, because it’s rarely tested.
Non-detects in Summary Statistics
In the last example we gave, it probably became clear that sometimes, a median value can be considered “non-detect”. Why not zero? When the lab tests for a sample, it can only report values greater than its detection limit. Technical and methodological limitations mean that detection limits are always greater than zero. When we say that “Arsenic was not detected”, what we really mean is, “The concentration of arsenic is between zero and the detection limit.” This creates a tricky problem when summing up results or trying to calculate things like the average. How should we represent values that we can’t actually quantify?
This problem is well known in environmental statistics–it’s called censored data. While this is a deep and important topic with many nuances, we’ll summarize our approach briefly here. Censored data–or precisely, left-censored data–is usually handled in summary statistics in three ways:
- Substitution method
- Kaplan-Meier Method or Maximum likelihood estimation
- Distribution sampling methods.
Each of these methods is a form of “imputation” which effectively creates a numeric value where we don’t have one—i.e., when the test comes back “not detected”. The first method is very common in academic papers and water quality tools. This method substitutes any non-detect value for a single value (typically the MDL/2, MDL/√2, or 0). However, this approach introduces error (observed - actual values) and performs poorly in statistical analysis, especially when a large proportion of the data is below the MDL. For such reasons, the use of substitution methods have been advised against. If severe censoring is present, some have suggested the Kaplan-Meier method, which assumes no parametric distribution of the underlying data (in fact, it may be impossible to reasonably select shape and scale parameters for a distribution with high censorship). In contrast, Maximum Likelihood Estimation (MLE) is well-suited for parametric data, which requires an assumed distributional form.
Given the uncertainty in distribution of water quality and microbial data, multiple imputation methods that do not assume a particular distribution are favored over parametric methods (e.g., MLE) and over substitution methods. "Distribution sampling" methods — described in detail here — produce an imputed value by assigning a random value between 0 and the detection limit. This approach has performed well across all levels of censoring, does not assume a distribution, and is easily interpretable: for all non-detects, impute values by sampling from a uniform distribution between 0 and the MDL. Importantly, this approach does not require assuming lognormal distribution parameters and is scalable across states. SimpleLab uses this method to impute values where a result is “not detected”.
Yes—it’s in the weeds. But the result is that all non-detects get a value assigned to them known as the “imputed value”. The imputation method allows us to create a value for every result, and calculate summary statistics that don’t over-inflate the resultant median or average.
Calculating the Median
Once all the data has been cleaned, standardized, flagged, and non-detects imputed, the result is a database that powers CWP. Each contaminant in each water system is summarized and we calculate a median concentration for every contaminant tested.
What If I Don’t See Data in My State?
SimpleLab has sample water quality data for water systems in 46 states. If your state does not have data, it is likely that the state did not respond to initial data requests from SimpleLab and no one has conducted tap water testing there. You can change this! Learn more about how you test water and encourage others in your community to test at mytapscore.com.
States not providing public water quality data through CWP:
- New Hampshire
- North Carolina
- New Mexico
- Louisiana
States with incomplete (i.e. very little) data:
- Indiana
- South Dakota
How to Get Access to the Data
SimpleLab offers an API to water quality data across the USA. If you are interested in accessing our API, please reach out: hello@gosimplelab.com with the subject line “Access To SimpleLab Water Quality Data API”.Read More
▾Disinfection Byproducts: The Adverse Effects of Water Chlorination | SimpleLab Tap Score
Will the EPA's Updated Lead and Copper Rule Make School Drinking Water | SimpleLab Tap Score