Tips for Taps Blog

How Tap Score Built the City Water Project

How Tap Score Built the City Water Project

Our blog is written by real experts— not AI. Each guide is carefully reviewed and updated based on the latest research. Plus, with no affiliate links, you can count on unbiased insights you can trust.

 

In 2022, Tap Score built the City Water Project (CWP). CWP is an online data tool bringing together tens-of-millions of water quality results across the United States. It’s the only tool of its kind to centralize point-of-use water quality tests at the tap with publicly available water quality data from utilities.

It’s purpose is to be your number one search tool for understanding what the water quality is like near you, your customers, or your constituents.

CWP has three main features: 

  • Summary page with high-level overview of the water system
  • Data page with deep-dive water quality data from the selected water system and nearby point-of-use samples
  • Utility page with details about your water system.

Tap Score launched CWP state-by-state, starting with Oklahoma. This allowed us to collect feedback and improve CWP with each release. 

This post summarizes why CWP matters and then details how we collect, clean, and prepared the data to support every town and city across the country. 

Why Does City Water Project Matter?

92% of the US is served by a public water system. Today, water systems are more pressured than ever to meet strict regulatory requirements that cover more than 90 different contaminants present in drinking water. There are nearly 50,000 active, community water systems nationwide across a range of town and city sizes, contaminant challenges, and financial capacities. While most systems successfully keep your water safe, many are struggling due to a lack of adequate funding and increasingly polluted source waters.

Utilities are required to post Consumer Confidence Reports (CCRs) yearly, and many states maintain dashboards (often called Drinking Water Watch or DWW). Unfortunately CCRs and DWW are either static PDFs or a web portal that can be difficult to navigate. CWP builds on these efforts — using the same underlying data for utilities but increasing use-ability, public accessibility, and engagement. CWP is clickable, dynamic, and easy-to-understand.

CWP also addresses a big gap that states and other water quality dashboards have not addressed yet: tap water quality. Most contaminants tested by your water system are sampled in the distribution lines or at the treatment plant. While this should be a pretty good proxy for your tap in some cases, in others, it’s not.

Take the nationwide challenge to get lead out of the tap. Or the growing issue of mitigating disinfection byproducts, both of which depend on distance to the plant, water chemistry, and household plumbing. 

Water chemistry is complex, and as a result, water quality at your tap might actually differ from water quality in the water main.

How Did We Do It?

Tap Score built CWP using three primary data sources:

  1. State databases of water quality by system
  2. Federal databases on water systems and their compliance history
  3. Tap Score tap water quality database developed by SimpleLab

Our data scientists and engineers ingest this data into an organized pipeline that transforms and standardizes water quality sample results into a unified dataset that powers CWP. 

For the Data Nerds

For those of you interested in the detailsread on: 

Tap Water Data

Tap Score™ provides everyone with access to certified, high quality lab testing of tap water. Results are nationwide, with varying levels of sampling and availability. We’re careful in how we aggregate household level results to protect people who test tap water with Tap Score so that no data is traceable to any address. Tap data is aggregated to the water system or city scale and summarized in simple statistics.

Water System Data

Data collection

Tap Score compiled data from U.S. federal and state agencies that manage water system information as well as water quality sample results as part of the Safe Drinking Water Act. To get information about compliance status, population served, and other characteristics, Tap Score accessed federal data sources like Enforcement and History of Compliance Online (ECHO) and the federal Safe Drinking Water Information Systems (SDWIS) databases.

To get water quality sample data, Tap Score compiled water quality results from the previous 10-year period through early-to-mid 2021 from 45 states for the CWP launch. Though this data is publicly available, a majority of the data were collected through FOIA requests because most states did not provide readily available, machine-readable sample data.

Several states denied our requests including New Hampshire, North Carolina or simply ignored our FOIAs and emails (Louisiana, New Mexico, and Tennessee). We were able to extract the most recent two years of data from Tennessee’s Drinking Water Watch. Other states had highly incomplete data, which include South Dakota (which we had to request from EPA Region 8 after the state ignored our requests), and Indiana. 

We learned a lot in this process. For example, some states responded to our requests within a day to a week, while others took five months. We even became the inspiration for one state to set up an online server for data-sharing, because we didn’t have a machine to process the CD-ROM they wanted to send us! Overall, most state administrators were incredibly friendly, helpful, and interested in our use of the data as a publicly-available online tool.

The Tap Score team first collected data for the prior ten year period (2011-2021) for chemical and radiological values, and for the prior one year period (2020) for bacteriological parameters. Actual date ranges vary by state and system depending on sampling schedules and the accuracy of state-provided results.

States with readily available data online are updated quarterly. 

Water systems sample chemical and bacteriological parameters based on factors that vary depending on, for example:

  1. The size of the water system
  2. Regulatory changes in requirements over time
  3. Results from previous sampling campaigns that may require systems to sample more (or less). Bacteriological parameters are sampled with very high frequency.

While state regulators care primarily about routine compliance samples, states occasionally provide Tap Score with non-compliance sample results. If these results meet tap water criteria, they are included in CWP. This might explain some differences between CWP’s water quality presentation and the results reported on a state website. CWP is a screening tool; it is not a tool for compliance reporting.

When Tap Score receives state data, we take the following steps (We took significant inspiration from how the federal EPA treats the exact same data sources in their Six-Year Review of Contaminant Occurrence Data nationwide):

Data Wrangling 

Tap Score applies standard data wrangling techniques to state sample data, including data structuring and cleaning.

System Types

Any non-active or non-community water system sample data is excluded, as these systems are less representative as far as demonstrating tap water quality typically consumed in an area. CWSs are defined as those systems serving at least 25 people or 15 connections continuously throughout the year. 

Basic Cleaning

Data cleaning steps happen throughout the entire pipeline, but Tap Score carefully evaluates incoming raw data for any obvious mistakes, e.g., a sample result value of “...”, and removes erroneous or entirely blank rows.

Deduplication

Sample results are considered to be unique if they have a sample identifier, date, time, and contaminant name. Where a clear unique identifier was provided, Tap Score dropped duplicates on the entire row of raw data. Some data are de-duplicated if the raw data is clearly de-normalized. Where sample identifiers were not provided, Tap Score did not drop perceived duplicate data. Some systems sample for the same contaminant twice in the same day, producing, in effect, a duplicate reading even if the value is real (this is especially true with bacteriological contaminants like E. coli). 

Column Selection

Tap Score transforms all incoming water quality results to a standard schema. 

Contaminant Matching and Unit Conversions

Each state provided sample data using different contaminant names, identifiers, and units. Tap Score matched incoming results with a master list of contaminant names and performs unit conversions on observations so that results across all states are standardized. Observations with nonsensical units — for example pH with reported units of mg/L — were discarded.

Non-detects

Many states report the method detection limit or a character value to indicate that a sample result is a non-detect. Other times, states report method detection limits in a separate column and do not specify if the value column is representative of a true value or not. Tap Score validated reported values against reported detection limits and qualifier codes to ensure that values are properly labeled as a detection or a non-detect. Values reported below their detection or reporting limits are coded as non-detects.  

MDLs/MRLs/LRLs

Tap Score requested method detection levels, minimum reporting levels, and lab reporting levels for all states. These values are standardized to Tap Score units.

Tap Water

All samples in CWP are representative of “finished” water — i.e., water on its way to, or at the tap for consumption. States specify sample location, type, and source types in varying ways. Tap Score uses a rule-based algorithm to verify samples relevant for “tap water”. Sample results are determined to be representative of “tap water” when they are labeled “Finished”, which usually indicates that they are collected at the treatment plant, entry point to the distribution system, within the distribution system, and/or at the tap (e.g., in lead sampling). Sample results are discarded as not relevant to tap water when they indicate measure of source water, raw water, or typical lab collection processes (e.g., spike matrix samples or blanks).

What is the Tap Score Algorithm and How Does it Work?

Data Analysis Preparation for CWP

In CWP, we present median concentrations for each contaminant tested over the prior 10 year period. Because of sampling frequency differences across contaminants and varying by regulations, it may be that a contaminant is only tested a few times over that 10 year period. The median value is the number in the middle of the distribution of results. More technically, 50% of results are above or below the median value. To prepare the data for this calculation, we conduct two critical analyses:

  1. A screen for extreme values
  2. Imputing values for non-detects

Extreme Values

The median is a statistic that inherently buffers against potential extreme values, or outliers, skewing results. However, not all outliers or extreme values are wrong, and careful validation work is required before “throwing out” data. Tap Score follows scientific practice by not removing extreme values but by flagging them to caveat results. If a contaminant has a maximum contaminant level (MCL), the result is flagged if it is 10xMCL. If a contaminant does not have an MCL, the result is flagged if a) there are at least 100 observations nationwide and b) the result is more than the 97.5th percentile nationwide. If a contaminant does not have an MCL, and there are fewer than 100 tests nationwide, we cannot make a determination about the distribution of the data and label this “low density” data. This means that the sample is rarely detected or tested for, so we cannot really evaluate whether or not the underlying results are considered “normal” concentrations.

Each of these flags is aggregated when doing summary statistics. For example, if we estimate the median Arsenic concentration in an area based on 10 samples, and 2 of those samples were greater than 10x the MCL, the “percent extreme value” is 20%. Or, take a rarely tested compound, like Glyoxal. If there were 54 samples, and 45 of these were non-detect, the median value would be non-detect. Because there are fewer than 100 results nationwide, the detected values cannot be evaluated as in the Arsenic case, and are thus labeled “low density” to indicate that any detection is pretty interesting, but hard to evaluate relative to other results, because it’s rarely tested.

Non-detects in Summary Statistics

In the last example we gave, it probably became clear that sometimes, a median value can be considered “non-detect”. Why not zero? When the lab tests for a sample, it can only report values greater than its detection limit.

Technical and methodological limitations mean that detection limits are always greater than zero. When we say that “Arsenic was not detected”, what we really mean is, “The concentration of arsenic is between zero and the detection limit.” This creates a tricky problem when summing up results or trying to calculate things like the average. How should we represent values that we can’t actually quantify?

This problem is well known in environmental statistics: it’s called censored data. While this is a deep and important topic with many nuances, we’ll summarize our approach briefly here. Censored data — or precisely, left-censored data — is usually handled in summary statistics in three ways:

  1. Substitution method 
  2. Kaplan-Meier Method or Maximum likelihood estimation
  3. Distribution sampling methods.

Each of these methods is a form of “imputation” which effectively creates a numeric value where we don’t have one—i.e., when the test comes back “not detected”. The first method is very common in academic papers and water quality tools. This method substitutes any non-detect value for a single value (typically the MDL/2, MDL/√2, or 0).

However, this approach introduces error (observed - actual values) and performs poorly in statistical analysis, especially when a large proportion of the data is below the MDL. For such reasons, the use of substitution methods have been advised against. If severe censoring is present, some have suggested the Kaplan-Meier method, which assumes no parametric distribution of the underlying data (in fact, it may be impossible to reasonably select shape and scale parameters for a distribution with high censorship). In contrast, Maximum Likelihood Estimation (MLE) is well-suited for parametric data, which requires an assumed distributional form.

Given the uncertainty in distribution of water quality and microbial data, multiple imputation methods that do not assume a particular distribution are favored over parametric methods (e.g., MLE) and over substitution methods. "Distribution sampling" methods—described in detail here—produce an imputed value by assigning a random value between 0 and the detection limit.

This approach has performed well across all levels of censoring, does not assume a distribution, and is easily interpretable: for all non-detects, impute values by sampling from a uniform distribution between 0 and the MDL. Importantly, this approach does not require assuming lognormal distribution parameters and is scalable across states. Tap Score uses this method to impute values where a result is “not detected”.

Yes, it’s in the weeds. But the result is that all non-detects get a value assigned to them known as the “imputed value”. The imputation method allows us to create a value for every result, and calculate summary statistics that don’t over-inflate the resultant median or average.  

Calculating the Median

Once all the data has been cleaned, standardized, flagged, and non-detects imputed, the result is a database that powers CWP. Each contaminant in each water system is summarized and we calculate a median concentration for every contaminant tested.

What If I Don’t See Data in My State?

Tap Score has sample water quality data for water systems in 46 states. If your state does not have data, it is likely that the state did not respond to initial data requests from Tap Score and no one has conducted tap water testing there. You can change this! Learn more about how you test water and encourage others in your community to test at mytapscore.com.

States not providing public water quality data through CWP:

  • New Hampshire
  • North Carolina
  • New Mexico
  • Louisiana

States with incomplete (i.e. very little) data:

  • Indiana
  • South Dakota

How to Get Access to the Data

Tap Score offers an API to water quality data across the USA. If you are interested in accessing our API, please reach out to: support@mytapscore.com with the subject line “Access To Tap Score Water Quality Data API”.


author portrait
About The Author

CHIEF SCIENCE OFFICER


Serving as the Chief Science Officer at SimpleLab, Jess Goddard spearheads the scientific program at Tap Score, overseeing all analytical products and services. With a Ph.D. in water resources and a Master's in environmental engineering from UC Berkeley, Jess brings a wealth of expertise to the team. Her leadership ensures the highest standards in our scientific endeavors, contributing to the excellence that defines SimpleLab and Tap Score. When away from her desk, Jess enjoys reading and being outside.
back to top