Data Types

PII Crawler currently supports the following PII data types. We plan to support all PII data types defined by CPPA.

U.S. Social Security Number (SSN)

9 digit numerical usually in the format NNN-NN-NNNN. The prefix used to have meaning but was removed in the randomization process June 25, 2011 where previously unassigned area numbers were introduced for assignment excluding area numbers 000, 666 and 900-999. There are three parts:

  • Area Number (NNN) - Initially assigned based on geographical regions, it indicated the state of application. Since 2011, it has been assigned randomly.

  • Group Number (NN) - This two-digit number ranges from 01 to 99 and is not assigned consecutively. It follows a specific issuing pattern for administrative purposes.

  • Serial Number (NNNN) - This four-digit number is assigned sequentially and can be consecutive.

  • Valid (this specific number is not): 078-05-1120

  • Not Valid: 666-12-1234

U.S. City, State, Zip Cluster (CSZ)

A cluster is a set of distinct pieces of data that by themselves don’t represent much but when found or linked together can produce something meaningful.

90210 by itself doesn’t mean much but when 90210 is found near the words Beverly Hills we know we have a city and zip code. PII Crawler uses this clustering method to find City, State, and Zip codes.

Street Address

Meaningful street addresses are often found near CSZ clusters.

First Name

PII Crawler uses common name lists and NER techniques to find names

Last Name

PII Crawler uses common name lists and NER techniques to find names

Email Address

PII Crawler uses a FSM to find email addresses.

US Passport

Begins with a letter followed by eight numbers

Credit Card

PII Crawler uses a custom FSM to find credit card numbers. It checks for valid IIN prefixes, lengths, and a checksum digit.

Driver’s License

PII Crawler uses a combination of FSM and Aho-Corasick multi-pattern matching to find driver’s license numbers. It uses a custom FSM to check for valid DLN formats and then uses an Aho-Corasick multi-pattern matcher to find the numbers near terms like “drivers license” or “driver’s license”.

AWS Credentials

PII Crawler uses a custom FSM built around AWS unique ID prefixes to find AWS credentials.

Custom Regex

You can specify your own custom regex rules to match your specific data types. Simply add them to the custom_regex.json file in the same directory as the piicrawler binary. Matches will be prefixed with regex_ in the report. Ex: regex_myrule1.

Sample custom_regex.json:

{
  "myrule1": {
    "regex": "your regex here",
    "description": "description of the regex"
  },
  "my_other_rule": {
    "regex": "another regex here",
    "description": "description of the other regex"
  }
}

💌 Get notified on new features and updates

Only sent when a new version is released. Nothing else.