PII Crawler currently supports the following PII data types. We plan to support all PII data types defined by CPPA.
9 digit numerical usually in the format NNN-NN-NNNN. The prefix used to have meaning but was removed in the randomization process June 25, 2011 where previously unassigned area numbers were introduced for assignment excluding area numbers 000, 666 and 900-999. There are three parts:
Area Number (NNN) - Initially assigned based on geographical regions, it indicated the state of application. Since 2011, it has been assigned randomly.
Group Number (NN) - This two-digit number ranges from 01 to 99 and is not assigned consecutively. It follows a specific issuing pattern for administrative purposes.
Serial Number (NNNN) - This four-digit number is assigned sequentially and can be consecutive.
Valid (this specific number is not): 078-05-1120
Not Valid: 666-12-1234
A cluster is a set of distinct pieces of data that by themselves don’t represent much but when found or linked together can produce something meaningful.
90210
by itself doesn’t mean much but when 90210
is found near the words Beverly Hills
we know we have a city and zip code. PII Crawler uses this clustering method to find City, State, and Zip codes.
Meaningful street addresses are often found near CSZ clusters.
PII Crawler uses common name lists and NER techniques to find names
PII Crawler uses common name lists and NER techniques to find names
PII Crawler uses a FSM to find email addresses.
Begins with a letter followed by eight numbers
PII Crawler uses a custom FSM to find credit card numbers. It checks for valid IIN prefixes, lengths, and a checksum digit.
PII Crawler uses a combination of FSM and Aho-Corasick multi-pattern matching to find driver’s license numbers. It uses a custom FSM to check for valid DLN formats and then uses an Aho-Corasick multi-pattern matcher to find the numbers near terms like “drivers license” or “driver’s license”.
PII Crawler uses a custom FSM built around AWS unique ID prefixes to find AWS credentials.
You can specify your own custom regex rules to match your specific data types. Simply add them to the custom_regex.json
file in the same directory as the piicrawler
binary. Matches will be prefixed with regex_
in the report. Ex: regex_myrule1
.
Sample custom_regex.json
:
{
"myrule1": {
"regex": "your regex here",
"description": "description of the regex"
},
"my_other_rule": {
"regex": "another regex here",
"description": "description of the other regex"
}
}
💌 Get notified on new features and updates