BLOG · December 2025 · 10 min read

PII in the enterprise AI era — the Indian data type taxonomy

Indian personal data has a structure that global PII detection frameworks were not built for. This is a complete reference for every identifier Pinaakini detects — format, regex pattern, sector relevance, and the risk it carries if it reaches an AI model unredacted.

Why Indian PII needs its own taxonomy

Most global PII detection frameworks — GDPR-aligned tools, US-built compliance scanners — recognise Social Security Numbers, EU national IDs, and generic formats like credit card numbers and email addresses. They were not built for the Indian identifier landscape.

India has a dense, overlapping set of government-issued identifiers. Aadhaar, PAN, ABHA, voter ID, driving licence, and passport each have distinct formats and sector-specific sensitivity profiles. Financial identifiers — UPI VPAs, IFSC codes, GSTIN — are unique to the Indian payments and tax ecosystem. A detection system that misses any of these creates a gap in the compliance posture.

This is the complete taxonomy Pinaakini uses in production. Each entry includes the format specification, a representative pattern, the primary sector where it appears, and the regulatory framework that governs its protection.

Government identity documents

Aadhaar Number

The 12-digit unique identity number issued by UIDAI to every Indian resident. The first digit is always between 2 and 9. Numbers are typically presented with spaces or hyphens as separators: 9876 5432 1098 or 9876-5432-1098.

Sector relevance: All sectors — banking KYC, healthcare ABHA linkage, government services, telecom onboarding.
Governing framework: Aadhaar Act 2016, DPDP Act 2023 (sensitive personal data).
Risk if exposed: Identity theft, fraudulent financial account opening, SIM swapping.

PAN Card

The 10-character alphanumeric Permanent Account Number issued by the Income Tax Department. Format: 5 letters, 4 digits, 1 letter. The fourth character indicates the taxpayer type (P = individual), and the fifth character is the first letter of the surname: ABCDE1234F.

Sector relevance: BFSI (mandatory for transactions above ₹50,000), real estate, high-value retail.
Governing framework: Income Tax Act 1961, DPDP Act 2023.
Risk if exposed: Tax fraud, financial identity theft, unauthorised high-value transaction facilitation.

ABHA Health ID

The 14-digit Ayushman Bharat Health Account identifier. Presented in the format 12-3456-7890-1234. Issued by the National Health Authority and serves as the anchor for a patient's digital health record under ABDM.

Sector relevance: Healthcare exclusively.
Governing framework: DPDP Act 2023 (health data = sensitive personal data), ABDM data governance framework.
Risk if exposed: Unauthorised access to complete medical history, insurance fraud.

Voter ID (EPIC)

The Electoral Photo Identity Card number: 3 letters followed by 7 digits, e.g. ABC1234567. Issued by the Election Commission of India.

Sector relevance: Government KYC, financial onboarding as an alternate identity proof.
Risk if exposed: Identity fraud, electoral roll manipulation.

Driving Licence

Format varies by state: typically 2-letter state code, 2-digit RTO code, 4-digit year, and 7-digit sequence. Example: MH01 2020 1234567. No single national format — state codes and separators vary.

Detection note: The variability in format across states makes driving licence detection probabilistic rather than deterministic. Pinaakini uses a combination of format matching and contextual signals (surrounding text containing "DL", "licence", "driving").

Passport Number

1 letter followed by 7 digits: A1234567. Issued by the Ministry of External Affairs.

Sector relevance: International KYC, premium banking, high-value insurance.
Risk if exposed: International identity fraud, travel document forgery.

Financial identifiers

UPI ID / Virtual Payment Address (VPA)

The handle used for UPI transactions: username@bankcode. Common suffixes include @okaxis, @oksbi, @okhdfcbank, @paytm, @ybl.

Risk if exposed: Targeted phishing, social engineering attacks using the known payment handle, transaction pattern analysis.

Bank Account Number

Indian bank account numbers range from 9 to 18 digits depending on the bank. No universal format — detection relies on context (co-occurrence with IFSC, "account number", "NEFT") plus length pattern matching.

IFSC Code

The 11-character Indian Financial System Code identifying a bank branch for NEFT/RTGS: 4-letter bank code, 0, and 6-character branch code. Example: HDFC0001234.

Note: IFSC alone is not PII — it identifies a branch, not a person. It becomes PII when combined with an account number (together they uniquely identify an individual's bank account).

Credit / Debit Card Number

16-digit number following Luhn algorithm, sometimes presented with spaces: 4111 1111 1111 1111. Pinaakini applies Luhn validation to reduce false positives on arbitrary 16-digit sequences.

Health identifiers

Medical Record Number (MRN)

Hospital-internal identifiers assigned to patients. No national standard — format varies by hospital information system. Pinaakini detects MRNs through contextual signals (co-occurrence with patient name, admission date, diagnosis codes) rather than format matching alone.

ABHA-linked mobile number

The mobile number registered with an ABHA account is treated as health-adjacent PII in healthcare contexts, as it can be used to authenticate into the patient's digital health record.

Contact and location data

Indian mobile number

10-digit numbers starting with 6, 7, 8, or 9, with optional country code prefix: +91 98765 43210 or 9876543210. The +91 prefix and leading zero variants (0 before the 10-digit number) are also detected.

Email address

Standard RFC 5322 format. Detected universally across all sectors.

PIN code

6-digit Indian postal code: 400001. Detected with contextual signals to avoid false positives on arbitrary 6-digit numbers (OTPs, invoice numbers). At area level, a PIN code is not PII — but in combination with a name or address, it is.

Enterprise and business identifiers

GSTIN

The 15-character Goods and Services Tax Identification Number: 2-digit state code, 10-character PAN, 1-digit entity number, 1 letter, 1 check digit. Example: 27ABCDE1234F1Z5.

Note: GSTIN is technically a business identifier, not a personal one — but it embeds the proprietor's PAN in positions 3–12, making it personally identifying for sole proprietors and individual taxpayers.

Detection notes and edge cases

Several identifiers create detection challenges that generic PII frameworks miss:

  • Aadhaar in Hindi / regional scripts: Aadhaar numbers are sometimes typed in Devanagari or other scripts in regional documents. Pinaakini normalises numerals across all 22 scheduled scripts before pattern matching.
  • PAN in lowercase: PAN is sometimes entered in lowercase in form data. Pinaakini applies case-insensitive matching.
  • Partial masking: Some systems display the last 4 digits of Aadhaar or card numbers. Pinaakini does not treat partially masked identifiers as fully redacted — the partial number still carries identifying information in context.
  • Concatenated identifiers: Some data pipelines concatenate identifiers without separators. Pattern matching is applied with and without standard separators.
  • OTP vs mobile number: 6-digit OTPs overlap with PIN codes in format. Pinaakini uses temporal context (presence of "OTP", "verification code") to suppress false positives.
💡
The full detection engine is open for inspection

The pattern library Pinaakini uses in production — including all regex patterns, contextual signals, and Luhn validation logic — is available for review as part of our enterprise POC. Contact our technical team.