• Is this looking for PII in my code, or trying to understand the code logic that handles PII?
    • Thanks for your question. I am one of the co-founders. It is the latter. We analyze the names of functions, methods, and variables to detect likely Personally Identifiable Information (PII), Protected Health Information (PHI), Cardholder Data (CHD), and authentication tokens using well tuned patterns and language specific rules. You can see the full list here: https://github.com/hounddogai/hounddog/blob/main/data-elemen...

      When we find a match, we trace that data through the codebase across different paths and transformations, including reassignment, helper functions, and nested calls. We then identify where the data ultimately ends up, such as third party SDKs (e.g. Stripe, Datadog, OpenAI, etc.), exposures in API protocols like REST, GraphQL, or gRPC, as well as functions that write to logs or local storage. Here's a list of all supported data sinks: https://github.com/hounddogai/hounddog/blob/main/data-sinks....

      Most privacy frameworks, including GDPR and US Privacy Frameworks, require these flows to be documented, so we use your source code as the source of truth to keep privacy notices accurate and aligned with what the software is actually doing.

  • Interesting. Can I use the output to document sub processors in our privacy notice, or is it specific to GDPR reporting only?
    • Great question. It is not limited to GDPR. The same output can be used to document sub processors in your privacy notice and vendor disclosures since it shows which third parties and services personal data flows to in your code.
  • Cool. Why not use LLM for this kind of analysis? Cost or something else?
    • LLMs can find issues that traditional SAST misses, but today they are slow, expensive, and nondeterministic. SAST is fast and cheap, but requires heavy manual rule maintenance. Our approach combines the strengths of both. The scanning engine is fully rule based and deterministic, with a rule language expressive enough to model code at compiler level accuracy. AI is used only to generate broad rule coverage across thousands of patterns, without sacrificing scan performance or reliability.