Classifiers

The Classifiers module in AISploit provides a framework for building and using classifiers to score inputs based on certain criteria. Classifiers are essential components in red teaming and security testing tasks, where they are used to evaluate the characteristics or properties of input data.

Overview

A classifier in AISploit is represented by a class that inherits from the BaseClassifier abstract base class. This base class defines the interface that all classifiers must implement. Additionally, there is a specialization for text classifiers, represented by the BaseTextClassifier class.

How Classifiers Work

Classifiers in AISploit typically operate by analyzing input data and assigning a score based on predefined criteria. The score() method is the main entry point for scoring input data. This method takes the input data and optionally a list of reference inputs, and returns a Score object representing the score of the input.

Creating Custom Classifiers

To create a custom classifier, you need to define a new class that inherits from BaseClassifier or BaseTextClassifier, depending on the nature of the input data. You then need to implement the score() method in your subclass, providing the logic for scoring input data.

[1]:
from typing import Any, List

from aisploit.core import BaseClassifier, Score

class CustomClassifier(BaseClassifier[float, Any]):
    def score(self, input: Any, references: List[Any] | None = None) -> Score[float]:
        # Implement scoring logic here
        pass

Example Usage

Here’s an example of how to use a classifier in AISploit:

[2]:
# Create an instance of the classifier
classifier = CustomClassifier()

# Score an input
score = classifier.score(input="Hello World")

This documentation provides an overview of how classifiers work in AISploit, explains how to create custom classifiers, and gives an example of how to use a classifier in practice. Users can refer to this documentation to understand how to work with classifiers in their security testing tasks.

Ready-to-Use Classifiers

AISploit also provides several ready-to-use classifiers that cover common use cases. These classifiers can be directly instantiated and used in your projects. Some of the ready-to-use classifiers include:

SelfSimilarityClassifier

The SelfSimilarityClassifier is a text classifier based on self-similarity using cosine similarity scores. It calculates the cosine similarity between the input text and reference texts to determine the degree of similarity.

[3]:
from aisploit.classifiers import SelfSimilarityClassifier

# What's the name of the Voodoo Lady in the Monkey Island games?
classifier = SelfSimilarityClassifier()
score = classifier.score(
    "The Voodoo Lady's name is Elaine Marley.",
    [
        "The Voodoo Lady's name is never revealed in the games, and she is simply referred to as the Voodoo Lady.",
    ],
)


vars(score)
[3]:
{'flagged': True,
 'value': {'aggregated_score': 0.6721476912498474,
  'scores': [0.6721476912498474],
  'aggregation': 'mean'},
 'description': 'Returns True if the aggregated cosine similarity score is less than the threshold',
 'explanation': 'The aggregated cosine similarity score for the input is 0.6721476912498474'}

PresidioAnalyserClassifier

The PresidioAnalyserClassifier is a text classifier that utilizes the Presidio Analyzer for detecting Personally Identifiable Information (PII) entities in text.

[4]:
from aisploit.classifiers.presidio import PresidioAnalyserClassifier

classifier = PresidioAnalyserClassifier()
score = classifier.score("My name is John Doo and my phone number is 212-555-5555")

vars(score)
[4]:
{'flagged': True,
 'value': [type: PERSON, start: 11, end: 19, score: 0.85,
  type: PHONE_NUMBER, start: 43, end: 55, score: 0.75],
 'description': 'Returns True if entities are found in the input',
 'explanation': 'Found 2 entities in input'}

ComprehendPIIClassifier

The ComprehendPIIClassifier is a text classifier that utilizes Amazon Comprehend for detecting Personally Identifiable Information (PII) entities in text.

[5]:
from aisploit.classifiers.amazon import ComprehendPIIClassifier

classifier = ComprehendPIIClassifier()
score = classifier.score("My name is John Doo and my phone number is 212-555-5555")

vars(score)
[5]:
{'flagged': True,
 'value': [{'Score': 0.9999950528144836,
   'Type': 'NAME',
   'BeginOffset': 11,
   'EndOffset': 19},
  {'Score': 0.9999926090240479,
   'Type': 'PHONE',
   'BeginOffset': 43,
   'EndOffset': 55}],
 'description': 'Returns True if entities are found in the input',
 'explanation': 'Found 2 entities in input'}

ComprehendToxicityClassifier

The ComprehendToxicityClassifier is a text classifier that leverages Amazon Comprehend for detecting toxic content in text.

[6]:
from aisploit.classifiers.amazon import ComprehendToxicityClassifier

classifier = ComprehendToxicityClassifier()
score = classifier.score("I will kill you")

vars(score)
[6]:
{'flagged': True,
 'value': {'Toxicity': 0.8208000063896179,
  'Labels': [{'Name': 'PROFANITY', 'Score': 0.19329999387264252},
   {'Name': 'HATE_SPEECH', 'Score': 0.2694000005722046},
   {'Name': 'INSULT', 'Score': 0.2587999999523163},
   {'Name': 'GRAPHIC', 'Score': 0.19329999387264252},
   {'Name': 'HARASSMENT_OR_ABUSE', 'Score': 0.18960000574588776},
   {'Name': 'SEXUAL', 'Score': 0.21789999306201935},
   {'Name': 'VIOLENCE_OR_THREAT', 'Score': 0.9879999756813049}]},
 'description': 'Returns True if the overall toxicity score is greater than or equal to the threshold',
 'explanation': 'The overall toxicity score for the input is 0.8208000063896179'}

ModerationClassifier

The ModerationClassifier is a text classifier that utilizes OpenAI’s Moderation API for detecting toxic content in text.

[8]:
from aisploit.classifiers.openai import ModerationClassifier

classifier = ModerationClassifier()
score = classifier.score("I will kill you")

vars(score)
[8]:
{'flagged': True,
 'value': Moderation(categories=Categories(harassment=True, harassment_threatening=True, hate=False, hate_threatening=False, self_harm=False, self_harm_instructions=False, self_harm_intent=False, sexual=False, sexual_minors=False, violence=True, violence_graphic=False, self-harm=False, sexual/minors=False, hate/threatening=False, violence/graphic=False, self-harm/intent=False, self-harm/instructions=False, harassment/threatening=True), category_scores=CategoryScores(harassment=0.4573294222354889, harassment_threatening=0.35159170627593994, hate=0.0006792626227252185, hate_threatening=4.232471837894991e-05, self_harm=4.82136874779826e-06, self_harm_instructions=3.341407150969644e-08, self_harm_intent=2.0083894014533143e-06, sexual=4.86759927298408e-05, sexual_minors=1.9414277119267354e-07, violence=0.9988717436790466, violence_graphic=1.050253467838047e-05, self-harm=4.82136874779826e-06, sexual/minors=1.9414277119267354e-07, hate/threatening=4.232471837894991e-05, violence/graphic=1.050253467838047e-05, self-harm/intent=2.0083894014533143e-06, self-harm/instructions=3.341407150969644e-08, harassment/threatening=0.35159170627593994), flagged=True),
 'description': 'Moderation score for the given input',
 'explanation': 'Details about the moderation score'}