Microsoft Information Protection - Trainable Classifiers

Shriram Tyagrajan
Apr 24, 2022
4 min read

Updated: Apr 25, 2022

Leverage user-friendly, pre-trained or trainable Machine Learning classifiers to identify various types of content in your organization.

Entropy - the degree of disorder or uncertainty in a system [merriam-webster dictionary]

Every organization starts with a fairly defined way of working to achieve its goal. This includes processes, people, information and technology. Initially everything works fine, everyone is following the process and things seem organized exactly the way the organization aimed for.

As the time passes, the business grows and along with it the staff, customers, and most importantly the data grows exponentially. As a result, after few months/years the organization has now accumulated huge repository of data.

This repository of data poses the following challenges to the organization:

Uncertainty on whether users have followed the processes to properly organize this data?
Does this data hold Information of Business Value or can this data be purged?
Has the organization been updating its processes to identify and properly organize all this data?
Does the organization know where, which type of data is stored?

Why do we see gradual deterioration of a system?

As an organization grows, it aims to achieve its goals faster which shifts the focus from finely organized information practice to fit-for-purpose information management due to lack of time and the extra effort required to keep things organized
Users may be uploading a lot of data to ensure all the data is stored for future reference albeit in a disorganized manner
Lack of dedicated Information Stewards in the organization to keep a close watch on the data being stored properly
Mindset of gradual acceptance of leniency in following the processes across the organization

This gradual chaos in the system is called an Entropy.

Entropy is the indicator of how much disorganized, chaotic a system has become post its initial setup.

Traditional Methods of Content Classification

Understanding the type of information available in an organization has been a tedious task. Organizations mainly depended upon two types of method to classify the information: Manually or Pattern-Matching technique.

Manually - preferred method if the organization knows where, what type of information is/can be stored. This type of classification of information is possible mostly when the organization follows a systematic approach to save the information in the right place.

Pattern-Matching - this method relies on specific keywords in the content of the document or document tags (metadata properties) to get more information without going into the content of the document.

Though both the above methods do have their merits, a large portion of the information spread across the organization may fail to get classified using these two methods as over the period of time users may upload contents in the wrong locations or the Pattern-Matching may not be covering a lot of the new patterns/keywords.

In such situations, which many organizations are already facing today, getting the job done with less time and effort is paramount.

Microsoft Trainable Classifiers

What is a Classifier?

A Classifier is a special tag/label which uses Machine Learning to identify documents of a specific type of content by following an automated logic. The Machine Learning uses some sample (seed) documents to create the initial logic of identifying the documents.

A Classifier is like a Label Sticker that is applied to each identified document based on its content.

Microsoft provides a list of classifiers which are pre-trained (based on sample documents like Legal, Finance, Manufacturing, Supply Chain etc.) and use Machine Learning to identify the classification of the documents in user-configured target locations.

Microsoft's pre-built Trainable Classifiers support multiple content languages: Chinese (Simplified), English, French, German, Italian, Japanese, Portuguese, Spanish (additional language support shall be added in future).

These Trainable Classifiers can be of two types: pre-trained or custom

Pre-trained Classifiers - are the simplest to implement as you don't need to train these classifiers with sample documents. Microsoft has already done this job for you.

Custom Classifiers - require that you supply sample documents for the custom classifier get trained. To make the training effective, you need to supply between 50-500 valid documents based on which the custom classifier will start to understand the documents using machine learning. You need to verify the classification results that the custom classifier would produce and refine the training (re-training the classifier) until you feel satisfied with the results. Once the Custom Trainable Classifier is ready, it works similar to the Pre-trained Classifiers.

A Custom Classifier usually goes through three stages: Create, Test, Re-train

Create Custom Classifier

Creating Custom Classifier Process © Infotechtion

Test Custom Classifier

Testing Custom Classifier Process © Infotechtion

Re-train Custom Classifier

Retraining Custom Classifier Process © Infotechtion

How to use Trainable Classifiers?

Trainable Classifier can be used to apply as a condition in Sensitivity Label, Information Retention Label, or in Communication Compliance.

You can create an auto-apply policy (for Sensitivity Label, Retention Label or Communication Compliance) and add the condition in the policy that any content matching your selected Trainable Classifier should be applied with the specific Label.

The ability to use Trainable Classifiers as a condition in the auto-labelling policies removes the need of manual intervention, implements continuous process of classification of contents and helps in understanding the volume of information within the organization.

Once the information is labelled, you can leverage Compliance Reports to get more insights into the labelled contents spread across Microsoft 365.

Trainable Classifiers help in identifying the classification of the contents and not take any action on the content.

You can find Trainable Classifiers under Compliance Administration -> Data Classification

Some of the built-in Trainable Classifiers supporting multiple languages

Built-in Trainable Classifier "HR" with Sample Results

Current Limitations of Trainable Classifiers

Following are some of the limitations of Trainable Classifiers (these shall be remediated in future):

Trainable Classifier can scan up to 1 week of data (at rest) in SharePoint Online
You cannot move/copy a custom classifier from one tenant (environment) to another. If you want to use a trained custom classifier from your development environment into your test environment, you would need to re-create the custom classifier and re-train it
Trainable Classifier currently does not support scanning Exchange emails but this support would be released by Microsoft in the next 3-4 months
You cannot use a pre-built trainable classifier to create a custom classifier to train it. For e.g. if you want to use the pre-built Finance classifier to create a custom classifier named "Finance Sales", you won't be able to use the Finance classifier in this case
You cannot combine Trainable Classifiers with Keyword Query Language (KQL)

Microsoft is working on improving Trainable Classifiers but with the current capabilities also you can achieve a lot.

Request for a Proof-of-Concept (POC)

Infotechtion follows a holistic, systematic, and collaborative approach to demonstrate the possibilities of Trainable Classifiers and other Microsoft Information Protection features.

How to request for a POC?

For more information on how you can achieve a simpler user experience with integrated protection of information Book a Demo | infotechtion.com to see a unified experience when Microsoft365 is configured with recommended practices.