Blogging all things data

Under the Hood of AI Document Classification

Under the Hood of AI Document Classification

As introduced in our previous post, AI for document processing is a powerful tool in streamlining workflows, minimising delays and reducing errors caused by manual document classification. Machine learning algorithms provide the accuracy and reliability demanded by such systems as well as giving them the capability of handling messy inputs. There are different types of algorithm that can be used for those purposes, each operating in its own way. This post will provide a brief overview for those interested in what happens under the hood of AI document classification. 


ai for document classification algorithm compiling with code on a laptop screen

Data Preparation for Document Classification

Before the main document classification algorithms come into play, there are a few steps required to prepare the data. The first step is dimensionality reduction, which involves stripping back the noise and pulling out only those elements which would aid rather than hamper categorisation. As part of this, the document is tokenised, which involves transforming the free text into workable blocks, usually discarding punctuation. The clarification of text is important because these algorithms assume a bag-of-words form of document representations, which discounts document structure in their calculations.

Following this is a process of stop word removal, which clarifies the text by removing filler words which add little to identifying document context. Finally, the document undergoes a round of stemming, which consolidates words which have slightly different forms, removing the slight variations in form. Once the document has undergone this preparation, it is ready to be processed by the actual machine learning algorithm. There are a number of choices, each with its own unique approach to the problem of AI for documents.

Machine Learning Options for Document Processing:

Artificial Neural Network 

A method that involves using a large number of interconnected, multi-layered computational elements which represent different features of the data being processed. Based on the data which it is fed, the connections between these nodes are given different weightings, allowing the model to adapt itself based on a large volume of inputs. When visualised, this mass of connections roughly resembles neuronal structure.

K-Nearest Neighbours 

A relatively simple categorisation method that does not rely on a training dataset to build a model. It works by applying an algorithm to the input which checks for feature similarity with previously categorised items. When the level of similarity is established, the input element is positioned relative to its “nearest neighbours”. Based on this, the category of the neighbours is used to determine the category of the input.

Decision Trees 

A type of categorisation which involves passing an input through a series of hierarchically arranged nodes (shaped much like a tree). At every node, the input is “tested” based on a particular category and, based on what it returns, moves down to another node at the next level. Once the input falls within an end node which cannot split the data further, it has been categorised.

an abstract representation of a decision tree document classification algorithm

Naïve Bayes Classifiers 

Another relatively straightforward classification algorithm based on the Bayes statistical theorem. It relies on applying a number of probability analyses to the input, attempting to pinpoint where it sits in relation to previously processed and categorised items. This method does not rely on training datasets to work but does depend on reliable data about probability distributions of certain data features.

Support Vector Machines 

A categorisation method which is quite precise but relies on a large training dataset. It functions by plotting data within an abstracted “feature space”, which clusters items together spatially based on similarity across several features. This space is then segmented by a “hyperplane” which segments it based on the positioning of the training data. Subsequent inputs are then categorised based on what part of the plane it is plotted within.

Rocchio Algorithm

Similar to Support Vectors in that it relies on plotting items within a “vector space” and delineating it with regions marked by “decision boundaries”. The region within which a new input falls will determine its classification. This method is often used in information retrieval systems and improves its accuracy through relevance feedback, wherein multiple results are returned, and user behaviour is used to further refine the results. 

Depending on the specific results sought with AI for document classification, different aspects of each type might be used together to create a hybrid classification solution. It is clear, though, that there are a range of options available for anyone seeking to build a document classification system from scratch.  However, with the increasing availability of cloud-based, code-free machine learning services, it's becoming easier to disregard the inner workings of these AI systems.

Watch our webinar to explore a AI for Documents solution that combines the power of Machine Learning and Azure Cognitive Services to remove the burden of manually classifying thousands of documents.