As introduced in our previous post ("Can AI help overcome document bottlenecks?"), AI document classification is a powerful tool in streamlining workflows, minimising delays and reducing errors caused by manual document processing. Machine learning algorithms provide the accuracy and reliability demanded by such systems as well as giving them the capability of handling messy inputs. There are different types of algorithm that can be used for classification purposes, each operating in its own way. This post will provide a brief overview for those interested in what happens under the hood.
Document Data Preparation
Before the main document classification algorithms come into play, there are a few steps required to prepare the data. The first step is dimensionality reduction, which involves stripping back the noise and pulling out only those elements which would aid rather than hamper categorisation. As part of this, the document is tokenised, which involves transforming the free text into workable blocks, usually discarding punctuation. The clarification of text is important because these algorithms assume a bag-of-words form of document representations, which discounts document structure in their calculations.
Following this is a process of stop word removal, which clarifies the text by removing filler words which add little to identifying document context. Finally, the document undergoes a round of stemming, which consolidates words which have slightly different forms, removing the slight variations in form. Once the document has undergone this preparation, it is ready to be processed by the actual machine learning algorithm. There are a number of choices, each with its own unique approach to the problem.
Machine Learning Processing:
An Artificial Neural Network method involves using a large number of interconnected, multi-layered computational elements which represent different features of the data being processed. Based on the data which it is fed, the connections between these nodes are given different weightings, allowing the model to adapt itself based on a large volume of inputs. When visualised, this mass of connections roughly resembles neuronal structure.
The k-Nearest Neighbours method is a relatively simple categorisation method that does not rely on a training dataset to build a model. It works by applying an algorithm to the input which checks for feature similarity with previously categorised items. When the level of similarity is established, the input element is positioned relative to its “nearest neighbours”. Based on this, the category of the neighbours is used to determine the category of the input.
Decision Trees are a type of categorisation which involves passing an input through a series of hierarchically arranged nodes (shaped much like a tree). At every node, the input is “tested” based on a particular category and, based on what it returns, moves down to another node at the next level. Once the input falls within an end node which cannot split the data further, it has been categorised.
Naïve Bayes Classifiers is another relatively straightforward classification algorithm based on the Bayes statistical theorem. It relies on applying a number of probability analyses to the input, attempting to pinpoint where it sits in relation to previously processed and categorised items. This method does not rely on training datasets to work but does depend on reliable data about probability distributions of certain data features.
Support Vector Machines is a categorisation method which is quite precise but relies on a large training dataset. It functions by plotting data within an abstracted “feature space”, which clusters items together spatially based on similarity across several features. This space is then segmented by a “hyperplane” which segments it based on the positioning of the training data. Subsequent inputs are then categorised based on what part of the plane it is plotted within.
Finally, there’s the Rocchio Algorithm, which is similar to Support Vectors in that it relies on plotting items within a “vector space” and delineating it with regions marked by “decision boundaries”. The region within which a new input falls will determine its classification. This method is often used in information retrieval systems and improves its accuracy through relevance feedback, wherein multiple results are returned, and user behaviour is used to further refine the results.
These algorithm options are not necessarily discrete.
Depending on the specific results sought, different aspects of each type might be used together to create a hybrid classification solution. It is clear, though, that there are a range of options available for anyone seeking to build a document classification system from scratch. However, with the increasing availability of cloud-based, code-free machine learning services, it's becoming easier to disregard the inner workings of these AI systems.
For a Business Intelligence professional, however, an understanding of the subtleties can always provide an unexpected advantage. If you are obsessed with the fine details, contact us to find out more about our private Machine Learning classes.
Find out More:
Watch our webinar to explore a AI for Documents solution that combines the power of Machine Learning and Azure Cognitive Services to remove the burden of manually classifying thousands of documents.