While we have moved leaps and bounds in AI on many fronts in the last two years, building a classifier can still be a challenging (and interesting) machine learning task. For those who are not familiar with the classification problem in machine learning, here is a simple explanation. We want to be able to classify images and videos or texts or components in these multimodal contents into specific categories (or classes in ML speak).
For example, an autonomous car needs to do real-time data processing and inference to understand if there is a giant hole on the road ahead or not (a two-class / binary classification) so it can make a decision to drive over or around the hole. In another example, a robotic arm that needs to pick up a defective product from the assembly line needs to be able to recognize the object on the assembly line and identifies if there is a defect or not as the object moves through the line.
Classification is a common task that we perform as humans. Classification is also a common and important machine learning task. Classification is often one of the initial machine learning tasks performed and serves as a scaffolding for later and more complex machine learning tasks. Classification is an important machine learning problem to master.
While we may think classification is easy, it is tricky for several reasons depending on the context and the specific problem that we are trying to solve. Here are some common challenges and reasons why building a classifier can be challenging:
- Complexity of the Problem: Some classification problems are inherently complex and may not have easily discernible patterns. For example, a slight (or beginning of a) crack in plastic packaging of a product may be misclassified as a shadow. An emergence of a sinkhole on a street may be misclassified as a pothole. It may be challenging for a machine to distinguish mountain bluebird from a Eastern bluebird. In general, image classification with fine-grained categories is challenging.
- Data Quantity and Quality: Classifiers require a sufficient amount of high-quality data for training. If the dataset is small, unrepresentative, noisy, imbalanced and/or contains biases, it can lead to a low-performant classifier. For example, if we have limited data on scans from a rare health condition and we want to classify if there is an anomaly, we may not be able to do it due to limited or biased data.
- Data Preprocessing: Data often needs preprocessing, including cleaning, normalization, and handling missing values. Inconsistent or messy data can make it challenging to build an effective classifier.
- Imbalanced Data: In some classification problems, classes may be imbalanced, meaning one class has significantly more examples than the other. Imbalanced data can lead to models that are biased towards the majority class.
- Biased Data: Data bias in classification occurs when the training dataset used to build a classifier is not representative of the real-world population or when it contains systematic errors or prejudices. For example: a company that aims to create a job application screening classifier using only historical data will most likely incorporate bias from that historical data set.
- Feature Extraction and selection: Extracting and selecting the right features is crucial. Choosing irrelevant or redundant features can degrade classifier performance, while overlooking important ones can lead to suboptimal results. Feature extraction and selection often require domain expertise, and it is a science and art. We have to transfer how humans distinguish complex things into categories into machine readable inputs.
- Data drift: In dynamic environments, the data distribution may change over time. This data drift can cause a classifier's performance to degrade so continuous monitoring and retraining may be required.
Opportunities remain to build performant classifiers for complex problems.