OpenFoodFacts is an food products database, completely open & crowd-sourced. It makes freely available the knowledge to make more informed choices about food products. Through their website and their barcode scanner mobile app, anyone can find detailed product information.

Scientists use this data to support research in nutrition and biology. Others build upon the data, with projects such as identifying high caloric foods (CaloPrix), sugary foods ( or distance from production (

Machine learning can support this work by recommending data when products are submitted or imputing missing data that can occur in crowd sourced datasets.

The goal is to use machine learning classifiers to figure out how to put products into categories.

Category information in the OpenFoodFacts database is in a hierarchical tree, where large categories that break down into smaller more specific categories. With this tree of categories, we can assume a product only has one correct class/category. And a product’s class is not required to be a leaf node in the class tree (aka. non-mandatory leaf node classification).

The most consistent product information is the product name, label, quantity and ingredients list. It’s this data that will feed the classifiers, training upon products that already have a category, and providing predictions where they don’t.

This is a classic text classification problem, with several design considerations.

Feature extraction

It’s necessary to translate between the written text and a numerical representation.

Normalization is required to reduce the input text to a standard form. Since we’re using labels and ingredient lists, we probably only need the most basic techniques to normalize. We’ll tokenize the lower-case form of the text into words using white-space and punctuation. As our text input is short, its possible binary token counts (words exists or not) can provide a more stable input to our classifier. We’ll remove accents from characters as it’s common for people to leave these out when writing. And n-grams may be worth considering, that is taking adjacent words as a single token. This preserves some word locality information.

Using the tools available in scikit-learn, we can build a vocabulary based on all the text data, and generate count vectors as described.

We now have a vector of numbers that represents our input text. This method used here is known as the Bag-Of-Words representation, a simple common method of vectorization.

Classifier scheme

To deploy component classifiers there are two typical approaches, a flat or hierarchical classification scheme.

Flat classifiers involve training a single multi-class classifier to choose from the target classes. Otherwise (more efficiently) train binary classifiers and combine the results in a One-v-All method.

Flat classification, makes one decision.

Hierarchical approaches assign a classifier to choose the next sub-category in a tree structure of classifiers. In this manner, each classifier only handles part of the problem and part of the data, as a result is more efficient.

Hierarchical classification, makes several smaller decisions.

The flat classification scheme handles all of the input data at once, this may restrict choices on the component classifier. This may be mitigated by using binary classifiers, although will lead to a very unbalanced dataset to work on.

The hierarchical scheme performs a sequence of classifications, this accumulates errors and causes severe accuracy loss for deeper categories. However, one can expect higher performance from the component classifiers in this scheme, as they can specialize at their sub-problem.

There is no golden rule on this choice, we’ll continue with the hierarchical scheme as it probably suits the problem best.


There is more than one strategy on how to choose a classifier for a project. If you have loads of data, it might make more sense to choose from one or two classifiers that suits the application on characteristics other than performance. The large volumes of data will make up the performance difference to a comparable level.

If data is more scarce, a brute-force approach is recommended to choose from applicable classifiers. Extra time should be spent on feature engineering, improving the quality of the data feeding the classifier.

And whichever you choose, optimizing its performance through configuration is a must. Hyper-Parameter Optimization is effectively a search to find the best configuration parameters to train a classifier.

For this project a middle ground is taken, some classifiers are identified as being likely candidates based on typical characteristics, listed below. These candidates are then evaluated against one another.

Logistic Regression has a fast prediction speed, performs well with less training data and with regularization can cope with noisy inputs.


Support Vector Machines with linear kernels, on top on the qualities of Logistic Regression, performs well with high dimensional data such as text.


Multinomial Naive Bayes, on top on the qualities of Logistic Regression, has a non-linear decision boundary.

P(A|B) = \frac{ P(B|A) P(A) }{ P(B) }

K-Nearest Neighbors, has a fast prediction speed, and has a non-linear decision boundary. But is too memory intensive for this work, as each classifier may hold a full copy of the training data set.


Random Forests Ensemble, can cope with noisy inputs, and classes don’t need to be linearly separable.


Evaluation of predictions requires measurements of error. Here we use the Lowest Common Ancestor (LCA) extension of the Hierarchical Classification (HC) versions of Precion (P), Recall (R) and F-score(F), as described by Kosmopoulos et al.

Where Y and \hat{Y} are the true and predicted classes respectively:

P_{LCA} = \frac{|Y \cap \hat{Y}|}{| \hat{Y} |}, R_{LCA} = \frac{|Y \cap \hat{Y}|}{| Y |}, F_{LCA} = |\frac{2 P_{LCA}R_{LCA}}{P_{LCA} + R_{LCA}}

HC versions of Precsion, Recall & F-score are preferable to standard flat classification measures as they consider the ancestor/descendents in the hierarchy Dekel. However the LCA extension Kosmopoulos et al. goes further and penalizes greater over/under specialization, without under-estimating the error where common ancestor chains are longer.


Observing a plot of F-scores for each component classifier…

We can see two classifiers have consistently higher F-scores than the others.
Clearly with this basic implementation accuracy is low for all, but this early evaluation can help us to identify a classifier to move forward with.

data distribution

The distribution of target category lengths tells us how much training data is available at a given level of the overall classifier tree.

The average sample count per classifier gives us an idea just how little training data a single component classifier has, and how it becomes more scarce as the target category get longer.

sample_per_classifierSuch sparsity issues are typically addressed by pruning the tree. This thin distribution would lend itself to the optimization of Koller & Sahami that restricts a nodes classifier to a subset of the feature space. This would improve the sample to feature count ratio and likely improve accuracy. And as mentioned previously hyper-optimization of the classifier is essential.

In Summary

We’ve addressed the challenge of hierarchical classification and identified the challenges that presented themselves. In this first iteration we performed standard techniques to prepare the data, we identified a suitable classifier scheme, and a component classifier. A suitable evaluation metric was used and initial optimization opportunities were found following visualization of the data.

With further optimization, this solution can be trained against the OpenFoodFacts database of products, and categorize products without category within a particular confidence interval. Ideally this would help and support future contributors to provide information to the database too.


Full source code is available on github.

Kosmopoulos et al.
Koller & Sahami