Fuzzy matching on big-data

An illustration with
scanner and crowd-sourced
nutritional datasets

Lino Galiana

Insee

9/7/22

Introduction

Context

Persistent nutritional inequalities ( vs ) have public health effects .

  • Synthetic indices to improve understanding of nutritional quality:
Nutriscore
Nova

Introduction

Perspective

Understand what is the nature, nutritional or environmental quality of food products consumed help to develop a sustainable and healthy consumption

Applications emerged to help consumer know better available products:

  • nutritional characteristics ;
  • packaging
  • carbon footprint

Introduction

Justification

Crowd-sourced database open up new perspectives on the analysis of scanner data at population scale once they have been matched.

  • Need to find methods to associate these data sources that scales:
    • reliably ;
    • flexibly ;
    • efficiently .

Teasing

Problematic

Enrichment of scanner data with several sources of information using advanced fuzzy matching methods

  • Use both search engine approach (ElasticSearch) and embeddings to associate pairs
  • Procedure to minimize false positives.

Teasing

Sources

Scanner data (Relevanc data)


  • Casino group (>10% French market):
    • \(\approx\) 11,000 supermarkets ;
    • \(\approx\) 250,000 different products ;
    • Both supermarket revenues and fidelity cards levels

Tip

  • Need to be enriched for analysis:
    • Geocoding supermarkets ;
    • Getting products characteristics.

Crowd-sourced nutritional data (Open Food Facts)

  • \(\approx\) 2 million products (continuously updated)
Type of information Exemples
Aggregated quality indices Nutriscore, NOVA score, Ecoscore
Nutritional information Energy, carbohydrates, fat…
Product information Packaging, volume…

Note

  • Access to additional datasets :
    • IRI: aggregator of information for products and supermarkets
    • CIQUAL: database produced by Ministry of Agriculture
    • From Wikipedia, scraped brands in categories where many products with different name or brands (e.g. alcohol)

Preliminary steps

Preprocessing (Details)

  1. Reduce noise in dataset ;
  2. Harmonize different sources ;
  3. Identify non-food products despite filtering categories.

Scanner data (RelevanC)

Crowd-sourced nutritional data (Open Food Facts)

Wordclouds before preprocessing

Preprocessing (Details)


Reduce noise in dataset ;
Harmonize different sources ;
Identify non-food products despite filtering categories.

Scanner data (RelevanC)

Crowd-sourced nutritional data (Open Food Facts)

Wordclouds after preprocessing

Classifying products using Fasttext

Tip

When performing linkage, blocking variable useful :

  • Increase result pertinence
  • Increase search speed

Recognize COICOP1 from products labels:

  • FastText neural network model
  • Trained on Consumer Price Index (CPI) scanner data
  • Exemples here

Important

Never access to the training dataset !

Linkage methodology

General approach

Objective

  1. Barcode linkage if EAN found in Open Food Facts
  2. Fuzzy matching in Open Food Facts products sharing same COICOP
  3. Fuzzy matching in whole Open Food Facts
  4. Fuzzy matching in whole CIQUAL & Wikipedia dictionaries (normalized named products)

Important point

  • Conservative validation criteria to exclude false positives using:
    • Levenshtein distance
    • Word embeddings

Textual distance is not good enough

  • Textual distance measures based on Levensthein and TF-IDF distance :
    • Tends to choose products that share same letters or words
    • Gives false positives
  • Sometimes we want to find synonyms:
    • e.g. Ricard and Pastis
    • Would be false negatives with textual distance

Idea

  • Need a complementary measure, based on word embeddings
    • Words can be closed in semantic space…
    • … while not sharing common words

Siamese neural network

Tip

We have a way to learn to link scanner and crowdsourced labels !

  • Using barcode linkage (step 1) as training set
  • \(\approx\) 150,000 pairs to learn from
Scanner data Open Food Facts
Beurre aux truffes Beurre aux truffes
Ricard FA18 Pastis de Marseille
Tartiflette William Saurin Tartiflette au reblochon
  • Model performs relatively well (see results)
  • Implemented using PyTorch:
    • Transfer learning from our COICOP classifier (FastText embedding)

Siamese neural network

  • Siamese neural networks are supervised methods based on pair comparison

Results

Evaluating word embedding

  • We can compare textual and semantic distances
  • Good match:
    • Both measures high
    • but sometimes embedding helps
  • More general performance measure here

Linkage share (see as Figure)

  • Fuzzy matching helps to complete linkage
  • \(\approx\) 98% imputed value
Share of products and revenues covered by each linkage step (energy nutrient)
Casino
Franprix
Monoprix
Linkage step Products sold (%) Revenue (%) Products sold (%) Revenue (%) Products sold (%) Revenue (%)
Products not found 0.8 1.4 0.5 0.9 0.9 1.3
EAN Matching (step 1) 72.1 65.0 83.5 81.1 68.6 66.0
Fuzzy matching with OpenFood, restrictive (step 2) 24.6 29.3 12.9 14.6 26.9 27.5
Fuzzy matching with OpenFood, less restrictive (step 3) 1.7 2.8 2.0 2.1 2.7 3.6
Fuzzy matching with CIQUAL (step 4) 0.7 1.6 1.2 1.4 0.9 1.5

Linkage quality

Conclusion

Fuzzy matching can be performed on high dimensional sources

  • Specificity of our corpus: great diversity of variations of close labels
    • e.g. : “Coca-Cola 6x33CL” can vs “Coca Cola 1.5L bottle”
    • Preprocessing vital to harmonize products labels !
  • Trade-off between :
    • flexible matching: selection of false positives
    • conservative selection criteria: selection of false negatives.
  • Compromise to allow similarities despite different wordings:
    • Syntactic measures (TF-IDF measures, n-grams, Levensthein distance)
    • Semantic criteria (neural network to build an embedding)
  • Impute nutritional values for more than 98% of the products.

What are the perspectives ?

  • We can take advantage of crowd-sourced data to enrich scanner data:
    • At European level, scanner data automatically collected for CPI1 ;
    • Decompose CPI by family of products.
  • When producing consumer survey data (e.g. HCFS):
    • Automatic linkage using label name with information from auxiliary sources ;
    • Reduce amount of information collected from households
  • With such linkage, new research made possible :
    • Fine geographic granularity studies on consumption

Additional content

FastText classification model: exemples (back to content)



Initial label Tokenized label COICOP Label
LE PANIER FAISSELLE BIO 4X100G panier faisselle bio 01.1.1.5.1.9999 Bread and cereals
NAVARIN AGNEAU 1,2KE navarin agneau 01.1.2.8.3.0010 Meat
SAUMON SAUVAGE PROV.MSC 330G BQ saumon sauvage prov msc 01.1.3.6.1.9999 Fish and shellfish
ABRICOT 35/45 BQ 1KG abricot 01.1.6.3.1.0005 Fruits
POTE AUVERGNATE 400G pote auvergnate 01.1.7.6.1.9999 Vegetables
MIEL ROMARIN HAUTE VALLES 480G miel romarin haute valles 01.1.8.4.1.0016 Confectionery and frozen products
ENTREMETS CITRON MERINGUE 6P 500G entremets citron meringue 01.1.8.5.1.9999 Confectionery and frozen products
HERBE MENTHE POT herbe menthe pot 01.1.9.2.1.0017 Salt, spices and sauces
COCA-COLA ZERO PET 1.5LX6 CONT MAST cocacola zero pet cont mast 01.2.2.2.1.0006 Other soft drinks
BISTROT DE FRANCE RS BIB 5L bistrot france rs bib 02.1.2.1.1.0004 Wines, ciders and champagne

Pipeline implementation (back to content)

Python controls this polyglot pipeline (foodbowl 🍜 package):

  • Handles connection with ElasticSearch and S3 ;
  • Transforms datasets into ElasticSearch requests ;
  • Restructures search results from ElasticSearch ;
  • Checks and validates/refuses linked labels (textual distance or PyTorch trained word embedding)

Perspective

Future work needed to make public our models (maybe using FastAPI)

Exemples of linkage (back to content)

Random examples of the values linked between sources
RelevanC label
Open Food Facts label
Nutrients
Original Label Preprocessed label Preprocessed label Energy (by 100g)
HERBE MENTHE POT herbe menthe pot menthe bio pot 180
MIEL ROMARIN HAUTE VALLES 480G miel romarin haute valles miel romarin 336
POTE AUVERGNATE 400G pote auvergnate truffade auvergnate 682
LE PANIER FAISSELLE BIO 4X100G panier faisselle bio panier 389
COCA-COLA ZERO PET 1.5LX6 CONT MAST cocacola zero pet cont mast cocacola pet 126
NAVARIN AGNEAU 1,2KE navarin agneau navarin petit legumes agneau francais 197
ABRICOT 35/45 BQ 1KG abricot abricot 84
BISTROT DE FRANCE RS BIB 5L bistrot france rs bib rs 1540
SAUMON SAUVAGE PROV.MSC 330G BQ saumon sauvage prov msc msc oeufs saumon sauvage 866
ENTREMETS CITRON MERINGUE 6P 500G entremets citron meringue cone citron meringue 1142

Siamese neural network performance (back to content)

Warning

Same product can be present with slightly different names in Open Food Facts. These duplicates could be admissible pairs. However, here they are considered as inadmissible as would be any other product.