Distance geometry for word embeddings

Sammy Khalife; Douglas S Gonçalves; Leo Liberti

Pré-Publication, Document De Travail Année : 2020

Distance geometry for word embeddings

(1) , (2) , (1)

1
2

Sammy Khalife

Fonction : Auteur
PersonId : 746409
IdHAL : sammy-khalife

Laboratoire d'informatique de l'École polytechnique [Palaiseau]

Douglas S Gonçalves

Fonction : Auteur

Universidade Federal de Santa Catarina = Federal University of Santa Catarina [Florianópolis]

Leo Liberti

Fonction : Auteur
PersonId : 184522
IdHAL : leoliberti
ORCID : 0000-0003-3139-6821
IdRef : 135962404

Laboratoire d'informatique de l'École polytechnique [Palaiseau]

Résumé

Many machine learning algorithms rely on vector representations as input. In particular, natural language word vector representations that encode semantic information can be constructed using several different methods, all based on solving an unconstrained optimization problem using stochastic gradient descent. Traditionally, these optimization formulations arise either from word co-occurrence based models (e.g word2vec, GloVe, fastText), or encoders combined with a masked language model (e.g BERT). In this work we propose word embedding methods based on the Distance Geometry Problem (DGP): find object positions based on a subset of their pairwise distances. Considering the empirical Pointwise Mutual Information (PMI) as an inner product approximation, we discuss two algorithms to obtain approximate solutions of the underlying Eu-clidean DGP on large instances. The resulting algorithms are considerably faster than state-of-the-art algorithms such as GloVe, fastText or BERT, with similar performance for classification tasks. The main advantage of our approach for practical use is its significantly lower computational complexity, which allows us to train representations much faster with a negligible quality loss, a useful property for domain specific corpora.

Domaines

Optimisation et contrôle [math.OC] Mathématique discrète [cs.DM] Intelligence artificielle [cs.AI]

Fichier principal

DGP_for_word_representations__General_format_.pdf (313.22 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Sammy Khalife : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02892020

Soumis le : mardi 7 juillet 2020-11:51:19

Dernière modification le : mercredi 17 avril 2024-15:02:09

Archivage à long terme le : vendredi 27 novembre 2020-13:10:36

Dates et versions

hal-02892020 , version 1 (07-07-2020)

Identifiants

HAL Id : hal-02892020 , version 1

Citer

Sammy Khalife, Douglas S Gonçalves, Leo Liberti. Distance geometry for word embeddings. 2020. ⟨hal-02892020⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X CNRS LIX X-LIX X-DEP-INFO TDS-MACS IP_PARIS

124 Consultations

646 Téléchargements

Distance geometry for word embeddings

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager