Distance geometry for word embeddings - Département d'informatique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2020

Distance geometry for word embeddings

Résumé

Many machine learning algorithms rely on vector representations as input. In particular, natural language word vector representations that encode semantic information can be constructed using several different methods, all based on solving an unconstrained optimization problem using stochastic gradient descent. Traditionally, these optimization formulations arise either from word co-occurrence based models (e.g word2vec, GloVe, fastText), or encoders combined with a masked language model (e.g BERT). In this work we propose word embedding methods based on the Distance Geometry Problem (DGP): find object positions based on a subset of their pairwise distances. Considering the empirical Pointwise Mutual Information (PMI) as an inner product approximation, we discuss two algorithms to obtain approximate solutions of the underlying Eu-clidean DGP on large instances. The resulting algorithms are considerably faster than state-of-the-art algorithms such as GloVe, fastText or BERT, with similar performance for classification tasks. The main advantage of our approach for practical use is its significantly lower computational complexity, which allows us to train representations much faster with a negligible quality loss, a useful property for domain specific corpora.
Fichier principal
Vignette du fichier
DGP_for_word_representations__General_format_.pdf (313.22 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02892020 , version 1 (07-07-2020)

Identifiants

  • HAL Id : hal-02892020 , version 1

Citer

Sammy Khalife, Douglas S Gonçalves, Leo Liberti. Distance geometry for word embeddings. 2020. ⟨hal-02892020⟩
124 Consultations
646 Téléchargements

Partager

Gmail Facebook X LinkedIn More