Publications

2017

Villar Lafuente, Carlos ; Garcés Díaz-Munío, Gonçal

Several approaches for tweet topic classification in COSET – IberEval 2017 Inproceedings

Proc. of 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017), pp. 36–42, Murcia (Spain), 2017.

Abstract | Links | BibTeX | Tags: COSET2017, language models, linear models, neural networks, sentence embeddings, text classification

@inproceedings{Lafuente2017,
title = {Several approaches for tweet topic classification in COSET – IberEval 2017},
author = {Villar Lafuente, Carlos and Garcés Díaz-Munío, Gonçal},
url = {http://hdl.handle.net/10251/166361
http://ceur-ws.org/Vol-1881/COSET_paper_4.pdf},
year = {2017},
date = {2017-01-01},
booktitle = {Proc. of 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)},
pages = {36--42},
address = {Murcia (Spain)},
abstract = {[EN] These working notes summarize the different approaches we have explored in order to classify a corpus of tweets related to the 2015 Spanish General Election (COSET 2017 task from IberEval 2017). Two approaches were tested during the COSET 2017 evaluations: Neural Networks with Sentence Embeddings (based on TensorFlow) and N-gram Language Models (based on SRILM). Our results with these approaches were modest: both ranked above the “Most frequent" baseline, but below the “Bag-of-words + SVM” baseline. A third approach was tried after the COSET 2017 evaluation phase was over: Advanced Linear Models (based on fastText). Results measured over the COSET 2017 Dev and Test show that this approach is well above the “TF-IDF+RF” baseline.

[CA] "Alguns mètodes per a la classificació temàtica de tuits en COSET - IberEval 2017": Aquest article resumeix els diferents mètodes que hem explorat per a classificar un corpus de tuits sobre les eleccions generals d'Espanya de 2015 (tasca COSET 2017 del taller IberEval 2017). Analitzàrem dos mètodes durant les avaluacions de COSET 2017: xarxes neuronals amb vectorització ("embedding") a nivell de frase (basat en TensorFlow) i models de llenguatge d'n-grames (basat en SRILM). Els nostres resultats amb aquests mètodes van ser modests: ambdós quedaren per damunt del valor de referència d'"el més freqüent" ("Most frequent"), però per davall del valor de referència de "bossa de paraules+SVM" ("Bag-of-words+SVM"). Analitzàrem un tercer mètode quan ja havia acabat la fase d'avaluacions de COSET 2017: models lineals avançats (basat en fastText). Els resultats mesurats sobre els conjunts de validació i prova de COSET 2017 mostren que aquest mètode supera clarament el valor de referència "TF-IDF+RF".},
keywords = {COSET2017, language models, linear models, neural networks, sentence embeddings, text classification},
pubstate = {published},
tppubtype = {inproceedings}
}