Review web pages collector tool for thematic corpus creation

Abstract : We present a method of automaticaly extracting and gathering specific data text from web pages, creating a thematic corpus of reviews for opinion mining and sentiment analysis. The internet is an immense source of machine-readable texts [11] suitable for linguistic corpus studies[3][1]. Though, specific tools of web information extraction research domain as well as from the NLP do not include an open source system able to provide a thematic corpus according to an end-user request[16]. The need of use natural texts as databank for opinion mining and sentiment analysis is increased since the expansion of the digital interaction between users and blogs, forums and social networks. The RevScrap system is designed to provide an intuitive, easy-to-use interface able to extract specific information from accurate web pages returned by search engine's request and create a corpus composed by comments, reviews, opinions, as expressed by users' experience and feedback. The corpus is well structured in xml documents, reflected Singler's design criteria[4].
Type de document :
Communication dans un congrès
CILC2016. 8th International Conference on Corpus Linguistics, Mar 2016, MALAGA, Spain. EPiC Series in Language and Linguistics, 1, pp.274 - 282, 2016, CILC2016. 8th International Conference on Corpus Linguistics. 〈EasyChair, http://www.easychair.org〉
Liste complète des métadonnées

https://hal-univ-paris8.archives-ouvertes.fr/hal-01489726
Contributeur : Anna Pappa <>
Soumis le : lundi 24 avril 2017 - 17:00:34
Dernière modification le : mardi 22 mai 2018 - 20:40:06
Document(s) archivé(s) le : mardi 25 juillet 2017 - 12:10:58

Fichier

revscrap.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01489726, version 1

Collections

Citation

Lisa Medrouk, Anna Pappa, Jugurtha Hallou. Review web pages collector tool for thematic corpus creation. CILC2016. 8th International Conference on Corpus Linguistics, Mar 2016, MALAGA, Spain. EPiC Series in Language and Linguistics, 1, pp.274 - 282, 2016, CILC2016. 8th International Conference on Corpus Linguistics. 〈EasyChair, http://www.easychair.org〉. 〈hal-01489726〉

Partager

Métriques

Consultations de la notice

106

Téléchargements de fichiers

91