Review web pages collector tool for thematic corpus creation

Lisa Medrouk; Anna Pappa; Jugurtha Hallou

Communication Dans Un Congrès Année : 2016

Review web pages collector tool for thematic corpus creation

(1) , (1) , (1)

Lisa Medrouk

Fonction : Auteur
PersonId : 1004054

Laboratoire d'Informatique Avancée de Saint-Denis

Anna Pappa

Fonction : Auteur
PersonId : 749911
IdHAL : anna-pappa
ORCID : 0000-0003-2447-4078

Laboratoire d'Informatique Avancée de Saint-Denis

Jugurtha Hallou

Fonction : Auteur
PersonId : 1004055

Laboratoire d'Informatique Avancée de Saint-Denis

Résumé

We present a method of automaticaly extracting and gathering specific data text from web pages, creating a thematic corpus of reviews for opinion mining and sentiment analysis. The internet is an immense source of machine-readable texts [11] suitable for linguistic corpus studies[3][1]. Though, specific tools of web information extraction research domain as well as from the NLP do not include an open source system able to provide a thematic corpus according to an end-user request[16]. The need of use natural texts as databank for opinion mining and sentiment analysis is increased since the expansion of the digital interaction between users and blogs, forums and social networks. The RevScrap system is designed to provide an intuitive, easy-to-use interface able to extract specific information from accurate web pages returned by search engine's request and create a corpus composed by comments, reviews, opinions, as expressed by users' experience and feedback. The corpus is well structured in xml documents, reflected Singler's design criteria[4].

Mots clés

corpus design thematic corpus opinion mining

Domaines

Informatique [cs]

Fichier principal

revscrap.pdf (743.06 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Anna Pappa : Connectez-vous pour contacter le contributeur

https://univ-paris8.hal.science/hal-01489726

Soumis le : lundi 24 avril 2017-17:00:34

Dernière modification le : lundi 8 novembre 2021-13:38:02

Archivage à long terme le : mardi 25 juillet 2017-12:10:58

Dates et versions

hal-01489726 , version 1 (24-04-2017)

Identifiants

HAL Id : hal-01489726 , version 1

Citer

Lisa Medrouk, Anna Pappa, Jugurtha Hallou. Review web pages collector tool for thematic corpus creation. CILC2016. 8th International Conference on Corpus Linguistics, Mar 2016, MALAGA, Spain. pp.274 - 282. ⟨hal-01489726⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-PARIS8 LIASD UNIV-PARIS-LUMIERES UNIV-PARIS8-OA

195 Consultations

236 Téléchargements

Review web pages collector tool for thematic corpus creation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager