Improving Word meaning representations using Wikipedia categories

Svoboda, Lukáš; Brychcín, Tomáš

Title:	Improving Word meaning representations using Wikipedia categories
Other Titles:	Vylepšení reprezentace slovních vektorů s využitím kategorií z Wikipedie
Authors:	Svoboda, Lukáš Brychcín, Tomáš
Citation:	SVOBODA, L., BRYCHCÍN, T. Improving Word meaning representations using Wikipedia categories. Neural Network World, 2018, roč. 28, č. 6, s. 523-534. ISSN 1210-0552.
Issue Date:	2018
Publisher:	Institute of Computer Science
Document type:	článek article
URI:	2-s2.0-85061489302 http://hdl.handle.net/11025/34807
ISSN:	1210-0552
Keywords:	distribuční sémantika;vylepšení word2vec;vnořená slova;globální informace;wikipedia;CBOW;Skip-gram;číselná reprezentace slov
Keywords in different language:	Word2vec;skipgram;cbow;improving distributional word representation;using global information;new approach
Abstract:	V tomto článku prezentujeme metody Skip-gram a CBOW pro extrakci reprezentace významu slov rozšířené o globální informaci. Využíváme vlastní korpus, který včetně globální informace generujeme z Wikipedie, kde jsou články organizovány hierarchicky dle kategorií. Tyto kategorie poskytují dodatečné a velmi užitečné informace (popis) o každém článku. Představujeme čtyři nové modely, jak obohatit reprezentaci slovních významů s využitím globální informace. Experimentujeme s anglickou Wikipedií a testujeme naše modely na standardních datových souborech podobnosti slov a korpusu slovních analogií. Navržené modely výrazně překonávají standardní metody reprezentace slov, zejména při trénování na velikostně podobných korpusech a poskytují podobné výsledky ve srovnání s metodami trénovanými na mnohem větších souborech dat. Náš nový přístup ukazuje, že zvyšování množství trénovacích dat nemusí zvyšovat kvalitu reprezentace významu slov tolik, jako je trénování s využitím globální informace, nebo jak se ukazuje u nových přístupů , které pracují s vnitřní informací daného slova na bázi jednotlivých znaků (fastText).
Abstract in different language:	In this paper we extend Skip-Gram and Continuous Bag-of-Words Distributional word representations models via global context information. We use a corpus extracted from Wikipedia, where articles are organized in a hierarchy of categories. These categories provide useful topical information about each article. We present the four new approaches, how to enrich word meaning representation with such information. We experiment with the English Wikipedia and evaluate our models on standard word similarity and word analogy datasets. Proposed models significantly outperform other word representation methods when similar size training data of similar size is used and provide similar performance compared with methods trained on much larger datasets. Our new approach shows, that increasing the amount of unlabelled data does not necessarily increase the performance of word embeddings as much as introducing the global or sub-word information, especially when training time is taken into the consideration.
Rights:	© Institute of Computer Science
Appears in Collections:	Články / Articles (KIV) OBD

Files in This Item:

File	Size	Format
Svoboda NNW.2018.28.029.pdf	387,91 kB	Adobe PDF	View/Open

Show full item record

Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/34807

search

navigation