Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks; Future Internet; Vol. 13, iss. 1

Podrobná bibliografie
Parent link:Future Internet
Vol. 13, iss. 1.— 2021.— [16 p.]
Korporativní autor: Национальный исследовательский Томский политехнический университет Инженерная школа информационных технологий и робототехники Отделение автоматизации и робототехники
Další autoři: Romanov A. S. Aleksandr Sergeevich, Kurtukova A. V. Anna Vladimirovna, Shelupanov A. A, Aleksandr Aleksandrovich, Fedotova A. M. Anastasia Mikhaylovna, Goncharov V. I. Valery Ivanovich
Shrnutí:Title screen
The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.
Jazyk:angličtina
Vydáno: 2021
Témata:
On-line přístup:http://dx.doi.org/10.3390/fi13010003
Médium: Elektronický zdroj Kapitola
KOHA link:https://koha.lib.tpu.ru/cgi-bin/koha/opac-detail.pl?biblionumber=665057

MARC

LEADER 00000naa0a2200000 4500
001 665057
005 20250127141352.0
035 |a (RuTPU)RU\TPU\network\36256 
035 |a RU\TPU\network\36201 
090 |a 665057 
100 |a 20210702d2021 k||y0rusy50 ba 
101 0 |a eng 
102 |a CH 
135 |a drnn ---uucaa 
181 0 |a i  
182 0 |a b 
200 1 |a Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks  |f A. S. Romanov, A. V. Kurtukova, A. A, Shelupanov [et al.] 
203 |a Text  |c electronic 
300 |a Title screen 
320 |a [References: 16 tit.] 
330 |a The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved. 
461 |t Future Internet 
463 |t Vol. 13, iss. 1  |v [16 p.]  |d 2021 
610 1 |a электронный ресурс 
610 1 |a труды учёных ТПУ 
610 1 |a authorship 
610 1 |a text mining 
610 1 |a machine learning 
701 1 |a Romanov  |b A. S.  |g Aleksandr Sergeevich 
701 1 |a Kurtukova  |b A. V.  |g Anna Vladimirovna 
701 1 |a Shelupanov  |b A. A,  |g Aleksandr Aleksandrovich 
701 1 |a Fedotova  |b A. M.  |g Anastasia Mikhaylovna 
701 1 |a Goncharov  |b V. I.  |c radio technician, specialist in the field of informatics and computer technology  |c Professor of Tomsk Polytechnic University, Doctor of technical sciences  |f 1937-  |g Valery Ivanovich  |3 (RuTPU)RU\TPU\pers\31330  |9 15502 
712 0 2 |a Национальный исследовательский Томский политехнический университет  |b Инженерная школа информационных технологий и робототехники  |b Отделение автоматизации и робототехники  |3 (RuTPU)RU\TPU\col\23553 
801 2 |a RU  |b 63413507  |c 20220701  |g RCR 
856 4 |u http://dx.doi.org/10.3390/fi13010003 
942 |c CF