Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution

Stańczyk, Urszula; Zielosko, Beata

Details

Title

Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution

Journal title

Bulletin of the Polish Academy of Sciences Technical Sciences

Yearbook

2021

Volume

69

Issue

4

Authors

Stańczyk, Urszula ; Zielosko, Beata

Affiliation

Stańczyk, Urszula : Silesian University of Technology, ul. Akademicka 2A, 44-100 Gliwice, Poland ; Zielosko, Beata : University of Silesia in Katowice, ul. Będzińska 39, 41-200 Sosnowiec, Poland

Keywords

discretisation ; data irregularities ; evaluation and test sets ; rough sets ; authorship attribution ; stylometry

Divisions of PAS

Nauki Techniczne

Coverage

e137629

Bibliography

G. Franzini, M. Kestemont, G. Rotari, M. Jander, J. Ochab, E. Franzini, J. Byszuk, and J. Rybicki, “Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm,” Front. Digital Humanit., vol. 5, p. 4, 2018, doi: 10.3389/fdigh.2018.00004.
A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Data level preprocessing methods,” in Learning from Imbalanced Data Sets. Cham: Springer International Publishing, 2018, pp. 79–121, doi: 10.1007/978-3-319-98074-4_5.
S. Garcia, J. Luengo, J. Saez, V. Lopez, and F. Herrera, “A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 734–750, 2013, doi: 10.1109/TKDE.2012.35.
S. Das, S. Datta, and B.B. Chaudhuri, “Handling data irregularities in classification: Foundations, trends, and future challenges,” Pattern Recognit., vol. 81, pp. 674–693, 2018, doi: 10.1016/j.patcog.2018.03.008.
U. Stańczyk, “Evaluating importance for numbers of bins in discretised learning and test sets,” in Intelligent Decision Technologies 2017: Proceedings of the 9th KES International Conference on Intelligent Decision Technologies (KES-IDT 2017) – Part II, ser. Smart Innovation, Systems and Technologies, I. Czarnowski, J.R. Howlett, and C.L. Jain, Eds. Springer International Publishing, 2018, vol. 72, pp. 159–169, doi: 10.1007/978-3-319-59421-7_15.
G. Baron, “On approaches to discretization of datasets used for evaluation of decision systems,” in Intelligent Decision Technologies 2016, ser. Smart Innovation, Systems and Technologies, I. Czarnowski, A. Caballero, R. Howlett, and L. Jain, Eds. Springer, 2016, vol. 56, pp. 149–159, doi: 10.1007/978-3-319-39627-9_14.
U. Stańczyk and B. Zielosko, “On approaches to discretisation of stylometric data and conflict resolution in decision making,” in Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 23rd International Conference KES-2019, Budapest, Hungary, 4‒6 September 2019, ser. Procedia Computer Science, I. J. Rudas, J. Csirik, C. Toro, J. Botzheim, R.J. Howlett, and L.C. Jain, Eds. Elsevier, 2019, vol. 159, pp. 1811– 1820, doi: 10.1016/j.procs.2019.09.353.
J. Bazan, H. Nguyen, S. Nguyen, P. Synak, and J. Wróblewski, “Rough set algorithms in classification problem,” in Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems, L. Polkowski, S. Tsumoto, and T. Lin, Eds. Heidelberg: Physica-Verlag HD, 2000, pp. 49–88, doi: 10.1007/978-3-7908-1840-6_3.
J. Bazan and M. Szczuka, “The rough set exploration system,” in Transactions on Rough Sets III, ser. Lecture Notes in Computer Science, J. F. Peters and A. Skowron, Eds. Berlin, Heidelberg: Springer, 2005, vol. 3400, pp. 37–56, doi: 10.1007/11427834_2.
I. Chikalov, V. Lozin, I. Lozina, M. Moshkov, H. Nguyen, A. Skowron, and B. Zielosko, Three Approaches to Data Analysis – Test Theory, Rough Sets and Logical Analysis of Data, ser. Intelligent Systems Reference Library. Berlin, Heidelberg: Springer, 2013, vol. 41, doi: 10.1007/978-3-642-28667-4.
Z. Pawlak and A. Skowron, “Rudiments of rough sets,” Inf. Sci., vol. 177, no. 1, pp. 3–27, 2007, doi: 10.1016/j.ins.2006.06.003.
J. Rybicki, M. Eder, and D. Hoover, “Computational stylistics and text analysis,” in Doing Digital Humanities: Practice, Training, Research, 1st ed., C. Crompton, R. Lane, and R. Siemens, Eds. Routledge, 2016, pp. 123–144, doi: 10.4324/9781315707860.
M. Eder, “Style-markers in authorship attribution a crosslanguage study of the authorial fingerprint,” Stud. Pol. Ling., vol. 6, no. 1, pp. 99–114, 2011.
H. Craig, “Stylistic analysis and authorship studies,” in A companion to digital humanities, S. Schreibman, R. Siemens, and J. Unsworth, Eds. Oxford: Blackwell, 2004, doi: 10.1002/9780470999875.ch20.
G. Baron, “Comparison of cross-validation and test sets approaches to evaluation of classifiers in authorship attribution domain,” in Proceedings of the 31st International Symposium on Computer and Inf. Sci., ser. Communications in Computer and Information Science, T. Czachórski, E. Gelenbe, K. Grochla, and R. Lent, Eds. Cracow: Springer, 2016, vol. 659, pp. 81–89, doi: 10.1007/978-3-319-47217- 1_9.
S.S. Mullick, S. Datta, S.G. Dhekane, and S. Das, “Appropriateness of performance indices for imbalanced data classification: An analysis,” Pattern Recognit., vol. 102, pp. 107–197, 2020, doi: 10.1016/j.patcog.2020.107197.
J.M. Johnson and T.M. Khoshgoftaar, “Survey on deep learning with class imbalance,” J. Big Data, vol. 6, no. 27, pp. 1–54, 2019, doi: 10.1186/s40537-019-0192-5.
N. Basurto, C. Cambra, and Á. Herrero, “Improving the detection of robot anomalies by handling data irregularities,” Neurocomputing, 2020, doi: 10.1016/j.neucom.2020.05.101, in press.
G. Shi, C. Feng,W. Xu, L. Liao, and H. Huang, “Penalized multiple distribution selection method for imbalanced data classification,” Knowledge-Based Syst., vol. 196, p. 105833, 2020, doi: 10.1016/j.knosys.2020.105833.
S. Au, R. Duan, S.G. Hesar, and W. Jiang, “A framework of irregularity enlightenment for data pre-processing in data mining,” Ann. Oper. Res., vol. 174, no. 1, pp. 47–66, 2010, doi: 10.1007/s10479-008-0494-z.
M. Koziarski, M. Wozniak, and B. Krawczyk, “Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise,” Knowledge-Based Syst., vol. 204, p. 106223, 2020, doi: 10.1016/j.knosys.2020.106223.
N. Basurto, Á. Arroyo, C. Cambra, and Á. Herrero, “Imputation of missing values affecting the software performance of component-based robots,” Comput. Electr. Eng., vol. 87, p. 106766, 2020, doi: 10.1016/j.compeleceng.2020.106766.
S. Argamon, K. Burns, and S. Dubnov, Eds., The structure of style: Algorithmic approaches to understanding manner and meaning. Berlin: Springer, 2010, doi: 10.1007/978-3-642-12337-5.
S. Sbalchiero and M. Eder, “Topic modeling, long texts and the best number of topics. some problems and solutions,” Qual. Quant., vol. 54, pp. 1095–1108, 2020, doi: 10.1007/s11135-020-00976-w.
R. Peng and H. Hengartner, “Quantitative analysis of literary styles,” Am. Statistician, vol. 56, no. 3, pp. 15–38, 2002, doi: 10.1198/000313002100.
E. Stamatatos, “A survey of modern authorship attribution methods,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009, doi: 10.1002/asi.21001.
D. Khmelev and F. Tweedie, “Using Markov chains for identification of writers,” Lit. Linguist. Comput., vol. 16, no. 4, pp. 299–307, 2001, doi: 10.1093/llc/16.3.299.
M. Koppel, J. Schler, and S. Argamon, “Computational methods in authorship attribution,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 1, pp. 9–26, 2009, doi: 10.1002/asi.20961.
M. Jockers and D. Witten, “A comparative study of machine learning methods for authorship attribution,” Lit. Linguist. Comput., vol. 25, no. 2, pp. 215–223, 2010, doi: 10.1093/llc/fqq001.
M. Eder and J. Rybicki, “Do birds of a feather really flock together, or how to choose training samples for authorship attribution,” Lit. Linguist. Comput., vol. 28, pp. 229–236, 8 2013, doi: 10.1093/llc/fqs036.
M. Eder, “Does size matter? Authorship attribution, small samples, big problem,” Digital Scholarsh. Humanit., vol. 30, pp. 167–182, 06 2015, doi: 10.1093/llc/fqt066.
K. Kalaivani and S. Kuppuswami, “Exploring the use of syntactic dependency features for document-level sentiment classification,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 67, no. 2, pp. 339–347, 2019, doi: 10.24425/bpas.2019.128608.
G. Rotari, M. Jander, and J. Rybicki, “The Grimm brothers: A stylometric network analysis,” Digital Scholarsh. Humanit., 02 2020, doi: 10.1093/llc/fqz088.
C. Jankowski, D. Reda, M. Mańkowski, and G. Borowik, “Discretization of data using Boolean transformations and information theory based evaluation criteria,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 63, no. 4, pp. 923–932, 2015, doi: 10.1515/bpasts-2015-0105.
U. Fayyad and K. Irani, “Multi-interval discretization of continuous valued attributes for classification learning,” in Proceedings of the 13th International Joint Conference on Artificial Intelligence, vol. 2. Morgan Kaufmann Publishers, 1993, pp. 1022–1027.
I. Kononenko, “On biases in estimating multi-valued attributes,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI’95, vol. 2. Morgan Kaufmann Publishers Inc., 1995, pp. 1034–1040.
J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978, doi: 10.1016/0005-1098(78)90005-5.
S. Kotsiantis and D. Kanellopoulos, “Discretization techniques: A recent survey,” GESTS Int. Trans. Comput. Sci. Eng., vol. 32, no. 1, pp. 47–58, 2006.
B. Zielosko, “Application of dynamic programming approach to optimization of association rules relative to coverage and length,” Fundamenta Informaticae, vol. 148, no. 1-2, pp. 87–105, 2016, doi: 10.3233/FI-2016-1424.
S.G. Weidman and J. O’Sullivan, “The limits of distinctive words: Re-evaluating literature’s gender marker debate,” Digital Scholarsh. Humanit., vol. 33, pp. 374–390, 2018, doi: 10.1093/llc/fqx017.

Date

15.06.2021

Type

Article

Identifier

DOI: 10.24425/bpasts.2021.137629

Source

Bulletin of the Polish Academy of Sciences: Technical Sciences; 2021; e137629