Access Journal

Access to Science, Business, Innovation in the Digital Economy

Business demands for processing unstructured textual data – text mining techniques for companies to implement

Published Online: Apr 17, 2022
Views: 1370
Downloads: 49
Download PDF

Abstract:
The rapid development of technology has caused a pervasive change in the way people and businesses live. Making sound business decisions is unthinkable without processing a large amount of data (publicly available and collected on the basis of problems) with high accuracy and quality. The importance of unstructured data acquires various sources is growing. Of particular value is the continuous flow of textual information that is generated every minute around the world in a different form (unstructured textual data). This is also the subject of this article. The aim of the article is to provide an analytical overview of the main methods of word processing that are applicable for pragmatic analysis of information flows from companies, such as: extraction, summarization, grouping and categorization of text. Some methodologies are based on NLP (Natural Language Processing), others on Bayesian logic and statistical theory and practice. From the review of various publications on the topic, conclusions are proposed for their practical applicability. This allows for an objective choice of appropriate tools for processing unstructured information and business intelligence. The results of the study can be successfully used to improve managerial decision-making, improve the quality of work of employees and reduce errors in overall marketing planning.
Keywords:
Pages:
107-120
JEL Classification:
C15, C81, C82
How to cite:
Zhecheva, D., Nenkov, N. Business demands for processing unstructured textual data – text mining techniques for companies to implement. Access to science, business, innovation in digital economy, ACCESS Press, 3(2): 107-120. https://doi.org/10.46656/access.2022.3.2(2)
References:
  • Augenstein, I., Padó, S., Rudolph, S. (2012). LODifier: Generating Linked Data from Unstructured Text. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds) The Semantic Web: Research and Applications. ESWC 2012. Lecture Notes in Computer Science, vol 7295. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30284-8_21
  • Bach, N. and Badaskar, S. (2007) ‘A review of relation extraction’, Literature review for Language and Statistics II, 2, pp. 1–15.
  • Bhartiya, D., & Singh, A. (2014). A Semantic Approach to Summarization. ArXiv, abs/1406.1203.
  • Bhide, M. (2016). Single or Multi-document Summarization Techniques. International Journal of Computer Science Trends and Technology (IJCST), 4(3), pp.375-379. Available at: http://www.ijcstjournal.org/volume-4/issue-3/IJCST-V4I3P63.pdf.
  • Birjali, M., Kasri, M. and Beni-Hssane, A. (2021). A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems, vol.226, doi: https://doi.org/10.1016/j.knosys.2021.107134.
  • Carbonell, J. and Goldstein, J. (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 335–336. https://doi.org/10.1145/290941.291025
  • Chen, Y. and Tu, L. (2007). Density-based clustering for real-time stream data. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133–142. https://doi.org/10.1145/1281192.1281210
  • Conroy, J. M. and O’leary, D. P. (2001) Text summarization via hidden Markov models. SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 406–407. https://doi.org/10.1145/383952.384042
  • Das, D. and Martins, A. (2007) A survey on automatic text summarization.
  • Edmundson, H. P. (1969) New Methods in Automatic Extracting. Journal of the ACM. vol. 16, Issue 2, pp. 264-285. https://doi.org/10.1145/321510.321519
  • Evans, D. K., McKeown, K., Klanvas, J. L. (2005) Similarity-based multilingual multi-document summarization.
  • Firoozeh, N., Nazarenko, A., Alizon, F., Daille, B. (2020) Keyword extraction: Issues and methods. Natural Language Engineering. 26(3), pp. 259–291. doi:10.1017/S1351324919000457.
  • Friedman, N., Geiger, D. and Goldszmidt, M. (1997) Bayesian Network Classifiers. Machine Learning, 29(2), pp. 131–163. doi: 10.1023/A:1007465528199.
  • Gibbert, M., Leibold, M. and Probst, G. (2002) Five Styles of Customer Knowledge Management, and How Smart Companies Use Them To Create Value. European Management Journal, 20(5), pp. 459–469. https://doi.org/10.1016/S0263-2373(02)00101-9.
  • Gracia Jacob, S. and Ramani, G. (2012) Data Mining in Clinical Data Sets: A Review. International Journal of Applied Information Systems, vol.4, Issue 6, pp. 15–26. doi: 10.5120/ijais12-450774.
  • Hänig C., Schierle M., T. D. (2010) Comparison of structured vs. unstructured data for industrial quality analysis. Proceedings of the World Congress on Engineering and Computer Science 2010 Vol I, pp.432-438. WCECS 2010, October 20-22, San Francisco, USA. ISSN: 2078-0966 (Online)
  • Impelsys (2021) An overview of Text Summarization in Natural Language Processing. Available at 12.02.2022: https://www.impelsys.com/an-overview-of-text-summarization-in-natural-language-processing/.
  • Inmon, W. H., Linstend, D. and Levins, M. (2019) Data Architecture. A primer for the Data Scientist. Academic Press. eISBN: 9780128169179, pISBN: 9780128169162
  • Jain, A., & Ghosh, A. (2021). Novel Insights into Data Mining to Improve the Specificity of Pharmacovigilance and Prevent Adverse Drug Reactions in Psychiatric Patients. Asia Pacific Journal of Health Management, 16(3), 130-136. https://doi.org/10.24083/apjhm.v16i3.985
  • Jiang, J. (2012). Information Extraction from Text. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_2
  • Kriegel, H. et al. (2011) Density‐based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3), pp. 231–240.
  • Kumagai, M., Komatsu, K., Takano, F., Araki, T., Sato, M. and H. Kobayashi. (2020). Combinatorial Clustering Based on an Externally-Defined One-Hot Constraint. 2020 Eighth International Symposium on Computing and Networking (CANDAR), pp. 59-68, doi: 10.1109/CANDAR51075.2020.00015.
  • Larsen, K., Monarchi, D., Hovorka, D., Bailey, C. (2008) Analyzing unstructured text data: Using latent categorization to identify intellectual communities in information systems. Decision Support Systems, 45(4), pp. 884–896. https://doi.org/10.1016/j.dss.2008.02.009.
  • Lin, C.-Y. (1999) Training a selection function for extraction. in. CIKM ’99: Proceedings of the eighth international conference on Information and knowledge management, pp. 55–62. https://doi.org/10.1145/319950.319957.
  • Mani, I. and Bloedorn, E. (1997) Multi-document summarization by graph search and matching. arXiv preprint cmp-lg/9712004.
  • McKeown, K. and Radev, D. R. (1995) ‘Generating summaries of multiple news articles. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 74–82. https://doi.org/10.1145/215206.215334
  • Mikheev, A., Moens, M. and Grover, C. (1999) Named Entity Recognition without Gazetteers. in EACL 1999, 9th Conference of the European Chapter of the Association for Computational Linguistics, June 8-12, 1999, University of Bergen, Bergen, Norway. pp. 1-8. https://doi.org/10.3115/977035.977037
  • Müllner, D. (2013). fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. Journal of Statistical Software, 53(9), 1–18. https://doi.org/10.18637/jss.v053.i09
  • Osborne, M. (2002) Using maximum entropy for sentence extraction. Proceedings of the ACL-02 Workshop on Automatic Summarization, pp. 1–8. Phildadelphia, Pennsylvania, USA. Association for Computational Linguistics. doi: 10.3115/1118162.1118163
  • Patel, K. M. A. and Thakral, P. (2016) The best clustering algorithms in data mining. 2016 International Conference on Communication and Signal Processing (ICCSP), pp. 2042–2046. doi: 10.1109/ICCSP.2016.7754534.
  • Radev, D. R. , Jing H., Sty M. (2004). Centroid-based summarization of multiple documents. Information Processing and Management, vol. 40, no. 6, pp. 919–938.
  • Rose, S., Engel, D., Cramer, N., Cowley, W. (2010) ‘Automatic keyword extraction from individual documents’, Text mining: applications and theory, 1, pp. 1–20. Editor(s):Michael W. Berry, Jacob Kogan. https://doi.org/10.1002/9780470689646.ch1
  • Roux, M. (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. Journal of Classification, Springer Verlag, 2018, 35 (2), pp.345-366. doi:10.1007/s00357-0189259-9. hal-02085844
  • Sasirekha, K. and Baby, P. (2013) Agglomerative hierarchical clustering algorithm- A Review. International Journal of Scientific and Research Publications, (IJSRP), Volume 3, Issue 3, March 2013 Edition.
  • Sharma, A. and Panigrahi, P. (2012) A Review of Financial Accounting Fraud Detection based on Data Mining Techniques. International Journal of Computer Applications, 39(1), pp. 37–47.
  • Stephens, K. R. (2002) What has the Loebner Contest told us about conversant systems. p. 2005.
  • Svore, K., Vanderwende, L. and Burges, C. (2007) Enhancing single-document summarization by combining RankNet and third-party sources. Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp. 448–457. Prague, Czech Republic. Association for Computational Linguistics.
  • Tamine, L. and Goeuriot, L. (2021) Semantic Information Retrieval on Medical Texts: Research Challenges, Survey, and Open Issues. ACM Computing Surveys (CSUR), 54(7), pp. 1–38. https://doi.org/10.1145/3462476
  • Tsai, C.-F., Wu, H.-C. and Tsai, C.-W. (2002) A new data clustering approach for data mining in large databases. Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN’02, pp. 315–320. doi: 10.1109/ISPAN.2002.1004300.
  • Turney, P. D. (2000) Learning algorithms for keyphrase extraction, Information retrieval, 2(4), pp. 303–336. https://doi.org/10.1023/A:1009976227802
  • Viveka, S. and Kalaavathi, B. (2016) Review on clinical data mining with psychiatric adverse drug reaction. 2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave), pp. 1–3. doi: 10.1109/STARTUP.2016.7583945.
  • Zainol, Z., Jaymes, M. T. H. and Nohuddin, P. N. E. (2018) Visualurtext: a text analytics tool for unstructured textual data. Journal of Physics: Conference Series. IOP Publishing, p. 12011.
  • Zhai, C. and Massung, S. (2016) Text data management and analysis: a practical introduction to information retrieval and text mining. Morgan & Claypool.
  • Zhan, Y., Tan, K. H. and Huo, B. (2019) Bridging customer knowledge to innovative product development: a data mining approach. International Journal of Production Research, 57(20), pp. 6335–6350. doi: 10.1080/00207543.2019.1566662.

Publish your science journal

If you like ACCESS journals system, you can publish your journal with us at a reasonable price