Constantina Nicolaou1, Amal Vaidya1, Fabon Dzogang2, David Wardrope1,2 and Nikos Konstantinidis1, 1Department of Physics and Astronomy, University College London, Gower Street, London WC1E 6BT, UK and 2ASOS AI, Greater London House,Hampstead Road, London NW1 7FB, UK
We study the performance of customer intent classifiers designed to predict the most popular intent received through ASOS customer care, namely “Where is my order?”. We conduct extensive experiments to compare the accuracy of two popular classification models: logistic regression via N-grams that account for sequences in the data and recurrent neural networks that perform the extraction of sequential patterns automatically. A Mann-Whitney U test indicated that F1 score on a representative sample of held out labelled messages was greater for linear N-grams classifiers than for recurrent neural networks classifiers (M1=0.828, M2=0.815; U=1,196, P=1.46e-20), unless all neural layers including the word representation layer were trained jointly on the classification task (M1=0.831, M2=0.828, U=4,280, P=8.24e-4). Overall our results indicate that using simple linear models in modern AI production systems is a judicious choice unless the necessity for higher accuracy significantly outweighs the cost of much longer training times.
Natural Language Processing, Intent Classification, Bag-of-words, Recurrent Neural Networks
Marco A. Palomino1 and Adithya Murali2, 1School of Computing, Electronics and Mathematics, University of Plymouth, Drake Circus, Plymouth, PL4 8AA, United Kingdom and 2School of Computing Science and Engineering, Vellore Institute of Technology, Vellore - 632 014, Tamil Nadu, India
Online trends have established themselves as a new method of information propagation that is reshaping journalism in the digital age. Services such as Google Trends and Twitter Trends have recently attracted a great deal of attention. Taking election campaigns as an example, journalists, campaign managers and political analysts have looked into trends to determine candidates’ popularity and predict likely election outcomes. Trend discovery has therefore become a fundamental aid to monitor and summarise information. While previous research on trend discovery has focused on the dynamics of data streams, we argue that sentiment analysis—the classification of human emotion expressed in text—can enhance existing algorithms for trend discovery. By highlighting topics that are strongly polarised, sentiment analysis can offer further insight into the influence of users who are involved in a trend, and how other users adopt such a trend. As a case study, we have investigated a highly topical subject: Brexit, the withdrawal of the United Kingdom from the European Union. We retrieved an experimental corpus of publicly available tweets referring to Brexit and used them to test a proposed algorithm to identify trends. We validate the efficiency of the algorithm and gauge the sentiment expressed on the captured trends to confirm that highly polarised data ensures the emergence of trends.
Text mining, Twitter, sentiment analysis, information retrieval.
Nadine Kuhnert1,2 and Andreas Maier1, 1Pattern Recognition, Friedrich-Alexander University, Erlangen-Nueremberg, Germany and 2Siemens Healthcare GmbH, Erlangen, Germany
We aim to model unknown file processing. As the content of log files often evolves over time, we established a dynamic statistical model which learns and adapts processing and parsing rules. First, we limit the amount of unstructured text by focusing only on those frequent patterns which lead to the desired output table similar to Vaarandi . Second, we transform the found frequent patterns and the output stating the parsed table into a Hidden Markov Model (HMM). We use this HMM as a specific, however, flexible representation of a pattern for log file processing. With changes in the raw log file distorting learned patterns, we aim the model to adapt automatically in order to maintain high quality output. After training our model on one system type, applying the model and the resulting parsing rule to a different system with slightly different log file patterns, we achieve an accuracy over 99%.
Hidden Markov Models, Parameter Extraction, Parsing, Text Mining, Information Retrieval.
Piotr Malak, Institute of Information Science and Book Studies, University of Wroclaw, Poland
In current paper we discuss the results of preliminary, but promising, research on including some Natural Language Processing and Machine Learning approaches into Information Retrieval. Classical IR uses indexing and term weighting in order to increase pertinence of answers given to users queries. However, such approach allows for meaning matching, i.e. matching all keywords of the same or very similar meaning as expressed in user query. For most cases this approach is sufficient enough to fulfil user information needs. However indexing and retrieving information over professional language texts brings new challenges as well as new possibilities. One of them is different grammar, causing the need of adjusting NLP tools for a given professiolect. One of the possibilities is detecting the context of occurrence of indexed term in the text. In our research we made an attempt to answer the question whether Natural Language Processing (NLP) approach combined with supervised Machine Learning (ML) is capable of detecting contextual features of professional language texts.
Enhanced Information Retrieval, Contextual IR, NLP, Machine Learning.
Md. Mijanur Rahman*, Hasan Mahmud, Razia Sultana Rupa, Rumana Jahan Rimpy Department of Computer Science and Engineering, Jatiya Kabi Kazi Nazrul Islam University, Mymensingh, Bangladesh
The spell checking system is used to detect and correct spelling errors in the word documents automatically. The aim of this project is to implement a Bangla spell checker system and also demonstrate how to use that spell checker in a general purpose search engine. From the study, it is seen that Bangla has complex grammatical and orthographical rules for spelling; and we found different challenges for generating suggestions for phonetic errors. For Bangla language, the string matching algorithm and a direct dictionary look up method are used for the detection of typographical and cognitive phonetic errors. Both algorithms have been used for detecting a misspelled word and for suggesting a list of candidate words. It is found from lexicon file using reverse dictionary lookup technique. Finally, the user could select the desired word from the suggested word list that was generated by the spell checker and used for searching on the search engine. Two-level of databases were used to design the system, lexicon database (contains about 58,319 correctly spelled Bangla words) and secondary database (contains about 46,881 stemmed key data).
Bangla spell checker, Complex orthographic rules, Indexing and query process, Search engine, Typographical and cognitive errors.
Yang An , Abdul Moeed*, Gerhard Hagerer, Georg Groh, Technical University of Munich, Germany
With the explosive growth in textual data, it is becoming increasingly important to summarize text automatically. Recently, generative language models have shown promise in abstractive text summarization tasks. However, existing metrics for text summarization are unable to measure performance of abstractive summaries generated by these language models in a faithful manner. In this paper, we propose using two evaluation metrics that are well-suited to abstractive summarization: Angular embedding similarity and Fréchet embedding distance. To demonstrate the utility of both metrics, we analyse the abstractive text summarization capacity of two state-of-the-art language models: GPT-2 and ULMFiT. Both metrics show close relations with human judgments in our experiments. To provide reproducibility, the source code of our experiments is available on GitHub.
Natural language processing, Generative language models, Text summarization, Text generation, Angular similarity, Fréchet distance, GPT-2, Amazon product dataset, Abstractive summarization, ULMFiT
Amit Anand, Sanjay Chatterji and Shaubhik Bhattacharya, IIIT Kalyani West Bengal, India 741235
Stemming is one of the most fundamental requirement of any Natural Language Processing tasks such as Information Retrieval. In simple words, it is the process of finding stem of a given word. This paper presents an algorithm to find the stem of a word in Hindi. The proposed algorithm uses word2vec, which is a semisupervised learning algorithm, for finding the 10 most similar words from a corpus. Then a mathematical function is proposed to achieve the above mentioned task of finding stem. Significant amount of attention need to be given to Indo-Aryan languages like Hindi, Bengali, Marathi etc. in the domain of Natural Language Processing because of their highly inflectional properties. Moreover,it is very difficult to build a rule based stemmer for such highly conflated languages. The proposed algorithm does not need any annotated corpus and does not use any hardcoded rules for finding the stem. The results are verified by selecting a set of 1000 Hindi words randomly taken from a corpus and comparing the results given by the proposed algorithm and the actual results created manually.
Inflection, Stemming, Word2Vec, Unsupervised Machine Learning
Enmei Wang1 and Shunan Wu2, 1School of Aeronautics and Astronautics, Dalian University of Technology, Dalian City, China and 2Key Laboratory of Advanced Technology for Aerospace Vehicles, Dalian University of Technology, Dalian City, China
To deal with the issues of vibration suppression of the large space structures (LSS) such as design complexity, fault-tolerant limitation, repeated expansion difficulty and etc., a distributed vibration control approach is proposed in this paper. According to the structure characteristics, the LSS is firstly divided into different control units, and the dynamic model of each unit is developed. The distributed LQR vibration controller of each unit is then designed and the final distributed vibration control system of the whole structure is therefore integrated. Simulations are presented to verify the validity of the proposed controller, and the results demonstrate that repeatable distributed controllers can achieve vibration suppression for LSS and provide good fault-tolerance performance.
Large Space Structure, Distributed Control, Linear Quadratic Regulator, Fault Tolerance
Nataly Ilyasova1,2 and Alexander Shirokanev1,2, 1IPSI RAS - branch of the FSRC «Crystallography and Photonics» RAS, Samara, Russia 2Samara National Research University, Samara, Russia
In this paper, information technology has been developed for automatic highlighting the lungs on x-ray images, based on the images pre-processing, calculation of textural properties and classification of kmeans. In some cases, the highlighted objects can describe not only the current patient’s condition but also specific characteristics regarding age, gender, constitution, etc. While using the k-means method, the relationship between the segmentation error and fragmentation window size was revealed. Within the study, both a visual criterion for evaluating the quality of the segmentation result and a criterion based on calculating the clustering error on a large set of fragmented images were implemented. The study also included image pre-processing techniques. Thus, the study showed that the technology provided key objects highlighting error at 26%. However, the equalizing procedure has lessened this error to 14%. Xray image clustering errors for fragmentation windows of 12x12, 24x24 and 36x36 were presented.
Lungs X-rays Images, Image Processing, Texture Analysis, Selection Technique of Interest Regions
Nataly Ilyasova1,2 and Alexander Shirokanev1,2, 1IPSI RAS - branch of the FSRC «Crystallography and Photonics» RAS, Samara, Russia and 2Samara National Research University, Samara, Russia
The article proposes a new method for analyzing eye fundus images. The method is based on the convolutional neural network (CNN). The CNN architecture was constructed, followed by network learning on a balanced dataset composed of four classes of images, composed of thick and thin blood vessels, healthy areas, and exudate areas. Segmentation of fundus images was performed using CNN. Considering that exudates are a primary target of laser coagulation surgery, the segmentation error was calculated on the exudate class, amounting to 5%. In the course of this research, the HSL color system was found to be most informative, using which the segmentation error was reduced to 3%.
Convolution Neural Networks, Fundus Image, Diabetic Retinopathy, Exudates, Laser Coagulation Image Processing, Image Segmentation
Clark Ren, Yu Sun and Fangyan Zhang, California State Polytechnic University, USA
As more and more students get access to computers to aid them in their studies, they also gain access to machines that can play games, which can negatively affect a student's academic performance. However, it is also debated that playing video games could also positively affect a student’s academic performance. In order to address both sides of the argument, we can create an app that limits the amount of time a student has to play games while not completely removing the ability for students to play games.
Parental Control, Smart System, Digital Games, Web Service
Sergei Zhuk, Quality Department of Shopify, Berlin, Germany
This contribution gives a review of granulated a testing approach for JSE 2019 committee.
Issues, defects, bugs, test management, test planning, development planning, waterfall, sashimi testing.