DEV Community: gouse

Social responsibility on articles published in News papers, forecasting

gouse — Fri, 21 Oct 2022 09:21:23 +0000

Abstract:

India is a democratic country with huge population, it is governed by four pillars of democracy: the legislature, executive, judiciary, and media. Each pillar must act within its own sphere while keeping the larger picture in mind. The strength of a democracy is determined by the strength of each pillar and the way pillars complement each other [1]. Because of its pervasive presence and undeniable importance in molding public opinion, the Media has emerged as the fourth pillar. The existence of a free and objective media committed to lend voice to the voiceless is the cornerstone of a healthy democracy. The Indian media has played this role quite creditably barring a few exceptions at some moments in history as in during the emergency. However, with the introduction of the internet, commercialization, changes in ownership patterns, and “News Coloured with Views,” the proliferation of Fake News, Paid News, and Propaganda, there have been some troubling tendencies in recent times. The media, like all other institutions, should have its own system of checks and balances. There has to be a code of ethics that needs to be voluntarily adhered to more as an article of faith and as an expression of media’s commitment to professionalism [1,2,3]. The main aim of this article is to demonstrate the various kind of articles that are going to be published in the fourth coming days. We can predict this with help of Machine Learning and Artificial intelligence using Python -R programming tools, these analysis can be extremely beneficial to the society and the movement in assessing print media’s duty. With regards to Education, empowerment, healthcare, political awareness. To do this we used ML AI algorithms like Time series and regression which can assist in predicting or forecasting news for the next day or week.

Key words: Education, Politics, Time series, Regression, Python, R

Objectives.

Now a days, there has been some speculation on news media that some political parties are favoring few newspapers and news channels, for their party benefits.

Some of them highlighting more on those political parties agenda, it’s really impacting on the society on how the print media is contributing on social responsibility of the society. We are forecasting these news articles published in newspapers. The categorizations are to be seen if they are related to Education, social awareness or political, these are our main goals of this paper.

Methodology

We plan to run a Time series and regression models to predict what kind of news we can expect tomorrow, this is going to help the government on designing public/society welfare related advertisements, for example: if they are publishing articles more or related to educational then, we can give advertisement base on the news publishing. In addition government body can keep an eye on the articles to tackle disturbing news trends which is published in recent times. Also to find which ones are taking advantage of the internet commercialization that changes in ownership patterns. Those articles and “news colored with views”, which are fake, paid for a particular negative propaganda. Over all like all other institutions, media too should have its own mechanism about social responsibility. [4,5,6,]

Data collection

After global digitization, communication channels have evolved online platforms more than ever that includes news channels and newspapers. Presently major regional newspapers and news channels are also having their online presence. We referred 3 major newspapers in which we randomly measured area of the articles. In [Fig 01] its showing dimensions measurable by collecting the size of the articles, those dimensions are converted into cm (cent meters). Randomly we collected data through followed by weekly basis. Say for example weekly 7 days randomly 3 to 4 days we considered. Picked those newspapers that had the largest circulation in English across India, we chose the top three: Times of India (1,614,105), Deccan Chronicle (1,064,661) Economic Times (664,352)

Fig 01

Analysis and Forecasting

Correlation:

For understanding data patterns we used R and Python programming tools, we did Correlation analysis by newspapers wise for Educational and Political, we found that for educational news there is a positive correlation for EDU ET vs EDU Deccan Chronicle [52.8%], correlation for EDU_ET Vs EDU TIMES OF INDIA [76.5%] has positive correlation. Which means if the news is related to Education, then almost all newspaper’s coverage for educational news is similar. For Political related news there is no relation between one newspaper to another newspaper which means if the news coverage is related to politics, then no relation among these newspapers, if favorable to them then only they are publishing or else not publishing.

Regression Model Political articles:

Predicting what kind of news would be published tomorrow by these newspapers we choose regression model [3,7]. Initially we run regression Model for predicting Educational related articles for tomorrow. Correlation is 49.9% and coefficient is 24% with Std error of 13.784 which is showing not optimum to predict news. Example tomorrow Education related news size would be 430+/- 13.784 cm. This model effectiveness is 24%, Moreover residual and regression is sig. with 5.423 of F value, 99% of significant. But predicting variable is politics Deccan Chronicle, with T distribution value 2.080 sign. 95%. Politics Times of India T distribution value 2.590 sign. 95% has remaining variable Politics ET, variable is not significant, final conclusion of this model we can predict only Deccan Chronicle paper news for tomorrow with less accuracy. In this model accuracy is less so we should follow alternate models.

Below the regression summary tables are explaining the model related summary, accuracy and predicting weights etc.

Regression Model Education articles:

Initially we run regression Model for predicting political related articles news for tomorrow. Correlation is 56.6% and Coefficient is 32% with Std error of 13.115 which is showing not optimum to predict news. Example tomorrow political related news size would be 430+/- 13.115 cm. This model effectiveness is 23%. Moreover, residual and regression is sig. with 7.703 of F value, 99% of significant. But predicting variable is EDU DECCAN CHRONICLE with T distribution value 4.121 sign. 99% remaining variables educational ET, Edu Times of India variables are not significant. Final conclusion of this model, we can predict only Deccan Chronicle paper news for tomorrow with less accuracy. In this model, accuracy is less so we should follow alternate models.

Below the regression summary tables are explaining the model related summary, accuracy and predicting weights etc.

Regression Model Summary

a. Dependent Variable: Week_number

b. Predictors: (Constant), EDU Times of India, EDU Deccan Chronicle, EDU ET

Summary of Regression:

Not optimum, since effectiveness, accuracy of model is less. We continued, the using the other models. As the data is continues (series) we will use time series model, ARIMA or Exponential smoothing models. [6,7]

Time series Models Politics:

Above the regression models are not fitting to predict kind of news that’s going to be published tomorrow, then we choose alternate methods when time series data is available as an independent variable, we can go for Time series ARIMA time series models [22,23]. We are very much interested in the time series approach: auto regressive integrated moving average (ARIMA) models. ARIMA model is labelled as an ARIMA model (p, d, and q), wherein: “p” is the number of autoregressive terms; “d” is the number of differences; and “q” is the number of moving averages. In the auto regressive process, Autoregressive models assume that Yt is a linear function of the preceding values.

Below the ARIMA is R square and Stationery R2 are less 19% but comparatively regression model even ARIMA. Political papers articles related variables prediction/ fitted values are close to actual values in this model with moving average 6 and autoregression 1 RMSE 74.3 and 94% which means training data validation data model accuracy maximum absolute error 20.1 which is moderate to actual so Timeseries ARIMA[20,24,] models is best model for predict of Daily new for tomorrow we can estimate what king of news are publishing for coming day on online.

Time series Models Educational

Below the ARIMA is R square and Stationary R2 are less 36 % it better then comparatively regression model even ARIMA Educational related papers articles related variables prediction/ fitted values are close to actual values this model with moving average 6 and autoregression 1 RMSE 96 and 24% which means training data validation data model accuracy maximum absolute error 69 which is moderate to actual so Timeseries ARIMA models is best model for predict of Daily new for tomorrow we can estimate what king of news are publishing for coming day on online.

Model Description:

Model Statistics:

Conclusion Summary:

India is a democratic country with many religions, castes, and regional political parties, yet major political parties have news channels and newspapers, or are indirectly linked to newspapers, that favor local regional parties. During this process, the print media has been swayed by political party leaders, and they have published items for their own gain by ignoring their social responsibility to society.

from this experiment we could see the majority of the articles are of individual interests and not on social responsibility.

This article aims to help whomsoever wants to use this in order to predict the news and restrict or have a check on the news columns published.

References :

Press Information Bureau, Government of India, Vice President’s Secretariat 08-December-2019 20:06 IST
The Audit Bureau of Circulations (ABC) of India is a non-profit circulation-auditing organisation.
Y.-W. Cheung, K.S. Lai Lag order and critical values of the augmented Dickey–Fuller test J. Bus. Econ. Stat., 13 (1995), pp. 277–280
Stanny, Monika, and Wojciech Strzelczyk. 2015. Zróznicowanie przestrzenne sytuacji dochodowej gmin a rozw ˙ ój społecznogospodarczy obszarów wiejskich w Polsce. Roczniki Naukowe Stowarzyszenia Ekonomistów Rolnictwa i Agrobiznesu XVII: 301–7.
Stanny, Monika, and Wojciech Strzelczyk. 2018. Kondycja Finansowa Samorz ˛adów Lokalnych a Rozwój Społeczno-Gospodarczy Obszarów wiejskich; Uj˛ecie Przestrzenne. Warszawa: Wyd. IRWiR PAN oraz Wyd. Naukowe Scholar Spółka z o.o., pp. 113–46.
Vermeulen, Ben, and Andreas Pyka. 2018. The role of network topology and the spatial distribution and structure of knowledge in regional innovation policy. A calibrated agent-based model study. Computational Economics 52: 773–808. [CrossRef]
Tang, Lijing, and Dongyan Wang. 2018. Optimization of County-Level Land Resource Allocation through the Improvement of Allocation Efficiency from the Perspective of Sustainable Development. International Journal of Environmental Research and Public Health 15: 2638. [CrossRef]
Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.
Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

10.Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

11.Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

12.Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

13.Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

14.Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

15.Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

16.Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

17.Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

18.Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

19.Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

20.Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

21.Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Prybutok VR, Yi J, and Mitchell D. Comparison of neural network models with ARIMA and regression models for prediction of Houston’s daily maximum ozone concentrations. Eur J Oper Res 2000; 122(1): 31–40.

23.Ho SL, Xie M, and Goh TN. A comparative study of neural network and Box-Jenkins ARIMA modeling in time series prediction. Comput Ind Eng 2002; 42(2–4): 371–375.

Kandananond K. A comparison of various forecasting methods for autocorrelated time series. Int J Eng Bus Manage 2012; 4: 4.

Predicting transit time using ML and AI techniques

gouse — Fri, 21 Oct 2022 09:18:16 +0000

Abstract:

Transportation is essential in the contemporary economy. In today’s fast-paced world, where everyone is short of time and is always in a hurry, everyone wants to know the transit duration for better planning. As a result, research communities have given intelligent transit systems a lot of attention. Both traffic engineers and users of the highway network depend on accurate transit times [1]. This work intends to explore the transit time for goods vehicles in highways on a given destination and start points. To brief whats a transit time is, it’s the time taken a goods/vehicle to reach from its source station to the destination. The transit time is calculated/measured based on number of hours/days/months it took to reach the destination. There will be lot of factors that may influence the transit time depends on the geographical location mode of transit, season wind direction, hill or normal roads etc.

In this experiment we are trying to predict the transit time based on those influencing factors to make the importer and the exporter has the most accurate possible timelines. So that they can plan there deliveries/usage criteria’s. for this analysis we used random forest and linear regression ML algorithms.

Keywords: Machine Learning, Linear Regression, Random Forest Model Validation.

Objectives:

Predicting transit times is crucial for transportation. Accurate trip estimation could lower transportation expenses. For creating a urbane transit information systems, trip time prediction is crucial. Linear regression and random forest regression are two of the various techniques employed. The results are individually explored in the below paragraphs.

A. LINEAR REGRESSION

It is a linear model that establishes the relationship between a dependent variable y(Target), and one or more independent variables denoted X(Inputs). In this data set we have different variables. In which we have considered transit time variables as y (target) and 13 other variables such as Temperature, Humidity, Pressure, Visibility, Vehicle age, Wind speed, Loads in Tons, bearing, Weather condition, service status, Destination, Road type, Wind direction as X(Inputs).

B. RANDOM FOREST

Random Forest (RF) is an ensemble supervised machine learning method that can be applied to categorical or numerical datasets as a classifier or regressor. To develop an RF model, multiple random samples from the training dataset are selected with replacement in several iterations, and decision tree is trained for each of them. The trained decision tree then returns the target variables value for each new record in the test dataset. The average of all predicted values from decision tree for the target is used to calculate the final result. Because it reduces decision tree variance, random forest is resistant to noisy data and over fitting, and it is expected to have higher accuracy than individual DTs. When a large dataset is available, RF usually works accurately and efficiently. As model inputs, it could also handle a large number of variables. The random forest model is an excellent choice because of these characteristics.

Methodology:

Python used a coding tool and historic transit data used for training the model. Used linear regression, random forest algorithms.

Dataset:

We have used the transit historic data, considered the following features :

Transit time: Total time took to complete the trip,

Service status: The status shows the trip completed on time or whether it has exceeded and if at all exceeds then how many days 30, 60 or 90 days.

Weather Condition: Weather conditions play an important role as it affects the transit time, the factors such as Heavy rain, snow, fog, unfair climate, cloudy etc.

The Different Variables are shown below:

Method: We plan to run Random Forest and Linear Regression models to predict the possible transit time which will help plan the trip considering the influencing factors in our dataset. To train the model we used random forest regressor and linear regression Model.

Analysis and Forecasting:

To start with, we did the correlation analysis on `different variables available in the time travel data. The Correlation matrix shall show us the variables that has more impact on the dependent variable. We also examined other algorithms on regression, took random forest as the best suited regression model.

Correlation:
We did correlational analysis to find the relation between two attributes which helped us to find the redundant data. Below table shows the impact of the available variables on the Dependent Variable that is Travel Time. We have found that the variables like Destination, Load in Tons, Weather Condition has direct correlation with the dependent variable — Travel Time. The Pressure, Visibility, Temperature and Humidity has high correlation with Weather Condition variable. The Service status variable has high correlation with Road type variable. The Wind direction variable has high correlation with variables such as Wind Speed, Travel Time. The Pressure, Visibility, Temperature and Humidity

Exploratory Data Analysis :: Table Below shows the Exploratory data analysis done for the variables available in the Dataset.

Model1 Regression: : Below the regression summary tables are explaining the model related summary, accuracy and predicting weights etc. as per the Model the R-squared value is 0.428 and the Adjusted R-squared value is 0.427

Model2 Random Forest Regressor : : We trained our model using Linear Regressor which gave us an accuracy of 42% and then we compared our model with a Random Forest Regressor and found that Random Forest was giving us more accurate results of 90%.In comparison to older techniques like Linear Regression our model gave a more accurate result by 48%.Further we observed that the Root Mean Square Error(RMSE) decreased rapidly to a healthy level ,Mean Absolute Error: 0.42916096051959735,Mean Squared Error: 10.277677122295003,Root Mean Squared Error: 3.205881645085327

Conclusion:

Compared to all the other algorithms such as Linear regression (accuracy: 42 percent) and its variants, Random Forest(accuracy: 90 percent) gives the best result. Predicted transit-time information provides the capacity for road users to organize travel schedule pre-trip and end-trip. It helps to save transport operational cost and reduce environmental impacts. Besides, accurate travel time information also helps delivery industries to promote their service quality by delivering on time. However, the development of travel time estimation and prediction are suffered from the shortage of traffic data sets and too much interference from transport environment. This paper provides a review of travel-time studies that includes variables of travel time, measurement of travel time, methodologies of travel-time prediction and estimation, research difficulties, some relationships between other variables and travel-time from field data and potential solutions of travel-time prediction studies.

References:

Abbott-Jard M, Shah H, Bhaskar A. 2013. Empirical evaluation of Bluetooth and Wifi scanning for road transport. Australasian Transport Research Forum (ATRF), 36th Edition. 14.
Abdollahi M, Khaleghi T, Yang K. 2020. An integrated feature learning approach using deep learning for travel time prediction. Expert Systems with Applications 139(4):112864 DOI 10.1016/j.eswa.2019.112864.
Abduljabbar R, Dia H, Liyanage S, Bagloee SA. 2019. Applications of artificial intelligence in transport: an overview. Sustainability 11(1):189
DOI 10.3390/su11010189. Achar A, Bharathi D, Kumar BA, Vanajakshi L. 2019. Bus arrival time prediction: a spatial kalman filter approach. IEEE Transactions on Intelligent Transportation Systems 21(3):1298–1307 DOI 10.1109/TITS.2019.2909314.
J.W.C. Van Lint, Online learning solutions for freeway travel time prediction, IEEE Trans. Intell. Transp. Syst. 9 (2008) 38–47.
G. Huisken, E.C. van Berkum, A comparative analysis of short-range travel time prediction methods, 82nd Annual Meeting of the Transportation Research Board, 2003.
U. Mori, A. Mendiburu, M. Álvarez, J.A. Lozano, A review of travel time estimation and forecasting for advanced traveller information systems, Transp. A Transp. Sci. 11 (2015) 119–157.
H.B. Celikoglu, Flow-based freeway travel-time estimation: a comparative evaluation within dynamic path loading, IEEE Trans. Intell. Transp. Syst. 14 (2013) 772–781.
L. Li, X. Chen, Z. Li, L. Zhang, Freeway travel-time estimation based on temporal–spatial queueing model, IEEE Trans. Intell. Transp. Syst. 14 (2013) 1536–1541.
F. Soriguera, F. Robuste, Requiem for freeway travel time estimation methods based on blind speed interpolations between point measurements, IEEE Trans. Intell. Transp. Syst. 12 (2010) 291–297.
J.W.C. Van Lint, Reliable travel time prediction for freeways, Netherlands TRAIL Res. School (2004).
J.W.C. Van Lint, C. Van Hinsbergen, Short-term traffic and travel time prediction models, Artif. Intell. Appl. to Crit. Transp. Issues 22 (2012) 22–41.
E.J. Schmitt, H. Jula, On the limitations of linear models in predicting travel times, in: 2007 IEEE Intelligent Transportation Systems Conference, IEEE, 2007, pp. 830–835.
M. Papageorgiou, I. Papamichail, A. Messmer, Y. Wang, Traffic simulation with METANET, in: Fundamentals of Traffic Simulation, Springer, 2010, pp. 399–430. [
P. Edara, R. Rahmani, H. Brown, C. Sun, Traffic Impact Assessment of Moving Work Zone Operations, Smart Work Zone Deployment Initiative (2017).
N.B. Taylor, The CONTRAM dynamic traffic assignment model, Netw. Spat. Econ. 3 (2003) 297–322. [13] L. Du, S. Peeta, Y.H. Kim, An adaptive information fusion model to predict the shortterm link travel time distribution in dynamic traffic networks, Transp. Res. Part B Methodol. 46 (2012) 235–252.
D. Laoide-Kemp, M. O’Mahony, Dealing with latency effects in travel time prediction on motorways, Transp. Eng. (2020) 100009.
E. Castillo, M. Nogal, J.M. Menendez, S. Sanchez-Cambronero, P. Jimenez, Stochastic demand dynamic traffic models using generalized beta-Gaussian Bayesian networks, IEEE Trans. Intell. Transp. Syst. 13 (2011) 565–581.
D. Billings, J.-S. Yang, Application of the ARIMA models to urban roadway travel time prediction-a case study, in: 2006 IEEE International Conference on Systems, Man and Cybernetics, IEEE, 2006, pp. 2529–2534
[26] W. Qiao, A. Haghani, M. Hamedi, A nonparametric model for short-term travel time prediction using bluetooth data, J. Intell. Transp. Syst. 17 (2013) 165–175.
B. Yu, X. Song, F. Guan, Z. Yang, B. Yao, k-Nearest neighbor model for multiple-time-step prediction of short-term traffic condition, J. Transp. Eng. 142 (2016) 4016018.
J. Zhao, Y. Gao, J. Tang, L. Zhu, J. Ma, Highway travel time prediction using sparse tensor completion tactics and-nearest neighbor pattern matching method, J. Adv. Transp. 2018 (2018).
D. Nikovski, N. Nishiuma, Y. Goto, H. Kumazawa, Univariate short-term prediction of road travel times, in: Proceedings. 2005 IEEE Intelligent Transportation Systems, 2005, IEEE, 2005, pp. 1074–1079.
A. Simroth, H. Zahle, Travel time prediction using floating car data applied to logistics planning, IEEE Trans. Intell. Transp. Syst. 12 (2010) 243–253.
C.-H. Wu, J.-M. Ho, D.-T. Lee, Travel-time prediction with support vector regression, IEEE Trans. Intell. Transp. Syst. 5 (2004) 276–281.
G. Leshem, Y. Ritov, Traffic flow prediction using adaboost algorithm with random forests as a weak learner, in: Proceedings of World Academy of Science, Engineering and Technology, Citeseer, 2007, pp. 193–198.
Y. Liu, Y. Wang, X. Yang, L. Zhang, Short-term travel time prediction by deep learning: a comparison of different LSTM-DNN models, in: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), IEEE, 2017, pp. 1–8.
J. Zhao, Y. Gao, Y. Qu, H. Yin, Y. Liu, H. Sun, Travel time prediction: based on gated recurrent unit method and data fusion, IEEE Access 6 (2018) 70463–70472.
X. Zeng, 2011. Dynamically predicting corridor travel time under incident conditions using a neural network approach.
M. Yildirimoglu, N. Geroliminis, Experienced travel time prediction for congested freeways, Transp. Res. Part B Methodol. 53 (2013) 45–63.
H. Chen, H.A. Rakha, Prediction of dynamic freeway travel times based on vehicle trajectory construction, in: 2012 15th International IEEE Conference on Intelligent Transportation Systems, IEEE, 2012, pp. 576–581.
M. Wang, Q. Ma, Dynamic prediction method of route travel time based on interval velocity measurement system, in: Proceedings of 2014 IEEE International Conference on Service Operations and Logistics, and Informatics, IEEE, 2014, pp. 172–176
S.-K.S. Fan, C.-J. Su, H.-T. Nien, P.-F. Tsai, C.-Y. Cheng, Using machine learning and big data approaches to predict travel time based on historical and real-time data from Taiwan electronic toll collection, Soft Comput. 22 (2018) 5707–5718
N.-E. El Faouzi, R. Billot, S. Bouzebda, Motorway travel time prediction based on toll data and weather effect integration, IET Intell. Transp. Syst. 4 (2010) 338–345.
C. Kamga, M.A. Yazıcı, Temporal and weather related variation patterns of urban travel time: considerations and caveats for value of travel time, value of variability, and mode choice studies, Transp. Res. Part C. 45 (2014) 4–16

Great Resignations, employee attrition analysis using Machine Learning Algorithms

gouse — Thu, 20 Oct 2022 06:27:40 +0000

Abstract:

In recent days there were high number of attrition's across all the industries over the globe. In this article we tried analyzing some of the most influenced reasons/factors using the ML algorithms.

Attrition is defined an employee leaves the organization for various reasons. The number of employees that leave an organization versus average number of employees in the organization over a period of time is known as Attrition rate. If Attrition Rate is higher than usual, it become a matter of concern. If attrition rate is high, there will be a huge loss of talent for the company. So, it is always suggested to predict the employee attrition forehand [2,3]. If the company has information on employees those who may leave the organization, company can take some preventive steps to contain the attrition. In this analysis we will explore the important factors/ attributes that are influencing employee attrition. We had also explored how each factor is contributing to the attrition. We had applied machine learning algorithms such as Classification prediction, data pre-processing techniques like Data Extraction ,Feature Engineering and Data sampling. Hence Classification Predictive models are implemented in companies to keep track of attrition possibilities, in turn to avoid or mitigate the employee attrition.

Keywords : Attrition, Classification, Perdition Future Extraction ,Future Engineering

Statement of problem and objective:

Attrition is a big problem in many organizations [4]. In any organization, small attrition rate is common. But, if it is more, then it becomes a matter of concern and the reasons for high attrition rates are to be investigated, so that the company can take required measures to reduce the attrition rate in future [5,6,7]. If more number of employees leave an organization, there will be a huge production loss, economic loss, loss of clients and loss company image. It affects the organization in many ways. Hence, it’s required to investigate the reasons behind high attrition rate and [12–14] it’s being asked to build a model to predict attrition of the employees.

Methodologies for Analysis:

As per the objective of Research Question, we adopted chi2 test statistic and evaluated the predictions of employee attrition. The analysis are carried using different algorithms like Logistic Regression, Linear Discriminate Analysis, K Nearest Neighbors, Classification and Regression Tree, Gaussian Naïve Bayes, Support Vector Machine. We used chi2 test for finding the features that are affecting the Attrition of an employee[15]. At the beginning stage, we applied data validation techniques and encoding techniques to convert Categorical Variables to Numerical Variables. Based on the sample experiment data,

2.1 Data Acquisition : Dataset Description The HRM dataset used in this research work is distributed by IBM Analytics [32]. This dataset contains 35 features relating to 1500 observations and refers to U.S. data. All features are related to the employees’ working life and personal characteristics (see Table 1). Table 1. Dataset features. Age, Monthly income Attrition(predicted), Monthly rate, Business travel, Number of companies worked, Daily rate, Over18, Department, Overtime, Distance from home, Percent salary hike, Education, Performance rating, [33]Education field, Relationship satisfaction[1,7], Employee count, Standard hours, Employee number, Stock option level, Environment satisfaction, Total working years, Gender, Training times last year, Hourly rate, Work-life balance, Job involvement, Years with company, Job level, Years in current role, Job role, Years since last promotion, Job satisfaction, Years with current manager, Marital status[34–38]

Attrition: A high attrition rate triggers high recruitment cost for resourcing new employees. So, it is always helpful for the companies to know the influencing factors of employee attrition. Here, chi2 test statistic is used for finding the strong relation or dependency of attrition variable[22] on input features of the given data.

2.2 Feature Engineering Techniques for Character Data:

When there are more predictors or features, the degree of association between predictor or input feature and the target feature or outcome can be measured with statistics such as Chi2.The features with more chi2 test statistic value can be the best features to be considered for modelling. The p value less than 0.01 are considered to validate the Chi2 score values. These are the nine features that are having high chi2 values and p values less than 0.01. DistanceFromHome’,’JobLevel’,’MaritalStatus’,’OverTime’,’StockOptionLevel’,’TotalWorkingYears’, ‘YearsAtCompany’, ‘YearsInCurrentRole’, ‘YearsWithCurrManager’ are the nine affecting features of attrition.

i. Distance From Home: This is one the input features which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 59.49 which is a huge score that represents much dependency of target variable on this input feature. The barplot in [Fig01] shows the affect of Distance From Home on Attrition. It shows that ,those who are 2 kms away(near to the office) from office are more likely to leave the company. Those who are much far from the company are not willing to leave the company. The pie chart in [Fig02] shows that among all the employees, who would like to leave the company, more people are 2kms away from the office. Data shows that 11.81% of employees who left the company are 2kms away from the office and 10.97% of employees who left the organization are 1km away from their office.

ii. Job Level: Job Level is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 21.74 which is a good score that represents much dependency of target variable on this input feature. There are five job levels in the data. Among all, only few are influencing attrition. The barplot in [Fig03] shows the affect of Job Level on Attrition. It shows that those who are at Job Level 1 are more likely to leave the company. Those who are at Job Level 4 and 5 are not willing to leave the company. The pie chart in [Fig04] shows that among all the employees, who would like to leave the company, more people are at Job Level 1. Data shows, 60.34% of employees who left the company are at Job Level 1, followed by 21.94% of employees who left the company are at Job Level 2.

iii. Marital Status: Marital Status is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 12.93 which is a good score that represents good dependency of target variable on this input feature. The bar plot in [Fig05] shows the affect of Marital Status on Attrition. It shows that those who are Single, are more likely to leave the company. Those who are Divorced are less willing to leave the company. The pie chart in [Fig06] shows that among all the employees, who would like to leave the company, more people are Single. Data shows, 50.63% of employees who left the company are Single, followed by 35.44% of employees who left the company are Married, and remaining 13.92% employees who left the company are Divorced.

iv. Over Time: Over Time is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 56.92 which is a huge score that represents much dependency of target variable on this input feature. The bar plot in [Fig07] shows the affect of Over Time on Attrition. It shows that those who are doing over time, are more likely to leave the company, when compared to those who are not working over time. The pie chart in [Fig08] shows that among all the employees, who would like to leave the company, more people are those who are working over time. Data shows, 53.59% of employees who left the company are doing over time.

v. Stock Option Level: Stock Option Level is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 17.31 which is a good score that represents good dependency of target variable on this input feature. The bar plot in [Fig09] shows the effect of Stock Option Level on Attrition. It shows that those who are having Stock Option Level 0, are more likely to leave the company, when compared to Stock Option Levels 1, 2 and 3.The pie chart in [Fig10] shows that among all the employees, who would like to leave the company, most people those who are at Stock Option Level 0. Data shows, 64.98% of employees who left the company are having Stock Option Level 0, followed by 23.63% of employees having Stock Option Level 1.

vi. Total Working Years: Total Working Years is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 219.33 which is a big score that represents high dependency of target variable on this input feature. The bar plot in [Fig11] shows the effect of Total Working Years on Attrition. It shows that those who are having Total Working Experience of 1 year, are more likely to leave the company, and those who are having Total Working Experience of more than 11 years are less likely to leave the company. The pie chart in [Fig12] shows that among all the employees, who would like to leave the company, more people are those who are having 1 year of Total Working experience. Data shows, 16.88% of employees who left the company are having 1 year of Total Working Experience, followed by 10.55% of employees with Total Working Experience of 10 years.

vii. Years At Company: Years At Company is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 145.78 which is a big score that represents high dependency of target variable on this input feature. The barplot in [Fig13] shows the effect of Years At Company on Attrition. It shows that those who are having 1 Year of experience At Company, are more likely to leave the company, and those who are in the company for more than 10 Years are less likely to leave the company. The pie chart in [Fig14] shows that among all the employees, who would likely to leave the company, more people are those who are having 1 year of Working experience At Company. Data shows, 24.89% of employees who left the company are having 1 year of Working Experience At the Company, followed by 11.39% of employees with 2 years of experience At Company.

viii. Years In Current Role: Years In Current Role is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 103.62 which is a big score that represents high dependency of target variable on this input feature. The barplot in [Fig15] shows the effect of Years In Current Role on Attrition. It shows that those who are having 0 Years of experience or less than 1 year of experience In Current Role, are more likely to leave the company, and those who are having more than 10 Years of experience in current role are less likely to leave the company. The pie chart in [Fig16] shows that among all the employees, who would like to leave the company, more people are those who are having less than 1 year of Working experience in current role. Data shows, 30.80% of employees who left the company are having less than 1 year of Working Experience in current role, followed by 28.69% of employees with 2 years of experience in current role.

ix. Years With Current Manager: Years With Current Manager is another input feature which is influencing the attrition according to chi2 test result. It’s Chi2 test value with Attrition variable is 120.49 which is a big score that represents high dependency of target variable on this input feature. The barplot in [Fig17] shows the effect of Years With Current Manager on Attrition. It shows that those who are having less than 1 Year With Current Manager, are more likely to leave the company, and those who are having more than 10 Years With Current Manager are less likely to leave the company. The pie chart in [Fig18] shows that among all the employees, who would like to leave the company, more people are those who are having less than 1 year with current manager. Data shows, 35.86% of employees who left the company are having less than 1 year of association with the current manager, followed by 21.10% of employees with 2 years with current manager.

2.2 Feature Engineering for Numerical Data :

The available numerical variables for modelling are “Age”, “DailyRate”, “HourlyRate”, “MonthlyIncome”, “MonthlyRate”. Often, it required to check the correlation among all the numerical features that are present in the dataset. If there are any highly correlated numerical features present in the data, it is required to remove redundant features. It’s required because, most of the times, these redundant features reduce the performance of machine learning models. And,the [table01] shows the correlation among all numerical features

i. Age:“Age” is one of the numerical variables that is useful in predicting the Attrition of employee. It has exhibited gaussian distribution with skewness of 0.413 and kurtosis of -0.404, which are valid scores. And, data shows that many employees left the organization at an age of 29 and 31 years.

ii. Daily Rate:“Daily Rate” is another numerical variables that is useful in predicting the Attrition of employee. It has exhibited gaussian distribution with skewness of -0.0035 and kurtosis of -1.203, which are valid scores.

iii. HourlyRate:“Hourly Rate” one more numerical variable that is useful in predicting the Attrition of employee. It has exhibited Gaussian distribution with skewness of -0.0323 and kurtosis of -1.196, which are valid scores. And, data shows that more number of employees with an Hourly Rate of 66 left the organization.

iv. Monthly Income:“Monthly Income” is one of the numerical variables that is useful in predicting the Attrition of employee. It has not exhibited Gaussian distribution and it’s skewness is 1.369, which is not acceptable and kurtosis is 1.005. Hence logarithmic transformation is done on this variable to make it Gaussian distributed. Now, the skewness is 0.286 and kurtosis is -0.697, which are acceptable scores.

v. MonthlyRate:“Monthly Rate” is one of the numerical variables that is useful in predicting the Attrition of employee. It has exhibited Gaussian distribution with skewness of 0.0185 and kurtosis of -1.214, which are valid scores.

2.3 Machine Learning /AI Algorithms Description:

i. Logistic Regression :Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems.

ii. Linear Discriminant Analysis: Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass classification. It too assumes a Gaussian distribution for the numerical input variables.

iii. K Nearest Neighbors: The k-Nearest Neighbors algorithm (or KNN) uses a distance metric like ecludian distance to find the k most nearest instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

iv. Naive Bayes: Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption).

v. CART: Classification and Regression Trees construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function.

vi. SVM: Support Vector Machines (or SVM) seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors

2.5 Evaluating Models:

In order to avoid data leakage, first the whole dataset is divided into training and validating datasets. Pipeline process is used to automate the scaling and evaluation of algorithms. Accuracy is chosen as the evaluation metric. KFold cross validation is used for resampling and evaluation of different algorithms. Min Max Scalar is used for scaling the data and standardizing it. Logistic Regression, Linear Discriminant Analysis, K Nearest Neighbors, Classification and Regression Tree, Gaussian Naïve Bayes, Support Vector Machine algorithms are used. The below scores are mean and standard deviation values of accuracy scores over 10 folds of KFold cross validation.

i. Logistic Regression has given mean accuracy of 0.851 and standard deviation of 0.027 on the training data of this dataset.

ii. Linear Discriminant Analysis has given mean accuracy of 0.845 and standard deviation of 0.024 on the training data of this dataset.

iii. K Nearest Neighbors has given mean accuracy of 0.832 and standard deviation of 0.026 on the training data of this dataset.

iv. Decision Tree Classifier or CART has given mean accuracy of 0.772 and standard deviation of 0.034 on the training data of this dataset.

v. Naive Bayes has given mean accuracy of 0.768 and standard deviation of 0.037 on the training data of this dataset.

vi. Support Vector Machine has given mean accuracy of 0.849 and standard deviation of 0.030 on the training data of this dataset.

2.6 Finalizing Model:

From the above analysis and stats resulting from the Machine Learning Models, the score card says the best model is Logistic Regression which is having the highest accuracy amongst other models. Hence, it is selected as final model and its accuracy is checked on unseen or validating dataset. It has given an accuracy of 0.8639, which is a good score on validation data. The test data of this dataset is used for further future predictions.

3 Findings:

Out of our research on attrition of employees from an organization, significance influencing factors are extracted through future extraction and future engineering techniques such as and it can be concluded that ‘DistanceFromHome’, ‘JobLevel’, ‘MaritalStatus’, ‘OverTime’, ‘StockOptionLevel’,’ ‘TotalWorkingYears’, ‘YearsAtCompany’, ‘YearsInCurrentRole’, ‘YearsWithCurrManager’ , ”Age”, “DailyRate”, “HourlyRate”, “MonthlyIncome”, “MonthlyRate” are the effecting features of attrition. And, LogisticRegression Algorithm is working better on this binary classification prediction problem with an accuracy of about 85%.

4 Conclusion:

High Attrition Rate is a problem that is to be carefully examined and to be investigated to find out the reasons behind it, in order to avoid major losses for the organization. Hence, in our research, we found out the major factors that are acting as driving forces of employee attrition, and accordingly we have developed models to predict the possible employee attrition. This might help organization to take required steps to avoid the losses caused by attrition, or else, companies can imply preventive measures to retain those employees who might leave the organization. Above factors are the most influencing in employee attrition.

Figures referred:

References:

Cockburn, I.; Henderson, R.; Stern, S. The Impact of Artificial Intelligence on Innovation. In The Economics of Artificial Intelligence: An Agenda; University of Chicago Press: Chicago, IL, USA, 2019; pp. 115–146.
Jarrahi, M. Artificial intelligence and the future of work: Human-AI symbiosis in organizational decisionmaking. Bus. Horiz. 2018, 61, 577–586. [CrossRef]
Yanqing, D.; Edwards, J.; Dwivedi, Y. Artificial intelligence for decision making in the era of Big Data. Int. J.Inf. Manag. 2019, 48, 63–71.
Paschek, D.; Luminosu, C.; Dra, A. Automated business process management-in times of digital transformation using machine learning or artificial intelligence. In MATEC Web of Conferences; EDP Sciences:Les Ulis, France, 2017; Volume 121.
Varian, H. Artificial Intelligence, Economics, and Industrial Organization; National Bureau of Economic Research: Cambridge, MA, USA, 2018.
Vardarlier, P.; Zafer, C. Use of Artificial Intelligence as Business Strategy in Recruitment Process and Social Perspective. In Digital Business Strategies in Blockchain Ecosystems; Springer: Berlin/Heidelberg, Germany, 2019; pp. 355–373.
Gupta, P.; Fernandes, S.; Manish, J. Automation in Recruitment: A New Frontier. J. Inf. Technol. Teach. Cases2018, 8, 118–125. [CrossRef]
Geetha, R.; Bhanu Sree Reddy, D. Recruitment through artificial intelligence: A conceptual study. Int. J. Mech.Eng. Technol. 2018, 9, 63–70.
Syam, N.; Sharma, A. Waiting for a sales renaissance in the fourth industrial revolution: Machine learning and artificial intelligence in sales research and practice. Ind. Mark. Manag. 2018, 69, 135–146. [CrossRef]
Mishra, S.; Lama, D.; Pal, Y. Human Resource Predictive Analytics (HRPA) For HR Management in Organizations. Int. J. Sci. Technol. Res. 2016, 5, 33–35.
Jain, N.; Maitri. Big Data and Predictive Analytics: A Facilitator for Talent Management. In Data Science Landscape; Springer: Singapore, 2018; pp. 199–204.
Boushey, H.; Glynn, S.J. There Are Significant Business Costs to Replacing Employees. Cent. Am. Prog.2012, 16, 1–9.
Martin, L. How to retain motivated employees in their jobs? Econ. Ind. Democr. 2018, 34, 25–41. [CrossRef]
involvement management and organizational performance: The mediating roles of job satisfaction and wellbeing. Hum. Relat. 2012, 65, 419–446. [CrossRef]
Zelenski, J.M.; Murphy, S.A.; Jenkins, D.A. The happy-productive worker thesis revisited. J. Happiness Stud.2008, 9, 521–537. [CrossRef]
Clark, A.E. What really matters in a job? Hedonic measurement using quit data. Labour Econ. 2001, 8, 223–242.[CrossRef]
Clark, A.E.; Georgellis, Y.; Sanfey, P. Job satisfaction, wage changes, and quits: vidence from Germany.Res. Labor Econ. 1998, 17, 95–121.Computers 2020, 9, 86 17 of 17
Delfgaauw, J. The effect of job satisfaction on job search: Not just whether, but also where. Labour Econ.2007, 14, 299–317. [CrossRef]
Green, F. Well-being, job satisfaction and labour mobility. Labour Econ. 2010, 17, 897–903. [CrossRef]
Kristensen, N.; Westergaard-Nielsen, N. Job satisfaction and quits — which job characteristics matters most?Dan. Econ. J. 2006, 144, 230–249.
Marchington, M.; Wilkinson, A.; Donnelly, R.; Kynighou, A. Human Resource Management at Work; Kogan PagePublishers: London, UK, 2016.
Van Reenen, J. Human resource management and productivity. In Handbook of Labor Economics; Elsevier:Amsterdam, The Netherlands, 2011.
Deepak, K.D.; Guthrie, J.; Wright, P. Human Resource Management and Labor Productivity: Does IndustryMatter? Acad. Manag. J. 2005, 48, 135–145.
Gordini, N.; Veglio, V. Customers churn prediction and marketing retention strategies. An application ofsupport vector machines based on the AUC parameter-selection technique in B2B e-commerce industry.Ind. Mark. Manag. 2016, 62, 100–107. [CrossRef]
Keramati, A.; Jafari-Marandi, R.; Aliannejadi, M.; Ahmadian, I.; Mozaffari, M.; Abbasi, U. Improved churnprediction in telecommunication industry using data mining techniques. Appl. Soft Comput. 2014, 24, 994–1012.[CrossRef]
Alao, D.; Adeyemo, A. Analyzing employee attrition using decision tree algorithms. Comput. Inf. Syst. Dev.Inf. Allied Res. J. 2013, 4, 17–28.
Nagadevara, V. Early Prediction of Employee Attrition in Software Companies-Application of Data MiningTechniques. Res. Pract. Hum. Resour. Manag. 2008, 16,
Rombaut, E.; Guerry, M.A. Predicting voluntary turnover through Human Resources database analysis.Manag. Res. Rev. 2018, 41, 96–112. [CrossRef]
Usha, P.; Balaji, N. Analysing Employee attrition using machine learning. Karpagam J. Comput. Sci. 2019, 13,277–282.
Ponnuru, S.; Merugumala, G.; Padigala, S.; Vanga, R.; Kantapalli, B. Employee Attrition Prediction usingLogistic Regression. Int. J. Res. Appl. Sci. Eng. Technol. 2020, 8, 2871–2875. [CrossRef]
Microsoft Docs: Team Data Science Process.
IBM HR Analytics Employee.
CrowdFlower. Data Science Report. 2016.
Antecol, H.; Cobb-Clark, D. Racial harassment, job satisfaction, and intentions to remain in the military.J. Popul. Econ. 2009, 22, 713–738. [CrossRef]
Böckerman, P.; Ilmakunnas, P. Job disamenities, job satisfaction, quit intentions, and actual separations:Putting the pieces together. Ind. Relations 2009, 48, 73–96. [CrossRef]
Theodossiou, I.; Zangelidis, A. Should I stay or should I go? The effect of gender, education andunemployment on labour market transitions. Labour Econ. 2009, 16, 566–577. [CrossRef]
Böckerman, P.; Ilmakunnas, P.; Jokisaari, M.; Vuori, J. Who stays unwillingly in a job? A study based on arepresentative random sample of employees. Econ. Ind. Democr. 2013, 34, 25–41. [CrossRef]
Griffeth, R.W.; Hom, P.W.; Gaertner, S. A meta-analysis of antecedents and correlates of employee turnover:Update, moderator tests, and research implications for the next millennium. J. Manag. 2000, 26, 463–488.[CrossRef]

Detect Money Laundering activities using Machine Learning/Artificial Intelligence based on Company profile

gouse — Thu, 20 Oct 2022 06:18:57 +0000

Abstract:

Money Laundering is the biggest crime in Banking and Non-Banking sectors, that affects countries economies. Most of the developing Asian countries are facing this money laundering issue due to lack of technology adoption. There are several ways to do money laundering activities, one of them using transactions. In 21st century, money laundering activities are very difficult to detect using manual measures. In this machine learning and artificial intelligence era we can detect or predict estimates on money laundering fraudulent activities through transactions using the ‘SAS AI’ and ‘Machine learning’. These two elements are redefining anti-money laundering (AML hereafter) [02]. After 2015 many companies are closed due to various reasons initiated by Minister of Commerce Affairs (MCA). [5]The effect of Master data unavailability wouldn’t justify the reason why companies are closed by MCA, but it’s significant that Money laundering would have been a major cause.

The main objective of this research is a ML/AI method of detecting money laundering activities using company profiles. Therefore, through these factors: 1) Company profile; 2) Directorships 3) Frequent change of directors in companies; 4) Same directors having multiple directorship 5) Address change; 6) Non Tax filing; 7) Company status (Active, Amalgamation, Strike Off) 5) Authorised Capital; 6) Paid up capital and others may indicate money laundering activities in the businesses. Through machine learning and artificial intelligence supervised, using algorithms with logistics regression, decision tree, Naive Bayes, KNN(K-Nearest Neighbour) and statistical inferential technique MANOVA(multivariate analysis of variance), we can estimate and detect probable fraudulent Money laundering activity [44].

Keywords: Money laundering, Company profiles, Machine Learning/AI Decision Tree MANOVA

Literature Review:

In our research, we followed secondary data review and literature. Sharman (2012) says that shell companies cannot be found to be real authorized owners acted as a corporate veil to proceeds of crime and fraud. The corporate transparency would help law enforcement agents to catch misusing and mismatching owners. As per Sharman (2012), the information on beneficial owners could be accessed in two ways[31]. First, the corporate registry or Know your customer questionnaire (KYC hereafter) information is required to be collected and hold together with proofs about the identity of beneficial owners. Second way is to regulate the company service providers (CSPs hereafter), who could collect information about the beneficial owners of entities and provide the same to regulators upon request. The CSPs may be individuals, law firms or other firms with the sole purpose of incorporating companies [16].

Technologies and anti-money laundering: To find the right Flavour of machine learning approach for predicting ML/CFT(Combating the Financing of Terrorism)[9] depends on the input available for training the model and research objectives. A supervised machine learning model is presented with sample inputs and their associated outputs. The goal is to devise a general rule that maps those inputs to outputs. For example, what attributes were associated with cases that were turned into[11] findings that are associated with false positives or false negatives? The model learns how to predict better the outcome, when it is applied to new data. Some of the more advanced early adopters of AI are getting those pilot projects over the line, and doing so with great results.

Objective of Research:

The objective of the research articles based on the company profile and company financial activities describe and validate ML models for estimating and detecting money laundering issues as these are very difficult to detect using manual investigation methods. After establishment of many start-up companies in many parts of India, increase the need of automated solutions that may detect money laundering activities. They could adopt ML/AI to control money laundering True Positive and False Positive ratings. Factors such as 1) Company profile; 2) Directorships 3) Frequent change of directors in companies; 4) Same directors having multiple directorship 5) Address change; 6) Non Tax filing; 7) Company status (Active, Amalgamation, Strike Off) 5) Authorized Capital; 6) Paid up capital and other criteria may indicate money laundering activities in the businesses.

In general Anti Money Laundering is hypothetical, in this research we tried proving it by applying the theory on the available data by measuring the features of available sample data, presuming that this can be proven using the “statistical inferential experiment”[51] and using special observations. This is subjected to extended investigation, which has resulted that the research has led to a tenable meaning of full information.

When large data is available, we can apply ML algorithms for predicting and estimating future outcome. When data is not sufficient or only small sample is available, we can use inferential statistics for future predictions. Minimum of 30 samples is enough [23] for inferential statistics of future predictions as per the statistical evidence. In our research we have limited data, so with hypothesis assumption we can predict or interfere the default companies list.

Section A:

H0: Company profile & KYC may lead to Money laundering significantly 0.05

H1: it may not s lead to Money laundering .95

Section B:

H0: Based on company profiles we can estimates Fraudulent Detection of Money laundering 0.05

H1: Based on company profiles we can estimates Fraudulent Detection of Money laundering 0.05

Methodologies & Analysis:

As per the above objectives and hypothesis, we intend to implement Statistical Inferential techniques like MANOVA and Machine Learning Supervised, to estimate fraudulent activity with Logistics regression, Decision Tree, Naive Bayes, KNN and Confusion matrix. Confusion matrix will conclude True positive and False positive rating accuracy of future outcome.
Source of Data: ED(Enforcement Directory) and SEBI(Securities and Exchange Board of India) have posted a list of 331 suspected shell companies list. The Ministry of Corporate Affairs[5] and Government of India have built this list of shell companies declared by SEBI. Many of these companies are already listed on the stock exchanges. We randomly collected companies that are not involved in money laundering activities and shell companies for detecting fraudulent activity declared by Government of India[8].
Tools & Techniques. For implementing ML/AI and MANOVA statistical techniques, we used Python, R rattle, SPSS software packages[37]. MANOVA ANALYSIS finding association correlations of the dependent variables and by the effect of sample sizes associated with those independent variables[34]. MANOVA’s power is lowest when the correlation equals the ratio of the smaller to the larger standardized effect size.[10]. In our case, Dependent Variables on Target, which has 1-Fraud and 0-Good.

Data Analysis:

Data Analysis- MANOVA analysis for conducting money laundering is followed by the below MANOVA Inferential Tables. Overall coefficient of effect with Money laundering R Squared = .666 (Adjusted R Squared = .647). No of Directors, F value 5.124, 0.025 sign, Dir1Peer_company 49.743, 0.001 sign, Age of Company F value 17.192 and 0.0001 sign , Address change F value 54.547 sign 0.000, Outside branches F value 5.933, 0.016 sign. Variables are significant with 0.05 [39]. Remaining variables are not significant. For money laundering detection, this significant variable is more influencing.

H0: Company history profile factors leads to money laundering significantly 0.05,

H1: may not Company history profile factors leads to Money laundering. 95% alternate hypothesis are accepted.

ML/AI: For current generation ML is new computing technology [10]. It was born from the theory that computers can learn with training data to perform specific events and tasks. Computers can learn from historical computations to predict reliable information, but one that has gained fresh momentum. To prove hypothesis that it is possible to estimate money laundering activities using company profile with Machine Learning Algorithms such as Logistic regression, Naive Bayes [45], KNN, SVM and model validation techniques such as Confusion matrix, ROC (receiver operating characteristic curve), AUC precision [41]. True positive and false negative cases models’ are explained below:

H0: Based on company profiles we can estimates Fraudulent Detection of Money laundering 0.05

H1: Based on company profiles we can estimates Fraudulent Detection of Money laundering 0.05

Above summary table shows the ML models with prediction rates. From this table we can conclude the possible ML algorithms that has the best prediction of money laundering companies. Using methodologies like confusion matrix, precision and recall, we can derive the binary prediction for the probabilities.

The below shown are the Individual model results:

Logistic regression: Logistic regression will measure relationship of the classification or grouping target variable and multiple independent exploratory variables by estimating probabilities using a logistic function[23], which is the cumulative distribution function of logistic distribution[2]. After building logistics regression models and as the validation table states below, logistic regression estimates 40 companies as True positive (91%) out of 44 Companies not involved in the Money Laundering. Fraudulent False negative (74%) out of 38 companies, is the recall. Overall validation is performed on 87% of companies which are in total 82 companies. So when we applied logistics regression for this data, money laundering companies could be estimated with 87% accuracy [25].
Naive-Bayes Model: This is one of the most prier and posteriors algorithms for most accurate predictions. Model validation table suggest that out of total 82 companies, it is estimated that 40 companies are True positive (91%) out of 44 companies not involved in Money Laundering, Fraudulent False negative (74%) out of 38 companies is the recall. Overall 83%, validation companies total 82 companies.
KNN model: is used for finding nearest variables to predict the future. This model validation table suggests that out of total 82 companies 41 companies are marked as True positive (93%), while out of 44 Companies not involved in the Money Laundering companies, Fraudulent False negatives are 3. Out of 38 companies, 33 are predicted correctly as Fraudulent True negative (87%) and 5 companies as False Negative and it’s called as recall. The overall model accuracy is 90%, resulting in identifying/validating 82 companies that are possible fraudulent.
Decision Tree CHAID model: it is a classified model followed by rules and goals. This models validation table suggests, that out of total 82 companies, 41 companies are predicted as True positive (93%). resulting in 44 companies not involved in Money Laundering, 3 companies are False negative. Fraudulent True negative (92%) out of 38 companies 35 predicted correctly and 3 companies as False Negative and it’s called as recall. The overall model accuracy is 93%, resulting in identifying/validating 82 companies that are possible fraudulent.
Model comparison: while we compare among the models, KNN (k-nearest neighbours algorithm (k-NN) is the best model to predict companies involved in money laundering activities using company profiles. Other models, mentioned in the research, provide similar estimates.

Research Findings:

I. No of Directors: if any company has changed the directors in short span of time on more occasions, that indicates some unusual activity is happening. If the same director is having ties with other companies then that is also alarming. As per the historic data analysis, it is considered if a company changed it’s directors frequently and the directors have ties with some fraudulent companies then such scenarios have led the company towards the money laundering. As per Indian MCA, .025, with 97.75%. Companies, which have more than one director or a director who has worked in more than 3 companies, are strike off and amalgamated. As per our data 75% of companies which had involved in Money Laundering has more than 3 directors and every director was linked with multiple companies.

II. Dir1Peer_company: significant .000 with 99.99% association with dependent variable this is high associated with our target variable followed by the above “No of Directors” variable — if one company has more than three directors, who are associated with other companies, it leads to strike off or amalgamated . This is one of the major causes for money laundering.

III. Age of Company: is one of the major causes for Money Laundering, old companies are non-fraudulent and new companies are Significant .000 with 99.9% fraudulent as per the above result stating. The age of the company since the establishment, has an impact when detecting money laundering activities and old-age companies are more trusted companies with proper and constant auditing. New companies may be shell companies with no proper auditing, therefore, their transparency cannot be judged.

IV. Address change: frequently changing company registration addresses are more fraudulent with significance of .00001 (99.99%). It means that such companies might have no stability and consistency and this might influence the money laundering. Our research data in the bar chart above shows the same.

V. Outside branches: if any company has more branches outside the country, this can be one of the cause for money laundering with significance of .016, with 95% of confidential interval. It is easier to process bulk money transactions through difference channels when having many branches, because company KYC shows and proves legalities.

VI. Paid-up capital: company Paid-up capital is one of the contributors for deciding money laundering activities with significance of .028. If paid-up capital is less, then the authorized capital, which means company has no improvement activities and no developments, when paid-up capital is more which indicating there is a good grout in companies.

Conclusion. Based on our research, we conclude that company’s KYC information and profile are major contributors for detection of money laundering. The statistical inferential method, MANOVA theories are stating that the associating list of significance variables are identified by MANOVA whereas all other significant variables are analysed in the Research. Finding a section of companies with the following attributes: No of Directors, Dir1Peer_company, Age of Company, Address change, outside branches, Paidup_Capital. Are most influential factors / variables influencing/determining in identifying the potential money laundering companies.

ML/AI algorithms suggested that when there is a need to predict possible money laundering companies using Logistics regression, Decision Tree, KNN, SVM, Naive Bayes, we can estimate those companies in model comparisons. KNN is the best model to estimate with all company KYC variables, which are available in MCA[5]. Machine Learning Algorithms are quite capable in predicting Money Laundering activities.

References:

Vaughan, G. (2018), “Shell companies, the role of company and trust service providers, and alternative banking platforms highlighted in NZ Police money laundering report”.
SAS Corporation Articles
Luna, D. K., Palshikar, G. K., Apte, M. and Bhattacharya, A. (2018), “Finding shell company accounts using anomaly detection”, Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, Goa, India, ACM pp. 167- 174
Lee, A. and Palstra, N. (2018), “The Companies We Keep: What The UK’s Open Data Register Actually Tells Us About Company Ownership”, United Kingdom, Global Witness.
Chen, T. and Guestrin, C. (2016), “Xgboost: a scalable tree boosting system”, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 785–794
DeLong, E.R., DeLong, D.M. and Clarke-Pearson, D.L. (1988), “Comparing the areas under two or more correlated receiver operating
Zauba Corp is India’s leading provider of commercial information and insight on businesses.
Learn more at sas.com/en_us/software/anti-money-laundering.
Grint, R. O’Driscoll, C. and Patton, S. (2017), “New technologies and anti-money laundering compliance report”
Machine Learning: Algorithms, Real-World Applications and Research Directions Iqbal H. Sarker1,2 Received: 27 January 2021 / Accepted: 12 March 2021 / Published online: 22 March 2021 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021
kokrim (2016), Annual Report. The National Authority for Investigation and Prosecution of Economic and Environmental Crime
Lopez-Rojas, E.A. and Axelsson, S. (2012), Money Laundering Detection Using Synthetic Data, the 27th Annual Workshop of the Swedish Artificial Intelligence Society (SAIS), 14–15 May 2012, Link«oping University Electronic Press, Örebro, pp. 33–40
Whitrow, C., Hand, D.J., Juszczak, P., Weston, D. and Adams, N.M. (2009), “Transaction aggregation as a strategy for credit card fraud detection”, Data Mining and Knowledge Discovery, Vol. 18 №1, pp. 30–55.
Regulatory Reform On The Company Service Providers Regime
The Norwegian Money Laundering Act, Chapter 3 (2009), “The Norwegian money laundering act, chapter 3”, In Norwegian
US Congress (1995), “Office of Technology Assessment, Information Tech- neologies for Control of Money
Whitrow, C., Hand, D.J., Juszczak, P., Weston, D. and Adams, N.M. (2009), “Transaction aggregation as a strategy for credit card fraud detection”, Data Mining and Knowledge Discovery, Vol. 18 №1, pp. 30–55.
Singh, K. and Best, P. (2019), “Anti-Money Laundering: Using data visualization to identify suspicious activity”, International Journal of Accounting Information Systems.
Song, X., Hu, Z., Du, J. and Sheng, Z. (2014), “Application of Machine Learning Methods to Risk Assessment of Financial Statement Fraud: Evidence from China”, Journal of Forecasting, Vol. 33 №8, pp. 611–626
Sample size estimation and power analysis for clinical research studies.
UNODC (2004), “United Nations Convention Against Transnational Organized Crime and the Protocols Thereto”, Vienna, United Nations
Vaughan, G. (2018), “Shell companies, the role of company and trust service providers, and alternative banking platforms highlighted in NZ Police money laundering report”.
Walker, J. (1999), “How Big is Global Money Laundering?”, Journal of Money Laundering Control, Vol. 3 №1, pp. 25–37.
Wedge, R., Kanter, J. M., Rubio, S. M., Perez, S. I. and Veeramachaneni, K. (2017), “Solving the” false positives” problem in fraud prediction”, arXiv preprint arXiv:1710.07709.
Zdanowicz, J. S. (2004a), “Detecting money laundering and terrorist financing via data mining”, Communications of the ACM, Vol. 47 №5, pp. 53–55.
Zdanowicz, J. S. (2004b), “U.S. Trade with the World and Al Qaeda Watch List Countries — 2001: An Estimate of Money Moved Out of and Into the U.S. Due to Suspicious Pricing in International Trade”
Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.
Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.
Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.
Liii Pearson K. on lines and planes of closest ft to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.
Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.
Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.
Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.
Wasserstein RL, Lazar NA. The ASA’s statement on P values: Context, process, and purpose. Am Stat. 2016;70:129–33.
Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.
Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7
Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234 140. Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.
Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.
Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13–17 September, pp. 389–400. ACM, New York, USA. 201Rasmussen C. The infnite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.
Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.
Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.
Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.
Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.
Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6
Thornton, Wayne (2000). Applied Research Projects, Texas State University.
Hypothesis testing and earthquake prediction (probability/significance/likelihood/simulation/Poisson) DAVID D. JACKSON Southern California Earthquake Center, University of California, Los Angeles, CA 90095–1567