Road Accident Analysis Using Data Mining Abstract— Globalization has affected many countries leading to increasedconsumption of resources including vehicles causing traffic and road accidents.The frequency of the road accidents at different locations have to beidentified in order to reduce the accidents, but due to exponential increase inthe data of road accidents and heterogeneous nature of it, it is difficult toanalyze and identify the locations on the basis of frequency of road accidents. Inthis paper we are using road accident dataset of Mumbai of the year 2016- 2017.We are going to apply clustering techniques to characterize the locations onthe basis of high frequency, low frequency and moderate frequency of road accident.However, to overcome the problem of heterogeneity of the data, data segmentationwill be used widely. Our project proposes a framework which usesk-means clustering as its primary technique for data segmentation.
Furthertrend analysis will be performed on all the clusters and entire data set tofind different trends of road accidents which will help to avoid road accidentsin future. The obtained results are then mapped on our Google maps applicationusing data overlay feature of Google maps to give real tie safety notificationsto users. Keywords—Data Mining, Roadaccident analysis, clustering, association rule mining I. IntroductionNowadays road accidents are a majorissue in Mumbai due to which many people lose their lives. The various reasonsresponsible for road accidents are ignorance of traffic rules, bad roadcondition, and alcohol consumption while/before driving. It is necessary toidentify locations of road accidents on basis of frequency in order to avoidthem. So for that, suitable data mining approaches has to be applied oncollected data.Data mining comprises many techniques,such as pre-processing, clustering, association, prediction, classification,and so on.
To analyse the data we will make a framework. Data pre-processing isa task of data mining. Data pre-processing mainly deals with removing noise,handle missing values, and removing irrelevant attributes in order to make thedata ready for the analysis. In this step, our aim is to pre-process the datain order to make it appropriate for the analysis.
Our project proposes a framework whichuses clustering as its preliminary technique for data segmentation. Theobjective of the clustering algorithm is to divide the data into differentclusters or groups such that the objects within a group are similar to eachother, whereas objects in other clusters are different from each other.Clustering does not depend onpredefined classes. We will use the clustering technique to group the data setinto an individual division. For that, we are using the k-means clusteringtechnique. II. Data MiningData mining, the extraction of hiddenpredictive information from large databases, is a powerful new technology withgreat potential to help companies focus on the most important information intheir data warehouses. Data mining tools predict future trends and behaviours,allowing businesses to make proactive, knowledge-driven decisions.
Theautomated, prospective analyses offered by data mining move beyond the analysesof past events provided by retrospective tools typical of decision supportsystems. Data mining tools can answer business questions that traditionallywere too time consuming to resolve. They scour databases for hidden patterns,finding predictive information that experts may miss because it lies outsidetheir expectations. A. Clustering· Acluster is a subset of objects which are “similar.”· Asubset of objects such that the distance between any two objects in the clusteris · lessthan the distance between any object in the cluster and any object not locatedinside it. · Aconnected region of a multidimensional space containing a relatively highdensity of objects. Clusteringis the grouping of a particular set of objects based on their characteristics,aggregating them according to their similarities.
Regarding to data mining,this methodology partitions the data implementing a specific join algorithm,most suitable for the desired information analysis.Clusteringanalysis allows an object not to be part of a cluster, or strictly belong toit, calling this type of grouping hard partitioning. In the other hand, softpartitioning states that every object belongs to a cluster in a determineddegree. More specific divisions can be possible to create like objectsbelonging to multiple clusters, to force an object to participate in only onecluster or even construct hierarchical trees on group relationships.There are severaldifferent ways to implement this partitioning, based on distinct models.Distinct algorithms are applied to each model, differentiating it’s propertiesand results.
These models are distinguished by their organization and type ofrelationship between them. The most important ones are: · Centralized- each cluster is represented by a single vector mean, and a object value iscompared to these mean values· Distributed– the cluster is built using statistical distributions· Connectivity– he connectivity on these models is based on a distance function betweenelements· Group –algorithms have only group information· Graph –cluster organization and relationship between members is defined by a graphlinked structureDensity– members of the cluster are grouped by regions where observations are denseand similar. III. Background and MotivationAnalysing datasetof road traffic accidents to generate important rules and to obtain hiddeninformation about accident will help in reducing the number of trafficaccidents by undertaking necessary precautions by observing the information generatedfor the dataset. To process such a large datasets which is heterogeneous in nature,clustering is used. Clustering will help generate different clusters of sametype of data. IV.
Review of LiteratureEla Etrunç et al. 1 in their paperhave used ArcMap and ArcGIS for analysis of road accidents; they have doneanalysis on various factors such as type of intersection, month of accident,year of accident, tourist seasons and timings. Through their application theyhave identified that in intersection accidents, intersections with mostaccidents are “four-way” intersections. 60% of total accidents have occurred inat these intersections.
41 hotspots were identified in Antalya province canter.The main advantage of the analysis system built in this paper is that it isvery quick to detect the cause while its major disadvantage is that it mainlyfocuses on road accidents occurring at intersections, due to this other majorcauses of road accidents can be missed. SalvatoreCafiso et al. 2 have usedfuzzy pattern recognition algorithm in their paper. In this paper accidents areclassified according to their actual conditions and rules of the factors suchas vehicle factors, driver factors, road factors and environmental factors. Thepaper helped to identify the factors causing road accidents and effectenvironment have on road accidents.Addi Ait-Mlouk et al.
in their paper3 have discussed various techniques for road accident analysis and haveexamined association rule mining technique which helps in predicting accidentsin advance and allows the drives to avoid the dangers. The integration of the association rulestechnique within multi-criteria decision analysis contributes to a betterunderstanding of the dynamics of road accidents and can provide meaningfulinformation to help decision makers and logistics managers to improveperformance in terms of transport quality and road safety optimization. Themajor advantage of this paper is Mining and visualization of association rulesManagement of the interest level of association rules Reduction of the largenumber of extracted rulesSachinKumar et al 4 in their paper have used k-means clustering algorithm toclassify the locations on the basis of high frequency, low frequency, andmoderate frequency of accidents. They have used gap statistics to find value ofk for k-means algorithm. They then applied association rule mining on theclusters to generate important relationships and patterns between the accidentswhich occur in same clusters.Inpaper by J.M.
Manasa et al. 5 Spatial decision trees are used as a primarymethod to retrieve important information from real world accident data; alsovarious trends of accidents at different locations were identified. Using thisdata they have also identified accident hotspot in order to reduce frequency ofaccident in future.Inpaper by Ayushi Jain et al. 6 they have used clustering (K-means) for makinggroups of similar objects of heterogeneous data and classification (Decisiontree) for predicting causes of accidents. Using cluster analysis they determinethe areas having more average of accidents than other.
Thisstudy by Eyad Abdullah et al.7 presents a very important application tool forusing big data for storing, integrating, and analyzing the traffic accidentsusing Mahout Data Mining as a part of big data ecosystem. Very large and realtraffic data sets from New York’s traffic collisions dataset is used as sourceof data for the developed application. The developed application consists ofseveral functions and web services to analyze and visualize the major trafficaccident information. The developed application stores the massive traffic dataon Hadoop with a parallel computing framework for processing and mining basedon Map-Reduce technique, then uses Web services interface to support developedmining application. Inthis paper by Liling Li et al.
8 the statistics, association rule mining, andthe classification, the environmental factors like roadway surface, weather,and light condition do not strongly affect the fatal rate, while the humanfactors like being drunk or not, and the collision type, have stronger effecton the fatal rate. They have used naive Bayes classificationtechnique and advantage is that it only requires a small number of trainingdata to estimate the parameters necessary for classification. Also they usedk-means clustering which tends to find clusters of comparable spatial extent.Here aprior algorithm is used which uses large item is set property, easilyparallelized, easy to implement.Inthis paper by Suwarna Gothane et al. 9 they evaluated attribute importancebased on information gain attribute evaluator approach to know which factorsare accident oriented and to apply apriori technique with a property allnonempty subsets of frequent item sets must also be frequent. With support andconfidence measure level wise approach they found out best rules to theirfrequent pattern.
In this paper they used information gainattribute evaluator which solves the drawback of information gain. And alsohelps to identify which attribute is most relevant to related database alsothey used weka tool of data mining which have a comprehensive collection ofdata pre-processing and modelling techniques. V Development MethodologyA. PythonData mining, theextraction of hidden predictive information from large databases, is a powerfulnew technology with great potential to help companies focus on the mostimportant information in their data warehouses. Data mining tools predictfuture trends and behaviours, allowing businesses to make proactive,knowledge-driven decisions. The automated, prospective analyses offered by datamining move beyond the analyses of past events provided by retrospective toolstypical of decision support systems. Data mining tools can answer businessquestions that traditionally were too time consuming to resolve.
They scourdatabases for hidden patterns, finding predictive information that experts maymiss because it lies outside their expectations. B. K-means Clustering AlgorithmK-means clustering is a type of unsupervisedlearning, which is used when you have unlabeled data (i.e., data withoutdefined categories or groups). The goal of this algorithm is to find groups inthe data, with the number of groups represented by the variable K. Thealgorithm works iteratively to assign each data point to one of K groupsbased on the features that are provided.
Data points are clustered based onfeature similarity. The results of the K-means clustering algorithmare:· The cancroids of the K clusters,which can be used to label new data · Labels for the training data(each data point is assigned to a single cluster) Rather than defining groups beforelooking at the data, clustering allows you to find and analyze the groups thathave formed organically. The “Choosing K” section below describes howthe number of groups can be determined. Each centroid of a cluster is acollection of feature values which define the resulting groups. Examining thecentroid feature weights can be used to qualitatively interpret what kind ofgroup each cluster represents.
VI. Proposed SystemOne ofthe key objectives in accident data analysis to identify the locations on basisof frequency of road accidents. However, heterogeneous nature of road accidentdata makes the analysis task difficult .
The proposed system we use a frameworkthat is based on the cluster analysis using K means algorithm. Using clusteranalysis as a preliminary task can group the data into different homogeneoussegments. Also trend analysis will be performed to understand the accidenttrends in a particular location at a particular time. The result of theanalysis will help us provide useful information about the accident to theusers and also provide necessary precautions for it. Our findings will then be mapped in ourapplication using data overlay feature of Google maps to give real-timenotification to the user about the safety of the current location.
Advantagesof Proposed system:1. Give real time location update tothe user2. Give real time road safetynotification to the user. Figure 1 System architecture VII. IMPLEMENTATIONClusteringis the grouping of a particular set of objects based on their characteristics,aggregating them according to their similarities.
Regarding to data mining,this methodology partitions the data implementing a specific join algorithm,most suitable for the desired information analysis. A. Data preprocessingDatapreprocessing is one of the important tasks in data mining. Data preprocessingmainly deals with removing noise, handle missing values, removing irrelevantattributes in order to make the data ready for the analysis. In this step, ouraim is to preprocess the accident data in order to make it appropriate for theanalysis. The data which we had obtained from road traffic control head quartersof Mumbai was bilingual which we converted into English for further processing.After translation we will encode the dataset on basis of time and locationwhich will help us in forming clusters of the data sets on basis of time ofaccident and the location of accident. Fig.
7.1. Data PreprocessingFig.7.2. Count3 graphical representation C.
ClusteringAfterthe data is preprocessed, clustering algorithm is applied to the data to formclusters. Clustering is nothing but forming groups of data having similarattributes, it helps in characterizing the data into different groups. In oursystem we are going to use K-means using python clustering algorithm forgenerating clusters of the dataset. Withhelp of clustering we will be able to form clusters of datasets on basis oftime and location of accidents, this will help us to not only find frequency ofroad accident at a particular location but will also help us to find frequencyof road accidents at a particular location at particular time. For examplelocation A is low frequency accident location between 10:00 am and 13:00 pm andis high frequency accident location between 20:00 pm and 22:00 pm.
D. Android Application DevelopmentFordeveloping our application we are using android studio 3.0, Android Studio isthe official integrated development environment for Google’s Android operatingsystem, built on JetBrains’ IntelliJ IDEA software and designed specificallyfor Android development. Theapplication which we are developing is a basic Google map application withlogin and register facility for the user. The results obtained by clustering adtend analysis will e mapped in our application which will help in identifyingdifferent categories of accident prone locations. This application will be able to track theuser movement on the map In real time; also the user will be able to enter histrip information such as starting location of the trip and the end location ofthe trip in order to get notification about accident prone location beforehandwhich they will have to cross during their trip. This will help the rider totake precautions before had to avoid accidents. E.
1. RegisterIn this module user will be able toregister on our website by entering his detail .2.
LoginIn this module user can login to ourwebsite using information he used for registration3. Getreal time location updatesThe user will be able get hiscurrent location displayed o the map4. Getreal time Road safety updatesUser will get real time updatesbased o accident frequency of the location.
Figure 2 displaying map on our application VIII. SUMMARYRoad accident is a major issue which causesthousands of people to lose their lives daily. To derive conclusion or to determinethe reason causing road accident analysis Many authors have implementeddifferent kind of techniques such as fuzzy model, k-means algorithm etc. Here in our project we will be using k-meansalgorithm which will help in clustering road accidents on basis of time andlocations, also we are going to perform trend analysis on the clusters andEntire Data Set in order to find different trends in road accidents basis oftime. The results obtained by clusteringand trend analysis will be mapped In our application to give user real timelocation on safety of location they are about to enter into. References 1 Ela Etrunç, Ömer Mutluoglu, Tayfun Çay, “Intersection road accident analysis using geographical Information Systems: Antalya (Turkey) example,” Baku, Azerbaijan, 2014.
2 S. Cafiso, G. L. Cava and V. Cutello, “A fuzzy model for road accidents analysis,” in IEEE, New York, NY, USA, USA, 1999. 3 A. Ait-Mlouk, F.
Gharnati and T. Agout, “. An improved approach for association rule mining using a multi-criteria decision support system: a case study in road safety.,” European Transport Research Review, 2017. 4 S. Kumar and D.
Toshniwal, “A data mining approach to characterize road accident locations,” Journal of Modern Transportation, vol. 24, no. 1, pp. 62-72, 2016. 5 S. B.
S. K. G.
a. S. M. J.M.Manasa, “Spatial decision tree for accident data analysis,” in IEEE, Gwalior, India, 2014. 6 A. Jain, G.
Ahuja, Anuranjana and Mehrotra’Deepti, “Data mining approach to analyse the road accidents in India,” in IEEE, Noida, India, 2016. 7 E. Abdullah and A. Emam, “Traffic Accidents Analyzer Using Big Data,” in International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, USA, 2016. 8 L. Li, S. Shrestha and G. Hu, “Analysis of road traffic fatal accidents using data mining techniques,” in Software Engineering Research, Management and Applications (SERA),IEEE , London, UK, 2017.
9 S. Gothane and M. Sarode, “Analyzing Factors, Construction of Dataset, Estimating Importance of Factor, and Generation of Association Rules for Indian Road Accident,” in Advanced Computing (IACC), 2016 IEEE 6th International Conference , Bhimavaram, India, 2016.