Road affected many countries leading to increased consumption of

Road Accident Analysis Using Data Mining

 

 

 

Abstract— Globalization has affected many countries leading to increased
consumption of resources including vehicles causing traffic and road accidents.
The frequency of the road accidents at different locations have to be
identified in order to reduce the accidents, but due to exponential increase in
the data of road accidents and heterogeneous nature of it, it is difficult to
analyze and identify the locations on the basis of frequency of road accidents.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 In
this paper we are using road accident dataset of Mumbai of the year 2016- 2017.
We are going to apply clustering techniques to characterize the locations on
the basis of high frequency, low frequency and moderate frequency of road accident.
However, to overcome the problem of heterogeneity of the data, data segmentation
will be used widely.

 Our project proposes a framework which uses
k-means clustering as its primary technique for data segmentation. Further
trend analysis will be performed on all the clusters and entire data set to
find different trends of road accidents which will help to avoid road accidents
in future. The obtained results are then mapped on our Google maps application
using data overlay feature of Google maps to give real tie safety notifications
to users.

 

Keywords—Data Mining, Road
accident analysis, clustering, association rule mining

 

                                                                                                                                                              
I.              
Introduction

Nowadays road accidents are a major
issue in Mumbai due to which many people lose their lives. The various reasons
responsible for road accidents are ignorance of traffic rules, bad road
condition, and alcohol consumption while/before driving. It is necessary to
identify locations of road accidents on basis of frequency in order to avoid
them. So for that, suitable data mining approaches has to be applied on
collected data.

Data mining comprises many techniques,
such as pre-processing, clustering, association, prediction, classification,
and so on. To analyse the data we will make a framework. Data pre-processing is
a task of data mining. Data pre-processing mainly deals with removing noise,
handle missing values, and removing irrelevant attributes in order to make the
data ready for the analysis. In this step, our aim is to pre-process the data
in order to make it appropriate for the analysis.

Our project proposes a framework which
uses clustering as its preliminary technique for data segmentation. The
objective of the clustering algorithm is to divide the data into different
clusters or groups such that the objects within a group are similar to each
other, whereas objects in other clusters are different from each other.
Clustering does not depend on
predefined classes. We will use the clustering technique to group the data set
into an individual division. For that, we are using the k-means clustering
technique.

 

                                                                                                                                                              
II.             
Data Mining

Data mining, the extraction of hidden
predictive information from large databases, is a powerful new technology with
great potential to help companies focus on the most important information in
their data warehouses. Data mining tools predict future trends and behaviours,
allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the analyses
of past events provided by retrospective tools typical of decision support
systems. Data mining tools can answer business questions that traditionally
were too time consuming to resolve. They scour databases for hidden patterns,
finding predictive information that experts may miss because it lies outside
their expectations.

 

A.    Clustering

·        
A
cluster is a subset of objects which are “similar.”

·        
A
subset of objects such that the distance between any two objects in the cluster
is

·        
less
than the distance between any object in the cluster and any object not located
inside it.

·        
A
connected region of a multidimensional space containing a relatively high
density of objects.

 

Clustering
is the grouping of a particular set of objects based on their characteristics,
aggregating them according to their similarities. Regarding to data mining,
this methodology partitions the data implementing a specific join algorithm,
most suitable for the desired information analysis.

Clustering
analysis allows an object not to be part of a cluster, or strictly belong to
it, calling this type of grouping hard partitioning. In the other hand, soft
partitioning states that every object belongs to a cluster in a determined
degree. More specific divisions can be possible to create like objects
belonging to multiple clusters, to force an object to participate in only one
cluster or even construct hierarchical trees on group relationships.

There are several
different ways to implement this partitioning, based on distinct models.
Distinct algorithms are applied to each model, differentiating it’s properties
and results. These models are distinguished by their organization and type of
relationship between them. The most important ones are:

 

·        
Centralized
– each cluster is represented by a single vector mean, and a object value is
compared to these mean values

·        
Distributed
– the cluster is built using statistical distributions

·        
Connectivity
– he connectivity on these models is based on a distance function between
elements

·        
Group –
algorithms have only group information

·        
Graph –
cluster organization and relationship between members is defined by a graph
linked structure

Density
– members of the cluster are grouped by regions where observations are dense
and similar.

 

                                                                                                                                   
III.            
Background and Motivation

Analysing dataset
of road traffic accidents to generate important rules and to obtain hidden
information about accident will help in reducing the number of traffic
accidents by undertaking necessary precautions by observing the information generated
for the dataset. To process such a large datasets which is heterogeneous in nature,
clustering is used. Clustering will help generate different clusters of same
type of data.

 

                                                                                                                                               
IV.            
Review of Literature

Ela Etrunç et al. 1 in their paper
have used ArcMap and ArcGIS for analysis of road accidents; they have done
analysis on various factors such as type of intersection, month of accident,
year of accident, tourist seasons and timings. Through their application they
have identified that in intersection accidents, intersections with most
accidents are “four-way” intersections. 60% of total accidents have occurred in
at these intersections. 41 hotspots were identified in Antalya province canter.
The main advantage of the analysis system built in this paper is that it is
very quick to detect the cause while its major disadvantage is that it mainly
focuses on road accidents occurring at intersections, due to this other major
causes of road accidents can be missed.

SalvatoreCafiso et al. 2 have used
fuzzy pattern recognition algorithm in their paper. In this paper accidents are
classified according to their actual conditions and rules of the factors such
as vehicle factors, driver factors, road factors and environmental factors. The
paper helped to identify the factors causing road accidents and effect
environment have on road accidents.

Addi Ait-Mlouk et al. in their paper
3 have discussed various techniques for road accident analysis and have
examined association rule mining technique which helps in predicting accidents
in advance and allows the drives to avoid the dangers. The integration of the association rules
technique within multi-criteria decision analysis contributes to a better
understanding of the dynamics of road accidents and can provide meaningful
information to help decision makers and logistics managers to improve
performance in terms of transport quality and road safety optimization. The
major advantage of this paper is Mining and visualization of association rules
Management of the interest level of association rules Reduction of the large
number of extracted rules

Sachin
Kumar et al 4 in their paper have used k-means clustering algorithm to
classify the locations on the basis of high frequency, low frequency, and
moderate frequency of accidents. They have used gap statistics to find value of
k for k-means algorithm. They then applied association rule mining on the
clusters to generate important relationships and patterns between the accidents
which occur in same clusters.

In
paper by J.M. Manasa et al. 5 Spatial decision trees are used as a primary
method to retrieve important information from real world accident data; also
various trends of accidents at different locations were identified. Using this
data they have also identified accident hotspot in order to reduce frequency of
accident in future.

In
paper by Ayushi Jain et al. 6 they have used clustering (K-means) for making
groups of similar objects of heterogeneous data and classification (Decision
tree) for predicting causes of accidents. Using cluster analysis they determine
the areas having more average of accidents than other.

This
study by Eyad Abdullah et al.7 presents a very important application tool for
using big data for storing, integrating, and analyzing the traffic accidents
using Mahout Data Mining as a part of big data ecosystem. Very large and real
traffic data sets from New York’s traffic collisions dataset is used as source
of data for the developed application. The developed application consists of
several functions and web services to analyze and visualize the major traffic
accident information. The developed application stores the massive traffic data
on Hadoop with a parallel computing framework for processing and mining based
on Map-Reduce technique, then uses Web services interface to support developed
mining application.

In
this paper by Liling Li et al.8 the statistics, association rule mining, and
the classification, the environmental factors like roadway surface, weather,
and light condition do not strongly affect the fatal rate, while the human
factors like being drunk or not, and the collision type, have stronger effect
on the fatal rate. They have used naive Bayes classification
technique and advantage is that it only requires a small number of training
data to estimate the parameters necessary for classification. Also they used
k-means clustering which tends to find clusters of comparable spatial extent.
Here aprior algorithm is used which uses large item is set property, easily
parallelized, easy to implement.

In
this paper by Suwarna Gothane et al. 9 they evaluated attribute importance
based on information gain attribute evaluator approach to know which factors
are accident oriented and to apply apriori technique with a property all
nonempty subsets of frequent item sets must also be frequent. With support and
confidence measure level wise approach they found out best rules to their
frequent pattern. In this paper they used information gain
attribute evaluator which solves the drawback of information gain. And also
helps to identify which attribute is most relevant to related database also
they used weka tool of data mining which have a comprehensive collection of
data pre-processing and modelling techniques.

 

V        Development Methodology

A.   
Python

Data mining, the
extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most
important information in their data warehouses. Data mining tools predict
future trends and behaviours, allowing businesses to make proactive,
knowledge-driven decisions. The automated, prospective analyses offered by data
mining move beyond the analyses of past events provided by retrospective tools
typical of decision support systems. Data mining tools can answer business
questions that traditionally were too time consuming to resolve. They scour
databases for hidden patterns, finding predictive information that experts may
miss because it lies outside their expectations.

 

 

 

B.   
K-means Clustering Algorithm

K-means clustering is a type of unsupervised
learning, which is used when you have unlabeled data (i.e., data without
defined categories or groups). The goal of this algorithm is to find groups in
the data, with the number of groups represented by the variable K. The
algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on
feature similarity. The results of the K-means clustering algorithm
are:

·        
The cancroids of the K clusters,
which can be used to label new data

·        
Labels for the training data
(each data point is assigned to a single cluster)

Rather than defining groups before
looking at the data, clustering allows you to find and analyze the groups that
have formed organically. The “Choosing K” section below describes how
the number of groups can be determined.  

Each centroid of a cluster is a
collection of feature values which define the resulting groups. Examining the
centroid feature weights can be used to qualitatively interpret what kind of
group each cluster represents.

 

VI.      
Proposed System

One of
the key objectives in accident data analysis to identify the locations on basis
of frequency of road accidents. However, heterogeneous nature of road accident
data makes the analysis task difficult .The proposed system we use a framework
that is based on the cluster analysis using K means algorithm. Using cluster
analysis as a preliminary task can group the data into different homogeneous
segments. Also trend analysis will be performed to understand the accident
trends in a particular location at a particular time. The result of the
analysis will help us provide useful information about the accident to the
users and also provide necessary precautions for it. Our findings will then be mapped in our
application using data overlay feature of Google maps to give real-time
notification to the user about the safety of the current location.

 

Advantages
of Proposed system:

1.       Give real time location update to
the user

2.       Give real time road safety
notification to the user

.

 

Figure 1 System architecture

 

VII. IMPLEMENTATION

Clustering
is the grouping of a particular set of objects based on their characteristics,
aggregating them according to their similarities. Regarding to data mining,
this methodology partitions the data implementing a specific join algorithm,
most suitable for the desired information analysis.

 

A.      
Data preprocessing

Data
preprocessing is one of the important tasks in data mining. Data preprocessing
mainly deals with removing noise, handle missing values, removing irrelevant
attributes in order to make the data ready for the analysis. In this step, our
aim is to preprocess the accident data in order to make it appropriate for the
analysis. The data which we had obtained from road traffic control head quarters
of Mumbai was bilingual which we converted into English for further processing.
After translation we will encode the dataset on basis of time and location
which will help us in forming clusters of the data sets on basis of time of
accident and the location of accident.

 

Fig.  7.1. Data Preprocessing

Fig.7.2. Count3 graphical representation

 

 

C.   
Clustering

After
the data is preprocessed, clustering algorithm is applied to the data to form
clusters. Clustering is nothing but forming groups of data having similar
attributes, it helps in characterizing the data into different groups. In our
system we are going to use K-means using python clustering algorithm for
generating clusters of the dataset.

With
help of clustering we will be able to form clusters of datasets on basis of
time and location of accidents, this will help us to not only find frequency of
road accident at a particular location but will also help us to find frequency
of road accidents at a particular location at particular time. For example
location A is low frequency accident location between 10:00 am and 13:00 pm and
is high frequency accident location between 20:00 pm and 22:00 pm.

 

D.   
Android Application Development

For
developing our application we are using android studio 3.0, Android Studio is
the official integrated development environment for Google’s Android operating
system, built on JetBrains’ IntelliJ IDEA software and designed specifically
for Android development.

The
application which we are developing is a basic Google map application with
login and register facility for the user. The results obtained by clustering ad
tend analysis will e mapped in our application which will help in identifying
different categories of accident prone locations.

 This application will be able to track the
user movement on the map In real time; also the user will be able to enter his
trip information such as starting location of the trip and the end location of
the trip in order to get notification about accident prone location beforehand
which they will have to cross during their trip. This will help the rider to
take precautions before had to avoid accidents.

 

E.   
Trend Analysis

In trend
analysis For every cluster and EDS, we performed a trend analysis on monthly
road accident counts for each cluster

 

F.   
Application Modules

We will
build a website where user can register, login, search ad filter locations,
view road accident analysis and view precautions. Our website will be built
using HTML, CSS , material design for 
bootstrap, JavaScript and php.

1.      
Register

In this module user will be able to
register on our website by entering his detail .

2.      
Login

In this module user can login to our
website using information he used for registration

3.      
Get
real time location updates

The user will be able get his
current location displayed o the map

4.      
Get
real time Road safety updates

User will get real time updates
based o accident frequency of the location.

              Figure 2 displaying map on our application

 

 

 

 

VIII.   
SUMMARY

Road accident is a major issue which causes
thousands of people to lose their lives daily. To derive conclusion or to determine
the reason causing road accident analysis Many authors have implemented
different kind of techniques such as fuzzy model, k-means algorithm etc.  Here in our project we will be using k-means
algorithm which will help in clustering road accidents on basis of time and
locations, also we are going to perform trend analysis on the clusters and
Entire Data Set in order to find different trends in road accidents basis of
time. The results obtained  by clustering
and trend analysis will be mapped In our application to give user real time
location on safety of location they are about to enter into.

 

References

 

1

Ela Etrunç, Ömer Mutluoglu, Tayfun
Çay, “Intersection road accident analysis using geographical Information
Systems: Antalya (Turkey) example,” Baku, Azerbaijan, 2014.

 

 

2

S. Cafiso, G. L. Cava and V.
Cutello, “A fuzzy model for road accidents analysis,” in IEEE,
New York, NY, USA, USA, 1999.

3

A. Ait-Mlouk, F. Gharnati and T.
Agout, “. An improved approach for association rule mining using a
multi-criteria decision support system: a case study in road safety.,” European
Transport Research Review, 2017.

 

 

4

S. Kumar and D. Toshniwal, “A
data mining approach to characterize road accident locations,” Journal
of Modern Transportation, vol. 24, no. 1, pp. 62-72, 2016.

5

S. B. S. K. G. a. S. M. J.M.Manasa,
“Spatial decision tree for accident data analysis,” in IEEE,
Gwalior, India, 2014.

6

A. Jain, G. Ahuja, Anuranjana and
Mehrotra’Deepti, “Data mining approach to analyse the road accidents in
India,” in IEEE, Noida, India, 2016.

7

E. Abdullah and A. Emam,
“Traffic Accidents Analyzer Using Big Data,” in International
Conference on Computational Science and Computational Intelligence, Las
Vegas, NV, USA, 2016.

8

L. Li, S. Shrestha and G. Hu,
“Analysis of road traffic fatal accidents using data mining
techniques,” in Software Engineering Research, Management and
Applications (SERA),IEEE , London, UK, 2017.

9

S. Gothane and M. Sarode, “Analyzing
Factors, Construction of Dataset, Estimating Importance of Factor, and
Generation of Association Rules for Indian Road Accident,” in Advanced
Computing (IACC), 2016 IEEE 6th International Conference , Bhimavaram,
India, 2016.