Statistical Learning for Driving Behaviour Profiling

By Roberta Siciliano and Giulia Vannucci

The vision of the i4Driving Project is to lay the foundation for a new industry-standard methodology to establish a credible and realistic human road safety baseline for virtual assessment of Connected, Cooperative and Automated Mobility (CCAM) systems. How? By delivering a new library of credible models of heterogeneous human driver behaviours. A new library doesn’t mean new models, but rather a combination of models that are suitable and valid for both scenario-based and traffic-based safety assessments, which bring the complexity of the road traffic system and the heterogeneity of human driving behaviours into simulation.

Sufficient system complexity is needed to conduct robust and meaningful analyses of road safety. Sufficient heterogeneity is necessary to mimic the diversity of human driving behaviours and effectively drive the occurrence of both ‘uncritical’ and safety-critical situations (e.g. accidents) in daily traffic. In this respect, the scientific challenge in i4Driving is to capture the relevant behavioural mechanisms for safety assessments.

i4Driving’s first innovation, and the focus of this blog, is to develop state-of-the-art data mining techniques to unveil patterns and formulate plausible hypotheses from Naturalistic Driving Studies (NDS) and Driving Simulation Experiments (DSE) datarelated to human (and external) factors and driving behaviours, which will be used identify model requirements.

What is Driving Behaviour Statistical Analysis?

Driving Behaviour Statistical Analysis provides a means to gather evidence on external and human factors influencing driving behaviour from data. The specific goal is to identify causal relationships among external and human factors and safety-critical driver behaviour at the level of specific driving situations. This involves, for example, analysing whether factors such as gender, cultural and ethnic background, ageing, impairments, driving experience and route familiarity, mental workload or fatigue, weather and lighting conditions are statistically correlated with safety-critical driving behaviours.

Firstly, it is important to note that there are two types of human factors:

  • Global Human Factors – concerning the driver characteristics, such as demographics, visual, perceptual, and cognitive capabilities, physical/health and psychological conditions; and
  • Local Human Factors – concerning driver operating conditions, the eventual secondary task, the manoeuvre judgement, the driving behaviour such as turn left or right, signal violation, drowsy, exceeded speed etc.

Let us take as an example ‘event severity’ as target variable with two response classes, namely “Crash” and “No-Crash”. When considering which contextual and human factors explain the driving behaviour profiles which result in a “crash” rather than a “no-crash”, questions to ask include:

  • What is the impact of a safety-critical driving behaviour (defined as “a type of action or driving manoeuvre that the subject vehicle driver was engaged in prior to, or at the time of (2 to 6 seconds), the Precipitating Event”, i.e. “crash” or “near crash”)?
  • What is the role played by external (contextual) factors, for example the weather conditions, traffic density, type of road (intersection, curved, etc.), the traffic flow, road obstructions, etc.?
  • What are the driving behaviour profiles (human factors) leading to the highest probability of a “crash” rather than “near crash” event?

Statistical Learning and Data Mining offers a vast set of methods to understand the data. Here, a distinction must be made between unsupervised and supervised learning models:  

  • Unsupervised learning (discovery and exploration) techniques help to understand relationships, patterns, associations and correlation in the data where all factors of interest play the same role in the analysis.
  • Supervised learning uses an inductive approach, with one variable serving as the target to be explained by the other variables, referred to as predictors or explanatory variables. Two goals can be achieved: one is modelling to understand the causal relationship or dependence of the target on the set of predictors; the other one is prediction to assign a response class or value for a “new” data based on the predictors’ measurement.
Unsupervised learning techniques

Unsupervised learning for exploratory data analysis helps us explore data patterns. It can be performed by factorial methods (also known as dimensionality data reduction methods, a machine learning (ML) or statistical technique of reducing the amount of random variables in a problem by obtaining a set of principal variables), such as Multiple Correspondence analysis,which is ideal for categorical data. It decomposes the overall association known as Chi-Square inertia in factorial axes for visualising the association among all categories of different variables and the patterns of categories of the same variable. The target variable is used as a supplementary variable to be ex-post associated to the typologies of driving behaviours in the factorial representation. The outcome is to detect different patterns of accidents, i.e. those related to the parking driving manoeuvres, those related to the traffic flow with few lines and interstate roads, those related to the intersection and pre-incident manoeuvres, those related to driver behaviour with some violations or erroneous manoeuvre judgment.

Supervised learning techniques

Supervised learning approaches, on the other hand, can be very useful to build up a scenario analysis. Tree-based methods (models which use a decision tree to represent how different input variables can be used to predict a target value) are among the most popular statistical (machine) learning models, specifically Classification Trees for modelling and Random Forests for prediction. The response variable could be the “Event Severity”, with the two target (label) classes “Crash” and “No-Crash”, and the contextual and local human factors could be used as predictors.

Figure 1 Example of classification tree structure.

Classification Trees are interpretable (white box) machine learning models where the output is the tree structure, which is easy to interpret. This method follows a distribution-free approach with no probability assumption, using a non-parametric method. Indeed, a classification tree is built up by a recursive binary segmentation or partitioning of the units (the driving events) into two child subgroups based on the best split of the predictors’ modalities (known as intelligent questions) to maximise the reduction in the impurity or heterogeneity of the target. Children are always better than their parent! Impurity can be measured with the Gini heterogeneity index, or the entropy measure, or the error rate. The set of terminal nodes of the tree is a partition of the starting set of driving events into internally homogeneous groups, each with a target label class. The node is the most homogeneous if all units belong to one target class. The node is the most heterogeneous if the units are divided 50% and 50% into the two respective target classes. The units of a terminal (leaf) node are labelled with the majority target class with the error rate equal to the percentage of misclassified units. An example of a classification tree structure is depicted in Figure 1. Say, if the leaf includes 80% of units of class “Crash” and 20% of class “No-Crash” then the terminal node is labelled “Crash” with the error rate equal to 20%. The overall error rate averages the leaves’ error rates, each weighted by the proportion of driving events falling into the leaf. The classification tree allows to identify a set of if-then production rules or paths yielding to the response class “Crash” and a set yielding to the response class “No-Crash”. There are paths of the tree structure yielding to “Crash” and others to “No-Crash”; applying the logical operators, each path is a logical intersection of intelligent questions with their answers leading to one response class, whereas a response class may derive from the logical union of all paths yielding to the same label class. Therefore, the final results of the classification tree will be the most important predictors in terms of discrimination ability to distinguish a “Crash” or a “No-Crash” driving event.

Further analysis can be carried out with the ensemble method called Random Forests. This method consists of a combination of classification trees (called weak learners) built up by resampled versions of data using bootstrapping. This process involves drawing random samples from the original dataset. Through bootstrapping you are simply taking samples over and over again from the same group of data (your sample data) to estimate how accurate your estimates about the entire population (what really is out there in the real world) is. At any node of any tree, candidate splitting variables are generated by a subset of predictors which is randomly selected. Unlike classification trees, random forests are accurate black-box machine learning models: there is no tree structure to be interpreted but the overall strategy improves the accuracy to predict the response class of a “new driving event data” where only the predictors’ measurements are known. The unbiased predictor importance estimates are computed through permutations of out-of-bag predictor measurements for random forests of weak learners. The importance of variables returned by the random forests could let stronger the analysis made with the classification tree if the same rank of importance of the explanatory variables between the two methods is obtained.

It is important not to forget about Global Human Factors. These variables do not enter in the tree growing but they can be descriptive of the leaves of the tree, thus stratifying the different scenarios. They thus add ‘flavour’ to our analysis, giving context to different scenarios. An in-depth analysis by Mosaic Plots with statistical testing allows one to visualise the most significant association between the response variable and the Global Human Factor within each leaf of the tree. As an example, considering the factor “Age”, those leaves labelled “Crash” might include a proportion of young drivers higher than that of the overall driving events in the root node. A leaf of a tree tells all about a segment of the data!

Therefore, the strength of the data mining and ML techniques above presented for the i4Driving Project lies in the use and development of these techniques to help to detect driving patterns behaviour, explain the cognitive/perceptive process of human driving, predict the causes that lead to safety-critical driving behaviours, discover the causes exogenous to human behaviour that are related to road accidents and the same driving behaviours that are dangerous for safety. In a nutshell, we’re turning data into insights – predicting what influences driving behaviour and whether a crash is likely to occur – thus decoding the secrets of the road for safer journeys.


Author information:

Roberta Siciliano, Full Professor of Statistics, Department of Electric Engineering and Technologies of Information, University of Naples Federico II (DIETI-UNINA)

Giulia Vannucci, Researcher of Statistics, Institute for Cognitive Sciences and Technologies (CNR-ISTC)

Leave a Comment

Your email address will not be published. Required fields are marked *