[HAL] BigDataFr recommends: Data Exploration with SQL using Machine Learning Techniques

data exploration

BigDataFr recommends: Data Exploration with SQL using Machine Learning Techniques

Abstract

[…] Nowadays data scientists have access to gigantic data, many of them being accessible through SQL. Despite the inherent simplicity of SQL, writing relevant and efficient SQL queries is known to be difficult, especially for databases having a large number of attributes or meaningless attribute names. In this paper, we propose a  » rewriting  » technique to help data scientists formulate SQL queries, to rapidly and intuitively explore their big data, while keeping user input at a minimum, with no manual tuple specification or labeling. For a user specified query, we define a negation query, which produces tuples that are not wanted in the initial query’s answer.

Since there is an exponential number of such negation queries, we describe a pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want. The initial query is reformulated using machine learning techniques and a new query, more efficient and diverse, is obtained. We have implemented a prototype and conducted experiments on real-life datasets and synthetic query workloads to assess the scalability and precision of our proposition. A preliminary qualitative experiment conducted with astrophysicists is also described. […]

Read paper
By Julien Cumin1, Jean-Marc Petit2,3, Vasile-Marian Scuturici3,2, Sabina Surdu4
Source: hal-archives-ouvertes.fr

1 Orange Labs [Grenoble]
2 BD – Base de Données
LIRIS – Laboratoire d’InfoRmatique en Image et Systèmes d’information
3 INSA Lyon – Institut National des Sciences Appliquées Lyon
4 Universitatea Babeş-Bolyai [Cluj-Napoca]

Laisser un commentaire