Updated: Jun 27, 2019
Most Machine Knowledge algorithms need data to be into a sole text file in tabular plan, with each row in place of a full illustration of the input dataset and each column one of its structures. For sample, imagine data in normal form divided in a table for users, additional for movies, and additional for ratings. You can get it in machine-learning-ready setup in this way for example joining by userid and movieid and removing ids and names:
DE normalizing (or “standardizing” data for Machine Learning) is a more or less composite task dependent on where the data is stored and where it is gotten from. Every so often the data you own or have entree to is not accessible in a single file—may be scattered across dissimilar sources like several CSV files, worksheets or plain text files, or standardized in database tables. So, you need a tool to gather, traverse, filter, convert when necessary, and in conclusion export to a single flat, text CSV file.
If your data is small and the variations are simple such as adding a resultant field or making a few replacements you can use a worksheet, make the essential changes, and then transfer it to a CSV file. But when the variations are more composite; e.g., joining some sources, clarifying a subset of the data, or handling a large volume of rows, you might need a more controlling tool like an RDBMS. MySQL is a great one—it’s free. If the data size that you are dealing is in the terabytes, then (and only then) you should study Hadoop.
Business reviews in San Francisco
Let’s look at an actual sample. The San Francisco’s Department of Public Health recently available a dataset about restaurants in San Francisco, inspections directed, violations detected, and a score calculated by a health inspector based on the destructions observed.
You can download the data directly from the San Francisco open data website. Recently some statistics using this data were described in this post—they may be problematic to stomach. Imagine, but, that you want to use the data to foresee what desecrations certain kind of restaurants commit—or, if you’re a restaurant owner, to foresee whether you are going to be reviewed. As the data comes “standardized” in four separated files:
businesses.csv: a list of restaurants or businesses in the city.
inspections.csv: reviews in some of previous businesses.
violations.csv: perceived law violations in some of previous reviews.
ScoreLegend.csv: a legend to define score ranges.
You will first need to make it to be used as input to a Machine Learning facility such as BigML.
Analyzing the data
Let’s first have a quick look at the main articles in the data of each file and its relationships. The four files are in CSV format with the subsequent fields:
There are three main things: businesses, inspections and violations. The relationships among entities are: a 0..N relationship amongst businesses and inspections and an 0..N association between inspections and a violations. There’s also a file with a report for each range in the score
To build a machine-learning-ready CSV file covering instances about businesses, their reviews and their respective violations, we’ll follow three basic steps: 1) bring in data into MySQL, 2) converting data using MySQL, and 3) joining and transferring data to a CSV file.
Bring in data into MySQL
First, we’ll need to create a new SQL table with the agreeing fields to import the data for each entity above. In its place of defining a type for each field that we import dates, strings, integers, floats, etc.. we make simpler the process by using varchar fields. In this way, we just need to be troubled with the number of fields and their length for each entity. We also formed a new table to import the received wisdom for each score range.
So now we are prepared to import the raw data into each of the new tables. We use the load data in file command to define the format of the source file, the separator, whether a header is present, and the table in which it will be loaded.
If the dataset is big for example, numerous thousand rows or more, then it’s central to create indexes as follows:
Transforming data using MySQL
Often, raw data needs to be transformed. For sample, numeric codes need to be converted into descriptive labels, dissimilar fields need to be joined, some fields might need dissimilar format. Also, very often missing values, bad formatted data, or incorrectly entered data need to be fixed, and some other times you might need to create and fill new derivative fields.
In our example, we’re going to remove the “[ date destruction corrected: …]” substring from the violation’s explanation field:
We are also going to fix some misplaced data:
Finally, we are going to add a new resulting field “examination” and fill it with Yes/No values:
MySQL has plenty of functions to deal with rows and field transformations. We do love to work at the command line but certainly the versatility of MySQL brings many other advantages.
Joining and Transferring MySQL tables to a CSV file
Once the data has been sanitized and transformed according to our needs, we are ready to generate a CSV file. Before creating it with the deformalized data, we need to make sure that we join the dissimilar tables in the right way. Observing at the restaurant data, we can see that some businesses have reviews and others not. For those with inspections, not all of them have desecrations. Therefore, the query that we should use is a “left join” to gather all businesses, inspections and desecrations. Extra
transformations, like reformat or concat fields, can be done in this step too. We also need to make sure that with export the data with a expressive header. The next query will make the transfer trick:
A file named sf_restaurants.csv will be produced with a row per case in this format:
You can download the raw CSV or clone the dataset in BigML here.
Once the data has been transferred, you might want to move the file from the MySQL default export folder typically in the database folder, replace end-of-line characters (\N) for empty strings, and compress the file if it’s too large.
Finally, if you want to get a first quick analytical model you upload it to BigML with BigMLer as follows:
A use case like this can be yours.