Hi Guys!
As a data scientist, you have to turn raw data into meaningful insights. Most of the world's raw data exist as organized tables known as relational databases. A relational database is a type of database that stores and provides access to data points that are related to one another. To be an efficient data scientist, you need to know how to wrangle and extract data from these databases using a language known as Structured Query Language(SQL).
To be able to extract data from a database, you need to know how to query the database. I don't mean the queries bosses give their employees, though it may be somehow similar. You can ask for specific data from the database but you are actually requesting for it, not commanding it like a boss even if you are the boss. So a query can be simply defined as a request for data from a database. The SQL syntax is shared by many databases such as Oracle, PostgreSQL, MySQL etc.
This article is the beginning of a series for Introduction to SQL. In this article, we will see the basic way of writing queries mainly how to select columns and filter rows from tables based on criteria of interest.
PRE-REQUISTES: There are little to no prerequisites though basic knowledge of how to navigate in the application will be helpful. The application used in this article is DBEAVER, you can use it to explore different databases and I made use of MySQL here. The syntax used in MySQL is similar to other databases with just a little difference in some. You can download Dbeaver from here : dbeaver.io/download and you can get the documentation so you can know how to use it from here : dbeaver.com/docs/wiki. The dataset used in this article is the IMDB movies dataset and I made use of only the title, language, genre and year columns. The dataset can be gotten from here : kaggle.com/stefanoleone992/imdb-extensive-d..
Let us now dive into the business of the day.
SELECTING COLUMNS: To select columns from tables, the two important keywords are the SELECT and FROM keywords. Example:
SELECT title FROM IMDb_movies_csv;
This query will produce;
You select column names from a table. You have to use the specific column name according to how it is written exactly and minding both the capital and small letters. You also have to mind that when writing the table name. If you don't write it accordingly, you will get a syntax error. The semi-colon is used to mark the ending of your query.
You can also select multiple columns at once but you have to write it in the order you want it to be selected, that is according to the column you want to be selected first and so forth.
SELECT title, language
FROM IMDb_movies_csv;
This will give the following result;
Often times, there may be duplicates in your results. You can use the DISTINCT keyword to select unique values.
SELECT DISTINCT(language)
FROM IMDb_movies_csv;
Notice the difference between the former result and this one;
You will notice that in the table above, the languages are only shown once but in the former one where we selected the title and language column, some languages were repeated twice.
FILTERING ROWS: You can filter tables for rows satisfying criteria of interest. The basic keywords used for filtering are WHERE, WHERE AND, WHERE AND OR, BETWEEN, WHERE IN. Let's assume we want to select only English movies, we can do this.
SELECT title, language
FROM IMDb_movies_csv
WHERE language = 'English';
The above query will produce the following result; Let us assume you want to select English crime movies. By the way, crime movies rock. This can be done using the following query;
SELECT title, language, genre
FROM IMDb_movies_csv
WHERE language = 'English'
AND genre = 'Crime';
See what this gives us; Let us write a query to select crime movies that are either English or French.
SELECT title, language, genre
FROM IMDb_movies_csv
WHERE (language = 'English' OR language = 'French')
AND genre = 'Crime';
The result of the query is shown below;
Let us write a query that will select english crime movies between the year 1900 and 2000. You could do this using two different formats.
SELECT title, genre, year
FROM IMDb_movies_csv
WHERE year >= '1900' AND year <= '2000'
AND language = 'English';
Second format;
SELECT title, genre, year
FROM IMDb_movies_csv
WHERE year BETWEEN '1900' AND '2000'
AND language = 'English';
The second format looks more understandable at first sight when compared to the first one and yet they perform the same function. The BETWEEN keyword is case inclusive, that is, it includes the beginning and end values in the results.
Last but not the least, let us see how we can use the WHERE IN keyword. Let us select English movies within the year 1900 or 2000.
SELECT title, genre, year
FROM IMDb_movies_csv
WHERE year in (1900,2000)
AND language = 'English';
This query will display this;
You can also use the normal OR keyword to achieve the result above too.
In Summary, We have discussed how to select columns, filter rows using different keywords and this is the main basis of extracting data from a database. SQL can do much more than this, you can perform aggregate functions, count, sum, do so many things with SQL. You can also select your values such that null values will not be included. You can create new tables, drop tables and so on. We are going to discuss how to do most of these things in the remaining parts of this series. Watch out for them🤗. This is a great way of relating and understanding data in tables and relational databases.
REFERENCE: learn.datacamp.com/courses/introduction-to-..
If you want to lean and practice more on SQL, you can take the SQL fundamentals skill track on Datacamp and also check out the Introduction to Sql course on Udacity.
If you have any questions or comments, you can write it in the comment section. I hope you enjoy the read.