SQL for Data Science – Part 2: Advanced Queries & Real-World Analysis

In Part 1, we explored the foundational SQL command types—DDL, DML, DQL, DCL, and TCL. Now, in Part 2, it’s time to go beyond the basics and dive into how SQL is actually used in real-world data science workflows.

This part focuses on writing smarter queries, combining data from multiple sources, and extracting meaningful insights from raw datasets.

1. Advanced SELECT Queries

The SELECT statement is much more powerful than just retrieving columns. With additional clauses, you can filter, sort, and refine your data to answer specific questions.

Key Clauses:

WHERE → Filter records
AND, OR, NOT → Combine conditions
BETWEEN → Range filtering
IN → Match multiple values
LIKE → Pattern matching
ORDER BY → Sort results
LIMIT → Restrict output size

Example:

SELECT name, salary
FROM employees
WHERE salary BETWEEN 30000 AND 70000
AND department = 'IT'
ORDER BY salary DESC
LIMIT 5;

👉 This query helps identify top-paid employees in a specific department within a salary range.

2. Joins – Combining Multiple Tables ⭐

In real-world databases, data is distributed across multiple tables. Joins allow you to combine them to get a complete picture.

Types of Joins:

INNER JOIN → Returns matching records from both tables
LEFT JOIN → Returns all records from left table + matches from right
RIGHT JOIN → Opposite of LEFT JOIN
FULL JOIN → Returns all records from both tables

Example:

SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.id;

👉 This query connects employee data with their department names, even if some employees are not assigned to any department.

💡 Joins are essential in data science because datasets are rarely stored in a single table.

3. Aggregation & GROUP BY

Data scientists often need summarized insights rather than raw data. SQL provides aggregation functions to make this easy.

Common Functions:

COUNT() → Number of records
SUM() → Total value
AVG() → Average value
MIN() / MAX() → Minimum/Maximum

GROUP BY:

Used to group rows with similar values.

HAVING:

Filters grouped data (like WHERE but for groups).

Example:

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;

👉 This helps identify departments with higher-than-average salaries.

4. Subqueries (Nested Queries)

Subqueries are queries inside another query. They allow you to perform more complex operations.

Example:

SELECT name
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

👉 This finds employees earning more than the company average.

Types:

Single-row subqueries
Multi-row subqueries
Correlated subqueries (advanced)

5. Views – Reusable Queries

A view is a virtual table created from a query. It helps simplify complex queries and improves reusability.

Example:

CREATE VIEW high_salary_employees AS
SELECT name, salary
FROM employees
WHERE salary > 50000;

👉 Instead of writing the same query repeatedly, you can simply use:

SELECT * FROM high_salary_employees;

6. Indexes – Improving Performance

As datasets grow, queries can become slow. Indexes help speed up data retrieval.

Example:

CREATE INDEX idx_salary
ON employees(salary);

👉 Indexes work like a book’s index—they allow the database to find data faster without scanning the entire table.

⚠️ Note: Too many indexes can slow down inserts/updates, so use them wisely.

7. Real-World Data Science Use Case 🧠

Let’s apply SQL to a real scenario: E-commerce Data Analysis

Goal:

Find top customers based on total spending.

SELECT customer_id, SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC;

Insights you can derive:

Who are your most valuable customers
Customer purchasing patterns
Revenue distribution

You can extend this further:

Monthly sales trends
Product performance
Customer retention analysis

8. Best Practices for Writing SQL

To become efficient in SQL, follow these best practices:

Avoid SELECT * in production
Use meaningful aliases (e, d, etc.)
Write clean and readable queries
Break complex queries using CTEs or views
Optimize performance using indexes
Test queries on smaller datasets first

🔚 Final Thoughts

In this part, we moved from basic SQL commands to practical, real-world querying techniques. These concepts—especially joins, aggregations, and subqueries—are heavily used in data science and analytics roles.

Mastering them will allow you to:

Work with large datasets efficiently
Extract meaningful insights
Prepare data for machine learning models

What’s Next?

In Part 3, you can explore:

Window functions (ROW_NUMBER, RANK)
Common Table Expressions (CTEs)
SQL for data cleaning
Integration with Python (Pandas + SQL)

Pro Tip: Practice these concepts on platforms like LeetCode, HackerRank, or real datasets from Kaggle to build confidence.

The Data Science Nerds

Search This Blog