Skip to main content

SQL for Data Science – Part 2: Advanced Queries & Real-World Analysis

 In Part 1, we explored the foundational SQL command types—DDL, DML, DQL, DCL, and TCL. Now, in Part 2, it’s time to go beyond the basics and dive into how SQL is actually used in real-world data science workflows.

This part focuses on writing smarter queries, combining data from multiple sources, and extracting meaningful insights from raw datasets.


1. Advanced SELECT Queries

The SELECT statement is much more powerful than just retrieving columns. With additional clauses, you can filter, sort, and refine your data to answer specific questions.

Key Clauses:

  • WHERE → Filter records

  • AND, OR, NOT → Combine conditions

  • BETWEEN → Range filtering

  • IN → Match multiple values

  • LIKE → Pattern matching

  • ORDER BY → Sort results

  • LIMIT → Restrict output size

Example:

SELECT name, salary
FROM employees
WHERE salary BETWEEN 30000 AND 70000
AND department = 'IT'
ORDER BY salary DESC
LIMIT 5;

πŸ‘‰ This query helps identify top-paid employees in a specific department within a salary range.


2. Joins – Combining Multiple Tables ⭐

In real-world databases, data is distributed across multiple tables. Joins allow you to combine them to get a complete picture.

Types of Joins:

  • INNER JOIN → Returns matching records from both tables

  • LEFT JOIN → Returns all records from left table + matches from right

  • RIGHT JOIN → Opposite of LEFT JOIN

  • FULL JOIN → Returns all records from both tables

Example:

SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.id;

πŸ‘‰ This query connects employee data with their department names, even if some employees are not assigned to any department.

πŸ’‘ Joins are essential in data science because datasets are rarely stored in a single table.


3. Aggregation & GROUP BY

Data scientists often need summarized insights rather than raw data. SQL provides aggregation functions to make this easy.

Common Functions:

  • COUNT() → Number of records

  • SUM() → Total value

  • AVG() → Average value

  • MIN() / MAX() → Minimum/Maximum

GROUP BY:

Used to group rows with similar values.

HAVING:

Filters grouped data (like WHERE but for groups).

Example:

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;

πŸ‘‰ This helps identify departments with higher-than-average salaries.


4. Subqueries (Nested Queries)

Subqueries are queries inside another query. They allow you to perform more complex operations.

Example:

SELECT name
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

πŸ‘‰ This finds employees earning more than the company average.

Types:

  • Single-row subqueries

  • Multi-row subqueries

  • Correlated subqueries (advanced)


5. Views – Reusable Queries

A view is a virtual table created from a query. It helps simplify complex queries and improves reusability.

Example:

CREATE VIEW high_salary_employees AS
SELECT name, salary
FROM employees
WHERE salary > 50000;

πŸ‘‰ Instead of writing the same query repeatedly, you can simply use:

SELECT * FROM high_salary_employees;

6. Indexes – Improving Performance

As datasets grow, queries can become slow. Indexes help speed up data retrieval.

Example:

CREATE INDEX idx_salary
ON employees(salary);

πŸ‘‰ Indexes work like a book’s index—they allow the database to find data faster without scanning the entire table.

⚠️ Note: Too many indexes can slow down inserts/updates, so use them wisely.


7. Real-World Data Science Use Case 🧠

Let’s apply SQL to a real scenario: E-commerce Data Analysis

Goal:

Find top customers based on total spending.

SELECT customer_id, SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC;

Insights you can derive:

  • Who are your most valuable customers

  • Customer purchasing patterns

  • Revenue distribution

You can extend this further:

  • Monthly sales trends

  • Product performance

  • Customer retention analysis


  8. Best Practices for Writing SQL

To become efficient in SQL, follow these best practices:

  • Avoid SELECT * in production

  • Use meaningful aliases (e, d, etc.)

  • Write clean and readable queries

  • Break complex queries using CTEs or views

  • Optimize performance using indexes

  • Test queries on smaller datasets first


πŸ”š Final Thoughts

In this part, we moved from basic SQL commands to practical, real-world querying techniques. These concepts—especially joins, aggregations, and subqueries—are heavily used in data science and analytics roles.

Mastering them will allow you to:

  • Work with large datasets efficiently

  • Extract meaningful insights

  • Prepare data for machine learning models


What’s Next?

In Part 3, you can explore:

  • Window functions (ROW_NUMBER, RANK)

  • Common Table Expressions (CTEs)

  • SQL for data cleaning

  • Integration with Python (Pandas + SQL)


Pro Tip: Practice these concepts on platforms like LeetCode, HackerRank, or real datasets from Kaggle to build confidence.


Comments

Popular posts from this blog

What is a Large Language Model?

  What is a Large Language Model? Explained Simply A beginner-friendly guide to understanding the AI technology behind ChatGPT, Claude, and Gemini Introduction: The AI Everyone Is Talking About You have probably heard terms like ChatGPT, Claude, or Gemini being thrown around everywhere in the news, at work, on social media. These are all powered by something called a Large Language Model, or LLM for short. But what exactly is an LLM? How does it work? And why does it seem almost magical at understanding and generating human language? In this blog post, we will break it all down in plain English no PhD required. By the end, you will have a solid understanding of what LLMs are, how they learn, and why they matter.   1. What Is a Language Model? Before we get to "Large," let us start with the basics: what is a language model? A language model is a type of AI that has been trained to understand and generate text. At its core, it learns to predict:...

Machine Learning Project Life Cycle: A Complete End-to-End Guide

  Machine Learning Project Life Cycle: A Complete End-to-End Guide Machine Learning (ML) projects are more than just training algorithms on data. A successful ML solution requires structured planning, quality data, robust engineering, continuous monitoring, and iterative improvements. The Machine Learning Project Life Cycle defines a systematic approach for building scalable, reliable, and production-ready ML systems. This blog explains each stage of the ML project life cycle in detail, including Statement of Work (SOW), data collection, exploratory data analysis (EDA), feature engineering, model selection, training, fine-tuning, deployment monitoring, and feedback loops. 1. Understanding the ML Project Life Cycle Definition The ML Project Life Cycle is a structured framework that guides the development of machine learning systems from problem identification to deployment and continuous improvement. It ensures that every phase of the project is organized, measurable, and aligned wi...

What is Data Science?

The Multidisciplinary Power of Data Science (It's Not Just a Buzzword) If you've spent any time in the tech world lately, you've heard the term Data Science . Some critics dismiss it as a superfluous label — a buzzword meant to salt resumes and catch the eye of tech recruiters. But if we peel back the hype, what is it actually? Data science, despite its hype-laden veneer, is perhaps the best label we have for a cross-disciplinary set of skills that are becoming increasingly important in both industry and academia. It isn't just a single subject you learn in a vacuum; it is a toolkit — a set of skills that allows you to turn raw, messy data into actionable insights. But to truly appreciate what data science is , we first need to understand where it came from. A Brief History: How Data Science Was Born Data science didn't appear overnight. Its roots stretch back decades. In the 1960s and 70s, statisticians were already wrestling with large datasets, ...